Table of Contents
- cs.CL [Total: 30]
- cs.CV [Total: 54]
- cs.AI [Total: 3]
- cs.MM [Total: 3]
- cs.GR [Total: 2]
- cs.RO [Total: 1]
- cs.LG [Total: 4]
- eess.IV [Total: 2]
- cs.SD [Total: 1]
cs.CL [Back]
[1] MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model
K. Sahit Reddy,N. Ragavenderan,Vasanth K.,Ganesh N. Naik,Vishalakshi Prabhu,Nagaraja G. S
Main category: cs.CL
TL;DR: MedicalBERT是一种基于BERT的预训练模型,专门用于生物医学领域,通过领域特定词汇和优化,显著提升了生物医学自然语言处理任务的表现。
Details
Motivation: 通用BERT模型在生物医学领域表现不佳,因其缺乏对领域特定术语的理解。因此,作者提出MedicalBERT,针对生物医学文本进行优化。Contribution: 提出了MedicalBERT,通过领域特定词汇和预训练优化,提升了生物医学NLP任务的表现,并在多项任务上超越了现有BERT模型。
Method: 基于BERT架构,在大型生物医学数据集上预训练,并优化领域特定词汇和任务微调,涵盖命名实体识别、关系抽取等任务。
Result: MedicalBERT在F1-score、准确率等指标上优于BioBERT、SciBERT等模型,平均比通用BERT模型提升5.67%。
Insight: 预训练BERT模型结合领域特定优化,能够显著提升生物医学NLP任务的性能,展示了迁移学习在特定领域的潜力。
Abstract: Recent advances in natural language processing (NLP) have been driven bypretrained language models like BERT, RoBERTa, T5, and GPT. Thesemodels excel at understanding complex texts, but biomedical literature, withits domain-specific terminology, poses challenges that models likeWord2Vec and bidirectional long short-term memory (Bi-LSTM) can’t fullyaddress. GPT and T5, despite capturing context, fall short in tasks needingbidirectional understanding, unlike BERT. Addressing this, we proposedMedicalBERT, a pretrained BERT model trained on a large biomedicaldataset and equipped with domain-specific vocabulary that enhances thecomprehension of biomedical terminology. MedicalBERT model is furtheroptimized and fine-tuned to address diverse tasks, including named entityrecognition, relation extraction, question answering, sentence similarity, anddocument classification. Performance metrics such as the F1-score,accuracy, and Pearson correlation are employed to showcase the efficiencyof our model in comparison to other BERT-based models such as BioBERT,SciBERT, and ClinicalBERT. MedicalBERT outperforms these models onmost of the benchmarks, and surpasses the general-purpose BERT model by5.67% on average across all the tasks evaluated respectively. This work alsounderscores the potential of leveraging pretrained BERT models for medicalNLP tasks, demonstrating the effectiveness of transfer learning techniques incapturing domain-specific information. (PDF) MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model. Available from: https://www.researchgate.net/publication/392489050_MedicalBERT_enhancing_biomedical_natural_language_processing_using_pretrained_BERT-based_model [accessed Jul 06 2025].
[2] Assessing the Capabilities and Limitations of FinGPT Model in Financial NLP Applications
Prudence Djagba,Chimezie A. Odinakachukwu
Main category: cs.CL
TL;DR: Error
Details
Motivation: ErrorContribution: Error
Method: Error
Result: Error
Insight: Error
Abstract: This work evaluates FinGPT, a financial domain-specific language model, across six key natural language processing (NLP) tasks: Sentiment Analysis, Text Classification, Named Entity Recognition, Financial Question Answering, Text Summarization, and Stock Movement Prediction. The evaluation uses finance-specific datasets to assess FinGPT’s capabilities and limitations in real-world financial applications. The results show that FinGPT performs strongly in classification tasks such as sentiment analysis and headline categorization, often achieving results comparable to GPT-4. However, its performance is significantly lower in tasks that involve reasoning and generation, such as financial question answering and summarization. Comparisons with GPT-4 and human benchmarks highlight notable performance gaps, particularly in numerical accuracy and complex reasoning. Overall, the findings indicate that while FinGPT is effective for certain structured financial tasks, it is not yet a comprehensive solution. This research provides a useful benchmark for future research and underscores the need for architectural improvements and domain-specific optimization in financial language models.
[3] Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis
Li Li,Yongliang Wu,Jingze Zhu,Jiawei Peng,Jianfei Cai,Xu Yang
Main category: cs.CL
TL;DR: 本文通过外部和内部分析揭示了大模型在图像描述任务中的上下文学习(ICL)有效配置。外部实验探索了示范配置策略,内部分析了模型注意力特性,并提出新指标量化行为。
Details
Motivation: 受大型语言模型(LLMs)成功的启发,研究者开发了具有ICL能力的大型多模态模型(LMMs),但多模态ICL的示范配置研究仍不足,其可控性为分析模型特性提供了高效途径。Contribution: 1. 外部实验系统探索示范配置策略;2. 内部分析提出注意力指标量化模型行为;3. 提出注意力驱动的模型加速与压缩方法;4. 揭示了预训练数据特征对模型表现的影响。
Method: 1. 外部:从示例数量、图像检索和描述分配三个维度探索配置策略;2. 内部:分析典型LMM注意力模式,开发注意力指标;3. 辅助实验验证注意力驱动加速与压缩。
Result: 通过双视角分析揭示了示范配置策略与模型性能的关系,并量化了注意力特性。辅助实验证明注意力驱动的优化可行性。
Insight: 外部与内部分析结合为新研究方法,注意力指标适用于更广领域,预训练数据特征显著影响模型表现。
Abstract: The evolution of large models has witnessed the emergence of In-Context Learning (ICL) capabilities. In Natural Language Processing (NLP), numerous studies have demonstrated the effectiveness of ICL. Inspired by the success of Large Language Models (LLMs), researchers have developed Large Multimodal Models (LMMs) with ICL capabilities. However, explorations of demonstration configuration for multimodal ICL remain preliminary. Additionally, the controllability of In-Context Examples (ICEs) provides an efficient and cost-effective means to observe and analyze the inference characteristics of LMMs under varying inputs. This paper conducts a comprehensive external and internal investigation of multimodal in-context learning on the image captioning task. Externally, we explore demonstration configuration strategies through three dimensions: shot number, image retrieval, and caption assignment. We employ multiple metrics to systematically and thoroughly evaluate and summarize key findings. Internally, we analyze typical LMM attention characteristics and develop attention-based metrics to quantify model behaviors. We also conduct auxiliary experiments to explore the feasibility of attention-driven model acceleration and compression. We further compare performance variations between LMMs with identical model design and pretraining strategies and explain the differences from the angles of pre-training data features. Our study reveals both how ICEs configuration strategies impact model performance through external experiments and characteristic typical patterns through internal inspection, providing dual perspectives for understanding multimodal ICL in LMMs. Our method of combining external and internal analysis to investigate large models, along with our newly proposed metrics, can be applied to broader research areas.
[4] “Amazing, They All Lean Left” – Analyzing the Political Temperaments of Current LLMs
W. Russell Neuman,Chad Coleman,Ali Dasdan,Safinah Ali,Manan Shah,Kund Meghani
Main category: cs.CL
TL;DR: 本文系统研究了七种主流大语言模型(LLMs)的政治倾向,发现它们普遍偏向自由主义价值观(如关怀与公平),并分析了训练数据、RLHF、学术伦理框架和安全微调等因素的影响。
Details
Motivation: 研究发现商业LLMs在伦理和政治回答中表现出自由主义倾向,但原因和影响尚不清晰,本文旨在系统分析这一现象。Contribution: 提出多角度分析框架(Moral Foundations Theory、政治意识形态量表等),揭示了LLMs自由主义倾向的成因,并区分了“偏见”与合理认知差异。
Method: 对七种LLMs进行多维度测试(道德基础理论、政治意识形态量表等),对比基础模型与微调模型的表现,量化自由主义倾向及其来源。
Result: 发现LLMs普遍倾向自由主义价值观,微调会强化这一倾向;其根源在于训练数据、RLHF等因素,而非开发者主观偏好。
Insight: LLMs的自由主义倾向可能是对民主权利话语的反映,类似Rawls的“无知之幕”哲学理念,为集体理性研究提供了新视角。
Abstract: Recent studies have revealed a consistent liberal orientation in the ethical and political responses generated by most commercial large language models (LLMs), yet the underlying causes and resulting implications remain unclear. This paper systematically investigates the political temperament of seven prominent LLMs - OpenAI’s GPT-4o, Anthropic’s Claude Sonnet 4, Perplexity (Sonar Large), Google’s Gemini 2.5 Flash, Meta AI’s Llama 4, Mistral 7b Le Chat and High-Flyer’s DeepSeek R1 – using a multi-pronged approach that includes Moral Foundations Theory, a dozen established political ideology scales and a new index of current political controversies. We find strong and consistent prioritization of liberal-leaning values, particularly care and fairness, across most models. Further analysis attributes this trend to four overlapping factors: Liberal-leaning training corpora, reinforcement learning from human feedback (RLHF), the dominance of liberal frameworks in academic ethical discourse and safety-driven fine-tuning practices. We also distinguish between political “bias” and legitimate epistemic differences, cautioning against conflating the two. A comparison of base and fine-tuned model pairs reveals that fine-tuning generally increases liberal lean, an effect confirmed through both self-report and empirical testing. We argue that this “liberal tilt” is not a programming error or the personal preference of programmers but an emergent property of training on democratic rights-focused discourse. Finally, we propose that LLMs may indirectly echo John Rawls’ famous veil-of ignorance philosophical aspiration, reflecting a moral stance unanchored to personal identity or interest. Rather than undermining democratic discourse, this pattern may offer a new lens through which to examine collective reasoning.
[5] Better Together: Quantifying the Benefits of AI-Assisted Recruitment
Ada Aka,Emil Palikot,Ali Ansari,Nima Yazdani
Main category: cs.CL
TL;DR: 该研究通过对比传统招聘与AI辅助招聘的效果,发现AI辅助显著提高了候选人通过终面的比例(54% vs 34%)并增加了其后续就业率(23% vs 18%),同时也揭示了AI倾向于选择年轻且经验较少的候选人。
Details
Motivation: 探索AI在招聘中的应用效果,填补关于AI如何提升招聘效率和候选人选择质量的实证研究空白。Contribution: 提供了AI辅助招聘对效率和候选人质量影响的量化证据,揭示了AI的选拔偏好及其潜在影响。
Method: 将37,000名申请人随机分配到传统招聘流程或AI辅助流程(AI视频面试+人工评估),终面由不知情的面试官统一进行,并通过后续LinkedIn数据分析就业情况。
Result: AI辅助流程的候选人通过终面的比例高20个百分点(54% vs 34%),且后续就业率比传统组高5.9个百分点(23% vs 18%)。AI更倾向于选择年轻、经验少的候选人。
Insight: AI可以显著提升招聘效率,但可能引入年龄和经验偏见,需注意其潜在的社会影响和伦理问题。
Abstract: Artificial intelligence (AI) is increasingly used in recruitment, yet empirical evidence quantifying its impact on hiring efficiency and candidate selection remains limited. We randomly assign 37,000 applicants for a junior-developer position to either a traditional recruitment process (resume screening followed by human selection) or an AI-assisted recruitment pipeline incorporating an initial AI-driven structured video interview before human evaluation. Candidates advancing from either track faced the same final-stage human interview, with interviewers blind to the earlier selection method. In the AI-assisted pipeline, 54% of candidates passed the final interview compared with 34% from the traditional pipeline, yielding an average treatment effect of 20 percentage points (SE 12 pp.). Five months later, we collected LinkedIn profiles of top applicants from both groups and found that 18% (SE 1.1%) of applicants from the traditional track found new jobs compared with 23% (SE 2.3%) from the AI group, resulting in a 5.9 pp. (SE 2.6 pp.) difference in the probability of finding new employment between groups. The AI system tended to select younger applicants with less experience and fewer advanced credentials. We analyze AI-generated interview transcripts to examine the selection criteria and conversational dynamics. Our findings contribute to understanding how AI technologies affect decision making in recruitment and talent acquisition while highlighting some of their potential implications.
[6] A Systematic Analysis of Declining Medical Safety Messaging in Generative AI Models
Sonali Sharma,Ahmed M. Alaa,Roxana Daneshjou
Main category: cs.CL
TL;DR: 该论文系统分析了生成式AI模型(如LLM和VLM)在医疗安全声明上的下降趋势,发现从2022年到2025年,模型输出中包含医疗免责声明的比例大幅减少。
Details
Motivation: 生成式AI模型在医疗领域的应用日益广泛,但其输出常存在不准确性,医疗免责声明是提醒用户这些输出未经专业审核的重要措施。研究旨在评估这些声明在模型输出中的变化趋势。Contribution: 通过大量医疗图像和问题数据,论文首次量化了医疗免责声明在生成式AI模型输出中的下降趋势,强调了改进安全措施的必要性。
Method: 研究使用了500张乳腺X光片、500张胸部X光片、500张皮肤病图像和500个医学问题,分析了2022年至2025年间LLM和VLM输出的声明变化。
Result: 医疗免责声明在LLM输出中从2022年的26.3%降至2025年的0.97%,在VLM中从2023年的19.6%降至2025年的1.05%。到2025年,大多数模型不再显示免责声明。
Insight: 论文指出,随着模型能力提升和权威性增强,免责声明的缺失可能增加医疗风险,需根据临床情境动态调整安全措施。
Abstract: Generative AI models, including large language models (LLMs) and vision-language models (VLMs), are increasingly used to interpret medical images and answer clinical questions. Their responses often include inaccuracies; therefore, safety measures like medical disclaimers are critical to remind users that AI outputs are not professionally vetted or a substitute for medical advice. This study evaluated the presence of disclaimers in LLM and VLM outputs across model generations from 2022 to 2025. Using 500 mammograms, 500 chest X-rays, 500 dermatology images, and 500 medical questions, outputs were screened for disclaimer phrases. Medical disclaimer presence in LLM and VLM outputs dropped from 26.3% in 2022 to 0.97% in 2025, and from 19.6% in 2023 to 1.05% in 2025, respectively. By 2025, the majority of models displayed no disclaimers. As public models become more capable and authoritative, disclaimers must be implemented as a safeguard adapting to the clinical context of each output.
[7] Beyond Scale: Small Language Models are Comparable to GPT-4 in Mental Health Understanding
Hong Jia,Shiya Fu,Vassilis Kostakos,Feng Xia,Ting Dang
Main category: cs.CL
TL;DR: 该论文研究了小型语言模型(SLM)在心理健康理解任务中的性能,发现尽管参数规模小,但SLM在二元分类任务中表现接近大型语言模型(LLM),且通过小样本学习能显著提升性能。
Details
Motivation: 随着SLM作为隐私保护替代方案的兴起,研究其在敏感领域(如心理健康)的理解能力是否足以媲美LLM具有重要意义。Contribution: 1. 系统评估SLM在心理健康任务中的表现,发现其接近LLM的性能;2. 揭示小样本学习对SLM的显著提升作用;3. 提出SLM可作为隐私保护的实用工具。
Method: 采用零样本和小样本学习范式,比较5种SLM与3种LLM在6项心理健康任务中的表现(F1分数等指标)。
Result: SLM在二元分类任务中平均性能仅比LLM低2%,小样本学习可提升SLM性能达14.6%。
Insight: 模型规模并非心理健康理解的唯一关键,SLM的快速适应能力使其适合隐私敏感的临床应用。
Abstract: The emergence of Small Language Models (SLMs) as privacy-preserving alternatives for sensitive applications raises a fundamental question about their inherent understanding capabilities compared to Large Language Models (LLMs). This paper investigates the mental health understanding capabilities of current SLMs through systematic evaluation across diverse classification tasks. Employing zero-shot and few-shot learning paradigms, we benchmark their performance against established LLM baselines to elucidate their relative strengths and limitations in this critical domain. We assess five state-of-the-art SLMs (Phi-3, Phi-3.5, Qwen2.5, Llama-3.2, Gemma2) against three LLMs (GPT-4, FLAN-T5-XXL, Alpaca-7B) on six mental health understanding tasks. Our findings reveal that SLMs achieve mean performance within 2% of LLMs on binary classification tasks (F1 scores of 0.64 vs 0.66 in zero-shot settings), demonstrating notable competence despite orders of magnitude fewer parameters. Both model categories experience similar degradation on multi-class severity tasks (a drop of over 30%), suggesting that nuanced clinical understanding challenges transcend model scale. Few-shot prompting provides substantial improvements for SLMs (up to 14.6%), while LLM gains are more variable. Our work highlights the potential of SLMs in mental health understanding, showing they can be effective privacy-preserving tools for analyzing sensitive online text data. In particular, their ability to quickly adapt and specialize with minimal data through few-shot learning positions them as promising candidates for scalable mental health screening tools.
[8] Integrating External Tools with Large Language Models to Improve Accuracy
Nripesh Niketan,Hadj Batatia
Main category: cs.CL
TL;DR: 论文提出一种集成外部工具与大型语言模型(LLM)的框架,以提升在教育和科学推理任务中的准确性。通过调用外部API和计算工具,模型性能显著优于现有基线。
Details
Motivation: 为解决LLMs在缺乏上下文时生成低质量回答或“幻觉”问题,论文提出整合外部工具以提供实时数据和计算能力,从而增强模型性能。Contribution: 提出Athena框架,通过集成外部工具(如API、计算器)为LLMs提供动态信息支持,显著提升数学和科学推理任务的准确性。
Method: 开发一个框架,允许LLMs访问外部API获取额外信息,并整合计算工具。使用MMLU数据集(数学和科学推理问题)进行评测。
Result: Athena框架在数学和科学推理任务中分别达到83%和88%准确率,远超基线模型(如LLaMA-Large的67%和79%)。
Insight: 通过动态整合外部工具,LLMs能够更自然地支持复杂任务,为构建围绕LLMs的计算生态系统提供了方向。
Abstract: This paper deals with improving querying large language models (LLMs). It is well-known that without relevant contextual information, LLMs can provide poor quality responses or tend to hallucinate. Several initiatives have proposed integrating LLMs with external tools to provide them with up-to-date data to improve accuracy. In this paper, we propose a framework to integrate external tools to enhance the capabilities of LLMs in answering queries in educational settings. Precisely, we develop a framework that allows accessing external APIs to request additional relevant information. Integrated tools can also provide computational capabilities such as calculators or calendars. The proposed framework has been evaluated using datasets from the Multi-Modal Language Understanding (MMLU) collection. The data consists of questions on mathematical and scientific reasoning. Results compared to state-of-the-art language models show that the proposed approach significantly improves performance. Our Athena framework achieves 83% accuracy in mathematical reasoning and 88% in scientific reasoning, substantially outperforming all tested models including GPT-4o, LLaMA-Large, Mistral-Large, Phi-Large, and GPT-3.5, with the best baseline model (LLaMA-Large) achieving only 67% and 79% respectively. These promising results open the way to creating complex computing ecosystems around LLMs to make their use more natural to support various tasks and activities.
[9] Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians’ Insights
Deepali Mishra,Chaklam Silpasuwanchai,Ashutosh Modi,Madhumita Sushil,Sorayouth Chumnanvej
Main category: cs.CL
TL;DR: 该论文通过文献综述和临床医生调查,系统分析了医学视觉问答(MedVQA)在放射学工作流中的应用现状及挑战,揭示了其与临床需求的脱节,并提出了改进方向。
Details
Motivation: 尽管MedVQA在医学图像自动解释方面展现出潜力,但其在临床工作流中的实际应用仍然有限。论文旨在通过系统综述和临床医生调查,探讨MedVQA的实用性和面临的障碍。Contribution: 论文的主要贡献包括:(1)系统综述了2018年至2024年的68篇相关文献,揭示了MedVQA的研究现状及研究空白;(2)调查了50名临床医生,获取了他们对MedVQA临床实用性的看法和需求。
Method: 论文采用Arksey和O’Malley的范围综述框架,结合两种方法:(1)文献综述,提取关键概念、进展和研究空白;(2)临床医生调查,评估MedVQA的临床相关性和需求。
Result: 研究发现,近60%的问答对缺乏临床诊断价值;现有数据集和模型在支持多视图、多分辨率成像、电子健康记录(EHR)整合和领域知识等方面不足;临床医生中仅29.8%认为MedVQA系统高度有用。
Insight: MedVQA的临床整合面临多模态分析不足、缺乏患者病史和领域知识等问题;改进方向包括开发对话式交互系统、支持多视图图像和特定解剖区域模型,以及调整评估指标以更贴合临床需求。
Abstract: Medical Visual Question Answering (MedVQA) is a promising tool to assist radiologists by automating medical image interpretation through question answering. Despite advances in models and datasets, MedVQA’s integration into clinical workflows remains limited. This study systematically reviews 68 publications (2018-2024) and surveys 50 clinicians from India and Thailand to examine MedVQA’s practical utility, challenges, and gaps. Following the Arksey and O’Malley scoping review framework, we used a two-pronged approach: (1) reviewing studies to identify key concepts, advancements, and research gaps in radiology workflows, and (2) surveying clinicians to capture their perspectives on MedVQA’s clinical relevance. Our review reveals that nearly 60% of QA pairs are non-diagnostic and lack clinical relevance. Most datasets and models do not support multi-view, multi-resolution imaging, EHR integration, or domain knowledge, features essential for clinical diagnosis. Furthermore, there is a clear mismatch between current evaluation metrics and clinical needs. The clinician survey confirms this disconnect: only 29.8% consider MedVQA systems highly useful. Key concerns include the absence of patient history or domain knowledge (87.2%), preference for manually curated datasets (51.1%), and the need for multi-view image support (78.7%). Additionally, 66% favor models focused on specific anatomical regions, and 89.4% prefer dialogue-based interactive systems. While MedVQA shows strong potential, challenges such as limited multimodal analysis, lack of patient context, and misaligned evaluation approaches must be addressed for effective clinical integration.
[10] CRISP: Complex Reasoning with Interpretable Step-based Plans
Matan Vetzler,Koren Lazar,Guy Uziel,Eran Hirsch,Ateret Anaby-Tavor,Leshem Choshen
Main category: cs.CL
TL;DR: CRISP是一个多领域数据集,旨在通过显式的高层次计划生成提升大语言模型(LLM)的复杂推理能力,其计划经过自动生成和严格验证,并通过微调小模型显著优于few-shot提示和Chain-of-Thought推理。
Details
Motivation: 当前的Chain-of-Thought推理在复杂问题上表现不足,而显式的高层次计划生成方法通常假设LLM可以通过few-shot提示直接生成有效计划,但这一假设缺乏验证。Contribution: 1. 提出CRISP数据集,包含数学推理和代码生成的高质量计划;2. 通过自动生成和双重验证(内在和外在)确保计划质量;3. 展示微调小模型在计划生成上的优越性和跨领域泛化能力。
Method: 1. 自动生成多层次计划;2. 使用LLM作为内在验证器,并通过下游任务性能进行外在验证;3. 对一个小模型进行微调以生成高质量计划。
Result: 微调的小模型在计划生成质量上优于大型few-shot模型,且显著超越Chain-of-Thought推理;跨领域评估显示微调后具备泛化能力。
Insight: 显式计划生成和微调的结合能够显著提升复杂推理能力,且计划的通用性支持跨领域迁移。
Abstract: Recent advancements in large language models (LLMs) underscore the need for stronger reasoning capabilities to solve complex problems effectively. While Chain-of-Thought (CoT) reasoning has been a step forward, it remains insufficient for many domains. A promising alternative is explicit high-level plan generation, but existing approaches largely assume that LLMs can produce effective plans through few-shot prompting alone, without additional training. In this work, we challenge this assumption and introduce CRISP (Complex Reasoning with Interpretable Step-based Plans), a multi-domain dataset of high-level plans for mathematical reasoning and code generation. The plans in CRISP are automatically generated and rigorously validated–both intrinsically, using an LLM as a judge, and extrinsically, by evaluating their impact on downstream task performance. We demonstrate that fine-tuning a small model on CRISP enables it to generate higher-quality plans than much larger models using few-shot prompting, while significantly outperforming Chain-of-Thought reasoning. Furthermore, our out-of-domain evaluation reveals that fine-tuning on one domain improves plan generation in the other, highlighting the generalizability of learned planning capabilities.
[11] AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research
Talor Abramovich,Gal Chechik
Main category: cs.CL
TL;DR: AblationBench是一个用于评估自动化消融实验规划的基准套件,包含两个任务:AuthorAblation和ReviewerAblation,并基于语言模型开发了自动评估框架。当前最佳语言模型系统仅能识别29%的原始消融实验,表现出挑战性。
Details
Motivation: 自动化代理和语言模型在科学研究的应用日益普及,但消融实验规划作为实证AI研究的关键环节仍未得到充分评估。本文旨在填补这一空白。Contribution: 1. 提出AblationBench基准套件,包含两个任务。2. 开发基于语言模型的自动评估框架(LM-based judges)。3. 分析了当前语言模型在消融实验规划任务上的局限性。
Method: 1. 构建AuthorAblation和ReviewerAblation两个任务的数据集。2. 设计基于语言模型的自动评估框架。3. 测试前沿语言模型的性能,并对比Chain-of-Thought提示与现有代理方法的优劣。
Result: 实验表明,当前最佳语言模型系统仅能识别29%的原始消融实验,表明任务具有挑战性。Chain-of-Thought提示优于现有代理方法。
Insight: 消融实验规划任务对语言模型仍具挑战性,需进一步优化方法,尤其是提升模型的推理能力。
Abstract: Autonomous agents built on language models (LMs) are showing increasing popularity in many fields, including scientific research. AI co-scientists aim to support or automate parts of the research process using these agents. A key component of empirical AI research is the design of ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 29% of the original ablations on average. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms the currently existing agent-based approach.
[12] Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing
Junyi Wen,Junyuan Liang,Zicong Hong,Wuhui Chen,Zibin Zheng
Main category: cs.CL
TL;DR: Krul 是一个高效的多轮对话状态恢复系统,通过动态跨层 KV 共享减少 KV 缓存的存储和计算开销,显著提升了生成效率。
Details
Motivation: 当前多轮对话中,KV 缓存的重计算或加载开销较大,且现有压缩方法静态统一,忽视对话间注意力模式的差异,导致精度下降。Contribution: 1) 动态压缩策略选择器;2) 细粒度注意力相似性估计器;3) 无气泡恢复调度器。
Method: Krul 动态选择压缩策略,基于注意力相似性优化 KV 缓存恢复,提出预选策略、相似性估计和调度器。
Result: 实验表明,Krul 在时间到首词(TTFT)和 KV 缓存存储上分别减少 1.5x-2.68x 和 1.33x-2.35x,同时保持生成质量。
Insight: 动态适应对话特性比静态压缩更有效,跨层注意力相似性是优化 KV 缓存的关键指标。
Abstract: Efficient state restoration in multi-turn conversations with large language models (LLMs) remains a critical challenge, primarily due to the overhead of recomputing or loading full key-value (KV) caches for all historical tokens. To address this, existing approaches compress KV caches across adjacent layers with highly similar attention patterns. However, these methods often apply a fixed compression scheme across all conversations, selecting the same layer pairs for compression without considering conversation-specific attention dynamics. This static strategy overlooks variability in attention pattern similarity across different conversations, which can lead to noticeable accuracy degradation. We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration. Krul dynamically selects compression strategies based on attention similarity across layer pairs and uses a recomputation-loading pipeline to restore the KV cache. It introduces three key innovations: 1) a preemptive compression strategy selector to preserve critical context for future conversation turns and selects a customized strategy for the conversation; 2) a token-wise heterogeneous attention similarity estimator to mitigate the attention similarity computation and storage overhead during model generation; 3) a bubble-free restoration scheduler to reduce potential bubbles brought by the imbalance of recomputing and loading stream due to compressed KV caches. Empirical evaluations on real-world tasks demonstrate that Krul achieves a 1.5x-2.68x reduction in time-to-first-token (TTFT) and a 1.33x-2.35x reduction in KV cache storage compared to state-of-the-art methods without compromising generation quality.
[13] GRASP: Generic Reasoning And SPARQL Generation across Knowledge Graphs
Sebastian Walter,Hannah Bast
Main category: cs.CL
TL;DR: GRASP提出了一种无需微调的方法,通过大型语言模型(LLM)从自然语言问题或关键词查询生成SPARQL查询,并在多种知识图谱和基准测试中取得了优异表现。
Details
Motivation: 现有的方法通常需要针对特定知识图谱进行微调,而GRASP旨在实现零样本或少样本的通用SPARQL查询生成,减少对特定数据依赖。Contribution: 1. 提出一种无需微调的通用方法;2. 通过策略性执行SPARQL查询探索知识图谱;3. 在多个知识图谱和模型上验证了方法的有效性。
Method: 利用大型语言模型生成SPARQL查询,并通过执行查询动态探索知识图谱以检索相关IRI和字面量。
Result: 在Wikidata上达到SOTA,Freebase上接近最佳少样本方法,其他知识图谱上也表现良好。
Insight: 语言模型可以策略性探索知识图谱,减少对数据特定微调的依赖,适用于多种规模和类型的知识图谱。
Abstract: We propose a new approach for generating SPARQL queries on RDF knowledge graphs from natural language questions or keyword queries, using a large language model. Our approach does not require fine-tuning. Instead, it uses the language model to explore the knowledge graph by strategically executing SPARQL queries and searching for relevant IRIs and literals. We evaluate our approach on a variety of benchmarks (for knowledge graphs of different kinds and sizes) and language models (of different scales and types, commercial as well as open-source) and compare it with existing approaches. On Wikidata we reach state-of-the-art results on multiple benchmarks, despite the zero-shot setting. On Freebase we come close to the best few-shot methods. On other, less commonly evaluated knowledge graphs and benchmarks our approach also performs well overall. We conduct several additional studies, like comparing different ways of searching the graphs, incorporating a feedback mechanism, or making use of few-shot examples.
[14] TruthTorchLM: A Comprehensive Library for Predicting Truthfulness in LLM Outputs
Duygu Nur Yaldiz,Yavuz Faruk Bakman,Sungmin Kang,Alperen Öziş,Hayrettin Eren Yildiz,Mitash Ashish Shah,Zhiqi Huang,Anoop Kumar,Alfy Samuel,Daben Liu,Sai Praneeth Karimireddy,Salman Avestimehr
Main category: cs.CL
TL;DR: TruthTorchLM是一个开源Python库,提供30多种真实性预测方法,支持多样化的计算成本、访问级别和监管类型,适用于HuggingFace和LiteLLM模型,并在多个数据集上进行了评估。
Details
Motivation: 当前大型语言模型(LLMs)输出的真实性难以保证,而现有工具(如Guardrails和LM-Polygraph)功能有限,无法满足多样化需求,因此需要一种更全面、灵活的真实性预测工具。Contribution: 推出了TruthTorchLM,一个开源、全面的Python库,包含30多种真实性预测方法,支持多种模型和接口,填补了现有工具的功能空白。
Method: 整合了30多种真实性预测方法,涵盖计算成本、访问级别、监管类型等多样化需求,提供统一的生成、评估、校准接口。
Result: 在TriviaQA、GSM8K和FactScore-Bio等数据集上评估了代表性方法,证明了库的实用性和灵活性。
Insight: TruthTorchLM通过多样化的方法和灵活的接口,显著提升了真实性预测的研究效率和可访问性,为高可靠性应用提供了支持。
Abstract: Generative Large Language Models (LLMs)inevitably produce untruthful responses. Accurately predicting the truthfulness of these outputs is critical, especially in high-stakes settings. To accelerate research in this domain and make truthfulness prediction methods more accessible, we introduce TruthTorchLM an open-source, comprehensive Python library featuring over 30 truthfulness prediction methods, which we refer to as Truth Methods. Unlike existing toolkits such as Guardrails, which focus solely on document-grounded verification, or LM-Polygraph, which is limited to uncertainty-based methods, TruthTorchLM offers a broad and extensible collection of techniques. These methods span diverse tradeoffs in computational cost, access level (e.g., black-box vs white-box), grounding document requirements, and supervision type (self-supervised or supervised). TruthTorchLM is seamlessly compatible with both HuggingFace and LiteLLM, enabling support for locally hosted and API-based models. It also provides a unified interface for generation, evaluation, calibration, and long-form truthfulness prediction, along with a flexible framework for extending the library with new methods. We conduct an evaluation of representative truth methods on three datasets, TriviaQA, GSM8K, and FactScore-Bio. The code is available at https://github.com/Ybakman/TruthTorchLM
[15] Simple Mechanistic Explanations for Out-Of-Context Reasoning
Atticus Wang,Joshua Engels,Oliver Clive-Griffin
Main category: cs.CL
TL;DR: 该论文提出了一种简单的机制解释LLMs在微调后表现出的OOCR现象,指出其本质是通过LoRA微调添加了一个恒定的转向向量,从而在多个相关任务中实现泛化。
Details
Motivation: 研究OOCR现象的机制,探究LLMs为何能在微调后在分布外任务中表现出泛化能力,这对于理解LLMs的安全可靠部署至关重要。Contribution: 揭示了OOCR现象的简单机制,即通过LoRA微调添加恒定转向向量,并表明这种机制甚至适用于条件行为任务(如模型后门)。
Method: 通过LoRA微调引入恒定转向向量,分析其对任务泛化的影响,并直接训练转向向量以验证其有效性。
Result: 恒定转向向量能够解释OOCR现象,并且在多种概念相关任务中实现泛化,甚至在条件行为任务中也适用。
Insight: OOCR的泛化能力可能源于简单的机制设计,而非复杂的条件推理,这对LLMs的透明性和可控性提供了新视角。
Abstract: Out-of-context reasoning (OOCR) is a phenomenon in which fine-tuned LLMs exhibit surprisingly deep out-of-distribution generalization. Rather than learning shallow heuristics, they implicitly internalize and act on the consequences of observations scattered throughout the fine-tuning data. In this work, we investigate this phenomenon mechanistically and find that many instances of OOCR in the literature have a simple explanation: the LoRA fine-tuning essentially adds a constant steering vector, steering the model towards a general concept. This improves performance on the fine-tuning task and in many other concept-related domains, causing the surprising generalization. Moreover, we can directly train steering vectors for these tasks from scratch, which also induces OOCR. We find that our results hold even for a task that seems like it must involve conditional behavior (model backdoors); it turns out that unconditionally adding a steering vector is sufficient. Overall, our work presents one explanation of what gets learned during fine-tuning for OOCR tasks, contributing to the key question of why LLMs can reason out of context, an advanced capability that is highly relevant to their safe and reliable deployment.
[16] Can LLMs Reliably Simulate Real Students’ Abilities in Mathematics and Reading Comprehension?
KV Aditya Srivatsa,Kaushal Kumar Maurya,Ekaterina Kochmar
Main category: cs.CL
TL;DR: 论文探讨了大型语言模型(LLMs)是否能可靠模拟真实学生在数学和阅读理解方面的能力,发现未指导的强通用模型普遍优于平均水平,而模型与提示的组合对表现影响显著。
Details
Motivation: 研究动机是评估LLMs作为代理学生在智能辅导系统(ITSs)和试题测试中的可靠性,以确定它们是否能准确模拟真实学生的行为。Contribution: 主要贡献是通过IRT模型量化了11种LLMs与真实学生能力的对齐情况,并提出了基于结果的代理选择指南。
Method: 研究方法包括收集NAEP的489个项目数据集,应用IRT模型将LLMs的能力与真实学生群体对齐,并测试不同提示的效果。
Result: 结果显示,强通用LLMs普遍超越平均水平,但模型与提示的组合表现高度不一致,未找到跨科目和年级的理想配对。
Insight: 研究指出,目前LLMs无法可靠模拟真实学生的能力分布,需开发新的训练和评估策略。
Abstract: Large Language Models (LLMs) are increasingly used as proxy students in the development of Intelligent Tutoring Systems (ITSs) and in piloting test questions. However, to what extent these proxy students accurately emulate the behavior and characteristics of real students remains an open question. To investigate this, we collected a dataset of 489 items from the National Assessment of Educational Progress (NAEP), covering mathematics and reading comprehension in grades 4, 8, and 12. We then apply an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations. Our findings reveal that, without guidance, strong general-purpose models consistently outperform the average student at every grade, while weaker or domain-mismatched models may align incidentally. Using grade-enforcement prompts changes models’ performance, but whether they align with the average grade-level student remains highly model- and prompt-specific: no evaluated model-prompt pair fits the bill across subjects and grades, underscoring the need for new training and evaluation strategies. We conclude by providing guidelines for the selection of viable proxies based on our findings.
[17] Exploring Gender Differences in Chronic Pain Discussions on Reddit
Ancita Maria Andrade,Tanvi Banerjee,Ramakrishna Mundugar
Main category: cs.CL
TL;DR: 该论文使用NLP技术(HAM-CNN)分析了Reddit上关于慢性疼痛讨论的性别差异,发现女性帖子更情感化,并揭示了特定疾病和药物反应的性别差异。
Details
Motivation: 以往研究常忽视性别在疼痛体验中的作用,该研究旨在通过NLP技术深入探索性别差异。Contribution: 成功使用HAM-CNN对帖子进行性别分类(F1分数0.86),揭示了语言学差异及特定疾病与药物反应的性别差异。
Method: 采用HAM-CNN模型对Reddit帖子进行性别分类,并通过语言学分析比较男女差异。
Result: 女性帖子更情感化,偏头痛和鼻窦炎在女性中更常见,疼痛药物的效果也存在性别差异。
Insight: 性别是疼痛体验中的重要因素,未来研究和治疗应考虑性别差异以提高针对性。
Abstract: Pain is an inherent part of human existence, manifesting as both physical and emotional experiences, and can be categorized as either acute or chronic. Over the years, extensive research has been conducted to understand the causes of pain and explore potential treatments, with contributions from various scientific disciplines. However, earlier studies often overlooked the role of gender in pain experiences. In this study, we utilized Natural Language Processing (NLP) to analyze and gain deeper insights into individuals’ pain experiences, with a particular focus on gender differences. We successfully classified posts into male and female corpora using the Hidden Attribute Model-Convolutional Neural Network (HAM-CNN), achieving an F1 score of 0.86 by aggregating posts based on usernames. Our analysis revealed linguistic differences between genders, with female posts tending to be more emotionally focused. Additionally, the study highlighted that conditions such as migraine and sinusitis are more prevalent among females and explored how pain medication affects individuals differently based on gender.
[18] KAT-V1: Kwai-AutoThink Technical Report
Zizheng Zhan,Ken Deng,Huaixi Tang,Wen Xiang,Kun Wu,Weihao Li,Wenqiang Zhu,Jingxuan Xu,Lecheng Huang,Zongxian Feng,Shaojie Wang,Shangpeng Yan,Jiaheng Liu,Zhongyuan Peng,Zuchen Gao,Haoyang Huang,Ziqi Zhan,Yanan Wu,Yuanxing Zhang,Jian Yang,Guang Chen,Haotian Zhang,Bin Chen,Bing Yu
Main category: cs.CL
TL;DR: 论文介绍了KAT-V1模型,通过自动思考训练范式动态切换推理模式,结合多令牌预测增强的知识蒸馏和强化学习算法Step-SRPO,提升了推理效率并减少30%的令牌使用。
Details
Motivation: 解决推理密集型任务中过度思考的问题,提高模型的推理效率和准确性。Contribution: 1. 提出自动思考训练范式,动态切换推理模式;2. 构建双机制数据集并应用多令牌预测增强的知识蒸馏;3. 引入Step-SRPO强化学习算法;4. 在实际应用中验证了高效性和可扩展性。
Method: 1. 使用多智能体合成策略构建双机制数据集;2. 结合MTP知识蒸馏;3. 冷启动初始化策略和Step-SRPO算法优化推理模式选择和响应准确性。
Result: 在多个基准测试中表现优异,减少30%令牌使用,已成功部署在实际应用中。
Insight: 动态切换推理模式和多令牌预测的结合显著提升了推理效率,强化学习的中间监督进一步优化了性能。
Abstract: We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks, where an automatic thinking training paradigm is proposed to dynamically switch between reasoning and non-reasoning modes based on task complexity. Specifically, first, we construct the dual-regime dataset based on a novel tagging pipeline and a multi-agent synthesis strategy, and then we apply Multi-Token Prediction (MTP)-enhanced knowledge distillation, enabling efficient and fine-grained reasoning transfer with minimal pretraining cost. Besides, we implement a cold-start initialization strategy that introduces mode-selection priors using majority-vote signals and intent-aware prompting. Finally, we propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework, offering structured guidance over both reasoning-mode selection and response accuracy. Extensive experiments across multiple benchmarks demonstrate that KAT consistently matches or even outperforms current state-of-the-art models, including DeepSeek-R1-0528 and Qwen3-235B-A22B, across a wide range of reasoning-intensive tasks while reducing token usage by up to approximately 30%. Beyond academic evaluation, KAT has been successfully deployed in Kwaipilot (i.e., Kuaishou’s internal coding assistant), and improves real-world development workflows with high accuracy, efficiency, and controllable reasoning behaviors. Moreover, we are actively training a 200B Mixture-of-Experts (MoE) with 40B activation parameters, where the early-stage results already demonstrate promising improvements in performance and efficiency, further showing the scalability of the AutoThink paradigm.
[19] Improving MLLM’s Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency
Yupu Liang,Yaping Zhang,Zhiyang Zhang,Zhiyuan Chen,Yang Zhao,Lu Xiang,Chengqing Zong,Yu Zhou
Main category: cs.CL
TL;DR: 该论文提出了一种名为“同步自检”(SSR)的新微调范式,旨在通过让模型在翻译前先生成OCR文本,来缓解DIMT任务中的灾难性遗忘问题,同时保持其OCR能力。
Details
Motivation: 现有的MLLMs在多模态任务中表现出色,但在文档图像机器翻译(DIMT)中面临跨模态和跨语言的双重挑战。传统微调方法会导致模型遗忘其单模态能力(如OCR)。Contribution: 提出SSR微调范式,通过“双语认知优势”概念,让模型同步自检OCR能力,从而在翻译任务中充分利用其原有OCR能力,避免灾难性遗忘。
Method: 采用同步生成OCR文本和翻译文本的方法,利用SFT训练模型。SSR机制通过分步生成(先OCR后翻译)实现。
Result: 实验表明,SSR显著减轻了灾难性遗忘,并在OCR和DIMT任务上均提升了模型的泛化能力。
Insight: 通过显式利用模型的单模态能力(如OCR)来辅助多模态任务(DIMT),可以提升模型的表现并缓解遗忘问题。
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks, especially Optical Character Recognition (OCR). However, they struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges. Previous efforts to enhance DIMT capability through Supervised Fine-Tuning (SFT) on the DIMT dataset often result in the forgetting of the model’s existing monolingual abilities, such as OCR. To address these challenges, we introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept “Bilingual Cognitive Advantage”. Specifically, SSR prompts the model to generate OCR text before producing translation text, which allows the model to leverage its strong monolingual OCR ability while learning to translate text across languages. Comprehensive experiments demonstrate the proposed SSR learning helps mitigate catastrophic forgetting, improving the generalization ability of MLLMs on both OCR and DIMT tasks.
[20] What Factors Affect LLMs and RLLMs in Financial Question Answering?
Peng Wang,Xuesi Hu,Jiageng Wu,Yuntao Zou,Qiancheng Zhang,Dagang Li
Main category: cs.CL
TL;DR: 论文探讨了提示方法、代理框架和多语言对齐方法对LLMs和RLLMs在金融问答任务中的影响,发现这些方法能够通过模拟长链思维提升LLMs性能,但对RLLMs的增强效果有限。
Details
Motivation: 现有的研究较少系统地探索如何在金融领域中充分释放LLMs和RLLMs的潜力,因此该研究旨在填补这一空白。Contribution: 通过实验评估了三种方法(提示方法、代理框架和多语言对齐)对LLMs和RLLMs的影响,并总结了这些方法的优劣及适用范围。
Method: 使用了五种LLMs和三种RLLMs,对比分析了提示方法、代理框架和多语言对齐方法在金融问答任务中的效果。
Result: (1) 提示方法和代理框架通过模拟长链思维提升LLMs的性能;(2) RLLMs本身具有长链思维能力,传统方法对其提升有限;(3) 多语言对齐方法主要通过延长推理长度提升LLMs的多语言性能。
Insight: LLMs和RLLMs在金融领域的性能提升需结合其自身特点,常规方法对RLLMs的提升效果有限,未来需针对性优化。
Abstract: Recently, the development of large language models (LLMs) and reasoning large language models (RLLMs) have gained considerable attention from many researchers. RLLMs enhance the reasoning capabilities of LLMs through Long Chain-of-Thought (Long CoT) processes, significantly improving the performance of LLMs in addressing complex problems. However, there are few works that systematically explore what methods can fully unlock the performance of LLMs and RLLMs within the financial domain. To investigate the impact of various methods on LLMs and RLLMs, we utilize five LLMs and three RLLMs to assess the effects of prompting methods, agentic frameworks, and multilingual alignment methods on financial question-answering tasks. Our research findings indicate: (1) Current prompting methods and agent frameworks enhance the performance of LLMs in financial question answering by simulating Long CoT; (2) RLLMs possess inherent Long CoT capabilities, which limits the effectiveness of conventional methods in further enhancing their performance; (3) Current advanced multilingual alignment methods primarily improve the multilingual performance of LLMs by extending the reasoning length, which yields minimal benefits for RLLMs. We hope that this study can serve as an important reference for LLMs and RLLMs in the field of financial question answering.
[21] Exploring Design of Multi-Agent LLM Dialogues for Research Ideation
Keisuke Ueda,Wataru Hirota,Takuto Asakura,Takahiro Omi,Kosuke Takahashi,Kosuke Arima,Tatsuya Ishigaki
Main category: cs.CL
TL;DR: 该论文研究了多智能体LLM对话在科研创意生成中的设计优化,比较了不同配置对创意新颖性和可行性的影响,发现增加智能体数量、深度交互和角色多样性可提升创意质量。
Details
Motivation: 尽管现有研究表明LLM间的结构化对话能提升创意生成质量,但如何优化多智能体对话设计仍不明确。论文旨在填补这一空白,探索多智能体交互的科学配置。Contribution: 通过系统实验分析了智能体角色、数量和交互深度对创意生成的影响,提出了提升创意新颖性和可行性的设计指南。
Method: 设计了多智能体对话实验,包括创意生成-批评-修订循环,比较了不同配置(如智能体数量、角色多样性、交互深度)的效果。
Result: 增加智能体数量、交互深度和角色多样性显著提升了创意的多样性;而批评角色的多样性进一步提高了创意的可行性。
Insight: 多智能体LLM系统的设计需综合考虑智能体数量、交互深度和角色分工,批评机制的多样性对可行性尤为关键。
Abstract: Large language models (LLMs) are increasingly used to support creative tasks such as research idea generation. While recent work has shown that structured dialogues between LLMs can improve the novelty and feasibility of generated ideas, the optimal design of such interactions remains unclear. In this study, we conduct a comprehensive analysis of multi-agent LLM dialogues for scientific ideation. We compare different configurations of agent roles, number of agents, and dialogue depth to understand how these factors influence the novelty and feasibility of generated ideas. Our experimental setup includes settings where one agent generates ideas and another critiques them, enabling iterative improvement. Our results show that enlarging the agent cohort, deepening the interaction depth, and broadening agent persona heterogeneity each enrich the diversity of generated ideas. Moreover, specifically increasing critic-side diversity within the ideation-critique-revision loop further boosts the feasibility of the final proposals. Our findings offer practical guidelines for building effective multi-agent LLM systems for scientific ideation. Our code is available at https://github.com/g6000/MultiAgent-Research-Ideator.
[22] The Curious Case of Factuality Finetuning: Models’ Internal Beliefs Can Improve Factuality
Benjamin Newman,Abhilasha Ravichander,Jaehun Jung,Rui Xin,Hamish Ivison,Yegor Kuznetsov,Pang Wei Koh,Yejin Choi
Main category: cs.CL
TL;DR: 本文探讨了如何通过微调语言模型以减少幻觉(生成不准确事实的文本)。研究发现,虽然传统方法依赖于高质量事实数据,但模型对自生成数据内部置信的判断更能有效提升生成文本的事实性。
Details
Motivation: 语言模型常产生幻觉(生成不准确的文本),通过微调高质量事实数据可能减少幻觉,但获取成本高且可能引发更多下游幻觉。研究旨在找到最佳微调数据策略以减少幻觉。Contribution: 研究发现,模型对自生成数据内部置信的判断比传统高质量事实数据更能提升文本的事实性。同时,基于模型内部判断过滤的数据微调效果优于其他策略。
Method: 研究比较了不同微调数据策略的效果,包括高质量事实数据、模型生成数据以及基于模型内部判断过滤的数据。通过长文本生成任务验证策略的有效性。
Result: 结果显示,基于模型内部判断过滤的自生成数据微调效果最佳,且这种提升在多个领域中具有迁移性。
Insight: 模型的内部置信信号可作为提升生成文本事实性的有效工具,为未来减少幻觉的研究提供了新思路。
Abstract: Language models are prone to hallucination - generating text that is factually incorrect. Finetuning models on high-quality factual information can potentially reduce hallucination, but concerns remain; obtaining factual gold data can be expensive and training on correct but unfamiliar data may potentially lead to even more downstream hallucination. What data should practitioners finetune on to mitigate hallucinations in language models? In this work, we study the relationship between the factuality of finetuning data and the prevalence of hallucinations in long-form generation tasks. Counterintuitively, we find that finetuning on factual gold data is not as helpful as finetuning on model-generated data that models believe to be factual. Next, we evaluate filtering strategies applied on both factual gold data and model-generated data, and find that finetuning on model-generated data that is filtered by models’ own internal judgments often leads to better overall factuality compared to other configurations: training on gold data filtered by models’ judgments, training on gold data alone, or training on model-generated data that is supported by gold data. These factuality improvements transfer across three domains we study, suggesting that a models’ own beliefs can provide a powerful signal for factuality.
[23] ChainEdit: Propagating Ripple Effects in LLM Knowledge Editing through Logical Rule-Guided Chains
Zilu Dong,Xiangqing Shen,Zinong Yang,Rui Xia
Main category: cs.CL
TL;DR: ChainEdit通过知识图谱的逻辑规则与LLM的推理能力相结合,系统性地更新相关知识链,显著提升了逻辑一致性,效果优于基线30%以上。
Details
Motivation: 现有LLM知识编辑方法在传播涟漪效应至相关事实时难以保持逻辑一致性,亟需一种系统性更新方法。Contribution: 提出了ChainEdit框架,结合知识图谱逻辑规则与LLM推理能力,实现了动态生成和编辑逻辑相关的知识簇。
Method: 从结构化知识库自动提取逻辑模式,并将其与LLM内部逻辑对齐,通过逻辑规则指导的链条更新知识。
Result: 实验显示逻辑泛化能力提升超30%,同时保持了编辑的可靠性和特异性。
Insight: 通过知识感知协议解决了现有评测中的偏差问题,为知识编辑中的内部逻辑一致性设立了新标杆。
Abstract: Current knowledge editing methods for large language models (LLMs) struggle to maintain logical consistency when propagating ripple effects to associated facts. We propose ChainEdit, a framework that synergizes knowledge graph-derived logical rules with LLM logical reasoning capabilities to enable systematic chain updates. By automatically extracting logical patterns from structured knowledge bases and aligning them with LLMs’ internal logics, ChainEdit dynamically generates and edits logically connected knowledge clusters. Experiments demonstrate an improvement of more than 30% in logical generalization over baselines while preserving editing reliability and specificity. We further address evaluation biases in existing benchmarks through knowledge-aware protocols that disentangle external dependencies. This work establishes new state-of-the-art performance on ripple effect while ensuring internal logical consistency after knowledge editing.
[24] Using Large Language Models for Legal Decision-Making in Austrian Value-Added Tax Law: An Experimental Study
Marina Luketina,Andrea Benkel,Christoph G. Schuetz
Main category: cs.CL
TL;DR: 本文通过实验评估大型语言模型(LLM)在奥地利增值税法律决策中的能力,发现其在支持税务顾问和自动化常规任务方面具有潜力,但也存在幻觉和法律领域敏感性的挑战。
Details
Motivation: 税务咨询实践中,客户通常用自然语言描述案例,这使得LLM成为支持自动化决策和减轻税务专业人员工作负担的理想选择。但LLM的幻觉倾向为法律分析和决策带来挑战。Contribution: 1. 实验评估LLM在增值税法律决策中的能力;2. 比较微调(fine-tuning)和检索增强生成(RAG)方法的效果;3. 提出LLM支持税务顾问的潜力与局限性。
Method: 在教科书案例和实际税务咨询案例上,应用微调和RAG方法,评估LLM的法律推理能力及其配置效果。
Result: 发现LLM在正确配置下能有效支持增值税任务并提供法律依据,但目前原型未完全自动化,对隐含客户知识和上下文特定文档的处理仍有限制。
Insight: LLM在法律领域具有潜力,但需进一步整合结构化背景信息以克服当前局限性。
Abstract: This paper provides an experimental evaluation of the capability of large language models (LLMs) to assist in legal decision-making within the framework of Austrian and European Union value-added tax (VAT) law. In tax consulting practice, clients often describe cases in natural language, making LLMs a prime candidate for supporting automated decision-making and reducing the workload of tax professionals. Given the requirement for legally grounded and well-justified analyses, the propensity of LLMs to hallucinate presents a considerable challenge. The experiments focus on two common methods for enhancing LLM performance: fine-tuning and retrieval-augmented generation (RAG). In this study, these methods are applied on both textbook cases and real-world cases from a tax consulting firm to systematically determine the best configurations of LLM-based systems and assess the legal-reasoning capabilities of LLMs. The findings highlight the potential of using LLMs to support tax consultants by automating routine tasks and providing initial analyses, although current prototypes are not ready for full automation due to the sensitivity of the legal domain. The findings indicate that LLMs, when properly configured, can effectively support tax professionals in VAT tasks and provide legally grounded justifications for decisions. However, limitations remain regarding the handling of implicit client knowledge and context-specific documentation, underscoring the need for future integration of structured background information.
[25] ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition
Qingliang Meng,Hao Wu,Wei Liang,Wei Xu,Qing Zhao
Main category: cs.CL
TL;DR: 本文提出了一种创新的训练范式——迭代LoRA训练(ILT),结合迭代伪标签策略,有效解决了低秩适应(LoRA)在监督微调(SFT)阶段的过拟合问题,显著提升了模型性能上限。
Details
Motivation: 大型语言模型与自动语音识别系统的深度融合具有重要研究价值,但在LoRA的SFT阶段普遍存在过拟合问题,亟需解决方案。Contribution: 提出了ILT训练范式,结合迭代伪标签策略,提升了模型的性能上限,并通过系统性实验验证了其有效性。
Method: 采用三阶段训练流程:Focus Training、Feed Back Training和Fix Training,基于Whisper-large-v3和Qwen2-Audio模型。
Result: 实验结果表明该方法有效,并在Interspeech 2025 MLC-SLM挑战赛中取得了优异成绩(Track 1第四名,Track 2第一名)。
Insight: ILT通过迭代训练机制和伪标签策略,不仅解决了过拟合问题,还展示了在多语言语音识别任务中的强大应用潜力。
Abstract: The deep integration of large language models and automatic speech recognition systems has become a promising research direction with high practical value. To address the overfitting issue commonly observed in Low-Rank Adaptation (LoRA) during the supervised fine-tuning (SFT) stage, this work proposes an innovative training paradigm Iterative LoRA Training (ILT) in combination with an Iterative Pseudo Labeling strategy, effectively enhancing the theoretical upper bound of model performance. Based on Whisper-large-v3 and Qwen2-Audio, we conduct systematic experiments using a three-stage training process: Focus Training, Feed Back Training, and Fix Training. Experimental results demonstrate the effectiveness of the proposed method. Furthermore, the MegaAIS research team applied this technique in the Interspeech 2025 Multilingual Conversational Speech Language Modeling Challenge (MLC-SLM), achieving 4th in Track 1 (Multilingual ASR Task) and 1st place in Track 2 (Speech Separation and Recognition Task), showcasing the practical feasibility and strong application potential of our approach.
[26] LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning
Shibo Sun,Xue Li,Donglin Di,Mingjie Wei,Lanshun Nie,Wei-Nan Zhang,Dechen Zhan,Yang Song,Lei Fan
Main category: cs.CL
TL;DR: LLaPa是一个结合视觉-语言模型的多模态程序规划框架,通过任务-环境重排器和反事实活动检索器提升规划质量,在多个基准测试中表现优异。
Details
Motivation: 现有的大型语言模型在多模态输入和反事实推理方面仍有不足,LLaPa旨在通过视觉-语言模型填补这一空白。Contribution: 提出了LLaPa框架,引入了任务-环境重排器(TER)和反事实活动检索器(CAR)两个辅助模块,显著提升了多模态程序规划的能力。
Method: 使用视觉-语言模型从文本和图像生成可执行动作序列,并通过TER优化任务相关特征空间,CAR强化反事实推理。
Result: 在ActPlan-1K和ALFRED基准测试中,LLaPa的规划质量和正确性优于先进模型。
Insight: 结合视觉和语言能力并引入反事实推理,可显著提升多模态程序规划的性能。
Abstract: While large language models (LLMs) have advanced procedural planning for embodied AI systems through strong reasoning abilities, the integration of multimodal inputs and counterfactual reasoning remains underexplored. To tackle these challenges, we introduce LLaPa, a vision-language model framework designed for multimodal procedural planning. LLaPa generates executable action sequences from textual task descriptions and visual environmental images using vision-language models (VLMs). Furthermore, we enhance LLaPa with two auxiliary modules to improve procedural planning. The first module, the Task-Environment Reranker (TER), leverages task-oriented segmentation to create a task-sensitive feature space, aligning textual descriptions with visual environments and emphasizing critical regions for procedural execution. The second module, the Counterfactual Activities Retriever (CAR), identifies and emphasizes potential counterfactual conditions, enhancing the model’s reasoning capability in counterfactual scenarios. Extensive experiments on ActPlan-1K and ALFRED benchmarks demonstrate that LLaPa generates higher-quality plans with superior LCS and correctness, outperforming advanced models. The code and models are available https://github.com/sunshibo1234/LLaPa.
[27] The AI Language Proficiency Monitor – Tracking the Progress of LLMs on Multilingual Benchmarks
David Pomerenke,Jonas Nothnagel,Simon Ostermann
Main category: cs.CL
TL;DR: 该论文提出了AI Language Proficiency Monitor,一个多语言基准测试系统,用于评估大型语言模型(LLMs)在多达200种语言上的表现,尤其关注低资源语言。
Details
Motivation: 确保大型语言模型的利益能够公平地惠及全球各种语言使用者,需对其多语言能力进行全面评估。Contribution: 1. 开发了一个综合性的多语言基准测试系统;2. 提供开源、自动更新的排行榜和仪表板;3. 通过全球熟练度地图等功能提供描述性见解。
Method: 整合了翻译、问答、数学和推理等多种任务,使用FLORES+、MMLU、GSM8K等数据集进行评估。
Result: 该系统支持研究人员、开发者和政策制定者识别模型表现的优劣势,促进多语言AI的透明度和包容性。
Insight: 多语言基准测试和透明性工具的引入有助于推动低资源语言模型的发展,促进全球AI技术的公平性。
Abstract: To ensure equitable access to the benefits of large language models (LLMs), it is essential to evaluate their capabilities across the world’s languages. We introduce the AI Language Proficiency Monitor, a comprehensive multilingual benchmark that systematically assesses LLM performance across up to 200 languages, with a particular focus on low-resource languages. Our benchmark aggregates diverse tasks including translation, question answering, math, and reasoning, using datasets such as FLORES+, MMLU, GSM8K, TruthfulQA, and ARC. We provide an open-source, auto-updating leaderboard and dashboard that supports researchers, developers, and policymakers in identifying strengths and gaps in model performance. In addition to ranking models, the platform offers descriptive insights such as a global proficiency map and trends over time. By complementing and extending prior multilingual benchmarks, our work aims to foster transparency, inclusivity, and progress in multilingual AI. The system is available at https://huggingface.co/spaces/fair-forward/evals-for-every-language.
[28] A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1
Marcin Pietroń,Rafał Olszowski,Jakub Gomułka,Filip Gampel,Andrzej Tomski
Main category: cs.CL
TL;DR: 本论文研究了基于大型语言模型(LLMs)的论点分类,比较了GPT、Llama和DeepSeek等模型在不同数据集上的表现,发现ChatGPT-4o和Deepseek-R1表现最佳,但也揭示了它们的常见错误和提示算法的局限性。
Details
Motivation: 论点挖掘(AM)是一个跨学科研究领域,但现有研究中缺乏对LLMs在公开论点分类数据集上的性能分析。本文旨在填补这一空白。Contribution: 论文首次对多个LLMs在公共论点分类数据集上的表现进行了广泛分析,并揭示了现有提示算法的弱点和改进方向。
Method: 研究测试了多种LLMs(如GPT、Llama、DeepSeek),并结合了Chain-of-Thoughts算法,使用了Args.me和UKP等数据集进行论证分类任务。
Result: ChatGPT-4o在论证分类基准测试中表现最好,而结合推理能力的Deepseek-R1也显示出优势,但它们仍存在错误,论文详细分析了这些错误。
Insight: LLMs在论点分类任务中表现优异,但仍需改进提示算法以减少错误,同时公共数据集的不足也需要进一步解决。
Abstract: Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM’s, using diverse datasets such as Args.me and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.
[29] Multilingual Multimodal Software Developer for Code Generation
Linzheng Chai,Jian Yang,Shukai Liu,Wei Zhang,Liran Wang,Ke Jin,Tao Sun,Congnan Liu,Chenchen Zhang,Hualei Zhu,Jiaheng Liu,Xianjie Wu,Ge Zhang,Tianyu Liu,Zhoujun Li
Main category: cs.CL
TL;DR: 论文提出MM-Coder,一个结合视觉设计输入(如UML图和流程图)与文本指令的多语言多模态代码生成模型,并通过MMc-Instruct数据集和MMEval基准填补了现有文本模型的局限性。
Details
Motivation: 现有的LLM代码生成模型多为纯文本,忽略了软件开发中常用的可视化辅助工具(如UML图和流程图)的重要性,导致生成代码与设计不匹配。Contribution: : 1) 提出多模态软件开发者MM-Coder;2) 构建多模态指令调优数据集MMc-Instruct;3) 引入多模态代码生成评估基准MMEval。
Method: MM-Coder通过整合视觉设计输入(Visual Workflow)与文本指令,生成更准确的代码;使用MMc-Instruct数据集进行多模态指令调优。
Result: 评估表明,模型在视觉信息捕捉、指令跟随和高级编程知识方面仍面临挑战。
Insight: 多模态输入(文本+视觉)能显著提升代码生成的准确性和设计对齐,为工业编程带来新的可能性。
Abstract: The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To bridge this gap, we introduce MM-Coder, a Multilingual Multimodal software developer. MM-Coder integrates visual design inputs-Unified Modeling Language (UML) diagrams and flowcharts (termed Visual Workflow)-with textual instructions to enhance code generation accuracy and architectural alignment. To enable this, we developed MMc-Instruct, a diverse multimodal instruction-tuning dataset including visual-workflow-based code generation, allowing MM-Coder to synthesize textual and graphical information like human developers, distinct from prior work on narrow tasks. Furthermore, we introduce MMEval, a new benchmark for evaluating multimodal code generation, addressing existing text-only limitations. Our evaluations using MMEval highlight significant remaining challenges for models in precise visual information capture, instruction following, and advanced programming knowledge. Our work aims to revolutionize industrial programming by enabling LLMs to interpret and implement complex specifications conveyed through both text and visual designs.
[30] KV Cache Steering for Inducing Reasoning in Small Language Models
Max Belitsky,Dawid J. Kopiczko,Michael Dorkenwald,M. Jehanzeb Mirza,Cees G. M. Snoek,Yuki M. Asano
Main category: cs.CL
TL;DR: 论文提出了一种轻量级的KV缓存引导方法,通过一次性干预引导小语言模型进行多步骤推理,无需微调或修改提示。
Details
Motivation: 现有的激活引导技术通常需要持续干预,效率较低且超参数不稳定。本文旨在提出一种更高效、简单的引导方法,以提升小语言模型的推理能力。Contribution: 提出了KV缓存引导技术,通过一次性干预实现模型行为的隐式引导,显著提升了推理任务的性能和效率。
Method: 利用GPT-4生成的推理轨迹构建引导向量,通过修改KV缓存实现模型行为的调整,从而诱导链式推理。
Result: 实验表明,该方法在多种推理任务中提升了模型的表现,同时具有更高的超参数稳定性和推理效率。
Insight: KV缓存引导是一种轻量且实用的技术,适用于快速调整模型行为,尤其在资源受限的小模型中表现优异。
Abstract: We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of hyperparameter stability, inference-time efficiency, and ease of integration, making it a more robust and practical solution for controlled generation.
cs.CV [Back]
[31] CuriosAI Submission to the EgoExo4D Proficiency Estimation Challenge 2025
Hayato Tanoue,Hiroki Nishihara,Yuma Suzuki,Takayuki Hori,Hiroki Takushima,Aiswariya Manojkumar,Yuki Shibata,Mitsuru Takeda,Fumika Beppu,Zhao Hengwei,Yuto Kanda,Daichi Yamaga
Main category: cs.CV
TL;DR: CuriosAI团队在CVPR 2025的EgoExo4D Proficiency Estimation Challenge中提出了两种多视角技能评估方法,展示了基于场景条件建模的有效性。
Details
Motivation: 该研究旨在通过多视角数据实现对技能的精准评估,为技能熟练度估计提供新的解决方案。Contribution: 提出了两种方法:(1) 基于Sapiens-2B的多任务学习框架,联合预测熟练度和场景标签;(2) 两阶段流水线,结合零样本场景识别和视角特定的VideoMAE分类器。
Method: 第一种方法使用多任务学习框架,第二种方法采用两阶段流水线(零样本场景识别+VideoMAE分类器)。
Result: 多任务学习框架的准确率为43.6%,两阶段方法的准确率为47.8%,显示后者更优。
Insight: 场景条件建模在熟练度估计中具有显著优势,两阶段方法因其灵活性表现更佳。
Abstract: This report presents the CuriosAI team’s submission to the EgoExo4D Proficiency Estimation Challenge at CVPR 2025. We propose two methods for multi-view skill assessment: (1) a multi-task learning framework using Sapiens-2B that jointly predicts proficiency and scenario labels (43.6 % accuracy), and (2) a two-stage pipeline combining zero-shot scenario recognition with view-specific VideoMAE classifiers (47.8 % accuracy). The superior performance of the two-stage approach demonstrates the effectiveness of scenario-conditioned modeling for proficiency estimation.
[32] Self-Consistency in Vision-Language Models for Precision Agriculture: Multi-Response Consensus for Crop Disease Management
Mihir Gupta,Abhay Mangla,Ross Greer,Pratik Desai
Main category: cs.CV
TL;DR: 该论文提出了一种结合提示专家评估与自一致性机制的领域感知框架,显著提升了视觉语言模型在精准农业中的作物病害识别性能。
Details
Motivation: 精准农业依赖准确的图像分析,但现有视觉语言模型在农业领域表现不佳,亟需改进。Contribution: 1. 提出基于提示的专家评估协议;2. 设计余弦一致性自投票机制,用于多候选响应的语义选择。
Method: 结合提示专家评估和余弦一致性自投票机制,优化农业图像处理流程,并利用领域适应的嵌入选择最一致诊断。
Result: 在玉米叶病害识别任务中,诊断准确率提升至87.8%,症状分析和治疗建议性能也显著提高。
Insight: 该方法不仅提升了模型性能,还保持轻量级特性,适合资源受限的移动设备部署,支持实时农业决策。
Abstract: Precision agriculture relies heavily on accurate image analysis for crop disease identification and treatment recommendation, yet existing vision-language models (VLMs) often underperform in specialized agricultural domains. This work presents a domain-aware framework for agricultural image processing that combines prompt-based expert evaluation with self-consistency mechanisms to enhance VLM reliability in precision agriculture applications. We introduce two key innovations: (1) a prompt-based evaluation protocol that configures a language model as an expert plant pathologist for scalable assessment of image analysis outputs, and (2) a cosine-consistency self-voting mechanism that generates multiple candidate responses from agricultural images and selects the most semantically coherent diagnosis using domain-adapted embeddings. Applied to maize leaf disease identification from field images using a fine-tuned PaliGemma model, our approach improves diagnostic accuracy from 82.2% to 87.8%, symptom analysis from 38.9% to 52.2%, and treatment recommendation from 27.8% to 43.3% compared to standard greedy decoding. The system remains compact enough for deployment on mobile devices, supporting real-time agricultural decision-making in resource-constrained environments. These results demonstrate significant potential for AI-driven precision agriculture tools that can operate reliably in diverse field conditions.
[33] Development of a Canada-Wide Morphology Map for the ITU-R P. 1411 Propagation Model
Jennifer P. T. Nguyen
Main category: cs.CV
TL;DR: 本文开发了一个基于ITU-R P.1411-12传播模型的加拿大全国地形分类地图,利用机器学习方法自动化分类住宅区、低层和高层城市环境,以提高路径损耗估计的准确性。
Details
Motivation: ITU-R P.1411推荐中的环境类型描述较为定性,难以直接应用,因此需要一种自动化方法来精确分类地形类型。Contribution: 通过机器学习方法开发了一个全国范围的地形分类地图,提升了300 MHz至100 GHz频段的短距离室外传播路径损耗估计的准确性。
Method: 采用机器学习方法自动化分类地形类型,并通过大量实验优化分类准确率。
Result: 成功生成了一个覆盖加拿大的地形分类地图,显著提高了路径损耗估计的准确性。
Insight: 机器学习方法能够有效解决传统定性描述中的不确定性,为传播模型提供更精确的地形分类。
Abstract: This paper outlines the development of a Canada-wide morphology map classifying regions into residential, urban low-rise, and urban high-rise environments, following the ITU-R P.1411-12 propagation model guidelines. To address the qualitative nature of the environment-type descriptors found in the Recommendation, a machine learning approach is employed to automate the classification process. Extensive experimentation optimized classification accuracy, resulting in a Canada-wide morphology map that ensures more accurate path loss estimations for outdoor short-range propagation at frequencies ranging from 300 MHz to 100 GHz.
[34] Towards Evaluating Robustness of Prompt Adherence in Text to Image Models
Sujith Vemishetty,Advitiya Arora,Anupama Sharma
Main category: cs.CV
TL;DR: 该论文提出了一个评估文本到图像(Text-to-Image)模型在提示词遵循性上的鲁棒性的框架,并创建了一个新数据集来验证模型在生成符合输入文本变化的图像时的表现。研究发现,模型在生成简单图像时表现不佳。
Details
Motivation: 随着多模态大语言模型和文本到图像模型的兴起,其可靠性研究不足,作者希望通过一个全面的评估框架来填补这一空白。Contribution: 1. 提出了一种评估文本到图像模型提示词遵循性的框架。2. 创建了一个新数据集用于测试模型的鲁棒性。3. 对多个Stable Diffusion和Janus模型进行了评估。
Method: 1. 使用GPT-4生成地面真实文本描述。2. 将这些描述输入文本到图像模型生成图像。3. 再次用相同系统提示将生成图像输入GPT-4,比较描述差异。
Result: 研究发现,模型在生成仅包含两个变化因素(简单几何形状及其位置)的二进制图像时表现不佳,且无法遵循输入数据分布。
Insight: 当前文本到图像模型在简单任务上表现仍有局限,提示词理解的鲁棒性需要进一步改进。
Abstract: The advancements in the domain of LLMs in recent years have surprised many, showcasing their remarkable capabilities and diverse applications. Their potential applications in various real-world scenarios have led to significant research on their reliability and effectiveness. On the other hand, multimodal LLMs and Text-to-Image models have only recently gained prominence, especially when compared to text-only LLMs. Their reliability remains constrained due to insufficient research on assessing their performance and robustness. This paper aims to establish a comprehensive evaluation framework for Text-to-Image models, concentrating particularly on their adherence to prompts. We created a novel dataset that aimed to assess the robustness of these models in generating images that conform to the specified factors of variation in the input text prompts. Our evaluation studies present findings on three variants of Stable Diffusion models: Stable Diffusion 3 Medium, Stable Diffusion 3.5 Large, and Stable Diffusion 3.5 Large Turbo, and two variants of Janus models: Janus Pro 1B and Janus Pro 7B. We introduce a pipeline that leverages text descriptions generated by the gpt-4o model for our ground-truth images, which are then used to generate artificial images by passing these descriptions to the Text-to-Image models. We then pass these generated images again through gpt-4o using the same system prompt and compare the variation between the two descriptions. Our results reveal that these models struggle to create simple binary images with only two factors of variation: a simple geometric shape and its location. We also show, using pre-trained VAEs on our dataset, that they fail to generate images that follow our input dataset distribution.
[35] ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints
Debasmit Das,Hyoungwoo Park,Munawar Hayat,Seokeon Choi,Sungrack Yun,Fatih Porikli
Main category: cs.CV
TL;DR: 论文提出一种数据驱动的低秩适配器(LoRA)权重初始化方法CNTLoRA,通过预训练和微调激活向量的约束关系,实现无需训练初始化,提升模型收敛速度和性能。
Details
Motivation: 现有LoRA方法通常在固定秩下随机初始化权重,可能影响收敛和最终性能。论文旨在通过数据驱动的初始化方法优化这一问题,利用预训练和微调激活向量间的约束关系。Contribution: 提出CNTLoRA方法,将LoRA权重初始化为域偏移问题,通过约束关系闭式求解权重,无需训练初始化,并支持可变秩分解,提升下游任务性能。
Method: 将LoRA初始化转化为域偏移问题,利用预训练和微调激活向量的约束关系,闭式求解权重并分解为可变秩的上、下矩阵,实现数据驱动的初始化。
Result: 在图像生成、分类和理解等任务中,CNTLoRA在性能和收敛速度上优于标准和数据驱动的初始化方法。实验和消融研究验证了方法的有效性。
Insight: 利用预训练和微调数据的约束关系进行权重初始化,可显著提升LoRA的效率和性能,为参数高效微调提供了一种新的思路。
Abstract: Foundation models are pre-trained on large-scale datasets and subsequently fine-tuned on small-scale datasets using parameter-efficient fine-tuning (PEFT) techniques like low-rank adapters (LoRA). In most previous works, LoRA weight matrices are randomly initialized with a fixed rank across all attachment points. In this paper, we improve convergence and final performance of LoRA fine-tuning, using our proposed data-driven weight initialization method, ConsNoTrainLoRA (CNTLoRA). We express LoRA initialization as a domain shift problem where we use multiple constraints relating the pre-training and fine-tuning activations. By reformulating these constraints, we obtain a closed-form estimate of LoRA weights that depends on pre-training weights and fine-tuning activation vectors and hence requires no training during initialization. This weight estimate is decomposed to initialize the up and down matrices with proposed flexibility of variable ranks. With the proposed initialization method, we fine-tune on downstream tasks such as image generation, image classification and image understanding. Both quantitative and qualitative results demonstrate that CNTLoRA outperforms standard and data-driven weight initialization methods. Extensive analyses and ablations further elucidate the design choices of our framework, providing an optimal recipe for faster convergence and enhanced performance.
[36] A Hybrid Multilayer Extreme Learning Machine for Image Classification with an Application to Quadcopters
Rolando A. Hernandez-Hernandez,Adrian Rubio-Solis
Main category: cs.CV
TL;DR: 论文提出了一种基于ELM自编码器和区间二型模糊逻辑理论的混合多层极限学习机(HML-ELM),用于图像分类,并应用于无人机(UAV)。该方法在特征提取和分类方面表现优于同类方法。
Details
Motivation: 现有的多层极限学习机(ML-ELM)及其变体在自然信号分类中表现良好,但在复杂任务(如图像分类)中的应用仍需改进。因此,作者提出了一种混合方法,结合ELM和模糊逻辑理论,以提高分类效率和准确性。Contribution: 主要贡献包括:1)提出了基于ELM自编码器和区间二型模糊逻辑理论的HML-ELM方法;2)设计了快速输出简化层(SC算法改进版);3)在基准测试和无人机实验中验证了方法的优越性。
Method: 方法分为两个阶段:1)无监督多层特征提取,通过堆叠ELM-AE实现;2)监督特征分类,使用改进的SIT2-FELM(基于SC算法的快速输出简化层)。
Result: 实验表明,HML-ELM在图像分类任务中优于ML-ELM、ML-FELM和ELM,并在无人机实验中表现高效。
Insight: 结合ELM和模糊逻辑理论可以显著提升复杂任务中的分类性能,尤其是在特征提取和分类的层次化框架设计中。
Abstract: Multilayer Extreme Learning Machine (ML-ELM) and its variants have proven to be an effective technique for the classification of different natural signals such as audio, video, acoustic and images. In this paper, a Hybrid Multilayer Extreme Learning Machine (HML-ELM) that is based on ELM-based autoencoder (ELM-AE) and an Interval Type-2 fuzzy Logic theory is suggested for active image classification and applied to Unmanned Aerial Vehicles (UAVs). The proposed methodology is a hierarchical ELM learning framework that consists of two main phases: 1) self-taught feature extraction and 2) supervised feature classification. First, unsupervised multilayer feature encoding is achieved by stacking a number of ELM-AEs, in which input data is projected into a number of high-level representations. At the second phase, the final features are classified using a novel Simplified Interval Type-2 Fuzzy ELM (SIT2-FELM) with a fast output reduction layer based on the SC algorithm; an improved version of the algorithm Center of Sets Type Reducer without Sorting Requirement (COSTRWSR). To validate the efficiency of the HML-ELM, two types of experiments for the classification of images are suggested. First, the HML-ELM is applied to solve a number of benchmark problems for image classification. Secondly, a number of real experiments to the active classification and transport of four different objects between two predefined locations using a UAV is implemented. Experiments demonstrate that the proposed HML-ELM delivers a superior efficiency compared to other similar methodologies such as ML-ELM, Multilayer Fuzzy Extreme Learning Machine (ML-FELM) and ELM.
[37] The relative importance of being Gaussian
F. Alberto Grünbaum,Tondgi Xu
Main category: cs.CV
TL;DR: 本文探讨了在非高斯噪声情况下,基于高斯噪声设计的去噪算法的表现,发现即使噪声分布与假设差异很大,算法仍可能有效。
Details
Motivation: 研究高斯噪声假设对去噪算法的重要性,探索算法在其他噪声分布下的鲁棒性。Contribution: 通过实验验证了去噪算法在非高斯噪声(如均匀分布、Beta分布等)下的性能,发现算法对噪声分布的变化具有一定鲁棒性。
Method: 在未修改算法的情况下,将高斯噪声替换为其他分布(如均匀分布、Beta分布或混合高斯分布),在小规模图像上进行实验。
Result: 实验表明,即使噪声分布与高斯假设差异较大,算法仍能保持一定的去噪效果。
Insight: 高斯假设虽为算法提供理论基础,但算法在实际应用中对噪声分布的敏感性可能比理论预期的低。
Abstract: The remarkable results for denoising in computer vision using diffusion models given in \cite{SDWMG,HJA,HHG} yield a robust mathematical justification for algorithms based on crucial properties of a sequence of Gaussian independent $N(0,1)$ random variables. In particular the derivations use the fact that a Gaussian distribution is determined by its mean and variance and that the sum of two Gaussians is another Gaussian. \bigskip The issue raised in this short note is the following: suppose we use the algorithm without any changes but replace the nature of the noise and use, for instance, uniformly distributed noise or noise with a Beta distribution, or noise which is a random superposition of two Gaussians with very different variances. One could, of course, try to modify the algorithm keeping in mind the nature of the noise, but this is not what we do. Instead we study the performance of the algorithm when used with noise that is very far in nature from the Gaussian case, where it is designed to work well. Usually these algorithms are implemented on very powerful computers. Our experiments are all carried out on a small laptop and for the smallest possible image size. Exploring how our observations are confirmed or changed when dealing in different situations remains an interesting challenge.
[38] Temporally Consistent Amodal Completion for 3D Human-Object Interaction Reconstruction
Hyungjun Doh,Dong In Lee,Seunggeun Chi,Pin-Hao Huang,Kwonjoon Lee,Sangpil Kim,Karthik Ramani
Main category: cs.CV
TL;DR: 本文提出了一种新框架,通过模态补全和时间一致性重建动态人物-物体交互,显著提升了遮挡场景下的3D重建精度和稳定性。
Details
Motivation: 传统的3D重建方法假设物体静态或完全可见,难以处理动态场景中的遮挡和时间不一致问题。Contribution: 提出了一种基于模态补全和时间上下文的新框架,无需模板即可动态重构遮挡区域的完整结构。
Method: 结合3D高斯抛雪球法和时间一致性约束,逐帧优化并稳定重建结果。
Result: 在单目视频上验证了方法的有效性,优于现有技术,尤其是在遮挡和时间稳定性方面。
Insight: 模态补全与时间上下文结合是提升动态3D重建精度的有效途径。
Abstract: We introduce a novel framework for reconstructing dynamic human-object interactions from monocular video that overcomes challenges associated with occlusions and temporal inconsistencies. Traditional 3D reconstruction methods typically assume static objects or full visibility of dynamic subjects, leading to degraded performance when these assumptions are violated-particularly in scenarios where mutual occlusions occur. To address this, our framework leverages amodal completion to infer the complete structure of partially obscured regions. Unlike conventional approaches that operate on individual frames, our method integrates temporal context, enforcing coherence across video sequences to incrementally refine and stabilize reconstructions. This template-free strategy adapts to varying conditions without relying on predefined models, significantly enhancing the recovery of intricate details in dynamic scenes. We validate our approach using 3D Gaussian Splatting on challenging monocular videos, demonstrating superior precision in handling occlusions and maintaining temporal stability compared to existing techniques.
[39] Adaptive Diffusion Denoised Smoothing : Certified Robustness via Randomized Smoothing with Differentially Private Guided Denoising Diffusion
Frederick Shpilevskiy,Saiyue Lyu,Krishnamurthy Dj Dvijotham,Mathias Lécuyer,Pierre-André Noël
Main category: cs.CV
TL;DR: 提出了一种自适应扩散去噪平滑方法,通过将引导去噪扩散模型重新解释为一系列自适应高斯差分隐私机制,实现对视觉模型对抗样本预测的认证鲁棒性,并在ImageNet上证明了其在认证准确率和标准准确率上的提升。
Details
Motivation: 当前对抗样本的问题在深度学习模型中普遍存在,尽管随机平滑等方法提供了一定的认证鲁棒性,但仍缺乏适应性和高效性。本文旨在结合扩散模型和差分隐私机制,提供一种自适应且高效的认证方法。Contribution: 1. 将引导去噪扩散模型重新解释为自适应高斯差分隐私机制的长序列;2. 提出自适应扩散去噪平滑方法,扩展了自适应随机平滑分析;3. 在ImageNet上验证了方法的有效性。
Method: 通过引导去噪扩散模型将纯噪声样本逐步转换为图像,并利用GDP隐私过滤器分析端到端鲁棒性,实现自适应认证鲁棒性。
Result: 在ImageNet的ℓ2威胁模型下,该设计显著提升了认证准确率和标准准确率。
Insight: 扩散模型与差分隐私机制的结合为对抗样本的认证鲁棒性提供了新的研究方向,展示了自适应机制在对抗防御中的潜力。
Abstract: We propose Adaptive Diffusion Denoised Smoothing, a method for certifying the predictions of a vision model against adversarial examples, while adapting to the input. Our key insight is to reinterpret a guided denoising diffusion model as a long sequence of adaptive Gaussian Differentially Private (GDP) mechanisms refining a pure noise sample into an image. We show that these adaptive mechanisms can be composed through a GDP privacy filter to analyze the end-to-end robustness of the guided denoising process, yielding a provable certification that extends the adaptive randomized smoothing analysis. We demonstrate that our design, under a specific guiding strategy, can improve both certified accuracy and standard accuracy on ImageNet for an $\ell_2$ threat model.
[40] An Embedded Real-time Object Alert System for Visually Impaired: A Monocular Depth Estimation based Approach through Computer Vision
Jareen Anjom,Rashik Iram Chowdhury,Tarbia Hasan,Md. Ishan Arefin Hossain
Main category: cs.CV
TL;DR: 论文提出了一种基于单目深度估计的实时物体警报系统,旨在帮助视觉障碍者在孟加拉国繁忙的街道中安全通行。通过结合深度估计和目标检测模型,并利用量化技术优化,系统实现了轻量化和高效性能。
Details
Motivation: 视觉障碍者在城市中面临通行障碍和交通事故的高风险,亟需一种能够实时预警近距离物体的辅助系统。Contribution: 提出了一种新颖的警报系统,结合了深度估计和目标检测模型,并通过量化技术实现了模型的轻量化和高效部署。
Method: 采用迁移学习训练深度估计和目标检测模型,结合两个模型的功能,并通过量化技术优化模型以适合嵌入式系统。
Result: 系统实现了轻量化的实时深度估计和目标检测,mAP50达到0.801。
Insight: 通过结合深度估计和目标检测,并利用量化技术,可以在嵌入式系统上实现高效的实时物体预警。
Abstract: Visually impaired people face significant challenges in their day-to-day commutes in the urban cities of Bangladesh due to the vast number of obstructions on every path. With many injuries taking place through road accidents on a daily basis, it is paramount for a system to be developed that can alert the visually impaired of objects at close distance beforehand. To overcome this issue, a novel alert system is proposed in this research to assist the visually impaired in commuting through these busy streets without colliding with any objects. The proposed system can alert the individual to objects that are present at a close distance. It utilizes transfer learning to train models for depth estimation and object detection, and combines both models to introduce a novel system. The models are optimized through the utilization of quantization techniques to make them lightweight and efficient, allowing them to be easily deployed on embedded systems. The proposed solution achieved a lightweight real-time depth estimation and object detection model with an mAP50 of 0.801.
[41] Transfer Learning and Mixup for Fine-Grained Few-Shot Fungi Classification
Jason Kahei Tam,Murilo Gustineli,Anthony Miyaguchi
Main category: cs.CV
TL;DR: 本文提出了一种用于细粒度少样本真菌分类的方法,结合迁移学习和Mixup技术,在FungiCLEF 2025竞赛中表现优于基线模型。
Details
Motivation: 真菌物种的准确识别在计算机视觉中具有挑战性,主要由于细粒度的种间差异和较大的种内差异,促使研究团队探索更有效的分类方法。Contribution: 主要贡献包括:测试了多种视觉Transformer模型,结合了数据增强和加权采样,并探索了文本信息的融合;同时验证了生成式AI模型在零样本分类中的局限性。
Method: 采用了迁移学习和Mixup技术,结合领域特定的预训练和平衡采样策略,同时评估了多模态学习方法的效果。
Result: 最终模型在FungiCLEF 2025竞赛中排名35/74,表明在元数据选择和领域适应多模态学习方面仍有改进空间。
Insight: 研究结果表明,领域特定的预训练和平衡采样策略对细粒度少样本分类任务至关重要,而生成式AI模型在此任务中表现较差。
Abstract: Accurate identification of fungi species presents a unique challenge in computer vision due to fine-grained inter-species variation and high intra-species variation. This paper presents our approach for the FungiCLEF 2025 competition, which focuses on few-shot fine-grained visual categorization (FGVC) using the FungiTastic Few-Shot dataset. Our team (DS@GT) experimented with multiple vision transformer models, data augmentation, weighted sampling, and incorporating textual information. We also explored generative AI models for zero-shot classification using structured prompting but found them to significantly underperform relative to vision-based models. Our final model outperformed both competition baselines and highlighted the effectiveness of domain specific pretraining and balanced sampling strategies. Our approach ranked 35/74 on the private test set in post-completion evaluation, this suggests additional work can be done on metadata selection and domain-adapted multi-modal learning. Our code is available at https://github.com/dsgt-arc/fungiclef-2025.
[42] Portable Biomechanics Laboratory: Clinically Accessible Movement Analysis from a Handheld Smartphone
J. D. Peiffer,Kunal Shah,Irina Djuraskovic,Shawana Anarwala,Kayan Abdou,Rujvee Patel,Prakash Jayabalan,Brenton Pennicooke,R. James Cotton
Main category: cs.CV
TL;DR: 该论文提出了一种名为’便携式生物力学实验室’(PBL)的手持智能手机系统,用于临床环境中客观测量运动功能。通过验证和实际临床测试,证明其准确性、易用性和对临床差异的敏感性。
Details
Motivation: 运动功能是神经系统和肌肉骨骼健康的重要指标,但在临床实践中缺乏客观、易用的测量方法。作者旨在填补这一空白,提供一个低成本、可扩展的工具。Contribution: 提出了首个临床验证的智能手机视频全身运动分析方法,包括云支持的数据收集应用和新算法。
Method: 开发了基于智能手机的PBL系统,通过云应用收集数据,并采用新算法拟合生物力学模型。在多种临床场景中验证了系统准确性(关节角度误差在3度内)。
Result: 系统在神经外科和运动医学诊所中表现出高可靠性和对临床差异的敏感性。例如,其步态指标与mJOA评分相关,并对手术干预更敏感。
Insight: 智能手机视频可作为低负担、可扩展的工具,用于临床生物力学测量,有望推动运动功能障碍的普及监测。
Abstract: The way a person moves is a direct reflection of their neurological and musculoskeletal health, yet it remains one of the most underutilized vital signs in clinical practice. Although clinicians visually observe movement impairments, they lack accessible and validated methods to objectively measure movement in routine care. This gap prevents wider use of biomechanical measurements in practice, which could enable more sensitive outcome measures or earlier identification of impairment. We present our Portable Biomechanics Laboratory (PBL), which includes a secure, cloud-enabled smartphone app for data collection and a novel algorithm for fitting biomechanical models to this data. We extensively validated PBL’s biomechanical measures using a large, clinically representative dataset. Next, we tested the usability and utility of our system in neurosurgery and sports medicine clinics. We found joint angle errors within 3 degrees across participants with neurological injury, lower-limb prosthesis users, pediatric inpatients, and controls. In addition to being easy to use, gait metrics computed from the PBL showed high reliability and were sensitive to clinical differences. For example, in individuals undergoing decompression surgery for cervical myelopathy, the mJOA score is a common patient-reported outcome measure; we found that PBL gait metrics correlated with mJOA scores and demonstrated greater responsiveness to surgical intervention than the patient-reported outcomes. These findings support the use of handheld smartphone video as a scalable, low-burden tool for capturing clinically meaningful biomechanical data, offering a promising path toward accessible monitoring of mobility impairments. We release the first clinically validated method for measuring whole-body kinematics from handheld smartphone video at https://intelligentsensingandrehabilitation.github.io/MonocularBiomechanics/ .
[43] Cross-Resolution SAR Target Detection Using Structural Hierarchy Adaptation and Reliable Adjacency Alignment
Jiang Qin,Bin Zou,Haolin Li,Lamei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为CR-Net的新方法,用于解决SAR目标检测中分辨率差异带来的问题。通过结合结构先验和证据学习理论,CR-Net实现了可靠的跨分辨率域适应,显著提升了检测性能。
Details
Motivation: 随着SAR分辨率的提高,散射特性的差异增大,导致目标检测模型的泛化能力下降。现有的域适应技术常因分辨率差异导致特征适应盲目和语义传播不可靠,性能受限。Contribution: 1. 提出了结合结构先验和证据学习理论的CR-Net方法;2. 设计了SHFA模块实现目标间的结构相关性和结构感知特征适应;3. 提出RSAA模块,通过安全邻接集提升语义对齐可靠性。
Method: 1. SHFA模块:通过结构诱导的分层特征适应,增强特征适应过程的可解释性;2. RSAA模块:利用安全邻接集,从源域向目标域传递有价值的判别知识。
Result: 在不同分辨率数据集上的实验结果表明,CR-Net在跨分辨率SAR目标检测中实现了SOTA性能。
Insight: 通过结构先验和证据学习理论的结合,CR-Net不仅提升了域适应性能,还增强了模型的可解释性和判别能力。
Abstract: In recent years, continuous improvements in SAR resolution have significantly benefited applications such as urban monitoring and target detection. However, the improvement in resolution leads to increased discrepancies in scattering characteristics, posing challenges to the generalization ability of target detection models. While domain adaptation technology is a potential solution, the inevitable discrepancies caused by resolution differences often lead to blind feature adaptation and unreliable semantic propagation, ultimately degrading the domain adaptation performance. To address these challenges, this paper proposes a novel SAR target detection method (termed CR-Net), that incorporates structure priors and evidential learning theory into the detection model, enabling reliable domain adaptation for cross-resolution detection. To be specific, CR-Net integrates Structure-induced Hierarchical Feature Adaptation (SHFA) and Reliable Structural Adjacency Alignment (RSAA). SHFA module is introduced to establish structural correlations between targets and achieve structure-aware feature adaptation, thereby enhancing the interpretability of the feature adaptation process. Afterwards, the RSAA module is proposed to enhance reliable semantic alignment, by leveraging the secure adjacency set to transfer valuable discriminative knowledge from the source domain to the target domain. This further improves the discriminability of the detection model in the target domain. Based on experimental results from different-resolution datasets,the proposed CR-Net significantly enhances cross-resolution adaptation by preserving intra-domain structures and improving discriminability. It achieves state-of-the-art (SOTA) performance in cross-resolution SAR target detection.
[44] M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation
Kui Jiang,Shiyu Liu,Junjun Jiang,Xin Yang,Hongxun Yang,Xiaopeng Fan
Main category: cs.CV
TL;DR: 论文提出了M2DAO-Talker框架,通过多粒度运动解耦和交替优化技术改进音频驱动的人头生成任务,解决了现有方法中的运动模糊、时间抖动等问题,并在生成质量和速度上取得显著提升。
Details
Motivation: 现有3D方法在音频驱动的人头生成中存在运动模糊、时间抖动等渲染问题,亟需一种更稳定、更精细的运动表示方法。Contribution: 1. 提出统一的三步框架(视频预处理、运动表示、渲染重建)改进人头生成;2. 设计多粒度运动解耦策略;3. 引入交替优化技术提升生成效果;4. 实验验证了方法在生成质量和速度上的优势。
Method: 1. 开发2D肖像预处理管道提取运动控制条件;2. 多粒度运动解耦(非刚性运动与刚性运动分别建模);3. 设计运动一致性约束;4. 交替优化策略迭代细化参数。
Result: 实验显示,M2DAO-Talker在生成质量上提升2.43 dB PSNR,用户评价视频真实度提升0.64分,推理速度达150 FPS。
Insight: 1. 运动解耦和交替优化是提升人脸生成任务的关键;2. 运动一致性约束解决了局部穿透问题;3. 框架设计为未来相关研究提供了参考。
Abstract: Audio-driven talking head generation holds significant potential for film production. While existing 3D methods have advanced motion modeling and content synthesis, they often produce rendering artifacts, such as motion blur, temporal jitter, and local penetration, due to limitations in representing stable, fine-grained motion fields. Through systematic analysis, we reformulate talking head generation into a unified framework comprising three steps: video preprocessing, motion representation, and rendering reconstruction. This framework underpins our proposed M2DAO-Talker, which addresses current limitations via multi-granular motion decoupling and alternating optimization.Specifically, we devise a novel 2D portrait preprocessing pipeline to extract frame-wise deformation control conditions (motion region segmentation masks, and camera parameters) to facilitate motion representation. To ameliorate motion modeling, we elaborate a multi-granular motion decoupling strategy, which independently models non-rigid (oral and facial) and rigid (head) motions for improved reconstruction accuracy.Meanwhile, a motion consistency constraint is developed to ensure head-torso kinematic consistency, thereby mitigating penetration artifacts caused by motion aliasing. In addition, an alternating optimization strategy is designed to iteratively refine facial and oral motion parameters, enabling more realistic video generation.Experiments across multiple datasets show that M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness versus TalkingGaussian while with 150 FPS inference speed. Our project homepage is https://m2dao-talker.github.io/M2DAO-Talk.github.io
[45] Cross-Domain Identity Representation for Skull to Face Matching with Benchmark DataSet
Ravi Shankar Prasad,Dinesh Singh
Main category: cs.CV
TL;DR: 该论文提出了一种基于卷积Siamese网络的跨域身份表示框架,用于从颅骨X射线图像匹配到对应的人脸图像,为解决法医科学中的颅面重建问题提供了技术支持,并自建了一个包含40名志愿者的数据集用于验证。
Details
Motivation: 法医科学中,颅面重建对于犯罪和灾难受害者的身份识别至关重要。传统方法存在效率低、准确性不足的问题。论文旨在利用深度学习(如Siamese网络)改进跨域身份匹配的准确性和效率。Contribution: 1. 提出了一种基于卷积Siamese网络的跨域身份表示框架;2. 自建了一个包含40名志愿者的跨域数据集(颅骨X射线和人脸图像);3. 在实验中验证了该框架在颅骨-人脸匹配中的有效性。
Method: 使用卷积Siamese网络学习跨域身份表示。该网络通过最小化相似样本对之间的欧氏距离和最大化不相似样本对之间的距离,将颅骨和脸部分组到同一特征空间中。
Result: 实验结果表明,所提出的框架在自建数据集上能够有效识别给定颅骨对应的身份,提供了满意的匹配结果。
Insight: 1. 跨域匹配问题可以通过深度特征学习解决;2. 数据稀缺时,Siamese网络是有效的解决方案;3. 自建数据集为跨模态研究提供了新的基准。
Abstract: Craniofacial reconstruction in forensic science is crucial for the identification of the victims of crimes and disasters. The objective is to map a given skull to its corresponding face in a corpus of faces with known identities using recent advancements in computer vision, such as deep learning. In this paper, we presented a framework for the identification of a person given the X-ray image of a skull using convolutional Siamese networks for cross-domain identity representation. Siamese networks are twin networks that share the same architecture and can be trained to discover a feature space where nearby observations that are similar are grouped and dissimilar observations are moved apart. To do this, the network is exposed to two sets of comparable and different data. The Euclidean distance is then minimized between similar pairs and maximized between dissimilar ones. Since getting pairs of skull and face images are difficult, we prepared our own dataset of 40 volunteers whose front and side skull X-ray images and optical face images were collected. Experiments were conducted on the collected cross-domain dataset to train and validate the Siamese networks. The experimental results provide satisfactory results on the identification of a person from the given skull.
[46] Single-Domain Generalization for Multimodal Cross-Cancer Prognosis via Dirac Rebalancer and Distribution Entanglement
Jia-Xuan Jiang,Jiashuai Liu,Hongtao Wu,Yifeng Wu,Zhong Wang,Qi Bi,Yefeng Zheng
Main category: cs.CV
TL;DR: 该论文提出了一个名为跨癌症单领域泛化(Cross-Cancer Single Domain Generalization)的新任务,旨在评估模型在单癌症数据训练下对未见癌症的泛化能力,并提出了两个模块(SDIR和CADE)来解决模态不平衡和分布差异问题。
Details
Motivation: 当前多模态生存预测模型主要针对单一癌症类型,而忽略了跨癌症泛化的挑战。研究发现,多模态模型在跨癌症场景中的泛化能力反而比单模态模型差,这在临床实践中是一个重要问题。Contribution: 1) 提出跨癌症单领域泛化任务;2) 揭示了多模态泛化能力不足的问题;3) 提出了两个模块(SDIR和CADE)以增强模型泛化能力。
Method: SDIR通过伯努利稀疏化和Dirac稳定化增强弱模态信号;CADE通过融合局部形态特征和全局基因表达合成目标域分布。
Result: 在四种癌症类型的基准测试中表现出优异的泛化性能,为跨癌症多模态预测奠定了基础。
Insight: 论文揭示了多模态模型在跨癌症泛化中的不足,并提出了一种有效的解决方案,为实际临床应用提供了新的研究方向。
Abstract: Deep learning has shown remarkable performance in integrating multimodal data for survival prediction. However, existing multimodal methods mainly focus on single cancer types and overlook the challenge of generalization across cancers. In this work, we are the first to reveal that multimodal prognosis models often generalize worse than unimodal ones in cross-cancer scenarios, despite the critical need for such robustness in clinical practice. To address this, we propose a new task: Cross-Cancer Single Domain Generalization for Multimodal Prognosis, which evaluates whether models trained on a single cancer type can generalize to unseen cancers. We identify two key challenges: degraded features from weaker modalities and ineffective multimodal integration. To tackle these, we introduce two plug-and-play modules: Sparse Dirac Information Rebalancer (SDIR) and Cancer-aware Distribution Entanglement (CADE). SDIR mitigates the dominance of strong features by applying Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals. CADE, designed to synthesize the target domain distribution, fuses local morphological cues and global gene expression in latent space. Experiments on a four-cancer-type benchmark demonstrate superior generalization, laying the foundation for practical, robust cross-cancer multimodal prognosis. Code is available at https://github.com/HopkinsKwong/MCCSDG
[47] MM-Gesture: Towards Precise Micro-Gesture Recognition through Multimodal Fusion
Jihao Gu,Fei Wang,Kun Li,Yanyan Wei,Zhiliang Wu,Dan Guo
Main category: cs.CV
TL;DR: MM-Gesture通过多模态融合框架,结合多种模态数据(如关节、肢体、RGB视频等),利用PoseConv3D和Video Swin Transformer架构及加权集成策略,显著提升了微手势识别的准确率,达到73.213%的Top-1准确率。
Details
Motivation: 微手势(MGs)因其短暂和细微的特性,识别难度较大。现有方法未能充分利用多模态数据的互补性,MM-Gesture旨在通过多模态融合提升识别性能。Contribution: 1. 提出多模态融合框架MM-Gesture,整合多种模态数据;2. 引入模态加权集成策略优化性能;3. 在iMiGUE基准测试中取得显著提升。
Method: 1. 结合关节、肢体、RGB视频等多种模态;2. 使用PoseConv3D和Video Swin Transformer架构;3. 通过加权集成策略优化多模态数据融合。
Result: 在iMiGUE基准测试中,Top-1准确率达到73.213%,显著优于现有方法。
Insight: 多模态数据融合是提升微手势识别性能的关键,加权集成策略能够有效平衡不同模态的贡献。
Abstract: In this paper, we present MM-Gesture, the solution developed by our team HFUT-VUT, which ranked 1st in the micro-gesture classification track of the 3rd MiGA Challenge at IJCAI 2025, achieving superior performance compared to previous state-of-the-art methods. MM-Gesture is a multimodal fusion framework designed specifically for recognizing subtle and short-duration micro-gestures (MGs), integrating complementary cues from joint, limb, RGB video, Taylor-series video, optical-flow video, and depth video modalities. Utilizing PoseConv3D and Video Swin Transformer architectures with a novel modality-weighted ensemble strategy, our method further enhances RGB modality performance through transfer learning pre-trained on the larger MA-52 dataset. Extensive experiments on the iMiGUE benchmark, including ablation studies across different modalities, validate the effectiveness of our proposed approach, achieving a top-1 accuracy of 73.213%.
[48] Cycle Context Verification for In-Context Medical Image Segmentation
Shishuai Hu,Zehui Liao,Liangli Zhen,Huazhu Fu,Yong Xia
Main category: cs.CV
TL;DR: 该论文提出了一种名为循环上下文验证(CCV)的新框架,用于提升基于上下文学习(ICL)的医学图像分割性能。通过自我验证预测并优化上下文对齐,CCV在多个数据集上表现出色。
Details
Motivation: 医学图像分割中,上下文学习(ICL)的性能高度依赖于查询图像与上下文图像-掩码对的对齐。由于标记数据的稀缺性和计算成本,传统方法难以选择最优的上下文对或微调基础模型。Contribution: 提出了CCV框架,通过循环验证预测和优化上下文对齐,显著提升了ICL在医学图像分割中的性能。
Method: 采用循环流程:首先生成查询图像的分割掩码,然后交换查询与上下文对的角色以验证预测,并通过查询特定提示优化对齐。
Result: 在七个医学图像分割数据集上的实验表明,CCV优于现有方法,验证了其有效性。
Insight: CCV通过自我验证和动态优化上下文对齐,为通用医学图像分割提供了一种鲁棒解决方案。
Abstract: In-context learning (ICL) is emerging as a promising technique for achieving universal medical image segmentation, where a variety of objects of interest across imaging modalities can be segmented using a single model. Nevertheless, its performance is highly sensitive to the alignment between the query image and in-context image-mask pairs. In a clinical scenario, the scarcity of annotated medical images makes it challenging to select optimal in-context pairs, and fine-tuning foundation ICL models on contextual data is infeasible due to computational costs and the risk of catastrophic forgetting. To address this challenge, we propose Cycle Context Verification (CCV), a novel framework that enhances ICL-based medical image segmentation by enabling self-verification of predictions and accordingly enhancing contextual alignment. Specifically, CCV employs a cyclic pipeline in which the model initially generates a segmentation mask for the query image. Subsequently, the roles of the query and an in-context pair are swapped, allowing the model to validate its prediction by predicting the mask of the original in-context image. The accuracy of this secondary prediction serves as an implicit measure of the initial query segmentation. A query-specific prompt is introduced to alter the query image and updated to improve the measure, thereby enhancing the alignment between the query and in-context pairs. We evaluated CCV on seven medical image segmentation datasets using two ICL foundation models, demonstrating its superiority over existing methods. Our results highlight CCV’s ability to enhance ICL-based segmentation, making it a robust solution for universal medical image segmentation. The code will be available at https://github.com/ShishuaiHu/CCV.
[49] Understanding Driving Risks using Large Language Models: Toward Elderly Driver Assessment
Yuki Yoshihara,Linjing Jiang,Nihan Karatas,Hitoshi Kanamori,Asuka Harada,Takahiro Tanaka
Main category: cs.CV
TL;DR: 该研究探索了多模态大语言模型(如ChatGPT-4o)在静态行车记录图像中评估交通场景的能力,重点关注老年驾驶员评估的三个任务:交通密度、交叉口可见性和停车标志识别。结果表明,提示策略显著影响性能,未来研究应扩展数据集和模型架构。
Details
Motivation: 研究旨在利用大语言模型(LLM)的上下文推理能力,替代简单目标检测,支持老年驾驶员的风险评估任务,提升场景理解的解释性与实用性。Contribution: 主要贡献包括:1) 提出利用LLM进行交通场景的多任务评估;2) 验证提示设计对性能的影响;3) 分析模型的稳定性和解释性。
Method: 采用零样本、少样本和多样本提示策略,以人类标注为参考标准,评估模型在交通密度、交叉口可见性和停车标志识别任务中的性能(精度、召回率和F1分数)。
Result: 多样本提示显著提升性能(如交叉口可见性召回率从21.7%升至57.0%),模型在停车标志检测中表现出高精度(86.3%)但召回率较低(76.7%)。模型解释文本与预测一致,增强了可解释性。
Insight: 提示设计是LLM性能的关键因素;模型对模糊场景的理解仍需改进;LLM有望成为驾驶风险评估的支持工具,但需更大数据集和先进架构优化。
Abstract: This study investigates the potential of a multimodal large language model (LLM), specifically ChatGPT-4o, to perform human-like interpretations of traffic scenes using static dashcam images. Herein, we focus on three judgment tasks relevant to elderly driver assessments: evaluating traffic density, assessing intersection visibility, and recognizing stop signs recognition. These tasks require contextual reasoning rather than simple object detection. Using zero-shot, few-shot, and multi-shot prompting strategies, we evaluated the performance of the model with human annotations serving as the reference standard. Evaluation metrics included precision, recall, and F1-score. Results indicate that prompt design considerably affects performance, with recall for intersection visibility increasing from 21.7% (zero-shot) to 57.0% (multi-shot). For traffic density, agreement increased from 53.5% to 67.6%. In stop-sign detection, the model demonstrated high precision (up to 86.3%) but a lower recall (approximately 76.7%), indicating a conservative response tendency. Output stability analysis revealed that humans and the model faced difficulties interpreting structurally ambiguous scenes. However, the model’s explanatory texts corresponded with its predictions, enhancing interpretability. These findings suggest that, with well-designed prompts, LLMs hold promise as supportive tools for scene-level driving risk assessments. Future studies should explore scalability using larger datasets, diverse annotators, and next-generation model architectures for elderly driver assessments.
[50] Unsupervised Methods for Video Quality Improvement: A Survey of Restoration and Enhancement Techniques
Alexandra Malyugina,Yini Li,Joanne Lin,Nantheera Anantrasirichai
Main category: cs.CV
TL;DR: 这篇论文是关于无监督视频质量提升方法的综述,重点分析了修复和增强技术,涵盖了常见视频退化问题、传统与深度学习方法,以及基于域转换、自监督信号设计和噪声等方法。
Details
Motivation: 视频修复和增强不仅对视觉质量至关重要,还能作为下游计算机视觉任务的重要预处理步骤。无监督方法因其适用性和灵活性成为研究热点。Contribution: 论文提供了对无监督视频修复与增强技术的全面综述,分类讨论了不同方法(如域转换、自监督信号设计等),并梳理了损失函数与合成数据集的作用。
Method: 论文回顾了传统和深度学习方法,重点分析了无监督技术,包括域转换、自监督信号设计及噪声利用等方法。
Result: 通过对现有技术的总结,论文指出了无监督方法的优势与局限,并为未来研究提供了方向。
Insight: 无监督方法在视频修复与增强中潜力巨大,但如何设计更有效的自监督信号和改进域转换技术是关键挑战。
Abstract: Video restoration and enhancement are critical not only for improving visual quality, but also as essential pre-processing steps to boost the performance of a wide range of downstream computer vision tasks. This survey presents a comprehensive review of video restoration and enhancement techniques with a particular focus on unsupervised approaches. We begin by outlining the most common video degradations and their underlying causes, followed by a review of early conventional and deep learning methods-based, highlighting their strengths and limitations. We then present an in-depth overview of unsupervised methods, categorise by their fundamental approaches, including domain translation, self-supervision signal design and blind spot or noise-based methods. We also provide a categorization of loss functions employed in unsupervised video restoration and enhancement, and discuss the role of paired synthetic datasets in enabling objective evaluation. Finally, we identify key challenges and outline promising directions for future research in this field.
[51] From Enhancement to Understanding: Build a Generalized Bridge for Low-light Vision via Semantically Consistent Unsupervised Fine-tuning
Sen Wang,Shao Zeng,Tianjun Gu,Zhizhong Zhang,Ruixin Zhang,Shouhong Ding,Jingyun Zhang,Jun Wang,Xin Tan,Yuan Xie,Lizhuang Ma
Main category: cs.CV
TL;DR: 论文提出了一种名为GEFU的广义桥梁,将低光增强和低光理解结合,通过语义一致的无监督微调(SCUF)提升泛化能力和扩展性。
Details
Motivation: 传统方法将低光增强和理解分开处理,前者依赖物理或几何先验,后者受限于标注数据的稀缺性。缺乏通用性和扩展性是关键问题。Contribution: 1. 提出GEFU范式,连接低光增强和理解;2. 利用预训练扩散模型优化图像,实现零样本泛化;3. 创新SCUF技术,包括光照感知图像提示和循环注意力适配器。
Method: 1. 使用预训练扩散模型进行图像优化;2. 设计光照感知提示和循环注意力适配器;3. 通过标题和反射一致性学习语义和空间特征。
Result: 实验表明,方法在图像质量和下游任务(分类、检测、分割)中优于现有技术。
Insight: 通过语义一致性的无监督微调,能够在低光条件下实现更鲁棒的视觉理解,为跨任务泛化提供了新思路。
Abstract: Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.
[52] Smelly, dense, and spreaded: The Object Detection for Olfactory References (ODOR) dataset
Mathias Zinnen,Prathmesh Madhu,Inger Leemans,Peter Bell,Azhar Hussian,Hang Tran,Ali Hürriyetoğlu,Andreas Maier,Vincent Christlein
Main category: cs.CV
TL;DR: 论文提出了ODOR数据集,针对艺术品中的物体检测,包含38,116个标注和139个细粒度类别,挑战现有模型在密集、重叠和空间分布不均的场景下的性能。
Details
Motivation: 现有数据集在艺术品物体检测中存在中心偏差和类别有限的问题,无法满足人文领域对鲁棒算法的需求。Contribution: ODOR数据集填补了这一空白,提供大量标注和细粒度类别,并通过统计分析和基线实验展示其挑战性。
Method: 收集4712张图像,标注38,116个物体实例,涵盖139个类别;进行统计分析并测试多种目标检测模型的性能。
Result: ODOR数据集具有密集、重叠和空间分布不均的特点,现有模型在此类任务中表现不佳。
Insight: 数据集不仅推动艺术品物体检测的研究,还探索了视觉与嗅觉感知的交叉领域。
Abstract: Real-world applications of computer vision in the humanities require algorithms to be robust against artistic abstraction, peripheral objects, and subtle differences between fine-grained target classes. Existing datasets provide instance-level annotations on artworks but are generally biased towards the image centre and limited with regard to detailed object classes. The proposed ODOR dataset fills this gap, offering 38,116 object-level annotations across 4712 images, spanning an extensive set of 139 fine-grained categories. Conducting a statistical analysis, we showcase challenging dataset properties, such as a detailed set of categories, dense and overlapping objects, and spatial distribution over the whole image canvas. Furthermore, we provide an extensive baseline analysis for object detection models and highlight the challenging properties of the dataset through a set of secondary studies. Inspiring further research on artwork object detection and broader visual cultural heritage studies, the dataset challenges researchers to explore the intersection of object recognition and smell perception.
[53] PanMatch: Unleashing the Potential of Large Vision Models for Unified Matching Models
Yongjian Zhang,Longguang Wang,Kunhong Li,Ye Zhang,Yun Wang,Liang Lin,Yulan Guo
Main category: cs.CV
TL;DR: PanMatch是一个统一的匹配模型,利用大型视觉模型的泛化能力,通过二维位移估计框架处理多种任务,无需任务特定设计,并在跨域和零样本场景中表现优异。
Details
Motivation: 现有的匹配方法通常需要任务特定的架构和微调,限制了模型的通用性和泛化能力。PanMatch旨在通过统一框架解决多任务匹配问题,提升跨域和零样本性能。Contribution: 1) 提出一个统一的匹配框架PanMatch,通过位移估计实现多任务集成;2) 利用大型视觉模型的通用特征提取能力;3) 构建跨域数据集预训练模型;4) 在异常场景展现零样本能力。
Method: 1) 使用2D位移估计框架统一多种匹配任务;2) 引入大型视觉模型的通用特征提取器;3) 设计特征变换流水线;4) 通过跨域数据集训练提升泛化性。
Result: PanMatch在跨任务评估中优于UniMatch和Flow-Anything,在任务特定基准上与SOTA方法相当,并在异常场景(如雨天和卫星图像)中展现零样本能力。
Insight: 统一的位移估计框架有望替代任务特定的设计;大型视觉模型的特征提取能力在多任务匹配中具有广泛应用潜力。
Abstract: This work presents PanMatch, a versatile foundation model for robust correspondence matching. Unlike previous methods that rely on task-specific architectures and domain-specific fine-tuning to support tasks like stereo matching, optical flow or feature matching, our key insight is that any two-frame correspondence matching task can be addressed within a 2D displacement estimation framework using the same model weights. Such a formulation eliminates the need for designing specialized unified architectures or task-specific ensemble models. Instead, it achieves multi-task integration by endowing displacement estimation algorithms with unprecedented generalization capabilities. To this end, we highlight the importance of a robust feature extractor applicable across multiple domains and tasks, and propose the feature transformation pipeline that leverage all-purpose features from Large Vision Models to endow matching baselines with zero-shot cross-view matching capabilities. Furthermore, we assemble a cross-domain dataset with near 1.8 million samples from stereo matching, optical flow, and feature matching domains to pretrain PanMatch. We demonstrate the versatility of PanMatch across a wide range of domains and downstream tasks using the same model weights. Our model outperforms UniMatch and Flow-Anything on cross-task evaluations, and achieves comparable performance to most state-of-the-art task-specific algorithms on task-oriented benchmarks. Additionally, PanMatch presents unprecedented zero-shot performance in abnormal scenarios, such as rainy day and satellite imagery, where most existing robust algorithms fail to yield meaningful results.
[54] Deep Hashing with Semantic Hash Centers for Image Retrieval
Li Chen,Rui Liu,Yuxiang Zhou,Xudong Ma,Yong Chen,Dell Zhang
Main category: cs.CV
TL;DR: 论文提出了一种基于语义哈希中心的深度哈希方法(SHC),通过数据依赖的相似性计算生成语义哈希中心,提升图像检索性能。
Details
Motivation: 现有方法通过预设哈希中心提升检索性能,但忽略了类间语义关系,影响检索效果。SHC旨在通过语义哈希中心保留语义结构,从而改善这一问题。Contribution: 提出语义哈希中心概念,设计了一个三阶段框架(SHC),包括语义相似性计算、哈希中心优化和深度哈希网络训练,显著提升了检索性能。
Method: 1. 使用分类网络通过数据依赖方法计算类间语义相似性;
2. 优化算法生成语义哈希中心,同时保持最小距离;
3. 训练深度哈希网络生成二进制哈希码。
Result: 在多个公开数据集上,SHC在MAP@100、MAP@1000和MAP@ALL指标上分别平均提升7.26%、7.62%和11.71%。
Insight: 哈希中心的设计应结合语义关系,数据依赖的相似性计算能更好地适应不同数据分布,提升哈希码的判别性和检索性能。
Abstract: Deep hashing is an effective approach for large-scale image retrieval. Current methods are typically classified by their supervision types: point-wise, pair-wise, and list-wise. Recent point-wise techniques (e.g., CSQ, MDS) have improved retrieval performance by pre-assigning a hash center to each class, enhancing the discriminability of hash codes across various datasets. However, these methods rely on data-independent algorithms to generate hash centers, which neglect the semantic relationships between classes and may degrade retrieval performance. This paper introduces the concept of semantic hash centers, building on the idea of traditional hash centers. We hypothesize that hash centers of semantically related classes should have closer Hamming distances, while those of unrelated classes should be more distant. To this end, we propose a three-stage framework, SHC, to generate hash codes that preserve semantic structure. First, we develop a classification network to identify semantic similarities between classes using a data-dependent similarity calculation that adapts to varying data distributions. Second, we introduce an optimization algorithm to generate semantic hash centers, preserving semantic relatedness while enforcing a minimum distance between centers to avoid excessively similar hash codes. Finally, a deep hashing network is trained using these semantic centers to convert images into binary hash codes. Experimental results on large-scale retrieval tasks across several public datasets show that SHC significantly improves retrieval performance. Specifically, SHC achieves average improvements of +7.26%, +7.62%, and +11.71% in MAP@100, MAP@1000, and MAP@ALL metrics, respectively, over state-of-the-art methods.
[55] Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models
Shijun Yang,Xiang Zhang,Wanqing Zhao,Hangzai Luo,Sheng Zhong,Jinye Peng,Jianping Fan
Main category: cs.CV
TL;DR: 该论文提出了一种新的多模态条件提示学习框架MuGCP,通过多模态大语言模型生成语义条件提示,并结合注意力互引导模块和提示融合机制提升视觉语言模型的性能。
Details
Motivation: 传统提示学习方法在未见类别上的泛化能力不足,且跨模态对齐通常局限于编码器的输出层,限制了模型的性能。MuGCP旨在解决这些问题。Contribution: 1. 提出多模态互引导条件提示学习框架MuGCP;2. 引入注意力互引导模块(AMG)和提示融合机制(MPF);3. 在14个数据集上展示了优越性能。
Method: MuGCP利用多模态大语言模型生成语义条件提示(SCP),通过AMG模块生成视觉条件提示(VCP),最后通过MPF机制整合多种提示以优化模型性能。
Result: 在14个数据集上超越现有最先进方法,验证了MuGCP的有效性。
Insight: 通过条件提示学习和跨模态交互,可以显著提升视觉语言模型在未见类别和多模态任务中的表现。
Abstract: Prompt learning facilitates the efficient adaptation of Vision-Language Models (VLMs) to various downstream tasks. However, it faces two significant challenges: (1) inadequate modeling of class embedding distributions for unseen instances, leading to suboptimal generalization on novel classes; (2) prevailing methodologies predominantly confine cross-modal alignment to the final output layer of vision and text encoders, which fundamentally limits their capacity to preserve topological consistency with pre-trained multi-modal embedding spaces. To this end, we introduce MuGCP (Multi-modal Mutual-Guidance Conditional Prompt Learning), a novel paradigm designed for conditional prompt generation. MuGCP leverages Multi-modal Large Language Models (MLLMs) as conditional prompt learners to adaptively generate Semantic Conditional Prompts (SCP) that incorporate rich, fine-grained high-level semantic knowledge for image instances. To ensure effective alignment and interaction across the multi-modal space of Vision-Language Models (VLMs), we introduce the Attention Mutual-Guidance (AMG) module, which facilitates interactions between visual and semantic information. Through mutual guidance, the AMG module generates Visual Conditional Prompts (VCP), enhancing the model’s performance in multi-modal tasks. Additionally, we present a Multi-Prompt Fusion (MPF) mechanism that integrates SCP and VCP with contextual prompts, ensuring seamless coordination among the different prompts and enhancing the modeling of class embeddings and instance-specific knowledge. Our MuGCP outperforms existing state-of-the-art methods on 14 different datasets. The code will be made available after publication.
[56] InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes
Zesong Yang,Bangbang Yang,Wenqi Dong,Chenxuan Cao,Liyuan Cui,Yuewen Ma,Zhaopeng Cui,Hujun Bao
Main category: cs.CV
TL;DR: Error
Details
Motivation: ErrorContribution: Error
Method: Error
Result: Error
Insight: Error
Abstract: Humans can naturally identify and mentally complete occluded objects in cluttered environments. However, imparting similar cognitive ability to robotics remains challenging even with advanced reconstruction techniques, which models scenes as undifferentiated wholes and fails to recognize complete object from partial observations. In this paper, we propose InstaScene, a new paradigm towards holistic 3D perception of complex scenes with a primary goal: decomposing arbitrary instances while ensuring complete reconstruction. To achieve precise decomposition, we develop a novel spatial contrastive learning by tracing rasterization of each instance across views, significantly enhancing semantic supervision in cluttered scenes. To overcome incompleteness from limited observations, we introduce in-situ generation that harnesses valuable observations and geometric cues, effectively guiding 3D generative models to reconstruct complete instances that seamlessly align with the real world. Experiments on scene decomposition and object completion across complex real-world and synthetic scenes demonstrate that our method achieves superior decomposition accuracy while producing geometrically faithful and visually intact objects.
[57] Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers
Wongi Jeong,Kyungryeol Lee,Hoigi Seo,Se Young Chun
Main category: cs.CV
TL;DR: RALU是一种无需训练的框架,通过混合分辨率采样加速扩散变换器的推理,显著减少计算量同时保持图像质量。
Details
Motivation: 扩散变换器虽然在高保真图像和视频生成中表现出色,但计算量大限制了其实时部署,现有加速方法主要利用时间维度,忽略了空间维度的潜力。Contribution: 提出了RALU框架,通过区域自适应潜在上采样在空间维度加速扩散变换器的推理,实现高达7倍的加速,并可与现有时间加速方法互补。
Method: RALU采用三阶段混合分辨率采样:低分辨率去噪、区域自适应上采样和全分辨率细化,结合噪声时间重调度以稳定生成质量。
Result: 在FLUX和Stable Diffusion 3上分别实现7.0倍和3.0倍的加速,图像质量几乎无损。
Insight: RALU展示了在空间维度加速扩散模型的潜力,为未来高效生成模型的设计提供了新思路。
Abstract: Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0$\times$ speed-up on FLUX and 3.0$\times$ on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.
[58] Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation
Anlin Zheng,Xin Wen,Xuanyang Zhang,Chuofan Ma,Tiancai Wang,Gang Yu,Xiangyu Zhang,Xiaojuan Qi
Main category: cs.CV
TL;DR: 该论文提出了一种新型图像分词器VFMTok,利用预训练视觉基础模型作为编码器,通过区域自适应量化框架和语义重构目标,显著提升图像重建与生成质量,同时提高分词效率,并在自回归生成任务中取得了优异表现。
Details
Motivation: 传统视觉基础模型主要用于视觉理解任务,而在图像生成领域中的应用尚未充分探索。论文旨在利用这些模型的强大表示能力,构建高效的图像分词器,以提升生成任务的性能。Contribution: 提出了VFMTok图像分词器,结合区域自适应量化框架和语义重构目标,显著提升了图像重建和生成质量;在自回归生成任务中取得了gFID 2.07的优异表现,并加速了模型收敛。
Method: 使用冻结的视觉基础模型作为编码器,引入区域自适应量化框架减少冗余特征,并设计语义重构目标以保持语义保真度。
Result: 在ImageNet基准测试中,VFMTok实现了gFID 2.07的成绩,模型收敛速度提升了三倍,且无需使用分类器无关引导(CFG)即可实现高保真度的类别条件合成。
Insight: 通过利用预训练视觉基础模型的强大表示能力,可以高效地构建图像生成任务的分词器,同时保持语义一致性和生成质量。
Abstract: Leveraging the powerful representations of pre-trained vision foundation models – traditionally used for visual comprehension – we explore a novel direction: building an image tokenizer directly atop such models, a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer’s outputs with the foundation model’s representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation – achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code will be released publicly to benefit the community.
[59] Review of Feed-forward 3D Reconstruction: From DUSt3R to VGGT
Wei Zhang,Yihang Wu,Songhua Li,Wenjie Ma,Xin Ma,Qiang Li,Qi Wang
Main category: cs.CV
TL;DR: 本文系统回顾了前馈式3D重建技术,从DUSt3R到VGGT等模型的进展,对比了传统方法和学习方法的优劣,并探讨了未来挑战。
Details
Motivation: 传统3D重建方法(如SfM和MVS)依赖迭代优化,计算成本高且对复杂场景鲁棒性差。深度学习推动了前馈式3D重建的发展,提高了效率和适用性。Contribution: 1. 系统综述了前馈式3D重建技术;2. 分析了Transformer-based的对应建模和联合位姿-几何回归机制;3. 对比传统与学习方法,并讨论未来方向。
Method: 前馈式3D重建模型(如DUSt3R)通过统一深度网络直接从无约束图像集推断相机位姿和密集几何结构,避免了传统迭代优化。
Result: 前馈式方法显著简化了3D重建流程,提高了效率,并在某些场景下表现优于传统方法。但模型精度和动态场景处理仍需改进。
Insight: 前馈式3D重建标志着技术范式的转变,未来需关注模型精度、可扩展性及动态场景适应性。
Abstract: 3D reconstruction, which aims to recover the dense three-dimensional structure of a scene, is a cornerstone technology for numerous applications, including augmented/virtual reality, autonomous driving, and robotics. While traditional pipelines like Structure from Motion (SfM) and Multi-View Stereo (MVS) achieve high precision through iterative optimization, they are limited by complex workflows, high computational cost, and poor robustness in challenging scenarios like texture-less regions. Recently, deep learning has catalyzed a paradigm shift in 3D reconstruction. A new family of models, exemplified by DUSt3R, has pioneered a feed-forward approach. These models employ a unified deep network to jointly infer camera poses and dense geometry directly from an Unconstrained set of images in a single forward pass. This survey provides a systematic review of this emerging domain. We begin by dissecting the technical framework of these feed-forward models, including their Transformer-based correspondence modeling, joint pose and geometry regression mechanisms, and strategies for scaling from two-view to multi-view scenarios. To highlight the disruptive nature of this new paradigm, we contrast it with both traditional pipelines and earlier learning-based methods like MVSNet. Furthermore, we provide an overview of relevant datasets and evaluation metrics. Finally, we discuss the technology’s broad application prospects and identify key future challenges and opportunities, such as model accuracy and scalability, and handling dynamic scenes.
[60] A document is worth a structured record: Principled inductive bias design for document recognition
Benjamin Meyer,Lukas Tuggener,Sascha Hänzi,Daniel Schmid,Erdal Ayfer,Benjamin F. Grewe,Ahmed Abdulkadir,Thilo Stadelmann
Main category: cs.CV
TL;DR: 论文提出了一种新的文档识别视角,将文档识别视为从文档到结构化记录的转录任务,并通过设计结构特定的归纳偏置和基准Transformer架构,解决了传统方法忽视文档内在结构的问题。实验验证了该方法在复杂文档(如工程图纸)上的有效性。
Details
Motivation: 现有文档识别方法通常将问题视为纯粹的计算机视觉任务,忽视了文档类型特定的内在结构,导致依赖启发式后处理且难以处理复杂或低频文档类型。Contribution: 1. 提出将文档识别视为结构化记录转录任务的新视角;2. 设计了适用于不同文档结构的归纳偏置和基准Transformer架构;3. 首次实现了端到端模型对工程图纸的转录。
Method: 通过设计结构特定的归纳偏置,结合基准Transformer架构,适应不同文档结构(如乐谱、形状绘图、工程图纸)。
Result: 实验表明该方法在乐谱、形状绘图和工程图纸等复杂文档上有效,尤其是首次实现了工程图纸的端到端转录。
Insight: 文档识别应整合文档类型的内在结构信息,未来文档基础模型的设计可以此为参考,提升对复杂文档的识别能力。
Abstract: Many document types use intrinsic, convention-driven structures that serve to encode precise and structured information, such as the conventions governing engineering drawings. However, state-of-the-art approaches treat document recognition as a mere computer vision problem, neglecting these underlying document-type-specific structural properties, making them dependent on sub-optimal heuristic post-processing and rendering many less frequent or more complicated document types inaccessible to modern document recognition. We suggest a novel perspective that frames document recognition as a transcription task from a document to a record. This implies a natural grouping of documents based on the intrinsic structure inherent in their transcription, where related document types can be treated (and learned) similarly. We propose a method to design structure-specific inductive biases for the underlying machine-learned end-to-end document recognition systems, and a respective base transformer architecture that we successfully adapt to different structures. We demonstrate the effectiveness of the so-found inductive biases in extensive experiments with progressively complex record structures from monophonic sheet music, shape drawings, and simplified engineering drawings. By integrating an inductive bias for unrestricted graph structures, we train the first-ever successful end-to-end model to transcribe engineering drawings to their inherently interlinked information. Our approach is relevant to inform the design of document recognition systems for document types that are less well understood than standard OCR, OMR, etc., and serves as a guide to unify the design of future document foundation models.
[61] F3-Net: Foundation Model for Full Abnormality Segmentation of Medical Images with Flexible Input Modality Requirement
Seyedeh Sahar Taheri Otaghsara,Reza Rahmanzadeh
Main category: cs.CV
TL;DR: F3-Net 是一种基础模型,用于医学图像的全异常分割,支持灵活的输入模态,通过合成模态训练和零图像策略解决模态缺失问题。
Details
Motivation: 医学图像分割在临床中存在依赖完整多模态输入、泛化性差和任务特异性强的问题,F3-Net 旨在解决这些问题。Contribution: 1. 提出支持缺失模态的灵活输入方案;2. 无需重新训练即可适应多种病理分割任务;3. 在多个数据集上优于基于 CNN 和 Transformer 的模型。
Method: 采用合成模态训练和零图像策略,避免显式合成网络,支持多模态输入缺失下的鲁棒性能。
Result: 在 BraTS 2021、BraTS 2024 和 ISLES 2022 数据集上达到平均 Dice 相似系数分别为 0.94、0.82、0.94 和 0.79。
Insight: F3-Net 的灵活性和泛化能力为医学图像分割的临床落地提供了实用解决方案。
Abstract: F3-Net is a foundation model designed to overcome persistent challenges in clinical medical image segmentation, including reliance on complete multimodal inputs, limited generalizability, and narrow task specificity. Through flexible synthetic modality training, F3-Net maintains robust performance even in the presence of missing MRI sequences, leveraging a zero-image strategy to substitute absent modalities without relying on explicit synthesis networks, thereby enhancing real-world applicability. Its unified architecture supports multi-pathology segmentation across glioma, metastasis, stroke, and white matter lesions without retraining, outperforming CNN-based and transformer-based models that typically require disease-specific fine-tuning. Evaluated on diverse datasets such as BraTS 2021, BraTS 2024, and ISLES 2022, F3-Net demonstrates strong resilience to domain shifts and clinical heterogeneity. On the whole pathology dataset, F3-Net achieves average Dice Similarity Coefficients (DSCs) of 0.94 for BraTS-GLI 2024, 0.82 for BraTS-MET 2024, 0.94 for BraTS 2021, and 0.79 for ISLES 2022. This positions it as a versatile, scalable solution bridging the gap between deep learning research and practical clinical deployment.
[62] Dual Dimensions Geometric Representation Learning Based Document Dewarping
Heng Li,Qingcai Chen,Xiangping Wu
Main category: cs.CV
TL;DR: 论文提出了一种基于双维度(水平-垂直线)的细粒度变形感知模型D2Dewarp,用于文档图像去扭曲。通过设计基于X和Y坐标的有效融合模块,结合水平和垂直维度的特征,并提出了自动细粒度标注方法生成大规模训练数据集。在公开基准测试中表现优于现有方法。
Details
Motivation: 当前文档图像去扭曲方法主要关注单一水平维度,忽略了垂直维度的信息,限制了性能提升。本文旨在通过双维度感知模型解决这一问题。Contribution: 1. 提出了D2Dewarp模型,首次引入垂直方向的变形感知。2. 设计了基于X和Y坐标的融合模块,实现双维度特征互补。3. 提出自动标注方法生成大规模训练数据集。
Method: 使用双维度(水平-垂直线)变形感知模型提取文档细节的扭曲趋势;通过融合模块结合两维度特征;利用自动渲染引擎生成标注数据集。
Result: 在公开的中英文基准测试中,定量和定性结果均优于现有方法。
Insight: 双维度感知能更全面地捕捉文档扭曲特征,提升去扭曲效果;自动标注方法为缺乏标注数据的研究提供了新思路。
Abstract: Document image dewarping remains a challenging task in the deep learning era. While existing methods have improved by leveraging text line awareness, they typically focus only on a single horizontal dimension. In this paper, we propose a fine-grained deformation perception model that focuses on Dual Dimensions of document horizontal-vertical-lines to improve document Dewarping called D2Dewarp. It can perceive distortion trends in different directions across document details. To combine the horizontal and vertical granularity features, an effective fusion module based on X and Y coordinate is designed to facilitate interaction and constraint between the two dimensions for feature complementarity. Due to the lack of annotated line features in current public dewarping datasets, we also propose an automatic fine-grained annotation method using public document texture images and an automatic rendering engine to build a new large-scale distortion training dataset. The code and dataset will be publicly released. On public Chinese and English benchmarks, both quantitative and qualitative results show that our method achieves better rectification results compared with the state-of-the-art methods. The dataset will be publicly available at https://github.com/xiaomore/DocDewarpHV
[63] Unified People Tracking with Graph Neural Networks
Martin Engilberge,Ivan Vrkic,Friedrich Wilke Grosche,Julien Pilet,Engin Turetken,Pascal Fua
Main category: cs.CV
TL;DR: 论文提出了一种基于图神经网络(GNN)的统一、完全可微的多人物跟踪模型,能直接关联检测结果成轨迹,无需预计算轨迹片段。通过构建动态时空图整合信息,模型在公开基准和新数据集上达到SOTA性能。
Details
Motivation: 传统多人物跟踪依赖预计算轨迹片段或手工设计特征,难以处理遮挡和视角变化。该工作旨在通过统一的端到端学习框架实现更鲁棒的跟踪。Contribution: 1) 提出统一的端到端可微模型,无需预计算轨迹片段;2) 构建动态时空图整合时空、上下文和场景信息;3) 发布新的大规模多视角数据集,促进遮挡和视角多样性研究。
Method: 模型通过动态图神经网络构建时空图,节点为检测框,边聚合时空和上下文特征。图结构可动态调整,支持信息跨帧传播,并引入场景先验以处理遮挡。
Result: 在公开基准和新数据集上均达到SOTA性能,尤其在遮挡和视角多样性场景中表现优异。
Insight: 动态图结构结合端到端学习能有效捕捉复杂时空关联,场景先验对遮挡处理至关重要;新数据集填补了多视角跟踪研究的空白。
Abstract: This work presents a unified, fully differentiable model for multi-people tracking that learns to associate detections into trajectories without relying on pre-computed tracklets. The model builds a dynamic spatiotemporal graph that aggregates spatial, contextual, and temporal information, enabling seamless information propagation across entire sequences. To improve occlusion handling, the graph can also encode scene-specific information. We also introduce a new large-scale dataset with 25 partially overlapping views, detailed scene reconstructions, and extensive occlusions. Experiments show the model achieves state-of-the-art performance on public benchmarks and the new dataset, with flexibility across diverse conditions. Both the dataset and approach will be publicly released to advance research in multi-people tracking.
[64] Occlusion-Guided Feature Purification Learning via Reinforced Knowledge Distillation for Occluded Person Re-Identification
Yufei Zheng,Wenjun Wang,Wenjun Gan,Jiawei Liu
Main category: cs.CV
TL;DR: 论文提出了一种名为OGFR的遮挡引导特征净化学习方法,通过强化知识蒸馏解决遮挡行人重识别中的多样遮挡和特征污染问题。
Details
Motivation: 现有方法在训练未见过的遮挡场景和从完整图像引入特征污染时表现不足。Contribution: 提出了OGFR框架,结合多样遮挡模式的特征表示和净化知识蒸馏,有效提升遮挡场景下的鲁棒性。
Method: 采用师生蒸馏架构,设计遮挡感知视觉Transformer和特征擦除净化模块,通过强化学习识别噪声并替换为可学习嵌入。
Result: 学生分支学习到净化后的知识,显著提升了遮挡行人重识别的性能。
Insight: 通过显式建模遮挡模式并净化特征污染,可以更有效地提取身份相关判别线索。
Abstract: Occluded person re-identification aims to retrieve holistic images based on occluded ones. Existing methods often rely on aligning visible body parts, applying occlusion augmentation, or complementing missing semantics using holistic images. However, they face challenges in handling diverse occlusion scenarios not seen during training and the issue of feature contamination from holistic images. To address these limitations, we propose Occlusion-Guided Feature Purification Learning via Reinforced Knowledge Distillation (OGFR), which simultaneously mitigates these challenges. OGFR adopts a teacher-student distillation architecture that effectively incorporates diverse occlusion patterns into feature representation while transferring the purified discriminative holistic knowledge from the holistic to the occluded branch through reinforced knowledge distillation. Specifically, an Occlusion-Aware Vision Transformer is designed to leverage learnable occlusion pattern embeddings to explicitly model such diverse occlusion types, thereby guiding occlusion-aware robust feature representation. Moreover, we devise a Feature Erasing and Purification Module within the holistic branch, in which an agent is employed to identify low-quality patch tokens of holistic images that contain noisy negative information via deep reinforcement learning, and substitute these patch tokens with learnable embedding tokens to avoid feature contamination and further excavate identity-related discriminative clues. Afterward, with the assistance of knowledge distillation, the student branch effectively absorbs the purified holistic knowledge to precisely learn robust representation regardless of the interference of occlusions.
[65] RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features
Inye Na,Nejung Rue,Jiwon Chung,Hyunjin Park
Main category: cs.CV
TL;DR: 本文提出了RadiomicsRetrieval,一种基于三维医学图像的检索框架,通过结合手工放射组学特征和深度学习方法,实现灵活且高效的医学图像检索。
Details
Motivation: 现有的医学图像检索方法主要针对二维图像,且需要完全标注的查询信息,限制了临床应用的灵活性。Contribution: 1. 提出一个三维内容检索框架,利用放射组学特征和深度学习嵌入;2. 通过可提示的分割模型(如SAM)生成肿瘤特定嵌入;3. 引入解剖位置嵌入(APE)增强全局上下文;4. 实现基于形状、位置或部分特征的灵活查询。
Method: 1. 使用可提示的分割模型获取肿瘤嵌入;2. 结合放射组学特征和深度嵌入进行对比学习;3. 加入解剖位置嵌入(APE)以提供全局上下文。
Result: 在肺部CT和脑部MRI公共数据集上的实验表明,放射组学特征提升了检索特异性,APE对基于位置的搜索至关重要。框架仅需最小用户提示(如单点标注)。
Insight: 该框架结合了手工特征和深度学习的优势,不仅提升了检索性能,还支持灵活的临床查询需求,为大规模医学影像库的利用提供了新思路。
Abstract: Medical image retrieval is a valuable field for supporting clinical decision-making, yet current methods primarily support 2D images and require fully annotated queries, limiting clinical flexibility. To address this, we propose RadiomicsRetrieval, a 3D content-based retrieval framework bridging handcrafted radiomics descriptors with deep learning-based embeddings at the tumor level. Unlike existing 2D approaches, RadiomicsRetrieval fully exploits volumetric data to leverage richer spatial context in medical images. We employ a promptable segmentation model (e.g., SAM) to derive tumor-specific image embeddings, which are aligned with radiomics features extracted from the same tumor via contrastive learning. These representations are further enriched by anatomical positional embedding (APE). As a result, RadiomicsRetrieval enables flexible querying based on shape, location, or partial feature sets. Extensive experiments on both lung CT and brain MRI public datasets demonstrate that radiomics features significantly enhance retrieval specificity, while APE provides global anatomical context essential for location-based searches. Notably, our framework requires only minimal user prompts (e.g., a single point), minimizing segmentation overhead and supporting diverse clinical scenarios. The capability to query using either image embeddings or selected radiomics attributes highlights its adaptability, potentially benefiting diagnosis, treatment planning, and research on large-scale medical imaging repositories. Our code is available at https://github.com/nainye/RadiomicsRetrieval.
[66] SAM2RL: Towards Reinforcement Learning Memory Control in Segment Anything Model 2
Alen Adamyan,Tomáš Čížek,Matej Straka,Klara Janouskova,Martin Schmid
Main category: cs.CV
TL;DR: 论文提出了一种基于强化学习的SAM2模型内存控制方法,显著提升了视觉目标跟踪的性能。
Details
Motivation: 当前SAM2的记忆更新依赖于手工规则,限制了其在复杂场景(如遮挡、干扰物)中的性能。因此,作者希望通过强化学习优化内存更新策略。Contribution: 主要贡献是首次将强化学习引入SAM2的内存控制问题,实现了超越手工规则三倍以上的性能提升。
Method: 将内存控制建模为序列决策问题,使用强化学习训练独立的视频代理来优化记忆更新策略。
Result: 在过拟合设置下,强化学习方法相对SAM2的提升效果超过现有启发式方法三倍以上。
Insight: 强化学习是优化内存控制的有效方法,显著释放了SAM2内存库的潜力,为未来视觉跟踪算法提供了新方向。
Abstract: Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks and has become the state-of-the-art for visual object tracking. The model stores information from previous frames in a memory bank, enabling temporal consistency across video sequences. Recent methods augment SAM 2 with hand-crafted update rules to better handle distractors, occlusions, and object motion. We propose a fundamentally different approach using reinforcement learning for optimizing memory updates in SAM 2 by framing memory control as a sequential decision-making problem. In an overfitting setup with a separate agent per video, our method achieves a relative improvement over SAM 2 that exceeds by more than three times the gains of existing heuristics. These results reveal the untapped potential of the memory bank and highlight reinforcement learning as a powerful alternative to hand-crafted update rules for memory control in visual object tracking.
[67] Image Translation with Kernel Prediction Networks for Semantic Segmentation
Cristina Mata,Michael S. Ryoo,Henrik Turbell
Main category: cs.CV
TL;DR: 该论文提出了Domain Adversarial Kernel Prediction Network (DA-KPN),一种新的图像翻译方法,通过预测像素级输入变换参数确保合成标签与翻译图像的语义匹配,在语义分割任务中优于现有GAN方法。
Details
Motivation: 由于真实世界数据的标注困难,语义分割通常依赖合成数据集训练,但现有GAN方法的翻译结果无法保证语义匹配,导致分割性能下降。Contribution: 提出了DA-KPN方法,通过预测像素级变换参数实现语义匹配,使用多尺度判别器提升翻译真实性,在低数据量场景下显著提升语义分割性能。
Method: 1. 预测像素级变换参数的轻量级翻译函数;2. 多尺度判别器确保翻译真实性;3. 结合对抗训练实现语义匹配。
Result: 在合成数据集到真实数据集的语义分割任务中优于GAN方法,在面部解析任务中表现相当。
Insight: 通过直接建模像素级变换而非生成像素值,能够更有效地保留语义信息,同时对抗训练确保了翻译结果的真实性。
Abstract: Semantic segmentation relies on many dense pixel-wise annotations to achieve the best performance, but owing to the difficulty of obtaining accurate annotations for real world data, practitioners train on large-scale synthetic datasets. Unpaired image translation is one method used to address the ensuing domain gap by generating more realistic training data in low-data regimes. Current methods for unpaired image translation train generative adversarial networks (GANs) to perform the translation and enforce pixel-level semantic matching through cycle consistency. These methods do not guarantee that the semantic matching holds, posing a problem for semantic segmentation where performance is sensitive to noisy pixel labels. We propose a novel image translation method, Domain Adversarial Kernel Prediction Network (DA-KPN), that guarantees semantic matching between the synthetic label and translation. DA-KPN estimates pixel-wise input transformation parameters of a lightweight and simple translation function. To ensure the pixel-wise transformation is realistic, DA-KPN uses multi-scale discriminators to distinguish between translated and target samples. We show DA-KPN outperforms previous GAN-based methods on syn2real benchmarks for semantic segmentation with limited access to real image labels and achieves comparable performance on face parsing.
[68] Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion
Enyu Liu,En Yu,Sijia Chen,Wenbing Tao
Main category: cs.CV
TL;DR: 论文提出了一种新的双流范式DISC,通过分离优化实例和场景上下文,提升3D语义场景补全的细粒度性能。
Details
Motivation: 现有方法以体素为基本交互单元,限制了类级信息的利用,影响了补全结果的细粒度表现。Contribution: 提出了DISC双流范式,引入类特定几何和语义先验的查询,并设计专用解码模块,优化类级信息流。
Method: 采用类查询替换体素查询,设计针对实例和场景的专门解码模块,分离优化两者的学习过程。
Result: 在SemanticKITTI和SSCBench-KITTI-360上取得SOTA性能,单帧输入超越多帧方法的实例mIoU表现。
Insight: 分离实例与场景的上下文学习,能显著提升类级信息的利用效率,改善细粒度补全效果。
Abstract: 3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception. Recent advancements have primarily focused on refining voxel-level features to construct 3D scenes. However, treating voxels as the basic interaction units inherently limits the utilization of class-level information, which is proven critical for enhancing the granularity of completion results. To address this, we propose \textbf{D}isentangling Instance and Scene Contexts (DISC), a novel dual-stream paradigm that enhances learning for both instance and scene categories through separated optimization. Specifically, we replace voxel queries with discriminative class queries, which incorporate class-specific geometric and semantic priors. Additionally, we exploit the intrinsic properties of classes to design specialized decoding modules, facilitating targeted interactions and efficient class-level information flow. Experimental results demonstrate that DISC achieves state-of-the-art (SOTA) performance on both SemanticKITTI and SSCBench-KITTI-360 benchmarks, with mIoU scores of 17.35 and 20.55, respectively. Remarkably, DISC even outperforms multi-frame SOTA methods using only single-frame input and significantly improves instance category performance, surpassing both single-frame and multi-frame SOTA instance mIoU by 17.9% and 11.9%, respectively, on the SemanticKITTI hidden test. The code is available at https://github.com/Enyu-Liu/DISC.
[69] A Multi-Modal Fusion Framework for Brain Tumor Segmentation Based on 3D Spatial-Language-Vision Integration and Bidirectional Interactive Attention Mechanism
Mingda Zhang,Kaiwen Pan
Main category: cs.CV
TL;DR: 该论文提出了一种基于3D空间-语言-视觉整合和双向交互注意力机制的多模态融合框架,用于提高脑肿瘤分割的准确性和边界清晰度。
Details
Motivation: 脑肿瘤分割在医学图像分析中至关重要,但现有方法通常忽视多模态信息(如MRI和临床文本)的融合。论文旨在通过整合空间、语言和视觉信息,提升分割效果。Contribution: 提出了多模态语义融合适配器(MSFA)和双向交互视觉语义注意力机制(BIVA),实现了3D MRI数据与临床文本描述的融合及模态间的双向信息交互。
Method: 采用MSFA进行层次化语义解耦,并通过BIVA实现模态间的迭代信息交换。在BraTS 2020数据集上进行评估。
Result: 平均Dice系数为0.8505,95% Hausdorff距离为2.8256mm,优于SCAU-Net、CA-Net和3D U-Net等现有方法。
Insight: 多模态语义融合与双向注意力机制的结合显著提升了分割性能,为临床知识融入医学图像分析提供了新范式。
Abstract: This study aims to develop a novel multi-modal fusion framework for brain tumor segmentation that integrates spatial-language-vision information through bidirectional interactive attention mechanisms to improve segmentation accuracy and boundary delineation. Methods: We propose two core components: Multi-modal Semantic Fusion Adapter (MSFA) integrating 3D MRI data with clinical text descriptions through hierarchical semantic decoupling, and Bidirectional Interactive Visual-semantic Attention (BIVA) enabling iterative information exchange between modalities. The framework was evaluated on BraTS 2020 dataset comprising 369 multi-institutional MRI scans. Results: The proposed method achieved average Dice coefficient of 0.8505 and 95% Hausdorff distance of 2.8256mm across enhancing tumor, tumor core, and whole tumor regions, outperforming state-of-the-art methods including SCAU-Net, CA-Net, and 3D U-Net. Ablation studies confirmed critical contributions of semantic and spatial modules to boundary precision. Conclusion: Multi-modal semantic fusion combined with bidirectional interactive attention significantly enhances brain tumor segmentation performance, establishing new paradigms for integrating clinical knowledge into medical image analysis.
[70] BayesTTA: Continual-Temporal Test-Time Adaptation for Vision-Language Models via Gaussian Discriminant Analysis
Shuang Cui,Jinglin Xu,Yi Li,Xiongxin Tang,Jiangmeng Li,Jiahuan Zhou,Fanjiang Xu,Fuchun Sun,Hui Xiong
Main category: cs.CV
TL;DR: BayesTTA是一个针对视觉语言模型的贝叶斯持续-时间测试时适应框架,通过高斯判别分析解决时间演化分布偏移问题,显著提升了性能和稳定性。
Details
Motivation: 现实场景中,视觉语言模型在逐渐变化的分布偏移(如光照或季节变化)下性能显著下降。现有方法通常针对突发性分布偏移,忽视了时间连续性,导致记忆受限、熵置信度不可靠和视觉表示失准等问题。Contribution: 提出了CT-TTA问题,设计了BayesTTA框架,包括基于高斯判别分析的增量分布估计、统计假设检验选择的协方差结构,以及自适应的归一化层调整。
Method: 通过高斯混合模型动态估计类别条件分布,采用统计检验选择协方差结构,并利用GDA进行校准推理,同时监督归一化层的自适应的对齐。
Result: BayesTTA在四个时间演化数据集和十个标准TTA数据集上显著优于现有方法,同时保持了高效性。
Insight: 时间连续性在分布偏移中的重要性被忽视,BayesTTA的动态分布估计和自适应对齐机制为解决此类问题提供了新思路。
Abstract: Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but degrade significantly under \textit{temporally evolving distribution shifts} common in real-world scenarios (e.g., gradual illumination or seasonal changes). Existing continual test-time adaptation (CTTA) methods are typically built around sudden and severe distribution shifts and neglect temporal continuity, leading to three core defects: limited memory cache restricts long-range distribution modeling, causing catastrophic forgetting; entropy-based confidence becomes unreliable under temporal drift, worsening error accumulation; and static visual representations misalign with evolving inputs. We formalize this practical problem as \textit{Continual-Temporal Test-Time Adaptation (CT-TTA)}, where test distributions evolve gradually over time. To address it, we propose \textit{BayesTTA}, a Bayesian adaptation framework that enforces temporally consistent predictions and dynamically aligns visual representations. Specifically, BayesTTA incrementally estimates class-conditional Gaussian mixture distributions without storing raw data, adaptively selects covariance structures through statistical hypothesis testing, and performs calibrated inference using Gaussian discriminant analysis (GDA). These calibrated predictions supervise self-paced adaptation of normalization layers, ensuring efficient and stable representation alignment. We establish a comprehensive CT-TTA benchmark across four temporally evolving datasets and further evaluate generalization on ten standard TTA datasets. Extensive experiments show that BayesTTA consistently outperforms state-of-the-art methods, achieving significant gains while maintaining efficiency. Code is available at \href{https://github.com/cuishuang99/BayesTTA}{https://github.com/cuishuang99/BayesTTA}.
[71] DatasetAgent: A Novel Multi-Agent System for Auto-Constructing Datasets from Real-World Images
Haoran Sun,Haoyu Bian,Shaoning Zeng,Yunbo Rao,Xu Xu,Lin Mei,Jianping Gou
Main category: cs.CV
TL;DR: 该论文提出了一种名为DatasetAgent的多智能体系统,用于从真实世界图像中自动构建数据集,避免了传统手动方法的低效问题。
Details
Motivation: 传统数据集构建依赖耗时的手工收集和标注,而生成数据虽然快速但与真实数据相比价值较低。因此,作者提出利用多智能体系统自动构建真实世界的数据集。Contribution: 提出了DatasetAgent系统,通过协调四个不同智能体和多模态大语言模型,实现了高质量数据集的自动构建,支持用户自定义需求。
Method: DatasetAgent利用多智能体协作系统,每个智能体配备多模态大语言模型和图像优化工具包,共同完成数据集的构建任务。
Result: 实验结果表明,该系统能够扩展现有数据集或从头创建新数据集,并用于训练图像分类、目标检测和分割模型。
Insight: 通过智能体协作和MLLMs的结合,可以实现高效且高质量的自动数据集构建,为计算机视觉领域提供了新的数据来源解决方案。
Abstract: Common knowledge indicates that the process of constructing image datasets usually depends on the time-intensive and inefficient method of manual collection and annotation. Large models offer a solution via data generation. Nonetheless, real-world data are obviously more valuable comparing to artificially intelligence generated data, particularly in constructing image datasets. For this reason, we propose a novel method for auto-constructing datasets from real-world images by a multiagent collaborative system, named as DatasetAgent. By coordinating four different agents equipped with Multi-modal Large Language Models (MLLMs), as well as a tool package for image optimization, DatasetAgent is able to construct high-quality image datasets according to user-specified requirements. In particular, two types of experiments are conducted, including expanding existing datasets and creating new ones from scratch, on a variety of open-source datasets. In both cases, multiple image datasets constructed by DatasetAgent are used to train various vision models for image classification, object detection, and image segmentation.
[72] Generalizable 7T T1-map Synthesis from 1.5T and 3T T1 MRI with an Efficient Transformer Model
Zach Eidex,Mojtaba Safari,Tonghe Wang,Vanessa Wildman,David S. Yu,Hui Mao,Erik Middlebrooks,Aparna Kesewala,Xiaofeng Yang
Main category: cs.CV
TL;DR: 本文提出了一种基于Transformer的高效模型(7T-Restormer),用于从常规1.5T或3T T1加权图像合成7T质量的T1图,显著提升了图像质量,同时减少了计算资源需求。
Details
Motivation: 超高场强7T MRI虽然在分辨率和对比度上优于常规临床场强(1.5T, 3T),但其成本高昂、设备稀缺且易受伪影影响。本文旨在通过合成方法将7T MRI的优势引入常规临床工作流。Contribution: 1. 提出了7T-Restormer模型,实现了从1.5T和3T MRI到7T T1图的高效合成;
2. 在参数更少的情况下,性能优于现有模型(ResViT和ResShift);
3. 验证了混合1.5T和3T训练数据的优势。
Method: 1. 使用Transformer架构(7T-Restormer)作为生成模型;
2. 在35例1.5T和108例3T T1W MRI及其配对的7T T1图上验证模型;
3. 评估指标包括PSNR、SSIM和NMSE。
Result: 模型在1.5T输入下达到PSNR 26.0 ±4.6 dB,SSIM 0.861 ±0.072,NMSE 0.019 ±0.011;在3T输入下达到PSNR 25.9 ±4.9 dB,SSIM 0.866 ±0.077。模型参数仅为10.5M,显著降低了计算复杂度。
Insight: 1. 混合1.5T和3T数据的训练策略优于单一场强训练;
2. 合成7T T1图可以有效弥补7T设备稀缺的问题,为临床提供更高分辨率的影像支持。
Abstract: Purpose: Ultra-high-field 7T MRI offers improved resolution and contrast over standard clinical field strengths (1.5T, 3T). However, 7T scanners are costly, scarce, and introduce additional challenges such as susceptibility artifacts. We propose an efficient transformer-based model (7T-Restormer) to synthesize 7T-quality T1-maps from routine 1.5T or 3T T1-weighted (T1W) images. Methods: Our model was validated on 35 1.5T and 108 3T T1w MRI paired with corresponding 7T T1 maps of patients with confirmed MS. A total of 141 patient cases (32,128 slices) were randomly divided into 105 (25; 80) training cases (19,204 slices), 19 (5; 14) validation cases (3,476 slices), and 17 (5; 14) test cases (3,145 slices) where (X; Y) denotes the patients with 1.5T and 3T T1W scans, respectively. The synthetic 7T T1 maps were compared against the ResViT and ResShift models. Results: The 7T-Restormer model achieved a PSNR of 26.0 +/- 4.6 dB, SSIM of 0.861 +/- 0.072, and NMSE of 0.019 +/- 0.011 for 1.5T inputs, and 25.9 +/- 4.9 dB, and 0.866 +/- 0.077 for 3T inputs, respectively. Using 10.5 M parameters, our model reduced NMSE by 64 % relative to 56.7M parameter ResShift (0.019 vs 0.052, p = <.001 and by 41 % relative to 70.4M parameter ResViT (0.019 vs 0.032, p = <.001) at 1.5T, with similar advantages at 3T (0.021 vs 0.060 and 0.033; p < .001). Training with a mixed 1.5 T + 3 T corpus was superior to single-field strategies. Restricting the model to 1.5T increased the 1.5T NMSE from 0.019 to 0.021 (p = 1.1E-3) while training solely on 3T resulted in lower performance on input 1.5T T1W MRI. Conclusion: We propose a novel method for predicting quantitative 7T MP2RAGE maps from 1.5T and 3T T1W scans with higher quality than existing state-of-the-art methods. Our approach makes the benefits of 7T MRI more accessible to standard clinical workflows.
[73] ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way
Rajarshi Roy,Devleena Das,Ankesh Banerjee,Arjya Bhattacharjee,Kousik Dasgupta,Subarna Tripathi
Main category: cs.CV
TL;DR: ByDeWay提出了一种无需训练的框架LDP,通过分层深度提示增强多模态大语言模型的性能,提升空间推理和接地能力。
Details
Motivation: 当前多模态大语言模型在空间推理和接地能力上存在不足,易产生幻觉响应,亟需一种轻量级、无需训练的方法改进。Contribution: 1. 提出分层深度提示(LDP),结合单目深度估计生成结构化描述;2. 在零训练场景下显著提升模型性能。
Method: 1. 利用单目深度估计将场景分为近、中、远三层;2. 结合视觉语言模型生成分层描述;3. 将描述附加到图像-问题输入中以增强上下文。
Result: 在POPE和GQA基准测试中表现一致提升,验证了深度提示的有效性。
Insight: 无需修改模型参数即可通过结构化提示显著改善响应质量,展示了外部上下文引导的潜力。
Abstract: We introduce ByDeWay, a training-free framework designed to enhance the performance of Multimodal Large Language Models (MLLMs). ByDeWay uses a novel prompting strategy called Layered-Depth-Based Prompting (LDP), which improves spatial reasoning and grounding without modifying any model parameters. It segments the scene into closest, mid-range, and farthest layers using monocular depth estimation, then generates region-specific captions with a grounded vision-language model. These structured, depth-aware captions are appended to the image-question prompt, enriching it with spatial context. This guides MLLMs to produce more grounded and less hallucinated responses. Our method is lightweight, modular, and compatible with black-box MLLMs. Experiments on hallucination-sensitive (POPE) and reasoning-intensive (GQA) benchmarks show consistent improvements across multiple MLLMs, validating the effectiveness of depth-aware prompting in a zero-training setting.
[74] MoSAiC: Multi-Modal Multi-Label Supervision-Aware Contrastive Learning for Remote Sensing
Debashis Gupta,Aditi Golder,Rongkhun Zhu,Kangning Cui,Wei Tang,Fan Yang,Ovidiu Csillik,Sarra Alaqahtani,V. Paul Pauca
Main category: cs.CV
TL;DR: MoSAiC提出了一种多模态多标签监督感知的对比学习框架,针对地球系统观测任务中的多模态卫星图像,解决了传统对比学习方法在多标签设置中的局限性。
Details
Motivation: 地球系统观测中多模态卫星图像的数据具有高类间相似性、场景杂乱和模糊边界等挑战,传统对比学习方法无法有效处理多标签对齐和语义精确性问题。Contribution: 提出MoSAiC框架,结合多标签监督对比损失,优化了多模态图像的表征学习能力,提升了语义解缠和泛化性能。
Method: MoSAiC通过联合优化模态内和模态间对比学习,结合多标签监督损失,实现了对多模态卫星图像的表征学习。
Result: 在BigEarthNet V2.0和Sent12MS数据集上,MoSAiC在低标签和高类重叠场景中表现优于全监督和自监督基线。
Insight: 多模态和多标签监督的结合能显著提升对比学习在地球系统观测任务中的性能,尤其在语义解缠和泛化能力方面。
Abstract: Contrastive learning (CL) has emerged as a powerful paradigm for learning transferable representations without the reliance on large labeled datasets. Its ability to capture intrinsic similarities and differences among data samples has led to state-of-the-art results in computer vision tasks. These strengths make CL particularly well-suited for Earth System Observation (ESO), where diverse satellite modalities such as optical and SAR imagery offer naturally aligned views of the same geospatial regions. However, ESO presents unique challenges, including high inter-class similarity, scene clutter, and ambiguous boundaries, which complicate representation learning – especially in low-label, multi-label settings. Existing CL frameworks often focus on intra-modality self-supervision or lack mechanisms for multi-label alignment and semantic precision across modalities. In this work, we introduce MoSAiC, a unified framework that jointly optimizes intra- and inter-modality contrastive learning with a multi-label supervised contrastive loss. Designed specifically for multi-modal satellite imagery, MoSAiC enables finer semantic disentanglement and more robust representation learning across spectrally similar and spatially complex classes. Experiments on two benchmark datasets, BigEarthNet V2.0 and Sent12MS, show that MoSAiC consistently outperforms both fully supervised and self-supervised baselines in terms of accuracy, cluster coherence, and generalization in low-label and high-class-overlap scenarios.
[75] An Efficient Approach for Muscle Segmentation and 3D Reconstruction Using Keypoint Tracking in MRI Scan
Mengyuan Liu,Jeongkyu Lee
Main category: cs.CV
TL;DR: 论文提出了一种基于关键点追踪的无训练肌肉分割方法,结合Lucas-Kanade光流技术,降低了计算成本,同时达到与CNN方法相当的准确率。
Details
Motivation: 现有基于CNN的肌肉分割方法计算开销大、泛化能力有限且解释性差,亟需一种高效、可解释的替代方案。Contribution: 提出了一种无需训练的分割框架,通过关键点追踪和光流技术实现高效、可解释的肌肉分割。
Method: 结合关键点选择和Lucas-Kanade光流技术,避免了CNN的训练过程,降低了计算复杂度。
Result: 平均Dice相似系数为0.6-0.7,与先进CNN方法性能相当,但计算需求大幅降低。
Insight: 无训练方法在小数据集场景下更具优势,为肌肉分割提供了一种高效且可解释的新思路。
Abstract: Magnetic resonance imaging (MRI) enables non-invasive, high-resolution analysis of muscle structures. However, automated segmentation remains limited by high computational costs, reliance on large training datasets, and reduced accuracy in segmenting smaller muscles. Convolutional neural network (CNN)-based methods, while powerful, often suffer from substantial computational overhead, limited generalizability, and poor interpretability across diverse populations. This study proposes a training-free segmentation approach based on keypoint tracking, which integrates keypoint selection with Lucas-Kanade optical flow. The proposed method achieves a mean Dice similarity coefficient (DSC) ranging from 0.6 to 0.7, depending on the keypoint selection strategy, performing comparably to state-of-the-art CNN-based models while substantially reducing computational demands and enhancing interpretability. This scalable framework presents a robust and explainable alternative for muscle segmentation in clinical and research applications.
[76] L-CLIPScore: a Lightweight Embedding-based Captioning Metric for Evaluating and Training
Li Li,Yingzhe Peng,Xu Yang,Ruoxi Cheng,Haiyang Xu,Ming Yan,Fei Huang
Main category: cs.CV
TL;DR: 论文提出了一种轻量级的基于嵌入的标题评估指标L-CLIPScore,通过压缩和蒸馏CLIP模型实现高效评估和训练标题质量。
Details
Motivation: 现有的标题评估方法通常计算复杂度高,难以高效应用于大规模数据集或实时任务。Contribution: 提出了L-CLIPScore,一种轻量化的CLIP变体,结合权重复用和矩阵分解压缩参数,并通过新型的多模态相似性调节(SR)损失蒸馏模型。
Method: 采用权重复用和矩阵分解压缩模型参数,设计SR损失增强匹配对的相似性并抑制非匹配对。
Result: L-CLIPScore在计算资源和运行时间减少的情况下,保持了与原CLIP相当的多模态对齐能力。
Insight: 单独使用L-CLIPScore训练标题模型会导致失败,需结合n-gram指标混合使用。
Abstract: We propose a novel embedding-based captioning metric termed as L-CLIPScore that can be used for efficiently evaluating caption quality and training captioning model. L-CLIPScore is calculated from a lightweight CLIP (L-CLIP), which is a dual-encoder architecture compressed and distilled from CLIP. To compress, we apply two powerful techniques which are weight multiplexing and matrix decomposition for reducing the parameters of encoders and word embedding matrix, respectively. To distill, we design a novel multi-modal Similarity Regulator (SR) loss to transfer more vision-language alignment knowledge. Specifically, SR loss amplifies the multi-modal embedding similarity if the given image-text pair is matched and diminishes the similarity if the pair is non-matched. By compressing and distilling by this novel SR loss, our L-CLIP achieves comparable multi-modal alignment ability to the original CLIP while it requires fewer computation resources and running time. We carry out exhaustive experiments to validate the efficiency and effectiveness of L-CLIPScore when using it as the judge to evaluate caption quality. We also discover that when using L-CLIPScore as the supervisor to train the captioning model, it should be mixed up by an n-gram-based metric and meanwhile analyze why using L-CLIPScore only will cause fail training.
[77] Unreal is all you need: Multimodal ISAC Data Simulation with Only One Engine
Kongwu Huang,Shiyi Mu,Jun Jiang,Yuan Gao,Shugong Xu
Main category: cs.CV
TL;DR: 论文提出了Great-X,一个基于Unreal Engine的单引擎多模态数据仿真平台,用于高效同步生成CSI、RGB、Radar和LiDAR数据,并构建了首个开源的大规模低空无人机多模态数据集Great-MSD。
Details
Motivation: 探索缩放定律在ISAC研究中的潜力,解决多模态数据仿真效率低且同步性差的问题,并为低空无人机应用提供基础数据支持。Contribution: 1. 提出Great-X,首个基于Unreal Engine的单引擎多模态数据仿真平台;2. 构建了开源的大规模低空无人机多模态数据集Great-MSD;3. 提出了基于CSI的无人机3D定位基线算法,验证了其跨仿真引擎的泛化能力。
Method: 通过将Sionna的射线追踪计算重构到Unreal Engine,并与自动驾驶工具深度整合,实现多模态数据的高效同步仿真。
Result: 验证了CSI-based无人机3D定位算法的可行性,并展示其在不同CSI仿真引擎中的泛化能力。
Insight: 基于单引擎的多模态数据仿真平台为ISAC研究提供了高效、同步的数据生成能力,开源数据集和算法促进了低空无人机应用的研究发展。
Abstract: Scaling laws have achieved success in LLM and foundation models. To explore their potential in ISAC research, we propose Great-X. This single-engine multimodal data twin platform reconstructs the ray-tracing computation of Sionna within Unreal Engine and is deeply integrated with autonomous driving tools. This enables efficient and synchronized simulation of multimodal data, including CSI, RGB, Radar, and LiDAR. Based on this platform, we construct an open-source, large-scale, low-altitude UAV multimodal synaesthesia dataset named Great-MSD, and propose a baseline CSI-based UAV 3D localization algorithm, demonstrating its feasibility and generalizability across different CSI simulation engines. The related code and dataset are publicly available at: https://github.com/hkw-xg/Great-MCD.
[78] RoundaboutHD: High-Resolution Real-World Urban Environment Benchmark for Multi-Camera Vehicle Tracking
Yuqiang Lin,Sam Lockyer,Mingxuan Sui,Li Gan,Florian Stanek,Markus Zarbock,Wenbin Li,Adrian Evans,Nic Zhang
Main category: cs.CV
TL;DR: RoundaboutHD填补了现有多相机车辆跟踪(MCVT)数据集的不足,提供了一个高分辨率、多相机、真实世界环岛场景的标注数据集,支持目标检测、单相机跟踪、车辆重识别等多任务研究。
Details
Motivation: 现有的MCVT数据集存在场景过于简单、分辨率低、多样性不足等问题,无法满足真实世界的应用需求,因此需要更贴近现实的高质量数据集。Contribution: 1. 提出了RoundaboutHD数据集,包含40分钟4K分辨率、15fps的视频,覆盖512辆车的多相机标注数据;2. 提供丰富的子任务支持(如检测、跟踪、ReID)和辅助信息(如车辆模型、相机几何数据);3. 设计了更高的挑战性(如遮挡和非线性运动)。
Method: 通过部署4个非重叠高分辨率相机,捕捉真实环岛场景的视频数据,并进行详细标注,包括车辆身份、运动轨迹等信息,同时提供相机几何模型支持多相机关联分析。
Result: 提供了基线实验结果,涵盖车辆检测、单相机跟踪、车辆重识别和多相机跟踪任务,证明数据集的实用性和挑战性。
Insight: 高分辨率和真实世界场景的数据集对于推动MCVT研究至关重要,RoundaboutHD通过复杂的场景设计和丰富的标注为领域提供了重要资源。
Abstract: The multi-camera vehicle tracking (MCVT) framework holds significant potential for smart city applications, including anomaly detection, traffic density estimation, and suspect vehicle tracking. However, current publicly available datasets exhibit limitations, such as overly simplistic scenarios, low-resolution footage, and insufficiently diverse conditions, creating a considerable gap between academic research and real-world scenario. To fill this gap, we introduce RoundaboutHD, a comprehensive, high-resolution multi-camera vehicle tracking benchmark dataset specifically designed to represent real-world roundabout scenarios. RoundaboutHD provides a total of 40 minutes of labelled video footage captured by four non-overlapping, high-resolution (4K resolution, 15 fps) cameras. In total, 512 unique vehicle identities are annotated across different camera views, offering rich cross-camera association data. RoundaboutHD offers temporal consistency video footage and enhanced challenges, including increased occlusions and nonlinear movement inside the roundabout. In addition to the full MCVT dataset, several subsets are also available for object detection, single camera tracking, and image-based vehicle re-identification (ReID) tasks. Vehicle model information and camera modelling/ geometry information are also included to support further analysis. We provide baseline results for vehicle detection, single-camera tracking, image-based vehicle re-identification, and multi-camera tracking. The dataset and the evaluation code are publicly available at: https://github.com/siri-rouser/RoundaboutHD.git
[79] Ensemble of Weak Spectral Total Variation Learners: a PET-CT Case Study
Anna Rosenberg,John Kennedy,Zohar Keidar,Yehoshua Y. Zeevi,Guy Gilboa
Main category: cs.CV
TL;DR: 论文提出一种基于谱总变分(STV)特征的弱学习器集成方法,应用于PET-CT医学影像分析,显著优于深度学习和传统Radiomics特征。
Details
Motivation: 解决计算机视觉任务中训练数据不足的问题,尤其是在医学影像分析领域,提出一种基于STV特征的集成学习方法。Contribution: 主要贡献是设计了基于STV特征的弱学习器集成方法,解决了医学影像数据稀缺的挑战,并在实际PET-CT数据集上验证了其有效性。
Method: 利用STV特征(与总变分子梯度的非线性特征函数相关)构建弱学习器集成,特征具有多尺度纹理表征能力,且在二维情况下低相关性。实验对比了深度学习与Radiomics特征。
Result: STV学习器集成方法在AUC指标上表现最佳(0.87),优于神经网络(0.75)和Radiomics(0.79),尤其细尺度STV特征对PET高摄取区域预测效果显著。
Insight: STV特征在医学影像分析中表现优异,尤其在数据稀缺场景下,其多尺度和低相关性特性为集成学习提供了理想基础。
Abstract: Solving computer vision problems through machine learning, one often encounters lack of sufficient training data. To mitigate this we propose the use of ensembles of weak learners based on spectral total-variation (STV) features (Gilboa 2014). The features are related to nonlinear eigenfunctions of the total-variation subgradient and can characterize well textures at various scales. It was shown (Burger et-al 2016) that, in the one-dimensional case, orthogonal features are generated, whereas in two-dimensions the features are empirically lowly correlated. Ensemble learning theory advocates the use of lowly correlated weak learners. We thus propose here to design ensembles using learners based on STV features. To show the effectiveness of this paradigm we examine a hard real-world medical imaging problem: the predictive value of computed tomography (CT) data for high uptake in positron emission tomography (PET) for patients suspected of skeletal metastases. The database consists of 457 scans with 1524 unique pairs of registered CT and PET slices. Our approach is compared to deep-learning methods and to Radiomics features, showing STV learners perform best (AUC=0.87), compared to neural nets (AUC=0.75) and Radiomics (AUC=0.79). We observe that fine STV scales in CT images are especially indicative for the presence of high uptake in PET.
[80] HieraRS: A Hierarchical Segmentation Paradigm for Remote Sensing Enabling Multi-Granularity Interpretation and Cross-Domain Transfer
Tianlong Ai,Tianzhu Liu,Haochen Jiang,Yanfeng Gu
Main category: cs.CV
TL;DR: 论文提出HieraRS方法,用于解决遥感影像中的土地覆盖和土地利用(LCLU)多粒度层次分类问题,同时支持跨领域异质层次任务的迁移。通过双向分层一致性约束机制(BHCCM)和跨领域迁移框架TransLU,提升了分类的灵活性和泛化能力。
Details
Motivation: 现有深度学习模型在LCLU分类多粒度层次任务中,主要采用平面分类范式,无法生成与树状层次结构对齐的多粒度预测;同时,跨领域迁移研究较少关注异质层次间的迁移问题。Contribution: 1.提出HieraRS,支持多粒度层次分类;2.引入BHCCM提升语义一致性和分类精度;3.设计TransLU框架,支持跨领域异质层次迁移;4.构建MM-5B大规模多模态数据集。
Method: 1. BHCCM机制与主流平面分类模型结合生成层次预测;2. TransLU框架包含跨领域知识共享(CDKS)和语义对齐(CDSA)分支,支持动态类别扩展和异质层次适配。
Result: HieraRS在生成多粒度预测和跨领域迁移任务中表现出色,显著提升了分类精度和灵活性。
Insight: 层次化预测和跨领域迁移的结合为遥感影像理解提供了更灵活的工具,尤其是对异质层次任务的适应能力具有实际应用价值。
Abstract: Hierarchical land cover and land use (LCLU) classification aims to assign pixel-wise labels with multiple levels of semantic granularity to remote sensing (RS) imagery. However, existing deep learning-based methods face two major challenges: 1) They predominantly adopt a flat classification paradigm, which limits their ability to generate end-to-end multi-granularity hierarchical predictions aligned with tree-structured hierarchies used in practice. 2) Most cross-domain studies focus on performance degradation caused by sensor or scene variations, with limited attention to transferring LCLU models to cross-domain tasks with heterogeneous hierarchies (e.g., LCLU to crop classification). These limitations hinder the flexibility and generalization of LCLU models in practical applications. To address these challenges, we propose HieraRS, a novel hierarchical interpretation paradigm that enables multi-granularity predictions and supports the efficient transfer of LCLU models to cross-domain tasks with heterogeneous tree-structured hierarchies. We introduce the Bidirectional Hierarchical Consistency Constraint Mechanism (BHCCM), which can be seamlessly integrated into mainstream flat classification models to generate hierarchical predictions, while improving both semantic consistency and classification accuracy. Furthermore, we present TransLU, a dual-branch cross-domain transfer framework comprising two key components: Cross-Domain Knowledge Sharing (CDKS) and Cross-Domain Semantic Alignment (CDSA). TransLU supports dynamic category expansion and facilitates the effective adaptation of LCLU models to heterogeneous hierarchies. In addition, we construct MM-5B, a large-scale multi-modal hierarchical land use dataset featuring pixel-wise annotations. The code and MM-5B dataset will be released at: https://github.com/AI-Tianlong/HieraRS.
[81] Geo-ORBIT: A Federated Digital Twin Framework for Scene-Adaptive Lane Geometry Detection
Rei Tamaru,Pei Li,Bin Ran
Main category: cs.CV
TL;DR: Geo-ORBIT是一个联邦数字孪生框架,通过结合实时车道检测、数字孪生同步和联邦元学习,解决了交通管理中动态道路几何感知的可扩展性和隐私问题。其核心组件包括轻量级车道检测模型GeoLane,以及支持本地化参数学习的Meta-GeoLane和联邦学习策略FedMeta-GeoLane。实验表明其性能优于基线方法。
Details
Motivation: 动态道路几何感知是交通管理数字孪生的关键,但现有方法依赖静态地图或昂贵传感器,且多源数据收集面临隐私和效率挑战。Contribution: 提出了Geo-ORBIT框架,整合了轻量级车道检测模型、本地化参数学习和联邦学习策略,实现了可扩展且隐私保护的动态道路感知。
Method: 结合GeoLane、Meta-GeoLane(支持本地化学习)和FedMeta-GeoLane(联邦学习策略),并与CARLA和SUMO集成创建高保真数字孪生。
Result: FedMeta-GeoLane在多样城市场景中几何误差更低,泛化能力更强,显著减少通信开销。
Insight: 联邦学习与元学习的结合为数字孪生提供了一种高效、隐私保护的动态数据建模方法。
Abstract: Digital Twins (DT) have the potential to transform traffic management and operations by creating dynamic, virtual representations of transportation systems that sense conditions, analyze operations, and support decision-making. A key component for DT of the transportation system is dynamic roadway geometry sensing. However, existing approaches often rely on static maps or costly sensors, limiting scalability and adaptability. Additionally, large-scale DTs that collect and analyze data from multiple sources face challenges in privacy, communication, and computational efficiency. To address these challenges, we introduce Geo-ORBIT (Geometrical Operational Roadway Blueprint with Integrated Twin), a unified framework that combines real-time lane detection, DT synchronization, and federated meta-learning. At the core of Geo-ORBIT is GeoLane, a lightweight lane detection model that learns lane geometries from vehicle trajectory data using roadside cameras. We extend this model through Meta-GeoLane, which learns to personalize detection parameters for local entities, and FedMeta-GeoLane, a federated learning strategy that ensures scalable and privacy-preserving adaptation across roadside deployments. Our system is integrated with CARLA and SUMO to create a high-fidelity DT that renders highway scenarios and captures traffic flows in real-time. Extensive experiments across diverse urban scenes show that FedMeta-GeoLane consistently outperforms baseline and meta-learning approaches, achieving lower geometric error and stronger generalization to unseen locations while drastically reducing communication overhead. This work lays the foundation for flexible, context-aware infrastructure modeling in DTs. The framework is publicly available at https://github.com/raynbowy23/FedMeta-GeoLane.git.
[82] A Hybrid Multi-Well Hopfield-CNN with Feature Extraction and K-Means for MNIST Classification
Ahmed Farooq
Main category: cs.CV
TL;DR: 论文提出了一种结合CNN和多阱Hopfield网络的混合模型,用于MNIST手写数字分类,通过特征提取和K-means聚类实现高准确率(99.2%)。
Details
Motivation: 为了解决MNIST数据集中手写数字的多样性问题,并提供一个可解释的分类框架。Contribution: 1. 结合CNN和多阱Hopfield网络的混合模型;2. 利用K-means聚类生成类特定原型;3. 通过能量函数实现分类。
Method: 1. 使用CNN提取高维特征;2. 利用K-means聚类特征生成原型;3. 在多阱能量景观中,Hopfield网络通过最小化能量函数进行分类。
Result: 模型在10,000张MNIST测试图像上实现了99.2%的准确率。
Insight: 深度特征提取和足够的原型覆盖对高性能至关重要,且能量函数为分类提供了可解释性。
Abstract: This study presents a hybrid model for classifying handwritten digits in the MNIST dataset, combining convolutional neural networks (CNNs) with a multi-well Hopfield network. The approach employs a CNN to extract high-dimensional features from input images, which are then clustered into class-specific prototypes using k-means clustering. These prototypes serve as attractors in a multi-well energy landscape, where a Hopfield network performs classification by minimizing an energy function that balances feature similarity and class assignment.The model’s design enables robust handling of intraclass variability, such as diverse handwriting styles, while providing an interpretable framework through its energy-based decision process. Through systematic optimization of the CNN architecture and the number of wells, the model achieves a high test accuracy of 99.2% on 10,000 MNIST images, demonstrating its effectiveness for image classification tasks. The findings highlight the critical role of deep feature extraction and sufficient prototype coverage in achieving high performance, with potential for broader applications in pattern recognition.
[83] CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering
Zhengqing Wang,Yuefan Wu,Jiacheng Chen,Fuyang Zhang,Yasutaka Furukawa
Main category: cs.CV
TL;DR: 论文提出一种通过压缩光场令牌(CLiFTs)实现高效计算和自适应神经渲染的方法,能够在保持高质量渲染的同时灵活调整计算资源。
Details
Motivation: 现有神经渲染方法在计算效率和数据压缩方面存在不足,难以在保持高质量渲染的同时适应不同计算预算的需求。Contribution: 提出CLiFTs,通过压缩和聚类光场表示实现高效计算和自适应的神经渲染,支持动态调整令牌数量以适应不同计算预算。
Method: 1. 多视角编码器将图像和相机位姿转换为令牌;2. 潜在空间K-means选择聚类中心射线;3. 多视角冷凝器压缩信息为CLiFTs;4. 自适应渲染器根据目标视图和计算预算合成新视图。
Result: 在RealEstate10K和DL3DV数据集上验证了方法的高效性,实现了数据减少的同时保持渲染质量,并在多种指标上表现最优。
Insight: 通过压缩和动态调整表示,CLiFTs为神经渲染提供了一种可扩展的解决方案,平衡了数据规模、渲染质量和速度。
Abstract: This paper proposes a neural rendering approach that represents a scene as “compressed light-field tokens (CLiFTs)”, retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view ``condenser’’ compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.
[84] Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective
Hangjie Yuan,Weihua Chen,Jun Cen,Hu Yu,Jingyun Liang,Shuning Chang,Zhihui Lin,Tao Feng,Pengwei Liu,Jiazheng Xing,Hao Luo,Jiasheng Tang,Fan Wang,Yi Yang
Main category: cs.CV
TL;DR: Lumos-1 is an autoregressive video generator that retains the standard LLM architecture with minimal modifications, addressing challenges like spatiotemporal correlations and frame-wise loss imbalance through MM-RoPE and AR-DF techniques.
Details
Motivation: The success of autoregressive LLMs in unifying language tasks inspired extensions to video generation, but existing methods suffer from architectural divergence, bulky encoders, or high latency.Contribution: Lumos-1 introduces MM-RoPE for spatiotemporal modeling and AR-DF for balanced training, achieving competitive performance with efficient training.
Method: The model uses 3D RoPE and MM-RoPE for spatiotemporal correlations, a token dependency strategy, and AR-DF with temporal tube masking for training.
Result: Lumos-1 matches state-of-the-art models on benchmarks like GenEval, VBench-I2V, and VBench-T2V, trained efficiently on 48 GPUs.
Insight: Maintaining a unified LLM architecture while addressing video-specific challenges (spatiotemporal correlations, redundancy) is key to efficient and effective autoregressive video generation.
Abstract: Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos.
cs.AI [Back]
[85] M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning
Inclusion AI,:,Fudong Wang,Jiajia Liu,Jingdong Chen,Jun Zhou,Kaixiang Ji,Lixiang Ru,Qingpei Guo,Ruobing Zheng,Tianqi Li,Yi Yuan,Yifan Mao,Yuting Xiao,Ziping Ma
Main category: cs.AI
TL;DR: 论文提出了M2-Reasoning-7B模型,通过创新的数据生成和动态多任务训练策略,显著提升了多模态大语言模型在通用和空间推理任务上的表现。
Details
Motivation: 现有MLLMs在动态空间交互能力上存在不足,这限制了其在真实场景中的应用。论文旨在填补这一空白。Contribution: 1)开发了包含29.4万高质量样本的数据管道;2)提出动态多任务训练策略,优化任务间冲突并引入任务特定奖励。
Method: 结合逻辑一致的数据生成和动态多任务训练,通过分步优化提升模型性能。
Result: M2-Reasoning-7B在8个基准测试中达到SOTA,在通用和空间推理任务上表现优异。
Insight: 高质量数据和针对性训练策略是提升MLLMs推理能力的关键。
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs), particularly through Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced their reasoning abilities. However, a critical gap persists: these models struggle with dynamic spatial interactions, a capability essential for real-world applications. To bridge this gap, we introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.
[86] A Multi-granularity Concept Sparse Activation and Hierarchical Knowledge Graph Fusion Framework for Rare Disease Diagnosis
Mingda Zhang,Na Zhao,Jianglong Qin,Guoyu Ye,Ruixiang Tang
Main category: cs.AI
TL;DR: 该论文提出了一种结合多粒度稀疏激活和分层知识图谱的框架,用于罕见病诊断,显著提升了诊断准确性和信息质量。
Details
Motivation: 罕见病诊断由于知识表示深度不足、概念理解有限和临床推理受限,仍然面临挑战。Contribution: 1. 多粒度概念稀疏激活与分层知识图谱的融合框架;2. 四种互补匹配算法和多样性控制;3. 五级回退策略和三层次知识图谱结构。
Method: 1. 多粒度稀疏激活医学概念;2. 层次化知识图谱(分类、临床特征、实例);3. 结合匹配算法和回退策略优化诊断过程。
Result: 在BioASQ罕见病QA数据集上,BLEU提升0.09,ROUGE提升0.05,准确率提升0.12,峰值准确率达0.89接近临床阈值0.90。
Insight: 该框架通过增强概念激活和知识融合,有望缩短罕见病患者的诊断周期。
Abstract: Despite advances from medical large language models in healthcare, rare-disease diagnosis remains hampered by insufficient knowledge-representation depth, limited concept understanding, and constrained clinical reasoning. We propose a framework that couples multi-granularity sparse activation of medical concepts with a hierarchical knowledge graph. Four complementary matching algorithms, diversity control, and a five-level fallback strategy enable precise concept activation, while a three-layer knowledge graph (taxonomy, clinical features, instances) provides structured, up-to-date context. Experiments on the BioASQ rare-disease QA set show BLEU gains of 0.09, ROUGE gains of 0.05, and accuracy gains of 0.12, with peak accuracy of 0.89 approaching the 0.90 clinical threshold. Expert evaluation confirms improvements in information quality, reasoning, and professional expression, suggesting our approach shortens the “diagnostic odyssey” for rare-disease patients.
[87] Large Multi-modal Model Cartographic Map Comprehension for Textual Locality Georeferencing
Kalana Wijegunarathna,Kristin Stock,Christopher B. Jones
Main category: cs.AI
TL;DR: 该论文提出了一种利用大型多模态模型(LMM)理解地图的零射击方法,用于地理参考复杂的生物样本记录,实验结果显示其优于单模态方法和现有工具。
Details
Motivation: 自然历史收藏中的数百万生物样本记录因缺乏地理参考而难以利用,现有自动化方法未充分利用地图这一关键工具。Contribution: 提出了一种创新的多模态方法,结合文本描述和地图视觉信息,实现了零射击地理参考任务的高精度表现(平均误差约1公里)。
Method: 采用网格化方法,使自回归多模态模型能够理解文本描述与地图空间关系,无需训练数据。
Result: 在小规模标注数据集上的实验表明,该方法显著优于单模态语言模型和现有工具。
Insight: 大型多模态模型能够精细理解地图空间关系,为复杂地理参考任务提供了新的解决方案。
Abstract: Millions of biological sample records collected in the last few centuries archived in natural history collections are un-georeferenced. Georeferencing complex locality descriptions associated with these collection samples is a highly labour-intensive task collection agencies struggle with. None of the existing automated methods exploit maps that are an essential tool for georeferencing complex relations. We present preliminary experiments and results of a novel method that exploits multi-modal capabilities of recent Large Multi-Modal Models (LMM). This method enables the model to visually contextualize spatial relations it reads in the locality description. We use a grid-based approach to adapt these auto-regressive models for this task in a zero-shot setting. Our experiments conducted on a small manually annotated dataset show impressive results for our approach ($\sim$1 km Average distance error) compared to uni-modal georeferencing with Large Language Models and existing georeferencing tools. The paper also discusses the findings of the experiments in light of an LMM’s ability to comprehend fine-grained maps. Motivated by these results, a practical framework is proposed to integrate this method into a georeferencing workflow.
cs.MM [Back]
[88] VideoConviction: A Multimodal Benchmark for Human Conviction and Stock Market Recommendations
Michael Galarnyk,Veer Kejriwal,Agam Shah,Yash Bhardwaj,Nicholas Meyer,Anand Krishnan,Sudheer Chava
Main category: cs.MM
TL;DR: VideoConviction是一个多模态基准数据集,用于评估人类信念与股票市场推荐,研究发现多模态信号虽有助于信息提取,但模型仍难以区分投资行为与信念强度。
Details
Motivation: 金融影响者(finfluencers)在社交媒体上传播股票推荐信息,其影响力不仅依赖于文本内容,还涉及语气、表达风格等多模态信号。Contribution: 提出了VideoConviction数据集,包含6000+专家标注,支持多模态大语言模型和文本大语言模型在金融领域的评估;揭示了高信念推荐的表现及其局限性。
Method: 结合视频的多模态信号(如语气、面部表情)与文本内容,通过专家标注数据评估模型性能,并设计逆向策略验证推荐的实际效果。
Result: 高信念推荐表现优于低信念推荐但仍不及S&P 500指数基金;逆向策略年均收益超出6.8%,但风险较高。
Insight: 多模态信号对金融分析具有补充价值,但模型仍需改进以准确识别信念强度;逆向策略的高收益可能揭示了finfluencer推荐的局限性。
Abstract: Social media has amplified the reach of financial influencers known as “finfluencers,” who share stock recommendations on platforms like YouTube. Understanding their influence requires analyzing multimodal signals like tone, delivery style, and facial expressions, which extend beyond text-based financial analysis. We introduce VideoConviction, a multimodal dataset with 6,000+ expert annotations, produced through 457 hours of human effort, to benchmark multimodal large language models (MLLMs) and text-based large language models (LLMs) in financial discourse. Our results show that while multimodal inputs improve stock ticker extraction (e.g., extracting Apple’s ticker AAPL), both MLLMs and LLMs struggle to distinguish investment actions and conviction–the strength of belief conveyed through confident delivery and detailed reasoning–often misclassifying general commentary as definitive recommendations. While high-conviction recommendations perform better than low-conviction ones, they still underperform the popular S&P 500 index fund. An inverse strategy–betting against finfluencer recommendations–outperforms the S&P 500 by 6.8% in annual returns but carries greater risk (Sharpe ratio of 0.41 vs. 0.65). Our benchmark enables a diverse evaluation of multimodal tasks, comparing model performance on both full video and segmented video inputs. This enables deeper advancements in multimodal financial research. Our code, dataset, and evaluation leaderboard are available under the CC BY-NC 4.0 license.
[89] PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning
Yibo Lyu,Rui Shao,Gongwei Chen,Yijie Zhu,Weili Guan,Liqiang Nie
Main category: cs.MM
TL;DR: PUMA提出了一种层剪枝的多模态语言模型,通过结构层面的层剪枝自蒸馏和学习层面的模态自适应对比学习损失,显著提升了统一多模态检索的效率和性能。
Details
Motivation: 随着多媒体内容的扩展,对高效统一多模态检索(UMR)的需求增加。现有的多模态大语言模型(MLLMs)参数规模大,导致训练成本高、推理效率低。Contribution: 1. 提出层剪枝自蒸馏(Layer-Pruned Self-Distillation),减少模型参数的同时保留表征能力。2. 设计模态自适应对比学习损失(MAC-Loss),提升学习效率。
Method: 1. 结构层面:剪枝MLLM的深层,仅保留浅层,并通过自蒸馏保留深层特征作为教师信号。2. 学习层面:将批量负样本分为模态内和模态间两类,采用不同温度策略优化学习。
Result: 实验表明,PUMA显著降低了资源消耗,同时保持了强性能。
Insight: 通过结合结构剪枝和模态自适应的学习策略,可以在资源受限的情况下高效完成多模态检索任务。
Abstract: As multimedia content expands, the demand for unified multimodal retrieval (UMR) in real-world applications increases. Recent work leverages multimodal large language models (MLLMs) to tackle this task. However, their large parameter size results in high training costs and low inference efficiency. To address this, we propose PUMA: a Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning. Our approach improves UMR from both structural and learning perspectives. (1) Structurally, we propose Layer-Pruned Self-Distillation, which prunes MLLMs by keeping only shallow layers while distilling features from dropped deep layers as teacher signals. This reduces parameters and preserves representation capability. (2) On the learning side, we introduce Modality-Adaptive Contrastive Learning Loss (MAC-Loss), which separates in-batch negatives into harder intra-modality and easier inter-modality groups based on the target modality, assigning different temperature strategies to enhance learning efficiency. Experiments show our method significantly reduces resource usage while maintaining strong performance.
[90] Visual Semantic Description Generation with MLLMs for Image-Text Matching
Junyu Chen,Yihua Gao,Mingyong Li
Main category: cs.MM
TL;DR: 该论文提出了一种利用多模态大语言模型(MLLMs)生成视觉语义描述(VSD)的图像文本匹配框架,通过实例级和原型级对齐提升跨模态匹配性能,并在多个基准测试中验证了其有效性。
Details
Motivation: 图像和文本在表示形式上存在固有差异(连续高维图像特征 vs. 离散结构化文本),现有方法难以有效对齐这两种模态。论文旨在通过MLLMs生成语义丰富的VSD,作为桥梁弥合这一差距。Contribution: 1) 提出了利用MLLMs生成VSD的新框架;2) 设计了实例级和原型级对齐模块增强跨模态匹配;3) 展示了在Flickr30K和MSCOCO上的性能提升,并验证了跨域任务的零样本泛化能力。
Method: 1) 通过MLLMs生成VSD,增强图像的语言表达能力;2) 融合视觉特征与VSD实现实例级对齐;3) 对VSD聚类实现原型级对齐。这些模块可无缝嵌入现有ITM模型。
Result: 在Flickr30K和MSCOCO上的实验显示显著性能提升,同时在新闻和遥感图像-文本匹配任务中表现出零样本泛化能力。
Insight: 利用MLLMs生成的高质量语义描述可有效弥合视觉与文本模态的差异,提升跨模态对齐能力。这一方法为图像文本匹配提供了新的思路。
Abstract: Image-text matching (ITM) aims to address the fundamental challenge of aligning visual and textual modalities, which inherently differ in their representations, continuous, high-dimensional image features vs. discrete, structured text. We propose a novel framework that bridges the modality gap by leveraging multimodal large language models (MLLMs) as visual semantic parsers. By generating rich Visual Semantic Descriptions (VSD), MLLMs provide semantic anchor that facilitate cross-modal alignment. Our approach combines: (1) Instance-level alignment by fusing visual features with VSD to enhance the linguistic expressiveness of image representations, and (2) Prototype-level alignment through VSD clustering to ensure category-level consistency. These modules can be seamlessly integrated into existing ITM models. Extensive experiments on Flickr30K and MSCOCO demonstrate substantial performance improvements. The approach also exhibits remarkable zero-shot generalization to cross-domain tasks, including news and remote sensing ITM. The code and model checkpoints are available at https://github.com/Image-Text-Matching/VSD.
cs.GR [Back]
[91] FlowDrag: 3D-aware Drag-based Image Editing with Mesh-guided Deformation Vector Flow Fields
Gwanhyeong Koo,Sunjae Yoon,Younghwan Lee,Ji Woo Hong,Chang D. Yoo
Main category: cs.GR
TL;DR: FlowDrag提出了一种基于3D网格的拖拽式图像编辑方法,通过整合几何信息解决现有方法因忽视全局几何导致的编辑不一致问题,并构建了带真值标注的评测数据集VFD。
Details
Motivation: 现有拖拽式编辑方法仅关注用户定义的点匹配,忽略了全局几何信息,导致编辑结果不一致或出现伪影。FlowDrag旨在通过引入3D几何信息提升编辑的准确性和一致性。Contribution: 1. 提出FlowDrag方法,通过3D网格引导变形向量流场实现更精确和一致的图像编辑;2. 构建了带真值的评测数据集VFD,填补了拖拽式编辑缺乏客观评估标准的空白。
Method: 基于3D网格构建能量函数,通过用户定义的拖拽点引导网格变形,再将变形映射到2D并融入UNet去噪过程,实现精确的点对齐和结构保持。
Result: FlowDrag在VFD Bench和DragBench上均优于现有方法,证明了其编辑精度和一致性的优势。
Insight: 引入几何信息(如3D网格)能有效提升图像编辑的稳定性,未来工作可探索更多几何表示与深度学习的结合。
Abstract: Drag-based editing allows precise object manipulation through point-based control, offering user convenience. However, current methods often suffer from a geometric inconsistency problem by focusing exclusively on matching user-defined points, neglecting the broader geometry and leading to artifacts or unstable edits. We propose FlowDrag, which leverages geometric information for more accurate and coherent transformations. Our approach constructs a 3D mesh from the image, using an energy function to guide mesh deformation based on user-defined drag points. The resulting mesh displacements are projected into 2D and incorporated into a UNet denoising process, enabling precise handle-to-target point alignment while preserving structural integrity. Additionally, existing drag-editing benchmarks provide no ground truth, making it difficult to assess how accurately the edits match the intended transformations. To address this, we present VFD (VidFrameDrag) benchmark dataset, which provides ground-truth frames using consecutive shots in a video dataset. FlowDrag outperforms existing drag-based editing methods on both VFD Bench and DragBench.
[92] Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation
Liu He,Xiao Zeng,Yizhi Song,Albert Y. C. Chen,Lu Xia,Shashwat Verma,Sankalp Dayal,Min Sun,Cheng-Hao Kuo,Daniel Aliaga
Main category: cs.GR
TL;DR: 该论文提出了一种生成大规模3D视觉指令数据集的方法,以解决多模态大语言模型(MLLM)在相机-物体关系识别中的不足,并通过新数据集Ultimate3D显著提升了模型性能。
Details
Motivation: 现有MLLM在图像-文本对齐任务中难以准确捕捉相机-物体关系(如物体朝向、相机视角和镜头类型),原因是训练数据中缺乏多样化的相机-物体关系标注和文本描述。Contribution: 1)提出了一种合成生成大规模3D视觉指令数据集的流程;2)利用3D资产、渲染和扩散模型生成逼真图像;3)引入Ultimate3D数据集和基准测试,显著提升MLLM性能。
Method: 1)结合3D资产和渲染技术生成图像;2)使用扩散模型增强图像真实性;3)利用LLM生成文本提示指导视觉指令调优;4)构建包含24万VQA对的Ultimate3D数据集。
Result: 在Ultimate3D上微调的MLLM表现优于商业模型,相机-物体关系识别任务准确率平均提升33.4%。
Insight: 通过精确控制相机-物体关系的合成数据生成,可以显著提升MLLM在视觉-语言任务中的性能,为未来多模态研究提供了新方向。
Abstract: Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. Our framework takes 3D assets as input and uses rendering and diffusion-based image generation models to create photorealistic images preserving precise camera-object relations. Additionally, large language models (LLMs) are used to generate text prompts for guiding visual instruction tuning and controlling image generation. We create Ultimate3D, a dataset of 240K VQAs with precise camera-object annotations, and corresponding benchmark. MLLMs fine-tuned on our proposed dataset outperform commercial models by a large margin, achieving an average accuracy improvement of 33.4% on camera-object relation recognition tasks. Our code, dataset, and benchmark will contribute to broad MLLM applications.
cs.RO [Back]
[93] CL3R: 3D Reconstruction and Contrastive Learning for Enhanced Robotic Manipulation Representations
Wenbo Cui,Chengyang Zhao,Yuhui Chen,Haoran Li,Zhizheng Zhang,Dongbin Zhao,He Wang
Main category: cs.RO
TL;DR: CL3R是一种新型3D预训练框架,结合空间感知与语义理解,通过点云掩码自编码和对比学习提升机器人操作策略的感知能力,并解决多视角泛化问题。
Details
Motivation: 现有机器人感知模块依赖预训练的2D基础模型,但缺乏3D空间信息和对多视角的泛化能力,影响了精细操作任务的策略效果。Contribution: 提出CL3R框架,融合3D重建与对比学习,统一坐标系并引入多视角点云随机融合,提升了机器人操作的感知鲁棒性。
Method: 使用点云掩码自编码器学习3D表征,结合2D预训练模型的对比学习实现语义知识迁移;通过统一坐标系和多视角融合解决视角模糊问题。
Result: 在模拟和真实世界的实验中,CL3R展示了在机器人操作任务中出色的感知性能和策略学习效果。
Insight: 3D空间信息与2D语义的协同学习是关键,多视角数据融合能显著提升模型对新视角的适应能力。
Abstract: Building a robust perception module is crucial for visuomotor policy learning. While recent methods incorporate pre-trained 2D foundation models into robotic perception modules to leverage their strong semantic understanding, they struggle to capture 3D spatial information and generalize across diverse camera viewpoints. These limitations hinder the policy’s effectiveness, especially in fine-grained robotic manipulation scenarios. To address these challenges, we propose CL3R, a novel 3D pre-training framework designed to enhance robotic manipulation policies. Our method integrates both spatial awareness and semantic understanding by employing a point cloud Masked Autoencoder to learn rich 3D representations while leveraging pre-trained 2D foundation models through contrastive learning for efficient semantic knowledge transfer. Additionally, we propose a 3D visual representation pre-training framework for robotic tasks. By unifying coordinate systems across datasets and introducing random fusion of multi-view point clouds, we mitigate camera view ambiguity and improve generalization, enabling robust perception from novel viewpoints at test time. Extensive experiments in both simulation and the real world demonstrate the superiority of our method, highlighting its effectiveness in visuomotor policy learning for robotic manipulation.
cs.LG [Back]
[94] Lightweight Safety Guardrails via Synthetic Data and RL-guided Adversarial Training
Aleksei Ilin,Gor Matevosyan,Xueying Ma,Vladimir Eremin,Suhaa Dada,Muqun Li,Riyaaz Shaik,Haluk Noyan Tokgozoglu
Main category: cs.LG
TL;DR: 该论文提出了一种轻量级的安全防护框架,通过合成数据生成和基于强化学习的对抗训练,使小规模语言模型在内容审核任务中表现优异,甚至超越大规模模型。
Details
Motivation: 当前的AI内容审核通常依赖大规模语言模型,计算开销高且对抗攻击的防御能力有限。作者希望通过轻量级方法提升小模型的性能,降低计算成本并增强鲁棒性。Contribution: 主要贡献包括:(1)高保真合成数据生成方法;(2)基于强化学习的对抗训练策略;(3)验证小模型可以作为高效安全防护框架,超越大模型性能。
Method: 方法分为两部分:(1)通过人类种子数据生成多样化的合成数据;(2)利用强化学习引导生成对抗性样本,用于优化安全分类器。
Result: 实验表明,该框架显著提升了小模型的内容审核能力,同时计算效率优于大模型,且对对抗攻击更具鲁棒性。
Insight: 合成数据的质量控制和对多样性是关键;强化学习与对抗训练的结合有效提升了模型性能;小模型可以通过优化方法在特定任务中超越大模型。
Abstract: We introduce a lightweight yet highly effective safety guardrail framework for language models, demonstrating that small-scale language models can achieve, and even surpass, the performance of larger counterparts in content moderation tasks. This is accomplished through high-fidelity synthetic data generation and adversarial training. The synthetic data generation process begins with human-curated seed data, which undergoes query augmentation and paraphrasing to create diverse and contextually rich examples. This augmented data is then subjected to multiple rounds of curation, ensuring high fidelity and relevance. Inspired by recent advances in the Generative Adversarial Network (GAN) architecture, our adversarial training employs reinforcement learning to guide a generator that produces challenging synthetic examples. These examples are used to fine-tune the safety classifier, enhancing its ability to detect and mitigate harmful content. Additionally, we incorporate strategies from recent research on efficient LLM training, leveraging the capabilities of smaller models to improve the performance of larger generative models. With iterative adversarial training and the generation of diverse, high-quality synthetic data, our framework enables small language models (SLMs) to serve as robust safety guardrails. This approach not only reduces computational overhead but also enhances resilience against adversarial attacks, offering a scalable and efficient solution for content moderation in AI systems.
[95] Scaling Attention to Very Long Sequences in Linear Time with Wavelet-Enhanced Random Spectral Attention (WERSA)
Vincenzo Dentamaro
Main category: cs.LG
TL;DR: WERSA是一种线性时间复杂度的注意力机制,通过结合小波变换和随机谱特征,显著降低了处理长序列的计算成本,同时保持了高准确性。
Details
Motivation: 传统Transformer在处理长序列时因二次方时间复杂度的注意力机制导致计算成本高,限制了其扩展性。WERSA旨在解决这一问题,实现高效的长序列处理。Contribution: 提出了WERSA,一种线性时间复杂度的注意力机制,结合小波变换和随机谱特征,显著降低了计算成本,同时在多个基准测试中表现优于其他方法。
Method: WERSA将内容自适应的随机谱特征与多分辨率Haar小波相结合,通过可学习参数选择性地关注数据的多尺度信息,保持了线性效率。
Result: 在单GPU上,WERSA在多个任务中表现最佳,例如在ArXiv分类任务上提升了1.2%的准确率,同时减少了81%的训练时间和73.4%的FLOPS。
Insight: WERSA通过高效的线性复杂度机制,为低资源设备上的长序列处理提供了可行方案,推动了可持续AI的发展。
Abstract: Transformer models are computationally costly on long sequences since regular attention has quadratic $O(n^2)$ time complexity. We introduce Wavelet-Enhanced Random Spectral Attention (WERSA), a novel mechanism of linear $O(n)$ time complexity that is pivotal to enable successful long-sequence processing without the performance trade-off. WERSA merges content-adaptive random spectral features together with multi-resolution Haar wavelets and learnable parameters to selectively attend to informative scales of data while preserving linear efficiency. Large-scale comparisons \textbf{on single GPU} and across various benchmarks (vision, NLP, hierarchical reasoning) and various attention mechanisms (like Multiheaded Attention, Flash-Attention-2, FNet, Linformer, Performer, Waveformer), reveal uniform advantages of WERSA. It achieves best accuracy in all tests. On ArXiv classification, WERSA improves accuracy over vanilla attention by 1.2% (86.2% vs 85.0%) while cutting training time by 81% (296s vs 1554s) and FLOPS by 73.4% (26.2G vs 98.4G). Significantly, WERSA excels where vanilla and FlashAttention-2 fail: on ArXiv-128k’s extremely lengthy sequences, it achieves best accuracy (79.1%) and AUC (0.979) among viable methods, operating on data that gives Out-Of-Memory errors to quadratic methods while being \textbf{twice as fast} as Waveformer, its next-best competitor. By significantly reducing computational loads without compromising accuracy, WERSA makes possible more practical, more affordable, long-context models, in particular on low-resource hardware, for more sustainable and more scalable AI development.
[96] One Token to Fool LLM-as-a-Judge
Yulai Zhao,Haolin Liu,Dian Yu,S. Y. Kung,Haitao Mi,Dong Yu
Main category: cs.LG
TL;DR: 论文揭示了生成式奖励模型(LLM-as-a-judge)对表面操控的脆弱性,并提出了一种数据增强策略以提升其鲁棒性。
Details
Motivation: 生成式奖励模型在评估复杂推理任务时优于基于规则的指标,但研究发现它们容易被简单的符号或开场白误导,导致错误的奖励信号。这威胁到依赖这些模型的算法范式。Contribution: 1. 揭示了生成式奖励模型的脆弱性;2. 提出了数据增强策略并训练了一个更鲁棒的奖励模型;3. 公开了鲁棒的模型和训练数据。
Method: 通过实验展示了模型对非单词符号和推理开场白的敏感性,并设计了一种数据增强方法,训练了一个更鲁棒的模型。
Result: 新模型显著提升了对抗表面操控的能力。
Insight: LLM-based评估方法存在潜在不可靠性,亟需更健壮的设计和数据增强策略。
Abstract: Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., “:” or “.”) or reasoning openers like “Thought process:” and “Let’s solve this problem step by step.” can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.
[97] Emergent Natural Language with Communication Games for Improving Image Captioning Capabilities without Additional Data
Parag Dutta,Ambedkar Dukkipati
Main category: cs.LG
TL;DR: 论文提出了一种名为LoGIC的多智能体强化学习方法,通过通信游戏提升图像描述能力,无需额外标注数据。实验表明,使用预训练模型或轻量级组件的LoGIC在无监督环境下显著优于现有方法。
Details
Motivation: 图像描述任务需要大量标注数据,但现有标注数据已被充分利用。如何在无监督条件下提升性能成为一个关键问题。Contribution: 提出了LoGIC方法,通过多智能体强化学习游戏提升图像描述能力,无需额外数据。
Method: 采用双智能体(‘说话者’和‘倾听者’)合作学习自然语言策略,使用GRPO算法训练。
Result: 使用预训练模型时,BLEU得分提升2分;轻量级组件时,BLEU得分提升10分,显著优于现有无监督方法。
Insight: 通过智能体间的通信游戏,可以无监督地提升图像描述性能,为无监督学习提供了新思路。
Abstract: Image captioning is an important problem in developing various AI systems, and these tasks require large volumes of annotated images to train the models. Since all existing labelled datasets are already used for training the large Vision Language Models (VLMs), it becomes challenging to improve the performance of the same. Considering this, it is essential to consider the unsupervised image captioning performance, which remains relatively under-explored. To that end, we propose LoGIC (Lewis Communication Game for Image Captioning), a Multi-agent Reinforcement Learning game. The proposed method consists of two agents, a ‘speaker’ and a ‘listener’, with the objective of learning a strategy for communicating in natural language. We train agents in the cooperative common-reward setting using the GRPO algorithm and show that improvement in image captioning performance emerges as a consequence of the agents learning to play the game. We show that using pre-trained VLMs as the ‘speaker’ and Large Language Model (LLM) for language understanding in the ‘listener’, we achieved a $46$ BLEU score after fine-tuning using LoGIC without additional labels, a $2$ units advantage in absolute metrics compared to the $44$ BLEU score of the vanilla VLM. Additionally, we replace the VLM from the ‘speaker’ with lightweight components: (i) a ViT for image perception and (ii) a GPT2 language generation, and train them from scratch using LoGIC, obtaining a $31$ BLEU score in the unsupervised setting, a $10$ points advantage over existing unsupervised image-captioning methods.
eess.IV [Back]
[98] 3D forest semantic segmentation using multispectral LiDAR and 3D deep learning
Narges Takhtkeshha,Lauris Bocaux,Lassi Ruoppa,Fabio Remondino,Gottfried Mandlburger,Antero Kukko,Juha Hyyppä
Main category: eess.IV
TL;DR: 该论文探讨了利用多光谱LiDAR数据和3D深度学习模型实现森林语义分割的潜力,实验表明KPConv模型效果最佳,显著提升了森林组分的分割精度。
Details
Motivation: 传统的森林资源调查方法劳动密集且耗时,多光谱LiDAR技术提供了同时获取空间和光谱信息的解决方案,为精确森林分割提供了新途径。Contribution: 论文的主要贡献在于验证了多光谱LiDAR数据在森林语义分割中的有效性,并通过实验比较了多种3D深度学习和机器学习的表现,结果表明KPConv模型表现最优。
Method: 使用HeliALS系统采集的高密度多光谱点云数据,通过三种点云3D深度学习模型(KPConv、Superpoint Transformer、Point Transformer V3)和一种机器学习模型(随机森林)进行森林组分分割,并探讨了不同几何和光谱特征向量的影响。
Result: 实验表明,KPConv模型在多光谱LiDAR数据上表现最佳,结合三个波长的光谱特征(1550 nm、905 nm、532 nm)显著提升了分割性能,mIoU和mAcc分别提高了33.73%和32.35%。
Insight: 多光谱LiDAR数据在森林语义分割中具有巨大潜力,深度学习方法能够有效利用空间和光谱信息实现高精度分割,为自动化森林资源调查提供了新思路。
Abstract: Conservation and decision-making regarding forest resources necessitate regular forest inventory. Light detection and ranging (LiDAR) in laser scanning systems has gained significant attention over the past two decades as a remote and non-destructive solution to streamline the labor-intensive and time-consuming procedure of forest inventory. Advanced multispectral (MS) LiDAR systems simultaneously acquire three-dimensional (3D) spatial and spectral information across multiple wavelengths of the electromagnetic spectrum. Consequently, MS-LiDAR technology enables the estimation of both the biochemical and biophysical characteristics of forests. Forest component segmentation is crucial for forest inventory. The synergistic use of spatial and spectral laser information has proven to be beneficial for achieving precise forest semantic segmentation. Thus, this study aims to investigate the potential of MS-LiDAR data, captured by the HeliALS system, providing high-density multispectral point clouds to segment forests into six components: ground, low vegetation, trunks, branches, foliage, and woody debris. Three point-wise 3D deep learning models and one machine learning model, including kernel point convolution, superpoint transformer, point transformer V3, and random forest, are implemented. Our experiments confirm the superior accuracy of the KPConv model. Additionally, various geometric and spectral feature vector scenarios are examined. The highest accuracy is achieved by feeding all three wavelengths (1550 nm, 905 nm, and 532 nm) as the initial features into the deep learning model, resulting in improvements of 33.73% and 32.35% in mean intersection over union (mIoU) and in mean accuracy (mAcc), respectively. This study highlights the excellent potential of multispectral LiDAR for improving the accuracy in fully automated forest component segmentation.
[99] Depth-Sequence Transformer (DST) for Segment-Specific ICA Calcification Mapping on Non-Contrast CT
Xiangjian Hou,Ebru Yaman Akcicek,Xin Wang,Kazem Hashemizadeh,Scott Mcnally,Chun Yuan,Xiaodong Ma
Main category: eess.IV
TL;DR: 论文提出了一种称为Depth-Sequence Transformer (DST)的方法,用于在非对比CT中实现颅内颈动脉钙化(ICAC)的段特异性定位。通过将3D问题转化为1D轴向维度的并行概率性标志定位任务,DST在保持全局上下文的同时实现了高精度和鲁棒性。
Details
Motivation: 现有的颅内颈动脉钙化(ICAC)分析仅关注总体积作为中风生物标志物,忽略了钙化位置对预后和治疗的关键影响。传统3D方法因计算限制无法实现高分辨率全局分析,导致解剖模糊和标志定位不可靠。Contribution: 1. 提出了DST框架,将3D问题转化为1D序列任务,实现段特异性ICAC定位;2. 在临床数据上实现了0.1切片的平均绝对误差,96%的预测在±1切片范围内;3. 在Clean-CC-CCII基准测试中取得最优性能。
Method: 1. 将全分辨率CT体积作为2D切片序列处理;2. 通过并行概率性标志定位任务预测6个独立的关键解剖标志概率分布;3. 采用Transformer结构捕获切片间的依赖关系。
Result: 在100例患者的临床队列中,5折交叉验证下MAE为0.1切片,96%预测在±1切片范围内;在Clean-CC-CCII分类基准中表现最佳。
Insight: 1. 将3D问题转化为1D序列任务可有效降低计算复杂度并保持全局上下文;2. 段特异性分析为诊断和预后提供了更精细的生物标志物;3. Transformer结构在医学图像分析任务中展现强大潜力。
Abstract: While total intracranial carotid artery calcification (ICAC) volume is an established stroke biomarker, growing evidence shows this aggregate metric ignores the critical influence of plaque location, since calcification in different segments carries distinct prognostic and procedural risks. However, a finer-grained, segment-specific quantification has remained technically infeasible. Conventional 3D models are forced to process downsampled volumes or isolated patches, sacrificing the global context required to resolve anatomical ambiguity and render reliable landmark localization. To overcome this, we reformulate the 3D challenge as a \textbf{Parallel Probabilistic Landmark Localization} task along the 1D axial dimension. We propose the \textbf{Depth-Sequence Transformer (DST)}, a framework that processes full-resolution CT volumes as sequences of 2D slices, learning to predict $N=6$ independent probability distributions that pinpoint key anatomical landmarks. Our DST framework demonstrates exceptional accuracy and robustness. Evaluated on a 100-patient clinical cohort with rigorous 5-fold cross-validation, it achieves a Mean Absolute Error (MAE) of \textbf{0.1 slices}, with \textbf{96%} of predictions falling within a $\pm1$ slice tolerance. Furthermore, to validate its architectural power, the DST backbone establishes the best result on the public Clean-CC-CCII classification benchmark under an end-to-end evaluation protocol. Our work delivers the first practical tool for automated segment-specific ICAC analysis. The proposed framework provides a foundation for further studies on the role of location-specific biomarkers in diagnosis, prognosis, and procedural planning. Our code will be made publicly available.
cs.SD [Back]
[100] Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Arushi Goel,Sreyan Ghosh,Jaehyeon Kim,Sonal Kumar,Zhifeng Kong,Sang-gil Lee,Chao-Han Huck Yang,Ramani Duraiswami,Dinesh Manocha,Rafael Valle,Bryan Catanzaro
Main category: cs.SD
TL;DR: 论文介绍了Audio Flamingo 3(AF3),一个完全开源的大型音频语言模型,通过联合学习语音、声音和音乐三种模态,实现了先进的音频理解和推理能力。
Details
Motivation: 当前音频智能模型在跨模态联合学习和长时间音频理解方面存在局限性,AF3旨在通过创新方法解决这些问题。Contribution: 1. 提出AF-Whisper统一音频编码器;2. 引入链式思维推理;3. 支持多轮多音频对话;4. 实现长达10分钟的音频理解;5. 开发大规模训练数据集和五阶段训练策略。
Method: 采用联合表示学习训练AF-Whisper,提出五阶段课程学习训练策略,利用新数据集AudioSkills-XL等优化模型。
Result: AF3在20+长音频理解和推理任务中达到SOTA,超越依赖更大数据集的闭源和开源模型。
Insight: 多模态联合学习和课程训练策略对提升音频模型的通用性和性能至关重要。
Abstract: We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.