Table of Contents

cs.CL [Back]

[1] AI-based Clinical Decision Support for Primary Care: A Real-World Study

Robert Korom,Sarah Kiptinness,Najib Adan,Kassim Said,Catherine Ithuli,Oliver Rotich,Boniface Kimani,Irene King’ori,Stellah Kamau,Elizabeth Atemba,Muna Aden,Preston Bowman,Michael Sharman,Rebecca Soskin Hicks,Rebecca Distler,Johannes Heidecke,Rahul K. Arora,Karan Singhal

Main category: cs.CL

TL;DR: 该研究评估了基于大型语言模型的临床决策支持工具AI Consult在真实医疗环境中的效果。工具集成到临床工作流中,减少了诊断和治疗错误,医生反馈正面,展示了AI在减少医疗错误方面的潜力。

Details Motivation: 医疗错误在初级保健中是一个重要问题。研究旨在探讨AI工具是否能减少临床决策中的错误,并评估其在真实环境中的可行性和效果。

Contribution: 首次在真实医疗环境中评估基于大型语言模型的临床决策支持工具,展示其在减少诊断和治疗错误方面的有效性,并提供了促进负责任AI采用的实用框架。

Method: 通过与肯尼亚Penda Health合作,使用AI Consult工具对39,849次患者就诊进行研究,比较有和没有AI支持的临床错误率,并通过独立医生评分和医生调查评估效果。

Result: AI Consult减少了16%的诊断错误和13%的治疗错误,每年可避免大量错误。75%的医生认为其对医疗质量有实质性提升。

Insight: 研究强调了AI工具与临床工作流整合及主动推广的重要性,展示了AI在提升初级保健质量和安全性方面的潜力。

Abstract: We evaluate the impact of large language model-based clinical decision support in live care. In partnership with Penda Health, a network of primary care clinics in Nairobi, Kenya, we studied AI Consult, a tool that serves as a safety net for clinicians by identifying potential documentation and clinical decision-making errors. AI Consult integrates into clinician workflows, activating only when needed and preserving clinician autonomy. We conducted a quality improvement study, comparing outcomes for 39,849 patient visits performed by clinicians with or without access to AI Consult across 15 clinics. Visits were rated by independent physicians to identify clinical errors. Clinicians with access to AI Consult made relatively fewer errors: 16% fewer diagnostic errors and 13% fewer treatment errors. In absolute terms, the introduction of AI Consult would avert diagnostic errors in 22,000 visits and treatment errors in 29,000 visits annually at Penda alone. In a survey of clinicians with AI Consult, all clinicians said that AI Consult improved the quality of care they delivered, with 75% saying the effect was “substantial”. These results required a clinical workflow-aligned AI Consult implementation and active deployment to encourage clinician uptake. We hope this study demonstrates the potential for LLM-based clinical decision support tools to reduce errors in real-world settings and provides a practical framework for advancing responsible adoption.

[2] Harnessing RLHF for Robust Unanswerability Recognition and Trustworthy Response Generation in LLMs

Shuyuan Lin,Lei Duan,Philip Hughes,Yuxuan Sheng

Main category: cs.CL

TL;DR: 该论文提出了一种名为SALU的新方法,通过多任务学习和RLHF技术,将不可回答性问题检测直接集成到LLM的生成过程中,显著减少了幻觉内容并提高了可靠性。

Details Motivation: 解决传统CIR系统在处理不可回答性问题时的局限性,避免生成误导性或幻觉内容。

Contribution: 提出了SALU方法,通过多任务学习和RLHF技术,在LLM中直接集成不可回答性问题检测,提高了系统的可靠性和准确性。

Method: 1. 多任务学习框架:同时训练模型进行QA任务和明确拒绝生成不可回答的查询;2. RLHF阶段:采用置信度分数引导的强化学习,惩罚幻觉响应并奖励正确拒绝。

Result: 在自定义C-IR_Answerability数据集上,SALU表现优于基线模型,人类评估也证实其高可靠性和低幻觉率。

Insight: 直接集成不可回答性检测到LLM的生成过程中,结合RLHF技术,可以有效提升模型的自我知识边界意识。

Abstract: Conversational Information Retrieval (CIR) systems, while offering intuitive access to information, face a significant challenge: reliably handling unanswerable questions to prevent the generation of misleading or hallucinated content. Traditional approaches often rely on external classifiers, which can introduce inconsistencies with the core generative Large Language Models (LLMs). This paper introduces Self-Aware LLM for Unanswerability (SALU), a novel approach that deeply integrates unanswerability detection directly within the LLM’s generative process. SALU is trained using a multi-task learning framework for both standard Question Answering (QA) and explicit abstention generation for unanswerable queries. Crucially, it incorporates a confidence-score-guided reinforcement learning with human feedback (RLHF) phase, which explicitly penalizes hallucinated responses and rewards appropriate abstentions, fostering intrinsic self-awareness of knowledge boundaries. Through extensive experiments on our custom-built C-IR_Answerability dataset, SALU consistently outperforms strong baselines, including hybrid LLM-classifier systems, in overall accuracy for correctly answering or abstaining from questions. Human evaluation further confirms SALU’s superior reliability, achieving high scores in factuality, appropriate abstention, and, most importantly, a dramatic reduction in hallucination, demonstrating its ability to robustly “know when to say ‘I don’t know’.”

[3] Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning

Aleksandr Perevalov,Andreas Both

Main category: cs.CL

TL;DR: 这篇论文提出了一个名为mKGQAgent的框架,通过模块化和可解释的子任务将多语言自然语言问题转换为SPARQL查询,并在Text2SPARQL挑战赛中取得第一名。

Details Motivation: 多语言自然语言接口访问知识是信息检索领域的一个新兴挑战,而现有的方法多依赖于组合式组件,缺乏模块化和可解释性。

Contribution: 提出了一种人机协作框架mKGQAgent,通过LLM代理的协调工作流程(规划、实体链接、查询优化)和多语言上下文学习,实现了高效的多语言知识图谱问答。

Method: 采用模块化任务分解(如规划、实体链接、查询优化),并结合经验池驱动的上下文学习,实现了多语言KGQA任务的逐步解决。

Result: 在DBpedia和企业知识图谱的Text2SPARQL 2025挑战赛中,mKGQAgent取得了第一名。

Insight: 通过模仿人类的模块化推理过程,并结合上下文学习,可以有效提升多语言语义解析的能力和可解释性。

Abstract: Accessing knowledge via multilingual natural-language interfaces is one of the emerging challenges in the field of information retrieval and related ones. Structured knowledge stored in knowledge graphs can be queried via a specific query language (e.g., SPARQL). Therefore, one needs to transform natural-language input into a query to fulfill an information need. Prior approaches mostly focused on combining components (e.g., rule-based or neural-based) that solve downstream tasks and come up with an answer at the end. We introduce mKGQAgent, a human-inspired framework that breaks down the task of converting natural language questions into SPARQL queries into modular, interpretable subtasks. By leveraging a coordinated LLM agent workflow for planning, entity linking, and query refinement - guided by an experience pool for in-context learning - mKGQAgent efficiently handles multilingual KGQA. Evaluated on the DBpedia- and Corporate-based KGQA benchmarks within the Text2SPARQL challenge 2025, our approach took first place among the other participants. This work opens new avenues for developing human-like reasoning systems in multilingual semantic parsing.

[4] CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards

Cheng Liu,Yifei Lu,Fanghua Ye,Jian Li,Xingyu Chen,Feiliang Ren,Zhaopeng Tu,Xiaolong Li

Main category: cs.CL

TL;DR: 论文提出了CogDual,一种通过强化学习增强大型语言模型(LLM)认知能力的角色扮演语言代理(RPLA)。其创新在于联合建模外部情境意识和内部自我意识,并通过强化学习和隐式规则奖励优化性能。实验结果表明,CogDual在多任务中表现优异。

Details Motivation: 现有角色扮演语言代理(RPLA)主要依赖提示工程或监督微调,忽略了行为背后的认知机制。作者从认知心理学获得灵感,提出模仿人类认知的方式来改善角色扮演的一致性。

Contribution: 1)提出CogDual,采用“认知-响应”推理范式;2)联合建模外部情境和内部意识;3)设计两种通用奖励机制,通过强化学习优化性能。

Method: 首先联合建模外部情境意识和内部自我意识,再通过强化学习优化模型。奖励机制设计为两种通用方案,适用于开放域文本生成任务。

Result: 在CoSER、Cross-MR和LifeChoice基准测试中,CogDual显著优于现有基线,并在多个任务中展示了良好的泛化能力。

Insight: 角色扮演语言代理的关键在于模拟人类认知机制,而不仅仅是行为模仿。强化学习与隐式规则奖励的结合是提升开放域任务表现的有效途径。

Abstract: Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs). Existing approaches typically rely on prompt engineering or supervised fine-tuning to enable models to imitate character behaviors in specific scenarios, but often neglect the underlying \emph{cognitive} mechanisms driving these behaviors. Inspired by cognitive psychology, we introduce \textbf{CogDual}, a novel RPLA adopting a \textit{cognize-then-respond } reasoning paradigm. By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment. To further optimize the performance, we employ reinforcement learning with two general-purpose reward schemes designed for open-domain text generation. Extensive experiments on the CoSER benchmark, as well as Cross-MR and LifeChoice, demonstrate that CogDual consistently outperforms existing baselines and generalizes effectively across diverse role-playing tasks.

[5] CLARIFID: Improving Radiology Report Generation by Reinforcing Clinically Accurate Impressions and Enforcing Detailed Findings

Kyeongkyu Lee,Seonghwan Yoon,Hongki Lim

Main category: cs.CL

TL;DR: CLARIFID提出了一种新颖的框架,通过模仿专家的工作流程优化放射学报告的诊断准确性,结合多视图X光片和强化学习,显著提升了报告的临床有效性。

Details Motivation: 当前放射学报告生成方法注重文本流畅性而忽视诊断事实的正确性,且多依赖单视图图像,限制了诊断的全面性。

Contribution: 1. 引入专家工作流程的两步学习(从发现到印象);2. 使用PPO强化学习优化印象部分的准确性;3. 提出推理感知的解码策略;4. 基于Vision Transformer的多视图融合。

Method: 结合多视图编码器、PPO强化学习、推理感知解码策略,以及报告级重排序,确保模型先生成全面的发现部分再合成印象部分。

Result: 在MIMIC-CXR数据集上,CLARIFID在自然语言生成指标和临床评分上均优于现有基线。

Insight: 专家工作流程的模拟和多视图融合显著提升了放射报告生成的临床可靠性,推理感知解码策略确保了逻辑一致性。

Abstract: Automatic generation of radiology reports has the potential to alleviate radiologists’ significant workload, yet current methods struggle to deliver clinically reliable conclusions. In particular, most prior approaches focus on producing fluent text without effectively ensuring the factual correctness of the reports and often rely on single-view images, limiting diagnostic comprehensiveness. We propose CLARIFID, a novel framework that directly optimizes diagnostic correctness by mirroring the two-step workflow of experts. Specifically, CLARIFID (1) learns the logical flow from Findings to Impression through section-aware pretraining, (2) is fine-tuned with Proximal Policy Optimization in which the CheXbert F1 score of the Impression section serves as the reward, (3) enforces reasoning-aware decoding that completes “Findings” before synthesizing the “Impression”, and (4) fuses multiple chest X-ray views via a vision-transformer-based multi-view encoder. During inference, we apply a reasoning-aware next-token forcing strategy followed by report-level re-ranking, ensuring that the model first produces a comprehensive Findings section before synthesizing the Impression and thereby preserving coherent clinical reasoning. Experimental results on the MIMIC-CXR dataset demonstrate that our method achieves superior clinical efficacy and outperforms existing baselines on both standard NLG metrics and clinically aware scores.

[6] Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge

Miaomiao Gao,Xiaoxiao Xiang,Yiwen Guo

Main category: cs.CL

TL;DR: 论文提出了Triple X多语言语音识别系统,采用创新的编码器-适配器-LLM架构,结合多阶段训练策略,在INTERSPEECH2025 MLC-SLM挑战赛中取得了第二名的成绩。

Details Motivation: 解决多语言对话场景下的语音识别问题,提升识别准确率。

Contribution: 1. 创新的编码器-适配器-LLM架构;2. 详细设计的多阶段训练策略;3. 在多语言数据集上验证了系统的有效性。

Method: 采用编码器-适配器-LLM架构,结合文本大语言模型的推理能力和领域适配技术,并通过多阶段训练优化性能。

Result: 在挑战赛的开发集和测试集上均取得了有竞争力的词错误率(WER),获得第二名。

Insight: 结合大语言模型和多语言数据集的适应性训练,可以显著提升多语言语音识别的性能。

Abstract: This paper describes our Triple X speech recognition system submitted to Task 1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM) Challenge. Our work focuses on optimizing speech recognition accuracy in multilingual conversational scenarios through an innovative encoder-adapter-LLM architecture. This framework harnesses the powerful reasoning capabilities of text-based large language models while incorporating domain-specific adaptations. To further enhance multilingual recognition performance, we adopted a meticulously designed multi-stage training strategy leveraging extensive multilingual audio datasets. Experimental results demonstrate that our approach achieves competitive Word Error Rate (WER) performance on both dev and test sets, obtaining second place in the challenge ranking.

[7] Millions of $\text{GeAR}$-s: Extending GraphRAG to Millions of Documents

Zhili Shen,Chenxin Diao,Pascual Merita,Pavlos Vougiouklis,Jeff Z. Pan

Main category: cs.CL

TL;DR: 论文探讨了将基于图的检索增强生成(RAG)方法扩展到大规模文档集的可行性,研究了现有方法在SIGIR 2025 LiveRAG挑战中的表现与局限性。

Details Motivation: 当前基于图的RAG方法多针对特定任务设计(如多跳问答),缺乏在大规模通用数据集上的验证,亟需研究其扩展性和普适性。

Contribution: 提出了将$ ext{GeAR}$扩展到百万级文档的方法,并验证了其在大规模数据集(如SIGIR 2025 LiveRAG挑战)中的表现。

Method: 基于$ ext{GeAR}$框架,利用文档中的实体及其关系构建图结构,优化检索过程以支持大规模文档集。

Result: 实验表明$ ext{GeAR}$在大规模文档任务中具有一定的扩展性和性能,但也揭示了其局限性。

Insight: 图结构的引入可以提升检索效率,但大规模文档的复杂性和多样性对方法的设计提出了更高要求。

Abstract: Recent studies have explored graph-based approaches to retrieval-augmented generation, leveraging structured or semi-structured information – such as entities and their relations extracted from documents – to enhance retrieval. However, these methods are typically designed to address specific tasks, such as multi-hop question answering and query-focused summarisation, and therefore, there is limited evidence of their general applicability across broader datasets. In this paper, we aim to adapt a state-of-the-art graph-based RAG solution: $\text{GeAR}$ and explore its performance and limitations on the SIGIR 2025 LiveRAG Challenge.

[8] MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs

Alexander R. Fabbri,Diego Mares,Jorge Flores,Meher Mankikar,Ernesto Hernandez,Dean Lee,Bing Liu,Chen Xing

Main category: cs.CL

TL;DR: 论文提出了MultiNRC,一个评估大型语言模型(LLMs)在多语言和文化背景下推理能力的基准测试,结果显示当前LLMs在原生多语言推理任务中表现不足。

Details Motivation: 现有评估主要基于英语基准的翻译,缺乏针对原生语言和文化背景的推理能力评估,因此需要更全面的多语言推理基准。

Contribution: 提出了MultiNRC基准,包含1000多道由母语者编写的原生问题,涵盖语言、文化、数学推理等类别,并提供了英语等效翻译。

Method: 通过母语者编写问题并手动翻译成英语,评估了14种主流LLMs的推理能力。

Result: LLMs在原生多语言推理任务中表现不佳(准确率<50%),数学推理中英语表现显著优于原生语言(+10%)。

Insight: LLMs在语言、文化和逻辑推理任务中存在显著差异,文化相关知识仍是其短板。

Abstract: Although recent Large Language Models (LLMs) have shown rapid improvement on reasoning benchmarks in English, the evaluation of such LLMs’ multilingual reasoning capability across diverse languages and cultural contexts remains limited. Existing multilingual reasoning benchmarks are typically constructed by translating existing English reasoning benchmarks, biasing these benchmarks towards reasoning problems with context in English language/cultures. In this work, we introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark designed to assess LLMs on more than 1,000 native, linguistic and culturally grounded reasoning questions written by native speakers in French, Spanish, and Chinese. MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance. For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English. This set of English equivalents can provide a direct comparison of LLM reasoning capacity in other languages vs. English on the same reasoning questions. We systematically evaluate current 14 leading LLMs covering most LLM families on MultiNRC and its English equivalent set. The results show that (1) current LLMs are still not good at native multilingual reasoning, with none scoring above 50% on MultiNRC; (2) LLMs exhibit distinct strengths and weaknesses in handling linguistic, cultural, and logical reasoning tasks; (3) Most models perform substantially better in math reasoning in English compared to in original languages (+10%), indicating persistent challenges with culturally grounded knowledge.

[9] Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

Shanbo Cheng,Yu Bao,Zhichao Huang,Yu Lu,Ningxin Peng,Lu Xu,Runsheng Yu,Rong Cao,Ting Han,Zeyang Li,Sitong Liu,Shengtao Ma,Shiguang Pan,Jiongchen Xiao,Nuo Xu,Meng Yang,Rong Ye,Yiming Yu,Ruofei Zhang,Wanyi Zhang,Wenhao Zhu,Liehao Zou,Lu Lu,Yuxuan Wang,Yonghui Wu

Main category: cs.CL

TL;DR: Seed-LiveInterpret 2.0 是一种端到端的同声传译模型,通过新型的双工语音到语音理解-生成框架,解决了语音转录和翻译质量低、实时语音生成不足等问题,显著提升了翻译准确性和延迟表现。

Details Motivation: 研究旨在解决同声传译(SI)领域的核心挑战,如低质量转录和翻译、实时性不足、多说话者混淆以及长篇幅翻译中的语音膨胀问题。

Contribution: 提出了 Seed-LiveInterpret 2.0,一种端到端同声传译模型,具备语音克隆能力,显著提升了翻译质量和延迟表现。

Method: 采用双工语音到语音理解-生成框架,结合大规模预训练和强化学习,优化翻译准确性和延迟。

Result: 实验结果显示,模型在复杂场景中的翻译正确率超过 70%,同时将克隆语音的平均延迟从 10 秒降至 3 秒,显著优于商业解决方案。

Insight: 大规模预训练和强化学习是实现高质量、低延迟语音到语音翻译的关键,双工框架有效解决了传统 SI 的瓶颈问题。

Abstract: Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.

[10] Megrez2 Technical Report

Boxun Li,Yadong Li,Zhiyuan Li,Congyi Liu,Weilin Liu,Guowei Niu,Zheyue Tan,Haiyang Xu,Zhuyu Yao,Tao Yuan,Dong Zhou,Yueqing Zhuang,Bo Zhao,Guohao Dai,Yu Wang

Main category: cs.CL

TL;DR: Megrez2是一个轻量高效的端侧部署语言模型架构,通过跨层专家共享和预门控路由技术减少参数量并提升推理效率。

Details Motivation: 提出一种能在资源受限设备上高效部署的语言模型架构,平衡性能与效率。

Contribution: 1. 跨层专家共享机制减少参数量;2. 预门控路由提升内存效率和推理速度;3. 发布Megrez2-Preview模型,性能优于更大模型。

Method: 结合跨层专家共享和预门控路由技术,通过监督微调和强化学习优化模型。

Result: 3B激活参数和7.5B存储参数的Megrez2-Preview在语言理解、数学推理等任务上表现优异。

Insight: 轻量设计可在保持性能的同时减少资源占用,适合实际部署。

Abstract: We present Megrez2, a novel lightweight and high-performance language model architecture optimized for device native deployment. Megrez2 introduces a novel cross-layer expert sharing mechanism, which significantly reduces total parameter count by reusing expert modules across adjacent transformer layers while maintaining most of the model’s capacity. It also incorporates pre-gated routing, enabling memory-efficient expert loading and faster inference. As the first instantiation of the Megrez2 architecture, we introduce the Megrez2-Preview model, which is pre-trained on a 5-trillion-token corpus and further enhanced through supervised fine-tuning and reinforcement learning with verifiable rewards. With only 3B activated and 7.5B stored parameters, Megrez2-Preview demonstrates competitive or superior performance compared to larger models on a wide range of tasks, including language understanding, instruction following, mathematical reasoning, and code generation. These results highlight the effectiveness of the Megrez2 architecture to achieve a balance between accuracy, efficiency, and deployability, making it a strong candidate for real-world, resource-constrained applications.

[11] Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

Linbo Cao,Jinman Zhao

Main category: cs.CL

TL;DR: 该论文提出了一种基于辩论的问答评估范式,通过将传统QA数据集转化为对抗性辩论任务,显著提高了评估难度,同时减少了数据污染和记忆化的问题。

Details Motivation: 随着前沿语言模型在标准QA基准上的表现趋近饱和,数据污染、记忆化以及数据集创建成本的问题日益突出。论文旨在提出一种可持续的评估方法,以更真实地衡量模型的高级推理能力。

Contribution: 1) 提出了一个系统化的评估流程,将QA任务转化为基于辩论的评估;2) 发布了一个公开基准,验证了该范式的有效性,并提供了标准化协议和参考模型。

Method: 将传统QA任务转化为结构化对抗辩论:一个模型负责辩护官方答案,另一个构建并辩护替代答案,裁判模型则在不知道正确答案的情况下进行裁决。多轮辩论设计显著提升了任务难度,同时抑制了模型的浅层记忆。

Result: 实验表明,该方法对数据污染具有鲁棒性(调优模型在辩论中表现更差)且成本效益高。即使较弱裁判也能可靠区分更强辩论者,验证了该范式的可扩展性。

Insight: 基于辩论的评估不仅减少了数据集的重复创建成本,还更有效地衡量了模型的真实推理能力,为未来更强大系统的评估提供了可持续路径。

Abstract: As frontier language models increasingly saturate standard QA benchmarks, concerns about data contamination, memorization, and escalating dataset creation costs persist. We propose a debate-driven evaluation paradigm that transforms any existing QA dataset into structured adversarial debates–where one model is given the official answer to defend, and another constructs and defends an alternative answer–adjudicated by a judge model blind to the correct solution. By forcing multi-round argumentation, this approach substantially increases difficulty while penalizing shallow memorization, yet reuses QA items to reduce curation overhead. We make two main contributions: (1) an evaluation pipeline to systematically convert QA tasks into debate-based assessments, and (2) a public benchmark that demonstrates our paradigm’s effectiveness on a subset of MMLU-Pro questions, complete with standardized protocols and reference models. Empirical results validate the robustness of the method and its effectiveness against data contamination–a Llama 3.1 model fine-tuned on test questions showed dramatic accuracy improvements (50% -> 82%) but performed worse in debates. Results also show that even weaker judges can reliably differentiate stronger debaters, highlighting how debate-based evaluation can scale to future, more capable systems while maintaining a fraction of the cost of creating new benchmarks. Overall, our framework underscores that “pretraining on the test set is no longer all you need,” offering a sustainable path for measuring the genuine reasoning ability of advanced language models.

cs.CV [Back]

[12] Post-Disaster Affected Area Segmentation with a Vision Transformer (ViT)-based EVAP Model using Sentinel-2 and Formosat-5 Imagery

Yi-Shan Chu,Hsuan-Cheng Wei

Main category: cs.CV

TL;DR: 提出了一种基于ViT的深度学习框架,用于从遥感图像中精炼灾害影响区域分割,支持台湾太空机构开发的EVAP产品,结合弱监督训练和多种解码器变体提升性能。

Details Motivation: 现有灾害影响区域分割方法在缺乏准确标注数据时性能有限,需进一步提升分割的平滑性和可靠性以支持灾害应急响应。

Contribution: 1. 提出了基于ViT的EVAP模型用于灾害区域分割;2. 结合PCA和置信度指数实现弱监督标签扩展;3. 支持多解码器和多阶段损失策略以优化有限监督条件下的性能。

Method: 1. 使用PCA特征空间分析和置信度指数扩展少量手动标注数据;2. 采用ViT编码器-解码器架构,输入Sentinel-2和Formosat-5多波段图像;3. 结合多解码器变体和多阶段损失策略。

Result: 在2022鄱阳湖干旱和2023罗得岛野火案例中,模型提升了分割结果的平滑性和一致性,验证了方法的可行性。

Insight: 结合ViT和弱监督学习可在缺乏精确标注时实现可靠的灾害区域分割,为灾害应急提供了一种可扩展的解决方案。

Abstract: We propose a vision transformer (ViT)-based deep learning framework to refine disaster-affected area segmentation from remote sensing imagery, aiming to support and enhance the Emergent Value Added Product (EVAP) developed by the Taiwan Space Agency (TASA). The process starts with a small set of manually annotated regions. We then apply principal component analysis (PCA)-based feature space analysis and construct a confidence index (CI) to expand these labels, producing a weakly supervised training set. These expanded labels are then used to train ViT-based encoder-decoder models with multi-band inputs from Sentinel-2 and Formosat-5 imagery. Our architecture supports multiple decoder variants and multi-stage loss strategies to improve performance under limited supervision. During the evaluation, model predictions are compared with higher-resolution EVAP output to assess spatial coherence and segmentation consistency. Case studies on the 2022 Poyang Lake drought and the 2023 Rhodes wildfire demonstrate that our framework improves the smoothness and reliability of segmentation results, offering a scalable approach for disaster mapping when accurate ground truth is unavailable.

[13] Toward a Real-Time Framework for Accurate Monocular 3D Human Pose Estimation with Geometric Priors

Mohamed Adjel

Main category: cs.CV

TL;DR: 论文提出了一种结合实时2D关键点检测与几何感知的2D到3D提升框架,利用相机内参和人体解剖学先验知识,实现快速、个性化的单目3D人体姿态估计。

Details Motivation: 单目3D人体姿态估计在实时场景和无约束环境中仍是一个具有挑战性的非适定问题,直接的方法需要大量标注数据和复杂模型。论文旨在通过结合数据驱动学习和模型先验,提高精度和可解释性。

Contribution: 主要贡献包括:1) 结合实时2D关键点检测与几何感知的2D到3D提升框架;2) 利用相机内参和人体解剖学先验;3) 从MoCap和合成数据生成大规模2D-3D训练对。

Method: 方法基于自标定和生物力学约束的逆运动学,生成大规模2D-3D训练对,通过几何先验和轻量级模型实现高效的姿态估计。

Result: 该方法能够在不依赖专用硬件的情况下,快速、精确地估计3D姿态,适用于边缘设备。

Insight: 论文展示了如何通过结合数据驱动学习和模型先验,提升单目3D姿态估计的精度和实时性,同时增强可解释性和部署能力。

Abstract: Monocular 3D human pose estimation remains a challenging and ill-posed problem, particularly in real-time settings and unconstrained environments. While direct imageto-3D approaches require large annotated datasets and heavy models, 2D-to-3D lifting offers a more lightweight and flexible alternative-especially when enhanced with prior knowledge. In this work, we propose a framework that combines real-time 2D keypoint detection with geometry-aware 2D-to-3D lifting, explicitly leveraging known camera intrinsics and subject-specific anatomical priors. Our approach builds on recent advances in self-calibration and biomechanically-constrained inverse kinematics to generate large-scale, plausible 2D-3D training pairs from MoCap and synthetic datasets. We discuss how these ingredients can enable fast, personalized, and accurate 3D pose estimation from monocular images without requiring specialized hardware. This proposal aims to foster discussion on bridging data-driven learning and model-based priors to improve accuracy, interpretability, and deployability of 3D human motion capture on edge devices in the wild.

[14] Coarse-to-fine crack cue for robust crack detection

Zelong Liu,Yuliang Gu,Zhichao Sun,Huachao Zhu,Xin Xiao,Bo Du,Laurent Najman,Yongchao Xu

Main category: cs.CV

TL;DR: 論文提出了一種基於粗到細裂紋線索生成的方法CrackCue,通過利用裂紋的細結構特性提升裂紋檢測的魯棒性和泛化能力。

Details Motivation: 當前深度學習方法在裂紋檢測中泛化能力不足,且忽略了裂紋的細結構特性,需要一種更魯棒的方法來解決這一問題。

Contribution: 提出了CrackCue方法,利用粗到細的裂紋線索生成技術,將裂紋的先驗信息嵌入檢測網絡,顯著提升了泛化能力和魯棒性。

Method: 通過最大池化和上採樣生成粗糙的無裂紋背景,再通過重建網絡得到精細的無裂紋背景,最後利用差值生成精細裂紋線索。

Result: 實驗表明,CrackCue能顯著提升基線方法的性能,並在複雜背景、陰影和多變光照下表現優異。

Insight: 裂紋的細結構特性是提升檢測魯棒性的關鍵,粗到細的線索生成方法可以有效地將這一特性融入檢測任務。

Abstract: Crack detection is an important task in computer vision. Despite impressive in-dataset performance, deep learning-based methods still struggle in generalizing to unseen domains. The thin structure property of cracks is usually overlooked by previous methods. In this work, we introduce CrackCue, a novel method for robust crack detection based on coarse-to-fine crack cue generation. The core concept lies on leveraging the thin structure property to generate a robust crack cue, guiding the crack detection. Specifically, we first employ a simple max-pooling and upsampling operation on the crack image. This results in a coarse crack-free background, based on which a fine crack-free background can be obtained via a reconstruction network. The difference between the original image and fine crack-free background provides a fine crack cue. This fine cue embeds robust crack prior information which is unaffected by complex backgrounds, shadow, and varied lighting. As a plug-and-play method, we incorporate the proposed CrackCue into three advanced crack detection networks. Extensive experimental results demonstrate that the proposed CrackCue significantly improves the generalization ability and robustness of the baseline methods. The source code will be publicly available.

[15] CLAMP: Contrastive Learning with Adaptive Multi-loss and Progressive Fusion for Multimodal Aspect-Based Sentiment Analysis

Xiaoqiang He

Main category: cs.CV

TL;DR: CLAMP是一种用于多模态基于方面的情感分析的端到端对比学习框架,通过渐进注意力融合、多任务对比学习和自适应多损失聚合,解决了跨模态对齐噪声和细粒度表示一致性问题。

Details Motivation: 现有方法在多模态基于方面的情感分析中存在跨模态对齐噪声和细粒度表示一致性不足的问题,尤其是全局模态对齐方法忽略了方面项与局部视觉区域的联系。

Contribution: 提出了CLAMP框架,包含渐进注意力融合网络、多任务对比学习和自适应多损失聚合三个模块,显著提升了性能。

Method: 1)渐进注意力融合网络通过分层多阶段跨模态交互增强细粒度对齐;2)多任务对比学习结合全局模态对比和局部粒度对齐;3)自适应多损失聚合动态调整损失权重。

Result: 在标准公开基准测试中,CLAMP显著优于现有最先进方法。

Insight: 解决多模态情感分析中的对齐噪声和一致性问题是关键,而动态损失校准和渐进融合能有效提升模型性能。

Abstract: Multimodal aspect-based sentiment analysis(MABSA) seeks to identify aspect terms within paired image-text data and determine their fine grained sentiment polarities, representing a fundamental task for improving the effectiveness of applications such as product review systems and public opinion monitoring. Existing methods face challenges such as cross modal alignment noise and insufficient consistency in fine-grained representations. While global modality alignment methods often overlook the connection between aspect terms and their corresponding local visual regions, bridging the representation gap between text and images remains a challenge. To address these limitations, this paper introduces an end to end Contrastive Learning framework with Adaptive Multi-loss and Progressive Attention Fusion(CLAMP). The framework is composed of three novel modules: Progressive Attention Fusion network, Multi-task Contrastive Learning, and Adaptive Multi-loss Aggregation. The Progressive Attention Fusion network enhances fine-grained alignment between textual features and image regions via hierarchical, multi-stage cross modal interactions, effectively suppressing irrelevant visual noise. Secondly, multi-task contrastive learning combines global modal contrast and local granularity alignment to enhance cross modal representation consistency. Adaptive Multi-loss Aggregation employs a dynamic uncertainty based weighting mechanism to calibrate loss contributions according to each task’s uncertainty, thereby mitigating gradient interference. Evaluation on standard public benchmarks demonstrates that CLAMP consistently outperforms the vast majority of existing state of the art methods.

[16] SIA: Enhancing Safety via Intent Awareness for Vision-Language Models

Youngjin Na,Sangheon Jeong,Youngwan Lee

Main category: cs.CV

TL;DR: SIA是一种无需训练的提示工程框架,通过意图感知提升视觉语言模型(VLM)的安全性,主动检测和减轻多模态输入中的潜在风险。

Details Motivation: 现有方法基于后过滤或静态拒绝提示,难以检测多模态输入中潜在的危害性意图,特别是在危害性仅由输入组合引发时。

Contribution: 提出了SIA框架,通过动态推断输入中的隐含意图,提升模型的安全性与对齐能力。

Method: SIA采用三阶段推理:视觉抽象(标题生成)、意图推断(Few-shot Chain-of-Thought提示)和意图条件响应优化。

Result: 在多个安全关键基准测试(SIUO、MM-SafetyBench、HoliSafe)中,SIA显著提升了安全性,优于先前方法。

Insight: 意图感知推理在提升VLM安全性的同时,可能对通用推理准确性产生轻微影响,但安全性收益显著。

Abstract: As vision-language models (VLMs) are increasingly deployed in real-world applications, new safety risks arise from the subtle interplay between images and text. In particular, seemingly innocuous inputs can combine to reveal harmful intent, leading to unsafe model responses. Despite increasing attention to multimodal safety, previous approaches based on post hoc filtering or static refusal prompts struggle to detect such latent risks, especially when harmfulness emerges only from the combination of inputs. We propose SIA (Safety via Intent Awareness), a training-free prompt engineering framework that proactively detects and mitigates harmful intent in multimodal inputs. SIA employs a three-stage reasoning process: (1) visual abstraction via captioning, (2) intent inference through few-shot chain-of-thought prompting, and (3) intent-conditioned response refinement. Rather than relying on predefined rules or classifiers, SIA dynamically adapts to the implicit intent inferred from the image-text pair. Through extensive experiments on safety-critical benchmarks including SIUO, MM-SafetyBench, and HoliSafe, we demonstrate that SIA achieves substantial safety improvements, outperforming prior methods. Although SIA shows a minor reduction in general reasoning accuracy on MMStar, the corresponding safety gains highlight the value of intent-aware reasoning in aligning VLMs with human-centric values.

[17] Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Xiang Li

Main category: cs.CV

TL;DR: 该论文提出了一种利用2D先验信息校准LiDAR和相机特征的方法,通过局部和全局对齐提升3D检测的鲁棒性,并在nuScenes数据集上实现了最优性能。

Details Motivation: LiDAR与相机特征的对齐问题导致3D检测性能下降,论文提出利用2D物体先验信息解决这一问题。

Contribution: 提出了Prior Guided Depth Calibration (PGDC)、Discontinuity Aware Geometric Fusion (DAGF)和Structural Guidance Depth Modulator (SGDM)三种方法,分别解决局部和全局对齐问题,并提升特征融合效果。

Method: 1. PGDC利用2D先验校准局部特征;2. DAGF处理全局对齐并增强边界特征;3. SGDM通过门控注意力机制高效融合对齐后的特征。

Result: 在nuScenes验证集上,mAP和NDS分别达到71.5%和73.6%,实现了最先进性能。

Insight: 2D物体边界信息可以显著提升LiDAR与相机特征的对齐效果,从而改善3D检测的鲁棒性。

Abstract: Integrating LiDAR and camera inputs into a unified Bird’s-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, current methods are often affected by misalignment between camera and LiDAR features. This misalignment leads to inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from minor extrinsic calibration inaccuracies and rolling shutter effect of LiDAR during vehicle motion. In this work, our key insight is that these projection errors are predominantly concentrated at object-background boundaries, which are readily identified by 2D detectors. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calibration (PGDC), which leverages 2D priors to correct local misalignment and preserve correct cross-modal feature pairs. To resolve global misalignment, we introduce Discontinuity Aware Geometric Fusion (DAGF) to process calibrated results from PGDC, suppressing noise and explicitly enhancing sharp transitions at object-background boundaries. To effectively utilize these transition-aware depth representations, we incorporate Structural Guidance Depth Modulator (SGDM), using a gated attention mechanism to efficiently fuse aligned depth and image features. Our proposed method achieves state-of-the-art performance on nuScenes validation dataset, with its mAP and NDS reaching 71.5% and 73.6% respectively.

[18] Pixels, Patterns, but No Poetry: To See The World like Humans

Hongcheng Gao,Zihao Huang,Lin Xu,Jingyi Tang,Xinhao Li,Yue Liu,Haoyang Li,Taihang Hu,Minhua Lin,Xinlong Yang,Ge Wu,Balong Bi,Hongyu Chen,Wentao Zhang

Main category: cs.CV

TL;DR: 这篇论文提出了Turing Eye Test (TET),一个专注于评估多模态大语言模型(MLLMs)感知能力的基准测试,并揭示了当前MLLMs在人类直觉性任务中的重大缺陷。

Details Motivation: 目前MLLMs的研究主要关注推理能力,而忽略了感知能力的重要性。论文旨在探索MLLMs是否能像人类一样真正感知世界。

Contribution: 提出了TET基准测试,包含四个诊断任务,用于评估MLLMs在合成图像上的感知能力。

Method: 通过构建合成图像的TET任务,测试MLLMs的性能,并尝试通过上下文学习和视觉塔微调来改进模型表现。

Result: 现有的MLLMs在TET任务中表现极差,而视觉塔的微调能够显著提升性能,这表明视觉泛化能力是当前MLLMs与人类感知的主要差距。

Insight: 论文指出MLLMs的视觉泛化能力不足是其感知能力薄弱的关键,未来研究应更多关注视觉塔的改进。

Abstract: Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs’ performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.

[19] HIPPO-Video: Simulating Watch Histories with Large Language Models for Personalized Video Highlighting

Jeongeun Lee,Youngjae Yu,Dongha Lee

Main category: cs.CV

TL;DR: HIPPO-Video利用大型语言模型生成个性化的观看历史数据,提出了HiPHer方法,基于这些数据预测用户偏好的视频片段显著性得分,性能优于现有方法。

Details Motivation: 视频内容的爆炸式增长使得个性化视频高亮成为重要任务,但现有数据集缺乏个性化,难以捕捉用户行为的复杂性。

Contribution: 1) 提出HIPPO-Video数据集,通过LLM模拟用户观看历史,反映多样化偏好;2) 提出HiPHer方法,利用观看历史预测显著性得分,优于现有方法。

Method: 使用基于LLM的用户模拟器生成真实观看历史,提出HiPHer方法结合这些历史数据预测视频片段显著性得分。

Result: 实验表明HiPHer优于通用和基于查询的方法,验证了其在个性化视频高亮中的有效性。

Insight: LLM可以模拟复杂用户行为,生成真实数据集;个性化历史数据对视频高亮任务至关重要。

Abstract: The exponential growth of video content has made personalized video highlighting an essential task, as user preferences are highly variable and complex. Existing video datasets, however, often lack personalization, relying on isolated videos or simple text queries that fail to capture the intricacies of user behavior. In this work, we introduce HIPPO-Video, a novel dataset for personalized video highlighting, created using an LLM-based user simulator to generate realistic watch histories reflecting diverse user preferences. The dataset includes 2,040 (watch history, saliency score) pairs, covering 20,400 videos across 170 semantic categories. To validate our dataset, we propose HiPHer, a method that leverages these personalized watch histories to predict preference-conditioned segment-wise saliency scores. Through extensive experiments, we demonstrate that our method outperforms existing generic and query-based approaches, showcasing its potential for highly user-centric video highlighting in real-world scenarios.

[20] ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension

Yizhi Hu,Zezhao Tian,Xingqun Qi,Chen Su,Bingkun Yang,Junhui Yin,Muyi Sun,Man Zhang,Zhenan Sun

Main category: cs.CV

TL;DR: 该论文提出了一种名为ReMeREC的新型框架,用于解决多实体指代表达理解(REC)任务。通过构建关系感知数据集ReMeX及辅助数据集EntityText,并结合文本自适应多实体感知器(TMP)和实体间关系推理器(EIR),显著提升了多实体定位和关系推理的准确性。

Details Motivation: 现有REC方法主要关注单实体定位,忽视了多实体场景中复杂的实体间关系,且缺乏高质量的关系标注数据集。这不仅限制了模型的准确性,也阻碍了进一步的研究进展。

Contribution: 1. 构建了包含细粒度关系和文本标注的关系感知多实体REC数据集ReMeX。
2. 提出了一种名为ReMeREC的新框架,结合TMP和EIR模块,实现了多实体定位和关系推理的高效联合建模。
3. 通过实验验证了ReMeREC在多实体指代和关系预测任务上的优越性能。

Method: 1. 文本自适应多实体感知器(TMP):动态推断实体数量及其边界,生成区分性表征。
2. 实体间关系推理器(EIR):增强关系推理和全局场景理解。
3. 利用大型语言模型构建辅助数据集EntityText以提升细粒度语言理解。

Result: 在四个基准数据集上的实验表明,ReMeREC在多实体定位和关系预测任务中超越了现有方法,取得了显著的性能提升。

Insight: 1. 多实体REC任务需要同时关注实体定位和关系推理。
2. 细粒度的文本和关系标注对模型性能至关重要。
3. 结合动态推断和关系建模能有效提升多实体场景的指代理解能力。

Abstract: Referring Expression Comprehension (REC) aims to localize specified entities or regions in an image based on natural language descriptions. While existing methods handle single-entity localization, they often ignore complex inter-entity relationships in multi-entity scenes, limiting their accuracy and reliability. Additionally, the lack of high-quality datasets with fine-grained, paired image-text-relation annotations hinders further progress. To address this challenge, we first construct a relation-aware, multi-entity REC dataset called ReMeX, which includes detailed relationship and textual annotations. We then propose ReMeREC, a novel framework that jointly leverages visual and textual cues to localize multiple entities while modeling their inter-relations. To address the semantic ambiguity caused by implicit entity boundaries in language, we introduce the Text-adaptive Multi-entity Perceptron (TMP), which dynamically infers both the quantity and span of entities from fine-grained textual cues, producing distinctive representations. Additionally, our Entity Inter-relationship Reasoner (EIR) enhances relational reasoning and global scene understanding. To further improve language comprehension for fine-grained prompts, we also construct a small-scale auxiliary dataset, EntityText, generated using large language models. Experiments on four benchmark datasets show that ReMeREC achieves state-of-the-art performance in multi-entity grounding and relation prediction, outperforming existing approaches by a large margin.

[21] CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos

Xuchen Li,Xuzhao Li,Shiyu Hu,Kaiqi Huang,Wentao Zhang

Main category: cs.CV

TL;DR: CausalStep是一个专为视频中明确的逐步因果推理设计的基准测试,通过分割视频成因果单元和严格的逐步问答协议,全面评估模型能力,揭示当前模型与人类推理能力之间的差距。

Details Motivation: 现有视频基准测试主要评估浅层理解和推理能力,允许模型利用全局上下文,未能严格评估真实的因果和逐步推理。为解决这一问题,团队开发了CausalStep。

Contribution: 1. 设计了CausalStep基准测试,用于评估逐步因果推理;2. 引入严格的逐步QA协议和错误类型分类的干扰项;3. 提供了7种诊断性指标以全面评估模型能力。

Method: 1. 将视频分割为因果单元;2. 采用逐步QA协议,要求顺序回答问题;3. 设计基于错误类型分类的干扰项;4. 包含100个视频和1,852个多选题对。

Result: 实验显示,当前领先的专有和开源模型在逐步推理能力上与人类基线存在显著差距。

Insight: CausalStep为视频推理提供了严格的评估标准,强调了模型需提升逐步和因果推理能力,以实现更稳健和可解释的视频推理。

Abstract: Recent advances in large language models (LLMs) have improved reasoning in text and image domains, yet achieving robust video reasoning remains a significant challenge. Existing video benchmarks mainly assess shallow understanding and reasoning and allow models to exploit global context, failing to rigorously evaluate true causal and stepwise reasoning. We present CausalStep, a benchmark designed for explicit stepwise causal reasoning in videos. CausalStep segments videos into causally linked units and enforces a strict stepwise question-answer (QA) protocol, requiring sequential answers and preventing shortcut solutions. Each question includes carefully constructed distractors based on error type taxonomy to ensure diagnostic value. The benchmark features 100 videos across six categories and 1,852 multiple-choice QA pairs. We introduce seven diagnostic metrics for comprehensive evaluation, enabling precise diagnosis of causal reasoning capabilities. Experiments with leading proprietary and open-source models, as well as human baselines, reveal a significant gap between current models and human-level stepwise reasoning. CausalStep provides a rigorous benchmark to drive progress in robust and interpretable video reasoning.

[22] AURA: A Multi-Modal Medical Agent for Understanding, Reasoning & Annotation

Nima Fathi,Amar Kumar,Tal Arbel

Main category: cs.CV

TL;DR: TL;DR: AURA is a多模态医学代理,通过视觉语言解释能力,动态交互和假设测试,推进医学影像分析的透明度和适应性。

Details Motivation: 大语言模型(LLM)在医学影像领域的应用尚处于早期阶段,需要更透明、适应性强且符合临床需求的AI系统。AURA旨在填补这一空白。

Contribution: 提出首个针对医学影像的视觉语言解释代理AURA,具备动态交互、上下文解释和假设测试功能。

Method: 基于Qwen-32B架构,集成模块化工具箱,包括分割套件、反事实图像生成模块和评估工具。

Result: AURA实现了对医学影像的全面分析和解释,提升了AI系统的透明度和临床适应性。

Insight: 代理型AI(Agentic AI)在医学影像分析中具有潜力,可推动从静态预测到交互式决策支持的转变。

Abstract: Recent advancements in Large Language Models (LLMs) have catalyzed a paradigm shift from static prediction systems to agentic AI agents capable of reasoning, interacting with tools, and adapting to complex tasks. While LLM-based agentic systems have shown promise across many domains, their application to medical imaging remains in its infancy. In this work, we introduce AURA, the first visual linguistic explainability agent designed specifically for comprehensive analysis, explanation, and evaluation of medical images. By enabling dynamic interactions, contextual explanations, and hypothesis testing, AURA represents a significant advancement toward more transparent, adaptable, and clinically aligned AI systems. We highlight the promise of agentic AI in transforming medical image analysis from static predictions to interactive decision support. Leveraging Qwen-32B, an LLM-based architecture, AURA integrates a modular toolbox comprising: (i) a segmentation suite with phase grounding, pathology segmentation, and anatomy segmentation to localize clinically meaningful regions; (ii) a counterfactual image-generation module that supports reasoning through image-level explanations; and (iii) a set of evaluation tools including pixel-wise difference-map analysis, classification, and advanced state-of-the-art components to assess diagnostic relevance and visual interpretability.

[23] Toward Long-Tailed Online Anomaly Detection through Class-Agnostic Concepts

Chiao-An Yang,Kuan-Chuan Peng,Raymond A. Yeh

Main category: cs.CV

TL;DR: 本文探讨了长尾在线异常检测(LTOAD)的新任务,发现现有的离线长尾异常检测方法无法直接应用于在线设置,提出了一种类别无关的框架并适配到在线学习中。

Details Motivation: 现实场景中的异常检测通常缺乏异常样本且数据分布长尾,同时需要在线学习能力。现有离线方法依赖于类别标签,无法直接适用。

Contribution: 提出了首个类别无关的长尾在线异常检测框架,在离线及在线设置中均优于现有方法。

Method: 设计了一个类别无关的框架,并适配到在线学习中,解决了标签不可用的问题。

Result: 在工业制造和医疗领域的离线实验中显著优于基线方法(如MVTec上提升4.63% image-AUROC),在线设置中也表现更优(提升0.53%)。

Insight: 类别无关的设计和在线学习的适配是关键,为长尾分布的异常检测提供了新思路。

Abstract: Anomaly detection (AD) identifies the defect regions of a given image. Recent works have studied AD, focusing on learning AD without abnormal images, with long-tailed distributed training data, and using a unified model for all classes. In addition, online AD learning has also been explored. In this work, we expand in both directions to a realistic setting by considering the novel task of long-tailed online AD (LTOAD). We first identified that the offline state-of-the-art LTAD methods cannot be directly applied to the online setting. Specifically, LTAD is class-aware, requiring class labels that are not available in the online setting. To address this challenge, we propose a class-agnostic framework for LTAD and then adapt it to our online learning setting. Our method outperforms the SOTA baselines in most offline LTAD settings, including both the industrial manufacturing and the medical domain. In particular, we observe +4.63% image-AUROC on MVTec even compared to methods that have access to class labels and the number of classes. In the most challenging long-tailed online setting, we achieve +0.53% image-AUROC compared to baselines. Our LTOAD benchmark is released here: https://doi.org/10.5281/zenodo.16283852 .

[24] Divisive Decisions: Improving Salience-Based Training for Generalization in Binary Classification Tasks

Jacob Piland,Chris Sweet,Adam Czajka

Main category: cs.CV

TL;DR: 这篇论文提出了一种改进的基于显著性的训练方法,通过同时利用真实类别和错误类别的类激活图(CAM)来提高深度学习模型在二元分类任务中的泛化能力。

Details Motivation: 现有的显著性引导训练方法仅利用真实类别的类激活图(CAM),忽略了错误类别的CAM。论文假设在二元分类任务中,真实和错误类别的CAM应在重要特征上表现出差异,从而利用这一差异改进训练策略。

Contribution: 论文的主要贡献包括:(1)提出了三种新的显著性引导训练方法,同时利用真实和错误类别的CAM;(2)设计了一种新的后处理工具,用于识别重要特征;(3)在多类二元分类任务中验证了方法的有效性。

Method: 方法的核心是通过对比真实和错误类别的CAM,设计新的损失函数,结合显著性图引导模型训练。具体包括三种策略:差异最大化、目标区域对齐和综合优化。

Result: 在合成人脸检测、生物特征攻击检测和胸部X光异常分类等任务中,新方法显著优于传统的仅使用真实类别CAM的方法,提高了模型的泛化能力。

Insight: 论文的见解在于,错误类别的CAM信息对模型训练同样重要,通过显式利用其与真实类别CAM的差异,可以更好地引导模型学习区分性特征。

Abstract: Existing saliency-guided training approaches improve model generalization by incorporating a loss term that compares the model’s class activation map (CAM) for a sample’s true-class ({\it i.e.}, correct-label class) against a human reference saliency map. However, prior work has ignored the false-class CAM(s), that is the model’s saliency obtained for incorrect-label class. We hypothesize that in binary tasks the true and false CAMs should diverge on the important classification features identified by humans (and reflected in human saliency maps). We use this hypothesis to motivate three new saliency-guided training methods incorporating both true- and false-class model’s CAM into the training strategy and a novel post-hoc tool for identifying important features. We evaluate all introduced methods on several diverse binary close-set and open-set classification tasks, including synthetic face detection, biometric presentation attack detection, and classification of anomalies in chest X-ray scans, and find that the proposed methods improve generalization capabilities of deep learning models over traditional (true-class CAM only) saliency-guided training approaches. We offer source codes and model weights\footnote{GitHub repository link removed to preserve anonymity} to support reproducible research.

[25] Transformer Based Building Boundary Reconstruction using Attraction Field Maps

Muhammad Kamran,Mohammad Moein Sheikholeslami,Andreas Wichmann,Gunho Sohn

Main category: cs.CV

TL;DR: 论文提出了一种基于图卷积网络(GCN)的新方法,通过引入几何规则性和吸引力场地图,从单张卫星图像中自动重建建筑边界,显著提升了性能。

Details Motivation: 卫星图像提供了丰富的视觉数据,但传统的空间地图生成依赖人工,效率低下。如何从单张卫星图像中自动、准确地重建建筑边界是一个重要且具有挑战性的任务。

Contribution: 1. 提出Decoupled-PolyGCN模型,结合几何规则性和多尺度特征;2. 引入吸引力场地图提升边界重建的精度;3. 在性能上超过现有方法6%(AP)和10%(AR)。

Method: 采用图卷积网络(GCN),通过吸引力场地图和多分辨率特征整合,增强建筑边界的几何规则性。

Result: 模型在多样化和复杂场景中表现优异,AP和AR分别提升6%和10%,验证了其高精度和正则化能力。

Insight: 吸引力场地图和多尺度特征的结合是解决建筑边界重建问题的关键,几何规则性的引入显著提升了模型的性能。

Abstract: In recent years, the number of remote satellites orbiting the Earth has grown significantly, streaming vast amounts of high-resolution visual data to support diverse applications across civil, public, and military domains. Among these applications, the generation and updating of spatial maps of the built environment have become critical due to the extensive coverage and detailed imagery provided by satellites. However, reconstructing spatial maps from satellite imagery is a complex computer vision task, requiring the creation of high-level object representations, such as primitives, to accurately capture the built environment. While the past decade has witnessed remarkable advancements in object detection and representation using visual data, primitives-based object representation remains a persistent challenge in computer vision. Consequently, high-quality spatial maps often rely on labor-intensive and manual processes. This paper introduces a novel deep learning methodology leveraging Graph Convolutional Networks (GCNs) to address these challenges in building footprint reconstruction. The proposed approach enhances performance by incorporating geometric regularity into building boundaries, integrating multi-scale and multi-resolution features, and embedding Attraction Field Maps into the network. These innovations provide a scalable and precise solution for automated building footprint extraction from a single satellite image, paving the way for impactful applications in urban planning, disaster management, and large-scale spatial analysis. Our model, Decoupled-PolyGCN, outperforms existing methods by 6% in AP and 10% in AR, demonstrating its ability to deliver accurate and regularized building footprints across diverse and challenging scenarios.

[26] Controllable Hybrid Captioner for Improved Long-form Video Understanding

Kuleen Sasse,Efsun Sarioglu Kayi,Arun Reddy

Main category: cs.CV

TL;DR: 论文提出了一种可控混合字幕生成器,通过结合动态动作和静态场景描述,提升了长视频理解的文本表示质量,并通过LLM支持复杂查询。

Details Motivation: 长视频内容密集且高维,传统视频字幕生成器仅关注动态动作,忽略了静态场景信息,限制了回答复杂问题的能力。

Contribution: 1. 提出可控混合字幕生成器,支持动态动作和静态场景描述的交替生成;2. 结合视频分割和LLM,实现更准确的文本记忆构建;3. 通过LaViLa和LLaVA的联合优化,提高字幕生成效率。

Method: 1. 使用视频分割技术划分视频段以优化字幕结构;2. 结合LaViLa(动态动作)和LLaVA(静态场景)生成混合字幕;3. 通过输入标记控制字幕类型切换。

Result: 模型生成了更全面的字幕日志,扩展了可回答问题的范围,并显著提升了字幕生成效率。

Insight: 结合动态与静态信息的混合字幕生成是提升长视频理解的关键,且通过输入标记控制字幕类型可以高效适应视频内容变化。

Abstract: Video data, especially long-form video, is extremely dense and high-dimensional. Text-based summaries of video content offer a way to represent query-relevant content in a much more compact manner than raw video. In addition, textual representations are easily ingested by state-of-the-art large language models (LLMs), which enable reasoning over video content to answer complex natural language queries. To solve this issue, we rely on the progressive construction of a text-based memory by a video captioner operating on shorter chunks of the video, where spatio-temporal modeling is computationally feasible. We explore ways to improve the quality of the activity log comprised solely of short video captions. Because the video captions tend to be focused on human actions, and questions may pertain to other information in the scene, we seek to enrich the memory with static scene descriptions using Vision Language Models (VLMs). Our video understanding system relies on the LaViLa video captioner in combination with a LLM to answer questions about videos. We first explored different ways of partitioning the video into meaningful segments such that the textual descriptions more accurately reflect the structure of the video content. Furthermore, we incorporated static scene descriptions into the captioning pipeline using LLaVA VLM, resulting in a more detailed and complete caption log and expanding the space of questions that are answerable from the textual memory. Finally, we have successfully fine-tuned the LaViLa video captioner to produce both action and scene captions, significantly improving the efficiency of the captioning pipeline compared to using separate captioning models for the two tasks. Our model, controllable hybrid captioner, can alternate between different types of captions according to special input tokens that signals scene changes detected in the video.

[27] Toward Scalable Video Narration: A Training-free Approach Using Multimodal Large Language Models

Tz-Ying Wu,Tahani Trigui,Sharath Nittur Sridhar,Anand Bodas,Subarna Tripathi

Main category: cs.CV

TL;DR: 论文提出了一种无训练的方法VideoNarrator,利用现成的多模态大型语言模型(MLLMs)和视觉语言模型(VLMs)生成密集视频描述,显著提升了时间对齐和描述质量。

Details Motivation: 多模态大型语言模型在视频理解中常存在时间对齐不准确和幻觉问题,特别是在不熟悉场景中。论文旨在解决这些问题,提升视频叙述的准确性和实用性。

Contribution: 1. 提出了一种无需训练的流水线VideoNarrator;2. 通过现成模型的协同作用提升时间对齐和减少幻觉;3. 为视频摘要和问答等下游任务提供支持。

Method: 采用现成的MLLMs和VLMs作为描述生成器、上下文提供者和描述验证器,通过协同作用优化视频叙述的质量和时间对齐。

Result: 实验结果显示,该方法显著提升了视频叙述的准确性和时间对齐,减少了幻觉现象,适用于多种下游任务。

Insight: 无训练的方法可以高效利用现成模型提升视频叙述能力,展示了多模态模型协同作用的潜力。

Abstract: In this paper, we introduce VideoNarrator, a novel training-free pipeline designed to generate dense video captions that offer a structured snapshot of video content. These captions offer detailed narrations with precise timestamps, capturing the nuances present in each segment of the video. Despite advancements in multimodal large language models (MLLMs) for video comprehension, these models often struggle with temporally aligned narrations and tend to hallucinate, particularly in unfamiliar scenarios. VideoNarrator addresses these challenges by leveraging a flexible pipeline where off-the-shelf MLLMs and visual-language models (VLMs) can function as caption generators, context providers, or caption verifiers. Our experimental results demonstrate that the synergistic interaction of these components significantly enhances the quality and accuracy of video narrations, effectively reducing hallucinations and improving temporal alignment. This structured approach not only enhances video understanding but also facilitates downstream tasks such as video summarization and video question answering, and can be potentially extended for advertising and marketing applications.

[28] Few-Shot Learning in Video and 3D Object Detection: A Survey

Md Meftahul Ferdaus,Kendall N. Niles,Joe Tom,Mahdi Abdelguerfi,Elias Ioup

Main category: cs.CV

TL;DR: 这篇综述探讨了少样本学习(FSL)在视频和3D目标检测中的最新进展,展示了如何通过少量标注数据检测新类别,减少人工标注成本,并分析了在时空结构和点云数据中的挑战与解决方案。

Details Motivation: 视频和3D目标检测需要大量标注数据,但标注成本高昂。少样本学习的引入能够显著减少标注需求,使其在实际应用中更具可行性。

Contribution: 1. 全面综述了视频和3D目标检测中的FSL方法;2. 分析了跨帧传播和点云数据处理的独特挑战;3. 提出了未来研究方向,如泛化与过拟合的平衡。

Method: 视频领域采用tube proposals和时序匹配网络,3D领域结合点云专用网络和类别不平衡损失函数。

Result: FSL在视频和3D检测中表现优异,显著降低了标注需求,推动了自动驾驶等实际应用的部署。

Insight: 通过结合时空结构和数据模态特性,FSL有望在视频、3D等领域进一步减少监督需求,实现更广泛的应用。

Abstract: Few-shot learning (FSL) enables object detection models to recognize novel classes given only a few annotated examples, thereby reducing expensive manual data labeling. This survey examines recent FSL advances for video and 3D object detection. For video, FSL is especially valuable since annotating objects across frames is more laborious than for static images. By propagating information across frames, techniques like tube proposals and temporal matching networks can detect new classes from a couple examples, efficiently leveraging spatiotemporal structure. FSL for 3D detection from LiDAR or depth data faces challenges like sparsity and lack of texture. Solutions integrate FSL with specialized point cloud networks and losses tailored for class imbalance. Few-shot 3D detection enables practical autonomous driving deployment by minimizing costly 3D annotation needs. Core issues in both domains include balancing generalization and overfitting, integrating prototype matching, and handling data modality properties. In summary, FSL shows promise for reducing annotation requirements and enabling real-world video, 3D, and other applications by efficiently leveraging information across feature, temporal, and data modalities. By comprehensively surveying recent advancements, this paper illuminates FSL’s potential to minimize supervision needs and enable deployment across video, 3D, and other real-world applications.

[29] SDGOCC: Semantic and Depth-Guided Bird’s-Eye View Transformation for 3D Multimodal Occupancy Prediction

Zaipeng Duan,Chenxu Dang,Xuzhong Hu,Pei An,Junfeng Ding,Jie Zhan,Yunbiao Xu,Jie Ma

Main category: cs.CV

TL;DR: SDG-OCC是一种新颖的多模态3D占用预测网络,通过结合语义和深度引导的视角变换以及融合-占用驱动的主动蒸馏,解决了现有方法的深度估计不准确和几何信息利用不足问题。

Details Motivation: 现有方法多为单模态,相机方法缺少深度信息,LiDAR方法易受遮挡影响。LSS等轻量方法因深度估计不准确和几何语义信息利用不足而受限。

Contribution: 提出SDG-OCC,通过联合语义和深度引导的视角变换与融合-占用驱动的主动蒸馏,构建了更准确的深度分布和丰富的语义信息。

Method: 1. 通过扩散和双线性离散化结合像素语义和共点深度;2. 从多模态数据中提取语义信息并选择性蒸馏;3. 提出SDG-Fusion(仅融合)和SDG-KL(融合+蒸馏)两种方案。

Result: 在Occ3D-nuScenes数据集上达到SOTA性能,实时处理;在SurroundOcc-nuScenes数据集上表现可比性。

Insight: 结合语义和深度信息的多模态方法显著提升了3D占用预测的准确性和鲁棒性,且轻量设计适合实时应用。

Abstract: Multimodal 3D occupancy prediction has garnered significant attention for its potential in autonomous driving. However, most existing approaches are single-modality: camera-based methods lack depth information, while LiDAR-based methods struggle with occlusions. Current lightweight methods primarily rely on the Lift-Splat-Shoot (LSS) pipeline, which suffers from inaccurate depth estimation and fails to fully exploit the geometric and semantic information of 3D LiDAR points. Therefore, we propose a novel multimodal occupancy prediction network called SDG-OCC, which incorporates a joint semantic and depth-guided view transformation coupled with a fusion-to-occupancy-driven active distillation. The enhanced view transformation constructs accurate depth distributions by integrating pixel semantics and co-point depth through diffusion and bilinear discretization. The fusion-to-occupancy-driven active distillation extracts rich semantic information from multimodal data and selectively transfers knowledge to image features based on LiDAR-identified regions. Finally, for optimal performance, we introduce SDG-Fusion, which uses fusion alone, and SDG-KL, which integrates both fusion and distillation for faster inference. Our method achieves state-of-the-art (SOTA) performance with real-time processing on the Occ3D-nuScenes dataset and shows comparable performance on the more challenging SurroundOcc-nuScenes dataset, demonstrating its effectiveness and robustness. The code will be released at https://github.com/DzpLab/SDGOCC.

[30] FedVLM: Scalable Personalized Vision-Language Models through Federated Learning

Arkajyoti Mitra,Afia Anjum,Paul Agbaje,Mert Pesé,Habeeb Olufowobi

Main category: cs.CV

TL;DR: FedVLM是一个联邦学习的框架,用于扩展性个性化视觉-语言模型(VLM)的调优。通过个性化LoRA(pLoRA),它在非独立同分布(non-iid)数据环境下显著提升了模型性能。

Details Motivation: 现有的参数高效调优方法(如LoRA)在联邦学习环境中面临数据异构性挑战,导致泛化性能不足。FedVLM旨在解决这一问题,实现分布式环境下的高效个性化调优。

Contribution: 1. 提出FedVLM框架,支持VLM的联邦学习调优;2. 引入pLoRA,动态适应客户端数据分布,显著提升本地适应能力;3. 在RLAIF-V数据集上验证pLoRA性能优于标准LoRA(提升24.5%)。

Method: 1. 联邦LoRA调优框架(FedVLM);2. 动态调整pLoRA参数以适应客户端数据;3. 结合全局模型聚合与本地个性化调优。

Result: pLoRA在非iid数据环境下比标准LoRA性能提升24.5%,验证了FedVLM的扩展性和高效性。

Insight: 联邦学习结合个性化调优策略(如pLoRA)可有效应对数据异构性问题,推动分布式学习场景下的个性化模型发展。

Abstract: Vision-language models (VLMs) demonstrate impressive zero-shot and few-shot learning capabilities, making them essential for several downstream tasks. However, fine-tuning these models at scale remains challenging, particularly in federated environments where data is decentralized and non-iid across clients. Existing parameter-efficient tuning methods like LoRA (Low-Rank Adaptation) reduce computational overhead but struggle with heterogeneous client data, leading to suboptimal generalization. To address these challenges, we propose FedVLM, a federated LoRA fine-tuning framework that enables decentralized adaptation of VLMs while preserving model privacy and reducing reliance on centralized training. To further tackle data heterogeneity, we introduce personalized LoRA (pLoRA), which dynamically adapts LoRA parameters to each client’s unique data distribution, significantly improving local adaptation while maintaining global model aggregation. Experiments on the RLAIF-V dataset show that pLoRA improves client-specific performance by 24.5% over standard LoRA, demonstrating superior adaptation in non-iid settings. FedVLM provides a scalable and efficient solution for fine-tuning VLMs in federated settings, advancing personalized adaptation in distributed learning scenarios.

[31] IONext: Unlocking the Next Era of Inertial Odometry

Shanshan Zhang,Siyue Wang,Tianshui Wen,Qi Zhang,Ziheng Zhou,Lingxiang Zheng,Yu Yang

Main category: cs.CV

TL;DR: IONext提出了一种基于CNN的新型惯性里程计框架,通过DADM模块和STGU单元,有效地结合了全局运动模式和局部精细运动特征,显著提升了定位精度和泛化能力。

Details Motivation: 当前的Transformer模型在惯性里程计中虽然能建模长距离依赖,但对局部精细运动变化的敏感性和缺乏归纳偏置限制了定位精度和泛化性能。为此,IONext提出了CNN-based的解决方案。

Contribution: 1. 提出Dual-wing Adaptive Dynamic Mixer (DADM),动态捕捉多尺度运动特征;2. 引入Spatio-Temporal Gating Unit (STGU),优化时间建模;3. 构建新的CNN主干IONext,在多个数据集上超越SOTA。

Method: 1. DADM模块动态生成权重,自适应聚合多尺度特征;2. STGU单元选择性地提取时空域中的任务相关特征;3. 将两者结合为IONext框架。

Result: 在六个公开数据集上的实验表明,IONext显著优于现有Transformer和CNN方法,例如在RNIN数据集上平均ATE和RTE分别降低了10%和12%。

Insight: CNN在惯性里程计中通过引入动态权重和时空选择性单元,可以更好地捕捉运动特征,弥补了Transformer的不足。

Abstract: Researchers have increasingly adopted Transformer-based models for inertial odometry. While Transformers excel at modeling long-range dependencies, their limited sensitivity to local, fine-grained motion variations and lack of inherent inductive biases often hinder localization accuracy and generalization. Recent studies have shown that incorporating large-kernel convolutions and Transformer-inspired architectural designs into CNN can effectively expand the receptive field, thereby improving global motion perception. Motivated by these insights, we propose a novel CNN-based module called the Dual-wing Adaptive Dynamic Mixer (DADM), which adaptively captures both global motion patterns and local, fine-grained motion features from dynamic inputs. This module dynamically generates selective weights based on the input, enabling efficient multi-scale feature aggregation. To further improve temporal modeling, we introduce the Spatio-Temporal Gating Unit (STGU), which selectively extracts representative and task-relevant motion features in the temporal domain. This unit addresses the limitations of temporal modeling observed in existing CNN approaches. Built upon DADM and STGU, we present a new CNN-based inertial odometry backbone, named Next Era of Inertial Odometry (IONext). Extensive experiments on six public datasets demonstrate that IONext consistently outperforms state-of-the-art (SOTA) Transformer- and CNN-based methods. For instance, on the RNIN dataset, IONext reduces the average ATE by 10% and the average RTE by 12% compared to the representative model iMOT.

[32] Robust Five-Class and binary Diabetic Retinopathy Classification Using Transfer Learning and Data Augmentation

Faisal Ahmed,Mohammad Alfrad Nobel Bhuiyan

Main category: cs.CV

TL;DR: 论文提出一种基于迁移学习和数据增强的深度学习框架,用于糖尿病视网膜病变(DR)的二分类和五分类任务,并在APTOS 2019数据集上取得了优异性能。

Details Motivation: 解决糖尿病视网膜病变诊断中类不平衡和训练数据不足的问题,提升自动诊断的准确性和实用性。

Contribution: 1. 提出结合迁移学习和广泛数据增强的深度学习框架;2. 在二分类和五分类任务中均取得SOTA性能;3. 验证了EfficientNet-B0和ResNet34在平衡精度与计算效率方面的优越性。

Method: 1. 采用多种预训练CNN架构(如ResNet和EfficientNet);2. 使用数据增强技术缓解类不平衡;3. 在APTOS 2019数据集上进行评估。

Result: 1. 二分类任务:准确率98.9%,AUC 99.4%;2. 五分类任务:准确率84.6%,AUC 94.1%;均优于现有方法。

Insight: 类平衡的数据增强与迁移学习结合能显著提升DR分类性能,为临床部署提供可扩展且高效的解决方案。

Abstract: Diabetic retinopathy (DR) is a leading cause of vision loss worldwide, and early diagnosis through automated retinal image analysis can significantly reduce the risk of blindness. This paper presents a robust deep learning framework for both binary and five-class DR classification, leveraging transfer learning and extensive data augmentation to address the challenges of class imbalance and limited training data. We evaluate a range of pretrained convolutional neural network architectures, including variants of ResNet and EfficientNet, on the APTOS 2019 dataset. For binary classification, our proposed model achieves a state-of-the-art accuracy of 98.9%, with a precision of 98.6%, recall of 99.3%, F1-score of 98.9%, and an AUC of 99.4%. In the more challenging five-class severity classification task, our model obtains a competitive accuracy of 84.6% and an AUC of 94.1%, outperforming several existing approaches. Our findings also demonstrate that EfficientNet-B0 and ResNet34 offer optimal trade-offs between accuracy and computational efficiency across both tasks. These results underscore the effectiveness of combining class-balanced augmentation with transfer learning for high-performance DR diagnosis. The proposed framework provides a scalable and accurate solution for DR screening, with potential for deployment in real-world clinical environments.

[33] ScSAM: Debiasing Morphology and Distributional Variability in Subcellular Semantic Segmentation

Bo Fang,Jianan Fan,Dongnan Liu,Hang Chang,Gerald J. Shami,Filip Braet,Weidong Cai

Main category: cs.CV

TL;DR: ScSAM是一种结合预训练SAM和MAE的方法,通过特征对齐和融合模块增强子细胞分割的鲁棒性,解决了数据不平衡和形态多变性带来的训练偏差问题。

Details Motivation: 子细胞组件在形态和分布上的显著多变性导致学习性分割模型容易产生特征学习的偏差,现有方法通常忽视特征多样性,而SAM虽然提供了丰富的特征表示,但在子细胞场景中面临标签空间差距和忽略细粒度细节的挑战。

Contribution: 1. 提出ScSAM,结合预训练SAM和MAE的先验知识,缓解数据不平衡导致的训练偏差。2. 设计了特征对齐和融合模块,统一不同表示的特征空间。3. 提出基于余弦相似度矩阵的类别提示编码器,激活类别特定特征。

Method: 1. 融合预训练SAM和MAE的特征,增强鲁棒性。2. 特征对齐和融合模块统一特征空间。3. 基于余弦相似度的类别提示编码器激活类别特定特征。

Result: 在多个子细胞图像数据集上的实验表明,ScSAM优于现有方法。

Insight: 通过结合全局上下文理解和细粒度空间细节的方法,可以显著提升子细胞分割的精度和鲁棒性,尤其是在数据分布不平衡和形态多变的情况下。

Abstract: The significant morphological and distributional variability among subcellular components poses a long-standing challenge for learning-based organelle segmentation models, significantly increasing the risk of biased feature learning. Existing methods often rely on single mapping relationships, overlooking feature diversity and thereby inducing biased training. Although the Segment Anything Model (SAM) provides rich feature representations, its application to subcellular scenarios is hindered by two key challenges: (1) The variability in subcellular morphology and distribution creates gaps in the label space, leading the model to learn spurious or biased features. (2) SAM focuses on global contextual understanding and often ignores fine-grained spatial details, making it challenging to capture subtle structural alterations and cope with skewed data distributions. To address these challenges, we introduce ScSAM, a method that enhances feature robustness by fusing pre-trained SAM with Masked Autoencoder (MAE)-guided cellular prior knowledge to alleviate training bias from data imbalance. Specifically, we design a feature alignment and fusion module to align pre-trained embeddings to the same feature space and efficiently combine different representations. Moreover, we present a cosine similarity matrix-based class prompt encoder to activate class-specific features to recognize subcellular categories. Extensive experiments on diverse subcellular image datasets demonstrate that ScSAM outperforms state-of-the-art methods.

[34] VBCD: A Voxel-Based Framework for Personalized Dental Crown Design

Linda Wei,Chang Liu,Wenran Zhang,Zengji Zhang,Shaoting Zhang,Hongsheng Li

Main category: cs.CV

TL;DR: VBCD提出了一种基于体素的自动化牙冠设计框架,通过粗到精的设计流程和距离感知监督提升牙冠设计的准确性,结合曲率和边缘线惩罚损失优化边缘对齐,并利用牙位编号提示进一步提升效果。

Details Motivation: 传统的牙冠设计过程依赖人工,费时费力。VBCD旨在通过自动化框架减轻牙科技师的工作负担。

Contribution: 1) 提出了基于体素的牙冠设计框架VBCD;2) 引入曲率和边缘线惩罚损失(CMPL)优化边缘对齐;3) 结合牙位编号提示提升设计的准确性。

Method: 1) 从体素化口腔扫描数据生成粗牙冠;2) 通过距离感知的精细化模块优化设计;3) 训练时使用CMPL损失函数;4) 引入牙位编号作为位置提示。

Result: 在大规模口腔扫描数据集上验证,VBCD优于现有方法,能高效、高质量地完成个性化牙冠设计。

Insight: 自动化结合领域知识(如牙位编号)能显著提升牙冠设计的精度和效率。

Abstract: The design of restorative dental crowns from intraoral scans is labor-intensive for dental technicians. To address this challenge, we propose a novel voxel-based framework for automated dental crown design (VBCD). The VBCD framework generates an initial coarse dental crown from voxelized intraoral scans, followed by a fine-grained refiner incorporating distance-aware supervision to improve accuracy and quality. During the training stage, we employ the Curvature and Margin line Penalty Loss (CMPL) to enhance the alignment of the generated crown with the margin line. Additionally, a positional prompt based on the FDI tooth numbering system is introduced to further improve the accuracy of the generated dental crowns. Evaluation on a large-scale dataset of intraoral scans demonstrated that our approach outperforms existing methods, providing a robust solution for personalized dental crown design.

[35] PIG-Nav: Key Insights for Pretrained Image Goal Navigation Models

Jiansong Wan,Chengming Zhou,Jinkua Liu,Xiangge Huang,Xiaoyu Chen,Xiaohan Yi,Qisen Yang,Baiting Zhu,Xin-Qiang Cai,Lixing Liu,Rushuai Yang,Chuheng Zhang,Sherif Abdelfattah,Hayong Shin,Pushi Zhang,Li Zhao,Jiang Bian

Main category: cs.CV

TL;DR: PIG-Nav提出了一种基于预训练模型的视觉导航方法,通过早期融合网络结构和辅助任务提升性能,并利用游戏视频数据进行数据增强,显著提升了零样本和微调性能。

Details Motivation: 研究旨在提升视觉导航模型的通用性和迁移能力,尤其是在未见环境中的零样本表现。

Contribution: 1. 提出早期融合网络结构和辅助任务优化预训练导航模型;2. 设计高效数据预处理流程,利用游戏视频数据增强训练集。

Method: 1. 使用预训练的Vision Transformer (ViT)编码器结合早期融合网络;2. 引入辅助任务增强全局导航表示学习;3. 通过游戏视频数据增强数据集。

Result: 在零样本和微调场景中,模型性能分别平均提升22.6%和37.5%,且在真实环境中表现优异。

Insight: 预训练策略和数据集多样性对导航模型性能至关重要,且模型能在少量微调数据下保持竞争力。

Abstract: Recent studies have explored pretrained (foundation) models for vision-based robotic navigation, aiming to achieve generalizable navigation and positive transfer across diverse environments while enhancing zero-shot performance in unseen settings. In this work, we introduce PIG-Nav (Pretrained Image-Goal Navigation), a new approach that further investigates pretraining strategies for vision-based navigation models and contributes in two key areas. Model-wise, we identify two critical design choices that consistently improve the performance of pretrained navigation models: (1) integrating an early-fusion network structure to combine visual observations and goal images via appropriately pretrained Vision Transformer (ViT) image encoder, and (2) introducing suitable auxiliary tasks to enhance global navigation representation learning, thus further improving navigation performance. Dataset-wise, we propose a novel data preprocessing pipeline for efficiently labeling large-scale game video datasets for navigation model training. We demonstrate that augmenting existing open navigation datasets with diverse gameplay videos improves model performance. Our model achieves an average improvement of 22.6% in zero-shot settings and a 37.5% improvement in fine-tuning settings over existing visual navigation foundation models in two complex simulated environments and one real-world environment. These results advance the state-of-the-art in pretrained image-goal navigation models. Notably, our model maintains competitive performance while requiring significantly less fine-tuning data, highlighting its potential for real-world deployment with minimal labeled supervision.

[36] MaskedCLIP: Bridging the Masked and CLIP Space for Semi-Supervised Medical Vision-Language Pre-training

Lei Zhu,Jun Zhou,Rick Siow Mong Goh,Yong Liu

Main category: cs.CV

TL;DR: 论文提出MaskedCLIP框架,通过结合掩码图像建模和对比语言-图像预训练,实现半监督的视觉-语言预训练,以充分利用成对和非成对图像数据学习更泛化的医学图像特征。

Details Motivation: 当前的医学图像分析中,基础模型通常仅基于成对的图像-文本数据或非成对的图像数据学习,这限制了模型捕捉更丰富和全面的图像特征。论文旨在通过半监督学习结合这两种数据提升特征学习的全面性。

Contribution: 1) 提出半监督视觉-语言预训练任务;2) 设计MaskedCLIP框架,通过桥接掩码特征空间和CLIP特征空间,结合两种数据互补学习;3) 提出掩码知识蒸馏损失,进一步增强特征语义学习。

Method: 在框架中引入桥接Transformer连接掩码特征空间与CLIP特征空间,并设计掩码知识蒸馏损失,使CLIP的语义特征能辅助掩码特征学习,反之亦然。

Result: 在视网膜图像分析任务上的实验表明,MaskedCLIP能更高效地利用数据,提升下游任务的性能。

Insight: 通过桥接不同特征空间并结合蒸馏损失,可以充分利用成对和非成对数据的互补性,从而学习更泛化和语义丰富的图像特征。

Abstract: Foundation models have recently gained tremendous popularity in medical image analysis. State-of-the-art methods leverage either paired image-text data via vision-language pre-training or unpaired image data via self-supervised pre-training to learn foundation models with generalizable image features to boost downstream task performance. However, learning foundation models exclusively on either paired or unpaired image data limits their ability to learn richer and more comprehensive image features. In this paper, we investigate a novel task termed semi-supervised vision-language pre-training, aiming to fully harness the potential of both paired and unpaired image data for foundation model learning. To this end, we propose MaskedCLIP, a synergistic masked image modeling and contrastive language-image pre-training framework for semi-supervised vision-language pre-training. The key challenge in combining paired and unpaired image data for learning a foundation model lies in the incompatible feature spaces derived from these two types of data. To address this issue, we propose to connect the masked feature space with the CLIP feature space with a bridge transformer. In this way, the more semantic specific CLIP features can benefit from the more general masked features for semantic feature extraction. We further propose a masked knowledge distillation loss to distill semantic knowledge of original image features in CLIP feature space back to the predicted masked image features in masked feature space. With this mutually interactive design, our framework effectively leverages both paired and unpaired image data to learn more generalizable image features for downstream tasks. Extensive experiments on retinal image analysis demonstrate the effectiveness and data efficiency of our method.

[37] Perceptual Classifiers: Detecting Generative Images using Perceptual Features

Krishna Srikar Durbha,Asvin Kumar Venkataramanan,Rajesh Sureddi,Alan C. Bovik

Main category: cs.CV

TL;DR: 该论文提出了一种基于图像质量评估(IQA)模型的特征空间的感知分类器,用于区分真实图像与AI生成的图像。该方法在小规模网络上训练,表现出优异的泛化能力和鲁棒性。

Details Motivation: 随着生成式AI技术的迅速发展,互联网上涌现大量AI生成的内容,需要一种有效的检测方法。现有的IQA模型能够捕捉真实图像的统计特征,因此可以利用其能力区分真实与生成图像。

Contribution: 主要贡献是利用IQA模型的特征空间,设计了一种轻量级的两层网络,能够以优异性能检测来自不同生成模型的假图像,同时对图像退化具有鲁棒性。

Method: 方法是通过IQA模型提取图像特征,并训练一个两层网络对这些特征进行分类,区分真实与生成图像。实验验证了其泛化能力和对图像退化的鲁棒性。

Result: 实验表明,该方法在检测不同生成模型的假图像时达到最先进性能,且在图像退化场景下仍保持稳定表现。

Insight: IQA模型的特征空间具有区分真实与生成图像的能力,为轻量级且高效的假图像检测提供了新思路。

Abstract: Image Quality Assessment (IQA) models are employed in many practical image and video processing pipelines to reduce storage, minimize transmission costs, and improve the Quality of Experience (QoE) of millions of viewers. These models are sensitive to a diverse range of image distortions and can accurately predict image quality as judged by human viewers. Recent advancements in generative models have resulted in a significant influx of “GenAI” content on the internet. Existing methods for detecting GenAI content have progressed significantly with improved generalization performance on images from unseen generative models. Here, we leverage the capabilities of existing IQA models, which effectively capture the manifold of real images within a bandpass statistical space, to distinguish between real and AI-generated images. We investigate the generalization ability of these perceptual classifiers to the task of GenAI image detection and evaluate their robustness against various image degradations. Our results show that a two-layer network trained on the feature space of IQA models demonstrates state-of-the-art performance in detecting fake images across generative models, while maintaining significant robustness against image degradations.

[38] TransLPRNet: Lite Vision-Language Network for Single/Dual-line Chinese License Plate Recognition

Guangzhu Xu,Zhi Ke,Pengcheng Zuo,Bangjun Lei

Main category: cs.CV

TL;DR: 论文提出了一种轻量化的视觉-语言网络TransLPRNet,用于单/双线中文车牌识别,通过预训练框架和视角校正网络提升了识别精度和实用性。

Details Motivation: 现有CNN和CRNN方法在车牌识别中面临多样性和成像条件的挑战,且缺乏双线车牌数据集,亟需一种统一且高效的解决方案。

Contribution: 1. 提出轻量化的视觉-语言网络TransLPRNet;2. 构建合成数据集解决双线车牌数据稀缺问题;3. 引入视角校正网络(PTN)提升稳定性与精度。

Method: 1. 结合轻量视觉编码器和文本解码器;2. 通过纹理映射和真实场景合成数据集;3. 利用车牌角点坐标回归和视角分类监督PTN。

Result: 在CCPD测试集上粗定位扰动下准确率99.34%,精细定位下提升至99.58%;双线车牌测试集上准确率98.70%,实时速度167FPS。

Insight: 合成数据和视角校正网络能有效解决数据稀缺和成像多样性问题,轻量化设计适合实际应用场景。

Abstract: License plate recognition in open environments is widely applicable across various domains; however, the diversity of license plate types and imaging conditions presents significant challenges. To address the limitations encountered by CNN and CRNN-based approaches in license plate recognition, this paper proposes a unified solution that integrates a lightweight visual encoder with a text decoder, within a pre-training framework tailored for single and double-line Chinese license plates. To mitigate the scarcity of double-line license plate datasets, we constructed a single/double-line license plate dataset by synthesizing images, applying texture mapping onto real scenes, and blending them with authentic license plate images. Furthermore, to enhance the system’s recognition accuracy, we introduce a perspective correction network (PTN) that employs license plate corner coordinate regression as an implicit variable, supervised by license plate view classification information. This network offers improved stability, interpretability, and low annotation costs. The proposed algorithm achieves an average recognition accuracy of 99.34% on the corrected CCPD test set under coarse localization disturbance. When evaluated under fine localization disturbance, the accuracy further improves to 99.58%. On the double-line license plate test set, it achieves an average recognition accuracy of 98.70%, with processing speeds reaching up to 167 frames per second, indicating strong practical applicability.

[39] Unsupervised Exposure Correction

Ruodai Cui,Li Niu,Guosheng Hu

Main category: cs.CV

TL;DR: 该论文提出了一种无需人工标注的Unsupervised Exposure Correction (UEC)方法,通过模拟ISP管道生成配对数据,提升了模型的泛化能力,并在低层视觉任务中表现优异。

Details Motivation: 现有曝光校正方法需要大量人工标注数据(paired data),泛化能力有限,且严重影响低层视觉任务的性能,因此提出了一种无需标注的解决方案。

Contribution: 1. 提出了一种无监督曝光校正方法UEC;2. 创建了一个大规模Radiometry Correction Dataset;3. 设计的转换函数仅需极少参数,性能优于有监督方法;4. 验证了曝光校正对边缘检测等下游任务的重要性。

Method: 利用模拟ISP管道生成的配对数据训练模型,避免了人工标注;提出了保留图像细节的转换函数。

Result: UEC方法在曝光校正任务中超越了有监督方法,同时仅使用其0.01%的参数。在边缘检测等下游任务中也表现出色。

Insight: 无监督学习可以解决曝光校正中的数据标注问题,并显著提升泛化能力;低层视觉任务的性能与曝光质量密切相关。

Abstract: Current exposure correction methods have three challenges, labor-intensive paired data annotation, limited generalizability, and performance degradation in low-level computer vision tasks. In this work, we introduce an innovative Unsupervised Exposure Correction (UEC) method that eliminates the need for manual annotations, offers improved generalizability, and enhances performance in low-level downstream tasks. Our model is trained using freely available paired data from an emulated Image Signal Processing (ISP) pipeline. This approach does not need expensive manual annotations, thereby minimizing individual style biases from the annotation and consequently improving its generalizability. Furthermore, we present a large-scale Radiometry Correction Dataset, specifically designed to emphasize exposure variations, to facilitate unsupervised learning. In addition, we develop a transformation function that preserves image details and outperforms state-of-the-art supervised methods [12], while utilizing only 0.01% of their parameters. Our work further investigates the broader impact of exposure correction on downstream tasks, including edge detection, demonstrating its effectiveness in mitigating the adverse effects of poor exposure on low-level features. The source code and dataset are publicly available at https://github.com/BeyondHeaven/uec_code.

[40] VisionTrap: Unanswerable Questions On Visual Data

Asir Saadat,Syem Aziz,Shahriar Mahmud,Abdullah Ibne Masud Mahi,Sabbir Ahmed

Main category: cs.CV

TL;DR: VisionTrap数据集旨在评估VQA模型在遇到无法回答问题时是否能够识别知识局限性,而不是生成错误答案。

Details Motivation: 目前VQA研究主要集中在可回答问题,缺乏对模型在无法回答问题中表现的评估,尤其是模型是否知道何时应避免回答。

Contribution: 提出了VisionTrap数据集,包含三类无法回答的问题,测试模型是否能正确识别其知识局限性。

Method: 构建了三类不现实或虚构的图像及其对应的问题,通过逻辑严密的无法回答问题评估模型行为。

Result: 研究表明,VQA模型倾向于给出答案而非承认局限性,突显了在评估中加入无法回答问题的重要性。

Insight: 未来VQA基准测试应包括无法回答问题,以更全面地评估模型的鲁棒性和知识边界意识。

Abstract: Visual Question Answering (VQA) has been a widely studied topic, with extensive research focusing on how VLMs respond to answerable questions based on real-world images. However, there has been limited exploration of how these models handle unanswerable questions, particularly in cases where they should abstain from providing a response. This research investigates VQA performance on unrealistically generated images or asking unanswerable questions, assessing whether models recognize the limitations of their knowledge or attempt to generate incorrect answers. We introduced a dataset, VisionTrap, comprising three categories of unanswerable questions across diverse image types: (1) hybrid entities that fuse objects and animals, (2) objects depicted in unconventional or impossible scenarios, and (3) fictional or non-existent figures. The questions posed are logically structured yet inherently unanswerable, testing whether models can correctly recognize their limitations. Our findings highlight the importance of incorporating such questions into VQA benchmarks to evaluate whether models tend to answer, even when they should abstain.

[41] URPO: A Unified Reward & Policy Optimization Framework for Large Language Models

Songshuo Lu,Hua Wang,Zhi Chen,Yaohua Tang

Main category: cs.CV

TL;DR: URPO提出了一种统一的奖励与策略优化框架,将指令遵循和奖励建模结合在一个模型中,显著提升了性能,同时简化了训练流程。

Details Motivation: 传统的对齐流程需要独立的奖励模型,不仅复杂且资源密集,且性能受限于静态奖励信号。URPO旨在通过统一的框架解决这些问题。

Contribution: URPO首次将指令遵循和奖励建模统一到一个模型中,通过GRPO优化循环实现性能提升,同时生成内部奖励,消除了对独立奖励模型的需求。

Method: URPO采用统一的生成格式处理对齐数据(如偏好对和开放指令),并通过GRPO循环同步优化模型,实现奖励和策略的共同学习。

Result: 实验表明,URPO在Qwen2.5-7B模型上表现优异,指令遵循分数从42.24提升至44.84,推理平均分从32.66提升至35.66,奖励评分达85.15。

Insight: 统一奖励与策略优化不仅简化了流程,还通过共同演化机制提升了模型性能,为语言模型的对齐提供了更高效的新路径。

Abstract: Large-scale alignment pipelines typically pair a policy model with a separately trained reward model whose parameters remain frozen during reinforcement learning (RL). This separation creates a complex, resource-intensive pipeline and suffers from a performance ceiling due to a static reward signal. We propose a novel framework, Unified Reward & Policy Optimization (URPO), that unifies instruction-following (“player”) and reward modeling (“referee”) within a single model and a single training phase. Our method recasts all alignment data-including preference pairs, verifiable reasoning, and open-ended instructions-into a unified generative format optimized by a single Group-Relative Policy Optimization (GRPO) loop. This enables the model to learn from ground-truth preferences and verifiable logic while simultaneously generating its own rewards for open-ended tasks. Experiments on the Qwen2.5-7B model demonstrate URPO’s superiority. Our unified model significantly outperforms a strong baseline using a separate generative reward model, boosting the instruction-following score on AlpacaEval from 42.24 to 44.84 and the composite reasoning average from 32.66 to 35.66. Furthermore, URPO cultivates a superior internal evaluator as a byproduct of training, achieving a RewardBench score of 85.15 and surpassing the dedicated reward model it replaces (83.55). By eliminating the need for a separate reward model and fostering a co-evolutionary dynamic between generation and evaluation, URPO presents a simpler, more efficient, and more effective path towards robustly aligned language models.

[42] Dual-branch Prompting for Multimodal Machine Translation

Jie Wang,Zhendong Yang,Liansong Zong,Xiaobo Zhang,Dexian Wang,Ji Zhang

Main category: cs.CV

TL;DR: 论文提出D2P-MMT框架,通过扩散模型生成重构图像并结合双分支提示策略,提升多模态机器翻译的鲁棒性和性能。

Details Motivation: 当前多模态机器翻译方法依赖成对图像-文本输入且易受无关视觉噪声干扰,限制了其实际应用。

Contribution: 提出D2P-MMT框架,使用扩散模型重构图像过滤噪声,引入双分支提示策略和分布对齐损失。

Method: 结合预训练扩散模型生成重构图像,采用双分支提示策略促进跨模态交互,并通过分布对齐损失弥合模态差异。

Result: 在Multi30K数据集上,D2P-MMT表现优于现有方法。

Insight: 扩散模型可有效过滤视觉噪声,双分支策略能增强跨模态对齐和模型鲁棒性。

Abstract: Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.

[43] CasP: Improving Semi-Dense Feature Matching Pipeline Leveraging Cascaded Correspondence Priors for Guidance

Peiqi Chen,Lei Yu,Yi Wan,Yingying Pei,Xinyi Liu,Yongxiang Yao,Yingying Zhang,Lixiang Ru,Liheng Zhong,Jingdong Chen,Ming Yang,Yongjun Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新颖的半稠密特征匹配流程CasP,通过级联对应先验指导,显著提升了匹配精度和效率。

Details Motivation: 现有半稠密特征匹配方法依赖全局搜索,限制了精度和效率的提升。

Contribution: 提出了CasP流程,采用级联对应先验和区域选择性交叉注意力机制,优化匹配效率和精度。

Method: 分解匹配阶段为两步,先识别一对多先验区域,再在一对一范围内搜索匹配,结合高层特征减少计算量。

Result: 在1152分辨率下,CasP Lite模型速度提升2.2倍,并在几何估计和跨域泛化中表现优异。

Insight: 级联先验和分阶段搜索策略显著提升匹配效率,适合SLAM和无人机等实时高鲁棒性应用。

Abstract: Semi-dense feature matching methods have shown strong performance in challenging scenarios. However, the existing pipeline relies on a global search across the entire feature map to establish coarse matches, limiting further improvements in accuracy and efficiency. Motivated by this limitation, we propose a novel pipeline, CasP, which leverages cascaded correspondence priors for guidance. Specifically, the matching stage is decomposed into two progressive phases, bridged by a region-based selective cross-attention mechanism designed to enhance feature discriminability. In the second phase, one-to-one matches are determined by restricting the search range to the one-to-many prior areas identified in the first phase. Additionally, this pipeline benefits from incorporating high-level features, which helps reduce the computational costs of low-level feature extraction. The acceleration gains of CasP increase with higher resolution, and our lite model achieves a speedup of $\sim2.2\times$ at a resolution of 1152 compared to the most efficient method, ELoFTR. Furthermore, extensive experiments demonstrate its superiority in geometric estimation, particularly with impressive cross-domain generalization. These advantages highlight its potential for latency-sensitive and high-robustness applications, such as SLAM and UAV systems. Code is available at https://github.com/pq-chen/CasP.

[44] CartoonAlive: Towards Expressive Live2D Modeling from Single Portraits

Chao He,Jianqiang Ren,Jianjing Xiang,Xiejie Shen

Main category: cs.CV

TL;DR: 本文提出了一种名为CartoonAlive的创新方法,可以从单张肖像图像生成高质量的Live2D数字人模型,解决了2D卡通风格数字人交互性的问题。

Details Motivation: 随着数字人技术的发展,3D模型和2D视频方案存在建模复杂或灵活性不足的问题,而2D卡通风格的Live2D模型提供了一种高效且表现力强的替代方案。

Contribution: 主要贡献是提出了一种从单张肖像图像快速生成Live2D模型的方法,结合3D脸型建模中的形状基概念,实现了实时交互的高质量结果。

Method: 利用3D面部建模的形状基概念构建适合Live2D的面部混合形状,并通过输入图像检测的关键点推断对应的混合形状权重。

Result: 能在半分钟内生成与输入肖像高度相似的Live2D模型,兼具高表达性和视觉准确性。

Insight: Live2D通过分层分割模拟3D运动,避免了复杂建模和高渲染成本,为交互式2D卡通角色提供了可扩展的解决方案。

Abstract: With the rapid advancement of large foundation models, AIGC, cloud rendering, and real-time motion capture technologies, digital humans are now capable of achieving synchronized facial expressions and body movements, engaging in intelligent dialogues driven by natural language, and enabling the fast creation of personalized avatars. While current mainstream approaches to digital humans primarily focus on 3D models and 2D video-based representations, interactive 2D cartoon-style digital humans have received relatively less attention. Compared to 3D digital humans that require complex modeling and high rendering costs, and 2D video-based solutions that lack flexibility and real-time interactivity, 2D cartoon-style Live2D models offer a more efficient and expressive alternative. By simulating 3D-like motion through layered segmentation without the need for traditional 3D modeling, Live2D enables dynamic and real-time manipulation. In this technical report, we present CartoonAlive, an innovative method for generating high-quality Live2D digital humans from a single input portrait image. CartoonAlive leverages the shape basis concept commonly used in 3D face modeling to construct facial blendshapes suitable for Live2D. It then infers the corresponding blendshape weights based on facial keypoints detected from the input image. This approach allows for the rapid generation of a highly expressive and visually accurate Live2D model that closely resembles the input portrait, within less than half a minute. Our work provides a practical and scalable solution for creating interactive 2D cartoon characters, opening new possibilities in digital content creation and virtual character animation. The project homepage is https://human3daigc.github.io/CartoonAlive_webpage/.

[45] Temporal Point-Supervised Signal Reconstruction: A Human-Annotation-Free Framework for Weak Moving Target Detection

Weihua Gao,Chunxu Ren,Wenlong Niu,Xiaodong Peng

Main category: cs.CV

TL;DR: 论文提出了一种无需人工标注的Temporal Point-Supervised (TPS)框架,用于弱运动目标检测。通过重构瞬时信号和动态多尺度注意力模块,该方法在低信噪比数据集上表现优异,且实时性强。

Details Motivation: 在低空监视和预警系统中,弱运动目标检测面临低信号能量、小空间范围和复杂背景的挑战。现有方法因缺乏可靠标注和鲁棒特征提取而受限。

Contribution: 1) 提出TPS框架,无需人工标注;2) 开发TSRNet网络,结合动态多尺度注意力模块;3) 引入基于图的轨迹挖掘策略提升一致性。

Method: 将任务重构为像素级时序信号建模问题,设计TSRNet网络(编码器-解码器结构)和DMSAttention模块,配合轨迹挖掘减少误报。

Result: 在低信噪比数据集上优于现有方法,检测性能强,实时性达1000 FPS以上。

Insight: 通过时序信号建模取代传统帧检测,解决了弱目标检测的标注依赖问题,且高效适用于实时场景。

Abstract: In low-altitude surveillance and early warning systems, detecting weak moving targets remains a significant challenge due to low signal energy, small spatial extent, and complex background clutter. Existing methods struggle with extracting robust features and suffer from the lack of reliable annotations. To address these limitations, we propose a novel Temporal Point-Supervised (TPS) framework that enables high-performance detection of weak targets without any manual annotations.Instead of conventional frame-based detection, our framework reformulates the task as a pixel-wise temporal signal modeling problem, where weak targets manifest as short-duration pulse-like responses. A Temporal Signal Reconstruction Network (TSRNet) is developed under the TPS paradigm to reconstruct these transient signals.TSRNet adopts an encoder-decoder architecture and integrates a Dynamic Multi-Scale Attention (DMSAttention) module to enhance its sensitivity to diverse temporal patterns. Additionally, a graph-based trajectory mining strategy is employed to suppress false alarms and ensure temporal consistency.Extensive experiments on a purpose-built low-SNR dataset demonstrate that our framework outperforms state-of-the-art methods while requiring no human annotations. It achieves strong detection performance and operates at over 1000 FPS, underscoring its potential for real-time deployment in practical scenarios.

[46] Principled Multimodal Representation Learning

Xiaohao Liu,Xiaobo Xia,See-Kiong Ng,Tat-Seng Chua

Main category: cs.CV

TL;DR: 论文提出了Principled Multimodal Representation Learning (PMRL)框架,用于无锚点多模态对齐,解决了传统对比学习中固定锚点的限制和优化的不稳定性问题。

Details Motivation: 传统多模态表示学习方法依赖于预定义的锚点模态,限制了所有模态的完全对齐,且优化过程中存在不稳定问题。

Contribution: 提出PMRL框架,无需依赖锚点,通过优化表示矩阵的奇异值实现多模态稳定对齐,并提出基于softmax的损失函数和对比正则化方法。

Method: PMRL基于Gram矩阵的秩理论,优化主导奇异值以对齐模态,并设计软最大损失函数和实例级对比正则化防止表示崩溃。

Result: 在多任务实验中,PMRL表现优于基线方法,实现了更好的多模态表示学习效果。

Insight: 模态对齐的数学本质是Gram矩阵的秩为1,PMRL通过优化奇异值提供了一种稳定且无锚点的对齐方法。

Abstract: Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities to improve multimodal understanding. Traditional methods often depend on pairwise contrastive learning, which relies on a predefined anchor modality, restricting alignment across all modalities. Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain, such as limitations imposed by fixed anchor points and instability arising from optimizing the product of singular values. To address the challenges, in this paper, we propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities without anchor dependency in a more stable manner. Specifically, grounded in the theoretical insight that full alignment corresponds to a rank-1 Gram matrix, PMRL optimizes the dominant singular value of the representation matrix to align modalities along a shared leading direction. We propose a softmax-based loss function that treats singular values as logits to prioritize the largest singular value. Besides, instance-wise contrastive regularization on the leading eigenvectors maintains inter-instance separability and prevents representation collapse. Extensive experiments across diverse tasks demonstrate PMRL’s superiority compared to baseline methods. The source code will be publicly available.

[47] Exploring Active Learning for Label-Efficient Training of Semantic Neural Radiance Field

Yuzhe Zhu,Lile Cai,Kangkang Lu,Fayao Liu,Xulei Yang

Main category: cs.CV

TL;DR: 本研究探讨了如何通过主动学习降低语义感知神经辐射场(NeRF)训练的标注成本,提出了一种结合3D几何约束的样本选择策略,实验显示标注成本可减少超过2倍。

Details Motivation: 语义感知NeRF需要大量像素级标注数据,标注成本高昂。为了解决这一问题,作者探索了通过主动学习减少标注量的方法。

Contribution: 1. 研究了语义感知NeRF中主动学习的设计选择(如选择粒度和策略);2. 提出了一种结合3D几何约束的主动学习策略。

Method: 1. 分析了不同选择粒度和策略的影响;2. 设计了基于3D几何约束的样本选择策略,以提高标注效率。

Result: 实验表明,主动学习可显著降低标注成本(超过2倍),同时保持模型性能。

Insight: 结合3D几何信息的主动学习策略能更高效地选择对模型训练最有价值的样本,从而减少标注负担。

Abstract: Neural Radiance Field (NeRF) models are implicit neural scene representation methods that offer unprecedented capabilities in novel view synthesis. Semantically-aware NeRFs not only capture the shape and radiance of a scene, but also encode semantic information of the scene. The training of semantically-aware NeRFs typically requires pixel-level class labels, which can be prohibitively expensive to collect. In this work, we explore active learning as a potential solution to alleviate the annotation burden. We investigate various design choices for active learning of semantically-aware NeRF, including selection granularity and selection strategies. We further propose a novel active learning strategy that takes into account 3D geometric constraints in sample selection. Our experiments demonstrate that active learning can effectively reduce the annotation cost of training semantically-aware NeRF, achieving more than 2X reduction in annotation cost compared to random sampling.

[48] Exploring Spatial Diversity for Region-based Active Learning

Lile Cai,Xun Xu,Lining Zhang,Chuan-Sheng Foo

Main category: cs.CV

TL;DR: 论文提出了一种基于区域的空间多样性主动学习方法,通过结合局部空间多样性和传统不确定性标准,显著降低了语义分割任务的标注成本,同时保持了高性能。

Details Motivation: 语义分割任务需要大量像素级标注数据,成本高昂。基于区域的方法可以减少标注量,但现有方法通常忽略局部空间多样性对模型性能的影响。因此,作者提出在主动学习中引入空间多样性以提高效率。

Contribution: 1. 提出了局部空间多样性对区域主动学习的重要性;2. 设计了一个统一的优化框架,将空间多样性与传统选择标准(如不确定性和特征多样性)结合;3. 在Cityscapes和PASCAL VOC数据集上展示了方法的有效性。

Method: 通过联合优化框架,同时考虑样本不确定性和局部空间多样性,选择信息量丰富的区域进行标注。具体方法包括多样性与不确定性的权衡策略,以及高效的区域选择算法。

Result: 实验表明,仅需标注5-9%的像素即可达到全监督方法95%的性能,显著优于现有区域主动学习方法。

Insight: 局部空间多样性在区域主动学习中至关重要,其与传统标准的结合能进一步提升模型效率。这一思路可扩展到其他需要密集标注的任务中。

Abstract: State-of-the-art methods for semantic segmentation are based on deep neural networks trained on large-scale labeled datasets. Acquiring such datasets would incur large annotation costs, especially for dense pixel-level prediction tasks like semantic segmentation. We consider region-based active learning as a strategy to reduce annotation costs while maintaining high performance. In this setting, batches of informative image regions instead of entire images are selected for labeling. Importantly, we propose that enforcing local spatial diversity is beneficial for active learning in this case, and to incorporate spatial diversity along with the traditional active selection criterion, e.g., data sample uncertainty, in a unified optimization framework for region-based active learning. We apply this framework to the Cityscapes and PASCAL VOC datasets and demonstrate that the inclusion of spatial diversity effectively improves the performance of uncertainty-based and feature diversity-based active learning methods. Our framework achieves $95%$ performance of fully supervised methods with only $5-9%$ of the labeled pixels, outperforming all state-of-the-art region-based active learning methods for semantic segmentation.

[49] A Conditional Probability Framework for Compositional Zero-shot Learning

Peng Wu,Qiuxia Lai,Hao Fang,Guo-Sen Xie,Yilong Yin,Xiankai Lu,Wenguan Wang

Main category: cs.CV

TL;DR: 该论文提出了一种条件概率框架(CPF),用于显式建模属性与对象之间的依赖关系,解决了组合零样本学习(CZSL)中的语义约束和上下文依赖问题。

Details Motivation: 传统方法通常将属性和对象视为独立实体,忽略了它们之间的语义约束和上下文依赖关系。因此,论文提出通过条件概率框架来显式建模这种依赖关系。

Contribution: 1. 提出了条件概率框架(CPF),显式建模属性与对象的依赖关系;
2. 结合文本描述符增强对象特征学习;
3. 通过交叉注意力机制实现上下文对齐。

Method: 1. 将组合概率分解为对象似然和条件属性似然;
2. 使用文本描述符增强对象特征;
3. 通过交叉注意力机制优化属性学习。

Result: 在多个CZSL基准测试中取得了优越性能,验证了方法的有效性。

Insight: 显式建模属性与对象的依赖关系对于组合零样本学习至关重要,而条件概率框架是一种有效的解决方案。

Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of known objects and attributes by leveraging knowledge from previously seen compositions. Traditional approaches primarily focus on disentangling attributes and objects, treating them as independent entities during learning. However, this assumption overlooks the semantic constraints and contextual dependencies inside a composition. For example, certain attributes naturally pair with specific objects (e.g., “striped” applies to “zebra” or “shirts” but not “sky” or “water”), while the same attribute can manifest differently depending on context (e.g., “young” in “young tree” vs. “young dog”). Thus, capturing attribute-object interdependence remains a fundamental yet long-ignored challenge in CZSL. In this paper, we adopt a Conditional Probability Framework (CPF) to explicitly model attribute-object dependencies. We decompose the probability of a composition into two components: the likelihood of an object and the conditional likelihood of its attribute. To enhance object feature learning, we incorporate textual descriptors to highlight semantically relevant image regions. These enhanced object features then guide attribute learning through a cross-attention mechanism, ensuring better contextual alignment. By jointly optimizing object likelihood and conditional attribute likelihood, our method effectively captures compositional dependencies and generalizes well to unseen compositions. Extensive experiments on multiple CZSL benchmarks demonstrate the superiority of our approach. Code is available at here.

[50] EndoGen: Conditional Autoregressive Endoscopic Video Generation

Xinyu Liu,Hengyu Liu,Cheng Wang,Tianming Liu,Yixuan Yuan

Main category: cs.CV

TL;DR: EndoGen是一个条件自回归内窥镜视频生成框架,通过时空网格帧模式(SGP)和语义感知标记掩码(SAT)机制,生成高质量的条件引导内窥镜内容。

Details Motivation: 现有方法局限于静态图像或无条件的视频生成,缺乏动态上下文和临床参考意义,难以满足实际应用需求。

Contribution: 1. 提出首个条件内窥镜视频生成框架EndoGen;2. 设计SGP策略和SAT机制,优化生成质量和语义多样性。

Method: 1. 自回归模型结合SGP策略,将多帧生成转化为网格图像生成任务;2. SAT机制选择性关注语义区域,增强生成多样性。

Result: 实验表明EndoGen能生成高质量条件视频,并提升息肉分割下游任务的性能。

Insight: 条件生成和自回归架构的结合在内窥镜视频任务中表现出色,为医学影像领域提供了新思路。

Abstract: Endoscopic video generation is crucial for advancing medical imaging and enhancing diagnostic capabilities. However, prior efforts in this field have either focused on static images, lacking the dynamic context required for practical applications, or have relied on unconditional generation that fails to provide meaningful references for clinicians. Therefore, in this paper, we propose the first conditional endoscopic video generation framework, namely EndoGen. Specifically, we build an autoregressive model with a tailored Spatiotemporal Grid-Frame Patterning (SGP) strategy. It reformulates the learning of generating multiple frames as a grid-based image generation pattern, which effectively capitalizes the inherent global dependency modeling capabilities of autoregressive architectures. Furthermore, we propose a Semantic-Aware Token Masking (SAT) mechanism, which enhances the model’s ability to produce rich and diverse content by selectively focusing on semantically meaningful regions during the generation process. Through extensive experiments, we demonstrate the effectiveness of our framework in generating high-quality, conditionally guided endoscopic content, and improves the performance of downstream task of polyp segmentation. Code released at https://www.github.com/CUHK-AIM-Group/EndoGen.

[51] HiProbe-VAD: Video Anomaly Detection via Hidden States Probing in Tuning-Free Multimodal LLMs

Zhaolin Cai,Fan Li,Ziwei Zheng,Yanjun Qin

Main category: cs.CV

TL;DR: HiProbe-VAD是一种新颖的视频异常检测框架,利用预训练的多模态大语言模型(MLLMs)的中间隐藏状态,无需微调即可检测视频异常,性能优于现有方法。

Details Motivation: 传统视频异常检测方法计算成本高且依赖大量标注数据,限制了实际应用。HiProbe-VAD旨在利用预训练MLLMs的潜力,无需微调即可解决这些问题。

Contribution: 1. 发现MLLMs的中间隐藏状态对异常更敏感且具有线性可分性。2. 提出动态层显著性探测机制(DLSP)提取最优中间层的隐藏状态。3. 设计了轻量化的异常评分器和时间定位模块。

Method: 通过DLSP机制识别和提取MLLMs中间层的最具信息量的隐藏状态,后续通过异常评分器和定位模块高效检测异常并生成解释。

Result: 在UCF-Crime和XD-Violence数据集上表现优于传统方法和无需训练的方法,并展现出跨模型的泛化能力。

Insight: 预训练MLLMs的中间隐藏状态是信息丰富的表示,可用于高效异常检测,为实际应用提供了可扩展的解决方案。

Abstract: Video Anomaly Detection (VAD) aims to identify and locate deviations from normal patterns in video sequences. Traditional methods often struggle with substantial computational demands and a reliance on extensive labeled datasets, thereby restricting their practical applicability. To address these constraints, we propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning. In this paper, we discover that the intermediate hidden states of MLLMs contain information-rich representations, exhibiting higher sensitivity and linear separability for anomalies compared to the output layer. To capitalize on this, we propose a Dynamic Layer Saliency Probing (DLSP) mechanism that intelligently identifies and extracts the most informative hidden states from the optimal intermediate layer during the MLLMs reasoning. Then a lightweight anomaly scorer and temporal localization module efficiently detects anomalies using these extracted hidden states and finally generate explanations. Experiments on the UCF-Crime and XD-Violence datasets demonstrate that HiProbe-VAD outperforms existing training-free and most traditional approaches. Furthermore, our framework exhibits remarkable cross-model generalization capabilities in different MLLMs without any tuning, unlocking the potential of pre-trained MLLMs for video anomaly detection and paving the way for more practical and scalable solutions.

[52] HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning

Li Jun,Wang Jinpeng,Tan Chaolei,Lian Niu,Chen Long,Zhang Min,Wang Yaowei,Xia Shu-Tao,Chen Bin

Main category: cs.CV

TL;DR: HLFormer提出了一种双曲学习框架,通过结合Lorentz和欧几里得注意力块,增强了部分相关视频检索(PRVR)中的层次建模能力,并引入了部分顺序保持损失来优化跨模态匹配。

Details Motivation: 现有方法在欧几里得空间中存在几何失真,无法充分建模视频的层次语义,导致PRVR任务中的性能不足。

Contribution: 1. 首个针对PRVR的双曲建模框架HLFormer;2. 提出混合空间编码和动态特征融合方法;3. 引入部分顺序保持损失优化跨模态匹配。

Method: 1. 结合Lorentz和欧几里得注意力块;2. 使用Mean-Guided Adaptive Interaction Module动态融合特征;3. 通过Lorentzian锥约束实现层次建模。

Result: 实验表明HLFormer在PRVR任务中优于现有方法。

Insight: 双曲空间更适合建模视频的层次结构,混合空间编码和动态融合能有效提升部分相关检索的性能。

Abstract: Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of matching untrimmed videos with text queries describing only partial content. Existing methods suffer from geometric distortion in Euclidean space that sometimes misrepresents the intrinsic hierarchical structure of videos and overlooks certain hierarchical semantics, ultimately leading to suboptimal temporal modeling. To address this issue, we propose the first hyperbolic modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space learning to compensate for the suboptimal hierarchical modeling capabilities of Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block and Euclidean Attention Block to encode video embeddings in hybrid spaces, using the Mean-Guided Adaptive Interaction Module to dynamically fuse features. Additionally, we introduce a Partial Order Preservation Loss to enforce “text < video” hierarchy through Lorentzian cone constraints. This approach further enhances cross-modal matching by reinforcing partial relevance between video content and text queries. Extensive experiments show that HLFormer outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICCV25-HLFormer.

[53] Physics-based Human Pose Estimation from a Single Moving RGB Camera

Ayce Idil Aytekin,Chuqiao Li,Diogo Luvizon,Rishabh Dabral,Martin Oswald,Marc Habermann,Christian Theobalt

Main category: cs.CV

TL;DR: 该论文提出了MoviCam数据集和PhysDynPose方法,解决了单目RGB相机动态拍摄下的人体姿态估计问题,尤其是在不平坦场景和相机运动时的挑战。

Details Motivation: 当前的单目及基于物理的人体姿态跟踪方法在非平坦地面或相机运动时会出现伪影,且缺乏真实世界数据的支持。

Contribution: 1) 提出首个非合成的MoviCam数据集,包含真实相机轨迹、场景几何和3D人体运动数据;2) 提出PhysDynPose方法,结合场景几何和物理约束优化姿态估计。

Method: 结合运动学估计器和SLAM方法恢复世界坐标系下的人体姿态,并通过场景感知的物理优化器进一步优化结果。

Result: 实验表明,现有方法在此类挑战性场景下表现不佳,而PhysDynPose能稳健地估计世界坐标系中的人体及相机姿态。

Insight: 动态相机和非平坦场景的复杂性揭示了现有方法的局限性,需结合场景几何和物理约束提升鲁棒性。

Abstract: Most monocular and physics-based human pose tracking methods, while achieving state-of-the-art results, suffer from artifacts when the scene does not have a strictly flat ground plane or when the camera is moving. Moreover, these methods are often evaluated on in-the-wild real world videos without ground-truth data or on synthetic datasets, which fail to model the real world light transport, camera motion, and pose-induced appearance and geometry changes. To tackle these two problems, we introduce MoviCam, the first non-synthetic dataset containing ground-truth camera trajectories of a dynamically moving monocular RGB camera, scene geometry, and 3D human motion with human-scene contact labels. Additionally, we propose PhysDynPose, a physics-based method that incorporates scene geometry and physical constraints for more accurate human motion tracking in case of camera motion and non-flat scenes. More precisely, we use a state-of-the-art kinematics estimator to obtain the human pose and a robust SLAM method to capture the dynamic camera trajectory, enabling the recovery of the human pose in the world frame. We then refine the kinematic pose estimate using our scene-aware physics optimizer. From our new benchmark, we found that even state-of-the-art methods struggle with this inherently challenging setting, i.e. a moving camera and non-planar environments, while our method robustly estimates both human and camera poses in world coordinates.

[54] CAPRI-CT: Causal Analysis and Predictive Reasoning for Image Quality Optimization in Computed Tomography

Sneha George Gnanakalavathy,Hairil Abdul Razak,Robert Meertens,Jonathan E. Fieldsend,Xujiong Ye,Mohammed M. Abdelsamea

Main category: cs.CV

TL;DR: 论文提出了一种名为CAPRI-CT的因果感知深度学习框架,用于优化CT成像质量。该方法通过整合图像数据和采集元数据,利用变分自编码器(VAE)提取特征并建模因果关系,支持预测和反事实推断,从而优化CT协议设计。

Details Motivation: 在CT成像中,平衡图像质量和辐射剂量是关键挑战。现有的方法缺乏对图像质量影响因素的因果分析,难以支持决策优化。

Contribution: 提出了CAPRI-CT框架,首次将因果分析引入CT图像质量优化,支持预测和反事实推理。利用VAE提取特征并建模因果关系,提升了预测性能和可解释性。

Method: 通过整合CT图像和采集参数,使用VAE提取特征并生成因果表示。采用集成学习训练模型,预测信噪比(SNR)并进行反事实推断,支持参数优化。

Result: CAPRI-CT表现出强大的预测性能,能够通过反事实推理提供可操作的优化建议,减少重复物理扫描的需求。

Insight: 因果分析能够有效揭示CT成像参数与图像质量的潜在关系,为协议设计提供数据驱动的优化途径。

Abstract: In computed tomography (CT), achieving high image quality while minimizing radiation exposure remains a key clinical challenge. This paper presents CAPRI-CT, a novel causal-aware deep learning framework for Causal Analysis and Predictive Reasoning for Image Quality Optimization in CT imaging. CAPRI-CT integrates image data with acquisition metadata (such as tube voltage, tube current, and contrast agent types) to model the underlying causal relationships that influence image quality. An ensemble of Variational Autoencoders (VAEs) is employed to extract meaningful features and generate causal representations from observational data, including CT images and associated imaging parameters. These input features are fused to predict the Signal-to-Noise Ratio (SNR) and support counterfactual inference, enabling what-if simulations, such as changes in contrast agents (types and concentrations) or scan parameters. CAPRI-CT is trained and validated using an ensemble learning approach, achieving strong predictive performance. By facilitating both prediction and interpretability, CAPRI-CT provides actionable insights that could help radiologists and technicians design more efficient CT protocols without repeated physical scans. The source code and dataset are publicly available at https://github.com/SnehaGeorge22/capri-ct.

[55] Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection

Yehao Lu,Minghe Weng,Zekang Xiao,Rui Jiang,Wei Su,Guangcong Zheng,Ping Lu,Xi Li

Main category: cs.CV

TL;DR: Dynamic-DINO是一种基于Mixture of Experts (MoE)的动态推理框架,用于实时开放词汇目标检测,通过细粒度专家调整和预训练权重分配策略,显著提升了Grounding DINO 1.5 Edge的性能。

Details Motivation: 在大型视觉语言模型(LVLMs)中,MoE架构表现出色,但其在实时开放词汇目标检测领域的潜力尚未被充分探索。

Contribution: 1. 提出了Dynamic-DINO框架;2. 设计了粒度分解机制和预训练权重分配策略;3. 展示了浅层和深层网络中专家行为的差异。

Method: 1. 通过MoE-Tuning策略将密集模型转换为动态推理框架;2. 分解基础模型的FFN为多个小型专家网络;3. 使用特定路由初始化防止性能下降。

Result: Dynamic-DINO仅用1.56M开源数据预训练,性能优于基于私有Grounding20M数据集预训练的Grounding DINO 1.5 Edge。

Insight: 浅层专家倾向于多样化合作以扩展搜索空间,而深层专家则形成固定的协作结构,专注于特定模式处理。

Abstract: The Mixture of Experts (MoE) architecture has excelled in Large Vision-Language Models (LVLMs), yet its potential in real-time open-vocabulary object detectors, which also leverage large-scale vision-language datasets but smaller models, remains unexplored. This work investigates this domain, revealing intriguing insights. In the shallow layers, experts tend to cooperate with diverse peers to expand the search space. While in the deeper layers, fixed collaborative structures emerge, where each expert maintains 2-3 fixed partners and distinct expert combinations are specialized in processing specific patterns. Concretely, we propose Dynamic-DINO, which extends Grounding DINO 1.5 Edge from a dense model to a dynamic inference framework via an efficient MoE-Tuning strategy. Additionally, we design a granularity decomposition mechanism to decompose the Feed-Forward Network (FFN) of base model into multiple smaller expert networks, expanding the subnet search space. To prevent performance degradation at the start of fine-tuning, we further propose a pre-trained weight allocation strategy for the experts, coupled with a specific router initialization. During inference, only the input-relevant experts are activated to form a compact subnet. Experiments show that, pretrained with merely 1.56M open-source data, Dynamic-DINO outperforms Grounding DINO 1.5 Edge, pretrained on the private Grounding20M dataset.

[56] VLM-Guided Visual Place Recognition for Planet-Scale Geo-Localization

Sania Waheed,Na Min An,Michael Milford,Sarvapali D. Ramchurn,Shoaib Ehsan

Main category: cs.CV

TL;DR: 该论文提出了一种结合视觉语言模型(VLM)和检索式视觉地点识别(VPR)的混合地理定位框架,通过VLM生成先验指导检索,显著提升了地理定位的准确性和鲁棒性。

Details Motivation: 传统的检索方法在规模扩展和感知混淆方面存在不足,而分类方法泛化能力有限且需要大量训练数据。尽管VLM在上下文理解和推理方面表现优异,但其易产生幻觉且缺乏可解释性,不适合单独使用。因此,该研究旨在结合两者的优势,解决全球尺度下的地理定位问题。

Contribution: 提出了一种新颖的混合地理定位框架,首次将VLM的先验生成能力与VPR的检索机制相结合,显著提升了地理定位的准确性(街道级别提升4.51%,城市级别提升13.52%)。

Method: 1. 利用VLM生成地理先验,缩小检索空间;2. 进行检索步骤;3. 通过重排序机制选择最合理的地理匹配。

Result: 在多个地理定位基准测试中表现优于现有方法,尤其是在街道和城市级别的定位准确率上提升显著。

Insight: VLM生成的先验能够有效指导检索,而混合框架的结合解决了VLM的幻觉问题,同时保留了检索方法的高效性和可扩展性。

Abstract: Geo-localization from a single image at planet scale (essentially an advanced or extreme version of the kidnapped robot problem) is a fundamental and challenging task in applications such as navigation, autonomous driving and disaster response due to the vast diversity of locations, environmental conditions, and scene variations. Traditional retrieval-based methods for geo-localization struggle with scalability and perceptual aliasing, while classification-based approaches lack generalization and require extensive training data. Recent advances in vision-language models (VLMs) offer a promising alternative by leveraging contextual understanding and reasoning. However, while VLMs achieve high accuracy, they are often prone to hallucinations and lack interpretability, making them unreliable as standalone solutions. In this work, we propose a novel hybrid geo-localization framework that combines the strengths of VLMs with retrieval-based visual place recognition (VPR) methods. Our approach first leverages a VLM to generate a prior, effectively guiding and constraining the retrieval search space. We then employ a retrieval step, followed by a re-ranking mechanism that selects the most geographically plausible matches based on feature similarity and proximity to the initially estimated coordinates. We evaluate our approach on multiple geo-localization benchmarks and show that it consistently outperforms prior state-of-the-art methods, particularly at street (up to 4.51%) and city level (up to 13.52%). Our results demonstrate that VLM-generated geographic priors in combination with VPR lead to scalable, robust, and accurate geo-localization systems.

[57] Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection

Francesco Tonini,Lorenzo Vaquero,Alessandro Conti,Cigdem Beyan,Elisa Ricci

Main category: cs.CV

TL;DR: DYSCO提出了一种无需训练的HOI检测框架,通过增强语义的动态评分和多模态注册表,有效结合文本和视觉交互表示,提升了罕见交互的理解能力。

Details Motivation: 传统HOI方法依赖大量人工标注数据,费时且难以扩展到新领域和罕见交互。作者提出利用VLM的潜力,探索无需训练的解决方案。

Contribution: 1. 提出DYSCO框架,结合多模态注册表和动态评分机制;2. 改进语义对齐,提升罕见交互的泛化能力;3. 引入多头注意力机制,自适应加权视觉和文本特征。

Method: DYSCO利用多模态注册表存储视觉提示和交互签名,通过动态评分和语义对齐识别交互。多头注意力机制自适应整合视觉和文本特征。

Result: DYSCO在无需训练的方法中表现最佳,且在罕见交互任务中优于部分需要训练的方法。

Insight: 结合VLM的语义能力可以显著提升HOI检测的泛化性能,尤其是对罕见交互的理解。无需训练的框架具有潜在的实际应用价值。

Abstract: Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions. Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues. These annotations are labor-intensive to create, prone to inconsistency, and limit scalability to new domains and rare interactions. We argue that recent advances in Vision-Language Models (VLMs) offer untapped potential, particularly in enhancing interaction representation. While prior work has injected such potential and even proposed training-free methods, there remain key gaps. Consequently, we propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics (DYSCO) that effectively utilizes textual and visual interaction representations within a multimodal registry, enabling robust and nuanced interaction understanding. This registry incorporates a small set of visual cues and uses innovative interaction signatures to improve the semantic alignment of verbs, facilitating effective generalization to rare interactions. Additionally, we propose a unique multi-head attention mechanism that adaptively weights the contributions of the visual and textual features. Experimental results demonstrate that our DYSCO surpasses training-free state-of-the-art models and is competitive with training-based approaches, particularly excelling in rare interactions. Code is available at https://github.com/francescotonini/dysco.

[58] ERMV: Editing 4D Robotic Multi-view images to enhance embodied agents

Chang Nie,Guangming Wang,Zhe Lie,Hesheng Wang

Main category: cs.CV

TL;DR: ERMV是一个用于编辑4D机器人多视角序列图像的数据增强框架,旨在解决机器人模仿学习中高质量数据稀缺的问题。通过EMA-Attn机制、稀疏时空模块和反馈干预机制,ERMV实现了高效的数据编辑,提升了视觉-语言-动作模型的鲁棒性和泛化能力。

Details Motivation: 机器人模仿学习依赖4D多视角序列图像,但高质量数据采集成本高且稀缺,限制了如视觉-语言-动作模型的泛化和应用。数据增强是解决这一问题的关键方法,但目前缺乏针对4D多视角序列图像的编辑技术。

Contribution: 1. 提出了ERMV框架,首次实现了基于单帧编辑和机器人状态条件的高效4D多视角序列图像编辑。
2. 创新性地设计了EMA-Attn机制、稀疏时空模块和反馈干预机制,解决了编辑中的几何一致性、计算效率和语义完整性等核心挑战。

Method: 1. EMA-Attn机制:通过感知运动引起的像素偏移,保证运动模糊的时空一致性。
2. 稀疏时空模块(STT):解耦时空视角,通过稀疏采样降低计算需求。
3. 反馈干预机制:利用多模态大语言模型检测编辑不一致性,仅在必要时引入专家指导。

Result: 实验表明,ERMV增强的数据显著提升了视觉-语言-动作模型在仿真和真实环境中的鲁棒性和泛化性能。

Insight: ERMV为机器人模仿学习的数据增强提供了新思路,其模块化设计和高效率特性在4D数据编辑领域具有广泛的应用潜力。

Abstract: Robot imitation learning relies on 4D multi-view sequential images. However, the high cost of data collection and the scarcity of high-quality data severely constrain the generalization and application of embodied intelligence policies like Vision-Language-Action (VLA) models. Data augmentation is a powerful strategy to overcome data scarcity, but methods for editing 4D multi-view sequential images for manipulation tasks are currently lacking. Thus, we propose ERMV (Editing Robotic Multi-View 4D data), a novel data augmentation framework that efficiently edits an entire multi-view sequence based on single-frame editing and robot state conditions. This task presents three core challenges: (1) maintaining geometric and appearance consistency across dynamic views and long time horizons; (2) expanding the working window with low computational costs; and (3) ensuring the semantic integrity of critical objects like the robot arm. ERMV addresses these challenges through a series of innovations. First, to ensure spatio-temporal consistency in motion blur, we introduce a novel Epipolar Motion-Aware Attention (EMA-Attn) mechanism that learns pixel shift caused by movement before applying geometric constraints. Second, to maximize the editing working window, ERMV pioneers a Sparse Spatio-Temporal (STT) module, which decouples the temporal and spatial views and remodels a single-frame multi-view problem through sparse sampling of the views to reduce computational demands. Third, to alleviate error accumulation, we incorporate a feedback intervention Mechanism, which uses a Multimodal Large Language Model (MLLM) to check editing inconsistencies and request targeted expert guidance only when necessary. Extensive experiments demonstrate that ERMV-augmented data significantly boosts the robustness and generalization of VLA models in both simulated and real-world environments.

[59] Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls

Elena Pitta,Tom Kouwenhoven,Tessa Verhoef

Main category: cs.CV

TL;DR: 该研究探讨了视觉蕴含(VE)任务作为多模态语言模型视觉-语言理解的可靠诊断工具的潜力与局限,通过实验发现三样本推理优于零样本基线,但过多样本会引入噪声,且标签顺序影响预测。微调模型表现最佳,但视觉信息的缺失导致模型依赖语言先验,对任务的视觉基础表示质疑。

Details Motivation: 研究动机在于评估VE任务能否有效诊断多模态模型的视觉-语言理解能力,并揭示其在实践中的潜力与限制。

Contribution: 主要贡献包括:1)在不同设置下评估VE任务的性能;2)分析了样本数量、标签顺序和视觉信息对模型表现的影响;3)通过解释性评估揭示了模型的推理逻辑;4)微调模型在e-SNLI-VE数据集上达到83.3%的准确率,超越现有最优模型。

Method: 方法包括:1)在零样本、少量样本和微调设置下进行实验;2)探索提示设计、样本数量和顺序的影响;3)通过解释性评估(如BERTScore)分析模型推理;4)对比有/无视觉信息的实验结果。

Result: 结果显示:1)三样本推理效果最佳;2)标签顺序显著影响预测;3)缺乏视觉信息时模型易产生幻觉;4)微调模型表现优异(83.3%准确率),但视觉基础受到质疑(BERTScore相似)。

Insight: 研究发现VE任务作为诊断工具虽有用但存在局限性,需改进多模态评估方法以减少对语言先验的依赖,并增强视觉基础。

Abstract: This study investigates the extent to which the Visual Entailment (VE) task serves as a reliable probe of vision-language understanding in multimodal language models, using the LLaMA 3.2 11B Vision model as a test case. Beyond reporting performance metrics, we aim to interpret what these results reveal about the underlying possibilities and limitations of the VE task. We conduct a series of experiments across zero-shot, few-shot, and fine-tuning settings, exploring how factors such as prompt design, the number and order of in-context examples and access to visual information might affect VE performance. To further probe the reasoning processes of the model, we used explanation-based evaluations. Results indicate that three-shot inference outperforms the zero-shot baselines. However, additional examples introduce more noise than they provide benefits. Additionally, the order of the labels in the prompt is a critical factor that influences the predictions. In the absence of visual information, the model has a strong tendency to hallucinate and imagine content, raising questions about the model’s over-reliance on linguistic priors. Fine-tuning yields strong results, achieving an accuracy of 83.3% on the e-SNLI-VE dataset and outperforming the state-of-the-art OFA-X model. Additionally, the explanation evaluation demonstrates that the fine-tuned model provides semantically meaningful explanations similar to those of humans, with a BERTScore F1-score of 89.2%. We do, however, find comparable BERTScore results in experiments with limited vision, questioning the visual grounding of this task. Overall, our results highlight both the utility and limitations of VE as a diagnostic task for vision-language understanding and point to directions for refining multimodal evaluation methods.

[60] Unsupervised anomaly detection using Bayesian flow networks: application to brain FDG PET in the context of Alzheimer’s disease

Hugues Roy,Reuben Dorent,Ninon Burgos

Main category: cs.CV

TL;DR: 该论文提出了一种基于贝叶斯流网络(BFN)的无监督异常检测方法AnoBFN,应用于阿尔茨海默病的脑FDG PET图像,在性能和假阳性率上优于现有方法。

Details Motivation: 无监督异常检测在神经影像学中对识别健康数据的偏差至关重要,现有生成模型在医学影像或异常检测中尚未应用贝叶斯流网络。

Contribution: 首次将BFN应用于医学影像和异常检测,提出AnoBFN,结合扩散框架和贝叶斯推理,实现高噪声下的条件生成并保持主体特异性。

Method: AnoBFN通过递归反馈保留输入图像的特异性,在高空间相关噪声下生成条件图像,用于异常检测。

Result: 在阿尔茨海默病的FDG PET图像异常检测任务中,AnoBFN优于基于VAE、GAN和扩散模型的现有方法。

Insight: BFN结合扩散和贝叶斯推理的能力,为医学影像异常检测提供了新的有效工具。

Abstract: Unsupervised anomaly detection (UAD) plays a crucial role in neuroimaging for identifying deviations from healthy subject data and thus facilitating the diagnosis of neurological disorders. In this work, we focus on Bayesian flow networks (BFNs), a novel class of generative models, which have not yet been applied to medical imaging or anomaly detection. BFNs combine the strength of diffusion frameworks and Bayesian inference. We introduce AnoBFN, an extension of BFNs for UAD, designed to: i) perform conditional image generation under high levels of spatially correlated noise, and ii) preserve subject specificity by incorporating a recursive feedback from the input image throughout the generative process. We evaluate AnoBFN on the challenging task of Alzheimer’s disease-related anomaly detection in FDG PET images. Our approach outperforms other state-of-the-art methods based on VAEs (beta-VAE), GANs (f-AnoGAN), and diffusion models (AnoDDPM), demonstrating its effectiveness at detecting anomalies while reducing false positive rates.

[61] Illicit object detection in X-ray imaging using deep learning techniques: A comparative evaluation

Jorgen Cani,Christos Diou,Spyridon Evangelatos,Vasileios Argyriou,Panagiotis Radoglou-Grammatikis,Panagiotis Sarigiannidis,Iraklis Varlamis,Georgios Th. Papadopoulos

Main category: cs.CV

TL;DR: 该论文对X射线影像中的违禁物品检测进行了系统的深度学习方法比较评估,提出了一个包含多个数据集和多种模型的综合评估框架,并公开了代码和模型权重。

Details Motivation: X射线自动检测在公共安全中非常重要,但由于物体遮挡、物品物理特性变化、X射线扫描设备多样性以及训练数据有限等问题,检测的准确性和可靠性仍存在挑战。当前的实验评估往往不完整且结果不一致,因此需要一个系统的比较研究。

Contribution: 开发了一个全面的评估框架,包含六个大规模公开数据集、十种最先进的物体检测方法,并进行了多种性能和时间/计算复杂度的评估分析。

Method: 采用了十种不同的物体检测方法(包括CNN、Transformer及混合架构),在六个X射线违禁物品检测数据集上进行评估,使用了mAP50、mAP50:95等检测指标和推理时间、参数规模、计算负载等复杂度指标。

Result: 论文通过详细分析得出了关键观察和见解,包括整体检测方法的表现、对象级检测性能、数据集特定观察以及时间效率和计算复杂度分析。

Insight: 研究强调了检测方法的多样性及其在不同数据集上的表现差异,为未来的研究提供了基准和方向,同时公开的代码和模型支持了研究的可复现性。

Abstract: Automated X-ray inspection is crucial for efficient and unobtrusive security screening in various public settings. However, challenges such as object occlusion, variations in the physical properties of items, diversity in X-ray scanning devices, and limited training data hinder accurate and reliable detection of illicit items. Despite the large body of research in the field, reported experimental evaluations are often incomplete, with frequently conflicting outcomes. To shed light on the research landscape and facilitate further research, a systematic, detailed, and thorough comparative evaluation of recent Deep Learning (DL)-based methods for X-ray object detection is conducted. For this, a comprehensive evaluation framework is developed, composed of: a) Six recent, large-scale, and widely used public datasets for X-ray illicit item detection (OPIXray, CLCXray, SIXray, EDS, HiXray, and PIDray), b) Ten different state-of-the-art object detection schemes covering all main categories in the literature, including generic Convolutional Neural Network (CNN), custom CNN, generic transformer, and hybrid CNN-transformer architectures, and c) Various detection (mAP50 and mAP50:95) and time/computational-complexity (inference time (ms), parameter size (M), and computational load (GFLOPS)) metrics. A thorough analysis of the results leads to critical observations and insights, emphasizing key aspects such as: a) Overall behavior of the object detection schemes, b) Object-level detection performance, c) Dataset-specific observations, and d) Time efficiency and computational complexity analysis. To support reproducibility of the reported experimental results, the evaluation code and model weights are made publicly available at https://github.com/jgenc/xray-comparative-evaluation.

[62] Accelerating Parallel Diffusion Model Serving with Residual Compression

Jiajun Luo,Yicheng Xiao,Jianru Xu,Yangxiu You,Rongwei Lu,Chen Tang,Jingyan Jiang,Zhi Wang

Main category: cs.CV

TL;DR: CompactFusion通过残差压缩减少并行扩散模型推理中的通信开销,显著提升效率及生成质量。

Details Motivation: 扩散模型需要大量计算资源,多加速器并行推理引入高通信开销,阻碍实时部署。

Contribution: 提出CompactFusion框架,利用残差压缩减少通信量,保持生成质量。

Method: 提出残差压缩方法传输激活差异,结合轻量级误差反馈避免累积误差。

Result: 在4xL20上实现3.0x加速,通信密集型任务中达到6.7x加速。

Insight: 扩散模型的激活具有时间冗余性,残差压缩能高效捕捉关键信息。

Abstract: Diffusion models produce realistic images and videos but require substantial computational resources, necessitating multi-accelerator parallelism for real-time deployment. However, parallel inference introduces significant communication overhead from exchanging large activations between devices, limiting efficiency and scalability. We present CompactFusion, a compression framework that significantly reduces communication while preserving generation quality. Our key observation is that diffusion activations exhibit strong temporal redundancy-adjacent steps produce highly similar activations, saturating bandwidth with near-duplicate data carrying little new information. To address this inefficiency, we seek a more compact representation that encodes only the essential information. CompactFusion achieves this via Residual Compression that transmits only compressed residuals (step-wise activation differences). Based on empirical analysis and theoretical justification, we show that it effectively removes redundant data, enabling substantial data reduction while maintaining high fidelity. We also integrate lightweight error feedback to prevent error accumulation. CompactFusion establishes a new paradigm for parallel diffusion inference, delivering lower latency and significantly higher generation quality than prior methods. On 4xL20, it achieves 3.0x speedup while greatly improving fidelity. It also uniquely supports communication-heavy strategies like sequence parallelism on slow networks, achieving 6.7x speedup over prior overlap-based method. CompactFusion applies broadly across diffusion models and parallel settings, and integrates easily without requiring pipeline rework. Portable implementation demonstrated on xDiT is publicly available at https://github.com/Cobalt-27/CompactFusion

[63] Multi-modal Multi-task Pre-training for Improved Point Cloud Understanding

Liwen Liu,Weidong Yang,Lipeng Ma,Ben Fei

Main category: cs.CV

TL;DR: 本文提出了一种多模态多任务预训练框架MMPT,通过三种预训练任务(TLR、PLR和MCL)增强点云理解,无需3D标注,并在下游任务中表现优异。

Details Motivation: 现有多模态预训练方法仅依赖单一任务,难以充分利用多模态数据信息,限制了模型在复杂下游任务中的性能。

Contribution: 提出了MMPT框架,结合了三种预训练任务(TLR、PLR、MCL),旨在提升点云理解能力,且无需3D标注。

Method: 设计了三种预训练任务:Token级重建(TLR)、点级重建(PLR)和多模态对比学习(MCL),结合3D点云和2D图像的多模态特征。

Result: 在多个判别性和生成性应用中,MMPT优于现有方法,证明了其有效性。

Insight: 多任务预训练能够充分利用多模态数据的信息,提升模型在下游任务中的表现。

Abstract: Recent advances in multi-modal pre-training methods have shown promising effectiveness in learning 3D representations by aligning multi-modal features between 3D shapes and their corresponding 2D counterparts. However, existing multi-modal pre-training frameworks primarily rely on a single pre-training task to gather multi-modal data in 3D applications. This limitation prevents the models from obtaining the abundant information provided by other relevant tasks, which can hinder their performance in downstream tasks, particularly in complex and diverse domains. In order to tackle this issue, we propose MMPT, a Multi-modal Multi-task Pre-training framework designed to enhance point cloud understanding. Specifically, three pre-training tasks are devised: (i) Token-level reconstruction (TLR) aims to recover masked point tokens, endowing the model with representative learning abilities. (ii) Point-level reconstruction (PLR) is integrated to predict the masked point positions directly, and the reconstructed point cloud can be considered as a transformed point cloud used in the subsequent task. (iii) Multi-modal contrastive learning (MCL) combines feature correspondences within and across modalities, thus assembling a rich learning signal from both 3D point cloud and 2D image modalities in a self-supervised manner. Moreover, this framework operates without requiring any 3D annotations, making it scalable for use with large datasets. The trained encoder can be effectively transferred to various downstream tasks. To demonstrate its effectiveness, we evaluated its performance compared to state-of-the-art methods in various discriminant and generative applications under widely-used benchmarks.

[64] Boosting Ray Search Procedure of Hard-label Attacks with Transfer-based Priors

Chen Ma,Xinjie Xu,Shuyu Cheng,Qi Xuan

Main category: cs.CV

TL;DR: 本文提出了一种改进硬标签攻击射线搜索效率的方法,通过引入基于迁移的先验知识,优化了梯度估计过程,显著提高了查询效率。

Details Motivation: 硬标签攻击是黑盒攻击中最具挑战性的一种,现有方法在射线搜索中梯度估计效率不高,特别是在高查询成本下。因此,作者希望通过引入先验知识来提升梯度估计的质量和效率。

Contribution: 本文的主要贡献是提出了一种基于迁移先验的梯度估计方法,通过理论分析和实验验证,证明了该方法在硬标签攻击中显著提升了射线搜索的效率和效果。

Method: 作者提出了一种先验引导的射线搜索方法,利用来自代理模型的迁移先验知识,结合随机方向优化梯度估计。理论推导了梯度估计与真实梯度之间的预期余弦相似性,并设计了高效的查询机制。

Result: 在ImageNet和CIFAR-10数据集上的实验表明,本文方法在查询效率上显著优于11种现有先进方法。

Insight: 引入先验知识可以显著提升梯度估计的准确性和效率,尤其是在黑盒攻击中,迁移学习为优化搜索方向提供了有效的信息来源。

Abstract: One of the most practical and challenging types of black-box adversarial attacks is the hard-label attack, where only the top-1 predicted label is available. One effective approach is to search for the optimal ray direction from the benign image that minimizes the $\ell_p$-norm distance to the adversarial region. The unique advantage of this approach is that it transforms the hard-label attack into a continuous optimization problem. The objective function value is the ray’s radius, which can be obtained via binary search at a high query cost. Existing methods use a “sign trick” in gradient estimation to reduce the number of queries. In this paper, we theoretically analyze the quality of this gradient estimation and propose a novel prior-guided approach to improve ray search efficiency both theoretically and empirically. Specifically, we utilize the transfer-based priors from surrogate models, and our gradient estimators appropriately integrate them by approximating the projection of the true gradient onto the subspace spanned by these priors and random directions, in a query-efficient manner. We theoretically derive the expected cosine similarities between the obtained gradient estimators and the true gradient, and demonstrate the improvement achieved by incorporating priors. Extensive experiments on the ImageNet and CIFAR-10 datasets show that our approach significantly outperforms 11 state-of-the-art methods in terms of query efficiency.

[65] RemixFusion: Residual-based Mixed Representation for Large-scale Online RGB-D Reconstruction

Yuqing Lan,Chenyang Zhu,Shuaifeng Zhi,Jiazhao Zhang,Zhoufeng Wang,Renjiao Yi,Yijie Wang,Kai Xu

Main category: cs.CV

TL;DR: RemixFusion提出了一种基于残差的混合表示方法,用于大规模在线RGB-D重建,结合了显式TSDF网格和隐式神经模块,实现了细节丰富且高效的重建。

Details Motivation: 传统的神经隐式表示在在线密集重建中存在细节缺失和学习耗时的问题,而显式表示(如TSDF)则缺乏细节重建能力。RemixFusion旨在通过混合表示解决这些问题。

Contribution: 1. 提出残差为基础的混合表示方法,结合了显式TSDF和隐式神经模块;2. 扩展到多帧联合位姿优化,并提出自适应梯度放大的技术;3. 采用局部移动体积的分治策略实现高效在线学习。

Method: 结合显式TSDF网格(粗糙重建)和隐式神经模块(细节残差),通过残差叠加实现高质量重建;采用自适应梯度放大优化位姿变化,并使用局部移动体积分治策略。

Result: 在大规模场景中,RemixFusion在重建和相机跟踪精度上均优于现有方法(包括显式和隐式表示)。

Insight: 混合表示结合了显式和隐式方法的优势,既保持了细节丰富性,又提高了计算效率;位姿优化的创新方法提升了全局收敛性。

Abstract: The introduction of the neural implicit representation has notably propelled the advancement of online dense reconstruction techniques. Compared to traditional explicit representations, such as TSDF, it improves the mapping completeness and memory efficiency. However, the lack of reconstruction details and the time-consuming learning of neural representations hinder the widespread application of neural-based methods to large-scale online reconstruction. We introduce RemixFusion, a novel residual-based mixed representation for scene reconstruction and camera pose estimation dedicated to high-quality and large-scale online RGB-D reconstruction. In particular, we propose a residual-based map representation comprised of an explicit coarse TSDF grid and an implicit neural module that produces residuals representing fine-grained details to be added to the coarse grid. Such mixed representation allows for detail-rich reconstruction with bounded time and memory budget, contrasting with the overly-smoothed results by the purely implicit representations, thus paving the way for high-quality camera tracking. Furthermore, we extend the residual-based representation to handle multi-frame joint pose optimization via bundle adjustment (BA). In contrast to the existing methods, which optimize poses directly, we opt to optimize pose changes. Combined with a novel technique for adaptive gradient amplification, our method attains better optimization convergence and global optimality. Furthermore, we adopt a local moving volume to factorize the mixed scene representation with a divide-and-conquer design to facilitate efficient online learning in our residual-based framework. Extensive experiments demonstrate that our method surpasses all state-of-the-art ones, including those based either on explicit or implicit representations, in terms of the accuracy of both mapping and tracking on large-scale scenes.

[66] PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

Maciej K. Wozniak,Lianhang Liu,Yixi Cai,Patric Jensfelt

Main category: cs.CV

TL;DR: PRIX是一种仅使用摄像头数据的端到端自动驾驶架构,避免了昂贵的LiDAR和BEV表示,通过视觉特征提取器和生成式规划头直接预测轨迹,核心模块CaRT增强了多级视觉特征的鲁棒性,在NavSim和nuScenes基准上达到SOTA性能。

Details Motivation: 当前端到端自动驾驶模型依赖LiDAR和计算密集的BEV表示,限制了其在仅配备摄像头的量产车上的部署。

Contribution: 提出PRIX架构,仅用摄像头数据实现高效规划,核心创新是Context-aware Recalibration Transformer (CaRT)模块。

Method: 结合视觉特征提取器和生成式规划头,CaRT模块增强多级视觉特征,直接预测轨迹。

Result: 在NavSim和nuScenes基准上表现优异,效率显著高于多模态扩散规划器。

Insight: 去除了对LiDAR和BEV的依赖,提升了自动驾驶模型的实用性和可扩展性。

Abstract: While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.

[67] Vision Transformer attention alignment with human visual perception in aesthetic object evaluation

Miguel Carrasco,César González-Martín,José Aranda,Luis Oliveros

Main category: cs.CV

TL;DR: 本文研究了视觉Transformer(ViT)注意力机制与人类视觉感知在美学对象评估中的一致性,通过眼动实验和注意力地图分析发现特定注意力头与人类注意力模式具有较强相关性。

Details Motivation: 探讨ViT注意力机制与人类视觉注意力的对应关系,尤其是在美学评估领域,填补现有研究的空白。

Contribution: 揭示了ViT注意力机制与人类视觉注意力的相关性,特别是指出某些注意力头(如#12)与人类模式高度一致,为AI在美学设计中的应用提供依据。

Method: 结合眼动实验(记录人类注视点)和ViT(DINO预训练模型)分析注意力地图,使用KL散度对比二者注意力分布,并通过统计测试评估相关性。

Result: 发现sigma=2.4时相关性最佳,注意力头#12与人类模式最接近,而#7和#9差异显著,表明ViT的全局注意力与人类聚焦注意力存在根本差异。

Insight: ViT的某些注意力机制可以模拟人类视觉行为,尤其在特定对象特征(如包袋扣环)上,但在整体策略上与人类仍有差异,为改进AI模型提供了方向。

Abstract: Visual attention mechanisms play a crucial role in human perception and aesthetic evaluation. Recent advances in Vision Transformers (ViTs) have demonstrated remarkable capabilities in computer vision tasks, yet their alignment with human visual attention patterns remains underexplored, particularly in aesthetic contexts. This study investigates the correlation between human visual attention and ViT attention mechanisms when evaluating handcrafted objects. We conducted an eye-tracking experiment with 30 participants (9 female, 21 male, mean age 24.6 years) who viewed 20 artisanal objects comprising basketry bags and ginger jars. Using a Pupil Labs eye-tracker, we recorded gaze patterns and generated heat maps representing human visual attention. Simultaneously, we analyzed the same objects using a pre-trained ViT model with DINO (Self-DIstillation with NO Labels), extracting attention maps from each of the 12 attention heads. We compared human and ViT attention distributions using Kullback-Leibler divergence across varying Gaussian parameters (sigma=0.1 to 3.0). Statistical analysis revealed optimal correlation at sigma=2.4 +-0.03, with attention head #12 showing the strongest alignment with human visual patterns. Significant differences were found between attention heads, with heads #7 and #9 demonstrating the greatest divergence from human attention (p< 0.05, Tukey HSD test). Results indicate that while ViTs exhibit more global attention patterns compared to human focal attention, certain attention heads can approximate human visual behavior, particularly for specific object features like buckles in basketry items. These findings suggest potential applications of ViT attention mechanisms in product design and aesthetic evaluation, while highlighting fundamental differences in attention strategies between human perception and current AI models.

[68] Reusing Attention for One-stage Lane Topology Understanding

Yang Li,Zongzheng Zhang,Xuchong Qiu,Xinrun Li,Ziming Liu,Leichen Wang,Ruikai Li,Zhenxin Zhu,Huan-ang Gao,Xiaojian Lin,Zhiyong Cui,Hang Zhao,Hao Zhao

Main category: cs.CV

TL;DR: 本文提出了一种单阶段架构,利用Transformer解码器中的注意力资源复用,同时预测交通元素、车道中心线和拓扑关系,提高了车道拓扑理解的精度和推理速度。

Details Motivation: 现有两阶段方法存在误差传播和计算开销大的问题,阻碍了车道拓扑关系理解的效率,本文旨在解决这些问题。

Contribution: 1. 单阶段架构同时预测交通元素、车道和拓扑关系;2. 通过注意力资源复用减少计算开销;3. 首次实现从标准地图模型到无地图模型的知识蒸馏。

Method: 使用Transformer解码器复用注意力资源,避免额外的图网络计算,同时实现知识蒸馏优化无地图场景性能。

Result: 在OpenLane-V2数据集上,相较于基线方法,本文方法在车道检测、交通元素识别和拓扑推理等方面取得了更优结果。

Insight: 注意力资源复用和知识蒸馏是实现高效车道拓扑理解的有效手段,同时减少了模型对标准地图的依赖。

Abstract: Understanding lane toplogy relationships accurately is critical for safe autonomous driving. However, existing two-stage methods suffer from inefficiencies due to error propagations and increased computational overheads. To address these challenges, we propose a one-stage architecture that simultaneously predicts traffic elements, lane centerlines and topology relationship, improving both the accuracy and inference speed of lane topology understanding for autonomous driving. Our key innovation lies in reusing intermediate attention resources within distinct transformer decoders. This approach effectively leverages the inherent relational knowledge within the element detection module to enable the modeling of topology relationships among traffic elements and lanes without requiring additional computationally expensive graph networks. Furthermore, we are the first to demonstrate that knowledge can be distilled from models that utilize standard definition (SD) maps to those operates without using SD maps, enabling superior performance even in the absence of SD maps. Extensive experiments on the OpenLane-V2 dataset show that our approach outperforms baseline methods in both accuracy and efficiency, achieving superior results in lane detection, traffic element identification, and topology reasoning. Our code is available at https://github.com/Yang-Li-2000/one-stage.git.

[69] CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts

Olaf Dünkel,Artur Jesslen,Jiahao Xie,Christian Theobalt,Christian Rupprecht,Adam Kortylewski

Main category: cs.CV

TL;DR: CNS-Bench 是一个新的基准测试工具,用于评估图像分类器在连续真实干扰变化下的鲁棒性,通过LoRA适配器和过滤机制生成连续的干扰变化,从而更全面地评估模型在OOD场景中的表现。

Details Motivation: 现有评估OOD鲁棒性的方法多依赖简单的合成干扰或二值化干扰,难以捕捉真实世界中连续的干扰变化,限制了模型鲁棒性的全面评估。

Contribution: 1. 提出CNS-Bench,首个支持连续干扰变化的基准测试工具;2. 引入LoRA适配器和过滤机制,提升生成干扰的可靠性和多样性;3. 对40多个分类器进行了大规模鲁棒性评估,发现模型排名会因干扰类型和程度变化。

Method: 1. 利用扩散模型和LoRA适配器生成连续干扰图像;2. 提出过滤机制以消除生成失败案例;3. 设计实验评估模型在不同干扰类型和程度下的表现。

Result: 实验表明,CNS-Bench能更全面地评估模型鲁棒性,且模型排名会因干扰变化而改变。连续干扰评估还能识别模型的失效点。

Insight: 连续干扰比二值化干扰更能反映真实场景,模型鲁棒性评估需要更细致的干扰设计和分析。

Abstract: An important challenge when using computer vision models in the real world is to evaluate their performance in potential out-of-distribution (OOD) scenarios. While simple synthetic corruptions are commonly applied to test OOD robustness, they often fail to capture nuisance shifts that occur in the real world. Recently, diffusion models have been applied to generate realistic images for benchmarking, but they are restricted to binary nuisance shifts. In this work, we introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify OOD robustness of image classifiers for continuous and realistic generative nuisance shifts. CNS-Bench allows generating a wide range of individual nuisance shifts in continuous severities by applying LoRA adapters to diffusion models. To address failure cases, we propose a filtering mechanism that outperforms previous methods, thereby enabling reliable benchmarking with generative models. With the proposed benchmark, we perform a large-scale study to evaluate the robustness of more than 40 classifiers under various nuisance shifts. Through carefully designed comparisons and analyses, we find that model rankings can change for varying shifts and shift scales, which cannot be captured when applying common binary shifts. Additionally, we show that evaluating the model performance on a continuous scale allows the identification of model failure points, providing a more nuanced understanding of model robustness. Project page including code and data: https://genintel.github.io/CNS.

[70] See the Forest and the Trees: A Synergistic Reasoning Framework for Knowledge-Based Visual Question Answering

Junjie Wang,Yunhan Tang,Yijie Wang,Zhihao Yuan,Huan Wang,Yangfan He,Bin Li

Main category: cs.CV

TL;DR: 该论文提出了Synergos-VQA框架,通过融合三种互补的证据流(整体证据、结构证据和因果证据),显著提升了基于知识的视觉问答任务的性能,并在多个基准测试中达到了新的最先进水平。

Details Motivation: 现有的多模态大模型(MLLMs)在基于知识的视觉问答(KBVQA)中依赖单一维度的证据,导致推理能力受限。论文旨在通过多角度证据的融合,实现更全面和鲁棒的推理。

Contribution: 提出了Synergos-VQA框架,首次将三种互补证据(整体、结构和因果)协同融合,显著提升了KBVQA任务的性能。同时,该框架具有即插即用的能力,可以提升其他MLLMs的性能。

Method: 框架包含三个模块:1) 整体证据模块(感知场景全局),2) 结构证据模块(通过原型驱动识别关键对象),3) 因果证据模块(通过反事实探测确保推理的鲁棒性)。这三个模块在推理时并行生成并融合证据。

Result: 在OK-VQA和A-OKVQA等多个基准测试中,Synergos-VQA均取得了最先进的性能。同时,该框架能够显著提升其他开源MLLMs的性能。

Insight: 研究表明,多角度证据的协同融合比单纯增加模型规模更能提升推理能力。此外,结构化推理和因果推理的引入有助于增强模型的可解释性和鲁棒性。

Abstract: Multimodal Large Language Models (MLLMs) have pushed the frontiers of Knowledge-Based Visual Question Answering (KBVQA), yet their reasoning is fundamentally bottlenecked by a reliance on uni-dimensional evidence. This “seeing only the trees, but not the forest” approach prevents robust, multi-faceted understanding. Inspired by the principle of seeing both the forest and trees, we propose Synergos-VQA, a novel synergistic reasoning framework. At its core, Synergos-VQA concurrently generates and fuses three complementary evidence streams at inference time: (1) Holistic Evidence to perceive the entire scene (the “forest”), (2) Structural Evidence from a prototype-driven module to identify key objects (the “trees”), and (3) Causal Evidence from a counterfactual probe to ensure the reasoning is robustly grounded. By synergistically fusing this multi-faceted evidence, our framework achieves a more comprehensive and reliable reasoning process. Extensive experiments show that Synergos-VQA decisively establishes a new state-of-the-art on three challenging benchmarks, including OK-VQA and A-OKVQA. Furthermore, our approach demonstrates strong plug-and-play capabilities, significantly boosting various open-source MLLMs and proving that superior methodological design can outperform sheer model scale.

[71] Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

Lingdong Kong,Dongyue Lu,Ao Liang,Rong Li,Yuhao Dong,Tianshuai Hu,Lai Xing Ng,Wei Tsang Ooi,Benoit R. Cottereau

Main category: cs.CV

TL;DR: 论文提出了Talk2Event,首个大规模事件相机语言驱动目标定位基准,并提出EventRefer框架,通过多属性专家混合(MoEE)动态融合多模态信息,显著提升了事件相机场景中的语言理解能力。

Details Motivation: 事件相机具有微秒级延迟和运动模糊鲁棒性,适用于动态环境感知,但将其异步数据流与人类语言连接仍具挑战。

Contribution: 1) 提出首个大规模事件相机语言驱动目标定位基准Talk2Event;2) 开发EventRefer框架,通过MoEE动态融合多属性特征,提升定位性能。

Method: EventRefer框架利用Mixture of Event-Attribute Experts(MoEE)动态融合外观、状态、视角关系等多属性表征,适应不同模态与动态场景。

Result: 在事件相机、传统帧相机及多模态融合设置中,EventRefer均显著优于现有方法。

Insight: 多属性表征的动态融合能有效提升事件相机场景中的语言驱动感知能力,为机器人及自动驾驶领域的多模态实时感知奠定基础。

Abstract: Event cameras offer microsecond-level latency and robustness to motion blur, making them ideal for understanding dynamic environments. Yet, connecting these asynchronous streams to human language remains an open challenge. We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions, each enriched with four grounding attributes – appearance, status, relation to viewer, and relation to other objects – bridging spatial, temporal, and relational reasoning. To fully exploit these cues, we propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. We hope our dataset and approach will establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.

[72] BetterCheck: Towards Safeguarding VLMs for Automotive Perception Systems

Malsha Ashani Mahawatta Dona,Beatriz Cabrero-Daniel,Yinan Yu,Christian Berger

Main category: cs.CV

TL;DR: 论文《BetterCheck》提出了一种方法,用于检测和防范VLM在自动驾驶感知系统中的幻觉问题,增强其安全性。

Details Motivation: VLM在理解复杂交通场景中表现优异,但其幻觉问题可能导致自动驾驶系统做出危险决策,因此需要一种机制来检测和防范这些幻觉。

Contribution: 提出了BetterCheck方法,系统评估了三种先进VLM在多样化交通场景中的表现,并设计了一种幻觉检测策略。

Method: 通过分析Waymo Open Dataset中的多样化交通场景,评估了三种VLM的表现,并提出了BetterCheck作为幻觉检测解决方案。

Result: 研究发现,VLM在图像理解上表现优异,但仍存在幻觉问题,BetterCheck能有效检测这些幻觉。

Insight: VLM虽然强大,但幻觉问题限制了其在自动驾驶中的应用,需要通过类似BetterCheck的方法进行优化和验证。

Abstract: Large language models (LLMs) are growingly extended to process multimodal data such as text and video simultaneously. Their remarkable performance in understanding what is shown in images is surpassing specialized neural networks (NNs) such as Yolo that is supporting only a well-formed but very limited vocabulary, ie., objects that they are able to detect. When being non-restricted, LLMs and in particular state-of-the-art vision language models (VLMs) show impressive performance to describe even complex traffic situations. This is making them potentially suitable components for automotive perception systems to support the understanding of complex traffic situations or edge case situation. However, LLMs and VLMs are prone to hallucination, which mean to either potentially not seeing traffic agents such as vulnerable road users who are present in a situation, or to seeing traffic agents who are not there in reality. While the latter is unwanted making an ADAS or autonomous driving systems (ADS) to unnecessarily slow down, the former could lead to disastrous decisions from an ADS. In our work, we are systematically assessing the performance of 3 state-of-the-art VLMs on a diverse subset of traffic situations sampled from the Waymo Open Dataset to support safety guardrails for capturing such hallucinations in VLM-supported perception systems. We observe that both, proprietary and open VLMs exhibit remarkable image understanding capabilities even paying thorough attention to fine details sometimes difficult to spot for us humans. However, they are also still prone to making up elements in their descriptions to date requiring hallucination detection strategies such as BetterCheck that we propose in our work.

[73] Yume: An Interactive World Generation Model

Xiaofeng Mao,Shaoheng Lin,Zhen Li,Chuanhao Li,Wenshuo Peng,Tong He,Jiangmiao Pang,Mingmin Chi,Yu Qiao,Kaipeng Zhang

Main category: cs.CV

TL;DR: Yume是一个交互式世界生成模型,能够从图像、文本或视频中生成动态世界,支持通过键盘或神经信号探索和控制。预发布版本通过量化相机运动、改进视频生成架构和优化采样器,实现了高质量交互式视频生成。

Details Motivation: 构建一个能够将静态输入(图像、文本或视频)转化为交互式动态世界的模型,支持用户通过多种方式探索和控制。

Contribution: 1. 提出了一个包含相机运动量化、视频生成架构、采样器优化和模型加速的完整框架。2. 引入了掩码视频扩散变换器(MVDT)和训练无关的抗伪影机制(AAM)。3. 提出基于随机微分方程的时间旅行采样(TTS-SDE)和模型加速优化。

Method: 框架包括:1. 相机运动量化以稳定训练;2. MVDT实现无限视频生成;3. AAM和TTS-SDE提升采样质量;4. 对抗蒸馏和缓存机制加速模型。

Result: 模型在高质量数据集\sekai上训练,在多样化场景中表现优异。代码、数据和模型均已开源。

Insight: 相机运动量化和训练无关的采样机制为交互式世界生成提供了新思路,开源计划有助于社区发展。

Abstract: Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.

eess.AS [Back]

[74] Towards Robust Speech Recognition for Jamaican Patois Music Transcription

Jordan Madden,Matthew Stone,Dimitri Johnson,Daniel Geddez

Main category: eess.AS

TL;DR: 该论文针对牙买加方言音乐的语音识别问题,提出了数据驱动的方法,通过手工标注40小时的数据集,优化了当前的ASR模型,并研究了Whisper模型的性能扩展规律。

Details Motivation: 当前语音识别系统在牙买加方言音乐上的表现不佳,限制了其可访问性和下游应用,因此需要改进。

Contribution: 构建了40小时的手工标注牙买加方言音乐数据集,并基于此优化了ASR模型,研究了Whisper模型的性能扩展规律。

Method: 采用数据驱动的方法,手工标注数据集并用于微调现有ASR模型(如Whisper)。

Result: 提高了牙买加方言音乐的语音识别性能,并总结出Whisper模型的性能扩展规律。

Insight: 数据质量和规模对低资源语言的语音识别性能至关重要,Whisper模型在这一任务上具有潜力。

Abstract: Although Jamaican Patois is a widely spoken language, current speech recognition systems perform poorly on Patois music, producing inaccurate captions that limit accessibility and hinder downstream applications. In this work, we take a data-centric approach to this problem by curating more than 40 hours of manually transcribed Patois music. We use this dataset to fine-tune state-of-the-art automatic speech recognition (ASR) models, and use the results to develop scaling laws for the performance of Whisper models on Jamaican Patois audio. We hope that this work will have a positive impact on the accessibility of Jamaican Patois music and the future of Jamaican Patois language modeling.

[75] Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems

Nima Yazdani,Ali Ansari,Aruj Mahajan,Amirhossein Afsharrad,Seyed Shahabeddin Mousavi

Main category: eess.AS

TL;DR: 论文通过大规模实验评估了不同语音转文本(STT)、大语言模型(LLM)和文本转语音(TTS)组合在AI面试系统中的表现,发现谷歌STT与GPT-4.1的组合表现最佳,并揭示了技术指标与用户满意度之间相关性较弱的问题。

Details Motivation: 语音驱动的对话AI系统通常采用STT、LLM和TTS的级联架构,但不同组件组合在实际生产环境中的系统化评估较少。本文旨在填补这一空白,为实际应用提供指导。

Contribution: 论文的主要贡献包括:(1)提出了一种自动化的评估框架(LLM-as-a-Judge),用于评估对话质量、技术准确性和技能评估能力;(2)通过对30万次AI面试数据的分析,发现谷歌STT与GPT-4.1组合性能最佳;(3)揭示了技术指标与用户满意度之间的弱相关性。

Method: 论文使用了四种生产配置进行实验,通过LLM-as-a-Judge框架自动评估对话质量和技术准确性。实验数据来源于30万次AI面试。

Result: 谷歌STT与GPT-4.1的组合在对话质量和技术准确性上显著优于其他组合,但技术指标与用户满意度的相关性较弱。

Insight: 论文的启示在于,语音AI系统的用户体验可能依赖于技术性能以外的因素,如对话的自然性或情感共鸣。这为未来的研究和实际系统设计提供了重要方向。

Abstract: Voice-based conversational AI systems increasingly rely on cascaded architectures combining speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. However, systematic evaluation of different component combinations in production settings remains understudied. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data from over 300,000 AI-conducted job interviews. We develop an automated evaluation framework using LLM-as-a-Judge to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of four production configurations reveals that Google STT paired with GPT-4.1 significantly outperforms alternatives in both conversational and technical quality metrics. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings provide practical guidance for selecting components in multimodal conversational AI systems and contribute a validated evaluation methodology for voice-based interactions.

[76] Segmentation-free Goodness of Pronunciation

Xinwei Cao,Zijian Fan,Torbjørn Svendsen,Giampiero Salvi

Main category: eess.AS

TL;DR: 该论文提出了一种无需预分段的自对齐GOP(GOP-SA)和对齐无关的GOP(GOP-AF)方法,用于发音评估,超越了传统方法的限制并取得了SOTA结果。

Details Motivation: 传统的发音评估方法需要预分段语音,限制了准确性且无法利用CTC训练的声学模型。本文旨在解决这一问题。

Contribution: 提出了GOP-SA和GOP-AF方法,实现了无需预分段的发音评估,并对GOP-AF进行了理论分析和数值优化。

Method: 通过自对齐和对齐无关的方法,利用CTC训练的声学模型,避免了预分段问题。

Result: 在CMU Kids和Speechocean762数据集上验证了方法的有效性,并在发音评估任务中取得了SOTA结果。

Insight: 取消预分段要求可以显著提升发音评估的灵活性和准确性,尤其是在结合现代CTC声学模型时。

Abstract: Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer aided language learning (CALL) systems. Within MDD, phoneme-level pronunciation assessment is key to helping L2 learners improve their pronunciation. However, most systems are based on a form of goodness of pronunciation (GOP) which requires pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general alignment-free method that takes all possible alignments of the target phoneme into account (GOP-AF). We give a theoretical account of our definition of GOP-AF, an implementation that solves potential numerical issues as well as a proper normalization which makes the method applicable with acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and Speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-AF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the Speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.

cs.SI [Back]

[77] Disaster Informatics after the COVID-19 Pandemic: Bibliometric and Topic Analysis based on Large-scale Academic Literature

Ngan Tran,Haihua Chen,Ana Cleveland,Yuhan Zhou

Main category: cs.SI

TL;DR: 该研究通过文献计量和主题分析,探究了2020年至2022年间灾害信息学领域的研究动态,发现COVID-19大流行显著影响了研究重点,并揭示了国家、机构和作者之间的合作模式及新兴主题。

Details Motivation: COVID-19大流行凸显了全球对灾害信息学的需求,激发了研究兴趣的转变。通过分析大规模学术文献,揭示研究趋势和优先领域,为决策者、从业者和学者提供战略洞察。

Contribution: 1. 揭示了COVID-19大流行对灾害信息学研究优先级的影响;2. 识别了国家、机构和作者的合作模式与兴趣差异;3. 展示了多维韧性策略和跨部门数据共享的新兴趋势;4. 提供了基于预训练语言模型和生成式AI的分析方法。

Method: 采用了文献计量和主题分析方法,结合预训练语言模型(如LLM)和生成式AI技术,对2020年至2022年间的灾害信息学文献进行大规模分析。

Result: 1. 受疫情影响严重的国家研究活跃;2. 区域和语言相近的国家/机构更易合作;3. 作者倾向于专注于1-2个主题,机构兴趣更广泛;4. 研究重点转向公共卫生和多维韧性策略。

Insight: 灾害信息学领域正朝着跨学科、数据共享和全球协作方向发展,反映了对全球脆弱性和相互依赖性的日益重视。研究方法和工具可推广至类似数据集或分析问题。

Abstract: This study presents a comprehensive bibliometric and topic analysis of the disaster informatics literature published between January 2020 to September 2022. Leveraging a large-scale corpus and advanced techniques such as pre-trained language models and generative AI, we identify the most active countries, institutions, authors, collaboration networks, emergent topics, patterns among the most significant topics, and shifts in research priorities spurred by the COVID-19 pandemic. Our findings highlight (1) countries that were most impacted by the COVID-19 pandemic were also among the most active, with each country having specific research interests, (2) countries and institutions within the same region or share a common language tend to collaborate, (3) top active authors tend to form close partnerships with one or two key partners, (4) authors typically specialized in one or two specific topics, while institutions had more diverse interests across several topics, and (5) the COVID-19 pandemic has influenced research priorities in disaster informatics, placing greater emphasis on public health. We further demonstrate that the field is converging on multidimensional resilience strategies and cross-sectoral data-sharing collaborations or projects, reflecting a heightened awareness of global vulnerability and interdependency. Collecting and quality assurance strategies, data analytic practices, LLM-based topic extraction and summarization approaches, and result visualization tools can be applied to comparable datasets or solve similar analytic problems. By mapping out the trends in disaster informatics, our analysis offers strategic insights for policymakers, practitioners, and scholars aiming to enhance disaster informatics capacities in an increasingly uncertain and complex risk landscape.

Apoorva Gulati,Rajesh Kumar,Vinti Agarwal,Aditya Sharma

Main category: cs.SI

TL;DR: 该论文研究了大型语言模型(LLMs)如何使LinkedIn上的虚假资料生成更加真实,并评估了现有虚假资料检测器的鲁棒性。研究发现现有检测器无法有效识别GPT生成的虚假资料,提出了一种基于GPT辅助的对抗训练方法,显著降低了误识率。实验表明,结合数值和文本嵌入的检测器具有最佳鲁棒性。

Details Motivation: 随着大型语言模型(LLMs)的发展,生成高度逼真的虚假资料变得更加容易,这对LinkedIn等平台的虚假资料检测系统构成了新的挑战。研究旨在评估现有检测器的局限性,并提出一种更鲁棒的解决方案。

Contribution: 1. 揭示了现有虚假资料检测器在LLM生成资料上的性能不足。
2. 提出了一种基于GPT辅助的对抗训练方法,显著提升了检测器的鲁棒性,误识率降至1-7%。
3. 通过消融实验,证明结合数值和文本嵌入的检测器优于单一嵌入方法。

Method: 1. 评估现有检测器对LLM生成虚假资料的效果。
2. 提出GPT辅助的对抗训练方法,生成对抗性样本以增强检测器。
3. 通过数值与文本嵌入结合的方式优化检测器。

Result: 现有检测器对GPT生成资料的误识率高达42-52%,而通过GPT辅助对抗训练后,误识率降至1-7%,同时保持了低误拒率(0.5-2%)。消融实验证明,结合数值和文本嵌入的检测器表现最佳。

Insight: 随着LLMs的普及,虚假资料的生成能力大幅提升,传统的检测方法已无法应对。对抗训练和结合多模态嵌入(数值与文本)是提升检测器鲁棒性的有效途径。未来需要持续关注LLM技术的滥用问题,并开发更先进的检测工具。

Abstract: Large Language Models (LLMs) have made it easier to create realistic fake profiles on platforms like LinkedIn. This poses a significant risk for text-based fake profile detectors. In this study, we evaluate the robustness of existing detectors against LLM-generated profiles. While highly effective in detecting manually created fake profiles (False Accept Rate: 6-7%), the existing detectors fail to identify GPT-generated profiles (False Accept Rate: 42-52%). We propose GPT-assisted adversarial training as a countermeasure, restoring the False Accept Rate to between 1-7% without impacting the False Reject Rates (0.5-2%). Ablation studies revealed that detectors trained on combined numerical and textual embeddings exhibit the highest robustness, followed by those using numerical-only embeddings, and lastly those using textual-only embeddings. Complementary analysis on the ability of prompt-based GPT-4Turbo and human evaluators affirms the need for robust automated detectors such as the one proposed in this study.

cs.GR [Back]

[79] Controllable Video Generation: A Survey

Yue Ma,Kunyu Feng,Zhongyuan Hu,Xinyu Wang,Yucheng Wang,Mingzhe Zheng,Xuanhua He,Chenyang Zhu,Hongyu Liu,Yingqing He,Zeyu Wang,Zhifeng Li,Xiu Li,Wei Liu,Dan Xu,Linfeng Zhang,Qifeng Chen

Main category: cs.GR

TL;DR: 这篇综述系统地总结了可控视频生成的理论基础与最新进展,重点关注了如何通过多模态条件(如相机运动、深度图等)扩展预训练视频生成模型,以实现更精准的用户意图表达。

Details Motivation: 随着AI生成内容(AIGC)的快速发展,视频生成为其最具影响力的子领域之一。然而,现有的文本到视频生成模型在表达复杂、多模态和细粒度用户需求时表现不足,因此需要探索更灵活的控制机制。

Contribution: 1. 系统性综述了可控视频生成的理论与方法;2. 重点分析了基于扩散模型的控制机制,探讨了如何通过多模态条件(如相机运动、深度图等)指导生成;3. 对现有方法进行了分类,包括单条件、多条件和通用可控生成。

Method: 综述分析了视频扩散模型的控制机制,通过引入额外的非文本条件(如相机运动、深度图等)扩展预训练模型,并探讨了这些条件在去噪过程中如何引导视频生成。

Result: 总结了当前可控视频生成的研究现状,提出了分类框架,并整理了相关文献资源库。

Insight: 未来研究可以进一步探索多模态条件的动态融合方法,以及如何实现更通用的可控视频生成框架。

Abstract: With the rapid development of AI-generated content (AIGC), video generation has emerged as one of its most dynamic and impactful subfields. In particular, the advancement of video generation foundation models has led to growing demand for controllable video generation methods that can more accurately reflect user intent. Most existing foundation models are designed for text-to-video generation, where text prompts alone are often insufficient to express complex, multi-modal, and fine-grained user requirements. This limitation makes it challenging for users to generate videos with precise control using current models. To address this issue, recent research has explored the integration of additional non-textual conditions, such as camera motion, depth maps, and human pose, to extend pretrained video generation models and enable more controllable video synthesis. These approaches aim to enhance the flexibility and practical applicability of AIGC-driven video generation systems. In this survey, we provide a systematic review of controllable video generation, covering both theoretical foundations and recent advances in the field. We begin by introducing the key concepts and commonly used open-source video generation models. We then focus on control mechanisms in video diffusion models, analyzing how different types of conditions can be incorporated into the denoising process to guide generation. Finally, we categorize existing methods based on the types of control signals they leverage, including single-condition generation, multi-condition generation, and universal controllable generation. For a complete list of the literature on controllable video generation reviewed, please visit our curated repository at https://github.com/mayuelala/Awesome-Controllable-Video-Generation.

[80] StreamME: Simplify 3D Gaussian Avatar within Live Stream

Luchuan Song,Yang Zhou,Zhan Xu,Yi Zhou,Deepali Aneja,Chenliang Xu

Main category: cs.GR

TL;DR: StreamME 提出了一种快速重建 3D 头像的方法,适用于实时视频流,无需预缓存数据,采用动态训练策略和简化的点云分布方法,提升效率并保护隐私。

Details Motivation: 现有的 3D 头像重建方法通常需要预缓存数据或依赖复杂的神经网络(如MLP),导致速度较慢且难以适应实时视频流的需求。StreamME 旨在解决这些问题。

Contribution: 1. 提出动态训练策略(on-the-fly training),实现实时重建;2. 基于 3D 高斯泼溅(3DGS)简化几何表达,提高适应性;3. 引入稀疏点云分布策略,优化效率。

Method: StreamME 基于 3D 高斯泼溅(3DGS),摒弃了传统方法中的 MLP 依赖,仅利用几何信息快速适应面部表情变化;同时通过主点简化策略减少点云数量,提升训练效率。

Result: 方法显著提升了头像重建速度,适用于实时视频流,并有效保护用户隐私,降低了 VR 或在线会议中的通信带宽需求。

Insight: 简化几何表达和动态训练是实现实时 3D 头像重建的关键,该方法为未来实时应用(如虚拟会议、动画等)提供了新思路。

Abstract: We propose StreamME, a method focuses on fast 3D avatar reconstruction. The StreamME synchronously records and reconstructs a head avatar from live video streams without any pre-cached data, enabling seamless integration of the reconstructed appearance into downstream applications. This exceptionally fast training strategy, which we refer to as on-the-fly training, is central to our approach. Our method is built upon 3D Gaussian Splatting (3DGS), eliminating the reliance on MLPs in deformable 3DGS and relying solely on geometry, which significantly improves the adaptation speed to facial expression. To further ensure high efficiency in on-the-fly training, we introduced a simplification strategy based on primary points, which distributes the point clouds more sparsely across the facial surface, optimizing points number while maintaining rendering quality. Leveraging the on-the-fly training capabilities, our method protects the facial privacy and reduces communication bandwidth in VR system or online conference. Additionally, it can be directly applied to downstream application such as animation, toonify, and relighting. Please refer to our project page for more details: https://songluchuan.github.io/StreamME/.

cs.SD [Back]

[81] BoSS: Beyond-Semantic Speech

Qing Wang,Zehan Li,Hang Lv,Hongjie Chen,Yaodong Song,Jian Kang,Jie Lian,Jie Li,Yongxiang Li,Zhongjiang He,Xuelong Li

Main category: cs.SD

TL;DR: 该论文提出了超越语义语音(BoSS)的概念,并引入了一个分层框架(L1-L5)来评估语音交互系统的能力,强调当前语音模型在捕捉情感、上下文等非显式语义信号方面的不足。

Details Motivation: 现代语音技术(如ASR和TTS)未能充分捕捉人类交流中的非显式语义信号(如情感、上下文等),导致其无法实现更自然的人机交互。

Contribution: 提出了BoSS概念,定义了一种超越显式语义的语音信息框架,并引入了Spoken Interaction System Capability Levels(L1-L5)来评估语音系统的进阶能力。

Method: 结合认知相关性理论和机器学习模型,分析了语音的时序和上下文动态特征,并在五个维度上评估了BoSS相关属性。

Result: 研究发现当前语音模型难以全面解释BoSS信号,表明需要进一步研究以提升上下文感知能力。

Insight: BoSS研究为人机交互提供了新的方向,强调情感和上下文信号的重要性,未来语音技术需更关注多维特征的建模。

Abstract: Human communication involves more than explicit semantics, with implicit signals and contextual cues playing a critical role in shaping meaning. However, modern speech technologies, such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) often fail to capture these beyond-semantic dimensions. To better characterize and benchmark the progression of speech intelligence, we introduce Spoken Interaction System Capability Levels (L1-L5), a hierarchical framework illustrated the evolution of spoken dialogue systems from basic command recognition to human-like social interaction. To support these advanced capabilities, we propose Beyond-Semantic Speech (BoSS), which refers to the set of information in speech communication that encompasses but transcends explicit semantics. It conveys emotions, contexts, and modifies or extends meanings through multidimensional features such as affective cues, contextual dynamics, and implicit semantics, thereby enhancing the understanding of communicative intentions and scenarios. We present a formalized framework for BoSS, leveraging cognitive relevance theories and machine learning models to analyze temporal and contextual speech dynamics. We evaluate BoSS-related attributes across five different dimensions, reveals that current spoken language models (SLMs) are hard to fully interpret beyond-semantic signals. These findings highlight the need for advancing BoSS research to enable richer, more context-aware human-machine communication.

[82] Audio-Vision Contrastive Learning for Phonological Class Recognition

Daiqi Liu,Tomás Arias-Vergara,Jana Hutter,Andreas Maier,Paula Andrea Pérez-Toro

Main category: cs.SD

TL;DR: 该论文提出了一种结合实时磁共振成像(rtMRI)和语音信号的多模态深度学习框架,用于分类三种关键的发音维度:发音方式、发音部位和嗓音。通过对比学习的方法,该框架在USC-TIMIT数据集上达到了最先进的性能,平均F1得分为0.81。

Details Motivation: 准确的发音-语音特征分类在理解人类语音生成和开发鲁棒的语音技术中至关重要,特别是在临床环境中,针对性的音素分析和治疗可以提高疾病诊断的准确性和个性化康复效果。

Contribution: 论文的主要贡献是提出了一种基于对比学习的多模态融合方法,显著提升了发音分类任务的性能。

Method: 论文采用了四种配置:(1)单模态rtMRI;(2)单模态语音信号;(3)多模态中间融合;(4)基于对比学习的音频-视觉融合。对比学习方法通过联合优化模态间的表示,实现了更好的性能。

Result: 在USC-TIMIT数据集上,基于对比学习的方法平均F1得分为0.81,比单模态基线提升了0.23。

Insight: 对比学习在多模态表示学习中具有显著优势,能够有效结合不同模态的信息,提升语音分析任务的性能。

Abstract: Accurate classification of articulatory-phonological features plays a vital role in understanding human speech production and developing robust speech technologies, particularly in clinical contexts where targeted phonemic analysis and therapy can improve disease diagnosis accuracy and personalized rehabilitation. In this work, we propose a multimodal deep learning framework that combines real-time magnetic resonance imaging (rtMRI) and speech signals to classify three key articulatory dimensions: manner of articulation, place of articulation, and voicing. We perform classification on 15 phonological classes derived from the aforementioned articulatory dimensions and evaluate the system with four audio/vision configurations: unimodal rtMRI, unimodal audio signals, multimodal middle fusion, and contrastive learning-based audio-vision fusion. Experimental results on the USC-TIMIT dataset show that our contrastive learning-based approach achieves state-of-the-art performance, with an average F1-score of 0.81, representing an absolute increase of 0.23 over the unimodal baseline. The results confirm the effectiveness of contrastive representation learning for multimodal articulatory analysis. Our code and processed dataset will be made publicly available at https://github.com/DaE-plz/AC_Contrastive_Phonology to support future research.

cs.RO [Back]

[83] InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

Shuai Yang,Hao Li,Yilun Chen,Bin Wang,Yang Tian,Tai Wang,Hanqing Wang,Feng Zhao,Yiyi Liao,Jiangmiao Pang

Main category: cs.RO

TL;DR: InstructVLA是一个端到端的视觉-语言-动作模型,通过新训练范式VLA-IT,在推理和动作生成上实现领先性能,同时保留大视觉语言模型的灵活性。

Details Motivation: 解决现有视觉-语言-动作模型牺牲推理或动作能力、局限于任务特定数据及遗忘预训练能力的问题。

Contribution: 提出InstructVLA及VLA-IT训练范式,优化文本推理和动作生成,并在多任务中实现显著性能提升。

Method: 采用多模态训练和专家混合适应,联合优化标准VLM语料库和650K样本的VLA-IT数据集。

Result: 在SimplerEnv任务中提升30.5%,在SimplerEnv-Instruct基准上超越基线模型92%。

Insight: 通过文本推理增强动作性能,为直观可控的人机交互与高效策略学习提供潜力。

Abstract: To operate effectively in the real world, robots must integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce InstructVLA, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance. InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation to jointly optimize textual reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 30.5% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 92% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA’s potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.

cs.LG [Back]

[84] SiLQ: Simple Large Language Model Quantization-Aware Training

Steven K. Esser,Jeffrey L. McKinstry,Deepika Bablani,Rathinakumar Appuswamy,Dharmendra S. Modha

Main category: cs.LG

TL;DR: SiLQ提出了一种简单的大语言模型量化训练方法,通过极小的训练额外成本(<0.1%),在多个基准测试中显著超越现有量化方法,且无需引入额外操作。

Details Motivation: 大语言模型量化可降低推理延迟、模型大小和能耗,但如何在不损失精度且适配专用推理加速器的前提下实现高效量化仍是一大挑战。

Contribution: SiLQ的核心贡献是提出了一种端到端的量化感知训练方法,仅需极低额外训练成本(<0.1%),无需添加额外操作,即可显著提升量化模型性能。

Method: 方法采用简单的端到端量化感知训练,适用于激活、缓存和权重,且通用性强,适配多种模型架构。

Result: 实验显示,SiLQ在多个现代基准测试中大幅领先现有量化方法,包括基础模型和指令模型变体。

Insight: 研究证明,高效量化训练可通过极简设计实现,无需复杂机制或额外操作,为模型部署提供了低成本解决方案。

Abstract: Large language models can be quantized to reduce inference time latency, model size, and energy consumption, thereby delivering a better user experience at lower cost. A challenge exists to deliver quantized models with minimal loss of accuracy in reasonable time, and in particular to do so without requiring mechanisms incompatible with specialized inference accelerators. Here, we demonstrate a simple, end-to-end quantization-aware training approach that, with an increase in total model training budget of less than 0.1%, outperforms the leading published quantization methods by large margins on several modern benchmarks, with both base and instruct model variants. The approach easily generalizes across different model architectures, can be applied to activations, cache, and weights, and requires the introduction of no additional operations to the model other than the quantization itself.

[85] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal,Anthony Wang,Elaine Lau,Vaskar Nath,Bing Liu,Sean Hendryx

Main category: cs.LG

TL;DR: 这篇论文提出了一个名为“Rubrics as Rewards”(RaR)的框架,通过将结构化、清单式的评分标准(rubrics)用作可解释的奖励信号,以解决强化学习中奖励信号难以定义的问题。RaR在HealthBench-1k任务上表现优于传统的Likert评分方法,并展现了与专家编写的参考奖励信号相当的性能。

Details Motivation: 在强化学习中,许多现实世界任务缺乏明确的奖励信号,尤其是当任务涉及主观评价标准时。传统的基于偏好的方法存在奖励函数不透明且易受虚假相关影响的问题。因此,需要一种可解释且鲁棒的奖励信号生成方法。

Contribution: 提出了RaR框架,将结构化评分标准作为奖励信号,提高了奖励的可解释性和鲁棒性。实验证明了RaR在小规模法官模型中能够更好地与人类偏好对齐,并在模型规模扩大时保持性能稳定。

Method: RaR框架使用清单式评分标准作为奖励信号,并采用GRPO(一种强化学习优化方法)进行训练。这种方法避免了传统奖励信号的不透明性和虚假相关问题。

Result: 在HealthBench-1k任务上,RaR相比简单的Likert评分方法取得了28%的相对改进,同时达到了与专家编写的参考奖励信号相当甚至更好的性能。

Insight: 结构化的评分标准可以作为有效的奖励信号,不仅提高了奖励的可解释性,还能在小规模模型中实现更好的对齐效果。这为强化学习在复杂任务中的应用提供了新思路。

Abstract: Extending Reinforcement Learning with Verifiable Rewards (RLVR) to real-world tasks often requires balancing objective and subjective evaluation criteria. However, many such tasks lack a single, unambiguous ground truth-making it difficult to define reliable reward signals for post-training language models. While traditional preference-based methods offer a workaround, they rely on opaque reward functions that are difficult to interpret and prone to spurious correlations. We introduce $\textbf{Rubrics as Rewards}$ (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a $28%$ relative improvement on HealthBench-1k compared to simple Likert-based approaches, while matching or surpassing the performance of reward signals derived from expert-written references. By treating rubrics as structured reward signals, we show that RaR enables smaller-scale judge models to better align with human preferences and sustain robust performance across model scales.

[86] Dataset Distillation as Data Compression: A Rate-Utility Perspective

Youneng Bao,Yiping Liu,Zhuo Chen,Yongsheng Liang,Mu Li,Kede Ma

Main category: cs.LG

TL;DR: 该论文提出了一种联合率-效用优化的数据集蒸馏方法,将数据集压缩为少量合成样本,并通过量化的潜在码和轻量网络实现高效的存储与性能平衡。

Details Motivation: 现代机器学习对大数据集和大模型的需求导致计算和存储成本剧增。数据集蒸馏通过压缩原始数据集为少量合成样本来缓解这一问题,但现有方法未能同时优化存储效率和性能。

Contribution: 1. 提出了一种联合率-效用优化的数据集蒸馏方法。2. 引入``比特每类’’(bpc)作为精确的存储度量标准。3. 在多个数据集上实现了显著的压缩比提升(如CIFAR-10上170倍的压缩)。

Method: 1. 将合成样本参数化为可优化的潜在码,通过轻量网络解码。2. 量化潜在码的熵作为率度量,蒸馏损失作为效用度量。3. 使用拉格朗日乘子权衡率和效用。

Result: 在CIFAR-10、CIFAR-100和ImageNet-128等数据集上,与标准蒸馏方法相比,实现了更高的压缩率(如170倍),同时保持相似精度。

Insight: 联合优化存储效率和性能是数据集蒸馏的关键。bpc为跨方法比较提供了统一的度量标准,轻量网络和潜在码优化是实现高效压缩的有效途径。

Abstract: Driven by the ``scale-is-everything’’ paradigm, modern machine learning increasingly demands ever-larger datasets and models, yielding prohibitive computational and storage requirements. Dataset distillation mitigates this by compressing an original dataset into a small set of synthetic samples, while preserving its full utility. Yet, existing methods either maximize performance under fixed storage budgets or pursue suitable synthetic data representations for redundancy removal, without jointly optimizing both objectives. In this work, we propose a joint rate-utility optimization method for dataset distillation. We parameterize synthetic samples as optimizable latent codes decoded by extremely lightweight networks. We estimate the Shannon entropy of quantized latents as the rate measure and plug any existing distillation loss as the utility measure, trading them off via a Lagrange multiplier. To enable fair, cross-method comparisons, we introduce bits per class (bpc), a precise storage metric that accounts for sample, label, and decoder parameter costs. On CIFAR-10, CIFAR-100, and ImageNet-128, our method achieves up to $170\times$ greater compression than standard distillation at comparable accuracy. Across diverse bpc budgets, distillation losses, and backbone architectures, our approach consistently establishes better rate-utility trade-offs.

[87] On the Interaction of Compressibility and Adversarial Robustness

Melih Barsbey,Antônio H. Ribeiro,Umut Şimşekli,Tolga Birdal

Main category: cs.LG

TL;DR: 该论文研究了神经网络的可压缩性与对抗鲁棒性之间的相互作用,揭示了压缩性(如神经元稀疏性和谱可压缩性)会引入一些敏感方向,从而容易受到对抗攻击的影响。

Details Motivation: 现代神经网络需要同时满足多种需求,如训练数据拟合、泛化能力、参数效率、计算效率以及对抗鲁棒性。然而,可压缩性与鲁棒性之间的交互关系仍不清楚,论文旨在填补这一空白。

Contribution: 论文提出了一个理论框架,分析了可压缩性(神经元稀疏性和谱可压缩性)如何影响对抗鲁棒性,并揭示了压缩性会引入敏感方向,导致模型易受攻击。

Method: 通过理论分析,论文推导了一个鲁棒性边界,探讨了神经元和谱可压缩性对表示空间的影响。同时,通过合成和实际任务的实验验证了理论的预测。

Result: 研究发现,压缩性会导致对抗攻击的有效性增加,且这种现象在对抗训练和迁移学习中仍然存在。此外,压缩性还与通用对抗扰动(UAPs)的出现相关。

Insight: 论文揭示了结构化的可压缩性与鲁棒性之间存在根本性矛盾,为设计既高效又安全的模型提供了新的思路。

Abstract: Modern neural networks are expected to simultaneously satisfy a host of desirable properties: accurate fitting to training data, generalization to unseen inputs, parameter and computational efficiency, and robustness to adversarial perturbations. While compressibility and robustness have each been studied extensively, a unified understanding of their interaction still remains elusive. In this work, we develop a principled framework to analyze how different forms of compressibility - such as neuron-level sparsity and spectral compressibility - affect adversarial robustness. We show that these forms of compression can induce a small number of highly sensitive directions in the representation space, which adversaries can exploit to construct effective perturbations. Our analysis yields a simple yet instructive robustness bound, revealing how neuron and spectral compressibility impact $L_\infty$ and $L_2$ robustness via their effects on the learned representations. Crucially, the vulnerabilities we identify arise irrespective of how compression is achieved - whether via regularization, architectural bias, or implicit learning dynamics. Through empirical evaluations across synthetic and realistic tasks, we confirm our theoretical predictions, and further demonstrate that these vulnerabilities persist under adversarial training and transfer learning, and contribute to the emergence of universal adversarial perturbations. Our findings show a fundamental tension between structured compressibility and robustness, and suggest new pathways for designing models that are both efficient and secure.

[88] Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility

Melih Barsbey,Lucas Prieto,Stefanos Zafeiriou,Tolga Birdal

Main category: cs.LG

TL;DR: 这篇论文探讨了高学习率如何同时实现对抗伪相关性的鲁棒性和模型的可压缩性。研究发现,高学习率还能带来不变特征利用、类别分离和激活稀疏性等理想的表示特性。

Details Motivation: 现代机器学习模型需要同时具备鲁棒性和资源效率,但实现这两者仍然是一个挑战。本文旨在研究高学习率如何同时满足这两种需求。

Contribution: 论文的主要贡献是揭示了高学习率可以同时增强模型对伪相关性的鲁棒性和可压缩性,并且优于其他超参数和正则化方法。

Method: 通过在高学习率下训练模型,研究其在多种伪相关性数据集、模型和优化器上的表现,分析其特征利用和分类效果。

Result: 结果表明,高学习率在对抗伪相关性和模型压缩方面表现优异,且在其他标准分类任务中的成功可能源于其对隐藏/罕见伪相关性的处理。

Insight: 高学习率不仅是一种训练策略,还隐含地解决了数据中的伪相关问题,为模型设计和训练提供了新的视角。

Abstract: Robustness and resource-efficiency are two highly desirable properties for modern machine learning models. However, achieving them jointly remains a challenge. In this paper, we position high learning rates as a facilitator for simultaneously achieving robustness to spurious correlations and network compressibility. We demonstrate that large learning rates also produce desirable representation properties such as invariant feature utilization, class separation, and activation sparsity. Importantly, our findings indicate that large learning rates compare favorably to other hyperparameters and regularization methods, in consistently satisfying these properties in tandem. In addition to demonstrating the positive effect of large learning rates across diverse spurious correlation datasets, models, and optimizers, we also present strong evidence that the previously documented success of large learning rates in standard classification tasks is likely due to its effect on addressing hidden/rare spurious correlations in the training dataset.

cs.HC [Back]

[89] Assessing Medical Training Skills via Eye and Head Movements

Kayhan Latifzadeh,Luis A. Leiva,Klen Čopič Pucihar,Matjaž Kljun,Iztok Devetak,Lili Steblovnik

Main category: cs.HC

TL;DR: 该研究通过分析眼部和头部动作,评估临床技能发展。结果表明,眼部和头部追踪可以有效区分训练有素和未经训练的从业者,为基于计算模型的技能评估提供了新方法。

Details Motivation: 传统临床技能评估依赖主观评分,作者希望通过客观的眼部和头部动作数据,提供更可靠的技能评估方法。

Contribution: 研究表明眼部和头部追踪数据可用于技能评估,特别是在婴儿分娩任务中。

Method: 研究使用24名从业者在模拟婴儿分娩训练中的眼部和头部动作数据,计算瞳孔反应率、注视时长和角速度等指标。

Result: 头部相关特征(F1=0.85, AUC=0.86)比瞳孔相关特征(F1=0.77, AUC=0.85)表现更好。

Insight: 眼部和头部追踪可作为补充工具,为临床技能评估提供客观数据支持。

Abstract: We examined eye and head movements to gain insights into skill development in clinical settings. A total of 24 practitioners participated in simulated baby delivery training sessions. We calculated key metrics, including pupillary response rate, fixation duration, or angular velocity. Our findings indicate that eye and head tracking can effectively differentiate between trained and untrained practitioners, particularly during labor tasks. For example, head-related features achieved an F1 score of 0.85 and AUC of 0.86, whereas pupil-related features achieved F1 score of 0.77 and AUC of 0.85. The results lay the groundwork for computational models that support implicit skill assessment and training in clinical settings by using commodity eye-tracking glasses as a complementary device to more traditional evaluation methods such as subjective scores.

[90] Explainable AI for Collaborative Assessment of 2D/3D Registration Quality

Sue Min Cho,Alexander Do,Russell H. Taylor,Mathias Unberath

Main category: cs.HC

TL;DR: The paper introduces an explainable AI (XAI) framework for verifying 2D/3D registration quality in surgery, aiming to improve human operators’ ability to detect misalignments, though explainability features only modestly enhance trust and performance.

Details Motivation: Current visualization-based methods are insufficient for reliably detecting 2D/3D registration errors in surgery, which can lead to serious consequences like revision surgeries. There's a need for robust quality assurance tools.

Contribution: The first AI framework specifically trained for 2D/3D registration quality verification, enhanced with explainability features to clarify model decisions and support human operators.

Method: The proposed XAI approach includes a model trained for registration quality assessment and explainability features. Evaluations compare AI-only, human-only, human-AI, and human-XAI conditions.

Result: Explainability features slightly improve user trust and willingness to correct AI errors but do not outperform standalone AI in overall performance.

Insight: While XAI aids human decision-making, further improvements in algorithmic design and human-AI collaboration are needed for more reliable quality assurance in surgical settings.

Abstract: As surgery embraces digital transformation–integrating sophisticated imaging, advanced algorithms, and robotics to support and automate complex sub-tasks–human judgment of system correctness remains a vital safeguard for patient safety. This shift introduces new “operator-type” roles tasked with verifying complex algorithmic outputs, particularly at critical junctures of the procedure, such as the intermediary check before drilling or implant placement. A prime example is 2D/3D registration, a key enabler of image-based surgical navigation that aligns intraoperative 2D images with preoperative 3D data. Although registration algorithms have advanced significantly, they occasionally yield inaccurate results. Because even small misalignments can lead to revision surgery or irreversible surgical errors, there is a critical need for robust quality assurance. Current visualization-based strategies alone have been found insufficient to enable humans to reliably detect 2D/3D registration misalignments. In response, we propose the first artificial intelligence (AI) framework trained specifically for 2D/3D registration quality verification, augmented by explainability features that clarify the model’s decision-making. Our explainable AI (XAI) approach aims to enhance informed decision-making for human operators by providing a second opinion together with a rationale behind it. Through algorithm-centric and human-centered evaluations, we systematically compare four conditions: AI-only, human-only, human-AI, and human-XAI. Our findings reveal that while explainability features modestly improve user trust and willingness to override AI errors, they do not exceed the standalone AI in aggregate performance. Nevertheless, future work extending both the algorithmic design and the human-XAI collaboration elements holds promise for more robust quality assurance of 2D/3D registration.

cs.AI [Back]

[91] Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning

Xinyao Liu,Diping Song

Main category: cs.AI

TL;DR: 论文提出了FundusExpert,一种眼科专用的多模态大语言模型(MLLM),通过临床认知链推理实现定位与诊断的协同。作者还构建了FundusGen数据集和智能Fundus-Engine系统,显著提升了模型在眼科问答和报告生成任务中的表现。

Details Motivation: 当前MLLM在眼科等专业领域面临标注粒度碎片化和临床推理逻辑不一致的问题,导致跨模态理解不精确。

Contribution: 1. 提出FundusExpert模型,整合定位与诊断推理;2. 构建FundusGen数据集和Fundus-Engine系统,实现自动化定位与语义扩展;3. 揭示了数据质量与模型能力之间的缩放规律。

Method: 1. 开发Fundus-Engine系统,结合全局分类、局部检测和细粒度特征分析;2. 构建临床对齐的认知链,生成可解释的推理路径;3. 用FundusGen的指令数据微调模型。

Result: 1. 在眼科问答任务中比40B MedRegA平均准确率高26.6%;2. 在零样本报告生成任务中临床一致性达77.0%,显著优于GPT-4o的47.6%;3. 发现数据质量与模型能力的缩放规律($L \propto N^{0.068}$)。

Insight: 1. 区域级定位与诊断推理链的结合可提升MLLM的临床对齐能力;2. 数据质量的高效利用可通过认知对齐标注实现;3. FundusExpert的成功为特定领域MLLM的视觉-语言鸿沟提供了解决方案。

Abstract: Multimodal large language models (MLLMs) demonstrate significant potential in the field of medical diagnosis. However, they face critical challenges in specialized domains such as ophthalmology, particularly the fragmentation of annotation granularity and inconsistencies in clinical reasoning logic, which hinder precise cross-modal understanding. This paper introduces FundusExpert, an ophthalmology-specific MLLM with integrated positioning-diagnosis reasoning capabilities, along with FundusGen, a dataset constructed through the intelligent Fundus-Engine system. Fundus-Engine automates localization and leverages MLLM-based semantic expansion to integrate global disease classification, local object detection, and fine-grained feature analysis within a single fundus image. Additionally, by constructing a clinically aligned cognitive chain, it guides the model to generate interpretable reasoning paths. FundusExpert, fine-tuned with instruction data from FundusGen, achieves the best performance in ophthalmic question-answering tasks, surpassing the average accuracy of the 40B MedRegA by 26.6%. It also excels in zero-shot report generation tasks, achieving a clinical consistency of 77.0%, significantly outperforming GPT-4o’s 47.6%. Furthermore, we reveal a scaling law between data quality and model capability ($L \propto N^{0.068}$), demonstrating that the cognitive alignment annotations in FundusGen enhance data utilization efficiency. By integrating region-level localization with diagnostic reasoning chains, our work develops a scalable, clinically-aligned MLLM and explores a pathway toward bridging the visual-language gap in specific MLLMs. Our project can be found at https://github.com/MeteorElf/FundusExpert.

eess.IV [Back]

[92] Harmonization in Magnetic Resonance Imaging: A Survey of Acquisition, Image-level, and Feature-level Methods

Qinqin Yang,Firoozeh Shomal-Zadeh,Ali Gholipour

Main category: eess.IV

TL;DR: 这篇综述论文对医学影像(尤其是MRI)中的图像协调问题进行了全面总结,重点分析了采集、图像级和特征级的方法,并讨论了未来研究方向。

Details Motivation: 医学影像数据因扫描仪、协议或站点不同存在异质性(如批次效应),这种非生物变异会掩盖真实生物信号,影响基于学习的模型的泛化能力。图像协调旨在消除这些偏差。

Contribution: 1. 系统性分类协调方法为前瞻性采集、回顾性图像级和特征级方法以及基于旅行受试者技术;2. 重点介绍了深度学习方法;3. 总结了当前挑战和未来方向。

Method: 论文分析了图像协调的三大类方法:采集阶段的前瞻性策略、图像级的后处理方法(如生成对抗网络)和特征级的统计调整。

Result: 通过综述,论文整理了许多典型方法和数据集,突出了深度学习的潜力,但也指出了协调技术的局限性。

Insight: 图像协调的核心挑战是在消除站点效应与保留生物信息之间取得平衡。未来可能需要结合多模态数据或开发更具适应性的算法。

Abstract: Modern medical imaging technologies have greatly advanced neuroscience research and clinical diagnostics. However, imaging data collected across different scanners, acquisition protocols, or imaging sites often exhibit substantial heterogeneity, known as “batch effects” or “site effects”. These non-biological sources of variability can obscure true biological signals, reduce reproducibility and statistical power, and severely impair the generalizability of learning-based models across datasets. Image harmonization aims to eliminate or mitigate such site-related biases while preserving meaningful biological information, thereby improving data comparability and consistency. This review provides a comprehensive overview of key concepts, methodological advances, publicly available datasets, current challenges, and future directions in the field of medical image harmonization, with a focus on magnetic resonance imaging (MRI). We systematically cover the full imaging pipeline, and categorize harmonization approaches into prospective acquisition and reconstruction strategies, retrospective image-level and feature-level methods, and traveling-subject-based techniques. Rather than providing an exhaustive survey, we focus on representative methods, with particular emphasis on deep learning-based approaches. Finally, we summarize the major challenges that remain and outline promising avenues for future research.

[93] A Versatile Pathology Co-pilot via Reasoning Enhanced Multimodal Large Language Model

Zhe Xu,Ziyi Liu,Junlin Hou,Jiabo Ma,Cheng Jin,Yihui Wang,Zhixuan Chen,Zhengyu Zhang,Zhengrui Guo,Fengtao Zhou,Yingxue Xu,Xi Wang,Ronald Cheong Kin Chan,Li Liang,Hao Chen

Main category: eess.IV

TL;DR: 该论文提出了一种多模态大语言模型SmartPath-R1,能够同时处理ROI和WSI级别的病理分析任务,并通过强化学习和混合专家机制实现动态多任务处理,展示了显著的病理推理能力。

Details Motivation: 当前病理学中的多模态大语言模型存在推理能力受限的问题,主要依赖于昂贵的链式思维标注,且仅支持简单的VQA任务,无法满足临床实践中的多任务需求。

Contribution: 提出SmartPath-R1模型,结合规模依赖的监督微调与任务感知的强化微调,避免了链式思维监督需求,并通过混合专家机制实现多尺度、多任务分析。

Method: 采用规模依赖的监督微调和任务感知的强化微调,结合混合专家机制动态处理多任务。

Result: 在72项任务上的实验验证了模型的有效性和优越性,展示了其在病理分析中的潜力。

Insight: 通过利用MLLM的固有知识,可以绕过链式思维标注的限制,同时实现多任务和多尺度分析,为精准病理学中的通用AI系统提供了新方向。

Abstract: Multimodal large language models (MLLMs) have emerged as powerful tools for computational pathology, offering unprecedented opportunities to integrate pathological images with language context for comprehensive diagnostic analysis. These models hold particular promise for automating complex tasks that traditionally require expert interpretation of pathologists. However, current MLLM approaches in pathology demonstrate significantly constrained reasoning capabilities, primarily due to their reliance on expensive chain-of-thought annotations. Additionally, existing methods remain limited to simplex application of visual question answering (VQA) at region-of-interest (ROI) level, failing to address the full spectrum of diagnostic needs such as ROI classification, detection, segmentation, whole-slide-image (WSI) classification and VQA in clinical practice. In this study, we present SmartPath-R1, a versatile MLLM capable of simultaneously addressing both ROI-level and WSI-level tasks while demonstrating robust pathological reasoning capability. Our framework combines scale-dependent supervised fine-tuning and task-aware reinforcement fine-tuning, which circumvents the requirement for chain-of-thought supervision by leveraging the intrinsic knowledge within MLLM. Furthermore, SmartPath-R1 integrates multiscale and multitask analysis through a mixture-of-experts mechanism, enabling dynamic processing for diverse tasks. We curate a large-scale dataset comprising 2.3M ROI samples and 188K WSI samples for training and evaluation. Extensive experiments across 72 tasks validate the effectiveness and superiority of the proposed approach. This work represents a significant step toward developing versatile, reasoning-enhanced AI systems for precision pathology.

[94] Mammo-Mamba: A Hybrid State-Space and Transformer Architecture with Sequential Mixture of Experts for Multi-View Mammography

Farnoush Bayatmakou,Reza Taleei,Nicole Simone,Arash Mohammadi

Main category: eess.IV

TL;DR: Mammo-Mamba提出了一种结合选择性状态空间模型(SSMs)、Transformer注意力和专家驱动特征优化的新型架构,用于多视角乳腺X光片分类,解决了传统Transformer计算复杂度高的问题,在分类性能和计算效率上均表现优异。

Details Motivation: 乳腺X光片的多视角分类对乳腺癌早期诊断至关重要,但现有基于Transformer的模型计算复杂度高,亟需更高效的替代方案。

Contribution: 1. 提出Mammo-Mamba架构,整合SSMs、Transformer和SeqMoE机制;2. 引入定制化的SecMamba模块,增强高分辨率图像表示学习;3. 在CBIS-DDSM数据集上实现分类性能和计算效率的双重优势。

Method: 采用MambaVision主干网络,通过SeqMoE机制和SecMamba模块实现内容自适应的特征优化,动态调整特征权重。

Result: 在CBIS-DDSM数据集上,Mammo-Mamba在所有关键指标上均优于现有方法,同时保持高效计算。

Insight: 结合状态空间模型和注意力机制可以平衡模型性能与计算效率,适用于高分辨率医学图像任务。

Abstract: Breast cancer (BC) remains one of the leading causes of cancer-related mortality among women, despite recent advances in Computer-Aided Diagnosis (CAD) systems. Accurate and efficient interpretation of multi-view mammograms is essential for early detection, driving a surge of interest in Artificial Intelligence (AI)-powered CAD models. While state-of-the-art multi-view mammogram classification models are largely based on Transformer architectures, their computational complexity scales quadratically with the number of image patches, highlighting the need for more efficient alternatives. To address this challenge, we propose Mammo-Mamba, a novel framework that integrates Selective State-Space Models (SSMs), transformer-based attention, and expert-driven feature refinement into a unified architecture. Mammo-Mamba extends the MambaVision backbone by introducing the Sequential Mixture of Experts (SeqMoE) mechanism through its customized SecMamba block. The SecMamba is a modified MambaVision block that enhances representation learning in high-resolution mammographic images by enabling content-adaptive feature refinement. These blocks are integrated into the deeper stages of MambaVision, allowing the model to progressively adjust feature emphasis through dynamic expert gating, effectively mitigating the limitations of traditional Transformer models. Evaluated on the CBIS-DDSM benchmark dataset, Mammo-Mamba achieves superior classification performance across all key metrics while maintaining computational efficiency.

cs.IR [Back]

[95] A Query-Aware Multi-Path Knowledge Graph Fusion Approach for Enhancing Retrieval-Augmented Generation in Large Language Models

Qikai Wei,Huansheng Ning,Chunlong Han,Jianguo Ding

Main category: cs.IR

TL;DR: 该论文提出了一种名为QMKGF的查询感知多路径知识图融合方法,旨在通过构建和优化知识图来增强检索增强生成(RAG)任务的效果,显著提升了大型语言模型的生成质量。

Details Motivation: 现有的检索增强生成(RAG)方法主要依赖基于相似性的片段检索,忽略了片段之间的内在联系,导致性能受限。QMKGF旨在通过知识图构建和多路径子图优化来解决这一问题。

Contribution: 1. 提出QMKGF方法,通过知识图和多路径子图策略增强RAG任务;2. 设计了查询感知注意力奖励模型,优化子图选择;3. 在多数据集上验证了方法的有效性和优越性。

Method: 1. 使用LLMs和提示模板提取实体和关系,生成知识图;2. 提出多路径子图策略(一跳、多跳和重要性关系);3. 设计查询感知注意力模型,筛选和优化子图;4. 利用知识图扩展查询,提升生成质量。

Result: 在HotpotQA数据集上,QMKGF的ROUGE-1得分达64.98%,比BGE-Rerank提升了9.72个百分点,证明了其优越性。

Insight: 通过知识图和多路径子图策略,能够更全面地捕捉查询的语义相关性,显著提升RAG任务的性能。

Abstract: Retrieval Augmented Generation (RAG) has gradually emerged as a promising paradigm for enhancing the accuracy and factual consistency of content generated by large language models (LLMs). However, existing RAG studies primarily focus on retrieving isolated segments using similarity-based matching methods, while overlooking the intrinsic connections between them. This limitation hampers performance in RAG tasks. To address this, we propose QMKGF, a Query-Aware Multi-Path Knowledge Graph Fusion Approach for Enhancing Retrieval Augmented Generation. First, we design prompt templates and employ general-purpose LLMs to extract entities and relations, thereby generating a knowledge graph (KG) efficiently. Based on the constructed KG, we introduce a multi-path subgraph construction strategy that incorporates one-hop relations, multi-hop relations, and importance-based relations, aiming to improve the semantic relevance between the retrieved documents and the user query. Subsequently, we designed a query-aware attention reward model that scores subgraph triples based on their semantic relevance to the query. Then, we select the highest score subgraph and enrich subgraph with additional triples from other subgraphs that are highly semantically relevant to the query. Finally, the entities, relations, and triples within the updated subgraph are utilised to expand the original query, thereby enhancing its semantic representation and improving the quality of LLMs’ generation. We evaluate QMKGF on the SQuAD, IIRC, Culture, HotpotQA, and MuSiQue datasets. On the HotpotQA dataset, our method achieves a ROUGE-1 score of 64.98%, surpassing the BGE-Rerank approach by 9.72 percentage points (from 55.26% to 64.98%). Experimental results demonstrate the effectiveness and superiority of the QMKGF approach.

[96] VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings

Ramin Giahi,Kehui Yao,Sriram Kollipara,Kai Zhao,Vahid Mirjalili,Jianpeng Xu,Topojoy Biswas,Evren Korpeoglu,Kannan Achan

Main category: cs.IR

TL;DR: VL-CLIP通过视觉定位和LLM增强的CLIP嵌入改进多模态推荐,解决了现有视觉语言模型在电子商务推荐系统中的细粒度对齐、文本歧义和领域适配问题。

Details Motivation: 现有CLIP等视觉语言模型在电商推荐系统中存在细粒度对齐不足、文本描述模糊及领域适配不佳的问题,影响了检索和推荐性能。

Contribution: 提出VL-CLIP框架,结合视觉定位技术增强图像表示,并使用LLM代理优化文本嵌入,显著提升了多模态推荐的效果。

Method: 采用视觉定位技术提取细粒度图像特征,并利用LLM代理生成丰富且无歧义的文本嵌入,改进CLIP的原始嵌入表示。

Result: 在美国大型电商平台上,VL-CLIP显著提高了CTR(18.6%)、ATC(15.5%)和GMV(4.0%),并优于现有视觉语言模型。

Insight: 结合对象感知的视觉定位和LLM增强的文本表示,可以有效提升多模态推荐系统的性能和语义对齐能力。

Abstract: Multimodal learning plays a critical role in e-commerce recommendation platforms today, enabling accurate recommendations and product understanding. However, existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems: 1) Weak object-level alignment, where global image embeddings fail to capture fine-grained product attributes, leading to suboptimal retrieval performance; 2) Ambiguous textual representations, where product descriptions often lack contextual clarity, affecting cross-modal matching; and 3) Domain mismatch, as generic vision-language models may not generalize well to e-commerce-specific data. To address these limitations, we propose a framework, VL-CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine-grained visual understanding and an LLM-based agent for generating enriched text embeddings. Visual Grounding refines image representations by localizing key products, while the LLM agent enhances textual features by disambiguating product descriptions. Our approach significantly improves retrieval accuracy, multimodal retrieval effectiveness, and recommendation quality across tens of millions of items on one of the largest e-commerce platforms in the U.S., increasing CTR by 18.6%, ATC by 15.5%, and GMV by 4.0%. Additional experimental results show that our framework outperforms vision-language models, including CLIP, FashionCLIP, and GCL, in both precision and semantic alignment, demonstrating the potential of combining object-aware visual grounding and LLM-enhanced text representation for robust multimodal recommendations.