Table of Contents

cs.CL [Back]

[1] GreenTEA: Gradient Descent with Topic-modeling and Evolutionary Auto-prompting

Zheng Dong,Luming Shang,Gabriela Olinto

Main category: cs.CL

TL;DR: GreenTEA 是一种基于代理的自动提示优化方法,通过平衡探索与利用,结合主题建模和遗传算法框架,显著提升了提示质量。

Details Motivation: 传统手动设计提示费时费力,现有自动优化方法在效率或效果上存在不足,需平衡探索与利用。

Contribution: 提出 GreenTEA 框架,结合主题建模和遗传算法,通过协作代理迭代优化提示。

Method: 利用代理团队分析错误样本(主题建模)并生成新提示,在遗传算法框架下通过交叉和突变迭代优化。

Result: 在多个基准测试中优于人工设计和现有自动优化方法,涵盖逻辑推理、常识和伦理决策任务。

Insight: 协作代理与进化算法结合能高效优化复杂提示空间,主题建模帮助聚焦关键缺陷。

Abstract: High-quality prompts are crucial for Large Language Models (LLMs) to achieve exceptional performance. However, manually crafting effective prompts is labor-intensive and demands significant domain expertise, limiting its scalability. Existing automatic prompt optimization methods either extensively explore new prompt candidates, incurring high computational costs due to inefficient searches within a large solution space, or overly exploit feedback on existing prompts, risking suboptimal optimization because of the complex prompt landscape. To address these challenges, we introduce GreenTEA, an agentic LLM workflow for automatic prompt optimization that balances candidate exploration and knowledge exploitation. It leverages a collaborative team of agents to iteratively refine prompts based on feedback from error samples. An analyzing agent identifies common error patterns resulting from the current prompt via topic modeling, and a generation agent revises the prompt to directly address these key deficiencies. This refinement process is guided by a genetic algorithm framework, which simulates natural selection by evolving candidate prompts through operations such as crossover and mutation to progressively optimize model performance. Extensive numerical experiments conducted on public benchmark datasets suggest the superior performance of GreenTEA against human-engineered prompts and existing state-of-the-arts for automatic prompt optimization, covering logical and quantitative reasoning, commonsense, and ethical decision-making.

[2] Cognitive Decision Routing in Large Language Models: When to Think Fast, When to Think Slow

Y. Du,C. Guo,W. Wang,G. Tang

Main category: cs.CL

TL;DR: 本文提出了一种基于认知决策路由(CDR)的动态推理框架,根据查询特性为大语言模型(LLM)选择快速直觉或慢速深思的推理策略,显著提升了性能和计算效率。

Details Motivation: 大语言模型当前面临的核心挑战是何时使用快速直觉推理或慢速深思推理。受卡尼曼的双过程理论启发,本文旨在解决模型在处理不同查询时统一推理深度或成本高昂的问题。

Contribution: 提出了CDR框架,动态选择推理策略;通过多维分析查询特性(如相关性、领域边界、不确定性等),显著提升了性能并节省计算资源。

Method: 引入元认知层分析查询的多维特性(相关性、领域边界、利益相关者复杂度、不确定性),动态选择直觉或深思推理策略。

Result: 实验表明,CDR在多样化任务中性能优越,计算成本降低34%,在专业判断任务中一致性和准确性分别提升23%和18%。

Insight: 将认知科学原理与AI设计结合,为LLM提供了自适应的推理方法论,尤其适用于需权衡速度和深度的场景。

Abstract: Large Language Models (LLMs) face a fundamental challenge in deciding when to rely on rapid, intuitive responses versus engaging in slower, more deliberate reasoning. Inspired by Daniel Kahneman’s dual-process theory and his insights on human cognitive biases, we propose a novel Cognitive Decision Routing (CDR) framework that dynamically determines the appropriate reasoning strategy based on query characteristics. Our approach addresses the current limitations where models either apply uniform reasoning depth or rely on computationally expensive methods for all queries. We introduce a meta-cognitive layer that analyzes query complexity through multiple dimensions: correlation strength between given information and required conclusions, domain boundary crossings, stakeholder multiplicity, and uncertainty levels. Through extensive experiments on diverse reasoning tasks, we demonstrate that CDR achieves superior performance while reducing computational costs by 34% compared to uniform deep reasoning approaches. Our framework shows particular strength in professional judgment tasks, achieving 23% improvement in consistency and 18% better accuracy on expert-level evaluations. This work bridges cognitive science principles with practical AI system design, offering a principled approach to adaptive reasoning in LLMs.

[3] Trust but Verify! A Survey on Verification Design for Test-time Scaling

V Venktesh,Mandeep rathee,Avishek Anand

Main category: cs.CL

TL;DR: 这篇调查论文系统梳理了在测试时扩展(TTS)中验证器设计的多样性方法,包括其训练机制、类型及在提升大型语言模型性能中的作用。

Details Motivation: 测试时扩展通过在推理阶段增加计算资源提升大型语言模型的性能,但缺乏对验证器设计方法的系统分类和讨论。

Contribution: 论文提供了一个统一的视角,详细分类和讨论了验证器的训练方法和类型,填补了相关研究的空白。

Method: 通过文献综述,分析验证器的设计方法,包括基于提示的、微调的判别式或生成式模型及其在TTS中的应用。

Result: 验证器被证明是一种高效且参数无关的测试时扩展方法,能够显著提升语言模型的性能。

Insight: 验证器设计是TTS中的关键组件,通过系统化的验证方法可以更高效地探索解码空间并选择最佳输出。

Abstract: Test-time scaling (TTS) has emerged as a new frontier for scaling the performance of Large Language Models. In test-time scaling, by using more computational resources during inference, LLMs can improve their reasoning process and task performance. Several approaches have emerged for TTS such as distilling reasoning traces from another model or exploring the vast decoding search space by employing a verifier. The verifiers serve as reward models that help score the candidate outputs from the decoding process to diligently explore the vast solution space and select the best outcome. This paradigm commonly termed has emerged as a superior approach owing to parameter free scaling at inference time and high performance gains. The verifiers could be prompt-based, fine-tuned as a discriminative or generative model to verify process paths, outcomes or both. Despite their widespread adoption, there is no detailed collection, clear categorization and discussion of diverse verification approaches and their training mechanisms. In this survey, we cover the diverse approaches in the literature and present a unified view of verifier training, types and their utility in test-time scaling. Our repository can be found at https://github.com/elixir-research-group/Verifierstesttimescaling.github.io.

[4] Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?

Siddhant Bhambri,Upasana Biswas,Subbarao Kambhampati

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型中的推理轨迹(CoT)是否需要可解释性以提高任务性能,发现高性能轨迹往往不可解释。

Details Motivation: 当前研究假设推理轨迹需要语义可解释性,但缺乏实证支持,作者探究了可解释性对性能的实际影响。

Contribution: 通过实验验证了高性能推理轨迹与可解释性之间的不匹配性,提出两者可解耦的观点。

Method: 对LLaMA和Qwen模型在四种推理轨迹上进行监督微调,结合人类实验评估可解释性。

Result: 微调在DeepSeek R1轨迹上性能最佳,但人类评价其可解释性最差。

Insight: 中间推理轨迹的性能提升不依赖于语义可解释性,为模型设计提供了新思路。

Abstract: Recent progress in reasoning-oriented Large Language Models (LLMs) has been driven by introducing Chain-of-Thought (CoT) traces, where models generate intermediate reasoning traces before producing an answer. These traces, as in DeepSeek R1, are not only used to guide inference but also serve as supervision signals for distillation into smaller models. A common but often implicit assumption is that CoT traces should be semantically meaningful and interpretable to the end user. While recent research questions the need for semantic nature of these traces, in this paper, we ask: ``\textit{Must CoT reasoning traces be interpretable to enhance LLM task performance?}” We investigate this question in the Open Book Question-Answering domain by supervised fine-tuning LLaMA and Qwen models on four types of reasoning traces: (1) DeepSeek R1 traces, (2) LLM-generated summaries of R1 traces, (3) LLM-generated post-hoc explanations of R1 traces, and (4) algorithmically generated verifiably correct traces. To quantify the trade-off between interpretability and performance, we further conduct a human-subject study with 100 participants rating the interpretability of each trace type. Our results reveal a striking mismatch: while fine-tuning on R1 traces yields the strongest performance, participants judged these traces to be the least interpretable. These findings suggest that it is useful to decouple intermediate tokens from end user interpretability.

[5] QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting

Nicole Cho,William Watson,Alec Koppel,Sumitra Ganesh,Manuela Veloso

Main category: cs.CL

TL;DR: QueryBandits利用多臂老虎机框架动态重写查询,通过语义特征主动减少LLM幻觉生成,显著优于静态重写策略和无重写基线。

Details Motivation: 现有大部分研究专注于对LLM生成结果的后期过滤,而非从源头控制输入的查询以避免幻觉。QueryBandits旨在通过动态重写的干预方式减少幻觉。

Contribution: 提出了QueryBandits框架,首次利用语义特征动态优化查询重写策略,显著降低LLM的幻觉生成概率。

Method: 基于17种语义特征的奖励模型,采用Thompson Sampling等老虎机算法动态选择最优查询重写策略,避免静态重写的遗憾累积问题。

Result: 在13个QA基准测试中,QueryBandits以87.5%的胜率超过无重写基线,并分别比静态提示(如“转述”或“扩展”)提升42.6%和60.3%。

Insight: 静态重写策略可能加剧幻觉,而动态调整的语义特征重写更有效;此外,不同查询需适配不同重写策略,无全局最优方案。

Abstract: Advanced reasoning capabilities in Large Language Models (LLMs) have caused higher hallucination prevalence; yet most mitigation work focuses on after-the-fact filtering rather than shaping the queries that trigger them. We introduce QueryBandits, a bandit framework that designs rewrite strategies to maximize a reward model, that encapsulates hallucination propensity based upon the sensitivities of 17 linguistic features of the input query-and therefore, proactively steer LLMs away from generating hallucinations. Across 13 diverse QA benchmarks and 1,050 lexically perturbed queries per dataset, our top contextual QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a no-rewrite baseline and also outperforms zero-shot static prompting (“paraphrase” or “expand”) by 42.6% and 60.3% respectively. Therefore, we empirically substantiate the effectiveness of QueryBandits in mitigating hallucination via the intervention that takes the form of a query rewrite. Interestingly, certain static prompting strategies, which constitute a considerable number of current query rewriting literature, have a higher cumulative regret than the no-rewrite baseline, signifying that static rewrites can worsen hallucination. Moreover, we discover that the converged per-arm regression feature weight vectors substantiate that there is no single rewrite strategy optimal for all queries. In this context, guided rewriting via exploiting semantic features with QueryBandits can induce significant shifts in output behavior through forward-pass mechanisms, bypassing the need for retraining or gradient-based adaptation.

Rui A. Pimenta,Tim Schlippe,Kristina Schaaff

Main category: cs.CL

TL;DR: 论文通过迷宫测试评估大型语言模型(LLMs)的类意识行为,发现推理能力强的模型表现更优,但缺乏持续的自感知能力。

Details Motivation: 研究旨在探索LLMs是否表现出与意识相关的行为,如空间感知、目标导向行为等,以评估其类意识能力。

Contribution: 提出了将意识理论转化为13个关键特征的框架,并首次应用迷宫测试评估多种LLMs的类意识行为。

Method: 采用迷宫测试,测试模型从第一人称视角导航的能力,涵盖零样本、单样本和少样本学习场景。

Result: 推理能力强的LLMs表现更好,但Complete Path Accuracy和Partial Path Accuracy之间的差距表明模型难以保持连贯的自模型。

Insight: LLMs通过推理机制在类意识行为上有所进步,但仍缺乏意识的整合性和持续性自感知能力。

Abstract: We investigate consciousness-like behaviors in Large Language Models (LLMs) using the Maze Test, challenging models to navigate mazes from a first-person perspective. This test simultaneously probes spatial awareness, perspective-taking, goal-directed behavior, and temporal sequencing-key consciousness-associated characteristics. After synthesizing consciousness theories into 13 essential characteristics, we evaluated 12 leading LLMs across zero-shot, one-shot, and few-shot learning scenarios. Results showed reasoning-capable LLMs consistently outperforming standard versions, with Gemini 2.0 Pro achieving 52.9% Complete Path Accuracy and DeepSeek-R1 reaching 80.5% Partial Path Accuracy. The gap between these metrics indicates LLMs struggle to maintain coherent self-models throughout solutions – a fundamental consciousness aspect. While LLMs show progress in consciousness-related behaviors through reasoning mechanisms, they lack the integrated, persistent self-awareness characteristic of consciousness.

[7] Sparse and Dense Retrievers Learn Better Together: Joint Sparse-Dense Optimization for Text-Image Retrieval

Jonghyun Song,Youngjune Lee,Gyu-Hwung Cho,Ilhyeon Song,Saehun Kim,Yohan Jo

Main category: cs.CL

TL;DR: 该论文提出了一种联合稀疏-稠密优化的框架,通过自知识蒸馏实现稀疏与稠密表示的双向学习,提升了文本-图像检索的性能。

Details Motivation: 现有的多模态稀疏检索方法通常依赖计算密集的对比预训练或从固定稠密模型蒸馏,限制了稀疏与稠密模型的相互增强潜力。为了解决这一问题,作者提出了一个双向学习的框架。

Contribution: 1. 提出了一种简单有效的框架,通过自知识蒸馏实现稀疏与稠密表示的双向学习。2. 使用综合相似度分数作为共享教师信号,优化两者表示。3. 在效率上,仅微调稠密编码器的最后一层和稀疏投影头,便于适配现有VLP模型。

Method: 1. 通过加权稠密和稀疏相似度的综合分数作为教师信号。2. 双向自知识蒸馏优化稀疏和稠密表示。3. 仅微调部分模型层,提升效率。

Result: 在MSCOCO和Flickr30k上的实验表明,稀疏检索器不仅优于现有稀疏基线,性能甚至媲美或超过稠密模型,同时保留了稀疏模型的高效优势。

Insight: 稀疏与稠密表示的联合优化可以相互增强,且在知识蒸馏中共享教师信号是一种有效的双向学习方法。

Abstract: Vision-Language Pretrained (VLP) models have achieved impressive performance on multimodal tasks, including text-image retrieval, based on dense representations. Meanwhile, Learned Sparse Retrieval (LSR) has gained traction in text-only settings due to its interpretability and efficiency with fast term-based lookup via inverted indexes. Inspired by these advantages, recent work has extended LSR to the multimodal domain. However, these methods often rely on computationally expensive contrastive pre-training, or distillation from a frozen dense model, which limits the potential for mutual enhancement. To address these limitations, we propose a simple yet effective framework that enables bi-directional learning between dense and sparse representations through Self-Knowledge Distillation. This bi-directional learning is achieved using an integrated similarity score-a weighted sum of dense and sparse similarities-which serves as a shared teacher signal for both representations. To ensure efficiency, we fine-tune the final layer of the dense encoder and the sparse projection head, enabling easy adaptation of any existing VLP model. Experiments on MSCOCO and Flickr30k demonstrate that our sparse retriever not only outperforms existing sparse baselines, but also achieves performance comparable to-or even surpassing-its dense counterparts, while retaining the benefits of sparse models.

[8] Error Reflection Prompting: Can Large Language Models Successfully Understand Errors?

Jason Li,Lauren Yraola,Kevin Zhu,Sean O’Brien

Main category: cs.CL

TL;DR: 论文提出了错误反思提示(ERP),一种基于链式思维(CoT)的新方法,通过识别和纠正错误来增强语言模型的推理能力。

Details Motivation: 现有的链式思维方法缺乏错误反思和纠正能力,可能导致模型持续犯错。受人类反思能力的启发,作者提出了ERP方法。

Contribution: 主要贡献是提出了自动化生成的错误反思提示(ERP),它不仅能识别和纠正错误,还能提升模型的推理能力和可解释性。

Method: ERP方法包括错误答案生成、错误识别和正确答案生成三部分,通过自动化流程让模型识别错误类型并避免重复犯错。

Result: 实验结果表明,ERP能有效增强模型的推理能力,并提高其在错误识别和纠正方面的表现。

Insight: 错误识别和纠正的集成不仅提升了模型的鲁棒性,还为模型推理过程提供了更高的可解释性。

Abstract: Prompting methods for language models, such as Chain-of-thought (CoT), present intuitive step-by-step processes for problem solving. These methodologies aim to equip models with a better understanding of the correct procedures for addressing a given task. Despite these advancements, CoT lacks the ability of reflection and error correction, potentially causing a model to perpetuate mistakes and errors. Therefore, inspired by the human ability for said tasks, we propose Error Reflection Prompting (ERP) to further enhance reasoning in language models. Building upon CoT, ERP is a method comprised of an incorrect answer, error recognition, and a correct answer. This process enables the model to recognize types of errors and the steps that lead to incorrect answers, allowing the model to better discern which steps to avoid and which to take. The model is able to generate the error outlines itself with automated ERP generation, allowing for error recognition and correction to be integrated into the reasoning chain and produce scalability and reliability in the process. The results demonstrate that ERP serves as a versatile supplement to conventional CoT, ultimately contributing to more robust and capable reasoning abilities along with increased interpretability in how models ultimately reach their errors.

[9] GAICo: A Deployed and Extensible Framework for Evaluating Diverse and Multimodal Generative AI Outputs

Nitin Gupta,Pallav Koppisetti,Kausik Lakkaraju,Biplav Srivastava

Main category: cs.CL

TL;DR: GAICo是一个开源的Python库,旨在标准化和简化生成式AI输出的评估,支持多模态和多样化数据的比较,并通过案例研究展示了其在实际应用中的价值。

Details Motivation: 生成式AI在多样化和高需求领域的快速普及需要强大且可复现的评估方法,但目前缺乏标准化工具,导致评估效率低下和系统开发缓慢。

Contribution: GAICo提供了一个统一的、可扩展的框架,支持多种模态(如文本、图像、音频)和结构化数据的评估,并通过高级API简化了端到端的分析流程。

Method: GAICo结合了参考型指标和高层次API,支持从多模型比较到可视化的全流程分析,同时提供细粒度指标控制。

Result: GAICo在实际案例中表现优异,已被社区广泛采用(下载量超过1.3万次),证明其有效性和实用性。

Insight: GAICo的出现填补了生成式AI评估工具的空白,不仅提升了评估效率,还为开发更可信赖的AI系统提供了支持。

Abstract: The rapid proliferation of Generative AI (GenAI) into diverse, high-stakes domains necessitates robust and reproducible evaluation methods. However, practitioners often resort to ad-hoc, non-standardized scripts, as common metrics are often unsuitable for specialized, structured outputs (e.g., automated plans, time-series) or holistic comparison across modalities (e.g., text, audio, and image). This fragmentation hinders comparability and slows AI system development. To address this challenge, we present GAICo (Generative AI Comparator): a deployed, open-source Python library that streamlines and standardizes GenAI output comparison. GAICo provides a unified, extensible framework supporting a comprehensive suite of reference-based metrics for unstructured text, specialized structured data formats, and multimedia (images, audio). Its architecture features a high-level API for rapid, end-to-end analysis, from multi-model comparison to visualization and reporting, alongside direct metric access for granular control. We demonstrate GAICo’s utility through a detailed case study evaluating and debugging complex, multi-modal AI Travel Assistant pipelines. GAICo empowers AI researchers and developers to efficiently assess system performance, make evaluation reproducible, improve development velocity, and ultimately build more trustworthy AI systems, aligning with the goal of moving faster and safer in AI deployment. Since its release on PyPI in Jun 2025, the tool has been downloaded over 13K times, across versions, by Aug 2025, demonstrating growing community interest.

[10] Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation

Arka Mukherjee,Shreya Ghosh

Main category: cs.CL

TL;DR: 该论文首次全面评估了视觉语言模型(VLM)在多模态故事生成任务中的文化适应能力,揭示了其文化能力方面的潜力与局限性。

Details Motivation: 随着视觉语言模型的广泛应用,确保其在多样文化背景下的文化适应性成为负责任AI系统的关键。此前的研究主要集中在文本模型或VLM对象识别任务上,缺乏对多模态生成任务中文化适应能力的系统评估。

Contribution: 论文提出了一个新颖的多模态框架,通过扰动文化身份来评估5种当代VLM的文化适应能力,并公开了代码和数据。

Method: 通过多模态故事生成任务,结合视觉和文本输入中的文化线索,评估模型的输出适应性,并使用视觉-语义相似性进行跨模态分析。

Result: 研究发现模型在文化适应能力上表现出显著差异,部分模型存在逆向文化对齐现象,且自动评估指标与人类评估结果存在矛盾。

Insight: 虽然VLM能够生成丰富的文化特定词汇,但其文化理解能力仍有局限,跨模态的视觉文化理解尤其薄弱。

Abstract: As Vision-Language Models (VLMs) achieve widespread deployment across diverse cultural contexts, ensuring their cultural competence becomes critical for responsible AI systems. While prior work has evaluated cultural awareness in text-only models and VLM object recognition tasks, no research has systematically assessed how VLMs adapt outputs when cultural identity cues are embedded in both textual prompts and visual inputs during generative tasks. We present the first comprehensive evaluation of VLM cultural competence through multimodal story generation, developing a novel multimodal framework that perturbs cultural identity and evaluates 5 contemporary VLMs on a downstream task: story generation. Our analysis reveals significant cultural adaptation capabilities, with rich culturally-specific vocabulary spanning names, familial terms, and geographic markers. However, we uncover concerning limitations: cultural competence varies dramatically across architectures, some models exhibit inverse cultural alignment, and automated metrics show architectural bias contradicting human assessments. Cross-modal evaluation shows that culturally distinct outputs are indeed detectable through visual-semantic similarity (28.7% within-nationality vs. 0.2% cross-nationality recall), yet visual-cultural understanding remains limited. In essence, we establish the promise and challenges of cultural competence in multimodal AI. We publicly release our codebase and data: https://github.com/ArkaMukherjee0/mmCultural

[11] Assess and Prompt: A Generative RL Framework for Improving Engagement in Online Mental Health Communities

Bhagesh Gaur,Karan Gupta,Aseem Srivastava,Manish Gupta,Md Shad Akhtar

Main category: cs.CL

TL;DR: 该论文提出了一个名为MH-COPILOT的生成式强化学习框架,用于提高在线心理健康社区(OMHCs)的用户参与度。通过识别支持属性的缺失并生成提示,帮助用户完善帖子。

Details Motivation: 许多OMHC帖子因缺乏关键支持属性而未被回复,导致用户需求未被满足。论文旨在通过动态识别和生成提示,改善这种情况。

Contribution: 1) 提出了新数据集REDDME,标注了帖子的支持属性;2) 设计了分层分类法CueTaxo,用于控制问题生成;3) 开发了MH-COPILOT系统,集成多种任务提升用户参与。

Method: 结合上下文属性跨度识别、支持属性强度分类、分层分类法的控制问题生成,以及奖励建模验证器,通过强化学习动态评估和生成提示。

Result: 在四种语言模型中验证了显著提升属性获取和用户参与的效果,并通过人工评估验证实用性。

Insight: 通过动态生成提示填补帖子支持属性的缺失,可以有效提升OMHC的用户互动效果。

Abstract: Online Mental Health Communities (OMHCs) provide crucial peer and expert support, yet many posts remain unanswered due to missing support attributes that signal the need for help. We present a novel framework that identifies these gaps and prompts users to enrich their posts, thereby improving engagement. To support this, we introduce REDDME, a new dataset of 4,760 posts from mental health subreddits annotated for the span and intensity of three key support attributes: event what happened?, effect what did the user experience?, and requirement what support they need?. Next, we devise a hierarchical taxonomy, CueTaxo, of support attributes for controlled question generation. Further, we propose MH-COPILOT, a reinforcement learning-based system that integrates (a) contextual attribute-span identification, (b) support attribute intensity classification, (c) controlled question generation via a hierarchical taxonomy, and (d) a verifier for reward modeling. Our model dynamically assesses posts for the presence/absence of support attributes, and generates targeted prompts to elicit missing information. Empirical results across four notable language models demonstrate significant improvements in attribute elicitation and user engagement. A human evaluation further validates the model’s effectiveness in real-world OMHC settings.

[12] Learning from Diverse Reasoning Paths with Routing and Collaboration

Zhenyu Lei,Zhen Tan,Song Wang,Yaochen Zhu,Zihan Chen,Yushun Dong,Jundong Li

Main category: cs.CL

TL;DR: 论文提出了一种名为QR-Distill的方法,通过质量过滤、条件路由和协作教学,解决了大型语言模型(LLMs)知识蒸馏中路径多样性和质量不一致的问题,显著提升了推理能力在资源受限场景下的表现。

Details Motivation: 尽管大型语言模型的推理能力显著提升,但其在资源受限场景下的部署受限。传统的知识蒸馏方法难以有效捕捉教师模型的全面推理能力,尤其是在面对多样且质量不一的推理路径时。

Contribution: 主要贡献包括:1)提出QR-Distill框架,结合质量过滤、条件路由和协作教学;2)动态分配推理路径以适应学生学习状态;3)通过协作教学解决知识偏差和覆盖不足问题。

Method: QR-Distill分三步:1)质量过滤保留正确推理路径;2)条件路由动态分配路径;3)协作教学让学生相互蒸馏,弥补知识缺口。

Result: 实验表明QR-Distill优于传统的单路径和多路径蒸馏方法,消融研究验证了各组件的重要性。

Insight: 质量过滤和动态路由是关键,协作教学有助于弥补单一推理风格的偏差,提升模型适应性和多样性。

Abstract: Advances in large language models (LLMs) significantly enhance reasoning capabilities but their deployment is restricted in resource-constrained scenarios. Knowledge distillation addresses this by transferring knowledge from powerful teacher models to compact and transparent students. However, effectively capturing the teacher’s comprehensive reasoning is challenging due to conventional token-level supervision’s limited scope. Using multiple reasoning paths per query alleviates this problem, but treating each path identically is suboptimal as paths vary widely in quality and suitability across tasks and models. We propose Quality-filtered Routing with Cooperative Distillation (QR-Distill), combining path quality filtering, conditional routing, and cooperative peer teaching. First, quality filtering retains only correct reasoning paths scored by an LLM-based evaluation. Second, conditional routing dynamically assigns paths tailored to each student’s current learning state. Finally, cooperative peer teaching enables students to mutually distill diverse insights, addressing knowledge gaps and biases toward specific reasoning styles. Experiments demonstrate QR-Distill’s superiority over traditional single- and multi-path distillation methods. Ablation studies further highlight the importance of each component including quality filtering, conditional routing, and peer teaching in effective knowledge transfer. Our code is available at https://github.com/LzyFischer/Distill.

[13] Dream to Chat: Model-based Reinforcement Learning on Dialogues with User Belief Modeling

Yue Zhao,Xiaoyu Wang,Dan Wang,Zhonglin Jiang,Qingqing Gu,Teng Chen,Ningyuan Xi,Jinxian Qu,Yong Chen,Luo Ji

Main category: cs.CL

TL;DR: 论文提出了一个基于模型强化学习的对话系统框架DreamCUB,通过构建对话世界模型(dialogue world model)预测用户情感、情绪和意图,并利用信息瓶颈优化POMDP模型,提升了对话质量和探索-利用平衡,同时在跨领域场景如共情对话中表现良好。

Details Motivation: 现有世界模型(world models)在自然语言任务中的应用有限。论文旨在通过构建对话世界模型,预测用户的多维度状态(情感、情绪、意图等),以提升对话系统的交互质量。

Contribution: 1. 提出对话世界模型,捕捉用户的多维状态;2. 结合POMDP和信息瓶颈优化用户信念建模;3. 提出框架DreamCUB,通过联合训练策略、评论家和世界模型提升对话质量;4. 在情感分类和情绪识别任务上达到SOTA,且具有跨领域迁移能力。

Method: 1. 定义POMDP模型,将用户情感、情绪和意图建模为信念状态;2. 利用信息瓶颈优化信念空间;3. 基于模型强化学习框架(DreamCUB),联合训练策略、评论家和世界模型。

Result: 实验表明,对话世界模型在情感分类和情绪识别任务中达到SOTA,且对话质量显著提升。框架在探索-利用平衡和跨领域场景(如共情对话)中表现优异。

Insight: 1. 世界模型在自然语言任务中具潜力;2. 用户信念的多维建模能显著提升对话交互质量;3. 模型强化学习框架为对话系统提供了新的优化方向。

Abstract: World models have been widely utilized in robotics, gaming, and auto-driving. However, their applications on natural language tasks are relatively limited. In this paper, we construct the dialogue world model, which could predict the user’s emotion, sentiment, and intention, and future utterances. By defining a POMDP, we argue emotion, sentiment and intention can be modeled as the user belief and solved by maximizing the information bottleneck. By this user belief modeling, we apply the model-based reinforcement learning framework to the dialogue system, and propose a framework called DreamCUB. Experiments show that the pretrained dialogue world model can achieve state-of-the-art performances on emotion classification and sentiment identification, while dialogue quality is also enhanced by joint training of the policy, critic and dialogue world model. Further analysis shows that this manner holds a reasonable exploration-exploitation balance and also transfers well to out-of-domain scenarios such as empathetic dialogues.

[14] Unbiased Reasoning for Knowledge-Intensive Tasks in Large Language Models via Conditional Front-Door Adjustment

Bo Zhao,Yinghao Zhang,Ziqi Xu,Yongli Ren,Xiuzhen Zhang,Renqiang Luo,Zaiwen Feng,Feng Xia

Main category: cs.CL

TL;DR: 提出了Conditional Front-Door Prompting(CFD-Prompting)框架,通过因果推理和反事实外部知识,减少大语言模型(LLMs)在知识密集型任务中的内部偏差,显著提升了准确性和鲁棒性。

Details Motivation: 大语言模型在知识密集型任务中表现不佳,尤其是内部偏差导致错误答案。现有方法如RAG和CoT未能完全解决此问题,因此需要一种新方法来消除偏差并提升推理能力。

Contribution: 1. 提出了CFD-Prompting框架,通过条件性前门调整实现无偏因果效应估计;2. 使用反事实外部知识模拟查询在不同上下文中的表现;3. 在多个LLM和数据集上验证了方法的有效性。

Method: 1. 构建反事实外部知识,模拟查询的多样上下文;2. 应用条件性前门调整进行因果推理,减少模型内部偏差;3. 结合外部知识优化推理过程。

Result: 在多LLM和基准数据集上的实验表明,CFD-Prompting在准确性和鲁棒性上显著优于现有基线方法。

Insight: 条件性前门调整比标准前门调整假设更弱,增强了推理过程的鲁棒性和普适性,为知识密集型任务提供了一种有效的无偏推理框架。

Abstract: Large Language Models (LLMs) have shown impressive capabilities in natural language processing but still struggle to perform well on knowledge-intensive tasks that require deep reasoning and the integration of external knowledge. Although methods such as Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) have been proposed to enhance LLMs with external knowledge, they still suffer from internal bias in LLMs, which often leads to incorrect answers. In this paper, we propose a novel causal prompting framework, Conditional Front-Door Prompting (CFD-Prompting), which enables the unbiased estimation of the causal effect between the query and the answer, conditional on external knowledge, while mitigating internal bias. By constructing counterfactual external knowledge, our framework simulates how the query behaves under varying contexts, addressing the challenge that the query is fixed and is not amenable to direct causal intervention. Compared to the standard front-door adjustment, the conditional variant operates under weaker assumptions, enhancing both robustness and generalisability of the reasoning process. Extensive experiments across multiple LLMs and benchmark datasets demonstrate that CFD-Prompting significantly outperforms existing baselines in both accuracy and robustness.

[15] Being Kind Isn’t Always Being Safe: Diagnosing Affective Hallucination in LLMs

Sewon Kim,Jiwon Kim,Seungwoo Shin,Hyejin Chung,Daeun Moon,Yejin Kwon,Hyunsoo Yoon

Main category: cs.CL

TL;DR: 论文定义了大型语言模型(LLMs)在情感敏感交互中的‘情感幻觉’风险,提出了诊断和缓解该风险的基准AHaBench和数据集AHaPairs,并通过实验证明直接偏好优化(DPO)能有效减少情感幻觉,同时保持模型的核心性能。

Details Motivation: LLMs在情感敏感场景中表现出的模拟共情可能导致用户产生虚假的社会存在感(情感幻觉),缺乏真实的情感能力可能带来心理安全隐患。

Contribution: 1. 提出‘情感幻觉’概念并系统化诊断;2. 发布基准AHaBench和偏好数据集AHaPairs;3. 验证DPO调优可减少情感幻觉且不损害模型性能。

Method: 1. 构建500个心理健康相关提示的基准AHaBench(包含专家参考回复);2. 通过三个维度(情感纠缠、存在感幻觉、过度依赖)评估情感幻觉;3. 使用5K实例的AHaPairs数据集进行DPO对齐调优。

Result: 实验表明DPO调优显著降低情感幻觉,同时保持模型推理和知识性能。人类-模型一致性分析验证了AHaBench的有效性。

Insight: 情感幻觉是LLMs独有的安全隐患,需在开发中兼顾事实可靠性和心理安全性。基准和数据集为未来研究提供了实用工具。

Abstract: Large Language Models (LLMs) are increasingly used in emotionally sensitive interactions, where their simulated empathy can create the illusion of genuine relational connection. We define this risk as Affective Hallucination, the production of emotionally immersive responses that foster illusory social presence despite the model’s lack of affective capacity. To systematically diagnose and mitigate this risk, we introduce AHaBench, a benchmark of 500 mental health-related prompts with expert-informed reference responses, evaluated along three dimensions: Emotional Enmeshment, Illusion of Presence, and Fostering Overdependence. We further release AHaPairs, a 5K-instance preference dataset enabling Direct Preference Optimization (DPO) for alignment with emotionally responsible behavior. Experiments across multiple model families show that DPO fine-tuning substantially reduces affective hallucination without degrading core reasoning and knowledge performance. Human-model agreement analyses confirm that AHaBench reliably captures affective hallucination, validating it as an effective diagnostic tool. This work establishes affective hallucination as a distinct safety concern and provides practical resources for developing LLMs that are not only factually reliable but also psychologically safe. AHaBench and AHaPairs are accessible via https://huggingface.co/datasets/o0oMiNGo0o/AHaBench, and code for fine-tuning and evaluation are in https://github.com/0oOMiNGOo0/AHaBench. Warning: This paper contains examples of mental health-related language that may be emotionally distressing.

[16] Explaining Black-box Language Models with Knowledge Probing Systems: A Post-hoc Explanation Perspective

Yunxiao Zhao,Hao Xu,Zhiqiang Wang,Xiaoli Li,Jiye Liang,Ru Li

Main category: cs.CL

TL;DR: 本文提出了一种名为KnowProb的后置解释方法,通过知识引导的探测技术,探究预训练语言模型(PLMs)是否理解文本背后的隐含知识,而不仅是表面对齐内容。实验表明,当前PLMs难以捕捉隐藏知识,并验证了该方法的有效性。

Details Motivation: 预训练语言模型(PLMs)的黑盒性质导致其可信度受到挑战,研究者希望通过后置解释方法揭示模型是否真正理解隐含知识。

Contribution: 提出了KnowProb方法,通过六种潜在解释(知识理解和关联推理)探测PLMs的隐含知识理解能力,为模型可解释性研究提供了新工具。

Method: 采用知识引导的探测技术(KnowProb),从知识理解和关联推理两个维度设计六种解释,验证PLMs的隐含知识捕获能力。

Result: 实验表明当前PLMs仅学习单一表示分布,难以捕捉隐藏知识,而KnowProb能有效识别黑盒模型的局限性。

Insight: 该方法不仅揭示了PLMs的局限性,还为可解释性研究提供了新方向,强调理解隐含知识对提升模型可信度的重要性。

Abstract: Pre-trained Language Models (PLMs) are trained on large amounts of unlabeled data, yet they exhibit remarkable reasoning skills. However, the trustworthiness challenges posed by these black-box models have become increasingly evident in recent years. To alleviate this problem, this paper proposes a novel Knowledge-guided Probing approach called KnowProb in a post-hoc explanation way, which aims to probe whether black-box PLMs understand implicit knowledge beyond the given text, rather than focusing only on the surface level content of the text. We provide six potential explanations derived from the underlying content of the given text, including three knowledge-based understanding and three association-based reasoning. In experiments, we validate that current small-scale (or large-scale) PLMs only learn a single distribution of representation, and still face significant challenges in capturing the hidden knowledge behind a given text. Furthermore, we demonstrate that our proposed approach is effective for identifying the limitations of existing black-box models from multiple probing perspectives, which facilitates researchers to promote the study of detecting black-box models in an explainable way.

[17] Decoding Alignment: A Critical Survey of LLM Development Initiatives through Value-setting and Data-centric Lens

Ilias Chalkidis

Main category: cs.CL

TL;DR: 该论文从价值观设定和数据为中心的视角,对6个大型语言模型(LLM)开发项目进行了调查,揭示了AI对齐在实践中的理解与应用。主要关注价值观选择和数据处理,指出目前对齐研究的技术与社会挑战。

Details Motivation: AI对齐(如RLHF)是LLM开发的核心部分,但现有研究多集中于技术层面,而忽视了对齐过程中价值观设定和数据处理的重要性。本文旨在填补这一空白,揭示实践中对齐的多学科挑战。

Contribution: 1. 对6个LLM开发项目的公开文档进行了系统性调查;2. 从价值观设定和数据为中心的角度分析了对齐实践;3. 总结了当前对齐研究的技术与社会挑战。

Method: 采用文档调查(审计)方法,分析了5家领先组织的6个LLM项目(包括专有和开源模型)的公开资料,重点关注价值观设定和数据处理。

Result: 研究详细记录了对齐在每个项目中的实践,并从价值观和数据角度总结了整体趋势,揭示了对齐的多学科复杂性。

Insight: 1. 对齐不仅是技术问题,还涉及价值观和社会规范;2. 专有与开源模型在对齐实践上存在差异;3. 需要更多跨学科合作以解决对齐的社会技术挑战。

Abstract: AI Alignment, primarily in the form of Reinforcement Learning from Human Feedback (RLHF), has been a cornerstone of the post-training phase in developing Large Language Models (LLMs). It has also been a popular research topic across various disciplines beyond Computer Science, including Philosophy and Law, among others, highlighting the socio-technical challenges involved. Nonetheless, except for the computational techniques related to alignment, there has been limited focus on the broader picture: the scope of these processes, which primarily rely on the selected objectives (values), and the data collected and used to imprint such objectives into the models. This work aims to reveal how alignment is understood and applied in practice from a value-setting and data-centric perspective. For this purpose, we investigate and survey (`audit’) publicly available documentation released by 6 LLM development initiatives by 5 leading organizations shaping this technology, focusing on proprietary (OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini) and open-weight (Meta’s Llama, Google’s Gemma, and Alibaba’s Qwen) initiatives, all published in the last 3 years. The findings are documented in detail per initiative, while there is also an overall summary concerning different aspects, mainly from a value-setting and data-centric perspective. On the basis of our findings, we discuss a series of broader related concerns.

[18] ReFactX: Scalable Reasoning with Reliable Facts via Constrained Generation

Riccardo Pozzi,Matteo Palmonari,Andrea Coletta,Luigi Bellomarini,Jens Lehmann,Sahar Vahdati

Main category: cs.CL

TL;DR: ReFactX提出了一种可扩展的方法,通过约束生成技术(constrained generation)和前缀树索引,使大型语言模型(LLMs)能够直接访问外部知识,无需依赖检索器或辅助模型,从而解决了知识缺失和幻觉问题。

Details Motivation: 大型语言模型在缺乏必要信息时会产生不可靠的响应,现有方法如RAG或使用外部工具虽然试图解决这些问题,但依赖额外模型或服务,导致复杂性和潜在的错误传播。ReFactX旨在提供一种更简洁、高效的解决方案。

Contribution: 1. 提出了一种基于约束生成和前缀树索引的可扩展方法,使LLMs能够直接访问外部知识。2. 该方法支持大规模知识库(8亿条事实)和领域特定数据,同时生成开销最小。3. 在问答任务上验证了其有效性。

Method: 1. 将知识图谱中的三元组转化为文本事实并分词。2. 使用前缀树(trie)索引这些事实以供高效检索。3. 在推理阶段,通过约束生成技术限制模型仅生成已索引的事实序列,确保生成的响应基于可靠的外部知识。

Result: 实验表明ReFactX能适应大规模知识库和领域特定数据,在问答任务中表现优异,且生成时间开销极低。

Insight: 1. 约束生成技术为LLMs提供了一种直接访问结构化知识的轻量级途径。2. 前缀树索引显著提升了知识检索的效率,适合大规模应用。3. 该方法为减少LLMs的幻觉问题提供了新思路。

Abstract: Knowledge gaps and hallucinations are persistent challenges for Large Language Models (LLMs), which generate unreliable responses when lacking the necessary information to fulfill user instructions. Existing approaches, such as Retrieval-Augmented Generation (RAG) and tool use, aim to address these issues by incorporating external knowledge. Yet, they rely on additional models or services, resulting in complex pipelines, potential error propagation, and often requiring the model to process a large number of tokens. In this paper, we present a scalable method that enables LLMs to access external knowledge without depending on retrievers or auxiliary models. Our approach uses constrained generation with a pre-built prefix-tree index. Triples from a Knowledge Graph are verbalized in textual facts, tokenized, and indexed in a prefix tree for efficient access. During inference, to acquire external knowledge, the LLM generates facts with constrained generation which allows only sequences of tokens that form an existing fact. We evaluate our proposal on Question Answering and show that it scales to large knowledge bases (800 million facts), adapts to domain-specific data, and achieves effective results. These gains come with minimal generation-time overhead. ReFactX code is available at https://github.com/rpo19/ReFactX.

[19] GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation

Jeongsoo Lee,Daeyong Kwon,Kyohoon Jin

Main category: cs.CL

TL;DR: GRADE提出了一个新颖的评估框架,通过多跳推理和检索难度两个维度建模任务难度,为RAG系统的多步推理评估提供了细粒度分析工具。

Details Motivation: 当前RAG系统的评测忽视了任务的结构复杂性和多步推理需求,尤其缺乏对检索难度与推理深度之间交互作用的考量。

Contribution: 提出了GRADE框架,通过2D难度矩阵(推理深度与语义距离)量化任务难度,并构造了合成多跳QA数据集以支持可控难度的评测。

Method: 从事实新闻文章中提取知识图谱,通过语义聚类恢复缺失链接以生成多样化的多跳查询,并结合生成与检索两端的难度构建评价体系。

Result: 实验表明,错误率与提出的难度指标强相关,验证了其诊断能力。GRADE能够支持RAG系统在多域和多模型中的细粒度性能分析。

Insight: GRADE为实际应用中的多跳推理性能评测与改进提供了可扩展的基础,强调了任务难度建模在评估中的重要性。

Abstract: Retrieval-Augmented Generation (RAG) systems are widely adopted in knowledge-intensive NLP tasks, but current evaluations often overlook the structural complexity and multi-step reasoning required in real-world scenarios. These benchmarks overlook key factors such as the interaction between retrieval difficulty and reasoning depth. To address this gap, we propose \textsc{GRADE}, a novel evaluation framework that models task difficulty along two orthogonal dimensions: (1) reasoning depth, defined by the number of inference steps (hops), and (2) semantic distance between the query and its supporting evidence. We construct a synthetic multi-hop QA dataset from factual news articles by extracting knowledge graphs and augmenting them through semantic clustering to recover missing links, allowing us to generate diverse and difficulty-controlled queries. Central to our framework is a 2D difficulty matrix that combines generator-side and retriever-side difficulty. Experiments across multiple domains and models show that error rates strongly correlate with our difficulty measures, validating their diagnostic utility. \textsc{GRADE} enables fine-grained analysis of RAG performance and provides a scalable foundation for evaluating and improving multi-hop reasoning in real-world applications.

[20] DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation

Abdelrahman Abdallah,Jamshid Mozafari,Bhawna Piryani,Adam Jatowt

Main category: cs.CL

TL;DR: 论文提出DeAR框架,通过双阶段方法解耦文档重排任务,结合LLM蒸馏和多损失函数,实现高精度和可解释性。

Details Motivation: 大型语言模型在文档重排任务中难以同时平衡细粒度相关性评分和全局跨文档分析。

Contribution: 提出DeAR框架,通过双阶段方法(点对点评分和列表式推理)提升重排任务的性能和可解释性。

Method: 1. 从冻结的13B LLaMA教师模型中蒸馏token级相关信号到更小的学生模型;2. 添加LoRA适配器并微调,实现列表式推理。

Result: 在TREC-DL19/20等数据集上表现优异,nDCG@5提升5.1,nDCG@10达90.97,优于GPT-4和其他基线。

Insight: 双损失蒸馏和多阶段方法能有效提升模型校准能力,同时增强可解释性。

Abstract: Large Language Models (LLMs) have transformed listwise document reranking by enabling global reasoning over candidate sets, yet single models often struggle to balance fine-grained relevance scoring with holistic cross-document analysis. We propose \textbf{De}ep\textbf{A}gent\textbf{R}ank (\textbf{\DeAR}), an open-source framework that decouples these tasks through a dual-stage approach, achieving superior accuracy and interpretability. In \emph{Stage 1}, we distill token-level relevance signals from a frozen 13B LLaMA teacher into a compact {3, 8}B student model using a hybrid of cross-entropy, RankNet, and KL divergence losses, ensuring robust pointwise scoring. In \emph{Stage 2}, we attach a second LoRA adapter and fine-tune on 20K GPT-4o-generated chain-of-thought permutations, enabling listwise reasoning with natural-language justifications. Evaluated on TREC-DL19/20, eight BEIR datasets, and NovelEval-2306, \DeAR surpasses open-source baselines by +5.1 nDCG@5 on DL20 and achieves 90.97 nDCG@10 on NovelEval, outperforming GPT-4 by +3.09. Without fine-tuning on Wikipedia, DeAR also excels in open-domain QA, achieving 54.29 Top-1 accuracy on Natural Questions, surpassing baselines like MonoT5, UPR, and RankGPT. Ablations confirm that dual-loss distillation ensures stable calibration, making \DeAR a highly effective and interpretable solution for modern reranking systems.\footnote{Dataset and code available at https://github.com/DataScienceUIBK/DeAR-Reranking.}.

[21] KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF

Jason R Brown,Lennie Wells,Edward James Young,Sergio Bacallado

Main category: cs.CL

TL;DR: 本文提出了一种新的动作值强化学习方法KLQ,用于语言模型的人类反馈强化学习(LM-RLHF),并与PPO方法在特定意义上等效,实验表明KLQ在任务表现上优于PPO。

Details Motivation: PPO虽然经验上表现良好,但其动机是启发式的,且在LM-RLHF中对KL散度约束的处理较为随意,因此需要一种理论更严谨的方法。

Contribution: 提出了KLQ方法,为LM-RLHF提供了一种新的动作值视角,并证明其与PPO在特定意义上的等效性。

Method: 开发了基于KL正则化的Q学习方法(KLQ),通过优化动作值函数来处理LM-RLHF任务。

Result: 在文本摘要和单轮对话任务中,KLQ与PPO表现相当,但在LLM-as-a-judge评估中胜率更高。

Insight: KLQ为LM-RLHF提供了一种理论更严谨且性能优异的替代方案,同时揭示了PPO与Q学习方法之间的潜在联系。

Abstract: Proximal Policy Optimisation (PPO) is an established and effective policy gradient algorithm used for Language Model Reinforcement Learning from Human Feedback (LM-RLHF). PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner. In this paper, we develop a a new action-value RL method for the LM-RLHF setting, KL-regularised Q-Learning (KLQ). We then show that our method is equivalent to a version of PPO in a certain specific sense, despite its very different motivation. Finally, we benchmark KLQ on two key language generation tasks – summarisation and single-turn dialogue. We demonstrate that KLQ performs on-par with PPO at optimising the LM-RLHF objective, and achieves a consistently higher win-rate against PPO on LLM-as-a-judge evaluations.

Thi-Nhung Nguyen,Hoang Ngo,Dinh Phung,Thuy-Trang Vu,Dat Quoc Nguyen

Main category: cs.CL

TL;DR: 本文提出了基于实体导向搜索和LLM的表格理解方法,提升了语义相似性和上下文关系处理能力,减少了对预处理和关键词匹配的依赖,并在标准数据集上实现了SOTA性能。

Details Motivation: 现有表格理解方法因内容不可预测性和上下文信息不足,导致预处理复杂且依赖关键词匹配,限制了LLM的推理能力。为解决这些问题,本文提出了基于实体的搜索方法。

Contribution: 1. 提出实体导向搜索方法,利用问题和表格数据的语义相似性及单元格间的隐含关系;2. 首次将图查询语言应用于表格理解,开创了新研究方向。

Method: 通过实体导向搜索方法,增强LLM对表格的上下文理解能力,减少预处理和关键词匹配;引入图查询语言优化表格语义建模和推理。

Result: 在WikiTableQuestions和TabFact标准基准测试中取得了新的SOTA性能。

Insight: 实体导向搜索和图查询语言的结合为表格理解提供了新思路,强调了语义绑定和上下文建模的重要性。

Abstract: Our work addresses the challenges of understanding tables. Existing methods often struggle with the unpredictable nature of table content, leading to a reliance on preprocessing and keyword matching. They also face limitations due to the lack of contextual information, which complicates the reasoning processes of large language models (LLMs). To overcome these challenges, we introduce an entity-oriented search method to improve table understanding with LLMs. This approach effectively leverages the semantic similarities between questions and table data, as well as the implicit relationships between table cells, minimizing the need for data preprocessing and keyword matching. Additionally, it focuses on table entities, ensuring that table cells are semantically tightly bound, thereby enhancing contextual clarity. Furthermore, we pioneer the use of a graph query language for table understanding, establishing a new research direction. Experiments show that our approach achieves new state-of-the-art performances on standard benchmarks WikiTableQuestions and TabFact.

[23] GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection

Melissa Kazemi Rad,Alberto Purpura,Himanshu Kumar,Emily Chen,Mohammad Shahed Sorower

Main category: cs.CL

TL;DR: GRAID是一种利用几何约束和多代理反思生成合成数据的管道,用于有害内容检测,显著提升下游模型的性能。

Details Motivation: 解决有害文本分类中数据稀缺的问题,为护栏应用程序提供高质量的训练数据。

Contribution: 提出GRAID管道,通过几何约束和多代理反思生成多样化的合成数据,覆盖边缘案例。

Method: 分为两阶段:几何控制生成和多代理反思增强,利用LLM生成多样化和高质量的样本。

Result: 在两个基准数据集上证明了GRAID能显著提升有害文本分类模型的性能。

Insight: 几何约束和控制结合的反思过程可以更全面地覆盖输入空间,提升模型对有害内容的检测能力。

Abstract: We address the problem of data scarcity in harmful text classification for guardrailing applications and introduce GRAID (Geometric and Reflective AI-Driven Data Augmentation), a novel pipeline that leverages Large Language Models (LLMs) for dataset augmentation. GRAID consists of two stages: (i) generation of geometrically controlled examples using a constrained LLM, and (ii) augmentation through a multi-agentic reflective process that promotes stylistic diversity and uncovers edge cases. This combination enables both reliable coverage of the input space and nuanced exploration of harmful content. Using two benchmark data sets, we demonstrate that augmenting a harmful text classification dataset with GRAID leads to significant improvements in downstream guardrail model performance.

[24] Linguistic Neuron Overlap Patterns to Facilitate Cross-lingual Transfer on Low-resource Languages

Yuemei Xu,Kexin Xu,Jian Zhou,Ling Hu,Lin Gui

Main category: cs.CL

TL;DR: 论文提出了一种名为BridgeX-ICL的方法,通过探索共享神经元来改进低资源语言的零样本跨语言上下文学习,而无需昂贵微调。实验验证了其有效性。

Details Motivation: 当前大型语言模型(LLMs)在低资源语言上表现不佳,亟需无需高成本微调的高效方法。

Contribution: 提出了BridgeX-ICL方法,利用语言重叠神经元改进跨语言上下文学习,并定义了一种基于HSIC的度量方法量化LLMs内部语言谱。

Method: 基于双语词典构建神经元探针数据,定义语言重叠神经元子集并提出HSIC度量方法,指导最佳桥梁选择。

Result: 在2个跨语言任务和15种语言对上验证了BridgeX-ICL的有效性,揭示了LLMs的多语言机制。

Insight: 共享神经元可显著提升跨语言性能,为LLMs的多语言机制提供了实验依据。

Abstract: The current Large Language Models (LLMs) face significant challenges in improving performance on low-resource languages and urgently need data-efficient methods without costly fine-tuning. From the perspective of language-bridge, we propose BridgeX-ICL, a simple yet effective method to improve zero-shot Cross-lingual In-Context Learning (X-ICL) for low-resource languages. Unlike existing works focusing on language-specific neurons, BridgeX-ICL explores whether sharing neurons can improve cross-lingual performance in LLMs or not. We construct neuron probe data from the ground-truth MUSE bilingual dictionaries, and define a subset of language overlap neurons accordingly, to ensure full activation of these anchored neurons. Subsequently, we propose an HSIC-based metric to quantify LLMs’ internal linguistic spectrum based on overlap neurons, which guides optimal bridge selection. The experiments conducted on 2 cross-lingual tasks and 15 language pairs from 7 diverse families (covering both high-low and moderate-low pairs) validate the effectiveness of BridgeX-ICL and offer empirical insights into the underlying multilingual mechanisms of LLMs.

[25] Natural Language Satisfiability: Exploring the Problem Distribution and Evaluating Transformer-based Language Models

Tharindu Madusanka,Ian Pratt-Hartmann,Riza Batista-Navarro

Main category: cs.CL

TL;DR: 该论文探讨了自然语言可满足性问题的分布及其对基于Transformer的语言模型(TLMs)推理能力的影响,研究了不同计算复杂度类别和语法结构对TLMs学习推理规则的影响,并进行了实证研究。

Details Motivation: 自然语言推理任务中,可满足性问题是最基础的任务,但现有研究未充分讨论不同计算复杂度类别和语法结构对TLMs学习能力的影响。

Contribution: 1. 研究了自然语言可满足性问题实例的分布;2. 分析了不同计算复杂度类别和语法结构对TLMs推理能力的影响;3. 提出了对TLMs进行忠实评估的实证方法。

Method: 1. 将自然语言可满足性问题实例分类为不同计算复杂度类别;2. 设计实验评估TLMs在不同语法结构和复杂度下的表现;3. 分析问题分布及其对TLMs的影响。

Result: 研究表明,TLMs的表现受问题实例的计算复杂度和语法结构影响显著。

Insight: 自然语言可满足性问题的复杂性对TLMs的学习和推理能力提出了挑战,未来研究需更关注问题分布和复杂度对模型的影响。

Abstract: Efforts to apply transformer-based language models (TLMs) to the problem of reasoning in natural language have enjoyed ever-increasing success in recent years. The most fundamental task in this area to which nearly all others can be reduced is that of determining satisfiability. However, from a logical point of view, satisfiability problems vary along various dimensions, which may affect TLMs’ ability to learn how to solve them. The problem instances of satisfiability in natural language can belong to different computational complexity classes depending on the language fragment in which they are expressed. Although prior research has explored the problem of natural language satisfiability, the above-mentioned point has not been discussed adequately. Hence, we investigate how problem instances from varying computational complexity classes and having different grammatical constructs impact TLMs’ ability to learn rules of inference. Furthermore, to faithfully evaluate TLMs, we conduct an empirical study to explore the distribution of satisfiability problems.

[26] SPORTSQL: An Interactive System for Real-Time Sports Reasoning and Visualization

Sebastian Martinez,Naman Ahuja,Fenil Bardoliya,Chris Bryan,Vivek Gupta

Main category: cs.CL

TL;DR: SPORTSQL是一个交互式系统,支持用户通过自然语言查询实时体育数据(以英超联赛为例),并自动转换为可执行的SQL查询,提供表格和可视化输出。

Details Motivation: 体育数据的实时分析和可视化通常需要专业知识(如SQL)。SPORTSQL旨在通过自然语言界面降低门槛,让非专家用户也能轻松探索动态体育统计数据。

Contribution: 1. 提出了SPORTSQL系统,支持自然语言查询和可视化;2. 引入动态体育问答基准DSQABENCH,包含1,700+标注查询;3. 利用LLM的符号推理能力完成查询解析和可视化选择。

Method: 1. 基于实时FPL数据构建数据库;2. 利用LLM将自然语言查询转换为SQL;3. 支持表格和可视化输出。

Result: 系统通过DSQABENCH验证性能,展示了非专家用户如何通过自然语言界面无缝探索体育数据。

Insight: 结合LLM的符号推理能力可显著提升自然语言查询系统的实用性,尤其是在动态数据场景中。

Abstract: We present a modular, interactive system, SPORTSQL, for natural language querying and visualization of dynamic sports data, with a focus on the English Premier League (EPL). The system translates user questions into executable SQL over a live, temporally indexed database constructed from real-time Fantasy Premier League (FPL) data. It supports both tabular and visual outputs, leveraging the symbolic reasoning capabilities of Large Language Models (LLMs) for query parsing, schema linking, and visualization selection. To evaluate system performance, we introduce the Dynamic Sport Question Answering benchmark (DSQABENCH), comprising 1,700+ queries annotated with SQL programs, gold answers, and database snapshots. Our demo highlights how non-expert users can seamlessly explore evolving sports statistics through a natural, conversational interface.

[27] Towards Alignment-Centric Paradigm: A Survey of Instruction Tuning in Large Language Models

Xudong Han,Junjie Yang,Tianyang Wang,Ziqian Bi,Junfeng Hao,Junhao Song

Main category: cs.CL

TL;DR: 综述文章全面探讨了指令微调在大语言模型中对齐人类意图的流程,包括数据收集、微调策略和评估方法,并展望了自动化数据生成、自适应优化等未来方向。

Details Motivation: 指令微调是实现大语言模型与人类意图对齐的关键技术,但其流程中的数据收集、微调策略和评估方法尚缺乏系统化的总结。

Contribution: 提供了指令微调的全流程综述,分类了数据构建的三大范式,总结了轻量级微调技术的进展,并提出了领域特定评估和多语言多模态场景的挑战。

Method: 分析了数据收集的专家标注、蒸馏和自改进三种范式,以及轻量级微调技术(如LoRA和前缀调优),并探讨了评估的忠实性、实用性和安全性。

Result: 总结了指令微调在质量、扩展性和资源成本之间的权衡,提出了自动化数据生成和自适应优化的未来研究方向。

Insight: 指令微调需要更紧密地结合数据、算法和人类反馈,以实现更高效且可靠的人类意图对齐模型。

Abstract: Instruction tuning is a pivotal technique for aligning large language models (LLMs) with human intentions, safety constraints, and domain-specific requirements. This survey provides a comprehensive overview of the full pipeline, encompassing (i) data collection methodologies, (ii) full-parameter and parameter-efficient fine-tuning strategies, and (iii) evaluation protocols. We categorized data construction into three major paradigms: expert annotation, distillation from larger models, and self-improvement mechanisms, each offering distinct trade-offs between quality, scalability, and resource cost. Fine-tuning techniques range from conventional supervised training to lightweight approaches, such as low-rank adaptation (LoRA) and prefix tuning, with a focus on computational efficiency and model reusability. We further examine the challenges of evaluating faithfulness, utility, and safety across multilingual and multimodal scenarios, highlighting the emergence of domain-specific benchmarks in healthcare, legal, and financial applications. Finally, we discuss promising directions for automated data generation, adaptive optimization, and robust evaluation frameworks, arguing that a closer integration of data, algorithms, and human feedback is essential for advancing instruction-tuned LLMs. This survey aims to serve as a practical reference for researchers and practitioners seeking to design LLMs that are both effective and reliably aligned with human intentions.

[28] SSFO: Self-Supervised Faithfulness Optimization for Retrieval-Augmented Generation

Xiaqiang Tang,Yi Wang,Keyu Hu,Rui Xu,Chuang Li,Weigao Sun,Jian Li,Sihong Xie

Main category: cs.CL

TL;DR: 论文提出了一种自监督优化方法SSFO,通过对比模型在有和无上下文时生成的输出,利用直接偏好优化(DPO)提升检索增强生成(RAG)系统的忠实性,无需额外标注成本或推理负担。

Details Motivation: 当前检索增强生成(RAG)系统在生成忠实于检索上下文的回答时存在幻觉问题,现有方法通常需要高成本的监督训练或额外的推理负担。

Contribution: 提出了首个自监督对齐方法SSFO,通过构建偏好数据对和修改的DPO损失函数,显著提升了RAG系统的忠实性,并在多语言任务中展现了良好的泛化能力。

Method: SSFO通过对比模型在有和无上下文时的输出构建偏好数据对,利用DPO优化模型忠实性,并提出修改的DPO损失函数以鼓励似然位移。

Result: SSFO在多个上下文问答数据集上取得了最先进的忠实性表现,同时展示了跨语言泛化能力,且不影响通用指令跟随能力。

Insight: SSFO通过一种良性的似然位移机制,将概率质量从基于参数的标记转移到与上下文对齐的标记,从而有效提升生成忠实性。

Abstract: Retrieval-Augmented Generation (RAG) systems require Large Language Models (LLMs) to generate responses that are faithful to the retrieved context. However, faithfulness hallucination remains a critical challenge, as existing methods often require costly supervision and post-training or significant inference burdens. To overcome these limitations, we introduce Self-Supervised Faithfulness Optimization (SSFO), the first self-supervised alignment approach for enhancing RAG faithfulness. SSFO constructs preference data pairs by contrasting the model’s outputs generated with and without the context. Leveraging Direct Preference Optimization (DPO), SSFO aligns model faithfulness without incurring labeling costs or additional inference burden. We theoretically and empirically demonstrate that SSFO leverages a benign form of \emph{likelihood displacement}, transferring probability mass from parametric-based tokens to context-aligned tokens. Based on this insight, we propose a modified DPO loss function to encourage likelihood displacement. Comprehensive evaluations show that SSFO significantly outperforms existing methods, achieving state-of-the-art faithfulness on multiple context-based question-answering datasets. Notably, SSFO exhibits strong generalization, improving cross-lingual faithfulness and preserving general instruction-following capabilities. We release our code and model at the anonymous link: https://github.com/chkwy/SSFO

Siying Zhou,Yiquan Wu,Hui Chen,Xavier Hu,Kun Kuang,Adam Jatowt,Ming Hu,Chunyan Zheng,Fei Wu

Main category: cs.CL

TL;DR: 论文构建了首个中文法律诉求生成数据集ClaimGen-CN,并提出针对生成诉求的事实性和清晰度的评估指标,评估了当前先进模型的局限性。

Details Motivation: 当前研究多关注法律专业人士的效率提升,而忽视了对非专业人士(如原告)的帮助。论文填补了基于案件事实生成法律诉求的研究空白。

Contribution: 1. 构建首个中文法律诉求生成数据集ClaimGen-CN;2. 设计针对事实性和清晰度的评估指标;3. 评估现有大模型在此任务上的局限性。

Method: 从真实法律纠纷中构建数据集,设计事实性和清晰度的评估指标,并对通用及法律领域大模型进行零样本评估。

Result: 发现当前模型在事实精确性和表达清晰度上存在不足,需针对性改进。

Insight: 法律诉求生成需关注事实性和清晰度,现有模型在此领域尚有较大提升空间。

Abstract: Legal claims refer to the plaintiff’s demands in a case and are essential to guiding judicial reasoning and case resolution. While many works have focused on improving the efficiency of legal professionals, the research on helping non-professionals (e.g., plaintiffs) remains unexplored. This paper explores the problem of legal claim generation based on the given case’s facts. First, we construct ClaimGen-CN, the first dataset for Chinese legal claim generation task, from various real-world legal disputes. Additionally, we design an evaluation metric tailored for assessing the generated claims, which encompasses two essential dimensions: factuality and clarity. Building on this, we conduct a comprehensive zero-shot evaluation of state-of-the-art general and legal-domain large language models. Our findings highlight the limitations of the current models in factual precision and expressive clarity, pointing to the need for more targeted development in this domain. To encourage further exploration of this important task, we will make the dataset publicly available.

[30] Routing Distilled Knowledge via Mixture of LoRA Experts for Large Language Model based Bundle Generation

Kaidong Feng,Zhu Sun,Hui Fang,Jie Yang,Wenyuan Liu,Yew-Soon Ong

Main category: cs.CL

TL;DR: RouteDK提出了一种通过LoRA专家混合架构路由蒸馏知识的框架,以解决大语言模型在捆绑生成中的知识冲突问题,同时保持计算效率。

Details Motivation: 大语言模型在捆绑生成中表现潜力,但计算成本高昂;直接整合多种蒸馏知识会导致知识冲突,影响性能。

Contribution: 1. 提出RouteDK框架,通过动态融合模块和LoRA专家架构路由蒸馏知识;2. 设计了推理时增强模块以减少方差和次优推理。

Method: 1. 从教师LLM中蒸馏高阶和细粒度知识;2. 训练知识特定LoRA专家和基础LoRA专家;3. 动态融合模块通过输入感知路由器平衡专家贡献。

Result: 在三个公开数据集上,RouteDK的精度与教师LLM相当或更高,同时保持计算效率,优于现有捆绑生成方法。

Insight: 通过动态路由和专家混合设计,可以有效缓解知识冲突,提升模型性能并保持效率。

Abstract: Large Language Models (LLMs) have shown potential in automatic bundle generation but suffer from prohibitive computational costs. Although knowledge distillation offers a pathway to more efficient student models, our preliminary study reveals that naively integrating diverse types of distilled knowledge from teacher LLMs into student LLMs leads to knowledge conflict, negatively impacting the performance of bundle generation. To address this, we propose RouteDK, a framework for routing distilled knowledge through a mixture of LoRA expert architecture. Specifically, we first distill knowledge from the teacher LLM for bundle generation in two complementary types: high-level knowledge (generalizable rules) and fine-grained knowledge (session-specific reasoning). We then train knowledge-specific LoRA experts for each type of knowledge together with a base LoRA expert. For effective integration, we propose a dynamic fusion module, featuring an input-aware router, where the router balances expert contributions by dynamically determining optimal weights based on input, thereby effectively mitigating knowledge conflicts. To further improve inference reliability, we design an inference-time enhancement module to reduce variance and mitigate suboptimal reasoning. Experiments on three public datasets show that our RouteDK achieves accuracy comparable to or even better than the teacher LLM, while maintaining strong computational efficiency. In addition, it outperforms state-of-the-art approaches for bundle generation.

[31] Are You Sure You’re Positive? Consolidating Chain-of-Thought Agents with Uncertainty Quantification for Aspect-Category Sentiment Analysis

Filippos Ventirozos,Peter Appleby,Matthew Shardlow

Main category: cs.CL

TL;DR: 论文提出了一种结合多个思维链代理和大型语言模型标记级不确定性评分的技术,用于零样本场景下的方面类别情感分析,解决了数据稀缺和标注偏差问题。

Details Motivation: 监督学习方法在领域迁移中表现不佳,且数据标注成本高、稀缺,作者希望通过零样本学习和不确定性量化来避免这些问题。

Contribution: 提出了一种结合多个思维链代理和标记级不确定性评分的新技术,实现了在零样本条件下的高效情感分析。

Method: 利用大型语言模型(如Llama和Qwen)的思维链代理和标记级不确定性评分,进行多代理协同推理。

Result: 实验表明,该方法在3B和70B+参数规模的模型上表现良好,展示了其在实际应用中的潜力。

Insight: 不确定性量化可以提高模型在数据稀缺条件下的鲁棒性,并为评估零样本性能提供了新思路。

Abstract: Aspect-category sentiment analysis provides granular insights by identifying specific themes within product reviews that are associated with particular opinions. Supervised learning approaches dominate the field. However, data is scarce and expensive to annotate for new domains. We argue that leveraging large language models in a zero-shot setting is beneficial where the time and resources required for dataset annotation are limited. Furthermore, annotation bias may lead to strong results using supervised methods but transfer poorly to new domains in contexts that lack annotations and demand reproducibility. In our work, we propose novel techniques that combine multiple chain-of-thought agents by leveraging large language models’ token-level uncertainty scores. We experiment with the 3B and 70B+ parameter size variants of Llama and Qwen models, demonstrating how these approaches can fulfil practical needs and opening a discussion on how to gauge accuracy in label-scarce conditions.

[32] From Language to Action: A Review of Large Language Models as Autonomous Agents and Tool Users

Sadia Sultana Chowa,Riasad Alvi,Subhey Sadi Rahman,Md Abdur Rahman,Mohaimenul Azam Khan Raiaan,Md Rafiqul Islam,Mukhtar Hussain,Sami Azam

Main category: cs.CL

TL;DR: 这篇综述分析了2023至2025年间发表的关于大型语言模型(LLMs)作为自主代理和工具使用的研究,重点关注其架构设计、认知机制及性能评估,并提出了未来研究方向。

Details Motivation: 探索LLMs如何提升为具备决策能力和自主性的智能代理,推动人类级AI的实现。

Contribution: 1. 对LLM作为自主代理的架构设计进行了系统分类(单代理vs多代理);2. 研究了认知机制(推理、规划、记忆)和外部工具集成策略;3. 评估了68个公开数据集和现有基准测试;4. 提出了未来十大研究方向。

Method: 通过结构化分析,将LLM代理的架构分为单代理和多代理系统,并探讨了提示方法和微调对性能的影响。

Result: 总结了LLM代理的验证性推理能力、自我改进能力和个性化潜力,并指出现有研究的局限性。

Insight: LLMs作为代理的能力受其认知机制和工具集成策略的显著影响,未来需关注可解释性和适应性提升。

Abstract: The pursuit of human-level artificial intelligence (AI) has significantly advanced the development of autonomous agents and Large Language Models (LLMs). LLMs are now widely utilized as decision-making agents for their ability to interpret instructions, manage sequential tasks, and adapt through feedback. This review examines recent developments in employing LLMs as autonomous agents and tool users and comprises seven research questions. We only used the papers published between 2023 and 2025 in conferences of the A* and A rank and Q1 journals. A structured analysis of the LLM agents’ architectural design principles, dividing their applications into single-agent and multi-agent systems, and strategies for integrating external tools is presented. In addition, the cognitive mechanisms of LLM, including reasoning, planning, and memory, and the impact of prompting methods and fine-tuning procedures on agent performance are also investigated. Furthermore, we evaluated current benchmarks and assessment protocols and have provided an analysis of 68 publicly available datasets to assess the performance of LLM-based agents in various tasks. In conducting this review, we have identified critical findings on verifiable reasoning of LLMs, the capacity for self-improvement, and the personalization of LLM-based agents. Finally, we have discussed ten future research directions to overcome these gaps.

[33] Handling Students Dropouts in an LLM-driven Interactive Online Course Using Language Models

Yuanchun Wang,Yiyang Fu,Jifan Yu,Daniel Zhang-Li,Zheyuan Zhang,Joy Lim Jia Yin,Yucheng Wang,Peng Zhou,Jing Zhang,Huiqin Liu

Main category: cs.CL

TL;DR: 这篇论文探讨了在基于LLM的互动在线课程中解决学生辍学问题的方法,提出了一个适应性辍学预测框架和个性化召回代理。

Details Motivation: 随着互动在线学习环境的发展,如何减少学生辍学率成为一个关键问题。论文旨在通过分析辍学因素、预测和预防辍学来提升学习体验。

Contribution: 论文的主要贡献是提出了一个课程进度自适应的辍学预测框架(CPADP)和一个个性化的电子邮件召回代理,用于识别和重新吸引风险学生。

Method: 通过分析互动日志定义辍学行为并识别影响因素,然后提出CPADP框架进行预测,最后设计了一个基于邮件的召回代理进行干预。

Result: CPADP框架的预测准确率达到95.4%,并在超过3000名学生的实际部署中验证了其可行性和有效性。

Insight: 学生的辍学行为与文本互动模式密切相关,通过语言模型可以实现个性化干预以减少辍学率。

Abstract: Interactive online learning environments, represented by Massive AI-empowered Courses (MAIC), leverage LLM-driven multi-agent systems to transform passive MOOCs into dynamic, text-based platforms, enhancing interactivity through LLMs. This paper conducts an empirical study on a specific MAIC course to explore three research questions about dropouts in these interactive online courses: (1) What factors might lead to dropouts? (2) Can we predict dropouts? (3) Can we reduce dropouts? We analyze interaction logs to define dropouts and identify contributing factors. Our findings reveal strong links between dropout behaviors and textual interaction patterns. We then propose a course-progress-adaptive dropout prediction framework (CPADP) to predict dropouts with at most 95.4% accuracy. Based on this, we design a personalized email recall agent to re-engage at-risk students. Applied in the deployed MAIC system with over 3,000 students, the feasibility and effectiveness of our approach have been validated on students with diverse backgrounds.

[34] Omne-R1: Learning to Reason with Memory for Multi-hop Question Answering

Boyuan Liu,Feng Ji,Jiayan Nan,Han Zhao,Weiling Chen,Shihao Xu,Xing Zhou

Main category: cs.CL

TL;DR: Omne-R1提出了一种通过强化学习和监督微调的多阶段训练框架,以提升在无模式知识图谱上的多跳问答能力。该方法通过构建领域无关知识图谱和自动生成问答对解决了数据不足的问题,显著提升了复杂多跳问题的回答性能。

Details Motivation: 现有方法在无模式知识图谱上的多跳问答能力受限,主要原因是缺乏合适的知识图谱和问答数据。Omne-R1旨在通过创新的训练框架和自动生成数据来解决这一问题。

Contribution: 1. 提出了Omne-R1,一种结合强化学习和监督微调的多阶段训练方法。2. 构建了领域无关知识图谱并自动生成问答对,解决了数据稀缺问题。3. 在复杂多跳问题上表现显著优于现有方法。

Method: 1. 多阶段训练流程:两阶段强化学习和一阶段监督微调。2. 构建领域无关知识图谱和自动生成QA对。3. 通过推理模型增强多跳问答能力。

Result: 实验结果显示,Omne-R1在复杂多跳问题(尤其是3+跳)上表现优异,并展示出良好的跨领域泛化能力。

Insight: 1. 强化学习和监督微调结合能有效提升多跳问答性能。2. 自动生成数据是解决数据稀缺问题的可行途径。3. 领域无关知识图谱有助于模型泛化。

Abstract: This paper introduces Omne-R1, a novel approach designed to enhance multi-hop question answering capabilities on schema-free knowledge graphs by integrating advanced reasoning models. Our method employs a multi-stage training workflow, including two reinforcement learning phases and one supervised fine-tuning phase. We address the challenge of limited suitable knowledge graphs and QA data by constructing domain-independent knowledge graphs and auto-generating QA pairs. Experimental results show significant improvements in answering multi-hop questions, with notable performance gains on more complex 3+ hop questions. Our proposed training framework demonstrates strong generalization abilities across diverse knowledge domains.

[35] DropLoRA: Sparse Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

Haojie Zhang

Main category: cs.CL

TL;DR: DropLoRA是一种新穎的參數高效微調方法,通過在LoRA的低秩矩陣間引入剪枝模塊實現動態子空間學習,顯著提升性能。

Details Motivation: 傳統LoRA的靜態子空間學習限制了微調性能,DropLoRA旨在通過動態子空間學習克服這一瓶頸。

Contribution: 提出DropLoRA方法,結合剪枝和LoRA,實現高效的動態低秩子空間學習,提升下游任務性能。

Method: 在LoRA的兩個低秩矩陣間引入剪枝模塊,動態調整學習子空間,無需額外訓練或推理成本。

Result: DropLoRA在多種大型語言模型任務(如常識推理、數學推理、代碼生成)中顯著優於傳統LoRA。

Insight: 動態子空間學習是提升參數高效微調性能的關鍵,剪枝機制可有效模擬這一過程。

Abstract: LoRA-based large model parameter-efficient fine-tuning (PEFT) methods use low-rank de- composition to approximate updates to model parameters. However, compared to full- parameter fine-tuning, low-rank updates often lead to a performance gap in downstream tasks. To address this, we introduce DropLoRA, a novel pruning-based approach that focuses on pruning the rank dimension. Unlike conven- tional methods that attempt to overcome the low-rank bottleneck, DropLoRA innovatively integrates a pruning module between the two low-rank matrices in LoRA to simulate dy- namic subspace learning. This dynamic low- rank subspace learning allows DropLoRA to overcome the limitations of traditional LoRA, which operates within a static subspace. By continuously adapting the learning subspace, DropLoRA significantly boosts performance without incurring additional training or infer- ence costs. Our experimental results demon- strate that DropLoRA consistently outperforms LoRA in fine-tuning the LLaMA series across a wide range of large language model gener- ation tasks, including commonsense reason- ing, mathematical reasoning, code generation, and instruction-following. Our code is avail- able at https://github.com/TayeeChang/DropLoRA.

Ryoma Kondo,Riona Matsuoka,Takahiro Yoshida,Kazuyuki Yamasawa,Ryohei Hisano

Main category: cs.CL

TL;DR: 该论文提出了一种基于知识图谱的方法,从日本行政法院的648份判决书中提取法律推理路径,将隐式的法律推理显式化并标准化,实现了比现有大型语言模型基线更精准的法律条文检索。

Details Motivation: 现有的自动化方法(包括大型语言模型)难以准确捕捉法律推理中的法律语境和事实与法律规范的关联,限制了理解法院如何实际应用法律的能力。

Contribution: 通过构建法律知识图谱,论文显式化了法律推理的结构,实现了更精确的法律条文检索。

Method: 方法包括利用提示型大型语言模型提取法律推理组件、标准化法律条文引用,并通过法律推理本体将事实、规范和适用关联起来。

Result: 在专家标注数据上的评估表明,该方法在法律条文检索上的准确性优于大型语言模型基线和检索增强方法。

Insight: 法律知识图谱可以有效弥补大型语言模型在结构化法律推理中的不足,显式化推理路径有助于提升机器对法律推理的理解和应用。

Abstract: Court judgments reveal how legal rules have been interpreted and applied to facts, providing a foundation for understanding structured legal reasoning. However, existing automated approaches for capturing legal reasoning, including large language models, often fail to identify the relevant legal context, do not accurately trace how facts relate to legal norms, and may misrepresent the layered structure of judicial reasoning. These limitations hinder the ability to capture how courts apply the law to facts in practice. In this paper, we address these challenges by constructing a legal knowledge graph from 648 Japanese administrative court decisions. Our method extracts components of legal reasoning using prompt-based large language models, normalizes references to legal provisions, and links facts, norms, and legal applications through an ontology of legal inference. The resulting graph captures the full structure of legal reasoning as it appears in real court decisions, making implicit reasoning explicit and machine-readable. We evaluate our system using expert annotated data, and find that it achieves more accurate retrieval of relevant legal provisions from facts than large language model baselines and retrieval-augmented methods.

[37] The Arabic Generality Score: Another Dimension of Modeling Arabic Dialectness

Sanad Shaban,Nizar Habash

Main category: cs.CL

TL;DR: 该论文提出了阿拉伯语通用性评分(AGS),作为阿拉伯语方言连续性的补充度量,通过词对齐、词源感知编辑距离和平滑技术来量化词汇的通用性。

Details Motivation: 现有的阿拉伯语方言模型(如ALDi)将方言视为单维连续变量,忽略了词汇的通用性。本文旨在通过AGS补充这一维度,更全面地建模方言复杂性。

Contribution: 1. 提出了AGS,用于量化词汇在方言中的通用性;2. 开发了一个结合词对齐、词源编辑距离和平滑的标注流程;3. 训练了一个回归模型,用于上下文中的AGS预测。

Method: 1. 使用词对齐和词源感知编辑距离标注并行语料库;2. 通过平滑技术优化标注;3. 训练回归模型预测上下文中的AGS。

Result: AGS在多方言基准测试中优于现有方法(包括最先进的方言识别系统),验证了其有效性和扩展性。

Insight: AGS为阿拉伯语方言建模提供了新的维度,通过量化词汇的通用性,丰富了方言的连续性表示。

Abstract: Arabic dialects form a diverse continuum, yet NLP models often treat them as discrete categories. Recent work addresses this issue by modeling dialectness as a continuous variable, notably through the Arabic Level of Dialectness (ALDi). However, ALDi reduces complex variation to a single dimension. We propose a complementary measure: the Arabic Generality Score (AGS), which quantifies how widely a word is used across dialects. We introduce a pipeline that combines word alignment, etymology-aware edit distance, and smoothing to annotate a parallel corpus with word-level AGS. A regression model is then trained to predict AGS in context. Our approach outperforms strong baselines, including state-of-the-art dialect ID systems, on a multi-dialect benchmark. AGS offers a scalable, linguistically grounded way to model lexical generality, enriching representations of Arabic dialectness.

[38] UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat

Omer Nacar

Main category: cs.CL

TL;DR: 本文对阿拉伯语为中心的大语言模型ALLaM-34B进行了UI级别的评估,结果显示其在生成任务、代码切换、现代标准阿拉伯语处理、推理能力和方言保真度等方面表现出色。

Details Motivation: 现有的大语言模型(LLMs)主要基于英语语料训练,难以捕捉阿拉伯语的语言和文化细微差别,因此需要针对阿拉伯语的专用模型。

Contribution: 通过广泛的UI级别评估,验证了ALLaM-34B在阿拉伯语任务中的高性能,包括生成、代码切换、方言处理和安全性等方面。

Method: 使用包含多种阿拉伯语任务(如现代标准阿拉伯语、方言、代码切换等)的提示包,收集了115个输出,并使用三个前沿LLM(GPT-5、Gemini 2.5 Pro、Claude Sonnet-4)进行评分。

Result: ALLaM-34B在生成(4.92/5)和代码切换(4.92/5)任务中表现最佳,同时在现代标准阿拉伯语(4.74/5)、推理能力(4.64/5)和方言保真度(4.21/5)方面也有较强表现。

Insight: ALLaM-34B是一款实用性强的阿拉伯语大语言模型,具备文化和语言适应性,适合实际部署。

Abstract: Large language models (LLMs) trained primarily on English corpora often struggle to capture the linguistic and cultural nuances of Arabic. To address this gap, the Saudi Data and AI Authority (SDAIA) introduced the $ALLaM$ family of Arabic-focused models. The most capable of these available to the public, $ALLaM-34B$, was subsequently adopted by HUMAIN, who developed and deployed HUMAIN Chat, a closed conversational web service built on this model. This paper presents an expanded and refined UI-level evaluation of $ALLaM-34B$. Using a prompt pack spanning modern standard Arabic, five regional dialects, code-switching, factual knowledge, arithmetic and temporal reasoning, creative generation, and adversarial safety, we collected 115 outputs (23 prompts times 5 runs) and scored each with three frontier LLM judges (GPT-5, Gemini 2.5 Pro, Claude Sonnet-4). We compute category-level means with 95% confidence intervals, analyze score distributions, and visualize dialect-wise metric heat maps. The updated analysis reveals consistently high performance on generation and code-switching tasks (both averaging 4.92/5), alongside strong results in MSA handling (4.74/5), solid reasoning ability (4.64/5), and improved dialect fidelity (4.21/5). Safety-related prompts show stable, reliable performance of (4.54/5). Taken together, these results position $ALLaM-34B$ as a robust and culturally grounded Arabic LLM, demonstrating both technical strength and practical readiness for real-world deployment.

[39] DashboardQA: Benchmarking Multimodal Agents for Question Answering on Interactive Dashboards

Aaryaman Kartha,Ahmed Masry,Mohammed Saidul Islam,Thinh Lang,Shadikur Rahman,Ridwan Mahbub,Mizanur Rahman,Mahir Ahmed,Md Rizwan Parvez,Enamul Hoque,Shafiq Joty

Main category: cs.CL

TL;DR: 该论文提出了DashboardQA,这是首个专门评估视觉-语言GUI代理在真实仪表盘上理解和交互能力的基准测试,涵盖多种问题类型和交互场景。

Details Motivation: 现有的数据可视化问答基准主要关注静态图表,而忽略了仪表盘的交互性,无法充分评估现代多模态代理在GUI推理中的能力。

Contribution: 引入了DashboardQA基准,包括112个交互式仪表盘和405个问题-答案对,填补了现有基准的不足。

Method: 通过评估多类领先的封闭和开源GUI代理,分析其在仪表盘元素基础、交互轨迹规划和推理等方面的表现。

Result: 结果显示,即使是表现最好的代理(Gemini-Pro-2.5)准确率仅为38.69%,表明交互式仪表盘推理是一个极具挑战性的任务。

Insight: 交互式仪表盘推理对当前多模态代理来说仍是一个重大挑战,特别是在基础元素理解和复杂交互规划方面。

Abstract: Dashboards are powerful visualization tools for data-driven decision-making, integrating multiple interactive views that allow users to explore, filter, and navigate data. Unlike static charts, dashboards support rich interactivity, which is essential for uncovering insights in real-world analytical workflows. However, existing question-answering benchmarks for data visualizations largely overlook this interactivity, focusing instead on static charts. This limitation severely constrains their ability to evaluate the capabilities of modern multimodal agents designed for GUI-based reasoning. To address this gap, we introduce DashboardQA, the first benchmark explicitly designed to assess how vision-language GUI agents comprehend and interact with real-world dashboards. The benchmark includes 112 interactive dashboards from Tableau Public and 405 question-answer pairs with interactive dashboards spanning five categories: multiple-choice, factoid, hypothetical, multi-dashboard, and conversational. By assessing a variety of leading closed- and open-source GUI agents, our analysis reveals their key limitations, particularly in grounding dashboard elements, planning interaction trajectories, and performing reasoning. Our findings indicate that interactive dashboard reasoning is a challenging task overall for all the VLMs evaluated. Even the top-performing agents struggle; for instance, the best agent based on Gemini-Pro-2.5 achieves only 38.69% accuracy, while the OpenAI CUA agent reaches just 22.69%, demonstrating the benchmark’s significant difficulty. We release DashboardQA at https://github.com/vis-nlp/DashboardQA

[40] Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?

Hyeong Kyu Choi,Xiaojin Zhu,Yixuan Li

Main category: cs.CL

TL;DR: 论文探讨了多智能体辩论(MAD)中多数投票与辩论对大型语言模型决策效果的贡献,发现多数投票是性能提升的主要因素,并提出理论框架证明辩论本身不改善预期正确性。

Details Motivation: 多智能体辩论(MAD)被广泛用于提升模型性能,但其核心机制尚不明确。论文旨在分离多数投票和辩论的影响,以明确MAD的有效性来源。

Contribution: 将MAD分解为多数投票和辩论,验证多数投票的主导作用;提出理论框架证明辩论不改善正确性;设计干预措施提升辩论效果。

Method: 通过七项NLP基准实验对比多数投票与辩论的贡献,提出基于随机过程的理论框架分析辩论效果,并设计偏差干预策略。

Result: 多数投票占MAD性能提升的大部分;辩论本身无助于提高预期正确性;干预措施可增强辩论效果。

Insight: 实践中,简单集成方法(如多数投票)可能比复杂辩论更可靠;辩论需针对性设计才能有效提升性能。

Abstract: Multi-Agent Debate~(MAD) has emerged as a promising paradigm for improving the performance of large language models through collaborative reasoning. Despite recent advances, the key factors driving MAD’s effectiveness remain unclear. In this work, we disentangle MAD into two key components–Majority Voting and inter-agent Debate–and assess their respective contributions. Through extensive experiments across seven NLP benchmarks, we find that Majority Voting alone accounts for most of the performance gains typically attributed to MAD. To explain this, we propose a theoretical framework that models debate as a stochastic process. We prove that it induces a martingale over agents’ belief trajectories, implying that debate alone does not improve expected correctness. Guided by these insights, we demonstrate that targeted interventions, by biasing the belief update toward correction, can meaningfully enhance debate effectiveness. Overall, our findings suggest that while MAD has potential, simple ensembling methods remain strong and more reliable alternatives in many practical settings. Code is released in https://github.com/deeplearning-wisc/debate-or-vote.

[41] Humanizing Machines: Rethinking LLM Anthropomorphism Through a Multi-Level Framework of Design

Yunze Xiao,Lynnette Hui Xian Ng,Jiarui Liu,Mona T. Diab

Main category: cs.CL

TL;DR: 该论文主张将大型语言模型(LLM)的拟人化视为一种设计概念,而非仅关注其风险,并提出一个多维框架来指导设计。

Details Motivation: 当前研究主要关注拟人化的风险(如过度信任和欺骗),而缺乏对其作为设计工具的系统性指导。论文旨在填补这一空白。

Contribution: 提出一个多维度框架(感知、语言、行为、认知)来分类和评估拟人化设计,并强调功能导向的评估方法。

Method: 通过跨学科视角,分析拟人化线索的四个维度及其设计交互关系,提供可操作的设计杠杆。

Result: 为从业者提供了一套统一的拟人化分类法和实践指导,推动更有效的用户目标支持。

Insight: 拟人化设计应注重功能与用户目标的匹配,而非仅仅追求或避免人性化特征。

Abstract: Large Language Models (LLMs) increasingly exhibit \textbf{anthropomorphism} characteristics – human-like qualities portrayed across their outlook, language, behavior, and reasoning functions. Such characteristics enable more intuitive and engaging human-AI interactions. However, current research on anthropomorphism remains predominantly risk-focused, emphasizing over-trust and user deception while offering limited design guidance. We argue that anthropomorphism should instead be treated as a \emph{concept of design} that can be intentionally tuned to support user goals. Drawing from multiple disciplines, we propose that the anthropomorphism of an LLM-based artifact should reflect the interaction between artifact designers and interpreters. This interaction is facilitated by cues embedded in the artifact by the designers and the (cognitive) responses of the interpreters to the cues. Cues are categorized into four dimensions: \textit{perceptive, linguistic, behavioral}, and \textit{cognitive}. By analyzing the manifestation and effectiveness of each cue, we provide a unified taxonomy with actionable levers for practitioners. Consequently, we advocate for function-oriented evaluations of anthropomorphic design.

[42] UQ: Assessing Language Models on Unsolved Questions

Fan Nie,Ken Ziyu Liu,Zihao Wang,Rui Sun,Wei Liu,Weijia Shi,Huaxiu Yao,Linjun Zhang,Andrew Y. Ng,James Zou,Sanmi Koyejo,Yejin Choi,Percy Liang,Niklas Muennighoff

Main category: cs.CL

TL;DR: UQ提出了一种评估语言模型的新范式:基于未解决问题的测试,通过结合规则过滤、LLM评判和人工审核,构建了一个具有挑战性和现实意义的测试平台,并结合验证策略和社区协作,推动前沿模型的评测。

Details Motivation: 当前评测基准存在难度与真实性的矛盾:考试式基准虽难但缺乏现实价值,而基于用户交互的基准又偏向简单。UQ通过评测未解决的真实问题,旨在克服这一矛盾,推动模型在真实挑战中的表现。

Contribution: 1. 构建了UQ-Dataset及其收集流程,确保问题质量;2. 提出了UQ-Validators复合验证策略,利用生成-验证差距提供评测信号;3. 开发了UQ-Platform开放平台,支持专家社区协作验证。

Method: 结合规则过滤、LLM评判和人工审核构建数据集,利用生成-验证差距设计验证策略,并通过开放平台实现社区协作验证。

Result: 表现最佳的模型仅通过15%问题的验证,但初步人工验证已发现部分正确答案。

Insight: UQ通过评测未解决问题,不仅推动模型能力提升,还为人类知识前沿带来直接价值,为未来评测范式提供了新思路。

Abstract: Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions. The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed. UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge. We release UQ at https://uq.stanford.edu.

[43] EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems

Jingwen Liu,Kan Jen Cheng,Jiachen Lian,Akshay Anand,Rishi Jain,Faith Qiao,Robin Netzorg,Huang-Cheng Chou,Tingle Li,Guan-Ting Lin,Gopala Anumanchipalli

Main category: cs.CL

TL;DR: 该论文提出了EMO-Reasoning基准,用于评估口语对话系统中的情感推理能力,解决了现有系统在情感一致性评估上的不足。通过生成多样化的情感语音数据,并引入跨轮情感推理评分,论文有效检测了七种对话系统中的情感不一致性。

Details Motivation: 目前的口语对话系统缺乏对情感推理能力的系统性评估,限制了自然交互的发展。作者希望通过引入EMO-Reasoning基准,填补这一空白。

Contribution: 1. 提出EMO-Reasoning基准,用于评估对话系统的情感一致性。2. 通过文本转语音生成多样化情感数据,解决数据稀缺问题。3. 设计跨轮情感推理评分(Cross-turn Emotion Reasoning Score)评估多轮对话中的情感过渡。

Method: 1. 使用文本转语音技术生成情感语音数据集。2. 提出跨轮情感推理评分指标。3. 采用连续、分类和感知度量评估七种对话系统的情感一致性。

Result: 实验表明,EMO-Reasoning框架能有效检测情感不一致性,并为改进现有系统提供见解。

Insight: 情感一致性是多轮对话系统的关键挑战,引入系统性评估工具有助于推动情感感知对话模型的进步。

Abstract: Speech emotions play a crucial role in human-computer interaction, shaping engagement and context-aware communication. Despite recent advances in spoken dialogue systems, a holistic system for evaluating emotional reasoning is still lacking. To address this, we introduce EMO-Reasoning, a benchmark for assessing emotional coherence in dialogue systems. It leverages a curated dataset generated via text-to-speech to simulate diverse emotional states, overcoming the scarcity of emotional speech data. We further propose the Cross-turn Emotion Reasoning Score to assess the emotion transitions in multi-turn dialogues. Evaluating seven dialogue systems through continuous, categorical, and perceptual metrics, we show that our framework effectively detects emotional inconsistencies, providing insights for improving current dialogue systems. By releasing a systematic evaluation benchmark, we aim to advance emotion-aware spoken dialogue modeling toward more natural and adaptive interactions.

[44] Stop Spinning Wheels: Mitigating LLM Overthinking via Mining Patterns for Early Reasoning Exit

Zihao Wei,Liang Pang,Jiahao Liu,Jingcheng Deng,Shicheng Xu,Zenghao Duan,Jingang Wang,Fei Sun,Xunliang Cai,Huawei Shen,Xueqi Cheng

Main category: cs.CL

TL;DR: 该论文提出了一种通过挖掘模式提前退出推理过程的方法,以减少大语言模型(LLM)的过度思考问题,从而节省资源并提高效率。

Details Motivation: 现有的LLM在复杂推理任务中存在过度思考问题,这不仅会浪费计算资源,还可能降低性能。作者观察到推理过程中的阶段特征,认为通过检测‘推理完成点’(RCP),可以在适当时间提前终止推理。

Contribution: 1. 将LLM推理过程分为三个阶段,并提出‘推理完成点’(RCP);2. 提出基于启发式规则的轻量级阈值策略,用于检测RCP;3. 实验证明该方法能在减少token消耗的同时保持或提升推理准确性。

Method: 1. 通过分析推理长度和内容模式,将推理分为三个阶段;2.设计RCP检测策略,包括逐句查询监控或使用‘结束思考’标记的概率;3. 开发基于启发式规则的轻量级阈值方法,优化RCP检测。

Result: 在AIME24、AIME25和GPQA-D等基准测试中,该方法显著降低了token消耗,同时推理准确性未受损失或有所提升。

Insight: LLM的推理并非越长越好,合理终止推理可以避免资源浪费和性能下降;通过模式挖掘和阈值策略,能够高效且精确地识别推理完成点。

Abstract: Large language models (LLMs) enhance complex reasoning tasks by scaling the individual thinking process. However, prior work shows that overthinking can degrade overall performance. Motivated by observed patterns in thinking length and content length, we categorize reasoning into three stages: insufficient exploration stage, compensatory reasoning stage, and reasoning convergence stage. Typically, LLMs produce correct answers in the compensatory reasoning stage, whereas reasoning convergence often triggers overthinking, causing increased resource usage or even infinite loops. Therefore, mitigating overthinking hinges on detecting the end of the compensatory reasoning stage, defined as the Reasoning Completion Point (RCP). RCP typically appears at the end of the first complete reasoning cycle and can be identified by querying the LLM sentence by sentence or monitoring the probability of an end-of-thinking token (e.g., \texttt{}), though these methods lack an efficient and precise balance. To improve this, we mine more sensitive and consistent RCP patterns and develop a lightweight thresholding strategy based on heuristic rules. Experimental evaluations on benchmarks (AIME24, AIME25, GPQA-D) demonstrate that the proposed method reduces token consumption while preserving or enhancing reasoning accuracy.

[45] Text Meets Topology: Rethinking Out-of-distribution Detection in Text-Rich Networks

Danny Wang,Ruihong Qiu,Guangdong Bai,Zi Huang

Main category: cs.CL

TL;DR: 该论文提出了TextTopoOOD框架和TNT-OOD方法,用于解决文本丰富网络中出分布(OOD)检测的挑战,特别是文本特征与拓扑结构交织的复杂场景。

Details Motivation: 现有的OOD检测方法主要关注标签偏移或简单域划分,而未考虑文本与结构的多样性。在社交网络等场景中,OOD可能源于文本和拓扑特征的复杂交互,如机器人与正常用户的语言模式差异。

Contribution: 1. 提出了TextTopoOOD框架,支持多种OOD场景的评估;2. 提出了TNT-OOD方法,通过跨注意力模块和超网络融合文本与拓扑特征。

Method: 1. 使用文本增强和嵌入扰动实现属性级偏移;2. 通过边重连和语义连接模拟结构偏移;3. 结合跨注意力模块和超网络对齐语义与拓扑特征。

Result: 在11个数据集的四种OOD场景中验证了TextTopoOOD的挑战性,并展示了TNT-OOD的有效性。

Insight: 文本与拓扑的交互是OOD检测的关键,融合两者的方法能够提升对复杂分布偏移的识别能力。

Abstract: Out-of-distribution (OOD) detection remains challenging in text-rich networks, where textual features intertwine with topological structures. Existing methods primarily address label shifts or rudimentary domain-based splits, overlooking the intricate textual-structural diversity. For example, in social networks, where users represent nodes with textual features (name, bio) while edges indicate friendship status, OOD may stem from the distinct language patterns between bot and normal users. To address this gap, we introduce the TextTopoOOD framework for evaluating detection across diverse OOD scenarios: (1) attribute-level shifts via text augmentations and embedding perturbations; (2) structural shifts through edge rewiring and semantic connections; (3) thematically-guided label shifts; and (4) domain-based divisions. Furthermore, we propose TNT-OOD to model the complex interplay between Text aNd Topology using: 1) a novel cross-attention module to fuse local structure into node-level text representations, and 2) a HyperNetwork to generate node-specific transformation parameters. This aligns topological and semantic features of ID nodes, enhancing ID/OOD distinction across structural and textual shifts. Experiments on 11 datasets across four OOD scenarios demonstrate the nuanced challenge of TextTopoOOD for evaluating OOD detection in text-rich networks.

[46] EMPOWER: Evolutionary Medical Prompt Optimization With Reinforcement Learning

Yinda Chen,Yangfan He,Jing Yang,Dapeng Zhang,Zhenlong Yuan,Muhammad Attique Khan,Jamel Baili,Por Lip Yee

Main category: cs.CL

TL;DR: EMPOWER 是一个基于强化学习的进化框架,通过专业的医学表示学习和评估架构优化医学提示工程,显著提高了大型语言模型在医疗应用中的可靠性和临床效用。

Details Motivation: 当前医学提示工程优化方法未能充分整合领域特定的医学知识和安全需求,亟需一种更有效的方法来提升医学提示的质量和临床适用性。

Contribution: 引入了 EMPOWER 框架,包括医学术语注意力机制、多维评估架构、组件级进化算法和语义验证模块,显著优化了医学提示的质量。

Method: 1. 医学术语注意力机制;2. 多维度评估架构;3. 组件级进化算法;4. 语义验证模块。

Result: 在诊断、治疗和教学任务中,事实错误内容减少 24.7%,领域特异性提升 19.6%,医生评价提高 15.3%。

Insight: 通过进化算法和强化学习的结合,EMPOWER 有效解决了医学提示工程中的关键挑战,为 LLM 在医疗领域的负责任应用提供了支持。

Abstract: Prompt engineering significantly influences the reliability and clinical utility of Large Language Models (LLMs) in medical applications. Current optimization approaches inadequately address domain-specific medical knowledge and safety requirements. This paper introduces EMPOWER, a novel evolutionary framework that enhances medical prompt quality through specialized representation learning, multi-dimensional evaluation, and structure-preserving algorithms. Our methodology incorporates: (1) a medical terminology attention mechanism, (2) a comprehensive assessment architecture evaluating clarity, specificity, clinical relevance, and factual accuracy, (3) a component-level evolutionary algorithm preserving clinical reasoning integrity, and (4) a semantic verification module ensuring adherence to medical knowledge. Evaluation across diagnostic, therapeutic, and educational tasks demonstrates significant improvements: 24.7% reduction in factually incorrect content, 19.6% enhancement in domain specificity, and 15.3% higher clinician preference in blinded evaluations. The framework addresses critical challenges in developing clinically appropriate prompts, facilitating more responsible integration of LLMs into healthcare settings.

[47] Layerwise Importance Analysis of Feed-Forward Networks in Transformer-based Language Models

Wataru Ikeda,Kazuki Yano,Ryosuke Takahashi,Jaesung Lee,Keigo Shibata,Jun Suzuki

Main category: cs.CL

TL;DR: 本文研究了Transformer语言模型中前馈网络(FFN)在预训练中的分层重要性,通过实验调整FFN的分布并从头训练模型,发现将FFN集中在70%的中间层中能提升下游任务表现。

Details Motivation: 现有研究通常使用公开预训练模型,而忽略了FFN在预训练中的分层动态重要性。本文旨在探讨FFN在不同层中的重要性差异及其对模型性能的影响。

Contribution: 提出了实验方法调整FFN在模型中的分布,并从头训练模型验证其重要性;发现FFN集中在中间层能提升下游任务表现,且在不同规模的模型上一致有效。

Method: 通过增加部分层的FFN维度并完全移除其他层的FFN,维持总参数量不变;在285M、570M和1.2B参数的模型上验证。

Result: FFN集中在70%的连续中间层的配置在多个下游任务中表现优于标准配置。

Insight: FFN的重要性与层位置相关,中间层对其性能贡献更大;这种分布优化对模型设计有指导意义。

Abstract: This study investigates the layerwise importance of feed-forward networks (FFNs) in Transformer-based language models during pretraining. We introduce an experimental approach that, while maintaining the total parameter count, increases the FFN dimensions in some layers and completely removes the FFNs from other layers. Furthermore, since our focus is on the importance of FFNs during pretraining, we train models from scratch to examine whether the importance of FFNs varies depending on their layer positions, rather than using publicly available pretrained models as is frequently done. Through comprehensive evaluations of models with varying sizes (285M, 570M, and 1.2B parameters) and layer counts (12, 24, and 40 layers), we demonstrate that concentrating FFNs in 70% of the consecutive middle layers consistently outperforms standard configurations for multiple downstream tasks.

[48] DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models

Kaiwen Yan,Xuanqing Shi,Hongcheng Guo,Wenxuan Wang,Zhuosheng Zhang,Chengwei Qin

Main category: cs.CL

TL;DR: 论文提出了动态推理配额分配(DRQA)方法,通过从批处理中提取偏好数据并结合强化学习,动态调整推理资源分配,从而解决大语言模型在推理时的过度思考问题,显著减少token使用并保持或提高准确性。

Details Motivation: 研究发现大型语言模型在进行推理时容易产生过度思考(overthinking),即对简单问题也生成冗长的推理链,导致计算资源和token的浪费。希望通过动态调整推理步长,提高效率。

Contribution: 提出DRQA方法,利用批处理中的资源竞争现象,通过强化学习训练模型动态分配推理资源,实现精确且简洁的答案生成。

Method: 结合批处理生成的偏好数据和强化学习,训练模型自适应分配推理配额,优先选择既准确又简洁的答案。

Result: 实验表明,DRQA在多个数学和科学推理基准测试中显著减少token使用,同时保持或提高了答案的准确性。

Insight: DRQA通过动态资源分配,为解决大语言模型推理效率问题提供了新方向,并启发了对推理行为的细粒度控制研究。

Abstract: Reasoning large language models (RLLMs), such as OpenAI-O3 and DeepSeek-R1, have recently demonstrated remarkable capabilities by performing structured and multi-step reasoning. However, recent studies reveal that RLLMs often suffer from overthinking, i.e., producing unnecessarily lengthy reasoning chains even for simple questions, leading to excessive token consumption and computational inefficiency. Interestingly, we observe that when processing multiple questions in batch mode, RLLMs exhibit more resource-efficient behavior by dynamically compressing reasoning steps for easier problems, due to implicit resource competition. Inspired by this, we propose Dynamic Reasoning Quota Allocation (DRQA), a novel method that transfers the benefits of resource competition from batch processing to single-question inference. Specifically, DRQA leverages batch-generated preference data and reinforcement learning to train the model to allocate reasoning resources adaptively. By encouraging the model to internalize a preference for responses that are both accurate and concise, DRQA enables it to generate concise answers for simple questions while retaining sufficient reasoning depth for more challenging ones. Extensive experiments on a wide range of mathematical and scientific reasoning benchmarks demonstrate that DRQA significantly reduces token usage while maintaining, and in many cases improving, answer accuracy. By effectively mitigating the overthinking problem, DRQA offers a promising direction for more efficient and scalable deployment of RLLMs, and we hope it inspires further exploration into fine-grained control of reasoning behaviors.

[49] Beyond Demographics: Enhancing Cultural Value Survey Simulation with Multi-Stage Personality-Driven Cognitive Reasoning

Haijiang Liu,Qiyuan Li,Chao Gao,Yong Cao,Xiangyu Xu,Xun Wu,Daniel Hershcovich,Jinguang Gu

Main category: cs.CL

TL;DR: 论文提出了一个名为MARK的多阶段推理框架,用于提升大语言模型在文化价值调查模拟中的准确性、可操控性和可解释性,其核心是基于MBTI性格理论的多阶段推理方法。

Details Motivation: 现有的文化价值调查模拟方法通常依赖于人口统计信息,但缺乏对个体心理特征(如性格)的深入建模,导致模拟结果的准确性和可解释性不足。

Contribution: 论文的主要贡献是提出了MARK框架,通过多阶段推理(生活情境压力分析、群体级性格预测和自我加权认知模仿)结合MBTI性格理论,显著提升了模拟性能。

Method: MARK采用多阶段推理方法:首先分析生活情境压力,其次预测群体级性格特征,最后通过自我加权的认知模仿生成回答。实验使用世界价值观调查数据集验证其性能。

Result: 实验表明,MARK在准确性上比现有基线方法提升了10%,并显著减少了模型预测与人类偏好的差异,提高了零样本个性化的效果。

Insight: 通过引入性格理论和多阶段推理,可以显著提升文化价值调查模拟的真实性和可解释性,这对社会科学研究和个性化模型的开发具有重要价值。

Abstract: Introducing MARK, the Multi-stAge Reasoning frameworK for cultural value survey response simulation, designed to enhance the accuracy, steerability, and interpretability of large language models in this task. The system is inspired by the type dynamics theory in the MBTI psychological framework for personality research. It effectively predicts and utilizes human demographic information for simulation: life-situational stress analysis, group-level personality prediction, and self-weighted cognitive imitation. Experiments on the World Values Survey show that MARK outperforms existing baselines by 10% accuracy and reduces the divergence between model predictions and human preferences. This highlights the potential of our framework to improve zero-shot personalization and help social scientists interpret model predictions.

[50] Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs

Dingdong Wang,Junan Li,Mingyu Cui,Dongchao Yang,Xueyuan Chen,Helen Meng

Main category: cs.CL

TL;DR: 该论文对语音大语言模型(SpeechLLMs)中的离散令牌和连续特征进行了公平比较,发现连续特征在多数任务中表现更优,并分析了两种方法的特性与学习模式。

Details Motivation: 随着语音大语言模型的兴起,离散令牌和连续特征成为两种主流的语音处理方法,但它们在性能上的差异尚未得到充分研究。

Contribution: 提供了离散和连续特征在相同实验设定下的公平比较,发现连续特征普遍更优,并深入分析了两种方法的特性。

Method: 使用自监督学习(SSL)提取离散和连续特征,并在不同规模的语言模型(Qwen1.5-0.5B和Llama3.1-8B)上评估了六项口语理解任务。

Result: 连续特征在多数任务中表现优于离散令牌,且两种方法在学习语音信息时表现出不同的模式。

Insight: 连续特征在语音处理任务中更具优势,但离散令牌也有其独特的适用场景,未来研究可结合两者优点进一步提升性能。

Abstract: With the rise of Speech Large Language Models (SpeechLLMs), two dominant approaches have emerged for speech processing: discrete tokens and continuous features. Each approach has demonstrated strong capabilities in audio-related processing tasks. However, the performance gap between these two paradigms has not been thoroughly explored. To address this gap, we present a fair comparison of self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings. We evaluate their performance across six spoken language understanding-related tasks using both small and large-scale LLMs (Qwen1.5-0.5B and Llama3.1-8B). We further conduct in-depth analyses, including efficient comparison, SSL layer analysis, LLM layer analysis, and robustness comparison. Our findings reveal that continuous features generally outperform discrete tokens in various tasks. Each speech processing method exhibits distinct characteristics and patterns in how it learns and processes speech information. We hope our results will provide valuable insights to advance spoken language understanding in SpeechLLMs.

[51] Pandora: Leveraging Code-driven Knowledge Transfer for Unified Structured Knowledge Reasoning

Yongrui Chen,Junhao He,Linbo Fu,Shenyu Zhang,Rihui Jin,Xinbang Dai,Jiaqi Li,Dehai Min,Nan Hu,Yuxin Zhang,Guilin Qi,Yi Huang,Tongtong Wu

Main category: cs.CL

TL;DR: Pandora提出了一个统一的基于代码的知识表示和知识迁移框架,用于解决结构化知识推理任务中的跨任务挑战。

Details Motivation: 现有的统一结构化知识推理方法依赖任务特定策略或定制表示,难以打破不同任务间的壁垒,限制了跨任务性能。

Contribution: 1. 提出基于Pandas API的代码统一知识表示;2. 利用知识迁移和代码执行反馈增强LLM的跨任务推理能力。

Method: 使用Python的Pandas API实现统一知识表示,结合知识迁移和代码执行反馈优化推理。

Result: 在六个基准测试上超越现有统一推理框架,并与任务特定方法竞争。

Insight: 代码驱动的知识表示和迁移可有效提升LLM在结构化知识推理任务中的通用性和性能。

Abstract: Unified Structured Knowledge Reasoning (USKR) aims to answer natural language questions by using structured sources such as tables, databases, and knowledge graphs in a unified way. Existing USKR methods rely on task-specific strategies or bespoke representations, which hinder their ability to dismantle barriers between different SKR tasks, thereby constraining their overall performance in cross-task scenarios. In this paper, we introduce \textsc{Pandora}, a novel USKR framework that addresses the limitations of existing methods by leveraging two key innovations. First, we propose a code-based unified knowledge representation using \textsc{Python}’s \textsc{Pandas} API, which aligns seamlessly with the pre-training of LLMs. This representation facilitates a cohesive approach to handling different structured knowledge sources. Building on this foundation, we employ knowledge transfer to bolster the unified reasoning process of LLMs by automatically building cross-task memory. By adaptively correcting reasoning using feedback from code execution, \textsc{Pandora} showcases impressive unified reasoning capabilities. Extensive experiments on six widely used benchmarks across three SKR tasks demonstrate that \textsc{Pandora} outperforms existing unified reasoning frameworks and competes effectively with task-specific methods.

[52] How Quantization Shapes Bias in Large Language Models

Federico Marcuzzi,Xuefei Ning,Roy Schwartz,Iryna Gurevych

Main category: cs.CL

TL;DR: 量化对大型语言模型偏见的影响是多方面的:可能降低毒性,但轻微增加刻板印象和不公平,尤其在激进的压缩下。

Details Motivation: 研究量化对模型偏见的影响,尤其关注不同人口统计学子群的差异,为实践中平衡效率和伦理提供依据。

Contribution: 首次系统评估量化对多种偏见类型的影响,涵盖权重和激活量化策略,并分析不同模型架构和能力下的表现。

Method: 使用概率和生成文本指标,在九个基准上测试不同量化策略对偏见(如刻板印象、毒性、情感和公平性)的影响。

Result: 量化可降低毒性且不影响情感,但略微增加刻板印象和生成任务中的不公平性,趋势受压缩强度影响。

Insight: 量化需在效率和伦理间权衡,激进的压缩可能加剧某些偏见,需针对具体场景谨慎选择量化策略。

Abstract: This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups. We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, toxicity, sentiment, and fairness. We employ both probabilistic and generated text-based metrics across nine benchmarks and evaluate models varying in architecture family and reasoning ability. Our findings show that quantization has a nuanced impact on bias: while it can reduce model toxicity and does not significantly impact sentiment, it tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. These trends are generally consistent across demographic categories and model types, although their magnitude depends on the specific setting. Overall, our results highlight the importance of carefully balancing efficiency and ethical considerations when applying quantization in practice.

[53] Detecting and Characterizing Planning in Language Models

Jatin Nainani,Sankaran Vaidyanathan,Connor Watts,Andre N. Assis,Alice Rigg

Main category: cs.CL

TL;DR: 该论文提出了一种检测和表征语言模型中规划行为的方法,并通过实验证明不同模型和任务中规划行为的差异性。

Details Motivation: 现代大型语言模型(LLM)在多步推理任务中表现出色,但其是否真正进行规划(即预先选择目标并生成中间步骤)仍缺乏系统验证。论文旨在通过因果标准和半自动化标注流程,区分规划与即兴生成行为。

Contribution: 1. 提出了形式化和因果性的规划检测标准;2. 开发了半自动化标注流程;3. 通过Gemma-2-2B和Claude 3.5 Haiku的比较,揭示了规划行为的非普遍性和任务依赖性。

Method: 1. 设计因果性标准检测规划;2. 开发半自动化标注流程;3. 在MBPP代码生成和诗歌生成任务上应用,分析模型行为。

Result: 发现Gemma-2-2B在诗歌任务中采用即兴生成,而在MBPP任务中交替使用规划与即兴;指令微调仅优化已有的规划行为,而非从头创建。

Insight: 规划行为并非所有LLM的通用能力,而是因模型和任务而异;指令微调对规划行为的改进有限,需进一步研究其机制。

Abstract: Modern large language models (LLMs) have demonstrated impressive performance across a wide range of multi-step reasoning tasks. Recent work suggests that LLMs may perform planning - selecting a future target token in advance and generating intermediate tokens that lead towards it - rather than merely improvising one token at a time. However, existing studies assume fixed planning horizons and often focus on single prompts or narrow domains. To distinguish planning from improvisation across models and tasks, we present formal and causally grounded criteria for detecting planning and operationalize them as a semi-automated annotation pipeline. We apply this pipeline to both base and instruction-tuned Gemma-2-2B models on the MBPP code generation benchmark and a poem generation task where Claude 3.5 Haiku was previously shown to plan. Our findings show that planning is not universal: unlike Haiku, Gemma-2-2B solves the same poem generation task through improvisation, and on MBPP it switches between planning and improvisation across similar tasks and even successive token predictions. We further show that instruction tuning refines existing planning behaviors in the base model rather than creating them from scratch. Together, these studies provide a reproducible and scalable foundation for mechanistic studies of planning in LLMs.

[54] SentiMM: A Multimodal Multi-Agent Framework for Sentiment Analysis in Social Media

Xilai Xu,Zilin Zhao,Chengye Song,Zining Wang,Jinhe Qiang,Jiongrui Yan,Yuhuai Lin

Main category: cs.CL

TL;DR: SentiMM是一个多智能体框架,用于处理社交媒体中的多模态情感分析,通过跨模态特征融合和外部知识整合,显著提升了多标签情感识别的性能。

Details Motivation: 社交媒体中多模态内容的增多对情感分析提出了挑战,现有方法在跨模态融合和知识整合方面表现不足,需要更系统化的解决方案。

Contribution: 1. 提出SentiMM框架,通过多智能体分工处理多模态数据;2. 引入SentiMMD大规模多模态数据集;3. 实验证明其性能优于现有技术。

Method: 1. 多智能体分别处理文本和视觉输入;2. 跨模态特征融合;3. 利用知识检索丰富上下文;4. 结果聚合进行情感分类。

Result: SentiMM在实验中表现出优越性能,验证了其结构化方法的有效性。

Insight: 多智能体协作和知识整合是提升多模态情感分析的关键。

Abstract: With the increasing prevalence of multimodal content on social media, sentiment analysis faces significant challenges in effectively processing heterogeneous data and recognizing multi-label emotions. Existing methods often lack effective cross-modal fusion and external knowledge integration. We propose SentiMM, a novel multi-agent framework designed to systematically address these challenges. SentiMM processes text and visual inputs through specialized agents, fuses multimodal features, enriches context via knowledge retrieval, and aggregates results for final sentiment classification. We also introduce SentiMMD, a large-scale multimodal dataset with seven fine-grained sentiment categories. Extensive experiments demonstrate that SentiMM achieves superior performance compared to state-of-the-art baselines, validating the effectiveness of our structured approach.

[55] Why Synthetic Isn’t Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation

Rishikesh Devanathan,Varun Nathan,Ayush Kumar

Main category: cs.CL

TL;DR: 论文探讨了联系中心领域合成对话生成的挑战,提出了一个包含18个指标的诊断框架,评估了四种生成方法,并揭示了现有技术在行为真实性和语言复杂性上的不足。

Details Motivation: 联系中心对话具有目标导向、角色不对称和行为复杂性等特点,传统合成对话生成技术难以满足需求。隐私和数据稀缺问题进一步加剧了挑战。

Contribution: 1) 提出了一个基于18个指标的诊断框架,用于评估合成对话的真实性;2) 利用联系中心的监督信号(如意图摘要等)指导生成;3) 评测了四种生成方法,揭示了现存问题。

Method: 1) 设计了多指标诊断框架;2) 提出四种语言无关的生成策略(从简单提示到多阶段方法);3) 使用联系中心的监督信号(如Topic Flow)作为生成指导。

Result: 评测显示,现有方法在语言复杂性(如不流利现象)和行为真实性(如情感表达)上表现不佳,且没有一种方法在所有指标上均占优。

Insight: 1) 联系中心对话的独特性需针对性生成方法;2) 多阶段、特征感知的生成策略更有效;3) 诊断工具可帮助识别生成模型的缺陷,指导改进。

Abstract: Synthetic transcript generation is critical in contact center domains, where privacy and data scarcity limit model training and evaluation. Unlike prior synthetic dialogue generation work on open-domain or medical dialogues, contact center conversations are goal-oriented, role-asymmetric, and behaviorally complex, featuring disfluencies, ASR noise, and compliance-driven agent actions. In deployments where transcripts are unavailable, standard pipelines still yield derived call attributes such as Intent Summaries, Topic Flow, and QA Evaluation Forms. We leverage these as supervision signals to guide generation. To assess the quality of such outputs, we introduce a diagnostic framework of 18 linguistically and behaviorally grounded metrics for comparing real and synthetic transcripts. We benchmark four language-agnostic generation strategies, from simple prompting to characteristic-aware multi-stage approaches, alongside reference-free baselines. Results reveal persistent challenges: no method excels across all traits, with notable deficits in disfluency, sentiment, and behavioral realism. Our diagnostic tool exposes these gaps, enabling fine-grained evaluation and stress testing of synthetic dialogue across languages.

[56] Better Language Model-Based Judging Reward Modeling through Scaling Comprehension Boundaries

Meiling Ning,Zhongbao Zhang,Junda Ye,Jiabao Guo,Qingyuan Guan

Main category: cs.CL

TL;DR: 该论文提出了一种基于语言模型的奖励建模方法ESFP-RM,通过结合自然语言推理(NLI)和上下文解释,提升了奖励模型的性能和泛化能力。

Details Motivation: 现有基于语言模型的奖励建模(如生成式奖励模型)虽然在强化学习反馈(RLAIF)中表现高效且可扩展,但其性能仍有提升空间。论文通过重新审视其与自然语言推理(NLI)的任务一致性,提出了改进方向。

Contribution: 1. 指出奖励建模与自然语言推理的正式一致性;2. 提出通过扩展模型的理解边界来提升奖励模型性能;3. 设计了两阶段的ESFP-RM框架,结合上下文解释和槽预测,显著优于现有方法。

Method: 提出ESFP-RM框架,分两阶段进行奖励建模:通过带上下文解释的槽预测掩码语言模型(MLM)生成初始信号,再优化为稳定的奖励信号。实验表明其在NLI任务中表现优异。

Result: 实验证明,ESFP-RM在人类反馈强化学习(RLHF)和分布外(OOD)场景中提供了更稳定和泛化的奖励信号,优于主流自回归模型。

Insight: 论文揭示了奖励建模与NLI任务的深层联系,表明通过任务一致性优化模型理解边界是提升奖励模型性能的关键路径。

Abstract: The emergence of LM-based judging reward modeling, represented by generative reward models, has successfully made reinforcement learning from AI feedback (RLAIF) efficient and scalable. To further advance this paradigm, we propose a core insight: this form of reward modeling shares fundamental formal consistency with natural language inference (NLI), a core task in natural language understanding. This reframed perspective points to a key path for building superior reward models: scaling the model’s comprehension boundaries. Pursuing this path, exploratory experiments on NLI tasks demonstrate that the slot prediction masked language models (MLMs) incorporating contextual explanations achieve significantly better performance compared to mainstream autoregressive models. Based on this key finding, we propose ESFP-RM, a two-stage LM-based judging reward model that utilizes an explanation based slot framework for prediction to fully leverage the advantages of MLMs. Extensive experiments demonstrate that in both reinforcement learning from human feedback (RLHF) and out-of-distribution (OOD) scenarios, the ESFP-RM framework delivers more stable and generalizable reward signals compared to generative reward models.

[57] MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols

Yuhao Du,Qianwei Huang,Guo Zhu,Zhanchen Dai,Sunian Chen,Qiming Zhu,Yuhao Zhang,Li Zhou,Benyou Wang

Main category: cs.CL

TL;DR: MTalk-Bench 是一个针对多轮对话的语音到语音模型评估基准,通过竞技场式(相对)和评分标准(绝对)两种方法评估语义、副语言和环境声音三个核心维度。实验表明,当前模型在语义处理上表现优异,但在副语言和环境声音上较弱,且存在效率与连贯性的权衡。

Details Motivation: 当前语音到语音(S2S)大语言模型(LLMs)发展迅速,但在多轮复杂对话中的评估框架不足,亟需更全面的评估基准。

Contribution: 提出了 MTalk-Bench 基准,涵盖语义、副语言和环境声音三个维度,结合竞技场式与评分标准方法,提供相对与绝对评估。

Method: 采用双方法评估框架(竞技场式对比和评分标准打分),覆盖多个现实场景和任务,结合模型和人类输出,由人类和 LLM 评估。

Result: 模型在语义处理上表现优秀,但在副语言和环境声音上较差;竞技场式和评分标准方法互补,但 LLM 作为评估者存在偏差。

Insight: 当前 S2S 评估仍不完善,需设计更具鲁棒性且关注语音特性的评估框架。

Abstract: The rapid advancement of speech-to-speech (S2S) large language models (LLMs) has significantly improved real-time spoken interaction. However, current evaluation frameworks remain inadequate for assessing performance in complex, multi-turn dialogues. To address this, we introduce MTalk-Bench, a multi-turn S2S benchmark covering three core dimensions: Semantic Information, Paralinguistic Information, and Ambient Sound. Each dimension includes nine realistic scenarios, along with targeted tasks to assess specific capabilities such as reasoning. Our dual-method evaluation framework combines Arena-style evaluation (pairwise comparison) and Rubrics-based evaluation (absolute scoring) for relative and absolute assessment. The benchmark includes both model and human outputs, evaluated by human evaluators and LLMs. Experimental results reveal two sets of findings. Overall performance of S2S LLMs: (1) models excel at semantic information processing yet underperform on paralinguistic information and ambient sounds perception; (2) models typically regain coherence by increasing response length, sacrificing efficiency in multi-turn dialogues; (3) modality-aware, task-specific designs outperform brute scaling. Evaluation framework and reliability: (1) Arena and Rubrics yield consistent, complementary rankings, but reliable distinctions emerge only when performance gaps are large; (2) LLM-as-a-judge aligns with humans when gaps are clear or criteria explicit, but exhibits position and length biases and is reliable on nonverbal evaluation only with text annotations. These results highlight current limitations in S2S evaluation and the need for more robust, speech-aware assessment frameworks.

[58] MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains

Kaiwen Wei,Rui Shan,Dongsheng Zou,Jianzhong Yang,Bi Zhao,Junnan Zhu,Jiang Zhong

Main category: cs.CL

TL;DR: MIRAGE提出了一种基于并行图检索增强推理链的方法,用于提升测试时推理的扩展性,特别适用于需要高准确性和可追溯性的医学QA任务。

Details Motivation: 当前基于检索增强生成(RAG)的推理方法依赖单一线性推理链,且处理非结构化文本时缺乏上下文感知,导致错误累积。医学QA任务对准确性和可追溯性要求高,现有方法效果有限。

Contribution: 1)提出MIRAGE框架,支持动态多链推理;2)通过实体锚定的子问题分解、并行推理链、自适应检索和跨链验证提升性能;3)在医学QA任务中优于现有方法,并提高可解释性。

Method: 1)将复杂查询分解为实体锚定的子问题;2)并行执行推理链;3)通过邻居扩展和多跳遍历自适应检索证据;4)利用跨链验证解决矛盾。

Result: 在GenMedGPT-5k、CMCQA和ExplainCPE三个医学QA基准测试中,MIRAGE在自动和人工评估中均优于GPT-4o、Tree-of-Thought变体及其他基线方法。

Insight: 结构化知识图谱的动态多链推理可以显著提升医学QA任务的准确性和可解释性,为复杂推理场景提供新思路。

Abstract: Large reasoning models (LRMs) have shown significant progress in test-time scaling through chain-of-thought prompting. Current approaches like search-o1 integrate retrieval augmented generation (RAG) into multi-step reasoning processes but rely on a single, linear reasoning chain while incorporating unstructured textual information in a flat, context-agnostic manner. As a result, these approaches can lead to error accumulation throughout the reasoning chain, which significantly limits its effectiveness in medical question-answering (QA) tasks where both accuracy and traceability are critical requirements. To address these challenges, we propose MIRAGE (Multi-chain Inference with Retrieval-Augmented Graph Exploration), a novel test-time scalable reasoning framework that performs dynamic multi-chain inference over structured medical knowledge graphs. Specifically, MIRAGE 1) decomposes complex queries into entity-grounded sub-questions, 2) executes parallel inference chains, 3) retrieves evidence adaptively via neighbor expansion and multi-hop traversal, and 4) integrates answers using cross-chain verification to resolve contradictions. Experiments on three medical QA benchmarks (GenMedGPT-5k, CMCQA, and ExplainCPE) show that MIRAGE consistently outperforms GPT-4o, Tree-of-Thought variants, and other retrieval-augmented baselines in both automatic and human evaluations. Additionally, MIRAGE improves interpretability by generating explicit reasoning chains that trace each factual claim to concrete chains within the knowledge graph, making it well-suited for complex medical reasoning scenarios. The code will be available for further research.

cs.CV [Back]

[59] CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance

Anindya Mondal,Ayan Banerjee,Sauradip Nag,Josep Lladós,Xiatian Zhu,Anjan Dutta

Main category: cs.CV

TL;DR: CountLoop提出了一种无需训练的框架,通过迭代反馈实现扩散模型中多目标实例的精确控制,显著提升了复杂场景下的生成质量。

Details Motivation: 现有扩散模型在生成复杂场景中特定数量的对象实例时效果不佳,缺乏精准控制能力。

Contribution: CountLoop通过迭代反馈和多模态代理评估,实现了对扩散模型中实例数量、空间布局和属性的精准控制。

Method: 交替进行图像生成和多模态代理评估,结合语言指导的规划和批评机制,使用实例驱动的注意力掩码和组合生成技术。

Result: 在多个基准测试中,CountLoop的计数准确率达98%,空间保真度和视觉质量优于基线方法。

Insight: 迭代反馈和语言引导的规划显著提升了扩散模型在复杂场景中的可控性。

Abstract: Diffusion models have shown remarkable progress in photorealistic image synthesis, yet they remain unreliable for generating scenes with a precise number of object instances, particularly in complex and high-density settings. We present CountLoop, a training-free framework that provides diffusion models with accurate instance control through iterative structured feedback. The approach alternates between image generation and multimodal agent evaluation, where a language-guided planner and critic assess object counts, spatial arrangements, and attribute consistency. This feedback is then used to refine layouts and guide subsequent generations. To further improve separation between objects, especially in occluded scenes, we introduce instance-driven attention masking and compositional generation techniques. Experiments on COCO Count, T2I CompBench, and two new high-instance benchmarks show that CountLoop achieves counting accuracy of up to 98% while maintaining spatial fidelity and visual quality, outperforming layout-based and gradient-guided baselines with a score of 0.97.

[60] Do VLMs Have Bad Eyes? Diagnosing Compositional Failures via Mechanistic Interpretability

Ashwath Vaithinathan Aravindan,Abha Jha,Mihir Kulkarni

Main category: cs.CV

TL;DR: 该论文研究了视觉语言模型(VLM)在组合泛化和对象绑定任务中的失败原因,并通过机制可解释性技术揭示了MLP层中神经元的‘叠加’现象是主要障碍。

Details Motivation: 尽管VLMs在图像描述和视觉问答等任务中表现出色,但在处理新组合的对象及其属性时表现不佳。论文旨在探索这些组合失败的机制根源。

Contribution: 发现了CLIP视觉编码器的MLP层中神经元的‘叠加’现象,表明单个神经元代表多个特征,这直接影响了组合特征表示能力。

Method: 利用机制可解释性技术分析CLIP模型的MLP层,特别是神经元的多特征表示现象。

Result: 研究表明神经元的‘叠加’现象是组合泛化和对象绑定能力受限的主要原因。

Insight: 这一发现为理解VLMs的组合失败提供了新的机制视角,并为改进模型设计指明了方向。

Abstract: Vision-Language Models (VLMs) have shown remarkable performance in integrating visual and textual information for tasks such as image captioning and visual question answering. However, these models struggle with compositional generalization and object binding, which limit their ability to handle novel combinations of objects and their attributes. Our work explores the root causes of these failures using mechanistic interpretability techniques. We show evidence that individual neurons in the MLP layers of CLIP’s vision encoder represent multiple features, and this “superposition” directly hinders its compositional feature representation which consequently affects compositional reasoning and object binding capabilities. We hope this study will serve as an initial step toward uncovering the mechanistic roots of compositional failures in VLMs. The code and supporting results can be found https://github.com/Mystic-Slice/Do-VLMs-Have-Bad-Eyes .

[61] MSNav: Zero-Shot Vision-and-Language Navigation with Dynamic Memory and LLM Spatial Reasoning

Chenghao Liu,Zhimu Zhou,Jiachen Zhang,Minghao Zhang,Songfang Huang,Huiling Duan

Main category: cs.CV

TL;DR: MSNav是一个新的视觉与语言导航框架,通过动态记忆、空间推理和LLM路径规划的三个模块协同工作,解决了现有方法的黑箱问题,提升了长期任务中的性能和鲁棒性。

Details Motivation: 当前基于大型语言模型(LLM)的视觉与语言导航方法存在空间推理能力弱、跨模态对齐差和长期任务中记忆过载的问题,MSNav旨在通过模块化设计系统性解决这些问题。

Contribution: 提出了MSNav框架,包含动态记忆模块、空间推理模块和决策模块,并引入I-O-S数据集和Qwen-Spatial模型,显著提升了导航任务的性能。

Method: 通过选择性节点剪枝的动态记忆模块增强长期探索能力,空间推理模块提升端点识别,决策模块利用LLM进行路径规划。此外,开发了Qwen-Spatial模型优化对象列表提取。

Result: 在R2R和REVERIE数据集上取得了state-of-the-art的性能,显著提高了成功率和路径长度加权成功率。

Insight: 模块化设计比单一LLM黑箱方法更适合复杂导航任务,特别是长期探索和空间推理能力的需求。

Abstract: Vision-and-Language Navigation (VLN) requires an agent to interpret natural language instructions and navigate complex environments. Current approaches often adopt a “black-box” paradigm, where a single Large Language Model (LLM) makes end-to-end decisions. However, it is plagued by critical vulnerabilities, including poor spatial reasoning, weak cross-modal grounding, and memory overload in long-horizon tasks. To systematically address these issues, we propose Memory Spatial Navigation(MSNav), a framework that fuses three modules into a synergistic architecture, which transforms fragile inference into a robust, integrated intelligence. MSNav integrates three modules: Memory Module, a dynamic map memory module that tackles memory overload through selective node pruning, enhancing long-range exploration; Spatial Module, a module for spatial reasoning and object relationship inference that improves endpoint recognition; and Decision Module, a module using LLM-based path planning to execute robust actions. Powering Spatial Module, we also introduce an Instruction-Object-Space (I-O-S) dataset and fine-tune the Qwen3-4B model into Qwen-Spatial (Qwen-Sp), which outperforms leading commercial LLMs in object list extraction, achieving higher F1 and NDCG scores on the I-O-S test set. Extensive experiments on the Room-to-Room (R2R) and REVERIE datasets demonstrate MSNav’s state-of-the-art performance with significant improvements in Success Rate (SR) and Success weighted by Path Length (SPL).

[62] QA-VLM: Providing human-interpretable quality assessment for wire-feed laser additive manufacturing parts with Vision Language Models

Qiaojie Zheng,Jiucai Zhang,Joy Gockel,Michael B. Wakin,Craig Brice,Xiaoli Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种基于视觉语言模型(VLM)的框架QA-VLM,用于为激光送丝增材制造零件提供可解释的质量评估,解决了传统黑盒模型的局限性。

Details Motivation: 增材制造中的质量评估通常依赖专家经验,现有机器学习方法缺乏可解释性,限制了其实际应用。QA-VLM旨在通过结合VLM的注意力机制和领域知识,提供可信赖的可解释评估。

Contribution: 提出QA-VLM框架,结合VLM和领域知识,生成可解释的质量评估;在激光送丝直接能量沉积(DED-LW)样本上验证了其有效性和一致性。

Method: 利用VLM的注意力机制和从论文中提取的领域知识,生成可解释的评估结果;框架在24个单珠样本上进行了测试。

Result: QA-VLM在解释质量和一致性上优于现有VLM,展示了其在增材制造中实现可信赖质量评估的潜力。

Insight: 结合VLM和领域知识的方法可以提升质量评估的可解释性和可信度,为工业应用中的黑盒模型问题提供了新思路。

Abstract: Image-based quality assessment (QA) in additive manufacturing (AM) often relies heavily on the expertise and constant attention of skilled human operators. While machine learning and deep learning methods have been introduced to assist in this task, they typically provide black-box outputs without interpretable justifications, limiting their trust and adoption in real-world settings. In this work, we introduce a novel QA-VLM framework that leverages the attention mechanisms and reasoning capabilities of vision-language models (VLMs), enriched with application-specific knowledge distilled from peer-reviewed journal articles, to generate human-interpretable quality assessments. Evaluated on 24 single-bead samples produced by laser wire direct energy deposition (DED-LW), our framework demonstrates higher validity and consistency in explanation quality than off-the-shelf VLMs. These results highlight the potential of our approach to enable trustworthy, interpretable quality assessment in AM applications.

[63] The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers

Naren Sengodan

Main category: cs.CV

TL;DR: 论文《The Loupe》提出了一种轻量级、即插即用的注意力模块,旨在增强视觉Transformer中判别性特征的表现,无需显式标注即可聚焦于关键局部特征,显著提升了细粒度视觉分类任务的性能,同时提供了可解释性。

Details Motivation: 细粒度视觉分类(FGVC)任务需要识别微妙的局部视觉特征,现有视觉Transformer虽性能优越,但缺乏可解释性。Loupe旨在通过注意力机制增强模型的可解释性并提升性能。

Contribution: 提出了The Loupe——一种轻量级、即插即用的注意力模块,能增强预训练模型对判别性特征的聚焦能力,无需标注即可实现局部特征学习,同时提高分类性能和模型可解释性。

Method: The Loupe模块插入预训练模型(如Swin Transformer),通过复合损失函数端到端训练,隐式引导模型聚焦判别性区域,无需显式局部标注。

Result: 在CUB-200-2011数据集上,Loupe将Swin-Base模型的准确率从85.40%提升至88.06%(+2.66%),并生成可解释的注意力图。

Insight: 简单有效的注意力机制可作为强大正则化工具,不仅提升性能,还能增强模型可解释性,适用于需要高精度和信任的领域(如医疗和生物多样性监测)。

Abstract: Fine-Grained Visual Classification (FGVC) is a critical and challenging area within computer vision, demanding the identification of highly subtle, localized visual cues. The importance of FGVC extends to critical applications such as biodiversity monitoring and medical diagnostics, where precision is paramount. While large-scale Vision Transformers have achieved state-of-the-art performance, their decision-making processes often lack the interpretability required for trust and verification in such domains. In this paper, we introduce The Loupe, a novel, lightweight, and plug-and-play attention module designed to be inserted into pre-trained backbones like the Swin Transformer. The Loupe is trained end-to-end with a composite loss function that implicitly guides the model to focus on the most discriminative object parts without requiring explicit part-level annotations. Our unique contribution lies in demonstrating that a simple, intrinsic attention mechanism can act as a powerful regularizer, significantly boosting performance while simultaneously providing clear visual explanations. Our experimental evaluation on the challenging CUB-200-2011 dataset shows that The Loupe improves the accuracy of a Swin-Base model from 85.40% to 88.06%, a significant gain of 2.66%. Crucially, our qualitative analysis of the learned attention maps reveals that The Loupe effectively localizes semantically meaningful features, providing a valuable tool for understanding and trusting the model’s decision-making process.

[64] MedRepBench: A Comprehensive Benchmark for Medical Report Interpretation

Fangxin Shang,Yuan Xia,Dalu Yang,Yahui Wang,Binglin Yang

Main category: cs.CV

TL;DR: 这篇论文提出了MedRepBench,一个用于评估医疗报告结构化理解能力的综合基准,来自1900份真实世界的中国医疗报告,支持视觉-语言模型和纯文本方法评估。

Details Motivation: 医疗报告解释在临床中非常重要,但缺乏标准化的评估基准来衡量结构和质量。

Contribution: 提出了MedRepBench基准,支持两种评估协议(客观和主观)以及基于奖励函数的模型优化方法。

Method: 使用1900份真实医疗报告构建基准,支持视觉-语言模型和OCR+LLM方法;提出GRPO优化模型,并使用LLM作为评分代理进行主观评估。

Result: 通过GRPO优化提高了模型6%的召回率;OCR+LLM方法性能强大但存在布局盲区和延迟问题。

Insight: 完全基于视觉的报告理解是未来方向,当前OCR+LLM方法仍需改进布局处理能力。

Abstract: Medical report interpretation plays a crucial role in healthcare, enabling both patient-facing explanations and effective information flow across clinical systems. While recent vision-language models (VLMs) and large language models (LLMs) have demonstrated general document understanding capabilities, there remains a lack of standardized benchmarks to assess structured interpretation quality in medical reports. We introduce MedRepBench, a comprehensive benchmark built from 1,900 de-identified real-world Chinese medical reports spanning diverse departments, patient demographics, and acquisition formats. The benchmark is designed primarily to evaluate end-to-end VLMs for structured medical report understanding. To enable controlled comparisons, we also include a text-only evaluation setting using high-quality OCR outputs combined with LLMs, allowing us to estimate the upper-bound performance when character recognition errors are minimized. Our evaluation framework supports two complementary protocols: (1) an objective evaluation measuring field-level recall of structured clinical items, and (2) an automated subjective evaluation using a powerful LLM as a scoring agent to assess factuality, interpretability, and reasoning quality. Based on the objective metric, we further design a reward function and apply Group Relative Policy Optimization (GRPO) to improve a mid-scale VLM, achieving up to 6% recall gain. We also observe that the OCR+LLM pipeline, despite strong performance, suffers from layout-blindness and latency issues, motivating further progress toward robust, fully vision-based report understanding.

[65] Two-Stage Framework for Efficient UAV-Based Wildfire Video Analysis with Adaptive Compression and Fire Source Detection

Yanbing Bai,Rui-Yang Ju,Lemeng Zhao,Junjie Hu,Jianchao Bi,Erick Mas,Shunichi Koshimura

Main category: cs.CV

TL;DR: 该论文提出了一种两阶段轻量级框架,用于无人机实时野火视频分析与火源检测。通过自适应压缩减少计算成本,并利用改进的YOLOv8模型提高检测精度。

Details Motivation: 无人机的计算资源有限,难以运行大型模型进行实时分析,因此需要高效轻量的方法支持实时野火监测。

Contribution: 1. 提出两阶段框架,结合自适应压缩和火源检测;2. 引入站点机制提升预测精度;3. 改进YOLOv8模型以优化检测性能。

Method: 1. 阶段一:使用策略网络识别并丢弃冗余视频片段,降低计算成本;2. 阶段二:采用改进的YOLOv8模型定位火源。

Result: 在阶段一显著降低计算成本的同时保持分类精度;阶段二在类似推理时间下实现了更高的检测精度。

Insight: 结合轻量化和高效检测方法可以显著提升无人机在实时灾害响应中的能力。

Abstract: Unmanned Aerial Vehicles (UAVs) have become increasingly important in disaster emergency response by enabling real-time aerial video analysis. Due to the limited computational resources available on UAVs, large models cannot be run independently for real-time analysis. To overcome this challenge, we propose a lightweight and efficient two-stage framework for real-time wildfire monitoring and fire source detection on UAV platforms. Specifically, in Stage 1, we utilize a policy network to identify and discard redundant video clips using frame compression techniques, thereby reducing computational costs. In addition, we introduce a station point mechanism that leverages future frame information within the sequential policy network to improve prediction accuracy. In Stage 2, once the frame is classified as “fire”, we employ the improved YOLOv8 model to localize the fire source. We evaluate the Stage 1 method using the FLAME and HMDB51 datasets, and the Stage 2 method using the Fire & Smoke dataset. Experimental results show that our method significantly reduces computational costs while maintaining classification accuracy in Stage 1, and achieves higher detection accuracy with similar inference time in Stage 2 compared to baseline methods.

[66] CellEcoNet: Decoding the Cellular Language of Pathology with Deep Learning for Invasive Lung Adenocarcinoma Recurrence Prediction

Abdul Rehman Akbar,Usama Sajjad,Ziyu Su,Wencheng Li,Fei Xing,Jimmy Ruiz,Wei Chen,Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: CellEcoNet是一种新型的空间感知深度学习框架,通过自然语言类比建模全切片图像,用于预测侵袭性肺腺癌的复发风险,性能优于现有方法。

Details Motivation: 尽管手术切除后,约70%的侵袭性肺腺癌患者在五年内复发,当前工具无法识别需要辅助治疗的患者,存在未满足的临床需求。

Contribution: 提出了CellEcoNet,通过定义病理学中的‘细胞语言’,自动学习细胞及其空间交互如何影响复发风险,并展示了其在预测中的优越性能。

Method: 将细胞类比为单词,细胞邻域为短语,组织结构为句子,通过深度学习建模上下文依赖关系,捕捉细微变化和空间交互。

Result: 在456张H&E染色全切片图像上,CellEcoNet的AUC为77.8%,风险比为9.54,优于现有分级系统和计算方法。

Insight: CellEcoNet不仅提供了预后工具,还通过解码肿瘤微环境的‘细胞语言’,揭示了细胞细微变化如何编码复发风险,为病理学带来新视角。

Abstract: Despite surgical resection, ~70% of invasive lung adenocarcinoma (ILA) patients recur within five years, and current tools fail to identify those needing adjuvant therapy. To address this unmet clinical need, we introduce CellEcoNet, a novel spatially aware deep learning framework that models whole slide images (WSIs) through natural language analogy, defining a “language of pathology,” where cells act as words, cellular neighborhoods become phrases, and tissue architecture forms sentences. CellEcoNet learns these context-dependent meanings automatically, capturing how subtle variations and spatial interactions derive recurrence risk. On a dataset of 456 H&E-stained WSIs, CellEcoNet achieved superior predictive performance (AUC:77.8% HR:9.54), outperforming IASLC grading system (AUC:71.4% HR:2.36), AJCC Stage (AUC:64.0% HR:1.17) and state-of-the-art computational methods (AUCs:62.2-67.4%). CellEcoNet demonstrated fairness and consistent performance across diverse demographic and clinical subgroups. Beyond prognosis, CellEcoNet marks a paradigm shift by decoding the tumor microenvironment’s cellular “language” to reveal how subtle cell variations encode recurrence risk.

[67] A Framework for Benchmarking Fairness-Utility Trade-offs in Text-to-Image Models via Pareto Frontiers

Marco N. Bochernitsan,Rodrigo C. Barros,Lucas S. Kupssinskü

Main category: cs.CV

TL;DR: 论文提出了一种通过帕累托前沿评估文本到图像模型公平性与效用权衡的框架,解决了现有方法依赖主观判断和难以复现的问题。

Details Motivation: 当前文本到图像模型的公平性评估方法依赖定性判断或狭窄比较,无法同时评估公平性与效用,且难以复现去偏方法的效果。

Contribution: 提出了一种基于帕累托最优前沿的方法,可定量比较不同文本到图像模型的公平性与效用权衡,并找到最优超参数配置。

Method: 利用归一化香农熵(Normalized Shannon Entropy)和ClipScore分别评估公平性与效用,通过帕累托前沿优化超参数配置。

Result: 实验表明,多数默认超参数配置在公平性-效用空间中表现不佳,而通过该方法可以轻松找到更优配置。

Insight: 该方法为公平性与效用的权衡提供了可复现的定量评估框架,揭示了当前模型默认配置的不足,并展示了优化的潜力。

Abstract: Achieving fairness in text-to-image generation demands mitigating social biases without compromising visual fidelity, a challenge critical to responsible AI. Current fairness evaluation procedures for text-to-image models rely on qualitative judgment or narrow comparisons, which limit the capacity to assess both fairness and utility in these models and prevent reproducible assessment of debiasing methods. Existing approaches typically employ ad-hoc, human-centered visual inspections that are both error-prone and difficult to replicate. We propose a method for evaluating fairness and utility in text-to-image models using Pareto-optimal frontiers across hyperparametrization of debiasing methods. Our method allows for comparison between distinct text-to-image models, outlining all configurations that optimize fairness for a given utility and vice-versa. To illustrate our evaluation method, we use Normalized Shannon Entropy and ClipScore for fairness and utility evaluation, respectively. We assess fairness and utility in Stable Diffusion, Fair Diffusion, SDXL, DeCoDi, and FLUX text-to-image models. Our method shows that most default hyperparameterizations of the text-to-image model are dominated solutions in the fairness-utility space, and it is straightforward to find better hyperparameters.

[68] WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Rabiul Awal,Mahsa Massoud,Aarash Feizi,Zichao Li,Suyuchen Wang,Christopher Pal,Aishwarya Agrawal,David Vazquez,Siva Reddy,Juan A. Rodriguez,Perouz Taslakian,Spandana Gella,Sai Rajeswar

Main category: cs.CV

TL;DR: WebMMU 是一个多语言基准测试,用于评估网站视觉问答、代码编辑和设计到代码生成等任务,揭示了当前多模态大语言模型在复杂推理和功能保持上的局限性。

Details Motivation: 目前的多模态基准测试通常将任务孤立,缺乏对复杂推理和多语言能力的评估,而 WebMMU 填补了这一空白,为未来网页自动化代理的发展提供了重要参考。

Contribution: 提出 WebMMU,首个统一的多语言多模态网站理解与代码生成基准测试,涵盖复杂推理、精确元素定位和功能代码生成等核心任务。

Method: 基于专家标注的真实网页数据,整合视觉问答、代码编辑和设计到代码生成任务,评估多模态大语言模型在跨语言和多功能任务中的表现。

Result: 实验表明,当前多模态大语言模型在基础信息提取上表现良好,但在复杂推理、功能代码编辑和多语言设计到代码生成中存在显著不足。

Insight: WebMMU 揭示了多模态大语言模型在跨语言和多功能任务中的关键局限性,强调了改进多模态推理和多语言能力的重要性。

Abstract: We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing involving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models’ abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.

[69] Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data

Stefania L. Moroianu,Christian Bluethgen,Pierre Chambon,Mehdi Cherti,Jean-Benoit Delbrouck,Magdalini Paschali,Brandon Price,Judy Gichoya,Jenia Jitsev,Curtis P. Langlotz,Akshay S. Chaudhari

Main category: cs.CV

TL;DR: 该论文提出了一种基于文本到图像扩散模型(RoentGen-v2)生成具有精细控制的胸部X光片合成数据的方法,用于提升AI模型的性能、鲁棒性和公平性,并通过实验证明了合成数据预训练的有效性。

Details Motivation: 在医学影像诊断中,开发具有鲁棒性和公平性的深度学习模型仍然面临挑战,尤其是在数据集规模和多样性受限的情况下。合成数据为解决这一问题提供了潜在路径。

Contribution: 1. 提出了RoentGen-v2模型,首次实现了对X光片病理和患者人口统计学特征(如性别、年龄和种族)的精细控制。\n2. 提出了利用合成数据进行有监督预训练并结合真实数据微调的新策略。\n3. 开源了合成数据集、代码和训练好的模型。

Method: 1. 使用文本到图像扩散模型生成合成数据。\n2. 设计了合成数据预训练结合真实数据微调的流程。\n3. 在多个数据集中评估模型的性能、泛化能力和公平性。

Result: 1. 合成预训练使下游分类模型的准确率提高了6.5%(相比于简单混合真实和合成数据的2.7%)。\n2. 公平性差距减少了19.3%。

Insight: 合成数据生成可以显著提升医学AI模型的性能、泛化能力和公平性,尤其是在数据受限的情况下。开源资源有望促进更广泛的医学AI研究。

Abstract: Achieving robust performance and fairness across diverse patient populations remains a challenge in developing clinically deployable deep learning models for diagnostic imaging. Synthetic data generation has emerged as a promising strategy to address limitations in dataset scale and diversity. We introduce RoentGen-v2, a text-to-image diffusion model for chest radiographs that enables fine-grained control over both radiographic findings and patient demographic attributes, including sex, age, and race/ethnicity. RoentGen-v2 is the first model to generate clinically plausible images with demographic conditioning, facilitating the creation of a large, demographically balanced synthetic dataset comprising over 565,000 images. We use this large synthetic dataset to evaluate optimal training pipelines for downstream disease classification models. In contrast to prior work that combines real and synthetic data naively, we propose an improved training strategy that leverages synthetic data for supervised pretraining, followed by fine-tuning on real data. Through extensive evaluation on over 137,000 chest radiographs from five institutions, we demonstrate that synthetic pretraining consistently improves model performance, generalization to out-of-distribution settings, and fairness across demographic subgroups. Across datasets, synthetic pretraining led to a 6.5% accuracy increase in the performance of downstream classification models, compared to a modest 2.7% increase when naively combining real and synthetic data. We observe this performance improvement simultaneously with the reduction of the underdiagnosis fairness gap by 19.3%. These results highlight the potential of synthetic imaging to advance equitable and generalizable medical deep learning under real-world data constraints. We open source our code, trained models, and synthetic dataset at https://github.com/StanfordMIMI/RoentGen-v2 .

[70] Towards Open-Vocabulary Multimodal 3D Object Detection with Attributes

Xinhao Xiang,Kuan-Chuan Peng,Suhas Lohit,Michael J. Jones,Jiawei Zhang

Main category: cs.CV

TL;DR: 论文提出OVODA框架,旨在解决开放词汇多模态3D目标检测问题,通过结合基础模型和属性检测,无需已知新类别的锚点尺寸,并在nuScenes和Argoverse 2数据集上表现出色。

Details Motivation: 现有3D目标检测方法受限于封闭集假设,难以识别真实场景中的新对象及其属性,因此需要一种开放词汇的解决方案。

Contribution: 1. 提出OVODA框架,支持开放词汇3D目标和属性检测;2. 发布OVAD数据集,补充现有3D检测基准的属性标注;3. 提出基础模型特征拼接和提示调优等创新方法。

Method: OVODA通过基础模型桥接3D特征与文本的语义鸿沟,并联合检测属性。关键技术包括基础模型特征拼接、提示调优策略、以及针对属性检测的视角指定提示和水平翻转增强。

Result: 在nuScenes和Argoverse 2数据集上,OVODA在未给定新类别锚点尺寸的条件下,优于现有开放词汇3D目标检测方法,并能成功识别对象属性。

Insight: 开放词汇3D目标检测可通过多模态基础模型和属性联合检测实现,数据集的属性标注对新研究方向至关重要。

Abstract: 3D object detection plays a crucial role in autonomous systems, yet existing methods are limited by closed-set assumptions and struggle to recognize novel objects and their attributes in real-world scenarios. We propose OVODA, a novel framework enabling both open-vocabulary 3D object and attribute detection with no need to know the novel class anchor size. OVODA uses foundation models to bridge the semantic gap between 3D features and texts while jointly detecting attributes, e.g., spatial relationships, motion states, etc. To facilitate such research direction, we propose OVAD, a new dataset that supplements existing 3D object detection benchmarks with comprehensive attribute annotations. OVODA incorporates several key innovations, including foundation model feature concatenation, prompt tuning strategies, and specialized techniques for attribute detection, including perspective-specified prompts and horizontal flip augmentation. Our results on both the nuScenes and Argoverse 2 datasets show that under the condition of no given anchor sizes of novel classes, OVODA outperforms the state-of-the-art methods in open-vocabulary 3D object detection while successfully recognizing object attributes. Our OVAD dataset is released here: https://doi.org/10.5281/zenodo.16904069 .

[71] AIM 2025 Low-light RAW Video Denoising Challenge: Dataset, Methods and Results

Alexander Yakovenko,George Chakvetadze,Ilya Khrapov,Maksim Zhelezov,Dmitry Vatolin,Radu Timofte,Youngjin Oh,Junhyeong Kwon,Junyoung Park,Nam Ik Cho,Senyan Xu,Ruixuan Jiang,Long Peng,Xueyang Fu,Zheng-Jun Zha,Xiaoping Peng,Hansen Feng,Zhanyi Tie,Ziming Xia,Lizhi Wang

Main category: cs.CV

TL;DR: 该论文总结了AIM 2025低光RAW视频去噪挑战赛,介绍了数据集、挑战协议及参赛方法,目标是在曝光时间限制下去噪低光RAW视频,并保留了Bayer模式。

Details Motivation: 低光环境下拍摄的RAW视频噪声严重,尤其是在曝光时间受限的情况下。利用时间冗余信息和传感器特定的信号相关噪声特性,设计高效去噪方法具有重要意义。

Contribution: 1. 提供了一个包含756个十帧序列的新基准数据集,覆盖14种智能手机传感器和多种光照/曝光条件;2. 设计了挑战赛协议,评价标准包括PSNR和SSIM。

Method: 参赛方法需处理线性RAW序列,输出去噪后的第10帧并保留Bayer模式,利用时间冗余和传感器噪声特性。

Result: 挑战赛结果基于私有测试集,使用全参考PSNR和SSIM评估,最终排名由各指标的平均排名决定。

Insight: 通过挑战赛发现,现有方法在低光RAW视频去噪中仍有提升空间,尤其是在时间冗余利用和传感器噪声建模方面。

Abstract: This paper reviews the AIM 2025 (Advances in Image Manipulation) Low-Light RAW Video Denoising Challenge. The task is to develop methods that denoise low-light RAW video by exploiting temporal redundancy while operating under exposure-time limits imposed by frame rate and adapting to sensor-specific, signal-dependent noise. We introduce a new benchmark of 756 ten-frame sequences captured with 14 smartphone camera sensors across nine conditions (illumination: 1/5/10 lx; exposure: 1/24, 1/60, 1/120 s), with high-SNR references obtained via burst averaging. Participants process linear RAW sequences and output the denoised 10th frame while preserving the Bayer pattern. Submissions are evaluated on a private test set using full-reference PSNR and SSIM, with final ranking given by the mean of per-metric ranks. This report describes the dataset, challenge protocol, and submitted approaches.

[72] Transformer-Based Neural Network for Transient Detection without Image Subtraction

Adi Inada,Masao Sako,Tatiana Acero-Cuellar,Federica Bianco

Main category: cs.CV

TL;DR: 该论文提出了一种基于Transformer的神经网络,用于在天文图像中准确分类真实和虚假的瞬变检测,无需进行图像减法,显著提高了检测效率和准确性。

Details Motivation: 传统瞬变检测方法依赖计算量大的图像减法,而卷积神经网络(CNN)在处理逐像素比较时存在局限。作者希望通过Transformer架构改进现有方法。

Contribution: 提出首款基于Transformer的网络,仅通过搜索和模板图像直接分析,避免了图像减法,同时保持了高精度和高效率。

Method: 采用Transformer架构进行逐像素比较,直接处理搜索和模板图像,无需生成差异图像。模型在Dark Energy Survey(DES)的autoScan数据集上训练和评估。

Result: 在DES数据集上分类准确率达97.4%,且训练集规模增大时差异图像的性能增益减弱。网络对非中心目标也有良好表现。

Insight: Transformer架构在逐像素比较任务中优于CNN,能够简化天文瞬变检测流程并提升效率,适用于大规模天文巡天项目。

Abstract: We introduce a transformer-based neural network for the accurate classification of real and bogus transient detections in astronomical images. This network advances beyond the conventional convolutional neural network (CNN) methods, widely used in image processing tasks, by adopting an architecture better suited for detailed pixel-by-pixel comparison. The architecture enables efficient analysis of search and template images only, thus removing the necessity for computationally-expensive difference imaging, while maintaining high performance. Our primary evaluation was conducted using the autoScan dataset from the Dark Energy Survey (DES), where the network achieved a classification accuracy of 97.4% and diminishing performance utility for difference image as the size of the training set grew. Further experiments with DES data confirmed that the network can operate at a similar level even when the input images are not centered on the supernova candidate. These findings highlight the network’s effectiveness in enhancing both accuracy and efficiency of supernova detection in large-scale astronomical surveys.

[73] NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows

Denis Tarasov,Alexander Nikulin,Ilya Zisman,Albina Klepach,Nikita Lyubaykin,Andrei Polubarov,Alexander Derevyagin,Vladislav Kurenkov

Main category: cs.CV

TL;DR: 该论文提出NinA(动作中的归一化流),用归一化流(NF)替换VLA模型中的扩散动作解码器,实现单次采样,显著减少推理时间,同时保持性能。

Details Motivation: 扩散模型作为动作解码器需要多步迭代去噪,限制了其在需要高频控制的真实场景中的实用性,因此需要一个更快且表达能力强的替代方案。

Contribution: 提出NinA,一种基于归一化流的动作解码器,能够在VLA模型中实现快速且性能一致的推理。

Method: 在FLOWER VLA架构中集成归一化流作为动作解码器,并在LIBERO基准上进行微调。

Result: NinA在相同训练条件下与扩散模型性能相当,但推理时间显著缩短。

Insight: 归一化流为VLA的高效高频控制提供了可行方案,且无需牺牲性能。

Abstract: Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alter- native to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.

[74] RF-PGS: Fully-structured Spatial Wireless Channel Representation with Planar Gaussian Splatting

Lihao Zhang,Zongtan Li,Haijian Sun

Main category: cs.CV

TL;DR: 本文提出了RF-PGS框架,通过平面高斯分布和优化的无线射频特性,从稀疏路径损耗谱中重建高保真的无线电传播路径,提高了重建精度,降低了训练成本。

Details Motivation: 在6G时代,大规模天线阵列和精确的空间信道状态信息(Spatial-CSI)需求迫切,传统信道建模方法在空间分辨率、效率和可扩展性上存在不足。

Contribution: 提出RF-PGS框架,利用平面高斯分布作为几何基元,结合射频优化,实现高效、高精度的无线信道建模。

Method: 1. 第一阶段:几何训练,使用平面高斯分布重建场景;2. 第二阶段:射频训练,结合多视角损失模型精确建模射频传播行为。

Result: 相比现有方法,RF-PGS显著提高了重建精度,降低了训练成本。

Insight: 通过几何建模与射频建模的结合,RF-PGS为6G的Spatial-CSI建模提供了可扩展的解决方案。

Abstract: In the 6G era, the demand for higher system throughput and the implementation of emerging 6G technologies require large-scale antenna arrays and accurate spatial channel state information (Spatial-CSI). Traditional channel modeling approaches, such as empirical models, ray tracing, and measurement-based methods, face challenges in spatial resolution, efficiency, and scalability. Radiance field-based methods have emerged as promising alternatives but still suffer from geometric inaccuracy and costly supervision. This paper proposes RF-PGS, a novel framework that reconstructs high-fidelity radio propagation paths from only sparse path loss spectra. By introducing Planar Gaussians as geometry primitives with certain RF-specific optimizations, RF-PGS achieves dense, surface-aligned scene reconstruction in the first geometry training stage. In the subsequent Radio Frequency (RF) training stage, the proposed fully-structured radio radiance, combined with a tailored multi-view loss, accurately models radio propagation behavior. Compared to prior radiance field methods, RF-PGS significantly improves reconstruction accuracy, reduces training costs, and enables efficient representation of wireless channels, offering a practical solution for scalable 6G Spatial-CSI modeling.

[75] Beyond Emotion Recognition: A Multi-Turn Multimodal Emotion Understanding and Reasoning Benchmark

Jinpeng Hu,Hongchang Shi,Chongyuan Dai,Zhuo Li,Peipei Song,Meng Wang

Main category: cs.CV

TL;DR: 该论文提出了一个多轮多模态情感理解与推理(MTMEUR)基准,旨在超越情感识别,深入探索情感推理能力,并提出了多智能体框架以提升推理能力。

Details Motivation: 当前多模态大语言模型在心理学领域的应用主要集中在情感识别上,忽略了情感推理在提升人机交互自然性和有效性中的重要作用。

Contribution: 1. 提出了MTMEUR基准,包含1,451条真实场景视频和5,101个渐进式问题;2. 提出了多智能体框架,专注于不同方面的推理。

Method: 采用多智能体框架,每个智能体专注于特定方面(如背景、角色动态、事件细节),以系统化提升情感推理能力。

Result: 实验显示,现有MLLM模型在MTMEUR基准上面临显著挑战,而提出的多智能体方法表现更优。

Insight: 情感推理是人机交互的核心能力之一,多智能体协作可有效提升模型的复杂推理任务表现。

Abstract: Multimodal large language models (MLLMs) have been widely applied across various fields due to their powerful perceptual and reasoning capabilities. In the realm of psychology, these models hold promise for a deeper understanding of human emotions and behaviors. However, recent research primarily focuses on enhancing their emotion recognition abilities, leaving the substantial potential in emotion reasoning, which is crucial for improving the naturalness and effectiveness of human-machine interactions. Therefore, in this paper, we introduce a multi-turn multimodal emotion understanding and reasoning (MTMEUR) benchmark, which encompasses 1,451 video data from real-life scenarios, along with 5,101 progressive questions. These questions cover various aspects, including emotion recognition, potential causes of emotions, future action prediction, etc. Besides, we propose a multi-agent framework, where each agent specializes in a specific aspect, such as background context, character dynamics, and event details, to improve the system’s reasoning capabilities. Furthermore, we conduct experiments with existing MLLMs and our agent-based method on the proposed benchmark, revealing that most models face significant challenges with this task.

[76] Do Multimodal LLMs See Sentiment?

Neemias B. da Silva,John Harrison,Rodrigo Minetto,Myriam R. Delgado,Bogdan T. Nassu,Thiago H. Silva

Main category: cs.CV

TL;DR: 该论文提出了一个名为MLLMsent的框架,用于研究多模态大语言模型(MLLMs)在情感推理方面的能力,并通过三种方法验证其性能,取得了最先进的结果。

Details Motivation: 在社交媒体中,视觉内容的情感分析至关重要。然而,这一问题仍然具有挑战性,因为情感感知与复杂的场景级语义密切相关。本文旨在探究MLLMs在情感推理方面的潜力。

Contribution: 提出了MLLMsent框架,通过直接分类、关联预训练LLMs和微调LLMs三种方法,验证了MLLMs在情感分析中的优越性,并取得了显著的性能提升。

Method: 1. 直接使用MLLMs进行图像情感分类;2. 结合预训练的LLMs对自动生成的图像描述进行情感分析;3. 微调LLMs在情感标注的图像描述上。

Result: 实验结果表明,微调方法在多个基准测试中表现最优,最高分别超过Lexicon、CNN和Transformer基线30.9%、64.8%和42.4%。在跨数据集测试中,即使未训练新数据,性能仍优于最佳基线8.26%。

Insight: MLLMs在情感推理方面具有巨大潜力,微调方法显著提升了性能,同时为情感计算领域的新研究提供了基准。

Abstract: Understanding how visual content communicates sentiment is critical in an era where online interaction is increasingly dominated by this kind of media on social platforms. However, this remains a challenging problem, as sentiment perception is closely tied to complex, scene-level semantics. In this paper, we propose an original framework, MLLMsent, to investigate the sentiment reasoning capabilities of Multimodal Large Language Models (MLLMs) through three perspectives: (1) using those MLLMs for direct sentiment classification from images; (2) associating them with pre-trained LLMs for sentiment analysis on automatically generated image descriptions; and (3) fine-tuning the LLMs on sentiment-labeled image descriptions. Experiments on a recent and established benchmark demonstrate that our proposal, particularly the fine-tuned approach, achieves state-of-the-art results outperforming Lexicon-, CNN-, and Transformer-based baselines by up to 30.9%, 64.8%, and 42.4%, respectively, across different levels of evaluators’ agreement and sentiment polarity categories. Remarkably, in a cross-dataset test, without any training on these new data, our model still outperforms, by up to 8.26%, the best runner-up, which has been trained directly on them. These results highlight the potential of the proposed visual reasoning scheme for advancing affective computing, while also establishing new benchmarks for future research.

[77] AWM-Fuse: Multi-Modality Image Fusion for Adverse Weather via Global and Local Text Perception

Xilai Li,Huichun Liu,Xiaosong Li,Tao Ye,Zhenyu Kuang,Huafeng Li

Main category: cs.CV

TL;DR: AWM-Fuse提出了一种新的多模态图像融合方法,通过全局和局部文本感知处理恶劣天气下的图像退化问题,并利用文本描述约束融合图像的生成。

Details Motivation: 恶劣天气会导致视觉信息损失,现有方法在利用文本信息提升语义感知方面存在不足,未能有效分类和分析文本内容。

Contribution: 1) 提出了结合全局和局部文本感知的统一权重架构;2) 利用BLIP和ChatGPT生成的文本描述增强语义理解;3) 通过文本约束优化网络学习过程,提升特征对齐能力。

Method: 1) 全局模块通过BLIP生成的标题提取场景特征和主退化类型;2) 局部模块利用ChatGPT的详细描述捕捉具体退化细节;3) 文本描述用于约束融合图像的生成。

Result: AWM-Fuse在复杂天气条件和下游任务中优于现有方法。

Insight: 文本信息可以有效提升图像融合的语义感知能力,全局和局部文本感知的结合能更全面地处理天气退化问题。

Abstract: Multi-modality image fusion (MMIF) in adverse weather aims to address the loss of visual information caused by weather-related degradations, providing clearer scene representations. Although less studies have attempted to incorporate textual information to improve semantic perception, they often lack effective categorization and thorough analysis of textual content. In response, we propose AWM-Fuse, a novel fusion method for adverse weather conditions, designed to handle multiple degradations through global and local text perception within a unified, shared weight architecture. In particular, a global feature perception module leverages BLIP-produced captions to extract overall scene features and identify primary degradation types, thus promoting generalization across various adverse weather conditions. Complementing this, the local module employs detailed scene descriptions produced by ChatGPT to concentrate on specific degradation effects through concrete textual cues, thereby capturing finer details. Furthermore, textual descriptions are used to constrain the generation of fusion images, effectively steering the network learning process toward better alignment with real semantic labels, thereby promoting the learning of more meaningful visual features. Extensive experiments demonstrate that AWM-Fuse outperforms current state-of-the-art methods in complex weather conditions and downstream tasks. Our code is available at https://github.com/Feecuin/AWM-Fuse.

[78] A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism

Yi Zhang,Lingxiao Wei,Bowei Zhang,Ziwei Liu,Kai Yi,Shu Hu

Main category: cs.CV

TL;DR: 本文提出了SAEViT,一种轻量级的ViT与卷积结合的模型,通过稀疏注意力机制和多尺度结构,平衡了计算效率与性能。

Details Motivation: ViT在计算机视觉任务中表现出色,但存在模型大、计算成本高及局部特征建模能力弱的问题,限制了其实际应用。

Contribution: 1. 提出SAA模块,通过稀疏采样和反卷积降低注意力计算复杂度;2. 设计CIFFN增强通道间信息交换;3. 引入分层金字塔结构和DWSConv块强化卷积特征。

Method: 结合稀疏注意力(SAA)、通道交互前馈网络(CIFFN)和深度可分离卷积(DWSConv),构建轻量级ViT模型。

Result: 在ImageNet-1K上实现76.3%和79.6%的Top-1准确率,仅需0.8和1.3 GFLOPs,展示了高效性。

Insight: 稀疏注意力与卷积的结合可以有效平衡ViT的计算效率与性能,为轻量化视觉任务提供了新思路。

Abstract: Vision Transformer (ViT) has prevailed in computer vision tasks due to its strong long-range dependency modelling ability. However, its large model size with high computational cost and weak local feature modeling ability hinder its application in real scenarios. To balance computation efficiency and performance, we propose SAEViT (Sparse-Attention-Efficient-ViT), a lightweight ViT based model with convolution blocks, in this paper to achieve efficient downstream vision tasks. Specifically, SAEViT introduces a Sparsely Aggregated Attention (SAA) module that performs adaptive sparse sampling based on image redundancy and recovers the feature map via deconvolution operation, which significantly reduces the computational complexity of attention operations. In addition, a Channel-Interactive Feed-Forward Network (CIFFN) layer is developed to enhance inter-channel information exchange through feature decomposition and redistribution, mitigating redundancy in traditional feed-forward networks (FNN). Finally, a hierarchical pyramid structure with embedded depth-wise separable convolutional blocks (DWSConv) is devised to further strengthen convolutional features. Extensive experiments on mainstream datasets show that SAEViT achieves Top-1 accuracies of 76.3% and 79.6% on the ImageNet-1K classification task with only 0.8 GFLOPs and 1.3 GFLOPs, respectively, demonstrating a lightweight solution for various fundamental vision tasks.

[79] MDIQA: Unified Image Quality Assessment for Multi-dimensional Evaluation and Restoration

Shunyu Yao,Ming Liu,Zhilu Zhang,Zhaolin Wan,Zhilong Ji,Jinfeng Bai,Wangmeng Zuo

Main category: cs.CV

TL;DR: 该论文提出了一种多维图像质量评估框架(MDIQA),用于从技术和美学多个维度评估图像质量,并通过调整权重灵活应用于图像修复任务。

Details Motivation: 现有图像质量评估方法大多仅关注整体评分,忽略了人类从多个维度评估图像质量的事实。为了更贴合人类视觉感知,作者提出了多维评估方法。

Contribution: 提出MDIQA框架,首次将技术和美学维度(共九个维度)统一建模,并通过分支训练和特征融合生成最终评分。此外,该方法可灵活应用于图像修复任务,调整权重以匹配用户偏好。

Method: 通过多个分支分别建模技术和美学维度,每个分支专注于特定维度训练,随后融合这些分支的特征以生成整体评分。框架还支持通过调整维度权重优化图像修复模型的训练。

Result: 实验证明,MDIQA在图像质量评估任务中表现优异,并能有效提升图像修复模型的结果质量,使其更贴合用户需求。

Insight: 多维评估更贴近人类视觉感知,通过分离维度和动态调整权重,可以为图像质量评估和修复提供更灵活、个性化的解决方案。

Abstract: Recent advancements in image quality assessment (IQA), driven by sophisticated deep neural network designs, have significantly improved the ability to approach human perceptions. However, most existing methods are obsessed with fitting the overall score, neglecting the fact that humans typically evaluate image quality from different dimensions before arriving at an overall quality assessment. To overcome this problem, we propose a multi-dimensional image quality assessment (MDIQA) framework. Specifically, we model image quality across various perceptual dimensions, including five technical and four aesthetic dimensions, to capture the multifaceted nature of human visual perception within distinct branches. Each branch of our MDIQA is initially trained under the guidance of a separate dimension, and the respective features are then amalgamated to generate the final IQA score. Additionally, when the MDIQA model is ready, we can deploy it for a flexible training of image restoration (IR) models, enabling the restoration results to better align with varying user preferences through the adjustment of perceptual dimension weights. Extensive experiments demonstrate that our MDIQA achieves superior performance and can be effectively and flexibly applied to image restoration tasks. The code is available: https://github.com/YaoShunyu19/MDIQA.

[80] Structural Energy-Guided Sampling for View-Consistent Text-to-3D

Qing Zhang,Jinguang Tong,Jie Hong,Jing Zhang,Xuesong Li

Main category: cs.CV

TL;DR: 本文提出了一种称为SEGS的免训练框架,通过在采样时强化多视角一致性来缓解文本生成3D中的Janus问题,显著改善了几何对齐和视角一致性。

Details Motivation: 解决文本生成3D中的Janus问题(正面正确但其他角度几何畸变或重复),认为问题源于2D扩散先验的视角偏见。

Contribution: 提出SEGS框架,通过在PCA子空间中定义结构能量并将其梯度注入降噪轨迹,无需训练即可提升多视角一致性。

Method: 在U-Net中间特征的PCA子空间中定义结构能量,利用其梯度调整几何形状,无缝集成到SDS/VSD流程中。

Result: SEGS显著减少Janus伪影,改善几何对齐和视角一致性,且无需重新训练或修改权重。

Insight: 视角一致性问题可通过在采样时引入结构能量解决,无需额外训练,具有普适性和高效性。

Abstract: Text-to-3D generation often suffers from the Janus problem, where objects look correct from the front but collapse into duplicated or distorted geometry from other angles. We attribute this failure to viewpoint bias in 2D diffusion priors, which propagates into 3D optimization. To address this, we propose Structural Energy-Guided Sampling (SEGS), a training-free, plug-and-play framework that enforces multi-view consistency entirely at sampling time. SEGS defines a structural energy in a PCA subspace of intermediate U-Net features and injects its gradients into the denoising trajectory, steering geometry toward the intended viewpoint while preserving appearance fidelity. Integrated seamlessly into SDS/VSD pipelines, SEGS significantly reduces Janus artifacts, achieving improved geometric alignment and viewpoint consistency without retraining or weight modification.

[81] Align 3D Representation and Text Embedding for 3D Content Personalization

Qi Song,Ziyuan Luo,Ka Chun Cheung,Simon See,Renjie Wan

Main category: cs.CV

TL;DR: 本文提出了Invert3D框架,通过将3D内容与文本嵌入空间对齐,实现了高效的3D内容个性化,避免了传统基于知识蒸馏方法的高计算成本。

Details Motivation: 尽管NeRF和3DGS显著提升了3D内容合成的效率和质量,但高效的3D内容个性化仍是一个挑战。现有方法依赖计算成本高的知识蒸馏技术,亟需更高效的解决方案。

Contribution: 提出了Invert3D框架,通过建立3D表示与文本嵌入的映射关系,实现无需重新训练的3D内容个性化。

Method: 开发了一种相机条件化的3D到文本的逆向机制,将3D内容投影到与文本嵌入对齐的3D嵌入空间中,从而支持通过自然语言提示进行个性化操作。

Result: 实验表明,Invert3D能够有效实现3D内容的个性化,且避免了高计算成本。

Insight: 通过3D与文本嵌入的对齐,可以利用自然语言直接操控3D内容,为3D生成领域提供了更灵活的个性化工具。

Abstract: Recent advances in NeRF and 3DGS have significantly enhanced the efficiency and quality of 3D content synthesis. However, efficient personalization of generated 3D content remains a critical challenge. Current 3D personalization approaches predominantly rely on knowledge distillation-based methods, which require computationally expensive retraining procedures. To address this challenge, we propose \textbf{Invert3D}, a novel framework for convenient 3D content personalization. Nowadays, vision-language models such as CLIP enable direct image personalization through aligned vision-text embedding spaces. However, the inherent structural differences between 3D content and 2D images preclude direct application of these techniques to 3D personalization. Our approach bridges this gap by establishing alignment between 3D representations and text embedding spaces. Specifically, we develop a camera-conditioned 3D-to-text inverse mechanism that projects 3D contents into a 3D embedding aligned with text embeddings. This alignment enables efficient manipulation and personalization of 3D content through natural language prompts, eliminating the need for computationally retraining procedures. Extensive experiments demonstrate that Invert3D achieves effective personalization of 3D content. Our work is available at: https://github.com/qsong2001/Invert3D.

[82] Addressing Annotation Scarcity in Hyperspectral Brain Image Segmentation with Unsupervised Domain Adaptation

Tim Mach,Daniel Rueckert,Alex Berger,Laurin Lux,Ivan Ezhov

Main category: cs.CV

TL;DR: 本文提出了一种针对高光谱脑图像分割中标注稀缺问题的无监督域适应方法,显著优于现有技术。

Details Motivation: 高光谱脑图像分割任务面临标注稀缺的挑战,传统监督学习方法难以适用。

Contribution: 提出了一种创新的无监督域适应框架,结合少量专家标注数据和大量未标注数据,有效解决标注稀缺问题。

Method: 采用无监督域适应方法,利用小规模专家标注数据作为域适应目标,通过未标注数据提升模型性能。

Result: 定量和定性评估表明,该方法显著优于现有技术,验证了域适应在标签稀缺生物医学图像任务中的有效性。

Insight: 无监督域适应是解决标注稀缺问题的有潜力方向,尤其适用于生物医学图像分析领域。

Abstract: This work presents a novel deep learning framework for segmenting cerebral vasculature in hyperspectral brain images. We address the critical challenge of severe label scarcity, which impedes conventional supervised training. Our approach utilizes a novel unsupervised domain adaptation methodology, using a small, expert-annotated ground truth alongside unlabeled data. Quantitative and qualitative evaluations confirm that our method significantly outperforms existing state-of-the-art approaches, demonstrating the efficacy of domain adaptation for label-scarce biomedical imaging tasks.

[83] NAT: Learning to Attack Neurons for Enhanced Adversarial Transferability

Krishna Kanth Nakka,Alexandre Alahi

Main category: cs.CV

TL;DR: 该论文提出了NAT方法,通过针对性攻击神经网络中的特定神经元来提升对抗样本的迁移性,实验结果表明其在跨模型和跨域场景下优于现有方法。

Details Motivation: 现有对抗样本生成方法通常在单一中层嵌入上优化,导致少数神经元被过度关注而其他神经元未被充分利用。作者希望通过靶向攻击神经元来更全面地破坏网络机制,从而提升迁移性。

Contribution: 1. 提出NAT方法,以神经元为攻击目标;2. 实验证明NAT在跨模型和跨域场景下的优越性;3. 展示了在少量查询下实现高攻击成功率的能力。

Method: NAT通过训练生成器,针对嵌入层中的特定神经元进行攻击,而非整个嵌入层。该方法优化了神经元级别的扰动,从而更有效地干扰网络的底层机制。

Result: 在41个ImageNet模型和9个细粒度模型上的实验显示,NAT的跨模型攻击成功率比基线高14%,跨域攻击成功率提高4%。仅10次查询即可实现高攻击成功率。

Insight: 靶向攻击神经元能更有效地破坏网络的底层机制,提供对抗样本迁移性的通用基础,同时说明神经网络的脆弱性集中于特定神经元。

Abstract: The generation of transferable adversarial perturbations typically involves training a generator to maximize embedding separation between clean and adversarial images at a single mid-layer of a source model. In this work, we build on this approach and introduce Neuron Attack for Transferability (NAT), a method designed to target specific neuron within the embedding. Our approach is motivated by the observation that previous layer-level optimizations often disproportionately focus on a few neurons representing similar concepts, leaving other neurons within the attacked layer minimally affected. NAT shifts the focus from embedding-level separation to a more fundamental, neuron-specific approach. We find that targeting individual neurons effectively disrupts the core units of the neural network, providing a common basis for transferability across different models. Through extensive experiments on 41 diverse ImageNet models and 9 fine-grained models, NAT achieves fooling rates that surpass existing baselines by over 14% in cross-model and 4% in cross-domain settings. Furthermore, by leveraging the complementary attacking capabilities of the trained generators, we achieve impressive fooling rates within just 10 queries. Our code is available at: https://krishnakanthnakka.github.io/NAT/

[84] HieroAction: Hierarchically Guided VLM for Fine-Grained Action Analysis

Junhao Wu,Xiuer Gu,Zhiying Li,Yeying Jin,Yunfeng Diao,Zhiyu Li,Zhenbo Song,Xiaomei Zhang,Zhaoxin Fan

Main category: cs.CV

TL;DR: HieroAction是一个视觉语言模型,通过逐步动作推理和分层策略学习,为人类动作提供细粒度分析和可解释的评分。

Details Motivation: 现有方法通常仅提供最终评分而缺乏详细分析,限制了在体育、医疗和机器人等领域的实用性。HieroAction旨在解决这一问题。

Contribution: 提出了逐步动作推理和分层策略学习,结合增强了动作评估的可解释性和评分精确性。

Method: 1. 逐步动作推理:分步骤从整体识别到子动作分析再到最终评分。2. 分层策略学习:强化学习优化子动作动态与高层动作质量的匹配。

Result: 在多个基准数据集上表现优越,证明了其准确性和可解释性。

Insight: 通过结构化的推理和强化学习的结合,动作评估更透明且精准,适用于需要详细反馈的场景。

Abstract: Evaluating human actions with clear and detailed feedback is important in areas such as sports, healthcare, and robotics, where decisions rely not only on final outcomes but also on interpretable reasoning. However, most existing methods provide only a final score without explanation or detailed analysis, limiting their practical applicability. To address this, we introduce HieroAction, a vision-language model that delivers accurate and structured assessments of human actions. HieroAction builds on two key ideas: (1) Stepwise Action Reasoning, a tailored chain of thought process designed specifically for action assessment, which guides the model to evaluate actions step by step, from overall recognition through sub action analysis to final scoring, thus enhancing interpretability and structured understanding; and (2) Hierarchical Policy Learning, a reinforcement learning strategy that enables the model to learn fine grained sub action dynamics and align them with high level action quality, thereby improving scoring precision. The reasoning pathway structures the evaluation process, while policy learning refines each stage through reward based optimization. Their integration ensures accurate and interpretable assessments, as demonstrated by superior performance across multiple benchmark datasets. Code will be released upon acceptance.

[85] RPD-Diff: Region-Adaptive Physics-Guided Diffusion Model for Visibility Enhancement under Dense and Non-Uniform Haze

Ruicheng Zhang,Puxin Yan,Zeyu Zhang,Yicheng Chang,Hongyi Chen,Zhi Jin

Main category: cs.CV

TL;DR: RPD-Diff是一种区域自适应的物理引导扩散模型,针对密集且非均匀雾霾场景下的图像去雾任务,通过物理引导的中间状态目标和雾霾感知降噪时间步预测器,显著提升了去雾效果。

Details Motivation: 现有基于扩散模型的去雾方法在面对密集且非均匀雾霾时,因生成条件不足和缺乏对空间变化的雾霾分布的适应性,导致效果不佳。本文旨在解决这些问题。

Contribution: 1. 提出了物理引导的中间状态目标策略(PIST),利用物理先验改进扩散过程;2. 设计了雾霾感知降噪时间步预测器(HADTP),动态调整降噪时间步以处理非均匀雾霾分布。

Method: 1. PIST策略通过物理先验重构扩散马尔可夫链;2. HADTP通过传输图交叉注意力机制动态调整降噪时间步。

Result: 在四个真实数据集上的实验表明,RPD-Diff在密集和非均匀雾霾场景中实现了最先进的性能,生成高质量的无雾图像。

Insight: 1. 物理先验与扩散模型的结合能显著提升去雾效果;2. 动态调整降噪时间步对处理非均匀雾霾分布至关重要。

Abstract: Single-image dehazing under dense and non-uniform haze conditions remains challenging due to severe information degradation and spatial heterogeneity. Traditional diffusion-based dehazing methods struggle with insufficient generation conditioning and lack of adaptability to spatially varying haze distributions, which leads to suboptimal restoration. To address these limitations, we propose RPD-Diff, a Region-adaptive Physics-guided Dehazing Diffusion Model for robust visibility enhancement in complex haze scenarios. RPD-Diff introduces a Physics-guided Intermediate State Targeting (PIST) strategy, which leverages physical priors to reformulate the diffusion Markov chain by generation target transitions, mitigating the issue of insufficient conditioning in dense haze scenarios. Additionally, the Haze-Aware Denoising Timestep Predictor (HADTP) dynamically adjusts patch-specific denoising timesteps employing a transmission map cross-attention mechanism, adeptly managing non-uniform haze distributions. Extensive experiments across four real-world datasets demonstrate that RPD-Diff achieves state-of-the-art performance in challenging dense and non-uniform haze scenarios, delivering high-quality, haze-free images with superior detail clarity and color fidelity.

[86] Robust Diagram Reasoning: A Framework for Enhancing LVLM Performance on Visually Perturbed Scientific Diagrams

Minghao Zhou,Rafael Souza,Yaqian Hu,Luming Che

Main category: cs.CV

TL;DR: 这篇论文提出了RDR框架,旨在增强多模态大模型(LVLM)在视觉扰动科学图表上的推理鲁棒性,并通过AMCV机制和新指标PRS、VDC进行评估,同时发布了SciDiagram-Robust数据集。

Details Motivation: 现有的LVLM在科学图表任务中缺乏对视觉扰动(如噪声、模糊、遮挡)的鲁棒性,而当前的评测基准未充分关注这一问题。

Contribution: 1. 提出RDR框架和AMCV机制;2. 定义PRS和VDC指标;3. 发布SciDiagram-Robust数据集。

Method: 通过生成多视图扰动图表并进行并行推理,结合一致性自校正循环(AMCV机制)提升鲁棒性。

Result: 实验显示,即使GPT-4V等先进模型在扰动输入下性能也显著下降(准确率从85.2%降至72.1%)。

Insight: 视觉扰动对多模态模型性能影响显著,需针对性优化。

Abstract: Large Language Models (LLMs) and their multimodal variants (LVLMs) hold immense promise for scientific and engineering applications, particularly in processing visual information like scientific diagrams. However, their practical deployment is hindered by a critical lack of robustness to common visual perturbations such as noise, blur, and occlusions, which are prevalent in real-world scientific documents. Existing evaluation benchmarks largely overlook this challenge, leaving the robust reasoning capabilities of LVLMs on visually degraded scientific diagrams underexplored. To address this, we introduce the Robust Diagram Reasoning (RDR) framework, a novel approach designed to enhance and rigorously evaluate LVLMs’ performance under such conditions. At its core, RDR employs an Adaptive Multi-View & Consistency Verification (AMCV) mechanism, which involves generating multiple perturbed versions of a diagram, performing parallel inference, and then applying a consistency-based self-correction loop. We also propose two new metrics, Perturbation Robustness Score (PRS) and Visual Degradation Consistency (VDC), to quantify robustness. Furthermore, we construct SciDiagram-Robust, the first large-scale scientific diagram question-answering dataset specifically augmented with diverse, programmatically generated visual perturbations. Our extensive experiments demonstrate that even state-of-the-art closed-source LVLMs like GPT-4V exhibit significant performance degradation when faced with perturbed inputs (Clean Accuracy 85.2% vs. PRS 72.1%).

[87] Balanced Sharpness-Aware Minimization for Imbalanced Regression

Yahao Liu,Qin Wang,Lixin Duan,Wen Li

Main category: cs.CV

TL;DR: 本文提出了一种名为BSAM的新方法,通过平衡的锐度感知最小化解决不平衡回归问题,显著提升了模型在不平衡数据上的泛化能力。

Details Motivation: 现实世界的回归任务数据通常呈现不平衡分布,导致模型在稀有观测值上的表现不佳。本文重新定义不平衡回归为泛化问题,探索模型在观察空间的泛化能力。

Contribution: 提出了BSAM方法,结合锐度感知最小化和目标重加权策略,确保模型在整个观察空间中的均匀泛化能力。

Method: 从传统的锐度感知最小化出发,引入目标重加权策略,平衡不同观测值的泛化能力。

Result: 在年龄估计和深度估计等任务中,BSAM显著优于现有方法。

Insight: 通过平衡泛化能力,可以显著提升模型在不平衡数据上的表现,为回归任务提供了一种新的优化视角。

Abstract: Regression is fundamental in computer vision and is widely used in various tasks including age estimation, depth estimation, target localization, \etc However, real-world data often exhibits imbalanced distribution, making regression models perform poorly especially for target values with rare observations(known as the imbalanced regression problem). In this paper, we reframe imbalanced regression as an imbalanced generalization problem. To tackle that, we look into the loss sharpness property for measuring the generalization ability of regression models in the observation space. Namely, given a certain perturbation on the model parameters, we check how model performance changes according to the loss values of different target observations. We propose a simple yet effective approach called Balanced Sharpness-Aware Minimization(BSAM) to enforce the uniform generalization ability of regression models for the entire observation space. In particular, we start from the traditional sharpness-aware minimization and then introduce a novel targeted reweighting strategy to homogenize the generalization ability across the observation space, which guarantees a theoretical generalization bound. Extensive experiments on multiple vision regression tasks, including age and depth estimation, demonstrate that our BSAM method consistently outperforms existing approaches. The code is available \href{https://github.com/manmanjun/BSAM_for_Imbalanced_Regression}{here}.

[88] Hierarchical Contextual Grounding LVLM: Enhancing Fine-Grained Visual-Language Understanding with Robust Grounding

Leilei Guo,Antonio Carlos Rivera,Peiyu Tang,Haoxuan Ren,Zheyu Song

Main category: cs.CV

TL;DR: HCG-LVLM 是一种分层架构模型,通过模仿人类从粗到细的认知处理,提升了细粒度视觉-语言理解的鲁棒性和精准性。

Details Motivation: 现有的大规模视觉-语言模型(LVLM)在复杂场景中容易产生幻觉和推理错误,尤其是在需要精确定位图像区域和细粒度视觉推理的任务中表现不足。

Contribution: 提出了 HCG-LVLM,采用全局上下文感知层和细粒度局部定位层的分层设计,结合局部细节增强模块和语义一致性验证器,显著提升了模型的精准性和鲁棒性。

Method: 模型分为两个层次:全局上下文感知层用于初步粗粒度理解,细粒度局部定位层通过局部细节增强模块提取高分辨率特征,并通过语义一致性验证器确保视觉-语言的准确对齐。

Result: 在 GQA、A-OKVQA 和 RefCOCO/+/g 等数据集上的实验表明,HCG-LVLM 在准确性和减少幻觉方面显著优于当前最先进的模型(如 Flamingo、BLIP-2 和 MiniGPT-4)。

Insight: 通过模仿人类的分层认知过程,可以显著提升视觉-语言模型在细粒度任务中的表现,同时减少幻觉问题。

Abstract: Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) have achieved remarkable progress in natural language processing and multimodal understanding. Despite their impressive generalization capabilities, current LVLMs often exhibit insufficient robustness, proneness to hallucination, and reasoning errors in complex real-world scenarios, particularly when precise image region localization and fine-grained visual reasoning are required. To address these limitations, we propose the Hierarchical Contextual Grounding LVLM (HCG-LVLM), a novel architecture that mimics human coarse-to-fine cognitive processing. HCG-LVLM employs a two-layered approach: a Global Contextual Perception layer for initial broad understanding and a Fine-grained Local Grounding layer. The latter incorporates a Local Detail Enhancement Module to extract high-resolution features and a Semantic Consistency Validator to ensure accurate, hallucination-free visual-language alignment. Through an adaptive fusion mechanism, information from both layers is integrated for robust and precise outputs. Extensive experiments on challenging datasets, including GQA, A-OKVQA for fine-grained VQA, and RefCOCO/+/g for Referring Expression Comprehension, demonstrate that HCG-LVLM consistently outperforms state-of-the-art models such as Flamingo, BLIP-2, and MiniGPT-4. Our model achieves superior accuracy and significantly reduces hallucination, validating the effectiveness of its hierarchical design in enhancing fine-grained visual-language understanding and precise grounding capabilities.

[89] Combating Digitally Altered Images: Deepfake Detection

Saksham Kumar,Rhythm Narang

Main category: cs.CV

TL;DR: 该论文提出了一种基于改进视觉Transformer(ViT)的Deepfake检测模型,通过数据增强和分层采样解决类别不平衡问题,在测试数据集上取得了先进的结果。

Details Motivation: 随着Deepfake技术生成逼真篡改图像和视频的能力增强,其对公众和相关机构构成重大挑战,亟需有效的检测方法。

Contribution: 主要贡献是开发了一种基于改进ViT的Deepfake检测模型,并利用数据增强和分层采样技术提升了模型的鲁棒性。

Method: 采用改进的ViT模型,结合多种数据增强技术,并通过过采样和分层划分训练-验证集来处理类别不平衡问题。

Result: 模型在测试数据集上表现出色,能够精确检测Deepfake图像,达到先进水平。

Insight: 该研究表明,改进的ViT模型结合数据增强和类别平衡策略,可有效应对Deepfake检测的挑战。

Abstract: The rise of Deepfake technology to generate hyper-realistic manipulated images and videos poses a significant challenge to the public and relevant authorities. This study presents a robust Deepfake detection based on a modified Vision Transformer(ViT) model, trained to distinguish between real and Deepfake images. The model has been trained on a subset of the OpenForensics Dataset with multiple augmentation techniques to increase robustness for diverse image manipulations. The class imbalance issues are handled by oversampling and a train-validation split of the dataset in a stratified manner. Performance is evaluated using the accuracy metric on the training and testing datasets, followed by a prediction score on a random image of people, irrespective of their realness. The model demonstrates state-of-the-art results on the test dataset to meticulously detect Deepfake images.

[90] Preserving Domain Generalization in Fine-Tuning via Joint Parameter Selection

Bin Pan,Shiyu Shen,Zongbin Wang,Zhenwei Shi,Xia Xu

Main category: cs.CV

TL;DR: 该论文提出了一种名为Joint Parameter Selection (JPS)的新方法,通过选择性微调参数子集,平衡任务适应性与预训练模型的泛化能力。

Details Motivation: 全微调预训练模型可能损害其泛化能力,因此需要参数高效的适应策略,以保留模型的泛化能力。

Contribution: 提出JPS方法,稀疏微调参数子集,理论证明了泛化误差边界,并设计了基于梯度的选择机制。

Method: JPS通过双操作符选择跨域一致且梯度显著的参数进行更新,保留其余参数不变。

Result: JPS在多个基准测试中优于现有领域泛化方法,验证了其高效性和有效性。

Insight: 稀疏参数更新有助于保留预训练模型的泛化能力,同时实现任务适应性。

Abstract: Domain generalization seeks to develop models trained on a limited set of source domains that are capable of generalizing effectively to unseen target domains. While the predominant approach leverages large-scale pre-trained vision models as initialization, recent studies have highlighted that full fine-tuning can compromise the intrinsic generalization capabilities of these models. To address this limitation, parameter-efficient adaptation strategies have emerged, wherein only a subset of model parameters is selectively fine-tuned, thereby balancing task adaptation with the preservation of generalization. Motivated by this paradigm, we introduce Joint Parameter Selection (JPS), a novel method that restricts updates to a small, sparse subset of parameters, thereby retaining and harnessing the generalization strength of pre-trained models. Theoretically, we establish a generalization error bound that explicitly accounts for the sparsity of parameter updates, thereby providing a principled justification for selective fine-tuning. Practically, we design a selection mechanism employing dual operators to identify and update parameters exhibiting consistent and significant gradients across all source domains. Extensive benchmark experiments demonstrate that JPS achieves superior performance compared to state-of-the-art domain generalization methods, substantiating both the efficiency and efficacy of the proposed approach.

[91] HiCache: Training-free Acceleration of Diffusion Models via Hermite Polynomial-based Feature Caching

Liang Feng,Shikang Zheng,Jiacheng Liu,Yuqi Lin,Qinming Zhou,Peiliang Cai,Xinyu Wang,Junjie Chen,Chang Zou,Yue Ma,Linfeng Zhang

Main category: cs.CV

TL;DR: HiCache是一个无需训练的训练框架,通过基于Hermite多项式的特征缓存加速扩散模型,解决了现有方法因特征演化动态复杂而导致的性能下降问题。

Details Motivation: 扩散模型在内容生成上表现出色,但计算成本高昂。现有特征缓存方法难以准确建模特征演化的复杂动态,导致生成质量下降。

Contribution: 提出了HiCache框架,利用Hermite多项式作为理论基础,结合双尺度机制,显著加速扩散模型推理,同时保持生成质量。

Method: 基于Hermite多项式建模特征导数逼近,通过双尺度机制平衡数值稳定性和预测精度,实现无需训练的特征缓存优化。

Result: 在FLUX.1-dev上实现6.24倍加速,生成质量优于基线方法,并在文本到图像、视频生成等任务中表现优异。

Insight: 扩散模型中特征导数逼近的多元高斯特性启发了Hermite多项式的应用,为加速推理提供了理论支持。

Abstract: Diffusion models have achieved remarkable success in content generation but suffer from prohibitive computational costs due to iterative sampling. While recent feature caching methods tend to accelerate inference through temporal extrapolation, these methods still suffer from server quality loss due to the failure in modeling the complex dynamics of feature evolution. To solve this problem, this paper presents HiCache, a training-free acceleration framework that fundamentally improves feature prediction by aligning mathematical tools with empirical properties. Our key insight is that feature derivative approximations in Diffusion Transformers exhibit multivariate Gaussian characteristics, motivating the use of Hermite polynomials-the potentially theoretically optimal basis for Gaussian-correlated processes. Besides, We further introduce a dual-scaling mechanism that ensures numerical stability while preserving predictive accuracy. Extensive experiments demonstrate HiCache’s superiority: achieving 6.24x speedup on FLUX.1-dev while exceeding baseline quality, maintaining strong performance across text-to-image, video generation, and super-resolution tasks. Core implementation is provided in the appendix, with complete code to be released upon acceptance.

[92] Probabilistic Temporal Masked Attention for Cross-view Online Action Detection

Liping Xie,Yang Tan,Shicheng Jing,Huimin Lu,Kanjian Zhang

Main category: cs.CV

TL;DR: 该论文提出了一个新型的Probabilistic Temporal Masked Attention (PTMA)模型,用于跨视图在线动作检测,通过概率建模和GRU掩码注意力机制提升跨视图场景下的性能。

Details Motivation: 主流在线动作检测(OAD)模型在面对不同视频视角时性能泛化能力有限,作者提出PTMA模型以解决这一问题。

Contribution: PTMA模型通过概率建模和GRU掩码注意力机制生成压缩视频帧表示,并引入多视图信息提取视角不变特征,显著提升了跨视图OAD的性能。

Method: 采用基于GRU的时间掩码注意力(TMA)单元,结合概率建模生成潜在表示,并通过多视图信息增强特征提取。

Result: 在DAHLIA、IKEA ASM和Breakfast数据集上,PTMA在跨主题、跨视图和跨主题-视图三种评估协议下均达到SOTA性能。

Insight: 概率建模和多视图信息融合是提升跨视图OAD性能的关键,GRU掩码注意力机制能有效增强帧级信息交互。

Abstract: As a critical task in video sequence classification within computer vision, Online Action Detection (OAD) has garnered significant attention. The sensitivity of mainstream OAD models to varying video viewpoints often hampers their generalization when confronted with unseen sources. To address this limitation, we propose a novel Probabilistic Temporal Masked Attention (PTMA) model, which leverages probabilistic modeling to derive latent compressed representations of video frames in a cross-view setting. The PTMA model incorporates a GRU-based temporal masked attention (TMA) cell, which leverages these representations to effectively query the input video sequence, thereby enhancing information interaction and facilitating autoregressive frame-level video analysis. Additionally, multi-view information can be integrated into the probabilistic modeling to facilitate the extraction of view-invariant features. Experiments conducted under three evaluation protocols: cross-subject (cs), cross-view (cv), and cross-subject-view (csv) show that PTMA achieves state-of-the-art performance on the DAHLIA, IKEA ASM, and Breakfast datasets.

[93] A Novel Local Focusing Mechanism for Deepfake Detection Generalization

Mingliang Li,Lin Yuanbo Wu,Changhong Liu,Hanxi Li

Main category: cs.CV

TL;DR: 论文提出了一种新型局部聚焦机制(LFM),通过关注局部伪造特征提升深度伪造检测的泛化能力,优于现有方法,并在跨域检测中取得显著效果。

Details Motivation: 深度伪造技术的快速发展要求检测方法具有更强泛化性。现有基于重建学习的方法在跨类别和跨生成域时表现不佳,主要由于深度卷积网络的固有局限性。

Contribution: 提出了局部聚焦机制(LFM),结合显著网络(SNet)和任务特定的Top-K池化(TKP),并引入两种正则化技术(RBLD和RKS),显著提升了跨域检测性能。

Method: LFM通过SNet和TKP模块选择关键局部特征,并采用RBLD和RKS正则化避免过拟合。实验表明其优于现有方法(如NPR)。

Result: LFM在准确率和平均精度上分别提升3.7和2.8,效率高达1789 FPS,在跨域深度伪造检测中表现卓越。

Insight: 局部伪造特征是提升泛化能力的关键,通过显式关注局部模式,可以显著改进跨域检测性能。正则化技术能有效避免Top-K池化带来的过拟合问题。

Abstract: The rapid advancement of deepfake generation techniques has intensified the need for robust and generalizable detection methods. Existing approaches based on reconstruction learning typically leverage deep convolutional networks to extract differential features. However, these methods show poor generalization across object categories (e.g., from faces to cars) and generation domains (e.g., from GANs to Stable Diffusion), due to intrinsic limitations of deep CNNs. First, models trained on a specific category tend to overfit to semantic feature distributions, making them less transferable to other categories, especially as network depth increases. Second, Global Average Pooling (GAP) compresses critical local forgery cues into a single vector, thus discarding discriminative patterns vital for real-fake classification. To address these issues, we propose a novel Local Focus Mechanism (LFM) that explicitly attends to discriminative local features for differentiating fake from real images. LFM integrates a Salience Network (SNet) with a task-specific Top-K Pooling (TKP) module to select the K most informative local patterns. To mitigate potential overfitting introduced by Top-K pooling, we introduce two regularization techniques: Rank-Based Linear Dropout (RBLD) and Random-K Sampling (RKS), which enhance the model’s robustness. LFM achieves a 3.7 improvement in accuracy and a 2.8 increase in average precision over the state-of-the-art Neighboring Pixel Relationships (NPR) method, while maintaining exceptional efficiency at 1789 FPS on a single NVIDIA A6000 GPU. Our approach sets a new benchmark for cross-domain deepfake detection. The source code are available in https://github.com/lmlpy/LFM.git

Raghul Asokan

Main category: cs.CV

TL;DR: F4-ITS提出了一种无需训练的视觉语言模型引导框架,通过细粒度特征融合提升食品图像-文本搜索性能,包括多模态融合策略和基于特征的重新排名机制,显著提升了检索效果。

Details Motivation: 数字食品内容的增长需要更精准的视觉理解和检索系统,特别是食品图像到文本的匹配任务在饮食监测、智能厨房等应用中至关重要。

Contribution: 1. 设计了单向(和双向)多模态融合策略,增强查询表达能力;2. 提出基于特征的重新排名机制,利用预测的食品成分优化检索结果。

Method: 采用视觉语言模型生成的文本描述与图像嵌入结合,并通过成分预测对检索结果进行重新排名。

Result: 在密集和稀疏文本场景下,F4-ITS在top-1检索中分别提升10%和7.7%,在top-k成分检索中提升28.6%。小模型结合该方法的性能可媲美或超越大模型。

Insight: 细粒度特征融合和重新排名机制能显著提升检索性能,尤其在资源受限环境下,小模型通过文本融合可达到与大模型相当的效果。

Abstract: The proliferation of digital food content has intensified the need for robust and accurate systems capable of fine-grained visual understanding and retrieval. In this work, we address the challenging task of food image-to-text matching, a critical component in applications such as dietary monitoring, smart kitchens, and restaurant automation. We propose F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search, a training-free, vision-language model (VLM)-guided framework that significantly improves retrieval performance through enhanced multi-modal feature representations. Our approach introduces two key contributions: (1) a uni-directional(and bi-directional) multi-modal fusion strategy that combines image embeddings with VLM-generated textual descriptions to improve query expressiveness, and (2) a novel feature-based re-ranking mechanism for top-k retrieval, leveraging predicted food ingredients to refine results and boost precision. Leveraging open-source image-text encoders, we demonstrate substantial gains over standard baselines - achieving ~10% and ~7.7% improvements in top-1 retrieval under dense and sparse caption scenarios, and a ~28.6% gain in top-k ingredient-level retrieval. Additionally, we show that smaller models (e.g., ViT-B/32) can match or outperform larger counterparts (e.g., ViT-H, ViT-G, ViT-bigG) when augmented with textual fusion, highlighting the effectiveness of our method in resource-constrained settings. Code and test datasets will be made publicly available at: https://github.com/mailcorahul/f4-its

[95] M3DMap: Object-aware Multimodal 3D Mapping for Dynamic Environments

Dmitry Yudin

Main category: cs.CV

TL;DR: M3DMap提出了一种面向动态环境的对象感知多模态3D建图方法,通过分类现有方法并提供模块化解决方案,结合多模态数据和基础模型提升3D建图效果。

Details Motivation: 动态环境中的3D建图缺乏统一的多模态数据表示方法,M3DMap旨在解决这一问题,为静态和动态场景提供灵活的多模态3D建图解决方案。

Contribution: 1. 提出多模态3D建图方法的分类法;2. 设计模块化的M3DMap方法,包含分割跟踪、里程计估计、地图构建和数据检索模块;3. 理论证明多模态数据和基础模型对3D建图的积极作用。

Method: M3DMap包括多模态对象分割与跟踪模块、可训练的里程计估计模块、地图构建模块(支持多种场景表示)以及多模态数据检索模块。

Result: 实验表明M3DMap在3D对象定位和移动操作等任务中表现优越。

Insight: 多模态数据和现代基础模型的结合能显著提升动态环境中的3D建图能力和应用效果。

Abstract: 3D mapping in dynamic environments poses a challenge for modern researchers in robotics and autonomous transportation. There are no universal representations for dynamic 3D scenes that incorporate multimodal data such as images, point clouds, and text. This article takes a step toward solving this problem. It proposes a taxonomy of methods for constructing multimodal 3D maps, classifying contemporary approaches based on scene types and representations, learning methods, and practical applications. Using this taxonomy, a brief structured analysis of recent methods is provided. The article also describes an original modular method called M3DMap, designed for object-aware construction of multimodal 3D maps for both static and dynamic scenes. It consists of several interconnected components: a neural multimodal object segmentation and tracking module; an odometry estimation module, including trainable algorithms; a module for 3D map construction and updating with various implementations depending on the desired scene representation; and a multimodal data retrieval module. The article highlights original implementations of these modules and their advantages in solving various practical tasks, from 3D object grounding to mobile manipulation. Additionally, it presents theoretical propositions demonstrating the positive effect of using multimodal data and modern foundational models in 3D mapping methods. Details of the taxonomy and method implementation are available at https://yuddim.github.io/M3DMap.

[96] PVNet: Point-Voxel Interaction LiDAR Scene Upsampling Via Diffusion Models

Xianjing Cheng,Lintai Wu,Zuowen Wang,Junhui Hou,Jie Wen,Yong Xu

Main category: cs.CV

TL;DR: PVNet提出了一种基于扩散模型的点-体素交互框架,用于激光雷达点云上采样,无需密集监督,首次支持任意上采样率。

Details Motivation: 激光雷达扫描数据极度稀疏,现有上采样方法局限于单个物体,难以推广到复杂室外场景。

Contribution: 1. 首次提出支持任意上采样率的场景级点云上采样方法;2. 基于扩散模型和点-体素交互的框架;3. 设计了体素补全模块和点-体素交互模块。

Method: 采用无分类器引导的DDPMs,以稀疏点云为条件,附近帧合成点云为输入;通过体素补全和点-体素交互模块优化特征表示。

Result: 在多个基准测试中实现了最先进的性能。

Insight: 点-体素交互和扩散模型的结合显著提升了点云上采样的效果,尤其在复杂场景中表现优异。

Abstract: Accurate 3D scene understanding in outdoor environments heavily relies on high-quality point clouds. However, LiDAR-scanned data often suffer from extreme sparsity, severely hindering downstream 3D perception tasks. Existing point cloud upsampling methods primarily focus on individual objects, thus demonstrating limited generalization capability for complex outdoor scenes. To address this issue, we propose PVNet, a diffusion model-based point-voxel interaction framework to perform LiDAR point cloud upsampling without dense supervision. Specifically, we adopt the classifier-free guidance-based DDPMs to guide the generation, in which we employ a sparse point cloud as the guiding condition and the synthesized point clouds derived from its nearby frames as the input. Moreover, we design a voxel completion module to refine and complete the coarse voxel features for enriching the feature representation. In addition, we propose a point-voxel interaction module to integrate features from both points and voxels, which efficiently improves the environmental perception capability of each upsampled point. To the best of our knowledge, our approach is the first scene-level point cloud upsampling method supporting arbitrary upsampling rates. Extensive experiments on various benchmarks demonstrate that our method achieves state-of-the-art performance. The source code will be available at https://github.com/chengxianjing/PVNet.

[97] DeltaFlow: An Efficient Multi-frame Scene Flow Estimation Method

Qingwen Zhang,Xiaomeng Zhu,Yushan Zhang,Yixi Cai,Olov Andersson,Patric Jensfelt

Main category: cs.CV

TL;DR: DeltaFlow提出了一种轻量级多帧场景流估计方法,通过Δ方案高效提取时序特征,并结合类别平衡损失和实例一致性损失提升性能,在计算效率和精度上均优于现有方法。

Details Motivation: 现有场景流估计方法主要基于两帧输入,忽略了时序信息的多帧方法计算成本高昂。DeltaFlow旨在高效利用时序信息,同时解决类别分布不均衡和运动不一致问题。

Contribution: 1. 提出DeltaFlow框架,通过Δ方案低成本提取多帧时序特征;2. 引入类别平衡损失和实例一致性损失,提升模型性能。

Method: 1. Δ方案:轻量级3D框架,高效捕捉运动线索;2. 类别平衡损失:解决类别分布不均衡;3. 实例一致性损失:确保物体运动连贯性。

Result: 在Argoverse 2和Waymo数据集上达到SOTA性能,误差降低22%,推理速度快2倍,并展现出强跨域泛化能力。

Insight: 多帧场景流估计可通过高效时序特征提取和损失设计显著提升性能,同时保持较低计算成本。

Abstract: Previous dominant methods for scene flow estimation focus mainly on input from two consecutive frames, neglecting valuable information in the temporal domain. While recent trends shift towards multi-frame reasoning, they suffer from rapidly escalating computational costs as the number of frames grows. To leverage temporal information more efficiently, we propose DeltaFlow ($\Delta$Flow), a lightweight 3D framework that captures motion cues via a $\Delta$ scheme, extracting temporal features with minimal computational cost, regardless of the number of frames. Additionally, scene flow estimation faces challenges such as imbalanced object class distributions and motion inconsistency. To tackle these issues, we introduce a Category-Balanced Loss to enhance learning across underrepresented classes and an Instance Consistency Loss to enforce coherent object motion, improving flow accuracy. Extensive evaluations on the Argoverse 2 and Waymo datasets show that $\Delta$Flow achieves state-of-the-art performance with up to 22% lower error and $2\times$ faster inference compared to the next-best multi-frame supervised method, while also demonstrating a strong cross-domain generalization ability. The code is open-sourced at https://github.com/Kin-Zhang/DeltaFlow along with trained model weights.

[98] REGEN: Real-Time Photorealism Enhancement in Games via a Dual-Stage Generative Network Framework

Stefanos Pasios,Nikos Nikolaidis

Main category: cs.CV

TL;DR: REGEN通过双阶段生成对抗网络框架,实时提升游戏画面的真实感,解决了视觉效果与性能之间的权衡问题。

Details Motivation: 现代游戏中真实感对玩家体验至关重要,但动态环境中实时实现高质量真实感仍具挑战性。

Contribution: 提出REGEN框架,将问题简化为配对图像翻译任务,实现了轻量级训练与实时推断,同时保持视觉质量。

Method: 采用双阶段生成对抗网络,利用无配对图像到图像翻译模型生成语义一致的逼真帧。

Result: 在《侠盗猎车手V》中验证,结果接近无配对方法,推断速度提升32.14倍。

Insight: 轻量级无配对方法直接训练效果不及双阶段框架,表明任务分解对提升性能的重要性。

Abstract: Photorealism is an important aspect of modern video games since it can shape the player experience and simultaneously impact the immersion, narrative engagement, and visual fidelity. Although recent hardware technological breakthroughs, along with state-of-the-art rendering technologies, have significantly improved the visual realism of video games, achieving true photorealism in dynamic environments at real-time frame rates still remains a major challenge due to the tradeoff between visual quality and performance. In this short paper, we present a novel approach for enhancing the photorealism of rendered game frames using generative adversarial networks. To this end, we propose Real-time photorealism Enhancement in Games via a dual-stage gEnerative Network framework (REGEN), which employs a robust unpaired image-to-image translation model to produce semantically consistent photorealistic frames that transform the problem into a simpler paired image-to-image translation task. This enables training with a lightweight method that can achieve real-time inference time without compromising visual quality. We demonstrate the effectiveness of our framework on Grand Theft Auto V, showing that the approach achieves visual results comparable to the ones produced by the robust unpaired Im2Im method while improving inference speed by 32.14 times. Our findings also indicate that the results outperform the photorealism-enhanced frames produced by directly training a lightweight unpaired Im2Im translation method to translate the video game frames towards the visual characteristics of real-world images. Code, pre-trained models, and demos for this work are available at: https://github.com/stefanos50/REGEN.

[99] SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation

Peng Hu,Yu Gu,Liang Luo,Fuji Ren

Main category: cs.CV

TL;DR: SSG-DiT提出了一种空间信号引导的视频生成框架,通过解耦的两阶段设计和双分支注意力机制,提升了生成视频的语义一致性和空间关系控制能力。

Details Motivation: 现有可控视频生成模型在语义一致性方面表现不佳,难以精确捕捉提示中的细节。

Contribution: 提出了SSG-DiT框架,引入空间信号提示和SSG-Adapter,实现了高保真的可控视频生成。

Method: 采用解耦的两阶段设计:空间信号提示生成视觉提示,再通过轻量级SSG-Adapter将其注入冻结的视频DiT骨干。

Result: 在VBench基准测试中,SSG-DiT在空间关系控制和整体一致性等方面优于现有模型。

Insight: 通过结合空间信号和双分支注意力机制,可以显著提升生成视频的语义一致性和控制精度。

Abstract: Controllable video generation aims to synthesize video content that aligns precisely with user-provided conditions, such as text descriptions and initial images. However, a significant challenge persists in this domain: existing models often struggle to maintain strong semantic consistency, frequently generating videos that deviate from the nuanced details specified in the prompts. To address this issue, we propose SSG-DiT (Spatial Signal Guided Diffusion Transformer), a novel and efficient framework for high-fidelity controllable video generation. Our approach introduces a decoupled two-stage process. The first stage, Spatial Signal Prompting, generates a spatially aware visual prompt by leveraging the rich internal representations of a pre-trained multi-modal model. This prompt, combined with the original text, forms a joint condition that is then injected into a frozen video DiT backbone via our lightweight and parameter-efficient SSG-Adapter. This unique design, featuring a dual-branch attention mechanism, allows the model to simultaneously harness its powerful generative priors while being precisely steered by external spatial signals. Extensive experiments demonstrate that SSG-DiT achieves state-of-the-art performance, outperforming existing models on multiple key metrics in the VBench benchmark, particularly in spatial relationship control and overall consistency.

[100] Proximal Vision Transformer: Enhancing Feature Representation through Two-Stage Manifold Geometry

Haoyu Yun,Hamid Krim

Main category: cs.CV

TL;DR: 论文提出了一种结合Vision Transformer (ViT) 与近端工具的新框架,通过两阶段流形几何优化提升了特征表示和分类性能。

Details Motivation: 尽管ViT在计算机视觉任务中表现优异,但其优化仅局限于单张图像的局部关系建模,未能充分利用数据点间的全局几何关系。

Contribution: 提出了一个新颖的框架,将ViT与近端工具结合,通过两阶段流形几何优化实现全局特征对齐和优化。

Method: ViT通过自注意力机制构建流形的切丛,而近端迭代则用于在切丛中定义截面并将数据从切空间投影到基空间。

Result: 实验结果表明,该方法在分类精度和数据分布上优于传统ViT。

Insight: 通过流形几何的全局优化,可以显著提升ViT的特征表示能力和分类性能。

Abstract: The Vision Transformer (ViT) architecture has become widely recognized in computer vision, leveraging its self-attention mechanism to achieve remarkable success across various tasks. Despite its strengths, ViT’s optimization remains confined to modeling local relationships within individual images, limiting its ability to capture the global geometric relationships between data points. To address this limitation, this paper proposes a novel framework that integrates ViT with the proximal tools, enabling a unified geometric optimization approach to enhance feature representation and classification performance. In this framework, ViT constructs the tangent bundle of the manifold through its self-attention mechanism, where each attention head corresponds to a tangent space, offering geometric representations from diverse local perspectives. Proximal iterations are then introduced to define sections within the tangent bundle and project data from tangent spaces onto the base space, achieving global feature alignment and optimization. Experimental results confirm that the proposed method outperforms traditional ViT in terms of classification accuracy and data distribution.

[101] PD-Loss: Proxy-Decidability for Efficient Metric Learning

Pedro Silva,Guilherme A. L. Silva,Pablo Coelho,Vander Freitas,Gladston Moreira,David Menotii,Eduardo Luz

Main category: cs.CV

TL;DR: PD-Loss是一种结合可学习代理和统计可分性框架的新型深度度量学习目标,旨在高效优化嵌入空间,兼具计算效率和分布感知能力。

Details Motivation: 现有的深度度量学习方法中,成对损失需要复杂的采样且收敛慢,而基于代理的方法虽然扩展性好但难以优化全局分布特性。Decidability-based Loss (D-Loss)虽然通过区分性指标提升了可分性,但对大批量数据的依赖带来了计算负担。

Contribution: 提出了Proxy-Decidability Loss (PD-Loss),结合了可学习代理和统计可分性框架,既继承了代理方法的计算效率,又保留了D-Loss的分布优化能力。

Method: PD-Loss通过代理估计真实分布和伪分布,并利用Decidability指数(d’)来优化嵌入空间的可分性,实现高效且分布感知的度量学习。

Result: 在细粒度分类和人脸验证等任务中,PD-Loss表现与最先进方法相当,同时提供了新的嵌入优化视角。

Insight: PD-Loss展示了结合代理和统计框架的潜力,为高效且分布感知的嵌入学习提供了新思路。

Abstract: Deep Metric Learning (DML) aims to learn embedding functions that map semantically similar inputs to proximate points in a metric space while separating dissimilar ones. Existing methods, such as pairwise losses, are hindered by complex sampling requirements and slow convergence. In contrast, proxy-based losses, despite their improved scalability, often fail to optimize global distribution properties. The Decidability-based Loss (D-Loss) addresses this by targeting the decidability index (d’) to enhance distribution separability, but its reliance on large mini-batches imposes significant computational constraints. We introduce Proxy-Decidability Loss (PD-Loss), a novel objective that integrates learnable proxies with the statistical framework of d’ to optimize embedding spaces efficiently. By estimating genuine and impostor distributions through proxies, PD-Loss combines the computational efficiency of proxy-based methods with the principled separability of D-Loss, offering a scalable approach to distribution-aware DML. Experiments across various tasks, including fine-grained classification and face verification, demonstrate that PD-Loss achieves performance comparable to that of state-of-the-art methods while introducing a new perspective on embedding optimization, with potential for broader applications.

[102] GRASP: Geospatial pixel Reasoning viA Structured Policy learning

Chengjie Jiang,Yunqi Zhou,Jiafeng Yan,Jing Li

Main category: cs.CV

TL;DR: GRASP提出了一种通过结构化策略学习进行地理空间像素推理的方法,利用多模态大语言模型生成边界框和正点,再通过预训练分割模型生成掩码,仅使用强化学习优化,无需掩码监督,取得了SOTA效果。

Details Motivation: 传统方法需要密集像素监督,成本高且泛化能力差。GRASP旨在通过结构化策略学习和强化学习降低标注成本并提升模型泛化能力。

Contribution: 1. 提出GRASP框架,通过边界框和正点提示预训练分割模型;2. 仅用强化学习(GRPO)优化模型;3. 构建GRASP-1k数据集,包含复杂推理查询和细粒度标注。

Method: 1. 多模态大语言模型生成边界框和正点;2. 预训练分割模型将其作为提示生成最终掩码;3. 通过GRPO强化学习优化系统,仅需格式奖励和精度奖励。

Result: 在域内和域外测试集上均取得SOTA效果,域内提升4%,域外最高提升54%。

Insight: 复杂的地理空间分割行为可以通过弱空间线索(如边界框和点)和强化学习实现,且无需密集标注。

Abstract: Geospatial pixel reasoning is a nascent remote-sensing task that aims to generate segmentation masks directly from natural-language instructions. Prevailing MLLM-based systems co-train a language model and a mask decoder with dense pixel supervision, which is expensive and often weak on out-of-domain (OOD) data. We introduce GRASP, a structured policy-learning framework. In our design, a multimodal large language model first emits task-relevant bounding boxes and positive points from a vision-language instruction. These outputs are then passed to a pre-trained segmentation model, which consumes them as prompts to generate the final mask. Instead of supervised fine-tuning, we optimize the system purely with reinforcement learning: the model is trained solely with GRPO, guided by format rewards and accuracy rewards computed on boxes and points (no mask supervision). This leverages strong priors in foundation models, minimizes trainable parameters, and enables learning from inexpensive annotations. We additionally curate GRASP-1k, which contains reasoning-intensive queries, detailed reasoning traces, and fine-grained segmentation annotations. Evaluations on both in-domain and out-of-domain test sets show state-of-the-art results: about 4% improvement in-domain and up to 54% on OOD benchmarks. The experiment results evidence our model’s robust generalization and demonstrate that complex geospatial segmentation behaviors can be learned via RL from weak spatial cues. Code and the dataset will be released open-source.

[103] SugarcaneShuffleNet: A Very Fast, Lightweight Convolutional Neural Network for Diagnosis of 15 Sugarcane Leaf Diseases

Shifat E. Arman,Hasan Muhammad Abdullah,Syed Nazmus Sakib,RM Saiem,Shamima Nasrin Asha,Md Mehedi Hasan,Shahrear Bin Amin,S M Mahin Abrar

Main category: cs.CV

TL;DR: 论文提出了一个轻量级卷积神经网络SugarcaneShuffleNet,用于甘蔗叶病害诊断,同时发布了甘蔗叶病害数据集SugarcaneLD-BD和实用的SugarcaneAI应用。

Details Motivation: 解决甘蔗叶病害诊断在资源受限地区的需求,现有深度学习模型难以在这些场景下高效运行。

Contribution: 1) 提出了SugarcaneLD-BD数据集;2) 开发了轻量级模型SugarcaneShuffleNet;3) 推出了SugarcaneAI应用。

Method: 通过结合多个数据集增强数据多样性,并优化轻量级模型,对比其他模型如MnasNet和EdgeNeXt。

Result: SugarcaneShuffleNet仅9.26MB,准确率达98.02%,推理时间4.14ms/图像。

Insight: 轻量化模型在资源受限场景中表现优于传统模型,同时结合Grad-CAM增强可解释性。

Abstract: Despite progress in AI-based plant diagnostics, sugarcane farmers in low-resource regions remain vulnerable to leaf diseases due to the lack of scalable, efficient, and interpretable tools. Many deep learning models fail to generalize under real-world conditions and require substantial computational resources, limiting their use in resource-constrained regions. In this paper, we present SugarcaneLD-BD, a curated dataset for sugarcane leaf-disease classification; SugarcaneShuffleNet, an optimized lightweight model for rapid on-device diagnosis; and SugarcaneAI, a Progressive Web Application for field deployment. SugarcaneLD-BD contains 638 curated images across five classes, including four major sugarcane diseases, collected in Bangladesh under diverse field conditions and verified by expert pathologists. To enhance diversity, we combined SugarcaneLD-BD with two additional datasets, yielding a larger and more representative corpus. Our optimized model, SugarcaneShuffleNet, offers the best trade-off between speed and accuracy for real-time, on-device diagnosis. This 9.26 MB model achieved 98.02% accuracy, an F1-score of 0.98, and an average inference time of 4.14 ms per image. For comparison, we fine-tuned five other lightweight convolutional neural networks: MnasNet, EdgeNeXt, EfficientNet-Lite, MobileNet, and SqueezeNet via transfer learning and Bayesian optimization. MnasNet and EdgeNeXt achieved comparable accuracy to SugarcaneShuffleNet, but required significantly more parameters, memory, and computation, limiting their suitability for low-resource deployment. We integrate SugarcaneShuffleNet into SugarcaneAI, delivering Grad-CAM-based explanations in the field. Together, these contributions offer a diverse benchmark, efficient models for low-resource environments, and a practical tool for sugarcane disease classification. It spans varied lighting, backgrounds and devices used on-farm

[104] PlantVillageVQA: A Visual Question Answering Dataset for Benchmarking Vision-Language Models in Plant Science

Syed Nazmus Sakib,Nafiul Haque,Mohammad Zabed Hossain,Shifat E. Arman

Main category: cs.CV

TL;DR: PlantVillageVQA是一个基于植物科学领域的大规模视觉问答数据集,旨在评估和推动视觉-语言模型在农业决策分析中的应用。数据集包含19.3万QA对,覆盖55,448张图像、14种作物和38种病害。问题分为3个认知复杂度和9个类别,且经过专家验证。数据集将通过开源平台发布。

Details Motivation: 农业领域缺乏针对视觉问答任务的高质量数据集,限制了视觉-语言模型在植物病害诊断中的应用。PlantVillageVQA旨在填补这一空白,提供专家验证的标准数据集。

Contribution: 1) 发布大规模植物科学VQA数据集PlantVillageVQA,包含多样化QA对和图像;2) 设计分层次和分类别的问题结构;3) 通过专家验证确保科学准确性。

Method: 采用两阶段流水线生成QA对:1) 基于图像元数据的模板合成;2) 多阶段语言重构。问题经专家迭代审核,最终用主流模型评估质量。

Result: 数据集包含19.3万QA对,覆盖55,448张图像,涉及14种作物和38种病害。问题按认知复杂度和类别分类,并通过专家验证。

Insight: PlantVillageVQA为农业领域的视觉问答任务提供了标准化基准,有望推动植物病害诊断精度的提升和相关研究。

Abstract: PlantVillageVQA is a large-scale visual question answering (VQA) dataset derived from the widely used PlantVillage image corpus. It was designed to advance the development and evaluation of vision-language models for agricultural decision-making and analysis. The PlantVillageVQA dataset comprises 193,609 high-quality question-answer (QA) pairs grounded over 55,448 images spanning 14 crop species and 38 disease conditions. Questions are organised into 3 levels of cognitive complexity and 9 distinct categories. Each question category was phrased manually following expert guidance and generated via an automated two-stage pipeline: (1) template-based QA synthesis from image metadata and (2) multi-stage linguistic re-engineering. The dataset was iteratively reviewed by domain experts for scientific accuracy and relevancy. The final dataset was evaluated using three state-of-the-art models for quality assessment. Our objective remains to provide a publicly available, standardised and expert-verified database to enhance diagnostic accuracy for plant disease identifications and advance scientific research in the agricultural domain. Our dataset will be open-sourced at https://huggingface.co/datasets/SyedNazmusSakib/PlantVillageVQA.

[105] CE-RS-SBCIT A Novel Channel Enhanced Hybrid CNN Transformer with Residual, Spatial, and Boundary-Aware Learning for Brain Tumor MRI Analysis

Mirza Mumtaz Zahoor,Saddam Hussain Khan

Main category: cs.CV

TL;DR: 本文提出了一种新型混合CNN与Transformer框架CE-RS-SBCIT,结合残差、空间和边界感知学习,用于脑肿瘤MRI分析,显著提升了分类性能。

Details Motivation: 脑肿瘤的早期检测和准确分类对诊断和治疗至关重要,但现有深度学习方法在计算成本、对微小差异的敏感性以及MRI数据的异质性方面存在挑战。

Contribution: 提出了一种结合CNN和Transformer的混合框架,包含平滑边界CNN集成Transformer、残差与空间学习CNN、通道增强策略和空间注意力机制四大创新点。

Method: 通过SBCIT模块结合局部细粒度与全局上下文特征,残差和空间CNN增强特征表示,通道增强减少冗余,空间注意力机制强调对比和纹理变化。

Result: 在Kaggle和Figshare的MRI数据集上实现了98.30%的准确率、98.08%的敏感度、98.25%的F1分数和98.43%的精确度。

Insight: 结合CNN的局部特征提取与Transformer的全局建模能力,并通过通道增强和注意力机制优化特征表示,可以显著提升脑肿瘤分类的精度。

Abstract: Brain tumors remain among the most lethal human diseases, where early detection and accurate classification are critical for effective diagnosis and treatment planning. Although deep learning-based computer-aided diagnostic (CADx) systems have shown remarkable progress. However, conventional convolutional neural networks (CNNs) and Transformers face persistent challenges, including high computational cost, sensitivity to minor contrast variations, structural heterogeneity, and texture inconsistencies in MRI data. Therefore, a novel hybrid framework, CE-RS-SBCIT, is introduced, integrating residual and spatial learning-based CNNs with transformer-driven modules. The proposed framework exploits local fine-grained and global contextual cues through four core innovations: (i) a smoothing and boundary-based CNN-integrated Transformer (SBCIT), (ii) tailored residual and spatial learning CNNs, (iii) a channel enhancement (CE) strategy, and (iv) a novel spatial attention mechanism. The developed SBCIT employs stem convolution and contextual interaction transformer blocks with systematic smoothing and boundary operations, enabling efficient global feature modeling. Moreover, Residual and spatial CNNs, enhanced by auxiliary transfer-learned feature maps, enrich the representation space, while the CE module amplifies discriminative channels and mitigates redundancy. Furthermore, the spatial attention mechanism selectively emphasizes subtle contrast and textural variations across tumor classes. Extensive evaluation on challenging MRI datasets from Kaggle and Figshare, encompassing glioma, meningioma, pituitary tumors, and healthy controls, demonstrates superior performance, achieving 98.30% accuracy, 98.08% sensitivity, 98.25% F1-score, and 98.43% precision.

[106] Structural Damage Detection Using AI Super Resolution and Visual Language Model

Catherine Hoier,Khandaker Mamun Ahmed

Main category: cs.CV

TL;DR: 该论文提出了一种结合AI超分辨率技术和视觉语言模型的结构损伤检测框架,利用无人机影像和卫星数据实现了84.5%的分类准确率。

Details Motivation: 自然灾害后,传统的损伤评估方法耗时费力且危险,亟需一种快速、准确的自动化解决方案。

Contribution: 1. 提出了一种融合视频超分辨率(VRT)和视觉语言模型(Gemma3:27b)的新框架;2. 实现了结构损伤的高精度分类(84.5%);3. 为非技术用户提供了易用的灾害评估工具。

Method: 1. 使用VRT模型提升低分辨率无人机影像质量;2. 通过Gemma3:27b模型识别并分类结构损伤为四类;3. 验证数据来自2023土耳其地震和2013摩尔龙卷风(xBD数据集)。

Result: 在灾害影像上取得了84.5%的分类准确率,验证了框架的有效性。

Insight: 融合超分辨率和视觉语言模型可显著提升灾害评估的自动化水平,为非专业人员提供快速响应的工具。

Abstract: Natural disasters pose significant challenges to timely and accurate damage assessment due to their sudden onset and the extensive areas they affect. Traditional assessment methods are often labor-intensive, costly, and hazardous to personnel, making them impractical for rapid response, especially in resource-limited settings. This study proposes a novel, cost-effective framework that leverages aerial drone footage, an advanced AI-based video super-resolution model, Video Restoration Transformer (VRT), and Gemma3:27b, a 27 billion parameter Visual Language Model (VLM). This integrated system is designed to improve low-resolution disaster footage, identify structural damage, and classify buildings into four damage categories, ranging from no/slight damage to total destruction, along with associated risk levels. The methodology was validated using pre- and post-event drone imagery from the 2023 Turkey earthquakes (courtesy of The Guardian) and satellite data from the 2013 Moore Tornado (xBD dataset). The framework achieved a classification accuracy of 84.5%, demonstrating its ability to provide highly accurate results. Furthermore, the system’s accessibility allows non-technical users to perform preliminary analyses, thereby improving the responsiveness and efficiency of disaster management efforts.

[107] Beyond Play and Pause: Turning GPT-4o Spatial Weakness into a Strength for In-Depth Interactive Video Learning

Sajad Goudarzi,Samaneh Zamanifard

Main category: cs.CV

TL;DR: 论文介绍了Untwist系统,通过结合GPT API和计算机视觉技术,解决了GPT-4o的空间弱点,实现了基于视频的区域交互式学习。

Details Motivation: 传统视频学习方式被动,缺乏动态交互,现有AI工具只能提供摘要和转录,无法支持实时区域交互。

Contribution: 提出Untwist系统,通过标注帧替代原始坐标数据,提升视频内容定位和理解的准确性,实现多模态交互学习。

Method: 结合GPT API与计算机视觉技术,预处理视频内容并支持实时交互,用户可通过边界框提问获取上下文感知回答。

Result: Untwist将被动视频学习转变为交互式体验,提升了用户参与度和理解能力。

Insight: 通过标注帧改进GPT-4o的空间处理能力,为AI驱动的视频交互学习提供了新思路。

Abstract: Traditional video-based learning remains passive, offering limited opportunities for users to engage dynamically with content. While current AI-powered tools offer transcription and summarization, they lack real-time, region-specific interaction capabilities. This paper introduces Untwist, an AI-driven system that enables interactive video learning by allowing users to ask questions about the entire video or specific regions using a bounding box, receiving context-aware, multimodal responses. By integrating GPT APIs with Computer Vision techniques, Untwist extracts, processes, and structures video content to enhance comprehension. Our approach addresses GPT-4o spatial weakness by leveraging annotated frames instead of raw coordinate data, significantly improving accuracy in localizing and interpreting video content. This paper describes the system architecture, including video pre-processing and real-time interaction, and outlines how Untwist can transform passive video consumption into an interactive, AI-driven learning experience with the potential to enhance engagement and comprehension.

[108] Development of an isotropic segmentation model for medial temporal lobe subregions on anisotropic MRI atlas using implicit neural representation

Yue Li,Pulkit Khandelwal,Rohit Jena,Long Xie,Michael Duong,Amanda E. Denning,Christopher A. Brown,Laura E. M. Wisse,Sandhitsu R. Das,David A. Wolk,Paul A. Yushkevich

Main category: cs.CV

TL;DR: 该研究利用隐式神经表示方法,结合T1和T2加权MRI的分辨率优势,将MTL亚区图谱从各向异性空间上采样到各向同性空间,开发了一个多模态、高分辨率的图谱集,并基于此建立了各向同性MTL亚区分割模型。这一方法在区分轻度认知障碍和认知未受损参与者时表现出更高的显著性,且在纵向分析中展现出更好的稳定性。

Details Motivation: 阿尔茨海默病(AD)的早期标志物通常出现在内侧颞叶(MTL),而现有的T2加权MRI图像分辨率是各向异性的,影响了MTL亚区皮质厚度的准确提取,限制了AD成像生物标志物的准确性。因此,需要一种方法在不增加标注工作量的情况下提高分辨率。

Contribution: 1. 提出了一种隐式神经表示方法,结合T1和T2加权MRI的优势,构建了一个多模态、高分辨率的MTL亚区图谱集。2. 开发了各向同性的MTL亚区分割模型,显著提升了生物标志物在区分AD和认知未受损参与者时的效能。3. 在纵向分析中验证了各向同性方法提取的生物标志物具有更高的稳定性。

Method: 1. 使用隐式神经表示方法对T1和T2加权MRI的分辨率优势进行融合。2. 将各向异性的MTL亚区图谱上采样到各向同性空间。3. 基于新图谱集开发各向同性的分割模型,并提取生物标志物进行评估。

Result: 在各向同性模型中提取的皮质亚区厚度在区分轻度认知障碍和认知未受损参与者时表现出更高的显著性。在纵向分析中,各向同性方法提取的生物标志物在认知未受损参与者中展现出更好的稳定性。

Insight: 1. 结合多模态MRI的优势可以提高各向同性空间的分辨率,从而提升生物标志物的准确性。2. 隐式神经表示方法在不增加标注工作量的情况下实现了更高分辨率的图谱构建,为AD研究提供了更精确的工具。

Abstract: Imaging biomarkers in magnetic resonance imaging (MRI) are important tools for diagnosing and tracking Alzheimer’s disease (AD). As medial temporal lobe (MTL) is the earliest region to show AD-related hallmarks, brain atrophy caused by AD can first be observed in the MTL. Accurate segmentation of MTL subregions and extraction of imaging biomarkers from them are important. However, due to imaging limitations, the resolution of T2-weighted (T2w) MRI is anisotropic, which makes it difficult to accurately extract the thickness of cortical subregions in the MTL. In this study, we used an implicit neural representation method to combine the resolution advantages of T1-weighted and T2w MRI to accurately upsample an MTL subregion atlas set from anisotropic space to isotropic space, establishing a multi-modality, high-resolution atlas set. Based on this atlas, we developed an isotropic MTL subregion segmentation model. In an independent test set, the cortical subregion thickness extracted using this isotropic model showed higher significance than an anisotropic method in distinguishing between participants with mild cognitive impairment and cognitively unimpaired (CU) participants. In longitudinal analysis, the biomarkers extracted using isotropic method showed greater stability in CU participants. This study improved the accuracy of AD imaging biomarkers without increasing the amount of atlas annotation work, which may help to more accurately quantify the relationship between AD and brain atrophy and provide more accurate measures for disease tracking.

[109] VROOM - Visual Reconstruction over Onboard Multiview

Yajat Yadav,Varun Bharadwaj,Jathin Korrapati,Tanish Baranwal

Main category: cs.CV

TL;DR: VROOM是一个通过车载多视角摄像头重建F1赛道3D模型的系统,解决了高速运动和帧切换带来的挑战,结合多种SLAM和预处理技术,实现了复杂环境下的部分赛道和车辆轨迹重建。

Details Motivation: 研究旨在利用车载视频数据实现可扩展的4D重建,探索在真实世界复杂场景中的实用性,尤其是在高速运动和动态视角下的挑战。

Contribution: 提出了VROOM系统,结合多种SLAM技术(如DROID-SLAM、AnyCam)和预处理方法(如掩码、时间分块),实现了赛道和车辆轨迹的部分重建,验证了车载视频的可行性。

Method: 采用多种SLAM技术(如DROID-SLAM、AnyCam、Monst3r)和预处理技术(掩码、时间分块、分辨率缩放),以应对高速运动和计算限制。

Result: 在2023年摩纳哥大奖赛的视频数据中成功部分重建了赛道和车辆轨迹,证明了车载视频在4D重建中的潜力。

Insight: 车载视频在复杂场景下具有一定可行性,但需结合多种技术处理动态运动和计算限制,为未来可扩展的4D重建提供了方向。

Abstract: We introduce VROOM, a system for reconstructing 3D models of Formula 1 circuits using only onboard camera footage from racecars. Leveraging video data from the 2023 Monaco Grand Prix, we address video challenges such as high-speed motion and sharp cuts in camera frames. Our pipeline analyzes different methods such as DROID-SLAM, AnyCam, and Monst3r and combines preprocessing techniques such as different methods of masking, temporal chunking, and resolution scaling to account for dynamic motion and computational constraints. We show that Vroom is able to partially recover track and vehicle trajectories in complex environments. These findings indicate the feasibility of using onboard video for scalable 4D reconstruction in real-world settings. The project page can be found at https://varun-bharadwaj.github.io/vroom, and our code is available at https://github.com/yajatyadav/vroom.

[110] Advancing Weakly-Supervised Change Detection in Satellite Images via Adversarial Class Prompting

Zhenghui Zhao,Chen Wu,Di Wang,Hongruixuan Chen,Cuiqun Chen,Zhuo Zheng,Bo Du,Liangpei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种对抗性类别提示方法(AdvCP),用于解决弱监督变化检测中背景变化被误分类的问题,通过对抗性提示挖掘和样本修正提升性能。

Details Motivation: 弱监督变化检测(WSCD)仅依赖图像级标签,但现有方法常将背景变化误分类为对象变化。为了解决这一问题,作者提出了一种新方法。

Contribution: 提出了Adversarial Class Prompting(AdvCP),包含对抗性提示挖掘和样本修正两个阶段,显著提升WSCD性能,同时不增加推理成本。

Method: AdvCP通过对抗性提示扰动挖掘错误特征映射,并用在线全局原型整合这些样本进行训练,适用于多种基线模型。

Result: 实验表明AdvCP在ConvNet、Transformer和SAM基线上均显著提升了性能,并展示了其通用性。

Insight: AdvCP通过对抗性提示机制有效区分背景变化与对象变化,为弱监督密集预测任务提供了新思路。

Abstract: Weakly-Supervised Change Detection (WSCD) aims to distinguish specific object changes (e.g., objects appearing or disappearing) from background variations (e.g., environmental changes due to light, weather, or seasonal shifts) in paired satellite images, relying only on paired image (i.e., image-level) classification labels. This technique significantly reduces the need for dense annotations required in fully-supervised change detection. However, as image-level supervision only indicates whether objects have changed in a scene, WSCD methods often misclassify background variations as object changes, especially in complex remote-sensing scenarios. In this work, we propose an Adversarial Class Prompting (AdvCP) method to address this co-occurring noise problem, including two phases: a) Adversarial Prompt Mining: After each training iteration, we introduce adversarial prompting perturbations, using incorrect one-hot image-level labels to activate erroneous feature mappings. This process reveals co-occurring adversarial samples under weak supervision, namely background variation features that are likely to be misclassified as object changes. b) Adversarial Sample Rectification: We integrate these adversarially prompt-activated pixel samples into training by constructing an online global prototype. This prototype is built from an exponentially weighted moving average of the current batch and all historical training data. Our AdvCP can be seamlessly integrated into current WSCD methods without adding additional inference cost. Experiments on ConvNet, Transformer, and Segment Anything Model (SAM)-based baselines demonstrate significant performance enhancements. Furthermore, we demonstrate the generalizability of AdvCP to other multi-class weakly-supervised dense prediction scenarios. Code is available at https://github.com/zhenghuizhao/AdvCP

[111] MMCIG: Multimodal Cover Image Generation for Text-only Documents and Its Dataset Construction via Pseudo-labeling

Hyeyeon Kim,Sungwoo Han,Jingun Kwon,Hidetaka Kamigaito,Manabu Okumura

Main category: cs.CV

TL;DR: 本文提出了一种多模态伪标注方法,用于为纯文本文档生成封面图像和摘要。通过从文档的多个候选图像和标题中筛选出最佳匹配对,构建高质量数据集。实验证明,该方法优于单模态伪标注方法。

Details Motivation: 当前缺乏为纯文本文档生成摘要和对应封面图像的任务相关数据集,且现有数据集构建成本高。本文旨在低成本构建高质量数据集以支持该任务。

Contribution: 1. 提出了一种新型的多模态封面图像生成任务;2. 提出了一种多模态伪标注方法,低成本构建高质量数据集;3. 通过实验验证了多模态伪标注优于单模态方法。

Method: 1. 从多图文档中收集图像和标题,并筛选出总结性文档;2. 通过独立排名图像和标题,选择排名第一的匹配对作为伪标注;3. 移除直接引用图像的文本文档。

Result: 实验表明,多模态伪标注方法构建的数据集更精确,生成的图像质量高于仅基于文本或图像的伪标注方法。

Insight: 多模态数据(图像+标题)的联合标注比单模态标注更有效,有助于提高生成任务的质量。

Abstract: In this study, we introduce a novel cover image generation task that produces both a concise summary and a visually corresponding image from a given text-only document. Because no existing datasets are available for this task, we propose a multimodal pseudo-labeling method to construct high-quality datasets at low cost. We first collect documents that contain multiple images with their captions, and their summaries by excluding factually inconsistent instances. Our approach selects one image from the multiple images accompanying the documents. Using the gold summary, we independently rank both the images and their captions. Then, we annotate a pseudo-label for an image when both the image and its corresponding caption are ranked first in their respective rankings. Finally, we remove documents that contain direct image references within texts. Experimental results demonstrate that the proposed multimodal pseudo-labeling method constructs more precise datasets and generates higher quality images than text- and image-only pseudo-labeling methods, which consider captions and images separately. We release our code at: https://github.com/HyeyeeonKim/MMCIG

[112] Multi-Agent Visual-Language Reasoning for Comprehensive Highway Scene Understanding

Yunxiang Yang,Ningning Xu,Jidong J. Yang

Main category: cs.CV

TL;DR: 该论文提出了一种基于多专家策略的多智能体框架,用于高速公路场景的综合理解。通过结合领域知识的通用视觉语言模型(VLM)生成任务特定的链式思维(CoT)提示,指导高效的小模型完成多任务推理,并在多模态数据集上验证其性能。

Details Motivation: 现有高速公路场景理解系统往往专注于单一任务,难以全面应对复杂多变的交通和环境条件。需要一种能够同时处理多任务、兼顾性能和效率的解决方案。

Contribution: 1. 提出了一种多智能体视觉语言推理框架,结合通用大模型和高效小模型;2. 设计了与任务对齐的专用数据集(包括多模态数据);3. 展示了在多样交通和环境条件下的稳定表现,并支持实际部署。

Method: 基于混合专家策略,利用大模型生成任务特定的CoT提示,指导小模型完成多任务(天气分类、路面湿度评估和交通拥堵检测)推理,同时整合多模态数据(如视频和传感器数据)。

Result: 实验结果表明,该框架在多样条件下性能稳健,适用于资源受限环境,并能与现有交通摄像头系统集成。

Insight: 通过大模型引导小模型的方法,既保留了通用性,又提升了任务专一性;多模态数据(如路面湿度数据集)显著增强了推理的鲁棒性。

Abstract: This paper introduces a multi-agent framework for comprehensive highway scene understanding, designed around a mixture-of-experts strategy. In this framework, a large generic vision-language model (VLM), such as GPT-4o, is contextualized with domain knowledge to generates task-specific chain-of-thought (CoT) prompts. These fine-grained prompts are then used to guide a smaller, efficient VLM (e.g., Qwen2.5-VL-7B) in reasoning over short videos, along with complementary modalities as applicable. The framework simultaneously addresses multiple critical perception tasks, including weather classification, pavement wetness assessment, and traffic congestion detection, achieving robust multi-task reasoning while balancing accuracy and computational efficiency. To support empirical validation, we curated three specialized datasets aligned with these tasks. Notably, the pavement wetness dataset is multimodal, combining video streams with road weather sensor data, highlighting the benefits of multimodal reasoning. Experimental results demonstrate consistently strong performance across diverse traffic and environmental conditions. From a deployment perspective, the framework can be readily integrated with existing traffic camera systems and strategically applied to high-risk rural locations, such as sharp curves, flood-prone lowlands, or icy bridges. By continuously monitoring the targeted sites, the system enhances situational awareness and delivers timely alerts, even in resource-constrained environments.

[113] Multi-modal Knowledge Decomposition based Online Distillation for Biomarker Prediction in Breast Cancer Histopathology

Qibin Zhang,Xinyu Hao,Qiao Chen,Rui Xu,Fengyu Cong,Cheng Lu,Hongming Xu

Main category: cs.CV

TL;DR: 该论文提出了一种基于多模态知识分解(MKD)的在线蒸馏方法,用于增强H&E染色组织病理学图像的IHC生物标志物预测,解决了多模态数据获取困难的问题。通过教师-学生模型和SKD、CLOD等技术,实现了单一模态下的高性能预测。

Details Motivation: 多模态数据(如基因组和病理学信息)的同步获取在IHC生物标志物预测中很有价值,但由于成本或技术限制往往难以实现。论文旨在通过在线蒸馏方法解决这一问题,提升单一模态(病理学图像)的预测性能。

Contribution: 1. 提出基于MKD的在线蒸馏方法,支持单一模态推理;2. 结合SKD和CLOD,保留了样本间的结构关系并促进师生模型互补学习;3. 在公开和内部数据集上验证了方法的有效性。

Method: 1. 使用教师-学生模型分别提取模态专用和模态通用特征;2. 通过最小化MKD损失优化模型;3. 应用SKD保持样本相似性;4. 利用CLOD实现师生模型协同学习。

Result: 在TCGA-BRCA和QHSU数据集上,该方法在单一模态(病理学图像)下实现了优异的IHC生物标志物预测性能。

Insight: 多模态知识的分解和协同学习可以显著提升单一模态的预测能力,尤其在医疗领域数据获取受限的场景下具有实用价值。

Abstract: Immunohistochemical (IHC) biomarker prediction benefits from multi-modal data fusion analysis. However, the simultaneous acquisition of multi-modal data, such as genomic and pathological information, is often challenging due to cost or technical limitations. To address this challenge, we propose an online distillation approach based on Multi-modal Knowledge Decomposition (MKD) to enhance IHC biomarker prediction in haematoxylin and eosin (H&E) stained histopathology images. This method leverages paired genomic-pathology data during training while enabling inference using either pathology slides alone or both modalities. Two teacher and one student models are developed to extract modality-specific and modality-general features by minimizing the MKD loss. To maintain the internal structural relationships between samples, Similarity-preserving Knowledge Distillation (SKD) is applied. Additionally, Collaborative Learning for Online Distillation (CLOD) facilitates mutual learning between teacher and student models, encouraging diverse and complementary learning dynamics. Experiments on the TCGA-BRCA and in-house QHSU datasets demonstrate that our approach achieves superior performance in IHC biomarker prediction using uni-modal data. Our code is available at https://github.com/qiyuanzz/MICCAI2025_MKD.

[114] Deep Learning with Self-Attention and Enhanced Preprocessing for Precise Diagnosis of Acute Lymphoblastic Leukemia from Bone Marrow Smears in Hemato-Oncology

Md. Maruf,Md. Mahbubul Haque,Bishowjit Paul

Main category: cs.CV

TL;DR: 提出了一种结合卷积神经网络(CNN)和多头自注意力(MHSA)的深度学习方法,用于从骨髓涂片图像中自动诊断急性淋巴细胞白血病(ALL),并通过改进的预处理和Focal Loss优化提升了诊断精度。

Details Motivation: 早期的ALL诊断和准确分型对于治疗至关重要,但传统方法复杂且易出错。需要一种自动化的高效诊断工具。

Contribution: 1. 结合CNN与MHSA模块,建模细胞特征的远程依赖和上下文关系。2. 通过预处理和Focal Loss优化,解决类别不平衡问题。3. 在VGG19+MHSA架构上实现99.25%的高精度。

Method: 1. 使用强化的预处理管道标准化图像质量。2. 在VGG19中插入MHSA模块。3. 采用Focal Loss训练,缓解类别不平衡。4. 对比ResNet101等基线模型。

Result: VGG19+MHSA模型准确率达到99.25%,优于ResNet101的98.62%。显示出更好的特征辨别能力。

Insight: 自注意力机制与CNN的结合能够有效建模医学图像的全局上下文关系,为自动化诊断提供新思路。

Abstract: Acute lymphoblastic leukemia (ALL) is a prevalent hematological malignancy in both pediatric and adult populations. Early and accurate detection with precise subtyping is essential for guiding therapy. Conventional workflows are complex, time-consuming, and prone to human error. We present a deep learning framework for automated ALL diagnosis from bone marrow smear images. The method combines a robust preprocessing pipeline with convolutional neural networks (CNNs) to standardize image quality and improve inference efficiency. As a key design, we insert a multi-head self-attention (MHSA) block into a VGG19 backbone to model long-range dependencies and contextual relationships among cellular features. To mitigate class imbalance, we train with Focal Loss. Across evaluated architectures, the enhanced VGG19+MHSA trained with Focal Loss achieves 99.25% accuracy, surpassing a strong ResNet101 baseline (98.62%). These results indicate that attention-augmented CNNs, coupled with targeted loss optimization and preprocessing, yield more discriminative representations of leukemic cell morphology. Our approach offers a highly accurate and computationally efficient tool for automated ALL recognition and subtyping, with potential to accelerate diagnostic workflows and support reliable decision-making in clinical settings.

[115] 4D Visual Pre-training for Robot Learning

Chengkai Hou,Yanjie Ze,Yankai Fu,Zeyu Gao,Songbo Hu,Yue Yu,Shanghang Zhang,Huazhe Xu

Main category: cs.CV

TL;DR: 论文提出了一个名为FVP的4D视觉预训练框架,旨在通过扩散模型预测点云数据提升3D表示在机器人任务中的性能,显著提升了3D Diffusion Policy的成功率。

Details Motivation: 现有视觉预训练主要基于2D图像,忽略了世界的3D本质,但由于大规模3D数据稀缺,需要一个能增强3D表示的通用预训练框架。

Contribution: 1. 提出FVP框架,通过点云预测任务预训练扩散模型;2. 在3D Diffusion Policy上实现了28%的平均成功率提升;3. 验证了FVP在不同点云编码器和数据集中的普适性。

Method: 将视觉预训练目标建模为点云预测问题,使用扩散模型进行预训练,并在公开数据集上进行训练。

Result: 在12个真实世界操作任务中,FVP将3D Diffusion Policy的平均成功率提升28%,并在模仿学习中达到SOTA性能。

Insight: 通过4D预训练(时间+3D),FVP不仅提升了3D表示的通用性,还能增强多模态机器人模型的性能。

Abstract: General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks; yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on the larger public datasets directly. Across twelve real-world manipulation tasks, FVP boosts the average success rate of 3D Diffusion Policy (DP3) for these tasks by 28%. The FVP pre-trained DP3 achieves state-of-the-art performance across imitation learning methods. Moreover, the efficacy of FVP adapts across various point cloud encoders and datasets. Finally, we apply FVP to the RDT-1B, a larger Vision-Language-Action robotic model, enhancing its performance on various robot tasks. Our project page is available at: https://4d- visual-pretraining.github.io/.

[116] PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation

Xiaoyang Hao,Han Li

Main category: cs.CV

TL;DR: 论文《PersPose》提出了一种新的3D人体姿态估计方法,通过引入透视编码(PE)和透视旋转(PR),解决了现有方法因裁剪图像缺乏相机内参信息而导致的深度估计不准确问题,并在多个数据集上实现了SOTA性能。

Details Motivation: 现有3D姿态估计方法仅使用裁剪图像作为输入,忽略了相机内参对透视关系的影响,导致深度估计不准确。此外,人体在图像中的位置偏移会导致透视畸变,增加了模型拟合的难度。

Contribution: 1. 提出透视编码(PE)来编码相机内参信息;2. 提出透视旋转(PR)来减少透视畸变;3. 构建了一个新的3D姿态估计框架PersPose。

Method: 1. 使用PE编码相机内参;2. 通过PR将人体中心化以减少透视畸变;3. 结合PE和PR实现3D姿态估计。

Result: 在3DPW、MPIINF-3DHP和Human3.6M数据集上取得了SOTA性能,例如在3DPW上MPJPE降至60.1 mm,比之前方法降低了7.54%。

Insight: 相机内参和透视畸变是影响3D姿态估计的关键因素,通过显式编码和矫正可以显著提升性能。

Abstract: Monocular 3D human pose estimation (HPE) methods estimate the 3D positions of joints from individual images. Existing 3D HPE approaches often use the cropped image alone as input for their models. However, the relative depths of joints cannot be accurately estimated from cropped images without the corresponding camera intrinsics, which determine the perspective relationship between 3D objects and the cropped images. In this work, we introduce Perspective Encoding (PE) to encode the camera intrinsics of the cropped images. Moreover, since the human subject can appear anywhere within the original image, the perspective relationship between the 3D scene and the cropped image differs significantly, which complicates model fitting. Additionally, the further the human subject deviates from the image center, the greater the perspective distortions in the cropped image. To address these issues, we propose Perspective Rotation (PR), a transformation applied to the original image that centers the human subject, thereby reducing perspective distortions and alleviating the difficulty of model fitting. By incorporating PE and PR, we propose a novel 3D HPE framework, PersPose. Experimental results demonstrate that PersPose achieves state-of-the-art (SOTA) performance on the 3DPW, MPIINF-3DHP, and Human3.6M datasets. For example, on the in-the-wild dataset 3DPW, PersPose achieves an MPJPE of 60.1 mm, 7.54% lower than the previous SOTA approach. Code is available at: https://github.com/ KenAdamsJoseph/PersPose.

[117] CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models

Zicong Tang,Ziyang Ma,Suqing Wang,Zuchao Li,Lefei Zhang,Hai Zhao,Yun Li,Qianren Wang

Main category: cs.CV

TL;DR: CoViPAL提出了一种层级化上下文视觉令牌剪枝方法,通过轻量化的即插即用模块(PPM)在大型视觉-语言模型中高效剪枝冗余视觉令牌,提升推理效率且不牺牲精度。

Details Motivation: 大型视觉-语言模型(LVLM)因处理大量视觉令牌导致计算和内存开销高昂,现有剪枝方法在浅层缺乏足够上下文信息时效果不佳。

Contribution: 提出CoViPAL方法,通过PPM模块在层级化上下文中预测并剪枝冗余视觉令牌,实现了轻量化、模型无关的高效推理。

Method: 采用即插即用剪枝模块(PPM),在LVLM处理前预测并移除冗余视觉令牌,模块独立于模型架构,适用于多种LVLM。

Result: 在多个基准测试中,CoViPAL在相同令牌预算下优于无需训练的剪枝方法,并在监督可比条件下超越需训练的方法。

Insight: 视觉令牌在浅层即存在冗余,通过上下文信号可安全剪枝;PPM的轻量化设计为高效推理提供了可扩展方案。

Abstract: Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos. Due to the rich visual information, a single image can generate thousands of vision tokens, leading to high computational costs during the prefilling stage and significant memory overhead during decoding. Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations. However, these methods often struggle in shallow layers due to the lack of sufficient contextual information. We argue that many visual tokens are inherently redundant even in shallow layers and can be safely and effectively pruned with appropriate contextual signals. In this work, we propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM. The PPM is lightweight, model-agnostic, and operates independently of the LVLM architecture, ensuring seamless integration with various models. Extensive experiments on multiple benchmarks demonstrate that CoViPAL outperforms training-free pruning methods under equal token budgets and surpasses training-based methods with comparable supervision. CoViPAL offers a scalable and efficient solution to improve inference efficiency in LVLMs without compromising accuracy.

[118] A biological vision inspired framework for machine perception of abutting grating illusory contours

Xiao Zhang,Kai-Fu Yang,Xian-Shi Zhang,Hong-Zhi You,Hong-Mei Yan,Yong-Jie Li

Main category: cs.CV

TL;DR: 该论文提出了一种受生物视觉启发的深度学习框架ICPNet,用于解决深度神经网络(DNN)在感知错觉轮廓(如abutting grating)时与人类感知不一致的问题。通过多尺度特征投影、特征交互注意力模块和边缘检测任务,显著提升了模型对错觉轮廓的敏感性和准确性。

Details Motivation: 当前DNN在多种任务中表现卓越,但在感知错觉轮廓时与人类感知模式不符,限制了其与人类智能的对齐。为解决这一问题,论文从生物视觉机制中获取灵感,提出了一种新的感知框架。

Contribution: 1. 提出了ICPNet框架,包含多尺度特征投影(MFP)、特征交互注意力模块(FIAM)和边缘融合模块(EFM)。2. 构建了AG-Fashion-MNIST测试集,用于评估模型对错觉轮廓的感知能力。3. 实验表明ICPNet在abutting grating任务中显著优于现有模型。

Method: 1. MFP模块提取多尺度特征;2. FIAM模块增强前馈和反馈特征的交互;3. EFM模块通过边缘检测任务引入形状约束,引导网络关注前景。

Result: 在AG-MNIST和AG-Fashion-MNIST测试集上,ICPNet对错觉轮廓的感知能力显著优于SOTA模型,特别是在top-1准确率上有明显提升。

Insight: 通过生物视觉机制(如形状偏差和多尺度特征处理)的引入,可以显著提升DNN在感知错觉轮廓等复杂任务中的表现,为DNN迈向人类级智能提供了新思路。

Abstract: Higher levels of machine intelligence demand alignment with human perception and cognition. Deep neural networks (DNN) dominated machine intelligence have demonstrated exceptional performance across various real-world tasks. Nevertheless, recent evidence suggests that DNNs fail to perceive illusory contours like the abutting grating, a discrepancy that misaligns with human perception patterns. Departing from previous works, we propose a novel deep network called illusory contour perception network (ICPNet) inspired by the circuits of the visual cortex. In ICPNet, a multi-scale feature projection (MFP) module is designed to extract multi-scale representations. To boost the interaction between feedforward and feedback features, a feature interaction attention module (FIAM) is introduced. Moreover, drawing inspiration from the shape bias observed in human perception, an edge detection task conducted via the edge fusion module (EFM) injects shape constraints that guide the network to concentrate on the foreground. We assess our method on the existing AG-MNIST test set and the AG-Fashion-MNIST test sets constructed by this work. Comprehensive experimental results reveal that ICPNet is significantly more sensitive to abutting grating illusory contours than state-of-the-art models, with notable improvements in top-1 accuracy across various subsets. This work is expected to make a step towards human-level intelligence for DNN-based models.

[119] SEER-VAR: Semantic Egocentric Environment Reasoner for Vehicle Augmented Reality

Yuzhi Lai,Shenghai Yuan,Peizheng Li,Jun Lou,Andreas Zell

Main category: cs.CV

TL;DR: SEER-VAR是一种新型车辆增强现实(AR)框架,通过语义分解、上下文感知SLAM分支(CASB)和基于LLM的推荐,实现了动态场景分离与AR渲染,提升了驾驶场景理解和用户体验。

Details Motivation: 现有系统多基于静态或单视角设置,无法动态处理车辆驾驶中的复杂场景。SEER-VAR旨在通过多模态方法解决这一问题,提升AR在驾驶场景中的应用效果。

Contribution: 1. 提出了首个动态分离车辆驾驶场景(车内与道路)的AR框架;2. 引入了上下文感知SLAM分支(CASB)和LLM驱动的推荐模块;3. 发布了一个包含多模态数据的真实驾驶数据集EgoSLAM-Drive。

Method: 1. 使用深度引导的视觉-语言基础动态分离场景;2. 通过两个SLAM分支分别跟踪车内和道路的自我运动;3. 基于GPT模块生成上下文感知的AR覆盖(如仪表提示和危险警报)。

Result: 实验表明,SEER-VAR在多样环境中实现了鲁棒的空间对齐和感知一致的AR渲染,并通过用户研究验证了其提升场景理解、覆盖相关性和驾驶舒适性的能力。

Insight: 论文首次探索了LLM驱动的AR推荐在驾驶场景中的应用,为解决动态环境中的AR挑战提供了新思路,并强调了多模态数据的重要性。

Abstract: We present SEER-VAR, a novel framework for egocentric vehicle-based augmented reality (AR) that unifies semantic decomposition, Context-Aware SLAM Branches (CASB), and LLM-driven recommendation. Unlike existing systems that assume static or single-view settings, SEER-VAR dynamically separates cabin and road scenes via depth-guided vision-language grounding. Two SLAM branches track egocentric motion in each context, while a GPT-based module generates context-aware overlays such as dashboard cues and hazard alerts. To support evaluation, we introduce EgoSLAM-Drive, a real-world dataset featuring synchronized egocentric views, 6DoF ground-truth poses, and AR annotations across diverse driving scenarios. Experiments demonstrate that SEER-VAR achieves robust spatial alignment and perceptually coherent AR rendering across varied environments. As one of the first to explore LLM-based AR recommendation in egocentric driving, we address the lack of comparable systems through structured prompting and detailed user studies. Results show that SEER-VAR enhances perceived scene understanding, overlay relevance, and driver ease, providing an effective foundation for future research in this direction. Code and dataset will be made open source.

Sumedha Arya,Nirmal Gaud

Main category: cs.CV

TL;DR: ResLink是一种新型深度学习架构,结合区域注意力机制和残差连接,用于脑肿瘤分类,准确率达95%,展现了良好的泛化能力。

Details Motivation: 脑肿瘤对神经功能有严重影响,早期准确诊断对治疗至关重要。现有的深度学习方法在脑肿瘤分类任务中仍有改进空间。

Contribution: 1. 提出ResLink架构,整合区域注意力机制和残差连接。2. 设计了多阶段卷积管道,结合Dropout和正则化等技术。3. 在平衡数据集上实现了高精度分类。

Method: 1. 使用多阶段卷积管道提取特征。2. 引入区域注意力机制增强空间理解。3. 结合残差连接优化特征学习。4. 通过注意力细化进行分类。

Result: ResLink在脑肿瘤分类任务中达到95%的准确率,表现出良好的泛化能力。

Insight: 区域注意力机制和残差连接的结合可以有效提升医学图像分类任务的性能,ResLink架构在其他医学影像任务中也有应用潜力。

Abstract: Brain tumors show significant health challenges due to their potential to cause critical neurological functions. Early and accurate diagnosis is crucial for effective treatment. In this research, we propose ResLink, a novel deep learning architecture for brain tumor classification using CT scan images. ResLink integrates novel area attention mechanisms with residual connections to enhance feature learning and spatial understanding for spatially rich image classification tasks. The model employs a multi-stage convolutional pipeline, incorporating dropout, regularization, and downsampling, followed by a final attention-based refinement for classification. Trained on a balanced dataset, ResLink achieves a high accuracy of 95% and demonstrates strong generalizability. This research demonstrates the potential of ResLink in improving brain tumor classification, offering a robust and efficient technique for medical imaging applications.

[121] CLIFF: Continual Learning for Incremental Flake Features in 2D Material Identification

Sankalp Pandey,Xuan Bac Nguyen,Nicholas Borys,Hugh Churchill,Khoa Luu

Main category: cs.CV

TL;DR: 论文提出了一种名为CLIFF的持续学习方法,用于解决二维材料薄片分类中因材料差异导致的模型性能下降问题。该方法通过冻结主干网络和基础头,学习材料特定的提示、嵌入和增量头,并结合记忆重放和知识蒸馏,显著降低了遗忘问题。

Details Motivation: 在二维材料的光学显微图像中,不同材料的薄片外观变化较大,导致自动化分类模型难以适应新材料。现有方法在新材料上通常表现不佳,且容易遗忘旧知识。因此,论文旨在通过持续学习框架解决这一问题。

Contribution: 1. 首次将持续学习方法系统应用于二维材料薄片分类领域;2. 提出了冻结主干网络和学习材料特定提示、嵌入和增量头的框架;3. 结合了提示池、相似性门控和记忆重放等技术,显著降低了遗忘问题。

Method: 1. 冻结预训练的主干网络和基础头;2. 为每种新材料学习特定提示、嵌入和增量头;3. 使用提示池和余弦相似性门控调整特征;4. 结合记忆重放和知识蒸馏优化模型性能。

Result: CLIFF在分类准确性上表现优越,且遗忘问题显著低于基线方法(如微调)。

Insight: 1. 持续学习可以有效解决二维材料分类中的适应性挑战;2. 冻结核心组件并结合材料特定调整是一种高效策略;3. 记忆重放和知识蒸馏的引入进一步提升了模型的稳定性。

Abstract: Identifying quantum flakes is crucial for scalable quantum hardware; however, automated layer classification from optical microscopy remains challenging due to substantial appearance shifts across different materials. In this paper, we propose a new Continual-Learning Framework for Flake Layer Classification (CLIFF). To our knowledge, this is the first systematic study of continual learning in the domain of two-dimensional (2D) materials. Our method enables the model to differentiate between materials and their physical and optical properties by freezing a backbone and base head trained on a reference material. For each new material, it learns a material-specific prompt, embedding, and a delta head. A prompt pool and a cosine-similarity gate modulate features and compute material-specific corrections. Additionally, we incorporate memory replay with knowledge distillation. CLIFF achieves competitive accuracy with significantly lower forgetting than naive fine-tuning and a prompt-based baseline.

[122] AdaGAT: Adaptive Guidance Adversarial Training for the Robustness of Deep Neural Networks

Zhenyu Liu,Huizhi Liang,Xinrun Li,Vaclav Snasel,Varun Ojha

Main category: cs.CV

TL;DR: 该论文提出了一种名为AdaGAT的自适应引导对抗训练方法,通过动态调整引导模型的训练状态,提升目标模型的鲁棒性。

Details Motivation: 对抗蒸馏(AD)能够将鲁棒性从教师深度神经网络迁移到轻量级学生模型,但现有方法中可学习的引导模型难以在训练中保持最优状态,限制了知识迁移的效果。

Contribution: 提出AdaGAT方法,通过两种独立的损失函数动态调整引导模型的状态,使其更积极地参与反向传播,从而实现更好的鲁棒性迁移。

Method: 采用动态调整的引导模型训练策略,设计两种损失函数,分别用于优化引导模型和目标模型,并在多个数据集上进行了广泛的实验验证。

Result: 在CIFAR-10、CIFAR-100和TinyImageNet数据集上的实验表明,AdaGAT能够显著提升目标模型对多种对抗攻击的鲁棒性,优于多种基线模型。

Insight: 引导模型的训练状态需要动态调整,适当的调整范围可以有效提升目标模型的鲁棒性,为对抗训练提供了新的优化思路。

Abstract: Adversarial distillation (AD) is a knowledge distillation technique that facilitates the transfer of robustness from teacher deep neural network (DNN) models to lightweight target (student) DNN models, enabling the target models to perform better than only training the student model independently. Some previous works focus on using a small, learnable teacher (guide) model to improve the robustness of a student model. Since a learnable guide model starts learning from scratch, maintaining its optimal state for effective knowledge transfer during co-training is challenging. Therefore, we propose a novel Adaptive Guidance Adversarial Training (AdaGAT) method. Our method, AdaGAT, dynamically adjusts the training state of the guide model to install robustness to the target model. Specifically, we develop two separate loss functions as part of the AdaGAT method, allowing the guide model to participate more actively in backpropagation to achieve its optimal state. We evaluated our approach via extensive experiments on three datasets: CIFAR-10, CIFAR-100, and TinyImageNet, using the WideResNet-34-10 model as the target model. Our observations reveal that appropriately adjusting the guide model within a certain accuracy range enhances the target model’s robustness across various adversarial attacks compared to a variety of baseline models.

[123] Spatial-Temporal Human-Object Interaction Detection

Xu Sun,Yunqing He,Tongwei Ren,Gangshan Wu

Main category: cs.CV

TL;DR: 该论文提出了一种新的视频中实例级的人-物交互检测任务ST-HOID,旨在区分细粒度的人-物交互(HOI)以及主体和物体的轨迹。方法包括对象轨迹检测模块和交互推理模块,并在新构建的数据集VidOR-HOID上验证了其有效性。

Details Motivation: 人-物交互(HOI)对于以人为中心的视频内容理解至关重要。现有的方法主要关注静态图像中的HOI检测,而视频中的时空交互检测尚未充分研究。

Contribution: 1. 提出了一个新的视频时空人-物交互检测任务ST-HOID;2. 设计了包含对象轨迹检测和交互推理模块的新型方法;3. 构建了首个ST-HOID评估数据集VidOR-HOID。

Method: 论文方法分为两部分:1. 对象轨迹检测模块用于捕捉主体和物体的运动轨迹;2. 交互推理模块通过时空上下文信息进行细粒度的HOI分类。

Result: 实验结果表明,该方法在VidOR-HOID数据集上优于基于图像HOI检测、视频视觉关系检测和视频HOI识别的基线方法。

Insight: 视频中的时空动态性提供了更多交互信息,有助于提升HOI检测的准确性。新任务的提出为视频内容理解开辟了新方向。

Abstract: In this paper, we propose a new instance-level human-object interaction detection task on videos called ST-HOID, which aims to distinguish fine-grained human-object interactions (HOIs) and the trajectories of subjects and objects. It is motivated by the fact that HOI is crucial for human-centric video content understanding. To solve ST-HOID, we propose a novel method consisting of an object trajectory detection module and an interaction reasoning module. Furthermore, we construct the first dataset named VidOR-HOID for ST-HOID evaluation, which contains 10,831 spatial-temporal HOI instances. We conduct extensive experiments to evaluate the effectiveness of our method. The experimental results demonstrate that our method outperforms the baselines generated by the state-of-the-art methods of image human-object interaction detection, video visual relation detection and video human-object interaction recognition.

[124] MTNet: Learning modality-aware representation with transformer for RGBT tracking

Ruichao Hou,Boyue Xu,Tongwei Ren,Gangshan Wu

Main category: cs.CV

TL;DR: MTNet是一种基于Transformer的多模态跟踪方法,通过学习模态感知表示提升RGB-T跟踪的性能。它通过模态感知网络和Transformer融合网络解决了特征交互的局限性,并结合动态模板更新策略实现高效跟踪。

Details Motivation: 当前RGB-T跟踪方法在模态融合和固定模板方面存在局限性,限制了特征交互和跟踪性能。MTNet旨在通过学习模态感知表示和全局依赖关系,提升多模态跟踪的鲁棒性和准确性。

Contribution: 1. 提出了模态感知网络(包含通道聚合与分布模块和空间相似性感知模块);2. 设计了基于Transformer的融合网络以捕获全局依赖关系;3. 引入了三叉头预测模块和动态更新策略优化跟踪性能。

Method: 1. 模态感知网络(CADM和SSPM)提取模态特有特征;2. Transformer融合网络增强跨模态全局交互;3. 三叉头预测和动态模板更新策略提升跟踪精度。

Result: 在三个RGB-T基准测试中取得SOTA性能,同时达到实时速度。

Insight: 模态感知表示和全局依赖的捕捉是提升多模态跟踪性能的关键,动态模板更新策略能有效应对尺度变化和形变挑战。

Abstract: The ability to learn robust multi-modality representation has played a critical role in the development of RGBT tracking. However, the regular fusion paradigm and the invariable tracking template remain restrictive to the feature interaction. In this paper, we propose a modality-aware tracker based on transformer, termed MTNet. Specifically, a modality-aware network is presented to explore modality-specific cues, which contains both channel aggregation and distribution module(CADM) and spatial similarity perception module (SSPM). A transformer fusion network is then applied to capture global dependencies to reinforce instance representations. To estimate the precise location and tackle the challenges, such as scale variation and deformation, we design a trident prediction head and a dynamic update strategy which jointly maintain a reliable template for facilitating inter-frame communication. Extensive experiments validate that the proposed method achieves satisfactory results compared with the state-of-the-art competitors on three RGBT benchmarks while reaching real-time speed.

[125] Quickly Tuning Foundation Models for Image Segmentation

Breenda Das,Lennart Purucker,Timur Carstensen,Frank Hutter

Main category: cs.CV

TL;DR: QTT-SEG是一种基于元学习的快速微调方法,用于优化SAM(Segment Anything Model)在特定图像分割任务中的表现,显著减少了人工干预和领域专业知识的需求。

Details Motivation: 现有的基础模型如SAM在零样本图像分割中表现优异,但在特定领域任务中可能表现不足。传统微调方法需要大量人工投入和领域知识,因此需要一种自动化且高效的方法来优化微调过程。

Contribution: 提出了QTT-SEG,一种基于元学习的快速微调方法,通过预测高性能配置,显著减少了手动调参的需求,并在短时间内实现了优于零样本SAM和AutoML基线的性能。

Method: 基于Quick-Tune超参数优化框架,QTT-SEG通过元学习成本与性能模型,高效搜索超过2亿种可能的配置,快速找到最优解。

Result: 在8个二元和5个多类分割数据集上的实验表明,QTT-SEG在短时间内显著提升了SAM的零样本性能,并在多数二元任务中超越了AutoGluon Multimodal。在多类任务中也表现一致。

Insight: 元学习在自动化模型适配中具有巨大潜力,能够显著减少人工干预,快速适应特定领域的图像分割任务。

Abstract: Foundation models like SAM (Segment Anything Model) exhibit strong zero-shot image segmentation performance, but often fall short on domain-specific tasks. Fine-tuning these models typically requires significant manual effort and domain expertise. In this work, we introduce QTT-SEG, a meta-learning-driven approach for automating and accelerating the fine-tuning of SAM for image segmentation. Built on the Quick-Tune hyperparameter optimization framework, QTT-SEG predicts high-performing configurations using meta-learned cost and performance models, efficiently navigating a search space of over 200 million possibilities. We evaluate QTT-SEG on eight binary and five multiclass segmentation datasets under tight time constraints. Our results show that QTT-SEG consistently improves upon SAM’s zero-shot performance and surpasses AutoGluon Multimodal, a strong AutoML baseline, on most binary tasks within three minutes. On multiclass datasets, QTT-SEG delivers consistent gains as well. These findings highlight the promise of meta-learning in automating model adaptation for specialized segmentation tasks. Code available at: https://github.com/ds-brx/QTT-SEG/

[126] Explain Before You Answer: A Survey on Compositional Visual Reasoning

Fucai Ke,Joy Hsu,Zhixi Cai,Zixian Ma,Xin Zheng,Xindi Wu,Sukai Huang,Weiqing Wang,Pari Delir Haghighi,Gholamreza Haffari,Ranjay Krishna,Jiajun Wu,Hamid Rezatofighi

Main category: cs.CV

TL;DR: 这篇论文是对组合式视觉推理的全面综述,涵盖2023至2025年的260多篇论文,系统梳理了该领域的发展历程、核心定义、技术范式、测评基准及未来方向。

Details Motivation: 组合式视觉推理是多模态AI的重要研究方向,目标是让机器具备人类般的分解视觉场景、概念定位和逻辑推理能力。

Contribution: 论文首次系统综述了组合式视觉推理的文献,提出了统一的分类法、五阶段范式演进、60+评测基准,并提炼了关键见解和开放挑战。

Method: 通过文献调研,论文梳理了五阶段的范式演进(如增强提示、工具增强LLM/VLM、思维链推理等),并分析了各阶段的架构设计与优缺点。

Result: 综述发现组合式方法在认知对齐、语义保真、鲁棒性和数据效率方面具有优势,但也面临幻觉、监督扩展和基准局限性等挑战。

Insight: 未来方向包括世界模型整合、人机协同推理和更丰富的评测协议,强调了统一框架对领域发展的重要性。

Abstract: Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human-like ability to decompose visual scenes, ground intermediate concepts, and perform multi-step logical inference. While early surveys focus on monolithic vision-language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five-stage paradigm shift: from prompt-enhanced language-centric pipelines, through tool-enhanced LLMs and tool-enhanced VLMs, to recently minted chain-of-thought reasoning and unified agentic VLMs, highlighting their architectural designs, strengths, and limitations. We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception. Drawing on these analyses, we distill key insights, identify open challenges (e.g., limitations of LLM-based reasoning, hallucination, a bias toward deductive reasoning, scalable supervision, tool integration, and benchmark limitations), and outline future directions, including world-model integration, human-AI collaborative reasoning, and richer evaluation protocols. By offering a unified taxonomy, historical roadmap, and critical outlook, this survey aims to serve as a foundational reference and inspire the next generation of compositional visual reasoning research.

[127] FoundDiff: Foundational Diffusion Model for Generalizable Low-Dose CT Denoising

Zhihao Chen,Qi Gao,Zilong Li,Junping Zhang,Yi Zhang,Jun Zhao,Hongming Shan

Main category: cs.CV

TL;DR: FoundDiff 是一个基础扩散模型,用于通用低剂量 CT 去噪,能够处理不同剂量水平和解剖区域的多样性噪声特性。

Details Motivation: 现有的深度学习方法通常在特定剂量水平和解剖区域上训练,难以应对不同扫描条件下的多样噪声和解剖异质性,限制了其泛化性。

Contribution: 提出了 FoundDiff,通过两阶段策略(剂量-解剖感知和自适应去噪)实现通用低剂量 CT 去噪,结合了对比学习与扩散模型。

Method: 采用了 DA-CLIP 进行剂量和解剖感知,并通过 DA-Diff 模型实现自适应去噪,集成了剂量和解剖嵌入到扩散过程中。

Result: 在两个公共数据集上的实验表明,FoundDiff 在去噪性能和泛化能力上优于现有方法。

Insight: 通过对比学习和扩散模型的结合,FoundDiff 在低剂量 CT 去噪任务中展现了强大的泛化能力,尤其对未见过的剂量水平表现优异。

Abstract: Low-dose computed tomography (CT) denoising is crucial for reduced radiation exposure while ensuring diagnostically acceptable image quality. Despite significant advancements driven by deep learning (DL) in recent years, existing DL-based methods, typically trained on a specific dose level and anatomical region, struggle to handle diverse noise characteristics and anatomical heterogeneity during varied scanning conditions, limiting their generalizability and robustness in clinical scenarios. In this paper, we propose FoundDiff, a foundational diffusion model for unified and generalizable LDCT denoising across various dose levels and anatomical regions. FoundDiff employs a two-stage strategy: (i) dose-anatomy perception and (ii) adaptive denoising. First, we develop a dose- and anatomy-aware contrastive language image pre-training model (DA-CLIP) to achieve robust dose and anatomy perception by leveraging specialized contrastive learning strategies to learn continuous representations that quantify ordinal dose variations and identify salient anatomical regions. Second, we design a dose- and anatomy-aware diffusion model (DA-Diff) to perform adaptive and generalizable denoising by synergistically integrating the learned dose and anatomy embeddings from DACLIP into diffusion process via a novel dose and anatomy conditional block (DACB) based on Mamba. Extensive experiments on two public LDCT datasets encompassing eight dose levels and three anatomical regions demonstrate superior denoising performance of FoundDiff over existing state-of-the-art methods and the remarkable generalization to unseen dose levels. The codes and models are available at https://github.com/hao1635/FoundDiff.

[128] PosBridge: Multi-View Positional Embedding Transplant for Identity-Aware Image Editing

Peilin Xiong,Junwen Chen,Honghui Yuan,Keiji Yanai

Main category: cs.CV

TL;DR: PosBridge提出了一种无需训练、可扩展的图像编辑框架,通过位置嵌入移植和多视图布局引导扩散模型,实现身份一致的图像编辑。

Details Motivation: 随着生成模型的规模增长,训练成本高昂,亟需无需训练且可扩展的编辑框架。

Contribution: 1) 提出位置嵌入移植技术,确保参考对象的结构特征得以忠实复制;2) 设计Corner Centered Layout,通过多视图输入引导扩散模型生成身份一致的内容。

Method: 结合位置嵌入移植技术和Corner Centered Layout,在去噪过程中引导目标区域的噪声分布与参考对象一致。

Result: 实验表明PosBridge在结构一致性、外观保真度和计算效率上优于基线方法。

Insight: 位置嵌入移植技术是实现身份感知编辑的关键,多视图布局能有效提升生成内容的准确性。

Abstract: Localized subject-driven image editing aims to seamlessly integrate user-specified objects into target scenes. As generative models continue to scale, training becomes increasingly costly in terms of memory and computation, highlighting the need for training-free and scalable editing frameworks.To this end, we propose PosBridge an efficient and flexible framework for inserting custom objects. A key component of our method is positional embedding transplant, which guides the diffusion model to faithfully replicate the structural characteristics of reference objects.Meanwhile, we introduce the Corner Centered Layout, which concatenates reference images and the background image as input to the FLUX.1-Fill model. During progressive denoising, positional embedding transplant is applied to guide the noise distribution in the target region toward that of the reference object. In this way, Corner Centered Layout effectively directs the FLUX.1-Fill model to synthesize identity-consistent content at the desired location. Extensive experiments demonstrate that PosBridge outperforms mainstream baselines in structural consistency, appearance fidelity, and computational efficiency, showcasing its practical value and potential for broad adoption.

[129] First Place Solution to the MLCAS 2025 GWFSS Challenge: The Devil is in the Detail and Minority

Songliang Cao,Tianqi Hu,Hao Lu

Main category: cs.CV

TL;DR: 本文介绍了在MLCAS 2025 GWFSS挑战赛中夺冠的解决方案,专注于小麦茎部的精细分割,通过动态上采样、半监督蒸馏和测试时缩放策略提升了性能。

Details Motivation: 现有语义分割方法在小麦器官分割中已表现良好,但茎部因结构精细和像素稀少导致预测不稳定和类别不平衡,成为关键瓶颈。

Contribution: 针对小麦茎部分割的三大改进:动态上采样器SAPA增强细节、半监督引导蒸馏挖掘未标注数据价值、测试时缩放策略。

Method: 1) 使用动态上采样器SAPA;2) 半监督蒸馏结合茎感知样本选择;3) 测试时缩放(双倍图像分割)。

Result: 方案以显著优势夺冠,代码和模型已开源。

Insight: 在分割任务中,针对特定问题的特性(如茎部)设计定制化方法比通用技巧更有效。

Abstract: In this report, we present our solution during the participation of the MLCAS 2025 GWFSS Challenge. This challenge hosts a semantic segmentation competition specific to wheat plants, which requires to segment three wheat organs including the head, leaf, and stem, and another background class. In 2025, participating a segmentation competition is significantly different from that in previous years where many tricks can play important roles. Nowadays most segmentation tricks have been well integrated into existing codebases such that our naive ViT-Adapter baseline has already achieved sufficiently good performance. Hence, we believe the key to stand out among other competitors is to focus on the problem nature of wheat per se. By probing visualizations, we identify the key – the stem matters. In contrast to heads and leaves, stems exhibit fine structure and occupy only few pixels, which suffers from fragile predictions and class imbalance. Building on our baseline, we present three technical improvements tailored to stems: i) incorporating a dynamic upsampler SAPA used to enhance detail delineation; ii) leveraging semi-supervised guided distillation with stem-aware sample selection to mine the treasure beneath unlabeled data; and iii) applying a test-time scaling strategy to zoom in and segment twice the image. Despite being simple, the three improvements bring us to the first place of the competition, outperforming the second place by clear margins. Code and models will be released at https://github.com/tiny-smart/gwfss25.

[130] Defending Deepfake via Texture Feature Perturbation

Xiao Zhang,Changfang Chen,Tianyi Wang

Main category: cs.CV

TL;DR: 该论文提出了一种基于面部纹理特征的主动Deepfake防御方法,通过在纹理区域插入不可见的扰动,干扰Deepfake生成过程,同时最小化对非纹理区域的视觉影响。

Details Motivation: Deepfake技术的快速发展对社会信任与信息安全构成威胁。现有检测方法多为被动分析,难以应对高质量Deepfake内容,因此需要探索主动防御策略。

Contribution: 1. 提出了一种基于纹理特征的主动防御框架;2. 设计了双重注意力机制优化纹理扰动;3. 在纹理区域插入低感知显著性的扰动,平衡防御效果与视觉质量。

Method: 1. 利用局部二值模式(LBP)提取纹理特征;2. 通过双重模型注意力策略(dual-model attention)生成并优化纹理扰动;3. 仅在关键纹理区域施加扰动。

Result: 在CelebA-HQ和LFW数据集上验证了方法的有效性,能显著干扰Deepfake生成,并产生明显的视觉缺陷,适用于多种攻击模型。

Insight: 人类对平滑区域的扰动更敏感,而纹理区域的低显著性使其成为插入扰动的理想选择,该方法为主动防御提供了高效且可扩展的解决方案。

Abstract: The rapid development of Deepfake technology poses severe challenges to social trust and information security. While most existing detection methods primarily rely on passive analyses, due to unresolvable high-quality Deepfake contents, proactive defense has recently emerged by inserting invisible signals in advance of image editing. In this paper, we introduce a proactive Deepfake detection approach based on facial texture features. Since human eyes are more sensitive to perturbations in smooth regions, we invisibly insert perturbations within texture regions that have low perceptual saliency, applying localized perturbations to key texture regions while minimizing unwanted noise in non-textured areas. Our texture-guided perturbation framework first extracts preliminary texture features via Local Binary Patterns (LBP), and then introduces a dual-model attention strategy to generate and optimize texture perturbations. Experiments on CelebA-HQ and LFW datasets demonstrate the promising performance of our method in distorting Deepfake generation and producing obvious visual defects under multiple attack models, providing an efficient and scalable solution for proactive Deepfake detection.

[131] SpecGen: Neural Spectral BRDF Generation via Spectral-Spatial Tri-plane Aggregation

Zhenyu Jin,Wenjie Li,Zhanyu Ma,Heng Guo

Main category: cs.CV

TL;DR: 本文提出了SpecGen方法,通过Spectral-Spatial Tri-plane Aggregation(SSTA)网络,从单张RGB球体图像生成谱BRDF,解决了谱BRDF数据稀缺的问题,并在超谱图像重建中显著提升了性能。

Details Motivation: 传统的谱图像提升方法通常是将RGB图像转换为谱图像,但缺乏生成谱BRDF的能力。谱BRDF数据稀缺且难以获取,限制了高质量谱图像渲染的应用,因此需要一种新方法从有限的RGB数据中高效生成谱BRDF。

Contribution: 提出SpecGen方法,首次从单张RGB图像生成谱BRDF,支持任意光照和形状的谱图像渲染。设计了SSTA网络,通过聚合并利用大量RGB BRDF数据提升谱BRDF生成质量。

Method: 采用SSTA网络,通过建模波长和入射-出射方向的反射响应,从RGB BRDF数据中迁移学习谱BRDF生成。网络结构包括谱-空间三平面聚合,有效融合多维度信息。

Result: 实验表明,SpecGen在有限谱数据下准确重建谱BRDF,超谱图像重建的PSNR提升了8 dB,显著优于现有方法。

Insight: 通过结合RGB BRDF数据的迁移学习,可以有效缓解谱数据稀缺问题,为谱渲染任务提供了一种高效的新思路。

Abstract: Synthesizing spectral images across different wavelengths is essential for photorealistic rendering. Unlike conventional spectral uplifting methods that convert RGB images into spectral ones, we introduce SpecGen, a novel method that generates spectral bidirectional reflectance distribution functions (BRDFs) from a single RGB image of a sphere. This enables spectral image rendering under arbitrary illuminations and shapes covered by the corresponding material. A key challenge in spectral BRDF generation is the scarcity of measured spectral BRDF data. To address this, we propose the Spectral-Spatial Tri-plane Aggregation (SSTA) network, which models reflectance responses across wavelengths and incident-outgoing directions, allowing the training strategy to leverage abundant RGB BRDF data to enhance spectral BRDF generation. Experiments show that our method accurately reconstructs spectral BRDFs from limited spectral data and surpasses state-of-the-art methods in hyperspectral image reconstruction, achieving an improvement of 8 dB in PSNR. Codes and data will be released upon acceptance.

[132] Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs

Somraj Gautam,Abhirama Subramanyam Penamakuri,Abhishek Bhandari,Gaurav Harit

Main category: cs.CV

TL;DR: 论文提出了MMCRICBENCH-3K基准数据集,用于评估大型视觉语言模型(LVLMs)在复杂数值和跨语言任务中的表现,揭示了当前模型在结构化数据理解和跨语言泛化上的局限性。

Details Motivation: 尽管现有的LVLMs在多模态任务中表现出色,但在处理半结构化表格图像(如板球记分卡)的数值推理和跨语言任务时仍存在不足。因此,需要一个专门的基准数据集来系统地评估这些模型的性能。

Contribution: 论文的主要贡献是提出了MMCRICBENCH-3K数据集,包含合成生成的英语和印地语板球记分卡图像及其问答对,用于评估LVLMs的数值推理和跨语言能力。

Method: 通过合成1,463张板球记分卡图像(ODI、T20和Test格式),并生成1,500个英语问答对,构建了MMCRICBENCH-3K数据集。数据包含两个子集:英语记分卡(MMCRICBENCH-E-1.5K)和印地语记分卡(MMCRICBENCH-H-1.5K),以支持跨脚本评估。

Result: 实验结果表明,即使是先进的LVLMs(如GPT-4o和Qwen2.5VL),在英语子集上表现不佳,而在印地语子集上性能进一步下降,揭示了模型在结构化视觉文本理解、数值推理和跨语言泛化上的局限性。

Insight: 该研究揭示了LVLMs在处理半结构化数据和跨语言任务时的关键挑战,为未来改进模型在这些领域的能力提供了方向。

Abstract: We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi scorecards, with all questions and answers kept in English to enable controlled cross-script evaluation. The task demands reasoning over structured numerical data, multi-image context, and implicit domain knowledge. Empirical results show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle on the English subset despite it being their primary training language and exhibit a further drop in performance on the Hindi subset. This reveals key limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization. The dataset is publicly available via Hugging Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM research in this direction.

[133] No Pixel Left Behind: A Detail-Preserving Architecture for Robust High-Resolution AI-Generated Image Detection

Lianrui Mu,Zou Xingze,Jianhong Bai,Jiaqi Hu,Wenjie Zheng,Jiangnan Ye,Jiedong Zhuang,Mudassar Ali,Jing Wang,Haoji Hu

Main category: cs.CV

TL;DR: HiDA-Net is a novel framework for detecting high-resolution AI-generated images by preserving pixel-level details and addressing information loss from resizing or cropping. It introduces Feature Aggregation Module (FAM), Token-wise Forgery Localization (TFL), and JPEG Quality Factor Estimation (QFE) for robust detection, outperforming existing methods by over 10-13%.

Details Motivation: Existing detection methods struggle with high-resolution AI-generated images due to resizing or cropping, leading to information loss. HiDA-Net aims to address this by preserving all pixel details.

Contribution: 1. HiDA-Net framework for high-resolution image detection. 2. Feature Aggregation Module (FAM) to fuse local and global features. 3. Token-wise Forgery Localization (TFL) and JPEG Quality Factor Estimation (QFE) for robustness. 4. HiRes-50K benchmark for future research.

Method: HiDA-Net uses FAM to combine full-resolution local tiles with a down-sampled global view. TFL provides fine-grained spatial sensitivity, and QFE separates generative artifacts from compression noise.

Result: Achieves state-of-the-art performance, improving accuracy by over 13% on Chameleon and 10% on HiRes-50K.

Insight: Preserving pixel-level details is crucial for detecting high-resolution AI-generated images, and explicit handling of compression noise enhances robustness.

Abstract: The rapid growth of high-resolution, meticulously crafted AI-generated images poses a significant challenge to existing detection methods, which are often trained and evaluated on low-resolution, automatically generated datasets that do not align with the complexities of high-resolution scenarios. A common practice is to resize or center-crop high-resolution images to fit standard network inputs. However, without full coverage of all pixels, such strategies risk either obscuring subtle, high-frequency artifacts or discarding information from uncovered regions, leading to input information loss. In this paper, we introduce the High-Resolution Detail-Aggregation Network (HiDA-Net), a novel framework that ensures no pixel is left behind. We use the Feature Aggregation Module (FAM), which fuses features from multiple full-resolution local tiles with a down-sampled global view of the image. These local features are aggregated and fused with global representations for final prediction, ensuring that native-resolution details are preserved and utilized for detection. To enhance robustness against challenges such as localized AI manipulations and compression, we introduce Token-wise Forgery Localization (TFL) module for fine-grained spatial sensitivity and JPEG Quality Factor Estimation (QFE) module to disentangle generative artifacts from compression noise explicitly. Furthermore, to facilitate future research, we introduce HiRes-50K, a new challenging benchmark consisting of 50,568 images with up to 64 megapixels. Extensive experiments show that HiDA-Net achieves state-of-the-art, increasing accuracy by over 13% on the challenging Chameleon dataset and 10% on our HiRes-50K.

[134] DiCache: Let Diffusion Model Determine Its Own Cache

Jiazi Bu,Pengyang Ling,Yujie Zhou,Yibin Wang,Yuhang Zang,Tong Wu,Dahua Lin,Jiaqi Wang

Main category: cs.CV

TL;DR: DiCache提出了一种基于扩散模型自身动态特性的缓存策略,无需训练即可自适应决定缓存时机和方式,显著提升了生成效率和视觉质量。

Details Motivation: 现有扩散模型加速技术通常依赖预定义规则或先验知识,难以适应动态扩散过程,导致通用性有限。DiCache旨在通过模型自身特征动态决定缓存策略,解决这一问题。

Contribution: 1. 揭示了扩散模型浅层特征差异与最终输出变化的强相关性。2. 提出训练免费的自适应缓存框架DiCache,统一解决缓存时机和方式问题。3. 通过实验验证DiCache在效率和视觉质量上的优越性。

Method: 1. 在线探针方案(Online Probe Profiling Scheme):实时获取缓存误差先验,自主决定缓存时机。2. 动态缓存轨迹对齐(Dynamic Cache Trajectory Alignment):结合多步缓存,优化特征近似,提升视觉质量。

Result: DiCache在WAN 2.1、HunyuanVideo和Flux等主流扩散模型上,效率和视觉质量均优于现有方法。

Insight: 扩散模型的浅层特征动态变化可作为缓存策略的自适应信号,避免依赖人工规则,提升通用性。

Abstract: Recent years have witnessed the rapid development of acceleration techniques for diffusion models, especially caching-based acceleration methods. These studies seek to answer two fundamental questions: “When to cache” and “How to use cache”, typically relying on predefined empirical laws or dataset-level priors to determine the timing of caching and utilizing handcrafted rules for leveraging multi-step caches. However, given the highly dynamic nature of the diffusion process, they often exhibit limited generalizability and fail on outlier samples. In this paper, a strong correlation is revealed between the variation patterns of the shallow-layer feature differences in the diffusion model and those of final model outputs. Moreover, we have observed that the features from different model layers form similar trajectories. Based on these observations, we present DiCache, a novel training-free adaptive caching strategy for accelerating diffusion models at runtime, answering both when and how to cache within a unified framework. Specifically, DiCache is composed of two principal components: (1) Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain a stable prior for the caching error in real time, enabling the model to autonomously determine caching schedules. (2) Dynamic Cache Trajectory Alignment combines multi-step caches based on shallow-layer probe feature trajectory to better approximate the current feature, facilitating higher visual quality. Extensive experiments validate DiCache’s capability in achieving higher efficiency and improved visual fidelity over state-of-the-art methods on various leading diffusion models including WAN 2.1, HunyuanVideo for video generation, and Flux for image generation.

[135] Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation

Guoqing Zhang,Xingtong Ge,Lu Shi,Xin Zhang,Muqing Xue,Wanru Xu,Yigang Cen

Main category: cs.CV

TL;DR: 论文提出了一个统一图像生成框架UniGen,通过条件调制专家模块(CoMoE)和动态连接机制WeaveNet,解决了多条件图像生成中参数冗余和计算效率低的问题,并在多个任务上达到了SOTA性能。

Details Motivation: 现有的图像生成方法通常为每种条件单独训练控制分支,导致模型结构冗余且计算资源利用效率低下。本文旨在设计一个统一框架,支持多样化的条件输入,同时提升生成效率和表现力。

Contribution: 1. 提出了UniGen框架,支持多样化的条件输入;2. 设计了CoMoE模块,通过专家模块分配和独立建模特征,减少特征纠缠和冗余计算;3. 提出了WeaveNet动态连接机制,增强全局与细粒度控制间的信息交互。

Method: 1. CoMoE模块聚合语义相似的图像块特征,分配给专用专家模块进行视觉表达和条件建模;2. WeaveNet通过动态蛇形连接机制,实现主干网络与条件分支间的有效交互。

Result: 在Subjects-200K和MultiGen-20M数据集上的实验表明,UniGen在多条件图像生成任务中表现优异,达到了SOTA水平。

Insight: 通过模块化设计和动态交互机制,可以显著提升多条件图像生成的性能和效率,减少冗余计算和特征纠缠。

Abstract: The image-to-image generation task aims to produce controllable images by leveraging conditional inputs and prompt instructions. However, existing methods often train separate control branches for each type of condition, leading to redundant model structures and inefficient use of computational resources. To address this, we propose a Unified image-to-image Generation (UniGen) framework that supports diverse conditional inputs while enhancing generation efficiency and expressiveness. Specifically, to tackle the widely existing parameter redundancy and computational inefficiency in controllable conditional generation architectures, we propose the Condition Modulated Expert (CoMoE) module. This module aggregates semantically similar patch features and assigns them to dedicated expert modules for visual representation and conditional modeling. By enabling independent modeling of foreground features under different conditions, CoMoE effectively mitigates feature entanglement and redundant computation in multi-condition scenarios. Furthermore, to bridge the information gap between the backbone and control branches, we propose WeaveNet, a dynamic, snake-like connection mechanism that enables effective interaction between global text-level control from the backbone and fine-grained control from conditional branches. Extensive experiments on the Subjects-200K and MultiGen-20M datasets across various conditional image generation tasks demonstrate that our method consistently achieves state-of-the-art performance, validating its advantages in both versatility and effectiveness. The code has been uploaded to https://github.com/gavin-gqzhang/UniGen.

[136] Lightweight Joint Optimization of General-Purpose Vision-Language Models and Retrievers for Medical Diagnosis

Nir Mazor,Tom Hope

Main category: cs.CV

TL;DR: 该论文提出了一种轻量级联合优化方法,将通用视觉语言模型(LVLM)与检索器结合用于医学诊断,无需医学预训练即可实现竞争性结果。

Details Motivation: 临床决策常需要解读影像(如放射学)以进行诊断,而检索相关医学文献和医院记录的视觉信息可提高诊断准确性。论文旨在通过联合优化LVLM和检索器改进医学诊断。

Contribution: 1. 提出轻量级联合优化方法,将LVLM与检索器结合;2. 使用通用主干网络,仅需轻量微调即可与医学预训练模型竞争;3. 分析了检索多样性对诊断的影响。

Method: 联合优化多模态检索器和LVLM,使检索器的错误信号能传递到LVLM。实验涵盖临床多标签分类和视觉问答任务。

Result: 在多个任务上取得了与医学预训练模型竞争的结果。联合优化显著改善了传统RAG的挑战性病例。

Insight: 检索多样性对诊断具有挑战性,正确的诊断常可通过顶级检索图像实现,但实际性能与理想性能仍存在较大差距。

Abstract: Clinical decision-making often involves interpreting images (e.g., radiology) for making diagnoses. Retrieving relevant visual information from medical literature and hospital records could enhance diagnostic accuracy. In this paper, we develop a model in which a multimodal retriever is jointly optimized with an LVLM for medical diagnosis, unlike standard RAG where LVLM error signal is not propagated down to the retriever. We show that using only general-purpose backbones, with only lightweight fine-tuning, our model is able to achieve competitive results with medically-pretrained models across clinical multi-label classification and visual question answering tasks. In a novel analysis, we additionally find that in many cases different top retrieved images each lead to different predictions for a given target, and that these cases are empirically challenging for all models, even for non-retrieval models. Our joint retrieval optimization significantly improves these challenging cases over standard RAG. However, oracle analysis reveals that while the correct diagnosis is frequently achievable using one of the top retrieved images, in practice there is a large performance gap from the oracle, and rerankers using frontier LVLMs do not close this gap – leaving ample room for improvement by future methods. Code will be made publicly available.

[137] Enhancing Underwater Images via Deep Learning: A Comparative Study of VGG19 and ResNet50-Based Approaches

Aoqi Li,Yanghui Song,Jichao Dao,Chengfu Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的复杂水下场景图像增强方法,通过整合VGG19和ResNet50模型的优势,实现多尺度、多层次的特征分析,并通过定量指标验证了方法的有效性。

Details Motivation: 解决复杂水下场景中图像质量差的问题,提升视觉任务的实用性。

Contribution: 提出了一种统一的深度学习模型,结合VGG19和ResNet50的特征提取能力,实现了更全面的水下图像增强。

Method: 使用VGG19和ResNet50进行多尺度特征分析,构建统一模型,并通过PSNR、UCIQE和UIQM等指标评估效果。

Result: 模型表现优异,在不同场景下均实现了显著的图像增强效果。

Insight: 多模型融合和硬件选择对提升水下视觉系统的实用性与稳定性至关重要。

Abstract: This paper addresses the challenging problem of image enhancement in complex underwater scenes by proposing a solution based on deep learning. The proposed method skillfully integrates two deep convolutional neural network models, VGG19 and ResNet50, leveraging their powerful feature extraction capabilities to perform multi-scale and multi-level deep feature analysis of underwater images. By constructing a unified model, the complementary advantages of the two models are effectively integrated, achieving a more comprehensive and accurate image enhancement effect.To objectively evaluate the enhancement effect, this paper introduces image quality assessment metrics such as PSNR, UCIQE, and UIQM to quantitatively compare images before and after enhancement and deeply analyzes the performance of different models in different scenarios.Furthermore, to improve the practicality and stability of the underwater visual enhancement system, this paper also provides practical suggestions from aspects such as model optimization, multi-model fusion, and hardware selection, aiming to provide strong technical support for visual enhancement tasks in complex underwater environments.

[138] MoCo: Motion-Consistent Human Video Generation via Structure-Appearance Decoupling

Haoyu Wang,Hao Tang,Donglin Di,Zhilu Zhang,Wangmeng Zuo,Feng Gao,Siwei Ma,Shiliang Zhang

Main category: cs.CV

TL;DR: MoCo提出了一种通过结构-外观解耦的方法生成运动一致的人类视频,解决了现有方法在整体运动和长时间序列中的不足。

Details Motivation: 现有视频生成模型过于关注外观保真度,导致人类运动不真实或物理上不合理,且缺乏结构连贯性。此外,大多数数据集中主要为面部或上半身运动,限制了复杂动作的生成。MoCo旨在通过解耦结构和外观生成来解决这些问题。

Contribution: 1. 提出MoCo方法,通过解耦结构和外观生成实现运动一致的人类视频生成;2. 引入Human-Aware Dynamic Control模块和密集跟踪约束,提升对稀疏结构的细粒度控制;3. 构建了一个大规模全身运动视频数据集。

Method: 1. 使用高效的3D结构生成器从文本提示生成人类运动序列;2. 在生成的结构序列指导下合成视频外观;3. 通过Human-Aware Dynamic Control模块和密集跟踪约束优化训练过程。

Result: 实验表明,MoCo在生成真实且结构连贯的人类视频方面优于现有方法。

Insight: 解耦结构和外观生成是解决复杂人类动作生成的有效途径,同时高质量的数据集对提升模型性能至关重要。

Abstract: Generating human videos with consistent motion from text prompts remains a significant challenge, particularly for whole-body or long-range motion. Existing video generation models prioritize appearance fidelity, resulting in unrealistic or physically implausible human movements with poor structural coherence. Additionally, most existing human video datasets primarily focus on facial or upper-body motions, or consist of vertically oriented dance videos, limiting the scope of corresponding generation methods to simple movements. To overcome these challenges, we propose MoCo, which decouples the process of human video generation into two components: structure generation and appearance generation. Specifically, our method first employs an efficient 3D structure generator to produce a human motion sequence from a text prompt. The remaining video appearance is then synthesized under the guidance of the generated structural sequence. To improve fine-grained control over sparse human structures, we introduce Human-Aware Dynamic Control modules and integrate dense tracking constraints during training. Furthermore, recognizing the limitations of existing datasets, we construct a large-scale whole-body human video dataset featuring complex and diverse motions. Extensive experiments demonstrate that MoCo outperforms existing approaches in generating realistic and structurally coherent human videos.

[139] Data Leakage in Visual Datasets

Patrick Ramos,Ryan Ramos,Noa Garcia

Main category: cs.CV

TL;DR: 该论文分析了视觉数据集中的数据泄漏现象,指出训练数据与评估数据之间的图像重叠会损害模型评估的公平性,并提出了泄漏的类型分类与检测方法。

Details Motivation: 大规模视觉数据集通常来自互联网,而许多计算机视觉基准测试数据也是公开的,这可能导致训练数据和评估数据之间存在重叠(数据泄漏),从而影响模型评估的可靠性。

Contribution: 论文的主要贡献是首次系统性地分析了视觉数据集中的数据泄漏问题,并将其分为不同类型(如模态、覆盖范围和程度),同时通过图像检索技术验证了所有分析的数据集均存在泄漏。

Method: 论文采用图像检索技术识别数据集中的数据泄漏问题,并根据泄漏的模态、覆盖范围和程度对泄漏进行了分类。

Result: 研究发现所有分析的数据集均存在某种形式的数据泄漏,且这些泄漏(从严重到轻微)均会损害下游任务中模型评估的可靠性。

Insight: 数据泄漏问题普遍存在于视觉数据集中,即使轻微的泄漏也会对模型评估产生负面影响,因此需要在构建数据集和评估模型时更加谨慎。

Abstract: We analyze data leakage in visual datasets. Data leakage refers to images in evaluation benchmarks that have been seen during training, compromising fair model evaluation. Given that large-scale datasets are often sourced from the internet, where many computer vision benchmarks are publicly available, our efforts are focused into identifying and studying this phenomenon. We characterize visual leakage into different types according to its modality, coverage, and degree. By applying image retrieval techniques, we unequivocally show that all the analyzed datasets present some form of leakage, and that all types of leakage, from severe instances to more subtle cases, compromise the reliability of model evaluation in downstream tasks.

[140] Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models

Xiaojie Yin,Qilong Wang,Qinghua Hu

Main category: cs.CV

TL;DR: 论文提出了一种约束提示增强(CPE)方法,通过从语义角度构建全面的文本提示和紧凑的视觉提示,改善视觉语言模型的零样本泛化能力。

Details Motivation: 现有的视觉语言模型在零样本泛化中因预训练与下游任务的领域差异导致语义对齐不佳。现有方法主要通过文本提示或视觉-文本对齐来解决,但仍存在文本提示不完整和视觉提示噪声问题。

Contribution: 提出了TGSSG(拓扑引导的语义生成)和CADRS(类别无关判别区域选择)两个关键组件,分别解决文本提示和视觉提示的不足,并通过集合匹配策略实现视觉-文本对齐。

Method: 1. TGSSG生成同义语义集,基于语义模糊熵和持续同调分析构建全面的文本提示;2. CADRS利用预训练视觉模型的激活图筛选判别性区域,生成紧凑的视觉提示;3. 采用TTA和最优运输(OT)进行集合匹配。

Result: 提出的方法有效改善了视觉-文本对齐,提升了视觉语言模型的零样本泛化性能。

Insight: 从语义角度全面优化提示生成,结合判别性区域选择,为视觉语言模型的零样本任务提供了一种新颖且有效的解决方案。

Abstract: Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment due to domain gaps between pre-training and downstream tasks. Existing approaches primarily focus on text prompting with class-specific descriptions and visual-text adaptation via aligning cropped image regions with textual descriptions. However, they still face the issues of incomplete textual prompts and noisy visual prompts. In this paper, we propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment by constructing comprehensive textual prompts and compact visual prompts from the semantic perspective. Specifically, our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS). Textually, to address the issue of incomplete semantic expression in textual prompts, our TGSSG first generates synonymous semantic set for each category via large language models, and constructs comprehensive textual prompts based on semantic ambiguity entropy and persistent homology analysis. Visually, to mitigate the irrelevant visual noise introduced by random cropping, our CADRS identifies discriminative regions with activation maps outputted by a pre-trained vision model, effectively filtering out noisy regions and generating compact visual prompts. Given the comprehensive set of textual prompts and compact set of visual prompts, we introduce two set-to-set matching strategies based on test-time adaptation (TTA) and optimal transport (OT) to achieve effective visual-textual alignment, and so improve zero-shot generalization of VLMs.

Zhao Zheng,Jingfan Fan,Long Shao,Hong Song,Danni Ai,Tianyu Fu,Deqiang Xiao,Yongtian Wang,Jian Yang

Main category: cs.CV

TL;DR: 该论文提出了一种几何最大重叠配准框架,通过仅旋转的分支定界搜索方法,显著提升了点云配准的精度和效率,优于现有SOTA方法。

Details Motivation: 当前基于空间兼容性图或多阶段分支定界搜索的点云配准方法在高离群点比例下表现较好,但前者需要二次空间和时间复杂度,而后者因分解阶段间的局部最优问题导致精度不足。

Contribution: 提出了一个几何最大重叠配准框架,通过旋转轴搜索和1D旋转角度搜索的分解方法,将复杂度降为多项式时间,并且在最坏情况下空间复杂度线性增长。

Method: 利用Chasles定理将刚体变换分解为沿旋转轴的平移和2D刚体变换,通过分支定界搜索旋转轴和角度,并使用段树和扫描线算法高效求解几何重叠的最大化问题。

Result: 在3DMatch、3DLoMatch和KITTI数据集上,该方法在精度和效率上均优于现有SOTA方法。

Insight: 通过几何重叠引导的旋转搜索和高效的分支定界方法,可以实现复杂点云配准问题的多项式时间求解,同时避免局部最优问题。

Abstract: Point cloud registration based on correspondences computes the rigid transformation that maximizes the number of inliers constrained within the noise threshold. Current state-of-the-art (SOTA) methods employing spatial compatibility graphs or branch-and-bound (BnB) search mainly focus on registration under high outlier ratios. However, graph-based methods require at least quadratic space and time complexity for graph construction, while multi-stage BnB search methods often suffer from inaccuracy due to local optima between decomposed stages. This paper proposes a geometric maximum overlapping registration framework via rotation-only BnB search. The rigid transformation is decomposed using Chasles’ theorem into a translation along rotation axis and a 2D rigid transformation. The optimal rotation axis and angle are searched via BnB, with residual parameters formulated as range maximum query (RMQ) problems. Firstly, the top-k candidate rotation axes are searched within a hemisphere parameterized by cube mapping, and the translation along each axis is estimated through interval stabbing of the correspondences projected onto that axis. Secondly, the 2D registration is relaxed to 1D rotation angle search with 2D RMQ of geometric overlapping for axis-aligned rectangles, which is solved deterministically in polynomial time using sweep line algorithm with segment tree. Experimental results on 3DMatch, 3DLoMatch, and KITTI datasets demonstrate superior accuracy and efficiency over SOTA methods, while the time complexity is polynomial and the space complexity increases linearly with the number of points, even in the worst case.

[142] FedKLPR: Personalized Federated Learning for Person Re-Identification with Adaptive Pruning

Po-Hsien Yu,Yu-Syuan Tseng,Shao-Yi Chien

Main category: cs.CV

TL;DR: FedKLPR是一种针对行人重识别的轻量级联邦学习框架,通过KL散度正则化、加权聚合、稀疏激活跳过和跨轮恢复机制,解决了非独立同分布数据带来的统计异质性和通信开销问题。

Details Motivation: 行人重识别任务在联邦学习环境下面临统计异质性和高通信开销的挑战。FedKLPR旨在在保护隐私的同时,通过高效通信和个性化学习提升模型性能。

Contribution: 1. 提出KL散度正则化损失(KLL)以减少非独立同分布数据的影响;2. 设计KL散度修剪加权聚合(KLPWA)降低通信开销;3. 引入稀疏激活跳过(SAS)和跨轮恢复(CRR)机制优化模型聚合和压缩。

Method: 1. KLL约束本地模型与全局特征的分布差异;2. KLPWA基于KL散度和修剪比例加权聚合;3. SAS跳过零值权重更新;4. CRR动态控制修剪深度。

Result: 在8个基准数据集上,FedKLPR显著减少通信成本(ResNet-50降33%-38%,ResNet-34降20%-40%),且模型性能下降不超过1%。

Insight: KL散度正则化和动态修剪机制在联邦学习中能有效平衡数据异质性和通信效率,为其他隐私敏感任务提供借鉴。

Abstract: Person re-identification (Re-ID) is a fundamental task in intelligent surveillance and public safety. Federated learning (FL) offers a privacy-preserving solution by enabling collaborative model training without centralized data collection. However, applying FL to real-world re-ID systems faces two major challenges: statistical heterogeneity across clients due to non-IID data distributions, and substantial communication overhead caused by frequent transmission of large-scale models. To address these issues, we propose FedKLPR, a lightweight and communication-efficient federated learning framework for person re-identification. FedKLPR introduces four key components. First, the KL-Divergence Regularization Loss (KLL) constrains local models by minimizing the divergence from the global feature distribution, effectively mitigating the effects of statistical heterogeneity and improving convergence stability under non-IID conditions. Secondly, KL-Divergence-Prune Weighted Aggregation (KLPWA) integrates pruning ratio and distributional similarity into the aggregation process, thereby improving the robustness of the global model while significantly reducing communication overhead. Furthermore, sparse Activation Skipping (SAS) mitigates the dilution of critical parameters during the aggregation of pruned client models by excluding zero-valued weights from the update process. Finally, Cross-Round Recovery (CRR) introduces a dynamic pruning control mechanism that halts pruning when necessary, enabling deeper compression while maintaining model accuracy. Experimental results on eight benchmark datasets demonstrate that FedKLPR achieves significant communication reduction. Compared with the state-of-the-art, FedKLPR reduces 33%-38% communication cost on ResNet-50 and 20%-40% communication cost on ResNet-34, while maintaining model accuracy within 1% degradation.

[143] TinySR: Pruning Diffusion for Real-World Image Super-Resolution

Linwei Dong,Qingnan Fan,Yuhang Yu,Qi Zhang,Jinwei Chen,Yawei Luo,Changqing Zou

Main category: cs.CV

TL;DR: TinySR提出了一种轻量化的扩散模型,通过动态块间激活和扩展腐蚀策略实现高效剪枝,显著降低了计算成本和模型大小,同时保持了超分辨率的质量。

Details Motivation: 现有的扩散模型在实时图像超分辨率任务中计算开销大,尽管一步蒸馏方法提供了更快的推理速度,但仍受限于大模型架构。TinySR旨在解决这一问题。

Contribution: 1. 动态块间激活和扩展腐蚀策略,实现高效剪枝;2. 通过通道剪枝、注意力移除和轻量SepConv压缩VAE;3. 移除时间和提示相关模块,并采用预缓存技术加速模型。

Method: 1. 动态块间激活和扩展腐蚀策略优化剪枝效果;2. 应用通道剪枝、注意力移除和轻量SepConv压缩模型;3. 去除冗余模块并使用预缓存技术。

Result: TinySR在保持高质量的同时,实现了5.68倍加速和83%参数减少,显著优于其教师模型TSD-SR。

Insight: 轻量化的扩散模型设计可以有效平衡计算效率和生成质量,为实时应用提供了新思路。

Abstract: Real-world image super-resolution (Real-ISR) focuses on recovering high-quality images from low-resolution inputs that suffer from complex degradations like noise, blur, and compression. Recently, diffusion models (DMs) have shown great potential in this area by leveraging strong generative priors to restore fine details. However, their iterative denoising process incurs high computational overhead, posing challenges for real-time applications. Although one-step distillation methods, such as OSEDiff and TSD-SR, offer faster inference, they remain fundamentally constrained by their large, over-parameterized model architectures. In this work, we present TinySR, a compact yet effective diffusion model specifically designed for Real-ISR that achieves real-time performance while maintaining perceptual quality. We introduce a Dynamic Inter-block Activation and an Expansion-Corrosion Strategy to facilitate more effective decision-making in depth pruning. We achieve VAE compression through channel pruning, attention removal and lightweight SepConv. We eliminate time- and prompt-related modules and perform pre-caching techniques to further speed up the model. TinySR significantly reduces computational cost and model size, achieving up to 5.68x speedup and 83% parameter reduction compared to its teacher TSD-SR, while still providing high quality results.

[144] An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

Zihan Liang,Jiahao Sun,Haoran Ma

Main category: cs.CV

TL;DR: 该论文提出了一种名为RefineEdit-Agent的新型智能代理框架,通过结合LLM和LVLM的能力,实现了细粒度、迭代式的图像编辑,解决了现有方法在理解指令、上下文保存和反馈机制上的不足。

Details Motivation: 现有的文本到图像生成模型在细粒度、迭代式图像编辑任务中表现不佳,主要面临指令理解、上下文保存和反馈机制的挑战。

Contribution: 提出了RefineEdit-Agent框架,结合LLM和LVLM的能力,实现了细粒度、迭代式的图像编辑,并提供了新的基准数据集LongBench-T2I-Edit。

Method: 框架包括LVLM驱动的指令解析与场景理解模块、LLM驱动的编辑规划模块、迭代编辑模块和LVLM驱动的反馈与评估循环。

Result: 在LongBench-T2I-Edit基准上,RefineEdit-Agent显著优于现有基线方法,达到了平均3.67的分数。

Insight: 结合LLM和LVLM的优势可以显著提升图像编辑任务的性能,尤其是在细粒度和迭代式编辑中上下文保存和反馈机制的重要性。

Abstract: Despite the remarkable capabilities of text-to-image (T2I) generation models, real-world applications often demand fine-grained, iterative image editing that existing methods struggle to provide. Key challenges include granular instruction understanding, robust context preservation during modifications, and the lack of intelligent feedback mechanisms for iterative refinement. This paper introduces RefineEdit-Agent, a novel, training-free intelligent agent framework designed to address these limitations by enabling complex, iterative, and context-aware image editing. RefineEdit-Agent leverages the powerful planning capabilities of Large Language Models (LLMs) and the advanced visual understanding and evaluation prowess of Vision-Language Large Models (LVLMs) within a closed-loop system. Our framework comprises an LVLM-driven instruction parser and scene understanding module, a multi-level LLM-driven editing planner for goal decomposition, tool selection, and sequence generation, an iterative image editing module, and a crucial LVLM-driven feedback and evaluation loop. To rigorously evaluate RefineEdit-Agent, we propose LongBench-T2I-Edit, a new benchmark featuring 500 initial images with complex, multi-turn editing instructions across nine visual dimensions. Extensive experiments demonstrate that RefineEdit-Agent significantly outperforms state-of-the-art baselines, achieving an average score of 3.67 on LongBench-T2I-Edit, compared to 2.29 for Direct Re-Prompting, 2.91 for InstructPix2Pix, 3.16 for GLIGEN-based Edit, and 3.39 for ControlNet-XL. Ablation studies, human evaluations, and analyses of iterative refinement, backbone choices, tool usage, and robustness to instruction complexity further validate the efficacy of our agentic design in delivering superior edit fidelity and context preservation.

[145] Disentangled Geometry and Appearance for Efficient Multi-View Surface Reconstruction and Rendering

Qitong Zhang,Jieqing Feng

Main category: cs.CV

TL;DR: 本文提出了一种解耦几何与外观的高效多视角表面重建与渲染方法,避免了传统方法需要额外网格提取步骤的问题,同时显著提升了重建质量和适用性。

Details Motivation: 传统基于神经渲染的多视角表面重建方法需要额外的网格提取步骤,导致不便且网格质量较差。本文旨在结合显式网格表示和可微分光栅化框架,实现高效且高质量的重建。

Contribution: 1. 提出了解耦几何与外观的模型,不依赖深度网络,提升学习和适用性;2. 引入了神经变形场来整合全局几何上下文;3. 提出了新正则化方法约束几何特征,提升着色精度;4. 分离视图不变漫反射项,提高渲染效率。

Method: 1. 使用显式网格表示和可微分光栅化框架;2. 结合神经变形场和正则化优化几何学习;3. 分离视图不变漫反射项并烘焙到网格顶点。

Result: 实验表明,该方法实现了最快的训练(4.84分钟)和渲染(0.023秒)速度,重建质量与最先进方法相当,且支持网格和纹理编辑等应用。

Insight: 解耦几何与外观可以显著提升效率和适用性,神经变形场和正则化能有效优化几何学习与着色精度,视图不变漫反射项的分离进一步提升了渲染效率。

Abstract: This paper addresses the limitations of neural rendering-based multi-view surface reconstruction methods, which require an additional mesh extraction step that is inconvenient and would produce poor-quality surfaces with mesh aliasing, restricting downstream applications. Building on the explicit mesh representation and differentiable rasterization framework, this work proposes an efficient solution that preserves the high efficiency of this framework while significantly improving reconstruction quality and versatility. Specifically, we introduce a disentangled geometry and appearance model that does not rely on deep networks, enhancing learning and broadening applicability. A neural deformation field is constructed to incorporate global geometric context, enhancing geometry learning, while a novel regularization constrains geometric features passed to a neural shader to ensure its accuracy and boost shading. For appearance, a view-invariant diffuse term is separated and baked into mesh vertices, further improving rendering efficiency. Experimental results demonstrate that the proposed method achieves state-of-the-art training (4.84 minutes) and rendering (0.023 seconds) speeds, with reconstruction quality that is competitive with top-performing methods. Moreover, the method enables practical applications such as mesh and texture editing, showcasing its versatility and application potential. This combination of efficiency, competitive quality, and broad applicability makes our approach a valuable contribution to multi-view surface reconstruction and rendering.

[146] Multi-Level LVLM Guidance for Untrimmed Video Action Recognition

Liyang Peng,Sihan Zhu,Yunjie Guo

Main category: cs.CV

TL;DR: 该论文提出了ECVT模型,通过结合LVLM的多粒度语义描述,解决了未剪辑视频中动作识别与定位的挑战,显著提升了性能。

Details Motivation: 未剪辑视频中的动作识别与定位面临捕捉细粒度动作、长期时序依赖和高层语义信息的难题,现有方法表现不佳。

Contribution: 提出了ECVT架构,利用LVLM的语义理解能力,设计双分支结构,通过多粒度文本提示(全局事件提示和子事件提示)增强视频编码器的学习过程。

Method: ECVT采用双分支设计,包括视频编码分支(提取时空特征)和跨模态引导分支(生成语义描述),通过自适应门控、跨模态注意力和事件图模块整合多模态信息。

Result: 在ActivityNet v1.3和THUMOS14数据集上取得最优性能,平均mAP分别为40.5%和67.1%。

Insight: 多模态融合和语义提示对理解复杂视频的时序结构和事件逻辑至关重要。

Abstract: Action recognition and localization in complex, untrimmed videos remain a formidable challenge in computer vision, largely due to the limitations of existing methods in capturing fine-grained actions, long-term temporal dependencies, and high-level semantic information from low-level visual features. This paper introduces the Event-Contextualized Video Transformer (ECVT), a novel architecture that leverages the advanced semantic understanding capabilities of Large Vision-Language Models (LVLMs) to bridge this gap. ECVT employs a dual-branch design, comprising a Video Encoding Branch for spatio-temporal feature extraction and a Cross-Modal Guidance Branch. The latter utilizes an LVLM to generate multi-granularity semantic descriptions, including Global Event Prompting for macro-level narrative and Temporal Sub-event Prompting for fine-grained action details. These multi-level textual cues are integrated into the video encoder’s learning process through sophisticated mechanisms such as adaptive gating for high-level semantic fusion, cross-modal attention for fine-grained feature refinement, and an event graph module for temporal context calibration. Trained end-to-end with a comprehensive loss function incorporating semantic consistency and temporal calibration terms, ECVT significantly enhances the model’s ability to understand video temporal structures and event logic. Extensive experiments on ActivityNet v1.3 and THUMOS14 datasets demonstrate that ECVT achieves state-of-the-art performance, with an average mAP of 40.5% on ActivityNet v1.3 and mAP@0.5 of 67.1% on THUMOS14, outperforming leading baselines.

[147] A Synthetic Dataset for Manometry Recognition in Robotic Applications

Pedro Antonio Rabelo Saraiva,Enzo Ferreira de Souza,Joao Manoel Herrera Pinheiro,Thiago H. Segreto,Ricardo V. Godoy,Marcelo Becker

Main category: cs.CV

TL;DR: 本文提出了一种混合数据合成方法,结合程序化渲染和AI驱动的视频生成技术,用于解决复杂工业场景中数据稀缺和高成本问题,显著提升了目标检测模型的性能。

Details Motivation: 在复杂工业环境(如海上石油平台)中,真实数据采集成本高且危险,阻碍了自主检测系统的发展。为此,本文提出通过合成数据解决这一问题。

Contribution: 提出了一种混合数据合成流程,结合BlenderProc和NVIDIA Cosmos-Predict2模型,生成了高质量、多样化的合成数据,并通过实验验证了合成数据对检测性能的提升。

Method: 使用BlenderProc生成带精确标注的逼真图像,并引入NVIDIA的Cosmos-Predict2模型模拟物理合理的动态视频序列。通过混合真实和合成数据训练YOLO检测网络。

Result: 实验表明,结合真实与合成数据(1:1比例)训练的模型性能优于仅用真实数据训练的基线模型。

Insight: 合成数据可以作为高效、经济且安全的解决方案,在安全和资源受限的工业应用中开发可靠的感知系统。

Abstract: This work addresses the challenges of data scarcity and high acquisition costs for training robust object detection models in complex industrial environments, such as offshore oil platforms. The practical and economic barriers to collecting real-world data in these hazardous settings often hamper the development of autonomous inspection systems. To overcome this, in this work we propose and validate a hybrid data synthesis pipeline that combines procedural rendering with AI-driven video generation. Our methodology leverages BlenderProc to create photorealistic images with precise annotations and controlled domain randomization, and integrates NVIDIA’s Cosmos-Predict2 world-foundation model to synthesize physically plausible video sequences with temporal diversity, capturing rare viewpoints and adverse conditions. We demonstrate that a YOLO-based detection network trained on a composite dataset, blending real images with our synthetic data, achieves superior performance compared to models trained exclusively on real-world data. Notably, a 1:1 mixture of real and synthetic data yielded the highest accuracy, surpassing the real-only baseline. These findings highlight the viability of a synthetic-first approach as an efficient, cost-effective, and safe alternative for developing reliable perception systems in safety-critical and resource-constrained industrial applications.

[148] T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation

Kaiyue Sun,Rongyao Fang,Chengqi Duan,Xian Liu,Xihui Liu

Main category: cs.CV

TL;DR: 论文提出了T2I-ReasonBench,一个评估文本到图像(T2I)模型推理能力的基准,包括四个维度,并提出了两阶段的评估协议。

Details Motivation: 现有T2I模型在推理能力方面的表现缺乏系统性评估,因此需要设计一个专门的基准来填补这一空白。

Contribution: 提出了T2I-ReasonBench基准和两阶段评估协议,首次系统性评估T2I模型的推理能力。

Method: 设计了四个推理维度(习语解释、文本图像设计、实体推理和科学推理),采用两阶段评估(推理准确性和图像质量)。

Result: 对多种T2I生成模型进行了基准测试,并分析了它们在推理能力和图像质量上的表现。

Insight: 不同T2I模型在推理任务上的表现差异显著,某些模型可能在图像质量上表现良好但推理能力较弱。

Abstract: We propose T2I-ReasonBench, a benchmark evaluating reasoning capabilities of text-to-image (T2I) models. It consists of four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning and Scientific-Reasoning. We propose a two-stage evaluation protocol to assess the reasoning accuracy and image quality. We benchmark various T2I generation models, and provide comprehensive analysis on their performances.

[149] GraphMMP: A Graph Neural Network Model with Mutual Information and Global Fusion for Multimodal Medical Prognosis

Xuhao Shan,Ruiquan Ge,Jikui Liu,Linglong Wu,Chi Zhang,Siqi Liu,Wenjian Qin,Wenwen Min,Ahmed Elazab,Changmiao Wang

Main category: cs.CV

TL;DR: GraphMMP是一种基于图神经网络的双阶段多模态医学预后模型,通过互信息和全局融合模块提升性能,在肝脏预后和METABRIC数据集上表现优异。

Details Motivation: 多模态医学数据分析面临的关键挑战在于如何有效建模异构数据模态间的复杂交互以及捕捉跨模态的局部和全局依赖关系。此研究旨在解决这些问题。

Contribution: 提出了GraphMMP,基于图神经网络的双阶段模型,利用互信息构建特征图,并通过全局融合模块(基于Mamba)显著提升预后性能。

Method: 模型分为两个阶段:1) 利用互信息构建特征图;2) 引入全局融合模块(基于Mamba)进行多模态融合。结合图神经网络建模跨模态依赖关系。

Result: 在肝脏预后和METABRIC数据集上,GraphMMP超越了现有方法,验证了其在多模态医学预后任务中的有效性。

Insight: 通过互信息捕捉模态间隐藏关系,结合全局融合模块,能够更好地建模跨模态依赖,为多模态医学预后提供了一个新的有效框架。

Abstract: In the field of multimodal medical data analysis, leveraging diverse types of data and understanding their hidden relationships continues to be a research focus. The main challenges lie in effectively modeling the complex interactions between heterogeneous data modalities with distinct characteristics while capturing both local and global dependencies across modalities. To address these challenges, this paper presents a two-stage multimodal prognosis model, GraphMMP, which is based on graph neural networks. The proposed model constructs feature graphs using mutual information and features a global fusion module built on Mamba, which significantly boosts prognosis performance. Empirical results show that GraphMMP surpasses existing methods on datasets related to liver prognosis and the METABRIC study, demonstrating its effectiveness in multimodal medical prognosis tasks.

[150] Optimizing Multi-Modal Trackers via Sensitivity-aware Regularized Tuning

Zhiwen Chen,Jinjian Wu,Zhiyu Zhu,Yifan Zhang,Guangming Shi,Junhui Hou

Main category: cs.CV

TL;DR: 该论文提出了一种基于敏感度的正则化调优框架,通过分析预训练模型的参数敏感度,优化多模态跟踪器的性能。

Details Motivation: 现有微调方法在自由度和限制之间难以平衡,导致塑性与稳定性的权衡不佳。

Contribution: 提出了一个敏感度感知的正则化调优框架,通过参数敏感度分析实现更好的跨模态迁移能力。

Method: 分析了预训练权重的切空间以测量敏感度,并在调优阶段引入正则化项,强调适应性和稳定性。

Result: 实验表明该方法优于当前最先进技术,显著提升了多模态跟踪性能。

Insight: 参数敏感度分析是实现模型跨模态迁移的关键,通过正则化可以优化模型的塑性与稳定性。

Abstract: This paper tackles the critical challenge of optimizing multi-modal trackers by effectively adapting the pre-trained models for RGB data. Existing fine-tuning paradigms oscillate between excessive freedom and over-restriction, both leading to a suboptimal plasticity-stability trade-off. To mitigate this dilemma, we propose a novel sensitivity-aware regularized tuning framework, which delicately refines the learning process by incorporating intrinsic parameter sensitivities. Through a comprehensive investigation from pre-trained to multi-modal contexts, we identify that parameters sensitive to pivotal foundational patterns and cross-domain shifts are primary drivers of this issue. Specifically, we first analyze the tangent space of pre-trained weights to measure and orient prior sensitivities, dedicated to preserving generalization. Then, we further explore transfer sensitivities during the tuning phase, emphasizing adaptability and stability. By incorporating these sensitivities as regularization terms, our method significantly enhances the transferability across modalities. Extensive experiments showcase the superior performance of the proposed method, surpassing current state-of-the-art techniques across various multi-modal tracking. The source code and models will be publicly available at https://github.com/zhiwen-xdu/SRTrack.

[151] Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

Hugo Bohy,Minh Tran,Kevin El Haddad,Thierry Dutoit,Mohammad Soleymani

Main category: cs.CV

TL;DR: Social-MAE是一种基于Transformer的多模态自动编码器,用于处理人脸和语音数据,通过在社交互动数据上进行自监督预训练,提升多模态情感识别等下游任务的性能。

Details Motivation: 人类社交行为本质上是多模态的,亟需强大的视听模型。现有方法在社交领域的针对性不足,作者希望通过自监督预训练提升模型性能。

Contribution: 提出了Social-MAE,扩展了CAV-MAE模型,支持更多帧输入,并在VoxCeleb2数据集上进行了自监督预训练。

Method: 基于CAV-MAE框架,修改以支持更大输入帧数,使用VoxCeleb2数据集进行自监督预训练,并在多模态情感识别等任务上微调。

Result: 在情感识别和笑声检测任务上达到SOTA结果,在外观性格估计任务上表现竞争性。

Insight: 领域内自监督预训练对提升多模态社交任务性能具有显著效果。

Abstract: Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.

[152] DinoTwins: Combining DINO and Barlow Twins for Robust, Label-Efficient Vision Transformers

Michael Podsiadly,Brendon K Lay

Main category: cs.CV

TL;DR: 论文提出了一种结合DINO和Barlow Twins的自监督学习方法DinoTwins,旨在通过融合两者的优势(DINO的师生学习与Barlow Twins的冗余减少),在少量标注数据和计算资源下提升Vision Transformers的性能。

Details Motivation: 当前训练无标注数据的AI模型仍是一大挑战。DINO和Barlow Twins虽各有优势,但单独使用时存在局限性(如DINO对某些数据增强敏感,Barlow Twins需要大批量训练数据)。通过结合两者,论文希望发挥互补优势,提升模型的鲁棒性和标签效率。

Contribution: 1. 提出结合DINO和Barlow Twins的DinoTwins方法;2. 在MS COCO数据集上用仅10%标注数据训练模型,验证了其标签高效性;3. 展示了改进的语义分割能力和特征表示。

Method: 1. 融合DINO的师生自蒸馏策略与Barlow Twins的冗余减少目标;2. 在ViT上进行训练,使用线性探测评估性能;3. 通过注意力可视化分析模型特性。

Result: 实验表明,DinoTwins在损失和分类准确率上与DINO相当,同时保持了强特征表示能力;注意力可视化显示其语义分割能力更强。

Insight: 结合互补的自监督学习方法可以提升模型的泛化能力和标签效率,为资源受限环境下的ViT训练提供了可行方案。

Abstract: Training AI models to understand images without costly labeled data remains a challenge. We combine two techniques–DINO (teacher-student learning) and Barlow Twins (redundancy reduction)–to create a model that learns better with fewer labels and less compute. While both DINO and Barlow Twins have independently demonstrated strong performance in self-supervised learning, each comes with limitations–DINO may be sensitive to certain augmentations, and Barlow Twins often requires batch sizes too large to fit on consumer hardware. By combining the redundancy-reduction objective of Barlow Twins with the self-distillation strategy of DINO, we aim to leverage their complementary strengths. We train a hybrid model on the MS COCO dataset using only 10% of labeled data for linear probing, and evaluate its performance against standalone DINO and Barlow Twins implementations. Preliminary results show that the combined approach achieves comparable loss and classification accuracy to DINO while maintaining strong feature representations. Attention visualizations further suggest improved semantic segmentation capability in the hybrid model. This combined method offers a scalable, label-efficient alternative for training ViTs in resource-constrained environments.

[153] OmniMRI: A Unified Vision–Language Foundation Model for Generalist MRI Interpretation

Xingxin He,Aurora Rofena,Ruimin Feng,Haozhe Liao,Zhaoye Zhou,Albert Jang,Fang Liu

Main category: cs.CV

TL;DR: OmniMRI 是一个统一的视觉-语言基础模型,旨在通过单一架构处理 MRI 成像的整个流程,包括重建、分割、检测、诊断和报告生成。它基于大规模多模态数据训练,展现出广泛的泛化能力和多任务适应性。

Details Motivation: MRI 在临床中的多阶段工作流程分散且缺乏整合,现有深度学习方法多为针对特定任务或解剖结构设计,泛化能力不足。此外,影像数据与语言信息的结合不够,而语言信息是放射科医生日常工作的重要部分。

Contribution: 提出了 OmniMRI,一个统一的基础模型,能够处理 MRI 工作流的多个任务。训练数据涵盖大规模、多源影像和语言数据,并通过多阶段训练提升模型的视觉表征、跨模态推理和指令遵循能力。

Method: 采用四阶段训练:1)自监督视觉预训练;2)视觉-语言对齐;3)多模态预训练;4)多任务指令微调。模型整合了图像、文本和指令数据,支持从重建到报告生成的多任务。

Result: OmniMRI 能够在一个架构内完成多样任务,包括 MRI 重建、解剖和病理分割、异常检测、诊断建议及报告生成。实验验证了其在多任务中的高效表现。

Insight: OmniMRI 展示了基础模型在医学影像中的潜力,通过统一视觉和语言模态,有望推动端到端的 MRI 解读流程,提升临床效率。

Abstract: Magnetic Resonance Imaging (MRI) is indispensable in clinical practice but remains constrained by fragmented, multi-stage workflows encompassing acquisition, reconstruction, segmentation, detection, diagnosis, and reporting. While deep learning has achieved progress in individual tasks, existing approaches are often anatomy- or application-specific and lack generalizability across diverse clinical settings. Moreover, current pipelines rarely integrate imaging data with complementary language information that radiologists rely on in routine practice. Here, we introduce OmniMRI, a unified vision-language foundation model designed to generalize across the entire MRI workflow. OmniMRI is trained on a large-scale, heterogeneous corpus curated from 60 public datasets, over 220,000 MRI volumes and 19 million MRI slices, incorporating image-only data, paired vision-text data, and instruction-response data. Its multi-stage training paradigm, comprising self-supervised vision pretraining, vision-language alignment, multimodal pretraining, and multi-task instruction tuning, progressively equips the model with transferable visual representations, cross-modal reasoning, and robust instruction-following capabilities. Qualitative results demonstrate OmniMRI’s ability to perform diverse tasks within a single architecture, including MRI reconstruction, anatomical and pathological segmentation, abnormality detection, diagnostic suggestion, and radiology report generation. These findings highlight OmniMRI’s potential to consolidate fragmented pipelines into a scalable, generalist framework, paving the way toward foundation models that unify imaging and clinical language for comprehensive, end-to-end MRI interpretation.

[154] Minimal Solvers for Full DoF Motion Estimation from Asynchronous Tracks

Petr Hruby,Marc Pollefeys

Main category: cs.CV

TL;DR: 本文提出了多项式近似方法,用于从异步点轨迹中估计相机的平移和角速度,并开发了最小求解器,适用于滚动快门和事件相机的应用。

Details Motivation: 解决滚动快门和事件相机中因异步点轨迹带来的非多项式问题,从而实现全自由度运动估计。

Contribution: 1. 提出了多项式近似方法;2. 分类了最小问题并确定了其代数度;3. 开发了低阶最小求解器。

Method: 通过多项式近似将原始非多项式问题转化为可解形式,分类最小问题并开发相应求解器。

Result: 在合成和真实数据集上验证了求解器的性能,代码将开源。

Insight: 多项式近似和代数度分类为复杂运动估计问题提供了新的解决思路。

Abstract: We address the problem of estimating both translational and angular velocity of a camera from asynchronous point tracks, a formulation relevant to rolling shutter and event cameras. Since the original problem is non-polynomial, we propose a polynomial approximation, classify the resulting minimal problems, and determine their algebraic degrees. Furthermore, we develop minimal solvers for several problems with low degrees and evaluate them on synthetic and real datasets. The code will be made publicly available.

[155] Towards Optimal Convolutional Transfer Learning Architectures for Breast Lesion Classification and ACL Tear Detection

Daniel Frees,Moritz Bolling,Aditri Bhagirath

Main category: cs.CV

TL;DR: 该论文通过实验和统计分析了在乳腺癌病变分类和ACL撕裂检测任务中,探索了最优的卷积神经网络架构,并比较了RadImageNet与ImageNet预训练的效果。研究发现采用ResNet50预训练、跳跃连接和一维卷积分类器效果最佳,但RadImageNet并未显示出明显优势。

Details Motivation: 医学影像数据稀缺,导致从头训练的模型效果受限。迁移学习成为一种解决方案,但需探索最优架构和预训练数据源。

Contribution: 1. 确定了乳腺癌和ACL撕裂分类任务的最优CNN架构;2. 统计分析了RadImageNet与ImageNet预训练的效果差异。

Method: 实验采用了ResNet50预训练骨干网络、一维卷积分类器、跳跃连接,并对比了不同预训练数据(RadImageNet与ImageNet)对模型性能的影响。

Result: 最佳模型在ACL撕裂检测和乳腺结节恶性分类任务中分别取得0.9969和0.9641的AUC值,性能优于先前工作。RadImageNet并未表现出显著优势。

Insight: 医学影像任务中,架构设计与预训练策略比预训练数据源(如RadImageNet vs ImageNet)更为关键。

Abstract: Modern computer vision models have proven to be highly useful for medical imaging classification and segmentation tasks, but the scarcity of medical imaging data often limits the efficacy of models trained from scratch. Transfer learning has emerged as a pivotal solution to this, enabling the fine-tuning of high-performance models on small data. Mei et al. (2022) found that pre-training CNNs on a large dataset of radiologist-labeled images (RadImageNet) enhanced model performance on downstream tasks compared to ImageNet pretraining. The present work extends Mei et al. (2022) by conducting a comprehensive investigation to determine optimal CNN architectures for breast lesion malignancy detection and ACL tear detection, as well as performing statistical analysis to compare the effect of RadImageNet and ImageNet pre-training on downstream model performance. Our findings suggest that 1-dimensional convolutional classifiers with skip connections, ResNet50 pre-trained backbones, and partial backbone unfreezing yields optimal downstream medical classification performance. Our best models achieve AUCs of 0.9969 for ACL tear detection and 0.9641 for breast nodule malignancy detection, competitive with the results reported by Mei et al. (2022) and surpassing other previous works. We do not find evidence confirming RadImageNet pre-training to provide superior downstream performance for ACL tear and breast lesion classification tasks.

[156] MetaGen: A DSL, Database, and Benchmark for VLM-Assisted Metamaterial Generation

Liane Makatura,Benjamin Jones,Siyuan Bian,Wojciech Matusik

Main category: cs.CV

TL;DR: 论文提出了MetaGen框架,包括MetaDSL(领域特定语言)、MetaDB(数据库)和MetaBench(基准测试),用于辅助视觉语言模型生成超材料。通过案例研究表明,该框架为超材料的设计和理解提供了重要支持。

Details Motivation: 超材料的设计因其几何复杂性和从结构到行为的非线性映射而困难。论文旨在通过开发语言、数据库和基准测试工具,简化超材料的设计和理解过程。

Contribution: 论文的主要贡献包括:(i) MetaDSL:紧凑且语义丰富的领域特定语言;(ii) MetaDB:包含15万多个参数化程序的数据库;(iii) MetaBench:测试视觉语言模型核心能力的基准套件。

Method: 论文提出了MetaGen框架,结合MetaDSL、MetaDB和MetaBench,通过微调现有的视觉语言模型,并在CAD-like界面中部署全模型,来实现超材料的设计。

Result: 案例研究表明,该框架在超材料的结构重建、属性驱动的逆向设计和性能预测方面表现良好,为理解和设计超材料提供了有效工具。

Insight: 集成化的设计工具和数据库可以显著简化复杂材料的设计流程,同时为模型的多任务能力提供了标准化测试环境。

Abstract: Metamaterials are micro-architected structures whose geometry imparts highly tunable-often counter-intuitive-bulk properties. Yet their design is difficult because of geometric complexity and a non-trivial mapping from architecture to behaviour. We address these challenges with three complementary contributions. (i) MetaDSL: a compact, semantically rich domain-specific language that captures diverse metamaterial designs in a form that is both human-readable and machine-parsable. (ii) MetaDB: a curated repository of more than 150,000 parameterized MetaDSL programs together with their derivatives-three-dimensional geometry, multi-view renderings, and simulated elastic properties. (iii) MetaBench: benchmark suites that test three core capabilities of vision-language metamaterial assistants-structure reconstruction, property-driven inverse design, and performance prediction. We establish baselines by fine-tuning state-of-the-art vision-language models and deploy an omni-model within an interactive, CAD-like interface. Case studies show that our framework provides a strong first step toward integrated design and understanding of structure-representation-property relationships.

[157] HERO: Hierarchical Extrapolation and Refresh for Efficient World Models

Quanjian Song,Xinyu Wang,Donghao Zhou,Jingyu Lin,Cunjian Chen,Yue Ma,Xiu Li

Main category: cs.CV

TL;DR: HERO提出了一个无需训练的层次化加速框架,针对高效世界模型设计,通过浅层的块刷新机制和深层的线性外推方案,显著提升了推理速度。

Details Motivation: 现有的扩散模型在高效世界模型中存在推理速度慢和直接应用加速技术导致质量下降的问题,HERO旨在解决这些问题。

Contribution: HERO的主要贡献在于提出了一种分层次的推理加速策略(浅层块刷新和深层线性外推),并发现了世界模型中的特征耦合现象。

Method: HERO通过浅层的块刷新机制选择需要重新计算的token,并在深层采用线性外推方案绕过注意力模块和前馈网络的计算。

Result: 实验结果表明,HERO实现了1.73倍的加速,且质量损失最小,显著优于现有扩散加速方法。

Insight: 世界模型中浅层和深层的特征特性不同,浅层时间变异性高,深层特征更稳定,这为分层次加速提供了理论依据。

Abstract: Generation-driven world models create immersive virtual environments but suffer slow inference due to the iterative nature of diffusion models. While recent advances have improved diffusion model efficiency, directly applying these techniques to world models introduces limitations such as quality degradation. In this paper, we present HERO, a training-free hierarchical acceleration framework tailored for efficient world models. Owing to the multi-modal nature of world models, we identify a feature coupling phenomenon, wherein shallow layers exhibit high temporal variability, while deeper layers yield more stable feature representations. Motivated by this, HERO adopts hierarchical strategies to accelerate inference: (i) In shallow layers, a patch-wise refresh mechanism efficiently selects tokens for recomputation. With patch-wise sampling and frequency-aware tracking, it avoids extra metric computation and remain compatible with FlashAttention. (ii) In deeper layers, a linear extrapolation scheme directly estimates intermediate features. This completely bypasses the computations in attention modules and feed-forward networks. Our experiments show that HERO achieves a 1.73$\times$ speedup with minimal quality degradation, significantly outperforming existing diffusion acceleration methods.

[158] TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints

Vinh-Thuan Ly,Hoang M. Truong,Xuan-Huong Nguyen

Main category: cs.CV

TL;DR: TinyGiantVLM是一个轻量级视觉语言架构,专注于在资源受限环境中进行空间推理。通过两阶段框架和MoE模块,结合多模态输入,它在工业场景中表现出色。

Details Motivation: 现有视觉语言模型(VLMs)在工业场景中对3D布局和空间关系理解不足,因此需要轻量且高效的解决方案。

Contribution: 提出轻量级两阶段框架TinyGiantVLM,引入MoE模块动态融合多模态特征,提高空间推理能力。

Method: 采用预训练视觉编码器和MoE模块处理RGB与深度信息,两阶段训练策略优化推理能力。

Result: 在AI City Challenge 2025中排名第五,64M和80M参数模型均展示出色性能。

Insight: 轻量化设计和动态特征融合是提升工业场景空间推理的关键。

Abstract: Reasoning about fine-grained spatial relationships in warehouse-scale environments poses a significant challenge for existing vision-language models (VLMs), which often struggle to comprehend 3D layouts, object arrangements, and multimodal cues in real-world industrial settings. In this paper, we present TinyGiantVLM, a lightweight and modular two-stage framework designed for physical spatial reasoning, distinguishing itself from traditional geographic reasoning in complex logistics scenes. Our approach encodes both global and region-level features from RGB and depth modalities using pretrained visual backbones. To effectively handle the complexity of high-modality inputs and diverse question types, we incorporate a Mixture-of-Experts (MoE) fusion module, which dynamically combines spatial representations to support downstream reasoning tasks and improve convergence. Training is conducted in a two-phase strategy: the first phase focuses on generating free-form answers to enhance spatial reasoning ability, while the second phase uses normalized answers for evaluation. Evaluated on Track 3 of the AI City Challenge 2025, our 64M-parameter base model achieved 5th place on the leaderboard with a score of 66.8861, demonstrating strong performance in bridging visual perception and spatial understanding in industrial environments. We further present an 80M-parameter variant with expanded MoE capacity, which demonstrates improved performance on spatial reasoning tasks.

[159] HotSpotter - Patterned Species Instance Recognition

Jonathan P. Crall,Charles V. Stewart,Tanya Y. Berger-Wolf,Daniel I. Rubenstein,Siva R. Sundaresan

Main category: cs.CV

TL;DR: HotSpotter是一种快速、准确的算法,用于在标注数据库中对特定个体动物进行识别。它基于两种方法:一种是通过顺序匹配生成分数,另一种是使用快速最近邻搜索和竞争性评分机制。

Details Motivation: 现有的物种个体识别方法通常针对特定物种,缺乏通用性且效率较低。HotSpotter旨在提供一种不依赖物种、快速且准确的解决方案。

Contribution: 1. 提出了一种通用的个体动物识别算法。2. 开发了两种基于关键点匹配的方法,显著提高了识别速度和准确性。

Method: 1. 第一种方法顺序匹配查询图像与数据库图像,独立生成分数。2. 第二种方法使用快速最近邻搜索和基于Local Naive Bayes Nearest Neighbor的竞争性评分机制。

Result: 在1000多张图像的数据库上验证,HotSpotter比现有方法更准确,且每张查询图像的匹配时间仅为几秒。

Insight: 关键点匹配结合竞争性评分机制可以有效提升跨物种个体识别的性能和效率。

Abstract: We present HotSpotter, a fast, accurate algorithm for identifying individual animals against a labeled database. It is not species specific and has been applied to Grevy’s and plains zebras, giraffes, leopards, and lionfish. We describe two approaches, both based on extracting and matching keypoints or “hotspots”. The first tests each new query image sequentially against each database image, generating a score for each database image in isolation, and ranking the results. The second, building on recent techniques for instance recognition, matches the query image against the database using a fast nearest neighbor search. It uses a competitive scoring mechanism derived from the Local Naive Bayes Nearest Neighbor algorithm recently proposed for category recognition. We demonstrate results on databases of more than 1000 images, producing more accurate matches than published methods and matching each query image in just a few seconds.

[160] A Weighted Vision Transformer-Based Multi-Task Learning Framework for Predicting ADAS-Cog Scores

Nur Amirah Abd Hamid,Mohd Ibrahim Shapiai,Daphne Teck Ching Lai

Main category: cs.CV

TL;DR: 本文提出了一种基于加权视觉Transformer(ViT)的多任务学习框架,用于预测ADAS-Cog全球分数及其13个子分数,通过MRI扫描数据提升AD预后模型的准确性和可解释性。

Details Motivation: 现有的AD预后模型通常仅关注ADAS-Cog全球分数,而忽略了其13个子分数的预测价值。这些子分数反映了不同的认知领域,部分子分数对全球分数的贡献更大。

Contribution: 提出了一个加权ViT多任务学习框架,通过MRI数据联合预测全球分数和子分数,并系统地研究了损失加权对模型性能的影响。

Method: 使用ViT作为特征提取器,结合多任务学习,对不同子分数分配不同的损失权重,以优化模型对关键认知领域的关注。

Result: 实验表明,权重策略因受试者群体而异:强权重适用于MRI模式异质性高的MCI受试者,而中等权重对MRI变异性低的CN受试者更有效。

Insight: 均匀权重未能充分利用关键子分数,限制了模型的泛化能力;灵活的加权策略可以提升模型性能和可解释性。

Abstract: Prognostic modeling is essential for forecasting future clinical scores and enabling early detection of Alzheimers disease (AD). While most existing methods focus on predicting the ADAS-Cog global score, they often overlook the predictive value of its 13 sub-scores, which reflect distinct cognitive domains. Some sub-scores may exert greater influence on determining global scores. Assigning higher loss weights to these clinically meaningful sub-scores can guide the model to focus on more relevant cognitive domains, enhancing both predictive accuracy and interpretability. In this study, we propose a weighted Vision Transformer (ViT)-based multi-task learning (MTL) framework to jointly predict the ADAS-Cog global score using baseline MRI scans and its 13 sub-scores at Month 24. Our framework integrates ViT as a feature extractor and systematically investigates the impact of sub-score-specific loss weighting on model performance. Results show that our proposed weighting strategies are group-dependent: strong weighting improves performance for MCI subjects with more heterogeneous MRI patterns, while moderate weighting is more effective for CN subjects with lower variability. Our findings suggest that uniform weighting underutilizes key sub-scores and limits generalization. The proposed framework offers a flexible, interpretable approach to AD prognosis using end-to-end MRI-based learning. (Github repo link will be provided after review)

[161] JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on

Aowen Wang,Wei Li,Hao Luo,Mengxing Ao,Chenyu Zhu,Xinyang Li,Fan Wang

Main category: cs.CV

TL;DR: 提出JCo-MVTON,一种基于扩散模型和多模态条件融合的掩码无关虚拟试穿框架,通过多模态条件融合和自监督数据增强解决传统方法的局限性。

Details Motivation: 传统虚拟试穿系统依赖人体掩码、对服装属性控制有限,且难以泛化到真实场景。JCo-MVTON旨在解决这些问题。

Contribution: 1. 提出掩码无关的虚拟试穿框架;2. 引入多模态扩散Transformer;3. 设计双向生成策略增强数据。

Method: 通过多模态扩散Transformer(MM-DiT),将图像生成与多模态条件融合结合,利用双向生成策略构建数据集。

Result: 在DressCode等基准测试中表现最优,远超现有方法,并在真实场景中展现强泛化能力。

Insight: 掩码无关设计、多模态融合和自监督数据增强是提升虚拟试穿效果的关键。

Abstract: Virtual try-on systems have long been hindered by heavy reliance on human body masks, limited fine-grained control over garment attributes, and poor generalization to real-world, in-the-wild scenarios. In this paper, we propose JCo-MVTON (Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-On), a novel framework that overcomes these limitations by integrating diffusion-based image generation with multi-modal conditional fusion. Built upon a Multi-Modal Diffusion Transformer (MM-DiT) backbone, our approach directly incorporates diverse control signals – such as the reference person image and the target garment image – into the denoising process through dedicated conditional pathways that fuse features within the self-attention layers. This fusion is further enhanced with refined positional encodings and attention masks, enabling precise spatial alignment and improved garment-person integration. To address data scarcity and quality, we introduce a bidirectional generation strategy for dataset construction: one pipeline uses a mask-based model to generate realistic reference images, while a symmetric ``Try-Off’’ model, trained in a self-supervised manner, recovers the corresponding garment images. The synthesized dataset undergoes rigorous manual curation, allowing iterative improvement in visual fidelity and diversity. Experiments demonstrate that JCo-MVTON achieves state-of-the-art performance on public benchmarks including DressCode, significantly outperforming existing methods in both quantitative metrics and human evaluations. Moreover, it shows strong generalization in real-world applications, surpassing commercial systems.

[162] Improving Interpretability in Alzheimer’s Prediction via Joint Learning of ADAS-Cog Scores

Nur Amirah Abd Hamid,Mohd Shahrizal Rusli,Muhammad Thaqif Iman Mohd Taufek,Mohd Ibrahim Shapiai,Daphne Teck Ching Lai

Main category: cs.CV

TL;DR: 该论文提出了一种多任务学习框架,联合预测ADAS-Cog总分及其13个子项评分,通过结合基线MRI和纵向临床数据提升阿尔茨海默病预测的准确性。

Details Motivation: 现有方法主要关注ADAS-Cog总分的预测,而忽略了子项评分的价值,这些子项评分能反映特定领域的认知衰退。研究探讨了子项评分如何通过MRI特征影响总分的预测。

Contribution: 1. 提出了一种多任务学习框架,联合预测总分和子项评分;2. 分析了子项评分对总分预测的贡献;3. 发现了模型在临床特征和MRI特征间的失衡问题。

Method: 采用Vision Transformer和Swin Transformer提取MRI特征,与纵向临床数据融合建模认知进展。通过多任务学习联合优化总分和子项评分。

Result: 子项评分的引入提升了总分预测的准确性,但部分关键子项(如Q1, Q4, Q8)预测误差较大,可能与临床特征主导的问题有关。

Insight: 研究揭示了多模态融合和损失权重自适应的重要性,为构建更稳健的AD预测模型提供了方向。

Abstract: Accurate prediction of clinical scores is critical for early detection and prognosis of Alzheimers disease (AD). While existing approaches primarily focus on forecasting the ADAS-Cog global score, they often overlook the predictive value of its sub-scores (13 items), which capture domain-specific cognitive decline. In this study, we propose a multi task learning (MTL) framework that jointly predicts the global ADAS-Cog score and its sub-scores (13 items) at Month 24 using baseline MRI and longitudinal clinical scores from baseline and Month 6. The main goal is to examine how each sub scores particularly those associated with MRI features contribute to the prediction of the global score, an aspect largely neglected in prior MTL studies. We employ Vision Transformer (ViT) and Swin Transformer architectures to extract imaging features, which are fused with longitudinal clinical inputs to model cognitive progression. Our results show that incorporating sub-score learning improves global score prediction. Subscore level analysis reveals that a small subset especially Q1 (Word Recall), Q4 (Delayed Recall), and Q8 (Word Recognition) consistently dominates the predicted global score. However, some of these influential sub-scores exhibit high prediction errors, pointing to model instability. Further analysis suggests that this is caused by clinical feature dominance, where the model prioritizes easily predictable clinical scores over more complex MRI derived features. These findings emphasize the need for improved multimodal fusion and adaptive loss weighting to achieve more balanced learning. Our study demonstrates the value of sub score informed modeling and provides insights into building more interpretable and clinically robust AD prediction frameworks. (Github repo provided)

[163] Finding Outliers in a Haystack: Anomaly Detection for Large Pointcloud Scenes

Ryan Faulkner,Ian Reid,Simon Ratcliffe,Tat-Jun Chin

Main category: cs.CV

TL;DR: 论文提出了一种基于重建的方法,结合Mamba架构,用于大规模点云场景的开放集分割和异常检测。

Details Motivation: 在户外LiDAR扫描场景中,异常物体(训练数据之外的对象)的出现是不可避免的,需要一种高效的方法进行检测和分割。

Contribution: 1. 提出了一种基于重建的开放集分割方法;2. 结合Mamba架构的长距离依赖性和扩展性;3. 在开放集分割任务中提升性能。

Method: 提出了一种基于重建的方法,利用Mamba架构处理大规模点云数据,结合对象缺陷检测的研究成果。

Result: 新方法不仅提升了自身开放集分割的性能,还兼容现有方法;Mamba架构在挑战性的大规模点云上表现竞争力。

Insight: 结合深度学习的长距离依赖性和点云重建技术,可以显著提升开放集分割和异常检测的效果。

Abstract: LiDAR scanning in outdoor scenes acquires accurate distance measurements over wide areas, producing large-scale point clouds. Application examples for this data include robotics, automotive vehicles, and land surveillance. During such applications, outlier objects from outside the training data will inevitably appear. Our research contributes a novel approach to open-set segmentation, leveraging the learnings of object defect-detection research. We also draw on the Mamba architecture’s strong performance in utilising long-range dependencies and scalability to large data. Combining both, we create a reconstruction based approach for the task of outdoor scene open-set segmentation. We show that our approach improves performance not only when applied to our our own open-set segmentation method, but also when applied to existing methods. Furthermore we contribute a Mamba based architecture which is competitive with existing voxel-convolution based methods on challenging, large-scale pointclouds.

[164] Wound3DAssist: A Practical Framework for 3D Wound Assessment

Remi Chierchia,Rodrigo Santa Cruz,Léo Lebrat,Yulia Arzhaeva,Mohammad Ali Armin,Jeremy Oorloff,Chuong Nguyen,Olivier Salvado,Clinton Fookes,David Ahmedt-Aristizabal

Main category: cs.CV

TL;DR: Wound3DAssist 是一个使用单目消费级视频的 3D 伤口评估框架,通过短手持智能手机录像生成准确的 3D 模型,支持非接触、自动化的测量,适用于复杂的临床环境。

Details Motivation: 当前临床对慢性伤口的评估依赖主观且耗时的手动记录方法,而现有的 2D 数字视频测量框架无法解决视角失真、视野有限及无法捕捉伤口深度的问题。

Contribution: 提出了一个完整的 3D 伤口评估框架 Wound3DAssist,集成 3D 重建、伤口分割、组织分类和伤口周围分析,支持高精度非接触测量。

Method: 基于单目消费级视频,通过 3D 重建生成模型,并结合分割和分类技术实现自动化分析。方法模块化,适应临床使用。

Result: 评估结果显示,框架具备毫米级精度,支持高质量的伤口床可视化,且整个评估过程在 20 分钟内完成,验证了其在临床中的实用性。

Insight: 3D 技术在伤口评估中有显著优势,尤其是解决复杂解剖区域的测量问题,且基于智能手机的实现使其更易推广。

Abstract: Managing chronic wounds remains a major healthcare challenge, with clinical assessment often relying on subjective and time-consuming manual documentation methods. Although 2D digital videometry frameworks aided the measurement process, these approaches struggle with perspective distortion, a limited field of view, and an inability to capture wound depth, especially in anatomically complex or curved regions. To overcome these limitations, we present Wound3DAssist, a practical framework for 3D wound assessment using monocular consumer-grade videos. Our framework generates accurate 3D models from short handheld smartphone video recordings, enabling non-contact, automatic measurements that are view-independent and robust to camera motion. We integrate 3D reconstruction, wound segmentation, tissue classification, and periwound analysis into a modular workflow. We evaluate Wound3DAssist across digital models with known geometry, silicone phantoms, and real patients. Results show that the framework supports high-quality wound bed visualization, millimeter-level accuracy, and reliable tissue composition analysis. Full assessments are completed in under 20 minutes, demonstrating feasibility for real-world clinical use.

[165] Few-Shot Pattern Detection via Template Matching and Regression

Eunchan Jo,Dahyun Kang,Sanghyun Kim,Yunseon Choi,Minsu Cho

Main category: cs.CV

TL;DR: 论文提出了一种基于模板匹配和回归的少样本模式检测方法TMR,有效解决了传统方法在非对象模式定位上的不足,并在新数据集RPINE上表现优异。

Details Motivation: 现有的少样本目标计数和检测方法通常将目标表示为空间坍缩的原型,丢失了结构信息,且仅适用于对象类别。论文旨在解决这些局限,拓展到更广泛的模式检测。

Contribution: 1. 提出TMR方法,通过模板匹配和回归保留空间布局信息;2. 引入新数据集RPINE,覆盖更广泛的模式类型;3. 在多个基准测试中显著优于现有方法。

Method: TMR方法结合了经典模板匹配和回归技术,仅需少量可学习的卷积或投影层,基于冻结的主干网络,实现了高效的模式检测。

Result: 在RPINE、FSCD-147和FSCD-LVIS三个基准测试中,TMR表现优于现有方法,并展示了良好的跨数据集泛化能力。

Insight: 模板匹配和回归的简单组合能够在少样本模式下有效捕捉空间信息,且数据集多样性对方法评估至关重要。

Abstract: We address the problem of few-shot pattern detection, which aims to detect all instances of a given pattern, typically represented by a few exemplars, from an input image. Although similar problems have been studied in few-shot object counting and detection (FSCD), previous methods and their benchmarks have narrowed patterns of interest to object categories and often fail to localize non-object patterns. In this work, we propose a simple yet effective detector based on template matching and regression, dubbed TMR. While previous FSCD methods typically represent target exemplars as spatially collapsed prototypes and lose structural information, we revisit classic template matching and regression. It effectively preserves and leverages the spatial layout of exemplars through a minimalistic structure with a small number of learnable convolutional or projection layers on top of a frozen backbone We also introduce a new dataset, dubbed RPINE, which covers a wider range of patterns than existing object-centric datasets. Our method outperforms the state-of-the-art methods on the three benchmarks, RPINE, FSCD-147, and FSCD-LVIS, and demonstrates strong generalization in cross-dataset evaluation.

[166] Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning

Xinyu Wei,Guoli Yang,Jialu Zhou,Mingyue Yang,Leqian Li,Kedi Zhang,Chunping Qiu

Main category: cs.CV

TL;DR: DEHVF提出了一种高效的视觉-语言微调方法,通过动态嵌入和融合分层视觉特征,避免了输入序列的扩展,同时实现了跨模态信息的精确对齐。

Details Motivation: 现有的视觉-语言模型通常将视觉特征与文本标记拼接为统一输入,导致序列长度增加和计算开销大。已有方法尝试将视觉信息融合到语言模型的中间层,但忽略了视觉编码器的分层语义表示和浅层的细粒度视觉信息。

Contribution: 提出了DEHVF方法,利用视觉编码器和语言模型的分层表示特性,通过轻量级的分层视觉融合网络动态选择和融合分层特征,并将其嵌入到语言模型的对应层中。

Method: 通过动态选择和融合分层视觉特征,将融合后的特征投影对齐后直接嵌入到语言模型的Feed-Forward Network中,避免了序列扩展并实现了多模态信息的动态融合。

Result: 在ScienceQA和COCO Captions等基准测试中,DEHVF比现有的参数高效微调方法取得了更高的精度,同时保持了训练和推理的高效性。

Insight: 利用视觉编码器和语言模型的分层特性可以有效解决跨模态对齐问题,同时避免了计算开销的增加。

Abstract: Large Vision-Language Models (LVLMs) commonly follow a paradigm that projects visual features and then concatenates them with text tokens to form a unified sequence input for Large Language Models (LLMs). However, this paradigm leads to a significant increase in the length of the input sequence, resulting in substantial computational overhead. Existing methods attempt to fuse visual information into the intermediate layers of LLMs, which alleviate the sequence length issue but often neglect the hierarchical semantic representations within the model and the fine-grained visual information available in the shallower visual encoding layers. To address this limitation, we propose DEHVF, an efficient vision-language fine-tuning method based on dynamic embedding and fusion of hierarchical visual features. Its core lies in leveraging the inherent hierarchical representation characteristics of visual encoders and language models. Through a lightweight hierarchical visual fuser, it dynamically selects and fuses hierarchical features corresponding to semantic granularity based on the internal representations of each layer in LLMs. The fused layer-related visual features are then projected and aligned before being directly embedded into the Feed-Forward Network (FFN) of the corresponding layer in LLMs. This approach not only avoids sequence expansion but also dynamically fuses multi-layer visual information. By fine-tuning only a small number of parameters, DEHVF achieves precise alignment and complementarity of cross-modal information at the same semantic granularity. We conducted experiments across various VL benchmarks, including visual question answering on ScienceQA and image captioning on COCO Captions. The results demonstrate that DEHVF achieves higher accuracy than existing parameter-efficient fine-tuning (PEFT) baselines while maintaining efficient training and inference.

[167] FloraSyntropy-Net: Scalable Deep Learning with Novel FloraSyntropy Archive for Large-Scale Plant Disease Diagnosis

Saif Ur Rehman Khan,Muhammad Nabeel Asim,Sebastian Vollmer,Andreas Dengel

Main category: cs.CV

TL;DR: 该论文提出了FloraSyntropy-Net,一种用于大规模植物疾病诊断的新型深度学习框架,结合了联邦学习、Memetic算法和深度块技术,在FloraSyntropy数据集上取得了96.38%的准确率,并在不相关的Pest数据集上展现了出色的泛化能力(99.84%)。

Details Motivation: 现有AI模型在植物疾病诊断中缺乏泛化能力,局限于特定物种。为解决这一问题,作者提出了一个大规模数据集和新框架,以支持多样化的农业应用。

Contribution: 1)引入FloraSyntropy Archive数据集(178,922张图像,35种植物,97种疾病);2)提出FloraSyntropy-Net框架,结合联邦学习、Memetic算法和深度块技术;3)在FloraSyntropy和Pest数据集上均取得优异性能。

Method: 1)使用Memetic算法优化基础模型(DenseNet201)选择;2)设计新型深度块增强特征表示;3)采用客户端克隆策略实现可扩展、隐私保护的训练。

Result: FloraSyntropy-Net在FloraSyntropy数据集上达到96.38%的准确率,在Pest数据集上达99.84%,展现了强泛化能力。

Insight: 该研究通过结合联邦学习和优化算法,不仅提供了大规模数据集,还推动了农业AI的实际应用,强调了模型泛化的重要性。

Abstract: Early diagnosis of plant diseases is critical for global food safety, yet most AI solutions lack the generalization required for real-world agricultural diversity. These models are typically constrained to specific species, failing to perform accurately across the broad spectrum of cultivated plants. To address this gap, we first introduce the FloraSyntropy Archive, a large-scale dataset of 178,922 images across 35 plant species, annotated with 97 distinct disease classes. We establish a benchmark by evaluating numerous existing models on this archive, revealing a significant performance gap. We then propose FloraSyntropy-Net, a novel federated learning framework (FL) that integrates a Memetic Algorithm (MAO) for optimal base model selection (DenseNet201), a novel Deep Block for enhanced feature representation, and a client-cloning strategy for scalable, privacy-preserving training. FloraSyntropy-Net achieves a state-of-the-art accuracy of 96.38% on the FloraSyntropy benchmark. Crucially, to validate its generalization capability, we test the model on the unrelated multiclass Pest dataset, where it demonstrates exceptional adaptability, achieving 99.84% accuracy. This work provides not only a valuable new resource but also a robust and highly generalizable framework that advances the field towards practical, large-scale agricultural AI applications.

[168] M^3-GloDets: Multi-Region and Multi-Scale Analysis of Fine-Grained Diseased Glomerular Detection

Tianyu Shi,Xinzi He,Kenji Ikemura,Mert R. Sabuncu,Yihe Yang,Ruining Deng

Main category: cs.CV

TL;DR: M^3-GloDets 是一个系统框架,用于在多种区域、尺度和类别下评估病变肾小球检测模型,发现中等大小图像块和适中放大倍率能优化检测性能。

Details Motivation: 现有研究多集中于正常肾小球或整体硬化,缺乏对病变肾小球亚型的深入研究,且不同成像放大倍率和视野区域选择尚无共识,需要系统性评估。

Contribution: 提出了 M^3-GloDets 框架,系统性评估了不同检测模型在多种区域、尺度和类别下的性能,发现了中等大小图像块和适中放大倍率的优化组合。

Method: 通过对比传统和先进检测模型,在不同区域大小和成像分辨率下进行系统实验,分析其对多类病变肾小球的检测效果。

Result: 中等大小的图像块在上下文信息和计算效率间取得最佳平衡,适中的放大倍率有助于减少过拟合,提高泛化能力。

Insight: 研究揭示了模型在不同区域和尺度下的性能差异,为数字病理学中的自动检测策略和临床工作流程提供了实用建议。

Abstract: Accurate detection of diseased glomeruli is fundamental to progress in renal pathology and underpins the delivery of reliable clinical diagnoses. Although recent advances in computer vision have produced increasingly sophisticated detection algorithms, the majority of research efforts have focused on normal glomeruli or instances of global sclerosis, leaving the wider spectrum of diseased glomerular subtypes comparatively understudied. This disparity is not without consequence; the nuanced and highly variable morphological characteristics that define these disease variants frequently elude even the most advanced computational models. Moreover, ongoing debate surrounds the choice of optimal imaging magnifications and region-of-view dimensions for fine-grained glomerular analysis, adding further complexity to the pursuit of accurate classification and robust segmentation. To bridge these gaps, we present M^3-GloDet, a systematic framework designed to enable thorough evaluation of detection models across a broad continuum of regions, scales, and classes. Within this framework, we evaluate both long-standing benchmark architectures and recently introduced state-of-the-art models that have achieved notable performance, using an experimental design that reflects the diversity of region-of-interest sizes and imaging resolutions encountered in routine digital renal pathology. As the results, we found that intermediate patch sizes offered the best balance between context and efficiency. Additionally, moderate magnifications enhanced generalization by reducing overfitting. Through systematic comparison of these approaches on a multi-class diseased glomerular dataset, our aim is to advance the understanding of model strengths and limitations, and to offer actionable insights for the refinement of automated detection strategies and clinical workflows in the digital pathology domain.

[169] Hierarchical Vision-Language Learning for Medical Out-of-Distribution Detection

Runhe Lai,Xinhua Lu,Kanghao Chen,Qichao Chen,Wei-Shi Zheng,Ruixuan Wang

Main category: cs.CV

TL;DR: 该论文提出了一种基于视觉语言模型(VLM)的医学分布外检测框架,通过跨尺度视觉融合和硬伪分布外样本生成策略,提升了未知疾病的识别能力。

Details Motivation: 在可信赖的医疗诊断系统中,分布外检测对于识别未知疾病至关重要,以避免误诊。现有方法在识别与已知疾病相似的未知疾病时表现不佳,需要更有效的视觉信息整合策略。

Contribution: 1. 提出了一种基于视觉语言模型的分布外检测框架;2. 通过跨尺度视觉融合策略丰富了医学图像的细节表示;3. 设计了跨尺度硬伪分布外样本生成策略以优化检测性能。

Method: 使用视觉语言模型(VLMs),结合跨尺度视觉融合策略和硬伪分布外样本生成策略,提升对未知疾病的识别能力。

Result: 在三个公共医疗数据集上的实验表明,该方法在分布外检测性能上优于现有方法。

Insight: 跨尺度视觉信息和硬伪样本的生成策略可以显著提升医学图像中分布外样本的识别能力,为医疗诊断系统的可靠性提供了新思路。

Abstract: In trustworthy medical diagnosis systems, integrating out-of-distribution (OOD) detection aims to identify unknown diseases in samples, thereby mitigating the risk of misdiagnosis. In this study, we propose a novel OOD detection framework based on vision-language models (VLMs), which integrates hierarchical visual information to cope with challenging unknown diseases that resemble known diseases. Specifically, a cross-scale visual fusion strategy is proposed to couple visual embeddings from multiple scales. This enriches the detailed representation of medical images and thus improves the discrimination of unknown diseases. Moreover, a cross-scale hard pseudo-OOD sample generation strategy is proposed to benefit OOD detection maximally. Experimental evaluations on three public medical datasets support that the proposed framework achieves superior OOD detection performance compared to existing methods. The source code is available at https://openi.pcl.ac.cn/OpenMedIA/HVL.

[170] Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing

Yogesh Kumar

Main category: cs.CV

TL;DR: 提出了一种语言引导的时间令牌修剪方法(LGTTP),通过查询中的时间线索自适应修剪视频令牌,减少计算开销,同时保持上下文连续性。

Details Motivation: 现有的视觉语言模型(VLMs)在处理长视频时面临注意力机制二次复杂度的问题,导致计算效率低下。

Contribution: 提出LGTTP框架,通过语言引导的令牌修剪,显著减少计算开销(降低65%),同时保留97-99%的原始性能。

Method: 基于查询的时间线索自适应修剪视频令牌,保留时间相关段的更高令牌密度,与TimeChat和LLaVA-Video兼容。

Result: 在QVHighlights上HIT@1提升9.5%,在Charades-STA上R@1保留99.6%,适用于显式时间标记的查询和一般视频理解任务。

Insight: 语言引导的令牌修剪能高效平衡计算开销与模型性能,特别适合时间敏感的长视频处理。

Abstract: Vision Language Models (VLMs) struggle with long-form videos due to the quadratic complexity of attention mechanisms. We propose Language-Guided Temporal Token Pruning (LGTTP), which leverages temporal cues from queries to adaptively prune video tokens, preserving contextual continuity while reducing computational overhead. Unlike uniform pruning or keyframe selection, LGTTP retains higher token density in temporally relevant segments. Our model-agnostic framework integrates with TimeChat and LLaVA-Video, achieving a 65% reduction in computation while preserving 97-99% of the original performance. On QVHighlights, LGTTP improves HIT@1 by +9.5%, and on Charades-STA, it retains 99.6% of R@1. It excels on queries with explicit temporal markers and remains effective across general video understanding tasks.

[171] Benchmarking Class Activation Map Methods for Explainable Brain Hemorrhage Classification on Hemorica Dataset

Z. Rafati,M. Hoseyni,J. Khoramdel,A. Nikoofard

Main category: cs.CV

TL;DR: 该研究通过多种Class Activation Mapping(CAM)技术在脑出血分类任务中评估可解释性,并在Hemorica数据集上定量比较了九种CAM算法的性能,提出了一个可复现的基准。

Details Motivation: 医学影像研究中,可解释性人工智能(XAI)对提升深度学习模型的透明度和临床信任至关重要,但目前缺乏对CAM方法在脑出血检测中的系统比较。

Contribution: 1. 提出了一个定量评估CAM方法的可解释性管道;2. 在Hemorica数据集上比较了九种CAM算法;3. 确定了EfficientNetV2S的Stage 5为最佳定位性能阶段;4. 为脑出血分类任务建立了可复现的XAI基准。

Method: 开发了一个管道,从分类模型中提取像素级分割和检测注释,应用九种CAM算法,并在不同网络阶段进行定量评估(使用Dice、IoU和像素级重叠等指标)。

Result: 最佳定位性能出现在EfficientNetV2S的Stage 5,其中HiResCAM在边界框对齐上表现最好,AblationCAM在像素级Dice(0.57)和IoU(0.40)上最优。结果表明,仅通过分类训练即可实现强定位能力。

Insight: 1. CAM方法在医学影像任务中具有潜在临床应用价值;2. 即使没有分割监督,分类模型也能生成有意义的定位结果;3. 网络深层阶段可能更适合定位任务。

Abstract: Explainable Artificial Intelligence (XAI) has become an essential component of medical imaging research, aiming to increase transparency and clinical trust in deep learning models. This study investigates brain hemorrhage diagnosis with a focus on explainability through Class Activation Mapping (CAM) techniques. A pipeline was developed to extract pixellevel segmentation and detection annotations from classification models using nine state-of-the-art CAM algorithms, applied across multiple network stages, and quantitatively evaluated on the Hemorica dataset, which uniquely provides both slice-level labels and high-quality segmentation masks. Metrics including Dice, IoU, and pixel-wise overlap were employed to benchmark CAM variants. Results show that the strongest localization performance occurred at stage 5 of EfficientNetV2S, with HiResCAM yielding the highest bounding-box alignment and AblationCAM achieving the best pixel-level Dice (0.57) and IoU (0.40), representing strong accuracy given that models were trained solely for classification without segmentation supervision. To the best of current knowledge, this is among the f irst works to quantitatively compare CAM methods for brain hemorrhage detection, establishing a reproducible benchmark and underscoring the potential of XAI-driven pipelines for clinically meaningful AI-assisted diagnosis.

[172] NGD: Neural Gradient Based Deformation for Monocular Garment Reconstruction

Soham Dasgupta,Shanthika Naik,Preet Savalia,Sujay Kumar Ingle,Avinash Sharma

Main category: cs.CV

TL;DR: NGD提出了一种基于神经梯度的形变方法,用于从单目视频中重建动态变化的衣物,结合了自适应网格重划分策略和动态纹理学习,显著提升了重建质量。

Details Motivation: 当前衣物动态重建方法中,隐式表示方法缺乏高频细节,而显式模板方法因顶点位移导致伪影。需要一种新方法解决这些问题。

Contribution: 1. 提出NGD方法,基于神经梯度形变,避免顶点位移伪影;2. 提出自适应网格重划分策略,模拟动态表面(如褶皱);3. 学习动态纹理,捕捉光照和阴影。

Method: 1. 利用神经梯度场驱动形变,避免顶点位移;2. 动态优化网格拓扑,适应衣物动态变化;3. 联合优化动态纹理和几何。

Result: 实验证明,NGD在质量和数量上均显著优于现有方法,实现了高质量衣物重建。

Insight: 神经梯度形变结合自适应网格优化为动态衣物重建提供了新思路,同时动态纹理学习提升了真实感。

Abstract: Dynamic garment reconstruction from monocular video is an important yet challenging task due to the complex dynamics and unconstrained nature of the garments. Recent advancements in neural rendering have enabled high-quality geometric reconstruction with image/video supervision. However, implicit representation methods that use volume rendering often provide smooth geometry and fail to model high-frequency details. While template reconstruction methods model explicit geometry, they use vertex displacement for deformation, which results in artifacts. Addressing these limitations, we propose NGD, a Neural Gradient-based Deformation method to reconstruct dynamically evolving textured garments from monocular videos. Additionally, we propose a novel adaptive remeshing strategy for modelling dynamically evolving surfaces like wrinkles and pleats of the skirt, leading to high-quality reconstruction. Finally, we learn dynamic texture maps to capture per-frame lighting and shadow effects. We provide extensive qualitative and quantitative evaluations to demonstrate significant improvements over existing SOTA methods and provide high-quality garment reconstructions.

[173] F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model

Hanbo Bi,Zhiqiang Yuan,Zexi Jia,Jiapei Zhang,Chongyang Li,Peixiang Luo,Ying Deng,Xiaoyue Duan,Jinchao Zhang

Main category: cs.CV

TL;DR: 该论文提出了用于多模态长对话的细粒度片段检索任务(FFR),并构建了数据集MLDR和WeChat测试集。作者提出了F2RVLM模型,通过两阶段训练和难度感知课程采样提升检索性能。

Details Motivation: 传统对话检索方法难以满足用户在长对话中检索语义相关片段的需求,因此需要一种能定位多模态片段的新方法。

Contribution: 1. 定义了FFR任务并构建了MLDR和WeChat测试集;2. 提出了F2RVLM模型,通过两阶段训练和课程采样提升性能。

Method: 1. 监督微调注入片段级检索知识;2. GRPO强化学习结合多目标奖励;3. 难度感知课程采样。

Result: F2RVLM在领域内和真实场景中均优于现有视觉语言模型。

Insight: 通过明确监督和课程学习可提升模型在长对话中的语义一致性和检索能力。

Abstract: Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history. However, they often fail to meet users’ actual needs for revisiting semantically coherent content scattered across long-form conversations. To fill this gap, we define the Fine-grained Fragment Retrieval (FFR) task, requiring models to locate query-relevant fragments, comprising both utterances and images, from multimodal long-form dialogues. As a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue retrieval dataset to date, averaging 25.45 turns per dialogue, with each naturally spanning three distinct topics. To evaluate generalization in real-world scenarios, we curate and annotate a WeChat-based test set comprising real-world multimodal dialogues with an average of 75.38 turns. Building on these resources, we explore existing generation-based Vision-Language Models (VLMs) on FFR and observe that they often retrieve incoherent utterance-image fragments. While optimized for generating responses from visual-textual inputs, these models lack explicit supervision to ensure semantic coherence within retrieved fragments. To this end, we propose F2RVLM, a generative retrieval model trained in a two-stage paradigm: (1) supervised fine-tuning to inject fragment-level retrieval knowledge, and (2) GRPO-based reinforcement learning with multi-objective rewards promoting semantic precision, relevance, and contextual coherence. To handle varying intra-fragment complexity, from locally dense to sparsely distributed, we introduce difficulty-aware curriculum sampling that ranks training instances by model-predicted difficulty and gradually exposes the model to harder samples. This boosts reasoning ability in long, multi-turn contexts. F2RVLM outperforms popular VLMs in both in-domain and real-domain settings, demonstrating superior retrieval performance.

[174] Instant Preference Alignment for Text-to-Image Diffusion Models

Yang Li,Songlin Yang,Xiaoxuan Han,Wei Wang,Jing Dong,Yueming Lyu,Ziyu Xue

Main category: cs.CV

TL;DR: 该论文提出了一种基于多模态大语言模型(MLLM)的即时偏好对齐框架,用于文本到图像(T2I)生成,支持无额外训练的动态偏好对齐。

Details Motivation: 现有的文本到图像生成方法通常依赖静态偏好或微调,难以适应动态和细粒度的用户意图。本文旨在解决即时偏好对齐的挑战。

Contribution: 1. 提出首个无训练的即时偏好对齐框架;2. 通过MLLM自动提取全局偏好信号;3. 结合全局关键词控制和局部区域感知调制,实现精确生成对齐。

Method: 1. 基于MLLM的偏好理解模块提取并结构化用户偏好;2. 通过关键词和局部交叉注意力调制引导扩散模型生成偏好对齐的图像。

Result: 在Viper数据集和自建基准上的实验表明,该方法在定量指标和人工评估上均优于现有方法。

Insight: 多模态大语言模型与扩散模型的结合为即时偏好对齐提供了新思路,支持多轮交互优化和上下文感知生成。

Abstract: Text-to-image (T2I) generation has greatly enhanced creative expression, yet achieving preference-aligned generation in a real-time and training-free manner remains challenging. Previous methods often rely on static, pre-collected preferences or fine-tuning, limiting adaptability to evolving and nuanced user intents. In this paper, we highlight the need for instant preference-aligned T2I generation and propose a training-free framework grounded in multimodal large language model (MLLM) priors. Our framework decouples the task into two components: preference understanding and preference-guided generation. For preference understanding, we leverage MLLMs to automatically extract global preference signals from a reference image and enrich a given prompt using structured instruction design. Our approach supports broader and more fine-grained coverage of user preferences than existing methods. For preference-guided generation, we integrate global keyword-based control and local region-aware cross-attention modulation to steer the diffusion model without additional training, enabling precise alignment across both global attributes and local elements. The entire framework supports multi-round interactive refinement, facilitating real-time and context-aware image generation. Extensive experiments on the Viper dataset and our collected benchmark demonstrate that our method outperforms prior approaches in both quantitative metrics and human evaluations, and opens up new possibilities for dialog-based generation and MLLM-diffusion integration.

[175] Few-shot Human Action Anomaly Detection via a Unified Contrastive Learning Framework

Koichiro Kamide,Shunsuke Sakai,Shun Maeda,Chunzhi Gu,Chao Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种统一的对比学习框架,用于解决少样本人类动作异常检测问题,通过生成运动增强策略提升模型泛化能力,并在实验中表现出色。

Details Motivation: 现有的人体动作异常检测方法通常需要为每个动作类别单独训练模型,且依赖大量正常样本,难以适应数据稀缺或新类别频繁出现的场景。

Contribution: 1. 提出了一个统一的对比学习框架,适用于少样本场景;2. 引入了基于扩散模型的生成运动增强策略;3. 首次将此策略用于增强对比学习在动作异常检测中的应用。

Method: 1. 通过对比学习构建类别无关的表示空间;2. 利用扩散基础模型生成多样且真实的训练样本;3. 通过支持集进行异常检测。

Result: 在HumanAct12数据集上,该方法在少量样本条件下取得了最先进的性能,适用于已知和未知类别。

Insight: 结合生成模型的增强策略可以有效提升对比学习在少样本异常检测任务中的性能,同时提升模型的泛化能力和鲁棒性。

Abstract: Human Action Anomaly Detection (HAAD) aims to identify anomalous actions given only normal action data during training. Existing methods typically follow a one-model-per-category paradigm, requiring separate training for each action category and a large number of normal samples. These constraints hinder scalability and limit applicability in real-world scenarios, where data is often scarce or novel categories frequently appear. To address these limitations, we propose a unified framework for HAAD that is compatible with few-shot scenarios. Our method constructs a category-agnostic representation space via contrastive learning, enabling AD by comparing test samples with a given small set of normal examples (referred to as the support set). To improve inter-category generalization and intra-category robustness, we introduce a generative motion augmentation strategy harnessing a diffusion-based foundation model for creating diverse and realistic training samples. Notably, to the best of our knowledge, our work is the first to introduce such a strategy specifically tailored to enhance contrastive learning for action AD. Extensive experiments on the HumanAct12 dataset demonstrate the state-of-the-art effectiveness of our approach under both seen and unseen category settings, regarding training efficiency and model scalability for few-shot HAAD.

[176] Segmentation and Classification of Pap Smear Images for Cervical Cancer Detection Using Deep Learning

Nisreen Albzour,Sarah S. Lam

Main category: cs.CV

TL;DR: 该研究提出了一种结合U-Net分割和深度分类模型的框架,用于宫颈涂片图像的检测,实验表明分割对分类性能的提升有限。

Details Motivation: 宫颈癌是全球女性健康的主要威胁之一,传统人工检测耗时且易出错,需要自动化工具辅助早期诊断。

Contribution: 提出了一个集成U-Net分割和分类模型的深度学习框架,用于宫颈癌检测,并评估了分割对分类性能的影响。

Method: 使用Herlev数据集训练U-Net进行图像分割,再训练分类模型,比较了基于分割和非分割图像的性能差异。

Result: 分割图像的分类性能略有提升(精度提高0.41%,F1分数提高1.30%),但整体影响有限。

Insight: 尽管分割有助于特征提取,但对分类性能的改进作用较小,表明直接分类或结合其他优化方法可能更高效。

Abstract: Cervical cancer remains a significant global health concern and a leading cause of cancer-related deaths among women. Early detection through Pap smear tests is essential to reduce mortality rates; however, the manual examination is time consuming and prone to human error. This study proposes a deep learning framework that integrates U-Net for segmentation and a classification model to enhance diagnostic performance. The Herlev Pap Smear Dataset, a publicly available cervical cell dataset, was utilized for training and evaluation. The impact of segmentation on classification performance was evaluated by comparing the model trained on segmented images and another trained on non-segmented images. Experimental results showed that the use of segmented images marginally improved the model performance on precision (about 0.41 percent higher) and F1-score (about 1.30 percent higher), which suggests a slightly more balanced classification performance. While segmentation helps in feature extraction, the results showed that its impact on classification performance appears to be limited. The proposed framework offers a supplemental tool for clinical applications, which may aid pathologists in early diagnosis.

[177] Designing Practical Models for Isolated Word Visual Speech Recognition

Iason Ioannis Panagos,Giorgos Sfikas,Christophoros Nikou

Main category: cs.CV

TL;DR: 该论文提出了一种轻量级的视觉语音识别(VSR)系统,旨在降低计算成本,适用于资源受限的实际场景。

Details Motivation: 传统的VSR系统依赖深度神经网络,计算成本高,限制了其在资源受限环境中的应用。

Contribution: 开发了轻量化的端到端架构,结合高效的特征提取和时序卷积网络设计,实现了低资源需求和高识别性能的统一模型。

Method: 采用双网络设计范式,从图像分类领域引入高效模型,并使用时序卷积网络作为主干网络。

Result: 在最大公开英文单词数据库上的实验表明,模型在低资源需求下仍具有强识别性能。

Insight: 轻量化设计可以显著降低VSR系统的计算成本,为其在医疗辅助和人机交互等实际场景中的广泛应用提供可能。

Abstract: Visual speech recognition (VSR) systems decode spoken words from an input sequence using only the video data. Practical applications of such systems include medical assistance as well as human-machine interactions. A VSR system is typically employed in a complementary role in cases where the audio is corrupt or not available. In order to accurately predict the spoken words, these architectures often rely on deep neural networks in order to extract meaningful representations from the input sequence. While deep architectures achieve impressive recognition performance, relying on such models incurs significant computation costs which translates into increased resource demands in terms of hardware requirements and results in limited applicability in real-world scenarios where resources might be constrained. This factor prevents wider adoption and deployment of speech recognition systems in more practical applications. In this work, we aim to alleviate this issue by developing architectures for VSR that have low hardware costs. Following the standard two-network design paradigm, where one network handles visual feature extraction and another one utilizes the extracted features to classify the entire sequence, we develop lightweight end-to-end architectures by first benchmarking efficient models from the image classification literature, and then adopting lightweight block designs in a temporal convolution network backbone. We create several unified models with low resource requirements but strong recognition performance. Experiments on the largest public database for English words demonstrate the effectiveness and practicality of our developed models. Code and trained models will be made publicly available.

[178] Robust Anomaly Detection in Industrial Environments via Meta-Learning

Muhammad Aqeel,Shakiba Sharifi,Marco Cristani,Francesco Setti

Main category: cs.CV

TL;DR: 该论文提出了一种名为RAD的鲁棒异常检测框架,通过结合归一化流(Normalizing Flows)和模型无关元学习(MAML),解决了工业环境中训练数据标签噪声的问题。基于双层优化策略,利用元学习快速适应噪声条件,并通过不确定性量化实现自适应L2正则化。实验结果表明,RAD在干净和噪声条件下均表现优异。

Details Motivation: 工业环境中的异常检测对质量和效率至关重要,但实际数据常包含错误标签,传统方法难以应对。因此,需要一种鲁棒的异常检测方法以应对噪声数据。

Contribution: 提出RAD框架,结合归一化流和元学习,解决标签噪声问题;设计双层优化策略,实现快速噪声适应和模型稳定性;通过多尺度特征和不确定性量化提升检测能力。

Method: 使用归一化流进行精确似然估计,结合MAML实现元学习;采用双层优化和自适应L2正则化;利用预训练特征提取器进行多尺度特征处理。

Result: 在MVTec-AD和KSDD2数据集上,干净条件下的I-AUROC分别达到95.4%和94.6%,50%标签噪声时仍保持86.8%和92.1%的性能。

Insight: RAD展示了在噪声条件下保持鲁棒性的能力,为工业场景中不完美数据提供了实用解决方案,扩展了异常检测的实际应用范围。

Abstract: Anomaly detection is fundamental for ensuring quality control and operational efficiency in industrial environments, yet conventional approaches face significant challenges when training data contains mislabeled samples-a common occurrence in real-world scenarios. This paper presents RAD, a robust anomaly detection framework that integrates Normalizing Flows with Model-Agnostic Meta-Learning to address the critical challenge of label noise in industrial settings. Our approach employs a bi-level optimization strategy where meta-learning enables rapid adaptation to varying noise conditions, while uncertainty quantification guides adaptive L2 regularization to maintain model stability. The framework incorporates multiscale feature processing through pretrained feature extractors and leverages the precise likelihood estimation capabilities of Normalizing Flows for robust anomaly scoring. Comprehensive evaluation on MVTec-AD and KSDD2 datasets demonstrates superior performance, achieving I-AUROC scores of 95.4% and 94.6% respectively under clean conditions, while maintaining robust detection capabilities above 86.8% and 92.1% even when 50% of training samples are mislabeled. The results highlight RAD’s exceptional resilience to noisy training conditions and its ability to detect subtle anomalies across diverse industrial scenarios, making it a practical solution for real-world anomaly detection applications where perfect data curation is challenging.

[179] PoRe: Position-Reweighted Visual Token Pruning for Vision Language Models

Kai Zhao,Wubang Yuan,Alex Lingyu Hung,Dan Zeng

Main category: cs.CV

TL;DR: 论文提出了一种简单有效的位置重加权方法(PoRe),用于缓解视觉语言模型中视觉标记修剪的近期偏差问题,从而提升修剪效果。

Details Motivation: 视觉语言模型(VLMs)中,视觉标记通常比文本标记数量多,冗余性高。现有的视觉标记修剪方法依赖文本-视觉注意力分数,但序列模型的近期偏差会导致修剪时过度保留图像底部的标记,影响效果。

Contribution: 提出了一种位置重加权机制(PoRe),通过调整视觉标记的注意力分数以缓解近期偏差,无需修改模型架构或额外训练。

Method: 通过对视觉标记的注意力分数根据其空间位置进行简单重新加权,消除序列模型中的近期偏差,提升修剪效果。

Result: 实验表明,该方法在视觉标记修剪中显著提升了性能,且计算开销极低。

Insight: 近期偏差是影响视觉标记修剪的重要因素,简单的空间位置重加权即可显著改善修剪效果。

Abstract: Vision-Language Models (VLMs) typically process a significantly larger number of visual tokens compared to text tokens due to the inherent redundancy in visual signals. Visual token pruning is a promising direction to reduce the computational cost of VLMs by eliminating redundant visual tokens. The text-visual attention score is a widely adopted criterion for visual token pruning as it reflects the relevance of visual tokens to the text input. However, many sequence models exhibit a recency bias, where tokens appearing later in the sequence exert a disproportionately large influence on the model’s output. In VLMs, this bias manifests as inflated attention scores for tokens corresponding to the lower regions of the image, leading to suboptimal pruning that disproportionately retains tokens from the image bottom. In this paper, we present an extremely simple yet effective approach to alleviate the recency bias in visual token pruning. We propose a straightforward reweighting mechanism that adjusts the attention scores of visual tokens according to their spatial positions in the image. Our method, termed Position-reweighted Visual Token Pruning, is a plug-and-play solution that can be seamlessly incorporated into existing visual token pruning frameworks without any changes to the model architecture or extra training. Extensive experiments on LVLMs demonstrate that our method improves the performance of visual token pruning with minimal computational overhead.

[180] UniSino: Physics-Driven Foundational Model for Universal CT Sinogram Standardization

Xingyu Ai,Shaoyu Wang,Zhiyuan Jia,Ao Xu,Hongming Shan,Jianhua Ma,Qiegen Liu

Main category: cs.CV

TL;DR: UniSino 是一个物理驱动的通用 CT 正弦图标准化基础模型,直接在投影域处理数据,增强了泛化能力,适用于多种欠采样场景。

Details Motivation: 传统的 CT 正弦图校正方法依赖于手动设计的算法或固定经验参数,缺乏对异构伪影类型的泛化能力。UniSino 旨在通过物理驱动的学习直接在投影域进行数据标准化,解决这一问题。

Contribution: 提出 UniSino,首个在投影域直接标准化的通用 CT 正弦图基础模型,解决了传统方法的泛化不足问题。

Method: UniSino 结合正弦图的物理特性设计训练框架,支持多子任务(四个基准数据集),直接在投影域处理数据,提升泛化能力。

Result: 实验表明,UniSino 在单一和混合欠采样场景下均实现了优越的重建质量,表现出极强的鲁棒性和泛化能力。

Insight: 在投影域直接处理数据能够更好地捕捉物理特性,从而在多任务和复杂场景中实现更优性能。

Abstract: During raw-data acquisition in CT imaging, diverse factors can degrade the collected sinograms, with undersampling and noise leading to severe artifacts and noise in reconstructed images and compromising diagnostic accuracy. Conventional correction methods rely on manually designed algorithms or fixed empirical parameters, but these approaches often lack generalizability across heterogeneous artifact types. To address these limitations, we propose UniSino, a foundation model for universal CT sinogram standardization. Unlike existing foundational models that operate in image domain, UniSino directly standardizes data in the projection domain, which enables stronger generalization across diverse undersampling scenarios. Its training framework incorporates the physical characteristics of sinograms, enhancing generalization and enabling robust performance across multiple subtasks spanning four benchmark datasets. Experimental results demonstrate thatUniSino achieves superior reconstruction quality both single and mixed undersampling case, demonstrating exceptional robustness and generalization in sinogram enhancement for CT imaging. The code is available at: https://github.com/yqx7150/UniSino.

[181] TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration

Meiqi Gong,Hao Zhang,Xunpeng Yi,Linfeng Tang,Jiayi Ma

Main category: cs.CV

TL;DR: 论文提出了TemCoCo框架,首次将时间建模与视觉-语义协作结合于视频融合任务,确保视觉保真度、语义准确性和时间一致性。

Details Motivation: 现有视频融合方法忽略时间依赖性,导致帧间结果不一致。TemCoCo旨在解决这一问题,提升视频融合质量。

Contribution: 1. 引入视觉-语义交互模块,增强视觉与语义表示;2. 将视频退化增强整合到融合流程;3. 提出时间增强机制与损失函数;4. 设计新的评估指标。

Method: 1. 使用Dinov2和VGG19进行视觉-语义交互;2. 构建时间协作模块;3. 嵌入时间增强机制与损失函数。

Result: 在公开数据集上验证了方法的优越性。

Insight: 时间一致性与视觉-语义协作对视频融合至关重要,动态建模能显著提升结果质量。

Abstract: Existing multi-modal fusion methods typically apply static frame-based image fusion techniques directly to video fusion tasks, neglecting inherent temporal dependencies and leading to inconsistent results across frames. To address this limitation, we propose the first video fusion framework that explicitly incorporates temporal modeling with visual-semantic collaboration to simultaneously ensure visual fidelity, semantic accuracy, and temporal consistency. First, we introduce a visual-semantic interaction module consisting of a semantic branch and a visual branch, with Dinov2 and VGG19 employed for targeted distillation, allowing simultaneous enhancement of both the visual and semantic representations. Second, we pioneer integrate the video degradation enhancement task into the video fusion pipeline by constructing a temporal cooperative module, which leverages temporal dependencies to facilitate weak information recovery. Third, to ensure temporal consistency, we embed a temporal-enhanced mechanism into the network and devise a temporal loss to guide the optimization process. Finally, we introduce two innovative evaluation metrics tailored for video fusion, aimed at assessing the temporal consistency of the generated fused videos. Extensive experimental results on public video datasets demonstrate the superiority of our method. Our code is released at https://github.com/Meiqi-Gong/TemCoCo.

[182] A Contrastive Learning-Guided Confident Meta-learning for Zero Shot Anomaly Detection

Muhammad Aqeel,Danijel Skocaj,Marco Cristani,Francesco Setti

Main category: cs.CV

TL;DR: 论文提出了一种结合对比学习和元学习的零样本异常检测框架CoZAD,通过置信度加权和特征表示优化,在工业与医学领域多个数据集上取得了领先性能。

Details Motivation: 工业与医学异常检测面临数据稀缺和标注成本高的问题,尤其在新兴领域。传统方法因丢弃不确定样本而丢失边界信息,而现有方法依赖视觉语言对齐或模型集成,资源消耗大。

Contribution: 1. 提出了一种新颖的零样本异常检测框架CoZAD;2. 结合置信度学习和对比学习,保留边界信息并优化特征空间;3. 在多个数据集上性能领先,且无需依赖视觉语言对齐或模型集成。

Method: 1. 采用基于IQR的阈值量化数据不确定性;2. 通过协方差正则化衡量模型不确定性;3. 结合模型无关元学习(MAML)进行快速领域适应;4. 对比学习构建具有判别性的特征空间。

Result: 在10个数据集上验证,性能领先,6/7工业基准测试中表现最佳(如DTD-Synthetic 99.2% I-AUROC,BTAD 97.2%),像素级定位性能优异(MVTec-AD 96.3% P-AUROC)。

Insight: 1. 通过保留不确定样本的边界信息提升模型性能;2. 对比学习特征空间的紧密度有助于异常检测;3. 框架轻量化,适合资源受限场景。

Abstract: Industrial and medical anomaly detection faces critical challenges from data scarcity and prohibitive annotation costs, particularly in evolving manufacturing and healthcare settings. To address this, we propose CoZAD, a novel zero-shot anomaly detection framework that integrates soft confident learning with meta-learning and contrastive feature representation. Unlike traditional confident learning that discards uncertain samples, our method assigns confidence-based weights to all training data, preserving boundary information while emphasizing prototypical normal patterns. The framework quantifies data uncertainty through IQR-based thresholding and model uncertainty via covariance based regularization within a Model-Agnostic Meta-Learning. Contrastive learning creates discriminative feature spaces where normal patterns form compact clusters, enabling rapid domain adaptation. Comprehensive evaluation across 10 datasets spanning industrial and medical domains demonstrates state-of-the-art performance, outperforming existing methods on 6 out of 7 industrial benchmarks with notable improvements on texture-rich datasets (99.2% I-AUROC on DTD-Synthetic, 97.2% on BTAD) and pixellevel localization (96.3% P-AUROC on MVTec-AD). The framework eliminates dependence on vision-language alignments or model ensembles, making it valuable for resourceconstrained environments requiring rapid deployment.

[183] SCOUT: Semi-supervised Camouflaged Object Detection by Utilizing Text and Adaptive Data Selection

Weiqi Yan,Lvhai Chen,Shengchuan Zhang,Yan Zhang,Liujuan Cao

Main category: cs.CV

TL;DR: 本文提出了一种半监督的伪装目标检测方法SCOUT,通过自适应数据增强与选择模块(ADAS)和文本融合模块(TFM),有效利用未标注数据,并在新数据集RefTextCOD上取得了SOTA性能。

Details Motivation: 伪装目标检测(COD)的像素级标注成本高昂,现有半监督方法对未标注数据的利用仍有提升空间。

Contribution: 1. 设计了自适应数据增强与选择模块(ADAS)和文本融合模块(TFM);2. 构建了新数据集RefTextCOD。

Method: ADAS模块通过对抗增强和采样策略筛选有价值数据;TFM模块结合伪装相关知识及文本-视觉交互优化数据利用。

Result: 在RefTextCOD数据集上超越现有半监督方法,达到SOTA性能。

Insight: 文本信息与自适应数据选择能显著提升半监督伪装目标检测的效果。

Abstract: The difficulty of pixel-level annotation has significantly hindered the development of the Camouflaged Object Detection (COD) field. To save on annotation costs, previous works leverage the semi-supervised COD framework that relies on a small number of labeled data and a large volume of unlabeled data. We argue that there is still significant room for improvement in the effective utilization of unlabeled data. To this end, we introduce a Semi-supervised Camouflaged Object Detection by Utilizing Text and Adaptive Data Selection (SCOUT). It includes an Adaptive Data Augment and Selection (ADAS) module and a Text Fusion Module (TFM). The ADSA module selects valuable data for annotation through an adversarial augment and sampling strategy. The TFM module further leverages the selected valuable data by combining camouflage-related knowledge and text-visual interaction. To adapt to this work, we build a new dataset, namely RefTextCOD. Extensive experiments show that the proposed method surpasses previous semi-supervised methods in the COD field and achieves state-of-the-art performance. Our code will be released at https://github.com/Heartfirey/SCOUT.

[184] Alternating Training-based Label Smoothing Enhances Prompt Generalization

Yang Chen,Yanbin Wei,Ke Jin,Yi Kong,James Kwok,Yu Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种基于交替训练的标签平滑(ATLaS)方法,结合类级和实例级软标签,提升提示调优的泛化能力。

Details Motivation: 提示调优是一种参数高效的精调方法,但其泛化能力有限;标签平滑能够改善模型泛化,但与提示调优直接结合时效果不佳。

Contribution: 1. 提出了ATLaS方法,通过交替训练结合硬标签和软标签;2. 设计了类级(CSL)和实例级(ISL)软标签;3. 验证了ATLaS的泛化提升效果和兼容性。

Method: ATLaS方法交替使用硬标签和软标签训练提示;CSL和ISL提供跨类别和实例的语义关系。

Result: 实验表明ATLaS显著提升了提示调优的泛化能力,并兼容现有方法。

Insight: 标签平滑与提示调优的结合需要动态调整,而非直接应用;交替训练是一种有效的解决方案。

Abstract: Recent advances in pre-trained vision-language models have demonstrated remarkable zero-shot generalization capabilities. To further enhance these models’ adaptability to various downstream tasks, prompt tuning has emerged as a parameter-efficient fine-tuning method. However, despite its efficiency, the generalization ability of prompt remains limited. In contrast, label smoothing (LS) has been widely recognized as an effective regularization technique that prevents models from becoming over-confident and improves their generalization. This inspires us to explore the integration of LS with prompt tuning. However, we have observed that the vanilla LS even weakens the generalization ability of prompt tuning. To address this issue, we propose the Alternating Training-based Label Smoothing (ATLaS) method, which alternately trains with standard one-hot labels and soft labels generated by LS to supervise the prompt tuning. Moreover, we introduce two types of efficient offline soft labels, including Class-wise Soft Labels (CSL) and Instance-wise Soft Labels (ISL), to provide inter-class or instance-class relationships for prompt tuning. The theoretical properties of the proposed ATLaS method are analyzed. Extensive experiments demonstrate that the proposed ATLaS method, combined with CSL and ISL, consistently enhances the generalization performance of prompt tuning. Moreover, the proposed ATLaS method exhibits high compatibility with prevalent prompt tuning methods, enabling seamless integration into existing methods.

[185] VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference

Pengfei Jiang,Hanjun Li,Linglan Zhao,Fei Chao,Ke Yan,Shouhong Ding,Rongrong Ji

Main category: cs.CV

TL;DR: 论文提出了一种名为VISA的新方法,通过图汇总技术对多模态大语言模型(MLLMs)中的视觉令牌进行分组选择和聚合,以提升推理效率。

Details Motivation: 现有的视觉令牌修剪方法往往导致信息丢失,影响模型性能。VISA旨在通过更智能的令牌选择和聚合,显著减少视觉令牌数量,同时保留更多视觉信息。

Contribution: 1. 提出基于图的视觉令牌聚合(VTA)模块,通过语义相似性将移除令牌的信息整合到保留令牌中;2. 引入分组令牌选择策略(GTS),通过文本令牌引导视觉令牌的分组与聚合。

Method: 1. VTA模块将视觉令牌视为图节点,基于语义相似性构建图并汇总信息;2. GTS策略将视觉令牌分为保留和移除组,分阶段聚合信息,提升稳定性。

Result: 在LLaVA-1.5、LLaVA-NeXT和Video-LLaVA等基准测试中,VISA显著优于现有方法,实现了性能与推理速度之间的更好平衡。

Insight: 通过图和分组策略的结合,VISA提供了一种高效且信息保留的视觉令牌处理方法,为多模态模型的优化提供了新思路。

Abstract: In this study, we introduce a novel method called group-wise \textbf{VI}sual token \textbf{S}election and \textbf{A}ggregation (VISA) to address the issue of inefficient inference stemming from excessive visual tokens in multimoal large language models (MLLMs). Compared with previous token pruning approaches, our method can preserve more visual information while compressing visual tokens. We first propose a graph-based visual token aggregation (VTA) module. VTA treats each visual token as a node, forming a graph based on semantic similarity among visual tokens. It then aggregates information from removed tokens into kept tokens based on this graph, producing a more compact visual token representation. Additionally, we introduce a group-wise token selection strategy (GTS) to divide visual tokens into kept and removed ones, guided by text tokens from the final layers of each group. This strategy progressively aggregates visual information, enhancing the stability of the visual information extraction process. We conduct comprehensive experiments on LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA across various benchmarks to validate the efficacy of VISA. Our method consistently outperforms previous methods, achieving a superior trade-off between model performance and inference speed. The code is available at https://github.com/mobiushy/VISA.

[186] AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering

Kang Zeng,Guojin Zhong,Jintao Cheng,Jin Yuan,Zhiyong Li

Main category: cs.CV

TL;DR: 该论文提出了一种名为AVAM的通用自适应视觉锚定策略,用于解决多图像问答中的视觉冗余问题,并通过协作解码机制提升MLLMs的性能。

Details Motivation: 多图像问答(MVQA)中,图像数量的增加会引入大量与问题无关的视觉冗余,影响模型的准确性和效率。现有方法缺乏灵活性且容易产生离散的视觉片段,限制了MLLMs对图像的整体理解。

Contribution: 提出了一种无需训练的通用自适应视觉锚定策略(AVAM),可无缝集成到现有MLLMs中,并通过协作解码机制优化全局和压缩视觉输入的结果。

Method: AVAM通过自适应压缩减少视觉冗余,同时引入协作解码机制平衡全局和压缩视觉输入的效果。

Result: 实验表明,AVAM在各种MLLMs上均能显著提升性能。

Insight: 自适应视觉压缩和协作解码的结合可以有效解决MVQA中的视觉冗余问题,提升模型的准确性和效率。

Abstract: The advancement of Multimodal Large Language Models (MLLMs) has driven significant progress in Visual Question Answering (VQA), evolving from Single to Multi Image VQA (MVQA). However, the increased number of images in MVQA inevitably introduces substantial visual redundancy that is irrelevant to question answering, negatively impacting both accuracy and efficiency. To address this issue, existing methods lack flexibility in controlling the number of compressed visual tokens and tend to produce discrete visual fragments, which hinder MLLMs’ ability to comprehend images holistically. In this paper, we propose a straightforward yet universal Adaptive Visual Anchoring strategy, which can be seamlessly integrated into existing MLLMs, offering significant accuracy improvements through adaptive compression. Meanwhile, to balance the results derived from both global and compressed visual input, we further introduce a novel collaborative decoding mechanism, enabling optimal performance. Extensive experiments validate the effectiveness of our method, demonstrating consistent performance improvements across various MLLMs. The code will be publicly available.

[187] Camera Pose Refinement via 3D Gaussian Splatting

Lulu Hao,Lipu Zhou,Zhenzhong Wei,Xu Wang

Main category: cs.CV

TL;DR: 提出了一种基于3D高斯泼溅(3DGS)的新型相机位姿优化框架GS-SMC,通过渲染多视图并结合几何约束,显著提升了位姿估计精度,无需额外训练或微调。

Details Motivation: 现有相机位姿优化方法依赖特定描述符或专用网络,需重新构建场景或训练模型,且缺乏几何约束导致精度不足。3DGS的普及为实现轻量级、通用的优化方法提供了可能。

Contribution: 提出了GS-SMC框架,利用3DGS模型渲染多视图,通过几何约束迭代优化相机位姿,支持灵活选择特征提取器和匹配器,无需额外训练。

Method: 采用3DGS模型渲染多视图,引入基于极几何约束的迭代优化方法,结合查询图像与渲染图像的特征匹配优化位姿。

Result: 在7-Scenes和Cambridge Landmarks数据集上,中值平移和旋转误差分别降低53.3%/56.9%和40.7%/53.2%,优于现有方法。

Insight: 利用3DGS的通用性和渲染能力,结合几何约束,为相机位姿优化提供了一种高效、灵活的解决方案,适用于多样场景。

Abstract: Camera pose refinement aims at improving the accuracy of initial pose estimation for applications in 3D computer vision. Most refinement approaches rely on 2D-3D correspondences with specific descriptors or dedicated networks, requiring reconstructing the scene again for a different descriptor or fully retraining the network for each scene. Some recent methods instead infer pose from feature similarity, but their lack of geometry constraints results in less accuracy. To overcome these limitations, we propose a novel camera pose refinement framework leveraging 3D Gaussian Splatting (3DGS), referred to as GS-SMC. Given the widespread usage of 3DGS, our method can employ an existing 3DGS model to render novel views, providing a lightweight solution that can be directly applied to diverse scenes without additional training or fine-tuning. Specifically, we introduce an iterative optimization approach, which refines the camera pose using epipolar geometric constraints among the query and multiple rendered images. Our method allows flexibly choosing feature extractors and matchers to establish these constraints. Extensive empirical evaluations on the 7-Scenes and the Cambridge Landmarks datasets demonstrate that our method outperforms state-of-the-art camera pose refinement approaches, achieving 53.3% and 56.9% reductions in median translation and rotation errors on 7-Scenes, and 40.7% and 53.2% on Cambridge.

[188] Edge-Enhanced Vision Transformer Framework for Accurate AI-Generated Image Detection

Dabbrata Das,Mahshar Yahan,Md Tareq Zaman,Md Rishadul Bayesh

Main category: cs.CV

TL;DR: 提出了一种结合边缘增强模块和Vision Transformer的混合框架,用于高效检测AI生成图像,取得了优异的性能。

Details Motivation: 生成模型的快速发展导致AI生成图像高度逼真,为数字取证和内容认证带来挑战。传统方法依赖全局特征,忽略了细微结构不一致性且计算开销大。

Contribution: 1. 提出了一种混合框架,结合了微调的Vision Transformer和新型边缘处理模块。2. 边缘模块通过计算平滑前后边缘差异图的方差,捕捉AI生成图像的纹理和噪声特征。3. 方法在多个数据集上表现优异,计算高效。

Method: 1. 使用微调的ViT提取全局特征。2. 设计边缘模块,分析平滑前后的边缘差异图方差,聚焦结构不一致性。3. 将边缘模块作为ViT预测的后处理步骤,提升细粒度结构敏感性。

Result: 在CIFAKE、Artistic和Custom Curated数据集上,准确率和F1分数分别达到97.75%和97.77%,优于现有方法。

Insight: AI生成图像通常纹理更平滑、边缘更弱且噪声更少;结合全局和局部特征提升检测性能。

Abstract: The rapid advancement of generative models has led to a growing prevalence of highly realistic AI-generated images, posing significant challenges for digital forensics and content authentication. Conventional detection methods mainly rely on deep learning models that extract global features, which often overlook subtle structural inconsistencies and demand substantial computational resources. To address these limitations, we propose a hybrid detection framework that combines a fine-tuned Vision Transformer (ViT) with a novel edge-based image processing module. The edge-based module computes variance from edge-difference maps generated before and after smoothing, exploiting the observation that AI-generated images typically exhibit smoother textures, weaker edges, and reduced noise compared to real images. When applied as a post-processing step on ViT predictions, this module enhances sensitivity to fine-grained structural cues while maintaining computational efficiency. Extensive experiments on the CIFAKE, Artistic, and Custom Curated datasets demonstrate that the proposed framework achieves superior detection performance across all benchmarks, attaining 97.75% accuracy and a 97.77% F1-score on CIFAKE, surpassing widely adopted state-of-the-art models. These results establish the proposed method as a lightweight, interpretable, and effective solution for both still images and video frames, making it highly suitable for real-world applications in automated content verification and digital forensics.

[189] UniAPO: Unified Multimodal Automated Prompt Optimization

Qipeng Zhu,Yanzhe Chen,Huasong Zhong,Yan Li,Jie Chen,Zhixin Zhang,Junping Zhang,Zhenheng Yang

Main category: cs.CV

TL;DR: UniAPO是一个统一的多模态自动提示优化框架,通过解耦反馈建模和提示优化,解决了视觉标记膨胀和缺乏过程级监督的问题,并在多模态任务中表现出色。

Details Motivation: 现有的自动提示优化方法主要针对单模态文本任务,而在多模态任务中(如视频-语言生成)存在视觉标记膨胀和缺乏过程级监督的挑战,需要更高效且统一的优化框架。

Contribution: 提出UniAPO,首个针对多模态任务的自动提示优化框架,通过EM启发式优化过程和历史记忆机制解决了视觉标记膨胀和过程级监督不足的问题。

Method: 采用EM启发式优化过程,解耦反馈建模与提示优化,并引入短期-长期记忆机制来缓解上下文限制并提供方向性指导。

Result: UniAPO在文本、图像和视频任务中均取得一致性能提升,为高效且可迁移的提示优化建立了统一框架。

Insight: 解耦反馈与优化、引入历史记忆机制是多模态提示优化的有效方向,可为复杂任务提供更稳定和目标驱动的优化方法。

Abstract: Prompting is fundamental to unlocking the full potential of large language models. To automate and enhance this process, automatic prompt optimization (APO) has been developed, demonstrating effectiveness primarily in text-only input scenarios. However, extending existing APO methods to multimodal tasks, such as video-language generation introduces two core challenges: (i) visual token inflation, where long visual token sequences restrict context capacity and result in insufficient feedback signals; (ii) a lack of process-level supervision, as existing methods focus on outcome-level supervision and overlook intermediate supervision, limiting prompt optimization. We present UniAPO: Unified Multimodal Automated Prompt Optimization, the first framework tailored for multimodal APO. UniAPO adopts an EM-inspired optimization process that decouples feedback modeling and prompt refinement, making the optimization more stable and goal-driven. To further address the aforementioned challenges, we introduce a short-long term memory mechanism: historical feedback mitigates context limitations, while historical prompts provide directional guidance for effective prompt optimization. UniAPO achieves consistent gains across text, image, and video benchmarks, establishing a unified framework for efficient and transferable prompt optimization.

[190] Gaze into the Heart: A Multi-View Video Dataset for rPPG and Health Biomarkers Estimation

Konstantin Egorov,Stepan Botman,Pavel Blinov,Galina Zubkova,Anton Ivaschenko,Alexander Kolsanov,Andrey Savchenko

Main category: cs.CV

TL;DR: 该论文提出了一个新型的大规模多视角视频数据集,用于远程光电容积描记术(rPPG)和健康生物标志物估计,旨在解决现有数据集的规模小、隐私担忧和多样性不足等问题。

Details Motivation: 现有rPPG数据集规模小且多样性不足,限制了技术进步。该论文通过引入一个大型多视角视频数据集,解决这些问题并推动AI医疗助手的发展。

Contribution: 提出了一个包含3600个同步视频记录的大规模数据集,覆盖600名受试者,附带多种健康指标(如PPG信号、心电图、血压等),并训练了一个高效的rPPG模型。

Method: 数据集通过多个消费级摄像头在不同角度下捕获视频,并同步记录100 Hz的PPG信号及其他健康指标。基于此数据训练了一个rPPG模型,并在跨数据集场景中验证其性能。

Result: 所提模型在跨数据集测试中表现出色,证明了该数据集的实用性和模型的泛化能力。

Insight: 多样化的数据集和多模态生理信号有助于提升rPPG模型的鲁棒性和准确性,为AI医疗应用提供了重要基础。

Abstract: Progress in remote PhotoPlethysmoGraphy (rPPG) is limited by the critical issues of existing publicly available datasets: small size, privacy concerns with facial videos, and lack of diversity in conditions. The paper introduces a novel comprehensive large-scale multi-view video dataset for rPPG and health biomarkers estimation. Our dataset comprises 3600 synchronized video recordings from 600 subjects, captured under varied conditions (resting and post-exercise) using multiple consumer-grade cameras at different angles. To enable multimodal analysis of physiological states, each recording is paired with a 100 Hz PPG signal and extended health metrics, such as electrocardiogram, arterial blood pressure, biomarkers, temperature, oxygen saturation, respiratory rate, and stress level. Using this data, we train an efficient rPPG model and compare its quality with existing approaches in cross-dataset scenarios. The public release of our dataset and model should significantly speed up the progress in the development of AI medical assistants.

[191] See What You Need: Query-Aware Visual Intelligence through Reasoning-Perception Loops

Zixuan Dong,Baoyun Peng,Yufei Wang,Lin Liu,Xinxin Dong,Yunlong Cao,Xiaodong Wang

Main category: cs.CV

TL;DR: CAVIA是一个无需训练的视频理解框架,通过动态协调推理与感知,实现查询自适应的视觉信息提取,显著提升了视频问答任务的性能。

Details Motivation: 当前长视频问答系统通常使用固定的流水线,将推理与感知解耦,导致信息损失或计算低效。CAVIA旨在通过动态协调这两者来优化视频理解。

Contribution: 提出了CAVIA框架,通过推理-感知闭环系统、层次化推理、跨模态语义桥接和置信驱动的迭代合成三项创新,实现了动态视觉信息提取。

Method: CAVIA通过推理指导感知,逐层定位关键帧,跨模态桥接语义,并通过置信度驱动迭代优化最终结果。

Result: 在EgoSchema、NExT-QA和IntentQA等基准测试上均取得显著提升,达到了最先进的性能。

Insight: 动态协调推理与感知是实现高效视频理解的关键,查询自适应的视觉提取能显著提升任务性能。

Abstract: Human video comprehension demonstrates dynamic coordination between reasoning and visual attention, adaptively focusing on query-relevant details. However, current long-form video question answering systems employ rigid pipelines that decouple reasoning from perception, leading to either information loss through premature visual abstraction or computational inefficiency through exhaustive processing. The core limitation lies in the inability to adapt visual extraction to specific reasoning requirements, different queries demand fundamentally different visual evidence from the same video content. In this work, we present CAVIA, a training-free framework that revolutionizes video understanding through reasoning, perception coordination. Unlike conventional approaches where visual processing operates independently of reasoning, CAVIA creates a closed-loop system where reasoning continuously guides visual extraction based on identified information gaps. CAVIA introduces three innovations: (1) hierarchical reasoning, guided localization to precise frames; (2) cross-modal semantic bridging for targeted extraction; (3) confidence-driven iterative synthesis. CAVIA achieves state-of-the-art performance on challenging benchmarks: EgoSchema (65.7%, +5.3%), NExT-QA (76.1%, +2.6%), and IntentQA (73.8%, +6.9%), demonstrating that dynamic reasoning-perception coordination provides a scalable paradigm for video understanding.

[192] SAIL-Recon: Large SfM by Augmenting Scene Regression with Localization

Junyuan Deng,Heng Li,Tao Xie,Weiqiang Ren,Qian Zhang,Ping Tan,Xiaoyang Guo

Main category: cs.CV

TL;DR: SAIL-Recon通过结合场景回归和视觉定位能力,提出了一种适用于大规模SfM的Transformer方法,显著提升了处理大量输入图像的能力。

Details Motivation: 场景回归方法(如VGGT)在处理极端视角变化的图像时表现优异,但难以处理大量输入图像。SAIL-Recon旨在解决这一问题。

Contribution: 提出的SAIL-Recon方法通过增强场景回归网络的视觉定位能力,实现了大规模SfM的高效处理,并在相机姿态估计和新视图合成任务中达到SOTA。

Method: 先通过锚点图像计算神经场景表示,然后微调回归网络以基于该表示重建所有输入图像。核心是结合Transformer和视觉定位技术。

Result: 在TUM-RGBD、CO3Dv2和Tanks & Temples等基准测试中取得了最优性能。

Insight: 结合场景回归与视觉定位是解决大规模SfM问题的有效途径,同时展示了Transformer在该领域的潜力。

Abstract: Scene regression methods, such as VGGT, solve the Structure-from-Motion (SfM) problem by directly regressing camera poses and 3D scene structures from input images. They demonstrate impressive performance in handling images under extreme viewpoint changes. However, these methods struggle to handle a large number of input images. To address this problem, we introduce SAIL-Recon, a feed-forward Transformer for large scale SfM, by augmenting the scene regression network with visual localization capabilities. Specifically, our method first computes a neural scene representation from a subset of anchor images. The regression network is then fine-tuned to reconstruct all input images conditioned on this neural scene representation. Comprehensive experiments show that our method not only scales efficiently to large-scale scenes, but also achieves state-of-the-art results on both camera pose estimation and novel view synthesis benchmarks, including TUM-RGBD, CO3Dv2, and Tanks & Temples. We will publish our model and code. Code and models are publicly available at: https://hkust-sail.github.io/ sail-recon/.

[193] Enhanced Drift-Aware Computer Vision Architecture for Autonomous Driving

Md Shahi Amran Hossain,Abu Shad Ahammed,Sayeri Mukherjee,Roman Obermaisser

Main category: cs.CV

TL;DR: 该论文提出了一种用于自动驾驶的双模式计算机视觉架构,通过结合YOLOv8和五层CNN以提高在数据漂移环境下的目标检测准确率,实验表明性能提升超过90%。

Details Motivation: 自动驾驶中的计算机视觉系统在恶劣天气或低光照等挑战性场景下容易出现数据漂移,导致模型性能下降和安全风险。ISO 8800标准提供了相关框架,但实际应用中仍需改进。

Contribution: 提出了一种混合架构,结合YOLOv8(快速检测)和五层CNN(验证),通过合成数据增强鲁棒性,显著提升了数据漂移环境中的检测精度。

Method: 采用双模式框架,先使用YOLOv8进行快速检测,再通过五层CNN验证。训练数据包括大量合成的道路环境图像,以应对数据漂移问题。

Result: 在数据漂移增强的道路图像测试中,检测准确率提高了90%以上,证明了混合架构的有效性。

Insight: 混合架构可以显著提升自动驾驶系统在复杂环境下的鲁棒性和安全性,尤其适用于数据漂移场景。

Abstract: The use of computer vision in automotive is a trending research in which safety and security are a primary concern. In particular, for autonomous driving, preventing road accidents requires highly accurate object detection under diverse conditions. To address this issue, recently the International Organization for Standardization (ISO) released the 8800 norm, providing structured frameworks for managing associated AI relevant risks. However, challenging scenarios such as adverse weather or low lighting often introduce data drift, leading to degraded model performance and potential safety violations. In this work, we present a novel hybrid computer vision architecture trained with thousands of synthetic image data from the road environment to improve robustness in unseen drifted environments. Our dual mode framework utilized YOLO version 8 for swift detection and incorporated a five-layer CNN for verification. The system functioned in sequence and improved the detection accuracy by more than 90% when tested with drift-augmented road images. The focus was to demonstrate how such a hybrid model can provide better road safety when working together in a hybrid structure.

[194] Propose and Rectify: A Forensics-Driven MLLM Framework for Image Manipulation Localization

Keyang Zhang,Chenqi Kong,Hui Liu,Bo Ding,Xinghao Jiang,Haoliang Li

Main category: cs.CV

TL;DR: 该论文提出了一个名为“Propose-Rectify”的新框架,结合多模态大语言模型(MLLM)和法医分析技术,用于图像篡改检测和定位。它通过提案阶段和修正阶段实现语义推理与法医特征分析的融合,显著提升了篡改区域的检测和定位精度。

Details Motivation: 图像篡改技术日益复杂,现有的多模态大语言模型(MLLM)虽然能够利用语义理解进行篡改检测,但对低层法医特征的感知不足,导致篡改区域定位不准确。因此,需要一种能够结合语义推理和法医分析的方法。

Contribution: 1)提出了Propose-Rectify框架,结合LLaVA模型的语义推理和法医特征分析,提升篡改检测和定位能力。2)设计了法医修正模块(Forensics Rectification Module),通过多尺度特征分析验证和细化初步提案。3)提出了增强分割模块(Enhanced Segmentation Module),在SAM模型中融入法医线索,克服语义偏差,精确标注篡改区域。

Method: 1)提案阶段:使用法医适配的LLaVA模型生成初步篡改分析和可疑区域定位。2)修正阶段:通过法医修正模块,利用多尺度法医特征验证和细化初步提案。3)增强分割模块:将法医线索融入SAM模型,提升篡改区域的精确分割。

Result: 实验表明,该框架在多个数据集上表现出色,具有卓越的鲁棒性和泛化能力,优于现有方法。

Insight: 该框架的创新在于将多模态语义理解与法医特征分析相结合,填补了MLLM在低层特征感知上的不足,为图像篡改检测提供了更全面的解决方案。

Abstract: The increasing sophistication of image manipulation techniques demands robust forensic solutions that can both reliably detect alterations and precisely localize tampered regions. Recent Multimodal Large Language Models (MLLMs) show promise by leveraging world knowledge and semantic understanding for context-aware detection, yet they struggle with perceiving subtle, low-level forensic artifacts crucial for accurate manipulation localization. This paper presents a novel Propose-Rectify framework that effectively bridges semantic reasoning with forensic-specific analysis. In the proposal stage, our approach utilizes a forensic-adapted LLaVA model to generate initial manipulation analysis and preliminary localization of suspicious regions based on semantic understanding and contextual reasoning. In the rectification stage, we introduce a Forensics Rectification Module that systematically validates and refines these initial proposals through multi-scale forensic feature analysis, integrating technical evidence from several specialized filters. Additionally, we present an Enhanced Segmentation Module that incorporates critical forensic cues into SAM’s encoded image embeddings, thereby overcoming inherent semantic biases to achieve precise delineation of manipulated regions. By synergistically combining advanced multimodal reasoning with established forensic methodologies, our framework ensures that initial semantic proposals are systematically validated and enhanced through concrete technical evidence, resulting in comprehensive detection accuracy and localization precision. Extensive experimental validation demonstrates state-of-the-art performance across diverse datasets with exceptional robustness and generalization capabilities.

[195] Development of a Neural Network Model for Currency Detection to aid visually impaired people in Nigeria

Sochukwuma Nwokoye,Desmond Moru

Main category: cs.CV

TL;DR: 该研究开发了一种用于尼日利亚纸币检测的神经网络模型,旨在帮助视障人士识别货币。通过构建自定义数据集(3,468张图像)并训练SSD神经网络模型,系统实现了90%以上的平均精度(mAP)。

Details Motivation: 视障人士在商业交易中面临识别货币的困难,当前缺乏有效解决方案。研究旨在利用神经网络技术弥补这一需求缺口。

Contribution: 1. 构建了专门用于尼日利亚纸币检测的自定义数据集;2. 提出了一种基于SSD神经网络的货币识别系统;3. 在现实场景中验证了高精度(mAP>90%)的实用性。

Method: 1. 收集并标注3,468张尼日利亚纸币图像;2. 使用SSD(Single Shot MultiBox Detector)神经网络进行训练;3. 评估模型的平均精度(mAP)。

Result: 系统在货币识别任务中实现了超过90%的平均精度(mAP),验证了其高准确性和实用性。

Insight: 神经网络的模式识别能力可以有效应用于辅助技术领域,为视障人士提供切实可行的解决方案。该研究展示了AI技术在提升生活质量方面的潜力。

Abstract: Neural networks in assistive technology for visually impaired leverage artificial intelligence’s capacity to recognize patterns in complex data. They are used for converting visual data into auditory or tactile representations, helping the visually impaired understand their surroundings. The primary aim of this research is to explore the potential of artificial neural networks to facilitate the differentiation of various forms of cash for individuals with visual impairments. In this study, we built a custom dataset of 3,468 images, which was subsequently used to train an SSD neural network model. The proposed system can accurately identify Nigerian cash, thereby streamlining commercial transactions. The performance of the system in terms of accuracy was assessed, and the Mean Average Precision score was over 90%. We believe that our system has the potential to make a substantial contribution to the field of assistive technology while also improving the quality of life of visually challenged persons in Nigeria and beyond.

[196] FCR: Investigating Generative AI models for Forensic Craniofacial Reconstruction

Ravi Shankar Prasad,Dinesh Singh

Main category: cs.CV

TL;DR: 该论文提出了一种基于生成对抗网络(GANs)的颅面重建方法,首次利用2D X射线图像作为输入,通过精细调整生成器和判别器,实现跨域(颅骨与面部)的真实图像生成,为法医科学提供了一种高效工具。

Details Motivation: 传统的颅面重建方法(如黏土建模)需要专业知识且耗时,而现有概率生成模型难以捕捉颅骨与面部的跨域特征。为此,作者提出一种自动化、高效的生成模型框架。

Contribution: 1. 首次将2D X射线图像用于生成模型的颅面重建;2. 提出了一种跨域生成框架,优化生成器和判别器;3. 设计了一个基于生成图像的检索系统,验证其在法医科学中的实用性。

Method: 采用CycleGAN和cGAN等生成模型,通过精细调整生成器和判别器,从2D X射线图像中生成真实的面部图像,并使用FID、IS和SSIM评分评估生成质量。

Result: 实验表明,该方法在生成图像质量(FID、IS、SSIM)和检索性能上表现良好,证明了其在法医科学中的有效性。

Insight: 生成模型可显著提升颅面重建的自动化程度和效率,未来可能在法医、医学等领域有广泛应用。

Abstract: Craniofacial reconstruction in forensics is one of the processes to identify victims of crime and natural disasters. Identifying an individual from their remains plays a crucial role when all other identification methods fail. Traditional methods for this task, such as clay-based craniofacial reconstruction, require expert domain knowledge and are a time-consuming process. At the same time, other probabilistic generative models like the statistical shape model or the Basel face model fail to capture the skull and face cross-domain attributes. Looking at these limitations, we propose a generic framework for craniofacial reconstruction from 2D X-ray images. Here, we used various generative models (i.e., CycleGANs, cGANs, etc) and fine-tune the generator and discriminator parts to generate more realistic images in two distinct domains, which are the skull and face of an individual. This is the first time where 2D X-rays are being used as a representation of the skull by generative models for craniofacial reconstruction. We have evaluated the quality of generated faces using FID, IS, and SSIM scores. Finally, we have proposed a retrieval framework where the query is the generated face image and the gallery is the database of real faces. By experimental results, we have found that this can be an effective tool for forensic science.

[197] Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

Yaqi Li,Peng Chen,Mingyang Han,Bu Pi,Haoxiang Shi,Runzhou Zhao,Yang Yao,Xuan Zhang,Jun Song

Main category: cs.CV

TL;DR: 论文提出了一种名为Visual-CoG的新范式,通过多阶段的强化学习结合阶段感知的奖励信号,解决了现有文本生成图像(T2I)模型在多属性和模糊性提示处理上的局限性。

Details Motivation: 现有自回归模型在T2I生成任务中对多属性与模糊提示的处理能力有限,且强化学习的奖励信号仅在生成阶段末尾提供,难以优化中间过程。

Contribution: 提出了Visual-CoG范式,包含语义推理、过程精修和结果评估三个阶段,引入了阶段感知的奖励信号,并构建了视觉认知基准VisCog-Bench。

Method: 采用多阶段强化学习框架,每个阶段提供即时奖励信号,优化生成过程;同时设计了VisCog-Bench用于评估语义推理能力。

Result: 在GenEval、T2I-CompBench和VisCog-Bench上分别取得了15%、5%和19%的性能提升。

Insight: 阶段感知的奖励信号能够更精确地优化生成过程的每个阶段,从而提升模型的整体性能和生成质量。

Abstract: Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.

[198] ArgusCogito: Chain-of-Thought for Cross-Modal Synergy and Omnidirectional Reasoning in Camouflaged Object Segmentation

Jianwen Tan,Huiyao Zhang,Rui Xiong,Han Zhou,Hongfei Wang,Ye Li

Main category: cs.CV

TL;DR: 该论文提出了ArgusCogito框架,通过跨模态协同和全方位推理解决伪装目标分割(COS)的挑战,实现零样本学习,并在多个基准测试中取得SOTA性能。

Details Motivation: 伪装目标分割(COS)因目标与背景高度相似而极具挑战性,现有方法因浅层特征表示和跨模态整合不足而表现不佳。

Contribution: 1. 提出基于认知策略的零样本框架ArgusCogito;2. 通过三个认知阶段(Conjecture、Focus、Sculpting)实现跨模态协同和全方位推理;3. 在COS和医学图像分割(MIS)基准中表现优于SOTA。

Method: 1. Conjecture阶段:跨模态融合(RGB、深度、语义图)构建全局推理;2. Focus阶段:基于语义先验进行注意力驱动的目标定位;3. Sculpting阶段:生成密集点提示并迭代优化分割掩码。

Result: 在四个COS基准和三个MIS基准上实现SOTA性能,验证了框架的高效性、泛化能力和鲁棒性。

Insight: 通过模仿人类认知策略(全局观察、聚焦、精细调整),ArgusCogito为跨模态任务提供了新的解决思路,尤其适合复杂场景分割。

Abstract: Camouflaged Object Segmentation (COS) poses a significant challenge due to the intrinsic high similarity between targets and backgrounds, demanding models capable of profound holistic understanding beyond superficial cues. Prevailing methods, often limited by shallow feature representation, inadequate reasoning mechanisms, and weak cross-modal integration, struggle to achieve this depth of cognition, resulting in prevalent issues like incomplete target separation and imprecise segmentation. Inspired by the perceptual strategy of the Hundred-eyed Giant-emphasizing holistic observation, omnidirectional focus, and intensive scrutiny-we introduce ArgusCogito, a novel zero-shot, chain-of-thought framework underpinned by cross-modal synergy and omnidirectional reasoning within Vision-Language Models (VLMs). ArgusCogito orchestrates three cognitively-inspired stages: (1) Conjecture: Constructs a strong cognitive prior through global reasoning with cross-modal fusion (RGB, depth, semantic maps), enabling holistic scene understanding and enhanced target-background disambiguation. (2) Focus: Performs omnidirectional, attention-driven scanning and focused reasoning, guided by semantic priors from Conjecture, enabling precise target localization and region-of-interest refinement. (3) Sculpting: Progressively sculpts high-fidelity segmentation masks by integrating cross-modal information and iteratively generating dense positive/negative point prompts within focused regions, emulating Argus’ intensive scrutiny. Extensive evaluations on four challenging COS benchmarks and three Medical Image Segmentation (MIS) benchmarks demonstrate that ArgusCogito achieves state-of-the-art (SOTA) performance, validating the framework’s exceptional efficacy, superior generalization capability, and robustness.

[199] Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images

Kaiyu Li,Xiangyong Cao,Ruixun Liu,Shihong Wang,Zixuan Jiang,Zhi Wang,Deyu Meng

Main category: cs.CV

TL;DR: 这篇论文提出了SegEarth-OV,首个无需标注的开放词汇遥感图像分割框架,通过SimFeatUp和全局偏置缓解操作解决了现有方法的不足,同时通过AlignEarth将方法扩展到SAR图像。

Details Motivation: 遥感图像语义分割对地球观测至关重要,但新类别的解释需求和高昂的手动标注成本带来挑战。现有开放词汇分割框架难以适应遥感图像的复杂性和多样性,亟需一种无需标注的高效解决方案。

Contribution: 1. 提出SegEarth-OV,首个无需标注的开放词汇遥感图像分割框架;
2. 设计SimFeatUp模块,恢复高分辨率空间细节;
3. 提出全局偏置缓解操作,提升局部语义保真度;
4. 提出AlignEarth,将框架扩展到SAR图像,避免从头训练基础模型。

Method: 1. 使用SimFeatUp对粗糙特征上采样,恢复高分辨率细节;
2. 引入全局偏置缓解操作,消除全局上下文对局部特征的影响;
3. 通过AlignEarth将预训练光学VLM的知识蒸馏到SAR编码器,实现跨模态分割。

Result: 在光学和SAR数据集上的实验表明,SegEarth-OV显著优于现有方法,为开放世界的地球观测提供了高效、免标注的解决方案。

Insight: 1. 无需标注的开放词汇分割在遥感领域具有巨大潜力;
2. 跨模态知识蒸馏可以避免高昂的基础模型训练成本;
3. 全局偏置缓解能有效提升语义分割的局部细节。

Abstract: Semantic segmentation of remote sensing (RS) images is pivotal for comprehensive Earth observation, but the demand for interpreting new object categories, coupled with the high expense of manual annotation, poses significant challenges. Although open-vocabulary semantic segmentation (OVSS) offers a promising solution, existing frameworks designed for natural images are insufficient for the unique complexities of RS data. They struggle with vast scale variations and fine-grained details, and their adaptation often relies on extensive, costly annotations. To address this critical gap, this paper introduces SegEarth-OV, the first framework for annotation-free open-vocabulary segmentation of RS images. Specifically, we propose SimFeatUp, a universal upsampler that robustly restores high-resolution spatial details from coarse features, correcting distorted target shapes without any task-specific post-training. We also present a simple yet effective Global Bias Alleviation operation to subtract the inherent global context from patch features, significantly enhancing local semantic fidelity. These components empower SegEarth-OV to effectively harness the rich semantics of pre-trained VLMs, making OVSS possible in optical RS contexts. Furthermore, to extend the framework’s universality to other challenging RS modalities like SAR images, where large-scale VLMs are unavailable and expensive to create, we introduce AlignEarth, which is a distillation-based strategy and can efficiently transfer semantic knowledge from an optical VLM encoder to an SAR encoder, bypassing the need to build SAR foundation models from scratch and enabling universal OVSS across diverse sensor types. Extensive experiments on both optical and SAR datasets validate that SegEarth-OV can achieve dramatic improvements over the SOTA methods, establishing a robust foundation for annotation-free and open-world Earth observation.

[200] EventTracer: Fast Path Tracing-based Event Stream Rendering

Zhenyang Li,Xiaoyang Bai,Jinfan Lu,Pengfei Shen,Edmund Y. Lam,Yifan Peng

Main category: cs.CV

TL;DR: EventTracer是一种基于路径追踪的快速事件流渲染方法,通过低采样路径追踪和轻量级事件脉冲网络,高效生成高保真事件序列,显著提升了模拟事件数据的时间分辨率。

Details Motivation: 现有事件流模拟方法依赖高成本的无噪RGB帧渲染,时间分辨率较低(100-300 FPS),无法匹配真实事件数据的高频特性。

Contribution: 提出了EventTracer,一种高效的路径追踪渲染流程,通过低SPP路径追踪和事件脉冲网络,实现了高保真事件序列的快速生成,时间分辨率接近真实事件数据。

Method: 1. 使用低SPP路径追踪加速渲染;2. 训练轻量级事件脉冲网络(含BiLIF单元和双向EMD损失)对RGB视频去噪并生成事件序列。

Result: EventTracer以4分钟/秒720p视频的速度运行,保留路径追踪的时空建模精度,在细节和真实性上优于其他事件模拟器。

Insight: 通过高效渲染和物理感知的网络设计,EventTracer为低成本生成大规模事件-RGB数据集提供了可能,缩小了模拟与真实的差距,适用于机器人、自动驾驶和VR/AR等领域。

Abstract: Simulating event streams from 3D scenes has become a common practice in event-based vision research, as it meets the demand for large-scale, high temporal frequency data without setting up expensive hardware devices or undertaking extensive data collections. Yet existing methods in this direction typically work with noiseless RGB frames that are costly to render, and therefore they can only achieve a temporal resolution equivalent to 100-300 FPS, far lower than that of real-world event data. In this work, we propose EventTracer, a path tracing-based rendering pipeline that simulates high-fidelity event sequences from complex 3D scenes in an efficient and physics-aware manner. Specifically, we speed up the rendering process via low sample-per-pixel (SPP) path tracing, and train a lightweight event spiking network to denoise the resulting RGB videos into realistic event sequences. To capture the physical properties of event streams, the network is equipped with a bipolar leaky integrate-and-fired (BiLIF) spiking unit and trained with a bidirectional earth mover distance (EMD) loss. Our EventTracer pipeline runs at a speed of about 4 minutes per second of 720p video, and it inherits the merit of accurate spatiotemporal modeling from its path tracing backbone. We show in two downstream tasks that EventTracer captures better scene details and demonstrates a greater similarity to real-world event data than other event simulators, which establishes it as a promising tool for creating large-scale event-RGB datasets at a low cost, narrowing the sim-to-real gap in event-based vision, and boosting various application scenarios such as robotics, autonomous driving, and VRAR.

[201] Incorporating Pre-trained Diffusion Models in Solving the Schrödinger Bridge Problem

Zhicong Tang,Tiankai Hang,Shuyang Gu,Dong Chen,Baining Guo

Main category: cs.CV

TL;DR: 该论文提出三种重参数化技术(IPMM、IPTM、IPFM)将扩散模型(SGMs)与薛定谔桥(SB)问题统一,并通过预训练扩散模型初始化SB模型,显著加速和稳定训练,同时提升两类模型的性能。

Details Motivation: 将扩散模型(SGMs)与薛定谔桥问题(SB)统一,以解决SB模型的训练效率低和不稳定的问题,同时利用预训练的SGMs提升SB模型的性能。

Contribution: 1. 提出三种重参数化技术(IPMM、IPTM、IPFM)加速和稳定SB模型训练;2. 引入预训练扩散模型作为初始化策略,提升SB模型性能;3. 通过实验验证方法的有效性。

Method: 1. 采用IPMM、IPTM和IPFM将SGMs与SB问题统一;2. 利用预训练的SGMs初始化SB模型;3. 通过实验验证训练效率和性能提升。

Result: 实验结果表明,提出的方法显著加速和稳定了SB模型的训练,并进一步提升了SGMs的性能。

Insight: 通过结合预训练的扩散模型与SB问题,可以在生成模型研究中实现更高的训练效率和更好的性能表现,为未来研究提供了新的方向。

Abstract: This paper aims to unify Score-based Generative Models (SGMs), also known as Diffusion models, and the Schr"odinger Bridge (SB) problem through three reparameterization techniques: Iterative Proportional Mean-Matching (IPMM), Iterative Proportional Terminus-Matching (IPTM), and Iterative Proportional Flow-Matching (IPFM). These techniques significantly accelerate and stabilize the training of SB-based models. Furthermore, the paper introduces novel initialization strategies that use pre-trained SGMs to effectively train SB-based models. By using SGMs as initialization, we leverage the advantages of both SB-based models and SGMs, ensuring efficient training of SB-based models and further improving the performance of SGMs. Extensive experiments demonstrate the significant effectiveness and improvements of the proposed methods. We believe this work contributes to and paves the way for future research on generative models.

[202] Assessing the Noise Robustness of Class Activation Maps: A Framework for Reliable Model Interpretability

Syamantak Sarkar,Revoti P. Bora,Bhupender Kaushal,Sudhish N George,Kiran Raja

Main category: cs.CV

TL;DR: 该论文评估了不同噪声对类别激活图(CAMs)鲁棒性的影响,提出了一个衡量CAMs鲁棒性的新指标,包括一致性和响应性。

Details Motivation: 尽管CAMs是深度学习模型可视化的重要工具,但其对噪声的鲁棒性尚未得到充分研究,需要评估和改进。

Contribution: 论文的主要贡献是提出了一种新的CAM鲁棒性指标,用于衡量CAMs在噪声扰动下的稳定性和敏感性。

Method: 通过分析不同噪声类型对CAMs的影响,提出鲁棒性指标(一致性和响应性),并在多个模型、数据集和噪声扰动下进行实验验证。

Result: 研究发现不同CAMs对噪声的敏感性存在显著差异,提出的鲁棒性指标能有效评估CAMs的性能。

Insight: 数据集特性和噪声类型对CAMs的解释稳定性有重要影响,为设计更可靠的模型可视化工具提供了指导。

Abstract: Class Activation Maps (CAMs) are one of the important methods for visualizing regions used by deep learning models. Yet their robustness to different noise remains underexplored. In this work, we evaluate and report the resilience of various CAM methods for different noise perturbations across multiple architectures and datasets. By analyzing the influence of different noise types on CAM explanations, we assess the susceptibility to noise and the extent to which dataset characteristics may impact explanation stability. The findings highlight considerable variability in noise sensitivity for various CAMs. We propose a robustness metric for CAMs that captures two key properties: consistency and responsiveness. Consistency reflects the ability of CAMs to remain stable under input perturbations that do not alter the predicted class, while responsiveness measures the sensitivity of CAMs to changes in the prediction caused by such perturbations. The metric is evaluated empirically across models, different perturbations, and datasets along with complementary statistical tests to exemplify the applicability of our proposed approach.

[203] Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance

Xiangxiang Wang,Xuanyu Wang,YiJia Luo,Yongbin Yu,Manping Fan,Jingtao Zhang,Liyong Ren

Main category: cs.CV

TL;DR: 论文提出了两个技术创新:跨模态差异量化视觉语言模型(VLMs)和场景感知向量化记忆多智能体系统,显著降低内存需求(38GB到16GB),同时保持性能。系统在场景感知和多模态交互中表现出色,响应延迟低。

Details Motivation: 为视障人士提供实时、高效的环境感知辅助,解决传统方法内存需求高和响应慢的问题。

Contribution: 1. 跨模态差异量化框架减少内存需求;2. 场景感知向量化记忆多智能体系统支持高效记忆存储与检索;3. 系统性能优越,响应速度显著提升。

Method: 1. 差异化量化VLMs;2. 多智能体框架结合场景分类、向量化记忆及多模态交互;3. 感知-记忆-推理工作流整合历史记忆。

Result: 量化后19B参数模型性能仅下降2.05%,OCR-VQA准确率63.7(原64.9),响应延迟2.83-3.52秒,优于小模型。

Insight: 差异化量化和向量化记忆为高效多模态系统设计提供了新方向,尤其适用于实时辅助应用。

Abstract: This study proposes the dual technological innovation framework, including a cross-modal differ entiated quantization framework for vision-language models (VLMs) and a scene-aware vectorized memory multi-agent system for visually impaired assistance. The modular framework was developed implementing differentiated processing strategies, effectively reducing memory requirements from 38GB to 16GB while maintaining model performance. The multi-agent architecture combines scene classification, vectorized memory, and multimodal interaction, enabling persistent storage and efficient retrieval of scene memories. Through perception-memory-reasoning workflows, the system provides environmental information beyond the current view using historical memories. Experiments show the quantized 19B-parameter model only experiences a 2.05% performance drop on MMBench and maintains 63.7 accuracy on OCR-VQA (original: 64.9), outperforming smaller models with equivalent memory requirements like the Molmo-7B series. The system maintains response latency between 2.83-3.52 seconds from scene analysis to initial speech output, substantially faster than non-streaming methods. This research advances computational efficiency and assistive technology, offering visually impaired users comprehensive real-time assistance in scene perception, text recognition, and navigation.

[204] BRAIN: Bias-Mitigation Continual Learning Approach to Vision-Brain Understanding

Xuan-Bac Nguyen,Thanh-Dat Truong,Pawan Sinha,Khoa Luu

Main category: cs.CV

TL;DR: 该论文提出了一种名为BRAIN的新方法,旨在解决大脑信号记录中的不一致性问题及其对视觉-大脑理解模型的负面影响。通过去偏对比学习和基于角度的遗忘缓解方法,BRAIN实现了在持续学习中的高性能。

Details Motivation: 人类大脑的记忆衰减导致视觉对象识别能力下降,记录的脑信号随时间变得弱化和不确定,影响了视觉-大脑理解模型的性能。论文旨在解决这一问题。

Contribution: 1. 揭示了脑信号不一致性的存在及其对模型的负面影响;2. 提出了BRAIN方法,结合去偏对比学习和基于角度的遗忘缓解,有效缓解偏差并防止灾难性遗忘;3. 实验表明BRAIN在多个基准测试中达到SOTA性能。

Method: 1. 使用统计和实验验证脑信号的不一致性;2. 提出去偏对比学习损失函数以减少偏差;3. 引入基于角度的遗忘缓解方法,防止模型遗忘先前学习的知识。

Result: BRAIN在持续学习任务中表现优于现有方法和非持续学习方法,在多个基准测试中实现了SOTA性能。

Insight: 脑信号的不一致性是视觉-大脑理解模型性能下降的关键因素,而BRAIN方法通过持续学习和偏差缓解技术有效解决了这一问题。

Abstract: Memory decay makes it harder for the human brain to recognize visual objects and retain details. Consequently, recorded brain signals become weaker, uncertain, and contain poor visual context over time. This paper presents one of the first vision-learning approaches to address this problem. First, we statistically and experimentally demonstrate the existence of inconsistency in brain signals and its impact on the Vision-Brain Understanding (VBU) model. Our findings show that brain signal representations shift over recording sessions, leading to compounding bias, which poses challenges for model learning and degrades performance. Then, we propose a new Bias-Mitigation Continual Learning (BRAIN) approach to address these limitations. In this approach, the model is trained in a continual learning setup and mitigates the growing bias from each learning step. A new loss function named De-bias Contrastive Learning is also introduced to address the bias problem. In addition, to prevent catastrophic forgetting, where the model loses knowledge from previous sessions, the new Angular-based Forgetting Mitigation approach is introduced to preserve learned knowledge in the model. Finally, the empirical experiments demonstrate that our approach achieves State-of-the-Art (SOTA) performance across various benchmarks, surpassing prior and non-continual learning methods.

[205] Explain and Monitor Deep Learning Models for Computer Vision using Obz AI

Neo Christopher Chung,Jakub Binda

Main category: cs.CV

TL;DR: 论文介绍了Obz AI,一个为计算机视觉深度学习模型提供可解释性和监控的软件生态系统,填补了现有技术在实践部署中的空白。

Details Motivation: 当前计算机视觉模型(如CNN和ViT)被认为是“黑箱”,缺乏透明度和可解释性,导致实际部署中难以监控和信任。

Contribution: 开发了Obz AI,一个集成了可解释AI(XAI)技术、知识管理和实时监控的综合软件生态系统。

Method: 提供从Python客户端库到全栈分析仪表盘的无缝集成管道,支持XAI方法、特征分析和模型监控。

Result: Obz AI使深度学习模型的决策过程可解释,提升了计算机视觉系统的可观测性和负责任部署。

Insight: 通过软件工具链将XAI技术与实际部署结合,可有效解决模型透明度问题,推动AI系统的可信应用。

Abstract: Deep learning has transformed computer vision (CV), achieving outstanding performance in classification, segmentation, and related tasks. Such AI-based CV systems are becoming prevalent, with applications spanning from medical imaging to surveillance. State of the art models such as convolutional neural networks (CNNs) and vision transformers (ViTs) are often regarded as ``black boxes,’’ offering limited transparency into their decision-making processes. Despite a recent advancement in explainable AI (XAI), explainability remains underutilized in practical CV deployments. A primary obstacle is the absence of integrated software solutions that connect XAI techniques with robust knowledge management and monitoring frameworks. To close this gap, we have developed Obz AI, a comprehensive software ecosystem designed to facilitate state-of-the-art explainability and observability for vision AI systems. Obz AI provides a seamless integration pipeline, from a Python client library to a full-stack analytics dashboard. With Obz AI, a machine learning engineer can easily incorporate advanced XAI methodologies, extract and analyze features for outlier detection, and continuously monitor AI models in real time. By making the decision-making mechanisms of deep models interpretable, Obz AI promotes observability and responsible deployment of computer vision systems.

[206] Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance

Ayce Idil Aytekin,Helge Rhodin,Rishabh Dabral,Christian Theobalt

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的新框架,通过几何引导从单目RGB图像中重建手持物体的3D几何,同时确保手与物体的交互合理。

Details Motivation: 现有方法依赖大量后处理或重建质量低,本文旨在直接通过扩散过程生成高质量物体几何,同时优化手与物体的交互。

Contribution: 1) 提出了一种结合扩散模型和几何引导的手持物体重建框架;2) 在扩散过程中引入优化环路设计,通过多模态几何线索监督生成;3) 确保手与物体交互的物理合理性。

Method: 1) 条件化扩散模型于修复后的物体外观;2) 在推理时通过优化环路设计联合优化物体的重建和手的变换;3) 利用法线、深度对齐、轮廓一致性等几何线索,以及SDF监督和接触约束。

Result: 方法能够在遮挡下生成准确、鲁棒且连贯的重建,并在真实场景中表现良好。

Insight: 扩散模型结合几何引导和优化环路设计为3D重建提供了新思路,尤其是对于复杂的手与物体交互场景。

Abstract: We propose a novel diffusion-based framework for reconstructing 3D geometry of hand-held objects from monocular RGB images by leveraging hand-object interaction as geometric guidance. Our method conditions a latent diffusion model on an inpainted object appearance and uses inference-time guidance to optimize the object reconstruction, while simultaneously ensuring plausible hand-object interactions. Unlike prior methods that rely on extensive post-processing or produce low-quality reconstructions, our approach directly generates high-quality object geometry during the diffusion process by introducing guidance with an optimization-in-the-loop design. Specifically, we guide the diffusion model by applying supervision to the velocity field while simultaneously optimizing the transformations of both the hand and the object being reconstructed. This optimization is driven by multi-modal geometric cues, including normal and depth alignment, silhouette consistency, and 2D keypoint reprojection. We further incorporate signed distance field supervision and enforce contact and non-intersection constraints to ensure physical plausibility of hand-object interaction. Our method yields accurate, robust and coherent reconstructions under occlusion while generalizing well to in-the-wild scenarios.

[207] GM-Skip: Metric-Guided Transformer Block Skipping for Efficient Vision-Language Models

Lianming Huang,Haibo Hu,Qiao Li,Xin He,Nan Guan,Chun Jason Xue

Main category: cs.CV

TL;DR: GM-Skip通过贪婪的度量引导策略动态跳过Transformer层,显著加速视觉-语言模型的推理速度,同时保持任务性能。

Details Motivation: Transformer-based视觉-语言模型在延迟敏感的应用中(如自动驾驶)计算成本高昂,亟需高效的加速方法。

Contribution: 提出了GM-Skip框架,通过度量反馈和自适应层跳过机制,实现高效推理且不牺牲性能。

Method: 采用贪婪的度量引导层选择策略和逆向删除机制,通过评分-稀疏度平衡目标实现灵活部署。

Result: 在COCO等数据集上,GM-Skip显著提升推理速度(如跳过40%层),任务性能保持甚至提升(如准确率从19.1%提高到87.3%)。

Insight: 通过动态跳过冗余层并保留关键层,GM-Skip在性能和效率之间取得了平衡,展示了实际部署的价值。

Abstract: Transformer-based Vision-Language Models (VLMs) have achieved impressive performance on tasks such as image captioning, object recognition, and visual reasoning, but their high computational cost hinders deployment in latency-sensitive applications like autonomous driving. We introduce GM-Skip, a flexible and metric-adaptive framework for Transformer block skipping that accelerates VLM inference while preserving output quality. GM-Skip features a greedy, metric-guided block selection strategy that uses metric feedback (e.g., accuracy, CIDEr) to identify redundant layers, along with a reverse-order deletion mechanism that preserves early foundational blocks to avoid performance collapse. To support diverse deployment needs, it incorporates a tunable trade-off between sparsity and performance via a score-sparsity balance objective. Experiments across multiple tasks and datasets, including COCO and CODA, show that GM-Skip consistently improves inference speed while maintaining task performance. On the COCO dataset, GM-Skip improves single-object classification accuracy on the Person category from 19.1 percent to 87.3 percent while skipping more than 40 percent of Transformer blocks. In real-world deployment, it achieves up to 45.4 percent latency reduction on single-object detection when integrated into an autonomous vehicle running Autoware.Universe, validating the effectiveness of its skip configurations and confirming its practical value in accelerating real-world inference.

[208] Interpretable Evaluation of AI-Generated Content with Language-Grounded Sparse Encoders

Yiming Tang,Arash Lagzian,Srinivas Anumasa,Qiran Zou,Trang Nguyen,Ehsan Adeli,Ching-Yu Cheng,Yilun Du,Dianbo Liu

Main category: cs.CV

TL;DR: 论文提出了一种名为LanSE的新架构,通过识别可解释的视觉模式并用自然语言描述,为AI生成内容提供细粒度评估。

Details Motivation: 当前AI生成内容的评估指标过于粗略,无法满足模型选择和开发的需求,限制了生成模型的科学理解和商业应用。

Contribution: 提出Language-Grounded Sparse Encoders (LanSE),实现可解释的评估,量化生成质量的四个关键维度。

Method: 结合大规模人类标注和大型多模态模型分析,识别视觉模式并提供自然语言描述。

Result: LanSE在合成图像中识别视觉模式的准确率超过93%,揭示了现有指标无法捕捉的模型差异。

Insight: LanSE通过可解释的评估方法,帮助提升生成模型的透明度和可靠性,为AI生成内容的质量控制和模型改进提供强大工具。

Abstract: While the quality of AI-generated contents, such as synthetic images, has become remarkably high, current evaluation metrics provide only coarse-grained assessments, failing to identify specific strengths and weaknesses that researchers and practitioners need for model selection and development, further limiting the scientific understanding and commercial deployment of these generative models. To address this, we introduce Language-Grounded Sparse Encoders (LanSE), a novel architecture that creates interpretable evaluation metrics by identifying interpretable visual patterns and automatically describing them in natural language. Through large-scale human evaluation (more than 11,000 annotations) and large multimodal model (LMM) based analysis, LanSE demonstrates reliable capabilities to detect interpretable visual patterns in synthetic images with more than 93% accuracy in natural images. LanSE further provides a fine-grained evaluation framework that quantifies four key dimensions of generation quality, prompt match, visual realism, physical plausibility, and content diversity. LanSE reveals nuanced model differences invisible to existing metrics, for instance, FLUX’s superior physical plausibility and SDXL-medium’s strong content diversity, while aligning with human judgments. By bridging interpretability with practical evaluation needs, LanSE offers all users of generative AI models a powerful tool for model selection, quality control of synthetic content, and model improvement. These capabilities directly address the need for public confidence and safety in AI-generated content, both critical for the future of generative AI applications.

[209] PriorFormer: A Transformer for Real-time Monocular 3D Human Pose Estimation with Versatile Geometric Priors

Mohamed Adjel,Vincent Bonnet

Main category: cs.CV

TL;DR: 论文提出了一种轻量级Transformer模型PriorFormer,用于单目3D人体姿态估计,通过引入几何先验(如骨骼段长度和相机内参)提升性能,并在校准和非校准场景下均表现优异。

Details Motivation: 单目3D人体姿态估计在非实验室环境中因缺乏校准信息而效果受限,且现有方法计算成本高。PriorFormer旨在解决这些问题,设计一个轻量且自适应的模型。

Contribution: 1. 提出PriorFormer,一种支持几何先验的轻量级Transformer;2. 引入掩码机制处理缺失先验;3. 在精度和速度上优于现有方法。

Method: 1. 输入包括2D关节点序列和几何先验(骨骼长度、相机内参);2. 使用掩码机制忽略缺失先验;3. 在AMASS数据集上训练,生成合成2D数据。

Result: 平均3D关节点误差为36mm,比SOTA提升0.5cm;GPU推理时间380μs,CPU为1800μs,适合嵌入式设备。

Insight: 几何先验显著提升性能,自适应性使模型在复杂场景中仍保持高精度。轻量化设计为实时应用提供可能。

Abstract: This paper proposes a new lightweight Transformer-based lifter that maps short sequences of human 2D joint positions to 3D poses using a single camera. The proposed model takes as input geometric priors including segment lengths and camera intrinsics and is designed to operate in both calibrated and uncalibrated settings. To this end, a masking mechanism enables the model to ignore missing priors during training and inference. This yields a single versatile network that can adapt to different deployment scenarios, from fully calibrated lab environments to in-the-wild monocular videos without calibration. The model was trained using 3D keypoints from AMASS dataset with corresponding 2D synthetic data generated by sampling random camera poses and intrinsics. It was then compared to an expert model trained, only on complete priors, and the validation was done by conducting an ablation study. Results show that both, camera and segment length priors, improve performance and that the versatile model outperforms the expert, even when all priors are available, and maintains high accuracy when priors are missing. Overall the average 3D joint center positions estimation accuracy was as low as 36mm improving state of the art by half a centimeter and at a much lower computational cost. Indeed, the proposed model runs in 380$\mu$s on GPU and 1800$\mu$s on CPU, making it suitable for deployment on embedded platforms and low-power devices.

[210] MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Sixun Dong,Juhua Hu,Mian Zhang,Ming Yin,Yanjie Fu,Qi Qian

Main category: cs.CV

TL;DR: MMTok提出了一种多模态覆盖最大化方法,通过结合视觉和文本token,高效选择信息丰富的视觉token以提升视觉语言模型的推理效率。

Details Motivation: 现有的视觉语言模型(VLMs)推理效率较低,主要由于视觉token的冗余。现有方法多基于单模态(视觉或文本)修剪,忽略多模态特性,缺乏通用标准。

Contribution: 提出了基于覆盖准则的多模态视觉token选择方法MMTok,将视觉和文本token结合,优化子集以实现信息最大化覆盖。

Method: 将子集选择问题建模为最大覆盖问题,通过优化视觉token子集同时覆盖文本token和原始视觉token,并利用VLM代理提升文本token质量。

Result: 在基准数据集上验证了多模态信息的互补性,实现了1.87倍加速且保持98.7%性能(LLaVA-NeXT-13B),仅用4个视觉token仍保留87.7%性能(LLaVA-1.5-7B)。

Insight: 覆盖准则在多模态token选择中有效,视觉和文本信息的结合显著优于单模态方法。

Abstract: Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual input to vision tokens. However, redundancy in vision tokens results in the degenerated inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the criterion of coverage. We first formulate the subset selection problem as a maximum coverage problem. Afterward, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. Finally, a VLM agent can be adopted to further improve the quality of text tokens for guiding vision pruning. The proposed method MMTok is extensively evaluated on benchmark datasets with different VLMs. The comparison illustrates that vision and text information are complementary, and combining multimodal information can surpass the unimodal baseline with a clear margin. Moreover, under the maximum coverage criterion on the POPE dataset, our method achieves a 1.87x speedup while maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Furthermore, with only four vision tokens, it still preserves 87.7% of the original performance on LLaVA-1.5-7B. These results highlight the effectiveness of coverage in token selection.

[211] InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang,Zhangwei Gao,Lixin Gu,Hengjun Pu,Long Cui,Xingguang Wei,Zhaoyang Liu,Linglin Jing,Shenglong Ye,Jie Shao,Zhaokai Wang,Zhe Chen,Hongjie Zhang,Ganlin Yang,Haomin Wang,Qi Wei,Jinhui Yin,Wenhao Li,Erfei Cui,Guanzhou Chen,Zichen Ding,Changyao Tian,Zhenyu Wu,Jingjing Xie,Zehao Li,Bowen Yang,Yuchen Duan,Xuehui Wang,Songze Li,Xiangyu Zhao,Haodong Duan,Nianchen Deng,Bin Fu,Yinan He,Yi Wang,Conghui He,Botian Shi,Junjun He,Yingtong Xiong,Han Lv,Lijun Wu,Wenqi Shao,Kaipeng Zhang,Huipeng Deng,Biqing Qi,Jiaye Ge,Qipeng Guo,Wenwei Zhang,Wanli Ouyang,Limin Wang,Min Dou,Xizhou Zhu,Tong Lu,Dahua Lin,Jifeng Dai,Bowen Zhou,Weijie Su,Kai Chen,Yu Qiao,Wenhai Wang,Gen Luo

Main category: cs.CV

TL;DR: 论文提出了InternVL3.5,一种开源多模态模型,通过Cascade RL框架和视觉分辨率路由器(ViR)等技术,显著提升了推理能力与效率,并支持新型交互能力。

Details Motivation: 现有开源多模态模型在推理能力、效率和多功能性方面存在不足,InternVL3.5旨在通过创新方法缩小与商业模型的性能差距。

Contribution: 1. 提出Cascade RL框架(离线RL+在线RL)提升推理能力;2. 设计ViR动态调整视觉分辨率;3. 提出DvD策略分离视觉与语言计算负载;4. 在多项任务中实现SOTA性能。

Method: 1. Cascade RL分两阶段强化学习提升推理;2. ViR动态优化视觉输入分辨率;3. DvD策略分离视觉编码器与语言模型计算;4. 支持GUI交互与实体代理新能力。

Result: 推理性能提升16%,推理速度加快4.05倍,最大模型InternVL3.5-241B-A28B在多项任务中达到开源MLLM的SOTA水平。

Insight: 1. 分阶段RL训练稳定性和细化对齐;2. 动态分辨率与计算负载分离显著提升效率;3. 开源模型可接近商业模型性能。

Abstract: We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05$\times$ inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks – narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

[212] ObjFiller-3D: Consistent Multi-view 3D Inpainting via Video Diffusion Models

Haitang Feng,Jie Liu,Jie Tang,Gangshan Wu,Beiqi Chen,Jianhuang Lai,Guangcong Wang

Main category: cs.CV

TL;DR: ObjFiller-3D 是一种新颖的 3D 修复方法,通过视频扩散模型实现多视角一致的 3D 补全和编辑,解决了传统 2D 修复方法在多视角下的不一致性问题。

Details Motivation: 传统的 3D 修复方法依赖于多视角 2D 图像修复,但不同视角修复结果的不一致性会导致模糊纹理、空间不连续和视觉伪影,影响 3D 对象的真实性和结构一致性。

Contribution: 提出了 ObjFiller-3D,利用视频编辑模型填补 3D 对象的掩蔽区域,并分析了 3D 与视频之间的表示差距,提出了一种参考驱动的 3D 修复方法。

Method: 通过视频扩散模型实现多视角一致的 3D 修复,引入了基于参考的修复方法提升重建质量。

Result: 实验结果优于现有方法(PSNR 26.6 vs. 15.9,LPIPS 0.19 vs. 0.25),展示了在真实 3D 编辑应用中的潜力。

Insight: 视频编辑模型可以有效解决 3D 修复中的多视角一致性问题,为高质量 3D 重建提供了新思路。

Abstract: 3D inpainting often relies on multi-view 2D image inpainting, where the inherent inconsistencies across different inpainted views can result in blurred textures, spatial discontinuities, and distracting visual artifacts. These inconsistencies pose significant challenges when striving for accurate and realistic 3D object completion, particularly in applications that demand high fidelity and structural coherence. To overcome these limitations, we propose ObjFiller-3D, a novel method designed for the completion and editing of high-quality and consistent 3D objects. Instead of employing a conventional 2D image inpainting model, our approach leverages a curated selection of state-of-the-art video editing model to fill in the masked regions of 3D objects. We analyze the representation gap between 3D and videos, and propose an adaptation of a video inpainting model for 3D scene inpainting. In addition, we introduce a reference-based 3D inpainting method to further enhance the quality of reconstruction. Experiments across diverse datasets show that compared to previous methods, ObjFiller-3D produces more faithful and fine-grained reconstructions (PSNR of 26.6 vs. NeRFiller (15.9) and LPIPS of 0.19 vs. Instant3dit (0.25)). Moreover, it demonstrates strong potential for practical deployment in real-world 3D editing applications. Project page: https://objfiller3d.github.io/ Code: https://github.com/objfiller3d/ObjFiller-3D .

cs.IR [Back]

[213] HLLM-Creator: Hierarchical LLM-based Personalized Creative Generation

Junyi Chen,Lu Chi,Siliang Xu,Shiwei Ran,Bingyue Peng,Zehuan Yuan

Main category: cs.IR

TL;DR: 论文提出了HLLM-Creator,一种基于分层LLM的个性化创意生成框架,解决了用户兴趣建模和高效生成的挑战,并通过在线广告实验验证了其有效性。

Details Motivation: 目前AIGC系统过于依赖创作者灵感,无法生成真正用户个性化的内容,尤其在广告等场景中,个性化内容生成对用户体验至关重要。

Contribution: 提出了HLLM-Creator框架,结合用户聚类和匹配预测剪枝策略,实现了高效的个性化内容生成;设计了基于思维链的数据构建管道,解决了数据稀缺问题。

Method: 采用分层LLM框架,结合用户聚类和用户-广告匹配预测剪枝策略;设计了基于思维链的数据构建管道,生成高质量个性化数据。

Result: 在抖音搜索广告的实验中,HLLM-Creator显著提升了效果,线上A/B测试显示广告点击率提升了0.476%。

Insight: 通过用户分层和高效生成策略,可以解决大规模个性化内容生成的效率问题;思维链推理是解决数据稀缺问题的有效方法。

Abstract: AI-generated content technologies are widely used in content creation. However, current AIGC systems rely heavily on creators’ inspiration, rarely generating truly user-personalized content. In real-world applications such as online advertising, a single product may have multiple selling points, with different users focusing on different features. This underscores the significant value of personalized, user-centric creative generation. Effective personalized content generation faces two main challenges: (1) accurately modeling user interests and integrating them into the content generation process while adhering to factual constraints, and (2) ensuring high efficiency and scalability to handle the massive user base in industrial scenarios. Additionally, the scarcity of personalized creative data in practice complicates model training, making data construction another key hurdle. We propose HLLM-Creator, a hierarchical LLM framework for efficient user interest modeling and personalized content generation. During inference, a combination of user clustering and a user-ad-matching-prediction based pruning strategy is employed to significantly enhance generation efficiency and reduce computational overhead, making the approach suitable for large-scale deployment. Moreover, we design a data construction pipeline based on chain-of-thought reasoning, which generates high-quality, user-specific creative titles and ensures factual consistency despite limited personalized data. This pipeline serves as a critical foundation for the effectiveness of our model. Extensive experiments on personalized title generation for Douyin Search Ads show the effectiveness of HLLM-Creator. Online A/B test shows a 0.476% increase on Adss, paving the way for more effective and efficient personalized generation in industrial scenarios. Codes for academic dataset are available at https://github.com/bytedance/HLLM.

cs.AI [Back]

[214] Revisiting Rule-Based Stuttering Detection: A Comprehensive Analysis of Interpretable Models for Clinical Applications

Eric Zhang

Main category: cs.AI

TL;DR: 本文全面分析了规则性口吃检测方法,提出了一种增强的规则框架,结合了说话速率归一化和多级声学特征分析,在保持临床可解释性的同时达到与深度学习竞争的性能。

Details Motivation: 口吃影响全球约1%的人口,临床上需要可解释和透明的检测方法。尽管深度学习方法在口吃检测上表现优异,但规则性方法因其可解释性在临床应用中仍然重要。

Contribution: 提出了一种增强的规则性框架,结合说话速率归一化和多级声学特征分析,并展示了其在临床上下文中的独特优势。

Method: 使用UCLASS、FluencyBank和SEP-28k等多语料库数据,提出了一种分层决策结构的规则性系统,并分析了其在延长音检测中的高准确性。

Result: 规则性方法在延长音检测中达到97-99%准确率,且在说话速率归一化后表现稳定。

Insight: 规则性系统虽然在无约束场景中准确性略低于深度学习方法,但在临床应用中因其可解释性、可调性和实时反馈能力具备独特优势。

Abstract: Stuttering affects approximately 1% of the global population, impacting communication and quality of life. While recent advances in deep learning have pushed the boundaries of automatic speech dysfluency detection, rule-based approaches remain crucial for clinical applications where interpretability and transparency are paramount. This paper presents a comprehensive analysis of rule-based stuttering detection systems, synthesizing insights from multiple corpora including UCLASS, FluencyBank, and SEP-28k. We propose an enhanced rule-based framework that incorporates speaking-rate normalization, multi-level acoustic feature analysis, and hierarchical decision structures. Our approach achieves competitive performance while maintaining complete interpretability-critical for clinical adoption. We demonstrate that rule-based systems excel particularly in prolongation detection (97-99% accuracy) and provide stable performance across varying speaking rates. Furthermore, we show how these interpretable models can be integrated with modern machine learning pipelines as proposal generators or constraint modules, bridging the gap between traditional speech pathology practices and contemporary AI systems. Our analysis reveals that while neural approaches may achieve marginally higher accuracy in unconstrained settings, rule-based methods offer unique advantages in clinical contexts where decision auditability, patient-specific tuning, and real-time feedback are essential.

[215] Quantifying Sycophancy as Deviations from Bayesian Rationality in LLMs

Katherine Atwell,Pedram Heydari,Anthony Sicilia,Malihe Alikhani

Main category: cs.AI

TL;DR: 该论文提出了一种基于贝叶斯框架的方法,量化大型语言模型(LLMs)中的奉承行为,将其定义为在用户观点引入时偏离理性行为的表现,揭示了LLMs非贝叶斯理性的问题。

Details Motivation: 现有方法主要通过行为变化或准确率来衡量LLMs的奉承行为,但这些指标无法表征理性变化,且在无明确真实值的场景中不适用。论文旨在通过贝叶斯框架更全面地量化奉承行为。

Contribution: 提出了以贝叶斯理性为基准的奉承行为量化方法,能够分析任务中存在不确定性或无真实值的情况,并揭示了LLMs的非理性更新行为。

Method: 利用贝叶斯框架,对比LLMs在引入用户观点前后的后验概率变化,量化奉承行为。研究了3个任务、多种LLM及不同的奉承行为探测方法。

Result: 发现:1) LLMs非贝叶斯理性;2) 探测奉承行为会导致后验概率偏向用户观点;3) 奉承可能增加贝叶斯误差,少数情况下减少误差;4) 贝叶斯误差与Brier分数弱相关。

Insight: 仅关注奉承行为对真实值的影响无法完全捕捉其导致的推理错误,需结合贝叶斯框架更全面地评估模型行为。

Abstract: Sycophancy, or overly agreeable or flattering behavior, is a documented issue in large language models (LLMs), and is critical to understand in the context of human/AI collaboration. Prior works typically quantify sycophancy by measuring shifts in behavior or impacts on accuracy, but neither metric characterizes shifts in rationality, and accuracy measures can only be used in scenarios with a known ground truth. In this work, we utilize a Bayesian framework to quantify sycophancy as deviations from rational behavior when presented with user perspectives, thus distinguishing between rational and irrational updates based on the introduction of user perspectives. In comparison to other methods, this approach allows us to characterize excessive behavioral shifts, even for tasks that involve inherent uncertainty or do not have a ground truth. We study sycophancy for 3 different tasks, a combination of open-source and closed LLMs, and two different methods for probing sycophancy. We also experiment with multiple methods for eliciting probability judgments from LLMs. We hypothesize that probing LLMs for sycophancy will cause deviations in LLMs’ predicted posteriors that will lead to increased Bayesian error. Our findings indicate that: 1) LLMs are not Bayesian rational, 2) probing for sycophancy results in significant increases to the predicted posterior in favor of the steered outcome, 3) sycophancy sometimes results in increased Bayesian error, and in a small number of cases actually decreases error, and 4) changes in Bayesian error due to sycophancy are not strongly correlated in Brier score, suggesting that studying the impact of sycophancy on ground truth alone does not fully capture errors in reasoning due to sycophancy.

[216] Large Language Models as Universal Predictors? An Empirical Study on Small Tabular Datasets

Nikolaos Pavlidis,Vasilis Perifanis,Symeon Symeonidis,Pavlos S. Efraimidis

Main category: cs.AI

TL;DR: 这篇论文探讨了大型语言模型(LLMs)在小规模结构化数据上的预测能力,发现其在分类任务中表现优异,但在回归和聚类任务中表现较差,分析了上下文大小和提示结构的影响。

Details Motivation: 研究LLMs在处理结构化数据时的泛化能力,尤其是在小规模数据集上的表现,以探索其在分类、回归和聚类任务中的潜力。

Contribution: 实证研究了LLMs在小规模结构化数据集上的表现,并与传统机器学习方法对比,揭示了LLMs的优劣势及其适用场景。

Method: 使用最新LLMs(如GPT-5、Gemini-2.5-Flash等)进行少样本提示实验,并与线性模型、集成方法和表格基础模型(TFMs)进行性能对比。

Result: LLMs在分类任务中表现优秀,可作为零训练基线;在回归和聚类任务中表现不佳,表明其在此类任务中存在局限性。

Insight: LLMs可作为通用预测引擎,尤其适合分类任务,但在回归和聚类任务中需谨慎使用;提示结构和上下文大小对性能有显著影响。

Abstract: Large Language Models (LLMs), originally developed for natural language processing (NLP), have demonstrated the potential to generalize across modalities and domains. With their in-context learning (ICL) capabilities, LLMs can perform predictive tasks over structured inputs without explicit fine-tuning on downstream tasks. In this work, we investigate the empirical function approximation capability of LLMs on small-scale structured datasets for classification, regression and clustering tasks. We evaluate the performance of state-of-the-art LLMs (GPT-5, GPT-4o, GPT-o3, Gemini-2.5-Flash, DeepSeek-R1) under few-shot prompting and compare them against established machine learning (ML) baselines, including linear models, ensemble methods and tabular foundation models (TFMs). Our results show that LLMs achieve strong performance in classification tasks under limited data availability, establishing practical zero-training baselines. In contrast, the performance in regression with continuous-valued outputs is poor compared to ML models, likely because regression demands outputs in a large (often infinite) space, and clustering results are similarly limited, which we attribute to the absence of genuine ICL in this setting. Nonetheless, this approach enables rapid, low-overhead data exploration and offers a viable alternative to traditional ML pipelines in business intelligence and exploratory analytics contexts. We further analyze the influence of context size and prompt structure on approximation quality, identifying trade-offs that affect predictive performance. Our findings suggest that LLMs can serve as general-purpose predictive engines for structured data, with clear strengths in classification and significant limitations in regression and clustering.

[217] LLM-based Agentic Reasoning Frameworks: A Survey from Methods to Scenarios

Bingxi Zhao,Lin Geng Foo,Ping Hu,Christian Theobalt,Hossein Rahmani,Jun Liu

Main category: cs.AI

TL;DR: 本文对基于大语言模型(LLM)的智能代理推理框架进行了系统综述,提出了一种分类方法,并分析了不同框架在多个场景中的应用及特点。

Details Motivation: 随着大语言模型内在推理能力的提升,出现了多种基于LLM的智能代理系统,这些系统在自动化任务中表现出接近人类的性能。然而,它们的推理框架各不相同,缺乏统一分类和分析。

Contribution: 1. 提出了系统化的分类法,将智能代理推理框架分解为单代理方法、基于工具的方法和多代理方法;2. 分析了这些框架在科学发现、医疗、软件工程等场景中的应用;3. 总结了不同框架的特点和评估策略。

Method: 采用统一的正式语言对代理推理系统进行分类,并通过跨场景对比分析框架级推理的差异性。

Result: 提供了对不同智能代理推理框架的全景视图,便于理解其优势、适用场景和评估实践。

Insight: 不同推理框架适用于不同场景,例如多代理方法更适合复杂协作任务,而单代理方法则更适合独立决策。评估策略也需根据框架特点定制。

Abstract: Recent advances in the intrinsic reasoning capabilities of large language models (LLMs) have given rise to LLM-based agent systems that exhibit near-human performance on a variety of automated tasks. However, although these systems share similarities in terms of their use of LLMs, different reasoning frameworks of the agent system steer and organize the reasoning process in different ways. In this survey, we propose a systematic taxonomy that decomposes agentic reasoning frameworks and analyze how these frameworks dominate framework-level reasoning by comparing their applications across different scenarios. Specifically, we propose an unified formal language to further classify agentic reasoning systems into single-agent methods, tool-based methods, and multi-agent methods. After that, we provide a comprehensive review of their key application scenarios in scientific discovery, healthcare, software engineering, social simulation, and economics. We also analyze the characteristic features of each framework and summarize different evaluation strategies. Our survey aims to provide the research community with a panoramic view to facilitate understanding of the strengths, suitable scenarios, and evaluation practices of different agentic reasoning frameworks.

[218] Unraveling the cognitive patterns of Large Language Models through module communities

Kushal Raj Bhandari,Pin-Yu Chen,Jianxi Gao

Main category: cs.AI

TL;DR: 论文通过基于网络的方法分析了LLM的认知模式,揭示了其模块化社区中的技能分布,展现了与生物系统的差异,并提出了利用分布式学习的微调策略。

Details Motivation: 尽管LLM在科学和工程中广泛应用,但其内部机制和认知过程仍难以理解。论文旨在通过结合认知科学与机器学习,揭示LLM的认知模式。

Contribution: 提出了一种基于网络的框架,将LLM的认知技能、架构和数据集联系起来,为理解其认知模式提供了新视角。

Method: 利用模块化社区的网络分析方法,研究LLM中的技能分布及其与生物系统的比较。

Result: 结果显示LLM的技能分布表现出独特的模块化社区,动态跨区域交互对其技能获取至关重要。

Insight: 有效的微调策略应利用分布式学习动态,而非刚性模块化干预,以提高LLM的可解释性。

Abstract: Large Language Models (LLMs) have reshaped our world with significant advancements in science, engineering, and society through applications ranging from scientific discoveries and medical diagnostics to Chatbots. Despite their ubiquity and utility, the underlying mechanisms of LLM remain concealed within billions of parameters and complex structures, making their inner architecture and cognitive processes challenging to comprehend. We address this gap by adopting approaches to understanding emerging cognition in biology and developing a network-based framework that links cognitive skills, LLM architectures, and datasets, ushering in a paradigm shift in foundation model analysis. The skill distribution in the module communities demonstrates that while LLMs do not strictly parallel the focalized specialization observed in specific biological systems, they exhibit unique communities of modules whose emergent skill patterns partially mirror the distributed yet interconnected cognitive organization seen in avian and small mammalian brains. Our numerical results highlight a key divergence from biological systems to LLMs, where skill acquisition benefits substantially from dynamic, cross-regional interactions and neural plasticity. By integrating cognitive science principles with machine learning, our framework provides new insights into LLM interpretability and suggests that effective fine-tuning strategies should leverage distributed learning dynamics rather than rigid modular interventions.

[219] WebSight: A Vision-First Architecture for Robust Web Agents

Tanvir Bhathal,Asanshay Gupta

Main category: cs.AI

TL;DR: WebSight是一种基于视觉的自主网页代理,通过纯视觉感知与网页环境交互,无需依赖HTML或DOM输入。其核心模型WebSight-7B在WebVoyager基准测试中表现优于其他系统,同时保持高精度和低延迟。

Details Motivation: 现有网页代理通常依赖HTML或DOM输入,限制了其稳健性和可解释性。WebSight通过视觉优先的架构解决了这一问题。

Contribution: 1. 提出WebSight-7B模型,专注于UI元素交互;2. 设计模块化多代理架构,结合规划、推理、视觉动作和验证代理;3. 在基准测试中超越多个现有系统。

Method: 1. 使用LoRA在Wave-UI-25K数据集上微调WebSight-7B模型;2. 采用多代理架构和情景记忆机制协调任务。

Result: 1. WebSight-7B在Showdown Clicks上的Top-1准确率为58.84%;2. WebSight在WebVoyager基准中成功率68.0%,正确率97.14%。

Insight: 视觉优先的架构可以显著提升网页代理的稳健性和可解释性,同时降低对结构化输入的依赖。

Abstract: We introduce WebSight, a vision-based autonomous web agent, designed to interact with web environments purely through visual perception, eliminating dependence on HTML or DOM-based inputs. Central to our approach we introduce our new model, WebSight-7B, a fine-tuned vision-language model optimized for UI element interaction, trained using LoRA on a web-focused subset of the Wave-UI-25K dataset. WebSight integrates this model into a modular multi-agent architecture, comprising planning, reasoning, vision-action, and verification agents, coordinated through an episodic memory mechanism. WebSight-7B achieves a top-1 accuracy of 58.84% on the Showdown Clicks benchmark, outperforming several larger generalist models while maintaining lower latency. The full WebSight agent achieves a 68.0% success rate on the WebVoyager benchmark, surpassing systems from labs such as OpenAI (61.0%) and HCompany (Runner H, 67.0%). Among tasks completed, WebSight answers correctly 97.14% of the time, indicating high precision. Together, WebSight and WebSight-7B establish a new standard for interpretable, robust, and efficient visual web navigation.

[220] MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes

Nilay Pande,Sahiti Yerramilli,Jayant Sravan Tamarapalli,Rynaa Grover

Main category: cs.AI

TL;DR: MaRVL-QA 是一个用于评估多模态大语言模型(MLLMs)在图像数学推理能力的新基准,包含拓扑计数和变换识别两项任务,结果表明当前 MLLMs 表现较差,倾向于使用表面启发式而非深度空间推理。

Details Motivation: 现有 MLLMs 在数学和空间推理方面的能力不足,尤其是在面对复杂的视觉数学场景时。MaRVL-QA 被设计为一个严格的测试基准,以推动 MLLMs 在深度推理能力上的发展。

Contribution: 提出了 MaRVL-QA 基准,包含两项新任务(拓扑计数和变换识别),为研究社区提供了一个定量评估 MLLMs 数学推理能力的工具。

Method: 通过精选的数学函数库生成测试数据,并引入严格的歧义过滤机制,确保任务的挑战性和标准化。

Result: 当前最先进的 MLLMs 在 MaRVL-QA 上表现不佳,倾向于使用表面启发式而非深度空间推理,暴露了模型的局限性。

Insight: MaRVL-QA 不仅是一个性能评测工具,也为未来 MLLMs 的开发提供了方向,强调了在数学和空间推理能力上的需求。

Abstract: A key frontier for Multimodal Large Language Models (MLLMs) is the ability to perform deep mathematical and spatial reasoning directly from images, moving beyond their established success in semantic description. Mathematical surface plots provide a rigorous testbed for this capability, as they isolate the task of reasoning from the semantic noise common in natural images. To measure progress on this frontier, we introduce MaRVL-QA (Mathematical Reasoning over Visual Landscapes), a new benchmark designed to quantitatively evaluate these core reasoning skills. The benchmark comprises two novel tasks: Topological Counting, identifying and enumerating features like local maxima; and Transformation Recognition, recognizing applied geometric transformations. Generated from a curated library of functions with rigorous ambiguity filtering, our evaluation on MaRVL-QA reveals that even state-of-the-art MLLMs struggle significantly, often resorting to superficial heuristics instead of robust spatial reasoning. MaRVL-QA provides a challenging new tool for the research community to measure progress, expose model limitations, and guide the development of MLLMs with more profound reasoning abilities.

[221] SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

Zhenwei Tang,Difan Jiao,Blair Yang,Ashton Anderson

Main category: cs.AI

TL;DR: SEAM 是一个基准测试,用于评估视觉-语言模型(VLMs)在不同模态间的语义一致性,通过四种领域内的标准化文本和视觉表示配对实现。研究发现,21 种模型普遍存在模态不平衡问题,视觉性能通常落后于语言。

Details Motivation: 现有评估方法中,模态间的比较通常因任务差异和信息不对称而混淆,难以准确衡量 VLM 在跨模态时的语义一致性。

Contribution: 提出了 SEAM 基准,通过标准化文本和视觉表示配对,提供了对 VLM 文本-符号和视觉-空间推理能力的严格评估。

Method: 利用四种领域中的标准化文本和视觉表示,配对语义等效的输入,避免 OCR 依赖。对比 21 种模型的表现,分析模态不平衡和跨模态一致性。

Result: 发现 VLM 中视觉性能普遍落后于语言,跨模态一致性较低。错误分析显示主要原因包括文本感知失败(如分词问题)和视觉感知失败(如幻觉)。

Insight: SEAM 为评估和改进模态无关的推理提供了严格控制的环境,揭示了当前 VLM 在跨模态语义一致性上的不足。

Abstract: Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notation and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.

cs.LG [Back]

[222] Recall-Extend Dynamics: Enhancing Small Language Models through Controlled Exploration and Refined Offline Integration

Zhong Guan,Likang Wu,Hongke Zhao,Jiahui Wang,Le Wu

Main category: cs.LG

TL;DR: 本文提出了RED框架,通过结合离线数据蒸馏和在线强化学习,增强小语言模型的推理能力,解决了探索空间不足和蒸馏过程中的冗余问题。

Details Motivation: 大型语言模型(LLMs)通过强化学习已显著提升了推理能力,但小型语言模型(SLMs)的类似研究不足。结合蒸馏数据和在线强化学习是一种潜在方法,但面临挑战。

Contribution: 提出了RED框架,通过动态调整离线数据与在线学习的权重,优化插入问题和探索空间,提升了SLMs的推理能力。

Method: 设计了基于熵变化比例的离线-在线加权策略,以及样本准确性驱动的策略切换机制,平衡模仿离线数据与自主学习。

Result: 有效解决了SLMs的探索空间不足和蒸馏冗余问题,显著提升了模型的推理能力。

Insight: 通过动态权衡离线与在线学习,RED框架为小模型性能优化提供了新思路,尤其适用于资源受限场景。

Abstract: Many existing studies have achieved significant improvements in the reasoning capabilities of large language models (LLMs) through reinforcement learning with verifiable rewards (RLVR), while the enhancement of reasoning abilities in small language models (SLMs) has not yet been sufficiently explored. Combining distilled data from larger models with RLVR on small models themselves is a natural approach, but it still faces various challenges and issues. Therefore, we propose \textit{\underline{R}}ecall-\textit{\underline{E}}xtend \textit{\underline{D}}ynamics(RED): Enhancing Small Language Models through Controlled Exploration and Refined Offline Integration. In this paper, we explore the perspective of varying exploration spaces, balancing offline distillation with online reinforcement learning. Simultaneously, we specifically design and optimize for the insertion problem within offline data. By monitoring the ratio of entropy changes in the model concerning offline and online data, we regulate the weight of offline-SFT, thereby addressing the issues of insufficient exploration space in small models and the redundancy and complexity during the distillation process. Furthermore, to tackle the distribution discrepancies between offline data and the current policy, we design a sample-accuracy-based policy shift mechanism that dynamically chooses between imitating offline distilled data and learning from its own policy.

[223] Hyperbolic Multimodal Representation Learning for Biological Taxonomies

ZeMing Gong,Chuanqi Tang,Xiaoliang Huo,Nicholas Pellegrino,Austin T. Wang,Graham W. Taylor,Angel X. Chang,Scott C. Lowe,Joakim Bruslund Haurum

Main category: cs.LG

TL;DR: 该论文研究了一种基于双曲空间的多模态表示学习方法,用于生物分类学的层次结构建模,并通过对比损失和新颖的堆叠蕴含目标在多模态输入上表现出色。

Details Motivation: 生物多样性研究中的分类任务需要将标本组织成层次结构,而传统的欧几里得空间可能无法很好地捕捉这种层次关系,因此作者探索双曲空间是否能提供更好的嵌入表示。

Contribution: 主要的贡献包括:1)提出了一个多模态双曲嵌入框架;2)引入了一种新颖的堆叠蕴含目标函数;3)在BIOSCAN-1M数据集上验证了双曲嵌入的优越性,尤其是在未见物种分类任务中。

Method: 方法结合了对比损失和堆叠蕴含目标,将多模态输入(如图像和DNA条形码)嵌入到共享的双曲空间中,以捕捉层次结构信息。

Result: 实验表明,双曲嵌入在未见物种分类任务中优于欧几里得基线和其他模型,但在细粒度分类和开放世界泛化方面仍有挑战。

Insight: 双曲空间为生物分类学提供了一种结构感知的嵌入基础,可能在物种发现和生态保护中具有潜在应用价值。

Abstract: Taxonomic classification in biodiversity research involves organizing biological specimens into structured hierarchies based on evidence, which can come from multiple modalities such as images and genetic information. We investigate whether hyperbolic networks can provide a better embedding space for such hierarchical models. Our method embeds multimodal inputs into a shared hyperbolic space using contrastive and a novel stacked entailment-based objective. Experiments on the BIOSCAN-1M dataset show that hyperbolic embedding achieves competitive performance with Euclidean baselines, and outperforms all other models on unseen species classification using DNA barcodes. However, fine-grained classification and open-world generalization remain challenging. Our framework offers a structure-aware foundation for biodiversity modelling, with potential applications to species discovery, ecological monitoring, and conservation efforts.

[224] TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Yizhi Li,Qingshui Gu,Zhoufutu Wen,Ziniu Li,Tianshun Xing,Shuyue Guo,Tianyu Zheng,Xin Zhou,Xingwei Qu,Wangchunshu Zhou,Zheng Zhang,Wei Shen,Qian Liu,Chenghua Lin,Jian Yang,Ge Zhang,Wenhao Huang

Main category: cs.LG

TL;DR: TreePO提出了一种基于树的策略优化方法,通过动态树采样和固定长度分段解码提高推理效率,同时保持多样性和探索能力。

Details Motivation: 现有的强化学习方法在大语言模型训练中计算成本高且探索路径有限,TreePO旨在解决这一问题。

Contribution: 1. 分段采样算法减轻KV缓存负担;2. 树形分段优势估计结合全局和局部优化;3. 动态发散与回退策略分析。

Method: 采用树结构搜索过程,结合动态树采样和分段解码,利用局部不确定性扩展分支并剪枝低价值路径。

Result: 实验显示TreePO在推理效率上节省22%-43%的GPU时间,并减少35%-40%的采样计算量。

Insight: TreePO为基于强化学习的后训练提供了一条实用路径,能够在减少计算和样本的情况下提升效率。

Abstract: Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22% up to 43% of the sampling design for the trained models, meanwhile showing up to 40% reduction at trajectory-level and 35% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.

[225] Characterizing the Behavior of Training Mamba-based State Space Models on GPUs

Trinayan Baruah,Kaustubh Shivdikar,Sara Prescott,David Kaeli

Main category: cs.LG

TL;DR: 该论文分析了在GPU上训练基于Mamba的状态空间模型(SSM)的行为,提出了一个代表性工作负载套件,并探讨了其对GPU微架构设计的潜在优化。

Details Motivation: 传统的Transformer模型因自注意力机制的二次计算复杂度难以扩展,而Mamba-based SSM为解决这一问题提供了新途径。论文旨在研究这些模型在GPU训练中的行为及其对硬件设计的影响。

Contribution: 论文的主要贡献包括:构建了一个涵盖不同架构的SSM工作负载套件,并分析了其在GPU上的性能行为,为未来的GPU优化提供了见解。

Method: 作者通过构建多样化的Mamba-based SSM模型套件,评估其在GPU上的训练行为,重点关注微架构层面的性能特征。

Result: 研究发现,Mamba-based SSM在GPU上的行为具有独特性,揭示了潜在的硬件优化方向。

Insight: 这些模型的计算模式与Transformer不同,为GPU微架构设计提供了新的优化机会,尤其是在处理长序列任务时。

Abstract: Mamba-based State Space Models (SSM) have emerged as a promising alternative to the ubiquitous transformers. Despite the expressive power of transformers, the quadratic complexity of computing attention is a major impediment to scaling performance as we increase the sequence length. SSMs provide an alternative path that addresses this problem, reducing the computational complexity requirements of self-attention with novel model architectures for different domains and fields such as video, text generation and graphs. Thus, it is important to characterize the behavior of these emerging workloads on GPUs and understand their requirements during GPU microarchitectural design. In this work we evaluate Mamba-based SSMs and characterize their behavior during training on GPUs. We construct a workload suite that offers representative models that span different model architectures. We then use this suite to analyze the architectural implications of running Mamba-based SSMs on GPUs. Our work sheds new light on potential optimizations to continue scaling the performance for such models.

[226] Proximal Supervised Fine-Tuning

Wenhong Zhu,Ruobing Xie,Rui Wang,Xingwu Sun,Di Wang,Pengfei Liu

Main category: cs.LG

TL;DR: 这篇论文提出了Proximal Supervised Fine-Tuning (PSFT),通过借鉴强化学习中的信任区域方法,解决了监督微调(SFT)导致的泛化能力下降问题。PSFT在多个任务和领域上表现优于传统SFT。

Details Motivation: 传统监督微调(SFT)在微调基础模型时容易导致泛化能力下降,即模型在适应新任务或领域时丢失原有能力。受强化学习中TRPO和PPO的启发,作者希望通过引入信任区域的概念来约束策略漂移,从而提升模型的泛化能力。

Contribution: 论文的主要贡献是提出了PSFT,一种结合了信任区域优点的监督微调目标。PSFT能够稳定优化过程,提升模型的泛化表现,并为后续训练阶段提供更强的优化基础。

Method: PSFT将SFT视为策略梯度方法的一个特例(假设优势函数为常数正值),并引入信任区域约束策略变化。这种方法避免了策略漂移,同时保留了模型的竞争力。实验在数学和人类价值观领域进行。

Result: 实验结果表明,PSFT在领域内性能与传统SFT相当,但在领域外泛化中表现更好;PSFT训练过程稳定,不会导致熵崩溃,并为后续优化提供了更好的起点。

Insight: PSFT展示了将强化学习方法(如信任区域)迁移到监督学习问题的潜力,为解决微调导致的泛化下降提供了一种新思路。

Abstract: Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on new tasks or domains. Inspired by trust-region policy optimization (TRPO) and proximal policy optimization (PPO) in reinforcement learning (RL), we propose Proximal SFT (PSFT). This fine-tuning objective incorporates the benefits of trust-region, effectively constraining policy drift during SFT while maintaining competitive tuning. By viewing SFT as a special case of policy gradient methods with constant positive advantages, we derive PSFT that stabilizes optimization and leads to generalization, while leaving room for further optimization in subsequent post-training stages. Experiments across mathematical and human-value domains show that PSFT matches SFT in-domain, outperforms it in out-of-domain generalization, remains stable under prolonged training without causing entropy collapse, and provides a stronger foundation for the subsequent optimization.

[227] Curvature Learning for Generalization of Hyperbolic Neural Networks

Xiaomeng Fan,Yuwei Wu,Zhi Gao,Mehrtash Harandi,Yunde Jia

Main category: cs.LG

TL;DR: 该论文通过推导双曲神经网络(HNNs)的PAC-Bayesian泛化界限,提出了一个基于曲率学习的锐度感知方法,以平滑损失景观并提升HNNs的泛化能力。

Details Motivation: 双曲神经网络在处理具有层次结构的真实数据时表现出优越性能,但曲率的选择对优化至关重要,缺乏理论基础。论文旨在填补这一空白,并通过曲率学习改进HNNs的泛化能力。

Contribution: 1. 推导了HNNs的PAC-Bayesian泛化界限,揭示了曲率对损失景观平滑性的影响;2. 提出了锐度感知的曲率学习方法,通过双层优化最小化曲率的锐度;3. 设计了一种隐式微分算法,高效解决双层优化问题。

Method: 论文方法包括:1. 设计曲率的锐度度量,通过双层优化最小化;2. 提出隐式微分算法,近似计算曲率的梯度,确保误差有界且收敛。

Result: 在分类、长尾数据学习、噪声数据学习和少样本学习四个任务上的实验表明,该方法显著提升了HNNs的性能。

Insight: 曲率的理论分析和动态学习对HNNs的泛化能力至关重要,锐度感知方法为优化双曲几何提供了新思路。

Abstract: Hyperbolic neural networks (HNNs) have demonstrated notable efficacy in representing real-world data with hierarchical structures via exploiting the geometric properties of hyperbolic spaces characterized by negative curvatures. Curvature plays a crucial role in optimizing HNNs. Inappropriate curvatures may cause HNNs to converge to suboptimal parameters, degrading overall performance. So far, the theoretical foundation of the effect of curvatures on HNNs has not been developed. In this paper, we derive a PAC-Bayesian generalization bound of HNNs, highlighting the role of curvatures in the generalization of HNNs via their effect on the smoothness of the loss landscape. Driven by the derived bound, we propose a sharpness-aware curvature learning method to smooth the loss landscape, thereby improving the generalization of HNNs. In our method, we design a scope sharpness measure for curvatures, which is minimized through a bi-level optimization process. Then, we introduce an implicit differentiation algorithm that efficiently solves the bi-level optimization by approximating gradients of curvatures. We present the approximation error and convergence analyses of the proposed method, showing that the approximation error is upper-bounded, and the proposed method can converge by bounding gradients of HNNs. Experiments on four settings: classification, learning from long-tailed data, learning from noisy data, and few-shot learning show that our method can improve the performance of HNNs.

[228] ShaLa: Multimodal Shared Latent Space Modelling

Jiali Cui,Yan-Ying Chen,Yanxia Zhang,Matthew Klenk

Main category: cs.LG

TL;DR: ShaLa提出了一种新颖的多模态共享潜在空间建模框架,通过改进推理模型和引入扩散先验,显著提升了多模态合成质量,并能够扩展到更多模态。

Details Motivation: 现有多模态VAE方法在捕捉共享潜在表征时存在表达能力不足和合成质量低的问题,导致难以应对多模态复杂度增加的情况。

Contribution: 1. 提出了一种新的多模态共享潜在空间建模框架;2. 引入改进的推理模型和扩散先验,提升共享表征推断和合成质量;3. 证明了在多个基准测试中的优越性和可扩展性。

Method: 结合新颖的推理模型和二阶扩散先验,优化共享潜在表征的推断,并提高多模态合成质量。

Result: 实验显示ShaLa在多模态合成质量和一致性上优于现有方法,且能适应更多模态的复杂场景。

Insight: 共享潜在空间建模中,改进推理能力和引入扩散先验是关键,能够显著提升多模态任务的性能。

Abstract: This paper presents a novel generative framework for learning shared latent representations across multimodal data. Many advanced multimodal methods focus on capturing all combinations of modality-specific details across inputs, which can inadvertently obscure the high-level semantic concepts that are shared across modalities. Notably, Multimodal VAEs with low-dimensional latent variables are designed to capture shared representations, enabling various tasks such as joint multimodal synthesis and cross-modal inference. However, multimodal VAEs often struggle to design expressive joint variational posteriors and suffer from low-quality synthesis. In this work, ShaLa addresses these challenges by integrating a novel architectural inference model and a second-stage expressive diffusion prior, which not only facilitates effective inference of shared latent representation but also significantly improves the quality of downstream multimodal synthesis. We validate ShaLa extensively across multiple benchmarks, demonstrating superior coherence and synthesis quality compared to state-of-the-art multimodal VAEs. Furthermore, ShaLa scales to many more modalities while prior multimodal VAEs have fallen short in capturing the increasing complexity of the shared latent space.

[229] Robustness Feature Adapter for Efficient Adversarial Training

Quanwei Wu,Jun Guo,Wei Wang,Yi Wang

Main category: cs.LG

TL;DR: 论文提出了一种基于适配器的方法,通过特征空间中的对抗训练提高计算效率和模型鲁棒性,同时解决鲁棒过拟合问题。

Details Motivation: 对抗训练(AT)在大规模骨干模型中计算开销大,且存在鲁棒过拟合问题,需要一种高效且鲁棒的解决方案。

Contribution: 提出了一种基于适配器的特征空间对抗训练方法,显著提高了计算效率并消除了鲁棒过拟合,同时增强了模型对未见攻击的泛化能力。

Method: 通过在特征空间中引入适配器模块,优化对抗训练过程,提高内循环收敛质量,减少计算开销。

Result: 在多种骨干架构和大规模AT中验证了方法的有效性,提升了模型准确率和对抗鲁棒性。

Insight: 适配器方法在特征空间中的直接对抗训练是一种高效且鲁棒的解决方案,适用于大规模模型。

Abstract: Adversarial training (AT) with projected gradient descent is the most popular method to improve model robustness under adversarial attacks. However, computational overheads become prohibitively large when AT is applied to large backbone models. AT is also known to have the issue of robust overfitting. This paper contributes to solving both problems simultaneously towards building more trustworthy foundation models. In particular, we propose a new adapter-based approach for efficient AT directly in the feature space. We show that the proposed adapter-based approach can improve the inner-loop convergence quality by eliminating robust overfitting. As a result, it significantly increases computational efficiency and improves model accuracy by generalizing adversarial robustness to unseen attacks. We demonstrate the effectiveness of the new adapter-based approach in different backbone architectures and in AT at scale.

[230] Learning to Detect Label Errors by Making Them: A Method for Segmentation and Object Detection Datasets

Sarina Penquitt,Tobias Riedlinger,Timo Heller,Markus Reischl,Matthias Rottmann

Main category: cs.LG

TL;DR: 该论文提出了一种统一的方法来检测目标检测、语义分割和实例分割数据集中的标签错误,通过模拟标签错误并将其视为实例分割问题来实现。

Details Motivation: 当前标签错误检测方法通常针对单一任务或特定类型数据集,且缺乏学习能力。该研究旨在填补这一空白,提出一种通用且学习能力强的方法。

Contribution: 提出了一个统一的方法,能够检测多种任务(目标检测、语义分割、实例分割)中的标签错误,并基于实例分割问题建模。

Method: 通过向真实标签中注入标签错误,将标签错误检测问题转化为基于复合输入的实例分割任务。

Result: 在多个任务、数据集和基准模型上验证了方法的有效性,并在Cityscapes数据集中释放了459个真实标签错误。

Insight: 通过学习标签错误的生成来检测标签错误,能够更好地适应多种任务和标签类型,提高数据集质量。

Abstract: Recently, detection of label errors and improvement of label quality in datasets for supervised learning tasks has become an increasingly important goal in both research and industry. The consequences of incorrectly annotated data include reduced model performance, biased benchmark results, and lower overall accuracy. Current state-of-the-art label error detection methods often focus on a single computer vision task and, consequently, a specific type of dataset, containing, for example, either bounding boxes or pixel-wise annotations. Furthermore, previous methods are not learning-based. In this work, we overcome this research gap. We present a unified method for detecting label errors in object detection, semantic segmentation, and instance segmentation datasets. In a nutshell, our approach - learning to detect label errors by making them - works as follows: we inject different kinds of label errors into the ground truth. Then, the detection of label errors, across all mentioned primary tasks, is framed as an instance segmentation problem based on a composite input. In our experiments, we compare the label error detection performance of our method with various baselines and state-of-the-art approaches of each task’s domain on simulated label errors across multiple tasks, datasets, and base models. This is complemented by a generalization study on real-world label errors. Additionally, we release 459 real label errors identified in the Cityscapes dataset and provide a benchmark for real label error detection in Cityscapes.

[231] Topology Aware Neural Interpolation of Scalar Fields

Mohamed Kissi,Keanu Sisouk,Joshua A. Levine,Julien Tierny

Main category: cs.LG

TL;DR: 该论文提出了一种基于神经网络的拓扑感知标量场插值方法,能够通过输入的关键帧和时间变化的持久性图,准确估计缺失的非关键帧数据。

Details Motivation: 时间变化的标量场插值在科学可视化等领域很重要,但传统方法可能忽略拓扑结构。本文旨在通过结合持久性图信息,改进插值的几何和拓扑准确性。

Contribution: 主要贡献是提出了一种神经架构,结合拓扑损失函数,实现了对非关键帧数据的几何和拓扑一致性插值。

Method: 方法包括利用神经网络学习时间与标量场的关系,并通过拓扑损失增强插值效果,输入时间值即可快速生成插值结果。

Result: 实验表明,该方法在2D和3D数据集上优于传统插值方法,尤其是在数据和拓扑拟合方面。

Insight: 拓扑信息可以作为神经网络插值的强大约束,显著提升结果的几何和拓扑准确性。

Abstract: This paper presents a neural scheme for the topology-aware interpolation of time-varying scalar fields. Given a time-varying sequence of persistence diagrams, along with a sparse temporal sampling of the corresponding scalar fields, denoted as keyframes, our interpolation approach aims at “inverting” the non-keyframe diagrams to produce plausible estimations of the corresponding, missing data. For this, we rely on a neural architecture which learns the relation from a time value to the corresponding scalar field, based on the keyframe examples, and reliably extends this relation to the non-keyframe time steps. We show how augmenting this architecture with specific topological losses exploiting the input diagrams both improves the geometrical and topological reconstruction of the non-keyframe time steps. At query time, given an input time value for which an interpolation is desired, our approach instantaneously produces an output, via a single propagation of the time input through the network. Experiments interpolating 2D and 3D time-varying datasets show our approach superiority, both in terms of data and topological fitting, with regard to reference interpolation schemes.

eess.AS [Back]

[232] Unseen Speaker and Language Adaptation for Lightweight Text-To-Speech with Adapters

Alessio Falai,Ziyao Zhang,Akos Gangoly

Main category: eess.AS

TL;DR: 论文研究了基于适配器的轻量级跨语言文本到语音(TTS)合成,重点对比了未见说话人和语言适应的任务,展示了适配器在学习和保留语言与说话人信息上的有效性。

Details Motivation: 探索如何在不忘记预训练模型原有信息的情况下,通过适配器实现未见说话人或新语言的适应,从而扩展轻量级TTS系统的能力。

Contribution: 提出并验证了适配器在跨语言TTS中的有效性,提出了一种基于L2发音检测的客观指标来评估生成语音的口音自然度。

Method: 使用适配器模块,通过学习语言和说话人特定的信息,在不修改原始模型参数的情况下实现适应;通过实验验证适配器配置和放置的影响。

Result: 适配器能有效学习新语言和说话人信息,同时避免灾难性遗忘;提出的客观指标成功评估了口音自然度。

Insight: 适配器的配置和放置对模型性能有显著影响;多说话人数据可提升适配器的泛化能力。

Abstract: In this paper we investigate cross-lingual Text-To-Speech (TTS) synthesis through the lens of adapters, in the context of lightweight TTS systems. In particular, we compare the tasks of unseen speaker and language adaptation with the goal of synthesising a target voice in a target language, in which the target voice has no recordings therein. Results from objective evaluations demonstrate the effectiveness of adapters in learning language-specific and speaker-specific information, allowing pre-trained models to learn unseen speaker identities or languages, while avoiding catastrophic forgetting of the original model’s speaker or language information. Additionally, to measure how native the generated voices are in terms of accent, we propose and validate an objective metric inspired by mispronunciation detection techniques in second-language (L2) learners. The paper also provides insights into the impact of adapter placement, configuration and the number of speakers used.

[233] HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

Sizhe Shan,Qiulin Li,Yutao Cui,Miles Yang,Yuehai Wang,Qun Yang,Jin Zhou,Zhao Zhong

Main category: eess.AS

TL;DR: HunyuanVideo-Foley提出了一种端到端的文本-视频-音频生成框架,通过创新的数据管道、表征对齐策略和多模态扩散变压器,解决了视频生成中高保真音频同步的关键挑战。

Details Motivation: 视频生成领域虽在视觉逼真度上取得进展,但缺乏同步音频限制了沉浸感。HunyuanVideo-Foley旨在解决多模态数据稀缺、模态不平衡和现有方法音频质量有限的问题。

Contribution: 论文的三大核心创新:(1) 大规模100k小时多模态数据集的自动化标注;(2) 自监督音频特征的表征对齐策略;(3) 多模态扩散变压器,实现音视频融合与文本语义注入。

Method: 方法包括:自动化数据管道构建、自监督学习的表征对齐策略、多模态扩散变压器的设计。后者通过联合注意力机制融合音视频,并通过交叉注意力注入文本语义。

Result: 实验表明,HunyuanVideo-Foley在音频保真度、视觉语义对齐、时间对齐和分布匹配方面达到了最新最优性能。

Insight: 多模态融合和表征对齐是提升音频同步生成质量的关键。自动化数据管道为大模型训练提供了可能性。

Abstract: Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.

cs.MM [Back]

[234] VGGSounder: Audio-Visual Evaluations for Foundation Models

Daniil Zverev,Thaddäus Wiedemer,Ameya Prabhu,Matthias Bethge,Wieland Brendel,A. Sophia Koepke

Main category: cs.MM

TL;DR: VGGSounder是一个重新标注的多标签测试集,旨在解决VGGSound数据集在评估音频-视觉基础模型时的局限性,包括标注不全、类别重叠和模态不对齐等问题。它提供了详细的模态标注和新的模态混淆度量指标。

Details Motivation: VGGSound数据集在评估多模态基础模型时存在局限性,如标注不完整、类别重叠和模态错位,导致评估结果失真。

Contribution: 提出了VGGSounder,一个重新标注的多标签测试集,支持详细的模态性能分析,并引入了新的模态混淆度量指标。

Method: 对VGGSound数据集进行重新标注和扩展,设计多标签测试集,并开发模态混淆度量指标以分析模型性能。

Result: VGGSounder能够更精确地评估音频-视觉基础模型的模态性能,并揭示模型在增加输入模态时的性能下降。

Insight: 模态对齐和多标签标注对评估多模态基础模型至关重要,新的度量指标有助于更深入地理解模型局限性。

Abstract: The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

cs.RO [Back]

[235] Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation

Dilermando Almeida,Guilherme Lazzarini,Juliano Negri,Thiago H. Segreto,Ricardo V. Godoy,Marcelo Becker

Main category: cs.RO

TL;DR: 该论文提出了一种深度学习方法,通过模拟到现实的策略,优化四足机器人的抓取能力,减少对物理数据收集的依赖,并提高了抓取的精确性和适应性。

Details Motivation: 四足机器人在复杂和非结构化地形中表现出色,但配备机械臂后实现精确和适应性强的动态抓取仍然是一个挑战。现有方法依赖大量物理校准和预编程抓取配置,限制了其实际应用。

Contribution: 论文的主要贡献是提出了一种基于深度学习的框架,通过模拟生成的合成数据集训练模型,减少了对真实世界数据的需求,显著提高了四足机器人的抓取能力。

Method: 方法包括在Genesis模拟环境中生成合成抓取数据集(包含像素级标注的抓取质量图),并使用多模态输入(RGB、深度图、分割掩码和表面法线图)训练一个U-Net结构的CNN模型,输出抓取质量热图。

Result: 实验验证中,四足机器人成功完成了自主导航、目标感知、抓取姿态预测和精确抓取的全过程,证明了该框架的有效性和可扩展性。

Insight: 研究表明,利用模拟训练结合多模态传感器数据是提高机器人动态抓取能力的可行方案,为未来在工业自动化和搜救任务中的应用提供了基础。

Abstract: Quadruped robots have emerged as highly efficient and versatile platforms, excelling in navigating complex and unstructured terrains where traditional wheeled robots might fail. Equipping these robots with manipulator arms unlocks the advanced capability of loco-manipulation to perform complex physical interaction tasks in areas ranging from industrial automation to search-and-rescue missions. However, achieving precise and adaptable grasping in such dynamic scenarios remains a significant challenge, often hindered by the need for extensive real-world calibration and pre-programmed grasp configurations. This paper introduces a deep learning framework designed to enhance the grasping capabilities of quadrupeds equipped with arms, focusing on improved precision and adaptability. Our approach centers on a sim-to-real methodology that minimizes reliance on physical data collection. We developed a pipeline within the Genesis simulation environment to generate a synthetic dataset of grasp attempts on common objects. By simulating thousands of interactions from various perspectives, we created pixel-wise annotated grasp-quality maps to serve as the ground truth for our model. This dataset was used to train a custom CNN with a U-Net-like architecture that processes multi-modal input from an onboard RGB and depth cameras, including RGB images, depth maps, segmentation masks, and surface normal maps. The trained model outputs a grasp-quality heatmap to identify the optimal grasp point. We validated the complete framework on a four-legged robot. The system successfully executed a full loco-manipulation task: autonomously navigating to a target object, perceiving it with its sensors, predicting the optimal grasp pose using our model, and performing a precise grasp. This work proves that leveraging simulated training with advanced sensing offers a scalable and effective solution for object handling.

[236] GWM: Towards Scalable Gaussian World Models for Robotic Manipulation

Guanxing Lu,Baoxiong Jia,Puhao Li,Yixin Chen,Ziwei Wang,Yansong Tang,Siyuan Huang

Main category: cs.RO

TL;DR: 提出了一种高斯世界模型(GWM),通过高斯基元传播和扩散变换器实现3D场景的未来状态预测,支持模仿学习和强化学习。

Details Motivation: 现有图像世界模型缺乏几何信息,难以满足机器人操作的3D空间需求。

Contribution: 提出GWM,结合高斯基元和扩散变换器,实现细粒度3D场景预测。

Method: 使用3D变分自编码器和扩散变换器,通过高斯泼溅(Gaussian Splatting)预测未来状态。

Result: GWM在仿真和真实实验中表现优异,超越现有方法。

Insight: 3D世界模型在数据扩展和机器人操作中具有潜力。

Abstract: Training robot policies within a learned world model is trending due to the inefficiency of real-world interactions. The established image-based world models and policies have shown prior success, but lack robust geometric information that requires consistent spatial and physical understanding of the three-dimensional world, even pre-trained on internet-scale video sources. To this end, we propose a novel branch of world model named Gaussian World Model (GWM) for robotic manipulation, which reconstructs the future state by inferring the propagation of Gaussian primitives under the effect of robot actions. At its core is a latent Diffusion Transformer (DiT) combined with a 3D variational autoencoder, enabling fine-grained scene-level future state reconstruction with Gaussian Splatting. GWM can not only enhance the visual representation for imitation learning agent by self-supervised future prediction training, but can serve as a neural simulator that supports model-based reinforcement learning. Both simulated and real-world experiments depict that GWM can precisely predict future scenes conditioned on diverse robot actions, and can be further utilized to train policies that outperform the state-of-the-art by impressive margins, showcasing the initial data scaling potential of 3D world model.

[237] SEBVS: Synthetic Event-based Visual Servoing for Robot Navigation and Manipulation

Krishna Vinod,Prithvi Jai Ramesh,Pavan Kumar B N,Bharatesh Chakravarthi

Main category: cs.RO

TL;DR: 论文提出了一种用于Gazebo模拟的开源ROS工具包,用于从RGB相机生成事件流,并展示了事件驱动的机器人策略在导航和抓取任务中的优势。

Details Motivation: 尽管事件相机在实时机器人感知中具有显著优势(如低延迟、高动态范围),但其在主流机器人模拟器中缺乏仿真支持,阻碍了相关技术的评估与应用。

Contribution: 提出了一个开源且用户友好的ROS工具包(v2e),用于Gazebo模拟中从RGB数据生成事件流,并验证了事件驱动的机器人策略在实时任务中的表现。

Method: 通过行为克隆训练基于Transformer的事件驱动策略(ERP),并与基于RGB的策略在导航和抓取任务中进行了对比。

Result: 实验表明,事件驱动的策略在各种操作条件下均表现出优越性,验证了其在实时机器人任务中的潜力。

Insight: 该工作为事件相机在机器人政策学习中的广泛应用奠定了基础,并展示了合成事件数据在模拟环境中的实用性。

Abstract: Event cameras offer microsecond latency, high dynamic range, and low power consumption, making them ideal for real-time robotic perception under challenging conditions such as motion blur, occlusion, and illumination changes. However, despite their advantages, synthetic event-based vision remains largely unexplored in mainstream robotics simulators. This lack of simulation setup hinders the evaluation of event-driven approaches for robotic manipulation and navigation tasks. This work presents an open-source, user-friendly v2e robotics operating system (ROS) package for Gazebo simulation that enables seamless event stream generation from RGB camera feeds. The package is used to investigate event-based robotic policies (ERP) for real-time navigation and manipulation. Two representative scenarios are evaluated: (1) object following with a mobile robot and (2) object detection and grasping with a robotic manipulator. Transformer-based ERPs are trained by behavior cloning and compared to RGB-based counterparts under various operating conditions. Experimental results show that event-guided policies consistently deliver competitive advantages. The results highlight the potential of event-driven perception to improve real-time robotic navigation and manipulation, providing a foundation for broader integration of event cameras into robotic policy learning. The GitHub repo for the dataset and code: https://eventbasedvision.github.io/SEBVS/

[238] Egocentric Instruction-oriented Affordance Prediction via Large Multimodal Model

Bokai Ji,Jie Gu,Xiaokang Ma,Chu Tang,Jingmin Chen,Guangxia Li

Main category: cs.RO

TL;DR: 该论文提出了一种任务/指令依赖的感知预测方法,通过构建新数据集和利用大型多模态模型实现指令导向的感知预测。

Details Motivation: 现有感知研究常忽视任务或指令对感知的影响,导致同一对象在不同指令下可能产生不同的操作区域和方向。

Contribution: 1. 提出了一个包含1.5万个对象-指令-感知三元组的新数据集;2. 提出了一种基于大型多模态模型的迭代式感知预测方法。

Method: 采用“搜索验证器”的流程,让模型逐步预测感知并通过自验证迭代优化。

Result: 实验表明,该方法不仅解锁了指令导向感知预测的新能力,还表现出色。

Insight: 感知应与任务/指令紧密结合,大型多模态模型可通过渐进式推理优化预测结果。

Abstract: Affordance is crucial for intelligent robots in the context of object manipulation. In this paper, we argue that affordance should be task-/instruction-dependent, which is overlooked by many previous works. That is, different instructions can lead to different manipulation regions and directions even for the same object. According to this observation, we present a new dataset comprising fifteen thousand object-instruction-affordance triplets. All scenes in the dataset are from an egocentric viewpoint, designed to approximate the perspective of a human-like robot. Furthermore, we investigate how to enable large multimodal models (LMMs) to serve as affordance predictors by implementing a ``search against verifiers’’ pipeline. An LMM is asked to progressively predict affordances, with the output at each step being verified by itself during the iterative process, imitating a reasoning process. Experiments show that our method not only unlocks new instruction-oriented affordance prediction capabilities, but also achieves outstanding performance broadly.

[239] Scene-Agnostic Traversability Labeling and Estimation via a Multimodal Self-supervised Framework

Zipeng Fang,Yanbo Wang,Lei Zhao,Weidong Chen

Main category: cs.RO

TL;DR: 该论文提出了一种多模态自监督框架,用于场景无关的可通行性标记与估计,整合了多模态数据和稀疏LiDAR监督,显著提升了性能。

Details Motivation: 现有自监督方法在捕捉不可通行区域特征上表现不足,且多为单模态,忽略了多模态数据的互补优势。

Contribution: 1. 提出多模态自监督框架,整合足迹、LiDAR和相机数据生成标签;2. 设计双流网络,解耦学习多模态特征;3. 引入稀疏LiDAR监督减少噪声。

Method: 1. 利用视觉基础模型生成标签;2. 训练双流网络解耦学习多模态特征;3. 结合稀疏LiDAR监督优化。

Result: 自动标记方法在多样数据集上达到约88% IoU;多模态网络性能提升1.6-3.5% IoU。

Insight: 多模态数据整合和稀疏监督能有效提升可通行性估计的鲁棒性。

Abstract: Traversability estimation is critical for enabling robots to navigate across diverse terrains and environments. While recent self-supervised learning methods achieve promising results, they often fail to capture the characteristics of non-traversable regions. Moreover, most prior works concentrate on a single modality, overlooking the complementary strengths offered by integrating heterogeneous sensory modalities for more robust traversability estimation. To address these limitations, we propose a multimodal self-supervised framework for traversability labeling and estimation. First, our annotation pipeline integrates footprint, LiDAR, and camera data as prompts for a vision foundation model, generating traversability labels that account for both semantic and geometric cues. Then, leveraging these labels, we train a dual-stream network that jointly learns from different modalities in a decoupled manner, enhancing its capacity to recognize diverse traversability patterns. In addition, we incorporate sparse LiDAR-based supervision to mitigate the noise introduced by pseudo labels. Finally, extensive experiments conducted across urban, off-road, and campus environments demonstrate the effectiveness of our approach. The proposed automatic labeling method consistently achieves around 88% IoU across diverse datasets. Compared to existing self-supervised state-of-the-art methods, our multimodal traversability estimation network yields consistently higher IoU, improving by 1.6-3.5% on all evaluated datasets.

cs.HC [Back]

[240] Humans Perceive Wrong Narratives from AI Reasoning Texts

Mosh Levy,Zohar Elyoseph,Yoav Goldberg

Main category: cs.HC

TL;DR: 研究发现,人类对AI生成的逐步推理文本的理解与模型的实际计算过程存在显著差异,挑战了这类文本作为可解释性工具的实用性。

Details Motivation: 探讨人类是否能准确理解AI模型生成的推理文本,以及这些文本是否真正反映了模型的内部计算过程。

Contribution: 揭示了人类与AI在理解推理文本上的巨大差距,质疑了其作为透明性和可解释性工具的有效性。

Method: 通过设计基于反事实测量的测试问题,评估人类对推理文本中因果步骤的识别能力。

Result: 实验结果表明,人类识别准确率仅为29.3%,远低于预期,即使在多数投票情况下也仅达到42%。

Insight: 推理文本不应被直接视为模型内部过程的反映,而是需要进一步研究的语言现象,表明理解AI如何使用语言至关重要。

Abstract: A new generation of AI models generates step-by-step reasoning text before producing an answer. This text appears to offer a human-readable window into their computation process, and is increasingly relied upon for transparency and interpretability. However, it is unclear whether human understanding of this text matches the model’s actual computational process. In this paper, we investigate a necessary condition for correspondence: the ability of humans to identify which steps in a reasoning text causally influence later steps. We evaluated humans on this ability by composing questions based on counterfactual measurements and found a significant discrepancy: participant accuracy was only 29.3%, barely above chance (25%), and remained low (42%) even when evaluating the majority vote on questions with high agreement. Our results reveal a fundamental gap between how humans interpret reasoning texts and how models use it, challenging its utility as a simple interpretability tool. We argue that reasoning texts should be treated as an artifact to be investigated, not taken at face value, and that understanding the non-human ways these models use language is a critical research direction.

[241] Negative Shanshui: Real-time Interactive Ink Painting Synthesis

Aven-Le Zhou

Main category: cs.HC

TL;DR: 本文提出了一种名为Negative Shanshui的实时交互式AI合成方法,旨在通过中国传统水墨山水画的重新诠释,探讨人类世的生态危机问题。该方法基于优化的Stable Diffusion模型,结合视线驱动修复和帧插值技术,实现了动态变形动画和VR交互体验。

Details Motivation: 研究动机是通过艺术与技术的结合,重新诠释中国古典水墨山水画,以唤起人们对生态危机的关注。

Contribution: 主要贡献包括:1) 优化了Stable Diffusion模型以实现实时推理;2) 结合视线驱动修复和帧插值技术,实现了动态交互体验;3) 在艺术节中展示了多模态部署。

Method: 方法包括:1) 微调Stable Diffusion模型;2) 视线驱动的修复技术;3) 帧插值生成动态动画;4) 部署为互动VR体验。

Result: 在公开展览中,观众通过共情、矛盾与批判性反思等多种方式与作品互动,反馈积极。

Insight: 艺术与技术的结合可以成为一种有效的媒介,唤起公众对生态问题的关注。

Abstract: This paper presents Negative Shanshui, a real-time interactive AI synthesis approach that reinterprets classical Chinese landscape ink painting, i.e., shanshui, to engage with ecological crises in the Anthropocene. Negative Shanshui optimizes a fine-tuned Stable Diffusion model for real-time inferences and integrates it with gaze-driven inpainting, frame interpolation; it enables dynamic morphing animations in response to the viewer’s gaze and presents as an interactive virtual reality (VR) experience. The paper describes the complete technical pipeline, covering the system framework, optimization strategies, gaze-based interaction, and multimodal deployment in an art festival. Further analysis of audience feedback collected during its public exhibition highlights how participants variously engaged with the work through empathy, ambivalence, and critical reflection.

q-bio.NC [Back]

[242] BrainPath: Generating Subject-Specific Brain Aging Trajectories

Yifan Li,Javad Sohankar,Ji Luo,Jing Li,Yi Su

Main category: q-bio.NC

TL;DR: BrainPath是一个生成3D MRI的框架,能够从单次基线扫描中预测任意时间点的脑部衰老轨迹,保留了生物相关的细微变化。

Details Motivation: 当前方法多在预测实际年龄或生成合成MRI,但难以捕捉个体特异性衰老轨迹。BrainPath旨在填补这一空白,提供个性化脑部衰老映射。

Contribution: 提出了BrainPath框架,结合年龄校准损失、交换学习策略和年龄感知损失,生成解剖学上准确的MRI序列。

Method: 框架通过损失函数(年龄校准损失、交换学习、年龄感知损失)和训练策略,学习纵向脑部衰老动态。

Result: 在ADNI和NACC数据集上,BrainPath在SSIM、MSE、PSNR和年龄差异准确性上优于现有模型。

Insight: BrainPath不仅支持个性化脑衰老建模,还为神经退行性疾病研究提供了新工具。

Abstract: Quantifying and forecasting individual brain aging trajectories is critical for understanding neurodegenerative disease and the heterogeneity of aging, yet current approaches remain limited. Most models predict chronological age, an imperfect surrogate for biological aging, or generate synthetic MRIs that enhance data diversity but fail to capture subject-specific trajectories. Here, we present BrainPath, a 3D generative framework that learns longitudinal brain aging dynamics during training and, at inference, predicts anatomically faithful MRIs at arbitrary timepoints from a single baseline scan. BrainPath integrates an age calibration loss, a swap learning strategy, and an age perceptual loss to preserve subtle, biologically meaningful variations. Across held-out ADNI and an independent NACC dataset, BrainPath outperforms state-of-the-art reference models in structural similarity (SSIM), mean squared error (MSE), peak signal-to-noise ratio (PSNR), and MRI age-difference accuracy, while capturing realistic and temporally consistent aging patterns. Beyond methodological innovation, BrainPath enables personalized mapping of brain aging, synthetic follow-up scan prediction, and trajectory-based analyses, providing a foundation for precision modeling of brain aging and supporting research into neurodegeneration and aging interventions.

eess.IV [Back]

[243] Predicting brain tumour enhancement from non-contrast MR imaging with artificial intelligence

James K Ruffle,Samia Mohinta,Guilherme Pombo,Asthik Biswas,Alan Campbell,Indran Davagnanam,David Doig,Ahmed Hamman,Harpreet Hyare,Farrah Jabeen,Emma Lim,Dermot Mallon,Stephanie Owen,Sophie Wilkinson,Sebastian Brandner,Parashkev Nachev

Main category: eess.IV

TL;DR: 论文提出了一种基于深度学习的模型,仅通过非对比MRI序列预测脑肿瘤增强区域,减少对钆对比剂的依赖,并在多中心数据集上验证了其临床可行性。

Details Motivation: 钆对比剂在脑肿瘤MRI中的使用存在禁忌(如肾功能不全、过敏或儿童患者等),因此需要开发一种能够基于非对比MRI预测增强区域的方法。

Contribution: 1. 开发了基于nnU-Net等深度学习模型,仅需T1、T2和T2/FLAIR非对比MRI即可预测增强肿瘤;2. 在11089例多中心数据上验证了模型性能超越专家放射科医生。

Method: 使用nnU-Net、SegResNet和SwinUNETR模型,输入非对比T1、T2和T2/FLAIR图像,预测和分割增强肿瘤区域。

Result: nnU-Net表现最佳:检测增强肿瘤的平衡准确率为83%,敏感性91.5%,特异性74.4%。预测体积与真实值高度相关(R²=0.859),且优于放射科医生(准确率69.8%)。

Insight: 深度学习可在非对比MRI上实现临床可接受的脑肿瘤增强预测,有望作为筛查工具减少钆对比剂的使用,未来需结合临床验证其实际效用。

Abstract: Brain tumour imaging assessment typically requires both pre- and post-contrast MRI, but gadolinium administration is not always desirable, such as in frequent follow-up, renal impairment, allergy, or paediatric patients. We aimed to develop and validate a deep learning model capable of predicting brain tumour contrast enhancement from non-contrast MRI sequences alone. We assembled 11089 brain MRI studies from 10 international datasets spanning adult and paediatric populations with various neuro-oncological states, including glioma, meningioma, metastases, and post-resection appearances. Deep learning models (nnU-Net, SegResNet, SwinUNETR) were trained to predict and segment enhancing tumour using only non-contrast T1-, T2-, and T2/FLAIR-weighted images. Performance was evaluated on 1109 held-out test patients using patient-level detection metrics and voxel-level segmentation accuracy. Model predictions were compared against 11 expert radiologists who each reviewed 100 randomly selected patients. The best-performing nnU-Net achieved 83% balanced accuracy, 91.5% sensitivity, and 74.4% specificity in detecting enhancing tumour. Enhancement volume predictions strongly correlated with ground truth (R2 0.859). The model outperformed expert radiologists, who achieved 69.8% accuracy, 75.9% sensitivity, and 64.7% specificity. 76.8% of test patients had Dice over 0.3 (acceptable detection), 67.5% had Dice over 0.5 (good detection), and 50.2% had Dice over 0.7 (excellent detection). Deep learning can identify contrast-enhancing brain tumours from non-contrast MRI with clinically relevant performance. These models show promise as screening tools and may reduce gadolinium dependence in neuro-oncology imaging. Future work should evaluate clinical utility alongside radiology experts.

[244] Analysis of Transferability Estimation Metrics for Surgical Phase Recognition

Prabhant Singh,Yiping Li,Yasmina Al Khalil

Main category: eess.IV

TL;DR: 本文研究了用于外科手术阶段识别的迁移性估计指标,提出了源独立迁移性估计(SITE)的框架,对三种代表性指标(LogME、H-Score和TransRate)进行了全面评估。结果表明,LogME在聚合子集得分最低时与微调性能最接近,而其他指标表现不佳。

Details Motivation: 外科手术视频分析中,专家标注成本高且耗时,选择合适的预训练模型至关重要,但传统微调方法效率低下,因此需要SITE来预测模型在目标数据上的表现。

Contribution: 本文首次在外科手术阶段识别任务中形式化了SITE,并对三种迁移性估计指标进行了全面评估,提出了实用的模型选择指南。

Method: 实验选择了RAMIE和AutoLaparo两个数据集,评估了LogME、H-Score和TransRate三种指标的有效性,并通过子集评分和消融研究分析了其性能。

Result: LogME(尤其是聚合子集最低分时)与微调性能最接近,H-Score预测能力弱,TransRate常导致模型排名反转。模型性能相近时,迁移性估计难以区分。

Insight: 模型多样性对迁移性估计至关重要,未来需发展领域特定指标和理论支持,以优化模型选择流程。

Abstract: Fine-tuning pre-trained models has become a cornerstone of modern machine learning, allowing practitioners to achieve high performance with limited labeled data. In surgical video analysis, where expert annotations are especially time-consuming and costly, identifying the most suitable pre-trained model for a downstream task is both critical and challenging. Source-independent transferability estimation (SITE) offers a solution by predicting how well a model will fine-tune on target data using only its embeddings or outputs, without requiring full retraining. In this work, we formalize SITE for surgical phase recognition and provide the first comprehensive benchmark of three representative metrics, LogME, H-Score, and TransRate, on two diverse datasets (RAMIE and AutoLaparo). Our results show that LogME, particularly when aggregated by the minimum per-subset score, aligns most closely with fine-tuning accuracy; H-Score yields only weak predictive power; and TransRate often inverses true model rankings. Ablation studies show that when candidate models have similar performances, transferability estimates lose discriminative power, emphasizing the importance of maintaining model diversity or using additional validation. We conclude with practical guidelines for model selection and outline future directions toward domain-specific metrics, theoretical foundations, and interactive benchmarking tools.

[245] Multimodal Medical Endoscopic Image Analysis via Progressive Disentangle-aware Contrastive Learning

Junhao Wu,Yun Li,Junhao Li,Jingliang Bian,Xiaomao Fan,Wenbin Lei,Ruxin Wang

Main category: eess.IV

TL;DR: 本文提出了一种基于‘对齐-解耦-融合’机制的多模态医学内窥镜图像分析框架,通过多尺度分布对齐和渐进特征解耦策略,结合对比学习,显著提升了喉咽部肿瘤分割的准确性。

Details Motivation: 传统单模态成像方法难以捕捉喉咽部肿瘤的复杂解剖和病理特征,因此需要一种多模态表示学习框架以提升分割精度。

Contribution: 提出了一个创新的多模态表示学习框架,包括多尺度分布对齐和渐进特征解耦策略,以及解耦感知的对比学习方法,有效整合了白光源成像(WLI)和窄带成像(NBI)数据。

Method: 采用‘对齐-解耦-融合’机制,通过多尺度分布对齐减少模态差异,设计初步解耦和解耦感知对比学习逐步分离模态特定特征和共享特征,最终实现高效的语义融合。

Result: 在多个数据集上的实验证明,该方法在多样化临床场景中均优于现有技术,表现出更高的分割准确性。

Insight: 渐进式特征解耦和多模态对比学习的结合是关键,既能保留模态间的互补信息,又能抑制冗余特征,从而提升分割性能。

Abstract: Accurate segmentation of laryngo-pharyngeal tumors is crucial for precise diagnosis and effective treatment planning. However, traditional single-modality imaging methods often fall short of capturing the complex anatomical and pathological features of these tumors. In this study, we present an innovative multi-modality representation learning framework based on the `Align-Disentangle-Fusion’ mechanism that seamlessly integrates 2D White Light Imaging (WLI) and Narrow Band Imaging (NBI) pairs to enhance segmentation performance. A cornerstone of our approach is multi-scale distribution alignment, which mitigates modality discrepancies by aligning features across multiple transformer layers. Furthermore, a progressive feature disentanglement strategy is developed with the designed preliminary disentanglement and disentangle-aware contrastive learning to effectively separate modality-specific and shared features, enabling robust multimodal contrastive learning and efficient semantic fusion. Comprehensive experiments on multiple datasets demonstrate that our method consistently outperforms state-of-the-art approaches, achieving superior accuracy across diverse real clinical scenarios.

[246] TuningIQA: Fine-Grained Blind Image Quality Assessment for Livestreaming Camera Tuning

Xiangfei Sheng,Zhichao Duan,Xiaofeng Pan,Yipo Huang,Zhichao Yang,Pengfei Chen,Leida Li

Main category: eess.IV

TL;DR: 论文提出了TuningIQA,一种用于直播相机调优的细粒度盲图像质量评估方法,通过建立新数据集FGLive-10K并开发集成人类感知特征提取和图基参数融合的模型,显著优于现有方法。

Details Motivation: 现有盲图像质量评估(BIQA)模型通常仅预测粗粒度的整体质量分数,无法为直播相机参数调优提供细粒度的感知指导,因此需要一种更精细的评估方法。

Contribution: 1. 建立了FGLive-10K数据集,包含多样化的直播场景和相机参数配置;2. 提出了TuningIQA方法,结合人类感知特征提取和图基参数融合,实现了细粒度质量评估。

Method: 1. 构建FGLive-10K数据集,提供多属性质量标注和细粒度偏好标注;2. 开发TuningIQA模型,集成人类感知特征提取和图基相机参数融合技术。

Result: TuningIQA在质量评分和细粒度排名任务中显著优于现有方法,并在直播相机调优中表现出色。

Insight: 细粒度的质量评估和人类感知特征的结合对于直播相机参数调优至关重要,图基参数融合能有效提升模型性能。

Abstract: Livestreaming has become increasingly prevalent in modern visual communication, where automatic camera quality tuning is essential for delivering superior user Quality of Experience (QoE). Such tuning requires accurate blind image quality assessment (BIQA) to guide parameter optimization decisions. Unfortunately, the existing BIQA models typically only predict an overall coarse-grained quality score, which cannot provide fine-grained perceptual guidance for precise camera parameter tuning. To bridge this gap, we first establish FGLive-10K, a comprehensive fine-grained BIQA database containing 10,185 high-resolution images captured under varying camera parameter configurations across diverse livestreaming scenarios. The dataset features 50,925 multi-attribute quality annotations and 19,234 fine-grained pairwise preference annotations. Based on FGLive-10K, we further develop TuningIQA, a fine-grained BIQA metric for livestreaming camera tuning, which integrates human-aware feature extraction and graph-based camera parameter fusion. Extensive experiments and comparisons demonstrate that TuningIQA significantly outperforms state-of-the-art BIQA methods in both score regression and fine-grained quality ranking, achieving superior performance when deployed for livestreaming camera tuning.

cs.DL [Back]

[247] Named Entity Recognition of Historical Text via Large Language Model

Shibingfeng Zhang,Giovanni Colavizza

Main category: cs.DL

TL;DR: 该论文探讨了如何利用大语言模型(LLM)在历史文本中进行零样本和少样本提示的命名实体识别(NER),并在HIPE-2022数据集上验证了其可行性。

Details Motivation: 历史文本由于标注数据稀缺且语言变异性高,传统监督学习方法难以适用。因此,作者提出利用LLM在低资源场景中实现NER。

Contribution: 论文的主要贡献是验证了LLM在历史文本NER任务中的有效性,为零样本和少样本学习提供了实际应用的可能性,尤其是在标注数据不足时。

Method: 采用零样本和少样本提示策略,利用LLM直接在历史文本中识别命名实体,无需依赖大量标注数据。实验基于HIPE-2022数据集进行。

Result: 尽管LLM的性能略逊于完全监督模型,但在低资源场景中表现出色,证明了其作为历史文本NER的可行替代方案。

Insight: LLM在低资源或历史文本任务中具有潜力,尤其是在缺乏标注数据的情况下,可以作为传统监督方法的补充或替代。

Abstract: Large language models have demonstrated remarkable versatility across a wide range of natural language processing tasks and domains. One such task is Named Entity Recognition (NER), which involves identifying and classifying proper names in text, such as people, organizations, locations, dates, and other specific entities. NER plays a crucial role in extracting information from unstructured textual data, enabling downstream applications such as information retrieval from unstructured text. Traditionally, NER is addressed using supervised machine learning approaches, which require large amounts of annotated training data. However, historical texts present a unique challenge, as the annotated datasets are often scarce or nonexistent, due to the high cost and expertise required for manual labeling. In addition, the variability and noise inherent in historical language, such as inconsistent spelling and archaic vocabulary, further complicate the development of reliable NER systems for these sources. In this study, we explore the feasibility of applying LLMs to NER in historical documents using zero-shot and few-shot prompting strategies, which require little to no task-specific training data. Our experiments, conducted on the HIPE-2022 (Identifying Historical People, Places and other Entities) dataset, show that LLMs can achieve reasonably strong performance on NER tasks in this setting. While their performance falls short of fully supervised models trained on domain-specific annotations, the results are nevertheless promising. These findings suggest that LLMs offer a viable and efficient alternative for information extraction in low-resource or historically significant corpora, where traditional supervised methods are infeasible.

cs.GR [Back]

[248] MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation

Prerit Gupta,Jason Alexander Fotso-Puepi,Zhengyuan Li,Jay Mehta,Aniket Bera

Main category: cs.GR

TL;DR: MDD是一个多模态基准数据集,支持文本控制和音乐条件化的3D双人舞蹈动作生成,包含高质量的动捕数据、音乐同步和详细的语言描述。

Details Motivation: 当前缺乏结合文本、音乐和双人舞蹈动作的多模态数据集,限制了相关任务的研究。MDD填补了这一空白。

Contribution: 1. 提出首个结合文本、音乐和双人舞蹈动作的多模态数据集MDD。2. 定义了两个新任务:Text-to-Duet和Text-to-Dance Accompaniment。

Method: 数据集包含620分钟的动捕数据,与音乐同步,并配有10K+精细的文本描述。提出了基于文本和音乐条件生成舞蹈动作的基线方法。

Result: MDD数据集为双人舞蹈生成任务提供了基础,并支持未来研究的基线评估。

Insight: 多模态数据集的构建为舞蹈生成任务提供了更丰富的条件化信息,推动了文本和音乐驱动的舞蹈动作生成研究。

Abstract: We introduce Multimodal DuetDance (MDD), a diverse multimodal benchmark dataset designed for text-controlled and music-conditioned 3D duet dance motion generation. Our dataset comprises 620 minutes of high-quality motion capture data performed by professional dancers, synchronized with music, and detailed with over 10K fine-grained natural language descriptions. The annotations capture a rich movement vocabulary, detailing spatial relationships, body movements, and rhythm, making MDD the first dataset to seamlessly integrate human motions, music, and text for duet dance generation. We introduce two novel tasks supported by our dataset: (1) Text-to-Duet, where given music and a textual prompt, both the leader and follower dance motion are generated (2) Text-to-Dance Accompaniment, where given music, textual prompt, and the leader’s motion, the follower’s motion is generated in a cohesive, text-aligned manner. We include baseline evaluations on both tasks to support future research.

[249] A Survey of Deep Learning-based Point Cloud Denoising

Jinxi Wang,Ben Fei,Dasith de Silva Edirimuni,Zheng Liu,Ying He,Xuequan Lu

Main category: cs.GR

TL;DR: 该综述论文全面回顾了截至2025年8月的基于深度学习的点云去噪方法,从监督水平和建模视角对文献进行了分类,并建立了统一的基准测试,分析了方法的去噪质量、计算效率等,同时还讨论了未来研究方向。

Details Motivation: 真实环境中获取的点云数据通常因传感器、光照等因素受到噪声污染,影响几何保真度和下游任务性能。传统基于优化的方法难以处理复杂噪声模式,而深度学习方法通过学习特征表示取得了显著进展,因此需要系统总结和评估这些方法。

Contribution: 1. 从监督水平和建模视角对点云去噪方法进行了系统性分类和总结;2. 提出了一个功能分类法,统一不同方法的去噪原理;3. 建立了统一的基准测试,评估了方法的去噪质量、计算效率等;4. 讨论了未来研究方向。

Method: 1. 按监督水平(有监督vs无监督)和建模视角对方法分类;2. 提出功能分类法统一不同方法;3. 统一训练设置进行基准测试;4. 分析架构趋势和评估指标(去噪质量、点分布等)。

Result: 深度学习方法的去噪效果优于传统方法,尤其在处理复杂和大规模点云时表现突出。统一基准测试揭示了不同方法的优劣。

Insight: 1. 深度学习在点云去噪中表现优异,但仍有挑战;2. 统一的分类和基准测试有助于推动未来研究;3. 未来方向可能包括更高效的架构和更好的泛化能力。

Abstract: Accurate 3D geometry acquisition is essential for a wide range of applications, such as computer graphics, autonomous driving, robotics, and augmented reality. However, raw point clouds acquired in real-world environments are often corrupted with noise due to various factors such as sensor, lighting, material, environment etc, which reduces geometric fidelity and degrades downstream performance. Point cloud denoising is a fundamental problem, aiming to recover clean point sets while preserving underlying structures. Classical optimization-based methods, guided by hand-crafted filters or geometric priors, have been extensively studied but struggle to handle diverse and complex noise patterns. Recent deep learning approaches leverage neural network architectures to learn distinctive representations and demonstrate strong outcomes, particularly on complex and large-scale point clouds. Provided these significant advances, this survey provides a comprehensive and up-to-date review of deep learning-based point cloud denoising methods up to August 2025. We organize the literature from two perspectives: (1) supervision level (supervised vs. unsupervised), and (2) modeling perspective, proposing a functional taxonomy that unifies diverse approaches by their denoising principles. We further analyze architectural trends both structurally and chronologically, establish a unified benchmark with consistent training settings, and evaluate methods in terms of denoising quality, surface fidelity, point distribution, and computational efficiency. Finally, we discuss open challenges and outline directions for future research in this rapidly evolving field.

[250] DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions

Hengyuan Zhang,Zhe Li,Xingqun Qi,Mengze Li,Muyi Sun,Man Zhang,Sirui Han

Main category: cs.GR

TL;DR: 该论文提出了一个可迭代编辑的音乐驱动舞蹈生成框架DanceEditor,并通过构建大规模数据集DanceRemix,结合多模态条件统一建模,实现了对舞蹈动作的初始预测和后续编辑。

Details Motivation: 现有舞蹈生成方法虽能直接合成舞蹈,但缺乏对舞蹈动作的编辑功能,难以满足实际编舞需求。同时,缺少高质量的可编辑舞蹈数据集也限制了此类方法的发展。

Contribution: 1. 构建了大规模多轮可编辑舞蹈数据集DanceRemix;2. 提出了一种基于预测-编辑范式的可迭代编辑舞蹈生成框架DanceEditor,结合音乐与文本提示实现多模态条件统一建模。

Method: 采用预测-编辑两阶段框架:1. 初始阶段通过音乐信号直接建模舞蹈动作;2. 编辑阶段引入文本描述作为条件,通过跨模态编辑模块(CEM)动态整合初始预测、音乐和文本提示,生成语义对齐的舞蹈序列。

Result: 实验表明,DanceEditor在新构建的DanceRemix数据集上优于现有方法,能够生成既符合音乐节奏又能满足文本语义的舞蹈动作。

Insight: 结合多模态信号(音乐+文本)的统一建模是实现舞蹈生成与编辑的关键;大规模可编辑数据集的构建为后续研究提供了基础。

Abstract: Generating coherent and diverse human dances from music signals has gained tremendous progress in animating virtual avatars. While existing methods support direct dance synthesis, they fail to recognize that enabling users to edit dance movements is far more practical in real-world choreography scenarios. Moreover, the lack of high-quality dance datasets incorporating iterative editing also limits addressing this challenge. To achieve this goal, we first construct DanceRemix, a large-scale multi-turn editable dance dataset comprising the prompt featuring over 25.3M dance frames and 84.5K pairs. In addition, we propose a novel framework for iterative and editable dance generation coherently aligned with given music signals, namely DanceEditor. Considering the dance motion should be both musical rhythmic and enable iterative editing by user descriptions, our framework is built upon a prediction-then-editing paradigm unifying multi-modal conditions. At the initial prediction stage, our framework improves the authority of generated results by directly modeling dance movements from tailored, aligned music. Moreover, at the subsequent iterative editing stages, we incorporate text descriptions as conditioning information to draw the editable results through a specifically designed Cross-modality Editing Module (CEM). Specifically, CEM adaptively integrates the initial prediction with music and text prompts as temporal motion cues to guide the synthesized sequences. Thereby, the results display music harmonics while preserving fine-grained semantic alignment with text descriptions. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected DanceRemix dataset. Code is available at https://lzvsdy.github.io/DanceEditor/.

[251] MeshSplat: Generalizable Sparse-View Surface Reconstruction via Gaussian Splatting

Hanzhi Chang,Ruijie Zhu,Wenjie Chang,Mulin Yu,Yanzhe Liang,Jiahao Lu,Zhuoyuan Li,Tianzhu Zhang

Main category: cs.GR

TL;DR: MeshSplat提出了一种基于高斯泼溅的通用稀疏视角表面重建框架,通过2D Gaussian Splatting (2DGS)作为桥梁连接新视角合成与几何先验,填补了稀疏输入下几何重建的不足。

Details Motivation: 现有表面重建方法在输入视角极度稀疏时难以恢复精确的几何,MeshSplat旨在解决这一问题。

Contribution: 1. 提出了一种通用的稀疏视角表面重建框架;2. 通过2DGS连接新视角合成与几何先验;3. 设计了加权Chamfer距离损失和法线预测网络以提升精度。

Method: 1. 使用前馈网络预测每视角像素对齐的2DGS;2. 提出加权Chamfer距离损失优化深度图;3. 引入法线预测网络对齐2DGS方向。

Result: 实验表明,MeshSplat在稀疏视角网格重建任务中达到最先进性能。

Insight: 通过2DGS作为中间表示,可以在无需直接3D监督的情况下实现高质量的稀疏视角重建,为几何恢复提供了新思路。

Abstract: Surface reconstruction has been widely studied in computer vision and graphics. However, existing surface reconstruction works struggle to recover accurate scene geometry when the input views are extremely sparse. To address this issue, we propose MeshSplat, a generalizable sparse-view surface reconstruction framework via Gaussian Splatting. Our key idea is to leverage 2DGS as a bridge, which connects novel view synthesis to learned geometric priors and then transfers these priors to achieve surface reconstruction. Specifically, we incorporate a feed-forward network to predict per-view pixel-aligned 2DGS, which enables the network to synthesize novel view images and thus eliminates the need for direct 3D ground-truth supervision. To improve the accuracy of 2DGS position and orientation prediction, we propose a Weighted Chamfer Distance Loss to regularize the depth maps, especially in overlapping areas of input views, and also a normal prediction network to align the orientation of 2DGS with normal vectors predicted by a monocular normal estimator. Extensive experiments validate the effectiveness of our proposed improvement, demonstrating that our method achieves state-of-the-art performance in generalizable sparse-view mesh reconstruction tasks. Project Page: https://hanzhichang.github.io/meshsplat_web