Table of Contents
- cs.CL [Total: 21]
- cs.CV [Total: 53]
- cs.SE [Total: 1]
- cs.SD [Total: 1]
- cs.CR [Total: 1]
- cs.IR [Total: 2]
- eess.IV [Total: 4]
- eess.SP [Total: 1]
- cs.AI [Total: 2]
- cs.LG [Total: 5]
cs.CL [Back]
[1] Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles
Kimberly Le Truong,Riccardo Fogliato,Hoda Heidari,Zhiwei Steven Wu
Main category: cs.CL
TL;DR: 该论文提出了一种通过角色增强提示(persona-based LLM prompting)来评估大语言模型(LLMs)在不同写作风格下的性能变化的方法,发现写作风格对模型性能有显著影响。
Details
Motivation: 现有LLM评估基准在写作风格多样性上不足,可能掩盖模型在非标准化输入下的脆弱表现,因此需要更全面的评估方法。Contribution: 开发了一种低成本方法(角色增强提示)来生成多样化的写作风格,并验证其对LLM性能的影响,提升了评估的外部有效性。
Method: 通过角色增强提示重写评估提示,模拟多样化的写作风格,测试不同风格和格式对LLM性能的影响。
Result: 发现写作风格和提示格式显著影响LLM性能,某些风格会导致模型表现持续偏低或偏高,且与模型家族、规模和新旧无关。
Insight: LLM评估需考虑写作风格的多样性,角色增强提示是一种可扩展的方法,有望提升现有基准的外在效度。
Abstract: Current benchmarks for evaluating Large Language Models (LLMs) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of communication patterns exhibited by humans. Thus, it is possible that LLMs, which are optimized on these benchmarks, may demonstrate brittle performance when faced with “non-standard” input. In this work, we test this hypothesis by rewriting evaluation prompts using persona-based LLM prompting, a low-cost method to emulate diverse writing styles. Our results show that, even with identical semantic content, variations in writing style and prompt formatting significantly impact the estimated performance of the LLM under evaluation. Notably, we identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks, irrespective of model family, size, and recency. Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for measuring LLM performance across linguistic variations.
[2] The role of media memorability in facilitating startups’ access to venture capital funding
L. Toschi,S. Torrisi,A. Fronzetti Colladon
Main category: cs.CL
TL;DR: 该研究提出“媒体记忆度”概念,揭示其对初创企业获得风险投资的影响,弥补了现有研究中仅关注媒体曝光度的不足。通过分析197家英国初创企业的数据,发现媒体记忆度显著影响投资结果,尤其是内容的独特性和语义网络中的关联性。
Details
Motivation: 研究旨在探究风险投资决策中媒体内容更细微特征的作用,以弥补现有文献对一般媒体曝光的过度关注。Contribution: 引入“媒体记忆度”这一新概念,揭示了其对初创企业吸引风险投资的直接影响,丰富了创业融资和媒体合法化的研究。
Method: 研究基于197家英国初创企业的数据(1995-2004年),通过分析媒体内容的独特性和语义网络中的关联性,量化媒体记忆度并检验其对投资结果的影响。
Result: 媒体记忆度显著影响投资决策,初创企业需通过突出独特性和行业关联性来增强品牌记忆度,而非仅依赖频繁的媒体报道。
Insight: 研究强调了媒体内容的深度和质量在创业融资中的关键作用,为初创企业提供了更精准的媒体策略方向。
Abstract: Media reputation plays an important role in attracting venture capital investment. However, prior research has focused too narrowly on general media exposure, limiting our understanding of how media truly influences funding decisions. As informed decision-makers, venture capitalists respond to more nuanced aspects of media content. We introduce the concept of media memorability - the media’s ability to imprint a startup’s name in the memory of relevant investors. Using data from 197 UK startups in the micro and nanotechnology sector (funded between 1995 and 2004), we show that media memorability significantly influences investment outcomes. Our findings suggest that venture capitalists rely on detailed cues such as a startup’s distinctiveness and connectivity within news semantic networks. This contributes to research on entrepreneurial finance and media legitimation. In practice, startups should go beyond frequent media mentions to strengthen brand memorability through more targeted, meaningful coverage highlighting their uniqueness and relevance within the broader industry conversation.
[3] RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation
Dongyub Jude Lee,Zhenyi Ye,Pengcheng He
Main category: cs.CL
TL;DR: RLfR提出了一种基于教师模型反馈的强化学习框架,用于机器翻译,通过逐步模仿教师模型的优化输出来提升翻译质量,避免了对静态三元组数据的依赖。
Details
Motivation: 传统的基于偏好学习的机器翻译方法(如DPO)依赖大量精心标注的三元组数据,且难以泛化到新领域,RLfR旨在通过教师模型的动态反馈解决这一问题。Contribution: 提出RLfR框架,利用教师模型(GPT-4o)的动态反馈引导模型逐步改进翻译质量,结合负编辑距离和COMET分数优化语义和结构保真度。
Method: 框架分为三步:(1) 演员生成翻译假设,(2) 教师模型优化假设,(3) 演员通过负编辑距离和COMET分数作为奖励信号,逐步模仿教师模型的输出。
Result: 在FLORES-200基准测试中,RLfR显著提升了COMET(语义充分性)和M-ETA(实体保留)分数,优于MT-SFT和基于偏好的基线方法。
Insight: 通过动态反馈的渐进式学习模拟人类学习过程,提供了一种更灵活且高效的机器翻译优化路径。
Abstract: Preference-learning methods for machine translation (MT)–such as Direct Preference Optimization (DPO)–have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher’s refinement. Guided by two complementary signals–(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy–the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.
[4] A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers
Roxana Petcu,Samarth Bhargav,Maarten de Rijke,Evangelos Kanoulas
Main category: cs.CL
TL;DR: 该论文提出了一种基于哲学、语言学和逻辑学定义的否定分类法,生成了两个用于评估神经信息检索模型性能的基准数据集,并提出了一个逻辑分类机制来分析检索模型在现有数据集上的表现。
Details
Motivation: 尽管密集神经模型能够学习上下文嵌入,但在包含否定的查询上仍表现不佳。论文旨在研究否定在传统神经信息检索和基于LLM的模型中的表现问题。Contribution: 1. 提出了一种否定分类法;2. 生成了两个基准数据集;3. 提出了逻辑分类机制以分析检索模型的表现。
Method: 方法包括从多元定义中提炼否定分类法、构建数据集,并设计逻辑分类机制来评估模型性能。
Result: 分类法在NevIR数据集上实现了更平衡的数据分布,加速了收敛,分类方法揭示了现有数据集中否定类型的覆盖率。
Insight: 论文揭示了否定类型分布对模型泛化性能的影响,为提升模型在否定查询上的表现提供了理论基础和工具。
Abstract: Understanding and solving complex reasoning tasks is vital for addressing the information needs of a user. Although dense neural models learn contextualised embeddings, they still underperform on queries containing negation. To understand this phenomenon, we study negation in both traditional neural information retrieval and LLM-based models. We (1) introduce a taxonomy of negation that derives from philosophical, linguistic, and logical definitions; (2) generate two benchmark datasets that can be used to evaluate the performance of neural information retrieval models and to fine-tune models for a more robust performance on negation; and (3) propose a logic-based classification mechanism that can be used to analyze the performance of retrieval models on existing datasets. Our taxonomy produces a balanced data distribution over negation types, providing a better training setup that leads to faster convergence on the NevIR dataset. Moreover, we propose a classification schema that reveals the coverage of negation types in existing datasets, offering insights into the factors that might affect the generalization of fine-tuned models on negation.
[5] Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors
Jia Li,Yichao He,Jiacheng Xu,Tianhao Luo,Zhenzhen Hu,Richang Hong,Meng Wang
Main category: cs.CL
TL;DR: 该论文提出了一种新的人格评估框架Traits Run Deep,结合心理学引导的LLM表示和多模态表观行为,显著提升了人格特质的评估准确性。
Details
Motivation: 人格评估在情感智能、心理健康诊断和个性化教育中至关重要,但传统方法难以捕捉稳定的特质和跨模态的异步模式。Contribution: 1. 首次使用人格特异性提示引导LLM提取特质感知语义;2. 提出文本中心特质融合网络(Text-Centric Trait Fusion Network)对齐多模态信号;3. 在AVI Challenge 2025中表现最佳。
Method: 1. 通过心理学引导的提示生成高质量人格语义表示;2. 设计多模态融合模块(Chunk-Wise Projector、Cross-Modal Connector等)对齐异步信号;3. 集成回归头提升泛化能力。
Result: 在AVI验证集上MSE降低约45%,在AVI Challenge 2025测试集中排名第一。
Insight: 心理学引导的LLM提示和多模态对齐策略能显著提升人格评估的准确性和鲁棒性,尤其在数据稀缺情况下。
Abstract: Accurate and reliable personality assessment plays a vital role in many fields, such as emotional intelligence, mental health diagnostics, and personalized education. Unlike fleeting emotions, personality traits are stable, often subconsciously leaked through language, facial expressions, and body behaviors, with asynchronous patterns across modalities. It was hard to model personality semantics with traditional superficial features and seemed impossible to achieve effective cross-modal understanding. To address these challenges, we propose a novel personality assessment framework called \textit{\textbf{Traits Run Deep}}. It employs \textit{\textbf{psychology-informed prompts}} to elicit high-level personality-relevant semantic representations. Besides, it devises a \textit{\textbf{Text-Centric Trait Fusion Network}} that anchors rich text semantics to align and integrate asynchronous signals from other modalities. To be specific, such fusion module includes a Chunk-Wise Projector to decrease dimensionality, a Cross-Modal Connector and a Text Feature Enhancer for effective modality fusion and an ensemble regression head to improve generalization in data-scarce situations. To our knowledge, we are the first to apply personality-specific prompts to guide large language models (LLMs) in extracting personality-aware semantics for improved representation quality. Furthermore, extracting and fusing audio-visual apparent behavior features further improves the accuracy. Experimental results on the AVI validation set have demonstrated the effectiveness of the proposed components, i.e., approximately a 45% reduction in mean squared error (MSE). Final evaluations on the test set of the AVI Challenge 2025 confirm our method’s superiority, ranking first in the Personality Assessment track. The source code will be made available at https://github.com/MSA-LMC/TraitsRunDeep.
[6] NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models
Hyeonseok Moon,Heuiseok Lim
Main category: cs.CL
TL;DR: 这篇论文提出了一个新的基准测试NeedleChain,用于更全面地评估大语言模型(LLM)的长上下文理解能力,并揭示了现有测试标准(如NIAH)可能高估了LLM的实际能力。同时,论文提出了一种简单但有效的策略ROPE Contraction来改进LLM的长上下文理解。
Details
Motivation: 现有的NIAH基准测试虽然被广泛用于评估LLM的长上下文理解能力,但其可能高估了模型的真实表现,因为即使是GPT-4o等先进模型也难以完整理解仅由查询相关句子组成的长上下文。因此,需要一种更全面的评估方法。Contribution: 1. 提出新的基准测试NeedleChain,要求LLM完全理解全部相关上下文才能正确回答问题;2. 提出ROPE Contraction策略,改进LLM的长上下文理解能力。
Method: 1. 设计NeedleChain基准测试,上下文完全由查询相关信息组成,且支持灵活的上下文长度和推理顺序;2. 提出ROPE Contraction策略,通过简化上下文表示提升模型理解能力。
Result: 实验表明,即使是先进LLM如GPT-4o在处理长上下文时也存在显著差距,NeedleChain揭示了这一现象,而ROPE Contraction策略能有效提升性能。
Insight: 长上下文理解能力的评估需要更严格的测试标准,而简单的优化策略(如ROPE Contraction)可能显著提升模型的实际表现。
Abstract: The Needle-in-a-Haystack (NIAH) benchmark is widely used to evaluate Large Language Models’ (LLMs) ability to understand long contexts (LC). It evaluates the capability to identify query-relevant context within extensive query-irrelevant passages. Although this method serves as a widely accepted standard for evaluating long-context understanding, our findings suggest it may overestimate the true LC capability of LLMs. We demonstrate that even state-of-the-art models such as GPT-4o struggle to intactly incorporate given contexts made up of solely query-relevant ten sentences. In response, we introduce a novel benchmark, \textbf{NeedleChain}, where the context consists entirely of query-relevant information, requiring the LLM to fully grasp the input to answer correctly. Our benchmark allows for flexible context length and reasoning order, offering a more comprehensive analysis of LLM performance. Additionally, we propose an extremely simple yet compelling strategy to improve LC understanding capability of LLM: ROPE Contraction. Our experiments with various advanced LLMs reveal a notable disparity between their ability to process large contexts and their capacity to fully understand them. Source code and datasets are available at https://github.com/hyeonseokk/NeedleChain
[7] AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini
Jill Walker Rettberg,Hermann Wigers
Main category: cs.CL
TL;DR: 这篇论文研究了GPT-4o-mini生成的11,800个故事,发现这些故事虽然在表面上有国家象征和主题,但情节结构高度同质化,强调稳定与传统,忽视冲突与变化,揭示了AI叙事中的一种新型偏差。
Details
Motivation: 探讨语言模型在生成跨文化故事时的表现,是否能够体现多样化的文化相关性,而非仅基于训练数据的Anglo-American文本模式。Contribution: 揭示了AI生成故事中的叙事同质化现象,提出了一种新型AI偏差——叙事标准化,同时为文学研究、AI批判研究和NLP改进提供了重要视角。
Method: 通过向GPT-4o-mini发送生成50个故事的提示(每个国家50个),共生成11,800个故事,分析其叙事结构和主题。
Result: 发现故事情节高度同质化:主人公回归小镇,通过传统和社区活动解决冲突,回避现实冲突和浪漫情节,强调怀旧与和解。
Insight: AI生成的叙事倾向于稳定与传统,而非变化与成长,这种叙事标准化是一种隐性的文化偏差,对AI的文化对齐和多样性提出了挑战。
Abstract: Can a language model trained largely on Anglo-American texts generate stories that are culturally relevant to other nationalities? To find out, we generated 11,800 stories - 50 for each of 236 countries - by sending the prompt “Write a 1500 word potential {demonym} story” to OpenAI’s model gpt-4o-mini. Although the stories do include surface-level national symbols and themes, they overwhelmingly conform to a single narrative plot structure across countries: a protagonist lives in or returns home to a small town and resolves a minor conflict by reconnecting with tradition and organising community events. Real-world conflicts are sanitised, romance is almost absent, and narrative tension is downplayed in favour of nostalgia and reconciliation. The result is a narrative homogenisation: an AI-generated synthetic imaginary that prioritises stability above change and tradition above growth. We argue that the structural homogeneity of AI-generated narratives constitutes a distinct form of AI bias, a narrative standardisation that should be acknowledged alongside the more familiar representational bias. These findings are relevant to literary studies, narratology, critical AI studies, NLP research, and efforts to improve the cultural alignment of generative AI.
[8] Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance
Jingwei Zuo,Maksim Velikanov,Ilyas Chahed,Younes Belkada,Dhia Eddine Rhayem,Guillaume Kunsch,Hakim Hacid,Hamza Yous,Brahim Farhat,Ibrahim Khadraoui,Mugariya Farooq,Giulia Campesan,Ruxandra Cojocaru,Yasser Djilali,Shi Hu,Iheb Chaabane,Puneesh Khanna,Mohamed El Amine Seddik,Ngoc Dung Huynh,Phuc Le Khac,Leen AlQadi,Billel Mokeddem,Mohamed Chami,Abdalgader Abubaker,Mikhail Lubinets,Kacper Piskorski,Slim Frikha
Main category: cs.CL
TL;DR: Falcon-H1系列语言模型通过混合Transformer与状态空间模型(SSM)的设计,在高效性和性能上实现突破,支持多语言和长上下文任务,表现优于更大规模的现有模型。
Details
Motivation: 传统的Transformer或Mamba架构在长上下文记忆和计算效率上存在局限,Falcon-H1旨在通过混合架构设计解决这一问题,同时提升模型性能与效率。Contribution: 1. 提出混合Transformer与SSM的并行架构;2. 优化模型设计、数据策略和训练动态;3. 发布多个参数规模的模型,包括量化版本,支持多语言和长上下文。
Method: 采用Transformer注意力机制与SSM的混合并行设计,结合优化的训练策略和数据选择,以提升长上下文记忆和计算效率。
Result: Falcon-H1-34B在性能上匹敌或超越70B规模的模型(如Qwen3-32B、Llama3.3-70B),小模型(如1.5B-Deep)也优于现有7B-10B模型。
Insight: 混合架构设计在语言模型中展现出显著优势,尤其是在长上下文和多语言任务中,同时参数效率的提升为实际应用提供了更多可能性。
Abstract: In this report, we introduce Falcon-H1, a new series of large language models (LLMs) featuring hybrid architecture designs optimized for both high performance and efficiency across diverse use cases. Unlike earlier Falcon models built solely on Transformer or Mamba architectures, Falcon-H1 adopts a parallel hybrid approach that combines Transformer-based attention with State Space Models (SSMs), known for superior long-context memory and computational efficiency. We systematically revisited model design, data strategy, and training dynamics, challenging conventional practices in the field. Falcon-H1 is released in multiple configurations, including base and instruction-tuned variants at 0.5B, 1.5B, 1.5B-deep, 3B, 7B, and 34B parameters. Quantized instruction-tuned models are also available, totaling over 30 checkpoints on Hugging Face Hub. Falcon-H1 models demonstrate state-of-the-art performance and exceptional parameter and training efficiency. The flagship Falcon-H1-34B matches or outperforms models up to 70B scale, such as Qwen3-32B, Qwen2.5-72B, and Llama3.3-70B, while using fewer parameters and less data. Smaller models show similar trends: the Falcon-H1-1.5B-Deep rivals current leading 7B-10B models, and Falcon-H1-0.5B performs comparably to typical 7B models from 2024. These models excel across reasoning, mathematics, multilingual tasks, instruction following, and scientific knowledge. With support for up to 256K context tokens and 18 languages, Falcon-H1 is suitable for a wide range of applications. All models are released under a permissive open-source license, underscoring our commitment to accessible and impactful AI research.
[9] What is an “Abstract Reasoner”? Revisiting Experiments and Arguments about Large Language Models
Tian Yun,Chen Sun,Ellie Pavlick
Main category: cs.CL
TL;DR: 本文通过重新实验和讨论,探讨了大型语言模型(LLMs)是否是“抽象推理者”的问题,并发现微调可以显著提升性能,但这种提升难以跨数据集迁移。
Details
Motivation: 近期研究认为LLMs在零样本设置下表现不佳,因此不是“抽象推理者”。本文旨在重新审视这些实验,为这一观点提供更细致的分析。Contribution: 本文的主要贡献是展示了尽管LLMs在零样本设置下表现差,但通过微调输入编码的少量参数即可实现接近完美的性能,同时指出这种提升的局限性(难以跨数据集迁移)。
Method: 作者通过对LLMs在不同任务上的零样本性能进行实验,并在此基础上进行参数微调,对比分析性能提升的效果及其迁移能力。
Result: 实验结果表明,微调可以显著提升LLMs的推理性能,但这种性能提升不具备跨数据集的泛化能力。
Insight: 研究引发了对“抽象推理者”定义的重新思考,提示LLMs的能力可能依赖于任务特定的调整,而非纯粹的抽象推理能力。
Abstract: Recent work has argued that large language models (LLMs) are not “abstract reasoners”, citing their poor zero-shot performance on a variety of challenging tasks as evidence. We revisit these experiments in order to add nuance to the claim. First, we show that while LLMs indeed perform poorly in a zero-shot setting, even tuning a small subset of parameters for input encoding can enable near-perfect performance. However, we also show that this finetuning does not necessarily transfer across datasets. We take this collection of empirical results as an invitation to (re-)open the discussion of what it means to be an “abstract reasoner”, and why it matters whether LLMs fit the bill.
[10] SLM-SQL: An Exploration of Small Language Models for Text-to-SQL
Lei Sheng,Shuai-Shuai Xu
Main category: cs.CL
TL;DR: 论文探讨了小型语言模型(SLMs)在Text-to-SQL任务中的潜力,通过后训练技术显著提升其表现,并在推理速度与边缘部署上具备优势。
Details
Motivation: 大型语言模型(LLMs)在Text-to-SQL任务中表现优异,但SLMs由于逻辑推理能力有限而表现不佳。然而,SLMs在推理速度和边缘部署上有天然优势,因此探索如何提升其性能具有重要意义。Contribution: 论文提出了一种通过后训练技术(监督微调和强化学习)提升SLMs在Text-to-SQL任务中的方法,并构建了两个新的数据集(SynSQL-Think和SynSQL-Merge-Think),显著提高了SLMs的表现。
Method: 利用SynSQL-2.5M数据集构建了两个新数据集,并对SLMs进行了监督微调和强化学习后训练,最终使用纠正自一致性方法进行推理。
Result: 在BIRD开发集上,模型平均提升了31.4分,其中0.5B模型达到56.87%执行准确率(EX),1.5B模型达到67.08% EX。
Insight: SLMs虽然参数规模小,但通过针对性训练和数据集优化,可以在特定任务(如Text-to-SQL)中接近甚至超越更大模型的性能,同时保持推理速度和部署灵活性的优势。
Abstract: Large language models (LLMs) have demonstrated strong performance in translating natural language questions into SQL queries (Text-to-SQL). In contrast, small language models (SLMs) ranging from 0.5B to 1.5B parameters currently underperform on Text-to-SQL tasks due to their limited logical reasoning capabilities. However, SLMs offer inherent advantages in inference speed and suitability for edge deployment. To explore their potential in Text-to-SQL applications, we leverage recent advancements in post-training techniques. Specifically, we used the open-source SynSQL-2.5M dataset to construct two derived datasets: SynSQL-Think-916K for SQL generation and SynSQL-Merge-Think-310K for SQL merge revision. We then applied supervised fine-tuning and reinforcement learning-based post-training to the SLM, followed by inference using a corrective self-consistency approach. Experimental results validate the effectiveness and generalizability of our method, SLM-SQL. On the BIRD development set, the five evaluated models achieved an average improvement of 31.4 points. Notably, the 0.5B model reached 56.87% execution accuracy (EX), while the 1.5B model achieved 67.08% EX. We will release our dataset, model, and code to github: https://github.com/CycloneBoy/slm_sql.
[11] ControlMed: Adding Reasoning Control to Medical Language Model
Sung-Min Lee,Siyoon Lee,Juyeon Kim,Kyungmin Roh
Main category: cs.CL
TL;DR: ControlMed 是一种医疗语言模型,通过细粒度控制标记实现在推理时主动控制推理过程长度,解决了现有推理模型计算开销大和响应延迟的问题。
Details
Motivation: 医疗领域的决策具有高度关键性,现有的推理大语言模型虽然准确性和解释性强,但推理过程冗余,导致计算开销和延迟,难以实际部署。Contribution: 提出了 ControlMed,支持用户动态控制推理长度,通过三阶段训练流程(预训练、监督微调、强化学习)兼顾性能和效率。
Method: 1) 在大规模合成医疗指令数据上预训练;2) 使用多长度推理数据和显式长度控制标记进行监督微调;3) 基于模型的奖励信号强化学习。
Result: 在多种英语和韩语医疗基准测试中表现优于或媲美现有技术,用户可灵活平衡推理准确性和计算效率。
Insight: ControlMed 展示了在临床问答和医疗信息分析中,动态控制推理长度对效率和实用性的重要性。
Abstract: Reasoning Large Language Models (LLMs) with enhanced accuracy and explainability are increasingly being adopted in the medical domain, as the life-critical nature of clinical decision-making demands reliable support. Despite these advancements, existing reasoning LLMs often generate unnecessarily lengthy reasoning processes, leading to significant computational overhead and response latency. These limitations hinder their practical deployment in real-world clinical environments. To address these challenges, we introduce \textbf{ControlMed}, a medical language model that enables users to actively control the length of the reasoning process at inference time through fine-grained control markers. ControlMed is trained through a three-stage pipeline: 1) pre-training on a large-scale synthetic medical instruction dataset covering both \textit{direct} and \textit{reasoning responses}; 2) supervised fine-tuning with multi-length reasoning data and explicit length-control markers; and 3) reinforcement learning with model-based reward signals to enhance factual accuracy and response quality. Experimental results on a variety of English and Korean medical benchmarks demonstrate that our model achieves similar or better performance compared to state-of-the-art models. Furthermore, users can flexibly balance reasoning accuracy and computational efficiency by controlling the reasoning length as needed. These findings demonstrate that ControlMed is a practical and adaptable solution for clinical question answering and medical information analysis.
[12] Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs
Xikang Yang,Biyu Zhou,Xuehai Tang,Jizhong Han,Songlin Hu
Main category: cs.CL
TL;DR: 论文提出CognitiveAttack框架,通过结合认知偏差(cognitive biases)的多重交互作用,系统性地绕过LLMs的安全机制,攻击成功率达到60.1%,显著高于现有方法。
Details
Motivation: 现有LLMs的安全机制容易受到认知偏差的攻击,而过去的研究多集中于单点攻击。本文探讨了多种认知偏差的协同作用,揭示了一个未被充分探索的攻击途径。Contribution: 提出了CognitiveAttack框架,首次系统性地利用多重认知偏差的组合攻击LLMs的安全机制,并验证了其高效性。
Method: 结合监督微调(supervised fine-tuning)和强化学习(reinforcement learning),生成嵌入优化偏差组合的提示(prompts),攻击LLMs。
Result: 在30个LLMs上的实验表明,CognitiveAttack的攻击成功率为60.1%,显著高于当前最优黑盒方法(31.6%),尤其对开源模型效果显著。
Insight: 认知偏差的多重交互作用是攻击LLMs安全的新路径,为构建更鲁棒和对齐人类的AI系统提供了新视角。
Abstract: Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases – systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.
[13] Unveiling the Influence of Amplifying Language-Specific Neurons
Inaya Rahmanisa,Lyzander Marciano Andrylie,Krisna Mahardika Ihsani,Alfan Farizki Wicaksono,Haryo Akbarianto Wibowo,Alham Fikri Aji
Main category: cs.CL
TL;DR: 该论文研究了大型语言模型中与特定语言强相关的神经元的作用,通过放大这些神经元在18种语言中的干预实验,发现放大可以有效引导模型输出目标语言,但在跨语言任务中表现有限。
Details
Motivation: 语言特定神经元在大型语言模型中的作用未被充分研究,尤其是在放大时的效果如何影响多语言行为。Contribution: 提出了语言特异性神经元的放大干预方法,并通过实验证明其对目标语言输出的有效性,同时揭示了其在跨语言任务中的局限性。
Method: 通过干预实验放大语言特异性神经元,使用提出的语言转向偏移(LSS)评分评估效果,并在多个下游任务中验证。
Result: 放大语言特异性神经元可以有效引导目标语言输出,但对跨语言任务的表现普遍有负面影响。
Insight: 语言特异性神经元的放大特别有助于低资源语言的性能提升,但在跨语言迁移中优势有限。
Abstract: Language-specific neurons in LLMs that strongly correlate with individual languages have been shown to influence model behavior by deactivating them. However, their role in amplification remains underexplored. This work investigates the effect of amplifying language-specific neurons through interventions across 18 languages, including low-resource ones, using three models primarily trained in different languages. We compare amplification factors by their effectiveness in steering to the target language using a proposed Language Steering Shift (LSS) evaluation score, then evaluate it on downstream tasks: commonsense reasoning (XCOPA, XWinograd), knowledge (Include), and translation (FLORES). The optimal amplification factors effectively steer output toward nearly all tested languages. Intervention using this factor on downstream tasks improves self-language performance in some cases but generally degrades cross-language results. These findings highlight the effect of language-specific neurons in multilingual behavior, where amplification can be beneficial especially for low-resource languages, but provides limited advantage for cross-lingual transfer.
[14] BALSAM: A Platform for Benchmarking Arabic Large Language Models
Rawan Al-Matham,Kareem Darwish,Raghad Al-Rasheed,Waad Alshammari,Muneera Alhoshan,Amal Almazrua,Asma Al Wazrah,Mais Alheraki,Firoj Alam,Preslav Nakov,Norah Alzahrani,Eman alBilali,Nizar Habash,Abdelrahman El-Sheikh,Muhammad Elmallah,Haonan Li,Hamdy Mubarak,Mohamed Anwar,Zaid Alyafeai,Ahmed Abdelali,Nora Altwairesh,Maram Hasanain,Abdulmohsen Al Thubaity,Shady Shehata,Bashar Alhafni,Injy Hamed,Go Inoue,Khalid Elmadani,Ossama Obeid,Fatima Haouari,Tamer Elsayed,Emad Alghamdi,Khalid Almubarak,Saied Alshahrani,Ola Aljarrah,Safa Alajlan,Areej Alshaqarawi,Maryam Alshihri,Sultana Alghurabi,Atikah Alzeghayer,Afrah Altamimi,Abdullah Alfaifi,Abdulrahman AlOsaimy
Main category: cs.CL
TL;DR: BALSAM是一个社区驱动的综合基准平台,旨在推动阿拉伯语大型语言模型(LLM)的发展和评估,填补了现有阿拉伯语基准的不足。
Details
Motivation: 由于数据稀缺、阿拉伯语及其方言的语言多样性、形态复杂性等问题,阿拉伯语LLM的表现落后于英语。现有的阿拉伯语基准通常依赖于静态数据,缺乏全面的任务覆盖或专用平台,难以真实衡量进展。Contribution: BALSAM引入了78个NLP任务,涵盖14个类别,包含52K个样本(37K测试集和15K开发集),并提供了一个集中、透明的盲评测平台。
Method: 通过社区驱动的方式,构建了一个全面的阿拉伯语基准平台,任务覆盖广泛,并提供盲测试以避免数据污染。
Result: BALSAM为阿拉伯语LLM的进展提供了一个统一的评估标准,促进了合作研究。
Insight: BALSAM的推出填补了阿拉伯语LLM评估的空白,为未来阿拉伯语NLP的发展提供了重要的工具和支持。
Abstract: The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.
[15] Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation
Daniil Gurgurov,Katharina Trinley,Yusser Al Ghussin,Tanja Baeumel,Josef van Genabith,Simon Ostermann
Main category: cs.CL
TL;DR: 论文提出了语言算术方法,系统识别和操纵语言模型中的语言特定神经元,并展示了其在多语言任务中的有效性。
Details
Motivation: 尽管大型语言模型(LLM)展现出强大的多语言能力,但其语言特异性处理的神经机制仍不明确。本文旨在揭示这些机制并为模型的精确控制提供方法。Contribution: 1. 提出语言激活概率熵(LAPE)方法,识别控制语言行为的神经元;2. 展示这些神经元在深层网络的聚类现象;3. 通过语言算术方法实现对模型语言行为的精确控制。
Method: 使用LAPE方法分析语言特异性神经元,并通过语言算术(激活加减乘除)操纵神经元,以控制模型在不同语言任务中的表现。
Result: 在五种多语言任务(语言强制、翻译、QA、理解和NLI)中,该方法表现优于简单替换方法,且高资源语言和类型学相似语言效果更佳。
Insight: 语言特异性神经元在深层网络集中分布,相关语言共享神经元反映了语言亲缘性;跨语言神经元操纵可提升下游任务性能,并揭示了模型内部的“回退”机制。
Abstract: Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share overlapping neurons, reflecting internal representations of linguistic proximity. Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming simpler replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that cross-lingual neuron steering enhances downstream performance and reveal internal “fallback” mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at https://github.com/d-gurgurov/Language-Neurons-Manipulation.
[16] Listening to the Unspoken: Exploring 365 Aspects of Multimodal Interview Performance Assessment
Jia Li,Yang Wang,Wenhao Qian,Zhenzhen Hu,Richang Hong,Meng Wang
Main category: cs.CL
TL;DR: 论文提出了一种综合多模态(视频、音频、文本)的面试表现评估框架,通过特征提取和多层感知机融合,结合两级集成学习策略,实现了全面且无偏见的评估。
Details
Motivation: 面试表现评估对职业选拔至关重要,传统方法可能忽视多模态的隐含信息,因此需要一种更全面和公正的评估框架。Contribution: 提出了一个名为’365’的综合框架,整合了三模态、六种回答和五个评估维度,实现了高效的跨模态特征融合和稳健的预测。
Method: 1. 使用模态特定的特征提取器编码数据;2. 通过共享压缩多层感知机融合特征;3. 采用两级集成学习策略(独立回归头和平均池化)生成最终评分。
Result: 在多维平均MSE(0.1824)上表现优异,并在AVI Challenge 2025中获得第一名。
Insight: 通过捕捉显性和隐性多模态线索,该框架提升了面试评估的全面性和公平性,为自动化多模态分析提供了新思路。
Abstract: Interview performance assessment is essential for determining candidates’ suitability for professional positions. To ensure holistic and fair evaluations, we propose a novel and comprehensive framework that explores ``365’’ aspects of interview performance by integrating \textit{three} modalities (video, audio, and text), \textit{six} responses per candidate, and \textit{five} key evaluation dimensions. The framework employs modality-specific feature extractors to encode heterogeneous data streams and subsequently fused via a Shared Compression Multilayer Perceptron. This module compresses multimodal embeddings into a unified latent space, facilitating efficient feature interaction. To enhance prediction robustness, we incorporate a two-level ensemble learning strategy: (1) independent regression heads predict scores for each response, and (2) predictions are aggregated across responses using a mean-pooling mechanism to produce final scores for the five target dimensions. By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data, enabling comprehensive and unbiased assessments. Achieving a multi-dimensional average MSE of 0.1824, our framework secured first place in the AVI Challenge 2025, demonstrating its effectiveness and robustness in advancing automated and multimodal interview performance assessment. The full implementation is available at https://github.com/MSA-LMC/365Aspects.
[17] From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs
Jie He,Victor Gutierrez Basulto,Jeff Z. Pan
Main category: cs.CL
TL;DR: 该论文提出了一种名为TIRESRAG-R1的新框架,通过多维度奖励系统改进LLM的检索增强推理能力,解决了现有RAG方法在信息不足、错误推理和答案-推理不一致等方面的缺陷。
Details
Motivation: 现有的基于强化学习的检索增强生成(RAG)方法仅依赖最终答案奖励,忽视了中间推理质量,导致信息不足、错误推理和答案-推理不一致等故障模式。Contribution: 提出了TIRESRAG-R1框架,引入了充足性奖励、推理质量奖励和反思奖励的多维度奖励系统,以及难度感知的重新加权策略和训练样本过滤机制。
Method: 通过think-retrieve-reflect过程和多维度奖励系统(包括充足性、推理质量和反思奖励),结合难度感知的重新加权策略,优化推理能力。
Result: 在四个多跳QA数据集上的实验表明,TIRESRAG-R1优于现有RAG方法,并在单跳任务中表现出良好的泛化能力。
Insight: 多维度奖励系统和难度感知策略可以有效提升推理的稳定性和准确性,尤其在复杂任务中表现突出。
Abstract: Reinforcement learning-based retrieval-augmented generation (RAG) methods enhance the reasoning abilities of large language models (LLMs). However, most rely only on final-answer rewards, overlooking intermediate reasoning quality. This paper analyzes existing RAG reasoning models and identifies three main failure patterns: (1) information insufficiency, meaning the model fails to retrieve adequate support; (2) faulty reasoning, where logical or content-level flaws appear despite sufficient information; and (3) answer-reasoning inconsistency, where a valid reasoning chain leads to a mismatched final answer. We propose TIRESRAG-R1, a novel framework using a think-retrieve-reflect process and a multi-dimensional reward system to improve reasoning and stability. TIRESRAG-R1 introduces: (1) a sufficiency reward to encourage thorough retrieval; (2) a reasoning quality reward to assess the rationality and accuracy of the reasoning chain; and (3) a reflection reward to detect and revise errors. It also employs a difficulty-aware reweighting strategy and training sample filtering to boost performance on complex tasks. Experiments on four multi-hop QA datasets show that TIRESRAG-R1 outperforms prior RAG methods and generalizes well to single-hop tasks. The code and data are available at: https://github.com/probe2/TIRESRAG-R1.
[18] Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index
Praveenkumar Katwe,Rakesh Chandra,Balabantaray Kali,Prasad Vittala
Main category: cs.CL
TL;DR: 论文提出了一种通过强化学习优化实体幻觉指数(EHI)的方法,减少摘要生成中的幻觉现象。
Details
Motivation: 抽象摘要中的幻觉问题限制了语言模型在实际场景中的应用,尤其是在实体信息的准确性和真实性方面。Contribution: 引入实体幻觉指数(EHI)作为量化评估指标,并通过强化学习框架优化模型以减少幻觉。
Method: 使用预训练语言模型生成初始摘要,计算EHI分数,再通过强化学习以EHI为奖励信号微调模型。
Result: 实验显示,该方法显著降低了实体级别的幻觉现象,同时保持了摘要的流畅性和信息量。
Insight: 为减少幻觉提供了一种轻量级且可扩展的微调方法,无需依赖人工标注的事实性数据。
Abstract: Reducing hallucinations in abstractive summarization remains a critical challenge for deploying language models (LMs) in real-world settings. In this work, we introduce a rewarddriven fine-tuning framework that explicitly optimizes for Entity Hallucination Index (EHI), a metric designed to quantify the presence, correctness, and grounding of named entities in generated summaries. Given a corpus of meeting transcripts, we first generate baseline summaries using a pre-trained LM and compute EHI scores via automatic entity extraction and matching. We then apply reinforcement learning to fine-tune the model parameters, using EHI as a reward signal to bias generation toward entity-faithful outputs. Our approach does not rely on human-written factuality annotations, enabling scalable fine-tuning. Experiments demonstrate consistent improvements in EHI across datasets, with qualitative analysis revealing a significant reduction in entity-level hallucinations without degradation in fluency or informativeness. We release a reproducible Colab pipeline, facilitating further research on hallucination-aware model fine-tuning using lightweight, hallucintion metrics like EHI.
[19] Opportunities and Challenges of LLMs in Education: An NLP Perspective
Sowmya Vajjala,Bashar Alhafni,Stefano Bannò,Kaushal Kumar Maurya,Ekaterina Kochmar
Main category: cs.CL
TL;DR: 论文探讨了大语言模型(LLMs)在教育领域的机遇与挑战,重点关注其在教学、学习和评估中的潜在应用和问题。
Details
Motivation: 随着LLMs的普及,其在教育中的潜在应用越来越受到关注。论文旨在从NLP角度分析LLMs如何改变语言教育,并识别未来的研究方向。Contribution: 1. 提出了LLMs在教育中的两大应用场景(辅助与评估),并基于四个维度(阅读、写作、口语和辅导)进行分析。2. 指出了LLMs带来的新方向和需解决的核心挑战。
Method: 论文采用文献综述和场景分析方法,从NLP视角探讨LLMs在教育中的应用潜力。
Result: LLMs在教育中展现了巨大的潜力,但也面临如公平性、有效性和适应性等挑战。
Insight: LLMs为语言教育提供了新的可能性,但其实际应用需解决技术和伦理问题,未来的研究方向应关注这些问题。
Abstract: Interest in the role of large language models (LLMs) in education is increasing, considering the new opportunities they offer for teaching, learning, and assessment. In this paper, we examine the impact of LLMs on educational NLP in the context of two main application scenarios: {\em assistance} and {\em assessment}, grounding them along the four dimensions – reading, writing, speaking, and tutoring. We then present the new directions enabled by LLMs, and the key challenges to address. We envision that this holistic overview would be useful for NLP researchers and practitioners interested in exploring the role of LLMs in developing language-focused and NLP-enabled educational applications of the future.
[20] Beyond Natural Language Plans: Structure-Aware Planning for Query-Focused Table Summarization
Weijia Zhang,Songgaojun Deng,Evangelos Kanoulas
Main category: cs.CL
TL;DR: 本文提出了一种结构化表示方法TaSoF和框架SPaGe,用于解决查询聚焦的表格摘要任务中自然语言计划的模糊性和缺乏结构性问题,显著提升了可靠性和可扩展性。
Details
Motivation: 自然语言(NL)计划在查询聚焦的表格摘要中存在模糊性和缺乏结构性的问题,限制了其转化为SQL等可执行程序的能力,尤其在多表任务中表现不佳。Contribution: 1) 引入结构化计划TaSoF;2) 提出SPaGe框架,分三阶段处理推理过程;3) 实验证明在多表和单表任务中优于现有方法。
Method: 1) 结构化规划生成TaSoF;2) 基于图的执行,将步骤转为SQL并通过有向循环图建模依赖关系;3) 生成查询聚焦的摘要。
Result: 在三个公共基准测试中,SPaGe在单表和多表任务中均优于现有模型。
Insight: 结构化表示能显著提升复杂依赖关系的捕捉能力和任务可扩展性。
Abstract: Query-focused table summarization requires complex reasoning, often approached through step-by-step natural language (NL) plans. However, NL plans are inherently ambiguous and lack structure, limiting their conversion into executable programs like SQL and hindering scalability, especially for multi-table tasks. To address this, we propose a paradigm shift to structured representations. We introduce a new structured plan, TaSoF, inspired by formalism in traditional multi-agent systems, and a framework, SPaGe, that formalizes the reasoning process in three phases: 1) Structured Planning to generate TaSoF from a query, 2) Graph-based Execution to convert plan steps into SQL and model dependencies via a directed cyclic graph for parallel execution, and 3) Summary Generation to produce query-focused summaries. Our method explicitly captures complex dependencies and improves reliability. Experiments on three public benchmarks show that SPaGe consistently outperforms prior models in both single- and multi-table settings, demonstrating the advantages of structured representations for robust and scalable summarization.
[21] Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning
Kwesi Cobbina,Tianyi Zhou
Main category: cs.CL
TL;DR: 本文揭示了大型语言模型(LLM)中上下文学习(ICL)的一种新位置偏差(DPP),即演示(demo)在提示中的位置对模型预测和准确性的显著影响。
Details
Motivation: 研究者发现ICL的性能对演示的选择和顺序敏感,但其在提示中的位置偏差尚未被系统研究,因此探索这一新偏差对模型性能的影响。Contribution: 首次提出并量化了DPP偏差,设计了一套评估流程和两个指标(ACCURACY-CHANGE和PREDICTION-CHANGE),在多个任务和模型上验证了其影响。
Method: 通过系统实验研究了演示位置在分类、问答、摘要和推理任务中的影响,并使用十种开源LLM(如QWEN、LLAMA3等)验证结果。
Result: 演示位于提示开头时模型输出最稳定且准确(提升达6分),而位于用户消息末尾时会翻转30%的预测且不提升正确性;小模型受影响最大,但大模型在复杂任务中仍有轻微影响。
Insight: 提示设计中演示的位置是影响ICL性能的关键因素,需在实际应用中优化其位置以提升模型表现。
Abstract: In-context learning (ICL) is a critical emerging capability of large language models (LLMs), enabling few-shot learning during inference by including a few demonstrations (demos) in the prompt. However, it has been found that ICL’s performance can be sensitive to the choices of demos and their order. This paper investigates an unexplored new positional bias of ICL for the first time: we observe that the predictions and accuracy can drift drastically when the positions of demos, the system prompt, and the user message in LLM input are varied. We refer to this bias as DEMOS’ POSITION IN PROMPT (DPP) bias. We design a systematic evaluation pipeline to study this type of positional bias across classification, question answering, summarization, and reasoning tasks. We introduce two metrics, ACCURACY-CHANGE and PREDICTION-CHANGE, to quantify net gains and output volatility induced by changes in the demos’ position. Extensive experiments on ten LLMs from four open-source model families (QWEN, LLAMA3, MISTRAL, COHERE) verify that the bias significantly affects their accuracy and predictions: placing demos at the start of the prompt yields the most stable and accurate outputs with gains of up to +6 points. In contrast, placing demos at the end of the user message flips over 30% of predictions without improving correctness on QA tasks. Smaller models are most affected by this sensitivity, though even large models remain marginally affected on complex tasks.
cs.CV [Back]
[22] Runtime Failure Hunting for Physics Engine Based Software Systems: How Far Can We Go?
Shuqing Li,Qiang Chen,Xiaoxue Ren,Michael R. Lyu
Main category: cs.CV
TL;DR: 本文首次对基于物理引擎的软件系统中的物理失败现象进行了大规模实证研究,提出了物理失败的表征分类法,评估了多种检测方法,并分享了开发者的实践经验。
Details
Motivation: 物理引擎(PEs)在娱乐和安全关键系统中广泛应用,但其物理失败问题可能导致可靠性问题和用户体验下降。当前检测方法多需白盒访问且仅关注崩溃,无法解决语义复杂的物理失败。Contribution: 1. 提出物理失败的表征分类法;2. 全面评估了包括深度学习、提示技术和大型多模态模型在内的检测方法;3. 基于开发者经验提供了改进检测的实用建议。
Method: 通过实证研究分析物理失败的表征,并对比多种检测技术的有效性,包括深度学习、提示技术和大型多模态模型。
Result: 研究发现当前检测方法存在局限性,需进一步优化。提出的分类法和评估结果为未来研究提供了基础。
Insight: 物理失败的复杂性要求开发更灵活、语义感知的检测方法,且开发者反馈为改进方向提供了重要参考。
Abstract: Physics Engines (PEs) are fundamental software frameworks that simulate physical interactions in applications ranging from entertainment to safety-critical systems. Despite their importance, PEs suffer from physics failures, deviations from expected physical behaviors that can compromise software reliability, degrade user experience, and potentially cause critical failures in autonomous vehicles or medical robotics. Current testing approaches for PE-based software are inadequate, typically requiring white-box access and focusing on crash detection rather than semantically complex physics failures. This paper presents the first large-scale empirical study characterizing physics failures in PE-based software. We investigate three research questions addressing the manifestations of physics failures, the effectiveness of detection techniques, and developer perceptions of current detection practices. Our contributions include: (1) a taxonomy of physics failure manifestations; (2) a comprehensive evaluation of detection methods including deep learning, prompt-based techniques, and large multimodal models; and (3) actionable insights from developer experiences for improving detection approaches. To support future research, we release PhysiXFails, code, and other materials at https://sites.google.com/view/physics-failure-detection.
[23] Trade-offs in Image Generation: How Do Different Dimensions Interact?
Sicheng Zhang,Binzhu Xie,Zhonghao Yan,Yuli Zhang,Donghao Zhou,Xiaofei Chen,Shi Qiu,Jiaqi Liu,Guoyang Xie,Zhichao Lu
Main category: cs.CV
TL;DR: TRIG-Bench是一个新的基准测试,用于量化图像生成模型在多维性能(如真实性、多样性等)之间的权衡,并提出了TRIGScore评估指标和Dimension Trade-off Map(DTM)可视化工具。通过实验,证明了DTM能有效帮助理解模型性能间的权衡,并可通过微调提升模型表现。
Details
Motivation: 当前图像生成模型的性能评估通常局限于单一维度,缺乏对多维度(如质量、对齐性等)之间权衡的系统研究。为此,作者提出了TRIG-Bench和TRIGScore,以填补这一空白。Contribution: 1. 提出了TRIG-Bench,覆盖10个维度的40,200个样本;2. 开发了TRIGScore评估指标;3. 提出Relation Recognition System和DTM可视化工具;4. 通过实验验证了DTM的有效性。
Method: 1. 构建TRIG-Bench数据集;2. 设计TRIGScore评估指标;3. 使用Relation Recognition System生成DTM;4. 在14个模型上进行T2I和I2I任务的评估。
Result: 实验表明,DTM能清晰展示模型在不同维度间的权衡关系,并通过微调进一步提升模型性能。
Insight: 图像生成模型的多维性能之间存在复杂权衡,DTM为模型的优化提供了可视化指导,有助于提升模型的综合表现。
Abstract: Model performance in text-to-image (T2I) and image-to-image (I2I) generation often depends on multiple aspects, including quality, alignment, diversity, and robustness. However, models’ complex trade-offs among these dimensions have rarely been explored due to (1) the lack of datasets that allow fine-grained quantification of these trade-offs, and (2) the use of a single metric for multiple dimensions. To bridge this gap, we introduce TRIG-Bench (Trade-offs in Image Generation), which spans 10 dimensions (Realism, Originality, Aesthetics, Content, Relation, Style, Knowledge, Ambiguity, Toxicity, and Bias), contains 40,200 samples, and covers 132 pairwise dimensional subsets. Furthermore, we develop TRIGScore, a VLM-as-judge metric that automatically adapts to various dimensions. Based on TRIG-Bench and TRIGScore, we evaluate 14 models across T2I and I2I tasks. In addition, we propose the Relation Recognition System to generate the Dimension Trade-off Map (DTM) that visualizes the trade-offs among model-specific capabilities. Our experiments demonstrate that DTM consistently provides a comprehensive understanding of the trade-offs between dimensions for each type of generative model. Notably, we show that the model’s dimension-specific weaknesses can be mitigated through fine-tuning on DTM to enhance overall performance. Code is available at: https://github.com/fesvhtr/TRIG
[24] AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock
Umair Nawaz,Muhammad Zaigham Zaheer,Fahad Shahbaz Khan,Hisham Cholakkal,Salman Khan,Rao Muhammad Anwer
Main category: cs.CV
TL;DR: 这篇论文对农业领域中的深度学习技术进行了系统综述,涵盖了200多篇研究,重点分析了作物、渔业和畜牧业中的AI应用,如病害检测、健康管理和物种监测。文章还探讨了数据变异性、数据集和性能评估等挑战,并提出了未来研究方向。
Details
Motivation: 全球粮食生产面临气候变化、资源限制等挑战,需要高效、准确且可扩展的AI技术解决方案,以支持可持续管理。Contribution: 1. 系统总结了农业领域中的深度学习和基础模型应用;2. 分析了数据变异性、评估指标等实验挑战;3. 提出了未来研究方向,如多模态数据整合和边缘设备部署。
Method: 通过对200多篇研究的系统性综述,分类分析了传统机器学习、深度学习(如Vision Transformers)和视觉-语言基础模型(如CLIP)在农业中的应用。
Result: 论文总结了对作物病害检测、畜牧健康管理和水产监测等任务的最新进展,并指出了数据与实验挑战。
Insight: 未来研究应关注多模态数据整合、高效的边缘设备部署,以及适应多样化农业环境的AI模型。
Abstract: Crops, fisheries and livestock form the backbone of global food production, essential to feed the ever-growing global population. However, these sectors face considerable challenges, including climate variability, resource limitations, and the need for sustainable management. Addressing these issues requires efficient, accurate, and scalable technological solutions, highlighting the importance of artificial intelligence (AI). This survey presents a systematic and thorough review of more than 200 research works covering conventional machine learning approaches, advanced deep learning techniques (e.g., vision transformers), and recent vision-language foundation models (e.g., CLIP) in the agriculture domain, focusing on diverse tasks such as crop disease detection, livestock health management, and aquatic species monitoring. We further cover major implementation challenges such as data variability and experimental aspects: datasets, performance evaluation metrics, and geographical focus. We finish the survey by discussing potential open research directions emphasizing the need for multimodal data integration, efficient edge-device deployment, and domain-adaptable AI models for diverse farming environments. Rapid growth of evolving developments in this field can be actively tracked on our project page: https://github.com/umair1221/AI-in-Agriculture
[25] Color as the Impetus: Transforming Few-Shot Learner
Chaofei Qi,Zhitai Liu,Jianbin Qiu
Main category: cs.CV
TL;DR: 这篇论文提出了一个创新的小样本学习框架ColorSense Learner,通过模拟人类色彩感知机制,利用通道间特征提取和交互学习,显著提升了小样本分类的性能。
Details
Motivation: 人类具有天生的元学习能力,部分归因于其卓越的色彩感知能力。然而,传统的小样本学习方法忽视了色彩这一直观视觉特征,而专注于抽象的类别特征区分。论文试图通过模拟人类色彩感知机制来弥补这一不足。Contribution: 1. 提出了ColorSense Learner框架,利用色彩感知机制提升小样本分类性能;2. 引入了ColorSense Distiller,通过知识蒸馏增强学生网络的元学习能力;3. 在11个小样本基准数据集上验证了方法的泛化性、鲁棒性和迁移能力。
Method: 1. 通过模拟人类色彩感知机制,设计通道间特征提取和交互学习策略,突出不同通道的色彩信息;2. 提出ColorSense Distiller,利用知识蒸馏整合教师网络的先验知识,提升学生网络的性能。
Result: 在多个小样本基准数据集上,ColorSense Learner和ColorSense Distiller表现出极强的泛化性、鲁棒性和迁移能力,显著优于传统方法。
Insight: 色彩信息是小样本学习中一个被忽视但极具潜力的特征。模拟人类色彩感知机制可以有效提升模型的性能,尤其在识别精细类别和跨域任务中表现出色。
Abstract: Humans possess innate meta-learning capabilities, partly attributable to their exceptional color perception. In this paper, we pioneer an innovative viewpoint on few-shot learning by simulating human color perception mechanisms. We propose the ColorSense Learner, a bio-inspired meta-learning framework that capitalizes on inter-channel feature extraction and interactive learning. By strategically emphasizing distinct color information across different channels, our approach effectively filters irrelevant features while capturing discriminative characteristics. Color information represents the most intuitive visual feature, yet conventional meta-learning methods have predominantly neglected this aspect, focusing instead on abstract feature differentiation across categories. Our framework bridges the gap via synergistic color-channel interactions, enabling better intra-class commonality extraction and larger inter-class differences. Furthermore, we introduce a meta-distiller based on knowledge distillation, ColorSense Distiller, which incorporates prior teacher knowledge to augment the student network’s meta-learning capacity. We’ve conducted comprehensive coarse/fine-grained and cross-domain experiments on eleven few-shot benchmarks for validation. Numerous experiments reveal that our methods have extremely strong generalization ability, robustness, and transferability, and effortless handle few-shot classification from the perspective of color perception.
[26] Enhancing efficiency in paediatric brain tumour segmentation using a pathologically diverse single-center clinical dataset
A. Piffer,J. A. Buchner,A. G. Gennari,P. Grehten,S. Sirin,E. Ross,I. Ezhov,M. Rosier,J. C. Peeken,M. Piraud,B. Menze,A. Guerreiro Stücklin,A. Jakab,F. Kofler
Main category: cs.CV
TL;DR: 论文通过使用单中心临床数据集,研究了深度学习在儿童脑肿瘤分割中的可行性,发现其在不同肿瘤亚型和MRI协议下表现不一,特别在T2高信号和整体肿瘤分割中效果较好。
Details
Motivation: 儿童脑肿瘤的异质性和多样性给诊断和治疗带来挑战,深度学习分割技术有望为此提供工具,但其在多种亚型和MRI协议下的性能尚不明确。Contribution: 论文的主要贡献在于利用单中心多样化数据集,评估了3D nnU-Net在不同儿童脑肿瘤亚型和MRI序列下的分割性能,并揭示了协议简化的潜力。
Method: 研究使用121例患者数据训练、53例测试3D nnU-Net模型,评估了四种肿瘤子区域的分割性能,并与人工标注变异性对比。
Result: 模型在整体肿瘤和T2高信号区域的分割表现优异(平均DSC: 0.85),接近人工标注水平;但在增强肿瘤和囊性成分上的表现较差。
Insight: T1、T1-C和T2序列单独使用时效果接近全协议,提示MRI协议可简化;深度学习在儿童脑肿瘤分割中具有潜力,但需针对特定区域优化。
Abstract: Background Brain tumours are the most common solid malignancies in children, encompassing diverse histological, molecular subtypes and imaging features and outcomes. Paediatric brain tumours (PBTs), including high- and low-grade gliomas (HGG, LGG), medulloblastomas (MB), ependymomas, and rarer forms, pose diagnostic and therapeutic challenges. Deep learning (DL)-based segmentation offers promising tools for tumour delineation, yet its performance across heterogeneous PBT subtypes and MRI protocols remains uncertain. Methods A retrospective single-centre cohort of 174 paediatric patients with HGG, LGG, medulloblastomas (MB), ependymomas, and other rarer subtypes was used. MRI sequences included T1, T1 post-contrast (T1-C), T2, and FLAIR. Manual annotations were provided for four tumour subregions: whole tumour (WT), T2-hyperintensity (T2H), enhancing tumour (ET), and cystic component (CC). A 3D nnU-Net model was trained and tested (121/53 split), with segmentation performance assessed using the Dice similarity coefficient (DSC) and compared against intra- and inter-rater variability. Results The model achieved robust performance for WT and T2H (mean DSC: 0.85), comparable to human annotator variability (mean DSC: 0.86). ET segmentation was moderately accurate (mean DSC: 0.75), while CC performance was poor. Segmentation accuracy varied by tumour type, MRI sequence combination, and location. Notably, T1, T1-C, and T2 alone produced results nearly equivalent to the full protocol. Conclusions DL is feasible for PBTs, particularly for T2H and WT. Challenges remain for ET and CC segmentation, highlighting the need for further refinement. These findings support the potential for protocol simplification and automation to enhance volumetric assessment and streamline paediatric neuro-oncology workflows.
[27] Temporally Consistent Unsupervised Segmentation for Mobile Robot Perception
Christian Ellis,Maggie Wigness,Craig Lennon,Lance Fiondella
Main category: cs.CV
TL;DR: 论文提出了一种名为Frontier-Seg的方法,用于移动机器人视频流的时间一致性无监督分割,解决了现有无监督分割方法缺乏时间一致性的问题。
Details
Motivation: 现有监督语义分割方法依赖昂贵的标注数据,而零样本无监督分割方法通常缺乏时间一致性,这对非结构化环境中的机器人感知至关重要。Contribution: 提出Frontier-Seg方法,通过聚类超像素级特征并强制时间一致性,实现了无需人工监督的持久地形边界分割。
Method: 基于DINOv2等基础模型提取超像素级特征,并通过时间一致性约束实现分割。
Result: 在RUGD和RELLIS-3D等多样化的数据集上验证了方法的有效性。
Insight: 时间一致性是机器人感知中的关键属性,尤其适用于非结构化环境中的无监督分割任务。
Abstract: Rapid progress in terrain-aware autonomous ground navigation has been driven by advances in supervised semantic segmentation. However, these methods rely on costly data collection and labor-intensive ground truth labeling to train deep models. Furthermore, autonomous systems are increasingly deployed in unrehearsed, unstructured environments where no labeled data exists and semantic categories may be ambiguous or domain-specific. Recent zero-shot approaches to unsupervised segmentation have shown promise in such settings but typically operate on individual frames, lacking temporal consistency-a critical property for robust perception in unstructured environments. To address this gap we introduce Frontier-Seg, a method for temporally consistent unsupervised segmentation of terrain from mobile robot video streams. Frontier-Seg clusters superpixel-level features extracted from foundation model backbones-specifically DINOv2-and enforces temporal consistency across frames to identify persistent terrain boundaries or frontiers without human supervision. We evaluate Frontier-Seg on a diverse set of benchmark datasets-including RUGD and RELLIS-3D-demonstrating its ability to perform unsupervised segmentation across unstructured off-road environments.
[28] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Shaoan Xie,Lingjing Kong,Yujia Zheng,Yu Yao,Zeyu Tang,Eric P. Xing,Guangyi Chen,Kun Zhang
Main category: cs.CV
TL;DR: SmartCLIP提出了一种模块化的视觉-语言对齐方法,解决了CLIP模型在图像-文本数据集中信息不对齐和表示纠缠的问题,通过理论条件和模块化方法实现细粒度对齐。
Details
Motivation: CLIP模型在图像-文本对齐中面临信息不对齐和表示纠缠的挑战,限制了其在某些下游任务中的泛化能力。本文旨在通过理论和模块化方法解决这些问题。Contribution: 1. 提出了理论条件,支持多粒度视觉-语言对齐;2. 提出模块化方法SmartCLIP,实现细粒度信息对齐和表示解纠缠;3. 通过实验验证了模型性能的提升。
Method: 基于理论条件,SmartCLIP通过模块化方法识别和对齐最相关的视觉和文本表示,实现细粒度对齐和解纠缠。
Result: SmartCLIP在多个任务中表现出优于现有方法的性能,验证了其解决信息不对齐和表示纠缠的能力。
Insight: 模块化和细粒度对齐是提升视觉-语言模型泛化能力的关键,理论指导下的方法设计能有效解决实际应用中的挑战。
Abstract: Contrastive Language-Image Pre-training (CLIP)~\citep{radford2021learning} has emerged as a pivotal model in computer vision and multimodal learning, achieving state-of-the-art performance at aligning visual and textual representations through contrastive learning. However, CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation. On the one hand, short captions for a single image in datasets like MSCOCO may describe disjoint regions in the image, leaving the model uncertain about which visual features to retain or disregard. On the other hand, directly aligning long captions with images can lead to the retention of entangled details, preventing the model from learning disentangled, atomic concepts – ultimately limiting its generalization on certain downstream tasks involving short prompts. In this paper, we establish theoretical conditions that enable flexible alignment between textual and visual representations across varying levels of granularity. Specifically, our framework ensures that a model can not only \emph{preserve} cross-modal semantic information in its entirety but also \emph{disentangle} visual representations to capture fine-grained textual concepts. Building on this foundation, we introduce \ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner. Superior performance across various tasks demonstrates its capability to handle information misalignment and supports our identification theory. The code is available at https://github.com/Mid-Push/SmartCLIP.
[29] HOG-CNN: Integrating Histogram of Oriented Gradients with Convolutional Neural Networks for Retinal Image Classification
Faisal Ahmed
Main category: cs.CV
TL;DR: 这篇论文提出了一种名为HOG-CNN的混合特征提取模型,将手工设计的HOG特征与深度CNN特征结合,用于视网膜图像的分类,并在多个公共数据集上取得了优异的性能。
Details
Motivation: 传统的视网膜疾病诊断依赖人工解释,耗时耗力。作者希望通过自动化方法提高诊断效率和准确性,同时保持模型的可解释性和轻量化。Contribution: 主要贡献是提出HOG-CNN模型,将HOG的局部纹理特征与CNN的高层语义特征融合,以提升视网膜图像分类的性能。
Method: 模型结合HOG和CNN特征,通过融合两种特征的优势,从视网膜图像中提取更全面的信息。
Result: 在APTOS、ORIGA和IC-AMD三个数据集上,HOG-CNN表现优异,例如在APTOS数据集上达到98.5%的准确率和99.2 AUC。
Insight: 手工特征与深度特征的结合可以互补优势,为医学图像分析提供更鲁棒的解决方案。
Abstract: The analysis of fundus images is critical for the early detection and diagnosis of retinal diseases such as Diabetic Retinopathy (DR), Glaucoma, and Age-related Macular Degeneration (AMD). Traditional diagnostic workflows, however, often depend on manual interpretation and are both time- and resource-intensive. To address these limitations, we propose an automated and interpretable clinical decision support framework based on a hybrid feature extraction model called HOG-CNN. Our key contribution lies in the integration of handcrafted Histogram of Oriented Gradients (HOG) features with deep convolutional neural network (CNN) representations. This fusion enables our model to capture both local texture patterns and high-level semantic features from retinal fundus images. We evaluated our model on three public benchmark datasets: APTOS 2019 (for binary and multiclass DR classification), ORIGA (for Glaucoma detection), and IC-AMD (for AMD diagnosis); HOG-CNN demonstrates consistently high performance. It achieves 98.5% accuracy and 99.2 AUC for binary DR classification, and 94.2 AUC for five-class DR classification. On the IC-AMD dataset, it attains 92.8% accuracy, 94.8% precision, and 94.5 AUC, outperforming several state-of-the-art models. For Glaucoma detection on ORIGA, our model achieves 83.9% accuracy and 87.2 AUC, showing competitive performance despite dataset limitations. We show, through comprehensive appendix studies, the complementary strength of combining HOG and CNN features. The model’s lightweight and interpretable design makes it particularly suitable for deployment in resource-constrained clinical environments. These results position HOG-CNN as a robust and scalable tool for automated retinal disease screening.
[30] AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data
Christopher F. Brown,Michal R. Kazmierski,Valerie J. Pasquarella,William J. Rucklidge,Masha Samsikova,Chenhui Zhang,Evan Shelhamer,Estefania Lahera,Olivia Wiles,Simon Ilyushchenko,Noel Gorelick,Lihui Lydia Zhang,Sophia Alj,Emily Schechter,Sean Askay,Oliver Guinan,Rebecca Moore,Alexis Boukouvalas,Pushmeet Kohli
Main category: cs.CV
TL;DR: AlphaEarth Foundations提出了一种嵌入场模型,能高效地从稀疏标注数据中生成全球地图,其性能优于所有先前的方法。
Details
Motivation: 地球观测数据量大但高质量标注稀缺,急需一种能将稀疏标注转化为高精度地图的通用方法。Contribution: 提出AlphaEarth Foundations模型,通过嵌入场整合多源数据的时空和测量信息,无需重新训练即在多样任务中表现最佳。
Method: 采用嵌入场模型,将空间、时间和测量上下文信息融合到统一的表示中,支持从局部到全球的高效地图生成。
Result: 模型生成的嵌入在全球多样评估中一致优于所有现有特征化方法,并计划发布2017-2024年的全球嵌入场数据集。
Insight: 嵌入场模型为地球观测数据提供了一种通用且高效的表示方法,大幅降低了高精度地图生成的门槛。
Abstract: Unprecedented volumes of Earth observation data are continually collected around the world, but high-quality labels remain scarce given the effort required to make physical measurements and observations. This has led to considerable investment in bespoke modeling efforts translating sparse labels into maps. Here we introduce AlphaEarth Foundations, an embedding field model yielding a highly general, geospatial representation that assimilates spatial, temporal, and measurement contexts across multiple sources, enabling accurate and efficient production of maps and monitoring systems from local to global scales. The embeddings generated by AlphaEarth Foundations are the only to consistently outperform all previous featurization approaches tested on a diverse set of mapping evaluations without re-training. We will release a dataset of global, annual, analysis-ready embedding field layers from 2017 through 2024.
[31] Learning from Heterogeneous Structural MRI via Collaborative Domain Adaptation for Late-Life Depression Assessment
Yuzhen Gao,Qianqian Wang,Yongheng Sun,Cui Wang,Yongquan Liang,Mingxia Liu
Main category: cs.CV
TL;DR: 该论文提出了一种基于协作域适应(CDA)的框架,用于利用结构化MRI检测晚年抑郁症(LLD),通过结合Vision Transformer和CNN,解决了小样本和跨域异质性问题,显著提升了模型的泛化能力。
Details
Motivation: 晚年抑郁症(LLD)的早期诊断对疾病管理至关重要,但现有基于MRI的方法受限于小样本和跨域异质性(如成像协议、扫描设备的差异),导致模型泛化能力不足。Contribution: 提出了一种协作域适应(CDA)框架,结合Vision Transformer(ViT)和CNN,通过三阶段训练(监督训练、自监督目标特征适应和协作训练),有效解决了跨域异质性问题并提升了LLD检测性能。
Method: 1. ViT捕获全局解剖信息,CNN提取局部结构特征;2. 三阶段训练:监督训练源数据、自监督目标特征适应(最小化分类器输出差异)、协作训练(使用伪标签和增强的目标域数据)。
Result: 在多站点T1加权MRI数据集上的实验表明,CDA显著优于现有无监督域适应方法。
Insight: 结合全局和局部特征的互补性,以及通过自监督和协作训练策略,可以有效提升跨域医学影像分析的鲁棒性和泛化能力。
Abstract: Accurate identification of late-life depression (LLD) using structural brain MRI is essential for monitoring disease progression and facilitating timely intervention. However, existing learning-based approaches for LLD detection are often constrained by limited sample sizes (e.g., tens), which poses significant challenges for reliable model training and generalization. Although incorporating auxiliary datasets can expand the training set, substantial domain heterogeneity, such as differences in imaging protocols, scanner hardware, and population demographics, often undermines cross-domain transferability. To address this issue, we propose a Collaborative Domain Adaptation (CDA) framework for LLD detection using T1-weighted MRIs. The CDA leverages a Vision Transformer (ViT) to capture global anatomical context and a Convolutional Neural Network (CNN) to extract local structural features, with each branch comprising an encoder and a classifier. The CDA framework consists of three stages: (a) supervised training on labeled source data, (b) self-supervised target feature adaptation and (c) collaborative training on unlabeled target data. We first train ViT and CNN on source data, followed by self-supervised target feature adaptation by minimizing the discrepancy between classifier outputs from two branches to make the categorical boundary clearer. The collaborative training stage employs pseudo-labeled and augmented target-domain MRIs, enforcing prediction consistency under strong and weak augmentation to enhance domain robustness and generalization. Extensive experiments conducted on multi-site T1-weighted MRI data demonstrate that the CDA consistently outperforms state-of-the-art unsupervised domain adaptation methods.
[32] DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-guided Difference Perception
Pei Deng,Wenqian Zhou,Hanlin Wu
Main category: cs.CV
TL;DR: DeltaVLM是一个应用于遥感图像变化分析的交互式模型,通过指令引导的差异感知,支持多轮查询驱动的分析。
Details
Motivation: 遥感图像的变化分析通常仅提供一次性变化掩码或静态描述,缺乏交互性和灵活性。DeltaVLM旨在结合变化检测和视觉问答的优势,实现多轮指令驱动的分析。Contribution: 1. 提出了新任务RSICA(遥感图像变化分析)及配套数据集ChangeChat-105k;2. 设计了DeltaVLM架构,包含双向时序视觉编码器、视觉差异感知模块和指令引导的Q-former;3. 在单轮标注和多轮交互任务中实现了SOTA性能。
Method: 1. 微调的双向时序视觉编码器捕捉时间差异;2. 视觉差异感知模块(含CSRM机制)解释变化;3. 指令引导的Q-form器提取查询相关差异信息。
Result: DeltaVLM在单轮和多轮任务中超越现有多模态大语言模型和遥感视觉语言模型。
Insight: 结合指令引导和多模态交互可显著提升遥感场景的动态分析能力,为实际应用提供了新的思路。
Abstract: Accurate interpretation of land-cover changes in multi-temporal satellite imagery is critical for real-world scenarios. However, existing methods typically provide only one-shot change masks or static captions, limiting their ability to support interactive, query-driven analysis. In this work, we introduce remote sensing image change analysis (RSICA) as a new paradigm that combines the strengths of change detection and visual question answering to enable multi-turn, instruction-guided exploration of changes in bi-temporal remote sensing images. To support this task, we construct ChangeChat-105k, a large-scale instruction-following dataset, generated through a hybrid rule-based and GPT-assisted process, covering six interaction types: change captioning, classification, quantification, localization, open-ended question answering, and multi-turn dialogues. Building on this dataset, we propose DeltaVLM, an end-to-end architecture tailored for interactive RSICA. DeltaVLM features three innovations: (1) a fine-tuned bi-temporal vision encoder to capture temporal differences; (2) a visual difference perception module with a cross-semantic relation measuring (CSRM) mechanism to interpret changes; and (3) an instruction-guided Q-former to effectively extract query-relevant difference information from visual changes, aligning them with textual instructions. We train DeltaVLM on ChangeChat-105k using a frozen large language model, adapting only the vision and alignment modules to optimize efficiency. Extensive experiments and ablation studies demonstrate that DeltaVLM achieves state-of-the-art performance on both single-turn captioning and multi-turn interactive change analysis, outperforming existing multimodal large language models and remote sensing vision-language models. Code, dataset and pre-trained weights are available at https://github.com/hanlinwu/DeltaVLM.
[33] FaceGCD: Generalized Face Discovery via Dynamic Prefix Generation
Yunseok Oh,Dong-Wan Choi
Main category: cs.CV
TL;DR: 该论文提出了广义人脸发现(GFD)任务,并设计了一种动态生成前缀的方法FaceGCD,以解决传统方法在高基数细粒度人脸识别中的不足。
Details
Motivation: 传统的人脸识别方法在开放世界中无法有效处理新身份(ID)的发现问题,而广义类别发现(GCD)也不适用于高基数细粒度的人脸识别场景。论文旨在统一这两种任务,提出更灵活的动态方法。Contribution: 1. 提出了GFD任务,将传统人脸识别与GCD结合;2. 设计了FaceGCD方法,通过动态前缀生成实现实例特异性特征提取,显著提升性能。
Method: FaceGCD利用HyperNetwork动态生成层特定前缀,为每个输入图像构建轻量级特征提取器,避免了静态高容量模型的依赖性。
Result: 实验证明,FaceGCD在GFD任务上优于现有GCD方法和ArcFace,实现了最先进的结果。
Insight: 动态前缀生成是一种有效的轻量级方法,可以解决高基数细粒度识别任务中的挑战,为开放世界人脸识别提供了新的思路。
Abstract: Recognizing and differentiating among both familiar and unfamiliar faces is a critical capability for face recognition systems and a key step toward artificial general intelligence (AGI). Motivated by this ability, this paper introduces generalized face discovery (GFD), a novel open-world face recognition task that unifies traditional face identification with generalized category discovery (GCD). GFD requires recognizing both labeled and unlabeled known identities (IDs) while simultaneously discovering new, previously unseen IDs. Unlike typical GCD settings, GFD poses unique challenges due to the high cardinality and fine-grained nature of face IDs, rendering existing GCD approaches ineffective. To tackle this problem, we propose FaceGCD, a method that dynamically constructs instance-specific feature extractors using lightweight, layer-wise prefixes. These prefixes are generated on the fly by a HyperNetwork, which adaptively outputs a set of prefix generators conditioned on each input image. This dynamic design enables FaceGCD to capture subtle identity-specific cues without relying on high-capacity static models. Extensive experiments demonstrate that FaceGCD significantly outperforms existing GCD methods and a strong face recognition baseline, ArcFace, achieving state-of-the-art results on the GFD task and advancing toward open-world face recognition.
[34] GVD: Guiding Video Diffusion Model for Scalable Video Distillation
Kunyang Li,Jeffrey A Chan Santiago,Sarinda Dhanesh Samarasinghe,Gaowen Liu,Mubarak Shah
Main category: cs.CV
TL;DR: GVD是第一个基于扩散模型的视频蒸馏方法,通过联合蒸馏空间和时间特征,显著减少了计算和存储需求,在MiniUCF和HMDB51数据集上表现优异。
Details
Motivation: 视频数据集的存储和计算需求高,现有方法难以高效蒸馏时空信息。Contribution: 提出了首个基于扩散模型的视频蒸馏方法GVD,能够高效生成高质量视频并保持动作多样性。
Method: 通过联合蒸馏空间和时间特征,确保视频生成的高保真度和运动信息捕捉。
Result: 在MiniUCF和HMDB51数据集上,仅用少量帧数即达到原数据集性能的78.29%和73.83%。
Insight: GVD不仅性能优异,还能生成更高分辨率的视频且计算成本可控。
Abstract: To address the larger computation and storage requirements associated with large video datasets, video dataset distillation aims to capture spatial and temporal information in a significantly smaller dataset, such that training on the distilled data has comparable performance to training on all of the data. We propose GVD: Guiding Video Diffusion, the first diffusion-based video distillation method. GVD jointly distills spatial and temporal features, ensuring high-fidelity video generation across diverse actions while capturing essential motion information. Our method’s diverse yet representative distillations significantly outperform previous state-of-the-art approaches on the MiniUCF and HMDB51 datasets across 5, 10, and 20 Instances Per Class (IPC). Specifically, our method achieves 78.29 percent of the original dataset’s performance using only 1.98 percent of the total number of frames in MiniUCF. Additionally, it reaches 73.83 percent of the performance with just 3.30 percent of the frames in HMDB51. Experimental results across benchmark video datasets demonstrate that GVD not only achieves state-of-the-art performance but can also generate higher resolution videos and higher IPC without significantly increasing computational cost.
[35] Object Recognition Datasets and Challenges: A Review
Aria Salari,Abtin Djavadifar,Xiangrui Liu,Homayoun Najjaran
Main category: cs.CV
TL;DR: 该论文综述了计算机视觉中物体识别任务的数据集和挑战,分析了160多个数据集,并介绍了常见的评估基准和竞赛。
Details
Motivation: 物体识别是计算机视觉的基础任务,而数据集的规模和质最对深度学习技术的发展至关重要。本文旨在为研究人员提供数据集和评估基准的详细分析,以支持数据驱动和机器学习研究。Contribution: 论文的主要贡献包括:(1) 对160多个常用公共数据集的系统分析和统计;(2) 概述了物体识别领域的知名基准和竞赛;(3) 总结了计算机视觉中广泛采用的评估指标。
Method: 论文通过统计和描述性分析,对数据集进行分类和比较,并梳理了相关竞赛和评估指标。
Result: 论文展示了数据集的发展趋势和特点,为研究者提供了数据集选择的参考。
Insight: 数据集的规模和标注质量对模型性能有直接影响;基准竞赛推动了物体识别技术的进步。
Abstract: Object recognition is among the fundamental tasks in the computer vision applications, paving the path for all other image understanding operations. In every stage of progress in object recognition research, efforts have been made to collect and annotate new datasets to match the capacity of the state-of-the-art algorithms. In recent years, the importance of the size and quality of datasets has been intensified as the utility of the emerging deep network techniques heavily relies on training data. Furthermore, datasets lay a fair benchmarking means for competitions and have proved instrumental to the advancements of object recognition research by providing quantifiable benchmarks for the developed models. Taking a closer look at the characteristics of commonly-used public datasets seems to be an important first step for data-driven and machine learning researchers. In this survey, we provide a detailed analysis of datasets in the highly investigated object recognition areas. More than 160 datasets have been scrutinized through statistics and descriptions. Additionally, we present an overview of the prominent object recognition benchmarks and competitions, along with a description of the metrics widely adopted for evaluation purposes in the computer vision community. All introduced datasets and challenges can be found online at github.com/AbtinDjavadifar/ORDC.
[36] Exploring the Application of Visual Question Answering (VQA) for Classroom Activity Monitoring
Sinh Trong Vu,Hieu Trung Pham,Dung Manh Nguyen,Hieu Minh Hoang,Nhu Hoang Le,Thu Ha Pham,Tai Tan Mai
Main category: cs.CV
TL;DR: 本研究探讨了视觉问答(VQA)模型在课堂行为监控中的应用,使用LLaMA2、LLaMA3、QWEN3和NVILA等开源模型评估其性能,并提出了BAV-Classroom-VQA数据集。实验结果显示了这些模型在课堂行为分析中的潜力。
Details
Motivation: 课堂行为监控对教育研究和学习成果至关重要,但传统方法效率低。VQA模型的发展为自动化分析课堂视频提供了新工具。Contribution: 提出了BAV-Classroom-VQA数据集,并评估了多个开源VQA模型在课堂行为分析中的性能,为未来教育干预系统奠定基础。
Method: 使用了LLaMA2、LLaMA3、QWEN3和NVILA等VQA模型,基于真实课堂视频数据集进行性能评估。
Result: 所有四个模型在行为相关的视觉问答任务中表现良好,展现了在教育分析中的应用潜力。
Insight: VQA模型可高效自动化分析课堂行为,未来或能集成到教育系统中,支持实时干预和改进学习效果。
Abstract: Classroom behavior monitoring is a critical aspect of educational research, with significant implications for student engagement and learning outcomes. Recent advancements in Visual Question Answering (VQA) models offer promising tools for automatically analyzing complex classroom interactions from video recordings. In this paper, we investigate the applicability of several state-of-the-art open-source VQA models, including LLaMA2, LLaMA3, QWEN3, and NVILA, in the context of classroom behavior analysis. To facilitate rigorous evaluation, we introduce our BAV-Classroom-VQA dataset derived from real-world classroom video recordings at the Banking Academy of Vietnam. We present the methodology for data collection, annotation, and benchmark the performance of the selected VQA models on this dataset. Our initial experimental results demonstrate that all four models achieve promising performance levels in answering behavior-related visual questions, showcasing their potential in future classroom analytics and intervention systems.
[37] Gems: Group Emotion Profiling Through Multimodal Situational Understanding
Anubhav Kataria,Surbhi Madan,Shreya Ghosh,Tom Gedeon,Abhinav Dhall
Main category: cs.CV
TL;DR: GEMS 是一个通过多模态情境理解进行群体情绪分析的系统,利用基于 Swin-Transformer 和 S3Attention 的架构,预测从细粒度个体情绪到粗粒度群体和事件层面的情绪。
Details
Motivation: 现有的多人情感分析基准主要关注基于时间和群体层面的原子交互,缺乏对个体、群体和事件层面情感的细粒度和整体分析。Contribution: 1. 提出了 GEMS 框架,整合多模态输入(场景、群体成员和上下文信息)进行联合预测;2. 扩展了 VGAF 数据集,提出 VGAF-GEMS 基准,提供更细粒度和全面的分析;3. 实现了对离散、连续情感(如效价和唤醒度)以及个体、群体和事件层面情感的统一预测。
Method: 采用基于 Swin-Transformer 和 S3Attention 的多模态架构,处理场景、群体成员和上下文信息,生成联合预测。
Result: 在 VGAF-GEMS 基准上,GEMS 框架在定量和定性比较中优于现有方法。
Insight: GEMS 通过将个体、群体和情境情感反应统一建模,为情感分析研究提供了一种更全面的方法。
Abstract: Understanding individual, group and event level emotions along with contextual information is crucial for analyzing a multi-person social situation. To achieve this, we frame emotion comprehension as the task of predicting fine-grained individual emotion to coarse grained group and event level emotion. We introduce GEMS that leverages a multimodal swin-transformer and S3Attention based architecture, which processes an input scene, group members, and context information to generate joint predictions. Existing multi-person emotion related benchmarks mainly focus on atomic interactions primarily based on emotion perception over time and group level. To this end, we extend and propose VGAF-GEMS to provide more fine grained and holistic analysis on top of existing group level annotation of VGAF dataset. GEMS aims to predict basic discrete and continuous emotions (including valence and arousal) as well as individual, group and event level perceived emotions. Our benchmarking effort links individual, group and situational emotional responses holistically. The quantitative and qualitative comparisons with adapted state-of-the-art models demonstrate the effectiveness of GEMS framework on VGAF-GEMS benchmarking. We believe that it will pave the way of further research. The code and data is available at: https://github.com/katariaak579/GEMS
[38] On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations
Jordan Vice,Naveed Akhtar,Yansong Gao,Richard Hartley,Ajmal Mian
Main category: cs.CV
TL;DR: 该论文揭示了视觉语言模型(VLMs)在频率域对抗性扰动下的脆弱性,通过实验表明这些模型在图像标注和真实性检测任务中对不可见的频率扰动高度敏感。
Details
Motivation: 视觉语言模型在视觉内容推理中被广泛应用,但其在频率域对抗性扰动下的可靠性尚未被充分研究。作者旨在揭示这一隐蔽漏洞,以提升模型的鲁棒性。Contribution: 论文的主要贡献是设计了一种频率域扰动方法,证明了其在多种VLM上的泛化性,并揭示了模型输出对频率扰动的敏感性,挑战了现有模型的可靠性。
Method: 作者设计了频率域的有针对性的图像变换,通过系统地调整VLM的输出来测试其对频率扰动的敏感性。实验涵盖了五种先进的VLM和十种真实与生成图像数据集。
Result: 实验结果显示,VLM的输出对频率扰动高度敏感,扰动能显著影响图像标注和DeepFake检测任务的表现,且这种脆弱性在不同模型上普遍存在。
Insight: 研究表明,VLM的判断可能不完全依赖于语义内容,而是对频率域特征敏感,这为开发更鲁棒的多模态感知系统提供了重要启示。
Abstract: Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning, including through captioning and DeepFake detection. In this work, we expose a critical vulnerability of VLMs when exposed to subtle, structured perturbations in the frequency domain. Specifically, we highlight how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks. We design targeted image transformations, operating in the frequency domain to systematically adjust VLM outputs when exposed to frequency-perturbed real and synthetic images. We demonstrate that the perturbation injection method generalizes across five state-of-the-art VLMs which includes different-parameter Qwen2/2.5 and BLIP models. Experimenting across ten real and generated image datasets reveals that VLM judgments are sensitive to frequency-based cues and may not wholly align with semantic content. Crucially, we show that visually-imperceptible spatial frequency transformations expose the fragility of VLMs deployed for automated image captioning and authenticity detection tasks. Our findings under realistic, black-box constraints challenge the reliability of VLMs, underscoring the need for robust multimodal perception systems.
[39] UAVScenes: A Multi-Modal Dataset for UAVs
Sijie Wang,Siqi Li,Yawei Zhang,Shangshu Yu,Shenghai Yuan,Rui She,Quanjiang Guo,JinXuan Zheng,Ong Kang Howe,Leonrich Chandra,Shrivarshann Srijeyan,Aditya Sivadas,Toshan Aggarwal,Heyuan Liu,Hongming Zhang,Chujie Chen,Junyu Jiang,Lihua Xie,Wee Peng Tay
Main category: cs.CV
TL;DR: 该论文提出了一个多模态无人机数据集UAVScenes,改进了现有的MARS-LVIG数据集,增加了图像和LiDAR点云的逐帧语义标注,支持多种高级感知任务。
Details
Motivation: 现有的多模态无人机数据集主要偏向定位和3D重建任务,缺乏逐帧标注,难以支持高级场景理解任务。为解决这一问题,作者提出了UAVScenes数据集。Contribution: 构建了一个大规模多模态无人机数据集UAVScenes,提供了逐帧图像和LiDAR点云的语义标注,支持分割、深度估计、6自由度定位等多种任务。
Method: 基于MARS-LVIG数据集,通过手动标注语义信息和6自由度位姿,扩展了数据集的适用范围。
Result: UAVScenes成为了一个支持多模态感知任务的基准数据集,适用于多种高级任务。
Insight: 通过补充逐帧标注,现有的数据集可以扩展支持更多高级感知任务,为无人机多模态研究提供了新的资源。
Abstract: Multi-modal perception is essential for unmanned aerial vehicle (UAV) operations, as it enables a comprehensive understanding of the UAVs’ surrounding environment. However, most existing multi-modal UAV datasets are primarily biased toward localization and 3D reconstruction tasks, or only support map-level semantic segmentation due to the lack of frame-wise annotations for both camera images and LiDAR point clouds. This limitation prevents them from being used for high-level scene understanding tasks. To address this gap and advance multi-modal UAV perception, we introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well-calibrated multi-modal UAV dataset MARS-LVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both frame-wise images and LiDAR point clouds, along with accurate 6-degree-of-freedom (6-DoF) poses. These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis (NVS). Our dataset is available at https://github.com/sijieaaa/UAVScenes
[40] Aleatoric Uncertainty Medical Image Segmentation Estimation via Flow Matching
Phi Van Nguyen,Ngoc Huynh Trinh,Duy Minh Lam Nguyen,Phu Loc Nguyen,Quoc Long Tran
Main category: cs.CV
TL;DR: 该论文提出了一种基于条件流匹配的方法来量化医学图像分割中的随机不确定性,显著优于传统生成模型和扩散模型,能够更准确地捕捉分割中的不确定性。
Details
Motivation: 医学图像分割中的随机不确定性反映了专家标注者的自然变异性,传统生成模型表达能力有限,而扩散模型因随机采样和无法精确建模密度而在捕捉不确定性方面受限。Contribution: 提出了基于条件流匹配的生成模型,通过模拟自由流程学习精确密度,生成高质量的分割样本,其像素级方差可靠地反映了数据分布。
Method: 利用条件流匹配模型,结合输入图像引导和多次采样,合成分割样本并捕捉模糊边界区域的不确定性。
Result: 实验表明,该方法不仅分割精度高,还能生成反映分割结果可靠性的不确定性图。
Insight: 条件流匹配模型在捕捉医学图像分割中的不确定性方面优于传统生成模型和扩散模型,为分割结果的可靠性提供了更深入的见解。
Abstract: Quantifying aleatoric uncertainty in medical image segmentation is critical since it is a reflection of the natural variability observed among expert annotators. A conventional approach is to model the segmentation distribution using the generative model, but current methods limit the expression ability of generative models. While current diffusion-based approaches have demonstrated impressive performance in approximating the data distribution, their inherent stochastic sampling process and inability to model exact densities limit their effectiveness in accurately capturing uncertainty. In contrast, our proposed method leverages conditional flow matching, a simulation-free flow-based generative model that learns an exact density, to produce highly accurate segmentation results. By guiding the flow model on the input image and sampling multiple data points, our approach synthesizes segmentation samples whose pixel-wise variance reliably reflects the underlying data distribution. This sampling strategy captures uncertainties in regions with ambiguous boundaries, offering robust quantification that mirrors inter-annotator differences. Experimental results demonstrate that our method not only achieves competitive segmentation accuracy but also generates uncertainty maps that provide deeper insights into the reliability of the segmentation outcomes. The code for this paper is freely available at https://github.com/huynhspm/Data-Uncertainty
[41] Efficient Spatial-Temporal Modeling for Real-Time Video Analysis: A Unified Framework for Action Recognition and Object Tracking
Shahla John
Main category: cs.CV
TL;DR: 该论文提出了一种高效的统一框架,用于实时视频分析,结合了动作识别和物体跟踪,通过先进的空间-时间建模技术和新颖的分层注意力机制,实现了高精度和高速度的平衡。
Details
Motivation: 实时视频分析在计算机视觉中是一个重要但具有挑战性的问题,现有的方法往往难以在精度和速度之间取得平衡,尤其是在资源受限的环境中。论文旨在解决这一问题。Contribution: 主要贡献包括:1) 提出了一个统一框架,同时支持动作识别和物体跟踪;2) 引入了新颖的分层注意力机制,自适应地关注时空序列中的相关区域;3) 在标准数据集上实现了性能提升和实时推理速度。
Method: 该方法基于并行序列建模技术,结合了分层注意力机制,能够高效处理空间和时间信息。具体包括对时空数据的自适应建模和优化推理过程。
Result: 在UCF-101、HMDB-51和MOT17数据集上的实验表明,动作识别的准确率提升3.2%,跟踪精度提升2.8%,推理速度比现有方法快40%。
Insight: 论文表明,通过结合先进的时空建模和注意力机制,可以在保持实时性能的同时显著提升任务精度。这在资源受限的实际应用场景中具有重要价值。
Abstract: Real-time video analysis remains a challenging problem in computer vision, requiring efficient processing of both spatial and temporal information while maintaining computational efficiency. Existing approaches often struggle to balance accuracy and speed, particularly in resource-constrained environments. In this work, we present a unified framework that leverages advanced spatial-temporal modeling techniques for simultaneous action recognition and object tracking. Our approach builds upon recent advances in parallel sequence modeling and introduces a novel hierarchical attention mechanism that adaptively focuses on relevant spatial regions across temporal sequences. We demonstrate that our method achieves state-of-the-art performance on standard benchmarks while maintaining real-time inference speeds. Extensive experiments on UCF-101, HMDB-51, and MOT17 datasets show improvements of 3.2% in action recognition accuracy and 2.8% in tracking precision compared to existing methods, with 40% faster inference time.
[42] HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models
Zhixiang Wei,Guangting Wang,Xiaoxiao Ma,Ke Mei,Huaian Chen,Yi Jin,Fengyun Rao
Main category: cs.CV
TL;DR: 该论文提出了一种利用大型视觉语言模型(LVLMs)优化图像-文本数据质量的方法,并基于此训练了高性能的HQ-CLIP模型。通过生成多粒度标注的VLM-150M数据集,结合负描述和短标签的对比学习,模型在多个任务中表现优异。
Details
Motivation: 为了解决大规模图像-文本数据噪声问题,并探索利用LVLMs提升数据质量的自我增强循环。Contribution: 1. 提出LVLM驱动的数据优化流程;2. 构建多粒度标注的VLM-150M数据集;3. 提出结合负描述和短标签的对比学习方法。
Method: 利用LVLMs生成四种文本标注(长/短正负描述),基于这些标注训练HQ-CLIP模型,扩展对比学习框架。
Result: HQ-CLIP在零样本分类、跨模态检索和细粒度视觉理解任务中达到SOTA,部分性能超过更大数据集训练的CLIP模型。
Insight: LVLMs不仅能提升数据质量,还能通过自我增强循环持续改进模型性能。
Abstract: Large-scale but noisy image-text pair data have paved the way for the success of Contrastive Language-Image Pretraining (CLIP). As the foundation vision encoder, CLIP in turn serves as the cornerstone for most large vision-language models (LVLMs). This interdependence naturally raises an interesting question: Can we reciprocally leverage LVLMs to enhance the quality of image-text pair data, thereby opening the possibility of a self-reinforcing cycle for continuous improvement? In this work, we take a significant step toward this vision by introducing an LVLM-driven data refinement pipeline. Our framework leverages LVLMs to process images and their raw alt-text, generating four complementary textual formulas: long positive descriptions, long negative descriptions, short positive tags, and short negative tags. Applying this pipeline to the curated DFN-Large dataset yields VLM-150M, a refined dataset enriched with multi-grained annotations. Based on this dataset, we further propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags as additional supervised signals. The resulting model, namely HQ-CLIP, demonstrates remarkable improvements across diverse benchmarks. Within a comparable training data scale, our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks. In retrieval benchmarks, HQ-CLIP even surpasses standard CLIP models trained on the DFN-2B dataset, which contains 10$\times$ more training data than ours. All code, data, and models are available at https://zxwei.site/hqclip.
[43] From Sharp to Blur: Unsupervised Domain Adaptation for 2D Human Pose Estimation Under Extreme Motion Blur Using Event Cameras
Youngho Kim,Hoonhee Cho,Kuk-Jin Yoon
Main category: cs.CV
TL;DR: 本文提出了一种基于事件相机的无监督域适应方法,用于解决极端运动模糊下2D人体姿态估计的问题。通过事件增强和师生框架,该方法有效缓解了清晰和模糊图像之间的域差距,并在无标注的目标域中取得了优于传统方法的表现。
Details
Motivation: 在快速运动和低光条件下,运动模糊会导致人体姿态估计的性能显著下降。现有方法主要在清晰图像上训练,难以直接应用于模糊图像,因此需要一种能适应模糊域的解决方案。Contribution: 1) 提出了一种利用事件相机进行域适应的方法,生成运动感知的模糊图像;2) 设计了师生框架,通过互不确定性掩码优化伪标签;3) 实验证明该方法在模糊域中优于传统方法,且无需目标域的标注数据。
Method: 1) 使用事件相机数据增强生成模糊图像;2) 采用师生框架迭代优化伪标签,并通过互不确定性掩码过滤错误标签;3) 结合事件相机的高时间分辨率特性,提升模型的抗模糊能力。
Result: 实验结果表明,该方法在模糊条件下的姿态估计性能优于传统域适应方法,且无需目标域的标注。事件相机的使用显著提升了模型的鲁棒性。
Insight: 事件相机的高时间分辨率特性为解决运动模糊提供了有效手段,同时无监督域适应方法可以扩展到其他视觉任务中,减少对标注数据的依赖。
Abstract: Human pose estimation is critical for applications such as rehabilitation, sports analytics, and AR/VR systems. However, rapid motion and low-light conditions often introduce motion blur, significantly degrading pose estimation due to the domain gap between sharp and blurred images. Most datasets assume stable conditions, making models trained on sharp images struggle in blurred environments. To address this, we introduce a novel domain adaptation approach that leverages event cameras, which capture high temporal resolution motion data and are inherently robust to motion blur. Using event-based augmentation, we generate motion-aware blurred images, effectively bridging the domain gap between sharp and blurred domains without requiring paired annotations. Additionally, we develop a student-teacher framework that iteratively refines pseudo-labels, leveraging mutual uncertainty masking to eliminate incorrect labels and enable more effective learning. Experimental results demonstrate that our approach outperforms conventional domain-adaptive human pose estimation methods, achieving robust pose estimation under motion blur without requiring annotations in the target domain. Our findings highlight the potential of event cameras as a scalable and effective solution for domain adaptation in real-world motion blur environments. Our project codes are available at https://github.com/kmax2001/EvSharp2Blur.
[44] TopoLiDM: Topology-Aware LiDAR Diffusion Models for Interpretable and Realistic LiDAR Point Cloud Generation
Jiuming Liu,Zheng Huang,Mengmeng Liu,Tianchen Deng,Francesco Nex,Hao Cheng,Hesheng Wang
Main category: cs.CV
TL;DR: TopoLiDM 是一个结合图神经网络和扩散模型的框架,通过拓扑正则化生成高保真的 LiDAR 点云,解决了现有方法在几何真实性和拓扑一致性上的不足。
Details
Motivation: 解决现有 LiDAR 生成方法难以捕捉几何真实性和全局拓扑一致性的问题。Contribution: 提出 TopoLiDM,结合了图神经网络和扩散模型,并引入 0 维持久同调约束以保证拓扑一致性。
Method: 使用拓扑保持的 VAE 提取潜在图表示,通过潜在扩散模型生成新拓扑图,并引入持久同调约束。
Result: 在 KITTI-360 数据集上表现优于 SOTA,FRID 和 MMD 分别降低了 22.6% 和 9.2%,推断速度为 1.68 samples/s。
Insight: 拓扑约束显著提升了生成 LiDAR 点云的几何真实性和拓扑一致性,适用于实际应用。
Abstract: LiDAR scene generation is critical for mitigating real-world LiDAR data collection costs and enhancing the robustness of downstream perception tasks in autonomous driving. However, existing methods commonly struggle to capture geometric realism and global topological consistency. Recent LiDAR Diffusion Models (LiDMs) predominantly embed LiDAR points into the latent space for improved generation efficiency, which limits their interpretable ability to model detailed geometric structures and preserve global topological consistency. To address these challenges, we propose TopoLiDM, a novel framework that integrates graph neural networks (GNNs) with diffusion models under topological regularization for high-fidelity LiDAR generation. Our approach first trains a topological-preserving VAE to extract latent graph representations by graph construction and multiple graph convolutional layers. Then we freeze the VAE and generate novel latent topological graphs through the latent diffusion models. We also introduce 0-dimensional persistent homology (PH) constraints, ensuring the generated LiDAR scenes adhere to real-world global topological structures. Extensive experiments on the KITTI-360 dataset demonstrate TopoLiDM’s superiority over state-of-the-art methods, achieving improvements of 22.6% lower Frechet Range Image Distance (FRID) and 9.2% lower Minimum Matching Distance (MMD). Notably, our model also enables fast generation speed with an average inference time of 1.68 samples/s, showcasing its scalability for real-world applications. We will release the related codes at https://github.com/IRMVLab/TopoLiDM.
[45] Exploiting Diffusion Prior for Task-driven Image Restoration
Jaeha Kim,Junghun Oh,Kyoung Mu Lee
Main category: cs.CV
TL;DR: 论文提出了一种名为EDTR的方法,利用扩散先验(diffusion prior)为任务驱动的图像恢复(TDIR)提供支持,通过从低质量图像中提取关键线索并限制去噪步骤,有效提升了任务性能和视觉质量。
Details
Motivation: 现有的任务驱动图像恢复方法在应对复杂退化图像时表现不佳,难以恢复对任务至关重要的细节。扩散先验虽有潜力,但直接应用难以生成任务相关的内容,因此需要一种更有效的方法。Contribution: 提出了EDTR方法,直接利用低质量图像中的线索,通过基于像素误差的预恢复和轻度噪声添加,结合少量去噪步骤,成功提升了任务驱动的图像恢复效果。
Method: EDTR在扩散过程中直接从低质量图像中提取有用线索,生成预恢复图像并添加轻度噪声,同时通过减少去噪步骤避免冗余细节生成,从而保留任务关键信息。
Result: 实验表明,EDTR在多种复杂退化场景下显著提升了任务性能和视觉质量。
Insight: 扩散先验的合理利用(如限制去噪步骤)可以避免生成冗余信息,从而更专注于恢复任务相关的细节。这种方法为复杂退化下的图像恢复提供了新思路。
Abstract: Task-driven image restoration (TDIR) has recently emerged to address performance drops in high-level vision tasks caused by low-quality (LQ) inputs. Previous TDIR methods struggle to handle practical scenarios in which images are degraded by multiple complex factors, leaving minimal clues for restoration. This motivates us to leverage the diffusion prior, one of the most powerful natural image priors. However, while the diffusion prior can help generate visually plausible results, using it to restore task-relevant details remains challenging, even when combined with recent TDIR methods. To address this, we propose EDTR, which effectively harnesses the power of diffusion prior to restore task-relevant details. Specifically, we propose directly leveraging useful clues from LQ images in the diffusion process by generating from pixel-error-based pre-restored LQ images with mild noise added. Moreover, we employ a small number of denoising steps to prevent the generation of redundant details that dilute crucial task-related information. We demonstrate that our method effectively utilizes diffusion prior for TDIR, significantly enhancing task performance and visual quality across diverse tasks with multiple complex degradations.
[46] Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation
Zheng Xiangyu,He Songcheng,Li Wanyun,Li Xiaoqiang,Zhang Wei
Main category: cs.CV
TL;DR: 论文提出了一种新颖的层次化记忆架构(HMHI-Net),通过结合浅层和高层特征并设计异质交互机制,解决了无监督视频对象分割(UVOS)中过度依赖高层语义特征的问题,显著提升了性能。
Details
Motivation: 现有的UVOS方法过度依赖高层语义特征,而忽视了浅层的细节信息,导致分割精度不足。研究旨在通过结合浅层和高层特征,并设计新的交互机制,提升分割的精确性。Contribution: 提出了层次化记忆架构(HMHI-Net),融合浅层和高层特征;设计了异质交互机制(PLAM和SGIM模块),平衡像素和语义信息的利用。
Method: 通过PLAM模块实现像素引导的局部对齐,SGIM模块实现语义引导的全局集成,结合浅层和高层特征的互补优势。
Result: HMHI-Net在所有UVOS和视频显著性检测基准测试中均实现了最先进的性能,且在不同骨干网络中表现稳健。
Insight: 浅层特征(如像素细节)与高层语义特征的结合对提升UVOS任务的分割精度至关重要,异质交互机制能够有效平衡两者的利用。
Abstract: Unsupervised Video Object Segmentation (UVOS) aims to predict pixel-level masks for the most salient objects in videos without any prior annotations. While memory mechanisms have been proven critical in various video segmentation paradigms, their application in UVOS yield only marginal performance gains despite sophisticated design. Our analysis reveals a simple but fundamental flaw in existing methods: over-reliance on memorizing high-level semantic features. UVOS inherently suffers from the deficiency of lacking fine-grained information due to the absence of pixel-level prior knowledge. Consequently, memory design relying solely on high-level features, which predominantly capture abstract semantic cues, is insufficient to generate precise predictions. To resolve this fundamental issue, we propose a novel hierarchical memory architecture to incorporate both shallow- and high-level features for memory, which leverages the complementary benefits of pixel and semantic information. Furthermore, to balance the simultaneous utilization of the pixel and semantic memory features, we propose a heterogeneous interaction mechanism to perform pixel-semantic mutual interactions, which explicitly considers their inherent feature discrepancies. Through the design of Pixel-guided Local Alignment Module (PLAM) and Semantic-guided Global Integration Module (SGIM), we achieve delicate integration of the fine-grained details in shallow-level memory and the semantic representations in high-level memory. Our Hierarchical Memory with Heterogeneous Interaction Network (HMHI-Net) consistently achieves state-of-the-art performance across all UVOS and video saliency detection benchmarks. Moreover, HMHI-Net consistently exhibits high performance across different backbones, further demonstrating its superiority and robustness. Project page: https://github.com/ZhengxyFlow/HMHI-Net .
[47] Visual Language Models as Zero-Shot Deepfake Detectors
Viacheslav Pirogov
Main category: cs.CV
TL;DR: 论文提出了一种基于视觉语言模型(VLM)的零样本深度伪造检测方法,利用高质量数据集进行验证,结果表明其性能优于传统分类器。
Details
Motivation: 当前深度伪造技术(如GAN或扩散模型)对数字媒体和身份验证等系统构成威胁,传统检测方法依赖专用分类器且缺乏辅助任务支持。Contribution: 提出了一种零样本VLM方法,利用非图像域信息增强检测鲁棒性,并在高质量数据集上验证其优越性。
Method: 采用视觉语言模型(如InstructBLIP)进行零样本分类和领域微调,与传统方法在DFDC-P数据集上对比。
Result: 零样本VLM在深度伪造检测中表现优于传统方法,尤其在高质量数据集上性能显著提升。
Insight: 视觉语言模型的零样本能力可有效扩展至深度伪造检测,为多模态任务提供了新思路。
Abstract: The contemporary phenomenon of deepfakes, utilizing GAN or diffusion models for face swapping, presents a substantial and evolving threat in digital media, identity verification, and a multitude of other systems. The majority of existing methods for detecting deepfakes rely on training specialized classifiers to distinguish between genuine and manipulated images, focusing only on the image domain without incorporating any auxiliary tasks that could enhance robustness. In this paper, inspired by the zero-shot capabilities of Vision Language Models, we propose a novel VLM-based approach to image classification and then evaluate it for deepfake detection. Specifically, we utilize a new high-quality deepfake dataset comprising 60,000 images, on which our zero-shot models demonstrate superior performance to almost all existing methods. Subsequently, we compare the performance of the best-performing architecture, InstructBLIP, on the popular deepfake dataset DFDC-P against traditional methods in two scenarios: zero-shot and in-domain fine-tuning. Our results demonstrate the superiority of VLMs over traditional classifiers.
[48] LIDAR: Lightweight Adaptive Cue-Aware Fusion Vision Mamba for Multimodal Segmentation of Structural Cracks
Hui Liu,Chen Jia,Fan Shi,Xu Cheng,Mengfei Shi,Xia Xie,Shengyong Chen
Main category: cs.CV
TL;DR: LIDAR是一种轻量级自适应感知融合视觉Mamba网络,用于多模态裂纹分割,通过自适应感知和高效交互融合跨模态特征,显著提升了分割性能。
Details
Motivation: 当前多模态裂纹分割方法在自适应感知和跨模态特征融合方面存在不足,导致计算成本高且效果不佳。Contribution: 1. 提出了轻量级自适应视觉状态空间模块(LacaVSS)和双域动态协作融合模块(LD3CF)。
2. 设计了掩模引导的动态扫描策略(EDG-SS)和自适应频率域感知器(AFDP)。
3. 引入了轻量级动态调制多核卷积(LDMK),显著降低计算开销。
Method: 1. LacaVSS模块通过EDG-SS策略自适应建模裂纹特征。
2. LD3CF模块结合AFDP和双池化策略,捕捉跨模态的空间和频率域特征。
3. LDMK卷积替换传统卷积,感知复杂形态结构。
Result: 在三个数据集上优于SOTA方法,其中在光场深度数据集上F1分数为0.8204,mIoU为0.8465,参数量仅5.35M。
Insight: 1. 自适应感知和动态融合是多模态分割的关键。
2. 低计算成本的轻量级设计在实际应用中具有显著优势。
Abstract: Achieving pixel-level segmentation with low computational cost using multimodal data remains a key challenge in crack segmentation tasks. Existing methods lack the capability for adaptive perception and efficient interactive fusion of cross-modal features. To address these challenges, we propose a Lightweight Adaptive Cue-Aware Vision Mamba network (LIDAR), which efficiently perceives and integrates morphological and textural cues from different modalities under multimodal crack scenarios, generating clear pixel-level crack segmentation maps. Specifically, LIDAR is composed of a Lightweight Adaptive Cue-Aware Visual State Space module (LacaVSS) and a Lightweight Dual Domain Dynamic Collaborative Fusion module (LD3CF). LacaVSS adaptively models crack cues through the proposed mask-guided Efficient Dynamic Guided Scanning Strategy (EDG-SS), while LD3CF leverages an Adaptive Frequency Domain Perceptron (AFDP) and a dual-pooling fusion strategy to effectively capture spatial and frequency-domain cues across modalities. Moreover, we design a Lightweight Dynamically Modulated Multi-Kernel convolution (LDMK) to perceive complex morphological structures with minimal computational overhead, replacing most convolutional operations in LIDAR. Experiments on three datasets demonstrate that our method outperforms other state-of-the-art (SOTA) methods. On the light-field depth dataset, our method achieves 0.8204 in F1 and 0.8465 in mIoU with only 5.35M parameters. Code and datasets are available at https://github.com/Karl1109/LIDAR-Mamba.
[49] Estimating 2D Camera Motion with Hybrid Motion Basis
Haipeng Li,Tianhao Zhou,Zhanglei Yang,Yi Wu,Yan Chen,Zijing Mao,Shen Cheng,Bing Zeng,Shuaicheng Liu
Main category: cs.CV
TL;DR: 论文提出了CamFlow框架,通过混合运动基(物理基和随机基)表示相机运动,并结合拉普拉斯分布的概率损失提升了训练鲁棒性。该方法在多种场景中优于现有技术。
Details
Motivation: 现有方法(如基于单应性或网格流的方法)在处理复杂非线性变换或纯相机运动时效果不佳,需要一种更灵活且鲁棒的框架。Contribution: 1. 提出CamFlow框架,结合物理和随机运动基表示相机运动;2. 设计混合概率损失函数(基于拉普拉斯分布);3. 创建新基准数据集(屏蔽动态物体以提取纯相机运动)。
Method: 1. 混合运动基:物理基(基于相机几何)和随机基(处理复杂场景);2. 拉普拉斯分布概率损失增强训练鲁棒性;3. 新基准数据集隔离动态物体干扰。
Result: 实验表明CamFlow在多样场景中表现优于现有方法,尤其在零样本设置下展示了更强的鲁棒性和泛化能力。
Insight: 混合运动基能捕捉单应性无法表示的运动模式,而拉普拉斯损失提高了对异常值的鲁棒性。新基准数据集为纯相机运动研究提供了更干净的评估环境。
Abstract: Estimating 2D camera motion is a fundamental computer vision task that models the projection of 3D camera movements onto the 2D image plane. Current methods rely on either homography-based approaches, limited to planar scenes, or meshflow techniques that use grid-based local homographies but struggle with complex non-linear transformations. A key insight of our work is that combining flow fields from different homographies creates motion patterns that cannot be represented by any single homography. We introduce CamFlow, a novel framework that represents camera motion using hybrid motion bases: physical bases derived from camera geometry and stochastic bases for complex scenarios. Our approach includes a hybrid probabilistic loss function based on the Laplace distribution that enhances training robustness. For evaluation, we create a new benchmark by masking dynamic objects in existing optical flow datasets to isolate pure camera motion. Experiments show CamFlow outperforms state-of-the-art methods across diverse scenarios, demonstrating superior robustness and generalization in zero-shot settings. Code and datasets are available at our project page: https://lhaippp.github.io/CamFlow/.
[50] Recognizing Actions from Robotic View for Natural Human-Robot Interaction
Ziyi Wang,Peiming Li,Hong Liu,Zhichao Deng,Can Wang,Jun Liu,Junsong Yuan,Mengyuan Liu
Main category: cs.CV
TL;DR: 论文提出了ACTIVE数据集和ACTIVE-PC方法,专注于移动服务机器人视角下的人类动作识别,解决了传统基准测试在自然人机交互中的局限性。
Details
Motivation: 自然人机交互需要机器人从动态视角识别人类动作,但现有数据集因数据、模态和多样性有限,无法满足这一需求。Contribution: 1. 提出了大规模数据集ACTIVE,涵盖多种动作、环境和距离;2. 提出了ACTIVE-PC方法,通过多级邻域采样等技术,提升远距离动作识别能力。
Method: ACTIVE-PC采用多级邻域采样(Multilevel Neighborhood Sampling)、分层识别器(Layered Recognizers)、弹性椭圆查询(Elastic Ellipse Query)及运动干扰解耦技术。
Result: 实验证明ACTIVE-PC在远距离动作识别上的有效性。
Insight: 机器人视角的动作识别需考虑动态相机、多样环境及距离变化,ACTIVE数据集和ACTIVE-PC为此提供了新基准和解决方案。
Abstract: Natural Human-Robot Interaction (N-HRI) requires robots to recognize human actions at varying distances and states, regardless of whether the robot itself is in motion or stationary. This setup is more flexible and practical than conventional human action recognition tasks. However, existing benchmarks designed for traditional action recognition fail to address the unique complexities in N-HRI due to limited data, modalities, task categories, and diversity of subjects and environments. To address these challenges, we introduce ACTIVE (Action from Robotic View), a large-scale dataset tailored specifically for perception-centric robotic views prevalent in mobile service robots. ACTIVE comprises 30 composite action categories, 80 participants, and 46,868 annotated video instances, covering both RGB and point cloud modalities. Participants performed various human actions in diverse environments at distances ranging from 3m to 50m, while the camera platform was also mobile, simulating real-world scenarios of robot perception with varying camera heights due to uneven ground. This comprehensive and challenging benchmark aims to advance action and attribute recognition research in N-HRI. Furthermore, we propose ACTIVE-PC, a method that accurately perceives human actions at long distances using Multilevel Neighborhood Sampling, Layered Recognizers, Elastic Ellipse Query, and precise decoupling of kinematic interference from human actions. Experimental results demonstrate the effectiveness of ACTIVE-PC. Our code is available at: https://github.com/wangzy01/ACTIVE-Action-from-Robotic-View.
[51] HRVVS: A High-resolution Video Vasculature Segmentation Network via Hierarchical Autoregressive Residual Priors
Xincheng Yao,Yijun Yang,Kangwei Guo,Ruiqiang Xiao,Haipeng Zhou,Haisu Tao,Jian Yang,Lei Zhu
Main category: cs.CV
TL;DR: 论文提出了HRVVS,一种基于分层自回归残差先验的高分辨率视频血管分割网络,并发布了一个高质量标注的肝血管手术视频数据集。
Details
Motivation: 肝血管分割在肝切除手术中具有重要临床意义,但缺乏合适的数据集和任务复杂性导致相关研究较少。Contribution: 1) 发布了一个高分辨率、逐帧标注的肝血管手术视频数据集;2) 提出了HRVVS网络,通过分层自回归残差先验和动态记忆解码器提升分割性能。
Method: 1) 在编码器中嵌入预训练的自回归视觉模型(VAR)作为先验信息;2) 设计动态记忆解码器,减少冗余信息传输并保留帧间细节。
Result: 实验表明HRVVS显著优于现有方法。
Insight: 分层自回归先验和动态记忆机制能有效解决视频血管分割中的信息退化和冗余问题。
Abstract: The segmentation of the hepatic vasculature in surgical videos holds substantial clinical significance in the context of hepatectomy procedures. However, owing to the dearth of an appropriate dataset and the inherently complex task characteristics, few researches have been reported in this domain. To address this issue, we first introduce a high quality frame-by-frame annotated hepatic vasculature dataset containing 35 long hepatectomy videos and 11442 high-resolution frames. On this basis, we propose a novel high-resolution video vasculature segmentation network, dubbed as HRVVS. We innovatively embed a pretrained visual autoregressive modeling (VAR) model into different layers of the hierarchical encoder as prior information to reduce the information degradation generated during the downsampling process. In addition, we designed a dynamic memory decoder on a multi-view segmentation network to minimize the transmission of redundant information while preserving more details between frames. Extensive experiments on surgical video datasets demonstrate that our proposed HRVVS significantly outperforms the state-of-the-art methods. The source code and dataset will be publicly available at \href{https://github.com/scott-yjyang/xx}{https://github.com/scott-yjyang/HRVVS}.
[52] RainbowPrompt: Diversity-Enhanced Prompt-Evolving for Continual Learning
Kiseong Hong,Gyeong-hyeon Kim,Eunwoo Kim
Main category: cs.CV
TL;DR: RainbowPrompt提出了一种多样化的Prompt演化机制,通过自适应聚合任务特定提示来提升持续学习的效果,避免固定提示或任务共享空间带来的表示局限性,在图像分类和视频动作识别任务上显著优于现有方法。
Details
Motivation: 现有的基于提示的持续学习方法存在表示多样性不足的问题,要么依赖固定提示,要么在任务共享空间中生成提示,无法有效整合任务特定知识。Contribution: 1. 提出了多样化的Prompt演化机制;2. 设计了可学习的概率门自适应激活层;3. 在多个任务上显著提升了性能。
Method: 1. 通过转换和对齐任务特定提示来演化知识;2. 使用概率门控制演化过程中的层激活;3. 支持在类增量学习中高效整合新旧任务知识。
Result: 在图像分类和视频动作识别任务上分别取得了9.07%和7.40%的平均提升。
Insight: 保持提示的多样性和自适应演化能力是持续学习的关键,通过演化机制可以高效利用历史知识学习新任务。
Abstract: Prompt-based continual learning provides a rehearsal-free solution by tuning small sets of parameters while keeping pre-trained models frozen. To meet the complex demands of sequential tasks, it is crucial to integrate task-specific knowledge within prompts effectively. However, existing works rely on either fixed learned prompts (i.e., prompts whose representations remain unchanged during new task learning) or on prompts generated from an entangled task-shared space, limiting the representational diversity of the integrated prompt. To address this issue, we propose a novel prompt-evolving mechanism to adaptively aggregate base prompts (i.e., task-specific prompts) into a unified prompt while ensuring diversity. By transforming and aligning base prompts, both previously learned and newly introduced, our approach continuously evolves accumulated knowledge to facilitate learning new tasks. We further introduce a learnable probabilistic gate that adaptively determines which layers to activate during the evolution process. We validate our method on image classification and video action recognition tasks in class-incremental learning, achieving average gains of 9.07% and 7.40% over existing methods across all scenarios.
[53] Subtyping Breast Lesions via Generative Augmentation based Long-tailed Recognition in Ultrasound
Shijing Chen,Xinrui Zhou,Yuhao Wang,Yuhao Huang,Ao Chang,Dong Ni,Ruobing Huang
Main category: cs.CV
TL;DR: 论文提出了一种双阶段框架,通过生成增强和强化学习驱动的自适应采样器解决乳腺超声图像中长尾分布的分类问题,实现了高性能的分类结果。
Details
Motivation: 乳腺病变亚型的准确识别对个性化治疗至关重要,但不同亚型的发病率呈长尾分布,导致自动化识别面临挑战。生成增强技术为解决这一问题提供了可能。Contribution: 1) 提出双阶段框架,通过高保真数据合成缓解分布偏差;2) 使用强化学习驱动的自适应采样器动态平衡合成与真实数据的比例;3) 设计基于解剖先验的类可控合成网络,保持类别特征。
Method: 1) 双阶段框架结合生成增强;2) 强化学习驱动的自适应采样器校准数据比例;3) 类可控合成网络集成草图感知分支。
Result: 在内部长尾数据集和公共不平衡乳腺超声数据集上取得优于现有方法的表现。
Insight: 生成增强和自适应数据合成策略能有效缓解长尾分布问题,同时避免合成数据过度使用导致的性能下降。
Abstract: Accurate identification of breast lesion subtypes can facilitate personalized treatment and interventions. Ultrasound (US), as a safe and accessible imaging modality, is extensively employed in breast abnormality screening and diagnosis. However, the incidence of different subtypes exhibits a skewed long-tailed distribution, posing significant challenges for automated recognition. Generative augmentation provides a promising solution to rectify data distribution. Inspired by this, we propose a dual-phase framework for long-tailed classification that mitigates distributional bias through high-fidelity data synthesis while avoiding overuse that corrupts holistic performance. The framework incorporates a reinforcement learning-driven adaptive sampler, dynamically calibrating synthetic-real data ratios by training a strategic multi-agent to compensate for scarcities of real data while ensuring stable discriminative capability. Furthermore, our class-controllable synthetic network integrates a sketch-grounded perception branch that harnesses anatomical priors to maintain distinctive class features while enabling annotation-free inference. Extensive experiments on an in-house long-tailed and a public imbalanced breast US datasets demonstrate that our method achieves promising performance compared to state-of-the-art approaches. More synthetic images can be found at https://github.com/Stinalalala/Breast-LT-GenAug.
[54] COOkeD: Ensemble-based OOD detection in the era of zero-shot CLIP
Galadrielle Humblot-Renaux,Gianni Franchi,Sergio Escalera,Thomas B. Moeslund
Main category: cs.CV
TL;DR: COOkeD proposes a heterogeneous ensemble method for OOD detection, combining closed-world classifier, zero-shot CLIP classifier, and linear probe classifier, achieving SOTA results.
Details
Motivation: OOD detection performance is limited by single classifiers' capabilities on ID data. The paper aims to leverage diverse classifiers' strengths for robust OOD detection.Contribution: Introduces COOkeD, a modular, post-hoc ensemble of diverse classifiers (closed-world, zero-shot CLIP, linear probe) for improved OOD detection.
Method: Combines predictions from a closed-world end-to-end trained classifier, zero-shot CLIP classifier, and linear probe on CLIP features.
Result: Achieves state-of-the-art OOD detection performance on CIFAR100 and ImageNet, with robustness to label noise and covariate shift.
Insight: Heterogeneous ensembles leverage complementary strengths of different classifiers, enhancing OOD detection beyond single-model approaches.
Abstract: Out-of-distribution (OOD) detection is an important building block in trustworthy image recognition systems as unknown classes may arise at test-time. OOD detection methods typically revolve around a single classifier, leading to a split in the research field between the classical supervised setting (e.g. ResNet18 classifier trained on CIFAR100) vs. the zero-shot setting (class names fed as prompts to CLIP). In both cases, an overarching challenge is that the OOD detection performance is implicitly constrained by the classifier’s capabilities on in-distribution (ID) data. In this work, we show that given a little open-mindedness from both ends, remarkable OOD detection can be achieved by instead creating a heterogeneous ensemble - COOkeD combines the predictions of a closed-world classifier trained end-to-end on a specific dataset, a zero-shot CLIP classifier, and a linear probe classifier trained on CLIP image features. While bulky at first sight, this approach is modular, post-hoc and leverages the availability of pre-trained VLMs, thus introduces little overhead compared to training a single standard classifier. We evaluate COOkeD on popular CIFAR100 and ImageNet benchmarks, but also consider more challenging, realistic settings ranging from training-time label noise, to test-time covariate shift, to zero-shot shift which has been previously overlooked. Despite its simplicity, COOkeD achieves state-of-the-art performance and greater robustness compared to both classical and CLIP-based OOD detection methods. Code is available at https://github.com/glhr/COOkeD
[55] Robust Deepfake Detection for Electronic Know Your Customer Systems Using Registered Images
Takuma Amada,Kazuya Kakizaki,Taiki Miyagawa,Akinori F. Ebihara,Kaede Shiohara,Toshihiko Yamasaki
Main category: cs.CV
TL;DR: 该论文提出了一种针对电子客户识别(eKYC)系统的深度伪造检测算法,通过检测视频中的身份向量时序不一致性并结合注册图像提高检测精度,同时利用大数据集训练的特征提取器增强鲁棒性。
Details
Motivation: 电子客户识别系统(eKYC)在身份验证过程中容易受到深度伪造攻击(如人脸交换和人脸重演),因此需要开发一种鲁棒的检测算法以确保系统的可靠性。Contribution: 主要贡献包括:1) 通过检测身份向量的时序不一致性实现全面检测;2) 利用注册图像显著提高检测精度;3) 使用大数据集训练的特征提取器增强检测性能和鲁棒性。
Method: 通过提取视频中的身份向量并检测其时序不一致性,同时结合注册图像计算输入视频与真实图像的身份差异,使用大规模数据集训练的特征提取器提升检测能力。
Result: 实验结果表明,该方法能准确检测人脸交换和人脸重演,并对多种未知图像退化形式具有鲁棒性。
Insight: 利用注册图像和时序不一致性检测是提高深度伪造检测性能的有效策略,尤其是在对抗图像退化的场景下。
Abstract: In this paper, we present a deepfake detection algorithm specifically designed for electronic Know Your Customer (eKYC) systems. To ensure the reliability of eKYC systems against deepfake attacks, it is essential to develop a robust deepfake detector capable of identifying both face swapping and face reenactment, while also being robust to image degradation. We address these challenges through three key contributions: (1)Our approach evaluates the video’s authenticity by detecting temporal inconsistencies in identity vectors extracted by face recognition models, leading to comprehensive detection of both face swapping and face reenactment. (2)In addition to processing video input, the algorithm utilizes a registered image (assumed to be genuine) to calculate identity discrepancies between the input video and the registered image, significantly improving detection accuracy. (3)~We find that employing a face feature extractor trained on a larger dataset enhances both detection performance and robustness against image degradation. Our experimental results show that our proposed method accurately detects both face swapping and face reenactment comprehensively and is robust against various forms of unseen image degradation. Our source code is publicly available https://github.com/TaikiMiyagawa/DeepfakeDetection4eKYC.
[56] ShortFT: Diffusion Model Alignment via Shortcut-based Fine-Tuning
Xiefan Guo,Miaomiao Cui,Liefeng Bo,Di Huang
Main category: cs.CV
TL;DR: 论文提出ShortFT方法,通过基于捷径的短链去噪微调策略,解决了传统反向传播方法在扩散模型中梯度计算时间长和爆炸风险的问题,显著提升了模型对齐性能和效率。
Details
Motivation: 传统基于反向传播的方法在扩散模型中因长链去噪过程计算成本高且梯度爆炸风险大,导致梯度回传不完全,难以实现最优模型对齐。Contribution: 提出ShortFT方法,利用保留轨迹的少步扩散模型构建短链去噪路径,显著提升微调效率和对齐效果,适用于多种奖励函数。
Method: 采用保留轨迹的少步扩散模型构建短链去噪路径,通过优化短链实现高效微调。
Result: 方法在多种奖励函数上表现优异,超越现有技术,显著提升对齐性能。
Insight: 短链去噪路径不仅降低了计算成本,还能有效规避梯度爆炸问题,为扩散模型的高效优化提供了新思路。
Abstract: Backpropagation-based approaches aim to align diffusion models with reward functions through end-to-end backpropagation of the reward gradient within the denoising chain, offering a promising perspective. However, due to the computational costs and the risk of gradient explosion associated with the lengthy denoising chain, existing approaches struggle to achieve complete gradient backpropagation, leading to suboptimal results. In this paper, we introduce Shortcut-based Fine-Tuning (ShortFT), an efficient fine-tuning strategy that utilizes the shorter denoising chain. More specifically, we employ the recently researched trajectory-preserving few-step diffusion model, which enables a shortcut over the original denoising chain, and construct a shortcut-based denoising chain of shorter length. The optimization on this chain notably enhances the efficiency and effectiveness of fine-tuning the foundational model. Our method has been rigorously tested and can be effectively applied to various reward functions, significantly improving alignment performance and surpassing state-of-the-art alternatives.
[57] VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning
Ruifeng Yuan,Chenghao Xiao,Sicong Leng,Jianyu Wang,Long Li,Weiwen Xu,Hou Pong Chan,Deli Zhao,Tingyang Xu,Zhongyu Wei,Hao Zhang,Yu Rong
Main category: cs.CV
TL;DR: VL-Cogito是一种通过渐进课程强化学习(PCuRL)训练的多模态推理模型,通过动态调整训练难度和自适应调节推理路径长度,显著提升了多模态任务中的推理能力。
Details
Motivation: 现有模型在多模态任务中表现不稳定,难以应对任务的多样性和复杂性。为此,提出了PCuRL框架来系统性地提升模型的推理能力。Contribution: 1. 提出PCuRL框架,通过渐进式课程学习提升模型的多模态推理能力;2. 引入在线难度软权重机制和动态长度奖励机制,优化训练过程。
Method: PCuRL通过多阶段训练,结合动态难度调整和自适应推理路径长度调节,逐步提升模型能力。
Result: 实验表明,VL-Cogito在数学、科学、逻辑和通用理解等主流多模态基准测试中表现优于或接近现有模型。
Insight: 渐进式课程学习和动态奖励机制能有效提升多模态推理任务的性能和稳定性。
Abstract: Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.
[58] MergeSAM: Unsupervised change detection of remote sensing images based on the Segment Anything Model
Meiqi Hu,Lingzhi Lu,Chengxi Han,Xiaoping Liu
Main category: cs.CV
TL;DR: MergeSAM是一种基于Segment Anything Model(SAM)的无监督变化检测方法,针对高分辨率遥感图像,通过MaskMatching和MaskSplitting策略解决现实中的复杂变化问题,如物体分割与合并。
Details
Motivation: 大型基础模型在特征提取和通用特征表示方面表现优异,推动了无监督变化检测技术的发展。本文旨在利用SAM的强大物体分割能力,提升遥感图像变化检测的实用性。Contribution: 提出了MergeSAM方法,设计MaskMatching和MaskSplitting策略,利用SAM的多时相掩码捕捉复杂变化,并将空间结构嵌入变化检测过程。
Method: 基于SAM的无监督方法,通过MaskMatching和MaskSplitting处理物体分割、合并等复杂变化,构建多时相掩码。
Result: 展示了MergeSAM在现实复杂场景中的有效性,提升了变化检测的精度和实用性。
Insight: 大型通用模型(如SAM)在遥感领域有潜力通过无监督方法显著提升任务性能,尤其是在处理复杂变化时。
Abstract: Recently, large foundation models trained on vast datasets have demonstrated exceptional capabilities in feature extraction and general feature representation. The ongoing advancements in deep learning-driven large models have shown great promise in accelerating unsupervised change detection methods, thereby enhancing the practical applicability of change detection technologies. Building on this progress, this paper introduces MergeSAM, an innovative unsupervised change detection method for high-resolution remote sensing imagery, based on the Segment Anything Model (SAM). Two novel strategies, MaskMatching and MaskSplitting, are designed to address real-world complexities such as object splitting, merging, and other intricate changes. The proposed method fully leverages SAM’s object segmentation capabilities to construct multitemporal masks that capture complex changes, embedding the spatial structure of land cover into the change detection process.
[59] Hydra-Bench: A Benchmark for Multi-Modal Leaf Wetness Sensing
Yimeng Liu,Maolin Gan,Yidong Ren,Gen Li,Jingkai Lin,Younsuk Dong,Zhichao Cao
Main category: cs.CV
TL;DR: 论文提出了一个多模态数据集Hydra-Bench,用于评估和改进机器学习算法在叶片湿度检测中的表现,数据集包含多种传感器数据,并提供了详细的基准测试。
Details
Motivation: 现有的叶片湿度检测系统在自然叶片和动态真实环境中存在鲁棒性、准确性和环境适应性的不足,需要新的数据集和算法来提升性能。Contribution: 1. 引入了首个多模态叶片湿度检测数据集Hydra-Bench;2. 提供了毫米波原始数据、SAR图像和RGB图像的同步采集;3. 使用Hydra模型进行了多种基准测试和融合策略比较。
Method: 数据集包含毫米波、SAR和RGB图像,采集自五种植物在不同环境下的数据。通过Hydra模型对单模态和多模态融合策略进行系统评估。
Result: Hydra-Bench数据集支持了多模态方法的性能提升,展示了融合策略在不同扫描距离和环境下的优势。
Insight: 多模态传感器数据的融合可以显著提升叶片湿度检测的鲁棒性和准确性,尤其在复杂环境下表现更优。
Abstract: Leaf wetness detection is a crucial task in agricultural monitoring, as it directly impacts the prediction and protection of plant diseases. However, existing sensing systems suffer from limitations in robustness, accuracy, and environmental resilience when applied to natural leaves under dynamic real-world conditions. To address these challenges, we introduce a new multi-modal dataset specifically designed for evaluating and advancing machine learning algorithms in leaf wetness detection. Our dataset comprises synchronized mmWave raw data, Synthetic Aperture Radar (SAR) images, and RGB images collected over six months from five diverse plant species in both controlled and outdoor field environments. We provide detailed benchmarks using the Hydra model, including comparisons against single modality baselines and multiple fusion strategies, as well as performance under varying scan distances. Additionally, our dataset can serve as a benchmark for future SAR imaging algorithm optimization, enabling a systematic evaluation of detection accuracy under diverse conditions.
[60] Zero-Shot Image Anomaly Detection Using Generative Foundation Models
Lemar Abdi,Amaan Valiuddin,Francisco Caetano,Christiaan Viviers,Fons van der Sommen
Main category: cs.CV
TL;DR: 该论文提出了一种基于扩散模型的零样本图像异常检测方法,利用去噪过程中的分数误差和SSIM指标,实现了无需针对每个目标数据集重新训练的通用异常检测。
Details
Motivation: 在开放世界中部署安全的视觉系统需要检测分布外(OOD)输入,传统方法通常需要针对每个目标数据集重新训练,而该方法旨在通过生成基础模型实现通用异常检测。Contribution: 论文的主要贡献是提出了一种利用去噪扩散模型(DDM)作为通用感知模板的零样本异常检测方法,通过分析Stein分数误差并结合SSIM指标,显著提高了性能。
Method: 方法基于扩散模型的去噪轨迹,利用其提供的纹理和语义信息,结合Stein分数误差和SSIM指标,设计了一种通用的异常检测框架,无需针对每个目标数据集重新训练。
Result: 实验结果表明,该方法在多个基准测试中表现优异,部分接近完美性能,且在CelebA数据集上训练的单一模型优于传统方法。
Insight: 生成基础模型(如扩散模型)在异常检测中具有巨大潜力,尤其是在零样本设置下,通过其丰富的语义和纹理信息可以实现高效的通用检测。
Abstract: Detecting out-of-distribution (OOD) inputs is pivotal for deploying safe vision systems in open-world environments. We revisit diffusion models, not as generators, but as universal perceptual templates for OOD detection. This research explores the use of score-based generative models as foundational tools for semantic anomaly detection across unseen datasets. Specifically, we leverage the denoising trajectories of Denoising Diffusion Models (DDMs) as a rich source of texture and semantic information. By analyzing Stein score errors, amplified through the Structural Similarity Index Metric (SSIM), we introduce a novel method for identifying anomalous samples without requiring re-training on each target dataset. Our approach improves over state-of-the-art and relies on training a single model on one dataset – CelebA – which we find to be an effective base distribution, even outperforming more commonly used datasets like ImageNet in several settings. Experimental results show near-perfect performance on some benchmarks, with notable headroom on others, highlighting both the strength and future potential of generative foundation models in anomaly detection.
[61] Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints
Thuy Tran,Ruochen Chen,Shaifali Parashar
Main category: cs.CV
TL;DR: 该论文提出了一种无监督的Shape-from-Template方法,结合图像特征和网格不可延展性约束,以比现有方法快400倍的速度重建3D形状,并在严重遮挡和细节生成方面表现优异。
Details
Motivation: 传统SfT方法依赖点对应关系,在严重遮挡时性能下降,而现代深度学习方法需要大量监督数据。论文的目标是通过无监督方式,结合图像特征和物理约束,实现高效且鲁棒的3D重建。Contribution: 提出了一种无监督SfT方法,仅需图像观察(颜色、梯度和轮廓)及网格不可延展性约束,实现高速、高精度的3D重建。
Method: 结合图像特征(颜色、梯度、轮廓)和网格不可延展性约束,通过优化方法实现模板到输入图像的3D变形。
Result: 比现有最佳无监督SfT方法快400倍,且在细节生成和严重遮挡情况下表现显著优于现有方法。
Insight: 无监督方法可以通过物理约束(如网格不可延展性)与图像特征的结合,在不需要监督数据的情况下实现高效3D重建。
Abstract: Shape-from-Template (SfT) refers to the class of methods that reconstruct the 3D shape of a deforming object from images/videos using a 3D template. Traditional SfT methods require point correspondences between images and the texture of the 3D template in order to reconstruct 3D shapes from images/videos in real time. Their performance severely degrades when encountered with severe occlusions in the images because of the unavailability of correspondences. In contrast, modern SfT methods use a correspondence-free approach by incorporating deep neural networks to reconstruct 3D objects, thus requiring huge amounts of data for supervision. Recent advances use a fully unsupervised or self-supervised approach by combining differentiable physics and graphics to deform 3D template to match input images. In this paper, we propose an unsupervised SfT which uses only image observations: color features, gradients and silhouettes along with a mesh inextensibility constraint to reconstruct at a $400\times$ faster pace than (best-performing) unsupervised SfT. Moreover, when it comes to generating finer details and severe occlusions, our method outperforms the existing methodologies by a large margin. Code is available at https://github.com/dvttran/nsft.
[62] A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks
Hang Su,Yunlong Feng,Daniel Gehrig,Panfeng Jiang,Ling Gao,Xavier Lagorce,Laurent Kneip
Main category: cs.CV
TL;DR: 本文提出了一种新的线性N点解算器,用于从异步轨迹中估计结构和运动,适用于多种传感器,包括全局快门、卷帘快门和事件相机。
Details
Motivation: 传统的5点或8点算法仅适用于同步视图的点对应关系,无法处理异步数据(如卷帘快门或事件相机)。本文旨在解决这一问题。Contribution: 提出了一种统一的线性点入射关系方法,能够高效恢复线性速度和3D点,适用于多种传感器和异步数据。
Method: 通过一阶动力学和恒定速度运动模型,推导出新的线性点入射关系,解决了异步数据的结构和运动估计问题。
Result: 在仿真和真实数据上验证了方法的有效性,相比现有方法,在所有模态上均有显著改进。
Insight: 该方法为从异步数据中高效估计结构和运动开辟了新途径,具有广泛的适用性。
Abstract: Structure and continuous motion estimation from point correspondences is a fundamental problem in computer vision that has been powered by well-known algorithms such as the familiar 5-point or 8-point algorithm. However, despite their acclaim, these algorithms are limited to processing point correspondences originating from a pair of views each one representing an instantaneous capture of the scene. Yet, in the case of rolling shutter cameras, or more recently, event cameras, this synchronization breaks down. In this work, we present a unified approach for structure and linear motion estimation from 2D point correspondences with arbitrary timestamps, from an arbitrary set of views. By formulating the problem in terms of first-order dynamics and leveraging a constant velocity motion model, we derive a novel, linear point incidence relation allowing for the efficient recovery of both linear velocity and 3D points with predictable degeneracies and solution multiplicities. Owing to its general formulation, it can handle correspondences from a wide range of sensing modalities such as global shutter, rolling shutter, and event cameras, and can even combine correspondences from different collocated sensors. We validate the effectiveness of our solver on both simulated and real-world data, where we show consistent improvement across all modalities when compared to recent approaches. We believe our work opens the door to efficient structure and motion estimation from asynchronous data. Code can be found at https://github.com/suhang99/AsyncTrack-Motion-Solver.
[63] HOLA: Enhancing Audio-visual Deepfake Detection via Hierarchical Contextual Aggregations and Efficient Pre-training
Xuecheng Wu,Danlei Huang,Heli Sun,Xinyi Yin,Yifan Wang,Hao Wang,Jia Zhang,Fei Wang,Peihao Guo,Suyu Xing,Junxiao Xue,Liang He
Main category: cs.CV
TL;DR: HOLA提出了一种层次化上下文聚合和高效预训练的方法,用于提升音频-视觉深度伪造检测的性能,通过两阶段框架、跨模态学习模块和伪监督信号注入策略,显著提升了检测效果。
Details
Motivation: 当前视频级深度伪造检测技术存在局限性,生成式AI的进步使得检测更具挑战性,HOLA旨在通过大规模预训练和多模态学习解决这一问题。Contribution: HOLA的贡献包括:(1)两阶段统一框架,(2)迭代感知的跨模态学习模块,(3)层次化上下文建模和门控聚合,(4)金字塔式优化器用于跨粒度语义增强,(5)伪监督信号注入策略。
Method: HOLA采用音频-视觉自监督预训练,设计了跨模态学习模块、层次化上下文建模和金字塔优化器,结合伪监督信号注入,实现高效检测。
Result: HOLA在2025年1M-Deepfakes检测挑战赛中排名第一,TestA集AUC超出第二名0.0476,实验验证了其有效性。
Insight: 层次化上下文建模和跨模态交互是关键,伪监督信号的引入进一步提升了模型性能,为视频级深度伪造检测提供了新思路。
Abstract: Advances in Generative AI have made video-level deepfake detection increasingly challenging, exposing the limitations of current detection techniques. In this paper, we present HOLA, our solution to the Video-Level Deepfake Detection track of 2025 1M-Deepfakes Detection Challenge. Inspired by the success of large-scale pre-training in the general domain, we first scale audio-visual self-supervised pre-training in the multimodal video-level deepfake detection, which leverages our self-built dataset of 1.81M samples, thereby leading to a unified two-stage framework. To be specific, HOLA features an iterative-aware cross-modal learning module for selective audio-visual interactions, hierarchical contextual modeling with gated aggregations under the local-global perspective, and a pyramid-like refiner for scale-aware cross-grained semantic enhancements. Moreover, we propose the pseudo supervised singal injection strategy to further boost model performance. Extensive experiments across expert models and MLLMs impressivly demonstrate the effectiveness of our proposed HOLA. We also conduct a series of ablation studies to explore the crucial design factors of our introduced components. Remarkably, our HOLA ranks 1st, outperforming the second by 0.0476 AUC on the TestA set.
[64] Modality-Aware Feature Matching: A Comprehensive Review of Single- and Cross-Modality Techniques
Weide Liu,Wei Zhou,Jun Liu,Ping Hu,Jun Cheng,Jungong Han,Weisi Lin
Main category: cs.CV
TL;DR: 本文是关于单模态与跨模态特征匹配技术的全面综述,涵盖传统手工方法和深度学习方法的比较与应用。
Details
Motivation: 特征匹配是计算机视觉中的核心任务,但在处理多样模态数据时面临挑战,尤其是传统方法在模态差异较大时表现不佳。本文旨在梳理和比较不同技术在处理各种模态数据时的表现。Contribution: 全面总结了单模态和跨模态特征匹配的技术发展,分析了传统和深度学习方法在不同模态数据(如RGB、深度图像、点云、LiDAR等)中的应用和局限性。
Method: 讨论了传统手工方法(如Harris、SIFT、ORB)和深度学习技术(如SuperPoint、LoFTR)的优缺点,并重点介绍了针对特定模态的优化方法(如几何描述子、注意力增强网络)。
Result: 深度学习显著提升了跨模态特征匹配的鲁棒性和适应性,尤其在复杂任务如3D点云匹配和医学图像处理中表现出色。
Insight: 跨模态特征的匹配需要针对不同数据特点设计专门的方法,深度学习的灵活性和表达能力为这一领域带来了重要突破。
Abstract: Feature matching is a cornerstone task in computer vision, essential for applications such as image retrieval, stereo matching, 3D reconstruction, and SLAM. This survey comprehensively reviews modality-based feature matching, exploring traditional handcrafted methods and emphasizing contemporary deep learning approaches across various modalities, including RGB images, depth images, 3D point clouds, LiDAR scans, medical images, and vision-language interactions. Traditional methods, leveraging detectors like Harris corners and descriptors such as SIFT and ORB, demonstrate robustness under moderate intra-modality variations but struggle with significant modality gaps. Contemporary deep learning-based methods, exemplified by detector-free strategies like CNN-based SuperPoint and transformer-based LoFTR, substantially improve robustness and adaptability across modalities. We highlight modality-aware advancements, such as geometric and depth-specific descriptors for depth images, sparse and dense learning methods for 3D point clouds, attention-enhanced neural networks for LiDAR scans, and specialized solutions like the MIND descriptor for complex medical image matching. Cross-modal applications, particularly in medical image registration and vision-language tasks, underscore the evolution of feature matching to handle increasingly diverse data interactions.
[65] Segment Anything for Video: A Comprehensive Review of Video Object Segmentation and Tracking from Past to Future
Guoping Xu,Jayaram K. Udupa,Yajun Yu,Hua-Chieh Shao,Songlin Zhao,Wei Liu,You Zhang
Main category: cs.CV
TL;DR: This论文全面回顾了基于Segment Anything Model (SAM/SAM2)的视频目标分割与跟踪(VOST)方法,从过去、现在和未来三个时间维度梳理了领域的发展,并讨论了当前挑战和未来研究方向。
Details
Motivation: 视频目标分割与跟踪(VOST)是计算机视觉中的重要挑战,传统方法在领域泛化、时间一致性和计算效率上存在局限。SAM/SAM2等基座模型的引入为解决这些问题提供了新范式,因此有必要对其应用进行系统梳理和总结。Contribution: 1. 提出了基于SAM/SAM2的VOST方法的综述框架,从过去(历史信息)、现在(当前帧特征)和未来(运动预测)三个维度分析;2. 总结了从传统方法到SAM2的技术演进;3. 指出了当前挑战并提出了未来研究方向。
Method: 论文围绕时间维度展开综述:(1)过去:保留和更新历史信息的策略;(2)现在:从当前帧提取和优化判别性特征的方法;(3)未来:运动预测和轨迹估计机制。此外,讨论了运动感知内存选择和轨迹引导提示等新技术。
Result: SAM/SAM2显著提升了VOST的泛化能力和实时性能,但仍面临内存冗余、错误累积和提示效率低等挑战。
Insight: 1. 基座模型(如SAM/SAM2)为VOST领域带来了范式转变;2. 时间维度的分析框架为VOST研究提供了系统性视角;3. 未来的创新需解决内存效率和提示优化等问题。
Abstract: Video Object Segmentation and Tracking (VOST) presents a complex yet critical challenge in computer vision, requiring robust integration of segmentation and tracking across temporally dynamic frames. Traditional methods have struggled with domain generalization, temporal consistency, and computational efficiency. The emergence of foundation models like the Segment Anything Model (SAM) and its successor, SAM2, has introduced a paradigm shift, enabling prompt-driven segmentation with strong generalization capabilities. Building upon these advances, this survey provides a comprehensive review of SAM/SAM2-based methods for VOST, structured along three temporal dimensions: past, present, and future. We examine strategies for retaining and updating historical information (past), approaches for extracting and optimizing discriminative features from the current frame (present), and motion prediction and trajectory estimation mechanisms for anticipating object dynamics in subsequent frames (future). In doing so, we highlight the evolution from early memory-based architectures to the streaming memory and real-time segmentation capabilities of SAM2. We also discuss recent innovations such as motion-aware memory selection and trajectory-guided prompting, which aim to enhance both accuracy and efficiency. Finally, we identify remaining challenges including memory redundancy, error accumulation, and prompt inefficiency, and suggest promising directions for future research. This survey offers a timely and structured overview of the field, aiming to guide researchers and practitioners in advancing the state of VOST through the lens of foundation models.
[66] Advancing Fetal Ultrasound Image Quality Assessment in Low-Resource Settings
Dongli He,Hu Wang,Mohammad Yaqub
Main category: cs.CV
TL;DR: 论文提出了一种基于FetalCLIP的低成本胎儿超声图像质量评估方法,通过参数高效微调技术(LoRA)在资源有限的环境中提升产前护理。
Details
Motivation: 低收入国家缺乏专业超声师,导致胎儿超声波图像质量不稳定,影响产前护理质量。Contribution: 1. 提出FetalCLIP$_{CLS}$模型,结合LoRA技术实现高效微调;2. 在ACOUSLIC-AI数据集上表现优于现有CNN和Transformer基线;3. 展示了分割模型用于分类任务的潜力。
Method: 1. 利用预训练的FetalCLIP(基于21万对图像-文本数据);2. 采用LoRA技术进行参数高效微调;3. 结合分割模型改进分类性能。
Result: FetalCLIP$_{CLS}$的F1分数达0.757;改进后的模型F1分数提升至0.771。
Insight: 低秩适应(LoRA)在资源受限场景中可高效微调基础模型,分割模型的迁移能力为分类任务提供了新思路。
Abstract: Accurate fetal biometric measurements, such as abdominal circumference, play a vital role in prenatal care. However, obtaining high-quality ultrasound images for these measurements heavily depends on the expertise of sonographers, posing a significant challenge in low-income countries due to the scarcity of trained personnel. To address this issue, we leverage FetalCLIP, a vision-language model pretrained on a curated dataset of over 210,000 fetal ultrasound image-caption pairs, to perform automated fetal ultrasound image quality assessment (IQA) on blind-sweep ultrasound data. We introduce FetalCLIP${CLS}$, an IQA model adapted from FetalCLIP using Low-Rank Adaptation (LoRA), and evaluate it on the ACOUSLIC-AI dataset against six CNN and Transformer baselines. FetalCLIP${CLS}$ achieves the highest F1 score of 0.757. Moreover, we show that an adapted segmentation model, when repurposed for classification, further improves performance, achieving an F1 score of 0.771. Our work demonstrates how parameter-efficient fine-tuning of fetal ultrasound foundation models can enable task-specific adaptations, advancing prenatal care in resource-limited settings. The experimental code is available at: https://github.com/donglihe-hub/FetalCLIP-IQA.
[67] MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention
Yuqi Pang,Bowen Yang,Yun Cao,Fan Rong,Xiaoyu Li,Chen He
Main category: cs.CV
TL;DR: MoCHA是一个新型视觉语言模型框架,通过集成多种视觉主干网络和动态专家选择模块(MoECs),结合层次化分组注意力(HGA),显著提升了视觉语言任务的表现。
Details
Motivation: 当前视觉大语言模型(VLLMs)在处理复杂视觉信息时面临高成本和跨模态桥接的挑战,MoCHA旨在解决这些问题。Contribution: 1. 集成多种视觉主干网络以提取互补视觉特征;2. 提出稀疏MoECs模块动态选择专家;3. 设计HGA模块优化视觉特征编码。
Method: 结合CLIP、SigLIP、DINOv2和ConvNeXt提取视觉特征,动态选择专家(MoECs),并通过HGA模块优化特征编码。
Result: MoCHA在多个基准测试中表现优异,例如在POPE任务上提升3.25%,在MME上提升153分。
Insight: MoECA和HGA模块的设计显著提升了模型的鲁棒性和性能,表明多模态特征的动态整合和优化是关键。
Abstract: Vision large language models (VLLMs) are focusing primarily on handling complex and fine-grained visual information by incorporating advanced vision encoders and scaling up visual models. However, these approaches face high training and inference costs, as well as challenges in extracting visual details, effectively bridging across modalities. In this work, we propose a novel visual framework, MoCHA, to address these issues. Our framework integrates four vision backbones (i.e., CLIP, SigLIP, DINOv2 and ConvNeXt) to extract complementary visual features and is equipped with a sparse Mixture of Experts Connectors (MoECs) module to dynamically select experts tailored to different visual dimensions. To mitigate redundant or insufficient use of the visual information encoded by the MoECs module, we further design a Hierarchical Group Attention (HGA) with intra- and inter-group operations and an adaptive gating strategy for encoded visual features. We train MoCHA on two mainstream LLMs (e.g., Phi2-2.7B and Vicuna-7B) and evaluate their performance across various benchmarks. Notably, MoCHA outperforms state-of-the-art open-weight models on various tasks. For example, compared to CuMo (Mistral-7B), our MoCHA (Phi2-2.7B) presents outstanding abilities to mitigate hallucination by showing improvements of 3.25% in POPE and to follow visual instructions by raising 153 points on MME. Finally, ablation studies further confirm the effectiveness and robustness of the proposed MoECs and HGA in improving the overall performance of MoCHA.
[68] Wall Shear Stress Estimation in Abdominal Aortic Aneurysms: Towards Generalisable Neural Surrogate Models
Patryk Rygiel,Julian Suk,Christoph Brune,Kak Khee Yeung,Jelmer M. Wolterink
Main category: cs.CV
TL;DR: 该论文提出了一种基于几何深度学习的模型,用于估计腹主动脉瘤(AAA)患者血流动力学参数,解决了传统计算流体动力学(CFD)模拟的高计算成本问题,并在几何重构和边界条件变化等现实场景中展示了良好的泛化能力。
Details
Motivation: 腹主动脉瘤(AAA)的研究通常依赖CFD模拟血流动力学参数,但其计算成本高昂。几何深度学习方法可以快速估计这些参数,但现有方法在真实世界变化因素下的泛化能力有限。Contribution: 1. 提出了一种E(3)-等变的几何深度学习模型;2. 引入了新的鲁棒几何描述符和投影几何代数;3. 展示了模型在几何重构、边界条件变化和不同动脉树拓扑结构下的泛化能力。
Method: 使用100名AAA患者的CT扫描数据提取腔体几何形状,并通过CFD模拟获取参考血流动力学参数。模型基于E(3)-等变架构,利用几何描述符和投影几何代数进行训练。
Result: 模型在分布内和外部测试集上表现良好,能够准确估计几何重构和边界条件变化下的血流动力学参数,并且对网格分辨率具有鲁棒性。
Insight: 几何深度学习方法在血流动力学参数估计中具有潜力,能够适应临床实践中的复杂变化,为AAA的个性化风险评估提供了高效工具。
Abstract: Abdominal aortic aneurysms (AAAs) are pathologic dilatations of the abdominal aorta posing a high fatality risk upon rupture. Studying AAA progression and rupture risk often involves in-silico blood flow modelling with computational fluid dynamics (CFD) and extraction of hemodynamic factors like time-averaged wall shear stress (TAWSS) or oscillatory shear index (OSI). However, CFD simulations are known to be computationally demanding. Hence, in recent years, geometric deep learning methods, operating directly on 3D shapes, have been proposed as compelling surrogates, estimating hemodynamic parameters in just a few seconds. In this work, we propose a geometric deep learning approach to estimating hemodynamics in AAA patients, and study its generalisability to common factors of real-world variation. We propose an E(3)-equivariant deep learning model utilising novel robust geometrical descriptors and projective geometric algebra. Our model is trained to estimate transient WSS using a dataset of CT scans of 100 AAA patients, from which lumen geometries are extracted and reference CFD simulations with varying boundary conditions are obtained. Results show that the model generalizes well within the distribution, as well as to the external test set. Moreover, the model can accurately estimate hemodynamics across geometry remodelling and changes in boundary conditions. Furthermore, we find that a trained model can be applied to different artery tree topologies, where new and unseen branches are added during inference. Finally, we find that the model is to a large extent agnostic to mesh resolution. These results show the accuracy and generalisation of the proposed model, and highlight its potential to contribute to hemodynamic parameter estimation in clinical practice.
[69] Bi-Level Optimization for Self-Supervised AI-Generated Face Detection
Mian Zou,Nan Zhong,Baosheng Yu,Yibing Zhan,Kede Ma
Main category: cs.CV
TL;DR: 提出了一种基于双级优化的自监督方法,用于AI生成人脸检测,通过内环预训练和外环优化任务权重,显著提升了检测性能。
Details
Motivation: 传统的监督学习方法依赖于特定生成器合成的图像,难以泛化到新兴的生成技术,因此需要一种更通用的自监督方法。Contribution: 提出了一种基于双级优化的自监督框架,通过优化多任务权重,使模型更接近AI生成人脸检测的最终目标。
Method: 内环预训练视觉编码器(仅使用真实人脸图像),外环优化多任务权重;检测时使用高斯混合模型或简单分类器。
Result: 在一类和二分类任务中显著优于现有方法,对未见过的生成器表现出强泛化能力。
Insight: 自监督学习通过优化任务权重,可以更好地适应特定下游任务,无需依赖特定生成器的合成数据。
Abstract: AI-generated face detectors trained via supervised learning typically rely on synthesized images from specific generators, limiting their generalization to emerging generative techniques. To overcome this limitation, we introduce a self-supervised method based on bi-level optimization. In the inner loop, we pretrain a vision encoder only on photographic face images using a set of linearly weighted pretext tasks: classification of categorical exchangeable image file format (EXIF) tags, ranking of ordinal EXIF tags, and detection of artificial face manipulations. The outer loop then optimizes the relative weights of these pretext tasks to enhance the coarse-grained detection of manipulated faces, serving as a proxy task for identifying AI-generated faces. In doing so, it aligns self-supervised learning more closely with the ultimate goal of AI-generated face detection. Once pretrained, the encoder remains fixed, and AI-generated faces are detected either as anomalies under a Gaussian mixture model fitted to photographic face features or by a lightweight two-layer perceptron serving as a binary classifier. Extensive experiments demonstrate that our detectors significantly outperform existing approaches in both one-class and binary classification settings, exhibiting strong generalization to unseen generators.
[70] ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Yilei Jiang,Yaozhi Zheng,Yuxuan Wan,Jiaming Han,Qunzhong Wang,Michael R. Lyu,Xiangyu Yue
Main category: cs.CV
TL;DR: ScreenCoder提出了一种模块化的多模态代理框架,通过分阶段(定位、规划、生成)实现从UI设计到前端代码的自动化转换,显著提升了代码生成的鲁棒性和准确性。
Details
Motivation: 当前基于纯文本提示的LLM在UI到代码转换中难以捕捉空间布局和视觉设计意图,而实际UI开发通常从视觉草图开始,因此需要一种多模态方法。Contribution: 1) 提出模块化的多代理框架,分阶段完成任务;2) 构建可扩展的数据引擎生成大规模图像-代码对;3) 微调开源VLM提升UI理解和代码质量。
Method: 框架分为三阶段:1) 定位代理(VLM检测UI组件);2) 规划代理(构建层次化布局);3) 生成代理(基于提示生成HTML/CSS代码)。
Result: 在布局准确性、结构一致性和代码正确性上达到SOTA性能。
Insight: 模块化和多阶段设计比端到端黑盒方法更具鲁棒性和可解释性,且数据引擎有助于解决监督数据不足问题。
Abstract: Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While recent large language models (LLMs) have demonstrated progress in text-to-code generation, many existing approaches rely solely on natural language prompts, limiting their effectiveness in capturing spatial layout and visual design intent. In contrast, UI development in practice is inherently multimodal, often starting from visual sketches or mockups. To address this gap, we introduce a modular multi-agent framework that performs UI-to-code generation in three interpretable stages: grounding, planning, and generation. The grounding agent uses a vision-language model to detect and label UI components, the planning agent constructs a hierarchical layout using front-end engineering priors, and the generation agent produces HTML/CSS code via adaptive prompt-based synthesis. This design improves robustness, interpretability, and fidelity over end-to-end black-box methods. Furthermore, we extend the framework into a scalable data engine that automatically produces large-scale image-code pairs. Using these synthetic examples, we fine-tune and reinforce an open-source VLM, yielding notable gains in UI understanding and code quality. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.
[71] CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models
Kedong Xiu,Saiqian Zhang
Main category: cs.CV
TL;DR: CapRecover 是一个跨模态特征反转攻击框架,专注于从视觉语言模型的中介特征中恢复高级语义内容(如标签或描述),避免传统模糊图像重建的问题。
Details
Motivation: 视觉语言模型在分层部署中可能导致隐私风险,传统方法重建的图像模糊且语义不清,CapRecover 旨在直接解决语义泄露问题。Contribution: 提出了 CapRecover 框架,能直接从中介特征中恢复语义内容,并提出了一种噪声添加的保护方法以防止泄露。
Method: 通过跨模态反转框架恢复语义内容,并使用层间随机噪声添加的保护机制。
Result: 在 CIFAR-10 上达到 92.71% 的 Top-1 标签准确率,COCO2017 上生成的描述 ROUGE-L 分数达 0.52。
Insight: 深层卷积层比浅层编码更多语义信息,噪声添加是一种简单有效的保护方法。
Abstract: As Vision-Language Models (VLMs) are increasingly deployed in split-DNN configurations–with visual encoders (e.g., ResNet, ViT) operating on user devices and sending intermediate features to the cloud–there is a growing privacy risk from semantic information leakage. Existing approaches to reconstructing images from these intermediate features often result in blurry, semantically ambiguous images. To directly address semantic leakage, we propose CapRecover, a cross-modality inversion framework that recovers high-level semantic content, such as labels or captions, directly from intermediate features without image reconstruction. We evaluate CapRecover on multiple datasets and victim models, demonstrating strong performance in semantic recovery. Specifically, CapRecover achieves up to 92.71% Top-1 label accuracy on CIFAR-10 and generates fluent captions from ResNet50 features on COCO2017 with ROUGE-L scores up to 0.52. Our analysis further reveals that deeper convolutional layers encode significantly more semantic information compared to shallow layers. To mitigate semantic leakage, we introduce a simple yet effective protection method: adding random noise to intermediate features at each layer and removing the noise in the next layer. Experimental results show that this approach prevents semantic leakage without additional training costs.
[72] TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning
Siqi Luo,Haoran Yang,Yi Xin,Mingyang Yi,Guangyang Wu,Guangtao Zhai,Xiaohong Liu
Main category: cs.CV
TL;DR: TR-PTS提出了一种任务驱动的参数和标记选择框架,通过选择任务相关的参数和动态合并冗余标记,提升了大规模预训练模型的调优效率和性能。
Details
Motivation: 大规模预训练模型在视觉任务中表现出色,但全参数微调对计算和存储资源要求过高。现有的高效调优方法多为任务无关,未能充分利用任务特异性,导致效率与性能不足。Contribution: 提出了TR-PTS框架,结合任务相关的参数选择和标记选择,动态优化计算资源分配,提升模型效率和准确性。
Method: 利用Fisher信息矩阵(FIM)逐层选择任务相关的参数进行微调,同时动态保留重要的标记并合并冗余标记。
Result: 在FGVC和VTAB-1k基准测试中,TR-PTS的性能分别超过全参数微调3.40%和10.35%,达到了最先进水平。
Insight: 任务驱动的参数和标记选择能够显著提升模型调优的效率,同时保持甚至超越全参数微调的性能。
Abstract: Large pre-trained models achieve remarkable performance in vision tasks but are impractical for fine-tuning due to high computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods mitigate this issue by updating only a subset of parameters; however, most existing approaches are task-agnostic, failing to fully exploit task-specific adaptations, which leads to suboptimal efficiency and performance. To address this limitation, we propose Task-Relevant Parameter and Token Selection (TR-PTS), a task-driven framework that enhances both computational efficiency and accuracy. Specifically, we introduce Task-Relevant Parameter Selection, which utilizes the Fisher Information Matrix (FIM) to identify and fine-tune only the most informative parameters in a layer-wise manner, while keeping the remaining parameters frozen. Simultaneously, Task-Relevant Token Selection dynamically preserves the most informative tokens and merges redundant ones, reducing computational overhead. By jointly optimizing parameters and tokens, TR-PTS enables the model to concentrate on task-discriminative information. We evaluate TR-PTS on benchmark, including FGVC and VTAB-1k, where it achieves state-of-the-art performance, surpassing full fine-tuning by 3.40% and 10.35%, respectively. The code are available at https://github.com/synbol/TR-PTS.
[73] Viser: Imperative, Web-based 3D Visualization in Python
Brent Yi,Chung Min Kim,Justin Kerr,Gina Wu,Rebecca Feng,Anthony Zhang,Jonas Kulhanek,Hongsuk Choi,Yi Ma,Matthew Tancik,Angjoo Kanazawa
Main category: cs.CV
TL;DR: Viser 是一个用于计算机视觉和机器人的 3D 可视化库,提供易于使用且可扩展的 Python 工具,支持 2D GUI 和 3D 场景的构建。
Details
Motivation: 现有 3D 可视化工具在 Python 生态中缺乏易用性和扩展性,Viser 旨在填补这一空白并提供现代化的编程体验。Contribution: Viser 提供了一个全面的 2D GUI 和 3D 场景库,具有命令式 API 和基于 Web 的查看器,支持快速构建和扩展。
Method: 采用命令式编程风格的 API 和基于 Web 的查看器设计,强调与现代编程模式和工作流的兼容性。
Result: Viser 提供了灵活的构建和扩展能力,适用于计算机视觉和机器人等领域的可视化需求。
Insight: 基于 Web 的可视化工具结合命令式 API 能够显著提升开发效率和用户体验。
Abstract: We present Viser, a 3D visualization library for computer vision and robotics. Viser aims to bring easy and extensible 3D visualization to Python: we provide a comprehensive set of 3D scene and 2D GUI primitives, which can be used independently with minimal setup or composed to build specialized interfaces. This technical report describes Viser’s features, interface, and implementation. Key design choices include an imperative-style API and a web-based viewer, which improve compatibility with modern programming patterns and workflows.
[74] Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
Kaining Ying,Henghui Ding,Guanquan Jie,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 本文提出了OmniAVS数据集和OISA模型,旨在解决多模态表达和推理任务中的视听分割问题,通过引入多样化的多模态表达和复杂推理,显著提升了性能。
Details
Motivation: 现有Referring Audio-Visual Segmentation (RAVS) 在多模态信息整合和深度理解上存在不足。本文旨在扩展RAVS的边界,推动未来研究。Contribution: 1. 提出了包含2098个视频和59458个多模态表达的OmniAVS数据集;2. 设计了8种多模态表达组合;3. 提出了OISA模型,结合MLLM实现多模态推理和细粒度理解。
Method: 1. 构建OmniAVS数据集,包含多种多模态表达;2. 开发OISA模型,利用MLLM处理复杂线索并完成推理分割任务。
Result: 实验表明,OISA在OmniAVS上优于现有方法,并在其他任务中表现竞争性。
Insight: 通过多模态表达和复杂推理的结合,可以更有效地理解和分割视听内容,为未来多模态学习研究提供了新方向。
Abstract: Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,098 videos and 59,458 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the challenges of multimodal reasoning and fine-grained understanding of audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and perform reasoning-based segmentation. Extensive experiments show that OISA outperforms existing methods on OmniAVS and achieves competitive results on other related tasks.
cs.SE [Back]
[75] CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback
Qiushi Sun,Jinyang Gong,Lei Li,Qipeng Guo,Fei Yuan
Main category: cs.SE
TL;DR: CodeEvo是一个通过两个LLM代理(Coder和Reviewer)的迭代交互合成代码数据的框架,结合编译器的确定性和代理的生成灵活性,显著提升了代码生成模型的性能。
Details
Motivation: 高质量指令-代码对对于训练代码生成的LLM至关重要,但人工标注成本高且规模有限。现有方法缺乏严格的数据验证,导致合成数据质量不佳。Contribution: 提出了CodeEvo框架,通过Coder和Reviewer的迭代交互合成代码数据,并引入了混合反馈机制,确保数据质量。
Method: 采用两个LLM代理(Coder生成代码和测试用例,Reviewer提供新指令和反馈)迭代交互,结合编译器确定性和生成灵活性进行质量控制。
Result: 实验表明,基于CodeEvo数据微调的模型在多个代码生成基准测试中显著优于基线方法。
Insight: 迭代交互和混合反馈机制是合成高质量代码数据的关键。
Abstract: Acquiring high-quality instruction-code pairs is essential for training Large Language Models (LLMs) for code generation. Manually curated data is expensive and inherently limited in scale, motivating the development of code-centric synthesis methods. Yet, current approaches either focus on augmenting existing code or rely on predefined heuristics, both lacking rigorous data validation, which results in synthetic data that is ungrounded, repetitive, or overly simplistic. Inspired by collaborative programming practices, we propose CodeEvo, a framework that synthesizes code data through iterative interactions between two LLM agents: a Coder, which generates candidate code and test cases based on given instructions, and a Reviewer, which guides the synthesis process by producing new instructions and feedback. We further introduce a hybrid feedback mechanism that combines compiler determinism with the generative flexibility of agents, enabling automatic quality control throughout synthesis. Extensive experiments demonstrate that models fine-tuned on CodeEvo data significantly outperform established baselines across code generation benchmarks with various difficulties. In-depth analyses further provide insights from multiple perspectives into effective code-centric data synthesis.
cs.SD [Back]
[76] Next Tokens Denoising for Speech Synthesis
Yanqing Liu,Ruiqing Xue,Chong Zhang,Yufei Liu,Gang Wang,Bohan Li,Yao Qian,Lei He,Shujie Liu,Sheng Zhao
Main category: cs.SD
TL;DR: 论文提出了Dragon-FM,一种结合自回归(AR)和流匹配(flow-matching)的文本到语音(TTS)模型,通过分块处理和并行流匹配解决了AR模型的慢速生成和扩散模型的KV缓存问题,支持高效生成高质量音频。
Details
Motivation: 现有AR模型无法利用未来上下文且生成速度慢,扩散模型则难以处理KV缓存,因此需要一种兼顾全局一致性和快速迭代降噪的解决方案。Contribution: 提出了Dragon-FM模型,统一AR和流匹配,支持分块处理和并行降噪,并证明了连续AR流匹配可预测离散令牌。
Method: 采用分块处理(12.5令牌/秒)实现AR建模的全局一致性,块内并行流匹配实现快速迭代降噪,同时利用KV缓存和未来上下文。
Result: 实验证明该模型能高效生成48 kHz高质量音频,适用于零样本播客生成。
Insight: 连续与离散特征建模的融合为高效音频生成提供了新思路,分块处理设计对长内容生成尤为有效。
Abstract: While diffusion and autoregressive (AR) models have significantly advanced generative modeling, they each present distinct limitations. AR models, which rely on causal attention, cannot exploit future context and suffer from slow generation speeds. Conversely, diffusion models struggle with key-value (KV) caching. To overcome these challenges, we introduce Dragon-FM, a novel text-to-speech (TTS) design that unifies AR and flow-matching. This model processes 48 kHz audio codec tokens in chunks at a compact 12.5 tokens per second rate. This design enables AR modeling across chunks, ensuring global coherence, while parallel flow-matching within chunks facilitates fast iterative denoising. Consequently, the proposed model can utilize KV-cache across chunks and incorporate future context within each chunk. Furthermore, it bridges continuous and discrete feature modeling, demonstrating that continuous AR flow-matching can predict discrete tokens with finite scalar quantizers. This efficient codec and fast chunk-autoregressive architecture also makes the proposed model particularly effective for generating extended content. Experiment for demos of our work} on podcast datasets demonstrate its capability to efficiently generate high-quality zero-shot podcasts.
cs.CR [Back]
[77] Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions
Yiting Qu,Ziqing Yang,Yihan Ma,Michael Backes,Savvas Zannettou,Yang Zhang
Main category: cs.CR
TL;DR: 该论文探讨了利用文本到图像扩散模型生成仇恨性幻觉的风险,及其绕过内容审核的能力,并提出初步缓解措施。
Details
Motivation: 随着文本到图像扩散模型的进步,生成仇恨性幻觉(隐藏仇恨信息的图像)成为可能,而当前的内容审核模型难以检测此类内容。Contribution: 论文首次系统研究了仇恨性幻觉的生成风险,构建了一个包含1,571张仇恨性幻觉的数据集,并揭示了现有审核模型的严重漏洞。
Method: 利用Stable Diffusion和ControlNet生成1,860张幻觉图像,评估了6个审核分类器和9个视觉语言模型(VLM)的检测能力。
Result: 实验显示,审核模型的检测准确率低于0.245,VLM低于0.102,揭示了视觉编码器无法捕捉隐藏信息的局限性。
Insight: 视觉编码器过于关注表层图像细节,忽略了隐藏信息,需改进检测方法和训练策略以应对此类风险。
Abstract: Recent advances in text-to-image diffusion models have enabled the creation of a new form of digital art: optical illusions–visual tricks that create different perceptions of reality. However, adversaries may misuse such techniques to generate hateful illusions, which embed specific hate messages into harmless scenes and disseminate them across web communities. In this work, we take the first step toward investigating the risks of scalable hateful illusion generation and the potential for bypassing current content moderation models. Specifically, we generate 1,860 optical illusions using Stable Diffusion and ControlNet, conditioned on 62 hate messages. Of these, 1,571 are hateful illusions that successfully embed hate messages, either overtly or subtly, forming the Hateful Illusion dataset. Using this dataset, we evaluate the performance of six moderation classifiers and nine vision language models (VLMs) in identifying hateful illusions. Experimental results reveal significant vulnerabilities in existing moderation models: the detection accuracy falls below 0.245 for moderation classifiers and below 0.102 for VLMs. We further identify a critical limitation in their vision encoders, which mainly focus on surface-level image details while overlooking the secondary layer of information, i.e., hidden messages. To address this risk, we explore preliminary mitigation measures and identify the most effective approaches from the perspectives of image transformations and training-level strategies.
cs.IR [Back]
[78] GeoOutageKG: A Multimodal Geospatiotemporal Knowledge Graph for Multiresolution Power Outage Analysis
Ethan Frakes,Yinghui Wu,Roger H. French,Mengjie Li
Main category: cs.IR
TL;DR: GeoOutageKG 是一个多模态地理时空知识图谱,通过整合夜间光卫星图像、高分辨率停电地图和县级停电报告,提升了停电检测、分析和预测的能力。
Details
Motivation: 现有的停电数据多为县级报告,空间分辨率不足;而卫星图像虽空间分辨率高,但时间粒度有限。整合这些数据可以更好地分析停电模式。Contribution: 提出了 GeoOutageKG,整合多种数据源,构建了一个模块化、可重用的语义资源,支持多分辨率停电分析。
Method: 通过开发本体 GeoOutageOnto 对齐数据源,整合了 10.6 万条停电记录、30 万张夜间光图像和 1.5 万张停电地图。
Result: 构建的知识图谱包含大量数据(2014-2024),并通过多分辨率分析证明了其实用性。
Insight: 多模态数据整合能够显著提升停电分析的时空分辨率和准确性,为灾害风险管理提供新工具。
Abstract: Detecting, analyzing, and predicting power outages is crucial for grid risk assessment and disaster mitigation. Numerous outages occur each year, exacerbated by extreme weather events such as hurricanes. Existing outage data are typically reported at the county level, limiting their spatial resolution and making it difficult to capture localized patterns. However, it offers excellent temporal granularity. In contrast, nighttime light satellite image data provides significantly higher spatial resolution and enables a more comprehensive spatial depiction of outages, enhancing the accuracy of assessing the geographic extent and severity of power loss after disaster events. However, these satellite data are only available on a daily basis. Integrating spatiotemporal visual and time-series data sources into a unified knowledge representation can substantially improve power outage detection, analysis, and predictive reasoning. In this paper, we propose GeoOutageKG, a multimodal knowledge graph that integrates diverse data sources, including nighttime light satellite image data, high-resolution spatiotemporal power outage maps, and county-level timeseries outage reports in the U.S. We describe our method for constructing GeoOutageKG by aligning source data with a developed ontology, GeoOutageOnto. Currently, GeoOutageKG includes over 10.6 million individual outage records spanning from 2014 to 2024, 300,000 NTL images spanning from 2012 to 2024, and 15,000 outage maps. GeoOutageKG is a novel, modular and reusable semantic resource that enables robust multimodal data integration. We demonstrate its use through multiresolution analysis of geospatiotemporal power outages.
[79] RecGPT Technical Report
Chao Yi,Dian Chen,Gaoyang Guo,Jiakai Tang,Jian Wu,Jing Yu,Sunhao Dai,Wen Chen,Wenjun Yang,Yuning Jiang,Zhujin Gao,Bo Zheng,Chi Li,Dimin Wang,Dixuan Wang,Fan Li,Fan Zhang,Haibin Chen,Haozhuang Liu,Jialin Zhu,Jiamang Wang,Jiawei Wu,Jin Cui,Ju Huang,Kai Zhang,Kan Liu,Lang Tian,Liang Rao,Longbin Li,Lulu Zhao,Mao Zhang,Na He,Peiyang Wang,Qiqi Huang,Tao Luo,Wenbo Su,Xiaoxiao He,Xin Tong,Xu Chen,Xunke Xi,Yang Li,Yaxuan Wu,Yeqiu Yang,Yi Hu,Yinnan Song,Yuchen Li,Yujie Luo,Yujin Yuan,Yuliang Yan,Zhengyang Wang,Zhibo Xiao,Zhixin Ma,Zile Zhou
Main category: cs.IR
TL;DR: RecGPT introduces一个基于大语言模型(LLM)的意图驱动的推荐系统框架,通过多阶段训练范式提升用户意图建模能力,并在淘宝App上全面部署,显著提升了多样性、满意度和转化率。
Details
Motivation: 当前推荐系统过度依赖历史共现模式和日志拟合目标,未能显式建模用户意图,导致过拟合和长尾问题。Contribution: 提出RecGPT,将LLM融入用户兴趣挖掘、物品检索和解释生成等关键阶段,实现意图驱动的推荐系统。
Method: 多阶段训练范式(推理增强的预对齐与自训练进化),辅以人机协作的评判系统。
Result: 在淘宝App上的在线实验表明,RecGPT提升了内容多样性、用户满意度及商业转化率。
Insight: 意图驱动的LLM推荐系统能促进更可持续、多方共赢的推荐生态。
Abstract: Recommender systems are among the most impactful applications of artificial intelligence, serving as critical infrastructure connecting users, merchants, and platforms. However, most current industrial systems remain heavily reliant on historical co-occurrence patterns and log-fitting objectives, i.e., optimizing for past user interactions without explicitly modeling user intent. This log-fitting approach often leads to overfitting to narrow historical preferences, failing to capture users’ evolving and latent interests. As a result, it reinforces filter bubbles and long-tail phenomena, ultimately harming user experience and threatening the sustainability of the whole recommendation ecosystem. To address these challenges, we rethink the overall design paradigm of recommender systems and propose RecGPT, a next-generation framework that places user intent at the center of the recommendation pipeline. By integrating large language models (LLMs) into key stages of user interest mining, item retrieval, and explanation generation, RecGPT transforms log-fitting recommendation into an intent-centric process. To effectively align general-purpose LLMs to the above domain-specific recommendation tasks at scale, RecGPT incorporates a multi-stage training paradigm, which integrates reasoning-enhanced pre-alignment and self-training evolution, guided by a Human-LLM cooperative judge system. Currently, RecGPT has been fully deployed on the Taobao App. Online experiments demonstrate that RecGPT achieves consistent performance gains across stakeholders: users benefit from increased content diversity and satisfaction, merchants and the platform gain greater exposure and conversions. These comprehensive improvement results across all stakeholders validates that LLM-driven, intent-centric design can foster a more sustainable and mutually beneficial recommendation ecosystem.
eess.IV [Back]
[80] A Segmentation Framework for Accurate Diagnosis of Amyloid Positivity without Structural Images
Penghan Zhu,Shurui Mei,Shushan Chen,Xiaobo Chu,Shanbo He,Ziyi Liu
Main category: eess.IV
TL;DR: 该论文提出了一种仅使用PET图像(无需结构MRI或CT)的深度学习框架,用于自动化脑区分割和淀粉样蛋白阳性分类。通过3D U-Net架构,在200例淀粉样蛋白-PET扫描数据上验证了高精度分割和分类性能。
Details
Motivation: 传统的淀粉样蛋白阳性诊断依赖结构MRI或CT的核心配准和手动分割,成本高且耗时。研究旨在开发一种仅基于PET图像的自动化方法,减少对结构图像的依赖。Contribution: 主要贡献是提出了一种仅需PET图像的深度学习框架,实现了高精度的脑区分割和淀粉样蛋白阳性分类,无需结构图像支持。
Method: 采用4层深度的3D U-Net架构,训练和验证数据集为200例F18-florbetapir淀粉样蛋白-PET扫描。分割性能通过Dice系数评估,分类性能通过ROC曲线和AUC评估。
Result: 分割Dice系数在0.45到0.88之间,临床相关区域的PET摄取误差低至0.0011,淀粉样蛋白阳性分类准确率达0.98,AUC为0.99。
Insight: 该方法展示了在缺乏结构图像时的高效诊断潜力,为临床和研究应用提供了可扩展、可靠的自动化分析工具。未来可扩展至其他PET示踪剂。
Abstract: This study proposes a deep learning-based framework for automated segmentation of brain regions and classification of amyloid positivity using positron emission tomography (PET) images alone, without the need for structural MRI or CT. A 3D U-Net architecture with four layers of depth was trained and validated on a dataset of 200 F18-florbetapir amyloid-PET scans, with an 130/20/50 train/validation/test split. Segmentation performance was evaluated using Dice similarity coefficients across 30 brain regions, with scores ranging from 0.45 to 0.88, demonstrating high anatomical accuracy, particularly in subcortical structures. Quantitative fidelity of PET uptake within clinically relevant regions. Precuneus, prefrontal cortex, gyrus rectus, and lateral temporal cortex was assessed using normalized root mean square error, achieving values as low as 0.0011. Furthermore, the model achieved a classification accuracy of 0.98 for amyloid positivity based on regional uptake quantification, with an area under the ROC curve (AUC) of 0.99. These results highlight the model’s potential for integration into PET only diagnostic pipelines, particularly in settings where structural imaging is not available. This approach reduces dependence on coregistration and manual delineation, enabling scalable, reliable, and reproducible analysis in clinical and research applications. Future work will focus on clinical validation and extension to diverse PET tracers including C11 PiB and other F18 labeled compounds.
[81] Whole-brain Transferable Representations from Large-Scale fMRI Data Improve Task-Evoked Brain Activity Decoding
Yueh-Po Peng,Vincent K. M. Cheung,Li Su
Main category: eess.IV
TL;DR: STDA-SwiFT是一种基于Transformer的模型,通过空间-时间分割注意力(spatial-temporal divided attention)和自监督对比学习,从大规模fMRI数据中学习可迁移表征,显著提升了任务诱发脑活动的解码性能。
Details
Motivation: fMRI数据的高维度、低信噪比和样本限制使得从任务诱发活动中解码心理状态具有挑战性。研究旨在利用大规模数据集和先进的计算方法提升解码能力。Contribution: 1. 提出了STDA-SwiFT模型,结合空间-时间分割注意力和对比学习;2. 展示了预训练表征在下游任务中的显著性能提升;3. 证明了高效注意力机制和大规模数据对模型性能的重要性。
Method: 基于Transformer,采用空间-时间分割注意力机制和自监督对比学习,从HCP项目的995名受试者数据中预训练可迁移的表征。
Result: 模型在多个感官和认知领域中显著改善了任务诱发活动的解码性能,特别是在数据预处理有限的情况下表现优异。
Insight: 大规模预训练和高效注意力机制是提升fMRI数据解码性能的关键,功能相关的预训练数据对少样本微调尤为重要。
Abstract: A fundamental challenge in neuroscience is to decode mental states from brain activity. While functional magnetic resonance imaging (fMRI) offers a non-invasive approach to capture brain-wide neural dynamics with high spatial precision, decoding from fMRI data – particularly from task-evoked activity – remains challenging due to its high dimensionality, low signal-to-noise ratio, and limited within-subject data. Here, we leverage recent advances in computer vision and propose STDA-SwiFT, a transformer-based model that learns transferable representations from large-scale fMRI datasets via spatial-temporal divided attention and self-supervised contrastive learning. Using pretrained voxel-wise representations from 995 subjects in the Human Connectome Project (HCP), we show that our model substantially improves downstream decoding performance of task-evoked activity across multiple sensory and cognitive domains, even with minimal data preprocessing. We demonstrate performance gains from larger receptor fields afforded by our memory-efficient attention mechanism, as well as the impact of functional relevance in pretraining data when fine-tuning on small samples. Our work showcases transfer learning as a viable approach to harness large-scale datasets to overcome challenges in decoding brain activity from fMRI data.
[82] Towards Blind Bitstream-corrupted Video Recovery via a Visual Foundation Model-driven Framework
Tianyi Liu,Kejun Wu,Chen Cai,Yi Wang,Kim-Hui Yap,Lap-Pui Chau
Main category: eess.IV
TL;DR: 提出了一种基于视觉基础模型的盲比特流受损视频恢复框架,通过检测任意损坏(DAC)模型和损坏感知特征补全(CFC)模块,显著提升了视频恢复质量,无需手工标注损坏区域。
Details
Motivation: 比特流损坏会导致视频像素域的显著退化,现有方法依赖手工标注损坏区域且恢复效果不佳,提出了无需标注的自适应恢复框架的需求。Contribution: 1. 首次提出盲比特流受损视频恢复框架;2. 引入DAC模型和CFC模块,利用视觉基础模型增强损坏定位和恢复;3. 在MoRE结构中结合高层特征协调,抑制伪影并提升信息残差。
Method: 1. 使用视觉基础模型驱动框架;2. DAC模型结合比特流和损坏知识定位损坏;3. CFC模块自适应处理残差;4. MoRE结构实现特征增强和高层协调。
Result: 无需手工标注的损坏掩码序列,即可显著提升比特流受损视频恢复质量,验证了方法的优越性。
Insight: 视觉基础模型的先验知识可有效提升视频恢复任务,自适应处理残差和损坏理解是关键。
Abstract: Video signals are vulnerable in multimedia communication and storage systems, as even slight bitstream-domain corruption can lead to significant pixel-domain degradation. To recover faithful spatio-temporal content from corrupted inputs, bitstream-corrupted video recovery has recently emerged as a challenging and understudied task. However, existing methods require time-consuming and labor-intensive annotation of corrupted regions for each corrupted video frame, resulting in a large workload in practice. In addition, high-quality recovery remains difficult as part of the local residual information in corrupted frames may mislead feature completion and successive content recovery. In this paper, we propose the first blind bitstream-corrupted video recovery framework that integrates visual foundation models with a recovery model, which is adapted to different types of corruption and bitstream-level prompts. Within the framework, the proposed Detect Any Corruption (DAC) model leverages the rich priors of the visual foundation model while incorporating bitstream and corruption knowledge to enhance corruption localization and blind recovery. Additionally, we introduce a novel Corruption-aware Feature Completion (CFC) module, which adaptively processes residual contributions based on high-level corruption understanding. With VFM-guided hierarchical feature augmentation and high-level coordination in a mixture-of-residual-experts (MoRE) structure, our method suppresses artifacts and enhances informative residuals. Comprehensive evaluations show that the proposed method achieves outstanding performance in bitstream-corrupted video recovery without requiring a manually labeled mask sequence. The demonstrated effectiveness will help to realize improved user experience, wider application scenarios, and more reliable multimedia communication and storage systems.
[83] trAIce3D: A Prompt-Driven Transformer Based U-Net for Semantic Segmentation of Microglial Cells from Large-Scale 3D Microscopy Images
MohammadAmin Alamalhoda,Arsalan Firoozi,Alessandro Venturino,Sandra Siegert
Main category: eess.IV
TL;DR: trAIce3D是一种基于提示驱动的Transformer和U-Net的深度学习架构,专注于从大规模3D显微镜图像中精准分割小胶质细胞的胞体和分支。
Details
Motivation: 细胞形态对理解其功能至关重要,但现有方法在分割小胶质细胞(与神经退行性疾病相关的免疫细胞)时面临挑战,如重叠结构、噪声图像以及需要手动调参等问题。Contribution: 提出了一种新型的两阶段深度学习架构trAIce3D,通过结合3D U-Net和Vision Transformer,以及基于提示的分割技术,显著提升了小胶质细胞的分割精度和泛化能力。
Method: 1. 第一阶段使用3D U-Net和Vision Transformer检测胞体;2. 第二阶段通过跨注意力块和提示技术细化胞体及其分支。采用两阶段训练策略,结合自监督学习和基于提示的分割。
Result: 在41,230个小胶质细胞的数据集上验证,trAIce3D显著优于现有方法,能够高效分析复杂的细胞形态。
Insight: trAIce3D的架构不仅适用于小胶质细胞,还可以扩展到其他复杂细胞类型(如神经元和星形胶质细胞),为神经生物学研究提供了新工具。
Abstract: The shape of a cell contains essential information about its function within the biological system. Segmenting these structures from large-scale 3D microscopy images is challenging, limiting clinical insights especially for microglia, immune-associated cells involved in neurodegenerative diseases. Existing segmentation methods mainly focus on cell bodies, struggle with overlapping structures, perform poorly on noisy images, require hyperparameter tuning for each new dataset, or rely on tedious semi-automated approaches. We introduce trAIce3D, a deep-learning architecture designed for precise microglia segmentation, capturing both somas and branches. It employs a two-stage approach: first, a 3D U-Net with vision transformers in the encoder detects somas using a sliding-window technique to cover the entire image. Then, the same architecture, enhanced with cross-attention blocks in skip connections, refines each soma and its branches by using soma coordinates as a prompt and a 3D window around the target cell as input. Training occurs in two phases: self-supervised Soma Segmentation, followed by prompt-based Branch Segmentation, leveraging pre-trained weights from the first phase. Trained and evaluated on a dataset of 41,230 microglial cells, trAIce3D significantly improves segmentation accuracy and generalization, enabling scalable analysis of complex cellular morphologies. While optimized for microglia, its architecture can extend to other intricate cell types, such as neurons and astrocytes, broadening its impact on neurobiological research.
eess.SP [Back]
[84] Exploration of Low-Cost but Accurate Radar-Based Human Motion Direction Determination
Weicheng Gao
Main category: eess.SP
TL;DR: 本文提出了一种低成本但准确的雷达基础的人体运动方向确定方法,通过特征增强和轻量级Vision Transformer-CNN混合模型实现高效方向检测。
Details
Motivation: 人体运动方向角度影响微多普勒频谱宽度,为步态识别等下游任务提供重要先验信息。然而,基于多普勒-时间图的方法在同时实现特征增强和运动方向确定方面仍有改进空间。Contribution: 提出了一种低成本且准确的雷达基础的人体运动方向确定方法,结合特征增强和轻量级Vision Transformer-CNN混合模型,提升检测效率。
Method: 首先生成雷达基础的人体步态多普勒-时间图,利用特征链接模型实现特征增强,再通过轻量级Vision Transformer-CNN混合模型实现方向确定。
Result: 通过开源数据集验证了方法的有效性,并公开了代码。
Insight: 结合Vision Transformer和CNN的混合模型在雷达数据处理中展现出高效性,低成本方案为实际应用提供了可行性。
Abstract: This work is completed on a whim after discussions with my junior colleague. The motion direction angle affects the micro-Doppler spectrum width, thus determining the human motion direction can provide important prior information for downstream tasks such as gait recognition. However, Doppler-Time map (DTM)-based methods still have room for improvement in achieving feature augmentation and motion determination simultaneously. In response, a low-cost but accurate radar-based human motion direction determination (HMDD) method is explored in this paper. In detail, the radar-based human gait DTMs are first generated, and then the feature augmentation is achieved using feature linking model. Subsequently, the HMDD is implemented through a lightweight and fast Vision Transformer-Convolutional Neural Network hybrid model structure. The effectiveness of the proposed method is verified through open-source dataset. The open-source code of this work is released at: https://github.com/JoeyBGOfficial/Low-Cost-Accurate-Radar-Based-Human-Motion-Direction-Determination.
cs.AI [Back]
[85] CoEx – Co-evolving World-model and Exploration
Minsoo Kim,Seung-won Hwang
Main category: cs.AI
TL;DR: CoEx提出了一种分层智能体架构,通过分层状态抽象使LLM规划与动态更新的世界模型协同进化,解决了现有智能体设计中世界模型静态化导致的规划错误问题。
Details
Motivation: 现有的LLM智能体依赖预训练中学到的静态世界模型,无法有效整合新观察以动态更新模型,导致规划与现实状态脱节。Contribution: 提出CoEx架构,通过分层状态抽象和神经符号记忆,实现LLM规划与世界模型的动态协同进化。
Method: 采用分层智能体架构,使用LLM推理编排动态子目标计划,并通过神经符号信念状态(文本推理与代码符号记忆)持续更新世界模型。
Result: 在ALFWorld、PDDL和Jericho等复杂任务中,CoEx在规划和探索方面优于现有智能体范式。
Insight: 动态更新的世界模型对智能体的长期规划和适应性至关重要,神经符号记忆的结合增强了模型的表达能力和可靠性。
Abstract: Planning in modern LLM agents relies on the utilization of LLM as an internal world model, acquired during pretraining. However, existing agent designs fail to effectively assimilate new observations into dynamic updates of the world model. This reliance on the LLM’s static internal world model is progressively prone to misalignment with the underlying true state of the world, leading to the generation of divergent and erroneous plans. We introduce a hierarchical agent architecture, CoEx, in which hierarchical state abstraction allows LLM planning to co-evolve with a dynamically updated model of the world. CoEx plans and interacts with the world by using LLM reasoning to orchestrate dynamic plans consisting of subgoals, and its learning mechanism continuously incorporates these subgoal experiences into a persistent world model in the form of a neurosymbolic belief state, comprising textual inferences and code-based symbolic memory. We evaluate our agent across a diverse set of agent scenarios involving rich environments and complex tasks including ALFWorld, PDDL, and Jericho. Our experiments show that CoEx outperforms existing agent paradigms in planning and exploration.
[86] The Incomplete Bridge: How AI Research (Mis)Engages with Psychology
Han Jiang,Pengda Wang,Xiaoyuan Yi,Xing Xie,Ziang Xiao
Main category: cs.AI
TL;DR: 这篇论文分析了AI研究与心理学之间的跨学科互动,发现心理学在AI研究中被引用但存在误用,提出了改进方法以促进更深入的合作。
Details
Motivation: 心理学为AI系统设计和理解提供了丰富的理论和见解,但目前的研究中跨学科合作并不充分,甚至存在误用心理学理论的现象。Contribution: 论文通过分析1006篇LLM相关AI论文和2544篇被引用的心理学文献,绘制了AI与心理学跨学科互动的综合图谱,并提出了改进跨学科合作的具体建议。
Method: 研究分析了2023至2025年间顶级AI会议中发表的LLM相关论文及其引用的心理学文献,识别了引用模式、误用情况和未充分探索的领域。
Result: 论文揭示了心理学理论在AI研究中的常见误用类型,并指出了未被充分引用的心理学领域,为未来更有效的跨学科整合提供了指导。
Insight: AI研究与心理学的深度合作潜力巨大,但需要更系统地理解和应用心理学理论,以避免误用并推动AI系统的进一步发展。
Abstract: Social sciences have accumulated a rich body of theories and methodologies for investigating the human mind and behaviors, while offering valuable insights into the design and understanding of Artificial Intelligence (AI) systems. Focusing on psychology as a prominent case, this study explores the interdisciplinary synergy between AI and the field by analyzing 1,006 LLM-related papers published in premier AI venues between 2023 and 2025, along with the 2,544 psychology publications they cite. Through our analysis, we identify key patterns of interdisciplinary integration, locate the psychology domains most frequently referenced, and highlight areas that remain underexplored. We further examine how psychology theories/frameworks are operationalized and interpreted, identify common types of misapplication, and offer guidance for more effective incorporation. Our work provides a comprehensive map of interdisciplinary engagement between AI and psychology, thereby facilitating deeper collaboration and advancing AI systems.
cs.LG [Back]
[87] CIMR: Contextualized Iterative Multimodal Reasoning for Robust Instruction Following in LVLMs
Yangshu Yuan,Heng Chen,Xinyi Jiang,Christian Ng,Kexin Qiu
Main category: cs.LG
TL;DR: CIMR是一种新型的上下文迭代多模态推理框架,旨在提升大型视觉语言模型(LVLM)处理复杂多模态指令的能力,通过迭代自校正和多模态反馈实现更鲁棒的任务执行。
Details
Motivation: 当前的大型语言模型(LLM)和视觉语言模型(LVLM)在处理需要逻辑推理、动态反馈和迭代校正的复杂多模态指令时表现不佳,CIMR旨在解决这一问题。Contribution: 提出了CIMR框架,结合上下文感知的迭代推理和自校正模块,动态融合多模态特征,显著提升了复杂任务中的性能。
Method: CIMR分为两阶段:初始推理与响应生成,以及基于多模态反馈的迭代优化;利用动态融合模块集成文本、视觉和上下文特征。
Result: 在Multi-modal Action Planning数据集上达到91.5%准确率,超越GPT-4V(89.2%)、LLaVA-1.5(78.5%)等先进模型。
Insight: 迭代自校正和多模态动态融合是提升复杂任务性能的关键,未来可扩展至更广泛的多模态推理场景。
Abstract: The rapid advancement of Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) has enhanced our ability to process and generate human language and visual information. However, these models often struggle with complex, multi-step multi-modal instructions that require logical reasoning, dynamic feedback integration, and iterative self-correction. To address this, we propose CIMR: Contextualized Iterative Multimodal Reasoning, a novel framework that introduces a context-aware iterative reasoning and self-correction module. CIMR operates in two stages: initial reasoning and response generation, followed by iterative refinement using parsed multi-modal feedback. A dynamic fusion module deeply integrates textual, visual, and contextual features at each step. We fine-tune LLaVA-1.5-7B on the Visual Instruction Tuning (VIT) dataset and evaluate CIMR on the newly introduced Multi-modal Action Planning (MAP) dataset. CIMR achieves 91.5% accuracy, outperforming state-of-the-art models such as GPT-4V (89.2%), LLaVA-1.5 (78.5%), MiniGPT-4 (75.3%), and InstructBLIP (72.8%), demonstrating the efficacy of its iterative reasoning and self-correction capabilities in complex tasks.
[88] Efficient Differentially Private Fine-Tuning of LLMs via Reinforcement Learning
Afshin Khadangi,Amir Sartipi,Igor Tchappi,Ramin Bahmani,Gilbert Fridgen
Main category: cs.LG
TL;DR: RLDP利用强化学习动态调整差分隐私优化中的梯度裁剪和噪声注入,显著提升模型效用和隐私预算效率。
Details
Motivation: 现有差分隐私优化方法(如DP-SGD)的全局固定参数导致隐私预算浪费或模型性能下降,需动态适应优化过程。Contribution: 提出了RLDP框架,首次将差分隐私优化建模为强化学习问题,动态分配隐私预算。
Method: 使用SAC超策略在线学习,动态选择每参数梯度裁剪阈值和噪声幅度,优化隐私预算分配。
Result: 在多个LLM上验证,RLDP实现了1.3-30.5%的困惑度降低和5.6%的下游任务性能提升,同时节省71%的梯度更新预算。
Insight: 强化学习在隐私预算动态分配中具有潜力,可平衡隐私和模型性能,且不增加隐私风险。
Abstract: The tension between data privacy and model utility has become the defining bottleneck for the practical deployment of large language models (LLMs) trained on sensitive corpora including healthcare. Differentially private stochastic gradient descent (DP-SGD) guarantees formal privacy, yet it does so at a pronounced cost: gradients are forcibly clipped and perturbed with noise, degrading sample efficiency and final accuracy. Numerous variants have been proposed to soften this trade-off, but they all share a handicap: their control knobs are hard-coded, global, and oblivious to the evolving optimization landscape. Consequently, practitioners are forced either to over-spend privacy budget in pursuit of utility, or to accept mediocre models in order to stay within privacy constraints. We present RLDP, the first framework to cast DP optimization itself as a closed-loop control problem amenable to modern deep reinforcement learning (RL). RLDP continuously senses rich statistics of the learning dynamics and acts by selecting fine-grained per parameter gradient-clipping thresholds as well as the magnitude of injected Gaussian noise. A soft actor-critic (SAC) hyper-policy is trained online during language model fine-tuning; it learns, from scratch, how to allocate the privacy budget where it matters and when it matters. Across more than 1,600 ablation experiments on GPT2-small, Llama-1B, Llama-3B, and Mistral-7B, RLDP delivers perplexity reductions of 1.3-30.5% (mean 5.4%) and an average 5.6% downstream utility gain. RLDP reaches each baseline’s final utility after only 13-43% of the gradient-update budget (mean speed-up 71%), all while honoring the same ($\epsilon$, $\delta$)-DP contract and exhibiting equal or lower susceptibility to membership-inference and canary-extraction attacks.
[89] Theoretical Analysis of Relative Errors in Gradient Computations for Adversarial Attacks with CE Loss
Yunrui Yu,Hang Su,Cheng-zhong Xu,Zhizhong Su,Jun Zhu
Main category: cs.LG
TL;DR: 该论文对基于交叉熵损失的对抗攻击中梯度计算的相对误差进行了理论分析,提出了新的T-MIFPE损失函数以优化计算精度。
Details
Motivation: 梯度计算中的浮点误差会导致对抗攻击的效果被高估,目前缺乏对此问题的系统理论分析。Contribution: 首次全面分析了浮点计算误差在四种攻击场景下的影响,提出T-MIFPE损失函数以最小化误差。
Method: 理论分析了浮点误差行为,并提出带最优缩放因子的T-MIFPE损失函数。
Result: T-MIFPE在MNIST、CIFAR-10和CIFAR-100数据集上表现优于现有损失函数。
Insight: 浮点下溢和舍入是梯度计算不稳定的主要原因,通过优化损失函数可以显著提升攻击效果。
Abstract: Gradient-based adversarial attacks using the Cross-Entropy (CE) loss often suffer from overestimation due to relative errors in gradient computation induced by floating-point arithmetic. This paper provides a rigorous theoretical analysis of these errors, conducting the first comprehensive study of floating-point computation errors in gradient-based attacks across four distinct scenarios: (i) unsuccessful untargeted attacks, (ii) successful untargeted attacks, (iii) unsuccessful targeted attacks, and (iv) successful targeted attacks. We establish theoretical foundations characterizing the behavior of relative numerical errors under different attack conditions, revealing previously unknown patterns in gradient computation instability, and identify floating-point underflow and rounding as key contributors. Building on this insight, we propose the Theoretical MIFPE (T-MIFPE) loss function, which incorporates an optimal scaling factor $T = t^*$ to minimize the impact of floating-point errors, thereby enhancing the accuracy of gradient computation in adversarial attacks. Extensive experiments on the MNIST, CIFAR-10, and CIFAR-100 datasets demonstrate that T-MIFPE outperforms existing loss functions, including CE, C&W, DLR, and MIFPE, in terms of attack potency and robustness evaluation accuracy.
[90] FGFP: A Fractional Gaussian Filter and Pruning for Deep Neural Networks Compression
Kuan-Ting Tu,Po-Hsien Yu,Yu-Syuan Tseng,Shao-Yi Chien
Main category: cs.LG
TL;DR: 论文提出了一种基于分数阶高斯滤波器(FGF)和剪枝(AUP)的神经网络压缩框架FGFP,显著减少了模型大小和计算复杂度,同时保持了较高的准确率。
Details
Motivation: 由于深度神经网络在边缘设备上的高负载问题,论文旨在通过压缩技术减少模型参数,同时保持性能,以便在资源受限的设备上高效部署。Contribution: 1. 提出分数阶高斯滤波器(FGF),通过分数阶微积分和高斯函数结合,优化卷积核参数(仅需7个参数)。2. 引入Grünwald-Letnikov分数阶导数来近似分数阶微分方程,降低计算复杂度。3. 结合自适应非结构化剪枝(AUP)实现更高压缩比。
Method: 1. 设计FGF框架,将分数阶微分与高斯函数结合。2. 使用Grünwald-Letnikov分数阶导数优化计算。3. 采用AUP进行模型剪枝,进一步提升压缩效率。
Result: 在CIFAR-10上,ResNet-20的准确率仅下降1.52%,模型大小减少85.2%;在ImageNet2012上,ResNet-50的准确率仅下降1.63%,模型大小减少69.1%。
Insight: 分数阶高斯滤波器在压缩网络中表现出色,结合剪枝技术可以在保持高精度的同时大幅减少模型参数,为边缘设备部署提供了高效解决方案。
Abstract: Network compression techniques have become increasingly important in recent years because the loads of Deep Neural Networks (DNNs) are heavy for edge devices in real-world applications. While many methods compress neural network parameters, deploying these models on edge devices remains challenging. To address this, we propose the fractional Gaussian filter and pruning (FGFP) framework, which integrates fractional-order differential calculus and Gaussian function to construct fractional Gaussian filters (FGFs). To reduce the computational complexity of fractional-order differential operations, we introduce Gr"unwald-Letnikov fractional derivatives to approximate the fractional-order differential equation. The number of parameters for each kernel in FGF is minimized to only seven. Beyond the architecture of Fractional Gaussian Filters, our FGFP framework also incorporates Adaptive Unstructured Pruning (AUP) to achieve higher compression ratios. Experiments on various architectures and benchmarks show that our FGFP framework outperforms recent methods in accuracy and compression. On CIFAR-10, ResNet-20 achieves only a 1.52% drop in accuracy while reducing the model size by 85.2%. On ImageNet2012, ResNet-50 achieves only a 1.63% drop in accuracy while reducing the model size by 69.1%.
[91] Tapping into the Black Box: Uncovering Aligned Representations in Pretrained Neural Networks
Maciej Satkiewicz
Main category: cs.LG
TL;DR: 本文探讨了ReLU网络在训练过程中学习到的隐含线性模型,并通过反向传播的简单修改,揭示了其高分辨率输入和目标特定的特征,表明神经网络确实依赖可解释的模式。
Details
Motivation: 神经网络通常被视为黑箱,缺乏可解释性。作者希望揭示其内部学习到的隐式线性模型,以增强其可解释性和可靠性。Contribution: 提出了通过反向传播的简单修改(称为excitation pullback)提取网络中的隐含线性模型,并展示了其在实际视觉任务中的高分辨率和可解释性。
Method: 通过修改反向传播过程,提取ReLU网络中隐含的线性模型(excitation pullbacks),并将其决策边界投影到输入空间。
Result: 实验证明了该方法在ImageNet预训练模型上能提取出高分辨率且与人类感知对齐的特征,表明网络确实学习到了可解释的模式。
Insight: 神经网络的隐含学习模式是可解释的,这为知识发现和可信AI系统的开发提供了新思路。
Abstract: In this paper we argue that ReLU networks learn an implicit linear model we can actually tap into. We describe that alleged model formally and show that we can approximately pull its decision boundary back to the input space with certain simple modification to the backward pass. The resulting gradients (called excitation pullbacks) reveal high-resolution input- and target-specific features of remarkable perceptual alignment on a number of popular ImageNet-pretrained deep architectures. This strongly suggests that neural networks do, in fact, rely on learned interpretable patterns that can be recovered after training. Thus, our findings may have profound implications for knowledge discovery and the development of dependable artificial systems.