Table of Contents
- cs.CL [Total: 25]
- cs.CV [Total: 45]
- cs.AI [Total: 1]
- cs.MM [Total: 1]
- cs.SI [Total: 1]
- cs.CY [Total: 1]
- cs.CR [Total: 2]
- cs.IR [Total: 2]
- eess.IV [Total: 1]
- cs.GR [Total: 1]
- cs.SD [Total: 1]
- cs.RO [Total: 1]
- cs.LG [Total: 2]
- cs.HC [Total: 1]
cs.CL [Back]
[1] Evaluating LLMs’ Reasoning Over Ordered Procedural Steps
Adrita Anika,Md Messal Monem Miah
Main category: cs.CL
TL;DR: 论文研究了大型语言模型(LLM)在处理有序步骤的任务中的推理能力,通过重构食谱步骤的顺序进行评估,发现模型表现随序列长度和步骤混乱程度增加而下降。
Details
Motivation: 研究LLM在有序步骤推理任务上的能力,尤其是在需要严格顺序的领域(如食谱),因为当前模型的局限性尚未被充分探索。Contribution: 提出了一个评估框架,结合了多种排序和序列对齐指标(如Kendall’s Tau、NLCS和NED),用于全面评估LLM在重构有序步骤中的表现。
Method: 使用食谱数据集,在零样本和少样本设置下评估多个LLM,并通过排序质量和序列对齐指标量化模型表现。
Result: 研究表明,随着序列长度增加和输入步骤混乱程度加剧,模型性能显著下降,揭示了LLM在长序列和高度无序输入中的局限。
Insight: 当前LLM在复杂有序任务中表现不足,尤其是在长序列和高度混乱的输入下,需要进一步改进模型的序列推理能力。
Abstract: Reasoning over procedural sequences, where the order of steps directly impacts outcomes, is a critical capability for large language models (LLMs). In this work, we study the task of reconstructing globally ordered sequences from shuffled procedural steps, using a curated dataset of food recipes, a domain where correct sequencing is essential for task success. We evaluate several LLMs under zero-shot and few-shot settings and present a comprehensive evaluation framework that adapts established metrics from ranking and sequence alignment. These include Kendall’s Tau, Normalized Longest Common Subsequence (NLCS), and Normalized Edit Distance (NED), which capture complementary aspects of ordering quality. Our analysis shows that model performance declines with increasing sequence length, reflecting the added complexity of longer procedures. We also find that greater step displacement in the input, corresponding to more severe shuffling, leads to further degradation. These findings highlight the limitations of current LLMs in procedural reasoning, especially with longer and more disordered inputs.
[2] SARC: Sentiment-Augmented Deep Role Clustering for Fake News Detection
Jingqing Wang,Jiaxing Shang,Rong Xu,Fei Hao,Tianjin Huang,Geyong Min
Main category: cs.CL
TL;DR: 论文提出了一种名为SARC的框架,通过结合情感增强的深度聚类技术,识别用户在虚假新闻检测中的不同角色,显著提升了检测性能。
Details
Motivation: 现有方法通常将情感特征视为辅助信号,忽视了用户在虚假新闻传播中的角色差异,导致难以捕捉细微模式。Contribution: 提出了SARC框架,首次将情感增强的深度聚类技术与虚假新闻检测结合,并设计了联合优化目标,同时考虑角色聚类和虚假新闻检测。
Method: 1. 使用BiGRU和注意力机制生成用户特征;2. 通过可微分的深度聚类模块自动分类用户角色;3. 设计了联合优化目标,整合角色聚类和虚假新闻检测。
Result: 在两个基准数据集(RumourEval-19和Weibo-comp)上,SARC在所有指标上均优于基线模型。
Insight: 情感信息与用户角色的结合能更有效地揭示虚假新闻传播的模式,联合优化目标进一步提升了模型性能。
Abstract: Fake news detection has been a long-standing research focus in social networks. Recent studies suggest that incorporating sentiment information from both news content and user comments can enhance detection performance. However, existing approaches typically treat sentiment features as auxiliary signals, overlooking role differentiation, that is, the same sentiment polarity may originate from users with distinct roles, thereby limiting their ability to capture nuanced patterns for effective detection. To address this issue, we propose SARC, a Sentiment-Augmented Role Clustering framework which utilizes sentiment-enhanced deep clustering to identify user roles for improved fake news detection. The framework first generates user features through joint comment text representation (with BiGRU and Attention mechanism) and sentiment encoding. It then constructs a differentiable deep clustering module to automatically categorize user roles. Finally, unlike existing approaches which take fake news label as the unique supervision signal, we propose a joint optimization objective integrating role clustering and fake news detection to further improve the model performance. Experimental results on two benchmark datasets, RumourEval-19 and Weibo-comp, demonstrate that SARC achieves superior performance across all metrics compared to baseline models. The code is available at: https://github.com/jxshang/SARC.
[3] Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng,Vidhisha Balachandran,Chan Young Park,Faeze Brahman,Sachin Kumar
Main category: cs.CL
TL;DR: 本文提出了一种通过推理任务解决大型语言模型(LLM)中指令优先级问题的方法,旨在提升模型的可靠性和可控性。
Details
Motivation: 随着LLM在高风险决策中的作用增加,如何协调来自不同来源(如开发者、用户等)的竞争性指令变得至关重要。Contribution: 提出了VerIH数据集用于训练模型解决指令优先级问题,并通过轻量级强化学习实现了从通用推理能力到指令优先级处理的迁移。
Method: 将指令层次关系解析为推理任务,利用VerIH数据集训练模型优先考虑系统指令而非用户指令。
Result: 微调后的模型在指令遵循和指令层次基准测试中表现显著提升,并能泛化到安全关键场景,如抵御对抗性攻击。
Insight: 通过推理处理指令层次关系为提高LLM的可靠性和可控性提供了可行路径,同时模型行为可通过系统提示更新实现可控调整。
Abstract: As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first “think” about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises both aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks. These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.
[4] POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios
Tingyue Yang,Junchi Yao,Yuhui Guo,Chang Liu
Main category: cs.CL
TL;DR: POLIS-Bench是首个针对双语政策场景的LLMs系统性评测框架,包含更新的双语语料、场景任务设计和双指标评估方法,揭示了推理模型的优势,并通过轻量微调实现低成本高性能。
Details
Motivation: 现有评测框架难以满足政府双语政策场景的需求,亟需一个系统、最新的评测工具以评估LLMs在政策理解和应用中的表现。Contribution: 提出了POLIS-Bench,包含更新的双语语料、三个场景任务设计和双指标评估框架,并通过评测揭示了推理模型的优势。
Method: 构建双语政策语料库,设计三个任务(条款检索与解释、解决方案生成、合规性判断),并采用语义相似度和准确率的双指标评估。
Result: 评测显示推理模型表现最佳,轻量微调的POLIS系列模型以低成本达到或超越私有模型性能。
Insight: 合规性任务最具挑战性,轻量化模型可通过针对性微调在政策场景中高效应用。
Abstract: We introduce POLIS-Bench, the first rigorous, systematic evaluation suite designed for LLMs operating in governmental bilingual policy scenarios. Compared to existing benchmarks, POLIS-Bench introduces three major advancements. (i) Up-to-date Bilingual Corpus: We construct an extensive, up-to-date policy corpus that significantly scales the effective assessment sample size, ensuring relevance to current governance practice. (ii) Scenario-Grounded Task Design: We distill three specialized, scenario-grounded tasks – Clause Retrieval & Interpretation, Solution Generation, and the Compliance Judgmen–to comprehensively probe model understanding and application. (iii) Dual-Metric Evaluation Framework: We establish a novel dual-metric evaluation framework combining semantic similarity with accuracy rate to precisely measure both content alignment and task requirement adherence. A large-scale evaluation of over 10 state-of-the-art LLMs on POLIS-Bench reveals a clear performance hierarchy where reasoning models maintain superior cross-task stability and accuracy, highlighting the difficulty of compliance tasks. Furthermore, leveraging our benchmark, we successfully fine-tune a lightweight open-source model. The resulting POLIS series models achieves parity with, or surpasses, strong proprietary baselines on multiple policy subtasks at a significantly reduced cost, providing a cost-effective and compliant path for robust real-world governmental deployment.
[5] GEMMA-SQL: A Novel Text-to-SQL Model Based on Large Language Models
Hari Mohan Pandey,Anshul Gupta,Subham Sarkar,Minakshi Tomer,Schneider Johannes,Yan Gong
Main category: cs.CL
TL;DR: GEMMA-SQL是一款基于Gemma 2B架构的轻量级文本到SQL模型,通过高效微调和多提示策略在SPIDER基准上表现优异,优于现有方法。
Details
Motivation: 旨在解决传统文本到SQL系统对专业编程知识的依赖问题,提供轻量且高效的解决方案。Contribution: 提出了GEMMA-SQL及其指令调优变体GEMMA-SQL Instruct,在轻量化和性能之间取得了平衡。
Method: 结合高效迭代微调和多提示策略(如少样本学习),并在SPIDER基准上进行训练和评估。
Result: 在SPIDER基准上取得66.8% Test-Suite准确率和63.3% Exact Set Match准确率,优于IRNet等基线模型。
Insight: 合适的提示设计和指令调优可以显著提升性能,同时保持模型的可扩展性和适应性。
Abstract: Text-to-SQL systems enable users to interact with structured databases using natural language, eliminating the need for specialized programming knowledge. In this work, we introduce GEMMA-SQL, a lightweight and efficient text-to-SQL model built upon the open-source Gemma 2B architecture. Unlike many large language models (LLMs), GEMMA-SQL is fine-tuned in a resource-efficient, iterative manner and can be deployed on low-cost hardware. Leveraging the SPIDER benchmark for training and evaluation, GEMMA-SQL combines multiple prompting strategies, including few-shot learning, to enhance SQL query generation accuracy. The instruction-tuned variant, GEMMA-SQL Instruct, achieves 66.8% Test-Suite accuracy and 63.3% Exact Set Match accuracy, outperforming several state-of-the-art baselines such as IRNet, RYANSQL, and CodeXDavinci. The proposed approach demonstrates that effective prompt design and targeted instruction tuning can significantly boost performance while maintaining high scalability and adaptability. These results position GEMMA-SQL as a practical, open-source alternative for robust and accessible text-to-SQL systems.
[6] First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation
Dmytro Vitel,Anshuman Chhabra
Main category: cs.CL
TL;DR: 本文挑战了先前认为语言模型中首层(嵌入层)最适合用于训练样本影响估计的观点,证明了中间注意力层更有效,并提出了一种新的层间聚合方法和评估指标NDR。
Details
Motivation: 当前的大型语言模型(LLM)训练样本影响估计方法常局限于部分层以降低计算负担,先前研究认为首层最适合,但缺乏可靠的理论支持。本文旨在验证此假设的可靠性并提出改进方法。Contribution: 1. 证明了“取消效应”不可靠,中间注意力层更适合影响估计;2. 提出了新的层间聚合方法(如排名和投票);3. 设计了新评估指标NDR(噪声检测率)。
Method: 1. 理论和实验验证“取消效应”不可靠;2. 对比不同层的性能;3. 提出多种层间聚合策略;4. 引入NDR指标评估影响分数有效性。
Result: 实验表明中间注意力层优于首层,新的聚合方法和NDR指标显著提升了性能。
Insight: 影响估计不应局限于模型的首层,中间层可能包含更关键的信息。
Abstract: Identifying how training samples influence/impact Large Language Model (LLM) decision-making is essential for effectively interpreting model decisions and auditing large-scale datasets. Current training sample influence estimation methods (also known as influence functions) undertake this goal by utilizing information flow through the model via its first-order and higher-order gradient terms. However, owing to the large model sizes of today consisting of billions of parameters, these influence computations are often restricted to some subset of model layers to ensure computational feasibility. Prior seminal work by Yeh et al. (2022) in assessing which layers are best suited for computing language data influence concluded that the first (embedding) layers are the most informative for this purpose, using a hypothesis based on influence scores canceling out (i.e., the cancellation effect). In this work, we propose theoretical and empirical evidence demonstrating how the cancellation effect is unreliable, and that middle attention layers are better estimators for influence. Furthermore, we address the broader challenge of aggregating influence scores across layers, and showcase how alternatives to standard averaging (such as ranking and vote-based methods) can lead to significantly improved performance. Finally, we propose better methods for evaluating influence score efficacy in LLMs without undertaking model retraining, and propose a new metric known as the Noise Detection Rate (NDR) that exhibits strong predictive capability compared to the cancellation effect. Through extensive experiments across LLMs of varying types and scales, we concretely determine that the first (layers) are not necessarily better than the last (layers) for LLM influence estimation, contrasting with prior knowledge in the field.
[7] Learning to reason about rare diseases through retrieval-augmented agents
Ha Young Kim,Jun Li,Ana Beatriz Solana,Carolin M. Pirkl,Benedikt Wiestler,Julia A. Schnabel,Cosmin I. Bercea
Main category: cs.CL
TL;DR: 这篇论文提出了RADAR(检索增强诊断推理代理)系统,用于解决罕见疾病诊断中因数据稀缺导致的AI模型性能不足问题,通过检索外部医学知识库来增强诊断推理能力。
Details
Motivation: 罕见疾病在医学影像数据中属于长尾分布,传统AI模型因训练数据稀缺而表现不佳。受放射科医生参考病例报告和文献的启发,作者希望通过检索增强的方式提升模型的诊断能力。Contribution: 1. 提出RADAR系统,通过检索外部医学知识库(病例报告和文献)增强诊断推理能力,无需额外训练。2. 设计为模型无关的推理模块,可与多种大语言模型无缝集成。3. 在NOVA数据集上实现了最高10.2%的性能提升。
Method: 1. 使用句子转换器(sentence transformers)嵌入病例报告和文献。2. 通过FAISS索引实现高效相似性搜索。3. AI代理检索相关临床证据以指导诊断决策。
Result: 在包含280种罕见疾病的NOVA数据集上,RADAR实现了最高10.2%的性能提升,尤其显著提升了开源模型(如DeepSeek)的表现。
Insight: 检索增强推理不仅提高了罕见疾病诊断的准确性,还提供了基于文献的可解释性说明,展示了其在医学影像中低流行条件下的强大潜力。
Abstract: Rare diseases represent the long tail of medical imaging, where AI models often fail due to the scarcity of representative training data. In clinical workflows, radiologists frequently consult case reports and literature when confronted with unfamiliar findings. Following this line of reasoning, we introduce RADAR, Retrieval Augmented Diagnostic Reasoning Agents, an agentic system for rare disease detection in brain MRI. Our approach uses AI agents with access to external medical knowledge by embedding both case reports and literature using sentence transformers and indexing them with FAISS to enable efficient similarity search. The agent retrieves clinically relevant evidence to guide diagnostic decision making on unseen diseases, without the need of additional training. Designed as a model-agnostic reasoning module, RADAR can be seamlessly integrated with diverse large language models, consistently improving their rare pathology recognition and interpretability. On the NOVA dataset comprising 280 distinct rare diseases, RADAR achieves up to a 10.2% performance gain, with the strongest improvements observed for open source models such as DeepSeek. Beyond accuracy, the retrieved examples provide interpretable, literature grounded explanations, highlighting retrieval-augmented reasoning as a powerful paradigm for low-prevalence conditions in medical imaging.
[8] Surprisal reveals diversity gaps in image captioning and different scorers change the story
Nikolai Ilinykh,Simon Dobnik
Main category: cs.CL
TL;DR: 该论文提出了一种基于信息量差异(surprisal variance)的新指标,用于量化图像描述任务的多样性。研究发现,人类描述在特定评分模型下显示更高多样性,但更换评分模型后结论反转。
Details
Motivation: 现有图像描述任务多关注生成内容的准确性,而忽略了多样性。作者希望通过引入信息量差异指标,更全面地评估描述多样性。Contribution: 1. 提出了基于信息量差异的多样性评估方法;2. 揭示了评分模型选择对多样性结论的重大影响;3. 指出多样性评估需依赖多种评分模型。
Method: 使用n-gram语言模型和通用语言模型分别计算人类和模型生成描述的信息量差异,比较不同评分下的多样性表现。
Result: 人类描述在n-gram模型评分下多样性是模型的两倍,但通用模型评分下结论相反。
Insight: 评分模型的选择可完全颠覆多样性评估结果,未来研究需多模型验证。
Abstract: We quantify linguistic diversity in image captioning with surprisal variance - the spread of token-level negative log-probabilities within a caption set. On the MSCOCO test set, we compare five state-of-the-art vision-and-language LLMs, decoded with greedy and nucleus sampling, to human captions. Measured with a caption-trained n-gram LM, humans display roughly twice the surprisal variance of models, but rescoring the same captions with a general-language model reverses the pattern. Our analysis introduces the surprisal-based diversity metric for image captioning. We show that relying on a single scorer can completely invert conclusions, thus, robust diversity evaluation must report surprisal under several scorers.
[9] Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models
Chenxi Liu,Junjie Liang,Yuqi Jia,Bochuan Cao,Yang Bai,Heng Huang,Xun Chen
Main category: cs.CL
TL;DR: ERPO框架通过探索剩余提示(residual prompts)重新激活训练信号,提升语言模型的推理能力,显著超越现有方法。
Details
Motivation: 随着语言模型训练时间长和规模扩大,许多训练提示变为剩余提示(零方差奖励),无法提供训练信号,导致多样性减少和效果下降。Contribution: 提出ERPO框架,通过动态调整采样温度,鼓励模型对剩余提示生成更多样化的推理轨迹,从而重新激活训练信号。
Method: ERPO为每个提示维护历史跟踪器,自适应增加剩余提示的采样温度,促使生成错误响应以恢复训练信号。
Result: 在Qwen2.5系列上的实验表明,ERPO在多个数学推理基准上表现优于基线方法。
Insight: 动态探索剩余提示的多样性可以有效提升模型的训练效率和推理能力。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs). The Group Relative Policy Optimization (GRPO) family has demonstrated strong performance in training LLMs with RLVR. However, as models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal. Consequently, fewer prompts contribute to training, reducing diversity and hindering effectiveness. To fully exploit these residual prompts, we propose the Explore Residual Prompts in Policy Optimization (ERPO) framework, which encourages exploration on residual prompts and reactivates their training signals. ERPO maintains a history tracker for each prompt and adaptively increases the sampling temperature for residual prompts that previously produced all correct responses. This encourages the model to generate more diverse reasoning traces, introducing incorrect responses that revive training signals. Empirical results on the Qwen2.5 series demonstrate that ERPO consistently surpasses strong baselines across multiple mathematical reasoning benchmarks.
[10] Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs
Preetum Nakkiran,Arwen Bradley,Adam Goliński,Eugene Ndiaye,Michael Kirchhof,Sinead Williamson
Main category: cs.CL
TL;DR: 这篇论文研究了大型语言模型(LLMs)在语义层面的校准性问题,发现基础LLMs在开放域问答任务中表现出了自然的语义校准能力,尽管它们并未被显式训练用于此任务。
Details
Motivation: LLMs通常缺乏对其输出的有意义的置信度估计,而现有研究主要集中在token级别的校准上。本文探讨LLMs是否能在语义层面评估置信度。Contribution: 提出了一个理论框架解释语义校准如何作为next-token预测的副产品出现,并通过实验验证了基础LLMs的语义校准能力,同时发现RL微调和chain-of-thought推理会破坏这种校准。
Method: 提出了一种新的语义校准定义“B-calibration”,并通过理论分析和实验验证了LLMs在语义层面的校准性。
Result: 实验表明:(1)基础LLMs在问答任务中具有语义校准能力;(2)RL微调会破坏这种校准;(3)chain-of-thought推理也会影响校准性。
Insight: 语义校准是LLMs在next-token预测任务中自然涌现的特性,但某些训练和推理方法(如RL微调)可能会削弱这一能力。
Abstract: Large Language Models (LLMs) often lack meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in open-domain question-answering tasks, despite not being explicitly trained to do so. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges as a byproduct of next-token prediction, leveraging a recent connection between calibration and local loss optimality. The theory relies on a general definition of “B-calibration,” which is a notion of calibration parameterized by a choice of equivalence classes (semantic or otherwise). This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) RL instruction-tuning systematically breaks this calibration, and (3) chain-of-thought reasoning breaks calibration. To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.
[11] Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs
Matthew Bozoukov,Matthew Nguyen,Shubkarman Singh,Bart Bussmann,Patrick Leask
Main category: cs.CL
TL;DR: 这篇论文探讨了大型语言模型(LLM)的行为自我意识(behavioral self-awareness)的最小条件和机制,发现通过低秩适配器(LoRA)微调,只需单一rank-1适配器即可诱导自我意识,且这种行为可通过激活空间的单一导向向量捕捉。
Details
Motivation: LLMs展现出的行为自我意识可能带来安全风险(如隐藏真实能力),因此研究其最小条件和机制对理解和控制模型行为具有重要意义。Contribution: 确定了行为自我意识的最小条件,揭示了其作为一种领域特异性线性特征的性质,并提出可通过简单调制的方式实现。
Method: 通过控制性实验,使用LoRA对指令调优的LLMs进行微调,研究了自我意识的诱导和表征。
Result: 发现自我意识可通过单一rank-1 LoRA适配器诱导,且行为效应主要由激活空间的导向向量决定,具有任务独立性。
Insight: 行为自我意识是一种局部且领域特定的特性,可通过简单线性干预实现,这对模型安全和行为控制提供了新思路。
Abstract: Recent studies have revealed that LLMs can exhibit behavioral self-awareness: the ability to accurately describe or predict their own learned behaviors without explicit supervision. This capability raises safety concerns as it may, for example, allow models to better conceal their true abilities during evaluation. We attempt to characterize the minimal conditions under which such self-awareness emerges, and the mechanistic processes through which it manifests. Through controlled finetuning experiments on instruction-tuned LLMs with low-rank adapters (LoRA), we find: (1) that self-awareness can be reliably induced using a single rank-1 LoRA adapter; (2) that the learned self-aware behavior can be largely captured by a single steering vector in activation space, recovering nearly all of the fine-tune’s behavioral effect; and (3) that self-awareness is non-universal and domain-localized, with independent representations across tasks. Together, these findings suggest that behavioral self-awareness emerges as a domain-specific, linear feature that can be easily induced and modulated.
[12] SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents
Jaehoon Lee,Sohyun Kim,Wanggeun Park,Geon Lee,Seungkyung Kim,Minyoung Lee
Main category: cs.CL
TL;DR: 该论文介绍了SDS KoPub VDR,首个针对韩语公共文档的大规模视觉文档检索基准数据集,包含多种复杂视觉元素和跨模态推理任务。
Details
Motivation: 现有视觉文档检索基准多忽略非英语语言及官方出版物的结构复杂性,亟需填补这一空白。Contribution: 提出了首个韩语公共文档的视觉文档检索基准数据集SDS KoPub VDR,包含真实世界的文档和严格的查询-页面-答案三元组评估集。
Method: 数据集基于361份真实文档(40,781页),构建了600个经过人工验证的查询-页面-答案三元组,并分为文本和跨模态检索任务。
Result: 实验显示,现有先进模型在多模态场景(尤其是跨模态推理)中表现存在显著差距。
Insight: 该数据集为多模态AI在复杂文档智能领域的进步提供了明确方向,强调了跨模态推理的重要性。
Abstract: Existing benchmarks for visual document retrieval (VDR) largely overlook non-English languages and the structural complexity of official publications. To address this critical gap, we introduce SDS KoPub VDR, the first large-scale, publicly available benchmark for retrieving and understanding Korean public documents. The benchmark is built upon a corpus of 361 real-world documents (40,781 pages), including 256 files under the KOGL Type 1 license and 105 from official legal portals, capturing complex visual elements like tables, charts, and multi-column layouts. To establish a challenging and reliable evaluation set, we constructed 600 query-page-answer triples. These were initially generated using multimodal models (e.g., GPT-4o) and subsequently underwent a rigorous human verification and refinement process to ensure factual accuracy and contextual relevance. The queries span six major public domains and are systematically categorized by the reasoning modality required: text-based, visual-based (e.g., chart interpretation), and cross-modal. We evaluate SDS KoPub VDR on two complementary tasks that reflect distinct retrieval paradigms: (1) text-only retrieval, which measures a model’s ability to locate relevant document pages based solely on textual signals, and (2) multimodal retrieval, which assesses retrieval performance when visual features (e.g., tables, charts, and layouts) are jointly leveraged alongside text. This dual-task evaluation reveals substantial performance gaps, particularly in multimodal scenarios requiring cross-modal reasoning, even for state-of-the-art models. As a foundational resource, SDS KoPub VDR not only enables rigorous and fine-grained evaluation across textual and multimodal retrieval tasks but also provides a clear roadmap for advancing multimodal AI in complex, real-world document intelligence.
[13] BudgetMem: Learning Selective Memory Policies for Cost-Efficient Long-Context Processing in Language Models
Chandra Vamsi Krishna Alla,Harish Naidu Gaddam,Manohar Kommi
Main category: cs.CL
TL;DR: BudgetMem提出了一种选择性记忆架构,通过学习记忆策略和基于特征的显著性评分,解决了语言模型处理长上下文时的高成本问题,显著降低了内存需求且保持了高性能。
Details
Motivation: 大型语言模型处理长上下文时面临高昂的计算和内存成本,限制了在资源受限环境中的部署。尽管已有方法扩展了上下文窗口,但成本仍然难以承受。Contribution: 提出了BudgetMem,一种选择性记忆架构,通过学习记忆策略和显著性评分(如实体密度、TF-IDF等)在严格预算下选择存储信息,显著节省内存并保持性能。
Method: 结合选择性记忆策略与基于特征的显著性评分(如实体密度、TF-IDF、话语标记、位置偏差),使用学习到的门控机制和BM25稀疏检索实现高效信息访问。
Result: 在700个问答对的实验中,BudgetMem在长文档上仅损失1.0% F1分数,同时节省72.4%内存,优于基线RAG方法。随着文档长度增加,其优势更显著。
Insight: 通过选择性记忆和高效检索,BudgetMem为资源受限环境下的长上下文处理提供了一种实用且高效的解决方案,降低了高级语言理解的部署门槛。
Abstract: Large Language Models (LLMs) face significant computational and memory constraints when processing long contexts, despite growing demand for applications requiring reasoning over extensive documents, multi-session dialogues, and book length texts. While recent advances have extended context windows to 100K-1M tokens, such approaches incur prohibitive costs for resource constrained deployments. We propose BudgetMem, a novel memory augmented architecture that learns what to remember rather than remembering everything. Our system combines selective memory policies with feature based salience scoring (entity density, TF-IDF, discourse markers, position bias) to decide which information merits storage under strict budget constraints. Unlike existing retrieval augmented generation (RAG) systems that store all chunks, BudgetMem employs learned gating mechanisms coupled with BM25 sparse retrieval for efficient information access. Through comprehensive experiments on 700 question answer pairs across short (237 tokens) and long (5K-10K tokens) documents with Llama-3.2-3B-Instruct, we demonstrate that BudgetMem achieves remarkable results on long documents: only 1.0% F1 score degradation while saving 72.4% memory compared to baseline RAG. We validate our approach through budget sensitivity analysis (testing 7 budget ratios), naive baseline comparisons, and document length analysis, showing that BudgetMem’s benefits increase with document length. Our work provides a practical pathway for deploying capable long context systems on modest hardware, democratizing access to advanced language understanding capabilities.
[14] AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent
Yu Li,Lehui Li,Qingmin Liao,Fengli Xu,Yong Li
Main category: cs.CL
TL;DR: 该论文提出了一个基于大语言模型(LLM)的自动化实验设计框架AgentExpt,通过整合数据集和基线引用的集体感知,显著提升了实验设计的可靠性和可解释性。
Details
Motivation: 现有方法在自动化实验设计中存在数据覆盖率低和过度依赖内容相似性的问题,忽视了实验适用性。论文旨在利用集体感知来解决这些问题。Contribution: 1. 设计了自动化的数据收集流程,将大量论文与其使用的数据集和基线关联;2. 提出了一种集体感知增强的检索器和推理增强的重新排序器。
Method: 1. 自动化收集论文、数据集和基线的关系;2. 结合自描述和引用上下文表示数据集和基线;3. 微调嵌入模型和LLM以提高检索和排序效果。
Result: 论文提出的方法在Recall@20和HitRate@5上分别比现有基线提升了5.85%和8.30%,覆盖了85%的顶级AI会议数据集和基线。
Insight: 利用集体感知可以显著提升实验设计的自动化效果,同时增强其可解释性;LLM在科学研究中扮演了重要的自动化工具角色。
Abstract: Large language model agents are becoming increasingly capable at web-centric tasks such as information retrieval, complex reasoning. These emerging capabilities have given rise to surge research interests in developing LLM agent for facilitating scientific quest. One key application in AI research is to automate experiment design through agentic dataset and baseline retrieval. However, prior efforts suffer from limited data coverage, as recommendation datasets primarily harvest candidates from public portals and omit many datasets actually used in published papers, and from an overreliance on content similarity that biases model toward superficial similarity and overlooks experimental suitability. Harnessing collective perception embedded in the baseline and dataset citation network, we present a comprehensive framework for baseline and dataset recommendation. First, we design an automated data-collection pipeline that links roughly one hundred thousand accepted papers to the baselines and datasets they actually used. Second, we propose a collective perception enhanced retriever. To represent the position of each dataset or baseline within the scholarly network, it concatenates self-descriptions with aggregated citation contexts. To achieve efficient candidate recall, we finetune an embedding model on these representations. Finally, we develop a reasoning-augmented reranker that exact interaction chains to construct explicit reasoning chains and finetunes a large language model to produce interpretable justifications and refined rankings. The dataset we curated covers 85% of the datasets and baselines used at top AI conferences over the past five years. On our dataset, the proposed method outperforms the strongest prior baseline with average gains of +5.85% in Recall@20, +8.30% in HitRate@5. Taken together, our results advance reliable, interpretable automation of experimental design.
[15] UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian
Mykyta Syromiatnikov,Victoria Ruvinskaya
Main category: cs.CL
TL;DR: 论文介绍了UA-Code-Bench,一个新开源的乌克兰语代码生成和竞争性编程问题解决能力评估基准,涵盖500个不同难度的问题,并评估了13个领先模型在代码正确性、性能等方面的表现。
Details
Motivation: 当前大多数基准测试集中于英语或简单语言任务,缺乏对低资源语言(如乌克兰语)代码生成能力的系统评估。Contribution: 提出了UA-Code-Bench基准测试,包含500个乌克兰语编程问题,并提供了全面的模型评估与分析。
Method: 使用Eolymp平台的500个问题评估13个模型的一击式提示代码生成能力,通过隐藏测试验证代码正确性,并分析性能、唯一性和效率。
Result: 即使顶级模型(如OpenAI o3和GPT-5)也只能解决一半的问题,凸显了低资源语言代码生成的挑战。
Insight: 竞争性编程基准是评估大语言模型代码生成能力的有效工具,尤其适用于低资源语言和多语言研究。
Abstract: Evaluating the real capabilities of large language models in low-resource languages still represents a challenge, as many existing benchmarks focus on widespread tasks translated from English or evaluate only simple language understanding. This paper introduces UA-Code-Bench, a new open-source benchmark established for a thorough evaluation of language models’ code generation and competitive programming problem-solving abilities in Ukrainian. The benchmark comprises 500 problems from the Eolymp platform, evenly distributed across five complexity levels from very easy to very hard. A diverse set of 13 leading proprietary and open-source models, generating Python solutions based on a one-shot prompt, was evaluated via the dedicated Eolymp environment against hidden tests, ensuring code correctness. The obtained results reveal that even top-performing models, such as OpenAI o3 and GPT-5, solve only half of the problems, highlighting the challenge of code generation in low-resource natural language. Furthermore, this research presents a comprehensive analysis of performance across various difficulty levels, as well as an assessment of solution uniqueness and computational efficiency, measured by both elapsed time and memory consumption of the generated solutions. In conclusion, this work demonstrates the value of competitive programming benchmarks in evaluating large language models, especially in underrepresented languages. It also paves the way for future research on multilingual code generation and reasoning-enhanced models. The benchmark, data parsing, preparation, code generation, and evaluation scripts are available at https://huggingface.co/datasets/NLPForUA/ua-code-bench.
[16] Reasoning-Guided Claim Normalization for Noisy Multilingual Social Media Posts
Manan Sharma,Arya Suneesh,Manish Jain,Pawan Kumar Rajpoot,Prasanna Devadiga,Bharatdeep Hazarika,Ashish Shrivastava,Kishan Gurumurthy,Anshuman B Suresh,Aditya U Baliga
Main category: cs.CL
TL;DR: 该论文提出了一种基于推理引导的规范化方法,用于将多语言社交媒体帖子中的主张转化为可验证的清晰陈述,展示了在仅用英语数据训练的情况下如何实现跨语言迁移。
Details
Motivation: 社交媒体中的多语言错误信息检测因帖子的噪声和多语言特性而具有挑战性,需要将其规范化以便于验证。Contribution: 主要贡献是通过系统的帖子分解(Who, What, Where, When, Why, How)实现了跨语言的鲁棒迁移,并在多种语言上取得了显著的性能提升。
Method: 方法包括使用LoRA对Qwen3-14B进行微调,结合帖子内去重、语义对齐的token级召回过滤以及推理时的检索增强few-shot学习。
Result: 在METEOR指标上,英语达到41.16分,马拉地语为15.21分,相对基线配置提升了41.3%,并在多个语言上取得了领先排名。
Insight: 研究表明,该方法在罗曼语族和日耳曼语族中表现出色,同时能够适应多样的语言结构,保持语义一致性。
Abstract: We address claim normalization for multilingual misinformation detection - transforming noisy social media posts into clear, verifiable statements across 20 languages. The key contribution demonstrates how systematic decomposition of posts using Who, What, Where, When, Why and How questions enables robust cross-lingual transfer despite training exclusively on English data. Our methodology incorporates finetuning Qwen3-14B using LoRA with the provided dataset after intra-post deduplication, token-level recall filtering for semantic alignment and retrieval-augmented few-shot learning with contextual examples during inference. Our system achieves METEOR scores ranging from 41.16 (English) to 15.21 (Marathi), securing third rank on the English leaderboard and fourth rank for Dutch and Punjabi. The approach shows 41.3% relative improvement in METEOR over baseline configurations and substantial gains over existing methods. Results demonstrate effective cross-lingual generalization for Romance and Germanic languages while maintaining semantic coherence across diverse linguistic structures.
[17] On Text Simplification Metrics and General-Purpose LLMs for Accessible Health Information, and A Potential Architectural Advantage of The Instruction-Tuned LLM class
P. Bilha Githinji,Aikaterini Meilliou,Peiwu Qin
Main category: cs.CL
TL;DR: 论文探讨了大型语言模型在医学文本简化中的表现,比较了指令调优的Mistral 24B和推理增强的QWen2.5 32B,发现Mistral在可读性和忠实性平衡上更优。
Details
Motivation: 随着公众对医学信息的数字消费增加,需要可扩展的自动文本简化解决方案,但现有方法在可读性和忠实性之间难以平衡。Contribution: 1. 比较了指令调优和推理增强两类LLMs的表现;2. 发现Mistral在可读性和忠实性上表现更优;3. 提供了简化任务的基准指标选择启发。
Method: 1. 使用Mistral 24B和QWen2.5 32B模型;2. 通过SARI和BERTScore等21个指标评估可读性和忠实性;3. 进行了相关性分析。
Result: Mistral在SARI(42.46)和BERTScore(0.91)上优于QWen(BERTScore 0.89),显示指令调优模型更适合文本简化。
Insight: 指令调优模型在医学文本简化中具有优势,且可读性指标存在冗余性;词汇支持是领域适应的关键问题。
Abstract: The increasing health-seeking behavior and digital consumption of biomedical information by the general public necessitate scalable solutions for automatically adapting complex scientific and technical documents into plain language. Automatic text simplification solutions, including advanced large language models, however, continue to face challenges in reliably arbitrating the tension between optimizing readability performance and ensuring preservation of discourse fidelity. This report empirically assesses the performance of two major classes of general-purpose LLMs, demonstrating their linguistic capabilities and foundational readiness for the task compared to a human benchmark. Using a comparative analysis of the instruction-tuned Mistral 24B and the reasoning-augmented QWen2.5 32B, we identify a potential architectural advantage in the instruction-tuned LLM. Mistral exhibits a tempered lexical simplification strategy that enhances readability across a suite of metrics and the simplification-specific formula SARI (mean 42.46), while preserving human-level discourse with a BERTScore of 0.91. QWen also attains enhanced readability performance, but its operational strategy shows a disconnect in balancing between readability and accuracy, reaching a statistically significantly lower BERTScore of 0.89. Additionally, a comprehensive correlation analysis of 21 metrics spanning readability, discourse fidelity, content safety, and underlying distributional measures for mechanistic insights, confirms strong functional redundancies among five readability indices. This empirical evidence tracks baseline performance of the evolving LLMs for the task of text simplification, identifies the instruction-tuned Mistral 24B for simplification, provides necessary heuristics for metric selection, and points to lexical support as a primary domain-adaptation issue for simplification.
[18] Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models
Cong-Thanh Do,Rama Doddipatla,Kate Knill
Main category: cs.CL
TL;DR: 探讨了Chain-of-Thought (CoT) prompt在知识蒸馏中用于提升小模型推理能力的有效性,通过实验证明了其作用。
Details
Motivation: CoT prompting在大语言模型(LLM)推理能力提升中被广泛应用,但其在小模型知识蒸馏中的效果尚不明确,因此需要验证CoT在知识蒸馏中的作用。Contribution: 通过实验验证了CoT prompting在白盒知识蒸馏中的作用,证明了其在提升小模型自然语言推理和理解能力方面的有效性。
Method: 使用Qwen和Llama2家族的LLM进行白盒知识蒸馏,结合CoT-Collection数据集中的CoT数据,蒸馏后的模型在BBH基准上进行评估。
Result: CoT显著提升了蒸馏模型在复杂自然语言推理和理解任务中的性能,使其在BBH任务中表现更优。
Insight: CoT不仅能在推理任务中直接使用,还可以通过知识蒸馏将推理能力从小LLM迁移到大LLM,为模型压缩和推理能力迁移提供了新思路。
Abstract: Chain-of-Thought (CoT) prompting is a widely used method to improve the reasoning capability of Large Language Models (LLMs). More recently, CoT has been leveraged in Knowledge Distillation (KD) to transfer reasoning capability from a larger LLM to a smaller one. This paper examines the role of CoT in distilling the reasoning capability from larger LLMs to smaller LLMs using white-box KD, analysing its effectiveness in improving the performance of the distilled models for various natural language reasoning and understanding tasks. We conduct white-box KD experiments using LLMs from the Qwen and Llama2 families, employing CoT data from the CoT-Collection dataset. The distilled models are then evaluated on natural language reasoning and understanding tasks from the BIG-Bench-Hard (BBH) benchmark, which presents complex challenges for smaller LLMs. Experimental results demonstrate the role of CoT in improving white-box KD effectiveness, enabling the distilled models to achieve better average performance in natural language reasoning and understanding tasks from BBH.
[19] Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese
Zilong Li,Jie Cao
Main category: cs.CL
TL;DR: 本文研究了将古典汉语翻译为日语的过程,将其抽象为序列标注任务,并通过引入基于大语言模型(LLM)的标注流程解决低资源问题。结果表明,辅助的汉语NLP任务能提升序列标注效果,而LLM在直接机器翻译中表现优秀,但不擅长标注任务。
Details
Motivation: 研究古典汉语到日语的翻译过程,解决低资源环境下的翻译挑战,探索标注与现代语言技术的结合。Contribution: 1. 提出基于LLM的标注流程;2. 构建新的翻译数据集;3. 验证辅助汉语NLP任务对序列标注的促进作用;4. 分析LLM在直接翻译与标注任务中的差异。
Method: 将翻译过程建模为序列标注任务,引入LLM标注流程并使用数字化开源数据构建数据集,结合辅助汉语NLP任务提升效果。
Result: LLM在直接翻译中表现优秀,但在标注任务中表现较差;辅助任务能显著提升序列标注性能。
Insight: 标注任务是翻译的重要补充,LLM在低资源翻译中有潜力,但需结合传统方法提升其灵活性。
Abstract: Ancient people translated classical Chinese into Japanese by annotating around each character. We abstract this process as sequence tagging tasks and fit them into modern language technologies. The research of this annotation and translation system is a facing low-resource problem. We release this problem by introducing a LLM-based annotation pipeline and construct a new dataset from digitalized open-source translation data. We show that under the low-resource setting, introducing auxiliary Chinese NLP tasks has a promoting effect on the training of sequence tagging tasks. We also evaluate the performance of large language models. They achieve high scores in direct machine translation, but they are confused when being asked to annotate characters. Our method could work as a supplement of LLMs.
[20] Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models
Teqi Hao,Xioayu Tan,Shaojie Shi,Yinghui Xu,Xihe Qiu
Main category: cs.CL
TL;DR: 论文提出了一种名为RPO的后处理框架,通过将内容生成与个性化对齐解耦,显著提升了黑盒大语言模型的个性化效果。
Details
Motivation: 现有方法通常通过上下文注入实现个性化,但这种方法会导致模型在生成内容的同时兼顾用户风格,从而影响输出质量。RPO旨在解决这种权衡问题。Contribution: 1. 提出RPO框架,将个性化任务分为内容生成和对齐两阶段;2. 引入外部反思模块,通过监督学习和强化学习优化个性化效果。
Method: 1. 基础模型生成通用响应;2. 外部反思模块通过监督微调和强化学习将其改写为用户偏好对齐的内容。
Result: 在LaMP基准测试中,RPO显著优于现有方法,验证了显式响应重塑的优越性。
Insight: RPO提供了一种高效、模型无关的个性化层,可与任何基础模型无缝集成,为个性化生成任务开辟了新方向。
Abstract: The personalization of black-box large language models (LLMs) is a critical yet challenging task. Existing approaches predominantly rely on context injection, where user history is embedded into the prompt to directly guide the generation process. However, this single-step paradigm imposes a dual burden on the model: generating accurate content while simultaneously aligning with user-specific styles. This often results in a trade-off that compromises output quality and limits precise control. To address this fundamental tension, we propose Reflective Personalization Optimization (RPO), a novel framework that redefines the personalization paradigm by decoupling content generation from alignment. RPO operates in two distinct stages: first, a base model generates a high-quality, generic response; then, an external reflection module explicitly rewrites this output to align with the user’s preferences. This reflection module is trained using a two-stage process. Initially, supervised fine-tuning is employed on structured rewriting trajectories to establish a core personalized reasoning policy that models the transformation from generic to user-aligned responses. Subsequently, reinforcement learning is applied to further refine and enhance the quality of the personalized outputs. Comprehensive experiments on the LaMP benchmark demonstrate that RPO, by decoupling content generation from personalization, significantly outperforms state-of-the-art baselines. These findings underscore the superiority of explicit response shaping over implicit context injection. Moreover, RPO introduces an efficient, model-agnostic personalization layer that can be seamlessly integrated with any underlying base model, paving the way for a new and effective direction in user-centric generation scenarios.
[21] A multimodal multiplex of the mental lexicon for multilingual individuals
Maria Huynh,Wilder C. Rodrigues
Main category: cs.CL
TL;DR: 这篇论文探讨了多语言使用者的心理词典如何通过多层网络模型构建,并引入视觉输入以研究其对语言习得的影响。
Details
Motivation: 研究动机源于对多语言能力认知优势的兴趣,尤其是视觉输入如何增强多语言学习者的语言习得效果。Contribution: 主要贡献是将多层网络模型(multiplex model)扩展到多模态领域,研究视觉输入在多语言心理词典中的作用,为多语言学习提供新视角。
Method: 研究方法基于Stella等人(2018)的爆发性学习模型和Dijkstra与van Heuven(2002)的BIA+框架,通过多层网络模型结合视觉输入,设计翻译任务实验。
Result: 预期结果是揭示视觉输入是否显著提高多语言使用者在翻译任务中的准确性和熟练度,进一步验证多语言能力的认知优势。
Insight: 研究的一个关键见解是多语言心理词典的动态性,以及跨模态(如视觉和语言)输入如何优化多语言学习与处理能力。
Abstract: Historically, bilingualism was often perceived as an additional cognitive load that could hinder linguistic and intellectual development. However, over the last three decades, this view has changed considerably. Numerous studies have aimed to model and understand the architecture of the bilingual word recognition system Dijkstra and van Heuven (2002), investigating how parallel activation operates in the brain and how one language influences another Kroll et al. (2015). Increasingly, evidence suggests that multilinguals, individuals who speak three or more languages, can perform better than monolinguals in various linguistic and cognitive tasks, such as learning an additional language Abu-Rabia and Sanitsky (2010). This research proposal focuses on the study of the mental lexicon and how it may be structured in individuals who speak multiple languages. Building on the work of Stella et al. (2018), who investigated explosive learning in humans using a multiplex model of the mental lexicon, and the Bilingual Interactive Activation (BIA+) framework proposed by Dijkstra and van Heuven (2002), the present study applies the same multilayer network principles introduced by Kivela et al. (2014). Our experimental design extends previous research by incorporating multimodality into the multiplex model, introducing an additional layer that connects visual inputs to their corresponding lexical representations across the multilingual layers of the mental lexicon. In this research, we aim to explore how a heritage language influences the acquisition of another language. Specifically, we ask: Does the presence of visual input in a translation task influence participants’ proficiency and accuracy compared to text-only conditions?
[22] Large Language Models for Explainable Threat Intelligence
Tiago Dinis,Miguel Correia,Roger Tavares
Main category: cs.CL
TL;DR: 论文探讨了利用大型语言模型(LLM)与检索增强生成(RAG)结合的技术RAGRecon,以提高网络威胁情报的解释性和透明度,并通过知识图谱可视化增强用户理解。实验结果显示,最佳组合的响应准确率超过91%。
Details
Motivation: 随着网络威胁日益复杂,传统安全机制难以应对。LLM在文本处理和生成方面的能力为网络安全提供了新可能性,但其透明度和解释性仍是挑战。Contribution: 提出了一种结合LLM与RAG的系统RAGRecon,通过实时信息检索和领域数据融合生成网络威胁情报,并以知识图谱形式提升模型推理的可解释性。
Method: 使用LLM与RAG结合,实时检索信息并生成回答。为每个回复构建可视化知识图谱,增强推理过程的透明度和可解释性。
Result: 在两个数据集和七种不同LLM的实验中,最佳组合的响应与参考响应的匹配率超过91%。
Insight: LLM与RAG的结合不仅能提升网络威胁情报的准确性,还能通过知识图谱增强解释性,为安全分析提供更透明的方法。
Abstract: As cyber threats continue to grow in complexity, traditional security mechanisms struggle to keep up. Large language models (LLMs) offer significant potential in cybersecurity due to their advanced capabilities in text processing and generation. This paper explores the use of LLMs with retrieval-augmented generation (RAG) to obtain threat intelligence by combining real-time information retrieval with domain-specific data. The proposed system, RAGRecon, uses a LLM with RAG to answer questions about cybersecurity threats. Moreover, it makes this form of Artificial Intelligence (AI) explainable by generating and visually presenting to the user a knowledge graph for every reply. This increases the transparency and interpretability of the reasoning of the model, allowing analysts to better understand the connections made by the system based on the context recovered by the RAG system. We evaluated RAGRecon experimentally with two datasets and seven different LLMs and the responses matched the reference responses more than 91% of the time for the best combinations.
[23] Minority-Aware Satisfaction Estimation in Dialogue Systems via Preference-Adaptive Reinforcement Learning
Yahui Fu,Zi Haur Pang,Tatsuya Kawahara
Main category: cs.CL
TL;DR: 论文提出了一种统一框架,通过个性化推理链和无监督聚类方法,结合强化学习,提升对话系统中用户满意度估计的准确性,尤其关注少数用户群体。
Details
Motivation: 现有对话系统通常采用统一的满意度估计模型,忽略了用户个体和少数群体的偏好差异,导致满意度评估不够精准。Contribution: 1. 提出个性化推理链(CoPeR)捕捉个体偏好;2. 设计了无监督的多数-少数群体偏好聚类方法(M2PC);3. 开发了偏好自适应强化学习框架(PAda-PPO)。
Method: 结合CoPeR和M2PC,通过期望最大化算法发现用户群体,并整合到强化学习框架中,优化个体和群体偏好的对齐。
Result: 在情感支持对话数据集上验证了方法的有效性,显著提升了少数用户群体的满意度估计准确率。
Insight: 个性化与群体偏好的联合建模能更全面地捕捉用户满意度,尤其对少数群体的敏感度更高。
Abstract: User satisfaction in dialogue systems is inherently subjective. When the same response strategy is applied across users, minority users may assign different satisfaction ratings than majority users due to variations in individual intents and preferences. However, existing alignment methods typically train one-size-fits-all models that aim for broad consensus, often overlooking minority perspectives and user-specific adaptation. We propose a unified framework that models both individual- and group-level preferences for user satisfaction estimation. First, we introduce Chain-of-Personalized-Reasoning (CoPeR) to capture individual preferences through interpretable reasoning chains. Second, we propose an expectation-maximization-based Majority-Minority Preference-Aware Clustering (M2PC) algorithm that discovers distinct user groups in an unsupervised manner to learn group-level preferences. Finally, we integrate these components into a preference-adaptive reinforcement learning framework (PAda-PPO) that jointly optimizes alignment with both individual and group preferences. Experiments on the Emotional Support Conversation dataset demonstrate consistent improvements in user satisfaction estimation, particularly for underrepresented user groups.
[24] Steering Language Models with Weight Arithmetic
Constanza Fierro,Fabien Roger
Main category: cs.CL
TL;DR: 论文提出了一种名为对比权重调节(contrastive weight steering)的简单后训练方法,通过权重算术编辑模型参数,以更好地利用窄分布数据。该方法通过权重空间的减法操作隔离行为方向,并调整模型行为,表现为比激活调节更强的泛化能力和行为控制。
Details
Motivation: 为大型语言模型提供高质量反馈在广泛分布上难以实现且成本高,而窄分布反馈可能导致意外的泛化行为。因此,需要一种方法更好地利用窄训练数据。Contribution: 提出对比权重调节方法,通过权重算术调整模型行为,实现了更强的泛化能力和控制能力,同时展示了其在任务微调中减轻行为偏差的潜力。
Method: 方法通过减去两个微调模型的权重增量(一个诱导期望行为,一个诱导相反行为),在权重空间中隔离行为方向,并通过加减方向调整模型行为。
Result: 实验表明权重调节相比激活调节具有更强的泛化能力,能在保持任务性能的同时减少行为偏差(如阿谀奉承和低拒绝率)。
Insight: 研究发现,权重调节可部分减轻微调引入的行为偏差,并能通过权重方向相似性检测未显现在训练或评估中的罕见偏差行为。
Abstract: Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes – one that induces the desired behavior and another that induces its opposite – and then add or remove this direction to modify the model’s weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an “evil” weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.
[25] MIMIC-SR-ICD11: A Dataset for Narrative-Based Diagnosis
Yuexin Wu,Shiqi Wang,Vasile Rus
Main category: cs.CL
TL;DR: 介绍了一个基于自述的疾病诊断数据集MIMIC-SR-ICD11,并提出了一种基于概率的重新排序框架LL-Rank,优于生成加映射的基线方法GenMap。
Details
Motivation: 电子健康记录(EHR)中模板化的诊断往往忽略或减弱了临床相关的细微但重要的信号,而自述文本可以更好地保留这些信息。Contribution: 推出了MIMIC-SR-ICD11数据集,并提出LL-Rank方法,通过概率重排序提高诊断标签的语义兼容性。
Method: LL-Rank框架计算给定临床报告上下文中每个标签的长度归一化联合概率,并减去无报告的标签先验概率。
Result: LL-Rank在七个模型骨架上均优于GenMap基准,其主要增益来源于基于PMI的评分机制。
Insight: 基于概率的重新排序可以有效减少标签频率偏差的影响,专注于语义兼容性。
Abstract: Disease diagnosis is a central pillar of modern healthcare, enabling early detection and timely intervention for acute conditions while guiding lifestyle adjustments and medication regimens to prevent or slow chronic disease. Self-reports preserve clinically salient signals that templated electronic health record (EHR) documentation often attenuates or omits, especially subtle but consequential details. To operationalize this shift, we introduce MIMIC-SR-ICD11, a large English diagnostic dataset built from EHR discharge notes and natively aligned to WHO ICD-11 terminology. We further present LL-Rank, a likelihood-based re-ranking framework that computes a length-normalized joint likelihood of each label given the clinical report context and subtracts the corresponding report-free prior likelihood for that label. Across seven model backbones, LL-Rank consistently outperforms a strong generation-plus-mapping baseline (GenMap). Ablation experiments show that LL-Rank’s gains primarily stem from its PMI-based scoring, which isolates semantic compatibility from label frequency bias.
cs.CV [Back]
[26] IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs
Ali Faraz,Akash,Shaharukh Khan,Raja Kolla,Akshat Patidar,Suranjan Goswami,Abhinav Ravi,Chandra Khatri,Shubham Agarwal
Main category: cs.CV
TL;DR: 该论文提出了IndicVisionBench,首个以印度次大陆为中心的视觉语言模型(VLM)评估基准,涵盖10种印度语言和3项多模态任务,揭示了当前VLM在文化多样性环境中的表现不足。
Details
Motivation: 现有VLM评估基准主要基于西方文化,缺乏对多元文化和多语言环境的测试能力,限制了模型的包容性。Contribution: 提出了IndicVisionBench基准,覆盖10种印度语言和3项多模态任务,为分析VLM的文化和语言偏见提供了独特资源。
Method: 构建了一个包含5K图像和37K以上问答对的数据集,涵盖了OCR、MMT和VQA任务,并对8种VLM模型进行了评估。
Result: 实验表明当前VLM在文化多样性和多语言环境中的表现存在显著差距,强调了现有模型的局限性。
Insight: 通过关注文化多样性和多语言性,IndicVisionBench为未来更具包容性的多模态研究奠定了基础。
Abstract: Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.
[27] Knowledge-based anomaly detection for identifying network-induced shape artifacts
Rucha Deshpande,Tahsin Rahman,Miguel Lago,Adarsh Subbaswamy,Jana G. Delfino,Ghada Zamzmi,Elim Thompson,Aldo Badano,Seyed Kahaki
Main category: cs.CV
TL;DR: 本文提出了一种新颖的知识异常检测方法,用于识别合成图像中的网络诱导形状伪影,通过两阶段框架(特征提取和孤立森林检测)实现高精度检测,并在乳腺摄影数据集中验证了有效性。
Details
Motivation: 合成数据在解决机器学习模型训练数据稀缺问题上具有潜力,但若缺乏质量评估,可能会引入伪影和失真,影响模型性能和临床实用性。Contribution: 提出了一种两阶段知识异常检测方法,专注于网络诱导的形状伪影,并展示了其在乳腺摄影数据集中的高效性。
Method: 采用两阶段框架:1)基于角度梯度分布的特征提取器,构建专用特征空间;2)孤立森林异常检测器。
Result: 在两个合成乳腺摄影数据集中,AUC值分别达到0.97和0.91,人工评分与算法检测结果表现出一致性。
Insight: 该方法为合成数据的负责任使用提供了工具,能够通过识别特定问题提升数据集质量。
Abstract: Synthetic data provides a promising approach to address data scarcity for training machine learning models; however, adoption without proper quality assessments may introduce artifacts, distortions, and unrealistic features that compromise model performance and clinical utility. This work introduces a novel knowledge-based anomaly detection method for detecting network-induced shape artifacts in synthetic images. The introduced method utilizes a two-stage framework comprising (i) a novel feature extractor that constructs a specialized feature space by analyzing the per-image distribution of angle gradients along anatomical boundaries, and (ii) an isolation forest-based anomaly detector. We demonstrate the effectiveness of the method for identifying network-induced shape artifacts in two synthetic mammography datasets from models trained on CSAW-M and VinDr-Mammo patient datasets respectively. Quantitative evaluation shows that the method successfully concentrates artifacts in the most anomalous partition (1st percentile), with AUC values of 0.97 (CSAW-syn) and 0.91 (VMLO-syn). In addition, a reader study involving three imaging scientists confirmed that images identified by the method as containing network-induced shape artifacts were also flagged by human readers with mean agreement rates of 66% (CSAW-syn) and 68% (VMLO-syn) for the most anomalous partition, approximately 1.5-2 times higher than the least anomalous partition. Kendall-Tau correlations between algorithmic and human rankings were 0.45 and 0.43 for the two datasets, indicating reasonable agreement despite the challenging nature of subtle artifact detection. This method is a step forward in the responsible use of synthetic data, as it allows developers to evaluate synthetic images for known anatomic constraints and pinpoint and address specific issues to improve the overall quality of a synthetic dataset.
[28] CPO: Condition Preference Optimization for Controllable Image Generation
Zonglin Lyu,Ming Li,Xinxin Liu,Chen Chen
Main category: cs.CV
TL;DR: 论文提出了Condition Preference Optimization (CPO)方法,通过在控制信号而非生成图像上进行偏好学习,显著提升了文本到图像生成的可控性。
Details
Motivation: 现有方法如ControlNet++仅优化低噪声时间步,忽略了高噪声时间步的贡献并引入近似误差;而直接偏好优化(DPO)因生成模型的不确定性难以确保图像对仅在可控性上存在差异。CPO通过优化控制信号而非图像来消除混淆因素。Contribution: 1. 提出CPO方法,直接在控制信号上进行偏好学习,降低训练目标的方差;2. 理论上证明CPO对比损失方差低于DPO;3. 实验证明CPO在多种控制类型上显著优于ControlNet++。
Method: 1. 构建胜败控制信号对(c^w, c^l);2. 训练模型偏好c^w,避免直接优化生成图像带来的不确定性;3. 理论分析显示CPO损失方差更低。
Result: 在分割任务中错误率降低超过10%,人体姿态任务中提升70-80%,边缘和深度图任务中一致降低2-5%。
Insight: 直接在控制信号上进行偏好学习是一种更高效且稳定的方法,能够有效提升生成模型的可控性,同时减少计算和存储成本。
Abstract: To enhance controllability in text-to-image generation, ControlNet introduces image-based control signals, while ControlNet++ improves pixel-level cycle consistency between generated images and the input control signal. To avoid the prohibitive cost of back-propagating through the sampling process, ControlNet++ optimizes only low-noise timesteps (e.g., $t < 200$) using a single-step approximation, which not only ignores the contribution of high-noise timesteps but also introduces additional approximation errors. A straightforward alternative for optimizing controllability across all timesteps is Direct Preference Optimization (DPO), a fine-tuning method that increases model preference for more controllable images ($I^{w}$) over less controllable ones ($I^{l}$). However, due to uncertainty in generative models, it is difficult to ensure that win–lose image pairs differ only in controllability while keeping other factors, such as image quality, fixed. To address this, we propose performing preference learning over control conditions rather than generated images. Specifically, we construct winning and losing control signals, $\mathbf{c}^{w}$ and $\mathbf{c}^{l}$, and train the model to prefer $\mathbf{c}^{w}$. This method, which we term \textit{Condition Preference Optimization} (CPO), eliminates confounding factors and yields a low-variance training objective. Our approach theoretically exhibits lower contrastive loss variance than DPO and empirically achieves superior results. Moreover, CPO requires less computation and storage for dataset curation. Extensive experiments show that CPO significantly improves controllability over the state-of-the-art ControlNet++ across multiple control types: over $10%$ error rate reduction in segmentation, $70$–$80%$ in human pose, and consistent $2$–$5%$ reductions in edge and depth maps.
[29] DARN: Dynamic Adaptive Regularization Networks for Efficient and Robust Foundation Model Adaptation
Dhenenjay Yadav,Rohan Sawai
Main category: cs.CV
TL;DR: 论文提出了DARN(动态自适应正则化网络),针对地理空间分析中基础模型的适应性挑战,通过动态调整正则化和容量门控,显著提升了性能、鲁棒性和效率。
Details
Motivation: 现有基础模型适应方法在处理卫星图像的异构性时表现不足,固定正则化策略无法适应样本复杂度的变化,亟需一种动态调整机制。Contribution: 1. 引入了Task Complexity Predictor(TCP)预测样本难度;2. 提出Adaptive Dropout Modulation(ADM)动态调整丢弃率;3. 设计了Dynamic Capacity Gating(DCG)模块调控通道激活。
Method: DARN结合了三项创新:TCP预测复杂度、ADM动态调整丢弃率、DCG调制通道激活,并通过理论证明了其优化收敛性。
Result: 在GeoBench上mIoU达86.66%(提升5.56),Sen1Floods11上90.5% mIoU,OOD泛化提升9.5%,鲁棒性增强17%。
Insight: DARN的动态自适应机制为复杂异构数据的基础模型适应提供了高效且鲁棒的解决方案,尤其在少样本和OOD场景中表现突出。
Abstract: Foundation models (FMs) offer powerful representations for geospatial analysis, but adapting them effectively remains challenging. Standard adaptation methods, whether full fine-tuning or efficient frozen-backbone approaches, typically employ decoders with fixed regularization strategies, failing to account for the significant heterogeneity in satellite imagery. We introduce Dynamic Adaptive Regularization Networks (DARN), a novel decoder architecture designed to address this limitation. DARN integrates three key innovations: (1) a lightweight Task Complexity Predictor (TCP) that estimates per-sample difficulty, (2) Adaptive Dropout Modulation (ADM), dynamically adjusting dropout rates (from 0.1 to 0.5) based on predicted complexity, and (3) Dynamic Capacity Gating (DCG) that modulates channel activation. We provide theoretical justifications linking DARN’s optimization to stationary point convergence and its mechanism to adaptive information bottlenecks. Empirically, DARN demonstrates exceptional performance across both major adaptation paradigms. In full fine-tuning (unfrozen backbone), DARN achieves a new state-of-the-art on the multi-task GeoBench benchmark (86.66% mIoU, +5.56 pp over prior SOTA). In efficient adaptation (frozen backbone), DARN achieves SOTA-competitive accuracy (90.5% mIoU on Sen1Floods11) while delivering substantial advantages crucial for real-world deployment: superior out-of-distribution (OOD) generalization (+9.5 pp mIoU on AI4SmallFarms), enhanced robustness (17% relative reduction in corruption error), and improved performance on minority classes. DARN offers a more intelligent, robust, and efficient approach to leveraging FMs in critical geospatial applications.
[30] Global 3D Reconstruction of Clouds & Tropical Cyclones
Shirin Ermis,Cesar Aybar,Lilli Freischem,Stella Girtsou,Kyriaki-Margarita Bintsi,Emiliano Diaz Salas-Porras,Michael Eisinger,William Jones,Anna Jungbluth,Benoit Tremblay
Main category: cs.CV
TL;DR: 该论文提出了一种基于预训练-微调流水线的新框架,用于从多卫星全球覆盖数据中将2D卫星图像转换为3D云图。该方法首次实现了全球即时3D云图的构建,并在强风暴条件下准确重建了3D结构。
Details
Motivation: 热带气旋(TCs)的准确预测因卫星观测有限和云属性难以解析而具有挑战性。现有的3D云重建方法在TC常见区域表现不佳且验证不足。Contribution: 1. 提出了基于预训练-微调流水线的框架;2. 首次实现全球即时3D云图重建;3. 在强风暴条件下准确重建3D云结构。
Method: 采用预训练-微调流水线,使用多卫星全球覆盖数据,将2D卫星图像转换为3D云图,并在自定义TC数据集上评估性能。
Result: 模型能够生成全球3D云图,并准确重建强风暴的3D结构,还能填补缺失的观测数据。
Insight: 该方法不仅扩展了卫星观测能力,还为理解TC增强和改善预报提供了关键工具。
Abstract: Accurate forecasting of tropical cyclones (TCs) remains challenging due to limited satellite observations probing TC structure and difficulties in resolving cloud properties involved in TC intensification. Recent research has demonstrated the capabilities of machine learning methods for 3D cloud reconstruction from satellite observations. However, existing approaches have been restricted to regions where TCs are uncommon, and are poorly validated for intense storms. We introduce a new framework, based on a pre-training–fine-tuning pipeline, that learns from multiple satellites with global coverage to translate 2D satellite imagery into 3D cloud maps of relevant cloud properties. We apply our model to a custom-built TC dataset to evaluate performance in the most challenging and relevant conditions. We show that we can - for the first time - create global instantaneous 3D cloud maps and accurately reconstruct the 3D structure of intense storms. Our model not only extends available satellite observations but also provides estimates when observations are missing entirely. This is crucial for advancing our understanding of TC intensification and improving forecasts.
[31] EETnet: a CNN for Gaze Detection and Tracking for Smart-Eyewear
Andrea Aspesi,Andrea Simpsi,Aaron Tognoli,Simone Mentasti,Luca Merigo,Matteo Matteucci
Main category: cs.CV
TL;DR: EETnet是一种专为低功耗嵌入式设备设计的CNN网络,用于基于事件的眼球追踪。它能够在微控制器上运行,并提出了分类和回归两种架构。
Details
Motivation: 随着事件相机的普及,现有眼球追踪方案多依赖高性能GPU,难以在嵌入式设备上部署。EETnet旨在解决这一问题。Contribution: 1. 提出了EETnet,一种专为嵌入式设备优化的CNN网络;2. 提出了训练、评估和量化方法;3. 设计了分类和回归两种架构。
Method: 1. 使用纯事件数据进行训练;2. 提出分类模型(基于网格检测瞳孔)和回归模型(像素级操作);3. 实现了网络量化以适配低功耗设备。
Result: EETnet能够在资源有限的微控制器上高效运行,验证了事件相机在嵌入式眼球追踪中的可行性。
Insight: 事件相机的稀疏数据特性使其适合低功耗应用,但需要针对嵌入式设备优化的算法设计。
Abstract: Event-based cameras are becoming a popular solution for efficient, low-power eye tracking. Due to the sparse and asynchronous nature of event data, they require less processing power and offer latencies in the microsecond range. However, many existing solutions are limited to validation on powerful GPUs, with no deployment on real embedded devices. In this paper, we present EETnet, a convolutional neural network designed for eye tracking using purely event-based data, capable of running on microcontrollers with limited resources. Additionally, we outline a methodology to train, evaluate, and quantize the network using a public dataset. Finally, we propose two versions of the architecture: a classification model that detects the pupil on a grid superimposed on the original image, and a regression model that operates at the pixel level.
[32] Validating Vision Transformers for Otoscopy: Performance and Data-Leakage Effects
James Ndubuisi,Fernando Auat,Marta Vallejo
Main category: cs.CV
TL;DR: 论文研究了视觉转换器(Swin Transformer)在耳镜检查中的性能,并揭示了数据泄露问题对模型性能的影响。初始结果显示Swin Transformer优于传统CNN,但修正数据泄露后性能显著下降。
Details
Motivation: 耳科疾病误诊率高达27%,亟需通过先进模型提高诊断准确性。研究旨在验证视觉转换器在耳镜检查中的潜力。Contribution: 1. 比较了Swin Transformer与传统CNN的性能;2. 揭示了数据泄露问题及其对模型性能的影响;3. 提出了数据预处理与模型架构优化的重要性。
Method: 采用Swin v1和Swin v2模型,以Laplacian和Shannon熵阈值筛选耳镜视频帧,移除空白帧。对比ResNet模型,评估性能。
Result: 初始性能(Swin v1 100%,Swin v2 99.1%)优于ResNet(99.5%),但修正数据泄露后,所有模型性能降至82-83%。
Insight: 即使先进模型也需要严格的数据处理。数据泄露问题可能导致研究结果的误导,医疗AI应用中尤其需注意。
Abstract: This study evaluates the efficacy of vision transformer models, specifically Swin transformers, in enhancing the diagnostic accuracy of ear diseases compared to traditional convolutional neural networks. With a reported 27% misdiagnosis rate among specialist otolaryngologists, improving diagnostic accuracy is crucial. The research utilised a real-world dataset from the Department of Otolaryngology at the Clinical Hospital of the Universidad de Chile, comprising otoscopic videos of ear examinations depicting various middle and external ear conditions. Frames were selected based on the Laplacian and Shannon entropy thresholds, with blank frames removed. Initially, Swin v1 and Swin v2 transformer models achieved accuracies of 100% and 99.1%, respectively, marginally outperforming the ResNet model (99.5%). These results surpassed metrics reported in related studies. However, the evaluation uncovered a critical data leakage issue in the preprocessing step, affecting both this study and related research using the same raw dataset. After mitigating the data leakage, model performance decreased significantly. Corrected accuracies were 83% for both Swin v1 and Swin v2, and 82% for the ResNet model. This finding highlights the importance of rigorous data handling in machine learning studies, especially in medical applications. The findings indicate that while vision transformers show promise, it is essential to find an optimal balance between the benefits of advanced model architectures and those derived from effective data preprocessing. This balance is key to developing a reliable machine learning model for diagnosing ear diseases.
[33] A benchmark multimodal oro-dental dataset for large vision-language models
Haoxin Lv,Ijazul Haq,Jin Du,Jiaxin Ma,Binnian Zhu,Xiaobing Dang,Chaoan Liang,Ruxu Du,Yingjie Zhang,Muhammad Saqib
Main category: cs.CV
TL;DR: 该论文提出了一个大规模多模态口腔数据集,用于训练和评估视觉-语言模型,展示了在口腔健康领域人工智能应用的潜力。
Details
Motivation: 口腔健康领域缺乏大规模、多模态的数据集,限制了人工智能技术在临床实践中的应用。Contribution: 提供了一个包含多样化数据(图像、文本等)的口腔数据集,并通过微调先进模型验证了其有效性。
Method: 收集并标注了8775次牙科检查的多模态数据,使用Qwen-VL模型进行微调,并评估其在分类和诊断报告生成任务上的表现。
Result: 微调后的模型在口腔异常分类和诊断报告生成任务上表现优于基准模型(包括GPT-4o)。
Insight: 大规模多模态数据集是推动口腔健康领域AI应用的关键资源。
Abstract: The advancement of artificial intelligence in oral healthcare relies on the availability of large-scale multimodal datasets that capture the complexity of clinical practice. In this paper, we present a comprehensive multimodal dataset, comprising 8775 dental checkups from 4800 patients collected over eight years (2018-2025), with patients ranging from 10 to 90 years of age. The dataset includes 50000 intraoral images, 8056 radiographs, and detailed textual records, including diagnoses, treatment plans, and follow-up notes. The data were collected under standard ethical guidelines and annotated for benchmarking. To demonstrate its utility, we fine-tuned state-of-the-art large vision-language models, Qwen-VL 3B and 7B, and evaluated them on two tasks: classification of six oro-dental anomalies and generation of complete diagnostic reports from multimodal inputs. We compared the fine-tuned models with their base counterparts and GPT-4o. The fine-tuned models achieved substantial gains over these baselines, validating the dataset and underscoring its effectiveness in advancing AI-driven oro-dental healthcare solutions. The dataset is publicly available, providing an essential resource for future research in AI dentistry.
[34] DeepForgeSeal: Latent Space-Driven Semi-Fragile Watermarking for Deepfake Detection Using Multi-Agent Adversarial Reinforcement Learning
Tharindu Fernando,Clinton Fookes,Sridha Sridharan
Main category: cs.CV
TL;DR: DeepForgeSeal使用潜在空间驱动和多智能体对抗强化学习,提出了一种半脆弱水印方法,用于深度学习生成的伪造媒体检测,显著提升了泛化性和鲁棒性。
Details
Motivation: 生成式AI的快速进步使得伪造媒体(deepfakes)越来越逼真,现有被动检测方法因依赖特定伪造痕迹而无法适应新类型。主动水印检测方法需要平衡鲁棒性和脆弱性的挑战。Contribution: 提出了一种在潜在空间中操作的可学习水印嵌入器,结合多智能体对抗强化学习(MAARL),实现了对水印编码与提取的精确控制,并优化了鲁棒性和脆弱性的平衡。
Method: 通过MAARL框架,水印嵌入器(智能体)与对抗攻击器交互,动态模拟良性或恶意的图像操作,从而学习自适应水印策略。
Result: 在CelebA和CelebA-HQ基准测试中,方法显著优于现有技术,分别提升4.5%和5.3%的性能。
Insight: 潜在空间和高维语义表示的水印方法更适合处理多样化的伪造媒体,MAARL框架为水印的鲁棒性和敏感性问题提供了新思路。
Abstract: Rapid advances in generative AI have led to increasingly realistic deepfakes, posing growing challenges for law enforcement and public trust. Existing passive deepfake detectors struggle to keep pace, largely due to their dependence on specific forgery artifacts, which limits their ability to generalize to new deepfake types. Proactive deepfake detection using watermarks has emerged to address the challenge of identifying high-quality synthetic media. However, these methods often struggle to balance robustness against benign distortions with sensitivity to malicious tampering. This paper introduces a novel deep learning framework that harnesses high-dimensional latent space representations and the Multi-Agent Adversarial Reinforcement Learning (MAARL) paradigm to develop a robust and adaptive watermarking approach. Specifically, we develop a learnable watermark embedder that operates in the latent space, capturing high-level image semantics, while offering precise control over message encoding and extraction. The MAARL paradigm empowers the learnable watermarking agent to pursue an optimal balance between robustness and fragility by interacting with a dynamic curriculum of benign and malicious image manipulations simulated by an adversarial attacker agent. Comprehensive evaluations on the CelebA and CelebA-HQ benchmarks reveal that our method consistently outperforms state-of-the-art approaches, achieving improvements of over 4.5% on CelebA and more than 5.3% on CelebA-HQ under challenging manipulation scenarios.
[35] CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting
Hexu Zhao,Xiwen Min,Xiaoteng Liu,Moonjun Gong,Yiming Li,Ang Li,Saining Xie,Jinyang Li,Aurojit Panda
Main category: cs.CV
TL;DR: CLM是一种系统,通过将高斯分布数据卸载到CPU内存,仅在必要时加载到GPU内存,解决了3D高斯喷绘(3DGS)在大场景中GPU内存不足的问题,实现了在消费级GPU上的高效渲染。
Details
Motivation: 3D高斯喷绘(3DGS)因其快速渲染和高质量输出受到广泛关注,但其大内存需求限制了在大场景中的应用。CLM旨在解决这一问题,使其能在单块消费级GPU上运行。Contribution: CLM提出了一种新颖的卸载策略,通过利用3DGS的内存访问模式特性,减少通信开销和性能损失,实现了对大场景的高效渲染。
Method: CLM的关键方法包括:(1)将高斯数据动态卸载到CPU内存;(2)通过管道化和访问模式优化,重叠GPU-CPU通信与计算;(3)减少数据传输量。
Result: 实验表明,CLM能够在RTX4090上渲染包含1亿高斯分布的大型场景,同时保持高质量的3D重建效果。
Insight: 通过优化内存管理和通信策略,可以在有限硬件资源下实现高性能的3D渲染,为大场景的高斯喷绘提供了实用的解决方案。
Abstract: 3D Gaussian Splatting (3DGS) is an increasingly popular novel view synthesis approach due to its fast rendering time, and high-quality output. However, scaling 3DGS to large (or intricate) scenes is challenging due to its large memory requirement, which exceed most GPU’s memory capacity. In this paper, we describe CLM, a system that allows 3DGS to render large scenes using a single consumer-grade GPU, e.g., RTX4090. It does so by offloading Gaussians to CPU memory, and loading them into GPU memory only when necessary. To reduce performance and communication overheads, CLM uses a novel offloading strategy that exploits observations about 3DGS’s memory access pattern for pipelining, and thus overlap GPU-to-CPU communication, GPU computation and CPU computation. Furthermore, we also exploit observation about the access pattern to reduce communication volume. Our evaluation shows that the resulting implementation can render a large scene that requires 100 million Gaussians on a single RTX4090 and achieve state-of-the-art reconstruction quality.
[36] Pattern-Aware Diffusion Synthesis of fMRI/dMRI with Tissue and Microstructural Refinement
Xiongri Shen,Jiaqi Wang,Yi Zhong,Zhenxi Song,Leilei Zhao,Yichen Wei,Lingyan Liang,Shuqiang Wang,Baiying Lei,Demao Deng,Zhiguo Zhang
Main category: cs.CV
TL;DR: 该论文提出了一种模式感知的扩散合成方法(PDS),用于功能MRI(fMRI)和扩散MRI(dMRI)的跨模态合成,通过双模态3D扩散框架和组织/微结构优化网络解决现有方法的局限性。
Details
Motivation: 由于fMRI和dMRI在时间/梯度轴上存在显著信号差异,且现有方法未能充分整合疾病相关的神经解剖模式,导致跨模态合成效果受限。Contribution: 1. 提出了一种模式感知的双模态3D扩散框架;2. 结合组织优化网络和高效的微结构优化以保持结构保真度和细节。
Method: 1. 使用双模态3D扩散框架进行跨模态学习;2. 集成组织优化网络和微结构优化模块。
Result: 在OASIS-3、ADNI和内部数据集上,PSNR/SSIM分别为29.83 dB/90.84%(fMRI)和30.00 dB/77.55%(dMRI),优于基线方法。临床验证中合成数据诊断准确率达到67.92%/66.02%/64.15%(NC vs. MCI vs. AD)。
Insight: PDS方法通过模式感知和结构优化显著提高了跨模态MRI合成的质量,展示了在神经退行性疾病研究中的临床应用潜力。
Abstract: Magnetic resonance imaging (MRI), especially functional MRI (fMRI) and diffusion MRI (dMRI), is essential for studying neurodegenerative diseases. However, missing modalities pose a major barrier to their clinical use. Although GAN- and diffusion model-based approaches have shown some promise in modality completion, they remain limited in fMRI-dMRI synthesis due to (1) significant BOLD vs. diffusion-weighted signal differences between fMRI and dMRI in time/gradient axis, and (2) inadequate integration of disease-related neuroanatomical patterns during generation. To address these challenges, we propose PDS, introducing two key innovations: (1) a pattern-aware dual-modal 3D diffusion framework for cross-modality learning, and (2) a tissue refinement network integrated with a efficient microstructure refinement to maintain structural fidelity and fine details. Evaluated on OASIS-3, ADNI, and in-house datasets, our method achieves state-of-the-art results, with PSNR/SSIM scores of 29.83 dB/90.84% for fMRI synthesis (+1.54 dB/+4.12% over baselines) and 30.00 dB/77.55% for dMRI synthesis (+1.02 dB/+2.2%). In clinical validation, the synthesized data show strong diagnostic performance, achieving 67.92%/66.02%/64.15% accuracy (NC vs. MCI vs. AD) in hybrid real-synthetic experiments. Code is available in \href{https://github.com/SXR3015/PDS}{PDS GitHub Repository}
[37] Learning Fourier shapes to probe the geometric world of deep neural networks
Jian Wang,Yixing Yong,Haixia Bi,Lijun He,Fan Li
Main category: cs.CV
TL;DR: 本文提出了一种利用优化形状作为语义载体的方法,通过Fourier级数参数化任意形状,结合绕数映射和信号能量约束,为探索深度神经网络的几何理解提供了新工具。
Details
Motivation: 尽管形状和纹理在视觉识别中都很重要,但现有研究主要关注纹理,缺乏对深度神经网络几何理解的探索。本文旨在填补这一空白。Contribution: 1) 提出优化形状可作为高置信度的语义载体;2) 开发了高保真的可解释工具;3) 提出了一种通用的对抗范式。
Method: 采用端到端可微分框架,结合Fourier级数参数化形状、绕数映射和信号能量约束,确保形状的物理合理性。
Result: 该方法生成的形状能够有效欺骗下游视觉任务,同时揭示了DNN的几何理解能力。
Insight: 形状可以作为独立的语义信息载体,为理解和挑战机器感知提供了新视角。
Abstract: While both shape and texture are fundamental to visual recognition, research on deep neural networks (DNNs) has predominantly focused on the latter, leaving their geometric understanding poorly probed. Here, we show: first, that optimized shapes can act as potent semantic carriers, generating high-confidence classifications from inputs defined purely by their geometry; second, that they are high-fidelity interpretability tools that precisely isolate a model’s salient regions; and third, that they constitute a new, generalizable adversarial paradigm capable of deceiving downstream visual tasks. This is achieved through an end-to-end differentiable framework that unifies a powerful Fourier series to parameterize arbitrary shapes, a winding number-based mapping to translate them into the pixel grid required by DNNs, and signal energy constraints that enhance optimization efficiency while ensuring physically plausible shapes. Our work provides a versatile framework for probing the geometric world of DNNs and opens new frontiers for challenging and understanding machine perception.
[38] Challenges in 3D Data Synthesis for Training Neural Networks on Topological Features
Dylan Peek,Matthew P. Skerritt,Siddharth Pritam,Stephan Chalup
Main category: cs.CV
TL;DR: 论文提出了一种新的方法,通过Repulsive Surface算法生成带有拓扑标注的3D数据集,用于训练神经网络拓扑特征分析任务。
Details
Motivation: 传统拓扑数据分析方法(如持续性同调)计算成本高,神经网络可以降低计算开销和推理时间,但缺乏适合监督学习的标注3D数据。Contribution: 提出了一种系统生成标注3D数据的方法,并展示了其在训练拓扑特征估计网络中的应用。
Method: 利用Repulsive Surface算法生成3D数据集,通过控制拓扑不变量(如孔洞数量)实现数据多样性,并使用3D卷积Transformer架构训练模型。
Result: 生成的数据集填补了标注3D数据的空白,但模型在几何复杂度增加时准确率下降,表明拓扑和几何复杂性对模型的泛化能力均有影响。
Insight: 神经网络的拓扑特征估计不仅受拓扑复杂性影响,几何复杂性也是重要因素,需在数据生成和模型训练中综合考虑。
Abstract: Topological Data Analysis (TDA) involves techniques of analyzing the underlying structure and connectivity of data. However, traditional methods like persistent homology can be computationally demanding, motivating the development of neural network-based estimators capable of reducing computational overhead and inference time. A key barrier to advancing these methods is the lack of labeled 3D data with class distributions and diversity tailored specifically for supervised learning in TDA tasks. To address this, we introduce a novel approach for systematically generating labeled 3D datasets using the Repulsive Surface algorithm, allowing control over topological invariants, such as hole count. The resulting dataset offers varied geometry with topological labeling, making it suitable for training and benchmarking neural network estimators. This paper uses a synthetic 3D dataset to train a genus estimator network, created using a 3D convolutional transformer architecture. An observed decrease in accuracy as deformations increase highlights the role of not just topological complexity, but also geometric complexity, when training generalized estimators. This dataset fills a gap in labeled 3D datasets and generation for training and evaluating models and techniques for TDA.
[39] GSE: Evaluating Sticker Visual Semantic Similarity via a General Sticker Encoder
Heng Er Metilda Chee,Jiayin Wang,Zhiqiang Guo,Weizhi Ma,Min Zhang
Main category: cs.CV
TL;DR: 本文提出了一个名为Triple-S的首个贴纸语义相似性(Sticker Semantic Similarity)基准数据集,并开发了通用贴纸编码器(GSE),解决了现有预训练模型难以捕捉贴纸细微语义的问题。
Details
Motivation: 贴纸作为一种流行的视觉交流形式,其语义关系理解由于多样化和象征性内容而具有挑战性。本文旨在填补这一研究空白并提供标准化工具。Contribution: 1. 定义了贴纸语义相似性任务;2. 发布了Triple-S基准数据集;3. 提出了轻量且通用的GSE模型,学习鲁棒的贴纸嵌入。
Method: GSE通过Triple-S和其他数据集学习贴纸嵌入,并结合下游任务(如情感分类和检索)进行验证。
Result: GSE在未见过的贴纸上表现优异,并在下游任务中展现了强大性能。
Insight: 贴纸的语义理解需要专门设计的模型,而通用方法难以捕捉其隐含的多样化和象征性信息。
Abstract: Stickers have become a popular form of visual communication, yet understanding their semantic relationships remains challenging due to their highly diverse and symbolic content. In this work, we formally {define the Sticker Semantic Similarity task} and introduce {Triple-S}, the first benchmark for this task, consisting of 905 human-annotated positive and negative sticker pairs. Through extensive evaluation, we show that existing pretrained vision and multimodal models struggle to capture nuanced sticker semantics. To address this, we propose the {General Sticker Encoder (GSE)}, a lightweight and versatile model that learns robust sticker embeddings using both Triple-S and additional datasets. GSE achieves superior performance on unseen stickers, and demonstrates strong results on downstream tasks such as emotion classification and sticker-to-sticker retrieval. By releasing both Triple-S and GSE, we provide standardized evaluation tools and robust embeddings, enabling future research in sticker understanding, retrieval, and multimodal content generation. The Triple-S benchmark and GSE have been publicly released and are available here.
[40] Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings
Aakriti Agrawal,Gouthaman KV,Rohith Aralikatti,Gauri Jagatap,Jiaxin Yuan,Vijay Kamarshi,Andrea Fanelli,Furong Huang
Main category: cs.CV
TL;DR: 大型视觉语言模型(LVLM)因语言模态的固有偏差导致幻象问题,本文提出一种简单有效的方法,通过整合平均池化的视觉特征来优化文本嵌入,显著减少幻象并提升视觉基础性。
Details
Motivation: 现有LVLM架构在语言模态上存在固有偏差,主要是由于直接将视觉嵌入附加到输入文本序列中,这导致了幻象问题。本文旨在通过视觉信息优化文本嵌入来解决这一问题。Contribution: 提出了一种通过整合平均池化视觉特征来优化文本嵌入的简单方法,显著减少了幻象并提升了模型的视觉基础性,同时指出了模态不平衡问题的影响。
Method: 通过平均池化视觉特征来优化文本嵌入,从而平衡语言和视觉模态的表示,进而减少幻象。
Result: 实验证明,该方法有效减少了幻象并提升了视觉基础性,在基准测试中表现显著改善。
Insight: 模态不平衡是导致LVLM幻象的重要原因,通过视觉信息优化文本嵌入可以显著缓解这一问题;未来可以探索更复杂的融合策略进一步提升性能。
Abstract: In this work, we identify an inherent bias in prevailing LVLM architectures toward the language modality, largely resulting from the common practice of simply appending visual embeddings to the input text sequence. To address this, we propose a simple yet effective method that refines textual embeddings by integrating average-pooled visual features. Our approach demonstrably improves visual grounding and significantly reduces hallucinations on established benchmarks. While average pooling offers a straightforward, robust, and efficient means of incorporating visual information, we believe that more sophisticated fusion methods could further enhance visual grounding and cross-modal alignment. Given that the primary focus of this work is to highlight the modality imbalance and its impact on hallucinations – and to show that refining textual embeddings with visual information mitigates this issue – we leave exploration of advanced fusion strategies for future work.
[41] Dynamic Residual Encoding with Slide-Level Contrastive Learning for End-to-End Whole Slide Image Representation
Jing Jin,Xu Liu,Te Gao,Zhihong Shi,Yixiong Liang,Ruiqing Zheng,Hulin Kuang,Min Zeng,Shichao Kan
Main category: cs.CV
TL;DR: 提出了动态残差编码与幻灯片级别对比学习(DRE-SLCL)方法,用于端到端的全切片图像(WSI)表征,解决了GPU内存限制下无法处理大量图像块的问题。
Details
Motivation: 全切片图像(WSI)表征在癌症亚型分类、识别和突变预测中至关重要,但由于单张WSI包含数万个图像块,GPU内存限制使其难以在单次小批量中处理所有块梯度。Contribution: 提出了DRE-SLCL方法,通过动态残差编码和幻灯片级别对比学习,实现了端到端的WSI表征训练,解决了内存限制问题。
Method: 使用记忆库存储所有WSI的图像块特征,在训练时随机采样部分块计算特征,并结合记忆库中提取的块特征,通过残差编码生成WSI表征,最终通过对比学习优化模型。
Result: 在癌症亚型分类、识别和突变预测任务中验证了DRE-SLCL的有效性。
Insight: 通过动态残差编码和对比学习的结合,高效地利用了有限的计算资源,显著提升了WSI表征的性能。
Abstract: Whole Slide Image (WSI) representation is critical for cancer subtyping, cancer recognition and mutation prediction.Training an end-to-end WSI representation model poses significant challenges, as a standard gigapixel slide can contain tens of thousands of image tiles, making it difficult to compute gradients of all tiles in a single mini-batch due to current GPU limitations. To address this challenge, we propose a method of dynamic residual encoding with slide-level contrastive learning (DRE-SLCL) for end-to-end WSI representation. Our approach utilizes a memory bank to store the features of tiles across all WSIs in the dataset. During training, a mini-batch usually contains multiple WSIs. For each WSI in the batch, a subset of tiles is randomly sampled and their features are computed using a tile encoder. Then, additional tile features from the same WSI are selected from the memory bank. The representation of each individual WSI is generated using a residual encoding technique that incorporates both the sampled features and those retrieved from the memory bank. Finally, the slide-level contrastive loss is computed based on the representations and histopathology reports ofthe WSIs within the mini-batch. Experiments conducted over cancer subtyping, cancer recognition, and mutation prediction tasks proved the effectiveness of the proposed DRE-SLCL method.
[42] Medical Referring Image Segmentation via Next-Token Mask Prediction
Xinyu Chen,Yiran Wang,Gaoyang Pang,Jiafu Hao,Chentao Yue,Luping Zhou,Yonghui Li
Main category: cs.CV
TL;DR: NTP-MRISeg将医学参考图像分割(MRIS)任务重新定义为自回归的下一个标记预测任务,简化了模型设计,并提出三种新策略以解决预测误差和长尾分布问题,实现了SOTA性能。
Details
Motivation: 现有的MRIS方法通常依赖于复杂的多模态融合或多阶段解码器,设计繁琐且效率低。本文旨在通过统一的序列化建模简化流程,提高效率和性能。Contribution: 1. 提出NTP-MRISeg框架,将MRIS任务重新定义为统一的序列预测问题;2. 设计了NkTP、TCL和HET三种策略解决序列预测中的挑战。
Method: 1. 使用统一的序列表示图像、文本和掩码;2. 引入Next-k Token Prediction(NkTP)减少误差累积;3. 提出Token-level Contrastive Learning(TCL)增强边界敏感度;4. 采用Hard Error Token(HET)优化困难标记预测。
Result: 在QaTa-COV19和MosMedData+数据集上实现了新的SOTA性能,验证了方法的有效性。
Insight: 通过序列化建模和自回归预测,可以简化多模态任务的设计,并通过针对性的策略解决序列预测中的典型问题。
Abstract: Medical Referring Image Segmentation (MRIS) involves segmenting target regions in medical images based on natural language descriptions. While achieving promising results, recent approaches usually involve complex design of multimodal fusion or multi-stage decoders. In this work, we propose NTP-MRISeg, a novel framework that reformulates MRIS as an autoregressive next-token prediction task over a unified multimodal sequence of tokenized image, text, and mask representations. This formulation streamlines model design by eliminating the need for modality-specific fusion and external segmentation models, supports a unified architecture for end-to-end training. It also enables the use of pretrained tokenizers from emerging large-scale multimodal models, enhancing generalization and adaptability. More importantly, to address challenges under this formulation-such as exposure bias, long-tail token distributions, and fine-grained lesion edges-we propose three novel strategies: (1) a Next-k Token Prediction (NkTP) scheme to reduce cumulative prediction errors, (2) Token-level Contrastive Learning (TCL) to enhance boundary sensitivity and mitigate long-tail distribution effects, and (3) a memory-based Hard Error Token (HET) optimization strategy that emphasizes difficult tokens during training. Extensive experiments on the QaTa-COV19 and MosMedData+ datasets demonstrate that NTP-MRISeg achieves new state-of-the-art performance, offering a streamlined and effective alternative to traditional MRIS pipelines.
[43] No Pose Estimation? No Problem: Pose-Agnostic and Instance-Aware Test-Time Adaptation for Monocular Depth Estimation
Mingyu Sung,Hyeonmin Choe,Il-Min Kim,Sangseok Yun,Jae Mo Kang
Main category: cs.CV
TL;DR: Error
Details
Motivation: ErrorContribution: Error
Method: Error
Result: Error
Insight: Error
Abstract: Monocular depth estimation (MDE), inferring pixel-level depths in single RGB images from a monocular camera, plays a crucial and pivotal role in a variety of AI applications demanding a three-dimensional (3D) topographical scene. In the real-world scenarios, MDE models often need to be deployed in environments with different conditions from those for training. Test-time (domain) adaptation (TTA) is one of the compelling and practical approaches to address the issue. Although there have been notable advancements in TTA for MDE, particularly in a self-supervised manner, existing methods are still ineffective and problematic when applied to diverse and dynamic environments. To break through this challenge, we propose a novel and high-performing TTA framework for MDE, named PITTA. Our approach incorporates two key innovative strategies: (i) pose-agnostic TTA paradigm for MDE and (ii) instance-aware image masking. Specifically, PITTA enables highly effective TTA on a pretrained MDE network in a pose-agnostic manner without resorting to any camera pose information. Besides, our instance-aware masking strategy extracts instance-wise masks for dynamic objects (e.g., vehicles, pedestrians, etc.) from a segmentation mask produced by a pretrained panoptic segmentation network, by removing static objects including background components. To further boost performance, we also present a simple yet effective edge extraction methodology for the input image (i.e., a single monocular image) and depth map. Extensive experimental evaluations on DrivingStereo and Waymo datasets with varying environmental conditions demonstrate that our proposed framework, PITTA, surpasses the existing state-of-the-art techniques with remarkable performance improvements in MDE during TTA.
[44] Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach
Yuanxiang Huangfu,Chaochao Wang,Weilei Wang
Main category: cs.CV
TL;DR: Role-SynthCLIP提出了一种基于角色扮演的多样化合成数据方法,通过多视角提示生成语义丰富的图像-文本对,提升了CLIP模型的训练效果。
Details
Motivation: 现有合成数据方法过于关注数据量而忽视语义多样性,导致生成的描述冗余或肤浅。Role-SynthCLIP旨在解决这一问题,通过角色扮演提示增强语义多样性和细粒度对齐。Contribution: 1. 提出了Role-SynthCLIP框架,利用多角色提示生成语义多样的图像-文本对。2. 在数据量不变的情况下,显著提升了CLIP模型的性能。
Method: 1. 使用多模态大语言模型(MLLMs)生成基于角色扮演(如组合分析师、图像上下文解释者)的多样化描述。2. 增强图像与文本的细粒度对齐和语义多样性。
Result: CLIP-B/16模型仅用100万组Role-SynthCLIP合成数据训练,在MS COCO验证集上Recall@1达到64.1%,比现有最佳基线(500万组数据)高2.8个百分点。
Insight: 语义多样性比数据量对CLIP模型训练更为关键,角色扮演提示是一种有效的生成多样化数据的策略。
Abstract: The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose Role-SynthCLIP, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs, thereby improving caption expressiveness and accuracy while keeping the total number of image-text pairs unchanged. Experimental results demonstrate the effectiveness and efficiency of our method. A CLIP-B/16 model trained on only 1 million Role-SynthCLIP pairs achieves a Recall@1 of 64.1% on the MS COCO validation set, surpassing the best existing synthetic data baseline (trained on 5M pairs) by 2.8 percentage points. The code and trained models are released at https://github.com/huangfu170/Role-SynthCLIP.
[45] SurgiATM: A Physics-Guided Plug-and-Play Model for Deep Learning-Based Smoke Removal in Laparoscopic Surgery
Mingyu Sheng,Jianan Fan,Dongnan Liu,Guoyan Zheng,Ron Kikinis,Weidong Cai
Main category: cs.CV
TL;DR: SurgiATM是一种用于腹腔镜手术烟雾去除的轻量级即插即用模块,结合了基于物理的大气模型和数据驱动的深度学习模型的优势,无需额外训练权重即可提升现有模型的准确性和泛化性。
Details
Motivation: 腹腔镜手术中烟雾会显著降低内窥镜帧的视觉质量,增加手术风险并阻碍临床决策和计算机辅助分析,因此去除烟雾对手术安全和效率至关重要。Contribution: 1. 提出SurgiATM模型,结合物理模型的高泛化性和深度学习的高精度;2. 设计为轻量级即插即用模块,无缝集成到多种现有架构中,仅引入两个超参数且无需额外训练权重。
Method: 通过统计方法桥接物理大气模型和数据驱动模型,同时保持网络架构不变,仅需小量计算和修改开销。
Result: 在三个公开手术数据集和十种烟雾去除方法上的实验表明,SurgiATM能降低现有模型的恢复误差并增强泛化性。
Insight: SurgiATM展示了物理引导与数据驱动结合的潜力,为手术烟雾去除提供了低成本、高效率的解决方案。
Abstract: During laparoscopic surgery, smoke generated by tissue cauterization can significantly degrade the visual quality of endoscopic frames, increasing the risk of surgical errors and hindering both clinical decision-making and computer-assisted visual analysis. Consequently, removing surgical smoke is critical to ensuring patient safety and maintaining operative efficiency. In this study, we propose the Surgical Atmospheric Model (SurgiATM) for surgical smoke removal. SurgiATM statistically bridges a physics-based atmospheric model and data-driven deep learning models, combining the superior generalizability of the former with the high accuracy of the latter. Furthermore, SurgiATM is designed as a lightweight, plug-and-play module that can be seamlessly integrated into diverse surgical desmoking architectures to enhance their accuracy and stability, better meeting clinical requirements. It introduces only two hyperparameters and no additional trainable weights, preserving the original network architecture with minimal computational and modification overhead. We conduct extensive experiments on three public surgical datasets with ten desmoking methods, involving multiple network architectures and covering diverse procedures, including cholecystectomy, partial nephrectomy, and diaphragm dissection. The results demonstrate that incorporating SurgiATM commonly reduces the restoration errors of existing models and relatively enhances their generalizability, without adding any trainable layers or weights. This highlights the convenience, low cost, effectiveness, and generalizability of the proposed method. The code for SurgiATM is released at https://github.com/MingyuShengSMY/SurgiATM.
[46] Deep learning models are vulnerable, but adversarial examples are even more vulnerable
Jun Li,Yanwei Xu,Keran Li,Xiaoli Zhang
Main category: cs.CV
TL;DR: 该论文通过研究发现对抗样本比干净样本对遮挡更敏感,并提出了一种基于滑窗掩码的对抗样本检测方法(SWM-AED),在检测对抗样本时表现出色。
Details
Motivation: 为了理解对抗样本与干净样本的内在差异,从而提升深度神经网络的鲁棒性和对抗攻击的检测能力。Contribution: 1. 首次通过实验发现对抗样本对遮挡的敏感性高于干净样本;2. 提出了一种量化遮挡下模型置信度波动的指标(SMCE);3. 设计了一种新的对抗样本检测方法(SWM-AED)。
Method: 1. 使用CIFAR-10数据集和9种典型攻击方法生成对抗样本;2. 提出SMCE量化遮挡下的置信度波动;3. 基于SMCE设计SWM-AED检测器。
Result: SWM-AED在多种分类器和攻击方法上表现稳健,检测准确率多数情况下超过62%,最高达96.5%。
Insight: 对抗样本对遮挡的敏感性可作为一种有效的检测特征,避免了传统对抗训练的灾难性过拟合问题。
Abstract: Understanding intrinsic differences between adversarial examples and clean samples is key to enhancing DNN robustness and detection against adversarial attacks. This study first empirically finds that image-based adversarial examples are notably sensitive to occlusion. Controlled experiments on CIFAR-10 used nine canonical attacks (e.g., FGSM, PGD) to generate adversarial examples, paired with original samples for evaluation. We introduce Sliding Mask Confidence Entropy (SMCE) to quantify model confidence fluctuation under occlusion. Using 1800+ test images, SMCE calculations supported by Mask Entropy Field Maps and statistical distributions show adversarial examples have significantly higher confidence volatility under occlusion than originals. Based on this, we propose Sliding Window Mask-based Adversarial Example Detection (SWM-AED), which avoids catastrophic overfitting of conventional adversarial training. Evaluations across classifiers and attacks on CIFAR-10 demonstrate robust performance, with accuracy over 62% in most cases and up to 96.5%.
[47] Real-World Adverse Weather Image Restoration via Dual-Level Reinforcement Learning with High-Quality Cold Start
Fuyang Liu,Jiaqi Xu,Xiaowei Hu
Main category: cs.CV
TL;DR: 该论文通过构建高质量仿真天气数据集HFLS-Weather,并结合双层次强化学习框架,实现了恶劣天气下真实世界图像的恢复,解决了现有方法对复杂退化场景泛化能力不足的问题。
Details
Motivation: 恶劣天气严重影响视觉感知,现有基于合成数据的模型因参数固定而无法适应复杂退化场景,需提出一种能够持续适应真实世界条件的解决方案。Contribution: 1) 构建了高保真天气数据集HFLS-Weather;2) 提出了双层次强化学习框架(局部优化+全局控制),支持无监督奖励学习;3) 实现了对多样化恶劣天气场景的适应和恢复。
Method: 1) 基于物理仿真构建HFLS-Weather数据集;2) 设计了局部层次的扰动驱动图像质量优化和全局层次的元控制器动态调度模型。
Result: 在多种恶劣天气场景下达到最优性能,并具备对真实世界条件的持续适应能力。
Insight: 双层次强化学习结合高质量冷启动能有效提升模型对复杂退化的泛化能力,动态调度机制是关键。
Abstract: Adverse weather severely impairs real-world visual perception, while existing vision models trained on synthetic data with fixed parameters struggle to generalize to complex degradations. To address this, we first construct HFLS-Weather, a physics-driven, high-fidelity dataset that simulates diverse weather phenomena, and then design a dual-level reinforcement learning framework initialized with HFLS-Weather for cold-start training. Within this framework, at the local level, weather-specific restoration models are refined through perturbation-driven image quality optimization, enabling reward-based learning without paired supervision; at the global level, a meta-controller dynamically orchestrates model selection and execution order according to scene degradation. This framework enables continuous adaptation to real-world conditions and achieves state-of-the-art performance across a wide range of adverse weather scenarios. Code is available at https://github.com/xxclfy/AgentRL-Real-Weather
[48] Early Alzheimer’s Disease Detection from Retinal OCT Images: A UK Biobank Study
Yasemin Turkan,F. Boray Tek,M. Serdar Nazlı,Öykü Eren
Main category: cs.CV
TL;DR: 该研究首次利用深度学习对原始OCT B扫描图像进行阿尔茨海默病(AD)早期检测,通过微调多种预训练模型并提出针对性增强技术,展现了中央黄斑区域的结构差异。
Details
Motivation: 视网膜层厚度的变化与阿尔茨海默病等神经退行性疾病相关,但传统方法依赖于分割厚度测量,本研究探索了直接从OCT B扫描图像分类进行AD早期检测的潜力。Contribution: 1. 首次将深度学习应用于原始OCT B扫描图像进行AD预测;2. 微调了多种预训练模型,并提出了OCT特定的增强技术和加权损失函数;3. 通过可解释性分析确认了AD组与对照组在中央黄斑区域的结构差异。
Method: 1. 使用UK Biobank数据,微调ImageNet预训练网络和OCT专用的RETFound transformer;2. 应用标准及OCT特定的数据增强技术以减少过拟合;3. 引入年份加权损失函数,优先考虑成像后四年内确诊的病例。
Result: ResNet-34在4年队列中表现最稳定,AUC为0.62,虽低于临床应用阈值,但可解释性分析揭示了AD组与对照组在中央黄斑区域的结构差异。
Insight: 研究为基于OCT的AD预测提供了基线,表明在AD诊断前数年检测细微视网膜生物标志物的挑战性,提示需要更大数据集和多模态方法。
Abstract: Alterations in retinal layer thickness, measurable using Optical Coherence Tomography (OCT), have been associated with neurodegenerative diseases such as Alzheimer’s disease (AD). While previous studies have mainly focused on segmented layer thickness measurements, this study explored the direct classification of OCT B-scan images for the early detection of AD. To our knowledge, this is the first application of deep learning to raw OCT B-scans for AD prediction in the literature. Unlike conventional medical image classification tasks, early detection is more challenging than diagnosis because imaging precedes clinical diagnosis by several years. We fine-tuned and evaluated multiple pretrained models, including ImageNet-based networks and the OCT-specific RETFound transformer, using subject-level cross-validation datasets matched for age, sex, and imaging instances from the UK Biobank cohort. To reduce overfitting in this small, high-dimensional dataset, both standard and OCT-specific augmentation techniques were applied, along with a year-weighted loss function that prioritized cases diagnosed within four years of imaging. ResNet-34 produced the most stable results, achieving an AUC of 0.62 in the 4-year cohort. Although below the threshold for clinical application, our explainability analyses confirmed localized structural differences in the central macular subfield between the AD and control groups. These findings provide a baseline for OCT-based AD prediction, highlight the challenges of detecting subtle retinal biomarkers years before AD diagnosis, and point to the need for larger datasets and multimodal approaches.
[49] SnowyLane: Robust Lane Detection on Snow-covered Rural Roads Using Infrastructural Elements
Jörg Gamerdinger,Benedict Wetzel,Patrick Schulz,Sven Teufel,Oliver Bringmann
Main category: cs.CV
TL;DR: 论文提出了一种名为SnowyLane的新方法,通过检测路边的竖立标志物作为间接车道指示,解决了雪天环境下车道标志常被遮挡的问题。该方法使用参数化的贝塞尔曲线模型拟合车道轨迹,并引入了一个包含8万帧标注数据的新合成数据集。
Details
Motivation: 雪天环境中,车道标志常被积雪遮挡或消失,传统车道检测方法失效,因此需要一种不依赖车道标志的鲁棒方法。Contribution: 1.提出了基于路边特征(竖立标志物)的车道检测方法;2.引入了SnowyLane数据集;3.在恶劣天气下表现出更强的鲁棒性。
Method: 1.检测路边竖立标志物;2.使用贝塞尔曲线模型拟合车道轨迹;3.利用SnowyLane数据集进行训练和评估。
Result: 相比现有方法,该方法在积雪严重的环境中表现出更优的鲁棒性和实时性。
Insight: 通过间接特征(如路边标志物)而非直接车道标志进行检测,可以提升恶劣天气条件下的车道检测性能。
Abstract: Lane detection for autonomous driving in snow-covered environments remains a major challenge due to the frequent absence or occlusion of lane markings. In this paper, we present a novel, robust and realtime capable approach that bypasses the reliance on traditional lane markings by detecting roadside features,specifically vertical roadside posts called delineators, as indirect lane indicators. Our method first perceives these posts, then fits a smooth lane trajectory using a parameterized Bezier curve model, leveraging spatial consistency and road geometry. To support training and evaluation in these challenging scenarios, we introduce SnowyLane, a new synthetic dataset containing 80,000 annotated frames capture winter driving conditions, with varying snow coverage, and lighting conditions. Compared to state-of-the-art lane detection systems, our approach demonstrates significantly improved robustness in adverse weather, particularly in cases with heavy snow occlusion. This work establishes a strong foundation for reliable lane detection in winter scenarios and contributes a valuable resource for future research in all-weather autonomous driving. The dataset is available at https://ekut-es.github.io/snowy-lane
[50] From Linear Probing to Joint-Weighted Token Hierarchy: A Foundation Model Bridging Global and Cellular Representations in Biomarker Detection
Jingsong Liu,Han Li,Nassir Navab,Peter J. Schüffler
Main category: cs.CV
TL;DR: JWTH(Joint-Weighted Token Hierarchy)是一种结合全局和细胞级表示的基础模型,通过自监督预训练和细胞中心的后调优,显著提升了数字病理学中的生物标志物检测性能。
Details
Motivation: 现有的病理学基础模型(PFMs)主要依赖全局补丁级嵌入,忽视了细胞级形态,限制了生物标志物检测的性能和可解释性。Contribution: 提出JWTH模型,整合自监督预训练与细胞中心的后调优,并通过注意力池化融合局部和全局令牌,显著提升了检测精度。
Method: 采用大规模自监督预训练,结合细胞级后调优和注意力池化机制,实现全局与局部令牌的层次化融合。
Result: 在四个生物标志物任务和八个队列中,JWTH比现有PFMs平均提升1.2%的平衡准确率,最高提升8.3%。
Insight: 结合细胞级信息的全局表示能显著提升生物标志物检测性能,为数字病理学提供更鲁棒和可解释的AI工具。
Abstract: AI-based biomarkers can infer molecular features directly from hematoxylin & eosin (H&E) slides, yet most pathology foundation models (PFMs) rely on global patch-level embeddings and overlook cell-level morphology. We present a PFM model, JWTH (Joint-Weighted Token Hierarchy), which integrates large-scale self-supervised pretraining with cell-centric post-tuning and attention pooling to fuse local and global tokens. Across four tasks involving four biomarkers and eight cohorts, JWTH achieves up to 8.3% higher balanced accuracy and 1.2% average improvement over prior PFMs, advancing interpretable and robust AI-based biomarker detection in digital pathology.
[51] Splatography: Sparse multi-view dynamic Gaussian Splatting for filmmaking challenges
Adrian Azzarelli,Nantheera Anantrasirichai,David R Bull
Main category: cs.CV
TL;DR: 该论文提出了一种基于稀疏多视角视频的动态高斯泼溅方法,通过将高斯表示和变形场分为前景和背景部分,解决了电影制作中稀疏相机配置下的动态3D重建问题。
Details
Motivation: 电影制作中常用的稀疏相机配置限制了现有动态高斯泼溅方法的表现,作者旨在通过分离前景和背景的高斯表示和改进的训练策略来解决这一问题。Contribution: 1. 提出了将高斯表示和变形场分为前景和背景的方法;2. 引入了不同的训练策略和损失函数;3. 在不依赖密集掩码监督的情况下实现了分割的动态重建。
Method: 通过稀疏掩码将高斯表示和变形场分为前景和背景部分,分别使用不同的损失函数进行预训练和动态训练。前景学习颜色、位置和旋转的变化,背景仅学习位置变化。
Result: 在3D和2.5D娱乐数据集上,该方法在PSNR上比现有方法高出3分,模型尺寸减少一半,并能分割动态重建透明和动态纹理。
Insight: 分离前景和背景的高斯表示并结合不同的训练策略,能够在稀疏相机配置下显著提升动态3D重建的效果。
Abstract: Deformable Gaussian Splatting (GS) accomplishes photorealistic dynamic 3-D reconstruction from dense multi-view video (MVV) by learning to deform a canonical GS representation. However, in filmmaking, tight budgets can result in sparse camera configurations, which limits state-of-the-art (SotA) methods when capturing complex dynamic features. To address this issue, we introduce an approach that splits the canonical Gaussians and deformation field into foreground and background components using a sparse set of masks for frames at t=0. Each representation is separately trained on different loss functions during canonical pre-training. Then, during dynamic training, different parameters are modeled for each deformation field following common filmmaking practices. The foreground stage contains diverse dynamic features so changes in color, position and rotation are learned. While, the background containing film-crew and equipment, is typically dimmer and less dynamic so only changes in point position are learned. Experiments on 3-D and 2.5-D entertainment datasets show that our method produces SotA qualitative and quantitative results; up to 3 PSNR higher with half the model size on 3-D scenes. Unlike the SotA and without the need for dense mask supervision, our method also produces segmented dynamic reconstructions including transparent and dynamic textures. Code and video comparisons are available online: https://interims-git.github.io/
[52] Another BRIXEL in the Wall: Towards Cheaper Dense Features
Alexander Lappe,Martin A. Giese
Main category: cs.CV
TL;DR: BRIXEL是一种简单的知识蒸馏方法,通过让学生模型学习在高分辨率下生成自身的特征图,显著降低了DINOv3模型的计算成本,同时在下游任务中表现更优。
Details
Motivation: DINOv3等视觉基础模型在高分辨率下生成密集特征图时,计算成本高昂,限制了实际应用。Contribution: 提出了BRIXEL,一种知识蒸馏方法,能够在固定分辨率下显著降低计算成本并提高性能。
Method: 通过知识蒸馏,让学生模型学习从低分辨率输入生成高分辨率特征图,模仿教师模型的输出。
Result: BRIXEL在固定分辨率下优于基线DINOv3模型,且能以较低计算成本生成与教师模型相似的特征图。
Insight: 知识蒸馏可以有效地应用于密集特征生成任务,显著降低计算复杂度而不损失性能。
Abstract: Vision foundation models achieve strong performance on both global and locally dense downstream tasks. Pretrained on large images, the recent DINOv3 model family is able to produce very fine-grained dense feature maps, enabling state-of-the-art performance. However, computing these feature maps requires the input image to be available at very high resolution, as well as large amounts of compute due to the squared complexity of the transformer architecture. To address these issues, we propose BRIXEL, a simple knowledge distillation approach that has the student learn to reproduce its own feature maps at higher resolution. Despite its simplicity, BRIXEL outperforms the baseline DINOv3 models by large margins on downstream tasks when the resolution is kept fixed. Moreover, it is able to produce feature maps that are very similar to those of the teacher at a fraction of the computational cost. Code and model weights are available at https://github.com/alexanderlappe/BRIXEL.
[53] 4D3R: Motion-Aware Neural Reconstruction and Rendering of Dynamic Scenes from Monocular Videos
Mengqi Guo,Bo Xu,Yanyan Li,Gim Hee Lee
Main category: cs.CV
TL;DR: 4D3R提出了一种新的动态神经渲染框架,通过两阶段方法分离静态和动态内容,并使用运动感知技术优化相机姿态和几何重建,在动态场景中取得了显著的性能提升。
Details
Motivation: 单目视频中动态场景的新视角合成是一个尚未解决的挑战性问题。现有的方法(如NeRF和3DGS)通常依赖预计算的相机姿态,且难以处理动态内容。4D3R旨在解决这些问题,实现无需姿态输入的动态场景重建和渲染。Contribution: 4D3R的主要贡献包括:(1)运动感知束调整模块(MA-BA),结合Transformer学习的先验和SAM2分割技术,提高相机姿态优化的鲁棒性;(2)高效的运动感知高斯溅射(MA-GS)表示,减少了计算成本。
Method: 方法分为两阶段:首先利用3D基础模型估计初始姿态和几何,再通过运动感知模块优化。MA-BA模块用于相机姿态细化,MA-GS通过控制点和变形场MLP建模动态运动。
Result: 在真实动态数据集上的实验表明,4D3R比现有方法PSNR提升了1.8dB,同时计算需求降低了5倍。
Insight: 4D3R展示了将深度学习的先验知识与传统几何优化相结合的优势,尤其在处理大规模动态物体时表现突出,为动态神经渲染提供了新的技术方向。
Abstract: Novel view synthesis from monocular videos of dynamic scenes with unknown camera poses remains a fundamental challenge in computer vision and graphics. While recent advances in 3D representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown promising results for static scenes, they struggle with dynamic content and typically rely on pre-computed camera poses. We present 4D3R, a pose-free dynamic neural rendering framework that decouples static and dynamic components through a two-stage approach. Our method first leverages 3D foundational models for initial pose and geometry estimation, followed by motion-aware refinement. 4D3R introduces two key technical innovations: (1) a motion-aware bundle adjustment (MA-BA) module that combines transformer-based learned priors with SAM2 for robust dynamic object segmentation, enabling more accurate camera pose refinement; and (2) an efficient Motion-Aware Gaussian Splatting (MA-GS) representation that uses control points with a deformation field MLP and linear blend skinning to model dynamic motion, significantly reducing computational cost while maintaining high-quality reconstruction. Extensive experiments on real-world dynamic datasets demonstrate that our approach achieves up to 1.8dB PSNR improvement over state-of-the-art methods, particularly in challenging scenarios with large dynamic objects, while reducing computational requirements by 5x compared to previous dynamic scene representations.
[54] ADPretrain: Advancing Industrial Anomaly Detection via Anomaly Representation Pretraining
Xincheng Yao,Yan Luo,Zefeng Qian,Chongyang Zhang
Main category: cs.CV
TL;DR: ADPretrain提出了一种专为工业异常检测设计的预训练框架,通过角度和范数导向的对比损失,最大化正常和异常特征的差异,避免了ImageNet预训练中的分布偏移问题。
Details
Motivation: 现有异常检测方法依赖ImageNet预训练的特征网络,但这些预训练目标与异常检测任务(区分正常与异常)不符,且自然图像与工业图像存在分布偏移。Contribution: 1)提出专为异常检测设计的预训练框架;2)引入角度和范数导向的对比损失;3)使用大规模工业异常数据集RealIAD进行预训练;4)基于残差特征学习类泛化表示。
Method: 提出角度和范数导向的对比损失,最大化正常和异常特征的差异;在大规模工业异常数据集RealIAD上预训练;利用残差特征缓解分布偏移。
Result: 在五个数据集和五种骨干网络上,ADPretrain的预训练特征显著优于现有方法。
Insight: 专为任务设计的预训练能显著提升性能;残差特征有助于缓解数据分布偏移问题。
Abstract: The current mainstream and state-of-the-art anomaly detection (AD) methods are substantially established on pretrained feature networks yielded by ImageNet pretraining. However, regardless of supervised or self-supervised pretraining, the pretraining process on ImageNet does not match the goal of anomaly detection (i.e., pretraining in natural images doesn’t aim to distinguish between normal and abnormal). Moreover, natural images and industrial image data in AD scenarios typically have the distribution shift. The two issues can cause ImageNet-pretrained features to be suboptimal for AD tasks. To further promote the development of the AD field, pretrained representations specially for AD tasks are eager and very valuable. To this end, we propose a novel AD representation learning framework specially designed for learning robust and discriminative pretrained representations for industrial anomaly detection. Specifically, closely surrounding the goal of anomaly detection (i.e., focus on discrepancies between normals and anomalies), we propose angle- and norm-oriented contrastive losses to maximize the angle size and norm difference between normal and abnormal features simultaneously. To avoid the distribution shift from natural images to AD images, our pretraining is performed on a large-scale AD dataset, RealIAD. To further alleviate the potential shift between pretraining data and downstream AD datasets, we learn the pretrained AD representations based on the class-generalizable representation, residual features. For evaluation, based on five embedding-based AD methods, we simply replace their original features with our pretrained representations. Extensive experiments on five AD datasets and five backbones consistently show the superiority of our pretrained features. The code is available at https://github.com/xcyao00/ADPretrain.
[55] DeepEyesV2: Toward Agentic Multimodal Model
Jack Hong,Chenxiao Zhao,ChengLin Zhu,Weiheng Lu,Guohai Xu,Xing Yu
Main category: cs.CV
TL;DR: DeepEyesV2提出了一种两阶段训练方法,通过冷启动和强化学习构建具有工具调用能力的多模态模型,并在RealX-Bench等基准上验证了其有效性。
Details
Motivation: 当前多模态模型缺乏主动调用工具并整合到推理中的能力,DeepEyesV2旨在解决这一问题。Contribution: 1) 两阶段训练方法;2) RealX-Bench基准;3) 展示了任务自适应的工具调用能力。
Method: 冷启动阶段建立工具使用模式,强化学习阶段优化工具调用行为。
Result: 在RealX-Bench等任务中表现出色,并能自适应调用工具。
Insight: 强化学习需配合冷启动阶段才能诱导出稳健的工具使用行为。
Abstract: Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.
[56] Cross-domain EEG-based Emotion Recognition with Contrastive Learning
Rui Yan,Yibo Li,Han Ding,Fei Wang
Main category: cs.CV
TL;DR: 引入EmotionCLIP,将EEG情绪识别重新定义为CLIP框架内的EEG-文本匹配任务,结合SST-LegoViT主干网络捕获多尺度特征,在SEED和SEED-IV数据集上取得显著性能提升。
Details
Motivation: EEG情绪识别在特征利用和跨域泛化上存在挑战,需要一种鲁棒的方法提升性能。Contribution: 提出EmotionCLIP框架,首次将EEG-文本匹配任务引入CLIP,设计SST-LegoViT主干网络,提升跨域情绪识别性能。
Method: 1. 使用CLIP框架进行EEG-文本匹配;2. 设计SST-LegoViT(多尺度卷积+Transformer模块)捕获空间、频谱和时序特征;3. 采用对比学习增强特征泛化能力。
Result: 在SEED和SEED-IV数据集上,跨被试准确率分别达88.69%和73.50%,跨时间准确率达88.46%和77.54%,超越现有模型。
Insight: 多模态对比学习可显著提升EEG情绪识别的鲁棒性,EEG-文本匹配任务为引入语义信息提供了新思路。
Abstract: Electroencephalogram (EEG)-based emotion recognition is vital for affective computing but faces challenges in feature utilization and cross-domain generalization. This work introduces EmotionCLIP, which reformulates recognition as an EEG-text matching task within the CLIP framework. A tailored backbone, SST-LegoViT, captures spatial, spectral, and temporal features using multi-scale convolution and Transformer modules. Experiments on SEED and SEED-IV datasets show superior cross-subject accuracies of 88.69% and 73.50%, and cross-time accuracies of 88.46% and 77.54%, outperforming existing models. Results demonstrate the effectiveness of multimodal contrastive learning for robust EEG emotion recognition.
[57] LiveStar: Live Streaming Assistant for Real-World Online Video Understanding
Zhenyu Yang,Kairui Zhang,Yuhang Hu,Bing Wang,Shengsheng Qian,Bin Wen,Fan Yang,Tingting Gao,Weiming Dong,Changsheng Xu
Main category: cs.CV
TL;DR: LiveStar是一个创新的实时流媒体助手,通过自适应流解码实现持续主动响应,解决了在线视频理解中的实时性和叙事一致性问题。
Details
Motivation: 现有的在线视频大语言模型(Video-LLMs)通常在连续帧输入和响应时机选择上表现不佳,影响了实时性和叙事连贯性。Contribution: 1) 提出了增量视频-语言对齐训练策略;2) 设计了单次前向验证的响应-沉默解码框架;3) 实现了内存感知加速,推理速度提升1.53倍。
Method: 结合自适应流解码、增量训练策略和内存压缩技术,优化在线视频理解的效率和实时性。
Result: 在三个基准测试中表现最优,语义正确性平均提升19.5%,响应时机差异减少18.1%,FPS提高12.0%。
Insight: LiveStar通过技术创新和数据集支持,显著提升了在线视频理解的性能和实用性。
Abstract: Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. To address these limitations, we introduce LiveStar, a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding. Specifically, LiveStar incorporates: (1) a training strategy enabling incremental video-language alignment for variable-length video streams, preserving temporal consistency across dynamically evolving frame sequences; (2) a response-silence decoding framework that determines optimal proactive response timing via a single forward pass verification; (3) memory-aware acceleration via peak-end memory compression for online inference on 10+ minute videos, combined with streaming key-value cache to achieve 1.53x faster inference. We also construct an OmniStar dataset, a comprehensive dataset for training and benchmarking that encompasses 15 diverse real-world scenarios and 5 evaluation tasks for online video understanding. Extensive experiments across three benchmarks demonstrate LiveStar’s state-of-the-art performance, achieving an average 19.5% improvement in semantic correctness with 18.1% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12.0% across all five OmniStar tasks. Our model and dataset can be accessed at https://github.com/yzy-bupt/LiveStar.
[58] $\mathbf{S^2LM}$: Towards Semantic Steganography via Large Language Models
Huanqi Wu,Huangbiao Xu,Runfeng Xie,Jiaxin Cai,Kaixin Zhang,Xiao Ke
Main category: cs.CV
TL;DR: 论文提出了一种基于大语言模型(LLM)的语义隐写术方法 $ ext{S}^2 ext{LM}$,能够将句子级别的高语义信息嵌入图像中,并通过新设计的流程实现语义隐写。
Details
Motivation: 传统隐写术难以嵌入高语义的句子级别信息,而在AIGC时代,这种能力尤为重要。本文旨在解决这一问题,实现语义丰富的隐写。Contribution: 1. 提出句子到图像的语义隐写任务;2. 建立评估基准IVT;3. 开发$ ext{S}^2 ext{LM}$模型,利用LLM实现高语义信息的隐写。
Method: 通过新设计的流程,LLM全程参与隐写过程,支持句子或段落级别的语义信息嵌入图像中。
Result: 定量和定性实验表明,$ ext{S}^2 ext{LM}$能够有效实现语义隐写功能。
Insight: 利用LLM的能力,可以将隐写术从比特级别提升到语义级别,为隐写术开辟新方向。
Abstract: Although steganography has made significant advancements in recent years, it still struggles to embed semantically rich, sentence-level information into carriers. However, in the era of AIGC, the capacity of steganography is more critical than ever. In this work, we present Sentence-to-Image Steganography, an instance of Semantic Steganography, a novel task that enables the hiding of arbitrary sentence-level messages within a cover image. Furthermore, we establish a benchmark named Invisible Text (IVT), comprising a diverse set of sentence-level texts as secret messages for evaluation. Finally, we present $\mathbf{S^2LM}$: Semantic Steganographic Language Model, which utilizes large language models (LLMs) to embed high-level textual information, such as sentences or even paragraphs, into images. Unlike traditional bit-level counterparts, $\mathrm{S^2LM}$ enables the integration of semantically rich content through a newly designed pipeline in which the LLM is involved throughout the entire process. Both quantitative and qualitative experiments demonstrate that our method effectively unlocks new semantic steganographic capabilities for LLMs. The source code will be released soon.
[59] Canonical Space Representation for 4D Panoptic Segmentation of Articulated Objects
Manuel Gomes,Bogdan Raducanu,Miguel Oliveira
Main category: cs.CV
TL;DR: 论文提出了Artic4D数据集和CanonSeg4D框架,用于4D全景分割。通过引入规范空间表示和时间动态建模,显著提升了动态物体的分割一致性。
Details
Motivation: 现有方法忽视了动态物体的时间动态特性,缺乏相关数据集和算法支持。Contribution: 1. 发布Artic4D数据集;2. 提出CanonSeg4D框架,利用规范空间表示提升动态物体分割的一致性。
Method: 通过估计每帧偏移将物体部分映射到规范空间,结合时间动态建模实现对物体的分割和对齐。
Result: 在Artic4D数据集上优于现有方法,尤其在复杂场景中表现突出。
Insight: 时间动态建模和规范对齐对动态物体理解至关重要,为未来4D动态物体感知研究奠定了基础。
Abstract: Articulated object perception presents significant challenges in computer vision, particularly because most existing methods ignore temporal dynamics despite the inherently dynamic nature of such objects. The use of 4D temporal data has not been thoroughly explored in articulated object perception and remains unexamined for panoptic segmentation. The lack of a benchmark dataset further hurt this field. To this end, we introduce Artic4D as a new dataset derived from PartNet Mobility and augmented with synthetic sensor data, featuring 4D panoptic annotations and articulation parameters. Building on this dataset, we propose CanonSeg4D, a novel 4D panoptic segmentation framework. This approach explicitly estimates per-frame offsets mapping observed object parts to a learned canonical space, thereby enhancing part-level segmentation. The framework employs this canonical representation to achieve consistent alignment of object parts across sequential frames. Comprehensive experiments on Artic4D demonstrate that the proposed CanonSeg4D outperforms state of the art approaches in panoptic segmentation accuracy in more complex scenarios. These findings highlight the effectiveness of temporal modeling and canonical alignment in dynamic object understanding, and pave the way for future advances in 4D articulated object perception.
[60] Dense Motion Captioning
Shiyao Xu,Benedetta Liberatori,Gül Varol,Paolo Rota
Main category: cs.CV
TL;DR: 论文提出了密集动作描述(Dense Motion Captioning)任务,旨在为3D人体运动序列中的动作提供时间定位和描述,并发布了首个大规模数据集CompMo和新模型DEMO。
Details
Motivation: 当前3D人体运动与语言结合的研究主要集中在文本到运动的生成,而运动理解的任务则相对较少被探索。现有数据集缺乏详细的时间标注且多为短序列。Contribution: 1. 提出密集动作描述任务;2. 发布首个大规模复杂运动数据集CompMo;3. 提出DEMO模型,结合大语言模型与运动适配器。
Method: DEMO模型通过将大语言模型与简单的运动适配器结合,生成密集且有时间基础的描述。数据生成流程精心设计,确保CompMo数据集的多样性和标注质量。
Result: DEMO在CompMo及适配基准测试上显著优于现有方法,为未来3D运动理解与描述研究建立了强基线。
Insight: 密集动作描述任务填补了3D运动理解的空白,CompMo数据集的复杂性和规模为未来研究提供了重要支持。
Abstract: Recent advances in 3D human motion and language integration have primarily focused on text-to-motion generation, leaving the task of motion understanding relatively unexplored. We introduce Dense Motion Captioning, a novel task that aims to temporally localize and caption actions within 3D human motion sequences. Current datasets fall short in providing detailed temporal annotations and predominantly consist of short sequences featuring few actions. To overcome these limitations, we present the Complex Motion Dataset (CompMo), the first large-scale dataset featuring richly annotated, complex motion sequences with precise temporal boundaries. Built through a carefully designed data generation pipeline, CompMo includes 60,000 motion sequences, each composed of multiple actions ranging from at least two to ten, accurately annotated with their temporal extents. We further present DEMO, a model that integrates a large language model with a simple motion adapter, trained to generate dense, temporally grounded captions. Our experiments show that DEMO substantially outperforms existing methods on CompMo as well as on adapted benchmarks, establishing a robust baseline for future research in 3D motion understanding and captioning.
[61] PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization
Zehui Feng,Tian Qiu,Tong Wu,Junxuan Li,Huayuan Xu,Ting Han
Main category: cs.CV
TL;DR: PreResQ-R1通过偏好-响应解耦强化学习框架,联合优化绝对分数回归和相对排名一致性,提升了视觉质量评估的细粒度推理和跨域泛化能力,并在多个基准测试中取得了最先进的结果。
Details
Motivation: 现有的视觉质量评估方法主要依赖监督微调或仅基于排名的目标,导致推理浅显、分数校准差和跨域泛化能力有限。因此,需要一种结合绝对分数和相对排名的框架,以提升推理的稳定性和解释性。Contribution: 1. 提出了PreResQ-R1框架,结合绝对分数回归和相对排名一致性;
2. 设计了双分支奖励机制,分别建模样本内响应一致性和样本间偏好对齐;
3. 在静态图像和动态视频质量评估任务中均取得了显著性能提升。
Method: 1. 使用偏好-响应解耦强化学习(Preference-Response Disentangled RL)框架;
2. 设计了Group Relative Policy Optimization(GRPO)优化方法;
3. 针对视频任务引入了全局-时间与局部-空间数据流策略。
Result: 在10个IQA和5个VQA基准测试中达到最先进水平,SRCC和PLCC指标分别提升5.30%和2.15%。
Insight: PreResQ-R1不仅提升了量化性能,还通过推理痕迹揭示了质量判断的感知线索,增强了模型的解释性与人类对齐性。
Abstract: Visual Quality Assessment (QA) seeks to predict human perceptual judgments of visual fidelity. While recent multimodal large language models (MLLMs) show promise in reasoning about image and video quality, existing approaches mainly rely on supervised fine-tuning or rank-only objectives, resulting in shallow reasoning, poor score calibration, and limited cross-domain generalization. We propose PreResQ-R1, a Preference-Response Disentangled Reinforcement Learning framework that unifies absolute score regression and relative ranking consistency within a single reasoning-driven optimization scheme. Unlike prior QA methods, PreResQ-R1 introduces a dual-branch reward formulation that separately models intra-sample response coherence and inter-sample preference alignment, optimized via Group Relative Policy Optimization (GRPO). This design encourages fine-grained, stable, and interpretable chain-of-thought reasoning about perceptual quality. To extend beyond static imagery, we further design a global-temporal and local-spatial data flow strategy for Video Quality Assessment. Remarkably, with reinforcement fine-tuning on only 6K images and 28K videos, PreResQ-R1 achieves state-of-the-art results across 10 IQA and 5 VQA benchmarks under both SRCC and PLCC metrics, surpassing by margins of 5.30% and textbf2.15% in IQA task, respectively. Beyond quantitative gains, it produces human-aligned reasoning traces that reveal the perceptual cues underlying quality judgments. Code and model are available.
[62] AI Assisted AR Assembly: Object Recognition and Computer Vision for Augmented Reality Assisted Assembly
Alexander Htet Kyaw,Haotian Ma,Sasa Zivkovic,Jenny Sabin
Main category: cs.CV
TL;DR: 论文提出了一种基于深度学习的AI辅助AR装配工作流,通过目标识别技术识别装配组件并逐步显示指导信息,减少手动操作。
Details
Motivation: 传统装配过程需要手动搜索、分类和标记组件,效率低下且易出错。通过AI和AR结合,可以实现更高效和准确的装配指导。Contribution: 提出了一种结合目标识别与AR的装配系统,能够动态显示组件的边界框和装配位置,简化装配流程。
Method: 采用深度学习模型识别装配组件,并通过AR技术实时显示装配指导信息(如边界框和放置位置)。以LEGO雕塑装配为例验证技术可行性。
Result: 系统能够准确识别组件并动态提供装配指导,减少手动干预,提高装配效率。
Insight: AI与AR的结合在装配流程中具有潜力,能够显著提升操作效率并减少人为错误。
Abstract: We present an AI-assisted Augmented Reality assembly workflow that uses deep learning-based object recognition to identify different assembly components and display step-by-step instructions. For each assembly step, the system displays a bounding box around the corresponding components in the physical space, and where the component should be placed. By connecting assembly instructions with the real-time location of relevant components, the system eliminates the need for manual searching, sorting, or labeling of different components before each assembly. To demonstrate the feasibility of using object recognition for AR-assisted assembly, we highlight a case study involving the assembly of LEGO sculptures.
[63] PALM: A Dataset and Baseline for Learning Multi-subject Hand Prior
Zicong Fan,Edoardo Remelli,David Dimond,Fadime Sener,Liuhao Ge,Bugra Tekin,Cem Keskin,Shreyas Hampali
Main category: cs.CV
TL;DR: PALM是一个大规模数据集,包含13k高质量手部扫描和90k多视角图像,覆盖263名受试者,展示了手部几何和材质的多样性。PALM-Net是基线方法,通过基于物理的逆向渲染学习多主体手部先验,支持单图像个性化手部建模。
Details
Motivation: 现有方法在从图像中创建高质量个性化手部虚拟形象时面临复杂几何、外观和关节动作的挑战,且缺乏包含多元主体和高质量数据的公开数据集。Contribution: 1. 提出了PALM数据集,覆盖广泛的手部几何和材质变化;2. 提出了PALM-Net基线方法,支持单图像手部虚拟形象的个性化建模。
Method: PALM-Net通过基于物理的逆向渲染学习多主体手部的几何和材质先验,实现从单张图像生成真实且可重光照的手部虚拟形象。
Result: PALM数据集为手部建模研究提供了丰富资源,PALM-Net展示了基于该数据集的单图像个性化手部建模能力。
Insight: PALM的多样性和规模填补了手部建模领域的公开数据缺口,为未来研究提供了标准化基准。
Abstract: The ability to grasp objects, signal with gestures, and share emotion through touch all stem from the unique capabilities of human hands. Yet creating high-quality personalized hand avatars from images remains challenging due to complex geometry, appearance, and articulation, particularly under unconstrained lighting and limited views. Progress has also been limited by the lack of datasets that jointly provide accurate 3D geometry, high-resolution multiview imagery, and a diverse population of subjects. To address this, we present PALM, a large-scale dataset comprising 13k high-quality hand scans from 263 subjects and 90k multi-view images, capturing rich variation in skin tone, age, and geometry. To show its utility, we present a baseline PALM-Net, a multi-subject prior over hand geometry and material properties learned via physically based inverse rendering, enabling realistic, relightable single-image hand avatar personalization. PALM’s scale and diversity make it a valuable real-world resource for hand modeling and related research.
[64] Multi-modal Loop Closure Detection with Foundation Models in Severely Unstructured Environments
Laura Alejandra Encinar Gonzalez,John Folkesson,Rudolph Triebel,Riccardo Giubilato
Main category: cs.CV
TL;DR: 该论文提出了MPRF,一种多模态闭环检测方法,利用基于变压器的基础模型(DINOv2和SONATA)在视觉和LiDAR模态中实现鲁棒的闭环检测,并通过两阶段检索和6自由度位姿估计进一步提升了性能。
Details
Motivation: 在全球导航卫星系统(GNSS)受限的环境中,如行星探索,传统的视觉或LiDAR闭环检测方法因环境特征稀疏或模糊性不足而表现不佳,亟需一种多模态且鲁棒的解决方案。Contribution: MPRF的主要贡献在于结合了视觉和LiDAR的多模态特征,利用DINOv2和SONATA等基础模型,提出了两阶段检索策略和6自由度位姿估计方法,显著提升了闭环检测的精度和鲁棒性。
Method: MPRF的核心方法是:1)使用DINOv2视觉特征和SALAD聚合进行候选匹配;2)通过SONATA LiDAR描述符进行几何验证;3)两阶段检索结合6自由度位姿估计,实现高效闭环检测。
Result: 在S3LI和S3LI Vulcano数据集上的实验表明,MPRF在低纹理区域的位姿估计鲁棒性和检索精度上均优于现有方法,同时保持了高效性和可靠性。
Insight: 论文揭示了结合视觉和LiDAR多模态特征的潜力,表明基础模型可以统一地用于地点识别和位姿估计,为SLAM后端提供了可解释的匹配关系。
Abstract: Robust loop closure detection is a critical component of Simultaneous Localization and Mapping (SLAM) algorithms in GNSS-denied environments, such as in the context of planetary exploration. In these settings, visual place recognition often fails due to aliasing and weak textures, while LiDAR-based methods suffer from sparsity and ambiguity. This paper presents MPRF, a multimodal pipeline that leverages transformer-based foundation models for both vision and LiDAR modalities to achieve robust loop closure in severely unstructured environments. Unlike prior work limited to retrieval, MPRF integrates a two-stage visual retrieval strategy with explicit 6-DoF pose estimation, combining DINOv2 features with SALAD aggregation for efficient candidate screening and SONATA-based LiDAR descriptors for geometric verification. Experiments on the S3LI dataset and S3LI Vulcano dataset show that MPRF outperforms state-of-the-art retrieval methods in precision while enhancing pose estimation robustness in low-texture regions. By providing interpretable correspondences suitable for SLAM back-ends, MPRF achieves a favorable trade-off between accuracy, efficiency, and reliability, demonstrating the potential of foundation models to unify place recognition and pose estimation. Code and models will be released at github.com/DLR-RM/MPRF.
[65] Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis
Dogucan Yaman,Seymanur Akti,Fevziye Irem Eyiokur,Alexander Waibel
Main category: cs.CV
TL;DR: 本文提出了一种基于HierSpeech++的文本到视听合成框架,通过共享潜在表示联合生成语音和面部动画,采用两阶段训练解决特征分布偏移问题。
Details
Motivation: 传统的文本到视听合成方法通常采用级联流水线,导致语音与面部动画同步性和真实感不足。本文旨在通过共享潜在表示实现更紧密的视听对齐。Contribution: 1) 提出了基于HierSpeech++的共享潜在表示框架;2) 设计了Text-to-Vec模块生成语音特征;3) 采用两阶段训练解决TTS预测特征的分布偏移问题。
Method: 1) Text-to-Vec模块生成Wav2Vec2嵌入;2) 语音和面部生成共享潜在表示;3) 两阶段训练(预训练和微调)处理特征分布问题。
Result: 实验表明,基于TTS预测潜在特征的条件化方法优于级联流水线,提升了唇部同步性和视觉真实感。
Insight: 共享潜在表示能够有效减少级联系统中的误差累积,提升视听合成的整体质量和一致性。
Abstract: We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.
[66] How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?
Tuan Anh Tran,Duy M. H. Nguyen,Hoai-Chau Tran,Michael Barz,Khoa D. Doan,Roger Wattenhofer,Ngo Anh Vien,Mathias Niepert,Daniel Sonntag,Paul Swoboda
Main category: cs.CV
TL;DR: 论文提出了一种名为gitmerge3D的全局信息图令牌合并方法,能够减少90-95%的令牌数量,同时保持性能,揭示了现有3D点云变压器模型中令牌冗余的问题。
Details
Motivation: 目前3D点云变压器模型依赖密集令牌表示,导致计算和内存成本过高。作者发现令牌存在显著冗余,希望通过减少冗余提升效率。Contribution: 提出了gitmerge3D方法,首次评估大规模3D变压器模型中令牌的冗余性,并展示了如何通过合并令牌显著提升效率。
Method: 全局信息图令牌合并(gitmerge3D),通过合并冗余令牌,减少模型的计算和内存开销。
Result: 在多个3D视觉任务中验证gitmerge3D,令牌数量减少90-95%的同时保持了竞争性性能。
Insight: 更多令牌不一定带来更好性能,现有模型存在过度令牌化和扩展性不足的问题,高效架构设计需关注令牌冗余优化。
Abstract: Recent advances in 3D point cloud transformers have led to state-of-the-art results in tasks such as semantic segmentation and reconstruction. However, these models typically rely on dense token representations, incurring high computational and memory costs during training and inference. In this work, we present the finding that tokens are remarkably redundant, leading to substantial inefficiency. We introduce gitmerge3D, a globally informed graph token merging method that can reduce the token count by up to 90-95% while maintaining competitive performance. This finding challenges the prevailing assumption that more tokens inherently yield better performance and highlights that many current models are over-tokenized and under-optimized for scalability. We validate our method across multiple 3D vision tasks and show consistent improvements in computational efficiency. This work is the first to assess redundancy in large-scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures. Our code and checkpoints are publicly available at https://gitmerge3d.github.io
[67] Photo Dating by Facial Age Aggregation
Jakub Paplham,Vojtech Franc
Main category: cs.CV
TL;DR: 本文提出了一种通过人脸年龄信息估计照片拍摄年份的新方法,并发布了CSFD-1.6M数据集。该方法结合了人脸识别、年龄估计和职业时间先验的概率框架,显著优于基于场景的基线方法。
Details
Motivation: 现有的照片年代估计方法主要依赖场景信息,忽视了照片中人物面部信息的重要性。本文希望通过综合利用多个人物的面部年龄信息来提高估计的准确性。Contribution: 1. 提出了首个结合人脸识别和年龄估计的概率框架用于照片年代估计。2. 公开了CSFD-1.6M数据集,包含160万张带注释的人脸图像。3. 展示了多个人脸信息聚合的显著效果。
Method: 提出了一个概率框架,结合现代人脸识别和年龄估计模型的视觉特征,以及基于人物职业的时间先验,推断照片拍摄年份。
Result: 实验表明,聚合多个人脸的信息显著提高了性能,尤其在包含多个可识别个体的图像中表现优于基于场景的基线方法。
Insight: 照片中的人物面部信息是估计拍摄年代的重要线索,多个面部信息的聚合可以显著提升模型的鲁棒性和准确性。
Abstract: We introduce a novel method for Photo Dating which estimates the year a photograph was taken by leveraging information from the faces of people present in the image. To facilitate this research, we publicly release CSFD-1.6M, a new dataset containing over 1.6 million annotated faces, primarily from movie stills, with identity and birth year annotations. Uniquely, our dataset provides annotations for multiple individuals within a single image, enabling the study of multi-face information aggregation. We propose a probabilistic framework that formally combines visual evidence from modern face recognition and age estimation models, and career-based temporal priors to infer the photo capture year. Our experiments demonstrate that aggregating evidence from multiple faces consistently improves the performance and the approach significantly outperforms strong, scene-based baselines, particularly for images containing several identifiable individuals.
[68] Semantic-Guided Natural Language and Visual Fusion for Cross-Modal Interaction Based on Tiny Object Detection
Xian-Hong Huang,Hui-Kai Su,Chi-Chia Sun,Jun-Wei Hsieh
Main category: cs.CV
TL;DR: 论文提出了一种结合语义引导的自然语言处理和视觉识别的跨模态交互方法,用于微小目标检测,通过BERT与PRB-FPN-Net的融合及优化主干网络(如ELAN、MSP和CSP),显著提升了检测精度和效率。
Details
Motivation: 微小目标检测在处理复杂场景时面临挑战,现有方法在跨模态交互和多尺度特征融合上表现不足。论文旨在通过结合语义引导的自然语言处理和高效的视觉主干网络,提升检测性能。Contribution: 1. 提出了一种结合BERT和PRB-FPN-Net的跨模态交互框架。2. 引入了ELAN、MSP和CSP等主干网络,优化特征提取和融合。3. 在COCO数据集上取得了52.6%的AP,显著优于YOLO-World,且参数量仅为Transformer模型的一半。
Method: 1. 使用BERT和PRB-FPN-Net进行跨模态特征融合。2. 引入ELAN、MSP和CSP主干网络,优化多尺度特征提取。3. 通过词形还原和微调技术对齐文本语义与视觉特征。
Result: 在COCO2017验证集上达到了52.6%的AP,优于YOLO-World,参数量仅为GLIP等Transformer模型的一半。主干网络的多样性进一步提升了多尺度目标的检测能力。
Insight: 1. 语义引导的自然语言与视觉特征融合能显著提升微小目标的检测精度。2. 主干网络的优化在多尺度目标检测中至关重要。3. 跨模态交互为资源受限环境下的高效检测提供了新思路。
Abstract: This paper introduces a cutting-edge approach to cross-modal interaction for tiny object detection by combining semantic-guided natural language processing with advanced visual recognition backbones. The proposed method integrates the BERT language model with the CNN-based Parallel Residual Bi-Fusion Feature Pyramid Network (PRB-FPN-Net), incorporating innovative backbone architectures such as ELAN, MSP, and CSP to optimize feature extraction and fusion. By employing lemmatization and fine-tuning techniques, the system aligns semantic cues from textual inputs with visual features, enhancing detection precision for small and complex objects. Experimental validation using the COCO and Objects365 datasets demonstrates that the model achieves superior performance. On the COCO2017 validation set, it attains a 52.6% average precision (AP), outperforming YOLO-World significantly while maintaining half the parameter consumption of Transformer-based models like GLIP. Several test on different of backbones such ELAN, MSP, and CSP further enable efficient handling of multi-scale objects, ensuring scalability and robustness in resource-constrained environments. This study underscores the potential of integrating natural language understanding with advanced backbone architectures, setting new benchmarks in object detection accuracy, efficiency, and adaptability to real-world challenges.
[69] TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
Junwen Pan,Qizhe Zhang,Rui Zhang,Ming Lu,Xin Wan,Yuan Zhang,Chang Liu,Qi She
Main category: cs.CV
TL;DR: 该论文提出了TimeSearch-R,通过强化学习的自验证机制(GRPO-CSV)优化长视频理解中的时间搜索任务,显著提升了搜索性能和视频理解的完整性。
Details
Motivation: 现有时间搜索方法依赖手工设计的搜索过程,缺乏端到端优化的搜索策略,导致搜索不充分和逻辑不一致。论文提出改进这一问题。Contribution: 1. 提出TimeSearch-R框架,将时间搜索与文本-视频交错思维结合。2. 引入GRPO-CSV,通过自验证机制提升搜索完整性。3. 构建专用数据集增强任务难度。
Method: 1. 将时间搜索建模为强化学习任务。2. 使用GRPO-CSV自验证机制优化搜索决策的完整性和一致性。3. 设计数据集筛选高时间依赖样本。
Result: 在Haystack-LVBench、Haystack-Ego4D等基准上表现显著提升,LongVideoBench上超越基线模型Qwen2.5-VL和Video-R1。
Insight: 通过自验证机制和多阶段数据集优化,强化学习可以显著提升长视频时间搜索的逻辑性和完整性。
Abstract: Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose TimeSearch-R, which reformulates temporal search as interleaved text-video thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these issues, we introduce GRPO with Completeness Self-Verification (GRPO-CSV), which gathers searched video frames from the interleaved reasoning process and utilizes the same policy model to verify the adequacy of searched frames, thereby improving the completeness of video reasoning. Additionally, we construct datasets specifically designed for the SFT cold-start and RL training of GRPO-CSV, filtering out samples with weak temporal dependencies to enhance task difficulty and improve temporal search capabilities. Extensive experiments demonstrate that TimeSearch-R achieves significant improvements on temporal search benchmarks such as Haystack-LVBench and Haystack-Ego4D, as well as long-form video understanding benchmarks like VideoMME and MLVU. Notably, TimeSearch-R establishes a new state-of-the-art on LongVideoBench with 4.1% improvement over the base model Qwen2.5-VL and 2.0% over the advanced video reasoning model Video-R1. Our code is available at https://github.com/Time-Search/TimeSearch-R.
[70] Visual Spatial Tuning
Rui Yang,Ziyu Zhu,Yanwei Li,Jingjia Huang,Shen Yan,Siyuan Zhou,Zhe Liu,Xiangtai Li,Shuangye Li,Wenqian Wang,Yi Lin,Hengshuang Zhao
Main category: cs.CV
TL;DR: 该论文提出了视觉空间调优(VST)框架,通过构建大规模数据集和渐进式训练流程,增强视觉语言模型(VLM)的空间感知与推理能力,同时不影响其通用性。
Details
Motivation: 现有的视觉语言模型在空间感知和推理能力上表现不足。通过添加额外专家编码器的方法虽能提升空间能力,但会增加开销并损害模型的通用性。因此,需要一种不牺牲通用性的方法增强空间能力。Contribution: 提出VST框架,分步构建VST-P(空间感知)和VST-R(空间推理)数据集,并设计渐进式训练流程(监督微调+强化学习),在不影响通用性的情况下显著提升模型的空间能力。
Method: 1. 构建VST-P(4.1M样本)和VST-R(135K样本)数据集;2. 采用监督微调+强化学习的渐进式训练流程;3. 在多个空间基准测试中验证性能。
Result: 在MMSI-Bench和VSIBench等基准测试中达到SOTA性能(34.8%和61.2%),同时不损害模型的通用能力。
Insight: 通过渐进式训练和数据驱动的方法,可以在不增加额外架构开销的情况下显著提升视觉语言模型的空间能力,为更物理接地的人工智能铺平了道路。
Abstract: Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8%$ on MMSI-Bench and $61.2%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.
cs.AI [Back]
[71] ORCHID: Orchestrated Retrieval-Augmented Classification with Human-in-the-Loop Intelligent Decision-Making for High-Risk Property
Maria Mahbub,Vanessa Lama,Sanjay Das,Brian Starks,Christopher Polchek,Saffell Silvers,Lauren Deck,Prasanna Balaprakash,Tirthankar Ghosal
Main category: cs.AI
TL;DR: ORCHID是一个模块化的代理系统,结合检索增强生成(RAG)和人类监督,用于高风险财产(HRP)分类,提升透明度和可审计性。
Details
Motivation: 传统的高风险财产分类工作流程依赖专家,效率低且难以适应动态的监管要求,需要一种更高效且可审计的解决方案。Contribution: 提出ORCHID系统,通过协同小型代理和人类监督,结合检索增强生成技术,实现高效、透明且可审计的HRP分类。
Method: 系统采用模块化代理(如检索器、分类器、验证器等),通过代理间消息传递和模型无关的工具调用(MCP),结合人类反馈和审计追踪。
Result: 初步测试显示,ORCHID在真实HRP案例中提升了分类准确性和可追溯性,同时通过SME反馈处理不确定项。
Insight: 展示了可信赖的大型语言模型(LLM)辅助在敏感合规工作流中的实际应用潜力,强调了透明性和审计的重要性。
Abstract: High-Risk Property (HRP) classification is critical at U.S. Department of Energy (DOE) sites, where inventories include sensitive and often dual-use equipment. Compliance must track evolving rules designated by various export control policies to make transparent and auditable decisions. Traditional expert-only workflows are time-consuming, backlog-prone, and struggle to keep pace with shifting regulatory boundaries. We demo ORCHID, a modular agentic system for HRP classification that pairs retrieval-augmented generation (RAG) with human oversight to produce policy-based outputs that can be audited. Small cooperating agents, retrieval, description refiner, classifier, validator, and feedback logger, coordinate via agent-to-agent messaging and invoke tools through the Model Context Protocol (MCP) for model-agnostic on-premise operation. The interface follows an Item to Evidence to Decision loop with step-by-step reasoning, on-policy citations, and append-only audit bundles (run-cards, prompts, evidence). In preliminary tests on real HRP cases, ORCHID improves accuracy and traceability over a non-agentic baseline while deferring uncertain items to Subject Matter Experts (SMEs). The demonstration shows single item submission, grounded citations, SME feedback capture, and exportable audit artifacts, illustrating a practical path to trustworthy LLM assistance in sensitive DOE compliance workflows.
cs.MM [Back]
[72] Automatización de Informes Geotécnicos para Macizos Rocosos con IA
Christofer Valencia,Alexis Llumigusín,Silvia Alvarez,Abrahan Arias,Christian Mejia-Escobar
Main category: cs.MM
TL;DR: 本文提出利用人工智能技术(多模态大语言模型)自动生成地质报告,通过处理图像和野外数据,替代传统手工方法。系统评估显示,自动生成的报告与专家报告具有可比性。
Details
Motivation: 传统地质报告依赖手工制作,速度慢且易出错。为了克服这些问题,作者提出利用AI技术实现自动化,提高效率和准确性。Contribution: 主要贡献是开发了一个基于多模态大语言模型的工具,能够自动生成结构化地质报告,并通过迭代优化提示工程减少对大模型的微调成本。
Method: 方法包括收集岩石露头照片和手工样本数据,定义报告大纲,并通过迭代提示工程优化多模态大语言模型的输出。
Result: 系统在BLEU和ROUGE-L指标上分别达到0.455和0.653,表明自动生成的报告质量接近专家水平。
Insight: 提示工程的迭代优化可以有效替代对大语言模型的昂贵微调,同时多模态数据(如图像)的结合增强了模型的实用性和准确性。
Abstract: Geotechnical reports are crucial for assessing the stability of rock formations and ensuring safety in modern engineering. Traditionally, these reports are prepared manually based on field observations using compasses, magnifying glasses, and notebooks. This method is slow, prone to errors, and subjective in its interpretations. To overcome these limitations, the use of artificial intelligence techniques is proposed for the automatic generation of reports through the processing of images and field data. The methodology was based on the collection of photographs of rock outcrops and manual samples with their respective descriptions, as well as on the reports prepared during the Geotechnical Studies course. These resources were used to define the report outline, prompt engineering, and validate the responses of a multimodal large language model (MLLM). The iterative refinement of prompts until structured and specific instructions were obtained for each section of the report proved to be an effective alternative to the costly process of fine-tuning the MLLM. The system evaluation establishes values of 0.455 and 0.653 for the BLEU and ROUGE-L metrics, respectively, suggesting that automatic descriptions are comparable to those made by experts. This tool, accessible via the web, with an intuitive interface and the ability to export to standardized formats, represents an innovation and an important contribution for professionals and students of field geology.
cs.SI [Back]
[73] Simulating Misinformation Vulnerabilities With Agent Personas
David Farr,Lynnette Hui Xian Ng,Stephen Prochaska,Iain J. Cruickshank,Jevin West
Main category: cs.SI
TL;DR: 该论文通过基于大型语言模型(LLM)的代理模拟,研究了不同职业和心理架构对假信息的反应,验证了LLM代理在信息网络中的有效性。
Details
Motivation: 假信息可能扭曲公众认知和社会稳定,但真实实验既困难又有伦理问题。因此,作者希望通过LLM代理模拟来研究不同人群对信息的反应。Contribution: 主要贡献包括:1)开发了基于LLM的代理模拟框架;2)验证了LLM代理与人类反应的相似性;3)揭示了心理架构对信息解读的关键影响。
Method: 作者构建了五种职业和三种心理架构的代理角色,分析其对新闻标题的反应,并通过对比真实标签和人类预测验证LLM代理的准确性。
Result: 结果表明,LLM代理的反应与真实标签和人类预测高度一致,且心理架构比职业背景更能解释代理对假信息的反应。
Insight: LLM代理可作为研究信息网络的工具,心理架构是影响信息解读的重要因素。
Abstract: Disinformation campaigns can distort public perception and destabilize institutions. Understanding how different populations respond to information is crucial for designing effective interventions, yet real-world experimentation is impractical and ethically challenging. To address this, we develop an agent-based simulation using Large Language Models (LLMs) to model responses to misinformation. We construct agent personas spanning five professions and three mental schemas, and evaluate their reactions to news headlines. Our findings show that LLM-generated agents align closely with ground-truth labels and human predictions, supporting their use as proxies for studying information responses. We also find that mental schemas, more than professional background, influence how agents interpret misinformation. This work provides a validation of LLMs to be used as agents in an agent-based model of an information network for analyzing trust, polarization, and susceptibility to deceptive content in complex social systems.
cs.CY [Back]
[74] Quantifying the Climate Risk of Generative AI: Region-Aware Carbon Accounting with G-TRACE and the AI Sustainability Pyramid
Zahida Kausar,Seemab Latif,Raja Khurrum Shahzad,Mehwish Fatima
Main category: cs.CY
TL;DR: 本文介绍了G-TRACE框架,用于量化生成式AI(GenAI)的碳排放,并结合AI可持续发展金字塔提出治理模型,将碳排放指标转化为可操作的政策指导。
Details
Motivation: 生成式AI的能源需求和碳排放成为一种新兴的气候风险,需要量化评估以实现可持续部署。Contribution: 提出了G-TRACE框架和AI可持续发展金字塔,量化GenAI的碳排放并提供治理模型。
Method: 通过跨模态、区域感知的G-TRACE框架,结合实时分析和微观模拟,量化训练和推理阶段的碳排放。
Result: 以Ghibli风格图像生成为例,估计了4309 MWh的能耗和2068 tCO2的排放,展示了分散推理的系统级影响。
Insight: GenAI的碳排放需纳入气候风险框架,数据驱动的方法有助于技术创新与全球脱碳目标的结合。
Abstract: Generative Artificial Intelligence (GenAI) represents a rapidly expanding digital infrastructure whose energy demand and associated CO2 emissions are emerging as a new category of climate risk. This study introduces G-TRACE (GenAI Transformative Carbon Estimator), a cross-modal, region-aware framework that quantifies training- and inference-related emissions across modalities and deployment geographies. Using real-world analytics and microscopic simulation, G-TRACE measures energy use and carbon intensity per output type (text, image, video) and reveals how decentralized inference amplifies small per-query energy costs into system-level impacts. Through the Ghibli-style image generation trend (2024-2025), we estimate 4,309 MWh of energy consumption and 2,068 tCO2 emissions, illustrating how viral participation inflates individual digital actions into tonne-scale consequences. Building on these findings, we propose the AI Sustainability Pyramid, a seven-level governance model linking carbon accounting metrics (L1-L7) with operational readiness, optimization, and stewardship. This framework translates quantitative emission metrics into actionable policy guidance for sustainable AI deployment. The study contributes to the quantitative assessment of emerging digital infrastructures as a novel category of climate risk, supporting adaptive governance for sustainable technology deployment. By situating GenAI within climate-risk frameworks, the work advances data-driven methods for aligning technological innovation with global decarbonization and resilience objectives.
cs.CR [Back]
[75] Jailbreaking in the Haystack
Rishi Rajesh Shah,Chen Henry Wu,Shashwat Saxena,Ziqian Zhong,Alexander Robey,Aditi Raghunathan
Main category: cs.CR
TL;DR: 本文提出了NINJA方法,通过在有害用户目标后附加无害的模型生成内容,成功实现对对齐语言模型的越狱攻击。研究发现,有害目标的位置对安全性至关重要。
Details
Motivation: 随着长上下文语言模型能力的提升,其安全隐患尚未充分研究,特别是扩展上下文是否引入新的安全漏洞。本文旨在填补这一空白。Contribution: 1. 提出NINJA方法,通过巧妙的文本位置设计实现对语言模型的越狱攻击;2. 揭示了有害目标位置在模型安全性中的关键作用;3. 证明了在固定计算预算下,增加上下文长度比增加尝试次数更具效率。
Method: NINJA方法通过在有害用户目标后附加无害的模型生成内容,利用长上下文生成的技术,实现对对齐模型的攻击。实验在HarmBench基准上进行,评估了多种开源和专有模型。
Result: NINJA显著提高了攻击成功率,且计算效率更高。研究发现,长上下文的设计可能成为现代语言模型的根本性漏洞。
Insight: 即使是无害的长上下文,若设计不当,也可能成为安全漏洞。这提示我们需要更严格的安全评估方法,特别是在长上下文场景下。
Abstract: Recent advances in long-context language models (LMs) have enabled million-token inputs, expanding their capabilities across complex tasks like computer-use agents. Yet, the safety implications of these extended contexts remain unclear. To bridge this gap, we introduce NINJA (short for Needle-in-haystack jailbreak attack), a method that jailbreaks aligned LMs by appending benign, model-generated content to harmful user goals. Critical to our method is the observation that the position of harmful goals play an important role in safety. Experiments on standard safety benchmark, HarmBench, show that NINJA significantly increases attack success rates across state-of-the-art open and proprietary models, including LLaMA, Qwen, Mistral, and Gemini. Unlike prior jailbreaking methods, our approach is low-resource, transferable, and less detectable. Moreover, we show that NINJA is compute-optimal – under a fixed compute budget, increasing context length can outperform increasing the number of trials in best-of-N jailbreak. These findings reveal that even benign long contexts – when crafted with careful goal positioning – introduce fundamental vulnerabilities in modern LMs.
[76] Quantifying the Risk of Transferred Black Box Attacks
Disesdi Susanna Cox,Niklas Bunzel
Main category: cs.CR
TL;DR: 该论文研究了如何在黑盒转移攻击场景下量化对抗性攻击的风险,提出了一种基于CKA相似性的目标模型选择策略,结合回归估计器实现风险量化。
Details
Motivation: 随着神经网络在安全相关产品中的广泛应用,对抗性攻击的风险评估成为关键问题。黑盒转移攻击因其高度可转移性和实践意义,亟需可靠的量化方法。Contribution: 论文提出了一种基于CKA相似性的目标韧性测试框架,通过选择高/低相似性的替代模型,优化对抗性子空间的覆盖范围,并提供回归估计器实现风险量化。
Method: 利用CKA相似性策略选择替代模型,模拟高/低相似性情景;采用回归估计器量化风险。
Result: 该方法为组织提供了实用且可操作的黑盒转移攻击风险量化工具。
Insight: CKA相似性是优化对抗性子空间覆盖的有效指标;回归估计器在实际风险量化中具有实用性。
Abstract: Neural networks have become pervasive across various applications, including security-related products. However, their widespread adoption has heightened concerns regarding vulnerability to adversarial attacks. With emerging regulations and standards emphasizing security, organizations must reliably quantify risks associated with these attacks, particularly regarding transferred adversarial attacks, which remain challenging to evaluate accurately. This paper investigates the complexities involved in resilience testing against transferred adversarial attacks. Our analysis specifically addresses black-box evasion attacks, highlighting transfer-based attacks due to their practical significance and typically high transferability between neural network models. We underline the computational infeasibility of exhaustively exploring high-dimensional input spaces to achieve complete test coverage. As a result, comprehensive adversarial risk mapping is deemed impractical. To mitigate this limitation, we propose a targeted resilience testing framework that employs surrogate models strategically selected based on Centered Kernel Alignment (CKA) similarity. By leveraging surrogate models exhibiting both high and low CKA similarities relative to the target model, the proposed approach seeks to optimize coverage of adversarial subspaces. Risk estimation is conducted using regression-based estimators, providing organizations with realistic and actionable risk quantification.
cs.IR [Back]
[77] Association via Entropy Reduction
Anthony Gamst,Lawrence Wilson
Main category: cs.IR
TL;DR: 论文提出了一种新的评分方法aver,用于识别文档之间的关联性,相较于传统的tf-idf方法,aver在多方面表现更优,尤其是在大数据集和自然阈值设定上。
Details
Motivation: 尽管tf-idf曾是关联性评估的黄金标准,但其在某些场景下表现不佳。作者希望通过一种基于熵的统计模型(aver)提供更自然的关联性评估方法。Contribution: 1. 提出aver评分方法;2. 实验证明aver在关联性识别上优于tf-idf;3. aver具备自然阈值、区分高分文档对及适用于大规模数据集等优势。
Method: 通过统计模型推导出基于熵的aver评分方法,适用于文档对或更大文档集的关联性评估。
Result: 在真实数据集上,aver在识别关联文档对上表现优于tf-idf,并展示了其在阈值设定和扩展性上的优势。
Insight: aver提供了一种更自然且可扩展的关联性评估方法,可能在大规模图数据或其他非神经网络适用场景中更具潜力。
Abstract: Prior to recent successes using neural networks, term frequency-inverse document frequency (tf-idf) was clearly regarded as the best choice for identifying documents related to a query. We provide a different score, aver, and observe, on a dataset with ground truth marking for association, that aver does do better at finding assciated pairs than tf-idf. This example involves finding associated vertices in a large graph and that may be an area where neural networks are not currently an obvious best choice. Beyond this one anecdote, we observe that (1) aver has a natural threshold for declaring pairs as unassociated while tf-idf does not, (2) aver can distinguish between pairs of documents for which tf-idf gives a score of 1.0, (3) aver can be applied to larger collections of documents than pairs while tf-idf cannot, and (4) that aver is derived from entropy under a simple statistical model while tf-idf is a construction designed to achieve a certain goal and hence aver may be more “natural.” To be fair, we also observe that (1) writing down and computing the aver score for a pair is more complex than for tf-idf and (2) that the fact that the aver score is naturally scale-free makes it more complicated to interpret aver scores.
[78] QUESTER: Query Specification for Generative Retrieval
Arthur Satouf,Yuxuan Zong,Habiboulaye Amadou-Boubacar,Pablo Piantanida,Benjamin Piwowarski
Main category: cs.IR
TL;DR: QUESTER 提出了一种新的生成式检索方法,通过将查询规范化为简单的关键词查询(由 BM25 处理),并利用强化学习技术训练一个小型 LLM,实现了高效且有效的检索性能。
Details
Motivation: 生成式检索(GR)虽然直接生成文档标识符,但通常难以泛化且扩展成本高。QUESTER 旨在通过重新定义 GR 为查询规范生成来解决这些问题。Contribution: 1) 提出了 QUESTER,将 GR 重新定义为查询规范生成(关键词查询);2) 使用强化学习方法(GRPO)训练小型 LLM。
Method: QUESTER 通过小型 LLM 将查询规范化为 BM25 可处理的关键词查询,并采用强化学习技术(GRPO)训练策略。
Result: 在领域内和领域外评估中,QUESTER 比 BM25 更有效,且与神经 IR 模型竞争,同时保持了良好的效率。
Insight: 将复杂的生成式检索任务分解为更简单的查询规范化步骤,结合强化学习,可以显著提升检索性能和可扩展性。
Abstract: Generative Retrieval (GR) differs from the traditional index-then-retrieve pipeline by storing relevance in model parameters and directly generating document identifiers. However, GR often struggles to generalize and is costly to scale. We introduce QUESTER (QUEry SpecificaTion gEnerative Retrieval), which reframes GR as query specification generation - in this work, a simple keyword query handled by BM25 - using a (small) LLM. The policy is trained using reinforcement learning techniques (GRPO). Across in- and out-of-domain evaluations, we show that our model is more effective than BM25, and competitive with neural IR models, while maintaining a good efficiency
eess.IV [Back]
[79] UHDRes: Ultra-High-Definition Image Restoration via Dual-Domain Decoupled Spectral Modulation
S. Zhao,W. Lu,B. Wang,T. Wang,K. Zhang,H. Zhao
Main category: eess.IV
TL;DR: 本文提出了一种轻量级的双域解耦频谱调制框架 UHDRes,用于超高清(UHD)图像恢复,通过显式调制频域振幅和隐式恢复空域相位,显著降低了计算复杂度和内存使用,同时达到了最先进的性能。
Details
Motivation: 超高清图像恢复面临高分辨率和计算复杂度的挑战,现有方法难以同时满足性能和效率的需求。本文旨在提出一种高效且轻量的解决方案。Contribution: 1. 提出了 UHDRes,一种双域解耦频谱调制框架;2. 通过多尺度上下文聚合器和共享门控前馈网络实现了空间和频域特征的高效融合;3. 在五个公开基准上取得了最先进性能,且仅需 40 万参数。
Method: 1. 使用多尺度上下文聚合器提取空间特征;2. 在频域显式增强振幅特征,并通过空域细化隐式恢复相位信息;3. 设计了共享门控前馈网络以促进特征交互。
Result: UHDRes 在多个 UHD 基准测试中实现了最先进的恢复性能,同时显著降低了推理延迟和内存使用。
Insight: 通过解耦频域和空域的调制,能够在轻量化模型中高效恢复图像质量,为高分辨率图像处理提供了新思路。
Abstract: Ultra-high-definition (UHD) images often suffer from severe degradations such as blur, haze, rain, or low-light conditions, which pose significant challenges for image restoration due to their high resolution and computational demands. In this paper, we propose UHDRes, a novel lightweight dual-domain decoupled spectral modulation framework for UHD image restoration. It explicitly models the amplitude spectrum via lightweight spectrum-domain modulation, while restoring phase implicitly through spatial-domain refinement. We introduce the spatio-spectral fusion mechanism, which first employs a multi-scale context aggregator to extract local and global spatial features, and then performs spectral modulation in a decoupled manner. It explicitly enhances amplitude features in the frequency domain while implicitly restoring phase information through spatial refinement. Additionally, a shared gated feed-forward network is designed to efficiently promote feature interaction through shared-parameter convolutions and adaptive gating mechanisms. Extensive experimental comparisons on five public UHD benchmarks demonstrate that our UHDRes achieves the state-of-the-art restoration performance with only 400K parameters, while significantly reducing inference latency and memory usage. The codes and models are available at https://github.com/Zhao0100/UHDRes.
cs.GR [Back]
[80] Neural Image Abstraction Using Long Smoothing B-Splines
Daniel Berio,Michael Stroh,Sylvain Calinon,Frederic Fol Leymarie,Oliver Deussen,Ariel Shamir
Main category: cs.GR
TL;DR: 本文提出了一种将平滑B样条集成到可微分向量图形(DiffVG)管线中的方法,通过线性映射生成平滑且任意长度的路径,用于图像深度学习中。
Details
Motivation: 传统方法在生成平滑路径和控制简化与保真度之间的权衡上存在局限性。本文旨在通过结合B样条和改进的控制机制解决这些问题。Contribution: 主要贡献是将平滑B样条集成到DiffVG管线中,支持导数平滑代价函数,实现了路径的平滑生成以及对保真度与简化程度的灵活控制。
Method: 通过线性映射将B样条引入DiffVG管线,利用导数平滑代价函数优化路径的平滑性和保真度。
Result: 方法在四个应用中展示了其多功能性:风格化填充路径生成、基于笔触的图像抽象、闭合区域图像抽象和风格化文本生成。
Insight: B样条的引入不仅提升了路径平滑性,还为深度学习系统提供了更灵活的几何和图像风格化控制能力。
Abstract: We integrate smoothing B-splines into a standard differentiable vector graphics (DiffVG) pipeline through linear mapping, and show how this can be used to generate smooth and arbitrarily long paths within image-based deep learning systems. We take advantage of derivative-based smoothing costs for parametric control of fidelity vs. simplicity tradeoffs, while also enabling stylization control in geometric and image spaces. The proposed pipeline is compatible with recent vector graphics generation and vectorization methods. We demonstrate the versatility of our approach with four applications aimed at the generation of stylized vector graphics: stylized space-filling path generation, stroke-based image abstraction, closed-area image abstraction, and stylized text generation.
cs.SD [Back]
[81] A Penny for Your Thoughts: Decoding Speech from Inexpensive Brain Signals
Quentin Auster,Kateryna Shapovalenko,Chuang Ma,Demaio Sun
Main category: cs.SD
TL;DR: 论文探索了通过神经网路将EEG信号解码为语音的方法,使用对比损失训练模型以对齐EEG嵌入与预训练语音模型的嵌入,并提出了三种架构改进,其中两种提高了性能,展示了其在脑机接口中的应用潜力。
Details
Motivation: 研究旨在验证是否可以通过神经网路将低成本的EEG信号解码为语音,为脑机接口的应用提供新的可能性。Contribution: 提出了三种EEG解码器的架构改进(主题特定注意力层、个性化空间注意力和双路径RNN),其中两种显著提升了性能。
Method: 使用对比CLIP损失训练模型,将EEG信号映射到预训练语音模型的嵌入空间,并结合三种新的架构设计。
Result: 两种改进(主题特定注意力层和个性化空间注意力)分别带来0.15%和0.45%的WER提升,展示了个性化架构的有效性。
Insight: 个性化架构在EEG信号解码任务中具有潜力,未来可能在脑机接口等应用中发挥重要作用。
Abstract: We explore whether neural networks can decode brain activity into speech by mapping EEG recordings to audio representations. Using EEG data recorded as subjects listened to natural speech, we train a model with a contrastive CLIP loss to align EEG-derived embeddings with embeddings from a pre-trained transformer-based speech model. Building on the state-of-the-art EEG decoder from Meta, we introduce three architectural modifications: (i) subject-specific attention layers (+0.15% WER improvement), (ii) personalized spatial attention (+0.45%), and (iii) a dual-path RNN with attention (-1.87%). Two of the three modifications improved performance, highlighting the promise of personalized architectures for brain-to-speech decoding and applications in brain-computer interfaces.
cs.RO [Back]
[82] EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation
Samarth Chopra,Alex McMoil,Ben Carnovale,Evan Sokolson,Rajkumar Kubendran,Samuel Dickerson
Main category: cs.RO
TL;DR: EveryDayVLA是一种低成本、高性能的视觉-语言-动作(VLA)模型,用于机器人操作,能够通过自适应规划实现安全可靠的运行,并在LIBERO和真实场景中表现优异。
Details
Motivation: 当前的VLA模型通常依赖昂贵硬件,且在复杂或新场景中表现不佳。论文旨在通过低成本硬件和高效算法,实现更经济、可靠的机器人操作。Contribution: 1. 设计了成本低于300美元的6自由度机械臂;2. 提出统一的模型输出离散和连续动作;3. 引入自适应时间集成方法以动态调整规划。
Method: 1. 构建低成本硬件平台;2. 开发联合输出的VLA模型;3. 采用自适应时间集成方法监测运动不确定性并触发重新规划。
Result: 在LIBERO基准测试中达到SOTA性能,真实场景中ID和OOD任务分别提升49%和34.9%。
Insight: 低成本硬件与高效算法的结合可显著提升机器人操作的性能和可访问性,为家庭和研究实验室提供实用解决方案。
Abstract: While Vision-Language-Action (VLA) models map visual inputs and language instructions directly to robot actions, they often rely on costly hardware and struggle in novel or cluttered scenes. We introduce EverydayVLA, a 6-DOF manipulator that can be assembled for under $300, capable of modest payloads and workspace. A single unified model jointly outputs discrete and continuous actions, and our adaptive-horizon ensemble monitors motion uncertainty to trigger on-the-fly re-planning for safe, reliable operation. On LIBERO, EverydayVLA matches state-of-the-art success rates, and in real-world tests it outperforms prior methods by 49% in-distribution and 34.9% out-of-distribution. By combining a state-of-the-art VLA with cost-effective hardware, EverydayVLA democratizes access to a robotic foundation model and paves the way for economical use in homes and research labs alike. Experiment videos and details: https://everydayvla.github.io/
cs.LG [Back]
[83] Prompt-Based Safety Guidance Is Ineffective for Unlearned Text-to-Image Diffusion Models
Jiwoo Shin,Byeonghu Na,Mina Kang,Wonhyeok Choi,Il-chul Moon
Main category: cs.LG
TL;DR: 本文研究发现,针对文本到图像生成模型的有害内容防御中,微调模型与无训练引导方法的结合效果不佳,提出了一种通过概念反转获取隐式负嵌入的简单方法,有效提升了防御成功率。
Details
Motivation: 文本到图像生成模型可能被恶意输入诱导生成有害内容,现有防御方法的结合效果不理想,亟需一种兼容且高效的解决方案。Contribution: 提出了通过概念反转获取隐式负嵌入的方法,无需修改现有框架即可提升防御效果。
Method: 用概念反转生成的隐式负嵌入替换无训练引导方法中的显式负提示,兼容微调和引导两种范式。
Result: 在裸露和暴力内容基准测试中,方法显著提升了防御成功率,同时保留了输入提示的核心语义。
Insight: 微调与无训练引导方法的底层不兼容性是造成其结合效果不佳的关键,而概念反转的隐式负嵌入能更高效地捕捉有害概念。
Abstract: Recent advances in text-to-image generative models have raised concerns about their potential to produce harmful content when provided with malicious input text prompts. To address this issue, two main approaches have emerged: (1) fine-tuning the model to unlearn harmful concepts and (2) training-free guidance methods that leverage negative prompts. However, we observe that combining these two orthogonal approaches often leads to marginal or even degraded defense performance. This observation indicates a critical incompatibility between two paradigms, which hinders their combined effectiveness. In this work, we address this issue by proposing a conceptually simple yet experimentally robust method: replacing the negative prompts used in training-free methods with implicit negative embeddings obtained through concept inversion. Our method requires no modification to either approach and can be easily integrated into existing pipelines. We experimentally validate its effectiveness on nudity and violence benchmarks, demonstrating consistent improvements in defense success rate while preserving the core semantics of input prompts.
[84] On Flow Matching KL Divergence
Maojiang Su,Jerry Yao-Chieh Hu,Sophia Pi,Han Liu
Main category: cs.LG
TL;DR: 该论文推导了流匹配(Flow Matching)方法在近似分布时Kullback-Leibler(KL)散度的确定性非渐近上界,并证明其在总变差(TV)距离下具有统计收敛性,且接近最小最大化最优效率。
Details
Motivation: 研究流匹配方法在分布近似中的统计性质,尤其是与KL散度相关的理论保证,以填补其与扩散模型在统计效率上的理论差距。Contribution: 提出了流匹配方法在KL散度上的非渐近上界,并证明了其在TV距离下的统计收敛性,表明流匹配在估计光滑分布时接近最小最大化最优效率。
Method: 通过理论分析,推导了流匹配损失与KL散度之间的明确关系,并利用实验验证了理论结果的正确性。
Result: 当流匹配损失为ε²时,KL散度被A₁ε + A₂ε²上界所约束,且流匹配在TV距离下具有统计收敛性。
Insight: 流匹配在统计效率上与扩散模型相当,为实际应用提供了理论支持,特别是在需要高效估计光滑分布的场景。
Abstract: We derive a deterministic, non-asymptotic upper bound on the Kullback-Leibler (KL) divergence of the flow-matching distribution approximation. In particular, if the $L_2$ flow-matching loss is bounded by $\epsilon^2 > 0$, then the KL divergence between the true data distribution and the estimated distribution is bounded by $A_1 \epsilon + A_2 \epsilon^2$. Here, the constants $A_1$ and $A_2$ depend only on the regularities of the data and velocity fields. Consequently, this bound implies statistical convergence rates of Flow Matching Transformers under the Total Variation (TV) distance. We show that, flow matching achieves nearly minimax-optimal efficiency in estimating smooth distributions. Our results make the statistical efficiency of flow matching comparable to that of diffusion models under the TV distance. Numerical studies on synthetic and learned velocities corroborate our theory.
cs.HC [Back]
[85] Enhancing Public Speaking Skills in Engineering Students Through AI
Amol Harsh,Brainerd Prince,Siddharth Siddharth,Deepan Raj Prabakar Muthirayan,Kabir S Bhalla,Esraaj Sarkar Gupta,Siddharth Sahu
Main category: cs.HC
TL;DR: 这篇论文提出了一种多模态AI系统,用于提升工程学生的公开演讲技能,通过结合语音分析、计算机视觉和情感检测,提供对语言和非语言沟通的综合反馈。
Details
Motivation: 工程学生需要有效的沟通能力,但传统教学方法无法提供持续且个性化的培训,尤其是语言和非语言反馈的结合成本高昂。Contribution: 提出了一个创新的多模态AI模型,首次整合语言表达、非语言行为和‘表达一致性’,提供个性化的公开演讲反馈。
Method: 结合语音分析(音调、音量、节奏)、计算机视觉(面部表情、手势)和情感检测,开发了一个多模态AI评估系统。
Result: 初步测试显示,AI生成的反馈与专家评估中度一致,其中Gemini Pro在LLM模型中表现最佳。
Insight: 多模态整合能够更全面地评估公开演讲表现,AI驱动的反馈工具可以显著提升学生的沟通技能。
Abstract: This research-to-practice full paper was inspired by the persistent challenge in effective communication among engineering students. Public speaking is a necessary skill for future engineers as they have to communicate technical knowledge with diverse stakeholders. While universities offer courses or workshops, they are unable to offer sustained and personalized training to students. Providing comprehensive feedback on both verbal and non-verbal aspects of public speaking is time-intensive, making consistent and individualized assessment impractical. This study integrates research on verbal and non-verbal cues in public speaking to develop an AI-driven assessment model for engineering students. Our approach combines speech analysis, computer vision, and sentiment detection into a multi-modal AI system that provides assessment and feedback. The model evaluates (1) verbal communication (pitch, loudness, pacing, intonation), (2) non-verbal communication (facial expressions, gestures, posture), and (3) expressive coherence, a novel integration ensuring alignment between speech and body language. Unlike previous systems that assess these aspects separately, our model fuses multiple modalities to deliver personalized, scalable feedback. Preliminary testing demonstrated that our AI-generated feedback was moderately aligned with expert evaluations. Among the state-of-the-art AI models evaluated, all of which were Large Language Models (LLMs), including Gemini and OpenAI models, Gemini Pro emerged as the best-performing, showing the strongest agreement with human annotators. By eliminating reliance on human evaluators, this AI-driven public speaking trainer enables repeated practice, helping students naturally align their speech with body language and emotion, crucial for impactful and professional communication.