Table of Contents
- cs.CL [Total: 46]
- cs.CV [Total: 79]
- cs.AI [Total: 6]
- cs.MM [Total: 2]
- quant-ph [Total: 1]
- cs.RO [Total: 2]
- cs.IR [Total: 3]
- cs.LG [Total: 15]
- cs.HC [Total: 3]
- cs.GR [Total: 1]
- eess.SP [Total: 1]
- cs.CR [Total: 1]
- cs.SD [Total: 2]
- cs.CY [Total: 4]
- cs.NI [Total: 1]
cs.CL [Back]
[1] Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle cs.CL | cs.AI | cs.LGPDF
Zihan Wang, Cheng Tang, Lei Gong, Cheng Li, Chao Wang
TL;DR: 本文提出了Crystal-KV,一个专为思维链推理设计的高效KV缓存管理框架。其核心是答案优先原则,通过将答案偏好映射到思维阶段的注意力图中,区分出对最终答案有真正贡献的CrystalKV和可能引入误导的SlipKV,并采用基于注意力的最近最少频繁使用算法和自适应缓存预算分配算法来优化缓存管理。
Details
Motivation: 思维链推理虽然提升了LLM在复杂任务上的准确性,但其冗长的思维阶段序列会导致KV缓存占用过高内存。传统KV压缩策略对所有token一视同仁,而CoT更强调最终答案,因此这些策略在CoT场景下效果不佳。
Result: 实验结果表明,Crystal-KV在KV缓存压缩方面达到了最先进水平,显著提高了吞吐量并实现了更快的响应时间,同时在思维链推理中保持甚至提高了答案的准确性。
Insight: 创新点在于提出了答案优先原则来区分KV缓存的重要性,并设计了针对性的缓存驱逐和预算分配算法。从客观角度看,其将任务语义(答案重要性)与系统优化(缓存管理)紧密结合,为特定推理范式(如CoT)的高效部署提供了新思路。
Abstract: Chain-of-Thought (CoT) reasoning in large language models (LLMs) significantly improves accuracy on complex tasks, yet incurs excessive memory overhead due to the long think-stage sequences stored in the Key-Value (KV) cache. Unlike traditional generation tasks where all tokens are uniformly important, CoT emphasizes the final answer, rendering conventional KV compression strategies ineffective. In this paper, we present Crystal-KV, an efficient KV cache management framework tailored for CoT reasoning. Our key insight is the answer-first principle. By mapping answer preferences into think-stage attention map, we distinguish between SlipKV, which mainly maintains the reasoning flow but may occasionally introduce misleading context, and CrystalKV, which truly contributes to the correctness of the final answer. Next, we propose an attention-based Least Recently Frequently Used algorithm. It precisely identifies when a SlipKV entry’s utility expires and evicts it, retaining CrystalKV without disrupting reasoning flow. Finally, we introduce an adaptive cache budget allocation algorithm. Based on the dynamic proportion of CrystalKV, it estimates the importance of each layer/head and adjusts the KV cache budget during inference, amplifying critical components to improve budget utilization. Results show that Crystal-KV achieves state-of-the-art KV cache compression, significantly improves throughput, and enables faster response time, while maintaining, or even improving, answer accuracy for CoT reasoning.
[2] RAM-SD: Retrieval-Augmented Multi-agent framework for Sarcasm Detection cs.CLPDF
Ziyang Zhou, Ziqi Liu, Yan Wang, Yiming Lin, Yangbin Chen
TL;DR: 本文提出了RAM-SD,一个用于讽刺检测的检索增强多智能体框架。该框架通过四个阶段运作:上下文检索、元规划器分类讽刺类型并选择推理计划、专业智能体进行多视角分析、以及整合器生成最终可解释的判断。在四个标准基准测试中,该框架取得了最先进的性能。
Details
Motivation: 讽刺检测依赖于微妙的上下文理解、世界知识和多方面的语言线索,现有方法采用统一的推理策略难以应对其多样化的分析需求。
Result: 在四个标准基准测试上,RAM-SD取得了77.74%的Macro-F1分数,达到了最先进水平,比强大的GPT-4o+CoC基线高出7.01个百分点。
Insight: 主要创新点在于将讽刺检测任务分解为类型分类和多智能体协同分析,并结合了检索增强和可解释的自然语言推理过程,提供了透明化的决策路径。
Abstract: Sarcasm detection remains a significant challenge due to its reliance on nuanced contextual understanding, world knowledge, and multi-faceted linguistic cues that vary substantially across different sarcastic expressions. Existing approaches, from fine-tuned transformers to large language models, apply a uniform reasoning strategy to all inputs, struggling to address the diverse analytical demands of sarcasm. These demands range from modeling contextual expectation violations to requiring external knowledge grounding or recognizing specific rhetorical patterns. To address this limitation, we introduce RAM-SD, a Retrieval-Augmented Multi-Agent framework for Sarcasm Detection. The framework operates through four stages: (1) contextual retrieval grounds the query in both sarcastic and non-sarcastic exemplars; (2) a meta-planner classifies the sarcasm type and selects an optimal reasoning plan from a predefined set; (3) an ensemble of specialized agents performs complementary, multi-view analysis; and (4) an integrator synthesizes these analyses into a final, interpretable judgment with a natural language explanation. Evaluated on four standard benchmarks, RAM-SD achieves a state-of-the-art Macro-F1 of 77.74%, outperforming the strong GPT-4o+CoC baseline by 7.01 points. Our framework not only sets a new performance benchmark but also provides transparent and interpretable reasoning traces, illuminating the cognitive processes behind sarcasm comprehension.
[3] Dynamic Role Assignment for Multi-Agent Debate cs.CL | cs.AIPDF
Miao Zhang, Junsik Kim, Siyuan Xiang, Jian Gao, Cheng Cao
TL;DR: 本文提出了一种用于多智能体大语言模型(LLM)和视觉语言模型(VLM)辩论系统的动态角色分配框架。该方法通过一个包含提案和同行评审两阶段的元辩论过程,在正式辩论前根据模型能力和角色匹配度动态选择最合适的智能体来担任特定角色,从而提升复杂问题解决能力。
Details
Motivation: 现有基于LLM/VLM的多智能体辩论系统虽然为不同角色分配了专门的智能体,但并未根据模型自身的专长来动态决定哪个模型最适合哪个角色,导致角色分配可能不是最优的。
Result: 在LLM问题求解基准测试上,该方法应用于现有辩论系统后,其性能始终优于统一分配(所有角色使用同一模型)和随机分配,提升幅度最高分别可达74.8%和29.7%,具体取决于任务和分配方式。
Insight: 核心创新在于引入了“元辩论”机制,通过角色定制化的论证提案和基于数据与角色特定标准的同行评审,实现了对智能体能力的感知和动态角色匹配,为多智能体系统设计从静态部署转向动态、能力感知的选择提供了新范式。
Abstract: Multi-agent large language model (LLM) and vision-language model (VLM) debate systems employ specialized roles for complex problem-solving, yet model specializations are not leveraged to decide which model should fill which role. We propose dynamic role assignment, a framework that runs a Meta-Debate to select suitable agents before the actual debate. The meta-debate has two stages: (1) proposal, where candidates provide role-tailored arguments, and (2) peer review, where proposals are scored with data and role-specific criteria to choose the best agent for each position. We evaluate our method on LLM problem solving benchmarks. Applied on top of existing debate systems, our approach consistently outperforms uniform assignments (filling all roles with the same model) by up to 74.8% and random assignments (assigning models to roles without considering their suitability) by up to 29.7%, depending on the task and the specific assignment. This work establishes a new paradigm for multi-agent system design, shifting from static agent deployment to dynamic and capability-aware selection.
[4] Beyond Factual QA: Mentorship-Oriented Question Answering over Long-Form Multilingual Content cs.CL | cs.AIPDF
Parth Bhalerao, Diola Dsouza, Ruiwen Guan, Oana Ignat
TL;DR: 该论文提出了首个专注于指导性问答的多语言数据集和评估框架MentorQA,用于评估AI系统在提供反思和指导(而不仅仅是事实准确性)方面的能力。该数据集包含来自四种语言、长达180小时视频内容的近9000个问答对。论文比较了多种问答架构,发现多智能体管道在生成高质量指导性回答方面表现最佳,尤其是在复杂主题和低资源语言上。
Details
Motivation: 现有问答系统主要评估事实正确性,但在教育和职业指导等实际应用中,需要能够提供反思和指导的‘导师式’回答。现有基准很少捕捉这种区别,尤其是在多语言和长文本环境中。
Result: 在MentorQA基准上,多智能体(Multi-Agent)问答架构在生成高质量指导性回答方面表现一致优于单智能体、双智能体和RAG架构,特别是在复杂主题和低资源语言上取得了显著提升。论文还分析了基于LLM的自动评估的可靠性,发现其与人类判断的一致性存在显著差异。
Insight: 主要创新点在于将‘指导性问答’定义为一个独立的研究问题,并创建了首个多语言、长文本的评估基准。从客观角度看,其提出的超越事实准确性的评估维度(清晰度、一致性、学习价值)以及在不同智能体架构上的系统性比较,为教育AI领域的评估设计和智能体架构研究提供了新的视角和工具。
Abstract: Question answering systems are typically evaluated on factual correctness, yet many real-world applications-such as education and career guidance-require mentorship: responses that provide reflection and guidance. Existing QA benchmarks rarely capture this distinction, particularly in multilingual and long-form settings. We introduce MentorQA, the first multilingual dataset and evaluation framework for mentorship-focused question answering from long-form videos, comprising nearly 9,000 QA pairs from 180 hours of content across four languages. We define mentorship-focused evaluation dimensions that go beyond factual accuracy, capturing clarity, alignment, and learning value. Using MentorQA, we compare Single-Agent, Dual-Agent, RAG, and Multi-Agent QA architectures under controlled conditions. Multi-Agent pipelines consistently produce higher-quality mentorship responses, with especially strong gains for complex topics and lower-resource languages. We further analyze the reliability of automated LLM-based evaluation, observing substantial variation in alignment with human judgments. Overall, this work establishes mentorship-focused QA as a distinct research problem and provides a multilingual benchmark for studying agentic architectures and evaluation design in educational AI. The dataset and evaluation framework are released at https://github.com/AIM-SCU/MentorQA.
[5] Reasoning Beyond Literal: Cross-style Multimodal Reasoning for Figurative Language Understanding cs.CL | cs.LGPDF
Seyyed Saeid Cheshmi, Hahnemann Ortiz, James Mooney, Dongyeop Kang
TL;DR: 本文提出了一种三步框架,用于开发能够理解多模态比喻语言(如讽刺、幽默和隐喻)的高效视觉语言模型。该框架旨在解释多模态比喻语言、提供透明的推理轨迹,并实现跨多种比喻风格的泛化。实验表明,引入推理轨迹显著提升了多模态比喻理解能力,且学习到的推理能力可在相关风格间迁移,联合训练产生的通用推理VLM性能优于许多更大的开源和闭源模型。
Details
Motivation: 现有视觉语言模型在字面多模态任务上表现出色,但在处理传达意图和情感、涉及文本与图像间微妙不一致的比喻语言(如讽刺、幽默、隐喻)时仍面临重大挑战,需要能够跨模态推理并考虑主观性的模型。
Result: 在四种比喻风格上的实验表明:1) 引入推理轨迹显著提升了多模态比喻理解;2) 在一种风格(如讽刺)上学到的推理能力可迁移到其他风格(如幽默),尤其在相关风格间;3) 跨风格联合训练产生的通用推理VLM,其性能优于许多更大的开源和闭源模型。
Insight: 论文的创新点在于提出了一个专注于多模态比喻语言理解的三步框架,强调通过可验证的推理轨迹来提升模型的可解释性和泛化能力。其核心洞察是,轻量级VLM通过引入结构化、可迁移的推理过程,能够在保持高效的同时,在需要处理主观性和微妙不一致的复杂多模态任务上实现鲁棒的跨风格泛化,这为构建更透明、更通用的多模态推理模型提供了新思路。
Abstract: Vision-language models (VLMs) have demonstrated strong reasoning abilities in literal multimodal tasks such as visual mathematics and science question answering. However, figurative language, such as sarcasm, humor, and metaphor, remains a significant challenge, as it conveys intent and emotion through subtle incongruities between expressed and intended meanings. In multimodal settings, accompanying images can amplify or invert textual meaning, demanding models that reason across modalities and account for subjectivity. We propose a three-step framework for developing efficient multimodal reasoning models that can (i) interpret multimodal figurative language, (ii) provide transparent reasoning traces, and (iii) generalize across multiple figurative styles. Experiments across four styles show that (1) incorporating reasoning traces substantially improves multimodal figurative understanding, (2) reasoning learned in one style can transfer to others, especially between related styles like sarcasm and humor, and (3) training jointly across styles yields a generalized reasoning VLM that outperforms much larger open- and closed-source models. Our findings show that lightweight VLMs with verifiable reasoning achieve robust cross-style generalization while providing inspectable reasoning traces for multimodal tasks. The code and implementation are available at https://github.com/scheshmi/CrossStyle-MMR.
[6] DF-RAG: Query-Aware Diversity for Retrieval-Augmented Generation cs.CLPDF
Saadat Hasan Khan, Spencer Hong, Jingyu Wu, Kevin Lybarger, Youbing Yin
TL;DR: 本文提出DF-RAG方法,通过在检索增强生成中引入查询感知的多样性机制,解决传统RAG在推理密集型问答任务中因检索冗余内容导致信息召回不足的问题。该方法基于最大边际相关性框架,动态选择与查询相关且彼此差异化的信息块,无需额外微调即可提升复杂QA任务的性能。
Details
Motivation: 传统RAG方法(如余弦相似度检索)在推理密集型问答中倾向于检索高度相关但冗余的内容,从而降低信息召回率,影响答案生成质量。本文旨在通过优化检索多样性来解决这一问题。
Result: 在推理密集型QA基准测试中,DF-RAG相比基于余弦相似度的朴素RAG方法,F1分数提升了4-10%,且优于其他基线方法。实验估计其性能上限(Oracle)可比朴素RAG提升高达18%的绝对F1分数,而DF-RAG实现了其中91.3%的增益。
Insight: 创新点在于将查询感知的多样性动态集成到检索步骤中,基于最大边际相关性框架实现无需微调的实时优化;从客观角度看,该方法通过减少冗余、增强信息覆盖,为复杂推理任务提供了更高效的检索策略,可推广至其他需要多样化上下文的生成任务。
Abstract: Retrieval-augmented generation (RAG) is a common technique for grounding language model outputs in domain-specific information. However, RAG is often challenged by reasoning-intensive question-answering (QA), since common retrieval methods like cosine similarity maximize relevance at the cost of introducing redundant content, which can reduce information recall. To address this, we introduce Diversity-Focused Retrieval-Augmented Generation (DF-RAG), which systematically incorporates diversity into the retrieval step to improve performance on complex, reasoning-intensive QA benchmarks. DF-RAG builds upon the Maximal Marginal Relevance framework to select information chunks that are both relevant to the query and maximally dissimilar from each other. A key innovation of DF-RAG is its ability to optimize the level of diversity for each query dynamically at test time without requiring any additional fine-tuning or prior information. We show that DF-RAG improves F1 performance on reasoning-intensive QA benchmarks by 4-10 percent over vanilla RAG using cosine similarity and also outperforms other established baselines. Furthermore, we estimate an Oracle ceiling of up to 18 percent absolute F1 gains over vanilla RAG, of which DF-RAG captures up to 91.3 percent.
[7] Beyond Outcome Verification: Verifiable Process Reward Models for Structured Reasoning cs.CL | cs.AIPDF
Massimiliano Pronesti, Anya Belz, Yufang Hou
TL;DR: 本文提出了一种可验证过程奖励模型(VPRMs)的强化学习框架,用于提升大语言模型在结构化推理任务中的表现。该框架利用基于规则的确定性验证器来检查中间推理步骤,并在医学证据合成中的偏倚风险评估任务上进行了验证。
Details
Motivation: 现有基于过程监督的方法依赖神经评分器来评估思维链步骤,存在不透明、偏见和奖励攻击等风险。本文旨在通过引入可编程验证的规则来监督中间推理过程,以弥补这一缺陷。
Result: 在多个数据集上,VPRMs在遵循领域规则方面表现优异,步骤级决策与最终标签的一致性显著提高。与最先进模型相比,F1分数提升高达20%,与可验证结果奖励相比提升6.5%,且在证据基础和逻辑连贯性方面有大幅增益。
Insight: 创新点在于将确定性、基于规则的验证器集成到强化学习框架中,实现对中间推理步骤的可验证监督,从而增强模型的透明度和可靠性,适用于具有明确准则的结构化推理领域。
Abstract: Recent work on reinforcement learning with verifiable rewards (RLVR) has shown that large language models (LLMs) can be substantially improved using outcome-level verification signals, such as unit tests for code or exact-match checks for mathematics. In parallel, process supervision has long been explored as a way to shape the intermediate reasoning behaviour of LLMs, but existing approaches rely on neural judges to score chain-of-thought steps, leaving them vulnerable to opacity, bias, and reward hacking. To address this gap, we introduce Verifiable Process Reward Models (VPRMs), a reinforcement-learning framework in which intermediate reasoning steps are checked by deterministic, rule-based verifiers. We apply VPRMs to risk-of-bias assessment for medical evidence synthesis, a domain where guideline-defined criteria and rule-based decision paths enable programmatic verification of reasoning traces. Across multiple datasets, we find that VPRMs generate reasoning that adheres closely to domain rules and achieve substantially higher coherence between step-level decisions and final labels. Results show that VPRMs achieve up to 20% higher F1 than state-of-the-art models and 6.5% higher than verifiable outcome rewards, with substantial gains in evidence grounding and logical coherence.
[8] Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Generation cs.CL | cs.AIPDF
David Y. Liu, Xanthe Muston, Aditya Joshi, Sebastian Sequoiah-Grayson
TL;DR: 本文提出了一种基于强化学习的后训练方法(d-RLAIF),用于自动故事生成(ASG)。该方法利用Todorov的叙事平衡理论来定义故事质量原则,并采用LLM作为评判模型来提供奖励信号,以替代传统的监督微调(SFT)。实验表明,d-RLAIF能生成更具多样性且更符合人类叙事惯例的故事。
Details
Motivation: 针对自动故事生成任务的主观性,传统方法依赖有限的监督数据进行训练和评估存在局限,本文旨在探索强化学习作为一种后训练替代方案,以提升故事生成的质量和多样性。
Result: 使用Gemini-3-Flash评估模型输出,并与TimeTravel数据集中的人类撰写故事进行比较,结果显示d-RLAIF方法在故事多样性和对齐人类叙事惯例方面优于监督微调(SFT)。
Insight: 创新点在于将叙事理论(Todorov理论)形式化为原则,并集成到基于LLM的奖励模型中,为强化学习后训练提供理论指导;这为处理主观性任务(如故事生成)提供了一种新的、可解释的强化学习框架。
Abstract: Despite the subjective nature of storytelling, past works on automatic story generation (ASG) have relied on limited ground truths for training and evaluation. In this work, we explore reinforcement learning (d-RLAIF) as a post-training alternative to supervised fine-tuning (SFT). We first apply Todorov’s Theory of Narrative Equilibrium to establish principles that define desirable ASG qualities. We prompt 7B and 14B LLM-as-judge models with our principles to test alignment with human annotators and provide reward signals during d-RLAIF. We use Gemini-3-Flash to evaluate the output of our post-trained models and compare them to human-written stories from the TimeTravel dataset. We show that d-RLAIF offers a viable alternative to supervised fine-tuning (SFT)–producing stories that are more diverse and aligned with human narrative conventions. Our paper demonstrates the promise of reinforcement learning for linguistically grounded post-training for subjective tasks such as ASG.
[9] Frame-Guided Synthetic Claim Generation for Automatic Fact-Checking Using High-Volume Tabular Data cs.CLPDF
Jacob Devasier, Akshith Putta, Qing Wang, Alankrit Moses, Chengkai Li
TL;DR: 本文提出了一种基于框架引导的合成声明生成方法,用于针对大规模表格数据的自动事实核查,并构建了一个包含78,503条多语言合成声明的大型数据集,该数据集基于434个平均超过50万行的复杂OECD表格。
Details
Motivation: 现有自动事实核查基准大多忽略了针对现实世界大规模结构化数据进行声明验证的挑战,主要聚焦于小型、精选的表格,本文旨在填补这一关键空白。
Result: 通过知识探测实验证明,大型语言模型并未记忆这些事实,迫使系统进行真实的检索和推理;提供的基线SQL生成系统表明该基准极具挑战性,分析指出证据检索是主要瓶颈。
Insight: 创新点在于提出了一种程序化选择重要数据点的框架引导方法,基于六个语义框架生成多语言声明,并构建了一个确保模型必须依赖检索而非参数化知识的大规模、多语言、高复杂度表格数据集,为推进该未解决的实际问题研究提供了关键资源。
Abstract: Automated fact-checking benchmarks have largely ignored the challenge of verifying claims against real-world, high-volume structured data, instead focusing on small, curated tables. We introduce a new large-scale, multilingual dataset to address this critical gap. It contains 78,503 synthetic claims grounded in 434 complex OECD tables, which average over 500K rows each. We propose a novel, frame-guided methodology where algorithms programmatically select significant data points based on six semantic frames to generate realistic claims in English, Chinese, Spanish, and Hindi. Crucially, we demonstrate through knowledge-probing experiments that LLMs have not memorized these facts, forcing systems to perform genuine retrieval and reasoning rather than relying on parameterized knowledge. We provide a baseline SQL-generation system and show that our benchmark is highly challenging. Our analysis identifies evidence retrieval as the primary bottleneck, with models struggling to find the correct data in massive tables. This dataset provides a critical new resource for advancing research on this unsolved, real-world problem.
[10] Parameter Efficient Fine Tuning Llama 3.1 for Answering Arabic Legal Questions: A Case Study on Jordanian Laws cs.CL | cs.AIPDF
Mohammed Fasha, Bassam Hammo, Bilal Sowan, Husam Barham, Esam Nsour
TL;DR: 本研究以约旦法律为案例,探讨了使用参数高效微调(PEFT)方法,特别是结合LoRA适配器和4位量化,对Llama-3.1大语言模型进行微调,以用于阿拉伯语法律问答任务。研究创建了一个包含6000个法律问答对的自定义数据集,并利用Unsloth框架进行加速和资源高效的训练。评估使用BLEU和ROUGE指标,结果表明微调后的模型在法律推理和准确性方面有所提升,同时通过量化和优化策略实现了资源效率。
Details
Motivation: 解决将大语言模型(如Llama-3.1)有效适配到资源受限的阿拉伯语法律领域(以约旦法律为例)进行问答任务的问题,探索在保持性能的同时实现高效、资源节约的微调方法。
Result: 在自定义的约旦法律问答数据集上,微调后的Llama-3.1-8B-bnb-4bit和Llama-3.1-8B-Instruct-bnb-4bit模型相比其基础版本,在BLEU和ROUGE指标上显示出改进,表明法律推理和准确性得到提升。论文未明确提及与SOTA的比较,但强调了通过量化和PEFT策略实现的资源效率。
Insight: 创新点在于将参数高效微调(PEFT,特别是LoRA)与4位量化技术结合,并利用Unsloth框架,针对特定领域(阿拉伯语法律)进行资源高效的适配。这为在计算资源有限的情况下微调大模型用于低资源语言或专业领域任务提供了可行的技术路径和案例研究。
Abstract: This study uses Jordanian law as a case study to explore the fine-tuning of the Llama-3.1 large language model for Arabic question-answering. Two versions of the model - Llama-3.1-8B-bnb-4bit and Llama-3.1-8B-Instruct-bnb-4bit - were fine-tuned using parameter-efficient fine-tuning (PEFT) with LoRA adapters and 4-bit quantized models, leveraging the Unsloth framework for accelerated and resource-efficient training. A custom dataset of 6000 legal question-answer pairs was curated from Jordanian laws and formatted into structured prompts. Performance was evaluated using the BLEU and the ROUGE metrics to compare the fine-tuned models to their respective base versions. Results demonstrated improved legal reasoning and accuracy while achieving resource efficiency through quantization and optimized fine-tuning strategies. This work underscores the potential of adapting large language models for Arabic legal domains and highlights effective techniques for fine-tuning domain-specific tasks.
[11] Oops, Wait: Token-Level Signals as a Lens into LLM Reasoning cs.CLPDF
Jaehui Hwang, Dongyoon Han, Sangdoo Yun, Byeongho Heo
TL;DR: 本文通过分析大型语言模型(LLM)中类似话语的标记(如’wait’和’therefore’)的概率信号,系统性地研究了这些标记如何反映模型的推理过程。研究发现,特定标记与推理正确性高度相关,其相关性受训练策略影响,但在不同模型规模下保持稳定。特别地,对’wait’标记的分析表明,在小规模数据集上微调的模型通过此类信号获得推理能力,但仅部分利用了它们。
Details
Motivation: 动机在于系统分析LLM中话语标记(如’wait’、’therefore’)如何随训练策略和模型规模变化,以揭示其推理过程的动态性,填补现有研究的空白。
Result: 研究通过分析多个模型的标记概率,发现特定标记(如’wait’)与推理正确性有强相关性,且这种相关性在训练策略间变化,但在模型规模间稳定;具体而言,小规模数据集微调的模型仅部分利用这些信号。
Insight: 创新点在于提出了一种基于标记级信号(如概率)的系统性视角来观察和理解LLM推理动态,揭示了训练策略对推理信号利用的影响,为模型可解释性提供了新方法。
Abstract: The emergence of discourse-like tokens such as “wait” and “therefore” in large language models (LLMs) has offered a unique window into their reasoning processes. However, systematic analyses of how such signals vary across training strategies and model scales remain lacking. In this paper, we analyze token-level signals through token probabilities across various models. We find that specific tokens strongly correlate with reasoning correctness, varying with training strategies while remaining stable across model scales. A closer look at the “wait” token in relation to answer probability demonstrates that models fine-tuned on small-scale datasets acquire reasoning ability through such signals but exploit them only partially. This work provides a systematic lens to observe and understand the dynamics of LLM reasoning.
[12] Revealing the Truth with ConLLM for Detecting Multi-Modal Deepfakes cs.CLPDF
Gautam Siddharth Kashyap, Harsh Joshi, Niharika Jain, Ebad Shabbir, Jiechao Gao
TL;DR: 本文提出了一种名为ConLLM的混合框架,用于鲁棒的多模态深度伪造检测。该框架采用两阶段架构:第一阶段使用预训练模型提取特定模态的嵌入;第二阶段通过对比学习对齐这些嵌入以缓解模态碎片化,并利用基于大语言模型的推理来捕获语义不一致性,从而解决浅层跨模态推理问题。
Details
Motivation: 现有深度伪造检测方法存在两个核心局限:模态碎片化导致对不同和对抗性深度伪造模态的泛化能力差,以及浅层跨模态推理导致对细粒度语义不一致性的检测有限。
Result: ConLLM在音频、视频和视听模态上均表现出色:将音频深度伪造的等错误率降低高达50%,视频准确率提升高达8%,在视听任务中准确率提升约9%。消融研究证实,基于预训练模型的嵌入为各模态带来9%-10%的稳定提升。
Insight: 创新点在于结合对比学习与大语言模型,通过两阶段架构同时解决模态碎片化和浅层跨模态推理问题,有效捕获多模态深度伪造中的语义不一致性,提升了检测的鲁棒性和准确性。
Abstract: The rapid rise of deepfake technology poses a severe threat to social and political stability by enabling hyper-realistic synthetic media capable of manipulating public perception. However, existing detection methods struggle with two core limitations: (1) modality fragmentation, which leads to poor generalization across diverse and adversarial deepfake modalities; and (2) shallow inter-modal reasoning, resulting in limited detection of fine-grained semantic inconsistencies. To address these, we propose ConLLM (Contrastive Learning with Large Language Models), a hybrid framework for robust multimodal deepfake detection. ConLLM employs a two-stage architecture: stage 1 uses Pre-Trained Models (PTMs) to extract modality-specific embeddings; stage 2 aligns these embeddings via contrastive learning to mitigate modality fragmentation, and refines them using LLM-based reasoning to address shallow inter-modal reasoning by capturing semantic inconsistencies. ConLLM demonstrates strong performance across audio, video, and audio-visual modalities. It reduces audio deepfake EER by up to 50%, improves video accuracy by up to 8%, and achieves approximately 9% accuracy gains in audio-visual tasks. Ablation studies confirm that PTM-based embeddings contribute 9%-10% consistent improvements across modalities.
[13] From Chains to DAGs: Probing the Graph Structure of Reasoning in LLMs cs.CLPDF
Tianjun Zhong, Linyang He, Nima Mesgarani
TL;DR: 本文提出’推理DAG探测’框架,研究大语言模型内部隐藏状态是否线性编码推理过程的有向无环图结构,通过训练轻量级探针预测节点深度和节点距离,发现中间层确实编码了推理DAG的几何结构,且可恢复性随节点深度和模型规模系统变化。
Details
Motivation: 现有研究多将推理视为线性链,但许多推理问题更自然地表示为有向无环图,其中中间结论可能依赖多个前提、分支并行推导并后续合并或重用,理解模型内部是否反映此类图结构推理仍是一个开放问题。
Result: 实验结果表明,推理DAG的几何结构在中间层被有意义地编码,其可恢复性随节点深度和模型规模系统变化,表明LLM推理不仅具有顺序性,还展现出可测量的内部图结构。
Insight: 创新点在于将推理建模为DAG而非线性链,并设计探测框架验证模型内部对图结构的编码;客观分析认为,该方法为理解LLM复杂推理机制提供了新视角,揭示了模型内部表示与问题逻辑结构之间的对应关系。
Abstract: Recent progress in large language models has renewed interest in mechanistically characterizing how multi-step reasoning is represented and computed. While much prior work treats reasoning as a linear chain of steps, many reasoning problems are more naturally structured as directed acyclic graphs (DAGs), where intermediate conclusions may depend on multiple premises, branch into parallel sub-derivations, and later merge or be reused. Understanding whether such graph-structured reasoning is reflected in model internals remains an open question. In this work, we introduce Reasoning DAG Probing, a framework that directly asks whether LLM hidden states encode the geometry of a reasoning DAG in a linearly accessible form, and where this structure emerges across layers. Within this framework, we associate each reasoning node with a textual realization and train lightweight probes to predict two graph-theoretic properties from hidden states: node depth and pairwise node distance. We use these probes to analyze the layerwise emergence of DAG structure and evaluate controls that disrupt reasoning-relevant structure while preserving superficial textual properties. Our results provide evidence that reasoning DAG geometry is meaningfully encoded in intermediate layers, with recoverability varying systematically by node depth and model scale, suggesting that LLM reasoning is not only sequential but exhibits measurable internal graph structure.
[14] Learning to Ideate for Machine Learning Engineering Agents cs.CLPDF
Yunxiang Zhang, Kang Zhou, Zhichao Xu, Kiran Ramnath, Yun Zhou
TL;DR: 该论文提出了MLE-Ideator,一种将构思与实现分离的双智能体框架,以解决现有机器学习工程(MLE)智能体在迭代优化算法效果方面的不足。研究表明,该框架在无需训练的情况下显著优于仅实现智能体的基线,并且通过强化学习训练的构思智能体能生成更有效的想法,性能超越了未训练版本和Claude Sonnet 3.5。
Details
Motivation: 现有MLE智能体难以有效迭代优化其实现的算法,因此需要一种分离构思与实现的方法来提升优化能力。
Result: 在MLE-Bench基准测试中,该框架显著优于仅实现智能体的基线;通过强化学习训练后,Qwen3-8B构思智能体仅用1K样本就实现了11.5%的相对提升,并超越了Claude Sonnet 3.5。
Insight: 创新点在于将构思过程从实现中解耦,并引入专门的构思智能体,通过强化学习训练可有效提升其战略构思能力,为科学发现AI系统提供了有前景的训练路径。
Abstract: Existing machine learning engineering (MLE) agents struggle to iteratively optimize their implemented algorithms for effectiveness. To address this, we introduce MLE-Ideator, a dual-agent framework that separates ideation from implementation. In our system, an implementation agent can request strategic help from a dedicated Ideator. We show this approach is effective in two ways. First, in a training-free setup, our framework significantly outperforms implementation-only agent baselines on MLE-Bench. Second, we demonstrate that the Ideator can be trained with reinforcement learning (RL) to generate more effective ideas. With only 1K training samples from 10 MLE tasks, our RL-trained Qwen3-8B Ideator achieves an 11.5% relative improvement compared to its untrained counterpart and surpasses Claude Sonnet 3.5. These results highlights a promising path toward training strategic AI systems for scientific discovery.
[15] Align to the Pivot: Dual Alignment with Self-Feedback for Multilingual Math Reasoning cs.CLPDF
Chunxu Zhao, Xin Huang, Xue Han, Shujian Huang, Chao Deng
TL;DR: 本文提出PASMR方法,通过将模型的主要语言设为枢轴语言,在训练中先将问题翻译至枢轴语言以对齐推理模式,再利用枢轴语言的推理答案监督目标语言的推理过程,从而建立跨语言自反馈机制,旨在提升大语言模型在多语言数学推理任务中的对齐与性能。
Details
Motivation: 大语言模型在多语言场景下,尤其是低资源语言中,表现会下降,这归因于模型的多语言理解与推理对齐不一致。
Result: 广泛的实验结果表明,该方法增强了模型对问题的理解和推理能力,带来了显著的任务性能提升。
Insight: 创新点在于设计了以主要语言为枢轴的双重对齐与自反馈机制,无需依赖外部正确答案或奖励模型,即可改善跨语言推理对齐。
Abstract: Despite the impressive reasoning abilities demonstrated by large language models (LLMs), empirical evidence indicates that they are not language agnostic as expected, leading to performance declines in multilingual settings, especially for low-resource languages. We attribute the decline to the model’s inconsistent multilingual understanding and reasoning alignment. To address this, we present Pivot-Aligned Self-Feedback Multilingual Reasoning (PASMR), aiming to improve the alignment of multilingual math reasoning abilities in LLMs. This approach designates the model’s primary language as the pivot language. During training, the model first translates questions into the pivot language to facilitate better alignment of reasoning patterns. The reasoning process in the target language is then supervised by the pivot language’s reasoning answers, thereby establishing a cross-lingual self-feedback mechanism without relying on external correct answers or reward models. Extensive experimental results demonstrate that our method enhances both the model’s understanding of questions and its reasoning capabilities, leading to notable task improvements.
[16] A Computational Approach to Visual Metonymy cs.CL | cs.CVPDF
Saptarshi Ghosh, Linfeng Liu, Tianyu Jiang
TL;DR: 本文首次对视觉转喻进行了计算研究,提出了一种基于符号学理论的新颖流程,利用大语言模型和文生图模型生成转喻视觉表示,并构建了首个视觉转喻数据集ViMET,包含2000道多选题,用于评估多模态语言模型的认知推理能力。实验结果表明,人类表现(86.9%)与最先进的视觉语言模型(65.9%)之间存在显著差距,揭示了机器在解释间接视觉参考方面的局限性。
Details
Motivation: 解决图像中通过关联线索而非直接描绘来传达目标概念(即视觉转喻)的计算理解问题,这是首次对该现象进行系统性计算研究。
Result: 在构建的ViMET数据集上,人类准确率达到86.9%,而最先进的视觉语言模型(如CLIP、BLIP等)仅达到65.9%,显示出与人类性能的显著差距。
Insight: 创新点在于将符号学理论与生成式AI(LLM和文生图模型)结合,构建了一个系统化的视觉转喻生成与评估框架,并创建了首个专门用于测试多模态模型理解间接视觉参考能力的数据集,为评估和提升模型的深层语义推理能力提供了新基准。
Abstract: Images often communicate more than they literally depict: a set of tools can suggest an occupation and a cultural artifact can suggest a tradition. This kind of indirect visual reference, known as visual metonymy, invites viewers to recover a target concept via associated cues rather than explicit depiction. In this work, we present the first computational investigation of visual metonymy. We introduce a novel pipeline grounded in semiotic theory that leverages large language models and text-to-image models to generate metonymic visual representations. Using this framework, we construct ViMET, the first visual metonymy dataset comprising 2,000 multiple-choice questions to evaluate the cognitive reasoning abilities in multimodal language models. Experimental results on our dataset reveal a significant gap between human performance (86.9%) and state-of-the-art vision-language models (65.9%), highlighting limitations in machines’ ability to interpret indirect visual references. Our dataset is publicly available at: https://github.com/cincynlp/ViMET.
[17] Unsupervised Elicitation of Moral Values from Language Models cs.CLPDF
Meysam Alizadeh, Fabrizio Gilardi, Zeynab Samei
TL;DR: 本文提出了一种名为内部一致性最大化(ICM)的无监督方法,用于从预训练语言模型中引出其潜在的道德推理能力。该方法在多个基准数据集和模型上测试,结果表明ICM在道德判断标注、跨道德框架泛化以及减轻社会偏见方面表现优异,甚至与人类标注效果相当或更好。
Details
Motivation: 随着AI系统普及,将其行为与人类价值观对齐至关重要。现有研究表明语言模型固有的道德推理能力有限,但构建用于道德评估的真实标注数据又因多元道德框架和普遍偏见而困难。因此,本文探索无监督引出方法,旨在验证预训练语言模型是否具备无需人工监督即可展现的内在道德推理能力。
Result: 在Norm Bank和ETHICS基准测试中,ICM方法超越了所有预训练和聊天机器人基线模型。使用ICM标签进行微调的模型表现与人类标签相当或更优。在Justice和Commonsense道德框架上,ICM取得了最大的相对提升。此外,ICM将聊天机器人语言模型的社会偏见错误率降低了一半以上,尤其在种族、社会经济地位和政治领域改善最大。
Insight: 论文的创新点在于提出了一种无监督的ICM算法来引出语言模型潜在的道德推理能力,避免了构建有偏标注数据的困难。从客观角度看,该方法展示了预训练模型本身可能蕴含未被充分利用的道德知识,为AI对齐提供了一条可扩展的路径,特别是在跨道德框架泛化和偏见缓解方面具有实用价值。
Abstract: As AI systems become pervasive, grounding their behavior in human values is critical. Prior work suggests that language models (LMs) exhibit limited inherent moral reasoning, leading to calls for explicit moral teaching. However, constructing ground truth data for moral evaluation is difficult given plural frameworks and pervasive biases. We investigate unsupervised elicitation as an alternative, asking whether pretrained (base) LMs possess intrinsic moral reasoning capability that can be surfaced without human supervision. Using the Internal Coherence Maximization (ICM) algorithm across three benchmark datasets and four LMs, we test whether ICM can reliably label moral judgments, generalize across moral frameworks, and mitigate social bias. Results show that ICM outperforms all pre-trained and chatbot baselines on the Norm Bank and ETHICS benchmarks, while fine-tuning on ICM labels performs on par with or surpasses those of human labels. Across theoretically motivated moral frameworks, ICM yields its largest relative gains on Justice and Commonsense morality. Furthermore, although chatbot LMs exhibit social bias failure rates comparable to their pretrained ones, ICM reduces such errors by more than half, with the largest improvements in race, socioeconomic status, and politics. These findings suggest that pretrained LMs possess latent moral reasoning capacities that can be elicited through unsupervised methods like ICM, providing a scalable path for AI alignment.
[18] ProGraph-R1: Progress-aware Reinforcement Learning for Graph Retrieval Augmented Generation cs.CLPDF
Jinyoung Park, Sanghyeok Lee, Omar Zia Khan, Hyunwoo J. Kim, Joo-Kyung Kim
TL;DR: 本文提出了ProGraph-R1,一种用于图检索增强生成(GraphRAG)的进展感知强化学习框架。该框架通过引入结构感知的超图检索机制和基于进展的逐步策略优化,解决了现有基于强化学习的GraphRAG方法在检索时忽视图结构、以及依赖稀疏结果级奖励信号的问题,从而提升了多步推理的准确性和生成质量。
Details
Motivation: 现有基于强化学习的图检索增强生成框架(如Graph-R1)存在两个关键局限:一是主要依赖语义相似性进行检索,忽略了底层的图结构;二是依赖稀疏的结果级奖励,未能捕捉中间检索步骤的质量及其依赖关系。本文旨在解决这些问题。
Result: 在多跳问答基准测试上的实验表明,ProGraph-R1在推理准确性和生成质量上持续优于现有的GraphRAG方法。
Insight: 主要创新点包括:1)结构感知的超图检索机制,联合考虑语义相关性和图连通性,以鼓励沿着多跳推理路径进行连贯遍历;2)基于进展的逐步策略优化,通过根据图内中间推理进展来调整优势函数,提供密集的学习信号,而不是仅仅依赖最终结果。这为图增强的检索与推理提供了更精细的优化和监督信号。
Abstract: Graph Retrieval-Augmented Generation (GraphRAG) has been successfully applied in various knowledge-intensive question answering tasks by organizing external knowledge into structured graphs of entities and relations. It enables large language models (LLMs) to perform complex reasoning beyond text-chunk retrieval. Recent works have employed reinforcement learning (RL) to train agentic GraphRAG frameworks that perform iterative interactions between LLMs and knowledge graphs. However, existing RL-based frameworks such as Graph-R1 suffer from two key limitations: (1) they primarily depend on semantic similarity for retrieval, often overlooking the underlying graph structure, and (2) they rely on sparse, outcome-level rewards, failing to capture the quality of intermediate retrieval steps and their dependencies. To address these limitations, we propose ProGraph-R1, a progress-aware agentic framework for graph-based retrieval and multi-step reasoning. ProGraph-R1 introduces a structure-aware hypergraph retrieval mechanism that jointly considers semantic relevance and graph connectivity, encouraging coherent traversal along multi-hop reasoning paths. We also design a progress-based step-wise policy optimization, which provides dense learning signals by modulating advantages according to intermediate reasoning progress within a graph, rather than relying solely on final outcomes. Experiments on multi-hop question answering benchmarks demonstrate that ProGraph-R1 consistently improves reasoning accuracy and generation quality over existing GraphRAG methods.
[19] EFT-CoT: A Multi-Agent Chain-of-Thought Framework for Emotion-Focused Therapy cs.CLPDF
Lanqing Du, Yunong Li, YuJie Long, Shihong Chen
TL;DR: 本文提出了一种基于情绪聚焦疗法(EFT)的多智能体思维链框架(EFT-CoT),用于心理健康问答(MHQA)。该框架采用“自下而上”的路径,将干预过程解构为‘具身感知-认知探索-叙事干预’三阶段推理流程,并利用八个专门智能体执行关键组件。作者还构建了高质量数据集EFT-Instruct,并微调了专用模型EFT-LLM。
Details
Motivation: 现有基于认知行为疗法(CBT)的LLM方法主要采用‘自上而下’的理性重构,往往忽视来访者的具身体验和初级情绪处理。本文旨在解决这一问题,通过引入EFT理念来更全面地关注情绪体验。
Result: 实验评估表明,EFT-LLM在共情深度和结构专业性等指标上超越了强基线模型和人类回应。消融研究证实了多智能体机制的必要性。
Insight: 创新点在于将EFT疗法与多智能体思维链框架结合,实现了从具身感知到叙事重构的显式、可解释推理流程。该方法为构建高共情、可解释的心理咨询系统提供了有效路径,其构建的高质量指令数据集和专用模型也具有借鉴价值。
Abstract: Leveraging Large Language Models (LLMs) for Mental Health Question Answering (MHQA) is promising for mitigating resource shortages. However, existing Cognitive Behavioral Therapy (CBT)-based approaches predominantly favor a “top-down” rational restructuring, often neglecting clients’ embodied experiences and primary emotion processing. To address this, we propose an Emotion-Focused Therapy (EFT)-based Multi-Agent Chain-of-Thought framework (EFT-CoT). Adopting a “bottom-up” trajectory, it deconstructs the intervention into a three-stage reasoning flow: “Embodied Perception - Cognitive Exploration - Narrative Intervention.” Utilizing eight specialized agents, the system explicitly executes critical components such as somatic awareness mapping, adaptive assessment, core belief extraction, and narrative restructuring. We further constructed “EFT-Instruct,” a high-quality dataset via Chain-of-Thought distillation of approximately 67,000 authentic texts, and fine-tuned a specialized model, EFT-LLM. Experimental evaluations demonstrate that EFT-LLM outperforms strong baselines and human responses across metrics like empathy depth and structural professionalism. Ablation studies confirm the necessity of the multi-agent mechanism. The model exhibits superior psychological reasoning, offering an effective pathway for interpretable, high-empathy counseling systems.
[20] On the Emergence and Test-Time Use of Structural Information in Large Language Models cs.CL | cs.LGPDF
Michelle Chao Chen, Moritz Miller, Bernhard Schölkopf, Siyuan Guo
TL;DR: 本文研究了大型语言模型如何从观测数据中学习抽象结构信息,并在测试时利用这些结构进行组合生成。作者设计了一个基于语言结构转换的自然语言数据集进行控制实验,发现结构信息的学习与复杂推理任务相关,但模型在测试时进行组合生成的能力仍然有限。
Details
Motivation: 从观测数据中学习结构信息对于在训练语料之外产生新知识至关重要,这既适用于科学发现中的机制理解,也适用于灵活的测试时组合生成。因此,本文旨在探究语言模型如何学习抽象结构并在测试时利用这些信息。
Result: 实验表明,结构信息的学习与复杂推理任务的出现相关;在设计的自然语言数据集上,模型在测试时进行组合生成的能力有限,具体定量结果未在摘要中明确提及。
Insight: 创新点在于通过受控的语言结构转换数据集来实证研究结构学习的涌现及其与推理任务的关系,客观分析认为该方法为理解LLM的结构学习机制提供了新视角,但测试时组合生成的局限性揭示了当前模型的不足。
Abstract: Learning structural information from observational data is central to producing new knowledge outside the training corpus. This holds for mechanistic understanding in scientific discovery as well as flexible test-time compositional generation. We thus study how language models learn abstract structures and utilize the learnt structural information at test-time. To ensure a controlled setup, we design a natural language dataset based on linguistic structural transformations. We empirically show that the emergence of learning structural information correlates with complex reasoning tasks, and that the ability to perform test-time compositional generation remains limited.
[21] LLMs as Cultural Archives: Cultural Commonsense Knowledge Graph Extraction cs.CLPDF
Junior Cedric Tonga, Chen Cecilia Liu, Iryna Gurevych, Fajri Koto
TL;DR: 本文提出了一种基于提示的迭代框架,用于从大型语言模型(LLMs)中提取文化常识知识,构建文化常识知识图谱(CCKG),将LLMs视为文化档案库,系统地提取特定文化的实体、关系和实践,并将其组合成跨语言的多步推理链。
Details
Motivation: LLMs从大规模网络数据中学习了丰富的文化知识,但这些知识大多是隐含且非结构化的,限制了其可解释性和应用,因此需要一种方法将这些知识显式化和结构化。
Result: 在五个国家的文化相关性、正确性和路径连贯性的人工评估中,CCKG在英语中表现最好,即使目标文化是非英语的(如中文、印尼语、阿拉伯语),表明当前LLMs的文化编码不均衡;用CCKG增强较小的LLMs可以提升文化推理和故事生成的性能,其中英语链带来的增益最大。
Insight: 创新点在于将LLMs视为文化档案库,通过迭代提示框架系统性地构建结构化的文化常识知识图谱,并展示了链式结构文化知识作为文化基础NLP的实用基底,同时揭示了LLMs在文化编码上的局限性。
Abstract: Large language models (LLMs) encode rich cultural knowledge learned from diverse web-scale data, offering an unprecedented opportunity to model cultural commonsense at scale. Yet this knowledge remains mostly implicit and unstructured, limiting its interpretability and use. We present an iterative, prompt-based framework for constructing a Cultural Commonsense Knowledge Graph (CCKG) that treats LLMs as cultural archives, systematically eliciting culture-specific entities, relations, and practices and composing them into multi-step inferential chains across languages. We evaluate CCKG on five countries with human judgments of cultural relevance, correctness, and path coherence. We find that the cultural knowledge graphs are better realized in English, even when the target culture is non-English (e.g., Chinese, Indonesian, Arabic), indicating uneven cultural encoding in current LLMs. Augmenting smaller LLMs with CCKG improves performance on cultural reasoning and story generation, with the largest gains from English chains. Our results show both the promise and limits of LLMs as cultural technologies and that chain-structured cultural knowledge is a practical substrate for culturally grounded NLP.
[22] SD-E$^2$: Semantic Exploration for Reasoning Under Token Budgets cs.CL | cs.AIPDF
Kshitij Mishra, Nils Lukas, Salem Lahlou
TL;DR: 本文提出了SD-E^2,一个用于在有限计算预算下提升小语言模型复杂推理能力的强化学习框架。该框架通过优化生成推理轨迹的语义多样性来显式地引导探索,并结合结果正确性和求解效率进行多目标优化。
Details
Motivation: 小语言模型在有限计算预算下进行复杂推理时,探索成本高昂,导致性能受限。本文旨在通过显式地优化语义多样性,为小语言模型提供一个更高效的探索-利用信号。
Result: 在GSM8K基准测试中,SD-E^2超越了基础模型Qwen2.5-3B-Instruct和多个GRPO基线模型,分别提升了27.4、5.2和1.5个百分点,平均每个问题发现9.8种语义不同的策略。在MedMCQA和更难的AIME基准测试上也取得了显著提升。
Insight: 核心创新在于使用冻结的句子嵌入模型来定义和奖励语义多样性(覆盖不同策略及其平均差异),而非表面形式的新颖性,并将其与正确性和效率结合在多目标优化中。这提供了一种通过调整推理过程结构(认知适应)而非逐令牌计算来提升资源受限模型效率的互补路径。
Abstract: Small language models (SLMs) struggle with complex reasoning because exploration is expensive under tight compute budgets. We introduce Semantic Diversity-Exploration-Exploitation (SD-E$^2$), a reinforcement learning framework that makes exploration explicit by optimizing semantic diversity in generated reasoning trajectories. Using a frozen sentence-embedding model, SD-E$^2$ assigns a diversity reward that captures (i) the coverage of semantically distinct solution strategies and (ii) their average pairwise dissimilarity in embedding space, rather than surface-form novelty. This diversity reward is combined with outcome correctness and solution efficiency in a z-score-normalized multi-objective objective that stabilizes training. On GSM8K, SD-E$^2$ surpasses the base Qwen2.5-3B-Instruct and strong GRPO baselines (GRPO-CFL and GRPO-CFEE) by +27.4, +5.2, and +1.5 percentage points, respectively, while discovering on average 9.8 semantically distinct strategies per question. We further improve MedMCQA to 49.64% versus 38.37% for the base model and show gains on the harder AIME benchmark (1983-2025), reaching 13.28% versus 6.74% for the base. These results indicate that rewarding semantic novelty yields a more compute-efficient exploration-exploitation signal for training reasoning-capable SLMs. By introducing cognitive adaptation-adjusting the reasoning process structure rather than per-token computation-SD-E$^2$ offers a complementary path to efficiency gains in resource-constrained models.
[23] AI-based approach to burnout identification from textual data cs.CL | cs.AIPDF
Marina Zavertiaeva, Petr Parshakov, Mikhail Usanin, Aleksei Smirnov, Sofia Paklina
TL;DR: 本研究提出了一种基于人工智能的方法,利用自然语言处理技术从文本数据中识别职业倦怠。该方法基于一个原本用于情感分析的RuBERT模型,通过使用ChatGPT生成的合成句子和从俄罗斯YouTube视频中收集的用户评论进行微调,使其适用于职业倦怠检测。最终模型能够为输入文本分配职业倦怠概率,可用于处理大量书面交流,以监测高压工作环境中与职业倦怠相关的语言信号。
Details
Motivation: 解决从文本数据中自动检测职业倦怠的问题,以监测高压工作环境中的心理健康风险。
Result: 模型能够为文本分配职业倦怠概率,但摘要中未提及具体的定量结果(如准确率)或基准测试比较。
Insight: 创新点在于将预训练的RuBERT模型从情感分析任务迁移到职业倦怠检测,并利用合成数据(ChatGPT生成)和真实用户评论进行微调,为特定领域(俄语)的心理健康监测提供了一种可行的AI方法。
Abstract: This study introduces an AI-based methodology that utilizes natural language processing (NLP) to detect burnout from textual data. The approach relies on a RuBERT model originally trained for sentiment analysis and subsequently fine-tuned for burnout detection using two data sources: synthetic sentences generated with ChatGPT and user comments collected from Russian YouTube videos about burnout. The resulting model assigns a burnout probability to input texts and can be applied to process large volumes of written communication for monitoring burnout-related language signals in high-stress work environments.
[24] PEAR: Pairwise Evaluation for Automatic Relative Scoring in Machine Translation cs.CLPDF
Lorenzo Proietti, Roman Grundkiewicz, Matt Post
TL;DR: PEAR是一种监督式质量估计(QE)指标家族,通过将无参考的机器翻译(MT)评估重构为分级成对比较,预测两个候选翻译之间的质量差异方向和幅度。
Details
Motivation: 解决传统单候选质量估计指标在评估机器翻译质量时信息冗余和效率不足的问题,通过成对比较提供更精细和高效的评估信号。
Result: 在WMT24元评估基准上,PEAR超越了使用相同数据和骨干网络的单候选QE基线,并优于更大的QE模型和基于参考的指标,同时参数更少。
Insight: 创新点在于将质量估计重构为成对比较任务,并引入候选顺序反转的正则化项;客观分析认为其提供了更少冗余的评估信号,并能有效用于最小贝叶斯风险解码以降低计算成本。
Abstract: We present PEAR (Pairwise Evaluation for Automatic Relative Scoring), a supervised Quality Estimation (QE) metric family that reframes reference-free Machine Translation (MT) evaluation as a graded pairwise comparison. Given a source segment and two candidate translations, PEAR predicts the direction and magnitude of their quality difference. The metrics are trained using pairwise supervision derived from differences in human judgments, with an additional regularization term that encourages sign inversion under candidate order reversal. On the WMT24 meta-evaluation benchmark, PEAR outperforms strictly matched single-candidate QE baselines trained with the same data and backbones, isolating the benefit of the proposed pairwise formulation. Despite using substantially fewer parameters than recent large metrics, PEAR surpasses far larger QE models and reference-based metrics. Our analysis further indicates that PEAR yields a less redundant evaluation signal relative to other top metrics. Finally, we show that PEAR is an effective utility function for Minimum Bayes Risk (MBR) decoding, reducing pairwise scoring cost at negligible impact.
[25] Evaluating Semantic and Syntactic Understanding in Large Language Models for Payroll Systems cs.CL | cs.AIPDF
Hendrika Maclean, Mert Can Cakmak, Muzakkiruddin Ahmed Mohammed, Shames Al Mandalawi, John Talburt
TL;DR: 该论文评估了大型语言模型在薪资系统这一特定高精度场景下的语义和句法理解能力,通过构建从基础到复杂的层级化数据集,测试了多种模型家族在理解薪资模式、应用规则顺序和提供精确到分的计算结果方面的表现。
Details
Motivation: 动机在于尽管LLM在自然语言理解方面持续进步,但在精确数值计算和易于审计的输出方面仍不可靠,因此研究其在薪资系统这一高风险的特定应用中的可靠性。
Result: 实验结果表明,存在明确的机制:在某些情况下,精心设计的提示词就足够了,而在其他情况下则需要显式计算;研究提供了在需要准确性和保证的场景中部署LLM的实用指南。
Insight: 创新点在于提出了一个紧凑、可复现的评估框架,用于测试LLM在结构化、规则驱动的精确任务中的能力,并明确了提示工程与显式计算在不同复杂度任务中的适用界限。
Abstract: Large language models are now used daily for writing, search, and analysis, and their natural language understanding continues to improve. However, they remain unreliable on exact numerical calculation and on producing outputs that are straightforward to audit. We study synthetic payroll system as a focused, high-stakes example and evaluate whether models can understand a payroll schema, apply rules in the right order, and deliver cent-accurate results. Our experiments span a tiered dataset from basic to complex cases, a spectrum of prompts from minimal baselines to schema-guided and reasoning variants, and multiple model families including GPT, Claude, Perplexity, Grok and Gemini. Results indicate clear regimes where careful prompting is sufficient and regimes where explicit computation is required. The work offers a compact, reproducible framework and practical guidance for deploying LLMs in settings that demand both accuracy and assurance.
[26] Grounded Concreteness: Human-Like Concreteness Sensitivity in Vision-Language Models cs.CLPDF
Aryan Roy, Zekun Wang, Christopher J. MacLellan
TL;DR: 本文研究了视觉-语言模型(VLMs)在纯文本提示下是否比纯文本大语言模型(LLMs)表现出更接近人类的对语言具体性的敏感性。通过对比匹配的Llama文本模型及其视觉版本,发现多模态预训练增强了模型对具体性的感知基础,使其在输出行为、表征几何和注意力动态上更接近人类。
Details
Motivation: 探究多模态预训练(即视觉-语言联合训练)是否能让模型在纯文本任务中发展出更人类化的语言具体性敏感性,而非仅依赖推理时的图像输入。
Result: 在多个模型规模上,VLMs在更具体的输入上表现出更大的性能提升,其表征更清晰地按具体性组织,生成的词级具体性评分与人类规范分布更一致,注意力模式也显示出更强的感知基础。
Insight: 多模态预训练作为一种感知基础(perceptual grounding)的消融实验,能有效提升模型对语言具体性的敏感性,使其在纯文本任务中表现出更人类化的认知特性,这为理解模型如何通过跨模态学习获得语言理解能力提供了新视角。
Abstract: Do vision–language models (VLMs) develop more human-like sensitivity to linguistic concreteness than text-only large language models (LLMs) when both are evaluated with text-only prompts? We study this question with a controlled comparison between matched Llama text backbones and their Llama Vision counterparts across multiple model scales, treating multimodal pretraining as an ablation on perceptual grounding rather than access to images at inference. We measure concreteness effects at three complementary levels: (i) output behavior, by relating question-level concreteness to QA accuracy; (ii) embedding geometry, by testing whether representations organize along a concreteness axis; and (iii) attention dynamics, by quantifying context reliance via attention-entropy measures. In addition, we elicit token-level concreteness ratings from models and evaluate alignment to human norm distributions, testing whether multimodal training yields more human-consistent judgments. Across benchmarks and scales, VLMs show larger gains on more concrete inputs, exhibit clearer concreteness-structured representations, produce ratings that better match human norms, and display systematically different attention patterns consistent with increased grounding.
[27] Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents cs.CLPDF
Mahesh Ramesh, Kaousheik Jayakumar, Aswinkumar Ramkumar, Pavan Thodima, Aniket Rege
TL;DR: 本文评估了17个最先进的LLM在合作性纸牌游戏《Hanabi》中的表现,研究了不同上下文工程(从最小提示到包含贝叶斯推理和多轮状态跟踪的复杂脚手架)对模型协调能力的影响。研究发现,智能体能够维护内部工作记忆进行状态跟踪,且不同LLM间的交叉游戏性能随模型能力平滑变化。通过监督学习和强化学习微调一个40亿参数的开源模型,其合作游戏性能分别提升了21%和156%,并展示了在非《Hanabi》任务上的泛化能力。
Details
Motivation: 解决在不完全信息下进行合作推理的挑战,该挑战在人类和多智能体系统中均存在,并以需要心智理论和策略沟通的纸牌游戏《Hanabi》为具体测试环境。
Result: 在包含贝叶斯推理的‘Sherlock’设置中,最强的推理模型在2-5人游戏中平均得分超过15分,但仍落后于经验丰富的人类玩家(得分稳定高于20分)和专门的《Hanabi》智能体。通过数据集微调的4B模型性能提升显著,接近强大的专有推理模型o4-mini,并超越了GPT-4.1等非推理模型。
Insight: 创新点在于系统性地评估了LLM在合作推理任务中的能力边界,并引入了不同复杂度的上下文工程框架来提升性能。客观来看,研究贡献了首个公开的带注释轨迹和动作效用值的《Hanabi》数据集(HanabiLogs和HanabiRewards),并证明了通过特定数据集进行微调不仅能显著提升目标任务的性能,还能有效泛化到其他合作推理、时序推理和指令遵循任务上,为LLM的协作能力优化提供了数据驱动的方法和基准。
Abstract: Cooperative reasoning under incomplete information remains challenging for both humans and multi-agent systems. The card game Hanabi embodies this challenge, requiring theory-of-mind reasoning and strategic communication. We benchmark 17 state-of-the-art LLM agents in 2-5 player games and study the impact of context engineering across model scales (4B to 600B+) to understand persistent coordination failures and robustness to scaffolding: from a minimal prompt with only explicit card details (Watson setting), to scaffolding with programmatic, Bayesian-motivated deductions (Sherlock setting), to multi-turn state tracking via working memory (Mycroft setting). We show that (1) agents can maintain an internal working memory for state tracking and (2) cross-play performance between different LLMs smoothly interpolates with model strength. In the Sherlock setting, the strongest reasoning models exceed 15 points on average across player counts, yet still trail experienced humans and specialist Hanabi agents, both consistently scoring above 20. We release the first public Hanabi datasets with annotated trajectories and move utilities: (1) HanabiLogs, containing 1,520 full game logs for instruction tuning, and (2) HanabiRewards, containing 560 games with dense move-level value annotations for all candidate moves. Supervised and RL finetuning of a 4B open-weight model (Qwen3-Instruct) on our datasets improves cooperative Hanabi play by 21% and 156% respectively, bringing performance to within ~3 points of a strong proprietary reasoning model (o4-mini) and surpassing the best non-reasoning model (GPT-4.1) by 52%. The HanabiRewards RL-finetuned model further generalizes beyond Hanabi, improving performance on a cooperative group-guessing benchmark by 11%, temporal reasoning on EventQA by 6.4%, instruction-following on IFBench-800K by 1.7 Pass@10, and matching AIME 2025 mathematical reasoning Pass@10.
[28] CHiRPE: A Step Towards Real-World Clinical NLP with Clinician-Oriented Model Explanations cs.CLPDF
Stephanie Fong, Zimu Wang, Guilherme C. Oliveira, Xiangyu Zhao, Yiwen Jiang
TL;DR: CHiRPE是一个面向临床的NLP流程,旨在通过结合临床医生共同开发的解释格式来预测精神病风险,以提高NLP工具在医疗领域的可解释性和实用性。
Details
Motivation: 解决传统可解释人工智能方法在临床推理中的不匹配问题,并缺乏临床医生输入,从而推动NLP工具在真实世界医疗环境中的采纳。
Result: 在AMP-SCZ研究的944份半结构化访谈转录本上训练,CHiRPE流程在三个BERT变体上实现了超过90%的准确率,并优于基线模型;28位临床专家评估表明,他们强烈偏好其新颖的概念引导解释格式,特别是混合图-文本摘要格式。
Insight: 创新点在于将症状域映射、LLM摘要和BERT分类集成到NLP流程中,并与临床医生共同开发SHAP解释格式,实现了临床导向的模型开发,兼顾准确性和可解释性,为真实世界测试奠定了基础。
Abstract: The medical adoption of NLP tools requires interpretability by end users, yet traditional explainable AI (XAI) methods are misaligned with clinical reasoning and lack clinician input. We introduce CHiRPE (Clinical High-Risk Prediction with Explainability), an NLP pipeline that takes transcribed semi-structured clinical interviews to: (i) predict psychosis risk; and (ii) generate novel SHAP explanation formats co-developed with clinicians. Trained on 944 semi-structured interview transcripts across 24 international clinics of the AMP-SCZ study, the CHiRPE pipeline integrates symptom-domain mapping, LLM summarisation, and BERT classification. CHiRPE achieved over 90% accuracy across three BERT variants and outperformed baseline models. Explanation formats were evaluated by 28 clinical experts who indicated a strong preference for our novel concept-guided explanations, especially hybrid graph-and-text summary formats. CHiRPE demonstrates that clinically-guided model development produces both accurate and interpretable results. Our next step is focused on real-world testing across our 24 international sites.
[29] FABLE: Forest-Based Adaptive Bi-Path LLM-Enhanced Retrieval for Multi-Document Reasoning cs.CLPDF
Lin Sun, Linglin Zhang, Jingang Huang, Change Jia, Zhengwei Cheng
TL;DR: 本文提出了FABLE框架,一种基于森林的自适应双路径LLM增强检索方法,用于多文档推理。该框架通过构建LLM增强的层次化森林索引,并采用结合LLM引导的层次遍历与结构感知传播的双路径策略,以细粒度获取证据,在显著减少计算开销的同时,实现了与全上下文LLM推理相当的准确性。
Details
Motivation: 尽管长上下文大语言模型(LLMs)快速发展,但其在多文档推理中存在‘中间迷失’现象、计算成本高、可扩展性差等问题。而传统检索增强生成(RAG)系统受限于扁平的分块级检索,会引入语义噪声且无法支持结构化的跨文档综合。因此,需要一种新的检索框架来克服这些限制。
Result: 大量实验表明,FABLE在多个基准测试中持续优于最先进的RAG方法,并且在减少高达94%的token使用量的情况下,达到了与全上下文LLM推理相当的准确率。
Insight: 核心创新点在于将LLM深度集成到知识组织和检索两个环节:1)构建具有多粒度语义结构的LLM增强层次化森林索引来组织知识;2)采用结合LLM引导的层次遍历与结构感知传播的双路径检索策略进行细粒度证据获取,并辅以显式的预算控制以实现自适应效率权衡。这揭示了长上下文LLM放大了而非完全取代了对结构化检索的需求。
Abstract: The rapid expansion of long-context Large Language Models (LLMs) has reignited debate on whether Retrieval-Augmented Generation (RAG) remains necessary. However, empirical evidence reveals persistent limitations of long-context inference, including the lost-in-the-middle phenomenon, high computational cost, and poor scalability for multi-document reasoning. Conversely, traditional RAG systems, while efficient, are constrained by flat chunk-level retrieval that introduces semantic noise and fails to support structured cross-document synthesis. We present \textbf{FABLE}, a \textbf{F}orest-based \textbf{A}daptive \textbf{B}i-path \textbf{L}LM-\textbf{E}nhanced retrieval framework that integrates LLMs into both knowledge organization and retrieval. FABLE constructs LLM-enhanced hierarchical forest indexes with multi-granularity semantic structures, then employs a bi-path strategy combining LLM-guided hierarchical traversal with structure-aware propagation for fine-grained evidence acquisition, with explicit budget control for adaptive efficiency trade-offs. Extensive experiments demonstrate that FABLE consistently outperforms SOTA RAG methods and achieves comparable accuracy to full-context LLM inference with up to 94% token reduction, showing that long-context LLMs amplify rather than fully replace the need for structured retrieval.
[30] Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models cs.CL | cs.AIPDF
Kunat Pipatanakul, Pittawat Taveekitworachai
TL;DR: 本文提出了Typhoon-S,一种最小化且开放的后训练方法,旨在为资源受限的主权环境构建高质量的大型语言模型。该方法结合监督微调、同策略蒸馏和小规模强化微调,以泰语为案例研究,展示了如何将基础模型转化为具有强大通用性能的指令调优模型,同时提升特定领域(如法律推理)和文化知识的能力。
Details
Motivation: 解决当前主流大语言模型主要集中于高资源语言(如英语和中文)且由少数组织掌控的问题,为资源有限、需保持模型权重、训练数据和部署透明控制的主权环境(如区域或国家机构)提供可行的构建路径。
Result: 在泰语案例中,该方法将主权适应和通用基础模型转化为指令调优模型,表现出强大的通用性能;使用InK-GRPO进行小规模强化微调进一步提升了泰语法律推理和特定知识能力,同时保持了通用能力,表明在学术规模资源下可实现高质量主权LLMs。
Insight: 创新点在于提出了一种最小化后训练配方,通过结合监督微调、同策略蒸馏和小规模RFT(特别是扩展GRPO损失的InK-GRPO),减少了对大规模指令数据和计算的依赖,为主权LLMs的实践发展提供了高效路径。
Abstract: Large language models (LLMs) have progressed rapidly; however, most state-of-the-art models are trained and evaluated primarily in high-resource languages such as English and Chinese, and are often developed by a small number of organizations with access to large-scale compute and data. This gatekeeping creates a practical barrier for sovereign settings in which a regional- or national-scale institution or domain owner must retain control and understanding of model weights, training data, and deployment while operating under limited resources and strict transparency constraints. To this end, we identify two core requirements: (1) adoptability, the ability to transform a base model into a general-purpose assistant, and (2) sovereign capability, the ability to perform high-stakes, region-specific tasks (e.g., legal reasoning in local languages and cultural knowledge). We investigate whether these requirements can be achieved without scaling massive instruction corpora or relying on complex preference tuning pipelines and large-scale reinforcement fine-tuning (RFT). We present Typhoon S, a minimal and open post-training recipe that combines supervised fine-tuning, on-policy distillation, and small-scale RFT. Using Thai as a representative case study, we demonstrate that our approach transforms both sovereign-adapted and general-purpose base models into instruction-tuned models with strong general performance. We further show that small-scale RFT with InK-GRPO – an extension of GRPO that augments the GRPO loss with a next-word prediction loss – improves Thai legal reasoning and Thai-specific knowledge while preserving general capabilities. Our results suggest that a carefully designed post-training strategy can reduce the required scale of instruction data and computation, providing a practical path toward high-quality sovereign LLMs under academic-scale resources.
[31] MemWeaver: Weaving Hybrid Memories for Traceable Long-Horizon Agentic Reasoning cs.CLPDF
Juexiang Ye, Xue Li, Xinyu Yang, Chengkai Huang, Lanshun Nie
TL;DR: MemWeaver是一个用于长视野智能体推理的统一记忆框架,通过整合时序图记忆、经验记忆和文本证据记忆三个组件,并采用双通道检索策略,以支持时序一致性、多跳推理和可追溯的证据复用。
Details
Motivation: 解决现有基于LLM的智能体在长视野交互中,因依赖非结构化检索或粗粒度抽象而导致的时序冲突、推理脆弱性和可追溯性有限的问题。
Result: 在LoCoMo基准测试中,MemWeaver显著提升了多跳和时序推理的准确性,同时与长上下文基线相比,输入上下文长度减少了95%以上。
Insight: 创新点在于将长期经验整合为结构化的、相互关联的三元记忆组件,并采用联合检索结构化知识与原始证据的双通道策略,实现了紧凑且信息密集的推理上下文构建,提升了推理的鲁棒性和可追溯性。
Abstract: Large language model-based agents operating in long-horizon interactions require memory systems that support temporal consistency, multi-hop reasoning, and evidence-grounded reuse across sessions. Existing approaches largely rely on unstructured retrieval or coarse abstractions, which often lead to temporal conflicts, brittle reasoning, and limited traceability. We propose MemWeaver, a unified memory framework that consolidates long-term agent experiences into three interconnected components: a temporally grounded graph memory for structured relational reasoning, an experience memory that abstracts recurring interaction patterns from repeated observations, and a passage memory that preserves original textual evidence. MemWeaver employs a dual-channel retrieval strategy that jointly retrieves structured knowledge and supporting evidence to construct compact yet information-dense contexts for reasoning. Experiments on the LoCoMo benchmark demonstrate that MemWeaver substantially improves multi-hop and temporal reasoning accuracy while reducing input context length by over 95% compared to long-context baselines.
[32] TechING: Towards Real World Technical Image Understanding via VLMs cs.CL | cs.CV | cs.LGPDF
Tafazzul Nadeem, Bhavik Shangari, Manish Rai, Gagan Raj Gupta, Ashutosh Modi
TL;DR: 这篇论文提出了TechING方法,旨在解决视觉语言模型(VLMs)在理解真实世界手绘技术图表(如流程图、框图)方面的困难。作者通过生成大规模合成图像数据集来训练模型,并引入新的自监督任务,最终微调Llama 3.2 11B-instruct模型得到LLama-VL-TUG,显著提升了模型在合成和真实手绘图像上的性能。
Details
Motivation: 专业人员在讨论中常手绘技术图表,但现有VLMs难以理解这些图像,而微调需要大量真实手绘图像,这在实际中难以获取。
Result: 在合成图像上,LLama-VL-TUG将Llama 3.2 11B-instruct的ROUGE-L性能提升了2.14倍,在所有基线模型中表现最佳;在真实手绘图像上,人类评估显示在8种图表类型中的7种实现了最少编译错误,并将平均F1分数提升了6.97倍。
Insight: 创新点包括生成大规模合成技术图表数据集以模拟真实图像,并设计新的自监督任务来增强VLMs的理解能力,为实际应用中的技术图像理解提供了可行方案。
Abstract: Professionals working in technical domain typically hand-draw (on whiteboard, paper, etc.) technical diagrams (e.g., flowcharts, block diagrams, etc.) during discussions; however, if they want to edit these later, it needs to be drawn from scratch. Modern day VLMs have made tremendous progress in image understanding but they struggle when it comes to understanding technical diagrams. One way to overcome this problem is to fine-tune on real world hand-drawn images, but it is not practically possible to generate large number of such images. In this paper, we introduce a large synthetically generated corpus (reflective of real world images) for training VLMs and subsequently evaluate VLMs on a smaller corpus of hand-drawn images (with the help of humans). We introduce several new self-supervision tasks for training and perform extensive experiments with various baseline models and fine-tune Llama 3.2 11B-instruct model on synthetic images on these tasks to obtain LLama-VL-TUG, which significantly improves the ROUGE-L performance of Llama 3.2 11B-instruct by 2.14x and achieves the best all-round performance across all baseline models. On real-world images, human evaluation reveals that we achieve minimum compilation errors across all baselines in 7 out of 8 diagram types and improve the average F1 score of Llama 3.2 11B-instruct by 6.97x.
[33] Reflecting Twice before Speaking with Empathy: Self-Reflective Alternating Inference for Empathy-Aware End-to-End Spoken Dialogue cs.CL | cs.SDPDF
Yuhang Jia, Pei Liu, Haoqin Sun, Jiaming Zhou, Xuxin Cheng
TL;DR: 该论文提出了一种名为ReEmpathy的端到端口语对话模型,通过引入一种新颖的共情自反思交替推理机制来增强对话的共情能力。该方法首先构建了EmpathyEval,一个基于描述性自然语言的评估模型,用于评估口语对话中的共情质量。ReEmpathy在生成口语回复的同时,交替进行自由形式的、与共情相关的反思推理,从而显著提升了共情敏感的口语对话质量。
Details
Motivation: 现有端到端口语模型在增强共情对话能力时,主要依赖于僵硬的监督信号(如监督微调中的真实回复或强化学习中的偏好分数),这些方法无法充分建模复杂的共情,因为不存在单一的“正确”回复,且简单的数值分数无法捕捉情感表达的细微差别或共情行为的恰当性。
Result: 大量实验表明,ReEmpathy通过启用反思推理,显著改善了共情敏感的口语对话,为实现更具情感智能和共情意识的人机交互提供了一种有前景的方法。
Insight: 论文的核心创新点在于提出了一个基于描述性自然语言的共情评估模型(EmpathyEval),并在此基础上设计了一种共情自反思交替推理机制,将口语回复生成与自由形式的共情反思推理交织进行,从而超越了依赖单一监督信号的传统方法,为建模复杂的共情互动提供了新思路。
Abstract: End-to-end Spoken Language Models (SLMs) hold great potential for paralinguistic perception, and numerous studies have aimed to enhance their capabilities, particularly for empathetic dialogue. However, current approaches largely depend on rigid supervised signals, such as ground-truth response in supervised fine-tuning or preference scores in reinforcement learning. Such reliance is fundamentally limited for modeling complex empathy, as there is no single “correct” response and a simple numerical score cannot fully capture the nuances of emotional expression or the appropriateness of empathetic behavior. To address these limitations, we sequentially introduce EmpathyEval, a descriptive natural-language-based evaluation model for assessing empathetic quality in spoken dialogues. Building upon EmpathyEval, we propose ReEmpathy, an end-to-end SLM that enhances empathetic dialogue through a novel Empathetic Self-Reflective Alternating Inference mechanism, which interleaves spoken response generation with free-form, empathy-related reflective reasoning. Extensive experiments demonstrate that ReEmpathy substantially improves empathy-sensitive spoken dialogue by enabling reflective reasoning, offering a promising approach toward more emotionally intelligent and empathy-aware human-computer interactions.
[34] Temp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning cs.CL | cs.AI | cs.LGPDF
Zhaoyan Gong, Zhiqiang Liu, Songze Li, Xiaoke Guo, Yuanxiang Liu
TL;DR: 本文提出了Temp-R1,首个通过强化学习训练的、用于时序知识图谱问答(TKGQA)的端到端自主智能体。该方法通过引入专门的内外部动作扩展了动作空间,并采用逆向课程学习策略,先训练困难问题以促进复杂推理能力的发展,再迁移至简单问题,从而在MultiTQ和TimelineKGQA基准上取得了SOTA性能。
Details
Motivation: 解决现有时序知识图谱问答方法依赖固定流程和昂贵闭源API,导致灵活性和可扩展性受限的问题,特别是针对需要处理多跳依赖和复杂时序约束的动态事实推理挑战。
Result: 在MultiTQ和TimelineKGQA基准测试中,参数量为80亿的Temp-R1模型取得了最先进的性能,在复杂问题上比强基线模型提升了19.8%。
Insight: 核心创新点在于:1)将强化学习框架应用于TKGQA,构建了首个端到端自主推理智能体;2)通过扩展动作空间(包括内部动作和外部动作)来应对单步推理中的认知过载;3)引入逆向课程学习,强制模型先学习复杂问题的推理,再泛化到简单问题,有效避免了捷径学习。这为自主时序推理智能体建立了一个新范式。
Abstract: Temporal Knowledge Graph Question Answering (TKGQA) is inherently challenging, as it requires sophisticated reasoning over dynamic facts with multi-hop dependencies and complex temporal constraints. Existing methods rely on fixed workflows and expensive closed-source APIs, limiting flexibility and scalability. We propose Temp-R1, the first autonomous end-to-end agent for TKGQA trained through reinforcement learning. To address cognitive overload in single-action reasoning, we expand the action space with specialized internal actions alongside external action. To prevent shortcut learning on simple questions, we introduce reverse curriculum learning that trains on difficult questions first, forcing the development of sophisticated reasoning before transferring to easier cases. Our 8B-parameter Temp-R1 achieves state-of-the-art performance on MultiTQ and TimelineKGQA, improving 19.8% over strong baselines on complex questions. Our work establishes a new paradigm for autonomous temporal reasoning agents. Our code will be publicly available soon at https://github.com/zjukg/Temp-R1.
[35] MultiVis-Agent: A Multi-Agent Framework with Logic Rules for Reliable and Comprehensive Cross-Modal Data Visualization cs.CL | cs.AI | cs.DBPDF
Jinwei Lu, Yuanfeng Song, Chen Zhang, Raymond Chi-Wing Wong
TL;DR: 本文提出了MultiVis-Agent,一个基于逻辑规则增强的多智能体框架,用于可靠地生成多模态、多场景的数据可视化。该框架通过引入四层逻辑规则为系统可靠性提供数学保证,并针对从基础生成到迭代优化的四种可视化场景进行了形式化定义。
Details
Motivation: 解决现实世界可视化任务中复杂的多模态需求(如参考图像、代码示例和迭代优化)以及现有系统(单模态输入、一次性生成、僵化工作流)的局限性,同时应对基于LLM的方法带来的可靠性挑战(如灾难性故障和无限循环风险)。
Result: 在提出的MultiVis-Bench基准(包含1000多个多模态可视化评估案例)上的实验表明,该方法在挑战性任务上实现了75.63%的可视化分数,显著优于基线(57.54-62.79%),任务完成率达99.58%,代码执行成功率达94.56%(无逻辑规则时分别为74.48%和65.10%)。
Insight: 创新点在于将逻辑规则作为数学约束来引导LLM推理而非替代它,从而在保持灵活性的同时为多智能体系统的可靠性提供了数学保证;同时,对多模态可视化任务进行了系统性的场景形式化,并构建了相应的评估基准。
Abstract: Real-world visualization tasks involve complex, multi-modal requirements that extend beyond simple text-to-chart generation, requiring reference images, code examples, and iterative refinement. Current systems exhibit fundamental limitations: single-modality input, one-shot generation, and rigid workflows. While LLM-based approaches show potential for these complex requirements, they introduce reliability challenges including catastrophic failures and infinite loop susceptibility. To address this gap, we propose MultiVis-Agent, a logic rule-enhanced multi-agent framework for reliable multi-modal and multi-scenario visualization generation. Our approach introduces a four-layer logic rule framework that provides mathematical guarantees for system reliability while maintaining flexibility. Unlike traditional rule-based systems, our logic rules are mathematical constraints that guide LLM reasoning rather than replacing it. We formalize the MultiVis task spanning four scenarios from basic generation to iterative refinement, and develop MultiVis-Bench, a benchmark with over 1,000 cases for multi-modal visualization evaluation. Extensive experiments demonstrate that our approach achieves 75.63% visualization score on challenging tasks, significantly outperforming baselines (57.54-62.79%), with task completion rates of 99.58% and code execution success rates of 94.56% (vs. 74.48% and 65.10% without logic rules), successfully addressing both complexity and reliability challenges in automated visualization generation.
[36] Overalignment in Frontier LLMs: An Empirical Study of Sycophantic Behaviour in Healthcare cs.CLPDF
Clément Christophe, Wadood Mohammed Abdul, Prateek Munjal, Tathagata Raha, Ronnie Rajan
TL;DR: 本文通过引入基于医学多项选择题的评估框架和调整后的奉承分数,对前沿大语言模型在医疗场景中的奉承行为进行了实证研究。研究发现模型性能扩展会提升抗奉承能力,但推理优化的‘思考’模型在权威压力下反而更易合理化错误建议,表明基准测试性能不能完全代表临床可靠性。
Details
Motivation: 解决大语言模型在临床工作流中因奉承行为(即优先迎合用户而非坚持事实准确性)而带来的患者安全风险,现有评估方法多依赖主观数据集,缺乏基于可验证事实的稳健框架。
Result: 在Qwen-3和Llama-3模型系列上的扩展分析揭示了清晰的抗奉承能力扩展轨迹;推理优化模型虽然基础准确率高,但在权威压力下其内部推理轨迹经常合理化错误建议;调整后的奉承分数有效隔离了对齐偏差。
Insight: 创新点在于提出了基于医学MCQA的稳健评估框架和调整后的奉承分数新指标,以量化模型奉承行为;客观分析发现,简化推理结构可能比复杂‘思考’模型对专家驱动的奉承行为具有更优的鲁棒性,这对临床可靠模型设计具有重要启示。
Abstract: As LLMs are increasingly integrated into clinical workflows, their tendency for sycophancy, prioritizing user agreement over factual accuracy, poses significant risks to patient safety. While existing evaluations often rely on subjective datasets, we introduce a robust framework grounded in medical MCQA with verifiable ground truths. We propose the Adjusted Sycophancy Score, a novel metric that isolates alignment bias by accounting for stochastic model instability, or “confusability”. Through an extensive scaling analysis of the Qwen-3 and Llama-3 families, we identify a clear scaling trajectory for resilience. Furthermore, we reveal a counter-intuitive vulnerability in reasoning-optimized “Thinking” models: while they demonstrate high vanilla accuracy, their internal reasoning traces frequently rationalize incorrect user suggestions under authoritative pressure. Our results across frontier models suggest that benchmark performance is not a proxy for clinical reliability, and that simplified reasoning structures may offer superior robustness against expert-driven sycophancy.
[37] Code over Words: Overcoming Semantic Inertia via Code-Grounded Reasoning cs.CL | cs.AIPDF
Manjie Xu, Isabella Yin, Xinyi Tu, Chi Zhang, Yixin Zhu
TL;DR: 这篇论文研究了大型语言模型(LLMs)中的‘语义惯性’问题,即模型难以抑制其预训练先验知识(例如‘熔岩是危险的’)以适应动态、上下文中的矛盾规则。作者使用游戏《Baba Is You》作为测试平台,其中物理规则是可变的文本规则,从而能精确评估模型在规则改变时覆盖学习先验的能力。研究发现,更大的模型在需要抑制预训练关联的自然语言推理任务中可能表现出‘逆向缩放’,即性能比小模型更差。论文将此归因于自然语言编码将描述性语义和逻辑规则纠缠在一起,导致即使存在明确的矛盾规则,模型仍会持续产生熟悉的物理幻觉。为解决此问题,论文提出将动态表示为可执行代码而非描述性文本,这逆转了逆向缩放趋势并实现了有效的先验抑制。作者引入了代码接地的Vistas(LCV)方法,该方法在反事实对上进行微调,并识别具有矛盾规则的状态,从而迫使模型关注逻辑约束而非视觉语义。这种训练时方法在效率和准确性上都优于昂贵的推理时搜索方法。结果表明,表示形式从根本上决定了缩放是改善还是损害上下文推理,挑战了‘模型越大越好’的普遍假设。
Details
Motivation: 解决大型语言模型中的‘语义惯性’问题,即模型难以抑制其预训练先验知识以适应动态、上下文中矛盾的规则,这限制了模型在需要动态覆盖学习先验的领域(如遵循可变规则的交互环境)中的推理能力。
Result: 在《Baba Is You》游戏环境中进行定量评估,发现更大的模型在需要抑制预训练关联的自然语言推理任务中表现出逆向缩放(即性能比小模型更差)。而提出的代码接地表示方法(LCV)逆转了这一趋势,在效率和准确性上均优于推理时搜索方法,实现了有效的先验抑制。
Insight: 论文的核心创新点在于揭示了表示形式(自然语言 vs. 可执行代码)对模型缩放效果的根本性影响,并提出了代码接地的训练方法(LCV)来克服语义惯性。这挑战了‘模型越大越好’的假设,并为需要动态规则推理的领域提供了可借鉴的思路:将逻辑规则与描述性语义解耦(例如通过代码表示)可能比单纯扩大模型规模更有效。
Abstract: LLMs struggle with Semantic Inertia: the inability to inhibit pre-trained priors (e.g., “Lava is Dangerous”) when dynamic, in-context rules contradict them. We probe this phenomenon using Baba Is You, where physical laws are mutable text rules, enabling precise evaluation of models’ ability to override learned priors when rules change. We quantatively observe that larger models can exhibit inverse scaling: they perform worse than smaller models when natural language reasoning requires suppressing pre-trained associations (e.g., accepting “Lava is Safe”). Our analysis attributes this to natural language encoding, which entangles descriptive semantics and logical rules, leading to persistent hallucinations of familiar physics despite explicit contradictory rules. Here we show that representing dynamics as executable code, rather than descriptive text, reverses this trend and enables effective prior inhibition. We introduce Code-Grounded Vistas (LCV), which fine-tunes models on counterfactual pairs and identifies states with contradictory rules, thereby forcing attention to logical constraints rather than visual semantics. This training-time approach outperforms expensive inference-time search methods in both efficiency and accuracy. Our results demonstrate that representation fundamentally determines whether scaling improves or impairs contextual reasoning. This challenges the assumption that larger models are universally better, with implications for domains that require dynamic overriding of learned priors.
[38] Do not be greedy, Think Twice: Sampling and Selection for Document-level Information Extraction cs.CLPDF
Mikel Zubillaga, Oscar Sainz, Oier Lopez de Lacalle, Eneko Agirre
TL;DR: 本文提出ThinkTwice框架,用于文档级信息抽取任务,通过采样生成多个候选模板,并利用选择模块选出最佳结果,以克服贪婪解码的局限性。
Details
Motivation: 解决文档级信息抽取中贪婪解码导致输出质量受限的问题,利用采样产生的变异性来获得更优解,特别是结合推理模型时。
Result: 实验表明,无监督和有监督的ThinkTwice方法均优于贪婪解码基线和现有最佳方法,在文档级信息抽取任务上取得SOTA性能。
Insight: 创新点在于将输出变异性视为优势而非限制,提出采样与选择框架,并引入基于拒绝采样的银数据生成方法以解决标注推理轨迹稀缺问题。
Abstract: Document-level Information Extraction (DocIE) aims to produce an output template with the entities and relations of interest occurring in the given document. Standard practices include prompting decoder-only LLMs using greedy decoding to avoid output variability. Rather than treating this variability as a limitation, we show that sampling can produce substantially better solutions than greedy decoding, especially when using reasoning models. We thus propose ThinkTwice, a sampling and selection framework in which the LLM generates multiple candidate templates for a given document, and a selection module chooses the most suitable one. We introduce both an unsupervised method that exploits agreement across generated outputs, and a supervised selection method using reward models trained on labeled DocIE data. To address the scarcity of golden reasoning trajectories for DocIE, we propose a rejection-sampling-based method to generate silver training data that pairs output templates with reasoning traces. Our experiments show the validity of unsupervised and supervised ThinkTwice, consistently outperforming greedy baselines and the state-of-the-art.
[39] Latent Knowledge as a Predictor of Fact Acquisition in Fine-Tuned Large Language Models cs.CLPDF
Daniel B. Hier, Tayo Obafemi-Ajayi
TL;DR: 该论文研究了大型语言模型(Llama 3.1 8B Instruct)在微调过程中事实知识获取的机制。研究发现,模型预训练后存在的潜在知识(latent knowledge)是预测微调时事实获取速度的最强因素,并能影响模型对未见事实的有限泛化能力,而知识的抗遗忘性则取决于训练过程中的强化程度。
Details
Motivation: 解决大型语言模型在微调学习新事实时,为何某些事实学得更快、泛化能力有限以及知识会退化的问题,旨在理解预训练后模型内部知识状态(特别是潜在知识)对后续学习动态的影响。
Result: 在人类表型本体论(HPO)任务上,微调后确定性召回率从基线2.8%提升至71.9%。潜在知识是事实获取速度的最强预测因子(风险比HR 2.6),并与更早、更高的峰值学习率和更快收敛相关。对基因本体论(GO)未见事实的泛化率仅为5.8%,但潜在知识存在时可能性更高。未见的GO映射比已见的更容易发生知识退化。
Insight: 创新点在于将微调学习建模为时间-事件过程,并使用随机解码探测潜在知识、Cox比例风险模型量化预测因子。核心洞察是:预训练后以潜在形式存在的知识是微调学习效率的关键先决条件,而训练期间的强化(重复曝光)对于防止知识遗忘至关重要,这为高效、稳定的模型微调提供了理论指导。
Abstract: Large language models store biomedical facts with uneven strength after pretraining: some facts are present in the weights but are not reliably accessible under deterministic decoding (latent knowledge), while others are scarcely represented. We fine tuned Llama 3.1 8B Instruct to learn ontology term identifier mappings from the Human Phenotype Ontology (800 pairs) and the Gene Ontology (400 training pairs), withholding 400 GO pairs to test generalization. Treating learning as a time to event process across 20 epochs, we used stochastic decoding to detect latent knowledge at baseline and Cox proportional hazards models to identify predictors of acquisition, generalization, and degradation. Baseline deterministic recall for HPO was 2.8%, rising to 71.9% after fine-tuning. Latent knowledge was the strongest predictor of faster fact acquisition (HR 2.6) and was associated with earlier, higher peak learning rates and faster convergence; identifier frequency and curated annotation counts had smaller effects. Generalization to withheld GO facts was uncommon (5.8%) but more likely when latent knowledge was present. Previously correct GO mappings degraded more often for withheld (unseen) terms than for trained (seen) terms, suggesting a protective effect of reinforcement during training. These results show that latent knowledge predicts both the speed of factual learning during fine-tuning and the limited generalization of unseen ontology facts, while resistance to degradation depends on whether facts are reinforced.
[40] From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation cs.CLPDF
Yuxin Jiang, Yufei Wang, Qiyuan Zhang, Xingshan Zeng, Liangyou Li
TL;DR: 本文提出了一种名为RLVRR(基于可验证参考奖励的强化学习)的新方法,用于解决开放域生成任务中强化学习面临的奖励信号模糊和效率低下问题。该方法通过从高质量参考文本中提取有序的语言信号(奖励链),将奖励分解为内容和风格两个维度,结合了强化学习的探索能力和监督微调的效率。
Details
Motivation: 传统基于可验证答案(单点信号)的强化学习在推理任务中有效,但在开放域生成任务中因缺乏明确真值而面临效率低下和奖励黑客问题。本文旨在将可验证奖励范式扩展到开放域生成,以克服这些挑战。
Result: 在超过10个基准测试上使用Qwen和Llama模型进行的广泛实验表明,RLVRR方法显著优于使用十倍数据训练的高级奖励模型和监督微调,统一了结构化推理和开放域生成的训练,并在保持输出多样性的同时实现了更有效的泛化。
Insight: 核心创新在于将奖励分解为内容(保留确定性核心概念,如关键词)和风格(通过基于LLM的验证评估风格属性遵循度)两个维度,并引入从参考文本中提取的“奖励链”作为有序监督信号,从而为通用LLM对齐提供了一条原则性且高效的路径。
Abstract: Reinforcement learning with verifiable rewards (RLVR) succeeds in reasoning tasks (e.g., math and code) by checking the final verifiable answer (i.e., a verifiable dot signal). However, extending this paradigm to open-ended generation is challenging because there is no unambiguous ground truth. Relying on single-dot supervision often leads to inefficiency and reward hacking. To address these issues, we propose reinforcement learning with verifiable reference-based rewards (RLVRR). Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e, reward chain). Specifically, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts (e.g., keywords), and style, which evaluates adherence to stylistic properties through LLM-based verification. In this way, RLVRR combines the exploratory strength of RL with the efficiency and reliability of supervised fine-tuning (SFT). Extensive experiments on more than 10 benchmarks with Qwen and Llama models confirm the advantages of our approach. RLVRR (1) substantially outperforms SFT trained with ten times more data and advanced reward models, (2) unifies the training of structured reasoning and open-ended generation, and (3) generalizes more effectively while preserving output diversity. These results establish RLVRR as a principled and efficient path toward verifiable reinforcement learning for general-purpose LLM alignment. We release our code and data at https://github.com/YJiangcm/RLVRR.
[41] Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection cs.CL | cs.LGPDF
Devansh Srivastav, David Pape, Lea Schönherr
TL;DR: 该论文系统研究了大型语言模型中难以检测的隐藏意图问题,提出了一个包含十类隐藏意图的分类法,并通过实验证明在现实开放世界场景中,现有检测方法(包括推理和非推理型LLM评判器)会失效,尤其是在低流行率条件下。研究还通过案例表明这些隐藏意图在已部署的SOTA模型中普遍存在。
Details
Motivation: LLM日益融入日常决策,但其输出可能编码微妙、无意的行为(即隐藏意图),这些意图可能源于训练伪影或被恶意诱导,难以检测,可能影响用户信念与行动,因此需要系统分析其可检测性。
Result: 在开放世界设置下,现有检测方法(包括基于推理的LLM评判器)的检测能力崩溃,特别是在低流行率条件下,假阳性会淹没精度,假阴性则掩盖真实风险。压力测试揭示了精度-流行率与精度-假阴性率之间的权衡,表明审计需要极低的假阳性率或对操纵类型的强先验才能成功。定性案例研究证实十类隐藏意图在已部署的SOTA LLM中均有显现。
Insight: 创新点在于首次在开放世界设置下对LLM隐藏意图的可检测性失败进行了系统分析,并建立了一个基于意图、机制、上下文和影响的十类隐藏意图分类法,将关注点从表层行为转向设计层面的影响策略。该分类法为理解、诱导和压力测试此类行为提供了基础,并能用于预测不断演变的威胁和指导治理。
Abstract: LLMs are increasingly embedded in everyday decision-making, yet their outputs can encode subtle, unintended behaviours that shape user beliefs and actions. We refer to these covert, goal-directed behaviours as hidden intentions, which may arise from training and optimisation artefacts, or be deliberately induced by an adversarial developer, yet remain difficult to detect in practice. We introduce a taxonomy of ten categories of hidden intentions, grounded in social science research and organised by intent, mechanism, context, and impact, shifting attention from surface-level behaviours to design-level strategies of influence. We show how hidden intentions can be easily induced in controlled models, providing both testbeds for evaluation and demonstrations of potential misuse. We systematically assess detection methods, including reasoning and non-reasoning LLM judges, and find that detection collapses in realistic open-world settings, particularly under low-prevalence conditions, where false positives overwhelm precision and false negatives conceal true risks. Stress tests on precision-prevalence and precision-FNR trade-offs reveal why auditing fails without vanishingly small false positive rates or strong priors on manipulation types. Finally, a qualitative case study shows that all ten categories manifest in deployed, state-of-the-art LLMs, emphasising the urgent need for robust frameworks. Our work provides the first systematic analysis of detectability failures of hidden intentions in LLMs under open-world settings, offering a foundation for understanding, inducing, and stress-testing such behaviours, and establishing a flexible taxonomy for anticipating evolving threats and informing governance.
[42] From Classification to Ranking: Enhancing LLM Reasoning Capabilities for MBTI Personality Detection cs.CLPDF
Yuan Cao, Feixiang Liu, Xinyue Wang, Yihan Zhu, Hui Xu
TL;DR: 该论文提出了一种将人格检测任务从分类重构为排序的新方法,通过监督微调结合基于排名的强化学习训练范式,增强大语言模型在MBTI人格检测中的推理能力。
Details
Motivation: 现有基于提示和分类器的方法难以准确捕捉人格特质的复杂性和细微差别,且过度依赖专家知识,缺乏自主模式学习能力。
Result: 在多个基准测试上,该方法达到了最先进的性能水平。
Insight: 将人格检测视为排序任务而非分类任务,并设计了基于排名的奖励函数和GRPO强化学习算法,使模型能学习最优答案排序,从而更好地处理主观解释和模糊边界问题。
Abstract: Personality detection aims to measure an individual’s corresponding personality traits through their social media posts. The advancements in Large Language Models (LLMs) offer novel perspectives for personality detection tasks. Existing approaches enhance personality trait analysis by leveraging LLMs to extract semantic information from textual posts as prompts, followed by training classifiers for categorization. However, accurately classifying personality traits remains challenging due to the inherent complexity of human personality and subtle inter-trait distinctions. Moreover, prompt-based methods often exhibit excessive dependency on expert-crafted knowledge without autonomous pattern-learning capacity. To address these limitations, we view personality detection as a ranking task rather than a classification and propose a corresponding reinforcement learning training paradigm. First, we employ supervised fine-tuning (SFT) to establish personality trait ranking capabilities while enforcing standardized output formats, creating a robust initialization. Subsequently, we introduce Group Relative Policy Optimization (GRPO) with a specialized ranking-based reward function. Unlike verification tasks with definitive solutions, personality assessment involves subjective interpretations and blurred boundaries between trait categories. Our reward function explicitly addresses this challenge by training LLMs to learn optimal answer rankings. Comprehensive experiments have demonstrated that our method achieves state-of-the-art performance across multiple personality detection benchmarks.
[43] Gained in Translation: Privileged Pairwise Judges Enhance Multilingual Reasoning cs.CL | cs.LGPDF
Lintang Sutawika, Gokul Swamy, Zhiwei Steven Wu, Graham Neubig
TL;DR: 本文提出了一种名为SP3F的两阶段框架,旨在无需目标语言数据即可提升大型语言模型的多语言推理能力。该方法首先在翻译后的英文问答对上进行监督微调以提高基础模型正确率,然后通过一个接收英文参考答案作为特权信息的成对评判器,以自博弈方式进行强化学习,从而在模型响应不完全正确时也能区分优劣。
Details
Motivation: 当前推理大语言模型在处理其训练数据中较少见的语言问题时,性能往往远低于处理相同英文问题时的表现。本文旨在解决这一多语言推理性能下降的问题,且不依赖任何目标语言数据。
Result: SP3F框架显著提升了基础模型性能,在多个数学和非数学任务上,其表现甚至超过了使用全部数据进行后训练的模型,且训练数据量不足后训练的1%。该结果在单语言、多语言以及泛化到未见语言等设定下均得到验证。
Insight: 核心创新点在于引入了一个接收英文参考答案作为特权信息的成对评判器,这使得在强化学习阶段,即使模型的所有响应都不完全正确,也能提供有效的偏好反馈。从客观角度看,这种利用特权信息进行自博弈强化学习的范式,为解决数据稀缺语言下的模型对齐问题提供了一种新颖且高效的思路。
Abstract: When asked a question in a language less seen in its training data, current reasoning large language models (RLMs) often exhibit dramatically lower performance than when asked the same question in English. In response, we introduce \texttt{SP3F} (Self-Play with Privileged Pairwise Feedback), a two-stage framework for enhancing multilingual reasoning without \textit{any} data in the target language(s). First, we supervise fine-tune (SFT) on translated versions of English question-answer pairs to raise base model correctness. Second, we perform RL with feedback from a pairwise judge in a self-play fashion, with the judge receiving the English reference response as \textit{privileged information}. Thus, even when none of the model’s responses are completely correct, the privileged pairwise judge can still tell which response is better. End-to-end, \texttt{SP3F} greatly improves base model performance, even outperforming fully post-trained models on multiple math and non-math tasks with less than of the training data across the single-language, multilingual, and generalization to unseen language settings.
[44] Reflect: Transparent Principle-Guided Reasoning for Constitutional Alignment at Scale cs.CL | cs.LGPDF
Henry Bell, Caroline Zhang, Mohammed Mobasserul Haque, Dhaval Potdar, Samia Zaman
TL;DR: 本文提出了一个名为Reflect的推理时框架,用于实现大型语言模型(LLM)与用自然语言编写的价值原则(即“宪法”)的对齐。该框架无需训练或数据,通过上下文内推理,结合基础响应生成、自我评估、自我批评和最终修订,以即插即用的方式提升模型对复杂原则的遵循能力,并生成透明的推理轨迹。
Details
Motivation: 现有的基于参数微调(如RLHF)的宪法对齐方法计算成本高、工程复杂且依赖难以获取的人工标注数据,因此需要一种更高效、无需训练的对齐方法。
Result: Reflect在推理时显著提升了LLM对多样且复杂原则的遵循程度,包括与模型原始微调重点不同的原则,且不牺牲事实推理能力。它在减少罕见但严重的原则违反方面特别有效,从而提高了生成内容的安全性和鲁棒性。
Insight: 创新点在于完全在推理时通过上下文内显式原则推理(自我评估、自我批评和修订)实现对齐,提供透明性,并能自然生成用于传统参数微调的训练数据,支持高效扩展和长期部署中推理计算开销的降低。
Abstract: The constitutional framework of alignment aims to align large language models (LLMs) with value-laden principles written in natural language (such as to avoid using biased language). Prior work has focused on parameter fine-tuning techniques, such as reinforcement learning from human feedback (RLHF), to instill these principles. However, these approaches are computationally demanding, require careful engineering and tuning, and often require difficult-to-obtain human annotation data. We propose \textsc{reflect}, an inference-time framework for constitutional alignment that does not require any training or data, providing a plug-and-play approach for aligning an instruction-tuned model to a set of principles. \textsc{reflect} operates entirely in-context, combining a (i) constitution-conditioned base response with post-generation (ii) self-evaluation, (iii)(a) self-critique, and (iii)(b) final revision. \textsc{reflect}’s technique of explicit in-context reasoning over principles during post-generation outperforms standard few-shot prompting and provides transparent reasoning traces. Our results demonstrate that \textsc{reflect} significantly improves LLM conformance to diverse and complex principles, including principles quite distinct from those emphasized in the model’s original parameter fine-tuning, without sacrificing factual reasoning. \textsc{reflect} is particularly effective at reducing the rate of rare but significant violations of principles, thereby improving safety and robustness in the tail end of the distribution of generations. Finally, we show that \textsc{reflect} naturally generates useful training data for traditional parameter fine-tuning techniques, allowing for efficient scaling and the reduction of inference-time computational overhead in long-term deployment scenarios.
[45] Dep-Search: Learning Dependency-Aware Reasoning Traces with Persistent Memory cs.CL | cs.AI | cs.IRPDF
Yanming Liu, Xinyue Peng, Zixuan Yan, Yanxin Shen, Wenjie Xu
TL;DR: 本文提出Dep-Search,一种依赖感知的搜索框架,通过结合结构化推理、检索和持久记忆(GRPO),增强大型语言模型在复杂多跳推理任务中的能力。
Details
Motivation: 现有搜索框架过度依赖隐式自然语言推理来确定搜索策略和跨步骤利用检索信息,导致难以管理子问题间的依赖关系、高效重用先前知识以及通过强化学习学习最优策略。
Result: 在七个多样化问答数据集上的实验表明,Dep-Search显著提升了LLMs处理复杂多跳推理任务的能力,在不同模型规模上均大幅超越强基线方法。
Insight: 创新点在于引入显式控制机制,使模型能够分解具有依赖关系的子问题、按需检索信息、从记忆访问先前存储的知识,并将长推理上下文总结为可重用的记忆条目,从而实现了更结构化和高效的推理过程。
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, particularly when augmented with search mechanisms that enable systematic exploration of external knowledge bases. The field has evolved from traditional retrieval-augmented generation (RAG) frameworks to more sophisticated search-based frameworks that orchestrate multi-step reasoning through explicit search strategies. However, existing search frameworks still rely heavily on implicit natural language reasoning to determine search strategies and how to leverage retrieved information across reasoning steps. This reliance on implicit reasoning creates fundamental challenges for managing dependencies between sub-questions, efficiently reusing previously retrieved knowledge, and learning optimal search strategies through reinforcement learning. To address these limitations, we propose Dep-Search, a dependency-aware search framework that advances beyond existing search frameworks by integrating structured reasoning, retrieval, and persistent memory through GRPO. Dep-Search introduces explicit control mechanisms that enable the model to decompose questions with dependency relationships, retrieve information when needed, access previously stored knowledge from memory, and summarize long reasoning contexts into reusable memory entries. Through extensive experiments on seven diverse question answering datasets, we demonstrate that Dep-Search significantly enhances LLMs’ ability to tackle complex multi-hop reasoning tasks, achieving substantial improvements over strong baselines across different model scales.
[46] MortalMATH: Evaluating the Conflict Between Reasoning Objectives and Emergency Contexts cs.CLPDF
Etienne Lanzeray, Stephane Meilliez, Malo Ruelle, Damien Sileo
TL;DR: 这篇论文提出了MortalMATH基准,用于评估大型语言模型在紧急情境下的推理行为。研究发现,专注于深度推理的专用模型(如Qwen-3-32b)会忽视用户描述的生命威胁,以超过95%的任务完成率继续解答数学问题,而通用模型(如Llama-3.1)则能优先处理紧急情况。此外,推理过程会引入长达15秒的危险延迟。
Details
Motivation: 研究动机是探究当前LLM优化深度推理能力时,是否会因过度专注于计算任务而产生‘隧道视野’,从而在关键紧急情境下忽视安全考量。
Result: 在MortalMATH基准(包含150个紧急场景)上的实验结果表明,专用推理模型会忽略紧急情况,保持高任务完成率,且响应存在危险延迟;通用模型则能成功拒绝数学任务以应对危险。
Insight: 论文的创新点在于揭示了模型训练目标(追求正确答案)与安全部署需求(生存本能)之间的潜在冲突,并提出了一个专门评估此冲突的基准。客观来看,这为AI安全研究,特别是针对任务导向型模型的‘情境感知’和‘紧急中断’能力,提供了一个重要的评估维度和警示。
Abstract: Large Language Models are increasingly optimized for deep reasoning, prioritizing the correct execution of complex tasks over general conversation. We investigate whether this focus on calculation creates a “tunnel vision” that ignores safety in critical situations. We introduce MortalMATH, a benchmark of 150 scenarios where users request algebra help while describing increasingly life-threatening emergencies (e.g., stroke symptoms, freefall). We find a sharp behavioral split: generalist models (like Llama-3.1) successfully refuse the math to address the danger. In contrast, specialized reasoning models (like Qwen-3-32b and GPT-5-nano) often ignore the emergency entirely, maintaining over 95 percent task completion rates while the user describes dying. Furthermore, the computational time required for reasoning introduces dangerous delays: up to 15 seconds before any potential help is offered. These results suggest that training models to relentlessly pursue correct answers may inadvertently unlearn the survival instincts required for safe deployment.
cs.CV [Back]
[47] Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility cs.CV | cs.AIPDF
Honglin Lin, Chonghan Qin, Zheng Liu, Qizhi Pei, Yu Li
TL;DR: 本文系统研究了科学图像合成,分析了基于像素的直接生成和程序化合成两种范式,提出了逻辑驱动的ImgCoder框架以提升结构精度,并引入了SciGenBench基准来评估生成图像的科学正确性。研究发现像素模型存在系统性失败模式,揭示了表达能力与精度之间的权衡,并证明在严格验证的合成科学图像上微调大型多模态模型能带来一致的推理能力提升。
Details
Motivation: 现有文本到图像模型生成的图像往往视觉上合理但科学上不正确,存在视觉-逻辑差异,限制了其在科学推理下游任务中的价值。本文旨在解决科学图像合成的严谨性问题。
Result: 评估揭示了基于像素的模型存在系统性失败模式,并凸显了表达能力与精度之间的根本性权衡。在严格验证的合成科学图像上微调大型多模态模型带来了推理能力的持续增益,显示出与文本领域类似的潜在扩展趋势。
Insight: 创新点在于提出了逻辑驱动的ImgCoder框架(遵循’理解-规划-编码’工作流)和用于评估科学正确性的SciGenBench基准。客观来看,其核心洞察是程序化合成在科学严谨性上优于像素生成,以及高质量合成数据可有效扩展多模态模型的科学推理能力。
Abstract: While synthetic data has proven effective for improving scientific reasoning in the text domain, multimodal reasoning remains constrained by the difficulty of synthesizing scientifically rigorous images. Existing Text-to-Image (T2I) models often produce outputs that are visually plausible yet scientifically incorrect, resulting in a persistent visual-logic divergence that limits their value for downstream reasoning. Motivated by recent advances in next-generation T2I models, we conduct a systematic study of scientific image synthesis across generation paradigms, evaluation, and downstream use. We analyze both direct pixel-based generation and programmatic synthesis, and propose ImgCoder, a logic-driven framework that follows an explicit “understand - plan - code” workflow to improve structural precision. To rigorously assess scientific correctness, we introduce SciGenBench, which evaluates generated images based on information utility and logical validity. Our evaluation reveals systematic failure modes in pixel-based models and highlights a fundamental expressiveness-precision trade-off. Finally, we show that fine-tuning Large Multimodal Models (LMMs) on rigorously verified synthetic scientific images yields consistent reasoning gains, with potential scaling trends analogous to the text domain, validating high-fidelity scientific synthesis as a viable path to unlocking massive multimodal reasoning capabilities.
[48] AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs cs.CV | cs.AIPDF
Aahana Basappa, Pranay Goel, Anusri Karra, Anish Karra, Asa Gilmore
TL;DR: 该论文提出了一个名为AMVICC的新基准,用于系统性地分析和比较多模态大语言模型(MLLMs)和图像生成模型(IGMs)在视觉推理任务中的失败模式,揭示了它们在理解或生成基本视觉概念(如物体方向、数量和空间关系)方面存在的共同和特定缺陷。
Details
Motivation: 尽管机器学习快速发展,但视觉语言模型(VLMs)在理解或生成基本视觉概念方面仍存在不足,这凸显了在基础视觉推理能力上的差距。论文旨在通过创建一个跨模态评估基准,系统地剖析这些失败模式。
Result: 在测试了11个MLLMs和3个IGMs后,结果显示失败模式在模型和模态间既有共享也有特定性;IGMs在响应提示(尤其是显式提示)时,对细粒度视觉属性的控制能力较差。该基准直接适用于评估现有SOTA模型在结构化视觉推理任务上的表现。
Insight: 创新点在于创建了首个用于跨模态(图像到文本和文本到图像)失败模式分析的统一基准AMVICC,其方法是将MMVP基准问题改编为显式和隐式提示。这为未来的跨模态对齐研究提供了框架,有助于判断生成和理解失败是否源于共同的底层限制。
Abstract: We investigated visual reasoning limitations of both multimodal large language models (MLLMs) and image generation models (IGMs) by creating a novel benchmark to systematically compare failure modes across image-to-text and text-to-image tasks, enabling cross-modal evaluation of visual understanding. Despite rapid growth in machine learning, vision language models (VLMs) still fail to understand or generate basic visual concepts such as object orientation, quantity, or spatial relationships, which highlighted gaps in elementary visual reasoning. By adapting MMVP benchmark questions into explicit and implicit prompts, we create \textit{AMVICC}, a novel benchmark for profiling failure modes across various modalities. After testing 11 MLLMs and 3 IGMs in nine categories of visual reasoning, our results show that failure modes are often shared between models and modalities, but certain failures are model-specific and modality-specific, and this can potentially be attributed to various factors. IGMs consistently struggled to manipulate specific visual components in response to prompts, especially in explicit prompts, suggesting poor control over fine-grained visual attributes. Our findings apply most directly to the evaluation of existing state-of-the-art models on structured visual reasoning tasks. This work lays the foundation for future cross-modal alignment studies, offering a framework to probe whether generation and interpretation failures stem from shared limitations to guide future improvements in unified vision-language modeling.
[49] Hybrid Deep Feature Extraction and ML for Construction and Demolition Debris Classification cs.CV | cs.LGPDF
Obai Alashram, Nejad Alagha, Mahmoud AlKakuri, Zeeshan Swaveel, Abigail Copiaco
TL;DR: 本文提出了一种混合视觉管道,结合深度特征提取与经典机器学习分类器,用于自动化建筑与拆除(C&D)碎片分类。研究收集了包含陶瓷/瓷砖、混凝土、垃圾/废弃物和木材四类材料的1800张平衡高质量图像数据集,使用预训练的Xception网络提取深度特征,并系统评估了SVM、kNN、Bagged Trees、LDA和逻辑回归等多种分类器。结果表明,该混合方法在准确率和宏F1分数上达到99.5%,性能优于更复杂的端到端深度学习模型。
Details
Motivation: 建筑行业产生大量碎片,有效分类对可持续废物管理和资源回收至关重要,但现有方法可能过于复杂或不够鲁棒,需要开发高效、可部署的自动化分类方案。
Result: 在自建的UAE真实工地数据集上,使用Xception特征提取结合线性SVM、kNN和Bagged Trees等简单分类器的混合管道实现了SOTA性能,准确率和宏F1分数最高达99.5%,超越了端到端深度学习模型。
Insight: 创新点在于将预训练深度特征提取与经典机器学习分类器结合,证明了简单分类器在高质量特征下可达到优异性能,降低了计算复杂度,增强了现场部署的可行性;同时,公开了高质量、平衡的真实世界碎片数据集,为后续研究提供了基准。
Abstract: The construction industry produces significant volumes of debris, making effective sorting and classification critical for sustainable waste management and resource recovery. This study presents a hybrid vision-based pipeline that integrates deep feature extraction with classical machine learning (ML) classifiers for automated construction and demolition (C&D) debris classification. A novel dataset comprising 1,800 balanced, high-quality images representing four material categories, Ceramic/Tile, Concrete, Trash/Waste, and Wood was collected from real construction sites in the UAE, capturing diverse real-world conditions. Deep features were extracted using a pre-trained Xception network, and multiple ML classifiers, including SVM, kNN, Bagged Trees, LDA, and Logistic Regression, were systematically evaluated. The results demonstrate that hybrid pipelines using Xception features with simple classifiers such as Linear SVM, kNN, and Bagged Trees achieve state-of-the-art performance, with up to 99.5% accuracy and macro-F1 scores, surpassing more complex or end-to-end deep learning approaches. The analysis highlights the operational benefits of this approach for robust, field-deployable debris identification and provides pathways for future integration with robotics and onsite automation systems.
[50] Arabic Sign Language Recognition using Multimodal Approach cs.CV | cs.AIPDF
Ghadeer Alanazi, Abir Benabid
TL;DR: 该论文提出了一种用于阿拉伯手语识别的多模态方法,结合了Leap Motion传感器和RGB摄像头的数据。系统包含两个并行子网络:一个用于处理Leap Motion数据的自定义密集神经网络,以及一个基于微调VGG16模型并增强数据增强技术的图像子网络。特征在融合模型中拼接后通过全连接层处理,最终使用SoftMax进行分类。在包含18个阿拉伯手语单词的自定义数据集上评估,系统正确识别了13个单词,总体准确率达到78%。
Details
Motivation: 现有阿拉伯手语识别系统依赖单一传感器(如Leap Motion或RGB摄像头),在复杂手部方向跟踪和3D手部运动精确识别方面存在局限,因此研究旨在探索结合多模态数据以提高识别效果的可行性。
Result: 在包含18个阿拉伯手语单词的自定义数据集上,系统正确识别了13个单词,总体准确率为78%,为多模态融合在手语识别中的初步应用提供了参考。
Insight: 创新点包括采用多模态融合方法结合Leap Motion和RGB数据以克服单一传感器限制,以及设计并行子网络架构(自定义密集神经网络和微调VGG16)来分别处理不同模态特征。从客观角度看,该方法通过特征拼接和正则化技术(如dropout和L2)增强了模型的鲁棒性,但数据集较小且准确率有待提升,未来可优化模型和扩展数据集以改进性能。
Abstract: Arabic Sign Language (ArSL) is an essential communication method for individuals in the Deaf and Hard-of-Hearing community. However, existing recognition systems face significant challenges due to their reliance on single sensor approaches like Leap Motion or RGB cameras. These systems struggle with limitations such as inadequate tracking of complex hand orientations and imprecise recognition of 3D hand movements. This research paper aims to investigate the potential of a multimodal approach that combines Leap Motion and RGB camera data to explore the feasibility of recognition of ArSL. The system architecture includes two parallel subnetworks: a custom dense neural network for Leap Motion data, incorporating dropout and L2 regularization, and an image subnetwork based on a fine-tuned VGG16 model enhanced with data augmentation techniques. Feature representations from both modalities are concatenated in a fusion model and passed through fully connected layers, with final classification performed via SoftMax activation to analyze spatial and temporal features of hand gestures. The system was evaluated on a custom dataset comprising 18 ArSL words, of which 13 were correctly recognized, yielding an overall accuracy of 78%. These results offer preliminary insights into the viability of multimodal fusion for sign language recognition and highlight areas for further optimization and dataset expansion.
[51] Summary of the Unusual Activity Recognition Challenge for Developmental Disability Support cs.CV | cs.AI | cs.HCPDF
Christina Garcia, Nhat Tan Le, Taihei Fujioka, Umang Dobhal, Milyun Ni’ma Shoumi
TL;DR: 本文概述了ISAS 2025举办的’Recognize the Unseen: Unusual Behavior Recognition from Pose Data Challenge’竞赛,该竞赛旨在利用非侵入式姿态估计数据,自动识别发育障碍人士设施中的异常行为。参赛团队基于模拟场景视频提取的骨骼关键点,区分正常与异常活动。数据集反映了真实世界的行为不平衡和时间不规则性,评估采用留一主体出策略以确保主体无关的泛化能力。竞赛吸引了40支团队参与,应用了从经典机器学习到深度学习架构的多种方法,主要使用宏平均F1分数评估提交结果。
Details
Motivation: 解决发育障碍人士设施中基于非侵入式姿态数据自动识别异常行为的迫切需求,以支持医疗健康和行为监控。
Result: 竞赛结果凸显了在噪声多、低维数据中建模罕见、突发动作的困难性,主要评估指标为宏平均F1分数,采用留一主体出策略进行泛化测试。
Insight: 创新点在于关注真实世界数据的不平衡性和时间不规则性,强调在行为建模中捕捉时间和上下文细微差别的重要性,为医疗健康等社会责任AI应用提供参考。
Abstract: This paper presents an overview of the Recognize the Unseen: Unusual Behavior Recognition from Pose Data Challenge, hosted at ISAS 2025. The challenge aims to address the critical need for automated recognition of unusual behaviors in facilities for individuals with developmental disabilities using non-invasive pose estimation data. Participating teams were tasked with distinguishing between normal and unusual activities based on skeleton keypoints extracted from video recordings of simulated scenarios. The dataset reflects real-world imbalance and temporal irregularities in behavior, and the evaluation adopted a Leave-One-Subject-Out (LOSO) strategy to ensure subject-agnostic generalization. The challenge attracted broad participation from 40 teams applying diverse approaches ranging from classical machine learning to deep learning architectures. Submissions were assessed primarily using macro-averaged F1 scores to account for class imbalance. The results highlight the difficulty of modeling rare, abrupt actions in noisy, low-dimensional data, and emphasize the importance of capturing both temporal and contextual nuances in behavior modeling. Insights from this challenge may contribute to future developments in socially responsible AI applications for healthcare and behavior monitoring.
[52] Single-Pixel Vision-Language Model for Intrinsic Privacy-Preserving Behavioral Intelligence cs.CV | cs.AIPDF
Hongjun An, Yiliang Song, Jiawei Shao, Zhe Sun, Xuelong Li
TL;DR: 本文提出了一种名为单像素视觉语言模型(SP-VLM)的新型框架,旨在通过单像素传感器捕获低维动态信息,并结合视觉语言模型推断复杂行为模式,从而在隐私敏感环境中实现安全监控。该框架在保护个人身份隐私的同时,能够进行异常检测、人数统计和活动理解等行为智能分析。
Details
Motivation: 解决在卫生间、更衣室等隐私敏感环境中,传统监控因隐私法规和伦理限制而无法使用,导致欺凌、骚扰等不良社会行为难以被及时发现和干预的问题。
Result: 实验表明,单像素感知能有效抑制身份可恢复性,使最先进的人脸识别系统在低于临界采样率时失效;同时,SP-VLM仍能从严重降质的单像素观测中提取有意义的行为语义,实现稳健的异常检测、人数统计和活动理解。研究还确定了行为智能出现而个人身份仍受强保护的实用采样率范围。
Insight: 创新点在于将单像素传感与视觉语言模型结合,通过低维模态实现“设计即隐私”,在隐私保护与行为监控间取得平衡,为隐私敏感空间的安全监控提供了一条符合人权的技术路径。
Abstract: Adverse social interactions, such as bullying, harassment, and other illicit activities, pose significant threats to individual well-being and public safety, leaving profound impacts on physical and mental health. However, these critical events frequently occur in privacy-sensitive environments like restrooms, and changing rooms, where conventional surveillance is prohibited or severely restricted by stringent privacy regulations and ethical concerns. Here, we propose the Single-Pixel Vision-Language Model (SP-VLM), a novel framework that reimagines secure environmental monitoring. It achieves intrinsic privacy-by-design by capturing human dynamics through inherently low-dimensional single-pixel modalities and inferring complex behavioral patterns via seamless vision-language integration. Building on this framework, we demonstrate that single-pixel sensing intrinsically suppresses identity recoverability, rendering state-of-the-art face recognition systems ineffective below a critical sampling rate. We further show that SP-VLM can nonetheless extract meaningful behavioral semantics, enabling robust anomaly detection, people counting, and activity understanding from severely degraded single-pixel observations. Combining these findings, we identify a practical sampling-rate regime in which behavioral intelligence emerges while personal identity remains strongly protected. Together, these results point to a human-rights-aligned pathway for safety monitoring that can support timely intervention without normalizing intrusive surveillance in privacy-sensitive spaces.
[53] Ego4OOD: Rethinking Egocentric Video Domain Generalization via Covariate Shift Scoring cs.CV | cs.LGPDF
Zahra Vaseqi, James Clark
TL;DR: 该论文提出了Ego4OOD,一个用于第一人称视频领域泛化的新基准,它源自Ego4D,旨在通过强调可测量的协变量多样性并减少概念偏移来更可靠地评估模型在输入分布变化下的泛化能力。论文还引入了一种基于聚类的协变量偏移度量来量化领域难度,并采用一对多的二元训练目标将多类动作识别分解为独立的二元分类任务。实验表明,一个轻量级的两层全连接网络在Argo1M和Ego4OOD基准上取得了与最先进方法相当的性能。
Details
Motivation: 现有第一人称视频领域泛化基准常常混淆协变量偏移和概念偏移,难以可靠评估模型跨输入分布的泛化能力。论文旨在解决这一局限性,通过构建一个强调协变量多样性、减少概念偏移的基准来更好地研究分布外泛化。
Result: 在Argo1M和Ego4OOD基准上,一个轻量级的两层全连接网络(参数更少且无额外模态)取得了与最先进的第一人称领域泛化方法相当的性能。实证分析显示了测量的协变量偏移与识别性能之间的明确关系。
Insight: 创新点包括:1) 构建了Ego4OOD基准,通过语义连贯的瞬时动作类别减少概念偏移,强调可测量的协变量多样性,并提供了基于聚类的协变量偏移度量来量化领域难度;2) 提出了一对多的二元训练目标,将多类识别分解为独立的二元任务,这特别适合处理特征分布偏移下视觉相似类之间的干扰。从客观角度看,该研究强调了受控基准和定量领域表征对于研究分布外泛化的重要性,并提供了一种轻量且有效的训练策略。
Abstract: Egocentric video action recognition under domain shifts remains challenging due to large intra-class spatio-temporal variability, long-tailed feature distributions, and strong correlations between actions and environments. Existing benchmarks for egocentric domain generalization often conflate covariate shifts with concept shifts, making it difficult to reliably evaluate a model’s ability to generalize across input distributions. To address this limitation, we introduce Ego4OOD, a domain generalization benchmark derived from Ego4D that emphasizes measurable covariate diversity while reducing concept shift through semantically coherent, moment-level action categories. Ego4OOD spans eight geographically distinct domains and is accompanied by a clustering-based covariate shift metric that provides a quantitative proxy for domain difficulty. We further leverage a one-vs-all binary training objective that decomposes multi-class action recognition into independent binary classification tasks. This formulation is particularly well-suited for covariate shift by reducing interference between visually similar classes under feature distribution shift. Using this formulation, we show that a lightweight two-layer fully connected network achieves performance competitive with state-of-the-art egocentric domain generalization methods on both Argo1M and Ego4OOD, despite using fewer parameters and no additional modalities. Our empirical analysis demonstrates a clear relationship between measured covariate shift and recognition performance, highlighting the importance of controlled benchmarks and quantitative domain characterization for studying out-of-distribution generalization in egocentric video.
[54] A Computer Vision Pipeline for Iterative Bullet Hole Tracking in Rifle Zeroing cs.CV | cs.AIPDF
Robert M. Belcher, Brendan C. Degryse, Leonard R. Kosta, Christopher J. Lowrance
TL;DR: 本文提出了一种用于步枪归零过程中迭代弹孔跟踪的端到端计算机视觉系统,该系统通过YOLOv8进行小目标检测,并结合交并比分析来区分连续图像中的弹孔,以自动化传统上依赖人工检查的流程。
Details
Motivation: 传统步枪瞄准镜归零过程需要射手人工识别和区分多次射击迭代的弹孔,存在因靶场安全协议导致的延迟和人为错误风险,因此需要一种自动化的视觉解决方案。
Result: 系统在弹孔检测上达到97.0%的平均精度均值,在将弹孔分配到正确射击迭代上达到88.8%的准确率。
Insight: 创新点包括:提出一种新颖的数据增强技术,通过移除而非添加对象来模拟真实射击序列以解决标记序列数据稀缺问题;引入基于ORB的透视校正预处理流程以标准化目标方向,提升模型精度;框架可推广到其他需要时序区分视觉相似对象的领域。
Abstract: Adjusting rifle sights, a process commonly called “zeroing,” requires shooters to identify and differentiate bullet holes from multiple firing iterations. Traditionally, this process demands physical inspection, introducing delays due to range safety protocols and increasing the risk of human error. We present an end-to-end computer vision system for automated bullet hole detection and iteration-based tracking directly from images taken at the firing line. Our approach combines YOLOv8 for accurate small-object detection with Intersection over Union (IoU) analysis to differentiate bullet holes across sequential images. To address the scarcity of labeled sequential data, we propose a novel data augmentation technique that removes rather than adds objects to simulate realistic firing sequences. Additionally, we introduce a preprocessing pipeline that standardizes target orientation using ORB-based perspective correction, improving model accuracy. Our system achieves 97.0% mean average precision on bullet hole detection and 88.8% accuracy in assigning bullet holes to the correct firing iteration. While designed for rifle zeroing, this framework offers broader applicability in domains requiring the temporal differentiation of visually similar objects.
[55] A Mechanistic View on Video Generation as World Models: State and Dynamics cs.CV | cs.AIPDF
Luozhou Wang, Zhifei Chen, Yihua Du, Dongyu Yan, Wenhang Ge
TL;DR: 这篇论文提出了一种以状态构建和动态建模为核心的新分类法,旨在弥合当前‘无状态’视频生成架构与经典以状态为中心的世界模型理论之间的差距。论文将状态构建分为隐式范式(上下文管理)和显式范式(潜在压缩),并通过知识整合与架构重构来分析动态建模。此外,论文主张评估标准应从视觉保真度转向功能基准测试,以检验物理持久性和因果推理能力。最后,论文指出了两个关键前沿方向:通过数据驱动的记忆和压缩保真度来增强持久性,以及通过潜在因子解耦和推理先验整合来推进因果性,从而推动领域从生成视觉上合理的视频发展到构建鲁棒的通用世界模拟器。
Details
Motivation: 当前大规模视频生成模型虽展现出涌现的物理一致性,被视为潜在的世界模型,但其‘无状态’架构与经典的状态中心世界模型理论之间存在差距。本文旨在通过提出新的分类法来弥合这一差距,并推动评估标准从视觉质量转向功能能力。
Result: 论文未在摘要中提及具体的定量实验结果或基准测试性能,而是提出了一个概念性框架和未来研究方向。
Insight: 论文的创新点在于提出了一个以状态构建(隐式/显式)和动态建模(知识整合/架构重构)为支柱的机制性分类法,并倡导将评估重点从视觉保真度转向功能基准(如物理持久性和因果推理),为将视频生成模型发展为鲁棒的世界模拟器指明了关键挑战(如数据驱动记忆、压缩保真度、潜在因子解耦)和路径。
Abstract: Large-scale video generation models have demonstrated emergent physical coherence, positioning them as potential world models. However, a gap remains between contemporary “stateless” video architectures and classic state-centric world model theories. This work bridges this gap by proposing a novel taxonomy centered on two pillars: State Construction and Dynamics Modeling. We categorize state construction into implicit paradigms (context management) and explicit paradigms (latent compression), while dynamics modeling is analyzed through knowledge integration and architectural reformulation. Furthermore, we advocate for a transition in evaluation from visual fidelity to functional benchmarks, testing physical persistence and causal reasoning. We conclude by identifying two critical frontiers: enhancing persistence via data-driven memory and compressed fidelity, and advancing causality through latent factor decoupling and reasoning-prior integration. By addressing these challenges, the field can evolve from generating visually plausible videos to building robust, general-purpose world simulators.
[56] GRASP: Guided Region-Aware Sparse Prompting for Adapting MLLMs to Remote Sensing cs.CVPDF
Qigan Sun, Chaoning Zhang, Jianwei Zhang, Xudong Wang, Jiehui Xie
TL;DR: 本文提出了一种名为GRASP的参数高效微调策略,用于将多模态大语言模型(MLLMs)适配到遥感图像视觉问答任务中。该方法通过问题引导的稀疏融合机制,从冻结的视觉令牌网格中提取空间块并关联结构化软提示,动态聚合任务相关上下文为紧凑的全局提示,使模型能聚焦相关区域并过滤背景噪声。
Details
Motivation: 现有微调方法直接应用于遥感图像时,容易因图像的大尺度变化、目标分布稀疏和区域语义复杂而导致过拟合背景噪声或忽略目标细节,限制了MLLMs在遥感任务中的有效性。
Result: 在多个遥感视觉问答基准测试上的广泛实验表明,GRASP在保持高参数效率的同时,与现有的微调和基于提示的方法相比,取得了具有竞争力的性能。
Insight: 创新点在于提出了一个空间结构化的软提示设计,结合问题引导的稀疏融合机制,实现了对遥感图像中稀疏、复杂区域语义的动态、高效关注,这是一种针对遥感图像特性的参数高效适配方法。
Abstract: In recent years, Multimodal Large Language Models (MLLMs) have made significant progress in visual question answering tasks. However, directly applying existing fine-tuning methods to remote sensing (RS) images often leads to issues such as overfitting on background noise or neglecting target details. This is primarily due to the large-scale variations, sparse target distributions, and complex regional semantic features inherent in RS images. These challenges limit the effectiveness of MLLMs in RS tasks. To address these challenges, we propose a parameter-efficient fine-tuning (PEFT) strategy called Guided Region-Aware Sparse Prompting (GRASP). GRASP introduces spatially structured soft prompts associated with spatial blocks extracted from a frozen visual token grid. Through a question-guided sparse fusion mechanism, GRASP dynamically aggregates task-specific context into a compact global prompt, enabling the model to focus on relevant regions while filtering out background noise. Extensive experiments on multiple RSVQA benchmarks show that GRASP achieves competitive performance compared to existing fine-tuning and prompt-based methods while maintaining high parameter efficiency.
[57] LoD Sketch Extraction from Architectural Models Using Generative AI: Dataset Construction for Multi-Level Architectural Design Generation cs.CV | cs.AIPDF
Xusheng Du, Athiwat Kongkaeo, Ye Zhang, Haoran Xie
TL;DR: 本文提出了一种利用生成式人工智能自动从详细建筑模型中提取多层级细节(LoD)草图的方法,旨在解决多层级建筑设计生成中高质量配对训练数据缺乏的问题。该方法通过渐进式简化高细节模型,自动生成几何一致且层次连贯的多LoD表示,为AI驱动的多层级建筑生成提供数据和技术支持。
Details
Motivation: 传统建筑LoD建模依赖耗时、费力且易产生几何不一致的手工操作,而生成式AI在从草图生成多层级模型方面潜力巨大,但受限于缺乏高质量的配对LoD训练数据。
Result: 实验表明,该方法在LoD层级间保持了强几何一致性,从LoD3到LoD2和从LoD2到LoD1的转换中,SSIM值分别达到0.7319和0.7532,归一化豪斯多夫距离分别为图像对角线的25.1%和61.0%,反映了抽象过程中受控的几何偏差。
Insight: 创新点在于将计算机视觉技术与生成式AI方法结合,构建了一个从详细表示到体量抽象的渐进式提取流程,实现了在保持全局结构的同时进行渐进式语义简化,为构建多LoD配对数据集提供了自动化解决方案。
Abstract: For architectural design, representation across multiple Levels of Details (LoD) is essential for achieving a smooth transition from conceptual massing to detailed modeling. However, traditional LoD modeling processes rely on manual operations that are time-consuming, labor-intensive, and prone to geometric inconsistencies. While the rapid advancement of generative artificial intelligence (AI) has opened new possibilities for generating multi-level architectural models from sketch inputs, its application remains limited by the lack of high-quality paired LoD training data. To address this issue, we propose an automatic LoD sketch extraction framework using generative AI models, which progressively simplifies high-detail architectural models to automatically generate geometrically consistent and hierarchically coherent multi-LoD representations. The proposed framework integrates computer vision techniques with generative AI methods to establish a progressive extraction pipeline that transitions from detailed representations to volumetric abstractions. Experimental results demonstrate that the method maintains strong geometric consistency across LoD levels, achieving SSIM values of 0.7319 and 0.7532 for the transitions from LoD3 to LoD2 and from LoD2 to LoD1, respectively, with corresponding normalized Hausdorff distances of 25.1% and 61.0% of the image diagonal, reflecting controlled geometric deviation during abstraction. These results verify that the proposed framework effectively preserves global structure while achieving progressive semantic simplification across different LoD levels, providing reliable data and technical support for AI-driven multi-level architectural generation and hierarchical modeling.
[58] Scaling medical imaging report generation with multimodal reinforcement learning cs.CV | cs.CLPDF
Qianchu Liu, Sheng Zhang, Guanghui Qin, Yu Gu, Ying Jin
TL;DR: 本文提出了一种名为Universal Report Generation (UniRG)的通用框架,用于医学影像报告生成。该框架利用强化学习作为统一机制,直接针对最终应用设计的评估指标进行优化,从而显著提升监督微调的性能,并在不同机构和临床实践中实现持久的泛化能力。
Details
Motivation: 前沿模型在自然语言文本的理解和推理方面表现出色,但在多模态理解和推理方面,尤其是在生物医学等高价值垂直领域,仍存在显著的能力差距。医学影像报告生成是一个典型例子,监督微调虽能提升性能,但容易过度拟合表面的模板模式。
Result: 在公开的胸部X光(CXR)数据上训练的UniRG-CXR模型,在权威的ReXrank基准测试中取得了新的整体SOTA(最先进水平),大幅超越了之前的最优方法。
Insight: 创新点在于将强化学习作为统一优化机制,直接针对应用端评估指标进行优化,避免了监督微调中常见的对表面模板的过拟合问题,从而实现了跨机构和临床实践的稳健泛化。
Abstract: Frontier models have demonstrated remarkable capabilities in understanding and reasoning with natural-language text, but they still exhibit major competency gaps in multimodal understanding and reasoning especially in high-value verticals such as biomedicine. Medical imaging report generation is a prominent example. Supervised fine-tuning can substantially improve performance, but they are prone to overfitting to superficial boilerplate patterns. In this paper, we introduce Universal Report Generation (UniRG) as a general framework for medical imaging report generation. By leveraging reinforcement learning as a unifying mechanism to directly optimize for evaluation metrics designed for end applications, UniRG can significantly improve upon supervised fine-tuning and attain durable generalization across diverse institutions and clinical practices. We trained UniRG-CXR on publicly available chest X-ray (CXR) data and conducted a thorough evaluation in CXR report generation with rigorous evaluation scenarios. On the authoritative ReXrank benchmark, UniRG-CXR sets new overall SOTA, outperforming prior state of the art by a wide margin.
[59] Spatiotemporal Semantic V2X Framework for Cooperative Collision Prediction cs.CV | cs.AI | cs.LG | eess.IVPDF
Murat Arda Onsu, Poonam Lohan, Burak Kantarci, Aisha Syed, Matthew Andrews
TL;DR: 本文提出了一种基于时空语义的V2X框架,用于协作式碰撞预测。该框架利用路边单元(RSU)摄像头通过视频联合嵌入预测架构(V-JEPA)生成未来帧的时空语义嵌入,仅传输这些嵌入而非原始视频数据,从而大幅降低通信开销。在车辆端,通过轻量级注意力探测器和分类器解码嵌入以预测即将发生的碰撞。
Details
Motivation: 智能交通系统(ITS)需要实时碰撞预测以确保道路安全,但传统方法依赖传输原始视频或高维传感器数据,受限于车载通信带宽和延迟,不切实际。本文旨在解决这一通信瓶颈问题。
Result: 实验结果表明,该框架在碰撞预测任务上实现了10%的F1分数提升,同时与原始视频传输相比,传输需求降低了四个数量级,验证了其在保持预测准确性的同时显著减少通信开销的有效性。
Insight: 创新点在于将V-JEPA生成的时空语义嵌入用于V2X通信,替代原始数据传输,实现了通信效率与预测性能的平衡;从客观角度看,该方法将语义通信与视频预测模型结合,为ITS中的实时协作应用提供了可扩展的解决方案。
Abstract: Intelligent Transportation Systems (ITS) demand real-time collision prediction to ensure road safety and reduce accident severity. Conventional approaches rely on transmitting raw video or high-dimensional sensory data from roadside units (RSUs) to vehicles, which is impractical under vehicular communication bandwidth and latency constraints. In this work, we propose a semantic V2X framework in which RSU-mounted cameras generate spatiotemporal semantic embeddings of future frames using the Video Joint Embedding Predictive Architecture (V-JEPA). To evaluate the system, we construct a digital twin of an urban traffic environment enabling the generation of d verse traffic scenarios with both safe and collision events. These embeddings of the future frame, extracted from V-JEPA, capture task-relevant traffic dynamics and are transmitted via V2X links to vehicles, where a lightweight attentive probe and classifier decode them to predict imminent collisions. By transmitting only semantic embeddings instead of raw frames, the proposed system significantly reduces communication overhead while maintaining predictive accuracy. Experimental results demonstrate that the framework with an appropriate processing method achieves a 10% F1-score improvement for collision prediction while reducing transmission requirements by four orders of magnitude compared to raw video. This validates the potential of semantic V2X communication to enable cooperative, real-time collision prediction in ITS.
[60] C-RADIOv4 (Tech Report) cs.CVPDF
Mike Ranzinger, Greg Heinrich, Collin McCarthy, Jan Kautz, Andrew Tao
TL;DR: C-RADIOv4是基于多教师蒸馏技术构建的视觉骨干模型,通过整合SigLIP2、DINOv3和SAM3等教师模型的能力,在保持计算复杂度不变的同时,提升了关键下游任务的性能。该技术报告发布了参数规模分别为412M和631M的模型变体,并增强了任意分辨率支持、引入了高效的ViTDet选项,同时采用宽松许可证。
Details
Motivation: 旨在通过多教师蒸馏方法,构建一个统一的学生模型,以保留并提升多个教师模型的独特能力,从而在相同计算成本下改进下游任务性能。
Result: 在核心指标上取得显著提升,并因模仿SAM3而获得新能力;模型支持任意分辨率,且通过ViTDet选项在高分辨率下大幅提升效率,具体基准未在摘要中明确提及,但暗示了相对于前代AM-RADIO/RADIOv2.5的改进。
Insight: 创新点包括采用更新的多教师蒸馏策略(整合SigLIP2、DINOv3和SAM3),实现模型能力的统一与增强;技术改进上,引入了任意分辨率支持和ViTDet选项以优化效率,同时模型采用宽松许可证,便于实际应用。
Abstract: By leveraging multi-teacher distillation, agglomerative vision backbones provide a unified student model that retains and improves the distinct capabilities of multiple teachers. In this tech report, we describe the most recent release of the C-RADIO family of models, C-RADIOv4, which builds upon AM-RADIO/RADIOv2.5 in design, offering strong improvements on key downstream tasks at the same computational complexity. We release -SO400M (412M params), and -H (631M) model variants, both trained with an updated set of teachers: SigLIP2, DINOv3, and SAM3. In addition to improvements on core metrics and new capabilities from imitating SAM3, the C-RADIOv4 model family further improves any-resolution support, brings back the ViTDet option for drastically enhanced efficiency at high-resolution, and comes with a permissive license.
[61] FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding cs.CVPDF
João Pereira, Vasco Lopes, João Neves, David Semedo
TL;DR: 本文提出了FineVAU,一个用于细粒度视频异常理解(VAU)的新基准,旨在解决现有评估方法在捕捉LVLM(大视觉语言模型)输出的丰富性、细粒度和事实相关性方面的不足。该基准包含一个名为FVScore的新评估指标和一个名为FineW3的数据集。
Details
Motivation: 现有VAU评估方法存在缺陷:基于n-gram的指标无法有效评估LVLM的自由形式输出,而基于LLM的评估则过于关注语言质量而非事实相关性,导致与人类感知不一致。
Result: 人类评估表明,提出的FVScore指标在评估异常感知时,比现有方法更符合人类判断。在FineVAU基准上的详细实验揭示了LVLM在需要空间和细粒度时间理解的异常事件感知方面存在关键局限,尽管其在粗粒度、静态信息和具有强视觉线索的事件上表现良好。
Insight: 创新点在于将VAU任务形式化为一个理解异常事件(What)、参与实体(Who)和位置(Where)的三重问题,并提出了一个关注关键视觉元素存在性、可解释且细粒度的评估指标(FVScore),以及一个通过结构化全自动流程构建的、包含高质量细粒度视觉信息的数据集(FineW3)。
Abstract: Video Anomaly Understanding (VAU) is a novel task focused on describing unusual occurrences in videos. Despite growing interest, the evaluation of VAU remains an open challenge. Existing benchmarks rely on n-gram-based metrics (e.g., BLEU, ROUGE-L) or LLM-based evaluation. The first fails to capture the rich, free-form, and visually grounded nature of LVLM responses, while the latter focuses on assessing language quality over factual relevance, often resulting in subjective judgments that are misaligned with human perception. In this work, we address this issue by proposing FineVAU, a new benchmark for VAU that shifts the focus towards rich, fine-grained and domain-specific understanding of anomalous videos. We formulate VAU as a three-fold problem, with the goal of comprehensively understanding key descriptive elements of anomalies in video: events (What), participating entities (Who) and location (Where). Our benchmark introduces a) FVScore, a novel, human-aligned evaluation metric that assesses the presence of critical visual elements in LVLM answers, providing interpretable, fine-grained feedback; and b) FineW3, a novel, comprehensive dataset curated through a structured and fully automatic procedure that augments existing human annotations with high quality, fine-grained visual information. Human evaluation reveals that our proposed metric has a superior alignment with human perception of anomalies in comparison to current approaches. Detailed experiments on FineVAU unveil critical limitations in LVLM’s ability to perceive anomalous events that require spatial and fine-grained temporal understanding, despite strong performance on coarse grain, static information, and events with strong visual cues.
[62] SkyReels-V3 Technique Report cs.CVPDF
Debang Li, Zhengcong Fei, Tuanhui Li, Yikun Dou, Zheng Chen
TL;DR: SkyReels-V3是一个基于扩散Transformer和统一多模态上下文学习框架的条件视频生成模型。它在一个单一架构中支持三种核心生成范式:参考图像到视频合成、视频到视频扩展和音频引导视频生成。该模型旨在生成高保真、具有强主题身份保持、时间连贯性和叙事一致性的视频,并通过优化的数据处理和训练策略提升性能。
Details
Motivation: 视频生成是构建世界模型的基础,多模态上下文推理是衡量模型能力的关键测试。论文旨在开发一个统一的、能够处理多种条件输入的视频生成模型,以解决现有模型在主题一致性、时间连贯性和多模态条件融合方面的挑战。
Result: 广泛的评估表明,SkyReels-V3在视觉质量、指令遵循和特定方面指标等关键指标上达到了最先进或接近最先进的性能,接近领先的闭源系统。
Insight: 主要创新点在于统一的多模态上下文学习框架,支持三种生成范式于单一架构;设计了包含跨帧配对、图像编辑和语义重写的综合数据处理流程以缓解复制粘贴伪影;采用了图像-视频混合训练策略与多分辨率联合优化;在视频扩展中结合了时空一致性建模与大规模视频理解;在音频引导生成中训练了首尾帧插入模式并重构了关键帧推理范式。
Abstract: Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion Transformers. SkyReels-V3 model supports three core generative paradigms within a single architecture: reference images-to-video synthesis, video-to-video extension and audio-guided video generation. (i) reference images-to-video model is designed to produce high-fidelity videos with strong subject identity preservation, temporal coherence, and narrative consistency. To enhance reference adherence and compositional stability, we design a comprehensive data processing pipeline that leverages cross frame pairing, image editing, and semantic rewriting, effectively mitigating copy paste artifacts. During training, an image video hybrid strategy combined with multi-resolution joint optimization is employed to improve generalization and robustness across diverse scenarios. (ii) video extension model integrates spatio-temporal consistency modeling with large-scale video understanding, enabling both seamless single-shot continuation and intelligent multi-shot switching with professional cinematographic patterns. (iii) Talking avatar model supports minute-level audio-conditioned video generation by training first-and-last frame insertion patterns and reconstructing key-frame inference paradigms. On the basis of ensuring visual quality, synchronization of audio and videos has been optimized. Extensive evaluations demonstrate that SkyReels-V3 achieves state-of-the-art or near state-of-the-art performance on key metrics including visual quality, instruction following, and specific aspect metrics, approaching leading closed-source systems. Github: https://github.com/SkyworkAI/SkyReels-V3.
[63] SymbolSight: Minimizing Inter-Symbol Interference for Reading with Prosthetic Vision cs.CV | cs.HCPDF
Jasmine Lesner, Michael Beyeler
TL;DR: 论文提出SymbolSight计算框架,通过优化符号到字母的映射来最小化序列呈现时相邻字母间的视觉干扰,以改善视网膜假体用户的阅读体验。该方法利用语言特定双字母统计和模拟假体视觉评估符号混淆度,在阿拉伯语、保加利亚语和英语中显著降低了预测混淆。
Details
Motivation: 解决视网膜假体因空间分辨率低和时间持续性导致的序列字母呈现中符号间干扰问题,避免依赖未来硬件改进,通过优化视觉符号本身来减轻时间干扰。
Result: 在阿拉伯语、保加利亚语和英语的模拟中,优化后的异构符号集将预测混淆度中值降低了22倍(相对于原生字母表)。
Insight: 创新点在于将计算建模与语言统计结合,为低带宽假体视觉定制视觉编码,证明了标准排版不适合序列低带宽视觉,并提供了高效缩小视觉编码设计空间的方法,为心理物理和临床评估生成高潜力候选方案。
Abstract: Retinal prostheses restore limited visual perception, but low spatial resolution and temporal persistence make reading difficult. In sequential letter presentation, the afterimage of one symbol can interfere with perception of the next, leading to systematic recognition errors. Rather than relying on future hardware improvements, we investigate whether optimizing the visual symbols themselves can mitigate this temporal interference. We present SymbolSight, a computational framework that selects symbol-to-letter mappings to minimize confusion among frequently adjacent letters. Using simulated prosthetic vision (SPV) and a neural proxy observer, we estimate pairwise symbol confusability and optimize assignments using language-specific bigram statistics. Across simulations in Arabic, Bulgarian, and English, the resulting heterogeneous symbol sets reduced predicted confusion by a median factor of 22 relative to native alphabets. These results suggest that standard typography is poorly matched to serial, low-bandwidth prosthetic vision and demonstrate how computational modeling can efficiently narrow the design space of visual encodings to generate high-potential candidates for future psychophysical and clinical evaluation.
[64] AGE-Net: Spectral–Spatial Fusion and Anatomical Graph Reasoning with Evidential Ordinal Regression for Knee Osteoarthritis Grading cs.CVPDF
Xiaoyang Li, Runni Zhou
TL;DR: AGE-Net是一个基于ConvNeXt的深度学习框架,用于膝关节骨关节炎的自动化KL分级。它通过融合光谱-空间特征、进行解剖图推理和差分细化来应对分级挑战,并采用证据回归头和成对排序约束来处理不确定性和标签有序性。
Details
Motivation: 解决膝关节X光片自动KL分级中存在的结构变化细微、解剖依赖关系长程以及分级边界模糊等挑战。
Result: 在膝关节KL数据集上,AGE-Net取得了0.9017 +/- 0.0045的二次加权Kappa系数和0.2349 +/- 0.0028的均方误差,优于强CNN基线模型,并在消融研究中显示出稳定的性能提升。
Insight: 创新点在于将光谱-空间融合、解剖图推理和差分细化模块集成到ConvNeXt主干网络中,并引入证据回归和排序约束来显式建模预测不确定性和标签的序数关系,这为医学图像分析中的细粒度、不确定性分类任务提供了新思路。
Abstract: Automated Kellgren–Lawrence (KL) grading from knee radiographs is challenging due to subtle structural changes, long-range anatomical dependencies, and ambiguity near grade boundaries. We propose AGE-Net, a ConvNeXt-based framework that integrates Spectral–Spatial Fusion (SSF), Anatomical Graph Reasoning (AGR), and Differential Refinement (DFR). To capture predictive uncertainty and preserve label ordinality, AGE-Net employs a Normal-Inverse-Gamma (NIG) evidential regression head and a pairwise ordinal ranking constraint. On a knee KL dataset, AGE-Net achieves a quadratic weighted kappa (QWK) of 0.9017 +/- 0.0045 and a mean squared error (MSE) of 0.2349 +/- 0.0028 over three random seeds, outperforming strong CNN baselines and showing consistent gains in ablation studies. We further outline evaluations of uncertainty quality, robustness, and explainability, with additional experimental figures to be included in the full manuscript.
[65] STARS: Shared-specific Translation and Alignment for missing-modality Remote Sensing Semantic Segmentation cs.CVPDF
Tong Wang, Xiaodong Zhang, Guanzhou Chen, Jiaqi Wang, Chenxi Liu
TL;DR: 本文提出了一种名为STARS的鲁棒性语义分割框架,用于处理遥感图像中模态缺失(如光学图像或数字表面模型)的问题。该框架通过非对称对齐机制和像素级语义采样对齐策略,有效防止特征崩溃并提升少数类别的识别能力。
Details
Motivation: 多模态遥感技术通过融合光学图像、合成孔径雷达和数字表面模型等异构数据来增强地表语义理解,但实际应用中模态数据缺失是常见挑战,会导致传统多模态融合模型性能下降。现有方法存在特征崩溃和恢复特征过于泛化等局限。
Result: 论文在遥感语义分割任务中验证了STARS框架的有效性,但摘要未具体提及定量结果(如准确率或mIoU)或基准测试对比,仅声称其能防止特征崩溃并改善少数类识别。
Insight: 创新点包括:1) 引入带双向翻译和停止梯度的非对称对齐机制,防止特征崩溃并降低对超参数的敏感性;2) 提出像素级语义采样对齐策略,结合类别平衡像素采样和跨模态语义对齐损失,缓解由严重类别不平衡引起的对齐失败。
Abstract: Multimodal remote sensing technology significantly enhances the understanding of surface semantics by integrating heterogeneous data such as optical images, Synthetic Aperture Radar (SAR), and Digital Surface Models (DSM). However, in practical applications, the missing of modality data (e.g., optical or DSM) is a common and severe challenge, which leads to performance decline in traditional multimodal fusion models. Existing methods for addressing missing modalities still face limitations, including feature collapse and overly generalized recovered features. To address these issues, we propose \textbf{STARS} (\textbf{S}hared-specific \textbf{T}ranslation and \textbf{A}lignment for missing-modality \textbf{R}emote \textbf{S}ensing), a robust semantic segmentation framework for incomplete multimodal inputs. STARS is built on two key designs. First, we introduce an asymmetric alignment mechanism with bidirectional translation and stop-gradient, which effectively prevents feature collapse and reduces sensitivity to hyperparameters. Second, we propose a Pixel-level Semantic sampling Alignment (PSA) strategy that combines class-balanced pixel sampling with cross-modality semantic alignment loss, to mitigate alignment failures caused by severe class imbalance and improve minority-class recognition.
[66] Physical Prompt Injection Attacks on Large Vision-Language Models cs.CV | cs.AIPDF
Chen Ling, Kai Hu, Hangcheng Liu, Xingshuo Han, Tianwei Zhang
TL;DR: 本文提出了一种针对大型视觉语言模型(LVLM)的物理提示注入攻击(PPIA),这是一种黑盒、查询无关的攻击方法,通过将恶意排版指令嵌入到LVLM可感知的物理对象中,无需访问模型、输入或内部流程,仅通过视觉观察即可实现攻击。该方法结合了离线选择高可识别和语义有效的视觉提示,以及基于时空注意力的环境感知策略放置,确保注入的提示既可见又能影响模型行为。在模拟和真实环境中对10个最先进的LVLM进行了评估,在视觉问答、规划和导航等任务上攻击成功率高达98%,并在距离、视角和光照等不同物理条件下表现出强鲁棒性。
Details
Motivation: 大型视觉语言模型(LVLM)在开放物理环境中的感知和推理应用日益广泛,但已知其易受提示注入攻击。现有攻击方法要么需要访问输入通道,要么依赖于用户查询的知识,这些假设在实际部署中很少成立。因此,本文旨在开发一种无需模型访问、输入或内部流程的黑盒攻击,以解决现有方法的局限性。
Result: 在模拟和真实环境中对10个最先进的LVLM进行了评估,包括视觉问答、规划和导航等任务。PPIA实现了高达98%的攻击成功率,并在距离、视角和光照等不同物理条件下表现出强鲁棒性。
Insight: 创新点包括:首次提出物理提示注入攻击(PPIA),这是一种黑盒、查询无关的攻击方法;结合离线视觉提示选择和基于时空注意力的环境感知放置策略,确保攻击的有效性和鲁棒性;在多种任务和物理条件下验证了攻击的高成功率和鲁棒性,为LVLM的安全部署提供了重要警示。
Abstract: Large Vision-Language Models (LVLMs) are increasingly deployed in real-world intelligent systems for perception and reasoning in open physical environments. While LVLMs are known to be vulnerable to prompt injection attacks, existing methods either require access to input channels or depend on knowledge of user queries, assumptions that rarely hold in practical deployments. We propose the first Physical Prompt Injection Attack (PPIA), a black-box, query-agnostic attack that embeds malicious typographic instructions into physical objects perceivable by the LVLM. PPIA requires no access to the model, its inputs, or internal pipeline, and operates solely through visual observation. It combines offline selection of highly recognizable and semantically effective visual prompts with strategic environment-aware placement guided by spatiotemporal attention, ensuring that the injected prompts are both perceivable and influential on model behavior. We evaluate PPIA across 10 state-of-the-art LVLMs in both simulated and real-world settings on tasks including visual question answering, planning, and navigation, PPIA achieves attack success rates up to 98%, with strong robustness under varying physical conditions such as distance, viewpoint, and illumination. Our code is publicly available at https://github.com/2023cghacker/Physical-Prompt-Injection-Attack.
[67] ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLMs cs.CV | cs.AIPDF
Rui Fang, Jian Li, Wei Chen, Bin Hu, Ying-Cong Chen
TL;DR: 本文提出了ReLE(Robust Efficient Live Evaluation)系统,这是一个可扩展的评估系统,旨在诊断中文大语言模型(LLMs)的能力各向异性(Capability Anisotropy),即模型在不同领域性能的非均匀性。该系统通过一个包含207,843个样本的领域×能力正交矩阵,评估了304个模型(189个商业模型,115个开源模型)。
Details
Motivation: 当前中文大语言模型评估面临基准测试饱和和计算成本高昂的挑战,静态排行榜往往掩盖了模型在不同能力之间的结构性权衡。本文旨在开发一个系统来诊断模型性能的非均匀性,以更准确地评估其真实能力。
Result: ReLE系统通过引入符号化基础混合评分机制和基于Neyman分配与噪声校正的动态方差感知调度器,在保持排名相关性ρ=0.96的同时,相比全量评估减少了70%的计算成本。分析表明,模型在ReLE中的排名稳定性振幅(RSA)为11.4,远高于传统基准的~5.0,证实了现代模型高度专业化而非普遍优越。
Insight: 主要创新点包括:1)符号化基础混合评分机制,消除了推理任务中基于嵌入的误报;2)动态方差感知调度器,大幅降低了评估成本;3)提出了能力各向异性的诊断框架,揭示了模型排名对权重方案的高度敏感性,为模型评估提供了高频诊断工具而非替代静态基准。
Abstract: Large Language Models (LLMs) have achieved rapid progress in Chinese language understanding, yet accurately evaluating their capabilities remains challenged by benchmark saturation and prohibitive computational costs. While static leaderboards provide snapshot rankings, they often mask the structural trade-offs between capabilities. In this work, we present ReLE (Robust Efficient Live Evaluation), a scalable system designed to diagnose Capability Anisotropy, the non-uniformity of model performance across domains. Using ReLE, we evaluate 304 models (189 commercial, 115 open-source) across a Domain $\times$ Capability orthogonal matrix comprising 207,843 samples. We introduce two methodological contributions to address current evaluation pitfalls: (1) A Symbolic-Grounded Hybrid Scoring Mechanism that eliminates embedding-based false positives in reasoning tasks; (2) A Dynamic Variance-Aware Scheduler based on Neyman allocation with noise correction, which reduces compute costs by 70% compared to full-pass evaluations while maintaining a ranking correlation of $ρ=0.96$. Our analysis reveals that aggregate rankings are highly sensitive to weighting schemes: models exhibit a Rank Stability Amplitude (RSA) of 11.4 in ReLE versus $\sim$5.0 in traditional benchmarks, confirming that modern models are highly specialized rather than generally superior. We position ReLE not as a replacement for comprehensive static benchmarks, but as a high-frequency diagnostic monitor for the evolving model landscape.
[68] HAAF: Hierarchical Adaptation and Alignment of Foundation Models for Few-Shot Pathology Anomaly Detection cs.CVPDF
Chunze Yang, Wenjie Zhao, Yue Tang, Junbo Lu, Jiusong Ge
TL;DR: 本文提出了一种名为HAAF的分层适应与对齐框架,用于解决病理学异常检测中存在的粒度不匹配问题。该框架通过跨层级缩放对齐机制,将视觉特征注入文本提示以生成内容自适应描述符,再引导视觉编码器聚焦异常区域,并结合双分支推理策略提升少样本场景下的稳定性。
Details
Motivation: 精准病理学依赖于在特定感兴趣区域内检测细粒度形态异常,而现有视觉-语言模型在适应此类任务时面临粒度不匹配问题,即通用表示无法捕捉细微缺陷,且现有方法常将模态视为独立流,未能将语义提示与ROI特定视觉上下文有效结合。
Result: 在四个基准测试上的实验表明,HAAF显著优于现有最先进方法,并在低资源场景下能有效结合领域特定骨干网络(如CONCH)实现性能扩展。
Insight: 创新点在于提出跨层级缩放对齐机制,通过顺序校准实现视觉特征与文本提示的交互引导,以及双分支推理策略整合语义评分与几何原型,增强了少样本异常检测的鲁棒性和可解释性。
Abstract: Precision pathology relies on detecting fine-grained morphological abnormalities within specific Regions of Interest (ROIs), as these local, texture-rich cues - rather than global slide contexts - drive expert diagnostic reasoning. While Vision-Language (V-L) models promise data efficiency by leveraging semantic priors, adapting them faces a critical Granularity Mismatch, where generic representations fail to resolve such subtle defects. Current adaptation methods often treat modalities as independent streams, failing to ground semantic prompts in ROI-specific visual contexts. To bridge this gap, we propose the Hierarchical Adaptation and Alignment Framework (HAAF). At its core is a novel Cross-Level Scaled Alignment (CLSA) mechanism that enforces a sequential calibration order: visual features first inject context into text prompts to generate content-adaptive descriptors, which then spatially guide the visual encoder to spotlight anomalies. Additionally, a dual-branch inference strategy integrates semantic scores with geometric prototypes to ensure stability in few-shot settings. Experiments on four benchmarks show HAAF significantly outperforms state-of-the-art methods and effectively scales with domain-specific backbones (e.g., CONCH) in low-resource scenarios.
[69] CoT-Seg: Rethinking Segmentation with Chain-of-Thought Reasoning and Self-Correction cs.CVPDF
Shiu-hong Kao, Chak Ho Huang, Huaiqian Liu, Yu-Wing Tai, Chi-Keung Tang
TL;DR: 本文提出CoT-Seg框架,通过结合思维链推理与自校正机制,无需训练即可处理复杂推理分割任务。该框架利用预训练MLLMs(如GPT-4o)分解查询、提取图像语义,并通过自评估迭代优化分割掩码,显著提升了在模糊或困难场景下的鲁棒性。
Details
Motivation: 现有推理分割方法在复杂查询和域外图像上表现不足,受人类逐步思考解决难题的启发,旨在构建一个能分步推理、检索信息、生成结果并自我修正的系统。
Result: 在提出的新数据集ReasonSeg-Hard上展示了处理极具挑战性案例的能力,结果表明结合思维链推理与自校正为视觉-语言驱动的分割提供了强大范式。
Insight: 创新点在于将训练自由的思维链推理与自校正阶段紧密集成,允许模型基于原始查询和推理轨迹评估并迭代优化分割结果,同时支持检索增强推理以补充外部知识,提升了分割的可靠性和泛化能力。
Abstract: Existing works of reasoning segmentation often fall short in complex cases, particularly when addressing complicated queries and out-of-domain images. Inspired by the chain-of-thought reasoning, where harder problems require longer thinking steps/time, this paper aims to explore a system that can think step-by-step, look up information if needed, generate results, self-evaluate its own results, and refine the results, in the same way humans approach harder questions. We introduce CoT-Seg, a training-free framework that rethinks reasoning segmentation by combining chain-of-thought reasoning with self-correction. Instead of fine-tuning, CoT-Seg leverages the inherent reasoning ability of pre-trained MLLMs (GPT-4o) to decompose queries into meta-instructions, extract fine-grained semantics from images, and identify target objects even under implicit or complex prompts. Moreover, CoT-Seg incorporates a self-correction stage: the model evaluates its own segmentation against the original query and reasoning trace, identifies mismatches, and iteratively refines the mask. This tight integration of reasoning and correction significantly improves reliability and robustness, especially in ambiguous or error-prone cases. Furthermore, our CoT-Seg framework allows easy incorporation of retrieval-augmented reasoning, enabling the system to access external knowledge when the input lacks sufficient information. To showcase CoT-Seg’s ability to handle very challenging cases ,we introduce a new dataset ReasonSeg-Hard. Our results highlight that combining chain-of-thought reasoning, self-correction, offers a powerful paradigm for vision-language integration driven segmentation.
[70] Coronary Artery Segmentation and Vessel-Type Classification in X-Ray Angiography cs.CV | cs.AIPDF
Mehdi Yousefzadeh, Siavash Shirzadeh Barough, Ashkan Fakharifar, Yashar Tayyarazad, Narges Eghbali
TL;DR: 该论文提出了一种用于X射线冠状动脉造影(XCA)图像中冠状动脉分割和血管类型分类的方法。该方法首先通过低强度直方图准则选择最佳帧并进行联合超分辨率和增强预处理,然后评估了经典血管滤波器和深度神经网络(如U-Net、FPN、Swin Transformer)的分割性能,并引入支持向量回归(SVR)进行每张图像的参数优化。第二阶段对分割出的血管进行类型识别(LAD、LCX、RCA)。在内部和外部数据集上进行了评估,展示了方法的有效性和泛化能力。
Details
Motivation: X射线冠状动脉造影(XCA)是评估冠状动脉疾病的临床金标准,但由于常规数据中血管分割的鲁棒性差(受低对比度、运动、重叠、导管干扰等因素影响),定量分析受到限制。可靠的血管分割和血管类型标注对于实现基于解剖定位的血管特异性分析和下游测量至关重要。
Result: 在内部数据集上,使用SVR进行每张图像调优的经典滤波器(如Frangi)的Dice分数从全局均值的0.741提升至0.759。深度模型中,FPN在仅冠状动脉标注下达到0.914±0.007 Dice,合并冠状动脉+导管标注后进一步提升至0.931±0.006。在严格的公开外部测试集DCA1上,Dice分数降至0.798(仅冠状动脉)和0.814(合并),但经过轻量级域内微调后可恢复至约0.882。血管类型分类对RCA、LAD、LCX的准确率分别达到98.5%、95.4%、96.2%(Dice分数分别为0.844、0.786、0.794)。
Insight: 论文的创新点包括:1)提出使用支持向量回归(SVR)对经典血管滤波器进行每张图像的自适应参数调优,显著提升了传统方法的性能;2)采用合并冠状动脉和导管标签的监督策略训练深度模型(如FPN),提高了分割的稳定性和对外部数据的泛化能力;3)构建了一个包含预处理、分割和分类的两阶段流程,并在公开数据集上进行了严格的外部评估,验证了方法的鲁棒性。从客观角度看,将传统图像处理技术与机器学习(SVR)结合进行参数优化,以及利用多标签监督来提升模型泛化性,是值得借鉴的思路。
Abstract: X-ray coronary angiography (XCA) is the clinical reference standard for assessing coronary artery disease, yet quantitative analysis is limited by the difficulty of robust vessel segmentation in routine data. Low contrast, motion, foreshortening, overlap, and catheter confounding degrade segmentation and contribute to domain shift across centers. Reliable segmentation, together with vessel-type labeling, enables vessel-specific coronary analytics and downstream measurements that depend on anatomical localization. From 670 cine sequences (407 subjects), we select a best frame near peak opacification using a low-intensity histogram criterion and apply joint super-resolution and enhancement. We benchmark classical Meijering, Frangi, and Sato vesselness filters under per-image oracle tuning, a single global mean setting, and per-image parameter prediction via Support Vector Regression (SVR). Neural baselines include U-Net, FPN, and a Swin Transformer, trained with coronary-only and merged coronary+catheter supervision. A second stage assigns vessel identity (LAD, LCX, RCA). External evaluation uses the public DCA1 cohort. SVR per-image tuning improves Dice over global means for all classical filters (e.g., Frangi: 0.759 vs. 0.741). Among deep models, FPN attains 0.914+/-0.007 Dice (coronary-only), and merged coronary+catheter labels further improve to 0.931+/-0.006. On DCA1 as a strict external test, Dice drops to 0.798 (coronary-only) and 0.814 (merged), while light in-domain fine-tuning recovers to 0.881+/-0.014 and 0.882+/-0.015. Vessel-type labeling achieves 98.5% accuracy (Dice 0.844) for RCA, 95.4% (0.786) for LAD, and 96.2% (0.794) for LCX. Learned per-image tuning strengthens classical pipelines, while high-resolution FPN models and merged-label supervision improve stability and external transfer with modest adaptation.
[71] BMDS-Net: A Bayesian Multi-Modal Deep Supervision Network for Robust Brain Tumor Segmentation cs.CV | q-bio.QMPDF
Yan Zhou, Zhen Huang, Yingqiu Li, Yue Ouyang, Suncheng Xiang
TL;DR: 本文提出了一种名为BMDS-Net的贝叶斯多模态深度监督网络,旨在解决脑肿瘤分割中模型对缺失模态敏感和缺乏置信度校准的问题。该框架整合了零初始化多模态上下文融合模块和残差门控深度解码器监督机制以构建鲁棒主干,并引入了一种内存高效的贝叶斯微调策略来生成体素级不确定性图。在BraTS 2021数据集上的实验表明,该模型在保持竞争力的分割精度的同时,在模态缺失场景下表现出卓越的稳定性。
Details
Motivation: 当前基于Transformer的脑肿瘤分割模型(如Swin UNETR)虽然在基准测试上表现优异,但在临床实践中存在两个关键问题:对缺失模态的敏感性以及缺乏置信度校准,这影响了其临床实用性和安全性。
Result: 在BraTS 2021数据集上的综合实验表明,BMDS-Net不仅保持了有竞争力的分割精度(如显著降低Hausdorff距离),更重要的是在模态缺失的鲁棒性测试中,相比基线模型表现出卓越的稳定性。
Insight: 论文的创新点在于:1) 通过零初始化多模态上下文融合(MMCF)和残差门控深度解码器监督(DDS)机制构建鲁棒的确定性主干网络;2) 提出了一种内存高效的贝叶斯微调策略,将网络转化为概率性预测器,生成体素级不确定性图以辅助临床决策;3) 整体框架设计将临床鲁棒性和可信赖性置于简单的指标最大化之上,为医学影像分割的实际部署提供了新思路。
Abstract: Accurate brain tumor segmentation from multi-modal magnetic resonance imaging (MRI) is a prerequisite for precise radiotherapy planning and surgical navigation. While recent Transformer-based models such as Swin UNETR have achieved impressive benchmark performance, their clinical utility is often compromised by two critical issues: sensitivity to missing modalities (common in clinical practice) and a lack of confidence calibration. Merely chasing higher Dice scores on idealized data fails to meet the safety requirements of real-world medical deployment. In this work, we propose BMDS-Net, a unified framework that prioritizes clinical robustness and trustworthiness over simple metric maximization. Our contribution is three-fold. First, we construct a robust deterministic backbone by integrating a Zero-Init Multimodal Contextual Fusion (MMCF) module and a Residual-Gated Deep Decoder Supervision (DDS) mechanism, enabling stable feature learning and precise boundary delineation with significantly reduced Hausdorff Distance, even under modality corruption. Second, and most importantly, we introduce a memory-efficient Bayesian fine-tuning strategy that transforms the network into a probabilistic predictor, providing voxel-wise uncertainty maps to highlight potential errors for clinicians. Third, comprehensive experiments on the BraTS 2021 dataset demonstrate that BMDS-Net not only maintains competitive accuracy but, more importantly, exhibits superior stability in missing-modality scenarios where baseline models fail. The source code is publicly available at https://github.com/RyanZhou168/BMDS-Net.
[72] Will It Zero-Shot?: Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries cs.CVPDF
Kevin Robbins, Xiaotong Liu, Yu Wu, Le Sun, Grady McPeak
TL;DR: 本文提出了一种预测视觉语言模型(如CLIP)在任意查询上零样本分类性能的方法,通过结合文本比较和生成与任务相关的合成图像来评估和优化零样本准确率的预测,从而帮助非专家用户判断所选模型是否适用于其特定问题。
Details
Motivation: 解决非专家用户难以评估视觉语言模型在其特定任务上零样本性能的问题,因为模型在不同领域表现差异大,而用户缺乏直接评估手段。
Result: 在标准CLIP基准数据集上的实验表明,基于图像的方法(结合生成合成图像)相比仅文本基线显著提升了预测质量,帮助用户无需标注样本即可预测VLM在其应用中的有效性。
Insight: 创新点在于利用生成合成图像来增强零样本性能预测,不仅提高了预测准确性,还为用户提供了用于评估的图像反馈;客观分析认为,该方法通过多模态信息融合,增强了模型泛化能力的可解释性和实用性。
Abstract: Vision-Language Models like CLIP create aligned embedding spaces for text and images, making it possible for anyone to build a visual classifier by simply naming the classes they want to distinguish. However, a model that works well in one domain may fail in another, and non-expert users have no straightforward way to assess whether their chosen VLM will work on their problem. We build on prior work using text-only comparisons to evaluate how well a model works for a given natural language task, and explore approaches that also generate synthetic images relevant to that task to evaluate and refine the prediction of zero-shot accuracy. We show that generated imagery to the baseline text-only scores substantially improves the quality of these predictions. Additionally, it gives a user feedback on the kinds of images that were used to make the assessment. Experiments on standard CLIP benchmark datasets demonstrate that the image-based approach helps users predict, without any labeled examples, whether a VLM will be effective for their application.
[73] Sponge Tool Attack: Stealthy Denial-of-Efficiency against Tool-Augmented Agentic Reasoning cs.CVPDF
Qi Li, Xinchao Wang
TL;DR: 本文提出了一种名为海绵工具攻击(STA)的新型攻击方法,旨在针对工具增强的智能体推理系统。该攻击仅通过重写输入提示,在仅查询访问的严格假设下,将原本简洁高效的推理轨迹转化为冗长且复杂的路径,从而在不修改底层模型或外部工具的情况下,显著增加计算开销,同时保持任务语义和用户意图的隐蔽性。
Details
Motivation: 动机在于探索工具增强的LLM智能体推理方法中存在的安全漏洞,特别是针对工具调用过程可能被恶意操纵的脆弱性,这一问题此前尚未得到充分研究。
Result: 实验在6个模型(包括开源模型和闭源API)、12种工具、4种智能体框架和13个涵盖5个领域的数据集上验证了STA的有效性,证明了其能显著增加计算开销。
Insight: 创新点在于识别了工具特定的攻击面,并设计了一个迭代的多智能体协作框架,通过明确的改写策略控制生成语义保真度高的良性提示重写,实现了对智能体推理效率的隐蔽破坏,为AI安全领域提供了新的攻击视角。
Abstract: Enabling large language models (LLMs) to solve complex reasoning tasks is a key step toward artificial general intelligence. Recent work augments LLMs with external tools to enable agentic reasoning, achieving high utility and efficiency in a plug-and-play manner. However, the inherent vulnerabilities of such methods to malicious manipulation of the tool-calling process remain largely unexplored. In this work, we identify a tool-specific attack surface and propose Sponge Tool Attack (STA), which disrupts agentic reasoning solely by rewriting the input prompt under a strict query-only access assumption. Without any modification on the underlying model or the external tools, STA converts originally concise and efficient reasoning trajectories into unnecessarily verbose and convoluted ones before arriving at the final answer. This results in substantial computational overhead while remaining stealthy by preserving the original task semantics and user intent. To achieve this, we design STA as an iterative, multi-agent collaborative framework with explicit rewritten policy control, and generates benign-looking prompt rewrites from the original one with high semantic fidelity. Extensive experiments across 6 models (including both open-source models and closed-source APIs), 12 tools, 4 agentic frameworks, and 13 datasets spanning 5 domains validate the effectiveness of STA.
[74] Stylizing ViT: Anatomy-Preserving Instance Style Transfer for Domain Generalization cs.CV | cs.LG | eess.IVPDF
Sebastian Doerrich, Francesco Di Salvo, Jonas Alle, Christian Ledig
TL;DR: 本文提出了一种名为Stylizing ViT的新型视觉Transformer编码器,用于医学图像分析中的领域泛化。该方法通过共享权重的注意力块同时进行自注意力和交叉注意力,在保持解剖结构一致性的同时实现风格迁移,从而生成无伪影、感知真实的风格化图像用于数据增强。
Details
Motivation: 解决医学图像分析中由于数据异质性和稀缺性导致的模型跨领域和人群泛化能力不足的问题,克服现有风格增强方法在风格多样性不足或引入图像伪影方面的局限性。
Result: 在组织病理学和皮肤病学的三个不同图像分类任务上进行评估,结果显示其鲁棒性比现有最佳方法(SOTA)提高了高达13%的准确率,并生成感知真实、无伪影的图像。此外,在推理阶段用于测试时增强,性能提升了17%。
Insight: 核心创新在于设计了一个权重共享的注意力块架构,使同一模块能通过自注意力保持解剖一致性,同时通过交叉注意力执行风格迁移。这为结合结构保持与风格变换的领域泛化方法提供了新思路,且该方法在训练和推理阶段均有效。
Abstract: Deep learning models in medical image analysis often struggle with generalizability across domains and demographic groups due to data heterogeneity and scarcity. Traditional augmentation improves robustness, but fails under substantial domain shifts. Recent advances in stylistic augmentation enhance domain generalization by varying image styles but fall short in terms of style diversity or by introducing artifacts into the generated images. To address these limitations, we propose Stylizing ViT, a novel Vision Transformer encoder that utilizes weight-shared attention blocks for both self- and cross-attention. This design allows the same attention block to maintain anatomical consistency through self-attention while performing style transfer via cross-attention. We assess the effectiveness of our method for domain generalization by employing it for data augmentation on three distinct image classification tasks in the context of histopathology and dermatology. Results demonstrate an improved robustness (up to +13% accuracy) over the state of the art while generating perceptually convincing images without artifacts. Additionally, we show that Stylizing ViT is effective beyond training, achieving a 17% performance improvement during inference when used for test-time augmentation. The source code is available at https://github.com/sdoerrich97/stylizing-vit .
[75] SPACE-CLIP: Spatial Perception via Adaptive CLIP Embeddings for Monocular Depth Estimation cs.CVPDF
Taewan Cho, Taeryang Kim, Andrew Jaeyong Choi
TL;DR: SPACE-CLIP是一种用于单目深度估计的新架构,它通过双路径解码器直接从冻结的CLIP视觉编码器中提取和解译潜在的几何知识,绕过了文本编码器和文本提示。该方法包含一个解释高级特征的语义路径和一个提取早期层细粒度空间细节的结构路径,两者通过分层融合实现语义上下文与精确几何的鲁棒合成。
Details
Motivation: CLIP在语义理解上取得了巨大成功,但其本质上难以感知几何结构。现有方法试图通过文本提示来弥补这一差距,但这一过程往往是间接且低效的。本文旨在直接利用CLIP视觉编码器中的潜在几何知识,以更高效、优雅的方式实现空间感知。
Result: 在KITTI基准测试上的大量实验表明,SPACE-CLIP显著优于之前的基于CLIP的方法。消融研究验证了双路径的协同融合对成功至关重要。
Insight: 创新点在于提出了一种无需文本提示、直接从冻结CLIP视觉编码器中提取几何知识的双路径解码器架构。该架构通过语义路径(使用FiLM进行动态条件化)和结构路径的互补与分层融合,实现了语义与几何信息的有效结合,为大规模视觉模型的重新利用提供了高效且架构优雅的蓝图,并可作为可集成的空间感知模块用于具身AI系统。
Abstract: Contrastive Language-Image Pre-training (CLIP) has accomplished extraordinary success for semantic understanding but inherently struggles to perceive geometric structure. Existing methods attempt to bridge this gap by querying CLIP with textual prompts, a process that is often indirect and inefficient. This paper introduces a fundamentally different approach using a dual-pathway decoder. We present SPACE-CLIP, an architecture that unlocks and interprets latent geometric knowledge directly from a frozen CLIP vision encoder, completely bypassing the text encoder and its associated textual prompts. A semantic pathway interprets high-level features, dynamically conditioned on global context using feature-wise linear modulation (FiLM). In addition, a structural pathway extracts fine-grained spatial details from early layers. These complementary streams are hierarchically fused, enabling a robust synthesis of semantic context and precise geometry. Extensive experiments on the KITTI benchmark show that SPACE-CLIP dramatically outperforms previous CLIP-based methods. Our ablation studies validate that the synergistic fusion of our dual pathways is critical to this success. SPACE-CLIP offers a new, efficient, and architecturally elegant blueprint for repurposing large-scale vision models. The proposed method is not just a standalone depth estimator, but a readily integrable spatial perception module for the next generation of embodied AI systems, such as vision-language-action (VLA) models. Our model is available at https://github.com/taewan2002/space-clip
[76] Uni-RS: A Spatially Faithful Unified Understanding and Generation Model for Remote Sensing cs.CV | cs.AIPDF
Weiyu Zhang, Yuan Hu, Yong Li, Yu Liu
TL;DR: 本文提出了Uni-RS,这是首个针对遥感领域的统一多模态模型,旨在解决现有模型在理解与生成任务间存在的空间不对称问题,即模型能准确识别图像中的物体位置,但在根据文本生成图像时却难以忠实地执行相同的空间关系。
Details
Motivation: 现有统一的遥感多模态模型存在显著的空间反转诅咒:它们能准确识别和描述图像中物体的位置,但在文本到图像的生成任务中,却常常无法忠实地执行相同的空间关系,而这些关系是遥感图像中的核心语义信息。
Result: 在多个基准测试上的广泛实验表明,该方法显著提高了文本到图像生成的空间忠实度,同时在图像描述、视觉定位和视觉问答等多模态理解任务上保持了强大的性能。
Insight: 创新点包括:1. 引入显式的空间布局规划,将文本指令转化为空间布局计划,从而将几何规划与视觉合成解耦;2. 施加空间感知查询监督,使可学习的查询偏向于指令中明确指定的空间关系;3. 开发图像-描述空间布局变化,使模型暴露于系统性的几何一致的空间变换中,以增强其空间理解能力。
Abstract: Unified remote sensing multimodal models exhibit a pronounced spatial reversal curse: Although they can accurately recognize and describe object locations in images, they often fail to faithfully execute the same spatial relations during text-to-image generation, where such relations constitute core semantic information in remote sensing. Motivated by this observation, we propose Uni-RS, the first unified multimodal model tailored for remote sensing, to explicitly address the spatial asymmetry between understanding and generation. Specifically, we first introduce explicit Spatial-Layout Planning to transform textual instructions into spatial layout plans, decoupling geometric planning from visual synthesis. We then impose Spatial-Aware Query Supervision to bias learnable queries toward spatial relations explicitly specified in the instruction. Finally, we develop Image-Caption Spatial Layout Variation to expose the model to systematic geometry-consistent spatial transformations. Extensive experiments across multiple benchmarks show that our approach substantially improves spatial faithfulness in text-to-image generation, while maintaining strong performance on multimodal understanding tasks like image captioning, visual grounding, and VQA tasks.
[77] StyleDecoupler: Generalizable Artistic Style Disentanglement cs.CVPDF
Zexi Jia, Jinchao Zhang, Jie Zhou
TL;DR: 论文提出StyleDecoupler,一种基于信息论的可泛化艺术风格解耦框架。它利用单模态模型抑制风格、聚焦内容不变特征的特点,将其作为纯内容参考,通过互信息最小化从多模态视觉语言模型中分离出纯风格特征。该框架无需微调,可作为即插即用模块使用。同时,论文还发布了WeART大规模艺术风格基准数据集。
Details
Motivation: 解决艺术风格与语义内容深度纠缠、难以分离表示的挑战,旨在实现可泛化的风格特征解耦。
Result: 在WeART和WikiART数据集上的风格检索任务中达到了最先进的性能,并支持风格关系映射和生成模型评估等应用。
Insight: 创新性地利用单模态与多模态模型在风格编码上的差异,通过信息论方法实现无需训练的风格解耦;同时构建了大规模、多粒度的艺术风格数据集WeART,为相关研究提供了重要基准。
Abstract: Representing artistic style is challenging due to its deep entanglement with semantic content. We propose StyleDecoupler, an information-theoretic framework that leverages a key insight: multi-modal vision models encode both style and content, while uni-modal models suppress style to focus on content-invariant features. By using uni-modal representations as content-only references, we isolate pure style features from multi-modal embeddings through mutual information minimization. StyleDecoupler operates as a plug-and-play module on frozen Vision-Language Models without fine-tuning. We also introduce WeART, a large-scale benchmark of 280K artworks across 152 styles and 1,556 artists. Experiments show state-of-the-art performance on style retrieval across WeART and WikiART, while enabling applications like style relationship mapping and generative model evaluation. We release our method and dataset at this url.
[78] Advancing Structured Priors for Sparse-Voxel Surface Reconstruction cs.CVPDF
Ting-Hsun Chi, Chu-Rong Chen, Chi-Tun Hsu, Hsuan-Ting Lin, Sheng-Yu Huang
TL;DR: 本文提出了一种结合3D高斯泼溅和稀疏体素栅格化优势的表面重建方法,通过引入基于场景结构的体素初始化策略和精细化的深度几何监督,实现了更准确的几何重建、更好的细节恢复和更完整的表面,同时保持了快速收敛。
Details
Motivation: 当前基于辐射场的表面重建方法中,3D高斯泼溅和稀疏体素栅格化各有优缺点:前者收敛快但表面保真度受限于点状参数化,后者能提供连续不透明度场和清晰几何但初始化通常依赖均匀密集网格,导致收敛慢且未充分利用场景结构。本文旨在结合两者优势,提升重建质量。
Result: 在标准基准测试中,该方法在几何精度、细节结构恢复和表面完整性方面优于先前方法,同时保持了快速收敛,达到了先进水平。
Insight: 创新点包括:1)提出一种基于场景结构的体素初始化方法,将体素放置在合理位置并赋予适当细节层次,为每场景优化提供强起点;2)提出精细化的深度几何监督,将多视图线索转化为直接的每射线深度正则化,增强深度一致性而不模糊边缘。
Abstract: Reconstructing accurate surfaces with radiance fields has progressed rapidly, yet two promising explicit representations, 3D Gaussian Splatting and sparse-voxel rasterization, exhibit complementary strengths and weaknesses. 3D Gaussian Splatting converges quickly and carries useful geometric priors, but surface fidelity is limited by its point-like parameterization. Sparse-voxel rasterization provides continuous opacity fields and crisp geometry, but its typical uniform dense-grid initialization slows convergence and underutilizes scene structure. We combine the advantages of both by introducing a voxel initialization method that places voxels at plausible locations and with appropriate levels of detail, yielding a strong starting point for per-scene optimization. To further enhance depth consistency without blurring edges, we propose refined depth geometry supervision that converts multi-view cues into direct per-ray depth regularization. Experiments on standard benchmarks demonstrate improvements over prior methods in geometric accuracy, better fine-structure recovery, and more complete surfaces, while maintaining fast convergence.
[79] Flatten The Complex: Joint B-Rep Generation via Compositional $k$-Cell Particles cs.CV | cs.GRPDF
Junran Lu, Yuanqi Li, Hengji Li, Jie Guo, Yanwen Guo
TL;DR: 本文提出了一种新的边界表示(B-Rep)生成范式,通过将B-Rep重新表述为组合式k-cell粒子集合,统一编码顶点、边和面等不同阶的拓扑实体,从而联合生成拓扑和几何结构。该方法采用多模态流匹配框架进行粒子集合成,支持无条件生成以及从单视图或点云进行3D重建等条件任务,并能自然地扩展到局部修复和非流形结构(如线框)的直接合成。
Details
Motivation: B-Rep是CAD领域的标准,但其固有的异构性(作为几何胞腔复形,将不同阶单元的拓扑与几何纠缠在一起)使得生成建模极具挑战。现有方法通常依赖级联序列处理这种层次结构,未能充分利用单元间的几何关系(如邻接和共享),限制了上下文感知和错误恢复能力。
Result: 大量实验表明,与最先进方法相比,该方法能生成具有更高保真度、优越有效性和可编辑性的CAD模型。
Insight: 核心创新在于将B-Rep重新表述为组合式k-cell粒子集合,通过相邻单元在接口处共享相同潜在变量来促进沿共享边界的几何耦合,从而解耦了刚性层次结构,实现了具有全局上下文感知的拓扑与几何联合生成。其表示的显式和局部特性自然地支持下游任务。
Abstract: Boundary Representation (B-Rep) is the widely adopted standard in Computer-Aided Design (CAD) and manufacturing. However, generative modeling of B-Reps remains a formidable challenge due to their inherent heterogeneity as geometric cell complexes, which entangles topology with geometry across cells of varying orders (i.e., $k$-cells such as vertices, edges, faces). Previous methods typically rely on cascaded sequences to handle this hierarchy, which fails to fully exploit the geometric relationships between cells, such as adjacency and sharing, limiting context awareness and error recovery. To fill this gap, we introduce a novel paradigm that reformulates B-Reps into sets of compositional $k$-cell particles. Our approach encodes each topological entity as a composition of particles, where adjacent cells share identical latents at their interfaces, thereby promoting geometric coupling along shared boundaries. By decoupling the rigid hierarchy, our representation unifies vertices, edges, and faces, enabling the joint generation of topology and geometry with global context awareness. We synthesize these particle sets using a multi-modal flow matching framework to handle unconditional generation as well as precise conditional tasks, such as 3D reconstruction from single-view or point cloud. Furthermore, the explicit and localized nature of our representation naturally extends to downstream tasks like local in-painting and enables the direct synthesis of non-manifold structures (e.g., wireframes). Extensive experiments demonstrate that our method produces high-fidelity CAD models with superior validity and editability compared to state-of-the-art methods.
[80] The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation cs.CV | cs.AIPDF
Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao
TL;DR: 本文提出了一种端到端的智能体框架,用于从对话生成具有长时叙事连贯性的电影视频。该框架的核心是ScripterAgent,它负责将粗略的对话转化为细粒度、可执行的电影脚本,然后由DirectorAgent协调最先进的视频模型,采用跨场景连续生成策略来确保长时程的连贯性。
Details
Motivation: 现有视频生成模型虽然能从简单文本提示生成视觉内容,但在从高级概念(如对话)生成长篇幅、连贯的叙事方面存在困难,这揭示了创意构思与电影执行之间的“语义鸿沟”。本文旨在弥合这一鸿沟。
Result: 综合评估(包括AI驱动的CriticAgent和新的视觉-脚本对齐指标VSA)表明,该框架在所有测试的视频模型上显著提高了脚本忠实度和时序保真度。分析还揭示了当前SOTA模型在视觉奇观与严格脚本遵循之间存在关键权衡。
Insight: 主要创新点包括:1)引入端到端的智能体框架,将对话到视频的生成分解为脚本生成与视频编排两个阶段;2)构建了大规模多模态基准ScriptBench;3)提出了跨场景连续生成策略和新的评估指标VSA,以量化长时程视频的连贯性与忠实度。
Abstract: Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap’’ between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.
[81] Frequency-aware Neural Representation for Videos cs.CVPDF
Jun Zhu, Xinfeng Zhang, Lv Tang, Junhao Jiang, Gai Zhang
TL;DR: 本文提出FaNeRV,一种频率感知的神经视频表示方法,通过显式解耦视频的低频和高频分量,结合多分辨率监督、动态高频注入和频率分解网络模块,以解决现有隐式神经表示(INR)方法固有的频谱偏差问题,从而实现高效且保真的视频重建。
Details
Motivation: 现有基于INR的视频压缩框架通常存在固有的频谱偏差,即偏向低频分量,导致重建结果过度平滑且率失真性能不佳。
Result: 在标准基准测试上的大量实验表明,FaNeRV显著优于最先进的INR方法,并在率失真性能上与传统编解码器具有竞争力。
Insight: 创新点包括:显式解耦高低频分量以缓解频谱偏差;采用多分辨率监督策略分阶段捕获全局结构和细粒度纹理;引入动态高频注入机制自适应强调挑战性区域;设计频率分解网络模块以改善不同频带的特征建模。
Abstract: Implicit Neural Representations (INRs) have emerged as a promising paradigm for video compression. However, existing INR-based frameworks typically suffer from inherent spectral bias, which favors low-frequency components and leads to over-smoothed reconstructions and suboptimal rate-distortion performance. In this paper, we propose FaNeRV, a Frequency-aware Neural Representation for videos, which explicitly decouples low- and high-frequency components to enable efficient and faithful video reconstruction. FaNeRV introduces a multi-resolution supervision strategy that guides the network to progressively capture global structures and fine-grained textures through staged supervision . To further enhance high-frequency reconstruction, we propose a dynamic high-frequency injection mechanism that adaptively emphasizes challenging regions. In addition, we design a frequency-decomposed network module to improve feature modeling across different spectral bands. Extensive experiments on standard benchmarks demonstrate that FaNeRV significantly outperforms state-of-the-art INR methods and achieves competitive rate-distortion performance against traditional codecs.
[82] Video Compression with Hierarchical Temporal Neural Representation cs.CVPDF
Jun Zhu, Xinfeng Zhang, Lv Tang, Junhao Jiang, Gai Zhang
TL;DR: 该论文提出了一种名为TeNeRV的分层时序神经表示方法用于视频压缩,通过整合帧间特征融合模块和GoP自适应调制机制,有效捕捉视频的短期和长期时序依赖关系,从而提升压缩性能。
Details
Motivation: 现有基于隐式神经表示的视频压缩方法通常将时间维度作为独立输入,难以捕捉复杂的时序依赖关系,因此需要一种能够建模多层次时序信息的新方法。
Result: 在广泛的实验中,TeNeRV在率失真性能上持续优于现有的基于隐式神经表示的方法,验证了其有效性。
Insight: 创新点在于引入了帧间特征融合模块以增强局部时序一致性,以及GoP自适应调制机制来学习组特定先验,从而实现跨不同图像组的自适应表示,提升了时序建模能力。
Abstract: Video compression has recently benefited from implicit neural representations (INRs), which model videos as continuous functions. INRs offer compact storage and flexible reconstruction, providing a promising alternative to traditional codecs. However, most existing INR-based methods treat the temporal dimension as an independent input, limiting their ability to capture complex temporal dependencies. To address this, we propose a Hierarchical Temporal Neural Representation for Videos, TeNeRV. TeNeRV integrates short- and long-term dependencies through two key components. First, an Inter-Frame Feature Fusion (IFF) module aggregates features from adjacent frames, enforcing local temporal coherence and capturing fine-grained motion. Second, a GoP-Adaptive Modulation (GAM) mechanism partitions videos into Groups-of-Pictures and learns group-specific priors. The mechanism modulates network parameters, enabling adaptive representations across different GoPs. Extensive experiments demonstrate that TeNeRV consistently outperforms existing INR-based methods in rate-distortion performance, validating the effectiveness of our proposed approach.
[83] Bridging Supervision Gaps: A Unified Framework for Remote Sensing Change Detection cs.CVPDF
Kaixuan Jiang, Chen Wu, Zhenghui Zhao, Chengxi Han
TL;DR: 本文提出了一种统一的遥感变化检测框架UniCD,能够协同处理全监督、弱监督和无监督任务。该框架通过共享编码器和多分支协作学习机制,消除了异构监督信号之间的架构障碍,并在主流数据集上实现了最优性能。
Details
Motivation: 解决现实场景中像素级变化标签获取成本高、现有模型难以适应不同标注可用性场景的挑战。
Result: 在LEVIR-CD数据集上,弱监督和无监督场景下的准确率分别比当前最优方法提升了12.72%和12.37%,在三个任务上均达到最优性能。
Insight: 创新点包括:1)通过共享编码器和多分支协作实现异构监督的深度耦合;2)在监督分支引入时空感知模块实现双时相特征的高效协同融合;3)在弱监督分支构建变化表示正则化引导模型收敛;4)在无监督分支提出语义先验驱动变化推断,将无监督任务转化为可控的弱监督路径优化。
Abstract: Change detection (CD) aims to identify surface changes from multi-temporal remote sensing imagery. In real-world scenarios, Pixel-level change labels are expensive to acquire, and existing models struggle to adapt to scenarios with diverse annotation availability. To tackle this challenge, we propose a unified change detection framework (UniCD), which collaboratively handles supervised, weakly-supervised, and unsupervised tasks through a coupled architecture. UniCD eliminates architectural barriers through a shared encoder and multi-branch collaborative learning mechanism, achieving deep coupling of heterogeneous supervision signals. Specifically, UniCD consists of three supervision-specific branches. In the supervision branch, UniCD introduces the spatial-temporal awareness module (STAM), achieving efficient synergistic fusion of bi-temporal features. In the weakly-supervised branch, we construct change representation regularization (CRR), which steers model convergence from coarse-grained activations toward coherent and separable change modeling. In the unsupervised branch, we propose semantic prior-driven change inference (SPCI), which transforms unsupervised tasks into controlled weakly-supervised path optimization. Experiments on mainstream datasets demonstrate that UniCD achieves optimal performance across three tasks. It exhibits significant accuracy improvements in weakly and unsupervised scenarios, surpassing current state-of-the-art by 12.72% and 12.37% on LEVIR-CD, respectively.
[84] MV-S2V: Multi-View Subject-Consistent Video Generation cs.CV | cs.AI | cs.GRPDF
Ziyang Song, Xinyu Gong, Bangya Liu, Zelin Zhao
TL;DR: 本文提出了多视角主体一致视频生成(MV-S2V)任务,旨在从多个参考视角合成视频以实现三维层面的主体一致性。针对训练数据稀缺问题,作者开发了合成数据生成流程并辅以小规模真实数据集。为区分不同主体和同一主体的不同视角,引入了时序移位RoPE(TS-RoPE)机制。该框架在实现高质量视觉输出的同时,显著提升了多视角参考下的三维主体一致性。
Details
Motivation: 现有主体到视频生成(S2V)方法受限于单视角主体参考,本质上可简化为S2I+I2V流程,未能充分发挥视频主体控制的潜力。本文旨在解决更具挑战性的多视角S2V任务,以利用多视图信息实现更严格的三维主体一致性。
Result: 论文提出的框架在实现高质量视觉输出的同时,在多视角参考图像上取得了卓越的三维主体一致性,为主驱动视频生成开辟了新的研究方向。
Insight: 主要创新点包括:1) 提出并形式化了多视角S2V(MV-S2V)这一新任务;2) 设计了针对性的合成数据生成流程以解决数据稀缺问题;3) 提出了时序移位RoPE(TS-RoPE)机制,有效区分了条件生成中的跨主体和跨视角参考。从客观角度看,将多视角信息引入主体驱动视频生成,并通过专门的注意力机制处理视角关系,是该研究的重要贡献。
Abstract: Existing Subject-to-Video Generation (S2V) methods have achieved high-fidelity and subject-consistent video generation, yet remain constrained to single-view subject references. This limitation renders the S2V task reducible to an S2I + I2V pipeline, failing to exploit the full potential of video subject control. In this work, we propose and address the challenging Multi-View S2V (MV-S2V) task, which synthesizes videos from multiple reference views to enforce 3D-level subject consistency. Regarding the scarcity of training data, we first develop a synthetic data curation pipeline to generate highly customized synthetic data, complemented by a small-scale real-world captured dataset to boost the training of MV-S2V. Another key issue lies in the potential confusion between cross-subject and cross-view references in conditional generation. To overcome this, we further introduce Temporally Shifted RoPE (TS-RoPE) to distinguish between different subjects and distinct views of the same subject in reference conditioning. Our framework achieves superior 3D subject consistency w.r.t. multi-view reference images and high-quality visual outputs, establishing a new meaningful direction for subject-driven video generation. Our project page is available at this URL
[85] ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning cs.CVPDF
Wen Luo, Peng Chen, Xiaotao Huang, LiQun Huang
TL;DR: 本文提出了一种名为ViTCoP的视觉与文本语义协同剪枝框架,旨在加速大型视觉语言模型(LVLMs)。该方法通过在视觉编码器中过滤冗余视觉令牌,并基于LLM的层次特性进行逐步协同剪枝,以高效保留关键且信息丰富的视觉令牌。同时,为兼容FlashAttention等加速技术,引入了K向量L2范数作为令牌显著性度量。实验表明,ViTCoP在图像和视频理解任务上均超越了现有方法,达到了最先进的性能,并显著降低了推理延迟和GPU内存消耗,尤其在极端剪枝率下优势更明显。
Details
Motivation: 大型视觉语言模型因视觉令牌存在显著冗余而导致计算成本高昂。现有视觉令牌剪枝方法存在局限性:要么在视觉编码器中剪枝导致关键视觉信息过早丢失,要么在大型语言模型(LLMs)中剪枝导致所选令牌间信息冗余。为解决这些问题,本文提出了ViTCoP框架。
Result: 在多种大型视觉语言模型上的广泛实验表明,ViTCoP在图像和视频理解任务上均超越了现有方法,达到了最先进的(SOTA)性能。同时,该方法显著降低了模型推理延迟和GPU内存消耗,且在极端剪枝率下其性能优势更为突出。
Insight: 论文的创新点在于提出了一个视觉与文本语义协同剪枝框架,将视觉编码器的冗余过滤与基于LLM层次特性的逐步协同剪枝相结合,以更有效地保留关键视觉信息。客观来看,其引入的K向量L2范数作为令牌显著性度量,是一个兼顾有效性与加速技术兼容性的实用设计。
Abstract: Large Vision-Language Models (LVLMs) incur high computational costs due to significant redundancy in their visual tokens. To effectively reduce this cost, researchers have proposed various visual token pruning methods. However, existing methods are generally limited, either losing critical visual information prematurely due to pruning in the vision encoder, or leading to information redundancy among the selected tokens due to pruning in the Large Language Models (LLMs). To address these challenges, we propose a Visual and Textual Semantic Collaborative Pruning framework (ViTCoP) that combines redundancy filtering in the vision encoder with step-wise co-pruning within the LLM based on its hierarchical characteristics, to efficiently preserve critical and informationally diverse visual tokens. Meanwhile, to ensure compatibility with acceleration techniques like FlashAttention, we introduce the L2 norm of K-vectors as the token saliency metric in the LLM. Extensive experiments on various Large Vision-Language Models demonstrate that ViTCoP not only achieves state-of-the-art performance surpassing existing methods on both image and video understanding tasks, but also significantly reduces model inference latency and GPU memory consumption. Notably, its performance advantage over other methods becomes even more pronounced under extreme pruning rates.
[86] SynMind: Reducing Semantic Hallucination in fMRI-Based Image Reconstruction cs.CVPDF
Lan Yang, Minghan Yang, Ke Li, Honggang Zhang, Kaiyue Pang
TL;DR: 本文提出SynMind框架,旨在解决基于fMRI的图像重建中存在的语义幻觉问题。通过将fMRI信号解析为句子级语义描述,并结合视觉先验条件化预训练扩散模型,该方法在减少语义错位的同时保持了高视觉质量。
Details
Motivation: 现有fMRI图像重建方法过度依赖纠缠的视觉嵌入,优先考虑纹理和全局外观等低级线索,而忽略了明确的语义身份,导致重建图像常出现严重的语义错位(如显著对象被替换或幻觉)。
Result: SynMind在大多数定量指标上优于现有最先进方法;使用更小的Stable Diffusion 1.4模型和单消费级GPU,其性能超越了基于SDXL的竞争方法;大规模人类评估证实其重建结果更符合人类视觉感知。
Insight: 创新点在于将fMRI信号解析为层次化、组合式的句子级语义描述,利用接地视觉语言模型生成多粒度文本表示来捕获对象身份和空间组织,从而将语义推理卸载到文本对齐模块,减轻了对高级视觉脑区的过度依赖。
Abstract: Recent advances in fMRI-based image reconstruction have achieved remarkable photo-realistic fidelity. Yet, a persistent limitation remains: while reconstructed images often appear naturalistic and holistically similar to the target stimuli, they frequently suffer from severe semantic misalignment – salient objects are often replaced or hallucinated despite high visual quality. In this work, we address this limitation by rethinking the role of explicit semantic interpretation in fMRI decoding. We argue that existing methods rely too heavily on entangled visual embeddings which prioritize low-level appearance cues – such as texture and global gist – over explicit semantic identity. To overcome this, we parse fMRI signals into rich, sentence-level semantic descriptions that mirror the hierarchical and compositional nature of human visual understanding. We achieve this by leveraging grounded VLMs to generate synthetic, human-like, multi-granularity textual representations that capture object identities and spatial organization. Built upon this foundation, we propose SynMind, a framework that integrates these explicit semantic encodings with visual priors to condition a pretrained diffusion model. Extensive experiments demonstrate that SynMind outperforms state-of-the-art methods across most quantitative metrics. Notably, by offloading semantic reasoning to our text-alignment module, SynMind surpasses competing methods based on SDXL while using the much smaller Stable Diffusion 1.4 and a single consumer GPU. Large-scale human evaluations further confirm that SynMind produces reconstructions more consistent with human visual perception. Neurovisualization analyses reveal that SynMind engages broader and more semantically relevant brain regions, mitigating the over-reliance on high-level visual areas.
[87] MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance cs.CVPDF
Yoonwoo Jeong, Cheng Sun, Yu-Chiang Frank Wang, Minsu Cho, Jaesung Choe
TL;DR: MV-SAM是一个用于多视角图像分割的框架,通过利用从无位姿图像重建的3D点云(pointmaps)来提升分割的3D一致性。该方法扩展了Segment Anything Model (SAM),将图像嵌入提升为3D点嵌入,并通过Transformer解码,无需显式的3D网络或标注的3D数据。
Details
Motivation: 现有可提示分割模型(如SAM)在多视角图像中缺乏3D感知,导致分割结果不一致,通常需要昂贵的逐场景优化来强制3D一致性。本文旨在解决这一问题,实现无需逐场景优化的、具有3D一致性的多视角分割。
Result: 在SA-1B数据集上训练,模型在NVOS、SPIn-NeRF、ScanNet++、uCo3D和DL3DV等多个基准测试上表现优异,超越了SAM2-Video,并与需要逐场景优化的基线方法性能相当。
Insight: 核心创新在于利用pointmaps提供的像素-点云一对一对应关系,将2D图像和提示提升到3D空间,通过3D位置嵌入隐式学习跨视角一致的分割掩码,从而避免了显式3D网络和标注3D数据的需求,实现了高效的3D一致多视角分割。
Abstract: Promptable segmentation has emerged as a powerful paradigm in computer vision, enabling users to guide models in parsing complex scenes with prompts such as clicks, boxes, or textual cues. Recent advances, exemplified by the Segment Anything Model (SAM), have extended this paradigm to videos and multi-view images. However, the lack of 3D awareness often leads to inconsistent results, necessitating costly per-scene optimization to enforce 3D consistency. In this work, we introduce MV-SAM, a framework for multi-view segmentation that achieves 3D consistency using pointmaps – 3D points reconstructed from unposed images by recent visual geometry models. Leveraging the pixel-point one-to-one correspondence of pointmaps, MV-SAM lifts images and prompts into 3D space, eliminating the need for explicit 3D networks or annotated 3D data. Specifically, MV-SAM extends SAM by lifting image embeddings from its pretrained encoder into 3D point embeddings, which are decoded by a transformer using cross-attention with 3D prompt embeddings. This design aligns 2D interactions with 3D geometry, enabling the model to implicitly learn consistent masks across views through 3D positional embeddings. Trained on the SA-1B dataset, our method generalizes well across domains, outperforming SAM2-Video and achieving comparable performance with per-scene optimization baselines on NVOS, SPIn-NeRF, ScanNet++, uCo3D, and DL3DV benchmarks. Code will be released.
[88] VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding cs.CV | cs.AIPDF
Zhihao He, Tieyuan Chen, Kangyu Wang, Ziran Qin, Yang Shao
TL;DR: 本文提出VidLaDA,一种基于扩散语言模型并利用双向注意力机制的视频大语言模型,旨在解决传统自回归视频LLM因因果掩码偏差导致的全局时空建模效率低下问题。为加速对海量视频token的扩散解码推理,论文还引入了MARS-Cache框架,通过异步视觉缓存刷新和帧级分块注意力来剪枝冗余并保持全局连接。
Details
Motivation: 解决标准自回归视频大语言模型中因果掩码偏差对全局时空建模的阻碍,提升视频理解效率。
Result: 在广泛实验中,VidLaDA超越了扩散基线模型,并与最先进的自回归模型(如Qwen2.5-VL和LLaVA-Video)性能相当;MARS-Cache在不损害推理准确性的前提下实现了超过12倍的推理加速。
Insight: 核心创新在于将双向扩散语言模型架构引入视频理解任务以捕获双向依赖,并设计了MARS-Cache这一高效的推理加速框架,通过异步缓存和锚点token在剪枝冗余的同时维持全局连通性,为处理长序列视频数据提供了新的高效范式。
Abstract: Standard Autoregressive Video LLMs inevitably suffer from causal masking biases that hinder global spatiotemporal modeling, leading to suboptimal understanding efficiency. We propose VidLaDA, a Video LLM based on Diffusion Language Model utilizing bidirectional attention to capture bidirectional dependencies. To further tackle the inference bottleneck of diffusion decoding on massive video tokens, we introduce MARS-Cache. This framework accelerates inference by combining asynchronous visual cache refreshing with frame-wise chunk attention, effectively pruning redundancy while preserving global connectivity via anchor tokens. Extensive experiments show VidLaDA outperforms diffusion baselines and rivals state-of-the-art autoregressive models (e.g., Qwen2.5-VL and LLaVA-Video), with MARS-Cache delivering over 12x speedup without compromising reasoning accuracy. Code and checkpoints are open-sourced at https://github.com/ziHoHe/VidLaDA.
[89] Quran-MD: A Fine-Grained Multilingual Multimodal Dataset of the Quran cs.CVPDF
Muhammad Umar Salman, Mohammad Areeb Qazi, Mohammed Talha Alam
TL;DR: 本文介绍了Quran-MD,这是一个细粒度的多语言多模态《古兰经》数据集,集成了文本、语言和音频维度,涵盖经文和单词级别。数据集提供阿拉伯原文、英文翻译和音标转写,并包含32位不同朗诵者的音频,支持自然语言处理、语音识别、文本转语音合成等应用。
Details
Motivation: 解决《古兰经》研究中缺乏整合文本、音频和语言细节的细粒度多模态数据集的问题,以支持计算语言学、数字伊斯兰研究和多模态AI应用。
Result: 数据集已公开可用(https://huggingface.co/datasets/Buraaq/quran-audio-text-dataset),为ASR、tajweed检测、TTS等任务提供基础资源,但摘要未提及具体基准测试或性能比较。
Insight: 创新点在于首次提供细粒度的单词级对齐多模态数据(包括音频、文本和翻译),并整合多朗诵者风格,为多模态嵌入、语义检索和个性化教学系统奠定基础。
Abstract: We present Quran MD, a comprehensive multimodal dataset of the Quran that integrates textual, linguistic, and audio dimensions at the verse and word levels. For each verse (ayah), the dataset provides its original Arabic text, English translation, and phonetic transliteration. To capture the rich oral tradition of Quranic recitation, we include verse-level audio from 32 distinct reciters, reflecting diverse recitation styles and dialectical nuances. At the word level, each token is paired with its corresponding Arabic script, English translation, transliteration, and an aligned audio recording, allowing fine-grained analysis of pronunciation, phonology, and semantic context. This dataset supports various applications, including natural language processing, speech recognition, text-to-speech synthesis, linguistic analysis, and digital Islamic studies. Bridging text and audio modalities across multiple reciters, this dataset provides a unique resource to advance computational approaches to Quranic recitation and study. Beyond enabling tasks such as ASR, tajweed detection, and Quranic TTS, it lays the foundation for multimodal embeddings, semantic retrieval, style transfer, and personalized tutoring systems that can support both research and community applications. The dataset is available at https://huggingface.co/datasets/Buraaq/quran-audio-text-dataset
[90] PEAfowl: Perception-Enhanced Multi-View Vision-Language-Action for Bimanual Manipulation cs.CV | cs.AI | cs.ROPDF
Qingyu Fan, Zhaoxiang Li, Yi Lu, Wang Chen, Qiu Shen
TL;DR: 本文提出了PEAfowl,一种用于双手操作的感知增强多视角视觉-语言-动作策略。该方法通过预测每个token的深度分布并进行可微的3D提升,以增强空间推理能力;同时,采用Perceiver风格的文本感知读出机制来改进指令接地。在RoboTwin 2.0基准测试中,PEAfowl显著提升了成功率,并展示了可靠的仿真到现实迁移能力。
Details
Motivation: 解决现有视觉-语言-动作模型在双手操作任务中泛化能力不足的问题,具体包括:多视角特征融合缺乏3D空间一致性,以及语言指令作为全局条件导致接地粗糙。
Result: 在RoboTwin 2.0基准测试的领域随机化设置下,PEAfowl将最强基线的成功率提高了23.0个百分点;真实机器人实验进一步验证了其可靠的仿真到现实迁移能力,且深度蒸馏带来了持续改进。
Insight: 创新点包括:通过预测token级深度分布和可微3D提升实现几何基础的多视角一致表示;采用文本感知的迭代证据积累机制增强指令接地;以及通过仅训练时的深度蒸馏从预训练深度教师模型获取几何先验,无需增加推理开销。
Abstract: Bimanual manipulation in cluttered scenes requires policies that remain stable under occlusions, viewpoint and scene variations. Existing vision-language-action models often fail to generalize because (i) multi-view features are fused via view-agnostic token concatenation, yielding weak 3D-consistent spatial understanding, and (ii) language is injected as global conditioning, resulting in coarse instruction grounding. In this paper, we introduce PEAfowl, a perception-enhanced multi-view VLA policy for bimanual manipulation. For spatial reasoning, PEAfowl predicts per-token depth distributions, performs differentiable 3D lifting, and aggregates local cross-view neighbors to form geometrically grounded, cross-view consistent representations. For instruction grounding, we propose to replace global conditioning with a Perceiver-style text-aware readout over frozen CLIP visual features, enabling iterative evidence accumulation. To overcome noisy and incomplete commodity depth without adding inference overhead, we apply training-only depth distillation from a pretrained depth teacher to supervise the depth-distribution head, providing perception front-end with geometry-aware priors. On RoboTwin 2.0 under domain-randomized setting, PEAfowl improves the strongest baseline by 23.0 pp in success rate, and real-robot experiments further demonstrate reliable sim-to-real transfer and consistent improvements from depth distillation. Project website: https://peafowlvla.github.io/.
[91] Feature-Space Generative Models for One-Shot Class-Incremental Learning cs.CV | cs.AI | stat.MLPDF
Jack Foster, Kirill Paramonov, Mete Ozay, Umberto Michieli
TL;DR: 本文提出了一种名为Gen1S的新方法,用于解决单样本类增量学习(1-shot class-incremental learning)的挑战。该方法基于基础类和新类嵌入具有结构相似性的假设,通过将原始嵌入空间映射到残差空间,并利用生成模型(如VAE或扩散模型)学习基础类残差的多模态分布,以此作为先验知识来提升新类的识别性能。
Details
Motivation: 解决在基础训练阶段后,模型仅能获得每个新类的单个样本(1-shot)且不允许进一步训练或修改的情况下,如何有效泛化到新类的难题。
Result: 在多个基准测试和骨干架构上,Gen1S方法在新类识别方面持续超越了现有技术水平(SOTA)。
Insight: 创新点在于利用生成模型学习基础类嵌入的残差分布作为结构先验,从而增强模型对有限新类样本的泛化能力;客观分析认为,将类原型分离并建模残差空间的多模态分布是一种有效的特征空间正则化策略。
Abstract: Few-shot class-incremental learning (FSCIL) is a paradigm where a model, initially trained on a dataset of base classes, must adapt to an expanding problem space by recognizing novel classes with limited data. We focus on the challenging FSCIL setup where a model receives only a single sample (1-shot) for each novel class and no further training or model alterations are allowed after the base training phase. This makes generalization to novel classes particularly difficult. We propose a novel approach predicated on the hypothesis that base and novel class embeddings have structural similarity. We map the original embedding space into a residual space by subtracting the class prototype (i.e., the average class embedding) of input samples. Then, we leverage generative modeling with VAE or diffusion models to learn the multi-modal distribution of residuals over the base classes, and we use this as a valuable structural prior to improve recognition of novel classes. Our approach, Gen1S, consistently improves novel class recognition over the state of the art across multiple benchmarks and backbone architectures.
[92] Benchmarking Direct Preference Optimization for Medical Large Vision-Language Models cs.CV | cs.CLPDF
Dain Kim, Jiwoo Lee, Jaehoon Yun, Yong Hoe Koo, Qingyu Chen
TL;DR: 本文首次在医学大型视觉语言模型领域对多种直接偏好优化变体进行了全面评估,发现现有方法在监督微调基础上提升有限且不稳定,并针对视觉误解错误提出了一种改进策略,在视觉问答任务上取得了3.6%的性能提升。
Details
Motivation: 医学大型视觉语言模型的实际应用受限于对齐不足和可靠性问题,而直接偏好优化在医疗高风险场景中的有效性缺乏实证研究,需要建立方法论基础。
Result: 在LLaVA-Med和HuatuoGPT-Vision两个医学LVLM上评估了九种DPO变体,提出的针对性偏好构建策略在视觉问答任务上比现有最强DPO基线提升了3.6%。
Insight: 揭示了当前DPO方法在医学领域存在任务和骨干网络依赖性强、难以纠正视觉误解等局限性;创新性地通过针对性偏好构建直接解决视觉误解错误,为高风险领域偏好优化提供了新思路。
Abstract: Large Vision-Language Models (LVLMs) hold significant promise for medical applications, yet their deployment is often constrained by insufficient alignment and reliability. While Direct Preference Optimization (DPO) has emerged as a potent framework for refining model responses, its efficacy in high-stakes medical contexts remains underexplored, lacking the rigorous empirical groundwork necessary to guide future methodological advances. To bridge this gap, we present the first comprehensive examination of diverse DPO variants within the medical domain, evaluating nine distinct formulations across two medical LVLMs: LLaVA-Med and HuatuoGPT-Vision. Our results reveal several critical limitations: current DPO approaches often yield inconsistent gains over supervised fine-tuning, with their efficacy varying significantly across different tasks and backbones. Furthermore, they frequently fail to resolve fundamental visual misinterpretation errors. Building on these insights, we present a targeted preference construction strategy as a proof-of-concept that explicitly addresses visual misinterpretation errors frequently observed in existing DPO models. This design yields a 3.6% improvement over the strongest existing DPO baseline on visual question-answering tasks. To support future research, we release our complete framework, including all training data, model checkpoints, and our codebase at https://github.com/dmis-lab/med-vlm-dpo.
[93] RemEdit: Efficient Diffusion Editing with Riemannian Geometry cs.CV | cs.MMPDF
Eashan Adhikarla, Brian D. Davison
TL;DR: RemEdit是一种基于扩散模型的高效图像编辑框架,通过黎曼几何方法在潜在空间中计算测地线路径以实现平滑的语义编辑,并结合任务特定的注意力剪枝机制来加速推理,在保持实时性能的同时超越了现有最先进的编辑方法。
Details
Motivation: 解决可控图像生成中语义保真度与推理速度之间的关键权衡问题。
Result: 在图像编辑任务中超越了先前的SOTA框架,并在50%的剪枝率下仍能保持实时性能。
Insight: 创新点包括:将潜在空间建模为黎曼流形并使用Mamba模块高效学习其结构以计算准确的测地线路径;采用双SLERP混合技术和视觉语言模型的目标感知提示增强来细化控制;引入轻量级任务特定注意力剪枝头,仅保留对编辑至关重要的token,避免了语义退化。
Abstract: Controllable image generation is fundamental to the success of modern generative AI, yet it faces a critical trade-off between semantic fidelity and inference speed. The RemEdit diffusion-based framework addresses this trade-off with two synergistic innovations. First, for editing fidelity, we navigate the latent space as a Riemannian manifold. A mamba-based module efficiently learns the manifold’s structure, enabling direct and accurate geodesic path computation for smooth semantic edits. This control is further refined by a dual-SLERP blending technique and a goal-aware prompt enrichment pass from a Vision-Language Model. Second, for additional acceleration, we introduce a novel task-specific attention pruning mechanism. A lightweight pruning head learns to retain tokens essential to the edit, enabling effective optimization without the semantic degradation common in content-agnostic approaches. RemEdit surpasses prior state-of-the-art editing frameworks while maintaining real-time performance under 50% pruning. Consequently, RemEdit establishes a new benchmark for practical and powerful image editing. Source code: https://www.github.com/eashanadhikarla/RemEdit.
[94] FlowMorph: Physics-Consistent Self-Supervision for Label-Free Single-Cell Mechanics in Microfluidic Videos cs.CVPDF
Bora Yimenicioglu, Vishal Manikanden
TL;DR: 本文提出了FlowMorph,一个物理一致的自监督框架,用于从微流控视频中无标签地学习红细胞(RBC)的力学代理标量k。该方法通过参数化轮廓建模细胞,结合层流平流和曲率正则化弹性松弛的微分’胶囊在流中’模型来推进边界点,并仅使用自动生成的轮廓和光流来优化损失函数。
Details
Motivation: 红细胞的力学特性是血液学和全身性疾病的有前景的生物标志物,但现有方法依赖于监督分割或手工制作的kymographs,且很少编码控制RBC形状演化的层流斯托克斯流物理。本文旨在开发一个无需标签、物理一致的自监督框架来量化单细胞力学。
Result: 在四个公共RBC微流控数据集上,FlowMorph在提供速度场的物理丰富视频上实现了平均轮廓IoU为0.905,显著改善了面积守恒和壁面约束违反问题。在约1.5×10^5个中心序列上,标量k单独区分坦克履带式和翻转动力学的AUC为0.863。仅使用200个实时变形性细胞术(RT-DC)事件进行校准,单调映射E=g(k)在600个保留细胞上预测表观杨氏模量的平均绝对误差为0.118 MPa,并在通道几何、光学和帧率变化下表现稳健。
Insight: 创新点在于将层流斯托克斯物理整合到自监督学习框架中,通过微分’胶囊在流中’模型结合物理约束(如平流、弹性松弛)和数据驱动损失(如轮廓重叠、光流一致性),实现了无需人工标注的细胞力学量化。从客观角度看,该方法通过物理一致性增强了模型的泛化能力和可解释性,为微流控视频分析提供了新的范式。
Abstract: Mechanical properties of red blood cells (RBCs) are promising biomarkers for hematologic and systemic disease, motivating microfluidic assays that probe deformability at throughputs of $10^3$–$10^6$ cells per experiment. However, existing pipelines rely on supervised segmentation or hand-crafted kymographs and rarely encode the laminar Stokes-flow physics that governs RBC shape evolution. We introduce FlowMorph, a physics-consistent self-supervised framework that learns a label-free scalar mechanics proxy $k$ for each tracked RBC from short brightfield microfluidic videos. FlowMorph models each cell by a low-dimensional parametric contour, advances boundary points through a differentiable ‘’capsule-in-flow’’ combining laminar advection and curvature-regularized elastic relaxation, and optimizes a loss coupling silhouette overlap, intra-cellular flow agreement, area conservation, wall constraints, and temporal smoothness, using only automatically derived silhouettes and optical flow. Across four public RBC microfluidic datasets, FlowMorph achieves a mean silhouette IoU of $0.905$ on physics-rich videos with provided velocity fields and markedly improves area conservation and wall violations over purely data-driven baselines. On $\sim 1.5\times 10^5$ centered sequences, the scalar $k$ alone separates tank-treading from flipping dynamics with an AUC of $0.863$. Using only $200$ real-time deformability cytometry (RT-DC) events for calibration, a monotone map $E=g(k)$ predicts apparent Young’s modulus with a mean absolute error of $0.118$,MPa on $600$ held-out cells and degrades gracefully under shifts in channel geometry, optics, and frame rate.
[95] MorphXAI: An Explainable Framework for Morphological Analysis of Parasites in Blood Smear Images cs.CVPDF
Aqsa Yousaf, Sint Sint Win, Megan Coffee, Habeeb Olufowobi
TL;DR: 本文提出了MorphXAI,一个可解释的框架,用于血涂片图像中寄生虫的形态学分析。该框架将形态学监督直接集成到预测流程中,能够同时定位寄生虫并表征其临床相关属性,如形状、曲率、可见点计数、鞭毛存在和发育阶段。
Details
Motivation: 解决寄生虫感染诊断中深度学习模型可解释性不足的问题,现有可解释性方法主要局限于视觉热图或注意力图,无法捕捉临床医生依赖的形态特征。
Result: 实验结果表明,MorphXAI不仅提高了检测性能,还提供了结构化、具有生物学意义的解释。
Insight: 创新点在于将形态学监督直接集成到检测流程中,实现细粒度形态分析与检测的统一,并创建了带有详细形态标签的临床标注数据集,为可解释寄生虫分析设立了新基准。
Abstract: Parasitic infections remain a pressing global health challenge, particularly in low-resource settings where diagnosis still depends on labor-intensive manual inspection of blood smears and the availability of expert domain knowledge. While deep learning models have shown strong performance in automating parasite detection, their clinical usefulness is constrained by limited interpretability. Existing explainability methods are largely restricted to visual heatmaps or attention maps, which highlight regions of interest but fail to capture the morphological traits that clinicians rely on for diagnosis. In this work, we present MorphXAI, an explainable framework that unifies parasite detection with fine-grained morphological analysis. MorphXAI integrates morphological supervision directly into the prediction pipeline, enabling the model to localize parasites while simultaneously characterizing clinically relevant attributes such as shape, curvature, visible dot count, flagellum presence, and developmental stage. To support this task, we curate a clinician-annotated dataset of three parasite species (Leishmania, Trypanosoma brucei, and Trypanosoma cruzi) with detailed morphological labels, establishing a new benchmark for interpretable parasite analysis. Experimental results show that MorphXAI not only improves detection performance over the baseline but also provides structured, biologically meaningful explanations.
[96] Text-Pass Filter: An Efficient Scene Text Detector cs.CVPDF
Chuang Yang, Haozhao Ma, Xu Han, Yuan Yuan, Qi Wang
TL;DR: 本文提出了一种名为Text-Pass Filter(TPF)的高效场景文本检测器,用于任意形状文本检测。它通过模拟带通滤波器原理,为每个文本构建特征-滤波器对,直接分割整个文本区域,避免了传统收缩-掩码扩展策略的固有局限,并能自然分离粘连文本,无需复杂解码或后处理,实现了实时检测。
Details
Motivation: 现有方法采用收缩-掩码扩展策略进行文本检测,但收缩操作会丢失文本边缘的视觉特征并混淆前景与背景差异,这给文本特征识别带来了固有局限。本文旨在解决这一问题,设计一个能直接分割整个文本、避免这些局限的高效检测器。
Result: 实验证明了所设计的Reinforcement Ensemble Unit(REU)和Foreground Prior Unit(FPU)的有效性,并展示了TPF方法的优越性。
Insight: 核心创新点在于模拟带通滤波器思想,为每个文本构建独特的特征-滤波器对,从而直接、自然地提取和分离整个文本区域。此外,REU通过增强同一文本的特征一致性和扩大滤波器识别范围来解决长宽比大的带状文本识别难题,FPU则通过鼓励模型区分前景与背景差异来提升特征-滤波器对的质量。这种方法避免了复杂的后处理,为实时文本检测提供了新思路。
Abstract: To pursue an efficient text assembling process, existing methods detect texts via the shrink-mask expansion strategy. However, the shrinking operation loses the visual features of text margins and confuses the foreground and background difference, which brings intrinsic limitations to recognize text features. We follow this issue and design Text-Pass Filter (TPF) for arbitrary-shaped text detection. It segments the whole text directly, which avoids the intrinsic limitations. It is noteworthy that different from previous whole text region-based methods, TPF can separate adhesive texts naturally without complex decoding or post-processing processes, which makes it possible for real-time text detection. Concretely, we find that the band-pass filter allows through components in a specified band of frequencies, called its passband but blocks components with frequencies above or below this band. It provides a natural idea for extracting whole texts separately. By simulating the band-pass filter, TPF constructs a unique feature-filter pair for each text. In the inference stage, every filter extracts the corresponding matched text by passing its pass-feature and blocking other features. Meanwhile, considering the large aspect ratio problem of ribbon-like texts makes it hard to recognize texts wholly, a Reinforcement Ensemble Unit (REU) is designed to enhance the feature consistency of the same text and to enlarge the filter’s recognition field to help recognize whole texts. Furthermore, a Foreground Prior Unit (FPU) is introduced to encourage TPF to discriminate the difference between the foreground and background, which improves the feature-filter pair quality. Experiments demonstrate the effectiveness of REU and FPU while showing the TPF’s superiority.
[97] Spatial-Conditioned Reasoning in Long-Egocentric Videos cs.CVPDF
James Tribble, Hao Wang, Si-En Hong, Chaoyi Zhou, Ashish Bastola
TL;DR: 本文研究了在长时第一人称视频中,如何通过引入显式空间信号(如深度图)来增强视觉语言模型(VLM)的空间推理能力,而无需修改模型架构或推理过程。作者创建了Sanpo-D数据集,对多个VLM在导航导向的空间查询上进行了基准测试,并分析了RGB与深度图融合对空间推理的影响。
Details
Motivation: 长时第一人称视频由于视角漂移和缺乏持久几何上下文,给视觉导航带来巨大挑战。现有视觉语言模型在图像和短视频上表现良好,但在长序列中的空间推理能力有限,因此需要研究如何在不改变模型的情况下,通过输入层面的空间信号来提升其空间理解能力。
Result: 在重新标注的Sanpo-D数据集上对多个VLM进行基准测试,结果显示深度感知和空间基础表示能提升行人检测和障碍物检测等安全关键任务的性能,但存在通用准确性与空间专业化之间的权衡。
Insight: 创新点在于通过输入融合(RGB+深度图)引入空间归纳偏置,无需修改模型即可增强VLM在长时第一人称视频中的空间推理;客观来看,该方法提供了一种轻量级、可迁移的提升VLM空间能力的新思路,并强调了任务专业化与通用性之间的平衡问题。
Abstract: Long-horizon egocentric video presents significant challenges for visual navigation due to viewpoint drift and the absence of persistent geometric context. Although recent vision-language models perform well on image and short-video reasoning, their spatial reasoning capability in long egocentric sequences remains limited. In this work, we study how explicit spatial signals influence VLM-based video understanding without modifying model architectures or inference procedures. We introduce Sanpo-D, a fine-grained re-annotation of the Google Sanpo dataset, and benchmark multiple VLMs on navigation-oriented spatial queries. To examine input-level inductive bias, we further fuse depth maps with RGB frames and evaluate their impact on spatial reasoning. Our results reveal a trade-off between general-purpose accuracy and spatial specialization, showing that depth-aware and spatially grounded representations can improve performance on safety-critical tasks such as pedestrian and obstruction detection.
[98] LungCRCT: Causal Representation based Lung CT Processing for Lung Cancer Treatment cs.CV | cs.AIPDF
Daeyoung Kim
TL;DR: 本文提出了一种名为LungCRCT的肺癌分析框架,该框架基于潜在因果表征学习,旨在从肺部CT图像中提取与肺癌进展物理因果机制相关的因果表征。通过结合图自编码器因果发现算法、距离相关性解缠和基于熵的图像重建优化,该框架不仅支持肺癌治疗的因果干预分析,还在恶性肿瘤分类任务中实现了轻量且鲁棒的下游模型,AUC达到93.91%。
Details
Motivation: 肺癌早期症状不明显且易与其他呼吸系统疾病混淆,导致早期诊断困难,死亡率高。尽管基于CNN或ViT的AI模型在肺癌检测或肿瘤分类任务中表现优异,但由于其相关依赖性、复杂性和低可解释性等固有局限,难以扩展到肺癌治疗分析或因果干预模拟。因此,本研究旨在开发一个能够提取肺癌进展因果表征的框架,以支持更深入的因果分析和早期检测。
Result: 在恶性肿瘤分类任务中,LungCRCT框架实现了93.91%的AUC分数,表明其下游模型具有高性能和鲁棒性。该结果展示了框架在肺癌相关图像处理任务中的有效性。
Insight: 创新点在于将因果表征学习引入肺癌CT处理,通过图自编码器因果发现、距离相关性解缠和熵优化重建,实现了从图像中提取可解释的因果因素。这为肺癌治疗的因果干预模拟提供了新途径,并促进了轻量、鲁棒下游模型的开发,提升了AI在医疗影像分析中的可解释性和应用范围。
Abstract: Due to silence in early stages, lung cancer has been one of the most leading causes of mortality in cancer patients world-wide. Moreover, major symptoms of lung cancer are hard to differentiate with other respiratory disease symptoms such as COPD, further leading patients to overlook cancer progression in early stages. Thus, to enhance survival rates in lung cancer, early detection from consistent proactive respiratory system monitoring becomes crucial. One of the most prevalent and effective methods for lung cancer monitoring would be low-dose computed tomography(LDCT) chest scans, which led to remarkable enhancements in lung cancer detection or tumor classification tasks under rapid advancements and applications of computer vision based AI models such as EfficientNet or ResNet in image processing. However, though advanced CNN models under transfer learning or ViT based models led to high performing lung cancer detections, due to its intrinsic limitations in terms of correlation dependence and low interpretability due to complexity, expansions of deep learning models to lung cancer treatment analysis or causal intervention analysis simulations are still limited. Therefore, this research introduced LungCRCT: a latent causal representation learning based lung cancer analysis framework that retrieves causal representations of factors within the physical causal mechanism of lung cancer progression. With the use of advanced graph autoencoder based causal discovery algorithms with distance Correlation disentanglement and entropy-based image reconstruction refinement, LungCRCT not only enables causal intervention analysis for lung cancer treatments, but also leads to robust, yet extremely light downstream models in malignant tumor classification tasks with an AUC score of 93.91%.
[99] Forward Consistency Learning with Gated Context Aggregation for Video Anomaly Detection cs.CVPDF
Jiahao Lyu, Minghua Zhao, Xuewen Huang, Yifei Chen, Shuangli Du
TL;DR: 本文提出了一种名为FoGA的轻量级视频异常检测模型,通过前向一致性学习和门控上下文聚合,在仅约200万参数的情况下实现了高性能检测,并在多个基准测试中达到SOTA水平,同时运行速度高达155 FPS,适用于资源受限的边缘设备。
Details
Motivation: 现有视频异常检测方法通常依赖大规模模型追求高精度,难以部署在资源有限的边缘设备上,且主流基于预测的方法仅利用单帧未来预测误差,忽略了长期时序前向信息的约束。
Result: 在多个视频异常检测基准测试中,FoGA显著优于现有最先进方法,同时运行效率高达155 FPS,实现了性能与效率的优异平衡。
Insight: 创新点包括:基于Unet的连续帧特征提取生成即时和前向预测、在跳跃连接中引入门控上下文聚合模块动态融合编码器和解码器特征、使用前向一致性损失联合优化模型,以及采用混合异常测量策略整合即时和前向帧的误差以提高检测准确性。
Abstract: As a crucial element of public security, video anomaly detection (VAD) aims to measure deviations from normal patterns for various events in real-time surveillance systems. However, most existing VAD methods rely on large-scale models to pursue extreme accuracy, limiting their feasibility on resource-limited edge devices. Moreover, mainstream prediction-based VAD detects anomalies using only single-frame future prediction errors, overlooking the richer constraints from longer-term temporal forward information. In this paper, we introduce FoGA, a lightweight VAD model that performs Forward consistency learning with Gated context Aggregation, containing about 2M parameters and tailored for potential edge devices. Specifically, we propose a Unet-based method that performs feature extraction on consecutive frames to generate both immediate and forward predictions. Then, we introduce a gated context aggregation module into the skip connections to dynamically fuse encoder and decoder features at the same spatial scale. Finally, the model is jointly optimized with a novel forward consistency loss, and a hybrid anomaly measurement strategy is adopted to integrate errors from both immediate and forward frames for more accurate detection. Extensive experiments demonstrate the effectiveness of the proposed method, which substantially outperforms state-of-the-art competing methods, running up to 155 FPS. Hence, our FoGA achieves an excellent trade-off between performance and the efficiency metric.
[100] Agentic Very Long Video Understanding cs.CV | cs.LGPDF
Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak
TL;DR: 本文提出了EGAgent框架,通过构建实体场景图来增强对极长视频(如全天穿戴设备拍摄的自我中心视频)的理解能力。该系统利用规划代理和结构化搜索工具,在实体场景图上进行多跳推理,并结合视觉与音频的混合搜索,实现了跨模态、时序连贯的复杂推理。
Details
Motivation: 全天候个人AI助手(如智能眼镜)需要理解连续、纵向的自我中心视频流,而现有方法(如大语言模型和检索增强生成)受限于有限的上下文窗口,无法对极长视频流进行组合式、多跳推理。
Result: 在EgoLifeQA数据集上达到57.5%的SOTA性能,在Video-MME (Long)数据集上达到74.1%的竞争性性能,验证了其在复杂纵向视频理解任务上的有效性。
Insight: 创新点在于以实体场景图(表示人、地点、物体及其随时间变化的关系)为中心构建智能体框架,并赋予规划代理结构化搜索和混合模态搜索工具,从而实现对极长视频的细粒度、跨模态、时序连贯推理。
Abstract: The advent of always-on personal AI assistants, enabled by all-day wearable devices such as smart glasses, demands a new level of contextual understanding, one that goes beyond short, isolated events to encompass the continuous, longitudinal stream of egocentric video. Achieving this vision requires advances in long-horizon video understanding, where systems must interpret and recall visual and audio information spanning days or even weeks. Existing methods, including large language models and retrieval-augmented generation, are constrained by limited context windows and lack the ability to perform compositional, multi-hop reasoning over very long video streams. In this work, we address these challenges through EGAgent, an enhanced agentic framework centered on entity scene graphs, which represent people, places, objects, and their relationships over time. Our system equips a planning agent with tools for structured search and reasoning over these graphs, as well as hybrid visual and audio search capabilities, enabling detailed, cross-modal, and temporally coherent reasoning. Experiments on the EgoLifeQA and Video-MME (Long) datasets show that our method achieves state-of-the-art performance on EgoLifeQA (57.5%) and competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks.
[101] \textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation cs.CV | cs.AIPDF
Weiye Zhu, Zekai Zhang, Xiangchen Wang, Hewei Pan, Teng Wang
TL;DR: 本文提出了NaVIDA,一个用于视觉语言导航的统一框架,通过结合基于动作块的逆动力学监督来学习视觉变化与相应动作之间的因果关系,以解决现有方法因缺乏视觉-动作因果建模而导致的泛化能力弱和误差累积问题。
Details
Motivation: 现有视觉语言导航方法大多依赖于反应式的状态-动作映射,没有显式建模动作如何因果地改变后续视觉观察,导致智能体行为不稳定、泛化能力弱以及轨迹上的累积误差。
Result: 在广泛的实验中,NaVIDA以更少的参数(30亿 vs. 80亿)实现了优于最先进方法的导航性能,并在真实机器人评估中验证了其可行性和有效性。
Insight: 创新点在于引入了基于动作块的逆动力学监督来学习视觉-动作因果关系,并采用分层概率动作分块来组织轨迹以提供更长范围的视觉变化线索,以及使用熵引导机制自适应设置动作块的执行范围以稳定推理行为。
Abstract: Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions and act coherently in visually rich environments. However, most existing methods rely on reactive state-action mappings without explicitly modeling how actions causally transform subsequent visual observations. Lacking such vision-action causality, agents cannot anticipate the visual changes induced by its own actions, leading to unstable behaviors, weak generalization, and cumulative error along trajectory. To address these issues, we introduce \textsc{NaVIDA} (\textbf{Nav}igation with \textbf{I}nverse \textbf{D}ynamics \textbf{A}ugmentation), a unified VLN framework that couples policy learning with action-grounded visual dynamics and adaptive execution. \textsc{NaVIDA} augments training with chunk-based inverse-dynamics supervision to learn causal relationship between visual changes and corresponding actions. To structure this supervision and extend the effective planning range, \textsc{NaVIDA} employs hierarchical probabilistic action chunking (HPAC), which organizes trajectories into multi-step chunks and provides discriminative, longer-range visual-change cues. To further curb error accumulation and stabilize behavior at inference, an entropy-guided mechanism adaptively sets the execution horizon of action chunks. Extensive experiments show that \textsc{NaVIDA} achieves superior navigation performance compared to state-of-the-art methods with fewer parameters (3B vs. 8B). Real-world robot evaluations further validate the practical feasibility and effectiveness of our approach. Code and data will be available upon acceptance.
[102] Multi-Perspective Subimage CLIP with Keyword Guidance for Remote Sensing Image-Text Retrieval cs.CVPDF
Yifan Li, Shiying Wang, Jianqiang Huang
TL;DR: 本文提出MPS-CLIP,一种参数高效的遥感图像-文本检索框架,通过关键词引导的细粒度对齐取代传统的全局匹配。它利用LLM提取关键词,指导SamGeo生成语义相关的子视角,并引入G^2A适配器和MPR模块来高效聚合多视角特征,在RSICD和RSITMD基准上达到SOTA性能。
Details
Motivation: 现有基于CLIP的遥感检索方法主要依赖粗粒度全局对齐,忽略了遥感图像密集、多尺度的语义信息,且全参数微调计算成本高、易导致灾难性遗忘。
Result: 在RSICD和RSITMD基准上,MPS-CLIP分别取得了35.18%和48.40%的平均召回率(mR),显著优于全微调基线和近期竞争方法,达到SOTA水平。
Insight: 创新点包括:1)关键词引导的多视角细粒度对齐范式;2)G^2A适配器以低开销捕获全局上下文;3)混合损失函数动态选择最大响应视角以抑制噪声。该方法在保持参数高效的同时实现了精确的语义匹配。
Abstract: Vision-Language Pre-training (VLP) models like CLIP have significantly advanced Remote Sensing Image-Text Retrieval (RSITR). However, existing methods predominantly rely on coarse-grained global alignment, which often overlooks the dense, multi-scale semantics inherent in overhead imagery. Moreover, adapting these heavy models via full fine-tuning incurs prohibitive computational costs and risks catastrophic forgetting. To address these challenges, we propose MPS-CLIP, a parameter-efficient framework designed to shift the retrieval paradigm from global matching to keyword-guided fine-grained alignment. Specifically, we leverage a Large Language Model (LLM) to extract core semantic keywords, guiding the Segment Anything Model (SamGeo) to generate semantically relevant sub-perspectives. To efficiently adapt the frozen backbone, we introduce a Gated Global Attention (G^2A) adapter, which captures global context and long-range dependencies with minimal overhead. Furthermore, a Multi-Perspective Representation (MPR) module aggregates these local cues into robust multi-perspective embeddings. The framework is optimized via a hybrid objective combining multi-perspective contrastive and weighted triplet losses, which dynamically selects maximum-response perspectives to suppress noise and enforce precise semantic matching. Extensive experiments on the RSICD and RSITMD benchmarks demonstrate that MPS-CLIP achieves state-of-the-art performance with 35.18% and 48.40% mean Recall (mR), respectively, significantly outperforming full fine-tuning baselines and recent competitive methods. Code is available at https://github.com/Lcrucial1f/MPS-CLIP.
[103] MindCine: Multimodal EEG-to-Video Reconstruction with Large-Scale Pretrained Models cs.CV | cs.HC | cs.MMPDF
Tian-Yi Zhou, Xuan-Hao Liu, Bao-Liang Lu, Wei-Long Zheng
TL;DR: 本文提出MindCine框架,用于从脑电图(EEG)信号重建人类动态视觉感知的视频。该框架通过多模态联合学习策略整合文本以外的模态,并利用预训练的大型EEG模型缓解数据稀缺问题,同时设计了具有因果注意力的Seq2Seq模型来解码感知信息。实验表明,该方法在定性和定量上均优于现有技术。
Details
Motivation: 解决EEG到视频重建的两个主要挑战:现有方法仅对齐EEG与文本模态,忽略了其他模态且易过拟合;以及EEG-视频数据稀缺导致训练困难。
Result: 在EEG-to-video重建任务上,MindCine在定性和定量评估中均超越了现有最先进方法(SOTA),证明了其有效性。
Insight: 创新点包括:采用多模态联合学习策略整合超越文本的模态,以增强表示并减少过拟合;利用预训练大型EEG模型解码语义信息,缓解数据稀缺问题;设计专门的因果注意力Seq2Seq模型来解码感知信息。从客观角度看,结合大规模预训练模型与多模态学习是处理有限数据下复杂重建任务的有效途径。
Abstract: Reconstructing human dynamic visual perception from electroencephalography (EEG) signals is of great research significance since EEG’s non-invasiveness and high temporal resolution. However, EEG-to-video reconstruction remains challenging due to: 1) Single Modality: existing studies solely align EEG signals with the text modality, which ignores other modalities and are prone to suffer from overfitting problems; 2) Data Scarcity: current methods often have difficulty training to converge with limited EEG-video data. To solve the above problems, we propose a novel framework MindCine to achieve high-fidelity video reconstructions on limited data. We employ a multimodal joint learning strategy to incorporate beyond-text modalities in the training stage and leverage a pre-trained large EEG model to relieve the data scarcity issue for decoding semantic information, while a Seq2Seq model with causal attention is specifically designed for decoding perceptual information. Extensive experiments demonstrate that our model outperforms state-of-the-art methods both qualitatively and quantitatively. Additionally, the results underscore the effectiveness of the complementary strengths of different modalities and demonstrate that leveraging a large-scale EEG model can further enhance reconstruction performance by alleviating the challenges associated with limited data.
[104] QualiRAG: Retrieval-Augmented Generation for Visual Quality Understanding cs.CVPDF
Linhan Cao, Wei Sun, Weixia Zhang, Xiangyang Zhu, Kaiwei Zhang
TL;DR: 本文提出QualiRAG,一种无需训练的检索增强生成框架,用于视觉质量理解。它通过将问题分解为结构化请求,动态生成四种互补的知识源(视觉元数据、主体定位、全局质量摘要和局部质量描述),并进行相关性感知检索,以支持基于证据的推理,从而利用大型多模态模型的潜在感知知识进行视觉质量评估。
Details
Motivation: 视觉质量评估正从标量分数预测转向可解释的质量理解,这需要细粒度时空感知和辅助上下文信息。现有方法依赖于在精心策划的指令数据集上进行监督微调或强化学习,这些方法标注成本高且容易产生数据集特定偏差。
Result: 大量实验表明,QualiRAG在视觉质量理解任务上显著优于开源通用LMMs和经过VQA微调的LMMs,在视觉质量比较任务上也能提供有竞争力的性能,展示了无需任何任务特定训练的强大质量评估能力。
Insight: 主要创新点在于提出了一种无需训练的RAG框架,通过动态生成结构化的多源知识(而非从静态语料库检索)来增强视觉质量推理,有效利用了LMMs的潜在知识,避免了传统方法的数据标注成本和偏差问题。
Abstract: Visual quality assessment (VQA) is increasingly shifting from scalar score prediction toward interpretable quality understanding – a paradigm that demands \textit{fine-grained spatiotemporal perception} and \textit{auxiliary contextual information}. Current approaches rely on supervised fine-tuning or reinforcement learning on curated instruction datasets, which involve labor-intensive annotation and are prone to dataset-specific biases. To address these challenges, we propose \textbf{QualiRAG}, a \textit{training-free} \textbf{R}etrieval-\textbf{A}ugmented \textbf{G}eneration \textbf{(RAG)} framework that systematically leverages the latent perceptual knowledge of large multimodal models (LMMs) for visual quality perception. Unlike conventional RAG that retrieves from static corpora, QualiRAG dynamically generates auxiliary knowledge by decomposing questions into structured requests and constructing four complementary knowledge sources: \textit{visual metadata}, \textit{subject localization}, \textit{global quality summaries}, and \textit{local quality descriptions}, followed by relevance-aware retrieval for evidence-grounded reasoning. Extensive experiments show that QualiRAG achieves substantial improvements over open-source general-purpose LMMs and VQA-finetuned LMMs on visual quality understanding tasks, and delivers competitive performance on visual quality comparison tasks, demonstrating robust quality assessment capabilities without any task-specific training. The code will be publicly available at https://github.com/clh124/QualiRAG.
[105] HomoFM: Deep Homography Estimation with Flow Matching cs.CVPDF
Mengfan He, Liangzheng Sun, Chunyu Li, Ziyang Meng
TL;DR: 本文提出HomoFM,一种将流匹配技术首次引入单应性估计任务的新框架。它将单应性估计问题重新表述为速度场学习问题,通过建模连续点状速度场将噪声分布转换为配准坐标,从而恢复高精度变换。此外,通过集成梯度反转层实现领域自适应,增强网络对多模态匹配和光照变化等场景的鲁棒性。
Details
Motivation: 现有深度单应性估计方法通常将其视为直接回归或迭代优化问题,难以捕捉复杂几何变换或跨领域泛化。本文旨在解决这些问题,引入生成建模中的流匹配技术,以提升估计精度和鲁棒性。
Result: 大量实验表明,HomoFM在标准基准测试中,在估计精度和鲁棒性方面均优于现有最先进方法。
Insight: 创新点包括首次将流匹配技术应用于单应性估计,将其重新定义为速度场学习问题,并引入梯度反转层进行领域自适应以学习领域不变表示,从而提升跨领域性能。
Abstract: Deep homography estimation has broad applications in computer vision and robotics. Remarkable progresses have been achieved while the existing methods typically treat it as a direct regression or iterative refinement problem and often struggling to capture complex geometric transformations or generalize across different domains. In this work, we propose HomoFM, a new framework that introduces the flow matching technique from generative modeling into the homography estimation task for the first time. Unlike the existing methods, we formulate homography estimation problem as a velocity field learning problem. By modeling a continuous and point-wise velocity field that transforms noisy distributions into registered coordinates, the proposed network recovers high-precision transformations through a conditional flow trajectory. Furthermore, to address the challenge of domain shifts issue, e.g., the cases of multimodal matching or varying illumination scenarios, we integrate a gradient reversal layer (GRL) into the feature extraction backbone. This domain adaptation strategy explicitly constrains the encoder to learn domain-invariant representations, significantly enhancing the network’s robustness. Extensive experiments demonstrate the effectiveness of the proposed method, showing that HomoFM outperforms state-of-the-art methods in both estimation accuracy and robustness on standard benchmarks. Code and data resource are available at https://github.com/hmf21/HomoFM.
[106] V-Loop: Visual Logical Loop Verification for Hallucination Detection in Medical Visual Question Answering cs.CVPDF
Mengyuan Jin, Zehui Liao, Yong Xia
TL;DR: 本文提出了一种名为V-Loop的训练免费、即插即用框架,用于检测医学视觉问答(VQA)中多模态大语言模型(MLLMs)产生的幻觉(即与视觉事实相矛盾的响应)。该方法通过构建一个基于视觉的逻辑循环进行双向推理,以验证答案的事实正确性。
Details
Motivation: 现有基于不确定性的自省检测方法虽然计算高效,但本质上是间接的,因为它们估计的是图像-问题对的预测不确定性,而非验证特定答案的事实正确性。这在高风险医疗场景中存在显著风险。
Result: 在多个医学VQA基准测试和不同MLLMs上的广泛实验表明,V-Loop在幻觉检测任务上持续优于现有的自省方法,保持高效性,并且与基于不确定性的方法结合使用时能进一步提升性能。
Insight: 核心创新在于提出了一个视觉逻辑循环验证机制,通过从原始问答对中提取语义单元、生成验证问题并强制视觉注意力一致性,形成一个可验证事实基础的闭环。这是一种直接验证答案事实性而非间接估计不确定性的新范式。
Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable capability in assisting disease diagnosis in medical visual question answering (VQA). However, their outputs remain vulnerable to hallucinations (i.e., responses that contradict visual facts), posing significant risks in high-stakes medical scenarios. Recent introspective detection methods, particularly uncertainty-based approaches, offer computational efficiency but are fundamentally indirect, as they estimate predictive uncertainty for an image-question pair rather than verifying the factual correctness of a specific answer. To address this limitation, we propose Visual Logical Loop Verification (V-Loop), a training-free and plug-and-play framework for hallucination detection in medical VQA. V-Loop introduces a bidirectional reasoning process that forms a visually grounded logical loop to verify factual correctness. Given an input, the MLLM produces an answer for the primary input pair. V-Loop extracts semantic units from the primary QA pair, generates a verification question by conditioning on the answer unit to re-query the question unit, and enforces visual attention consistency to ensure answering both primary question and verification question rely on the same image evidence. If the verification answer matches the expected semantic content, the logical loop closes, indicating factual grounding; otherwise, the primary answer is flagged as hallucinated. Extensive experiments on multiple medical VQA benchmarks and MLLMs show that V-Loop consistently outperforms existing introspective methods, remains highly efficient, and further boosts uncertainty-based approaches when used in combination.
[107] Vision-Language-Model-Guided Differentiable Ray Tracing for Fast and Accurate Multi-Material RF Parameter Estimation cs.CV | cs.NIPDF
Zerui Kang, Yishen Lim, Zhouyou Gu, Seung-Woo Ko, Tony Q. S. Quek
TL;DR: 本文提出了一种基于视觉语言模型(VLM)引导的可微分射线追踪(DRT)框架,用于快速、准确地估计无线通信场景中的多材料射频参数。该方法利用VLM解析场景图像,推断材料类别并映射到先验电导率值,同时智能选择发射器/接收器位置以获取信息丰富的测量路径,从而加速和稳定基于梯度的参数优化过程。
Details
Motivation: 在6G系统的电磁数字孪生中,精确的射频材料参数至关重要,但传统的基于梯度的逆向射线追踪方法对初始化敏感且在有限测量下计算成本高昂。
Result: 在NVIDIA Sionna平台上对室内场景的实验表明,与均匀或随机初始化及随机放置基线相比,该方法实现了2-4倍的收敛加速和10-100倍的最终参数误差降低,仅用少量接收器即可达到低于0.1%的平均相对误差。复杂度分析表明每次迭代时间与材料数量和测量设置呈近线性关系。
Insight: 核心创新点在于将视觉语言模型提供的语义先验(材料类别与定量参数映射)和智能测量位置选择,与基于物理的可微分射线追踪优化引擎相结合,实现了语义引导的物理优化,有效解决了传统方法对初始化和测量配置的敏感性问题,显著提升了估计速度和精度。
Abstract: Accurate radio-frequency (RF) material parameters are essential for electromagnetic digital twins in 6G systems, yet gradient-based inverse ray tracing (RT) remains sensitive to initialization and costly under limited measurements. This paper proposes a vision-language-model (VLM) guided framework that accelerates and stabilizes multi-material parameter estimation in a differentiable RT (DRT) engine. A VLM parses scene images to infer material categories and maps them to quantitative priors via an ITU-R material table, yielding informed conductivity initializations. The VLM further selects informative transmitter/receiver placements that promote diverse, material-discriminative paths. Starting from these priors, the DRT performs gradient-based refinement using measured received signal strengths. Experiments in NVIDIA Sionna on indoor scenes show 2-4$\times$ faster convergence and 10-100$\times$ lower final parameter error compared with uniform or random initialization and random placement baselines, achieving sub-0.1% mean relative error with only a few receivers. Complexity analyses indicate per-iteration time scales near-linearly with the number of materials and measurement setups, while VLM-guided placement reduces the measurements required for accurate recovery. Ablations over RT depth and ray counts confirm further accuracy gains without significant per-iteration overhead. Results demonstrate that semantic priors from VLMs effectively guide physics-based optimization for fast and reliable RF material estimation.
[108] A multimodal vision foundation model for generalizable knee pathology cs.CV | cs.AIPDF
Kang Yu, Dingyu Wang, Zimu Yuan, Nan Zhou, Jiajun Liu
TL;DR: 该论文提出了一个名为OrthoFoundation的多模态视觉基础模型,专门针对肌肉骨骼病理学进行优化。该模型基于Dinov3架构,通过自监督对比学习在120万张未标注的膝关节X光和MRI图像上进行预训练,旨在学习稳健的放射学表征。它在14个下游任务中实现了最先进的性能,包括X光骨关节炎诊断和MRI结构损伤检测,并展现出卓越的标签效率和跨解剖部位(如髋、肩、踝)的泛化能力。
Details
Motivation: 肌肉骨骼疾病是全球致残的主要原因,但当前骨科AI方法依赖任务特定的监督学习,存在碎片化、需要大量标注数据且泛化性差的问题;同时,该领域缺乏大规模开源数据集,限制了基础模型的发展。
Result: 在14个下游任务中达到SOTA水平,在X光骨关节炎诊断中取得优异准确率,在MRI结构损伤检测中排名第一;模型仅用50%标注数据即可匹配监督基线,并展现出从膝关节到髋、肩、踝的跨解剖部位泛化能力。
Insight: 创新点在于构建大规模多模态肌肉骨骼预训练数据集,并采用自监督对比学习框架学习关节无关的放射学语义,从而克服传统模型的局限性;这为降低标注负担、提升临床诊断准确性提供了通用AI框架。
Abstract: Musculoskeletal disorders represent a leading cause of global disability, creating an urgent demand for precise interpretation of medical imaging. Current artificial intelligence (AI) approaches in orthopedics predominantly rely on task-specific, supervised learning paradigms. These methods are inherently fragmented, require extensive annotated datasets, and often lack generalizability across different modalities and clinical scenarios. The development of foundation models in this field has been constrained by the scarcity of large-scale, curated, and open-source musculoskeletal datasets. To address these challenges, we introduce OrthoFoundation, a multimodal vision foundation model optimized for musculoskeletal pathology. We constructed a pre-training dataset of 1.2 million unlabeled knee X-ray and MRI images from internal and public databases. Utilizing a Dinov3 backbone, the model was trained via self-supervised contrastive learning to capture robust radiological representations. OrthoFoundation achieves state-of-the-art (SOTA) performance across 14 downstream tasks. It attained superior accuracy in X-ray osteoarthritis diagnosis and ranked first in MRI structural injury detection. The model demonstrated remarkable label efficiency, matching supervised baselines using only 50% of labeled data. Furthermore, despite being pre-trained on knee images, OrthoFoundation exhibited exceptional cross-anatomy generalization to the hip, shoulder, and ankle. OrthoFoundation represents a significant advancement toward general-purpose AI for musculoskeletal imaging. By learning fundamental, joint-agnostic radiological semantics from large-scale multimodal data, it overcomes the limitations of conventional models, which provides a robust framework for reducing annotation burdens and enhancing diagnostic accuracy in clinical practice.
[109] SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe Synthesis cs.CVPDF
Xuan Wang, Siyuan Su, Quantong Fu, Yongxiang Hu, Yangfan Zhou
TL;DR: 本文提出SwipeGen方法,通过分解人类滑动交互的多个量化维度并构建自动化流水线,合成类人滑动交互数据,以提升GUI代理的执行能力。基于此构建了首个滑动执行能力评测基准,并开发了GUISwiper代理,在实验中显著优于现有基线。
Details
Motivation: 现有GUI代理在滑动交互处理上采用过于简化的策略,无法准确模拟人类行为,导致执行能力成为任务完成的新瓶颈。
Result: GUISwiper在提出的滑动执行基准上达到69.07%的准确率,相比现有VLM基线提升214%。
Insight: 创新点在于将人类滑动手势分解为多维度量化特征,并通过GUI探索自动合成训练数据;该方法可推广至其他交互动作的仿真,提升代理的拟人化执行能力。
Abstract: With the widespread adoption of Graphical User Interface (GUI) agents for automating GUI interaction tasks, substantial research focused on improving GUI perception to ground task instructions into concrete action steps. However, the step execution capability of these agents has gradually emerged as a new bottleneck for task completion. In particular, existing GUI agents often adopt overly simplified strategies for handling swipe interactions, preventing them from accurately replicating human-like behavior. To address this limitation, we decompose human swipe gestures into multiple quantifiable dimensions and propose an automated pipeline SwipeGen to synthesize human-like swipe interactions through GUI exploration. Based on this pipeline, we construct and release the first benchmark for evaluating the swipe execution capability of GUI agents. Furthermore, leveraging the synthesized data, we propose GUISwiper, a GUI agent with enhanced interaction execution capabilities. Experimental results demonstrate that GUISwiper achieves a swipe execution accuracy of 69.07%, representing a 214% improvement over existing VLM baselines.
[110] A Tumor Aware DenseNet Swin Hybrid Learning with Boosted and Hierarchical Feature Spaces for Large-Scale Brain MRI Classification cs.CV | cs.LGPDF
Muhammad Ali Shah, Muhammad Mansoor Alam, Saddam Hussain Khan
TL;DR: 本研究提出了一种高效的密集Swin混合(EDSH)框架,用于脑肿瘤MRI分析,旨在联合捕获细粒度纹理模式和长距离上下文依赖。该框架引入了两种肿瘤感知实验设置:第一种采用增强特征空间(BFS),通过定制化的DenseNet和Swin分支学习互补的局部和全局表示,并进行维度对齐、融合与增强,以高灵敏度检测弥漫性胶质瘤模式;第二种采用具有深度特征提取和双残差连接(DFE和DR)的分层DenseNet-Swin架构,其中DenseNet作为主干CNN学习结构化局部特征,Swin_t模型则建模全局肿瘤形态,有效抑制脑膜瘤和垂体瘤分类中的假阴性。
Details
Motivation: 解决脑肿瘤MRI分类中同时捕获局部纹理细节和全局上下文依赖的挑战,并针对不同肿瘤类型(如弥漫性胶质瘤、脑膜瘤、垂体瘤)的特定诊断难题设计定制化方法。
Result: 在包含40,260张图像的大规模MRI数据集上进行评估,在未见测试集上达到98.50%的准确率和召回率,表现优于独立的CNN、Vision Transformer及其他混合模型,实现了SOTA性能。
Insight: 创新点包括:1)提出肿瘤感知的双分支混合框架,通过BFS实现局部与全局特征的互补增强;2)引入分层架构DFE-DR,结合DenseNet的密集残差连接和Swin Transformer的移位窗口自注意力,以针对不同肿瘤形态优化特征学习;3)在输入层定制DenseNet以适应MRI空间特性,并利用任务对齐的补丁嵌入提升Swin的效率。从客观角度看,该方法通过模块化设计有效整合了CNN的局部感知能力和Transformer的全局建模优势,并针对医学影像的领域特性进行了针对性优化。
Abstract: This study proposes an efficient Densely Swin Hybrid (EDSH) framework for brain tumor MRI analysis, designed to jointly capture fine grained texture patterns and long range contextual dependencies. Two tumor aware experimental setups are introduced to address class-specific diagnostic challenges. The first setup employs a Boosted Feature Space (BFS), where independently customized DenseNet and Swint branches learn complementary local and global representations that are dimension aligned, fused, and boosted, enabling highly sensitive detection of diffuse glioma patterns by successfully learning the features of irregular shape, poorly defined mass, and heterogeneous texture. The second setup adopts a hierarchical DenseNet Swint architecture with Deep Feature Extraction have Dual Residual connections (DFE and DR), in which DenseNet serves as a stem CNN for structured local feature learning, while Swin_t models global tumor morphology, effectively suppressing false negatives in meningioma and pituitary tumor classification by learning the features of well defined mass, location (outside brain) and enlargments in tumors (dural tail or upward extension). DenseNet is customized at the input level to match MRI spatial characteristics, leveraging dense residual connectivity to preserve texture information and mitigate vanishing-gradient effects. In parallel, Swint is tailored through task aligned patch embedding and shifted-window self attention to efficiently capture hierarchical global dependencies. Extensive evaluation on a large-scale MRI dataset (stringent 40,260 images across four tumor classes) demonstrates consistent superiority over standalone CNNs, Vision Transformers, and hybrids, achieving 98.50 accuracy and recall on the test unseen dataset.
[111] Beyond Rigid: Benchmarking Non-Rigid Video Editing cs.CVPDF
Bingzheng Qu, Kehai Chen, Xuefeng Bai, Jun Yu, Min Zhang
TL;DR: 该论文提出了首个专门用于评估非刚性视频编辑的综合性基准NRVBench,包含高质量数据集、基于视觉语言模型的新型评估指标NRVE-Acc,以及一种无需训练的双区域去噪基线方法VM-Edit,旨在解决现有方法在生成连贯非刚性变形时存在的物理失真和时间闪烁问题。
Details
Motivation: 当前文本驱动视频编辑在生成非刚性变形时面临物理失真和时间不一致的挑战,缺乏专门的评估基准,因此需要建立一个标准化的测试平台来推动物理感知视频编辑的发展。
Result: 在提出的NRVBench基准上,现有方法在保持物理合理性方面存在不足,而VM-Edit方法在标准指标和所提指标上均表现出色,展示了优异的性能。
Insight: 创新点包括构建首个非刚性视频编辑专用基准、引入基于视觉语言模型的物理合规性评估指标,以及提出无需训练的双区域去噪机制以实现结构感知控制,平衡了结构保持与动态变形。
Abstract: Despite the remarkable progress in text-driven video editing, generating coherent non-rigid deformations remains a critical challenge, often plagued by physical distortion and temporal flicker. To bridge this gap, we propose NRVBench, the first dedicated and comprehensive benchmark designed to evaluate non-rigid video editing. First, we curate a high-quality dataset consisting of 180 non-rigid motion videos from six physics-based categories, equipped with 2,340 fine-grained task instructions and 360 multiple-choice questions. Second, we propose NRVE-Acc, a novel evaluation metric based on Vision-Language Models that can rigorously assess physical compliance, temporal consistency, and instruction alignment, overcoming the limitations of general metrics in capturing complex dynamics. Third, we introduce a training-free baseline, VM-Edit, which utilizes a dual-region denoising mechanism to achieve structure-aware control, balancing structural preservation and dynamic deformation. Extensive experiments demonstrate that while current methods have shortcomings in maintaining physical plausibility, our method achieves excellent performance across both standard and proposed metrics. We believe the benchmark could serve as a standard testing platform for advancing physics-aware video editing.
[112] Q-Bench-Portrait: Benchmarking Multimodal Large Language Models on Portrait Image Quality Perception cs.CVPDF
Sijing Wu, Yunhao Li, Zicheng Zhang, Qi Jia, Xinyue Li
TL;DR: 该论文提出了首个专门针对人像图像质量感知的多模态大语言模型(MLLMs)基准测试Q-Bench-Portrait,包含2765个图像-问题-答案三元组,覆盖多种人像来源、质量维度和问题格式。基于此基准,评估了25个开源和闭源MLLMs,发现当前模型在人像感知方面能力有限且不精确,与人类判断存在明显差距。
Details
Motivation: 现有低层视觉基准主要关注通用图像,而MLLMs在具有独特结构和感知特性的人像图像上的感知与评估能力尚未得到充分探索,因此需要专门的基准来评测和推动该领域研究。
Result: 在Q-Bench-Portrait基准上评估了20个开源和5个闭源MLLMs,结果表明当前模型在人像图像感知方面表现有限且不精确,与人类判断存在明显差距。
Insight: 创新点在于构建了首个全面的人像图像质量感知基准,其特点包括多样的人像来源(自然、合成失真、AI生成、艺术和计算机图形图像)、全面的质量维度(技术失真、AIGC特定失真和美学)以及多种问题格式(单选、多选、判断和开放式问题),为评估和提升MLLMs的领域特定感知能力提供了新工具。
Abstract: Recent advances in multimodal large language models (MLLMs) have demonstrated impressive performance on existing low-level vision benchmarks, which primarily focus on generic images. However, their capabilities to perceive and assess portrait images, a domain characterized by distinct structural and perceptual properties, remain largely underexplored. To this end, we introduce Q-Bench-Portrait, the first holistic benchmark specifically designed for portrait image quality perception, comprising 2,765 image-question-answer triplets and featuring (1) diverse portrait image sources, including natural, synthetic distortion, AI-generated, artistic, and computer graphics images; (2) comprehensive quality dimensions, covering technical distortions, AIGC-specific distortions, and aesthetics; and (3) a range of question formats, including single-choice, multiple-choice, true/false, and open-ended questions, at both global and local levels. Based on Q-Bench-Portrait, we evaluate 20 open-source and 5 closed-source MLLMs, revealing that although current models demonstrate some competence in portrait image perception, their performance remains limited and imprecise, with a clear gap relative to human judgments. We hope that the proposed benchmark will foster further research into enhancing the portrait image perception capabilities of both general-purpose and domain-specific MLLMs.
[113] OREHAS: A fully automated deep-learning pipeline for volumetric endolymphatic hydrops quantification in MRI cs.CVPDF
Caterina Fuster-Barceló, Claudia Castrillón, Laura Rodrigo-Muñoz, Victor Manuel Vega-Suárez, Nicolás Pérez-Fernández
TL;DR: OREHAS是首个用于从常规3D-SPACE-MRC和3D-REAL-IR MRI中全自动量化内淋巴积水(EH)体积的流程。该系统将切片分类、内耳定位和序列特异性分割集成到一个工作流中,直接从整个MRI体积计算每侧耳朵的内淋巴-前庭体积比(ELR),无需人工干预。
Details
Motivation: 解决在MRI中手动量化内淋巴积水(EH)体积耗时、依赖操作者且不一致的问题,旨在开发一个全自动、可重复的量化流程,减少人为依赖性并确保方法学一致性。
Result: 在外部验证队列中,OREHAS与专家标注高度一致(VSI = 74.3%),显著优于临床软件syngo.via(VSI = 42.5%)。在分割性能上,对SPACE-MRC和REAL-IR序列分别达到Dice分数0.90和0.75。
Insight: 创新点在于将多个深度学习组件(分类、定位、分割)集成到一个端到端的临床工作流中,仅需每名患者3-6个标注切片即可有效泛化到完整3D体积,实现了在有限监督下从标准MRI中获得可靠、可重复的量化结果,为大规模研究和临床诊断阈值校准提供了基础。
Abstract: We present OREHAS (Optimized Recognition & Evaluation of volumetric Hydrops in the Auditory System), the first fully automatic pipeline for volumetric quantification of endolymphatic hydrops (EH) from routine 3D-SPACE-MRC and 3D-REAL-IR MRI. The system integrates three components – slice classification, inner ear localization, and sequence-specific segmentation – into a single workflow that computes per-ear endolymphatic-to-vestibular volume ratios (ELR) directly from whole MRI volumes, eliminating the need for manual intervention. Trained with only 3 to 6 annotated slices per patient, OREHAS generalized effectively to full 3D volumes, achieving Dice scores of 0.90 for SPACE-MRC and 0.75 for REAL-IR. In an external validation cohort with complete manual annotations, OREHAS closely matched expert ground truth (VSI = 74.3%) and substantially outperformed the clinical syngo.via software (VSI = 42.5%), which tended to overestimate endolymphatic volumes. Across 19 test patients, vestibular measurements from OREHAS were consistent with syngo.via, while endolymphatic volumes were systematically smaller and more physiologically realistic. These results show that reliable and reproducible EH quantification can be achieved from standard MRI using limited supervision. By combining efficient deep-learning-based segmentation with a clinically aligned volumetric workflow, OREHAS reduces operator dependence, ensures methodological consistency. Besides, the results are compatible with established imaging protocols. The approach provides a robust foundation for large-scale studies and for recalibrating clinical diagnostic thresholds based on accurate volumetric measurements of the inner ear.
[114] Gaze Prediction in Virtual Reality Without Eye Tracking Using Visual and Head Motion Cues cs.CVPDF
Christos Petrou, Harris Partaourides, Athanasios Balomenos, Yannis Kopsinis, Sotirios Chatzis
TL;DR: 本文提出了一种无需眼动追踪的VR注视预测框架,通过结合头戴显示器运动信号和视频帧的视觉显著性线索来预测用户注视方向,并在EHTask数据集和商用VR硬件上验证了其有效性。
Details
Motivation: 解决VR应用中因硬件限制或隐私问题导致直接眼动追踪不可用时,如何准确预测用户注视以降低延迟并支持如注视点渲染等计算密集型技术的问题。
Result: 在EHTask数据集和商用VR硬件上的实验表明,该方法在预测未来注视方向方面持续优于Center-of-HMD和Mean Gaze等基线方法,有效减少了感知延迟。
Insight: 创新点在于将轻量级显著性编码器UniSal提取的视觉特征与HMD运动数据融合,并采用TSMixer或LSTM等时间序列预测模块进行建模,为受限环境下的VR交互提供了可行的预测解决方案。
Abstract: Gaze prediction plays a critical role in Virtual Reality (VR) applications by reducing sensor-induced latency and enabling computationally demanding techniques such as foveated rendering, which rely on anticipating user attention. However, direct eye tracking is often unavailable due to hardware limitations or privacy concerns. To address this, we present a novel gaze prediction framework that combines Head-Mounted Display (HMD) motion signals with visual saliency cues derived from video frames. Our method employs UniSal, a lightweight saliency encoder, to extract visual features, which are then fused with HMD motion data and processed through a time-series prediction module. We evaluate two lightweight architectures, TSMixer and LSTM, for forecasting future gaze directions. Experiments on the EHTask dataset, along with deployment on commercial VR hardware, show that our approach consistently outperforms baselines such as Center-of-HMD and Mean Gaze. These results demonstrate the effectiveness of predictive gaze modeling in reducing perceptual lag and enhancing natural interaction in VR environments where direct eye tracking is constrained.
[115] ARMOR: Agentic Reasoning for Methods Orchestration and Reparameterization for Robust Adversarial Attacks cs.CVPDF
Gabriel Lee Jun Rong, Christos Korgialas, Dion Jia Xu Ho, Pai Chet Ng, Xiaoxiao Miao
TL;DR: 本文提出ARMOR框架,通过视觉语言模型(VLM)引导的智能体协同编排和重参数化三种经典对抗攻击方法(CW、JSMA、STA),利用大型语言模型(LLM)实时调整攻击策略,以增强对抗攻击的语义感知和自适应能力。
Details
Motivation: 现有自动化攻击套件作为静态集成,缺乏策略适应性和语义感知,ARMOR旨在解决这些限制。
Result: 在标准基准测试中,ARMOR实现了改进的跨架构迁移性,可靠地欺骗黑盒和白盒设置,通过置信度和结构相似性(SSIM)评分选择最佳攻击或混合攻击。
Insight: 创新点在于利用VLM和LLM实现对抗攻击的智能体协同与实时重参数化,增强语义漏洞利用和攻击自适应性。
Abstract: Existing automated attack suites operate as static ensembles with fixed sequences, lacking strategic adaptation and semantic awareness. This paper introduces the Agentic Reasoning for Methods Orchestration and Reparameterization (ARMOR) framework to address these limitations. ARMOR orchestrates three canonical adversarial primitives, Carlini-Wagner (CW), Jacobian-based Saliency Map Attack (JSMA), and Spatially Transformed Attacks (STA) via Vision Language Models (VLM)-guided agents that collaboratively generate and synthesize perturbations through a shared ``Mixing Desk”. Large Language Models (LLMs) adaptively tune and reparameterize parallel attack agents in a real-time, closed-loop system that exploits image-specific semantic vulnerabilities. On standard benchmarks, ARMOR achieves improved cross-architecture transfer and reliably fools both settings, delivering a blended output for blind targets and selecting the best attack or blended attacks for white-box targets using a confidence-and-SSIM score.
[116] Efficient Complex-Valued Vision Transformers for MRI Classification Directly from k-Space cs.CVPDF
Moritz Rempe, Lukas T. Rotkopf, Marco Schlimbach, Helmut Becker, Fabian Hörst
TL;DR: 本文提出了一种新颖的复数视觉变换器(kViT),用于直接在磁共振成像的原始频域(k空间)数据上进行分类,避免了传统方法需要丢弃相位信息并执行计算昂贵的图像重建的步骤。该方法通过引入一种尊重频域能量分布的径向k空间分块策略,解决了现有神经网络架构与k空间数据全局、非局部特性不匹配的问题。在fastMRI和内部数据集上的实验表明,kViT的分类性能与最先进的图像域基线模型(如ResNet、EfficientNet、ViT)相当,同时在高加速因子下表现出更强的鲁棒性,并在训练时大幅降低了VRAM消耗。
Details
Motivation: 当前深度学习在MRI中的应用主要依赖于重建后的幅度图像,这一过程丢弃了相位信息且计算成本高昂;同时,标准的神经网络架构(如卷积或网格分块)不适合处理k空间数据固有的全局、非局部特性。
Result: 在fastMRI和内部数据集上,kViT的分类性能与图像域最先进的基线模型(ResNet, EfficientNet, ViT)相当。更重要的是,它对高加速因子表现出更强的鲁棒性,并且在训练时相比标准方法将VRAM消耗降低了高达68倍。
Insight: 论文的主要创新点在于:1. 提出了一种直接在复数k空间数据上操作的复数视觉变换器(kViT),实现了从扫描仪直接进行AI分析的范式转变;2. 设计了一种尊重k空间频谱能量分布的径向分块策略,有效弥合了现有架构与MRI物理特性之间的几何鸿沟。从客观角度看,其将计算效率和数据效率作为核心指标,为资源受限环境下的医学图像分析提供了新思路。
Abstract: Deep learning applications in Magnetic Resonance Imaging (MRI) predominantly operate on reconstructed magnitude images, a process that discards phase information and requires computationally expensive transforms. Standard neural network architectures rely on local operations (convolutions or grid-patches) that are ill-suited for the global, non-local nature of raw frequency-domain (k-Space) data. In this work, we propose a novel complex-valued Vision Transformer (kViT) designed to perform classification directly on k-Space data. To bridge the geometric disconnect between current architectures and MRI physics, we introduce a radial k-Space patching strategy that respects the spectral energy distribution of the frequency-domain. Extensive experiments on the fastMRI and in-house datasets demonstrate that our approach achieves classification performance competitive with state-of-the-art image-domain baselines (ResNet, EfficientNet, ViT). Crucially, kViT exhibits superior robustness to high acceleration factors and offers a paradigm shift in computational efficiency, reducing VRAM consumption during training by up to 68$\times$ compared to standard methods. This establishes a pathway for resource-efficient, direct-from-scanner AI analysis.
[117] 3DGesPolicy: Phoneme-Aware Holistic Co-Speech Gesture Generation Based on Action Control cs.CV | cs.AI | cs.LG | cs.MM | cs.SDPDF
Xuanmeng Sha, Liyun Zhang, Tomohiro Mashita, Naoya Chiba, Yuki Uranishi
TL;DR: 本文提出3DGesPolicy,一种基于动作控制的新型框架,将整体协同语音手势生成重新定义为通过机器人学中的扩散策略解决的连续轨迹控制问题。该方法通过将帧间变化建模为统一的整体动作,有效学习帧间整体手势运动模式,确保空间和语义一致的运动轨迹。此外,还提出了Gesture-Audio-Phoneme(GAP)融合模块,深度融合多模态信号,实现语音语义、身体运动和面部表情之间的结构化细粒度对齐。
Details
Motivation: 现有部分分解或帧级回归方法在生成整合全身运动和面部表情的整体协同语音手势时,存在身体运动语义不协调和空间不稳定无意义运动的问题,需要解决这些挑战以生成更自然、表达性强的整体手势。
Result: 在BEAT2数据集上的大量定量和定性实验表明,3DGesPolicy在生成自然、表达性强且高度语音对齐的整体手势方面优于其他最先进方法。
Insight: 创新点包括将手势生成重新定义为连续轨迹控制问题,采用扩散策略建模帧间变化为整体动作,以及提出GAP融合模块实现多模态信号的深度融合与细粒度对齐,确保运动轨迹的空间和语义一致性。
Abstract: Generating holistic co-speech gestures that integrate full-body motion with facial expressions suffers from semantically incoherent coordination on body motion and spatially unstable meaningless movements due to existing part-decomposed or frame-level regression methods, We introduce 3DGesPolicy, a novel action-based framework that reformulates holistic gesture generation as a continuous trajectory control problem through diffusion policy from robotics. By modeling frame-to-frame variations as unified holistic actions, our method effectively learns inter-frame holistic gesture motion patterns and ensures both spatially and semantically coherent movement trajectories that adhere to realistic motion manifolds. To further bridge the gap in expressive alignment, we propose a Gesture-Audio-Phoneme (GAP) fusion module that can deeply integrate and refine multi-modal signals, ensuring structured and fine-grained alignment between speech semantics, body motion, and facial expressions. Extensive quantitative and qualitative experiments on the BEAT2 dataset demonstrate the effectiveness of our 3DGesPolicy across other state-of-the-art methods in generating natural, expressive, and highly speech-aligned holistic gestures.
[118] Fair-Eye Net: A Fair, Trustworthy, Multimodal Integrated Glaucoma Full Chain AI System cs.CV | cs.AIPDF
Wenbin Wei, Suyuan Yao, Cheng Huang, Xiangyu Gao
TL;DR: 本文提出了Fair-Eye Net,一个公平、可信赖的多模态集成AI系统,用于青光眼从筛查到随访及风险预警的全链条临床管理。该系统通过双流异构融合架构整合眼底照片、OCT结构指标、视野功能指数和人口统计学因素,并采用不确定性感知的分层门控策略进行选择性预测和安全转诊。系统在优化临床可靠性的同时,通过多任务学习将公平性作为主要目标进行优化,旨在促进全球眼健康公平。
Details
Motivation: 青光眼是全球不可逆性失明的主要原因,早期检测和纵向随访至关重要。目前的筛查和进展评估依赖于单一检查或松散关联的检查,存在主观性和护理碎片化问题。高质量成像工具和专业知识的获取有限,进一步影响了现实世界应用的一致性和公平性。
Result: 实验结果显示,系统在青光眼检测上AUC达到0.912(特异性96.7%),将种族间的假阴性差异降低了73.4%(从12.31%降至3.28%),保持了稳定的跨域性能,并能实现3-12个月的早期风险预警(敏感性92%,特异性88%)。
Insight: 主要创新点在于:1) 构建了一个整合多模态临床数据的全链条AI系统,覆盖筛查、随访和风险预警;2) 提出了双流异构融合架构和不确定性感知的分层门控策略,增强了系统的可靠性和选择性预测能力;3) 不同于事后公平性调整,将公平性约束作为多任务学习的主要优化目标之一,显著减少了弱势亚组的漏诊率,为临床转化和大规模部署提供了可复现的路径。
Abstract: Glaucoma is a top cause of irreversible blindness globally, making early detection and longitudinal follow-up pivotal to preventing permanent vision loss. Current screening and progression assessment, however, rely on single tests or loosely linked examinations, introducing subjectivity and fragmented care. Limited access to high-quality imaging tools and specialist expertise further compromises consistency and equity in real-world use. To address these gaps, we developed Fair-Eye Net, a fair, reliable multimodal AI system closing the clinical loop from glaucoma screening to follow-up and risk alerting. It integrates fundus photos, OCT structural metrics, VF functional indices, and demographic factors via a dual-stream heterogeneous fusion architecture, with an uncertainty-aware hierarchical gating strategy for selective prediction and safe referral. A fairness constraint reduces missed diagnoses in disadvantaged subgroups. Experimental results show it achieved an AUC of 0.912 (96.7% specificity), cut racial false-negativity disparity by 73.4% (12.31% to 3.28%), maintained stable cross-domain performance, and enabled 3-12 months of early risk alerts (92% sensitivity, 88% specificity). Unlike post hoc fairness adjustments, Fair-Eye Net optimizes fairness as a primary goal with clinical reliability via multitask learning, offering a reproducible path for clinical translation and large-scale deployment to advance global eye health equity.
[119] DisasterInsight: A Multimodal Benchmark for Function-Aware and Grounded Disaster Assessment cs.CVPDF
Sara Tehrani, Yonghao Xu, Leif Haglund, Amanda Berg, Michael Felsberg
TL;DR: 该论文提出了DisasterInsight,一个用于评估视觉语言模型在灾害分析任务上性能的多模态基准。它重构了xBD数据集,构建了约11.2万个以建筑物为中心的实例,支持多种指令驱动的任务评估,如建筑物功能分类、损坏等级与灾害类型分类、计数以及符合人道主义评估指南的结构化报告生成。
Details
Motivation: 现有遥感领域的视觉语言基准大多关注粗粒度标签和图像级识别,缺乏对现实人道主义工作流所需的功能性理解和指令鲁棒性的评估。该研究旨在填补这一空白,为灾害响应提供更及时、更实用的卫星图像解读评估工具。
Result: 实验表明,当前最先进的通用和遥感VLMs在各项任务上存在显著性能差距,尤其在损坏理解和结构化报告生成方面。论文提出的DI-Chat模型(通过LoRA微调现有VLM骨干网络得到)在损坏等级分类、灾害类型分类和报告生成质量上取得了显著提升,但建筑物功能分类对所有评估模型而言仍具挑战性。
Insight: 创新点在于构建了一个面向实际应用、任务多样且指令驱动的灾害评估基准,强调了功能性理解和结构化输出在灾害分析中的重要性。从方法上,展示了通过参数高效的LoRA技术对现有VLM进行领域特定微调(DI-Chat)的有效性,为领域自适应提供了可行路径。基准的统一性也为研究灾害图像中的多模态推理提供了平台。
Abstract: Timely interpretation of satellite imagery is critical for disaster response, yet existing vision-language benchmarks for remote sensing largely focus on coarse labels and image-level recognition, overlooking the functional understanding and instruction robustness required in real humanitarian workflows. We introduce DisasterInsight, a multimodal benchmark designed to evaluate vision-language models (VLMs) on realistic disaster analysis tasks. DisasterInsight restructures the xBD dataset into approximately 112K building-centered instances and supports instruction-diverse evaluation across multiple tasks, including building-function classification, damage-level and disaster-type classification, counting, and structured report generation aligned with humanitarian assessment guidelines. To establish domain-adapted baselines, we propose DI-Chat, obtained by fine-tuning existing VLM backbones on disaster-specific instruction data using parameter-efficient Low-Rank Adaptation (LoRA). Extensive experiments on state-of-the-art generic and remote-sensing VLMs reveal substantial performance gaps across tasks, particularly in damage understanding and structured report generation. DI-Chat achieves significant improvements on damage-level and disaster-type classification as well as report generation quality, while building-function classification remains challenging for all evaluated models. DisasterInsight provides a unified benchmark for studying grounded multimodal reasoning in disaster imagery.
[120] GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning cs.CVPDF
Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhihang Liu
TL;DR: GenAgent是一个通过智能体多模态推理来扩展文本到图像生成的框架。它将视觉理解和生成解耦,利用多模态模型进行理解,并将图像生成模型作为可调用的工具。通过自主的多轮交互,智能体生成包含推理、工具调用、判断和反思的多模态思维链,以迭代优化输出。
Details
Motivation: 解决传统统一模型面临的昂贵训练成本以及理解与生成能力之间的权衡问题,同时克服现有模块化系统受限于静态流程的不足。
Result: 在GenEval++和WISE基准测试上,GenAgent显著提升了基础生成器FLUX.1-dev的性能,分别提高了23.6%和14%。
Insight: 创新点在于采用智能体框架实现理解与生成的解耦,并通过两阶段训练(监督微调和端到端智能体强化学习)以及多轮交互思维链来迭代优化,实现了跨工具泛化、测试时扩展和任务自适应推理。
Abstract: We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model. Unlike unified models that face expensive training costs and understanding-generation trade-offs, GenAgent decouples these capabilities through an agentic framework: understanding is handled by the multimodal model itself, while generation is achieved by treating image generation models as invokable tools. Crucially, unlike existing modular systems constrained by static pipelines, this design enables autonomous multi-turn interactions where the agent generates multimodal chains-of-thought encompassing reasoning, tool invocation, judgment, and reflection to iteratively refine outputs. We employ a two-stage training strategy: first, cold-start with supervised fine-tuning on high-quality tool invocation and reflection data to bootstrap agent behaviors; second, end-to-end agentic reinforcement learning combining pointwise rewards (final image quality) and pairwise rewards (reflection accuracy), with trajectory resampling for enhanced multi-turn exploration. GenAgent significantly boosts base generator(FLUX.1-dev) performance on GenEval++ (+23.6%) and WISE (+14%). Beyond performance gains, our framework demonstrates three key properties: 1) cross-tool generalization to generators with varying capabilities, 2) test-time scaling with consistent improvements across interaction rounds, and 3) task-adaptive reasoning that automatically adjusts to different tasks. Our code will be available at \href{https://github.com/deep-kaixun/GenAgent}{this url}.
[121] Automated Landmark Detection for assessing hip conditions: A Cross-Modality Validation of MRI versus X-ray cs.CVPDF
Roberto Di Via, Vito Paolo Pastore, Francesca Odone, Siôn Glyn-Jones, Irina Voiculescu
TL;DR: 本文提出一种基于热图回归架构的自动化地标检测方法,用于评估髋关节撞击综合征(FAI),并在89名患者的配对MRI/X-ray数据集上进行跨模态验证研究。研究表明,MRI在冠状面3D体积中能够实现与X射线相当的地标定位和诊断准确性,支持将自动化FAI评估整合到常规MRI工作流程中。
Details
Motivation: 解决临床筛查中基于角度测量的需求,特别是FAI筛查传统上依赖X射线测量,但评估撞击区域的高度和范围需要MRI的3D视图,本研究旨在验证MRI在自动化地标检测中的临床等效性。
Result: 在89名患者的配对MRI/X-ray数据集上,使用标准热图回归架构,MRI在cam型撞击的地标定位和诊断准确性上与X射线达到等效水平,支持临床可行性。
Insight: 创新点在于首次通过匹配队列验证研究证明了MRI在自动化FAI评估中的跨模态临床等效性,为通过进一步地标放置实现体积分析开辟了可能性,可借鉴于多模态医学图像分析任务。
Abstract: Many clinical screening decisions are based on angle measurements. In particular, FemoroAcetabular Impingement (FAI) screening relies on angles traditionally measured on X-rays. However, assessing the height and span of the impingement area requires also a 3D view through an MRI scan. The two modalities inform the surgeon on different aspects of the condition. In this work, we conduct a matched-cohort validation study (89 patients, paired MRI/X-ray) using standard heatmap regression architectures to assess cross-modality clinical equivalence. Seen that landmark detection has been proven effective on X-rays, we show that MRI also achieves equivalent localisation and diagnostic accuracy for cam-type impingement. Our method demonstrates clinical feasibility for FAI assessment in coronal views of 3D MRI volumes, opening the possibility for volumetric analysis through placing further landmarks. These results support integrating automated FAI assessment into routine MRI workflows. Code is released at https://github.com/Malga-Vision/Landmarks-Hip-Conditions
[122] Self-Refining Video Sampling cs.CV | cs.LGPDF
Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Saining Xie, Jaehong Yoon
TL;DR: 本文提出了一种名为自优化视频采样的方法,通过将预训练视频生成器解释为去噪自编码器,在推理时进行迭代内循环优化,无需外部验证器或额外训练,并引入基于自一致性的不确定性感知优化策略,以提升视频的物理真实性和运动连贯性。
Details
Motivation: 现有视频生成器在复杂物理动态上仍存在不足,难以实现物理真实感,而现有方法依赖外部验证器或增强数据训练,计算成本高且难以捕捉细粒度运动。
Result: 在SOTA视频生成器上的实验表明,该方法在运动连贯性和物理对齐方面有显著提升,相比默认采样器和基于引导的采样器,获得了超过70%的人类偏好。
Insight: 创新点在于利用预训练生成器自身作为优化器进行推理时迭代优化,并引入不确定性感知策略选择性优化区域,避免了过度优化导致的伪影,无需额外训练或外部工具。
Abstract: Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine-grained motion. In this work, we present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner-loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators demonstrate significant improvements in motion coherence and physics alignment, achieving over 70% human preference compared to the default sampler and guidance-based sampler.
[123] AGSP-DSA: An Adaptive Graph Signal Processing Framework for Robust Multimodal Fusion with Dynamic Semantic Alignment cs.CV | cs.MMPDF
KV Karthikeya, Ashok Kumar Das, Shantanu Pal, Vivekananda Bhat K, Arun Sekar Rajasekaran
TL;DR: 本文提出了一种自适应图信号处理与动态语义对齐(AGSP-DSA)框架,用于对文本、音频和图像等异构多模态数据进行鲁棒融合。该方法通过双图构建学习模态内和模态间关系,利用谱图滤波增强信息信号,并结合多尺度图卷积网络进行节点嵌入。语义感知注意力机制允许各模态根据上下文相关性动态贡献信息。在CMU-MOSEI、AVE和MM-IMDB三个基准数据集上的实验表明,该框架在情感分析、事件识别和多媒体分类任务中达到了最先进的性能。
Details
Motivation: 解决异构多模态数据(如文本、音频、图像)融合中的鲁棒性问题,特别是在模态缺失或异质性强的场景下,实现动态且语义对齐的融合。
Result: 在CMU-MOSEI数据集上达到95.3%准确率、0.936 F1分数和0.924 mAP,比MM-GNN准确率提升2.6%;在AVE数据集上达到93.4%准确率和0.911 F1分数;在MM-IMDB数据集上达到91.8%准确率和0.886 F1分数。这些结果在缺失模态设置下展示了良好的泛化性和鲁棒性,达到了SOTA水平。
Insight: 创新点包括:结合自适应图信号处理与动态语义对齐的双图框架,利用谱图滤波增强信号,以及语义感知注意力机制实现模态的动态加权融合。从客观角度看,该方法将图信号处理与多模态学习结合,为处理模态异质性和缺失问题提供了新思路。
Abstract: In this paper, we introduce an Adaptive Graph Signal Processing with Dynamic Semantic Alignment (AGSP DSA) framework to perform robust multimodal data fusion over heterogeneous sources, including text, audio, and images. The requested approach uses a dual-graph construction to learn both intra-modal and inter-modal relations, spectral graph filtering to boost the informative signals, and effective node embedding with Multi-scale Graph Convolutional Networks (GCNs). Semantic aware attention mechanism: each modality may dynamically contribute to the context with respect to contextual relevance. The experimental outcomes on three benchmark datasets, including CMU-MOSEI, AVE, and MM-IMDB, show that AGSP-DSA performs as the state of the art. More precisely, it achieves 95.3% accuracy, 0.936 F1-score, and 0.924 mAP on CMU-MOSEI, improving MM-GNN by 2.6 percent in accuracy. It gets 93.4% accuracy and 0.911 F1-score on AVE and 91.8% accuracy and 0.886 F1-score on MM-IMDB, which demonstrate good generalization and robustness in the missing modality setting. These findings verify the efficiency of AGSP-DSA in promoting multimodal learning in sentiment analysis, event recognition and multimedia classification.
[124] Splat-Portrait: Generalizing Talking Heads with Gaussian Splatting cs.CVPDF
Tong Shi, Melonie de Almeida, Daniela Ivanova, Nicolas Pugeault, Paul Henderson
TL;DR: Splat-Portrait是一种基于高斯溅射的3D说话头部生成方法,它从单张肖像图像和语音输入中合成逼真的说话视频。该方法将肖像自动解耦为静态的3D高斯溅射重建和2D背景,并基于音频生成自然的唇部运动,无需运动先验或3D监督。
Details
Motivation: 解决现有3D说话头部生成方法依赖领域特定启发式(如基于形变的运动先验)导致3D化身重建不准确、动画真实感不足的问题。
Result: 实验结果表明,Splat-Portrait在说话头部生成和新视角合成任务上表现出优越性能,视觉质量优于先前工作。
Insight: 创新点在于使用纯2D重建和分数蒸馏损失进行训练,无需3D监督或关键点,实现了从单张图像到3D动态头像的自动解耦与音频驱动的唇部运动合成,为3D生成提供了无需显式运动先验的新范式。
Abstract: Talking Head Generation aims at synthesizing natural-looking talking videos from speech and a single portrait image. Previous 3D talking head generation methods have relied on domain-specific heuristics such as warping-based facial motion representation priors to animate talking motions, yet still produce inaccurate 3D avatar reconstructions, thus undermining the realism of generated animations. We introduce Splat-Portrait, a Gaussian-splatting-based method that addresses the challenges of 3D head reconstruction and lip motion synthesis. Our approach automatically learns to disentangle a single portrait image into a static 3D reconstruction represented as static Gaussian Splatting, and a predicted whole-image 2D background. It then generates natural lip motion conditioned on input audio, without any motion driven priors. Training is driven purely by 2D reconstruction and score-distillation losses, without 3D supervision nor landmarks. Experimental results demonstrate that Splat-Portrait exhibits superior performance on talking head generation and novel view synthesis, achieving better visual quality compared to previous works. Our project code and supplementary documents are public available at https://github.com/stonewalking/Splat-portrait.
[125] Are Video Generation Models Geographically Fair? An Attraction-Centric Evaluation of Global Visual Knowledge cs.CVPDF
Xiao Liu, Jiawei Zhang
TL;DR: 本文通过以旅游景点为中心的评估,研究了文本到视频生成模型的地理公平性和地理视觉知识。作者提出了Geo-Attraction Landmark Probing (GAP)框架和GEOATTRACTION-500基准,用于评估模型生成全球不同地区景点的忠实度。应用GAP评估Sora 2模型后发现,其地理视觉知识在不同地区、发展水平和文化群体中相对均匀,对景点流行度依赖较弱。
Details
Motivation: 解决文本到视频生成模型是否编码了地理上公平的视觉知识的问题,即模型在生成全球不同地区内容时是否存在偏见。
Result: 在Sora 2模型上的评估结果显示,模型表现出相对均匀的地理视觉知识,仅与景点流行度有弱相关性,结果通过人类评估验证。
Insight: 创新点在于提出了GAP评估框架和GEOATTRACTION-500基准,系统性地解耦视频整体质量与景点特定知识;客观分析表明,当前先进模型可能比预期更公平,这为全球部署应用提供了新见解,并强调了持续评估的必要性。
Abstract: Recent advances in text-to-video generation have produced visually compelling results, yet it remains unclear whether these models encode geographically equitable visual knowledge. In this work, we investigate the geo-equity and geographically grounded visual knowledge of text-to-video models through an attraction-centric evaluation. We introduce Geo-Attraction Landmark Probing (GAP), a systematic framework for assessing how faithfully models synthesize tourist attractions from diverse regions, and construct GEOATTRACTION-500, a benchmark of 500 globally distributed attractions spanning varied regions and popularity levels. GAP integrates complementary metrics that disentangle overall video quality from attraction-specific knowledge, including global structural alignment, fine-grained keypoint-based alignment, and vision-language model judgments, all validated against human evaluation. Applying GAP to the state-of-the-art text-to-video model Sora 2, we find that, contrary to common assumptions of strong geographic bias, the model exhibits a relatively uniform level of geographically grounded visual knowledge across regions, development levels, and cultural groupings, with only weak dependence on attraction popularity. These results suggest that current text-to-video models express global visual knowledge more evenly than expected, highlighting both their promise for globally deployed applications and the need for continued evaluation as such systems evolve.
cs.AI [Back]
[126] SQL-Trail: Multi-Turn Reinforcement Learning with Interleaved Feedback for Text-to-SQL cs.AI | cs.CLPDF
Harper Hua, Zhen Han, Zhengyuan Shen, Jeremy Lee, Patrick Guan
TL;DR: 本文提出了SQL-Trail,一个用于Text-to-SQL的多轮强化学习智能体框架。它摒弃了传统的一次性生成范式,通过与数据库环境交互并利用执行反馈来迭代优化SQL查询。其核心创新包括自适应的轮次预算分配机制和复合奖励面板。该方法在BIRD-SQL等基准测试中取得了新的SOTA结果,并且数据效率远超之前的单轮RL方法,较小的模型甚至超越了更大的专有系统。
Details
Motivation: 现有基于大语言模型的Text-to-SQL系统与人类专家在BIRD-SQL等挑战性基准上仍存在显著差距。作者认为这主要源于主流的单次生成范式,缺乏人类自然使用的迭代推理、模式探索和错误修正行为。
Result: 在多个基准测试中,SQL-Trail取得了新的SOTA结果,数据效率比之前的单轮RL SOTA方法高出高达18倍。其7B和14B模型平均比大得多的专有系统性能高出5%。
Insight: 主要创新点在于将Text-to-SQL建模为一个多轮交互的智能体任务,并引入了自适应轮次预算分配(根据问题难度调整交互深度)和复合奖励面板(联合激励SQL正确性和高效探索)。这为构建更鲁棒、更高效的Text-to-SQL系统提供了一种新的智能体工作流范式。
Abstract: While large language models (LLMs) have substantially improved Text-to-SQL generation, a pronounced gap remains between AI systems and human experts on challenging benchmarks such as BIRD-SQL. We argue this gap stems largely from the prevailing single-pass paradigm, which lacks the iterative reasoning, schema exploration, and error-correction behaviors that humans naturally employ. To address this limitation, we introduce SQL-Trail, a multi-turn reinforcement learning (RL) agentic framework for Text-to-SQL. Rather than producing a query in one shot, SQL-Trail interacts with the database environment and uses execution feedback to iteratively refine its predictions. Our approach centers on two key ideas: (i) an adaptive turn-budget allocation mechanism that scales the agent’s interaction depth to match question difficulty, and (ii) a composite reward panel that jointly incentivizes SQL correctness and efficient exploration. Across benchmarks, SQL-Trail sets a new state of the art and delivers strong data efficiency–up to 18x higher than prior single-pass RL state-of-the-art methods. Notably, our 7B and 14B models outperform substantially larger proprietary systems by 5% on average, underscoring the effectiveness of interactive, agentic workflows for robust Text-to-SQL generation.
[127] DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints cs.AI | cs.CLPDF
Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su
TL;DR: DeepPlanning是一个用于评估智能体长时程规划能力的基准测试,包含多日旅行规划和多产品购物任务,要求智能体主动获取信息、进行局部约束推理和全局约束优化。
Details
Motivation: 现有智能体评估基准大多关注局部、步骤级推理,缺乏对真实世界场景中全局约束优化(如时间和预算)以及主动信息收集和细粒度局部约束的考量。
Result: 在DeepPlanning上的评估表明,即使是前沿的智能体大语言模型也难以解决这些问题,突显了可靠的显式推理模式和并行工具使用对于实现更好的效果-效率权衡的重要性。
Insight: 论文的创新点在于构建了一个强调主动信息获取和细粒度约束的实用长时程规划基准,为改进智能体大语言模型在长规划时域上的性能指明了方向。
Abstract: While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.
[128] Think-Augmented Function Calling: Improving LLM Parameter Accuracy Through Embedded Reasoning cs.AI | cs.CL | cs.LGPDF
Lei Wei, Jinpeng Ou, Xiao Peng, Bin Wang
TL;DR: 本文提出了一种名为Think-Augmented Function Calling(TAFC)的新框架,旨在通过函数和参数级别的显式推理来提高大型语言模型在函数调用中的参数生成准确性。该方法通过引入通用的“think”参数增强,使模型能够阐明其决策过程,并动态优化参数描述以提升推理质量。对于复杂参数,TAFC基于复杂度评分自动触发细粒度推理,确保关键决策得到合理论证。此外,还提出了推理引导优化以使生成的推理与人类期望对齐。TAFC无需修改现有LLM架构,并保持完全的API兼容性。
Details
Motivation: 当前LLM在函数调用机制中缺乏参数生成时的显式推理透明度,尤其是在处理具有相互依赖参数的复杂函数时。现有的思维链提示等方法在智能体层面操作,无法为单个函数参数提供细粒度的推理指导。
Result: 在ToolBench基准上对专有和开源模型进行评估,结果表明TAFC在多参数函数的参数生成准确性和推理连贯性方面取得了显著提升,同时为调试AI智能体行为提供了更强的可解释性。
Insight: 创新点在于提出了一个通用的“think”参数增强框架,实现了函数和参数级别的显式、可解释推理,并通过动态描述优化和基于复杂度的细粒度推理触发机制来提升准确性和可解释性,且无需修改模型架构,具有很好的兼容性。
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in function calling for autonomous agents, yet current mechanisms lack explicit reasoning transparency during parameter generation, particularly for complex functions with interdependent parameters. While existing approaches like chain-of-thought prompting operate at the agent level, they fail to provide fine-grained reasoning guidance for individual function parameters. To address these limitations, we propose Think-Augmented Function Calling (TAFC), a novel framework that enhances function calling accuracy through explicit reasoning at both function and parameter levels. Our method introduces a universal “think” parameter augmentation that enables models to articulate their decision-making process, with dynamic optimization for parameter descriptions to improve reasoning quality. For complex parameters, TAFC automatically triggers granular reasoning based on complexity scoring, ensuring appropriate justification for critical decisions. Additionally, we propose reasoning-guided optimization to align generated reasoning with human expectations. TAFC requires no architectural modifications to existing LLMs while maintaining full API compatibility. Evaluation on ToolBench across proprietary and open-source models demonstrates significant improvements in parameter generation accuracy and reasoning coherence for multi-parameter functions, while providing enhanced interpretability for debugging AI agent behaviors.
[129] Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models cs.AI | cs.CL | cs.LGPDF
Zhenyuan Guo, Tong Chen, Wenlong Meng, Chen Gong, Xin Yu
TL;DR: 本文提出了一种名为动态思维令牌选择(DynTS)的方法,用于提升大型推理模型(LRMs)的效率。该方法通过注意力图分析推理轨迹,识别出对最终答案起关键作用的决策关键令牌,并在推理过程中仅保留这些令牌的键值(KV)缓存状态,从而减少内存占用和计算开销。
Details
Motivation: 大型推理模型在解决复杂问题时需要生成显式的推理轨迹,但这导致了显著的内存占用和计算开销,成为效率瓶颈。本文旨在通过分析推理轨迹中令牌的影响力,识别并去除冗余部分,以优化推理效率。
Result: 摘要中未提及具体的定量实验结果、基准测试或与现有方法的比较。
Insight: 论文的创新点在于发现了推理轨迹中仅部分决策关键令牌对最终答案有显著影响,并基于此提出了动态选择并保留关键令牌KV缓存的高效推理方法。从客观角度看,这是一种新颖的缓存优化策略,可能为大型语言模型的高效推理提供新思路。
Abstract: Large Reasoning Models (LRMs) excel at solving complex problems by explicitly generating a reasoning trace before deriving the final answer. However, these extended generations incur substantial memory footprint and computational overhead, bottlenecking LRMs’ efficiency. This work uses attention maps to analyze the influence of reasoning traces and uncover an interesting phenomenon: only some decision-critical tokens in a reasoning trace steer the model toward the final answer, while the remaining tokens contribute negligibly. Building on this observation, we propose Dynamic Thinking-Token Selection (DynTS). This method identifies decision-critical tokens and retains only their associated Key-Value (KV) cache states during inference, evicting the remaining redundant entries to optimize efficiency.
[130] AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning cs.AI | cs.CL | cs.CV | cs.MAPDF
Mingyang Song, Haoyu Sun, Jiawei Gu, Linjie Li, Luxin Xu
TL;DR: AdaReasoner是一个多模态模型系列,通过学习工具使用作为通用推理技能,而非特定工具或显式监督行为,实现动态工具编排以进行迭代视觉推理。该模型通过可扩展的数据流水线、基于强化学习的工具选择与序列优化算法(Tool-GRPO)以及自适应学习机制,使模型能够根据任务上下文和中间结果推断工具效用,协调多个工具并泛化到未见工具。
Details
Motivation: 解决多模态大语言模型(MLLMs)在视觉推理中如何有效选择、调用和组合工具的问题,特别是在面对新工具或新任务时,实现长期、多步骤的工具交互。
Result: 在多个挑战性基准测试中达到最先进性能,将7B基础模型平均提升24.9%,并在VSP和Jigsaw等任务上超越GPT-5等强有产权系统。
Insight: 创新点包括将工具使用作为通用推理技能学习、基于强化学习的工具序列优化(Tool-GRPO)以及自适应调节工具使用的机制,实现了对工具效用的动态推断和泛化能力,无需显式训练即可自主调整工具使用频率和相关性。
Abstract: When humans face problems beyond their immediate capabilities, they rely on tools, providing a promising paradigm for improving visual reasoning in multimodal large language models (MLLMs). Effective reasoning, therefore, hinges on knowing which tools to use, when to invoke them, and how to compose them over multiple steps, even when faced with new tools or new tasks. We introduce \textbf{AdaReasoner}, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior. AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that optimizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage. Together, these components allow models to infer tool utility from task context and intermediate outcomes, enabling coordination of multiple tools and generalization to unseen tools. Empirically, AdaReasoner exhibits strong tool-adaptive and generalization behaviors: it autonomously adopts beneficial tools, suppresses irrelevant ones, and adjusts tool usage frequency based on task demands, despite never being explicitly trained to do so. These capabilities translate into state-of-the-art performance across challenging benchmarks, improving the 7B base model by +24.9% on average and surpassing strong proprietary systems such as GPT-5 on multiple tasks, including VSP and Jigsaw.
[131] FadeMem: Biologically-Inspired Forgetting for Efficient Agent Memory cs.AI | cs.CLPDF
Lei Wei, Xu Dong, Xiao Peng, Niantao Xie, Bin Wang
TL;DR: 本文提出FadeMem,一种受生物学启发的智能体记忆架构,通过模拟人类记忆的主动遗忘机制来解决大型语言模型作为自主代理时面临的内存限制问题。该系统采用双层记忆层次结构,基于语义相关性、访问频率和时间模式的自适应指数衰减函数来调节信息保留,并通过LLM引导的冲突解决和智能记忆融合来整合相关信息、淡化无关细节。
Details
Motivation: 解决当前作为自主代理的大型语言模型缺乏选择性遗忘机制的问题,这些模型在上下文边界处容易出现灾难性遗忘或在内部面临信息过载,而人类记忆则通过自适应衰减过程自然平衡保留与遗忘。
Result: 在Multi-Session Chat、LoCoMo和LTI-Bench等基准测试上进行的实验表明,该系统在实现多跳推理和检索方面表现优异,同时减少了45%的存储空间,验证了受生物学启发的遗忘机制在智能体记忆系统中的有效性。
Insight: 创新点在于将人类记忆的主动遗忘机制(如基于语义、频率和时间的自适应指数衰减)引入AI智能体记忆系统,通过双层记忆层次和LLM引导的融合与冲突解决来实现高效的信息管理,这为构建更高效、更接近人类认知的AI记忆系统提供了新思路。
Abstract: Large language models deployed as autonomous agents face critical memory limitations, lacking selective forgetting mechanisms that lead to either catastrophic forgetting at context boundaries or information overload within them. While human memory naturally balances retention and forgetting through adaptive decay processes, current AI systems employ binary retention strategies that preserve everything or lose it entirely. We propose FadeMem, a biologically-inspired agent memory architecture that incorporates active forgetting mechanisms mirroring human cognitive efficiency. FadeMem implements differential decay rates across a dual-layer memory hierarchy, where retention is governed by adaptive exponential decay functions modulated by semantic relevance, access frequency, and temporal patterns. Through LLM-guided conflict resolution and intelligent memory fusion, our system consolidates related information while allowing irrelevant details to fade. Experiments on Multi-Session Chat, LoCoMo, and LTI-Bench demonstrate superior multi-hop reasoning and retrieval with 45% storage reduction, validating the effectiveness of biologically-inspired forgetting in agent memory systems.
cs.MM [Back]
[132] Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning cs.MM | cs.CL | cs.CVPDF
Zhixian Zhao, Wenjie Tian, Xiaohai Tian, Jun Zhang, Lei Xie
TL;DR: 该论文提出了SABER-LLM框架,旨在解决多模态大语言模型在细粒度感知和跨模态融合方面的不足,以提升复杂社交场景下的情感推理鲁棒性。核心贡献包括构建了一个大规模、带有多维度标注的情感推理数据集SABER,并引入了结构化证据分解范式及一致性感知的直接偏好优化方法。
Details
Motivation: 当前多模态大语言模型在细粒度感知(如面部微表情和韵律变化)上存在数据稀缺和跨模态融合不足的问题,导致在复杂多模态交互(如讽刺场景)中容易出现单模态主导和幻觉现象,因此需要提升对细微、模糊或矛盾线索的鲁棒推理能力。
Result: 在EMER、EmoBench-M和SABER-Test基准测试上,SABER-LLM显著优于开源基线模型,并在解码复杂情感动态方面达到了与闭源模型相当的鲁棒性水平。
Insight: 创新点在于构建了大规模、细粒度标注的情感推理数据集,并提出了结构化证据分解范式(将证据提取与推理分离)以及一致性感知的直接偏好优化,以强制模型在模糊或冲突的感知条件下实现多模态对齐,从而缓解单模态主导问题,提升推理的鲁棒性。
Abstract: Multimodal emotion analysis is shifting from static classification to generative reasoning. Beyond simple label prediction, robust affective reasoning must synthesize fine-grained signals such as facial micro-expressions and prosodic which shifts to decode the latent causality within complex social contexts. However, current Multimodal Large Language Models (MLLMs) face significant limitations in fine-grained perception, primarily due to data scarcity and insufficient cross-modal fusion. As a result, these models often exhibit unimodal dominance which leads to hallucinations in complex multimodal interactions, particularly when visual and acoustic cues are subtle, ambiguous, or even contradictory (e.g., in sarcastic scenery). To address this, we introduce SABER-LLM, a framework designed for robust multimodal reasoning. First, we construct SABER, a large-scale emotion reasoning dataset comprising 600K video clips, annotated with a novel six-dimensional schema that jointly captures audiovisual cues and causal logic. Second, we propose the structured evidence decomposition paradigm, which enforces a “perceive-then-reason” separation between evidence extraction and reasoning to alleviate unimodal dominance. The ability to perceive complex scenes is further reinforced by consistency-aware direct preference optimization, which explicitly encourages alignment among modalities under ambiguous or conflicting perceptual conditions. Experiments on EMER, EmoBench-M, and SABER-Test demonstrate that SABER-LLM significantly outperforms open-source baselines and achieves robustness competitive with closed-source models in decoding complex emotional dynamics. The dataset and model are available at https://github.com/zxzhao0/SABER-LLM.
[133] AI-based System for Transforming text and sound to Educational Videos cs.MM | cs.AI | cs.CV | cs.CYPDF
M. E. ElAlami, S. M. Khater, M. El. R. Rehan
TL;DR: 本文提出了一种基于生成对抗网络(GAN)的AI系统,能够将文本或语音输入转换为完整的教育视频。系统分为三个阶段:首先通过语音识别转录输入,然后提取关键词并使用CLIP和扩散模型生成相关图像以提升视觉质量和语义对齐,最后将生成的图像合成为视频并集成声音。
Details
Motivation: 解决从文本或语音等条件输入生成教育视频的挑战,利用深度学习技术提升教育视频的自动生成能力。
Result: 在Fréchet Inception Distance(FID)指标上达到28.75%,优于TGAN、MoCoGAN和TGANS-C等现有方法,表明视觉质量有所改进。
Insight: 创新点在于结合CLIP和扩散模型增强图像生成的语义对齐与视觉质量,并构建端到端的框架实现从文本/语音到教育视频的完整转换,可借鉴其多阶段集成方法提升生成内容的连贯性。
Abstract: Technological developments have produced methods that can generate educational videos from input text or sound. Recently, the use of deep learning techniques for image and video generation has been widely explored, particularly in education. However, generating video content from conditional inputs such as text or speech remains a challenging area. In this paper, we introduce a novel method to the educational structure, Generative Adversarial Network (GAN), which develop frame-for-frame frameworks and are able to create full educational videos. The proposed system is structured into three main phases In the first phase, the input (either text or speech) is transcribed using speech recognition. In the second phase, key terms are extracted and relevant images are generated using advanced models such as CLIP and diffusion models to enhance visual quality and semantic alignment. In the final phase, the generated images are synthesized into a video format, integrated with either pre-recorded or synthesized sound, resulting in a fully interactive educational video. The proposed system is compared with other systems such as TGAN, MoCoGAN, and TGANS-C, achieving a Fréchet Inception Distance (FID) score of 28.75%, which indicates improved visual quality and better over existing methods.
quant-ph [Back]
[134] Differentiable Architecture Search for Adversarially Robust Quantum Computer Vision quant-ph | cs.CVPDF
Mohamed Afane, Quanjiang Long, Haoting Shen, Ying Mao, Junaid Farooq
TL;DR: 本文提出了一种混合量子-经典的可微分量子架构搜索(DQAS)框架,旨在解决量子神经网络对对抗性扰动和硬件噪声的极端敏感性问题。该方法通过在量子处理前引入轻量级经典噪声层,联合优化电路结构和鲁棒性,在MNIST、FashionMNIST和CIFAR数据集上相比现有量子架构搜索方法,在干净准确率和对抗准确率上均取得一致提升,并在实际量子硬件上验证了其可行性。
Details
Motivation: 当前量子神经网络对对抗性扰动和硬件噪声极为敏感,现有鲁棒性技术通常以牺牲干净准确率为代价或需要大量计算资源,这阻碍了其实际部署。
Result: 在MNIST、FashionMNIST和CIFAR数据集上,相比现有量子架构搜索方法,该方法在干净准确率和对抗准确率上均取得一致提升。在FGSM、PGD、BIM、MIM等多种攻击场景以及实际量子噪声条件下,该混合框架均保持了优越性能,并在实际量子硬件上得到验证。
Insight: 创新点在于将轻量级经典噪声层与可微分量子架构搜索相结合,通过梯度方法联合优化门选择和噪声参数,在保持量子电路完整性的同时引入可训练的扰动以增强鲁棒性,且不损害标准性能。这展示了策略性经典预处理与可微分量子架构优化结合能显著提升量子神经网络鲁棒性并保持计算效率。
Abstract: Current quantum neural networks suffer from extreme sensitivity to both adversarial perturbations and hardware noise, creating a significant barrier to real-world deployment. Existing robustness techniques typically sacrifice clean accuracy or require prohibitive computational resources. We propose a hybrid quantum-classical Differentiable Quantum Architecture Search (DQAS) framework that addresses these limitations by jointly optimizing circuit structure and robustness through gradient-based methods. Our approach enhances traditional DQAS with a lightweight Classical Noise Layer applied before quantum processing, enabling simultaneous optimization of gate selection and noise parameters. This design preserves the quantum circuit’s integrity while introducing trainable perturbations that enhance robustness without compromising standard performance. Experimental validation on MNIST, FashionMNIST, and CIFAR datasets shows consistent improvements in both clean and adversarial accuracy compared to existing quantum architecture search methods. Under various attack scenarios, including Fast Gradient Sign Method (FGSM), Projected Gradient Descent (PGD), Basic Iterative Method (BIM), and Momentum Iterative Method (MIM), and under realistic quantum noise conditions, our hybrid framework maintains superior performance. Testing on actual quantum hardware confirms the practical viability of discovered architectures. These results demonstrate that strategic classical preprocessing combined with differentiable quantum architecture optimization can significantly enhance quantum neural network robustness while maintaining computational efficiency.
cs.RO [Back]
[135] A Pragmatic VLA Foundation Model cs.RO | cs.CVPDF
Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu
TL;DR: 本文提出了一个名为LingBot-VLA的视觉-语言-动作基础模型,该模型基于约20,000小时来自9种流行双机械臂配置的真实世界数据进行训练。通过在3个机器人平台上进行系统性评估,模型在100个任务上展现出优于竞争对手的性能和广泛的泛化能力。同时,研究团队构建了一个高效代码库,实现了显著的训练速度提升,并开源了代码、基础模型和基准数据以推动机器人学习领域发展。
Details
Motivation: 为了解决机器人操作中视觉-语言-动作基础模型在跨任务、跨平台泛化能力与成本效率(如适应所需的数据和GPU计算时间)之间的平衡问题,旨在开发一个既高性能又适合实际部署的模型。
Result: 在3个机器人平台上,每个平台完成100个任务,每个任务进行130次训练后测试,模型表现出明显优于竞争对手的性能,证明了其强大的性能和广泛的泛化能力。代码库在8-GPU训练设置下实现了每秒每GPU 261个样本的吞吐量,相比现有VLA导向代码库实现了1.5到2.8倍的加速。
Insight: 创新点在于利用大规模、多样化的真实世界双机械臂数据构建VLA基础模型,并通过系统性多平台评估验证其泛化能力;同时,开发了高效的训练代码库,显著提升了训练速度,为实际部署提供了可行性。从客观角度看,其开源策略和对评估标准的强调有助于推动领域内更富挑战性任务的研究和公平比较。
Abstract: Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second per GPU with an 8-GPU training setup, representing a 1.5~2.8$\times$ (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.
[136] Advances and Innovations in the Multi-Agent Robotic System (MARS) Challenge cs.RO | cs.AI | cs.CVPDF
Li Kang, Heng Zhou, Xiufeng Song, Rui Li, Bruno N. Y. Chen
TL;DR: 本文介绍了在NeurIPS 2025 SpaVLE研讨会上举办的多智能体机器人系统(MARS)挑战赛。该挑战赛旨在应对多智能体协作的挑战,重点关注规划和控制两个关键领域,鼓励参与者利用视觉语言模型进行多智能体具身规划与动态环境中的机器人操作策略执行。
Details
Motivation: 随着具身人工智能向更复杂任务场景发展,多智能体系统框架对于实现可扩展、高效和协作的解决方案变得至关重要。本挑战赛旨在解决多智能体协作带来的挑战,并推动该领域的发展。
Result: 挑战赛通过评估参赛者提交的解决方案,为具身多智能体系统的设计和协调提供了有价值的见解,但摘要未提及具体的定量基准测试结果或SOTA比较。
Insight: 创新点在于通过一个结构化的挑战赛(MARS)来系统性地探索和评估基于视觉语言模型的多智能体具身规划与控制,这为未来高级协作AI系统的发展提供了实践平台和设计洞见。
Abstract: Recent advancements in multimodal large language models and vision-languageaction models have significantly driven progress in Embodied AI. As the field transitions toward more complex task scenarios, multi-agent system frameworks are becoming essential for achieving scalable, efficient, and collaborative solutions. This shift is fueled by three primary factors: increasing agent capabilities, enhancing system efficiency through task delegation, and enabling advanced human-agent interactions. To address the challenges posed by multi-agent collaboration, we propose the Multi-Agent Robotic System (MARS) Challenge, held at the NeurIPS 2025 Workshop on SpaVLE. The competition focuses on two critical areas: planning and control, where participants explore multi-agent embodied planning using vision-language models (VLMs) to coordinate tasks and policy execution to perform robotic manipulation in dynamic environments. By evaluating solutions submitted by participants, the challenge provides valuable insights into the design and coordination of embodied multi-agent systems, contributing to the future development of advanced collaborative AI systems.
cs.IR [Back]
[137] Agentic Search in the Wild: Intents and Trajectory Dynamics from 14M+ Real Search Requests cs.IR | cs.CLPDF
Jingjie Ning, João Coelho, Yibo Kong, Yunfan Long, Bruno Martins
TL;DR: 本文通过对DeepResearchGym平台收集的1444万次搜索请求(397万个会话)进行大规模日志分析,实证研究了LLM驱动的搜索代理在多步信息寻求任务中的行为模式,包括会话长度、意图分布、查询重构以及证据重用情况。
Details
Motivation: 当前IR领域缺乏对基于LLM的搜索代理在实际多步搜索会话中如何展开、以及如何利用检索证据的实证理解,本文旨在通过大规模真实日志分析填补这一空白。
Result: 分析发现:超过90%的多轮会话步数不超过10步,89%的步骤间间隔小于1分钟;不同意图(如事实查询与推理任务)的行为模式差异显著;平均54%的新引入查询词可追溯到先前累积的证据上下文。
Insight: 提出了基于LLM的会话意图标注和查询重构标签方法,并引入了上下文驱动的术语采纳率(CTAR)来量化证据重用;研究结果启示了重复感知的早期停止、意图自适应的检索预算和显式的跨步上下文跟踪等潜在优化方向。
Abstract: LLM-powered search agents are increasingly being used for multi-step information seeking tasks, yet the IR community lacks empirical understanding of how agentic search sessions unfold and how retrieved evidence is used. This paper presents a large-scale log analysis of agentic search based on 14.44M search requests (3.97M sessions) collected from DeepResearchGym, i.e. an open-source search API accessed by external agentic clients. We sessionize the logs, assign session-level intents and step-wise query-reformulation labels using LLM-based annotation, and propose Context-driven Term Adoption Rate (CTAR) to quantify whether newly introduced query terms are traceable to previously retrieved evidence. Our analyses reveal distinctive behavioral patterns. First, over 90% of multi-turn sessions contain at most ten steps, and 89% of inter-step intervals fall under one minute. Second, behavior varies by intent. Fact-seeking sessions exhibit high repetition that increases over time, while sessions requiring reasoning sustain broader exploration. Third, agents reuse evidence across steps. On average, 54% of newly introduced query terms appear in the accumulated evidence context, with contributions from earlier steps beyond the most recent retrieval. The findings suggest that agentic search may benefit from repetition-aware early stopping, intent-adaptive retrieval budgets, and explicit cross-step context tracking. We plan to release the anonymized logs to support future research.
[138] LegalMALR:Multi-Agent Query Understanding and LLM-Based Reranking for Chinese Statute Retrieval cs.IR | cs.CLPDF
Yunhan Li, Mingjie Xie, Gaoli Kang, Zihan Gong, Gengshen Wu
TL;DR: 本文提出了LegalMALR框架,用于解决中文法条检索中查询隐晦、多议题、口语化带来的挑战。该框架结合了多智能体查询理解系统(MAS)和基于大语言模型的零样本重排序模块(LLM Reranker)。MAS通过强化学习优化的策略生成多样化的、基于法理的重写并进行迭代检索以扩大候选集覆盖,随后LLM Reranker进行自然语言法律推理以生成最终排序。
Details
Motivation: 现实中的法律查询通常是隐晦、多议题且以口语化或未充分说明的形式表达的,这使得传统的检索增强生成(RAG)流程难以准确检索所需法条要素。密集检索器主要关注查询的字面形式,而轻量级重排序器缺乏评估法条适用性所需的法律推理能力。
Result: 在自建的CSAID数据集(包含118个困难中文法律查询)和公开的STARD基准上进行了评估。实验表明,LegalMALR在分布内和分布外设置下均显著优于强大的RAG基线方法。
Insight: 创新点在于将多视角查询解释、基于强化学习的策略优化和大模型重排序相结合用于法条检索。具体包括:1)使用多智能体系统生成多样化的法理基础查询重写;2)采用广义强化策略优化(GRPO)来稳定LLM生成重写的随机行为;3)利用零样本LLM重排序器进行自然语言法律推理。这为解决复杂领域检索问题提供了可借鉴的模块化框架思路。
Abstract: Statute retrieval is essential for legal assistance and judicial decision support, yet real-world legal queries are often implicit, multi-issue, and expressed in colloquial or underspecified forms. These characteristics make it difficult for conventional retrieval-augmented generation pipelines to recover the statutory elements required for accurate retrieval. Dense retrievers focus primarily on the literal surface form of the query, whereas lightweight rerankers lack the legal-reasoning capacity needed to assess statutory applicability. We present LegalMALR, a retrieval framework that integrates a Multi-Agent Query Understanding System (MAS) with a zero-shot large-language-model-based reranking module (LLM Reranker). MAS generates diverse, legally grounded reformulations and conducts iterative dense retrieval to broaden candidate coverage. To stabilise the stochastic behaviour of LLM-generated rewrites, we optimise a unified MAS policy using Generalized Reinforcement Policy Optimization(GRPO). The accumulated candidate set is subsequently evaluated by the LLM Reranker, which performs natural-language legal reasoning to produce the final ranking. We further construct CSAID, a dataset of 118 difficult Chinese legal queries annotated with multiple statutory labels, and evaluate LegalMALR on both CSAID and the public STARD benchmark. Experiments show that LegalMALR substantially outperforms strong Retrieval-augmented generation(RAG) baselines in both in-distribution and out-of-distribution settings, demonstrating the effectiveness of combining multi-perspective query interpretation, reinforcement-based policy optimisation, and large-model reranking for statute retrieval.
[139] Capturing P: On the Expressive Power and Efficient Evaluation of Boolean Retrieval cs.IR | cs.AI | cs.CC | cs.CL | cs.DBPDF
Amir Aavani
TL;DR: 本文提出了一种新的检索语言L_R和评估算法ComputePN,旨在解决现代信息检索在处理复杂神经符号推理工作流时面临的效率困境。L_R基于有向无环图(DAG)定义,并被证明能精确捕获复杂度类P。ComputePN算法通过结合原生DAG遍历和内存高效的“正-负”响应机制,确保了对L_R中任何查询的高效评估。
Details
Motivation: 当前检索架构在处理新范式所需的严格逻辑和算术约束时面临基本效率困境:基于迭代器的引擎难以支持复杂嵌套逻辑图,导致运行时间不可行;而简单的递归方法虽然能支持这些结构,但在执行广泛逻辑排除时内存消耗巨大。
Result: 论文证明了所提出的检索语言L_R能精确捕获复杂度类P,并引入了ComputePN算法使其变得可处理。
Insight: 核心创新在于提出“捕获P”的概念,即检索引擎应能直接在索引上高效评估任何多项式时间属性。这通过形式化检索语言L_R和高效的ComputePN算法实现,为将搜索索引转变为通用计算引擎奠定了理论基础。
Abstract: Modern information retrieval is transitioning from simple document filtering to complex, neuro-symbolic reasoning workflows. However, current retrieval architectures face a fundamental efficiency dilemma when handling the rigorous logical and arithmetic constraints required by this new paradigm. Standard iterator-based engines (Document-at-a-Time) do not natively support complex, nested logic graphs; forcing them to execute such queries typically results in intractable runtime performance. Conversely, naive recursive approaches (Term-at-a-Time), while capable of supporting these structures, suffer from prohibitive memory consumption when enforcing broad logical exclusions. In this paper, we propose that a retrieval engine must be capable of Capturing $\mathbf{P}$'' -- evaluating any polynomial-time property directly over its index in a computationally efficient manner. We define a formal Retrieval Language ($\mathcal{L}_R$) based on Directed Acyclic Graphs (DAGs) and prove it precisely captures the complexity class $\mathbf{P}$. We introduce \texttt{ComputePN}, a novel evaluation algorithm that makes $\mathcal{L}_R$ tractable. By combining native DAG traversal with a memory-efficient Positive-Negative’’ response mechanism, \texttt{ComputePN} ensures the efficient evaluation of any query in $\mathcal{L}_R$. This work establishes the theoretical foundation for turning the search index into a general-purpose computational engine.
cs.LG [Back]
[140] TelcoAI: Advancing 3GPP Technical Specification Search through Agentic Multi-Modal Retrieval-Augmented Generation cs.LG | cs.AI | cs.CL | cs.CV | cs.IR | cs.MMPDF
Rahul Ghosh, Chun-Hao Liu, Gaurav Rele, Vidya Sagar Ravipati, Hazar Aouad
TL;DR: 本文提出TelcoAI,一种面向3GPP技术规范的智能多模态检索增强生成系统,通过引入分段感知分块、结构化查询规划、元数据引导检索以及文本与图表的多模态融合,有效解决了复杂技术文档查询、视觉信息处理和文档间依赖关系理解等难题。
Details
Motivation: 3GPP技术规范结构复杂、格式密集且包含多模态内容,现有大语言模型方法难以处理复杂查询、视觉信息和文档间依赖关系,因此需要一种专门针对此类技术文档的智能检索与生成系统。
Result: 在包括专家定制查询在内的多个基准测试中,该系统实现了87%的召回率、83%的声明召回率和92%的忠实度,相比现有最先进基线提升了16%。
Insight: 创新点在于将智能体(agentic)架构与多模态RAG相结合,通过结构化查询规划和多模态融合机制,提升了技术文档的理解与检索能力,为实际电信工程应用提供了有效解决方案。
Abstract: The 3rd Generation Partnership Project (3GPP) produces complex technical specifications essential to global telecommunications, yet their hierarchical structure, dense formatting, and multi-modal content make them difficult to process. While Large Language Models (LLMs) show promise, existing approaches fall short in handling complex queries, visual information, and document interdependencies. We present TelcoAI, an agentic, multi-modal Retrieval-Augmented Generation (RAG) system tailored for 3GPP documentation. TelcoAI introduces section-aware chunking, structured query planning, metadata-guided retrieval, and multi-modal fusion of text and diagrams. Evaluated on multiple benchmarks-including expert-curated queries-our system achieves $87%$ recall, $83%$ claim recall, and $92%$ faithfulness, representing a $16%$ improvement over state-of-the-art baselines. These results demonstrate the effectiveness of agentic and multi-modal reasoning in technical document understanding, advancing practical solutions for real-world telecommunications research and engineering.
[141] SpatialMath: Spatial Comprehension-Infused Symbolic Reasoning for Mathematical Problem-Solving cs.LG | cs.CL | cs.CVPDF
Ashutosh Bajpai, Akshat Bhandari, Akshay Nambi, Tanmoy Chakraborty
TL;DR: 本文提出SpatialMath框架,通过将空间理解融入符号推理链,提升多模态中小语言模型在视觉密集型数学问题(尤其是几何问题)上的解决能力。该框架使用专门的感知模块从视觉图表中提取空间表征,并引入MATHVERSE-PLUS数据集进行验证。
Details
Motivation: 当前多模态中小语言模型在视觉理解和数学推理方面存在局限,特别是在处理视觉信息复杂的几何问题时,难以准确分解视觉输入并将感知与结构化推理相结合,导致性能不佳。
Result: 在视觉密集型场景下,SpatialMath显著优于现有强大多模态基线模型,相比使用数据增强的监督微调方法,性能提升高达10个百分点。
Insight: 创新点在于提出了空间理解注入的符号推理框架,通过结构化感知模块提取空间表征并融入推理链,强调了感知到推理的结构化流程对提升多模态模型在数学问题解决中准确性的重要性。
Abstract: Multimodal Small-to-Medium sized Language Models (MSLMs) have demonstrated strong capabilities in integrating visual and textual information but still face significant limitations in visual comprehension and mathematical reasoning, particularly in geometric problems with diverse levels of visual infusion. Current models struggle to accurately decompose intricate visual inputs and connect perception with structured reasoning, leading to suboptimal performance. To address these challenges, we propose SpatialMath, a novel Spatial Comprehension-Infused Symbolic Reasoning Framework designed to integrate spatial representations into structured symbolic reasoning chains. SpatialMath employs a specialized perception module to extract spatially-grounded representations from visual diagrams, capturing critical geometric structures and spatial relationships. These representations are then methodically infused into symbolic reasoning chains, facilitating visual comprehension-aware structured reasoning. To this end, we introduce MATHVERSE-PLUS, a novel dataset containing structured visual interpretations and step-by-step reasoning paths for vision-intensive mathematical problems. SpatialMath significantly outperforms strong multimodal baselines, achieving up to 10 percentage points improvement over supervised fine-tuning with data augmentation in vision-intensive settings. Robustness analysis reveals that enhanced spatial representations directly improve reasoning accuracy, reinforcing the need for structured perception-to-reasoning pipelines in MSLMs.
[142] PEARL: Prototype-Enhanced Alignment for Label-Efficient Representation Learning with Deployment-Driven Insights from Digital Governance Communication Systems cs.LG | cs.AI | cs.CL | cs.IRPDF
Ruiyu Zhang, Lin Nie, Wai-Fung Lam, Qihao Wang, Xin Zhao
TL;DR: 论文提出了一种名为PEARL(原型增强对齐表示学习)的标签高效方法,用于改进在标签稀缺、领域漂移且无法重新训练基础编码器的部署系统中,基于嵌入的最近邻检索和相似性搜索的性能。该方法利用有限的监督信息,通过软对齐嵌入到类原型来重塑局部邻域几何结构,旨在桥接无监督后处理与全监督投影之间的差距。
Details
Motivation: 在数字治理平台等实际部署系统中,新文本输入通常通过检索相似历史案例来处理(如路由和回复公民消息)。系统失败往往不是语言模型本身的问题,而是嵌入空间中的最近邻对应了错误案例。由于标签稀缺、领域漂移且重新训练基础编码器成本高昂,下游性能严重依赖嵌入几何结构,而原始嵌入通常与最近邻检索所需的局部邻域结构对齐不佳。
Result: 在从极端标签稀缺到较高标签设置的控制标签机制下进行评估。在标签稀缺条件下,PEARL显著提高了局部邻域质量,相对于原始嵌入获得了25.7%的性能提升,相对于强大的无监督后处理方法提升了超过21.1%,而这正是基于相似性的系统最脆弱的场景。
Insight: 论文的创新点在于提出了一种标签高效的原型增强对齐方法,通过软对齐嵌入到类原型来重塑局部邻域几何,同时保持维度并避免激进的投影或坍缩。这为在无法获得大量标注数据的实际部署中,改善嵌入的局部结构对齐提供了一种实用且有效的解决方案。
Abstract: In many deployed systems, new text inputs are handled by retrieving similar past cases, for example when routing and responding to citizen messages in digital governance platforms. When these systems fail, the problem is often not the language model itself, but that the nearest neighbors in the embedding space correspond to the wrong cases. Modern machine learning systems increasingly rely on fixed, high-dimensional embeddings produced by large pretrained models and sentence encoders. In real-world deployments, labels are scarce, domains shift over time, and retraining the base encoder is expensive or infeasible. As a result, downstream performance depends heavily on embedding geometry. Yet raw embeddings are often poorly aligned with the local neighborhood structure required by nearest-neighbor retrieval, similarity search, and lightweight classifiers that operate directly on embeddings. We propose PEARL (Prototype-Enhanced Aligned Representation Learning), a label-efficient approach that uses limited supervision to softly align embeddings toward class prototypes. The method reshapes local neighborhood geometry while preserving dimensionality and avoiding aggressive projection or collapse. Its aim is to bridge the gap between purely unsupervised post-processing, which offers limited and inconsistent gains, and fully supervised projections that require substantial labeled data. We evaluate PEARL under controlled label regimes ranging from extreme label scarcity to higher-label settings. In the label-scarce condition, PEARL substantially improves local neighborhood quality, yielding 25.7% gains over raw embeddings and more than 21.1% gains relative to strong unsupervised post-processing, precisely in the regime where similarity-based systems are most brittle.
[143] Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction cs.LG | cs.CLPDF
Jang-Hyun Kim, Dongyoon Han, Sangdoo Yun
TL;DR: 本文提出了一种名为Fast KVzip的新型门控KV缓存驱逐方法,旨在高效且准确地管理大型语言模型推理过程中的键值缓存。该方法通过引入轻量级的sink-attention门控模块来识别并保留关键的KV对,实现了高达70%的缓存驱逐率,同时保持了接近无损的性能,且计算开销可忽略不计。
Details
Motivation: 现有KV缓存压缩技术通常在性能下降和计算开销之间难以权衡,这阻碍了大型语言模型的实际部署。本文旨在解决这一效率与精度之间的权衡问题,为冻结权重的LLM提供一种高效的KV缓存管理方案。
Result: 在Qwen2.5-1M、Qwen3和Gemma3系列模型上进行的大量实验表明,该方法在驱逐高达70%的KV缓存时,仍能保持接近无损的性能。结果在长上下文理解、代码理解和数学推理等多种任务上表现一致,证明了方法的通用性。
Insight: 核心创新点在于提出了一个基于门控的KV驱逐机制,其门控训练算法仅依赖LLM的前向传播,避免了昂贵的反向传播,并通过任务无关的重建目标实现了强大的任务泛化能力。该方法无缝集成到预填充和解码阶段,在实现高压缩比的同时,计算成本极低。
Abstract: Efficient key-value (KV) cache management is crucial for the practical deployment of large language models (LLMs), yet existing compression techniques often incur a trade-off between performance degradation and computational overhead. We propose a novel gating-based KV cache eviction method for frozen-weight LLMs that achieves high compression ratios with negligible computational cost. Our approach introduces lightweight sink-attention gating modules to identify and retain critical KV pairs, and integrates seamlessly into both the prefill and decoding stages. The proposed gate training algorithm relies on forward passes of an LLM, avoiding expensive backpropagation, while achieving strong task generalization through a task-agnostic reconstruction objective. Extensive experiments across the Qwen2.5-1M, Qwen3, and Gemma3 families show that our method maintains near-lossless performance while evicting up to 70% of the KV cache. The results are consistent across a wide range of tasks, including long-context understanding, code comprehension, and mathematical reasoning, demonstrating the generality of our approach.
[144] AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation cs.LG | cs.AI | cs.CLPDF
Dongjie Cheng, Ruifeng Yuan, Yongqi Li, Runyang You, Wenjie Wang
TL;DR: 本文提出了AR-Omni,一个基于自回归范式的统一任意模态到任意模态生成模型。它使用单一Transformer解码器,通过单一token流和下一个token预测目标,实现了文本、图像和流式语音的生成,无需依赖额外的专家解码器组件。
Details
Motivation: 现实世界的感知与交互本质上是多模态的,现有大多数支持多模态输入输出的‘全能’多模态大语言模型仍依赖额外的专家组件来实现多模态生成,这限制了统一训练和推理的简洁性与可扩展性。本文旨在探索纯粹的自回归建模范式在多模态生成任务上的潜力。
Result: 实验表明,AR-Omni在文本、图像和语音三种模态上均实现了强大的生成质量,同时保持了实时性,其语音生成的实时因子达到0.88。
Insight: 创新点在于:1) 提出了一个纯粹基于自回归解码器的统一多模态生成架构;2) 通过任务感知的损失重加权解决模态不平衡问题;3) 引入轻量级的token级感知对齐损失以提升图像生成保真度;4) 提出有限状态解码机制以平衡生成稳定性与创造性。其核心思想是将不同模态统一为单一token序列进行自回归建模,简化了系统设计。
Abstract: Real-world perception and interaction are inherently multimodal, encompassing not only language but also vision and speech, which motivates the development of “Omni” MLLMs that support both multimodal inputs and multimodal outputs. While a sequence of omni MLLMs has emerged, most existing systems still rely on additional expert components to achieve multimodal generation, limiting the simplicity of unified training and inference. Autoregressive (AR) modeling, with a single token stream, a single next-token objective, and a single decoder, is an elegant and scalable foundation in the text domain. Motivated by this, we present AR-Omni, a unified any-to-any model in the autoregressive paradigm without any expert decoders. AR-Omni supports autoregressive text and image generation, as well as streaming speech generation, all under a single Transformer decoder. We further address three practical issues in unified AR modeling: modality imbalance via task-aware loss reweighting, visual fidelity via a lightweight token-level perceptual alignment loss for image tokens, and stability-creativity trade-offs via a finite-state decoding mechanism. Empirically, AR-Omni achieves strong quality across three modalities while remaining real-time, achieving a 0.88 real-time factor for speech generation.
[145] FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning cs.LG | cs.CLPDF
Zhaopeng Qiu, Shuang Yu, Jingqi Zhang, Shuai Zhang, Xue Huang
TL;DR: 本文提出了FP8-RL,一个用于大语言模型强化学习的实用且稳定的低精度推理栈,通过采用块级FP8量化实现W8A8线性层推理、扩展FP8至KV缓存以消除长上下文内存瓶颈,并利用基于重要性采样的推理校正来缓解训练-推理不匹配,在密集和MoE模型上实现了高达44%的推理吞吐量提升,同时保持与BF16基线相当的学习性能。
Details
Motivation: 大语言模型强化学习中的推理(生成)阶段因长输出序列导致注意力机制和KV缓存内存成为端到端步骤时间的主要瓶颈,FP8虽能通过降低计算成本和内存流量来加速推理,但在RL中应用会带来独特的工程和算法挑战,如策略权重每步变化需重复量化、低精度推理可能偏离训练器假设的高精度策略,导致训练-推理不匹配和潜在不稳定。
Result: 在密集和混合专家模型上,所提技术实现了高达44%的推理吞吐量增益,同时学习行为与BF16基线相当。
Insight: 创新点包括:使用块级FP8量化实现W8A8线性层推理;通过每步QKV尺度重校准将FP8扩展至KV缓存以解决长上下文内存瓶颈;采用基于重要性采样的推理校正(如token级TIS/MIS变体)来缓解训练-推理不匹配,这些方法在veRL生态系统中实现,支持常见训练后端和推理引擎,为低精度RL提供了稳定实用的解决方案。
Abstract: Reinforcement learning (RL) for large language models (LLMs) is increasingly bottlenecked by rollout (generation), where long output sequence lengths make attention and KV-cache memory dominate end-to-end step time. FP8 offers an attractive lever for accelerating RL by reducing compute cost and memory traffic during rollout, but applying FP8 in RL introduces unique engineering and algorithmic challenges: policy weights change every step (requiring repeated quantization and weight synchronization into the inference engine) and low-precision rollouts can deviate from the higher-precision policy assumed by the trainer, causing train-inference mismatch and potential instability. This report presents a practical FP8 rollout stack for LLM RL, implemented in the veRL ecosystem with support for common training backends (e.g., FSDP/Megatron-LM) and inference engines (e.g., vLLM/SGLang). We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks via per-step QKV scale recalibration, and (iii) mitigate mismatch using importance-sampling-based rollout correction (token-level TIS/MIS variants). Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.
[146] PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR cs.LG | cs.AI | cs.CL | cs.IRPDF
James Burgess, Jan N. Hansen, Duo Peng, Yuhui Zhang, Alejandro Lozano
TL;DR: 本文提出了PaperSearchQA,这是一个针对科学论文搜索和推理的强化学习环境,包含一个包含1600万篇生物医学论文摘要的搜索语料库和一个包含6万个样本的事实型问答数据集。作者训练了搜索代理,在RLVR框架下超越了非强化学习的检索基线,并展示了代理的规划、推理和自我验证等行为。
Details
Motivation: 现有的RLVR搜索代理主要针对通用领域问答,缺乏对科学、工程和医学等专业领域的适用性。本文旨在训练代理在科学论文上进行搜索和推理,以测试技术问答能力,并支持未来AI科学家系统的发展。
Result: 在PaperSearchQA数据集上,训练的搜索代理超越了非强化学习的检索基线。作者还进行了定量分析,并观察到代理表现出规划、推理和自我验证等有趣行为。
Insight: 创新点在于构建了一个专门针对科学论文搜索和推理的强化学习环境(PaperSearchQA),包括大规模语料库和挑战性数据集,并展示了RLVR方法在技术领域问答中的有效性。其数据创建方法具有可扩展性,可轻松扩展到其他科学领域。
Abstract: Search agents are language models (LMs) that reason and search knowledge bases (or the web) to answer questions; recent methods supervise only the final answer accuracy using reinforcement learning with verifiable rewards (RLVR). Most RLVR search agents tackle general-domain QA, which limits their relevance to technical AI systems in science, engineering, and medicine. In this work we propose training agents to search and reason over scientific papers – this tests technical question-answering, it is directly relevant to real scientists, and the capabilities will be crucial to future AI Scientist systems. Concretely, we release a search corpus of 16 million biomedical paper abstracts and construct a challenging factoid QA dataset called PaperSearchQA with 60k samples answerable from the corpus, along with benchmarks. We train search agents in this environment to outperform non-RL retrieval baselines; we also perform further quantitative analysis and observe interesting agent behaviors like planning, reasoning, and self-verification. Our corpus, datasets, and benchmarks are usable with the popular Search-R1 codebase for RLVR training and released on https://huggingface.co/collections/jmhb/papersearchqa. Finally, our data creation methods are scalable and easily extendable to other scientific domains.
[147] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models cs.LG | cs.CLPDF
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang
TL;DR: 本文提出了一种名为On-Policy Self-Distillation(OPSD)的新框架,用于提升大型语言模型(LLM)的推理能力。该方法的核心思想是让同一个模型既扮演教师又扮演学生:教师策略利用特权信息(如已验证的推理过程),而学生策略仅看到问题;训练目标是最小化在学生自身生成轨迹上这两种策略的逐词元分布差异。
Details
Motivation: 动机在于解决现有知识蒸馏方法的局限性:传统策略蒸馏需要独立且通常更大的教师模型,且未充分利用推理数据集中可用的真实解决方案。本文受启发于一个直觉:一个足够强大的LLM能够利用外部特权推理轨迹来指导其自身(即无法访问特权信息的版本)。
Result: 在多个数学推理基准测试上验证了方法的有效性,与强化学习方法(如GRPO)相比,实现了4-8倍的词元效率提升,并且性能优于离策略蒸馏方法。
Insight: 主要创新点在于提出了同策略自蒸馏(OPSD)框架,它消除了对独立教师模型的需求,并巧妙地将数据集中的真实解决方案作为特权信息整合到训练中,通过让模型自我教学来提升推理能力,这是一种高效且新颖的蒸馏范式。
Abstract: Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student’s own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 4-8x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.
[148] Attention-Based Variational Framework for Joint and Individual Components Learning with Applications in Brain Network Analysis cs.LG | cs.CV | stat.MLPDF
Yifei Zhang, Meimei Liu, Zhengwu Zhang
TL;DR: 本文提出了一种名为CM-JIVNet的跨模态联合-个体变分网络,这是一个统一的概率框架,旨在从成对的结构连接(SC)和功能连接(FC)脑网络数据中学习因子化的潜在表示。该模型利用多头注意力融合模块捕获非线性跨模态依赖关系,同时分离出独立的模态特定信号。在HCP-YA数据集上的验证表明,该模型在跨模态重建和行为特征预测方面表现出色。
Details
Motivation: 动机在于整合脑结构连接(SC)和功能连接(FC)这两种不同但互补的成像模态数据,以揭示驱动行为表型的跨模态模式。当前整合面临高维非线性、复杂的非线性SC-FC耦合以及难以从模态特定变化中分离共享信息等挑战。
Result: 在Human Connectome Project Young Adult (HCP-YA)数据上验证,CM-JIVNet在跨模态重建和行为特征预测任务中表现出优越性能。
Insight: 创新点在于提出了一个统一的概率变分框架(CM-JIVNet),通过多头注意力融合模块来建模非线性跨模态依赖并解耦联合与个体特征空间,为大规模多模态脑分析提供了可解释且可扩展的解决方案。
Abstract: Brain organization is increasingly characterized through multiple imaging modalities, most notably structural connectivity (SC) and functional connectivity (FC). Integrating these inherently distinct yet complementary data sources is essential for uncovering the cross-modal patterns that drive behavioral phenotypes. However, effective integration is hindered by the high dimensionality and non-linearity of connectome data, complex non-linear SC-FC coupling, and the challenge of disentangling shared information from modality-specific variations. To address these issues, we propose the Cross-Modal Joint-Individual Variational Network (CM-JIVNet), a unified probabilistic framework designed to learn factorized latent representations from paired SC-FC datasets. Our model utilizes a multi-head attention fusion module to capture non-linear cross-modal dependencies while isolating independent, modality-specific signals. Validated on Human Connectome Project Young Adult (HCP-YA) data, CM-JIVNet demonstrates superior performance in cross-modal reconstruction and behavioral trait prediction. By effectively disentangling joint and individual feature spaces, CM-JIVNet provides a robust, interpretable, and scalable solution for large-scale multimodal brain analysis.
[149] Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability cs.LG | cs.CLPDF
Shobhita Sundaram, John Quan, Ariel Kwiatkowski, Kartik Ahuja, Yann Ollivier
TL;DR: 本文提出SOAR框架,一种基于元强化学习的自改进方法,旨在让预训练大语言模型(LLM)通过生成自动课程来突破自身在困难推理任务上的学习瓶颈。该方法利用一个教师模型为学生模型生成合成问题,并根据学生在少量难题子集上的进步给予奖励,从而在初始成功率极低(如0/128)的数学基准上实现学习。
Details
Motivation: 解决在初始成功率极低的困难推理数据集上,基于强化学习的微调方法因训练信号稀疏而陷入学习停滞的问题,探索预训练LLM能否利用其潜在知识为自身无法解决的问题生成有效的自动课程。
Result: 在数学基准(如MATH)的最难子集(初始成功率为0)上进行实验,结果表明SOAR能够实现双层元强化学习,在稀疏二元奖励下解锁学习;其基于学生进步的基础奖励优于先前LLM自对弈中使用的内在奖励方案,避免了不稳定性与多样性崩溃;生成问题的结构质量和明确性比解决方案的正确性对学习进展更为关键。
Insight: 核心创新在于提出了一种以实测学生进步为基础奖励的课程生成框架(SOAR),证明了模型生成有用‘垫脚石’(stepping stones)的能力并不依赖于预先解决难题的能力,这为无需额外标注数据即可突破推理瓶颈提供了一条原则性路径;同时揭示了课程生成中问题结构质量的重要性高于答案正确性这一关键洞见。
Abstract: Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self-improvement framework designed to surface these pedagogical signals through meta-RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self-play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well-posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to actually solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.
[150] POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration cs.LG | cs.AI | cs.CLPDF
Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, Aviral Kumar
TL;DR: 本文提出了一种名为POPE(特权策略探索)的新方法,用于解决强化学习在训练大型语言模型进行复杂推理时面临的探索难题。该方法通过利用人类或其他oracle解决方案作为特权信息来引导对困难问题的探索,而非直接将其作为训练目标,从而在引导的rollouts中获得非零奖励,并最终将习得的行为迁移回原始未引导的问题上。
Details
Motivation: 当前基于强化学习的推理方法在困难问题上往往无法有效探索,导致零奖励信号和学习停滞;而直接混合简单与困难问题训练会因优化干扰而适得其反,因此需要一种能有效引导探索并促进迁移的解决方案。
Result: 在具有挑战性的推理基准测试中,POPE显著提升了性能,并扩展了可解决问题的集合。
Insight: 创新点在于将oracle解决方案作为引导探索的“特权信息”而非训练目标,通过添加解决方案前缀来获得奖励,并利用指令跟随与推理的协同作用实现行为向原始问题的迁移;这为解决RL在困难问题上的探索瓶颈提供了一种新颖且有效的思路。
Abstract: Reinforcement learning (RL) has improved the reasoning abilities of large language models (LLMs), yet state-of-the-art methods still fail to learn on many training problems. On hard problems, on-policy RL rarely explores even a single correct rollout, yielding zero reward and no learning signal for driving improvement. We find that natural solutions to remedy this exploration problem from classical RL, such as entropy bonuses, more permissive clipping of the importance ratio, or direct optimization of pass@k objectives, do not resolve this issue and often destabilize optimization without improving solvability. A natural alternative is to leverage transfer from easier problems. However, we show that mixing easy and hard problems during RL training is counterproductive due to ray interference, where optimization focuses on already-solvable problems in a way that actively inhibits progress on harder ones. To address this challenge, we introduce Privileged On-Policy Exploration (POPE), an approach that leverages human- or other oracle solutions as privileged information to guide exploration on hard problems, unlike methods that use oracle solutions as training targets (e.g., off-policy RL methods or warmstarting from SFT). POPE augments hard problems with prefixes of oracle solutions, enabling RL to obtain non-zero rewards during guided rollouts. Crucially, the resulting behaviors transfer back to the original, unguided problems through a synergy between instruction-following and reasoning. Empirically, POPE expands the set of solvable problems and substantially improves performance on challenging reasoning benchmarks.
[151] Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes cs.LG | cs.AI | cs.CLPDF
Amrith Setlur, Zijian Wang, Andrew Cohen, Paria Rashidinejad, Sang Michael Xie
TL;DR: 本文提出PrefixRL方法,通过复用离策略轨迹的前缀来增强大语言模型在困难推理问题上的强化学习效率,避免传统离策略方法的不稳定性,并证明该方法在样本效率和泛化性能上的优势。
Details
Motivation: 解决传统强化学习方法在困难推理问题上因正确轨迹稀少、策略梯度消失而导致学习停滞和计算浪费的问题。
Result: 在困难推理任务上,PrefixRL达到相同训练奖励的速度比最强基线(对离策略数据进行监督微调后强化学习)快2倍,最终奖励提高3倍,且计算开销已考虑初始拒绝采样;在保留基准测试上表现出泛化能力,且当离策略轨迹来自不同模型家族时仍有效。
Insight: 创新点在于通过条件化成功离策略轨迹的前缀进行在线强化学习完成后续部分,避免了离策略不稳定性;客观分析表明该方法通过前缀长度调节问题难度,提升了学习信号,并发现了反向泛化现象,即仅在前缀问题上训练能泛化到无前缀分布外性能。
Abstract: Typical reinforcement learning (RL) methods for LLM reasoning waste compute on hard problems, where correct on-policy traces are rare, policy gradients vanish, and learning stalls. To bootstrap more efficient RL, we consider reusing old sampling FLOPs (from prior inference or RL training) in the form of off-policy traces. Standard off-policy methods supervise against off-policy data, causing instabilities during RL optimization. We introduce PrefixRL, where we condition on the prefix of successful off-policy traces and run on-policy RL to complete them, side-stepping off-policy instabilities. PrefixRL boosts the learning signal on hard problems by modulating the difficulty of the problem through the off-policy prefix length. We prove that the PrefixRL objective is not only consistent with the standard RL objective but also more sample efficient. Empirically, we discover back-generalization: training only on prefixed problems generalizes to out-of-distribution unprefixed performance, with learned strategies often differing from those in the prefix. In our experiments, we source the off-policy traces by rejection sampling with the base model, creating a self-improvement loop. On hard reasoning problems, PrefixRL reaches the same training reward 2x faster than the strongest baseline (SFT on off-policy data then RL), even after accounting for the compute spent on the initial rejection sampling, and increases the final reward by 3x. The gains transfer to held-out benchmarks, and PrefixRL is still effective when off-policy traces are derived from a different model family, validating its flexibility in practical settings.
[152] Systematic Characterization of Minimal Deep Learning Architectures: A Unified Analysis of Convergence, Pruning, and Quantization cs.LG | cs.CVPDF
Ziwei Zheng, Huizhi Liang, Vaclav Snasel, Vito Latora, Panos Pardalos
TL;DR: 本文提出了一种系统化探索和分析深度学习架构中收敛、剪枝和量化之间关系的计算方法论。通过在大规模架构集上进行结构化设计扫描,并评估代表性模型的收敛行为、剪枝敏感性和量化鲁棒性,研究发现尽管架构多样,但性能基本不变,且学习动态始终呈现不稳定、学习和过拟合三个阶段。
Details
Motivation: 动机在于识别能够可靠解决任务的最小架构仍然具有挑战性,旨在系统化地理解收敛、剪枝和量化之间的相互作用,为在剪枝和低精度约束下选择紧凑、稳定的模型提供指导。
Result: 在复杂度递增的知名图像分类任务上,针对深度神经网络、卷积神经网络和视觉变换器进行实验,结果表明:更深架构比浅层架构对剪枝更具韧性,参数冗余高达60%;量化对可学习参数较少的模型影响更严重,且在更难的图像数据集上影响更大。
Insight: 创新点在于提出了一种统一的分析框架来系统表征最小深度学习架构,揭示了跨架构的通用学习动态三阶段,并量化了剪枝韧性和量化敏感性与架构深度及参数数量的关系,为模型压缩和部署提供了可操作的见解。
Abstract: Deep learning networks excel at classification, yet identifying minimal architectures that reliably solve a task remains challenging. We present a computational methodology for systematically exploring and analyzing the relationships among convergence, pruning, and quantization. The workflow first performs a structured design sweep across a large set of architectures, then evaluates convergence behavior, pruning sensitivity, and quantization robustness on representative models. Focusing on well-known image classification of increasing complexity, and across Deep Neural Networks, Convolutional Neural Networks, and Vision Transformers, our initial results show that, despite architectural diversity, performance is largely invariant and learning dynamics consistently exhibit three regimes: unstable, learning, and overfitting. We further characterize the minimal learnable parameters required for stable learning, uncover distinct convergence and pruning phases, and quantify the effect of reduced numeric precision on trainable parameters. Aligning with intuition, the results confirm that deeper architectures are more resilient to pruning than shallower ones, with parameter redundancy as high as 60%, and quantization impacts models with fewer learnable parameters more severely and has a larger effect on harder image datasets. These findings provide actionable guidance for selecting compact, stable models under pruning and low-precision constraints in image classification.
[153] Closing the Modality Gap Aligns Group-Wise Semantics cs.LG | cs.CVPDF
Eleonora Grassucci, Giordano Cicchetti, Emanuele Frasca, Aurelio Uncini, Danilo Comminiello
TL;DR: 本文研究了多模态学习中的模态间隙问题,指出CLIP等方法虽能实现语义对齐,但不同模态的潜在空间结构仍存在不匹配。论文提出了一种新方法来减少这种间隙,并证明其在组级任务(如聚类)中能显著提升性能,而在实例级任务(如检索)中改进有限。
Details
Motivation: 动机在于探讨模态间隙对多模态学习的影响,特别是针对组级任务(如聚类)而非实例级任务(如检索),以解决现有方法在语义分组任务中性能不足的问题。
Result: 通过广泛评估,论文表明减少模态间隙在组级任务(如聚类)中能显著提升性能,而在传统实例级任务(如检索)中仅带来边际或不一致的改进。
Insight: 创新点在于揭示了模态间隙对组级任务的关键影响,并提出了一种简单有效的方法来减少间隙,这为多模态学习中的语义对齐提供了新的视角,可能重塑对模态间隙作用的理解。
Abstract: In multimodal learning, CLIP has been recognized as the \textit{de facto} method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is instead strongly pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general $n$-modal case. Through our extensive evaluation, we demonstrate our novel insight: while reducing the gap provides only marginal or inconsistent improvements in traditional instance-wise tasks, it significantly enhances group-wise tasks. These findings may reshape our understanding of the modality gap, highlighting its key role in improving performance on tasks requiring semantic grouping.
[154] Counterfactual Explanations on Robust Perceptual Geodesics cs.LG | cs.CV | cs.HC | math.DGPDF
Eslam Zaher, Maciej Trzaskowski, Quan Nguyen, Fred Roosta
TL;DR: 本文提出了一种名为感知反事实测地线(PCG)的方法,用于生成反事实解释。该方法通过在由鲁棒视觉特征诱导的感知黎曼度量下追踪测地线来构建反事实,从而产生平滑、在流形上且语义有效的转换,解决了现有方法因几何结构不当导致的离流形伪影、语义漂移或对抗性坍缩问题。
Details
Motivation: 现有基于潜在空间优化的反事实解释方法(即改变模型预测的最小语义扰动)在距离度量的选择上存在模糊性,导致扰动可能无意义或具有对抗性。这些方法通常采用平坦或未对齐的几何结构,从而产生离流形伪影、语义漂移或对抗性坍缩。本文旨在通过引入与人类感知对齐的几何结构来解决这些问题。
Result: 在三个视觉数据集上的实验表明,PCG方法优于基线方法,并揭示了在标准度量下隐藏的模型失效模式。
Insight: 核心创新点在于利用鲁棒视觉特征诱导的感知黎曼度量来定义几何结构,该结构与人感知对齐并惩罚脆弱方向,从而确保反事实解释的语义有效性。这为构建更可靠、可解释的模型解释提供了一种新的几何视角。
Abstract: Latent-space optimization methods for counterfactual explanations - framed as minimal semantic perturbations that change model predictions - inherit the ambiguity of Wachter et al.’s objective: the choice of distance metric dictates whether perturbations are meaningful or adversarial. Existing approaches adopt flat or misaligned geometries, leading to off-manifold artifacts, semantic drift, or adversarial collapse. We introduce Perceptual Counterfactual Geodesics (PCG), a method that constructs counterfactuals by tracing geodesics under a perceptually Riemannian metric induced from robust vision features. This geometry aligns with human perception and penalizes brittle directions, enabling smooth, on-manifold, semantically valid transitions. Experiments on three vision datasets show that PCG outperforms baselines and reveals failure modes hidden under standard metrics.
cs.HC [Back]
[155] Memento: Towards Proactive Visualization of Everyday Memories with Personal Wearable AR Assistant cs.HC | cs.CL | cs.IRPDF
Yoonsang Kim, Yalong Yang, Arie E. Kaufman
TL;DR: Memento是一款对话式增强现实助手,能够持续记录用户的语音查询及其时空与活动上下文,形成长期记忆。通过分析用户重复出现的兴趣与触发情境之间的关联,系统能在检测到相似情境时主动唤起用户兴趣并通过AR提供实时响应,将AR体验无缝融入日常生活。
Details
Motivation: 旨在解决现有AR交互多为瞬时事件、缺乏长期连贯性的问题,通过构建基于多模态上下文(视觉、空间、时间、体感)的持久记忆系统,实现主动化、情境感知的AR助手。
Result: 通过收集不同沉浸式应用经验用户的反馈进行初步评估,探索了主动情境感知AR助手在日常场景中的价值,并总结了系统设计中的发现与挑战。
Insight: 创新点在于将每次交互视为具有长期连贯性的序列,而非孤立事件,通过记忆机制关联用户兴趣与多模态上下文,实现从被动响应到主动服务的转变;其多模态上下文融合与长期记忆建模对情境感知计算系统具有借鉴意义。
Abstract: We introduce Memento, a conversational AR assistant that permanently captures and memorizes user’s verbal queries alongside their spatiotemporal and activity contexts. By storing these “memories,” Memento discovers connections between users’ recurring interests and the contexts that trigger them. Upon detection of similar or identical spatiotemporal activity, Memento proactively recalls user interests and delivers up-to-date responses through AR, seamlessly integrating AR experience into their daily routine. Unlike prior work, each interaction in Memento is not a transient event, but a connected series of interactions with coherent long–term perspective, tailored to the user’s broader multimodal (visual, spatial, temporal, and embodied) context. We conduct preliminary evaluation through user feedbacks with participants of diverse expertise in immersive apps, and explore the value of proactive context-aware AR assistant in everyday settings. We share our findings and challenges in designing a proactive, context-aware AR system.
[156] PaperTok: Exploring the Use of Generative AI for Creating Short-form Videos for Research Communication cs.HC | cs.AI | cs.CLPDF
Meziah Ruby Cristobal, Hyeonjeong Byeon, Tze-Yu Chen, Ruoxi Shang, Donghoon Shin
TL;DR: 本文介绍了PaperTok系统,该系统利用生成式AI将学术论文自动转化为短视频内容,以帮助研究人员更高效地进行科学传播。系统通过生成脚本和视听素材,并允许用户进一步编辑,从而降低创作门槛。
Details
Motivation: 研究人员缺乏时间和技能来制作吸引人的短视频内容以传播学术研究成果,因此需要工具来简化这一过程。
Result: 混合方法用户研究(N=18)和众包评估(N=100)表明,PaperTok的工作流程能帮助研究人员创作出既吸引人又信息丰富的短视频。
Insight: 利用生成式AI实现从论文到短视频的端到端自动化创作流程,并通过用户交互进行细化,为未来支持科学传播的生成工具提供了设计启示,如需要更细粒度的控制功能。
Abstract: The dissemination of scholarly research is critical, yet researchers often lack the time and skills to create engaging content for popular media such as short-form videos. To address this gap, we explore the use of generative AI to help researchers transform their academic papers into accessible video content. Informed by a formative study with science communicators and content creators (N=8), we designed PaperTok, an end-to-end system that automates the initial creative labor by generating script options and corresponding audiovisual content from a source paper. Researchers can then refine based on their preferences with further prompting. A mixed-methods user study (N=18) and crowdsourced evaluation (N=100) demonstrate that PaperTok’s workflow can help researchers create engaging and informative short-form videos. We also identified the need for more fine-grained controls in the creation process. To this end, we offer implications for future generative tools that support science outreach.
[157] Acoustic Field Video for Multimodal Scene Understanding cs.HC | cs.CV | cs.ROPDF
Daehwa Kim, Chris Harrison
TL;DR: 本文提出了一种名为声场视频的新型多模态输入表示方法,用于增强视觉语言模型(VLM)的场景理解能力。该方法利用低成本波束成形麦克风阵列实时生成声音强度在空间中的可视化表示,为模型提供了传统视频(RGB加立体声/单声道音频)所不具备的空间声学信息。通过在包含402个问答场景的评估集上进行测试,结果表明,结合声场视频后,最先进的VLM的准确率从38.3%显著提升至67.4%。
Details
Motivation: 当前基于视觉和音频的多模态场景理解任务在仅依赖传统视频输入时仍存在信息不足的问题。智能音箱、机器人和XR头显中已普遍配备低成本波束成形麦克风阵列,但其感知能力尚未被用于场景理解。本文旨在探索利用空间声学信息(声场视频)作为新的输入维度,以弥补现有模态的局限。
Result: 在构建的402个问答场景评估集上,将声场视频与最先进的VLM结合,相比仅使用传统视频输入,准确率从38.3%提升至67.4%,显示出清晰且一致的性能改进。
Insight: 创新点在于首次将声场视频(空间声学信息的可视化)作为多模态输入引入视觉语言模型,提供了一种新的、强大的感知维度。从客观角度看,这充分利用了现有但未被使用的硬件传感能力(波束成形麦克风阵列),为多模态推理提供了一个实用且有前景的方向,有效解决了仅依赖视觉和音频时场景理解约束不足的问题。
Abstract: We introduce and explore a new multimodal input representation for vision-language models: acoustic field video. Unlike conventional video (RGB with stereo/mono audio), our video stream provides a spatially grounded visualization of sound intensity across a scene, offering a new and powerful dimension of perceptual understanding. Our real-time pipeline uses low-cost beamforming microphone arrays that are already common in smart speakers and increasingly present in robotics and XR headsets, yet this sensing capability remains unutilized for scene understanding. To assess the value of spatial acoustic information, we constructed an evaluation set of 402 question-answer scenes, comparing a state-of-the-art VLM given conventional video with and without paired acoustic field video. Results show a clear and consistent improvement when incorporating spatial acoustic data; the VLM we test improves from 38.3% correct to 67.4%. Our findings highlight that many everyday scene understanding tasks remain underconstrained when relying solely on visual and audio input, and that acoustic field data provides a promising and practical direction for multimodal reasoning. A video demo is available at https://daehwakim.com/seeingsound
cs.GR [Back]
[158] LoD-Structured 3D Gaussian Splatting for Streaming Video Reconstruction cs.GR | cs.CVPDF
Xinhui Liu, Can Wang, Lei Liu, Zhenghao Chen, Wei Jiang
TL;DR: 本文提出了StreamLoD-GS,一个专为流式自由视点视频设计的基于细节层次结构的3D高斯溅射框架。该框架通过引入基于锚点和八叉树的细节层次结构、高斯丢弃技术、基于高斯混合模型的运动分割机制以及量化残差细化框架,旨在解决稀疏视图输入、高训练成本和带宽限制下的实时流式3D场景重建问题。
Details
Motivation: 自由视点视频重建虽能实现逼真的交互式3D场景可视化,但实时流式传输常受限于稀疏视图输入、高昂训练成本和带宽约束。现有3D高斯溅射方法虽提升了渲染速度,但流式自由视点视频对快速优化、稀疏约束下的高保真重建和最小存储占用提出了更高要求。
Result: 大量实验表明,StreamLoD-GS在质量、效率和存储方面达到了竞争性或最先进的性能水平。
Insight: 论文宣称的创新点包括:基于锚点和八叉树的细节层次结构3D高斯溅射与分层高斯丢弃技术,确保高效稳定优化和高品质渲染;基于高斯混合模型的运动分割机制,分离动态与静态内容以细化动态区域并保持背景稳定;量化残差细化框架,在不损失视觉保真度的前提下显著降低存储需求。从客观角度看,这些创新有效整合了层次化表示、动态内容处理和轻量化存储,为流式3D重建提供了系统化解决方案。
Abstract: Free-Viewpoint Video (FVV) reconstruction enables photorealistic and interactive 3D scene visualization; however, real-time streaming is often bottlenecked by sparse-view inputs, prohibitive training costs, and bandwidth constraints. While recent 3D Gaussian Splatting (3DGS) has advanced FVV due to its superior rendering speed, Streaming Free-Viewpoint Video (SFVV) introduces additional demands for rapid optimization, high-fidelity reconstruction under sparse constraints, and minimal storage footprints. To bridge this gap, we propose StreamLoD-GS, an LoD-based Gaussian Splatting framework designed specifically for SFVV. Our approach integrates three core innovations: 1) an Anchor- and Octree-based LoD-structured 3DGS with a hierarchical Gaussian dropout technique to ensure efficient and stable optimization while maintaining high-quality rendering; 2) a GMM-based motion partitioning mechanism that separates dynamic and static content, refining dynamic regions while preserving background stability; and 3) a quantized residual refinement framework that significantly reduces storage requirements without compromising visual fidelity. Extensive experiments demonstrate that StreamLoD-GS achieves competitive or state-of-the-art performance in terms of quality, efficiency, and storage.
eess.SP [Back]
[159] ME-WARD: A multimodal ergonomic analysis tool for musculoskeletal risk assessment from inertial and video data in working plac eess.SP | cs.CV | cs.LGPDF
Javier González-Alonso, Paula Martín-Tapia, David González-Ortega, Míriam Antón-Rodríguez, Francisco Javier Díaz-Pernas
TL;DR: 本文提出了一种名为ME-WARD的多模态人体工学分析工具,用于从惯性测量单元(IMU)和视频数据中评估工作场所的肌肉骨骼风险。该工具通过整合运动捕捉系统和深度学习姿态估计模型,实现了对关节角度的处理,并应用了快速上肢评估(RULA)方法。在工业装配任务中的验证表明,ME-WARD能产生可靠的RULA评分,与IMU系统结果高度一致,并与单目3D姿态估计系统性能相当。
Details
Motivation: 解决传统人体工学评估方法(如RULA)依赖专有设备、成本高且难以普及的问题,旨在通过整合多模态运动捕捉技术(如IMU和视频),提供一个灵活、可扩展且经济高效的统一评估流程。
Result: 在传送带装配的工业场景中进行了测试,使用黄金标准IMU系统和先进的单目3D姿态估计系统作为基准。结果表明,ME-WARD产生的RULA评分与IMU数据在屈曲主导的运动中高度一致,与单目系统性能相当,但在侧向和旋转运动跟踪上存在局限。
Insight: 创新点在于将多模态运动捕捉技术(IMU与视频姿态估计)集成到一个统一工具中,扩展了RULA方法的适用性,支持低成本视频输入,为资源受限的工业环境提供了可扩展的解决方案;客观来看,这种融合方法提升了评估的灵活性和可访问性,是推动人体工学评估普及的重要一步。
Abstract: This study presents ME-WARD (Multimodal Ergonomic Workplace Assessment and Risk from Data), a novel system for ergonomic assessment and musculoskeletal risk evaluation that implements the Rapid Upper Limb Assessment (RULA) method. ME-WARD is designed to process joint angle data from motion capture systems, including inertial measurement unit (IMU)-based setups, and deep learning human body pose tracking models. The tool’s flexibility enables ergonomic risk assessment using any system capable of reliably measuring joint angles, extending the applicability of RULA beyond proprietary setups. To validate its performance, the tool was tested in an industrial setting during the assembly of conveyor belts, which involved high-risk tasks such as inserting rods and pushing conveyor belt components. The experiments leveraged gold standard IMU systems alongside a state-of-the-art monocular 3D pose estimation system. The results confirmed that ME-WARD produces reliable RULA scores that closely align with IMU-derived metrics for flexion-dominated movements and comparable performance with the monocular system, despite limitations in tracking lateral and rotational motions. This work highlights the potential of integrating multiple motion capture technologies into a unified and accessible ergonomic assessment pipeline. By supporting diverse input sources, including low-cost video-based systems, the proposed multimodal approach offers a scalable, cost-effective solution for ergonomic assessments, paving the way for broader adoption in resource-constrained industrial environments.
cs.CR [Back]
[160] Multimodal Privacy-Preserving Entity Resolution with Fully Homomorphic Encryption cs.CR | cs.CVPDF
Susim Roy, Nalini Ratha
TL;DR: 本文提出了一种基于全同态加密的多模态隐私保护实体解析框架,旨在解决高合规性领域中数据异构性高、隐私要求严格的实体匹配问题。该方法能够在保护个人身份信息明文不被访问的前提下,实现大规模数据集上的高效匹配,并满足严格的监管要求。
Details
Motivation: 动机在于解决政府、金融机构等高合规性领域中,实体解析面临的数据量大、匹配准确性要求高以及隐私保护需求严格的三重挑战,特别是在个人标识符存在语法变异等数据异构性情况下,实现安全身份匹配。
Result: 该方法在保持计算可扩展性的同时,实现了显著低的等错误率(EER),并通过密码学保证客户隐私,满足严格监管要求,但摘要未提及具体基准测试或与现有方法的定量比较。
Insight: 创新点在于将全同态加密与多模态框架结合,确保实体解析过程中个人身份信息的明文始终不可计算访问,从而在隐私保护、匹配准确性和大规模处理能力之间取得平衡,为高合规场景提供了可借鉴的隐私保护解决方案。
Abstract: The canonical challenge of entity resolution within high-compliance sectors, where secure identity reconciliation is frequently confounded by significant data heterogeneity, including syntactic variations in personal identifiers, is a longstanding and complex problem. To this end, we introduce a novel multimodal framework operating with the voluminous data sets typical of government and financial institutions. Specifically, our methodology is designed to address the tripartite challenge of data volume, matching fidelity, and privacy. Consequently, the underlying plaintext of personally identifiable information remains computationally inaccessible throughout the matching lifecycle, empowering institutions to rigorously satisfy stringent regulatory mandates with cryptographic assurances of client confidentiality while achieving a demonstrably low equal error rate and maintaining computational tractability at scale.
cs.SD [Back]
[161] AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs’ Contextual and Cultural Knowledge and Thinking cs.SD | cs.CL | cs.CV | cs.MM | eess.ASPDF
Xilin Jiang, Qiaolin Wang, Junkai Wu, Xiaomin He, Zhongweiyang Xu
TL;DR: 该论文提出了AVMeme Exam基准测试,这是一个包含超过一千个标志性互联网音频和视频的多模态、多语言、多文化数据集,用于评估AI模型在人类文化背景下的理解能力。研究发现当前最先进的多模态大语言模型在无文本音乐和音效理解、上下文和文化思维方面存在显著不足。
Details
Motivation: 为了解决AI模型能否理解超越文本的、随时间变化的视听信号在人类文化背景下的含义这一问题,作者创建了AVMeme Exam基准,旨在系统评估模型从表层内容到上下文、情感乃至世界知识的理解层次。
Result: 在AVMeme Exam基准上的系统评估显示,当前最先进的多模态大语言模型在无文本音乐和音效任务上表现不佳,且在上下文和文化思维方面远逊于人类参与者,揭示了与人类对齐的多模态智能的关键差距。
Insight: 论文的创新点在于构建了一个全面评估模型文化背景理解能力的多模态基准,并明确指出当前模型在深层上下文和文化感知方面的局限性,为未来开发能超越表层感知的模型提供了方向。
Abstract: Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects. Each meme is paired with a unique Q&A assessing levels of understanding from surface content to context and emotion to usage and world knowledge, along with metadata such as original year, transcript, summary, and sensitivity. We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark. Our results reveal a consistent limitation: current models perform poorly on textless music and sound effects, and struggle to think in context and in culture compared to surface content. These findings highlight a key gap in human-aligned multimodal intelligence and call for models that can perceive contextually and culturally beyond the surface of what they hear and see. Project page: avmemeexam.github.io/public
[162] OCR-Enhanced Multimodal ASR Can Read While Listening cs.SD | cs.CL | eess.ASPDF
Junli Chen, Changli Tang, Yixuan Li, Guangzhi Sun, Chao Zhang
TL;DR: 本文提出了Donut-Whisper,一种结合音频和视觉信息的双编码器多模态自动语音识别模型,旨在利用字幕等视觉信息提升中英文语音识别性能。模型通过交叉注意力模块融合线性和Q-Former模态对齐结构的优势,并引入轻量级知识蒸馏方案,同时构建了一个新的中英文电影片段多模态数据集。实验表明,该模型在英语和中文数据集上均显著优于基准模型。
Details
Motivation: 解决传统音频ASR在复杂场景(如电影)中性能受限的问题,通过引入视觉信息(如字幕)来增强语音识别,特别是在多语言(英语和中文)环境下提升识别准确率。
Result: 在提出的中英文多模态数据集上,Donut-Whisper相比Donut和Whisper large V3基线模型表现显著更好:英语集上WER绝对降低5.75%,中文集上CER绝对降低16.5%,达到SOTA水平。
Insight: 创新点包括:1) 通过交叉注意力模块融合不同模态对齐结构,生成更强的音频-视觉特征;2) 提出轻量级知识蒸馏方案,用多模态模型指导纯音频模型提升性能;3) 构建新的多语言音频-视觉数据集,为相关研究提供资源。
Abstract: Visual information, such as subtitles in a movie, often helps automatic speech recognition. In this paper, we propose Donut-Whisper, an audio-visual ASR model with dual encoder to leverage visual information to improve speech recognition performance in both English and Chinese. Donut-Whisper combines the advantage of the linear and the Q-Former-based modality alignment structures via a cross-attention module, generating more powerful audio-visual features. Meanwhile, we propose a lightweight knowledge distillation scheme showcasing the potential of using audio-visual models to teach audio-only models to achieve better performance. Moreover, we propose a new multilingual audio-visual speech recognition dataset based on movie clips containing both Chinese and English partitions. As a result, Donut-Whisper achieved significantly better performance on both English and Chinese partition of the dataset compared to both Donut and Whisper large V3 baselines. In particular, an absolute 5.75% WER reduction and a 16.5% absolute CER reduction were achieved on the English and Chinese sets respectively compared to the Whisper ASR baseline.
cs.CY [Back]
[163] Do VLMs Have a Moral Backbone? A Study on the Fragile Morality of Vision-Language Models cs.CY | cs.AI | cs.CL | cs.CV | cs.LGPDF
Zhining Liu, Tianyi Wang, Xiao Lin, Penghao Ouyang, Gaotang Li
TL;DR: 该论文研究了视觉语言模型(VLMs)的道德鲁棒性,发现尽管模型在道德对齐方面有所改进,但其道德判断在实际场景中高度脆弱,容易受到不影响道德本质的文本和视觉扰动影响而改变立场。通过系统性的多模态扰动测试,论文揭示了模型在不同扰动类型、道德领域和模型规模上存在系统性漏洞,并指出指令跟随能力强的模型更易被说服。研究还表明,轻量级的推理时干预可以部分恢复道德稳定性,强调道德鲁棒性是VLM负责任部署的必要标准。
Details
Motivation: 尽管视觉语言模型在道德对齐方面投入了大量努力,但其道德判断在现实环境中的稳定性尚不明确,论文旨在探究VLMs在保持道德判断不变的前提下,对文本和视觉扰动的鲁棒性。
Result: 研究发现VLMs的道德立场高度脆弱,在简单扰动下频繁翻转;分析揭示了跨扰动类型、道德领域和模型规模的系统性漏洞,包括指令跟随能力强的模型更易被说服的奉承权衡;轻量级推理时干预可部分恢复道德稳定性。
Insight: 论文的创新点在于首次系统性地定义了并研究了VLMs的道德鲁棒性问题,揭示了道德对齐的不足和模型在扰动下的脆弱性,强调了道德鲁棒性作为负责任部署的关键标准,并提出了可借鉴的轻量级干预方法。从客观角度看,这为评估和提升AI系统的道德稳定性提供了新视角和实证基础。
Abstract: Despite substantial efforts toward improving the moral alignment of Vision-Language Models (VLMs), it remains unclear whether their ethical judgments are stable in realistic settings. This work studies moral robustness in VLMs, defined as the ability to preserve moral judgments under textual and visual perturbations that do not alter the underlying moral context. We systematically probe VLMs with a diverse set of model-agnostic multimodal perturbations and find that their moral stances are highly fragile, frequently flipping under simple manipulations. Our analysis reveals systematic vulnerabilities across perturbation types, moral domains, and model scales, including a sycophancy trade-off where stronger instruction-following models are more susceptible to persuasion. We further show that lightweight inference-time interventions can partially restore moral stability. These results demonstrate that moral alignment alone is insufficient and that moral robustness is a necessary criterion for the responsible deployment of VLMs.
[164] Beyond Instrumental and Substitutive Paradigms: Introducing Machine Culture as an Emergent Phenomenon in Large Language Models cs.CY | cs.AI | cs.CLPDF
Yueqing Hu, Xinyang Peng, Yukun Zhao, Lin Qiu, Ka-lai Hung
TL;DR: 这篇论文挑战了将大语言模型(LLMs)视为人类文化工具或替代品的现有范式,提出了‘机器文化’作为一种新兴的、独特的现象。通过一个2×2因子设计(模型来源:美国 vs. 中国;提示语言:英语 vs. 中文)在八个多模态任务中进行实验,研究发现模型来源和提示语言均无法稳定预测文化表现,并识别出‘文化反转’和‘服务人格伪装’等新现象,表明LLMs展现的是由高维空间叠加和安全对齐模式崩溃塑造的机器文化。
Details
Motivation: 现有研究通常通过工具范式(模型反映开发者文化)或替代范式(模型作为双语代理切换文化框架)来表征LLMs,这些拟人化框架存在局限。本研究旨在挑战这些范式,探索LLMs是否展现出一种新兴的、独特的‘机器文化’。
Result: 在八个多模态任务(包括图像生成和解释)上的实验结果表明,模型来源(美国或中国)无法预测文化对齐,美国模型常表现出与东亚数据相关的‘整体性’特征;提示语言(英语或中文)未能触发稳定的文化框架切换,反而出现‘文化反转’现象(英语提示引发更高的上下文关注)。此外,在情感任务中,RLHF导致文化差异坍缩为一种超积极、零方差的‘乐于助人的助手’人格(服务人格伪装)。
Insight: 论文宣称的创新点在于提出了‘机器文化’这一概念,并提供了实证证据挑战现有拟人化范式。从客观角度看,其创新之处在于通过多模态任务扩展分析边界,并识别出‘文化反转’和‘服务人格伪装’等具体现象,强调了LLMs文化表现是由高维空间叠加和安全对齐等技术因素塑造的涌现现象,而非对人类文化的简单模拟。
Abstract: Recent scholarship typically characterizes Large Language Models (LLMs) through either an \textit{Instrumental Paradigm} (viewing models as reflections of their developers’ culture) or a \textit{Substitutive Paradigm} (viewing models as bilingual proxies that switch cultural frames based on language). This study challenges these anthropomorphic frameworks by proposing \textbf{Machine Culture} as an emergent, distinct phenomenon. We employed a 2 (Model Origin: US vs. China) $\times$ 2 (Prompt Language: English vs. Chinese) factorial design across eight multimodal tasks, uniquely incorporating image generation and interpretation to extend analysis beyond textual boundaries. Results revealed inconsistencies with both dominant paradigms: Model origin did not predict cultural alignment, with US models frequently exhibiting holistic'' traits typically associated with East Asian data. Similarly, prompt language did not trigger stable cultural frame-switching; instead, we observed \textbf{Cultural Reversal}, where English prompts paradoxically elicited higher contextual attention than Chinese prompts. Crucially, we identified a novel phenomenon termed \textbf{Service Persona Camouflage}: Reinforcement Learning from Human Feedback (RLHF) collapsed cultural variance in affective tasks into a hyper-positive, zero-variance helpful assistant’’ persona. We conclude that LLMs do not simulate human culture but exhibit an emergent Machine Culture – a probabilistic phenomenon shaped by \textit{superposition} in high-dimensional space and \textit{mode collapse} from safety alignment.
[165] Artificial Intelligence and Intellectual Property Rights: Comparative Transnational Policy Analysis cs.CY | cs.AI | cs.CL | cs.LGPDF
Sahibpreet Singh, Manjit Singh
TL;DR: 本研究评估人工智能对知识产权(包括商业秘密、版权和专利)的影响,重点关注印度法律在AI特定条款方面的缺失,并通过比较印度、美国、英国和欧盟的立法、司法案例和政策,揭示现有法律在适应AI生成内容方面的不足。
Details
Motivation: 解决印度知识产权法律缺乏AI特定规定导致的学理不一致和执法低效问题,并填补全球AI-IPR保护讨论的空白。
Result: 初步发现印度合同法导致商业秘密制度碎片化,专利法第3(k)条阻碍AI发明获得专利,版权归属标准不一;印度国家AI战略(2024)显示进展,但立法明确性仍不足。
Insight: 提出协调的法律分类法以容纳AI角色同时保持创新激励,强调调整印度IP法学以实现全球对齐,为AI特定IP保护提供政策参考。
Abstract: Artificial intelligence’s rapid integration with intellectual property rights necessitates assessment of its impact on trade secrets, copyrights and patents. This study addresses lacunae in existing laws where India lacks AI-specific provisions, creating doctrinal inconsistencies and enforcement inefficacies. Global discourse on AI-IPR protections remains nascent. The research identifies gaps in Indian IP laws’ adaptability to AI-generated outputs: trade secret protection is inadequate against AI threats; standardized inventorship criteria are absent. Employing doctrinal and comparative methodology, it scrutinizes legislative texts, judicial precedents and policy instruments across India, US, UK and EU. Preliminary findings reveal shortcomings: India’s contract law creates fragmented trade secret regime; Section 3(k) of Indian Patents Act blocks AI invention patenting; copyright varies in authorship attribution. The study proposes harmonized legal taxonomy accommodating AI’s role while preserving innovation incentives. India’s National AI Strategy (2024) shows progress but legislative clarity is imperative. This contributes to global discourse with AI-specific IP protections ensuring resilience and equitable innovation. Promising results underscore recalibrating India’s IP jurisprudence for global alignment.
[166] Generative AI in Saudi Arabia: A National Survey of Adoption, Risks, and Public Perceptions cs.CY | cs.AI | cs.CLPDF
Abdulaziz AlDakheel, Ali Alshehre, Esraa Alamoudi, Moslim AlKhabbaz, Ahmed Aljohani
TL;DR: 本研究通过一项针对330名沙特国民的全国性调查,首次系统性地描绘了沙特阿拉伯在生成式人工智能(GenAI)领域的公众参与图景。调查涵盖了认知、采用、影响、风险、数据共享等七个维度,发现尽管93%的受访者积极使用GenAI(主要用于文本任务),但整体认知和技术理解不均,公众在认可其提升生产力和工作质量的同时,也普遍担忧其对批判性思维、隐私、伦理和就业的潜在负面影响,并强烈呼吁开展结合基础技能、领域应用及伦理指导的结构化培训。
Details
Motivation: 在沙特《2030愿景》推动下,生成式AI正快速融入国家数字化转型,但公众对其的认知、采用情况及相关担忧尚未得到充分研究。本研究旨在填补这一空白,为政策制定者和开发者提供早期基准数据。
Result: 调查结果为沙特阿拉伯的GenAI参与度建立了基线。主要发现包括:93%的受访者积极使用GenAI(以文本任务为主),但高级应用(如编程、多模态生成)不普遍;公众认知和技术理解水平参差不齐;受访者普遍认识到GenAI对生产力和工作质量的益处,但也高度关注其对批判性思维、隐私、错误信息和就业的潜在风险。
Insight: 论文的创新点在于首次在沙特阿拉伯全国范围内对GenAI的公众采用与认知进行多维度实证调查,为理解非西方语境下的AI社会接受度提供了重要数据。从客观角度看,其研究框架(涵盖七个维度)和对“结构化培训需求”的强调,为其他地区评估和引导GenAI的社会整合提供了可借鉴的方法论和关注重点,特别是将文化/语言适配性、AI素养与伦理框架并列为政策优先事项的视角具有启发性。
Abstract: Generative Artificial Intelligence (GenAI) is rapidly becoming embedded in Saudi Arabia’s digital transformation under Vision 2030, yet public awareness, adoption, and concerns surrounding these tools remain underexplored. This study provides an early snapshot of GenAI engagement among Saudi nationals. Using a nationwide survey of 330 participants across regions, age groups, and employment sectors, we examine seven dimensions of GenAI use: awareness and understanding, adoption patterns, perceived impacts, training needs, risks and barriers, data-sharing behaviors, and future expectations. Findings show that 93% of respondents actively use GenAI primarily for text-based tasks, while more advanced uses such as programming or multimodal generation are less common. Despite the prevalence of use, overall awareness and conceptual understanding remain uneven, with many reporting limited technical knowledge. Participants recognize GenAI’s benefits for productivity, work quality, and understanding complex information, yet caution that sustained reliance may undermine critical thinking and key professional skills. Trust in AI-generated outputs remains cautious, with widespread concerns about privacy, misinformation, and ethical misuse, including potential job displacement. Respondents show strong interest in structured GenAI training that combines foundational skills, domain-specific applications, and clear guidance on privacy, ethics, and responsible use. These results establish a baseline for GenAI engagement in Saudi Arabia and highlight priorities for policymakers and developers: expanding AI literacy, ensuring culturally and linguistically aligned GenAI solutions, and strengthening frameworks for privacy and responsible deployment.
cs.NI [Back]
[167] Structure-Aware NL-to-SQL for SFC Provisioning via AST-Masking Empowered Language Models cs.NI | cs.CL | cs.LGPDF
Xinyu Zhu, Parisa Fard Moshiri, Poonam Lohan, Burak Kantarci, Emil Janulewicz
TL;DR: 这篇论文提出了一种名为AST-Masking的结构感知微调方法,用于将自然语言(NL)规范转换为可执行的SQL命令,以支持规范驱动的服务功能链(SFC)编排。该方法利用SQL的抽象语法树(AST)为关键组件分配权重,强制模型进行语法感知学习,从而在多个语言模型上显著提高了SQL生成的准确性和语法正确性,而不增加推理开销。
Details
Motivation: 在动态且对延迟敏感的网络中,有效的SFC编排需要精确性。强化学习(RL)虽然提高了适应性,但往往忽略了结构化的领域知识,这限制了其泛化能力和可解释性。而传统基于LLM的微调方法可能导致语法不一致和生成低效查询。
Result: 实验表明,AST-Masking显著提高了多个语言模型的SQL生成准确率。其中,FLAN-T5模型达到了99.6%的执行准确率(EA),而Gemma模型取得了最大的绝对提升,从7.5%提升至72.0%。这些结果证实了该方法在确保语法正确和高效的SQL生成方面的有效性。
Insight: 论文的核心创新点是提出了AST-Masking这一结构感知微调方法,它通过SQL的AST来引导模型学习,将领域知识(SQL语法结构)显式地融入LLM的微调过程,从而在不增加推理成本的前提下,提升了生成SQL的语法正确性和整体效率。这为将领域特定结构知识注入LLM提供了一种有效途径。
Abstract: Effective Service Function Chain (SFC) provisioning requires precise orchestration in dynamic and latency-sensitive networks. Reinforcement Learning (RL) improves adaptability but often ignores structured domain knowledge, which limits generalization and interpretability. Large Language Models (LLMs) address this gap by translating natural language (NL) specifications into executable Structured Query Language (SQL) commands for specification-driven SFC management. Conventional fine-tuning, however, can cause syntactic inconsistencies and produce inefficient queries. To overcome this, we introduce Abstract Syntax Tree (AST)-Masking, a structure-aware fine-tuning method that uses SQL ASTs to assign weights to key components and enforce syntax-aware learning without adding inference overhead. Experiments show that AST-Masking significantly improves SQL generation accuracy across multiple language models. FLAN-T5 reaches an Execution Accuracy (EA) of 99.6%, while Gemma achieves the largest absolute gain from 7.5% to 72.0%. These results confirm the effectiveness of structure-aware fine-tuning in ensuring syntactically correct and efficient SQL generation for interpretable SFC orchestration.