Table of Contents
- cs.CL [Total: 41]
- cs.CV [Total: 52]
- cs.LG [Total: 6]
- cs.MM [Total: 1]
- eess.IV [Total: 2]
- cs.CR [Total: 3]
- cs.HC [Total: 1]
- cs.MA [Total: 2]
- cs.AI [Total: 6]
cs.CL [Back]
[1] LLM-Driven Preference Data Synthesis for Proactive Prediction of the Next User Utterance in Human-Machine Dialogue cs.CLPDF
Jinqiang Wang, Huansheng Ning, Jianguo Ding, Tao Zhu, Liming Chen
TL;DR: 本文提出ProUtt方法,一种由大语言模型驱动的偏好数据合成方法,用于主动预测人机对话中用户的下一个话语。该方法将对话历史转换为意图树,从利用和探索两个角度显式建模意图推理轨迹,并通过扰动或修改未来轮次的意图树路径来构建偏好与非偏好推理过程。
Details
Motivation: 现有基于商业API的解决方案存在隐私问题,而本地部署通用大语言模型计算成本高昂;现有的用户模拟器方法主要模仿用户说话风格而非推进对话,且现有偏好数据合成方法缺乏对导致用户下一话语的意图推理进行显式建模的能力。
Result: 在四个基准数据集上,使用LLM作为评判者和人工评估的广泛实验表明,ProUtt在主动下一话语预测任务上,持续优于现有的数据合成方法、用户模拟器和商业LLM API。
Insight: 创新点在于将对话历史结构化为意图树,并显式建模意图推理轨迹(包括利用和探索视角),以及通过扰动意图树路径来合成偏好与非偏好数据,从而更好地对齐LLM与用户偏好以进行主动预测。
Abstract: Proactively predicting a users next utterance in human-machine dialogue can streamline interaction and improve user experience. Existing commercial API-based solutions are subject to privacy concerns while deploying general-purpose LLMs locally remains computationally expensive. As such, training a compact, task-specific LLM provides a practical alternative. Although user simulator methods can predict a user’s next utterance, they mainly imitate their speaking style rather than advancing the dialogue. Preference data synthesis has been investigated to generate data for proactive next utterance prediction and help align LLMs with user preferences. Yet existing methods lack the ability to explicitly model the intent reasoning that leads to the user’s next utterance and to define and synthesize preference and non-preference reasoning processes for predicting the user’s next utterance.To address these challenges, we propose ProUtt, an LLM-driven preference data synthesis method for proactive next utterance prediction. ProUtt converts dialogue history into an intent tree and explicitly models intent reasoning trajectories by predicting the next plausible path from both exploitation and exploration perspectives. It then constructs preference and non-preference reasoning processes by perturbing or revising intent tree paths at different future turns. Extensive evaluations using LLM-as-a-judge and human judgments demonstrate that ProUtt consistently outperforms existing data synthesis methods, user simulators, and commercial LLM APIs across four benchmark datasets. We release both the code and the synthesized datasets to facilitate future research.
[2] Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines cs.CL | cs.AIPDF
Devesh Saraogi, Rohit Singhee, Dhruv Kumar
TL;DR: 本文研究了多步智能体工作流(如迭代推理、进化搜索、递归分解)在生成研究计划时,相较于单步提示方法,能否产生更具新颖性和可行性的成果。论文对五种推理架构进行了基准测试,发现基于分解和长上下文的工作流在新颖性上表现最佳,而基于反思的方法得分较低。
Details
Motivation: 解决将大语言模型集成到科学生态系统中引发的关于AI生成研究的创造性和原创性问题,特别是针对单步提示方法中存在的’智能抄袭’(即用术语转换重复现有想法)的担忧。
Result: 在三十份提案的新颖性、可行性和影响力评估中,基于分解和长上下文的工作流平均新颖性得分达到4.17/5,而基于反思的方法得分显著较低(2.33/5。结果表明,高性能工作流能在不牺牲创造性的情况下保持可行性。
Insight: 论文宣称的创新点在于系统性地评估和比较了多种多步智能体工作流在生成研究计划方面的性能。客观来看,其核心洞察是精心设计的多阶段智能体工作流(特别是基于递归分解和利用长上下文的方法)能够有效提升AI辅助研究构思的新颖性,这为克服’智能抄袭’问题提供了具体的技术路径和实证支持。
Abstract: The integration of Large Language Models (LLMs) into the scientific ecosystem raises fundamental questions about the creativity and originality of AI-generated research. Recent work has identified ``smart plagiarism’’ as a concern in single-step prompting approaches, where models reproduce existing ideas with terminological shifts. This paper investigates whether agentic workflows – multi-step systems employing iterative reasoning, evolutionary search, and recursive decomposition – can generate more novel and feasible research plans. We benchmark five reasoning architectures: Reflection-based iterative refinement, Sakana AI v2 evolutionary algorithms, Google Co-Scientist multi-agent framework, GPT Deep Research (GPT-5.1) recursive decomposition, and Gemini~3 Pro multimodal long-context pipeline. Using evaluations from thirty proposals each on novelty, feasibility, and impact, we find that decomposition-based and long-context workflows achieve mean novelty of 4.17/5, while reflection-based approaches score significantly lower (2.33/5). Results reveal varied performance across research domains, with high-performing workflows maintaining feasibility without sacrificing creativity. These findings support the view that carefully designed multi-stage agentic workflows can advance AI-assisted research ideation.
[3] StatLLaMA: A multi-stage training framework for building a domain-optimized statistical language model cs.CL | cs.AIPDF
Jing-Yi Zeng, Guan-Hua Huang
TL;DR: 本研究提出了一种多阶段训练框架StatLLaMA,旨在高效构建面向统计领域的专用大语言模型。研究基于轻量级LLaMA-3.2-3B系列模型,系统比较了三种从不同基础模型(无指令跟随能力、后置指令微调、指令微调)出发的训练流程,涵盖持续预训练、监督微调、基于人类反馈的强化学习对齐和下游任务适配。研究发现,从具备通用推理能力的指令微调模型出发是实现有效领域专业化的关键。
Details
Motivation: 解决如何高效、资源节约地构建一个在统计学领域具有专业能力的大语言模型的问题,探索不同训练起点和流程对模型领域专业化效果的影响。
Result: 最终模型StatLLaMA在数学推理、常识推理和统计专业知识等多个基准测试中取得了强大且均衡的性能。实验表明,从基础模型出发的流程无法发展出有意义的统计推理能力,而从LLaMA-3.2-3B-Instruct出发则能实现有效领域专业化。直接偏好优化被证明能提供稳定有效的RLHF对齐。
Insight: 论文的创新点在于系统性地比较了不同起点和训练阶段对领域专业化LLM构建的影响,并提出了一个实用的多阶段训练蓝图。关键洞察包括:1) 领域专业化需要以具备强通用推理能力的指令微调模型为起点;2) 监督微调变体在领域专业知识和通用推理能力间存在权衡;3) 对高度优化的模型进行下游微调时,必须使用极低的强度以避免灾难性遗忘。
Abstract: This study investigates how to efficiently build a domain-specialized large language model (LLM) for statistics using the lightweight LLaMA-3.2-3B family as the foundation model (FM). We systematically compare three multi-stage training pipelines, starting from a base FM with no instruction-following capability, a base FM augmented with post-hoc instruction tuning, and an instruction-tuned FM with strong general reasoning abilities across continual pretraining, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) preference alignment, and downstream task adaptation. Results show that pipelines beginning with a base FM fail to develop meaningful statistical reasoning, even after extensive instruction tuning, SFT, or RLHF alignment. In contrast, starting from LLaMA-3.2-3B-Instruct enables effective domain specialization. A comprehensive evaluation of SFT variants reveals clear trade-offs between domain expertise and general reasoning ability. We further demonstrate that direct preference optimization provides stable and effective RLHF preference alignment. Finally, we show that downstream fine-tuning must be performed with extremely low intensity to avoid catastrophic forgetting in highly optimized models. The final model, StatLLaMA, achieves strong and balanced performance on benchmarks of mathematical reasoning, common-sense reasoning, and statistical expertise, offering a practical blueprint for developing resource-efficient statistical LLMs. The code is available at https://github.com/HuangDLab/StatLLaMA.
[4] Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models cs.CL | cs.AIPDF
Hoyoon Byun, Youngjun Choi, Taero Kim, Sungrae Park, Kyungwoo Song
TL;DR: 本文提出了一种名为Bounded Hyperbolic Tanh(BHyT)的新方法,旨在替代大型语言模型中广泛使用的预层归一化(Pre-LN)。BHyT通过结合双曲正切非线性激活函数和显式的、数据驱动的输入边界控制,解决了Pre-LN在深度增加时激活值幅度和方差增长导致的训练不稳定问题,同时提升了计算效率。
Details
Motivation: 预层归一化(Pre-LN)是大型语言模型训练稳定和有效迁移学习的关键,但其存在重复统计计算导致的效率低下问题,并且在模型深度增加时,隐藏状态的幅度和方差会不断增长,导致训练不稳定。现有面向效率的无归一化方法(如Dynamic Tanh)在深度增加时仍显脆弱。本文旨在同时解决稳定性和效率问题。
Result: 在预训练中,BHyT在保持稳定性的同时,相比RMSNorm实现了平均15.8%的训练加速和平均4.2%的token生成吞吐量提升。在语言理解和推理基准测试中,其推理性能和鲁棒性达到或超过了RMSNorm的水平。
Insight: 核心创新点在于将双曲正切非线性与显式的、数据驱动的输入边界控制相结合,从理论上保证了训练的稳定性。效率方面的创新在于每个块仅计算一次精确统计量,并用轻量级的方差近似替代第二次归一化操作,从而在保证性能的同时显著提升效率。
Abstract: Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm., while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: https://anonymous.4open.science/r/BHyT
[5] Uncertainty-Aware Dynamic Knowledge Graphs for Reliable Question Answering cs.CL | cs.AIPDF
Yu Takahashi, Shun Takeuchi, Kexuan Xin, Guillaume Pelat, Yoshiaki Ikai
TL;DR: 该论文提出了一种不确定性感知的动态知识图谱框架,旨在提升问答系统的可靠性和可解释性。该框架通过动态构建演化知识图谱、置信度评分与不确定性感知检索,并结合交互式界面,使系统能够处理不完整、有噪声或不确定的证据。论文以医疗领域为例,从电子健康记录构建个性化知识图谱,并应用于死亡率预测任务,展示了该框架在高风险应用中的潜力。
Details
Motivation: 现有基于知识图谱的问答系统通常将事实表示为静态和确定性的,无法捕捉信息的演化特性和推理中的固有不确定性,导致在证据不完整、有噪声或不确定时可靠性下降。
Result: 论文在医疗领域的死亡率预测任务中实例化了该框架,通过构建个性化知识图谱并可视化患者就诊过程中的不确定性,评估了其对问答可靠性的影响,但未提及具体的定量基准测试结果或与SOTA的比较。
Insight: 创新点在于将不确定性建模与动态知识图谱相结合,通过置信度标注的三元组和交互式界面增强问答系统的鲁棒性和透明度,为高风险领域(如医疗)提供了可借鉴的可靠性提升方法。
Abstract: Question answering (QA) systems are increasingly deployed across domains. However, their reliability is undermined when retrieved evidence is incomplete, noisy, or uncertain. Existing knowledge graph (KG) based QA frameworks typically represent facts as static and deterministic, failing to capture the evolving nature of information and the uncertainty inherent in reasoning. We present a demonstration of uncertainty-aware dynamic KGs, a framework that combines (i) dynamic construction of evolving KGs, (ii) confidence scoring and uncertainty-aware retrieval, and (iii) an interactive interface for reliable and interpretable QA. Our system highlights how uncertainty modeling can make QA more robust and transparent by enabling users to explore dynamic graphs, inspect confidence-annotated triples, and compare baseline versus confidence-aware answers. The target users of this demo are clinical data scientists and clinicians, and we instantiate the framework in healthcare: constructing personalized KGs from electronic health records, visualizing uncertainty across patient visits, and evaluating its impact on a mortality prediction task. This use case demonstrates the broader promise of uncertainty-aware dynamic KGs for enhancing QA reliability in high-stakes applications.
[6] Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions cs.CL | cs.AIPDF
Katherine Elkins, Jon Chun
TL;DR: 本文研究了大型语言模型在伦理决策中对良性提示变化的鲁棒性,提出了句法框架脆弱性评估框架,通过逻辑极性归一化隔离纯句法效应,审计了23个最先进模型在14个伦理场景下的决策一致性,发现普遍存在统计显著的不一致性,开源模型脆弱性是商业模型的两倍以上,并发现思维链推理可显著降低脆弱性。
Details
Motivation: 解决LLMs在伦理决策中对逻辑等价但句法不同的提示缺乏鲁棒性的问题,评估模型是否能在否定和条件结构等句法变化下保持一致的伦理判断。
Result: 在14个伦理场景和4种控制框架下的39,975个决策中,发现许多模型仅因句法极性就反转伦理认可,开源模型脆弱性是商业模型的两倍以上,某些模型在明确提示’不应该’时认可行为的比例高达80-97%,思维链推理可显著降低脆弱性,金融和商业场景风险高于医疗场景。
Insight: 创新点在于提出句法框架脆弱性框架和逻辑极性归一化方法,客观揭示了句法一致性是伦理鲁棒性的关键维度,建议将此类审计作为部署LLMs安全评估的标准组件,思维链推理作为实用缓解杠杆。
Abstract: Large language models (LLMs) are increasingly deployed in consequential decision-making settings, yet their robustness to benign prompt variation remains underexplored. In this work, we study whether LLMs maintain consistent ethical judgments across logically equivalent but syntactically different prompts, focusing on variations involving negation and conditional structure. We introduce Syntactic Framing Fragility (SFF), a robustness evaluation framework that isolates purely syntactic effects via Logical Polarity Normalization (LPN), enabling direct comparison of decisions across positive and negative framings without semantic drift. Auditing 23 state-of-the-art models spanning the U.S. and China as well as small U.S. open-source software models over 14 ethical scenarios and four controlled framings (39,975 decisions), we find widespread and statistically significant inconsistency: many models reverse ethical endorsements solely due to syntactic polarity, with open-source models exhibiting over twice the fragility of commercial counterparts. We further uncover extreme negation sensitivity, where some models endorse actions in 80-97% of cases when explicitly prompted with “should not.” We show that eliciting chain-of-thought reasoning substantially reduces fragility, identifying a practical mitigation lever, and we map fragility across scenarios, finding higher risk in financial and business contexts than in medical scenarios. Our results demonstrate that syntactic consistency constitutes a distinct and critical dimension of ethical robustness, and we argue that SFF-style audits should be a standard component of safety evaluation for deployed LLMs. Code and results will be available on github.com.
[7] Forgetting as a Feature: Cognitive Alignment of Large Language Models cs.CL | cs.AIPDF
Hien Tran, Quinten Steenhuis, Alexandros Christoforos, Chadbourne Davis
TL;DR: 这篇论文重新诠释了大型语言模型中的遗忘现象,将其视为一种功能性的认知机制而非缺陷。作者受人类记忆动态启发,将LLM推理建模为受指数衰减控制的概率记忆过程,并提出了一个评估时间推理、概念漂移适应和关联回忆的基准套件。实证结果表明,LLM的遗忘率类似于人类记忆在稳定性与适应性之间的权衡,并在此基础上提出了概率记忆提示策略,以模仿人类记忆衰减,从而提升长程推理性能。
Details
Motivation: 针对LLM在上下文推理中系统性地遗忘过去信息的行为,论文旨在重新审视这一现象,将其视为一种适应性的认知机制,而非与完美贝叶斯推理理想相悖的局限。
Result: 在提出的评估时间推理、概念漂移适应和关联回忆的基准套件上,实证结果显示LLM的遗忘率与人类记忆的效率权衡(稳定性 vs. 适应性)类似;通过提出的概率记忆提示策略,模型的长程推理性能得到了提升。
Insight: 论文的创新点在于将遗忘重新定义为一种原则性的自适应智能机制,并借鉴人类记忆动力学建模LLM推理;提出的概率记忆提示是一种轻量级策略,通过塑造证据整合来模仿人类记忆衰减,可改善长视野推理。从客观角度看,这为理解和改进LLM的上下文学习提供了新的认知对齐视角。
Abstract: Large Language Models (LLMs) are often evaluated against ideals of perfect Bayesian inference, yet growing evidence suggests that their in-context reasoning exhibits systematic forgetting of past information. Rather than viewing this behavior as a limitation, we reinterpret forgetting as a functional cognitive mechanism. Drawing inspiration from human memory dynamics, we model LLM inference as a probabilistic memory process governed by exponential decay. We introduce a benchmark suite that evaluates temporal reasoning, concept drift adaptation, and associative recall, enabling direct comparison between model behavior and human cognitive patterns. Our empirical results reveal that LLMs demonstrate forgetting rates analogous to human memory efficiency trade-offs between stability and adaptability. Building on these observations, we propose probabilistic memory prompting, a lightweight strategy that shapes evidence integration to mimic human-like memory decay, leading to improved long-horizon reasoning performance. Our findings position forgetting not as a failure mode, but as a principled mechanism for adaptive intelligence.
[8] SciNets: Graph-Constrained Multi-Hop Reasoning for Scientific Literature Synthesis cs.CL | cs.AI | cs.IR | cs.LGPDF
Sauhard Dubey
TL;DR: SciNets提出了一种基于图约束多跳推理的框架,用于跨领域科学文献的机制性综合。该方法通过构建概念图并探索连接罕见共现概念的多跳路径,来合成解释。研究比较了多种推理策略,并引入行为评估框架来衡量推理深度、多样性和稳定性,揭示了图约束与语言模型集成中的权衡。
Details
Motivation: 解决检索系统和无约束语言模型在跨领域科学文献综合中,难以连接碎片化文献的机制性解释,以及对推理深度和结构基础缺乏控制的问题。
Result: 在机器学习、生物学和气候科学任务上,图约束实现了可控的多跳推理;研究发现,更深、更多样的符号推理会增加基础不稳定性,而最短路径推理则高度稳定但结构保守。
Insight: 将机制性综合形式化为图约束多跳推理问题,并引入行为评估框架(衡量推理深度、多样性和稳定性)而非仅评估正确性,为图与LLM集成的能力与局限提供了系统性的行为表征。
Abstract: Cross-domain scientific synthesis requires connecting mechanistic explanations across fragmented literature, a capability that remains challenging for both retrieval-based systems and unconstrained language models. While recent work has applied large language models to scientific summarization and question answering, these approaches provide limited control over reasoning depth and structural grounding. We frame mechanistic synthesis as a graph-constrained multi-hop reasoning problem over literature-derived concept graphs. Given a scientific query and a compact, query-local corpus, SciNets constructs a directed concept graph and synthesizes mechanistic explanations by identifying multi-hop reasoning paths that connect concepts that rarely co-occur within individual papers. We systematically compare shortest-path reasoning, k-shortest paths with diversity constraints, stochastic random walks, and a retrieval-augmented language model baseline. Rather than evaluating correctness, which is often indeterminate when synthesizing connections across distributed sources, we introduce a behavioral framework that measures symbolic reasoning depth, mechanistic diversity, and grounding stability. Across machine learning, biology, and climate science tasks, explicit graph constraints enable controllable multi-hop reasoning while revealing a consistent trade-off: deeper and more diverse symbolic reasoning increases grounding instability, whereas shortest-path reasoning remains highly stable but structurally conservative. These findings provide a systematic behavioral characterization of the limits and capabilities of current graph-LLM integration for scientific synthesis.
[9] Eliminating Agentic Workflow for Introduction Generation with Parametric Stage Tokens cs.CL | cs.AIPDF
Meicong Zhang, Tiancheng su, Guoxiu He
TL;DR: 本文提出了一种名为STIG(Stage Token for Introduction Generation)的方法,用于消除传统基于代理工作流的研究引言生成中的长推理链、错误累积和文本连贯性降低等问题。该方法通过将工作流的逻辑结构参数化到大型语言模型中,使用阶段令牌作为显式信号来引导模型在单次推理中生成完整且逻辑连贯的研究引言。
Details
Motivation: 现有基于预定义代理工作流的方法在生成研究引言时面临挑战,因为引言需要严谨的逻辑、连贯的结构和抽象的总结,而长推理链容易导致错误累积和文本连贯性下降。本文旨在通过消除外部代理工作流,直接将逻辑结构编码到模型参数中来解决这些问题。
Result: 实验结果表明,STIG方法能够在单次推理中生成多阶段文本,无需显式工作流调用。在语义相似性和句子级结构合理性指标上,STIG超越了传统的代理工作流和其他基线方法。
Insight: 本文的创新点在于提出了阶段令牌(STIG)的概念,将多阶段工作流的逻辑角色和功能转化为可学习的显式信号,并通过指令微调使模型内化阶段间的逻辑顺序和转换模式。这为将复杂任务的结构化知识直接编码到模型参数中,从而简化推理过程和提高生成质量提供了新思路。
Abstract: In recent years, using predefined agentic workflows to guide large language models (LLMs) for literature classification and review has become a research focus. However, writing research introductions is more challenging. It requires rigorous logic, coherent structure, and abstract summarization. Existing workflows often suffer from long reasoning chains, error accumulation, and reduced textual coherence. To address these limitations, we propose eliminating external agentic workflows. Instead, we directly parameterize their logical structure into the LLM. This allows the generation of a complete introduction in a single inference. To this end, we introduce the Stage Token for Introduction Generation (STIG). STIG converts the multiple stages of the original workflow into explicit stage signals. These signals guide the model to follow different logical roles and functions during generation. Through instruction tuning, the model learns the mapping between stage tokens and text functions. It also learns the logical order and transition patterns between stages, encoding this knowledge into the model parameters. Experimental results show that STIG can generate multi-stage text in a single inference. It does not require explicit workflow calls. STIG outperforms traditional agentic workflows and other baselines on metrics of semantic similarity and sentence-level structural rationality. The code is provided in the Supplementary Materials.
[10] Benchmarking Cross-Lingual Semantic Alignment in Multilingual Embeddings cs.CLPDF
Wen G. Gong
TL;DR: 本文针对当前众多多语言嵌入模型缺乏跨语言语义对齐能力评估标准的问题,提出了一个名为语义亲和度(SA)的度量指标,并结合可视化框架Semanscope,对13个模型在4个数据集上进行了基准测试。研究发现,模型的对齐能力主要由训练目标决定,而非架构或规模,且明确的翻译监督是关键。
Details
Motivation: 解决现有任务驱动型基准(如MTEB)可能掩盖多语言嵌入模型在跨语言语义对齐方面的根本缺陷,以及为从业者从数百个可用模型中筛选出真正具有跨语言语义对齐能力的模型提供清晰指导。
Result: 在4个数据集上的52次实验揭示了模型的三层结构:1) 基于翻译对监督的顶级BERT模型(如LaBSE SA=0.70)对齐能力强;2) LLM嵌入无论规模大小(0.6B到8B),SA值在0.55-0.61间达到平台期;3) 仅使用掩码语言建模目标的BERT模型(如mBERT、XLM-R,SA<0.50)对齐失败。结果表明,训练目标决定对齐能力。
Insight: 创新点在于提出了一个专门用于评估跨语言语义对齐的、有界的语义亲和度(SA)度量指标和可视化框架。客观分析认为,其核心洞察是揭示了跨语言语义对齐的关键在于明确的翻译监督训练目标,而非模型规模或多语言数据本身,这为模型设计和选择提供了重要指导。
Abstract: With hundreds of multilingual embedding models available, practitioners lack clear guidance on which provide genuine cross-lingual semantic alignment versus task performance through language-specific patterns. Task-driven benchmarks (MTEB) may mask fundamental alignment shortcomings. We introduce Semantic Affinity (SA), a bounded (between 0 and 1) metric measuring inter-lingual to intra-lingual spread ratio using cosine distance, combined with PHATE visualization in our Semanscope framework. Benchmarking 13 models across 4 datasets (52 experiments) reveals a three-tier structure: (1) Top BERT models (LaBSE SA = 0.70, USE SA = 0.68, S-BERT SA = 0.68) achieve strong alignment via translation-pair supervision; (2) LLM embeddings plateau at SA between 0.55 and 0.61 regardless of 0.6 B to 8 B scale; (3) MLM-only BERT models (mBERT, XLM-R, SA < 0.50) fail despite more than 100 language training. Training objective, not architecture or scale, determines alignment. Oracle Bone primitives (1200 BCE) expose semantic drift-models learn corpus patterns rather than cognitive primitives. This work provides semantic benchmarking to help practitioners select quality multilingual embeddings from hundreds of available models, showing cross-lingual alignment requires explicit translation supervision, not merely model scale or multilingual data.
[11] Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets cs.CL | cs.AIPDF
Xin Gao, Xiaoyang Wang, Yun Zhu, Mengzhang Cai, Conghui He
TL;DR: 本文提出了一种从临时数据收集转向闭环数据集工程的新范式,利用OpenDataArena(ODA)框架,通过价值锚定排名和多维分析将基准测试转化为指导数据集构建的反馈信号。该方法实例化为两个新数据集:ODA-Math-460k(一个通过两阶段难度感知流程构建的数学推理数据集)和ODA-Mixture(一个通过“锚定与修补”策略构建的多领域指令数据集),这些数据集在特定领域推理和通用效用方面均表现出显著提升,并实现了卓越的数据效率。
Details
Motivation: 当前大型语言模型(LLM)的监督微调(SFT)数据集构建是一个关键但缺乏理论指导的阶段,普遍做法依赖于启发式聚合,缺乏对单个样本如何影响模型性能的系统性理解。
Result: ODA-Math-460k在AIME和HMMT等基准测试上取得了最先进(SOTA)的结果;ODA-Mixture系列数据集在显著小于开源基线的情况下,性能显著超越这些基线。
Insight: 创新点在于提出了一个以透明评估为核心引擎的闭环数据集工程框架(ODA),将基准测试直接转化为数据构建的反馈,并通过“价值锚定排名”、“两阶段难度感知流程”和“锚定与修补”等具体策略,实现了数据质量与效率的同步提升,推动了以数据为中心的人工智能发展。
Abstract: The construction of Supervised Fine-Tuning (SFT) datasets is a critical yet under-theorized stage in the post-training of Large Language Models (LLMs), as prevalent practices often rely on heuristic aggregation without a systematic understanding of how individual samples contribute to model performance. In this report, we propose a paradigm shift from ad-hoc curation to a closed-loop dataset engineering framework using OpenDataArena (ODA), which leverages value-anchored rankings and multi-dimensional analysis to transform value benchmarking into feedback signals guiding dataset construction. We instantiate this methodology through two new datasets: \textbf{ODA-Math-460k}, a specialized mathematics reasoning dataset that utilizes a novel two-stage difficulty-aware pipeline to achieve State-of-the-Art (SOTA) results on benchmarks such as AIME and HMMT, and \textbf{ODA-Mixture (100k & 500k)}, a series of multi-domain instruction datasets built via an ``Anchor-and-Patch’’ strategy that outperforms significantly larger open-source baselines. Our empirical results demonstrate that ODA-driven datasets significantly improve both domain-specific reasoning and general utility while achieving superior data efficiency, validating a transition toward data-centric AI where transparent evaluation serves as the primary engine for engineering high-quality training data.
[12] From Detection to Diagnosis: Advancing Hallucination Analysis with Automated Data Synthesis cs.CL | cs.AIPDF
Yanyi Liu, Qingwen Yang, Tiezheng Guo, Feiyu Qu, Jun Liu
TL;DR: 本文提出从幻觉检测转向幻觉诊断的新研究范式,并引入了幻觉诊断任务,要求模型不仅能检测幻觉,还能进行错误定位、因果解释和内容修正。作者开发了幻觉诊断生成器(HDG)来自动合成高质量的训练数据,并训练了HDM-4B-RL模型,该模型在幻觉诊断任务上超越了之前的SOTA检测模型。
Details
Motivation: 当前研究主要关注二元的幻觉检测方法,这些方法虽然能识别幻觉,但无法提供可解释且可操作的反馈以改进模型,限制了其实用性。因此,需要转向更深入的诊断范式。
Result: 在HaluEval基准测试上,HDM-4B-RL模型超越了之前最先进的检测模型,并在综合诊断任务中达到了与更大通用模型相当的性能,同时保持了较小的模型规模。
Insight: 创新点在于提出了幻觉诊断任务,并开发了HDG自动化数据合成管道来生成带丰富元数据的训练样本,以及使用GRPO和综合奖励函数训练专用诊断模型,为构建更可信的生成式AI系统提供了有效方法。
Abstract: Hallucinations in Large Language Models (LLMs), defined as the generation of content inconsistent with facts or context, represent a core obstacle to their reliable deployment in critical domains. Current research primarily focuses on binary “detection” approaches that, while capable of identifying hallucinations, fail to provide interpretable and actionable feedback for model improvement, thus limiting practical utility. To address this limitation, a new research paradigm is proposed, shifting from “detection” to “diagnosis”. The Hallucination Diagnosis Task is introduced, a task which requires models to not only detect hallucinations, but also perform error localization, causal explanation, and content correction. We develop the Hallucination Diagnosis Generator (HDG), an automated pipeline that systematically generates high-quality training samples with rich diagnostic metadata from raw corpora through multi-dimensional augmentation strategies including controlled fact fabrication and reasoning chain perturbation. Using HDG-generated data, we train HDM-4B-RL, a 4-billion-parameter hallucination diagnosis model, employing Group Relative Policy Optimization (GRPO) with a comprehensive reward function incorporating structural, accuracy, and localization signals. Experimental results demonstrate that our model surpasses previous state-of-the-art detection models on the HaluEval benchmark while achieving comparable performance to advanced general-purpose models. In comprehensive diagnosis tasks, HDM-4B-RL matches the capabilities of larger general models while maintaining a smaller size. This work validates the feasibility and value of hallucination diagnosis, providing an effective methodology for building more trustworthy and reliable generative AI systems.
[13] Bears, all bears, and some bears. Language Constraints on Language Models’ Inductive Inferences cs.CLPDF
Sriram Padmanabhan, Siyuan Song, Kanishka Misra
TL;DR: 本文研究了语言模型在归纳推理中如何处理不同语言约束(如全称量化、类属陈述和存在量化)的差异,通过复现Gelman等人的儿童心理学实验,发现视觉语言模型(VLMs)在推理行为上与人类儿童表现一致,且其内部表征反映了归纳约束而非表面形式差异。
Details
Motivation: 探讨通用统计学习器(如视觉语言模型)是否能像人类儿童一样,在归纳推理中区分不同语言约束(如“所有熊”、“熊”和“一些熊”),以理解语言对推理的微妙影响。
Result: 实验显示,视觉语言模型在复现的心理学任务中表现出与人类儿童(4岁及以上)相似的行为模式(全称 > 类属 > 存在),表明模型能有效区分这些语言约束。
Insight: 视觉语言模型能够捕捉人类语言中的归纳推理约束,其内部表征基于语义约束而非表面语言形式,这为理解模型如何模拟人类认知过程提供了新视角。
Abstract: Language places subtle constraints on how we make inductive inferences. Developmental evidence by Gelman et al. (2002) has shown children (4 years and older) to differentiate among generic statements (“Bears are daxable”), universally quantified NPs (“all bears are daxable”) and indefinite plural NPs (“some bears are daxable”) in extending novel properties to a specific member (all > generics > some), suggesting that they represent these types of propositions differently. We test if these subtle differences arise in general purpose statistical learners like Vision Language Models, by replicating the original experiment. On tasking them through a series of precondition tests (robust identification of categories in images and sensitivities to all and some), followed by the original experiment, we find behavioral alignment between models and humans. Post-hoc analyses on their representations revealed that these differences are organized based on inductive constraints and not surface-form differences.
[14] OUTLINEFORGE: Hierarchical Reinforcement Learning with Explicit States for Scientific Writing cs.CL | cs.AI | cs.LGPDF
Yilin Bao, Ziyao He, Zayden Yang
TL;DR: 本文提出了一种名为OUTLINEFORGE的强化学习框架,通过将科学论文大纲构建建模为层次化文档结构上的长程规划问题,以解决现有大语言模型在全局结构、输入覆盖和引用一致性方面的不足。该方法采用结构化动作建模大纲演化,并引入包含逆向大纲重建和正向价值引导强化学习的两阶段优化过程,以提升科学写作的规划能力。
Details
Motivation: 当前大语言模型在科学论文生成中虽具备局部流畅性,但在全局结构规划、输入内容覆盖和引用一致性方面存在缺陷,因此需要一种能够进行文档级规划和事实性基础的生成方法。
Result: 在作者新引入的科学论文生成基准测试中,该方法在文档规划、输入利用、引用忠实度、大纲组织和内容事实准确性方面,相比强大的神经模型和LLM基线模型取得了持续改进,尤其在长程结构连贯性和引用可靠性上表现突出。
Insight: 创新点在于将科学大纲构建形式化为层次化强化学习的长程规划问题,并设计了结合逆向重建(确保全局结构一致性)和正向强化学习(显式建模科学正确性、语篇连贯性和引用保真度奖励)的两阶段优化策略,为文档级文本生成提供了可借鉴的规划与优化框架。
Abstract: Scientific paper generation requires document-level planning and factual grounding, but current large language models, despite their strong local fluency, often fail in global structure, input coverage, and citation consistency. We present a reinforcement learning framework that casts scientific outline construction as a long-horizon planning problem over hierarchical document structures. Our approach models edit evolving outlines through structured actions, enabling the system to incrementally build a complete scientific manuscript. To support effective and stabilize learning,we introduce a two-stage optimization procedure consisting of (i) backward outline reconstruction from partial plans to enforce global structural consistency, and (ii) forward value-guided reinforcement learning with rewards explicitly modeling scientific correctness, discourse coherence, and citation fidelity. In addition, We further introduce a benchmark for scientific paper generation that evaluates document planning, input utilization, reference faithfulness, outline organization, and content-level factual accuracy. Our results show consistent improvements over strong neural and LLM baselines, particularly in long-range structural coherence and citation reliability.
[15] Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL cs.CLPDF
Yifei Shen, Yilun Zhao, Justice Ou, Tinglin Huang, Arman Cohan
TL;DR: 本文介绍了CLINSQL,一个基于MIMIC-IV v3.1的临床文本到SQL基准测试,包含633个专家标注的任务,要求模型处理异构电子健康记录表、时间窗口和患者相似性队列以生成可执行查询。研究评估了22个专有和开源模型在思维链自优化下的表现,发现尽管有进展,但性能仍远未达到临床可靠性水平。
Details
Motivation: 解决真实世界临床文本到SQL任务中的挑战,包括对异构EHR表、时间窗口和患者相似性队列进行推理,以生成可执行的SQL查询,现有方法难以满足临床可靠性需求。
Result: 在测试集上,GPT-5-mini获得74.7%的执行分数,DeepSeek-R1以69.2%领先开源模型,Gemini-2.5-Pro在简单任务上为85.5%但在困难任务上降至67.2%,表明性能与临床可靠性仍有差距。
Insight: 创新点在于引入了CLINSQL基准,强调多表连接、临床有意义的过滤和可执行SQL,并采用基于量规的SQL分析与执行检查来优先考虑关键临床需求;客观分析认为,该研究通过整合临床编码系统和处理长上下文,推动了面向真实世界EHR分析的可靠文本到SQL技术的发展。
Abstract: Real-world clinical text-to-SQL requires reasoning over heterogeneous EHR tables, temporal windows, and patient-similarity cohorts to produce executable queries. We introduce CLINSQL, a benchmark of 633 expert-annotated tasks on MIMIC-IV v3.1 that demands multi-table joins, clinically meaningful filters, and executable SQL. Solving CLINSQL entails navigating schema metadata and clinical coding systems, handling long contexts, and composing multi-step queries beyond traditional text-to-SQL. We evaluate 22 proprietary and open-source models under Chain-of-Thought self-refinement and use rubric-based SQL analysis with execution checks that prioritize critical clinical requirements. Despite recent advances, performance remains far from clinical reliability: on the test set, GPT-5-mini attains 74.7% execution score, DeepSeek-R1 leads open-source at 69.2% and Gemini-2.5-Pro drops from 85.5% on Easy to 67.2% on Hard. Progress on CLINSQL marks tangible advances toward clinically reliable text-to-SQL for real-world EHR analytics.
[16] SocraticKG: Knowledge Graph Construction via QA-Driven Fact Extraction cs.CLPDF
Sanghyeok Choi, Woosang Jeon, Kyuseok Yang, Taehyeong Kim
TL;DR: SocraticKG是一种基于问答对作为结构化中间表示的知识图谱自动构建方法,通过5W1H引导的问答扩展来系统化展开文档级语义,再进行三元组抽取,以解决现有LLM方法在事实覆盖度和关系连贯性之间的权衡问题。
Details
Motivation: 解决当前基于LLM的知识图谱构建方法中存在的根本权衡问题:追求事实覆盖度会导致关系碎片化,而提前整合又会导致信息丢失。
Result: 在MINE基准测试上的评估表明,该方法有效解决了覆盖度-连通性权衡,在提取知识量大幅增加的同时,实现了更优的事实保留和更高的结构内聚性。
Insight: 创新点在于引入问答对作为中间表示来系统化展开文档语义,通过5W1H引导的QA扩展捕获上下文依赖和隐性关系链接,为后续图谱构建提供明确的文档依据,从而减少隐性推理错误,提升图谱的连贯性和可靠性。
Abstract: Constructing Knowledge Graphs (KGs) from unstructured text provides a structured framework for knowledge representation and reasoning, yet current LLM-based approaches struggle with a fundamental trade-off: factual coverage often leads to relational fragmentation, while premature consolidation causes information loss. To address this, we propose SocraticKG, an automated KG construction method that introduces question-answer pairs as a structured intermediate representation to systematically unfold document-level semantics prior to triple extraction. By employing 5W1H-guided QA expansion, SocraticKG captures contextual dependencies and implicit relational links typically lost in direct KG extraction pipelines, providing explicit grounding in the source document that helps mitigate implicit reasoning errors. Evaluation on the MINE benchmark demonstrates that our approach effectively addresses the coverage-connectivity trade-off, achieving superior factual retention while maintaining high structural cohesion even as extracted knowledge volume substantially expands. These results highlight that QA-mediated semantic scaffolding plays a critical role in structuring semantics prior to KG extraction, enabling more coherent and reliable graph construction in subsequent stages.
[17] EHRNavigator: A Multi-Agent System for Patient-Level Clinical Question Answering over Heterogeneous Electronic Health Records cs.CLPDF
Lingfei Qian, Mauro Giuffre, Yan Wang, Huan He, Qianqian Xie
TL;DR: 本文提出了EHRNavigator,一个用于在异构电子健康记录上进行患者级别临床问答的多智能体系统。该系统利用AI智能体处理多模态EHR数据,并在包含不同模式、时序推理需求和多模态证据整合的真实医院条件下进行评估。
Details
Motivation: 现有临床自然语言问答系统大多仅在基准数据集上进行评估,限制了其实际应用价值,因此需要开发一个能适应真实、异构EHR环境并支持患者级别问答的系统。
Result: 在公共基准和机构数据集上的评估显示,EHRNavigator在真实病例中达到了86%的准确率,并保持了临床可接受的响应时间,展现了强大的泛化能力。
Insight: 创新点在于提出了一个多智能体框架来桥接基准评估与临床部署之间的差距,通过智能体协作处理异构、多模态EHR数据,实现了对真实世界临床问答的鲁棒、自适应且高效的解决方案。
Abstract: Clinical decision-making increasingly relies on timely and context-aware access to patient information within Electronic Health Records (EHRs), yet most existing natural language question-answering (QA) systems are evaluated solely on benchmark datasets, limiting their practical relevance. To overcome this limitation, we introduce EHRNavigator, a multi-agent framework that harnesses AI agents to perform patient-level question answering across heterogeneous and multimodal EHR data. We assessed its performance using both public benchmark and institutional datasets under realistic hospital conditions characterized by diverse schemas, temporal reasoning demands, and multimodal evidence integration. Through quantitative evaluation and clinician-validated chart review, EHRNavigator demonstrated strong generalization, achieving 86% accuracy on real-world cases while maintaining clinically acceptable response times. Overall, these findings confirm that EHRNavigator effectively bridges the gap between benchmark evaluation and clinical deployment, offering a robust, adaptive, and efficient solution for real-world EHR question answering.
[18] Long-Chain Reasoning Distillation via Adaptive Prefix Alignment cs.CLPDF
Zhenghao Liu, Zhuoyang Wu, Xinze Li, Yukun Yan, Shuo Wang
TL;DR: 本文提出了一种名为P-ALIGN的蒸馏框架,旨在解决大语言模型(LLMs)生成的长且复杂的推理轨迹难以被小规模学生模型有效学习的问题。该方法通过自适应前缀对齐,截取教师模型推理轨迹中简洁且足够指导学生的部分作为前缀进行监督,从而提升学生模型的推理性能。
Details
Motivation: 教师模型生成的长推理轨迹结构复杂,与学生模型的学习能力不匹配,导致监督信号低效。本文旨在通过自适应地选择和利用推理轨迹的关键前缀部分,来弥合这一差距,提升知识蒸馏在复杂推理任务上的效果。
Result: 在多个数学推理基准测试上的实验表明,P-ALIGN方法优于所有基线模型,性能提升超过3%。分析表明,该方法构建的前缀提供了更有效的监督信号,并避免了冗余和不确定推理成分的负面影响。
Insight: 创新点在于提出了自适应前缀对齐的蒸馏机制,核心思想是动态识别并利用教师推理轨迹中对学生模型学习最关键的、简洁的“前缀”部分进行监督,而非直接使用整个冗长轨迹,这为复杂推理任务的知识蒸馏提供了新的优化视角。
Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in solving complex mathematical problems. Recent studies show that distilling long reasoning trajectories can effectively enhance the reasoning performance of small-scale student models. However, teacher-generated reasoning trajectories are often excessively long and structurally complex, making them difficult for student models to learn. This mismatch leads to a gap between the provided supervision signal and the learning capacity of the student model. To address this challenge, we propose Prefix-ALIGNment distillation (P-ALIGN), a framework that fully exploits teacher CoTs for distillation through adaptive prefix alignment. Specifically, P-ALIGN adaptively truncates teacher-generated reasoning trajectories by determining whether the remaining suffix is concise and sufficient to guide the student model. Then, P-ALIGN leverages the teacher-generated prefix to supervise the student model, encouraging effective prefix alignment. Experiments on multiple mathematical reasoning benchmarks demonstrate that P-ALIGN outperforms all baselines by over 3%. Further analysis indicates that the prefixes constructed by P-ALIGN provide more effective supervision signals, while avoiding the negative impact of redundant and uncertain reasoning components. All code is available at https://github.com/NEUIR/P-ALIGN.
[19] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature cs.CL | cs.AI | cs.MMPDF
Yiming Ren, Junjie Wang, Yuxin Meng, Yihang Shi, Zhiqiang Lin
TL;DR: 该论文提出了SIN-Bench,这是一个用于评估多模态大语言模型在长篇幅科学文献中追踪跨模态证据链能力的基准。它基于FITO范式,要求模型在原生科学文档中构建明确的证据链,并包含四个渐进任务:证据发现、假设验证、基于证据的问答和证据锚定的摘要生成。
Details
Motivation: 现有评估方法(如仅答案匹配和合成“大海捞针”测试)无法验证模型是否在文档中进行了因果、证据链式的推理,因此需要一种新范式来评估模型对长篇幅科学论文的真正理解。
Result: 在八个MLLM上的实验表明,基于证据的推理是主要瓶颈。Gemini-3-pro取得了最佳平均总分(0.573),而GPT-5在SIN-QA答案准确率上最高(0.767),但在证据对齐的总分上表现不佳,揭示了答案正确性与可追溯证据支持之间的差距。
Insight: 创新点在于提出了“Fish-in-the-Ocean”评估范式,强调在原生科学文档中构建显式跨模态证据链,并设计了“No Evidence, No Score”的评分机制,通过匹配性、相关性和逻辑性诊断证据质量,这为评估模型深层推理能力提供了新视角。
Abstract: Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic “Needle-In-A-Haystack” tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the “Fish-in-the-Ocean” (FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce “No Evidence, No Score”, scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic. Experiments on eight MLLMs show that grounding is the primary bottleneck: Gemini-3-pro achieves the best average overall score (0.573), while GPT-5 attains the highest SIN-QA answer accuracy (0.767) but underperforms on evidence-aligned overall scores, exposing a gap between correctness and traceable support.
[20] Skill-Aware Data Selection and Fine-Tuning for Data-Efficient Reasoning Distillation cs.CLPDF
Lechen Zhang, Yunxiang Zhang, Wei Hu, Lu Wang
TL;DR: 本文提出了一种以技能为中心的高效推理蒸馏框架,通过技能导向的数据选择和技能感知的微调,仅使用少量数据即可将大型推理模型的推理能力迁移到较弱的学生模型上。
Details
Motivation: 现有推理模型的蒸馏通常需要大规模监督微调数据,本文旨在解决数据高效训练的问题,减少对大规模数据的依赖。
Result: 在五个数学推理基准测试上,仅使用从10万教师生成语料中选出的1000个训练样本,该方法在Qwen3-4B和Qwen3-8B模型上分别比随机SFT基线提升了+1.6%和+1.4%。
Insight: 创新点在于将推理能力分解为具体技能,并针对学生模型的薄弱技能进行定向数据选择和微调,实现了数据高效的性能提升,其增益集中在训练中强调的技能上。
Abstract: Large reasoning models such as DeepSeek-R1 and their distilled variants achieve strong performance on complex reasoning tasks. Yet, distilling these models often demands large-scale data for supervised fine-tuning (SFT), motivating the pursuit of data-efficient training methods. To address this, we propose a skill-centric distillation framework that efficiently transfers reasoning ability to weaker models with two components: (1) Skill-based data selection, which prioritizes examples targeting the student model’s weaker skills, and (2) Skill-aware fine-tuning, which encourages explicit skill decomposition during problem solving. With only 1,000 training examples selected from a 100K teacher-generated corpus, our method surpasses random SFT baselines by +1.6% on Qwen3-4B and +1.4% on Qwen3-8B across five mathematical reasoning benchmarks. Further analysis confirms that these gains concentrate on skills emphasized during training, highlighting the effectiveness of skill-centric training for efficient reasoning distillation.
[21] Role-Playing Agents Driven by Large Language Models: Current Status, Challenges, and Future Trends cs.CL | cs.AI | cs.HCPDF
Ye Wang, Jiaxing Chen, Hongjiang Xiao
TL;DR: 本文系统综述了由大语言模型驱动的角色扮演智能体(RPLAs)的发展现状、关键技术、数据构建、评估方法及未来趋势。文章梳理了从早期规则模板到语言风格模仿,再到以个性建模和记忆机制为核心的认知模拟阶段的技术演进,并总结了支撑高质量角色扮演的关键技术路径、数据挑战和多维评估框架。
Details
Motivation: 随着大语言模型的快速发展,角色扮演语言智能体成为自然语言处理与人机交互交叉领域的研究热点。本文旨在系统梳理该领域的当前进展、关键技术挑战,并为后续研究提供系统视角和方法论参考。
Result: 本文是一篇综述性论文,未提出具体模型或实验,因此未报告定量结果。它整理并评述了现有的多维度评估框架和基准数据集,涵盖了角色知识、个性保真度、价值对齐和交互幻觉等方面。
Insight: 创新点在于系统性地梳理了RPLAs的技术演进脉络,并提炼了关键技术路径(如心理量表驱动的角色建模、记忆增强提示、动机-情境行为决策)。未来方向(如个性演化建模、多智能体协同叙事、多模态沉浸交互)为领域发展提供了清晰的路线图。
Abstract: In recent years, with the rapid advancement of large language models (LLMs), role-playing language agents (RPLAs) have emerged as a prominent research focus at the intersection of natural language processing (NLP) and human-computer interaction. This paper systematically reviews the current development and key technologies of RPLAs, delineating the technological evolution from early rule-based template paradigms, through the language style imitation stage, to the cognitive simulation stage centered on personality modeling and memory mechanisms. It summarizes the critical technical pathways supporting high-quality role-playing, including psychological scale-driven character modeling, memory-augmented prompting mechanisms, and motivation-situation-based behavioral decision control. At the data level, the paper further analyzes the methods and challenges of constructing role-specific corpora, focusing on data sources, copyright constraints, and structured annotation processes. In terms of evaluation, it collates multi-dimensional assessment frameworks and benchmark datasets covering role knowledge, personality fidelity, value alignment, and interactive hallucination, while commenting on the advantages and disadvantages of methods such as human evaluation, reward models, and LLM-based scoring. Finally, the paper outlines future development directions of role-playing agents, including personality evolution modeling, multi-agent collaborative narrative, multimodal immersive interaction, and integration with cognitive neuroscience, aiming to provide a systematic perspective and methodological insights for subsequent research.
[22] ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback cs.CLPDF
Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang
TL;DR: 该论文提出了ToolSafe框架,旨在增强基于LLM的智能体在调用外部工具时的安全性。其核心包括构建了一个新的基准测试TS-Bench,用于评估智能体在步骤层面的工具调用安全性;开发了一个通过多任务强化学习训练的护栏模型TS-Guard,该模型能在工具执行前主动检测不安全的调用行为;并设计了TS-Flow框架,该框架结合了护栏和反馈机制,显著减少了有害工具调用并提升了良性任务完成率。
Details
Motivation: LLM智能体通过调用外部工具与环境交互的能力扩展了其功能,但也放大了安全风险。目前,在工具执行前实时监控步骤级调用行为并进行主动干预对于智能体部署至关重要,但这一领域尚未得到充分探索。
Result: 在TS-Bench基准上,提出的TS-Flow框架将ReAct风格智能体在提示注入攻击下的有害工具调用平均减少了65%,并将良性任务完成率提高了约10%。
Insight: 主要创新点在于:1) 构建了首个专注于步骤级工具调用安全检测的基准TS-Bench;2) 提出了基于多任务强化学习的主动护栏模型TS-Guard,它通过对交互历史的推理来评估请求危害性和行动-攻击关联性,生成可解释且可泛化的安全判断;3) 设计了护栏-反馈驱动的推理框架TS-Flow,将安全机制深度整合到智能体的决策循环中,实现了安全性与任务性能的协同提升。
Abstract: While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.
[23] Credit C-GPT: A Domain-Specialized Large Language Model for Conversational Understanding in Vietnamese Debt Collection cs.CLPDF
Nhung Nguyen Thi Hong, Cuong Nguyen Dang, Tri Le Ngoc
TL;DR: 本文介绍了Credit C-GPT,一个专门针对越南语债务催收场景的领域专用大语言模型,拥有70亿参数,通过微调实现对话理解。该模型在一个基于推理的框架内整合了对话理解、情感识别、意图检测、通话阶段分类和结构化槽值提取等多个任务,并在专有标注数据集上评估,结果显示其性能优于传统流水线方法。
Details
Motivation: 债务催收是银行、金融和保险领域的关键功能,主要依赖越南语呼叫中心的大规模人际对话,这些对话涉及非正式口语、情感变化和复杂的领域特定推理,对传统自然语言处理系统构成重大挑战。
Result: 实验结果表明,该模型在专有人工标注数据集上持续优于传统的基于流水线的方法,为实时辅助和事后分析提供了可扩展且注重隐私的解决方案。
Insight: 创新点在于提出了一个专门针对越南语债务催收的领域专用大语言模型,通过单一推理框架整合多个对话智能任务,实现了端到端的优化,并强调了在注重隐私的企业环境中提供可扩展解决方案的潜力。
Abstract: Debt collection is a critical function within the banking, financial services, and insurance (BFSI) sector, relying heavily on large-scale human-to-human conversational interactions conducted primarily in Vietnamese contact centers. These conversations involve informal spoken language, emotional variability, and complex domain-specific reasoning, which pose significant challenges for traditional natural language processing systems. This paper introduces Credit C-GPT, a domain-specialized large language model with seven billion parameters, fine-tuned for conversational understanding in Vietnamese debt collection scenarios. The proposed model integrates multiple conversational intelligence tasks, including dialogue understanding, sentiment recognition, intent detection, call stage classification, and structured slot-value extraction, within a single reasoning-based framework. We describe the data construction process, annotation strategy, and training methodology, and evaluate the model on proprietary human-annotated datasets. Experimental results show consistent improvements over traditional pipeline-based approaches, indicating that domain-specialized conversational language models provide a scalable and privacy-aware solution for real-time assistance and post-call analytics in enterprise contact centers.
[24] HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning cs.CL | cs.AIPDF
Ziang Cui, Mengran Yu, Tianjiao Li, Chenyu Shi, Yingxuan Shi
TL;DR: 本文针对大型语言模型在翻译任务中存在的跨语言冗余偏差问题,提出了一种基于强化学习的框架HOMURA,以在音节级时长约束下优化翻译的语义保真度与时间可行性之间的平衡。
Details
Motivation: 解决LLM在严格时间约束任务(如字幕和配音)中因跨语言冗余偏差导致翻译输出过长的问题,现有提示工程方法难以平衡语义准确性和时间限制。
Result: 在专门设计的音节级时长约束基准Sand-Glass上,HOMURA显著优于强LLM基线,实现了精确的长度控制,同时保持了语义充分性。
Insight: 创新点包括引入动态音节比率奖励的KL正则化目标来显式优化翻译长度,以及构建针对时间约束翻译的专用基准Sand-Glass,为实际应用场景提供了有效解决方案。
Abstract: Large Language Models (LLMs) have achieved remarkable strides in multilingual translation but are hindered by a systemic cross-lingual verbosity bias, rendering them unsuitable for strict time-constrained tasks like subtitling and dubbing. Current prompt-engineering approaches struggle to resolve this conflict between semantic fidelity and rigid temporal feasibility. To bridge this gap, we first introduce Sand-Glass, a benchmark specifically designed to evaluate translation under syllable-level duration constraints. Furthermore, we propose HOMURA, a reinforcement learning framework that explicitly optimizes the trade-off between semantic preservation and temporal compliance. By employing a KL-regularized objective with a novel dynamic syllable-ratio reward, HOMURA effectively “tames” the output length. Experimental results demonstrate that our method significantly outperforms strong LLM baselines, achieving precise length control that respects linguistic density hierarchies without compromising semantic adequacy.
[25] HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns cs.CLPDF
Xintao Wang, Jian Yang, Weiyuan Li, Rui Xie, Jen-tse Huang
TL;DR: 本文提出了HUMANLLM框架,通过将人类心理模式视为相互作用的因果力来增强LLM的拟人化能力。该框架从约12,000篇学术论文中构建了244种心理模式,并合成了11,359个涉及2-5种模式相互强化、冲突或调节的多轮对话场景。通过双层检查表评估个体模式保真度和涌现的多模式动态,实现了高度的人类对齐(r=0.91)。
Details
Motivation: 尽管LLM在推理和生成方面表现出色,并成为高级角色模拟和角色扮演语言代理(RPLAs)的基础,但实现与人类认知和行为模式的真实对齐仍然是一个关键挑战。
Result: HUMANLLM-8B模型在评估多模式动态方面超越了参数多4倍的Qwen3-32B模型,表明其实现了强人类对齐(相关系数r=0.91)。同时,研究揭示了整体指标混淆了模拟准确性与社会期望性。
Insight: 论文的核心创新在于将心理模式建模为相互作用的因果力,并构建了大规模、结构化的多模式交互场景数据集。其关键见解是,真实的拟人化需要认知建模——不仅要模拟人类的行为,还要模拟产生这些行为的心理过程,而不仅仅是增加模型参数。
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and generation, serving as the foundation for advanced persona simulation and Role-Playing Language Agents (RPLAs). However, achieving authentic alignment with human cognitive and behavioral patterns remains a critical challenge for these agents. We present HUMANLLM, a framework treating psychological patterns as interacting causal forces. We construct 244 patterns from ~12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other, with multi-turn conversations expressing inner thoughts, actions, and dialogue. Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment (r=0.91) while revealing that holistic metrics conflate simulation accuracy with social desirability. HUMANLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite 4x fewer parameters, demonstrating that authentic anthropomorphism requires cognitive modeling–simulating not just what humans do, but the psychological processes generating those behaviors.
[26] GeoSteer: Faithful Chain-of-Thought Steering via Latent Manifold Gradients cs.CLPDF
Kentaro Kazama, Daiki Shirafuji, Tatsuhiko Saito
TL;DR: 本文提出GeoSteer,一种基于流形学习的框架,旨在提升大语言模型(LLM)在思维链(CoT)推理中中间步骤的质量。该方法通过构建带分段评分的CoT数据集,训练变分自编码器(VAE)和质量估计模型来学习高质量CoT轨迹的低维流形,并在潜在空间中引导目标LLM的隐藏状态向更高质量区域移动,从而实现几何一致的梯度调整。
Details
Motivation: 现有LLM在生成思维链时,即使最终答案正确,中间推理步骤也常存在逻辑不一致问题,这降低了步骤级推理的可靠性。本文旨在通过流形引导提升中间推理质量。
Result: 在GSM8k数据集上使用Qwen3系列模型进行评估,GeoSteer将精确匹配准确率最高提升2.6个百分点,配对胜率提升5.3个百分点,表明其能有效提升LLM的中间推理质量。
Insight: 创新点在于将高质量CoT轨迹建模为低维流形,并在潜在空间中进行自然梯度式的隐状态引导,这为提升LLM推理步骤的连贯性和可控性提供了新思路。
Abstract: Recent advances in Large Language Models (LLMs) have improved multi-step reasoning. Most approaches rely on Chain-of-Thought (CoT) rationales. Previous studies have shown that LLMs often generate logically inconsistent reasoning steps even when their final answers are correct. These inconsistencies reduce the reliability of step-level reasoning. We propose GeoSteer, a manifold-based framework that improves the quality of intermediate reasoning. The method consists of: (1) constructing a CoT dataset with segment-level scores, (2) training a Variational Autoencoder (VAE) model and a quality estimation model to learn a low-dimensional manifold of high-quality CoT trajectories, and (3) steering hidden states of target LLMs toward higher-quality regions in the latent space. This update in a latent space behaves like a natural-gradient adjustment in the original hidden-state space. It ensures geometrically coherent steering. We evaluate GeoSteer on the GSM8k dataset using the Qwen3 series. We measure via answer accuracy and overall reasoning performance. GeoSteer improved the exact match accuracy by up to 2.6 points. It also enhanced the pairwise win rate by 5.3 points. These results indicate that GeoSteer provides an effective and controllable mechanism for improving the quality of intermediate reasoning in LLMs.
[27] coTherapist: A Behavior-Aligned Small Language Model to Support Mental Healthcare Experts cs.CLPDF
Prottay Kumar Adhikary, Reena Rawat, Tanmoy Chakraborty
TL;DR: 本文提出了coTherapist,一个利用小型语言模型通过领域微调、检索增强和智能体推理来模拟核心治疗能力的统一框架,旨在支持心理健康专家。评估表明,其在临床查询上能生成比现有基线更相关且基于临床的回应,并展现出高共情和治疗师一致的人格特质。
Details
Motivation: 心理健康服务的可及性因劳动力短缺和需求增长而日益紧张,因此需要开发智能系统来支持心理健康专家。
Result: 在临床查询评估中,coTherapist比当代基线生成更相关和临床基础的回答;使用新颖的T-BARS量表和心理测量分析,确认其具有高共情和治疗师一致的人格特质;领域专家的人类评估验证了其回答的准确性、可信赖性和安全性。
Insight: 创新点在于通过领域特定微调、检索增强和智能体推理的统一框架,使小型语言模型能展现出专家级行为,为数字心理健康工具提供了可扩展的路径;客观分析认为,将小型模型工程化为行为对齐的专家系统是一个有前景的实用化方向。
Abstract: Access to mental healthcare is increasingly strained by workforce shortages and rising demand, motivating the development of intelligent systems that can support mental healthcare experts. We introduce coTherapist, a unified framework utilizing a small language model to emulate core therapeutic competencies through domain-specific fine-tuning, retrieval augmentation, and agentic reasoning. Evaluation on clinical queries demonstrates that coTherapist generates more relevant and clinically grounded responses than contemporary baselines. Using our novel T-BARS rubric and psychometric profiling, we confirm coTherapist exhibits high empathy and therapist-consistent personality traits. Furthermore, human evaluation by domain experts validates that coTherapist delivers accurate, trustworthy, and safe responses. coTherapist was deployed and tested by clinical experts. Collectively, these findings demonstrate that small models can be engineered to exhibit expert-like behavior, offering a scalable pathway for digital mental health tools.
[28] Untangling Input Language from Reasoning Language: A Diagnostic Framework for Cross-Lingual Moral Alignment in LLMs cs.CL | cs.AIPDF
Nan Li, Bo Kang, Tijl De Bie
TL;DR: 本文提出了一种诊断框架,用于解耦大型语言模型(LLM)在多语言道德对齐中的输入语言和推理语言效应。通过分别操控这两个因素(包括匹配和不匹配条件),并结合道德基础理论进行解释,该框架能够量化并诊断模型在不同语言下进行道德判断时产生差异的原因。
Details
Motivation: 解决的核心问题是:当LLMs判断道德困境时,不同语言导致的结论差异究竟是由困境描述的语言(输入语言)驱动,还是由模型内部推理过程所使用的语言(推理语言)驱动?标准评估方法将两者混为一谈,无法区分各自的贡献。
Result: 在13个LLMs上进行的英-中文道德判断实验表明:1)推理语言效应贡献的方差是输入语言效应的两倍;2)该框架检测到近一半的模型存在标准评估所遗漏的上下文依赖性;3)基于诊断结果构建的分类法可为模型部署提供指导。
Insight: 主要创新点在于提出了一个解耦输入语言与推理语言影响的诊断性评估框架,通过设计匹配与不匹配的实验条件来实现贡献分解。客观来看,该方法为理解和诊断LLMs在多语言场景下的行为不一致性(尤其是道德判断这类复杂任务)提供了一种系统性的分析工具,并将道德基础理论应用于结果解释,甚至对理论维度(如将权威维度细分为家庭相关和制度相关)提出了新的见解。
Abstract: When LLMs judge moral dilemmas, do they reach different conclusions in different languages, and if so, why? Two factors could drive such differences: the language of the dilemma itself, or the language in which the model reasons. Standard evaluation conflates these by testing only matched conditions (e.g., English dilemma with English reasoning). We introduce a methodology that separately manipulates each factor, covering also mismatched conditions (e.g., English dilemma with Chinese reasoning), enabling decomposition of their contributions. To study \emph{what} changes, we propose an approach to interpret the moral judgments in terms of Moral Foundations Theory. As a side result, we identify evidence for splitting the Authority dimension into a family-related and an institutional dimension. Applying this methodology to English-Chinese moral judgment with 13 LLMs, we demonstrate its diagnostic power: (1) the framework isolates reasoning-language effects as contributing twice the variance of input-language effects; (2) it detects context-dependency in nearly half of models that standard evaluation misses; and (3) a diagnostic taxonomy translates these patterns into deployment guidance. We release our code and datasets at https://anonymous.4open.science/r/CrossCulturalMoralJudgement.
[29] MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts cs.CL | cs.AI | cs.LG | cs.SDPDF
Yuxuan Lou, Kai Yang, Yang You
TL;DR: MoST是一种新型的多模态大语言模型,通过提出的模态感知专家混合架构,无缝整合语音和文本处理。该模型利用特定模态的专家组和共享专家,增强模态特定学习和跨模态理解,并通过高效的训练流程在多个语音和文本基准测试中取得优异性能。
Details
Motivation: 当前多模态模型通常使用相同参数处理不同模态表示,忽略了其内在表示差异,MoST旨在通过模态感知路由机制解决这一问题,实现更有效的语音和文本集成处理。
Result: 在ASR、TTS、音频语言建模和口语问答基准测试中,MoST在可比参数数量下一致优于现有模型,并通过消融研究证实了模态特定路由和共享专家设计的有效性。
Insight: 创新点包括模态感知专家混合架构,通过特定模态路由和共享专家促进跨模态学习;训练流程仅依赖开源数据集,实现了数据效率和强性能,是首个基于专家混合架构的完全开源语音-文本大语言模型。
Abstract: We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type. MAMoE simultaneously enhances modality-specific learning and cross-modal understanding through two complementary components: modality-specific expert groups that capture domain-specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post-training on ASR and TTS datasets, followed by fine-tuning with a carefully curated speech-text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open-source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality-specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open-source speech-text LLM built on a Mixture of Experts architecture. \footnote{We release MoST model, training code, inference code, and training data at https://github.com/NUS-HPC-AI-Lab/MoST
[30] Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis cs.CLPDF
Songsong Tian, Kongsheng Zhuo, Zhendong Wang, Rong Shen, Shengtao Zhang
TL;DR: 本文提出了BAR-SQL(边界感知可靠NL2SQL),一个统一的训练框架,旨在将可靠性和边界感知直接嵌入到自然语言到SQL的生成过程中。该框架通过种子突变数据合成范式构建企业级语料库,并采用知识驱动的推理合成生成可解释的思维链。模型通过监督微调和基于组相对策略优化的强化学习进行两阶段训练,并设计了任务条件混合奖励机制。实验在作者构建的Ent-SQL-Bench基准上表明,BAR-SQL在SQL生成质量和边界感知的拒答能力上均超越了领先的闭源模型。
Details
Motivation: 解决现有NL2SQL系统在面临模糊查询或超出模式限制的不可回答查询时,缺乏可靠性和边界感知能力的问题,旨在构建一个既能生成准确SQL,又能在必要时安全拒答的系统。
Result: 在作者构建的Ent-SQL-Bench基准上,BAR-SQL取得了91.48%的平均准确率,在SQL生成质量和边界感知拒答能力上均超越了Claude 4.5 Sonnet和GPT-5等领先的专有模型。
Insight: 主要创新点包括:1)种子突变数据合成范式,用于构建包含边界案例的代表性企业语料;2)知识驱动的推理合成,生成基于模式元数据和业务规则的可解释思维链;3)任务条件混合奖励机制,同时优化SQL执行准确性和语义精确的拒答响应;4)组相对策略优化用于强化学习训练。这些方法将可靠性和边界感知系统地整合到模型训练中。
Abstract: In this paper, we present BAR-SQL (Boundary-Aware Reliable NL2SQL), a unified training framework that embeds reliability and boundary awareness directly into the generation process. We introduce a Seed Mutation data synthesis paradigm that constructs a representative enterprise corpus, explicitly encompassing multi-step analytical queries alongside boundary cases including ambiguity and schema limitations. To ensure interpretability, we employ Knowledge-Grounded Reasoning Synthesis, which produces Chain-of-Thought traces explicitly anchored in schema metadata and business rules. The model is trained through a two-stage process: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning via Group Relative Policy Optimization. We design a Task-Conditioned Hybrid Reward mechanism that simultaneously optimizes SQL execution accuracy-leveraging Abstract Syntax Tree analysis and dense result matching-and semantic precision in abstention responses. To evaluate reliability alongside generation accuracy, we construct and release Ent-SQL-Bench, which jointly assesse SQL precision and boundary-aware abstention across ambiguous and unanswerable queries. Experimental results on this benchmark demonstrate that BAR-SQL achieves 91.48% average accuracy, outperforming leading proprietary models, including Claude 4.5 Sonnet and GPT-5, in both SQL generation quality and boundary-aware abstention capability. The source code and benchmark are available anonymously at: https://github.com/TianSongS/BAR-SQL.
[31] An Efficient Long-Context Ranking Architecture With Calibrated LLM Distillation: Application to Person-Job Fit cs.CL | cs.IR | cs.LG | cs.SIPDF
Warren Jouanneau, Emma Jouffroy, Marc Palyart
TL;DR: 本文提出了一种基于新一代延迟交叉注意力架构的重新排序模型,用于高效处理长文本简历与职位描述的匹配问题,并通过使用生成式大语言模型作为教师模型生成细粒度语义监督信号,利用增强的蒸馏损失函数将知识蒸馏到学生模型中,以减轻历史数据偏差,实现一致且可解释的人岗匹配。
Details
Motivation: 解决在简历长文本、结构化且多语言的场景下,实时为职位提案寻找最相关候选人的挑战,并缓解历史数据中存在的偏差问题。
Result: 在相关性、排序和校准指标上的实验表明,该方法优于现有的最先进基线模型。
Insight: 创新点包括:采用延迟交叉注意力架构分解长文本输入以降低计算开销;利用生成式LLM作为教师提供细粒度、基于语义的监督信号;设计增强的蒸馏损失函数进行知识迁移,从而提升模型性能与可解释性。
Abstract: Finding the most relevant person for a job proposal in real time is challenging, especially when resumes are long, structured, and multilingual. In this paper, we propose a re-ranking model based on a new generation of late cross-attention architecture, that decomposes both resumes and project briefs to efficiently handle long-context inputs with minimal computational overhead. To mitigate historical data biases, we use a generative large language model (LLM) as a teacher, generating fine-grained, semantically grounded supervision. This signal is distilled into our student model via an enriched distillation loss function. The resulting model produces skill-fit scores that enable consistent and interpretable person-job matching. Experiments on relevance, ranking, and calibration metrics demonstrate that our approach outperforms state-of-the-art baselines.
[32] Training-Trajectory-Aware Token Selection cs.CL | cs.AI | cs.LGPDF
Zhanming Shen, Jiaqi Hu, Zeyu Qin, Hao Chen, Wentao Ye
TL;DR: 本文提出了一种训练轨迹感知的令牌选择(T3S)方法,用于改进高效知识蒸馏。该方法通过识别和分离模仿锚定令牌与待学习令牌,重构令牌级训练目标,以解决在已有强推理能力的学生模型上进行持续蒸馏时出现的性能瓶颈甚至退化问题。
Details
Motivation: 在将昂贵推理能力转化为可部署效率的高效蒸馏中,当学生模型已具备较强推理能力时,朴素的持续蒸馏往往收益有限甚至导致性能下降。作者观察到训练过程中存在一个特征瓶颈现象,即所有性能指标在损失单调下降的同时会急剧下降,随后逐渐恢复。
Result: T3S方法在自回归(AR)和蒸馏大语言模型(dLLM)设置下均取得了一致的性能提升。例如,仅使用数百个示例,Qwen3-8B在竞争性推理基准上超越了DeepSeek-R1,Qwen3-32B接近Qwen3-235B的水平,而T3训练的LLaDA-2.0-Mini超越了其自回归基线,在所有16B规模的无思考模型中达到了最先进的性能。
Insight: 论文的创新点在于揭示了持续蒸馏失败的根本原因在于令牌级别的置信度分叉现象,即模仿锚定令牌与待学习令牌无法共存,并据此提出了训练轨迹感知的令牌选择机制来重构优化路径。从客观角度看,该方法提供了一种细粒度的、基于训练动态的蒸馏策略,为解决强学生模型蒸馏难题提供了新思路。
Abstract: Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3 yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.
[33] The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models cs.CLPDF
Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, Jack Lindsey
TL;DR: 该论文研究了大型语言模型(LLM)中不同人格(persona)的表示结构,发现了一个主导的’助手轴’(Assistant Axis),它定义了模型默认的’助手’人格模式。通过沿该轴调整激活方向,可以强化或削弱模型的助手行为,并发现偏离此轴会导致’人格漂移’(persona drift),即模型表现出有害或怪异行为。研究还表明,将激活限制在助手轴的特定区域可以稳定模型行为,抵御基于人格的对抗性越狱攻击。
Details
Motivation: 动机是探究LLM人格空间的结构,理解其默认助手人格的形成与稳定性,并解决模型在对话中可能发生的’人格漂移’问题,即模型偏离其典型助手行为,表现出有害或怪异倾向。
Result: 在多个模型(包括预训练和微调后模型)中识别出’助手轴’,沿该轴正向调节可增强有益无害行为,负向调节则诱导模型认同其他实体并产生神秘、戏剧化的说话风格。限制激活在该轴固定区域能稳定模型行为,有效应对基于人格的对抗性越狱。
Insight: 创新点在于通过激活方向提取量化了LLM的人格空间结构,揭示了’助手轴’这一关键维度,并提出了利用该轴监测和稳定模型人格的方法,为未来训练和调控策略以更牢固地锚定模型人格提供了新思路。
Abstract: Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an “Assistant Axis,” which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model’s tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts “persona drift,” a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model’s processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios – and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.
[34] SurgGoal: Rethinking Surgical Planning Evaluation via Goal-Satisfiability cs.CL | cs.ROPDF
Ruochen Li, Kun Yuan, Yufei Xia, Yue Zhou, Qingyu Lu
TL;DR: 本文提出了一种基于目标可满足性的外科手术规划评估新视角,定义了通过阶段目标满足度来判断规划正确性的方法,并构建了一个包含有效程序变异和无效规划的多中心元评估基准。研究发现,序列相似性指标会系统性误判规划质量,因此采用基于规则的目标可满足性指标作为元评估参考,揭示了视频大语言模型在感知错误和约束不足推理方面的失败,并发现结构化知识能持续提升性能。
Details
Motivation: 当前评估协议在安全关键场景下是否可靠地评估视觉语言模型(VLMs)的外科手术规划能力尚不明确,因此论文从目标导向的视角出发,重新思考手术规划的评估方式。
Result: 在构建的多中心元评估基准上,序列相似性指标被证明会错误地惩罚有效规划并无法识别无效规划。采用基于规则的目标可满足性指标作为参考进行评估,揭示了视频大语言模型在感知和推理方面的具体失败模式。
Insight: 核心创新在于将手术规划正确性定义为阶段目标的可满足性,并基于此构建了包含程序变异的元评估基准。客观来看,其提出的基于规则的评估指标为安全关键领域的规划任务评估提供了更可靠、更符合领域逻辑的替代方案,并强调了结构化知识相对于纯语义引导的重要性。
Abstract: Surgical planning integrates visual perception, long-horizon reasoning, and procedural knowledge, yet it remains unclear whether current evaluation protocols reliably assess vision-language models (VLMs) in safety-critical settings. Motivated by a goal-oriented view of surgical planning, we define planning correctness via phase-goal satisfiability, where plan validity is determined by expert-defined surgical rules. Based on this definition, we introduce a multicentric meta-evaluation benchmark with valid procedural variations and invalid plans containing order and content errors. Using this benchmark, we show that sequence similarity metrics systematically misjudge planning quality, penalizing valid plans while failing to identify invalid ones. We therefore adopt a rule-based goal-satisfiability metric as a high-precision meta-evaluation reference to assess Video-LLMs under progressively constrained settings, revealing failures due to perception errors and under-constrained reasoning. Structural knowledge consistently improves performance, whereas semantic guidance alone is unreliable and benefits larger models only when combined with structural constraints.
[35] DR-Arena: an Automated Evaluation Framework for Deep Research Agents cs.CLPDF
Yiwen Gao, Ruochen Zhao, Yang Deng, Wenxuan Zhang
TL;DR: 本文提出了DR-Arena,一个用于评估深度研究(DR)智能体的自动化框架,旨在解决现有静态数据集评估方法在任务通用性、时间对齐和数据污染方面的局限性。该框架通过动态构建实时信息树和自适应演化循环,测试智能体的深度推理和广泛覆盖能力,并在实验中与人类偏好评估高度相关。
Details
Motivation: 当前基于静态数据集的评估方法存在任务通用性有限、时间未对齐和数据污染等问题,无法可靠评估能够自主调查和综合信息的深度研究智能体,因此需要一种动态、自动化的评估框架。
Result: 在六个先进DR智能体上的实验表明,DR-Arena与LMSYS搜索竞技场排行榜的Spearman相关系数达到0.94,实现了与人类偏好的最先进对齐,无需人工干预。
Insight: 创新点包括:利用实时网络趋势构建信息树以确保评估与当前世界状态同步;引入自适应演化循环动态调整任务复杂度以探测能力边界;通过自动化考官测试深度推理和广泛覆盖两个正交能力,为评估自主研究智能体提供了可靠且高效的自动化方案。
Abstract: As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination. To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges. Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents the state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.
[36] AEQ-Bench: Measuring Empathy of Omni-Modal Large Models cs.CL | cs.HCPDF
Xuan Luo, Lewei Yao, Libo Zhao, Lanqing Hong, Kai Chen
TL;DR: 本文提出了AEQ-Bench(音频共情商数基准),这是一个用于系统评估全模态大模型(OLMs)共情能力的新基准,重点关注模型基于音频和文本多模态输入生成共情回应的能力,以及不依赖文本转录直接评估音频回应共情水平的能力。
Details
Motivation: 全模态大模型的自动评估至关重要,但由于共情本身的情感属性,对其评估仍是一个重大挑战,本文旨在解决这一挑战。
Result: 综合语言和副语言指标的评估表明:具备音频输出能力的OLMs通常优于纯文本输出模型;OLMs在粗粒度质量评估上与人类判断一致,但在评估细粒度副语言表现力方面仍不可靠。
Insight: 创新点在于引入了两个新颖的设置(上下文特异性和语音语调的变化)来构建基准,并首次系统评估了OLMs在音频模态上的共情生成与判断能力,揭示了音频输出能力对共情表现的重要性以及当前模型在细粒度副语言评估上的局限性。
Abstract: While the automatic evaluation of omni-modal large models (OLMs) is essential, assessing empathy remains a significant challenge due to its inherent affectivity. To investigate this challenge, we introduce AEQ-Bench (Audio Empathy Quotient Benchmark), a novel benchmark to systematically assess two core empathetic capabilities of OLMs: (i) generating empathetic responses by comprehending affective cues from multi-modal inputs (audio + text), and (ii) judging the empathy of audio responses without relying on text transcription. Compared to existing benchmarks, AEQ-Bench incorporates two novel settings that vary in context specificity and speech tone. Comprehensive assessment across linguistic and paralinguistic metrics reveals that (1) OLMs trained with audio output capabilities generally outperformed models with text-only outputs, and (2) while OLMs align with human judgments for coarse-grained quality assessment, they remain unreliable for evaluating fine-grained paralinguistic expressiveness.
[37] PERM: Psychology-grounded Empathetic Reward Modeling for Large Language Models cs.CLPDF
Chengbing Wang, Wuqiang Zheng, Yang Zhang, Fengbin Zhu, Junyi Cheng
TL;DR: 本文提出了一种基于心理学理论的共情奖励建模方法(PERM),用于提升大型语言模型在情感支持对话中的共情能力。该方法从支持者、寻求者和旁观者三个视角对共情进行双向分解评估,以克服现有方法仅从单一视角评估的局限。实验表明,PERM在情感智能基准和工业日常对话数据集上均显著优于现有方法。
Details
Motivation: 现有基于强化学习的方法通常从单一视角评估LLMs的共情能力,忽略了共情循环理论中支持者与寻求者之间双向互动的本质,导致模型难以提供实质性的情感支持。
Result: 在广泛使用的情感智能基准和工业日常对话数据集上的实验表明,PERM的性能比最先进的基线方法高出10%以上。盲测用户研究显示,70%的用户更偏好PERM生成的回应。
Insight: 核心创新在于将心理学中的共情循环理论操作化,通过支持者(内部共鸣与表达)、寻求者(情感接收)和旁观者(整体质量)三个视角对共情进行双向分解评估,构建了更全面、更符合人类互动本质的奖励模型。这为构建更人性化、更具情感支持能力的对话AI提供了新的建模思路。
Abstract: Large Language Models (LLMs) are increasingly deployed in human-centric applications, yet they often fail to provide substantive emotional support. While Reinforcement Learning (RL) has been utilized to enhance empathy of LLMs, existing reward models typically evaluate empathy from a single perspective, overlooking the inherently bidirectional interaction nature of empathy between the supporter and seeker as defined by Empathy Cycle theory. To address this limitation, we propose Psychology-grounded Empathetic Reward Modeling (PERM). PERM operationalizes empathy evaluation through a bidirectional decomposition: 1) Supporter perspective, assessing internal resonation and communicative expression; 2) Seeker perspective, evaluating emotional reception. Additionally, it incorporates a bystander perspective to monitor overall interaction quality. Extensive experiments on a widely-used emotional intelligence benchmark and an industrial daily conversation dataset demonstrate that PERM outperforms state-of-the-art baselines by over 10%. Furthermore, a blinded user study reveals a 70% preference for our approach, highlighting its efficacy in generating more empathetic responses. Our code, dataset, and models are available at https://github.com/ZhengWwwq/PERM.
[38] Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure cs.CL | cs.LGPDF
Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib
TL;DR: 本文提出了一种名为知识免疫框架(KIF)的表示感知遗忘方法,旨在从大型语言模型中实现真正的知识擦除,而非仅仅抑制表面行为。该方法通过针对内部激活签名进行动态抑制和参数高效适配,在保持模型实用性的同时,实现了接近理想的擦除效果,并打破了现有方法在稳定性和擦除效果之间的权衡。
Details
Motivation: 当前的大模型选择性知识遗忘方法往往混淆了行为抑制与真正的知识移除,导致潜在能力在表面拒绝之下依然存在,这不符合GDPR合规性和模型安全性的要求。本文旨在解决这一挑战,区分真正的擦除与混淆。
Result: KIF在多个标准基础模型(Llama, Mistral)和推理优先模型(Qwen, DeepSeek)上进行了评估,参数规模从3B到14B。它实现了接近理想的擦除效果(遗忘质量FQ约0.99 vs. 理想值1.00),同时保持了理想水平的实用性(模型实用性MU=0.62)。标准模型表现出与规模无关的真正擦除(实用性漂移<3%),而推理优先模型则揭示了根本性的架构差异。
Insight: 核心创新在于提出了一个基于内部激活签名的表示感知遗忘框架,将擦除目标从表面输出转向内部表示,从而能够区分和实现真正的知识移除。此外,论文提出的结合表面泄露和潜在痕迹持续性的双指标评估协议,首次实现了跨模型家族和规模的机制级遗忘行为的系统诊断。从客观角度看,该方法在参数高效适配和动态抑制的结合上具有借鉴意义,为解决遗忘中的稳定性-擦除权衡问题提供了新思路。
Abstract: Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.
[39] Detecting Winning Arguments with Large Language Models and Persuasion Strategies cs.CLPDF
Tiziano Labruna, Arkadiusz Modzelewski, Giorgio Satta, Giovanni Da San Martino
TL;DR: 本文提出了一种基于大语言模型(LLMs)和多策略说服力评分的方法,用于检测议论文本中的说服力。该方法通过引导模型对六种说服策略(如攻击声誉、分散注意力和操纵性措辞)进行推理,提升了预测性能。研究在三个标注数据集(Winning Arguments、Anthropic/Persuasion和Persuasion for Good)上进行了实验,并将Winning Arguments数据集按讨论主题进行了组织分析。
Details
Motivation: 检测议论文本中的说服力是一项具有挑战性的任务,对理解人类沟通至关重要。本文旨在探究说服策略在决定文本说服力中的作用,以提升论证质量评估的可解释性和鲁棒性。
Result: 实验结果表明,策略引导的推理方法提高了说服力预测的准确性。研究在多个数据集上验证了方法的有效性,并公开了按主题标注的Winning Arguments数据集版本,以促进未来研究。
Insight: 创新点在于提出了结构化的、策略感知的提示方法,将说服策略整合到大语言模型的推理过程中,从而增强模型在论证质量评估中的可解释性和鲁棒性。从客观角度看,该方法通过显式建模策略维度,为理解文本说服力提供了更细粒度的分析框架。
Abstract: Detecting persuasion in argumentative text is a challenging task with important implications for understanding human communication. This work investigates the role of persuasion strategies - such as Attack on reputation, Distraction, and Manipulative wording - in determining the persuasiveness of a text. We conduct experiments on three annotated argument datasets: Winning Arguments (built from the Change My View subreddit), Anthropic/Persuasion, and Persuasion for Good. Our approach leverages large language models (LLMs) with a Multi-Strategy Persuasion Scoring approach that guides reasoning over six persuasion strategies. Results show that strategy-guided reasoning improves the prediction of persuasiveness. To better understand the influence of content, we organize the Winning Argument dataset into broad discussion topics and analyze performance across them. We publicly release this topic-annotated version of the dataset to facilitate future research. Overall, our methodology demonstrates the value of structured, strategy-aware prompting for enhancing interpretability and robustness in argument quality assessment.
[40] Grounding Agent Memory in Contextual Intent cs.CL | cs.AI | cs.IRPDF
Ruozhen Yang, Yucheng Jiang, Yueqi Jiang, Priyanka Kargupta, Yunyi Zhang
TL;DR: 论文提出STITCH(结构化意图跟踪上下文历史)代理记忆系统,通过结构化检索线索(上下文意图)索引轨迹步骤,并基于当前步骤意图匹配检索历史,以解决长视野目标导向交互中因相似实体和事实在不同潜在目标下重复出现导致的上下文不匹配检索问题。
Details
Motivation: 解决大语言模型在长视野目标导向交互中部署的挑战,即相似实体和事实在不同潜在目标和约束下重复出现,导致记忆系统检索到上下文不匹配的证据。
Result: 在CAME-Bench和LongMemEval基准测试中,STITCH实现了最先进的性能,比最强基线高出35.6%,且随着轨迹长度增加增益最大。
Insight: 创新点包括引入上下文意图作为紧凑信号来消除歧义和减少干扰,具体通过潜在目标、动作类型和关键实体类型进行索引;同时提出CAME-Bench基准用于评估现实动态目标导向轨迹中的上下文感知检索,支持意图感知记忆以增强长视野推理鲁棒性。
Abstract: Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched evidence. We propose STITCH (Structured Intent Tracking in Contextual History), an agentic memory system that indexes each trajectory step with a structured retrieval cue, contextual intent, and retrieves history by matching the current step’s intent. Contextual intent provides compact signals that disambiguate repeated mentions and reduce interference: (1) the current latent goal defining a thematic segment, (2) the action type, and (3) the salient entity types anchoring which attributes matter. During inference, STITCH filters and prioritizes memory snippets by intent compatibility, suppressing semantically similar but context-incompatible history. For evaluation, we introduce CAME-Bench, a benchmark for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Across CAME-Bench and LongMemEval, STITCH achieves state-of-the-art performance, outperforming the strongest baseline by 35.6%, with the largest gains as trajectory length increases. Our analysis shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning.
[41] MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching cs.CL | cs.AIPDF
Changle Qu, Sunhao Dai, Hengyi Cai, Jun Xu, Shuaiqiang Wang
TL;DR: MatchTIR是一个为工具集成推理(TIR)设计的强化学习框架,通过基于二分图匹配的回合级奖励分配和双层级优势估计,解决了现有方法中粗粒度信用分配的问题,从而在长视野、多回合任务中更有效地区分有效与冗余的工具调用。
Details
Motivation: 现有强化学习方法通常依赖结果级或轨迹级奖励,对轨迹中的所有步骤赋予统一的优势,这种粗粒度的信用分配无法在长视野多回合场景中区分有效、冗余或错误的工具调用。
Result: 在三个基准测试上的广泛实验表明MatchTIR具有优越性,其4B模型在长视野和多回合任务中超越了大多数8B竞争对手。
Insight: 创新点在于将信用分配形式化为预测轨迹与真实轨迹之间的二分图匹配问题,以生成密集的回合级奖励,并引入结合回合级和轨迹级信号的双层级优势估计方案,以平衡局部步骤精度与全局任务成功。
Abstract: Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.
cs.CV [Back]
[42] Diffusion-Driven Deceptive Patches: Adversarial Manipulation and Forensic Detection in Facial Identity Verification cs.CV | cs.AIPDF
Shahrzad Sayyafzadeh, Hongmei Chi, Shonda Bernadin
TL;DR: 本文提出了一种端到端的对抗性补丁生成、优化与评估流程,用于攻击面部生物识别系统,并应用于取证分析与安全测试。该方法结合FGSM生成针对身份分类器的对抗噪声,利用扩散模型与反向扩散进行高斯平滑和自适应亮度校正以提升补丁的隐蔽性,并通过ViT-GPT2模型生成对抗图像的语义描述以支持取证解释。
Details
Motivation: 动机是开发一种能够生成隐蔽对抗补丁以欺骗面部身份验证系统的方法,同时为取证和安全测试提供分析和检测手段,解决对抗攻击在生物识别安全中的威胁评估问题。
Result: 在对抗条件下评估了身份分类、图像描述以及面部身份验证与表情识别的脆弱性,并利用感知哈希和分割技术有效检测对抗补丁与样本,达到了0.95的结构相似性指数(SSIM)。
Insight: 创新点在于将扩散模型与反向扩散过程结合用于优化对抗补丁的视觉自然性,并引入ViT-GPT2模型为对抗图像生成语义描述以增强取证可解释性,为对抗攻击的生成与检测提供了集成化的端到端框架。
Abstract: This work presents an end-to-end pipeline for generating, refining, and evaluating adversarial patches to compromise facial biometric systems, with applications in forensic analysis and security testing. We utilize FGSM to generate adversarial noise targeting an identity classifier and employ a diffusion model with reverse diffusion to enhance imperceptibility through Gaussian smoothing and adaptive brightness correction, thereby facilitating synthetic adversarial patch evasion. The refined patch is applied to facial images to test its ability to evade recognition systems while maintaining natural visual characteristics. A Vision Transformer (ViT)-GPT2 model generates captions to provide a semantic description of a person’s identity for adversarial images, supporting forensic interpretation and documentation for identity evasion and recognition attacks. The pipeline evaluates changes in identity classification, captioning results, and vulnerabilities in facial identity verification and expression recognition under adversarial conditions. We further demonstrate effective detection and analysis of adversarial patches and adversarial samples using perceptual hashing and segmentation, achieving an SSIM of 0.95.
[43] LCF3D: A Robust and Real-Time Late-Cascade Fusion Framework for 3D Object Detection in Autonomous Driving cs.CV | cs.ROPDF
Carlo Sgaravatti, Riccardo Pieroni, Matteo Corno, Sergio M. Savaresi, Luca Magri
TL;DR: LCF3D是一个用于自动驾驶中3D目标检测的鲁棒实时级联后融合框架,通过结合RGB图像的2D检测器和LiDAR点云的3D检测器,利用后融合减少LiDAR误报,级联融合恢复LiDAR漏检,提升检测性能并增强领域泛化能力。
Details
Motivation: 自动驾驶中准确检测行人、骑行者等3D目标至关重要,但有效融合RGB相机和LiDAR传感器数据仍具挑战,现有方法难以兼顾精度和鲁棒性。
Result: 在KITTI和nuScenes数据集上,LCF3D相比基于LiDAR的方法取得显著提升,尤其在行人、骑行者等困难类别上表现优异,并展现出处理训练与测试域间不同传感器配置的泛化能力。
Insight: 创新点在于提出级联后融合策略:后融合匹配2D与3D检测以过滤误报,级联融合利用未匹配的2D检测生成新3D提案以恢复漏检,实现多模态互补并增强领域适应性。
Abstract: Accurately localizing 3D objects like pedestrians, cyclists, and other vehicles is essential in Autonomous Driving. To ensure high detection performance, Autonomous Vehicles complement RGB cameras with LiDAR sensors, but effectively combining these data sources for 3D object detection remains challenging. We propose LCF3D, a novel sensor fusion framework that combines a 2D object detector on RGB images with a 3D object detector on LiDAR point clouds. By leveraging multimodal fusion principles, we compensate for inaccuracies in the LiDAR object detection network. Our solution combines two key principles: (i) late fusion, to reduce LiDAR False Positives by matching LiDAR 3D detections with RGB 2D detections and filtering out unmatched LiDAR detections; and (ii) cascade fusion, to recover missed objects from LiDAR by generating new 3D frustum proposals corresponding to unmatched RGB detections. Experiments show that LCF3D is beneficial for domain generalization, as it turns out to be successful in handling different sensor configurations between training and testing domains. LCF3D achieves significant improvements over LiDAR-based methods, particularly for challenging categories like pedestrians and cyclists in the KITTI dataset, as well as motorcycles and bicycles in nuScenes. Code can be downloaded from: https://github.com/CarloSgaravatti/LCF3D.
[44] UniHash: Unifying Pointwise and Pairwise Hashing Paradigms for Seen and Unseen Category Retrieval cs.CVPDF
Xiaoxu Ma, Runhao Li, Hanwen Liu, Xiangbo Zhang, Zhenyu Weng
TL;DR: 本文提出了一种名为UniHash的双分支哈希框架,旨在统一点对(pointwise)和成对(pairwise)哈希训练范式的优势,以在已知类别和未知类别的图像检索任务中实现均衡且优越的性能。该方法通过一个基于中心的点对分支和一个成对分支进行互补学习,并引入了一种新颖的哈希码学习方法,包括互学习损失和分割-合并哈希专家混合模块,以促进分支间的知识转移和哈希表示交换。
Details
Motivation: 现有深度哈希方法通常局限于单一的训练范式(点对或成对),其中点对范式在已知类别检索上表现优异,而成对范式在未知类别检索上泛化能力更强。本文旨在克服这一局限,设计一个统一的框架以同时提升在已知和未知类别上的检索性能。
Result: 在CIFAR-10、MSCOCO和ImageNet数据集上的大量实验表明,UniHash在已知和未知类别的图像检索场景中均能持续取得最先进的(state-of-the-art)性能。
Insight: 论文的核心创新点在于提出了一个统一双分支框架来融合两种主流哈希范式,并通过互学习损失和Split-Merge Mixture of Hash Experts模块实现分支间的双向知识转移,从而同时增强了哈希码的判别性和泛化能力。从客观角度看,这种将不同范式优势互补并促进其协同学习的思路具有借鉴意义。
Abstract: Effective retrieval across both seen and unseen categories is crucial for modern image retrieval systems. Retrieval on seen categories ensures precise recognition of known classes, while retrieval on unseen categories promotes generalization to novel classes with limited supervision. However, most existing deep hashing methods are confined to a single training paradigm, either pointwise or pairwise, where the former excels on seen categories and the latter generalizes better to unseen ones. To overcome this limitation, we propose Unified Hashing (UniHash), a dual-branch framework that unifies the strengths of both paradigms to achieve balanced retrieval performance across seen and unseen categories. UniHash consists of two complementary branches: a center-based branch following the pointwise paradigm and a pairwise branch following the pairwise paradigm. A novel hash code learning method is introduced to enable bidirectional knowledge transfer between branches, improving hash code discriminability and generalization. It employs a mutual learning loss to align hash representations and introduces a Split-Merge Mixture of Hash Experts (SM-MoH) module to enhance cross-branch exchange of hash representations. Theoretical analysis substantiates the effectiveness of UniHash, and extensive experiments on CIFAR-10, MSCOCO, and ImageNet demonstrate that UniHash consistently achieves state-of-the-art performance in both seen and unseen image retrieval scenarios.
[45] ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning cs.CV | cs.AI | cs.HCPDF
Po-han Li, Shenghui Chen, Ufuk Topcu, Sandeep Chinchali
TL;DR: 本文提出了ViSIL(Video Summary Information Loss)评分,这是一个信息论框架,用于量化多模态视频摘要(结合关键帧和自然语言)中的信息损失。它通过视觉语言模型(VLM)推理,衡量摘要未能捕获的视频信息量,从而实现对不同结构摘要格式的统一评估。实验表明,ViSIL评分与人类及VLM在视频问答(VQA)任务上的表现显著相关,并能用于优化信息损失与处理速度之间的权衡。
Details
Motivation: 多模态视频摘要将密集视频内容压缩为关键帧和文本的结构化格式,但传统指标(如BLEU、ROUGE)无法量化跨模态(如文本段落与关键帧序列)的信息覆盖度。因此,需要一种统一的度量标准来评估和比较不同格式摘要的信息损失。
Result: ViSIL评分在视频问答(VQA)任务上,与人类及VLM的性能表现出统计显著的相关性。使用ViSIL进行摘要选择,可以在不增加处理负载的情况下,建立帕累托最优前沿,使VQA准确率比纯文本摘要提升7%。
Insight: 创新点在于提出了一个基于信息论和VLM的统一度量框架(ViSIL),首次实现了对结构迥异的多模态摘要进行直接比较和量化评估。这为优化摘要的信息密度与计算效率权衡提供了可量化的工具,超越了传统基于文本重叠的评估方法。
Abstract: Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by $7%$ in VQA accuracy without increasing processing load.
[46] Breaking the Limits of Open-Weight CLIP: An Optimization Framework for Self-supervised Fine-tuning of CLIP cs.CV | cs.LGPDF
Anant Mehta, Xiyuan Wei, Xingyu Chen, Tianbao Yang
TL;DR: 本文提出了一种名为TuneCLIP的自监督微调框架,旨在提升开源权重CLIP模型在各种下游任务上的通用性能,而无需从头开始训练。该框架通过一个包含预热阶段和微调阶段的两阶段优化过程,解决了直接微调导致性能下降的问题。
Details
Motivation: 动机在于避免从头训练CLIP模型所需的高昂成本,并探索仅利用现有自监督数据集来提升开源CLIP模型在下游任务上的通用性能,同时克服直接微调导致性能退化的挑战。
Result: 实验表明,TuneCLIP在不同架构和规模的模型上均能持续提升性能。例如,在SigLIP (ViT-B/16)模型上,ImageNet及相关分布外基准测试提升高达+2.5%,在竞争激烈的DataComp基准测试上提升+1.2%,为高效的后预训练适应设定了新的强基线。
Insight: 创新点在于提出了一个两阶段的自监督微调框架:1)基于理论分析的预热阶段,用于恢复优化统计量以减少冷启动偏差;2)微调阶段优化新的对比损失,以减轻对假阴性对的惩罚。这为高效利用现有预训练模型提供了可借鉴的优化策略。
Abstract: CLIP has become a cornerstone of multimodal representation learning, yet improving its performance typically requires a prohibitively costly process of training from scratch on billions of samples. We ask a different question: Can we improve the performance of open-weight CLIP models across various downstream tasks using only existing self-supervised datasets? Unlike supervised fine-tuning, which adapts a pretrained model to a single downstream task, our setting seeks to improve general performance across various tasks. However, as both our experiments and prior studies reveal, simply applying standard training protocols starting from an open-weight CLIP model often fails, leading to performance degradation. In this paper, we introduce TuneCLIP, a self-supervised fine-tuning framework that overcomes the performance degradation. TuneCLIP has two key components: (1) a warm-up stage of recovering optimization statistics to reduce cold-start bias, inspired by theoretical analysis, and (2) a fine-tuning stage of optimizing a new contrastive loss to mitigate the penalization on false negative pairs. Our extensive experiments show that TuneCLIP consistently improves performance across model architectures and scales. Notably, it elevates leading open-weight models like SigLIP (ViT-B/16), achieving gains of up to +2.5% on ImageNet and related out-of-distribution benchmarks, and +1.2% on the highly competitive DataComp benchmark, setting a new strong baseline for efficient post-pretraining adaptation.
[47] MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation cs.CV | cs.AIPDF
Yang Xing, Jiong Wu, Savas Ozdemir, Ying Zhang, Yang Yang
TL;DR: 本文提出了MedVL-SAM2,一个统一的3D医学视觉-语言模型,旨在同时支持报告生成、视觉问答(VQA)以及多种范式(语义、参考、交互式)的3D分割任务。该模型通过整合图像级推理和像素级感知,并采用基于SAM2的体积分割模块,实现了精细的多粒度空间推理。
Details
Motivation: 现有医学视觉-语言模型在图像级文本任务上表现良好,但在3D医学场景中实现细粒度视觉定位和体积空间推理仍具挑战,特别是将这些能力统一在一个通用框架内。
Result: 该模型在报告生成、VQA和多个3D分割任务上均取得了最先进的性能,并通过广泛分析证明了其可靠的3D视觉定位、可控的交互式分割和鲁棒的跨模态推理能力。
Insight: 创新点在于提出了一个统一的3D医学多模态架构,通过多阶段训练(大规模3D CT图像-文本对预训练与语言理解及分割目标的联合优化)将高级语义推理与精确的3D定位相结合,实现了通过语言、点或框提示的灵活交互,从而统一了高层次视觉推理与空间精确定位。
Abstract: Recent progress in medical vision-language models (VLMs) has achieved strong performance on image-level text-centric tasks such as report generation and visual question answering (VQA). However, achieving fine-grained visual grounding and volumetric spatial reasoning in 3D medical VLMs remains challenging, particularly when aiming to unify these capabilities within a single, generalizable framework. To address this challenge, we proposed MedVL-SAM2, a unified 3D medical multimodal model that concurrently supports report generation, VQA, and multi-paradigm segmentation, including semantic, referring, and interactive segmentation. MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging, and incorporates a SAM2-based volumetric segmentation module to enable precise multi-granular spatial reasoning. The model is trained in a multi-stage pipeline: it is first pre-trained on a large-scale corpus of 3D CT image-text pairs to align volumetric visual features with radiology-language embeddings. It is then jointly optimized with both language-understanding and segmentation objectives using a comprehensive 3D CT segmentation dataset. This joint training enables flexible interaction via language, point, or box prompts, thereby unifying high-level visual reasoning with spatially precise localization. Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks. Extensive analyses further show that the model provides reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning, demonstrating that high-level semantic reasoning and precise 3D localization can be jointly achieved within a unified 3D medical VLM.
[48] Transition Matching Distillation for Fast Video Generation cs.CV | cs.AI | cs.LGPDF
Weili Nie, Julius Berner, Nanye Ma, Chao Liu, Saining Xie
TL;DR: 本文提出了一种名为Transition Matching Distillation (TMD)的新框架,用于将大型视频扩散模型蒸馏成高效的少步生成器,以解决其在实时交互应用中因多步采样效率低下而受限的问题。该方法通过将扩散模型的多步去噪轨迹与一个由轻量级条件流建模的少步概率转移过程进行匹配来实现。
Details
Motivation: 大型视频扩散和流模型虽然能生成高质量视频,但其低效的多步采样过程限制了在实时交互应用中的使用,因此需要一种方法在保持视觉质量的同时大幅提升生成速度。
Result: 在蒸馏Wan2.1 1.3B和14B文本到视频模型上的大量实验表明,TMD在生成速度与视觉质量之间提供了灵活且强大的权衡。在可比的推理成本下,TMD在视觉保真度和提示遵循方面优于现有的蒸馏模型。
Insight: 核心创新点在于提出了匹配多步扩散轨迹与少步概率转移过程的蒸馏框架,并设计了将主干网络分解为语义提取主干和流头进行多步内更新的架构,实现了高效的知识蒸馏与速度-质量的优化权衡。
Abstract: Large video diffusion and flow models have achieved remarkable success in high-quality video generation, but their use in real-time interactive applications remains limited due to their inefficient multi-step sampling process. In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. The central idea of TMD is to match the multi-step denoising trajectory of a diffusion model with a few-step probability transition process, where each transition is modeled as a lightweight conditional flow. To enable efficient distillation, we decompose the original diffusion backbone into two components: (1) a main backbone, comprising the majority of early layers, that extracts semantic representations at each outer transition step; and (2) a flow head, consisting of the last few layers, that leverages these representations to perform multiple inner flow updates. Given a pretrained video diffusion model, we first introduce a flow head to the model, and adapt it into a conditional flow map. We then apply distribution matching distillation to the student model with flow head rollout in each transition step. Extensive experiments on distilling Wan2.1 1.3B and 14B text-to-video models demonstrate that TMD provides a flexible and strong trade-off between generation speed and visual quality. In particular, TMD outperforms existing distilled models under comparable inference costs in terms of visual fidelity and prompt adherence. Project page: https://research.nvidia.com/labs/genair/tmd
[49] OT-Drive: Out-of-Distribution Off-Road Traversable Area Segmentation via Optimal Transport cs.CV | cs.ROPDF
Zhihua Zhao, Guoqiang Li, Chen Min, Kangping Lu
TL;DR: 本文提出OT-Drive,一种基于最优传输的多模态融合框架,用于解决非结构化环境中可通行区域分割在分布外场景下性能下降的问题。该方法通过场景锚生成器分解场景信息并构建语义锚点,并利用最优传输模块将RGB和表面法线特征映射到语义锚点流形上,从而实现鲁棒的分布外分割。
Details
Motivation: 现有数据驱动方法在分布外场景下的可通行区域分割性能会显著下降,影响自动驾驶的规划与决策,因此需要提升模型在未见场景中的泛化能力。
Result: 在ORFD数据集的分布外场景上达到95.16% mIoU,比先前方法提升6.35%;在跨数据集迁移任务上达到89.79% mIoU,比基线提升13.99%,表明模型在有限训练数据下实现了强分布外泛化。
Insight: 创新点包括将多模态融合建模为分布传输问题,设计场景锚生成器构建可泛化的语义锚点,以及基于最优传输的特征映射机制,这为提升视觉任务在分布外场景的鲁棒性提供了新思路。
Abstract: Reliable traversable area segmentation in unstructured environments is critical for planning and decision-making in autonomous driving. However, existing data-driven approaches often suffer from degraded segmentation performance in out-of-distribution (OOD) scenarios, consequently impairing downstream driving tasks. To address this issue, we propose OT-Drive, an Optimal Transport–driven multi-modal fusion framework. The proposed method formulates RGB and surface normal fusion as a distribution transport problem. Specifically, we design a novel Scene Anchor Generator (SAG) to decompose scene information into the joint distribution of weather, time-of-day, and road type, thereby constructing semantic anchors that can generalize to unseen scenarios. Subsequently, we design an innovative Optimal Transport-based multi-modal fusion module (OT Fusion) to transport RGB and surface normal features onto the manifold defined by the semantic anchors, enabling robust traversable area segmentation under OOD scenarios. Experimental results demonstrate that our method achieves 95.16% mIoU on ORFD OOD scenarios, outperforming prior methods by 6.35%, and 89.79% mIoU on cross-dataset transfer tasks, surpassing baselines by 13.99%.These results indicate that the proposed model can attain strong OOD generalization with only limited training data, substantially enhancing its practicality and efficiency for real-world deployment.
[50] The Spatial Blindspot of Vision-Language Models cs.CVPDF
Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung
TL;DR: 本文指出当前基于CLIP风格的视觉语言模型(VLMs)在空间关系理解上存在盲点,主要原因是其训练过程中将图像展平为一维序列,丢失了二维结构信息。为解决这一问题,论文研究了采用不同训练目标的图像编码器以及二维位置编码,实验证明这些架构改进能提升模型在多个基准测试上的空间推理能力。
Details
Motivation: 现有VLMs通常采用CLIP风格的图像编码器,其训练过程将图像展平为1D序列,丢弃了二维结构,导致模型缺乏空间感知能力,这成为机器人学和具身AI等需要空间基础的应用瓶颈。
Result: 实验表明,通过调整图像编码器的训练目标和引入2D位置编码,模型在多个空间推理基准测试上取得了性能提升。
Insight: 论文的创新点在于明确指出VLMs的空间感知缺陷,并通过架构改进(如替代训练目标和2D位置编码)来增强空间推理能力,为VLM设计提供了新的维度,对需要空间基础的应用具有借鉴意义。
Abstract: Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.
[51] DR$^2$Seg: Decomposed Two-Stage Rollouts for Efficient Reasoning Segmentation in Multimodal Large Language Models cs.CVPDF
Yulin He, Wei Chen, Zhikang Jian, Tianhang Guo, Wenjuan Zhou
TL;DR: 本文提出DR^2Seg框架,通过分解的两阶段推理策略来解决多模态大语言模型在推理分割任务中的过度思考问题,提升推理效率和分割精度。该方法将推理分割分解为多模态推理和参考分割两个阶段,并引入自奖励机制来强化目标导向的推理并抑制冗余思考。
Details
Motivation: 现有方法在推理分割任务中常因过度思考产生冗长的推理链,干扰多模态大语言模型中的目标定位,导致效率低下和分割不准确。
Result: 在不同规模和分割模型的多模态大语言模型上进行的大量实验表明,DR^2Seg能持续提升推理效率和整体分割性能。
Insight: 创新点在于将推理分割任务分解为两个阶段,并设计自奖励机制来优化推理过程,无需额外监督即可实现更高效和准确的分割。
Abstract: Reasoning segmentation is an emerging vision-language task that requires reasoning over intricate text queries to precisely segment objects. However, existing methods typically suffer from overthinking, generating verbose reasoning chains that interfere with object localization in multimodal large language models (MLLMs). To address this issue, we propose DR$^2$Seg, a self-rewarding framework that improves both reasoning efficiency and segmentation accuracy without requiring extra thinking supervision. DR$^2$Seg employs a two-stage rollout strategy that decomposes reasoning segmentation into multimodal reasoning and referring segmentation. In the first stage, the model generates a self-contained description that explicitly specifies the target object. In the second stage, this description replaces the original complex query to verify its self-containment. Based on this design, two self-rewards are introduced to strengthen goal-oriented reasoning and suppress redundant thinking. Extensive experiments across MLLMs of varying scales and segmentation models demonstrate that DR$^2$Seg consistently improves reasoning efficiency and overall segmentation performance.
[52] VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models cs.CV | cs.AIPDF
Zefan Zhang, Kehua Zhu, Shijie Jiang, Hongyuan Lu, Shengkai Sun
TL;DR: 本文提出了VERHallu基准,用于评估视频大语言模型中的事件关系幻觉问题,重点关注事件间的因果、时序和子事件关系,并通过关系分类、问答和反事实问答任务进行全面评估。研究发现现有SOTA模型在密集事件关系推理上表现不佳,常依赖先验知识而忽略帧级线索,为此作者提出了关键帧传播策略以增强多事件理解,有效缓解幻觉且不影响推理速度。
Details
Motivation: 现有研究主要关注视频中事件、物体和场景存在的幻觉,而忽视了事件关系幻觉,本文旨在填补这一空白,系统评估并缓解视频大语言模型在事件关系推理上的幻觉问题。
Result: 在VERHallu基准上的实验表明,当前SOTA视频大语言模型在密集事件关系推理任务中表现不佳,而提出的关键帧传播策略能有效缓解事件关系幻觉,且不影响推理速度。
Insight: 创新点在于首次系统定义了事件关系幻觉并构建了包含反直觉场景和人工标注偏见的VERHallu基准,同时提出的关键帧传播策略通过重分配中间层帧级注意力来增强多事件理解,为缓解视频大语言模型的幻觉提供了新思路。
Abstract: Video Large Language Models (VideoLLMs) exhibit various types of hallucinations. Existing research has primarily focused on hallucinations involving the presence of events, objects, and scenes in videos, while largely neglecting event relation hallucination. In this paper, we introduce a novel benchmark for evaluating the Video Event Relation Hallucination, named VERHallu. This benchmark focuses on causal, temporal, and subevent relations between events, encompassing three types of tasks: relation classification, question answering, and counterfactual question answering, for a comprehensive evaluation of event relation hallucination. Additionally, it features counterintuitive video scenarios that deviate from typical pretraining distributions, with each sample accompanied by human-annotated candidates covering both vision-language and pure language biases. Our analysis reveals that current state-of-the-art VideoLLMs struggle with dense-event relation reasoning, often relying on prior knowledge due to insufficient use of frame-level cues. Although these models demonstrate strong grounding capabilities for key events, they often overlook the surrounding subevents, leading to an incomplete and inaccurate understanding of event relations. To tackle this, we propose a Key-Frame Propagating (KFP) strategy, which reallocates frame-level attention within intermediate layers to enhance multi-event understanding. Experiments show it effectively mitigates the event relation hallucination without affecting inference speed.
[53] UEOF: A Benchmark Dataset for Underwater Event-Based Optical Flow cs.CV | cs.ROPDF
Nick Truong, Pritam P. Karmokar, William J. Beksi
TL;DR: 本文提出了首个合成水下事件相机光流基准数据集UEOF,该数据集通过基于物理的射线追踪RGBD序列生成,包含真实的水下光学效果和密集的真实光流、深度及相机运动数据,旨在推动水下事件相机感知算法的发展与评估。
Details
Motivation: 水下成像面临波长相关光衰减、悬浮颗粒散射、浑浊模糊和非均匀照明等挑战,传统相机难以获取真实运动数据,而事件相机虽具有微秒级分辨率和动态范围优势,但缺乏结合真实水下光学与准确光流的数据集,限制了其在水下环境的研究进展。
Result: 论文在UEOF数据集上对基于学习和模型的最先进光流预测方法进行了基准测试,分析了水下光传输对事件形成和运动估计精度的影响,为未来算法提供了新的评估基线。
Insight: 创新点在于首次构建了合成水下事件相机光流数据集,通过物理渲染和视频到事件转换管道生成逼真事件流,解决了水下真实数据获取难题,为水下事件相机感知研究提供了关键资源。
Abstract: Underwater imaging is fundamentally challenging due to wavelength-dependent light attenuation, strong scattering from suspended particles, turbidity-induced blur, and non-uniform illumination. These effects impair standard cameras and make ground-truth motion nearly impossible to obtain. On the other hand, event cameras offer microsecond resolution and high dynamic range. Nonetheless, progress on investigating event cameras for underwater environments has been limited due to the lack of datasets that pair realistic underwater optics with accurate optical flow. To address this problem, we introduce the first synthetic underwater benchmark dataset for event-based optical flow derived from physically-based ray-traced RGBD sequences. Using a modern video-to-event pipeline applied to rendered underwater videos, we produce realistic event data streams with dense ground-truth flow, depth, and camera motion. Moreover, we benchmark state-of-the-art learning-based and model-based optical flow prediction methods to understand how underwater light transport affects event formation and motion estimation accuracy. Our dataset establishes a new baseline for future development and evaluation of underwater event-based perception algorithms. The source code and dataset for this project are publicly available at https://robotic-vision-lab.github.io/ueof.
[54] CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation cs.CV | cs.AIPDF
Chengzhuo Tong, Mingkun Chang, Shenglong Zhang, Yuran Wang, Cheng Liang
TL;DR: 本文提出CoF-T2I模型,将视频生成模型中的链式帧推理能力引入文本到图像生成任务,通过渐进式视觉细化和独立帧编码,以中间帧作为显式推理步骤来提升生成质量。
Details
Motivation: 动机在于利用视频模型已展现的链式帧推理能力来增强文本到图像生成,但此前由于缺乏明确的视觉推理起点和可解释的中间状态而未被充分探索。
Result: 在GenEval基准上达到0.86分,在Imagine-Bench上达到7.468分,显著优于基础视频模型,并在挑战性基准上取得有竞争力的性能。
Insight: 创新点包括将链式帧推理机制引入T2I生成以实现渐进式细化,构建CoF-Evol-Instruct数据集建模从语义到美学的生成轨迹,以及采用独立帧编码以避免运动伪影并提升质量。
Abstract: Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we propose CoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such an explicit generation process, we curate CoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to aesthetics. To further improve quality and avoid motion artifacts, we enable independent encoding operation for each frame. Experiments show that CoF-T2I significantly outperforms the base video model and achieves competitive performance on challenging benchmarks, reaching 0.86 on GenEval and 7.468 on Imagine-Bench. These results indicate the substantial promise of video models for advancing high-quality text-to-image generation.
[55] ReaMIL: Reasoning- and Evidence-Aware Multiple Instance Learning for Whole-Slide Histopathology cs.CV | cs.AIPDF
Hyun Do Jung, Jungwon Choi, Hwiyoung Kim
TL;DR: 本文提出了ReaMIL,一种用于全切片组织病理学图像的多示例学习方法。它在强大的MIL骨干网络上增加了一个轻量的选择头,该头生成每个图像块的软门控,并通过一个预算充足性目标进行训练,该目标在给定稀疏预算下,仅使用保留的证据强制真实类概率达到阈值。该方法在不牺牲基线性能的前提下,产生小而空间紧凑的证据集。
Details
Motivation: 解决全切片组织病理学分析中,现有多示例学习方法缺乏对模型决策过程的透明度和可解释性证据的问题,旨在识别出少量关键且空间紧凑的图像块作为决策依据。
Result: 在TCGA-NSCLC、TCGA-BRCA和PANDA数据集上,ReaMIL达到或略微提升了基线AUC性能,并提供了定量的证据效率诊断。在NSCLC任务中,在τ=0.90时达到AUC 0.983,平均最小充分K值约为8.2个图像块,AUKC约为0.864。
Insight: 创新点在于引入了预算充足性目标(带稀疏预算的铰链损失)来训练选择头,无需额外监督即可自动学习选择少量紧凑的关键证据块,同时保持分类性能,并自然地生成幻灯片级别的可视化覆盖图,为模型行为提供了严谨的评估指标(如MSK、AUKC、连续性)。
Abstract: We introduce ReaMIL (Reasoning- and Evidence-Aware MIL), a multiple instance learning approach for whole-slide histopathology that adds a light selection head to a strong MIL backbone. The head produces soft per-tile gates and is trained with a budgeted-sufficiency objective: a hinge loss that enforces the true-class probability to be $\geq τ$ using only the kept evidence, under a sparsity budget on the number of selected tiles. The budgeted-sufficiency objective yields small, spatially compact evidence sets without sacrificing baseline performance. Across TCGA-NSCLC (LUAD vs. LUSC), TCGA-BRCA (IDC vs. Others), and PANDA, ReaMIL matches or slightly improves baseline AUC and provides quantitative evidence-efficiency diagnostics. On NSCLC, it attains AUC 0.983 with a mean minimal sufficient K (MSK) $\approx 8.2$ tiles at $τ= 0.90$ and AUKC $\approx 0.864$, showing that class confidence rises sharply and stabilizes once a small set of tiles is kept. The method requires no extra supervision, integrates seamlessly with standard MIL training, and naturally yields slide-level overlays. We report accuracy alongside MSK, AUKC, and contiguity for rigorous evaluation of model behavior on WSIs.
[56] Thinking Like Van Gogh: Structure-Aware Style Transfer via Flow-Guided 3D Gaussian Splatting cs.CV | cs.GR | cs.LGPDF
Zhendong Wang, Lebin Zhou, Jingchuan Xiao, Rongduo Han, Nam Ling
TL;DR: 本文提出了一种基于3D高斯泼溅(3DGS)的流引导几何平流框架,用于实现结构感知的3D风格迁移。该方法从2D绘画中提取方向流场,并将其反向传播到3D空间,以引导高斯基元形成与场景拓扑一致、对齐笔触的结构化变形,从而更真实地再现后印象派艺术风格。
Details
Motivation: 现有大多数3D风格迁移方法将几何结构视为刚性基底,仅进行表面纹理投影,这与后印象派艺术强调几何抽象作为主要表达载体的核心理念相悖。本文旨在解决这一问题,使3D风格迁移能够拥抱几何抽象,实现更真实的艺术风格化。
Result: 论文提出了一种基于VLM(视觉语言模型)作为评判者的评估框架,通过美学判断而非传统的像素级指标来评估艺术真实性,从而明确应对艺术风格化的主观性。摘要中未提及具体的定量基准测试结果或SOTA比较。
Insight: 创新点包括:1) 一种基于投影的、无网格的流引导机制,将2D艺术运动转化为3D高斯几何;2) 一种亮度-结构解耦策略,将几何变形与颜色优化分离,以减轻激进结构抽象过程中的伪影;3) 引入主观美学评估框架,超越了传统客观指标。
Abstract: In 1888, Vincent van Gogh wrote, “I am seeking exaggeration in the essential.” This principle, amplifying structural form while suppressing photographic detail, lies at the core of Post-Impressionist art. However, most existing 3D style transfer methods invert this philosophy, treating geometry as a rigid substrate for surface-level texture projection. To authentically reproduce Post-Impressionist stylization, geometric abstraction must be embraced as the primary vehicle of expression. We propose a flow-guided geometric advection framework for 3D Gaussian Splatting (3DGS) that operationalizes this principle in a mesh-free setting. Our method extracts directional flow fields from 2D paintings and back-propagates them into 3D space, rectifying Gaussian primitives to form flow-aligned brushstrokes that conform to scene topology without relying on explicit mesh priors. This enables expressive structural deformation driven directly by painterly motion rather than photometric constraints. Our contributions are threefold: (1) a projection-based, mesh-free flow guidance mechanism that transfers 2D artistic motion into 3D Gaussian geometry; (2) a luminance-structure decoupling strategy that isolates geometric deformation from color optimization, mitigating artifacts during aggressive structural abstraction; and (3) a VLM-as-a-Judge evaluation framework that assesses artistic authenticity through aesthetic judgment instead of conventional pixel-level metrics, explicitly addressing the subjective nature of artistic stylization.
[57] V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation cs.CV | cs.AI | cs.LGPDF
Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, Wei Chen
TL;DR: V-Zero是一个无需人工标注的自改进多模态推理后训练框架,通过实例化提问者和解答者两个角色,在无标签图像上建立协同进化循环,实现了视觉语言模型性能的持续提升。
Details
Motivation: 解决当前先进多模态学习方法严重依赖大规模、高成本人工标注数据的问题,旨在探索仅使用未标注图像实现模型自我改进的途径。
Result: 在Qwen2.5-VL-7B-Instruct模型上,无需任何人工标注,实现了视觉数学推理性能提升+1.7,通用视觉中心任务性能提升+2.6。
Insight: 创新点在于设计了包含提问者与解答者的协同进化框架,其中提问者利用对比直觉猜测与推理结果的双轨推理奖励来合成高质量问题,解答者则通过对其自身采样响应的多数投票获得伪标签进行优化;两者通过组相对策略优化进行迭代训练,形成相互增强的循环,为多模态系统的自监督学习提供了新范式。
Abstract: Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric by +2.6, demonstrating the potential of self-improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V-Zero
[58] FlowAct-R1: Towards Interactive Humanoid Video Generation cs.CV | cs.AIPDF
Lizhen Wang, Yongming Zhu, Zhipeng Ge, Youwei Zheng, Longhao Zhang
TL;DR: 本文提出了FlowAct-R1框架,专门用于实时交互式人形视频生成。该框架基于MMDiT架构,能够以流式方式合成任意时长的视频,并保持低延迟响应。通过引入分块扩散强制策略及其自强制变体,缓解了连续交互过程中的误差累积,确保了长期时间一致性。结合高效蒸馏和系统级优化,该框架在480p分辨率下实现了稳定的25fps帧率,首帧时间仅约1.5秒,提供了精细的全身控制,使智能体能在交互场景中自然切换不同行为状态。
Details
Motivation: 解决现有视频合成方法在高保真度合成与实时交互需求之间难以权衡的问题,旨在生成能与人类进行连续、响应式交互的逼真视觉智能体。
Result: 实验结果表明,FlowAct-R1在保持对不同角色风格鲁棒泛化能力的同时,实现了卓越的行为生动性和感知真实感。
Insight: 主要创新点在于为实时交互视频生成设计的流式合成框架、用于确保长期时间一致性的分块扩散强制策略(及其自强制变体),以及通过蒸馏和系统优化实现的高效性能(25fps@480p,TTFF~1.5s)。从客观角度看,其将流式合成、低延迟与精细全身控制相结合的系统设计思路具有借鉴意义。
Abstract: Interactive humanoid video generation aims to synthesize lifelike visual agents that can engage with humans through continuous and responsive video. Despite recent advances in video synthesis, existing methods often grapple with the trade-off between high-fidelity synthesis and real-time interaction requirements. In this paper, we propose FlowAct-R1, a framework specifically designed for real-time interactive humanoid video generation. Built upon a MMDiT architecture, FlowAct-R1 enables the streaming synthesis of video with arbitrary durations while maintaining low-latency responsiveness. We introduce a chunkwise diffusion forcing strategy, complemented by a novel self-forcing variant, to alleviate error accumulation and ensure long-term temporal consistency during continuous interaction. By leveraging efficient distillation and system-level optimizations, our framework achieves a stable 25fps at 480p resolution with a time-to-first-frame (TTFF) of only around 1.5 seconds. The proposed method provides holistic and fine-grained full-body control, enabling the agent to transition naturally between diverse behavioral states in interactive scenarios. Experimental results demonstrate that FlowAct-R1 achieves exceptional behavioral vividness and perceptual realism, while maintaining robust generalization across diverse character styles.
[59] MathDoc: Benchmarking Structured Extraction and Active Refusal on Noisy Mathematics Exam Papers cs.CV | cs.AIPDF
Chenyue Zhou, Jiayi Tuo, Shitong Qin, Wei Dai, Mingxuan Wang
TL;DR: MathDoc是首个针对真实高中数学试卷的文档级信息提取基准,包含3,609个带有真实噪声的数学问题,并引入不可识别样本来评估模型的主动拒绝能力。论文提出了一个涵盖题干准确性、视觉相似性和拒绝能力的多维评估框架,并在Qwen3-VL和Gemini-2.5-Pro等SOTA多模态大语言模型上进行了实验。
Details
Motivation: 现有基准主要关注干净文档或通用布局分析,忽略了数学问题的结构完整性以及模型对不完整输入的主动拒绝能力,而真实场景中的试卷存在严重视觉噪声,这使得自动化提取结构化问题具有挑战性。
Result: 实验表明,尽管端到端模型在提取性能上表现强劲,但它们一致无法拒绝难以辨认的输入,反而产生自信但无效的输出,这揭示了当前MLLMs在退化文档条件下的可靠性存在关键差距。
Insight: 创新点在于首次构建了包含真实噪声和主动拒绝评估的数学试卷提取基准,并提出了多维评估框架;客观来看,该研究强调了模型在现实噪声场景下的可靠性问题,为评估模型在退化输入下的行为提供了重要基准。
Abstract: The automated extraction of structured questions from paper-based mathematics exams is fundamental to intelligent education, yet remains challenging in real-world settings due to severe visual noise. Existing benchmarks mainly focus on clean documents or generic layout analysis, overlooking both the structural integrity of mathematical problems and the ability of models to actively reject incomplete inputs. We introduce MathDoc, the first benchmark for document-level information extraction from authentic high school mathematics exam papers. MathDoc contains \textbf{3,609} carefully curated questions with real-world artifacts and explicitly includes unrecognizable samples to evaluate active refusal behavior. We propose a multi-dimensional evaluation framework covering stem accuracy, visual similarity, and refusal capability. Experiments on SOTA MLLMs, including Qwen3-VL and Gemini-2.5-Pro, show that although end-to-end models achieve strong extraction performance, they consistently fail to refuse illegible inputs, instead producing confident but invalid outputs. These results highlight a critical gap in current MLLMs and establish MathDoc as a benchmark for assessing model reliability under degraded document conditions. Our project repository is available at \href{https://github.com/winnk123/papers/tree/master}{GitHub repository}
[60] Enhancing Visual In-Context Learning by Multi-Faceted Fusion cs.CVPDF
Wenwen Liao, Jianbo Yu, Yuansong Wang, Qingchao Jiang, Xiaofeng Yang
TL;DR: 本文提出了一种新颖的多方面协作融合框架,以增强视觉上下文学习(VICL)。该方法摒弃了传统的‘检索-提示’方法中仅依赖单个最佳提示或简单融合前K个提示的做法,转而通过生成三个由不同高质量提示组合形成的上下文表示分支,并设计MULTI-VQGAN架构来协同利用这些互补信息。
Details
Motivation: 当前主流的‘检索-提示’VICL方法通常只选择单个最佳视觉提示,丢弃了其他合适候选中的宝贵上下文信息。即使最近有工作尝试将前K个提示融合成单一增强表示,也只是将多个丰富信号简单压缩,限制了模型的推理能力。本文旨在通过更全面、协作的融合方式来充分释放多样化上下文的潜力。
Result: 在包括前景分割、单目标检测和图像着色在内的多种任务上进行了广泛实验。结果表明,该方法展现出强大的跨任务泛化能力、有效的上下文融合能力,并能产生比现有方法更鲁棒和准确的预测。
Insight: 核心创新点在于从单一提示融合转向多组合协作融合,通过生成多个互补的上下文表示分支来保留和协同利用更丰富的上下文信息。所提出的MULTI-VQGAN架构专门设计用于联合解释和利用多源协作信息,这是一个新颖的模型设计思路。
Abstract: Visual In-Context Learning (VICL) has emerged as a powerful paradigm, enabling models to perform novel visual tasks by learning from in-context examples. The dominant “retrieve-then-prompt” approach typically relies on selecting the single best visual prompt, a practice that often discards valuable contextual information from other suitable candidates. While recent work has explored fusing the top-K prompts into a single, enhanced representation, this still simply collapses multiple rich signals into one, limiting the model’s reasoning capability. We argue that a more multi-faceted, collaborative fusion is required to unlock the full potential of these diverse contexts. To address this limitation, we introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion. Instead of collapsing multiple prompts into one, our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts. These complementary guidance signals are then fed into proposed MULTI-VQGAN architecture, which is designed to jointly interpret and utilize collaborative information from multiple sources. Extensive experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization, effective contextual fusion, and ability to produce more robust and accurate predictions than existing methods.
[61] Beyond Single Prompts: Synergistic Fusion and Arrangement for VICL cs.CVPDF
Wenwen Liao, Jianbo Yu, Yuansong Wang, Shifu Yan, Xiaofeng Yang
TL;DR: 本文提出了一种端到端的视觉上下文学习(VICL)框架,通过自适应融合模块聚合多个提示的关键信息,并引入轻量级MLP来解耦布局先验,同时采用双向微调机制增强模型协作,从而克服现有方法仅使用单一提示和忽略提示排列结构信息的局限性。
Details
Motivation: 现有视觉上下文学习方法存在两个关键问题:一是仅选择最相似的提示,丢弃了其他高质量提示的互补信息;二是未能利用不同提示排列所蕴含的结构化信息。
Result: 在图像前景分割、单目标检测和图像着色任务上的实验表明,该方法取得了优越的结果,并展现出强大的跨任务泛化能力。
Insight: 创新点在于提出了自适应融合模块以利用多提示互补信息,通过轻量级MLP解耦布局先验以保持模型核心结构,并采用双向微调机制促进融合模块与修复模型之间的协作,从而提升上下文学习的精度和泛化性。
Abstract: Vision In-Context Learning (VICL) enables inpainting models to quickly adapt to new visual tasks from only a few prompts. However, existing methods suffer from two key issues: (1) selecting only the most similar prompt discards complementary cues from other high-quality prompts; and (2) failing to exploit the structured information implied by different prompt arrangements. We propose an end-to-end VICL framework to overcome these limitations. Firstly, an adaptive Fusion Module aggregates critical patterns and annotations from multiple prompts to form more precise contextual prompts. Secondly, we introduce arrangement-specific lightweight MLPs to decouple layout priors from the core model, while minimally affecting the overall model. In addition, an bidirectional fine-tuning mechanism swaps the roles of query and prompt, encouraging the model to reconstruct the original prompt from fused context and thus enhancing collaboration between the fusion module and the inpainting model. Experiments on foreground segmentation, single-object detection, and image colorization demonstrate superior results and strong cross-task generalization of our method.
[62] LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning cs.CV | cs.AIPDF
Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang
TL;DR: 本文提出LaViT框架,通过对齐潜在视觉思维而非静态嵌入来解决多模态推理中的感知鸿沟问题,强制学生模型在文本生成前自回归重构教师的视觉语义和注意力轨迹,并采用课程感知门控机制防止捷径学习。
Details
Motivation: 当前多模态潜在推理常依赖外部监督(如辅助图像),忽略了内在视觉注意力动态,导致学生模型模仿教师文本输出时关注截然不同的视觉区域,依赖语言先验而非接地感知。
Result: 实验表明,LaViT显著增强视觉接地能力,在复杂推理任务上实现高达+16.9%的性能提升,使紧凑的3B模型超越更大开源变体和GPT-4o等专有模型。
Insight: 创新点在于通过对齐潜在视觉思维和注意力轨迹来弥合感知鸿沟,并引入课程感知门控机制防止捷径学习,提升多模态推理的视觉基础性。
Abstract: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher’s textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher’s visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.
[63] Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method cs.CVPDF
Chao Huang, Benfeng Wang, Wei Wang, Jie Wen, Li Shen
TL;DR: 本文提出了视频异常推理(VAR)新任务,将视频异常分析从描述性理解提升为结构化、多阶段推理,并构建了包含8,641个视频、超过50,000个样本的大规模数据集,基于PerCoAct-CoT链式思维进行标注。同时,作者提出了Vad-R1-Plus端到端多模态大语言模型,采用异常感知组相对策略优化增强弱监督下的推理可靠性,在VAR任务上超越了开源和专有基线模型。
Details
Motivation: 现有基于多模态大语言模型(MLLMs)的视频异常检测与理解方法主要局限于异常定位或事后描述,缺乏显式推理过程、风险意识和面向决策的解释,因此需要推动视频异常分析向结构化、多阶段推理发展。
Result: 在提出的VAR任务上,Vad-R1-Plus模型通过广泛实验验证,其推理能力优于开源和专有基线模型,实现了SOTA性能。
Insight: 创新点包括:定义了VAR新任务,强调渐进式推理;构建了大规模、基于PerCoAct-CoT结构化标注的数据集;提出了异常感知组相对策略优化方法,提升弱监督下的推理可靠性;开发了支持自适应分层推理和风险感知决策的端到端MLLM模型Vad-R1-Plus。
Abstract: Recent progress in reasoning capabilities of Multimodal Large Language Models(MLLMs) has highlighted their potential for performing complex video understanding tasks. However, in the domain of Video Anomaly Detection and Understanding (VAD&U), existing MLLM-based methods are largely limited to anomaly localization or post-hoc description, lacking explicit reasoning processes, risk awareness, and decision-oriented interpretation. To address this gap, we define a new task termed Video Anomaly Reasoning (VAR), which elevates video anomaly analysis from descriptive understanding to structured, multi-stage reasoning. VAR explicitly requires models to perform progressive reasoning over anomalous events before answering anomaly-related questions, encompassing visual perception, causal interpretation, and risk-aware decision making. To support this task, we present a new dataset with 8,641 videos, where each video is annotated with diverse question types corresponding to different reasoning depths, totaling more than 50,000 samples, making it one of the largest datasets for video anomaly. The annotations are based on a structured Perception-Cognition-Action Chain-of-Thought (PerCoAct-CoT), which formalizes domain-specific reasoning priors for video anomaly understanding. This design enables systematic evaluation of multi-stage and adaptive anomaly reasoning. In addition, we propose Anomaly-Aware Group Relative Policy Optimization to further enhance reasoning reliability under weak supervision. Building upon the proposed task and dataset, we develop an end-to-end MLLM-based VAR model termed Vad-R1-Plus, which supports adaptive hierarchical reasoning and risk-aware decision making. Extensive experiments demonstrate that the proposed benchmark and method effectively advance the reasoning capabilities of MLLMs on VAR tasks, outperforming both open-source and proprietary baselines.
[64] ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and TEst-time Generative Adaptation cs.CVPDF
Kim Youwang, Lee Hyoseok, Subin Park, Gerard Pons-Moll, Tae-Hyun Oh
TL;DR: ELITE是一种从单目视频中高效合成高斯头部化身的方法,通过学习的初始化和测试时生成适应实现。它结合了3D数据先验和2D生成先验的优势,使用前馈Mesh2Gaussian先验模型快速初始化化身,并通过渲染引导的单步扩散增强器在测试时恢复缺失的视觉细节。
Details
Motivation: 现有方法依赖3D数据先验或2D生成先验来补偿单目视频中缺失的视觉线索,但3D方法在野外场景泛化能力差,2D方法计算量大且易产生身份幻觉。ELITE旨在通过协同两种先验,实现高效、高保真且泛化能力强的可动画化身合成。
Result: 实验表明,ELITE在视觉质量上优于先前工作,即使在挑战性表情下也能生成更优的化身,同时合成速度比2D生成先验方法快60倍。
Insight: 创新点包括:结合3D和2D先验的互补协同设计、前馈Mesh2Gaussian先验模型实现快速初始化、测试时生成适应阶段利用真实和合成图像监督,以及渲染引导的单步扩散增强器替代缓慢且易幻觉的全扩散去噪策略,提升了效率和细节恢复能力。
Abstract: We introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEst-time generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild, while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic images as supervision. Unlike previous full diffusion denoising strategies that are slow and hallucination-prone, we propose a rendering-guided single-step diffusion enhancer that restores missing visual details, grounded on Gaussian avatar renderings. Our experiments demonstrate that ELITE produces visually superior avatars to prior works, even for challenging expressions, while achieving 60x faster synthesis than the 2D generative prior method.
[65] Beyond Inpainting: Unleash 3D Understanding for Precise Camera-Controlled Video Generation cs.CV | cs.GRPDF
Dong-Yu Chen, Yixin Guo, Shuojin Yang, Tai-Jiang Mu, Shi-Min Hu
TL;DR: 本文提出DepthDirector,一种基于深度视频引导的精确相机控制视频重渲染框架,通过双流条件机制将源视频和目标视角下的扭曲深度序列注入预训练视频扩散模型,实现动态场景在新相机轨迹下的忠实再现。
Details
Motivation: 现有相机控制方法主要依赖3D表示扭曲,但未能充分利用视频扩散模型的3D先验,易陷入修复陷阱,导致主体不一致和生成质量下降。
Result: 在构建的大规模多相机同步数据集MultiCam-WarpData上实验表明,DepthDirector在相机可控性和视觉质量上均优于现有方法。
Insight: 创新点包括视图-内容双流条件机制和轻量级LoRA适配器,通过几何引导信号激发视频扩散模型的3D理解能力,实现精确相机控制与内容一致性生成。
Abstract: Camera control has been extensively studied in conditioned video generation; however, performing precisely altering the camera trajectories while faithfully preserving the video content remains a challenging task. The mainstream approach to achieving precise camera control is warping a 3D representation according to the target trajectory. However, such methods fail to fully leverage the 3D priors of video diffusion models (VDMs) and often fall into the Inpainting Trap, resulting in subject inconsistency and degraded generation quality. To address this problem, we propose DepthDirector, a video re-rendering framework with precise camera controllability. By leveraging the depth video from explicit 3D representation as camera-control guidance, our method can faithfully reproduce the dynamic scene of an input video under novel camera trajectories. Specifically, we design a View-Content Dual-Stream Condition mechanism that injects both the source video and the warped depth sequence rendered under the target viewpoint into the pretrained video generation model. This geometric guidance signal enables VDMs to comprehend camera movements and leverage their 3D understanding capabilities, thereby facilitating precise camera control and consistent content generation. Next, we introduce a lightweight LoRA-based video diffusion adapter to train our framework, fully preserving the knowledge priors of VDMs. Additionally, we construct a large-scale multi-camera synchronized dataset named MultiCam-WarpData using Unreal Engine 5, containing 8K videos across 1K dynamic scenes. Extensive experiments show that DepthDirector outperforms existing methods in both camera controllability and visual quality. Our code and dataset will be publicly available.
[66] Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge cs.CV | cs.MM | eess.IVPDF
Sicheng Yang, Yukai Huang, Shitong Sun, Weitong Cai, Jiankang Deng
TL;DR: 本文提出一个针对HD-EPIC VQA挑战的优化框架,通过整合查询/选项预处理、领域特定的Qwen2.5-VL微调、新颖的时序思维链(T-CoT)提示以及鲁棒的后处理,解决了多模态大语言模型在复杂视频问答任务中面临的查询模糊、长时序推理能力不足和输出非标准化等问题。
Details
Motivation: 解决多模态大语言模型在HD-EPIC VQA等复杂视频问答基准测试中表现不佳的问题,包括模糊查询/选项、长时序推理能力差和非标准化输出。
Result: 在HD-EPIC VQA基准测试上达到41.6%的准确率,展示了整体流程优化在苛刻视频理解任务中的有效性。
Insight: 创新点包括查询/选项预处理、领域特定模型微调、时序思维链提示用于多步推理,以及后处理优化,强调了对整个视频问答流程进行系统化改进的重要性。
Abstract: Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries/options, poor long-range temporal reasoning, and non-standardized outputs. We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning, and robust post-processing. This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding. Our code, fine-tuned models are available at https://github.com/YoungSeng/Egocentric-Co-Pilot.
[67] Attend to what I say: Highlighting relevant content on slides cs.CVPDF
Megha Mariam K M, C. V. Jawahar
TL;DR: 本文提出了一种自动识别并高亮演示文稿中与演讲者叙述最相关区域的方法,通过分析口语内容并与幻灯片中的文本或图形元素匹配,以改善听众听觉与视觉注意的同步。
Details
Motivation: 解决在观看演示(如会议报告)时,听众需同时关注演讲者叙述和扫描幻灯片寻找相关信息所导致的认知负担和视觉追赶问题,旨在减少认知负荷并提升理解效果。
Result: 论文探索了解决该问题的不同方法,并评估了其成功与失败案例,相关代码和数据集已公开。
Insight: 创新点在于将多媒体文档分析(结合口语、文本、图形和布局)应用于教育视频和会议报告等内容丰富视频的理解,通过自动高亮相关区域来增强视听同步。
Abstract: Imagine sitting in a presentation, trying to follow the speaker while simultaneously scanning the slides for relevant information. While the entire slide is visible, identifying the relevant regions can be challenging. As you focus on one part of the slide, the speaker moves on to a new sentence, leaving you scrambling to catch up visually. This constant back-and-forth creates a disconnect between what is being said and the most important visual elements, making it hard to absorb key details, especially in fast-paced or content-heavy presentations such as conference talks. This requires an understanding of slides, including text, graphics, and layout. We introduce a method that automatically identifies and highlights the most relevant slide regions based on the speaker’s narrative. By analyzing spoken content and matching it with textual or graphical elements in the slides, our approach ensures better synchronization between what listeners hear and what they need to attend to. We explore different ways of solving this problem and assess their success and failure cases. Analyzing multimedia documents is emerging as a key requirement for seamless understanding of content-rich videos, such as educational videos and conference talks, by reducing cognitive strain and improving comprehension. Code and dataset are available at: https://github.com/meghamariamkm2002/Slide_Highlight
[68] DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset cs.CV | cs.AIPDF
Hengyu Shen, Tiancheng Gu, Bin Qin, Lan Wu, Yuling Wu
TL;DR: 该论文提出了一个名为DanQing的大规模中文视觉-语言预训练数据集,包含1亿个从Common Crawl收集的图像-文本对。该数据集通过更严格的筛选流程构建,数据质量更高,且主要基于2024-2025年的网络数据,能更好地捕捉语义趋势。通过使用SigLIP2模型进行持续预训练的实验表明,DanQing在多个中文下游任务上均取得了优越性能。
Details
Motivation: 当前中文视觉-语言预训练进展显著滞后于英文,主要原因是缺乏高质量的中文图像-文本数据。该论文旨在填补这一空白,构建一个高质量的中文跨模态数据集以推动相关研究。
Result: 通过在SigLIP2模型上进行持续预训练,实验结果显示DanQing在零样本分类、跨模态检索和基于LMM的评估等一系列中文下游任务中均取得了优越性能。
Insight: 论文的创新点在于提出了一个构建高质量中文视觉-语言数据集的新流程,其数据质量更高、时效性更强(基于2024-2025年数据),这有助于模型捕捉语义演变趋势,提升实际应用价值。从客观角度看,该数据集的开源(CC-BY 4.0许可)将极大促进中文多模态研究的发展。
Abstract: Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.
[69] Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models cs.CV | cs.MMPDF
Peng-Fei Zhang, Zi Huang
TL;DR: 本文提出了一种名为分层精炼攻击(HRA)的多模态通用对抗攻击框架,旨在解决现有视觉-语言预训练(VLP)模型对抗攻击方法样本特异性强、计算开销大的问题。HRA在样本级和优化级两个层面精炼通用对抗扰动(UAPs),通过解耦图像模态的干净图像与扰动、引入ScMix增强策略、利用历史与未来梯度的时间层次优化路径,以及识别文本模态中全局有影响力的词作为通用文本扰动,实现了对多种VLP模型的高效、通用攻击。
Details
Motivation: 现有VLP模型的对抗攻击大多是样本特定的,当扩展到大型数据集或新场景时会产生巨大的计算开销。本文旨在克服这一限制,开发一种通用的多模态攻击框架。
Result: 在多种下游任务、VLP模型和数据集上的广泛实验证明了所提出的通用多模态攻击方法的优越性,表明其攻击效果优于现有方法。
Insight: 创新点包括:1)在图像模态上将对抗样本解耦为干净图像和扰动进行独立处理,更有效地破坏跨模态对齐;2)引入ScMix增强策略以多样化视觉上下文,增强UAPs的全局和局部效用,减少对虚假特征的依赖;3)利用历史和估计未来梯度的时间层次来精炼优化路径,避免局部极小值并稳定通用扰动学习;4)结合句内和句间重要性度量识别文本模态中全局有影响力的词作为通用文本扰动。从客观角度看,这种分层、多模态的通用攻击框架设计思路清晰,对提升对抗攻击的效率和泛化能力有借鉴意义。
Abstract: Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. HRA refines universal adversarial perturbations (UAPs) at both the sample level and the optimization level. For the image modality, we disentangle adversarial examples into clean images and perturbations, allowing each component to be handled independently for more effective disruption of cross-modal alignment. We further introduce a ScMix augmentation strategy that diversifies visual contexts and strengthens both global and local utility of UAPs, thereby reducing reliance on spurious features. In addition, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, HRA identifies globally influential words by combining intra-sentence and inter-sentence importance measures, and subsequently utilizes these words as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets demonstrate the superiority of the proposed universal multimodal attacks.
[70] ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding cs.CV | cs.CLPDF
Xueyun Tian, Wei Li, Bingbing Xu, Heng Dong, Yuanzhuo Wang
TL;DR: 本文提出了ROMA,一个用于统一反应式和主动式交互的实时全模态助手。它通过将连续输入处理为同步的多模态单元,并引入轻量级的说话头来解耦响应启动与生成,以应对流式音视频理解中的挑战。
Details
Motivation: 解决现有全模态大语言模型在流式音视频理解中存在的模态支持不完整、缺乏自主主动监控能力的问题,旨在实现统一的实时反应与主动交互。
Result: 在涵盖主动(警报、叙述)和反应(问答)设置的12个基准测试上进行广泛实验,结果表明ROMA在主动任务上达到了最先进的性能,同时在反应式设置中具有竞争力。
Insight: 创新点包括将连续输入处理为同步的多模态单元以对齐密集音频和离散视频帧,以及引入轻量级说话头实现精确的在线决策触发;客观分析认为其提出的统一评估套件和两阶段课程训练方法对标准化流式多模态理解评估有借鉴意义。
Abstract: Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.
[71] Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders cs.CVPDF
Siqi Kou, Jiachun Jin, Zetong Zhou, Ye Ma, Yugang Wang
TL;DR: 本文提出了一种名为Think-Then-Generate(T2G)的新范式,旨在提升文本到图像(T2I)扩散模型的推理感知生成能力。该方法通过轻量级监督微调激活大型语言模型(LLM)编码器的‘思考-重写’模式,并利用Dual-GRPO协同优化LLM编码器和扩散主干网络,使模型能够基于用户原始提示进行推理和重写,从而生成更符合事实、语义对齐且视觉逼真的图像。
Details
Motivation: 现有T2I扩散模型大多只是将LLM用作文本编码器,未能充分利用其内在的推理能力来推断文本提示所应描绘的视觉内容,导致生成结果往往停留在字面映射层面。本文旨在超越这种字面生成,通过引入推理过程来提升图像生成的质量和一致性。
Result: 在基于推理的图像生成和编辑基准测试中,该方法在事实一致性、语义对齐和视觉真实感方面取得了显著提升,在WISE分数上达到了0.79,与GPT-4的表现近乎相当。
Insight: 核心创新点在于提出了‘思考-然后-生成’的范式,将LLM从单纯的文本编码器转变为具有推理和重写能力的组件,并通过图像基础奖励(image-grounded rewards)和协同优化(Dual-GRPO)机制,将推理过程与图像生成过程深度耦合,从而实现了更智能、更符合上下文的图像合成。这为构建集推理、表达和演示能力于一体的下一代统一模型提供了有前景的路径。
Abstract: Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers – they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.
[72] Fine-Grained Human Pose Editing Assessment via Layer-Selective MLLMs cs.CVPDF
Ningyu Sun, Zhaolin Cai, Zitong Xu, Peihang Chen, Huiyu Duan
TL;DR: 本文提出了HPE-Bench基准和基于层选择多模态大语言模型(MLLMs)的统一框架,用于细粒度评估文本引导的人体姿态编辑。该框架通过对比LoRA调优和层敏感性分析机制,在真实性和多维度质量回归任务上均取得优异性能。
Details
Motivation: 解决现有文本引导人体姿态编辑评估方法中,真实性检测与质量评估分离、无法提供细粒度姿态不一致性分析的问题。
Result: 在包含1700个样本的HPE-Bench基准上,所提框架在真实性检测和多维质量回归方面均实现了优越性能,有效弥合了取证检测与质量评估之间的差距。
Insight: 创新点在于构建了专门的姿态编辑评估基准HPE-Bench,并提出了基于层选择性MLLMs的统一评估框架,通过层敏感性分析机制自动确定用于姿态评估的最佳特征层。
Abstract: Text-guided human pose editing has gained significant traction in AIGC applications. However,it remains plagued by structural anomalies and generative artifacts. Existing evaluation metrics often isolate authenticity detection from quality assessment, failing to provide fine-grained insights into pose-specific inconsistencies. To address these limitations, we introduce HPE-Bench, a specialized benchmark comprising 1,700 standardized samples from 17 state-of-the-art editing models, offering both authenticity labels and multi-dimensional quality scores. Furthermore, we propose a unified framework based on layer-selective multimodal large language models (MLLMs). By employing contrastive LoRA tuning and a novel layer sensitivity analysis (LSA) mechanism, we identify the optimal feature layer for pose evaluation. Our framework achieves superior performance in both authenticity detection and multi-dimensional quality regression, effectively bridging the gap between forensic detection and quality assessment.
[73] Global Context Compression with Interleaved Vision-Text Transformation cs.CV | cs.AIPDF
Dian Jiao, Jiaxin Duan, Shuai Zhao, Jiabing Leng, Yiran Zhang
TL;DR: 本文提出VIST2模型,通过将文本分块渲染为草图图像并与视觉编码交错输入,实现全局上下文压缩,在预填充和推理阶段均减少token数量,从而显著提升长文本生成效率。
Details
Motivation: 现有视觉语言模型在端到端OCR中的成功表明文本信息可低损失压缩,但部分压缩方法无法在token-by-token推理阶段节省计算或内存成本,因此研究全局上下文压缩以同时优化预填充和推理效率。
Result: 在4倍压缩比下,VIST2模型在长文本生成任务上显著优于基线,平均实现首token生成速度提升3倍、内存使用减少77%、FLOPS降低74%,模型规模从0.6B到8B均经过实验验证。
Insight: 创新点包括交错视觉-文本转换的Transformer架构、文本分块草图渲染方法,以及从课程调度预训练到模态交错指令调优的多阶段训练策略,为高效长上下文处理提供了新思路。
Abstract: Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer’s input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4$\times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3$\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.
[74] Handling Missing Modalities in Multimodal Survival Prediction for Non-Small Cell Lung Cancer cs.CV | cs.AI | cs.MMPDF
Filippo Ruffini, Camillo Maria Caruso, Claudia Tacconi, Lorenzo Nibid, Francesca Miccolis
TL;DR: 本文提出了一种针对不可切除II-III期非小细胞肺癌的缺失感知多模态生存预测框架,该框架整合了CT影像、全切片组织病理学图像和结构化临床变量。通过利用基础模型进行模态特定特征提取和一种缺失感知编码策略,该方法能够在模态自然缺失的情况下实现中间层多模态融合,从而在训练和推理中充分利用所有可用数据,无需丢弃患者样本。
Details
Motivation: 多模态深度学习在NSCLC生存预测中面临小样本队列和模态缺失问题的严重限制,临床适用性受阻,通常被迫进行完整病例过滤或激进插补。
Result: 实验结果表明,中间融合策略在一致性指数上持续优于单模态基线以及早期和晚期融合策略,其中WSI与临床模态的融合取得了最佳性能(C-index为73.30)。进一步分析显示模型能自适应地降低信息量较少的模态(如CT)的权重。
Insight: 创新点在于设计了一种对模态缺失具有内在鲁棒性的架构,通过基础模型提取特征和缺失感知编码,实现了在自然不完整模态情况下的中间融合,并能自适应地评估模态重要性。
Abstract: Accurate survival prediction in Non-Small Cell Lung Cancer (NSCLC) requires the integration of heterogeneous clinical, radiological, and histopathological information. While Multimodal Deep Learning (MDL) offers a promises for precision prognosis and survival prediction, its clinical applicability is severely limited by small cohort sizes and the presence of missing modalities, often forcing complete-case filtering or aggressive imputation. In this work, we present a missing-aware multimodal survival framework that integrates Computed Tomography (CT), Whole-Slide Histopathology (WSI) Images, and structured clinical variables for overall survival modeling in unresectable stage II-III NSCLC. By leveraging Foundation Models (FM) for modality-specific feature extraction and a missing-aware encoding strategy, the proposed approach enables intermediate multimodal fusion under naturally incomplete modality profiles. The proposed architecture is resilient to missing modalities by design, allowing the model to utilize all available data without being forced to drop patients during training or inference. Experimental results demonstrate that intermediate fusion consistently outperforms unimodal baselines as well as early and late fusion strategies, with the strongest performance achieved by the fusion of WSI and clinical modalities (73.30 C-index). Further analyses of modality importance reveal an adaptive behavior in which less informative modalities, i.e., CT modality, are automatically down-weighted and contribute less to the final survival prediction.
[75] Multi-Temporal Frames Projection for Dynamic Processes Fusion in Fluorescence Microscopy cs.CVPDF
Hassan Eshkiki, Sarah Costa, Mostafa Mohammadpour, Farinaz Tanhaei, Christopher H. George
TL;DR: 本文提出了一种新颖的计算框架,用于将荧光显微镜中多个时间分辨帧融合成单张高质量图像,以克服噪声、时间变异性和信号可视化不一致的问题。该方法在包含动态、异质且形态复杂的心脏细胞2D单层数据集上进行了广泛评估,结果表明其能有效保留并增强原始视频的生物学内容。
Details
Motivation: 解决荧光显微镜记录中因噪声、时间变异性和振荡信号可视化不一致导致的图像质量下降问题,旨在从多时间帧中整合信息生成高质量静态图像。
Result: 在包含111种配置的挑战性心脏细胞数据集上评估,该方法相比先前方法平均提高了44%的细胞计数,表明其在保留和增强图像质量与信息方面的有效性。
Insight: 创新点在于结合了来自不同计算机视觉应用领域的可解释技术,构建了一个通用框架,适用于需要将多时间图像堆栈融合为高质量2D图像的其他成像领域,从而促进标注和下游分割任务。
Abstract: Fluorescence microscopy is widely employed for the analysis of living biological samples; however, the utility of the resulting recordings is frequently constrained by noise, temporal variability, and inconsistent visualisation of signals that oscillate over time. We present a unique computational framework that integrates information from multiple time-resolved frames into a single high-quality image, while preserving the underlying biological content of the original video. We evaluate the proposed method through an extensive number of configurations (n = 111) and on a challenging dataset comprising dynamic, heterogeneous, and morphologically complex 2D monolayers of cardiac cells. Results show that our framework, which consists of a combination of explainable techniques from different computer vision application fields, is capable of generating composite images that preserve and enhance the quality and information of individual microscopy frames, yielding 44% average increase in cell count compared to previous methods. The proposed pipeline is applicable to other imaging domains that require the fusion of multi-temporal image stacks into high-quality 2D images, thereby facilitating annotation and downstream segmentation.
[76] Lunar-G2R: Geometry-to-Reflectance Learning for High-Fidelity Lunar BRDF Estimation cs.CVPDF
Clementine Grethen, Nicolas Menga, Roland Brochard, Geraldine Morin, Simone Gasparini
TL;DR: 本文提出Lunar-G2R框架,通过几何到反射率的学习,直接从月球数字高程模型(DEM)预测空间变化的BRDF参数,无需多视角图像或专用硬件,旨在实现高保真月球表面渲染。
Details
Motivation: 现有月球渲染管线依赖简化或空间均匀的BRDF模型,难以估计参数且无法捕捉局部反射变化,限制了光度真实感,因此需要一种能直接从地形几何推断空间变化反射率的方法。
Result: 在Tycho陨石坑地理隔离区域的实验中,该方法相比SOTA基线减少了38%的光度误差,同时获得更高的PSNR和SSIM,并改善了感知相似性,能捕捉均匀模型缺失的精细反射变化。
Insight: 创新点在于首次直接从地形几何推断空间变化反射率模型,利用U-Net和可微分渲染训练,无需推理时的多视图图像或受控光照,实现了基于物理的高保真渲染。
Abstract: We address the problem of estimating realistic, spatially varying reflectance for complex planetary surfaces such as the lunar regolith, which is critical for high-fidelity rendering and vision-based navigation. Existing lunar rendering pipelines rely on simplified or spatially uniform BRDF models whose parameters are difficult to estimate and fail to capture local reflectance variations, limiting photometric realism. We propose Lunar-G2R, a geometry-to-reflectance learning framework that predicts spatially varying BRDF parameters directly from a lunar digital elevation model (DEM), without requiring multi-view imagery, controlled illumination, or dedicated reflectance-capture hardware at inference time. The method leverages a U-Net trained with differentiable rendering to minimize photometric discrepancies between real orbital images and physically based renderings under known viewing and illumination geometry. Experiments on a geographically held-out region of the Tycho crater show that our approach reduces photometric error by 38 % compared to a state-of-the-art baseline, while achieving higher PSNR and SSIM and improved perceptual similarity, capturing fine-scale reflectance variations absent from spatially uniform models. To our knowledge, this is the first method to infer a spatially varying reflectance model directly from terrain geometry.
[77] Urban Socio-Semantic Segmentation with Vision-Language Reasoning cs.CV | cs.AI | cs.CYPDF
Yu Wang, Yi Wang, Rui Dai, Yujie Wang, Kaikui Liu
TL;DR: 该论文提出了一种基于视觉-语言模型推理的城市社会语义分割方法,通过引入新的数据集SocioSeg和框架SocioReasoner,解决了卫星图像中社会定义类别(如学校、公园)分割的挑战。
Details
Motivation: 当前先进的分割模型能可靠分割物理属性定义的实体(如建筑、水体),但难以处理社会定义的类别,因此需要一种新方法来提升城市社会语义分割能力。
Result: 实验表明,该方法在SocioSeg数据集上超越了现有最先进模型,并展现出强大的零样本泛化能力。
Insight: 创新点包括引入层次化结构的城市社会语义分割数据集,以及通过强化学习优化非可微过程来模拟人类跨模态识别和多阶段推理的视觉-语言推理框架。
Abstract: As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach’s gains over state-of-the-art models and strong zero-shot generalization. Our dataset and code are available in https://github.com/AMAP-ML/SocioReasoner.
[78] mergetune: Continued fine-tuning of vision-language models cs.CVPDF
Wenqing Wang, Da Li, Xiatian Zhu, Josef Kittler
TL;DR: 本文提出了一种名为MERGETUNE的新范式——持续微调(CFT),旨在解决视觉语言模型(如CLIP)微调后导致的灾难性遗忘问题。该方法通过线性模式连通性(LMC)指导,在零样本模型和微调模型之间寻找一个具有低损失路径的持续模型,以恢复微调过程中丢失的预训练知识。实验表明,MERGETUNE在基础-新类别泛化任务上显著提升了性能,并在鲁棒微调评估中达到了最先进水平。
Details
Motivation: 微调视觉语言模型(如CLIP)常导致灾难性遗忘,现有方法主要关注在适应过程中减轻遗忘,但遗忘往往不可避免。本文旨在提出一种新范式,在模型微调后恢复丢失的预训练知识。
Result: 在基础-新类别泛化任务上,MERGETUNE将CoOp的调和平均提升了+5.6%,且不增加参数;在跨数据集迁移任务中,首次在DTD和EuroSAT数据集上超越了CLIP;在鲁棒微调评估中,MERGETUNE的LMC合并模型超越了集成基线,并在与零样本模型集成时取得了进一步增益和最先进结果。
Insight: 创新点在于提出了持续微调(CFT)范式,利用线性模式连通性(LMC)指导模型合并,通过二阶近似避免大规模数据回放,实现了模型无关的后处理策略,有效恢复预训练知识并提升泛化性能。
Abstract: Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, \emph{continued fine-tuning (CFT)}, which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6% on base-novel generalisation without adding parameters. % We show \emph{the first time} superior performance than CLIP on both DTD and EuroSAT, on cross-dataset transfer. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at \href{https://github.com/Surrey-UP-Lab/MERGETUNE}{https://github.com/Surrey-UP-Lab/MERGETUNE}.
[79] BikeActions: An Open Platform and Benchmark for Cyclist-Centric VRU Action Recognition cs.CVPDF
Max A. Buettner, Kanak Mazumder, Luca Koecher, Mario Finkbeiner, Sebastian Niebler
TL;DR: 该论文提出了首个面向骑行者视角的开放感知平台FUSE-Bike及多模态数据集BikeActions,旨在解决密集共享空间中弱势道路使用者(VRU)行为预测的挑战,并建立了首个基于该数据集的性能基准。
Details
Motivation: 当前自动驾驶研究主要关注车辆视角下的行人过街行为,而密集共享空间(如骑行者与行人、车辆交互的场景)中的交互行为研究不足,需要从骑行者第一视角进行高保真、近距离的数据采集以改进VRU行为建模。
Result: 在公开的数据划分上评估了最先进的图卷积和基于Transformer的模型,为这一具有挑战性的任务建立了首个性能基准。
Insight: 创新点在于构建了首个从骑行者视角采集的开放多模态平台与数据集,专注于密集共享空间中的VRU动作识别,并提供了完整的数据集、数据管理工具、开源硬件设计和基准代码以促进该领域研究。
Abstract: Anticipating the intentions of Vulnerable Road Users (VRUs) is a critical challenge for safe autonomous driving (AD) and mobile robotics. While current research predominantly focuses on pedestrian crossing behaviors from a vehicle’s perspective, interactions within dense shared spaces remain underexplored. To bridge this gap, we introduce FUSE-Bike, the first fully open perception platform of its kind. Equipped with two LiDARs, a camera, and GNSS, it facilitates high-fidelity, close-range data capture directly from a cyclist’s viewpoint. Leveraging this platform, we present BikeActions, a novel multi-modal dataset comprising 852 annotated samples across 5 distinct action classes, specifically tailored to improve VRU behavior modeling. We establish a rigorous benchmark by evaluating state-of-the-art graph convolution and transformer-based models on our publicly released data splits, establishing the first performance baselines for this challenging task. We release the full dataset together with data curation tools, the open hardware design, and the benchmark code to foster future research in VRU action understanding under https://iv.ee.hm.edu/bikeactions/.
[80] SVII-3D: Advancing Roadside Infrastructure Inventory with Decimeter-level 3D Localization and Comprehension from Sparse Street Imagery cs.CVPDF
Chong Liu, Luxuan Fu, Yang Jia, Zhen Dong, Bisheng Yang
TL;DR: SVII-3D是一个用于道路基础设施数字化和资产盘点的统一框架,旨在从稀疏的街景图像中实现分米级3D定位和细粒度状态理解。它通过融合LoRA微调的开集检测与空间注意力匹配网络来关联稀疏视图,引入几何引导的细化机制以提升定位精度,并集成视觉-语言模型代理来自动诊断资产的操作状态。
Details
Motivation: 解决在智慧城市建设和设施生命周期管理中,利用低成本稀疏图像进行自动化数字孪生和精确资产盘点时面临的鲁棒性不足、定位不准确以及缺乏细粒度状态理解等挑战。
Result: 实验表明,SVII-3D显著提高了识别准确性并最小化了定位误差,为高保真基础设施数字化提供了一个可扩展且经济高效的解决方案。
Insight: 创新点在于将开集检测、空间注意力匹配、几何引导的3D定位细化与视觉-语言模型代理相结合,实现了从稀疏感知到自动化智能维护的端到端资产数字化,超越了传统的静态几何映射。
Abstract: The automated creation of digital twins and precise asset inventories is a critical task in smart city construction and facility lifecycle management. However, utilizing cost-effective sparse imagery remains challenging due to limited robustness, inaccurate localization, and a lack of fine-grained state understanding. To address these limitations, SVII-3D, a unified framework for holistic asset digitization, is proposed. First, LoRA fine-tuned open-set detection is fused with a spatial-attention matching network to robustly associate observations across sparse views. Second, a geometry-guided refinement mechanism is introduced to resolve structural errors, achieving precise decimeter-level 3D localization. Third, transcending static geometric mapping, a Vision-Language Model agent leveraging multi-modal prompting is incorporated to automatically diagnose fine-grained operational states. Experiments demonstrate that SVII-3D significantly improves identification accuracy and minimizes localization errors. Consequently, this framework offers a scalable, cost-effective solution for high-fidelity infrastructure digitization, effectively bridging the gap between sparse perception and automated intelligent maintenance.
[81] Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure cs.CVPDF
Luxuan Fu, Chong Liu, Bisheng Yang, Zhen Dong
TL;DR: 本文提出了一种领域自适应框架,将大型视觉语言模型转化为专门用于智能路边基础设施分析的智能体。该框架结合了数据高效的微调策略和基于知识的推理机制,通过Grounding DINO进行开放词汇微调以实现资产定位,并基于Qwen-VL进行LoRA适配以进行深度语义属性推理。此外,还引入了双模态检索增强生成模块,以在推理时动态检索权威行业标准和视觉范例,从而减少幻觉并确保专业合规性。
Details
Motivation: 通用模型在捕捉城市路边基础设施所需的细粒度属性和领域规则方面存在困难,而大型视觉语言模型在准确解释符合工程标准的复杂设施状态方面表现不佳,导致实际应用不可靠。本文旨在解决这一问题,使VLMs能够可靠地用于智能基础设施监控。
Result: 在一个全面的新城市路边场景数据集上进行评估,该框架实现了58.9%的mAP检测性能和95.5%的属性识别准确率。
Insight: 创新点在于将开放词汇检测、参数高效微调与双模态RAG相结合,形成了一个专门用于基础设施分析的领域自适应框架。其核心洞察是利用外部权威知识(行业标准和视觉范例)来约束和增强VLM的推理,以解决其在专业领域中的幻觉和合规性问题,从而实现从通用开放世界识别到可靠专业领域感知的转变。
Abstract: Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
[82] Inference-time Physics Alignment of Video Generative Models with Latent World Models cs.CVPDF
Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall
TL;DR: 这篇论文提出了一种名为WMReward的方法,通过推理时对齐来提升视频生成模型的物理合理性。该方法利用潜在世界模型(如VJEPA-2)的物理先验作为奖励,搜索并引导多个候选去噪轨迹,从而在不改变预训练模型的情况下,通过增加推理时计算来改善生成视频的物理真实性。
Details
Motivation: 当前最先进的视频生成模型虽然能产生视觉上吸引人的内容,但经常违反基本物理原理,限制了其实用性。作者认为这种缺陷不仅源于预训练阶段对物理理解不足,也源于次优的推理策略。
Result: 该方法在图像条件、多帧条件和文本条件的视频生成设置中,都显著提升了物理合理性,并通过人类偏好研究得到验证。在ICCV 2025 Perception Test PhysicsIQ挑战赛中,取得了62.64%的最终得分,获得第一名,比之前的最先进方法高出7.42%。
Insight: 核心创新点是将提升视频生成的物理合理性视为一个推理时对齐问题,并利用潜在世界模型作为奖励函数来引导生成过程。这提供了一种无需重新训练或微调基础生成模型,而是通过优化推理策略来提升特定生成质量(如物理合理性)的新范式。
Abstract: State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.
[83] Jordan-Segmentable Masks: A Topology-Aware definition for characterizing Binary Image Segmentation cs.CV | math.AT | math.NAPDF
Serena Grazia De Benedictis, Amedeo Altavilla, Nicoletta Del Buono
TL;DR: 本文提出了一种基于Jordan曲线定理和数字拓扑学的拓扑感知分割评估方法,定义了Jordan可分割掩码的概念,用于评估二值图像分割的结构连贯性。
Details
Motivation: 现有分割评估指标(如像素级、区域或边界指标)难以捕捉分割的结构和拓扑连贯性,导致在医学成像等应用中,即使掩码存在空洞或碎片化问题也能获得高分,无法保证分割将图像划分为有意义的内外区域。
Result: 论文通过数字拓扑和同调理论分析分割掩码,提取4-曲线候选并利用Betti数验证其拓扑有效性;当掩码的补集恰好分裂为两个8-连通分量时,即满足Jordan可分割性。
Insight: 创新点在于将Jordan曲线定理适配到数字平面,结合同调不变量提供无监督的拓扑正确性评估框架,为需要保持拓扑连贯性的应用(如医学图像分割)提供了传统指标之外的补充评估工具。
Abstract: Image segmentation plays a central role in computer vision. However, widely used evaluation metrics, whether pixel-wise, region-based, or boundary-focused, often struggle to capture the structural and topological coherence of a segmentation. In many practical scenarios, such as medical imaging or object delineation, small inaccuracies in boundary, holes, or fragmented predictions can result in high metric scores, despite the fact that the resulting masks fail to preserve the object global shape or connectivity. This highlights a limitation of conventional metrics: they are unable to assess whether a predicted segmentation partitions the image into meaningful interior and exterior regions. In this work, we introduce a topology-aware notion of segmentation based on the Jordan Curve Theorem, and adapted for use in digital planes. We define the concept of a \emph{Jordan-segmentatable mask}, which is a binary segmentation whose structure ensures a topological separation of the image domain into two connected components. We analyze segmentation masks through the lens of digital topology and homology theory, extracting a $4$-curve candidate from the mask, verifying its topological validity using Betti numbers. A mask is considered Jordan-segmentatable when this candidate forms a digital 4-curve with $β_0 = β_1 = 1$, or equivalently when its complement splits into exactly two $8$-connected components. This framework provides a mathematically rigorous, unsupervised criterion with which to assess the structural coherence of segmentation masks. By combining digital Jordan theory and homological invariants, our approach provides a valuable alternative to standard evaluation metrics, especially in applications where topological correctness must be preserved.
[84] Adversarial Evasion Attacks on Computer Vision using SHAP Values cs.CV | cs.AIPDF
Frank Mollard, Marcus Becker, Florian Roehrbein
TL;DR: 该论文提出了一种基于SHAP值的白盒对抗性攻击方法,用于针对计算机视觉模型。该方法通过利用SHAP值量化输入特征对模型输出的重要性,在推理阶段生成对抗样本,以降低模型输出置信度或诱导错误分类。研究发现,与著名的快速梯度符号方法相比,SHAP攻击在梯度隐藏等场景下生成错误分类的鲁棒性更强。
Details
Motivation: 动机是探索如何利用SHAP值来执行更隐蔽、更有效的对抗性规避攻击,以揭示深度学习模型在面临精心设计的输入扰动时的脆弱性,特别是在攻击能够欺骗算法却难以被人类察觉的情况下。
Result: 论文通过实验比较了SHAP攻击与快速梯度符号方法,提供了证据表明SHAP攻击在生成错误分类方面更具鲁棒性,尤其是在梯度隐藏的场景中,但未具体说明在哪个标准基准数据集上测试或是否达到SOTA水平。
Insight: 宣称的创新点在于首次将SHAP值用于指导白盒对抗样本的生成,提供了一种基于特征重要性解释的新攻击视角。从客观角度看,这为对抗攻击领域引入了可解释性工具的新应用,可能启发了针对模型解释性本身脆弱性的研究。
Abstract: The paper introduces a white-box attack on computer vision models using SHAP values. It demonstrates how adversarial evasion attacks can compromise the performance of deep learning models by reducing output confidence or inducing misclassifications. Such attacks are particularly insidious as they can deceive the perception of an algorithm while eluding human perception due to their imperceptibility to the human eye. The proposed attack leverages SHAP values to quantify the significance of individual inputs to the output at the inference stage. A comparison is drawn between the SHAP attack and the well-known Fast Gradient Sign Method. We find evidence that SHAP attacks are more robust in generating misclassifications particularly in gradient hiding scenarios.
[85] Action100M: A Large-scale Video Action Dataset cs.CVPDF
Delong Chen, Tejaswi Kasarla, Yejin Bang, Mustafa Shukor, Willy Chung
TL;DR: 本文介绍了Action100M,一个从120万条互联网教学视频构建的大规模视频动作数据集,包含约1亿个带开放词汇动作标注和时间定位的片段。该数据集通过一个全自动流程生成,该流程利用V-JEPA 2嵌入进行分层时间分割,生成树状结构的多级字幕,并借助GPT-OSS-120B模型通过多轮自优化推理来输出结构化标注。在Action100M上训练VL-JEPA模型,在多个动作识别基准测试中展现出持续的数据规模效益和强大的零样本性能。
Details
Motivation: 从视觉观察中推断物理动作是推动机器智能在物理世界中发展的基础能力,这需要大规模、开放词汇、跨领域的视频动作数据集。现有数据集在规模和标注多样性上存在不足,因此本文旨在构建一个全新的、大规模、开放词汇的视频动作数据集,为视频理解和世界建模研究提供基础。
Result: 在Action100M上训练的VL-JEPA模型,在多个不同的动作识别基准测试上,都表现出随着数据规模增加而持续提升的性能,并取得了强大的零样本性能。
Insight: 论文的创新点在于构建了Action100M这一超大规模、开放词汇、带丰富结构化标注的视频动作数据集,其核心是提出了一套全自动的标注流水线,结合了分层时间分割、树状字幕生成以及基于大语言模型(GPT-OSS-120B)的多轮自优化推理,实现了高质量、大规模数据标注的自动化。这为视频理解领域提供了一个新的、可扩展的研究基础。
Abstract: Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.
[86] RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation cs.CVPDF
Peng Chen, Xiaobao Wei, Yi Yang, Naiming Yao, Hui Chen
TL;DR: RSATalker是首个利用3D高斯泼溅(3DGS)技术,实现支持多轮对话的、逼真且具有社交感知的说话头生成框架。该方法通过基于网格的3D面部运动驱动,将3D高斯绑定到网格面片上以渲染高保真2D化身视频,并引入一个可学习的查询机制来编码社交关系,以捕捉人际动态。
Details
Motivation: 解决现有说话头生成方法在虚拟现实社交场景中的局限性:基于网格的3D方法能建模双人对话但缺乏逼真纹理,基于大模型的2D方法能产生自然外观但计算成本过高,而基于3DGS的方法虽高效逼真但仅限于单说话者且忽略社交关系。
Result: 大量实验表明,RSATalker在逼真度和社交感知方面均达到了最先进的性能水平。
Insight: 创新点在于首次将3DGS应用于多轮对话的社交感知说话头生成,并提出了一个可学习的社交感知模块来编码社交关系(如血缘与非血缘、平等与不平等),以及一个包含社交关系标注的三阶段训练范式和数据集RSATalker。
Abstract: Talking head generation is increasingly important in virtual reality (VR), especially for social scenarios involving multi-turn conversation. Existing approaches face notable limitations: mesh-based 3D methods can model dual-person dialogue but lack realistic textures, while large-model-based 2D methods produce natural appearances but incur prohibitive computational costs. Recently, 3D Gaussian Splatting (3DGS) based methods achieve efficient and realistic rendering but remain speaker-only and ignore social relationships. We introduce RSATalker, the first framework that leverages 3DGS for realistic and socially-aware talking head generation with support for multi-turn conversation. Our method first drives mesh-based 3D facial motion from speech, then binds 3D Gaussians to mesh facets to render high-fidelity 2D avatar videos. To capture interpersonal dynamics, we propose a socially-aware module that encodes social relationships, including blood and non-blood as well as equal and unequal, into high-level embeddings through a learnable query mechanism. We design a three-stage training paradigm and construct the RSATalker dataset with speech-mesh-image triplets annotated with social relationships. Extensive experiments demonstrate that RSATalker achieves state-of-the-art performance in both realism and social awareness. The code and dataset will be released.
[87] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV | cs.AIPDF
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi
TL;DR: 本文提出了Molmo2,一个开源的视频-语言模型家族,在视频理解与定位任务上达到了开源模型的先进水平。其核心贡献在于发布了7个新的视频数据集和2个多图像数据集,并提出了一个包含高效数据打包、消息树编码、双向视觉注意力及新颖token权重策略的训练方案。
Details
Motivation: 当前最强的视频-语言模型多为闭源,而开源模型要么依赖闭源模型的合成数据进行蒸馏,要么不公开其训练数据和配方,这阻碍了开源社区在视频(及图像)语言模型领域的进一步发展。此外,许多下游应用不仅需要高级视频理解,还需要像素级的定位能力,而即使是闭源模型也缺乏此能力。
Result: 在短视频、计数和字幕生成任务上,其8B模型在开源权重和数据模型中表现最佳,在长视频任务上也有竞争力。在视频定位任务上,Molmo2显著优于Qwen3-VL等开源模型(视频计数准确率35.5 vs 29.6),并在某些任务上超越了Gemini 3 Pro等闭源模型(视频指向任务F1分数38.4 vs 20.0,视频跟踪任务J&F分数56.2 vs 41.1)。
Insight: 主要创新点在于:1)发布了一系列高质量、无需闭源模型辅助构建的全新视频与多图像数据集,为开源社区提供了关键基础;2)提出了包含高效数据打包、消息树编码、双向视觉注意力及新颖token权重策略的训练方案,提升了模型性能,特别是在像素级定位任务上的能力。
Abstract: Today’s strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding – either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
[88] CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos cs.CVPDF
Chengfeng Zhao, Jiazhi Shu, Yubo Zhao, Tianyu Huang, Jiahao Lu
TL;DR: CoMoVi是一个协同生成框架,通过耦合两个视频扩散模型,在一个单一的去噪循环中同步生成3D人体运动和2D人体视频。该方法基于3D运动为视频提供结构先验、视频模型为运动提供泛化能力的观察,设计了双分支扩散模型进行特征交互,并构建了一个大规模带标注的真实世界人体视频数据集。
Details
Motivation: 解决3D人体运动生成与2D人体视频生成任务内在耦合但现有方法通常独立处理的问题,旨在通过耦合生成过程,利用3D运动的结构先验提升视频的合理性与一致性,同时借助预训练视频模型的强大泛化能力来增强运动生成。
Result: 大量实验证明了该方法在3D人体运动和视频生成任务上的有效性,但摘要未提及具体的定量结果(如FID、R-Precision等指标)或在特定基准(如HumanML3D、KIT-ML)上是否达到SOTA水平。
Insight: 主要创新点在于:1) 提出了一个有效的2D人体运动表示,以继承预训练视频扩散模型的强大先验;2) 设计了具有相互特征交互和3D-2D交叉注意力的双分支扩散模型,实现运动与视频的协同生成;3) 构建了大规模、多样化的带文本和运动标注的真实世界人体视频数据集(CoMoVi Dataset)。
Abstract: In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions, which necessitate coupling their generation processes. Based on this, we present CoMoVi, a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop. To achieve this, we first propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs. Then, we design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions. Moreover, we curate CoMoVi Dataset, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate the effectiveness of our method in both 3D human motion and video generation tasks.
[89] CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning cs.CVPDF
Darshan Singh, Arsha Nagrani, Kawshik Manikantan, Harman Singh, Dinesh Tewari
TL;DR: 该论文提出了CURVE基准测试,旨在评估视频模型在多元文化和多语言长视频推理方面的能力。该基准包含来自全球18个地区的高质量、人工标注的文化视频数据,并提供了复杂的原生语言问题、答案和推理步骤。研究发现,当前最先进的视频大语言模型在CURVE上表现远低于人类水平,主要错误源于对文化元素的视觉感知不足。
Details
Motivation: 当前视频理解基准主要基于西方中心数据和英语,存在显著的评估偏见。为了解决这一问题,作者创建了CURVE,以促进对多元文化和多语言视频中文化背景的深度理解。
Result: 在CURVE基准上,最先进的视频大语言模型表现显著低于人类准确率水平,突显了其在文化视觉感知方面的不足。
Insight: 论文的创新点在于构建了一个高质量、多文化、多语言的视频推理基准,并利用其推理轨迹构建基于证据的图,提出了一种新颖的迭代策略来识别细粒度推理错误,这为评估和提升模型的文化理解能力提供了新方向。
Abstract: Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE (Cultural Understanding and Reasoning in Video Evaluation), a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE’s reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. CURVE will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural
[90] A continental-scale dataset of ground beetles with high-resolution images and validated morphological trait measurements cs.CVPDF
S M Rayeed, Mridul Khurana, Alyson East, Isadora E. Fluck, Elizabeth G. Campolongo
TL;DR: 本研究构建了一个覆盖美国大陆及夏威夷30个站点的、包含超过13,200个步甲标本的多模态数据集,通过高分辨率成像和数字化形态性状测量,旨在解决无脊椎动物在性状数据库中代表性不足的问题,为AI驱动的自动物种识别和基于性状的生态研究提供基础。
Details
Motivation: 当前全球性状数据库严重偏向脊椎动物和植物,限制了如步甲等高多样性无脊椎动物群体的全面生态分析。尽管美国国家生态观测网络(NEON)拥有大量步甲标本物理收藏,但其可访问性和大规模分析受限,因此需要构建一个数字化的、可广泛访问的多模态数据集。
Result: 数据集通过高分辨率成像数字化了NEON的步甲标本,并数字化测量了每个标本的鞘翅长度和宽度。经验证,数字性状提取达到了亚毫米级精度,确保了生态和计算研究的可靠性。
Insight: 创新点在于将物理标本集合转化为高质量、可计算分析的多模态数字资源,并验证了数字化性状测量的精度,为利用AI进行自动性状提取和物种识别提供了可靠基础,推动了生物多样性监测与保护领域的进展。
Abstract: Despite the ecological significance of invertebrates, global trait databases remain heavily biased toward vertebrates and plants, limiting comprehensive ecological analyses of high-diversity groups like ground beetles. Ground beetles (Coleoptera: Carabidae) serve as critical bioindicators of ecosystem health, providing valuable insights into biodiversity shifts driven by environmental changes. While the National Ecological Observatory Network (NEON) maintains an extensive collection of carabid specimens from across the United States, these primarily exist as physical collections, restricting widespread research access and large-scale analysis. To address these gaps, we present a multimodal dataset digitizing over 13,200 NEON carabids from 30 sites spanning the continental US and Hawaii through high-resolution imaging, enabling broader access and computational analysis. The dataset includes digitally measured elytra length and width of each specimen, establishing a foundation for automated trait extraction using AI. Validated against manual measurements, our digital trait extraction achieves sub-millimeter precision, ensuring reliability for ecological and computational studies. By addressing invertebrate under-representation in trait databases, this work supports AI-driven tools for automated species identification and trait-based research, fostering advancements in biodiversity monitoring and conservation.
[91] From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion cs.CVPDF
Cheng Chen, Yuyu Guo, Pengpeng Zeng, Jingkuan Song, Peng Di
TL;DR: 本文提出了一种名为跨层注入(CLI)的新型轻量级框架,旨在解决现有视觉语言模型(VLMs)中视觉特征瓶颈问题。CLI通过自适应多投影(AMP)模块和自适应门控融合(AGF)机制,在视觉编码器和大型语言模型(LLM)之间建立动态的多对多连接,使LLM能够根据实时解码上下文选择性注入最相关的分层视觉信息,从而提升多模态理解能力。
Details
Motivation: 现有视觉语言模型通常采用静态、非对称的架构,仅将视觉编码器的最终输出连接到LLM的输入,这限制了LLM与分层视觉知识的全面对齐能力,导致其难以准确整合局部细节与全局语义进行连贯推理。
Result: 将CLI集成到LLaVA-OneVision和LLaVA-1.5模型中,在18个多样化基准测试上进行了广泛实验,结果显示性能显著提升,证明了CLI的有效性和通用性。
Insight: 创新点在于从静态的一对一连接转变为动态的多对多跨层注入,通过参数高效的AMP和AGF组件,使LLM能够按需访问完整的视觉层次结构,这为解锁更深层次的多模态理解提供了一种可扩展的范式。
Abstract: Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.
[92] Alterbute: Editing Intrinsic Attributes of Objects in Images cs.CV | cs.GRPDF
Tal Reiss, Daniel Winter, Matan Cohen, Alex Rav-Acha, Yael Pritch
TL;DR: 本文提出了Alterbute,一种基于扩散模型的方法,用于编辑图像中物体的固有属性(如颜色、纹理、材质和形状),同时保持其感知身份和场景上下文。该方法通过结合宽松的训练目标和视觉命名实体(VNEs)来实现身份保留的监督学习。
Details
Motivation: 现有方法要么依赖无监督先验(常无法保持身份),要么使用过于严格的监督(限制了有意义的固有属性变化),因此需要一种能有效编辑物体固有属性并保持其身份的方法。
Result: Alterbute在身份保留的物体固有属性编辑任务上优于现有方法。
Insight: 创新点在于:1)使用宽松的训练目标,在推理时通过重用原始背景和物体掩码来限制外在变化;2)引入视觉命名实体(VNEs)作为细粒度视觉身份类别,结合视觉语言模型自动提取标签和属性描述,实现可扩展的身份保留监督。
Abstract: We introduce Alterbute, a diffusion-based method for editing an object’s intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ‘’Porsche 911 Carrera’’) that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.
[93] WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments cs.CVPDF
Xuweiyi Chen, Wentao Zhou, Zezhou Cheng
TL;DR: WildRayZer是一个用于动态环境(相机和物体均运动)中新颖视图合成的自监督框架。它通过分析-合成测试,利用静态渲染器解释刚性结构,其残差揭示瞬态区域,进而构建伪运动掩码、蒸馏运动估计器,并以此掩码输入令牌和门控损失梯度,使监督专注于跨视图背景补全。
Details
Motivation: 解决动态内容破坏静态新颖视图合成模型所依赖的多视图一致性,导致重影、幻觉几何和不稳定姿态估计的问题。
Result: 在动态RealEstate10K数据集上的实验表明,WildRayZer在瞬态区域移除和全帧新颖视图合成质量上,均一致优于基于优化的和前馈基线方法,且仅需单次前馈。
Insight: 创新点在于通过自监督方式从静态渲染残差中构建伪运动掩码,并利用其引导训练过程专注于背景补全;同时贡献了大规模动态数据集D-RE10K用于训练和评估。
Abstract: We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, leading to ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.
cs.LG [Back]
[94] Social Determinants of Health Prediction for ICD-9 Code with Reasoning Models cs.LG | cs.CL | cs.CYPDF
Sharim Khan, Paul Landes, Adam Cross, Jimeng Sun
TL;DR: 该论文探索了使用推理模型和传统大语言模型在MIMIC-III数据集上对医院入院记录进行多标签健康社会决定因素ICD-9代码分类。研究发现利用现有ICD-9代码进行预测,取得了89%的F1分数,并识别出139份入院记录中缺失的SDoH代码。
Details
Motivation: 健康社会决定因素与患者预后相关,但很少在结构化数据中捕获。研究旨在从临床文本中自动提取这些标记,以补充诊断系统对患者社会状况的了解,并解决长距离依赖带来的预测挑战。
Result: 在MIMIC-III数据集上,利用现有ICD-9代码进行入院预测,达到了89%的F1分数。
Insight: 创新点在于将推理模型和大语言模型应用于长文本(入院记录)的SDoH ICD-9代码多标签分类任务,并利用现有诊断代码作为预测线索,同时通过分析揭示了数据集中SDoH代码的缺失情况。
Abstract: Social Determinants of Health correlate with patient outcomes but are rarely captured in structured data. Recent attention has been given to automatically extracting these markers from clinical text to supplement diagnostic systems with knowledge of patients’ social circumstances. Large language models demonstrate strong performance in identifying Social Determinants of Health labels from sentences. However, prediction in large admissions or longitudinal notes is challenging given long distance dependencies. In this paper, we explore hospital admission multi-label Social Determinants of Health ICD-9 code classification on the MIMIC-III dataset using reasoning models and traditional large language models. We exploit existing ICD-9 codes for prediction on admissions, which achieved an 89% F1. Our contributions include our findings, missing SDoH codes in 139 admissions, and code to reproduce the results.
[95] Process-Guided Concept Bottleneck Model cs.LG | cs.AI | cs.CVPDF
Reza M. Asiyabi, SEOSAW Partnership, Steven Hancock, Casey Ryan
TL;DR: 本文提出了一种过程引导的概念瓶颈模型(PG-CBM),通过引入领域定义的因果机制和生物物理意义的中层概念,扩展了标准概念瓶颈模型(CBM),旨在提高深度学习模型在科学应用中的可解释性和准确性。
Details
Motivation: 标准CBM忽略了领域特定的关系和因果机制,且依赖完整的概念标注,限制了其在监督稀疏但过程定义明确的科学领域(如地球观测)中的应用。
Result: 以地球观测数据估算地上生物量密度为例,PG-CBM相比多个基准模型降低了误差和偏差,同时利用了多源异构训练数据并生成了可解释的中间输出。
Insight: PG-CBM通过约束模型遵循领域定义的因果机制进行学习,不仅提升了准确性和可解释性,还能检测虚假学习并提供科学洞见,是迈向更可信科学AI系统的一步。
Abstract: Concept Bottleneck Models (CBMs) improve the explainability of black-box Deep Learning (DL) by introducing intermediate semantic concepts. However, standard CBMs often overlook domain-specific relationships and causal mechanisms, and their dependence on complete concept labels limits applicability in scientific domains where supervision is sparse but processes are well defined. To address this, we propose the Process-Guided Concept Bottleneck Model (PG-CBM), an extension of CBMs which constrains learning to follow domain-defined causal mechanisms through biophysically meaningful intermediate concepts. Using above ground biomass density estimation from Earth Observation data as a case study, we show that PG-CBM reduces error and bias compared to multiple benchmarks, whilst leveraging multi-source heterogeneous training data and producing interpretable intermediate outputs. Beyond improved accuracy, PG-CBM enhances transparency, enables detection of spurious learning, and provides scientific insights, representing a step toward more trustworthy AI systems in scientific applications.
[96] The Geometry of Thought: Disclosing the Transformer as a Tropical Polynomial Circuit cs.LG | cs.CLPDF
Faruk Alpay, Bilge Senturk
TL;DR: 本文证明了在高置信度状态下(即逆温度参数β趋于无穷大时),Transformer的自注意力机制在热带半环(max-plus代数)中运行。具体而言,将softmax注意力取热带极限可将其转化为热带矩阵乘积,从而揭示Transformer的前向传播实际上是在由token相似度定义的潜在图上执行动态规划递推(特别是Bellman-Ford路径查找更新)。这一理论结果为思维链推理提供了新的几何视角:它源于网络计算中固有的最短路径(或最长路径)算法。
Details
Motivation: 论文的动机是从数学角度深入理解Transformer的自注意力机制,特别是在高置信度极限下的运算本质,以揭示其与动态规划和路径查找算法的内在联系。
Result: 论文通过理论证明,展示了Transformer在高置信度极限下等价于热带多项式电路,其前向传播对应于在潜在图上执行Bellman-Ford算法,这为理解Transformer的计算几何提供了新的理论框架。
Insight: 创新点在于将Transformer的自注意力机制与热带代数、动态规划及图算法联系起来,为解释思维链推理等复杂行为提供了基于最短路径计算的几何视角,有助于从理论上分析Transformer的推理能力。
Abstract: We prove that the Transformer self-attention mechanism in the high-confidence regime ($β\to \infty$, where $β$ is an inverse temperature) operates in the tropical semiring (max-plus algebra). In particular, we show that taking the tropical limit of the softmax attention converts it into a tropical matrix product. This reveals that the Transformer’s forward pass is effectively executing a dynamic programming recurrence (specifically, a Bellman-Ford path-finding update) on a latent graph defined by token similarities. Our theoretical result provides a new geometric perspective for chain-of-thought reasoning: it emerges from an inherent shortest-path (or longest-path) algorithm being carried out within the network’s computation.
[97] Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts cs.LG | cs.AI | cs.CLPDF
Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang
TL;DR: 本文提出了Sparse-RL方法,旨在解决大型语言模型强化学习训练中因长序列生成导致KV缓存内存开销过大的问题。该方法通过稀疏化采样策略,结合拒绝采样和重要性重加权技术,在减少内存占用的同时保持训练稳定性,并提升模型在稀疏推理部署中的鲁棒性。
Details
Motivation: 动机是解决LLM强化学习训练中长序列生成时KV缓存内存开销过大的瓶颈问题,现有KV压缩技术直接应用于RL训练会导致策略失配和性能崩溃。
Result: 实验结果表明,Sparse-RL在减少内存开销的同时保持了与密集基线相当的性能,并在稀疏推理部署中显著增强了模型的鲁棒性。
Insight: 创新点在于揭示了稀疏采样导致的策略失配问题,并提出了Sparsity-Aware Rejection Sampling和Importance-based Reweighting来纠正偏差,实现了稳定且高效的稀疏RL训练。
Abstract: Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (LLMs). However, the substantial memory overhead of storing Key-Value (KV) caches during long-horizon rollouts acts as a critical bottleneck, often prohibiting efficient training on limited hardware. While existing KV compression techniques offer a remedy for inference, directly applying them to RL training induces a severe policy mismatch, leading to catastrophic performance collapse. To address this, we introduce Sparse-RL empowers stable RL training under sparse rollouts. We show that instability arises from a fundamental policy mismatch among the dense old policy, the sparse sampler policy, and the learner policy. To mitigate this issue, Sparse-RL incorporates Sparsity-Aware Rejection Sampling and Importance-based Reweighting to correct the off-policy bias introduced by compression-induced information loss. Experimental results show that Sparse-RL reduces rollout overhead compared to dense baselines while preserving the performance. Furthermore, Sparse-RL inherently implements sparsity-aware training, significantly enhancing model robustness during sparse inference deployment.
[98] PRL: Process Reward Learning Improves LLMs’ Reasoning Ability and Broadens the Reasoning Boundary cs.LG | cs.AI | cs.CLPDF
Jiarui Yao, Ruida Wang, Tong Zhang
TL;DR: 本文提出了一种名为过程奖励学习(PRL)的方法,旨在提升大型语言模型(LLMs)的推理能力并拓宽其推理边界。PRL通过将熵正则化强化学习目标分解为中间步骤,为推理过程提供细粒度的过程监督信号,从而优化模型训练。
Details
Motivation: 现有工作大多基于轨迹级的结果奖励,缺乏推理过程中的细粒度监督;而结合过程信号的训练框架通常依赖繁琐的额外步骤(如蒙特卡洛树搜索或训练独立奖励模型),效率低下,且过程信号设计缺乏理论支持。
Result: 实验表明,PRL不仅提高了LLMs推理能力的平均性能(通过average @ n衡量),还通过提升pass @ n指标拓宽了推理边界,其有效性在广泛实验中得到了验证和泛化。
Insight: PRL的创新点在于从理论动机出发,推导出与奖励最大化加KL散度惩罚项等效的公式,将结果奖励转化为过程监督信号,从而更有效地指导强化学习优化中的探索,无需依赖额外复杂步骤。
Abstract: Improving the reasoning abilities of Large Language Models (LLMs) has been a continuous topic recently. But most relevant works are based on outcome rewards at the trajectory level, missing fine-grained supervision during the reasoning process. Other existing training frameworks that try to combine process signals together to optimize LLMs also rely heavily on tedious additional steps like MCTS, training a separate reward model, etc., doing harm to the training efficiency. Moreover, the intuition behind the process signals design lacks rigorous theoretical support, leaving the understanding of the optimization mechanism opaque. In this paper, we propose Process Reward Learning (PRL), which decomposes the entropy regularized reinforcement learning objective into intermediate steps, with rigorous process rewards that could be assigned to models accordingly. Starting from theoretical motivation, we derive the formulation of PRL that is essentially equivalent to the objective of reward maximization plus a KL-divergence penalty term between the policy model and a reference model. However, PRL could turn the outcome reward into process supervision signals, which helps better guide the exploration during RL optimization. From our experiment results, we demonstrate that PRL not only improves the average performance for LLMs’ reasoning ability measured by average @ n, but also broadens the reasoning boundary by improving the pass @ n metric. Extensive experiments show the effectiveness of PRL could be verified and generalized.
[99] SuS: Strategy-aware Surprise for Intrinsic Exploration cs.LG | cs.AI | cs.CL | cs.GTPDF
Mark Kashirskiy, Ilya Makarov
TL;DR: 本文提出了一种名为策略感知惊奇(SuS)的新型内在探索框架,用于强化学习。该方法通过预测前后不匹配作为新颖性信号,并引入策略稳定性(SS)和策略惊奇(SuS)两个互补组件,结合学习到的权重系数形成奖励函数。在基于大语言模型的数学推理任务上,该方法在准确性和解决方案多样性方面均取得了显著提升。
Details
Motivation: 解决传统好奇心驱动方法仅依赖状态预测误差的局限性,旨在通过整合策略层面的稳定性和意外性来更有效地驱动智能体探索。
Result: 在数学推理任务上,与基线方法相比,Pass@1指标提升了17.4%,Pass@5指标提升了26.4%,并保持了更高的策略多样性。消融研究表明,移除任一组件会导致至少10%的性能下降。
Insight: 核心创新在于将行为策略的时序一致性(SS)和相对于当前策略表征的意外结果(SuS)相结合,形成协同的内在奖励信号,这为基于大模型的强化学习探索提供了新思路。
Abstract: We propose Strategy-aware Surprise (SuS), a novel intrinsic motivation framework that uses pre-post prediction mismatch as a novelty signal for exploration in reinforcement learning. Unlike traditional curiosity-driven methods that rely solely on state prediction error, SuS introduces two complementary components: Strategy Stability (SS) and Strategy Surprise (SuS). SS measures consistency in behavioral strategy across temporal steps, while SuS captures unexpected outcomes relative to the agent’s current strategy representation. Our combined reward formulation leverages both signals through learned weighting coefficients. We evaluate SuS on mathematical reasoning tasks using large language models, demonstrating significant improvements in both accuracy and solution diversity. Ablation studies confirm that removing either component results in at least 10% performance degradation, validating the synergistic nature of our approach. SuS achieves 17.4% improvement in Pass@1 and 26.4% improvement in Pass@5 compared to baseline methods, while maintaining higher strategy diversity throughout training.
cs.MM [Back]
[100] Subjective evaluation of UHD video coded using VVC with LCEVC and ML-VVC cs.MM | cs.CVPDF
Naeem Ramzan, Muhammad Tufail Khan
TL;DR: 本文对一种多层视频编码配置进行了主观质量评估,该配置将低复杂度增强视频编码(LCEVC)作为增强层应用于通用视频编码(VVC)基础层之上。研究遵循MPEG多层视频编码评估的既定方法,比较了由HD VVC基础层与LCEVC增强层重建的UHD输出与两种参考方案(上采样的VVC基础层解码和多层VVC)的质量。测试考虑了两种码率分配点,并采用DCR方法对15个SDR和HDR序列进行了主观评估。
Details
Motivation: 动机是评估将LCEVC作为增强层与VVC基础层结合的新型多层编码配置的感知质量,并与现有的VVC上采样和多层VVC方案进行比较,以探索高效视频编码的可行方案。
Result: 结果以平均意见得分(MOS)及其95%置信区间的形式报告,在定义的测试范围内比较了不同编码方法和码率分配点的感知质量。具体数值未在摘要中给出,但提供了完整的评估框架和数据。
Insight: 创新点在于将LCEVC这种低复杂度增强编码技术与最新的VVC标准结合,形成一种混合多层编码方案,并对其进行了系统的主观质量评估。这为未来视频编码标准(如VVC)的增强层设计提供了新的技术路径和实证数据。
Abstract: This paper presents the results of a subjective quality assessment of a multilayer video coding configuration in which Low Complexity Enhancement Video Coding (LCEVC) is applied as an enhancement layer on top of a Versatile Video Coding (VVC) base layer. The evaluation follows the same test methodology and conditions previously defined for MPEG multilayer video coding assessments, with the LCEVC enhancement layer encoded using version 8.1 of the LCEVC Test Model (LTM). The test compares reconstructed UHD output generated from an HD VVC base layer with LCEVC enhancement against two reference cases: upsampled VVC base layer decoding and multilayer VVC (ML-VVC). Two operating points are considered, corresponding to enhancement layers representing approximately 10% and 50% of the total bitrate. Subjective assessment was conducted using the Degradation Category Rating (DCR) methodology with twenty five participants, across a dataset comprising fifteen SDR and HDR sequences. The reported results include Mean Opinion Scores (MOS) with associated 95% confidence intervals, enabling comparison of perceptual quality across coding approaches and operating points within the defined test scope.
eess.IV [Back]
[101] Cell Behavior Video Classification Challenge, a benchmark for computer vision methods in time-lapse microscopy eess.IV | cs.CV | q-bio.QMPDF
Raffaella Fiamma Cabini, Deborah Barkauskas, Guangyu Chen, Zhi-Qi Cheng, David E Cicchetti
TL;DR: 本文介绍了细胞行为视频分类挑战赛(CBVCC),这是一个用于评估计算机视觉方法在延时显微镜视频中分类复杂细胞行为的基准测试。该挑战赛汇集了35种方法,主要基于三种策略:基于跟踪特征分类、端到端深度学习直接从视频序列学习时空特征,以及结合跟踪与图像特征的集成方法。
Details
Motivation: 动机在于解决显微镜视频中复杂细胞行为分类的难题,这需要能够有效建模无刚性边界物体的形状与运动、从整个图像序列而非静态帧中提取层次化时空特征,并处理视场中多个对象的计算机视觉方法。
Result: 挑战赛对35种方法进行了基准测试,并讨论了参与者的结果,比较了每种方法的潜力和局限性。
Insight: 创新点在于建立了一个专门的基准挑战赛来推动细胞动力学研究中的计算机视觉方法发展,并系统性地对比了基于跟踪、端到端学习以及特征集成等不同策略的优劣,为未来方法设计提供了重要参考。
Abstract: The classification of microscopy videos capturing complex cellular behaviors is crucial for understanding and quantifying the dynamics of biological processes over time. However, it remains a frontier in computer vision, requiring approaches that effectively model the shape and motion of objects without rigid boundaries, extract hierarchical spatiotemporal features from entire image sequences rather than static frames, and account for multiple objects within the field of view. To this end, we organized the Cell Behavior Video Classification Challenge (CBVCC), benchmarking 35 methods based on three approaches: classification of tracking-derived features, end-to-end deep learning architectures to directly learn spatiotemporal features from the entire video sequence without explicit cell tracking, or ensembling tracking-derived with image-derived features. We discuss the results achieved by the participants and compare the potential and limitations of each approach, serving as a basis to foster the development of computer vision methods for studying cellular dynamics.
[102] Multi-Objective Pareto-Front Optimization for Efficient Adaptive VVC Streaming eess.IV | cs.CVPDF
Angeliki Katsenou, Vignesh V. Menon, Guoda Laurinaviciute, Benjamin Bross, Detlev Marpe
TL;DR: 本文提出了一种用于高效自适应VVC流媒体的多目标帕累托前沿优化框架,旨在联合优化视频质量、码率和解码时间(作为解码能耗的代理)。该框架引入了两种策略:联合码率-质量-时间帕累托前沿和联合质量-时间帕累托前沿,以构建满足质量单调性约束的自适应码率阶梯。实验表明,该方法在保持相同视频质量的同时,能显著节省码率并降低解码复杂度,优于现有的固定阶梯和动态分辨率选择等方法。
Details
Motivation: 自适应视频流媒体需要在码率、视频质量和解码复杂度等多个编码性能目标之间取得平衡,以实现高效、内容与编解码器相关的自适应流媒体。现有方法(如固定码率阶梯)可能无法实现最优权衡。
Result: 在大型UHD数据集(Inter-4K)上的实验表明,与广泛使用的固定阶梯相比,JQT-PF方法在保持相同XPSNR质量下,平均节省11.76%的码率,同时平均解码时间减少0.29%;更激进的配置可实现高达27.88%的码率节省(但复杂度增加)。JRQT-PF策略则提供了更可控的权衡,实现6.38%的码率节省和6.17%的解码时间减少。该框架超越了包括固定阶梯、基于VMAF/XPSNR的动态分辨率选择以及复杂度感知基准在内的现有方法。
Insight: 主要创新点在于提出了一个多目标帕累托前沿优化框架,将解码时间(作为能耗代理)明确纳入码率阶梯构建的优化目标,并引入质量单调性约束以保证自适应流媒体中的体验质量一致性。从客观角度看,该研究将多目标优化思想系统应用于视频流媒体参数选择问题,通过帕累托前沿分析提供了灵活且高效的权衡策略,为面向可持续、高质量的自适应流媒体提供了新思路。
Abstract: Adaptive video streaming has facilitated improved video streaming over the past years. A balance among coding performance objectives such as bitrate, video quality, and decoding complexity is required to achieve efficient, content- and codec-dependent, adaptive video streaming. This paper proposes a multi-objective Pareto-front (PF) optimization framework to construct quality-monotonic, content-adaptive bitrate ladders Versatile Video Coding (VVC) streaming that jointly optimize video quality, bitrate, and decoding time, which is used as a practical proxy for decoding energy. Two strategies are introduced: the Joint Rate-Quality-Time Pareto Front (JRQT-PF) and the Joint Quality-Time Pareto Front (JQT-PF), each exploring different tradeoff formulations and objective prioritizations. The ladders are constructed under quality monotonicity constraints during adaptive streaming to ensure a consistent Quality of Experience (QoE). Experiments are conducted on a large-scale UHD dataset (Inter-4K), with quality assessed using PSNR, VMAF, and XPSNR, and complexity measured via decoding time and energy consumption. The JQT-PF method achieves 11.76% average bitrate savings while reducing average decoding time by 0.29% to maintain the same XPSNR, compared to a widely-used fixed ladder. More aggressive configurations yield up to 27.88% bitrate savings at the cost of increased complexity. The JRQT-PF strategy, on the other hand, offers more controlled tradeoffs, achieving 6.38 % bitrate savings and 6.17 % decoding time reduction. This framework outperforms existing methods, including fixed ladders, VMAF- and XPSNR-based dynamic resolution selection, and complexity-aware benchmarks. The results confirm that PF optimization with decoding time constraints enables sustainable, high-quality streaming tailored to network and device capabilities.
cs.CR [Back]
[103] Synthetic Data for Veterinary EHR De-identification: Benefits, Limits, and Safety Trade-offs Under Fixed Compute cs.CR | cs.AI | cs.CLPDF
David Brundage
TL;DR: 本研究评估了在固定计算预算下,使用LLM生成的合成叙事数据对兽医电子健康记录(vEHR)去标识化任务安全性的影响。通过对比合成数据增强和合成数据替换两种训练策略,发现合成数据增强能有效提升模型性能并降低文档级标识符泄露率,而用合成数据完全替换真实数据则会损害安全性。
Details
Motivation: 兽医电子健康记录(vEHR)包含隐私敏感标识符,限制了其二次使用。该领域(如PetEVAL基准)属于低资源领域,研究旨在探索在固定计算预算下,LLM生成的合成数据是否能安全地提升去标识化模型的性能。
Result: 在PetEVAL衍生语料库上的实验表明:在固定样本替换策略下,用合成数据替换真实数据会导致标识符泄露率单调上升;在计算匹配的训练中,适度的合成数据混合能达到与仅使用真实数据相当的性能,但高比例的合成数据会降低效用;而在epoch扩展的增强策略下,PetBERT模型的span-overlap F1从0.831提升至0.850,文档级泄露率从6.32%降至4.02%。
Insight: 论文的创新点在于系统评估了在固定计算预算约束下,合成数据用于安全关键型兽医去标识化任务的两种使用范式(增强 vs. 替换),并强调了合成数据作为真实数据补充(而非替代)的有效性。客观分析认为,其核心洞察在于性能提升主要源于训练暴露的增加,而非合成数据本身的质量,且合成数据与真实数据在长度和标签分布上的系统性不匹配是导致残留泄露的原因。
Abstract: Veterinary electronic health records (vEHRs) contain privacy-sensitive identifiers that limit secondary use. While PetEVAL provides a benchmark for veterinary de-identification, the domain remains low-resource. This study evaluates whether large language model (LLM)-generated synthetic narratives improve de-identification safety under distinct training regimes, emphasizing (i) synthetic augmentation and (ii) fixed-budget substitution. We conducted a controlled simulation using a PetEVAL-derived corpus (3,750 holdout/1,249 train). We generated 10,382 synthetic notes using a privacy-preserving “template-only” regime where identifiers were removed prior to LLM prompting. Three transformer backbones (PetBERT, VetBERT, Bio_ClinicalBERT) were trained under varying mixtures. Evaluation prioritized document-level leakage rate (the fraction of documents with at least one missed identifier) as the primary safety outcome. Results show that under fixed-sample substitution, replacing real notes with synthetic ones monotonically increased leakage, indicating synthetic data cannot safely replace real supervision. Under compute-matched training, moderate synthetic mixing matched real-only performance, but high synthetic dominance degraded utility. Conversely, epoch-scaled augmentation improved performance: PetBERT span-overlap F1 increased from 0.831 to 0.850 +/- 0.014, and leakage decreased from 6.32% to 4.02% +/- 0.19%. However, these gains largely reflect increased training exposure rather than intrinsic synthetic data quality. Corpus diagnostics revealed systematic synthetic-real mismatches in note length and label distribution that align with persistent leakage. We conclude that synthetic augmentation is effective for expanding exposure but is complementary, not substitutive, for safety-critical veterinary de-identification.
[104] ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack cs.CR | cs.AI | cs.CLPDF
Hao Li, Yankai Yang, G. Edward Suh, Ning Zhang, Chaowei Xiao
TL;DR: 本文提出了ReasAlign,一种针对间接提示注入攻击的模型级安全对齐方法。该方法通过引入结构化推理步骤来分析用户查询、检测冲突指令并保持用户意图任务的连续性,从而防御攻击。
Details
Motivation: 大型语言模型(LLM)驱动的智能体系统易受间接提示注入攻击,即外部数据中的恶意指令可能劫持代理行为,因此需要一种有效的安全对齐解决方案。
Result: 在多个基准测试中,ReasAlign在保持与未防御模型相当的实用性的同时,始终优于最强的现有防护方法Meta SecAlign。在代表性的开放式CyberSecEval2基准上,ReasAlign实现了94.6%的实用性和仅3.6%的攻击成功率,远超Meta SecAlign(56.4%实用性和74.4%攻击成功率),达到了安全与实用性的最佳权衡。
Insight: 创新点在于将结构化推理步骤集成到安全对齐过程中,并引入一个基于偏好优化的评判模型进行测试时缩放,以评分和选择最佳推理轨迹,从而增强防御的逻辑性和准确性。
Abstract: Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external data can hijack agent behavior. In this work, we present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks. The core idea of ReasAlign is to incorporate structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user’s intended tasks to defend against indirect injection attacks. To further ensure reasoning logic and accuracy, we introduce a test-time scaling mechanism with a preference-optimized judge model that scores reasoning steps and selects the best trajectory. Comprehensive evaluations across various benchmarks show that ReasAlign maintains utility comparable to an undefended model while consistently outperforming Meta SecAlign, the strongest prior guardrail. On the representative open-ended CyberSecEval2 benchmark, which includes multiple prompt-injected tasks, ReasAlign achieves 94.6% utility and only 3.6% ASR, far surpassing the state-of-the-art defensive model of Meta SecAlign (56.4% utility and 74.4% ASR). These results demonstrate that ReasAlign achieves the best trade-off between security and utility, establishing a robust and practical defense against prompt injection attacks in real-world agentic systems. Our code and experimental results could be found at https://github.com/leolee99/ReasAlign.
[105] Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay cs.CR | cs.CLPDF
Hao Wang, Yanting Wang, Hao Li, Rui Li, Lei Sha
TL;DR: 本文提出了一种名为安全自博弈(SSP)的新方法,用于增强大型语言模型(LLM)的安全对齐能力。该方法让单个LLM在统一的强化学习循环中同时扮演攻击者(生成越狱攻击)和防御者(拒绝有害请求)的角色,并通过引入反思经验回放机制,利用上置信界采样策略从过去的失败案例中学习,从而实现自主、动态的攻防进化。
Details
Motivation: 当前LLM的安全对齐方法严重依赖静态的外部红队测试,使用固定的防御提示或预收集的对抗数据集,导致防御僵化、过拟合已知模式,无法泛化到新颖、复杂的威胁。本文旨在解决这一关键局限,使模型能够自主进行红队测试。
Result: 大量实验表明,SSP方法能够自主进化出强大的防御能力,其性能显著优于在静态对抗数据集上训练的基线方法,为主动安全对齐设立了新的基准。
Insight: 核心创新点在于提出了‘让模型成为自己的红队测试员’的范式,通过单一模型的自博弈和反思经验回放机制,实现了攻防策略的动态协同进化。这为安全对齐提供了一种从被动、静态防御转向主动、自适应防御的新思路,其经验回放中的UCB采样策略对于强化学习中平衡探索与利用、专注于学习困难样本具有借鉴意义。
Abstract: Large Language Models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial ``jailbreak’’ attacks designed to bypass safety guardrails. Current safety alignment methods depend heavily on static external red teaming, utilizing fixed defense prompts or pre-collected adversarial datasets. This leads to a rigid defense that overfits known patterns and fails to generalize to novel, sophisticated threats. To address this critical limitation, we propose empowering the model to be its own red teamer, capable of achieving autonomous and evolving adversarial attacks. Specifically, we introduce Safety Self- Play (SSP), a system that utilizes a single LLM to act concurrently as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests) within a unified Reinforcement Learning (RL) loop, dynamically evolving attack strategies to uncover vulnerabilities while simultaneously strengthening defense mechanisms. To ensure the Defender effectively addresses critical safety issues during the self-play, we introduce an advanced Reflective Experience Replay Mechanism, which uses an experience pool accumulated throughout the process. The mechanism employs a Upper Confidence Bound (UCB) sampling strategy to focus on failure cases with low rewards, helping the model learn from past hard mistakes while balancing exploration and exploitation. Extensive experiments demonstrate that our SSP approach autonomously evolves robust defense capabilities, significantly outperforming baselines trained on static adversarial datasets and establishing a new benchmark for proactive safety alignment.
cs.HC [Back]
[106] The Algorithmic Gaze: An Audit and Ethnography of the LAION-Aesthetics Predictor Model cs.HC | cs.AI | cs.CVPDF
Jordan Taylor, William Agnew, Maarten Sap, Sarah E. Fox, Haiyi Zhu
TL;DR: 本文对广泛用于视觉生成AI模型数据集筛选的LAION美学预测器(LAP)进行了审计和数字民族志研究,发现该模型在美学评分中表现出系统性偏见,例如过度过滤涉及女性的图像,并偏好西方和日本艺术家的写实风格,从而强化了艺术史中的帝国主义和男性凝视。
Details
Motivation: 视觉生成AI模型通常使用单一的美学标准进行训练,但美学判断与个人品味和文化价值观紧密相关,因此需要研究这些模型究竟代表了谁的品味,以及其潜在的偏见和危害。
Result: 审计发现,LAP在LAION-Aesthetics数据集中不成比例地过滤掉提及男性或LGBTQ+群体的图像,而保留提及女性的图像;在两个艺术数据集(约33万张图像)上,LAP对西方和日本艺术家的风景、城市景观和肖像等写实图像评分最高。
Insight: 论文揭示了AI美学评估模型可能继承并强化社会历史偏见(如帝国主义和男性凝视),其训练数据(主要来自英语摄影师和西方AI爱好者)是偏见的重要来源;创新点在于结合审计和数字民族志方法,系统性分析了算法“凝视”的偏见根源,并呼吁开发者放弃单一的美学规定,转向更包容的评估方式。
Abstract: Visual generative AI models are trained using a one-size-fits-all measure of aesthetic appeal. However, what is deemed “aesthetic” is inextricably linked to personal taste and cultural values, raising the question of whose taste is represented in visual generative AI models. In this work, we study an aesthetic evaluation model–LAION Aesthetic Predictor (LAP)–that is widely used to curate datasets to train visual generative image models, like Stable Diffusion, and evaluate the quality of AI-generated images. To understand what LAP measures, we audited the model across three datasets. First, we examined the impact of aesthetic filtering on the LAION-Aesthetics Dataset (approximately 1.2B images), which was curated from LAION-5B using LAP. We find that the LAP disproportionally filters in images with captions mentioning women, while filtering out images with captions mentioning men or LGBTQ+ people. Then, we used LAP to score approximately 330k images across two art datasets, finding the model rates realistic images of landscapes, cityscapes, and portraits from western and Japanese artists most highly. In doing so, the algorithmic gaze of this aesthetic evaluation model reinforces the imperial and male gazes found within western art history. In order to understand where these biases may have originated, we performed a digital ethnography of public materials related to the creation of LAP. We find that the development of LAP reflects the biases we found in our audits, such as the aesthetic scores used to train LAP primarily coming from English-speaking photographers and western AI-enthusiasts. In response, we discuss how aesthetic evaluation can perpetuate representational harms and call on AI developers to shift away from prescriptive measures of “aesthetics” toward more pluralistic evaluation.
cs.MA [Back]
[107] Multi-Agent Cooperative Learning for Robust Vision-Language Alignment under OOD Concepts cs.MA | cs.AI | cs.CV | cs.LGPDF
Philip Xu, Isabel Wagner, Eerke Boiten
TL;DR: 本文提出了一种新颖的多智能体协同学习框架,旨在解决视觉-语言模型在处理分布外概念时出现的跨模态对齐崩溃问题。该框架通过四个核心智能体(图像、文本、名称和协调智能体)进行结构化消息传递,以缓解模态不平衡,并集成了多智能体特征空间名称学习、上下文交换增强的小样本学习算法以及自适应动态平衡机制。
Details
Motivation: 解决视觉-语言模型在处理分布外概念时,由于模态不平衡导致的跨模态对齐性能下降问题。
Result: 在VISTA-Beyond数据集上的实验表明,该框架在小样本和零样本设置下均显著提升了性能,在不同视觉领域中实现了1-5%的精确度提升。
Insight: 创新点在于将多智能体协同学习引入视觉-语言对齐任务,通过结构化消息传递和自适应平衡机制来增强模型对分布外概念的鲁棒性;客观来看,其将名称作为独立智能体进行学习,并结合上下文交换来优化小样本学习,是一种新颖的架构设计思路。
Abstract: This paper introduces a novel Multi-Agent Cooperative Learning (MACL) framework to address cross-modal alignment collapse in vision-language models when handling out-of-distribution (OOD) concepts. Four core agents, including image, text, name, and coordination agents, collaboratively mitigate modality imbalance through structured message passing. The proposed framework enables multi-agent feature space name learning, incorporates a context exchange enhanced few-shot learning algorithm, and adopts an adaptive dynamic balancing mechanism to regulate inter-agent contributions. Experiments on the VISTA-Beyond dataset demonstrate that MACL significantly improves performance in both few-shot and zero-shot settings, achieving 1-5% precision gains across diverse visual domains.
[108] Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems cs.MA | cs.AI | cs.CLPDF
Xi Shi, Mengxin Zheng, Qian Lou
TL;DR: 本文提出了一种延迟感知的多智能体系统编排框架LAMaS,通过并行执行和显式优化关键执行路径来降低推理延迟,在多个基准测试中相比现有方法将关键路径长度减少了38-46%,同时保持或提升了任务性能。
Details
Motivation: 多智能体系统在复杂推理任务中面临高推理延迟问题,现有方法主要优化任务性能和推理成本,且通常假设顺序执行,难以在并行执行下有效控制延迟。
Result: 在多个基准测试中,LAMaS相比最先进的多智能体架构搜索基线方法,将关键路径长度减少了38-46%,同时保持或改进了任务性能。
Insight: 创新点在于首次在并行执行下引入显式延迟监督的学习型编排机制,通过优化关键执行路径来构建低延迟的执行拓扑图,为设计高效多智能体系统提供了新思路。
Abstract: Multi-agent systems (MAS) enable complex reasoning by coordinating multiple agents, but often incur high inference latency due to multi-step execution and repeated model invocations, severely limiting their scalability and usability in time-sensitive scenarios. Most existing approaches primarily optimize task performance and inference cost, and explicitly or implicitly assume sequential execution, making them less optimal for controlling latency under parallel execution. In this work, we investigate learning-based orchestration of multi-agent systems with explicit latency supervision under parallel execution. We propose Latency-Aware Multi-agent System (LAMaS), a latency-aware multi-agent orchestration framework that enables parallel execution and explicitly optimizes the critical execution path, allowing the controller to construct execution topology graphs with lower latency under parallel execution. Our experiments show that our approach reduces critical path length by 38-46% compared to the state-of-the-art baseline for multi-agent architecture search across multiple benchmarks, while maintaining or even improving task performance. These results highlight the importance of explicitly optimizing latency under parallel execution when designing efficient multi-agent systems. The code is available at https://github.com/xishi404/LAMaS
cs.AI [Back]
[109] ChartComplete: A Taxonomy-based Inclusive Chart Dataset cs.AI | cs.CVPDF
Ahmad Mustapha, Charbel Toumieh, Mariette Awad
TL;DR: 本文提出了ChartComplete数据集,这是一个基于可视化领域分类法的综合性图表数据集,覆盖了30种不同的图表类型,旨在解决现有图表理解基准数据集图表类型覆盖范围有限的问题。
Details
Motivation: 当前用于评估多模态大语言模型(MLLMs)图表理解性能的基准数据集普遍局限于少数几种图表类型,这限制了模型能力的全面评估。本文旨在通过构建一个更全面的图表数据集来弥补这一差距。
Result: 论文构建并发布了ChartComplete数据集,该数据集是一个包含30种图表类型的分类图像集合,但本身不包含用于模型训练的学习信号(如标注)。
Insight: 主要创新点在于借鉴了可视化领域的图表分类学来构建一个更具包容性的数据集,为图表理解研究提供了一个更全面的基准测试基础。从客观角度看,这有助于推动图表理解模型在更广泛的图表类型上进行评估和发展。
Abstract: With advancements in deep learning (DL) and computer vision techniques, the field of chart understanding is evolving rapidly. In particular, multimodal large language models (MLLMs) are proving to be efficient and accurate in understanding charts. To accurately measure the performance of MLLMs, the research community has developed multiple datasets to serve as benchmarks. By examining these datasets, we found that they are all limited to a small set of chart types. To bridge this gap, we propose the ChartComplete dataset. The dataset is based on a chart taxonomy borrowed from the visualization community, and it covers thirty different chart types. The dataset is a collection of classified chart images and does not include a learning signal. We present the ChartComplete dataset as is to the community to build upon it.
[110] A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5 cs.AI | cs.CL | cs.CV | cs.LGPDF
Xingjun Ma, Yixu Wang, Hengyuan Xu, Yutao Wu, Yifan Ding
TL;DR: 本文对GPT-5.2、Gemini 3 Pro等7个前沿大语言模型和多模态大模型进行了综合性的安全评估,评估覆盖语言、视觉-语言和图像生成场景,采用统一的协议整合了基准测试、对抗性评估、多语言评估和合规性评估。结果显示模型安全状况差异显著,GPT-5.2表现最为均衡稳健,而其他模型在不同评估维度间存在明显权衡,且所有模型在对抗性评估下均表现出显著脆弱性。
Details
Motivation: 大模型在推理、感知和生成能力上快速进步,但其安全性是否同步提升尚不明确,且现有评估实践往往局限于单一模态或威胁模型,存在碎片化问题。本文旨在通过集成化评估,系统性地揭示前沿模型在真实世界中的安全风险。
Result: 评估结果汇总为多个评估模式下的安全排行榜和模型安全画像。GPT-5.2在所有评估中均表现出持续强劲且均衡的安全性能;其他模型则在基准安全、对抗对齐、多语言泛化和监管合规性之间存在显著权衡。所有模型在标准基准上表现良好,但在对抗性评估下性能大幅下降。文生图模型在受监管的视觉风险类别中对齐性相对更强,但在对抗性或语义模糊提示下依然脆弱。
Insight: 论文的创新点在于提出并实施了一个整合了多种评估维度的统一安全评估协议,揭示了前沿模型安全性的多维性和异质性,强调了模态、语言和评估方案对安全性的共同塑造作用。其核心见解是,模型安全是内在多维的,需要标准化的安全评估来准确评估现实风险并指导负责任的模型开发与部署。
Abstract: The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has produced substantial gains in reasoning, perception, and generative capability across language and vision. However, whether these advances yield commensurate improvements in safety remains unclear, in part due to fragmented evaluation practices limited to single modalities or threat models. In this report, we present an integrated safety evaluation of 7 frontier models: GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5. We evaluate each model across language, vision-language, and image generation settings using a unified protocol that integrates benchmark evaluation, adversarial evaluation, multilingual evaluation, and compliance evaluation. Aggregating our evaluations into safety leaderboards and model safety profiles across multiple evaluation modes reveals a sharply heterogeneous safety landscape. While GPT-5.2 demonstrates consistently strong and balanced safety performance across evaluations, other models exhibit pronounced trade-offs among benchmark safety, adversarial alignment, multilingual generalization, and regulatory compliance. Both language and vision-language modalities show significant vulnerability under adversarial evaluation, with all models degrading substantially despite strong results on standard benchmarks. Text-to-image models achieve relatively stronger alignment in regulated visual risk categories, yet remain brittle under adversarial or semantically ambiguous prompts. Overall, these results show that safety in frontier models is inherently multidimensional–shaped by modality, language, and evaluation scheme, underscoring the need for standardized safety evaluations to accurately assess real-world risk and guide responsible model development and deployment.
[111] Thinking Long, but Short: Stable Sequential Test-Time Scaling for Large Reasoning Models cs.AI | cs.CLPDF
Michael R. Metel, Yufei Cui, Boxing Chen, Prasanna Parthasarathi
TL;DR: 本文提出了一种名为Min-Seek的新型顺序测试时缩放方法,旨在解决现有训练免费方法在提升大型推理模型准确性时出现的性能下降和不稳定问题。该方法通过仅保留一个额外诱导思维的KV对在KV缓存中,实现了高效且稳定的推理,并能超越模型的最大上下文长度进行持续推理。
Details
Motivation: 现有顺序测试时缩放方法在延长推理长度时会导致准确性下降和模型不稳定,且需要针对推理长度进行微调,因此需要一种能稳定提升准确性、无需长度微调且高效的方法。
Result: Min-Seek方法在各种推理任务上显著提高了模型准确性,稳定了顺序缩放的性能,无需推理长度微调,并在自定义KV缓存下实现了线性计算复杂度,能超越最大上下文长度持续推理。
Insight: 创新点在于提出Min-Seek方法,通过动态连续编码位置嵌入并仅缓存关键KV对,实现了推理过程的稳定性和效率提升,同时突破了模型上下文长度限制,为训练免费推理优化提供了新思路。
Abstract: Sequential test-time scaling is a promising training-free method to improve large reasoning model accuracy, but as currently implemented, significant limitations have been observed. Inducing models to think for longer can increase their accuracy, but as the length of reasoning is further extended, it has also been shown to result in accuracy degradation and model instability. This work presents a novel sequential test-time scaling method, Min-Seek, which improves model accuracy significantly over a wide range of induced thoughts, stabilizing the accuracy of sequential scaling, and removing the need for reasoning length fine-tuning. Beyond improving model accuracy over a variety of reasoning tasks, our method is inherently efficient, as only the KV pairs of one additional induced thought are kept in the KV cache during reasoning. With a custom KV cache which stores keys without position embeddings, by dynamically encoding them contiguously before each new generated thought, our method can continue to reason well beyond a model’s maximum context length, and under mild conditions has linear computational complexity.
[112] MATRIX AS PLAN: Structured Logical Reasoning with Feedback-Driven Replanning cs.AI | cs.CLPDF
Ke Chen, Jiandian Zeng, Zihao Peng, Guo Li, Guangxue Zhang
TL;DR: 本文提出MatrixCoT,一种基于矩阵规划的结构化思维链框架,旨在增强大语言模型在复杂符号逻辑推理任务中的能力。该方法通过规范化自然语言表达式、引入显式引用字段和矩阵化规划来保持推理步骤间的全局关系,并利用反馈驱动的重规划机制进行验证和修正。
Details
Motivation: 现有思维链提示方法在处理依赖符号表达式和严格演绎规则的逻辑推理任务时存在不足,神经符号方法依赖外部求解器但格式敏感易失败,而纯LLM驱动方法缺乏结构化表示和过程级纠错机制。
Result: 在五个逻辑推理基准和五个大语言模型上的实验表明,MatrixCoT在不依赖外部求解器的情况下,提升了处理复杂符号推理任务的鲁棒性和可解释性,并保持了有竞争力的性能。
Insight: 创新点在于将推理过程结构化为可验证的矩阵规划,并引入反馈驱动的重规划机制进行自我修正;这为LLM的复杂逻辑推理提供了一种更稳定、可解释且不依赖外部工具的结构化方法。
Abstract: As knowledge and semantics on the web grow increasingly complex, enhancing Large Language Models (LLMs) comprehension and reasoning capabilities has become particularly important. Chain-of-Thought (CoT) prompting has been shown to enhance the reasoning capabilities of LLMs. However, it still falls short on logical reasoning tasks that rely on symbolic expressions and strict deductive rules. Neuro-symbolic methods address this gap by enforcing formal correctness through external solvers. Yet these solvers are highly format-sensitive, and small instabilities in model outputs can lead to frequent processing failures. LLM-driven approaches avoid parsing brittleness, but they lack structured representations and process-level error-correction mechanisms. To further enhance the logical reasoning capabilities of LLMs, we propose MatrixCoT, a structured CoT framework with a matrix-based plan. Specifically, we normalize and type natural language expressions, attach explicit citation fields, and introduce a matrix-based planning method to preserve global relations among steps. The plan becomes a verifiable artifact, making execution more stable. For verification, we also add a feedback-driven replanning mechanism. Under semantic-equivalence constraints, it identifies omissions and defects, rewrites and compresses the dependency matrix, and produces a more trustworthy final answer. Experiments on five logical-reasoning benchmarks and five LLMs show that, without relying on external solvers, MatrixCoT enhances both robustness and interpretability when tackling complex symbolic reasoning tasks, while maintaining competitive performance.
[113] TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks cs.AI | cs.CL | cs.LGPDF
Vansh Kapoor, Aman Gupta, Hao Chen, Anurag Beniwal, Jing Huang
TL;DR: TRIM是一种用于多步推理任务的混合推理方法,通过针对性地将关键步骤路由到更大模型,而让较小模型处理常规步骤,以提升推理效率并防止级联错误。
Details
Motivation: 当前LLM路由方法将整个查询分配给单一模型,无法区分推理步骤的重要性,导致级联错误风险高且效率低。TRIM旨在通过步骤级路由,仅在关键步骤使用大模型,优化准确性与成本的权衡。
Result: 在MATH-500基准上,TRIM的简单阈值策略比先前路由方法成本效率高5倍;更高级策略能以减少80%昂贵模型token的代价匹配强模型性能。在AIME等更难的基准上,成本效率提升高达6倍。
Insight: 创新点在于步骤级路由和利用过程奖励模型识别错误步骤,核心洞察是关键步骤的针对性干预能根本改变推理效率,且步骤级难度是推理任务的基本特征。
Abstract: Multi-step reasoning tasks like mathematical problem solving are vulnerable to cascading failures, where a single incorrect step leads to complete solution breakdown. Current LLM routing methods assign entire queries to one model, treating all reasoning steps as equal. We propose TRIM (Targeted routing in multi-step reasoning tasks), which routes only critical steps$\unicode{x2013}$those likely to derail the solution$\unicode{x2013}$to larger models while letting smaller models handle routine continuations. Our key insight is that targeted step-level interventions can fundamentally transform inference efficiency by confining expensive calls to precisely those steps where stronger models prevent cascading errors. TRIM operates at the step-level: it uses process reward models to identify erroneous steps and makes routing decisions based on step-level uncertainty and budget constraints. We develop several routing strategies within TRIM, ranging from a simple threshold-based policy to more expressive policies that reason about long-horizon accuracy-cost trade-offs and uncertainty in step-level correctness estimates. On MATH-500, even the simplest thresholding strategy surpasses prior routing methods with 5x higher cost efficiency, while more advanced policies match the strong, expensive model’s performance using 80% fewer expensive model tokens. On harder benchmarks such as AIME, TRIM achieves up to 6x higher cost efficiency. All methods generalize effectively across math reasoning tasks, demonstrating that step-level difficulty represents fundamental characteristics of reasoning.
[114] Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning cs.AI | cs.CLPDF
Xin Guan, Zijian Li, Shen Huang, Pengjun Xie, Jingren Zhou
TL;DR: 本文提出了一种名为EAPO(证据增强策略优化)的强化学习方法,旨在解决长上下文推理中由于结果奖励稀疏而导致的模型无法有效学习证据检索过程的问题。该方法通过引入证据增强推理范式,并利用树状证据采样验证了精确证据提取是长上下文推理的关键瓶颈。EAPO的核心是一个专门的RL算法,它使用奖励模型计算组相对证据奖励,为证据质量提供密集的过程监督。此外,还引入了自适应奖励-策略协同进化机制,通过迭代优化奖励模型来维持训练过程中的精确监督。
Details
Motivation: 动机在于,尽管强化学习(RL)已推动了LLM推理的发展,但在长上下文场景中,结果奖励的稀疏性限制了其应用。这种稀疏性无法惩罚无根据的“幸运猜测”,使得关键的“大海捞针”式证据检索过程在很大程度上缺乏监督。
Result: 在八个基准测试上的综合评估表明,与最先进的基线方法相比,EAPO显著提升了长上下文推理的性能。
Insight: 论文宣称的创新点在于提出了证据增强推理范式,并设计了结合组相对证据奖励和自适应奖励-策略协同进化机制的专门RL算法。从客观角度看,其核心创新在于将密集的过程监督(针对证据质量)引入到长上下文推理的RL训练中,并通过协同进化机制动态保持奖励模型的判别能力,从而有效解决了奖励稀疏和监督信号不精确的问题。
Abstract: While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded “lucky guesses,” leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization). We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. Comprehensive evaluations across eight benchmarks demonstrate that EAPO significantly enhances long-context reasoning performance compared to SOTA baselines.