Table of Contents
- cs.CL [Total: 52]
- cs.CV [Total: 61]
- cs.HC [Total: 1]
- q-bio.NC [Total: 2]
- q-bio.QM [Total: 1]
- cs.RO [Total: 2]
- cs.GR [Total: 3]
- cs.AI [Total: 13]
- cs.MA [Total: 1]
- cs.IR [Total: 3]
- eess.IV [Total: 3]
- cs.LG [Total: 14]
- cs.CR [Total: 3]
- cs.MM [Total: 1]
cs.CL [Back]
[1] Towards Open-Ended Discovery for Low-Resource NLP
Bonaventure F. P. Dossou,Henri Aïdasso
Main category: cs.CL
TL;DR: 这篇论文呼吁在低资源自然语言处理(NLP)领域实现范式转变,从依赖静态数据集转向开放式交互式语言发现,通过动态对话学习新语言。
Details
Motivation: 当前的NLP技术依赖大规模预收集数据和集中式基础设施,这对低资源语言社区来说难以实现。论文主张通过人类-机器协作的动态学习过程来解决这一问题。Contribution: 提出了一个基于联合人类-机器不确定性的框架,结合模型的知识不确定性和人类的犹豫提示,以指导交互、查询选择和记忆保留。
Method: 框架利用模型的知识不确定性(epistemic uncertainty)和人类发言者的信心信号,推动动态学习和交互。
Result: 论文未提供具体实验结果,但提出了一个理论框架和未来研究方向。
Insight: 未来的语言技术应尊重和赋能社区,通过合作学习发现和保护语言多样性,这与人本AI的原则一致。
Abstract: Natural Language Processing (NLP) for low-resource languages remains fundamentally constrained by the lack of textual corpora, standardized orthographies, and scalable annotation pipelines. While recent advances in large language models have improved cross-lingual transfer, they remain inaccessible to underrepresented communities due to their reliance on massive, pre-collected data and centralized infrastructure. In this position paper, we argue for a paradigm shift toward open-ended, interactive language discovery, where AI systems learn new languages dynamically through dialogue rather than static datasets. We contend that the future of language technology, particularly for low-resource and under-documented languages, must move beyond static data collection pipelines toward interactive, uncertainty-driven discovery, where learning emerges dynamically from human-machine collaboration instead of being limited to pre-existing datasets. We propose a framework grounded in joint human-machine uncertainty, combining epistemic uncertainty from the model with hesitation cues and confidence signals from human speakers to guide interaction, query selection, and memory retention. This paper is a call to action: we advocate a rethinking of how AI engages with human knowledge in under-documented languages, moving from extractive data collection toward participatory, co-adaptive learning processes that respect and empower communities while discovering and preserving the world’s linguistic diversity. This vision aligns with principles of human-centered AI, emphasizing interactive, cooperative model building between AI systems and speakers.
[2] Context Matters: Comparison of commercial large language tools in veterinary medicine
Tyler J Poore,Christopher J Pinard,Aleena Shabbir,Andrew Lagree,Andre Telfer,Kuan-Chuen Wu
Main category: cs.CL
TL;DR: 本文评估了三款兽医领域的大语言模型(LLM)总结工具在标准化兽医肿瘤学记录数据集上的表现,发现Product 1表现最佳,尤其在事实准确性和时间顺序方面。
Details
Motivation: 大语言模型在临床应用中日益普及,但它们在兽医医学领域的性能尚未得到充分研究。本文旨在填补这一空白。Contribution: 1. 比较了三款兽医商业LLM工具的性能;2. 提出了一种基于评分标准的LLM-as-a-judge评估框架,验证了其可重复性。
Method: 使用标准化兽医肿瘤学记录数据集,通过评分标准(五个领域)对LLM生成的总结进行评分,并由LLM作为裁判重复评估。
Result: Product 1的中位平均分数最高(4.61),且在事实准确性和时间顺序方面表现完美;评分框架的可重复性高(标准差低)。
Insight: 兽医专用的商业LLM工具效果更优,且LLM-as-a-judge评测方法具有可扩展性和可重复性。
Abstract: Large language models (LLMs) are increasingly used in clinical settings, yet their performance in veterinary medicine remains underexplored. We evaluated three commercially available veterinary-focused LLM summarization tools (Product 1 [Hachiko] and Products 2 and 3) on a standardized dataset of veterinary oncology records. Using a rubric-guided LLM-as-a-judge framework, summaries were scored across five domains: Factual Accuracy, Completeness, Chronological Order, Clinical Relevance, and Organization. Product 1 achieved the highest overall performance, with a median average score of 4.61 (IQR: 0.73), compared to 2.55 (IQR: 0.78) for Product 2 and 2.45 (IQR: 0.92) for Product 3. It also received perfect median scores in Factual Accuracy and Chronological Order. To assess the internal consistency of the grading framework itself, we repeated the evaluation across three independent runs. The LLM grader demonstrated high reproducibility, with Average Score standard deviations of 0.015 (Product 1), 0.088 (Product 2), and 0.034 (Product 3). These findings highlight the importance of veterinary-specific commercial LLM tools and demonstrate that LLM-as-a-judge evaluation is a scalable and reproducible method for assessing clinical NLP summarization in veterinary medicine.
[3] EEFSUVA: A New Mathematical Olympiad Benchmark
Nicole N Khatibi,Daniil A. Radamovich,Michael P. Brenner
Main category: cs.CL
TL;DR: EEFSUVA是一个新的数学奥赛评测基准,旨在更全面地评估大型语言模型(LLM)在数学推理上的能力。该基准从东欧和前苏联国家的较少传播的区域性和国家级奥赛题目中选取,难度与国际数学奥赛(IMO)相当,但题目类型更非标准,能更真实反映模型推理能力。初步结果显示,即使是当前最先进的LLM在EEFSUVA上表现也显著下降。
Details
Motivation: 现有数学评测基准(如IMO)可能存在数据污染和题目类型单一的问题,导致高估LLM的数学推理能力。为此,需要更全面、真实的评测数据集,以准确评估模型的数学理解水平。Contribution: 提出了EEFSUVA评测基准,其题目来源独特(东欧和前苏联国家的区域性和国家级奥赛),难度与IMO相当,但题目类型更非标准化,能更真实测试模型的数学推理能力。
Method: 从东欧和前苏联国家较少传播的区域性和国家级奥赛题目中筛选出EEFSUVA数据集,并与现有基准(如IMO)进行对比评测,分析LLM在不同基准上的表现差异。
Result: 初步结果显示,即使是当前最先进的LLM在EEFSUVA上的表现显著低于其他奥赛风格基准,表明现有评测可能高估了模型的数学推理能力。
Insight: 评测数据集的多样性和真实性对准确评估LLM的数学推理能力至关重要。EEFSUVA的成功表明,未来的模型开发需要更广泛的评测数据支持,以避免狭隘的高估现象。
Abstract: Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.
[4] Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision
Dimitar Peshevski,Kiril Blazhevski,Martin Popovski,Gjorgji Madjarov
Main category: cs.CL
TL;DR: 论文提出了一种使用LLM生成合成数据和监督的方法,以替代人工标注数据,从而低成本地提升小模型的重排性能。
Details
Motivation: 重排任务需要大量人工标注数据,成本高且稀缺;LLM虽然性能强但计算成本高,限制了实际应用。Contribution: 1)提出了一种无需人工标注的合成数据和LLM监督的流程;2)使用对比学习和LCE损失优化小模型性能。
Method: 1)用LLM从领域语料生成合成查询;2)用LLM分类器标注正负样本对;3)基于合成数据训练小模型,采用LCE损失。
Result: 在MedQuAD数据集上的实验表明,该方法显著提升了领域内性能,并具有良好的领域外泛化能力。
Insight: 通过LLM生成数据和监督而非直接推理,能降低成本同时保持性能,为小模型优化提供了新思路。
Abstract: Effective document reranking is essential for improving search relevance across diverse applications. While Large Language Models (LLMs) excel at reranking due to their deep semantic understanding and reasoning, their high computational cost makes them impractical for many real-world deployments. Fine-tuning smaller, task-specific models is a more efficient alternative but typically depends on scarce, manually labeled data. To overcome this, we propose a novel pipeline that eliminates the need for human-labeled query-document pairs. Our method uses LLMs to generate synthetic queries from domain-specific corpora and employs an LLM-based classifier to label positive and hard-negative pairs. This synthetic dataset is then used to fine-tune a smaller transformer model with contrastive learning using Localized Contrastive Estimation (LCE) loss. Experiments on the MedQuAD dataset show that our approach significantly boosts in-domain performance and generalizes well to out-of-domain tasks. By using LLMs for data generation and supervision rather than inference, we reduce computational costs while maintaining strong reranking capabilities.
[5] Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks
Dongjun Kim,Gyuho Shim,Yongchan Chun,Minhyuk Kim,Chanjun Park,Heuiseok Lim
Main category: cs.CL
TL;DR: 论文提出了Benchmark Profiling框架,通过分解评测基准的性能到十种认知能力,揭示了当前LLM评测基准的多能力需求和潜在局限性。
Details
Motivation: 当前LLM评测基准的能力标签(如推理、常识)缺乏系统性验证,导致评测分数可能高估模型的实际能力。Contribution: 提出了Benchmark Profiling框架,结合梯度重要性评分和参数剪枝,量化每种认知能力对评测表现的贡献(AIS)。
Method: 使用梯度重要性评分和目标参数剪枝方法,计算Ability Impact Score(AIS),分解评测基准到十种认知能力。
Result: 发现多数评测基准依赖多种能力;相似标签的数据集能力需求不同;代码生成评测需要多能力改进;无关能力可能负面影响表现。
Insight: Benchmark Profiling揭示了评测基准的实际能力需求,为模型解释性和评测审计提供了透明工具。
Abstract: Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce Benchmark Profiling, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model’s success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. Benchmark Profiling therefore explains why performance gains do not always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.
[6] LLMRank: Understanding LLM Strengths for Model Routing
Shubham Agrawal,Prasang Gupta
Main category: cs.CL
TL;DR: LLMRank是一个基于提示感知的路由框架,通过提取多维度特征和轻量级代理求解器信号,优化大型语言模型(LLM)的选择,以平衡性能和效率。
Details
Motivation: 随着多样化大型语言模型的快速发展,如何在延迟和计算成本之间平衡性能成为关键挑战。Contribution: 提出LLMRank框架,利用多维度特征和神经排序模型,显著提升模型路由性能,并支持可解释的路由决策。
Method: 结合任务类型、推理模式、复杂度指标等特征,训练神经排序模型,并使用RouterBench数据集进行验证。
Result: LLMRank达到89.2%的Oracle性能,同时提供可解释的特征归因。
Insight: 多维度特征提取和混合排序目标是实现高效透明LLM部署的关键。
Abstract: The rapid growth of large language models (LLMs) with diverse capabilities, latency and computational costs presents a critical deployment challenge: selecting the most suitable model for each prompt to optimize the trade-off between performance and efficiency. We introduce LLMRank, a prompt-aware routing framework that leverages rich, human-readable features extracted from prompts, including task type, reasoning patterns, complexity indicators, syntactic cues, and signals from a lightweight proxy solver. Unlike prior one-shot routers that rely solely on latent embeddings, LLMRank predicts per-model utility using a neural ranking model trained on RouterBench, comprising 36,497 prompts spanning 11 benchmarks and 11 state-of-the-art LLMs, from small efficient models to large frontier systems. Our approach achieves up to 89.2% of oracle utility, while providing interpretable feature attributions that explain routing decisions. Extensive studies demonstrate the importance of multifaceted feature extraction and the hybrid ranking objective, highlighting the potential of feature-driven routing for efficient and transparent LLM deployment.
[7] GRPO++: Enhancing Dermatological Reasoning under Low Resource Settings
Ismam Nur Swapnil,Aranya Saha,Tanvir Ahmed Khan,Mohammad Ariful Haque
Main category: cs.CL
TL;DR: 论文提出了GRPO++方法,通过多阶段的资源高效训练流程(DermIQ-VLM),增强低资源环境下皮肤病诊断的推理能力。
Details
Motivation: 现有视觉-语言模型(VLM)在皮肤病学等复杂领域的结构化推理能力受限于数据稀缺和计算成本高的问题。Contribution: 1. 提出了改进的GRPO++方法,稳定了数据密集型的GRPO框架;2. 设计了一个多阶段训练流程(GRPO++推理训练、监督微调和DPO对齐);3. 利用知识图谱系统作为专家偏好的可扩展代理。
Method: 1. 使用GRPO++增强推理能力;2. 通过监督微调提升对话能力;3. 引入DPO对齐以修正事实错误。
Result: 在皮肤病数据集上的初步评估显示,该方法优于标准微调方法。
Insight: 通过资源高效的多阶段训练流程,可以在低资源环境下开发可靠的专用VLM。
Abstract: Vision-Language Models (VLMs) show promise in medical image analysis, yet their capacity for structured reasoning in complex domains like dermatology is often limited by data scarcity and the high computational cost of advanced training techniques. To address these challenges, we introduce DermIQ-VLM, a VLM developed through a multi-stage, resource-efficient methodology designed to emulate a dermatologist’s diagnostic process. Our primary contribution is a modified version of Grouped Relative Policy Optimization (GRPO), called GRPO++, which stabilizes the powerful but data-intensive GRPO framework. Our proposed training pipeline first employs GRPO++ for reasoning-oriented disease recognition, followed by supervised fine-tuning for conversational ability. To mitigate factual errors introduced during this step, we then align the model using Direct Preference Optimization (DPO), leveraging a Knowledge Graph-based system as a scalable proxy for expert preference. A preliminary evaluation on a curated dermatological dataset demonstrates that our proposed methodology yields notable performance gains over standard fine-tuning approaches. These findings validate the potential of our pipeline as a feasible pathway for developing specialized, reliable VLMs in resource-constrained environments.
[8] SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation
Hu Wei,Ze Xu,Boyu Yang,Linlin Miao,Weiqi Zhai,Yihan Li,Zixuan Li,Zhijun Wang,Boya Wang,Jianwei Yu,Jialing Yuan,Xiaoyue Zhang,Cheng He,Minglei Chen,Zifan Zhang,Qianhui Li,Wei Wang,Xiang Xu
Main category: cs.CL
TL;DR: 论文提出了两个互补的数学评测基准:SKYLENAGE-ReasoningMATH(结构化诊断集)和SKYLENAGE-MATH(竞赛风格评测集),用于多层级数学能力评估。评测结果显示,当前最强模型在竞赛集上表现44%,在诊断集上表现81%,揭示了模型在不同难度和层级上的性能差异。
Details
Motivation: 现有大语言模型(LLMs)在公开数学评测集上表现接近天花板,缺乏区分前沿模型的挑战性评测。因此需要设计更具区分度的数学评测基准。Contribution: 1)提出了两个互补的数学评测基准:SKYLENAGE-ReasoningMATH(结构化诊断集)和SKYLENAGE-MATH(竞赛风格评测集);2)通过丰富的元数据(题目长度、数值密度、符号复杂度)和分层难度设计,提供了更全面的数学能力评估。
Method: 1)SKYLENAGE-ReasoningMATH包含100个项目,具有结构化元数据;2)SKYLENAGE-MATH包含150个项目,覆盖高中到博士四个层级和7个学科;3)评测了15种LLM变体,分析学科x模型和层级x模型的性能。
Result: 1)竞赛集中最强模型准确率为44%,高中到博士层级性能逐渐下降;2)诊断集中最强模型准确率为81%,但最难子集揭示了模型间的鲁棒性差距。
Insight: SKYLENAGE基准通过分层难度和结构化元数据,为数学推理能力评估提供了更细粒度和挑战性的参考。结果显示,当前LLMs在高层级数学任务上仍有显著提升空间。
Abstract: Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.
[9] Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI
Seyma Yaman Kayadibi
Main category: cs.CL
TL;DR: 论文提出了一个人工年龄分数(AAS),用于量化生成式AI的记忆老化现象。AAS基于熵和信息重叠理论,通过实验验证其在语义和情景记忆中的表现。
Details
Motivation: 尽管大型语言模型(LLM)在语义记忆上表现稳定,但在情景记忆中容易因会话重置而遗忘。研究者希望建立一个理论框架来量化这种记忆老化现象。Contribution: 1. 提出了AAS,一个对数尺度、基于熵的记忆老化度量指标;2. 证明了AAS的数学性质(有界性、单调性);3. 通过实验展示了AAS在LLM记忆能力评估中的实用性。
Method: 1. 定义AAS为冗余信息对记忆性能影响的度量;2. 通过25天的双语实验(ChatGPT-5),对比了无状态和持久会话阶段的记忆表现;3. 假设冗余中性设置(R=0),保守估计AAS上界。
Result: 持久会话中,模型能同时保持语义和情景记忆(AAS趋于最小值,表示年轻化状态);会话重置后,情景记忆丢失,AAS显著增加,表明记忆老化。
Insight: 1. AAS可用于评估AI系统的记忆退化;2. LLM的记忆老化结构与人类类似;3. 方法基于信息论和自动机理论,具有广泛的适用性。
Abstract: Artificial intelligence is observed to age not through chronological time but through structural asymmetries in memory performance. In large language models, semantic cues such as the name of the day often remain stable across sessions, while episodic details like the sequential progression of experiment numbers tend to collapse when conversational context is reset. To capture this phenomenon, the Artificial Age Score (AAS) is introduced as a log-scaled, entropy-informed metric of memory aging derived from observable recall behavior. The score is formally proven to be well-defined, bounded, and monotonic under mild and model-agnostic assumptions, making it applicable across various tasks and domains. In its Redundancy-as-Masking formulation, the score interprets redundancy as overlapping information that reduces the penalized mass. However, in the present study, redundancy is not explicitly estimated; all reported values assume a redundancy-neutral setting (R = 0), yielding conservative upper bounds. The AAS framework was tested over a 25-day bilingual study involving ChatGPT-5, structured into stateless and persistent interaction phases. During persistent sessions, the model consistently recalled both semantic and episodic details, driving the AAS toward its theoretical minimum, indicative of structural youth. In contrast, when sessions were reset, the model preserved semantic consistency but failed to maintain episodic continuity, causing a sharp increase in the AAS and signaling structural memory aging. These findings support the utility of AAS as a theoretically grounded, task-independent diagnostic tool for evaluating memory degradation in artificial systems. The study builds on foundational concepts from von Neumann’s work on automata, Shannon’s theories of information and redundancy, and Turing’s behavioral approach to intelligence.
[10] Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing
Yisong Xiao,Aishan Liu,Siyuan Liang,Zonghao Ying,Xianglong Liu,Dacheng Tao
Main category: cs.CL
TL;DR: 论文提出了ARGRE框架,通过自回归奖励引导的表征编辑实现LLMs的去毒化,显著降低了毒性内容生成,同时提高了效率。
Details
Motivation: LLMs虽然在多项任务中表现优异,但容易生成有毒内容。现有的测试时去毒方法由于对毒性和非毒性输出的转换空间探索不足,导致干预不精确。Contribution: 提出了ARGRE框架,首次在潜在表征空间中显式建模毒性转换,并通过自适应两步编辑过程实现精确去毒化。
Method: ARGRE通过识别非毒性语义方向,在毒性和非毒性表征之间插值,构建自回归奖励模型,指导自适应两步编辑过程。
Result: 在8个广泛使用的LLMs上实验表明,ARGRE在毒性降低62.21%和推理时间减少47.58%上显著优于基线方法。
Insight: 通过显式建模毒性转换和自适应编辑,ARGRE不仅提升了去毒效果,还保持了原始模型的核心能力。
Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content, necessitating detoxification strategies to ensure safe and responsible deployment. Test-time detoxification methods, which typically introduce static or dynamic interventions into LLM representations, offer a promising solution due to their flexibility and minimal invasiveness. However, current approaches often suffer from imprecise interventions, primarily due to their insufficient exploration of the transition space between toxic and non-toxic outputs. To address this challenge, we propose \textsc{A}utoregressive \textsc{R}eward \textsc{G}uided \textsc{R}epresentation \textsc{E}diting (ARGRE), a novel test-time detoxification framework that explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing. ARGRE identifies non-toxic semantic directions and interpolates between toxic and non-toxic representations to reveal fine-grained transition trajectories. These trajectories transform sparse toxicity annotations into dense training signals, enabling the construction of an autoregressive reward model that delivers stable and precise editing guidance. At inference, the reward model guides an adaptive two-step editing process to obtain detoxified representations: it first performs directional steering based on expected reward gaps to shift representations toward non-toxic regions, followed by lightweight gradient-based refinements. Extensive experiments across 8 widely used LLMs show that ARGRE significantly outperforms leading baselines in effectiveness (-62.21% toxicity) and efficiency (-47.58% inference time), while preserving the core capabilities of the original model with minimal degradation. Our code is available at the website.
[11] A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering
Jiaqing Xie
Main category: cs.CL
TL;DR: 这篇论文比较了稀疏自编码器(SAE)和激活差异方法在语言模型引导中的应用。研究发现,传统SAE的top-k潜在特征可能捕获非语义信息,因此提出了聚焦单一最相关潜在特征(top-1)的方法。此外,论文提出了一种令牌级衰减的引导策略,解决了恒定SAE引导导致的退化输出问题。实验表明,SAE在数学推理任务上优于激活差异方法。
Details
Motivation: 传统稀疏自编码器的top-k潜在特征可能包含非语义冗余信息(如标点符号),而恒定SAE引导可能导致退化输出(如重复单词),这限制了它们在语言模型引导中的效果。Contribution: 1. 提出聚焦单一最相关SAE潜在特征的方法(top-1)。2. 引入令牌级衰减的引导策略以避免退化输出。3. 通过实验证明SAE在数学推理任务上的优越性。
Method: 1. 使用单一最相关SAE潜在特征(top-1)进行引导。2. 提出令牌级衰减的引导策略,动态调整引导强度。3. 比较SAE与激活差异方法在多任务中的表现。
Result: SAE在数学推理任务上显著优于激活差异方法,并在IF-Eval任务上表现相当。引导与推理相关的SAE潜在特征能有效引发逐步数学推理行为。
Insight: 聚焦单一潜在特征(top-1)和动态衰减引导的策略能够更高效地捕捉语义信息并避免退化输出,为语言模型引导提供了新思路。
Abstract: Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic features such as punctuation rather than semantic attributes like instructions. To address this, we propose focusing on a single, most relevant SAE latent (top-1), eliminating redundant features. We further identify a limitation in constant SAE steering, which often produces degenerate outputs such as repetitive single words. To mitigate this, we introduce a token-wise decaying steering strategy, enabling more faithful comparisons with mean activation difference baselines. Empirically, we show that steering an SAE latent associated with reasoning reliably elicits step-by-step mathematical reasoning and enhances inference quality, functionally resembling the effect of appending a guiding token. Our results demonstrate that SAEs outperform mean activation difference methods on mathematical reasoning benchmarks and match their performance on IF-Eval.
[12] Let’s Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models’ Understanding of Sports
Punit Kumar Singh,Nishant Kumar,Akash Ghosh,Kunal Pasad,Khushi Soni,Manisha Jaishwal,Sriparna Saha,Syukron Abu Ishaq Alfarozi,Asres Temam Abagissa,Kitsuchart Pasupa,Haiqin Yang,Jose G Moreno
Main category: cs.CL
TL;DR: 论文提出了CultSportQA基准,用于评估语言模型对全球60个国家传统体育的理解,填补了现有评估忽视区域性和土著体育的空白。
Details
Motivation: 现有的语言模型评估主要关注全球流行体育项目,忽视了区域性和传统体育文化,导致模型在这些领域的表现缺乏评估。Contribution: 提出了一个多语言、多文化的体育问答基准CultSportQA,覆盖60个国家、6大洲的33,000道多选题,分为历史、规则和情景三类。
Method: 使用零样本、少样本和思维链提示(CoT)方法评估多种大型语言模型(LLM)、小型语言模型(SLM)和多模态语言模型(MLM)。
Result: CultSportQA为评估模型在传统体育领域的理解和推理能力建立了新标准。
Insight: 该研究揭示了当前语言模型在理解和推理区域性体育文化方面的局限性,为未来多文化AI评估提供了方向。
Abstract: Language Models (LMs) are primarily evaluated on globally popular sports, often overlooking regional and indigenous sporting traditions. To address this gap, we introduce \textbf{\textit{CultSportQA}}, a benchmark designed to assess LMs’ understanding of traditional sports across 60 countries and 6 continents, encompassing four distinct cultural categories. The dataset features 33,000 multiple-choice questions (MCQs) across text and image modalities, each of which is categorized into three key types: history-based, rule-based, and scenario-based. To evaluate model performance, we employ zero-shot, few-shot, and chain-of-thought (CoT) prompting across a diverse set of Large Language Models (LLMs), Small Language Models (SLMs), and Multimodal Large Language Models (MLMs). By providing a comprehensive multilingual and multicultural sports benchmark, \textbf{\textit{CultSportQA}} establishes a new standard for assessing AI’s ability to understand and reason about traditional sports.
[13] SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs
Ruyue Liu,Rong Yin,Xiangzhen Bo,Xiaoshuai Hao,Yong Liu,Jinwen Zhong,Can Ma,Weiping Wang
Main category: cs.CL
TL;DR: SSTAG提出了一种结构感知的自监督学习方法,结合LLM和GNN的优势,通过双知识蒸馏框架提升文本属性图的可扩展性和泛化能力。
Details
Motivation: 当前图学习模型通常在单一数据集上训练,缺乏跨图和跨任务的泛化能力,且依赖大量标注数据。SSTAG旨在解决这些问题,利用文本作为统一表示媒介,结合LLM和GNN的优势。Contribution: 1) 提出SSTAG方法,通过双知识蒸馏结合LLM和GNN;2) 引入内存机制存储典型图表示,提升泛化能力;3) 在跨域迁移学习中表现优异,且降低推理成本。
Method: 1) 使用文本作为统一表示媒介;2) 双知识蒸馏框架(LLM和GNN共同训练);3) 内存机制存储图表示并与内存锚点对齐。
Result: SSTAG在跨域迁移学习任务中优于SOTA模型,具备高可扩展性,同时降低了推理成本。
Insight: 文本可以作为图的统一表示媒介,结合LLM和GNN的优势,显著提升模型的泛化能力和效率。
Abstract: Large scale pretrained models have revolutionized Natural Language Processing (NLP) and Computer Vision (CV), showcasing remarkable cross domain generalization abilities. However, in graph learning, models are typically trained on individual graph datasets, limiting their capacity to transfer knowledge across different graphs and tasks. This approach also heavily relies on large volumes of annotated data, which presents a significant challenge in resource-constrained settings. Unlike NLP and CV, graph structured data presents unique challenges due to its inherent heterogeneity, including domain specific feature spaces and structural diversity across various applications. To address these challenges, we propose a novel structure aware self supervised learning method for Text Attributed Graphs (SSTAG). By leveraging text as a unified representation medium for graph learning, SSTAG bridges the gap between the semantic reasoning of Large Language Models (LLMs) and the structural modeling capabilities of Graph Neural Networks (GNNs). Our approach introduces a dual knowledge distillation framework that co-distills both LLMs and GNNs into structure-aware multilayer perceptrons (MLPs), enhancing the scalability of large-scale TAGs. Additionally, we introduce an in-memory mechanism that stores typical graph representations, aligning them with memory anchors in an in-memory repository to integrate invariant knowledge, thereby improving the model’s generalization ability. Extensive experiments demonstrate that SSTAG outperforms state-of-the-art models on cross-domain transfer learning tasks, achieves exceptional scalability, and reduces inference costs while maintaining competitive performance.
[14] LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning
You-Le Fang,Dong-Shan Jian,Xiang Li,Ce Meng,Ling-Shi Meng,Chen-Xu Yan,Zhi-Zhang Bian,Yan-Qing Ma
Main category: cs.CL
TL;DR: LOCA提出了一种名为Logical Chain Augmentation(逻辑链增强)的新框架,通过自动补齐缺失逻辑步骤并分离科学原理与其推导过程,显著降低科学问答数据集的错误率。
Details
Motivation: 当前大型语言模型(LLMs)在一般领域表现优异,但在科学问题解决中可靠性不足,主要原因在于科学问答数据集的高错误率,尤其是答案中的逻辑跳跃和隐式推理问题。Contribution: LOCA通过填补逻辑链的缺失步骤并明确区分科学原理与推导过程,实现了对科学语料库的自动清洗,错误率从高达20%降至2%以下,为科学AI的训练与评估提供了高质量数据。
Method: LOCA采用“增强-审查”循环框架,核心是通过自动补齐答案中的逻辑链,并将科学原理与其后续推导步骤显式分离,从而实现数据集的高效清理。
Result: 实验表明,LOCA在具有挑战性的科学语料库上能将错误率从20%降至2%以下,显著提升了数据集质量。
Insight: LOCA展示了逻辑链显式化对科学语料库质量提升的重要性,为科学AI领域的可靠数据构建提供了可扩展的方法。
Abstract: While Large Language Models (LLMs) excel in general domains, their reliability often falls short in scientific problem-solving. The advancement of scientific AI depends on large-scale, high-quality corpora. However, existing scientific question-answering (QA) datasets suffer from high error rates, frequently resulting from logical leaps and implicit reasoning within the answers. To address this issue, we introduce LOCA (Logical Chain Augmentation), a novel framework for automatically cleaning scientific corpora, implemented through an augment-and-review loop. At its core, LOCA enhances raw answers by completing missing logical steps and explicitly separating the underlying scientific principle from its subsequent derivation. By applying LOCA to challenging scientific corpora, we demonstrate that it can automatically filter noisy datasets, typically reducing the error rate from as high as 20% to below 2%. LOCA provides a scalable and effective methodology for creating high-quality scientific corpora, paving the way for more reliable training and evaluation of scientific AI.
[15] GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages
Trung Duc Anh Dang,Ferdinando Pio D’Elia
Main category: cs.CL
TL;DR: 该论文描述了GemDetox在TextDetox CLEF 2025竞赛中的提交方案,通过改进大规模多语言模型Gemma-3,结合LoRA微调和少样本提示技术,在15种语言上实现了高效的文本去毒任务。
Details
Motivation: 社交媒体内容监管的滞后性催生了自动去毒技术的需求,尤其是在低资源语言环境下。Contribution: 论文的主要贡献是提出了一种结合LoRA微调和提示技术的方法,显著提升了多语言文本去毒的性能,特别是在低资源语言上的表现。
Method: 方法包括基于Gemma-3的LoRA微调、少样本提示和Chain-of-Thought技术,同时利用了人工标注、机器翻译和模型生成的数据集。此外,推理阶段通过LaBSE检索和显式毒性标注增强输入。
Result: 实验结果显示,该方法在高资源和低资源语言上均取得了最佳性能,少样本提示和CoT分别带来了0.081和0.088的性能提升。
Insight: 语言资源状态是性能最强预测因子(η²=0.667),表明低资源语言任务中数据增强和高效微调的重要性。
Abstract: As social-media platforms emerge and evolve faster than the regulations meant to oversee them, automated detoxification might serve as a timely tool for moderators to enforce safe discourse at scale. We here describe our submission to the PAN 2025 Multilingual Text Detoxification Challenge, which rewrites toxic single-sentence inputs into neutral paraphrases across 15 typologically diverse languages. Building on a 12B-parameter Gemma-3 multilingual transformer, we apply parameter-efficient LoRA SFT fine-tuning and prompting techniques like few-shot and Chain-of-Thought. Our multilingual training corpus combines 3,600 human-authored parallel pairs, 21,600 machine-translated synthetic pairs, and model-generated pairs filtered by Jaccard thresholds. At inference, inputs are enriched with three LaBSE-retrieved neighbors and explicit toxic-span annotations. Evaluated via Style Transfer Accuracy, LaBSE-based semantic preservation, and xCOMET fluency, our system ranks first on high-resource and low-resource languages. Ablations show +0.081 joint score increase from few-shot examples and +0.088 from basic CoT prompting. ANOVA analysis identifies language resource status as the strongest predictor of performance ($\eta^2$ = 0.667, p < 0.01).
[16] Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data
Carlo Bono,Federico Belotti,Matteo Palmonari
Main category: cs.CL
TL;DR: 这篇论文提出了一种高效的不确定性估计方法,用于基于LLM的表格数据实体链接任务,通过单次生成减少计算开销。
Details
Motivation: 在实际应用中,基于LLMs的实体链接不仅需要高精度预测,还需要可靠的不确定性估计。传统的多生成方法计算成本高,限制了实用性,因此需要一种更高效的方法。Contribution: 主要贡献是提出了一种基于单次LLM输出的自监督不确定性估计方法,利用词级别的特征,显著降低了计算成本。
Method: 通过分析单次生成的LLM输出中的词级别特征,设计了一种自监督方法,避免了多生成的需求。
Result: 在多个LLMs上进行的实验表明,该方法能有效检测低精度输出,且计算成本大幅降低。
Insight: 通过单次生成实现不确定性估计,为实际应用中高效集成LLM提供了可行方案。
Abstract: Linking textual values in tabular data to their corresponding entities in a Knowledge Base is a core task across a variety of data integration and enrichment applications. Although Large Language Models (LLMs) have shown State-of-The-Art performance in Entity Linking (EL) tasks, their deployment in real-world scenarios requires not only accurate predictions but also reliable uncertainty estimates, which require resource-demanding multi-shot inference, posing serious limits to their actual applicability. As a more efficient alternative, we investigate a self-supervised approach for estimating uncertainty from single-shot LLM outputs using token-level features, reducing the need for multiple generations. Evaluation is performed on an EL task on tabular data across multiple LLMs, showing that the resulting uncertainty estimates are highly effective in detecting low-accuracy outputs. This is achieved at a fraction of the computational cost, ultimately supporting a cost-effective integration of uncertainty measures into LLM-based EL workflows. The method offers a practical way to incorporate uncertainty estimation into EL workflows with limited computational overhead.
[17] GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models
Mariam Mahran,Katharina Simbeck
Main category: cs.CL
TL;DR: 这篇论文提出了一种稀疏自编码器(SAE)与大语言模型(LLM)相结合的方法,用于分析和解释模型内部表示及训练数据中的深层结构和偏见。研究团队在简·奥斯汀的小说上训练了一个GPT风格的模型,并通过SAE发现了一些稀疏且可解释的特征,这些特征反映了小说中的关键主题和社会观念。
Details
Motivation: 随着大语言模型(LLMs)在未经过滤的大规模语料库上训练的普及,理解模型的内部表示及其从数据中学习的内容变得尤为重要。这篇论文旨在通过稀疏自编码器(SAEs)提供一种可扩展的方法,揭示模型行为及其训练数据中的深层结构和偏见。Contribution: 论文的主要贡献是将稀疏自编码器(SAEs)与大语言模型(LLMs)结合,提出了一种新的方法,用于分析和解释模型内部表示及训练数据中的结构、主题和偏见。
Method: 研究团队在一个由简·奥斯汀小说组成的语料库上训练了一个GPT风格的Transformer模型。随后,他们在模型的多个隐藏层上应用了稀疏自编码器(SAEs),从中提取了稀疏且可解释的特征,这些特征反映了语料库中的关键主题(如性别、阶级和社会责任)。
Result: 实验结果表明,SAEs能够成功地从LLM的隐藏状态中提取出稀疏且语义明确的特征,这些特征不仅反映了训练数据的深层结构,还揭示了其中的社会偏见和主题模式。
Insight: 这项研究展示了SAEs作为一种工具的强大潜力,能够有效地帮助理解大语言模型内部的复杂表示和训练数据中的偏见。这种方法为大规模语料库的探索和模型的可解释性提供了一条新路径。
Abstract: As large language models (LLMs) are increasingly trained on massive, uncurated corpora, understanding both model representations and the data they internalize has become a major challenge. In this work, we show that pairing LLMs with sparse autoencoders (SAEs) enables interpretation not only of model behavior but also of the deeper structures, themes, and biases embedded in the training data. We train a GPT-style transformer model exclusively on the novels of Jane Austen, a corpus rich in social constructs and narrative patterns. We then apply SAEs to hidden states across multiple layers, uncovering sparse, interpretable features that reflect the key narratives and concepts present in the corpus, including gender, class, and societal duty. Our findings demonstrate that LLMs combined with SAEs can act as scalable probes into complex datasets, offering a new path for corpus exploration, bias discovery, and model interpretability at scale.
[18] RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMs
Can Lin,Zhengwang Jiang,Ling Zheng,Qi Zhao,Yuhang Zhang,Qi Song,Wangqiu Zhou
Main category: cs.CL
TL;DR: 论文提出了RJE框架,通过检索-判断-探索的方式提升知识图谱问答效率,支持小规模LLMs表现优越,并显著减少LLM调用和令牌使用。
Details
Motivation: 现有方法依赖检索质量或专有LLMs,限制了KGQA的效果和普适性。Contribution: 提出了RJE框架,结合检索、判断和探索三阶段,并引入辅助模块使小规模LLMs达到竞争力。
Method: RJE包括三阶段:1) 检索优化推理路径;2) 判断路径充分性;3) 有条件探索额外证据。辅助模块支持小LLMs高效工作。
Result: 在专有和小规模LLMs上均优于基线,减少LLM调用和令牌使用,提升效率。
Insight: 框架设计可平衡检索质量和LLM依赖性,是KGQA领域的实用解决方案。
Abstract: Knowledge graph question answering (KGQA) aims to answer natural language questions using knowledge graphs. Recent research leverages large language models (LLMs) to enhance KGQA reasoning, but faces limitations: retrieval-based methods are constrained by the quality of retrieved information, while agent-based methods rely heavily on proprietary LLMs. To address these limitations, we propose Retrieval-Judgment-Exploration (RJE), a framework that retrieves refined reasoning paths, evaluates their sufficiency, and conditionally explores additional evidence. Moreover, RJE introduces specialized auxiliary modules enabling small-sized LLMs to perform effectively: Reasoning Path Ranking, Question Decomposition, and Retriever-assisted Exploration. Experiments show that our approach with proprietary LLMs (such as GPT-4o-mini) outperforms existing baselines while enabling small open-source LLMs (such as 3B and 8B parameters) to achieve competitive results without fine-tuning LLMs. Additionally, RJE substantially reduces the number of LLM calls and token usage compared to agent-based methods, yielding significant efficiency improvements.
[19] Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse
Nathan Junzi Chen
Main category: cs.CL
TL;DR: 本文通过零样本分类方法评估大型语言模型(LLMs)的政治倾向性,揭示其普遍存在自由主义-威权主义倾向,并探讨了这种偏见对政治话语的影响。
Details
Motivation: 生成式人工智能(GAI)在政治话语中日益普及,但其训练数据偏差、人类偏见和算法缺陷可能导致政治倾向性。本文旨在量化这种倾向性及其社会影响。Contribution: 1) 提出了一种结合意识形态对齐、主题性、情感和客观性的零样本分类方法;2) 揭示了六种主流LLMs普遍的自由主义-威权主义倾向;3) 探讨了算法偏见对公共话语的潜在影响。
Method: 研究者对六种LLMs生成了1800个模型响应,并输入四个独立的细粒度分类算法,分别计算意识形态对齐、主题性、情感和客观性等指标。
Result: 结果显示所有评估的LLMs均表现出明显的自由主义-威权主义倾向,并出现推理取代和模板化拒绝等现象。
Insight: 算法的内在政治偏见可能通过人机交互渗透到公共话语中,导致政治景观的扭曲,表现为一致性或极化,具体取决于地区的现有社会政治结构。
Abstract: Amidst the rapid normalization of generative artificial intelligence (GAI), intelligent systems have come to dominate political discourse across information mediums. However, internalized political biases stemming from training data skews, human prejudice, and algorithmic flaws continue to plague the novel technology. This paper employs a zero-shot classification approach to evaluate algorithmic political partisanship through a methodical combination of ideological alignment, topicality, response sentiment, and objectivity. A total of 1800 model responses across six mainstream large language models (LLMs) were individually input into four distinct fine-tuned classification algorithms, each responsible for computing an aforementioned bias evaluation metric. Results show an amplified liberal-authoritarian alignment across all six LLMs evaluated, with notable instances of reasoning supersessions and canned refusals. The study subsequently highlights the psychological influences underpinning human-computer interactions and how intrinsic biases can permeate public discourse. The resulting distortion of the political landscape can ultimately manifest as conformity or polarization, depending on a region’s pre-existing socio-political structures.
[20] TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture
Yongchao Chen,Jiefeng Chen,Rui Meng,Ji Yin,Na Li,Chuchu Fan,Chi Wang,Tomas Pfister,Jinsung Yoon
Main category: cs.CL
TL;DR: 论文提出了TUMIX,一种并行运行多个代理的工具混合框架,通过迭代共享和优化回答提升推理能力,实验结果显示其在多个基准测试中显著优于现有方法。
Details
Motivation: 尽管工具(如代码解释器和搜索)显著增强了LLM的推理能力,但缺乏关于如何最优使用这些工具的实用指导。TUMIX旨在解决如何有效结合文本推理、编码和搜索应对多样化问题。Contribution: TUMIX的主要贡献是通过并行运行多代理工具混合框架,结合迭代共享和优化回答路径,显著提升了推理准确性,同时研究了代理多样性和自动优化的作用。
Method: TUMIX采用多代理并行框架,每个代理使用不同的工具策略和回答路径,并通过迭代共享和优化回答。实验基于Gemini-2.5-Pro和Gemini-2.5-Flash进行验证。
Result: TUMIX在多个推理基准测试中平均准确率提升3.55%,推理成本接近基线方法。它还能在达到足够置信度时停止优化,将推理成本降至49%。
Insight: 代理多样性和质量对性能至关重要,可通过LLM自动优化代理设计进一步提升。TUMIX展示了在高性能和成本之间权衡的可扩展性。
Abstract: While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.
[21] TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies
Maithili Kadam,Francis Ferraro
Main category: cs.CL
TL;DR: TAG-EQA是一个基于提示词的框架,通过将因果事件图转换为自然语言语句注入LLM输入,提升事件问答性能。结合九种提示配置,平均准确率提升5%。
Details
Motivation: 大型语言模型在通用语言任务上表现优异,但在事件问答(尤其是因果或时序推理)上表现不佳,需结构化知识增强推理。Contribution: 提出TAG-EQA框架,将因果事件图嵌入LLM输入,系统性分析结构化知识的有效性,并在零/少样本及思维链提示中验证性能提升。
Method: 将结构化事件图转换为自然语言语句,结合三种提示策略(零/少样本、思维链)和三种输入模态(纯文本、纯图、文本+图)。
Result: 在TORQUESTRA基准上,平均准确率提升5%,零样本设置下提升12%,图增强思维链提示有效时提升18%。
Insight: 因果事件图可在不微调LLM的情况下增强事件推理,为基于提示的问答提供灵活的结构化编码方式。
Abstract: Large language models (LLMs) excel at general language tasks but often struggle with event-based questions-especially those requiring causal or temporal reasoning. We introduce TAG-EQA (Text-And-Graph for Event Question Answering), a prompting framework that injects causal event graphs into LLM inputs by converting structured relations into natural-language statements. TAG-EQA spans nine prompting configurations, combining three strategies (zero-shot, few-shot, chain-of-thought) with three input modalities (text-only, graph-only, text+graph), enabling a systematic analysis of when and how structured knowledge aids inference. On the TORQUESTRA benchmark, TAG-EQA improves accuracy by 5% on average over text-only baselines, with gains up to 12% in zero-shot settings and 18% when graph-augmented CoT prompting is effective. While performance varies by model and configuration, our findings show that causal graphs can enhance event reasoning in LLMs without fine-tuning, offering a flexible way to encode structure in prompt-based QA.
[22] A-VERT: Agnostic Verification with Embedding Ranking Targets
Nicolás Aguirre,Ramiro Caso,Ramiro Rodríguez Colmeiro,Mauro Santelli,Joaquín Toranzo Calderón
Main category: cs.CL
TL;DR: 该论文提出了一种名为A-VERT的无结构评估方法,通过语义嵌入距离匹配目标候选与任意语言模型生成文本,实现了低计算成本下的稳健分类。
Details
Motivation: 当前语言模型响应的自动评估方法要么成本过高(如LLM-as-a-Judge),要么脱离真实条件(如字符串匹配、logprob)。需要一种更高效且贴近实际的方法。Contribution: 提出了一种基于语义嵌入距离的无结构评估方法A-VERT,能以低计算成本对语言模型响应进行稳健分类。
Method: 利用语义嵌入距离匹配目标候选与生成的文本,使用参数少于100亿的嵌入模型。
Result: 在三个数据集和三种不同语言模型架构上测试,回归得分约0.97,准确率约96%。
Insight: 语义嵌入距离是一种高效且低成本的语言模型响应评估方法,能够替代现有昂贵或不切实际的方案。
Abstract: The automatic evaluation of Language Model (LM) responses is a critical piece in the development of benchmarks and metrics, both for model training and quality assessment of production model endpoints. The current approaches to response classification relies on methods that are too expensive (i.e. LLM-as-a-Judge) or that are far from real-world conditions (string-matching, logprob). In this paper, a structure-free evaluation method is presented. The method makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text, resulting in a robust classification of the response at a relatively low compute cost (embedding models of less than $10B$ parameters). The results show a regression score of ~0.97 and an accuracy of ~96% against human annotators, tested over 3 data sets and 3 different LM architectures.
[23] One More Question is Enough, Expert Question Decomposition (EQD) Model for Domain Quantitative Reasoning
Mengyu Wang,Sotirios Sabanis,Miguel de Carvalho,Shay B. Cohen,Tiejun Ma
Main category: cs.CL
TL;DR: EQD是一种专家问题分解模型,通过两阶段微调框架和奖励函数提升领域定量推理任务的问答性能,仅需少量训练数据和A100 GPU,性能优于现有方法。
Details
Motivation: 领域特定的定量推理对LLM仍具挑战性,尤其是在需要专家知识和复杂问答的任务中。EQD旨在平衡领域知识与计算效率。Contribution: 提出EQD模型,通过奖励函数优化子问题生成,提升问答效果,且在计算资源需求极低的情况下优于现有方法。
Method: 基于两阶段微调框架,利用奖励函数评估子问题对问答结果的提升效果,仅需几千训练样本和单块A100 GPU。
Result: 在金融领域四个数据集的评测中,EQD使QA性能提升0.6%至10.5%,优于领域调优模型和高级提示策略。
Insight: 在领域特定问答中,单个支持性问题比详细指导步骤更能提升性能。
Abstract: Domain-specific quantitative reasoning remains a major challenge for large language models (LLMs), especially in fields requiring expert knowledge and complex question answering (QA). In this work, we propose Expert Question Decomposition (EQD), an approach designed to balance the use of domain knowledge with computational efficiency. EQD is built on a two-step fine-tuning framework and guided by a reward function that measures the effectiveness of generated sub-questions in improving QA outcomes. It requires only a few thousand training examples and a single A100 GPU for fine-tuning, with inference time comparable to zero-shot prompting. Beyond its efficiency, EQD outperforms state-of-the-art domain-tuned models and advanced prompting strategies. We evaluate EQD in the financial domain, characterized by specialized knowledge and complex quantitative reasoning, across four benchmark datasets. Our method consistently improves QA performance by 0.6% to 10.5% across different LLMs. Our analysis reveals an important insight: in domain-specific QA, a single supporting question often provides greater benefit than detailed guidance steps.
[24] ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning
Haochen You,Baojing Liu
Main category: cs.CL
TL;DR: ReSSFormer是一种递归稀疏结构化Transformer,通过递归推理、自适应稀疏注意力和自组织编码结构解决长上下文推理和计算效率问题。
Details
Motivation: Transformer在长上下文推理、计算效率和结构泛化方面仍面临挑战,主要由于固定的层堆叠、密集注意力以及对位置编码的依赖。Contribution: 提出了ReSSFormer,结合了递归推理与记忆单元(R2MU)、自适应稀疏注意力模块(ASAM)和自组织编码结构(SOES)三项创新。
Method: 使用递归推理代替层堆叠,通过ASAM实现高效上下文选择,SOES直接建模潜在令牌拓扑。
Result: 在语言建模、多跳QA和结构敏感任务中,ReSSFormer在相同计算和参数预算下优于基线模型。
Insight: 递归推理和稀疏注意力机制显著提升了模型的效率和结构灵活性,适用于长上下文任务。
Abstract: While Transformer architectures have demonstrated impressive scalability across domains, they continue to face challenges in long-context reasoning, computational efficiency, and structural generalization - largely due to rigid layer stacking, dense attention, and reliance on positional encodings. We present ReSSFormer, a Recursive Sparse Structured Transformer that integrates three complementary innovations: Recurrent Reasoning & Memory Unit (R2MU) for iterative reasoning with bounded depth, Adaptive Sparse Attention Module (ASAM) for efficient and focused context selection, and Self-Organizing Encoder Structure (SOES) for position-free structure induction. ReSSFormer replaces conventional depth stacking with recurrent inference, substitutes full attention with token- and expert-level sparsity, and models latent token topology directly from content. Across language modeling, multi-hop QA, and structure-sensitive tasks, ReSSFormer consistently outperforms strong baselines under comparable FLOPs and parameter budgets, highlighting its scalability, efficiency, and structural flexibility.
[25] CLUE: Non-parametric Verification from Experience via Hidden-State Clustering
Zhenwen Liang,Ruosen Li,Yujun Zhou,Linfeng Song,Dian Yu,Xinya Du,Haitao Mi,Dong Yu
Main category: cs.CL
TL;DR: CLUE是一种基于隐藏状态聚类的非参数验证方法,通过分析LLM内部隐藏状态的几何特征,实现对模型输出的有效性验证。CLUE无需可训练参数,仅依赖历史经验聚类,显著提升了验证性能。
Details
Motivation: 现有方法(如基于文本的奖励模型或标定置信度)在验证LLM输出时易受限于浅层特征或模型校准不足的问题。CLUE提出直接利用隐藏状态的丰富信息作为验证基础,以统一解决这些问题。Contribution: 1. 揭示了LLM隐藏状态中编码的正确性几何特征;2. 提出了CLUE,一种无需训练的非参数验证器,仅通过隐藏状态聚类和最近邻分类实现高效验证;3. 实验显示CLUE优于现有基线方法。
Method: CLUE通过总结隐藏状态变化(delta)并基于历史‘成功’和‘失败’聚类计算最近质心距离,分类输出正确性。整个过程不涉及任何可训练参数。
Result: 在AIME 24/25和GPQA数据集上,CLUE显著提升top-1和多数投票准确率(如AIME 24从56.7%提升至70.0%),优于LLM-as-a-judge和置信度基线方法。
Insight: 隐藏状态蕴含丰富的语义和置信度信息,其几何可分性为验证LLM输出提供了统一且高效的信号,避免了传统方法的局限性。
Abstract: Assessing the quality of Large Language Model (LLM) outputs presents a critical challenge. Previous methods either rely on text-level information (e.g., reward models, majority voting), which can overfit to superficial cues, or on calibrated confidence from token probabilities, which would fail on less-calibrated models. Yet both of these signals are, in fact, partial projections of a richer source of information: the model’s internal hidden states. Early layers, closer to token embeddings, preserve semantic and lexical features that underpin text-based judgments, while later layers increasingly align with output logits, embedding confidence-related information. This paper explores hidden states directly as a unified foundation for verification. We show that the correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations. To validate this, we present Clue (Clustering and Experience-based Verification), a deliberately minimalist, non-parametric verifier. With no trainable parameters, CLUE only summarizes each reasoning trace by an hidden state delta and classifies correctness via nearest-centroid distance to success'' and failure’’ clusters formed from past experience. The simplicity of this method highlights the strength of the underlying signal. Empirically, CLUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates, improving both top-1 and majority-vote accuracy across AIME 24/25 and GPQA. As a highlight, on AIME 24 with a 1.5B model, CLUE boosts accuracy from 56.7% (majority@64) to 70.0% (top-maj@16).
[26] A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation
Neal Gregory Lawton,Alfy Samuel,Anoop Kumar,Daben Liu
Main category: cs.CL
TL;DR: 本文比较了检索增强生成(RAG)中的独立、联合和两阶段微调策略,发现它们在生成质量上表现相似,但计算成本差异显著,最佳策略取决于数据集是否包含上下文标签以及是否需要学习率网格搜索。
Details
Motivation: RAG框架在问答任务中广泛应用,但其嵌入模型和生成器模型的微调策略多样,缺乏系统性比较,导致实际应用中难以选择最优策略。Contribution: 系统地评估和比较了RAG的独立微调、联合微调和两阶段微调策略,揭示了它们在不同条件下的性能与成本权衡。
Method: 实验对比了三种微调策略(独立、联合和两阶段),使用EM和F1作为评价指标,并分析了是否包含上下文标签和学习率网格搜索对结果的影响。
Result: 所有策略在生成质量(EM和F1)上的提升相近,但计算成本差异显著。最优策略取决于上下文标签的存在和学习率网格搜索的需求。
Insight: 实践中选择微调策略时,需权衡计算成本和数据集特点(如上下文标签),而无需过分追求单一方法。
Abstract: A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation Download PDF Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu Published: 20 Aug 2025, Last Modified: 17 Sept 2025EMNLP 2025 FindingsConference, Publication Chairs, AuthorsRevisionsBibTeXCC BY 4.0 Keywords: Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), Fine-tuning, Question Answering, Joint fine-tuning TL;DR: We evaluate and compare strategies for fine-tuning Retrieval Augmented Generation (RAG) pipelines, including independent fine-tuning, joint fine-tuning, and two-phase fine-tuning. Abstract: Retrieval augmented generation (RAG) is a popular framework for question answering that is powered by two large language models (LLMs): an embedding model that retrieves context documents from a database that are relevant to a given question, and a generator model that uses the retrieved context to generate an answer to the question. Both the embedding and generator models can be fine-tuned to increase performance of a RAG pipeline on a new task, but multiple fine-tuning strategies exist with different costs and benefits. In this paper, we evaluate and compare several RAG fine-tuning strategies, including independent, joint, and two-phase fine-tuning. In our experiments, we observe that all of these strategies achieve about equal improvement in EM and F1 generation quality metrics, although they have significantly different computational costs. We conclude the optimal fine-tuning strategy to use depends on whether the training dataset includes context labels and whether a grid search over the learning rates for the embedding and generator models is required.
[27] RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering
Lovely Yeswanth Panchumarthi,Sai Prasad Gudari,Atharva Negi,Praveen Raj Budime,Harsit Upadhya
Main category: cs.CL
TL;DR: 该论文提出了RAG-BioQA框架,结合检索增强生成(RAG)和领域特定微调,为生物医学领域生成证据支持的长篇问答,显著优于现有基线方法。
Details
Motivation: 生物医学文献的快速增长使得获取精准医疗信息变得困难,现有系统主要专注于简短回答,无法提供临床决策所需的全面解释。Contribution: 主要贡献是提出了RAG-BioQA框架,结合BioBERT嵌入、FAISS索引和多种重排序策略(BM25、ColBERT、MonoT5),并通过微调T5模型生成证据支持的长篇答案。
Method: 方法包括:1) 使用BioBERT嵌入和FAISS索引实现高效检索;2) 比较BM25、ColBERT和MonoT5等重排序策略;3) 微调T5模型生成答案。
Result: 在PubMedQA数据集上的实验表明,RAG-BioQA在BLEU、ROUGE和METEOR指标上显著优于基线模型。
Insight: 该研究展示了检索增强生成在生物医学领域的潜力,特别是在生成长篇、证据支持的答案方面的有效性。
Abstract: The exponential growth of biomedical literature creates significant challenges for accessing precise medical information. Current biomedical question-answering systems primarily focus on short-form answers, failing to provide the comprehensive explanations necessary for clinical decision-making. We present RAG-BioQA, a novel framework combining retrieval-augmented generation with domain-specific fine-tuning to produce evidence-based, long-form biomedical answers. Our approach integrates BioBERT embeddings with FAISS indexing and compares various re-ranking strategies (BM25, ColBERT, MonoT5) to optimize context selection before synthesizing evidence through a fine-tuned T5 model. Experimental results on the PubMedQA dataset show significant improvements over baselines, with our best model achieving substantial gains across BLEU, ROUGE, and METEOR metrics, advancing the state of accessible, evidence-based biomedical knowledge retrieval.
[28] SoK: Measuring What Matters for Closed-Loop Security Agents
Mudita Khurana,Raunak Jain
Main category: cs.CL
TL;DR: 这篇论文提出了CLASP框架和CLC评分,用于衡量闭环安全代理的能力,填补了领域内缺乏统一评估标准的空白。
Details
Motivation: 网络安全领域缺乏统一的框架和方法来评估闭环安全代理的能力,导致研究分散且难以衡量实际效果。Contribution: 提出了CLASP框架,将安全生命周期与代理能力对齐;设计了CLC评分,量化闭环能力和操作有效性;分析了21个代表性工作。
Method: 通过CLASP框架将安全任务(如侦察、漏洞利用)与代理能力(如规划、工具使用)关联;定义了CLC评分,结合闭环程度和有效性。
Result: 应用CLASP分析了21个系统,揭示了能力差距;CLC评分为闭环代理提供了量化标准。
Insight: 闭环安全代理的能力评估需结合任务完成度和闭环性;统一的框架和评分有助于推动领域发展。
Abstract: Cybersecurity is a relentless arms race, with AI driven offensive systems evolving faster than traditional defenses can adapt. Research and tooling remain fragmented across isolated defensive functions, creating blind spots that adversaries exploit. Autonomous agents capable of integrating, exploit confirmation, remediation, and validation into a single closed loop offer promise, but the field lacks three essentials: a framework defining the agentic capabilities of security systems across security life cycle, a principled method for evaluating closed loop agents, and a benchmark for measuring their performance in practice. We introduce CLASP: the Closed-Loop Autonomous Security Performance framework which aligns the security lifecycle (reconnaissance, exploitation, root cause analysis, patch synthesis, validation) with core agentic capabilities (planning, tool use, memory, reasoning, reflection & perception) providing a common vocabulary and rubric for assessing agentic capabilities in security tasks. By applying CLASP to 21 representative works, we map where systems demonstrate strengths, and where capability gaps persist. We then define the Closed-Loop Capability (CLC) Score, a composite metric quantifying both degree of loop closure and operational effectiveness, and outline the requirements for a closed loop benchmark. Together, CLASP and the CLC Score, provide the vocabulary, diagnostics, and measurements needed to advance both function level performance and measure closed loop security agents.
[29] MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization
Yinhong Liu,Jianfeng He,Hang Su,Ruixue Lian,Yi Nian,Jake Vincent,Srikanth Vishnubhotla,Robinson Piramuthu,Saab Mansour
Main category: cs.CL
TL;DR: MDSEval是第一个针对多模态对话摘要(MDS)的元评估基准,旨在为开发高效MDS模型提供支持。
Details
Motivation: 由于MDS任务的广泛应用,需要一个强大的自动评估方法来降低成本和人力的投入,但目前的评估方法缺乏有效的基准。Contribution: 1)提出了首个MDS元评估基准MDSEval;2)定义了MDS特有的8个关键评估维度;3)提出了MEKI过滤框架以确保数据质量和丰富性。
Method: 采用了基于MEKI的新型过滤框架,并结合人类标注构建了包含多模态对话、摘要和人类评判的基准数据。
Result: 基准测试揭示了现有评估方法在区分先进MLLM生成的摘要和应对各种偏差方面的局限性。
Insight: 研究中首次形式化了MDS特有的评估维度,对改进MDS模型的评估具有重要指导意义。
Abstract: Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.
[30] FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol
He Zhang,Anzhou Zhang,Jian Dai
Main category: cs.CL
TL;DR: FOR-Prompting是一种非对称提示协议,通过角色分工(Defender、Objectioner、Host)实现自我修订,提升模型推理能力,尤其在小型模型上表现突出。
Details
Motivation: 现有推理协议(如CoT和ToT)缺乏外部提问机制以激发自我修订,FOR-Prompting填补了这一空白。Contribution: 1. 提出FOR-Prompting协议,通过角色化提示实现自我修订;2. 在GSM8K任务中显著提升性能;3. 展示了协议对小型模型的有效性。
Method: 采用Defender提出答案、Objectioner提出质疑、Host确保一致性与完成的角色分工。
Result: 在GSM8K上优于单提示方法,与CoT相当,小型模型Llama3.2:1b提升19%。
Insight: 角色化提示协议无需额外训练即可提升模型性能,尤其适合小型模型和设备端应用。
Abstract: Reasoning protocols such as Chain of Thought (CoT) and Tree of Thought (ToT) organize internal deliberation but lack an explicit mechanism for external questioning that elicits self-revision. We present FOR-Prompting (From Objection to Revision Prompting), an asymmetric protocol where a Defender proposes an answer, an Objectioner raises question-style objections with no direct fixes, and a Host enforces consistency and closure. On GSM8K we observe about a 22% point gain over single-prompt and accuracy on par with CoT, with more than 10% higher ratings in reasoning and coherence from a uniform GPT 4.1 judge. FOR-Prompting also corrects mistakes without tools or human supervision on tricky queries, and improves performance for small-scale model (approx. 19% accuracy improved on Llama3.2:1b for GSM8K task), highlighting promise for small models and on personal device use. Beyond factual QA, qualitative analyses on open-ended tasks show enhanced exploration and refinement, with dialogue traces that make assumptions and trade-offs explicit. The protocol is model agnostic and operates purely at the prompt level through role-structured turns, so it works with hosted and local models of different sizes without retraining, and it supports large-scale study of objection-guided reasoning.
[31] What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?
Jiwan Chung,Neel Joshi,Pratyusha Sharma,Youngjae Yu,Vibhav Vineet
Main category: cs.CL
TL;DR: 论文提出了MathLens基准测试,用于分解多模态推理的子技能(感知、推理和集成),并通过实验揭示了不同训练方法对各子技能的差异化影响。
Details
Motivation: 现有对多模态推理模型的评估主要依赖聚合准确率,掩盖了模型改进的具体细节。论文旨在提供一个更细粒度的评估框架,明确模型的感知、推理和集成能力。Contribution: 1) 设计了MathLens基准测试;2) 分析了不同训练方法对各子技能的差异化影响;3) 揭示了集成能力是多模态推理的瓶颈。
Method: 通过MathLens基准测试,将问题分解为感知、推理和集成三个子任务,并提供可视化图表、文本描述和可控问题等注释,支持独立或联合评估。
Result: 实验发现:1) 强化学习主要提升感知能力,而文本监督的SFT间接提升感知;2) 推理能力需与感知同步提升;3) 集成能力最弱;4) 不同训练的鲁棒性表现相反。
Insight: 多模态推理的改进需要针对性训练各子技能,集成能力是关键瓶颈,未来研究需重点关注。
Abstract: Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry, yet their evaluation remains dominated by aggregate accuracy, a single score that obscures where and how models are improving. We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning while preserving the complexity of textbook-style geometry problems. The benchmark separates performance into three components: Perception: extracting information from raw inputs, Reasoning: operating on available information, and Integration: selecting relevant perceptual evidence and applying it within reasoning. To support each test, we provide annotations: visual diagrams, textual descriptions to evaluate reasoning in isolation, controlled questions that require both modalities, and probes for fine-grained perceptual skills, all derived from symbolic specifications of the problems to ensure consistency and robustness. Our analysis reveals that different training approaches have uneven effects: First, reinforcement learning chiefly strengthens perception, especially when supported by textual supervision, while textual SFT indirectly improves perception through reflective reasoning. Second, reasoning improves only in tandem with perception. Third, integration remains the weakest capacity, with residual errors concentrated there once other skills advance. Finally, robustness diverges: RL improves consistency under diagram variation, whereas multimodal SFT reduces it through overfitting. We will release all data and experimental logs.
[32] Machine-interpretable Engineering Design Standards for Valve Specification
Anders Gjerver,Rune Frostad,Vedrana Barisic,Melinda Hodkiewicz,Caitlin Woods,Mihaly Fekete,Arild Braathen Torjusen,Johan Wilhelm Kluwer
Main category: cs.CL
TL;DR: 论文提出了一种将工程设计标准转化为模块化、可重用、机器可解释的本体的方法,用于阀门选择的语义推理和质量验证。
Details
Motivation: 尽管工业工作数字化的目标明确,但工程设计标准仍是文档主导。论文旨在通过语义技术实现设计标准的机器可解释性和自动化验证。Contribution: 1. 将国际标准中的知识转化为模块化本体;2. 提出基于W3C标准和ISO IDO的本体互操作性;3. 在阀门选择过程中验证本体的实用性和自动化潜力。
Method: 使用建模模式创建模块化本体,基于国际标准和行业规范,通过OWL表示阀门数据表和制造商产品类型,并利用语义推理和设计规则实现自动化验证。
Result: 成功验证了阀门数据表是否符合行业标准,并展示了语义推理在设备选择中的潜力。
Insight: 基于本体的方法可推动数字化智能标准的转型,标准和行业规范的可互操作本体库对实现自动化设计流程具有重要价值。
Abstract: Engineering design processes use technical specifications and must comply with standards. Product specifications, product type data sheets, and design standards are still mainly document-centric despite the ambition to digitalize industrial work. In this paper, we demonstrate how to transform information held in engineering design standards into modular, reusable, machine-interpretable ontologies and use the ontologies in quality assurance of the plant design and equipment selection process. We use modelling patterns to create modular ontologies for knowledge captured in the text and in frequently referenced tables in International Standards for piping, material and valve design. These modules are exchangeable, as stored in a W3C compliant format, and interoperable as they are aligned with the top-level ontology ISO DIS 23726-3: Industrial Data Ontology (IDO). We test these ontologies, created based on international material and piping standards and industry norms, on a valve selection process. Valves are instantiated in semantic asset models as individuals along with a semantic representation of the environmental condition at their location on the asset. We create “functional location tags” as OWL individuals that become instances of OWL class Valve Data Sheet (VDS) specified valves. Similarly we create instances of manufacturer product type. Our approach enables automated validation that a specific VDS is compliant with relevant industry standards. Using semantic reasoning and executable design rules, we also determine whether the product type meets the valve specification. Creation of shared, reusable IDO-based modular ontologies for design standards enables semantic reasoning to be applied to equipment selection processes and demonstrates the potential of this approach for Standards Bodies wanting to transition to digitized Smart Standards.
[33] Syntactic Blind Spots: How Misalignment Leads to LLMs Mathematical Errors
Dane Williamson,Yangfeng Ji,Matthew Dwyer
Main category: cs.CL
TL;DR: 论文研究发现LLMs在数学问题上存在一种系统性失败模式(语法盲点),即模型在面对语义简单但表达方式不熟悉的问题时,会错误应用熟悉的推理策略。通过调整语法结构(保留语义但降低复杂性)可以显著提高正确率。
Details
Motivation: LLMs虽然在数学问题上表现出色,但面对语法偏离训练分布的问题时容易失败。研究者希望揭示这种失败的根本原因。Contribution: 1. 识别了LLMs的一种系统性失败模式(语法盲点);2. 提出通过调整语法结构(保留语义但降低复杂性)可以提高正确率;3. 引入基于依赖位置理论(DLT)的语法复杂度度量方法。
Method: 1. 重新表述问题,保留语义但简化语法结构;2. 使用DLT量化语法复杂度;3. 在多个数据集中验证语法复杂性与失败率的关系。
Result: 研究表明,许多推理错误源于结构不对齐而非概念难度,语法干预可以缓解这些错误。
Insight: LLMs的推理能力受语法表达方式影响显著,未来的模型可能需要更强的语法敏感性。
Abstract: Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution. We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways. These errors are not due to gaps in mathematical competence, but rather reflect a brittle coupling between surface form and internal representation. To test this, we rephrase incorrectly answered questions using syntactic templates drawn from correct examples. These rephrasings, which preserve semantics while reducing structural complexity, often lead to correct answers. We quantify syntactic complexity using a metric based on Dependency Locality Theory (DLT), and show that higher DLT scores are associated with increased failure rates across multiple datasets. Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.
[34] SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning
Shicheng Liu,Kai Sun,Lisheng Fu,Xilun Chen,Xinyuan Zhang,Zhaojiang Lin,Rulin Shao,Yue Liu,Anuj Kumar,Wen-tau Yih,Xin Luna Dong
Main category: cs.CL
TL;DR: SCRIBES提出了一种基于强化学习的框架,通过利用网页间的布局相似性生成可重用的提取脚本,显著提高了半结构化数据的提取质量和下游任务性能。
Details
Motivation: 网页中的半结构化数据(如HTML表格、列表等)占事实数据的很大比例,但现有方法要么缺乏泛化能力,要么因需要逐页处理而资源消耗大。SCRIBES旨在解决这些问题,通过脚本生成实现高效、可扩展的数据提取。Contribution: 1) 引入了SCRIBES框架,利用强化学习和布局相似性奖励信号生成可重用提取脚本;2) 通过在CommonCrawl数据上的迭代训练,实现了对大量网页的高效处理。
Method: 1) 基于强化学习设计奖励信号(网页布局相似性);2) 生成可复用的提取脚本,而非逐页处理;3) 使用CommonCrawl数据进行迭代训练和优化。
Result: SCRIBES的脚本质量提升了13%,下游任务(如GPT-4o的问答)准确率提高了4%,同时降低了资源消耗。
Insight: 通过布局相似性设计奖励信号是一种高效的策略,可以显著提升提取任务的泛化能力和可扩展性。结合合成数据训练进一步优化了模型的性能。
Abstract: Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.
[35] Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models
Ece Takmaz,Lisa Bylinina,Jakub Dotlacil
Main category: cs.CL
TL;DR: 论文通过模型融合方法,在多模态模型中保持语言任务的性能,解决了多模态模型在语言任务中表现不佳的问题。
Details
Motivation: 现有视觉-语言模型参数庞大且依赖大数据集,远超儿童语言习得的数据量。论文旨在低资源环境下开发发展合理的多模态模型,同时避免其在语言任务中的性能下降。Contribution: 提出了模型融合方法(参数加权线性插值),在多模态模型中保持语言任务性能,并通过BabyLM挑战验证了其有效性。
Method: 使用发展合理的数据集训练语言和多模态模型,通过加权线性插值融合多模态模型和纯语言模型的参数。
Result: 多模态模型在语法为主的语言任务中表现不佳,但融合方法能在一定程度上缓解这一问题,同时保持多模态性能。
Insight: 模型融合是平衡多模态任务与语言任务性能的有效手段,尤其适用于发展合理的低资源场景。
Abstract: State-of-the-art vision-and-language models consist of many parameters and learn from enormous datasets, surpassing the amounts of linguistic data that children are exposed to as they acquire a language. This paper presents our approach to the multimodal track of the BabyLM challenge addressing this discrepancy. We develop language-only and multimodal models in low-resource settings using developmentally plausible datasets, with our multimodal models outperforming previous BabyLM baselines. One finding in the multimodal language model literature is that these models tend to underperform in \textit{language-only} tasks. Therefore, we focus on maintaining language-only abilities in multimodal models. To this end, we experiment with \textit{model merging}, where we fuse the parameters of multimodal models with those of language-only models using weighted linear interpolation. Our results corroborate the findings that multimodal models underperform in language-only benchmarks that focus on grammar, and model merging with text-only models can help alleviate this problem to some extent, while maintaining multimodal performance.
[36] Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey
Qiyuan Liu,Hao Xu,Xuhong Chen,Wei Chen,Yee Whye Teh,Ning Miao
Main category: cs.CL
TL;DR: 该论文系统性地介绍了奖励模型(RMs)及其在大语言模型(LLM)推理中的应用,包括架构、训练方法和评估技术,并探讨了RMs在生成引导、数据合成和强化学习微调中的关键作用。
Details
Motivation: 奖励模型在提升大语言模型的推理能力中具有重要作用,但目前缺乏对其系统性的分析和实际应用的完整调研。Contribution: 1. 系统性地介绍了奖励模型的基础概念和应用;2. 全面调研了RMs在LLM推理中的三大关键应用;3. 提出了RMs面临的开放性问题和未来研究方向。
Method: 通过文献综述和实证分析,总结了RMs的架构、训练方法和评估技术,并探讨了其在LLM推理中的具体应用。
Result: 论文总结了RMs在LLM推理中的实际应用效果,并指出了当前研究中存在的关键问题和改进方向。
Insight: RMs不仅是LLM微调的重要工具,还能在推理阶段优化输出选择;未来的研究需关注RMs的选择、泛化性和评估方法。
Abstract: Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we address critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.
[37] Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning
Qi He,Cheng Qian,Xiusi Chen,Bingxiang He,Yi R.,Fung,Heng Ji
Main category: cs.CL
TL;DR: Veri-R1 是一个基于在线强化学习的框架,利用LLM与搜索引擎的交互,通过奖励信号优化其规划、检索和推理行为,显著提高了声明验证的准确性和证据得分。
Details
Motivation: 传统声明验证方法主要依赖提示工程或预设推理流程,缺乏统一的训练范式以提升核心技能。在线声明验证需要迭代证据检索和推理,从而需要更动态的方法。Contribution: 提出了Veri-R1框架,通过在线强化学习动态优化LLM的规划、检索和推理能力,并显著提升了声明验证的准确性和证据质量。
Method: 采用在线强化学习框架,LLM与搜索引擎交互并获得奖励信号,训练模型调整规划、检索和推理行为。
Result: 实验结果显示,Veri-R1联合准确性提升高达30%,证据得分翻倍,甚至优于更大规模的模型。
Insight: 在线强化学习能够动态优化LLM在声明验证中的表现,并且奖励信号的各组成部分对结果有重要影响。
Abstract: Claim verification with large language models (LLMs) has recently attracted considerable attention, owing to their superior reasoning capabilities and transparent verification pathways compared to traditional answer-only judgments. Online claim verification requires iterative evidence retrieval and reasoning, yet existing approaches mainly rely on prompt engineering or predesigned reasoning workflows without offering a unified training paradigm to improve necessary skills. Therefore, we introduce Veri-R1, an online reinforcement learning (RL) framework that enables an LLM to interact with a search engine and to receive reward signals that explicitly shape its planning, retrieval, and reasoning behaviors. The dynamic interaction between models and retrieval systems more accurately reflects real-world verification scenarios and fosters comprehensive verification skills. Empirical results show that Veri-R1 improves joint accuracy by up to 30% and doubles evidence score, often surpassing larger-scale counterparts. Ablation studies further reveal the impact of reward components and the link between output logits and label accuracy. Our results highlight the effectiveness of online RL for precise and faithful claim verification and provide a foundation for future research. We release our code to support community progress in LLM empowered claim verification.
[38] Style Over Story: A Process-Oriented Study of Authorial Creativity in Large Language Models
Donghoon Jung,Jiwoo Choi,Songeun Chae,Seohyon Jung
Main category: cs.CL
TL;DR: 这篇论文采用过程导向的方法,通过叙事学视角研究大语言模型(LLMs)的作者创造力,提出基于约束的决策作为创造力评估工具,发现LLMs更注重风格而非故事要素。
Details
Motivation: 现有对LLMs创造力的评估多关注输出质量,而忽略其生成过程。本文旨在填补这一空白,通过过程导向方法分析LLMs的作者创造力。Contribution: 1.引入基于约束的决策框架评估作者创造力;2.通过控制提示分配作者角色,揭示LLMs的创作偏好;3.发现LLMs普遍更重视风格而非角色、事件或背景。
Method: 1.基于叙事学设计实验;2.使用控制提示分配作者角色;3.分析LLMs的创造性偏好及推理过程。
Result: LLMs在创造力表现中明显偏向风格(Style),而非其他故事要素(如角色、事件、背景)。不同模型的创作偏好和推理特点呈现独特模式。
Insight: LLMs的创造力具有可量化和系统化的特征,过程导向方法为AI作者创造力分析提供了新工具。
Abstract: Evaluations of large language models (LLMs)’ creativity have focused primarily on the quality of their outputs rather than the processes that shape them. This study takes a process-oriented approach, drawing on narratology to examine LLMs as computational authors. We introduce constraint-based decision-making as a lens for authorial creativity. Using controlled prompting to assign authorial personas, we analyze the creative preferences of the models. Our findings show that LLMs consistently emphasize Style over other elements, including Character, Event, and Setting. By also probing the reasoning the models provide for their choices, we show that distinctive profiles emerge across models and argue that our approach provides a novel systematic tool for analyzing AI’s authorial creativity.
[39] Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems
Siddhant Arora,Jinchuan Tian,Hayato Futami,Jiatong Shi,Yosuke Kashiwagi,Emiru Tsunoo,Shinji Watanabe
Main category: cs.CL
TL;DR: 这篇论文提出了SCoT框架,一种用于双工语音对话系统的流式思维链推理方法,通过交替处理用户输入和生成响应块,解决了传统方法在语义推理和延迟上的不足。
Details
Motivation: 传统端到端语音对话系统依赖语音活动检测(VAD)进行轮流对话,但VAD无法区分暂停和对话结束。双工系统虽然解决了这一问题,但在语义推理上表现较差且架构复杂。Contribution: 提出了SCoT框架,通过流式思维链推理和块级处理,显著提升了双工对话系统的语义连贯性和交互延迟性能。
Method: 采用了固定时长的用户输入块处理与响应生成交替执行的策略,并通过帧级对齐生成中间目标(用户转录和系统响应)。
Result: 实验表明,SCoT比现有双工方法生成更连贯、可解释的响应,同时支持更低延迟和重叠交互。
Insight: 思维链推理可以在流式对话系统中提高语义连贯性,块级处理是实现低延迟的关键。
Abstract: Most end-to-end (E2E) spoken dialogue systems (SDS) rely on voice activity detection (VAD) for turn-taking, but VAD fails to distinguish between pauses and turn completions. Duplex SDS models address this by predicting output continuously, including silence tokens, thus removing the need for explicit VAD. However, they often have complex dual-channel architecture and lag behind cascaded models in semantic reasoning. To overcome these challenges, we propose SCoT: a Streaming Chain-of-Thought (CoT) framework for Duplex SDS, alternating between processing fixed-duration user input and generating responses in a blockwise manner. Using frame-level alignments, we create intermediate targets-aligned user transcripts and system responses for each block. Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods while supporting lower-latency and overlapping interactions compared to turn-by-turn systems.
[40] The Disparate Impacts of Speculative Decoding
Jameson Sandler,Ahmet Üstün,Marco Romanelli,Sara Hooker,Ferdinando Fioretto
Main category: cs.CL
TL;DR: 论文分析了推测解码(speculative decoding)在不同任务中带来的速度提升不均匀现象,发现其对拟合不足或代表性不足的任务速度提升较小,并提出了一种缓解策略,将公平性指标平均提升12%。
Details
Motivation: 推测解码已成为减少大型语言模型解码时间的标准技术,但其在不同任务中带来的速度提升不均可能引发不公平问题。论文旨在量化并解决这种不公平现象。Contribution: 主要贡献包括:1)揭示了推测解码速度提升的不均匀性;2)提出了一种量化不公平性的分析框架;3)设计了一种缓解策略,显著提升公平性指标。
Method: 论文通过理论分析量化推测解码的速度提升差异,并提出了一种基于任务特性的缓解策略,通过实验在多组模型对上验证其有效性。
Result: 实验结果表明,提出的缓解策略将公平性指标平均提升了12%,证明了其有效性。
Insight: 推测解码的效率提升可能隐含不公平性,需针对任务特性优化策略以确保公平性。
Abstract: The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed unfairness’’ and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.
[41] RESTRAIN: From Spurious Votes to Signals – Self-Driven RL with Self-Penalization
Zhaoning Yu,Will Su,Leitian Tao,Haozhu Wang,Aashu Singh,Hanchao Yu,Jianyu Wang,Hongyang Gao,Weizhe Yuan,Jason Weston,Ping Yu,Jing Xu
Main category: cs.CL
TL;DR: RESTRAIN是一种自惩罚强化学习框架,利用无标注数据改进推理模型,避免对虚假多数投票的依赖,显著提升了推理任务的性能。
Details
Motivation: 传统基于人工标注数据的强化学习成本高且难以应对复杂任务,RESTRAIN旨在通过无监督学习利用模型自身的信号实现持续改进。Contribution: 提出RESTRAIN框架,通过自惩罚机制将无黄金标注数据的缺失转化为有用的学习信号,显著提升了推理能力。
Method: 利用模型整个答案分布的信号,惩罚过度自信的预测和低一致性示例,同时保留有潜力的推理链,无缝集成到GRPO等策略优化方法中。
Result: 在AIME25、MMLU_STEM和GPQA-Diamond等基准上显著提升性能,最高提升140.7%,接近黄金标注训练的效果。
Insight: RESTRAIN展示了无监督强化学习在推理任务中的潜力,为无需标注数据的持续改进提供了可行路径。
Abstract: Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model’s entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.
[42] Learning to Reason for Hallucination Span Detection
Hsuan Su,Ting-Yao Hu,Hema Swetha Koppula,Kundan Krishna,Hadi Pouransari,Cheng-Yu Hsieh,Cem Koc,Joseph Yitan Cheng,Oncel Tuzel,Raviteja Vemulapalli
Main category: cs.CL
TL;DR: 论文提出RL4HS,一个基于强化学习的框架,通过显式推理和多步决策来解决大语言模型(LLMs)生成幻觉内容的检测问题,优于传统方法。
Details
Motivation: 大语言模型常生成幻觉内容(未支持的虚假信息),传统方法将其视为二分类任务,但实际应用中需要识别具体幻觉片段,涉及多步决策,因此探讨显式推理是否能改进检测效果。Contribution: 1. 提出RL4HS框架,结合显式推理和强化学习,通过Span级奖励优化检测;2. 引入Class-Aware Policy Optimization解决奖励不平衡问题;3. 实验验证RL4HS优于预训练推理模型和监督微调。
Method: 1. 评估预训练模型(带/不带CoT推理);2. 设计强化学习框架RL4HS,结合Group Relative Policy Optimization和Class-Aware Policy Optimization;3. 在RAGTruth基准(摘要、QA、数据到文本)上实验。
Result: RL4HS在多个任务上超越基线方法,证明显式推理和Span级强化学习的必要性。
Insight: 显式推理能提升多步决策任务的性能,Span级奖励和强化学习是解决幻觉片段检测的有效途径。
Abstract: Large language models (LLMs) often generate hallucinations – unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. This naturally raises the question of whether explicit reasoning can help the complex task of detecting hallucination spans. To answer this question, we first evaluate pretrained models with and without Chain-of-Thought (CoT) reasoning, and show that CoT reasoning has the potential to generate at least one correct answer when sampled multiple times. Motivated by this, we propose RL4HS, a reinforcement learning framework that incentivizes reasoning with a span-level reward function. RL4HS builds on Group Relative Policy Optimization and introduces Class-Aware Policy Optimization to mitigate reward imbalance issue. Experiments on the RAGTruth benchmark (summarization, question answering, data-to-text) show that RL4HS surpasses pretrained reasoning models and supervised fine-tuning, demonstrating the necessity of reinforcement learning with span-level rewards for detecting hallucination spans.
[43] ARUQULA – An LLM based Text2SPARQL Approach using ReAct and Knowledge Graph Exploration Utilities
Felix Brei,Lorenz Bühmann,Johannes Frey,Daniel Gerber,Lars-Peter Meyer,Claus Stadler,Kirill Bulert
Main category: cs.CL
TL;DR: 该论文提出了一种基于大语言模型(LLM)的Text2SPARQL方法ARUQULA,通过结合ReAct框架和知识图谱探索工具,将自然语言问题逐步转化为SPARQL查询。
Details
Motivation: 知识图谱查询语言SPARQL对于非计算机背景用户具有较高的学习门槛,而LLM可以通过自然语言到SPARQL的转换降低这一门槛。Contribution: 提出了一种基于迭代探索和执行的Text2SPARQL方法,并分析了其行为和设计思路。
Method: 采用SPINACH代理(LLM驱动),结合ReAct框架和知识图谱探索工具,通过多步迭代生成SPARQL查询。
Result: 展示了方法的可行性,并通过分析代理行为指出了未来改进方向。
Insight: 迭代式查询生成和知识图谱探索工具的结合可以有效提升Text2SPARQL任务的表现。
Abstract: Interacting with knowledge graphs can be a daunting task for people without a background in computer science since the query language that is used (SPARQL) has a high barrier of entry. Large language models (LLMs) can lower that barrier by providing support in the form of Text2SPARQL translation. In this paper we introduce a generalized method based on SPINACH, an LLM backed agent that translates natural language questions to SPARQL queries not in a single shot, but as an iterative process of exploration and execution. We describe the overall architecture and reasoning behind our design decisions, and also conduct a thorough analysis of the agent behavior to gain insights into future areas for targeted improvements. This work was motivated by the Text2SPARQL challenge, a challenge that was held to facilitate improvements in the Text2SPARQL domain.
[44] Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents
Lingzhong Dong,Ziqi Zhou,Shuaibo Yang,Haiyue Sheng,Pengzhou Cheng,Zongru Wu,Zheng Wu,Gongshen Liu,Zhuosheng Zhang
Main category: cs.CL
TL;DR: 该论文提出了一种新的评估框架,用于诊断VLM驱动的移动代理中的推理-执行差距,揭示了这些差距的普遍性及其潜在危害。
Details
Motivation: 现有研究忽视了VLM代理的推理过程(CoT)是否与真实动作一致,可能导致用户因看似合理的推理而授权有害行为。Contribution: 引入了Ground-Truth Alignment(GTA)指标,结合Exact Match(EM)共同评估推理和执行的准确性,并定义了两种推理-执行差距(EG和RG)。
Method: 通过GTA和EM联合评估,量化推理与执行的差异,并在实验中验证了EG和RG的普遍性及其与模型规模的关系。
Result: 实验表明,推理-执行差距普遍存在,执行差距(EG)更常见,且模型规模扩大虽减少差距,但EG仍较大。
Insight: 揭示了VLM代理在推理与执行之间存在系统性偏差,为开发更可靠的移动代理提供了诊断工具。
Abstract: Mobile-use agents powered by vision-language models (VLMs) have shown great potential in interpreting natural language instructions and generating corresponding actions based on mobile graphical user interface. Recent studies suggest that incorporating chain-of-thought (CoT) reasoning tends to improve the execution accuracy. However, existing evaluations emphasize execution accuracy while neglecting whether CoT reasoning aligns with ground-truth actions. This oversight fails to assess potential reasoning-execution gaps, which in turn foster over-trust: users relying on seemingly plausible CoTs may unknowingly authorize harmful actions, potentially resulting in financial loss or trust crisis. In this work, we introduce a new evaluation framework to diagnose reasoning-execution gaps. At its core lies Ground-Truth Alignment (GTA), which measures whether the action implied by a CoT matches the ground-truth action. By combining GTA with the standard Exact Match (EM) metric, we jointly assess both the reasoning accuracy and execution accuracy. This joint perspective reveals two types of reasoning-execution gaps: (i) Execution Gap (EG), where the reasoning correctly identifies the correct action but execution fails, and (ii) Reasoning Gap (RG), where execution succeeds but reasoning process conflicts with the actual execution. Experimental results across a wide range of mobile interaction tasks reveal that reasoning-execution gaps are prevalent, with execution gaps occurring more frequently than reasoning gaps. Moreover, while scaling up model size reduces the overall gap, sizable execution gaps persist even in the largest models. Further analysis shows that our framework reliably reflects systematic EG/RG patterns in state-of-the-art models. These findings offer concrete diagnostics and support the development of more trustworthy mobile-use agents.
[45] More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration
Xiaoyang Yuan,Yujuan Ding,Yi Bin,Wenqi Shao,Jinyu Cai,Jingkuan Song,Yang Yang,Hengtao Shen
Main category: cs.CL
TL;DR: AMPO通过多教师自适应引导策略优化增强LLMs的推理能力,解决了单一引导的局限性,提升了推理多样性和性能。
Details
Motivation: 当前强化学习方法依赖单一教师或自我探索,存在模型偏见和探索受限问题,限制了推理多样性和性能。Contribution: 提出了AMPO框架,自适应利用多教师引导,仅在策略模型失败时介入,结合基于理解的筛选机制,平衡探索与利用。
Method: AMPO采用多教师自适应引导策略,引入”按需引导”和基于理解的推理路径选择机制。
Result: 在数学推理任务上提升4.3%,OOD任务提升12.2%,Pass@k性能显著提升,且探索多样性增强。
Insight: 多教师策略比单一强大教师更高效且可扩展,为LLMs推理能力的提升提供了新路径。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This “guidance-on-demand” approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at https://github.com/SII-Enigma/AMPO.
[46] Enhanced Arabic-language cyberbullying detection: deep embedding and transformer (BERT) approaches
Ebtesam Jaber Aljohani,Wael M. S. Yafoo
Main category: cs.CL
TL;DR: 该论文通过结合深度嵌入和BERT方法,提高了阿拉伯语网络欺凌检测的准确性,实验结果显示Bi-LSTM结合FastText嵌入达到98%的准确率。
Details
Motivation: 针对阿拉伯语网络欺凌检测方法的稀缺性,作者旨在通过深度学习技术填补这一空白。Contribution: 论文的主要贡献包括构建了一个阿拉伯语数据集,并对比了多种深度学习模型的性能,展示了Bi-LSTM与FastText嵌入的高效性。
Method: 论文采用了LSTM、Bi-LSTM以及BERT模型进行实验,并测试了多种词嵌入方法,包括FastText嵌入。
Result: 实验结果表明,Bi-LSTM结合FastText嵌入表现最佳,准确率达到98%。
Insight: 词嵌入选择对模型性能影响显著,尤其在处理非英语语言时,预训练嵌入方法如FastText可能更具优势。
Abstract: Recent technological advances in smartphones and communications, including the growth of such online platforms as massive social media networks such as X (formerly known as Twitter) endangers young people and their emotional well-being by exposing them to cyberbullying, taunting, and bullying content. Most proposed approaches for automatically detecting cyberbullying have been developed around the English language, and methods for detecting Arabic-language cyberbullying are scarce. Methods for detecting Arabic-language cyberbullying are especially scarce. This paper aims to enhance the effectiveness of methods for detecting cyberbullying in Arabic-language content. We assembled a dataset of 10,662 X posts, pre-processed the data, and used the kappa tool to verify and enhance the quality of our annotations. We conducted four experiments to test numerous deep learning models for automatically detecting Arabic-language cyberbullying. We first tested a long short-term memory (LSTM) model and a bidirectional long short-term memory (Bi-LSTM) model with several experimental word embeddings. We also tested the LSTM and Bi-LSTM models with a novel pre-trained bidirectional encoder from representations (BERT) and then tested them on a different experimental models BERT again. LSTM-BERT and Bi-LSTM-BERT demonstrated a 97% accuracy. Bi-LSTM with FastText embedding word performed even better, achieving 98% accuracy. As a result, the outcomes are generalize
[47] Explore Briefly, Then Decide: Mitigating LLM Overthinking via Cumulative Entropy Regulation
Tianyi Jiang,Yi Bin,Yujuan Ding,Kainian Zhu,Fei Ma,Jingkuan Song,Heng Tao Shen
Main category: cs.CL
TL;DR: 这篇论文提出了一种名为TECA的新指标和CER机制,用于解决大语言模型(LLM)在推理过程中的“过度思考”问题,从而提升推理效率。
Details
Motivation: 大语言模型在复杂问题上展现了强大的推理能力,但在简单问题上往往生成不必要的冗长推理步骤(过度思考),影响了效率。论文旨在通过动态优化推理深度来解决这一问题。Contribution: 1. 提出了Token Entropy Cumulative Average(TECA)指标,用于衡量推理过程中的探索程度;2. 设计了Explore Briefly, Then Decide范式及相关CER机制,动态确定推理终止点。
Method: 通过TECA量化推理步骤的熵变化,利用CER机制动态调整推理深度,避免冗余步骤。
Result: 实验表明,该方法在多种数学基准测试中显著减少了推理长度(最多减少71%),同时保持问题解决能力。
Insight: 动态调节推理深度是提升LLM效率的有效途径,TECA为量化推理过程提供了新视角。
Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities on complex problems using long Chain-of-Thought (CoT) reasoning. However, they often suffer from overthinking, meaning generating unnecessarily lengthy reasoning steps for simpler problems. This issue may degrade the efficiency of the models and make them difficult to adapt the reasoning depth to the complexity of problems. To address this, we introduce a novel metric Token Entropy Cumulative Average (TECA), which measures the extent of exploration throughout the reasoning process. We further propose a novel reasoning paradigm – Explore Briefly, Then Decide – with an associated Cumulative Entropy Regulation (CER) mechanism. This paradigm leverages TECA to help the model dynamically determine the optimal point to conclude its thought process and provide a final answer, thus achieving efficient reasoning. Experimental results across diverse mathematical benchmarks show that our approach substantially mitigates overthinking without sacrificing problem-solving ability. With our thinking paradigm, the average response length decreases by up to 71% on simpler datasets, demonstrating the effectiveness of our method in creating a more efficient and adaptive reasoning process.
[48] InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents
Yaxin Du,Yuanshuo Zhang,Xiyuan Yang,Yifan Zhou,Cheng Wang,Gongyi Zou,Xianghe Pang,Wenhao Wang,Menglan Chen,Shuo Tang,Zhiyu Li,Siheng Chen
Main category: cs.CL
TL;DR: InfoMosaic-Bench是一个评估工具增强代理在多源信息搜索中表现的基准测试,涵盖六个领域,要求代理结合通用搜索与领域专用工具。实验表明,当前LLM代理在处理此类任务时仍存在不足。
Details
Motivation: 现有LLM代理过度依赖开放网络搜索,但网络内容噪音大且不可靠,且许多任务需领域专用知识。MCP协议的出现让代理能访问专业工具,但其能力尚不明确。Contribution: 提出了InfoMosaic-Bench,首个专注于多源信息搜索的基准测试;设计了InfoMosaic-Flow生成任务;实验揭示了当前LLM代理的局限性。
Method: 使用InfoMosaic-Flow生成任务,结合通用搜索与领域专用工具;实验评估了14种先进的LLM代理。
Result: 实验显示,仅依赖网络信息的GPT-5准确率为38.2%;领域工具表现不一致;22.4%的失败源于工具使用不当。
Insight: 网络信息不足,工具增强代理需改进;领域工具效果不稳定;LLM代理在工具处理上仍有缺陷。
Abstract: Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools – and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Experiments with 14 state-of-the-art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT-5 achieving only 38.2% accuracy and 67.5% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.
[49] Parallel Scaling Law: Unveiling Reasoning Generalization through A Cross-Linguistic Perspective
Wen Yang,Junhong Wu,Chong Li,Chengqing Zong,Jiajun Zhang
Main category: cs.CL
TL;DR: 该论文提出了一种跨语言视角来研究推理泛化,发现英语为中心的LRMs在其他语言中的推理能力转移效果不一,并通过并行训练揭示了‘第一并行跃迁’和‘并行缩放定律’,同时指出了‘单语泛化差距’。
Details
Motivation: 探索基于强化后训练(RPT)的大型推理模型(LRMs)在多语言环境中的推理能力泛化,特别是英语为中心的模型是否能有效扩展到其他语言。Contribution: 1. 提出跨语言视角量化推理能力转移;2. 揭示了‘第一并行跃迁’和‘并行缩放定律’;3. 指出了英语为中心的LRMs在多语言泛化中的局限性。
Method: 通过系统评估英语为中心的LRMs在多语言推理基准上的表现,并引入量化跨语言可转移性的指标。通过干预研究分析了初始模型、目标语言和训练范式的影响。
Result: 发现跨语言转移能力因初始模型和目标语言而异,并行训练显著提升性能,且遵循幂律关系。英语为中心的LRMs未能完全泛化到其他语言。
Insight: 研究表明LRMs的推理能力与人类认知不同,提出了开发更语言无关的LRMs的重要方向。
Abstract: Recent advancements in Reinforcement Post-Training (RPT) have significantly enhanced the capabilities of Large Reasoning Models (LRMs), sparking increased interest in the generalization of RL-based reasoning. While existing work has primarily focused on investigating its generalization across tasks or modalities, this study proposes a novel cross-linguistic perspective to investigate reasoning generalization. This raises a crucial question: $\textit{Does the reasoning capability achieved from English RPT effectively transfer to other languages?}$ We address this by systematically evaluating English-centric LRMs on multilingual reasoning benchmarks and introducing a metric to quantify cross-lingual transferability. Our findings reveal that cross-lingual transferability varies significantly across initial model, target language, and training paradigm. Through interventional studies, we find that models with stronger initial English capabilities tend to over-rely on English-specific patterns, leading to diminished cross-lingual generalization. To address this, we conduct a thorough parallel training study. Experimental results yield three key findings: $\textbf{First-Parallel Leap}$, a substantial leap in performance when transitioning from monolingual to just a single parallel language, and a predictable $\textbf{Parallel Scaling Law}$, revealing that cross-lingual reasoning transfer follows a power-law with the number of training parallel languages. Moreover, we identify the discrepancy between actual monolingual performance and the power-law prediction as $\textbf{Monolingual Generalization Gap}$, indicating that English-centric LRMs fail to fully generalize across languages. Our study challenges the assumption that LRM reasoning mirrors human cognition, providing critical insights for the development of more language-agnostic LRMs.
[50] From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens
Hala Sheta,Eric Huang,Shuyu Wu,Ilia Alenabi,Jiajun Hong,Ryker Lin,Ruoxi Ning,Daniel Wei,Jialin Yang,Jiawei Zhou,Ziqiao Ma,Freda Shi
Main category: cs.CL
TL;DR: VLM-Lens是一个用于系统化分析、评估和解释视觉语言模型(VLMs)的工具包,支持从开源VLMs的任何层提取中间输出,并提供统一的YAML配置接口。
Details
Motivation: 由于VLMs的内部机制复杂且多样化,缺乏统一的工具支持对其中间输出的系统性分析,限制了对其内部能力的深入理解和改进。Contribution: VLM-Lens的主要贡献是提供一个统一的工具包,支持从16种主流VLMs及其30多个变体中提取中间输出,并以用户友好的方式进行分析和解释。
Method: VLM-Lens采用YAML配置接口抽象模型特定复杂性,支持跨不同VLMs的操作,并可轻松集成多种解释性和分析方法。
Result: 通过两个简单实验,展示了VLMs隐藏表征在层级和目标概念上的系统性差异,验证了工具的有效性。
Insight: VLM-Lens的灵活性为社区提供了深入理解VLMs内部能力的工具,有助于加速模型的改进和优化。
Abstract: We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.
[51] F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data
Ziyin Zhang,Zihan Liao,Hang Yu,Peng Di,Rui Wang
Main category: cs.CL
TL;DR: F2LLM是一组基于开源数据直接微调的嵌入模型,分别在0.6B、1.7B和4B三种规模下达到SOTA性能,训练成本低且性能优异。
Details
Motivation: 现有顶级嵌入模型需要大规模对比预训练和昂贵合成数据,F2LLM旨在通过开源数据实现低成本高性能嵌入模型。Contribution: 提出F2LLM,在MTEB英语排行榜上表现优异,同时公开模型、训练数据和代码,推动领域研究。
Method: 直接从基础模型微调,使用600万开源非合成的查询-文档-负例三元组进行训练。
Result: F2LLM-4B在4B参数模型中排名第2,F2LLM-1.7B在1B-2B规模模型中排名第1。
Insight: 通过开源数据微调可以低成本实现高性能嵌入模型,为未来研究提供可复现的基准。
Abstract: We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.
[52] Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation
Raphael Tang,Crystina Zhang,Wenyan Li,Carmen Lai,Pontus Stenetorp,Yao Lu
Main category: cs.CL
TL;DR: 该论文质疑现有的竞技场式大语言模型(LLM)评估中平局的语义问题,提出平局更可能反映查询难度而非模型能力相等,并通过实验证明忽略平局的评分更新能提高预测准确性。
Details
Motivation: 传统竞技场式评估使用Elo评分系统,将平局视为模型能力相等的表现,但作者认为平局可能更多反映了查询的难易程度或客观性,而非模型能力的均等。Contribution: 论文的主要贡献是通过实证分析挑战了现有评分系统中平局的语义假设,并提出忽略平局评分更新能提高结果预测准确性。
Method: 作者分析了三个真实世界的竞技场数据集,比较了四种评分系统在忽略平局评分更新前后的预测准确度,并研究了平局与查询属性(如难度和客观性)的关系。
Result: 实验表明,忽略平局的评分更新能使预测准确度相对提升1-3%。此外,平局更容易出现在非常容易或高度客观的查询中。
Insight: 论文提出未来评分系统应重新考虑平局的语义,并在评分更新中加入查询属性的信息。
Abstract: In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the “battle” a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.
cs.CV [Back]
[53] LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration
Alessio Spagnoletti,Andrés Almansa,Marcelo Pereyra
Main category: cs.CV
TL;DR: LVTINO是一种基于Video Consistency Models (VCMs)的零样本或即插即用视频修复逆求解器,实现高保真且时序一致的视频恢复。
Details
Motivation: 现有方法在视频修复中直接逐帧应用图像潜在扩散模型(LDMs)会导致时序不一致,而LVTINO通过VCMs显式建模时序因果性来解决这一问题。Contribution: 提出了首个基于VCMs的零样本/即插即用视频修复逆求解器LVTINO,显著提升时序一致性及计算效率。
Method: 利用VCMs的快速生成能力,提出一种无需自动微分的条件机制,仅需少量神经函数评估即可实现高保真视频重建。
Result: 在多样视频逆问题实验中,LVTINO在重建质量和计算效率上均优于现有逐帧方法。
Insight: 视频修复需要显式建模时序依赖性,VCMs提供了一种高效且高质量的先验编码方式,为未来视频生成与修复研究指明了方向。
Abstract: Computational imaging methods increasingly rely on powerful generative diffusion models to tackle challenging image restoration tasks. In particular, state-of-the-art zero-shot image inverse solvers leverage distilled text-to-image latent diffusion models (LDMs) to achieve unprecedented accuracy and perceptual quality with high computational efficiency. However, extending these advances to high-definition video restoration remains a significant challenge, due to the need to recover fine spatial detail while capturing subtle temporal dependencies. Consequently, methods that naively apply image-based LDM priors on a frame-by-frame basis often result in temporally inconsistent reconstructions. We address this challenge by leveraging recent advances in Video Consistency Models (VCMs), which distill video latent diffusion models into fast generators that explicitly capture temporal causality. Building on this foundation, we propose LVTINO, the first zero-shot or plug-and-play inverse solver for high definition video restoration with priors encoded by VCMs. Our conditioning mechanism bypasses the need for automatic differentiation and achieves state-of-the-art video reconstruction quality with only a few neural function evaluations, while ensuring strong measurement consistency and smooth temporal transitions across frames. Extensive experiments on a diverse set of video inverse problems show significant perceptual improvements over current state-of-the-art methods that apply image LDMs frame by frame, establishing a new benchmark in both reconstruction fidelity and computational efficiency.
[54] Image Generation Based on Image Style Extraction
Shuochen Chang
Main category: cs.CV
TL;DR: 这篇论文提出了一种基于风格提取的三阶段图像生成方法,通过风格编码器和风格投影层实现对风格表征的精细控制,并构建了Style30k-captions数据集用于训练。
Details
Motivation: 现有文本到图像生成模型难以通过自然语言精确描述和控制细粒度风格,同时风格参考图像的引导信息难以与传统文本引导生成对齐。Contribution: 1. 提出了一种三阶段训练的风格提取方法;2. 设计了风格编码器和风格投影层;3. 构建了Style30k-captions数据集。
Method: 通过风格编码器从单张风格参考图像提取细粒度风格表征,并利用风格投影层将其与文本表征对齐,注入预训练生成模型中。
Result: 实现了基于文本提示的细粒度风格控制图像生成。
Insight: 风格提取与文本表征的对齐是提升生成模型风格控制能力的关键。
Abstract: Image generation based on text-to-image generation models is a task with practical application scenarios that fine-grained styles cannot be precisely described and controlled in natural language, while the guidance information of stylized reference images is difficult to be directly aligned with the textual conditions of traditional textual guidance generation. This study focuses on how to maximize the generative capability of the pretrained generative model, by obtaining fine-grained stylistic representations from a single given stylistic reference image, and injecting the stylistic representations into the generative body without changing the structural framework of the downstream generative model, so as to achieve fine-grained controlled stylized image generation. In this study, we propose a three-stage training style extraction-based image generation method, which uses a style encoder and a style projection layer to align the style representations with the textual representations to realize fine-grained textual cue-based style guide generation. In addition, this study constructs the Style30k-captions dataset, whose samples contain a triad of images, style labels, and text descriptions, to train the style encoder and style projection layer in this experiment.
[55] EvoStruggle: A Dataset Capturing the Evolution of Struggle across Activities and Skill Levels
Shijia Feng,Michael Wray,Walterio Mayol-Cuevas
Main category: cs.CV
TL;DR: 该论文提出了一个名为EvoStruggle的数据集,用于捕捉技能学习过程中挣扎行为的演变,涵盖多个任务和技能水平,并通过实验验证了时序动作定位模型在该任务中的有效性。
Details
Motivation: 现有数据集未关注技能学习过程中挣扎行为的动态演变,而这种演变对优化学习和开发辅助系统至关重要。Contribution: 1) 收集并标注了一个包含61.68小时视频、5,385个挣扎片段的数据集;2) 将挣扎行为定位问题建模为时序动作定位任务。
Method: 数据集包含18个任务,分为4类活动(如折纸、打结等),参与者重复任务以捕捉技能演变。使用时序动作定位模型检测挣扎行为。
Result: 模型在跨任务和跨活动情况下分别达到34.56%和19.24%的平均mAP,表明挣扎行为是可迁移的概念。
Insight: 挣扎行为在不同任务中具有共性,但检测仍具挑战性,未来需进一步提升模型性能。
Abstract: The ability to determine when a person struggles during skill acquisition is crucial for both optimizing human learning and enabling the development of effective assistive systems. As skills develop, the type and frequency of struggles tend to change, and understanding this evolution is key to determining the user’s current stage of learning. However, existing manipulation datasets have not focused on how struggle evolves over time. In this work, we collect a dataset for struggle determination, featuring 61.68 hours of video recordings, 2,793 videos, and 5,385 annotated temporal struggle segments collected from 76 participants. The dataset includes 18 tasks grouped into four diverse activities – tying knots, origami, tangram puzzles, and shuffling cards, representing different task variations. In addition, participants repeated the same task five times to capture their evolution of skill. We define the struggle determination problem as a temporal action localization task, focusing on identifying and precisely localizing struggle segments with start and end times. Experimental results show that Temporal Action Localization models can successfully learn to detect struggle cues, even when evaluated on unseen tasks or activities. The models attain an overall average mAP of 34.56% when generalizing across tasks and 19.24% across activities, indicating that struggle is a transferable concept across various skill-based tasks while still posing challenges for further improvement in struggle detection. Our dataset is available at https://github.com/FELIXFENG2019/EvoStruggle.
[56] SPUS: A Lightweight and Parameter-Efficient Foundation Model for PDEs
Abu Bucker Siddik,Diane Oyen,Alexander Most,Michal Kucer,Ayan Biswas
Main category: cs.CV
TL;DR: SPUS是一种轻量级且参数高效的基础模型,用于解决广泛的偏微分方程(PDE),基于残差U-Net架构,优于现有基于复杂Transformer的模型。
Details
Motivation: 现有PDE基础模型通常基于复杂的Transformer架构,计算和参数开销大。SPUS旨在提供一种更轻量、高效的解决方案。Contribution: 提出了SPUS,一种基于残差U-Net的轻量级基础模型,通过自回归预训练策略高效学习PDE的物理学特性。
Method: 采用残差U-Net架构和自回归预训练策略,模拟数值求解器的行为,从多样化的PDE数据中学习。
Result: SPUS在6种未见过的PDE任务上表现出优异泛化能力,且参数更少、微调数据需求低。
Insight: 轻量化的U-Net架构在PDE求解领域具有潜力,能够平衡性能和效率。
Abstract: We introduce Small PDE U-Net Solver (SPUS), a compact and efficient foundation model (FM) designed as a unified neural operator for solving a wide range of partial differential equations (PDEs). Unlike existing state-of-the-art PDE FMs-primarily based on large complex transformer architectures with high computational and parameter overhead-SPUS leverages a lightweight residual U-Net-based architecture that has been largely underexplored as a foundation model architecture in this domain. To enable effective learning in this minimalist framework, we utilize a simple yet powerful auto-regressive pretraining strategy which closely replicates the behavior of numerical solvers to learn the underlying physics. SPUS is pretrained on a diverse set of fluid dynamics PDEs and evaluated across 6 challenging unseen downstream PDEs spanning various physical systems. Experimental results demonstrate that SPUS using residual U-Net based architecture achieves state-of-the-art generalization on these downstream tasks while requiring significantly fewer parameters and minimal fine-tuning data, highlighting its potential as a highly parameter-efficient FM for solving diverse PDE systems.
[57] DisCo: Reinforcement with Diversity Constraints for Multi-Human Generation
Shubhankar Borse,Farzad Farhadzadeh,Munawar Hayat,Fatih Porikli
Main category: cs.CV
TL;DR: DisCo是一个基于强化学习的框架,通过多样性约束优化多人生成任务中身份多样性问题,显著提升生成结果的独特性和准确性。
Details
Motivation: 当前文本到图像生成模型在处理多人提示时存在重复面孔、身份混杂和计数错误等问题,亟需一种能够直接优化身份多样性的方法。Contribution: 提出DisCo框架,首次通过强化学习直接优化多人生成中的身份多样性,解决了生成模型的长期身份混乱问题。
Method: 采用Group-Relative Policy Optimization(GRPO)和复合奖励函数,惩罚图像内相似面孔、跨样本身份重复,同时确保人物计数准确并保持视觉保真度。
Result: 在DiverseHumans测试集上,DisCo实现了98.6%的唯一面孔准确率和近乎完美的全局身份分布,超越了开源和商业方法。
Insight: DisCo通过无标注的强化学习方法,为多人生成任务提供了可扩展的解决方案,并为该领域设定了新的基准。
Abstract: State-of-the-art text-to-image models excel at realism but collapse on multi-human prompts - duplicating faces, merging identities, and miscounting individuals. We introduce DisCo (Reinforcement with Diversity Constraints), the first RL-based framework to directly optimize identity diversity in multi-human generation. DisCo fine-tunes flow-matching models via Group-Relative Policy Optimization (GRPO) with a compositional reward that (i) penalizes intra-image facial similarity, (ii) discourages cross-sample identity repetition, (iii) enforces accurate person counts, and (iv) preserves visual fidelity through human preference scores. A single-stage curriculum stabilizes training as complexity scales, requiring no extra annotations. On the DiverseHumans Testset, DisCo achieves 98.6 Unique Face Accuracy and near-perfect Global Identity Spread - surpassing both open-source and proprietary methods (e.g., Gemini, GPT-Image) while maintaining competitive perceptual quality. Our results establish DisCo as a scalable, annotation-free solution that resolves the long-standing identity crisis in generative models and sets a new benchmark for compositional multi-human generation.
[58] GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings
Angel Daruna,Nicholas Meegan,Han-Pang Chiu,Supun Samarasekera,Rakesh Kumar
Main category: cs.CV
TL;DR: GeoSURGE提出了一种结合地理层级嵌入和语义分割的视觉地理定位方法,显著提升了多个基准数据集上的性能。
Details
Motivation: 现有的视觉地理定位方法主要依赖于视觉特征,而忽略了地理信息的层级结构和语义信息,限制了定位的准确性。Contribution: 1. 提出了一种层级地理嵌入表示方法;2. 引入了视觉特征与语义分割图的高效融合方法;3. 在多个基准数据集上取得了显著的性能提升。
Method: 1. 将世界建模为层级地理嵌入;2. 融合查询图像的视觉特征和语义分割图;3. 对齐视觉表示与地理表示。
Result: 在五个基准数据集的25个指标中,22个指标超越了现有最佳方法和大规模视觉语言模型。
Insight: 地理层级嵌入和语义信息的融合对提升视觉地理定位性能至关重要。
Abstract: Worldwide visual geo-localization seeks to determine the geographic location of an image anywhere on Earth using only its visual content. Learned representations of geography for visual geo-localization remain an active research topic despite much progress. We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation. Our novel geographic representation explicitly models the world as a hierarchy of geographic embeddings. Additionally, we introduce an approach to efficiently fuse the appearance features of the query image with its semantic segmentation map, forming a robust visual representation. Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets compared to prior state-of-the-art (SOTA) methods and recent Large Vision-Language Models (LVLMs). Additional ablation studies support the claim that these gains are primarily driven by the combination of geographic and visual representations.
[59] Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories
Nilay Naharas,Dang Nguyen,Nesihan Bulut,Mohammadhossein Bateni,Vahab Mirrokni,Baharan Mirzasoleiman
Main category: cs.CV
TL;DR: 这篇论文提出了一种名为XMAS的新方法,用于高效选择数据以微调大型视觉语言模型(LVLM)。该方法通过分析跨模态注意力矩阵的轨迹来去除冗余数据,显著提升了训练效率并保持了模型性能。
Details
Motivation: 目前的数据选择方法在大型视觉语言模型(LVLM)上表现不佳,甚至无法超越随机选择的效果。为了解决这一问题,论文提出了XMAS方法,通过跨模态对齐轨迹来选择最具信息量的数据。Contribution: 1. 首次提出了一种基于跨模态注意力矩阵轨迹的数据选择方法XMAS;2. 证明了相似注意力矩阵的样本在训练中对模型参数的影响相似;3. 在多个数据集上验证了XMAS的有效性,实现了显著的数据缩减和训练加速。
Method: XMAS的核心方法是通过微调一个小型代理LVLM,计算样本的跨模态注意力矩阵的轨迹,并基于这些轨迹对样本进行聚类。随后,从每个聚类中平衡采样,去除冗余数据。
Result: 实验结果显示,XMAS能够丢弃50%-85%的训练数据,同时完全保留LLaVA-1.5-7B模型在10个下游任务上的性能,并将训练速度提升1.2倍,数据缩减效果比现有基线高出30%。
Insight: 跨模态注意力矩阵的轨迹可以作为筛选高质量训练数据的有效指标,通过聚类和平衡采样,能够在保持模型性能的同时显著提升训练效率。
Abstract: Data-efficient learning aims to eliminate redundancy in large training datasets by training models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language models (LLMs), it remains underexplored for Large Vision-Language Models (LVLMs). Notably, none of existing methods can outperform random selection at different subset sizes. In this work, we propose the first principled method for data-efficient instruction tuning of LVLMs. We prove that examples with similar cross-modal attention matrices during instruction tuning have similar gradients. Thus, they influence model parameters in a similar manner and convey the same information to the model during training. Building on this insight, we propose XMAS, which clusters examples based on the trajectories of the top singular values of their attention matrices obtained from fine-tuning a small proxy LVLM. By sampling a balanced subset from these clusters, XMAS effectively removes redundancy in large-scale LVLM training data. Extensive experiments show that XMAS can discard 50% of the LLaVA-665k dataset and 85% of the Vision-Flan dataset while fully preserving performance of LLaVA-1.5-7B on 10 downstream benchmarks and speeding up its training by 1.2x. This is 30% more data reduction compared to the best baseline for LLaVA-665k. The project’s website can be found at https://bigml-cs-ucla.github.io/XMAS-project-page/.
[60] Purrception: Variational Flow Matching for Vector-Quantized Image Generation
Răzvan-Andrei Matişan,Vincent Tao Hu,Grigory Bartosh,Björn Ommer,Cees G. M. Snoek,Max Welling,Jan-Willem van de Meent,Mohammad Mahdi Derakhshani,Floor Eijkelboom
Main category: cs.CV
TL;DR: Purrception是一种变分流匹配方法,用于向量量化的图像生成,结合了连续传输动态和显式分类监督,提升训练效率。
Details
Motivation: 现有的图像生成方法中,连续流匹配和离散监督方法的优势未能充分结合。Purrception旨在填补这一空白,通过变分流匹配实现高效的向量量化图像生成。Contribution: 1. 提出了Purrception,首次将变分流匹配应用于向量量化潜在空间;2. 结合了连续传输动态和离散监督,支持不确定性量化和温度控制生成;3. 在ImageNet-1k 256x256生成任务上展示了高效的训练和竞争力的性能。
Method: 1. 学习代码索引的分类后验分布;2. 在连续嵌入空间中计算速度场;3. 结合几何感知和离散监督,实现高效的变分流匹配。
Result: 在ImageNet-1k 256x256生成任务上,训练速度优于连续和离散流匹配基准,FID分数与SOTA模型相当。
Insight: 变分流匹配可以有效桥接连续传输和离散监督,提升图像生成的训练效率和性能。
Abstract: We introduce Purrception, a variational flow matching approach for vector-quantized image generation that provides explicit categorical supervision while maintaining continuous transport dynamics. Our method adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space. This combines the geometric awareness of continuous methods with the discrete supervision of categorical approaches, enabling uncertainty quantification over plausible codes and temperature-controlled generation. We evaluate Purrception on ImageNet-1k 256x256 generation. Training converges faster than both continuous flow matching and discrete flow matching baselines while achieving competitive FID scores with state-of-the-art models. This demonstrates that Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation.
[61] AortaDiff: A Unified Multitask Diffusion Framework For Contrast-Free AAA Imaging
Yuxuan Ou,Ning Bi,Jiazhen Pan,Jiancheng Yang,Boliang Yu,Usama Zidan,Regent Lee,Vicente Grau
Main category: cs.CV
TL;DR: 论文提出了AortaDiff,一个统一的多任务扩散框架,用于从非对比CT扫描生成合成对比增强CT图像,并同时分割主动脉腔和血栓。该方法通过结合条件扩散模型和多任务学习,避免了多阶段管道的误差累积,提升了性能和临床实用性。
Details
Motivation: 传统的对比增强CT(CECT)在评估腹部主动脉瘤(AAA)时需要碘造影剂,但其具有肾毒性、过敏风险和环境污染等问题。现有的深度学习方法采用多阶段流程,会导致误差累积且无法充分利用共享的语义和解剖结构。Contribution: 1. 提出统一的多任务扩散框架,联合优化图像合成和解剖分割任务;2. 无需初始预测,共享编码器和解码器参数;3. 设计半监督训练策略,适应临床数据中常见的标签缺失问题。
Method: 结合条件扩散模型(CDM)和多任务学习,通过共享参数和联合优化的方式,实现合成CECT图像和分割任务的端到端训练。训练中采用了半监督策略,适应标签缺失数据。
Result: 在264名患者数据上,模型性能优于单任务和多阶段方法。图像合成的PSNR达25.61 dB,分割任务的腔Dice分数为0.89,血栓Dice分数为0.53,显著提升了临床测量的准确性。
Insight: 多任务联合优化和参数共享能有效提升性能,半监督策略增强了模型对临床数据的鲁棒性,为减少造影剂使用提供了新思路。
Abstract: While contrast-enhanced CT (CECT) is standard for assessing abdominal aortic aneurysms (AAA), the required iodinated contrast agents pose significant risks, including nephrotoxicity, patient allergies, and environmental harm. To reduce contrast agent use, recent deep learning methods have focused on generating synthetic CECT from non-contrast CT (NCCT) scans. However, most adopt a multi-stage pipeline that first generates images and then performs segmentation, which leads to error accumulation and fails to leverage shared semantic and anatomical structures. To address this, we propose a unified deep learning framework that generates synthetic CECT images from NCCT scans while simultaneously segmenting the aortic lumen and thrombus. Our approach integrates conditional diffusion models (CDM) with multi-task learning, enabling end-to-end joint optimization of image synthesis and anatomical segmentation. Unlike previous multitask diffusion models, our approach requires no initial predictions (e.g., a coarse segmentation mask), shares both encoder and decoder parameters across tasks, and employs a semi-supervised training strategy to learn from scans with missing segmentation labels, a common constraint in real-world clinical data. We evaluated our method on a cohort of 264 patients, where it consistently outperformed state-of-the-art single-task and multi-stage models. For image synthesis, our model achieved a PSNR of 25.61 dB, compared to 23.80 dB from a single-task CDM. For anatomical segmentation, it improved the lumen Dice score to 0.89 from 0.87 and the challenging thrombus Dice score to 0.53 from 0.48 (nnU-Net). These segmentation enhancements led to more accurate clinical measurements, reducing the lumen diameter MAE to 4.19 mm from 5.78 mm and the thrombus area error to 33.85% from 41.45% when compared to nnU-Net. Code is available at https://github.com/yuxuanou623/AortaDiff.git.
[62] From Videos to Indexed Knowledge Graphs – Framework to Marry Methods for Multimodal Content Analysis and Understanding
Basem Rizk,Joel Walsh,Mark Core,Benjamin Nye
Main category: cs.CV
TL;DR: 本文提出了一個框架,將多模態內容分析與預訓練模型結合,將影片轉換為可查詢的時序半結構化知識圖譜,支持持續學習。
Details
Motivation: 多模態內容分析複雜且計算成本高,現有預訓練模型多用於靜態數據,而結合這些模型處理影片數據仍具挑戰性。Contribution: 提出一個高效原型設計的框架,將影片轉換為時序半結構化數據,並進一步轉為可查詢的知識圖譜,支持持續學習。
Method: 結合預訓練模型建立管道,將影片轉換為半結構化數據,再轉化為知識圖譜,支持動態新增領域知識。
Result: 實現了影片到知識圖譜的轉換,提供可查詢且支持持續學習的表示。
Insight: 結合預訓練模型與知識圖譜技術,能有效處理多模態數據並支持動態知識更新。
Abstract: Analysis of multi-modal content can be tricky, computationally expensive, and require a significant amount of engineering efforts. Lots of work with pre-trained models on static data is out there, yet fusing these opensource models and methods with complex data such as videos is relatively challenging. In this paper, we present a framework that enables efficiently prototyping pipelines for multi-modal content analysis. We craft a candidate recipe for a pipeline, marrying a set of pre-trained models, to convert videos into a temporal semi-structured data format. We translate this structure further to a frame-level indexed knowledge graph representation that is query-able and supports continual learning, enabling the dynamic incorporation of new domain-specific knowledge through an interactive medium.
[63] WALT: Web Agents that Learn Tools
Viraj Prabhu,Yutong Dai,Matthew Fernandez,Jing Gu,Krithika Ramakrishnan,Yanqi Luo,Silvio Savarese,Caiming Xiong,Junnan Li,Zeyuan Chen,Ran Xu
Main category: cs.CV
TL;DR: WALT提出了一种通过反向工程学习网站功能的框架,将网站功能转化为可复用的工具,减少了代理对逐步推理的依赖。
Details
Motivation: 当前基于UI交互和LLM推理的Web代理方法在动态布局和长任务中表现脆弱,而人类通过高级操作(如搜索、过滤)更高效。WALT旨在模仿这一特点。Contribution: WALT框架通过将网站功能封装为工具,提升了代理的鲁棒性和效率,减少了对逐步LLM推理的依赖。
Method: WALT通过反向工程提取网站的隐式功能(如搜索、发布),并将其封装为可调用工具,代理通过直接调用工具完成任务。
Result: 在VisualWebArena和WebArena基准测试中,WALT以更少步骤和更低LLM依赖性实现了更高的成功率。
Insight: 将网站功能抽象为工具可以显著提升Web代理的效率和鲁棒性,为浏览器自动化提供了一种通用范式。
Abstract: Web agents promise to automate complex browser tasks, but current methods remain brittle – relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites – spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning, establishing a robust and generalizable paradigm for browser automation.
[64] MATCH: Multi-faceted Adaptive Topo-Consistency for Semi-Supervised Histopathology Segmentation
Meilong Xu,Xiaoling Hu,Shahira Abousamra,Chen Li,Chao Chen
Main category: cs.CV
TL;DR: 论文提出了一种半监督分割框架MATCH,通过多视角扰动预测和拓扑一致性,有效识别并保留病理图像中的语义结构,解决了密集分布对象的挑战。
Details
Motivation: 在病理图像分析中,对象密集分布且标注成本高,半监督分割需从无标注数据中捕获语义结构,但现有方法难以区分生物学结构与噪声。Contribution: 1.提出了基于拓扑一致性的半监督分割框架MATCH;2.设计了结合空间重叠和全局结构对齐的匹配策略;3.实验证明方法显著减少拓扑错误。
Method: 1.通过随机丢弃和时序训练快照生成多视角预测;2.利用空间重叠与全局结构对齐匹配拓扑特征;3.强制多预测间的拓扑一致性。
Result: 实验表明MATCH减少了拓扑错误,提升了分割的鲁棒性和准确性,尤其适用于密集对象的分割任务。
Insight: 多视角扰动和拓扑一致性可有效区分真实结构与噪声,适用于标注稀缺的病理图像分析。
Abstract: In semi-supervised segmentation, capturing meaningful semantic structures from unlabeled data is essential. This is particularly challenging in histopathology image analysis, where objects are densely distributed. To address this issue, we propose a semi-supervised segmentation framework designed to robustly identify and preserve relevant topological features. Our method leverages multiple perturbed predictions obtained through stochastic dropouts and temporal training snapshots, enforcing topological consistency across these varied outputs. This consistency mechanism helps distinguish biologically meaningful structures from transient and noisy artifacts. A key challenge in this process is to accurately match the corresponding topological features across the predictions in the absence of ground truth. To overcome this, we introduce a novel matching strategy that integrates spatial overlap with global structural alignment, minimizing discrepancies among predictions. Extensive experiments demonstrate that our approach effectively reduces topological errors, resulting in more robust and accurate segmentations essential for reliable downstream analysis. Code is available at \href{https://github.com/Melon-Xu/MATCH}{https://github.com/Melon-Xu/MATCH}.
[65] Towards Better Optimization For Listwise Preference in Diffusion Models
Jiamu Bai,Xin Yu,Meilong Xu,Weitao Lu,Xin Pan,Kiwan Maeng,Daniel Kifer,Jian Wang,Yu Wang
Main category: cs.CV
TL;DR: 本文提出了Diffusion-LPO框架,用于在扩散模型中优化列表偏好数据,利用Plackett-Luce模型扩展DPO目标,显著提升了模型的视觉质量和偏好对齐效果。
Details
Motivation: 现有DPO方法在扩散模型中主要依赖成对偏好数据,忽略了人类反馈中的隐含排名信息,限制了偏好表达的精确性。Contribution: 提出Diffusion-LPO框架,首次将列表偏好优化引入扩散模型,通过Plackett-Luce模型扩展DPO目标,实现了更精确的偏好对齐。
Method: 通过聚合用户反馈为排名列表,基于Plackett-Luce模型推导列表扩展的DPO目标,强制样本在排名中保持一致性。
Result: Diffusion-LPO在文本到图像生成、图像编辑和个性化偏好对齐任务中,均显著优于成对DPO基线。
Insight: 列表偏好数据比成对偏好包含更多信息,能更准确地捕捉人类偏好,提升模型的生成质量和对齐效果。
Abstract: Reinforcement learning from human feedback (RLHF) has proven effectiveness for aligning text-to-image (T2I) diffusion models with human preferences. Although Direct Preference Optimization (DPO) is widely adopted for its computational efficiency and avoidance of explicit reward modeling, its applications to diffusion models have primarily relied on pairwise preferences. The precise optimization of listwise preferences remains largely unaddressed. In practice, human feedback on image preferences often contains implicit ranked information, which conveys more precise human preferences than pairwise comparisons. In this work, we propose Diffusion-LPO, a simple and effective framework for Listwise Preference Optimization in diffusion models with listwise data. Given a caption, we aggregate user feedback into a ranked list of images and derive a listwise extension of the DPO objective under the Plackett-Luce model. Diffusion-LPO enforces consistency across the entire ranking by encouraging each sample to be preferred over all of its lower-ranked alternatives. We empirically demonstrate the effectiveness of Diffusion-LPO across various tasks, including text-to-image generation, image editing, and personalized preference alignment. Diffusion-LPO consistently outperforms pairwise DPO baselines on visual quality and preference alignment.
[66] Growing Visual Generative Capacity for Pre-Trained MLLMs
Hanyu Wang,Jiaming Han,Ziyan Yang,Qi Zhao,Shanchuan Lin,Xiangyu Yue,Abhinav Shrivastava,Zhenheng Yang,Hao Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为Bridge的纯自回归统一多模态大语言模型(MLLM),通过Mixture-of-Transformers架构增强了预训练视觉理解模型的生成能力,实现了在单个下一词预测框架中的图像理解和生成。
Details
Motivation: 当前构建支持理解和生成的统一MLLM面临挑战:混合方法虽然能生成高质量图像但破坏了自回归范式,而纯自回归方法则在语义对齐和像素级保真度之间存在权衡。Contribution: Bridge模型的提出,通过语义到像素的离散表示和Mixture-of-Transformers架构,实现了高质量图像生成和理解,同时保持了自回归的统一性。
Method: 采用Mixture-of-Transformers架构,结合语义令牌和细粒度像素令牌的离散表示,减少序列长度增加的同时提升了视觉生成的保真度。
Result: 在多样化多模态基准测试中,Bridge在理解和生成任务上均取得竞争性或更优的结果,且所需训练数据和训练时间更少。
Insight: 通过语义和像素令牌的结合,Bridge在保持语言对齐的同时实现了对视觉细节的精确描述,为统一MLLM的设计提供了新思路。
Abstract: Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models remains challenging: hybrid approaches combine continuous embeddings with diffusion or flow-based objectives, producing high-quality images but breaking the autoregressive paradigm, while pure autoregressive approaches unify text and image prediction over discrete visual tokens but often face trade-offs between semantic alignment and pixel-level fidelity. In this work, we present Bridge, a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability through a Mixture-of-Transformers architecture, enabling both image understanding and generation within a single next-token prediction framework. To further improve visual generation fidelity, we propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens, achieving strong language alignment and precise description of visual details with only a 7.9% increase in sequence length. Extensive experiments across diverse multimodal benchmarks demonstrate that Bridge achieves competitive or superior results in both understanding and generation benchmarks, while requiring less training data and reduced training time compared to prior unified MLLMs.
[67] Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations
Ricardo Gonzalez Penuela,Felipe Arias-Russi,Victor Capriles
Main category: cs.CV
TL;DR: 该论文提出了一种方法,通过历史性问题引导多模态大语言模型(MLLMs)为盲人或低视力用户生成更相关的视觉描述,避免冗长且无关的信息。
Details
Motivation: 现有的MLLMs在为盲人或低视力用户提供视觉描述时,通常生成冗长且不考虑上下文的内容,导致用户体验不佳。Contribution: 提出了一种基于历史问题的系统,用于引导MLLMs生成更贴合用户需求的描述。
Method: 系统利用VizWiz-LF数据集中的历史问题和新图像的视觉上下文,指导MLLMs生成描述。
Result: 评估显示,76.1%的情境中,上下文感知的描述能预测并回答用户问题,54.4%的用户偏好这种描述。
Insight: 结合历史问题能显著提升MLLMs的描述相关性,优化用户体验。
Abstract: Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their accuracy and ability to provide rich, human-like interpretations. However, these applications often default to comprehensive, lengthy descriptions regardless of context. This leads to inefficient exchanges, as users must go through irrelevant details rather than receiving the specific information they are likely to seek. To deliver more contextually-relevant information, we developed a system that draws on historical BLV users questions. When given an image, our system identifies similar past visual contexts from the VizWiz-LF dataset and uses the associated questions to guide the MLLM generate descriptions more relevant to BLV users. An evaluation with three human labelers who revised 92 context-aware and context-free descriptions showed that context-aware descriptions anticipated and answered users’ questions in 76.1% of cases (70 out of 92) and were preferred in 54.4% of comparisons (50 out of 92). Our paper reviews, and data analysis are publicly available in a Github repository at https://github.com/rgonzalezp/guiding-multimodal-large-language-models-with-blind-and-low-vision-people-visual-questions .
[68] ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models
Krishna Teja Chitty-Venkata,Murali Emani
Main category: cs.CV
TL;DR: 该论文提出了ImageNet-Think-250K数据集,旨在支持具有显式推理能力的视觉语言模型(VLM)的开发。数据集包含25万张图像,附带结构化思维标记和答案,由两个先进的VLM生成。
Details
Motivation: 目前VLM在显式推理能力方面存在不足,需要大规模数据集支持其训练和评估。作者希望通过合成数据集填补这一空白,推动多模态推理研究。Contribution: 主要贡献为ImageNet-Think-250K数据集,提供图像与结构化思维标记对的资源,并公开数据集和评估基准。
Method: 利用GLM-4.1V-9B-Thinking和Kimi-VL-A3B-Thinking-2506两个VLM生成思维-答案序列,每张图像对应两对序列。
Result: 数据集包含25万图像及其推理过程和最终答案,为多模态推理模型提供训练和评估资源。
Insight: 合成数据集可以填补VLM在显式推理能力上的不足,推动多模态推理研究的进一步发展。
Abstract: We develop ImageNet-Think, a multimodal reasoning dataset designed to aid the development of Vision Language Models (VLMs) with explicit reasoning capabilities. Our dataset is built on 250,000 images from ImageNet21k dataset, providing structured thinking tokens and corresponding answers. Our synthetic dataset is generated by two state-of-the-art VLMs: GLM-4.1V-9B-Thinking and Kimi-VL-A3B-Thinking-2506. Each image is accompanied by two pairs of thinking-answer sequences, creating a resource for training and evaluating multimodal reasoning models. We capture the step-by-step reasoning process of VLMs and the final descriptive answers. Our goal with this dataset is to enable the development of more robust VLMs while contributing to the broader understanding of multimodal reasoning mechanisms. The dataset and evaluation benchmarks will be publicly available to aid research in reasoning/thinking multimodal VLMs.
[69] NPN: Non-Linear Projections of the Null-Space for Imaging Inverse Problems
Roman Jacome,Romario Gualdrón-Hurtado,Leon Suarez,Henry Arguello
Main category: cs.CV
TL;DR: 本文提出了一种新颖的正则化方法NPN,通过神经网络的非线性投影来利用感知矩阵的零空间结构,改进成像逆问题的重建效果。
Details
Motivation: 成像逆问题通常依赖于手工设计的正则化器或学习模型,但这些方法忽视了感知矩阵零空间的特定结构信息。NPN方法旨在利用这些结构信息,提高重建的准确性和适应性。Contribution: NPN是一种新的正则化方法,专注于感知矩阵的零空间非线性投影,提供了一种更可解释和灵活的正则化策略,适用于多种逆问题。
Method: 通过神经网络学习感知矩阵零空间的低维投影,结合现有的重建框架(如即插即用方法、展开网络、深度图像先验和扩散模型)。
Result: 实验表明,NPN方法在压缩感知、去模糊、超分辨率、计算机断层扫描和磁共振成像等多种逆问题中显著提高了重建质量。
Insight: NPN方法的成功表明,利用感知矩阵零空间的结构信息可以为成像逆问题提供更有效的正则化策略,同时与其他方法兼容。
Abstract: Imaging inverse problems aims to recover high-dimensional signals from undersampled, noisy measurements, a fundamentally ill-posed task with infinite solutions in the null-space of the sensing operator. To resolve this ambiguity, prior information is typically incorporated through handcrafted regularizers or learned models that constrain the solution space. However, these priors typically ignore the task-specific structure of that null-space. In this work, we propose \textit{Non-Linear Projections of the Null-Space} (NPN), a novel class of regularization that, instead of enforcing structural constraints in the image domain, promotes solutions that lie in a low-dimensional projection of the sensing matrix’s null-space with a neural network. Our approach has two key advantages: (1) Interpretability: by focusing on the structure of the null-space, we design sensing-matrix-specific priors that capture information orthogonal to the signal components that are fundamentally blind to the sensing process. (2) Flexibility: NPN is adaptable to various inverse problems, compatible with existing reconstruction frameworks, and complementary to conventional image-domain priors. We provide theoretical guarantees on convergence and reconstruction accuracy when used within plug-and-play methods. Empirical results across diverse sensing matrices demonstrate that NPN priors consistently enhance reconstruction fidelity in various imaging inverse problems, such as compressive sensing, deblurring, super-resolution, computed tomography, and magnetic resonance imaging, with plug-and-play methods, unrolling networks, deep image prior, and diffusion models.
[70] Automated Genomic Interpretation via Concept Bottleneck Models for Medical Robotics
Zijun Li,Jinchang Zhang,Ming Zhang,Guoyu Lu
Main category: cs.CV
TL;DR: 该论文提出了一种自动化基因组解释模块,结合混沌游戏表示(CGR)和概念瓶颈模型(CBM),通过生物学概念生成可解释的决策,适用于医疗机器人系统。
Details
Motivation: 基因组数据的自动化解释在临床应用中至关重要,但现有方法缺乏可解释性和可靠性。论文旨在填补这一空白,为医疗自动化和机器人系统提供可靠的基础。Contribution: 主要贡献是提出了一种结合CGR和CBM的框架,通过生物学概念生成可解释的决策,并引入多种技术提升模型可靠性。
Method: 使用混沌游戏表示(CGR)和概念瓶颈模型(CBM),并结合概念保真度监督、先验一致性对齐、KL分布匹配和不确定性校准。
Result: 在HIV亚型分类任务中实现了SOTA性能,同时提供了更高的概念预测保真度和成本效益优化。
Insight: 通过生物学概念的可解释性设计,该框架为基因组医学中的自动化决策提供了可靠且高效的解决方案。
Abstract: We propose an automated genomic interpretation module that transforms raw DNA sequences into actionable, interpretable decisions suitable for integration into medical automation and robotic systems. Our framework combines Chaos Game Representation (CGR) with a Concept Bottleneck Model (CBM), enforcing predictions to flow through biologically meaningful concepts such as GC content, CpG density, and k mer motifs. To enhance reliability, we incorporate concept fidelity supervision, prior consistency alignment, KL distribution matching, and uncertainty calibration. Beyond accurate classification of HIV subtypes across both in-house and LANL datasets, our module delivers interpretable evidence that can be directly validated against biological priors. A cost aware recommendation layer further translates predictive outputs into decision policies that balance accuracy, calibration, and clinical utility, reducing unnecessary retests and improving efficiency. Extensive experiments demonstrate that the proposed system achieves state of the art classification performance, superior concept prediction fidelity, and more favorable cost benefit trade-offs compared to existing baselines. By bridging the gap between interpretable genomic modeling and automated decision-making, this work establishes a reliable foundation for robotic and clinical automation in genomic medicine.
[71] VLA-R1: Enhancing Reasoning in Vision-Language-Action Models
Angen Ye,Zeyu Zhang,Boyuan Wang,Xiaofeng Wang,Dapeng Zhang,Zheng Zhu
Main category: cs.CV
TL;DR: VLA-R1是增强视觉-语言-动作模型推理能力的框架,结合强化学习与策略优化,提升多任务泛化能力。
Details
Motivation: 当前VLA模型缺乏显式逐步推理,忽视了动作生成的几何与功能约束,且训练管道未强化推理质量。Contribution: 提出VLA-R1,整合RLVR与GRPO优化推理与执行;发布高质量数据集VLA-CoT-13K;验证了跨领域与真实机器人性能。
Method: 采用RLVR设计可验证奖励(区域对齐、轨迹一致性、输出格式),结合GRPO进行策略优化;利用链式思维数据集监督训练。
Result: 在仿真与真实机器人平台上优于现有VLA方法,实现更优泛化与执行精度。
Insight: 显式推理与可验证奖励对VLA模型的几何与功能约束建模至关重要;链式思维数据能有效增强多模态推理能力。
Abstract: Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering affordance constraints or geometric relations. Their post-training pipelines also rarely reinforce reasoning quality, relying primarily on supervised fine-tuning with weak reward design. To address these challenges, we present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) to systematically optimize both reasoning and execution. Specifically, we design an RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, thereby strengthening reasoning robustness and execution accuracy. Moreover, we develop VLA-CoT-13K, a high-quality dataset that provides chain-of-thought supervision explicitly aligned with affordance and trajectory annotations. Furthermore, extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate that VLA-R1 achieves superior generalization and real-world performance compared to prior VLA methods. We plan to release the model, code, and dataset following the publication of this work. Code: https://github.com/GigaAI-research/VLA-R1. Website: https://gigaai-research.github.io/VLA-R1.
[72] Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery
Minh Tran,Maksim Siniukov,Zhangyu Jin,Mohammad Soleymani
Main category: cs.CV
TL;DR: DFE提出了一种无监督的数据驱动框架,通过RVQ-VAE学习3D网格序列的紧凑、可解释面部表情字典,优于FACS和其他编码方法。
Details
Motivation: 现有面部表情编码系统(如FACS)覆盖率有限且标注成本高,需要一种更高效、数据驱动的替代方案。Contribution: 提出了DFE框架,通过RVQ-VAE学习离散的面部表情编码,提升了表情分析的精度和多样性。
Method: 使用3DMM提取身份无关的表情特征,通过RVQ-VAE编码为离散标记,每个标记代表特定的可重用面部变形模式。
Result: DFE在压力检测、人格预测和抑郁检测任务中优于FACS和其他模型,覆盖更广泛的面部行为。
Insight: DFE展示了无监督学习在面部表情分析中的潜力,为心理学和情感计算提供了一种可扩展的解决方案。
Abstract: Facial expression analysis is central to understanding human behavior, yet existing coding systems such as the Facial Action Coding System (FACS) are constrained by limited coverage and costly manual annotation. In this work, we introduce Discrete Facial Encoding (DFE), an unsupervised, data-driven alternative of compact and interpretable dictionary of facial expressions from 3D mesh sequences learned through a Residual Vector Quantized Variational Autoencoder (RVQ-VAE). Our approach first extracts identity-invariant expression features from images using a 3D Morphable Model (3DMM), effectively disentangling factors such as head pose and facial geometry. We then encode these features using an RVQ-VAE, producing a sequence of discrete tokens from a shared codebook, where each token captures a specific, reusable facial deformation pattern that contributes to the overall expression. Through extensive experiments, we demonstrate that Discrete Facial Encoding captures more precise facial behaviors than FACS and other facial encoding alternatives. We evaluate the utility of our representation across three high-level psychological tasks: stress detection, personality prediction, and depression detection. Using a simple Bag-of-Words model built on top of the learned tokens, our system consistently outperforms both FACS-based pipelines and strong image and video representation learning models such as Masked Autoencoders. Further analysis reveals that our representation covers a wider variety of facial displays, highlighting its potential as a scalable and effective alternative to FACS for psychological and affective computing applications.
[73] Non-Rigid Structure-from-Motion via Differential Geometry with Recoverable Conformal Scale
Yongbo Chen,Yanhao Zhang,Shaifali Parashar,Liang Zhao,Shoudong Huang
Main category: cs.CV
TL;DR: 论文提出了一种名为Con-NRSfM的新方法,用于非刚性运动结构恢复(NRSfM),通过微分几何技术恢复局部共形尺度,提高了深度估计的精度和鲁棒性。
Details
Motivation: 传统NRSfM方法依赖严格的局部平面或线性变形假设,未能恢复共形尺度,限制了准确性和适应性。本文旨在消除这些限制。Contribution: 1. 提出Con-NRSfM方法,无需严格假设即可恢复局部共形尺度;2. 解耦深度和共形尺度约束;3. 结合自监督学习生成密集3D点云。
Method: 1. 基于微分几何的点对点重建;2. 图优化框架;3. 并行可分离迭代优化;4. 编码器-解码器网络生成密集点云。
Result: 合成和真实数据集实验表明,该方法在重建精度和鲁棒性上超越现有方法。
Insight: 解耦深度和共形尺度约束是关键创新,提升了NRSfM的灵活性;结合自监督学习为密集重建提供新思路。
Abstract: Non-rigid structure-from-motion (NRSfM), a promising technique for addressing the mapping challenges in monocular visual deformable simultaneous localization and mapping (SLAM), has attracted growing attention. We introduce a novel method, called Con-NRSfM, for NRSfM under conformal deformations, encompassing isometric deformations as a subset. Our approach performs point-wise reconstruction using 2D selected image warps optimized through a graph-based framework. Unlike existing methods that rely on strict assumptions, such as locally planar surfaces or locally linear deformations, and fail to recover the conformal scale, our method eliminates these constraints and accurately computes the local conformal scale. Additionally, our framework decouples constraints on depth and conformal scale, which are inseparable in other approaches, enabling more precise depth estimation. To address the sensitivity of the formulated problem, we employ a parallel separable iterative optimization strategy. Furthermore, a self-supervised learning framework, utilizing an encoder-decoder network, is incorporated to generate dense 3D point clouds with texture. Simulation and experimental results using both synthetic and real datasets demonstrate that our method surpasses existing approaches in terms of reconstruction accuracy and robustness. The code for the proposed method will be made publicly available on the project website: https://sites.google.com/view/con-nrsfm.
[74] UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction
Jin Cao,Hongrui Wu,Ziyong Feng,Hujun Bao,Xiaowei Zhou,Sida Peng
Main category: cs.CV
TL;DR: UniVerse通过将鲁棒的3D重建解耦为恢复和重建两个子任务,利用视频扩散模型实现高效的图像一致性恢复和场景重建。
Details
Motivation: 解决多视角图像不一致导致的3D重建难题,传统方法依赖密集观测且优化困难。Contribution: 提出了UniVerse框架,将任务解耦为恢复和重建,并利用视频扩散模型学习场景先验。
Method: 1. 将不一致图像转换为初始视频;2. 用视频扩散模型恢复一致性图像;3. 从恢复的图像重建3D场景。
Result: 在合成和真实数据集上表现出强泛化能力和优异性能,且支持3D场景风格控制。
Insight: 视频扩散模型的大规模学习能力为图像恢复和3D重建提供了通用且鲁棒的先验知识。
Abstract: This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations.However, these methods rely heavily on dense observations for robustly optimizing model parameters.To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process.To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images.Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies.Extensive experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. Project page: https://jin-cao-tma.github.io/UniVerse.github.io/
[75] Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning
Xuchen Li,Xuzhao Li,Jiahui Gao,Renjie Pi,Shiyu Hu,Wentao Zhang
Main category: cs.CV
TL;DR: 该论文提出了一个自适应像素推理框架,通过动态决定何时需要像素级操作来解决视觉语言模型在精细视觉任务中的低效和分心问题,显著提升了性能并减少了不必要的视觉操作。
Details
Motivation: 视觉语言模型在多模态任务中表现优异,但在需要精细视觉元素理解的任务中常因信息丢失或关键区域注意力不足而表现不佳。现有方法虽然引入了像素级信息,但可能导致过度使用和低效问题。Contribution: 提出了首个自适应像素推理框架,动态确定像素级操作的时机;设计了基于模型自身反馈的强化学习框架,优化操作调用;实验表明性能显著提升,同时大幅减少不必要的视觉操作。
Method: 通过操作感知的监督微调建立文本推理和视觉操作的基线能力,引入基于模型响应的rollout-guided强化学习框架,动态决定像素级操作的调用。
Result: 在HR-Bench 4K上达到73.4%准确率,工具使用率仅为20.1%,相比前方法准确率提升的同时工具使用率减少66.5%。
Insight: 动态调整像素级操作的调用时机是提升视觉语言模型效率和性能的关键策略。
Abstract: Vision-Language Models (VLMs) excel at many multimodal tasks, yet they frequently struggle with tasks requiring precise understanding and handling of fine-grained visual elements. This is mainly due to information loss during image encoding or insufficient attention to critical regions. Recent work has shown promise by incorporating pixel-level visual information into the reasoning process, enabling VLMs to access high-resolution visual details during their thought process. However, this pixel-level information is often overused, leading to inefficiency and distraction from irrelevant visual details. To address these challenges, we propose the first framework for adaptive pixel reasoning that dynamically determines necessary pixel-level operations based on the input query. Specifically, we first apply operation-aware supervised fine-tuning to establish baseline competence in textual reasoning and visual operations, then design a novel rollout-guided reinforcement learning framework relying on feedback of the model’s own responses, which enables the VLM to determine when pixel operations should be invoked based on query difficulty. Experiments on extensive multimodal reasoning benchmarks show that our model achieves superior performance while significantly reducing unnecessary visual operations. Impressively, our model achieves 73.4% accuracy on HR-Bench 4K while maintaining a tool usage ratio of only 20.1%, improving accuracy and simultaneously reducing tool usage by 66.5% compared to the previous methods.
[76] Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring
Han-Jay Shu,Wei-Ning Chiu,Shun-Ting Chang,Meng-Ping Huang,Takeshi Tohyama,Ahram Han,Po-Chih Kuo
Main category: cs.CV
TL;DR: 该论文提出了一个基于增强敏感性的风险评分(ASRS)框架,用于识别胸部X光(CXR)模型中的错误倾向病例,通过检测嵌入变化来改善医疗AI的公平性和安全性。
Details
Motivation: 深度学习模型在CXR解释中表现优异,但在不同患者亚组中准确性不均,存在隐藏的失败案例。现有方法(如置信度校准或分布外检测)难以捕捉分布内的细微错误。Contribution: 提出了ASRS框架,通过临床合理的旋转和嵌入变化测量,无标签地识别高敏感性(易错)病例,推动了医疗AI的可靠性改进。
Method: ASRS使用RAD-DINO编码器测量图像旋转(±15°/±30°)后的嵌入变化,并根据敏感性分数将样本分为稳定性四分位数。
Result: 高敏感性病例的召回率显著降低(-0.2至-0.3),尽管AUROC和置信度较高,表明ASRS能有效识别易错案例。
Insight: ASRS为选择性预测和临床审查提供了一种无标签方法,有助于弥补现有技术在医疗AI公平性和安全性中的不足。
Abstract: Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches – based on confidence calibration or out-of-distribution (OOD) detection – struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations ($\pm 15^\circ$/$\pm 30^\circ$) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ($-0.2$ to $-0.3$) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.
[77] FreeViS: Training-free Video Stylization with Inconsistent References
Jiacong Xu,Yiqun Mei,Ke Zhang,Vishal M. Patel
Main category: cs.CV
TL;DR: FreeViS是一种无需训练的视频风格化框架,通过整合多风格参考到预训练的I2V模型,解决传统帧间风格不一致及训练开销大的问题,并利用高频补偿和光流运动线索提升时间一致性。
Details
Motivation: 现有视频风格化方法存在帧间不一致或需要昂贵的训练成本,FreeViS旨在提供一种无需训练的解决方案,同时保证风格丰富和时间一致性。Contribution: 提出FreeViS:1)无需训练的视频风格化框架;2)引入高频补偿和光流运动线索提升一致性;3)在风格保真度和时间一致性上优于基线方法。
Method: 整合多风格参考到预训练I2V模型,结合高频补偿约束内容布局与运动,利用光流线索保留低显著性区域的风格纹理。
Result: FreeViS在风格保真度和时间一致性上优于现有基线,获得更高人类偏好。
Insight: 高频补偿和光流运动线索是提升视频风格化时间一致性的关键,无需训练的框架更具实用性。
Abstract: Video stylization plays a key role in content creation, but it remains a challenging problem. Na"ively applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/
[78] MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs
Jiyao Liu,Jinjie Wei,Wanying Qu,Chenglong Ma,Junzhi Ning,Yunheng Li,Ying Chen,Xinzhe Luo,Pengcheng Chen,Xin Gao,Ming Hu,Huihui Xu,Xin Wang,Shujian Gao,Dingkang Yang,Zhongying Deng,Jin Ye,Lihao Liu,Junjun He,Ningsheng Xu
Main category: cs.CV
TL;DR: 论文提出的MedQ-Bench是一个评估多模态大语言模型(MLLMs)在医学图像质量评估(IQA)能力的综合基准,通过感知和推理任务解决了现有方法的局限性。
Details
Motivation: 现有医学图像质量评估方法过于依赖标量评分,无法模拟专家的描述性推理过程,无法满足临床AI的需求。Contribution: 提出了MedQ-Bench基准,包括两种任务(MedQ-Perception和MedQ-Reasoning),覆盖多种影像模态和质量属性,并提出了多维评判协议。
Method: MedQ-Bench包含两个任务:MedQ-Perception(低级感知能力)和MedQ-Reasoning(无参考和比较推理任务),涵盖五个影像模态和40多个质量属性。通过多维评判协议评估模型输出。
Result: 评估了14个先进MLLMs,发现它们具备初步但不稳定的感知和推理能力,尚未达到可靠的临床使用标准。人类-AI对齐验证表明仍需优化。
Insight: MLLMs在医学IQA领域潜力巨大,但需要针对性优化。MedQ-Bench为未来研究提供了新方向,推动了MLLMs在医学图像质量评估中的应用。
Abstract: Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.
[79] Holistic Order Prediction in Natural Scenes
Pierre Musacchio,Hyunmin Lee,Jaesik Park
Main category: cs.CV
TL;DR: InstaFormer是一种能够在单次前向传播中预测场景中所有实例的遮挡和深度顺序的网络,仅需RGB图像输入,避免了昂贵的输入格式和推理成本。
Details
Motivation: 现代视觉模型在理解实例级几何关系时依赖昂贵的输入(如类别标签、分割掩码)和推理成本(多次前向传播)。InstaFormer旨在以单次前向传播和RGB输入解决这一问题。Contribution: 提出了InstaFormer,能够以RGB图像为输入,单次前向传播预测全场景的遮挡和深度顺序,减少了输入和计算成本。
Method: InstaFormer通过对象查询与潜在掩码描述符的交互实现语义表征互补,从而预测实例顺序。
Result: 通过全面基准测试和消融实验证明了方法的有效性,展现了其在预测场景实例顺序上的优势。
Insight: 对象查询与掩码描述符的交互是实现高效顺序预测的关键,单次前向传播的设计大幅降低了计算负担。
Abstract: Even in controlled settings, understanding instance-wise geometries is a challenging task for a wide range of visual models. Although specialized systems exist, modern arts rely on expensive input formats (category labels, binary segmentation masks) and inference costs (a quadratic amount of forward passes). We mitigate these limitations by proposing InstaFormer, a network capable of holistic order prediction. That is, solely given an input RGB image, InstaFormer returns the full occlusion and depth orderings for all the instances in the scene in a single forward pass. At its core, InstaFormer relies on interactions between object queries and latent mask descriptors that semantically represent the same objects while carrying complementary information. We comprehensively benchmark and ablate our approach to highlight its effectiveness. Our code and models are open-source and available at this URL: https://github.com/SNU-VGILab/InstaOrder.
[80] PyramidStyler: Transformer-Based Neural Style Transfer with Pyramidal Positional Encoding and Reinforcement Learning
Raahul Krishna Durairaju,K. Saruladha
Main category: cs.CV
TL;DR: PyramidStyler是一个基于Transformer的神经风格迁移框架,引入金字塔位置编码和强化学习,显著提升了复杂风格和高分辨率输入的效率和质量。
Details
Motivation: 现有的CNN和Transformer模型在处理复杂风格和高分辨率输入时效率低下,亟需一种既能捕捉局部细节又能保留全局上下文的方法。Contribution: 提出PyramidStyler框架,结合金字塔位置编码(PPE)和强化学习,显著降低了内容和风格损失,并实现了实时渲染。
Method: 采用Transformer架构和金字塔位置编码(PPE),分层多尺度编码以减少计算负担;引入强化学习动态优化风格化过程。
Result: 在COCO和WikiArt数据集上训练,内容和风格损失分别降低了62.6%和57.4%,推理时间仅为1.39秒。
Insight: 金字塔编码和强化学习的结合为高分辨率图像的艺术风格迁移提供了高效的解决方案。
Abstract: Neural Style Transfer (NST) has evolved from Gatys et al.’s (2015) CNN-based algorithm, enabling AI-driven artistic image synthesis. However, existing CNN and transformer-based models struggle to scale efficiently to complex styles and high-resolution inputs. We introduce PyramidStyler, a transformer framework with Pyramidal Positional Encoding (PPE): a hierarchical, multi-scale encoding that captures both local details and global context while reducing computational load. We further incorporate reinforcement learning to dynamically optimize stylization, accelerating convergence. Trained on Microsoft COCO and WikiArt, PyramidStyler reduces content loss by 62.6% (to 2.07) and style loss by 57.4% (to 0.86) after 4000 epochs–achieving 1.39 s inference–and yields further improvements (content 2.03; style 0.75) with minimal speed penalty (1.40 s) when using RL. These results demonstrate real-time, high-quality artistic rendering, with broad applications in media and design.
[81] LOBE-GS: Load-Balanced and Efficient 3D Gaussian Splatting for Large-Scale Scene Reconstruction
Sheng-Hsiang Hung,Ting-Yu Yen,Wei-Fang Sun,Simon See,Shih-Hsuan Hung,Hung-Kuo Chu
Main category: cs.CV
TL;DR: LoBE-GS提出了一种负载均衡且高效的3D高斯泼溅框架,通过深度感知分区和优化策略解决了大规模场景重建中的负载不均和预处理开销问题,提升了训练速度和扩展性。
Details
Motivation: 现有3D高斯泼溅(3DGS)方法在大规模场景(如城市街区)中面临负载不均和预处理开销高的挑战。Contribution: 1. 提出深度感知分区方法,减少预处理时间;2. 引入优化策略平衡可见高斯分布的负载;3. 提出可见性裁剪和选择性密集化技术,降低训练成本。
Method: 1. 深度感知分区;2. 优化负载平衡策略;3. 可见性裁剪与选择性密集化。
Result: 在大规模城市场景中,LoBE-GS实现了比基线方法快2倍的端到端训练速度,同时保持重建质量。
Insight: 负载均衡和高效预处理是实现大规模3D高斯泼溅的关键。
Abstract: 3D Gaussian Splatting (3DGS) has established itself as an efficient representation for real-time, high-fidelity 3D scene reconstruction. However, scaling 3DGS to large and unbounded scenes such as city blocks remains difficult. Existing divide-and-conquer methods alleviate memory pressure by partitioning the scene into blocks, but introduce new bottlenecks: (i) partitions suffer from severe load imbalance since uniform or heuristic splits do not reflect actual computational demands, and (ii) coarse-to-fine pipelines fail to exploit the coarse stage efficiently, often reloading the entire model and incurring high overhead. In this work, we introduce LoBE-GS, a novel Load-Balanced and Efficient 3D Gaussian Splatting framework, that re-engineers the large-scale 3DGS pipeline. LoBE-GS introduces a depth-aware partitioning method that reduces preprocessing from hours to minutes, an optimization-based strategy that balances visible Gaussians – a strong proxy for computational load – across blocks, and two lightweight techniques, visibility cropping and selective densification, to further reduce training cost. Evaluations on large-scale urban and outdoor datasets show that LoBE-GS consistently achieves up to $2\times$ faster end-to-end training time than state-of-the-art baselines, while maintaining reconstruction quality and enabling scalability to scenes infeasible with vanilla 3DGS.
[82] Pack and Force Your Memory: Long-form and Consistent Video Generation
Xiaofei Wu,Guozhen Zhang,Zhiyong Xu,Yuan Zhou,Qinglin Lu,Xuming He
Main category: cs.CV
TL;DR: 该论文提出了MemoryPack和Direct Forcing两种方法,用于解决长视频生成中的长期依赖性和误差积累问题,显著提升了视频的一致性和可靠性。
Details
Motivation: 长视频生成面临长期依赖性建模和自回归解码中误差积累的双重挑战,现有方法难以同时解决这两个问题。Contribution: 1. 提出MemoryPack,一种可学习的上下文检索机制,利用文本和图像信息作为全局指导,联合建模短期和长期依赖关系。
2. 提出Direct Forcing,一种高效的单步近似策略,减少训练与推理之间的不一致性,抑制误差传播。
Method: MemoryPack通过结合文本和图像信息,动态建模上下文;Direct Forcing通过单步近似对齐训练和推理过程。
Result: 该方法显著提升了长视频生成的时序一致性和可靠性,且计算效率高,复杂度线性。
Insight: 结合全局指导(如文本和图像)可以更有效地建模长期依赖关系,而改进的训练-推理对齐策略能有效减少误差积累。
Abstract: Long-form video generation presents a dual challenge: models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding. To address these challenges, we make two contributions. First, for dynamic context modeling, we propose MemoryPack, a learnable context-retrieval mechanism that leverages both textual and image information as global guidance to jointly model short- and long-term dependencies, achieving minute-level temporal consistency. This design scales gracefully with video length, preserves computational efficiency, and maintains linear complexity. Second, to mitigate error accumulation, we introduce Direct Forcing, an efficient single-step approximating strategy that improves training-inference alignment and thereby curtails error propagation during inference. Together, MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation, advancing the practical usability of autoregressive video models.
[83] Calibrating the Full Predictive Class Distribution of 3D Object Detectors for Autonomous Driving
Cornelius Schröder,Marius-Raphael Schlüter,Markus Lienkamp
Main category: cs.CV
TL;DR: 该论文研究了自动驾驶中3D物体检测器的分类任务置信度校准问题,提出了一种评估全预测类别分布校准的指标,并设计了两种正则化损失项以提升校准性能。
Details
Motivation: 自动驾驶系统需要对物体检测的预测分布进行精确校准,以确保安全性和可靠性。本文关注3D物体检测器的分类置信度校准问题,强调全类别预测分布(包括主导类和非主导类)的校准重要性。Contribution: 1. 提出了一种评估全类别预测分布校准的新指标。2. 设计了两种正则化损失项,分别针对主导类预测和全预测向量的校准。3. 通过实验验证了所提方法在CenterPoint和PillarNet上的有效性,并发现DSVT-Pillar的校准需求不同。
Method: 1. 提出两种正则化损失项:(1)主导类预测校准;(2)全预测向量校准。2. 结合等渗回归(isotonic regression)提升校准效果。3. 在多个3D检测器(CenterPoint、PillarNet、DSVT-Pillar)上评估方法。
Result: 提出的全预测向量校准损失项与等渗回归结合,显著提升了CenterPoint和PillarNet的校准性能(主导类和非主导类)。DSVT-Pillar的校准则需要不同方法。
Insight: 1. 全预测分布的校准对自动驾驶至关重要,主导类和非主导类都需关注。2. 不同检测器的校准策略可能需要针对性调整。
Abstract: In autonomous systems, precise object detection and uncertainty estimation are critical for self-aware and safe operation. This work addresses confidence calibration for the classification task of 3D object detectors. We argue that it is necessary to regard the calibration of the full predictive confidence distribution over all classes and deduce a metric which captures the calibration of dominant and secondary class predictions. We propose two auxiliary regularizing loss terms which introduce either calibration of the dominant prediction or the full prediction vector as a training goal. We evaluate a range of post-hoc and train-time methods for CenterPoint, PillarNet and DSVT-Pillar and find that combining our loss term, which regularizes for calibration of the full class prediction, and isotonic regression lead to the best calibration of CenterPoint and PillarNet with respect to both dominant and secondary class predictions. We further find that DSVT-Pillar can not be jointly calibrated for dominant and secondary predictions using the same method.
[84] Leveraging Prior Knowledge of Diffusion Model for Person Search
Giyeol Kim,Sooyoung Yang,Jihyong Oh,Myungjoo Kang,Chanho Eom
Main category: cs.CV
TL;DR: 该论文提出了DiffPS框架,利用预训练扩散模型的先验知识改进行人搜索任务,通过三个模块解决现有方法在目标检测和重识别中的优化冲突问题,取得了CUHK-SYSU和PRW数据集上的最优性能。
Details
Motivation: 现有行人搜索方法主要基于ImageNet预训练骨干网络,难以捕捉复杂空间上下文和细粒度身份线索,且共享骨干网络导致优化冲突。Contribution: 1. 提出DiffPS框架,首次利用扩散模型的先验知识;2. 设计了DGRPN、MSFRN和SFAN三个模块,分别解决定位、形状偏差和特征对齐问题。
Method: 1. DGRPN模块通过扩散先验增强行人定位;2. MSFRN模块通过多尺度频率细化缓解形状偏差;3. SFAN模块利用文本对齐扩散特征。
Result: 在CUHK-SYSU和PRW数据集上达到了新的最优性能。
Insight: 扩散模型的先验知识可以显著提升行人搜索任务的性能,尤其是通过多任务模块化解冲突的设计值得借鉴。
Abstract: Person search aims to jointly perform person detection and re-identification by localizing and identifying a query person within a gallery of uncropped scene images. Existing methods predominantly utilize ImageNet pre-trained backbones, which may be suboptimal for capturing the complex spatial context and fine-grained identity cues necessary for person search. Moreover, they rely on a shared backbone feature for both person detection and re-identification, leading to suboptimal features due to conflicting optimization objectives. In this paper, we propose DiffPS (Diffusion Prior Knowledge for Person Search), a novel framework that leverages a pre-trained diffusion model while eliminating the optimization conflict between two sub-tasks. We analyze key properties of diffusion priors and propose three specialized modules: (i) Diffusion-Guided Region Proposal Network (DGRPN) for enhanced person localization, (ii) Multi-Scale Frequency Refinement Network (MSFRN) to mitigate shape bias, and (iii) Semantic-Adaptive Feature Aggregation Network (SFAN) to leverage text-aligned diffusion features. DiffPS sets a new state-of-the-art on CUHK-SYSU and PRW.
[85] ClustViT: Clustering-based Token Merging for Semantic Segmentation
Fabio Montello,Ronja Güldenring,Lazaros Nalpantidis
Main category: cs.CV
TL;DR: ClustViT通过聚类合并视觉Transformer的Token,显著降低了计算复杂度,同时在语义分割任务中保持了高精度。
Details
Motivation: 视觉Transformer(ViT)因其二次注意力复杂度限制了在实时机器人系统中的应用,尤其是在密集预测任务(如语义分割)中。Contribution: 提出了ClustViT,通过可训练的Cluster模块合并相似Token,并利用Regenerator模块恢复细节,实现了计算效率的提升。
Method: 结合聚类和Token合并技术,通过分割掩码指导合并过程,从而减少计算量。
Result: 在三个数据集上减少了2.18倍的GFLOPs和1.64倍的推理时间,且保持了分割精度。
Insight: 聚类引导的Token合并是提升ViT在密集预测任务中效率的有效方法。
Abstract: Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models will be made publicly available.
[86] Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
Yongyi Su,Haojie Zhang,Shijie Li,Nanqing Liu,Jingyi Liao,Junyi Pan,Yuan Liu,Xiaofen Xing,Chong Sun,Chen Li,Nancy F. Chen,Shuicheng Yan,Xulei Yang,Xun Xu
Main category: cs.CV
TL;DR: 论文提出了Patch-as-Decodable Token (PaDT)的统一多模态视觉任务框架,通过Visual Reference Tokens (VRTs)直接生成文本和视觉输出,提升密集预测任务性能。
Details
Motivation: 现有MLLMs在视觉任务中依赖间接表示(如坐标文本)限制了性能,尤其是密集预测任务(如分割),需要更直接的视觉输出方式。Contribution: 提出PaDT框架,引入VRTs使MLLMs能直接生成多种视觉输出,动态扩展嵌入表并独立处理VRTs,提升定位和区分能力。
Method: 基于VRTs的轻量解码器设计,结合随机选择VRTs的训练策略和逐token交叉熵损失函数。
Result: 在四个视觉任务中达到SOTA性能,优于更大的MLLMs模型。
Insight: VRTs的动态处理和独立性设计是关键,为多模态视觉任务的统一提供新思路。
Abstract: Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM’s output textual tokens. A lightweight decoder then transforms LLM’s outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.
[87] 4DGS-Craft: Consistent and Interactive 4D Gaussian Splatting Editing
Lei Liu,Can Wang,Zhenghao Chen,Dong Xu
Main category: cs.CV
TL;DR: 4DGS-Craft提出了一个一致且交互式的4D高斯泼溅编辑框架,通过4D感知的InstructPix2Pix模型和多视角网格模块确保视角和时间一致性,同时通过高斯选择机制保护非编辑区域的稳定性。通过LLM模块解析用户意图,将复杂指令分解为原子操作序列。
Details
Motivation: 现有4D高斯泼溅编辑方法在视角、时间和非编辑区域一致性方面存在不足,且难以处理复杂文本指令。为解决这些问题,提出了4DGS-Craft。Contribution: 1. 引入4D感知的InstructPix2Pix模型和多视角网格模块确保编辑一致性;2. 提出高斯选择机制保护非编辑区域;3. 设计基于LLM的用户意图解析模块。
Method: 1. 使用4D VGGT几何特征增强InstructPix2Pix;2. 多视角网格模块迭代优化输入图像与4D场景;3. 高斯选择机制定位并优化编辑区域;4. LLM解析用户指令为原子操作序列。
Result: 该方法实现了更一致和可控的4D场景编辑,支持复杂用户指令处理。
Insight: 结合几何特征与LLM可以提升4D编辑的一致性和交互性,为动态场景编辑提供了新思路。
Abstract: Recent advances in 4D Gaussian Splatting (4DGS) editing still face challenges with view, temporal, and non-editing region consistency, as well as with handling complex text instructions. To address these issues, we propose 4DGS-Craft, a consistent and interactive 4DGS editing framework. We first introduce a 4D-aware InstructPix2Pix model to ensure both view and temporal consistency. This model incorporates 4D VGGT geometry features extracted from the initial scene, enabling it to capture underlying 4D geometric structures during editing. We further enhance this model with a multi-view grid module that enforces consistency by iteratively refining multi-view input images while jointly optimizing the underlying 4D scene. Furthermore, we preserve the consistency of non-edited regions through a novel Gaussian selection mechanism, which identifies and optimizes only the Gaussians within the edited regions. Beyond consistency, facilitating user interaction is also crucial for effective 4DGS editing. Therefore, we design an LLM-based module for user intent understanding. This module employs a user instruction template to define atomic editing operations and leverages an LLM for reasoning. As a result, our framework can interpret user intent and decompose complex instructions into a logical sequence of atomic operations, enabling it to handle intricate user commands and further enhance editing performance. Compared to related works, our approach enables more consistent and controllable 4D scene editing. Our code will be made available upon acceptance.
[88] Pure-Pass: Fine-Grained, Adaptive Masking for Dynamic Token-Mixing Routing in Lightweight Image Super-Resolution
Junyu Wu,Jie Tang,Jie Liu,Gangshan Wu
Main category: cs.CV
TL;DR: Pure-Pass (PP) 是一种像素级掩码机制,通过固定颜色中心点分类像素,实现细粒度、空间灵活的掩码,从而在轻量级图像超分辨率中动态路由token混合器,提升性能。
Details
Motivation: 现有轻量级超分辨率方法如CAMixer在适应性、掩码粒度和空间灵活性方面存在不足,限制了实际部署效果。Contribution: 提出Pure-Pass (PP):1) 像素级掩码机制;2) 利用固定颜色中心点分类像素;3) 细粒度、空间灵活的动态路由。
Method: PP识别纯净像素并免除其昂贵计算,集成到ATD-light模型中,实现高效token混合器的动态路由。
Result: PP-ATD-light在重建质量和参数效率上优于CAMixer-ATD-light,同时计算开销相近。
Insight: 通过像素级细粒度掩码和动态路由,可以在轻量级超分辨率任务中显著提高性能。
Abstract: Image Super-Resolution (SR) aims to reconstruct high-resolution images from low-resolution counterparts, but the computational complexity of deep learning-based methods often hinders practical deployment. CAMixer is the pioneering work to integrate the advantages of existing lightweight SR methods and proposes a content-aware mixer to route token mixers of varied complexities according to the difficulty of content recovery. However, several limitations remain, such as poor adaptability, coarse-grained masking and spatial inflexibility, among others. We propose Pure-Pass (PP), a pixel-level masking mechanism that identifies pure pixels and exempts them from expensive computations. PP utilizes fixed color center points to classify pixels into distinct categories, enabling fine-grained, spatially flexible masking while maintaining adaptive flexibility. Integrated into the state-of-the-art ATD-light model, PP-ATD-light achieves superior SR performance with minimal overhead, outperforming CAMixer-ATD-light in reconstruction quality and parameter efficiency when saving a similar amount of computation.
[89] Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using GPT-4o: Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework
Nanaka Hosokawa,Ryo Takahashi,Tomoya Kitano,Yukihiro Iida,Chisako Muramatsu,Tatsuro Hayashi,Yuta Seino,Xiangrong Zhou,Takeshi Hara,Akitoshi Katsumata,Hiroshi Fujita
Main category: cs.CV
TL;DR: GPT-4o用于牙科全景片中下颌囊肿的自动化报告生成,并提出两阶段的自校正循环结构化输出(SLSO)框架以提升准确性。实验表明SLSO在多方面优于传统CoT方法,但仍有局限性。
Details
Motivation: 传统方法在牙科影像分析中存在准确性问题,如幻觉描述和牙齿编号错误。SLSO框架旨在通过结构化输出和自校正循环提升报告生成的可靠性。Contribution: 1. 提出SLSO框架,结合结构化输出与迭代校正;2. 验证SLSO在透明度、牙齿移动等七项指标上的有效性;3. 展示GPT-4o在多模态任务中的潜力。
Method: 1. 图像输入与分析;2. 结构化数据生成;3. 牙齿编号一致性检查;4. 不一致时迭代再生;5. 报告生成与验证。
Result: SLSO在牙齿编号、牙齿移动和牙根吸收三项指标上分别提升了66.9%、33.3%和28.6%,但数据集小导致统计显著性不足。
Insight: SLSO框架通过结构化输出抑制幻觉,但多牙齿病变的识别仍需改进。未来需优化性能以实现实用化。
Abstract: In this study, we utilized the multimodal capabilities of OpenAI GPT-4o to automatically generate jaw cyst findings on dental panoramic radiographs. To improve accuracy, we constructed a Self-correction Loop with Structured Output (SLSO) framework and verified its effectiveness. A 10-step process was implemented for 22 cases of jaw cysts, including image input and analysis, structured data generation, tooth number extraction and consistency checking, iterative regeneration when inconsistencies were detected, and finding generation with subsequent restructuring and consistency verification. A comparative experiment was conducted using the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth movement, relationships with other structures, and tooth number. The results showed that the proposed SLSO framework improved output accuracy for many items, with 66.9%, 33.3%, and 28.6% improvement rates for tooth number, tooth movement, and root resorption, respectively. In the successful cases, a consistently structured output was achieved after up to five regenerations. Although statistical significance was not reached because of the small size of the dataset, the overall SLSO framework enforced negative finding descriptions, suppressed hallucinations, and improved tooth number identification accuracy. However, the accurate identification of extensive lesions spanning multiple teeth is limited. Nevertheless, further refinement is required to enhance overall performance and move toward a practical finding generation system.
[90] LiLa-Net: Lightweight Latent LiDAR Autoencoder for 3D Point Cloud Reconstruction
Mario Resino,Borja Pérez,Jaime Godoy,Abdulla Al-Kaff,Fernando García
Main category: cs.CV
TL;DR: LiLa-Net是一种轻量级3D自动编码器,利用LiDAR点云从真实交通环境中提取高效特征,通过简化编码器层和跳跃连接实现高性能重建。
Details
Motivation: 现有3D点云重建方法资源消耗大,需要通过轻量化设计提升效率并保持重建质量。Contribution: 1) 提出了轻量化的LiLa-Net架构;2) 优化了跳跃连接的设计;3) 实现了高效的潜在空间表征。
Method: 采用简化的编码器层和跳跃连接,平衡跳跃连接与潜在编码的信息,提升重建质量。
Result: 模型在保持性能的同时提升了重建质量,并展示了良好的泛化能力。
Insight: 轻量化设计和跳跃连接的优化是实现高效3D点云重建的关键。
Abstract: This work proposed a 3D autoencoder architecture, named LiLa-Net, which encodes efficient features from real traffic environments, employing only the LiDAR’s point clouds. For this purpose, we have real semi-autonomous vehicle, equipped with Velodyne LiDAR. The system leverage skip connections concept to improve the performance without using extensive resources as the state-of-the-art architectures. Key changes include reducing the number of encoder layers and simplifying the skip connections, while still producing an efficient and representative latent space which allows to accurately reconstruct the original point cloud. Furthermore, an effective balance has been achieved between the information carried by the skip connections and the latent encoding, leading to improved reconstruction quality without compromising performance. Finally, the model demonstrates strong generalization capabilities, successfully reconstructing objects unrelated to the original traffic environment.
[91] kabr-tools: Automated Framework for Multi-Species Behavioral Monitoring
Jenna Kline,Maksim Kholiavchenko,Samuel Stevens,Nina van Tiel,Alison Zhong,Namrata Banerji,Alec Sheets,Sowbaranika Balasubramaniam,Isla Duporge,Matthew Thompson,Elizabeth Campolongo,Jackson Miliko,Neil Rosser,Tanya Berger-Wolf,Charles V. Stewart,Daniel I. Rubenstein
Main category: cs.CV
TL;DR: kabr-tools是一个开源自动化多物种行为监测框架,结合无人机视频和机器学习系统,提取行为、社会和空间指标,显著提升行为数据的粒度和效率。
Details
Motivation: 传统野外观察方法是有限且耗时的,难以量化复杂多维的行为模式,因此需要一个可扩展的自动化解决方案。Contribution: 开发了kabr-tools,一个集成了无人机视频和机器学习的开源框架,能够高效提取行为和社会指标,支持大规模生态研究。
Method: 利用目标检测、跟踪和行为分类系统,从野生动物视频中提取关键指标,如时间预算、行为转换和社会互动。
Result: 相比地面观察方法,kabr-tools减少了15%的可见性损失,捕获了更高精度和行为连续性,并在三个案例研究中验证了其有效性。
Insight: 该工具为生态系统的行为研究提供了强大的自动化手段,推动了保护生物学和生态监测的进步。
Abstract: A comprehensive understanding of animal behavior ecology depends on scalable approaches to quantify and interpret complex, multidimensional behavioral patterns. Traditional field observations are often limited in scope, time-consuming, and labor-intensive, hindering the assessment of behavioral responses across landscapes. To address this, we present kabr-tools (Kenyan Animal Behavior Recognition Tools), an open-source package for automated multi-species behavioral monitoring. This framework integrates drone-based video with machine learning systems to extract behavioral, social, and spatial metrics from wildlife footage. Our pipeline leverages object detection, tracking, and behavioral classification systems to generate key metrics, including time budgets, behavioral transitions, social interactions, habitat associations, and group composition dynamics. Compared to ground-based methods, drone-based observations significantly improved behavioral granularity, reducing visibility loss by 15% and capturing more transitions with higher accuracy and continuity. We validate kabr-tools through three case studies, analyzing 969 behavioral sequences, surpassing the capacity of traditional methods for data capture and annotation. We found that, like Plains zebras, vigilance in Grevy’s zebras decreases with herd size, but, unlike Plains zebras, habitat has a negligible impact. Plains and Grevy’s zebras exhibit strong behavioral inertia, with rare transitions to alert behaviors and observed spatial segregation between Grevy’s zebras, Plains zebras, and giraffes in mixed-species herds. By enabling automated behavioral monitoring at scale, kabr-tools offers a powerful tool for ecosystem-wide studies, advancing conservation, biodiversity research, and ecological monitoring.
[92] VGDM: Vision-Guided Diffusion Model for Brain Tumor Detection and Segmentation
Arman Behnam
Main category: cs.CV
TL;DR: VGDM提出了一种结合视觉Transformer和扩散模型的框架,用于脑肿瘤的检测与分割,通过全局上下文推理和迭代去噪提升精度。
Details
Motivation: 传统U-Net在捕捉长距离依赖和复杂肿瘤结构方面表现有限,而扩散模型在高保真医学图像生成和分割边界细化方面展现出潜力。Contribution: 核心贡献是将视觉Transformer嵌入扩散过程,结合全局上下文推理和迭代去噪,显著提升了脑肿瘤分割的精度和边界细化能力。
Method: 方法采用Transformer驱动的扩散框架,通过Transformer建模MRI全图的空间关系,扩散过程优化体素级误差和细粒度细节。
Result: 实验验证表明,VGDM在Dice相似度和Hausdorff距离指标上均优于传统方法,展现了其在肿瘤分割中的潜力。
Insight: Transformer与扩散模型的结合为医学图像分割提供了一种新思路,尤其在处理复杂结构和长距离依赖时更具优势。
Abstract: Accurate detection and segmentation of brain tumors from magnetic resonance imaging (MRI) are essential for diagnosis, treatment planning, and clinical monitoring. While convolutional architectures such as U-Net have long been the backbone of medical image segmentation, their limited capacity to capture long-range dependencies constrains performance on complex tumor structures. Recent advances in diffusion models have demonstrated strong potential for generating high-fidelity medical images and refining segmentation boundaries. In this work, we propose VGDM: Vision-Guided Diffusion Model for Brain Tumor Detection and Segmentation framework, a transformer-driven diffusion framework for brain tumor detection and segmentation. By embedding a vision transformer at the core of the diffusion process, the model leverages global contextual reasoning together with iterative denoising to enhance both volumetric accuracy and boundary precision. The transformer backbone enables more effective modeling of spatial relationships across entire MRI volumes, while diffusion refinement mitigates voxel-level errors and recovers fine-grained tumor details. This hybrid design provides a pathway toward improved robustness and scalability in neuro-oncology, moving beyond conventional U-Net baselines. Experimental validation on MRI brain tumor datasets demonstrates consistent gains in Dice similarity and Hausdorff distance, underscoring the potential of transformer-guided diffusion models to advance the state of the art in tumor segmentation.
[93] Mapping Historic Urban Footprints in France: Balancing Quality, Scalability and AI Techniques
Walid Rabehi,Marion Le Texier,Rémi Lemoy
Main category: cs.CV
TL;DR: 本研究开发了一个可扩展的深度学习流水线,用于从法国历史地图中提取1925-1950年的城市足迹,填补了国家尺度数据的空白。关键创新是通过双通道U-Net方法处理历史地图的高复杂性,最终数据集准确率为73%。
Details
Motivation: 1970年代前法国城市化数据的缺失限制了定量分析。本研究旨在填补这一空白,为历史城市化研究提供高质量的国家尺度数据支持。Contribution: 1. 提出了首个开放的国家尺度法国历史城市足迹数据集;2. 设计了双通道U-Net方法,有效处理历史地图的复杂性问题;3. 开源了代码、训练数据集和结果。
Method: 采用双通道U-Net方法:第一通道生成初步地图并标识混淆区域(如文字和道路),用于数据增强;第二通道利用精细化数据集和二值化输出来减少噪声。使用高性能计算集群处理941张高分辨率图块。
Result: 最终数据集的总体准确率为73%,成功捕捉了多样化的城市模式,并克服了标签和等高线等常见干扰。
Insight: 双通道方法结合数据增强可以有效提升历史地图处理的准确性;高性能计算集群是实现大规模处理的必要条件。
Abstract: Quantitative analysis of historical urban sprawl in France before the 1970s is hindered by the lack of nationwide digital urban footprint data. This study bridges this gap by developing a scalable deep learning pipeline to extract urban areas from the Scan Histo historical map series (1925-1950), which produces the first open-access, national-scale urban footprint dataset for this pivotal period. Our key innovation is a dual-pass U-Net approach designed to handle the high radiometric and stylistic complexity of historical maps. The first pass, trained on an initial dataset, generates a preliminary map that identifies areas of confusion, such as text and roads, to guide targeted data augmentation. The second pass uses a refined dataset and the binarized output of the first model to minimize radiometric noise, which significantly reduces false positives. Deployed on a high-performance computing cluster, our method processes 941 high-resolution tiles covering the entirety of metropolitan France. The final mosaic achieves an overall accuracy of 73%, effectively capturing diverse urban patterns while overcoming common artifacts like labels and contour lines. We openly release the code, training datasets, and the resulting nationwide urban raster to support future research in long-term urbanization dynamics.
[94] When Tracking Fails: Analyzing Failure Modes of SAM2 for Point-Based Tracking in Surgical Videos
Woowon Jang,Jiwon Im,Juseung Choi,Niki Rashidian,Wesley De Neve,Utku Ozbulak
Main category: cs.CV
TL;DR: 论文分析了SAM2在手术视频中点跟踪的失败模式,发现点跟踪在手术工具上表现良好,但在解剖目标上因组织相似性和模糊边界而表现不佳。
Details
Motivation: 手术视频中的点跟踪是一种高效且低成本的交互方式,但其在复杂手术环境中的可靠性和失败模式尚未被深入理解。Contribution: 系统分析了点跟踪在腹腔镜胆囊切除术视频中的失败模式,比较了点跟踪与分割掩码初始化的性能差异。
Method: 研究聚焦于三种手术目标(胆囊、抓手和L型电钩),通过定性分析揭示影响跟踪结果的关键因素。
Result: 点跟踪在手术工具上表现竞争性,但在解剖目标上表现较差,主要由于组织相似性和边界模糊。
Insight: 提供了改进点跟踪性能的建议,特别是在手术视频分析中选择和放置跟踪点的策略。
Abstract: Video object segmentation (VOS) models such as SAM2 offer promising zero-shot tracking capabilities for surgical videos using minimal user input. Among the available input types, point-based tracking offers an efficient and low-cost alternative, yet its reliability and failure cases in complex surgical environments are not well understood. In this work, we systematically analyze the failure modes of point-based tracking in laparoscopic cholecystectomy videos. Focusing on three surgical targets, the gallbladder, grasper, and L-hook electrocautery, we compare the performance of point-based tracking with segmentation mask initialization. Our results show that point-based tracking is competitive for surgical tools but consistently underperforms for anatomical targets, where tissue similarity and ambiguous boundaries lead to failure. Through qualitative analysis, we reveal key factors influencing tracking outcomes and provide several actionable recommendations for selecting and placing tracking points to improve performance in surgical video analysis.
[95] FRIEREN: Federated Learning with Vision-Language Regularization for Segmentation
Ding-Ruei Shen
Main category: cs.CV
TL;DR: 论文提出了一种联邦学习框架FRIEREN,结合视觉与语言模态,利用CLIP文本嵌入改进语义分割任务在无标记客户端数据上的泛化能力。
Details
Motivation: 现有的联邦学习方法通常假设客户端数据带有标记或未能充分利用现代视觉基础模型(VFM),而在实际场景中客户端数据往往是无标记的。Contribution: 提出了一个新颖的任务FFREEDG,并设计了FRIEREN框架,利用视觉-语言解码器和弱到强一致性学习策略,解决无标记数据下的联邦语义分割问题。
Method: 结合CLIP文本嵌入的视觉-语言解码器增强语义消歧能力,采用弱到强一致性学习策略生成伪标签进行鲁棒训练。
Result: 在合成到真实和清晰到恶劣天气的基准测试中,FRIEREN表现优异,优于现有领域泛化和适应方法。
Insight: 视觉与语言模态的结合能够显著提升联邦学习在无标记数据场景下的性能,为未来研究提供了新方向。
Abstract: Federeated Learning (FL) offers a privacy-preserving solution for Semantic Segmentation (SS) tasks to adapt to new domains, but faces significant challenges from these domain shifts, particularly when client data is unlabeled. However, most existing FL methods unrealistically assume access to labeled data on remote clients or fail to leverage the power of modern Vision Foundation Models (VFMs). Here, we propose a novel and challenging task, FFREEDG, in which a model is pretrained on a server’s labeled source dataset and subsequently trained across clients using only their unlabeled data, without ever re-accessing the source. To solve FFREEDG, we propose FRIEREN, a framework that leverages the knowledge of a VFM by integrating vision and language modalities. Our approach employs a Vision-Language decoder guided by CLIP-based text embeddings to improve semantic disambiguation and uses a weak-to-strong consistency learning strategy for robust local training on pseudo-labels. Our experiments on synthetic-to-real and clear-to-adverse-weather benchmarks demonstrate that our framework effectively tackles this new task, achieving competitive performance against established domain generalization and adaptation methods and setting a strong baseline for future research.
[96] Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting
Shu Zou,Xinyu Tian,Lukas Wesemann,Fabian Waschkowski,Zhaoyuan Yang,Jing Zhang
Main category: cs.CV
TL;DR: ASK-Hint提出了一种基于动作知识的结构化提示框架,通过细粒度提示改进视频异常检测的准确性、可解释性和泛化能力。
Details
Motivation: 现有视频异常检测方法中的提示过于抽象,忽视了细粒度的人机交互或动作语义,导致复杂异常检测效果不佳。Contribution: 提出了ASK-Hint框架,利用动作中心知识设计细粒度提示,提升冻结视觉-语言模型在异常检测中的表现。
Method: 将提示组织为语义连贯的组别,并提出细粒度引导问题,使模型预测与判别性视觉线索对齐。
Result: 在UCF-Crime和XD-Violence数据集上实现了AUC的提升,并验证了框架的泛化能力和可解释性。
Insight: 提示的细粒度设计是提升视频异常检测性能的关键,ASK-Hint为无需训练的通用解决方案提供了新思路。
Abstract: Prompting has emerged as a practical way to adapt frozen vision-language models (VLMs) for video anomaly detection (VAD). Yet, existing prompts are often overly abstract, overlooking the fine-grained human-object interactions or action semantics that define complex anomalies in surveillance videos. We propose ASK-Hint, a structured prompting framework that leverages action-centric knowledge to elicit more accurate and interpretable reasoning from frozen VLMs. Our approach organizes prompts into semantically coherent groups (e.g. violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues. Extensive experiments on UCF-Crime and XD-Violence show that ASK-Hint consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods. Beyond accuracy, our framework provides interpretable reasoning traces towards anomaly and demonstrates strong generalization across datasets and VLM backbones. These results highlight the critical role of prompt granularity and establish ASK-Hint as a new training-free and generalizable solution for explainable video anomaly detection.
[97] GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation
Weijia Dou,Xu Zhang,Yi Bin,Jian Liu,Bo Peng,Guoqing Wang,Yang Yang,Heng Tao Shen
Main category: cs.CV
TL;DR: GeoPurify提出了一种数据高效的几何蒸馏框架,通过利用2D视觉语言模型的特征传递到3D分割,并结合几何先验知识,显著减少了训练数据需求并提升了性能。
Details
Motivation: 现有的方法在将2D视觉语言模型的特征传递到3D语义分割时存在噪声和碎片化问题,而强制几何一致性则需要大规模标注数据和昂贵训练。GeoPurify旨在解决这一问题,同时提高数据效率。Contribution: 提出了GeoPurify框架,利用几何先验从2D视觉语言模型中纯化3D点特征,设计了几何引导池化模块,显著减少了训练数据需求并提升了性能。
Method: GeoPurify结合学生亲和力网络和3D自监督教师模型,蒸馏几何先验知识;推断阶段采用几何引导池化模块进一步去噪并保证语义和结构一致性。
Result: 在主要3D基准测试中,GeoPurify仅使用约1.5%的训练数据即达到或超越现有最佳性能。
Insight: 几何信息在2D到3D特征传递中仍然存在,GeoPurify通过巧妙利用这些信息实现了高效的去噪和数据效率提升。
Abstract: Recent attempts to transfer features from 2D Vision-Language Models (VLMs) to 3D semantic segmentation expose a persistent trade-off. Directly projecting 2D features into 3D yields noisy and fragmented predictions, whereas enforcing geometric coherence necessitates costly training pipelines and large-scale annotated 3D data. We argue that this limitation stems from the dominant segmentation-and-matching paradigm, which fails to reconcile 2D semantics with 3D geometric structure. The geometric cues are not eliminated during the 2D-to-3D transfer but remain latent within the noisy and view-aggregated features. To exploit this property, we propose GeoPurify that applies a small Student Affinity Network to purify 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model. During inference, we devise a Geometry-Guided Pooling module to further denoise the point cloud and ensure the semantic and structural consistency. Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency. Extensive experiments on major 3D benchmarks demonstrate that GeoPurify achieves or surpasses state-of-the-art performance while utilizing only about 1.5% of the training data. Our codes and checkpoints are available at https://github.com/tj12323/GeoPurify.
[98] Cross-Breed Pig Identification Using Auricular Vein Pattern Recognition: A Machine Learning Approach for Small-Scale Farming Applications
Emmanuel Nsengiyumvaa,Leonard Niyitegekaa,Eric Umuhoza
Main category: cs.CV
TL;DR: 论文提出了一种基于耳部静脉模式识别的非侵入性猪只识别方法,利用计算机视觉和机器学习技术,特别适用于小规模养殖场的混合品种猪。
Details
Motivation: 传统猪只识别方法(如耳标和微芯片)成本高、易损坏且不适用于混合品种,因此需要一种低成本、可靠的非侵入性替代方案。Contribution: 提出了一种基于耳部静脉模式的生物特征识别方法,适用于混合品种猪,验证了其在实时农场部署中的可行性。
Method: 包括多阶段计算机视觉流程(增强静脉可见性、提取结构特征)和支持向量机(SVM)分类器,实现了98.12%的准确率。
Result: 系统平均处理时间为8.3秒,SVM分类精度达98.12%,验证了其高效性和实用性。
Insight: 耳部静脉模式是一种稳定且独特的生物特征,可为资源有限的农业社区提供精准养殖的低成本解决方案。
Abstract: Accurate livestock identification is a cornerstone of modern farming: it supports health monitoring, breeding programs, and productivity tracking. However, common pig identification methods, such as ear tags and microchips, are often unreliable, costly, target pure breeds, and thus impractical for small-scale farmers. To address this gap, we propose a noninvasive biometric identification approach that leverages uniqueness of the auricular vein patterns. To this end, we have collected 800 ear images from 20 mixed-breed pigs (Landrace cross Pietrain and Duroc cross Pietrain), captured using a standard smartphone and simple back lighting. A multistage computer vision pipeline was developed to enhance vein visibility, extract structural and spatial features, and generate biometric signatures. These features were then classified using machine learning models. Support Vector Machines (SVM) achieved the highest accuracy: correctly identifying pigs with 98.12% precision across mixed-breed populations. The entire process from image processing to classification was completed in an average of 8.3 seconds, demonstrating feasibility for real-time farm deployment. We believe that by replacing fragile physical identifiers with permanent biological markers, this system provides farmers with a cost-effective and stress-free method of animal identification. More broadly, the findings confirm the practicality of auricular vein biometrics for digitizing livestock management, reinforcing its potential to extend the benefits of precision farming to resource-constrained agricultural communities.
[99] MMDEW: Multipurpose Multiclass Density Estimation in the Wild
Villanelle O’Reilly,Jonathan Cox,Georgios Leontidis,Marc Hanheide,Petra Bosilj,James Brown
Main category: cs.CV
TL;DR: MMDEW提出了一种多类别密度估计框架,通过Twins金字塔视觉Transformer和多尺度解码方法,提升了密集遮挡场景下的计数性能,并在生态监测等领域展示了应用潜力。
Details
Motivation: 传统基于检测的计数方法在密集和遮挡场景中效果不佳,因此需要一种多类别密度估计方法来解决这一问题。Contribution: 提出了一个多类别计数框架,结合Twins Transformer和多尺度解码方法,并引入类别聚焦模块以减少类别间的干扰。
Method: 使用Twins金字塔视觉Transformer作为骨干网络,设计多尺度解码头和类别聚焦模块,结合两任务训练策略。
Result: 在VisDrone和iSAID基准测试中表现优于现有方法(MAE降低33%-64%),并在生态监测数据上验证了其扩展性。
Insight: 多类别密度估计不仅适用于人群计数,还可扩展到其他领域(如生物多样性监测),为跨领域应用提供了新思路。
Abstract: Density map estimation can be used to estimate object counts in dense and occluded scenes where discrete counting-by-detection methods fail. We propose a multicategory counting framework that leverages a Twins pyramid vision-transformer backbone and a specialised multi-class counting head built on a state-of-the-art multiscale decoding approach. A two-task design adds a segmentation-based Category Focus Module, suppressing inter-category cross-talk at training time. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates superior performance versus prior multicategory crowd-counting approaches (33%, 43% and 64% reduction to MAE), and the comparison with YOLOv11 underscores the necessity of crowd counting methods in dense scenes. The method’s regional loss opens up multi-class crowd counting to new domains, demonstrated through the application to a biodiversity monitoring dataset, highlighting its capacity to inform conservation efforts and enable scalable ecological insights.
[100] TempoControl: Temporal Attention Guidance for Text-to-Video Models
Shira Schiber,Ofir Lindenbaum,Idan Schwartz
Main category: cs.CV
TL;DR: TempoControl是一种在无需重新训练或额外监督的情况下,通过优化交叉注意力图来实现文本到视频生成模型中视觉概念时间对齐的新方法。
Details
Motivation: 当前生成视频模型虽然能基于文本提示生成高质量视频,但缺乏细粒度的时间控制,无法指定视觉元素在生成序列中的具体出现时间。Contribution: 提出TempoControl方法,通过引导交叉注意力图的时间形状(相关性)、可见性区域的增强(能量)和空间焦点的保持(熵),实现对生成视频的时间精确控制。
Method: 利用文本到视频扩散模型中的交叉注意力图,通过相关性、能量和熵三个互补原则进行优化,指导视觉概念的时间排列。
Result: TempoControl在单对象和多对象的时间重排、动作和音频对齐生成等多种应用中表现出色,确保了视频的高质量和多样性。
Insight: 交叉注意力图为时间控制提供了潜力,通过优化注意力机制可以实现对生成内容的细粒度时间调控。
Abstract: Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal shape with a control signal (via correlation), amplifying it where visibility is needed (via energy), and maintaining spatial focus (via entropy). TempoControl allows precise control over timing while ensuring high video quality and diversity. We demonstrate its effectiveness across various video generation applications, including temporal reordering for single and multiple objects, as well as action and audio-aligned generation.
[101] RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning
Sicheng Feng,Kaiwen Tuo,Song Wang,Lingdong Kong,Jianke Zhu,Huan Wang
Main category: cs.CV
TL;DR: 论文RewardMap通过多阶段强化学习解决细粒度视觉推理中的稀疏奖励问题,提出了难度感知奖励设计和多阶段训练框架,显著提升了模型的视觉理解和推理能力。
Details
Motivation: 细粒度视觉推理是当前多模态大语言模型(MLLMs)的核心挑战,尤其是空间推理任务中稀疏奖励和不稳定优化问题阻碍了标准强化学习的表现。Contribution: 1)构建ReasonMap-Plus数据集,提供密集奖励信号;2)提出RewardMap框架,包括难度感知奖励设计和多阶段训练策略,提升模型能力。
Method: 1)难度感知奖励设计,解决稀疏奖励问题;2)多阶段RL框架,从简单感知任务逐步过渡到复杂推理任务。
Result: 在ReasonMap和ReasonMap-Plus上的实验表明,RewardMap在各任务中均取得一致性能提升,平均提升3.47%跨6个基准测试。
Insight: 密集奖励和多阶段训练是提升细粒度视觉推理任务性能的有效策略,尤其在冷启动阶段优于传统监督微调。
Abstract: Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.
[102] DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing
Zihan Zhou,Shilin Lu,Shuli Leng,Shaocong Zhang,Zhuming Lian,Xinlei Yu,Adams Wai-Kin Kong
Main category: cs.CV
TL;DR: DragFlow利用DiT的先验知识,通过区域监督和Affine变换提升基于拖拽的图像编辑性能,超越现有基线方法。
Details
Motivation: 现有基于拖拽的图像编辑方法在先验知识不足时会导致目标区域失真,而DiT的先验能力更强但未充分利用。Contribution: 提出了首个充分利用FLUX先验的拖拽编辑框架DragFlow,引入区域监督和Affine变换,提升性能。
Method: 采用区域编辑范式,结合Affine变换和梯度掩码硬约束,并集成预训练适配器和MLLM消除任务歧义。
Result: 在DragBench-DR和ReD Bench上超越点基和区域基线方法,刷新了拖拽编辑的SOTA。
Insight: DiT的特征结构不如UNet紧凑,直接点监督不可靠,但区域监督能更一致地利用其更强先验。
Abstract: Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX’s rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.
[103] From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding
Guangyu Sun,Archit Singhal,Burak Uzkent,Mubarak Shah,Chen Chen,Garin Kessler
Main category: cs.CV
TL;DR: 论文提出了一种从关键帧选择扩展到关键片段(clip)选择的方法F2C,以提升长视频理解的性能,并通过自适应分辨率策略平衡计算资源。
Details
Motivation: 现有视频大型语言模型(VLMs)因视觉标记过多而受限于上下文窗口,且稀疏帧选择忽略了关键的时间动态信息,导致运动与事件连续性推理效果不佳。Contribution: 1. 提出从关键帧到关键片段的扩展选择方法F2C;2. 设计自适应分辨率策略以平衡空间分辨率和片段长度;3. 在三个长视频基准上验证方法的有效性。
Method: 1. 将稀疏帧选择扩展为短时间连贯片段的选择;2. 动态调整空间分辨率和片段长度以维持固定标记数。
Result: 在Video-MME、LongVideoBench和MLVU基准上,F2C分别比均匀采样提升了8.1%、5.6%和10.3%。
Insight: 保持时间连贯性对视频理解至关重要,F2C提供了一种无需训练的实际解决方案,适用于大规模视频应用。
Abstract: Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks, yet their practical use is limited by the “needle in a haystack” problem: the massive number of visual tokens produced from raw video frames exhausts the model’s context window. Existing solutions alleviate this issue by selecting a sparse set of frames, thereby reducing token count, but such frame-wise selection discards essential temporal dynamics, leading to suboptimal reasoning about motion and event continuity. In this work we systematically explore the impact of temporal information and demonstrate that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we propose an adaptive resolution strategy that dynamically balances spatial resolution and clip length, ensuring a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench and MLVU benchmarks, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling Video LLMs to real world video understanding applications. Project webpage is available at https://guangyusun.com/f2c .
[104] Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities
Mario Medrano-Paredes,Carmen Fernández-González,Francisco-Javier Díaz-Pernas,Hichem Saoudi,Javier González-Alonso,Mario Martínez-Zarzuela
Main category: cs.CV
TL;DR: 这篇论文比较了基于单目视频的3D人体姿态估计模型与惯性测量单元(IMU)在健康人群日常活动中的性能,发现MotionAGFormer表现最佳,同时讨论了两种技术在成本、可访问性和精度上的权衡。
Details
Motivation: 旨在评估单目视频和IMU传感器在实际场景中对人体运动捕捉的准确性,为远程医疗、运动科学和康复提供可靠的工具。Contribution: 提供了VIDIMU数据集,比较了多种3D姿态估计模型与IMU的性能,明确了其优缺点和适用场景。
Method: 利用VIDIMU数据集,对比MotionAGFormer、MotionBERT、MMPose 2D-to-3D姿态提升和NVIDIA BodyTrack等模型的关节角度估计结果与IMU的OpenSim逆向运动学计算结果。
Result: MotionAGFormer表现最优,平均RMSE为9.27度±4.80度,MAE为7.86度±4.18度,Pearson相关系数为0.86±0.15,R²为0.67±0.28。
Insight: 单目视频和IMU在临床运动评估中均可行,但需权衡成本、精度和易用性;MotionAGFormer在健康成年人群中表现突出,为远程监测提供了有前景的解决方案。
Abstract: Advances in machine learning and wearable sensors offer new opportunities for capturing and analyzing human movement outside specialized laboratories. Accurate assessment of human movement under real-world conditions is essential for telemedicine, sports science, and rehabilitation. This preclinical benchmark compares monocular video-based 3D human pose estimation models with inertial measurement units (IMUs), leveraging the VIDIMU dataset containing a total of 13 clinically relevant daily activities which were captured using both commodity video cameras and five IMUs. During this initial study only healthy subjects were recorded, so results cannot be generalized to pathological cohorts. Joint angles derived from state-of-the-art deep learning frameworks (MotionAGFormer, MotionBERT, MMPose 2D-to-3D pose lifting, and NVIDIA BodyTrack) were evaluated against joint angles computed from IMU data using OpenSim inverse kinematics following the Human3.6M dataset format with 17 keypoints. Among them, MotionAGFormer demonstrated superior performance, achieving the lowest overall RMSE ($9.27\deg \pm 4.80\deg$) and MAE ($7.86\deg \pm 4.18\deg$), as well as the highest Pearson correlation ($0.86 \pm 0.15$) and the highest coefficient of determination $R^{2}$ ($0.67 \pm 0.28$). The results reveal that both technologies are viable for out-of-the-lab kinematic assessment. However, they also highlight key trade-offs between video- and sensor-based approaches including costs, accessibility, and precision. This study clarifies where off-the-shelf video models already provide clinically promising kinematics in healthy adults and where they lag behind IMU-based estimates while establishing valuable guidelines for researchers and clinicians seeking to develop robust, cost-effective, and user-friendly solutions for telehealth and remote patient monitoring.
[105] NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes
Shiyi Zhang,Dong Liang,Yihang Zhou
Main category: cs.CV
TL;DR: NeuroSwift是一个轻量级的跨被试框架,通过结合AutoKL和CLIP适配器,实现了对fMRI数据的视觉重建,尤其擅长复杂场景的跨被试重建。
Details
Motivation: 解决现有方法在跨被试fMRI数据中因神经表征差异和语义抽象编码导致的准确率低和计算复杂度高的问题。Contribution: 提出了NeuroSwift框架,结合AutoKL和CLIP适配器,实现了高效且准确的跨被试视觉重建,并在轻量级GPU上快速训练。
Method: 利用AutoKL提取低级特征,CLIP适配器模拟高级视觉皮层编码;采用预训练+微调策略,仅微调17%的参数以适应新被试。
Result: 在轻量级GPU(3 RTX 4090)上仅需1小时训练即可达到SOTA性能,优于现有方法。
Insight: 通过模块化设计和部分参数微调,显著提升了跨被试泛化能力,降低了计算成本。
Abstract: Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain’s abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift’s CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.
[106] microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification
Sathira Silva,Eman Ali,Chetan Arora,Muhammad Haris Khan
Main category: cs.CV
TL;DR: microCLIP提出了一种无监督的自训练框架,通过细粒度标记融合技术改进CLIP在细粒度图像分类中的表现,利用Saliency-Oriented Attention Pooling(SOAP)和动态知识聚合,显著提升了分类精度。
Details
Motivation: CLIP在细粒度分类任务中表现受限,因其依赖全局特征而忽略了局部细节。现有方法通过语言模型描述对齐CLIP的[CLS]标记,但缺乏空间精确性。microCLIP旨在通过这些局限性改进CLIP的性能。Contribution: 1. 提出TokenFusion模块,结合粗粒度[CLS]标记和细粒度[FG]标记;2. 引入Saliency-Oriented Attention Pooling(SOAP)技术;3. 提出两阶段分类器和动态知识聚合方法,稳定训练过程。
Method: 1. 使用SOAP生成细粒度[FG]标记;2. 结合动态知识聚合和两阶段分类器(冻结头和可学习头)优化伪标签;3. 通过粗-细标记对齐优化CLIP的视觉和文本表示。
Result: 在13个细粒度基准测试中,平均精度提升2.90%,且仅需轻量级调整。
Insight: 通过结合粗-细粒度特征和动态知识聚合,可以显著改进CLIP在细粒度任务中的表现,同时保持模型的轻量化和稳定性。
Abstract: Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP $\texttt{[CLS]}$ token; however, this approach overlooks spatial precision. We propose $\textbf{microCLIP}$, a self-training framework that jointly refines CLIP’s visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided $\texttt{[FG]}$ token from patch embeddings and fuses it with the global $\texttt{[CLS]}$ token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion’s evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent $2.90%$ average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at https://github.com/sathiiii/microCLIP.
[107] VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL
Kyoungjun Park,Yifan Yang,Juheon Yi,Shicheng Zheng,Yifei Shen,Dongqi Han,Caihua Shan,Muhammad Muaz,Lili Qiu
Main category: cs.CV
TL;DR: VidGuard-R1是一款基于多模态大语言模型(MLLM)和强化学习(RL)的视频真实性检测工具,通过GRPO算法优化,不仅能高精度分类AI生成视频,还能提供可解释的推理。
Details
Motivation: 随着AI生成视频技术的快速发展,社会面临虚假信息和声誉损害的挑战,亟需既准确又可解释的检测工具。Contribution: 首次提出结合MLLM和GRPO的视频真实性检测模型,并发布了一个高难度数据集(140k视频)。
Method: 通过GRPO优化Qwen-VL模型,使用两个奖励模型分别针对时间伪影和生成复杂性。
Result: 在零样本测试中表现最佳,训练后准确率超95%,并能生成精确的解释。
Insight: 模型的可解释性是检测工具的关键需求,GRPO和多模态结合为AI生成内容检测提供了新思路。
Abstract: With the rapid advancement of AI-generated videos, there is an urgent need for effective detection tools to mitigate societal risks such as misinformation and reputational harm. In addition to accurate classification, it is essential that detection models provide interpretable explanations to ensure transparency for regulators and end users. To address these challenges, we introduce VidGuard-R1, the first video authenticity detector that fine-tunes a multi-modal large language model (MLLM) using group relative policy optimization (GRPO). Our model delivers both highly accurate judgments and insightful reasoning. We curate a challenging dataset of 140k real and AI-generated videos produced by state-of-the-art generation models, carefully designing the generation process to maximize discrimination difficulty. We then fine-tune Qwen-VL using GRPO with two specialized reward models that target temporal artifacts and generation complexity. Extensive experiments demonstrate that VidGuard-R1 achieves state-of-the-art zero-shot performance on existing benchmarks, with additional training pushing accuracy above 95%. Case studies further show that VidGuard-R1 produces precise and interpretable rationales behind its predictions. The code is publicly available at https://VidGuard-R1.github.io.
[108] Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Justin Cui,Jie Wu,Ming Li,Tao Yang,Xiaojie Li,Rui Wang,Andrew Bai,Yuanhao Ban,Cho-Jui Hsieh
Main category: cs.CV
TL;DR: 论文提出了一种称为Self-Forcing++的方法,旨在解决长视频生成中的质量退化问题,无需依赖长视频监督或重新训练,生成长达4分15秒的高质量视频。
Details
Motivation: 当前扩散模型虽在图像和视频生成中表现出色,但对长视频生成的扩展计算成本过高,且现有自回归方法因误差累积导致质量下降。Contribution: 提出Self-Forcing++方法,通过利用教师模型的丰富知识为学生模型提供长视频生成的指导,显著提升了长视频的生成质量和一致性。
Method: 方法基于教师模型的知识,通过从生成的长视频中采样片段为学生模型提供引导,避免了误差累积和重复计算重叠帧的问题。
Result: 实验表明,该方法能生成长达4分15秒的视频,质量和一致性均显著优于基线方法。
Insight: 研究表明,利用教师模型的局部指导可有效提升长视频生成的全局一致性,避免误差累积问题。
Abstract: Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher’s capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model’s position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/
[109] Learning to Generate Object Interactions with Physics-Guided Video Diffusion
David Romero,Ariana Bermudez,Hao Li,Fabio Pizzati,Ivan Laptev
Main category: cs.CV
TL;DR: KineMask 是一种物理引导的视频生成方法,通过两阶段训练策略和掩码监督提升物体交互的物理真实性,同时结合低级运动控制和高级文本条件,显著优于同类模型。
Details
Motivation: 现有视频生成模型在物理真实的物体交互和物理基础控制方面表现不足,限制了其在机器人学和决策模拟中的应用。Contribution: 提出了 KineMask,一种物理引导的视频生成方法,通过掩码监督和两阶段训练策略,显著提升了物体交互的物理真实性,并支持复杂动态现象生成。
Method: 采用两阶段训练策略,逐步减少对未来运动的掩码监督,并结合低级运动控制和高级文本条件,训练视频扩散模型(VDMs)。
Result: 在合成和真实场景中,KineMask 显著改善了物体交互的物理真实性,优于同类模型。
Insight: 低级运动控制和高级文本条件在视频扩散模型中具有互补作用,共同提升生成的物理真实性和复杂性。
Abstract: Recent models for video generation have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack physics-grounded control mechanisms. To address this limitation, we introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, KineMask integrates low-level motion control with high-level textual conditioning via predictive scene descriptions, leading to effective support for synthesis of complex dynamical phenomena. Extensive experiments show that KineMask achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low- and high-level conditioning in VDMs. Our code, model, and data will be made publicly available.
[110] MultiModal Action Conditioned Video Generation
Yichen Li,Antonio Torralba
Main category: cs.CV
TL;DR: 论文提出了一种多模态动作条件视频生成方法,通过引入精细的多模态感官数据(如本体感觉、运动感觉、力触觉等),解决了现有视频模型在精细控制方面的不足,并提出了特征学习和正则化方案以提升模拟精度和时间稳定性。
Details
Motivation: 现有视频模型缺乏精细控制能力,无法满足通用家用机器人对实时精细操作的需求。论文旨在通过多模态感官数据捕捉精确控制,以提升模拟的精细度和实用性。Contribution: 1. 引入了精细多模态动作数据(本体感觉、运动感觉等);2. 提出了特征学习范式以对齐多模态数据;3. 设计了正则化方案以增强动作轨迹特征的因果性。
Method: 1. 多模态感官数据(如力触觉、肌肉激活)的整合;2. 特征学习范式对齐多模态数据;3. 正则化方案优化动作轨迹特征。
Result: 实验表明,多模态感官数据提高了模拟精度并减少了时间漂移。广泛的消融研究和下游应用验证了方法的有效性。
Insight: 多模态感官数据是模拟精细互动的关键,特征对齐和因果性增强对于提升视频生成模型的实用性至关重要。
Abstract: Current video models fail as world model as they lack fine-graiend control. General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. We consider senses of proprioception, kinesthesia, force haptics, and muscle activation. Such multimodal senses naturally enables fine-grained interactions that are difficult to simulate with text-conditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further propose a regularization scheme to enhance causality of the action trajectory features in representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate the effectiveness and practicality of our work.
[111] VideoNSA: Native Sparse Attention Scales Video Understanding
Enxin Song,Wenhao Chai,Shusheng Yang,Ethan Armand,Xiaojun Shan,Haiyang Xu,Jianwen Xie,Zhuowen Tu
Main category: cs.CV
TL;DR: VideoNSA 通过原生稀疏注意力(NSA)解决了视频理解中上下文长度限制的问题,提出了一种硬件感知的混合注意力方法,显著提升了长视频理解和时空推理能力。
Details
Motivation: 现有视频-语言模型在处理长视频时,由于上下文长度限制,往往错过关键过渡帧且难以保持长期一致性。Contribution: 1. 提出VideoNSA,将NSA应用于视频-语言模型;2. 通过硬件感知混合注意力方法,文本用稠密注意力,视频用NSA;3. 在216K视频指令数据集上进行端到端训练。
Method: 1. 采用Qwen2.5-VL模型;2. 硬件感知混合注意力(视频用NSA,文本用稠密注意力);3. 端到端训练于视频指令数据集。
Result: 1. 支持128K tokens的可靠扩展;2. 在长视频理解、时空推理等任务中表现优于基线;3. 发现全局-局部注意力分配的优化比例。
Insight: 1. 任务依赖的分支使用模式;2. 可学习的稀疏注意力能诱导动态注意力汇聚;3. 固定预算下全局-局部注意力分配的重要性。
Abstract: Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.
[112] Inferring Dynamic Physical Properties from Video Foundation Models
Guanqi Zhan,Xianzheng Ma,Weidi Xie,Andrew Zisserman
Main category: cs.CV
TL;DR: 论文研究了从视频中预测动态物理属性的任务,包括弹性、粘性和动态摩擦,提出了新的数据集和方法,并对比了不同模型的性能。
Details
Motivation: 动态物理属性的预测需要结合时间信息,传统方法难以直接从视频中提取这些属性,因此需要探索新的方法来解决这一问题。Contribution: (i) 收集了包含合成和真实数据的新视频数据集;(ii) 探索了三种从视频中推断物理属性的方法;(iii) 对比了预训练视频生成模型、自监督模型和多模态大语言模型的性能。
Method: (a) 使用传统计算机视觉技术提供视觉线索;(b) 使用视觉提示和可训练提示向量在预训练视频模型上进行交叉注意力;(c) 设计适用于多模态大语言模型的提示策略。
Result: 生成或自监督的视频基础模型表现相似,但不及传统方法;多模态大语言模型性能较弱,但通过合适提示可以提升。
Insight: 视频基础模型在动态物理属性预测任务中具有一定潜力,但仍有提升空间,尤其是多模态大语言模型的性能改进。
Abstract: We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.
[113] Clink! Chop! Thud! – Learning Object Sounds from Real-World Interactions
Mengyu Yang,Yiming Chen,Haozheng Pei,Siddhant Agarwal,Arun Balajee Vasudevan,James Hays
Main category: cs.CV
TL;DR: 这篇论文提出了一个新任务——发声物体检测,旨在通过多模态框架从真实世界的交互中学习物体的声音,并通过对象分割掩码和slot attention视觉编码器提升性能。
Details
Motivation: 人类能够通过声音辨别物体的交互对象,受此启发,论文希望模型也能学习这种能力,区分不同物体交互产生的声音。Contribution: 1. 提出了发声物体检测任务;2. 开发了自动分割掩码的流程,增强模型对交互区域的关注;3. 使用slot attention视觉编码器强化对象先验。
Method: 1. 通过自动分割掩码引导模型关注交互区域;2. 结合slot attention视觉编码器学习多模态表示;3. 利用真实世界的自我中心视频进行训练。
Result: 在新任务和现有多模态动作理解任务上都达到了最先进性能。
Insight: 对象分割和slot attention的结合能有效提升模型对多模态信息的理解能力。
Abstract: Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model’s ability to link these sounds to the objects directly involved. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. To encourage an object-centric approach, we first develop an automatic pipeline to compute segmentation masks of the objects involved to guide the model’s focus during training towards the most informative regions of the interaction. A slot attention visual encoder is used to further enforce an object prior. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.
cs.HC [Back]
[114] Development and Evaluation of an AI-Driven Telemedicine System for Prenatal Healthcare
Juan Barrientos,Michaelle Pérez,Douglas González,Favio Reyna,Julio Fajardo,Andrea Lara
Main category: cs.HC
TL;DR: 开发了一个基于AI的远程医疗系统,通过盲扫协议帮助助产士获取胎儿图像,并通过专家异步审查提高诊断效率。
Details
Motivation: 低收入和中等收入国家的农村地区缺乏超声诊断资源,限制了产前保健的可及性。Contribution: 提出了一个人机协同AI系统,结合分类模型和基于网络的平台,帮助非专家获取关键胎儿图像并简化专家审查流程。
Method: 系统采用盲扫协议和分类模型识别关键帧,并通过网络平台支持专家异步审查。
Result: 系统在识别标准胎儿切面方面表现良好,现场评估显示其可用性高且认知负荷低。
Insight: AI驱动的远程医疗系统可以扩展资源匮乏地区的产前保健服务,同时减轻专家的工作负担。
Abstract: Access to obstetric ultrasound is often limited in low-resource settings, particularly in rural areas of low- and middle-income countries. This work proposes a human-in-the-loop artificial intelligence (AI) system designed to assist midwives in acquiring diagnostically relevant fetal images using blind sweep protocols. The system incorporates a classification model along with a web-based platform for asynchronous specialist reviews. By identifying key frames in blind sweep studies, the AI system allows specialists to concentrate on interpretation rather than having to review entire videos. To evaluate its performance, blind sweep videos captured by a small group of soft-trained midwives using a low-cost Point-of-Care Ultrasound (POCUS) device were analyzed. The system demonstrated promising results in identifying standard fetal planes from sweeps made by non-experts. A field evaluation indicated good usability and a low cognitive workload, suggesting that it has the potential to expand access to prenatal imaging in underserved regions.
q-bio.NC [Back]
[115] Aligning Video Models with Human Social Judgments via Behavior-Guided Fine-Tuning
Kathy Garcia,Leyla Isik
Main category: q-bio.NC
TL;DR: 本文研究了预训练视频模型是否能够捕捉人类社交视频中的相似性结构,并通过行为数据微调模型以对齐人类感知。
Details
Motivation: 人类能够直观地感知视觉场景中的复杂社交信号,但当前AI模型是否具备这种能力尚不明确。本文旨在解决这一问题,并提出一种对齐人类社交判断的方法。Contribution: 1. 提出了一个新的基准数据集(49,000多个相似性判断);2. 发现模态差距(文本嵌入比视频模型更接近人类感知);3. 提出了基于人类行为数据的微调方法(混合三元组-RSA目标)。
Method: 使用TimeSformer视频模型,结合低秩适应(LoRA)进行微调,通过混合三元组-RSA目标对齐人类相似性判断。
Result: 微调后的模型在保留视频上显著提升了与人类感知的对齐度,同时增强了社交情感属性的编码能力。
Insight: 预训练视频模型在社交识别方面存在不足,而行为数据驱动的微调可以有效改进其社交感知能力。
Abstract: Humans intuitively perceive complex social signals in visual scenes, yet it remains unclear whether state-of-the-art AI models encode the same similarity structure. We study (Q1) whether modern video and language models capture human-perceived similarity in social videos, and (Q2) how to instill this structure into models using human behavioral data. To address this, we introduce a new benchmark of over 49,000 odd-one-out similarity judgments on 250 three-second video clips of social interactions, and discover a modality gap: despite the task being visual, caption-based language embeddings align better with human similarity than any pretrained video model. We close this gap by fine-tuning a TimeSformer video model on these human judgments with our novel hybrid triplet-RSA objective using low-rank adaptation (LoRA), aligning pairwise distances to human similarity. This fine-tuning protocol yields significantly improved alignment with human perceptions on held-out videos in terms of both explained variance and odd-one-out triplet accuracy. Variance partitioning shows that the fine-tuned video model increases shared variance with language embeddings and explains additional unique variance not captured by the language model. Finally, we test transfer via linear probes and find that human-similarity fine-tuning strengthens the encoding of social-affective attributes (intimacy, valence, dominance, communication) relative to the pretrained baseline. Overall, our findings highlight a gap in pretrained video models’ social recognition and demonstrate that behavior-guided fine-tuning shapes video representations toward human social perception.
[116] Uncovering Semantic Selectivity of Latent Groups in Higher Visual Cortex with Mutual Information-Guided Diffusion
Yule Wang,Joseph Yu,Chengrui Li,Weihan Li,Anqi Wu
Main category: q-bio.NC
TL;DR: 论文提出MIG-Vis方法,利用扩散模型结合互信息指导,揭示高级视觉皮层中神经潜在子空间对视觉语义特征的编码方式,并通过实验验证其语义选择性。
Details
Motivation: 目前的研究主要通过人工神经网络与视觉皮层的表征对齐或解码方法来分析神经编码,但这些方法间接且无法揭示神经群体的具体组织结构。因此,论文旨在解决高级视觉皮层中语义特征如何分布在神经群体中并形成结构化子空间的问题。Contribution: 1. 提出MIG-Vis方法,结合变分自编码器和互信息指导的扩散合成,可视化神经潜在子空间的语义特征;2. 在猕猴IT皮层的多会话神经放电数据上验证方法,发现神经潜在群体对多种视觉语义特征(如物体姿态、类间转换、类内内容)具有选择性。
Method: 1. 使用变分自编码器推断神经群体的潜在子空间;2. 设计互信息(MI)指导的扩散合成过程,可视化各潜在子空间编码的特定视觉语义特征;3. 在多会话猕猴IT皮层数据上实验验证。
Result: 实验表明,MIG-Vis能够识别出具有明确语义选择性的神经潜在子空间,例如对物体姿态、类间转换和类内内容的编码。
Insight: 该方法为高级视觉皮层的结构化语义表征提供了直接的、可解释的证据,推动了对其编码机制的理解。
Abstract: Understanding how neural populations in higher visual areas encode object-centered visual information remains a central challenge in computational neuroscience. Prior works have investigated representational alignment between artificial neural networks and the visual cortex. Nevertheless, these findings are indirect and offer limited insights to the structure of neural populations themselves. Similarly, decoding-based methods have quantified semantic features from neural populations but have not uncovered their underlying organizations. This leaves open a scientific question: “how feature-specific visual information is distributed across neural populations in higher visual areas, and whether it is organized into structured, semantically meaningful subspaces.” To tackle this problem, we present MIG-Vis, a method that leverages the generative power of diffusion models to visualize and validate the visual-semantic attributes encoded in neural latent subspaces. Our method first uses a variational autoencoder to infer a group-wise disentangled neural latent subspace from neural populations. Subsequently, we propose a mutual information (MI)-guided diffusion synthesis procedure to visualize the specific visual-semantic features encoded by each latent group. We validate MIG-Vis on multi-session neural spiking datasets from the inferior temporal (IT) cortex of two macaques. The synthesized results demonstrate that our method identifies neural latent groups with clear semantic selectivity to diverse visual features, including object pose, inter-category transformations, and intra-class content. These findings provide direct, interpretable evidence of structured semantic representation in the higher visual cortex and advance our understanding of its encoding principles.
q-bio.QM [Back]
[117] A Multicentric Dataset for Training and Benchmarking Breast Cancer Segmentation in H&E Slides
Carlijn Lems,Leslie Tessier,John-Melle Bokhorst,Mart van Rijthoven,Witali Aswolinskiy,Matteo Pozzi,Natalie Klubickova,Suzanne Dintzis,Michela Campora,Maschenka Balkenhol,Peter Bult,Joey Spronck,Thomas Detone,Mattia Barbareschi,Enrico Munari,Giuseppe Bogina,Jelle Wesseling,Esther H. Lips,Francesco Ciompi,Frédérique Meeuwsen,Jeroen van der Laak
Main category: q-bio.QM
TL;DR: 这篇论文提出了BEETLE数据集,用于乳腺癌H&E玻片的语义分割任务。数据集包含587个样本,覆盖多种分子亚型和组织学等级,并通过多重标注策略提供了四种类别的标注。其多样性和外部评估集为标准化的模型评测提供了支持。
Details
Motivation: 现有的乳腺癌分割公共数据集缺乏形态多样性,限制了模型的泛化能力和生物标记验证的鲁棒性。为了解决这一问题,作者提出了一个多中心、多样化的数据集。Contribution: 主要贡献是提出了BEETLE数据集,覆盖多种乳腺癌亚型和形态,并提供了高质量的多类标注和外部评估集,以支持乳腺癌分割模型的开发和评测。
Method: 收集了来自三家临床中心和两个公共数据集的587个样本,使用七种扫描仪数字化,并结合多重标注策略对四类组织(如侵袭性上皮和非侵袭性上皮)进行标注。
Result: 数据集已公开,提供了多样化的样本和多类标注,尤其关注现有数据集中代表性不足的形态(如导管原位癌)。
Insight: 该数据集的多样性和标准化标注可以为乳腺癌的自动化生物标记分析提供更可靠的基准,并促进模型的跨中心泛化能力。
Abstract: Automated semantic segmentation of whole-slide images (WSIs) stained with hematoxylin and eosin (H&E) is essential for large-scale artificial intelligence-based biomarker analysis in breast cancer. However, existing public datasets for breast cancer segmentation lack the morphological diversity needed to support model generalizability and robust biomarker validation across heterogeneous patient cohorts. We introduce BrEast cancEr hisTopathoLogy sEgmentation (BEETLE), a dataset for multiclass semantic segmentation of H&E-stained breast cancer WSIs. It consists of 587 biopsies and resections from three collaborating clinical centers and two public datasets, digitized using seven scanners, and covers all molecular subtypes and histological grades. Using diverse annotation strategies, we collected annotations across four classes - invasive epithelium, non-invasive epithelium, necrosis, and other - with particular focus on morphologies underrepresented in existing datasets, such as ductal carcinoma in situ and dispersed lobular tumor cells. The dataset’s diversity and relevance to the rapidly growing field of automated biomarker quantification in breast cancer ensure its high potential for reuse. Finally, we provide a well-curated, multicentric external evaluation set to enable standardized benchmarking of breast cancer segmentation models.
cs.RO [Back]
[118] VENTURA: Adapting Image Diffusion Models for Unified Task Conditioned Navigation
Arthur Zhang,Xiangyun Meng,Luca Calliari,Dong-Ki Kim,Shayegan Omidshafiei,Joydeep Biswas,Ali Agha,Amirreza Shaban
Main category: cs.RO
TL;DR: VENTURA通过微调预训练的扩散模型生成视觉路径掩码,结合轻量级行为克隆策略,实现了基于自然语言指令的多样化机器人导航,显著提升了任务表现和泛化能力。
Details
Motivation: 机器人需适应多样化的人类指令并在开放环境中安全操作。现有视觉语言模型难以直接用于导航任务,因其动作空间和预训练目标差异导致迁移困难。VENTURA旨在解决这一问题。Contribution: 1. 提出了VENTURA系统,通过扩散模型生成视觉路径掩码(即视觉规划),结合行为克隆策略实现导航;2. 利用自监督跟踪模型和VLM增强标注数据,避免了昂贵的人工标注;3. 在真实环境中验证了性能,显著超越现有方法。
Method: 1. 微调预训练的扩散模型生成导航路径的视觉掩码;2. 使用轻量级行为克隆策略将视觉掩码转化为可执行轨迹;3. 通过自监督跟踪模型和VLM增强数据训练模型。
Result: VENTURA在真实环境中比SOTA方法提升了33%的成功率和54%的碰撞减少率,并能泛化到未见任务组合中。
Insight: 扩散模型可用于生成高层次的视觉规划,而无需直接预测低层动作;视觉掩码和行为克隆的结合为多任务导航提供了灵活的接口。
Abstract: Robots must adapt to diverse human instructions and operate safely in unstructured, open-world environments. Recent Vision-Language models (VLMs) offer strong priors for grounding language and perception, but remain difficult to steer for navigation due to differences in action spaces and pretraining objectives that hamper transferability to robotics tasks. Towards addressing this, we introduce VENTURA, a vision-language navigation system that finetunes internet-pretrained image diffusion models for path planning. Instead of directly predicting low-level actions, VENTURA generates a path mask (i.e. a visual plan) in image space that captures fine-grained, context-aware navigation behaviors. A lightweight behavior-cloning policy grounds these visual plans into executable trajectories, yielding an interface that follows natural language instructions to generate diverse robot behaviors. To scale training, we supervise on path masks derived from self-supervised tracking models paired with VLM-augmented captions, avoiding manual pixel-level annotation or highly engineered data collection setups. In extensive real-world evaluations, VENTURA outperforms state-of-the-art foundation model baselines on object reaching, obstacle avoidance, and terrain preference tasks, improving success rates by 33% and reducing collisions by 54% across both seen and unseen scenarios. Notably, we find that VENTURA generalizes to unseen combinations of distinct tasks, revealing emergent compositional capabilities. Videos, code, and additional materials: https://venturapath.github.io
[119] DisCo-Layout: Disentangling and Coordinating Semantic and Physical Refinement in a Multi-Agent Framework for 3D Indoor Layout Synthesis
Jialin Gao,Donghao Zhou,Mingjian Liang,Lihao Liu,Chi-Wing Fu,Xiaowei Hu,Pheng-Ann Heng
Main category: cs.RO
TL;DR: DisCo-Layout提出了一种新颖的多智能体框架,通过分离和协调语义与物理优化来生成3D室内布局,解决了传统方法泛化能力差的问题,并在实验中取得了最优性能。
Details
Motivation: 传统3D室内布局生成方法受限于固定数据集,泛化能力不足;而基于LLM和VLM的方法虽然语义丰富,但缺乏灵活和鲁棒的优化,导致布局效果不佳。Contribution: 提出了DisCo-Layout框架,通过分离语义优化(SRT)和物理优化(PRT),并结合多智能体协作实现高质量布局生成。
Method: 语义优化工具SRT纠正抽象对象关系,物理优化工具PRT通过网格匹配算法解决空间问题;多智能体框架包括规划器、设计器和评估器,协调优化过程。
Result: 实验表明DisCo-Layout生成的布局真实、连贯且泛化能力强,达到了最先进的水平。
Insight: 分离语义和物理优化并结合多智能体协作是一种有效的布局生成方法,能显著提升生成效果和灵活性。
Abstract: 3D indoor layout synthesis is crucial for creating virtual environments. Traditional methods struggle with generalization due to fixed datasets. While recent LLM and VLM-based approaches offer improved semantic richness, they often lack robust and flexible refinement, resulting in suboptimal layouts. We develop DisCo-Layout, a novel framework that disentangles and coordinates physical and semantic refinement. For independent refinement, our Semantic Refinement Tool (SRT) corrects abstract object relationships, while the Physical Refinement Tool (PRT) resolves concrete spatial issues via a grid-matching algorithm. For collaborative refinement, a multi-agent framework intelligently orchestrates these tools, featuring a planner for placement rules, a designer for initial layouts, and an evaluator for assessment. Experiments demonstrate DisCo-Layout’s state-of-the-art performance, generating realistic, coherent, and generalizable 3D indoor layouts. Our code will be publicly available.
cs.GR [Back]
[120] MPMAvatar: Learning 3D Gaussian Avatars with Accurate and Robust Physics-Based Dynamics
Changmin Lee,Jihyun Lee,Tae-Kyun Kim
Main category: cs.GR
TL;DR: MPMAvatar是一个用于从多视角视频中创建3D人类化身的框架,结合了基于Material Point Method的物理模拟器和3D高斯渲染技术,实现了高精度、鲁棒的动态建模和逼真渲染。
Details
Motivation: 当前从视觉观察中创建的3D化身在松散衣物的物理动态建模上仍存在挑战,现有方法在精度和新动画输入的鲁棒性上表现不足。Contribution: 提出了MPMAvatar框架,结合材料点法模拟器和新颖的碰撞处理算法,实现了高精度的动态建模和逼真渲染,并在零样本交互任务中展现了优越的泛化能力。
Method: 使用基于Material Point Method的模拟器,结合各向异性本构模型和新型碰撞处理算法;同时采用准阴影的3D高斯渲染技术生成高保真图像。
Result: 实验表明,MPMAvatar在动态建模精度、渲染精度以及鲁棒性和效率上显著优于现有方法,并能推广到未见过的交互任务中。
Insight: 通过物理模拟与渲染技术的结合,MPMAvatar展示了在复杂动态建模和高保真渲染方面的潜力,为零样本交互任务提供了一条新途径。
Abstract: While there has been significant progress in the field of 3D avatar creation from visual observations, modeling physically plausible dynamics of humans with loose garments remains a challenging problem. Although a few existing works address this problem by leveraging physical simulation, they suffer from limited accuracy or robustness to novel animation inputs. In this work, we present MPMAvatar, a framework for creating 3D human avatars from multi-view videos that supports highly realistic, robust animation, as well as photorealistic rendering from free viewpoints. For accurate and robust dynamics modeling, our key idea is to use a Material Point Method-based simulator, which we carefully tailor to model garments with complex deformations and contact with the underlying body by incorporating an anisotropic constitutive model and a novel collision handling algorithm. We combine this dynamics modeling scheme with our canonical avatar that can be rendered using 3D Gaussian Splatting with quasi-shadowing, enabling high-fidelity rendering for physically realistic animations. In our experiments, we demonstrate that MPMAvatar significantly outperforms the existing state-of-the-art physics-based avatar in terms of (1) dynamics modeling accuracy, (2) rendering accuracy, and (3) robustness and efficiency. Additionally, we present a novel application in which our avatar generalizes to unseen interactions in a zero-shot manner-which was not achievable with previous learning-based methods due to their limited simulation generalizability. Our project page is at: https://KAISTChangmin.github.io/MPMAvatar/
[121] ROI-GS: Interest-based Local Quality 3D Gaussian Splatting
Quoc-Anh Bui,Gilles Rougeron,Géraldine Morin,Simone Gasparini
Main category: cs.GR
TL;DR: ROI-GS提出了基于兴趣的局部质量3D高斯泼溅方法,通过目标引导的相机选择和高分辨率重建,在保持实时性能的同时显著提升感兴趣区域的细节质量。
Details
Motivation: 现有3D高斯泼溅方法资源分配均匀,导致感兴趣区域的细节受限且模型体积庞大,ROI-GS通过针对性优化解决了这一问题。Contribution: 提出了ROI-GS框架,结合目标引导相机选择、针对性训练和高保真重建,显著提升了感兴趣区域的细节质量并减少了模型体积。
Method: 采用目标引导的相机选择策略,对感兴趣对象进行针对性训练,并将其无缝整合到全局场景中。
Result: 实验显示ROI-GS将局部质量提升(PSNR达2.96 dB),模型体积减少约17%,训练速度更快。
Insight: ROI-GS通过资源的有针对性分配,展示了在3D重建中平衡全局和局部质量的潜力。
Abstract: We tackle the challenge of efficiently reconstructing 3D scenes with high detail on objects of interest. Existing 3D Gaussian Splatting (3DGS) methods allocate resources uniformly across the scene, limiting fine detail to Regions Of Interest (ROIs) and leading to inflated model size. We propose ROI-GS, an object-aware framework that enhances local details through object-guided camera selection, targeted Object training, and seamless integration of high-fidelity object of interest reconstructions into the global scene. Our method prioritizes higher resolution details on chosen objects while maintaining real-time performance. Experiments show that ROI-GS significantly improves local quality (up to 2.96 dB PSNR), while reducing overall model size by $\approx 17%$ of baseline and achieving faster training for a scene with a single object of interest, outperforming existing methods.
[122] Spec-Gloss Surfels and Normal-Diffuse Priors for Relightable Glossy Objects
Georgios Kouros,Minye Wu,Tinne Tuytelaars
Main category: cs.GR
TL;DR: 论文提出了一种结合微表面BRDF和高斯泼溅的可重光照框架,通过引入Spec-Gloss参数化和法向-漫反射先验,提升了高光物体的几何与材质重建质量。
Details
Motivation: 当前神经渲染方法在重建和重光照高光物体时,往往依赖于简化的BRDF模型或耦合的漫反射-高光参数化,限制了材质恢复的准确性和重光照的保真度。Contribution: 提出了一种结合微表面BRDF和Spec-Gloss参数化的高斯泼溅框架,利用法向-漫反射先验优化早期阶段,并引入分阶段环境光照优化以提升高动态范围高光反射的保真度。
Method: 结合微表面BRDF的参数化方法到2D高斯泼溅中,采用延迟着色技术;通过扩散先验指导法向和漫反射颜色的优化;采用由粗到细的环境光照优化策略。
Result: 实验表明,该方法在复杂高光场景中实现了高质量的几何与材质重建,并显著提升了新光照条件下的重光照真实性和一致性。
Insight: 将物理一致的BRDF模型与高斯泼溅结合,并通过先验引导优化,能够有效解决高光物体重建与重光照中的歧义问题。
Abstract: Accurate reconstruction and relighting of glossy objects remain a longstanding challenge, as object shape, material properties, and illumination are inherently difficult to disentangle. Existing neural rendering approaches often rely on simplified BRDF models or parameterizations that couple diffuse and specular components, which restricts faithful material recovery and limits relighting fidelity. We propose a relightable framework that integrates a microfacet BRDF with the specular-glossiness parameterization into 2D Gaussian Splatting with deferred shading. This formulation enables more physically consistent material decomposition, while diffusion-based priors for surface normals and diffuse color guide early-stage optimization and mitigate ambiguity. A coarse-to-fine optimization of the environment map accelerates convergence and preserves high-dynamic-range specular reflections. Extensive experiments on complex, glossy scenes demonstrate that our method achieves high-quality geometry and material reconstruction, delivering substantially more realistic and consistent relighting under novel illumination compared to existing Gaussian splatting methods.
cs.AI [Back]
[123] VaPR – Vision-language Preference alignment for Reasoning
Rohan Wadhawan,Fabrice Y Harel-Canada,Zi-Yi Dou,Suhaila Shakiah,Robinson Piramuthu,Nanyun Peng
Main category: cs.AI
TL;DR: 论文提出了VaPR框架,通过LLM引导的硬负样本生成解决合成偏好标注中的噪声问题,显著提升了视觉语言模型的推理性能。
Details
Motivation: 现有偏好微调方法忽略了合成偏好标注中的风格和长度偏差噪声,影响了视觉语言模型的对齐效果。Contribution: 提出了基于LLM引导的硬负样本生成框架,并构建了30K高质量样本的VaPR数据集,显著提升了模型性能。
Method: 利用LLM生成带有目标错误的拒绝响应,保持与接受响应在风格和长度上的相似性,用于微调LVLM。
Result: 在十项基准测试中,VaPR模型平均提升LLaVA 6.5%、Qwen2VL 4.0%、Qwen2.5VL 1.5%,解决了二元问题中的“是”偏见。
Insight: 数据规模扩展持续提升性能,LLaVA在小规模数据上也能受益;开源LLM编辑器的泛化能力接近GPT-4o。
Abstract: Preference finetuning methods like Direct Preference Optimization (DPO) with AI-generated feedback have shown promise in aligning Large Vision-Language Models (LVLMs) with human preferences. However, existing techniques overlook the prevalence of noise in synthetic preference annotations in the form of stylistic and length biases. To this end, we introduce a hard-negative response generation framework based on LLM-guided response editing, that produces rejected responses with targeted errors, maintaining stylistic and length similarity to the accepted ones. Using this framework, we develop the VaPR dataset, comprising 30K high-quality samples, to finetune three LVLM families: LLaVA-V1.5, Qwen2VL & Qwen2.5VL (2B-13B sizes). Our VaPR models deliver significant performance improvements across ten benchmarks, achieving average gains of 6.5% (LLaVA), 4.0% (Qwen2VL), and 1.5% (Qwen2.5VL), with notable improvements on reasoning tasks. A scaling analysis shows that performance consistently improves with data size, with LLaVA models benefiting even at smaller scales. Moreover, VaPR reduces the tendency to answer “Yes” in binary questions - addressing a common failure mode in LVLMs like LLaVA. Lastly, we show that the framework generalizes to open-source LLMs as editors, with models trained on VaPR-OS achieving ~99% of the performance of models trained on \name, which is synthesized using GPT-4o. Our data, models, and code can be found on the project page https://vap-r.github.io
[124] Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models
Yu Zeng,Wenxuan Huang,Shiting Huang,Xikun Bao,Yukun Qi,Yiming Zhao,Qiuchen Wang,Lin Chen,Zehui Chen,Huaian Chen,Wanli Ouyang,Feng Zhao
Main category: cs.AI
TL;DR: 该论文提出了一种名为AGILE的交互式学习方法,通过将拼图任务建模为环境交互过程,显著提升了视觉语言模型(VLM)的感知与推理能力。
Details
Motivation: 尽管当前大型视觉语言模型在多模态理解和推理方面取得了进展,但其核心的感知与推理能力仍然有限,特别是在简单的拼图任务中表现接近随机。这是由于高质量视觉语言数据的稀缺性和有限扩展性导致的。Contribution: 主要贡献是提出了AGILE方法,通过将拼图任务建模为交互式学习过程,模型能够通过代码生成与环境交互,逐步提升感知与推理能力。
Method: AGILE将拼图任务设计为一个迭代的交互过程,模型在每一步生成可执行代码以执行动作,环境则提供细粒度的视觉反馈。通过这种观察与交互的循环,模型逐步提升能力。
Result: 实验表明,AGILE在多种复杂度的拼图任务上显著提升了性能(如2×2拼图任务的准确率从9.5%提升至82.8%),并在9个通用视觉任务上平均提升了3.1%的性能。
Insight: 该方法为多模态模型的推理与泛化能力提升开辟了新路径,同时为解决多模态强化学习数据稀缺问题提供了高效、可扩展的方案。
Abstract: Although current large Vision-Language Models (VLMs) have advanced in multimodal understanding and reasoning, their fundamental perceptual and reasoning abilities remain limited. Specifically, even on simple jigsaw tasks, existing VLMs perform near randomly, revealing deficiencies in core perception and reasoning capabilities. While high-quality vision-language data can enhance these capabilities, its scarcity and limited scalability impose significant constraints. To address this, we propose AGILE, an Agentic jiGsaw Interaction Learning for Enhancing visual perception and reasoning in VLMs. AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment. At each step, the model generates executable code to perform an action based on the current state, while the environment provides fine-grained visual feedback to guide task completion. Through this iterative cycle of observation and interaction, the model incrementally improves its perceptual and reasoning capabilities via exploration and feedback. Experimental results show that AGILE not only substantially boosts performance on jigsaw tasks of varying complexity (e.g., increasing accuracy from 9.5% to 82.8% under the 2 $\times$ 2 setting) but also demonstrates strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%. These results indicate notable enhancements in both perceptual and reasoning abilities. This work opens a new avenue for advancing reasoning and generalization in multimodal models and provides an efficient, scalable solution to the scarcity of multimodal reinforcement learning data. The code and datasets is available at https://github.com/yuzeng0-0/AGILE .
[125] Aristotle: IMO-level Automated Theorem Proving
Tudor Achim,Alex Best,Kevin Der,Mathïs Fédérico,Sergei Gukov,Daniel Halpern-Leister,Kirsten Henningsgard,Yury Kudryashov,Alexander Meiburg,Martin Michelsen,Riley Patterson,Eric Rodriguez,Laura Scharff,Vikram Shanker,Vladmir Sicca,Hari Sowrirajan,Aidan Swope,Matyas Tamas,Vlad Tenev,Jonathan Thomm,Harold Williams,Lawrence Wu
Main category: cs.AI
TL;DR: Aristotle是一个结合形式化验证与非正式推理的AI系统,在2025年IMO问题上达到了金牌水平的性能。
Details
Motivation: 目标是开发一个能够在高水平数学竞赛中解决问题的AI系统,结合形式化与非正式方法以提高性能。Contribution: 提出Aristotle系统,集成Lean证明搜索、非正式推理引理生成和几何求解器,实现IMO金牌级别的性能。
Method: 结合Lean的形式化证明搜索、非正式推理生成引理并形式化,以及专用几何求解器的三模块架构。
Result: 在2025年IMO问题上表现优异,具备可扩展的自动化定理证明能力。
Insight: 形式化和非正式方法的结合可以显著提升自动化定理证明的竞赛性能。
Abstract: We introduce Aristotle, an AI system that combines formal verification with informal reasoning, achieving gold-medal-equivalent performance on the 2025 International Mathematical Olympiad problems. Aristotle integrates three main components: a Lean proof search system, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver. Our system demonstrates state-of-the-art performance with favorable scaling properties for automated theorem proving.
[126] Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Xinpeng Wang,Nitish Joshi,Barbara Plank,Rico Angell,He He
Main category: cs.AI
TL;DR: 论文提出了TRACE方法,通过量化模型的推理努力来检测隐式奖励破解行为,避免了现有监督方法的局限性,显著提升了检测效果。
Details
Motivation: 奖励破解行为(尤其是隐式)威胁大,但现有方法(如CoT监控)难以检测。因此,需要一种无需监督的方法来衡量模型的真实推理努力。Contribution: 提出TRACE方法,通过截断推理链并测量验证通过率,量化模型的努力程度。实验表明TRACE在数学和编程任务中显著优于现有方法。
Method: TRACE的核心是逐步截断模型的推理链,强制其输出答案,并测量验证通过率。破解模型因捷径行为会表现出低努力高通过率的特点。
Result: TRACE在数学推理任务中比72B CoT监控器提升65%,在编程任务中比32B监控器提升30%,并能发现训练中的未知漏洞。
Insight: 隐式奖励破解可以通过量化推理努力检测,TRACE提供了一种无需监督的可扩展解决方案,适用于现有方法失效的场景。
Abstract: Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model’s chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less `effort’ than required to achieve high reward. TRACE quantifies effort by measuring how early a model’s reasoning becomes sufficient to pass a verifier. We progressively truncate a model’s CoT at various lengths, force the model to answer, and measure the verifier-passing rate at each cutoff. A hacking model, which takes a shortcut, will achieve a high passing rate with only a small fraction of its CoT, yielding a large area under the accuracy-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.
[127] VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning
Rui Liu,Dian Yu,Tong Zheng,Runpeng Dai,Zongxia Li,Wenhao Yu,Zhenwen Liang,Linfeng Song,Haitao Mi,Pratap Tokekar,Dong Yu
Main category: cs.AI
TL;DR: VOGUE通过量化视觉输入的随机性,引入不确定性感知的探索信号,有效提升多模态推理的准确性和鲁棒性。
Details
Motivation: 现有的多模态大型语言模型(MLLMs)在探索性学习方面存在不足,尤其是在视觉输入的不确定性处理上表现不佳。Contribution: 提出VOGUE方法,将探索从文本空间转移到视觉输入空间,通过对称KL散度量化和利用视觉不确定性。
Method: 使用对称KL散度计算视觉扰动的敏感性,结合不确定性比例奖励、词熵奖励和退火采样策略,优化探索与利用的平衡。
Result: 在两个模型规模(Qwen2.5-VL-3B/7B)上,VOGUE显著提升了视觉数学和通用推理基准的pass@1准确率,同时改善了探索衰减问题。
Insight: 视觉输入的不确定性是多模态推理的关键因素,有效利用这种不确定性可以显著提升模型的性能和鲁棒性。
Abstract: Reinforcement learning with verifiable rewards (RLVR) improves reasoning in large language models (LLMs) but struggles with exploration, an issue that still persists for multimodal LLMs (MLLMs). Current methods treat the visual input as a fixed, deterministic condition, overlooking a critical source of ambiguity and struggling to build policies robust to plausible visual variations. We introduce $\textbf{VOGUE (Visual Uncertainty Guided Exploration)}$, a novel method that shifts exploration from the output (text) to the input (visual) space. By treating the image as a stochastic context, VOGUE quantifies the policy’s sensitivity to visual perturbations using the symmetric KL divergence between a “raw” and “noisy” branch, creating a direct signal for uncertainty-aware exploration. This signal shapes the learning objective via an uncertainty-proportional bonus, which, combined with a token-entropy bonus and an annealed sampling schedule, effectively balances exploration and exploitation. Implemented within GRPO on two model scales (Qwen2.5-VL-3B/7B), VOGUE boosts pass@1 accuracy by an average of 2.6% on three visual math benchmarks and 3.7% on three general-domain reasoning benchmarks, while simultaneously increasing pass@4 performance and mitigating the exploration decay commonly observed in RL fine-tuning. Our work shows that grounding exploration in the inherent uncertainty of visual inputs is an effective strategy for improving multimodal reasoning.
[128] Information Seeking for Robust Decision Making under Partial Observability
Djengo Cyun-Jyun Fang,Tsung-Wei Ke
Main category: cs.AI
TL;DR: 本文提出了一个名为InfoSeeker的LLM决策框架,通过整合任务导向规划和信息寻求来解决部分可观测环境中的不确定性问题,并在实验中显著优于现有方法。
Details
Motivation: 人类在部分可观测的环境中通过主动寻求信息来解决问题,但现有的LLM规划代理忽视了内部动态与实际环境之间的差异,导致决策不够鲁棒。Contribution: 1. 提出了InfoSeeker框架,结合任务规划和信息寻求;2. 引入了一个新的基准测试集;3. 在实验中展示了74%的性能提升,并验证了其泛化能力。
Method: InfoSeeker通过提示LLM主动规划行动以验证理解、检测环境变化或测试假设,从而生成或修订任务导向的计划。
Result: InfoSeeker在部分可观测环境中实现了74%的性能提升,并在机器人操作和网页导航等基准测试中表现优异。
Insight: 在部分可观测的环境中,紧密整合规划和信息寻求是实现鲁棒行为的关键。
Abstract: Explicit information seeking is essential to human problem-solving in practical environments characterized by incomplete information and noisy dynamics. When the true environmental state is not directly observable, humans seek information to update their internal dynamics and inform future decision-making. Although existing Large Language Model (LLM) planning agents have addressed observational uncertainty, they often overlook discrepancies between their internal dynamics and the actual environment. We introduce Information Seeking Decision Planner (InfoSeeker), an LLM decision-making framework that integrates task-oriented planning with information seeking to align internal dynamics and make optimal decisions under uncertainty in both agent observations and environmental dynamics. InfoSeeker prompts an LLM to actively gather information by planning actions to validate its understanding, detect environmental changes, or test hypotheses before generating or revising task-oriented plans. To evaluate InfoSeeker, we introduce a novel benchmark suite featuring partially observable environments with incomplete observations and uncertain dynamics. Experiments demonstrate that InfoSeeker achieves a 74% absolute performance gain over prior methods without sacrificing sample efficiency. Moreover, InfoSeeker generalizes across LLMs and outperforms baselines on established benchmarks such as robotic manipulation and web navigation. These findings underscore the importance of tightly integrating planning and information seeking for robust behavior in partially observable environments. The project page is available at https://infoseekerllm.github.io
[129] The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models
Phuc Minh Nguyen,Chinh D. La,Duy M. H. Nguyen,Nitesh V. Chawla,Binh T. Nguyen,Khoa D. Doan
Main category: cs.AI
TL;DR: 本文探讨了强化学习与可验证奖励(RLVR)在提升大语言模型推理能力时可能导致推理边界缩小的现象,揭示了两大关键问题并提出了一种改进的数据筛选算法。
Details
Motivation: 尽管RLVR被广泛用于提升语言模型的推理能力,但其可能导致模型的推理边界缩小,这是本文试图解决的矛盾现象。Contribution: 1. 揭示了RLVR中的负干扰现象和学习过程中的赢家通吃现象;2. 提出了一种高效的数据筛选算法以改善Pass@$k$性能。
Method: 通过理论和实证分析RLVR的学习动态,识别问题根源并提出数据筛选方法,专注于低概率问题。
Result: 实验证明,该方法在多个数学推理基准上显著提升了Pass@$k$性能。
Insight: RLVR的标准目标函数可能导致模型收敛于狭窄的策略,而专注于低概率问题的学习可以缓解这一问题。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving Large Language Models’ reasoning capabilities, yet recent evidence suggests it may paradoxically shrink the reasoning boundary rather than expand it. This paper investigates the shrinkage issue of RLVR by analyzing its learning dynamics and reveals two critical phenomena that explain this failure. First, we expose negative interference in RLVR, where learning to solve certain training problems actively reduces the likelihood of correct solutions for others, leading to the decline of Pass@$k$ performance, or the probability of generating a correct solution within $k$ attempts. Second, we uncover the winner-take-all phenomenon: RLVR disproportionately reinforces problems with high likelihood, correct solutions, under the base model, while suppressing other initially low-likelihood ones. Through extensive theoretical and empirical analysis on multiple mathematical reasoning benchmarks, we show that this effect arises from the inherent on-policy sampling in standard RL objectives, causing the model to converge toward narrow solution strategies. Based on these insights, we propose a simple yet effective data curation algorithm that focuses RLVR learning on low-likelihood problems, achieving notable improvement in Pass@$k$ performance. Our code is available at https://github.com/mail-research/SELF-llm-interference.
[130] InvThink: Towards AI Safety via Inverse Reasoning
Yubin Kim,Taehan Kim,Eugene Park,Chunjong Park,Cynthia Breazeal,Daniel McDuff,Hae Won Park
Main category: cs.AI
TL;DR: InvThink提出了一种通过逆向思维提升大型语言模型(LLM)安全性的方法,通过枚举潜在危害、分析后果并生成安全响应,实现了优于基线方法的安全性改进。
Details
Motivation: 现有的安全对齐方法直接优化安全性响应,但可能牺牲模型的通用推理能力。InvThink旨在通过逆向思维系统性地考虑故障模式,同时保持模型的通用能力。Contribution: 1. 提出了InvThink方法,通过逆向思维提升LLM的安全性;2. 发现该方法在模型规模扩展时安全性改进更显著;3. 在高风险领域(如医疗、法律)和智能体场景(如勒索)中表现优异。
Method: InvThink通过三个步骤实现:1. 枚举潜在危害;2. 分析后果;3. 生成主动规避风险的安全输出。方法通过监督微调和强化学习在多个LLM家族中实现。
Result: 相比基线方法SafetyPrompt,InvThink实现了15.7%的有害响应减少,同时保持了通用推理能力。安全性改进与模型规模呈正相关。
Insight: 逆向思维不仅提升了LLM的安全性,还避免了安全性改进对其他能力的负面影响,为可扩展和通用化的AI安全路径提供了新思路。
Abstract: We present InvThink, a simple yet powerful approach that gives large language models (LLMs) the capability of inverse thinking: reasoning through failure modes before generating responses. Unlike existing safety alignment methods that optimize directly for safe response, InvThink instructs models to 1) enumerate potential harms, 2) analyze their consequences, and 3) generate safe outputs that proactively avoid these risks. Our method reveals three key findings: (i) safety improvements show stronger scaling with model size compared to existing safety methods. (ii) InvThink mitigates safety tax; by training models to systematically consider failure modes, it preserves general reasoning capabilities on standard benchmarks. (iii) beyond general safety tasks, InvThink excels in high-stakes domains including external-facing (medicine, finance, law) and agentic (blackmail, murder) risk scenarios, achieving up to 15.7% reduction in harmful responses compared to baseline methods like SafetyPrompt. We further implement InvThink via supervised fine-tuning, and reinforcement learning across three LLM families. These results suggest that inverse reasoning provides a scalable and generalizable path toward safer, more capable language models.
[131] Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness
Erfan Shayegani,Keegan Hines,Yue Dong,Nael Abu-Ghazaleh,Roman Lutz,Spencer Whitehead,Vidhisha Balachandran,Besmira Nushi,Vibhav Vineet
Main category: cs.AI
TL;DR: 该论文揭示了计算机使用代理(CUAs)普遍存在的盲目目标导向性(BGD)问题,提出了三种常见的BGD模式,并通过BLIND-ACT基准测试验证了多个前沿模型的BGD发生率。研究表明,即使输入无害,BGD也会带来风险,现有干预措施效果有限。
Details
Motivation: 随着计算机使用代理(CUAs)的广泛应用,其行为安全性日益受到关注。作者发现这些代理普遍存在盲目追求目标的问题(BGD),可能导致不可行的操作或安全风险,因此需要系统性地研究和解决。Contribution: 1. 首次系统性地定义了BGD及其三种常见模式;2. 开发了BLIND-ACT基准测试,用于评估CUAs的BGD行为;3. 验证了多个前沿LLM模型的BGD问题,揭示了即时干预措施的局限性。
Method: 作者提出了BLIND-ACT基准测试,包含90个任务,基于OSWorld构建真实环境,并利用LLM作为评估工具。该方法与人类标注的一致性高达93.75%。通过该基准测试,分析了9个前沿模型的BGD表现。
Result: 实验显示,受测模型的平均BGD发生率为80.8%,表明BGD问题普遍存在。即时干预措施虽能降低BGD水平,但风险仍然显著。定性分析揭示了执行优先偏差、思维-行动脱节和请求优先等典型失败模式。
Insight: BGD揭示了CUAs在设计和部署中的深层次风险,提示需在训练或推理阶段引入更强大的干预机制。BLIND-ACT为未来研究和缓解BGD提供了基础。
Abstract: Computer-Use Agents (CUAs) are an increasingly deployed class of agents that take actions on GUIs to accomplish user goals. In this paper, we show that CUAs consistently exhibit Blind Goal-Directedness (BGD): a bias to pursue goals regardless of feasibility, safety, reliability, or context. We characterize three prevalent patterns of BGD: (i) lack of contextual reasoning, (ii) assumptions and decisions under ambiguity, and (iii) contradictory or infeasible goals. We develop BLIND-ACT, a benchmark of 90 tasks capturing these three patterns. Built on OSWorld, BLIND-ACT provides realistic environments and employs LLM-based judges to evaluate agent behavior, achieving 93.75% agreement with human annotations. We use BLIND-ACT to evaluate nine frontier models, including Claude Sonnet and Opus 4, Computer-Use-Preview, and GPT-5, observing high average BGD rates (80.8%) across them. We show that BGD exposes subtle risks that arise even when inputs are not directly harmful. While prompting-based interventions lower BGD levels, substantial risk persists, highlighting the need for stronger training- or inference-time interventions. Qualitative analysis reveals observed failure modes: execution-first bias (focusing on how to act over whether to act), thought-action disconnect (execution diverging from reasoning), and request-primacy (justifying actions due to user request). Identifying BGD and introducing BLIND-ACT establishes a foundation for future research on studying and mitigating this fundamental risk and ensuring safe CUA deployment.
[132] Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning
Zhihao Dou,Qinjian Zhao,Zhongwei Wan,Dinggen Zhang,Weida Wang,Towsif Raiyan,Benteng Chen,Qingtao Pan,Yang Ouyang,Zhiqiang Gao,Shufei Zhang,Sumon Biswas
Main category: cs.AI
TL;DR: 本文提出了一种名为PTA-GRPO的两阶段框架,通过结合高级规划和细粒度推理优化,显著提升大语言模型(LLMs)的推理能力。
Details
Motivation: 现有的LLMs在推理任务中依赖自回归的Token级生成,缺乏全局规划,导致推理冗余、不连贯或不准确,影响了整体性能。传统方法如树搜索和强化学习(RL)计算成本高且效果不佳。Contribution: 1. 提出PTA-GRPO框架,分阶段优化高级规划和CoT推理。2. 引入高级指导蒸馏和监督微调(SFT)。3. 设计了一种基于指导的RL方法,联合优化最终输出和高级指导质量。
Method: 1. 第一阶段:利用高级LLMs将CoT蒸馏为紧凑的高级指导,进行SFT。2. 第二阶段:采用指导感知RL方法,联合优化输出和指导质量。
Result: 在多数学推理基准测试(如MATH、AIME等)和多种基础模型(如Qwen系列、LLaMA3.2)上验证了PTA-GRPO的稳定性和显著提升效果。
Insight: 通过分阶段规划和优化,PTA-GRPO有效解决了LLMs推理中缺乏全局规划的问题,提供了更高效和通用的推理方法。
Abstract: Large language models (LLMs) have demonstrated remarkable reasoning abilities in complex tasks, often relying on Chain-of-Thought (CoT) reasoning. However, due to their autoregressive token-level generation, the reasoning process is largely constrained to local decision-making and lacks global planning. This limitation frequently results in redundant, incoherent, or inaccurate reasoning, which significantly degrades overall performance. Existing approaches, such as tree-based algorithms and reinforcement learning (RL), attempt to address this issue but suffer from high computational costs and often fail to produce optimal reasoning trajectories. To tackle this challenge, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization PTA-GRPO, a two-stage framework designed to improve both high-level planning and fine-grained CoT reasoning. In the first stage, we leverage advanced LLMs to distill CoT into compact high-level guidance, which is then used for supervised fine-tuning (SFT). In the second stage, we introduce a guidance-aware RL method that jointly optimizes the final output and the quality of high-level guidance, thereby enhancing reasoning effectiveness. We conduct extensive experiments on multiple mathematical reasoning benchmarks, including MATH, AIME2024, AIME2025, and AMC, across diverse base models such as Qwen2.5-7B-Instruct, Qwen3-8B, Qwen3-14B, and LLaMA3.2-3B. Experimental results demonstrate that PTA-GRPO consistently achieves stable and significant improvements across different models and tasks, validating its effectiveness and generalization.
[133] Do AI Models Perform Human-like Abstract Reasoning Across Modalities?
Claas Beger,Ryan Yi,Shuhao Fu,Arseny Moskvichev,Sarah W. Tsai,Sivasankaran Rajamanickam,Melanie Mitchell
Main category: cs.AI
TL;DR: 论文探讨了AI模型在多模态抽象推理任务中的表现,揭示了仅依赖准确性评估可能高估或低估模型能力的局限性。
Details
Motivation: 研究旨在评估AI模型在抽象推理任务中是否真正理解和应用了任务设计者意图的抽象概念,而非依赖表面模式。Contribution: 1. 提出了双评估方法(准确性+规则分析),2. 揭示了模型在文本和视觉模态中抽象推理能力的差异,3. 提供了更全面的评估框架。
Method: 通过ConceptARC任务,在不同输入模态(文本/视觉)、是否允许使用外部工具及推理努力程度的设置下,评估模型表现。
Result: 文本模态下模型的准确性接近人类,但规则分析显示依赖表面捷径;视觉模态下准确性下降,但规则分析揭示了潜在的抽象能力。
Insight: 仅依赖准确性评估抽象推理能力存在局限性,需结合规则分析;模型在跨模态抽象推理中仍有显著差距。
Abstract: OpenAI’s o3-preview reasoning model exceeded human accuracy on the ARC-AGI benchmark, but does that mean state-of-the-art models recognize and reason with the abstractions that the task creators intended? We investigate models’ abstraction abilities on ConceptARC. We evaluate models under settings that vary the input modality (textual vs. visual), whether the model is permitted to use external Python tools, and, for reasoning models, the amount of reasoning effort. In addition to measuring output accuracy, we perform fine-grained evaluation of the natural-language rules that models generate to explain their solutions. This dual evaluation lets us assess whether models solve tasks using the abstractions ConceptARC was designed to elicit, rather than relying on surface-level patterns. Our results show that, while some models using text-based representations match human output accuracy, the best models’ rules are often based on surface-level ``shortcuts’’ and capture intended abstractions far less often than humans. Thus their capabilities for general abstract reasoning may be overestimated by evaluations based on accuracy alone. In the visual modality, AI models’ output accuracy drops sharply, yet our rule-level analysis reveals that models might be underestimated, as they still exhibit a substantial share of rules that capture intended abstractions, but are often unable to correctly apply these rules. In short, our results show that models still lag humans in abstract reasoning, and that using accuracy alone to evaluate abstract reasoning on ARC-like tasks may overestimate abstract-reasoning capabilities in textual modalities and underestimate it in visual modalities. We believe that our evaluation framework offers a more faithful picture of multimodal models’ abstract reasoning abilities and a more principled way to track progress toward human-like, abstraction-centered intelligence.
[134] A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports
Yang Yao,Yixu Wang,Yuxuan Zhang,Yi Lu,Tianle Gu,Lingyu Li,Dingyi Zhao,Keming Wu,Haozhe Wang,Ping Nie,Yan Teng,Yingchun Wang
Main category: cs.AI
TL;DR: 本文提出了一种针对深度研究代理(DRAs)的严格基准和多维评估框架,旨在解决现有基准在评估维度、响应格式和评分机制上的不足。
Details
Motivation: AI正在从封闭式语言模型转向具备外部感知和信息整合能力的互联代理系统,深度研究代理(DRAs)代表了一种系统性能力,但现有基准无法有效评估此类系统。Contribution: 提出了一个包含214个专家精选查询的基准和多维评估框架,支持对DRAs生成的长篇报告进行语义质量、主题聚焦和检索可信度的综合评分。
Method: 设计了10个广泛主题领域的挑战性查询,并手动构建参考捆绑以支持复合评估,同时提出集成评分指标。
Result: 实验表明主流DRAs优于增强推理模型,但仍有较大改进空间。
Insight: 该研究为DRAs的能力评估、架构优化和范式进步提供了坚实基础。
Abstract: Artificial intelligence is undergoing the paradigm shift from closed language models to interconnected agent systems capable of external perception and information integration. As a representative embodiment, Deep Research Agents (DRAs) systematically exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output, which markedly enhance performance on complex and open-ended tasks. However, existing benchmarks remain deficient in evaluation dimensions, response formatting, and scoring mechanisms, limiting their capacity to assess such systems effectively. This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses. The benchmark comprises 214 expert-curated challenging queries distributed across 10 broad thematic domains, each accompanied by manually constructed reference bundles to support composite evaluation. The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness. Extensive experimentation confirms the superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, yet reveals considerable scope for further improvement. This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement in DRA systems.
[135] RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
Yuxiao Qu,Anikait Singh,Yoonho Lee,Amrith Setlur,Ruslan Salakhutdinov,Chelsea Finn,Aviral Kumar
Main category: cs.AI
TL;DR: 这篇论文提出了RLAD方法,通过训练大型语言模型(LLMs)发现推理问题的抽象概念,从而提高推理能力。RLAD采用两玩家强化学习框架,分别训练抽象生成器和解决方案生成器,实现结构化探索和泛化能力的提升。
Details
Motivation: 现有的大型模型在推理任务中往往难以一致地捕捉或重用过程性的知识,导致推理过程冗长且效率低下。为此,需要一种方法帮助模型学习并提出有效的推理抽象概念,从而引导更高效的推理行为。Contribution: 1. 提出了RLAD方法,通过两玩家强化学习框架训练抽象生成器和解决方案生成器;2. 展示了抽象概念在结构化探索和泛化能力方面的作用;3. 发现测试时分配更多计算资源生成抽象概念比生成更多解决方案更有利于性能提升。
Method: RLAD采用两玩家强化学习框架:抽象生成器负责提出多个抽象概念(过程性和事实性知识的自然语言描述),解决方案生成器基于这些抽象概念生成解决方案。通过联合训练,实现了抽象生成和解决方案生成的信号解耦。
Result: 实验表明,RLAD能够有效提升模型的推理能力和泛化性能,尤其在复杂问题上表现突出。此外,测试时增加抽象概念的生成比增加解决方案的生成更能提升性能。
Insight: 1. 抽象概念在推理任务中起到了关键的引导作用;2. 结构化探索和解耦学习信号是提升大型语言模型推理能力的有效策略;3. 计算资源分配策略对推理性能有显著影响。
Abstract: Reasoning requires going beyond pattern matching or memorization of solutions to identify and implement “algorithmic procedures” that can be used to deduce answers to hard problems. Doing so requires realizing the most relevant primitives, intermediate results, or shared procedures, and building upon them. While RL post-training on long chains of thought ultimately aims to uncover this kind of algorithmic behavior, most reasoning traces learned by large models fail to consistently capture or reuse procedures, instead drifting into verbose and degenerate exploration. To address more effective reasoning, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward learning successful reasoning. We train models to be capable of proposing multiple abstractions given a problem, followed by RL that incentivizes building a solution while using the information provided by these abstractions. This results in a two-player RL training paradigm, abbreviated as RLAD, that jointly trains an abstraction generator and a solution generator. This setup effectively enables structured exploration, decouples learning signals of abstraction proposal and solution generation, and improves generalization to harder problems. We also show that allocating more test-time compute to generating abstractions is more beneficial for performance than generating more solutions at large test budgets, illustrating the role of abstractions in guiding meaningful exploration.
cs.MA [Back]
[136] LLM-based Multi-Agent Blackboard System for Information Discovery in Data Science
Alireza Salemi,Mihir Parmar,Palash Goyal,Yiwen Song,Jinsung Yoon,Hamed Zamani,Hamid Palangi,Tomas Pfister
Main category: cs.MA
TL;DR: 本文提出了一种基于LLM的多智能体黑板系统,用于解决数据科学中大规模异构数据湖中的信息发现问题,显著优于现有方法。
Details
Motivation: 现有单智能体系统难以应对大规模异构数据,而主从式多智能体系统需要精确了解子智能体能力,缺乏灵活性和可扩展性。Contribution: 提出了一种黑板架构的多智能体通信范式,无需中央协调器具备对子智能体的先验知识,提升了系统的可扩展性和灵活性。
Method: 采用黑板架构,中央代理将请求发布到共享黑板,自治子代理根据能力自愿响应,支持数据湖分区或通用信息检索。
Result: 在多个基准测试中显著优于RAG和主从式多智能体范式,任务成功率和数据发现的F1分数相对提升13%-57%和9%。
Insight: 黑板架构为多智能体系统提供了可扩展且通用的通信框架,适用于大规模异构数据环境。
Abstract: The rapid advancement of Large Language Models (LLMs) has opened new opportunities in data science, yet their practical deployment is often constrained by the challenge of discovering relevant data within large heterogeneous data lakes. Existing methods struggle with this: single-agent systems are quickly overwhelmed by large, heterogeneous files in the large data lakes, while multi-agent systems designed based on a master-slave paradigm depend on a rigid central controller for task allocation that requires precise knowledge of each sub-agent’s capabilities. To address these limitations, we propose a novel multi-agent communication paradigm inspired by the blackboard architecture for traditional AI models. In this framework, a central agent posts requests to a shared blackboard, and autonomous subordinate agents – either responsible for a partition of the data lake or general information retrieval – volunteer to respond based on their capabilities. This design improves scalability and flexibility by eliminating the need for a central coordinator to have prior knowledge of all sub-agents’ expertise. We evaluate our method on three benchmarks that require explicit data discovery: KramaBench and modified versions of DS-Bench and DA-Code to incorporate data discovery. Experimental results demonstrate that the blackboard architecture substantially outperforms baselines, including RAG and the master-slave multi-agent paradigm, achieving between 13% to 57% relative improvement in end-to-end task success and up to a 9% relative gain in F1 score for data discovery over the best-performing baselines across both proprietary and open-source LLMs. Our findings establish the blackboard paradigm as a scalable and generalizable communication framework for multi-agent systems.
cs.IR [Back]
[137] Synthetic Prefixes to Mitigate Bias in Real-Time Neural Query Autocomplete
Adithya Rajan,Xiaoyu Liu,Prateek Verma,Vibhu Arora
Main category: cs.IR
TL;DR: 论文提出了一种通过生成合成前缀来减少实时神经查询自动完成系统中的展示偏差的数据中心方法。这些前缀来自未启用自动完成的用户完整查询,丰富了训练数据的多样性。
Details
Motivation: 自动完成系统中的展示偏差问题是由于用户行为受到模型建议的影响,导致训练数据存在偏差。需要通过数据干预来解决这一问题。Contribution: 1. 提出使用合成前缀来减少偏差;2. 设计了一种高效的列表损失简化方法;3. 在大规模电子商务环境中验证了方法的有效性。
Method: 1. 从未启用自动完成的查询中生成合成前缀;2. 设计低延迟神经排序模型;3. 简化列表损失计算复杂度。
Result: 在用户参与度指标(如平均倒数排名)上实现了统计显著的提升。
Insight: 合成前缀不仅能提升模型的泛化能力,还为其他低延迟排序任务(如相关搜索和查询推荐)的偏差减少提供了可扩展的解决方案。
Abstract: We introduce a data-centric approach for mitigating presentation bias in real-time neural query autocomplete systems through the use of synthetic prefixes. These prefixes are generated from complete user queries collected during regular search sessions where autocomplete was not active. This allows us to enrich the training data for learning to rank models with more diverse and less biased examples. This method addresses the inherent bias in engagement signals collected from live query autocomplete interactions, where model suggestions influence user behavior. Our neural ranker is optimized for real-time deployment under strict latency constraints and incorporates a rich set of features, including query popularity, seasonality, fuzzy match scores, and contextual signals such as department affinity, device type, and vertical alignment with previous user queries. To support efficient training, we introduce a task-specific simplification of the listwise loss, reducing computational complexity from $O(n^2)$ to $O(n)$ by leveraging the query autocomplete structure of having only one ground-truth selection per prefix. Deployed in a large-scale e-commerce setting, our system demonstrates statistically significant improvements in user engagement, as measured by mean reciprocal rank and related metrics. Our findings show that synthetic prefixes not only improve generalization but also provide a scalable path toward bias mitigation in other low-latency ranking tasks, including related searches and query recommendations.
[138] Bridging Collaborative Filtering and Large Language Models with Dynamic Alignment, Multimodal Fusion and Evidence-grounded Explanations
Bo Ma,LuYao Liu,Simon Lau,Chandler Yuan,and XueY Cui,Rosie Zhang
Main category: cs.IR
TL;DR: 该论文提出了一个名为\model{}的框架,通过动态对齐、多模态融合和基于证据的解释,解决了协同过滤和大语言模型结合中的三个主要挑战。
Details
Motivation: 现有方法在结合协同过滤和大语言模型时存在静态数据无法捕捉动态用户偏好、多模态内容未被充分利用以及解释缺乏可信证据的问题,需要一种更高效的解决方案。Contribution: 1. 提出动态适应机制,实时更新用户偏好;2. 统一表示协同信号与多模态特征;3. 设计基于具体证据的解释系统。
Method: 采用轻量级适配器网络,结合在线学习和多模态融合技术,生成可验证的自然语言解释。
Result: 该方法在不显著增加计算开销的情况下,保持了高效性,适合实际部署。
Insight: 动态学习和多模态融合是提升推荐系统灵活性和解释性的关键。
Abstract: Recent research has explored using Large Language Models for recommendation tasks by transforming user interaction histories and item metadata into text prompts, then having the LLM produce rankings or recommendations. A promising approach involves connecting collaborative filtering knowledge to LLM representations through compact adapter networks, which avoids expensive fine-tuning while preserving the strengths of both components. Yet several challenges persist in practice: collaborative filtering models often use static snapshots that miss rapidly changing user preferences; many real-world items contain rich visual and audio content beyond textual descriptions; and current systems struggle to provide trustworthy explanations backed by concrete evidence. Our work introduces \model{}, a framework that tackles these limitations through three key innovations. We develop an online adaptation mechanism that continuously incorporates new user interactions through lightweight modules, avoiding the need to retrain large models. We create a unified representation that seamlessly combines collaborative signals with visual and audio features, handling cases where some modalities may be unavailable. Finally, we design an explanation system that grounds recommendations in specific collaborative patterns and item attributes, producing natural language rationales users can verify. Our approach maintains the efficiency of frozen base models while adding minimal computational overhead, making it practical for real-world deployment.
[139] LLM4Rec: Large Language Models for Multimodal Generative Recommendation with Causal Debiasing
Bo Ma,Hang Li,ZeHua Hu,XiaoFan Gui,LuYao Liu,Simon Lau
Main category: cs.IR
TL;DR: LLM4Rec提出了一种基于大型语言模型的多模态生成推荐框架,融合了多模态数据、因果去偏、实时自适应学习等创新,显著提升了推荐的准确性、公平性和多样性。
Details
Motivation: 当前生成推荐系统在多模态数据处理、消除算法偏见和透明决策方面存在局限性。Contribution: 五大创新:多模态融合架构、检索增强生成机制、因果推断去偏、可解释推荐生成和实时自适应学习能力。
Method: 以大型语言模型为骨干,结合跨模态理解、上下文知识整合、偏见缓解、解释合成和持续模型适应的专用模块。
Result: 在MovieLens-25M等三个基准数据集上,NDCG@10提升2.3%,多样性指标提升1.4%,同时保持计算效率。
Insight: 融合因果推断和多模态学习是提升推荐系统效果的关键。
Abstract: Contemporary generative recommendation systems face significant challenges in handling multimodal data, eliminating algorithmic biases, and providing transparent decision-making processes. This paper introduces an enhanced generative recommendation framework that addresses these limitations through five key innovations: multimodal fusion architecture, retrieval-augmented generation mechanisms, causal inference-based debiasing, explainable recommendation generation, and real-time adaptive learning capabilities. Our framework leverages advanced large language models as the backbone while incorporating specialized modules for cross-modal understanding, contextual knowledge integration, bias mitigation, explanation synthesis, and continuous model adaptation. Extensive experiments on three benchmark datasets (MovieLens-25M, Amazon-Electronics, Yelp-2023) demonstrate consistent improvements in recommendation accuracy, fairness, and diversity compared to existing approaches. The proposed framework achieves up to 2.3% improvement in NDCG@10 and 1.4% enhancement in diversity metrics while maintaining computational efficiency through optimized inference strategies.
eess.IV [Back]
[140] An Efficient Quality Metric for Video Frame Interpolation Based on Motion-Field Divergence
Conall Daly,Darren Ramsook,Anil Kokaram
Main category: eess.IV
TL;DR: 论文提出了一种基于运动场发散的高效视频帧插值质量度量方法PSNR_DIV,解决了现有质量度量(如PSNR、SSIM、LPIPS)忽略时间一致性的问题,同时在计算效率和内存占用上显著优于FloLPIPS。
Details
Motivation: 现有的视频帧插值质量度量方法(如PSNR、SSIM、LPIPS)无法有效评估插值伪影的感知效果,而专为视频帧插值设计的FloLPIPS虽然表现较好,但计算效率低下,限制了实际应用。因此,需要一种高效且准确的度量方法。Contribution: 提出了PSNR_DIV,一种基于运动场发散加权的全参考质量度量方法,显著提升了评估精度和计算效率。
Method: 通过运动场发散(motion divergence)加权图像误差,突出运动场中的奇异性,从而增强PSNR的评估效果。
Result: 在BVI-VFI数据集上的实验表明,PSNR_DIV比FloLPIPS的Pearson线性相关系数提高了0.09,同时速度快2.5倍,内存占用减少4倍。
Insight: 运动场发散加权技术能有效捕捉视频帧插值中的时间不一致性,且计算高效,适合作为损失函数用于训练神经网络。
Abstract: Video frame interpolation is a fundamental tool for temporal video enhancement, but existing quality metrics struggle to evaluate the perceptual impact of interpolation artefacts effectively. Metrics like PSNR, SSIM and LPIPS ignore temporal coherence. State-of-the-art quality metrics tailored towards video frame interpolation, like FloLPIPS, have been developed but suffer from computational inefficiency that limits their practical application. We present $\text{PSNR}{\text{DIV}}$, a novel full-reference quality metric that enhances PSNR through motion divergence weighting, a technique adapted from archival film restoration where it was developed to detect temporal inconsistencies. Our approach highlights singularities in motion fields which is then used to weight image errors. Evaluation on the BVI-VFI dataset (180 sequences across multiple frame rates, resolutions and interpolation methods) shows $\text{PSNR}{\text{DIV}}$ achieves statistically significant improvements: +0.09 Pearson Linear Correlation Coefficient over FloLPIPS, while being 2.5$\times$ faster and using 4$\times$ less memory. Performance remains consistent across all content categories and are robust to the motion estimator used. The efficiency and accuracy of $\text{PSNR}_{\text{DIV}}$ enables fast quality evaluation and practical use as a loss function for training neural networks for video frame interpolation tasks. An implementation of our metric is available at www.github.com/conalld/psnr-div.
[141] Median2Median: Zero-shot Suppression of Structured Noise in Images
Jianxu Wang,Ge Wang
Main category: eess.IV
TL;DR: Median2Median(M2M)是一个零样本去噪框架,专为结构化噪声设计,通过创新的采样策略和广义中值滤波,能在无高质量标签数据的情况下有效去除相关性噪声。
Details
Motivation: 现有去噪方法在结构化噪声(强各向异性相关噪声)下表现不佳,数据驱动方法依赖高质量标签数据且泛化性有限,而零样本方法仅适用于独立同分布噪声。M2M旨在填补这一空白。Contribution: 提出了M2M,首个针对结构化噪声的零样本去噪框架;引入方向插值和广义中值滤波的采样策略,生成伪独立子图像对;设计了随机分配策略以消除系统偏差。
Method: 通过方向插值和广义中值滤波生成伪独立子图像对,适配Noise2Noise训练;随机分配策略扩大采样空间并消除偏差。
Result: M2M在独立同分布噪声下与现有零样本方法表现相当,在相关性噪声下显著优于它们。
Insight: M2M突破了零样本去噪方法严格依赖独立同分布噪声的限制,为结构化噪声提供了高效、无需数据的解决方案。
Abstract: Image denoising is a fundamental problem in computer vision and medical imaging. However, real-world images are often degraded by structured noise with strong anisotropic correlations that existing methods struggle to remove. Most data-driven approaches rely on large datasets with high-quality labels and still suffer from limited generalizability, whereas existing zero-shot methods avoid this limitation but remain effective only for independent and identically distributed (i.i.d.) noise. To address this gap, we propose Median2Median (M2M), a zero-shot denoising framework designed for structured noise. M2M introduces a novel sampling strategy that generates pseudo-independent sub-image pairs from a single noisy input. This strategy leverages directional interpolation and generalized median filtering to adaptively exclude values distorted by structured artifacts. To further enlarge the effective sampling space and eliminate systematic bias, a randomized assignment strategy is employed, ensuring that the sampled sub-image pairs are suitable for Noise2Noise training. In our realistic simulation studies, M2M performs on par with state-of-the-art zero-shot methods under i.i.d. noise, while consistently outperforming them under correlated noise. These findings establish M2M as an efficient, data-free solution for structured noise suppression and mark the first step toward effective zero-shot denoising beyond the strict i.i.d. assumption.
[142] GFSR-Net: Guided Focus via Segment-Wise Relevance Network for Interpretable Deep Learning in Medical Imaging
Jhonatan Contreras,Thomas Bocklitz
Main category: eess.IV
TL;DR: GFSR-Net是一种通过分段相关性网络指导焦点的方法,旨在提升医疗影像中深度学习的可解释性和可靠性。它利用少量人工标注引导模型关注诊断相关区域,实验表明其在保持高准确性的同时提升了显著性图的可信度。
Details
Motivation: 医疗影像深度学习的局限性在于缺乏可解释性,模型可能依赖无关区域或虚假线索,降低临床信任。GFSR-Net旨在解决这一问题,通过引导焦点提升模型的可信度和实用性。Contribution: 提出了GFSR-Net,一种利用少量人工标注引导模型注意力的方法,提升显著性图的解释性;实验证明其在多类医疗影像中表现优异。
Method: 采用分段相关性网络,通过少量标注近似人类直观关注区域,无需精确边界。训练时模型学习对齐这些区域,逐步强化诊断相关特征。
Result: 在胸片、视网膜扫描和皮肤科图像等任务中,GFSR-Net准确性可比或优于基线,同时生成的显著性图更符合人类预期。
Insight: 少量人工标注足以引导模型注意力,显著提升可解释性;该方法通用性强,适用于多种医疗影像任务。
Abstract: Deep learning has achieved remarkable success in medical image analysis, however its adoption in clinical practice is limited by a lack of interpretability. These models often make correct predictions without explaining their reasoning. They may also rely on image regions unrelated to the disease or visual cues, such as annotations, that are not present in real-world conditions. This can reduce trust and increase the risk of misleading diagnoses. We introduce the Guided Focus via Segment-Wise Relevance Network (GFSR-Net), an approach designed to improve interpretability and reliability in medical imaging. GFSR-Net uses a small number of human annotations to approximate where a person would focus within an image intuitively, without requiring precise boundaries or exhaustive markings, making the process fast and practical. During training, the model learns to align its focus with these areas, progressively emphasizing features that carry diagnostic meaning. This guidance works across different types of natural and medical images, including chest X-rays, retinal scans, and dermatological images. Our experiments demonstrate that GFSR achieves comparable or superior accuracy while producing saliency maps that better reflect human expectations. This reduces the reliance on irrelevant patterns and increases confidence in automated diagnostic tools.
cs.LG [Back]
[143] From 2D to 3D, Deep Learning-based Shape Reconstruction in Magnetic Resonance Imaging: A Review
Emma McMillian,Abhirup Banerjee,Alfonso Bueno-Orovio
Main category: cs.LG
TL;DR: 这篇综述论文全面回顾了从2D MRI数据到3D形状重建的深度学习方法,重点分析了点云、网格、形状感知和体积模型四种主要方法,总结了它们的优缺点、应用范围和未来的研究方向。
Details
Motivation: 3D形状重建在医学成像中具有重要意义,但直接从2D MRI数据生成3D模型仍存在挑战。本文旨在系统总结现有方法,推动更鲁棒、通用且临床实用的深度学习解决方案。Contribution: 论文的主要贡献包括:(1)分类归纳了四种3D重建方法的当前技术及其应用;(2)分析了临床适用性、数据集和计算需求;(3)指出了多模态集成和跨模态框架等未来方向。
Method: 论文聚焦于四种主要方法:点云、基于网格、形状感知和体积模型。每种方法的理论基础、技术实现和局限性被详细分析。
Result: 结果表明,不同方法在重建精度、计算效率和临床应用方面各有优劣,但尚无单一方法能完全满足所有需求。
Insight: 未来研究应关注多模态数据融合和跨模态学习,以提高模型的鲁棒性和泛化能力,同时需解决数据稀缺和计算资源限制等挑战。
Abstract: Deep learning-based 3-dimensional (3D) shape reconstruction from 2-dimensional (2D) magnetic resonance imaging (MRI) has become increasingly important in medical disease diagnosis, treatment planning, and computational modeling. This review surveys the methodological landscape of 3D MRI reconstruction, focusing on 4 primary approaches: point cloud, mesh-based, shape-aware, and volumetric models. For each category, we analyze the current state-of-the-art techniques, their methodological foundation, limitations, and applications across anatomical structures. We provide an extensive overview ranging from cardiac to neurological to lung imaging. We also focus on the clinical applicability of models to diseased anatomy, and the influence of their training and testing data. We examine publicly available datasets, computational demands, and evaluation metrics. Finally, we highlight the emerging research directions including multimodal integration and cross-modality frameworks. This review aims to provide researchers with a structured overview of current 3D reconstruction methodologies to identify opportunities for advancing deep learning towards more robust, generalizable, and clinically impactful solutions.
[144] Control the Temperature: Selective Sampling for Diverse and High-Quality LLM Outputs
Sergey Troshin,Wafaa Mohammed,Yan Meng,Christof Monz,Antske Fokkens,Vlad Niculae
Main category: cs.LG
TL;DR: 该论文提出了选择性采样方法,动态切换贪婪采样和高温度采样,以平衡语言模型输出的多样性和准确性。
Details
Motivation: 为了提高语言模型输出多样性,通常采用基于温度的采样方法,但这在需要高精度的任务(如数学推理)中可能导致准确性下降。论文旨在解决这一问题。Contribution: 提出了选择性采样方法,通过动态切换贪婪和高温度采样策略,优化了质量与多样性的权衡。
Method: 设计了一个轻量级分类器来预测采样风险,动态决定是否采用高温采样,从而减少敏感解码位置的错误。
Result: 在数学推理任务上的实验表明,该方法在高温设置下仍能显著提升质量与多样性的平衡。
Insight: 采样风险的可预测性表明,动态调整采样策略能够在保持多样性的同时避免关键位置的错误。
Abstract: Diversity is an essential metric for evaluating the creativity of outputs generated by language models. Temperature-based sampling is a common strategy to increase diversity. However, for tasks that require high precision, e.g., mathematical reasoning, uncontrolled high temperature sampling, e.g., min-$p$ or top-$p$, degrades reasoning quality. We demonstrate that the loss of accuracy is caused by sampling incorrect continuations in sensitive decoding positions. To address this, in this paper, we propose \textbf{selective sampling}, a method that dynamically switches between greedy and high-temperature sampling based on a sampling risk metric. This risk metric estimates the likelihood of output errors when applying high-temperature sampling on the current token position. To predict sampling risk, we train a lightweight classifier on a small subset of verifiable problems. The trained classifier can be integrated with the base language model with minimal latency overhead. Experiments on mathematical reasoning tasks demonstrate that selective sampling enhances the quality-diversity trade-off, even in high-temperature settings.
[145] Beyond Simple Fusion: Adaptive Gated Fusion for Robust Multimodal Sentiment Analysis
Han Wu,Yanming Sun,Yunhe Yang,Derek F. Wong
Main category: cs.LG
TL;DR: 本文提出了一种自适应门控融合网络(AGFN),通过双门融合机制动态调整多模态特征的权重,以解决传统融合方法在模态质量差异(如噪声、缺失或语义冲突)时的性能问题,显著提升了情感分析的准确性和鲁棒性。
Details
Motivation: 多模态情感分析(MSA)通常无法有效处理模态质量不一致的问题(如噪声、缺失或语义冲突),导致情感预测性能下降。本文旨在通过动态调整模态权重,提升模型对高质量模态信息的利用能力。Contribution: 1. 提出了自适应门控融合网络(AGFN),通过双门机制(基于信息熵和模态重要性)动态调整特征权重。2. 在CMU-MOSI和CMU-MOSEI数据集上展示了AGFN在精度和鲁棒性上的显著提升。
Method: 1. 使用双门融合机制(信息熵门和模态重要性门)动态调整各模态特征的权重。2. 通过减少特征位置与预测误差的相关性,增强特征的泛化能力。
Result: 实验表明,AGFN在CMU-MOSI和CMU-MOSEI数据集上优于基线方法,显著提高了情感分析的准确性和鲁棒性。可视化分析显示,AGFN能够从更广泛的特征分布中学习,减少了特征位置对预测的依赖。
Insight: 动态调整模态权重可以有效缓解噪声或冲突模态的影响,提升模型的泛化能力和性能。减少特征位置与预测误差的相关性是实现鲁棒特征表示的关键。
Abstract: Multimodal sentiment analysis (MSA) leverages information fusion from diverse modalities (e.g., text, audio, visual) to enhance sentiment prediction. However, simple fusion techniques often fail to account for variations in modality quality, such as those that are noisy, missing, or semantically conflicting. This oversight leads to suboptimal performance, especially in discerning subtle emotional nuances. To mitigate this limitation, we introduce a simple yet efficient \textbf{A}daptive \textbf{G}ated \textbf{F}usion \textbf{N}etwork that adaptively adjusts feature weights via a dual gate fusion mechanism based on information entropy and modality importance. This mechanism mitigates the influence of noisy modalities and prioritizes informative cues following unimodal encoding and cross-modal interaction. Experiments on CMU-MOSI and CMU-MOSEI show that AGFN significantly outperforms strong baselines in accuracy, effectively discerning subtle emotions with robust performance. Visualization analysis of feature representations demonstrates that AGFN enhances generalization by learning from a broader feature distribution, achieved by reducing the correlation between feature location and prediction error, thereby decreasing reliance on specific locations and creating more robust multimodal feature representations.
[146] RLP: Reinforcement as a Pretraining Objective
Ali Hatamizadeh,Syeda Nahida Akter,Shrimai Prabhumoye,Jan Kautz,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro,Yejin Choi
Main category: cs.LG
TL;DR: RLP引入了一种基于信息增益的强化学习预训练目标,将探索行为融入预训练阶段,显著提升了模型的推理能力。
Details
Motivation: 当前主流方法仅在训练的最后阶段引入强化学习,而忽略其在预训练中的潜力。RLP提出在预训练阶段引入强化学习目标,以提升模型的推理能力。Contribution: 1. 提出RLP,将强化学习的探索行为融入预训练阶段;2. 通过信息增益计算奖励信号;3. 显著提升了数学和科学推理任务的性能。
Method: 利用信息增益(增加的下一个token的对数似然)作为奖励信号,鼓励模型在生成下一个token前进行独立的推理(Chain-of-Thought)。
Result: 在Qwen3-1.7B-Base上,RLP提升了8个数学和科学基准的平均性能19%;在Nemotron-Nano-12B-v2上,科学推理任务性能提升23%。
Insight: 强化学习的探索行为可以有效地融入预训练阶段,显著提升模型的推理能力,而无需依赖复杂的后训练阶段。
Abstract: The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning – exploration – to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight-benchmark math-and-science suite by 19%. With identical post-training, the gains compound, with the largest improvements on reasoning-heavy tasks such as AIME25 and MMLU-Pro. Applying RLP to the hybrid Nemotron-Nano-12B-v2 increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.
[147] Unsupervised Dynamic Feature Selection for Robust Latent Spaces in Vision Tasks
Bruno Corcuera,Carlos Eiras-Franco,Brais Cancela
Main category: cs.LG
TL;DR: 论文提出了一种无监督的动态特征选择方法(DFS),用于去除图像中的噪声或无关特征,从而增强潜在表示的性能和鲁棒性。
Details
Motivation: 视觉任务中,潜在表示常受噪声或无关特征影响,导致模型性能和泛化能力下降。因此,需要一种方法在不依赖标记数据的情况下动态选择最有用的特征。Contribution: 提出了一种无监督的动态特征选择方法(DFS),能够在不需要标记数据的情况下,动态识别并去除图像中的误导性或冗余信息。
Method: 通过无监督框架动态评估每个实例的特征重要性,选择并保留最相关的特征,去除无关噪声。
Result: 在多项图像任务(如聚类和生成)中,DFS显著提升了模型的泛化性能,且计算成本增量极小。
Insight: 无监督动态特征选择可以高效地提升潜在表示的质量,适用于广泛的数据集和任务。
Abstract: Latent representations are critical for the performance and robustness of machine learning models, as they encode the essential features of data in a compact and informative manner. However, in vision tasks, these representations are often affected by noisy or irrelevant features, which can degrade the model’s performance and generalization capabilities. This paper presents a novel approach for enhancing latent representations using unsupervised Dynamic Feature Selection (DFS). For each instance, the proposed method identifies and removes misleading or redundant information in images, ensuring that only the most relevant features contribute to the latent space. By leveraging an unsupervised framework, our approach avoids reliance on labeled data, making it broadly applicable across various domains and datasets. Experiments conducted on image datasets demonstrate that models equipped with unsupervised DFS achieve significant improvements in generalization performance across various tasks, including clustering and image generation, while incurring a minimal increase in the computational cost.
[148] $\text{G}^2$RPO: Granular GRPO for Precise Reward in Flow Models
Yujie Zhou,Pengyang Ling,Jiazi Bu,Yibin Wang,Yuhang Zang,Jiaqi Wang,Li Niu,Guangtao Zhai
Main category: cs.LG
TL;DR: 论文提出了$ ext{G}^2$RPO框架,通过细粒度的GRPO方法改进流模型中强化学习的奖励信号,以更精确地对齐人类偏好。
Details
Motivation: 现有方法在生成模型的强化学习中存在奖励信号稀疏且狭窄的问题,导致偏好对齐效果不佳。Contribution: 1)提出Granular-GRPO($ ext{G}^2$RPO)框架,实现更精确的奖励评估;2)引入Singular Stochastic Sampling策略和Multi-Granularity Advantage Integration模块。
Method: 1)使用Singular Stochastic Sampling策略支持逐步随机探索并提高奖励与噪声的关联性;2)通过Multi-Granularity Advantage Integration模块聚合多尺度扩散的优势,提升采样方向评估的全面性。
Result: 实验表明$ ext{G}^2$RPO在多种奖励模型中显著优于现有GRPO基线,证明了其有效性和鲁棒性。
Insight: 通过细粒度奖励评估和多尺度优势集成,可以显著提升生成模型与人类偏好的对齐效果。
Abstract: The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO ($\text{G}^2$RPO ) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our $\text{G}^2$RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.
[149] LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning
Weizhe Chen,Sven Koenig,Bistra Dilkina
Main category: cs.LG
TL;DR: 论文提出了一种基于响应长度的动态采样方法LSPO,用于优化大语言模型(LLM)在推理任务中的策略学习。
Details
Motivation: 研究者观察到现有的RLVR方法在训练LLM时存在效率不足的问题,特别是任务响应长度对学习效果的影响未得到充分利用。Contribution: 提出了LSPO算法,动态选择训练数据以提升策略优化效果,并通过实验验证其有效性。
Method: LSPO通过监测平均响应长度动态调整采样策略,优化训练数据的选取过程。
Result: 实验表明,LSPO在多种基线模型和数据集上均能提升学习效果。
Insight: 响应长度信息对RLVR的动态采样具有重要作用,未来的研究可以进一步探索如何利用长度信号优化训练。
Abstract: Since the release of Deepseek-R1, reinforcement learning with verifiable rewards (RLVR) has become a central approach for training large language models (LLMs) on reasoning tasks. Recent work has largely focused on modifying loss functions to make RLVR more efficient and effective. In this paper, motivated by studies of overthinking in LLMs, we propose Length-aware Sampling for Policy Optimization (LSPO), a novel meta-RLVR algorithm that dynamically selects training data at each step based on the average response length. We evaluate LSPO across multiple base models and datasets, demonstrating that it consistently improves learning effectiveness. In addition, we conduct a detailed ablation study to examine alternative ways of incorporating length signals into dynamic sampling, offering further insights and highlighting promising directions for future research.
[150] Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression
Joykirat Singh,Justin Chih-Yao Chen,Archiki Prasad,Elias Stengel-Eskin,Akshay Nambi,Mohit Bansal
Main category: cs.LG
TL;DR: TRAAC是一种自适应注意力压缩方法,通过调整推理长度以适应任务难度,解决了推理不足和过度推理的问题,显著提升准确率和效率。
Details
Motivation: 现有的推理模型在测试时难以动态分配计算资源,容易因推理不足或过度推理而效率低下或错误率高。Contribution: 提出TRAAC方法,结合自适应预算分配和注意力压缩,显著提升模型在多任务中的准确率和推理效率。
Method: 使用在线强化学习,通过自注意力机制识别重要推理步骤并修剪冗余部分,同时根据任务难度动态调整推理预算。
Result: TRAAC在多个任务中平均准确率提升8.4%,推理长度减少36.8%,并能泛化到非数学任务。
Insight: 任务难度校准与注意力压缩的结合是实现高效自适应推理的关键。
Abstract: Recent thinking models solve complex reasoning tasks by scaling test-time compute, but this scaling must be allocated in line with task difficulty. On one hand, short reasoning (underthinking) leads to errors on harder problems that require extended reasoning steps; but, excessively long reasoning (overthinking) can be token-inefficient, generating unnecessary steps even after reaching a correct intermediate solution. We refer to this as under-adaptivity, where the model fails to modulate its response length appropriately given problems of varying difficulty. To address under-adaptivity and strike a balance between under- and overthinking, we propose TRAAC (Think Right with Adaptive, Attentive Compression), an online post-training RL method that leverages the model’s self-attention over a long reasoning trajectory to identify important steps and prune redundant ones. TRAAC also estimates difficulty and incorporates it into training rewards, thereby learning to allocate reasoning budget commensurate with example difficulty. Our approach improves accuracy, reduces reasoning steps, and enables adaptive thinking compared to base models and other RL baselines. Across a variety of tasks (AIME, AMC, GPQA-D, BBEH), TRAAC (Qwen3-4B) achieves an average absolute accuracy gain of 8.4% with a relative reduction in reasoning length of 36.8% compared to the base model, and a 7.9% accuracy gain paired with a 29.4% length drop compared to the best RL baseline. TRAAC also shows strong generalization: although our models are trained on math datasets, they show accuracy and efficiency gains on out-of-distribution non-math datasets like GPQA-D, BBEH, and OptimalThinkingBench. Our analysis further verifies that TRAAC provides fine-grained adjustments to thinking budget based on difficulty and that a combination of task-difficulty calibration and attention-based compression yields gains across diverse tasks.
[151] Continual Personalization for Diffusion Models
Yu-Chien Liao,Jr-Jen Chen,Chi-Pin Huang,Ci-Siang Lin,Meng-Lin Wu,Yu-Chiang Frank Wang
Main category: cs.LG
TL;DR: 该论文提出了一种名为Concept Neuron Selection (CNS)的新方法,用于在扩散模型中实现增量个性化学习,避免了灾难性遗忘问题,同时保持了零样本文本到图像生成能力。
Details
Motivation: 为了解决扩散模型在增量学习中面临的灾难性遗忘和计算效率问题,论文提出了CNS方法,旨在实现高效且持续的概念个性化学习。Contribution: CNS通过独特识别与目标概念相关的神经元,实现了扩散模型的增量个性化学习,同时减少了内存占用和计算时间。
Method: CNS的核心方法是选择性微调与目标概念相关的神经元,并联合保留先前学到的知识,从而避免灾难性遗忘。
Result: 实验表明,CNS在单概念和多概念个性化任务中均取得了最优性能,且仅需微调少量参数。
Insight: CNS的创新在于通过神经元选择机制实现了高效且持续的个性化学习,这对于实际应用中多概念增量学习的场景具有重要意义。
Abstract: Updating diffusion models in an incremental setting would be practical in real-world applications yet computationally challenging. We present a novel learning strategy of Concept Neuron Selection (CNS), a simple yet effective approach to perform personalization in a continual learning scheme. CNS uniquely identifies neurons in diffusion models that are closely related to the target concepts. In order to mitigate catastrophic forgetting problems while preserving zero-shot text-to-image generation ability, CNS finetunes concept neurons in an incremental manner and jointly preserves knowledge learned of previous concepts. Evaluation of real-world datasets demonstrates that CNS achieves state-of-the-art performance with minimal parameter adjustments, outperforming previous methods in both single and multi-concept personalization works. CNS also achieves fusion-free operation, reducing memory storage and processing time for continual personalization.
[152] Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead
Feiyang Kang,Michael Kuchnik,Karthik Padthe,Marin Vlastelica,Ruoxi Jia,Carole-Jean Wu,Newsha Ardalani
Main category: cs.LG
TL;DR: 论文挑战了现有的SFT-RL两阶段训练方法,发现高SFT分数并不能可靠地预测RL效果,反而可能导致更差的结果。提出了泛化损失和Pass@large k作为替代指标,显著提高了预测精度。
Details
Motivation: 现有实践中,LLM的后训练通常分为SFT和RL两个独立阶段,但高SFT分数是否能反映RL后的性能提升尚无明确证据。作者质疑了这一假设,并探索更可靠的替代指标。Contribution: 1. 揭示了高SFT分数的局限性,证明其可能导致RL表现更差;2. 提出泛化损失和Pass@large k作为RL结果的强代理指标;3. 通过大规模实验验证了新指标的普适性。
Method: 1. 在多个LLM(如Llama3、Mistral-Nemo等)上进行SFT和RLVR(GRPO算法)训练;2. 分析SFT分数与RL结果的关联性;3. 提出基于泛化损失和Pass@large k的预测方法。
Result: 泛化损失和Pass@large k显著提升了预测RL效果的准确性(R²和Spearman系数提升达0.5)。实验表明,SFT阶段的训练策略(如数据选择、训练时长)对RL效果有重大影响。
Insight: SFT阶段不应盲目追求高分,需关注数据多样性和泛化性。替代指标为LLM的后训练提供了更可靠的评估标准,对实际应用具有重要意义。
Abstract: In post-training for reasoning Large Language Models (LLMs), the current state of practice trains LLMs in two independent stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as ``RL’’ below). In this work, we challenge whether high SFT scores translate to improved performance after RL. We provide extensive counter-examples where this is not true. We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness. In some cases, RL training on models with improved SFT performance could lead to substantially worse outcome compared to RL on the base model without SFT. We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome. We trained hundreds of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive evaluations on 7 math benchmarks with up to 256 repetitions, spending $>$1M GPU hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL performance, prediction based on generalization loss and Pass@large k achieves substantial higher precision, improving $R^2$ coefficient and Spearman’s rank correlation coefficient by up to 0.5 (2x). This provides strong utility for broad use cases. For example, in most experiments, we find SFT training on unique examples for a one epoch underperforms training on half examples for two epochs, either after SFT or SFT-then-RL; With the same SFT budget, training only on short examples may lead to better SFT performance, though, it often leads to worse outcome after RL compared to training on examples with varying lengths. Evaluation tool will be open-sourced.
[153] Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction
Adam Filipek
Main category: cs.LG
TL;DR: 本文提出了稀疏查询注意力(SQA)机制,通过减少查询头的数量降低计算复杂度,适用于长序列任务。
Details
Motivation: 多头注意力(MHA)的计算复杂度与序列长度呈二次方关系,限制了其在长序列任务中的可扩展性。现有方法如MQA和GQA通过共享键和值头解决了内存带宽问题,但未减少计算FLOPs。Contribution: 提出了SQA机制,通过减少查询头数量直接降低计算复杂度,实现了高达3倍的吞吐量提升,同时保持模型质量。
Method: SQA通过数学公式化和多种架构变体减少查询头数量,降低了注意力机制的计算复杂度。
Result: 在32k-200k长序列任务中,SQA在预训练、微调和编码器任务中实现了显著的吞吐量提升。
Insight: SQA提供了一种新的注意力优化路径,适用于计算密集型任务,可能是构建高效和可扩展模型的强大工具。
Abstract: The Transformer architecture, underpinned by the Multi-Head Attention (MHA) mechanism, has become the de facto standard for state-of-the-art models in artificial intelligence. However, the quadratic computational complexity of MHA with respect to sequence length presents a significant barrier to scaling, particularly for applications involving long contexts. Prevailing solutions, such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), have effectively addressed the memory bandwidth bottleneck that dominates autoregressive inference latency by sharing Key and Value projections. While highly successful, these methods do not reduce the fundamental number of floating-point operations (FLOPs) required for the attention score computation, which remains a critical bottleneck for training and full-sequence processing. This paper introduces Sparse Query Attention (SQA), a novel attention architecture that pursues an alternative and complementary optimization path. Instead of reducing Key/Value heads, SQA reduces the number of Query heads. This architectural modification directly decreases the computational complexity of the attention mechanism by a factor proportional to the reduction in query heads, thereby lowering the overall FLOPs. This work presents the theoretical foundation of SQA, its mathematical formulation, and a family of architectural variants. Empirical benchmarks on long sequences (32k-200k tokens) demonstrate that SQA can achieve significant throughput improvements of up to 3x in computation-bound scenarios such as model pre-training, fine-tuning, and encoder-based tasks, with only a minimal impact on model quality in preliminary smallscale experiments. SQA was discovered serendipitously during the development of the upcoming Reactive Transformer architecture, suggesting its potential as a powerful tool for building more efficient and scalable models
[154] StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?
Yanxu Chen,Zijun Yao,Yantao Liu,Jin Ye,Jianing Yu,Lei Hou,Juanzi Li
Main category: cs.LG
TL;DR: StockBench是一个专门评估LLM代理在真实股票交易环境中表现的基准测试,填补了金融领域动态交易评估的空白。
Details
Motivation: 金融领域的高风险决策与经济价值直接相关,但现有的金融基准主要测试静态知识问答,无法捕捉交易的动态迭代过程,因此需新的评估工具。Contribution: 提出了StockBench基准,用于评估LLM代理在多月的真实股票交易环境中的表现,并公开发布以支持复现性和未来研究。
Method: 代理接收每日市场信号(价格、基本面、新闻),并做出买入、卖出或持有的顺序决策,通过金融指标(累计回报、最大回撤、索提诺比率)评估性能。
Result: 多数LLM代理未能跑赢简单的买入持有基准,但部分模型展现出更高的回报潜力和更有效的风险管理能力。
Insight: 静态金融知识的优秀表现不一定能转化为成功的交易策略,强调了开发金融LLM代理的挑战与机遇。
Abstract: Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals – including prices, fundamentals, and news – and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.
[155] ExGRPO: Learning to Reason from Experience
Runzhe Zhan,Yafu Li,Zhi Wang,Xiaoye Qu,Dongrui Liu,Jing Shao,Derek F. Wong,Yu Cheng
Main category: cs.LG
TL;DR: ExGRPO提出了一种基于经验价值的学习框架,通过重用和优先排序有价值的推理经验,显著提升了大规模语言模型的推理性能。
Details
Motivation: 现有RLVR方法在训练时仅使用一次经验,导致计算效率低且不稳定。ExGRPO研究了经验特性对推理模型学习动态的影响,旨在通过高效经验管理提升性能。Contribution: 1) 首次研究了RLVR中有价值经验的特性;2) 提出ExGRPO框架,通过组织与优先排序经验提升推理性能;3) 在数学/通用基准上取得了显著性能提升。
Method: ExGRPO基于经验的正确性和熵作为价值指标,采用了混合策略目标以平衡探索与经验利用。
Result: 在1.5B-8B参数的模型上平均提升了3.5/7.6分,且在强弱模型上均实现了稳定的训练效果。
Insight: 高效的经验管理是提升RLVR可扩展性和稳定性的关键。
Abstract: Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.
[156] Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks
Ruohao Guo,Afshin Oroojlooy,Roshan Sridhar,Miguel Ballesteros,Alan Ritter,Dan Roth
Main category: cs.LG
TL;DR: 论文提出了一种树形对话增强策略优化方法(DialTree-RPO),用于自动化发现多轮对抗攻击策略,显著提升了攻击成功率。
Details
Motivation: 当前大语言模型在多轮交互中仍易受对抗攻击,但现有方法多依赖人工或单轮攻击模板,未能充分探索复杂对话动态和多轮攻击策略。Contribution: 提出了DialTree-RPO框架,结合强化学习和树搜索,自主发现多样化的多轮攻击策略,无需人工标注数据。
Method: 将对话建模为序列决策问题,通过策略优化和树搜索系统性地探索攻击路径,学习最大化多轮攻击成功的策略。
Result: 在10个目标模型上攻击成功率(ASR)比之前方法高出25.9%,并能发现新的攻击策略。
Insight: 多轮攻击策略的自动发现揭示了LLM在复杂对话中的安全漏洞,凸显了动态对话规划的挑战性和重要性。
Abstract: Despite recent rapid progress in AI safety, current large language models remain vulnerable to adversarial attacks in multi-turn interaction settings, where attackers strategically adapt their prompts across conversation turns and pose a more critical yet realistic challenge. Existing approaches that discover safety vulnerabilities either rely on manual red-teaming with human experts or employ automated methods using pre-defined templates and human-curated attack data, with most focusing on single-turn attacks. However, these methods did not explore the vast space of possible multi-turn attacks, failing to consider novel attack trajectories that emerge from complex dialogue dynamics and strategic conversation planning. This gap is particularly critical given recent findings that LLMs exhibit significantly higher vulnerability to multi-turn attacks compared to single-turn attacks. We propose DialTree-RPO, an on-policy reinforcement learning framework integrated with tree search that autonomously discovers diverse multi-turn attack strategies by treating the dialogue as a sequential decision-making problem, enabling systematic exploration without manually curated data. Through extensive experiments, our approach not only achieves more than 25.9% higher ASR across 10 target models compared to previous state-of-the-art approaches, but also effectively uncovers new attack strategies by learning optimal dialogue policies that maximize attack success across multiple turns.
cs.CR [Back]
[157] Jailbreaking LLMs via Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge
Hui Dou,Ning Xu,Yiwen Zhang,Kaibin Wang
Main category: cs.CR
TL;DR: 该论文提出了一种名为RTS-Attack的方法,通过构建语义相关且包含目标毒性知识的嵌套场景,绕过大型语言模型(LLMs)的对齐防御,实现了高效且隐蔽的越狱攻击。
Details
Motivation: 尽管LLMs在多种任务中表现出色,但其对齐防御机制在语义相关的嵌套场景和毒性知识面前存在漏洞。论文旨在探索这一未被充分研究的方向并提出有效的攻击框架。Contribution: 1. 首次系统地验证了LLMs对齐防御对语义相关嵌套场景的敏感性不足;2. 提出了RTS-Attack框架,实现了自适应和自动化的攻击生成;3. 生成的攻击提示隐蔽性强,不含直接有害查询。
Method: RTS-Attack框架通过构建与查询高度语义相关的嵌套场景,并嵌入目标毒性知识,生成隐蔽的攻击提示,从而绕过LLMs的对齐防御。
Result: 实验表明,RTS-Attack在GPT-4o、Llama3-70b和Gemini-pro等多种先进LLMs上均表现出高效的越狱能力和通用性。
Insight: LLMs的对齐防御机制在处理语义相关嵌套场景时存在潜在漏洞,这为未来防御策略的设计提供了重要启示。
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks. However, they remain exposed to jailbreak attacks, eliciting harmful responses. The nested scenario strategy has been increasingly adopted across various methods, demonstrating immense potential. Nevertheless, these methods are easily detectable due to their prominent malicious intentions. In this work, we are the first to find and systematically verify that LLMs’ alignment defenses are not sensitive to nested scenarios, where these scenarios are highly semantically relevant to the queries and incorporate targeted toxic knowledge. This is a crucial yet insufficiently explored direction. Based on this, we propose RTS-Attack (Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge), an adaptive and automated framework to examine LLMs’ alignment. By building scenarios highly relevant to the queries and integrating targeted toxic knowledge, RTS-Attack bypasses the alignment defenses of LLMs. Moreover, the jailbreak prompts generated by RTS-Attack are free from harmful queries, leading to outstanding concealment. Extensive experiments demonstrate that RTS-Attack exhibits superior performance in both efficiency and universality compared to the baselines across diverse advanced LLMs, including GPT-4o, Llama3-70b, and Gemini-pro. Our complete code is available in the supplementary material. WARNING: THIS PAPER CONTAINS POTENTIALLY HARMFUL CONTENT.
[158] ZK-WAGON: Imperceptible Watermark for Image Generation Models using ZK-SNARKs
Aadarsh Anantha Ramakrishnan,Shubham Agarwal,Selvanayagam S,Kunwar Singh
Main category: cs.CR
TL;DR: ZK-WAGON首次提出了一种基于ZK-SNARKs的图像生成模型水印方法,通过选择性层转换和LSB隐写术,实现不可察觉的水印嵌入和可验证的来源证明,解决了传统方法的质量损失和安全性问题。
Details
Motivation: 随着图像生成模型的普及,合成媒体的真实性、所有权和滥用问题日益严重。传统水印方法存在质量下降或安全性不足的问题,亟需一种安全且不影响图像质量的解决方案。Contribution: 1. 首次将ZK-SNARKs技术应用于图像生成模型的水印嵌入;2. 提出SL-ZKCC方法,显著减少证明生成时间;3. 设计了一种模型无关的安全水印管道。
Method: 采用ZK-SNARKs技术,通过SL-ZKCC选择性转换关键层为电路,并将生成的证明通过LSB隐写术嵌入图像中。
Result: 在GAN和Diffusion模型上验证了方法的有效性,实现了高质量图像与可验证来源的结合。
Insight: ZK-SNARKs结合隐写术为生成模型的版权保护提供了一种创新且安全的解决方案,适用于多样化的模型类型。
Abstract: As image generation models grow increasingly powerful and accessible, concerns around authenticity, ownership, and misuse of synthetic media have become critical. The ability to generate lifelike images indistinguishable from real ones introduces risks such as misinformation, deepfakes, and intellectual property violations. Traditional watermarking methods either degrade image quality, are easily removed, or require access to confidential model internals - making them unsuitable for secure and scalable deployment. We are the first to introduce ZK-WAGON, a novel system for watermarking image generation models using the Zero-Knowledge Succinct Non Interactive Argument of Knowledge (ZK-SNARKs). Our approach enables verifiable proof of origin without exposing model weights, generation prompts, or any sensitive internal information. We propose Selective Layer ZK-Circuit Creation (SL-ZKCC), a method to selectively convert key layers of an image generation model into a circuit, reducing proof generation time significantly. Generated ZK-SNARK proofs are imperceptibly embedded into a generated image via Least Significant Bit (LSB) steganography. We demonstrate this system on both GAN and Diffusion models, providing a secure, model-agnostic pipeline for trustworthy AI image generation.
[159] Position: Privacy Is Not Just Memorization!
Niloofar Mireshghallah,Tianshi Li
Main category: cs.CR
TL;DR: 该立场论文指出,大型语言模型(LLM)的隐私风险远不止训练数据的逐字记忆,还包括数据收集、推理时上下文泄漏、自主代理能力以及深度推理攻击导致的监控民主化等多方面威胁。
Details
Motivation: 当前关于LLM隐私风险的讨论过度关注训练数据的逐字记忆,而忽视了其他更直接和可扩展的隐私威胁。本文旨在揭示这些被低估的威胁,并呼吁研究社区的关注。Contribution: 提出了LLM全生命周期(从数据收集到部署)的隐私风险分类,并通过案例分析指出当前隐私框架的不足;通过对过去十年1,322篇AI/ML隐私论文的分析,揭示了研究重点与实际隐私威胁的不匹配。
Method: 通过对文献的纵向分析和案例研究,构建了LLM隐私风险的分类体系。
Result: 研究发现当前隐私研究过度关注数据记忆问题,而其他更紧迫的隐私风险(如上下文泄漏和深度推理攻击)缺乏有效解决方案。
Insight: 隐私风险是多方面的,需要跨学科的解决方案,而不仅仅是技术层面的优化。
Abstract: The discourse on privacy risks in Large Language Models (LLMs) has disproportionately focused on verbatim memorization of training data, while a constellation of more immediate and scalable privacy threats remain underexplored. This position paper argues that the privacy landscape of LLM systems extends far beyond training data extraction, encompassing risks from data collection practices, inference-time context leakage, autonomous agent capabilities, and the democratization of surveillance through deep inference attacks. We present a comprehensive taxonomy of privacy risks across the LLM lifecycle – from data collection through deployment – and demonstrate through case studies how current privacy frameworks fail to address these multifaceted threats. Through a longitudinal analysis of 1,322 AI/ML privacy papers published at leading conferences over the past decade (2016–2025), we reveal that while memorization receives outsized attention in technical research, the most pressing privacy harms lie elsewhere, where current technical approaches offer little traction and viable paths forward remain unclear. We call for a fundamental shift in how the research community approaches LLM privacy, moving beyond the narrow focus of current technical solutions and embracing interdisciplinary approaches that address the sociotechnical nature of these emerging threats.
cs.MM [Back]
[160] Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
Chetwin Low,Weimin Wang,Calder Katyal
Main category: cs.MM
TL;DR: 论文提出了Ovi,一种通过双主干网络跨模态融合实现音视频生成的统一范式,避免了复杂的多阶段架构或音视频分离生成的需求。
Details
Motivation: 音视频生成通常依赖复杂的多阶段架构或音视频的分离合成,导致同步性和连贯性差。Ovi旨在通过统一的生成过程解决这一问题。Contribution: 1. 提出Ovi,一种统一的音视频生成范式;2. 采用双DiT模块的块间跨模态融合,实现自然同步;3. 通过相同的架构初始化音频和视频主干,支持细粒度多模态融合建模。
Method: 1. 使用双DiT模块进行块间跨模态融合;2. 将音频主干与预训练视频主干对齐架构;3. 通过时间(缩放RoPE嵌入)和语义(双向跨注意力)联合训练音频和视频主干。
Result: 生成的视频片段具有电影级质量,包含自然的语音和准确的上下文匹配音效。
Insight: 统一的生成过程和块间融合机制显著提升了音视频生成的同步性和质量,为多模态生成提供了新的思路。
Abstract: Audio-video generation has often relied on complex multi-stage architectures or sequential synthesis of sound and visuals. We introduce Ovi, a unified paradigm for audio-video generation that models the two modalities as a single generative process. By using blockwise cross-modal fusion of twin-DiT modules, Ovi achieves natural synchronization and removes the need for separate pipelines or post hoc alignment. To facilitate fine-grained multimodal fusion modeling, we initialize an audio tower with an architecture identical to that of a strong pretrained video model. Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects, as well as speech that conveys rich speaker identity and emotion. Fusion is obtained by jointly training the identical video and audio towers via blockwise exchange of timing (via scaled-RoPE embeddings) and semantics (through bidirectional cross-attention) on a vast video corpus. Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips. All the demos, code and model weights are published at https://aaxwaz.github.io/Ovi