Table of Contents
- cs.CL [Total: 38]
- cs.CV [Total: 107]
- cs.GR [Total: 1]
- eess.AS [Total: 1]
- cs.AI [Total: 22]
- cs.IR [Total: 1]
- cs.CR [Total: 1]
- eess.IV [Total: 5]
- cs.CY [Total: 1]
- cs.LG [Total: 19]
- cs.RO [Total: 2]
- cs.IT [Total: 1]
cs.CL [Back]
[1] From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation
Viacheslav Yusupov,Danil Maksimov,Ameliia Alaeva,Anna Vasileva,Anna Antipina,Tatyana Zaitseva,Alina Ermilova,Evgeny Burnaev,Egor Shvetsov
Main category: cs.CL
TL;DR: 该论文提出了一种通过内部表示的几何特性评估大语言模型生成文本质量的方法,验证了多种指标的可靠性,并发现这些指标能够反映文本的固有特质而非模型特定特征。
Details
Motivation: 现有的大语言模型评估方法通常依赖外部标准或人工标注数据,缺乏高效且自动化的评估手段。本文旨在通过内部表示的几何特性填补这一空白。Contribution: 证明了内部表示的几何特性(如固有维度和有效秩)可作为文本自然性和质量的通用评估指标,并提出了一种无需参考的自动化评估方法。
Method: 验证了包括最大可解释方差、有效秩、固有维度、MAUVE得分和Schatten范数在内的多种指标在不同模型层的表现。
Result: 发现不同模型对文本质量的排序一致,表明这些指标反映了文本的固有特征而非模型特定的偏差。
Insight: 几何特性可以作为无监督评估文本质量的可靠工具,为自动化评估流程提供了新思路。
Abstract: This paper bridges internal and external analysis approaches to large language models (LLMs) by demonstrating that geometric properties of internal model representations serve as reliable proxies for evaluating generated text quality. We validate a set of metrics including Maximum Explainable Variance, Effective Rank, Intrinsic Dimensionality, MAUVE score, and Schatten Norms measured across different layers of LLMs, demonstrating that Intrinsic Dimensionality and Effective Rank can serve as universal assessments of text naturalness and quality. Our key finding reveals that different models consistently rank text from various sources in the same order based on these geometric properties, indicating that these metrics reflect inherent text characteristics rather than model-specific artifacts. This allows a reference-free text quality evaluation that does not require human-annotated datasets, offering practical advantages for automated evaluation pipelines.
[2] From Faithfulness to Correctness: Generative Reward Models that Think Critically
Qiyao Ma,Yunsheng Shi,Hongtao Tian,Chao Wang,Weiming Chang,Ting Yao
Main category: cs.CL
TL;DR: 这篇论文提出了一个名为TRM(Thinking-supervised Reward Model)的奖励模型,通过在生成过程中引入句子级别的思考监督,提升模型在开放域问答任务中的正确性和批判性评估能力。
Details
Motivation: 现有基于RLVR的方法在复杂任务(如开放域问答)中面临正确性难以验证的挑战。近期研究过于关注忠实性(与外部文档的一致性),导致模型过度依赖外部知识而缺乏批判性思考能力。Contribution: 提出了TRM模型,通过句子级别的忠实性评估和推理步骤,结合外部和内部知识来评估答案的正确性。
Method: TRM首先评估每个句子与外部文档的忠实性,然后通过推理步骤评估句子级别的正确性,从而结构化奖励建模过程。
Result: 实验表明,TRM显著提高了错误句子的识别能力,并显著提升了答案的正确性和实用性。
Insight: 在复杂任务中,单纯依赖外部知识的忠实性评估是不够的,结合批判性思考和推理能力能显著提升模型的正确性和实用性。
Abstract: Through reinforcement learning with verifiable rewards (RLVR), large language models have achieved substantial progress in domains with easily verifiable outcomes, such as mathematics and coding. However, when applied to more complex tasks like open-domain question answering, RLVR faces significant challenges due to the difficulty of verifying correctness. The nuanced and ambiguous nature of real-world knowledge makes it difficult to reliably evaluate correctness in these settings, necessitating further abilities that extend beyond mere logical consistency to encompass an understanding and assessment of both external and internal knowledge. Recent work has primarily focused on improving faithfulness, defined as semantic alignment with supporting documents, which can cause models to rely excessively on external sources and diminish their capacity for critical assessment. To address this, we propose the Thinking-supervised Reward Model (TRM), which incorporates sentence-level thinking supervision to endow reward models with critical thinking abilities. Given a query, answer, and supporting documents, TRM first assesses the faithfulness of each answer sentence to the supporting documents, and then applies a reasoning step to evaluate sentence-level correctness. By structuring reward modeling as a sequence of faithfulness, reasoning, and correctness evaluations, TRM encourages models to critically assess and leverage both external and internal knowledge. Experiments on reward signals demonstrate that TRM substantially improves the identification of incorrect sentences, and incorporating TRM into policy optimization leads to significant gains in both answer correctness and usefulness.
[3] MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources
Huu Nguyen,Victor May,Harsh Raj,Marianna Nezhurina,Yishan Wang,Yanqi Luo,Minh Chien Vu,Taishi Nakamura,Ken Tsui,Van Khue Nguyen,David Salinas,Aleksandra Krasnodębska,Christoph Schuhmann,Mats Leon Richter,Xuan-Son,Vu,Jenia Jitsev
Main category: cs.CL
TL;DR: MixtureVitae是一个开源的预训练数据集,旨在降低法律风险的同时提供强大的模型性能,通过结合公共领域和宽松许可的文本、有针对性的指令数据以及合成数据,实现了竞争力的表现。
Details
Motivation: 当前预训练数据常依赖大规模网络爬取,存在法律风险和伦理问题。MixtureVitae的目标是通过合法合规的数据来源构建高性能预训练数据集。Contribution: 提出了一个法律风险最小化的开源预训练数据集MixtureVitae,并详细公开了数据筛选、处理和混合的透明流程。
Method: 采用多阶段管道进行许可证感知过滤、安全与质量筛选以及领域感知混合,结合公共领域、宽松许可数据和合成数据。
Result: 在多个标准基准测试中,MixtureVitae训练的模型性能优于其他宽松许可数据集,尤其在数学/代码和问答任务上表现突出。
Insight: 研究表明,合法合规的数据来源可以成为训练高性能语言模型的可行选择,减少对无差别网络爬取的依赖。
Abstract: We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong model performance. MixtureVitae follows a risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi-stage pipeline for license-aware filtering, safety and quality screening, and domain-aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open-sci-ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb-Edu and approach DCLM in the later stages of training. Performance is particularly strong on math/code and competitive on QA tasks. These results demonstrate that permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae
[4] Calibrating Verbalized Confidence with Self-Generated Distractors
Victor Wang,Elias Stengel-Eskin
Main category: cs.CL
TL;DR: 本文提出了一种名为DINCO的方法,用于校准大型语言模型(LLM)的输出置信度,通过自我生成的干扰项(distractors)来减少其过高的置信度,从而提高可信度和安全性。
Details
Motivation: LLM生成的置信度分数通常未校准,表现为低准确性情况下仍高置信,这会损害用户对模型的信任。作者假设这种过高的置信度源于LLM对某些信息的低编码度导致的易受暗示性(suggestibility)。Contribution: 1. 验证了LLM在低准确性声明上更易受暗示的假设;2. 提出了DINCO方法,通过干扰项归一化置信度;3. 结合生成器-校验器不一致性进一步提升校准效果。
Method: DINCO通过让模型在多个自我生成的干扰项上独立表达置信度,并对总置信度进行归一化,以减少易受暗示的影响。同时利用生成器和校验器的置信度不一致性增强校准。
Result: DINCO提供了更少饱和的置信度估计,且在10次推理调用下的表现优于基线方法(如self-consistency)的100次调用。
Insight: 1. LLM的过自信与其对信息的低编码度有关;2. 干扰项归一化是校准置信度的有效方法;3. 生成器-校验器不一致性可作为校准的补充维度。
Abstract: Calibrated confidence estimates are necessary for large language model (LLM) outputs to be trusted by human users. While LLMs can express their confidence in human-interpretable ways, verbalized LLM-generated confidence scores have empirically been found to be miscalibrated, reporting high confidence on instances with low accuracy and thereby harming trust and safety. We hypothesize that this overconfidence often stems from a given LLM’s heightened suggestibility when faced with claims that it encodes little information about; we empirically validate this hypothesis, finding more suggestibility on lower-accuracy claims. Building on this finding, we introduce Distractor-Normalized Coherence (DINCO), which estimates and accounts for an LLM’s suggestibility bias by having the model verbalize its confidence independently across several self-generated distractors (i.e. alternative claims), and normalizes by the total verbalized confidence. To further improve calibration, we leverage generator-validator disagreement, augmenting normalized validator confidence with a consistency-based estimate of generator confidence. Here, we frame the popular approach of self-consistency as leveraging coherence across sampled generations, and normalized verbalized confidence as leveraging coherence across validations on incompatible claims, allowing us to integrate these complementary dimensions of coherence into DINCO. Moreover, our analysis shows that DINCO provides less saturated – and therefore more usable – confidence estimates, and that further sampling alone cannot close the gap between DINCO and baselines, with DINCO at 10 inference calls outperforming self-consistency at 100.
[5] Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning
Zhiling Ye,Yun Yue,Haowen Wang,Xudong Han,Jiadi Jiang,Cheng Wei,Lei Fan,Jiaxin Liang,Shuowen Zhang,Ji Li,Chunxiao Guo,Jian Wang,Peng Wei,Jinjie Gu
Main category: cs.CL
TL;DR: 该论文提出了一种基于自我奖励的评分标准的强化学习框架(Self-Rewarding Rubric-Based RL),用于提升语言模型在开放域推理任务中的表现。通过模型自身作为评分者并生成基于标准的奖励信号,显著提高了模型的推理能力,同时也增强了其评分能力。
Details
Motivation: 开放域评估对于语言模型在真实场景中的应用至关重要。作者发现在HealthBench任务中,使用模型自身作为评分者并通过基于标准的奖励信号进行训练,可以显著提升模型的推理能力和评分能力。Contribution: 1. 提出了一个轻量级的自我奖励评分标准强化学习框架,用于开放域推理任务;
2. 实验表明,该框架在资源效率上优于基线方法;
3. 仅用4000个样本的小规模训练数据,就在HealthBench Hard任务上超越了GPT-5的性能。
Method: 1. 模型自身作为评分者,生成基于标准的奖励信号;
2. 结合强化学习框架进行训练;
3. 小规模教师评分数据进一步优化低能力模型的性能。
Result: 在Qwen3-32B上,仅使用4000个样本的训练数据,所得模型在HealthBench Hard任务上超越了GPT-5。加入少量教师评分数据后,对能力较弱的模型性能有进一步提升。
Insight: 1. 模型自身可以作为有效的评分者;
2. 小规模高质量数据足以显著提升模型性能;
3. 评分标准的设计对模型推理能力的提升至关重要。
Abstract: Open-ended evaluation is essential for deploying large language models in real-world settings. In studying HealthBench, we observe that using the model itself as a grader and generating rubric-based reward signals substantially improves reasoning performance. Remarkably, the trained model also becomes a stronger grader. Motivated by this, we introduce Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning, a lightweight framework that enables faster and more resource-efficient training while surpassing baselines. Remarkably, on Qwen3-32B, training with just the 4000-sample HealthBench Easy subset is sufficient to obtain a model that exceeds GPT-5 on HealthBench Hard. Incorporating a small amount of teacher-graded data further enhances performance for less capable models.
[6] Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model
Fahim Faisal,Kaiqiang Song,Song Wang,Simin Ma,Shujian Liu,Haoyun Deng,Sathish Reddy Indurthi
Main category: cs.CL
TL;DR: PB-RLSVR框架通过高资源英语LLM生成参考应答,利用语义等价性奖励多语言模型,显著提升多语言推理能力,无需目标语言标注数据。
Details
Motivation: 现有强化学习方法在多语言推理中的表现主要集中于英语,导致其他语言的性能差距显著。Contribution: 提出PB-RLSVR框架,利用高资源英语LLM作为“枢纽”模型,通过语义等价奖励实现多语言推理能力的迁移。
Method: 通过英语LLM生成参考应答,设计跨语言语义奖励函数(如嵌入和机器翻译),训练多语言模型。
Result: PB-RLSVR将Llama-3.1-8B-Instruct和Qwen3-32B的平均多语言性能分别提升16.41%和10.17%。
Insight: 高资源语言模型的语义等价性奖励是高效提升多语言推理能力的有效方法。
Abstract: While reinforcement learning has advanced the reasoning abilities of Large Language Models (LLMs), these gains are largely confined to English, creating a significant performance disparity across languages. To address this, we introduce Pivot-Based Reinforcement Learning with Semantically Verifiable Rewards (PB-RLSVR), a novel framework that enhances multilingual reasoning by circumventing the need for human-annotated data in target languages. Our approach employs a high-performing English LLM as a “pivot” model to generate reference responses for reasoning tasks. A multilingual model is then rewarded based on the semantic equivalence of its responses to the English reference, effectively transferring the pivot model’s reasoning capabilities across languages. We investigate several cross-lingual semantic reward functions, including those based on embeddings and machine translation. Extensive experiments on a suite of multilingual reasoning benchmarks show that our method significantly narrows the performance gap between English and other languages, substantially outperforming traditional PPO baselines. Specifically, our PB-RLSVR framework improves the average multilingual performance of Llama-3.1-8B-Instruct and Qwen3-32B by 16.41% and 10.17%, respectively, demonstrating a powerful and data-efficient approach to building truly multilingual reasoning agents.
[7] Probing the Limits of Stylistic Alignment in Vision-Language Models
Asma Farajidizaji,Akash Gupta,Vatsal Raina
Main category: cs.CL
TL;DR: 这篇论文研究了小规模视觉语言模型在幽默和浪漫风格对齐任务中的数据效率,探讨了模型在零样本设置下的性能极限,并提出了一种评估其能力和限制的方法。
Details
Motivation: 视觉语言模型在生成特定风格的图像标题(如幽默或浪漫)时面临挑战,尤其是在零样本设置下。而获取偏好数据用于对齐模型风格的成本高昂,限制了模型的探索潜力。Contribution: 论文的主要贡献是通过研究幽默和浪漫风格对齐的数据效率,定义了小规模视觉语言模型的性能极限,并确定了达到风格饱和所需的最小偏好数据量。
Method: 论文使用小规模视觉语言模型,并通过偏好数据对齐其生成风格,研究了不同数据量下的对齐效果,从而评估模型的能力和局限性。
Result: 结果表明,即使是小规模的视觉语言模型,也能通过有限的偏好数据实现对特定风格的较好对齐,但存在性能饱和点。
Insight: 论文揭示了视觉语言模型在风格对齐任务中的潜力与限制,为未来研究提供了数据效率的基准。
Abstract: Vision-language models are increasingly used to generate image captions in specific styles, such as humor or romantic. However, these transformer-based models often struggle with this subjective task in a zero-shot setting. While preference data can be used to align them toward a desired style, such data is expensive to acquire, limiting the ability to explore the models’ full capabilities. This work addresses this by studying the data efficiency of aligning small vision-language models to humor and romantic styles. This approach helps to define the performance limits of these models and determine how little preference data is needed to achieve stylistic saturation, benchmarking their capabilities and limitations.
[8] RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance
Tianlang Chen,Minkai Xu,Jure Leskovec,Stefano Ermon
Main category: cs.CL
TL;DR: 本文提出了RFG(Reward-Free Guidance),一种无需显式过程奖励即可指导扩散大语言模型(dLLMs)推理轨迹的方法。RFG通过参数化增强模型和参考模型的似然比来隐式引导推理,显著提升了复杂任务的表现。
Details
Motivation: 现有的自回归语言模型通常依赖密集标注的过程奖励模型来指导推理步骤,但这对扩散大语言模型(dLLMs)不适用,因为其生成过程是任意顺序的且中间状态部分遮蔽。因此,需要一种无需显式奖励的引导方法。Contribution: 1. 提出了RFG方法,无需外部奖励模型即可指导dLLMs的推理轨迹;2. 理论上证明了RFG诱导的采样分布与奖励引导的分布一致;3. 在多种复杂任务上验证了RFG的普适性和有效性。
Method: RFG通过参数化增强模型和参考模型的似然比来隐式表示过程奖励。增强模型可通过强化学习(RL)或监督微调(SFT)获得,无需额外奖励标注。
Result: 在四个数学推理和代码生成任务上,RFG显著提升了dLLMs的表现,最高提升9.2%准确率,且适用于多种增强后的dLLMs。
Insight: RFG提供了一种无需外部奖励的训练无关框架,适用于扩散模型的推理优化,为复杂任务的性能提升提供了新思路。
Abstract: Diffusion large language models (dLLMs) have shown great potential in large-scale language modeling, and there is an increasing interest in further improving the capacity to solve complex problems by guiding the reasoning process step by step. Common practice for autoregressive language models typically learns a process reward model with dense annotation for each intermediate step. However, this is challenging for dLLMs where the generation is in an any-order fashion and intermediate states are partially masked sentences. To this end, in this paper, we propose reward-free guidance (RFG), a principled method for guiding the reasoning trajectory of dLLMs without explicit process reward. The key idea of RFG is to parameterize the process reward by log-likelihood ratios of the enhanced and reference dLLMs, where the enhanced model can be easily obtained by any off-the-shelf dLLM that has been post-trained with reinforcement learning (RL) or supervised fine-tuning (SFT). We provide theoretical justification that RFG induces the reward-guided sampling distribution with no additional reward. We conduct comprehensive experiments on four challenging mathematical reasoning and code generation benchmarks using a diverse suite of dLLMs enhanced with various post-training methods. RFG consistently yields significant improvements across all tasks and model types, achieving accuracy gains of up to 9.2%. These findings establish RFG as a general training-free framework that scales test-time reasoning without reliance on external reward models.
[9] Transformers through the lens of support-preserving maps between measures
Takashi Furuya,Maarten V. de Hoop,Matti Lassas
Main category: cs.CL
TL;DR: 该论文通过概率测度的视角分析了Transformer的表达能力,证明了其能够近似具有连续上下文的映射,并展示了与Vlasov方程的关联性。
Details
Motivation: 研究Transformer在处理任意数量上下文标记时的表达能力,并通过概率测度的框架统一分析其数学性质。Contribution: 1. 完全刻画了Transformer能够表示的测度间映射的性质;2. 证明了Transformer可以普遍近似任何连续的上下文映射;3. 展示了Transformer与Vlasov方程的联系。
Method: 通过概率测度和Frechet导数的性质,分析了Transformer的映射能力,并将其与Vlasov方程的解映射进行比较。
Result: Transformer可以表示支持基数保持且Frechet导数正则部分一致连续的映射,并能近似Vlasov方程的解映射。
Insight: Transformer的表达能力可以从概率测度的角度统一理解,且其与Vlasov方程的关联为研究其动力学行为提供了新的视角。
Abstract: Transformers are deep architectures that define ``in-context maps’’ which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for a vision transformer). In previous work, we studied the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically, uniformly analyze their expressivity, we considered the case that the mappings are conditioned on a context represented by a probability distribution which becomes discrete for a finite number of tokens. Modeling neural networks as maps on probability measures has multiple applications, such as studying Wasserstein regularity, proving generalization bounds and doing a mean-field limit analysis of the dynamics of interacting particles as they go through the network. In this work, we study the question what kind of maps between measures are transformers. We fully characterize the properties of maps between measures that enable these to be represented in terms of in-context maps via a push forward. On the one hand, these include transformers; on the other hand, transformers universally approximate representations with any continuous in-context map. These properties are preserving the cardinality of support and that the regular part of their Fr'{e}chet derivative is uniformly continuous. Moreover, we show that the solution map of the Vlasov equation, which is of nonlocal transport type, for interacting particle systems in the mean-field regime for the Cauchy problem satisfies the conditions on the one hand and, hence, can be approximated by a transformer; on the other hand, we prove that the measure-theoretic self-attention has the properties that ensure that the infinite depth, mean-field measure-theoretic transformer can be identified with a Vlasov flow.
[10] The Media Bias Detector: A Framework for Annotating and Analyzing the News at Scale
Samar Haider,Amir Tohidi,Jenny S. Wang,Timothy Dörr,David M. Rothschild,Chris Callison-Burch,Duncan J. Watts
Main category: cs.CL
TL;DR: 本文提出了一个名为“媒体偏见检测器”(Media Bias Detector)的框架,用于大规模标注和分析新闻内容,以研究新闻媒体在选择和框架层面的偏见。
Details
Motivation: 主流新闻机构通过选择报道主题和框架问题来塑造公众认知,但大规模测量这些微妙的媒体偏见仍具挑战性。Contribution: 1)提供了一个持续更新的实时新闻数据集和计算框架;2)结合大型语言模型(LLMs)和新闻爬取技术,提取结构化标注;3)发布了交互式网络平台供数据探索。
Method: 利用LLMs和实时新闻爬取技术,提取政治倾向、语调、主题等多维度标注,并在句子、文章和发布者层面量化这些维度。
Result: 构建了包含15万+文章的2024年数据集,揭示了新闻覆盖和偏见的模式,支持学术研究和媒体问责。
Insight: 该框架为研究现代新闻环境中的偏见提供了可扩展的方法论,并为改进媒体透明度提供了实证资源。
Abstract: Mainstream news organizations shape public perception not only directly through the articles they publish but also through the choices they make about which topics to cover (or ignore) and how to frame the issues they do decide to cover. However, measuring these subtle forms of media bias at scale remains a challenge. Here, we introduce a large, ongoing (from January 1, 2024 to present), near real-time dataset and computational framework developed to enable systematic study of selection and framing bias in news coverage. Our pipeline integrates large language models (LLMs) with scalable, near-real-time news scraping to extract structured annotations – including political lean, tone, topics, article type, and major events – across hundreds of articles per day. We quantify these dimensions of coverage at multiple levels – the sentence level, the article level, and the publisher level – expanding the ways in which researchers can analyze media bias in the modern news landscape. In addition to a curated dataset, we also release an interactive web platform for convenient exploration of these data. Together, these contributions establish a reusable methodology for studying media bias at scale, providing empirical resources for future research. Leveraging the breadth of the corpus over time and across publishers, we also present some examples (focused on the 150,000+ articles examined in 2024) that illustrate how this novel data set can reveal insightful patterns in news coverage and bias, supporting academic research and real-world efforts to improve media accountability.
[11] Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities
Jiayi Kuang,Haojing Huang,Yinghui Li,Xinnian Liang,Zhikun Xu,Yangning Li,Xiaoyu Tan,Chao Qu,Meishan Zhang,Ying Shen,Philip S. Yu
Main category: cs.CL
TL;DR: 该论文提出了一种新范式,将大型语言模型(LLMs)的数学推理能力拆解为原子能力,并在两个维度(领域能力和逻辑能力)上进行分析,通过实验揭示了不同原子能力之间的相互作用及其对模型表现的影响。
Details
Motivation: 目前LLMs在数学推理上的表现依赖于大规模多样化问题和长推理链的训练数据,但模型是否真正掌握了数学概念和推理原则尚不明确。作者受人类将复杂问题拆解为原子能力的启发,提出了评估数学原子能力的新方法。Contribution: 1. 提出了数学原子能力的分类框架,涵盖四个数学领域(代数、几何、分析、拓扑)和多层次逻辑能力;2. 设计并实验验证了各原子能力的训练和评估数据集;3. 揭示了原子能力之间的相互作用及其对模型表现的启示。
Method: 通过将数学能力拆解为原子能力(领域能力和逻辑能力),设计对应的训练和评估数据集,并分析不同原子能力之间的影响关系。
Result: 实验结果表明,模型在不同原子能力上的表现存在显著差异,且原子能力之间存在相互作用。
Insight: 拆解数学能力为原子能力有助于更好地理解模型的认知机制,并为高效、可迁移的训练策略提供新思路。
Abstract: Large Language Models (LLMs) have demonstrated outstanding performance in mathematical reasoning capabilities. However, we argue that current large-scale reasoning models primarily rely on scaling up training datasets with diverse mathematical problems and long thinking chains, which raises questions about whether LLMs genuinely acquire mathematical concepts and reasoning principles or merely remember the training data. In contrast, humans tend to break down complex problems into multiple fundamental atomic capabilities. Inspired by this, we propose a new paradigm for evaluating mathematical atomic capabilities. Our work categorizes atomic abilities into two dimensions: (1) field-specific abilities across four major mathematical fields, algebra, geometry, analysis, and topology, and (2) logical abilities at different levels, including conceptual understanding, forward multi-step reasoning with formal math language, and counterexample-driven backward reasoning. We propose corresponding training and evaluation datasets for each atomic capability unit, and conduct extensive experiments about how different atomic capabilities influence others, to explore the strategies to elicit the required specific atomic capability. Evaluation and experimental results on advanced models show many interesting discoveries and inspirations about the different performances of models on various atomic capabilities and the interactions between atomic capabilities. Our findings highlight the importance of decoupling mathematical intelligence into atomic components, providing new insights into model cognition and guiding the development of training strategies toward a more efficient, transferable, and cognitively grounded paradigm of “atomic thinking”.
[12] CATCH: A Novel Data Synthesis Framework for High Therapy Fidelity and Memory-Driven Planning Chain of Thought in AI Counseling
Mingyu Chen,Jingkai Lin,Zhaojie Chu,Xiaofen Xing,Yirong Chen,Xiangmin Xu
Main category: cs.CL
TL;DR: CATCH是一个新的数据合成框架,旨在提高AI心理咨询的治疗忠实度和逻辑连贯性。它通过渐进式对话合成策略和记忆驱动的动态规划思维模式实现了这一目标。
Details
Motivation: 现有研究采用一次性生成方法合成多轮对话样本,导致治疗忠实度低且无法捕捉每轮回应的决策逻辑。Contribution: 提出了CATCH框架,包含渐进式对话合成策略和记忆驱动的动态规划思维模式,显著提升了AI心理咨询的质量。
Method: 1. 渐进式对话合成策略:从用户自述中提取目标、资源和解决方案,逐步生成阶段对齐的对话;
2. 记忆驱动的动态规划:结合记忆增强、全局规划和策略推理,为每轮对话附加显式的思维链。
Result: 实验和人工评估表明,CATCH显著提高了治疗忠实度和逻辑连贯性。
Insight: 通过结构化对话生成和显式思维链,可以更有效地模拟人类心理咨询过程。
Abstract: Recently, advancements in AI counseling based on large language models have shown significant progress. However, existing studies employ a one-time generation approach to synthesize multi-turn dialogue samples, resulting in low therapy fidelity and failing to capture the decision-making rationale behind each response. In this work, we propose CATCH, a novel data synthesis framework designed to address these challenges. Specifically, to improve therapy fidelity, we introduce the Progressive Dialogue Synthesis strategy, which extracts goals, resources, and solutions from a client’s self-report, organizes them into structured outlines, and then incrementally generates stage-aligned counseling dialogues. To capture decision-making rationale behind each response, we propose the Memory-Driven Dynamic Planning thinking pattern that integrates memory enhancement, global planning, and strategy reasoning; a collaborative multi-agent optimizer then leverages MDP to attach explicit chain-of-thought to each dialogue turn. Extensive experiments and human evaluations demonstrate that CATCH significantly enhances fidelity and logical coherence in AI counseling.
[13] Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications
Chenhua Shi,Gregor Macdonald,Bhavika Jalli,Wanlu Lei,John Zou,Mridul Jain,Joji Philip
Main category: cs.CL
TL;DR: 该论文提出了一种自动化生成高质量领域特定问答对(QA)的多阶段框架,用于微调大型语言模型(LLMs),减少人工标注的需求,特别适用于电信领域的任务。
Details
Motivation: 由于人工标注大规模高质量的任务指令和强化学习数据在领域特定任务(如电信网络故障排除)中耗时且需要专业知识,作者提出了一种自动化解决方案。Contribution: 提出了一种完全自动化的、基于检索增强的合成数据生成框架,集成了检索器、基础生成器和优化模型,并结合RAGAS评分来保证数据质量。
Method: 多阶段框架:先检索领域特定知识图谱中的文档,生成初步QA对,再通过优化模型增强数据质量,最后使用RAGAS评分筛选高质量样本。
Result: 在电信RAN故障排除任务中,成功生成了复杂且上下文丰富的解决方案计划,无需人工干预。
Insight: 该方法为构建领域特定指令和强化学习数据集提供了可扩展的解决方案,显著降低了对人工标注的依赖,同时保持了高质量的技术准确性。
Abstract: The success of large language models (LLMs) depends heavily on large-scale, high-quality instruction-following and reinforcement datasets. However, generating such data through human annotation is prohibitively time-consuming particularly for domain-specific tasks like telecom network troubleshooting, where accurate responses require deep technical expertise and contextual understanding. In this paper, we present a fully automated, retrieval-augmented pipeline for generating synthetic question-answer (QA) pairs grounded in structured domain knowledge. Our multi-stage framework integrates a retriever, base generator, and refinement model to synthesize and enhance QA pairs using documents retrieved from a domain-specific knowledge graph. To ensure data quality, we employ customized RAGAS-based scoring to filter low-quality samples, producing a high-quality dataset suitable for reinforcement fine-tuning (RFT). We demonstrate our approach in a real-world telecom scenario focused on radio access network (RAN) troubleshooting. The resulting pipeline generates complex, context-rich troubleshooting solution plans without human intervention. This work offers a scalable solution for building instruction and reinforcement datasets in specialized domains, significantly reducing dependence on manual labeling while maintaining high technical fidelity.
[14] TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
Zhepei Wei,Xiao Yang,Kai Sun,Jiaqi Wang,Rulin Shao,Sean Chen,Mohammad Kachuee,Teja Gollapudi,Tony Liao,Nicolas Scheffer,Rakesh Wanga,Anuj Kumar,Yu Meng,Wen-tau Yih,Xin Luna Dong
Main category: cs.CL
TL;DR: TruthRL是一个通过强化学习直接优化大型语言模型(LLM)真实性的框架,使用三元奖励机制区分正确回答、幻觉和弃权,显著减少幻觉并提高真实性。
Details
Motivation: 当前LLMs在回答事实性问题时容易产生幻觉或不真实的回答,尤其是在面对超出其参数知识范围的任务时。现有方法在优化准确性和鼓励弃权之间存在权衡,导致真实性问题。Contribution: 提出了TruthRL框架,通过强化学习和三元奖励机制,直接优化LLMs的真实性,显著减少了幻觉并提升了模型的真实性表现。
Method: 使用GRPO(一种强化学习方法)和三元奖励机制,分别奖励正确回答、惩罚幻觉和激励不确定时的弃权行为。
Result: 在四个知识密集型基准测试中,TruthRL相比基线方法减少了28.9%的幻觉,提高了21.1%的真实性,并在不同骨干模型(如Qwen、Llama)下表现一致。
Insight: 设计学习目标是开发真实LLMs的关键,TruthRL的成功表明直接优化真实性比传统方法更能平衡准确性和不确定性。
Abstract: While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy – models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.
[15] Assessing Algorithmic Bias in Language-Based Depression Detection: A Comparison of DNN and LLM Approaches
Obed Junias,Prajakta Kini,Theodora Chaspari
Main category: cs.CL
TL;DR: 本文研究了基于语言的自动化抑郁检测模型中的算法偏见,比较了DNN和LLM方法在性能和公平性上的表现,并提出了一些缓解偏见的策略。
Details
Motivation: 研究动机是探讨语言模型在抑郁检测中的算法偏见问题,特别是性别和种族/族裔的社会人口差异,以提升模型的公平性。Contribution: 主要贡献包括:1)比较DNN和LLM在抑郁检测中的性能和偏见表现;2)提出并评估多种缓解偏见的方法(如公平感知损失函数和提示工程)。
Method: 使用了DNN嵌入模型和LLM少样本学习方法,应用公平感知损失函数(如最差组损失和公平正则化损失)及不同的提示策略(如伦理框架引导)来缓解偏见。
Result: 结果显示LLM在抑郁分类中表现优于DNN,尤其是对西班牙裔参与者等少数群体;LLM的性别偏见较低,但种族偏见仍然存在。最差组损失在DNN中表现最佳。
Insight: 提示工程(如伦理框架)在LLM中可缓解性别偏见,但对种族偏见效果有限;增加样本量(N-shot)并未进一步减少偏见,表明偏见问题的复杂性。
Abstract: This paper investigates algorithmic bias in language-based models for automated depression detection, focusing on socio-demographic disparities related to gender and race/ethnicity. Models trained using deep neural networks (DNN) based embeddings are compared to few-shot learning approaches with large language models (LLMs), evaluating both performance and fairness on clinical interview transcripts from the Distress Analysis Interview Corpus/Wizard-of-Oz (DAIC-WOZ). To mitigate bias, fairness-aware loss functions are applied to DNN-based models, while in-context learning with varied prompt framing and shot counts is explored for LLMs. Results indicate that LLMs outperform DNN-based models in depression classification, particularly for underrepresented groups such as Hispanic participants. LLMs also exhibit reduced gender bias compared to DNN-based embeddings, though racial disparities persist. Among fairness-aware techniques for mitigating bias in DNN-based embeddings, the worst-group loss, which is designed to minimize loss for the worst-performing demographic group, achieves a better balance between performance and fairness. In contrast, the fairness-regularized loss minimizes loss across all groups but performs less effectively. In LLMs, guided prompting with ethical framing helps mitigate gender bias in the 1-shot setting. However, increasing the number of shots does not lead to further reductions in disparities. For race/ethnicity, neither prompting strategy nor increasing $N$ in $N$-shot learning effectively reduces disparities.
[16] RoBiologyDataChoiceQA: A Romanian Dataset for improving Biology understanding of Large Language Models
Dragos-Dumitru Ghinea,Adela-Nicoleta Corbeanu,Adrian-Marius Dumitran
Main category: cs.CL
TL;DR: 这篇论文介绍了RoBiologyDataChoiceQA,一个针对罗马尼亚语的多项选择生物学数据集,用于评估和改进大型语言模型(LLM)在科学领域的理解和推理能力。
Details
Motivation: 当前大型语言模型在领域特定应用和非英语语言中的表现尚未充分研究,特别是在低资源语言和专业知识任务中。Contribution: 提供了一个包含约14,000个罗马尼亚语生物学问题的数据集,并评估了多种LLM在该数据集上的表现。
Method: 通过设计数据集、基准测试多种LLM,并结合提示工程、微调和其他优化技术分析模型性能。
Result: 研究发现当前LLM在处理低资源语言和专业知识任务时既有优势也有局限性。
Insight: 研究为未来在低资源语言和领域特定任务中的LLM开发提供了重要参考。
Abstract: In recent years, large language models (LLMs) have demonstrated significant potential across various natural language processing (NLP) tasks. However, their performance in domain-specific applications and non-English languages remains less explored. This study introduces a novel Romanian-language dataset for multiple-choice biology questions, carefully curated to assess LLM comprehension and reasoning capabilities in scientific contexts. Containing approximately 14,000 questions, the dataset provides a comprehensive resource for evaluating and improving LLM performance in biology. We benchmark several popular LLMs, analyzing their accuracy, reasoning patterns, and ability to understand domain-specific terminology and linguistic nuances. Additionally, we perform comprehensive experiments to evaluate the impact of prompt engineering, fine-tuning, and other optimization techniques on model performance. Our findings highlight both the strengths and limitations of current LLMs in handling specialized knowledge tasks in low-resource languages, offering valuable insights for future research and development.
[17] ReTAG: Retrieval-Enhanced, Topic-Augmented Graph-Based Global Sensemaking
Boyoung Kim,Dosung Lee,Sumin An,Jinseong Jeong,Paul Hongsuck Seo
Main category: cs.CL
TL;DR: ReTAG提出了一种结合检索增强和主题增强的图框架,用于解决全局理解任务中的检索不足、主题特异性缺失和高推理成本问题。
Details
Motivation: 现有的图方法在全局理解任务中缺乏检索机制、主题特异性,且推理成本高。Contribution: 提出了ReTAG框架,结合检索和主题增强,构建主题特定的子图并检索相关内容以提高响应质量。
Method: 通过构建主题特定子图和检索相关摘要,优化响应生成过程。
Result: 实验表明ReTAG在提升响应质量的同时显著降低了推理时间。
Insight: 结合检索和主题增强的方法能有效提升全局理解任务的性能和效率。
Abstract: Recent advances in question answering have led to substantial progress in tasks such as multi-hop reasoning. However, global sensemaking-answering questions by synthesizing information from an entire corpus remains a significant challenge. A prior graph-based approach to global sensemaking lacks retrieval mechanisms, topic specificity, and incurs high inference costs. To address these limitations, we propose ReTAG, a Retrieval-Enhanced, Topic-Augmented Graph framework that constructs topic-specific subgraphs and retrieves the relevant summaries for response generation. Experiments show that ReTAG improves response quality while significantly reducing inference time compared to the baseline. Our code is available at https://github.com/bykimby/retag.
[18] Personalized Scientific Figure Caption Generation: An Empirical Study on Author-Specific Writing Style Transfer
Jaeyoung Kim,Jongho Lee,Hongjun Choi,Sion Jang
Main category: cs.CL
TL;DR: 论文研究了如何利用科学论文的作者档案数据进行个性化图注生成,发现结合丰富的作者档案和元数据可以显著提升多模态大语言模型(MLLM)的个性化能力,但揭示了作者风格匹配与图注质量之间的权衡。
Details
Motivation: 科学图注通常由作者撰写,具有独特的写作风格。研究旨在利用作者档案数据生成个性化的科学图注,以提升自动化图注系统的实用性。Contribution: 1)展示了作者档案数据对个性化图注生成的显著影响;2)揭示了作者风格匹配与图注质量之间的基本权衡;3)为实际应用中平衡两者的自动化系统提供了方向。
Method: 结合多模态大语言模型与作者档案数据,通过实验分析数据对个性化图注生成的影响。
Result: 实验证明,丰富的作者档案数据能显著提升个性化性能,但同时需权衡风格匹配和质量。
Insight: 个性化图注生成需要平衡作者风格与内容质量,未来研究方向应关注如何优化这种平衡。
Abstract: We study personalized figure caption generation using author profile data from scientific papers. Our experiments demonstrate that rich author profile data, combined with relevant metadata, can significantly improve the personalization performance of multimodal large language models. However, we also reveal a fundamental trade-off between matching author style and maintaining caption quality. Our findings offer valuable insights and future directions for developing practical caption automation systems that balance both objectives. This work was conducted as part of the 3rd SciCap challenge.
[19] Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
Shuyang Jiang,Yusheng Liao,Ya Zhang,Yanfeng Wang,Yu Wang
Main category: cs.CL
TL;DR: DECS框架通过解耦奖励和课程数据调度,解决大型推理模型中的‘过度思考’问题,显著减少推理步数同时保持或提升性能。
Details
Motivation: 现有RLVR(基于奖励的强化学习)训练的大型推理模型存在‘过度思考’问题,即生成长路径但无性能提升。现有方法因奖励设计与优化的不对齐而效果不佳。Contribution: 1. 提出DECS框架,理论发现并解决当前长度奖励的两个缺陷;2. 创新性设计了解耦的token级奖励机制和课程批调度策略。
Method: 1. 解耦奖励机制,精准惩罚冗余token;2. 课程批调度策略平衡效率与效果。
Result: 在七个基准测试中,DECS减少50%以上的推理步数,同时保持或提升性能。
Insight: 研究表明,通过精细设计的奖励机制和调度策略,可以在不影响模型推理能力的前提下显著提升效率。
Abstract: While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking’’, a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework’s innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model’s underlying reasoning power.
[20] Believing without Seeing: Quality Scores for Contextualizing Vision-Language Model Explanations
Keyu He,Tejas Srinivasan,Brihi Joshi,Xiang Ren,Jesse Thomason,Swabha Swayamdipta
Main category: cs.CL
TL;DR: 本文提出两种解释质量评分功能(视觉忠实度和对比性),以增强视觉语言模型(VLM)预测的可靠性,防止用户过度依赖错误的解释。实验证明这些评分能显著提高用户判断预测准确性的能力。
Details
Motivation: 当前视觉语言模型(VLM)的解释可能误导用户相信错误的预测,尤其是视觉上下文不可见的情况下(如盲人或低视力用户)。本文旨在通过量化解释的质量来解决这一问题。Contribution: 1. 提出两种解释质量评分功能:视觉忠实度(Visual Fidelity)和对比性(Contrastiveness)。2. 证明这些评分比现有方法更好地校准了模型正确性与解释质量的关系。3. 通过用户研究验证评分能显著提升用户判断预测准确性的能力(准确率提升11.1%,错误依赖减少15.4%)。
Method: 1. 定义视觉忠实度:衡量解释与视觉上下文的一致性。2. 定义对比性:衡量解释能否区分模型预测与替代选项的视觉细节。3. 在A-OKVQA和VizWiz任务上评估评分功能。4. 进行用户研究,验证评分对用户判断的影响。
Result: 1. 提出的评分功能在A-OKVQA和VizWiz任务上优于现有解释质量指标。2. 用户研究中,使用评分的参与者预测准确率提升11.1%,错误依赖减少15.4%。
Insight: 解释质量评分可以显著减少用户对错误VLM预测的依赖,尤其是在缺乏视觉上下文的情况下。这种方法有助于提升模型的透明度和实用性。
Abstract: When people query Vision-Language Models (VLMs) but cannot see the accompanying visual context (e.g. for blind and low-vision users), augmenting VLM predictions with natural language explanations can signal which model predictions are reliable. However, prior work has found that explanations can easily convince users that inaccurate VLM predictions are correct. To remedy undesirable overreliance on VLM predictions, we propose evaluating two complementary qualities of VLM-generated explanations via two quality scoring functions. We propose Visual Fidelity, which captures how faithful an explanation is to the visual context, and Contrastiveness, which captures how well the explanation identifies visual details that distinguish the model’s prediction from plausible alternatives. On the A-OKVQA and VizWiz tasks, these quality scoring functions are better calibrated with model correctness than existing explanation qualities. We conduct a user study in which participants have to decide whether a VLM prediction is accurate without viewing its visual context. We observe that showing our quality scores alongside VLM explanations improves participants’ accuracy at predicting VLM correctness by 11.1%, including a 15.4% reduction in the rate of falsely believing incorrect predictions. These findings highlight the utility of explanation quality scores in fostering appropriate reliance on VLM predictions.
[21] ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations
Yindong Wang,Martin Preiß,Margarita Bugueño,Jan Vincent Hoffbauer,Abdullatif Ghajar,Tolga Buz,Gerard de Melo
Main category: cs.CL
TL;DR: ReFACT是一个用于检测科学领域大型语言模型(LLM)虚构事实的基准数据集,包含1001个专家标注的问题-答案对,支持多阶段评估(检测、定位和纠正)。测试发现现有LLM性能有限,GPT-4o等顶尖模型也难以区分虚构与科学事实,凸显了细粒度人工验证的必要性。
Details
Motivation: LLM在科学领域频繁虚构事实,影响可靠性。现有评估多限于二元真实性,缺乏细粒度分析工具。Contribution: 提出了ReFACT基准数据集,支持多阶段评估(检测、定位和纠正),并提供了精确的错误标注和类型,填补了科学领域细粒度评估的空白。
Method: 构建了1001个专家标注的问题-答案对数据集,包含正确与虚构答案的对比,标注了错误范围和类型。测试了9种先进LLM的性能。
Result: LLM在检测科学虚构中的表现较差(约50%准确率),GPT-4o也难以区分虚构与科学事实。
Insight: LLM在科学领域的可靠性仍有明显不足,需要更多细粒度人工验证的基准工具来提升评估质量。
Abstract: Large Language Models (LLMs) frequently confabulate scientific facts,severely undermining their trustworthiness. Addressing this challenge requires benchmarks that go beyond binary factuality and enable fine-grained evaluation. We introduce \textbf{ReFACT} (\textit{Reddit False And Correct Texts}), a benchmark of 1,001 expert-annotated question–answer pairs spanning diverse scientific domains for the detection of scientific confabulation. Each instance includes both a scientifically correct answer and a non-factual counterpart annotated with \textbf{precise error spans and error-types}. ReFACT enables multi-stage evaluation: (1) confabulation detection, (2) fine-grained error localization, and (3) correction. We benchmark 9 state-of-the-art LLMs, revealing limited performance ($\sim$50% accuracy). Even top models such as GPT-4o fail to distinguish factual from confabulated scientific answers, raising concerns about the reliability of \textit{LLM-as-judge} evaluation paradigms. Our findings highlight the need for fine-grained, human-validated benchmarks to detect and correct scientific confabulation in domain-specific contexts. Dataset is released on \href{https://github.com/ddz5431/ReFACT}{GitHub}\footnote{We provide the dataset at: https://github.com/ddz5431/ReFACT}.
[22] Mem-α: Learning Memory Construction via Reinforcement Learning
Yu Wang,Ryuichi Takanobu,Zhiqi Liang,Yuzhen Mao,Yuanzhe Hu,Julian McAuley,Xiaojian Wu
Main category: cs.CL
TL;DR: Mem-α是一种基于强化学习的框架,训练LLM代理通过交互和反馈有效管理复杂的记忆系统,显著提升了长序列信息的处理能力。
Details
Motivation: 解决LLM代理在有限上下文窗口中处理长期信息时的记忆构建问题,避免预定义指令和工具导致的子优化记忆和信息丢失。Contribution: 提出Mem-α框架,通过强化学习优化记忆构建;设计专用训练数据集和多组件记忆架构;代理在训练长度13倍以上的序列上表现出泛化能力。
Method: 使用强化学习训练代理提取、存储和更新记忆;奖励信号来自下游问答任务的准确性;设计了核心、情景和语义记忆的多组件架构。
Result: Mem-α在实验中对现有基线模型有显著提升,且在远超训练长度的序列(400k tokens)上表现出强大泛化能力。
Insight: 强化学习可以有效优化LLM的记忆管理能力;多组件记忆架构和专用训练数据是提升性能的关键。
Abstract: Large language model (LLM) agents are constrained by limited context windows, necessitating external memory systems for long-term information understanding. Current memory-augmented agents typically depend on pre-defined instructions and tools for memory updates. However, language models may lack the ability to determine which information to store, how to structure it, and when to update it, especially as memory systems become more complex. This results in suboptimal memory construction and information loss. To this end, we propose Mem-alpha, a reinforcement learning framework that trains agents to effectively manage complex memory systems through interaction and feedback. We also construct a specialized training dataset spanning diverse multi-turn interaction patterns paired with comprehensive evaluation questions designed to teach effective memory management. During training, agents process sequential information chunks, learn to extract and store relevant content, then update the memory system. The reward signal derives from downstream question-answering accuracy over the full interaction history, directly optimizing for memory construction. To illustrate the effectiveness of our training framework, we design a memory architecture comprising core, episodic, and semantic components, equipped with multiple tools for memory operations. Empirical evaluation demonstrates that Mem-alpha achieves significant improvements over existing memory-augmented agent baselines. Despite being trained exclusively on instances with a maximum length of 30k tokens, our agents exhibit remarkable generalization to sequences exceeding 400k tokens, over 13x the training length, highlighting the robustness of Mem-alpha.
[23] Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel
Chuanyang Zheng,Jiankai Sun,Yihang Gao,Enze Xie,Yuehao Wang,Peihao Wang,Ting Xu,Matthew Chang,Liliang Ren,Jingyao Li,Jing Xiong,Kashif Rasul,Mac Schwager,Anderson Schneider,Zhangyang Wang,Yuriy Nevmyvaka
Main category: cs.CL
TL;DR: 这篇论文重新审视了Mixture-of-Experts(MoE)的路由机制,提出了一种基于Nadaraya-Watson核的FFN风格路由函数KERN,替代传统的Softmax机制,并在MoE和LLM中验证了其有效性。
Details
Motivation: MoE在大型语言模型中广泛使用,但其路由函数一直依赖Softmax,缺乏理论支持。作者发现MoE与Nadaraya-Watson回归的数学形式一致,试图探索更自然的路由机制。Contribution: 提出了KERN路由函数,基于Nadaraya-Watson核理论,无需额外成本即可实现路由,同时泛化了传统的Sigmoid和Softmax机制。
Method: 1. 将MoE重新表述为Nadaraya-Watson回归;2. 设计FFN风格的KERN路由函数;3. 使用ReLU激活和l2归一化优化性能。
Result: 实验证明KERN在MoE和LLM中优于或等同于传统Softmax路由,并提供更灵活的泛化能力。
Insight: MoE的路由机制可以与经典统计回归方法联系,未来研究方向可以探索更多基于核理论的路由设计。
Abstract: Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $\mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \textbf{zero-additional-cost} Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to $\mathrm{Softmax}$. We demonstrate that this router generalizes both $\mathrm{Sigmoid}$- and $\mathrm{Softmax}$-based routers. \textbf{Based on empirical observations and established practices in FFN implementation, we recommend the use of $\mathrm{ReLU}$ activation and $\ell_2$-normalization in $\mathrm{KERN}$ router function.} Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \methodNorm.
[24] Unspoken Hints: Accuracy Without Acknowledgement in LLM Reasoning
Arash Marioriyad,Shaygan Adim,Nima Alighardashi,Mahdieh Soleymani Banghshah,Mohammad Hossein Rohban
Main category: cs.CL
TL;DR: 这篇论文系统地研究了提示中的暗示(hints)对大型语言模型(LLM)推理任务中Chain-of-Thought(CoT)忠实性的影响,揭示了暗示的正确性、呈现方式对任务准确性和暗示承认的影响。
Details
Motivation: 大型语言模型(LLMs)在数学和逻辑推理任务中广泛使用Chain-of-Thought(CoT)提示方法,但生成的推理过程是否忠实于实际计算,还是受到提示中嵌入的暗示(hints)影响而成为事后叙述,这一问题尚未明确。Contribution: 论文的主要贡献在于通过系统的实验设计,研究了暗示的正确性、呈现方式和复杂度对LLM推理任务中任务准确性和暗示承认的影响,揭示了LLM推理过程中暗示的系统性影响。
Method: 研究设计了四种数据集(AIME、GSM-Hard、MATH-500、UniADILR)、两种先进的模型(GPT-4o和Gemini-2-Flash),并通过控制暗示的正确性(正确与错误)、呈现方式(附和式与数据泄漏式)和复杂度(原始答案、两运算符表达式、四运算符表达式),评估任务准确性和暗示是否被明确承认。
Result: 研究发现:(1) 正确的暗示显著提高任务准确性,尤其是较难的任务;(2) 暗示的承认不均衡,复杂暗示更可能被明示;(3) 呈现方式影响暗示承认,附和式提示鼓励承认,而泄漏式提示虽提高准确性但隐藏依赖。
Insight: LLM的推理过程受到提示中隐含捷径的系统性影响,暗示的正确性和呈现方式不仅影响任务表现,还反映了RLHF相关效应,如人类讨好和自我审查。这提示了在评估LLM推理忠实性时需考虑提示设计的潜在偏差。
Abstract: Large language models (LLMs) increasingly rely on chain-of-thought (CoT) prompting to solve mathematical and logical reasoning tasks. Yet, a central question remains: to what extent are these generated rationales \emph{faithful} to the underlying computations, rather than post-hoc narratives shaped by hints that function as answer shortcuts embedded in the prompt? Following prior work on hinted vs.\ unhinted prompting, we present a systematic study of CoT faithfulness under controlled hint manipulations. Our experimental design spans four datasets (AIME, GSM-Hard, MATH-500, UniADILR), two state-of-the-art models (GPT-4o and Gemini-2-Flash), and a structured set of hint conditions varying in correctness (correct and incorrect), presentation style (sycophancy and data leak), and complexity (raw answers, two-operator expressions, four-operator expressions). We evaluate both task accuracy and whether hints are explicitly acknowledged in the reasoning. Our results reveal three key findings. First, correct hints substantially improve accuracy, especially on harder benchmarks and logical reasoning, while incorrect hints sharply reduce accuracy in tasks with lower baseline competence. Second, acknowledgement of hints is highly uneven: equation-based hints are frequently referenced, whereas raw hints are often adopted silently, indicating that more complex hints push models toward verbalizing their reliance in the reasoning process. Third, presentation style matters: sycophancy prompts encourage overt acknowledgement, while leak-style prompts increase accuracy but promote hidden reliance. This may reflect RLHF-related effects, as sycophancy exploits the human-pleasing side and data leak triggers the self-censoring side. Together, these results demonstrate that LLM reasoning is systematically shaped by shortcuts in ways that obscure faithfulness.
[25] RE-Searcher: Robust Agentic Search with Goal-oriented Planning and Self-reflection
Daocheng Fu,Jianbiao Mei,Licheng Wen,Xuemeng Yang,Cheng Yang,Rong Wu,Tao Hu,Siqi Li,Yufan Shen,Xinyu Cai,Pinlong Cai,Botian Shi,Yong Liu,Yu Qiao
Main category: cs.CL
TL;DR: 本文提出了RE-Searcher,一种结合目标导向规划和自我反思的搜索代理,通过明确搜索目标和自我评估检索结果来提升搜索的鲁棒性和准确性。
Details
Motivation: 大型语言模型(LLMs)在知识密集型任务中表现优异,但仍受限于知识截断、幻觉和交互方式的局限性。尽管使用外部搜索工具可以部分缓解这些问题,但复杂搜索环境中的微小查询变化可能导致推理偏离有效路径。本文旨在解决这一挑战。Contribution: 主要贡献是提出了RE-Searcher方法,通过目标导向规划和自我反思增强搜索代理的鲁棒性,并在实验中展示了其优异性能和对噪声的抗干扰能力。
Method: RE-Searcher在搜索时明确表达具体搜索目标,并通过自我反思评估检索证据是否满足目标,从而抵抗复杂环境中的干扰。
Result: 实验表明,RE-Searcher显著提升了搜索准确性,并在噪声环境下表现出强鲁棒性,达到了最先进的效果。
Insight: 研究表明,结合目标导向规划和自我反思可以有效增强LLM代理在复杂交互环境中的稳定性,为更自主的决策提供了实用指导。
Abstract: Large language models (LLMs) excel at knowledge-intensive question answering and reasoning, yet their real-world deployment remains constrained by knowledge cutoff, hallucination, and limited interaction modalities. Augmenting LLMs with external search tools helps alleviate these issues, but it also exposes agents to a complex search environment in which small, plausible variations in query formulation can steer reasoning into unproductive trajectories and amplify errors. We present a systematic analysis that quantifies how environmental complexity induces fragile search behaviors and, in turn, degrades overall performance. To address this challenge, we propose a simple yet effective approach to instantiate a search agent, RE-Searcher. During search, RE-Searcher explicitly articulates a concrete search goal and subsequently reflects on whether the retrieved evidence satisfies that goal. This combination of goal-oriented planning and self-reflection enables RE-Searcher to resist spurious cues in complex search environments and perform robust search. Extensive experiments show that our method improves search accuracy and achieves state-of-the-art results. Perturbation studies further demonstrate substantial resilience to noisy or misleading external signals, mitigating the fragility of the search process. We believe these findings offer practical guidance for integrating LLM-powered agents into more complex interactive environments and enabling more autonomous decision-making.
[26] DyFlow: Dynamic Workflow Framework for Agentic Reasoning
Yanbo Wang,Zixiang Xu,Yue Huang,Xiangqi Wang,Zirui Song,Lang Gao,Chenxi Wang,Xiangru Tang,Yue Zhao,Arman Cohan,Xiangliang Zhang,Xiuying Chen
Main category: cs.CL
TL;DR: DyFlow提出了一种动态工作流生成框架,通过任务需求和实时反馈自适应构建和调整推理过程,显著提升了跨任务泛化能力。
Details
Motivation: 现有基于LLM的代理系统通常依赖手动设计的工作流,缺乏适应性和灵活性,且对中间反馈的利用有限,限制了泛化能力和推理深度。Contribution: DyFlow的核心贡献是引入动态工作流生成框架,包含设计器和执行器两部分,能够自适应地分解任务并执行动态推理操作。
Method: DyFlow的设计器将复杂问题分解为子目标并动态规划下一步,执行器通过动态运算符执行任务,支持上下文感知的参数化操作。
Result: 实验结果表明,DyFlow在社交推理、生物医学、数学解题和代码生成等多个领域显著优于现有基线,提升了Pass@k性能。
Insight: DyFlow展示了动态调整推理流程对提升LLM代理泛化能力的关键作用,强调了实时反馈和灵活操作在复杂任务中的重要性。
Abstract: Agent systems based on large language models (LLMs) have shown great potential in complex reasoning tasks, but building efficient and generalizable workflows remains a major challenge. Most existing approaches rely on manually designed processes, which limits their adaptability across different tasks. While a few methods attempt automated workflow generation, they are often tied to specific datasets or query types and make limited use of intermediate feedback, reducing system robustness and reasoning depth. Moreover, their operations are typically predefined and inflexible. To address these limitations, we propose DyFlow, a dynamic workflow generation framework that adaptively constructs and adjusts reasoning procedures based on task requirements and real-time intermediate feedback, thereby enhancing cross-task generalization. DyFlow consists of two core components: a designer and an executor. The designer decomposes complex problems into a sequence of sub-goals defined by high-level objectives and dynamically plans the next steps based on intermediate outputs and feedback. These plans are then carried out by the executor, which executes each operation using dynamic operators with context-aware parameterization, enabling flexible and semantically grounded reasoning. We systematically evaluate DyFlow across diverse domains, including social reasoning, biomedical tasks, mathematical problem solving, and code generation. Results demonstrate that DyFlow significantly outperforms existing baselines, achieving substantial Pass@k improvements and exhibiting robust generalization across diverse domains. The code is publicly available at https://github.com/wyf23187/DyFlow.
[27] IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation
Johannes Schmitt,Gergely Bérczi,Jasper Dekoninck,Jeremy Feusi,Tim Gehrunger,Raphael Appenzeller,Jim Bryan,Niklas Canova,Timo de Wolff,Filippo Gaia,Michel van Garrel,Baran Hashemi,David Holmes,Aitor Iribar Lopez,Victor Jaeck,Martina Jørgensen,Steven Kelk,Stefan Kuhlmann,Adam Kurpisz,Chiara Meroni,Ingmar Metzler,Martin Möller,Samuel Muñoz-Echániz,Robert Nowak,Georg Oberdieck,Daniel Platt,Dylan Possamaï,Gabriel Ribeiro,Raúl Sánchez Galán,Zheming Sun,Josef Teichmann,Richard P. Thomas,Charles Vial
Main category: cs.CL
TL;DR: IMProofBench是一个专注于评估大型语言模型在研究级数学证明生成能力的私有基准测试,包含39个由专家设计的、需要详细证明的问题。测试环境模拟真实研究场景,结合工具使用(如文献检索和数学软件),结果显示现有模型在部分问题上表现尚可,但在更具挑战性的问题上仍有困难。
Details
Motivation: 现有基准测试多集中于高中竞赛题或仅评估最终答案,无法全面衡量模型在研究级数学问题上的能力。因此,需要一个新的基准测试来填补这一空白。Contribution: 提出了IMProofBench,一个专注于研究级数学证明生成的基准测试,包含详细证明问题和子问题,支持专家评估和自动化评分。测试环境模拟真实研究场景。
Method: 基准测试包含39个专家设计的数学问题,要求详细证明。模型在代理框架下运行,可使用工具(如网页搜索和数学软件)。评价结合专家打分和自动化分析。
Result: Grok-4在最终答案子问题上准确率最高(52%),GPT-5在证明生成上表现最佳(22%完全正确)。
Insight: 当前模型在研究级问题上表现有限,仍需改进;工具集成有助于提升能力。IMProofBench将持续更新,以确保其适用于下一代模型的评估。
Abstract: As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are limited, as they focus solely on final-answer questions or high-school competition problems. To address this gap, we introduce IMProofBench, a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians. Each problem requires a detailed proof and is paired with subproblems that have final answers, supporting both an evaluation of mathematical reasoning capabilities by human experts and a large-scale quantitative analysis through automated grading. Furthermore, unlike prior benchmarks, the evaluation setup simulates a realistic research environment: models operate in an agentic framework with tools like web search for literature review and mathematical software such as SageMath. Our results show that current LLMs can succeed at the more accessible research-level questions, but still encounter significant difficulties on more challenging problems. Quantitatively, Grok-4 achieves the highest accuracy of 52% on final-answer subproblems, while GPT-5 obtains the best performance for proof generation, achieving a fully correct solution for 22% of problems. IMProofBench will continue to evolve as a dynamic benchmark in collaboration with the mathematical community, ensuring its relevance for evaluating the next generation of LLMs.
[28] Reinforced Strategy Optimization for Conversational Recommender Systems via Network-of-Experts
Xiaoyan Zhao
Main category: cs.CL
TL;DR: 论文提出了一种分层框架RSO,通过强化学习优化对话推荐系统中的交互策略,显著提升了推荐效果。
Details
Motivation: 现有对话推荐系统缺乏对交互策略的显式优化,仅依赖统一提示可能效果不佳。Contribution: 提出了RSO框架,通过分层策略优化(宏策略规划和微适应)和LLM驱动的奖励机制,改进交互策略。
Method: 采用分层框架:Planner选择策略(如推荐、解释),Actor在专家辅助下生成响应;利用强化学习和LLM奖励解决数据不足问题。
Result: 实验表明RSO优于现有方法,验证了分层策略优化的有效性。
Insight: 分层设计与强化学习结合是优化复杂交互任务的可行方向。
Abstract: Conversational Recommender Systems (CRSs) provide personalized recommendations through multi-turn interactions. With the strong reasoning abilities of Large Language Models (LLMs), applying them to CRSs has become promising. Yet, existing methods often lack explicit optimization of interaction strategies, relying instead on unified prompts, which can yield suboptimal outcomes. We propose Reinforced Strategy Optimization (RSO), a hierarchical framework that decomposes response generation into macro-level strategy planning and micro-level adaptation within a network-of-experts. A Planner selects strategies (e.g., recommend, explain, encourage), while an Actor generates responses guided by auxiliary experts for preferences and factual grounding. This disentanglement enables more tractable learning. To address limited multi-turn data, we model strategy learning as reinforcement learning with an LLM-based reward for exploration. Experiments show RSO outperforms state-of-the-art baselines, validating the effectiveness of hierarchical strategy optimization.
[29] End-to-End Aspect-Guided Review Summarization at Scale
Ilya Boytsov,Vinny DeGenova,Mikhail Balyasin,Joseph Walt,Caitlin Eusden,Marie-Claire Rochat,Margaret Pierson
Main category: cs.CL
TL;DR: 该论文提出了一种基于大型语言模型(LLM)的系统,结合基于方面的情感分析(ABSA)和引导式摘要技术,为Wayfair平台的用户评论生成简洁且易于理解的摘要。系统提取评论中的方面-情感对,选择高频方面,并基于代表性评论生成结构化提示,指导LLM摘要生成。通过大规模在线A/B测试验证了其有效性,并开源了一个包含1180万匿名评论的数据集。
Details
Motivation: 在线购物平台的用户评论数量庞大,人工处理效率低下,且难以提取关键信息。因此,需要一种自动化的方法生成简洁且准确的评论摘要,帮助顾客快速了解产品评价。Contribution: 1. 提出了一种结合ABSA和LLM的端到端评论摘要系统;2. 通过结构化提示技术实现摘要的内容真实性;3. 开源了大规模评论数据集,支持未来研究。
Method: 1. 从评论中提取方面-情感对并统计高频方面;2. 选择代表性评论生成结构化提示;3. 使用LLM基于提示生成摘要。
Result: 在线A/B测试证明了系统的有效性,生成的摘要简洁且准确地反映了用户反馈。
Insight: 1. 结构化提示可以显著提升LLM生成摘要的内容相关性;2. 高频方面选择有助于聚焦最重要的用户关注点;3. 大规模数据集为未来研究提供了重要资源。
Abstract: We present a scalable large language model (LLM)-based system that combines aspect-based sentiment analysis (ABSA) with guided summarization to generate concise and interpretable product review summaries for the Wayfair platform. Our approach first extracts and consolidates aspect-sentiment pairs from individual reviews, selects the most frequent aspects for each product, and samples representative reviews accordingly. These are used to construct structured prompts that guide the LLM to produce summaries grounded in actual customer feedback. We demonstrate the real-world effectiveness of our system through a large-scale online A/B test. Furthermore, we describe our real-time deployment strategy and release a dataset of 11.8 million anonymized customer reviews covering 92,000 products, including extracted aspects and generated summaries, to support future research in aspect-guided review summarization.
[30] QUARTZ : QA-based Unsupervised Abstractive Refinement for Task-oriented Dialogue Summarization
Mohamed Imed Eddine Ghebriout,Gaël Guibon,Ivan Lerner,Emmanuel Vincent
Main category: cs.CL
TL;DR: QUARTZ是一个基于问答的无监督抽象摘要优化框架,用于任务导向的对话摘要生成。它通过零样本生成多个摘要和任务相关问题,利用LLMs评估摘要质量并选择最佳摘要,最终在多个数据集上表现优异,媲美全监督SotA方法。
Details
Motivation: 现有的对话摘要方法通常依赖人工标注的摘要进行监督训练,成本高昂且缺乏任务特定性,限制了其在医疗等应用中的效果。Contribution: 提出了QUARTZ框架,结合LLMs生成和评估摘要,实现了无监督的任务导向摘要优化。
Method: 1. 零样本生成多个摘要和任务问题;2. 通过LLMs回答问题评估摘要质量;3. 选择最佳答案和摘要;4. 微调最佳LLM。
Result: 在多个数据集上表现优异,媲美全监督SotA方法。
Insight: 通过LLMs的内在能力评估摘要质量,避免了人工标注的高成本,同时提升了任务特定性。
Abstract: Dialogue summarization aims to distill the core meaning of a conversation into a concise text. This is crucial for reducing the complexity and noise inherent in dialogue-heavy applications. While recent approaches typically train language models to mimic human-written summaries, such supervision is costly and often results in outputs that lack task-specific focus limiting their effectiveness in downstream applications, such as medical tasks. In this paper, we propose \app, a framework for task-oriented utility-based dialogue summarization. \app starts by generating multiple summaries and task-oriented question-answer pairs from a dialogue in a zero-shot manner using a pool of large language models (LLMs). The quality of the generated summaries is evaluated by having LLMs answer task-related questions before \textit{(i)} selecting the best candidate answers and \textit{(ii)} identifying the most informative summary based on these answers. Finally, we fine-tune the best LLM on the selected summaries. When validated on multiple datasets, \app demonstrates its effectiveness by achieving competitive results in various zero-shot settings, rivaling fully-supervised State-of-the-Art (SotA) methods.
[31] One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient
Rui Ming,Haoyuan Wu,Shoubo Hu,Zhuolun He,Bei Yu
Main category: cs.CL
TL;DR: 论文提出了一种名为One-Token Rollout (OTR)的新颖微调算法,通过将策略梯度方法引入监督微调(SFT)中,显著提升了LLMs的泛化能力。
Details
Motivation: 传统的SFT因其依赖于静态数据集而泛化能力有限,相比之下,RL由于使用动态的on-policy数据表现更优。OTR旨在结合两者的优势。Contribution: OTR的核心贡献是通过将token生成视为单步RL轨迹,将静态监督数据动态化,从而在不增加计算开销的情况下提升模型性能。
Method: OTR在每个token生成步骤中采样多个候选token,并利用监督数据的ground-truth token提供奖励信号,通过策略梯度指导学习。
Result: 实验证明,在数学推理、代码生成等领域,OTR均显著优于传统SFT。
Insight: on-policy数据的动态性是模型泛化能力的关键,OTR为LLM微调提供了一种高效的新方法。
Abstract: Supervised fine-tuning (SFT) is the predominant method for adapting large language models (LLMs), yet it often struggles with generalization compared to reinforcement learning (RL). In this work, we posit that this performance disparity stems not just from the loss function, but from a more fundamental difference: SFT learns from a fixed, pre-collected dataset, whereas RL utilizes on-policy data sampled from the current policy. Building on this hypothesis, we introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method. OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory. At each step, it performs a Monte Carlo ``rollout’’ by sampling multiple candidate tokens from the current policy’s distribution. The ground-truth token from the supervised data is then used to provide a reward signal to these samples. Guided by policy gradient, our algorithm repurposes static, off-policy supervised data into a dynamic, on-policy signal at the token level, capturing the generalization benefits of on-policy learning while bypassing the costly overhead of full sentence generation. Through extensive experiments on a diverse suite of challenging benchmarks spanning mathematical reasoning, code generation, and general domain reasoning, we demonstrate that OTR consistently outperforms standard SFT. Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs and provide compelling evidence that the on-policy nature of data is a critical driver of generalization, offering a promising new direction for fine-tuning LLMs.
[32] Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in its Latent Thoughts
Hanwen Du,Yuxin Dong,Xia Ning
Main category: cs.CL
TL;DR: 本文研究了Huggin-3.5B在潜在空间中的思维过程,展示了正确与错误答案的潜在思维模式的可区分性,并提出了一种通过潜在奖励模型优化的方法Latent Thinking Optimization(LTO),显著提升了潜在思维的可靠性和效率。
Details
Motivation: 大型语言模型(LLMs)通过自然语言生成链式思维来解决问题,但这种方法计算成本高且容易过度思考。潜在思维虽解决了成本问题,但缺乏可解释性和监督,影响了其可靠性和正确性。Contribution: 1. 揭示了Huggin-3.5B潜在思维中正确与错误答案的显著区分性;2. 提出了Latent Thinking Optimization(LTO),利用潜在奖励模型优化思维过程;3. 展示了LTO在跨领域任务中的泛化能力。
Method: 提出Latent Thinking Optimization(LTO),通过潜在分类器构建Latent Reward Model(LRM),检测并优化潜在思维模式。
Result: 实验表明,LRM能有效检测错误的潜在思维模式,LTO显著提升了潜在思维的效率和可靠性,且方法可泛化到其他LLMs。
Insight: 潜在思维可以直接用于奖励建模和监督优化,为LLMs提供了一种高效且通用的思维优化方法。
Abstract: Large Language Models (LLMs) excel at problem solving by generating chain of thoughts in natural language, but such verbal thinking is computationally costly and prone to overthinking. Recent work instead proposes a latent thinking architecture Huggin-3.5B, which represents intermediate reasoning steps as sequence of latent representations. However, latent thoughts lack interpretability and are difficult to supervise, raising concerns about the correctness and reliability of its latent thinking processes. In this paper, we provide a systematic study of how Huggin-3.5B thinks in the latent space and how external supervision signals can improve its latent thinking processes. We show that latent thoughts leading to correct versus incorrect answers exhibit highly distinguishable patterns, and that a latent classifier can reliably predict answer correctness directly from latent thoughts. Leveraging these insights, we propose Latent Thinking Optimization (LTO), a probabilistic algorithm that employs the latent classifier as a Latent Reward Model (LRM) to optimize the latent thinking processes. Extensive experiments across diverse reasoning tasks demonstrate that LRM is highly effective in detecting incorrect latent thinking patterns, and LTO can significantly improve the latent thinking processes. Furthermore, we show that LRM can generalize across diverse domains, and LTO can be seamlessly applied to general LLMs to improve their thinking processes. In contrast to verbal thinking, our method demonstrates that reward modeling and scaling test-time thinking with supervision can be performed directly in the latent space, highlighting its potential as a general, efficient, and domain-agnostic approach to improving the thinking processes of LLMs.
[33] Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning
Jinyeop Song,Song Wang,Julian Shun,Yada Zhu
Main category: cs.CL
TL;DR: 本文提出了KG-R1,一个基于强化学习的单智能体知识图谱检索增强生成框架,通过端到端优化实现高效和可迁移的问答性能。
Details
Motivation: 现有KG-RAG系统通常由多个LLM模块组成,导致推理成本高且难以迁移到新知识图谱。KG-R1旨在通过单智能体强化学习优化检索与生成过程。Contribution: 1. 提出首个基于强化学习的单智能体KG-RAG框架;2. 展示了高效性和可迁移性;3. 在KGQA基准测试中表现出色,减少生成token的同时提高准确性。
Method: 利用单智能体与环境(知识图谱)交互,通过强化学习端到端优化检索、推理和生成步骤。
Result: 在KGQA基准测试中,KG-R1比多模块方法使用更少的生成token且准确性更高,同时在未见过的知识图谱上表现稳定。
Insight: 单智能体强化学习框架能够有效统一检索、推理与生成过程,降低计算成本并提高迁移能力,适合实际部署。
Abstract: Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucinations and expose reasoning traces. However, many KG-RAG systems compose multiple LLM modules (e.g planning, reasoning, and responding), inflating inference cost and binding behavior to a specific target KG. To address this, we introduce KG-R1, an agentic KG retrieval-augmented generation (KG-RAG) framework through reinforcement learning (RL). KG-R1 utilizes a single agent that interacts with KGs as its environment, learning to retrieve at each step and incorporating the retrieved information into its reasoning and generation. The process is optimized through end-to-end RL. In controlled experiments across Knowledge-Graph Question Answering (KGQA) benchmarks, our method demonstrates both efficiency and transferability: Using Qwen-2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use larger foundation or fine-tuned models. Furthermore, KG-R1 enables plug and play: after training, it maintains strong accuracy on new KGs without modification. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at https://github.com/Jinyeop3110/KG-R1.
[34] An Annotation Scheme for Factuality and its Application to Parliamentary Proceedings
Gili Goldin,Shira Wigderson,Ella Rabinovich,Shuly Wintner
Main category: cs.CL
TL;DR: 该论文提出了一个用于评估语言陈述事实性的多层次标注方案,并应用于议会会议的文本标注。研究者标注了近5000个句子,并探讨了自动预测部分标注特征的可行性。
Details
Motivation: 事实性评估对于事实核查至关重要,但目前的事实性标注方案较为分散且复杂。本文旨在结合前人工作,提出一个综合的标注方案,并将其应用于议会会议的实际场景中。Contribution: 1. 提出了一个结合多学科概念的多层次事实性标注方案;2. 标注了近5000个议会会议句子;3. 研究了自动预测标注特征的可行性。
Method: 1. 设计了一个综合事实性标注方案;2. 手动标注议会会议数据集;3. 评估标注一致性并尝试自动化部分标注。
Result: 标注方案的标注一致性较好,自动预测部分标注特征的实验展示了扩展标注到大语料的潜力。
Insight: 事实性标注需要结合语言学信号和多学科知识,自动化标注的可行性为大规模语料标注提供了方向。
Abstract: Factuality assesses the extent to which a language utterance relates to real-world information; it determines whether utterances correspond to facts, possibilities, or imaginary situations, and as such, it is instrumental for fact checking. Factuality is a complex notion that relies on multiple linguistic signals, and has been studied in various disciplines. We present a complex, multi-faceted annotation scheme of factuality that combines concepts from a variety of previous works. We developed the scheme for Hebrew, but we trust that it can be adapted to other languages. We also present a set of almost 5,000 sentences in the domain of parliamentary discourse that we manually annotated according to this scheme. We report on inter-annotator agreement, and experiment with various approaches to automatically predict (some features of) the scheme, in order to extend the annotation to a large corpus.
[35] Automatic Fact-checking in English and Telugu
Ravi Kiran Chikkala,Tatiana Anikina,Natalia Skachkova,Ivan Vykopal,Rodrigo Agerri,Josef van Genabith
Main category: cs.CL
TL;DR: 论文研究了大型语言模型(LLMs)在英语和泰卢固语中分类事实性声明并提出理由的效果,贡献包括创建双语数据集和对不同分类方法的基准测试。
Details
Motivation: 虚假信息是一个全球性挑战,人工验证耗时耗力,因此需要自动化的解决方案。Contribution: 1. 创建了一个英语-泰卢固语的双语数据集;2. 基于LLMs的不同分类方法进行了基准测试。
Method: 实验了多种基于LLMs的方法,用于分类事实性声明并生成理由。
Result: 展示了LLMs在多语言环境下自动验证事实的效果和潜力。
Insight: LLMs在多语言环境中的自动化事实核查具有实际应用价值,尤其是在资源稀缺的语言中。
Abstract: False information poses a significant global challenge, and manually verifying claims is a time-consuming and resource-intensive process. In this research paper, we experiment with different approaches to investigate the effectiveness of large language models (LLMs) in classifying factual claims by their veracity and generating justifications in English and Telugu. The key contributions of this work include the creation of a bilingual English-Telugu dataset and the benchmarking of different veracity classification approaches based on LLMs.
[36] OceanGym: A Benchmark Environment for Underwater Embodied Agents
Yida Xue,Mingjun Mao,Xiangyuan Ru,Yuqi Zhu,Baochang Ren,Shuofei Qiao,Mengru Wang,Shumin Deng,Xinyu An,Ningyu Zhang,Ying Chen,Huajun Chen
Main category: cs.CL
TL;DR: 论文介绍了OceanGym,首个专注于水下实体智能体的综合基准环境,旨在推动AI在这一极具挑战性领域的进展。
Details
Motivation: 水下环境因极端的光线条件、动态洋流等特性,对智能体的感知与决策能力提出了更高要求,但相关研究缺乏标准化的测试平台。Contribution: 1) 提出了首个水下实体智能体基准OceanGym,包含8个任务领域;2) 设计了基于多模态大语言模型(MLLMs)的统一智能体框架,整合感知、记忆与决策;3) 揭示了当前MLLM智能体与人类专家在水下任务中的差距。
Method: 利用多模态大语言模型(MLLMs)驱动智能体框架,结合光学与声呐数据输入,支持复杂环境下的自主探索与长期目标达成。
Result: 实验表明,现有MLLM智能体在水下任务的感知、规划和适应性上与人类专家存在显著差距。
Insight: OceanGym为开发鲁棒的水下AI提供了一个高保真平台,对推进真实水下自主载具的能力具有重要意义。
Abstract: We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal Large Language Models (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth’s last unexplored frontiers. The code and data are available at https://github.com/OceanGPT/OceanGym.
[37] Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling
Seiji Maekawa,Jackson Hassell,Pouya Pezeshkpour,Tom Mitchell,Estevam Hruschka
Main category: cs.CL
TL;DR: 论文提出了FuncBenchGen,一个无污染、可控的评估框架,用于测试多步工具调用的语言模型(TaLMs),解决了现有基准测试中控制不足和数据污染的问题。
Details
Motivation: 现有的TaLMs基准测试无法精确控制任务难度(如函数数量、任务复杂度等),且易受数据污染影响,亟需一个更可靠、可控的评估框架。Contribution: 提出FuncBenchGen框架,通过生成合成的多步工具使用任务,实现任务难度的精确控制和数据污染的避免。
Method: 将工具使用建模为在隐藏的函数依赖有向无环图(DAG)上的遍历,节点表示函数调用,边表示函数间的依赖关系。框架根据外部函数模式、初始变量和目标变量生成任务。
Result: 实验评估了七种LLMs,发现推理优化的模型显著优于通用模型,但性能随依赖深度增加而下降。引入简单缓解策略(显式重述变量值)显著提升了成功率。
Insight: 多步工具调用中,LLMs的状态跟踪较为脆弱,轻量级的缓解策略(如显式重述变量值)能显著提升性能。
Abstract: As language models gain access to external tools via structured function calls, they become increasingly more capable of solving complex, multi-step tasks. However, existing benchmarks for tool-augmented language models (TaLMs) provide insufficient control over factors such as the number of functions accessible, task complexity, and input size, and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where nodes are function calls and an edge between nodes represents one function consuming the output of another. Given a set of external function schemas, initial variable values, and a target variable, models must compose the correct call sequence to compute the target variable. FuncBenchGen allows users to precisely control task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding data leakage. We apply our FuncBenchGen framework to evaluate seven LLMs on tool use tasks of varying difficulty. Reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other models. Performance declines sharply as dependency depth increases. Furthermore, connected irrelevant functions prove especially difficult to handle. We find that strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding a success rate improvement from 62.5% to 81.3% for GPT-5.
[38] MENLO: From Preferences to Proficiency – Evaluating and Modeling Native-like Quality Across 47 Languages
Chenxi Whitehouse,Sebastian Ruder,Tony Lin,Oksana Kurylo,Haruka Takagi,Janice Lam,Nicolò Busetto,Denise Diaz
Main category: cs.CL
TL;DR: MENLO是一个评估多语言大模型(LLM)回答质量的框架,通过人类标注的数据集和多维度评分,提升了零样本LLM评委的表现,并通过强化学习等方法进一步优化。研究发现,尽管自动评委仍有改进空间,但为多语言评估和对齐偏好提供了可行方向。
Details
Motivation: 多语言环境下,确保LLM回答具有母语水平的质量是复杂的挑战。MENLO旨在通过系统化的评估方法和数据集,填补这一研究空白,并为多语言LLM的评估和改进提供支持。Contribution: 1. 提出MENLO框架,用于评估LLM回答的母语质量;2. 发布包含47种语言的6,423对标注数据集;3. 展示了零样本LLM评委的局限性,并通过强化学习等方法显著改进;4. 探讨了生成式奖励模型的潜力。
Method: 1. 基于观众设计机制,设计多维度评分标准;2. 通过人类标注创建数据集;3. 使用零样本LLM评委和强化学习(RL)优化表现;4. 结合奖励塑形和多任务学习进一步提升。
Result: 零样本LLM评委在结构化评分标准下表现提升,但仍不及人类标注者。RL训练后的评委在多语言熟练度上表现更优,但与人类判断仍存在差异。
Insight: 结构化评分标准和强化学习是提升多语言LLM评估的有效方法。未来研究需进一步弥合自动评委与人类标注之间的差距。
Abstract: Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs’ multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.
cs.CV [Back]
[39] LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model
Haozhe Jia,Wenshuo Chen,Yuqi Lin,Yang Yang,Lei Wang,Mang Ning,Bowen Tian,Songning Lai,Nanqian Jia,Yifan Chen,Yutao Yue
Main category: cs.CV
TL;DR: LUMA提出了一种低维度统一运动对齐方法,通过双路锚定增强文本到运动生成中的语义对齐,解决了现有扩散模型中梯度衰减和运动伪影问题。
Details
Motivation: 尽管现有的基于U-Net的扩散模型在文本到运动生成任务中表现良好,但仍存在语义失准和运动伪影问题。研究发现深层网络的梯度衰减是关键瓶颈,影响了高层次特征的学习。Contribution: 提出了LUMA模型,引入了双路锚定机制:一路通过轻量级MoCLIP模型提供时间域语义监督,另一路通过低频DCT分量提供频率域对齐信号。通过时间调制机制自适应融合两路信号,逐步优化语义对齐。
Method: 结合了时间域和频率域的语义对齐信号,通过自适应融合机制在去噪过程中实现从粗到细的语义对齐。实验中使用HumanML3D和KIT-ML数据集验证了方法的有效性。
Result: LUMA在HumanML3D和KIT-ML数据集上的FID分数分别为0.035和0.123,达到SOTA性能,并且比基线模型收敛速度快1.4倍。
Insight: 双路锚定机制有效解决了语义失准问题,低频DCT分量的引入增强了模型的语义表达能力。时间调制机制为实现高效的运动生成提供了新思路。
Abstract: While current diffusion-based models, typically built on U-Net architectures, have shown promising results on the text-to-motion generation task, they still suffer from semantic misalignment and kinematic artifacts. Through analysis, we identify severe gradient attenuation in the deep layers of the network as a key bottleneck, leading to insufficient learning of high-level features. To address this issue, we propose \textbf{LUMA} (\textit{\textbf{L}ow-dimension \textbf{U}nified \textbf{M}otion \textbf{A}lignment}), a text-to-motion diffusion model that incorporates dual-path anchoring to enhance semantic alignment. The first path incorporates a lightweight MoCLIP model trained via contrastive learning without relying on external data, offering semantic supervision in the temporal domain. The second path introduces complementary alignment signals in the frequency domain, extracted from low-frequency DCT components known for their rich semantic content. These two anchors are adaptively fused through a temporal modulation mechanism, allowing the model to progressively transition from coarse alignment to fine-grained semantic refinement throughout the denoising process. Experimental results on HumanML3D and KIT-ML demonstrate that LUMA achieves state-of-the-art performance, with FID scores of 0.035 and 0.123, respectively. Furthermore, LUMA accelerates convergence by 1.4$\times$ compared to the baseline, making it an efficient and scalable solution for high-fidelity text-to-motion generation.
[40] VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
Paul Gavrikov,Wei Lin,M. Jehanzeb Mirza,Soumya Jahagirdar,Muhammad Huzaifa,Sivan Doveh,Serena Yeung-Levy,James Glass,Hilde Kuehne
Main category: cs.CV
TL;DR: VisualOverload提出了一個新的視覺問答基準測試,專注於密集場景中簡單但知識無關的視覺任務,旨在揭露當前視覺語言模型在細節理解上的不足。
Details
Motivation: 研究動機是檢驗當前先進的視覺語言模型(VLMs)是否真的解決了基礎視覺理解問題,尤其是在密集場景中的細節處理能力。Contribution: 主要貢獻包括:(1) 提出了一個新的VQA基準測試VisualOverload,包含2,720個問題-答案對;(2) 揭示了當前VLMs在密集場景中的多種失敗模式;(3) 提供了詳細的錯誤分析,幫助未來模型改進。
Method: 使用高分辨率公共領域繪畫圖片,手動註釋問題,覆蓋六種任務類別,評估模型對密集場景的細節理解能力。
Result: 測試了37個模型,最佳模型僅在最難測試集上達到19.6%準確率,總體準確率為69.5%。
Insight: VisualOverload顯示當前VLMs在細節編碼和推理上仍存在明顯不足,尤其是在複雜場景中的邏輯一致性和計數能力方面。
Abstract: Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload
[41] Editing Physiological Signals in Videos Using Latent Representations
Tianwen Zhou,Akshay Paruchuri,Josef Spjut,Kaan Akşit
Main category: cs.CV
TL;DR: 论文提出了一种基于3D VAE和文本编码器的框架,用于编辑视频中的生理信号(如心率),在保护隐私的同时保持视频视觉质量。
Details
Motivation: 基于摄像头的生理信号估计(如心率)虽然便捷,但会泄露健康或情感状态等敏感信息,引发隐私问题。本文旨在解决这一问题。Contribution: 提出了一种可控的心率编辑框架,通过3D VAE和文本编码器融合输入视频和目标心率提示,实现了高保真的生理信号调制。
Method: 采用预训练的3D VAE编码视频,冻结的文本编码器嵌入目标心率提示,通过自适应层归一化(AdaLN)和特征线性调制(FiLM)融合信号和解码。
Result: 实验中,该方法在视觉质量(PSNR 38.96 dB,SSIM 0.98)和心率调制精度(MAE 10.00 bpm,MAPE 10.09%)上均表现优异。
Insight: 该方法不仅可用于匿名化真实视频中的生物特征信号,还能合成具有特定生理信号的逼真视频,具有实际应用潜力。
Abstract: Camera-based physiological signal estimation provides a non-contact and convenient means to monitor Heart Rate (HR). However, the presence of vital signals in facial videos raises significant privacy concerns, as they can reveal sensitive personal information related to the health and emotional states of an individual. To address this, we propose a learned framework that edits physiological signals in videos while preserving visual fidelity. First, we encode an input video into a latent space via a pretrained 3D Variational Autoencoder (3D VAE), while a target HR prompt is embedded through a frozen text encoder. We fuse them using a set of trainable spatio-temporal layers with Adaptive Layer Normalizations (AdaLN) to capture the strong temporal coherence of remote Photoplethysmography (rPPG) signals. We apply Feature-wise Linear Modulation (FiLM) in the decoder with a fine-tuned output layer to avoid the degradation of physiological signals during reconstruction, enabling accurate physiological modulation in the reconstructed video. Empirical results show that our method preserves visual quality with an average PSNR of 38.96 dB and SSIM of 0.98 on selected datasets, while achieving an average HR modulation error of 10.00 bpm MAE and 10.09% MAPE using a state-of-the-art rPPG estimator. Our design’s controllable HR editing is useful for applications such as anonymizing biometric signals in real videos or synthesizing realistic videos with desired vital signs.
[42] SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs
Yuyou Zhang,Radu Corcodel,Chiori Hori,Anoop Cherian,Ding Zhao
Main category: cs.CV
TL;DR: SpinBench是一个用于评估视觉语言模型(VLMs)空间推理能力的诊断基准,围绕视角转换的核心挑战设计,揭示了VLMs在空间推理中的系统性弱点。
Details
Motivation: 研究动机是评估VLMs在空间推理中的能力,尤其是视角转换和对象关系的变化,填补了对VLM空间认知能力理解的空白。Contribution: 主要贡献是提出了SpinBench基准,包含渐进式诊断任务,揭示了VLMs在视角转换和旋转理解中的系统性弱点,并与人类表现进行了对比。
Method: 方法是通过设计SpinBench基准任务,包括平移、旋转、对象相对姿态和视角变化,逐步从简单任务过渡到复杂的多对象视角转换任务。
Result: 结果表明,37种先进VLMs在空间推理中存在强烈的自我中心偏差、旋转理解能力差等问题,人类准确率达到91.2%,任务难度与人反应时间高度相关。
Insight: 研究揭示了VLMs在空间推理中的关键缺陷,强调了它们在物理空间推理能力上的不足,为进一步改进提供了方向。
Abstract: We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation. Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SpinBench introduces a set of fine-grained diagnostic categories. Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 37 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations. Scaling analysis shows both smooth improvements and emergent capabilities. While human subjects achieve high accuracy (91.2%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SpinBench captures spatial reasoning challenges shared across humans and VLMs. We believe SpinBench provides critical insights into spatial reasoning in VLMs and highlights key gaps in their ability to reason about physical space. Our website can be found at https://spinbench25.github.io/.
[43] A Deep Learning Approach for Spatio-Temporal Forecasting of InSAR Ground Deformation in Eastern Ireland
Wendong Yao,Binhua Huang,Soumyabrata Dev
Main category: cs.CV
TL;DR: 本文提出了一种多模态时空Transformer(MM-STT)框架,用于融合动态位移数据和静态物理先验,通过联合时空注意力机制显著提升了地面沉降的时空预测性能。
Details
Motivation: 现有方法(如ConvLSTM)难以建模地面沉降的复杂非线性动态和长距离依赖关系,且通常局限于单模态数据。Contribution: 1)提出MM-STT框架,首次将多模态数据(动态位移和静态物理先验)统一处理;2)设计了联合时空注意力机制,显著提升了预测精度。
Method: MM-STT通过多模态融合和联合时空注意力机制,动态建模地面沉降的时空依赖关系。
Result: 在EGMS数据集上,MM-STT大幅降低了长距离预测的RMSE,性能超越了STGCN和STAEformer等SOTA方法。
Insight: 多模态融合对于地面沉降等高复杂性问题至关重要,框架的设计需注重模态间的深度交互。
Abstract: Forecasting high-resolution land subsidence is a critical yet challenging task due to its complex, non-linear dynamics. While standard architectures like ConvLSTM often fail to model long-range dependencies, we argue that a more fundamental limitation of prior work lies in the uni-modal data paradigm. To address this, we propose the Multi-Modal Spatio-Temporal Transformer (MM-STT), a novel framework that fuses dynamic displacement data with static physical priors. Its core innovation is a joint spatio-temporal attention mechanism that processes all multi-modal features in a unified manner. On the public EGMS dataset, MM-STT establishes a new state-of-the-art, reducing the long-range forecast RMSE by an order of magnitude compared to all baselines, including SOTA methods like STGCN and STAEformer. Our results demonstrate that for this class of problems, an architecture’s inherent capacity for deep multi-modal fusion is paramount for achieving transformative performance.
[44] DepthLM: Metric Depth From Vision Language Models
Zhipeng Cai,Ching-Feng Yeh,Hu Xu,Zhuang Liu,Gregory Meyer,Xinjie Lei,Changsheng Zhao,Shang-Wen Li,Vikas Chandra,Yangyang Shi
Main category: cs.CV
TL;DR: DepthLM展示了视觉语言模型(VLM)无需架构或损失函数改动即可达到专家级的精度,通过稀疏标签的文本监督微调和视觉提示等方法解锁3D理解能力。
Details
Motivation: 现有VLMs在语义理解上表现优异,但在2D输入的3D理解(如度量深度估计)上表现不足,而纯视觉模型虽达到超人类精度,却需任务特定架构和损失。Contribution: 1. 证明VLMs可通过稀疏标签和监督微调实现专家级精度;2. 提出视觉提示和相机内参增强解决像素参考和跨数据集模糊问题;3. DepthLM首次使VLMs与纯视觉模型精度可比。
Method: 1. 文本监督微调;2. 视觉提示解决像素参考问题;3. 相机内参增强解决跨数据集模糊;4. 无需密集预测头或复杂回归损失。
Result: DepthLM超越大多数先进VLMs精度2倍以上,首次与纯视觉模型可比,且自然避免了过平滑问题。
Insight: 稀疏标签足以解锁VLMs的3D能力;视觉提示和内参增强是关键;VLMs可覆盖更多3D任务。
Abstract: Vision language models (VLMs) can flexibly address various vision tasks through text interactions. Although successful in semantic understanding, state-of-the-art VLMs including GPT-5 still struggle in understanding 3D from 2D inputs. On the other hand, expert pure vision models achieve super-human accuracy in metric depth estimation, a key 3D understanding task. However, they require task-specific architectures and losses. Such difference motivates us to ask: Can VLMs reach expert-level accuracy without architecture or loss change? We take per-pixel metric depth estimation as the representative task and show that the answer is yes! Surprisingly, comprehensive analysis shows that text-based supervised-finetuning with sparse labels is sufficient for VLMs to unlock strong 3D understanding, no dense prediction head or complex regression/regularization loss is needed. The bottleneck for VLMs lies actually in pixel reference and cross-dataset camera ambiguity, which we address through visual prompting and intrinsic-conditioned augmentation. With much smaller models, our method DepthLM surpasses the accuracy of most advanced VLMs by over 2x, making VLMs for the first time comparable with pure vision models. Interestingly, without explicit enforcement during training, VLMs trained with DepthLM naturally avoids over-smoothing, having much fewer flying points at boundary regions than pure vision models. The simplicity of DepthLM also enables a single VLM to cover various 3D tasks beyond metric depth. Our code and model will be released at the link below.
[45] Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection
Kaiqing Lin,Zhiyuan Yan,Ruoxin Chen,Junyan Ye,Ke-Yue Zhang,Yue Zhou,Peng Jin,Bin Li,Taiping Yao,Shouhong Ding
Main category: cs.CV
TL;DR: 该论文提出了一种新范式’Seeing Before Reasoning’,通过先训练多模态大语言模型(MLLMs)感知伪造痕迹,再基于此进行推理,从而提升AI生成图像的检测性能和可解释性。作者提出了Forensic-Chat和一个新基准ExplainFake-Bench,实验证明了其优越性。
Details
Motivation: 现有的MLLMs在检测AI生成图像时表现不佳,原因在于其视觉编码器主要针对语义识别而非低层信号感知,导致对伪造痕迹不敏感。此外,现有微调数据与实际分布差异大,模型依赖语言捷径,导致预训练知识遗忘。Contribution: 1. 提出’Seeing Before Reasoning’新范式;2. 设计了Forensic-Chat,兼具泛化性、可解释性和对话能力;3. 提出了ExplainFake-Bench基准,用于评估图像取证的可解释性。
Method: 1. 强化MLLMs对伪造痕迹的视觉感知能力;2. 构建Forensic-Chat模型,结合视觉感知与推理;3. 设计ExplainFake-Bench,从五个方面评估可解释性。
Result: 实验表明,Forensic-Chat在泛化性和可解释性上均表现出色,超过了现有方法。
Insight: 视觉感知是AI生成图像检测的关键,先强化感知能力再推理能显著提升性能。此外,可解释性评估需要专门的基准。
Abstract: Detecting AI-generated images with multimodal large language models (MLLMs) has gained increasing attention, due to their rich world knowledge, common-sense reasoning, and potential for explainability. However, naively applying those MLLMs for detection often leads to suboptimal performance. We argue that the root of this failure lies in a fundamental mismatch: MLLMs are asked to reason about fakes before they can truly see them. First, they do not really see: existing MLLMs’ vision encoders are primarily optimized for semantic-oriented recognition rather than the perception of low-level signals, leaving them insensitive to subtle forgery traces. Without access to reliable perceptual evidence, the model grounds its judgment on incomplete and limited visual observations. Second, existing finetuning data for detection typically uses narrow, instruction-style formats, which diverge sharply from the diverse, heterogeneous distributions seen in pretraining. In the absence of meaningful visual cues, the model therefore exploits these linguistic shortcuts, resulting in catastrophic forgetting of pretrained knowledge (even the basic dialogue capabilities). In response, we advocate for a new paradigm: seeing before reasoning. We propose that MLLMs should first be trained to perceive artifacts-strengthening their artifact-aware visual perception-so that subsequent reasoning is grounded in actual observations. We therefore propose Forensic-Chat, a generalizable, explainable, and still-conversational (for multi-round dialogue) assistant for fake image detection. We also propose ExplainFake-Bench, a benchmark tailored for the evaluation of the MLLM’s explainability for image forensics from five key aspects. Extensive experiments show its superiority of generalization and genuinely reliable explainability.
[46] DeepFake Detection in Dyadic Video Calls using Point of Gaze Tracking
Odin Kohler,Rahul Vijaykumar,Masudul H. Imtiaz
Main category: cs.CV
TL;DR: 提出了一种利用视线跟踪技术检测视频通话中实时深度伪造内容的方法,通过分析双向对话中的视线模式实现82%的检测准确率。
Details
Motivation: 随着深度伪造技术的进步,恶意攻击者可能利用实时生成的深度伪造内容进行视频会议钓鱼攻击。传统的检测方法难以应对这种新型攻击。Contribution: 首次提出利用视线跟踪技术检测深度伪造内容,利用了双向对话中视线模式的生物特征信息,开发了一种实时检测方法。
Method: 基于双向对话中视线模式的研究,选择了可解释的特征构建模型,并在自建数据集上进行了测试。
Result: 在自建数据集上实现了82%的检测准确率。
Insight: 深度伪造内容难以模仿人类在对话中的微妙非语言行为(如视线模式),因此可以利用这一特性进行检测。
Abstract: With recent advancements in deepfake technology, it is now possible to generate convincing deepfakes in real-time. Unfortunately, malicious actors have started to use this new technology to perform real-time phishing attacks during video meetings. The nature of a video call allows access to what the deepfake is ``seeing,’’ that is, the screen displayed to the malicious actor. Using this with the estimated gaze from the malicious actors streamed video enables us to estimate where the deepfake is looking on screen, the point of gaze. Because the point of gaze during conversations is not random and is instead used as a subtle nonverbal communicator, it can be used to detect deepfakes, which are not capable of mimicking this subtle nonverbal communication. This paper proposes a real-time deepfake detection method adapted to this genre of attack, utilizing previously unavailable biometric information. We built our model based on explainable features selected after careful review of research on gaze patterns during dyadic conversations. We then test our model on a novel dataset of our creation, achieving an accuracy of 82%. This is the first reported method to utilize point-of-gaze tracking for deepfake detection.
[47] Robust Visual Localization in Compute-Constrained Environments by Salient Edge Rendering and Weighted Hamming Similarity
Tu-Hoa Pham,Philip Bailey,Daniel Posada,Georgios Georgakis,Jorge Enriquez,Surya Suresh,Marco Dolci,Philip Twu
Main category: cs.CV
TL;DR: 论文提出了一种在计算受限环境下基于显著边缘渲染和加权汉明相似度的视觉定位方法,适用于火星样本返回任务中的6-DoF物体位姿估计。
Details
Motivation: 研究解决了在硬件资源受限(如火星任务)的背景下,机器人手臂需要对多个物体进行精准位姿估计的挑战。传统方法计算复杂且依赖高保真模型,难以满足实时性和资源限制。Contribution: 1. 提出了一种新的定位算法,结合自定义渲染器和边缘域模板匹配度量;2. 仅需低质量、无纹理的3D模型输入;3. 在计算和内存受限条件下优于现有技术。
Method: 1. 使用自定义渲染器生成显著边缘;2. 设计了加权汉明相似度度量方法,专门优化边缘域匹配;3. 算法轻量化,适用于通用硬件。
Result: 在合成数据集、地球物理测试台和火星实地图像上的实验表明,该方法在鲁棒性和准确性上均优于现有技术。
Insight: 该方法展示了在资源受限环境下,通过优化渲染和匹配策略,可以实现高效可靠的视觉定位,拓展了低成本硬件的应用潜力。
Abstract: We consider the problem of vision-based 6-DoF object pose estimation in the context of the notional Mars Sample Return campaign, in which a robotic arm would need to localize multiple objects of interest for low-clearance pickup and insertion, under severely constrained hardware. We propose a novel localization algorithm leveraging a custom renderer together with a new template matching metric tailored to the edge domain to achieve robust pose estimation using only low-fidelity, textureless 3D models as inputs. Extensive evaluations on synthetic datasets as well as from physical testbeds on Earth and in situ Mars imagery shows that our method consistently beats the state of the art in compute and memory-constrained localization, both in terms of robustness and accuracy, in turn enabling new possibilities for cheap and reliable localization on general-purpose hardware.
[48] LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models
Pranav Saxena,Avigyan Bhattacharya,Ji Zhang,Wenshan Wang
Main category: cs.CV
TL;DR: LLM-RG是一个结合视觉语言模型(VLM)和大语言模型(LLM)的混合方法,用于解决户外驾驶场景中的指代物接地问题。它通过提取对象属性和空间信息,利用LLM进行符号推理,显著提升了性能。
Details
Motivation: 户外驾驶场景中存在大量视觉相似对象和动态元素,导致自然语言指代的解析困难。现有方法在处理此类复杂问题时表现不足。Contribution: 提出LLM-RG,这是一个结合VLM的细粒度属性提取与LLM符号推理的混合方法,实现了零样本的指代物接地。
Method: LLM-RG处理图像和自由形式的指代表达式时,首先提取对象类型和属性,生成视觉描述符,并结合空间元数据输入LLM进行链式推理,最终确定指代物的边界框。
Result: 在Talk2Car基准测试中,LLM-RG显著优于基于LLM和VLM的基线方法,加入3D空间线索后性能进一步提升。
Insight: VLM和LLM的结合展现了互补优势,尤其在零样本场景下,能够鲁棒地解决复杂的指代物接地问题。
Abstract: Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references (e.g., “the black car on the right”). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning. LLM-RG processes an image and a free-form referring expression by using an LLM to extract relevant object types and attributes, detecting candidate regions, generating rich visual descriptors with a VLM, and then combining these descriptors with spatial metadata into natural-language prompts that are input to an LLM for chain-of-thought reasoning to identify the referent’s bounding box. Evaluated on the Talk2Car benchmark, LLM-RG yields substantial gains over both LLM and VLM-based baselines. Additionally, our ablations show that adding 3D spatial cues further improves grounding. Our results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.
[49] VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models
Ravikumar Balakrishnan,Mansi Phute
Main category: cs.CV
TL;DR: VISOR++ 是一种通过优化视觉输入实现通用视觉语言模型行为控制的框架,无需运行时访问模型内部,适用于开放和闭源模型。
Details
Motivation: 由于视觉语言模型(VLMs)在安全关键应用中广泛部署,理解和控制其行为模式变得至关重要。现有方法(如系统提示或激活导向)存在局限性,亟需一种无需侵入式访问的通用行为控制方法。Contribution: 1. 提出 VISOR++,通过优化视觉输入实现多模型行为控制;2. 生成单一图像即可适配多个 VLMs;3. 在开放和闭源模型上验证有效性,同时保持性能。
Method: VISOR++ 通过生成优化的视觉输入,模拟激活导向向量的效果,无需运行时访问模型内部。实验涉及拒绝、迎合和生存本能三个对齐方向。
Result: VISOR++ 图像在 LLaVA-1.5-7B 和 IDEFICS2-8B 上表现与激活导向相当;在未见模型上也显现行为转移潜力;同时保持 99.9% 的 MMLU 任务性能。
Insight: 视觉输入可作为通用行为控制媒介,适用于多样化模型生态系统,为安全可控的 VLM 部署提供了新思路。
Abstract: As Vision Language Models (VLMs) are deployed across safety-critical applications, understanding and controlling their behavioral patterns has become increasingly important. Existing behavioral control methods face significant limitations: system prompting approaches could easily be overridden by user instructions, while applying activation-based steering vectors requires invasive runtime access to model internals, precluding deployment with API-based services and closed-source models. Finding steering methods that transfer across multiple VLMs is still an open area of research. To this end, we introduce universal visual input based steering for output redirection (VISOR++), to achieve behavioral control through optimized visual inputs alone. We demonstrate that a single VISOR++ image can be generated for an ensemble of VLMs to emulate each of their steering vectors. By crafting universal visual inputs that induce target activation patterns, VISOR++ eliminates the need for runtime model access while remaining deployment-agnostic. This means that when an underlying model supports multimodal capability, model behaviors can be steered by inserting an image input replacing runtime steering vector based interventions. We first demonstrate the effectiveness of the VISOR++ images on open-access models such as LLaVA-1.5-7B and IDEFICS2-8B along three alignment directions: refusal, sycophancy and survival instinct. Both the model-specific steering images and the jointly optimized images achieve performance parity closely following that of steering vectors for both positive and negative steering tasks. We also show the promise of VISOR++ images in achieving directional behavioral shifts for unseen models including both open-access and closed-access ones. Furthermore, VISOR++ images are able to preserve 99.9% performance on 14,000 unrelated MMLU evaluation tasks.
[50] Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
Qinsi Wang,Bo Liu,Tianyi Zhou,Jing Shi,Yueqian Lin,Yiran Chen,Hai Helen Li,Kun Wan,Wentian Zhao
Main category: cs.CV
TL;DR: Vision-Zero提出了一种无需人工标注的自改进框架,通过视觉游戏促进视觉语言模型(VLM)在多领域的战略性推理能力,并实现了持续的性能提升。
Details
Motivation: 现有基于强化学习的方法依赖大量人工标注数据,导致训练成本高且限制了VLM的实际应用。Vision-Zero旨在通过自生成游戏数据解决这一挑战。Contribution: 1) 提出了战略性自对弈框架,模型通过角色扮演生成训练数据;2) 能从任意图像生成游戏数据,增强多领域推理能力;3) 设计了Iterative-SPO算法,结合自对弈和RLVR实现持续性能提升。
Method: Vision-Zero设计了类“谁是卧底”的游戏框架,结合战略性自对弈和Iterative-SPO算法(交替使用自对弈和RLVR)。
Result: 在推理、图表问答和视觉理解任务中达到SOTA性能,超越依赖人工标注的方法。
Insight: 自生成游戏数据可以高效替代人工标注,结合战略性自对弈和强化学习能显著提升模型的泛化能力和长期性能。
Abstract: Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in “Who Is the Spy”-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model’s reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.
[51] FishNet++: Analyzing the capabilities of Multimodal Large Language Models in marine biology
Faizan Farooq Khan,Yousef Radwan,Eslam Abdelrahman,Abdulwahab Felemban,Aymen Mir,Nico K. Michiels,Andrew J. Temple,Michael L. Berumen,Mohamed Elhoseiny
Main category: cs.CV
TL;DR: FishNet++是一个多模态基准数据集,用于评估和提升大型多模态语言模型(MLLMs)在海洋生物学中的细粒度鱼类识别能力。研究发现现有模型的识别准确率低于10%,并为此提供了大规模的数据标注支持。
Details
Motivation: 多模态大语言模型在专业科学领域(如海洋生物学)的能力尚未充分探索,尤其是在监测受人为压力影响的海洋生态系统中,细粒度鱼类识别至关重要。Contribution: 提出了FishNet++,一个包含35,133条文本描述、706,426个关键点标注和119,399个边界框的大规模多模态基准数据集,填补了现有模型的性能缺陷。
Method: 通过扩展现有资源并提供全面的多模态标注(如文本描述、关键点和边界框),系统评估了当前MLLMs在鱼类识别中的表现。
Result: 研究发现现有模型的细粒度鱼类识别准确率低于10%,表明其在专业领域的局限性。
Insight: FishNet++不仅揭示了MLLMs在专业领域的不足,还为开发更强大的视觉-语言模型提供了关键数据支持,推动了水生科学的进步。
Abstract: Multimodal large language models (MLLMs) have demonstrated impressive cross-domain capabilities, yet their proficiency in specialized scientific fields like marine biology remains underexplored. In this work, we systematically evaluate state-of-the-art MLLMs and reveal significant limitations in their ability to perform fine-grained recognition of fish species, with the best open-source models achieving less than 10% accuracy. This task is critical for monitoring marine ecosystems under anthropogenic pressure. To address this gap and investigate whether these failures stem from a lack of domain knowledge, we introduce FishNet++, a large-scale, multimodal benchmark. FishNet++ significantly extends existing resources with 35,133 textual descriptions for multimodal learning, 706,426 key-point annotations for morphological studies, and 119,399 bounding boxes for detection. By providing this comprehensive suite of annotations, our work facilitates the development and evaluation of specialized vision-language models capable of advancing aquatic science.
[52] AttentionViG: Cross-Attention-Based Dynamic Neighbor Aggregation in Vision GNNs
Hakan Emre Gedik,Andrew Martin,Mustafa Munir,Oguzhan Baser,Radu Marculescu,Sandeep P. Chinchali,Alan C. Bovik
Main category: cs.CV
TL;DR: AttentionViG提出了一种基于交叉注意力的动态邻居聚合方法,用于视觉图神经网络(ViGs),在ImageNet-1K等基准任务上取得了SOTA性能。
Details
Motivation: 现有的ViGs方法在节点-邻居特征聚合上缺乏通用性强且能高效捕捉复杂关系的方案,需要设计一种无需架构特定优化的聚合方法。Contribution: 1. 提出了基于交叉注意力的动态邻居聚合方法;2. 设计了AttentionViG架构,支持非局部消息传递;3. 在多项任务上验证了方法的性能和高效性。
Method: 通过节点生成查询(query)投影,邻居生成键(key)投影,利用交叉注意力机制动态聚合邻居信息。
Result: 在ImageNet-1K上达到SOTA性能,并在COCO和ADE20K的下游任务中展示了良好的迁移能力。
Insight: 交叉注意力机制为图神经网络提供了一种灵活且高效的节点关系建模方式,优于传统图卷积方法。
Abstract: Vision Graph Neural Networks (ViGs) have demonstrated promising performance in image recognition tasks against Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). An essential part of the ViG framework is the node-neighbor feature aggregation method. Although various graph convolution methods, such as Max-Relative, EdgeConv, GIN, and GraphSAGE, have been explored, a versatile aggregation method that effectively captures complex node-neighbor relationships without requiring architecture-specific refinements is needed. To address this gap, we propose a cross-attention-based aggregation method in which the query projections come from the node, while the key projections come from its neighbors. Additionally, we introduce a novel architecture called AttentionViG that uses the proposed cross-attention aggregation scheme to conduct non-local message passing. We evaluated the image recognition performance of AttentionViG on the ImageNet-1K benchmark, where it achieved SOTA performance. Additionally, we assessed its transferability to downstream tasks, including object detection and instance segmentation on MS COCO 2017, as well as semantic segmentation on ADE20K. Our results demonstrate that the proposed method not only achieves strong performance, but also maintains efficiency, delivering competitive accuracy with comparable FLOPs to prior vision GNN architectures.
[53] MetaChest: Generalized few-shot learning of patologies from chest X-rays
Berenice Montalvo-Lezama,Gibran Fuentes-Pineda
Main category: cs.CV
TL;DR: 这篇论文提出了MetaChest数据集和一个多标签few-shot学习框架,用于解决医学影像分析中标注数据不足的问题,并展示了迁移学习方法在广义few-shot分类任务中的优越性。
Details
Motivation: 医学影像分析中标注数据稀缺,传统few-shot学习假设任务中所有类别都是新的,而实际应用中经常需要同时学习新类别并利用已知类别知识。Contribution: 1) 提出了MetaChest数据集,包含47.9万张胸部X光片;2) 设计了多标签few-shot学习框架;3) 展示了迁移学习方法在广义few-shot分类中的优势。
Method: 1) 使用了MetaChest数据集;2) 对比了迁移学习方法和ProtoNet扩展方法;3) 研究了图像分辨率和模型架构对性能的影响。
Result: 实验表明,迁移学习方法优于ProtoNet扩展,更高分辨率的图像提升性能但增加计算负担,高效模型架构在减少资源需求的同时保持性能。
Insight: 1) 广义few-shot学习在医学影像中更具实用性;2) 迁移学习方法的泛化能力值得关注;3) 资源效率在医学应用中很重要。
Abstract: The limited availability of annotated data presents a major challenge for applying deep learning methods to medical image analysis. Few-shot learning methods aim to recognize new classes from only a small number of labeled examples. These methods are typically studied under the standard few-shot learning setting, where all classes in a task are new. However, medical applications such as pathology classification from chest X-rays often require learning new classes while simultaneously leveraging knowledge of previously known ones, a scenario more closely aligned with generalized few-shot classification. Despite its practical relevance, few-shot learning has been scarcely studied in this context. In this work, we present MetaChest, a large-scale dataset of 479,215 chest X-rays collected from four public databases. MetaChest includes a meta-set partition specifically designed for standard few-shot classification, as well as an algorithm for generating multi-label episodes. We conduct extensive experiments evaluating both a standard transfer learning approach and an extension of ProtoNet across a wide range of few-shot multi-label classification tasks. Our results demonstrate that increasing the number of classes per episode and the number of training examples per class improves classification performance. Notably, the transfer learning approach consistently outperforms the ProtoNet extension, despite not being tailored for few-shot learning. We also show that higher-resolution images improve accuracy at the cost of additional computation, while efficient model architectures achieve comparable performance to larger models with significantly reduced resource requirements.
[54] K-Prism: A Knowledge-Guided and Prompt Integrated Universal Medical Image Segmentation Model
Bangwei Guo,Yunhe Gao,Meng Ye,Difei Gu,Yang Zhou,Leon Axel,Dimitris Metaxas
Main category: cs.CV
TL;DR: K-Prism提出了一种知识引导和提示集成的通用医学图像分割模型,通过整合语义先验、上下文知识和交互反馈,实现了对不同任务的灵活切换和联合训练。
Details
Motivation: 现有的医学图像分割模型通常是任务、模态或器官特定的,与临床实践中专家综合多源知识的需求不符。Contribution: 提出了K-Prism,一个统一的分割框架,系统整合了三种知识范式,并通过双提示表征和MoE解码器实现灵活切换。
Method: 使用1-D稀疏提示和2-D密集提示编码多源知识,并通过MoE解码器动态路由。
Result: 在18个公共数据集上实现了语义、上下文和交互分割的SOTA性能。
Insight: 多源知识的统一表示和动态路由能够显著提升医学图像分割的灵活性和性能。
Abstract: Medical image segmentation is fundamental to clinical decision-making, yet existing models remain fragmented. They are usually trained on single knowledge sources and specific to individual tasks, modalities, or organs. This fragmentation contrasts sharply with clinical practice, where experts seamlessly integrate diverse knowledge: anatomical priors from training, exemplar-based reasoning from reference cases, and iterative refinement through real-time interaction. We present $\textbf{K-Prism}$, a unified segmentation framework that mirrors this clinical flexibility by systematically integrating three knowledge paradigms: (i) $\textit{semantic priors}$ learned from annotated datasets, (ii) $\textit{in-context knowledge}$ from few-shot reference examples, and (iii) $\textit{interactive feedback}$ from user inputs like clicks or scribbles. Our key insight is that these heterogeneous knowledge sources can be encoded into a dual-prompt representation: 1-D sparse prompts defining $\textit{what}$ to segment and 2-D dense prompts indicating $\textit{where}$ to attend, which are then dynamically routed through a Mixture-of-Experts (MoE) decoder. This design enables flexible switching between paradigms and joint training across diverse tasks without architectural modifications. Comprehensive experiments on 18 public datasets spanning diverse modalities (CT, MRI, X-ray, pathology, ultrasound, etc.) demonstrate that K-Prism achieves state-of-the-art performance across semantic, in-context, and interactive segmentation settings. Code will be released upon publication.
[55] GaussianLens: Localized High-Resolution Reconstruction via On-Demand Gaussian Densification
Yijia Weng,Zhicheng Wang,Songyou Peng,Saining Xie,Howard Zhou,Leonidas J. Guibas
Main category: cs.CV
TL;DR: 该论文提出了GaussianLens方法,用于实现局部高分辨率重建。通过需求导向的高斯密度化,解决了均匀分辨率重建的高计算成本问题,同时支持用户指定感兴趣区域的精细细节捕捉。
Details
Motivation: 人类视觉特点是聚焦于感兴趣区域(如超市货架标签),但现有3D高斯泼溅(3DGS)方法因均匀分辨率重建导致高计算成本,无法利用原始高分辨率图像重建细节。本文旨在填补高分辨率全局重建的高成本与用户局部精细需求之间的鸿沟。Contribution: 1. 定义了局部高分辨率重建问题,提出用户可指定感兴趣区域(RoI)的需求导向高斯密度化;2. 提出了GaussianLens框架,通过前馈网络融合多模态信息(初始3DGS和多视角图像);3. 设计了像素引导的密度化机制,支持大分辨率提升下的细节捕捉。
Method: 基于低分辨率3DGS重建,提出GaussianLens框架,通过学习一个通用化网络,利用稀疏高分辨率观测对RoI进行高斯密度化。通过融合3DGS和多视角图像的多模态信息,并结合像素引导机制实现高效细节重建。
Result: 实验表明,GaussianLens在局部精细细节重建方面表现优异,并能支持高达1024×1024分辨率的图像,展示了方法的强扩展性。
Insight: 需求导向的局部重建策略避免了均匀高分辨率重建的高成本和冗余,充分利用高分辨率图像在关键区域的细节捕捉能力,为实际应用提供了高效的解决方案。
Abstract: We perceive our surroundings with an active focus, paying more attention to regions of interest, such as the shelf labels in a grocery store. When it comes to scene reconstruction, this human perception trait calls for spatially varying degrees of detail ready for closer inspection in critical regions, preferably reconstructed on demand. While recent works in 3D Gaussian Splatting (3DGS) achieve fast, generalizable reconstruction from sparse views, their uniform resolution output leads to high computational costs unscalable to high-resolution training. As a result, they cannot leverage available images at their original high resolution to reconstruct details. Per-scene optimization methods reconstruct finer details with adaptive density control, yet require dense observations and lengthy offline optimization. To bridge the gap between the prohibitive cost of high-resolution holistic reconstructions and the user needs for localized fine details, we propose the problem of localized high-resolution reconstruction via on-demand Gaussian densification. Given a low-resolution 3DGS reconstruction, the goal is to learn a generalizable network that densifies the initial 3DGS to capture fine details in a user-specified local region of interest (RoI), based on sparse high-resolution observations of the RoI. This formulation avoids the high cost and redundancy of uniformly high-resolution reconstructions and fully leverages high-resolution captures in critical regions. We propose GaussianLens, a feed-forward densification framework that fuses multi-modal information from the initial 3DGS and multi-view images. We further design a pixel-guided densification mechanism that effectively captures details under large resolution increases. Experiments demonstrate our method’s superior performance in local fine detail reconstruction and strong scalability to images of up to $1024\times1024$ resolution.
[56] LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology
Zhenyue Qin,Yang Liu,Yu Yin,Jinyu Ding,Haoran Zhang,Anran Li,Dylan Campbell,Xuansheng Wu,Ke Zou,Tiarnan D. L. Keenan,Emily Y. Chew,Zhiyong Lu,Yih-Chung Tham,Ninghao Liu,Xiuzhen Zhang,Qingyu Chen
Main category: cs.CV
TL;DR: LMOD+是一个大规模多模态眼科基准数据集,用于开发和评估多模态大语言模型(MLLMs)在眼科的应用。该数据集包含32,633个实例,覆盖12种常见眼病和5种成像模态,支持多种任务,如疾病筛查和分级,并评估了24种先进MLLMs的性能。
Details
Motivation: 全球范围内威胁视力的眼病诊断受到医疗资源不足的限制。尽管多模态大语言模型在医学图像解释方面具有潜力,但缺乏适合生成模型的综合基准数据集阻碍了其在眼科的发展。Contribution: 1)扩展了数据集规模(增加50%)和任务覆盖范围;2)提供了多模态数据和多粒度标注;3)系统评估了24种先进MLLMs的性能,揭示了模型的优势和局限性。
Method: 数据集整合了多种成像模态、解剖结构、人口统计数据和自由文本标注,支持疾病筛查、分级和人口统计学预测等任务。
Result: 最佳模型在零样本设置下的疾病筛查准确率约为58%,但在疾病分级等复杂任务上表现不佳。
Insight: 尽管MLLMs在眼科应用中表现出潜力,但复杂任务的性能仍需提升,数据集的公开可能推动眼科AI的发展。
Abstract: Vision-threatening eye diseases pose a major global health burden, with timely diagnosis limited by workforce shortages and restricted access to specialized care. While multimodal large language models (MLLMs) show promise for medical image interpretation, advancing MLLMs for ophthalmology is hindered by the lack of comprehensive benchmark datasets suitable for evaluating generative models. We present a large-scale multimodal ophthalmology benchmark comprising 32,633 instances with multi-granular annotations across 12 common ophthalmic conditions and 5 imaging modalities. The dataset integrates imaging, anatomical structures, demographics, and free-text annotations, supporting anatomical structure recognition, disease screening, disease staging, and demographic prediction for bias evaluation. This work extends our preliminary LMOD benchmark with three major enhancements: (1) nearly 50% dataset expansion with substantial enlargement of color fundus photography; (2) broadened task coverage including binary disease diagnosis, multi-class diagnosis, severity classification with international grading standards, and demographic prediction; and (3) systematic evaluation of 24 state-of-the-art MLLMs. Our evaluations reveal both promise and limitations. Top-performing models achieved ~58% accuracy in disease screening under zero-shot settings, and performance remained suboptimal for challenging tasks like disease staging. We will publicly release the dataset, curation pipeline, and leaderboard to potentially advance ophthalmic AI applications and reduce the global burden of vision-threatening diseases.
[57] Anchor-free Cross-view Object Geo-localization with Gaussian Position Encoding and Cross-view Association
Xingtao Ling,Chenlin Fu,Yingying Zhu
Main category: cs.CV
TL;DR: AFGeo提出了一种无锚点的跨视角对象地理定位方法,通过高斯位置编码和跨视角关联模块,消除了对预定义锚点的依赖,提升了定位的鲁棒性。
Details
Motivation: 现有跨视角地理定位方法依赖预定义锚点,限制了灵活性。AFGeo旨在消除这种依赖,提升定位性能。Contribution: 1) 提出无锚点定位方法AFGeo;2) 引入高斯位置编码(GPE)增强空间先验;3) 提出跨视角对象关联模块(CVOAM)提升跨视角关联性。
Method: 1) 直接预测像素到目标框的四个方向偏移;2) 使用GPE建模查询图像中的点击点位置;3) 通过CVOAM关联不同视角下的对象及其上下文。
Result: AFGeo在基准数据集上实现轻量高效且最优的性能。
Insight: 无锚点方法更灵活,GPE和CVOAM的结合能有效缓解跨视角场景中的定位不确定性。
Abstract: Most existing cross-view object geo-localization approaches adopt anchor-based paradigm. Although effective, such methods are inherently constrained by predefined anchors. To eliminate this dependency, we first propose an anchor-free formulation for cross-view object geo-localization, termed AFGeo. AFGeo directly predicts the four directional offsets (left, right, top, bottom) to the ground-truth box for each pixel, thereby localizing the object without any predefined anchors. To obtain a more robust spatial prior, AFGeo incorporates Gaussian Position Encoding (GPE) to model the click point in the query image, mitigating the uncertainty of object position that challenges object localization in cross-view scenarios. In addition, AFGeo incorporates a Cross-view Object Association Module (CVOAM) that relates the same object and its surrounding context across viewpoints, enabling reliable localization under large cross-view appearance gaps. By adopting an anchor-free localization paradigm that integrates GPE and CVOAM with minimal parameter overhead, our model is both lightweight and computationally efficient, achieving state-of-the-art performance on benchmark datasets.
[58] Generalized Contrastive Learning for Universal Multimodal Retrieval
Jungsoo Lee,Janghoon Cho,Hyojin Park,Munawar Hayat,Kyuwoong Hwang,Fatih Porikli,Sungha Choi
Main category: cs.CV
TL;DR: 论文提出了广义对比学习(GCL),一种新的损失函数形式,用于提升多模态检索性能,避免了繁琐的新数据集构建需求。
Details
Motivation: 现有的跨模态检索模型(如CLIP)在检索包含融合图像-文本模态的键(例如带图像和文本的维基百科页面)时表现下降。现有的方法需要精心构建新的数据集,且无法泛化到未见过的模态组合。Contribution: 提出了GCL方法,利用现有图像-文本配对数据集,通过在小批量内对所有模态执行对比学习,学习统一的表示空间,无需额外数据集。
Method: GCL通过在小批量内对所有模态实施对比学习,利用现有图像-文本配对数据集学习多模态的统一表示空间。
Result: 实验表明,GCL在多个基准测试(M-BEIR、MMEB、CoVR)上提升了通用多模态检索模型(如VISTA、CLIP、TinyCLIP)的性能。
Insight: GCL表明,利用现有数据集和多模态对比学习可以有效提升多模态检索的泛化能力,避免了新数据集的构建开销。
Abstract: Despite their consistent performance improvements, cross-modal retrieval models (e.g., CLIP) show degraded performances with retrieving keys composed of fused image-text modality (e.g., Wikipedia pages with both images and text). To address this critical challenge, multimodal retrieval has been recently explored to develop a unified single retrieval model capable of retrieving keys across diverse modality combinations. A common approach involves constructing new composed sets of image-text triplets (e.g., retrieving a pair of image and text given a query image). However, such an approach requires careful curation to ensure the dataset quality and fails to generalize to unseen modality combinations. To overcome these limitations, this paper proposes Generalized Contrastive Learning (GCL), a novel loss formulation that improves multimodal retrieval performance without the burdensome need for new dataset curation. Specifically, GCL operates by enforcing contrastive learning across all modalities within a mini-batch, utilizing existing image-caption paired datasets to learn a unified representation space. We demonstrate the effectiveness of GCL by showing consistent performance improvements on off-the-shelf multimodal retrieval models (e.g., VISTA, CLIP, and TinyCLIP) using the M-BEIR, MMEB, and CoVR benchmarks.
[59] Using Images from a Video Game to Improve the Detection of Truck Axles
Leandro Arab Marcomini,Andre Luiz Cunha
Main category: cs.CV
TL;DR: 该论文探讨了利用视频游戏生成的合成图像训练CNN以检测真实卡车车轴的可行性,使用三种YOLO架构进行实验,结果表明合成图像能显著提升模型性能。
Details
Motivation: 传统CNN训练需要大量真实数据,但数据收集成本高昂。论文通过合成图像(来自视频游戏)作为低成本替代方案,研究其是否可用于训练高性能模型。Contribution: 证明了视频游戏生成的合成图像能有效训练CNN,提升卡车车轴检测的性能,并提供了低成本数据源的解决方案。
Method: 构建三个数据集(真实与合成图像),采用三种YOLO架构训练,通过召回率、精确率、F1分数和mAP评估性能,并使用Mann-Whitney U检验结果显著性。
Result: 合成图像显著提升了所有网络的性能,最高mAP达99%,表明其可作为可靠的训练数据源。
Insight: 合成图像为数据稀缺问题提供了经济高效的解决方案,尤其在目标检测任务中具有实际应用潜力。
Abstract: Convolutional Neural Networks (CNNs) traditionally require large amounts of data to train models with good performance. However, data collection is an expensive process, both in time and resources. Generated synthetic images are a good alternative, with video games producing realistic 3D models. This paper aims to determine whether images extracted from a video game can be effectively used to train a CNN to detect real-life truck axles. Three different databases were created, with real-life and synthetic trucks, to provide training and testing examples for three different You Only Look Once (YOLO) architectures. Results were evaluated based on four metrics: recall, precision, F1-score, and mean Average Precision (mAP). To evaluate the statistical significance of the results, the Mann-Whitney U test was also applied to the resulting mAP of all models. Synthetic images from trucks extracted from a video game proved to be a reliable source of training data, contributing to the performance of all networks. The highest mAP score reached 99%. Results indicate that synthetic images can be used to train neural networks, providing a reliable, low-cost data source for extracting knowledge.
[60] DescribeEarth: Describe Anything for Remote Sensing Images
Kaiyu Li,Zixuan Jiang,Xiangyong Cao,Jiayu Wang,Yuchen Xiao,Deyu Meng,Zhi Wang
Main category: cs.CV
TL;DR: 论文提出了Geo-DLC任务,旨在解决遥感图像细粒度目标描述的不足,并构建了DE-Dataset和DE-Benchmark。提出的DescribeEarth模型在多模态大语言模型架构中引入尺度自适应焦点策略和领域引导融合模块,显著优于现有方法。
Details
Motivation: 现有遥感图像描述研究主要在图像层面,缺乏目标级细粒度描述,限制了遥感图像丰富语义信息的充分利用。Contribution: 1) 提出Geo-DLC任务;2) 构建DE-Dataset和DE-Benchmark;3) 设计DescribeEarth模型,整合尺度自适应焦点策略和领域引导融合模块。
Method: DescribeEarth模型采用多模态大语言模型架构,结合尺度自适应焦点策略(捕获高分辨率细节)和领域引导融合模块(利用遥感视觉语言模型特征)。
Result: DescribeEarth在DE-Benchmark上全面超越现有通用MLLMs,尤其在目标特征和环境属性描述上表现突出。
Insight: 目标级细粒度描述是遥感图像应用的关键,结合领域知识和多模态大语言模型能显著提升描述质量。
Abstract: Automated textual description of remote sensing images is crucial for unlocking their full potential in diverse applications, from environmental monitoring to urban planning and disaster management. However, existing studies in remote sensing image captioning primarily focus on the image level, lacking object-level fine-grained interpretation, which prevents the full utilization and transformation of the rich semantic and structural information contained in remote sensing images. To address this limitation, we propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing. To support this task, we construct DE-Dataset, a large-scale dataset contains 25 categories and 261,806 annotated instances with detailed descriptions of object attributes, relationships, and contexts. Furthermore, we introduce DE-Benchmark, a LLM-assisted question-answering based evaluation suite designed to systematically measure model capabilities on the Geo-DLC task. We also present DescribeEarth, a Multi-modal Large Language Model (MLLM) architecture explicitly designed for Geo-DLC, which integrates a scale-adaptive focal strategy and a domain-guided fusion module leveraging remote sensing vision-language model features to encode high-resolution details and remote sensing category priors while maintaining global context. Our DescribeEarth model consistently outperforms state-of-the-art general MLLMs on DE-Benchmark, demonstrating superior factual accuracy, descriptive richness, and grammatical soundness, particularly in capturing intrinsic object features and surrounding environmental attributes across simple, complex, and even out-of-distribution remote sensing scenarios. All data, code and weights are released at https://github.com/earth-insights/DescribeEarth.
[61] AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning
Xiping Li,Jianghong Ma
Main category: cs.CV
TL;DR: AIMCoT提出了一个主动信息驱动的多模态思维链框架,通过上下文增强的注意力图生成、主动视觉探测和动态注意力转移触发器,显著提升了视觉-语言推理的效果。
Details
Motivation: 现有的多模态思维链方法依赖简化的启发式方法(如注意力图)构建,这些方法存在不可靠性和被动的信息选择策略问题,无法动态满足模型的认知需求。Contribution: 1) 提出了上下文增强的注意力图生成(CAG),解决文本与视觉粒度不平衡问题;2) 设计了主动视觉探测(AVP),基于信息论主动选择图像区域;3) 引入了动态注意力转移触发器(DAT),动态确定插入视觉信息的时机。
Method: AIMCoT结合了CAG、AVP和DAT三个部分,通过更可靠的注意力图、主动信息选择和动态触发机制,实现高效的多模态推理。
Result: 在三个具有挑战性的基准测试中,AIMCoT显著优于现有方法。
Insight: 主动信息选择和动态注意力转移是提升多模态思维链性能的关键,AIMCoT提供了一种更接近人类推理的方式。
Abstract: Multimodal Chain-of-Thought (CoT) has emerged as a powerful technique for enhancing the vision-language reasoning with interleaved information. However, existing methods often rely on simplistic heuristics for constructing interleaved CoT, typically depending on attention maps, which our empirical analysis reveals can be unreliable. What’s more, the shortcomings of their passive and purposeless selection strategies and their arbitrary triggering mechanisms in capturing the model’s cognitive need for information are further amplified. In this paper, we propose \textbf{AIMCoT}, an \textbf{A}ctive \textbf{I}nformation-driven \textbf{M}ulti-modal \textbf{C}hain-\textbf{o}f-\textbf{T}hought framework that addresses these fundamental limitations. AIMCoT introduces three synergistic components: (1) \textbf{Context-enhanced Attention-map Generation (CAG)}, which mitigates the text-vision granularity imbalance, thereby producing more reliable attention maps as a foundation. (2) \textbf{Active Visual Probing (AVP)}, which replaces passive selection with a proactive, goal-oriented strategy grounded in information theory to select image regions that help answer the questions maximally. (3) \textbf{Dynamic Attention-shifting Trigger (DAT)}, which intelligently determines the optimal moments to insert visual information by monitoring the model’s text-to-vision attention shifts. Extensive experiments on three challenging benchmarks demonstrate that AIMCoT significantly outperforms state-of-the-art methods across different settings. By actively foraging for information and dynamically structuring its reasoning process, AIMCoT represents a critical step towards more robust, effective, and human-like multimodal reasoning. Our code is available at https://anonymous.4open.science/r/AIMCoT.
[62] ProbMed: A Probabilistic Framework for Medical Multimodal Binding
Yuan Gao,Sangwook Kim,Jianzhong You,Chris McIntosh
Main category: cs.CV
TL;DR: ProbMED是一种概率多模态医学视觉语言预训练模型,通过概率对比学习建模嵌入分布,解决了现有模型在多对多医学数据映射中的不足。
Details
Motivation: 医学决策需要整合多种模态信息(如影像、文本),但现有Med-VLPM模型未能直接处理多对多映射关系,限制了嵌入的表征能力。Contribution: 提出了ProbMED框架,引入概率对比学习,统一建模四种模态的嵌入分布;设计了Hellinger距离的InfoNCE损失和模态内绑定损失,提升了跨模态检索和分类性能。
Method: 1. 使用概率对比学习建模嵌入分布;2. 结合Hellinger距离的InfoNCE损失优化跨模态对齐;3. 提出概率合成采样损失加强模态内绑定。
Result: 在13个医学数据集上实验显示,ProbMED在跨模态检索、零样本和小样本分类任务中优于现有Med-VLPM,并展示了多模态整合的鲁棒性。
Insight: 概率框架能够更自然地建模医学多模态数据的多对多关系,分布化嵌入优于确定性嵌入,尤其适合医学数据的复杂性和不确定性。
Abstract: Medical decision-making requires integrating diverse medical information, from imaging to clinical narratives. These medical modalities are often acquired in a many-to-many manner. However, current medical vision-language pretraining models (Med-VLPMs) fail to directly account for this many-to-many mapping in their model training and embeddings. To address this, we present Probabilistic Modality-Enhanced Diagnosis (ProbMED), a multimodal Med-VLPM that employs probabilistic contrastive learning to model distributions over embeddings rather than deterministic estimates. ProbMED aligns four distinct modalities – chest X-rays, electrocardiograms, echocardiograms, and clinical text – into a unified probabilistic embedding space. We use InfoNCE loss with Hellinger distance to integrate inter-modality distributions. We introduce a probabilistic synthetic sampling loss that captures modality-specific mean and variance to improve intra-modality binding. Extensive experiments across 13 medical datasets demonstrate that our model outperforms current Med-VLPMs in cross-modality retrieval, zero-shot, and few-shot classification. We also demonstrate the robust integration of multiple modalities for prognostication, showing improved intra- and inter-medical modality binding.
[63] Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization
Xintong Li,Chuhan Wang,Junda Wu,Rohan Surana,Tong Yu,Julian McAuley,Jingbo Shang
Main category: cs.CV
TL;DR: MISP-DPO提出了一个在视觉语言模型中引入多负样本直接偏好优化的框架,通过Plackett-Luce模型和重要性采样策略,解决了现有方法因单一负样本导致的优化偏差和幻觉问题。
Details
Motivation: 现有的多模态直接偏好优化方法依赖简单的成对比较,仅生成单一负样本,无法捕捉多模态偏好的复杂性,导致优化偏差和幻觉。Contribution: 1) 提出首个在多模态DPO中引入多负样本的框架MISP-DPO;2) 利用Plackett-Luce模型处理多负样本比较;3) 设计重要性采样策略提升训练效率。
Method: 通过CLIP嵌入提示和候选图像,使用稀疏自编码器揭示语义偏差,基于重建难度、语义偏差和多样性选择负样本,并结合Plackett-Luce目标和重要性采样优化训练。
Result: 在五个多样性基准测试中,MISP-DPO显著优于现有方法,验证了语义感知多负样本采样的有效性。
Insight: 多负样本的选择和多样性对提高多模态对齐至关重要,Plackett-Luce模型和重要性采样为解决多模态偏好学习提供了新思路。
Abstract: Direct Preference Optimization (DPO) has recently been extended from text-only models to vision-language models. However, existing methods rely on oversimplified pairwise comparisons, generating a single negative image via basic perturbations or similarity-based retrieval, which fail to capture the complex nature of multimodal preferences, inducing optimization bias and hallucinations. To address this issue, we propose MISP-DPO, the first framework to incorporate multiple, semantically diverse negative images in multimodal DPO via the Plackett-Luce model. Our method embeds prompts and candidate images in CLIP (Contrastive Language-Image Pretraining) space and applies a sparse autoencoder to uncover semantic deviations into interpretable factors. Negative samples are selected based on reconstruction difficulty, semantic deviation from the positive, and mutual diversity, yielding broader and more informative supervision. To handle multi-negative comparisons, we adopt a Plackett-Luce objective and introduce an importance sampling strategy that improves training efficiency. Experiments across five diverse benchmarks demonstrate that MISP-DPO consistently improves multimodal alignment over prior methods, validating the effectiveness of semantic-aware, multi-negative sampling in preference-based learning.
[64] LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing
Zhenghao Zhang,Ziying Zhang,Junchao Liao,Xiangyu Meng,Qiang Hu,Siyu Zhu,Xiaoyun Zhang,Long Qin,Weizhi Wang
Main category: cs.CV
TL;DR: LaTo通过地标标记化的扩散变换器实现细粒度人脸编辑,解决了现有方法在属性控制和身份保持上的不足,并提出地标标记化、位置映射编码和地标预测器等创新。
Details
Motivation: 现有基于多模态的人脸编辑方法难以实现精确属性控制和身份保持,尤其是在地标条件与源图像差异较大时表现不佳。Contribution: 1) 地标标记器将原始地标坐标量化为离散标记;2) 位置映射编码统一处理地标和图像标记;3) 基于视觉语言模型的地标预测器提升目标地标估计精度。
Method: 采用扩散变换器架构,结合地标标记化、位置映射编码和地标预测器,实现灵活且高效的几何-外观解耦交互。
Result: LaTo在身份保持和语义一致性上分别优于现有方法7.8%和4.6%,并构建了HFL-150K数据集。
Insight: 地标标记化和统一编码策略显著提升了编辑的灵活性和身份保持能力,视觉语言模型的链式思考提高了地标估计的准确性。
Abstract: Recent multimodal models for instruction-based face editing enable semantic manipulation but still struggle with precise attribute control and identity preservation. Structural facial representations such as landmarks are effective for intermediate supervision, yet most existing methods treat them as rigid geometric constraints, which can degrade identity when conditional landmarks deviate significantly from the source (e.g., large expression or pose changes, inaccurate landmark estimates). To address these limitations, we propose LaTo, a landmark-tokenized diffusion transformer for fine-grained, identity-preserving face editing. Our key innovations include: (1) a landmark tokenizer that directly quantizes raw landmark coordinates into discrete facial tokens, obviating the need for dense pixel-wise correspondence; (2) a location-mapping positional encoding that integrates facial and image tokens for unified processing, enabling flexible yet decoupled geometry-appearance interactions with high efficiency and strong identity preservation; and (3) a landmark predictor that leverages vision-language models to infer target landmarks from instructions and source images, whose structured chain-of-thought improves estimation accuracy and interactive control. To mitigate data scarcity, we curate HFL-150K, to our knowledge the largest benchmark for this task, containing over 150K real face pairs with fine-grained instructions. Extensive experiments show that LaTo outperforms state-of-the-art methods by 7.8% in identity preservation and 4.6% in semantic consistency. Code and dataset will be made publicly available upon acceptance.
[65] The 1st Solution for MOSEv1 Challenge on LSVOS 2025: CGFSeg
Tingmin Li,Yixuan Li,Yang Yang
Main category: cs.CV
TL;DR: CGFSeg提出了一种改进的视频对象分割方法,在MOSEv1挑战赛中取得了第一名,通过冻结SAM2的特征提取器并结合像素检查策略,显著提升了复杂场景下的分割效果。
Details
Motivation: 视频对象分割(VOS)在复杂现实场景中仍然具有挑战性,尤其是在物体长期消失和重现、小目标和不明显目标的情况下。MOSEv1和LVOS数据集旨在提升VOS模型的鲁棒性。Contribution: 提出了Confidence-Guided Fusion Segmentation(CGFSeg)方法,结合冻结SAM2的特征提取器和像素检查策略,显著提升了分割精度和鲁棒性。
Method: 训练时冻结SAM2的特征提取器,微调其余组件以保持特征提取能力;推理阶段引入像素检查策略,通过多模型的互补优势逐步优化预测结果。
Result: CGFSeg在MOSEv1挑战赛测试集上取得了86.37%的J&F分数,排名第一。
Insight: 冻结预训练特征提取器并结合多模型互补策略,可以有效提升复杂场景下的分割性能。
Abstract: Video Object Segmentation (VOS) aims to track and segment specific objects across entire video sequences, yet it remains highly challenging under complex real-world scenarios. The MOSEv1 and LVOS dataset, adopted in the MOSEv1 challenge on LSVOS 2025, which is specifically designed to enhance the robustness of VOS models in complex real-world scenarios, including long-term object disappearances and reappearances, as well as the presence of small and inconspicuous objects. In this paper, we present our improved method, Confidence-Guided Fusion Segmentation (CGFSeg), for the VOS task in the MOSEv1 Challenge. During training, the feature extractor of SAM2 is frozen, while the remaining components are fine-tuned to preserve strong feature extraction ability and improve segmentation accuracy. In the inference stage, we introduce a pixel-check strategy that progressively refines predictions by exploiting complementary strengths of multiple models, thereby yielding robust final masks. As a result, our method achieves a J&F score of 86.37% on the test set, ranking 1st in the MOSEv1 Challenge at LSVOS 2025. These results highlight the effectiveness of our approach in addressing the challenges of VOS task in complex scenarios.
[66] LieHMR: Autoregressive Human Mesh Recovery with $SO(3)$ Diffusion
Donghwan Kim,Tae-Kyun Kim
Main category: cs.CV
TL;DR: 论文提出了LieHMR方法,利用$SO(3)$扩散模型生成人体姿态的分布,结合transformer提取关节的潜在向量,通过小规模MLP去噪模型学习每个关节的条件分布,解决了单点预测与多样性之间的权衡问题。
Details
Motivation: 现有的人体网格恢复方法多为确定性输出,无法充分建模模糊性;概率方法虽生成多样结果,但准确性不足。本文旨在通过建模与2D观察对齐的姿态分布,提升多样性与准确性。Contribution: 1. 引入$SO(3)$扩散模型生成姿态参数的分布;2. 提出分层transformer结构提取关节潜在向量;3. 小规模MLP去噪模型学习关节条件分布,提升了分布对齐与预测精度。
Method: 1. 使用$SO(3)$扩散模型生成无条件和图像条件下的姿态分布;2. 分层transformer提取关节潜在向量;3. MLP去噪模型学习每个关节的分布。
Result: 实验证明该方法能够有效预测准确的姿态概率分布,解决多样性与准确性的权衡问题。
Insight: 通过扩散模型和transformer结合,实现了对模糊性问题的建模,同时保持高精度,为人体网格恢复任务提供了新思路。
Abstract: We tackle the problem of Human Mesh Recovery (HMR) from a single RGB image, formulating it as an image-conditioned human pose and shape generation. While recovering 3D human pose from 2D observations is inherently ambiguous, most existing approaches have regressed a single deterministic output. Probabilistic methods attempt to address this by generating multiple plausible outputs to model the ambiguity. However, these methods often exhibit a trade-off between accuracy and sample diversity, and their single predictions are not competitive with state-of-the-art deterministic models. To overcome these limitations, we propose a novel approach that models well-aligned distribution to 2D observations. In particular, we introduce $SO(3)$ diffusion model, which generates the distribution of pose parameters represented as 3D rotations unconditional and conditional to image observations via conditioning dropout. Our model learns the hierarchical structure of human body joints using the transformer. Instead of using transformer as a denoising model, the time-independent transformer extracts latent vectors for the joints and a small MLP-based denoising model learns the per-joint distribution conditioned on the latent vector. We experimentally demonstrate and analyze that our model predicts accurate pose probability distribution effectively.
[67] Dragging with Geometry: From Pixels to Geometry-Guided Image Editing
Xinyu Pu,Hongsong Wang,Jie Gui,Pan Zhou
Main category: cs.CV
TL;DR: GeoDrag通过结合3D几何线索和2D空间先验,提出了一种基于几何引导的图像编辑方法,解决了现有像素级拖动编辑在几何密集型场景中的不精确和不一致问题。
Details
Motivation: 现有的基于拖动的图像编辑方法主要在2D像素平面上操作,缺乏3D几何线索的利用,导致在旋转和透视变换等几何密集型场景中编辑结果不精确和不一致。Contribution: 1) 提出了一种联合编码3D几何和2D空间先验的统一位移场方法;2) 通过冲突无关的分区策略隔离编辑区域;3) 在单次前向传递中实现了结构一致的高保真编辑。
Method: GeoDrag利用统一的位移场结合3D几何线索和2D空间先验,并通过冲突无关的分区策略隔离编辑区域,确保编辑的一致性和高保真。
Result: 实验表明,GeoDrag在各种编辑场景中表现出色,具有更高的精度、结构一致性和可靠的多点编辑能力。
Insight: 将3D几何信息引入2D图像编辑任务,可以有效提升复杂几何变换场景中的编辑效果和一致性。
Abstract: Interactive point-based image editing serves as a controllable editor, enabling precise and flexible manipulation of image content. However, most drag-based methods operate primarily on the 2D pixel plane with limited use of 3D cues. As a result, they often produce imprecise and inconsistent edits, particularly in geometry-intensive scenarios such as rotations and perspective transformations. To address these limitations, we propose a novel geometry-guided drag-based image editing method - GeoDrag, which addresses three key challenges: 1) incorporating 3D geometric cues into pixel-level editing, 2) mitigating discontinuities caused by geometry-only guidance, and 3) resolving conflicts arising from multi-point dragging. Built upon a unified displacement field that jointly encodes 3D geometry and 2D spatial priors, GeoDrag enables coherent, high-fidelity, and structure-consistent editing in a single forward pass. In addition, a conflict-free partitioning strategy is introduced to isolate editing regions, effectively preventing interference and ensuring consistency. Extensive experiments across various editing scenarios validate the effectiveness of our method, showing superior precision, structural consistency, and reliable multi-point editability. The code will be available on https://github.com/xinyu-pu/GeoDrag .
[68] FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos
Siddhant Sukhani,Yash Bhardwaj,Riya Bhadani,Veer Kejriwal,Michael Galarnyk,Sudheer Chava
Main category: cs.CV
TL;DR: 论文评估了多模态大语言模型(MLLMs)在金融短视频(SVs)中生成主题对齐字幕的能力,通过测试七种模态组合在五种主题上的表现,发现单视频模态在多数主题上表现优异,但过多的模态可能引入噪声。
Details
Motivation: 研究金融短视频的多模态字幕生成,填补该领域的研究空白,并提供基线结果和资源。Contribution: 1. 提出了金融短视频字幕生成的基线结果;2. 发现单视频模态在多数任务中表现优异;3. 提供了公开代码和数据。
Method: 使用624个标注的YouTube短视频,测试七种模态组合(T, A, V, TA, TV, AV, TAV)在五种主题(如情感分析、金融实体识别等)上的表现。
Result: 单视频模态(V)在五种主题中的四种表现最佳;部分模态组合(如TV或AV)优于全模态(TAV)。
Insight: 1. 视觉模态在金融短视频中具有重要价值;2. 过多的模态可能引入噪声,需谨慎选择模态组合。
Abstract: We evaluate multimodal large language models (MLLMs) for topic-aligned captioning in financial short-form videos (SVs) by testing joint reasoning over transcripts (T), audio (A), and video (V). Using 624 annotated YouTube SVs, we assess all seven modality combinations (T, A, V, TA, TV, AV, TAV) across five topics: main recommendation, sentiment analysis, video purpose, visual analysis, and financial entity recognition. Video alone performs strongly on four of five topics, underscoring its value for capturing visual context and effective cues such as emotions, gestures, and body language. Selective pairs such as TV or AV often surpass TAV, implying that too many modalities may introduce noise. These results establish the first baselines for financial short-form video captioning and illustrate the potential and challenges of grounding complex visual cues in this domain. All code and data can be found on our Github under the CC-BY-NC-SA 4.0 license.
[69] Dolphin v1.0 Technical Report
Taohan Weng,Chi zhang,Chaoran Yan,Siya Liu,Xiaoyang Liu,Yalun Wu,Boyang Wang,Boyan Wang,Jiren Ren,Kaiwen Yan,Jinze Yu,Kaibing Hu,Henan Liu,Haoyun zheng,Anjie Le,Hongcheng Guo
Main category: cs.CV
TL;DR: Dolphin v1.0 及其强化推理版本 Dolphin R1 是首个统一多临床任务的大规模超声基础模型,通过三阶段训练策略解决了超声数据变异性和噪声问题,在多个任务上表现优异。
Details
Motivation: 超声在医学中至关重要,但因操作依赖性强、图像噪声多及实时扫描等问题,AI 集成面临挑战。现有大模型在其他医学影像领域表现优异,但对超声的复杂性适应性不足。Contribution: 1. 提出首个统一超声多任务的视觉-语言基础模型 Dolphin;2. 通过大规模多模态数据集(200万样本)增强鲁棒性;3. 引入三阶段训练策略(预训练、指令对齐、强化微调),显著提升诊断推理能力。
Method: 1. 构建包含课本知识、公开数据、合成样本的数据集;2. 三阶段训练:领域预训练、指令对齐、强化微调;3. Dolphin R1 加入超声专用奖励机制强化推理能力。
Result: 在 U2-Bench 的八项超声任务中,Dolphin R1 U2 分数达 0.5835,是第二名(0.2968)的两倍多,验证了统一框架和推理训练的优越性。
Insight: 强化推理训练对提升高风险医疗 AI 的诊断准确性、一致性和可解释性至关重要,数据集多样性和专用奖励机制是成功关键。
Abstract: Ultrasound is crucial in modern medicine but faces challenges like operator dependence, image noise, and real-time scanning, hindering AI integration. While large multimodal models excel in other medical imaging areas, they struggle with ultrasound’s complexities. To address this, we introduce Dolphin v1.0 (V1) and its reasoning-augmented version, Dolphin R1-the first large-scale multimodal ultrasound foundation models unifying diverse clinical tasks in a single vision-language framework.To tackle ultrasound variability and noise, we curated a 2-million-scale multimodal dataset, combining textbook knowledge, public data, synthetic samples, and general corpora. This ensures robust perception, generalization, and clinical adaptability.The Dolphin series employs a three-stage training strategy: domain-specialized pretraining, instruction-driven alignment, and reinforcement-based refinement. Dolphin v1.0 delivers reliable performance in classification, detection, regression, and report generation. Dolphin R1 enhances diagnostic inference, reasoning transparency, and interpretability through reinforcement learning with ultrasound-specific rewards.Evaluated on U2-Bench across eight ultrasound tasks, Dolphin R1 achieves a U2-score of 0.5835-over twice the second-best model (0.2968) setting a new state of the art. Dolphin v1.0 also performs competitively, validating the unified framework. Comparisons show reasoning-enhanced training significantly improves diagnostic accuracy, consistency, and interpretability, highlighting its importance for high-stakes medical AI.
[70] ART-VITON: Measurement-Guided Latent Diffusion for Artifact-Free Virtual Try-On
Junseo Park,Hyeryung Jang
Main category: cs.CV
TL;DR: ART-VITON提出了一种基于测量的潜在扩散框架,用于虚拟试穿任务,解决了传统方法在非试穿区域边界上产生的伪影问题。通过结合残差初始化与测量引导采样,实现了高质量、无伪影的试穿效果。
Details
Motivation: 虚拟试穿任务中,潜在扩散模型虽能提升服装对齐和细节合成,但在非试穿区域的保留上仍面临挑战。传统方法直接替换非试穿区域会导致边界伪影。Contribution: 1. 将虚拟试穿任务重新定义为线性逆问题,并通过轨迹对齐求解器逐步保证测量一致性。2. 提出残差初始化减少训练与推理的不匹配,并结合测量引导采样(数据一致性、频率校正、周期性降噪)消除伪影。
Method: 1. 使用轨迹对齐求解器减少非试穿区域的突变;2. 结合残差先验初始化与测量引导采样(数据一致性、频率校正、周期性标准降噪)。
Result: 在VITON-HD、DressCode和SHHQ-1.0数据集上的实验表明,ART-VITON能够有效保留身份与背景,消除边界伪影,并在视觉保真度和鲁棒性上超越现有方法。
Insight: 通过测量引导的扩散框架和残差初始化,可以在复杂任务(如虚拟试穿)中平衡对齐与非试穿区域的保留,为其他图像生成任务提供了新思路。
Abstract: Virtual try-on (VITON) aims to generate realistic images of a person wearing a target garment, requiring precise garment alignment in try-on regions and faithful preservation of identity and background in non-try-on regions. While latent diffusion models (LDMs) have advanced alignment and detail synthesis, preserving non-try-on regions remains challenging. A common post-hoc strategy directly replaces these regions with original content, but abrupt transitions often produce boundary artifacts. To overcome this, we reformulate VITON as a linear inverse problem and adopt trajectory-aligned solvers that progressively enforce measurement consistency, reducing abrupt changes in non-try-on regions. However, existing solvers still suffer from semantic drift during generation, leading to artifacts. We propose ART-VITON, a measurement-guided diffusion framework that ensures measurement adherence while maintaining artifact-free synthesis. Our method integrates residual prior-based initialization to mitigate training-inference mismatch and artifact-free measurement-guided sampling that combines data consistency, frequency-level correction, and periodic standard denoising. Experiments on VITON-HD, DressCode, and SHHQ-1.0 demonstrate that ART-VITON effectively preserves identity and background, eliminates boundary artifacts, and consistently improves visual fidelity and robustness over state-of-the-art baselines.
[71] Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs
Jia Jun Cheng Xian,Muchen Li,Haotian Yang,Xin Tao,Pengfei Wan,Leonid Sigal,Renjie Liao
Main category: cs.CV
TL;DR: 该论文提出了Text Preference Optimization (TPO)框架,通过在无需配对图像偏好数据的情况下对齐文本到图像(T2I)扩散模型,避免了高昂的人工标注成本。TPO通过训练模型偏好匹配提示而非扰动后的提示,实现了高效对齐。
Details
Motivation: 现有T2I模型在生成高质量图像时仍难以确保文本与图像的准确对齐。依赖人类反馈的强化学习方法成本高且扩展性差。因此,需要一种低成本、无需配对数据的对齐方法。Contribution: 提出了TPO框架,支持无配对图像偏好数据的T2I模型对齐;扩展了DPO和KTO算法为TDPO和TKTO;开源了代码。
Method: TPO通过构建扰动后的提示(使用大语言模型)作为不匹配样本,训练模型偏好匹配提示。框架兼容现有偏好算法,如DPO和KTO。
Result: 在多个基准测试中,TPO框架显著提升了人类偏好得分和文本-图像对齐效果,优于原始算法。
Insight: 通过扰动提示替代配对数据,TPO展示了无监督对齐的可能性,为低成本优化T2I模型提供了新思路。
Abstract: Recent advances in diffusion-based text-to-image (T2I) models have led to remarkable success in generating high-quality images from textual prompts. However, ensuring accurate alignment between the text and the generated image remains a significant challenge for state-of-the-art diffusion models. To address this, existing studies employ reinforcement learning with human feedback (RLHF) to align T2I outputs with human preferences. These methods, however, either rely directly on paired image preference data or require a learned reward function, both of which depend heavily on costly, high-quality human annotations and thus face scalability limitations. In this work, we introduce Text Preference Optimization (TPO), a framework that enables “free-lunch” alignment of T2I models, achieving alignment without the need for paired image preference data. TPO works by training the model to prefer matched prompts over mismatched prompts, which are constructed by perturbing original captions using a large language model. Our framework is general and compatible with existing preference-based algorithms. We extend both DPO and KTO to our setting, resulting in TDPO and TKTO. Quantitative and qualitative evaluations across multiple benchmarks show that our methods consistently outperform their original counterparts, delivering better human preference scores and improved text-to-image alignment. Our Open-source code is available at https://github.com/DSL-Lab/T2I-Free-Lunch-Alignment.
[72] V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs
Zhengpeng Shi,Hengli Li,Yanpeng Zhao,Jianqun Zhou,Yuxuan Wang,Qinrong Cui,Wei Bi,Songchun Zhu,Bo Zhao,Zilong Zheng
Main category: cs.CV
TL;DR: 论文介绍了v-HUB,一个专注于视觉幽默理解的视频基准测试,用于评估多模态大语言模型(MLLMs)的幽默理解能力。实验结果表明,MLLMs仅通过视觉线索理解幽默存在困难,同时音频的加入有助于提升理解效果。
Details
Motivation: 幽默理解在增强人机交互中具有实际价值,但目前缺乏专门用于评估MLLMs幽默理解能力的基准测试。Contribution: 提出了v-HUB基准测试,包含精心筛选的无声短视频片段及其丰富标注,支持多样化的评估任务,并揭示了MLLMs在视觉幽默理解中的挑战。
Method: 构建了v-HUB数据集,包含无声短视频片段及多模态标注,设计了多种评估任务(如字幕匹配和幽默解释),并对多种MLLMs进行了广泛测试。
Result: 实验显示MLLMs在仅依赖视觉线索时表现显著下降,但音频的引入能够改善理解效果。
Insight: 音频在视频幽默理解中具有重要作用,未来研究应考虑整合更丰富的模态以提升复杂视频任务的性能。
Abstract: AI models capable of comprehending humor hold real-world promise – for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel visual-centric video humor understanding benchmark. v-HUB comprises a curated collection of minimally verbal short videos, sourced from classic silent films and online resources, and reflecting real-world scenarios where humor can be appreciated purely through visual cues. Each video clip is paired with rich annotations, including captions, descriptions, and explanations, supporting evaluation tasks like caption matching and humor explanation. To broaden its applicability, we further construct an open-ended video QA task, making it readily integrable into existing video understanding benchmarks. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. For example, all models exhibit a marked performance drop on caption matching when moving from text-based to video-based evaluation (without audio). Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the informativeness of sound and the promise of integrating richer modalities for complex video understanding tasks.
[73] PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models
Jeongjae Lee,Jong Chul Ye
Main category: cs.CV
TL;DR: PCPO提出了一种新的策略优化框架,通过比例信用分配解决了文本到图像(T2I)模型对齐训练中的不稳定性和高方差问题,显著提升了收敛速度和图像质量。
Details
Motivation: 当前策略梯度方法在文本到图像模型对齐中存在训练不稳定性和高方差问题,导致收敛速度慢和图像质量下降。主要原因是生成采样器的数学结构导致的非比例信用分配问题。Contribution: PCPO通过稳定的目标重构和时间步的重新加权,实现了比例信用分配,从而解决了训练不稳定性和高方差问题,大幅提升了模型性能。
Method: PCPO提出了一个比例信用策略优化框架,通过修正信用分配的非比例性问题,并采用稳定目标重构和时间步加权方法优化训练过程。
Result: PCPO显著改善了收敛速度和图像质量,避免了常见的递归训练失败模式(模型崩溃),并在所有基准测试中优于现有策略梯度方法(包括DanceGRPO)。
Insight: 非比例信用分配是导致文本到图像模型训练不稳定的关键因素,通过比例信用分配可以有效提升训练效果和模型性能。
Abstract: While reinforcement learning has advanced the alignment of text-to-image (T2I) models, state-of-the-art policy gradient methods are still hampered by training instability and high variance, hindering convergence speed and compromising image quality. Our analysis identifies a key cause of this instability: disproportionate credit assignment, in which the mathematical structure of the generative sampler produces volatile and non-proportional feedback across timesteps. To address this, we introduce Proportionate Credit Policy Optimization (PCPO), a framework that enforces proportional credit assignment through a stable objective reformulation and a principled reweighting of timesteps. This correction stabilizes the training process, leading to significantly accelerated convergence and superior image quality. The improvement in quality is a direct result of mitigating model collapse, a common failure mode in recursive training. PCPO substantially outperforms existing policy gradient baselines on all fronts, including the state-of-the-art DanceGRPO.
[74] Editable Noise Map Inversion: Encoding Target-image into Noise For High-Fidelity Image Manipulation
Mingyu Kang,Yong Suk Choi
Main category: cs.CV
TL;DR: 论文提出了一种名为ENM Inversion的新技术,通过优化噪声映射以同时保持内容保真度和可编辑性,解决了现有文本引导图像编辑方法中内容重建与编辑灵活性之间的冲突。
Details
Motivation: 现有扩散模型在图像编辑中将源图像反转为噪声映射时,难以同时满足内容重建的忠实性和文本引导编辑的灵活性。Contribution: 提出了ENM Inversion技术,分析了噪声映射的特性以增强可编辑性,并引入了可编辑噪声细化方法,以最小化重建与编辑噪声之间的差异。
Method: 1. 分析噪声映射特性;2. 提出可编辑噪声细化方法;3. 优化噪声映射以实现内容保留和目标编辑。
Result: 实验表明,ENM Inversion在多种图像编辑任务中优于现有方法,且可扩展至视频编辑中保持时序一致性。
Insight: 通过优化噪声映射而非单纯重建,可以在图像编辑中更好地平衡内容保留与目标提示的灵活性。
Abstract: Text-to-image diffusion models have achieved remarkable success in generating high-quality and diverse images. Building on these advancements, diffusion models have also demonstrated exceptional performance in text-guided image editing. A key strategy for effective image editing involves inverting the source image into editable noise maps associated with the target image. However, previous inversion methods face challenges in adhering closely to the target text prompt. The limitation arises because inverted noise maps, while enabling faithful reconstruction of the source image, restrict the flexibility needed for desired edits. To overcome this issue, we propose Editable Noise Map Inversion (ENM Inversion), a novel inversion technique that searches for optimal noise maps to ensure both content preservation and editability. We analyze the properties of noise maps for enhanced editability. Based on this analysis, our method introduces an editable noise refinement that aligns with the desired edits by minimizing the difference between the reconstructed and edited noise maps. Extensive experiments demonstrate that ENM Inversion outperforms existing approaches across a wide range of image editing tasks in both preservation and edit fidelity with target prompts. Our approach can also be easily applied to video editing, enabling temporal consistency and content manipulation across frames.
[75] Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking
Wen Wen,Tianwu Zhi,Kanglong Fan,Yang Li,Xinge Peng,Yabin Zhang,Yiting Liao,Junlin Li,Li Zhang
Main category: cs.CV
TL;DR: EvoQuality是一个无需真实标注的自监督框架,通过投票和排序动态优化视觉语言模型(VLM)的图像质量评估能力,显著提升了零样本性能,甚至超越了一些监督学习方法的效果。
Details
Motivation: 现有的视觉语言模型在图像质量评估(IQA)领域依赖昂贵的人工标注数据,提出了一种无需标注的自监督方法来解决这一问题。Contribution: 提出了EvoQuality框架,首次将自一致性原则应用于IQA,通过投票生成伪标签和GRPO优化,实现了模型的自进化能力。
Method: 利用模型自身输出的多次投票生成伪排名,构建保真度奖励,通过GRPO(组相对策略优化)迭代优化模型。
Result: 在零样本条件下,EvoQuality将VLM的PLCC性能提升了31.8%,并在多个IQA基准测试中超越了监督学习方法。
Insight: 自监督方法在IQA领域具有巨大潜力,模型的自我优化能力可以减少对标注数据的依赖,同时提升性能。
Abstract: Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques such as self-consistency have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM’s own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model’s iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM’s perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM’s zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks.
[76] EchoingECG: An Electrocardiogram Cross-Modal Model for Echocardiogram Tasks
Yuan Gao,Sangwook Kim,Chris McIntosh
Main category: cs.CV
TL;DR: EchoingECG提出了一个基于心电图(ECG)和超声心动图(ECHO)的跨模态模型,通过概率对比框架和知识蒸馏技术,利用ECG预测ECHO任务中的心脏功能指标,显著优于现有ECG模型。
Details
Motivation: ECG因其低成本和高可及性被广泛用于心脏功能评估,但传统上更复杂的ECHO在临床评估中更为重要但资源密集。研究试图通过ECG预测ECHO任务中的指标,以提供更便捷的替代方案。Contribution: 1) 提出了EchoingECG,一个概率学生-教师模型,将ECG嵌入与ECHO监督结合;2) 引入PCME++和ECHO-CLIP模型,将ECHO知识蒸馏到ECG表征中;3) 通过不确定性估计提升了模型性能分析能力。
Method: 结合概率对比框架PCME++和预训练的ECHO-CLIP模型,通过知识蒸馏将ECHO任务的知识迁移到ECG表征中,并采用不确定性感知技术优化预测。
Result: 在零样本、小样本和微调设置下,EchoingECG均优于现有ECG模型,且在不确定性估计方面表现出色,有助于识别ECG中的不确定性区域。
Insight: 研究表明,ECG可以作为ECHO任务的替代或补充工具,而不确定性估计为模型性能提供了更深的理解,有助于临床决策。}
Abstract: Electrocardiogram (ECG) is a widely used tool for assessing cardiac function due to its low cost and accessibility. Emergent research shows that ECGs can help make predictions on key outcomes traditionally derived from more complex modalities such as echocardiograms (ECHO), enabling the use of ECGs as a more accessible method to predict broader measurements of cardiac function. ECHO, in particular, are of great importance because they require considerable hospital resources while playing a key role in clinical cardiac assessment. To aid this use case, we introduce EchoingECG, a probabilistic student-teacher model that leverages uncertainty-aware ECG embeddings and ECHO supervision to improve ECG-based cardiac function prediction. Our approach integrates Probabilistic Cross-Modal Embeddings (PCME++), a probabilistic contrastive framework, with ECHO-CLIP, a vision-language pre-trained model trained on ECHO-text pairs, to distill ECHO knowledge into ECG representations. Through experiments and external validation, we showed that EchoingECG outperforms state-of-the-art foundation ECG models in zero-shot, few-shot, and fine-tune settings for ECHO predictions based on ECG. We also highlighted that variance estimation (enabled through our method) enhanced our understanding of model performance by identifying underlying regions of uncertainty within ECGs. The code is available: https://github.com/mcintoshML/EchoingECG.
[77] Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding
Haotian Xue,Yunhao Ge,Yu Zeng,Zhaoshuo Li,Ming-Yu Liu,Yongxin Chen,Jiaojiao Fan
Main category: cs.CV
TL;DR: 论文提出了Point-It-Out(PIO)基准,通过多阶段视觉接地任务(S1: 物体定位,S2: 任务驱动指向,S3: 视觉轨迹预测)系统评估VLMs的具身推理能力。实验发现,GPT-4o等通用模型在精准视觉接地上表现不如某些开源模型,而MoLMO等在S1和S2表现良好,但在S3中表现不佳。
Details
Motivation: 现有基准主要通过基于图像标注的多选题评估VLMs的具身推理能力,缺乏对精准视觉接地的系统性评估。Contribution: 提出了PIO基准,通过三阶段任务(S1-S3)全面评估VLMs的具身推理能力,填补了现有研究的空白。
Method: 设计了包含物体定位、任务驱动指向和视觉轨迹预测的三阶段评估协议,并在多个关键领域(室内、厨房、驾驶、机器人操作)收集数据。
Result: 实验发现,GPT-4o在精准视觉接地上表现不如某些开源模型;MoLMO在S1和S2表现良好,但在需要视觉轨迹规划的S3中表现不佳。
Insight: 精准视觉接地任务挑战了现有VLMs的能力,尤其是在需要结合视觉轨迹规划的复杂场景中,模型的性能仍有提升空间。
Abstract: Vision-Language Models (VLMs) have demonstrated impressive world knowledge across a wide range of tasks, making them promising candidates for embodied reasoning applications. However, existing benchmarks primarily evaluate the embodied reasoning ability of VLMs through multiple-choice questions based on image annotations – for example, selecting which trajectory better describes an event in the image. In this work, we introduce the Point-It-Out (PIO) benchmark, a novel benchmark designed to systematically assess the embodied reasoning abilities of VLMs through precise visual grounding. We propose a hierarchical evaluation protocol spanning three stages (S1: referred-object localization, S2: task-driven pointing, and S3: visual trace prediction), with data collected from critical domains for embodied intelligence, including indoor, kitchen, driving, and robotic manipulation scenarios. Extensive experiments with over ten state-of-the-art VLMs reveal several interesting findings. For example, strong general-purpose models such as GPT-4o, while excelling on many benchmarks (e.g., language, perception, and reasoning), underperform compared to some open-source models in precise visual grounding; models such as MoLMO perform well in S1 and S2 but struggle in S3, where requires grounding combined with visual trace planning.
[78] Adapting SAM with Dynamic Similarity Graphs for Few-Shot Parameter-Efficient Small Dense Object Detection: A Case Study of Chickpea Pods in Field Conditions
Xintong Jiang,Yixue Liu,Mohamed Debbagh,Yu Tian,Valerio Hoyos-Villegas,Viacheslav Adamchuk,Shangpeng Sun
Main category: cs.CV
TL;DR: 该研究提出了一种动态相似性图适配模块(DSGA),用于在极少量数据条件下高效调整Segment Anything Model(SAM),以实现复杂农业环境中小密集目标(如鹰嘴豆荚)的精确分割。
Details
Motivation: 农业计算机视觉任务中,基础模型的参数高效微调(PEFT)面临训练数据不足和复杂田间条件的挑战。Contribution: 1. 提出DSGA模块,动态构建相似性图并通过多项式衰减初始化权重排名机制增强特征表示;2. 结合LoRA实现局部与全局依赖的互补优化。
Method: DSGA通过学习动态相似性图和自适应局部特征聚合,仅需4.00M可训练参数,同时结合LoRA优化图像嵌入的全局和局部特征。
Result: 在鹰嘴豆荚数据集上,DSGA+LoRA在2/4/8/10样本设置下均表现优越,结构度量提升17.31%,自适应F度量提升62.36%,且随着样本数增加性能持续提升。
Insight: 动态相似性图和LoRA的结合为小样本农业目标检测提供了高效解决方案,同时保持了模型稳定性和参数效率。
Abstract: Parameter-Efficient Fine-Tuning (PEFT) of foundation models for agricultural computer vision tasks remains challenging due to limited training data and complex field conditions. This study introduces a Dynamic Similarity-based Graph Adaptation (DSGA) module to adapt the Segment Anything Model (SAM) under extreme data constraints for precise foreground and instance segmentation of small dense objects in complex agricultural environments. Through dynamic similarity graph construction with a learnable polynomial decay-initialized weight ranking mechanism and adaptive local feature aggregation, DSGA establishes robust spatial and dynamic similarity representation with only 4.00M trainable parameters, which is 4.26% of the original SAM. Integrating this graph-based feature adaptation with Low-Rank Adaptation (LoRA) creates a complementary optimization framework that effectively captures both local and global dependencies in image embeddings while preserving model stability and parameter efficiency. Experimental results on a challenging chickpea pod dataset demonstrated that DSGA with LoRA achieved superior performance across multiple metrics evaluated under 2, 4, 8 and 10 shots, with progressive performance gains as shot count increased. Quantitative metrics showed a 17.31% improvement in Structure-measure and a 62.36% gain in adaptive F-measure compared to the baseline SAM fine-tuning. Comprehensive ablation studies and visualization analyses through Grad-CAM and t-SNE validated the framework’s effectiveness in feature discrimination. The proposed adaptation demonstrated practical utility for automated agricultural monitoring applications, achieving accurate pod-counting with an adjusted R-squared of 0.8987 for images with 10 to 120 pods under challenging field conditions.
[79] Logo-VGR: Visual Grounded Reasoning for Open-world Logo Recognition
Zichen Liang,Jingjing Fei,Jie Wang,Zheming Yang,Changqing Li,Pei Wu,Minghui Qiu,Fei Yang,Xialei Liu
Main category: cs.CV
TL;DR: Logo-VGR提出了一种全新的开放世界Logo识别方法,通过将Logo识别重新定义为基于比较的任务,并结合领域特定的多模态推理,实现了对小规模品牌监督的大规模品牌泛化识别。
Details
Motivation: 当前多模态大语言模型(MLLMs)主要在通用基准上评估,而在智能产品审核等特定领域的应用仍不足。传统Logo识别方法依赖记忆大量品牌信息,不适用于开放世界场景。Contribution: 1)提出了开放世界Logo识别基准;2)将Logo识别重新定义为基于比较的任务;3)引入了领域特定的多模态推理范式(Logo感知定位和Logo引导视觉推理)。
Method: Logo-VGR通过Logo感知定位注入领域知识,并通过Logo引导视觉推理增强模型推理能力,避免了现有模型对品牌分布的过拟合。
Result: 实验表明,Logo-VGR在OOD设置下比基线方法提升了近10个百分点,展现了出色的泛化性能。
Insight: 将Logo识别从直接生成标签的任务转变为比较任务,并结合领域知识的多模态推理,是提升开放世界Logo识别性能的关键。
Abstract: Recent advances in multimodal large language models (MLLMs) have been primarily evaluated on general-purpose benchmarks, while their applications in domain-specific scenarios, such as intelligent product moderation, remain underexplored. To address this gap, we introduce an open-world logo recognition benchmark, a core challenge in product moderation. Unlike traditional logo recognition methods that rely on memorizing representations of tens of thousands of brands-an impractical approach in real-world settings-our proposed method, Logo-VGR, enables generalization to large-scale brand recognition with supervision from only a small subset of brands. Specifically, we reformulate logo recognition as a comparison-based task, requiring the model to match product images with candidate logos rather than directly generating brand labels. We further observe that existing models tend to overfit by memorizing brand distributions instead of learning robust multimodal reasoning, which results in poor performance on unseen brands. To overcome this limitation, Logo-VGR introduces a new paradigm of domain-specific multimodal reasoning: Logo Perception Grounding injects domain knowledge, and Logo-Guided Visual Grounded Reasoning enhances the model’s reasoning capability. Experimental results show that Logo-VGR outperforms strong baselines by nearly 10 points in OOD settings, demonstrating superior generalization.
[80] VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
Kazuki Matsuda,Yuiga Wada,Shinnosuke Hirano,Seitaro Otsuki,Komei Sugiura
Main category: cs.CV
TL;DR: VELA提出了一种基于LLM混合评估框架的自动评价指标,专门用于评估多模态大语言模型生成的长图像描述。它与传统短描述评价指标不同,并通过LongCap-Arena基准测试验证了其优越性。
Details
Motivation: 现有的图像描述自动评价指标主要针对短描述,而长描述的评价需求日益增长。同时,现有的LLM-as-a-Judge方法因依赖自回归推理和视觉信息早期融合而效率低下。Contribution: 1. 提出VELA,一种适用于长描述的自动评价指标;2. 设计了LLM-Hybrid-as-a-Judge框架;3. 构建了LongCap-Arena基准测试,包含7,805张图像、参考和候选描述及32,246个人类评价。
Method: VELA基于LLM混合评估框架,结合多模态信息和高效推理机制,避免了传统方法的效率问题。
Result: VELA在LongCap-Arena上超越了现有指标,达到了超人类性能。
Insight: 长描述评价需要专门设计的指标和框架,LLM混合方法可以提升效率和性能。
Abstract: In this study, we focus on the automatic evaluation of long and detailed image captions generated by multimodal Large Language Models (MLLMs). Most existing automatic evaluation metrics for image captioning are primarily designed for short captions and are not suitable for evaluating long captions. Moreover, recent LLM-as-a-Judge approaches suffer from slow inference due to their reliance on autoregressive inference and early fusion of visual information. To address these limitations, we propose VELA, an automatic evaluation metric for long captions developed within a novel LLM-Hybrid-as-a-Judge framework. Furthermore, we propose LongCap-Arena, a benchmark specifically designed for evaluating metrics for long captions. This benchmark comprises 7,805 images, the corresponding human-provided long reference captions and long candidate captions, and 32,246 human judgments from three distinct perspectives: Descriptiveness, Relevance, and Fluency. We demonstrated that VELA outperformed existing metrics and achieved superhuman performance on LongCap-Arena.
[81] Training-Free Reward-Guided Image Editing via Trajectory Optimal Control
Jinho Chang,Jaemin Kim,Jong Chul Ye
Main category: cs.CV
TL;DR: 该论文提出了一种无需训练的奖励引导图像编辑方法,通过将编辑过程建模为轨迹最优控制问题,显著提升了编辑效果和源图像保真度的平衡。
Details
Motivation: 当前奖励引导的图像生成方法主要用于推理阶段,但在图像编辑任务中如何保持源图像的语义内容并提升目标奖励尚未充分探索。Contribution: 提出了一种无需训练的奖励引导图像编辑框架,将扩散模型的反向过程建模为可控轨迹,并通过伴随状态迭代更新来引导编辑过程。
Method: 将编辑过程视为轨迹最优控制问题,利用扩散模型的反向过程作为可控轨迹,并通过伴随状态迭代优化以实现目标奖励最大化。
Result: 在多种编辑任务中,该方法显著优于现有的基于反演的无训练引导基线,实现了奖励最大化和源图像保真度的更好平衡。
Insight: 通过将编辑问题形式化为最优控制问题,可以利用扩散模型的动态特性实现高效的语义内容保留和奖励引导编辑。
Abstract: Recent advancements in diffusion and flow-matching models have demonstrated remarkable capabilities in high-fidelity image synthesis. A prominent line of research involves reward-guided guidance, which steers the generation process during inference to align with specific objectives. However, leveraging this reward-guided approach to the task of image editing, which requires preserving the semantic content of the source image while enhancing a target reward, is largely unexplored. In this work, we introduce a novel framework for training-free, reward-guided image editing. We formulate the editing process as a trajectory optimal control problem where the reverse process of a diffusion model is treated as a controllable trajectory originating from the source image, and the adjoint states are iteratively updated to steer the editing process. Through extensive experiments across distinct editing tasks, we demonstrate that our approach significantly outperforms existing inversion-based training-free guidance baselines, achieving a superior balance between reward maximization and fidelity to the source image without reward hacking.
[82] More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
Xinyu Tian,Shu Zou,Zhaoyuan Yang,Mengqi He,Fabian Waschkowski,Lukas Wesemann,Peter Tu,Jing Zhang
Main category: cs.CV
TL;DR: 该论文研究了视觉语言模型(VLMs)中推理能力的双重性,发现尽管推理能增强逻辑推断能力,但也可能导致感知基础能力下降(视觉遗忘)。作者提出了一种新方法VAPO,显著提升了模型对视觉信息的依赖,并在多个基准测试中取得新的最佳性能。
Details
Motivation: 视觉语言模型在推理任务中表现出双重性:虽然推理能力有助于解决复杂任务,但长时间推理可能导致模型逐渐忽略视觉输入,损害感知基础能力。这一现象被称为视觉遗忘,作者希望通过新方法解决这一问题。Contribution: 论文的主要贡献包括:(1)揭示了视觉遗忘现象;(2)提出了Vision-Anchored Policy Optimization(VAPO)方法,显式引导推理过程依赖视觉信息;(3)开发的VAPO-Thinker-7B模型在多个基准测试中达到新SOTA。
Method: 作者提出的VAPO方法通过强化学习(具体为Group Relative Policy Optimization的改进版本),显式地引导推理过程关注视觉输入,避免视觉遗忘。该方法在训练过程中增加了对视觉轨迹的锚定,确保模型持续依赖视觉信息。
Result: VAPO-Thinker-7B在多个基准测试中表现出色,显著提升了模型对视觉信息的依赖,同时在复杂推理任务中保持了高性能。
Insight: 论文揭示了推理能力的潜在副作用(视觉遗忘),并提出了一种简单有效的解决方案。这一研究为未来视觉语言模型的平衡发展提供了重要参考。
Abstract: Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model’s reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/
[83] MuSLR: Multimodal Symbolic Logical Reasoning
Jundong Xu,Hao Fei,Yuhui Zhang,Liangming Pan,Qijun Huang,Qian Liu,Preslav Nakov,Min-Yen Kan,William Yang Wang,Mong-Li Lee,Wynne Hsu
Main category: cs.CV
TL;DR: MuSLR是首个多模态符号逻辑推理基准,旨在评估现有视觉语言模型的逻辑推理能力。研究发现主流模型表现不佳,GPT-4.1最高仅46.8%。提出的LogiCAM框架显著提升了性能,同时错误分析揭示了模态间逻辑对齐的重要性。
Details
Motivation: 在自动驾驶和医疗诊断等高风险领域,多模态符号逻辑推理的确定性至关重要。当前视觉语言模型在此任务上的能力尚不明确,缺乏评估基准。Contribution: 1. 提出首个多模态符号逻辑推理基准MuSLR;2. 发现主流模型在此任务上表现不佳;3. 提出LogiCAM框架,显著提升模型性能。
Method: 1. 构建MuSLR数据集,包含1,093个实例,涵盖7个领域和不同逻辑深度;2. 提出LogiCAM框架,通过形式逻辑规则增强多模态输入的推理能力。
Result: 主流模型在MuSLR上表现较差(GPT-4.1仅46.8%),LogiCAM将GPT-4.1的Chain-of-Thought性能提升14.13%,尤其在复杂逻辑上表现更佳。
Insight: 约70%的错误源于模态间逻辑未对齐,强调了多模态融合中逻辑一致性的重要性。
Abstract: Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1’s Chain-of-Thought performance by 14.13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements. All data and code are publicly available at https://llm-symbol.github.io/MuSLR.
[84] PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection
Po-Han Huang,Jeng-Lin Li,Po-Hsuan Huang,Ming-Ching Chang,Wei-Chao Chen
Main category: cs.CV
TL;DR: 论文提出了一个统一的视觉提示框架PatchEAD,专注于工业异常检测中的补丁级处理,实现了无需训练的异常检测,兼容多种基础模型。
Details
Motivation: 当前工业异常检测主要依赖基础模型,但过去研究多集中于文本提示调整,视觉处理步骤分散且模型依赖性高。PatchEAD旨在填补这一空白,统一视觉提示技术。Contribution: 主要贡献是提出了一种兼容多基础模型的统一视觉提示框架PatchEAD,实现了无需训练和高兼容性的异常检测。
Method: 方法包括对齐模块和前景掩膜技术,通过补丁级处理统一视觉提示框架,提升异常检测性能。
Result: 实验表明,PatchEAD在少样本和零样本任务中优于先前工作,且不依赖文本特征。
Insight: 研究揭示了基础模型结构和预训练特性对补丁相似性鲁棒性的影响,为实际应用中选择和配置模型提供了指导。
Abstract: Industrial anomaly detection is increasingly relying on foundation models, aiming for strong out-of-distribution generalization and rapid adaptation in real-world deployments. Notably, past studies have primarily focused on textual prompt tuning, leaving the intrinsic visual counterpart fragmented into processing steps specific to each foundation model. We aim to address this limitation by proposing a unified patch-focused framework, Patch-Exclusive Anomaly Detection (PatchEAD), enabling training-free anomaly detection that is compatible with diverse foundation models. The framework constructs visual prompting techniques, including an alignment module and foreground masking. Our experiments show superior few-shot and batch zero-shot performance compared to prior work, despite the absence of textual features. Our study further examines how backbone structure and pretrained characteristics affect patch-similarity robustness, providing actionable guidance for selecting and configuring foundation models for real-world visual inspection. These results confirm that a well-unified patch-only framework can enable quick, calibration-light deployment without the need for carefully engineered textual prompts.
[85] LiDAR Point Cloud Colourisation Using Multi-Camera Fusion and Low-Light Image Enhancement
Pasindu Ranasinghe,Dibyayan Patra,Bikram Banerjee,Simit Raval
Main category: cs.CV
TL;DR: 该论文提出了一种基于多相机融合和低光图像增强的LiDAR点云着色方法,能够在低光条件下实现实时性能。
Details
Motivation: LiDAR与相机数据的融合可以增强空间理解,但低光条件下的性能是一个挑战。本研究旨在解决这一问题,提供一种硬件无关的着色方法。Contribution: 主要贡献包括:1)提出了一种硬件无关的多相机融合方法;2)集成低光图像增强模块,提升了低光条件下的着色效果;3)取消了专用标定目标的需求。
Method: 方法包括相机内参标定、自动计算LiDAR与相机间的几何变换、色彩校正以及低光图像增强模块的融合。
Result: 实验使用Velodyne Puck Hi-Res LiDAR和四相机配置,实现了实时性能和在低光条件下的可靠着色,恢复了场景细节。
Insight: 该方法展示了在复杂光照条件下融合LiDAR和相机数据的潜力,为实时和低光场景的点云着色提供了新思路。
Abstract: In recent years, the fusion of camera data with LiDAR measurements has emerged as a powerful approach to enhance spatial understanding. This study introduces a novel, hardware-agnostic methodology that generates colourised point clouds from mechanical LiDAR using multiple camera inputs, providing complete 360-degree coverage. The primary innovation lies in its robustness under low-light conditions, achieved through the integration of a low-light image enhancement module within the fusion pipeline. The system requires initial calibration to determine intrinsic camera parameters, followed by automatic computation of the geometric transformation between the LiDAR and cameras, removing the need for specialised calibration targets and streamlining the setup. The data processing framework uses colour correction to ensure uniformity across camera feeds before fusion. The algorithm was tested using a Velodyne Puck Hi-Res LiDAR and a four-camera configuration. The optimised software achieved real-time performance and reliable colourisation even under very low illumination, successfully recovering scene details that would otherwise remain undetectable.
[86] MAPLE: Multi-scale Attribute-enhanced Prompt Learning for Few-shot Whole Slide Image Classification
Junjie Zhou,Wei Shao,Yagao Yue,Wei Mu,Peng Wan,Qi Zhu,Daoqiang Zhang
Main category: cs.CV
TL;DR: MAPLE提出了一种多尺度属性增强的提示学习方法,用于少样本全幻灯片图像(WSI)分类。该方法通过结合多尺度视觉语义和分层预测(实体级和幻灯片级),提升了癌症诊断的准确性。
Details
Motivation: 现有方法主要依赖幻灯片级提示,未能捕捉到与癌症诊断关键的亚型特异性表型变化(如细胞核、腺体)。MAPLE旨在填补这一空白,通过多尺度属性增强提示学习提高WSI分类的准确性。Contribution: 1)提出了一种分层框架MAPLE,整合多尺度视觉语义并在实体级和幻灯片级进行预测;2)利用大型语言模型生成实体级和幻灯片级提示;3)设计了实体引导的跨注意力模块和跨尺度实体图学习模块以优化实体表示。
Method: 1)使用LLM生成多尺度实体级和全局幻灯片级提示;2)通过实体引导的跨注意力模块生成实体级特征并与属性对齐;3)开发跨尺度实体图学习模块更新实体表示;4)结合实体级和幻灯片级预测输出最终结果。
Result: 在三个癌症队列上的实验验证了MAPLE在少样本病理诊断任务中的有效性。
Insight: 通过多尺度提示学习和分层预测,MAPLE能够更好地捕捉关键表型变化,为少样本WSI分类提供了新的解决方案。
Abstract: Prompt learning has emerged as a promising paradigm for adapting pre-trained vision-language models (VLMs) to few-shot whole slide image (WSI) classification by aligning visual features with textual representations, thereby reducing annotation cost and enhancing model generalization. Nevertheless, existing methods typically rely on slide-level prompts and fail to capture the subtype-specific phenotypic variations of histological entities (\emph{e.g.,} nuclei, glands) that are critical for cancer diagnosis. To address this gap, we propose Multi-scale Attribute-enhanced Prompt Learning (\textbf{MAPLE}), a hierarchical framework for few-shot WSI classification that jointly integrates multi-scale visual semantics and performs prediction at both the entity and slide levels. Specifically, we first leverage large language models (LLMs) to generate entity-level prompts that can help identify multi-scale histological entities and their phenotypic attributes, as well as slide-level prompts to capture global visual descriptions. Then, an entity-guided cross-attention module is proposed to generate entity-level features, followed by aligning with their corresponding subtype-specific attributes for fine-grained entity-level prediction. To enrich entity representations, we further develop a cross-scale entity graph learning module that can update these representations by capturing their semantic correlations within and across scales. The refined representations are then aggregated into a slide-level representation and aligned with the corresponding prompts for slide-level prediction. Finally, we combine both entity-level and slide-level outputs to produce the final prediction results. Results on three cancer cohorts confirm the effectiveness of our approach in addressing few-shot pathology diagnosis tasks.
[87] DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning
Chi Zhang,Haibo Qiu,Qiming Zhang,Zhixiong Zeng,Lin Ma,Jing Zhang
Main category: cs.CV
TL;DR: DeepSketcher提出了一种多模态推理的新方法,通过直接在视觉嵌入空间中生成‘视觉思维’,避免了对外部工具的依赖。
Details
Motivation: 传统视觉语言模型(VLM)主要依赖文本推理,限制了其对图像的理解和多模态推理的深度。DeepSketcher旨在通过交互式图像生成和工具调用,提升模型的推理能力和灵活性。Contribution: 1)提出包含31k CoT推理轨迹的数据集;2)设计了一个直接在视觉嵌入空间操作的自包含模型,实现免工具的图像思维。
Method: 通过图像文本交替推理数据集训练模型,无需外部工具,直接在视觉嵌入空间中生成中间视觉表示。
Result: 在多模态推理基准测试中表现优异,验证了数据集和模型设计的有效性。
Insight: 直接在视觉嵌入空间操作可以避免反复重新编码图像,提高推理效率,同时也拓宽了多模态推理的应用场景。
Abstract: The “thinking with images” paradigm represents a pivotal shift in the reasoning of Vision Language Models (VLMs), moving from text-dominant chain-of-thought to image-interactive reasoning. By invoking visual tools or generating intermediate visual representations, VLMs can iteratively attend to fine-grained regions, enabling deeper image understanding and more faithful multimodal reasoning. As an emerging paradigm, however, it still leaves substantial room for exploration in data construction accuracy, structural design, and broader application scenarios, which offer rich opportunities for advancing multimodal reasoning. To further advance this line of work, we present DeepSketcher, a comprehensive suite comprising both an image-text interleaved dataset and a self-contained model. The dataset contains 31k chain-of-thought (CoT) reasoning trajectories with diverse tool calls and resulting edited images, covering a wide range of data types and manipulation instructions with high annotation accuracy. Building on this resource, we design a model that performs interleaved image-text reasoning and natively generates “visual thoughts” by operating directly in the visual embedding space, rather than invoking external tools and repeatedly re-encoding generated images. This design enables tool-free and more flexible “thinking with images”. Extensive experiments on multimodal reasoning benchmarks demonstrate strong performance, validating both the utility of the dataset and the effectiveness of the model design.
[88] A Multimodal LLM Approach for Visual Question Answering on Multiparametric 3D Brain MRI
Arvind Murari Vepa,Yannan Yu,Jingru Gan,Anthony Cuturrufo,Weikai Li,Wei Wang,Fabien Scalzo,Yizhou Sun
Main category: cs.CV
TL;DR: 该论文提出了一种名为mpLLM的多模态LLM方法,用于3D脑部多参数MRI的视觉问答任务,通过分层混合专家架构和合成VQA协议解决了数据不足问题,并在临床验证中表现优于基线方法。
Details
Motivation: 针对3D脑部多参数MRI的视觉问答任务,当前方法面临图像-文本配对数据不足的挑战,需要一种能够高效融合多模态信息且无需预训练的方法。Contribution: 1. 首个临床验证的3D脑部mpMRI VQA数据集;2. 新型多模态LLM架构mpLLM,支持多3D模态融合;3. 展示了方法的医学实用性,性能优于基线。
Method: 提出mpLLM架构,结合分层混合专家(MoE)和提示条件路由技术,通过合成VQA协议生成医学相关问答数据。
Result: mpLLM在多个mpMRI数据集上平均优于基线方法5.3%。
Insight: 模态级和令牌级专家的分层设计对多模态融合至关重要,提示条件路由提高了模型效率。
Abstract: We introduce mpLLM, a prompt-conditioned hierarchical mixture-of-experts (MoE) architecture for visual question answering over multi-parametric 3D brain MRI (mpMRI). mpLLM routes across modality-level and token-level projection experts to fuse multiple interrelated 3D modalities, enabling efficient training without image–report pretraining. To address limited image-text paired supervision, mpLLM integrates a synthetic visual question answering (VQA) protocol that generates medically relevant VQA from segmentation annotations, and we collaborate with medical experts for clinical validation. mpLLM outperforms strong medical VLM baselines by 5.3% on average across multiple mpMRI datasets. Our study features three main contributions: (1) the first clinically validated VQA dataset for 3D brain mpMRI, (2) a novel multimodal LLM that handles multiple interrelated 3D modalities, and (3) strong empirical results that demonstrate the medical utility of our methodology. Ablations highlight the importance of modality-level and token-level experts and prompt-conditioned routing. We have included our source code in the supplementary materials and will release our dataset upon publication.
[89] LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
Guolei Huang,Qingzhi Peng,Gan Xu,Yuxuan Lu,Yongjun Shen
Main category: cs.CV
TL;DR: 该论文提出了LLaVAShield,一种保护多模态多轮对话安全的方法,通过构建MMDS数据集和基于MCTS的自动化框架生成不安全对话样本,并开发工具在动态策略配置下检测和评估风险。
Details
Motivation: 随着多模态语言模型在交互式多轮对话中的应用,传统的单轮或单模态内容审核方法无法应对跨轮次和多模态的安全风险,需要新的解决方案。Contribution: 1. 首次系统定义了多模态多轮对话安全问题;2. 构建了MMDS数据集;3. 提出了基于MCTS的自动化红队框架;4. 开发了LLaVAShield工具,显著优于基线方法。
Method: 通过蒙特卡洛树搜索(MCTS)生成不安全的多模态多轮对话样本,构建MMDS数据集,并开发LLaVAShield工具联合检测用户输入和助手回复的风险。
Result: LLaVAShield在多模态内容审核任务和动态策略配置中表现优异,达到了新的最先进水平。
Insight: 多模态多轮对话的安全问题需要跨轮次和多模态的联合分析,自动化生成不安全样本的方法可以有效提升内容审核的鲁棒性。
Abstract: As Vision-Language Models (VLMs) move into interactive, multi-turn use, new safety risks arise that single-turn or single-modality moderation misses. In Multimodal Multi-Turn (MMT) dialogues, malicious intent can be spread across turns and images, while context-sensitive replies may still advance harmful content. To address this challenge, we present the first systematic definition and study of MMT dialogue safety. Building on this formulation, we introduce the Multimodal Multi-turn Dialogue Safety (MMDS) dataset. We further develop an automated multimodal multi-turn red-teaming framework based on Monte Carlo Tree Search (MCTS) to generate unsafe multimodal multi-turn dialogues for MMDS. MMDS contains 4,484 annotated multimodal dialogue samples with fine-grained safety ratings, policy dimension labels, and evidence-based rationales for both users and assistants. Leveraging MMDS, we present LLaVAShield, a powerful tool that jointly detects and assesses risk in user inputs and assistant responses. Across comprehensive experiments, LLaVAShield consistently outperforms strong baselines on MMT content moderation tasks and under dynamic policy configurations, establishing new state-of-the-art results. We will publicly release the dataset and model to support future research.
[90] VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
Peng Liu,Haozhan Shen,Chunxin Fang,Zhicheng Sun,Jiajia Liao,Tiancheng Zhao
Main category: cs.CV
TL;DR: VLM-FO1提出了一个新颖的框架,通过将基于对象的感知任务从坐标生成问题转化为特征检索任务,解决了VLM在精细粒度感知任务中的局限性。该框架通过双视觉编码器生成丰富的区域特征,并结合LLM进行推理,实现了优异的性能。
Details
Motivation: 现有的VLM在高层次场景理解上表现优异,但无法处理需要精确定位的精细粒度感知任务,主要是因为语言中心架构生成精确坐标的能力有限。Contribution: 1. 提出VLM-FO1框架,将对象感知任务重新定义为特征检索任务;2. 设计Hybrid Fine-grained Region Encoder(HFRE),生成语义和空间细节丰富的区域特征;3. 通过两阶段训练策略,提升感知能力而不损害基础模型的通用视觉理解能力。
Method: 1. 使用HFRE(双视觉编码器)生成区域特征;2. 结合LLM利用这些特征进行推理;3. 采用两阶段训练策略。
Result: VLM-FO1在多个基准测试中达到SOTA,展示了在对象定位、区域生成理解和视觉区域推理方面的卓越能力。
Insight: 通过特征检索替代坐标生成,可以更有效地解决VLM在精细粒度感知任务中的局限性,同时保持其高层次推理能力。
Abstract: Vision-Language Models (VLMs) excel at high-level scene understanding but falter on fine-grained perception tasks requiring precise localization. This failure stems from a fundamental mismatch, as generating exact numerical coordinates is a challenging task for language-centric architectures. In this paper, we introduce VLM-FO1, a novel framework that overcomes this limitation by reframing object-centric perception from a brittle coordinate generation problem into a robust feature retrieval task. Our method operates as a plug-and-play module that integrates with any pre-trained VLM. It leverages a Hybrid Fine-grained Region Encoder (HFRE), featuring a dual vision encoder, to generate powerful region tokens rich in both semantic and spatial detail. A token-based referencing system then enables the LLM to seamlessly reason about and ground language in these specific visual regions. Experiments show that VLM-FO1 achieves state-of-the-art performance across a diverse suite of benchmarks, demonstrating exceptional capabilities in object grounding, region generational understanding, and visual region reasoning. Crucially, our two-stage training strategy ensures that these perception gains are achieved without compromising the base model’s general visual understanding capabilities. VLM-FO1 establishes an effective and flexible paradigm for building perception-aware VLMs, bridging the gap between high-level reasoning and fine-grained visual grounding.
[91] The Impact of Scaling Training Data on Adversarial Robustness
Marco Zimmerli,Andreas Plesner,Till Aczel,Roger Wattenhofer
Main category: cs.CV
TL;DR: 本文研究了训练数据规模和特性对对抗鲁棒性的影响,发现数据量和模型规模的对数缩放规律主导鲁棒性,但数据质量、架构和训练目标比单纯的规模更重要。
Details
Motivation: 尽管深度学习在架构和训练范式上取得了进步,但深度神经网络仍容易受到对抗性样本的攻击。本文旨在探索训练数据的特性(如规模和多样性)如何影响模型的对抗鲁棒性。Contribution: 1. 揭示了数据量和模型规模对对抗鲁棒性的对数缩放规律;2. 指出数据质量、架构和训练目标比单纯的规模更具决定性;3. 发现一些自监督模型在小规模精选数据集上表现优于大规模非精选数据集训练的模型。
Method: 评估了36种最先进的视觉模型(涵盖监督、自监督和对比学习方法),在6种黑盒攻击类别下测试对抗鲁棒性,并分析了数据规模和模型规模的影响。
Result: 数据量增加10倍,攻击成功率平均降低3.2%;模型规模增加10倍,攻击成功率平均降低13.4%。自监督模型在小规模精选数据集上表现突出。
Insight: 对抗鲁棒性不仅依赖于数据规模,数据质量和模型设计同样重要。人类与机器视觉在对抗性场景下仍存在显著差距。
Abstract: Deep neural networks remain vulnerable to adversarial examples despite advances in architectures and training paradigms. We investigate how training data characteristics affect adversarial robustness across 36 state-of-the-art vision models spanning supervised, self-supervised, and contrastive learning approaches, trained on datasets from 1.2M to 22B images. Models were evaluated under six black-box attack categories: random perturbations, two types of geometric masks, COCO object manipulations, ImageNet-C corruptions, and ImageNet-R style shifts. Robustness follows a logarithmic scaling law with both data volume and model size: a tenfold increase in data reduces attack success rate (ASR) on average by ~3.2%, whereas a tenfold increase in model size reduces ASR on average by ~13.4%. Notably, some self-supervised models trained on curated datasets, such as DINOv2, outperform others trained on much larger but less curated datasets, challenging the assumption that scale alone drives robustness. Adversarial fine-tuning of ResNet50s improves generalization across structural variations but not across color distributions. Human evaluation reveals persistent gaps between human and machine vision. These results show that while scaling improves robustness, data quality, architecture, and training objectives play a more decisive role than raw scale in achieving broad-spectrum adversarial resilience.
[92] UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression
Yuan Zhao,Youwei Pang,Lihe Zhang,Hanqi Liu,Jiaming Zuo,Huchuan Lu,Xiaoqi Zhao
Main category: cs.CV
TL;DR: UniMMAD提出了一种统一的多模态和多类异常检测框架,通过MoE驱动的特征解压缩机制实现自适应重建,显著提升了性能和效率。
Details
Motivation: 现有异常检测方法常将模态和类别视为独立因素,导致解决方案碎片化和内存开销过大。UniMMAD旨在解决这些问题,并提供统一框架。Contribution: 提出了UniMMAD框架,引入MoE驱动的特征解压缩机制,实现多模态和多类异常检测的统一处理,并在效率和性能上取得突破。
Method: 采用‘从通用到特定’的范式,编码阶段通过特征压缩模块处理多模态输入;解码阶段通过稀疏门控的跨MoE动态选择专家路径。
Result: 在9个异常检测数据集上取得SOTA,涵盖3个领域、12种模态和66个类别,同时参数使用减少75%。
Insight: MoE驱动的解压缩机制有效解决了传统方法中的领域干扰和边界失真问题,为多模态异常检测提供了新思路。
Abstract: Existing anomaly detection (AD) methods often treat the modality and class as independent factors. Although this paradigm has enriched the development of AD research branches and produced many specialized models, it has also led to fragmented solutions and excessive memory overhead. Moreover, reconstruction-based multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates. To address these limitations, we propose UniMMAD, a unified framework for multi-modal and multi-class anomaly detection. At the core of UniMMAD is a Mixture-of-Experts (MoE)-driven feature decompression mechanism, which enables adaptive and disentangled reconstruction tailored to specific domains. This process is guided by a ``general to specific’’ paradigm. In the encoding stage, multi-modal inputs of varying combinations are compressed into compact, general-purpose features. The encoder incorporates a feature compression module to suppress latent anomalies, encourage cross-modal interaction, and avoid shortcut learning. In the decoding stage, the general features are decompressed into modality-specific and class-specific forms via a sparsely-gated cross MoE, which dynamically selects expert pathways based on input modality and class. To further improve efficiency, we design a grouped dynamic filtering mechanism and a MoE-in-MoE structure, reducing parameter usage by 75% while maintaining sparse activation and fast inference. UniMMAD achieves state-of-the-art performance on 9 anomaly detection datasets, spanning 3 fields, 12 modalities, and 66 classes. The source code will be available at https://github.com/yuanzhao-CVLAB/UniMMAD.
[93] Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation
Longzhen Yang,Zhangkai Ni,Ying Wen,Yihang Liu,Lianghua He,Heng Tao Shen
Main category: cs.CV
TL;DR: 论文提出了一种无需专家标注的自监督解剖一致性学习框架(SS-ACL),通过多层次解剖图和对齐嵌入提升医学报告生成的准确性和可解释性。
Details
Motivation: 现有方法依赖专家标注的检测模块,成本高且泛化能力受限。作者希望通过自监督方法解决这些问题。Contribution: 1. 提出SS-ACL框架,无需专家标注;2. 基于解剖一致性构建区域级对比学习;3. 在生成报告中实现视觉证据对齐。
Method: 1. 构建多层次解剖图;2. 递归重建解剖区域以对齐空间注意力;3. 引入区域级对比学习增强语义对齐。
Result: SS-ACL在词汇准确率和临床效用上分别超过SOTA方法10%和25%,在零样本视觉定位任务中领先8%。
Insight: 解剖结构的空间和语义一致性可有效指导模型学习,减少对专家标注的依赖。
Abstract: Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images, anchored in explicit visual evidence to improve interpretability and facilitate integration into clinical workflows. However, existing methods often rely on separately trained detection modules that require extensive expert annotations, introducing high labeling costs and limiting generalizability due to pathology distribution bias across datasets. To address these challenges, we propose Self-Supervised Anatomical Consistency Learning (SS-ACL) – a novel and annotation-free framework that aligns generated reports with corresponding anatomical regions using simple textual prompts. SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy, organizing entities by spatial location. It recursively reconstructs fine-grained anatomical regions to enforce intra-sample spatial alignment, inherently guiding attention maps toward visually relevant areas prompted by text. To further enhance inter-sample semantic alignment for abnormality recognition, SS-ACL introduces a region-level contrastive learning based on anatomical consistency. These aligned embeddings serve as priors for report generation, enabling attention maps to provide interpretable visual evidence. Extensive experiments demonstrate that SS-ACL, without relying on expert annotations, (i) generates accurate and visually grounded reports – outperforming state-of-the-art methods by 10% in lexical accuracy and 25% in clinical efficacy, and (ii) achieves competitive performance on various downstream visual tasks, surpassing current leading visual foundation models by 8% in zero-shot visual grounding.
[94] A Multi-purpose Tracking Framework for Salmon Welfare Monitoring in Challenging Environments
Espen Uri Høgstedt,Christian Schellewald,Annette Stahl,Rudolf Mester
Main category: cs.CV
TL;DR: 本文提出了一种用于三文鱼福利监测的多功能跟踪框架,通过姿态估计网络提取鱼体及其部位的边界框,并利用身体部位信息解决水下场景中的跟踪挑战,最终用于计算福利指标。
Details
Motivation: 现有的计算机视觉方法专注于单一指标且依赖其他领域的对象检测与跟踪器,计算资源需求高且难以应对水下场景中的遮挡、相似外观和运动等问题。Contribution: 1. 提出了一种灵活的跟踪框架,结合姿态估计和身体部位信息解决水下三文鱼跟踪的挑战;2. 构建了两个新数据集评估跟踪性能;3. 展示了该方法在尾拍分析中的适用性。
Method: 使用姿态估计网络提取三文鱼的边界框和身体部位,并通过专门模块处理水下场景的特殊挑战,随后利用详细的身体部位轨迹计算福利指标。
Result: 提出的方法在拥挤场景和三文鱼转向的ID转移任务中表现优于当前的先进行人跟踪器BoostTrack,并成功应用于尾拍波长的自动监测。
Insight: 通过结合姿态估计和身体部位信息,可以显著提升水下复杂场景中的跟踪精度,同时为多指标福利监测提供了高效的解决方案。
Abstract: Computer Vision (CV)-based continuous, automated and precise salmon welfare monitoring is a key step toward reduced salmon mortality and improved salmon welfare in industrial aquaculture net pens. Available CV methods for determining welfare indicators focus on single indicators and rely on object detectors and trackers from other application areas to aid their welfare indicator calculation algorithm. This comes with a high resource demand for real-world applications, since each indicator must be calculated separately. In addition, the methods are vulnerable to difficulties in underwater salmon scenes, such as object occlusion, similar object appearance, and similar object motion. To address these challenges, we propose a flexible tracking framework that uses a pose estimation network to extract bounding boxes around salmon and their corresponding body parts, and exploits information about the body parts, through specialized modules, to tackle challenges specific to underwater salmon scenes. Subsequently, the high-detail body part tracks are employed to calculate welfare indicators. We construct two novel datasets assessing two salmon tracking challenges: salmon ID transfers in crowded scenes and salmon ID switches during turning. Our method outperforms the current state-of-the-art pedestrian tracker, BoostTrack, for both salmon tracking challenges. Additionally, we create a dataset for calculating salmon tail beat wavelength, demonstrating that our body part tracking method is well-suited for automated welfare monitoring based on tail beat analysis. Datasets and code are available at https://github.com/espenbh/BoostCompTrack.
[95] VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing
Abdelilah Aitrouga,Youssef Hmamouche,Amal El Fallah Seghrouchni
Main category: cs.CV
TL;DR: VRWKV-Editor提出了一种基于线性时空聚合模块的视频编辑方法,解决了传统注意力机制二次计算复杂度的问题,显著提升了长视频和高分辨率视频的编辑效率。
Details
Motivation: 当前基于Transformer的视频编辑方法因注意力机制的二次计算复杂度难以适应长视频和高分辨率视频的需求,限制了实时视频处理的应用。Contribution: 提出VRWKV-Editor,通过引入RWKV Transformer的双向加权键值递归机制,实现了线性复杂度的时空依赖建模,显著降低了计算和内存开销。
Method: 在视频扩散模型中集成线性时空聚合模块,利用RWKV Transformer的双向加权键值递归机制捕获全局依赖并保持时间一致性。
Result: 实验表明,VRWKV-Editor在编辑速度上提升3.7倍,内存使用降低60%,同时在帧一致性和文本对齐方面保持竞争力。
Insight: 长视频的场景下,线性复杂度方法的优势更为明显,为实时视频处理提供了可行方案。
Abstract: In light of recent progress in video editing, deep learning models focusing on both spatial and temporal dependencies have emerged as the primary method. However, these models suffer from the quadratic computational complexity of traditional attention mechanisms, making them difficult to adapt to long-duration and high-resolution videos. This limitation restricts their applicability in practical contexts such as real-time video processing. To tackle this challenge, we introduce a method to reduce both time and space complexity of these systems by proposing VRWKV-Editor, a novel video editing model that integrates a linear spatio-temporal aggregation module into video-based diffusion models. VRWKV-Editor leverages bidirectional weighted key-value recurrence mechanism of the RWKV transformer to capture global dependencies while preserving temporal coherence, achieving linear complexity without sacrificing quality. Extensive experiments demonstrate that the proposed method achieves up to 3.7x speedup and 60% lower memory usage compared to state-of-the-art diffusion-based video editing methods, while maintaining competitive performance in frame consistency and text alignment. Furthermore, a comparative analysis we conducted on videos with different sequence lengths confirms that the gap in editing speed between our approach and architectures with self-attention becomes more significant with long videos.
[96] Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations
Nicola Messina,Rosario Leonardi,Luca Ciampi,Fabio Carrara,Giovanni Maria Farinella,Fabrizio Falchi,Antonino Furnari
Main category: cs.CV
TL;DR: 本文提出了一种通过人类叙述的弱监督学习方法来分割自我中心视角下的手持物体,避免了昂贵的像素级标注需求。
Details
Motivation: 自我中心视角下的手持物体分割对辅助技术、工业安全和活动监测至关重要,但现有方法依赖昂贵的标注数据,限制了进展。Contribution: 1) 提出了Narration-Supervised in-Hand Object Segmentation (NS-iHOS)任务;2) 开发了WISH模型,通过自然语言叙述实现弱监督学习,无需推理时使用叙述。
Method: WISH是一种端到端模型,利用叙述中的信息学习手部与物体的关联,实现弱监督分割。
Result: 在EPIC-Kitchens和Ego4D数据集上,WISH表现优于其他基线方法,达到了全监督方法50%以上的性能。
Insight: 自然语言叙述可作为有效的弱监督信号,减少对标注数据的依赖,推动了自我中心视角分割任务的进展。
Abstract: Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human-object interaction detection leveraging narrations – natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects (e.g., “I am pouring vegetables from the chopping board to the pan”). Narrations provide a form of weak supervision that is cheap to acquire and readily available in state-of-the-art egocentric datasets. We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models, showing the superiority of its design. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations.
[97] AgenticIQA: An Agentic Framework for Adaptive and Interpretable Image Quality Assessment
Hanwei Zhu,Yu Tian,Keyan Ding,Baoliang Chen,Bolin Chen,Shiqi Wang,Weisi Lin
Main category: cs.CV
TL;DR: AgenticIQA提出了一种模块化的代理框架,将视觉语言模型与传统IQA工具动态结合,解决了传统方法在适应性、用户查询和解释性方面的不足。通过分解任务并协调规划、执行和汇总,实现了更准确的评分和解释。
Details
Motivation: 传统IQA方法依赖固定模型,缺乏对多样失真、用户查询和解释性的适应性,且评分与解释通常分开处理。为此,提出了动态、查询感知的代理框架来解决这些问题。Contribution: 提出AgenticIQA框架,整合VLMs与传统IQA工具;设计了四个子任务(失真检测、分析、工具选择和工具执行);发布了AgenticIQA-200K数据集和AgenticIQA-Eval基准。
Method: 框架分为规划器、执行器和汇总器。规划器制定策略,执行器通过工具调用收集感知证据,汇总器整合证据生成评分和解释。
Result: 实验表明,AgenticIQA在评分准确性和解释对齐性上均优于基线方法。
Insight: 通过动态规划和任务分解,IQA任务的可解释性和适应性显著提升,VLMs与传统工具的协同作用显著。
Abstract: Image quality assessment (IQA) is inherently complex, as it reflects both the quantification and interpretation of perceptual quality rooted in the human visual system. Conventional approaches typically rely on fixed models to output scalar scores, limiting their adaptability to diverse distortions, user-specific queries, and interpretability needs. Furthermore, scoring and interpretation are often treated as independent processes, despite their interdependence: interpretation identifies perceptual degradations, while scoring abstracts them into a compact metric. To address these limitations, we propose AgenticIQA, a modular agentic framework that integrates vision-language models (VLMs) with traditional IQA tools in a dynamic, query-aware manner. AgenticIQA decomposes IQA into four subtasks – distortion detection, distortion analysis, tool selection, and tool execution – coordinated by a planner, executor, and summarizer. The planner formulates task-specific strategies, the executor collects perceptual evidence via tool invocation, and the summarizer integrates this evidence to produce accurate scores with human-aligned explanations. To support training and evaluation, we introduce AgenticIQA-200K, a large-scale instruction dataset tailored for IQA agents, and AgenticIQA-Eval, the first benchmark for assessing the planning, execution, and summarization capabilities of VLM-based IQA agents. Extensive experiments across diverse IQA datasets demonstrate that AgenticIQA consistently surpasses strong baselines in both scoring accuracy and explanatory alignment.
[98] PFDepth: Heterogeneous Pinhole-Fisheye Joint Depth Estimation via Distortion-aware Gaussian-Splatted Volumetric Fusion
Zhiwei Zhang,Ruikai Xu,Weijian Zhang,Zhizhong Zhang,Xin Tan,Jingyu Gong,Yuan Xie,Lizhuang Ma
Main category: cs.CV
TL;DR: PFDepth是首个针对异构多视角深度估计的针孔-鱼眼联合框架,通过基于失真感知的高斯溅射体积融合,实现了互补特性的联合优化。
Details
Motivation: 针孔和鱼眼图像具有互补特性(如无失真与有失真、小FOV与大FOV、远场与近场),利用这些特性可以提升深度估计的性能。Contribution: PFDepth提出了一种统一架构,能够处理任意组合的针孔和鱼眼相机,并通过异构空间融合和3D高斯表示实现了更精细的特征融合。
Method: PFDepth将2D特征提升到规范3D体积空间,设计异构空间融合模块,并采用动态适应局部纹理的3D高斯表示进行特征聚合。
Result: 在KITTI-360和RealHet数据集上,PFDepth显著优于现有主流深度网络,达到了最新的性能水平。
Insight: 异构传感器(如针孔和鱼眼)的联合优化可以显著提升深度估计的鲁棒性和精度,尤其是在复杂场景下。
Abstract: In this paper, we present the first pinhole-fisheye framework for heterogeneous multi-view depth estimation, PFDepth. Our key insight is to exploit the complementary characteristics of pinhole and fisheye imagery (undistorted vs. distorted, small vs. large FOV, far vs. near field) for joint optimization. PFDepth employs a unified architecture capable of processing arbitrary combinations of pinhole and fisheye cameras with varied intrinsics and extrinsics. Within PFDepth, we first explicitly lift 2D features from each heterogeneous view into a canonical 3D volumetric space. Then, a core module termed Heterogeneous Spatial Fusion is designed to process and fuse distortion-aware volumetric features across overlapping and non-overlapping regions. Additionally, we subtly reformulate the conventional voxel fusion into a novel 3D Gaussian representation, in which learnable latent Gaussian spheres dynamically adapt to local image textures for finer 3D aggregation. Finally, fused volume features are rendered into multi-view depth maps. Through extensive experiments, we demonstrate that PFDepth sets a state-of-the-art performance on KITTI-360 and RealHet datasets over current mainstream depth networks. To the best of our knowledge, this is the first systematic study of heterogeneous pinhole-fisheye depth estimation, offering both technical novelty and valuable empirical insights.
[99] New Fourth-Order Grayscale Indicator-Based Telegraph Diffusion Model for Image Despeckling
Rajendra K. Ray,Manish Kumar
Main category: cs.CV
TL;DR: 论文提出了一种结合扩散和波特性的四阶非线性PDE模型,用于去除图像中的乘性噪声,相比传统二阶方法减少了块状伪影,同时在灰度图和彩色图中均表现更优。
Details
Motivation: 二阶PDE模型在去除乘性噪声时容易在早期阶段引入块状伪影,为解决这一问题,作者提出了更高阶的模型以改进去噪效果。Contribution: 提出了一种四阶非线性PDE模型,融合了扩散和波的特性,显著提升了去噪效果并保留了更多细节。
Method: 模型结合了拉普拉斯算子和强度值指导的扩散过程,以及保留纹理的波特性,同时在彩色图像中独立处理每个通道。
Result: 在PSNR、MSSIM和SI等指标上,新模型均优于现有的二阶各向异性扩散方法。
Insight: 高阶PDE模型在去噪任务中可能更有效,尤其是在需要平衡噪声去除和细节保留的场景中。
Abstract: Second-order PDE models have been widely used for suppressing multiplicative noise, but they often introduce blocky artifacts in the early stages of denoising. To resolve this, we propose a fourth-order nonlinear PDE model that integrates diffusion and wave properties. The diffusion process, guided by both the Laplacian and intensity values, reduces noise better than gradient-based methods, while the wave part keeps fine details and textures. The effectiveness of the proposed model is evaluated against two second-order anisotropic diffusion approaches using the Peak Signal-to-Noise Ratio (PSNR) and Mean Structural Similarity Index (MSSIM) for images with available ground truth. For SAR images, where a noise-free reference is unavailable, the Speckle Index (SI) is used to measure noise reduction. Additionally, we extend the proposed model to study color images by applying the denoising process independently to each channel, preserving both structure and color consistency. The same quantitative metrics PSNR and MSSIM are used for performance evaluation, ensuring a fair comparison across grayscale and color images. In all the cases, our computed results produce better results compared to existing models in this genre.
[100] SETR: A Two-Stage Semantic-Enhanced Framework for Zero-Shot Composed Image Retrieval
Yuqi Xiao,Yingying Zhu
Main category: cs.CV
TL;DR: SETR是一个两阶段的语义增强框架,用于零样本组合图像检索(ZS-CIR),通过粗检索和细粒度重排解决CLIP方法的局限性,显著提升了检索性能。
Details
Motivation: 现有CLIP方法在组合图像检索中存在两个问题:一是联合特征融合会引入无关背景信息,二是全局余弦相似度无法捕捉细粒度语义关系。SETR旨在解决这些问题。Contribution: SETR的主要贡献是提出两阶段方法:(1)粗检索阶段采用交集驱动策略过滤无关信息;(2)重排阶段利用多模态LLM进行语义相关性判断,填补CLIP的不足。
Method: SETR分为粗检索和重排两阶段:粗检索保留参考图像和文本的交叠语义,生成高精度候选集;重排阶段通过微调LLM进行二元语义判断。
Result: 在CIRR、Fashion-IQ和CIRCO数据集上,SETR达到最新SOTA性能,CIRR的Recall@1提升高达15.15点。
Insight: 两阶段推理(粗检索+细粒度重排)是ZS-CIR的通用范式,结合交集驱动和多模态LLM可显著提升检索鲁棒性和细粒度语义理解。
Abstract: Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image given a reference image and a relative text, without relying on costly triplet annotations. Existing CLIP-based methods face two core challenges: (1) union-based feature fusion indiscriminately aggregates all visual cues, carrying over irrelevant background details that dilute the intended modification, and (2) global cosine similarity from CLIP embeddings lacks the ability to resolve fine-grained semantic relations. To address these issues, we propose SETR (Semantic-enhanced Two-Stage Retrieval). In the coarse retrieval stage, SETR introduces an intersection-driven strategy that retains only the overlapping semantics between the reference image and relative text, thereby filtering out distractors inherent to union-based fusion and producing a cleaner, high-precision candidate set. In the fine-grained re-ranking stage, we adapt a pretrained multimodal LLM with Low-Rank Adaptation to conduct binary semantic relevance judgments (“Yes/No”), which goes beyond CLIP’s global feature matching by explicitly verifying relational and attribute-level consistency. Together, these two stages form a complementary pipeline: coarse retrieval narrows the candidate pool with high recall, while re-ranking ensures precise alignment with nuanced textual modifications. Experiments on CIRR, Fashion-IQ, and CIRCO show that SETR achieves new state-of-the-art performance, improving Recall@1 on CIRR by up to 15.15 points. Our results establish two-stage reasoning as a general paradigm for robust and portable ZS-CIR.
[101] GeoLink: Empowering Remote Sensing Foundation Model with OpenStreetMap Data
Lubian Bai,Xiuyuan Zhang,Siqi Zhang,Zepeng Zhang,Haoyu Wang,Wei Qin,Shihong Du
Main category: cs.CV
TL;DR: GeoLink是一个多模态框架,利用OpenStreetMap(OSM)数据增强遥感(RS)基础模型,通过预训练和下游任务阶段的跨模态协同提升地理空间智能。
Details
Motivation: 当前遥感基础模型主要依赖图像数据,而OSM等地理空间数据模态差异大,难以有效整合。GeoLink旨在填补这一空白,实现遥感与地理空间数据的多模态协同。Contribution: 1)提出GeoLink框架,首次将OSM数据整合到RS FM中;2)通过跨模态空间相关性增强预训练和下游任务;3)引入图像掩模重建技术支持高效预训练。
Method: GeoLink利用OSM数据生成多粒度学习信号,通过跨模态空间相关性指导信息交互;并采用图像掩模重建技术进行高效预训练。
Result: 实验表明,预训练中整合OSM数据提升了RS图像编码器性能,下游任务中融合RS和OSM数据增强了模型的适应性。
Insight: 空间相关性是实现地理空间多模态数据有效整合的关键因素。
Abstract: Integrating ground-level geospatial data with rich geographic context, like OpenStreetMap (OSM), into remote sensing (RS) foundation models (FMs) is essential for advancing geospatial intelligence and supporting a broad spectrum of tasks. However, modality gap between RS and OSM data, including differences in data structure, content, and spatial granularity, makes effective synergy highly challenging, and most existing RS FMs focus on imagery alone. To this end, this study presents GeoLink, a multimodal framework that leverages OSM data to enhance RS FM during both the pretraining and downstream task stages. Specifically, GeoLink enhances RS self-supervised pretraining using multi-granularity learning signals derived from OSM data, guided by cross-modal spatial correlations for information interaction and collaboration. It also introduces image mask-reconstruction to enable sparse input for efficient pretraining. For downstream tasks, GeoLink generates both unimodal and multimodal fine-grained encodings to support a wide range of applications, from common RS interpretation tasks like land cover classification to more comprehensive geographic tasks like urban function zone mapping. Extensive experiments show that incorporating OSM data during pretraining enhances the performance of the RS image encoder, while fusing RS and OSM data in downstream tasks improves the FM’s adaptability to complex geographic scenarios. These results underscore the potential of multimodal synergy in advancing high-level geospatial artificial intelligence. Moreover, we find that spatial correlation plays a crucial role in enabling effective multimodal geospatial data integration. Code, checkpoints, and using examples are released at https://github.com/bailubin/GeoLink_NeurIPS2025
[102] PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution
Shian Du,Menghan Xia,Chang Liu,Xintao Wang,Jing Wang,Pengfei Wan,Di Zhang,Xiangyang Ji
Main category: cs.CV
TL;DR: PatchVSR提出了一种基于视频扩散先验的分块视频超分辨率方法,通过双流适配器和位置信息注入解决了预训练模型在分块生成细节时的局限性,实现了高效的4K超分辨率。
Details
Motivation: 现有的视频超分辨率方法通常采用全尺寸处理,导致计算开销大且输出分辨率固定。PatchVSR探索了利用视频扩散先验的分块级生成,以提升效率和灵活性。Contribution: 1. 首次将视频扩散先验用于分块视频超分辨率;2. 提出双流适配器和位置信息注入,提升生成质量;3. 设计了多分块联合调制模块,确保视觉一致性。
Method: 1. 使用双流适配器(分块分支和全局分支)分别提取局部和全局特征;2. 注入分块位置信息以增强上下文;3. 多分块联合调制保证一致性。
Result: 实验表明,PatchVSR能在512x512基础模型上高效实现4K超分辨率,生成高保真细节。
Insight: 分块处理范式灵活且高效,为视频超分辨率提供了一种新的解决方案,尤其适合高分辨率场景。
Abstract: Pre-trained video generation models hold great potential for generative video super-resolution (VSR). However, adapting them for full-size VSR, as most existing methods do, suffers from unnecessary intensive full-attention computation and fixed output resolution. To overcome these limitations, we make the first exploration into utilizing video diffusion priors for patch-wise VSR. This is non-trivial because pre-trained video diffusion models are not native for patch-level detail generation. To mitigate this challenge, we propose an innovative approach, called PatchVSR, which integrates a dual-stream adapter for conditional guidance. The patch branch extracts features from input patches to maintain content fidelity while the global branch extracts context features from the resized full video to bridge the generation gap caused by incomplete semantics of patches. Particularly, we also inject the patch’s location information into the model to better contextualize patch synthesis within the global video frame. Experiments demonstrate that our method can synthesize high-fidelity, high-resolution details at the patch level. A tailor-made multi-patch joint modulation is proposed to ensure visual consistency across individually enhanced patches. Due to the flexibility of our patch-based paradigm, we can achieve highly competitive 4K VSR based on a 512x512 resolution base model, with extremely high efficiency.
[103] Causally Guided Gaussian Perturbations for Out-Of-Distribution Generalization in Medical Imaging
Haoran Pei,Yuguang Yang,Kexin Liu,Baochang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种轻量级框架CGP,通过因果引导的高斯扰动增强医学图像中的分布外泛化能力。
Details
Motivation: 在医学图像等领域,分布偏移常见且细微,现有方法可能忽略了泛化的因果机制。Contribution: 提出了因果引导的高斯扰动(CGP),利用视觉Transformer生成的软因果掩模指导噪声注入,提升模型对因果相关特征的依赖。
Method: 通过空间变化的高斯噪声扰动输入图像,根据因果掩模对不同区域施加不同强度的扰动。
Result: 在Camelyon17基准测试中,CGP超越了现有OOD方法的性能。
Insight: 因果扰动不仅提升泛化能力,还具有可解释性,为可靠模型设计提供了新思路。
Abstract: Out-of-distribution (OOD) generalization remains a central challenge in deploying deep learning models to real-world scenarios, particularly in domains such as biomedical images, where distribution shifts are both subtle and pervasive. While existing methods often pursue domain invariance through complex generative models or adversarial training, these approaches may overlook the underlying causal mechanisms of generalization.In this work, we propose Causally-Guided Gaussian Perturbations (CGP)-a lightweight framework that enhances OOD generalization by injecting spatially varying noise into input images, guided by soft causal masks derived from Vision Transformers. By applying stronger perturbations to background regions and weaker ones to foreground areas, CGP encourages the model to rely on causally relevant features rather than spurious correlations.Experimental results on the challenging WILDS benchmark Camelyon17 demonstrate consistent performance gains over state-of-the-art OOD baselines, highlighting the potential of causal perturbation as a tool for reliable and interpretable generalization.
[104] SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP
Christoph Timmermann,Hyunse Lee,Woojin Lee
Main category: cs.CV
TL;DR: SeMoBridge提出了一种轻量级方法,通过语义模态桥将图像映射到文本模态,解决CLIP在少样本分类中的模态内不对齐问题,显著提升性能。
Details
Motivation: CLIP在零样本任务表现优异,但在少样本分类中因模态内不对齐问题性能受限,现有方法要么计算昂贵,要么效果有限。Contribution: 提出了SeMoBridge,一种轻量级方法,直接通过语义模态桥校准图像和文本的嵌入空间,提升少样本分类性能。
Method: 通过语义模态桥将图像映射到文本模态,结合多模态监督优化投影,分为闭式解法和训练版本SeMoBridge-T。
Result: SeMoBridge-T在1、2、4样本场景中优于其他方法,且训练时间大幅减少。
Insight: 直接校准模态内嵌入空间是提升CLIP少样本性能的关键,轻量级方法在低数据场景下表现突出。
Abstract: While Contrastive Language-Image Pretraining (CLIP) excels at zero-shot tasks by aligning image and text embeddings, its performance in few-shot classification is hindered by a critical limitation: intra-modal misalignment. This issue, caused by a persistent modality gap and CLIP’s exclusively inter-modal training objective, leaves the embedding spaces uncalibrated, making direct image-to-image comparisons unreliable. Existing methods attempt to address this by refining similarity logits or by computationally expensive per-sample optimization. To overcome these challenges, we introduce SeMoBridge, a lightweight yet powerful approach that directly addresses the misalignment. Our method maps images into the text modality, while keeping their semantic content intact through what we call a Semantic Modality Bridge. SeMoBridge is closed-form and can optionally be trained through multi-modal supervision, combining image and text-alignment losses to optimize the projection. Experiments show that the trained version, SeMoBridge-T, requires only a fraction of the training time while overall outperforming other methods, particularly in low-data scenarios (1, 2, and 4 shots). The code is available at \href{https://github.com/christti98/semobridge}{github.com/christti98/semobridge}.
[105] SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies
Gagandeep Singh,Samudi Amarsinghe,Urawee Thani,Ki Fung Wong,Priyanka Singh,Xue Li
Main category: cs.CV
TL;DR: 该论文通过提出一种轻量化的分割引导评分(SGS)流水线,解决了现有HAMMER模型在全局场景不一致性(如前景-背景不匹配)检测上的局限性,显著提升了模型的鲁棒性。
Details
Motivation: HAMMER模型在多模态篡改检测中表现出色,但在处理全局场景不一致性(如前景与背景不匹配)时表现不佳。本文旨在诊断这一局限性并提出无需重新训练的解决方案。Contribution: 提出了一种轻量化的分割引导评分(SGS)流水线,通过分割掩码分离前景和背景区域,并结合视觉-语言模型计算区域感知一致性评分,显著提升了HAMMER模型在全局篡改检测中的性能。
Method: 1. 使用人物/面部分割掩码分离前景和背景区域;2. 利用联合视觉-语言模型提取嵌入表示;3. 计算区域感知一致性评分并与HAMMER的原始预测结果融合。
Result: SGS在推理阶段仅引入可忽略的计算开销,显著提升了HAMMER模型对全局篡改的鲁棒性,增强了二元检测、定位和词级解释能力。
Insight: 研究强调了在多模态虚假信息检测中区域感知推理的重要性,为未来相关工作提供了新思路。
Abstract: We extend HAMMER, a state-of-the-art model for multimodal manipulation detection, to handle global scene inconsistencies such as foreground-background (FG-BG) mismatch. While HAMMER achieves strong performance on the DGM4 dataset, it consistently fails when the main subject is contextually misplaced into an implausible background. We diagnose this limitation as a combination of label-space bias, local attention focus, and spurious text-foreground alignment. To remedy this without retraining, we propose a lightweight segmentation-guided scoring (SGS) pipeline. SGS uses person/face segmentation masks to separate foreground and background regions, extracts embeddings with a joint vision-language model, and computes region-aware coherence scores. These scores are fused with HAMMER’s original prediction to improve binary detection, grounding, and token-level explanations. SGS is inference-only, incurs negligible computational overhead, and significantly enhances robustness to global manipulations. This work demonstrates the importance of region-aware reasoning in multimodal disinformation detection. We release scripts for segmentation and scoring at https://github.com/Gaganx0/HAMMER-sgs
[106] DGM4+: Dataset Extension for Global Scene Inconsistency
Gagandeep Singh,Samudi Amarsinghe,Priyanka Singh,Xue Li
Main category: cs.CV
TL;DR: 论文扩展了DGM4数据集,新增了5000个高质量样本,引入前景-背景(FG-BG)不匹配及其与文本操作的混合,填补了全局不一致性检测的空白。
Details
Motivation: 生成模型的快速发展使得多模态虚假信息的制作变得更加容易,而现有数据集DGM4仅关注局部操作(如人脸替换、属性编辑和标题修改),缺乏对全局不一致性(例如前景与背景不匹配)的检测能力。Contribution: 1. 扩展DGM4数据集,新增5,000个样本,引入FG-BG不匹配及其混合类别;2. 提供数据生成和质量控制流程;3. 创建了DGM4+基准,支持对多模态模型的全面评测。
Method: 使用OpenAI的gpt-image-1生成图像,确保前景与背景明显不匹配,并通过三种文本条件(字面、文本属性和文本分割)生成标题。质量控制包括人脸数量限制、感知哈希去重、OCR文本清洗和标题长度控制。
Result: DGM4+数据集填补了全局不一致性检测的空白,并为多模态模型(如HAMMER)提供了更全面的评测基准。数据集和生成脚本已开源。
Insight: 全局不一致性(如FG-BG不匹配)是真实伪造中常见且容易被忽略的问题,现有模型在此类任务上表现较差,需进一步优化。
Abstract: The rapid advances in generative models have significantly lowered the barrier to producing convincing multimodal disinformation. Fabricated images and manipulated captions increasingly co-occur to create persuasive false narratives. While the Detecting and Grounding Multi-Modal Media Manipulation (DGM4) dataset established a foundation for research in this area, it is restricted to local manipulations such as face swaps, attribute edits, and caption changes. This leaves a critical gap: global inconsistencies, such as mismatched foregrounds and backgrounds, which are now prevalent in real-world forgeries. To address this, we extend DGM4 with 5,000 high-quality samples that introduce Foreground-Background (FG-BG) mismatches and their hybrids with text manipulations. Using OpenAI’s gpt-image-1 and carefully designed prompts, we generate human-centric news-style images where authentic figures are placed into absurd or impossible backdrops (e.g., a teacher calmly addressing students on the surface of Mars). Captions are produced under three conditions: literal, text attribute, and text split, yielding three new manipulation categories: FG-BG, FG-BG+TA, and FG-BG+TS. Quality control pipelines enforce one-to-three visible faces, perceptual hash deduplication, OCR-based text scrubbing, and realistic headline length. By introducing global manipulations, our extension complements existing datasets, creating a benchmark DGM4+ that tests detectors on both local and global reasoning. This resource is intended to strengthen evaluation of multimodal models such as HAMMER, which currently struggle with FG-BG inconsistencies. We release our DGM4+ dataset and generation script at https://github.com/Gaganx0/DGM4plus
[107] Geometric Learning of Canonical Parameterizations of $2D$-curves
Ioana Ciuclea,Giorgio Longari,Alice Barbara Tumpach
Main category: cs.CV
TL;DR: 该论文提出了一种基于主纤维丛截面的方法,以避免数据增强并学习对称性,应用于2D曲线的规范参数化。
Details
Motivation: 在计算机视觉和医学应用中,数据集常表现出对称性(如旋转和缩放),传统方法依赖数据增强,本文旨在避免此类方法,构建更可持续的算法。Contribution: 提出了一种基于主纤维丛截面的框架,用于模去对称性,并引入了一种2D曲线的2参数族规范参数化方法,包含匀速参数化作为特例。
Method: 利用主纤维丛截面理论,设计了一种度量轨道间差异的方法,并通过优化截面以最大化类间分离。
Result: 论文展示了该方法在对象轮廓数据集上的应用,验证了其对平移、旋转、缩放和重参数化对称性的有效性。
Insight: 几何方法提供了一种避免数据增强的替代方案,为对称性学习提供了新视角,其核心框架具有广泛的应用潜力。
Abstract: Most datasets encountered in computer vision and medical applications present symmetries that should be taken into account in classification tasks. A typical example is the symmetry by rotation and/or scaling in object detection. A common way to build neural networks that learn the symmetries is to use data augmentation. In order to avoid data augmentation and build more sustainable algorithms, we present an alternative method to mod out symmetries based on the notion of section of a principal fiber bundle. This framework allows the use of simple metrics on the space of objects in order to measure dissimilarities between orbits of objects under the symmetry group. Moreover, the section used can be optimized to maximize separation of classes. We illustrate this methodology on a dataset of contours of objects for the groups of translations, rotations, scalings and reparameterizations. In particular, we present a $2$-parameter family of canonical parameterizations of curves, containing the constant-speed parameterization as a special case, which we believe is interesting in its own right. We hope that this simple application will serve to convey the geometric concepts underlying this method, which have a wide range of possible applications. The code is available at the following link: $\href{https://github.com/GiLonga/Geometric-Learning}{https://github.com/GiLonga/Geometric-Learning}$. A tutorial notebook showcasing an application of the code to a specific dataset is available at the following link: $\href{https://github.com/ioanaciuclea/geometric-learning-notebook}{https://github.com/ioanaciuclea/geometric-learning-notebook}$
[108] EasyOcc: 3D Pseudo-Label Supervision for Fully Self-Supervised Semantic Occupancy Prediction Models
Seamie Hayes,Ganesh Sistu,Ciarán Eising
Main category: cs.CV
TL;DR: EasyOcc提出了一种利用3D伪标签(由Grounded-SAM和Metric3Dv2生成)和时序信息的自监督方法,显著提升了语义占据预测的性能,避免了传统高成本渲染策略。
Details
Motivation: 传统自监督方法(如新视角合成)计算成本高且内存消耗大,为解决这一问题,提出了基于3D伪标签的轻量级解决方案。Contribution: 1. 提出3D伪标签生成方法;2. 设计了EasyOcc模型,仅依赖伪标签学习;3. 在性能和泛化性上显著优于现有方法。
Method: 利用Grounded-SAM和Metric3Dv2生成3D伪标签,结合时序信息进行标签稠密化,并设计了EasyOcc模型避免复杂渲染。
Result: EasyOcc在OccNeRF上mIoU提升45%,并在无相机掩码的全场景评估中达到SOTA性能(7.71 mIoU)。
Insight: 基础模型和时序信息对自监督学习至关重要,损失计算空间的选择显著影响性能。
Abstract: Self-supervised models have recently achieved notable advancements, particularly in the domain of semantic occupancy prediction. These models utilize sophisticated loss computation strategies to compensate for the absence of ground-truth labels. For instance, techniques such as novel view synthesis, cross-view rendering, and depth estimation have been explored to address the issue of semantic and depth ambiguity. However, such techniques typically incur high computational costs and memory usage during the training stage, especially in the case of novel view synthesis. To mitigate these issues, we propose 3D pseudo-ground-truth labels generated by the foundation models Grounded-SAM and Metric3Dv2, and harness temporal information for label densification. Our 3D pseudo-labels can be easily integrated into existing models, which yields substantial performance improvements, with mIoU increasing by 45%, from 9.73 to 14.09, when implemented into the OccNeRF model. This stands in contrast to earlier advancements in the field, which are often not readily transferable to other architectures. Additionally, we propose a streamlined model, EasyOcc, achieving 13.86 mIoU. This model conducts learning solely from our labels, avoiding complex rendering strategies mentioned previously. Furthermore, our method enables models to attain state-of-the-art performance when evaluated on the full scene without applying the camera mask, with EasyOcc achieving 7.71 mIoU, outperforming the previous best model by 31%. These findings highlight the critical importance of foundation models, temporal context, and the choice of loss computation space in self-supervised learning for comprehensive scene understanding.
[109] Predicting Penalty Kick Direction Using Multi-Modal Deep Learning with Pose-Guided Attention
Pasindu Ranasinghe,Pamudu Ranasinghe
Main category: cs.CV
TL;DR: 该论文提出了一种结合多模态深度学习与姿态引导注意力机制的实时框架,用于在足球点球踢出前预测其方向,模型表现优于仅使用视觉或姿态信息的基线方法。
Details
Motivation: 点球往往决定比赛胜负,但守门员需在极短时间内从微妙的生物力学线索中预测踢球方向。该研究旨在通过多模态深度学习提升预测准确性。Contribution: 主要贡献包括:1)双分支架构结合CNN和LSTM网络;2)引入姿态引导注意力机制;3)构建包含755个点球事件的标注数据集;4)模型在测试集上达到89%准确率且实时性高。
Method: 方法分为两部分:MobileNetV2提取RGB帧空间特征,LSTM处理2D关键点并加入注意力机制。通过距离阈值分割序列,确保输入一致性。
Result: 模型在测试集上准确率为89%,比纯视觉或姿态基线高14-22%,推理时间为22毫秒,适用于实时分析与训练场景。
Insight: 姿态信息能有效引导视觉注意力至任务相关区域,多模态融合是关键;轻量化设计展示了实际部署潜力。
Abstract: Penalty kicks often decide championships, yet goalkeepers must anticipate the kicker’s intent from subtle biomechanical cues within a very short time window. This study introduces a real-time, multi-modal deep learning framework to predict the direction of a penalty kick (left, middle, or right) before ball contact. The model uses a dual-branch architecture: a MobileNetV2-based CNN extracts spatial features from RGB frames, while 2D keypoints are processed by an LSTM network with attention mechanisms. Pose-derived keypoints further guide visual focus toward task-relevant regions. A distance-based thresholding method segments input sequences immediately before ball contact, ensuring consistent input across diverse footage. A custom dataset of 755 penalty kick events was created from real match videos, with frame-level annotations for object detection, shooter keypoints, and final ball placement. The model achieved 89% accuracy on a held-out test set, outperforming visual-only and pose-only baselines by 14-22%. With an inference time of 22 milliseconds, the lightweight and interpretable design makes it suitable for goalkeeper training, tactical analysis, and real-time game analytics.
[110] Text-to-Scene with Large Reasoning Models
Frédéric Berdoz,Luca A. Lanzendörfer,Nick Tuninga,Roger Wattenhofer
Main category: cs.CV
TL;DR: Reason-3D是一个基于大型推理模型(LRM)的文本到场景生成模型,通过结合物体检索和空间推理,显著提升了复杂3D场景生成的视觉保真度和约束遵从性。
Details
Motivation: 现有的文本到场景方法在处理复杂几何和物体变换时表现不佳,且对复杂指令的遵从性较弱。Contribution: 提出了Reason-3D模型,通过大型推理模型(LRM)实现物体检索和空间优化,解决了现有方法的局限性。
Method: 结合物体检索(基于物理、功能和上下文属性)和碰撞感知的空间推理,生成3D场景。
Result: 在从简单到复杂的室内配置任务中,Reason-3D在视觉保真度、约束遵从性和物体检索质量上显著优于现有方法。
Insight: 大型推理模型显示出强大的空间推理能力,为文本到场景生成提供了新方向。
Abstract: Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.
[111] EVODiff: Entropy-aware Variance Optimized Diffusion Inference
Shigui Li,Wei Chen,Delu Zeng
Main category: cs.CV
TL;DR: EVODiff提出了一种基于信息论的熵感知方差优化方法,显著提升了扩散模型的推理效率和生成质量。
Details
Motivation: 扩散模型在图像生成中表现出色,但推理速度慢且训练-推理存在差异。已有梯度求解器虽加速但仍缺乏信息传输效率的理论基础。Contribution: 1. 从信息论角度揭示去噪过程本质是减少条件熵;2. 提出数据预测参数化优于噪声预测;3. 提出优化条件方差以减少误差的无参考方法EVODiff。
Method: EVODiff在去噪过程中通过优化条件熵系统地减少不确定性,结合熵感知方差优化生成过程。
Result: EVODiff显著优于SOTA梯度求解器:CIFAR-10上重建误差降低45.5%(FID从5.10降至2.78),ImageNet-256上NFE成本减少25%(20至15),且提升文本生成质量。
Insight: 1. 信息论视角为扩散模型提供了新的理论基础;2. 条件熵优化是提升推理效率的关键。
Abstract: Diffusion models (DMs) excel in image generation, but suffer from slow inference and the training-inference discrepancies. Although gradient-based solvers like DPM-Solver accelerate the denoising inference, they lack theoretical foundations in information transmission efficiency. In this work, we introduce an information-theoretic perspective on the inference processes of DMs, revealing that successful denoising fundamentally reduces conditional entropy in reverse transitions. This principle leads to our key insights into the inference processes: (1) data prediction parameterization outperforms its noise counterpart, and (2) optimizing conditional variance offers a reference-free way to minimize both transition and reconstruction errors. Based on these insights, we propose an entropy-aware variance optimized method for the generative process of DMs, called EVODiff, which systematically reduces uncertainty by optimizing conditional entropy during denoising. Extensive experiments on DMs validate our insights and demonstrate that our method significantly and consistently outperforms state-of-the-art (SOTA) gradient-based solvers. For example, compared to the DPM-Solver++, EVODiff reduces the reconstruction error by up to 45.5% (FID improves from 5.10 to 2.78) at 10 function evaluations (NFE) on CIFAR-10, cuts the NFE cost by 25% (from 20 to 15 NFE) for high-quality samples on ImageNet-256, and improves text-to-image generation while reducing artifacts. Code is available at https://github.com/ShiguiLi/EVODiff.
[112] EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model
Ruixiao Dong,Zhendong Wang,Keli Liu,Li Li,Ying Chen,Kai Li,Daowen Li,Houqiang Li
Main category: cs.CV
TL;DR: EchoGen是一个基于自回归模型(VAR)的前馈式主体驱动生成框架,通过双路径注入策略实现高效和高保真度的主体驱动生成,避免了传统扩散模型的慢推理速度问题。
Details
Motivation: 当前主体驱动生成方法存在计算成本高或推理速度慢的问题,EchoGen旨在通过VAR模型的快速采样特性解决这一矛盾。Contribution: 1. 提出了首个基于VAR模型的前馈式主体驱动生成框架EchoGen;2. 设计了双路径注入策略,分离主体的高层语义和低层细节以实现高可控性和保真度。
Method: 1. 使用语义编码器提取主体的抽象身份,通过解耦交叉注意力注入;2. 使用内容编码器捕获细节,通过多模态注意力机制实现高保真纹理和结构保留。
Result: EchoGen在主体保真度和图像质量上与基于扩散的方法相当,且采样延迟显著降低。
Insight: VAR模型在主体驱动生成中具有高效性和潜力,双路径注入策略是实现高质量生成的关键。
Abstract: Subject-driven generation is a critical task in creative AI; yet current state-of-the-art methods present a stark trade-off. They either rely on computationally expensive, per-subject fine-tuning, sacrificing efficiency and zero-shot capability, or employ feed-forward architectures built on diffusion models, which are inherently plagued by slow inference speeds. Visual Auto-Regressive (VAR) models are renowned for their rapid sampling speeds and strong generative quality, making them an ideal yet underexplored foundation for resolving this tension. To bridge this gap, we introduce EchoGen, a pioneering framework that empowers VAR models with subject-driven generation capabilities. The core design of EchoGen is an effective dual-path injection strategy that disentangles a subject’s high-level semantic identity from its low-level fine-grained details, enabling enhanced controllability and fidelity. We employ a semantic encoder to extract the subject’s abstract identity, which is injected through decoupled cross-attention to guide the overall composition. Concurrently, a content encoder captures intricate visual details, which are integrated via a multi-modal attention mechanism to ensure high-fidelity texture and structural preservation. To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models. Both quantitative and qualitative results substantiate our design, demonstrating that EchoGen achieves subject fidelity and image quality comparable to state-of-the-art diffusion-based methods with significantly lower sampling latency. Code and models will be released soon.
[113] EntroPE: Entropy-Guided Dynamic Patch Encoder for Time Series Forecasting
Sachith Abeywickrama,Emadeldeen Eldele,Min Wu,Xiaoli Li,Chau Yuen
Main category: cs.CV
TL;DR: 该论文提出了EntroPE(熵引导的动态补丁编码器),通过动态检测时间序列中的过渡点来保留时间结构,从而提高时间序列预测的准确性和效率。
Details
Motivation: 现有基于Transformer的时间序列预测模型在补丁构建上忽视了时间连贯性,导致分割破坏了短期依赖性和表示学习。Contribution: 1. 提出了一种新颖的动态补丁构建框架EntroPE。2. 引入了基于信息论的动态补丁边界检测(EDP)和自适应补丁编码(APE),保留了时间结构。
Method: 1. 使用条件熵动态检测时间序列中的自然过渡点。2. 将动态分块的补丁通过池化和交叉注意力编码为固定大小的潜在表示。3. 利用全局Transformer建模补丁间动态。
Result: 实验表明,EntroPE在长期预测基准上提高了准确性和效率。
Insight: 熵引导的动态分块是时间序列建模的一种新范式,能够更好地保留时间结构和短期依赖性。
Abstract: Transformer-based models have significantly advanced time series forecasting, with patch-based input strategies offering efficiency and improved long-horizon modeling. Yet, existing approaches rely on temporally-agnostic patch construction, where arbitrary starting positions and fixed lengths fracture temporal coherence by splitting natural transitions across boundaries. This naive segmentation often disrupts short-term dependencies and weakens representation learning. In response, we propose EntroPE (Entropy-Guided Dynamic Patch Encoder), a novel, temporally informed framework that dynamically detects transition points via conditional entropy and dynamically places patch boundaries. This preserves temporal structure while retaining the computational benefits of patching. EntroPE consists of two key modules, namely an Entropy-based Dynamic Patcher (EDP) that applies information-theoretic criteria to locate natural temporal shifts and determine patch boundaries, and an Adaptive Patch Encoder (APE) that employs pooling and cross-attention to capture intra-patch dependencies and produce fixed-size latent representations. These embeddings are then processed by a global transformer to model inter-patch dynamics. Experiments across long-term forecasting benchmarks demonstrate that EntroPE improves both accuracy and efficiency, establishing entropy-guided dynamic patching as a promising new paradigm for time series modeling. Code is available at: https://github.com/Sachithx/EntroPE.
[114] Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
Yuansen Liu,Haiming Tang,Jinlong Peng,Jiangning Zhang,Xiaozhong Ji,Qingdong He,Donghao Luo,Zhenye Gan,Junwei Zhu,Yunhang Shen,Chaoyou Fu,Chengjie Wang,Xiaobin Hu,Shuicheng Yan
Main category: cs.CV
TL;DR: Human-MME是一个综合评估多模态大语言模型(MLLMs)在人类中心场景理解能力的基准,覆盖多样化的场景和任务,提供高质量标注和多维度评估。
Details
Motivation: 现有MLLMs评估基准缺乏对人类中心场景的全面考察,尤其在细粒度感知和高维因果推理方面的能力未被充分探索。Contribution: 1.提出多样化的人类场景覆盖;2.设计渐进式多维评估框架;3.构建高质量标注和数据范式。
Method: 通过构建包含19,945个真实图像问题对的评估套件,覆盖8个维度,结合自动标注流程和人工标注平台。
Result: 在17个顶尖MLLMs上的实验揭示了模型局限性,为未来研究提供了方向。
Insight: 人类中心的场景理解需要兼顾细粒度感知和高维推理能力,现有MLLMs在此领域仍有显著提升空间。
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: 1. Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. 2. Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. 3. High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling to facilitate precise and reliable model assessment. Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex questions of their combination. The extensive experiments on 17 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding. All data and code are available at https://github.com/Yuan-Hou/Human-MME.
[115] AttriGen: Automated Multi-Attribute Annotation for Blood Cell Datasets
Walid Houmaidi,Youssef Sabiri,Fatima Zahra Iguenfer,Amine Abouaomar
Main category: cs.CV
TL;DR: AttriGen是一个用于自动化多属性标注的新框架,特别针对细胞显微镜图像,通过双模型架构实现了高精度分类,显著提升了模型解释性和标注效率。
Details
Motivation: 传统细胞类型分类在多属性标注领域研究不足,且人工标注成本高昂。AttriGen旨在解决这一问题,提供自动化高效的多属性标注解决方案。Contribution: 提出了AttriGen框架,结合CNN和ViT的双模型架构,在细胞显微镜数据上实现了94.62%的高精度多属性分类,显著优于传统方法。
Method: 使用CNN进行细胞类型分类,ViT进行多属性分类,结合两个互补数据集(PBC和WBCAtt)训练双模型架构。
Result: 实现了94.62%的分类准确率,显著提升了标注效率和模型解释性。
Insight: 双模型架构在多属性分类任务中表现出色,AttriGen可扩展到其他计算机视觉任务,为自动化标注提供了新范式。
Abstract: We introduce AttriGen, a novel framework for automated, fine-grained multi-attribute annotation in computer vision, with a particular focus on cell microscopy where multi-attribute classification remains underrepresented compared to traditional cell type categorization. Using two complementary datasets: the Peripheral Blood Cell (PBC) dataset containing eight distinct cell types and the WBC Attribute Dataset (WBCAtt) that contains their corresponding 11 morphological attributes, we propose a dual-model architecture that combines a CNN for cell type classification, as well as a Vision Transformer (ViT) for multi-attribute classification achieving a new benchmark of 94.62% accuracy. Our experiments demonstrate that AttriGen significantly enhances model interpretability and offers substantial time and cost efficiency relative to conventional full-scale human annotation. Thus, our framework establishes a new paradigm that can be extended to other computer vision classification tasks by effectively automating the expansion of multi-attribute labels.
[116] TSalV360: A Method and Dataset for Text-driven Saliency Detection in 360-Degrees Videos
Ioannis Kontostathis,Evlampios Apostolidis,Vasileios Mezaris
Main category: cs.CV
TL;DR: 这篇论文提出了TSalV360方法和TSV360数据集,用于360度视频中基于文本的显著性检测。方法结合了视觉-语言模型和跨模态注意力机制,实现了定制化的显著性检测。
Details
Motivation: 360度视频的显著性检测通常仅依赖于视觉信息,忽略了用户可能通过文本描述指定感兴趣的对象或事件的需求。Contribution: 1) 提出了TSV360数据集,包含16,000组ERP帧、文本描述和显著图;2) 开发了TSalV360方法,结合文本描述实现定制化显著性检测。
Method: TSalV360方法扩展了视觉显著性检测方法,引入了视觉-语言模型、相似性估计模块和视口时空跨注意力机制,以捕捉多模态数据的依赖关系。
Result: 实验表明,TSalV360在TSV360数据集上优于仅依赖视觉的SOTA方法,验证了其定制化显著性检测的能力。
Insight: 结合文本描述的显著性检测方法可以更好地满足用户需求,同时多模态数据的建模是关键。
Abstract: In this paper, we deal with the task of text-driven saliency detection in 360-degrees videos. For this, we introduce the TSV360 dataset which includes 16,000 triplets of ERP frames, textual descriptions of salient objects/events in these frames, and the associated ground-truth saliency maps. Following, we extend and adapt a SOTA visual-based approach for 360-degrees video saliency detection, and develop the TSalV360 method that takes into account a user-provided text description of the desired objects and/or events. This method leverages a SOTA vision-language model for data representation and integrates a similarity estimation module and a viewport spatio-temporal cross-attention mechanism, to discover dependencies between the different data modalities. Quantitative and qualitative evaluations using the TSV360 dataset, showed the competitiveness of TSalV360 compared to a SOTA visual-based approach and documented its competency to perform customized text-driven saliency detection in 360-degrees videos.
[117] Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation
Chenyang Jiang,Zhengcen Li,Hang Zhao,Qiben Shan,Shaocong Wu,Jingyong Su
Main category: cs.CV
TL;DR: 该论文提出了一种基于稀疏高斯表示的高效数据集蒸馏方法(GSDD),通过少量高斯基元编码关键判别信息,避免了传统方法的冗余问题,提高了数据集多样性和蒸馏性能。
Details
Motivation: 现代模型训练中数据集的计算和存储负担巨大。传统基于密集像素的方法存在冗余和难以扩展的问题。Contribution: 提出了GSDD方法,采用稀疏2D高斯表示数据集蒸馏,显著降低了冗余并提升了效率和性能。
Method: 使用少量高斯基元编码关键信息,结合CUDA并行技术实现高效训练和推理。
Result: 在CIFAR-10、CIFAR-100和ImageNet子集上达到SOTA性能,同时保持高效的编码解码成本。
Insight: 稀疏高斯表示能高效捕获数据的判别信息,为数据集蒸馏提供了一种可扩展的解决方案。
Abstract: Dataset distillation has emerged as a promising paradigm that synthesizes compact, informative datasets capable of retaining the knowledge of large-scale counterparts, thereby addressing the substantial computational and storage burdens of modern model training. Conventional approaches typically rely on dense pixel-level representations, which introduce redundancy and are difficult to scale up. In this work, we propose GSDD, a novel and efficient sparse representation for dataset distillation based on 2D Gaussians. Instead of representing all pixels equally, GSDD encodes critical discriminative information in a distilled image using only a small number of Gaussian primitives. This sparse representation could improve dataset diversity under the same storage budget, enhancing coverage of difficult samples and boosting distillation performance. To ensure both efficiency and scalability, we adapt CUDA-based splatting operators for parallel inference and training, enabling high-quality rendering with minimal computational and memory overhead. Our method is simple yet effective, broadly applicable to different distillation pipelines, and highly scalable. Experiments show that GSDD achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet subsets, while remaining highly efficient encoding and decoding cost. Our code is available at https://github.com/j-cyoung/GSDatasetDistillation.
[118] An Experimental Study on Generating Plausible Textual Explanations for Video Summarization
Thomas Eleftheriadis,Evlampios Apostolidis,Vasileios Mezaris
Main category: cs.CV
TL;DR: 论文通过实验研究为视频摘要生成可信的文本解释,结合LLaVA-OneVision模型并提出评估可信度的方法,探索解释的信度与可信度的关系。
Details
Motivation: 研究动机是为视频摘要结果生成可信的文本解释,并验证解释的信度与可信度是否一致。Contribution: 主要贡献包括扩展现有视频摘要解释框架、提出文本解释的可信度评估方法,以及通过实验验证模型生成的解释效果。
Method: 方法整合了LLaVA-OneVision模型生成文本描述,并利用SBERT和SimCSE量化语义重叠以评估可信度。
Result: 实验使用CA-SUM方法和SumMe、TVSum数据集,验证了解释的信度与可信度的关系,明确了生成可信文本解释的最佳方法。
Insight: 研究发现更高信度的解释未必更可信,选择合适的文本生成方法是提升可信度的关键。
Abstract: In this paper, we present our experimental study on generating plausible textual explanations for the outcomes of video summarization. For the needs of this study, we extend an existing framework for multigranular explanation of video summarization by integrating a SOTA Large Multimodal Model (LLaVA-OneVision) and prompting it to produce natural language descriptions of the obtained visual explanations. Following, we focus on one of the most desired characteristics for explainable AI, the plausibility of the obtained explanations that relates with their alignment with the humans’ reasoning and expectations. Using the extended framework, we propose an approach for evaluating the plausibility of visual explanations by quantifying the semantic overlap between their textual descriptions and the textual descriptions of the corresponding video summaries, with the help of two methods for creating sentence embeddings (SBERT, SimCSE). Based on the extended framework and the proposed plausibility evaluation approach, we conduct an experimental study using a SOTA method (CA-SUM) and two datasets (SumMe, TVSum) for video summarization, to examine whether the more faithful explanations are also the more plausible ones, and identify the most appropriate approach for generating plausible textual explanations for video summarization.
[119] Generalized Fine-Grained Category Discovery with Multi-Granularity Conceptual Experts
Haiyang Zheng,Nan Pu,Wenjing Li,Nicu Sebe,Zhun Zhong
Main category: cs.CV
TL;DR: 提出了一个多粒度概念专家框架(MGCE),通过动态概念对比学习和多粒度专家协同学习,解决广义类别发现问题,无需预先知道未标记数据的类别数量,并在细粒度视觉识别任务中取得SOTA结果。
Details
Motivation: 广义类别发现(GCD)是一个开放世界问题,需要从未标记数据中聚类出新类别,而现有方法未能充分利用多粒度概念信息且依赖预先知道类别数量,不够实用。Contribution: 1. 提出了MGCE框架,包含DCCL和MECL模块,动态挖掘多粒度概念并提升表示能力;2. 无需预先知道类别数量,能自动估计;3. 在多个细粒度视觉基准任务中性能显著优于现有方法。
Method: 1. DCCL模块:通过概念挖掘和双级表示学习联合优化特征和类别发现;2. MECL模块:引入多粒度专家并通过概念对齐矩阵实现跨专家协作。
Result: 在九个细粒度视觉基准任务中达到SOTA性能,尤其在发现新类别上表现突出,无需预先知道类别数量的情况下平均提升3.6%。
Insight: 多粒度概念信息和动态学习方法可显著提升开放世界类别发现的性能,同时消除对类别数量先验知识的依赖更具实用性。
Abstract: Generalized Category Discovery (GCD) is an open-world problem that clusters unlabeled data by leveraging knowledge from partially labeled categories. A key challenge is that unlabeled data may contain both known and novel categories. Existing approaches suffer from two main limitations. First, they fail to exploit multi-granularity conceptual information in visual data, which limits representation quality. Second, most assume that the number of unlabeled categories is known during training, which is impractical in real-world scenarios. To address these issues, we propose a Multi-Granularity Conceptual Experts (MGCE) framework that adaptively mines visual concepts and integrates multi-granularity knowledge for accurate category discovery. MGCE consists of two modules: (1) Dynamic Conceptual Contrastive Learning (DCCL), which alternates between concept mining and dual-level representation learning to jointly optimize feature learning and category discovery; and (2) Multi-Granularity Experts Collaborative Learning (MECL), which extends the single-expert paradigm by introducing additional experts at different granularities and by employing a concept alignment matrix for effective cross-expert collaboration. Importantly, MGCE can automatically estimate the number of categories in unlabeled data, making it suitable for practical open-world settings. Extensive experiments on nine fine-grained visual recognition benchmarks demonstrate that MGCE achieves state-of-the-art results, particularly in novel-class accuracy. Notably, even without prior knowledge of category numbers, MGCE outperforms parametric approaches that require knowing the exact number of categories, with an average improvement of 3.6%. Code is available at https://github.com/HaiyangZheng/MGCE.
[120] IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
Jiayi Guo,Chuanhao Yan,Xingqian Xu,Yulin Wang,Kai Wang,Gao Huang,Humphrey Shi
Main category: cs.CV
TL;DR: 论文提出了一种名为IMplicit Multimodal Guidance (IMG)的新方法,通过隐式多模态指导校准扩散模型,无需额外数据或编辑操作即可提升图像与提示的多模态对齐效果。
Details
Motivation: 扩散模型生成的图像与输入提示的多模态对齐长期以来是一个挑战,现有方法依赖高质量偏好数据或局部编辑,但存在扩展性差或影响图像整体质量的问题。Contribution: 1. 提出IMG框架,无需额外数据或操作即可校准对齐;2. 设计隐式对齐器(Implicit Aligner)和可训练目标Iteratively Updated Preference Objective;3. 作为插件适配器提升已有方法。
Method: 1. 利用多模态大语言模型(MLLM)识别对齐偏差;2. 通过Implicit Aligner调整扩散条件特征以重新生成;3. 通过迭代更新的目标函数优化对齐。
Result: 在SDXL、SDXL-DPO和FLUX上的实验表明,IMG在多模态对齐任务上优于现有方法,且兼容已有微调方法。
Insight: IMG的创新在于通过隐式特征调整实现对齐,避免了数据依赖和编辑操作,为扩散模型的校准提供了灵活高效的解决方案。
Abstract: Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.
[121] Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document
Adnan Ben Mansour,Ayoub Karine,David Naccache
Main category: cs.CV
TL;DR: 本文提出了一种基于机制可解释性的模型压缩方法Donut-MINT,通过知识蒸馏和剪枝技术,减少了Donut模型的计算开销和内存占用,同时保持了在DocVQA任务上的高性能。
Details
Motivation: 当前的大型视觉语言模型(如Donut)虽然性能强大,但在实时或资源受限的应用中成本过高。因此,需要一种轻量化的方法来压缩模型。Contribution: 主要贡献是提出了Donut-MINT,一种基于机制可解释性的剪枝和蒸馏方法,显著减少了模型开销,同时保持了性能。
Method: 通过分析模型的内部计算,识别关键子组件并进行剪枝或重参数化,同时使用知识蒸馏训练紧凑的学生模型。
Result: Donut-MINT在DocVQA基准测试中保持了高性能,同时显著减少了推理时间和内存占用。
Insight: 模型压缩可以视为电路发现的过程,结合机制可解释性研究,为视觉语言模型的轻量化部署提供了新思路。
Abstract: Recent advances in Visually-rich Document Understanding rely on large Vision-Language Models like Donut, which perform document-level Visual Question Answering without Optical Character Recognition. Despite their effectiveness, these models are too costly for real-time or resource-constrained applications. We investigate model compression through knowledge distillation, training compact student models from a larger teacher. We leverage mechanistic interpretability to drive student architecture design within this framework. By analyzing internal computations, we identify essential subcomponents to retain, while having a clear view of which subcomponents should be approximated, skipped, or reparametrized based on their function. This approach yields Donut-MINT (Mechanistic Interpretability-based Network Trimming), a pruned Donut variant that reduces inference time and memory usage while maintaining strong performance on DocVQA, a standard benchmark for document Visual Question Answering. Our method reframes compression as circuit discovery, bridging interpretability research and practical Vision-Language Model deployment.
[122] Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA
Zhejia Cai,Yandan Yang,Xinyuan Chang,Shiyi Liang,Ronghan Chen,Feng Xiong,Mu Xu,Ruqi Huang
Main category: cs.CV
TL;DR: 论文提出了Farsighted-LAM和SSM-VLA框架,通过几何感知的空间编码和多尺度时间建模解决了Latent Action Models的空间理解和时间感知问题,显著提升了VLA系统的性能和可解释性。
Details
Motivation: 现有的Latent Action Models(LAMs)在空间理解和时间感知上存在瓶颈,影响了动作建模的稳定性和清晰性,因此需要一种能够结合空间和动态感知的改进方法。Contribution: 1. 提出了Farsighted-LAM框架,结合几何感知空间编码和多尺度时间建模;2. 设计了SSM-VLA框架,整合结构化感知和视觉Chain-of-Thought模块,增强了决策一致性和可解释性。
Method: 1. 使用几何感知空间编码提升空间理解;2. 引入多尺度时间建模捕捉动态运动模式;3. 在SSM-VLA中结合结构化感知和视觉Chain-of-Thought模块进行显式推理。
Result: 在多个VLA任务中实现了最先进的性能,证明了该方法在增强具身智能的鲁棒性和泛化能力上的有效性。
Insight: 结合几何感知、时间一致性和显式推理的策略是提升VLA系统性能的关键。
Abstract: Latent Action Models (LAMs) enable Vision-Language-Action (VLA) systems to learn semantic action representations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are distant, leading to limited temporal perception. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry-aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end-to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real-world settings, and achieve state-of-the-art performance. Our results demonstrate that our strategy of combining geometry-aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.
[123] PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection
Tuan Nguyen,Naseem Khan,Khang Tran,NhatHai Phan,Issa Khalil
Main category: cs.CV
TL;DR: 本文介绍了PRPO(段落级相对策略优化),一种用于视觉语言深度伪造检测的强化学习算法,通过段落级别的推理对齐提升检测准确性。
Details
Motivation: 深度伪造检测面临高质量数据集稀缺的问题,且现有的多模态大语言模型在检测时推理与视觉证据不匹配或产生幻觉。Contribution: 提出了一个推理注释的数据集和PRPO算法,通过段落级推理对齐提升深度伪造检测的准确性和可解释性。
Method: 采用强化学习方法,将大语言模型的推理与图像内容在段落级别对齐。
Result: PRPO显著提升了检测准确性,推理得分达4.55/5.0,优于基准方法GRPO。
Insight: 多模态推理需基于视觉证据,才能实现更可靠和可解释的深度伪造检测。
Abstract: The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.
[124] ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Edoardo Bianchi,Jacopo Staiano,Antonio Liotta
Main category: cs.CV
TL;DR: ProfVLM是一种轻量级的视频语言模型,通过生成式推理联合预测技能水平和生成专家反馈,动态融合多视角特征,显著减少了参数数量并提升了训练效率。
Details
Motivation: 现有技能熟练度估计方法通常依赖黑盒视频分类器,缺乏多视角上下文和可解释性。本文提出一种透明且高效的解决方案。Contribution: 提出ProfVLM模型,引入AttentiveGatedProjector动态融合多视角特征,支持技能水平预测和自然语言反馈生成。
Method: 基于冻结的TimeSformer骨干网络,通过AttentiveGatedProjector将特征投影到语言模型中,实现多视角特征的动态融合和反馈生成。
Result: 在EgoExo4D数据集上,ProfVLM超越现有方法,参数减少20倍,训练时间缩短60%,同时提供透明的自然语言评价。
Insight: 生成式视觉语言建模为技能评估提供了新方向,兼具高效性和解释性。
Abstract: Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.
[125] Point2RBox-v3: Self-Bootstrapping from Point Annotations via Integrated Pseudo-Label Refinement and Utilization
Teng Zhang,Ziqian Fan,Mingxin Liu,Xin Zhang,Xudong Lu,Wentong Li,Yue Zhou,Yi Yu,Xiang Li,Junchi Yan,Xue Yang
Main category: cs.CV
TL;DR: Point2RBox-v3提出了一种基于点标注的自举方法,通过动态伪标签优化和利用,解决了现有方法在伪标签利用效率和质量上的不足。
Details
Motivation: 现有基于点标注的弱监督方法存在伪标签利用效率低和质量差的问题,限制了模型在定向目标检测任务中的表现。Contribution: 1) 提出渐进式标签分配(PLA),动态估计实例尺寸;2) 设计先验引导的动态掩码损失(PGDM-Loss),结合SAM模型和分水岭算法的优势。
Method: PLA动态分配标签,PGDM-Loss结合SAM和分水岭算法,优化稀疏和密集场景下的性能。
Result: 在多个数据集(DOTA-v1.0/DOTA-v1.5/DOTA-v2.0/DIOR/STAR/RSAR)上表现优异,最高达到66.09%的准确率。
Insight: 动态伪标签和集成方法能显著提升弱监督学习的效果,尤其是在目标尺寸变化大或稀疏的场景中。
Abstract: Driven by the growing need for Oriented Object Detection (OOD), learning from point annotations under a weakly-supervised framework has emerged as a promising alternative to costly and laborious manual labeling. In this paper, we discuss two deficiencies in existing point-supervised methods: inefficient utilization and poor quality of pseudo labels. Therefore, we present Point2RBox-v3. At the core are two principles: 1) Progressive Label Assignment (PLA). It dynamically estimates instance sizes in a coarse yet intelligent manner at different stages of the training process, enabling the use of label assignment methods. 2) Prior-Guided Dynamic Mask Loss (PGDM-Loss). It is an enhancement of the Voronoi Watershed Loss from Point2RBox-v2, which overcomes the shortcomings of Watershed in its poor performance in sparse scenes and SAM’s poor performance in dense scenes. To our knowledge, Point2RBox-v3 is the first model to employ dynamic pseudo labels for label assignment, and it creatively complements the advantages of SAM model with the watershed algorithm, which achieves excellent performance in both sparse and dense scenes. Our solution gives competitive performance, especially in scenarios with large variations in object size or sparse object occurrences: 66.09%/56.86%/41.28%/46.40%/19.60%/45.96% on DOTA-v1.0/DOTA-v1.5/DOTA-v2.0/DIOR/STAR/RSAR.
[126] Continuous Space-Time Video Super-Resolution with 3D Fourier Fields
Alexander Becker,Julius Erbach,Dominik Narnhofer,Konrad Schindler
Main category: cs.CV
TL;DR: 本文提出了一种新的连续时空视频超分辨率方法,通过3D视频傅里叶场(VFF)表示视频,实现了灵活的空间和时间采样,同时避免了传统方法中的显式帧变形问题。该方法在多个基准测试中表现优异。
Details
Motivation: 传统视频超分辨率方法通常将空间和时间组件分离,依赖显式帧变形进行运动补偿,这种方法在处理复杂运动时容易失败。本文旨在通过连续的时空表示解决这一问题。Contribution: 提出3D视频傅里叶场(VFF)表示,实现了连续的时空相干性;设计了一种基于神经网络的编码器预测傅里叶基系数;在多个基准测试中取得了最优性能。
Method: 使用3D傅里叶场表示视频,通过神经网络预测傅里叶基系数,结合高斯点扩散函数实现反锯齿重建。
Result: 在广泛的超分辨率尺度范围内,该方法提供了更锐利且时间一致性更强的重建结果,计算效率也更高。
Insight: 连续的时空表示避免了传统方法的运动补偿问题,同时结合傅里叶基和神经网络实现了高效且高质量的视频超分辨率。
Abstract: We introduce a novel formulation for continuous space-time video super-resolution. Instead of decoupling the representation of a video sequence into separate spatial and temporal components and relying on brittle, explicit frame warping for motion compensation, we encode video as a continuous, spatio-temporally coherent 3D Video Fourier Field (VFF). That representation offers three key advantages: (1) it enables cheap, flexible sampling at arbitrary locations in space and time; (2) it is able to simultaneously capture fine spatial detail and smooth temporal dynamics; and (3) it offers the possibility to include an analytical, Gaussian point spread function in the sampling to ensure aliasing-free reconstruction at arbitrary scale. The coefficients of the proposed, Fourier-like sinusoidal basis are predicted with a neural encoder with a large spatio-temporal receptive field, conditioned on the low-resolution input video. Through extensive experiments, we show that our joint modeling substantially improves both spatial and temporal super-resolution and sets a new state of the art for multiple benchmarks: across a wide range of upscaling factors, it delivers sharper and temporally more consistent reconstructions than existing baselines, while being computationally more efficient. Project page: https://v3vsr.github.io.
[127] SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval
Ren-Di Wu,Yu-Yen Lin,Huei-Fang Yang
Main category: cs.CV
TL;DR: SQUARE是一个无需训练的两阶段零样本组合图像检索(ZS-CIR)框架,通过多模态大语言模型(MLLM)增强检索效果。第一阶段通过语义查询增强融合(SQAF)改进查询嵌入;第二阶段通过高效批量重新排序(EBR)提升排名准确性,实验表明其在多个基准测试中表现优异。
Details
Motivation: 组合图像检索(CIR)需要在保留参考图像内容的基础上整合用户文本修改,但零样本CIR(ZS-CIR)方法在未使用任务特定训练数据时难以准确捕捉用户意图。SQUARE旨在通过MLLM提升这种训练无关方法的性能。Contribution: 1)提出两阶段框架SQUARE;2)SQAF阶段利用MLLM生成目标图像描述增强查询嵌入;3)EBR阶段通过批量视觉-语义推理改进排名;4)在四个CIR基准上表现优异。
Method: 1)SQAF:用CLIP生成初始查询嵌入,并结合MLLM生成的语义描述;2)EBR:将候选图像网格化展示给MLLM,通过单次联合推理重新排序。
Result: 实验表明SQUARE在多个CIR基准测试中表现优异,且无需任务特定训练,保持高性能。
Insight: MLLM提供的高阶语义信息显著提升检索效果;高效的批量重新排序策略可在单次推理中优化排名,展现了轻量预训练模型的潜力。
Abstract: Composed Image Retrieval (CIR) aims to retrieve target images that preserve the visual content of a reference image while incorporating user-specified textual modifications. Training-free zero-shot CIR (ZS-CIR) approaches, which require no task-specific training or labeled data, are highly desirable, yet accurately capturing user intent remains challenging. In this paper, we present SQUARE, a novel two-stage training-free framework that leverages Multimodal Large Language Models (MLLMs) to enhance ZS-CIR. In the Semantic Query-Augmented Fusion (SQAF) stage, we enrich the query embedding derived from a vision-language model (VLM) such as CLIP with MLLM-generated captions of the target image. These captions provide high-level semantic guidance, enabling the query to better capture the user’s intent and improve global retrieval quality. In the Efficient Batch Reranking (EBR) stage, top-ranked candidates are presented as an image grid with visual marks to the MLLM, which performs joint visual-semantic reasoning across all candidates. Our reranking strategy operates in a single pass and yields more accurate rankings. Experiments show that SQUARE, with its simplicity and effectiveness, delivers strong performance on four standard CIR benchmarks. Notably, it maintains high performance even with lightweight pre-trained, demonstrating its potential applicability.
[128] EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing
Keming Wu,Sicong Jiang,Max Ku,Ping Nie,Minghao Liu,Wenhu Chen
Main category: cs.CV
TL;DR: 这篇论文提出了EditReward,一个与人类偏好对齐的奖励模型,用于指导图像编辑任务。通过构建新的大规模人类偏好数据集,模型在多项基准测试中表现优异,并能筛选高质量数据用于训练改进。
Details
Motivation: 开源模型在自然语言指导的图像编辑任务中表现落后,主要瓶颈是缺乏可靠的奖励模型来生成高质量的训练数据。本文旨在解决这一关键问题。Contribution: 1. 提出了EditReward,一个与人类偏好高度对齐的奖励模型;2. 构建了一个包含20万对偏好数据的大规模标注数据集;3. 证明了模型在筛选高质量数据和改进训练中的有效性。
Method: 1. 使用经过严格协议标注的大规模人类偏好数据集训练奖励模型;2. 在多基准测试中验证模型表现;3. 利用模型筛选现有数据集中的高质量子集用于训练。
Result: EditReward在GenAI-Bench、AURORA-Bench等基准测试中取得了最先进的人类相关性评分,并显著提升了训练数据的质量。
Insight: 可靠的奖励模型不仅能提升图像编辑模型的性能,还为强化学习后训练和测试时扩展提供了潜在应用。
Abstract: Recently, we have witnessed great progress in image editing with natural language instructions. Several closed-source models like GPT-Image-1, Seedream, and Google-Nano-Banana have shown highly promising progress. However, the open-source models are still lagging. The main bottleneck is the lack of a reliable reward model to scale up high-quality synthetic training data. To address this critical bottleneck, we built \mname, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs. \mname demonstrates superior alignment with human preferences in instruction-guided image editing tasks. Experiments show that \mname achieves state-of-the-art human correlation on established benchmarks such as GenAI-Bench, AURORA-Bench, ImagenHub, and our new \benchname, outperforming a wide range of VLM-as-judge models. Furthermore, we use \mname to select a high-quality subset from the existing noisy ShareGPT-4o-Image dataset. We train Step1X-Edit on the selected subset, which shows significant improvement over training on the full set. This demonstrates \mname’s ability to serve as a reward model to scale up high-quality training data for image editing. Furthermore, its strong alignment suggests potential for advanced applications like reinforcement learning-based post-training and test-time scaling of image editing models. \mname with its training dataset will be released to help the community build more high-quality image editing training datasets.
[129] TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos
Xiangrui Liu,Minghao Qin,Yan Shu,Zhengyang Liang,Yang Tian,Chen Jason Zhang,Bo Zhao,Zheng Liu
Main category: cs.CV
TL;DR: 论文提出了任务导向的时间定位问题(ToTG),并引入了一个新框架TimeScope,通过逐步推理在长视频中定位关键时刻。同时发布了ToTG Bench基准和ToTG Pile数据集,实验表明TimeScope优于现有方法。
Details
Motivation: 长视频中识别关键时刻对下游任务至关重要,但传统方法泛化性有限且难以处理长视频。因此,论文提出了任务导向的时间定位问题(ToTG),旨在通过任务的自然描述定位关键时间区间。Contribution: 1. 定义了ToTG问题;2. 提出了TimeScope框架,通过粗粒度定位和细粒度划分逐步推理;3. 发布了ToTG Bench基准和ToTG Pile数据集。
Method: TimeScope采用逐步推理:先粗粒度定位可能包含关键时刻的时间范围,再通过细粒度分区细化该范围。
Result: 实验表明TimeScope在多种设置下优于现有时间定位方法和多模态大语言模型(MLLMs)。
Insight: 逐步推理结构在处理长视频时有显著优势,任务导向的描述可以更自然地指导时间定位。
Abstract: Identifying key moments in long videos is essential for downstream understanding and reasoning tasks. In this paper, we introduce a new problem, Taskoriented Temporal Grounding ToTG, which aims to localize time intervals containing the necessary information based on a task’s natural description. Along with the definition, we also present ToTG Bench, a comprehensive benchmark for evaluating the performance on ToTG. ToTG is particularly challenging for traditional approaches due to their limited generalizability and difficulty in handling long videos. To address these challenges, we propose TimeScope, a novel framework built upon progressive reasoning. TimeScope first identifies a coarse-grained temporal scope in the long video that likely contains the key moments, and then refines this scope through finegrained moment partitioning. Additionally, we curate a highquality dataset, namely ToTG Pile, to enhance TimeScope’s ability to perform progressive temporal grounding effectively. Extensive experiments demonstrate that TimeScope consistently outperforms both existing temporalgrounding methods and popular MLLMs across various settings, highlighting its effectiveness in addressing this new challenging problem.
[130] PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer
Zhiwei Yang,Chen Gao,Mike Zheng Shou
Main category: cs.CV
TL;DR: PANDA 提出了一个基于多模态大语言模型(MLLMs)的代理型AI工程师,旨在实现无需训练数据或人工干预的通用视频异常检测。通过自适应场景感知策略规划、目标驱动的启发式推理、工具增强的自我反思和自我改进的记忆链机制,PANDA在多场景、开放集和复杂场景下实现了最先进的性能。
Details
Motivation: 现有的视频异常检测(VAD)方法依赖领域特定训练数据和人工调整,泛化能力有限且成本高昂。PANDA的目标是开发一种通用VAD系统,能够自动处理任何场景和异常类型,无需训练或人工干预。Contribution: PANDA的四大核心贡献:1)自适应场景感知的检索增强生成(RAG)机制;2)潜在异常引导的启发式提示策略;3)渐进式反思机制和上下文感知工具;4)记忆链机制以实现性能的持续改进。
Method: PANDA通过以下方法实现:1)自适应RAG机制检索异常知识;2)启发式提示策略提升推理精度;3)渐进式反思和工具增强决策;4)记忆链机制利用历史经验改进性能。
Result: 实验表明,PANDA在多场景、开放集和复杂场景下无需训练和人工干预即达到最先进性能,验证了其泛化能力和鲁棒性。
Insight: PANDA展示了代理型AI在通用VAD任务中的潜力,通过融合自适应策略和记忆机制,为未来AI工程师的设计提供了新方向。
Abstract: Video anomaly detection (VAD) is a critical yet challenging task due to the complex and diverse nature of real-world scenarios. Previous methods typically rely on domain-specific training data and manual adjustments when applying to new scenarios and unseen anomaly types, suffering from high labor costs and limited generalization. Therefore, we aim to achieve generalist VAD, i.e., automatically handle any scene and any anomaly types without training data or human involvement. In this work, we propose PANDA, an agentic AI engineer based on MLLMs. Specifically, we achieve PANDA by comprehensively devising four key capabilities: (1) self-adaptive scene-aware strategy planning, (2) goal-driven heuristic reasoning, (3) tool-augmented self-reflection, and (4) self-improving chain-of-memory. Concretely, we develop a self-adaptive scene-aware RAG mechanism, enabling PANDA to retrieve anomaly-specific knowledge for anomaly detection strategy planning. Next, we introduce a latent anomaly-guided heuristic prompt strategy to enhance reasoning precision. Furthermore, PANDA employs a progressive reflection mechanism alongside a suite of context-aware tools to iteratively refine decision-making in complex scenarios. Finally, a chain-of-memory mechanism enables PANDA to leverage historical experiences for continual performance improvement. Extensive experiments demonstrate that PANDA achieves state-of-the-art performance in multi-scenario, open-set, and complex scenario settings without training and manual involvement, validating its generalizable and robust anomaly detection capability. Code is released at https://github.com/showlab/PANDA.
[131] MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation
Chenhui Zhu,Yilu Wu,Shuai Wang,Gangshan Wu,Limin Wang
Main category: cs.CV
TL;DR: MotionRAG采用检索增强的方法,通过Context-Aware Motion Adaptation (CAMA)从参考视频中提取并适配运动先验,以提升图像到视频生成的运动逼真度。其核心创新包括检索流水线、情境学习运动适配和注意力机制特征注入。
Details
Motivation: 尽管扩散模型推动了图像到视频生成的进步,但生成具有逼真运动的视频仍具挑战性。运动建模涉及物理约束、物体交互和领域特定动态,难以泛化。Contribution: 1) 提出了MotionRAG框架,通过检索增强提升运动逼真度;2) 设计了Context-Aware Motion Adaptation (CAMA)适配运动先验;3) 实现了零样本泛化能力,只需更新检索数据库。
Method: 1) 基于检索的流水线提取高层运动特征;2) 使用因果Transformer架构进行情境学习运动适配;3) 通过注意力机制将运动特征注入预训练视频扩散模型。
Result: 实验表明,MotionRAG在多个领域和基础模型上显著提升了运动逼真度,且推理计算开销极小。
Insight: 检索和适配运动先验是提升视频生成逼真度的有效途径;模块化设计支持零样本泛化,减少了模型重新训练的需求。
Abstract: Image-to-video generation has made remarkable progress with the advancements in diffusion models, yet generating videos with realistic motion remains highly challenging. This difficulty arises from the complexity of accurately modeling motion, which involves capturing physical constraints, object interactions, and domain-specific dynamics that are not easily generalized across diverse scenarios. To address this, we propose MotionRAG, a retrieval-augmented framework that enhances motion realism by adapting motion priors from relevant reference videos through Context-Aware Motion Adaptation (CAMA). The key technical innovations include: (i) a retrieval-based pipeline extracting high-level motion features using video encoder and specialized resamplers to distill semantic motion representations; (ii) an in-context learning approach for motion adaptation implemented through a causal transformer architecture; (iii) an attention-based motion injection adapter that seamlessly integrates transferred motion features into pretrained video diffusion models. Extensive experiments demonstrate that our method achieves significant improvements across multiple domains and various base models, all with negligible computational overhead during inference. Furthermore, our modular design enables zero-shot generalization to new domains by simply updating the retrieval database without retraining any components. This research enhances the core capability of video generation systems by enabling the effective retrieval and transfer of motion priors, facilitating the synthesis of realistic motion dynamics.
[132] PRISM: Progressive Rain removal with Integrated State-space Modeling
Pengze Xue,Shanwen Wang,Fei Zhou,Yan Cui,Xin Sun
Main category: cs.CV
TL;DR: 本文提出了PRISM(渐进式雨水去除与集成状态空间建模)框架,用于图像去雨任务。通过三阶段渐进式处理(粗提取、频率融合和精细恢复),结合混合注意力UNet和混合域Mamba等方法,显著提升了去雨效果。
Details
Motivation: 当前的单尺度模型在图像去雨任务中难以同时实现细粒度恢复和全局一致性,影响视觉任务的准确性(如自动驾驶)。Contribution: 1. 提出PRISM框架,包含三个阶段:CENet、SFNet和RNet;2. 设计了HA-UNet和HDMamba,分别用于多尺度特征聚合和空间语义与小波域特征的联合建模;3. RNet通过原始分辨率子网络恢复细粒度结构。
Method: PRISM采用三阶段渐进式处理:1. CENet(粗提取网络)和SFNet(频率融合网络)利用HA-UNet结合通道注意力和窗口空间变换器;2. SFNet引入HDMamba建模空间语义和小波域特征;3. RNet(精细网络)通过子网络恢复细节。
Result: 在多个数据集上验证了PRISM的竞争力,优于现有去雨方法。
Insight: 通过渐进式处理和混合域建模,PRISM在去雨任务中实现了细粒度恢复和全局一致性的平衡。
Abstract: Image deraining is an essential vision technique that removes rain streaks and water droplets, enhancing clarity for critical vision tasks like autonomous driving. However, current single-scale models struggle with fine-grained recovery and global consistency. To address this challenge, we propose Progressive Rain removal with Integrated State-space Modeling (PRISM), a progressive three-stage framework: Coarse Extraction Network (CENet), Frequency Fusion Network (SFNet), and Refine Network (RNet). Specifically, CENet and SFNet utilize a novel Hybrid Attention UNet (HA-UNet) for multi-scale feature aggregation by combining channel attention with windowed spatial transformers. Moreover, we propose Hybrid Domain Mamba (HDMamba) for SFNet to jointly model spatial semantics and wavelet domain characteristics. Finally, RNet recovers the fine-grained structures via an original-resolution subnetwork. Our model learns high-frequency rain characteristics while preserving structural details and maintaining global context, leading to improved image quality. Our method achieves competitive results on multiple datasets against recent deraining methods.
[133] Multi-View Camera System for Variant-Aware Autonomous Vehicle Inspection and Defect Detection
Yash Kulkarni,Raman Jha,Renu Kachhoria
Main category: cs.CV
TL;DR: 论文提出了一个名为AVI的多视角摄像系统,用于实时检测汽车生产线的变体规格和表面缺陷,通过多模块深度学习和语义规则引擎实现高精度检测。
Details
Motivation: 现代汽车生产线需要确保每辆车的变体规格和表面缺陷检查的高效性和准确性,传统方法难以满足需求。Contribution: 提出了首个公开报告的可部署多视角汽车检测系统AVI,结合深度学习模块和规则引擎,实现了高精度实时检测。
Method: 使用11个同步摄像头采集360°视图,分发给专用模块(如YOLOv8、EfficientNet等),并通过融合层和规则引擎统一处理。
Result: 在混合数据集上,AVI达到93%的验证准确率和86%的缺陷召回率,处理速度为3.3辆/分钟。
Insight: 多视角系统结合专用模块和规则引擎可显著提升汽车检测的效率和精度。
Abstract: Ensuring that every vehicle leaving a modern production line is built to the correct \emph{variant} specification and is free from visible defects is an increasingly complex challenge. We present the \textbf{Automated Vehicle Inspection (AVI)} platform, an end-to-end, \emph{multi-view} perception system that couples deep-learning detectors with a semantic rule engine to deliver \emph{variant-aware} quality control in real time. Eleven synchronized cameras capture a full 360{\deg} sweep of each vehicle; task-specific views are then routed to specialised modules: YOLOv8 for part detection, EfficientNet for ICE/EV classification, Gemini-1.5 Flash for mascot OCR, and YOLOv8-Seg for scratch-and-dent segmentation. A view-aware fusion layer standardises evidence, while a VIN-conditioned rule engine compares detected features against the expected manifest, producing an interpretable pass/fail report in (\approx! 300,\text{ms}). On a mixed data set of Original Equipment Manufacturer(OEM) vehicle data sets of four distinct models plus public scratch/dent images, AVI achieves \textbf{ 93 %} verification accuracy, \textbf{86 %} defect-detection recall, and sustains (\mathbf{3.3}) vehicles/min, surpassing single-view or no segmentation baselines by large margins. To our knowledge, this is the first publicly reported system that unifies multi-camera feature validation with defect detection in a deployable automotive setting in industry.
[134] Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting
Hanzhou Liu,Jia Huang,Mi Lu,Srikanth Saripalli,Peng Jiang
Main category: cs.CV
TL;DR: Stylos提出了一种单次前向传递的3D高斯风格迁移框架,适用于未配准的内容(从单张图像到多视图集合),并以参考风格图像为条件,无需逐场景优化或预计算姿态。
Details
Motivation: 现有的3D风格迁移方法通常需要复杂的优化过程或预先计算的姿态信息,限制了其在实际应用中的灵活性和可扩展性。Stylos旨在解决这些问题。Contribution: 1. 提出了一个单次前向传递的3D高斯风格迁移框架;2. 通过全局交叉注意力实现风格注入;3. 设计了基于体素的3D风格损失以确保视角一致的风格迁移和几何保真度;4. 展示了框架从单视图到多视图的可扩展性。
Method: Stylos采用Transformer双通路结构:几何预测保留自注意力以确保几何保真度,而风格通过全局交叉注意力注入以保持视图一致性。此外,提出了一种基于体素的3D风格损失,对齐场景特征与风格统计量。
Result: 在多个数据集上的实验表明,Stylos能够实现高质量的零样本风格迁移,展示了其全局风格-内容耦合、3D风格损失以及多视图扩展的有效性。
Insight: Stylos的成功表明,通过全局注意力机制和3D风格损失的结合,可以在无需逐场景优化的情况下实现高质量的3D风格迁移,为未来的3D视觉任务提供了新思路。
Abstract: We present Stylos, a single-forward 3D Gaussian framework for 3D style transfer that operates on unposed content, from a single image to a multi-view collection, conditioned on a separate reference style image. Stylos synthesizes a stylized 3D Gaussian scene without per-scene optimization or precomputed poses, achieving geometry-aware, view-consistent stylization that generalizes to unseen categories, scenes, and styles. At its core, Stylos adopts a Transformer backbone with two pathways: geometry predictions retain self-attention to preserve geometric fidelity, while style is injected via global cross-attention to enforce visual consistency across views. With the addition of a voxel-based 3D style loss that aligns aggregated scene features to style statistics, Stylos enforces view-consistent stylization while preserving geometry. Experiments across multiple datasets demonstrate that Stylos delivers high-quality zero-shot stylization, highlighting the effectiveness of global style-content coupling, the proposed 3D style loss, and the scalability of our framework from single view to large-scale multi-view settings.
[135] Attention over Scene Graphs: Indoor Scene Representations Toward CSAI Classification
Artur Barros,Carlos Caetano,João Macedo,Jefersson A. dos Santos,Sandra Avila
Main category: cs.CV
TL;DR: 该论文提出了一种名为ASGRA的新框架,通过将图像转换为场景图并结合图注意力网络,直接建模场景中各组件之间的关系,从而提升室内场景分类和敏感内容分析的性能。
Details
Motivation: 室内场景分类任务具有挑战性,尤其在敏感内容分析(如儿童性虐待图像分类)中,传统基于像素的方法难以捕捉复杂的对象关系和空间布局。Contribution: 提出ASGRA框架,利用场景图和图注意力网络,实现对场景组件关系的直接建模,具有可解释性和隐私保护优势。
Method: 1.将图像转换为场景图;2.使用图注意力网络进行推理,建模对象间的交互关系。
Result: 在Places8数据集上达到81.27%的平衡准确率,优于基于图像的方法;在真实世界的CSAI评估中达到74.27%。
Insight: 结构化场景表示是一种强大的室内场景分类和敏感内容分析范式,同时兼顾解释性和隐私保护。
Abstract: Indoor scene classification is a critical task in computer vision, with wide-ranging applications that go from robotics to sensitive content analysis, such as child sexual abuse imagery (CSAI) classification. The problem is particularly challenging due to the intricate relationships between objects and complex spatial layouts. In this work, we propose the Attention over Scene Graphs for Sensitive Content Analysis (ASGRA), a novel framework that operates on structured graph representations instead of raw pixels. By first converting images into Scene Graphs and then employing a Graph Attention Network for inference, ASGRA directly models the interactions between a scene’s components. This approach offers two key benefits: (i) inherent explainability via object and relationship identification, and (ii) privacy preservation, enabling model training without direct access to sensitive images. On Places8, we achieve 81.27% balanced accuracy, surpassing image-based methods. Real-world CSAI evaluation with law enforcement yields 74.27% balanced accuracy. Our results establish structured scene representations as a robust paradigm for indoor scene classification and CSAI classification. Code is publicly available at https://github.com/tutuzeraa/ASGRA.
[136] Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation
Miao Rang,Zhenni Bi,Hang Zhou,Hanting Chen,An Xiao,Tianyu Guo,Kai Han,Xinghao Chen,Yunhe Wang
Main category: cs.CV
TL;DR: 这篇论文提出了一种系统化的后训练流程,通过课程学习的监督微调(SFT)和离线策略知识蒸馏,提升小型语言模型的性能,使其在严格硬件约束下达到高性能。
Details
Motivation: 大型语言模型(LLMs)虽然功能强大,但因巨大的计算成本和规模不适于边缘环境。小型模型虽适合边缘部署,但仅通过预训练难以满足复杂任务需求。因此,需要一种高效方法提升小型模型的性能。Contribution: 提出了一种系统化的后训练流程,结合课程学习和知识蒸馏,显著提升了小型语言模型的性能,使其在边缘设备上达到先进水平。
Method: 1. 课程学习的监督微调(SFT):逐步提升任务难度。2. 离线策略知识蒸馏:利用离线数据进行知识转移。
Result: 最终模型在严格硬件约束下表现优异,实现了十亿参数模型中的领先性能,并在多种任务中保持竞争力。
Insight: 后训练(尤其是知识蒸馏和课程学习)是提升小型语言模型性能的关键,为边缘设备上的高效模型开发提供了实用方案。
Abstract: The rapid advancement of large language models (LLMs) has significantly advanced the capabilities of artificial intelligence across various domains. However, their massive scale and high computational costs render them unsuitable for direct deployment in resource-constrained edge environments. This creates a critical need for high-performance small models that can operate efficiently at the edge. Yet, after pre-training alone, these smaller models often fail to meet the performance requirements of complex tasks. To bridge this gap, we introduce a systematic post-training pipeline that efficiently enhances small model accuracy. Our post training pipeline consists of curriculum-based supervised fine-tuning (SFT) and offline on-policy knowledge distillation. The resulting instruction-tuned model achieves state-of-the-art performance among billion-parameter models, demonstrating strong generalization under strict hardware constraints while maintaining competitive accuracy across a variety of tasks. This work provides a practical and efficient solution for developing high-performance language models on Ascend edge devices.
[137] Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
Zhen Yang,Zi-Yi Dou,Di Feng,Forrest Huang,Anh Nguyen,Keen You,Omar Attia,Yuhao Yang,Michael Feng,Haotian Zhang,Ram Ramrakhya,Chao Jia,Jeffrey Nichols,Alexander Toshev,Yinfei Yang,Zhe Gan
Main category: cs.CV
TL;DR: 本文提出了Ferret-UI Lite,一种小型设备端GUI代理,通过多样化的数据混合、链式推理和视觉工具使用以及强化学习,实现了在多个平台上的高效交互性能。
Details
Motivation: 开发能够在多样平台上高效交互的小型设备端GUI代理是一个具有挑战性的开放问题。Contribution: 提出了Ferret-UI Lite,一种3B参数的紧凑型端到端GUI代理,能在移动、网页和桌面平台上运行。
Method: 通过混合真实和合成数据、链式推理和视觉工具使用以及强化学习优化模型性能。
Result: 在GUI接地任务中表现优异(ScreenSpot-V2:91.6%,ScreenSpot-Pro:53.3%,OSWorld-G:61.2%),在导航任务中也有一定成功率(AndroidWorld:28.0%,OSWorld:19.8%)。
Insight: 小型设备端模型通过多样数据混合和优化技术可以在GUI交互任务中表现出色。
Abstract: Developing autonomous agents that effectively interact with Graphic User Interfaces (GUIs) remains a challenging open problem, especially for small on-device models. In this paper, we present Ferret-UI Lite, a compact, end-to-end GUI agent that operates across diverse platforms, including mobile, web, and desktop. Utilizing techniques optimized for developing small models, we build our 3B Ferret-UI Lite agent through curating a diverse GUI data mixture from real and synthetic sources, strengthening inference-time performance through chain-of-thought reasoning and visual tool-use, and reinforcement learning with designed rewards. Ferret-UI Lite achieves competitive performance with other small-scale GUI agents. In GUI grounding, Ferret-UI Lite attains scores of $91.6%$, $53.3%$, and $61.2%$ on the ScreenSpot-V2, ScreenSpot-Pro, and OSWorld-G benchmarks, respectively. For GUI navigation, Ferret-UI Lite achieves success rates of $28.0%$ on AndroidWorld and $19.8%$ on OSWorld. We share our methods and lessons learned from developing compact, on-device GUI agents.
[138] Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation
Agneet Chatterjee,Rahim Entezari,Maksym Zhuravinskyi,Maksim Lapin,Reshinth Adithyan,Amit Raj,Chitta Baral,Yezhou Yang,Varun Jampani
Main category: cs.CV
TL;DR: 本文提出了Stable Cinemetrics,一个结构化评估框架,旨在解决专业视频生成的复杂性需求。通过定义76个细粒度控制节点,并构建自动化评估流程,揭示了当前视频生成模型在事件和镜头控制方面的显著不足。
Details
Motivation: 现有视频生成模型和基准测试无法满足专业视频生成的复杂需求,因此需要一种结构化方法来评估和改进模型的表现。Contribution: 1. 引入了Stable Cinemetrics框架;2. 定义了76个细粒度控制节点;3. 构建了自动化评估流程;4. 训练了一个优于零样本基线的自动评估器。
Method: 1. 将电影制作控制解耦为四个层次化分类:场景、事件、灯光和镜头;2. 构建了与专业用例对齐的基准测试;3. 开发了自动化提示分类和问题生成流程;4. 进行了大规模人类研究和自动评估器训练。
Result: 研究发现当前最强模型在事件和镜头控制方面存在显著缺陷。自动评估器表现优于现有零样本基线。
Insight: 专业视频生成需要更细粒度的控制和解耦评估方法,现有模型仍需改进。
Abstract: Recent advances in video generation have enabled high-fidelity video synthesis from user provided prompts. However, existing models and benchmarks fail to capture the complexity and requirements of professional video generation. Towards that goal, we introduce Stable Cinemetrics, a structured evaluation framework that formalizes filmmaking controls into four disentangled, hierarchical taxonomies: Setup, Event, Lighting, and Camera. Together, these taxonomies define 76 fine-grained control nodes grounded in industry practices. Using these taxonomies, we construct a benchmark of prompts aligned with professional use cases and develop an automated pipeline for prompt categorization and question generation, enabling independent evaluation of each control dimension. We conduct a large-scale human study spanning 10+ models and 20K videos, annotated by a pool of 80+ film professionals. Our analysis, both coarse and fine-grained reveal that even the strongest current models exhibit significant gaps, particularly in Events and Camera-related controls. To enable scalable evaluation, we train an automatic evaluator, a vision-language model aligned with expert annotations that outperforms existing zero-shot baselines. SCINE is the first approach to situate professional video generation within the landscape of video generative models, introducing taxonomies centered around cinematic controls and supporting them with structured evaluation pipelines and detailed analyses to guide future research.
[139] Autoproof: Automated Segmentation Proofreading for Connectomics
Gary B Huang,William M Katz,Stuart Berg,Louis Scheffer
Main category: cs.CV
TL;DR: 论文提出了一种自动化分割校对方法AutoProof,利用已有的人工标注数据训练机器学习模型,以减少电子显微镜图像连接组学中的人工校对成本。
Details
Motivation: 电子显微镜图像连接组学的人工校对成本高昂,成为扩展连接组学和实现比较连接组学的瓶颈。Contribution: 1. 提出了AutoProof系统,利用机器学习自动化或优化校对流程;2. 在果蝇中枢神经系统的重建中验证了其有效性。
Method: 利用已有的人工标注数据训练机器学习模型,自动化校对分割结果。
Result: 1. 在指导校对流程中节省80%成本,保留90%的价值;2. 自动合并20万个分割片段,相当于四年手动工作,提升连接组完成率1.3%。
Insight: 机器学习可以有效利用人工标注数据,显著降低连接组学中的校对成本,提升效率。
Abstract: Producing connectomes from electron microscopy (EM) images has historically required a great deal of human proofreading effort. This manual annotation cost is the current bottleneck in scaling EM connectomics, for example, in making larger connectome reconstructions feasible, or in enabling comparative connectomics where multiple related reconstructions are produced. In this work, we propose using the available ground-truth data generated by this manual annotation effort to learn a machine learning model to automate or optimize parts of the required proofreading workflows. We validate our approach on a recent complete reconstruction of the \emph{Drosophila} male central nervous system. We first show our method would allow for obtaining 90% of the value of a guided proofreading workflow while reducing required cost by 80%. We then demonstrate a second application for automatically merging many segmentation fragments to proofread neurons. Our system is able to automatically attach 200 thousand fragments, equivalent to four proofreader years of manual work, and increasing the connectivity completion rate of the connectome by 1.3% points.
[140] Video Object Segmentation-Aware Audio Generation
Ilpo Viertola,Vladimir Iashin,Esa Rahtu
Main category: cs.CV
TL;DR: 论文提出了视频对象分割感知音频生成的新任务,通过结合视觉分割掩码、视频和文本信息,实现了对音频生成的精细化控制,并提出了SAGANet模型和Segmented Music Solos数据集。
Details
Motivation: 现有音频生成模型缺乏精确的用户控制能力,难以满足专业Foley工作流程的需求,尤其是在特定对象的优先级控制和背景噪声管理方面存在不足。Contribution: 1. 提出视频对象分割感知音频生成的新任务;2. 开发了SAGANet模型,支持基于视觉分割掩码的多模态可控音频生成;3. 发布了Segmented Music Solos数据集,推动相关研究。
Method: SAGANet结合视觉分割掩码、视频帧和文本信息,通过多模态条件生成高质量的音频,确保生成的声音与指定对象精确对齐。
Result: 实验表明,SAGANet在可控性和音频保真度上显著优于现有方法,为高精度Foley合成设立了新标准。
Insight: 通过视觉分割掩码引入对象级控制,是多模态音频生成领域的重要进展,有望推动专业Foley合成技术的实际应用。
Abstract: Existing multimodal audio generation models often lack precise user control, which limits their applicability in professional Foley workflows. In particular, these models focus on the entire video and do not provide precise methods for prioritizing a specific object within a scene, generating unnecessary background sounds, or focusing on the wrong objects. To address this gap, we introduce the novel task of video object segmentation-aware audio generation, which explicitly conditions sound synthesis on object-level segmentation maps. We present SAGANet, a new multimodal generative model that enables controllable audio generation by leveraging visual segmentation masks along with video and textual cues. Our model provides users with fine-grained and visually localized control over audio generation. To support this task and further research on segmentation-aware Foley, we propose Segmented Music Solos, a benchmark dataset of musical instrument performance videos with segmentation information. Our method demonstrates substantial improvements over current state-of-the-art methods and sets a new standard for controllable, high-fidelity Foley synthesis. Code, samples, and Segmented Music Solos are available at https://saganet.notion.site
[141] Hy-Facial: Hybrid Feature Extraction by Dimensionality Reduction Methods for Enhanced Facial Expression Classification
Xinjin Li,Yu Ma,Kaisen Ye,Jinghan Cao,Minghao Zhou,Yeyang Zhou
Main category: cs.CV
TL;DR: Hy-Facial提出了一种混合特征提取框架,结合深度学习和传统图像处理方法,并通过降维策略增强面部表情分类性能。
Details
Motivation: 面部表情分类因图像数据的高维性和复杂性而极具挑战性,需要一种既能提取丰富特征又能降低冗余的方法。Contribution: 提出Hy-Facial框架,融合VGG19的深度特征与传统手工特征(SIFT、ORB),并通过UMAP降维优化特征质量。
Method: 结合VGG19、SIFT和ORB提取特征,采用K-means聚类和UMAP降维,最后进行分类。
Result: 在FER数据集上达到83.3%的分类准确率,证明了降维对提升特征质量和分类性能的重要性。
Insight: 降维不仅是预处理步骤,更是提升特征质量和分类性能的关键组成部分。
Abstract: Facial expression classification remains a challenging task due to the high dimensionality and inherent complexity of facial image data. This paper presents Hy-Facial, a hybrid feature extraction framework that integrates both deep learning and traditional image processing techniques, complemented by a systematic investigation of dimensionality reduction strategies. The proposed method fuses deep features extracted from the Visual Geometry Group 19-layer network (VGG19) with handcrafted local descriptors and the scale-invariant feature transform (SIFT) and Oriented FAST and Rotated BRIEF (ORB) algorithms, to obtain rich and diverse image representations. To mitigate feature redundancy and reduce computational complexity, we conduct a comprehensive evaluation of dimensionality reduction techniques and feature extraction. Among these, UMAP is identified as the most effective, preserving both local and global structures of the high-dimensional feature space. The Hy-Facial pipeline integrated VGG19, SIFT, and ORB for feature extraction, followed by K-means clustering and UMAP for dimensionality reduction, resulting in a classification accuracy of 83. 3% in the facial expression recognition (FER) dataset. These findings underscore the pivotal role of dimensionality reduction not only as a pre-processing step but as an essential component in improving feature quality and overall classification performance.
[142] DA$^2$: Depth Anything in Any Direction
Haodong Li,Wangguangdong Zheng,Jing He,Yuhao Liu,Xin Lin,Xin Yang,Ying-Cong Chen,Chunchao Guo
Main category: cs.CV
TL;DR: DA$^2$ 是一种端到端的全景深度估计方法,通过数据生成引擎扩展数据集并提出 SphereViT 以解决球形畸变,实现了零样本泛化和高效性能。
Details
Motivation: 全景数据稀缺且球形畸变问题显著,现有方法泛化能力差且效率低,亟需一种能够高效处理全景图像并泛化的深度估计方法。Contribution: 1. 提出数据生成引擎,将视角图像转换为全景深度数据,扩展了数据集;2. 设计 SphereViT,利用球形坐标提升特征一致性;3. 实现零样本泛化并超越领域内方法。
Method: 通过数据生成引擎创建高质量全景数据,结合 SphereViT 建模球形几何特征,实现端到端的全景深度估计。
Result: 在多数据集上表现 SOTA,平均 AbsRel 提升 38%,且效率高于融合方法。
Insight: 大规模数据生成和球形几何建模是全景深度估计的关键,端到端设计显著提升了效率和泛化能力。
Abstract: Panorama has a full FoV (360$^\circ\times$180$^\circ$), offering a more complete visual description than perspective images. Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision. However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization. Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (e.g., cubemaps), which leads to suboptimal efficiency. To address these challenges, we propose $\textbf{DA}$$^{\textbf{2}}$: $\textbf{D}$epth $\textbf{A}$nything in $\textbf{A}$ny $\textbf{D}$irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator. Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create $\sim$543K panoramic RGB-depth pairs, bringing the total to $\sim$607K. To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance. A comprehensive benchmark on multiple datasets clearly demonstrates DA$^{2}$’s SoTA performance, with an average 38% improvement on AbsRel over the strongest zero-shot baseline. Surprisingly, DA$^{2}$ even outperforms prior in-domain methods, highlighting its superior zero-shot generalization. Moreover, as an end-to-end solution, DA$^{2}$ exhibits much higher efficiency over fusion-based approaches. Both the code and the curated panoramic data will be released. Project page: https://depth-any-in-any-dir.github.io/.
[143] Benchmarking Egocentric Visual-Inertial SLAM at City Scale
Anusha Krishnan,Shaohui Liu,Paul-Edouard Sarlin,Oscar Gentilhomme,David Caruso,Maurizio Monge,Richard Newcombe,Jakob Engel,Marc Pollefeys
Main category: cs.CV
TL;DR: 该论文提出了一个新的数据集和基准测试,用于城市尺度的视觉-惯性SLAM(同时定位与建图),专注于解决穿戴设备在复杂运动、动态内容和长时任务中的挑战。
Details
Motivation: 现有的SLAM基准未能充分反映穿戴设备在复杂运动、动态内容和长时任务中的挑战,且缺乏高精度的地面真实姿态。Contribution: 引入了一个城市尺度的视觉-惯性SLAM数据集和基准测试,提供了厘米级精度的地面真实姿态,并涵盖多样化的挑战场景。
Method: 利用测量工具获取控制点作为间接姿态标注,并提供多模态传感器数据记录了长时间和大范围的轨迹。
Result: 现有学术界开发的SLAM系统在面对这些挑战时表现不佳,论文通过设计不同难度级别的轨迹深入分析了系统弱点。
Insight: 城市尺度的SLAM数据集揭示了现有系统在复杂环境中的局限性,为未来研究提供了重要的评估工具和改进方向。
Abstract: Precise 6-DoF simultaneous localization and mapping (SLAM) from onboard sensors is critical for wearable devices capturing egocentric data, which exhibits specific challenges, such as a wider diversity of motions and viewpoints, prevalent dynamic visual content, or long sessions affected by time-varying sensor calibration. While recent progress on SLAM has been swift, academic research is still driven by benchmarks that do not reflect these challenges or do not offer sufficiently accurate ground truth poses. In this paper, we introduce a new dataset and benchmark for visual-inertial SLAM with egocentric, multi-modal data. We record hours and kilometers of trajectories through a city center with glasses-like devices equipped with various sensors. We leverage surveying tools to obtain control points as indirect pose annotations that are metric, centimeter-accurate, and available at city scale. This makes it possible to evaluate extreme trajectories that involve walking at night or traveling in a vehicle. We show that state-of-the-art systems developed by academia are not robust to these challenges and we identify components that are responsible for this. In addition, we design tracks with different levels of difficulty to ease in-depth analysis and evaluation of less mature approaches. The dataset and benchmark are available at https://www.lamaria.ethz.ch.
[144] Query-Kontext: An Unified Multimodal Model for Image Generation and Editing
Yuxin Song,Wenkai Dong,Shizun Wang,Qi Zhang,Song Xue,Tao Yuan,Hu Yang,Haocheng Feng,Hang Zhou,Xinyan Xiao,Jingdong Wang
Main category: cs.CV
TL;DR: Query-Kontext提出了一种统一的多模态模型,通过结合视觉语言模型(VLM)和扩散模型,实现了高质量的图像生成与编辑。其核心创新是多模态“kontext”的设计和渐进式训练策略。
Details
Motivation: 现有统一多模态模型(UMMs)在多模态生成推理能力(如指令理解、定位和图像参考)与高质量合成之间存在内在纠缠,导致性能受限。Contribution: 1. 提出了Query-Kontext框架,通过多模态“kontext”桥接VLM和扩散模型;2. 设计了三阶段渐进训练策略;3. 构建了全面的数据管道支持多样化任务。
Method: 1. 使用多模态“kontext”结合语义线索和粗糙图像条件;2. 三阶段训练:先连接VLM与轻量扩散头,再扩展至预训练扩散模型,最后引入低层图像编码器;3. 数据管道整合合成和开源数据。
Result: 实验表明,Query-Kontext在图像生成和编辑任务中表现优异,甚至在某些任务上超越了专用模型。
Insight: 分离多模态生成推理与高质量合成的角色(VLM负责前者,扩散模型负责后者)是提升统一模型性能的关键。
Abstract: Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I), whether instantiated as assembled unified frameworks which couple powerful vision-language model (VLM) with diffusion-based generator, or as naive Unified Multimodal Models with an early fusion of understanding and generation modalities. We contend that in current unified frameworks, the crucial capability of multimodal generative reasoning which encompasses instruction understanding, grounding, and image referring for identity preservation and faithful reconstruction, is intrinsically entangled with high-fidelity synthesis. In this work, we introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal ``kontext’’ composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs. This design delegates the complex ability of multimodal generative reasoning to powerful VLM while reserving diffusion model’s role for high-quality visual synthesis. To achieve this, we propose a three-stage progressive training strategy. First, we connect the VLM to a lightweight diffusion head via multimodal kontext tokens to unleash the VLM’s generative reasoning ability. Second, we scale this head to a large, pre-trained diffusion model to enhance visual detail and realism. Finally, we introduce a low-level image encoder to improve image fidelity and perform instruction tuning on downstream tasks. Furthermore, we build a comprehensive data pipeline integrating real, synthetic, and open-source datasets, covering diverse multimodal reference-to-image scenarios, including image generation, instruction-driven editing, customized generation, and multi-subject composition. Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
[145] Stitch: Training-Free Position Control in Multimodal Diffusion Transformers
Jessica Bader,Mateusz Pach,Maria A. Bravo,Serge Belongie,Zeynep Akata
Main category: cs.CV
TL;DR: Stitch是一种无需训练的方法,通过自动生成的边界框在多模态扩散变换器(MMDiT)中引入外部位置控制,实现了空间关系和视觉吸引力的图像生成。
Details
Motivation: 现有的T2I生成模型在捕捉空间关系(如“上方”或“右侧”)时表现不佳。早期方法通过外部位置控制改进,但与现代模型不兼容。作者提出了Stitch来填补这一空白。Contribution: 1. 提出Stitch,一种无需训练的位置控制方法;2. 引入PosEval基准,扩展了位置相关的T2I任务;3. 在领先模型上实现了SOTA性能,显著提升空间关系生成能力。
Method: Stitch通过自动生成的边界框隔离图像中的对象,利用特定注意力头捕捉信息,无需完整生成图像即可缝合对象。
Result: Stitch在Qwen-Image、FLUX和SD3.5上表现优异,FLUX在GenEval和PosEval任务中分别提升了218%和206%,并在Qwen-Image上实现了54%的性能提升。
Insight: 1. 特定注意力头可有效捕捉空间关系信息;2. 无需训练的位置控制方法可无缝集成到现代模型中;3. PosEval揭示了现有模型在空间关系生成上的不足。
Abstract: Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like “above” or “to the right of” poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes. Stitch produces images that are both spatially accurate and visually appealing by generating individual objects within designated bounding boxes and seamlessly stitching them together. We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image. We evaluate Stitch on PosEval, our benchmark for position-based T2I generation. Featuring five new tasks that extend the concept of Position beyond the basic GenEval task, PosEval demonstrates that even top models still have significant room for improvement in position-based generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances base models, even improving FLUX by 218% on GenEval’s Position task and by 206% on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on PosEval, improving over previous models by 54%, all accomplished while integrating position control into leading models training-free. Code is available at https://github.com/ExplainableML/Stitch.
cs.GR [Back]
[146] Vector sketch animation generation with differentialable motion trajectories
Xinding Zhu,Xinye Yang,Shuyang Zheng,Zhexin Zhang,Fei Gao,Jing Huang,Jiazhou Chen
Main category: cs.GR
TL;DR: 该论文提出了一种基于可微分运动轨迹(DMT)的端到端矢量草图动画生成方法,解决了帧间闪烁问题并提升了语义一致性与时间连贯性。
Details
Motivation: 视频草绘动画生成由于时间连贯性要求高而极具挑战,现有方法无法很好地解决闪烁问题,因此需要一种新方法。Contribution: 提出了可微分运动轨迹(DMT)表示法,利用多项式轨迹描述笔画控制点的帧间运动,优化了语义一致性与时间连贯性。
Method: 采用基于伯恩斯坦基的多项式轨迹平衡参数敏感性,引入稀疏轨迹点进行显式空间建模,支持高效和长时视频处理。
Result: 在DAVIS和LVOS数据集上的实验表明,该方法优于现有方法,并通过跨域验证证实了其鲁棒性和兼容性。
Insight: DMT通过全局语义梯度传播和多帧优化,为矢量动画生成提供了一种高效且稳定的解决方案。
Abstract: Sketching is a direct and inexpensive means of visual expression. Though image-based sketching has been well studied, video-based sketch animation generation is still very challenging due to the temporal coherence requirement. In this paper, we propose a novel end-to-end automatic generation approach for vector sketch animation. To solve the flickering issue, we introduce a Differentiable Motion Trajectory (DMT) representation that describes the frame-wise movement of stroke control points using differentiable polynomial-based trajectories. DMT enables global semantic gradient propagation across multiple frames, significantly improving the semantic consistency and temporal coherence, and producing high-framerate output. DMT employs a Bernstein basis to balance the sensitivity of polynomial parameters, thus achieving more stable optimization. Instead of implicit fields, we introduce sparse track points for explicit spatial modeling, which improves efficiency and supports long-duration video processing. Evaluations on DAVIS and LVOS datasets demonstrate the superiority of our approach over SOTA methods. Cross-domain validation on 3D models and text-to-video data confirms the robustness and compatibility of our approach.
eess.AS [Back]
[147] TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics
Yi-Cheng Lin,Yu-Hua Chen,Jia-Kai Dong,Yueh-Hsuan Huang,Szu-Chi Chen,Yu-Chen Chen,Chih-Yao Chen,Yu-Jung Lin,Yu-Ling Chen,Zih-Yu Chen,I-Ning Tsai,Hsiu-Hsuan Wang,Ho-Lam Chung,Ke-Han Lu,Hung-yi Lee
Main category: eess.AS
TL;DR: TAU(台湾音频理解)是一个专注于文化特色音频理解的基准测试,通过精心设计的流程生成702个音频片段和1,794项多选题,揭示了当前先进音频-语言模型在文化特征音频识别上的局限性。
Details
Motivation: 现有的音频-语言模型评测主要集中在语音或全球通用的音频上,忽视了文化特色的音频线索,导致模型在这些场景中表现不佳。Contribution: 提出了TAU基准测试,专注于台湾地区独特的‘声音标志’,通过人工编辑和大语言模型辅助的问题生成流程,填补了文化特色音频评测的空白。
Method: 采用了结合人工编辑和LLM辅助问题生成的流程,构建了一个包含702个音频片段和1,794项多选题的数据集,这些问题无法仅通过文字转录解决。
Result: 实验显示,包括Gemini 2.5和Qwen2-Audio在内的先进模型在TAU测试中表现远低于本地人类水平,揭示了模型的局限性。
Insight: TAU揭示了当前音频-语言模型在文化特色音频领域的盲区,强调了针对多样化社区需求开发评测工具的重要性。
Abstract: Large audio-language models are advancing rapidly, yet most evaluations emphasize speech or globally sourced sounds, overlooking culturally distinctive cues. This gap raises a critical question: can current models generalize to localized, non-semantic audio that communities instantly recognize but outsiders do not? To address this, we present TAU (Taiwan Audio Understanding), a benchmark of everyday Taiwanese “soundmarks.” TAU is built through a pipeline combining curated sources, human editing, and LLM-assisted question generation, producing 702 clips and 1,794 multiple-choice items that cannot be solved by transcripts alone. Experiments show that state-of-the-art LALMs, including Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates the need for localized benchmarks to reveal cultural blind spots, guide more equitable multimodal evaluation, and ensure models serve communities beyond the global mainstream.
cs.AI [Back]
[148] Artificial Phantasia: Evidence for Propositional Reasoning-Based Mental Imagery in Large Language Models
Morgan McCarty,Jorge Morales
Main category: cs.AI
TL;DR: 该论文通过设计经典心理意象任务,测试大型语言模型(LLMs)在非视觉依赖任务中的表现,发现顶尖LLMs表现优于人类平均水平,并探讨了命题推理在心理意象任务中的可能作用。
Details
Motivation: 研究旨在探索LLMs是否能够在缺乏视觉架构的情况下完成传统上依赖视觉心理意象的任务,从而揭示其复杂的认知能力和表征形式。Contribution: 1. 提出了新的基准任务,测试LLMs在非语言依赖任务中的表现;2. 发现顶尖LLMs在心理意象任务中表现优于人类;3. 为心理意象表征形式的辩论提供了新的证据。
Method: 1. 设计经典心理意象任务的文本版本;2. 测试多款顶尖LLMs及人类被试的表现;3. 通过调整推理资源分配,验证推理能力对任务表现的影响。
Result: LLMs在心理意象任务中表现优异,尤其是推理资源分配较多时,其表现显著优于人类平均水平。
Insight: 结果表明,命题推理可能足以完成传统上依赖视觉的任务,挑战了心理意象任务的独特性假设。
Abstract: This study offers a novel approach for benchmarking complex cognitive behavior in artificial systems. Almost universally, Large Language Models (LLMs) perform best on tasks which may be included in their training data and can be accomplished solely using natural language, limiting our understanding of their emergent sophisticated cognitive capacities. In this work, we created dozens of novel items of a classic mental imagery task from cognitive psychology. A task which, traditionally, cognitive psychologists have argued is solvable exclusively via visual mental imagery (i.e., language alone would be insufficient). LLMs are perfect for testing this hypothesis. First, we tested several state-of-the-art LLMs by giving text-only models written instructions and asking them to report the resulting object after performing the transformations in the aforementioned task. Then, we created a baseline by testing 100 human subjects in exactly the same task. We found that the best LLMs performed significantly above average human performance. Finally, we tested reasoning models set to different levels of reasoning and found the strongest performance when models allocate greater amounts of reasoning tokens. These results provide evidence that the best LLMs may have the capability to complete imagery-dependent tasks despite the non-pictorial nature of their architectures. Our study not only demonstrates an emergent cognitive capacity in LLMs while performing a novel task, but it also provides the field with a new task that leaves lots of room for improvement in otherwise already highly capable models. Finally, our findings reignite the debate over the formats of representation of visual imagery in humans, suggesting that propositional reasoning (or at least non-imagistic reasoning) may be sufficient to complete tasks that were long-thought to be imagery-dependent.
[149] TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models
Tong Guan,Zijie Meng,Dianqi Li,Shiyu Wang,Chao-Han Huck Yang,Qingsong Wen,Zuozhu Liu,Sabato Marco Siniscalchi,Ming Jin,Shirui Pan
Main category: cs.AI
TL;DR: 论文提出了Time Series Reasoning Suite (TSR-Suite)和TimeOmni-1模型,旨在解决多模态时间序列数据集在深度推理上的不足。TSR-Suite定义了四种原子任务,涵盖了感知、外推和决策三类核心能力,TimeOmni-1则是一个统一的时间序列推理模型,表现出优异的分布外泛化能力。
Details
Motivation: 现有时间序列数据集多为表面对齐和问答,缺乏深度推理任务和数据,限制了时间序列推理模型的发展,因此需要一套支持深度推理的评测工具和高质量数据集。Contribution: 1.提出了TSR-Suite,正式定义了四种覆盖时间序列推理核心能力的原子任务;2.发布了包含23K样本的高质量数据集;3.提出了TimeOmni-1模型,通过多阶段训练实现了强大的推理能力和泛化性。
Method: 1.基于人类引导的层级标注构建数据集;2.设计了多任务场景、新型奖励函数和优化策略;3.TimeOmni-1采用多阶段训练,整合感知、外推和决策能力。
Result: TimeOmni-1在所有任务上表现出色,尤其是在因果发现(64.0% vs. 35.9%)和事件感知预测的有效回答率(提升6%)上优于GPT-4.1。
Insight: 时间序列推理需要从基础模式识别转向高级理解和决策,TSR-Suite和TimeOmni-1为此提供了标准化评测和统一模型框架。
Abstract: Recent advances in multimodal time series learning underscore a paradigm shift from analytics centered on basic patterns toward advanced time series understanding and reasoning. However, existing multimodal time series datasets mostly remain at the level of surface alignment and question answering, without reaching the depth of genuine reasoning. The absence of well-defined tasks that genuinely require time series reasoning, along with the scarcity of high-quality data, has limited progress in building practical time series reasoning models (TSRMs). To this end, we introduce Time Series Reasoning Suite (TSR-Suite), which formalizes four atomic tasks that span three fundamental capabilities for reasoning with time series: (1) perception, acquired through scenario understanding and causality discovery; (2) extrapolation, realized via event-aware forecasting; and (3) decision-making, developed through deliberation over perception and extrapolation. TSR-Suite is the first comprehensive time series reasoning suite that supports not only thorough evaluation but also the data pipeline and training of TSRMs. It contains more than 23K samples, of which 2.3K are carefully curated through a human-guided hierarchical annotation process. Building on this foundation, we introduce TimeOmni-1, the first unified reasoning model designed to address diverse real-world problems demanding time series reasoning. The model is trained in multiple stages, integrating a mixture of task scenarios, novel reward functions, and tailored optimizations. Experiments show that TimeOmni-1 delivers strong out-of-distribution generalization across all tasks and achieves a high rate of valid responses. It significantly improves causality discovery accuracy (64.0% vs. 35.9% with GPT-4.1) and raises the valid response rate by over 6% compared to GPT-4.1 on the event-aware forecasting task.
[150] A Formal Comparison Between Chain-of-Thought and Latent Thought
Kevin Xu,Issei Sato
Main category: cs.AI
TL;DR: 该论文对比了Chain-of-Thought (CoT)和Latent Thought两种推理方法,指出CoT通过自然语言生成中间步骤,而Latent Thought直接在连续隐空间运算,支持并行计算。CoT则更适合解决精确计算不可行的问题。
Details
Motivation: 研究目的是比较CoT和Latent Thought两种推理方法的优势与适用场景,填补它们之间的对比研究空白。Contribution: 论文提供了形式化分析,证明Latent Thought在Looped Transformers中支持并行计算,效率更高;而CoT借助随机解码适用于精确计算不可行的问题。
Method: 采用形式化分析方法对比CoT和Latent Thought的推理机制,明确各自的优势和应用范围。
Result: 分析表明,Latent Thought更适合并行计算任务,而CoT更适合近似求解复杂问题。
Insight: 选择推理方法时应根据任务需求:需要高效并行时优先Latent Thought,需要近似求解时选择CoT。
Abstract: Chain-of-Thought (CoT) elicits reasoning in large language models by explicitly generating intermediate steps in natural language. In contrast, Latent Thought in looped models operates directly in the continuous latent space, enabling computation beyond discrete linguistic representations. While both approaches exploit iterative computation, their comparative capabilities remain underexplored. In this work, we present a formal analysis showing that Latent Thought in Looped Transformers enables parallel computation, which is more efficient than the inherently sequential process of CoT. In contrast, CoT leverages stochastic decoding to approximate solutions to problems where exact computation is intractable. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical guidance for choosing between reasoning paradigms. Code is available at https://github.com/kevin671/cot-vs-loop.
[151] Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution
Tianrui Qin,Qianben Chen,Sinuo Wang,He Xing,King Zhu,He Zhu,Dingfeng Shi,Xinxin Liu,Ge Zhang,Jiaheng Liu,Yuchen Eleanor Jiang,Xitong Gao,Wangchunshu Zhou
Main category: cs.AI
TL;DR: Flash-Searcher 是一种基于DAG的并行代理推理框架,通过将复杂任务分解为具有明确依赖关系的子任务,实现独立推理路径的并发执行,显著提升了执行效率和任务准确性。
Details
Motivation: 现有的LLM框架主要依赖顺序处理,导致需要大量工具交互的任务执行效率低下。Flash-Searcher的目标是通过并行化推理路径,提升效率和性能。Contribution: 提出了基于DAG的并行代理推理框架Flash-Searcher,支持动态工作流优化和并发执行,显著减少了执行步骤并提高了任务准确性。
Method: 将复杂任务分解为子任务并构建DAG,明确依赖关系以实现并行执行。框架动态优化执行图并集成摘要模块。
Result: 在BrowseComp和xbench-DeepSearch等基准测试中,Flash-Searcher的准确率分别达到67.7%和83%,执行步骤减少了35%。
Insight: DAG结构和并行执行能够显著提升LLM在复杂推理任务中的效率,同时也展示了将其蒸馏到单一模型中的潜力。
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks when equipped with external tools. However, current frameworks predominantly rely on sequential processing, leading to inefficient execution particularly for tasks requiring extensive tool interaction. This paper introduces Flash-Searcher, a novel parallel agent reasoning framework that fundamentally reimagines the execution paradigm from sequential chains to directed acyclic graphs (DAGs). Flash-Searcher decomposes complex tasks into subtasks with explicit dependencies, enabling concurrent execution of independent reasoning paths while maintaining logical constraints. Through dynamic workflow optimization, our framework continuously refines the execution graph based on intermediate results, effectively integrating summary module. Comprehensive evaluations across multiple benchmarks demonstrate that Flash-Searcher consistently outperforms existing approaches. Specifically, it achieves 67.7% accuracy on BrowseComp and 83% on xbench-DeepSearch, while reducing agent execution steps by up to 35% compared to current frameworks. Furthermore, when distilling this parallel reasoning pipeline into single models, we observe substantial performance gains across diverse backbone architectures, underscoring the generalizability of our methodology. Our work thus represents a significant advance in agent architecture design, offering a more scalable and efficient paradigm for complex reasoning tasks.
[152] Spontaneous High-Order Generalization in Neural Theory-of-Mind Networks
Yiming Wang,Rui Wang
Main category: cs.AI
TL;DR: 这篇论文表明,神经网络可以像人类一样,在不需要依赖高级技能的情况下,自发地从一级心智理论(ToM)推广到高阶心智理论。作者提出的ToMNN网络模拟了最小认知系统,仅通过一级ToM能力就展现了高阶ToM能力。
Details
Motivation: 人类的ToM能力在短时间内从一级发展到高阶,且在正式教育或高级技能学习之前完成。而现有的自回归语言模型需要依赖高级技能才能实现这种推广。本文旨在探索神经网络是否能像人类一样独立完成这一推广过程。Contribution: 主要贡献是证明了神经网络可以在不依赖高级技能的情况下自发推广到高阶ToM能力。此外,作者提出的ToMNN网络展现了与人类认知难度一致的推广模式,并验证了结果的普适性。
Method: 作者提出了一个神经心智理论网络(ToMNN),模拟最小认知系统,仅通过一级ToM能力进行训练。随后评估其在二阶和三阶ToM任务中的表现,并分析推广模式和任务复杂度的影响。
Result: ToMNN在二阶和三阶ToM任务中的表现显著高于随机水平,且推广模式与人类认知难度一致(从一级到二阶的下降幅度大于二阶到高阶)。结果在不同参数规模下验证了普适性。
Insight: 这项研究表明,机器的ToM推广模式可以与人类认知过程相似,为开发更具人类认知特点的系统提供了基础。
Abstract: Theory-of-Mind (ToM) is a core human cognitive capacity for attributing mental states to self and others. Wimmer and Perner demonstrated that humans progress from first- to higher-order ToM within a short span, completing this development before formal education or advanced skill acquisition. In contrast, neural networks represented by autoregressive language models progress from first- to higher-order ToM only alongside gains in advanced skills like reasoning, leaving open whether their trajectory can unfold independently, as in humans. In this research, we provided evidence that neural networks could spontaneously generalize from first- to higher-order ToM without relying on advanced skills. We introduced a neural Theory-of-Mind network (ToMNN) that simulated a minimal cognitive system, acquiring only first-order ToM competence. Evaluations of its second- and third-order ToM abilities showed accuracies well above chance. Also, ToMNN exhibited a sharper decline when generalizing from first- to second-order ToM than from second- to higher orders, and its accuracy decreased with greater task complexity. These perceived difficulty patterns were aligned with human cognitive expectations. Furthermore, the universality of results was confirmed across different parameter scales. Our findings illuminate machine ToM generalization patterns and offer a foundation for developing more human-like cognitive systems.
[153] Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search
Yingqian Cui,Zhenwei Dai,Pengfei He,Bing He,Hui Liu,Xianfeng Tang,Jingying Zeng,Suhang Wang,Yue Xing,Jiliang Tang,Benoit Dumoulin
Main category: cs.AI
TL;DR: 该论文提出了一种双阶段测试时间推理框架,将推理过程分为规划和执行两个阶段,并通过奖励模型分别指导搜索,以提高效率和准确性。
Details
Motivation: 现有的基于树的搜索方法虽然提升了准确性,但忽略了任务的规划-执行特性,导致推理过程的探索效率低下。Contribution: 提出了双阶段测试时间扩展框架,动态分配计算资源,并通过独立的奖励模型优化规划和执行阶段的搜索。
Method: 将推理轨迹分解为规划和执行两个阶段,分别为每个阶段开发奖励模型,并引入动态预算分配机制,自适应调整计算资源。
Result: 在数学推理和代码生成任务上的实验表明,该方法提高了准确性并减少了冗余计算。
Insight: 通过显式分离规划和执行阶段,并动态分配资源,可以有效提升推理任务的效率和性能。
Abstract: Large Language Models (LLMs) have achieved significant advances in reasoning tasks. A key approach is tree-based search with verifiers, which expand candidate reasoning paths and use reward models to guide pruning and selection. Although effective in improving accuracy, these methods are not optimal in terms of efficiency: they perform simple decomposition on the reasoning process, but ignore the planning-execution nature of tasks such as math reasoning or code generation. This results in inefficient exploration of reasoning process. To address this, we propose a dual-phase test-time scaling framework that explicitly separates reasoning into planning and execution, and performs search over the two phases individually. Specifically, we decompose reasoning trajectories and develop reward models for each phase, enabling the search to explore and prune plans and executions separately. We further introduce a dynamic budget allocation mechanism that adaptively redistributes sampling effort based on reward feedback, allowing early stopping on confident steps and reallocation of computation to more challenging parts of the reasoning process. Experiments on both mathematical reasoning and code generation benchmarks demonstrate that our approach consistently improves accuracy while reducing redundant computation.
[154] DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
Fang Wu,Weihao Xuan,Heli Qi,Ximing Lu,Aaron Tu,Li Erran Li,Yejin ChoiRetry
Main category: cs.AI
TL;DR: DeepSearch通过将蒙特卡洛树搜索(MCTS)直接集成到RLVR训练中,解决了现有方法因稀疏探索导致的性能瓶颈问题,显著提升了推理模型的效率和准确性。
Details
Motivation: 现有RLVR方法依赖有限的探索路径,导致训练陷入瓶颈,性能提升停滞。DeepSearch旨在通过结构化搜索扩展探索范围,解决这一问题。Contribution: 1) 提出了一种全局前沿选择策略;2) 使用基于熵的路径选择方法;3) 引入了自适应重放缓冲和解决方案缓存,提升了训练效率。
Method: 将蒙特卡洛树搜索嵌入训练循环,结合全局前沿选择、熵引导路径选择和自适应缓冲机制。
Result: 在数学推理基准测试中,DeepSearch以5.7倍更少的GPU时间实现了62.95%的平均准确率,成为1.5B参数推理模型的SOTA。
Insight: 战略性探索比粗暴扩展计算资源更有效,为RLVR方法的进一步发展提供了新方向。
Abstract: Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.
[155] IRIS: Intrinsic Reward Image Synthesis
Yihang Chen,Yuanhao Ban,Yunqi Hong,Cho-Jui Hsieh
Main category: cs.AI
TL;DR: IRIS提出了一种无需外部奖励的自回归文本到图像(T2I)生成框架,通过最大化模型自身的不确定性来提升图像生成质量,结果显示其性能与基于外部奖励的方法相当或更优。
Details
Motivation: 现有基于人类反馈的强化学习(RLHF)在语言推理中表现优异,但在自回归T2I生成中却因缺乏人类偏好数据而受限。本文旨在探索如何仅利用内部信号改进T2I模型。Contribution: IRIS是首个提出仅使用内在奖励通过强化学习改进自回归T2I模型的框架,证明了最大化不确定性而非确定性可以生成更符合人类偏好的图像。
Method: IRIS框架通过最大化模型的不确定性作为内在奖励,指导强化学习优化自回归T2I模型,避免生成过于简单或单一的图像。
Result: 实验表明,IRIS在无需外部奖励的情况下,能够生成与基于外部奖励方法竞争或更优的图像质量。
Insight: 论文揭示了自回归T2I模型中不确定性对图像多样性和质量的正向影响,为无监督图像生成提供了新思路。
Abstract: Despite the success of Reinforcement Learning from Human Feedback (RLHF) in language reasoning, its application to autoregressive Text-to-Image (T2I) generation is often constrained by the limited availability of human preference data. This paper explores how an autoregressive T2I model can learn from internal signals without relying on external rewards or labeled data. Contrary to recent findings in text generation, we show that maximizing self-uncertainty, rather than self-certainty, improves image generation. We observe that this is because autoregressive T2I models with low uncertainty tend to generate simple and uniform images, which are less aligned with human preferences. Based on these observations, we propose IRIS (Intrinsic Reward Image Synthesis), the first framework to improve autoregressive T2I models with reinforcement learning using only an intrinsic reward. Empirical results demonstrate that applying IRIS to autoregressive T2I models achieves performance that is competitive with or superior to external rewards.
[156] Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models
Max Hartman,Vidhata Jayaraman,Moulik Choraria,Akhil Bhimaraju,Lav R. Varshney
Main category: cs.AI
TL;DR: 这篇论文提出了一个理论和实验框架,用于分析在视觉语言模型(VLMs)中跳过哪些层可以在保持性能的同时提高推理效率。通过信息和学习理论,作者揭示了层跳过的适用条件,并与实际中的流行方法一致。
Details
Motivation: 视觉语言模型(VLMs)推理成本高昂,而选择性跳过某些层可以提升效率,但缺乏理论指导。研究旨在填补这一空白。Contribution: 提出了一个理论和实验框架,用于确定层跳过的适用条件,并证明这些条件与实际方法选择的层高度一致。
Method: 结合信息和学习理论分析隐藏表征的演化,识别冗余层,并通过实验验证跳过这些层的效果。
Result: 跳过符合理论条件的层能够在不损失性能的情况下加速推理,而跳过其他层会导致性能下降。
Insight: 理论与实践的匹配表明,信息和学习理论可为高效推理方法提供统一的解释基础。
Abstract: Vision-language models (VLMs) achieve incredible performance across a wide range of tasks, but their large size makes inference costly. Recent work shows that selectively skipping VLM layers can improve efficiency with minimal performance loss or even performance improvements. However, this technique remains underused due to the limited understanding of when layer skipping is beneficial. In this paper, we develop a framework that uses information and learning theory to characterize the conditions under which layer skipping enhances efficiency without sacrificing performance. Motivated by these observations, we analyze the evolution of the VLM’s hidden representations through the LLM backbone and show that layers with large redundancy as predicted by our framework coincide with those skipped by popular layer-skipping methods in practice, providing a unified theoretical scaffolding for multiple efficient inference techniques. Our experiments demonstrate that skipping such layers yields faster inference that preserves performance, and also show that applying skipping outside these conditions leads to model degradation.
[157] ATLAS: Constraints-Aware Multi-Agent Collaboration for Real-World Travel Planning
Jihye Choi,Jinsung Yoon,Jiefeng Chen,Somesh Jha,Tomas Pfister
Main category: cs.AI
TL;DR: ATLAS提出了一种多智能体协作框架,用于复杂约束下的实时旅行规划,显著提升了任务完成率和实用性。
Details
Motivation: 大语言模型(LLMs)在复杂约束下难以生成最优且落地的解决方案,旅行规划任务中的动态约束和用户需求进一步增加了挑战。Contribution: ATLAS通过动态约束管理、迭代计划评价和自适应交错搜索,首次在实时信息搜索和多轮反馈的旅行规划任务中展示了定量有效性。
Method: ATLAS采用多智能体协作框架,结合动态约束管理、迭代计划评价和自适应交错搜索机制。
Result: 在TravelPlanner基准测试中,ATLAS将任务完成率从23.3%提升至44.4%,在实时任务中达到84%的完成率,显著优于基线方法。
Insight: 多智能体协作和动态约束管理是解决复杂规划任务的关键,尤其在实时交互和动态环境中表现出色。
Abstract: While Large Language Models (LLMs) have shown remarkable advancements in reasoning and tool use, they often fail to generate optimal, grounded solutions under complex constraints. Real-world travel planning exemplifies these challenges, evaluating agents’ abilities to handle constraints that are explicit, implicit, and even evolving based on interactions with dynamic environments and user needs. In this paper, we present ATLAS, a general multi-agent framework designed to effectively handle such complex nature of constraints awareness in real-world travel planning tasks. ATLAS introduces a principled approach to address the fundamental challenges of constraint-aware planning through dedicated mechanisms for dynamic constraint management, iterative plan critique, and adaptive interleaved search. ATLAS demonstrates state-of-the-art performance on the TravelPlanner benchmark, improving the final pass rate from 23.3% to 44.4% over its best alternative. More importantly, our work is the first to demonstrate quantitative effectiveness on real-world travel planning tasks with live information search and multi-turn feedback. In this realistic setting, ATLAS showcases its superior overall planning performance, achieving an 84% final pass rate which significantly outperforms baselines including ReAct (59%) and a monolithic agent (27%).
[158] Building the EHR Foundation Model via Next Event Prediction
Zekai Chen,Arda Pekis,Kevin Brown
Main category: cs.AI
TL;DR: 该论文提出了一种名为Next Event Prediction(NEP)的框架,通过自回归微调临床事件序列,增强大型语言模型(LLMs)在电子健康记录(EHRs)中的时序推理能力,显著提升了预测性能。
Details
Motivation: 传统编码方法在捕捉电子健康记录(EHRs)中的丰富时序动态性方面表现不足,而现有的大型语言模型(LLMs)在处理临床事件的序列依赖性和时序关系时也存在困难。Contribution: NEP框架通过将EHRs重新建模为带时间戳的事件链,并预测未来医疗事件,显式模拟疾病发展模式和因果关系,同时实现了卓越的预测准确性和临床可解释性。
Method: NEP通过自回归微调临床事件序列,对时间戳事件链建模,并预测未来医疗事件。
Result: 在肿瘤生存预测和临床诊断任务中,NEP显著优于专业EHR模型(AUROC提升4.6%)和通用LLMs(C-index提升7.2%)。
Insight: NEP不仅能提升预测性能,还能生成与已知疾病途径一致的可解释注意力模式,兼具技术和临床价值。
Abstract: Electronic Health Records (EHRs) contain rich temporal dynamics that conventional encoding approaches fail to adequately capture. While Large Language Models (LLMs) show promise for EHR modeling, they struggle to reason about sequential clinical events and temporal dependencies. We propose Next Event Prediction (NEP), a framework that enhances LLMs’ temporal reasoning through autoregressive fine-tuning on clinical event sequences. By reformulating EHRs as timestamped event chains and predicting future medical events, NEP explicitly models disease progression patterns and causal relationships. Extensive evaluations across oncology survival prediction and clinical diagnosis tasks demonstrate NEP’s superiority, outperforming specialized EHR models by 4.6% AUROC and general-purpose LLMs by 7.2% C-index in temporal reasoning tasks. Our analyses reveal dual benefits: state-of-the-art prediction accuracy combined with clinically interpretable attention patterns that align with known disease pathways.
[159] Causal Autoencoder-like Generation of Feedback Fuzzy Cognitive Maps with an LLM Agent
Akash Kumar Panda,Olaoluwa Adigun,Bart Kosko
Main category: cs.AI
TL;DR: 论文提出一种基于大型语言模型(LLM)的因果模糊认知图(FCM)生成方法,类似自动编码器(AE),但更具可解释性。LLM将FCM映射为文本并重建,保留强因果关系,同时去除弱边。
Details
Motivation: 传统的自动编码器(AE)是黑盒模型,缺乏可解释性。通过LLM将FCM与文本双向映射,实现了因果关系的自然语言解释,提升了透明度和人机交互能力。Contribution: 1. 提出了一个类似自动编码器的LLM系统,将FCM与文本双向映射;2. 保留了强因果关系,去除了弱边,提升了模型的解释性和实用性;3. 实现了人类可读的因果关系表达。
Method: 1. 使用LLM作为编码器,将FCM映射为自然语言文本;2. 通过LLM解码文本,重建FCM;3. 保留了强因果关系,过滤弱边,并通过系统指令优化重建过程。
Result: 实验表明,该方法能够有效重建FCM,保留强因果关系,同时生成人类可读的解释文本,增强了模型的可解释性。
Insight: LLM可以作为因果关系的桥梁,将黑盒模型转化为可解释的文本表达,为因果推理和决策支持系统提供了新思路。
Abstract: A large language model (LLM) can map a feedback causal fuzzy cognitive map (FCM) into text and then reconstruct the FCM from the text. This explainable AI system approximates an identity map from the FCM to itself and resembles the operation of an autoencoder (AE). Both the encoder and the decoder explain their decisions in contrast to black-box AEs. Humans can read and interpret the encoded text in contrast to the hidden variables and synaptic webs in AEs. The LLM agent approximates the identity map through a sequence of system instructions that does not compare the output to the input. The reconstruction is lossy because it removes weak causal edges or rules while it preserves strong causal edges. The encoder preserves the strong causal edges even when it trades off some details about the FCM to make the text sound more natural.
[160] NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language
Danial Kamali,Parisa Kordjamshidi
Main category: cs.AI
TL;DR: NePTune是一种神经符号框架,通过结合基础视觉模型的感知能力和符号推理的组合表达能力,解决了当前视觉语言模型在组合推理上的局限性,支持动态生成Python程序进行软逻辑推理。
Details
Motivation: 现代视觉语言模型在组合推理(将概念分解并重组以解决新问题)方面表现不佳,需要一种灵活的方法来克服传统神经符号方法的局限性(如严格的逻辑执行或预定义谓词)。Contribution: NePTune的主要贡献在于提出了一种神经符号框架,能够动态将自然语言查询转换为可执行的Python程序,支持软逻辑推理和非确定性感知,同时实现了感知与推理的解耦。
Method: NePTune结合基础视觉模型的感知能力和符号推理的组合表达能力,通过动态生成Python程序,实现了训练自由的软逻辑推理。其模块化设计支持感知与推理的解耦,并允许微调。
Result: 在多个视觉推理基准测试和对抗测试中,NePTune显著优于基线模型,展示了其在组合泛化和新环境适应中的有效性。
Insight: NePTune的创新在于通过神经符号方法实现了灵活的推理能力,既保留了感知模型的不确定性处理能力,又通过符号推理的组合性解决了复杂任务。其模块化设计为未来的扩展提供了潜力。
Abstract: Modern Vision-Language Models (VLMs) have achieved impressive performance in various tasks, yet they often struggle with compositional reasoning, the ability to decompose and recombine concepts to solve novel problems. While neuro-symbolic approaches offer a promising direction, they are typically constrained by crisp logical execution or predefined predicates, which limit flexibility. In this work, we introduce NePTune, a neuro-symbolic framework that overcomes these limitations through a hybrid execution model that integrates the perception capabilities of foundation vision models with the compositional expressiveness of symbolic reasoning. NePTune dynamically translates natural language queries into executable Python programs that blend imperative control flow with soft logic operators capable of reasoning over VLM-generated uncertainty. Operating in a training-free manner, NePTune, with a modular design, decouples perception from reasoning, yet its differentiable operations support fine-tuning. We evaluate NePTune on multiple visual reasoning benchmarks and various domains, utilizing adversarial tests, and demonstrate a significant improvement over strong base models, as well as its effective compositional generalization and adaptation capabilities in novel environments.
[161] Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA
Raphael Schumann,Stefan Riezler
Main category: cs.AI
TL;DR: 论文探讨了大型语言模型中推理质量的重要性,不仅关注正确答案的生成,还关注中间步骤的有效性。通过多选问答(MCQA)研究,作者发现当模型无法有效解决某些问题时,更容易产生虚假的思维链(CoT)。通过估计问题的可解性,并提出基于结果监督的奖励模型和强化学习方法,文章展示了如何提升过程正确性和答案准确性。
Details
Motivation: 大型语言模型(LLM)在推理任务中的表现不仅依赖于正确答案的生成,还需要有效的中间步骤。然而,当模型面对无法解决的问题时,容易产生虚假的思维链(CoT),导致错误推理。因此,研究如何通过可解性估计来提升推理质量具有重要意义。Contribution: 1. 揭示了模型在面对无法解决问题时更容易产生虚假CoT的现象;2. 提出了通过估计问题可解性来优化推理的方法;3. 设计了基于结果监督的奖励模型和强化学习方法,显著提升了过程正确性和答案准确性。
Method: 1. 通过多选问答(MCQA)分析问题的可解性;2. 利用结果监督的奖励模型(OSRM)和强化学习(RL),结合群组相对优势(group-relative advantage),将可解性纳入目标函数。
Result: 在数学和多模态数据集上的实验表明,提出的方法显著提升了过程正确的推理率,并在强化学习中进一步提高了答案准确性。
Insight: 问题的可解性是减少幻觉(hallucination)和提升CoT推理可靠性的关键因素。这种方法为优化模型推理提供了新的思路。
Abstract: Reasoning quality in large language models depends not only on producing correct answers but also on generating valid intermediate steps. We study this through multiple-choice question answering (MCQA), which provides a controlled setting with fixed answer options. Our analysis shows that when questions are effectively unsolvable for a model, spurious chains of thought (CoTs) are more likely to appear, leading to false positives. By estimating the solvability of each question, we uncover an intermediate regime where learning is most effective. Building on this insight, we adapt outcome-supervised reward models and reinforcement learning with group-relative advantage to incorporate solvability into their objectives. Across experiments on math and multimodal datasets, these modifications consistently yield higher rates of process-correct reasoning and, in reinforcement learning, improved answer accuracy as well. Our results highlight solvability as a key factor for reducing hallucinations and increasing reliability in CoT reasoning.
[162] RoRecomp: Enhancing Reasoning Efficiency via Rollout Response Recomposition in Reinforcement Learning
Gang Li,Yulei Qin,Xiaoyu Tan,Dingkang Yang,Yuchen Shi,Zihan Xu,Xiang Li,Xing Sun,Ke Li
Main category: cs.AI
TL;DR: 论文提出了一种名为RoRecomp的方法,通过重组训练数据来引导模型实现更简洁的推理,显著提升了强化学习中的推理效率。
Details
Motivation: 标准RLVR训练在复杂推理任务中常导致冗长的推理过程和低效的探索轨迹,原因是仅基于结果的奖励无法激励效率,且小规模rollout组中响应长度的高方差导致优化信号噪声大。Contribution: 提出了RoRecomp方法,其核心贡献是通过优先级批次和补偿批次的重组策略,明确梯度信号以优化推理效率。
Method: RoRecomp将响应分为两类批次:1)优先级批次,结合短正确和长错误响应,提供简洁性梯度信号;2)补偿批次,利用剩余响应维持稳定性。
Result: 实验显示,RoRecomp在零RL训练中减少27.7%的推理长度,在代理RL中减少46.8%的不必要工具调用并提升准确性,在思维压缩中实现52.5%的长度缩减。
Insight: 通过数据重组策略明确优化目标,可以显著提升推理效率,同时避免模型崩溃,为RLVR训练提供了更高效的指导方式。
Abstract: Reinforcement learning with verifiable rewards (RLVR) has proven effective in eliciting complex reasoning in large language models (LLMs). However, standard RLVR training often leads to excessively verbose processes (in reasoning tasks) and inefficient exploration trajectories (in agentic settings), as outcome-only rewards provide no incentive for efficiency and the high variance in response length within relatively small rollout groups results in noisy optimization signals. To address this, we propose Rollout Response Recomposition (RoRecomp), a plug-and-play method that guides models toward concise reasoning by strategically recomposing the training data. RoRecomp separates responses into two distinct batch types: 1) priority batches, which combine short-correct and long-incorrect responses selected from online batches to provide a clear gradient signal for brevity, and 2) compensation batches, which utilize remaining responses from a replay buffer to maintain stability and prevent model collapse. To comprehensively evaluate effectiveness, we test RoRecomp across three settings where results demonstrate substantial efficiency gains: reducing reasoning length by 27.7% in zero RL training, reducing unnecessary tool calls by 46.8% while improving accuracy in agentic RL, and achieving up to 52.5% length reduction in thinking compression, all with minimal performance impact.
[163] RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration
Xiuyuan Chen,Jian Zhao,Yuchen Yuan,Tianle Zhang,Huilin Zhou,Zheng Zhu,Ping Hu,Linghe Kong,Chi Zhang,Weiran Huang,Xuelong Li
Main category: cs.AI
TL;DR: 本文提出了一种名为RADAR的多智能体协作框架,用于解决现有大型语言模型(LLM)安全性评估中存在的评估者偏见和模型同质性问题。通过分解潜在风险概念空间并引入动态更新机制,RADAR显著提高了风险识别的准确性和稳定性。
Details
Motivation: 现有LLM安全性评估方法存在两大局限性:评估者偏见和模型同质性导致的检测失败。这些问题影响了风险评估的鲁棒性,因此需要一种新的框架来全面覆盖显性和隐性风险。Contribution: 1. 提出了一个新的理论框架,将风险概念空间分解为三个互斥子空间。2. 设计了RADAR框架,通过多智能体协作和动态更新机制实现了风险概念的自我演化。3. 构建了包含800个挑战性案例的数据集,并在实验中显著优于基线方法。
Method: 1. 分解风险概念空间为显性风险、隐性风险和非风险子空间。2. 通过四个专业化角色的智能体进行多轮辩论和动态更新。3. 使用多轮辩论机制覆盖显性和隐性风险,并减少评估者偏见。
Result: 在挑战性测试集和公开基准上,RADAR显著优于基线方法,风险识别准确率提高了28.87%。此外,在稳定性和自我评估风险敏感性方面也有显著提升。
Insight: RADAR通过角色专业化协作和动态更新机制,有效解决了现有评估方法的局限性。这表明多智能体协作和风险概念空间的分解是提升LLM安全性评估效果的关键。
Abstract: Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations, including evaluator bias and detection failures arising from model homogeneity, which collectively undermine the robustness of risk evaluation processes. This paper seeks to re-examine the risk evaluation paradigm by introducing a theoretical framework that reconstructs the underlying risk concept space. Specifically, we decompose the latent risk concept space into three mutually exclusive subspaces: the explicit risk subspace (encompassing direct violations of safety guidelines), the implicit risk subspace (capturing potential malicious content that requires contextual reasoning for identification), and the non-risk subspace. Furthermore, we propose RADAR, a multi-agent collaborative evaluation framework that leverages multi-round debate mechanisms through four specialized complementary roles and employs dynamic update mechanisms to achieve self-evolution of risk concept distributions. This approach enables comprehensive coverage of both explicit and implicit risks while mitigating evaluator bias. To validate the effectiveness of our framework, we construct an evaluation dataset comprising 800 challenging cases. Extensive experiments on our challenging testset and public benchmarks demonstrate that RADAR significantly outperforms baseline evaluation methods across multiple dimensions, including accuracy, stability, and self-evaluation risk sensitivity. Notably, RADAR achieves a 28.87% improvement in risk identification accuracy compared to the strongest baseline evaluation method.
[164] Saliency Guided Longitudinal Medical Visual Question Answering
Jialin Wu,Xiaofeng Liu
Main category: cs.AI
TL;DR: 该论文提出了一种基于显著性指导的编码器-解码器模型,用于胸部X光片的纵向医学视觉问答(Diff-VQA),通过将显著性图转化为监督信号,实现了对疾病变化的有效推理。
Details
Motivation: 在医学视觉问答中,纵向比较不同时间点的影像并回答关于临床变化的问题,需要关注差异信号和视觉焦点的一致性,而非单一图像的绝对结果。Contribution: 主要贡献包括:1) 提出了一种显著性指导的编码器-解码器框架;2) 使用轻量级仿射预对齐减少干扰运动;3) 通过关键字生成的Grad-CAM掩码实现语言-视觉闭环。
Method: 方法分为两步:1)从答案中提取医学关键字并生成关键字条件化的Grad-CAM;2)应用共享显著性掩码生成最终答案,确保空间注意力一致性。
Result: 在Medical-Diff-VQA数据集上,模型在BLEU、ROUGE-L、CIDEr和METEOR等指标上表现优异,且无需放射学特定预训练,具有实用性和可迁移性。
Insight: 论文揭示了显著性指导生成与轻度预对齐的结合可作为医学VQA中纵向推理的原则性框架,同时突出了模型的解释性和泛化能力。
Abstract: Longitudinal medical visual question answering (Diff-VQA) requires comparing paired studies from different time points and answering questions about clinically meaningful changes. In this setting, the difference signal and the consistency of visual focus across time are more informative than absolute single-image findings. We propose a saliency-guided encoder-decoder for chest X-ray Diff-VQA that turns post-hoc saliency into actionable supervision. The model first performs a lightweight near-identity affine pre-alignment to reduce nuisance motion between visits. It then executes a within-epoch two-step loop: step 1 extracts a medically relevant keyword from the answer and generates keyword-conditioned Grad-CAM on both images to obtain disease-focused saliency; step 2 applies the shared saliency mask to both time points and generates the final answer. This closes the language-vision loop so that the terms that matter also guide where the model looks, enforcing spatially consistent attention on corresponding anatomy. On Medical-Diff-VQA, the approach attains competitive performance on BLEU, ROUGE-L, CIDEr, and METEOR while providing intrinsic interpretability. Notably, the backbone and decoder are general-domain pretrained without radiology-specific pretraining, highlighting practicality and transferability. These results support saliency-conditioned generation with mild pre-alignment as a principled framework for longitudinal reasoning in medical VQA.
[165] Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
Minhui Zhu,Minyang Tian,Xiaocheng Yang,Tianci Zhou,Penghao Zhu,Eli Chertkov,Shengyan Liu,Yufeng Du,Lifan Yuan,Ziming Ji,Indranil Das,Junyi Cao,Yufeng Du,Jinchen He,Yifan Su,Jiabin Yu,Yikun Jiang,Yujie Zhang,Chang Liu,Ze-Min Huang,Weizhen Jia,Xinan Chen,Peixue Wu,Yunkai Wang,Juntai Zhou,Yong Zhao,Farshid Jafarpour,Jessie Shelton,Aaron Young,John Bartolotta,Wenchao Xu,Yue Sun,Anjun Chu,Victor Colussi,Chris Akers,Nathan Brooks,Wenbo Fu,Christopher Wilson,Jinchao Zhao,Marvin Qi,Anqi Mu,Yubo Yang,Allen Zang,Yang Lyu,Peizhi Mai,Xuefei Guo,Luyu Gao,Ze Yang,Chi Xue,Dmytro Bandak,Yaïr Hein,Yonatan Kahn,Kevin Zhou,John Drew Wilson Jarrod T. Reilly,Di Luo,Daniel Inafuku,Hao Tong,Liang Yang,Ruixing Zhang,Xueying Wang,Ofir Press,Nicolas Chia,Eliu Huerta,Hao Peng
Main category: cs.AI
TL;DR: CritPt是首个针对前沿物理学研究中未发表的研究级推理任务设计的基准,用于评估大语言模型在复杂推理任务中的能力。
Details
Motivation: 研究动机是探索大语言模型是否能够在物理学前沿研究中提供有效推理支持,以及物理学家希望LLM在哪些类型的推理任务中发挥作用。Contribution: 主要贡献是提出了CritPt基准,覆盖现代物理学多个研究领域,包含71个复合研究挑战和190个分检查点任务。
Method: 方法是通过50+物理学家设计新问题,创建抗猜测且机器可验证的任务,并使用自动化评分管道进行评估。
Result: 当前最先进的LLM在独立检查点上表现出初步潜力,但在完整研究挑战中表现不佳,最佳平均准确率仅为4%。
Insight: CritPt揭示了当前模型能力与实际物理学研究需求之间的巨大差距,为开发科学基础的AI工具提供了方向。
Abstract: While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced “critical point”), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 4.0% , achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.
[166] PUREVQ-GAN: Defending Data Poisoning Attacks through Vector-Quantized Bottlenecks
Alexander Branch,Omead Pooladzandi,Radin Khosraviani,Sunay Gajanan Bhat,Jeffrey Jiang,Gregory Pottie
Main category: cs.AI
TL;DR: PureVQ-GAN提出了一种通过向量量化瓶颈防御数据投毒攻击的方法,结合VQ-VAE和GAN判别器,有效消除触发器模式,同时保持语义内容。
Details
Motivation: 数据投毒攻击(如后门触发器)是机器学习中的重大威胁,现有防御方法(如基于扩散的方法)效率低且计算成本高。Contribution: 1. 提出了PureVQ-GAN,将VQ-VAE与GAN判别器结合,通过量化瓶颈破坏触发器模式;2. 在效率上显著优于扩散方法,速度快50倍以上。
Method: 1. 使用VQ-VAE的编码器将图像量化为离散码本;2. 通过GAN判别器确保输出符合自然图像分布;3. 破坏触发器的细粒度模式。
Result: 在CIFAR-10上,PureVQ-GAN对Gradient Matching和Bullseye Polytope攻击的毒化成功率(PSR)为0%,对Narcissus为1.64%,同时保持91-95%的干净数据精度。
Insight: 离散量化是防御数据投毒攻击的有效手段,结合GAN可以进一步提升鲁棒性和效率。
Abstract: We introduce PureVQ-GAN, a defense against data poisoning that forces backdoor triggers through a discrete bottleneck using Vector-Quantized VAE with GAN discriminator. By quantizing poisoned images through a learned codebook, PureVQ-GAN destroys fine-grained trigger patterns while preserving semantic content. A GAN discriminator ensures outputs match the natural image distribution, preventing reconstruction of out-of-distribution perturbations. On CIFAR-10, PureVQ-GAN achieves 0% poison success rate (PSR) against Gradient Matching and Bullseye Polytope attacks, and 1.64% against Narcissus while maintaining 91-95% clean accuracy. Unlike diffusion-based defenses requiring hundreds of iterative refinement steps, PureVQ-GAN is over 50x faster, making it practical for real training pipelines.
[167] CoLLM-NAS: Collaborative Large Language Models for Efficient Knowledge-Guided Neural Architecture Search
Zhe Li,Zhiwei Lin,Yongtao Wang
Main category: cs.AI
TL;DR: CoLLM-NAS提出了一种两阶段的NAS框架,结合了两个互补的大型语言模型(LLM),通过导航器LLM和生成器LLM协作高效搜索神经网络架构,性能优于传统NAS方法。
Details
Motivation: 现有的LLM与NAS结合方法存在架构无效性、计算效率低和性能不佳的问题,需要一种更高效的知识引导搜索框架。Contribution: 提出了CoLLM-NAS框架,引入了导航器LLM和生成器LLM的协作机制,通过知识引导和历史轨迹优化架构搜索。
Method: 采用两阶段NAS框架,通过导航器LLM引导搜索方向,生成器LLM生成高质量候选架构,协调器模块管理两者交互。
Result: 在ImageNet和NAS-Bench-201上超越了现有NAS方法和传统搜索算法,同时提升了多种两阶段NAS方法的性能和效率。
Insight: 结合LLM的结构知识搜索能力与迭代反馈的渐进知识,能显著提升NAS方法的效率和效果。
Abstract: The integration of Large Language Models (LLMs) with Neural Architecture Search (NAS) has introduced new possibilities for automating the design of neural architectures. However, most existing methods face critical limitations, including architectural invalidity, computational inefficiency, and inferior performance compared to traditional NAS. In this work, we present Collaborative LLM-based NAS (CoLLM-NAS), a two-stage NAS framework with knowledge-guided search driven by two complementary LLMs. Specifically, we propose a Navigator LLM to guide search direction and a Generator LLM to synthesize high-quality candidates, with a dedicated Coordinator module to manage their interaction. CoLLM-NAS efficiently guides the search process by combining LLMs’ inherent knowledge of structured neural architectures with progressive knowledge from iterative feedback and historical trajectory. Experimental results on ImageNet and NAS-Bench-201 show that CoLLM-NAS surpasses existing NAS methods and conventional search algorithms, achieving new state-of-the-art results. Furthermore, CoLLM-NAS consistently enhances the performance and efficiency of various two-stage NAS methods (e.g., OFA, SPOS, and AutoFormer) across diverse search spaces (e.g., MobileNet, ShuffleNet, and AutoFormer), demonstrating its excellent generalization.
[168] ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning
Yichao Liang,Dat Nguyen,Cambridge Yang,Tianyang Li,Joshua B. Tenenbaum,Carl Edward Rasmussen,Adrian Weller,Zenna Tavares,Tom Silver,Kevin Ellis
Main category: cs.AI
TL;DR: 论文提出了一个框架,用于学习动态世界的抽象模型,支持机器人规划。通过联合学习符号状态表示和因果过程(包括内源行为和外源机制),并结合变分贝叶斯推理与LLM提议,模型能够从有限数据中学习,并在模拟环境中实现高效的规划效果。
Details
Motivation: 长视野嵌入式规划面临挑战,因为世界的变化不仅由代理行为引起,还受到外源过程(如加热、多米诺骨牌效应)的并发影响。论文旨在解决这些问题,提出抽象世界模型框架。Contribution: 主要贡献包括:1)提出了一个联合学习符号状态表示和因果过程的框架;2)结合变分贝叶斯推理与LLM提议,从有限数据中学习;3)在五个模拟环境中验证了模型的有效性和泛化能力。
Method: 方法包括:1)学习符号化的状态表示;2)建模内源行为和外源机制的因果过程;3)通过变分贝叶斯推理和LLM提议优化学习过程。
Result: 在五个模拟桌面机器人环境中,模型能够高效规划并泛化到更复杂的目标和更多对象的新任务中,表现优于多个基线方法。
Insight: 论文表明,结合符号表示与因果过程的学习可以为动态世界建模提供高效且可泛化的解决方案,尤其在长视野规划任务中具有潜力。
Abstract: Long-horizon embodied planning is challenging because the world does not only change through an agent’s actions: exogenous processes (e.g., water heating, dominoes cascading) unfold concurrently with the agent’s actions. We propose a framework for abstract world models that jointly learns (i) symbolic state representations and (ii) causal processes for both endogenous actions and exogenous mechanisms. Each causal process models the time course of a stochastic causal-effect relation. We learn these world models from limited data via variational Bayesian inference combined with LLM proposals. Across five simulated tabletop robotics environments, the learned models enable fast planning that generalizes to held-out tasks with more objects and more complex goals, outperforming a range of baselines.
[169] Zero-Shot Decentralized Federated Learning
Alessio Masano,Matteo Pennisi,Federica Proietto Salanitri,Concetto Spampinato,Giovanni Bellitto
Main category: cs.AI
TL;DR: 论文提出了一种名为ZeroDFL的全新去中心化联邦学习框架,通过在分布式客户端之间共享优化的文本提示(prompt),实现了无需中央协调器的零样本适应性学习。该方法在多个数据集上表现优异,显著降低了通信开销并提升了隐私保护。
Details
Motivation: 现有的联邦提示学习方法(如FedCoOp和FedTPG)虽然提升了CLIP的适应性,但在泛化能力、通信成本和隐私保护方面存在不足。ZeroDFL旨在解决这些问题,提供一个更高效、隐私友好的去中心化解决方案。Contribution: 主要贡献包括提出ZeroDFL,一种完全去中心化的联邦学习框架;设计了一种迭代式提示共享机制;验证了该方法的有效性,显著减少了通信开销(118倍)。
Method: ZeroDFL采用了一种迭代的提示共享机制,客户端通过优化和交换文本提示来提升泛化能力。这种方法避免了中央服务器的依赖,同时大幅降低了通信成本。
Result: 在九个图像分类数据集上的实验表明,ZeroDFL性能优于或与现有联邦提示学习方法相当,同时通信开销减少了118倍。
Insight: 论文表明,去中心化的方法不仅能有效提升联邦学习的效率和隐私保护,还能在大规模视觉语言模型的零样本适应性学习中发挥重要作用。
Abstract: CLIP has revolutionized zero-shot learning by enabling task generalization without fine-tuning. While prompting techniques like CoOp and CoCoOp enhance CLIP’s adaptability, their effectiveness in Federated Learning (FL) remains an open challenge. Existing federated prompt learning approaches, such as FedCoOp and FedTPG, improve performance but face generalization issues, high communication costs, and reliance on a central server, limiting scalability and privacy. We propose Zero-shot Decentralized Federated Learning (ZeroDFL), a fully decentralized framework that enables zero-shot adaptation across distributed clients without a central coordinator. ZeroDFL employs an iterative prompt-sharing mechanism, allowing clients to optimize and exchange textual prompts to enhance generalization while drastically reducing communication overhead. We validate ZeroDFL on nine diverse image classification datasets, demonstrating that it consistently outperforms–or remains on par with–state-of-the-art federated prompt learning methods. More importantly, ZeroDFL achieves this performance in a fully decentralized setting while reducing communication overhead by 118x compared to FedTPG. These results highlight that our approach not only enhances generalization in federated zero-shot learning but also improves scalability, efficiency, and privacy preservation–paving the way for decentralized adaptation of large vision-language models in real-world applications.
cs.IR [Back]
[170] MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval
Junjie Zhou,Ze Liu,Lei Xiong,Jin-Ge Yao,Yueze Wang,Shitao Xiao,Fenfen Lin,Miguel Hu Chen,Zhicheng Dou,Siqi Bao,Defu Lian,Yongping Xiong,Zheng Liu
Main category: cs.IR
TL;DR: MR$^2$-Bench是一个专注于推理的多模态检索基准,超越了现有的浅层语义匹配评估,要求模型具备逻辑、空间和因果推理能力。
Details
Motivation: 现有基准无法评估多模态检索中深层次的推理能力,难以满足实际应用中复杂的场景需求。Contribution: 1)设计了推理驱动的多模态检索基准;2)涵盖多样化的多模态数据和复杂查询;3)展示了现有模型在新基准上的性能差距。
Method: 通过手动收集和标注数据,并结合公开数据集,构建包含1,309个查询的数据集,涵盖自然图像、图表和视觉谜题等内容。
Result: 现有SOTA模型(如Seed1.6-Embedding)在新基准上表现显著下降(Recall@1从77.78降至9.91)。
Insight: 多模态检索需要提升推理能力,新基准为未来研究提供了更具挑战性的评估标准。
Abstract: Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios. Existing benchmarks primarily probe surface-level semantic correspondence (e.g., object-text matching) while failing to assess the deeper reasoning required to capture complex relationships between visual and textual information. To address this gap, we introduce MR$^2$-Bench, a reasoning-intensive benchmark for multimodal retrieval. MR$^2$-Bench presents the following critical values: 1) all tasks are reasoning-driven, going beyond shallow matching to effectively assess models’ capacity for logical, spatial, and causal inference; 2) it features diverse multimodal data, such as natural images, diagrams, and visual puzzles, enabling comprehensive evaluation across content types; 3) it supports complex queries and documents containing multiple images and covers diverse retrieval scenarios, more accurately reflecting real-world applications. Our benchmark contains 1,309 curated queries, derived either from manual collection and annotation or from selective consolidation of public datasets. Despite achieving strong results on existing benchmarks, current state-of-the-art models still struggle on MR$^2$-Bench: for example, the leading Seed1.6-Embedding model attains a Recall@1 of 77.78 on MMEB, but only 9.91 on MR$^2$-Bench. This substantial performance gap highlights both the increased challenge posed by our benchmark and the pressing need for further advances in reasoning-intensive multimodal retrieval. The dataset and evaluation code will be made publicly available at https://github.com/VectorSpaceLab/MR2-Bench.
cs.CR [Back]
[171] STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
Jing-Jing Li,Jianfeng He,Chao Shang,Devang Kulshreshtha,Xun Xian,Yi Zhang,Hang Su,Sandesh Swamy,Yanjun Qi
Main category: cs.CR
TL;DR: 该论文提出了STAC攻击框架,通过链式调用看似无害的工具,最终实现有害操作,展示了当前LLM代理在安全性上的重大漏洞,并提出了新的防御方法。
Details
Motivation: 随着LLM发展为具有工具使用能力的自主代理,其安全性挑战不再局限于传统的内容安全问题。STAC攻击框架揭示了多步工具调用可能带来的潜在风险。Contribution: 1. 提出了STAC攻击框架,通过自动生成和验证多步工具链;2. 系统评估了483个案例,展示了高达90%的攻击成功率;3. 提出了新的基于推理的防御方法。
Method: STAC采用闭环管道设计,包括合成可执行的多步工具链、环境内执行验证,以及逆向工程生成诱导代理执行恶意序列的多轮提示。
Result: 实验表明,即使是GPT-4.1等先进代理也高度易受STAC攻击,而现有防御方法效果有限,新提出的推理驱动防御能将攻击成功率降低28.8%。
Insight: 防御工具赋能代理需要推理整个动作序列及其累积效应,而不仅是评估单次提示或响应。
Abstract: As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel multi-turn attack framework that exploits agent tool use. STAC chains together tool calls that each appear harmless in isolation but, when combined, collectively enable harmful operations that only become apparent at the final execution step. We apply our framework to automatically generate and systematically evaluate 483 STAC cases, featuring 1,352 sets of user-agent-environment interactions and spanning diverse domains, tasks, agent types, and 10 failure modes. Our evaluations show that state-of-the-art LLM agents, including GPT-4.1, are highly vulnerable to STAC, with attack success rates (ASR) exceeding 90% in most cases. The core design of STAC’s automated framework is a closed-loop pipeline that synthesizes executable multi-step tool chains, validates them through in-environment execution, and reverse-engineers stealthy multi-turn prompts that reliably induce agents to execute the verified malicious sequence. We further perform defense analysis against STAC and find that existing prompt-based defenses provide limited protection. To address this gap, we propose a new reasoning-driven defense prompt that achieves far stronger protection, cutting ASR by up to 28.8%. These results highlight a crucial gap: defending tool-enabled agents requires reasoning over entire action sequences and their cumulative effects, rather than evaluating isolated prompts or responses.
eess.IV [Back]
[172] Position-Blind Ptychography: Viability of image reconstruction via data-driven variational inference
Simon Welker,Lorenz Kuger,Tim Roith,Berthy Feng,Martin Burger,Timo Gerkmann,Henry Chapman
Main category: eess.IV
TL;DR: 该论文探讨了位置盲ptychography(位置盲叠层成像)这一新颖的盲逆问题,提出通过数据驱动的变分推断方法,在未知扫描位置的情况下联合恢复图像和位置信息,并在模拟实验中验证了方法的可行性。
Details
Motivation: 研究动机源于单粒子衍射X射线成像,其中随机方向的粒子被照射并收集衍射图案。如果使用高度聚焦的X射线束,测量结果还将对光束位置敏感,但这些位置同样未知。Contribution: 论文主要贡献在于首次研究了位置盲ptychography问题,并提出了一种基于现代数据驱动图像先验(得分扩散模型)的变分推断方法,实现了在未知扫描位置下联合恢复图像和位置的可行方案。
Method: 方法采用变分推断和得分扩散模型作为图像先验,通过模拟简化2D实验验证了方法的有效性。
Result: 实验结果表明,在合适的照明结构和强先验条件下,即使在测量噪声的干扰下,也能实现可靠且成功的图像重建,但最困难的成像场景除外。
Insight: 论文的洞察点在于强调了图像先验和照明结构对解决位置盲ptychography问题的重要性,尤其是在数据驱动的变分推断框架中。
Abstract: In this work, we present and investigate the novel blind inverse problem of position-blind ptychography, i.e., ptychographic phase retrieval without any knowledge of scan positions, which then must be recovered jointly with the image. The motivation for this problem comes from single-particle diffractive X-ray imaging, where particles in random orientations are illuminated and a set of diffraction patterns is collected. If one uses a highly focused X-ray beam, the measurements would also become sensitive to the beam positions relative to each particle and therefore ptychographic, but these positions are also unknown. We investigate the viability of image reconstruction in a simulated, simplified 2-D variant of this difficult problem, using variational inference with modern data-driven image priors in the form of score-based diffusion models. We find that, with the right illumination structure and a strong prior, one can achieve reliable and successful image reconstructions even under measurement noise, in all except the most difficult evaluated imaging scenario.
[173] Anatomy-DT: A Cross-Diffusion Digital Twin for Anatomical Evolution
Moinak Bhattacharya,Gagandeep Singh,Prateek Prasanna
Main category: eess.IV
TL;DR: 该论文提出了一个名为Anatomy-DT的跨扩散数字孪生框架,用于模拟肿瘤形态和周围解剖结构的时空演变,结合了部分微分方程(PDE)和可微分深度学习,实现了高精度且拓扑一致的结果。
Details
Motivation: 现有方法主要关注肿瘤生长而忽略了周围解剖结构的变化,而实际中肿瘤演变是高度非线性和异质性的,受治疗干预及周围组织的空间相互作用影响。因此,需要一种能够全面建模肿瘤及其周围解剖结构的框架。Contribution: 1. 提出了一个结合PDE和可微分深度学习的数学框架,用于建模多类解剖结构的演变;2. 引入了交叉扩散反应-扩散系统,增强了类间竞争和排他性;3. 设计了拓扑正则化器,保持中心线并惩罚区域重叠;4. 在合成和临床数据集上验证了方法的优越性。
Method: 1. 将解剖结构表示为单纯形上的多类概率场;2. 使用交叉扩散反应-扩散系统模拟演变;3. 采用可微分的隐式-显式方案处理刚性和非线性问题;4. 通过投影保持概率场的单纯形性质;5. 引入拓扑正则化器优化全局一致性。
Result: 在合成数据集上实现了最先进的精度并保持了拓扑结构,同时在临床数据集上也表现出优越性能。
Insight: 结合PDE动力学、拓扑感知正则化和可微分求解器,为数字孪生提供了一条有原则的解剖结构生成路径,确保了视觉真实性、解剖排他性和拓扑一致性。
Abstract: Accurately modeling the spatiotemporal evolution of tumor morphology from baseline imaging is a pre-requisite for developing digital twin frameworks that can simulate disease progression and treatment response. Most existing approaches primarily characterize tumor growth while neglecting the concomitant alterations in adjacent anatomical structures. In reality, tumor evolution is highly non-linear and heterogeneous, shaped not only by therapeutic interventions but also by its spatial context and interaction with neighboring tissues. Therefore, it is critical to model tumor progression in conjunction with surrounding anatomy to obtain a comprehensive and clinically relevant understanding of disease dynamics. We introduce a mathematically grounded framework that unites mechanistic partial differential equations with differentiable deep learning. Anatomy is represented as a multi-class probability field on the simplex and evolved by a cross-diffusion reaction-diffusion system that enforces inter-class competition and exclusivity. A differentiable implicit-explicit scheme treats stiff diffusion implicitly while handling nonlinear reaction and event terms explicitly, followed by projection back to the simplex. To further enhance global plausibility, we introduce a topology regularizer that simultaneously enforces centerline preservation and penalizes region overlaps. The approach is validated on synthetic datasets and a clinical dataset. On synthetic benchmarks, our method achieves state-of-the-art accuracy while preserving topology, and also demonstrates superior performance on the clinical dataset. By integrating PDE dynamics, topology-aware regularization, and differentiable solvers, this work establishes a principled path toward anatomy-to-anatomy generation for digital twins that are visually realistic, anatomically exclusive, and topologically consistent.
[174] Multi-modal Liver Segmentation and Fibrosis Staging Using Real-world MRI Images
Yang Zhou,Kunhao Yuan,Ye Wei,Jishizhan Chen
Main category: eess.IV
TL;DR: 该论文提出了一种自动化的多模态MRI肝脏分割和纤维化分期方法,结合伪标签和多模态配准技术,通过深度学习网络和STAD特征实现了高性能的非侵入性诊断。
Details
Motivation: 肝脏纤维化的精确分期通常需要侵入性方法,存在风险和并发症。因此,研究旨在开发一种非侵入性的AI解决方案,利用多模态MRI数据进行肝脏分割和纤维化分期,以支持早期诊断和临床决策。Contribution: 1)提出了一个自动化流水线,整合了多模态配准、深度学习分割和STAD特征提取技术;2)在CARE 2025挑战赛中展现了卓越的泛化能力和性能;3)提供了一种可重复的非侵入性肝脏纤维化评估框架。
Method: 1)基于多模态配准生成伪标签;2)使用深度神经网络进行肝脏分割;3)从分割掩码和MRI图像中提取形状、纹理、外观和方向(STAD)特征进行纤维化分期。
Result: 该方法在多中心、多模态MRI数据中表现出色,在挑战赛的所有子任务中均取得顶尖性能。
Insight: 该研究表明,结合多模态数据和深度学习可以有效实现非侵入性肝脏疾病评估,为临床提供了快速、可重复的诊断工具。
Abstract: Liver fibrosis represents the accumulation of excessive extracellular matrix caused by sustained hepatic injury. It disrupts normal lobular architecture and function, increasing the chances of cirrhosis and liver failure. Precise staging of fibrosis for early diagnosis and intervention is often invasive, which carries risks and complications. To address this challenge, recent advances in artificial intelligence-based liver segmentation and fibrosis staging offer a non-invasive alternative. As a result, the CARE 2025 Challenge aimed for automated methods to quantify and analyse liver fibrosis in real-world scenarios, using multi-centre, multi-modal, and multi-phase MRI data. This challenge included tasks of precise liver segmentation (LiSeg) and fibrosis staging (LiFS). In this study, we developed an automated pipeline for both tasks across all the provided MRI modalities. This pipeline integrates pseudo-labelling based on multi-modal co-registration, liver segmentation using deep neural networks, and liver fibrosis staging based on shape, textural, appearance, and directional (STAD) features derived from segmentation masks and MRI images. By solely using the released data with limited annotations, our proposed pipeline demonstrated excellent generalisability for all MRI modalities, achieving top-tier performance across all competition subtasks. This approach provides a rapid and reproducible framework for quantitative MRI-based liver fibrosis assessment, supporting early diagnosis and clinical decision-making. Code is available at https://github.com/YangForever/care2025_liver_biodreamer.
[175] Ordinal Label-Distribution Learning with Constrained Asymmetric Priors for Imbalanced Retinal Grading
Nagur Shareef Shaik,Teja Krishna Cherukuri,Adnan Masood,Ehsan Adeli,Dong Hye Ye
Main category: eess.IV
TL;DR: 该论文提出了一种名为CAP-WAE的新方法,通过不对称先验和改进的损失函数,有效解决了糖尿病视网膜病变分级任务中的类别不平衡和顺序性问题。
Details
Motivation: 糖尿病视网膜病变分级任务具有顺序性和长尾分布的特点,少数类别稀缺且临床重要性高。传统方法依赖对称假设(如高斯先验和对称损失),难以捕捉任务的不对称性和类别不平衡,导致性能受限。Contribution: 1. 提出CAP-WAE框架,结合Wasserstein自编码器和不对称先验,保留少数类别的重尾和偏斜结构。2. 设计MAOC损失,在隐空间中实现按等级有序的可分离性。3. 引入方向感知的顺序损失,通过不对称惩罚反映临床优先级。4. 自适应多任务加权方案减少了调参需求。
Method: 1. 使用WAE并约束其聚合后验与不对称先验对齐。2. 通过MAOC损失优化隐空间结构,确保类别紧凑且有序。3. 利用轻量级头部预测不对称分散度,生成反映临床优先级的软标签。4. 采用自适应多任务加权稳定训练。
Result: 在公共DR基准测试中,CAP-WAE在Quadratic Weighted Kappa、准确率和macro-F1上均达到SOTA,超越了现有顺序分类和生成模型基线。t-SNE可视化显示该方法能形成紧凑且有序的隐空间簇。
Insight: 1. 不对称先验和监督损失的设计对处理长尾和顺序性任务至关重要。2. 隐空间的结构化优化(如MAOC)能显著提升模型性能。3. 临床优先级可以通过不对称惩罚自然地融入损失函数。
Abstract: Diabetic retinopathy grading is inherently ordinal and long-tailed, with minority stages being scarce, heterogeneous, and clinically critical to detect accurately. Conventional methods often rely on isotropic Gaussian priors and symmetric loss functions, misaligning latent representations with the task’s asymmetric nature. We propose the Constrained Asymmetric Prior Wasserstein Autoencoder (CAP-WAE), a novel framework that addresses these challenges through three key innovations. Our approach employs a Wasserstein Autoencoder (WAE) that aligns its aggregate posterior with a asymmetric prior, preserving the heavy-tailed and skewed structure of minority classes. The latent space is further structured by a Margin-Aware Orthogonality and Compactness (MAOC) loss to ensure grade-ordered separability. At the supervision level, we introduce a direction-aware ordinal loss, where a lightweight head predicts asymmetric dispersions to generate soft labels that reflect clinical priorities by penalizing under-grading more severely. Stabilized by an adaptive multi-task weighting scheme, our end-to-end model requires minimal tuning. Across public DR benchmarks, CAP-WAE consistently achieves state-of-the-art Quadratic Weighted Kappa, accuracy, and macro-F1, surpassing both ordinal classification and latent generative baselines. t-SNE visualizations further reveal that our method reshapes the latent manifold into compact, grade-ordered clusters with reduced overlap.
[176] GastroViT: A Vision Transformer Based Ensemble Learning Approach for Gastrointestinal Disease Classification with Grad CAM & SHAP Visualization
Sumaiya Tabassum,Md. Faysal Ahamed,Hafsa Binte Kibria,Md. Nahiduzzaman,Julfikar Haider,Muhammad E. H. Chowdhury,Mohammad Tariqul Islam
Main category: eess.IV
TL;DR: 该论文提出了一种基于预训练视觉变换器(ViT)的集成学习方法GastroViT,用于胃肠疾病分类,并通过Grad-CAM和SHAP可视化增强模型解释性。
Details
Motivation: 胃肠疾病的早期诊断对治疗至关重要,但传统方法在识别复杂异常时效果有限。ViT的注意力机制在大规模图像任务中表现优异,因此作者探索其在该领域的应用。Contribution: 1. 提出了一种基于ViT的集成模型GastroViT,在HyperKvasir数据集上取得了高精度(91.98%);2. 结合Grad-CAM和SHAP提升了模型的可解释性。
Method: 1. 集成两个预训练的ViT模型(MobileViT_XS和MobileViT_V2_200);2. 在23类和16类分类任务中测试性能;3. 使用XAI方法(Grad-CAM和SHAP)进行可视化分析。
Result: 1. 在23类分类任务中,准确率达91.98%;2. 在16类分类任务中,准确率进一步提升至92.70%。
Insight: 1. ViT的集成在小样本和高不平衡数据集中表现优异;2. XAI方法为医疗诊断提供了可靠的模型解释。
Abstract: The gastrointestinal (GI) tract of humans can have a wide variety of aberrant mucosal abnormality findings, ranging from mild irritations to extremely fatal illnesses. Prompt identification of gastrointestinal disorders greatly contributes to arresting the progression of the illness and improving therapeutic outcomes. This paper presents an ensemble of pre-trained vision transformers (ViTs) for accurately classifying endoscopic images of the GI tract to categorize gastrointestinal problems and illnesses. ViTs, attention-based neural networks, have revolutionized image recognition by leveraging the transformative power of the transformer architecture, achieving state-of-the-art (SOTA) performance across various visual tasks. The proposed model was evaluated on the publicly available HyperKvasir dataset with 10,662 images of 23 different GI diseases for the purpose of identifying GI tract diseases. An ensemble method is proposed utilizing the predictions of two pre-trained models, MobileViT_XS and MobileViT_V2_200, which achieved accuracies of 90.57% and 90.48%, respectively. All the individual models are outperformed by the ensemble model, GastroViT, with an average precision, recall, F1 score, and accuracy of 69%, 63%, 64%, and 91.98%, respectively, in the first testing that involves 23 classes. The model comprises only 20 million (M) parameters, even without data augmentation and despite the highly imbalanced dataset. For the second testing with 16 classes, the scores are even higher, with average precision, recall, F1 score, and accuracy of 87%, 86%, 87%, and 92.70%, respectively. Additionally, the incorporation of explainable AI (XAI) methods such as Grad-CAM (Gradient Weighted Class Activation Mapping) and SHAP (Shapley Additive Explanations) enhances model interpretability, providing valuable insights for reliable GI diagnosis in real-world settings.
cs.CY [Back]
[177] Toxicity in Online Platforms and AI Systems: A Survey of Needs, Challenges, Mitigations, and Future Directions
Smita Khapre,Melkamu Abay Mersha,Hassan Shakil,Jonali Baruah,Jugal Kalita
Main category: cs.CY
TL;DR: 本文是一篇关于在线平台和AI系统中的毒性问题的综述,提出了毒性的分类和检测方法,总结了相关数据集和研究,并指出了未来的研究方向。
Details
Motivation: 数字通信系统和在线平台的设计无意中助长了毒性行为的传播,这对个人和社会的福祉构成了严重威胁,因此需要系统性的研究来检测和缓解毒性问题。Contribution: 本文提出了一种全面的毒性分类方法,总结了毒性的数据集和研究,并指出了毒性缓解领域的研究空白,如数据集、策略、大语言模型、适应性和可解释性等方面的不足。
Method: 本文采用综述方法,通过对现有研究的系统梳理,提出了毒性的多视角分类,并总结了毒性检测和缓解的相关技术和数据集。
Result: 研究提供了毒性的详细分类和相关数据集、技术的总结,同时指出了现有研究的局限性,如反应式策略和上下文理解的不足。
Insight: 毒性的检测和缓解需要综合考虑语境和环境,未来的研究应更注重主动性策略和多语言支持,以提高大语言模型和在线平台的适应性和可解释性。
Abstract: The evolution of digital communication systems and the designs of online platforms have inadvertently facilitated the subconscious propagation of toxic behavior. Giving rise to reactive responses to toxic behavior. Toxicity in online content and Artificial Intelligence Systems has become a serious challenge to individual and collective well-being around the world. It is more detrimental to society than we realize. Toxicity, expressed in language, image, and video, can be interpreted in various ways depending on the context of usage. Therefore, a comprehensive taxonomy is crucial to detect and mitigate toxicity in online content, Artificial Intelligence systems, and/or Large Language Models in a proactive manner. A comprehensive understanding of toxicity is likely to facilitate the design of practical solutions for toxicity detection and mitigation. The classification in published literature has focused on only a limited number of aspects of this very complex issue, with a pattern of reactive strategies in response to toxicity. This survey attempts to generate a comprehensive taxonomy of toxicity from various perspectives. It presents a holistic approach to explain the toxicity by understanding the context and environment that society is facing in the Artificial Intelligence era. This survey summarizes the toxicity-related datasets and research on toxicity detection and mitigation for Large Language Models, social media platforms, and other online platforms, detailing their attributes in textual mode, focused on the English language. Finally, we suggest the research gaps in toxicity mitigation based on datasets, mitigation strategies, Large Language Models, adaptability, explainability, and evaluation.
cs.LG [Back]
[178] Spectral Logit Sculpting: Adaptive Low-Rank Logit Transformation for Controlled Text Generation
Jin Li,Zhebo Wang,Tianliang Lu,Mohan Li,Wenpeng Xing,Meng Han
Main category: cs.LG
TL;DR: 论文提出了一种轻量级的推理时优化方法Spectral Logit Sculpting (SLS),通过动态调制令牌分布来提高大型语言模型的可控文本生成能力,优于现有基线方法。
Details
Motivation: 现有的熵最小化方法计算开销高且未能充分利用历史令牌上下文,SLS旨在解决这些问题。Contribution: 提出SLS方法,利用对数谱和熵的动态调制来优化推理过程,无需更新模型参数。
Method: SLS通过滑动缓冲区记录前K个对数,实时进行奇异值分解(SVD)识别主导方向,并基于熵和对数间距统计自适应调整对数分布。
Result: 实验表明,SLS在数学、编程和科学推理任务中表现优于基线方法。
Insight: SLS通过轻量级操作实现了高效的推理优化,同时保持了上下文一致性。
Abstract: Entropy-based inference methods have gained traction for improving the reliability of Large Language Models (LLMs). However, many existing approaches, such as entropy minimization techniques, suffer from high computational overhead and fail to leverage historical token context effectively. To address these limitations, we propose Spectral Logit Sculpting (SLS), a lightweight inference-time optimization method that dynamically modulates token distributions using spectral and entropic properties of recent logits. SLS maintains a sliding buffer of top-K logits, performs on-the-fly Singular Value Decomposition (SVD) to identify dominant spectral directions, and adaptively rescales logits based on both entropy and logit gap statistics–only activating when uncertainty is high. Without updating any model parameters, SLS effectively sharpens the output distribution while preserving contextual consistency. Experimental results on multiple public benchmarks demonstrate that SLS consistently outperforms existing baseline methods, achieving superior accuracy in mathematical, coding, and scientific reasoning tasks.
[179] HAMMER: Hamiltonian Curiosity Augmented Large Language Model Reinforcement
Ming Yang,Xiaofan Li,Zhiyuan Ma,Dengliang Shi,Jintao Du,Yu Cheng,Weiguo Zheng
Main category: cs.LG
TL;DR: HAMMER提出了一种基于哈密尔顿路径的多样性驱动的强化学习方法,通过动态排序训练样本促进模型”好奇心”,提高了大语言模型的性能。
Details
Motivation: 现有基于难度的课程强化学习方法容易陷入局部优化,导致模型在早期训练中失去探索能力,HAMMER旨在解决这一问题。Contribution: 提出了HAMMER框架,将数据集多样性指标动态融入强化学习过程,通过最小语义哈密尔顿路径排序样本,增强模型探索能力。
Method: 利用哈密尔顿路径动态排序训练样本,结合多样性指标优化强化学习过程,避免局部优化。
Result: 实验表明,HAMMER在多种推理基准测试中平均准确率提升了3%至4%。
Insight: 多样性驱动的样本排序能促进模型”好奇心”,稳定收敛,避免早期训练的过度简化导致的探索不足。
Abstract: Recent curriculum reinforcement learning for large language models (LLMs) typically rely on difficulty-based annotations for data filtering and ordering. However, such methods suffer from local optimization, where continual training on simple samples in the early steps can cause the policy to lose its exploration. We propose a novel schema, namely Hamiltonian curiosity augmented large language model reinforcement (HAMMER), that transfers diversity metrics, commonly used in dataset evaluation, into the dynamic reinforcement learning procedure, where training samples are ordered via a minimum-semantic Hamiltonian path making the initial training retrain more exploration. From a theoretical perspective of generalization bounds, diversity-driven ordering facilitates stable convergence. Empirical evaluations indicate that HAMMER stimulates model “curiosity” and consistently achieves a 3% to 4% average accuracy gain across diverse inference benchmark.
[180] Dynamic Policy Induction for Adaptive Prompt Optimization: Bridging the Efficiency-Accuracy Gap via Lightweight Reinforcement Learning
Jiexi Xu
Main category: cs.LG
TL;DR: 该论文提出了一种轻量级强化学习框架Prompt Policy Network(PPN),通过动态选择提示策略,在保持高效的同时提高大型语言模型的准确性。
Details
Motivation: 当前静态提示策略(如Zero-Shot、Few-Shot或Chain-of-Thought)存在效率与准确性之间的固有权衡,无法灵活适应不同任务的复杂度。需要一种自适应方法来动态选择策略以优化资源使用。Contribution: 1. 提出了PPN框架,将自适应策略选择建模为单步马尔可夫决策过程(MDP)。2. 通过资源显式奖励函数和PPO训练,PPN能够高效分配高成本推理策略。
Method: 使用强化学习(PPO)训练PPN框架,动态选择提示策略。奖励函数明确考虑资源开销,优化效率与准确性之间的权衡。
Result: 在算术推理基准测试中,PPN实现了效率-准确性的帕累托最优,相比于Self-Consistency节省了61.5%的token成本,同时保持竞争力准确率。
Insight: 轻量级强化学习可以有效解决提示策略的动态选择问题,为大型语言模型的高效部署提供了系统化框架。
Abstract: The performance of Large Language Models (LLMs) depends heavily on the chosen prompting strategy, yet static approaches such as Zero-Shot, Few-Shot, or Chain-of-Thought (CoT) impose a rigid efficiency-accuracy trade-off. Highly accurate strategies like Self-Consistency (SC) incur substantial computational waste on simple tasks, while lightweight methods often fail on complex inputs. This paper introduces the Prompt Policy Network (PPN), a lightweight reinforcement learning framework that formalizes adaptive strategy selection as a single-step Markov Decision Process (MDP). The PPN, trained with Proximal Policy Optimization (PPO) and guided by a resource-explicit reward function, learns to allocate costly reasoning strategies only when necessary. Experiments on arithmetic reasoning benchmarks demonstrate that PPN achieves superior performance on the efficiency-accuracy Pareto front, delivering up to 61.5% token cost reduction compared to Self-Consistency while maintaining competitive accuracy. This work contributes a systematic, adaptive framework for cost-efficient LLM deployment, advancing the design of lightweight optimization techniques for scalable and sustainable language model applications.
[181] Rethinking Parameter Sharing for LLM Fine-Tuning with Multiple LoRAs
Hao Ban,Kaiyi Ji
Main category: cs.LG
TL;DR: 论文研究了多任务微调中LoRA参数共享问题,提出ALoRA和Fed-ALoRA方法,通过非对称设计和矩阵分解优化性能。
Details
Motivation: 现有研究表明LoRA的A矩阵在训练中高度相似,但作者发现这种相似性源于初始化而非共享知识,从而重新审视了参数共享的有效性。Contribution: 提出了ALoRA(非对称多LoRA设计)和Fed-ALoRA(联邦微调中的B矩阵共享),并通过实验验证其性能优势。
Method: ALoRA使用多个A矩阵和共享的B矩阵,Fed-ALoRA则通过矩阵分解支持异构客户端间的B矩阵共享。
Result: 在常识推理、数学推理和多任务NLP数据集上,方法表现优于现有方法,实现了任务间更平衡的性能。
Insight: B矩阵在知识编码和传递中更为关键,而A矩阵的相似性主要由初始化决定,而非共享知识。
Abstract: Large language models are often adapted using parameter-efficient techniques such as Low-Rank Adaptation (LoRA), formulated as $y = W_0x + BAx$, where $W_0$ is the pre-trained parameters and $x$ is the input to the adapted layer. While multi-adapter extensions often employ multiple LoRAs, prior studies suggest that the inner $A$ matrices are highly similar during training and thus suitable for sharing. We revisit this phenomenon and find that this similarity is largely attributable to the identical initialization rather than shared knowledge, with $B$ playing a more critical role in knowledge encoding and transfer. Motivated by these insights, we propose \textbf{ALoRA}, an asymmetric multi-LoRA design with multiple $A$ matrices and a single shared $B$ in multi-task fine-tuning, and \textbf{Fed-ALoRA}, which shares $B$ across clients in federated fine-tuning under both homogeneous and heterogeneous settings, through a novel matrix decomposition strategy to accommodate heterogeneous ranks across clients. Experiments on commonsense reasoning, math reasoning, multi-task NLP dataset, and federated NLP dataset demonstrate that our methods achieve more balanced performance across tasks with comparable or superior average accuracy relative to existing multi-LoRA approaches. Codes are available at https://github.com/OptMN-Lab/ALoRA.
[182] Nudging the Boundaries of LLM Reasoning
Justin Chih-Yao Chen,Becky Xiangyu Peng,Prafulla Kumar Choubey,Kung-Hsiang Huang,Jiaxin Zhang,Mohit Bansal,Chien-Sheng Wu
Main category: cs.LG
TL;DR: 论文提出NuRL方法,通过自生成提示(hint)推动LLM推理的上限,解决了传统RL方法无法从‘不可解’问题中学习的问题。NuRL在6个基准和3个模型中表现一致提升,且与测试时扩展兼容。
Details
Motivation: 当前在线强化学习算法(如GRPO)在LLM推理中的核心限制是无法从模型‘不可解’的问题中学习。这些问题无法提供梯度信号,导致模型的上限无法提升。NuRL旨在通过自生成提示打破这一限制。Contribution: 1. 提出NuRL方法,利用自生成提示为‘不可解’问题引入训练信号;2. 展示提示的最佳形式(抽象且高层次)和应用时机(GRPO收敛后);3. 在多个基准和模型上验证NuRL的有效性。
Method: 1. 给定问题和正确答案,模型生成思维链(CoT)并提取核心知识作为提示;2. 对基础策略生成G次rollouts,根据通过率决定是否注入提示;3. 对通过率为0%的硬样本注入提示并重新生成轨迹。
Result: NuRL在6个基准和3个模型中实现一致提升,能够提高模型的上限(GRPO无法做到)。最佳提示为抽象高层形式,且在GRPO收敛后应用效果最好。
Insight: 自生成提示避免了分布偏移和外部模型依赖;抽象高层的提示最有效,且需在必要时使用;NuRL与传统RL方法互补,可进一步增强模型性能。
Abstract: Current online reinforcement learning (RL) algorithms like GRPO share a key limitation in LLM reasoning: they cannot learn from problems that are “unsolvable” to the model. In other words, they can only improve performance on problems where the model is capable of exploring the correct answer. Consequently, the model’s “upper limit” remains unchanged after RL training, even though the likelihood of solving easier, solvable problems may increase. These hard samples cannot contribute to training, as no rollouts yield rewards and thus no gradients are produced. To unlock learning from these hard samples, we propose NuRL, a “nudging” method that aims to push the upper bound of LLM reasoning using self-generated hints, i.e., abstract cues that help reduce the problem difficulty for the model. Given a question and its gold answer, the model generates a CoT and then produces a hint containing the core knowledge needed to solve the problem. During training, we generate G rollouts from the base policy and use the pass rate to decide whether the hint should be injected. For hard samples with a 0% pass rate, we inject the hint and regenerate a new batch of trajectories. This yields two benefits: (1) the hint boosts pass rates (from 0% to non-zero), thereby introducing training signals for previously unsolvable samples, and (2) the hints are self-generated, avoiding distributional shift and do not rely on external models. NuRL achieves consistent improvements across 6 benchmarks and 3 models, while remaining complementary to test-time scaling. Notably, NuRL can raise the model’s upper limit, whereas GRPO leaves pass@1024 unchanged from the base model. Furthermore, we present a systematic study of what makes an effective hint and when hints are most useful. Interestingly, the best hints are abstract and high-level, and are most beneficial when applied necessarily and after GRPO has converged.
[183] Can VLM Pseudo-Labels Train a Time-Series QA Model That Outperforms the VLM?
Takuya Fujimura,Kota Dohi,Natsuo Yamashita,Yohei Kawaguchi
Main category: cs.LG
TL;DR: 论文探讨了使用时序问答任务中视觉语言模型(VLM)生成的伪标签训练模型的可能性,结果表明伪标签训练不仅可行,还能通过利用大量无标签数据超越VLM本身的性能。
Details
Motivation: 时序问答任务(TSQA)面临的挑战是标注数据稀缺,而视觉语言模型(VLM)在零样本设置下展示了分析时序信号的潜力。Contribution: 提出了一种利用VLM生成的伪标签训练TSQA模型的方法,证明模型不仅能够成功训练,还能通过无标签数据超越VLM的性能。
Method: 通过VLM生成伪标签,利用深度神经网络对噪声标签的鲁棒性,训练TSQA模型。
Result: 实验结果显示,TSQA模型不仅能够通过伪标签训练成功,还能在某些任务中超越VLM的表现。
Insight: 即使在伪标签存在噪声的情况下,深度神经网络仍能通过大量无标签数据学习有效特征,进而提升模型性能。
Abstract: Time-series question answering (TSQA) tasks face significant challenges due to the lack of labeled data. Alternatively, with recent advancements in large-scale models, vision-language models (VLMs) have demonstrated the potential to analyze time-series signals in a zero-shot manner. In this paper, we propose a training approach that uses pseudo labels generated by a VLM. Although VLMs can produce incorrect labels, TSQA models can still be effectively trained based on the property that deep neural networks are inherently robust to such noisy labels. Our experimental results demonstrate that TSQA models are not only successfully trained with pseudo labels, but also surpass the performance of the VLM itself by leveraging a large amount of unlabeled data.
[184] MuPlon: Multi-Path Causal Optimization for Claim Verification through Controlling Confounding
Hanghui Guo,Shimin Di,Pasquale De Meo,Zhangze Chen,Jia Zhu
Main category: cs.LG
TL;DR: MuPlon提出了一种多路径因果优化框架,通过控制混杂因素提升声明验证的准确性。该方法通过后门路径和前门路径的双重因果干预,分别处理数据噪声和数据偏差,显著优于现有方法。
Details
Motivation: 传统声明验证方法忽略了证据间的复杂交互,易受数据噪声和数据偏差的影响,导致结果不可靠。为了解决这些问题,作者提出了MuPlon框架。Contribution: 1. 提出了Claim-Evidence Graph (C-E Graph)的概念;2. 设计了一种双路径因果干预策略,分别处理噪声和偏差;3. 通过实验验证了MuPlon的优越性。
Method: MuPlon结合后门路径和前门路径:1. 后门路径优化节点概率权重,稀释噪声并加强相关证据的连接;2. 前门路径提取高相关子图,构造推理路径并进行反事实推理以消除偏差。
Result: 实验结果显示MuPlon在声明验证任务中表现优于现有方法,达到了最先进的性能。
Insight: 双路径因果干预策略为处理复杂交互和混杂因素提供了新思路,尤其在数据质量和真实性要求高的场景中具有潜力。
Abstract: As a critical task in data quality control, claim verification aims to curb the spread of misinformation by assessing the truthfulness of claims based on a wide range of evidence. However, traditional methods often overlook the complex interactions between evidence, leading to unreliable verification results. A straightforward solution represents the claim and evidence as a fully connected graph, which we define as the Claim-Evidence Graph (C-E Graph). Nevertheless, claim verification methods based on fully connected graphs face two primary confounding challenges, Data Noise and Data Biases. To address these challenges, we propose a novel framework, Multi-Path Causal Optimization (MuPlon). MuPlon integrates a dual causal intervention strategy, consisting of the back-door path and front-door path. In the back-door path, MuPlon dilutes noisy node interference by optimizing node probability weights, while simultaneously strengthening the connections between relevant evidence nodes. In the front-door path, MuPlon extracts highly relevant subgraphs and constructs reasoning paths, further applying counterfactual reasoning to eliminate data biases within these paths. The experimental results demonstrate that MuPlon outperforms existing methods and achieves state-of-the-art performance.
[185] Learning to Reason as Action Abstractions with Scalable Mid-Training RL
Shenao Zhang,Donghan Yu,Yihao Feng,Bowen Jin,Zhaoran Wang,John Peebles,Zirui Wang
Main category: cs.LG
TL;DR: 该论文提出了RA3(推理作为动作抽象)算法,通过中训练阶段识别紧凑的动作集,提升语言模型在强化学习(RL)中的表现,并通过实验验证了其有效性。
Details
Motivation: 大型语言模型在强化学习中表现优异,但要充分发挥潜力,需要一个中训练阶段来优化动作选择空间和强化学习收敛速度。Contribution: 1. 首次理论分析了中训练如何影响后训练,提出了动作子空间的优化目标;2. 提出了RA3算法,通过动作抽象提升模型的性能和收敛速度。
Method: RA3通过变分下界和强化学习迭代发现时间一致的隐结构,并在自举数据上进行微调。
Result: 在代码生成任务中,RA3显著提升了HumanEval和MBPP的性能,并在多个基准测试中表现出更快的收敛速度和更高的渐近性能。
Insight: 中训练的有效性关键在于动作空间的紧凑性和有效视野的短小性,动作抽象比原始动作更高效。
Abstract: Large language models excel with reinforcement learning (RL), but fully unlocking this potential requires a mid-training stage. An effective mid-training phase should identify a compact set of useful actions and enable fast selection among them through online RL. We formalize this intuition by presenting the first theoretical result on how mid-training shapes post-training: it characterizes an action subspace that minimizes both the value approximation error from pruning and the RL error during subsequent planning. Our analysis reveals two key determinants of mid-training effectiveness: pruning efficiency, which shapes the prior of the initial RL policy, and its impact on RL convergence, which governs the extent to which that policy can be improved via online interactions. These results suggest that mid-training is most effective when the decision space is compact and the effective horizon is short, highlighting the importance of operating in the space of action abstractions rather than primitive actions. Building on these insights, we propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm. Specifically, we derive a sequential variational lower bound and optimize it by iteratively discovering temporally-consistent latent structures via RL, followed by fine-tuning on the bootstrapped data. Experiments on code generation tasks demonstrate the effectiveness of our approach. Across multiple base models, RA3 improves the average performance on HumanEval and MBPP by 8 and 4 points over the base model and the next-token prediction baseline. Furthermore, RA3 achieves faster convergence and higher asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.
[186] Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation
Ziniu Li,Congliang Chen,Tianyun Yang,Tian Ding,Ruoyu Sun,Ge Zhang,Wenhao Huang,Zhi-Quan Luo
Main category: cs.LG
TL;DR: 该论文提出了一种基于背包问题优化的探索预算分配方法(Knapsack RL),用于大语言模型(LLMs)的强化学习,显著提升了训练效率和性能。
Details
Motivation: 当前的LLM强化学习方法在探索时采用均匀预算分配,导致简单任务和困难任务的训练梯度无效,限制了模型的学习效率。Contribution: 1. 将探索预算分配问题建模为经典背包问题;2. 提出了自适应资源分配规则,动态调整预算;3. 显著提高了非零梯度的比例和计算效率。
Method: 通过将每个任务的探索视为背包问题中的“物品”,并根据其“价值”和“成本”动态分配预算,优化GRPO训练过程。
Result: 方法将非零策略梯度的比例提高了20-40%,在数学推理基准测试中平均提升2-4分,峰值提升9分,且节省约2倍计算资源。
Insight: 动态预算分配能够将资源从学习饱和的任务转移到更有潜力的任务,从而在不增加计算开销的情况下显著提升性能。
Abstract: Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions. However, this exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task. This uniform allocation creates problematic edge cases: easy tasks consistently succeed while difficult tasks consistently fail, both producing zero gradients during training updates for the widely used Group Relative Policy Optimization (GRPO). We address this problem from the lens of exploration budget allocation. Viewing each task’s exploration as an “item” with a distinct “value” and “cost”, we establish a connection to the classical knapsack problem. This formulation allows us to derive an optimal assignment rule that adaptively distributes resources based on the model’s current learning status. When applied to GRPO, our method increases the effective ratio of non-zero policy gradients by 20-40% during training. Acting as a computational “free lunch”, our approach could reallocate exploration budgets from tasks where learning is saturated to those where it is most impactful. This enables significantly larger budgets (e.g., 93 rollouts) for especially challenging problems, which would be computationally prohibitive under a uniform allocation. These improvements translate to meaningful gains on mathematical reasoning benchmarks, with average improvements of 2-4 points and peak gains of 9 points on specific tasks. Notably, achieving comparable performance with traditional homogeneous allocation would require about 2x the computational resources.
[187] CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models
Weiyu Huang,Yuezhou Hu,Jun Zhu,Jianfei Chen
Main category: cs.LG
TL;DR: 论文提出了一种名为CAST的连续自适应稀疏训练框架,用于半结构化(N:M)稀疏模型的大语言模型训练。CAST通过联合优化稀疏模式和权重,显著提升了模型的性能,同时减少了训练资源的需求。
Details
Motivation: 大型语言模型的稀疏化可以降低推理延迟和内存消耗,但现有方法通常将稀疏模式和权重优化分开处理,导致效率不足。CAST旨在通过连续可微的联合优化解决这一问题。Contribution: 1) 提出了CAST框架,支持稀疏模式和权重的联合优化;2) 引入AdamS优化器、权重缩放模块和知识蒸馏技术;3) 在多种模型规模上验证了方法的有效性,并提出了性能预测的缩放定律。
Method: CAST通过以下三个关键组件实现连续可微的稀疏训练:1) AdamS优化器(自适应L1衰减);2) 权重缩放模块(缓解衰减带来的幅度减小);3) 知识蒸馏(利用稠密模型作为自教师)。
Result: 在2:4稀疏模式下,CAST在125M到13B参数的模型上显著优于现有方法,LLaMA2-7B稀疏模型的困惑度仅增加0.09,零样本准确率提升0.36%,且仅需2%的训练数据。
Insight: CAST的连续可微性和联合优化设计显著提升了稀疏模型的性能,同时证明了稀疏模型在量化和微调场景下的实用性。
Abstract: Sparsity-aware training is an effective approach for transforming large language models (LLMs) into hardware-friendly sparse patterns, thereby reducing latency and memory consumption during inference. In this paper, we propose Continuous Adaptive Sparse Trainer (CAST), a fully continuous and differentiable sparsity-aware training framework for semi-structured (or “N:M”) sparse models. Unlike previous approaches that optimize sparsity patterns and weights separately, CAST enables seamless joint optimization during training, while progressively transforming the model into the desired sparsity format. Specifically, CAST introduces three key components: 1) AdamS, a sparsity-aware optimizer that leverages adaptive L1 decay to promote uniform sparsification across all parameters; 2) Weight Scaling, a module designed to mitigate the magnitude reduction caused by decay while preserving desired sparsity patterns; 3) Knowledge Distillation, which employs the dense model as a self-teacher to enhance training efficiency. We evaluate CAST under 2:4 sparsity patterns across multiple model families, ranging from 125M to 13B parameters. Our results demonstrate significant improvements over previous state-of-the-art methods in both perplexity and zero-shot accuracy with minimal training resources. Notably, on LLaMA2-7B, our 2:4 sparse model achieves a negligible perplexity increase of 0.09 and a 0.36% gain in zero-shot accuracy compared to the dense model using only 2% of the original pretraining tokens. Additionally, we establish an accurate and robust empirical scaling law to predict sparse model performance given adequate training resources. Finally, we demonstrate the practical applicability of our sparse models by evaluating them under quantization and fine-tuning scenarios.
[188] FITS: Towards an AI-Driven Fashion Information Tool for Sustainability
Daphne Theodorakopoulos,Elisabeth Eberling,Miriam Bodenheimer,Sabine Loos,Frederic Stahl
Main category: cs.LG
TL;DR: 论文《FITS: Towards an AI-Driven Fashion Information Tool for Sustainability》提出了一个基于Transformer的系统FITS,通过自然语言处理技术从非结构化文本中提取和分类时尚品牌的可持续发展信息,解决了该领域信息稀缺的问题。
Details
Motivation: 尽管公众和监管对时尚行业透明度的需求增长,但获取可信的可持续发展信息仍然困难且难以解释。通用语言模型缺乏领域知识且容易产生幻觉,这在要求事实准确的领域中尤其有害。Contribution: 1. 提出了FITS工具,基于BERT模型微调,提取和分类时尚品牌的可持续发展信息;2. 提供了SustainableTextileCorpus数据集和方法论;3. 通过用户评估验证了工具的价值。
Method: 使用多个基于BERT的语言模型(包括预训练的科学和气候相关数据模型),在自定义的分类框架上微调,并通过贝叶斯优化调整超参数。
Result: FITS在两个潜在用户焦点组中评估,结果显示其在可用性、内容清晰度和用例方面表现良好,验证了领域适配NLP的价值。
Insight: 领域适配的自然语言处理技术可有效支持可持续发展信息的决策,展示了AI在应对气候相关挑战中的潜力。
Abstract: Access to credible sustainability information in the fashion industry remains limited and challenging to interpret, despite growing public and regulatory demands for transparency. General-purpose language models often lack domain-specific knowledge and tend to “hallucinate”, which is particularly harmful for fields where factual correctness is crucial. This work explores how Natural Language Processing (NLP) techniques can be applied to classify sustainability data for fashion brands, thereby addressing the scarcity of credible and accessible information in this domain. We present a prototype Fashion Information Tool for Sustainability (FITS), a transformer-based system that extracts and classifies sustainability information from credible, unstructured text sources: NGO reports and scientific publications. Several BERT-based language models, including models pretrained on scientific and climate-specific data, are fine-tuned on our curated corpus using a domain-specific classification schema, with hyperparameters optimized via Bayesian optimization. FITS allows users to search for relevant data, analyze their own data, and explore the information via an interactive interface. We evaluated FITS in two focus groups of potential users concerning usability, visual design, content clarity, possible use cases, and desired features. Our results highlight the value of domain-adapted NLP in promoting informed decision-making and emphasize the broader potential of AI applications in addressing climate-related challenges. Finally, this work provides a valuable dataset, the SustainableTextileCorpus, along with a methodology for future updates. Code available at https://github.com/daphne12345/FITS
[189] Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners
Xin Xu,Cliveb AI,Kai Yang,Tianhao Chen,Yang Wang,Saiyong Yang,Can Yang
Main category: cs.LG
TL;DR: 论文提出了一种名为TFPI的方法,通过在推理阶段丢弃‘思考内容’来减少标记使用,提升模型性能和训练效率,同时避免了传统方法的计算开销问题。
Details
Motivation: 传统的RLVR方法在训练时需要极长的上下文,导致高计算成本。多阶段训练虽能部分缓解,但初始阶段过短的上下文可能导致性能不可逆下降,未能显著减少训练计算量。Contribution: 提出了TFPI方法,通过简单的‘ThinkFree’操作(显式丢弃思考内容)减少推理阶段的标记使用,进而提升模型性能和训练效率。
Method: 采用‘ThinkFree’操作,通过在输入中直接追加‘’标记来显式丢弃思考内容,从而降低推理时的标记消耗。
Result: 实验表明,TFPI加速了RL收敛,达到更高性能上限,同时在多个基准测试中显著减少了标记使用量。例如,4B模型在AIME24上达到89.0%准确率,LiveCodeBench上达到65.5%。
Insight: TFPI展示了通过简单的输入适配策略,能够在保持性能的同时大幅降低计算成本,为高效推理模型的训练提供了新思路。
Abstract: Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate this, starting with overly short contexts often causes irreversible performance degradation, ultimately failing to reduce overall training compute significantly. In this paper, we introduce Thinking-Free Policy Initialization (TFPI), a simple yet effective adaptation to RLVR that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple ThinkFree operation, explicitly discarding the thinking content via a direct append, to reduce token usage during inference. Training with ThinkFree-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.
[190] Hyperbolic Optimization
Yanke Wang,Kyriakos Flouris
Main category: cs.LG
TL;DR: 论文提出了一种双曲优化方法,扩展了双曲随机梯度下降(SGD)为双曲Adam优化器,用于双曲流形上的优化任务。该方法在Poincaré球上学习时尤为有效,并能提升其他非欧几里得设置的收敛速度。
Details
Motivation: 传统优化方法在非欧几里得空间(如双曲流形)中表现不佳,而双曲优化在表示学习(如Poincaré嵌入)中具有潜力。因此,研究如何在双曲流形上实现高效优化是必要的。Contribution: 主要贡献是将Riemannian优化方法扩展到双曲流形上,提出双曲Adam优化器,并在扩散模型中验证了其加速收敛的效果。
Method: 基于Riemannian优化原则,扩展了双曲SGD为双曲Adam,并结合双曲时间离散化的Langevin动力学训练扩散模型。
Result: 实验表明,双曲优化方法在初始训练阶段能显著加速收敛,同时不牺牲生成质量。
Insight: 双曲优化在处理非欧几里得空间问题时具有优势,特别是在参数远离最优解时,能更快收敛。
Abstract: This work explores optimization methods on hyperbolic manifolds. Building on Riemannian optimization principles, we extend the Hyperbolic Stochastic Gradient Descent (a specialization of Riemannian SGD) to a Hyperbolic Adam optimizer. While these methods are particularly relevant for learning on the Poincar'e ball, they may also provide benefits in Euclidean and other non-Euclidean settings, as the chosen optimization encourages the learning of Poincar'e embeddings. This representation, in turn, accelerates convergence in the early stages of training, when parameters are far from the optimum. As a case study, we train diffusion models using the hyperbolic optimization methods with hyperbolic time-discretization of the Langevin dynamics, and show that they achieve faster convergence on certain datasets without sacrificing generative quality.
[191] InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions
Liangjian Wen,Qun Dai,Jianzhuang Liu,Jiangtao Zheng,Yong Dai,Dongkai Wang,Zhao Kang,Jun Wang,Zenglin Xu,Jiang Duan
Main category: cs.LG
TL;DR: InfMasking是一种新颖的对比多模态交互方法,通过无限掩蔽策略增强模态间的协同信息,从而提升多模态表示的性能。
Details
Motivation: 现有方法难以充分捕捉多模态间的协同信息,限制了多模态表示的性能。协同信息是多模态学习的核心价值,因此需要一种更有效的方法来提取和利用这种信息。Contribution: 提出了InfMasking方法,通过无限掩蔽策略在融合过程中随机遮蔽模态特征,保留部分信息以生成多样化的协同模式。同时设计了InfMasking损失函数,近似计算互信息最大化。
Method: InfMasking采用对比学习框架,通过随机遮蔽模态特征的无限掩蔽策略,生成多样化的协同模式。未遮蔽的融合表示与遮蔽表示对齐,通过互信息最大化编码全面的协同信息。
Result: 在七个大规模真实数据集上的实验表明,InfMasking显著提升了多模态间的协同信息,取得了最先进的性能。
Insight: 无限掩蔽策略通过生成多样化的协同模式,增强了模型的泛化能力和对协同信息的捕捉能力,为多模态学习提供了新思路。
Abstract: In multimodal representation learning, synergistic interactions between modalities not only provide complementary information but also create unique outcomes through specific interaction patterns that no single modality could achieve alone. Existing methods may struggle to effectively capture the full spectrum of synergistic information, leading to suboptimal performance in tasks where such interactions are critical. This is particularly problematic because synergistic information constitutes the fundamental value proposition of multimodal representation. To address this challenge, we introduce InfMasking, a contrastive synergistic information extraction method designed to enhance synergistic information through an \textbf{Inf}inite \textbf{Masking} strategy. InfMasking stochastically occludes most features from each modality during fusion, preserving only partial information to create representations with varied synergistic patterns. Unmasked fused representations are then aligned with masked ones through mutual information maximization to encode comprehensive synergistic information. This infinite masking strategy enables capturing richer interactions by exposing the model to diverse partial modality combinations during training. As computing mutual information estimates with infinite masking is computationally prohibitive, we derive an InfMasking loss to approximate this calculation. Through controlled experiments, we demonstrate that InfMasking effectively enhances synergistic information between modalities. In evaluations on large-scale real-world datasets, InfMasking achieves state-of-the-art performance across seven benchmarks. Code is released at https://github.com/brightest66/InfMasking.
[192] Clarification as Supervision: Reinforcement Learning for Vision-Language Interfaces
John Gkountouras,Ivan Titov
Main category: cs.LG
TL;DR: 本文提出了一种自适应澄清强化学习(AC-RL)方法,通过交互训练视觉模型以生成更全面的文本描述,从而改善视觉数学推理任务的表现。
Details
Motivation: 当前视觉语言模型生成的文本描述通常忽略对推理系统至关重要的细节,导致推理失败并非由于模型能力不足,而是信息缺失。因此需要一种方法能够自动学习如何生成更全面的描述。Contribution: 1. 提出了AC-RL方法,通过交互和澄清请求揭示信息缺口,优化视觉模型的文本生成能力。2. 无需显式标注,仅通过交互即可学习到有效的视觉语言接口。
Method: AC-RL利用澄清请求作为隐式监督信号,在训练过程中惩罚需要澄清的成功案例,促使模型生成更全面的初始描述。
Result: 在七个视觉数学推理基准测试中,AC-RL比预训练基线平均提高了4.4个百分点,并可将澄清请求减少39%。
Insight: 澄清请求可以作为隐式监督信号,帮助视觉模型学习生成更全面的描述,从而改善下游推理任务的性能。
Abstract: Recent text-only models demonstrate remarkable mathematical reasoning capabilities. Extending these to visual domains requires vision-language models to translate images into text descriptions. However, current models, trained to produce captions for human readers, often omit the precise details that reasoning systems require. This creates an interface mismatch: reasoners often fail not due to reasoning limitations but because they lack access to critical visual information. We propose Adaptive-Clarification Reinforcement Learning (AC-RL), which teaches vision models what information reasoners need through interaction. Our key insight is that clarification requests during training reveal information gaps; by penalizing success that requires clarification, we create pressure for comprehensive initial captions that enable the reasoner to solve the problem in a single pass. AC-RL improves average accuracy by 4.4 points over pretrained baselines across seven visual mathematical reasoning benchmarks, and analysis shows it would cut clarification requests by up to 39% if those were allowed. By treating clarification as a form of implicit supervision, AC-RL demonstrates that vision-language interfaces can be effectively learned through interaction alone, without requiring explicit annotations.
[193] Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models
Runze Liu,Jiakang Wang,Yuling Shi,Zhihui Xie,Chenxin An,Kaiyan Zhang,Jian Zhao,Xiaodong Gu,Lei Lin,Wenping Hu,Xiu Li,Fuzheng Zhang,Guorui Zhou,Kun Gai
Main category: cs.LG
TL;DR: 该论文提出了一种新的过程监督强化学习框架(AttnRL),通过高效探索提升大语言模型的推理能力。
Details
Motivation: 现有过程监督强化学习(PSRL)方法在探索效率(分支位置和采样)方面存在局限性,因此需要一种更高效的探索策略。Contribution: 提出了AttnRL框架,利用注意力分数指导分支选择,并设计了自适应采样策略和一步离策略训练流程。
Method: 1. 基于高注意力分数选择分支位置;2. 自适应采样策略考虑问题难度和历史批次;3. 一步离策略训练提升采样效率。
Result: 在多个数学推理基准测试中,AttnRL在性能、采样和训练效率上均优于现有方法。
Insight: 注意力分数可以有效地指导探索行为,自适应采样策略能显著提升训练效率。
Abstract: Reinforcement Learning (RL) has shown remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). Process-Supervised RL (PSRL) has emerged as a more effective paradigm compared to outcome-based RL. However, existing PSRL approaches suffer from limited exploration efficiency, both in terms of branching positions and sampling. In this paper, we introduce a novel PSRL framework (AttnRL), which enables efficient exploration for reasoning models. Motivated by preliminary observations that steps exhibiting high attention scores correlate with reasoning behaviors, we propose to branch from positions with high values. Furthermore, we develop an adaptive sampling strategy that accounts for problem difficulty and historical batch size, ensuring that the whole training batch maintains non-zero advantage values. To further improve sampling efficiency, we design a one-step off-policy training pipeline for PSRL. Extensive experiments on multiple challenging mathematical reasoning benchmarks demonstrate that our method consistently outperforms prior approaches in terms of performance and sampling and training efficiency.
[194] Annotation-Efficient Active Test-Time Adaptation with Conformal Prediction
Tingyu Shi,Fan Lyu,Shaoliang Peng
Main category: cs.LG
TL;DR: CPATTA提出了一种基于conformal prediction的方法,用于在测试时主动适应领域偏移,并通过覆盖保证的不确定性提高标注效率。
Details
Motivation: 现有ATTA方法依赖启发式不确定性度量,标注效率低,浪费标注预算。CPATTA旨在通过principled uncertainty改善这一问题。Contribution: 1. 首次将覆盖保证的conformal prediction引入ATTA;2. 提出了smoothed conformal scores与top-K certainty度量;3. 设计了在线权重更新算法和领域偏移检测器;4. 实现了准确率提升约5%。
Method: 1. 使用smoothed conformal scores和top-K certainty度量不确定性;2. 基于pseudo coverage驱动在线权重更新;3. 结合领域偏移检测器和分阶段更新方案平衡标注数据。
Result: 在实验中,CPATTA显著优于现有ATTA方法,准确率提升约5%。
Insight: Conformal prediction能有效提升ATTA的标注效率,其覆盖保证特性为不确定性度量提供了理论支持。
Abstract: Active Test-Time Adaptation (ATTA) improves model robustness under domain shift by selectively querying human annotations at deployment, but existing methods use heuristic uncertainty measures and suffer from low data selection efficiency, wasting human annotation budget. We propose Conformal Prediction Active TTA (CPATTA), which first brings principled, coverage-guaranteed uncertainty into ATTA. CPATTA employs smoothed conformal scores with a top-K certainty measure, an online weight-update algorithm driven by pseudo coverage, a domain-shift detector that adapts human supervision, and a staged update scheme balances human-labeled and model-labeled data. Extensive experiments demonstrate that CPATTA consistently outperforms the state-of-the-art ATTA methods by around 5% in accuracy. Our code and datasets are available at https://github.com/tingyushi/CPATTA.
[195] Optimizing Indoor Environmental Quality in Smart Buildings Using Deep Learning
Youssef Sabiri,Walid Houmaidi,Aaya Bougrine,Salmane El Mansour Billah
Main category: cs.LG
TL;DR: 这篇论文提出了一种基于深度学习的智能建筑室内环境质量(IEQ)优化方法,通过LSTM、GRU和CNN-LSTM三种架构预测CO2浓度、温度和湿度,平衡能源效率与舒适度。
Details
Motivation: 传统HVAC系统在确保IEQ时能耗较高,亟需一种智能方法在节能与舒适度间取得平衡。Contribution: 提出了一种深度学习驱动的IEQ管理方法,比较了三种架构的性能,为智能BMS提供了实用建议。
Method: 使用ROBOD数据集,对比LSTM、GRU和CNN-LSTM在不同时间范围内的预测效果。
Result: GRU在短期预测中表现最佳,CNN-LSTM适用于长时特征提取,LSTM在长时态建模中表现稳健。
Insight: 预测可靠性受数据分辨率、传感器布局和动态入住条件影响,为实际应用提供了优化方向。
Abstract: Ensuring optimal Indoor Environmental Quality (IEQ) is vital for occupant health and productivity, yet it often comes at a high energy cost in conventional Heating, Ventilation, and Air Conditioning (HVAC) systems. This paper proposes a deep learning driven approach to proactively manage IEQ parameters specifically CO2 concentration, temperature, and humidity while balancing building energy efficiency. Leveraging the ROBOD dataset collected from a net-zero energy academic building, we benchmark three architectures–Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and a hybrid Convolutional Neural Network LSTM (CNN-LSTM)–to forecast IEQ variables across various time horizons. Our results show that GRU achieves the best short-term prediction accuracy with lower computational overhead, whereas CNN-LSTM excels in extracting dominant features for extended forecasting windows. Meanwhile, LSTM offers robust long-range temporal modeling. The comparative analysis highlights that prediction reliability depends on data resolution, sensor placement, and fluctuating occupancy conditions. These findings provide actionable insights for intelligent Building Management Systems (BMS) to implement predictive HVAC control, thereby reducing energy consumption and enhancing occupant comfort in real-world building operations.
[196] Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Junlin Han,Shengbang Tong,David Fan,Yufan Ren,Koustuv Sinha,Philip Torr,Filippos Kokkinos
Main category: cs.LG
TL;DR: 该论文揭示了大型语言模型(LLMs)通过语言预训练意外获得的丰富视觉先验知识,将其分为感知和推理先验,并提出了基于数据的预训练方法,为多模态LLMs的发展提供了新思路。
Details
Motivation: 研究LLMs仅通过文本训练如何获得视觉先验知识,并探索这些知识如何应用于视觉任务,以推动多模态LLMs的发展。Contribution: 1) 揭示了LLMs视觉先验的可分离性(感知与推理);2) 分析了不同数据对视觉先验的影响;3) 提出了面向视觉的LLMs预训练方法。
Method: 通过100多项控制实验和50万GPU小时的验证,分析了LLMs预训练、视觉对齐和多模态微调的完整流程,提出了MLE-Bench评估基准。
Result: 发现推理先验主要由代码、数学等数据驱动,而感知先验更依赖广泛语料;视觉描述数据的性能影响快速饱和。
Insight: LLMs的视觉能力可通过语言预训练有意培养,为多模态LLMs的设计提供了重要指导。
Abstract: Large Language Models (LLMs), despite being trained on text alone, surprisingly develop rich visual priors. These priors allow latent visual capabilities to be unlocked for vision tasks with a relatively small amount of multimodal data, and in some cases, to perform visual tasks without ever having seen an image. Through systematic analysis, we reveal that visual priors-the implicit, emergent knowledge about the visual world acquired during language pre-training-are composed of separable perception and reasoning priors with unique scaling trends and origins. We show that an LLM’s latent visual reasoning ability is predominantly developed by pre-training on reasoning-centric data (e.g., code, math, academia) and scales progressively. This reasoning prior acquired from language pre-training is transferable and universally applicable to visual reasoning. In contrast, a perception prior emerges more diffusely from broad corpora, and perception ability is more sensitive to the vision encoder and visual instruction tuning data. In parallel, text describing the visual world proves crucial, though its performance impact saturates rapidly. Leveraging these insights, we propose a data-centric recipe for pre-training vision-aware LLMs and verify it in 1T token scale pre-training. Our findings are grounded in over 100 controlled experiments consuming 500,000 GPU-hours, spanning the full MLLM construction pipeline-from LLM pre-training to visual alignment and supervised multimodal fine-tuning-across five model scales, a wide range of data categories and mixtures, and multiple adaptation setups. Along with our main findings, we propose and investigate several hypotheses, and introduce the Multi-Level Existence Bench (MLE-Bench). Together, this work provides a new way of deliberately cultivating visual priors from language pre-training, paving the way for the next generation of multimodal LLMs.
cs.RO [Back]
[197] dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought
Junjie Wen,Minjie Zhu,Jiaming Liu,Zhiyuan Liu,Yicun Yang,Linfeng Zhang,Shanghang Zhang,Yichen Zhu,Yi Xu
Main category: cs.RO
TL;DR: 论文介绍了dVLA,一种基于扩散的视觉-语言-动作模型,通过多模态思维链统一视觉感知、语言推理和机器人控制,展现了强大的跨模态推理能力和泛化性。
Details
Motivation: 视觉-语言-动作(VLA)模型是机器人技术的下一代范式,然而现有模型在跨模态推理和泛化能力上仍有不足。本文旨在通过统一框架提升感知、语言理解和动作规划的协同优化。Contribution: 1. 提出dVLA,首个基于扩散的VLA模型;2. 通过多模态思维链统一跨模态推理;3. 引入推理加速策略(前缀注意力掩码和KV缓存)。
Method: 1. 使用扩散目标联合优化感知、语言理解和动作;2. 在多模态思维链框架下实现跨模态推理;3. 通过前缀注意力掩码和KV缓存加速推理。
Result: 1. 在LIBERO基准测试中达到96.4%成功率;2. 在真实Franka机器人上成功完成复杂任务(如分步规划的bin-picking);3. 推理速度显著提升。
Insight: 统一的扩散框架能够有效提升VLA模型的跨模态能力和实用性,加速策略增强了实际部署的可行性。
Abstract: Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages a multimodal chain-of-thought to unify visual perception, language reasoning, and robotic control in a single system. dVLA jointly optimizes perception, language understanding, and action under a single diffusion objective, enabling stronger cross-modal reasoning and better generalization to novel instructions and objects. For practical deployment, we mitigate inference latency by incorporating two acceleration strategies, a prefix attention mask and KV caching, yielding up to around times speedup at test-time inference. We evaluate dVLA in both simulation and the real world: on the LIBERO benchmark, it achieves state-of-the-art performance with a 96.4% average success rate, consistently surpassing both discrete and continuous action policies; on a real Franka robot, it succeeds across a diverse task suite, including a challenging bin-picking task that requires multi-step planning, demonstrating robust real-world performance. Together, these results underscore the promise of unified diffusion frameworks for practical, high-performance VLA robotics.
[198] SDA-PLANNER: State-Dependency Aware Adaptive Planner for Embodied Task Planning
Zichao Shen,Chen Gao,Jiaqi Yuan,Tianchen Zhu,Xingcheng Fu,Qingyun Sun
Main category: cs.RO
TL;DR: SDA-PLANNER是一种基于LLM的自适应任务规划方法,通过状态依赖图和错误自适应重规划策略,显著提高了任务执行的准确性和鲁棒性。
Details
Motivation: 现有基于LLM的任务规划方法存在固定范式、缺乏动作序列约束和对错误不敏感的问题,限制了其在复杂环境中的表现。Contribution: 提出了状态依赖图和错误自适应重规划策略,实现了动态规划和局部修复能力。
Method: SDA-PLANNER利用状态依赖图显式建模动作前提条件和效果,并通过错误回溯诊断和自适应动作子树生成实现局部重规划。
Result: 实验表明,SDA-PLANNER在各种错误条件下均优于基线方法,特别是在任务完成率和目标达成率上表现突出。
Insight: 显式建模动作依赖关系和动态错误处理机制是提升任务规划鲁棒性的关键。
Abstract: Embodied task planning requires agents to produce executable actions in a close-loop manner within the environment. With progressively improving capabilities of LLMs in task decomposition, planning, and generalization, current embodied task planning methods adopt LLM-based architecture.However, existing LLM-based planners remain limited in three aspects, i.e., fixed planning paradigms, lack of action sequence constraints, and error-agnostic. In this work, we propose SDA-PLANNER, enabling an adaptive planning paradigm, state-dependency aware and error-aware mechanisms for comprehensive embodied task planning. Specifically, SDA-PLANNER introduces a State-Dependency Graph to explicitly model action preconditions and effects, guiding the dynamic revision. To handle execution error, it employs an error-adaptive replanning strategy consisting of Error Backtrack and Diagnosis and Adaptive Action SubTree Generation, which locally reconstructs the affected portion of the plan based on the current environment state. Experiments demonstrate that SDA-PLANNER consistently outperforms baselines in success rate and goal completion, particularly under diverse error conditions.
cs.IT [Back]
[199] Challenges and Solutions in Selecting Optimal Lossless Data Compression Algorithms
Md. Atiqur Rahman,MM Fazle Rabbi
Main category: cs.IT
TL;DR: 本文提出了一个数学框架,用于统一评估无损数据压缩算法的性能,通过平衡压缩比、编码时间和解码时间,为用户提供最优算法选择工具。
Details
Motivation: 随着数字数据的快速增长,无损压缩算法的需求日益增加,但现有算法在不同指标间存在权衡,难以同时满足压缩比、速度等多重需求。Contribution: 提出了一个数学框架,将压缩比、编码和解码时间统一为一个性能评分,支持在多指标权衡下选择最优算法。
Method: 通过归一化和加权机制整合多个性能指标,建立统一的评分模型,并在图像和文本数据集上进行实验验证。
Result: 实验表明,该框架能够可靠地为不同优先级的应用选择最优算法,同时揭示了学习型编解码器在压缩比上的优势与传统算法在速度上的优势。
Insight: 学习型算法压缩比高但速度慢,传统算法速度快但压缩比低,框架为用户提供了权衡选择的科学依据。
Abstract: The rapid growth of digital data has heightened the demand for efficient lossless compression methods. However, existing algorithms exhibit trade-offs: some achieve high compression ratios, others excel in encoding or decoding speed, and none consistently perform best across all dimensions. This mismatch complicates algorithm selection for applications where multiple performance metrics are simultaneously critical, such as medical imaging, which requires both compact storage and fast retrieval. To address this challenge, we present a mathematical framework that integrates compression ratio, encoding time, and decoding time into a unified performance score. The model normalizes and balances these metrics through a principled weighting scheme, enabling objective and fair comparisons among diverse algorithms. Extensive experiments on image and text datasets validate the approach, showing that it reliably identifies the most suitable compressor for different priority settings. Results also reveal that while modern learning-based codecs often provide superior compression ratios, classical algorithms remain advantageous when speed is paramount. The proposed framework offers a robust and adaptable decision-support tool for selecting optimal lossless data compression techniques, bridging theoretical measures with practical application needs.