Table of Contents
- cs.CL [Total: 26]
- cs.CV [Total: 50]
- eess.SY [Total: 1]
- eess.IV [Total: 1]
- cs.CR [Total: 1]
- cs.MA [Total: 1]
- cs.SD [Total: 1]
- cs.LG [Total: 5]
- cs.AI [Total: 2]
- cs.RO [Total: 5]
cs.CL [Back]
[1] TextualVerifier: Verify TextGrad Step-by-Step
Eugenius Mario Situmorang,Adila Alfa Krisnadhi,Ari Wibisono
Main category: cs.CL
TL;DR: TextualVerifier是一个验证框架,通过链式思维分解、变体生成、多数投票和共识聚合四个阶段,为TextGrad提供自验证机制,显著提升了基于文本的优化推理的可靠性。
Details
Motivation: TextGrad作为一种基于文本的自动微分方法,缺乏自我验证机制,无法保证文本决策中的推理有效性。因此,本研究旨在填补这一验证空白。Contribution: 提出了TextualVerifier,首个通过LLM技术为TextGrad提供的自验证框架,无需依赖数值梯度,显著提升了推理的可靠性。
Method: 采用四阶段工作流:链式思维分解、变体生成、多数投票和共识聚合,并集成到TextGrad的损失函数和优化结果验证阶段。
Result: 实验显示,在PRM800K上推理步骤有效性提升29%;与TextGrad集成后,GPQA-Diamond等基准测试中的准确率显著提升(p <0.001)。
Insight: 通过LLM技术实现的自验证为文本优化开辟了新方向,验证了其在提升推理可靠性方面的潜力。
Abstract: TextGrad is a novel approach to text-based automatic differentiation that enables composite AI systems to perform optimization without explicit numerical equations. However, it currently lacks self-verification mechanisms that ensure reasoning validity in text-based decision making. This research introduces TextualVerifier, a verification framework that leverages chain-of-thought reasoning and majority voting with large language models to address this verification gap. TextualVerifier implements a four-stage workflow: chain-of-thought decomposition, variant generation, majority voting, and consensus aggregation. It integrates non-invasively with TextGrad at both the loss function and optimization result verification stages. Experimental evaluation using the Gemini 1.5 Pro model is conducted in two phases: (1) standalone evaluation on PRM800K, and (2) integrated evaluation with TextGrad on GPQA-Diamond, MMLU-ML, and MMLU-CP benchmarks. Results show statistically significant improvements (p < 0.001). In phase one, TextualVerifier improves the validity of reasoning steps by 29 percent. In phase two, integration into TextGrad loss function yields a 2.2 percentage point gain from 68.2 to 70.4 percent with a moderate overhead of 5.9 LLM calls on average. Further evaluations of TextualVerifier versioning yield 8.08, 10.71, and 3.92 percentage point improvements on GPQA, MMLU-ML, and MMLU-CP respectively. TextualVerifier thus presents the first self-verification framework for TextGrad through LLM-based techniques without requiring numerical gradients, enabling more reliable reasoning and opening new directions for verification in text-based optimization.
[2] PLLuM: A Family of Polish Large Language Models
Jan Kocoń,Maciej Piasecki,Arkadiusz Janz,Teddy Ferdinan,Łukasz Radliński,Bartłomiej Koptyra,Marcin Oleksy,Stanisław Woźniak,Paweł Walkowiak,Konrad Wojtasik,Julia Moska,Tomasz Naskręt,Bartosz Walkowiak,Mateusz Gniewkowski,Kamil Szyc,Dawid Motyka,Dawid Banach,Jonatan Dalasiński,Ewa Rudnicka,Bartłomiej Alberski,Tomasz Walkowiak,Aleksander Szczęsny,Maciej Markiewicz,Tomasz Bernaś,Hubert Mazur,Kamil Żyta,Mateusz Tykierko,Grzegorz Chodak,Tomasz Kajdanowicz,Przemysław Kazienko,Agnieszka Karlińska,Karolina Seweryn,Anna Kołos,Maciej Chrabąszcz,Katarzyna Lorenc,Aleksandra Krasnodębska,Artur Wilczek,Katarzyna Dziewulska,Paula Betscher,Zofia Cieślińska,Katarzyna Kowol,Daria Mikoś,Maciej Trzciński,Dawid Krutul,Marek Kozłowski,Sławomir Dadas,Rafał Poświata,Michał Perełkiewicz,Małgorzata Grębowiec,Maciej Kazuła,Marcin Białas,Roman Roszko,Danuta Roszko,Jurgita Vaičenonienė,Andrius Utka,Paweł Levchuk,Paweł Kowalski,Irena Prawdzic-Jankowska,Maciej Ogrodniczuk,Monika Borys,Anna Bulińska,Wiktoria Gumienna,Witold Kieraś,Dorota Komosińska,Katarzyna Krasnowska-Kieraś,Łukasz Kobyliński,Martyna Lewandowska,Marek Łaziński,Mikołaj Łątkowski,Dawid Mastalerz,Beata Milewicz,Agnieszka Anna Mykowiecka,Angelika Peljak-Łapińska,Sandra Penno,Zuzanna Przybysz,Michał Rudolf,Piotr Rybak,Karolina Saputa,Aleksandra Tomaszewska,Aleksander Wawer,Marcin Woliński,Joanna Wołoszyn,Alina Wróblewska,Bartosz Żuk,Filip Żarnecki,Konrad Kaczyński,Anna Cichosz,Zuzanna Deckert,Monika Garnys,Izabela Grabarczyk,Wojciech Janowski,Sylwia Karasińska,Aleksandra Kujawiak,Piotr Misztela,Maria Szymańska,Karolina Walkusz,Igor Siek,Jakub Kwiatkowski,Piotr Pęzik
Main category: cs.CL
TL;DR: PLLuM是首个专门为波兰语设计的大型开源语言模型家族,填补了非英语语言模型的空白,并强调数据治理和AI责任。
Details
Motivation: 现有大型语言模型主要集中于英语,缺乏对波兰语等语言的高质量支持,PLLuM旨在解决这一问题并推动波兰本土AI技术发展。Contribution: 开发了PLLuM,包括波兰语1400亿token的预训练语料库、7.7万指令数据集和10万偏好优化数据集,并引入责任AI框架。
Method: 通过严格的数据治理、混合模块的校正与安全过滤,以及基础模型和指令调优变体的对齐技术构建模型。
Result: PLLuM在公共管理任务中展示了实用性,并开源以促进波兰AI研究。
Insight: 多语言模型需关注数据质量和文化相关性,责任AI框架是提升模型透明度和安全性的关键。
Abstract: Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety filtering. We detail the models’ architecture, training procedures, and alignment techniques for both base and instruction-tuned variants, and demonstrate their utility in a downstream task within public administration. By releasing these models publicly, PLLuM aims to foster open research and strengthen sovereign AI technologies in Poland.
[3] STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models
Mohammad Atif Quamar,Mohammad Areeb,Mikhail Kuznetsov,Muslum Ozgur Ozmen,Z. Berkay Celik
Main category: cs.CL
TL;DR: STARS提出了一种在解码阶段通过分段的令牌对齐和拒绝采样来优化大语言模型生成的算法,显著提高了计算效率和对齐质量。
Details
Motivation: 现有的大语言模型对齐方法(如微调)计算成本高且效果有限,而推断时间方法(如Best-of-N采样)则计算不可行。STARS旨在克服这些限制。Contribution: STARS是一种新颖的解码时算法,通过分段令牌对齐和拒绝采样,实现了高效且高质量的模型对齐。
Method: STARS在解码时迭代采样、评分和拒绝/接受固定长度的令牌段,从而早期修正生成路径。
Result: 实验表明,STARS在六个大语言模型上显著优于监督微调和直接偏好优化,并与Best-of-N基线竞争。
Insight: STARS证明了细粒度的奖励引导采样是一种通用、鲁棒且高效的对齐替代方案。
Abstract: Aligning large language models with human values is crucial for their safe deployment; however, existing methods, such as fine-tuning, are computationally expensive and suboptimal. In contrast, inference-time approaches like Best-of-N sampling require practically infeasible computation to achieve optimal alignment. We propose STARS: Segment-level Token Alignment with Rejection Sampling, a decoding-time algorithm that steers model generation by iteratively sampling, scoring, and rejecting/accepting short, fixed-size token segments. This allows for early correction of the generation path, significantly improving computational efficiency and boosting alignment quality. Across a suite of six LLMs, we show that STARS outperforms Supervised Fine-Tuning (SFT) by up to 14.9 percentage points and Direct Preference Optimization (DPO) by up to 4.3 percentage points on win-rates, while remaining highly competitive with strong Best-of-N baselines. Our work establishes granular, reward-guided sampling as a generalizable, robust, and efficient alternative to traditional fine-tuning and full-sequence ranking methods for aligning LLMs.
[4] Context informs pragmatic interpretation in vision-language models
Alvin Wei Ming Tan,Ben Prystawski,Veronica Boyce,Michael C. Frank
Main category: cs.CL
TL;DR: 论文研究了视觉语言模型在多轮语言环境中进行上下文敏感语用推理的能力,通过迭代参考游戏测试模型和人类的表现。模型在缺乏相关上下文时表现较差,但在上下文相关时表现显著提升。
Details
Motivation: 研究动机是评估视觉语言模型在复杂对话环境中能否像人类一样进行上下文敏感的语用推理,尤其是在多轮迭代参考游戏中。Contribution: 主要贡献在于展示了视觉语言模型在上下文相关信息下语用推理能力的显著提升,并揭示了其在抽象指代任务中的局限性。
Method: 研究方法是通过设计迭代参考游戏,测试人类和模型在不同上下文条件下(数量、顺序、相关性)的表现。
Result: 结果显示,模型在缺乏相关上下文时表现较差,但在上下文相关时表现接近人类;抽象指代任务仍然是模型的难点。
Insight: 研究启示是上下文信息对视觉语言模型的语用推理能力至关重要,未来需要进一步提升模型在抽象指代任务中的表现。
Abstract: Iterated reference games - in which players repeatedly pick out novel referents using language - present a test case for agents’ ability to perform context-sensitive pragmatic reasoning in multi-turn linguistic environments. We tested humans and vision-language models on trials from iterated reference games, varying the given context in terms of amount, order, and relevance. Without relevant context, models were above chance but substantially worse than humans. However, with relevant context, model performance increased dramatically over trials. Few-shot reference games with abstract referents remain a difficult task for machine learning models.
[5] The Human Flourishing Geographic Index: A County-Level Dataset for the United States, 2013–2023
Stefano M. Iacus,Devika Jain,Andrea Nasuto,Giuseppe Porro,Marcello Carammia,Andrea Vezzulli
Main category: cs.CL
TL;DR: 文章提出了一个名为‘人类繁荣地理指数’(HFGI)的数据集,通过分析2013-2023年间26亿条美国地理标注推文,并结合精细调优的大语言模型,量化了48项与繁荣相关的指标,为多学科研究提供了高分辨率的社会福祉分析工具。
Details
Motivation: 现有的人类福祉测量工具通常缺乏精细的空间和时间分辨率,影响了对其动态变化的理解。本文旨在填补这一空白,通过社交媒体数据捕捉多维度的繁荣指标。Contribution: 主要贡献是开发了HFGI数据集,利用大语言模型和社交媒体数据,为美国提供了一个县级和时间分辨率高达月度的人类繁荣度量工具。
Method: 通过对26亿条地理标注的推特数据进行分析,使用精细调优的大语言模型分类48项繁荣相关指标,并验证其与传统指标的相关性。
Result: HFGI数据集能够准确反映人类繁荣的多维度指标,并与已有指标表现出预期的相关性,支持高分辨率的福祉动态分析。
Insight: 社交媒体数据可以成为一种强大的工具,用于量化和分析人类福祉的多维度特征,尤其是在时间和空间上的高分辨率需求场景。
Abstract: Quantifying human flourishing, a multidimensional construct including happiness, health, purpose, virtue, relationships, and financial stability, is critical for understanding societal well-being beyond economic indicators. Existing measures often lack fine spatial and temporal resolution. Here we introduce the Human Flourishing Geographic Index (HFGI), derived from analyzing approximately 2.6 billion geolocated U.S. tweets (2013-2023) using fine-tuned large language models to classify expressions across 48 indicators aligned with Harvard’s Global Flourishing Study framework plus attitudes towards migration and perception of corruption. The dataset offers monthly and yearly county- and state-level indicators of flourishing-related discourse, validated to confirm that the measures accurately represent the underlying constructs and show expected correlations with established indicators. This resource enables multidisciplinary analyses of well-being, inequality, and social change at unprecedented resolution, offering insights into the dynamics of human flourishing as reflected in social media discourse across the United States over the past decade.
[6] Abductive Inference in Retrieval-Augmented Language Models: Generating and Validating Missing Premises
Shiyin Lin
Main category: cs.CL
TL;DR: 该论文提出了一个框架,将溯因推理(abductive inference)引入检索增强的语言模型(RAG),以填补证据不完整导致的推理空白,并通过生成和验证缺失前提来提高答案准确性和推理可靠性。
Details
Motivation: 现有检索增强生成(RAG)系统在证据不完全时表现不佳,导致推理过程出现空白。为了解决这一问题,论文提出利用溯因推理生成合理的缺失前提,以增强模型的鲁棒性和可解释性。Contribution: 提出了一种将溯因推理集成到RAG系统中的框架,包括检测证据不足、生成候选缺失前提以及通过一致性和合理性验证这些前提的方法。
Method: 框架分三部分:1)检测检索证据的不足;2)生成可能的缺失前提;3)通过一致性和合理性检查对生成的前提进行验证。
Result: 在溯因推理和多跳问答基准测试中,该方法显著提高了答案准确性和推理的可靠性。
Insight: 溯因推理是增强RAG系统鲁棒性和可解释性的有效途径,尤其是在处理不完整证据时。
Abstract: Large Language Models (LLMs) enhanced with retrieval – commonly referred to as Retrieval-Augmented Generation (RAG) – have demonstrated strong performance in knowledge-intensive tasks. However, RAG pipelines often fail when retrieved evidence is incomplete, leaving gaps in the reasoning process. In such cases, \emph{abductive inference} – the process of generating plausible missing premises to explain observations – offers a principled approach to bridge these gaps. In this paper, we propose a framework that integrates abductive inference into retrieval-augmented LLMs. Our method detects insufficient evidence, generates candidate missing premises, and validates them through consistency and plausibility checks. Experimental results on abductive reasoning and multi-hop QA benchmarks show that our approach improves both answer accuracy and reasoning faithfulness. This work highlights abductive inference as a promising direction for enhancing the robustness and explainability of RAG systems.
[7] WST: Weakly Supervised Transducer for Automatic Speech Recognition
Dongji Gao,Chenda Liao,Changliang Liu,Matthew Wiesner,Leibny Paola Garcia,Daniel Povey,Sanjeev Khudanpur,Jian Wu
Main category: cs.CL
TL;DR: 论文提出了一种弱监督Transducer(WST),用于降低自动语音识别(ASR)中对高质量标注数据的依赖,能够在转录错误率高达70%的情况下保持性能。
Details
Motivation: 现有的RNN-T模型在ASR任务中依赖大量高质量标注数据,但标注成本高且难以获取。论文旨在通过弱监督学习解决这一问题。Contribution: 提出了WST模型,其灵活的训练图设计能够鲁棒地处理转录错误,无需额外的置信度估计或预训练模型。
Method: WST通过设计灵活的训练图,直接处理转录错误的输入,避免了其他弱监督方法(如BTC和OTC)的复杂流程。
Result: 实验表明,即使在转录错误率达70%的情况下,WST仍能保持性能,优于现有的CTC类弱监督方法。
Insight: WST展示了在现实ASR场景中的实用性和鲁棒性,为低资源ASR提供了一种有效解决方案。
Abstract: The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.
[8] T-FIX: Text-Based Explanations with Features Interpretable to eXperts
Shreya Havaldar,Helen Jin,Chaehyeon Kim,Anton Xue,Weiqiu You,Marco Gatti,Bhuvnesh Jain,Helen Qu,Daniel A Hashimoto,Amin Madani,Rajat Deo,Sameed Ahmed M. Khatana,Gary E. Weissman,Lyle Ungar,Eric Wong
Main category: cs.CL
TL;DR: T-FIX是一个新的基准测试,旨在评估LLM生成的解释是否与专家直觉一致,涵盖七个知识密集领域,并开发了新指标以衡量对齐性。
Details
Motivation: 由于LLM在知识密集型领域(如手术、天文学、治疗)中的应用,用户(通常是领域专家)不仅需要答案,还需要与其专业直觉一致的解释。现有的评估方法主要关注解释的合理性和内部一致性,无法捕捉解释内容是否真正符合专家直觉。Contribution: 提出了T-FIX基准测试,用于衡量LLM生成的解释与专家判断的对齐性,并在多个知识密集型领域中开发了新指标。
Method: 通过与领域专家合作,定义了专家对齐性作为评估标准,并设计了新指标来量化这种对齐性。
Result: T-FIX提供了一个标准的评估框架,帮助验证LLM解释的专业性和可靠性。
Insight: 研究强调了在知识密集型领域中,LLM解释需要与专家直觉一致的重要性,而不仅仅是表面的合理性。
Abstract: As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users expect not just answers, but also meaningful explanations for those answers. In these settings, users are often domain experts (e.g., doctors, astrophysicists, psychologists) who require explanations that reflect expert-level reasoning. However, current evaluation schemes primarily emphasize plausibility or internal faithfulness of the explanation, which fail to capture whether the content of the explanation truly aligns with expert intuition. We formalize expert alignment as a criterion for evaluating explanations with T-FIX, a benchmark spanning seven knowledge-intensive domains. In collaboration with domain experts, we develop novel metrics to measure the alignment of LLM explanations with expert judgment.
[9] Plan of Knowledge: Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering
Xinying Qian,Ying Zhang,Yu Zhao,Baohang Zhou,Xuhui Sui,Xiaojie Yuan
Main category: cs.CL
TL;DR: 论文提出了一种名为PoK的框架,通过结合知识计划和对比性时间检索器,提升大语言模型在时间知识图谱问答任务中的表现,显著优于现有方法。
Details
Motivation: 现有方法在时间知识图谱问答(TKGQA)任务中未能充分理解时间约束的复杂语义信息,而大语言模型(LLMs)虽具备强大的语义理解和推理能力,但其时间推理能力有限且存在幻觉和知识缺乏问题。Contribution: 提出了Plan of Knowledge(PoK)框架,包含知识计划模块和对比性时间检索器,通过分解问题和选择性检索时间对齐的事实,提升模型的解释性和事实一致性。
Method: 1)知识计划模块将复杂问题分解为子目标序列;2)构建时间知识库(TKS)并采用对比性检索框架,实现语义和时间对齐的事实检索。
Result: 在四个TKGQA基准数据集上,PoK显著提升了LLMs的检索精度和推理准确率,最多超越现有方法56.0%。
Insight: 结合结构化规划与时间知识检索能有效弥补LLMs在时间推理上的不足,增强模型的解释性和事实一致性。
Abstract: Temporal Knowledge Graph Question Answering (TKGQA) aims to answer time-sensitive questions by leveraging factual information from Temporal Knowledge Graphs (TKGs). While previous studies have employed pre-trained TKG embeddings or graph neural networks to inject temporal knowledge, they fail to fully understand the complex semantic information of time constraints. Recently, Large Language Models (LLMs) have shown remarkable progress, benefiting from their strong semantic understanding and reasoning generalization capabilities. However, their temporal reasoning ability remains limited. LLMs frequently suffer from hallucination and a lack of knowledge. To address these limitations, we propose the Plan of Knowledge framework with a contrastive temporal retriever, which is named PoK. Specifically, the proposed Plan of Knowledge module decomposes a complex temporal question into a sequence of sub-objectives from the pre-defined tools, serving as intermediate guidance for reasoning exploration. In parallel, we construct a Temporal Knowledge Store (TKS) with a contrastive retrieval framework, enabling the model to selectively retrieve semantically and temporally aligned facts from TKGs. By combining structured planning with temporal knowledge retrieval, PoK effectively enhances the interpretability and factual consistency of temporal reasoning. Extensive experiments on four benchmark TKGQA datasets demonstrate that PoK significantly improves the retrieval precision and reasoning accuracy of LLMs, surpassing the performance of the state-of-the-art TKGQA methods by 56.0% at most.
[10] Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods
Eva Prakash,Maayane Attias,Pierre Chambon,Justin Xu,Steven Truong,Jean-Benoit Delbrouck,Tessa Cook,Curtis Langlotz
Main category: cs.CL
TL;DR: 该论文通过大规模训练数据微调transformer模型,提升放射学报告的自动去识别性能,并在PHI检测上超越商业云服务系统。
Details
Motivation: 现有的PHI去识别方法在多中心泛化性和鲁棒性方面表现不足,尤其是对合成PHI的处理能力有限。本文旨在通过大规模数据集和引入新PHI类别(AGE)解决这些问题。Contribution: 1) 在斯坦福和宾大的大型注释放射学报告数据集上微调transformer模型,引入新PHI类别AGE;2) 提出“hide-in-plain-sight”方法评估合成PHI生成的稳定性;3) 在性能上超越所有商业云服务系统。
Method: 使用两个大型放射学报告数据集微调transformer模型,并通过token-level PHI检测评估性能。还评估了合成PHI生成稳定性和对比商业系统的表现。
Result: 模型在斯坦福和宾大数据集上的F1分数分别为0.996和0.973,显著优于商业系统(F1:0.960 vs. 0.632-0.754)。合成PHI检测一致性高(F1:0.959)。
Insight: 大规模多模态训练显著提升模型的泛化能力和鲁棒性;合成PHI生成实现了隐私保护与数据实用性的平衡;transformer模型在PHI检测领域潜力巨大。
Abstract: Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a “hide-in-plain-sight” method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.
[11] Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models
Wenmo Qiu,Saurabh Srivastava
Main category: cs.CL
TL;DR: 批处理提示(batch prompting)不仅能优化推理模型的吞吐量,还能显著抑制模型在多步推理中的过度思考现象,提高准确率并减少token使用量。
Details
Motivation: 研究批处理提示对大规模推理模型(LRMs)行为的影响,探索其在推理过程中的正则化效果,特别是在抑制过度思考和提升效率方面的潜力。Contribution: 1. 发现批处理提示能显著减少模型的过度思考和使用token量(3x-5x);2. 揭示了批处理还能抑制模型的犹豫语言(如重复自我修正);3. 观察到批处理中的集体效应,模型能从早期样本泛化模式解决更难问题。
Method: 在13个多样化基准测试上进行了全面实验,通过行为分析研究了批处理提示对模型推理行为和效率的影响。
Result: 批处理提示在提高准确率的同时,显著减少了推理token的使用量,并抑制了过度思考和犹豫现象。
Insight: 批处理提示不仅是吞吐量优化的工具,还能作为推理时的正则化手段,提升模型的效率和可靠性。
Abstract: Recent work has explored batch prompting as a strategy to amortize inference cost in large language models (LLMs). In this paper, we show that batching offers an additional, underappreciated benefit: it regularizes model behavior during multi-step reasoning for Large Reasoning Models (LRMs). We conduct a comprehensive study across 13 diverse benchmarks and observe that batching improves accuracy while substantially reducing reasoning token usage, often by 3x-5x. Through detailed behavioral analysis, we find that batching suppresses overthinking, reduces hedging language (e.g., repetitive self-corrections), and encourages more decisive answers. Surprisingly, we also observe emergent collective effects in batched inference: models often generalize patterns from earlier examples to solve harder ones in the same batch. These findings position batching not just as a throughput optimization, but as a powerful inference-time regularizer for more efficient and reliable LLM reasoning.
[12] RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning
Xinyuan Li,Murong Xu,Wenbiao Tao,Hanlun Zhu,Yike Zhao,Jipeng Zhang,Yunshi Lan
Main category: cs.CL
TL;DR: RIDE是一个基于IRT理论的对抗性问题重写框架,用于生成难度可控的数学问题,以评估LLMs的真实数学推理能力。
Details
Motivation: 当前大规模语言模型在数学推理任务上的高表现可能源于数据泄露或浅层模式匹配,而非真正的推理能力,因此需要更严格的评估方法。Contribution: 提出了RIDE框架,结合IRT理论动态生成难度递增的问题,并构建了难度排序器以系统评估LLM的推理能力。
Method: 利用35个LLM模拟学生响应,构建难度排序器作为奖励信号,通过强化学习生成对抗性问题。
Result: 在竞赛级数学基准上,RIDE生成的问题使LLM平均性能下降21.73%,揭示了其推理能力的局限性。
Insight: IRT理论在动态调整问题难度方面的应用是有效的,为LLM评估提供了新思路。
Abstract: Large language models (LLMs) achieve high performance on mathematical reasoning, but these results can be inflated by training data leakage or superficial pattern matching rather than genuine reasoning. To this end, an adversarial perturbation-based evaluation is needed to measure true mathematical reasoning ability. Current rule-based perturbation methods often generate ill-posed questions and impede the systematic evaluation of question difficulty and the evolution of benchmarks. To bridge this gap, we propose RIDE, a novel adversarial question-rewriting framework that leverages Item Response Theory (IRT) to rigorously measure question difficulty and to generate intrinsically more challenging, well-posed variations of mathematical problems. We employ 35 LLMs to simulate students and build a difficulty ranker from their responses. This ranker provides a reward signal during reinforcement learning and guides a question-rewriting model to reformulate existing questions across difficulty levels. Applying RIDE to competition-level mathematical benchmarks yields perturbed versions that degrade advanced LLM performance, with experiments showing an average 21.73% drop across 26 models, thereby exposing limited robustness in mathematical reasoning and confirming the validity of our evaluation approach.
[13] CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese
Dazhong Chen,Yi-Cheng Lin,Yuchen Huang,Ziwei Gong,Di Jiang,Zeying Xie,Yi R.,Fung
Main category: cs.CL
TL;DR: CantoASR:一个结合了语音识别(ASR)和大音频语言模型(LALM)的协作框架,通过融合声学特征和上下文推理,显著提升了低资源粤语的识别准确率。
Details
Motivation: 粤语作为一种低资源语言,面临标注数据有限、六种声调、变调及口音变异的挑战,现有的ASR模型(如Whisper)表现不佳。Contribution: 1. 提出CantoASR框架,结合ASR和LALM的协作方法;2. 引入强制对齐的声学特征提取和LoRA微调的Whisper模型;3. 利用指令调优的Qwen-Audio实现韵律感知的纠错。
Method: 1. 使用强制对齐提取声学特征;2. 对Whisper模型进行LoRA微调以提高声调辨别能力;3. 结合Qwen-Audio进行韵律感知的自动纠错。
Result: 在自发粤语数据上测试,CER显著优于Whisper-Large-V3。
Insight: 声学特征与LALM的上下文推理结合,为低资源声调方言的ASR提供了一种可扩展的解决方案。
Abstract: Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.
[14] BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation
Fahim Ahmed,Md Mubtasim Ahasan,Jahir Sadik Monon,Muntasir Wahed,M Ashraful Amin,A K M Mahbubur Rahman,Amin Ahsan Ali
Main category: cs.CL
TL;DR: 论文探讨了三种多智能体LLM流水线,旨在提升文本到SQL生成任务的性能,尤其关注小型高效模型的表现。通过系统性能评测,发现多智能体讨论和规划器-编码器流水线能显著提升模型性能。
Details
Motivation: 现有的大型语言模型(LLM)在从自然语言指令生成SQL时面临大规模模式和复杂推理的挑战。此前工作多关注复杂但低效的流水线,而小型高效模型被忽视。本文旨在填补这一空白。Contribution: 提出了三种多智能体LLM流水线(讨论、规划器-编码器、编码器-聚合器),并系统评测了不同规模开源模型的性能。实验表明多智能体方法能显著提升小型模型的SQL生成能力。
Method: 1. 多智能体讨论流水线:智能体迭代优化SQL查询并由法官合成最终结果;2. 规划器-编码器流水线:规划模型生成分步SQL计划,编码器合成查询;3. 编码器-聚合器流水线:多个编码器独立生成查询,推理智能体选择最佳结果。
Result: 实验显示,多智能体讨论可将Qwen2.5-7b-Instruct的执行准确率提升10.6%;规划器-编码器流水线表现最佳,DeepSeek-R1-32B和QwQ-32B规划器将Gemma 3 27B IT的准确率从52.4%提升至56.4%。
Insight: 多智能体协作和分步规划能有效提升小型模型在复杂文本到SQL任务中的表现,为实际应用提供了高效解决方案。
Abstract: Text-to-SQL systems provide a natural language interface that can enable even laymen to access information stored in databases. However, existing Large Language Models (LLM) struggle with SQL generation from natural instructions due to large schema sizes and complex reasoning. Prior work often focuses on complex, somewhat impractical pipelines using flagship models, while smaller, efficient models remain overlooked. In this work, we explore three multi-agent LLM pipelines, with systematic performance benchmarking across a range of small to large open-source models: (1) Multi-agent discussion pipeline, where agents iteratively critique and refine SQL queries, and a judge synthesizes the final answer; (2) Planner-Coder pipeline, where a thinking model planner generates stepwise SQL generation plans and a coder synthesizes queries; and (3) Coder-Aggregator pipeline, where multiple coders independently generate SQL queries, and a reasoning agent selects the best query. Experiments on the Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small model performance, with up to 10.6% increase in Execution Accuracy for Qwen2.5-7b-Instruct seen after three rounds of discussion. Among the pipelines, the LLM Reasoner-Coder pipeline yields the best results, with DeepSeek-R1-32B and QwQ-32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest score of 56.4%. Codes are available at https://github.com/treeDweller98/bappa-sql.
[15] LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal
Michał Karp,Anna Kubaszewska,Magdalena Król,Robert Król,Aleksander Smywiński-Pohl,Mateusz Szymański,Witold Wydmański
Main category: cs.CL
TL;DR: 该研究通过实证评估发现,当前大型语言模型(LLMs)无法通过波兰国家上诉委员会成员资格考试的实践部分,且‘LLM-as-a-judge’方法的评估与官方评委存在显著差异。
Details
Motivation: 研究动机在于验证LLMs在专业法律考试中的表现,尤其是能否作为评委或考生参与高标准的法律资格考试。Contribution: 论文的主要贡献包括:(1)实证展示了LLMs在法律考试中的局限性;(2)揭示了‘LLM-as-a-judge’方法的评估偏差;(3)提出了法律与AI协作的必要性。
Method: 研究方法包括:(1)构建混合信息恢复与提取管道;(2)在闭卷和检索增强生成(RAG)设置下测试多种LLMs;(3)对比模型评分与官方评委结果。
Result: 结果显示,LLMs在选择题知识测试中表现尚可,但实践写作部分未达到通过标准,且模型评委的评估与官方评委存在显著差异。
Insight: 研究揭示了LLMs在法律领域的局限性,如易产生幻觉、错误引用法律条款、逻辑论证不足等,强调了法律专家与技术团队协作的重要性。
Abstract: This study provides an empirical assessment of whether current large language models (LLMs) can pass the official qualifying examination for membership in Poland’s National Appeal Chamber (Krajowa Izba Odwo{\l}awcza). The authors examine two related ideas: using LLM as actual exam candidates and applying the ‘LLM-as-a-judge’ approach, in which model-generated answers are automatically evaluated by other models. The paper describes the structure of the exam, which includes a multiple-choice knowledge test on public procurement law and a written judgment, and presents the hybrid information recovery and extraction pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4 Sonnet and Bielik-11B-v2.6) were tested in closed-book and various Retrieval-Augmented Generation settings. The results show that although the models achieved satisfactory scores in the knowledge test, none met the passing threshold in the practical written part, and the evaluations of the ‘LLM-as-a-judge’ often diverged from the judgments of the official examining committee. The authors highlight key limitations: susceptibility to hallucinations, incorrect citation of legal provisions, weaknesses in logical argumentation, and the need for close collaboration between legal experts and technical teams. The findings indicate that, despite rapid technological progress, current LLMs cannot yet replace human judges or independent examiners in Polish public procurement adjudication.
[16] SSPO: Subsentence-level Policy Optimization
Kun Yang,Zikang chen,Yanmeng Wang,Zhigen Li
Main category: cs.CL
TL;DR: SSPO是一种新的强化学习优化方法,通过在子句级别计算重要性比率,平衡了GRPO和GSPO的优点,避免了训练崩溃和高方差问题,同时提高了采样数据的利用率。
Details
Motivation: 现有的RLVR算法(如GRPO和GSPO)存在训练不稳定或数据利用率低的问题,SSPO旨在解决这些问题,进一步提升大语言模型的推理能力。Contribution: 提出了SSPO方法,引入子句级别的重要性比率和基于熵的剪裁机制,在稳定训练的同时提高了数据利用率。
Method: 在GRPO和GSPO的基础上,SSPO采用子句级别的重要性比率,并结合句子熵动态调整剪裁范围。
Result: 在五个数据集上平均得分为46.57,优于GRPO(43.01)和GSPO(44.42),并在三个数据集上达到最先进性能。
Insight: 子句级别的优化能更精细地平衡稳定性和数据利用率,而动态剪裁机制进一步提升了模型的探索能力。
Abstract: As a significant part of post-training of the Large Language Models (LLMs), Reinforcement Learning from Verifiable Reward (RLVR) has greatly improved LLMs’ reasoning skills. However, some RLVR algorithms, such as GRPO (Group Relative Policy Optimization) and GSPO (Group Sequence Policy Optimization), are observed to suffer from unstable policy updates and low usage of sampling data, respectively. The importance ratio of GRPO is calculated at the token level, which focuses more on optimizing a single token. This will be easily affected by outliers, leading to model training collapse. GSPO proposed the calculation of the response level importance ratio, which solves the problem of high variance and training noise accumulation in the calculation of the GRPO importance ratio. However, since all the response tokens share a common importance ratio, extreme values can easily raise or lower the overall mean, leading to the entire response being mistakenly discarded, resulting in a decrease in the utilization of sampled data. This paper introduces SSPO, which applies sentence-level importance ratio, taking the balance between GRPO and GSPO. SSPO not only avoids training collapse and high variance, but also prevents the whole response tokens from being abandoned by the clipping mechanism. Furthermore, we apply sentence entropy to PPO-CLIP to steadily adjust the clipping bounds, encouraging high-entropy tokens to explore and narrow the clipping range of low-entropy tokens. In particular, SSPO achieves an average score of 46.57 across five datasets, surpassing GRPO (43.01) and GSPO (44.42), and wins state-of-the-art performance on three datasets. These results highlight SSPO’s effectiveness in leveraging generated data by taking the essence of GSPO but rejecting its shortcomings.
[17] Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning
Mohammad Amin Ghanizadeh,Mohammad Javad Dousti
Main category: cs.CL
TL;DR: 论文提出了一种数据选择方法,通过结合学习模型和预训练参考模型的协同效应,定义可学习性评分并动态选择批次,显著提高了机器翻译微调的数据效率。
Details
Motivation: 机器翻译模型的数据质量和有效选择对性能至关重要,但传统方法往往忽视数据点间的依赖关系。本文旨在通过动态批次选择优化数据效率和训练效果。Contribution: 1) 提出了一种基于可学习性评分的数据选择方法;2) 设计了考虑数据点依赖关系的动态批次选择策略;3) 展示了在多个语言对上显著提升数据效率和翻译性能。
Method: 方法结合学习模型和预训练参考模型,定义数据点的可学习性评分,并通过动态批次选择优化训练效率。实验基于mBART模型和CCMatrix数据集。
Result: 在英波等语言对上,数据效率提高了5倍,计算效率提升24倍(缓存嵌入时),且翻译性能优于随机选择基线。
Insight: 数据选择不仅关注单个样本的质量,还需考虑批次内样本的协同效应,动态批次选择是提升微调效率的有效途径。
Abstract: Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness. By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process. Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance. Experiments on English to Persian and several other language pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method can achieve up to a fivefold improvement in data efficiency compared to an iid baseline. Experimental results indicate that our approach improves computational efficiency by 24 when utilizing cached embeddings, as it requires fewer training data points. Additionally, it enhances generalization, resulting in superior translation performance compared to random selection method.
[18] If I Could Turn Back Time: Temporal Reframing as a Historical Reasoning Task for LLMs
Lars Bungum,Charles Yijia Huang,Abeer Kashar
Main category: cs.CL
TL;DR: 该研究探讨了大型语言模型(LLMs)在时间推理任务中的表现,通过模拟1940年的挪威书籍问答,测试模型在不同语言和规模下的表现。
Details
Motivation: 研究动机是评估LLMs在时间推理和历史背景理解方面的能力,尤其是当问题涉及过去时间点时。Contribution: 主要贡献是通过实验验证了LLMs在时间推理任务中的表现,发现英文提示的效果优于挪威语,且模型规模越大表现越好。
Method: 方法包括使用1940年的挪威书籍提问,以英文和挪威语提示LLMs,并通过LLM-as-judge和人工检查评估回答的准确性。
Result: 结果显示,英文提示效果更好,大模型表现更优,但即使专为挪威语设计的最大LLM也未超越英文提示的效果。
Insight: 研究提示语言选择对LLMs任务表现的重要性,以及模型规模与性能的正相关关系。
Abstract: In this study, we experiment with the ability of LLMs to do temporal reasoning. Using a Norwegian book from 1940 containing trivia questions, we prompt the LLMs to answer the questions as if it were 1940. We also pose the questions in both English and Norwegian. Correct answers are often presented as sentences, and grading is done by means of LLM-as-judge, with sampled checks by a native speaker. Prompting in English consistently gave better results than in Norwegian, an unexpected result. In contrast, using larger LLMs improved results. We tested the DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families, and also the largest available LLM especially crafted for Norwegian.
[19] ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai
Surapon Nonesung,Teetouch Jaknamon,Sirinya Chaiophat,Natapong Nitarach,Chanakan Wittayasakpan,Warit Sirichotedumrong,Adisai Na-Thalang,Kunat Pipatanakul
Main category: cs.CL
TL;DR: 论文提出了首个针对泰语的视觉-语言理解基准ThaiOCRBench,填补了现有基准在泰语和多模态任务中的空白,评估了多种VLMs的性能并揭示了开源模型的不足。
Details
Motivation: 现有视觉-语言模型(VLMs)基准主要集中在高资源语言,泰语等低资源语言和多模态任务(尤其是文档结构理解)缺乏代表性基准。Contribution: 提出了首个综合性的泰语文本视觉理解基准ThaiOCRBench,包含13类任务、2,808个标注样本,并对多种VLMs进行了零样本评估。
Method: 构建了一个多样化的泰语标注数据集,并在零样本设置下评测了包括专有和开源模型在内的多种VLMs。
Result: 专有模型(如Gemini 2.5 Pro)表现优于开源模型,开源模型在细粒度文本识别和手写内容提取上表现较差。
Insight: 揭示了语言偏见、结构不匹配和幻觉内容是VLMs在泰语任务中的主要挑战,为改进低资源语言文档理解提供了方向。
Abstract: We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.
[20] RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables
Nikhil Abhyankar,Purvi Chaurasia,Sanchit Kabra,Ananya Srivastava,Vivek Gupta,Chandan K. Reddy
Main category: cs.CL
TL;DR: RUST-BENCH是一个新的基准测试,旨在评估大型语言模型(LLMs)在真实世界复杂表格数据上的推理能力,填补了现有基准在小规模、单一表格测试上的不足。
Details
Motivation: 现有表格推理基准测试多为小规模、单一结构的表格,无法反映真实世界中长表、异构和领域特定数据的复杂性,无法全面评估LLMs的推理能力。Contribution: 提出了RUST-BENCH基准测试,包含7966个问题和2031个真实表格,覆盖科学(NSF资助记录)和体育(NBA数据)两个领域,强调多模态推理和复杂任务。
Method: 通过构建包含异构模式和自由文本的真实表格数据集,评估开源和专有LLMs在复杂多跳推理上的表现。
Result: 实验表明,LLMs在处理异构模式和复杂推理时表现不佳,揭示了当前模型结构和提示策略的局限性。
Insight: RUST-BENCH为表格推理研究提供了一个更具挑战性的测试平台,凸显了现有技术的改进空间。
Abstract: Existing tabular reasoning benchmarks mostly test models on small, uniform tables, underrepresenting the complexity of real-world data and giving an incomplete view of Large Language Models’ (LLMs) reasoning abilities. Real tables are long, heterogeneous, and domain-specific, mixing structured fields with free text and requiring multi-hop reasoning across thousands of tokens. To address this gap, we introduce RUST-BENCH, a benchmark of 7966 questions from 2031 real-world tables spanning two domains: i) RB-Science (NSF grant records) and ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates LLMs jointly across scale, heterogeneity, domain specificity, and reasoning complexity. Experiments with open-source and proprietary models show that LLMs struggle with heterogeneous schemas and complex multi-hop inference, revealing persistent weaknesses in current architectures and prompting strategies. RUST-BENCH establishes a challenging new testbed for advancing tabular reasoning research.
[21] OUNLP at TSAR 2025 Shared Task: Multi-Round Text Simplifier via Code Generation
Cuong Huynh,Jie Cao
Main category: cs.CL
TL;DR: 本文介绍了OUNLP团队为TSAR-2025共享任务设计的基于LLM提示生成的文本简化系统,发现文本简化性能与源CEFR和目标CEFR级别之间的差距高度相关,并提出了两种多轮简化方法。
Details
Motivation: 基于CEFR级别的文本简化需求,探索LLM在多轮简化中的潜力。Contribution: 提出了两种多轮简化方法(MRS-Rule和MRS-Joint),并证明LLM简化候选作为起点可进一步提升性能。
Method: 采用规则简化(MRS-Rule)和规则与LLM联合简化(MRS-Joint),通过GPT-4o生成。
Result: 系统在20个团队中排名第7,MRS-Joint进一步提升了性能。
Insight: 文本简化性能与CEFR级别差距相关,多轮简化结合LLM提示可优化结果。
Abstract: This paper describes the OUNLP system submitted to the TSAR-2025 Shared Task (Alva-Manchego et al., 2025), designed for readability-controlled text simplification using LLM-prompting-based generation. Based on the analysis of prompt-based text simplification methods, we discovered an interesting finding that text simplification performance is highly related to the gap between the source CEFR (Arase et al., 2022) level and the target CEFR level. Inspired by this finding, we propose two multi-round simplification methods and generate them via GPT-4o: rule-based simplification (MRS-Rule) and jointly rule-based LLM simplification (MRS-Joint). Our submitted systems ranked 7 out of 20 teams. Later improvements with MRS-Joint show that taking the LLM simplified candidates as the starting point could further boost the multi-round simplification performance.
[22] Modeling Clinical Uncertainty in Radiology Reports: from Explicit Uncertainty Markers to Implicit Reasoning Pathways
Paloma Rabaey,Jong Hak Moon,Jung-Oh Lee,Min Gwan Kim,Hangyul Yoon,Thomas Demeester,Edward Choi
Main category: cs.CL
TL;DR: 论文提出一个两部分框架来量化放射报告中显性和隐性不确定性,并通过专家验证和大语言模型(LLM)排名显性不确定性标记,同时通过系统性扩展方法建模隐性不确定性,最终发布了一个包含不确定性信息的结构化数据集Lunguage++。
Details
Motivation: 放射报告中的不确定性对临床决策和自动化分析至关重要,但现有方法难以准确量化显性和隐性不确定性。Contribution: 1. 提出一个两部分框架来量化显性和隐性不确定性;2. 使用LLM和专家验证方法排名显性不确定性标记;3. 通过诊断路径扩展方法建模隐性不确定性;4. 发布了Lunguage++数据集。
Method: 1. 使用LLM和专家验证对常见显性不确定性标记进行排名,并映射为概率值;2. 通过专家定义的14种常见诊断路径系统性扩展建模隐性不确定性。
Result: 发布了Lunguage++数据集,支持不确定性感知的图像分类和诊断推理。
Insight: 显性和隐性不确定性的量化方法有助于提高放射报告的自动化分析能力,并为临床不确定性研究提供新工具。
Abstract: Radiology reports are invaluable for clinical decision-making and hold great potential for automated analysis when structured into machine-readable formats. These reports often contain uncertainty, which we categorize into two distinct types: (i) Explicit uncertainty reflects doubt about the presence or absence of findings, conveyed through hedging phrases. These vary in meaning depending on the context, making rule-based systems insufficient to quantify the level of uncertainty for specific findings; (ii) Implicit uncertainty arises when radiologists omit parts of their reasoning, recording only key findings or diagnoses. Here, it is often unclear whether omitted findings are truly absent or simply unmentioned for brevity. We address these challenges with a two-part framework. We quantify explicit uncertainty by creating an expert-validated, LLM-based reference ranking of common hedging phrases, and mapping each finding to a probability value based on this reference. In addition, we model implicit uncertainty through an expansion framework that systematically adds characteristic sub-findings derived from expert-defined diagnostic pathways for 14 common diagnoses. Using these methods, we release Lunguage++, an expanded, uncertainty-aware version of the Lunguage benchmark of fine-grained structured radiology reports. This enriched resource enables uncertainty-aware image classification, faithful diagnostic reasoning, and new investigations into the clinical impact of diagnostic uncertainty.
[23] Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics
Amir Zur,Atticus Geiger,Ekdeep Singh Lubana,Eric Bigelow
Main category: cs.CL
TL;DR: 该论文研究了语言模型在生成文本时是否意识到未被选择的路径,并通过隐状态动态分析了模型的token级不确定性。
Details
Motivation: 动机是探索语言模型在推理过程中是否隐含地表示可能的不同路径,从而帮助量化不确定性。Contribution: 主要贡献是通过隐状态干预证明了模型在不确定性高时更容易被操控,揭示了模型隐含表示可能的路径空间的能力。
Method: 方法是通过隐状态控制和预测模型在链式推理中的不确定性,分析了模型在不同token处的行为。
Result: 实验结果显示,模型的不确定性与隐状态操控的有效性显著相关,且隐状态能预测模型的未来输出分布。
Insight: 洞察是语言模型在决策过程中隐含表示多种路径的能力,这种能力在不确定性高时最为显著。
Abstract: When a language model generates text, the selection of individual tokens might lead it down very different reasoning paths, making uncertainty difficult to quantify. In this work, we consider whether reasoning language models represent the alternate paths that they could take during generation. To test this hypothesis, we use hidden activations to control and predict a language model’s uncertainty during chain-of-thought reasoning. In our experiments, we find a clear correlation between how uncertain a model is at different tokens, and how easily the model can be steered by controlling its activations. This suggests that activation interventions are most effective when there are alternate paths available to the model – in other words, when it has not yet committed to a particular final answer. We also find that hidden activations can predict a model’s future outcome distribution, demonstrating that models implicitly represent the space of possible paths.
[24] BanglaMedQA and BanglaMMedBench: Evaluating Retrieval-Augmented Generation Strategies for Bangla Biomedical Question Answering
Sadia Sultana,Saiyma Sittul Muna,Mosammat Zannatul Samarukh,Ajwad Abrar,Tareque Mohmud Chowdhury
Main category: cs.CL
TL;DR: 论文介绍了首个大规模孟加拉语生物医学多选题数据集BanglaMedQA和BanglaMMedBench,并评估了多种检索增强生成(RAG)策略在提升医学QA准确性中的作用。Agentic RAG方法表现最佳,准确率达89.54%。
Details
Motivation: 低资源语言的生物医学QA系统发展不足,限制了可靠医学知识的公平获取。本文旨在填补孟加拉语领域的研究空白,并探索RAG方法在此类任务中的潜力。Contribution: 1. 创建了首个孟加拉语生物医学多选题数据集(BanglaMedQA和BanglaMMedBench);2. 引入了多种RAG策略并评估其性能;3. 提出了动态选择检索与推理策略的Agentic RAG方法,显著提升了准确性。
Method: 结合基于教科书的检索与生成式推理,通过OCR技术整合孟加拉语医学教科书语料,并提出了五种RAG策略(传统、零样本回退、Agentic、迭代反馈和聚合RAG)。Agentic RAG动态选择最佳策略。
Result: Agentic RAG在openai/gpt-oss-120b模型上实现了89.54%的最高准确率,优于其他策略配置,并在推理质量上表现突出。
Insight: RAG方法能够显著提升低资源语言医学QA的可靠性,动态策略选择(Agentic RAG)是关键创新点。这为多语言医学AI研究提供了新方向。
Abstract: Developing accurate biomedical Question Answering (QA) systems in low-resource languages remains a major challenge, limiting equitable access to reliable medical knowledge. This paper introduces BanglaMedQA and BanglaMMedBench, the first large-scale Bangla biomedical Multiple Choice Question (MCQ) datasets designed to evaluate reasoning and retrieval in medical artificial intelligence (AI). The study applies and benchmarks several Retrieval-Augmented Generation (RAG) strategies, including Traditional, Zero-Shot Fallback, Agentic, Iterative Feedback, and Aggregate RAG, combining textbook-based and web retrieval with generative reasoning to improve factual accuracy. A key novelty lies in integrating a Bangla medical textbook corpus through Optical Character Recognition (OCR) and implementing an Agentic RAG pipeline that dynamically selects between retrieval and reasoning strategies. Experimental results show that the Agentic RAG achieved the highest accuracy 89.54% with openai/gpt-oss-120b, outperforming other configurations and demonstrating superior rationale quality. These findings highlight the potential of RAG-based methods to enhance the reliability and accessibility of Bangla medical QA, establishing a foundation for future research in multilingual medical artificial intelligence.
[25] When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection
Alamgir Munir Qazi,John P. McCrae,Jamal Abdul Nasir
Main category: cs.CL
TL;DR: DeReC提出了一种轻量级的密集检索分类框架,通过结合密集检索和专用分类,显著提升了虚假新闻检测的效率和准确性,优于生成式LLM方法。
Details
Motivation: 虚假新闻泛滥需要高效的事实验证系统,而当前基于LLM生成解释性理由的方法存在计算成本高和幻觉风险的问题。Contribution: DeReC框架展示了通用文本嵌入可以替代自回归LLM方法,在事实验证任务中实现更高的效率和准确性。
Method: 结合密集检索和专用分类,利用通用文本嵌入替代LLM生成方法。
Result: DeReC在RAWFC数据集上F1分数达65.58%,优于L-Defense的61.20%,同时运行时减少95%。
Insight: 经过精心设计的检索系统可以在特定任务中匹配或超越LLM性能,同时更适用于实际部署。
Abstract: The proliferation of misinformation necessitates robust yet computationally efficient fact verification systems. While current state-of-the-art approaches leverage Large Language Models (LLMs) for generating explanatory rationales, these methods face significant computational barriers and hallucination risks in real-world deployments. We present DeReC (Dense Retrieval Classification), a lightweight framework that demonstrates how general-purpose text embeddings can effectively replace autoregressive LLM-based approaches in fact verification tasks. By combining dense retrieval with specialized classification, our system achieves better accuracy while being significantly more efficient. DeReC outperforms explanation-generating LLMs in efficiency, reducing runtime by 95% on RAWFC (23 minutes 36 seconds compared to 454 minutes 12 seconds) and by 92% on LIAR-RAW (134 minutes 14 seconds compared to 1692 minutes 23 seconds), showcasing its effectiveness across varying dataset sizes. On the RAWFC dataset, DeReC achieves an F1 score of 65.58%, surpassing the state-of-the-art method L-Defense (61.20%). Our results demonstrate that carefully engineered retrieval-based systems can match or exceed LLM performance in specialized tasks while being significantly more practical for real-world deployment.
[26] Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning
Mohammad Atif Quamar,Mohammad Areeb
Main category: cs.CL
TL;DR: 论文提出了一种名为LEASH的自适应停止启发式方法,用于在链式思维推理中高效停止生成过程,减少计算浪费。
Details
Motivation: 链式思维推理(CoT)在大型语言模型中需要生成固定长度的推理过程,这会浪费计算资源和增加延迟。Contribution: 提出了LEASH方法,通过监测熵的斜率和top-logit边缘的改进,自适应地停止推理生成,减少30-35%的token使用和27%的延迟。
Method: LEASH通过监测token级熵的斜率和top-logit边缘的改进,自适应地停止生成过程。
Result: 在GSM8K和AQuA-RAT基准测试中,LEASH减少了30-35%的token生成和27%的延迟,但准确率下降了10个百分点。
Insight: LEASH是一种无需训练的模型无关方法,为链式思维推理提供了一种高效的替代方案。
Abstract: Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models. However, generating full, fixed-length rationales is computationally wasteful, inflating both token usage and latency. We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation. LEASH monitors two intrinsic signals: the slope of token-level entropy and the improvement in the top-logit margin. It terminates the generation once both signals plateau, indicating the model has reached a stable reasoning state. Across four instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, LEASH reduces average token generation by 30–35% and latency by 27%, while incurring a 10 p.p. accuracy drop relative to CoT. LEASH is model-agnostic and requires no additional training or supervision, offering a simple and efficient alternative to CoT decoding.
cs.CV [Back]
[27] SILVI: Simple Interface for Labeling Video Interactions
Ozan Kanbertay,Richard Vogg,Elif Karakoc,Peter M. Kappeler,Claudia Fichtel,Alexander S. Ecker
Main category: cs.CV
TL;DR: SILVI是一款开源的视频标注工具,专注于标注视频中的行为和个体互动,填补了现有工具无法同时支持行为标注和个体定位的空白。
Details
Motivation: 当前计算机视觉方法在大规模视频数据分析中主要关注个体行为检测,而缺乏对互动的标注支持,这对于理解社会化和个体化动物行为至关重要。Contribution: SILVI整合了行为标注与个体定位功能,提供了结构化输出,支持自动化方法的开发和验证。
Method: SILVI是一个开源的标注软件,直接支持在视频数据中标注行为和互动。
Result: SILVI为生态行为学和计算机视觉的交叉研究提供了工具支持,同时也可用于其他需要动态场景图标注的领域。
Insight: SILVI的设计理念强调了行为互动的动态性,为细粒度行为分析提供了新工具,其开源特性也促进了更广泛的应用。
Abstract: Computer vision methods are increasingly used for the automated analysis of large volumes of video data collected through camera traps, drones, or direct observations of animals in the wild. While recent advances have focused primarily on detecting individual actions, much less work has addressed the detection and annotation of interactions – a crucial aspect for understanding social and individualized animal behavior. Existing open-source annotation tools support either behavioral labeling without localization of individuals, or localization without the capacity to capture interactions. To bridge this gap, we present SILVI, an open-source labeling software that integrates both functionalities. SILVI enables researchers to annotate behaviors and interactions directly within video data, generating structured outputs suitable for training and validating computer vision models. By linking behavioral ecology with computer vision, SILVI facilitates the development of automated approaches for fine-grained behavioral analyses. Although developed primarily in the context of animal behavior, SILVI could be useful more broadly to annotate human interactions in other videos that require extracting dynamic scene graphs. The software, along with documentation and download instructions, is available at: https://gitlab.gwdg.de/kanbertay/interaction-labelling-app.
[28] Noise Injection: Improving Out-of-Distribution Generalization for Limited Size Datasets
Duong Mai,Lawrence Hall
Main category: cs.CV
TL;DR: 本文研究了通过噪声注入技术(如高斯、散斑、泊松和椒盐噪声)提升深度学习模型在有限数据集上的OOD泛化能力,显著缩小了ID与OOD性能差距。
Details
Motivation: 深度学习模型在图像识别中因利用数据集中源特定的捷径特征(如设备或人群相关伪影)而非合理生物标志物,导致在OOD数据上表现不佳,特别是在COVID-19胸部X光检测任务中。本文旨在通过噪声注入增强模型对分布偏移的鲁棒性。Contribution: 提出了四种噪声注入技术(高斯、散斑、泊松和椒盐噪声),通过实验证明这些技术能显著缩小ID与OOD性能差距(从0.10-0.20降至0.01-0.06),提升模型泛化能力。
Method: 在训练过程中注入不同类型的噪声(高斯、散斑、泊松和椒盐噪声),通过对比ID和OOD数据的性能指标(如AUC、F1、准确率、召回率和特异性)评估效果。
Result: 噪声注入技术显著缩小了ID与OOD的性能差距,平均性能差距从0.10-0.20降至0.01-0.06。
Insight: 噪声注入通过干扰模型对捷径特征的依赖,迫使模型学习更具泛化能力的生物标志物,从而提升OOD性能。这一方法简单有效,适用于数据规模有限的任务。
Abstract: Deep learned (DL) models for image recognition have been shown to fail to generalize to data from different devices, populations, etc. COVID-19 detection from Chest X-rays (CXRs), in particular, has been shown to fail to generalize to out-of-distribution (OOD) data from new clinical sources not covered in the training set. This occurs because models learn to exploit shortcuts - source-specific artifacts that do not translate to new distributions - rather than reasonable biomarkers to maximize performance on in-distribution (ID) data. Rendering the models more robust to distribution shifts, our study investigates the use of fundamental noise injection techniques (Gaussian, Speckle, Poisson, and Salt and Pepper) during training. Our empirical results demonstrate that this technique can significantly reduce the performance gap between ID and OOD evaluation from 0.10-0.20 to 0.01-0.06, based on results averaged over ten random seeds across key metrics such as AUC, F1, accuracy, recall and specificity. Our source code is publicly available at https://github.com/Duongmai127/Noisy-ood
[29] Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures
Florence Klitzner,Blanca Inigo,Benjamin D. Killeen,Lalithkumar Seenivasan,Michelle Song,Axel Krieger,Mathias Unberath
Main category: cs.CV
TL;DR: 本研究探讨了模仿学习在X射线引导下脊柱手术机器人控制中的应用,开发了一个高仿真模拟环境,并通过实验验证了该方法的可行性与局限性。
Details
Motivation: 模仿学习在视频机器人领域受到关注,但尚未在X射线引导的复杂手术(如脊柱手术)中得到验证。研究旨在探索其在这一领域的适用性。Contribution: 1. 开发了一个高仿真的模拟环境,支持自动化X射线引导脊柱手术的训练;2. 提出了一种基于模仿学习的策略,仅通过视觉信息完成手术规划和开环控制。
Method: 1. 构建模拟环境和数据集;2. 训练模仿学习策略,利用双平面X射线序列逐步对齐手术路径;3. 在真实X射线上测试泛化能力。
Result: 策略在68.5%的首次尝试中成功保持了安全的脊柱内轨迹,并能泛化到复杂解剖结构和不同初始条件。
Insight: 模仿学习在X射线引导手术中具有一定潜力,但需解决入口点精度和闭环控制反馈频率等问题。
Abstract: Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation. This is because interpretation of multi-view X-rays is complex. We examine opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy generalized to complex anatomy, including fractures, and remained robust to varied initializations. Rollouts on real bi-planar X-rays further suggest that the model can produce plausible trajectories, despite training exclusively in simulation. While these preliminary results are promising, we also identify limitations, especially in entry point precision. Full closed-look control will require additional considerations around how to provide sufficiently frequent feedback. With more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.
[30] Desert Waste Detection and Classification Using Data-Based and Model-Based Enhanced YOLOv12 DL Model
Abdulmumin Sa’ad,Sulaimon Oyeniyi Adebayo,Abdul Jabbar Siddiqui
Main category: cs.CV
TL;DR: 这篇论文提出了一种基于YOLOv12的增强型实时目标检测框架,用于沙漠环境中的垃圾检测与分类,结合了数据增强和模型优化的方法,显著提升了检测精度和效率。
Details
Motivation: 全球废物危机日益严重,但传统废物收集方法在偏远地区(如沙漠)效率低下且危险。现有计算机视觉研究多集中于城市环境,忽视了沙漠等特殊地形和有机/危险废物的检测。Contribution: 论文的主要贡献是提出了一种轻量化的YOLOv12模型,整合了自我对抗训练(SAT)和专有数据增强策略,在DroneTrashNet数据集上取得了更高的精度和效率。
Method: 方法包括对YOLOv12进行剪枝和轻量化设计,结合SAT和数据增强策略,优化了模型在沙漠环境中的检测性能。
Result: 实验结果表明,该模型在精度、召回率和mAP上均有显著提升,同时保持了低延迟和小模型尺寸,适合部署在资源受限的无人机上。
Insight: 结合数据中心和模型中心的优化方法,可以在复杂环境中实现高效且鲁棒的实时检测,为类似场景提供了实用解决方案。
Abstract: The global waste crisis is escalating, with solid waste generation expected to increase by 70% by 2050. Traditional waste collection methods, particularly in remote or harsh environments like deserts, are labor-intensive, inefficient, and often hazardous. Recent advances in computer vision and deep learning have opened the door to automated waste detection systems, yet most research focuses on urban environments and recyclable materials, overlooking organic and hazardous waste and underexplored terrains such as deserts. In this work, we propose an enhanced real-time object detection framework based on a pruned, lightweight version of YOLOv12 integrated with Self-Adversarial Training (SAT) and specialized data augmentation strategies. Using the DroneTrashNet dataset, we demonstrate significant improvements in precision, recall, and mean average precision (mAP), while achieving low latency and compact model size suitable for deployment on resource-constrained aerial drones. Benchmarking our model against state-of-the-art lightweight YOLO variants further highlights its optimal balance of accuracy and efficiency. Our results validate the effectiveness of combining data-centric and model-centric enhancements for robust, real-time waste detection in desert environments.
[31] I Detect What I Don’t Know: Incremental Anomaly Learning with Stochastic Weight Averaging-Gaussian for Oracle-Free Medical Imaging
Nand Kumar Yadav,Rodrigue Rizk,William CW Chen,KC Santosh
Main category: cs.CV
TL;DR: 本文提出了一种无监督、无标注的增量异常学习框架,通过轻量级适配器和不确定性门控扩展正常样本集,在医学影像中实现高效的异常检测。
Details
Motivation: 医学影像中未知异常的检测面临标注稀缺和专家监督成本高的问题,亟需一种无需异常标注的自适应方法。Contribution: 提出了一种无需生成模型或重放缓冲区的增量学习框架,利用轻量级适配器和双重概率门控机制安全扩展正常样本库。
Method: 结合冻结的预训练视觉骨干和轻量级卷积适配器,通过k-NN异常评分和SWAG-based认知不确定性门控实现安全扩展。
Result: 在多个医学影像数据集(COVID-CXR、Pneumonia CXR、Brain MRI ND-5)上显著提升了异常检测性能(如ROC-AUC从0.9489提升至0.9982)。
Insight: 通过适配器和不确定性门控的结合,能够在无标注情况下高效扩展正常样本库,为医学影像中的异常检测提供实用解决方案。
Abstract: Unknown anomaly detection in medical imaging remains a fundamental challenge due to the scarcity of labeled anomalies and the high cost of expert supervision. We introduce an unsupervised, oracle-free framework that incrementally expands a trusted set of normal samples without any anomaly labels. Starting from a small, verified seed of normal images, our method alternates between lightweight adapter updates and uncertainty-gated sample admission. A frozen pretrained vision backbone is augmented with tiny convolutional adapters, ensuring rapid domain adaptation with negligible computational overhead. Extracted embeddings are stored in a compact coreset enabling efficient k-nearest neighbor anomaly (k-NN) scoring. Safety during incremental expansion is enforced by dual probabilistic gates, a sample is admitted into the normal memory only if its distance to the existing coreset lies within a calibrated z-score threshold, and its SWAG-based epistemic uncertainty remains below a seed-calibrated bound. This mechanism prevents drift and false inclusions without relying on generative reconstruction or replay buffers. Empirically, our system steadily refines the notion of normality as unlabeled data arrive, producing substantial gains over baselines. On COVID-CXR, ROC-AUC improves from 0.9489 to 0.9982 (F1: 0.8048 to 0.9746); on Pneumonia CXR, ROC-AUC rises from 0.6834 to 0.8968; and on Brain MRI ND-5, ROC-AUC increases from 0.6041 to 0.7269 and PR-AUC from 0.7539 to 0.8211. These results highlight the effectiveness and efficiency of the proposed framework for real-world, label-scarce medical imaging applications.
[32] Adaptive Temporal Refinement: Continuous Depth Allocation and Distance Regression for Efficient Action Localization
Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
Main category: cs.CV
TL;DR: 论文提出两种改进时序动作定位的方法:边界距离回归(BDR)和自适应时序细化(ATR),分别通过距离回归和动态计算分配提升定位精度与效率。
Details
Motivation: 现有方法在处理时序动作边界时采用统一计算,忽视了不同边界的难度差异,导致效率低下。本文旨在通过自适应计算分配和更优的边界回归方法解决这一问题。Contribution: 1. 提出边界距离回归(BDR),通过有符号距离回归替代分类,提升边界检测精度。2. 提出自适应时序细化(ATR),动态分配计算资源,显著提升效率与性能。
Method: 1. BDR:使用有符号距离回归优化边界检测,显著提升边界峰值精度。2. ATR:通过连续深度选择(τ∈[0,1])动态分配计算,实现端到端可微优化。
Result: 在THUMOS14上,ATR以162G FLOPs达到56.5% mAP@0.7,较统一计算(198G FLOPs,53.6%)提升2.9%且计算量减少18%。BDR可无缝适配现有方法,带来1.8-3.1%的mAP提升。
Insight: 通过自适应计算分配和距离回归,显著提高了时序动作定位的效率与精度,短动作性能提升尤为明显(4.2%)。知识蒸馏进一步降低了训练成本。
Abstract: Temporal action localization requires precise boundary detection; however, current methods apply uniform computation despite significant variations in difficulty across boundaries. We present two complementary contributions. First, Boundary Distance Regression (BDR) provides information-theoretically optimal localization through signed-distance regression rather than classification, achieving 43% sharper boundary peaks. BDR retrofits to existing methods with approximately 50 lines of code, yielding consistent 1.8 to 3.1% mAP@0.7 improvements across diverse architectures. Second, Adaptive Temporal Refinement (ATR) allocates computation via continuous depth selection $\tau \in [0,1]$, enabling end-to-end differentiable optimization without reinforcement learning. On THUMOS14, ATR achieves 56.5% mAP@0.7 at 162G FLOPs, compared to 53.6% at 198G for uniform processing, providing a 2.9% improvement with 18% less compute. Gains scale with boundary heterogeneity, showing 4.2% improvement on short actions. Training cost is mitigated via knowledge distillation, with lightweight students retaining 99% performance at baseline cost. Results are validated across four benchmarks with rigorous statistical testing.
[33] Room Envelopes: A Synthetic Dataset for Indoor Layout Reconstruction from Images
Sam Bahrami,Dylan Campbell
Main category: cs.CV
TL;DR: 这篇论文提出了一种合成数据集Room Envelopes,旨在通过RGB图像和两种点图(可见表面和结构布局表面)来促进室内布局重建的研究。
Details
Motivation: 现有场景重建方法通常在部分可见表面的情况下重建不完整,忽略了被遮挡的结构元素(如墙壁、地板和天花板)。由于这些元素通常是平面、重复且简单的,因此更容易预测且成本较低。Contribution: 主要贡献是引入了Room Envelopes数据集,提供RGB图像及对应的可见表面点图和结构布局点图,为单目几何估计器提供了直接监督信号。
Method: 使用合成的RGB图像和两种点图(可见表面和结构布局表面)作为输入,支持前馈单目几何估计器预测场景的可见和遮挡表面。
Result: 通过该数据集,能够实现对场景范围及其对象形状和位置的更完整理解。
Insight: 结构元素的重建可以通过低成本方法实现,因为它们通常简单且重复,无需复杂的生成模型。
Abstract: Modern scene reconstruction methods are able to accurately recover 3D surfaces that are visible in one or more images. However, this leads to incomplete reconstructions, missing all occluded surfaces. While much progress has been made on reconstructing entire objects given partial observations using generative models, the structural elements of a scene, like the walls, floors and ceilings, have received less attention. We argue that these scene elements should be relatively easy to predict, since they are typically planar, repetitive and simple, and so less costly approaches may be suitable. In this work, we present a synthetic dataset – Room Envelopes – that facilitates progress on this task by providing a set of RGB images and two associated pointmaps for each image: one capturing the visible surface and one capturing the first surface once fittings and fixtures are removed, that is, the structural layout. As we show, this enables direct supervision for feed-forward monocular geometry estimators that predict both the first visible surface and the first layout surface. This confers an understanding of the scene’s extent, as well as the shape and location of its objects.
[34] Simple 3D Pose Features Support Human and Machine Social Scene Understanding
Wenshuo Qin,Leyla Isik
Main category: cs.CV
TL;DR: 该论文研究表明,人类社交互动的理解依赖于3D姿态信息,而现有AI视觉模型缺乏这种能力。通过提取3D关节位置和社交姿态特征,可以显著提升AI模型的性能。
Details
Motivation: 人类能快速从视觉输入中提取社交互动信息,但AI系统在此任务上表现不佳。作者假设人类依赖3D姿态信息,而AI模型缺乏这种能力。Contribution: 提出基于3D姿态信息的社交场景理解方法,证明了3D关节位置和社交姿态特征在预测社交互动中的重要性,并显著提升了AI模型的性能。
Method: 结合先进的姿态和深度估计算法,提取视频中人物的3D关节位置,并设计了一组紧凑的3D社交姿态特征,用于预测人类社交判断。
Result: 3D关节位置特征优于现有AI模型,社交姿态特征与完整关节特征预测能力相当,并能显著提升AI模型的性能。
Insight: 人类社交场景理解依赖于显式的3D姿态表示,简单的结构化视觉空间特征可以支持AI模型更好地匹配人类的社交判断。
Abstract: Humans can quickly and effortlessly extract a variety of information about others’ social interactions from visual input, ranging from visuospatial cues like whether two people are facing each other to higher-level information. Yet, the computations supporting these abilities remain poorly understood, and social interaction recognition continues to challenge even the most advanced AI vision systems. Here, we hypothesized that humans rely on 3D visuospatial pose information to make social interaction judgments, which is absent in most AI vision models. To test this, we combined state-of-the-art pose and depth estimation algorithms to extract 3D joint positions of people in short video clips depicting everyday human actions and compared their ability to predict human social interaction judgments with current AI vision models. Strikingly, 3D joint positions outperformed most current AI vision models, revealing that key social information is available in explicit body position but not in the learned features of most vision models, including even the layer-wise embeddings of the pose models used to extract joint positions. To uncover the critical pose features humans use to make social judgments, we derived a compact set of 3D social pose features describing only the 3D position and direction of faces in the videos. We found that these minimal descriptors matched the predictive strength of the full set of 3D joints and significantly improved the performance of off-the-shelf AI vision models when combined with their embeddings. Moreover, the degree to which 3D social pose features were represented in each off-the-shelf AI vision model predicted the model’s ability to match human social judgments. Together, our findings provide strong evidence that human social scene understanding relies on explicit representations of 3D pose and can be supported by simple, structured visuospatial primitives.
[35] CaRF: Enhancing Multi-View Consistency in Referring 3D Gaussian Splatting Segmentation
Yuwen Tao,Kanglei Zhou,Xin Tan,Yuan Xie
Main category: cs.CV
TL;DR: CaRF是一个用于增强3D高斯泼溅分割中多视图一致性的框架,通过引入相机感知的Gaussian Field Camera Encoding(GFCE)和In Training Paired View Supervision(ITPVS),显著提升了跨视图一致性,并在多个基准测试中优于现有方法。
Details
Motivation: 现有的Referring 3D Gaussian Splatting Segmentation方法依赖2D渲染伪监督和视图特定特征学习,导致跨视图一致性不足。CaRF旨在直接在3D高斯空间中实现多视图一致性和几何推理。Contribution: 1. 提出Gaussian Field Camera Encoding(GFCE),显式建模视图依赖的变体以增强几何推理;2. 提出In Training Paired View Supervision(ITPVS),通过校准视图对齐高斯logits,减轻单视图过拟合问题。
Method: CaRF是一个完全可微的框架,通过GFCE将相机几何信息融入高斯文本交互,并通过ITPVS在训练中优化跨视图一致性。
Result: 在Ref LERF、LERF OVS和3D OVS基准测试中,CaRF的mIoU分别比现有方法平均提升了16.8%、4.3%和2.0%。
Insight: CaRF通过直接操作3D高斯空间和显式建模视图依赖关系,显著提升了3D场景理解的可靠性和多视图一致性,对嵌入式AI、AR/VR交互和自主感知有潜在应用价值。
Abstract: Referring 3D Gaussian Splatting Segmentation (R3DGS) aims to interpret free-form language expressions and localize the corresponding 3D regions in Gaussian fields. While recent advances have introduced cross-modal alignment between language and 3D geometry, existing pipelines still struggle with cross-view consistency due to their reliance on 2D rendered pseudo supervision and view specific feature learning. In this work, we present Camera Aware Referring Field (CaRF), a fully differentiable framework that operates directly in the 3D Gaussian space and achieves multi view consistency. Specifically, CaRF introduces Gaussian Field Camera Encoding (GFCE), which incorporates camera geometry into Gaussian text interactions to explicitly model view dependent variations and enhance geometric reasoning. Building on this, In Training Paired View Supervision (ITPVS) is proposed to align per Gaussian logits across calibrated views during training, effectively mitigating single view overfitting and exposing inter view discrepancies for optimization. Extensive experiments on three representative benchmarks demonstrate that CaRF achieves average improvements of 16.8%, 4.3%, and 2.0% in mIoU over state of the art methods on the Ref LERF, LERF OVS, and 3D OVS datasets, respectively. Moreover, this work promotes more reliable and view consistent 3D scene understanding, with potential benefits for embodied AI, AR/VR interaction, and autonomous perception.
[36] PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection
Peiyao Wang,Weining Wang,Qi Li
Main category: cs.CV
TL;DR: PhysCorr提出了一个统一框架,通过双奖励模型PhysicsRM和优化方法PhyDPO,解决文本到视频生成中物理一致性问题,显著提升生成的物理合理性。
Details
Motivation: 现有文本到视频生成模型在视觉质量上表现优异,但常违反物理规律(如不合理物体动态、不连贯交互),限制了其在机器人、AI等领域的应用。PhysCorr旨在解决这一问题。Contribution: 1. 提出PhysicsRM,首个双维度奖励模型,量化物体内部稳定性和物体间交互;2. 开发PhyDPO优化方法,通过对比反馈和物理感知加权提升生成视频的物理一致性。
Method: 1. PhysicsRM:双奖励模型评估物理一致性;2. PhyDPO:基于直接偏好优化的流程,结合对比学习和权重调整;3. 模型无关且可扩展,适配多种生成主干网络。
Result: 多基准测试表明,PhysCorr在物理合理性上显著提升,同时保持视觉质量和语义对齐。
Insight: 通过显式建模物理约束并优化生成过程,PhysCorr为物理可信的视频生成提供了新思路,尤其在具身AI和机器人领域潜力巨大。
Abstract: Recent advances in text-to-video generation have achieved impressive perceptual quality, yet generated content often violates fundamental principles of physical plausibility - manifesting as implausible object dynamics, incoherent interactions, and unrealistic motion patterns. Such failures hinder the deployment of video generation models in embodied AI, robotics, and simulation-intensive domains. To bridge this gap, we propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation. Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions. On this foundation, we develop PhyDPO, a novel direct preference optimization pipeline that leverages contrastive feedback and physics-aware reweighting to guide generation toward physically coherent outputs. Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones. Extensive experiments across multiple benchmarks demonstrate that PhysCorr achieves significant improvements in physical realism while preserving visual fidelity and semantic alignment. This work takes a critical step toward physically grounded and trustworthy video generation.
[37] GNN-MoE: Context-Aware Patch Routing using GNNs for Parameter-Efficient Domain Generalization
Mahmoud Soliman,Omar Abdelaziz,Ahmed Radwan,Anand,Mohamed Shehata
Main category: cs.CV
TL;DR: 提出GNN-MoE框架,结合GNN和MoE技术,通过动态分配图像块(patch)到专家网络,提升ViT在域泛化任务中的参数效率和性能。
Details
Motivation: 现有PEFT方法在域泛化任务中仍面临参数效率低和泛化能力不足的问题,需要更高效的上下文感知机制来适应域偏移。Contribution: 1. 提出GNN-MoE框架,结合GNN和MoE实现动态路由;2. 利用GNN(GCN、GAT、SAGE)捕捉图像块间关系;3. 在参数高效的同时实现SOTA性能。
Method: 1. 用GNN路由替代传统的基于token的路由;2. 构建图像块间关系图,动态分配块到专家网络;3. 采用Kronecker适配器提升参数效率。
Result: 在多个域泛化基准测试中达到SOTA或接近SOTA的性能,同时保持高参数效率。
Insight: 图神经网络的路由机制可以有效捕捉局部和全局上下文,提升模型对域偏移的鲁棒性。
Abstract: Domain generalization (DG) seeks robust Vision Transformer (ViT) performance on unseen domains. Efficiently adapting pretrained ViTs for DG is challenging; standard fine-tuning is costly and can impair generalization. We propose GNN-MoE, enhancing Parameter-Efficient Fine-Tuning (PEFT) for DG with a Mixture-of-Experts (MoE) framework using efficient Kronecker adapters. Instead of token-based routing, a novel Graph Neural Network (GNN) router (GCN, GAT, SAGE) operates on inter-patch graphs to dynamically assign patches to specialized experts. This context-aware GNN routing leverages inter-patch relationships for better adaptation to domain shifts. GNN-MoE achieves state-of-the-art or competitive DG benchmark performance with high parameter efficiency, highlighting the utility of graph-based contextual routing for robust, lightweight DG.
[38] MedDChest: A Content-Aware Multimodal Foundational Vision Model for Thoracic Imaging
Mahmoud Soliman,Islam Osman,Mohamed S. Shehata,Rasika Rajapakshe
Main category: cs.CV
TL;DR: 论文提出了MedDChest,一种专为胸部影像设计的ViT基础模型,通过大规模领域内预训练和内容感知数据增强策略(Guided Random Resized Crops),显著提升了胸部影像诊断任务的性能。
Details
Motivation: 现有视觉模型在医学影像中表现不佳,原因在于通常采用自然图像预训练的骨干网络进行微调,存在领域差距问题。Contribution: 提出MedDChest模型,利用大规模胸部影像数据从头预训练;引入Guided Random Resized Crops数据增强方法,优化医学影像裁剪效率。
Method: 采用Vision Transformer架构,基于包含120万张胸部X光和CT的多模态数据集预训练;设计内容感知的裁剪策略,聚焦解剖学相关区域。
Result: 实验表明MedDChest显著优于基于ImageNet预训练的模型,成为胸部诊断任务的更强特征提取器。
Insight: 领域内大规模预训练结合针对性的数据增强是医学影像分析的更优路径,模型公开可推动后续研究。
Abstract: The performance of vision models in medical imaging is often hindered by the prevailing paradigm of fine-tuning backbones pre-trained on out-of-domain natural images. To address this fundamental domain gap, we propose MedDChest, a new foundational Vision Transformer (ViT) model optimized specifically for thoracic imaging. We pre-trained MedDChest from scratch on a massive, curated, multimodal dataset of over 1.2 million images, encompassing different modalities including Chest X-ray and Computed Tomography (CT) compiled from 10 public sources. A core technical contribution of our work is Guided Random Resized Crops, a novel content-aware data augmentation strategy that biases sampling towards anatomically relevant regions, overcoming the inefficiency of standard cropping techniques on medical scans. We validate our model’s effectiveness by fine-tuning it on a diverse set of downstream diagnostic tasks. Comprehensive experiments empirically demonstrate that MedDChest significantly outperforms strong, publicly available ImageNet-pretrained models. By establishing the superiority of large-scale, in-domain pre-training combined with domain-specific data augmentation, MedDChest provides a powerful and robust feature extractor that serves as a significantly better starting point for a wide array of thoracic diagnostic tasks. The model weights will be made publicly available to foster future research and applications.
[39] A Hybrid Deep Learning Model for Robust Biometric Authentication from Low-Frame-Rate PPG Signals
Arfina Rahman,Mahesh Banavar
Main category: cs.CV
TL;DR: 本文提出了一种基于低帧率PPG信号的混合深度学习模型CVT-ConvMixer-LSTM,用于鲁棒的生物特征认证,实现了98%的准确率。
Details
Motivation: PPG信号因非侵入性、实时活体检测和低成本可穿戴设备的适用性而备受关注,但其易受运动伪影和光照变化影响,因此需要鲁棒的特征提取和分类方法。Contribution: 1. 提出了CVT-ConvMixer-LSTM混合深度学习模型,结合了空间和时间特征;2. 采用CWT将PPG信号转换为时频标量图;3. 在46名受试者上实现了98%的认证准确率。
Method: 1. 预处理PPG信号(基线漂移去除、PCA降噪等);2. 使用CWT生成时频标量图;3. 结合CVT、ConvMixer和LSTM的特征提取能力。
Result: 在CFIHSR数据集上,模型达到了98%的认证准确率,表现出对噪声和受试者间差异的鲁棒性。
Insight: 混合模型通过结合空间和时间特征显著提升了性能,适用于移动和嵌入式设备的生物识别安全应用。
Abstract: Photoplethysmography (PPG) signals, which measure changes in blood volume in the skin using light, have recently gained attention in biometric authentication because of their non-invasive acquisition, inherent liveness detection, and suitability for low-cost wearable devices. However, PPG signal quality is challenged by motion artifacts, illumination changes, and inter-subject physiological variability, making robust feature extraction and classification crucial. This study proposes a lightweight and cost-effective biometric authentication framework based on PPG signals extracted from low-frame-rate fingertip videos. The CFIHSR dataset, comprising PPG recordings from 46 subjects at a sampling rate of 14 Hz, is employed for evaluation. The raw PPG signals undergo a standard preprocessing pipeline involving baseline drift removal, motion artifact suppression using Principal Component Analysis (PCA), bandpass filtering, Fourier-based resampling, and amplitude normalization. To generate robust representations, each one-dimensional PPG segment is converted into a two-dimensional time-frequency scalogram via the Continuous Wavelet Transform (CWT), effectively capturing transient cardiovascular dynamics. We developed a hybrid deep learning model, termed CVT-ConvMixer-LSTM, by combining spatial features from the Convolutional Vision Transformer (CVT) and ConvMixer branches with temporal features from a Long Short-Term Memory network (LSTM). The experimental results on 46 subjects demonstrate an authentication accuracy of 98%, validating the robustness of the model to noise and variability between subjects. Due to its efficiency, scalability, and inherent liveness detection capability, the proposed system is well-suited for real-world mobile and embedded biometric security applications.
[40] Unveiling Deep Semantic Uncertainty Perception for Language-Anchored Multi-modal Vision-Brain Alignment
Zehui Feng,Chenqi Zhang,Mingru Wang,Minuo Wei,Shiwei Cheng,Cuntai Guan,Ting Han
Main category: cs.CV
TL;DR: Bratrix提出了一种端到端的多模态语言锚定视觉-大脑对齐框架,通过分解视觉刺激为层次化语义组件,并结合不确定性感知模块,显著提升了EEG、MEG和fMRI数据的检索、重建和描述性能。
Details
Motivation: 现有方法直接将神经活动与视觉嵌入对齐,但仅依赖视觉表示难以捕捉潜在语义维度,限制了模型的解释性和鲁棒性。Bratrix通过语言锚定和不确定性感知解决了这些问题。Contribution: 1. 提出首个端到端的语言锚定视觉-大脑对齐框架Bratrix;2. 引入不确定性感知模块;3. 通过层次化解耦语义组件和两阶段训练策略提升对齐精度。
Method: 1. 将视觉刺激分解为层次化视觉和语言语义组件;2. 使用不确定性感知模块加权对齐;3. 利用语言锚定语义矩阵增强跨模态相关性;4. 采用单模态预训练和多模态微调的两阶段训练策略。
Result: 在EEG、MEG和fMRI基准测试中,Bratrix在检索、重建和描述任务上表现优于现有方法,如在200-way EEG检索任务中提升14.3%。
Insight: 通过语言锚定和不确定性感知,Bratrix展示了对神经信号中潜在语义的更好捕捉能力,为跨模态对齐提供了新的研究方向。
Abstract: Unveiling visual semantics from neural signals such as EEG, MEG, and fMRI remains a fundamental challenge due to subject variability and the entangled nature of visual features. Existing approaches primarily align neural activity directly with visual embeddings, but visual-only representations often fail to capture latent semantic dimensions, limiting interpretability and deep robustness. To address these limitations, we propose Bratrix, the first end-to-end framework to achieve multimodal Language-Anchored Vision-Brain alignment. Bratrix decouples visual stimuli into hierarchical visual and linguistic semantic components, and projects both visual and brain representations into a shared latent space, enabling the formation of aligned visual-language and brain-language embeddings. To emulate human-like perceptual reliability and handle noisy neural signals, Bratrix incorporates a novel uncertainty perception module that applies uncertainty-aware weighting during alignment. By leveraging learnable language-anchored semantic matrices to enhance cross-modal correlations and employing a two-stage training strategy of single-modality pretraining followed by multimodal fine-tuning, Bratrix-M improves alignment precision. Extensive experiments on EEG, MEG, and fMRI benchmarks demonstrate that Bratrix improves retrieval, reconstruction, and captioning performance compared to state-of-the-art methods, specifically surpassing 14.3% in 200-way EEG retrieval task. Code and model are available.
[41] Adversarial and Score-Based CT Denoising: CycleGAN vs Noise2Score
Abu Hanif Muhammad Syarubany
Main category: cs.CV
TL;DR: 该论文比较了CycleGAN和Noise2Score两种方法在CT图像去噪任务中的表现,发现CycleGAN在最终图像质量上更优,而Noise2Score在无配对数据时表现稳健。
Details
Motivation: 研究在无配对和自监督情况下CT图像去噪的效果,比较两种高效训练数据的方法,以找到最佳去噪方案。Contribution: 1. 确定了CycleGAN的最优配置;2. 验证了CycleGAN和Noise2Score在CT去噪中的表现;3. 提供了开源代码供社区使用。
Method: 1. CycleGAN:使用U-Net作为骨干网络,优化循环一致性和身份损失;2. Noise2Score:基于分数匹配的自监督方法。
Result: CycleGAN将PSNR从34.66 dB提升至38.913 dB,SSIM从0.9234提升至0.971;Noise2Score在噪声较大时表现突出。
Insight: CycleGAN在图像质量上更优,适合有配对数据的情况;Noise2Score无需配对数据,适合实际应用中的无监督场景。
Abstract: We study CT image denoising in the unpaired and self-supervised regimes by evaluating two strong, training-data-efficient paradigms: a CycleGAN-based residual translator and a Noise2Score (N2S) score-matching denoiser. Under a common evaluation protocol, a configuration sweep identifies a simple standard U-Net backbone within CycleGAN (lambda_cycle = 30, lambda_iden = 2, ngf = ndf = 64) as the most reliable setting; we then train it to convergence with a longer schedule. The selected CycleGAN improves the noisy input from 34.66 dB / 0.9234 SSIM to 38.913 dB / 0.971 SSIM and attains an estimated score of 1.9441 and an unseen-set (Kaggle leaderboard) score of 1.9343. Noise2Score, while slightly behind in absolute PSNR / SSIM, achieves large gains over very noisy inputs, highlighting its utility when clean pairs are unavailable. Overall, CycleGAN offers the strongest final image quality, whereas Noise2Score provides a robust pair-free alternative with competitive performance. Source code is available at https://github.com/hanifsyarubany/CT-Scan-Image-Denoising-using-CycleGAN-and-Noise2Score.
[42] When Swin Transformer Meets KANs: An Improved Transformer Architecture for Medical Image Segmentation
Nishchal Sapkota,Haoyan Shi,Yejia Zhang,Xianshi Ma,Bofang Zheng,Danny Z. Chen
Main category: cs.CV
TL;DR: 论文提出了一种名为UKAST的新型架构,将Swin Transformer与KANs结合,用于医学图像分割,具有更高的数据效率和性能。
Details
Motivation: 医学图像分割面临复杂解剖结构和有限标注数据的挑战,CNN擅长局部特征但缺乏长距离依赖建模,Transformer虽全局有效但数据需求大且计算昂贵。Contribution: 提出UKAST架构,整合Kolmogorov-Arnold Networks (KANs)到Swin Transformer编码器,提升了数据效率和表达能力,减少了计算开销。
Method: 使用Group Rational KANs (GR-KANs)和Rational基函数改进KANs,结合Swin Transformer的编码器,形成类似U-Net的结构。
Result: 在四个2D和3D医学图像分割基准测试中达到SOTA,尤其在数据稀缺条件下表现优异,计算开销仅有小幅增加。
Insight: KAN增强的Transformer为数据高效的医学图像分割提供了新方向,缓解了ViT对数据的依赖性问题。
Abstract: Medical image segmentation is critical for accurate diagnostics and treatment planning, but remains challenging due to complex anatomical structures and limited annotated training data. CNN-based segmentation methods excel at local feature extraction, but struggle with modeling long-range dependencies. Transformers, on the other hand, capture global context more effectively, but are inherently data-hungry and computationally expensive. In this work, we introduce UKAST, a U-Net like architecture that integrates rational-function based Kolmogorov-Arnold Networks (KANs) into Swin Transformer encoders. By leveraging rational base functions and Group Rational KANs (GR-KANs) from the Kolmogorov-Arnold Transformer (KAT), our architecture addresses the inefficiencies of vanilla spline-based KANs, yielding a more expressive and data-efficient framework with reduced FLOPs and only a very small increase in parameter count compared to SwinUNETR. UKAST achieves state-of-the-art performance on four diverse 2D and 3D medical image segmentation benchmarks, consistently surpassing both CNN- and Transformer-based baselines. Notably, it attains superior accuracy in data-scarce settings, alleviating the data-hungry limitations of standard Vision Transformers. These results show the potential of KAN-enhanced Transformers to advance data-efficient medical image segmentation. Code is available at: https://github.com/nsapkota417/UKAST
[43] SpatialLock: Precise Spatial Control in Text-to-Image Synthesis
Biao Liu,Yuanzhi Liang
Main category: cs.CV
TL;DR: SpatialLock是一种新颖的文本到图像生成框架,通过结合感知信号和定位信息,实现了对生成图像中物体位置的精确控制。
Details
Motivation: 现有文本到图像生成方法在物体定位上表现不足,难以精确控制空间布局,因此需要一种更有效的方法来提升生成图像的空间精度。Contribution: 提出了SpatialLock框架,包含Position-Engaged Injection (PoI)和Position-Guided Learning (PoG)两个组件,显著提升了生成图像中物体的定位精度和视觉质量。
Method: 通过PoI在注意力层中直接融合空间信息,并通过PoG利用感知信号监督学习,进一步优化物体定位能力。
Result: 在多个数据集上实现了IOU分数超过0.9,达到最先进的物体定位性能。
Insight: 结合感知信号和定位信息的双重监督机制是提升文本到图像生成空间精度的有效方法。
Abstract: Text-to-Image (T2I) synthesis has made significant advancements in recent years, driving applications such as generating datasets automatically. However, precise control over object localization in generated images remains a challenge. Existing methods fail to fully utilize positional information, leading to an inadequate understanding of object spatial layouts. To address this issue, we propose SpatialLock, a novel framework that leverages perception signals and grounding information to jointly control the generation of spatial locations. SpatialLock incorporates two components: Position-Engaged Injection (PoI) and Position-Guided Learning (PoG). PoI directly integrates spatial information through an attention layer, encouraging the model to learn the grounding information effectively. PoG employs perception-based supervision to further refine object localization. Together, these components enable the model to generate objects with precise spatial arrangements and improve the visual quality of the generated images. Experiments show that SpatialLock sets a new state-of-the-art for precise object positioning, achieving IOU scores above 0.9 across multiple datasets.
[44] Tortoise and Hare Guidance: Accelerating Diffusion Model Inference with Multirate Integration
Yunghee Lee,Byeonghyun Pak,Junwha Hong,Hoseong Kim
Main category: cs.CV
TL;DR: 这篇论文提出了一种名为“Tortoise and Hare Guidance (THG)”的训练自由策略,通过多速率积分加速扩散模型推理,同时保持高质量生成。
Details
Motivation: 传统的扩散模型推理计算成本高,影响了实时应用。通过分析分类器自由引导(CFG)ODE的多速率特性,作者发现额外的引导分支对数值误差更鲁棒,从而揭示了现有方法的冗余。Contribution: 1. 提出THG方法,将噪声估计和额外引导分支分别在不同粒度的网格上进行积分。2. 引入误差边界感知的时间步采样器和引导尺度调度器。3. 在保持生成质量的同时,显著减少计算量(NFE减少30%)。
Method: 1. 将CFG ODE重新表述为多速率ODE系统。2. 在细粒度网格上积分噪声估计(Tortoise方程),在粗粒度网格上积分额外引导(Hare方程)。3. 使用自适应时间步采样和引导尺度调度优化性能。
Result: THG在相同计算预算下优于现有CFG加速方法,NFE减少30%,生成质量几乎无损(ΔImageReward≤0.032)。
Insight: 多速率ODE系统为扩散模型提供了一种高效推理的途径,无需重新训练模型即可实现实时高质量图像合成。
Abstract: In this paper, we propose Tortoise and Hare Guidance (THG), a training-free strategy that accelerates diffusion sampling while maintaining high-fidelity generation. We demonstrate that the noise estimate and the additional guidance term exhibit markedly different sensitivity to numerical error by reformulating the classifier-free guidance (CFG) ODE as a multirate system of ODEs. Our error-bound analysis shows that the additional guidance branch is more robust to approximation, revealing substantial redundancy that conventional solvers fail to exploit. Building on this insight, THG significantly reduces the computation of the additional guidance: the noise estimate is integrated with the tortoise equation on the original, fine-grained timestep grid, while the additional guidance is integrated with the hare equation only on a coarse grid. We also introduce (i) an error-bound-aware timestep sampler that adaptively selects step sizes and (ii) a guidance-scale scheduler that stabilizes large extrapolation spans. THG reduces the number of function evaluations (NFE) by up to 30% with virtually no loss in generation fidelity ($\Delta$ImageReward $\leq$ 0.032) and outperforms state-of-the-art CFG-based training-free accelerators under identical computation budgets. Our findings highlight the potential of multirate formulations for diffusion solvers, paving the way for real-time high-quality image synthesis without any model retraining. The source code is available at https://github.com/yhlee-add/THG.
[45] Text to Sketch Generation with Multi-Styles
Tengjie Li,Shikui Tu,Lei Xu
Main category: cs.CV
TL;DR: 该论文提出了一种基于扩散模型的免训练框架M3S,支持通过文本提示和参考风格草图实现多风格控制,通过线性平滑和风格-内容引导机制减少内容泄漏,提升生成质量。
Details
Motivation: 尽管视觉语言模型在草图生成方面取得进展,但现有方法缺乏对草图风格的精确控制机制。本文旨在解决这一问题。Contribution: 提出了一个免训练框架,支持多风格草图生成,并引入线性平滑和风格-内容引导机制以减少内容泄漏。
Method: 基于扩散模型,通过文本提示和参考草图实现风格控制,采用线性平滑和AdaIN模块协调多风格生成。
Result: 实验表明,该方法在风格对齐和生成质量上表现优异,尤其在结构和参考草图相似性较低时效果显著。
Insight: 通过辅助信息和联合AdaIN模块,多风格生成能灵活协调,且避免了传统风格迁移中的内容泄漏问题。
Abstract: Recent advances in vision-language models have facilitated progress in sketch generation. However, existing specialized methods primarily focus on generic synthesis and lack mechanisms for precise control over sketch styles. In this work, we propose a training-free framework based on diffusion models that enables explicit style guidance via textual prompts and referenced style sketches. Unlike previous style transfer methods that overwrite key and value matrices in self-attention, we incorporate the reference features as auxiliary information with linear smoothing and leverage a style-content guidance mechanism. This design effectively reduces content leakage from reference sketches and enhances synthesis quality, especially in cases with low structural similarity between reference and target sketches. Furthermore, we extend our framework to support controllable multi-style generation by integrating features from multiple reference sketches, coordinated via a joint AdaIN module. Extensive experiments demonstrate that our approach achieves high-quality sketch generation with accurate style alignment and improved flexibility in style control. The official implementation of M3S is available at https://github.com/CMACH508/M3S.
[46] Automated Tennis Player and Ball Tracking with Court Keypoints Detection (Hawk Eye System)
Venkata Manikanta Desu,Syed Fawaz Ali
Main category: cs.CV
TL;DR: 该论文提出了一种自动化网球比赛分析系统,结合多深度学习模型实时追踪球员和网球,同时检测场地关键点,为比赛提供详细的分析数据。
Details
Motivation: 通过自动化技术提升网球比赛分析的效率和准确性,为教练、转播方和球员提供实时、可操作的见解。Contribution: 提出了一种集成YOLOv8、YOLOv5和基于ResNet50架构的完整分析系统,能够检测球员、追踪网球并定位场地关键点,生成详细比赛数据。
Method: 使用YOLOv8检测球员,利用定制训练的YOLOv5追踪网球,并通过ResNet50检测场地关键点,实现实时分析。
Result: 实验显示系统在不同场地条件和比赛场景下表现稳健,输出带注释的视频和详细性能指标。
Insight: 多模型集成是自动化体育分析的有效方法,深度学习可显著提升实时追踪和目标检测的精度。
Abstract: This study presents a complete pipeline for automated tennis match analysis. Our framework integrates multiple deep learning models to detect and track players and the tennis ball in real time, while also identifying court keypoints for spatial reference. Using YOLOv8 for player detection, a custom-trained YOLOv5 model for ball tracking, and a ResNet50-based architecture for court keypoint detection, our system provides detailed analytics including player movement patterns, ball speed, shot accuracy, and player reaction times. The experimental results demonstrate robust performance in varying court conditions and match scenarios. The model outputs an annotated video along with detailed performance metrics, enabling coaches, broadcasters, and players to gain actionable insights into the dynamics of the game.
[47] Learning from Online Videos at Inference Time for Computer-Use Agents
Yujian Liu,Ze Wang,Hao Chen,Ximeng Sun,Xiaodong Yu,Jialian Wu,Jiang Liu,Emad Barsoum,Zicheng Liu,Shiyu Chang
Main category: cs.CV
TL;DR: 该论文提出了一种框架,使计算机使用代理能够在推理时从在线视频中学习,通过检索、过滤视频并将其转化为结构化演示轨迹,动态选择轨迹作为上下文指导,从而提升代理性能。
Details
Motivation: 计算机使用代理在需要领域特定程序知识的任务中表现不如人类,人类可以通过观看视频教程快速学习。论文旨在探索如何让代理在推理时有效利用在线视频教程。Contribution: 1.提出一个框架,将在线视频转化为结构化演示轨迹;2.引入两阶段轨迹选择机制,动态提供上下文指导;3.实验证明框架在基准测试中优于纯文本教程或转录的变体。
Method: 1.使用视觉语言模型(VLM)推断UI动作;2.将视频分割为动作子序列并分配文本目标;3.两阶段选择机制动态选择最相关的轨迹作为上下文。
Result: 在两个广泛使用的基准测试中,框架显著优于基础代理和仅使用文本教程的变体,表明视频信息的系统提取对代理性能至关重要。
Insight: 视频轨迹的分割与选择、动作过滤以及视觉信息的重要性表明,在线视频可以被系统转化为可操作的指导,显著提升计算机使用代理的表现。
Abstract: Computer-use agents can operate computers and automate laborious tasks, but despite recent rapid progress, they still lag behind human users, especially when tasks require domain-specific procedural knowledge about particular applications, platforms, and multi-step workflows. Humans can bridge this gap by watching video tutorials: we search, skim, and selectively imitate short segments that match our current subgoal. In this paper, we study how to enable computer-use agents to learn from online videos at inference time effectively. We propose a framework that retrieves and filters tutorial videos, converts them into structured demonstration trajectories, and dynamically selects trajectories as in-context guidance during execution. Particularly, using a VLM, we infer UI actions, segment videos into short subsequences of actions, and assign each subsequence a textual objective. At inference time, a two-stage selection mechanism dynamically chooses a single trajectory to add in context at each step, focusing the agent on the most helpful local guidance for its next decision. Experiments on two widely used benchmarks show that our framework consistently outperforms strong base agents and variants that use only textual tutorials or transcripts. Analyses highlight the importance of trajectory segmentation and selection, action filtering, and visual information, suggesting that abundant online videos can be systematically distilled into actionable guidance that improves computer-use agents at inference time. Our code is available at https://github.com/UCSB-NLP-Chang/video_demo.
[48] Seeing Straight: Document Orientation Detection for Efficient OCR
Suranjan Goswami,Abhinav Ravi,Raja Kolla,Ali Faraz,Shaharukh Khan,Akash,Chandra Khatri,Shubham Agarwal
Main category: cs.CV
TL;DR: 该论文提出了OCR-Rotation-Bench(ORB)基准测试,用于评估OCR对图像旋转的鲁棒性,并开发了一种轻量化的旋转分类方法,显著提高了OCR性能。
Details
Motivation: 文档方向检测是OCR预处理中的关键步骤,但现有方法在真实场景中因用户错误(如相机方向错误)表现不佳。Contribution: 1)引入ORB基准测试(包括ORB-En和ORB-Indic);2)提出基于Phi-3.5-Vision的轻量化旋转分类方法,准确率高。
Method: 使用Phi-3.5-Vision的视觉编码器,结合动态图像裁剪,针对4类旋转任务进行微调。
Result: 在ORB-En和ORB-Indic上分别达到96%和92%的准确率,显著提升OCR性能(闭源模型提升14%,开源模型提升4倍)。
Insight: 准确的旋转矫正对OCR性能至关重要,轻量化的分类方法在真实场景中具有高效性和鲁棒性。
Abstract: Despite significant advances in document understanding, determining the correct orientation of scanned or photographed documents remains a critical pre-processing step in the real world settings. Accurate rotation correction is essential for enhancing the performance of downstream tasks such as Optical Character Recognition (OCR) where misalignment commonly arises due to user errors, particularly incorrect base orientations of the camera during capture. In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from rotation-transformed structured and free-form English OCR datasets, and (ii) ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource languages. We also present a fast, robust and lightweight rotation classification pipeline built on the vision encoder of Phi-3.5-Vision model with dynamic image cropping, fine-tuned specifically for 4-class rotation task in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy on identifying the rotations respectively on both the datasets. Beyond classification, we demonstrate the critical role of our module in boosting OCR performance: closed-source (up to 14%) and open-weights models (up to 4x) in the simulated real-world setting.
[49] Systematic Evaluation of Preprocessing Techniques for Accurate Image Registration in Digital Pathology
Fatemehzahra Darzi,Rodrigo Escobar Diaz Guerrero,Thomas Bocklitz
Main category: cs.CV
TL;DR: 这篇论文系统地评估了不同预处理技术对数字病理学中多模图像配准的影响,发现CycleGAN颜色变换能显著降低配准误差。
Details
Motivation: 数字病理学中,不同染色或成像模态的图像配准对于信息整合和分析至关重要,但如何通过预处理技术提高配准精度尚不明确。Contribution: 论文的主要贡献是系统评估了多种颜色变换和其他预处理技术对H&E和非线性多模态图像配准的影响,确立了CycleGAN在降低配准误差上的优势。
Method: 研究使用了多种预处理技术(如CycleGAN、Macenko、Reinhard、Vahadane)、VALIS配准方法,并用rTRE指标评估性能。
Result: CycleGAN颜色变换在两种测试场景下均表现最佳,显著降低了配准误差(MMrTRE和AMrTRE)。
Insight: 预处理步骤(特别是颜色变换)在多模态图像配准中起关键作用,选择合适的预处理方法可显著提升配准精度和分析可靠性。
Abstract: Image registration refers to the process of spatially aligning two or more images by mapping them into a common coordinate system, so that corresponding anatomical or tissue structures are matched across images. In digital pathology, registration enables direct comparison and integration of information from different stains or imaging modalities, sup-porting applications such as biomarker analysis and tissue reconstruction. Accurate registration of images from different modalities is an essential step in digital pathology. In this study, we investigated how various color transformation techniques affect image registration between hematoxylin and eosin (H&E) stained images and non-linear multimodal images. We used a dataset of 20 tissue sample pairs, with each pair undergoing several preprocessing steps, including different color transformation (CycleGAN, Macenko, Reinhard, Vahadane), inversion, contrast adjustment, intensity normalization, and denoising. All images were registered using the VALIS registration method, which first applies rigid registration and then performs non-rigid registration in two steps on both low and high-resolution images. Registration performance was evaluated using the relative Target Registration Error (rTRE). We reported the median of median rTRE values (MMrTRE) and the average of median rTRE values (AMrTRE) for each method. In addition, we performed a custom point-based evaluation using ten manually selected key points. Registration was done separately for two scenarios, using either the original or inverted multimodal images. In both scenarios, CycleGAN color transformation achieved the lowest registration errors, while the other methods showed higher errors. These findings show that applying color transformation before registration improves alignment between images from different modalities and supports more reliable analysis in digital pathology.
[50] Covariance Descriptors Meet General Vision Encoders: Riemannian Deep Learning for Medical Image Classification
Josef Mayr,Anna Reithmeir,Maxime Di Folco,Julia A. Schnabel
Main category: cs.CV
TL;DR: 该论文探讨了协方差描述符(Covariance Descriptors)在医学图像分类中的有效性,结合预训练的通用视觉编码器(GVE)特征,并与手工特征对比,结果显示GVE衍生的协方差描述符性能更优,且与SPDNet结合时效果超越现有方法。
Details
Motivation: 协方差描述符在通用计算机视觉任务中表现优异,但在医学图像分析中研究不足。作者希望验证其在医学图像分类中的潜力,尤其是与预训练通用视觉编码器的结合效果。Contribution: 1. 提出将协方差描述符与GVE特征结合的方法;2. 证明GVE衍生的协方差描述符优于手工特征;3. 在SPDNet中结合DINOv2特征,性能超越当前最优方法。
Method: 1. 从预训练的GVE(如DINOv2和MedSAM)中提取特征构建协方差描述符;2. 与手工特征生成的协方差描述符对比;3. 使用SPDNet分类网络处理SPD矩阵数据。
Result: 在MedMNSIT基准测试的11个数据集上,GVE衍生的协方差描述符表现优于手工特征,且SPDNet结合DINOv2特征时性能最佳。
Insight: 预训练的通用视觉编码器能为医学图像分析提供更强大的特征表示,而协方差描述符与这些特征结合可进一步提升分类性能。
Abstract: Covariance descriptors capture second-order statistics of image features. They have shown strong performance in general computer vision tasks, but remain underexplored in medical imaging. We investigate their effectiveness for both conventional and learning-based medical image classification, with a particular focus on SPDNet, a classification network specifically designed for symmetric positive definite (SPD) matrices. We propose constructing covariance descriptors from features extracted by pre-trained general vision encoders (GVEs) and comparing them with handcrafted descriptors. Two GVEs - DINOv2 and MedSAM - are evaluated across eleven binary and multi-class datasets from the MedMNSIT benchmark. Our results show that covariance descriptors derived from GVE features consistently outperform those derived from handcrafted features. Moreover, SPDNet yields superior performance to state-of-the-art methods when combined with DINOv2 features. Our findings highlight the potential of combining covariance descriptors with powerful pretrained vision encoders for medical image analysis.
[51] MedSapiens: Taking a Pose to Rethink Medical Imaging Landmark Detection
Marawan Elbatel,Anbang Wang,Keyuan Liu,Kaouther Mouheb,Enrique Almar-Munoz,Lizhuo Lin,Yanqi Yang,Karim Lekadir,Xiaomeng Li
Main category: cs.CV
TL;DR: 该论文提出MedSapiens模型,通过将人体姿态估计的基础模型Sapiens适配到医学影像中的解剖标志点检测任务,取得了优于现有方法的性能。
Details
Motivation: 传统医学影像标志点检测依赖领域专用模型,而大规模预训练视觉模型的涌现为跨领域应用提供了新机会。论文探索了人体姿态估计模型在医学领域的潜力。Contribution: 证明了人体姿态估计的基础模型(如Sapiens)可以作为医学影像标志点检测的强先验,并通过实验表明其性能优于通用和专用模型。
Method: 采用多数据集预训练策略,将Sapiens模型适配到医学影像任务中,提出MedSapiens模型。
Result: MedSapiens在多个数据集上达到新SOTA,分别比通用和专用模型提高5.26%和21.81%的检测成功率(SDR);在少样本设置中比现有方法高2.69%。
Insight: 人体姿态优化的基础模型在医学影像任务中具备未被充分利用的潜力,跨领域适配是提升表现的有效途径。
Abstract: This paper does not introduce a novel architecture; instead, it revisits a fundamental yet overlooked baseline: adapting human-centric foundation models for anatomical landmark detection in medical imaging. While landmark detection has traditionally relied on domain-specific models, the emergence of large-scale pre-trained vision models presents new opportunities. In this study, we investigate the adaptation of Sapiens, a human-centric foundation model designed for pose estimation, to medical imaging through multi-dataset pretraining, establishing a new state of the art across multiple datasets. Our proposed model, MedSapiens, demonstrates that human-centric foundation models, inherently optimized for spatial pose localization, provide strong priors for anatomical landmark detection, yet this potential has remained largely untapped. We benchmark MedSapiens against existing state-of-the-art models, achieving up to 5.26% improvement over generalist models and up to 21.81% improvement over specialist models in the average success detection rate (SDR). To further assess MedSapiens adaptability to novel downstream tasks with few annotations, we evaluate its performance in limited-data settings, achieving 2.69% improvement over the few-shot state of the art in SDR. Code and model weights are available at https://github.com/xmed-lab/MedSapiens .
[52] Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery
Claudio Giusti,Luca Guarnera,Sebastiano Battiato
Main category: cs.CV
TL;DR: Proto-LeakNet提出了一种信号泄漏感知的图像溯源框架,利用扩散模型潜在空间的统计特征,通过部分前向扩散和时序注意力编码器实现高精度分类,同时在未训练生成器上表现优异。
Details
Motivation: 随着合成图像和深度伪造技术的进步,溯源和真实性验证成为关键挑战。研究发现扩散模型在输出中留下统计痕迹(信号泄漏),这为图像溯源提供了新思路。Contribution: 提出了Proto-LeakNet框架,结合闭集分类和密度开集评估,利用扩散模型潜在空间中的信号泄漏特性,实现高精度且可解释的溯源。
Method: 在扩散模型的潜在域中,通过部分前向扩散暴露生成器线索,使用时序注意力编码器聚合多步潜在特征,并通过特征加权原型头构建嵌入空间。
Result: 在闭集数据上训练,Macro AUC达98.13%,对未见生成器表现优异,且对后处理鲁棒。
Insight: 扩散模型的潜在空间中存在信号泄漏,可通过建模其潜在几何实现可靠且可解释的图像溯源和深度伪造检测。
Abstract: The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates closed-set classification with a density-based open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Operating in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability between known and unseen generators. These results demonstrate that modeling signal-leak bias in latent space enables reliable and interpretable AI-image and deepfake forensics. The code for the whole work will be available upon submission.
[53] DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification
Yujie Yang,Shuang Li,Jun Ye,Neng Dong,Fan Li,Huafeng Li
Main category: cs.CV
TL;DR: 该论文提出了一种新的视频基于可见光-红外跨模态行人重识别方法,通过结合DINOv2的视觉先验与步态特征学习,显著提升了模型性能。
Details
Motivation: 现有的行人重识别方法主要关注模态不变的视觉特征,而忽略了步态特征的跨模态不变性和时间动态特性,导致跨模态视频匹配的时空一致性建模不足。Contribution: 1. 提出了DinoGRL框架,利用DINOv2的视觉先验学习步态特征,补充外观线索。2. 设计了SASGL模型生成语义增强的剪影表示。3. 开发了PBMGE模块,通过步态与外观流的双向交互优化全局表示。
Method: 1. 使用DINOv2提取语义先验并增强剪影表示。2. 构建SASGL模型联合优化剪影与重识别目标。3. 通过PBMGE模块多粒度渐进式优化特征表示。
Result: 在HITSZ-VCM和BUPT数据集上,DinoGRL显著优于现有最优方法。
Insight: 步态特征具有跨模态不变性和时间动态性,结合语义先验可以显著提升跨模态行人重识别的表现。
Abstract: Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate the superiority of our approach, significantly outperforming existing state-of-the-art methods.
[54] Vision Foundation Models in Agriculture: Toward Domain-Specific Adaptation for Weed Herbicide Trials Assessment
Leire Benito-Del-Valle,Artzai Picón,Daniel Mugica,Manuel Ramos,Eva Portillo,Javier Romero,Carlos Javier Jimenez,Ramón Navarra-Mestre
Main category: cs.CV
TL;DR: 该论文通过自监督学习方法,将通用视觉基础模型适应于农业领域的除草剂试验评估,显著提高了物种识别和损伤分类的性能,并在未见条件下展现出更强的泛化能力。
Details
Motivation: 农业领域的除草剂试验需要精确的植物物种识别和损伤评估,而通用视觉基础模型在农业细粒度任务中表现有限,因此需要针对特定领域进行优化。Contribution: 提出了一种适用于除草剂试验的领域特定视觉基础模型,显著提升了物种识别和损伤分类的性能,并减少了人工标注的需求。
Method: 采用自监督学习方法,在大规模农业数据集上训练模型,学习适用于除草剂试验图像的丰富且可迁移的表征。
Result: 领域特定模型在物种识别(F1从0.91提升至0.94)和损伤分类(F1从0.26提升至0.33)中显著优于通用模型,且在未见条件下表现更优(物种识别从0.56到0.66;损伤分类从0.17到0.27)。
Insight: 领域特定的预训练不仅能提升模型性能,还能显著减少标注需求(80%的标注样本节省),为农业领域的自动化解决方案提供了可能性。
Abstract: Herbicide field trials require accurate identification of plant species and assessment of herbicide-induced damage across diverse environments. While general-purpose vision foundation models have shown promising results in complex visual domains, their performance can be limited in agriculture, where fine-grained distinctions between species and damage types are critical. In this work, we adapt a general-purpose vision foundation model to herbicide trial characterization. Trained using a self-supervised learning approach on a large, curated agricultural dataset, the model learns rich and transferable representations optimized for herbicide trials images. Our domain-specific model significantly outperforms the best general-purpose foundation model in both species identification (F1 score improvement from 0.91 to 0.94) and damage classification (from 0.26 to 0.33). Under unseen conditions (new locations and other time), it achieves even greater gains (species identification from 0.56 to 0.66; damage classification from 0.17 to 0.27). In domain-shift scenarios, such as drone imagery, it maintains strong performance (species classification from 0.49 to 0.60). Additionally, we show that domain-specific pretraining enhances segmentation accuracy, particularly in low-annotation regimes. An annotation-efficiency analysis reveals that, under unseen conditions, the domain-specific model achieves 5.4% higher F1 score than the general-purpose model, while using 80% fewer labeled samples. These results demonstrate the generalization capabilities of domain-specific foundation models and their potential to significantly reduce manual annotation efforts, offering a scalable and automated solution for herbicide trial analysis.
[55] Deep learning-based object detection of offshore platforms on Sentinel-1 Imagery and the impact of synthetic training data
Robin Spanier,Thorsten Hoeser,Claudia Kuenzer
Main category: cs.CV
TL;DR: 论文研究利用合成的训练数据结合真实Sentinel-1卫星图像训练YOLOv10模型,以提升海上平台检测的性能和地理迁移能力。
Details
Motivation: 海洋基础设施(如海上风电场、油气平台等)的快速扩张需要有效的监测系统,但由于数据稀缺,特别是对于少数类别和不常见形状的对象,现有模型表现不佳。Contribution: 1. 研究了合成训练数据对提升模型性能的作用;2. 展示了模型在地理迁移中的泛化能力;3. 检测了全球多个区域的3,529个海上平台。
Method: 结合合成和真实的Sentinel-1图像训练YOLOv10模型,并在未见过的区域(墨西哥湾、北海、波斯湾)进行地理迁移评估。
Result: 模型F1分数从0.85提升到0.90,检测到3,529个海上平台,证明了合成数据对不平衡类别和模型性能的提升。
Insight: 合成数据能够有效解决遥感任务中的数据不平衡问题,支持全球可迁移的检测模型。
Abstract: The recent and ongoing expansion of marine infrastructure, including offshore wind farms, oil and gas platforms, artificial islands, and aquaculture facilities, highlights the need for effective monitoring systems. The development of robust models for offshore infrastructure detection relies on comprehensive, balanced datasets, but falls short when samples are scarce, particularly for underrepresented object classes, shapes, and sizes. By training deep learning-based YOLOv10 object detection models with a combination of synthetic and real Sentinel-1 satellite imagery acquired in the fourth quarter of 2023 from four regions (Caspian Sea, South China Sea, Gulf of Guinea, and Coast of Brazil), this study investigates the use of synthetic training data to enhance model performance. We evaluated this approach by applying the model to detect offshore platforms in three unseen regions (Gulf of Mexico, North Sea, Persian Gulf) and thereby assess geographic transferability. This region-holdout evaluation demonstrated that the model generalises beyond the training areas. In total, 3,529 offshore platforms were detected, including 411 in the North Sea, 1,519 in the Gulf of Mexico, and 1,593 in the Persian Gulf. The model achieved an F1 score of 0.85, which improved to 0.90 upon incorporating synthetic data. We analysed how synthetic data enhances the representation of unbalanced classes and overall model performance, taking a first step toward globally transferable detection of offshore infrastructure. This study underscores the importance of balanced datasets and highlights synthetic data generation as an effective strategy to address common challenges in remote sensing, demonstrating the potential of deep learning for scalable, global offshore infrastructure monitoring.
[56] RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation
Xiangjun Zhang,Litong Gong,Yinglin Zheng,Yansong Liu,Wentao Jiang,Mingyi Xu,Biao Wang,Tiezheng Ge,Ming Zeng
Main category: cs.CV
TL;DR: RISE-T2V提出了一种将提示重述和语义特征提取集成为一步的新框架,显著提升了文本到视频(T2V)生成的质量和对用户意图的匹配度。
Details
Motivation: 现有T2V扩散模型依赖预训练文本编码器进行语义对齐,但对简洁提示的理解不足,且无法在线重述提示以更好地匹配用户意图。Contribution: 1. 提出RISE-T2V框架,无缝集成提示重述和语义特征提取;2. 设计了Rephrasing Adapter模块,利用LLM的隐藏状态生成视频;3. 证明了框架的通用性和提升T2V任务的能力。
Method: 通过Rephrasing Adapter模块,利用LLM的下一个token预测的隐藏状态作为视频生成的条件,隐式重述提示并增强语义理解。
Result: 实验表明RISE-T2V适用于多种视频扩散模型架构,显著提升了生成视频的质量和对用户意图的匹配度。
Insight: 利用LLM的强大语义理解能力可以弥补T2V模型的不足,提示重述和语义提取的结合是提升性能的关键。
Abstract: Most text-to-video(T2V) diffusion models depend on pre-trained text encoders for semantic alignment, yet they often fail to maintain video quality when provided with concise prompts rather than well-designed ones. The primary issue lies in their limited textual semantics understanding. Moreover, these text encoders cannot rephrase prompts online to better align with user intentions, which limits both the scalability and usability of the models, To address these challenges, we introduce RISE-T2V, which uniquely integrates the processes of prompt rephrasing and semantic feature extraction into a single and seamless step instead of two separate steps. RISE-T2V is universal and can be applied to various pre-trained LLMs and video diffusion models(VDMs), significantly enhancing their capabilities for T2V tasks. We propose an innovative module called the Rephrasing Adapter, enabling diffusion models to utilize text hidden states during the next token prediction of the LLM as a condition for video generation. By employing a Rephrasing Adapter, the video generation model can implicitly rephrase basic prompts into more comprehensive representations that better match the user’s intent. Furthermore, we leverage the powerful capabilities of LLMs to enable video generation models to accomplish a broader range of T2V tasks. Extensive experiments demonstrate that RISE-T2V is a versatile framework applicable to different video diffusion model architectures, significantly enhancing the ability of T2V models to generate high-quality videos that align with user intent. Visual results are available on the webpage at https://rise-t2v.github.io.
[57] Comparative Study of CNN Architectures for Binary Classification of Horses and Motorcycles in the VOC 2008 Dataset
Muhammad Annas Shaikh,Hamza Zaman,Arbaz Asif
Main category: cs.CV
TL;DR: 论文比较了九种CNN架构在VOC 2008数据集上对马和摩托车进行二分类的性能,重点解决了类别不平衡问题,并通过实验验证了数据增强的显著效果。
Details
Motivation: 研究动机是探索不同CNN架构在类别不平衡的二分类任务中的表现,并量化数据增强对性能的影响。Contribution: 主要贡献包括:1) 对九种现代CNN架构的性能进行全面比较;2) 通过少数类数据增强显著提升不平衡任务的分类性能;3) 提出ConvNeXt-Tiny在分类任务中的优越性。
Method: 研究方法包括:1) 使用VOC 2008数据集,专注于马和摩托车的二分类;2) 采用少数类数据增强技术;3) 比较ResNet-50、ConvNeXt-Tiny、DenseNet-121和Vision Transformer等多种架构的性能。
Result: 结果显示,ConvNeXT-Tiny表现最佳(马的AP为95.53%,摩托车的AP为89.12%),数据增强显著改善了少数类的检测性能,尤其对深层架构效果更明显。
Insight: 研究发现:1) 数据增强对处理类别不平衡至关重要;2) 架构选择对性能影响显著;3) 深层架构在数据增强下表现更好。
Abstract: This paper presents a comprehensive evaluation of nine convolutional neural network architectures for binary classification of horses and motorcycles in the VOC 2008 dataset. We address the significant class imbalance problem by implementing minority-class augmentation techniques. Our experiments compare modern architectures including ResNet-50, ConvNeXt-Tiny, DenseNet-121, and Vision Transformer across multiple performance metrics. Results demonstrate substantial performance variations, with ConvNeXt-Tiny achieving the highest Average Precision (AP) of 95.53% for horse detection and 89.12% for motorcycle detection. We observe that data augmentation significantly improves minority class detection, particularly benefiting deeper architectures. This study provides insights into architecture selection for imbalanced binary classification tasks and quantifies the impact of data augmentation strategies in mitigating class imbalance issues in object detection.
[58] Evaluating the Impact of Weather-Induced Sensor Occlusion on BEVFusion for 3D Object Detection
Sanjay Kumar,Tim Brophy,Eoin Martino Grua,Ganesh Sistu,Valentina Donzella,Ciaran Eising
Main category: cs.CV
TL;DR: 该论文研究了天气导致的传感器遮挡对BEVFusion架构在3D物体检测中的影响,发现相机遮挡对仅依赖相机的检测影响显著,而LiDAR在重度遮挡下性能下降明显。多传感器融合时,LiDAR的遮挡影响更大。
Details
Motivation: 自动车辆在复杂环境中需要精确的3D物体检测,而BEV(鸟瞰图)表示通过多传感器数据融合提供了强大的感知能力。然而,传感器因天气或被遮挡对检测性能的影响尚未充分研究。Contribution: 论文的主要贡献包括:1) 定量分析了相机和LiDAR遮挡对BEVFusion性能的影响;2) 揭示了多传感器融合中对LiDAR的更强依赖;3) 提出了未来改进传感器融合技术和遮挡感知评估方法的需求。
Method: 研究方法基于BEVFusion架构,在nuScenes数据集上评估相机和LiDAR遮挡对3D检测性能的影响。性能指标采用mAP和NDS。
Result: 实验结果显示:1) 相机遮挡导致仅依赖相机的检测mAP下降41.3%;2) LiDAR在重度遮挡下mAP下降47.3%;3) 融合设置中,LiDAR遮挡影响更大(mAP下降26.8%)。
Insight: 论文揭示了多模态融合中LiDAR的关键作用,并指出未来需开发更鲁棒的传感器融合技术以应对部分传感器失效或环境干扰。
Abstract: Accurate 3D object detection is essential for automated vehicles to navigate safely in complex real-world environments. Bird’s Eye View (BEV) representations, which project multi-sensor data into a top-down spatial format, have emerged as a powerful approach for robust perception. Although BEV-based fusion architectures have demonstrated strong performance through multimodal integration, the effects of sensor occlusions, caused by environmental conditions such as fog, haze, or physical obstructions, on 3D detection accuracy remain underexplored. In this work, we investigate the impact of occlusions on both camera and Light Detection and Ranging (LiDAR) outputs using the BEVFusion architecture, evaluated on the nuScenes dataset. Detection performance is measured using mean Average Precision (mAP) and the nuScenes Detection Score (NDS). Our results show that moderate camera occlusions lead to a 41.3% drop in mAP (from 35.6% to 20.9%) when detection is based only on the camera. On the other hand, LiDAR sharply drops in performance only under heavy occlusion, with mAP falling by 47.3% (from 64.7% to 34.1%), with a severe impact on long-range detection. In fused settings, the effect depends on which sensor is occluded: occluding the camera leads to a minor 4.1% drop (from 68.5% to 65.7%), while occluding LiDAR results in a larger 26.8% drop (to 50.1%), revealing the model’s stronger reliance on LiDAR for the task of 3D object detection. Our results highlight the need for future research into occlusion-aware evaluation methods and improved sensor fusion techniques that can maintain detection accuracy in the presence of partial sensor failure or degradation due to adverse environmental conditions.
[59] A MATLAB tutorial on deep feature extraction combined with chemometrics for analytical applications
Puneet Mishra,Martijntje Vollebregt,Yizhou Ma,Maria Font-i-Furnols
Main category: cs.CV
TL;DR: 这篇教程论文旨在为分析化学领域的研究人员提供一个使用MATLAB结合深度学习和化学计量学提取成像数据深度特征的逐步指南,填补了现有深度学习模型在分析化学中应用的结构化指导的空白。
Details
Motivation: 分析化学中成像数据的空间信息提取和分析仍面临挑战,传统化学计量学方法难以高效处理。尽管深度学习在图像处理方面进展显著,但因缺乏结构化实施指南,其在分析化学中的应用受限。Contribution: 主要贡献是为分析化学领域提供了一个结合深度学习和化学计量学的MATLAB教程,指导研究人员使用开源深度学习模型提取深度特征,并与光谱等其他数据整合。
Method: 教程通过MATLAB代码演示,展示了如何使用现有开源深度学习模型从多种成像模态中提取深度特征,而非训练新的模型。
Result: 教程提供了可重复的演示代码,研究人员可以按照步骤在自己的数据集上实施深度特征提取。
Insight: 通过利用现有深度学习模型提取深度特征,分析化学领域可以更高效地处理复杂的成像数据,而无需从头训练模型。
Abstract: Background In analytical chemistry, spatial information about materials is commonly captured through imaging techniques, such as traditional color cameras or with advanced hyperspectral cameras and microscopes. However, efficiently extracting and analyzing this spatial information for exploratory and predictive purposes remains a challenge, especially when using traditional chemometric methods. Recent advances in deep learning and artificial intelligence have significantly enhanced image processing capabilities, enabling the extraction of multiscale deep features that are otherwise challenging to capture with conventional image processing techniques. Despite the wide availability of open-source deep learning models, adoption in analytical chemistry remains limited because of the absence of structured, step-by-step guidance for implementing these models. Results This tutorial aims to bridge this gap by providing a step-by-step guide for applying deep learning approaches to extract spatial information from imaging data and integrating it with other data sources, such as spectral information. Importantly, the focus of this work is not on training deep learning models for image processing but on using existing open source models to extract deep features from imaging data. Significance The tutorial provides MATLAB code tutorial demonstrations, showcasing the processing of imaging data from various imaging modalities commonly encountered in analytical chemistry. Readers must run the tutorial steps on their own datasets using the codes presented in this tutorial.
[60] Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA
Itbaan Safwan,Muhammad Annas Shaikh,Muhammad Haaris,Ramail Khan,Muhammad Atif Tahir
Main category: cs.CV
TL;DR: 该论文提出了一种基于LoRA调优Florence-2模型的多任务学习框架,用于医学视诊问答(VQA)、解释生成和视觉定位,结合了三个数据集实现联合学习,显著优于单任务基线。
Details
Motivation: 解决医学视诊问答任务中单一任务学习的局限性,通过多任务学习提升模型的视觉定位、推理和解释能力,实现更准确和可解释的回答。Contribution: 1. 提出一种多任务学习框架,结合VQA、解释生成和视觉定位;2. 整合三个数据集,增强模型的多任务能力;3. 实验表明多任务学习显著优于单任务基线。
Method: 1. 使用LoRA调优Florence-2模型;2. 联合训练三个任务:VQA、解释生成和视觉定位;3. 结合三个数据集(Kvasir-VQA-x1、合成解释数据集和文本-区域对数据集)。
Result: 实验表明,该方法在答案准确性和视觉定位方面显著优于单任务基线。
Insight: 多任务学习在医学VQA任务中具有显著优势,尤其是结合视觉定位和解释生成时,能够提升模型的解释性和准确性。
Abstract: We present a multi-task framework for the MediaEval Medico 2025 challenge, leveraging a LoRA-tuned Florence-2 model for simultaneous visual question answering (VQA), explanation generation, and visual grounding. The proposed system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer learning, (2) a synthetically enriched explanation dataset offering structured medical reasoning, and (3) text-to-region pairs linking visual features with segmentation masks. This multi-task setup enables the model to jointly learn visual grounding, reasoning, and interpretation, producing responses that are both accurate and interpretable. Extensive evaluation demonstrates that our approach substantially improves over single-task baselines in both answer accuracy and visual localization, highlighting the effectiveness of grounded multi-task learning for medical VQA applications.
[61] HideAndSeg: an AI-based tool with automated prompting for octopus segmentation in natural habitats
Alan de Aguiar,Michaella Pereira Andrade,Charles Morphy D. Santos,João Paulo Gois
Main category: cs.CV
TL;DR: HideAndSeg是一个基于AI的工具,结合了SAM2和YOLOv11,用于自动分割自然栖息地中的章鱼视频。通过自动化提示和无监督指标,减少了人工干预,提高了分割质量。
Details
Motivation: 章鱼在自然栖息地中由于其伪装能力、皮肤纹理和颜色的快速变化、非刚性变形和频繁遮挡,分析难度大,且缺乏大规模标注数据集。Contribution: 提出了HideAndSeg工具,结合SAM2和YOLOv11,实现了自动化提示和分割;设计了两个无监督指标评估分割质量;在完全遮挡的场景下仍能重新识别和分割章鱼。
Method: 用户提供初始点坐标生成SAM2分割掩码,用于训练YOLO模型;通过边界框提示自动生成SAM2掩码,无需进一步人工干预;使用时间一致性和新组件数量指标优化分割。
Result: HideAndSeg减少了分割噪声,在完全遮挡的场景下表现优于人工提示方法。
Insight: 无监督指标可用于指导无真实标签数据下的分割优化,自动化流程显著提升了野外行为研究的效率。
Abstract: Analyzing octopuses in their natural habitats is challenging due to their camouflage capability, rapid changes in skin texture and color, non-rigid body deformations, and frequent occlusions, all of which are compounded by variable underwater lighting and turbidity. Addressing the lack of large-scale annotated datasets, this paper introduces HideAndSeg, a novel, minimally supervised AI-based tool for segmenting videos of octopuses. It establishes a quantitative baseline for this task. HideAndSeg integrates SAM2 with a custom-trained YOLOv11 object detector. First, the user provides point coordinates to generate the initial segmentation masks with SAM2. These masks serve as training data for the YOLO model. After that, our approach fully automates the pipeline by providing a bounding box prompt to SAM2, eliminating the need for further manual intervention. We introduce two unsupervised metrics - temporal consistency $DICE_t$ and new component count $NC_t$ - to quantitatively evaluate segmentation quality and guide mask refinement in the absence of ground-truth data, i.e., real-world information that serves to train, validate, and test AI models. Results show that HideAndSeg achieves satisfactory performance, reducing segmentation noise compared to the manually prompted approach. Our method can re-identify and segment the octopus even after periods of complete occlusion in natural environments, a scenario in which the manually prompted model fails. By reducing the need for manual analysis in real-world scenarios, this work provides a practical tool that paves the way for more efficient behavioral studies of wild cephalopods.
[62] Solving Convex Partition Visual Jigsaw Puzzles
Yaniv Ohayon,Ofir Itzhak Shahar,Ohad Ben-Shahar
Main category: cs.CV
TL;DR: 论文提出了一种解决凸分割视觉拼图的贪婪求解器,扩展了计算处理的拼图类型,结合了几何和图像兼容性,并提供了一个新的基准数据集。
Details
Motivation: 现有拼图求解器主要针对方形拼图,实用性有限。凸分割拼图是主要的多边形拼图子集,但因复杂度高未被充分研究。Contribution: 1. 扩展了计算处理的拼图类型,聚焦凸分割拼图;2. 提出了结合几何和图像兼容性的贪婪求解器;3. 发布了首个此类拼图的基准数据集。
Method: 利用几何和图像兼容性,提出了一种贪婪求解器来处理凸分割拼图。
Result: 论文报告了多种性能指标,并提供了新的基准数据集。
Insight: 凸分割拼图的解决在实际应用中具有潜在影响,尤其是在图像处理和计算机视觉领域。
Abstract: Jigsaw puzzle solving requires the rearrangement of unordered pieces into their original pose in order to reconstruct a coherent whole, often an image, and is known to be an intractable problem. While the possible impact of automatic puzzle solvers can be disruptive in various application domains, most of the literature has focused on developing solvers for square jigsaw puzzles, severely limiting their practical use. In this work, we significantly expand the types of puzzles handled computationally, focusing on what is known as Convex Partitions, a major subset of polygonal puzzles whose pieces are convex. We utilize both geometrical and pictorial compatibilities, introduce a greedy solver, and report several performance measures next to the first benchmark dataset of such puzzles.
[63] V-Thinker: Interactive Thinking with Images
Runqi Qiao,Qiuna Tan,Minghan Yang,Guanting Dong,Peiqing Yang,Shiqiang Lang,Enhui Wan,Xiaowan Wang,Yida Xu,Lan Yang,Chong Sun,Chen Li,Honggang Zhang
Main category: cs.CV
TL;DR: V-Thinker是一种通用的多模态推理助手,通过端到端强化学习实现视觉中心的交互式思考,解决了现有大型多模态模型(LMMs)在图像交互与长时序推理中的局限性。
Details
Motivation: 现有LMMs在图像交互与长时序推理中的能力不足,且视觉工具空间和任务特定设计限制了其发展。V-Thinker旨在通过强化学习提升模型在交互式视觉思考中的表现。Contribution: 1. 提出V-Thinker,通过数据演化飞轮和视觉渐进训练课程实现通用交互式推理。2. 提出VTBench,一个专家验证的交互式视觉推理基准测试。
Method: 1. 数据演化飞轮自动生成并优化多样化的交互推理数据集。2. 视觉渐进训练课程结合点级监督和两阶段强化学习框架。
Result: V-Thinker在通用和交互式推理任务中表现优于现有LMMs基准模型。
Insight: 通过合成数据和渐进训练,V-Thinker展示了提升LMMs在视觉交互推理任务中的潜力,为未来研究提供了新方向。
Abstract: Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising “Thinking with Images” paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.
[64] Landslide Hazard Mapping with Geospatial Foundation Models: Geographical Generalizability, Data Scarcity, and Band Adaptability
Wenwen Li,Sizhe Wang,Hyunho Lee,Chenyan Lu,Sujit Roy,Rahul Ramachandran,Chia-Yu Hsu
Main category: cs.CV
TL;DR: 本文提出了一个基于地理空间基础模型(GeoFMs)的框架,用于解决滑坡灾害映射中的传感器、标签和领域适应性问题。相比传统深度学习方法,GeoFMs在光谱变化、标签稀缺和跨区域泛化方面表现更优。
Details
Motivation: 滑坡灾害预测和映射的准确性对灾害响应至关重要,但传统深度学习方法在跨传感器、跨区域和训练数据稀缺的情况下表现不佳。本研究旨在通过GeoFMs解决这些问题。Contribution: 1. 提出了一个三轴分析框架(传感器、标签、域)用于GeoFMs的适应;2. 展示了GeoFMs在滑坡映射中优于特定任务的CNN和视觉Transformer;3. 证明了模型在光谱变化、标签稀缺和跨区域泛化中的鲁棒性。
Method: 1. 使用全球预训练和自监督学习;2. 通过可调整的微调适应不同任务;3. 对比实验评估了模型性能。
Result: GeoFMs(如Prithvi-EO-2.0)在滑坡映射任务中优于U-Net、Segformer等传统模型,且在光谱变化、标签稀缺和跨区域泛化中表现更稳健。
Insight: 地理空间基础模型为灾害映射提供了更鲁棒和可扩展的解决方案,但计算成本和AI-ready数据的不足仍是挑战。
Abstract: Landslides cause severe damage to lives, infrastructure, and the environment, making accurate and timely mapping essential for disaster preparedness and response. However, conventional deep learning models often struggle when applied across different sensors, regions, or under conditions of limited training data. To address these challenges, we present a three-axis analytical framework of sensor, label, and domain for adapting geospatial foundation models (GeoFMs), focusing on Prithvi-EO-2.0 for landslide mapping. Through a series of experiments, we show that it consistently outperforms task-specific CNNs (U-Net, U-Net++), vision transformers (Segformer, SwinV2-B), and other GeoFMs (TerraMind, SatMAE). The model, built on global pretraining, self-supervision, and adaptable fine-tuning, proved resilient to spectral variation, maintained accuracy under label scarcity, and generalized more reliably across diverse datasets and geographic settings. Alongside these strengths, we also highlight remaining challenges such as computational cost and the limited availability of reusable AI-ready training data for landslide research. Overall, our study positions GeoFMs as a step toward more robust and scalable approaches for landslide risk reduction and environmental monitoring.
[65] THEval. Evaluation Framework for Talking Head Video Generation
Nabyl Quignon,Baptiste Chopin,Yaohui Wang,Antitza Dantcheva
Main category: cs.CV
TL;DR: 论文提出了一种新的评估框架THEval,专注于说话头部视频生成的三个维度(质量、自然度和同步性),包含8个高效且符合人类偏好的指标,填补了当前评估指标的不足。
Details
Motivation: 现有说话头部视频生成的评估方法主要依赖有限指标(如视频质量、唇同步)和用户研究,无法全面反映生成视频的表现。作者旨在设计一个更全面的评估框架。Contribution: 提出了包含8个指标的评估框架THEval,覆盖质量、自然度和同步性三个维度。同时,引入了一个新的真实数据集以避免训练数据偏差。
Method: 通过分析头部、嘴部和眉毛的细粒度动态以及面部质量,设计了高效的评估指标。实验基于17种先进模型生成的85,000个视频验证了框架的有效性。
Result: 实验表明,尽管许多模型在唇同步上表现良好,但在生成表情丰富和无伪影的细节方面仍存在挑战。
Insight: 现有算法在面部表情和细节生成上仍有改进空间,高效且全面的评估框架对推动领域进展至关重要。
Abstract: Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this considerations, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many algorithms excel in lip synchronization, they face challenges with generating expressiveness and artifact-free details. These videos were generated based on a novel real dataset, that we have curated, in order to mitigate bias of training data. Our proposed benchmark framework is aimed at evaluating the improvement of generative methods. Original code, dataset and leaderboards will be publicly released and regularly updated with new methods, in order to reflect progress in the field.
[66] Learning from Single Timestamps: Complexity Estimation in Laparoscopic Cholecystectomy
Dimitrios Anastasiou,Santiago Barbarisi,Lucy Culshaw,Jayna Patel,Evangelos B. Mazomenos,Imanol Luengo,Danail Stoyanov
Main category: cs.CV
TL;DR: STC-Net是一种新颖的单时间戳复杂性评估框架,用于腹腔镜胆囊切除术(LC),通过Parkland分级量表(PGS)在弱时间监督下实现全视频分析。
Details
Motivation: 腹腔镜胆囊切除术中,准确评估手术复杂度对术后分析和培训至关重要。传统PGS虽临床有效,但自动化分析未被充分探索,尤其是在未修剪的全视频场景中。Contribution: 提出了STC-Net,首次在弱监督下实现全视频的PGS分级,结合时间定位和分级模块,提出新颖的损失函数。
Method: 通过定位、窗口提议和分级模块进行联合时间定位与分级,引入了硬/软定位目标和背景感知分级监督的损失函数。
Result: 在1859个LC视频中,STC-Net准确率达62.11%,F1分数为61.42%,相比基线提升10%以上。
Insight: 弱监督方法在手术复杂性评估中有效,STC-Net有望支持术后分析和培训。
Abstract: Purpose: Accurate assessment of surgical complexity is essential in Laparoscopic Cholecystectomy (LC), where severe inflammation is associated with longer operative times and increased risk of postoperative complications. The Parkland Grading Scale (PGS) provides a clinically validated framework for stratifying inflammation severity; however, its automation in surgical videos remains largely unexplored, particularly in realistic scenarios where complete videos must be analyzed without prior manual curation. Methods: In this work, we introduce STC-Net, a novel framework for SingleTimestamp-based Complexity estimation in LC via the PGS, designed to operate under weak temporal supervision. Unlike prior methods limited to static images or manually trimmed clips, STC-Net operates directly on full videos. It jointly performs temporal localization and grading through a localization, window proposal, and grading module. We introduce a novel loss formulation combining hard and soft localization objectives and background-aware grading supervision. Results: Evaluated on a private dataset of 1,859 LC videos, STC-Net achieves an accuracy of 62.11% and an F1-score of 61.42%, outperforming non-localized baselines by over 10% in both metrics and highlighting the effectiveness of weak supervision for surgical complexity assessment. Conclusion: STC-Net demonstrates a scalable and effective approach for automated PGS-based surgical complexity estimation from full LC videos, making it promising for post-operative analysis and surgical training.
[67] Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Jingqi Tong,Yurong Mou,Hangcheng Li,Mingzhe Li,Yongzhuo Yang,Ming Zhang,Qiguang Chen,Tianyi Liang,Xiaomeng Hu,Yining Zheng,Xinchi Chen,Jun Zhao,Xuanjing Huang,Xipeng Qiu
Main category: cs.CV
TL;DR: 论文引入了“Thinking with Video”新范式,利用视频生成模型(如Sora-2)结合视觉与文本推理,克服传统多模态推理的局限性。通过VideoThinkBench验证,Sora-2在视觉和文本任务中表现优异,展示视频生成模型作为统一多模态理解与生成模型的潜力。
Details
Motivation: 传统“Thinking with Text”和“Thinking with Images”范式无法捕捉动态过程或统一视觉与文本推理。因此,论文提出“Thinking with Video”范式,以视频生成模型解决这些问题。Contribution: 1. 提出“Thinking with Video”新范式;2. 开发VideoThinkBench基准;3. 验证Sora-2在多模态任务中的优异表现;4. 分析其能力来源并优化性能。
Method: 利用视频生成模型(如Sora-2)作为统一推理工具,设计VideoThinkBench基准(包含视觉与文本任务)评估模型能力,并通过自一致性和上下文学习提升性能。
Result: Sora-2在视觉任务中与SOTA视觉语言模型相当,部分任务(如Eyeballing Games)表现更优;在文本任务中,MATH准确率达92%,MMMU达75.53%。
Insight: 视频生成模型有望成为统一多模态理解与生成的核心工具,通过时间动态性弥补静态图像的不足,实现更高效的多模态推理。
Abstract: “Thinking with Text” and “Thinking with Images” paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce “Thinking with Video”, a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2’s performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions “thinking with video” as a unified multimodal reasoning paradigm.
[68] UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction
Chen Shi,Shaoshuai Shi,Xiaoyang Lyu,Chunyang Liu,Kehua Sheng,Bo Zhang,Li Jiang
Main category: cs.CV
TL;DR: UniSplat提出了一种动态驾驶场景重建的统一框架,通过3D潜在支架实现时空信息融合,解决了稀疏、非重叠视角和复杂动态场景的挑战。
Details
Motivation: 现有方法在稀疏、非重叠视角和复杂动态场景的重建中表现不佳,UniSplat旨在解决这些问题。Contribution: UniSplat的主要贡献包括:1) 3D潜在支架的构建;2) 高效的时空融合机制;3) 双分支解码器设计;4) 持续静态高斯记忆支持流式场景补全。
Method: UniSplat利用预训练基础模型构建3D潜在支架,通过直接操作支架实现时空对齐。采用双分支解码器结合点锚定优化和体素生成,生成动态感知的高斯分布。
Result: 在真实数据集上的实验表明,UniSplat在新视角合成任务中表现最优,并能提供高质量渲染,即使视角超出原始相机覆盖范围。
Insight: 通过统一的潜在支架和高效融合机制,UniSplat在动态场景重建中实现了鲁棒性和高精度,展示了时空建模的重要性。
Abstract: Feed-forward 3D reconstruction for autonomous driving has advanced rapidly, yet existing methods struggle with the joint challenges of sparse, non-overlapping camera views and complex scene dynamics. We present UniSplat, a general feed-forward framework that learns robust dynamic scene reconstruction through unified latent spatio-temporal fusion. UniSplat constructs a 3D latent scaffold, a structured representation that captures geometric and semantic scene context by leveraging pretrained foundation models. To effectively integrate information across spatial views and temporal frames, we introduce an efficient fusion mechanism that operates directly within the 3D scaffold, enabling consistent spatio-temporal alignment. To ensure complete and detailed reconstructions, we design a dual-branch decoder that generates dynamic-aware Gaussians from the fused scaffold by combining point-anchored refinement with voxel-based generation, and maintain a persistent memory of static Gaussians to enable streaming scene completion beyond current camera coverage. Extensive experiments on real-world datasets demonstrate that UniSplat achieves state-of-the-art performance in novel view synthesis, while providing robust and high-quality renderings even for viewpoints outside the original camera coverage.
[69] PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning
Yicheng Xiao,Yu Chen,Haoxuan Ma,Jiale Hong,Caorui Li,Lingxiang Wu,Haiyun Guo,Jinqiao Wang
Main category: cs.CV
TL;DR: PixCLIP通过像素级对齐学习和长文本处理提升CLIP模型的细粒度视觉语言理解能力。
Details
Motivation: 尽管CLIP在多模态任务中表现出色,但其在细粒度图像-文本对齐方面的能力仍有提升空间。现有工作多关注视觉信息的细粒度处理,但CLIP的文本编码器限制了其处理长文本的能力。Contribution: 提出了PixCLIP框架,融合视觉提示和长文本描述;构建了LongGRIT数据集(150万样本);设计了三分支像素-文本对齐学习框架。
Method: 通过自动标注管道生成像素级长文本描述;用LLM替换CLIP的文本编码器;设计三分支对齐学习框架。
Result: PixCLIP在像素级交互和长文本处理方面取得突破,性能达到SOTA。
Insight: 视觉和文本信息的细粒度协同处理是提升多模态模型能力的关键。
Abstract: While the Contrastive Language-Image Pretraining(CLIP) model has achieved remarkable success in a variety of downstream vison language understanding tasks, enhancing its capability for fine-grained image-text alignment remains an active research focus. To this end, most existing works adopt the strategy of explicitly increasing the granularity of visual information processing, e.g., incorporating visual prompts to guide the model focus on specific local regions within the image. Meanwhile, researches on Multimodal Large Language Models(MLLMs) have demonstrated that training with long and detailed textual descriptions can effectively improve the model’s fine-grained vision-language alignment. However, the inherent token length limitation of CLIP’s text encoder fundamentally limits CLIP to process more granular textual information embedded in long text sequences. To synergistically leverage the advantages of enhancing both visual and textual content processing granularity, we propose PixCLIP, a novel framework designed to concurrently accommodate visual prompt inputs and process lengthy textual descriptions. Specifically, we first establish an automated annotation pipeline capable of generating pixel-level localized, long-form textual descriptions for images. Utilizing this pipeline, we construct LongGRIT, a high-quality dataset comprising nearly 1.5 million samples. Secondly, we replace CLIP’s original text encoder with the LLM and propose a three-branch pixel-text alignment learning framework, facilitating fine-grained alignment between image regions and corresponding textual descriptions at arbitrary granularity. Experiments demonstrate that PixCLIP showcases breakthroughs in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.
[70] NovisVQ: A Streaming Convolutional Neural Network for No-Reference Opinion-Unaware Frame Quality Assessment
Kylie Cancilla,Alexander Moore,Amar Saini,Carmen Carrano
Main category: cs.CV
TL;DR: NovisVQ是一个基于流的卷积神经网络,用于无参考且无需人工评分的视频质量评估(VQA)。它通过合成DAVIS数据集的退化视频训练,直接预测FR指标(如LPIPS、PSNR、SSIM),且无需参考视频。该模型在时间建模方面表现优异,超越了图像基线和传统方法BRISQUE。
Details
Motivation: 现有VQA方法存在局限性:FR指标需要参考视频,NR方法依赖昂贵的人工评分数据。此外,大多数无视评分的NR方法是图像基础的,忽略了视频任务中关键的时间上下文信息。Contribution: 提出了一种无参考且无视评分的流式VQA模型,支持实时视频质量评估。该模型通过时间感知架构和合成退化数据训练,显著提升了性能和适用性。
Method: 利用DAVIS数据集的合成退化视频训练卷积神经网络,直接预测FR指标。采用流式架构,捕捉视频的时间上下文信息,无需参考视频输入。
Result: 模型在多种退化案例中表现优于图像基线,且与FR指标的相关性高于传统方法BRISQUE,验证了时间建模的有效性和模型的实用性。
Insight: 时间建模对视频质量评估至关重要,合成数据训练的无参考方法可以高效替代依赖人工评分的传统方法,适用于实际视觉系统。
Abstract: Video quality assessment (VQA) is vital for computer vision tasks, but existing approaches face major limitations: full-reference (FR) metrics require clean reference videos, and most no-reference (NR) models depend on training on costly human opinion labels. Moreover, most opinion-unaware NR methods are image-based, ignoring temporal context critical for video object detection. In this work, we present a scalable, streaming-based VQA model that is both no-reference and opinion-unaware. Our model leverages synthetic degradations of the DAVIS dataset, training a temporal-aware convolutional architecture to predict FR metrics (LPIPS , PSNR, SSIM) directly from degraded video, without references at inference. We show that our streaming approach outperforms our own image-based baseline by generalizing across diverse degradations, underscoring the value of temporal modeling for scalable VQA in real-world vision systems. Additionally, we demonstrate that our model achieves higher correlation with full-reference metrics compared to BRISQUE, a widely-used opinion-aware image quality assessment baseline, validating the effectiveness of our temporal, opinion-unaware approach.
[71] Polarization-resolved imaging improves eye tracking
Mantas Žurauskas,Tom Bu,Sanaz Alali,Beyza Kalkanli,Derek Shi,Fernando Alamos,Gauresh Pandit,Christopher Mei,Ali Behrooz,Ramin Mirjalili,Dave Stronks,Alexander Fix,Dmitri Model
Main category: cs.CV
TL;DR: 论文提出了一种基于偏振分辨近红外成像的眼球追踪技术(PET),通过测量眼部组织反射光的偏振状态和强度,显著提升了追踪性能。
Details
Motivation: 传统眼球追踪技术仅依赖光强度信息,容易受到眼睑遮挡、瞳孔变化等因素的影响,限制了其鲁棒性和准确性。Contribution: 提出了一种结合偏振滤波阵列相机和线偏振近红外光源的PET系统,能够在巩膜和角膜上提取更多可追踪特征。
Method: 使用偏振分辨成像技术捕捉眼部组织的光学对比特征,并通过卷积神经网络(CNN)训练模型以提高追踪精度。
Result: 在346名参与者的实验中,PET系统将95%绝对凝视误差的中位数降低了10-16%,尤其是在遮挡和瞳孔变化等复杂条件下表现更优。
Insight: 偏振分辨成像为眼球追踪提供了额外的光学对比信息,显著提高了系统的鲁棒性和准确性,有望应用于未来可穿戴设备。
Abstract: Polarization-resolved near-infrared imaging adds a useful optical contrast mechanism to eye tracking by measuring the polarization state of light reflected by ocular tissues in addition to its intensity. In this paper we demonstrate how this contrast can be used to enable eye tracking. Specifically, we demonstrate that a polarization-enabled eye tracking (PET) system composed of a polarization–filter–array camera paired with a linearly polarized near-infrared illuminator can reveal trackable features across the sclera and gaze-informative patterns on the cornea, largely absent in intensity-only images. Across a cohort of 346 participants, convolutional neural network based machine learning models trained on data from PET reduced the median 95th-percentile absolute gaze error by 10–16% relative to capacity-matched intensity baselines under nominal conditions and in the presence of eyelid occlusions, eye-relief changes, and pupil-size variation. These results link light–tissue polarization effects to practical gains in human–computer interaction and position PET as a simple, robust sensing modality for future wearable devices.
[72] Benchmark Designers Should “Train on the Test Set” to Expose Exploitable Non-Visual Shortcuts
Ellis Brown,Jihan Yang,Shusheng Yang,Rob Fergus,Saining Xie
Main category: cs.CV
TL;DR: 该论文提出了一种诊断和去偏方法,用于揭示和减少多模态大语言模型(MLLMs)基准测试中的非视觉捷径问题,通过‘训练测试集’和‘迭代偏差修剪’技术改进基准设计。
Details
Motivation: 当前的多模态基准测试容易被模型通过非视觉捷径(如语言偏见和表面模式)绕过,导致评估不准确。论文旨在帮助基准设计者提前发现并消除这些捷径。Contribution: 1. 提出‘测试集压力测试’(TsT)方法,揭示基准中的非视觉偏见;2. 设计‘迭代偏差修剪’(IBP)程序,去除高偏见样本;3. 应用于多个基准(如VSI-Bench),展示了显著的去偏效果。
Method: 1. 使用k折交叉验证在测试集的非视觉文本输入上微调LLM,量化每个样本的偏见分数;2. 结合随机森林的快速诊断;3. 通过IBP过滤高偏见样本。
Result: 在四个基准测试中发现了普遍的非视觉偏见,并成功创建了去偏版本VSI-Bench-Debiased,减少了非视觉可解性,增大了视觉盲区性能差距。
Insight: 基准设计者应主动‘攻击’自己的测试集,通过诊断和去偏技术提升评估的鲁棒性;非视觉捷径的揭示对多模态模型的实际能力评估至关重要。
Abstract: Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly training on the test set’’ – probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via $k$-fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score $s(x)$. We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an Iterative Bias Pruning’’ (IBP) procedure. Applying this framework to four benchmarks – VSI-Bench, CV-Bench, MMMU, and VideoMME – we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original.
[73] SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
Ellis Brown,Arijit Ray,Ranjay Krishna,Ross Girshick,Rob Fergus,Saining Xie
Main category: cs.CV
TL;DR: 该论文提出了一种利用3D模拟器生成空间丰富的视频训练数据的方法SIMS-V,通过系统化的问题类型、混合和规模消融,发现了三种最有效的问题类别,显著提升了多模态语言模型的空间推理能力。
Details
Motivation: 尽管多模态语言模型在高级视频理解方面表现出色,但在跨时空的空间推理上仍有不足,且当前依赖真实视频数据的方法面临数据多样性和精确标注的瓶颈。Contribution: 1. 提出SIMS-V框架,利用3D模拟器生成空间丰富的视频训练数据;2. 通过系统消融发现三种最有效的问题类别;3. 在仅使用少量模拟数据的情况下,实现了高性能的空间推理模型。
Method: 1. 通过3D模拟器生成带有空间标注的模拟视频数据;2. 系统化分析问题类型、混合和规模的影响;3. 筛选出三类关键问题(度量测量、视角依赖推理和时间跟踪)进行训练。
Result: 7B参数的视频LLM在仅25K模拟数据上微调后,性能超过72B基线,并在真实世界空间推理任务中表现出色,同时保持通用视频理解能力。
Insight: 模拟数据可以高效提升空间推理能力,少量但关键的问题类别比全覆盖更有效,且模拟训练能良好迁移到真实世界任务。
Abstract: Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V – a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.
[74] Cambrian-S: Towards Spatial Supersensing in Video
Shusheng Yang,Jihan Yang,Pinzhi Huang,Ellis Brown,Zihao Yang,Yue Yu,Shengbang Tong,Zihan Zheng,Yifan Xu,Muhan Wang,Daohan Lu,Rob Fergus,Yann LeCun,Li Fei-Fei,Saining Xie
Main category: cs.CV
TL;DR: 论文提出了空间超感知(spatial supersensing)的概念,作为多模态智能发展的新范式,并设计了VSI-SUPER基准测试和Cambrian-S模型以推动这一领域的研究。
Details
Motivation: 当前的多模态系统主要集中在短上下文和任务驱动的反应式模式,缺乏对持续事件、3D空间认知和世界建模的综合能力。因此,需要一种更广泛的感知范式(即超感知)来推动智能系统的进步。Contribution: 1. 提出空间超感知的四阶段框架(语义感知、流式事件认知、隐式3D空间认知和预测世界建模)。2. 设计了VSI-SUPER基准测试(包括VSR和VSC任务),以挑战模型的长上下文和空间认知能力。3. 提出Cambrian-S模型,在VSI-Bench上表现提升30%,同时展示了其在空间超感知上的局限性。4. 提出了预测感知的概念,并通过自监督的潜在帧预测器验证其有效性。
Method: 1. 通过VSI-SUPER基准测试评估模型的长视频理解和空间认知能力。2. 训练Cambrian-S模型并测试数据扩展的极限,发现单纯的数据扩展无法解决空间超感知问题。3. 提出预测感知方法,利用自监督的下一个潜在帧预测器和预测误差(“意外”)来驱动记忆和事件分割。
Result: Cambrian-S在VSI-Bench上实现了30%的绝对性能提升,但在VSI-SUPER基准上的表现仍有局限。预测感知方法显著优于现有基线模型,表明需要更智能的感知机制来组织经验。
Insight: 1. 空间超感知需要超越简单的视觉任务,实现持续事件理解和预测能力。2. 数据扩展虽然能提升部分性能,但无法完全解决空间超感知的复杂性。3. 预测感知为模型提供了一种主动组织和筛选信息的新途径。
Abstract: We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.
[75] InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation
Jinlai Liu,Jian Han,Bin Yan,Hui Wu,Fengda Zhu,Xing Wang,Yi Jiang,Bingyue Peng,Zehuan Yuan
Main category: cs.CV
TL;DR: InfinityStar提出了一种统一的时空自回归框架,支持高分辨率图像和动态视频生成,通过联合建模时空依赖关系,在多种生成任务中表现优异。
Details
Motivation: 基于自回归模型在视觉和语言领域的成功,研究旨在开发一个统一的框架,能够高效生成高质量的视频内容,解决现有方法在分辨率和效率上的局限性。Contribution: 提出了首个能够生成工业级720p视频的离散自回归视频生成模型,统一了图像和视频生成任务,并显著提升了生成速度和效果。
Method: 采用离散的自回归方法,联合建模空间和时间依赖关系,支持文本到图像、文本到视频、图像到视频等多种任务,并通过简单的时间自回归扩展。
Result: 在VBench上取得83.74的高分,超过所有自回归模型和部分扩散模型,生成速度比领先的扩散方法快10倍。
Insight: 统一的时空自回归设计不仅简化了多任务生成流程,还为高效高质量的视频生成提供了新的思路。
Abstract: We introduce InfinityStar, a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis. Building on the recent success of autoregressive modeling in both vision and language, our purely discrete approach jointly captures spatial and temporal dependencies within a single architecture. This unified design naturally supports a variety of generation tasks such as text-to-image, text-to-video, image-to-video, and long interactive video synthesis via straightforward temporal autoregression. Extensive experiments demonstrate that InfinityStar scores 83.74 on VBench, outperforming all autoregressive models by large margins, even surpassing some diffusion competitors like HunyuanVideo. Without extra optimizations, our model generates a 5s, 720p video approximately 10x faster than leading diffusion-based methods. To our knowledge, InfinityStar is the first discrete autoregressive video generator capable of producing industrial level 720p videos. We release all code and models to foster further research in efficient, high-quality video generation.
[76] Tracking and Understanding Object Transformations
Yihong Sun,Xinyu Yang,Jennifer J. Sun,Bharath Hariharan
Main category: cs.CV
TL;DR: 论文提出了跟踪物体状态变化的任务(Track Any State),并介绍了TubeletGraph系统和新数据集VOST-TAS。系统通过语义和邻近性先验恢复丢失的物体,生成状态变化图,实现了SOTA性能。
Details
Motivation: 现实中的物体经常发生状态变化(如苹果被切块),现有跟踪方法在物体外观显著变化时容易丢失目标。为此,论文提出跟踪和理解物体状态变化的任务。Contribution: 1. 提出Track Any State任务和VOST-TAS数据集;2. 提出TubeletGraph系统,零样本恢复丢失物体并生成状态变化图;3. 在跟踪和理解状态变化方面取得SOTA性能。
Method: TubeletGraph利用语义和邻近性先验识别被忽略的轨迹,判断是否整合,进而推理并生成描述状态变化的图表示。
Result: TubeletGraph在物体状态变化下的跟踪性能达到SOTA,并展示了在时序定位和语义推理方面的潜力。
Insight: 结合语义和邻近性先验的图表示方法能有效理解复杂物体状态变化,为零样本场景提供新思路。
Abstract: Real-world objects frequently undergo state transformations. From an apple being cut into pieces to a butterfly emerging from its cocoon, tracking through these changes is important for understanding real-world objects and dynamics. However, existing methods often lose track of the target object after transformation, due to significant changes in object appearance. To address this limitation, we introduce the task of Track Any State: tracking objects through transformations while detecting and describing state changes, accompanied by a new benchmark dataset, VOST-TAS. To tackle this problem, we present TubeletGraph, a zero-shot system that recovers missing objects after transformation and maps out how object states are evolving over time. TubeletGraph first identifies potentially overlooked tracks, and determines whether they should be integrated based on semantic and proximity priors. Then, it reasons about the added tracks and generates a state graph describing each observed transformation. TubeletGraph achieves state-of-the-art tracking performance under transformations, while demonstrating deeper understanding of object transformations and promising capabilities in temporal grounding and semantic reasoning for complex object transformations. Code, additional results, and the benchmark dataset are available at https://tubelet-graph.github.io.
eess.SY [Back]
[77] A convolutional neural network deep learning method for model class selection
Marios Impraimakis
Main category: eess.SY
TL;DR: 该论文提出了一种基于卷积神经网络(CNN)的深度学习方法,用于模型类别的选择,无需系统输入信息或完整的系统识别,适用于结构健康监测。
Details
Motivation: 传统模型类别选择方法通常需要系统输入信息或完整的系统识别过程,这在某些实际应用中可能不切实际。因此,作者提出了一种仅基于响应的深度学习方法。Contribution: 主要贡献是通过一维CNN实现了仅基于响应的模型类别选择方法,展示了其在线性和非线性系统以及3D建筑模型中的有效性,并提出了基于卡尔曼滤波的物理增强选项。
Method: 方法采用一维CNN对自由度响应进行训练和验证,并通过卡尔曼滤波融合加速度和位移数据的运动学约束,以增强算法的物理基础。
Result: 实验表明,该方法能够从微小的信号变化中准确选择模型类别,包括阻尼行为或滞后行为的影响,且适用于复杂系统。
Insight: 该方法展示了深度学习在结构健康监测中的潜力,尤其是在缺乏完整系统信息时的实用性。
Abstract: The response-only model class selection capability of a novel deep convolutional neural network method is examined herein in a simple, yet effective, manner. Specifically, the responses from a unique degree of freedom along with their class information train and validate a one-dimensional convolutional neural network. In doing so, the network selects the model class of new and unlabeled signals without the need of the system input information, or full system identification. An optional physics-based algorithm enhancement is also examined using the Kalman filter to fuse the system response signals using the kinematics constraints of the acceleration and displacement data. Importantly, the method is shown to select the model class in slight signal variations attributed to the damping behavior or hysteresis behavior on both linear and nonlinear dynamic systems, as well as on a 3D building finite element model, providing a powerful tool for structural health monitoring applications.
eess.IV [Back]
[78] Shape Deformation Networks for Automated Aortic Valve Finite Element Meshing from 3D CT Images
Linchen Qian,Jiasong Chen,Ruonan Gong,Wei Sun,Minliang Liu,Liang Liang
Main category: eess.IV
TL;DR: 该论文提出了一种基于深度神经网络的模板拟合流程,用于从3D CT图像生成主动脉瓣的结构化四边形网格,以提高网格质量和一致性。
Details
Motivation: 传统的主动脉瓣几何建模方法生成的三角网格拓扑不规则,容易出现形状不佳的元素和患者间不一致的对应关系,这对生物力学分析和术前规划造成了挑战。Contribution: 论文的主要贡献是提出了一种模板拟合流程,通过深度神经网络生成结构化的四边形网格,确保了患者间一致的网格拓扑和高质量的元素形状。
Method: 方法包括使用一个共同的四边形网格模板对所有患者的主动脉瓣进行重网格化,并通过仅包含几何重建和平滑正则化两项的损失函数简化了神经网络的学习目标。
Result: 实验结果表明,该方法生成的主动脉瓣表面网格具有更高的平滑性和形状质量,同时比传统方法需要更少的显式正则化项。
Insight: 使用结构化四边形网格作为模板和训练目标,不仅确保了网格的一致性和质量,还简化了训练过程,提高了建模的效率和效果。
Abstract: Accurate geometric modeling of the aortic valve from 3D CT images is essential for biomechanical analysis and patient-specific simulations to assess valve health or make a preoperative plan. However, it remains challenging to generate aortic valve meshes with both high-quality and consistency across different patients. Traditional approaches often produce triangular meshes with irregular topologies, which can result in poorly shaped elements and inconsistent correspondence due to inter-patient anatomical variation. In this work, we address these challenges by introducing a template-fitting pipeline with deep neural networks to generate structured quad (i.e., quadrilateral) meshes from 3D CT images to represent aortic valve geometries. By remeshing aortic valves of all patients with a common quad mesh template, we ensure a uniform mesh topology with consistent node-to-node and element-to-element correspondence across patients. This consistency enables us to simplify the learning objective of the deep neural networks, by employing a loss function with only two terms (i.e., a geometry reconstruction term and a smoothness regularization term), which is sufficient to preserve mesh smoothness and element quality. Our experiments demonstrate that the proposed approach produces high-quality aortic valve surface meshes with improved smoothness and shape quality, while requiring fewer explicit regularization terms compared to the traditional methods. These results highlight that using structured quad meshes for the template and neural network training not only ensures mesh correspondence and quality but also simplifies the training process, thus enhancing the effectiveness and efficiency of aortic valve modeling.
cs.CR [Back]
[79] Black-Box Guardrail Reverse-engineering Attack
Hongwei Yao,Yun Xia,Shuo Shao,Haoran Shi,Tong Qiao,Cong Wang
Main category: cs.CR
TL;DR: 该论文首次研究了黑盒大型语言模型(LLM)护栏的反向工程攻击,提出了基于强化学习的框架GRA,通过遗传算法驱动的数据增强逼近目标护栏的决策策略,实现了高精度的规则匹配。
Details
Motivation: 随着LLM广泛采用护栏机制以约束输出,这些机制暴露了新的安全漏洞,研究目标是通过反向工程攻击揭示其决策逻辑,从而评估其安全风险。Contribution: 论文首次提出黑盒LLM护栏反向工程攻击方法GRA,结合强化学习和遗传算法,以低成本高精度地提取护栏规则。
Method: 采用基于强化学习的框架GRA,通过遗传算法驱动的数据增强和迭代优化,逐步逼近目标护栏的决策策略。
Result: 在ChatGPT、DeepSeek和Qwen3三个商业化系统上的实验表明,GRA规则匹配率超过0.92,且API成本低于85美元。
Insight: 研究揭示了当前LLM护栏设计的关键漏洞,强调了在实际部署中需要更鲁棒的防御机制。
Abstract: Large language models (LLMs) increasingly employ guardrails to enforce ethical, legal, and application-specific constraints on their outputs. While effective at mitigating harmful responses, these guardrails introduce a new class of vulnerabilities by exposing observable decision patterns. In this work, we present the first study of black-box LLM guardrail reverse-engineering attacks. We propose Guardrail Reverse-engineering Attack (GRA), a reinforcement learning-based framework that leverages genetic algorithm-driven data augmentation to approximate the decision-making policy of victim guardrails. By iteratively collecting input-output pairs, prioritizing divergence cases, and applying targeted mutations and crossovers, our method incrementally converges toward a high-fidelity surrogate of the victim guardrail. We evaluate GRA on three widely deployed commercial systems, namely ChatGPT, DeepSeek, and Qwen3, and demonstrate that it achieves an rule matching rate exceeding 0.92 while requiring less than $85 in API costs. These findings underscore the practical feasibility of guardrail extraction and highlight significant security risks for current LLM safety mechanisms. Our findings expose critical vulnerabilities in current guardrail designs and highlight the urgent need for more robust defense mechanisms in LLM deployment.
cs.MA [Back]
[80] Multi-Agent Collaborative Framework For Math Problem Generation
Kia Karbasi,Kevin Hong,Mohammad Amin Samadi,Gregory Pottie
Main category: cs.MA
TL;DR: 该论文提出了一种多智能体协作框架,用于生成数学题目,通过在推理时引入多轮迭代优化来平衡问题的复杂性和认知需求,提升了自动生成题目的质量。
Details
Motivation: 传统的预训练语言模型在自动生成数学题目时难以精确控制问题的复杂性和认知需求,影响了教育内容的质量和实用性。Contribution: 提出了一种新颖的多智能体协作框架,通过多个智能体的迭代优化,实现了对生成题目复杂性和认知需求的动态控制。
Method: 框架利用多个智能体协作生成并迭代优化题目-答案对,通过五元评价标准(相关性、重要性、清晰度、难度匹配、可解答性)进行质量控制。
Result: 初步评估表明,该框架生成的题目在教育相关性和认知挑战之间实现了更好的平衡,提升了教育内容的质量。
Insight: 多智能体协作工作流可以为自动生成教育内容提供更精细的控制,有望推动自适应学习环境的发展。
Abstract: Automatic question generation (AQG) for mathematics education remains an elusive goal for Intelligent Tutoring Systems and educators. While pre-trained transformer-based language models have significantly advanced natural language generation, they often struggle to precisely control problem complexity and cognitive demands. In this paper, we introduce a collaborative multi-agent framework as a novel method of incorporating inference-time computation into AQG. This approach leverages multiple agents that iteratively refine generated question-answer pairs to better balance complexity and cognitive demand. We evaluate the generated questions on five meta-evaluation criteria: relevance, importance, clarity, difficulty matching, answerability, to assess the system’s ability to control the required complexity and quality of the questions. Preliminary evaluations show that this collaborative multi-agent framework elevates the quality of generated educational content by fostering a more nuanced balance between cognitive challenge and clarity. These promising outcomes suggest that integrating collaborative multi-agent workflows can yield more controlled, pedagogically valuable content that can help advance automated educational content generation and adaptive learning environments.
cs.SD [Back]
[81] MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation
Shih-Lun Wu,Yoon Kim,Cheng-Zhi Anna Huang
Main category: cs.SD
TL;DR: MIDI-LLM将大型语言模型(LLM)扩展为能够从自由文本提示生成多轨MIDI音乐的系统,通过两阶段训练实现文本到MIDI的能力,并在质量和速度上优于Text2midi模型。
Details
Motivation: 现有文本到音乐的生成模型在控制性和推理速度上存在不足,而MIDI-LLM旨在通过扩展LLM的能力并保留其参数结构来解决这些问题。Contribution: 主要贡献包括扩展LLM的词汇表以包含MIDI标记、两阶段训练方法,以及直接利用vLLM库实现快速推理。
Method: 采用两阶段训练:1) 扩展LLM词汇表以包含MIDI标记;2) 保持原始LLM参数结构并利用vLLM库加速推理。
Result: 实验表明,MIDI-LLM在生成质量、文本控制和推理速度上优于Text2midi模型。
Insight: 通过保留LLM原有结构并扩展其能力,可以在多模态任务中高效利用现有的语言模型基础设施。
Abstract: We present MIDI-LLM, an LLM for generating multitrack MIDI music from free-form text prompts. Our approach expands a text LLM’s vocabulary to include MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI abilities. By preserving the original LLM’s parameter structure, we can directly leverage the vLLM library for accelerated inference. Experiments show that MIDI-LLM achieves higher quality, better text control, and faster inference compared to the recent Text2midi model. Live demo at https://midi-llm-demo.vercel.app.
cs.LG [Back]
[82] RLHF: A comprehensive Survey for Cultural, Multimodal and Low Latency Alignment Methods
Raghav Sharma,Manan Mehta,Sai Tiger Raina
Main category: cs.LG
TL;DR: 这篇调查论文系统地总结了强化学习人类反馈(RLHF)在多模态对齐、文化公平和低延迟优化方面的最新进展,并提出了未来研究方向。
Details
Motivation: RLHF目前主要应用于文本领域,但在多模态对齐、文化公平性和低延迟优化方面存在研究空白。本文旨在填补这些空白,并为构建更健壮、高效和公平的AI系统提供指导。Contribution: 1)全面总结了RLHF在多模态对齐、文化公平性和低延迟优化方面的最新方法;2)对PPO、DPO和GRPO等核心算法进行了综述;3)提出了开放挑战和未来研究方向。
Method: 论文首先回顾了PPO、DPO和GRPO等基础算法,然后详细分析了多模态对齐、文化公平性和低延迟优化的最新技术,并进行了比较性总结。
Result: 通过综合分析,论文指出当前RLHF在这些新兴领域的研究进展和不足,并提出了未来需要解决的问题。
Insight: 多模态和文化公平性是RLHF研究的新方向,未来的AI系统需要在这些方面进一步优化,以提高其鲁棒性和公平性。
Abstract: Reinforcement Learning from Human Feedback (RLHF) is the standard for aligning Large Language Models (LLMs), yet recent progress has moved beyond canonical text-based methods. This survey synthesizes the new frontier of alignment research by addressing critical gaps in multi-modal alignment, cultural fairness, and low-latency optimization. To systematically explore these domains, we first review foundational algo- rithms, including PPO, DPO, and GRPO, before presenting a detailed analysis of the latest innovations. By providing a comparative synthesis of these techniques and outlining open challenges, this work serves as an essential roadmap for researchers building more robust, efficient, and equitable AI systems.
[83] The Illusion of Certainty: Uncertainty quantification for LLMs fails under ambiguity
Tim Tomov,Dominik Fuchsgruber,Tom Wollschläger,Stephan Günnemann
Main category: cs.LG
TL;DR: 这篇论文揭示了当前大型语言模型(LLMs)不确定性量化(UQ)方法在模糊性问题上的表现接近随机,提出了首个模糊问答数据集,并指出现有方法的理论局限性。
Details
Motivation: 实际语言具有模糊性,但现有的不确定性量化方法通常在无模糊性任务上测试,导致其在真实场景中表现不佳。Contribution: 1. 提出了首个模糊问答数据集MAQA和AmbigQA;2. 揭示了现有UQ方法在模糊性下的性能退化;3. 从理论上解释了UQ方法的局限性。
Method: 通过构建模糊问答数据集,测试不同不确定性估计方法(如预测分布、内部表征和模型集成)的性能。
Result: 研究表明,现有UQ方法在模糊性问题上的表现接近随机,且理论分析表明其存在根本性局限。
Insight: 需要重新思考当前LLMs的不确定性量化建模范式,以适应真实世界的语言模糊性。
Abstract: Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is critical for trustworthy deployment. While real-world language is inherently ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically benchmarked against tasks with no ambiguity. In this work, we demonstrate that while current uncertainty estimators perform well under the restrictive assumption of no ambiguity, they degrade to close-to-random performance on ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions estimated from factual co-occurrence. We find this performance deterioration to be consistent across different estimation paradigms: using the predictive distribution itself, internal representations throughout the model, and an ensemble of models. We show that this phenomenon can be theoretically explained, revealing that predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity. Overall, our study reveals a key shortcoming of current UQ methods for LLMs and motivates a rethinking of current modeling paradigms.
[84] What’s in Common? Multimodal Models Hallucinate When Reasoning Across Scenes
Candace Ross,Florian Bordes,Adina Williams,Polina Kirichenko,Mark Ibrahim
Main category: cs.LG
TL;DR: 该论文提出了一个名为Common-O的新基准测试,用于评估多模态语言模型在跨场景推理时的幻觉问题,发现当前模型在复杂推理任务上表现不佳,尤其在相似物体存在的场景中更容易产生幻觉。
Details
Motivation: 尽管多模态语言模型在开放词汇任务上表现优异,但在真实世界场景中的跨场景推理时仍存在严重的幻觉问题。论文旨在填补现有感知基准测试与真实世界推理之间的差距。Contribution: 1. 提出了Common-O基准测试,包含10.5k个新图像示例,专门用于评估跨场景推理能力;2. 发现当前模型在复杂推理任务上表现较差,且容易因相似物体产生幻觉;3. 提出多图像训练可能提升性能。
Method: 论文通过构建Common-O基准测试,对领先的多模态语言模型进行评估,包括经过链式推理训练的模型。测试内容涉及单图像感知和多场景推理任务。
Result: 尽管模型在感知任务上表现饱和,但在Common-O上的最佳模型仅达到35%的准确率,而在更复杂的Common-O Complex任务上仅为1%。模型在相似物体存在的场景中更容易产生幻觉。
Insight: 论文揭示了多模态语言模型在跨场景推理中的局限性,指出多图像训练可能是未来改进的方向。同时,提出了公开基准测试以推动相关研究。
Abstract: Multimodal language models possess a remarkable ability to handle an open-vocabulary’s worth of objects. Yet the best models still suffer from hallucinations when reasoning about scenes in the real world, revealing a gap between their seemingly strong performance on existing perception benchmarks that are saturating and their reasoning in the real world. To address this gap, we build a novel benchmark of in-the-wild scenes that we call Common-O. With more than 10.5k examples using exclusively new images not found in web training data to avoid contamination, Common-O goes beyond just perception, inspired by cognitive tests for humans, to probe reasoning across scenes by asking “what’s in common?”. We evaluate leading multimodal language models, including models specifically trained to perform chain-of-thought reasoning. We find that perceiving objects in single images is tractable for most models, yet reasoning across scenes is very challenging even for the best models, including reasoning models. Despite saturating many leaderboards focusing on perception, the best performing model only achieves 35% on Common-O – and on Common-O Complex, consisting of more complex scenes, the best model achieves only 1%. Curiously, we find models are more prone to hallucinate when similar objects are present in the scene, suggesting models may be relying on object co-occurrence seen during training. Among the models we evaluated, we found scale can provide modest improvements while models explicitly trained with multi-image inputs show bigger improvements, suggesting scaled multi-image training may offer promise. We make our benchmark publicly available to spur research into the challenge of hallucination when reasoning across scenes.
[85] NVIDIA Nemotron Nano V2 VL
NVIDIA,:,Amala Sanjay Deshmukh,Kateryna Chumachenko,Tuomas Rintamaki,Matthieu Le,Tyler Poon,Danial Mohseni Taheri,Ilia Karmanov,Guilin Liu,Jarno Seppanen,Guo Chen,Karan Sapra,Zhiding Yu,Adi Renduchintala,Charles Wang,Peter Jin,Arushi Goel,Mike Ranzinger,Lukas Voegtle,Philipp Fischer,Timo Roman,Wei Ping,Boxin Wang,Zhuolin Yang,Nayeon Lee,Shaokun Zhang,Fuxiao Liu,Zhiqi Li,Di Zhang,Greg Heinrich,Hongxu,Yin,Song Han,Pavlo Molchanov,Parth Mannan,Yao Xu,Jane Polak Scowcroft,Tom Balough,Subhashree Radhakrishnan,Paris Zhang,Sean Cha,Ratnesh Kumar,Zaid Pervaiz Bhat,Jian Zhang,Darragh Hanley,Pritam Biswas,Jesse Oliver,Kevin Vasques,Roger Waleffe,Duncan Riach,Oluwatobi Olabiyi,Ameya Sunil Mahabaleshwarkar,Bilal Kartal,Pritam Gundecha,Khanh Nguyen,Alexandre Milesi,Eugene Khvedchenia,Ran Zilberstein,Ofri Masad,Natan Bagrov,Nave Assaf,Tomer Asida,Daniel Afrimi,Amit Zuker,Netanel Haber,Zhiyu Cheng,Jingyu,Xin,Di,Wu,Nik Spirin,Maryam Moosaei,Roman Ageev,Vanshil Atul Shah,Yuting Wu,Daniel Korzekwa,Unnikrishnan Kizhakkemadam Sreekumar,Wanli Jiang,Padmavathy Subramanian,Alejandra Rico,Sandip Bhaskar,Saeid Motiian,Kedi Wu,Annie Surla,Chia-Chih Chen,Hayden Wolff,Matthew Feinberg,Melissa Corpuz,Marek Wawrzos,Eileen Long,Aastha Jhunjhunwala,Paul Hendricks,Farzan Memarian,Benika Hall,Xin-Yu Wang,David Mosallanezhad,Soumye Singhal,Luis Vega,Katherine Cheung,Krzysztof Pawelec,Michael Evans,Katherine Luna,Jie Lou,Erick Galinkin,Akshay Hazare,Kaustubh Purandare,Ann Guan,Anna Warno,Chen Cui,Yoshi Suhara,Shibani Likhite,Seph Mard,Meredith Price,Laya Sleiman,Saori Kaji,Udi Karpas,Kari Briski,Joey Conway,Michael Lightstone,Jan Kautz,Mohammad Shoeybi,Mostofa Patwary,Jonathen Cohen,Oleksii Kuchaiev,Andrew Tao,Bryan Catanzaro
Main category: cs.LG
TL;DR: NVIDIA发布了Nemotron Nano V2 VL,这是Nemotron视觉-语言系列的最新型号,专为现实世界文档理解、长视频理解和推理任务设计,性能优于前代模型。
Details
Motivation: 提升视觉-语言模型在文档理解和长视频场景中的性能,同时优化推理效率。Contribution: 提出基于混合Mamba-Transformer架构的Nemotron Nano V2 VL,结合创新的token缩减技术,实现了更高的推理吞吐量。
Method: 采用混合Mamba-Transformer架构和token缩减技术,优化模型在长文档和视频中的表现。
Result: 模型性能显著优于前代Llama-3.1-Nemotron-Nano-VL-8B,并在多个视觉和文本任务中表现优异。
Insight: 结合Mamba和Transformer的混合架构以及高效的token缩减技术,可能是未来长序列场景中视觉-语言模型的发展方向。
Abstract: We introduce Nemotron Nano V2 VL, the latest model of the Nemotron vision-language series designed for strong real-world document understanding, long video comprehension, and reasoning tasks. Nemotron Nano V2 VL delivers significant improvements over our previous model, Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major enhancements in model architecture, datasets, and training recipes. Nemotron Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and innovative token reduction techniques to achieve higher inference throughput in long document and video scenarios. We are releasing model checkpoints in BF16, FP8, and FP4 formats and sharing large parts of our datasets, recipes and training code.
[86] Distribution-Aware Tensor Decomposition for Compression of Convolutional Neural Networks
Alper Kalle,Theo Rudkiewicz,Mohamed-Oumar Ouerfelli,Mohamed Tamaazousti
Main category: cs.LG
TL;DR: 该论文提出了一种基于数据感知的张量分解方法,用于卷积神经网络(CNN)的压缩,目标是减小计算和内存开销,同时保持模型性能。
Details
Motivation: 传统方法在权重空间中最小化各向同性范数(如Frobenius范数)进行压缩,但忽略了输入数据分布的影响,可能导致性能下降。本文旨在通过数据感知的范数优化压缩过程。Contribution: 1.提出了一种新的数据感知范数,通过最小化层输出分布的变化来优化压缩;2.针对Tucker-2和CPD张量分解提出了交替最小二乘算法;3.无需微调即可达到竞争性精度,且该方法可跨数据集迁移。
Method: 1. 使用数据感知范数$(\lVert (W - \widetilde{W}) \Sigma^{1/2}\rVert_F)$衡量压缩误差;2. 为Tucker-2和CPD分解设计交替最小二乘算法;3. 利用输入数据的协方差矩阵$(\Sigma)$优化压缩。
Result: 在ResNet-18/50、GoogLeNet等CNN架构和ImageNet、Cifar10/100等多个数据集上验证了方法的有效性,无需微调即可保持高性能。
Insight: 1. 数据感知的压缩方法优于传统权重空间优化;2. 协方差信息可跨数据集迁移,增强了方法的通用性;3. 为CNN压缩提供了一种无需微调的高效解决方案。
Abstract: Neural networks are widely used for image-related tasks but typically demand considerable computing power. Once a network has been trained, however, its memory- and compute-footprint can be reduced by compression. In this work, we focus on compression through tensorization and low-rank representations. Whereas classical approaches search for a low-rank approximation by minimizing an isotropic norm such as the Frobenius norm in weight-space, we use data-informed norms that measure the error in function space. Concretely, we minimize the change in the layer’s output distribution, which can be expressed as $\lVert (W - \widetilde{W}) \Sigma^{1/2}\rVert_F$ where $\Sigma^{1/2}$ is the square root of the covariance matrix of the layer’s input and $W$, $\widetilde{W}$ are the original and compressed weights. We propose new alternating least square algorithms for the two most common tensor decompositions (Tucker-2 and CPD) that directly optimize the new norm. Unlike conventional compression pipelines, which almost always require post-compression fine-tuning, our data-informed approach often achieves competitive accuracy without any fine-tuning. We further show that the same covariance-based norm can be transferred from one dataset to another with only a minor accuracy drop, enabling compression even when the original training dataset is unavailable. Experiments on several CNN architectures (ResNet-18/50, and GoogLeNet) and datasets (ImageNet, FGVC-Aircraft, Cifar10, and Cifar100) confirm the advantages of the proposed method.
cs.AI [Back]
[87] DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration
Narjes Nourzad,Hanqing Yang,Shiyu Chen,Carlee Joe-Wong
Main category: cs.AI
TL;DR: DR. WELL是一个去中心化的神经符号框架,通过符号规划和动态世界模型实现多智能体协作,避免轨迹级别的脆弱对齐,提升任务的完成率和效率。
Details
Motivation: 多智能体协作中,轨迹级别的协调容易因小偏差导致冲突,而符号规划可以通过抽象化和同步最小动作词汇来解决这一问题。Contribution: 提出了DR. WELL框架,结合符号规划和动态世界模型,实现多智能体协作的同步和高效。
Method: 采用两阶段协商协议:角色提议与共识分配;每个智能体生成独立的符号计划并通过共享动态世界模型执行。
Result: 实验表明,DR. WELL提高了任务完成率和效率,但也引入了时间开销。
Insight: 符号规划和动态世界模型可以显著提升多智能体协作的可重用性和可解释性。
Abstract: Cooperative multi-agent planning requires agents to make joint decisions with partial information and limited communication. Coordination at the trajectory level often fails, as small deviations in timing or movement cascade into conflicts. Symbolic planning mitigates this challenge by raising the level of abstraction and providing a minimal vocabulary of actions that enable synchronization and collective progress. We present DR. WELL, a decentralized neurosymbolic framework for cooperative multi-agent planning. Cooperation unfolds through a two-phase negotiation protocol: agents first propose candidate roles with reasoning and then commit to a joint allocation under consensus and environment constraints. After commitment, each agent independently generates and executes a symbolic plan for its role without revealing detailed trajectories. Plans are grounded in execution outcomes via a shared world model that encodes the current state and is updated as agents act. By reasoning over symbolic plans rather than raw trajectories, DR. WELL avoids brittle step-level alignment and enables higher-level operations that are reusable, synchronizable, and interpretable. Experiments on cooperative block-push tasks show that agents adapt across episodes, with the dynamic world model capturing reusable patterns and improving task completion rates and efficiency. Experiments on cooperative block-push tasks show that our dynamic world model improves task completion and efficiency through negotiation and self-refinement, trading a time overhead for evolving, more efficient collaboration strategies.
[88] VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks
Yu Feng,Nathaniel Weir,Kaj Bostrom,Sam Bayless,Darion Cassel,Sapana Chaudhary,Benjamin Kiesl-Reiter,Huzefa Rangwala
Main category: cs.AI
TL;DR: VeriCoT提出了一种神经符号方法,通过形式逻辑验证CoT推理的合理性,结合自动化求解器和人类审查,有效识别逻辑缺陷,并在推理时和微调阶段提升模型性能。
Details
Motivation: 现有的LLMs在多步推理(如Chain-of-Thought)中无法可靠验证其逻辑合理性,即使答案正确,推理过程也可能存在漏洞,这在高风险场景中尤为危险。Contribution: 提出了VeriCoT,一种神经符号方法,能够从CoT推理中提取形式逻辑并进行验证,同时结合自动化求解器和人类参与审查,提升推理的可信度。
Method: 将CoT推理的每一步形式化为逻辑表达式,并通过符号求解器验证合理性。同时,利用自然语言前提支持人类和系统识别未接地或谬误步骤。
Result: 在ProofWriter、LegalBench和BioASQ数据集上,VeriCoT能有效识别错误推理,并显著预测最终答案的正确性。还能通过微调提升模型推理能力。
Insight: 结合神经与符号方法的优势,可以显著提升LLM推理的可信度,并为推理时优化和模型微调提供了新思路。
Abstract: LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct answers, the underlying reasoning may be flawed, undermining trust in high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a neuro-symbolic method that extracts and verifies formal logical arguments from CoT reasoning. VeriCoT formalizes each CoT reasoning step into first-order logic and identifies premises that ground the argument in source context, commonsense knowledge, or prior reasoning steps. The symbolic representation enables automated solvers to verify logical validity while the NL premises allow humans and systems to identify ungrounded or fallacious reasoning steps. Experiments on the ProofWriter, LegalBench, and BioASQ datasets show VeriCoT effectively identifies flawed reasoning, and serves as a strong predictor of final answer correctness. We also leverage VeriCoT’s verification signal for (1) inference-time self-reflection, (2) supervised fine-tuning (SFT) on VeriCoT-distilled datasets and (3) preference fine-tuning (PFT) with direct preference optimization (DPO) using verification-based pairwise rewards, further improving reasoning validity and accuracy.
cs.RO [Back]
[89] GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies
Maëlic Neau,Zoe Falomir,Paulo E. Santos,Anne-Gwenn Bosser,Cédric Buche
Main category: cs.RO
TL;DR: GraSP-VLA提出了一种结合神经符号方法的框架,通过连续场景图生成符号化动作表示,解决长期规划任务中VLA模型缺乏高层规划能力和符号方法泛化不足的问题。
Details
Motivation: 现有VLA模型缺乏高层符号规划能力,而符号方法在泛化和扩展性上受限,无法满足长期任务需求。Contribution: 提出了GraSP-VLA框架,结合神经符号方法,通过连续场景图生成符号化表示,用于规划领域生成和低层VLA策略的协调。
Method: 采用连续场景图表示人类示范,生成符号化动作表示,并用于规划领域生成和低层VLA策略的协调。
Result: 实验表明,GraSP-VLA在自动规划领域生成任务中表现有效,并在实际长期任务中展示了协调低层VLA策略的潜力。
Insight: 通过神经符号结合,GraSP-VLA在长期任务中实现了更高层规划和低层执行的协调,为机器人学习提供了新思路。
Abstract: Deploying autonomous robots that can learn new skills from demonstrations is an important challenge of modern robotics. Existing solutions often apply end-to-end imitation learning with Vision-Language Action (VLA) models or symbolic approaches with Action Model Learning (AML). On the one hand, current VLA models are limited by the lack of high-level symbolic planning, which hinders their abilities in long-horizon tasks. On the other hand, symbolic approaches in AML lack generalization and scalability perspectives. In this paper we present a new neuro-symbolic approach, GraSP-VLA, a framework that uses a Continuous Scene Graph representation to generate a symbolic representation of human demonstrations. This representation is used to generate new planning domains during inference and serves as an orchestrator for low-level VLA policies, scaling up the number of actions that can be reproduced in a row. Our results show that GraSP-VLA is effective for modeling symbolic representations on the task of automatic planning domain generation from observations. In addition, results on real-world experiments show the potential of our Continuous Scene Graph representation to orchestrate low-level VLA policies in long-horizon tasks.
[90] Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment
Tao Lin,Yilei Zhong,Yuxin Du,Jingjing Zhang,Jiting Liu,Yinxinyu Chen,Encheng Gu,Ziyan Liu,Hongyi Cai,Yanwen Zou,Lixing Zou,Zhaoye Zhou,Gen Li,Bo Zhao
Main category: cs.RO
TL;DR: Evo-1 是一个轻量级的视觉-语言-动作(VLA)模型,通过优化架构和训练范式,减少了计算成本和部署负担,同时保持了高性能,无需机器人数据预训练。
Details
Motivation: 现有的 VLA 模型通常参数量巨大且依赖大规模机器人数据预训练,计算成本高且部署效率低。此外,训练方式常损害视觉-语言模型的感知表示,导致泛化能力差。Evo-1 旨在解决这些问题。Contribution: 1. 提出轻量级 VLA 模型 Evo-1,参数量仅 0.77B,无需机器人数据预训练;2. 设计跨调制扩散转换器和优化模块的高效架构;3. 引入两阶段训练范式,保持视觉-语言模型的表示能力;4. 在多个基准测试和实际应用中取得 SOTA 结果。
Method: 1. 基于原生视觉-语言模型(VLM)构建;2. 结合跨调制扩散转换器和优化模块;3. 采用两阶段训练,逐步对齐动作与感知模态。
Result: 1. 在 Meta-World 和 RoboTwin 上超越 SOTA 12.4% 和 6.9%;2. LIBERO 任务达到 94.8%;3. 实际应用中成功率 78%,推理频率高且内存占用低。
Insight: 轻量化和高效架构设计是关键,同时保持视觉-语言模型的语义对齐能力是提升泛化性能的重要策略。
Abstract: Vision-Language-Action (VLA) models have emerged as a powerful framework that unifies perception, language, and control, enabling robots to perform diverse tasks through multimodal understanding. However, current VLA models typically contain massive parameters and rely heavily on large-scale robot data pretraining, leading to high computational costs during training, as well as limited deployability for real-time inference. Moreover, most training paradigms often degrade the perceptual representations of the vision-language backbone, resulting in overfitting and poor generalization to downstream tasks. In this work, we present Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency, while maintaining strong performance without pretraining on robot data. Evo-1 builds on a native multimodal Vision-Language model (VLM), incorporating a novel cross-modulated diffusion transformer along with an optimized integration module, together forming an effective architecture. We further introduce a two-stage training paradigm that progressively aligns action with perception, preserving the representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1 achieves state-of-the-art results on the Meta-World and RoboTwin suite, surpassing the previous best models by 12.4% and 6.9%, respectively, and also attains a competitive result of 94.8% on LIBERO. In real-world evaluations, Evo-1 attains a 78% success rate with high inference frequency and low memory overhead, outperforming all baseline methods. We release code, data, and model weights to facilitate future research on lightweight and efficient VLA models.
[91] Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions
Kaifeng Zhang,Shuo Sha,Hanxiao Jiang,Matthew Loper,Hyunjong Song,Guangyan Cai,Zhuo Xu,Xiaochen Hu,Changxi Zheng,Yunzhu Li
Main category: cs.RO
TL;DR: 本文提出了一个基于3D高斯点云的从真实到仿真(real-to-sim)的策略评估框架,用于高效评估机器人策略在涉及可变形物体任务中的表现。
Details
Motivation: 直接在真实世界中评估机器人策略成本高、耗时长且难以重复,尤其是涉及可变形物体的任务。现有仿真器难以同时捕捉视觉和物理复杂性。Contribution: 提出了一种结合物理信息重建与高质量渲染的框架,构建可变形物体的数字孪生,并验证其在仿真中的策略评估与真实世界高度相关。
Method: 利用3D高斯点云从真实视频中重建软体数字孪生,在仿真中渲染机器人、物体和环境,实现光真实感的仿真评估。
Result: 在填充玩具、绳索路由和T形块推送等任务中,仿真结果与真实世界表现高度相关,并能揭示策略的关键行为模式。
Insight: 物理信息的重建与高质量渲染结合,可提供可重复、可扩展且准确的机器人策略评估方法。
Abstract: Robotic manipulation policies are advancing rapidly, but their direct evaluation in the real world remains costly, time-consuming, and difficult to reproduce, particularly for tasks involving deformable objects. Simulation provides a scalable and systematic alternative, yet existing simulators often fail to capture the coupled visual and physical complexity of soft-body interactions. We present a real-to-sim policy evaluation framework that constructs soft-body digital twins from real-world videos and renders robots, objects, and environments with photorealistic fidelity using 3D Gaussian Splatting. We validate our approach on representative deformable manipulation tasks, including plush toy packing, rope routing, and T-block pushing, demonstrating that simulated rollouts correlate strongly with real-world execution performance and reveal key behavioral patterns of learned policies. Our results suggest that combining physics-informed reconstruction with high-quality rendering enables reproducible, scalable, and accurate evaluation of robotic manipulation policies. Website: https://real2sim-eval.github.io/
[92] X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations
Maximus A. Pace,Prithwish Dan,Chuanruo Ning,Atiksh Bhardwaj,Audrey Du,Edward W. Duan,Wei-Chiu Ma,Kushal Kedia
Main category: cs.RO
TL;DR: X-Diffusion利用扩散过程在人类和机器人的动作差异中提取有用信息,通过噪声分类器区分动作来源,从而最大化利用人类数据训练机器人策略。
Details
Motivation: 人类视频数据丰富,但与机器人动作执行存在差异,直接使用会导致无效动作学习。需要一种方法在不引入动态不可行动作的情况下利用人类数据。Contribution: 提出X-Diffusion框架,通过噪声分类器区分人类和机器人动作,仅在噪声足够高时引入人类数据,保留任务指导信息。
Method: 训练一个分类器区分噪声动作来源,人类动作在噪声足够大(无法分辨来源)时用于策略训练,机器人动作用于低噪声精细去噪。
Result: 在五个任务中,X-Diffusion平均成功率比基线高16%,证明其在利用人类数据上的有效性。
Insight: 噪声扩散过程能自然过滤低级执行差异,保留高级任务信息,为跨实体学习提供了新思路。
Abstract: Human videos can be recorded quickly and at scale, making them an appealing source of training data for robot learning. However, humans and robots differ fundamentally in embodiment, resulting in mismatched action execution. Direct kinematic retargeting of human hand motion can therefore produce actions that are physically infeasible for robots. Despite these low-level differences, human demonstrations provide valuable motion cues about how to manipulate and interact with objects. Our key idea is to exploit the forward diffusion process: as noise is added to actions, low-level execution differences fade while high-level task guidance is preserved. We present X-Diffusion, a principled framework for training diffusion policies that maximally leverages human data without learning dynamically infeasible motions. X-Diffusion first trains a classifier to predict whether a noisy action is executed by a human or robot. Then, a human action is incorporated into policy training only after adding sufficient noise such that the classifier cannot discern its embodiment. Actions consistent with robot execution supervise fine-grained denoising at low noise levels, while mismatched human actions provide only coarse guidance at higher noise levels. Our experiments show that naive co-training under execution mismatches degrades policy performance, while X-Diffusion consistently improves it. Across five manipulation tasks, X-Diffusion achieves a 16% higher average success rate than the best baseline. The project website is available at https://portal-cornell.github.io/X-Diffusion/.
[93] GentleHumanoid: Learning Upper-body Compliance for Contact-rich Human and Object Interaction
Qingzhou Lu,Yao Feng,Baiyu Shi,Michael Piseno,Zhenan Bao,C. Karen Liu
Main category: cs.RO
TL;DR: 该论文提出了一种名为GentleHumanoid的框架,通过将阻抗控制与全身运动跟踪策略结合,实现上半身的柔顺性控制,从而在人与物体交互中减少峰值接触力并保持任务成功。
Details
Motivation: 在人类中心环境中,人形机器人需要安全自然的物理交互,但现有强化学习策略多强调刚性跟踪,缺乏对柔顺性的支持。Contribution: 提出了一个统一的弹簧模型,支持抗性和引导性接触,确保运动学一致性,并通过任务可调节的力阈值增强安全性。
Method: 将阻抗控制整合到全身运动跟踪策略中,采用弹簧模型模拟抗性和引导性接触,训练策略以处理多样化的交互场景。
Result: 在仿真和Unitree G1人形机器人上验证,任务中峰值接触力显著降低,交互更自然流畅。
Insight: 通过统一的柔顺性控制框架,能够在复杂交互中平衡安全性与任务完成效果,为人形机器人的实际应用提供支持。
Abstract: Humanoid robots are expected to operate in human-centered environments where safe and natural physical interaction is essential. However, most recent reinforcement learning (RL) policies emphasize rigid tracking and suppress external forces. Existing impedance-augmented approaches are typically restricted to base or end-effector control and focus on resisting extreme forces rather than enabling compliance. We introduce GentleHumanoid, a framework that integrates impedance control into a whole-body motion tracking policy to achieve upper-body compliance. At its core is a unified spring-based formulation that models both resistive contacts (restoring forces when pressing against surfaces) and guiding contacts (pushes or pulls sampled from human motion data). This formulation ensures kinematically consistent forces across the shoulder, elbow, and wrist, while exposing the policy to diverse interaction scenarios. Safety is further supported through task-adjustable force thresholds. We evaluate our approach in both simulation and on the Unitree G1 humanoid across tasks requiring different levels of compliance, including gentle hugging, sit-to-stand assistance, and safe object manipulation. Compared to baselines, our policy consistently reduces peak contact forces while maintaining task success, resulting in smoother and more natural interactions. These results highlight a step toward humanoid robots that can safely and effectively collaborate with humans and handle objects in real-world environments.