Table of Contents

cs.CL [Back]

[1] eSapiens’s DEREK Module: Deep Extraction & Reasoning Engine for Knowledge with LLMs

Isaac Shi,Zeyuan Li,Fan Liu,Wenli Wang,Lewei He,Yang Yang,Tianyu Shi

Main category: cs.CL

TL;DR: DEREK模块是一种企业级文档问答系统,结合检索增强生成技术,通过混合检索、重排序和验证确保答案的准确性和可追溯性。

Details Motivation: 企业需要一种能够高效、安全且可审计的文档问答解决方案,尤其是在法律和金融等高风险领域。DEREK模块旨在填补这一需求。

Contribution: 1. 提出了一种结合混合检索(HNSW+BM25)和重排序的检索增强生成流程;2. 设计了可验证的答案生成机制,确保答案的可追溯性;3. 系统支持异构文档输入并具备高安全性。

Method: 1. 文档分块(1000-token重叠块)并索引到混合存储中;2. 结合GPT-4o查询优化、向量+BM25检索、Cohere重排序;3. 使用CO-STAR提示工程生成答案,并通过LangGraph验证器确保引用覆盖。

Result: 在LegalBench子集上,1000-token块提升Recall@50约1%,混合+重排序提升Precision@10约7%;验证器使TRACe利用率超过0.5,未支持的陈述少于3%。

Insight: 1. 混合检索和重排序显著提升性能;2. 可验证的答案生成机制在高风险领域至关重要;3. 系统设计兼顾了安全性和易用性。

Abstract: We present the DEREK (Deep Extraction & Reasoning Engine for Knowledge) Module, a secure and scalable Retrieval-Augmented Generation pipeline designed specifically for enterprise document question answering. Designed and implemented by eSapiens, the system ingests heterogeneous content (PDF, Office, web), splits it into 1,000-token overlapping chunks, and indexes them in a hybrid HNSW+BM25 store. User queries are refined by GPT-4o, retrieved via combined vector+BM25 search, reranked with Cohere, and answered by an LLM using CO-STAR prompt engineering. A LangGraph verifier enforces citation overlap, regenerating answers until every claim is grounded. On four LegalBench subsets, 1000-token chunks improve Recall@50 by approximately 1 pp and hybrid+rerank boosts Precision@10 by approximately 7 pp; the verifier raises TRACe Utilization above 0.50 and limits unsupported statements to less than 3%. All components run in containers, enforce end-to-end TLS 1.3 and AES-256. These results demonstrate that the DEREK module delivers accurate, traceable, and production-ready document QA with minimal operational overhead. The module is designed to meet enterprise demands for secure, auditable, and context-faithful retrieval, providing a reliable baseline for high-stakes domains such as legal and finance.

[2] Small Edits, Big Consequences: Telling Good from Bad Robustness in Large Language Models

Altynbek Ismailov,Salia Asanova

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型(LLMs)在面对提示的微小扰动时的鲁棒性,发现模型在重要语义变化时表现不足,而对无关噪声过度稳健。

Details Motivation: 在实际应用中,LLMs可能会因对提示中的微小语义变化(如关键词替换或缺失)不敏感而导致严重后果,因此需要研究模型如何区分重要变化与无害噪声。

Contribution: 1. 提出了三种扰动方法(渐进性缺失、量词翻转、术语替换)以测试LLMs的鲁棒性;
2. 揭示了LLMs对语义重要变化的不敏感性和对无关噪声的过度鲁棒性;
3. 提出了一种评估和训练框架,要求模型对语义变化更敏感但对噪声保持稳健。

Method: 通过50个LeetCode问题生成三种扰动提示(渐进缺失10%的词、翻转关键量词、替换术语),测试六种前沿模型(包括三种推理优化版本)的响应能力,并分析其输出是否适应语义变化。

Result: 1. 即使提示缺失90%,模型仍能保持85%的正确率;
2. 仅54%的情况下模型对关键量词翻转作出反应,推理优化版本表现更差;
3. 术语替换的通过率为56%,表明模型对语义变化的敏感性不足。

Insight: 当前LLMs难以区分无害噪声和语义变化,需要通过训练和评估机制提升其对重要变化的敏感性,同时保持对噪声的鲁棒性。

Abstract: Large language models (LLMs) now write code in settings where misreading a single word can break safety or cost money, yet we still expect them to overlook stray typos. To probe where useful robustness ends and harmful insensitivity begins, we compile 50 LeetCode problems and craft three minimal prompt perturbations that should vary in importance: (i) progressive underspecification deleting 10 % of words per step; (ii) lexical flip swapping a pivotal quantifier (“max” to “min”); and (iii) jargon inflation replacing a common noun with an obscure technical synonym. Six frontier models, including three “reasoning-tuned” versions, solve each mutated prompt, and their Python outputs are checked against the original test suites to reveal whether they reused the baseline solution or adapted. Among 11 853 generations we observe a sharp double asymmetry. Models remain correct in 85 % of cases even after 90 % of the prompt is missing, showing over-robustness to underspecification, yet only 54 % react to a single quantifier flip that reverses the task, with reasoning-tuned variants even less sensitive than their bases. Jargon edits lie in between, passing through 56 %. Current LLMs thus blur the line between harmless noise and meaning - changing edits, often treating both as ignorable. Masking salient anchors such as function names can force re - evaluation. We advocate evaluation and training protocols that reward differential sensitivity: stay steady under benign noise but adapt - or refuse - when semantics truly change.

[3] Deep Researcher with Test-Time Diffusion

Rujun Han,Yanfei Chen,Zoey CuiZhu,Lesly Miculicich,Guan Sun,Yuanjun Bi,Weiming Wen,Hui Wan,Chunfeng Wen,Solène Maître,George Lee,Vishy Tirumalashetty,Emily Xue,Zizhao Zhang,Salem Haykal,Burak Gokturk,Tomas Pfister,Chen-Yu Lee

Main category: cs.CL

TL;DR: 本文提出了一种基于扩散过程的深度研究代理框架(TTD-DR),通过迭代的’去噪’检索机制和自我进化算法,显著提升了复杂长篇研究报告的生成质量。

Details Motivation: 现有基于大语言模型的研究代理在生成长篇复杂报告时表现趋于瓶颈,缺乏类似人类研究的迭代搜索、推理和修订过程。

Contribution: 提出了TTD-DR框架,将报告生成视为扩散过程,通过动态检索和自我进化算法显著提升报告质量和连贯性。

Method: 1. 初始化可更新的草稿骨架;2. 通过检索机制动态引入外部信息进行迭代’去噪’;3. 采用自我进化算法优化工作流组件。

Result: TTD-DR在需要密集搜索和多跳推理的多个基准测试中实现了最先进的性能。

Insight: 将扩散过程引入报告生成,结合检索和自我进化机制,可显著提升模型在复杂任务中的表现。

Abstract: Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a “denoising” process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents.

[4] The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models

Marlene Lutz,Indira Sen,Georg Ahnert,Elisa Rogers,Markus Strohmaier

Main category: cs.CL

TL;DR: 本文系统评估了大型语言模型(LLM)中社会人口学角色提示的效果,发现提示的格式和策略对模型表现有显著影响,尤其是在模拟边缘化群体时。

Details Motivation: 研究者关注如何在LLM中通过角色提示模拟不同社会人口学群体的观点,但提示的构建方式可能影响结果的真实性。

Contribution: 1. 系统评估了角色提示策略对社会人口学模拟的影响;2. 发现提示格式和策略能显著减少刻板印象;3. 小模型表现出意外优势。

Method: 使用五种开源LLM,在不同提示策略(角色采纳格式和人口统计学启动策略)下测试15个交叉群体在开放和封闭任务中的表现。

Result: LLM在模拟边缘化群体(如非二元性别、西班牙裔和中东身份)时表现不佳,但访谈式提示和基于姓名的启动策略能改善表现。小模型OLMo-2-7B优于大模型Llama-3.3-70B。

Insight: 提示设计对LLM模拟社会人口学群体至关重要,较小的模型在某些情况下可能比大模型表现更好。

Abstract: Persona prompting is increasingly used in large language models (LLMs) to simulate views of various sociodemographic groups. However, how a persona prompt is formulated can significantly affect outcomes, raising concerns about the fidelity of such simulations. Using five open-source LLMs, we systematically examine how different persona prompt strategies, specifically role adoption formats and demographic priming strategies, influence LLM simulations across 15 intersectional demographic groups in both open- and closed-ended tasks. Our findings show that LLMs struggle to simulate marginalized groups, particularly nonbinary, Hispanic, and Middle Eastern identities, but that the choice of demographic priming and role adoption strategy significantly impacts their portrayal. Specifically, we find that prompting in an interview-style format and name-based priming can help reduce stereotyping and improve alignment. Surprisingly, smaller models like OLMo-2-7B outperform larger ones such as Llama-3.3-70B. Our findings offer actionable guidance for designing sociodemographic persona prompts in LLM-based simulation studies.

[5] Efficient Compositional Multi-tasking for On-device Large Language Models

Ondrej Bohdal,Mete Ozay,Jijoong Moon,Kyeng-Hun Lee,Hyeonmok Ko,Umberto Michieli

Main category: cs.CL

TL;DR: 该论文研究了在设备端大型语言模型(LLM)中支持多任务组合的高效方法,提出了一个适用于多任务组合的基准测试和一种资源高效的校准方法。

Details Motivation: 目前LLM中的任务合并研究主要集中在单一任务的场景,而现实应用中往往需要同时处理多个任务(如翻译和摘要)。设备端计算资源有限,亟需高效的多任务组合解决方案。

Contribution: 1. 提出了一个包含四种实际任务的组合多任务基准。2. 设计了一种高效的方法(Learnable Calibration),适用于设备端资源受限的场景。

Method: 提出了Learnable Calibration方法,通过可学习的校准模块支持多任务组合。该方法强调在计算资源有限的设备端保持高性能。

Result: 提出的方法在资源受限的设备端场景下表现良好,支撑了LLM在复杂多任务中的应用。

Insight: 设备端LLM的多任务组合具有实际应用价值,但需平衡性能和资源消耗。论文为后续研究奠定了基础。

Abstract: Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.

[6] Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task

Jared Moore,Ned Cooper,Rasmus Overmark,Beba Cibralic,Nick Haber,Cameron R. Jones

Main category: cs.CL

TL;DR: 该论文通过MindGames任务评估了大语言模型在规划心理理论(PToM)中的表现,发现人类在需要动态规划和干预他人心理状态的任务中明显优于模型。

Details Motivation: 探讨大语言模型是否具备人类类似的规划心理理论能力,尤其是在动态干预他人心理状态的任务中。

Contribution: 提出MindGames任务,首次明确评估了语言模型在实际规划心理理论能力中的表现。

Method: 设计了多步说服任务(MindGames),要求代理推断对话者的信念和欲望以改变其行为,并比较了人类和o1-preview模型的表现。

Result: 人类在PToM任务中显著优于o1-preview模型(11%差距;p=0.006),但模型在只需规划无需心理推理的基线任务中表现更好。

Insight: 结果表明,大语言模型在需要复杂心理推理的动态任务中仍与人类存在显著差距,突显了模型在社交推理能力上的局限性。

Abstract: Recent evidence suggests Large Language Models (LLMs) display Theory of Mind (ToM) abilities. Most ToM experiments place participants in a spectatorial role, wherein they predict and interpret other agents’ behavior. However, human ToM also contributes to dynamically planning action and strategically intervening on others’ mental states. We present MindGames: a novel `planning theory of mind’ (PToM) task which requires agents to infer an interlocutor’s beliefs and desires to persuade them to alter their behavior. Unlike previous evaluations, we explicitly evaluate use cases of ToM. We find that humans significantly outperform o1-preview (an LLM) at our PToM task (11% higher; $p=0.006$). We hypothesize this is because humans have an implicit causal model of other agents (e.g., they know, as our task requires, to ask about people’s preferences). In contrast, o1-preview outperforms humans in a baseline condition which requires a similar amount of planning but minimal mental state inferences (e.g., o1-preview is better than humans at planning when already given someone’s preferences). These results suggest a significant gap between human-like social reasoning and LLM abilities.

[7] WakenLLM: A Fine-Grained Benchmark for Evaluating LLM Reasoning Potential and Reasoning Process Stability

Zipeng Ling,Yuehao Tang,Shuliang Liu,Junqi Yang,Shenghong Fu,Yao Wan,Kejia Huang,Zhichao Hou,Xuming Hu

Main category: cs.CL

TL;DR: 论文提出了一种细粒度基准测试框架WakenLLM,用于评估大型语言模型(LLMs)的推理潜力与推理过程稳定性,重点关注模糊感知现象。

Details Motivation: 当前评估主要关注LLMs输出‘Unknown’的诚实性,而非其产生原因,模糊了‘真正不确定输入’与‘模型无法解决的问题’的区别。

Contribution: 1. 提出量化模型能力不足导致的‘Unknown’比例的方法;2. 测试通过引导刺激能否将其转化为正确或固有不确定结果;3. 明确LLM推理界限与改进潜力。

Method: 1. 分离不确定性来源;2. 在基线框架下测试模型是否能达到理论推理准确率;3. 分析不同方法对模型表现的提升效果。

Result: 为LLM的真实推理能力提供了新评估视角,并探索了解决模糊感知现象的途径。

Insight: 模糊感知现象揭示了LLM推理过程中的不稳定性和潜在改进空间,为模型优化提供了方向。

Abstract: Large Language Models (LLMs) frequently output the label \emph{Unknown}, yet current evaluations focus almost exclusively on whether such answers are \emph{honest} rather than why they arise. This blurs two distinct cases: (i) an input that is genuinely indeterminate and (ii) a solvable problem that the model fails to resolve. We call this phenomenon \emph{Vague Perception}. And thus we introduce a framework that quantifies the proportion of \emph{Unknown} responses attributable to model incapacity and tests whether guided stimulation can convert them into either correct (\emph{Known}) or intrinsically indeterminate outcomes. By separating these sources of uncertainty, our method provides a clearer picture of LLM reasoning limits and their potential for improvement. As we get a theoretical accuracy of reasoning task on different LLMs, we apply different methods to test whether the model can reach the accuracy given a baseline framework. Our work is meaningful in exploring the true reasoning ability of LLMs and providing a new perspective on solving the \emph{Vague Perception} phenomenon.

[8] Towards Compute-Optimal Many-Shot In-Context Learning

Shahriar Golchin,Yanfei Chen,Rujun Han,Manan Gandhi,Tianli Yu,Swaroop Mishra,Mihai Surdeanu,Rishabh Agarwal,Chen-Yu Lee,Tomas Pfister

Main category: cs.CL

TL;DR: 论文提出了两种高效的多样本上下文学习(ICL)演示选择策略,以最小计算开销提升性能,同时支持缓存并显著降低推理成本。

Details Motivation: 在多样本ICL中,随机选择演示虽简单但因高推理成本和缓存需求成为主流,但其性能并非最优。因此需要更高效的演示选择方法。

Contribution: 1. 提出结合相似性和随机演示的选择策略;2. 进一步改进为基于k-means聚类中心的选择策略,显著改进性能并降低成本。

Method: 1. 结合少量相似演示与大量随机演示(支持缓存);2. 使用k-means聚类中心替换随机演示。

Result: 在Gemini Pro和Flash模型中,新策略显著优于随机选择,性能接近最优方法,同时降低推理成本高达一个数量级。

Insight: 调整不同标准选择的演示比例可平衡性能与推理成本,为多样本ICL提供了实用解决方案。

Abstract: Long-context large language models (LLMs) are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the benefits of caching and reusing computations, and (3) the similar performance offered by this strategy compared to others when scaled. In this work, we propose two straightforward strategies for demonstration selection in many-shot ICL that improve performance with minimal computational overhead. Our first method combines a small number of demonstrations, selected based on their similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached. The second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering. Our experiments with Gemini Pro and Flash across several datasets indicate that our strategies consistently outperform random selection and surpass or match the most performant selection approach while supporting caching and reducing inference cost by up to an order of magnitude. We also show that adjusting the proportion of demonstrations selected based on different criteria can balance performance and inference cost in many-shot ICL.

[9] FinResearchBench: A Logic Tree based Agent-as-a-Judge Evaluation Framework for Financial Research Agents

Run Sun,Zuo Bai,Wentao Zhang,Yuxiang Zhang,Li Zhao,Shan Sun,Zhengwen Qiu

Main category: cs.CL

TL;DR: FinResearchBench 是一个基于逻辑树的评估框架,专门用于金融研究智能体的自动评估,填补了该领域缺乏系统性评估工具的空白。

Details Motivation: 当前AI智能体在专业研究领域的应用快速发展(如金融研究),但缺乏系统化、自动化的评估框架,尤其是针对金融研究的复杂性和微妙性。

Contribution: 1. 首次提出基于逻辑树的Agent-as-a-Judge系统,通过提取研究成果的逻辑树进行综合评估;2. 聚焦金融领域,覆盖7类常见任务和70个典型问题。

Method: FinResearchBench 利用逻辑树作为中间信息,对金融研究智能体在7类任务中的表现进行自动评估。

Result: 提出了一个可靠的金融研究智能体评估框架,能够全面覆盖金融领域的复杂任务。

Insight: 逻辑树结构能更好地捕捉金融研究中的复杂推理过程,为评估研究型AI智能体提供了一种新思路。

Abstract: Recently, AI agents are rapidly evolving in intelligence and widely used in professional research applications, such as STEM, software development, finance, etc. Among these AI agents, deep research agent is a key category as it can perform long-horizon tasks and solve problems of greater complexity. However, there are few evaluation frameworks and benchmarks that systematically and automatically investigate the capabilities of these research agents. Furthermore, financial research problems have distinct complexity and subtlety. To fill in the gap, we propose FinResearchBench, which is a logic tree based Agent-as-a-Judge and targets specifically for the financial research agents. It provides a comprehensive and automatic assessment of the research agents across 7 key types of tasks in the financial research domain. The contributions of this work are two-folded: (1) the first and innovative Agent-as-a-Judge system that extracts the logic tree of the research outcome and uses it as the intermediate information to present a comprehensive, reliable and robust evaluation; (2) finance oriented that it covers 70 typical financial research questions, spreading across 7 frequently encountered types of tasks in the domain.

[10] Efficient RL for optimizing conversation level outcomes with an LLM-based tutor

Hyunji Nam,Omer Gottesman,Amy Zhang,Dean Foster,Emma Brunskill,Lyle Ungar

Main category: cs.CL

TL;DR: 该论文提出了一种轻量级方法,通过表示学生的低维潜在状态来优化基于LLM的数学辅导系统中的长期策略,改善了多轮对话场景下的辅导效果。

Details Motivation: 现有的RLHF框架通常基于即时回合级别的人类反馈优化LLM响应,但在多轮对话(如数学辅导)中,这种方法无法捕捉长期目标。

Contribution: 论文的主要贡献是提出了一种用低维潜在状态表示学生对话历史的方法,并优化长期策略以更好地实现辅导目标。

Method: 通过低维潜在状态表示学生对话历史,并基于此优化长期策略,减少了计算资源需求。

Result: 实验表明,该方法在LLM模拟的辅导任务中优于传统的提示方法,实现了更好的长期效果。

Insight: 在LLM的多轮对话中,潜在状态表示可以更高效地捕捉长期目标,同时减少计算开销。

Abstract: Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize responses based on immediate turn-level human preferences. However, this approach falls short in multi-turn dialogue settings, such as online math tutoring. We propose a method to enhance LLM-based tutors by representing the dialogue history with a lower-dimensional latent state representation of a student and optimizing a long-term policy to determine high-level actions based on the latent state. The goal is to better align the tutor’s behavior with the long-term objective of guiding the student towards solving a target math problem on their own. Our model is lightweight, requiring less computational resources than prior work of training the tutor policy end-to-end to directly output the tutor’s next utterance. Our experiment results demonstrate that these modifications lead to improved long-term outcomes compared to prompting in LLM-simulated tutoring tasks.

[11] Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

Chuanhao Yan,Fengdi Che,Xuhan Huang,Xu Xu,Xin Li,Yizhi Li,Xingwei Qu,Jingzhe Shi,Zhuangzhuang He,Chenghua Lin,Yaodong Yang,Binhang Yuan,Hang Zhao,Yu Qiao,Bowen Zhou,Jie Fu

Main category: cs.CL

TL;DR: 该论文提出了一种利用形式语言(如Dafny)减少对人工先验依赖的方法,结合自动数据管道和强化学习(RL)设计,提升了形式化软件验证的可靠性和可扩展性。

Details Motivation: 现有基于非形式语言(如自然语言)的大语言模型(LLM)在验证过程中存在不可靠和不可扩展的问题,而形式语言(如Dafny)提供了自动化和数学可证明的验证能力。

Contribution: 1. 提出了一种减少人工先验依赖的方法;2. 设计了DafnyComp基准测试集;3. 通过SFT和RL实现了小模型在形式语言任务上的超越。

Method: 1. 自动数据管道;2. 形式语言验证器的反馈结合RL设计;3. 提出DafnyComp基准测试集。

Result: 小模型(如0.5B)在形式语言任务上的表现超过私有模型,且RL进一步提升了泛化能力。

Insight: 形式语言为LLM提供了可靠的验证信号,减少对人工先验的依赖能够提升模型的通用性和效率。

Abstract: Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs. Unfortunately, it becomes unacceptably all-consuming to provide such priors for supervising complex programming tasks. In this work, we systematically explore ways to reduce human priors with the formal language, Dafny, as the main environment for our pilot study. Our pipeline mainly relies on introducing an automatic and scalable data curation pipeline, and careful RL designs integrated with feedback from the formal language verifier. We introduce DafnyComp, a benchmark of compositional formal programs with auto-formalized specifications for specification reasoning. Our supervised fine-tuning (SFT) stage enables even small models (e.g., 0.5B) to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization further improves performance, achieving stronger generalization to out-of-domain tasks and outperforming all strong baselines on the challenging DafnyComp benchmark.

[12] GG-BBQ: German Gender Bias Benchmark for Question Answering

Shalaka Satheesh,Katrin Klug,Katharina Beckh,Héctor Allende-Cid,Sebastian Houben,Teena Hassan

Main category: cs.CL

TL;DR: 该论文介绍了GG-BBQ,一个用于评估德语大语言模型性别偏见的基准数据集,通过机器翻译和人工修正构建,并展示了模型在该数据集上的偏见表现。

Details Motivation: 评估和减少自然语言处理(NLP)中的偏见是公平性研究的关键,尤其是在多语言背景下。现有基准多为英语,德语等语言的偏见评估工具缺乏。

Contribution: 提出GG-BBQ,首个针对德语大语言模型的性别偏见评估基准数据集,并通过实验揭示了模型在性别维度上的偏见表现。

Method: 基于英语偏见基准数据集,机器翻译并人工修正生成德语版本,构建包含性别相关群体术语和专有名词的两个子集。评估多个德语LLM的准确性和偏见分数。

Result: 所有测试模型均表现出性别偏见,包括沿袭和违背社会刻板印象的行为。

Insight: 机器翻译在构建多语言偏见基准时存在局限性,人工修正是确保数据质量的关键步骤。

Abstract: Within the context of Natural Language Processing (NLP), fairness evaluation is often associated with the assessment of bias and reduction of associated harm. In this regard, the evaluation is usually carried out by using a benchmark dataset, for a task such as Question Answering, created for the measurement of bias in the model’s predictions along various dimensions, including gender identity. In our work, we evaluate gender bias in German Large Language Models (LLMs) using the Bias Benchmark for Question Answering by Parrish et al. (2022) as a reference. Specifically, the templates in the gender identity subset of this English dataset were machine translated into German. The errors in the machine translated templates were then manually reviewed and corrected with the help of a language expert. We find that manual revision of the translation is crucial when creating datasets for gender bias evaluation because of the limitations of machine translation from English to a language such as German with grammatical gender. Our final dataset is comprised of two subsets: Subset-I, which consists of group terms related to gender identity, and Subset-II, where group terms are replaced with proper names. We evaluate several LLMs used for German NLP on this newly created dataset and report the accuracy and bias scores. The results show that all models exhibit bias, both along and against existing social stereotypes.

[13] PromptAL: Sample-Aware Dynamic Soft Prompts for Few-Shot Active Learning

Hui Xiang,Jinqiao Shi,Ting Zhang,Xiaojie Zhao,Yong Liu,Yong Ma

Main category: cs.CL

TL;DR: 本文提出了PromptAL,一种用于小样本主动学习的样本感知动态软提示框架,通过利用未标记数据优化决策边界和样本选择。

Details Motivation: 现有的主动学习方法在小样本场景下,由于标记数据的经验分布与目标分布存在显著差异,导致决策边界偏移,影响样本选择质量。PromptAL旨在通过动态软提示优化经验分布,提升样本选择的代表性。

Contribution: 1)提出样本感知动态软提示,调整模型预测分布和决策边界;2)结合全局和局部多样性的不确定性估计,选择更具代表性的样本;3)在多个数据集上验证了PromptAL的优越性。

Method: PromptAL通过未标记数据构建动态软提示,优化决策边界,并结合不确定性和多样性选择样本。

Result: 在六个域内和三个域外数据集上,PromptAL优于九种基线方法。

Insight: 利用未标记数据动态调整模型预测分布,可以有效提升小样本主动学习的性能。

Abstract: Active learning (AL) aims to optimize model training and reduce annotation costs by selecting the most informative samples for labeling. Typically, AL methods rely on the empirical distribution of labeled data to define the decision boundary and perform uncertainty or diversity estimation, subsequently identifying potential high-quality samples. In few-shot scenarios, the empirical distribution often diverges significantly from the target distribution, causing the decision boundary to shift away from its optimal position. However, existing methods overlook the role of unlabeled samples in enhancing the empirical distribution to better align with the target distribution, resulting in a suboptimal decision boundary and the selection of samples that inadequately represent the target distribution. To address this, we propose a hybrid AL framework, termed \textbf{PromptAL} (Sample-Aware Dynamic Soft \textbf{Prompts} for Few-Shot \textbf{A}ctive \textbf{L}earning). This framework accounts for the contribution of each unlabeled data point in aligning the current empirical distribution with the target distribution, thereby optimizing the decision boundary. Specifically, PromptAL first leverages unlabeled data to construct sample-aware dynamic soft prompts that adjust the model’s predictive distribution and decision boundary. Subsequently, based on the adjusted decision boundary, it integrates uncertainty estimation with both global and local diversity to select high-quality samples that more accurately represent the target distribution. Experimental results on six in-domain and three out-of-domain datasets show that PromptAL achieves superior performance over nine baselines. Our codebase is openly accessible.

[14] Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models

Mohamad Ballout,Serwan Jassim,Elia Bruni

Main category: cs.CL

TL;DR: 本文通过对多模态大语言模型(MLLMs)在直觉物理任务上的系统评估,发现即使最新模型也难以可靠区分物理合理与不合理场景。视觉编码器能捕捉物理线索,但语言模型未能有效利用,导致推理失败。

Details Motivation: 评估MLLMs在直觉物理任务中的表现,探究其失败原因,尤其是视觉-语言对齐问题。

Contribution: 揭示了MLLMs在直觉物理任务中的主要限制是视觉和语言信息的整合问题,而非视觉组件本身。

Method: 使用GRASP和IntPhys 2数据集评估多个开源和专有MLLMs,并对模型嵌入进行探测分析。

Result: 模型在复杂任务中表现不佳,视觉编码器能捕捉物理线索,但语言模型未能有效利用。

Insight: 视觉-语言对齐是未来MLLMs改进的关键方向。

Abstract: This paper presents a systematic evaluation of state-of-the-art multimodal large language models (MLLMs) on intuitive physics tasks using the GRASP and IntPhys 2 datasets. We assess the open-source models InternVL 2.5, Qwen 2.5 VL, LLaVA-OneVision, and the proprietary Gemini 2.0 Flash Thinking, finding that even the latest models struggle to reliably distinguish physically plausible from implausible scenarios. To go beyond performance metrics, we conduct a probing analysis of model embeddings, extracting intermediate representations at key processing stages to examine how well task-relevant information is preserved. Our results show that, depending on task difficulty, a critical vision-language misalignment can emerge: vision encoders successfully capture physical plausibility cues, but this information is not effectively utilized by the language model, leading to failures in reasoning. This misalignment suggests that the primary limitation of MLLMs in intuitive physics tasks is not the vision component but the ineffective integration of visual and linguistic information. Our findings highlight vision-language alignment as a key area for improvement, offering insights for future MLLMs development.

[15] Step-Audio 2 Technical Report

Boyong Wu,Chao Yan,Chen Hu,Cheng Yi,Chengli Feng,Fei Tian,Feiyu Shen,Gang Yu,Haoyang Zhang,Jingbei Li,Mingrui Chen,Peng Liu,Wang You,Xiangyu Tony Zhang,Xingyuan Li,Xuerui Yang,Yayue Deng,Yechang Huang,Yuxin Li,Yuxin Zhang,Zhao You,Brian Li,Changyi Wan,Hanpeng Hu,Jiangjie Zhen,Siyu Chen,Song Yuan,Xuelin Zhang,Yimin Jiang,Yu Zhou,Yuxiang Yang,Bingxin Li,Buyun Ma,Changhe Song,Dongqing Pang,Guoqiang Hu,Haiyang Sun,Kang An,Na Wang,Shuli Gao,Wei Ji,Wen Li,Wen Sun,Xuan Wen,Yong Ren,Yuankai Ma,Yufan Lu,Bin Wang,Bo Li,Changxin Miao,Che Liu,Chen Xu,Dapeng Shi,Dingyuan Hu,Donghang Wu,Enle Liu,Guanzhe Huang,Gulin Yan,Han Zhang,Hao Nie,Haonan Jia,Hongyu Zhou,Jianjian Sun,Jiaoren Wu,Jie Wu,Jie Yang,Jin Yang,Junzhe Lin,Kaixiang Li,Lei Yang,Liying Shi,Li Zhou,Longlong Gu,Ming Li,Mingliang Li,Mingxiao Li,Nan Wu,Qi Han,Qinyuan Tan,Shaoliang Pang,Shengjie Fan,Siqi Liu,Tiancheng Cao,Wanying Lu,Wenqing He,Wuxun Xie,Xu Zhao,Xueqi Li,Yanbo Yu,Yang Yang,Yi Liu,Yifan Lu,Yilei Wang,Yuanhao Ding,Yuanwei Liang,Yuanwei Lu,Yuchu Luo,Yuhe Yin,Yumeng Zhan,Yuxiang Zhang,Zidong Yang,Zixin Zhang,Binxing Jiao,Daxin Jiang,Heung-Yeung Shum,Jiansheng Chen,Jing Li,Xiangyu Zhang,Yibo Zhu

Main category: cs.CL

TL;DR: Step-Audio 2是一个端到端的多模态大语言模型,专注于音频理解和语音对话,通过潜在音频编码器和强化学习实现高性能,并在生成离散音频令牌和检索增强生成方面取得突破。

Details Motivation: 开发一个能够处理真实世界中复杂音频和语音任务的模型,同时增强其对副语言信息(如语气和情感)的响应能力。

Contribution: 1. 提出端到端的多模态大语言模型,集成潜在音频编码器与强化学习;2. 通过离散音频令牌生成增强副语言信息处理;3. 结合检索增强生成和外部工具调用以减少幻觉并扩展能力。

Method: 1. 潜在音频编码器与强化学习结合;2. 离散音频令牌生成;3. 检索增强生成(RAG)和外部工具调用。

Result: 在多种音频理解和对话基准测试中达到SOTA性能。

Insight: 通过结合多模态和强化学习,模型在真实场景中表现出更高的智能性和表达力,同时通过RAG和工具调用提升了实用性。

Abstract: This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

[16] P-CoT: A Pedagogically-motivated Participatory Chain-of-Thought Prompting for Phonological Reasoning in LLMs

Dongjun Jang,Youngchae Ahn,Hyopil Shin

Main category: cs.CL

TL;DR: 该论文提出了一种基于教育理论的新型P-CoT提示方法,通过结构化指导激活LLM的潜在音韵推理能力,在音韵任务中表现优异。

Details Motivation: 探索文本大语言模型(LLM)在音韵推理任务中的潜力,并解决现有少样本学习方法在音韵任务中表现不一致的问题。

Contribution: 提出P-CoT提示方法,结合教育理论,显著提升LLM在音韵任务中的性能,甚至在某些任务中超越人类基线。

Method: 基于教育理论(如脚手架和发现学习)设计P-CoT提示,通过结构化指导激活LLM的音韵推理能力。

Result: P-CoT提示方法在音韵任务中表现优异,最高提升52%,部分任务超越人类表现。

Insight: 结合教育理论设计提示方法可以有效激活LLM的潜在能力,为未来在其他语言学领域的应用提供启示。

Abstract: This study explores the potential of phonological reasoning within text-based large language models (LLMs). Utilizing the PhonologyBench benchmark, we assess tasks like rhyme word generation, g2p conversion, and syllable counting. Our evaluations across 12 LLMs reveal that while few-shot learning offers inconsistent gains, the introduction of a novel Pedagogically-motivated Participatory Chain-of-Thought (P-CoT) prompt, which is anchored in educational theories like scaffolding and discovery learning, consistently enhances performance. This method leverages structured guidance to activate latent phonological abilities, achieving up to 52% improvement and even surpassing human baselines in certain tasks. Future work could aim to optimize P-CoT prompts for specific models or explore their application across different linguistic domains.

[17] Self-Contradiction as Self-Improvement: Mitigating the Generation-Understanding Gap in MLLMs

Yujin Han,Hao Chen,Andi Han,Zhiheng Wang,Xinyu Lin,Yingya Zhang,Shiwei Zhang,Difan Zou

Main category: cs.CL

TL;DR: 论文揭示了多模态大语言模型(MLLM)中生成与理解任务之间的自相矛盾现象,并提出一种量化方法(Nonunified分数)。通过利用更强的理解能力指导较弱的生成能力,实现了模型的自改进。研究还发现了一种在生成分支微调时对理解和生成能力的共同提升现象,并提出基于课程学习的策略进一步优化模型。

Details Motivation: 尽管研究者尝试将多模态生成与理解任务统一在一个模型中,但这些MLLMs仍存在生成内容与输入提示不一致的自相矛盾现象。论文旨在揭示这一现象的根本原因,并探索如何利用模型的内部监督能力实现自改进。

Contribution: 论文的主要贡献包括:1) 定义了量化自相矛盾的Nonunified分数;2) 揭示了生成能力较弱是自相矛盾的主要原因;3) 提出利用模型内部理解能力指导生成能力的改进方法;4) 发现了生成分支微调时对理解能力的共同提升现象;5) 提出了基于课程学习的优化策略。

Method: 论文采用标准后训练方法(如SFT、DPO),结合模型内部的监督信号(理解能力)来改进生成能力。同时,分析了生成与理解之间的对齐训练动态,并提出一种基于课程学习的策略逐步引入更困难的样本。

Result: 实验表明,通过内部监督改进生成能力可以显著减少自相矛盾现象,并提升模型的统一性。此外,论文还验证了在生成分支微调时对理解能力的共同提升效果,以及不良监督可能导致模型共同退化的风险。

Insight: 论文的洞察包括:1) 自相矛盾的根源在于生成能力的不足;2) 模型内部理解能力可作为生成改进的有效监督信号;3) 生成与理解能力的改进具有相互促进的潜力;4) 数据质量对于避免共同退化至关重要;5) 课程学习策略可优化模型的渐进式改进。

Abstract: Despite efforts to unify multimodal generation and understanding tasks in a single model, we show these MLLMs exhibit self-contradiction where generation produces images deemed misaligned with input prompts based on the model’s own understanding. We define a Nonunified score that quantifies such self-contradiction. Our empirical results reveal that the self-contradiction mainly arises from weak generation that fails to align with prompts, rather than misunderstanding. This capability asymmetry indicates the potential of leveraging self-contradiction for self-improvement, where the stronger model understanding guides the weaker generation to mitigate the generation-understanding gap. Applying standard post-training methods (e.g., SFT, DPO) with such internal supervision successfully improves both generation and unification. We discover a co-improvement effect on both generation and understanding when only fine-tuning the generation branch, a phenomenon known in pre-training but underexplored in post-training. Our analysis shows improvements stem from better detection of false positives that are previously incorrectly identified as prompt-aligned. Theoretically, we show the aligned training dynamics between generation and understanding allow reduced prompt-misaligned generations to also improve mismatch detection in the understanding branch. Additionally, the framework reveals a potential risk of co-degradation under poor supervision-an overlooked phenomenon that is empirically validated in our experiments. Notably, we find intrinsic metrics like Nonunified score cannot distinguish co-degradation from co-improvement, which highlights the necessity of data quality check. Finally, we propose a curriculum-based strategy based on our findings that gradually introduces harder samples as the model improves, leading to better unification and improved MLLM generation and understanding.

[18] Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

Hongyin Luo,Nathaniel Morgan,Tina Li,Derek Zhao,Ai Vy Ngo,Philip Schroeder,Lijie Yang,Assaf Ben-Kish,Jack O’Brien,James Glass

Main category: cs.CL

TL;DR: 这篇论文提出了Thread Inference Model (TIM)和TIMRUN运行时环境,以突破大型语言模型(LLMs)的上下文限制,支持递归和分解问题求解。TIMRUN实现了近乎无限的工作内存和多跳工具调用,解决了输出限制、位置嵌入约束和GPU内存瓶颈问题。

Details Motivation: 现有的LLMs在长程推理任务中存在上下文限制的瓶颈,影响推理的准确性和效率。论文旨在解决这一限制,实现更长范围的、结构化的推理能力。

Contribution: 1) 提出TIM模型和TIMRUN运行时,扩展了LLMs的上下文限制;2) 通过建模自然语言为推理树而非线性序列,实现了长程推理;3) 设计了基于规则的子任务剪枝机制,优化了GPU内存和位置嵌入的复用。

Method: 1) 将问题建模为推理树,包含任务、递归子任务和结论;2) 在推理过程中维护一个工作内存,仅保留最相关的上下文token的键值状态;3) 使用TIMRUN运行时动态管理GPU内存和位置嵌入,支持多跳工具调用。

Result: 实验表明,该系统在GPU内存中操纵高达90%的KV缓存时仍保持高推理吞吐量,并在数学任务和信息检索等需要长程推理的任务中表现优异。

Insight: 将语言建模为推理树而非线性序列是一种有效的突破上下文限制的方法,动态内存管理和规则剪枝机制为优化推理效率提供了新思路。

Abstract: To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.

[19] Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Yanjun Zheng,Xiyang Du,Longfei Liao,Xiaoke Zhao,Zhaowen Zhou,Bo Zhang,Jiawei Liu,Xiang Qi,Zhe Li,Zhiqiang Zhang,Wang Wei,Peng Zhang

Main category: cs.CL

TL;DR: 该论文提出了Agentar-Fin-R1系列金融大语言模型,通过领域专业知识和高效训练优化,提升了金融任务的推理能力、可靠性和适应性。

Details Motivation: 现有的LLMs在金融领域缺乏强推理能力、高可信度和任务适应性,因此需要开发更专业的金融语言模型。

Contribution: 1. 基于Qwen3基础模型开发了8B和32B参数的金融LLMs;2. 提出了系统化的金融任务分类和多层次可信保障框架;3. 设计了高效的训练优化方法(如难度感知优化和两阶段学习)。

Method: 1. 高质量可信知识工程;2. 多智能体可信数据合成;3. 标签引导的自动化难度感知优化和两阶段学习;4. 综合评估框架(Finova基准)。

Result: 在金融基准(FinEva等)和通用推理数据集(MATH-500等)上取得SOTA表现,验证了其在高风险金融应用中的有效性。

Insight: 1. 领域专业知识对金融LLMs至关重要;2. 多层次可信框架能显著提升模型可靠性和合规性;3. 高效的训练方法可以平衡性能与计算成本。

Abstract: Large Language Models (LLMs) demonstrate tremendous potential in the financial domain, yet existing models often fall short in scenarios demanding robust reasoning capabilities, stringent trustworthiness requirements, and efficient adaptation to task-specific needs. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task taxonomy with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage learning processes, and detailed attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including FinEva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications.

[20] LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs

Da-Chen Lian,Ri-Sheng Huang,Pin-Er Chen,Chunki Lim,You-Kuan Lin,Guan-Yu Tseng,Zi-Cheng Yang,Shu-Kai Hsieh

Main category: cs.CL

TL;DR: LingBench++ 是一个语言学驱动的基准测试和推理框架,用于评估大语言模型在复杂语言学任务中的表现,提供结构化推理痕迹和多步骤评估协议。

Details Motivation: 现有基准测试通常仅关注最终答案的准确性,缺乏对推理过程和跨文化语言多样性的深入评估。LingBench++旨在填补这一空白。

Contribution: 提出了LingBench++基准测试框架,支持多步骤推理和跨文化语言评估,并开发了一种结合外部知识检索和迭代推理的多智能体架构。

Method: 采用多智能体架构,结合语法知识检索、工具增强推理和假设测试,通过结构化推理和分步评估提升模型性能。

Result: 实验表明,结合外部知识源和迭代推理的模型在准确性和可解释性上优于单次推理方法。

Insight: 通过引入语言学知识和跨文化数据,可以显著提升大语言模型在复杂任务中的性能,同时增强模型的透明性。

Abstract: We propose LingBench++, a linguistically-informed benchmark and reasoning framework designed to evaluate large language models (LLMs) on complex linguistic tasks inspired by the International Linguistics Olympiad (IOL). Unlike prior benchmarks that focus solely on final answer accuracy, LingBench++ provides structured reasoning traces, stepwise evaluation protocols, and rich typological metadata across over 90 low-resource and cross-cultural languages. We further develop a multi-agent architecture integrating grammatical knowledge retrieval, tool-augmented reasoning, and deliberate hypothesis testing. Through systematic comparisons of baseline and our proposed agentic models, we demonstrate that models equipped with external knowledge sources and iterative reasoning outperform single-pass approaches in both accuracy and interpretability. LingBench++ offers a comprehensive foundation for advancing linguistically grounded, culturally informed, and cognitively plausible reasoning in LLMs.

[21] MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

Run-Ze Fan,Zengzhi Wang,Pengfei Liu

Main category: cs.CL

TL;DR: 该论文提出了TextbookReasoning和MegaScience两个大规模高质量的科学推理数据集,通过系统性数据选择方法优化公开科学数据集,训练出性能显著优于官方模型的Llama3.1、Qwen系列模型。

Details Motivation: 科学推理对AI科学家和人类研究者至关重要,但开源社区缺乏高质量、可验证的科学推理数据集,阻碍了这一领域的发展。论文旨在填补这一空白。

Contribution: 1. 发布TextbookReasoning和MegaScience两个大规模科学推理数据集;2. 提出系统性数据选择方法;3. 构建全面的评估系统;4. 训练出性能优越的Llama3.1、Qwen系列模型。

Method: 1. 从12k大学教科书中提取650k科学问题构建TextbookReasoning;2. 通过数据选择方法优化公开科学数据集,形成1.25M实例的MegaScience;3. 结合多样化的答案提取策略进行模型评估。

Result: 实验表明,MegaScience数据集在性能和训练效率上显著优于现有开源科学数据集,训练的模型性能超过官方指令模型,且对更大更强的模型表现更好。

Insight: 科学推理领域的数据质量和规模对模型性能至关重要,更大的模型能更好地利用高质量数据集进行优化。

Abstract: Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.

cs.CV [Back]

[22] Salience Adjustment for Context-Based Emotion Recognition

Bin Han,Jonathan Gratch

Main category: cs.CV

TL;DR: 论文提出了一个基于显著性调整的上下文感知情绪识别框架,结合了贝叶斯线索整合(BCI)和视觉-语言模型(VLM),动态加权面部和上下文信息,提升了情绪识别性能。

Details Motivation: 动态社交场景中的情绪识别需要理解面部表情与情境线索的复杂交互。传统方法未能充分结合面部和上下文信息,需一种动态加权的解决方案。

Contribution: 1. 提出显著性调整框架,动态平衡面部与上下文信息;2. 结合贝叶斯线索整合与视觉-语言模型;3. 在囚徒困境场景中验证了方法的有效性。

Method: 使用贝叶斯线索整合(BCI)和视觉-语言模型(VLM)动态加权面部与上下文信息,根据面部表情的显著性调整权重。

Result: 实验表明,显著性调整显著提升了情绪识别的性能。

Insight: 动态调整面部与上下文信息的权重是提升情绪识别的关键,未来可扩展到更广泛的社会场景和多模态应用中。

Abstract: Emotion recognition in dynamic social contexts requires an understanding of the complex interaction between facial expressions and situational cues. This paper presents a salience-adjusted framework for context-aware emotion recognition with Bayesian Cue Integration (BCI) and Visual-Language Models (VLMs) to dynamically weight facial and contextual information based on the expressivity of facial cues. We evaluate this approach using human annotations and automatic emotion recognition systems in prisoner’s dilemma scenarios, which are designed to evoke emotional reactions. Our findings demonstrate that incorporating salience adjustment enhances emotion recognition performance, offering promising directions for future research to extend this framework to broader social contexts and multimodal applications.

[23] Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark

Goeric Huybrechts,Srikanth Ronanki,Sai Muralidhar Jayanthi,Jack Fitzgerald,Srinivasan Veeravanallur

Main category: cs.CV

TL;DR: 该论文提出了一个名为Document Haystack的基准测试,用于评估视觉语言模型(VLM)在长文档上的理解和检索能力,填补了当前缺乏长文档处理基准的空白。

Details Motivation: 多模态大语言模型在处理复杂多模态数据方面取得了进展,但长文档处理的研究仍较少,主要原因是缺乏合适的基准测试。

Contribution: 论文的主要贡献是提出了Document Haystack基准,包含长达200页的文档以及在文档中插入的纯文本或多模态文本+图像”needles”,用于评估VLM的检索能力。

Method: 通过构建包含400种文档变体和8,250个问题的数据集,并设计了一个自动化评估框架,测试VLM在不同深度文档中的检索性能。

Result: 论文展示了主流VLM在Document Haystack上的表现,并分析了结果。

Insight: 长文档处理是VLM研究的一个潜在方向,Document Haystack为未来研究提供了工具和方向。

Abstract: The proliferation of multimodal Large Language Models has significantly advanced the ability to analyze and understand complex data inputs from different modalities. However, the processing of long documents remains under-explored, largely due to a lack of suitable benchmarks. To address this, we introduce Document Haystack, a comprehensive benchmark designed to evaluate the performance of Vision Language Models (VLMs) on long, visually complex documents. Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text+image “needles” at various depths within the documents to challenge VLMs’ retrieval capabilities. Comprising 400 document variants and a total of 8,250 questions, it is supported by an objective, automated evaluation framework. We detail the construction and characteristics of the Document Haystack dataset, present results from prominent VLMs and discuss potential research avenues in this area.

[24] PAT++: a cautionary tale about generative visual augmentation for Object Re-identification

Leonardo Santiago Benitez Pereira,Arathy Jeevan

Main category: cs.CV

TL;DR: 该论文探讨了生成式数据增强在物体重识别任务中的有效性,通过PAT++框架结合扩散自蒸馏与Part-Aware Transformer,发现生成图像导致性能下降,挑战了生成模型在细粒度识别任务中的适用性假设。

Details Motivation: 生成式数据增强在视觉任务中表现出潜力,但其在需要保留细粒度视觉细节的物体重识别任务中的效果尚未充分研究。论文旨在评估身份保留图像生成在该任务中的有效性。

Contribution: 论文的主要贡献包括:1) 提出PAT++框架,结合扩散自蒸馏与Part-Aware Transformer;2) 通过实验证明了生成图像在物体重识别任务中导致性能下降,揭示了当前视觉增强方法的关键限制。

Method: 论文采用PAT++框架,结合了扩散自蒸馏技术和Part-Aware Transformer,用于生成身份保留的图像,并在Urban Elements ReID Challenge数据集上进行实验评估。

Result: 实验结果显示,使用生成图像进行模型训练和查询扩展时,模型性能持续下降,主要原因是域偏移和未能保留定义身份的关键特征。

Insight: 论文揭示了生成式增强在细粒度识别任务中的局限性,挑战了现有方法对生成模型可迁移性的假设,为未来的研究提供了重要参考。

Abstract: Generative data augmentation has demonstrated gains in several vision tasks, but its impact on object re-identification - where preserving fine-grained visual details is essential - remains largely unexplored. In this work, we assess the effectiveness of identity-preserving image generation for object re-identification. Our novel pipeline, named PAT++, incorporates Diffusion Self-Distillation into the well-established Part-Aware Transformer. Using the Urban Elements ReID Challenge dataset, we conduct extensive experiments with generated images used for both model training and query expansion. Our results show consistent performance degradation, driven by domain shifts and failure to retain identity-defining features. These findings challenge assumptions about the transferability of generative models to fine-grained recognition tasks and expose key limitations in current approaches to visual augmentation for identity-preserving applications.

[25] Local Dense Logit Relations for Enhanced Knowledge Distillation

Liuchi Xu,Kang Liu,Jinshuai Liu,Lu Wang,Lisheng Xu,Jun Cheng

Main category: cs.CV

TL;DR: 本文提出了一种新的知识蒸馏方法LDRLD,通过递归解耦和重组logit信息,捕获类间关系,并结合自适应权重策略ADW提升学生模型的性能。

Details Motivation: 现有logit蒸馏方法未能充分探究logit知识中的细粒度关系,限制了知识传递的效果。

Contribution: 1. 提出LDRLD方法,通过递归解耦和重组logit信息捕捉细粒度类间关系;2. 引入ADW策略,动态调整关键类别对的权重;3. 通过实验验证了方法的优越性。

Method: 1. 递归解耦和重组logit信息;2. 结合IRW和ERD实现自适应权重调整;3. 非目标知识的蒸馏确保知识完整性。

Result: 在CIFAR-100、ImageNet-1K和Tiny-ImageNet等数据集上,LDRLD优于现有logit蒸馏方法。

Insight: 细粒度类间关系的挖掘和自适应权重策略是提升知识蒸馏效果的关键。

Abstract: State-of-the-art logit distillation methods exhibit versatility, simplicity, and efficiency. Despite the advances, existing studies have yet to delve thoroughly into fine-grained relationships within logit knowledge. In this paper, we propose Local Dense Relational Logit Distillation (LDRLD), a novel method that captures inter-class relationships through recursively decoupling and recombining logit information, thereby providing more detailed and clearer insights for student learning. To further optimize the performance, we introduce an Adaptive Decay Weight (ADW) strategy, which can dynamically adjust the weights for critical category pairs using Inverse Rank Weighting (IRW) and Exponential Rank Decay (ERD). Specifically, IRW assigns weights inversely proportional to the rank differences between pairs, while ERD adaptively controls weight decay based on total ranking scores of category pairs. Furthermore, after the recursive decoupling, we distill the remaining non-target knowledge to ensure knowledge completeness and enhance performance. Ultimately, our method improves the student’s performance by transferring fine-grained knowledge and emphasizing the most critical relationships. Extensive experiments on datasets such as CIFAR-100, ImageNet-1K, and Tiny-ImageNet demonstrate that our method compares favorably with state-of-the-art logit-based distillation approaches. The code will be made publicly available.

[26] An empirical study for the early detection of Mpox from skin lesion images using pretrained CNN models leveraging XAI technique

Mohammad Asifur Rahim,Muhammad Nazmul Arefin,Md. Mizanur Rahman,Md Ali Hossain,Ahmed Moustafa

Main category: cs.CV

TL;DR: 该论文研究了预训练CNN模型(如VGG16、InceptionV3等)结合Grad-CAM技术用于Mpox早期检测的效果,结果显示InceptionV3和MobileNetV2在二元和多类别数据集上表现最佳,同时指出模型过拟合的问题。

Details Motivation: Mpox与其他皮肤病症状相似,早期诊断困难。尽管AI在医学图像分析中表现突出,但预训练CNN模型和XAI技术在Mpox检测中的应用尚未充分探索。

Contribution: 1. 评估了多种预训练CNN模型在Mpox检测中的表现;2. 使用Grad-CAM增强模型可解释性;3. 提供了二元和多类别数据集的实验结果。

Method: 采用迁移学习,冻结预训练CNN模型的初始层并添加自定义层以避免过拟合。使用MSLD和MSLD v2.0数据集,通过准确率、召回率等指标评估模型性能,Grad-CAM可视化关键特征。

Result: InceptionV3在二元数据集上准确率达95%,MobileNetV2在多类别数据集上为93%。Grad-CAM成功识别关键区域,但部分模型出现过拟合。

Insight: 预训练CNN和XAI技术在Mpox检测中潜力显著,但需解决过拟合问题,并结合多模态数据以提升诊断可靠性。

Abstract: Context: Mpox is a zoonotic disease caused by the Mpox virus, which shares similarities with other skin conditions, making accurate early diagnosis challenging. Artificial intelligence (AI), especially Deep Learning (DL), has a strong tool for medical image analysis; however, pre-trained models like CNNs and XAI techniques for mpox detection is underexplored. Objective: This study aims to evaluate the effectiveness of pre-trained CNN models (VGG16, VGG19, InceptionV3, MobileNetV2) for the early detection of monkeypox using binary and multi-class datasets. It also seeks to enhance model interpretability using Grad-CAM an XAI technique. Method: Two datasets, MSLD and MSLD v2.0, were used for training and validation. Transfer learning techniques were applied to fine-tune pre-trained CNN models by freezing initial layers and adding custom layers for adapting the final features for mpox detection task and avoid overfitting. Models performance were evaluated using metrics such as accuracy, precision, recall, F1-score and ROC. Grad-CAM was utilized for visualizing critical features. Results: InceptionV3 demonstrated the best performance on the binary dataset with an accuracy of 95%, while MobileNetV2 outperformed on the multi-class dataset with an accuracy of 93%. Grad-CAM successfully highlighted key image regions. Despite high accuracy, some models showed overfitting tendencies, as videnced by discrepancies between training and validation losses. Conclusion: This study underscores the potential of pre-trained CNN models in monkeypox detection and the value of XAI techniques. Future work should address dataset limitations, incorporate multimodal data, and explore additional interpretability techniques to improve diagnostic reliability and model transparency

[27] Is Tracking really more challenging in First Person Egocentric Vision?

Matteo Dunnhofer,Zaira Manigrasso,Christian Micheloni

Main category: cs.CV

TL;DR: 该论文质疑现有研究关于第一人称视角(egocentric vision)在视觉目标跟踪和分割任务中更具挑战性的结论,并通过设计新的基准测试,明确分离第一人称视角与人类-物体活动领域的挑战,以更精准地识别困难的真正来源。

Details Motivation: 现有研究认为第一人称视角在目标跟踪和分割任务中更具挑战性,但这些结论可能混淆了第一人称视角与人类-物体活动领域的复杂性。论文旨在通过设计更精确的评估方法,区分这两者的影响。

Contribution: 1) 提出了一种新的基准测试方法,明确分离第一人称视角与人类-物体活动领域的挑战;2) 提供了对第一人称视角跟踪和分割任务困难来源的更深入洞察。

Method: 通过设计新的评估策略和基准测试,比较第一人称和第三人称视频中目标跟踪和分割的性能差异,从而区分视角和活动领域的影响。

Result: 研究发现,部分挑战并非仅源于第一人称视角,而是与人类-物体活动领域相关,这为未来的研究方向提供了更明确的指导。

Insight: 论文揭示了当前研究中结论的潜在误区,强调了在设计基准测试时需明确区分不同因素的重要性,从而更有效地推动技术进步。

Abstract: Visual object tracking and segmentation are becoming fundamental tasks for understanding human activities in egocentric vision. Recent research has benchmarked state-of-the-art methods and concluded that first person egocentric vision presents challenges compared to previously studied domains. However, these claims are based on evaluations conducted across significantly different scenarios. Many of the challenging characteristics attributed to egocentric vision are also present in third person videos of human-object activities. This raises a critical question: how much of the observed performance drop stems from the unique first person viewpoint inherent to egocentric vision versus the domain of human-object activities? To address this question, we introduce a new benchmark study designed to disentangle such factors. Our evaluation strategy enables a more precise separation of challenges related to the first person perspective from those linked to the broader domain of human-object activity understanding. By doing so, we provide deeper insights into the true sources of difficulty in egocentric tracking and segmentation, facilitating more targeted advancements on this task.

[28] Artifacts and Attention Sinks: Structured Approximations for Efficient Vision Transformers

Andrew Lu,Wentinn Liao,Liuhui Wang,Huzheng Yang,Jianbo Shi

Main category: cs.CV

TL;DR: 研究了ViT中高激活的注意力汇聚点(massive tokens)和推理过程中产生的伪影令牌(artifact tokens),发现它们通过注意力机制相互抑制,影响信息流。提出Fast Nyström Attention (FNA),一种无需训练的线性复杂度注意力近似方法,并设计了掩码策略以减少噪声。实验表明,该方法在多种任务中表现优异且计算高效。

Details Motivation: Vision Transformers(ViT)虽然广泛应用,但其内部机制尚不完全清楚。特别是massive tokens和artifact tokens的作用未被充分研究,它们可能对信息流和计算效率产生影响。

Contribution: 1. 揭示了massive tokens和artifact tokens在ViT中的作用;2. 提出了Fast Nyström Attention (FNA),一种线性复杂度的注意力近似方法;3. 设计了掩码策略以减少噪声,提升性能。

Method: 1. 分析massive tokens和artifact tokens的行为;2. 基于结构化模式设计FNA,近似自注意力机制;3. 引入掩码策略抑制噪声。

Result: 在检索、分类、分割和VQA任务中,FNA表现优异,显著降低了计算开销。

Insight: massive tokens和artifact tokens在ViT中扮演关键角色,通过结构化近似可以高效地优化注意力机制,同时提升计算效率。

Abstract: Vision transformers have emerged as a powerful tool across a wide range of applications, yet their inner workings remain only partially understood. In this work, we examine the phenomenon of massive tokens - tokens with exceptionally high activation norms that act as attention sinks - and artifact tokens that emerge as a byproduct during inference. Our analysis reveals that these tokens mutually suppress one another through the attention mechanism, playing a critical role in regulating information flow within the network. Leveraging these insights, we introduce Fast Nystr"om Attention (FNA), a training-free method that approximates self-attention in linear time and space by exploiting the structured patterns formed by massive and artifact tokens. Additionally, we propose a masking strategy to mitigate noise from these tokens, yielding modest performance gains at virtually no cost. We evaluate our approach on popular pretrained vision backbones and demonstrate competitive performance on retrieval, classification, segmentation, and visual question answering (VQA), all while reducing computational overhead.

[29] Discovering and using Spelke segments

Rahul Venkatesh,Klemen Kotar,Lilian Naing Chen,Seungwoo Kim,Luca Thomas Wheeler,Jared Watrous,Ashley Xu,Gia Ancone,Wanhee Lee,Honglin Chen,Daniel Bear,Stefan Stojanov,Daniel Yamins

Main category: cs.CV

TL;DR: 论文提出了Spelke对象的概念,并构建了SpelkeBench数据集和SpelkeNet模型,通过预测未来运动分布来发现Spelke片段,优于传统分割方法,并在下游任务中表现出色。

Details Motivation: 传统的图像分割依赖语义标注,而人类感知更基于物理运动关联。论文旨在通过Spelke对象的概念(基于物理运动关系的无类别分组)改进分割任务。

Contribution: 1. 提出SpelkeBench数据集;2. 开发SpelkeNet模型,通过预测运动分布发现Spelke片段;3. 展示Spelke片段在下游任务中的实用性。

Method: SpelkeNet通过预测未来运动分布生成运动可及性图和预期位移图,利用统计反事实探测(虚拟推动)定义Spelke片段。

Result: SpelkeNet在SpelkeBench上优于传统分割方法(如SAM),并在3DEditBench上提升物体操纵性能。

Insight: 基于物理运动的分割方法可能优于传统语义分割,尤其在需要物理交互的下游任务中。

Abstract: Segments in computer vision are often defined by semantic considerations and are highly dependent on category-specific conventions. In contrast, developmental psychology suggests that humans perceive the world in terms of Spelke objects–groupings of physical things that reliably move together when acted on by physical forces. Spelke objects thus operate on category-agnostic causal motion relationships which potentially better support tasks like manipulation and planning. In this paper, we first benchmark the Spelke object concept, introducing the SpelkeBench dataset that contains a wide variety of well-defined Spelke segments in natural images. Next, to extract Spelke segments from images algorithmically, we build SpelkeNet, a class of visual world models trained to predict distributions over future motions. SpelkeNet supports estimation of two key concepts for Spelke object discovery: (1) the motion affordance map, identifying regions likely to move under a poke, and (2) the expected-displacement map, capturing how the rest of the scene will move. These concepts are used for “statistical counterfactual probing”, where diverse “virtual pokes” are applied on regions of high motion-affordance, and the resultant expected displacement maps are used define Spelke segments as statistical aggregates of correlated motion statistics. We find that SpelkeNet outperforms supervised baselines like SegmentAnything (SAM) on SpelkeBench. Finally, we show that the Spelke concept is practically useful for downstream applications, yielding superior performance on the 3DEditBench benchmark for physical object manipulation when used in a variety of off-the-shelf object manipulation models.

[30] Stop-band Energy Constraint for Orthogonal Tunable Wavelet Units in Convolutional Neural Networks for Computer Vision problems

An D. Le,Hung Nguyen,Sungbal Seo,You-Suk Bae,Truong Q. Nguyen

Main category: cs.CV

TL;DR: 该论文提出了一种带阻能量约束方法,用于优化正交可调小波单元中的滤波器设计,旨在提升CNN在图像分类和异常检测任务中的性能,尤其是在纹理丰富的数据集上。

Details Motivation: 尽管小波变换在图像处理中表现出色,但在CNN中直接应用时效果有限。作者希望通过引入带阻能量约束,优化小波滤波器的设计,以更好地利用小波变换的特性,提升CNN的性能。

Contribution: 论文的主要贡献是提出了一种带阻能量约束方法,用于优化正交可调小波单元中的滤波器设计,并将其集成到ResNet中,显著提升了图像分类和异常检测任务的性能。

Method: 方法包括:1) 设计带阻能量约束以优化小波滤波器;2) 在ResNet-18和ResNet-34中集成正交可调小波单元,改进卷积、池化和下采样操作;3) 在多个数据集上验证方法的有效性。

Result: 实验结果显示,在CIFAR-10和Describable Textures数据集上,分类准确率分别提升了2.48%和13.56%。此外,在MVTec榛子异常检测任务中,方法在分割和检测任务中均表现优异。

Insight: 论文表明,通过优化小波滤波器的设计,可以充分利用小波变换的特性,显著提升CNN在纹理相关任务中的性能。同时,带阻能量约束为小波单元的设计提供了新思路。

Abstract: This work introduces a stop-band energy constraint for filters in orthogonal tunable wavelet units with a lattice structure, aimed at improving image classification and anomaly detection in CNNs, especially on texture-rich datasets. Integrated into ResNet-18, the method enhances convolution, pooling, and downsampling operations, yielding accuracy gains of 2.48% on CIFAR-10 and 13.56% on the Describable Textures dataset. Similar improvements are observed in ResNet-34. On the MVTec hazelnut anomaly detection task, the proposed method achieves competitive results in both segmentation and detection, outperforming existing approaches.

[31] PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation

Yaofang Liu,Yumeng Ren,Aitor Artola,Yuxuan Hu,Xiaodong Cun,Xiaotong Zhao,Alan Zhao,Raymond H. Chan,Suiyun Zhang,Rui Liu,Dandan Tu,Jean-Michel Morel

Main category: cs.CV

TL;DR: Pusa V1.0提出了向量化时间步适应(VTA),以低成本超越Wan-I2V的性能,并在多任务视频生成中展现了零样本能力。

Details Motivation: 当前视频扩散模型在时间建模上存在局限性,传统标量时间步变量限制了帧演化的灵活性,且任务专用方法计算效率低、泛化性差。

Contribution: 1. 提出VTA实现了细粒度时间控制;2. 以500美元成本超越Wan-I2V性能;3. 展示了零样本多任务能力。

Method: 通过向量化时间步适应(VTA)对SOTA模型微调,保留基础模型能力的同时注入时间动态。

Result: 在VBench-I2V上总分87.32%,优于Wan-I2V的86.86%,且训练成本仅为1/200。

Insight: VTA通过非破坏性适配保留了生成先验,避免了向量化时间步的复杂性爆炸,为高效视频合成提供了新范式。

Abstract: The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present Pusa, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. Besides, VTA is a non-destructive adaptation, which means it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency – surpassing the performance of Wan-I2V-14B with $\leq$ 1/200 of the training cost ($500 vs. $\geq$ $100,000) and $\leq$ 1/2500 of the dataset size (4K vs. $\geq$ 10M samples). Pusa not only sets a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32% (vs. 86.86% of Wan-I2V-14B), but also unlocks many zero-shot multi-task capabilities such as start-end frames and video extension – all without task-specific training. Meanwhile, Pusa can still perform text-to-video generation. Mechanistic analyses reveal that our approach preserves the foundation model’s generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to vectorized timesteps. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike. Code is open-sourced at https://github.com/Yaofang-Liu/Pusa-VidGen

[32] Universal Wavelet Units in 3D Retinal Layer Segmentation

An D. Le,Hung Nguyen,Melanie Tran,Jesse Most,Dirk-Uwe G. Bartsch,William R Freeman,Shyamanga Borooah,Truong Q. Nguyen,Cheolhong An

Main category: cs.CV

TL;DR: 该论文首次将可调谐小波单元(UwUs)应用于3D视网膜层分割,通过引入三种基于小波的下采样模块提升OCT图像的细节和结构一致性。

Details Motivation: 传统最大池化方法在3D医学图像分割中可能丢失高频细节信息,作者提出利用小波单元来保留低频和高频特征,以提升分割精度。

Contribution: 1. 首次将可调谐小波单元(UwUs)引入3D视网膜层分割;2. 设计了三种基于小波的下采样模块(OrthLattUwU、BiorthLattUwU和LS-BiorthLattUwU);3. 在MGU-Net架构中集成这些模块,显著提升了分割性能。

Method: 作者在MGU-Net中集成了三种小波下采样模块,利用可学习的格型滤波器组保留低频和高频特征,并通过运动校正提升分割精度。

Result: 在JRC OCT数据集上的实验表明,LS-BiorthLattUwU模块表现最佳,显著提升了分割的准确率和Dice分数。

Insight: 可调谐小波单元能够更好地保留医学图像中的细节信息,为3D医学图像分割提供了新思路。

Abstract: This paper presents the first study to apply tunable wavelet units (UwUs) for 3D retinal layer segmentation from Optical Coherence Tomography (OCT) volumes. To overcome the limitations of conventional max-pooling, we integrate three wavelet-based downsampling modules, OrthLattUwU, BiorthLattUwU, and LS-BiorthLattUwU, into a motion-corrected MGU-Net architecture. These modules use learnable lattice filter banks to preserve both low- and high-frequency features, enhancing spatial detail and structural consistency. Evaluated on the Jacobs Retina Center (JRC) OCT dataset, our framework shows significant improvement in accuracy and Dice score, particularly with LS-BiorthLattUwU, highlighting the benefits of tunable wavelet filters in volumetric medical image segmentation.

[33] LongSplat: Online Generalizable 3D Gaussian Splatting from Long Sequence Images

Guichen Huang,Ruoyu Wang,Xiangjun Gao,Che Sun,Yuwei Wu,Shenghua Gao,Yunde Jia

Main category: cs.CV

TL;DR: LongSplat 是一种用于长序列图像的在线实时 3D 高斯重建框架,通过流式更新机制和高斯-图像表示(GIR)实现高效增量更新和冗余压缩,显著提升了重建效率和内存利用率。

Details Motivation: 现有 3D 高斯泼溅方法在长序列场景中效率低下,无法支持实时增量更新。LongSplat 旨在解决这一问题,实现高效在线重建。

Contribution: 1. 提出流式更新机制和 GIR 表示,支持高效增量更新;2. 选择性压缩冗余高斯点,降低内存和计算开销;3. 结合图像压缩方法生成更紧凑的高质量高斯点。

Method: 1. 使用 GIR 将 3D 高斯参数编码为图像式 2D 格式;2. 通过流式机制增量融合当前视图和历史高斯数据;3. 应用冗余压缩和图像压缩优化高斯点生成。

Result: LongSplat 在实时新视角合成中实现了效率与质量的平衡,高斯点数量减少 44%,同时保持重建质量。

Insight: 采用结构化编码和在线增量更新是处理长序列 3D 重建的有效方法,结合压缩技术可以显著优化存储和计算效率。

Abstract: 3D Gaussian Splatting achieves high-fidelity novel view synthesis, but its application to online long-sequence scenarios is still limited. Existing methods either rely on slow per-scene optimization or fail to provide efficient incremental updates, hindering continuous performance. In this paper, we propose LongSplat, an online real-time 3D Gaussian reconstruction framework designed for long-sequence image input. The core idea is a streaming update mechanism that incrementally integrates current-view observations while selectively compressing redundant historical Gaussians. Crucial to this mechanism is our Gaussian-Image Representation (GIR), a representation that encodes 3D Gaussian parameters into a structured, image-like 2D format. GIR simultaneously enables efficient fusion of current-view and historical Gaussians and identity-aware redundancy compression. These functions enable online reconstruction and adapt the model to long sequences without overwhelming memory or computational costs. Furthermore, we leverage an existing image compression method to guide the generation of more compact and higher-quality 3D Gaussians. Extensive evaluations demonstrate that LongSplat achieves state-of-the-art efficiency-quality trade-offs in real-time novel view synthesis, delivering real-time reconstruction while reducing Gaussian counts by 44% compared to existing per-pixel Gaussian prediction methods.

[34] SPACT18: Spiking Human Action Recognition Benchmark Dataset with Complementary RGB and Thermal Modalities

Yasser Ashraf,Ahmed Sharshar,Velibor Bojkovic,Bin Gu

Main category: cs.CV

TL;DR: 论文首次提出了一个基于脉冲相机(spike camera)的视频动作识别(VAR)数据集SPACT18,配备了互补的RGB和热成像模态,用于推动Spiking Neural Networks(SNNs)在能量高效、超低功耗视频理解中的研究。

Details Motivation: 现有的事件相机(event cameras)虽能捕捉运动变化,但其时空分辨率有限。脉冲相机提供了更高的时空分辨率和连续变化的精确表示,但缺乏用于动作识别的标准化数据集。

Contribution: 1. 首次提出一个包含脉冲相机、RGB和热成像模态的多模态动作识别数据集SPACT18;2. 为研究能量高效的SNNs提供了直接对比多模态数据的平台。

Method: 同步采集脉冲相机、RGB和热成像模态的视频数据,保留脉冲数据的固有稀疏性和时间精度,构建标准化数据集用于动作识别任务。

Result: SPACT18数据集为多模态视频理解和SNNs研究提供了独特资源,支持对脉冲、热成像和RGB数据的直接比较。

Insight: 脉冲相机在多模态动作识别中具有潜力,尤其是在需要高时空分辨率和低功耗的应用场景中。数据集将推动SNNs在视频理解中的进一步研究。

Abstract: Spike cameras, bio-inspired vision sensors, asynchronously fire spikes by accumulating light intensities at each pixel, offering ultra-high energy efficiency and exceptional temporal resolution. Unlike event cameras, which record changes in light intensity to capture motion, spike cameras provide even finer spatiotemporal resolution and a more precise representation of continuous changes. In this paper, we introduce the first video action recognition (VAR) dataset using spike camera, alongside synchronized RGB and thermal modalities, to enable comprehensive benchmarking for Spiking Neural Networks (SNNs). By preserving the inherent sparsity and temporal precision of spiking data, our three datasets offer a unique platform for exploring multimodal video understanding and serve as a valuable resource for directly comparing spiking, thermal, and RGB modalities. This work contributes a novel dataset that will drive research in energy-efficient, ultra-low-power video understanding, specifically for action recognition tasks using spike-based data.

[35] LSSGen: Leveraging Latent Space Scaling in Flow and Diffusion for Efficient Text to Image Generation

Jyun-Ze Tang,Chih-Fan Hsu,Jeng-Lin Li,Ming-Ching Chang,Wei-Chao Chen

Main category: cs.CV

TL;DR: LSSGen 提出了一种直接在潜在空间中执行分辨率缩放的方法,通过轻量级潜在上采样器,显著提升了文本到图像生成的效率和视觉质量。

Details Motivation: 传统的像素空间下采样和上采样方法会引入伪影和失真,导致重新编码到潜在空间后图像质量下降。LSSGen 旨在解决这一问题。

Contribution: LSSGen 提出了一种新颖的潜在空间缩放生成框架,能够在不改变 Transformer 或 U-Net 架构的情况下,提升生成效率与图像质量。

Method: LSSGen 直接在潜在空间中进行分辨率缩放,使用轻量级潜在上采样器,避免了像素空间操作带来的问题。

Result: LSSGen 在生成 1024^2 分辨率图像时,速度与常规方法相当,但视觉质量显著提升,TOPIQ 分数提高了 246%。

Insight: 潜在空间的高效操作可以显著提升生成模型的性能,同时保持架构不变,为多分辨率生成提供了灵活性。

Abstract: Flow matching and diffusion models have shown impressive results in text-to-image generation, producing photorealistic images through an iterative denoising process. A common strategy to speed up synthesis is to perform early denoising at lower resolutions. However, traditional methods that downscale and upscale in pixel space often introduce artifacts and distortions. These issues arise when the upscaled images are re-encoded into the latent space, leading to degraded final image quality. To address this, we propose {\bf Latent Space Scaling Generation (LSSGen)}, a framework that performs resolution scaling directly in the latent space using a lightweight latent upsampler. Without altering the Transformer or U-Net architecture, LSSGen improves both efficiency and visual quality while supporting flexible multi-resolution generation. Our comprehensive evaluation covering text-image alignment and perceptual quality shows that LSSGen significantly outperforms conventional scaling approaches. When generating $1024^2$ images at similar speeds, it achieves up to 246% TOPIQ score improvement.

[36] AMMNet: An Asymmetric Multi-Modal Network for Remote Sensing Semantic Segmentation

Hui Ye,Haodong Chen,Zeke Zexi Hu,Xiaoming Chen,Yuk Ying Chung

Main category: cs.CV

TL;DR: 论文提出了一种名为AMMNet的不对称多模态网络,用于遥感语义分割,通过减少架构冗余和解决模态不对齐问题,实现了高效且鲁棒的多模态融合。

Details Motivation: 在多模态遥感语义分割中,RGB和DSM的融合常面临计算复杂度高和模态不对齐的问题,影响分割效率和精度,特别是在复杂城市环境中。

Contribution: 1. 提出不对称双编码器(ADE),根据模态特性分配计算资源;2. 设计不对称先验融合器(APF)实现模态对齐;3. 引入分布对齐(DA)模块提升跨模态兼容性。

Method: 1. ADE模块使用深度编码器处理RGB图像,轻量编码器处理DSM;2. APF模块通过先验矩阵实现模态感知融合;3. DA模块通过散度最小化对齐特征分布。

Result: 在ISPRS Vaihingen和Potsdam数据集上,AMMNet达到SOTA分割精度,同时降低计算和内存需求。

Insight: 不对称设计能有效平衡多模态网络的效率和性能,模态感知融合和分布对齐是提升分割鲁棒性的关键。

Abstract: Semantic segmentation in remote sensing (RS) has advanced significantly with the incorporation of multi-modal data, particularly the integration of RGB imagery and the Digital Surface Model (DSM), which provides complementary contextual and structural information about the ground object. However, integrating RGB and DSM often faces two major limitations: increased computational complexity due to architectural redundancy, and degraded segmentation performance caused by modality misalignment. These issues undermine the efficiency and robustness of semantic segmentation, particularly in complex urban environments where precise multi-modal integration is essential. To overcome these limitations, we propose Asymmetric Multi-Modal Network (AMMNet), a novel asymmetric architecture that achieves robust and efficient semantic segmentation through three designs tailored for RGB-DSM input pairs. To reduce architectural redundancy, the Asymmetric Dual Encoder (ADE) module assigns representational capacity based on modality-specific characteristics, employing a deeper encoder for RGB imagery to capture rich contextual information and a lightweight encoder for DSM to extract sparse structural features. Besides, to facilitate modality alignment, the Asymmetric Prior Fuser (APF) integrates a modality-aware prior matrix into the fusion process, enabling the generation of structure-aware contextual features. Additionally, the Distribution Alignment (DA) module enhances cross-modal compatibility by aligning feature distributions through divergence minimization. Extensive experiments on the ISPRS Vaihingen and Potsdam datasets demonstrate that AMMNet attains state-of-the-art segmentation accuracy among multi-modal networks while reducing computational and memory requirements.

[37] Explicit Context Reasoning with Supervision for Visual Tracking

Fansheng Zeng,Bineng Zhong,Haiying Xia,Yufei Tan,Xiantao Hu,Liangtao Shi,Shuxiang Song

Main category: cs.CV

TL;DR: 论文提出了 RSTrack,通过显式建模和监督上下文推理,解决了传统算法在视觉跟踪中上下文关联不明确的问题。

Details Motivation: 主流跟踪算法通常只是简单地堆叠历史信息来关联上下文,缺乏显式监督,导致难以有效建模目标的动态变化。

Contribution: 提出了三种核心机制(上下文推理、前向监督、高效状态建模),显式建模和监督上下文关联,提升跟踪的时序一致性。

Method: 1) 上下文推理机制:构建目标状态推理流程;2) 前向监督策略:用真实目标特征约束推理;3) 高效状态建模:压缩重构机制提取核心特征。

Result: RSTrack 在多个基准数据集上实现了最优性能,同时保持实时运行速度。

Insight: 显式监督和高效的上下文建模能显著提升视觉跟踪的稳定性和准确性。

Abstract: Contextual reasoning with constraints is crucial for enhancing temporal consistency in cross-frame modeling for visual tracking. However, mainstream tracking algorithms typically associate context by merely stacking historical information without explicitly supervising the association process, making it difficult to effectively model the target’s evolving dynamics. To alleviate this problem, we propose RSTrack, which explicitly models and supervises context reasoning via three core mechanisms. \textit{1) Context Reasoning Mechanism}: Constructs a target state reasoning pipeline, converting unconstrained contextual associations into a temporal reasoning process that predicts the current representation based on historical target states, thereby enhancing temporal consistency. \textit{2) Forward Supervision Strategy}: Utilizes true target features as anchors to constrain the reasoning pipeline, guiding the predicted output toward the true target distribution and suppressing drift in the context reasoning process. \textit{3) Efficient State Modeling}: Employs a compression-reconstruction mechanism to extract the core features of the target, removing redundant information across frames and preventing ineffective contextual associations. These three mechanisms collaborate to effectively alleviate the issue of contextual association divergence in traditional temporal modeling. Experimental results show that RSTrack achieves state-of-the-art performance on multiple benchmark datasets while maintaining real-time running speeds. Our code is available at https://github.com/GXNU-ZhongLab/RSTrack.

[38] LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs

Zitong Xu,Huiyu Duan,Bingnan Liu,Guangji Ma,Jiarui Wang,Liu Yang,Shiqi Gao,Xiaoyu Wang,Jia Wang,Xiongkuo Min,Guangtao Zhai,Weisi Lin

Main category: cs.CV

TL;DR: 该论文提出了 EBench-18K,一个包含 18K 编辑图像的大规模基准数据集,并基于此提出了 LMM4Edit,一种基于大语言模型的图像编辑评估指标,在感知质量、编辑对齐等方面表现出色且与人类偏好一致。

Details Motivation: 当前文本引导图像编辑(TIE)模型在图像质量、编辑对齐和与原图一致性上存在不足,且现有评估基准和指标在规模和人类感知对齐上有局限。

Contribution: 1) 提出了首个大规模图像编辑基准 EBench-18K;2) 提出了基于大语言模型的评估指标 LMM4Edit,综合评估编辑模型的多维度表现。

Method: 通过收集 18K 编辑图像和 55K+ 人类评分,训练 LMM4Edit 从感知质量、编辑对齐等方面评估编辑效果,并结合零样本验证其泛化能力。

Result: LMM4Edit 在评估中表现出色且与人类偏好高度一致,同时在零样本验证中展示了良好的泛化性能。

Insight: 利用大语言模型进行图像编辑评估是一种有效的方式,能够综合多维度表现并贴近人类感知。

Abstract: The rapid advancement of Text-guided Image Editing (TIE) enables image modifications through text prompts. However, current TIE models still struggle to balance image quality, editing alignment, and consistency with the original image, limiting their practical applications. Existing TIE evaluation benchmarks and metrics have limitations on scale or alignment with human perception. To this end, we introduce EBench-18K, the first large-scale image Editing Benchmark including 18K edited images with fine-grained human preference annotations for evaluating TIE. Specifically, EBench-18K includes 1,080 source images with corresponding editing prompts across 21 tasks, 18K+ edited images produced by 17 state-of-the-art TIE models, 55K+ mean opinion scores (MOSs) assessed from three evaluation dimensions, and 18K+ question-answering (QA) pairs. Based on EBench-18K, we employ outstanding LMMs to assess edited images, while the evaluation results, in turn, provide insights into assessing the alignment between the LMMs’ understanding ability and human preferences. Then, we propose LMM4Edit, a LMM-based metric for evaluating image Editing models from perceptual quality, editing alignment, attribute preservation, and task-specific QA accuracy in an all-in-one manner. Extensive experiments show that LMM4Edit achieves outstanding performance and aligns well with human preference. Zero-shot validation on the other datasets also shows the generalization ability of our model. The dataset and code are available at https://github.com/IntMeGroup/LMM4Edit.

[39] A Single-step Accurate Fingerprint Registration Method Based on Local Feature Matching

Yuwei Jia,Zhe Cui,Fei Su

Main category: cs.CV

TL;DR: 提出了一种基于局部特征匹配的单步指纹配准方法,通过直接预测半密集匹配点对应关系,避免了传统两步法中因低质量指纹图像导致初始配准失败的问题。

Details Motivation: 传统指纹配准方法依赖两步法(初始配准和密集配准),但低质量图像中特征点提取失败会导致配准失败。因此需要一种更鲁棒的单步配准方法。

Contribution: 1. 提出了一种端到端的单步指纹配准算法;2. 利用全局-局部注意力机制实现像素级对齐;3. 展示了单步配准的性能优于传统方法。

Method: 通过直接预测两幅指纹图像间的半密集匹配点对应关系,结合全局-局部注意力机制实现端到端的对齐。

Result: 实验证明该方法在单步配准中实现了最优性能,并能与密集配准算法结合进一步提升效果。

Insight: 单步配准方法可以避免传统两步法的失败风险,更适合低质量图像。

Abstract: Distortion of the fingerprint images leads to a decline in fingerprint recognition performance, and fingerprint registration can mitigate this distortion issue by accurately aligning two fingerprint images. Currently, fingerprint registration methods often consist of two steps: an initial registration based on minutiae, and a dense registration based on matching points. However, when the quality of fingerprint image is low, the number of detected minutiae is reduced, leading to frequent failures in the initial registration, which ultimately causes the entire fingerprint registration process to fail. In this study, we propose an end-to-end single-step fingerprint registration algorithm that aligns two fingerprints by directly predicting the semi-dense matching points correspondences between two fingerprints. Thus, our method minimizes the risk of minutiae registration failure and also leverages global-local attentions to achieve end-to-end pixel-level alignment between the two fingerprints. Experiment results prove that our method can achieve the state-of-the-art matching performance with only single-step registration, and it can also be used in conjunction with dense registration algorithms for further performance improvements.

[40] Advancing Visual Large Language Model for Multi-granular Versatile Perception

Wentao Xiang,Haoxian Tan,Cong Wei,Yujie Zhong,Dengjie Li,Yujiu Yang

Main category: cs.CV

TL;DR: 论文提出了一种多粒度和多功能的视觉感知框架MVP-LM,结合视觉大语言模型,支持多种预测类型和指令类型的任务。

Details Motivation: 现有的视觉感知研究通常局限于少数任务组合,缺乏跨任务的普适性和多功能性。

Contribution: MVP-LM框架通过统一多粒度解码器和数据集策略,实现了对多种视觉任务的集成处理。

Method: 提出了多粒度解码器、基于CoT的数据集统一策略和查询增强策略。

Result: 在多个基准测试中,MVP-LM在单词和句子级别的感知任务中表现优异。

Insight: 通过统一的框架,视觉大语言模型可以在更广泛的视觉任务中发挥作用。

Abstract: Perception is a fundamental task in the field of computer vision, encompassing a diverse set of subtasks that can be systematically categorized into four distinct groups based on two dimensions: prediction type and instruction type. Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVP-LM, a Multi-granular and Versatile Perception framework incorporating Visual Large Language Model. Our framework is designed to integrate both word-based and sentence-based perception tasks alongside box and mask predictions within a single architecture. MVP-LM features an innovative multi-granularity decoder in conjunction with a CoT-inspired dataset unification strategy, enabling seamless supervised fine-tuning across a wide spectrum of tasks, including but not limited to panoptic segmentation, detection, grounding, and referring expression segmentation. Furthermore, we introduce a query enhancement strategy aimed at harnessing the decoding and generative capabilities inherent in VLLMs. Extensive experiments conducted across a range of benchmarks in both word-based and sentence-based perception tasks substantiate the efficacy of our framework. The code will be available at https://github.com/xiangwentao666/MVP-LM.

[41] LDRFusion: A LiDAR-Dominant multimodal refinement framework for 3D object detection

Jijun Wang,Yan Wu,Yujian Mo,Junqiao Zhao,Jun Yan,Yinghao Hu

Main category: cs.CV

TL;DR: LDRFusion是一个以LiDAR为主导的多模态融合框架,用于3D目标检测,通过两阶段细化策略提高检测精度,特别是在处理稀疏点云时避免了伪点云引入的噪声问题。

Details Motivation: 现有的LiDAR-Camera融合方法通常通过深度补全生成伪点云作为辅助输入,但这些伪点云可能引入噪声,导致预测不准确。LDRFusion旨在通过以LiDAR为主导的两阶段细化策略,减少噪声影响并提高检测性能。

Contribution: 1. 提出了LDRFusion,一个以LiDAR为主导的两阶段多模态融合框架。2. 设计了一个分层的伪点残差编码模块,增强了局部结构的表示能力。3. 在KITTI数据集上展示了多类别和多难度级别的强性能。

Method: 1. 第一阶段仅依赖LiDAR生成定位准确的候选框。2. 第二阶段引入伪点云检测具有挑战性的实例。3. 使用分层伪点残差编码模块编码邻域特征和位置残差,提升伪点云的局部结构表示。

Result: 在KITTI数据集上,LDRFusion在多个类别和难度级别上均表现出色,验证了其有效性。

Insight: LiDAR在定位任务中更可靠,而Camera在特征补充上更有效;通过分阶段融合和局部结构编码,可以充分利用各自模态的优势并抑制噪声。

Abstract: Existing LiDAR-Camera fusion methods have achieved strong results in 3D object detection. To address the sparsity of point clouds, previous approaches typically construct spatial pseudo point clouds via depth completion as auxiliary input and adopts a proposal-refinement framework to generate detection results. However, introducing pseudo points inevitably brings noise, potentially resulting in inaccurate predictions. Considering the differing roles and reliability levels of each modality, we propose LDRFusion, a novel Lidar-dominant two-stage refinement framework for multi-sensor fusion. The first stage soley relies on LiDAR to produce accurately localized proposals, followed by a second stage where pseudo point clouds are incorporated to detect challenging instances. The instance-level results from both stages are subsequently merged. To further enhance the representation of local structures in pseudo point clouds, we present a hierarchical pseudo point residual encoding module, which encodes neighborhood sets using both feature and positional residuals. Experiments on the KITTI dataset demonstrate that our framework consistently achieves strong performance across multiple categories and difficulty levels.

[42] MONITRS: Multimodal Observations of Natural Incidents Through Remote Sensing

Shreelekha Revankar,Utkarsh Mall,Cheng Perng Phoo,Kavita Bala,Bharath Hariharan

Main category: cs.CV

TL;DR: MONITRS提出了一种新型多模态数据集,结合了卫星图像和自然语言标注,用于自然灾害监测与响应,显著提升了现有模型的性能。

Details Motivation: 自然灾害响应受限于灾区的难以接近性以及传统遥感技术的局限性(如单一灾害类型、依赖人工解释、数据时间粒度不足)。MONITRS致力于通过多模态数据解决这些问题。

Contribution: 1. 推出了包含10,000+ FEMA灾害事件的多模态数据集,涵盖卫星图像、自然语言标注和地理定位。2. 证明了基于该数据集微调现有多模态大语言模型(MLLMs)能显著提升灾害监测性能。

Method: 1. 收集并标注多模态数据(卫星图像、新闻文章、问答对)。2. 利用这些数据微调现有MLLMs,以提升其对灾害监测任务的理解和响应能力。

Result: 实验表明,微调后的模型在灾害监测任务中表现显著优于基线方法,为机器学习辅助的灾害响应系统设立了新基准。

Insight: 多模态数据(尤其是结合时空信息的自然语言标注)能有效解决单一模态的局限性,提升灾害监测的自动化水平和准确性。

Abstract: Natural disasters cause devastating damage to communities and infrastructure every year. Effective disaster response is hampered by the difficulty of accessing affected areas during and after events. Remote sensing has allowed us to monitor natural disasters in a remote way. More recently there have been advances in computer vision and deep learning that help automate satellite imagery analysis, However, they remain limited by their narrow focus on specific disaster types, reliance on manual expert interpretation, and lack of datasets with sufficient temporal granularity or natural language annotations for tracking disaster progression. We present MONITRS, a novel multimodal dataset of more than 10,000 FEMA disaster events with temporal satellite imagery and natural language annotations from news articles, accompanied by geotagged locations, and question-answer pairs. We demonstrate that fine-tuning existing MLLMs on our dataset yields significant performance improvements for disaster monitoring tasks, establishing a new benchmark for machine learning-assisted disaster response systems. Code can be found at: https://github.com/ShreelekhaR/MONITRS

[43] Positive Style Accumulation: A Style Screening and Continuous Utilization Framework for Federated DG-ReID

Xin Xu,Chaoyue Ren,Wei Liu,Wenke Huang,Bin Yang,Zhixi Yu,Kui Jiang

Main category: cs.CV

TL;DR: 这篇论文提出了一个名为SSCU的框架,用于在联邦学习环境下筛选和利用正样本风格,以提升行人再识别(ReID)模型的泛化性能。

Details Motivation: 现有方法通过风格转换增加样本多样性,但并非所有风格都对模型泛化性能有益。因此,需要有效筛选和持续利用有益风格(正风格),避免有害风格(负风格)。

Contribution: 1. 提出Generalization Gain-guided Dynamic Style Memory (GGDSM)用于筛选和积累正风格。2. 设计了风格记忆识别损失函数,充分利用记忆中的正风格。3. 提出Collaborative Style Training (CST)策略,结合新生成风格和累积正风格训练模型。

Method: 1. GGDSM模块动态筛选和存储正风格。2. 风格记忆识别损失函数优化风格利用率。3. CST策略通过双分支训练模型,提升风格利用效率。

Result: 实验结果表明,该方法在源域和目标域均优于现有方法。

Insight: 正风格的筛选和持续利用是提升模型泛化性能的关键,尤其是在联邦学习环境中。

Abstract: The Federated Domain Generalization for Person re-identification (FedDG-ReID) aims to learn a global server model that can be effectively generalized to source and target domains through distributed source domain data. Existing methods mainly improve the diversity of samples through style transformation, which to some extent enhances the generalization performance of the model. However, we discover that not all styles contribute to the generalization performance. Therefore, we define styles that are beneficial or harmful to the model’s generalization performance as positive or negative styles. Based on this, new issues arise: How to effectively screen and continuously utilize the positive styles. To solve these problems, we propose a Style Screening and Continuous Utilization (SSCU) framework. Firstly, we design a Generalization Gain-guided Dynamic Style Memory (GGDSM) for each client model to screen and accumulate generated positive styles. Meanwhile, we propose a style memory recognition loss to fully leverage the positive styles memorized by Memory. Furthermore, we propose a Collaborative Style Training (CST) strategy to make full use of positive styles. Unlike traditional learning strategies, our approach leverages both newly generated styles and the accumulated positive styles stored in memory to train client models on two distinct branches. This training strategy is designed to effectively promote the rapid acquisition of new styles by the client models, and guarantees the continuous and thorough utilization of positive styles, which is highly beneficial for the model’s generalization performance. Extensive experimental results demonstrate that our method outperforms existing methods in both the source domain and the target domain.

[44] Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling

Chao Zhou,Tianyi Wei,Nenghai Yu

Main category: cs.CV

TL;DR: 论文提出了一种名为SaaS的方法,通过动态调整注意力激活来提升统一图像生成模型对多子指令的遵循能力,解决了现有模型忽视文本指令的问题。

Details Motivation: 统一图像生成模型(如OmniGen)在多模态任务中表现优异,但在处理包含多个子指令的文本时,存在指令忽视问题,影响了用户友好性和任务执行效果。

Contribution: 提出Self-Adaptive Attention Scaling(SaaS)方法,动态调整注意力激活,提升模型对多子指令的遵循能力,无需额外训练或测试时优化。

Method: 通过扰动分析确定关键步骤和层,利用相邻时间步交叉注意力的一致性,动态缩放每个子指令的注意力激活。

Result: 实验证明,SaaS在指令驱动的图像编辑和视觉条件图像生成任务中,显著提升了指令遵循的准确性,优于现有方法。

Insight: 注意力冲突是导致指令忽视的主要原因,动态调整注意力激活可有效缓解这一问题,适用于统一图像生成模型的优化。

Abstract: Recent advancements in unified image generation models, such as OmniGen, have enabled the handling of diverse image generation and editing tasks within a single framework, accepting multimodal, interleaved texts and images in free form. This unified architecture eliminates the need for text encoders, greatly reducing model complexity and standardizing various image generation and editing tasks, making it more user-friendly. However, we found that it suffers from text instruction neglect, especially when the text instruction contains multiple sub-instructions. To explore this issue, we performed a perturbation analysis on the input to identify critical steps and layers. By examining the cross-attention maps of these key steps, we observed significant conflicts between neglected sub-instructions and the activations of the input image. In response, we propose Self-Adaptive Attention Scaling (SaaS), a method that leverages the consistency of cross-attention between adjacent timesteps to dynamically scale the attention activation for each sub-instruction. Our SaaS enhances instruction-following fidelity without requiring additional training or test-time optimization. Experimental results on instruction-based image editing and visual conditional image generation validate the effectiveness of our SaaS, showing superior instruction-following fidelity over existing methods. The code is available https://github.com/zhouchao-ops/SaaS.

[45] HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery

Yu Wang,Bo Dang,Wanchun Li,Wei Chen,Yansheng Li

Main category: cs.CV

TL;DR: HoliTracer提出了一种从大尺寸遥感影像中整体提取地理对象矢量的框架,通过上下文注意力网络(CAN)和改进的矢量化流水线(MCR与PST)实现了高精度的矢量提取,大幅优于现有方法。

Details Motivation: 现有方法局限于处理小尺寸影像,导致上下文信息丢失和矢量结果碎片化。HoliTracer旨在解决这一问题,通过整体处理大尺寸影像提升矢量提取的精度和完整性。

Contribution: 1. 首个针对大尺寸遥感影像的整体矢量提取框架HoliTracer。2. 提出上下文注意力网络(CAN)捕捉局部到全局的上下文依赖。3. 设计Mask Contour Reformer(MCR)和Polygon Sequence Tracer(PST)实现稳健的矢量化流水线。

Method: 1. 使用CAN进行大尺寸影像分割,通过局部-全局注意力机制增强上下文建模。2. 通过MCR从掩膜重建多边形轮廓,PST追踪顶点序列,实现矢量生成。

Result: 在建筑物、水体、道路等多种地理对象的大尺寸遥感影像数据集上,HoliTracer表现优于现有方法。

Insight: 1. 大尺寸影像的整体处理对上下文信息的保留至关重要。2. 局部-全局注意力机制能够有效捕捉地理对象的空间依赖关系。3. 结合分阶段矢量化方法(MCR+PST)可以提升结果的质量和鲁棒性。

Abstract: With the increasing resolution of remote sensing imagery (RSI), large-size RSI has emerged as a vital data source for high-precision vector mapping of geographic objects. Existing methods are typically constrained to processing small image patches, which often leads to the loss of contextual information and produces fragmented vector outputs. To address these, this paper introduces HoliTracer, the first framework designed to holistically extract vectorized geographic objects from large-size RSI. In HoliTracer, we enhance segmentation of large-size RSI using the Context Attention Net (CAN), which employs a local-to-global attention mechanism to capture contextual dependencies. Furthermore, we achieve holistic vectorization through a robust pipeline that leverages the Mask Contour Reformer (MCR) to reconstruct polygons and the Polygon Sequence Tracer (PST) to trace vertices. Extensive experiments on large-size RSI datasets, including buildings, water bodies, and roads, demonstrate that HoliTracer outperforms state-of-the-art methods. Our code and data are available in https://github.com/vvangfaye/HoliTracer.

[46] Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models

Futa Waseda,Saku Sugawara,Isao Echizen

Main category: cs.CV

TL;DR: 论文提出了一种名为QT-AFT的方法,通过高质量文本引导的对抗性微调,解决了现有对抗训练方法在视觉语言模型中过度依赖短文本或缺乏语义引导的问题,显著提升了模型在零样本任务中的鲁棒性和准确性。

Details Motivation: 现有的对抗训练方法在视觉语言模型中存在两个主要问题:一是监督方法过度依赖短文本(如类别标签),导致对训练数据中对象类别的过拟合;二是无监督方法因缺乏语义引导,在面对实际文本引导的对抗攻击时效果不佳。为了解决这些问题,论文提出了一种更有效的方法。

Contribution: 论文的主要贡献包括:1)提出QT-AFT方法,利用高质量文本(如图像描述)引导对抗性微调,增强了视觉编码器在对抗噪声下的鲁棒性;2)通过实验验证,QT-AFT在16个零样本数据集上实现了最先进的鲁棒性和准确性;3)揭示了语言在提升视觉鲁棒性中的关键作用,例如描述对象属性可进一步提升零样本鲁棒性。

Method: QT-AFT方法的核心是利用高质量文本(如图像描述)生成对抗性样本,引导视觉编码器学习更广泛的图像特征。这种方法避免了过拟合和语义缺失的问题,通过丰富的语义信息增强模型的鲁棒性。

Result: 实验结果表明,QT-AFT在16个零样本数据集上显著优于现有方法,无论是零样本对抗鲁棒性还是干净数据的准确性均达到最优。此外,研究发现描述对象属性的语言信息能进一步提升模型鲁棒性。

Insight: 论文揭示了高质量语言监督在提升视觉表征学习鲁棒性中的关键作用,为未来研究指明了方向:应更关注语言信息的质量与多样性对模型性能的影响。

Abstract: Defending pre-trained vision-language models (VLMs), such as CLIP, against adversarial attacks is crucial, as these models are widely used in diverse zero-shot tasks, including image classification. However, existing adversarial training (AT) methods for robust fine-tuning largely overlook the role of language in enhancing visual robustness. Specifically, (1) supervised AT methods rely on short texts (e.g., class labels) to generate adversarial perturbations, leading to overfitting to object classes in the training data, and (2) unsupervised AT avoids this overfitting but remains suboptimal against practical text-guided adversarial attacks due to its lack of semantic guidance. To address these limitations, we propose Quality Text-guided Adversarial Fine-Tuning (QT-AFT), which leverages high-quality captions during training to guide adversarial examples away from diverse semantics present in images. This enables the visual encoder to robustly recognize a broader range of image features even under adversarial noise, thereby enhancing robustness across diverse downstream tasks. QT-AFT overcomes the key weaknesses of prior methods – overfitting in supervised AT and lack of semantic awareness in unsupervised AT – achieving state-of-the-art zero-shot adversarial robustness and clean accuracy, evaluated across 16 zero-shot datasets. Furthermore, our comprehensive study uncovers several key insights into the role of language in enhancing vision robustness; for example, describing object properties in addition to object names further enhances zero-shot robustness. Our findings point to an urgent direction for future work – centering high-quality linguistic supervision in robust visual representation learning.

[47] ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference

Haoyue Zhang,Jie Zhang,Song Guo

Main category: cs.CV

TL;DR: ToFe提出了一种新的Vision Transformer推理效率优化框架,通过动态冻结和复用不重要的token,在降低计算成本的同时保持模型性能。实验结果表明,ToFe能在LV-ViT模型中减少50%计算量,且Top-1准确率下降不到2%。

Details Motivation: Vision Transformer的计算成本高,Token缩减方法虽能提升效率,但不重要的token会被永久丢弃,导致后续模块无法复用。为解决这一问题,ToFe提出动态管理token的方法,平衡性能与效率。

Contribution: 1. 提出ToFe框架,动态冻结和复用token;2. 设计了预测模块和近似模块,实现token的恢复;3. 通过计算预算感知的端到端训练,优化模型性能与复杂度。

Method: 1. 预测模块识别重要token;2. 近似模块恢复冻结token;3. 计算预算感知的联合优化训练,动态调整处理token。

Result: 在LV-ViT上,ToFe减少50%计算量,Top-1准确率仅下降2%,优于现有方法。

Insight: ToFe表明,动态管理token能高效平衡Vision Transformer的性能与计算成本。token复用设计为高计算效率的Transformer提供了新思路。

Abstract: Although vision transformers (ViT) have shown remarkable success in various vision tasks, their computationally expensive self-attention hinder their deployment on resource-constrained devices. Token reduction, which discards less important tokens during forward propagation, has been proposed to enhance the efficiency of transformer models. However, existing methods handle unimportant tokens irreversibly, preventing their reuse in subsequent blocks. Considering that transformers focus on different information among blocks, tokens reduced in early blocks might be useful later. Furthermore, to adapt transformer models for resource-constrained devices, it is crucial to strike a balance between model performance and computational overhead. To address these challenges, in this paper, we introduce a novel Token Freezing and Reusing (ToFe) framework, where we identify important tokens at each stage and temporarily freeze the unimportant ones, allowing their lagged reusing at a later stage. Specifically, we design a prediction module for token identification and an approximate module for recovery of the frozen tokens. By jointly optimizing with the backbone through computation budget-aware end-to-end training, ToFe can adaptively process the necessary tokens at each block, thereby reducing computational cost while maintaining performance. Extensive experiments demonstrate that ToFe reduces the computational cost of LV-ViT model by 50% with less than 2% drop in Top-1 accuracy, achieving a better trade-off between performance and complexity compared to state-of-the-art methods.

[48] MAN++: Scaling Momentum Auxiliary Network for Supervised Local Learning in Vision Tasks

Junhao Su,Feiyu Zhu,Hengyu Shi,Tianyang Han,Yurui Qiu,Junfeng Luo,Xiaoming Wei,Jialin Gao

Main category: cs.CV

TL;DR: MAN++是一种用于视觉任务中监督局部学习的方法,通过引入动态交互机制和可学习缩放偏置,解决了传统局部学习中信息流动不畅的问题,性能接近端到端训练,同时显著降低了GPU内存消耗。

Details Motivation: 端到端反向传播存在参数更新锁定、高GPU内存消耗和缺乏生物合理性等问题。监督局部学习通过将网络划分为多个局部块并独立更新来缓解这些问题,但性能下降阻碍了其替代端到端方法。MAN++旨在改进局部块间的信息流动。

Contribution: 提出MAN++,通过EMA机制增强局部块间的动态交互,并引入可学习缩放偏置以平衡特征差异,显著提升了监督局部学习的性能。

Method: 使用EMA参数更新辅助网络,增强块间信息流动;引入可学习缩放偏置解决EMA参数直接应用的次优问题。

Result: MAN++在图像分类、目标检测和图像分割等任务中性能接近端到端训练,同时显著减少了GPU内存占用。

Insight: 动态交互机制和特征差异平衡是提升监督局部学习性能的关键。MAN++为替代传统训练方法提供了新思路。

Abstract: Deep learning typically relies on end-to-end backpropagation for training, a method that inherently suffers from issues such as update locking during parameter optimization, high GPU memory consumption, and a lack of biological plausibility. In contrast, supervised local learning seeks to mitigate these challenges by partitioning the network into multiple local blocks and designing independent auxiliary networks to update each block separately. However, because gradients are propagated solely within individual local blocks, performance degradation occurs, preventing supervised local learning from supplanting end-to-end backpropagation. To address these limitations and facilitate inter-block information flow, we propose the Momentum Auxiliary Network++ (MAN++). MAN++ introduces a dynamic interaction mechanism by employing the Exponential Moving Average (EMA) of parameters from adjacent blocks to enhance communication across the network. The auxiliary network, updated via EMA, effectively bridges the information gap between blocks. Notably, we observed that directly applying EMA parameters can be suboptimal due to feature discrepancies between local blocks. To resolve this issue, we introduce a learnable scaling bias that balances feature differences, thereby further improving performance. We validate MAN++ through extensive experiments on tasks that include image classification, object detection, and image segmentation, utilizing multiple network architectures. The experimental results demonstrate that MAN++ achieves performance comparable to end-to-end training while significantly reducing GPU memory usage. Consequently, MAN++ offers a novel perspective for supervised local learning and presents a viable alternative to conventional training methods.

[49] Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recognition

Zefeng Qian,Xincheng Yao,Yifei Huang,Chongyang Zhang,Jiangyong Ying,Hong Sun

Main category: cs.CV

TL;DR: 本文提出了一种名为LGA的新框架,通过利用大型语言模型(LLM)和视觉解剖模块,从标签语义之外挖掘动作的核心表征特性,从而在少样本动作识别任务中实现更优的性能。

Details Motivation: 少样本动作识别(FSAR)数据稀缺,传统方法仅依赖动作标签无法充分捕捉动作的时空动态特性。本文希望通过语言和视觉的结合,挖掘动作的细粒度表征。

Contribution: 1. 提出LGA框架,结合LLM和视觉解剖模块从标签语义中挖掘动作的核心特性;2. 设计原子级的多模态融合策略;3. 引入多模态匹配机制提升分类鲁棒性。

Method: 1. 使用LLM将动作标签解剖为原子动作描述(主语、动作、对象);2. 视觉解剖模块将动作分割为视频相位;3. 原子级融合文本和视觉特征;4. 通过多模态匹配机制进行少样本分类。

Result: LGA在多个FSAR基准测试中取得最优性能。

Insight: 语言模型提供的高层语义可以作为动作识别的强先验知识,结合视觉的细粒度表征能显著提升少样本任务的性能。

Abstract: Few-shot action recognition (FSAR) aims to classify human actions in videos with only a small number of labeled samples per category. The scarcity of training data has driven recent efforts to incorporate additional modalities, particularly text. However, the subtle variations in human posture, motion dynamics, and the object interactions that occur during different phases, are critical inherent knowledge of actions that cannot be fully exploited by action labels alone. In this work, we propose Language-Guided Action Anatomy (LGA), a novel framework that goes beyond label semantics by leveraging Large Language Models (LLMs) to dissect the essential representational characteristics hidden beneath action labels. Guided by the prior knowledge encoded in LLM, LGA effectively captures rich spatiotemporal cues in few-shot scenarios. Specifically, for text, we prompt an off-the-shelf LLM to anatomize labels into sequences of atomic action descriptions, focusing on the three core elements of action (subject, motion, object). For videos, a Visual Anatomy Module segments actions into atomic video phases to capture the sequential structure of actions. A fine-grained fusion strategy then integrates textual and visual features at the atomic level, resulting in more generalizable prototypes. Finally, we introduce a Multimodal Matching mechanism, comprising both video-video and video-text matching, to ensure robust few-shot classification. Experimental results demonstrate that LGA achieves state-of-the-art performance across multipe FSAR benchmarks.

[50] Dens3R: A Foundation Model for 3D Geometry Prediction

Xianze Fang,Jingnan Gao,Zhe Wang,Zhuo Chen,Xingyu Ren,Jiangjing Lyu,Qiaomu Ren,Zhonglei Yang,Xiaokang Yang,Yichao Yan,Chengfei Lyu

Main category: cs.CV

TL;DR: Dens3R is a 3D foundation model for joint geometric prediction, addressing the challenge of inconsistency in isolated geometric quantity predictions through a unified framework. It employs a two-stage training approach and a shared encoder-decoder backbone, achieving accurate and consistent results.

Details Motivation: Existing methods for 3D reconstruction often predict geometric quantities (e.g., depth, normals) in isolation, leading to inconsistency. Dens3R aims to unify these predictions by modeling their structural coupling.

Contribution: Dens3R introduces a foundation model for joint 3D geometric prediction, featuring a two-stage training framework, shared encoder-decoder backbone, and position-interpolated rotary encoding.

Method: The method uses a lightweight shared encoder-decoder backbone and position-interpolated rotary encoding. It integrates image-pair matching for intrinsic invariance and includes a post-processing pipeline for multi-view consistency.

Result: Experiments show Dens3R outperforms existing methods in accuracy and consistency for diverse 3D prediction tasks.

Insight: Joint prediction of correlated geometric quantities improves consistency and accuracy, making Dens3R a versatile tool for 3D reconstruction applications.

Abstract: Recent advances in dense 3D reconstruction have led to significant progress, yet achieving accurate unified geometric prediction remains a major challenge. Most existing methods are limited to predicting a single geometry quantity from input images. However, geometric quantities such as depth, surface normals, and point maps are inherently correlated, and estimating them in isolation often fails to ensure consistency, thereby limiting both accuracy and practical applicability. This motivates us to explore a unified framework that explicitly models the structural coupling among different geometric properties to enable joint regression. In this paper, we present Dens3R, a 3D foundation model designed for joint geometric dense prediction and adaptable to a wide range of downstream tasks. Dens3R adopts a two-stage training framework to progressively build a pointmap representation that is both generalizable and intrinsically invariant. Specifically, we design a lightweight shared encoder-decoder backbone and introduce position-interpolated rotary positional encoding to maintain expressive power while enhancing robustness to high-resolution inputs. By integrating image-pair matching features with intrinsic invariance modeling, Dens3R accurately regresses multiple geometric quantities such as surface normals and depth, achieving consistent geometry perception from single-view to multi-view inputs. Additionally, we propose a post-processing pipeline that supports geometrically consistent multi-view inference. Extensive experiments demonstrate the superior performance of Dens3R across various dense 3D prediction tasks and highlight its potential for broader applications.

[51] MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation

Yanchen Liu,Yanan Sun,Zhening Xing,Junyao Gao,Kai Chen,Wenjie Pei

Main category: cs.CV

TL;DR: MotionShot是一个无需训练的框架,通过细粒度的参考-目标对应关系解析,实现高保真度的运动迁移,同时保持外观一致性。

Details Motivation: 现有的文本到视频方法难以在参考对象和目标对象外观或结构差异较大的情况下实现平滑的运动迁移。

Contribution: MotionShot通过语义特征匹配和形态对齐,结合时间注意力编码,实现了跨对象的高保真运动迁移。

Method: 框架包括语义特征匹配确保高层对齐,形态对齐通过形状重定向实现,时间注意力编码用于运动迁移。

Result: 大量实验表明,MotionShot能在显著外观和结构差异下实现连贯的运动迁移。

Insight: 细粒度的对应关系解析和时间注意力机制是提升运动迁移质量的关键。

Abstract: Existing text-to-video methods struggle to transfer motion smoothly from a reference object to a target object with significant differences in appearance or structure between them. To address this challenge, we introduce MotionShot, a training-free framework capable of parsing reference-target correspondences in a fine-grained manner, thereby achieving high-fidelity motion transfer while preserving coherence in appearance. To be specific, MotionShot first performs semantic feature matching to ensure high-level alignments between the reference and target objects. It then further establishes low-level morphological alignments through reference-to-target shape retargeting. By encoding motion with temporal attention, our MotionShot can coherently transfer motion across objects, even in the presence of significant appearance and structure disparities, demonstrated by extensive experiments. The project page is available at: https://motionshot.github.io/.

[52] M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision

Kailai Zhou,Fuqiang Yang,Shixian Wang,Bihan Wen,Chongde Zi,Linsen Chen,Qiu Shen,Xun Cao

Main category: cs.CV

TL;DR: M-SpecGene 是一种广义的多光谱基础模型,通过自监督学习从大规模数据中提取模态不变特征,解决了现有 RGBT 任务依赖人工定制模型的局限性,并通过 CMSS 指标和 GMM-CMSS 掩码策略提升了预训练效果。

Details Motivation: 现有的 RGBT 多光谱视觉任务依赖人工定制模型,存在归纳偏差、模态偏差和数据瓶颈问题,亟需一种通用的基础模型来解决这些问题。

Contribution: 1. 提出了首个广义 RGBT 多光谱基础模型 M-SpecGene;2. 引入 CMSS 指标量化模态间信息密度;3. 开发了 GMM-CMSS 渐进掩码策略以优化预训练过程。

Method: 1. 通过自监督学习从大规模数据中提取模态不变特征;2. 使用 CMSS 指标衡量模态间信息密度;3. 采用 GMM-CMSS 渐进掩码策略实现灵活且对象中心的预训练。

Result: 在 11 个数据集上验证了 M-SpecGene 对 4 种 RGBT 下游任务的通用性,表明其性能优越。

Insight: M-SpecGene 将分散的 RGBT 研究统一为一个通用范式,并通过量化信息密度和渐进掩码策略解决了模态间信息不平衡的问题。

Abstract: RGB-Thermal (RGBT) multispectral vision is essential for robust perception in complex environments. Most RGBT tasks follow a case-by-case research paradigm, relying on manually customized models to learn task-oriented representations. Nevertheless, this paradigm is inherently constrained by artificial inductive bias, modality bias, and data bottleneck. To address these limitations, we make the initial attempt to build a Generalized RGBT MultiSpectral foundation model (M-SpecGene), which aims to learn modality-invariant representations from large-scale broad data in a self-supervised manner. M-SpecGene provides new insights into multispectral fusion and integrates prior case-by-case studies into a unified paradigm. Considering the unique characteristic of information imbalance in RGBT data, we introduce the Cross-Modality Structural Sparsity (CMSS) metric to quantify the information density across two modalities. Then we develop the GMM-CMSS progressive masking strategy to facilitate a flexible, easy-to-hard, and object-centric pre-training process. Comprehensive experiments validate M-SpecGene’s generalizability across eleven datasets for four RGBT downstream tasks. The code will be available at https://github.com/CalayZhou/M-SpecGene.

[53] Scene Text Detection and Recognition “in light of” Challenging Environmental Conditions using Aria Glasses Egocentric Vision Cameras

Joseph De Mathia,Carlos Francisco Moreno-García

Main category: cs.CV

TL;DR: 该论文研究了在Meta的Project Aria智能眼镜上,环境条件对场景文本检测与识别(STDR)性能的影响,并提出了一个自定义数据集和两种OCR流程的评估结果。结果显示分辨率和距离是主要影响因素,图像放大是一种有效的预处理技术。

Details Motivation: 随着可穿戴技术的普及,研究STDR在真实环境条件下的表现变得尤为重要,尤其是通过第一视角摄像头(如Project Aria)的应用场景。

Contribution: 1. 提出一个自定义的STDR数据集;2. 评估了两种OCR流程的性能;3. 发现分辨率和距离对识别准确性的显著影响;4. 提出结合眼动跟踪优化处理效率的方法。

Method: 使用Project Aria眼镜采集数据,比较EAST与CRNN以及EAST与PyTesseract两种OCR流程的性能,并测试了图像放大等预处理技术。

Result: 图像放大显著降低了字符错误率(CER),从0.65降至0.48。分辨率和距离是主要影响因素,光照的影响则较为复杂。

Insight: 眼动跟踪可用于优化STDR处理效率,为自适应、用户感知的增强现实系统奠定了基础。

Abstract: In an era where wearable technology is reshaping applications, Scene Text Detection and Recognition (STDR) becomes a straightforward choice through the lens of egocentric vision. Leveraging Meta’s Project Aria smart glasses, this paper investigates how environmental variables, such as lighting, distance, and resolution, affect the performance of state-of-the-art STDR algorithms in real-world scenarios. We introduce a novel, custom-built dataset captured under controlled conditions and evaluate two OCR pipelines: EAST with CRNN, and EAST with PyTesseract. Our findings reveal that resolution and distance significantly influence recognition accuracy, while lighting plays a less predictable role. Notably, image upscaling emerged as a key pre-processing technique, reducing Character Error Rate (CER) from 0.65 to 0.48. We further demonstrate the potential of integrating eye-gaze tracking to optimise processing efficiency by focusing on user attention zones. This work not only benchmarks STDR performance under realistic conditions but also lays the groundwork for adaptive, user-aware AR systems. Our contributions aim to inspire future research in robust, context-sensitive text recognition for assistive and research-oriented applications, such as asset inspection and nutrition analysis. The code is available at https://github.com/josepDe/Project_Aria_STR.

[54] One Polyp Identifies All: One-Shot Polyp Segmentation with SAM via Cascaded Priors and Iterative Prompt Evolution

Xinyu Mao,Xiaohan Xing,Fei Meng,Jianbang Liu,Fan Bai,Qiang Nie,Max Meng

Main category: cs.CV

TL;DR: 该论文提出了一种基于SAM(Segment Anything Model)的一次性息肉分割框架OP-SAM,通过级联先验和迭代提示演化,实现了无需额外标注的高效、准确分割。

Details Motivation: 息肉分割对早期结直肠癌检测至关重要,但传统方法面临形态多变性和域偏移问题,且需要大量标注。SAM虽具有强泛化能力,但依赖人工提示输入,限制了自动化应用。

Contribution: 1. 提出OP-SAM框架,通过单次标注实现自动提示生成;2. 引入相关性先验生成(CPG)和尺度级联先验融合(SPF)处理噪声和尺寸变化;3. 提出欧几里得提示演化(EPE)迭代优化分割。

Method: 1. CPG用于语义标签迁移;2. SPF适应息肉尺寸变化并过滤噪声;3. EPE迭代优化提示以提高分割质量。

Result: 在Kvasir数据集上达到76.93% IoU,超越现有方法11.44%。

Insight: 通过一次性标注和自动提示生成,OP-SAM解决了SAM的自动化限制,同时显著提升了息肉分割性能,为医学图像分割提供了高效解决方案。

Abstract: Polyp segmentation is vital for early colorectal cancer detection, yet traditional fully supervised methods struggle with morphological variability and domain shifts, requiring frequent retraining. Additionally, reliance on large-scale annotations is a major bottleneck due to the time-consuming and error-prone nature of polyp boundary labeling. Recently, vision foundation models like Segment Anything Model (SAM) have demonstrated strong generalizability and fine-grained boundary detection with sparse prompts, effectively addressing key polyp segmentation challenges. However, SAM’s prompt-dependent nature limits automation in medical applications, since manually inputting prompts for each image is labor-intensive and time-consuming. We propose OP-SAM, a One-shot Polyp segmentation framework based on SAM that automatically generates prompts from a single annotated image, ensuring accurate and generalizable segmentation without additional annotation burdens. Our method introduces Correlation-based Prior Generation (CPG) for semantic label transfer and Scale-cascaded Prior Fusion (SPF) to adapt to polyp size variations as well as filter out noisy transfers. Instead of dumping all prompts at once, we devise Euclidean Prompt Evolution (EPE) for iterative prompt refinement, progressively enhancing segmentation quality. Extensive evaluations across five datasets validate OP-SAM’s effectiveness. Notably, on Kvasir, it achieves 76.93% IoU, surpassing the state-of-the-art by 11.44%.

[55] Navigating Large-Pose Challenge for High-Fidelity Face Reenactment with Video Diffusion Model

Mingtao Guo,Guanyu Xing,Yanci Zhang,Yanli Liu

Main category: cs.CV

TL;DR: 本文提出了一种用于大姿态变化下高保真人脸重演的Face Reenactment Video Diffusion模型(FRVD),通过隐式关键点提取和运动对齐,结合预训练的图像到视频模型的潜在空间,显著提升了重演效果。

Details Motivation: 现有的人脸重演方法在大姿态变化时容易产生扭曲伪影或受限于粗糙的面部关键点,导致效果不佳。为了克服这一问题,作者提出了一种新框架FRVD。

Contribution: 1. 提出了FRVD模型,专注于解决大姿态变化下的高保真人脸重演问题。2. 引入了Warping Feature Mapper(WFM),利用预训练I2V模型的潜在空间修复扭曲并增强时序一致性。

Method: 1. 使用运动提取器提取隐式关键点。2. 通过扭曲模块进行运动对齐。3. 引入WFM将扭曲后的图像映射到预训练I2V模型的潜在空间。

Result: 实验表明,FRVD在姿态准确性、身份保持和视觉质量上均优于现有方法,尤其是在极端姿态变化的场景中。

Insight: 预训练图像到视频模型的潜在空间蕴含了丰富的面部动态先验,可用于有效修复扭曲并提升时序一致性,这一思路可扩展至其他视频生成任务。

Abstract: Face reenactment aims to generate realistic talking head videos by transferring motion from a driving video to a static source image while preserving the source identity. Although existing methods based on either implicit or explicit keypoints have shown promise, they struggle with large pose variations due to warping artifacts or the limitations of coarse facial landmarks. In this paper, we present the Face Reenactment Video Diffusion model (FRVD), a novel framework for high-fidelity face reenactment under large pose changes. Our method first employs a motion extractor to extract implicit facial keypoints from the source and driving images to represent fine-grained motion and to perform motion alignment through a warping module. To address the degradation introduced by warping, we introduce a Warping Feature Mapper (WFM) that maps the warped source image into the motion-aware latent space of a pretrained image-to-video (I2V) model. This latent space encodes rich priors of facial dynamics learned from large-scale video data, enabling effective warping correction and enhancing temporal coherence. Extensive experiments show that FRVD achieves superior performance over existing methods in terms of pose accuracy, identity preservation, and visual quality, especially in challenging scenarios with extreme pose variations.

[56] Mamba-OTR: a Mamba-based Solution for Online Take and Release Detection from Untrimmed Egocentric Video

Alessandro Sebastiano Catinello,Giovanni Maria Farinella,Antonino Furnari

Main category: cs.CV

TL;DR: Mamba-OTR是一种基于Mamba架构的模型,用于从未修剪的自中心视频中在线检测物体的抓取和释放(OTR),解决了标签不平衡和计算效率问题,性能优于基于Transformer的方法。

Details Motivation: 从未修剪的自中心视频中在线检测物体的抓取和释放(OTR)面临标签不平衡、时间预测精度要求高以及计算效率需求等挑战。

Contribution: 提出了Mamba-OTR模型,结合Mamba架构的时序递归性和新的正则化方案,显著提升了OTR任务的性能和效率。

Method: 基于Mamba架构,利用时序递归进行推理,使用焦点损失和新正则化方案解决标签不平衡问题。

Result: 在EPIC-KITCHENS-100数据集上的实验中,Mamba-OTR在滑动窗口模式下达到45.48 mp-mAP,流模式下达到43.35 mp-mAP,优于基线方法。

Insight: Mamba架构在长视频和高帧率序列上表现优异,即使训练数据为短片段,也能高效完成在线检测任务。

Abstract: This work tackles the problem of Online detection of Take and Release (OTR) of an object in untrimmed egocentric videos. This task is challenging due to severe label imbalance, with temporally sparse positive annotations, and the need for precise temporal predictions. Furthermore, methods need to be computationally efficient in order to be deployed in real-world online settings. To address these challenges, we propose Mamba-OTR, a model based on the Mamba architecture. Mamba-OTR is designed to exploit temporal recurrence during inference while being trained on short video clips. To address label imbalance, our training pipeline incorporates the focal loss and a novel regularization scheme that aligns model predictions with the evaluation metric. Extensive experiments on EPIC-KITCHENS-100, the comparisons with transformer-based approach, and the evaluation of different training and test schemes demonstrate the superiority of Mamba-OTR in both accuracy and efficiency. These finding are particularly evident when evaluating full-length videos or high frame-rate sequences, even when trained on short video snippets for computational convenience. The proposed Mamba-OTR achieves a noteworthy mp-mAP of 45.48 when operating in a sliding-window fashion, and 43.35 in streaming mode, versus the 20.32 of a vanilla transformer and 25.16 of a vanilla Mamba, thus providing a strong baseline for OTR. We will publicly release the source code of Mamba-OTR to support future research.

[57] From Flat to Round: Redefining Brain Decoding with Surface-Based fMRI and Cortex Structure

Sijin Yu,Zijiao Chen,Wenxuan Wu,Shengxian Chen,Zhongliang Liu,Jingxin Nie,Xiaofen Xing,Xiangmin Xu,Xin Zhang

Main category: cs.CV

TL;DR: 提出了一种基于皮层表面的球形标记器和融合结构MRI的方法,用于更准确地解码fMRI数据,并通过多样本混合策略提高重建效果。

Details Motivation: 现有方法忽略了大脑结构与功能的联系,且未充分利用个体解剖结构信息,导致重建精度和解释性不足。

Contribution: 1. 球形标记器;2. 融合sMRI数据;3. 多样本混合策略。提升了重建精度和跨个体泛化能力。

Method: 1. 将fMRI信号建模为2D球形数据;2. 结合sMRI编码解剖结构;3. 使用正样本混合策略增强数据利用效率。

Result: 实验表明,该方法在重建性能上优于现有技术,且更具生物解释性。

Insight: 结合大脑结构和多模态数据可显著提升解码效果,同时增强方法的可解释性。

Abstract: Reconstructing visual stimuli from human brain activity (e.g., fMRI) bridges neuroscience and computer vision by decoding neural representations. However, existing methods often overlook critical brain structure-function relationships, flattening spatial information and neglecting individual anatomical variations. To address these issues, we propose (1) a novel sphere tokenizer that explicitly models fMRI signals as spatially coherent 2D spherical data on the cortical surface; (2) integration of structural MRI (sMRI) data, enabling personalized encoding of individual anatomical variations; and (3) a positive-sample mixup strategy for efficiently leveraging multiple fMRI scans associated with the same visual stimulus. Collectively, these innovations enhance reconstruction accuracy, biological interpretability, and generalizability across individuals. Experiments demonstrate superior reconstruction performance compared to SOTA methods, highlighting the effectiveness and interpretability of our biologically informed approach.

[58] ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering

Thuy-Duong Tran,Trung-Kien Tran,Manfred Hauswirth,Danh Le Phuoc

Main category: cs.CV

TL;DR: 该论文提出了一个新的视觉问答(VQA)数据集ReasonVQA,集成了结构化百科知识,并通过低成本框架生成复杂多跳问题,显著挑战了现有VQA模型。

Details Motivation: 现有的VQA数据集在处理需要外部知识和复杂推理的问题时表现不足,ReasonVQA旨在填补这一空白。

Contribution: 1. 提出ReasonVQA数据集,整合结构化知识并支持复杂多跳问题;2. 通过低成本框架实现数据集扩展;3. 实验显示当前VQA模型在该数据集上表现不佳,凸显其挑战性。

Method: 自动集成结构化百科知识,利用低成本框架生成多跳问题,并通过实验评估现有VQA模型。

Result: ReasonVQA在规模上超过现有数据集,且现有VQA模型在该数据集上表现较差,验证了其复杂性和挑战性。

Insight: ReasonVQA为研究复杂推理和外部知识整合的VQA任务提供了新基准,推动了该领域的进一步发展。

Abstract: In this paper, we propose a new dataset, ReasonVQA, for the Visual Question Answering (VQA) task. Our dataset is automatically integrated with structured encyclopedic knowledge and constructed using a low-cost framework, which is capable of generating complex, multi-hop questions. We evaluated state-of-the-art VQA models on ReasonVQA, and the empirical results demonstrate that ReasonVQA poses significant challenges to these models, highlighting its potential for benchmarking and advancing the field of VQA. Additionally, our dataset can be easily scaled with respect to input images; the current version surpasses the largest existing datasets requiring external knowledge by more than an order of magnitude.

[59] Sparse-View 3D Reconstruction: Recent Advances and Open Challenges

Tanveer Younis,Zhanglin Cheng

Main category: cs.CV

TL;DR: 这篇综述探讨了稀疏视角3D重建的最新进展和开放挑战,总结了神经隐式模型、显式点云方法和混合框架的优缺点,并提出了未来研究方向。

Details Motivation: 稀疏视角3D重建在机器人、AR/VR和自动驾驶等领域至关重要,但由于图像重叠少,传统方法(如SfM和MVS)无法可靠匹配对应关系,需要新的技术来解决这一问题。

Contribution: 论文统一分析了基于几何、神经隐式和生成(扩散模型)的方法,揭示了它们在稀疏视角重建中的权衡(如精度、效率和泛化性),并提出了未来方向。

Method: 综述了多种方法,包括神经隐式模型(如NeRF及其变体)、显式点云方法(如3D高斯泼溅),以及结合扩散模型和视觉基础模型的混合框架。

Result: 在标准基准测试中比较了不同方法的性能,展示了如何通过几何正则化、显式形状建模和生成推理减少稀疏视图下的伪影和姿态模糊性。

Insight: 未来的挑战包括领域泛化和无需姿态的重建,发展方向包括3D原生生成先验和实时无约束稀疏视角重建。

Abstract: Sparse-view 3D reconstruction is essential for applications in which dense image acquisition is impractical, such as robotics, augmented/virtual reality (AR/VR), and autonomous systems. In these settings, minimal image overlap prevents reliable correspondence matching, causing traditional methods, such as structure-from-motion (SfM) and multiview stereo (MVS), to fail. This survey reviews the latest advances in neural implicit models (e.g., NeRF and its regularized versions), explicit point-cloud-based approaches (e.g., 3D Gaussian Splatting), and hybrid frameworks that leverage priors from diffusion and vision foundation models (VFMs).We analyze how geometric regularization, explicit shape modeling, and generative inference are used to mitigate artifacts such as floaters and pose ambiguities in sparse-view settings. Comparative results on standard benchmarks reveal key trade-offs between the reconstruction accuracy, efficiency, and generalization. Unlike previous reviews, our survey provides a unified perspective on geometry-based, neural implicit, and generative (diffusion-based) methods. We highlight the persistent challenges in domain generalization and pose-free reconstruction and outline future directions for developing 3D-native generative priors and achieving real-time, unconstrained sparse-view reconstruction.

[60] Towards Railway Domain Adaptation for LiDAR-based 3D Detection: Road-to-Rail and Sim-to-Real via SynDRA-BBox

Xavier Diaz,Gianluca D’Amico,Raul Dominguez-Sanchez,Federico Nesti,Max Ronecker,Giorgio Buttazzo

Main category: cs.CV

TL;DR: 本文提出了SynDRA-BBox,一个专门为铁路领域2D和3D目标检测设计的合成数据集,填补了该领域公开真实标注数据的空白。通过半监督域适应方法,验证了合成数据在铁路3D目标检测中的有效性。

Details Motivation: 铁路领域缺乏公开的真实标注数据集,限制了视觉感知算法的发展和应用,因此需要合成数据支持。

Contribution: 提出了首个针对铁路领域的合成数据集SynDRA-BBox,并验证了域适应方法在合成数据到真实铁路场景中的迁移能力。

Method: 采用半监督域适应方法,将原本用于汽车领域的感知技术迁移到铁路场景,实现了合成数据到3D目标检测的有效应用。

Result: 实验结果表明,合成数据集和域适应技术在提升铁路环境感知能力方面具有显著效果。

Insight: 合成数据和域适应技术可以弥补铁路领域数据不足的问题,推动该领域感知技术的发展。

Abstract: In recent years, interest in automatic train operations has significantly increased. To enable advanced functionalities, robust vision-based algorithms are essential for perceiving and understanding the surrounding environment. However, the railway sector suffers from a lack of publicly available real-world annotated datasets, making it challenging to test and validate new perception solutions in this domain. To address this gap, we introduce SynDRA-BBox, a synthetic dataset designed to support object detection and other vision-based tasks in realistic railway scenarios. To the best of our knowledge, is the first synthetic dataset specifically tailored for 2D and 3D object detection in the railway domain, the dataset is publicly available at https://syndra.retis.santannapisa.it. In the presented evaluation, a state-of-the-art semi-supervised domain adaptation method, originally developed for automotive perception, is adapted to the railway context, enabling the transferability of synthetic data to 3D object detection. Experimental results demonstrate promising performance, highlighting the effectiveness of synthetic datasets and domain adaptation techniques in advancing perception capabilities for railway environments.

[61] Combined Image Data Augmentations diminish the benefits of Adaptive Label Smoothing

Georg Siedel,Ekagra Gupta,Weijia Shao,Silvia Vock,Andrey Morozov

Main category: cs.CV

TL;DR: 该论文探讨了自适应标签平滑框架在多种数据增强方法(如随机擦除和噪声注入)中的应用,发现其在单一或有限增强类型中有效,但在多样化增强(如TrivialAugment)中失效,甚至可能损害模型对常见噪声的鲁棒性。

Details Motivation: 研究动机是探索自适应标签平滑是否适用于除随机剪裁外的其他攻击性数据增强方法,并分析其在多样化增强环境中的局限性。

Contribution: 主要贡献包括将自适应标签平滑扩展到随机擦除和噪声注入等增强方法,并揭示了其在多样化增强场景下的失效问题及其对模型鲁棒性的负面影响。

Method: 研究方法是通过实验验证自适应标签平滑在随机擦除和噪声注入增强中的有效性,并进一步评估其在TrivialAugment等多样化增强中的表现。

Result: 结果表明,自适应标签平滑在高强度随机擦除中有益,但在多样化增强中失去优势,过度的标签平滑会损害模型对常见噪声的鲁棒性。

Insight: 研究启示是自适应标签平滑应仅用于训练数据分布由有限且同质的图像变换类型主导的场景。

Abstract: Soft augmentation regularizes the supervised learning process of image classifiers by reducing label confidence of a training sample based on the magnitude of random-crop augmentation applied to it. This paper extends this adaptive label smoothing framework to other types of aggressive augmentations beyond random-crop. Specifically, we demonstrate the effectiveness of the method for random erasing and noise injection data augmentation. Adaptive label smoothing permits stronger regularization via higher-intensity Random Erasing. However, its benefits vanish when applied with a diverse range of image transformations as in the state-of-the-art TrivialAugment method, and excessive label smoothing harms robustness to common corruptions. Our findings suggest that adaptive label smoothing should only be applied when the training data distribution is dominated by a limited, homogeneous set of image transformation types.

[62] Robust Noisy Pseudo-label Learning for Semi-supervised Medical Image Segmentation Using Diffusion Model

Lin Xi,Yingliang Ma,Cheng Wang,Sandra Howell,Aldo Rinaldi,Kawal S. Rhode

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的半监督医学图像分割方法,通过原型对比一致性减少伪标签噪声的影响,并在新公开的MOSXAV数据集上验证了其优越性。

Details Motivation: 医学图像的像素级标注成本高且耗时长,现有半监督方法因伪标签噪声难以构建语义分布的潜在空间结构。

Contribution: 1. 提出一种扩散框架,在去噪过程中通过原型对比一致性约束潜在语义结构;2. 发布新的MOSXAV基准数据集。

Method: 利用扩散模型,在去噪过程中引入原型对比一致性,集中潜在空间的语义表示,提升对噪声伪标签的鲁棒性。

Result: 在EndoScapes2023和MOSXAV数据集上的实验表明,方法优于现有半监督分割方法。

Insight: 通过原型对比一致性约束潜在语义分布,可以有效减少伪标签噪声对模型性能的影响,提升分割鲁棒性。

Abstract: Obtaining pixel-level annotations in the medical domain is both expensive and time-consuming, often requiring close collaboration between clinical experts and developers. Semi-supervised medical image segmentation aims to leverage limited annotated data alongside abundant unlabeled data to achieve accurate segmentation. However, existing semi-supervised methods often struggle to structure semantic distributions in the latent space due to noise introduced by pseudo-labels. In this paper, we propose a novel diffusion-based framework for semi-supervised medical image segmentation. Our method introduces a constraint into the latent structure of semantic labels during the denoising diffusion process by enforcing prototype-based contrastive consistency. Rather than explicitly delineating semantic boundaries, the model leverages class prototypes centralized semantic representations in the latent space as anchors. This strategy improves the robustness of dense predictions, particularly in the presence of noisy pseudo-labels. We also introduce a new publicly available benchmark: Multi-Object Segmentation in X-ray Angiography Videos (MOSXAV), which provides detailed, manually annotated segmentation ground truth for multiple anatomical structures in X-ray angiography videos. Extensive experiments on the EndoScapes2023 and MOSXAV datasets demonstrate that our method outperforms state-of-the-art medical image segmentation approaches under the semi-supervised learning setting. This work presents a robust and data-efficient diffusion model that offers enhanced flexibility and strong potential for a wide range of clinical applications.

[63] VGGT-Long: Chunk it, Loop it, Align it – Pushing VGGT’s Limits on Kilometer-scale Long RGB Sequences

Kai Deng,Zexin Ti,Jiawei Xu,Jian Yang,Jin Xie

Main category: cs.CV

TL;DR: VGGT-Long 是一种扩展 VGGT 方法,通过分块处理、重叠对齐和轻量级闭环优化,实现千米级单目 3D 重建,无需相机标定、深度监督或模型重训练。

Details Motivation: 解决现有基础模型在大规模 RGB 序列 3D 重建中的内存限制问题。

Contribution: 1. 提出分块处理和重叠对齐策略;2. 轻量级闭环优化;3. 实现千米级单目 3D 重建。

Method: 分块处理、重叠对齐、轻量级闭环优化。

Result: 在 KITTI、Waymo 和 Virtual KITTI 数据集上表现优异,适用于长序列 RGB 数据。

Insight: 基础模型可通过轻量化策略扩展至大规模 3D 场景重建,尤其在自动驾驶中潜力巨大。

Abstract: Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios. Code is available at https://github.com/DengKaiCQ/VGGT-Long.

[64] C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning

Xiuwei Chen,Wentao Hu,Hanhui Li,Jun Zhou,Zisheng Chen,Meng Cao,Yihan Zeng,Kui Zhang,Yu-Jie Yuan,Jianhua Han,Hang Xu,Xiaodan Liang

Main category: cs.CV

TL;DR: C2-Evo提出了一种自动闭环的自我改进框架,通过联合进化训练数据和模型能力,解决多模态数据与模型进化不一致的问题,显著提升数学推理任务的性能。

Details Motivation: 现有的多模态大语言模型(MLLMs)需要高质量视觉语言数据集,但这类数据集成本高且难以扩展。此外,现有方法在数据和模型进化中存在不一致性。

Contribution: 提出C2-Evo框架,通过交叉模态数据进化循环和数据-模型进化循环联合优化数据和模型能力。

Method: 结合结构化文本子问题和几何图形生成复杂多模态问题,通过模型表现自适应选择任务进行监督微调和强化学习交替训练。

Result: 在多个数学推理基准测试中持续获得显著性能提升。

Insight: 联合进化数据和模型可以更有效地解决多模态任务的复杂性不一致问题,同时降低数据标注成本。

Abstract: Recent advances in multimodal large language models (MLLMs) have shown impressive reasoning capabilities. However, further enhancing existing MLLMs necessitates high-quality vision-language datasets with carefully curated task complexities, which are both costly and challenging to scale. Although recent self-improving models that iteratively refine themselves offer a feasible solution, they still suffer from two core challenges: (i) most existing methods augment visual or textual data separately, resulting in discrepancies in data complexity (e.g., over-simplified diagrams paired with redundant textual descriptions); and (ii) the evolution of data and models is also separated, leading to scenarios where models are exposed to tasks with mismatched difficulty levels. To address these issues, we propose C2-Evo, an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities. Specifically, given a base dataset and a base model, C2-Evo enhances them by a cross-modal data evolution loop and a data-model evolution loop. The former loop expands the base dataset by generating complex multimodal problems that combine structured textual sub-problems with iteratively specified geometric diagrams, while the latter loop adaptively selects the generated problems based on the performance of the base model, to conduct supervised fine-tuning and reinforcement learning alternately. Consequently, our method continuously refines its model and training data, and consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks. Our code, models, and datasets will be released.

[65] Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Ang Li,Charles Wang,Kaiyu Yue,Zikui Cai,Ollie Liu,Deqing Fu,Peng Guo,Wang Bill Zhu,Vatsal Sharan,Robin Jia,Willie Neiswanger,Furong Huang,Tom Goldstein,Micah Goldblum

Main category: cs.CV

TL;DR: Zebra-CoT是一个多样化的大规模数据集,包含182,384个样本,用于支持视觉语言链式推理(Visual CoT)任务。它在多个领域的任务上提供了逻辑连贯的图文推理轨迹,显著提升了多模态模型的性能。

Details Motivation: 现有的视觉链式推理(Visual CoT)模型性能较差,且缺乏高质量的训练数据,限制了多模态模型的推理能力发展。Zebra-CoT的提出旨在解决这两个问题。

Contribution: 提出了Zebra-CoT数据集,支持视觉语言链式推理任务,涵盖科学问题、2D/3D视觉推理、逻辑问题等多样化任务。通过微调模型,显著提升了多模态推理能力。

Method: 构建了包含182,384个样本的多样化数据集,覆盖多个领域的视觉推理任务。通过微调Anole-7B和Bagel-7B模型验证了数据集的效率。

Result: 微调Anole-7B模型在测试集上提升了12%的准确率,并在标准VLM基准测试上取得了最高13%的性能提升。Bagel-7B生成了高质量的视觉推理链。

Insight: 多样化且高质量的视觉推理数据对提升多模态模型性能至关重要,Zebra-CoT为视觉链式推理任务的训练和评估提供了重要资源。

Abstract: Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $\textbf{Zebra-CoT}$, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT’s effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.

[66] Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models

Xiaoyan Wang,Zeju Li,Yifan Xu,Jiaxing Qi,Zhifei Yang,Ruifei Ma,Xiangde Liu,Chao Zhang

Main category: cs.CV

TL;DR: 该论文提出了Spatial 3D-LLM,一种针对3D视觉-语言任务设计的多模态大语言模型,通过渐进式空间感知方案增强空间嵌入能力,从而提升对3D场景的理解。

Details Motivation: 现有的3D多模态大语言模型(MLLMs)通常通过压缩整体场景信息或分割独立对象来处理任务,导致对3D场景的空间感知能力不足,限制了模型的表现。

Contribution: 1. 提出Spatial 3D-LLM,通过渐进式空间感知方案增强3D场景的空间嵌入;2. 引入两个新任务(3D对象距离测量和3D布局编辑)和数据集MODEL;3. 在广泛的任务中实现了最优性能。

Method: 集成LLM主干与渐进式空间感知方案,逐步捕获空间信息并生成位置丰富的3D场景嵌入作为视觉提示。

Result: 实验表明,Spatial 3D-LLM在多种3D视觉-语言任务中表现最优,验证了渐进式空间感知方案的有效性。

Insight: 渐进式空间感知能够挖掘更深层次的空间信息,显著提升模型对3D场景的理解能力。

Abstract: New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks. However, most existing 3D multimodal LLMs (MLLMs) rely on compressing holistic 3D scene information or segmenting independent objects to perform these tasks, which limits their spatial awareness due to insufficient representation of the richness inherent in 3D scenes. To overcome these limitations, we propose Spatial 3D-LLM, a 3D MLLM specifically designed to enhance spatial awareness for 3D vision-language tasks by enriching the spatial embeddings of 3D scenes. Spatial 3D-LLM integrates an LLM backbone with a progressive spatial awareness scheme that progressively captures spatial information as the perception field expands, generating location-enriched 3D scene embeddings to serve as visual prompts. Furthermore, we introduce two novel tasks: 3D object distance measurement and 3D layout editing, and construct a 3D instruction dataset, MODEL, to evaluate the model’s spatial awareness capabilities. Experimental results demonstrate that Spatial 3D-LLM achieves state-of-the-art performance across a wide range of 3D vision-language tasks, revealing the improvements stemmed from our progressive spatial awareness scheme of mining more profound spatial information. Our code is available at https://github.com/bjshuyuan/Spatial-3D-LLM.

[67] EarthCrafter: Scalable 3D Earth Generation via Dual-Sparse Latent Diffusion

Shang Liu,Chenjie Cao,Chaohui Yu,Wen Qian,Jing Wang,Fan Wang

Main category: cs.CV

TL;DR: EarthCrafter通过双稀疏潜在扩散和数据基础设施的创新,实现了大规模3D地球生成,提出了Aerial-Earth3D数据集和定制化生成框架。

Details Motivation: 当前3D生成方法难以扩展到地理尺度(如数千平方公里的地球表面),需要新的数据和方法支持大规模3D地球生成。

Contribution: 1) 提出Aerial-Earth3D数据集(最大3D航拍数据集);2) 设计EarthCrafter框架,通过双稀疏潜在扩散实现大规模3D地球生成。

Method: 1) 双稀疏3D-VAEs压缩几何体素和纹理2D高斯泼溅;2) 条件感知流匹配模型独立建模几何与纹理特征。

Result: 实验表明EarthCrafter在大规模生成中表现显著优于现有方法,并支持语义引导城市布局生成和无条件地形合成。

Insight: 通过分离结构和纹理生成,以及稀疏潜在表示,可有效解决大规模3D生成的计算和存储问题。

Abstract: Despite the remarkable developments achieved by recent 3D generation works, scaling these methods to geographic extents, such as modeling thousands of square kilometers of Earth’s surface, remains an open challenge. We address this through a dual innovation in data infrastructure and model architecture. First, we introduce Aerial-Earth3D, the largest 3D aerial dataset to date, consisting of 50k curated scenes (each measuring 600m x 600m) captured across the U.S. mainland, comprising 45M multi-view Google Earth frames. Each scene provides pose-annotated multi-view images, depth maps, normals, semantic segmentation, and camera poses, with explicit quality control to ensure terrain diversity. Building on this foundation, we propose EarthCrafter, a tailored framework for large-scale 3D Earth generation via sparse-decoupled latent diffusion. Our architecture separates structural and textural generation: 1) Dual sparse 3D-VAEs compress high-resolution geometric voxels and textural 2D Gaussian Splats (2DGS) into compact latent spaces, largely alleviating the costly computation suffering from vast geographic scales while preserving critical information. 2) We propose condition-aware flow matching models trained on mixed inputs (semantics, images, or neither) to flexibly model latent geometry and texture features independently. Extensive experiments demonstrate that EarthCrafter performs substantially better in extremely large-scale generation. The framework further supports versatile applications, from semantic-guided urban layout generation to unconditional terrain synthesis, while maintaining geographic plausibility through our rich data priors from Aerial-Earth3D.

[68] Optimization of DNN-based HSI Segmentation FPGA-based SoC for ADS: A Practical Approach

Jon Gutiérrez-Zaballa,Koldo Basterretxea,Javier Echanobe

Main category: cs.CV

TL;DR: 这篇论文提出了一种在FPGA-based SoC上优化基于DNN的HSI分割处理器的方法,通过软硬件协同设计、硬件感知预处理和模型压缩等技术,显著降低了计算复杂度和参数量,同时提高了推理速度。

Details Motivation: 高光谱成像(HSI)在自动驾驶系统(ADS)中的应用需要满足低延迟、低资源消耗和高安全性的严格要求,但DNN的过参数化和HSI的复杂数据预处理带来了计算上的挑战。

Contribution: 1. 提出了一套优化技术,用于在FPGA-based SoC上实现DNN-based HSI分割处理器的软硬件协同设计;2. 实现了模型压缩,显著减少了计算复杂度和参数量;3. 通过完整流水线部署,提升了推理速度。

Method: 1. 功能性的软硬件任务分配;2. 硬件感知的数据预处理;3. ML模型压缩;4. 完整的流水线部署。

Result: 优化后的DNN计算复杂度降至原来的24.34%,参数量降至原来的1.02%,推理速度提升2.86倍,且分割精度无明显下降。

Insight: 针对ADS等安全关键系统,软硬件协同设计和模型压缩是解决DNN实时部署问题的有效方法,同时需重视数据预处理对整体性能的影响。

Abstract: The use of HSI for autonomous navigation is a promising research field aimed at improving the accuracy and robustness of detection, tracking, and scene understanding systems based on vision sensors. Combining advanced computer algorithms, such as DNNs, with small-size snapshot HSI cameras enhances the reliability of these systems. HSI overcomes intrinsic limitations of greyscale and RGB imaging in depicting physical properties of targets, particularly regarding spectral reflectance and metamerism. Despite promising results in HSI-based vision developments, safety-critical systems like ADS demand strict constraints on latency, resource consumption, and security, motivating the shift of ML workloads to edge platforms. This involves a thorough software/hardware co-design scheme to distribute and optimize the tasks efficiently among the limited resources of computing platforms. With respect to inference, the over-parameterized nature of DNNs poses significant computational challenges for real-time on-the-edge deployment. In addition, the intensive data preprocessing required by HSI, which is frequently overlooked, must be carefully managed in terms of memory arrangement and inter-task communication to enable an efficient integrated pipeline design on a SoC. This work presents a set of optimization techniques for the practical co-design of a DNN-based HSI segmentation processor deployed on a FPGA-based SoC targeted at ADS, including key optimizations such as functional software/hardware task distribution, hardware-aware preprocessing, ML model compression, and a complete pipelined deployment. Applied compression techniques significantly reduce the complexity of the designed DNN to 24.34% of the original operations and to 1.02% of the original number of parameters, achieving a 2.86x speed-up in the inference task without noticeable degradation of the segmentation accuracy.

[69] Comparative validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the PhaKIR 2024 challenge

Tobias Rueckert,David Rauber,Raphaela Maerkl,Leonard Klausmann,Suemeyye R. Yildiran,Max Gutbrod,Danilo Weber Nunes,Alvaro Fernandez Moreno,Imanol Luengo,Danail Stoyanov,Nicolas Toussaint,Enki Cho,Hyeon Bae Kim,Oh Sung Choo,Ka Young Kim,Seong Tae Kim,Gonçalo Arantes,Kehan Song,Jianjun Zhu,Junchen Xiong,Tingyi Lin,Shunsuke Kikuchi,Hiroki Matsuzaki,Atsushi Kouno,João Renato Ribeiro Manesco,João Paulo Papa,Tae-Min Choi,Tae Kyeong Jeong,Juyoun Park,Oluwatosin Alabi,Meng Wei,Tom Vercauteren,Runzhi Wu,Mengya Xu,An Wang,Long Bai,Hongliang Ren,Amine Yamlahi,Jakob Hennighausen,Lena Maier-Hein,Satoshi Kondo,Satoshi Kasai,Kousuke Hirasawa,Shu Yang,Yihui Wang,Hao Chen,Santiago Rodríguez,Nicolás Aparicio,Leonardo Manrique,Juan Camilo Lyons,Olivia Hosie,Nicolás Ayobi,Pablo Arbeláez,Yiping Li,Yasmina Al Khalil,Sahar Nasirihaghighi,Stefanie Speidel,Daniel Rueckert,Hubertus Feussner,Dirk Wilhelm,Christoph Palm

Main category: cs.CV

TL;DR: 该论文介绍了PhaKIR挑战赛,旨在通过多任务数据集(手术阶段识别、器械关键点估计和器械实例分割)推动计算机辅助手术中的鲁棒性和可解释性研究。

Details Motivation: 由于现实条件下手术器械识别和定位的鲁棒性不足,结合手术上下文(如当前阶段)有望提升性能。

Contribution: 提出了一个多中心统一标注的数据集,支持关节研究器械定位和手术上下文,并引入了时间信息的整合。

Method: 组织挑战赛并定义三个任务,使用统一的新数据集进行模型评测。

Result: 提供了基准结果,验证了上下文和时间信息对提升手术场景理解的潜力。

Insight: 多任务联合学习和时间信息整合是提升手术器械理解和场景理解的关键方向。

Abstract: Reliable recognition and localization of surgical instruments in endoscopic video recordings are foundational for a wide range of applications in computer- and robot-assisted minimally invasive surgery (RAMIS), including surgical training, skill assessment, and autonomous assistance. However, robust performance under real-world conditions remains a significant challenge. Incorporating surgical context - such as the current procedural phase - has emerged as a promising strategy to improve robustness and interpretability. To address these challenges, we organized the Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) sub-challenge as part of the Endoscopic Vision (EndoVis) challenge at MICCAI 2024. We introduced a novel, multi-center dataset comprising thirteen full-length laparoscopic cholecystectomy videos collected from three distinct medical institutions, with unified annotations for three interrelated tasks: surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation. Unlike existing datasets, ours enables joint investigation of instrument localization and procedural context within the same data while supporting the integration of temporal information across entire procedures. We report results and findings in accordance with the BIAS guidelines for biomedical image analysis challenges. The PhaKIR sub-challenge advances the field by providing a unique benchmark for developing temporally aware, context-driven methods in RAMIS and offers a high-quality resource to support future research in surgical scene understanding.

[70] A Multimodal Deviation Perceiving Framework for Weakly-Supervised Temporal Forgery Localization

Wenbo Xu,Junyan Wu,Wei Lu,Xiangyang Luo,Qian Wang

Main category: cs.CV

TL;DR: 论文提出了一种多模态偏差感知框架(MDP),用于弱监督的时间伪造定位,通过视频级标注识别部分伪造片段,利用新颖的多模态交互机制和可扩展的偏差感知损失实现精细定位。

Details Motivation: 当前Deepfake取证研究通常将检测视为分类任务或时间伪造定位问题,这些方法通常受限、耗时且难以扩展。为了解决这些问题,作者提出了一种仅需视频级标注的弱监督方法。

Contribution: 1. 提出了多模态偏差感知框架(MDP);2. 设计了新颖的多模态交互机制(MI)和可扩展的偏差感知损失;3. 在仅使用视频级标注的情况下实现了与全监督方法可比的效果。

Method: 1. 使用多模态交互机制(MI)通过跨模态注意力衡量视觉和音频模态的相关性,识别模态间偏差;2. 提出偏差感知损失,扩大伪造样本相邻段的偏差并减小真实样本的偏差。

Result: 大量实验表明,该框架在多个评估指标上达到了与全监督方法可比的效果。

Insight: 多模态交互和偏差感知机制可以有效提升弱监督任务中时间伪造定位的精度,且仅需视频级标注,具有较高的实用性。

Abstract: Current researches on Deepfake forensics often treat detection as a classification task or temporal forgery localization problem, which are usually restrictive, time-consuming, and challenging to scale for large datasets. To resolve these issues, we present a multimodal deviation perceiving framework for weakly-supervised temporal forgery localization (MDP), which aims to identify temporal partial forged segments using only video-level annotations. The MDP proposes a novel multimodal interaction mechanism (MI) and an extensible deviation perceiving loss to perceive multimodal deviation, which achieves the refined start and end timestamps localization of forged segments. Specifically, MI introduces a temporal property preserving cross-modal attention to measure the relevance between the visual and audio modalities in the probabilistic embedding space. It could identify the inter-modality deviation and construct comprehensive video features for temporal forgery localization. To explore further temporal deviation for weakly-supervised learning, an extensible deviation perceiving loss has been proposed, aiming at enlarging the deviation of adjacent segments of the forged samples and reducing that of genuine samples. Extensive experiments demonstrate the effectiveness of the proposed framework and achieve comparable results to fully-supervised approaches in several evaluation metrics.

[71] CTSL: Codebook-based Temporal-Spatial Learning for Accurate Non-Contrast Cardiac Risk Prediction Using Cine MRIs

Haoyang Su,Shaohao Rui,Jinyi Xiang,Lianming Wu,Xiaosong Wang

Main category: cs.CV

TL;DR: 论文提出了一种自监督框架CTSL,用于从Cine MRI序列中无对比剂地预测心脏事件(MACE),通过多视图蒸馏和动态病灶自检测实现准确的时空特征学习。

Details Motivation: 现有方法依赖人工标记的室间隔心肌掩膜,而对比剂的使用不切实际。CTSL旨在摆脱这一限制,实现无监督、准确的MACE预测。

Contribution: 1. 提出无监督框架CTSL,无需分割掩膜即可学习时空特征;2. 通过多视图蒸馏和动态病灶自检测提升模型性能。

Method: 1. 多视图蒸馏分离时空特征;2. 基于编码簿的特征表示;3. 利用运动线索进行动态病灶自检测。

Result: CTSL在MACE风险预测上优于传统依赖对比剂的方法,提供了快速、非侵入的临床解决方案。

Insight: 无监督学习方法可以在医学影像分析中减少对人工标注的依赖,同时实现高性能预测。

Abstract: Accurate and contrast-free Major Adverse Cardiac Events (MACE) prediction from Cine MRI sequences remains a critical challenge. Existing methods typically necessitate supervised learning based on human-refined masks in the ventricular myocardium, which become impractical without contrast agents. We introduce a self-supervised framework, namely Codebook-based Temporal-Spatial Learning (CTSL), that learns dynamic, spatiotemporal representations from raw Cine data without requiring segmentation masks. CTSL decouples temporal and spatial features through a multi-view distillation strategy, where the teacher model processes multiple Cine views, and the student model learns from reduced-dimensional Cine-SA sequences. By leveraging codebook-based feature representations and dynamic lesion self-detection through motion cues, CTSL captures intricate temporal dependencies and motion patterns. High-confidence MACE risk predictions are achieved through our model, providing a rapid, non-invasive solution for cardiac risk assessment that outperforms traditional contrast-dependent methods, thereby enabling timely and accessible heart disease diagnosis in clinical settings.

[72] Automatic Fine-grained Segmentation-assisted Report Generation

Frederic Jonske,Constantin Seibold,Osman Alperen Koras,Fin Bahnsen,Marie Bauer,Amin Dada,Hamza Kalisch,Anton Schily,Jens Kleesiek

Main category: cs.CV

TL;DR: ASaRG通过融合细粒度分割图和LLaVA的中间特征,显著提升了医学报告生成的性能,同时增强了报告的可信度。

Details Motivation: 目标是通过自动化报告生成减轻放射科医生的工作负担,并提供可靠的第二意见。关键需求是模型性能强且具备可验证的置信度。

Contribution: 提出了ASaRG方法,结合细粒度分割图和中间特征,显著提升了报告生成的F1分数,并增强了报告的可解释性。

Method: 通过简单的特征拼接,将专家模型生成的细粒度分割图和LLaVA的中间特征融合到多模态投影层。

Result: 相比LLaVA基线,F1分数提升了2.77%;相比其他方法(COMG和ORID),性能提升高达6.98%。

Insight: ASaRG不仅性能优越,还能通过分割图验证报告的局部相关性,增强了模型的透明度和可信度。

Abstract: Reliable end-to-end clinical report generation has been a longstanding goal of medical ML research. The end goal for this process is to alleviate radiologists’ workloads and provide second opinions to clinicians or patients. Thus, a necessary prerequisite for report generation models is a strong general performance and some type of innate grounding capability, to convince clinicians or patients of the veracity of the generated reports. In this paper, we present ASaRG (\textbf{A}utomatic \textbf{S}egmentation-\textbf{a}ssisted \textbf{R}eport \textbf{G}eneration), an extension of the popular LLaVA architecture that aims to tackle both of these problems. ASaRG proposes to fuse intermediate features and fine-grained segmentation maps created by specialist radiological models into LLaVA’s multi-modal projection layer via simple concatenation. With a small number of added parameters, our approach achieves a +0.89% performance gain ($p=0.012$) in CE F1 score compared to the LLaVA baseline when using only intermediate features, and +2.77% performance gain ($p<0.001$) when adding a combination of intermediate features and fine-grained segmentation maps. Compared with COMG and ORID, two other report generation methods that utilize segmentations, the performance gain amounts to 6.98% and 6.28% in F1 score, respectively. ASaRG is not mutually exclusive with other changes made to the LLaVA architecture, potentially allowing our method to be combined with other advances in the field. Finally, the use of an arbitrary number of segmentations as part of the input demonstrably allows tracing elements of the report to the corresponding segmentation maps and verifying the groundedness of assessments. Our code will be made publicly available at a later date.

[73] A2Mamba: Attention-augmented State Space Models for Visual Recognition

Meng Lou,Yunxiang Fu,Yizhou Yu

Main category: cs.CV

TL;DR: A2Mamba提出了一种新的Transformer-Mamba混合网络架构,通过多尺度注意力增强状态空间模型(MASS)实现了局部细节与全局上下文的深度整合,显著提升了视觉识别任务的性能。

Details Motivation: 现有的视觉识别模型常简单地堆叠Transformer和Mamba层,缺乏交互机制,限制了模型的表现。A2Mamba旨在解决这一问题,通过深度整合两种架构,提升空间依赖性和动态建模能力。

Contribution: 1. 提出A2Mamba架构,结合了Transformer和Mamba的优势。2. 设计了一种新的Multi-scale Attention-augmented State Space Model(MASS),将多尺度注意力图融入SSM中,增强了空间依赖性。

Method: 通过Multi-scale Attention-augmented SSM(A2SSM)实现Transformer和Mamba的深度整合,利用多尺度注意力图对SSM的隐藏状态进行空间聚合,提高了动态建模能力。

Result: A2Mamba在ImageNet-1K上达到86.1% top-1准确率,在语义分割和目标检测任务中超越现有方法,同时参数更少、效率更高。

Insight: 通过注意力增强SSM的设计,A2Mamba展示了Transformer和Mamba结合的潜力,为视觉识别任务提供了一种更高效且性能更强的解决方案。

Abstract: Transformers and Mamba, initially invented for natural language processing, have inspired backbone architectures for visual recognition. Recent studies integrated Local Attention Transformers with Mamba to capture both local details and global contexts. Despite competitive performance, these methods are limited to simple stacking of Transformer and Mamba layers without any interaction mechanism between them. Thus, deep integration between Transformer and Mamba layers remains an open problem. We address this problem by proposing A2Mamba, a powerful Transformer-Mamba hybrid network architecture, featuring a new token mixer termed Multi-scale Attention-augmented State Space Model (MASS), where multi-scale attention maps are integrated into an attention-augmented SSM (A2SSM). A key step of A2SSM performs a variant of cross-attention by spatially aggregating the SSM’s hidden states using the multi-scale attention maps, which enhances spatial dependencies pertaining to a two-dimensional space while improving the dynamic modeling capabilities of SSMs. Our A2Mamba outperforms all previous ConvNet-, Transformer-, and Mamba-based architectures in visual recognition tasks. For instance, A2Mamba-L achieves an impressive 86.1% top-1 accuracy on ImageNet-1K. In semantic segmentation, A2Mamba-B exceeds CAFormer-S36 by 2.5% in mIoU, while exhibiting higher efficiency. In object detection and instance segmentation with Cascade Mask R-CNN, A2Mamba-S surpasses MambaVision-B by 1.2%/0.9% in AP^b/AP^m, while having 40% less parameters. Code is publicly available at https://github.com/LMMMEng/A2Mamba.

[74] Benchmarking pig detection and tracking under diverse and challenging conditions

Jonathan Henrich,Christian Post,Maximilian Zilke,Parth Shiroya,Emma Chanut,Amir Mollazadeh Yamchi,Ramin Yahyapour,Thomas Kneib,Imke Traulsen

Main category: cs.CV

TL;DR: 本文通过构建PigDetect和PigTrack两个数据集,系统评估了猪只检测与多目标跟踪任务,发现挑战性训练数据能提升检测性能,并对比了不同模型的优劣,为实际养猪业应用提供了重要参考。

Details Motivation: 传统的猪只监测依赖人工,效率低且难以个体化。尽管机器学习在猪只检测与跟踪方面已有研究,但缺乏系统性基准测试,阻碍了实际应用的进步。

Contribution: 1. 构建了两个高质量数据集PigDetect(检测)和PigTrack(跟踪);2. 系统性评估了多种检测与跟踪模型;3. 发现挑战性训练数据对检测性能的重要性;4. 公开数据集与代码促进后续研究。

Method: 1. 采集真实猪舍的多样化图像与视频数据,包括遮挡和低可见性等挑战场景;2. 针对检测任务,比较随机采样与挑战性训练数据的性能差异;3. 对比实时模型与SOTA模型的检测效果;4. 分析SORT类方法与端到端模型在跟踪任务中的表现。

Result: 1. 挑战性训练数据显著提升检测性能;2. SOTA检测模型优于实时模型;3. SORT类方法检测性能更优,但端到端模型的关联性能更好;4. 模型在未见猪舍中表现良好,泛化能力强。

Insight: 高质量的多样化训练数据对模型性能至关重要;端到端跟踪模型虽关联性能更优,但检测性能需改进,未来可能成为主要方法。

Abstract: To ensure animal welfare and effective management in pig farming, monitoring individual behavior is a crucial prerequisite. While monitoring tasks have traditionally been carried out manually, advances in machine learning have made it possible to collect individualized information in an increasingly automated way. Central to these methods is the localization of animals across space (object detection) and time (multi-object tracking). Despite extensive research of these two tasks in pig farming, a systematic benchmarking study has not yet been conducted. In this work, we address this gap by curating two datasets: PigDetect for object detection and PigTrack for multi-object tracking. The datasets are based on diverse image and video material from realistic barn conditions, and include challenging scenarios such as occlusions or bad visibility. For object detection, we show that challenging training images improve detection performance beyond what is achievable with randomly sampled images alone. Comparing different approaches, we found that state-of-the-art models offer substantial improvements in detection quality over real-time alternatives. For multi-object tracking, we observed that SORT-based methods achieve superior detection performance compared to end-to-end trainable models. However, end-to-end models show better association performance, suggesting they could become strong alternatives in the future. We also investigate characteristic failure cases of end-to-end models, providing guidance for future improvements. The detection and tracking models trained on our datasets perform well in unseen pens, suggesting good generalization capabilities. This highlights the importance of high-quality training data. The datasets and research code are made publicly available to facilitate reproducibility, re-use and further development.

[75] QRetinex-Net: Quaternion-Valued Retinex Decomposition for Low-Level Computer Vision Applications

Sos Agaian,Vladimir Frants

Main category: cs.CV

TL;DR: QRetinex-Net提出了第一个基于四元数的Retinex分解方法,通过四元数建模提升了低光图像处理的性能,解决了传统方法的四大缺陷。

Details Motivation: 低光图像存在颜色偏移、低对比度和噪声等问题,传统Retinex模型在独立处理RGB通道、缺乏神经科学基础、重建不完美和无法解释人类颜色恒常性等方面存在不足。

Contribution: 1. 首次提出四元数Retinex模型,将场景建模为四元数反射率和照明的乘积;2. 提出反射率一致性指标(Reflectance Consistency Index)。

Method: 采用四元数(Quaternion)建模,将图像分解为四元数反射率和照明的Hamilton乘积,解决了RGB通道独立处理的问题。

Result: 在低光裂缝检测、多变光照下的人脸检测和红外-可见光融合任务中,性能提升了2-11%,颜色保真度更高、噪声更低、反射率更稳定。

Insight: 四元数建模能够更好地模拟人类视觉的颜色恒常性,为低层计算机视觉任务提供了更鲁棒的解决方案。

Abstract: Images taken in low light often show color shift, low contrast, noise, and other artifacts that hurt computer-vision accuracy. Retinex theory addresses this by viewing an image S as the pixel-wise product of reflectance R and illumination I, mirroring the way people perceive stable object colors under changing light. The decomposition is ill-posed, and classic Retinex models have four key flaws: (i) they treat the red, green, and blue channels independently; (ii) they lack a neuroscientific model of color vision; (iii) they cannot perfectly rebuild the input image; and (iv) they do not explain human color constancy. We introduce the first Quaternion Retinex formulation, in which the scene is written as the Hamilton product of quaternion-valued reflectance and illumination. To gauge how well reflectance stays invariant, we propose the Reflectance Consistency Index. Tests on low-light crack inspection, face detection under varied lighting, and infrared-visible fusion show gains of 2-11 percent over leading methods, with better color fidelity, lower noise, and higher reflectance stability.

[76] Enhancing Remote Sensing Vision-Language Models Through MLLM and LLM-Based High-Quality Image-Text Dataset Generation

Yiguo He,Junjie Zhu,Yiying Li,Xiaoyu Zhang,Chunping Qiu,Jun Wang,Qiangjuan Huang,Ke Yang

Main category: cs.CV

TL;DR: 论文提出了一种名为MpGI的两阶段方法,通过结合多模态大语言模型(MLLM)和大语言模型(LLM)生成高质量的遥感图像文本描述,并构建了HQRS-IT-210K数据集。基于该数据集训练的HQRS-CLIP和RS-CoCa模型在多个下游任务中表现出色。

Details Motivation: 遥感领域的视觉-语言基础模型(VLFM)受限于缺乏高质量的图像-文本配对数据,现有数据生成方法质量较低,导致模型性能提升有限。

Contribution: 1. 提出MpGI方法生成高质量文本描述;2. 构建了HQRS-IT-210K数据集;3. 训练了HQRS-CLIP和RS-CoCa模型,性能超越现有方法。

Method: 两阶段方法MpGI:首先生成多角度的详细描述(Rule-MLLM和MLLM),再利用LLM整合为全面描述。数据集用于微调CLIP和CoCa。

Result: HQRS-CLIP仅用4.2%训练数据即超越SOTA;RS-CoCa生成的文本媲美人工标注。

Insight: 结合MLLM和LLM可以显著提升遥感图像文本数据的质量,进而优化视觉-语言模型的性能。

Abstract: The application of Vision-language foundation models (VLFMs) to remote sensing (RS) imagery has garnered significant attention due to their superior capability in various downstream tasks. A key challenge lies in the scarcity of high-quality, large-scale, image-text paired training data. Recently, several works introduced extensive image-text datasets for RS and trained their VLFMs. However, due to the rudimentary methods used for generating captions, the quality of datasets is suboptimal, requiring larger volumes of training data, while only yielding modest performance improvements. In this paper, we propose a two-stage method named MpGI(Multi-Perspective Generation and Integration) for generating high-quality text captions for RS images. Firstly, we generate distinct and detailed descriptions from different perspectives using Rule-MLLM(Multimodal Large Language Model) Relay Generation and MLLMs generation methods. Next, we utilize Large Language Models (LLMs) to integrate these diverse descriptions into comprehensive captions, capturing details from multiple perspectives. Finally, we have created the HQRS-IT-210K dataset, including about 210,000 RS images and 1.3 million captions. We fine-tuned two VLFMs using our dataset: CLIP, a discriminative model, and CoCa, an image-to-text generative model. This process resulted in our proposed HQRS-CLIP and RS-CoCa models. Experimental results demonstrate that HQRS-CLIP surpassed the previous SOTA RS CLIP model in various downstream tasks while using only 4.2% of the training data. RS-CoCa outperforms other advanced approaches across benchmark datasets and can generate captions for RS images that rival or even exceed manual annotations. Dataset, pre-trained models, and codes will be released at https://github.com/YiguoHe/HQRS-210K-and-HQRS-CLIP.

[77] Temporally-Constrained Video Reasoning Segmentation and Automated Benchmark Construction

Yiqing Shen,Chenjia Li,Chenxiao Fan,Mathias Unberath

Main category: cs.CV

TL;DR: 该论文提出了时间约束的视频推理分割任务,解决了传统视频分割无法处理复杂文本查询和动态对象的问题,并开发了自动化基准构建方法。

Details Motivation: 传统视频分割方法无法处理动态对象和复杂文本查询,限制了在多变场景中的应用。该论文旨在解决这一问题,专注于手术室视频分析等动态环境。

Contribution: 1. 提出时间约束的视频推理分割任务;2. 开发自动化基准构建方法;3. 发布TCVideoRSBenchmark数据集。

Method: 1. 引入时间约束的推理分割任务,要求模型根据包含时间推理的文本查询推断目标对象的相关性;2. 使用自动化方法构建基准数据集。

Result: 提出了TCVideoRSBenchmark数据集,包含52个样本,基于MVOR数据集。

Insight: 结合时间推理的文本查询可以更灵活地处理动态场景中的对象分割,自动化基准构建方法可扩展性强。

Abstract: Conventional approaches to video segmentation are confined to predefined object categories and cannot identify out-of-vocabulary objects, let alone objects that are not identified explicitly but only referred to implicitly in complex text queries. This shortcoming limits the utility for video segmentation in complex and variable scenarios, where a closed set of object categories is difficult to define and where users may not know the exact object category that will appear in the video. Such scenarios can arise in operating room video analysis, where different health systems may use different workflows and instrumentation, requiring flexible solutions for video analysis. Reasoning segmentation (RS) now offers promise towards such a solution, enabling natural language text queries as interaction for identifying object to segment. However, existing video RS formulation assume that target objects remain contextually relevant throughout entire video sequences. This assumption is inadequate for real-world scenarios in which objects of interest appear, disappear or change relevance dynamically based on temporal context, such as surgical instruments that become relevant only during specific procedural phases or anatomical structures that gain importance at particular moments during surgery. Our first contribution is the introduction of temporally-constrained video reasoning segmentation, a novel task formulation that requires models to implicitly infer when target objects become contextually relevant based on text queries that incorporate temporal reasoning. Since manual annotation of temporally-constrained video RS datasets would be expensive and limit scalability, our second contribution is an innovative automated benchmark construction method. Finally, we present TCVideoRSBenchmark, a temporally-constrained video RS dataset containing 52 samples using the videos from the MVOR dataset.

[78] DFR: A Decompose-Fuse-Reconstruct Framework for Multi-Modal Few-Shot Segmentation

Shuai Chen,Fanman Meng,Xiwei Zhang,Haoran Wei,Chenhao Wu,Qingbo Wu,Hongliang Li

Main category: cs.CV

TL;DR: DFR提出了一种多模态少样本分割框架,通过分解、融合和重建三步走策略,有效整合视觉、文本和音频模态,显著提升分割性能。

Details Motivation: 现有少样本分割方法主要依赖单或双模态,无法充分利用现实场景中的多模态信息,限制了语义理解能力。

Contribution: 1) 多模态分解:利用SAM提取视觉区域、细化文本描述和处理音频特征;2) 多模态对比融合:通过对比学习保持模态一致性并增强语义交互;3) 双路径重建:结合语义指导和几何先验。

Method: 采用分解-融合-重建框架,结合SAM和多模态对比学习,动态整合视觉、文本和音频特征。

Result: 在合成和真实场景下,DFR在视觉、文本和音频模态的分割任务上均优于现有方法。

Insight: 多模态信息的系统性融合可通过分层分解和对比学习显著提升少样本分割的语义理解能力。

Abstract: This paper presents DFR (Decompose, Fuse and Reconstruct), a novel framework that addresses the fundamental challenge of effectively utilizing multi-modal guidance in few-shot segmentation (FSS). While existing approaches primarily rely on visual support samples or textual descriptions, their single or dual-modal paradigms limit exploitation of rich perceptual information available in real-world scenarios. To overcome this limitation, the proposed approach leverages the Segment Anything Model (SAM) to systematically integrate visual, textual, and audio modalities for enhanced semantic understanding. The DFR framework introduces three key innovations: 1) Multi-modal Decompose: a hierarchical decomposition scheme that extracts visual region proposals via SAM, expands textual semantics into fine-grained descriptors, and processes audio features for contextual enrichment; 2) Multi-modal Contrastive Fuse: a fusion strategy employing contrastive learning to maintain consistency across visual, textual, and audio modalities while enabling dynamic semantic interactions between foreground and background features; 3) Dual-path Reconstruct: an adaptive integration mechanism combining semantic guidance from tri-modal fused tokens with geometric cues from multi-modal location priors. Extensive experiments across visual, textual, and audio modalities under both synthetic and real settings demonstrate DFR’s substantial performance improvements over state-of-the-art methods.

[79] Denoising-While-Completing Network (DWCNet): Robust Point Cloud Completion Under Corruption

Keneni W. Tesema,Lyndon Hill,Mark W. Jones,Gary K. L. Tam

Main category: cs.CV

TL;DR: DWCNet 是一种结合去噪和补全的鲁棒点云补全网络,针对真实世界中多种退化情况下的点云数据,引入噪声管理模块(NMM)提升性能,并在合成和真实数据集上取得最优结果。

Details Motivation: 点云补全是 3D 计算机视觉任务的关键,但由于噪声和遮挡,真实环境中的点云往往不完整且含噪声。现有方法依赖合成数据训练,难以应对真实世界中的多种退化问题。

Contribution: 1) 提出 DWCNet,通过噪声管理模块(NMM)结合对比学习和自注意力机制抑制噪声并建模结构关系;2) 引入了 CPCCD 数据集,用于评估点云补全方法在多种退化情况下的鲁棒性。

Method: DWCNet 框架通过 NMM 模块处理噪声,利用对比学习区分噪声和真实点,并通过自注意力机制建模点之间的结构关系。

Result: DWCNet 在干净和退化的合成与真实数据集上均达到最优性能。

Insight: 点云补全任务中,去噪和结构建模是关键,结合对比学习与自注意力机制能有效提升模型对噪声的鲁棒性。

Abstract: Point cloud completion is crucial for 3D computer vision tasks in autonomous driving, augmented reality, and robotics. However, obtaining clean and complete point clouds from real-world environments is challenging due to noise and occlusions. Consequently, most existing completion networks – trained on synthetic data – struggle with real-world degradations. In this work, we tackle the problem of completing and denoising highly corrupted partial point clouds affected by multiple simultaneous degradations. To benchmark robustness, we introduce the Corrupted Point Cloud Completion Dataset (CPCCD), which highlights the limitations of current methods under diverse corruptions. Building on these insights, we propose DWCNet (Denoising-While-Completing Network), a completion framework enhanced with a Noise Management Module (NMM) that leverages contrastive learning and self-attention to suppress noise and model structural relationships. DWCNet achieves state-of-the-art performance on both clean and corrupted, synthetic and real-world datasets. The dataset and code will be publicly available at https://github.com/keneniwt/DWCNET-Robust-Point-Cloud-Completion-against-Corruptions

[80] Faithful, Interpretable Chest X-ray Diagnosis with Anti-Aliased B-cos Networks

Marcel Kleinmann,Shashank Agnihotri,Margret Keuper

Main category: cs.CV

TL;DR: 该论文改进了B-cos网络,通过引入抗锯齿策略(FLCPooling和BlurPool)提升解释质量,并扩展支持多标签分类,适用于临床医疗影像诊断。

Details Motivation: 在医疗影像等安全关键领域,深度神经网络的解释性和忠实性至关重要。标准B-cos网络因锯齿伪影和多类分类限制,难以满足临床需求。

Contribution: 1. 提出抗锯齿策略(FLC和BP),显著提升解释质量;2. 扩展B-cos网络至多标签分类,保持高性能并提供无伪影的临床适用解释。

Method: 1. 引入FLCPooling和BlurPool抗锯齿技术;2. 改进B-cos网络以支持多标签分类。

Result: 实验证明改进后的B-cos_FLC和B-cos_BP在保持预测性能的同时,提供了无伪影的忠实解释,适用于临床多标签场景。

Insight: 抗锯齿技术和多标签扩展使B-cos网络在医疗影像诊断中兼具高性能和临床适用的解释性。

Abstract: Faithfulness and interpretability are essential for deploying deep neural networks (DNNs) in safety-critical domains such as medical imaging. B-cos networks offer a promising solution by replacing standard linear layers with a weight-input alignment mechanism, producing inherently interpretable, class-specific explanations without post-hoc methods. While maintaining diagnostic performance competitive with state-of-the-art DNNs, standard B-cos models suffer from severe aliasing artifacts in their explanation maps, making them unsuitable for clinical use where clarity is essential. Additionally, the original B-cos formulation is limited to multi-class settings, whereas chest X-ray analysis often requires multi-label classification due to co-occurring abnormalities. In this work, we address both limitations: (1) we introduce anti-aliasing strategies using FLCPooling (FLC) and BlurPool (BP) to significantly improve explanation quality, and (2) we extend B-cos networks to support multi-label classification. Our experiments on chest X-ray datasets demonstrate that the modified $\text{B-cos}\text{FLC}$ and $\text{B-cos}\text{BP}$ preserve strong predictive performance while providing faithful and artifact-free explanations suitable for clinical application in multi-label settings. Code available at: $\href{https://github.com/mkleinma/B-cos-medical-paper}{GitHub repository}$.

[81] Enhancing Domain Diversity in Synthetic Data Face Recognition with Dataset Fusion

Anjith George,Sebastien Marcel

Main category: cs.CV

TL;DR: 论文提出通过融合两种架构不同的合成人脸数据集,减少模型特异性伪影并增强多样性,提升了人脸识别模型的性能。

Details Motivation: 解决合成数据训练人脸识别模型时,因单一生成器导致的模型特异性伪影和多样性不足问题,同时缓解伦理和隐私问题。

Contribution: 提出融合两种不同架构生成的合成数据集的方法,增强数据多样性并减少伪影,从而提升模型性能。

Method: 结合两种架构不同的合成人脸数据集,利用其互补性减少伪影,增强数据多样性(如姿态、光照、人口统计学特征),并通过强调身份相关特征实现隐式正则化。

Result: 在标准人脸识别基准测试中,融合数据集训练的模型表现优于单数据集训练的模型。

Insight: 利用多生成器的互补性可以减少合成数据的局限性,同时为模型提供更强的泛化能力。

Abstract: While the accuracy of face recognition systems has improved significantly in recent years, the datasets used to train these models are often collected through web crawling without the explicit consent of users, raising ethical and privacy concerns. To address this, many recent approaches have explored the use of synthetic data for training face recognition models. However, these models typically underperform compared to those trained on real-world data. A common limitation is that a single generator model is often used to create the entire synthetic dataset, leading to model-specific artifacts that may cause overfitting to the generator’s inherent biases and artifacts. In this work, we propose a solution by combining two state-of-the-art synthetic face datasets generated using architecturally distinct backbones. This fusion reduces model-specific artifacts, enhances diversity in pose, lighting, and demographics, and implicitly regularizes the face recognition model by emphasizing identity-relevant features. We evaluate the performance of models trained on this combined dataset using standard face recognition benchmarks and demonstrate that our approach achieves superior performance across many of these benchmarks.

[82] HOComp: Interaction-Aware Human-Object Composition

Dong Liang,Jinyuan Jia,Yuhao Liu,Rynson W. H. Lau

Main category: cs.CV

TL;DR: HOComp是一种新型的人类-物体交互感知合成方法,通过MLLM驱动的姿态指导和细节一致性外观保留技术,实现了自然和谐的合成效果,并提出了首个交互感知数据集IHOC。

Details Motivation: 现有图像合成方法在处理人类-物体交互时难以实现自然和谐的合成效果,缺乏对交互区域和类型的感知。

Contribution: 1. 提出了HOComp方法,结合MLLM驱动的姿态指导和细节一致性外观保留技术;2. 提出了首个交互感知数据集IHOC。

Method: 1. MLLM驱动的区域姿态指导(MRPG)识别交互区域和类型,生成和谐姿态;2. 细节一致性外观保留(DCAP)通过注意力调制和多视角损失保障外观一致。

Result: 在IHOC数据集上,HOComp在质量和数量上均优于相关方法,生成了和谐且一致的外观合成结果。

Insight: 通过结合交互感知和外观一致性技术,可以显著提升人类-物体合成的自然性和和谐性。

Abstract: While existing image-guided composition methods may help insert a foreground object onto a user-specified region of a background image, achieving natural blending inside the region with the rest of the image unchanged, we observe that these existing methods often struggle in synthesizing seamless interaction-aware compositions when the task involves human-object interactions. In this paper, we first propose HOComp, a novel approach for compositing a foreground object onto a human-centric background image, while ensuring harmonious interactions between the foreground object and the background person and their consistent appearances. Our approach includes two key designs: (1) MLLMs-driven Region-based Pose Guidance (MRPG), which utilizes MLLMs to identify the interaction region as well as the interaction type (e.g., holding and lefting) to provide coarse-to-fine constraints to the generated pose for the interaction while incorporating human pose landmarks to track action variations and enforcing fine-grained pose constraints; and (2) Detail-Consistent Appearance Preservation (DCAP), which unifies a shape-aware attention modulation mechanism, a multi-view appearance loss, and a background consistency loss to ensure consistent shapes/textures of the foreground and faithful reproduction of the background human. We then propose the first dataset, named Interaction-aware Human-Object Composition (IHOC), for the task. Experimental results on our dataset show that HOComp effectively generates harmonious human-object interactions with consistent appearances, and outperforms relevant methods qualitatively and quantitatively.

[83] ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang,Yueh-Hua Wu,Min-Hung Chen,Yu-Chiang Frank Wang,Fu-En Yang

Main category: cs.CV

TL;DR: ThinkAct提出了一种结合视觉-语言-动作推理的双系统框架,通过强化视觉潜在规划实现高效的多模态任务完成。

Details Motivation: 现有方法通常端到端训练多模态模型,缺乏显式推理能力,难以进行多步规划或适应复杂任务变体。ThinkAct旨在解决这一问题。

Contribution: 提出ThinkAct框架,通过强化视觉潜在规划将高级推理与低级动作执行结合,支持少样本适应、长时程规划和自我纠正行为。

Method: 采用双系统框架:1)多模态大语言模型生成推理计划,2)视觉潜在计划指导下游动作模型执行。

Result: 在具身推理和机器人操作任务中表现优异,验证了其在复杂任务中的适应性。

Insight: 通过显式推理和潜在计划,ThinkAct为解决多模态任务中的规划问题提供了一种新思路。

Abstract: Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.

eess.IV [Back]

[84] Quantization-Aware Neuromorphic Architecture for Efficient Skin Disease Classification on Resource-Constrained Devices

Haitian Wang,Xinyu Wang,Yiren Wang,Karen Lee,Zichen Geng,Xian Zhang,Kehkashan Kiran,Yu Zhang,Bo Miao

Main category: eess.IV

TL;DR: 该论文提出了QANA,一种针对资源受限设备的量化感知神经形态架构,用于高效的皮肤病分类。它在HAM10000和临床数据集上表现优异,显著降低了延迟和能耗。

Details Motivation: 在边缘设备上进行准确且高效的皮肤病分类对可访问的皮肤病护理至关重要,但由于计算、能源和隐私的约束,这一问题仍然具有挑战性。

Contribution: 提出了QANA架构,集成了低延迟和高效能的特性,能够无缝转换为SNN并在神经形态平台上部署,显著优于现有方法。

Method: 结合了幽灵模块、高效通道注意力和挤压-激励块,设计了量化感知头和脉冲兼容转换。

Result: 在HAM10000数据集上达到91.6%的Top-1准确率和82.4%的宏F1;在BrainChip Akida硬件上实现1.5ms延迟和1.7mJ能耗。

Insight: QANA为边缘环境中的实时、隐私敏感的医学分析提供了高效解决方案,显著提升了计算和能源效率。

Abstract: Accurate and efficient skin lesion classification on edge devices is critical for accessible dermatological care but remains challenging due to computational, energy, and privacy constraints. We introduce QANA, a novel quantization-aware neuromorphic architecture for incremental skin lesion classification on resource-limited hardware. QANA effectively integrates ghost modules, efficient channel attention, and squeeze-and-excitation blocks for robust feature representation with low-latency and energy-efficient inference. Its quantization-aware head and spike-compatible transformations enable seamless conversion to spiking neural networks (SNNs) and deployment on neuromorphic platforms. Evaluation on the large-scale HAM10000 benchmark and a real-world clinical dataset shows that QANA achieves 91.6% Top-1 accuracy and 82.4% macro F1 on HAM10000, and 90.8% / 81.7% on the clinical dataset, significantly outperforming state-of-the-art CNN-to-SNN models under fair comparison. Deployed on BrainChip Akida hardware, QANA achieves 1.5,ms inference latency and 1.7,mJ energy per image, reducing inference latency and energy use by over 94.6%/98.6% compared to GPU-based CNNs surpassing state-of-the-art CNN-to-SNN conversion baselines. These results demonstrate the effectiveness of QANA for accurate, real-time, and privacy-sensitive medical analysis in edge environments.

[85] MLRU++: Multiscale Lightweight Residual UNETR++ with Attention for Efficient 3D Medical Image Segmentation

Nand Kumar Yadav,Rodrigue Rizk,Willium WC Chen,KC

Main category: eess.IV

TL;DR: MLRU++提出了一种高效轻量化的3D医学图像分割模型,结合了多尺度和注意力机制,显著提升了分割精度并降低了计算成本。

Details Motivation: 医学图像分割的精度和效率是关键挑战,尤其是对于3D体积数据,现有的混合CNN-Transformer架构虽然性能优越,但复杂度高。论文旨在设计一种高效轻量化的模型,平衡精度和计算效率。

Contribution: 提出了MLRU++架构,包含两个关键创新:轻量化的通道和瓶颈注意力模块(LCBAM)和多尺度瓶颈块(M2B),显著提升了性能并降低了模型复杂度。

Method: 结合了多尺度和注意力机制,LCBAM增强上下文特征编码,M2B通过多分辨率特征聚合捕获细粒度细节。

Result: 在多个公开数据集(Synapse、BTCV、ACDC和Decathlon Lung)上达到SOTA性能,Dice分数显著提升,同时减少了参数和计算成本。

Insight: 轻量化的注意力机制和多尺度特征是高效医学图像分割的关键,MLRU++为3D分割任务提供了实用且高性能的解决方案。

Abstract: Accurate and efficient medical image segmentation is crucial but challenging due to anatomical variability and high computational demands on volumetric data. Recent hybrid CNN-Transformer architectures achieve state-of-the-art results but add significant complexity. In this paper, we propose MLRU++, a Multiscale Lightweight Residual UNETR++ architecture designed to balance segmentation accuracy and computational efficiency. It introduces two key innovations: a Lightweight Channel and Bottleneck Attention Module (LCBAM) that enhances contextual feature encoding with minimal overhead, and a Multiscale Bottleneck Block (M2B) in the decoder that captures fine-grained details via multi-resolution feature aggregation. Experiments on four publicly available benchmark datasets (Synapse, BTCV, ACDC, and Decathlon Lung) demonstrate that MLRU++ achieves state-of-the-art performance, with average Dice scores of 87.57% (Synapse), 93.00% (ACDC), and 81.12% (Lung). Compared to existing leading models, MLRU++ improves Dice scores by 5.38% and 2.12% on Synapse and ACDC, respectively, while significantly reducing parameter count and computational cost. Ablation studies evaluating LCBAM and M2B further confirm the effectiveness of the proposed architectural components. Results suggest that MLRU++ offers a practical and high-performing solution for 3D medical image segmentation tasks. Source code is available at: https://github.com/1027865/MLRUPP

[86] Pyramid Hierarchical Masked Diffusion Model for Imaging Synthesis

Xiaojiao Xiao,Qinmin Vivian Hu,Guanghui Wang

Main category: eess.IV

TL;DR: PHMDiff是一种新型的多尺度分层掩码扩散模型,用于医学图像合成,通过多尺度掩码加速训练并提升图像质量,结合Transformer进行跨粒度正则化,实验表明其在PSNR和SSIM指标上优于其他方法。

Details Motivation: 医学成像中常因扫描时间长、伪影或患者不耐受等问题缺失某些模态,需要高效的图像合成方法来填补空白。

Contribution: 提出PHMDiff模型,结合多尺度掩码和Transformer,实现跨粒度的信息一致性建模,提升合成图像的细节和结构保真度。

Method: 采用多尺度分层掩码策略加速扩散模型训练,引入Transformer和跨粒度正则化,优化像素级感知精度。

Result: 在两个数据集上,PHMDiff在PSNR和SSIM指标上表现最优,展示出卓越的结构完整性和细节还原能力。

Insight: 多尺度掩码与跨粒度正则化的结合是提升医学图像合成质量的关键策略。

Abstract: Medical image synthesis plays a crucial role in clinical workflows, addressing the common issue of missing imaging modalities due to factors such as extended scan times, scan corruption, artifacts, patient motion, and intolerance to contrast agents. The paper presents a novel image synthesis network, the Pyramid Hierarchical Masked Diffusion Model (PHMDiff), which employs a multi-scale hierarchical approach for more detailed control over synthesizing high-quality images across different resolutions and layers. Specifically, this model utilizes randomly multi-scale high-proportion masks to speed up diffusion model training, and balances detail fidelity and overall structure. The integration of a Transformer-based Diffusion model process incorporates cross-granularity regularization, modeling the mutual information consistency across each granularity’s latent spaces, thereby enhancing pixel-level perceptual accuracy. Comprehensive experiments on two challenging datasets demonstrate that PHMDiff achieves superior performance in both the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), highlighting its capability to produce high-quality synthesized images with excellent structural integrity. Ablation studies further confirm the contributions of each component. Furthermore, the PHMDiff model, a multi-scale image synthesis framework across and within medical imaging modalities, shows significant advantages over other methods. The source code is available at https://github.com/xiaojiao929/PHMDiff

[87] MultiTaskDeltaNet: Change Detection-based Image Segmentation for Operando ETEM with Application to Carbon Gasification Kinetics

Yushuo Niu,Tianyu Li,Yuanyuan Zhu,Qian Yang

Main category: eess.IV

TL;DR: 该论文提出了一种名为MultiTaskDeltaNet (MTDN)的新方法,通过将语义分割任务重新定义为变化检测问题,解决了传统深度学习方法在TEM图像中动态特征分割时的数据不足和小对象问题。

Details Motivation: 传统深度学习方法在TEM图像分析中面临标注数据稀缺、特征模糊和小对象分割困难等问题,限制了其在固体反应动态特征自动化分割中的应用。

Contribution: 提出了MTDN架构,利用Siamese网络和U-Net骨干网设计,通过变化检测和多任务学习策略,显著提高了对小且模糊特征的分割能力。

Method: 使用配对的TEM图像作为输入,通过Siamese网络和U-Net提取特征变化,结合多任务学习优化性能。

Result: 在碳气化的ETEM视频数据中,MTDN比传统分割模型性能提升了10.22%,尤其在细小结构特征的分割上表现出色。

Insight: 重新定义任务为变化检测并结合多任务学习,可以有效利用有限数据,提升分割精度,为复杂的纳米材料实验分析提供了新思路。

Abstract: Transforming in-situ transmission electron microscopy (TEM) imaging into a tool for spatially-resolved operando characterization of solid-state reactions requires automated, high-precision semantic segmentation of dynamically evolving features. However, traditional deep learning methods for semantic segmentation often encounter limitations due to the scarcity of labeled data, visually ambiguous features of interest, and small-object scenarios. To tackle these challenges, we introduce MultiTaskDeltaNet (MTDN), a novel deep learning architecture that creatively reconceptualizes the segmentation task as a change detection problem. By implementing a unique Siamese network with a U-Net backbone and using paired images to capture feature changes, MTDN effectively utilizes minimal data to produce high-quality segmentations. Furthermore, MTDN utilizes a multi-task learning strategy to leverage correlations between physical features of interest. In an evaluation using data from in-situ environmental TEM (ETEM) videos of filamentous carbon gasification, MTDN demonstrated a significant advantage over conventional segmentation models, particularly in accurately delineating fine structural features. Notably, MTDN achieved a 10.22% performance improvement over conventional segmentation models in predicting small and visually ambiguous physical features. This work bridges several key gaps between deep learning and practical TEM image analysis, advancing automated characterization of nanomaterials in complex experimental settings.

cs.CR [Back]

[88] DREAM: Scalable Red Teaming for Text-to-Image Generative Systems via Distribution Modeling

Boheng Li,Junjie Wang,Yiming Li,Zhiyang Hu,Leyi Qi,Jianshuo Dong,Run Wang,Han Qiu,Zhan Qin,Tianwei Zhang

Main category: cs.CR

TL;DR: DREAM提出了一种基于分布建模的可扩展红队框架,用于自动发现文本到图像生成系统中的多样化有害提示,显著优于现有方法。

Details Motivation: 尽管已有安全对齐和外部过滤器,文本到图像生成模型仍可能输出有害内容,现有红队方法因孤立优化提示而效果有限。

Contribution: 提出了DREAM框架,直接建模有害提示的概率分布,实现多样性和有效性的显式优化,并设计了高效优化算法GC-SPSA。

Method: 采用能量基模型思想,将复杂目标转化为可优化目标;引入GC-SPSA算法处理长且不可微的生成流程。

Result: 在广泛实验中,DREAM在提示成功率和多样性上显著优于9种现有基线方法。

Insight: 红队任务需从提示级优化转向分布建模,以提升多样性和效果,同时高效算法是处理复杂系统的关键。

Abstract: Despite the integration of safety alignment and external filters, text-to-image (T2I) generative models are still susceptible to producing harmful content, such as sexual or violent imagery. This raises serious concerns about unintended exposure and potential misuse. Red teaming, which aims to proactively identify diverse prompts that can elicit unsafe outputs from the T2I system (including the core generative model as well as potential external safety filters and other processing components), is increasingly recognized as an essential method for assessing and improving safety before real-world deployment. Yet, existing automated red teaming approaches often treat prompt discovery as an isolated, prompt-level optimization task, which limits their scalability, diversity, and overall effectiveness. To bridge this gap, in this paper, we propose DREAM, a scalable red teaming framework to automatically uncover diverse problematic prompts from a given T2I system. Unlike most prior works that optimize prompts individually, DREAM directly models the probabilistic distribution of the target system’s problematic prompts, which enables explicit optimization over both effectiveness and diversity, and allows efficient large-scale sampling after training. To achieve this without direct access to representative training samples, we draw inspiration from energy-based models and reformulate the objective into simple and tractable objectives. We further introduce GC-SPSA, an efficient optimization algorithm that provide stable gradient estimates through the long and potentially non-differentiable T2I pipeline. The effectiveness of DREAM is validated through extensive experiments, demonstrating that it surpasses 9 state-of-the-art baselines by a notable margin across a broad range of T2I models and safety filters in terms of prompt success rate and diversity.

cs.RO [Back]

[89] Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory

Guowei Lan,Kaixian Qu,René Zurbrügg,Changan Chen,Christopher E. Mower,Haitham Bou-Ammar,Marco Hutter

Main category: cs.RO

TL;DR: ExpTeach提出了一种通过自生成经验记忆(self-generated memory)来将视觉语言模型(VLM)落地到机器人任务中的框架,显著提升了任务成功率和适应性。

Details Motivation: 现有视觉语言模型(VLM)虽然在互联网数据上训练效果良好,但在多样化机器人任务中的落地仍然面临挑战。如何让VLM通过实际经验学习并适应机器人任务是一个关键问题。

Contribution: 提出了ExpTeach框架,通过VLM自主规划、验证、反思和适应行为,构建自生成经验记忆,并结合检索增强生成(RAG)技术提升任务表现。此外,还引入了按需图像注释模块以增强VLM的空间理解能力。

Method: 1. 采用自生成经验记忆机制,记录任务执行过程中的经验。2. 通过检索增强生成(RAG)复用历史经验。3. 加入反思机制优化行为。4. 使用图像注释模块增强空间理解。

Result: 实验表明,反思机制将任务成功率从36%提升到84%,长期记忆的引入使单次任务成功率从22%提升到80%。在多场景测试中表现出强大的泛化能力。

Insight: 通过实际经验学习和记忆复用,VLM可以在机器人任务中表现出更高的适应性和创造性,如工具使用的灵活处理。这种自生成经验的方法为VLM在复杂现实任务中的落地提供了可行路径。

Abstract: Vision-language models (VLMs) have been widely adopted in robotics to enable autonomous planning. However, grounding VLMs, originally trained on internet data, to diverse real-world robots remains a challenge. This paper presents ExpTeach, a framework that grounds VLMs to physical robots by building a self-generated memory of real-world experiences. In ExpTeach, the VLM autonomously plans actions, verifies outcomes, reflects on failures, and adapts robot behaviors in a closed loop. The self-generated experiences during this process are then summarized into a long-term memory, enabling retrieval of learned knowledge to guide future tasks via retrieval-augmented generation (RAG). Additionally, ExpTeach enhances the spatial understanding of VLMs with an on-demand image annotation module. In experiments, we show that reflection improves success rates from 36% to 84% on four challenging robotic tasks and observe the emergence of intelligent object interactions, including creative tool use. Across extensive tests on 12 real-world scenarios (including eight unseen ones), we find that grounding with long-term memory boosts single-trial success rates from 22% to 80%, demonstrating the effectiveness and generalizability of ExpTeach.

[90] Improved Semantic Segmentation from Ultra-Low-Resolution RGB Images Applied to Privacy-Preserving Object-Goal Navigation

Xuying Huang,Sicong Pan,Olga Zatsarynna,Juergen Gall,Maren Bennewitz

Main category: cs.RO

TL;DR: 这篇论文提出了一种全新的联合学习方法,通过聚合特征提取器和分割感知判别器相结合,解决了超低分辨率RGB图像的语义分割问题,进而实现了隐私保护的语义目标导航任务。

Details Motivation: 随着移动机器人隐私问题的日益突出,现有方法通常无法同时兼顾下游任务性能和隐私保护。作者研究了在超低分辨率图像下语义导航的可能性,以在不泄露视觉隐私的前提下完成任务。

Contribution: 主要贡献是提出了一种完全联合学习的方法,结合了聚合特征提取器和分割感知判别器,提升了超低分辨率图像的语义分割性能,并进一步提高了隐私约束下语义目标导航的成功率。

Method: 方法包含两个关键组件:聚合特征提取器用于整合多尺度和多层级特征,分割感知判别器用于优化分割结果。两者联合学习,共同提升超低分辨率图像的语义分割能力。

Result: 在超低分辨率语义分割任务中,该方法优于多个基线方法,且在真实隐私约束场景中,改进的分割结果显著提高了语义目标导航的成功率。

Insight: 论文表明,通过联合学习和任务感知的设计,可以在保护隐私的同时不牺牲下游任务的性能,这为隐私敏感的机器人应用提供了新思路。

Abstract: User privacy in mobile robotics has become a critical concern. Existing methods typically prioritize either the performance of downstream robotic tasks or privacy protection, with the latter often constraining the effectiveness of task execution. To jointly address both objectives, we study semantic-based robot navigation in an ultra-low-resolution setting to preserve visual privacy. A key challenge in such scenarios is recovering semantic segmentation from ultra-low-resolution RGB images. In this work, we introduce a novel fully joint-learning method that integrates an agglomerative feature extractor and a segmentation-aware discriminator to solve ultra-low-resolution semantic segmentation, thereby enabling privacy-preserving, semantic object-goal navigation. Our method outperforms different baselines on ultra-low-resolution semantic segmentation and our improved segmentation results increase the success rate of the semantic object-goal navigation in a real-world privacy-constrained scenario.

[91] Designing for Difference: How Human Characteristics Shape Perceptions of Collaborative Robots

Sabrina Livanec,Laura Londoño,Michael Gorki,Adrian Röfer,Abhinav Valada,Andrea Kiesel

Main category: cs.RO

TL;DR: 本文探讨了人机协作中人类特征如何影响对协作机器人的感知,并通过在线研究发现互动范式对人类可接受性评估有显著影响,强调了亲社会设计的重要性。

Details Motivation: 当前研究缺乏关于不同机器人行为与多样化人类需求结合时如何被评估的探讨,尤其是在涉及特殊群体(如残疾或老年人)的情况下。

Contribution: 填补了人机协作评估的空白,提供了关于人类特征和互动范式如何影响机器人可接受性的实证证据。

Method: 通过在线研究,112名参与者评估了28种人机协作视频,实验组采用认知-情感映射(CAM)方法以支持反思性评估。

Result: 反社交机器人行为评分最低,老年人协作场景更敏感,包含物体交接的情景更受欢迎。CAM方法虽未显著影响总体评分,但能引出更细致的反馈。

Insight: 亲社会设计对协作机器人至关重要,反思性方法(如CAM)有助于开发更用户中心、社会责任导向的机器人系统。

Abstract: The development of assistive robots for social collaboration raises critical questions about responsible and inclusive design, especially when interacting with individuals from protected groups such as those with disabilities or advanced age. Currently, research is scarce on how participants assess varying robot behaviors in combination with diverse human needs, likely since participants have limited real-world experience with advanced domestic robots. In the current study, we aim to address this gap while using methods that enable participants to assess robot behavior, as well as methods that support meaningful reflection despite limited experience. In an online study, 112 participants (from both experimental and control groups) evaluated 7 videos from a total of 28 variations of human-robot collaboration types. The experimental group first completed a cognitive-affective mapping (CAM) exercise on human-robot collaboration before providing their ratings. Although CAM reflection did not significantly affect overall ratings, it led to more pronounced assessments for certain combinations of robot behavior and human condition. Most importantly, the type of human-robot collaboration influences the assessment. Antisocial robot behavior was consistently rated as the lowest, while collaboration with aged individuals elicited more sensitive evaluations. Scenarios involving object handovers were viewed more positively than those without them. These findings suggest that both human characteristics and interaction paradigms influence the perceived acceptability of collaborative robots, underscoring the importance of prosocial design. They also highlight the potential of reflective methods, such as CAM, to elicit nuanced feedback, supporting the development of user-centered and socially responsible robotic systems tailored to diverse populations.

cs.GR [Back]

[92] MMS Player: an open source software for parametric data-driven animation of Sign Language avatars

Fabrizio Nunnari,Shailesh Mishra,Patrick Gebhard

Main category: cs.GR

TL;DR: MMS-Player是一款开源软件,用于从新型手语表示格式MMS合成手语动画,支持并行执行、时间和变体信息。

Details Motivation: 提高手语动画合成的表现力,通过多模态数据增强传统gloss表示。

Contribution: 提出MMS表示格式,并开发开源工具MMS-Player,集成到Blender中。

Method: 基于Python脚本和Blender工具,通过命令行或HTTP API调用生成动画。

Result: 软件支持视频渲染和3D动画格式导出,开源且易用。

Insight: 多模态数据可以显著提升手语动画的自然性和表现力。

Abstract: This paper describes the MMS-Player, an open source software able to synthesise sign language animations from a novel sign language representation format called MMS (MultiModal Signstream). The MMS enhances gloss-based representations by adding information on parallel execution of signs, timing, and inflections. The implementation consists of Python scripts for the popular Blender 3D authoring tool and can be invoked via command line or HTTP API. Animations can be rendered as videos or exported in other popular 3D animation exchange formats. The software is freely available under GPL-3.0 license at https://github.com/DFKI-SignLanguage/MMS-Player.

cs.LG [Back]

[93] RDMA: Cost Effective Agent-Driven Rare Disease Discovery within Electronic Health Record Systems

John Wu,Adam Cross,Jimeng Sun

Main category: cs.LG

TL;DR: RDMA是一种基于代理的框架,用于在电子健康记录(EHR)中高效发现罕见病信息,通过本地处理解决隐私问题,显著提升性能并降低成本。

Details Motivation: 罕见病在EHR中难以通过标准ICD编码捕捉,现有方法存在隐私、效率和临床推理能力的不足,RDMA旨在解决这些问题。

Contribution: 提出RDMA框架,结合医学专家模式识别能力,支持缩写处理、隐含疾病识别和本地推理,显著提升性能并降低成本。

Method: RDMA通过代理驱动方法,连接散落的临床观察结果,支持本地化处理,避免依赖云服务,同时优化推理效率。

Result: 实验显示RDMA的F1性能提升30%以上,推理成本降低10倍,同时减少隐私风险。

Insight: 本地化处理和临床推理能力是罕见病发现的关键,RDMA为临床实践提供了高效且隐私安全的解决方案。

Abstract: Rare diseases affect 1 in 10 Americans, yet standard ICD coding systems fail to capture these conditions in electronic health records (EHR), leaving crucial information buried in clinical notes. Current approaches struggle with medical abbreviations, miss implicit disease mentions, raise privacy concerns with cloud processing, and lack clinical reasoning abilities. We present Rare Disease Mining Agents (RDMA), a framework that mirrors how medical experts identify rare disease patterns in EHR. RDMA connects scattered clinical observations that together suggest specific rare conditions. By handling clinical abbreviations, recognizing implicit disease patterns, and applying contextual reasoning locally on standard hardware, RDMA reduces privacy risks while improving F1 performance by upwards of 30% and decreasing inferences costs 10-fold. This approach helps clinicians avoid the privacy risk of using cloud services while accessing key rare disease information from EHR systems, supporting earlier diagnosis for rare disease patients. Available at https://github.com/jhnwu3/RDMA.

[94] Scaling Linear Attention with Sparse State Expansion

Yuqi Pan,Yongqi An,Zheng Li,Yuhong Chou,Ruijie Zhu,Xiaohui Wang,Mingxuan Wang,Jinqiao Wang,Guoqi Li

Main category: cs.LG

TL;DR: 论文提出了一种稀疏状态扩展(SSE)方法,通过稀疏更新和状态分区的创新,解决了线性注意力在长上下文任务中的效率与性能问题。

Details Motivation: Transformer架构在长上下文场景中因二次计算和线性内存增长而受限,线性注意力虽然提升效率但性能下降,亟需改进。

Contribution: 提出了稀疏状态更新和稀疏状态扩展(SSE),扩展上下文状态分区,平衡效率与性能。

Method: 采用软最大top-k硬分类稀疏更新状态,并通过SSE将状态分多区以提升容量,支持高效并行实现。

Result: SSE在语言建模、上下文检索和数学推理任务中表现优异,2B SSE-H模型在数学推理中达到SOTA性能。

Insight: 稀疏分类和状态分区的结合能显著提升长上下文建模的效率与性能,为小规模推理模型提供了新方向。

Abstract: The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top-$k$ hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse classification paradigm. Our design, supported by efficient parallelized implementations, yields effective classification and discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.7 on AIME24 and 51.3 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.

[95] Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Mehul Damani,Isha Puri,Stewart Slocum,Idan Shenfeld,Leshem Choshen,Yoon Kim,Jacob Andreas

Main category: cs.LG

TL;DR: 该论文提出了RLCR方法,通过结合二元正确性评分和校准评分训练语言模型,以同时提升准确性和校准置信度,解决了传统强化学习训练中模型校准性差的问题。

Details Motivation: 传统强化学习训练语言模型时,仅使用二元奖励函数评估输出正确性,导致模型校准性下降并增加了错误生成的风险。

Contribution: 提出了RLCR方法,通过引入Brier评分(校准评分)优化奖励函数,有效提升模型准确性和校准性。

Method: RLCR方法要求模型生成预测和数值置信度估计,并通过结合二元正确性评分和Brier评分训练模型。

Result: 实验表明,RLCR显著改善了模型的校准性,同时保持了准确性,优于传统强化学习和后处理置信度评分方法。

Insight: 显式优化校准性可训练出更可靠的推理模型,且置信度估计可在测试时用于提升性能。

Abstract: When language models (LMs) are trained via reinforcement learning (RL) to generate natural language “reasoning chains”, their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or “hallucinate”) in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score – a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations – outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models.

[96] Understanding Generalization, Robustness, and Interpretability in Low-Capacity Neural Networks

Yash Kumar

Main category: cs.LG

TL;DR: 该论文通过低容量神经网络研究了泛化、鲁棒性和可解释性之间的关系,发现任务复杂度与最小模型容量直接相关,且稀疏子网络在高剪枝率下仍保持高性能。

Details Motivation: 尽管现代深度学习依赖大规模超参数化模型,但低容量网络中容量、稀疏性和鲁棒性的基本关系仍需探索,作者通过实验填补这一空白。

Contribution: 1. 揭示了最小模型容量与任务复杂度的直接关系;2. 发现高剪枝率下的稀疏子网络仍能保持高性能;3. 展示了超参数化在对抗输入损坏时的鲁棒性优势。

Method: 通过从MNIST数据集创建逐渐复杂的二分类任务(如0和1 vs. 4和9),并分析模型的泛化、剪枝后的性能及输入损坏下的鲁棒性。

Result: 低容量网络的泛化能力随任务复杂度提升而需要更高容量,而稀疏子网络在高剪枝率下仍能保留原始模型的推理过程。

Insight: 低容量网络中的稀疏性和鲁棒性存在权衡,超参数化有助于提升对抗输入损坏的能力,同时稀疏子网络保留了核心逻辑。

Abstract: Although modern deep learning often relies on massive over-parameterized models, the fundamental interplay between capacity, sparsity, and robustness in low-capacity networks remains a vital area of study. We introduce a controlled framework to investigate these properties by creating a suite of binary classification tasks from the MNIST dataset with increasing visual difficulty (e.g., 0 and 1 vs. 4 and 9). Our experiments reveal three core findings. First, the minimum model capacity required for successful generalization scales directly with task complexity. Second, these trained networks are robust to extreme magnitude pruning (up to 95% sparsity), revealing the existence of sparse, high-performing subnetworks. Third, we show that over-parameterization provides a significant advantage in robustness against input corruption. Interpretability analysis via saliency maps further confirms that these identified sparse subnetworks preserve the core reasoning process of the original dense models. This work provides a clear, empirical demonstration of the foundational trade-offs governing simple neural networks.

[97] Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation

Viktor Muryn,Marta Sumyk,Mariya Hirna,Sofiya Garkot,Maksym Shamrai

Main category: cs.LG

TL;DR: Screen2AX是一个基于视觉的框架,能够通过单张截图自动生成实时的树状无障碍元数据,填补了现有技术在捕获桌面界面完整层次结构方面的空白。

Details Motivation: 许多macOS应用程序缺乏完整的无障碍元数据,导致依赖辅助工具的用户无法充分使用。现有技术未能完整复现桌面界面的层次结构。

Contribution: 首次提出自动生成实时树状无障碍元数据的方法,并公开了三个macOS应用程序数据集,评测显示显著提升了自主代理的任务执行性能。

Method: 结合视觉语言和物体检测模型,从截图中检测并层级化组织UI元素。发布的数据集支持UI检测、分组和元数据标注。

Result: Screen2AX在重建完整无障碍树时达到77%的F1分数,任务执行性能提升2.2倍,优于现有系统。

Insight: 层级化的无障碍树能显著提升自主代理对复杂界面的解析能力,填补了数据缺失的空白。

Abstract: Desktop accessibility metadata enables AI agents to interpret screens and supports users who depend on tools like screen readers. Yet, many applications remain largely inaccessible due to incomplete or missing metadata provided by developers - our investigation shows that only 33% of applications on macOS offer full accessibility support. While recent work on structured screen representation has primarily addressed specific challenges, such as UI element detection or captioning, none has attempted to capture the full complexity of desktop interfaces by replicating their entire hierarchical structure. To bridge this gap, we introduce Screen2AX, the first framework to automatically create real-time, tree-structured accessibility metadata from a single screenshot. Our method uses vision-language and object detection models to detect, describe, and organize UI elements hierarchically, mirroring macOS’s system-level accessibility structure. To tackle the limited availability of data for macOS desktop applications, we compiled and publicly released three datasets encompassing 112 macOS applications, each annotated for UI element detection, grouping, and hierarchical accessibility metadata alongside corresponding screenshots. Screen2AX accurately infers hierarchy trees, achieving a 77% F1 score in reconstructing a complete accessibility tree. Crucially, these hierarchy trees improve the ability of autonomous agents to interpret and interact with complex desktop interfaces. We introduce Screen2AX-Task, a benchmark specifically designed for evaluating autonomous agent task execution in macOS desktop environments. Using this benchmark, we demonstrate that Screen2AX delivers a 2.2x performance improvement over native accessibility representations and surpasses the state-of-the-art OmniParser V2 system on the ScreenSpot benchmark.

[98] Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

Junhao Shen,Haiteng Zhao,Yuzhe Gu,Songyang Gao,Kuikun Liu,Haian Huang,Jianfei Gao,Dahua Lin,Wenwei Zhang,Kai Chen

Main category: cs.LG

TL;DR: 论文提出了一种半离策略强化学习方法SOPHIA,通过结合可训练大视觉-语言模型(LVLM)的在线策略视觉理解和语言模型的离策略慢思考推理,解决了直接使用离策略RL可能导致的视觉幻觉问题,显著提升了多模态推理任务的性能。

Details Motivation: 大视觉-语言模型(LVLMs)在复杂多模态任务中需要慢思考推理能力,但传统的在线策略强化学习因初始能力限制难以实现。直接使用离策略RL可能因模型间视觉感知能力不匹配导致视觉幻觉。因此,需要一种新的方法来解决这些问题。

Contribution: 提出了SOPHIA,一种半离策略强化学习方法,通过结合在线策略的视觉理解和离策略的推理能力,设计基于结果的奖励机制和后传视觉奖励,显著提升了LVLMs在多模态推理任务中的性能。

Method: SOPHIA构建了半离策略行为模型,结合LVLM的在线策略视觉理解和语言模型的离策略慢思考推理,为推理分配基于结果的奖励,并通过离策略RL算法从获得的推理轨迹中反向传播视觉奖励。

Result: SOPHIA显著提升了InternVL3.0-38B模型的性能(提升8.50%),在多项多模态推理基准测试中达到了开源LVLMs的最先进水平,甚至在某些任务上超越了闭源模型(如GPT-4.1)。

Insight: SOPHIA在初始策略训练上优于监督微调和直接的在线策略RL方法,为后续的在线策略训练提供了更好的初始化策略,同时避免了视觉幻觉问题。

Abstract: Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging MathVision and OlympiadBench, achieving 49.08% and 49.95% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.

cs.AI [Back]

[99] Why Braking? Scenario Extraction and Reasoning Utilizing LLM

Yin Wu,Daniel Slieter,Vivek Subramanian,Ahmed Abouelazm,Robin Bohn,J. Marius Zöllner

Main category: cs.AI

TL;DR: 论文提出了一种利用大型语言模型(LLM)的新型框架,用于从驾驶数据中提取和理解刹车事件的场景,支持对已知和未知分布(OOD)场景的分类和检索,显著优于传统基于规则的方法。

Details Motivation: 随着配备ADAS的车辆增多,驾驶数据急剧增加,但大多数数据仅捕获常规驾驶行为。如何在大量数据中识别和理解安全关键场景(如刹车事件)是主要挑战。现有基于规则的方法在复杂城市环境中泛化能力不足。

Contribution: 1) 利用LLM弥补低层次数值信号与自然语言描述之间的鸿沟;2) 提出双路径场景检索方法,支持已知类别的分类搜索和未知OOD场景的嵌入检索;3) 在Argoverse 2数据集上进行了标注和评估。

Method: 1) 使用LLM解析驾驶场景;2) 结合基于类别的搜索和基于嵌入的检索以涵盖已知和OOD场景;3) 在真实数据集上验证框架。

Result: 实验结果表明,该方法优于基于规则的基线方法,且在OOD场景中表现出良好的泛化能力。

Insight: LLM为复杂驾驶场景的理解提供了新的可能性,双路径检索方法为处理未知场景提供了灵活解决方案。

Abstract: The growing number of ADAS-equipped vehicles has led to a dramatic increase in driving data, yet most of them capture routine driving behavior. Identifying and understanding safety-critical corner cases within this vast dataset remains a significant challenge. Braking events are particularly indicative of potentially hazardous situations, motivating the central question of our research: Why does a vehicle brake? Existing approaches primarily rely on rule-based heuristics to retrieve target scenarios using predefined condition filters. While effective in simple environments such as highways, these methods lack generalization in complex urban settings. In this paper, we propose a novel framework that leverages Large Language Model (LLM) for scenario understanding and reasoning. Our method bridges the gap between low-level numerical signals and natural language descriptions, enabling LLM to interpret and classify driving scenarios. We propose a dual-path scenario retrieval that supports both category-based search for known scenarios and embedding-based retrieval for unknown Out-of-Distribution (OOD) scenarios. To facilitate evaluation, we curate scenario annotations on the Argoverse 2 Sensor Dataset. Experimental results show that our method outperforms rule-based baselines and generalizes well to OOD scenarios.

[100] SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting

Shuhao Mei,Yongchao Long,Shan Cao,Xiaobo Han,Shijia Geng,Jinbo Sun,Yuxi Zhou,Shenda Hong

Main category: cs.AI

TL;DR: SpiroLLM是首个能理解呼吸曲线的多模态大语言模型,通过结合形态学特征和数值数据生成全面诊断报告,在COPD诊断中表现出高准确性和鲁棒性。

Details Motivation: COPD是全球主要慢性呼吸疾病,现有AI模型无法提供诊断依据,而大语言模型无法理解呼吸曲线。SpiroLLM旨在通过多模态融合解决这一问题。

Contribution: 提出首个能理解呼吸曲线的多模态大语言模型SpiroLLM,其通过SpiroEncoder提取曲线特征并与数值数据对齐,生成可解释的诊断报告。

Method: 使用SpiroEncoder提取呼吸曲线的形态学特征,通过SpiroProjector将特征与PFT数值对齐,最终由大语言模型生成诊断报告。

Result: SpiroLLM诊断AUROC达0.8980,且在核心数据缺失的鲁棒性测试中保持100%有效响应率,显著优于纯文本模型。

Insight: 多模态融合生理信号与大语言模型为临床决策工具提供了可解释和可靠的新范式。

Abstract: Chronic Obstructive Pulmonary Disease (COPD), a major chronic respiratory disease with persistent airflow limitation, is a leading global cause of disability and mortality. Respiratory spirogram time series, routinely collected during pulmonary function tests (PFTs), play a critical role in the early detection of repsiratory diseases and in monitoring lung function over time. However, most current AI models for COPD diagnosis are limited to outputting classification results without providing a rationale for their diagnostic process, while current Large Language Models (LLMs) cannot understand spirograms yet, which severely limits their clinical trust and adoption. To tackle this challenge, we leverage a cohort of 234,028 individuals from the UK Biobank (UKB) to propose SpiroLLM, the first multimodal large language model that can understand spirogram. The model extracts morphological features from respiratory curves via a SpiroEncoder and aligns them with PFT numerical values in a unified latent space using a SpiroProjector, ultimately empowering a large language model to generate a comprehensive diagnostic report. Experimental results confirm that SpiroLLM achieved a diagnostic AUROC of 0.8980 (95% CI: 0.8820-0.9132). In a robustness test with missing core data, it maintained a 100% valid response rate, far surpassing the 13.4% of a text-only model and showcasing the superiority of its multimodal design. This work demonstrates the substantial potential of deeply fusing physiological signals with large language models, establishing a new paradigm for the next generation of interpretable and reliable clinical decision support tools.

[101] Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report

Shanghai AI Lab,:,Xiaoyang Chen,Yunhao Chen,Zeren Chen,Zhiyun Chen,Hanyun Cui,Yawen Duan,Jiaxuan Guo,Qi Guo,Xuhao Hu,Hong Huang,Lige Huang,Chunxiao Li,Juncheng Li,Qihao Lin,Dongrui Liu,Xinmin Liu,Zicheng Liu,Chaochao Lu,Xiaoya Lu,Jingjing Qu,Qibing Ren,Jing Shao,Jingwei Shi,Jingwei Sun,Peng Wang,Weibing Wang,Jia Xu,Lewen Yan,Xiao Yu,Yi Yu,Boxuan Zhang,Jie Zhang,Weichen Zhang,Zhijie Zheng,Tianyi Zhou,Bowen Zhou

Main category: cs.AI

TL;DR: 该报告通过对前沿AI模型的风险进行全面评估,采用E-T-C分析和AI-$45^\circ$法则,识别了七个关键风险领域,并定义了红、黄、绿三种风险区域。实验表明,目前所有前沿AI模型均处于绿区和黄区,未跨越红线。

Details Motivation: 随着AI技术的快速发展,其潜在的不可控风险日益凸显。报告旨在通过系统分析,识别和管理这些前沿风险,为AI的安全部署提供指导。

Contribution: 提出了基于E-T-C分析和AI-$45^\circ$法则的风险评估框架,定义了红、黄、绿三种风险区域,并对前沿AI模型的风险进行了分类评估。

Method: 采用E-T-C分析(部署环境、威胁来源、赋能能力)和AI-$45^\circ$法则,通过“红线”和“黄线”指标划分风险区域,对七个领域的风险进行量化评估。

Result: 实验结果显示,所有前沿AI模型均处于绿区和黄区,未跨越红线。某些领域(如说服与操纵)的模型多处于黄区,而自复制和战略欺骗领域的大多数模型处于绿区。

Insight: 前沿AI模型目前尚未达到不可控风险水平,但某些领域(如生物和化学风险)仍需进一步详细评估。报告呼吁集体行动以应对潜在风险。

Abstract: To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on the E-T-C analysis (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks in seven areas: cyber offense, biological and chemical risks, persuasion and manipulation, uncontrolled autonomous AI R&D, strategic deception and scheming, self-replication, and collusion. Guided by the “AI-$45^\circ$ Law,” we evaluate these risks using “red lines” (intolerable thresholds) and “yellow lines” (early warning indicators) to define risk zones: green (manageable risk for routine deployment and continuous monitoring), yellow (requiring strengthened mitigations and controlled deployment), and red (necessitating suspension of development and/or deployment). Experimental results show that all recent frontier AI models reside in green and yellow zones, without crossing red lines. Specifically, no evaluated models cross the yellow line for cyber offense or uncontrolled AI R&D risks. For self-replication, and strategic deception and scheming, most models remain in the green zone, except for certain reasoning models in the yellow zone. In persuasion and manipulation, most models are in the yellow zone due to their effective influence on humans. For biological and chemical risks, we are unable to rule out the possibility of most models residing in the yellow zone, although detailed threat modeling and in-depth assessment are required to make further claims. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.