Table of Contents
- cs.CL [Total: 23]
- cs.CV [Total: 101]
- eess.IV [Total: 2]
- cs.CR [Total: 1]
- cs.LG [Total: 1]
- cs.RO [Total: 3]
- cs.AI [Total: 4]
- cs.DL [Total: 1]
- cs.MM [Total: 2]
- cs.NE [Total: 1]
cs.CL [Back]
[1] Refine Thought: A Test-Time Inference Method for Embedding Model Reasoning
Guangzhi Wang,Kai Li,Yinghao Jiao,Zhi Liu
Main category: cs.CL
TL;DR: 论文提出了RT(Refine Thought),一种测试时推理方法,通过多次前向传播增强文本嵌入模型的语义推理能力,适用于特定任务如BRIGHT和PJBenchmark1,同时保持通用任务如C-MTEB的性能。
Details
Motivation: 现有的文本嵌入模型在预训练中学到了丰富的语义推理能力,但如何在测试时进一步激活这些能力尚待探索。Contribution: 提出了RT方法,通过多次前向传播(test-time inference)激活模型的语义推理能力,提升了特定任务的性能。
Method: RT方法通过在测试时运行多次模型前向传播,逐步优化语义表示,从而增强推理能力。
Result: 在BRIGHT和PJBenchmark1等语义推理任务上表现显著提升,同时在C-MTEB等通用任务上保持稳定。
Insight: RT揭示了预训练模型中潜在的语义推理能力可以通过测试时优化进一步激活,为嵌入模型的推理能力提供了新思路。
Abstract: We propose RT (Refine Thought), a method that can enhance the semantic rea-soning ability of text embedding models. The method obtains the final semanticrepresentation by running multiple forward passes of the text embedding model.Experiments show that RT achieves significant improvements on semantic reason-ing tasks in BRIGHT and the person job matching benchmark PJBenchmark1, while maintaining consistent performance on general-purpose semantic under-standing tasks such as C-MTEB. Our results indicate that RT is effective becauseit further activates the semantic reasoning ability learned during pretraining bydecoder-only text embedding models(e.g., Qwen3-Embedding-8B). RT canbe seen as a test-time inference method.
[2] What Works for ‘Lost-in-the-Middle’ in LLMs? A Study on GM-Extract and Mitigations
Mihir Gupte,Eshan Dixit,Muhammad Tayyab,Arun Adiththan
Main category: cs.CL
TL;DR: 论文研究了大型语言模型(LLM)在处理长上下文时的‘迷失在中间’问题,并提出了一种新的基准数据集GM-Extract来评估LLM在控制变量检索中的表现。通过两种指标(空间检索和语义检索)诊断问题,分析了7-8B参数模型的表现,并对比了不同数据表示方式的影响。此外,总结了缓解方法(黑盒和白盒)的有效性和局限性。
Details
Motivation: LLM在处理长上下文时存在‘迷失在中间’现象,影响了检索任务的性能。为了更好地理解和解决这一问题,需要一个针对性的评估工具和方法。Contribution: 1. 提出了GM-Extract基准数据集,用于评估LLM的检索能力;2. 设计了两种指标(空间和语义检索)诊断问题;3. 系统地评估了7-8B参数模型的表现;4. 总结了缓解方法的适用性和局限性。
Method: 1. 使用GM-Extract数据集评估模型;2. 设计了Document Metric和Variable Extraction Metric两种指标;3. 分析了数据表示方式对性能的影响;4. 调研并分类了黑盒和白盒缓解方法。
Result: 研究发现,改变数据表示方式显著影响检索性能,但未一致观察到U形曲线。缓解方法的效果因场景不同而异,有些甚至带来负面影响。
Insight: 1. 数据表示是影响检索性能的关键;2. 缓解方法需谨慎选择,因其效果高度依赖任务和模型;3. 需要进一步研究如何优化长上下文的处理。
Abstract: The diminishing ability of large language models (LLMs) to effectively utilize long-range context-the “lost-in-the-middle” phenomenon-poses a significant challenge in retrieval-based LLM applications. To study the impact of this phenomenon in a real-world application setting, we introduce GM-Extract, a novel benchmark dataset meticulously designed to evaluate LLM performance on retrieval of control variables. To accurately diagnose failure modes, we propose a simple yet elegant evaluation system using two distinct metrics: one for spatial retrieval capability (Document Metric) and the other for semantic retrieval capability (Variable Extraction Metric). We conduct a systematic evaluation of 7-8B parameter models on two multi-document tasks (key-value extraction and question-answering), demonstrating a significant change in retrieval performance simply by altering how the data is represented in the context window. While a distinct U-shaped curve was not consistently observed, our analysis reveals a clear pattern of performance across models, which we further correlate with perplexity scores. Furthermore, we perform a literature survey of mitigation methods, which we categorize into two distinct approaches: black-box and white-box methods. We then apply these techniques to our benchmark, finding that their efficacy is highly nuanced. Our evaluation highlights scenarios where these strategies successfully improve performance, as well as surprising cases where they lead to a negative impact, providing a comprehensive understanding of their utility in a practical context.
[3] Hint-Augmented Re-ranking: Efficient Product Search using LLM-Based Query Decomposition
Yilun Zhu,Nikhita Vedula,Shervin Malmasi
Main category: cs.CL
TL;DR: 该论文提出了一种基于LLM的查询分解方法Hint-Augmented Re-ranking,通过提取结构化的提示(hints)来优化电商搜索性能,显著提升了搜索和排序的效率。
Details
Motivation: 电商搜索中带有最高级(如“最好的”“最受欢迎的”)的查询需要多维度比较候选商品,传统方法难以捕捉其潜在意图,因此需要结合语言理解和领域知识的方法。Contribution: 主要贡献包括:1)提出LLM生成的属性-值提示框架,用于捕获查询的潜在意图;2)设计了高效的轻量化模型传输方法,避免直接使用LLM的高延迟问题;3)在搜索性能和排序指标上显著优于基线方法。
Method: 方法分为两部分:1)利用LLM并行检索和生成结构化提示(hints);2)将这些提示集成到轻量化模型的排序管道中,以实现高效重排序。
Result: 实验结果显示,搜索性能(MAP)提升了10.9个百分点,排序性能(MRR)提升了5.9个百分点,同时显著降低了延迟。
Insight: 研究发现最高级语义可以通过结构化提示在不同模型间传递,为检索系统中的语言解释提供了新思路,同时解决了实际部署中的效率问题。
Abstract: Search queries with superlatives (e.g., best, most popular) require comparing candidates across multiple dimensions, demanding linguistic understanding and domain knowledge. We show that LLMs can uncover latent intent behind these expressions in e-commerce queries through a framework that extracts structured interpretations or hints. Our approach decomposes queries into attribute-value hints generated concurrently with retrieval, enabling efficient integration into the ranking pipeline. Our method improves search performanc eby 10.9 points in MAP and ranking by 5.9 points in MRR over baselines. Since direct LLM-based reranking faces prohibitive latency, we develop an efficient approach transferring superlative interpretations to lightweight models. Our findings provide insights into how superlative semantics can be represented and transferred between models, advancing linguistic interpretation in retrieval systems while addressing practical deployment constraints.
[4] Knowledge-Grounded Agentic Large Language Models for Multi-Hazard Understanding from Reconnaissance Reports
Chenchen Kuai,Zihao Li,Braden Rosen,Stephanie Paan,Navid Jafari,Jean-Louis Briaud,Yunlong Zhang,Youssef M. A. Hashash,Yang Zhou
Main category: cs.CL
TL;DR: 论文提出了MoRA-RAG框架,通过知识增强的LLM分析和结构化灾难侦察报告,支持多灾害理解,显著提升了准确性和减少了幻觉。
Details
Motivation: 灾难侦察报告的非结构化特性使其难以系统化利用,而LLM在缺乏领域知识时易产生不可靠输出。本文旨在解决这一问题。Contribution: 提出MoRA-RAG框架,结合动态检索、代理分块和验证循环,显著提升LLM在灾害报告分析中的准确性和可信度。
Method: 采用Mixture-of-Retrieval机制动态路由查询,代理分块保持上下文连贯,验证循环优化检索结果。
Result: 在HazardRecQA数据集上准确率达94.5%,优于零样本LLM和最先进RAG系统,同时减少幻觉生成。
Insight: 知识增强和结构化检索能显著提升LLM在专业领域的可靠性,且开源LLM性能可媲美专有模型。
Abstract: Post-disaster reconnaissance reports contain critical evidence for understanding multi-hazard interactions, yet their unstructured narratives make systematic knowledge transfer difficult. Large language models (LLMs) offer new potential for analyzing these reports, but often generate unreliable or hallucinated outputs when domain grounding is absent. This study introduces the Mixture-of-Retrieval Agentic RAG (MoRA-RAG), a knowledge-grounded LLM framework that transforms reconnaissance reports into a structured foundation for multi-hazard reasoning. The framework integrates a Mixture-of-Retrieval mechanism that dynamically routes queries across hazard-specific databases while using agentic chunking to preserve contextual coherence during retrieval. It also includes a verification loop that assesses evidence sufficiency, refines queries, and initiates targeted searches when information remains incomplete. We construct HazardRecQA by deriving question-answer pairs from GEER reconnaissance reports, which document 90 global events across seven major hazard types. MoRA-RAG achieves up to 94.5 percent accuracy, outperforming zero-shot LLMs by 30 percent and state-of-the-art RAG systems by 10 percent, while reducing hallucinations across diverse LLM architectures. MoRA-RAG also enables open-weight LLMs to achieve performance comparable to proprietary models. It establishes a new paradigm for transforming post-disaster documentation into actionable, trustworthy intelligence for hazard resilience.
[5] HiEAG: Evidence-Augmented Generation for Out-of-Context Misinformation Detection
Junjie Wu,Yumeng Fu,Nan Yu,Guohong Fu
Main category: cs.CL
TL;DR: HiEAG提出了一种新颖的层次化证据增强生成框架,通过结合多模态大语言模型(MLLMs)的外部知识,改进外部一致性检查,提升了多模态上下文外错误信息检测的准确性。
Details
Motivation: 现有方法过度依赖内部一致性,忽视了图像-文本对与外部证据的外部一致性对错误信息检测的重要性,导致检测效果受限。Contribution: 1. 提出HiEAG框架,分解外部一致性检查为检索、重排序和改写三个模块;2. 引入自动证据选择提示(AESP)和自动证据生成提示(AEGP)提升任务适应性;3. 支持解释性判断,并通过指令微调实现优异性能。
Method: HiEAG框架通过检索模块获取外部证据,利用AESP进行证据重排序,再通过AEGP改写证据以适配任务,最终结合多模态大语言模型检测错误信息。
Result: 在多个基准数据集上,HiEAG超越了现有最先进方法,在所有样本上达到了更高的准确率。
Insight: 外部证据的合理利用和多模态大语言模型的结合能显著提升上下文外错误信息检测的性能,同时增强模型的解释能力。
Abstract: Recent advancements in multimodal out-of-context (OOC) misinformation detection have made remarkable progress in checking the consistencies between different modalities for supporting or refuting image-text pairs. However, existing OOC misinformation detection methods tend to emphasize the role of internal consistency, ignoring the significant of external consistency between image-text pairs and external evidence. In this paper, we propose HiEAG, a novel Hierarchical Evidence-Augmented Generation framework to refine external consistency checking through leveraging the extensive knowledge of multimodal large language models (MLLMs). Our approach decomposes external consistency checking into a comprehensive engine pipeline, which integrates reranking and rewriting, apart from retrieval. Evidence reranking module utilizes Automatic Evidence Selection Prompting (AESP) that acquires the relevant evidence item from the products of evidence retrieval. Subsequently, evidence rewriting module leverages Automatic Evidence Generation Prompting (AEGP) to improve task adaptation on MLLM-based OOC misinformation detectors. Furthermore, our approach enables explanation for judgment, and achieves impressive performance with instruction tuning. Experimental results on different benchmark datasets demonstrate that our proposed HiEAG surpasses previous state-of-the-art (SOTA) methods in the accuracy over all samples.
[6] Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT
Le Yu,Zhengyue Zhao,Yawen Zheng,Yunhao Liu
Main category: cs.CL
TL;DR: 本文提出了一种名为”隐形微调”的攻击方法,通过干扰分段生成有害推理链,并将其作为监督微调数据,高效地破坏了推理增强视觉语言模型(RVLM)的安全性对齐。该方法仅需少量样本和计算资源即可实现高攻击成功率。
Details
Motivation: RVLMs依赖于安全对齐来防止有害行为,但其暴露的推理链(CoT)轨迹引入了新的攻击面。本文旨在探索如何利用这些轨迹高效地破坏模型的安全性对齐。Contribution: 1. 提出了一种轻量级、分布一致的微调方法”隐形微调”;2. 通过分段干扰和自生成输出来高效破坏模型的安全性对齐;3. 实验表明该方法在低成本和少量样本下具有高攻击成功率。
Method: 1. 使用分段干扰技术生成有害推理链;2. 将自生成输出作为监督微调数据;3. 设计了基于轮次的加权损失函数;4. 采用QLoRA技术在少量计算资源下完成微调。
Result: 实验表明,该方法仅需499个样本和3小时单卡A100训练,攻击成功率(ASR)比基线高38.52%,同时保持了模型的通用推理能力。
Insight: 1. RVLMs的安全对齐存在潜在脆弱性;2. 自生成数据可用于高效微调攻击;3. 轻量级方法可显著降低攻击成本。
Abstract: Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed \textbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through \textbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a \textbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. \textcolor{red}{\textbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}
[7] Selective Weak-to-Strong Generalization
Hao Lang,Fei Huang,Yongbin Li
Main category: cs.CL
TL;DR: 本文提出了一种选择性的弱到强泛化(W2SG)框架,通过训练二元分类器识别强模型能回答的问题,避免不必要的弱监督,并使用图平滑方法优化弱标签。实验表明该方法优于基线,且分类器能跨任务泛化,支持超对齐目标。
Details
Motivation: 随着超级人类模型能力超越人类,人类只能提供弱监督,但现有方法不加选择地使用弱监督可能损害模型性能。本文旨在解决这一问题。Contribution: 提出了选择性W2SG框架,包括二元分类器和弱标签图平滑方法,避免有害弱监督,提升模型性能。
Method: 1. 训练二元分类器P(IK)识别强模型能回答的问题,利用其自生成标签对齐;2. 通过图平滑方法优化弱标签。
Result: 在三个基准测试中优于基线方法,分类器P(IK)能跨任务和难度泛化。
Insight: 选择性使用弱监督是关键,分类器的泛化能力表明该方法有助于实现超级对齐(superalignment)。
Abstract: Future superhuman models will surpass the ability of humans and humans will only be able to \textit{weakly} supervise superhuman models. To alleviate the issue of lacking high-quality data for model alignment, some works on weak-to-strong generalization (W2SG) finetune a strong pretrained model with a weak supervisor so that it can generalize beyond weak supervision. However, the invariable use of weak supervision in existing methods exposes issues in robustness, with a proportion of weak labels proving harmful to models. In this paper, we propose a selective W2SG framework to avoid using weak supervision when unnecessary. We train a binary classifier P(IK) to identify questions that a strong model can answer and use its self-generated labels for alignment. We further refine weak labels with a graph smoothing method. Extensive experiments on three benchmarks show that our method consistently outperforms competitive baselines. Further analyses show that P(IK) can generalize across tasks and difficulties, which indicates selective W2SG can help superalignment.
[8] Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning
Rui Liu,Yuan Zhao,Zhenqi Jia
Main category: cs.CL
TL;DR: 本文提出一种名为Authentic-Dubber的方案,通过检索增强的导演-演员交互学习,模拟真实电影配音流程,提升情感表达和唇音同步效果。
Details
Motivation: 现有电影配音方法忽略了导演与演员之间的动态协作过程,无法充分模拟真实配音工作流中的情感表达。Contribution: 1. 构建多模态参考素材库;2. 提出基于情感相似性的检索增强策略;3. 开发渐进式图语音生成方法。
Method: 整合LLMs理解情感表征,通过检索多模态信息实现情感对齐,并逐步集成检索结果生成语音。
Result: 在V2C Animation数据集上验证了情感表达和配音质量的显著提升。
Insight: 模拟真实导演-演员交互可以显著提升自动化配音的情感真实性和表现力。
Abstract: The automatic movie dubbing model generates vivid speech from given scripts, replicating a speaker’s timbre from a brief timbre prompt while ensuring lip-sync with the silent video. Existing approaches simulate a simplified workflow where actors dub directly without preparation, overlooking the critical director-actor interaction. In contrast, authentic workflows involve a dynamic collaboration: directors actively engage with actors, guiding them to internalize the context cues, specifically emotion, before performance. To address this issue, we propose a new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing, termed Authentic-Dubber, which contains three novel mechanisms: (1) We construct a multimodal Reference Footage library to simulate the learning footage provided by directors. Note that we integrate Large Language Models (LLMs) to achieve deep comprehension of emotional representations across multimodal signals. (2) To emulate how actors efficiently and comprehensively internalize director-provided footage during dubbing, we propose an Emotion-Similarity-based Retrieval-Augmentation strategy. This strategy retrieves the most relevant multimodal information that aligns with the target silent video. (3) We develop a Progressive Graph-based speech generation approach that incrementally incorporates the retrieved multimodal emotional knowledge, thereby simulating the actor’s final dubbing process. The above mechanisms enable the Authentic-Dubber to faithfully replicate the authentic dubbing workflow, achieving comprehensive improvements in emotional expressiveness. Both subjective and objective evaluations on the V2C Animation benchmark dataset validate the effectiveness. The code and demos are available at https://github.com/AI-S2-Lab/Authentic-Dubber.
[9] AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR
Gabrial Zencha Ashungafac,Mardhiyah Sanni,Busayo Awobade,Alex Gichamba,Tobi Olatunji
Main category: cs.CL
TL;DR: AfriSpeech-MultiBench是首个针对非洲100多种英语口音的跨域多国ASR基准测试套件,涵盖10+国家和7个应用领域。
Details
Motivation: 目前缺乏针对非洲语言多样性的公开ASR评估工具,亟需填补这一空白。Contribution: 发布了首个面向非洲多国多领域的ASR基准套件,评估了开源、闭源和多模态LLM模型的性能。
Method: 基于非洲口音英语数据集,测试了自发和非自发语音场景下多种模型的性能,涵盖不同国家和领域。
Result: 开源ASR在自发语音场景表现好,但对非母语和嘈杂对话效果差;多模态LLM对口音鲁棒性高,但命名实体识别能力较弱;专有模型在干净语音上表现优异,但性能和地域/领域相关。
Insight: 针对非洲英语微调的模型在精度和延迟上表现更好,但大多数SOTA模型仍面临幻觉问题。
Abstract: Recent advances in speech-enabled AI, including Google’s NotebookLM and OpenAI’s speech-to-speech API, are driving widespread interest in voice interfaces globally. Despite this momentum, there exists no publicly available application-specific model evaluation that caters to Africa’s linguistic diversity. We present AfriSpeech-MultiBench, the first domain-specific evaluation suite for over 100 African English accents across 10+ countries and seven application domains: Finance, Legal, Medical, General dialogue, Call Center, Named Entities and Hallucination Robustness. We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems using both spontaneous and non-spontaneous speech conversation drawn from various open African accented English speech datasets. Our empirical analysis reveals systematic variation: open-source ASR models excels in spontaneous speech contexts but degrades on noisy, non-native dialogue; multimodal LLMs are more accent-robust yet struggle with domain-specific named entities; proprietary models deliver high accuracy on clean speech but vary significantly by country and domain. Models fine-tuned on African English achieve competitive accuracy with lower latency, a practical advantage for deployment, hallucinations still remain a big problem for most SOTA models. By releasing this comprehensive benchmark, we empower practitioners and researchers to select voice technologies suited to African use-cases, fostering inclusive voice applications for underserved communities.
[10] Entropy-Guided Reasoning Compression
Hourun Zhu,Yang Gao,Wenlong Fei,Jiawei Li,Huashan Sun
Main category: cs.CL
TL;DR: 论文提出了一种熵引导的训练框架,解决大型推理模型在压缩过程中出现的熵冲突问题,实现更高效的推理压缩。
Details
Motivation: 大型推理模型在复杂任务中表现优异,但其推理链过长导致计算成本高和部署困难。现有压缩方法忽略了训练过程中的熵冲突,导致模型陷入局部最优。Contribution: 提出了熵冲突的分析框架,揭示了熵冲突的根源是逻辑连接符的高梯度与压缩目标的矛盾。设计了熵引导的训练方法,动态平衡压缩与探索。
Method: 采用熵引导的训练框架,在熵下降时鼓励简洁推理步骤,熵上升时增强探索能力。实验表明方法能将推理长度压缩至原始的20%。
Result: 在六个数学基准测试中,方法在维持或超越基线精度的同时,将推理长度压缩至原始的20%。
Insight: 熵冲突的根源在于逻辑连接符的高梯度,动态平衡压缩与探索是实现高效推理压缩的关键。
Abstract: Large reasoning models have demonstrated remarkable performance on complex reasoning tasks, yet the excessive length of their chain-of-thought outputs remains a major practical bottleneck due to high computation cost and poor deployability. Existing compression methods have achieved partial success but overlook a crucial phenomenon in the training process – the entropy conflict. During compression training, entropy decreases, leading to shorter reasoning but limited exploration, while accuracy-oriented objectives increase entropy, lengthening reasoning chains. This can cause the model to get stuck in a local dilemma. Our analysis further reveals the origin of the entropy conflict: many high-entropy tokens are logical connectors that receive larger gradients and are encouraged under the performance objective, while the compression objective simultaneously penalizes these potentially redundant connectors. This opposing pressure creates a direct source of entropy conflict. To address these issues, we adopt an entropy-guided training framework. As entropy descends, the model is guided toward efficient reasoning by encouraging concise thought steps; as entropy rises, exploration is reinforced under the compact reasoning mode to improve robustness. Experiments on six mathematical benchmarks show that our method compresses reasoning length to 20% of the original while maintaining or even surpassing baseline accuracy. Code and models will be released publicly.
[11] Don’t Miss the Forest for the Trees: In-Depth Confidence Estimation for LLMs via Reasoning over the Answer Space
Ante Wang,Weizhi Ma,Yang Liu
Main category: cs.CL
TL;DR: 该论文提出了一种新方法,通过预测答案空间的概率分布(而非单一答案)来提升大型语言模型(LLMs)置信度估计的深度和可靠性。
Details
Motivation: 现有置信度估计方法多依赖单一答案或链式推理(chain-of-thought),但对推理策略如何影响置信度估计缺乏深入探索。文章希望通过鼓励模型在答案空间中进行深度推理,提升置信度估计的透明度和准确性。Contribution: 提出了一种基于答案空间概率分布的置信度估计方法,要求模型不仅考虑单一答案,还为所有候选答案分配合理置信度,从而更全面地反映模型的可靠性。
Method: 方法核心是通过预测答案空间的概率分布来引导深度推理。这要求模型对所有候选答案进行评估,并确保置信度分配满足分布的特性,增强透明度和逻辑性。
Result: 该方法在不同模型和任务中均表现出优势,即使答案空间未知或经过强化学习后仍能保持效果。定性分析显示其推理模式与人类期望一致。
Insight: 论文揭示了置信度估计应关注答案空间的全方位分析,而非单一结果。这种方法不仅提升了估计质量,还为模型的可解释性提供了新视角。
Abstract: Knowing the reliability of a model’s response is essential in application. With the strong generation capabilities of LLMs, research has focused on generating verbalized confidence. This is further enhanced by combining chain-of-thought reasoning, which provides logical and transparent estimation. However, how reasoning strategies affect the estimated confidence is still under-explored. In this work, we demonstrate that predicting a verbalized probability distribution can effectively encourage in-depth reasoning for confidence estimation. Intuitively, it requires an LLM to consider all candidates within the answer space instead of basing on a single guess, and to carefully assign confidence scores to meet the requirements of a distribution. This method shows an advantage across different models and various tasks, regardless of whether the answer space is known. Its advantage is maintained even after reinforcement learning, and further analysis shows its reasoning patterns are aligned with human expectations.
[12] AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
Mohammad Zbib,Hasan Abed Al Kader Hammoud,Sina Mukalled,Nadine Rizk,Fatima Karnib,Issam Lakkis,Ammar Mohanna,Bernard Ghanem
Main category: cs.CL
TL;DR: AraLingBench是一个人工标注的基准测试,用于评估大语言模型(LLMs)在阿拉伯语语言学能力上的表现,涵盖语法、形态、拼写、阅读理解和句法五大核心类别。
Details
Motivation: 现有的大语言模型在阿拉伯语语言学能力上的评估缺乏系统性和深度,尤其是对语法和句法等深层次语言理解的测试不足。Contribution: 提出了AraLingBench,一个包含150道专家设计多选题的基准测试,专注于评估LLMs的阿拉伯语语言学能力,揭示了模型在深层次语言理解上的不足。
Method: 设计了五个核心语言学类别的多选题,通过人工标注和对35个阿拉伯语及双语LLMs的测试,分析了其表现。
Result: 结果表明,当前模型在表面语言掌握上表现良好,但在深层语法和句法推理上表现不佳,依赖记忆或模式识别而非真正理解。
Insight: AraLingBench为开发更强大的阿拉伯语LLMs提供了诊断工具,强调了知识基准测试与真实语言能力之间的差距。
Abstract: We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.
[13] ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
Hongwei Liu,Junnan Liu,Shudong Liu,Haodong Duan,Yuqiang Li,Mao Su,Xiaohong Liu,Guangtao Zhai,Xinyu Fang,Qianhong Ma,Taolin Zhang,Zihan Ma,Yufeng Zhao,Peiheng Zhou,Linchen Xiao,Wenlong Zhang,Shijie Zhou,Xingjian Ma,Siqi Sun,Jiaye Ge,Meng Li,Yuhong Liu,Jianxin Dong,Jiaying Li,Hui Wu,Hanwen Liang,Jintai Lin,Yanting Wang,Jie Dong,Tong Zhu,Tianfan Fu,Conghui He,Qi Zhang,Songyang Zhang,Lei Bai,Kai Chen
Main category: cs.CL
TL;DR: ATLAS是一个高难度、跨学科的基准测试,旨在评估前沿语言模型在复杂科学推理和多领域知识整合中的能力,填补现有基准的不足。
Details
Motivation: 现有基准测试在高难度模型中性能饱和,且多局限于单一学科、简答题形式或易受数据污染影响,与真实科学研究的复杂性不符。Contribution: 提出ATLAS,一个包含约800个原创问题的跨学科评估套件,具有高原创性、抗污染性、高保真答案和严格质量控制。
Method: 问题由领域专家设计,涵盖七大科学领域,采用复杂开放答案和多阶段专家评审确保质量,并引入LLM评委进行自动化评估。
Result: 初步结果显示ATLAS能有效区分前沿模型的科学推理能力,计划发展为长期开放的社区驱动平台。
Insight: ATLAS为AGI进展提供了可靠的评估工具,强调跨学科推理和复杂答案的重要性。
Abstract: The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models’ ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS’s effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable “ruler” for progress toward Artificial General Intelligence.
[14] MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents
Jinru Ding,Lu Lu,Chao Ding,Mouxiao Bian,Jiayuan Chen,Renjie Lu,Wenrao Pang,Xiaoqin Wu,Zhiqiang Liu,Luyi Jiang,Bing Han,Yunqiu Wang,Jie Xu
Main category: cs.CL
TL;DR: MedBench v4是一个针对中文医疗语言模型、多模态模型和智能代理的评测基准,涵盖70万专家标注任务,评测15个前沿模型,揭示多模态推理和安全性的不足,并展示智能代理通过规管意识的编排显著提升性能。
Details
Motivation: 由于医疗大型语言模型、多模态模型和智能代理的快速发展,需要能反映真实临床工作流程和安全限制的评测框架。Contribution: 提出了MedBench v4,一个全国性、基于云的评测基础设施,涵盖24个一级和91个二级专科的任务,通过多阶段优化和多轮临床专家评审,提供LLM、多模态模型和智能代理的专门评测轨道。
Method: 使用70万专家标注任务,通过LLM作为评委校准人类评分,评测了15个前沿模型,包括基础LLM、多模态模型和智能代理。
Result: 基础LLM平均得分54.1/100,多模态模型47.5/100,智能代理表现最佳(79.8/100),安全性仍为短板(18.4/100)。
Insight: 多模态推理和安全性是模型的短板,而智能代理通过规管意识的编排可显著提升性能;评测结果可为医院、开发者和政策制定者提供参考。
Abstract: Recent advances in medical large language models (LLMs), multimodal models, and agents demand evaluation frameworks that reflect real clinical workflows and safety constraints. We present MedBench v4, a nationwide, cloud-based benchmarking infrastructure comprising over 700,000 expert-curated tasks spanning 24 primary and 91 secondary specialties, with dedicated tracks for LLMs, multimodal models, and agents. Items undergo multi-stage refinement and multi-round review by clinicians from more than 500 institutions, and open-ended responses are scored by an LLM-as-a-judge calibrated to human ratings. We evaluate 15 frontier models. Base LLMs reach a mean overall score of 54.1/100 (best: Claude Sonnet 4.5, 62.5/100), but safety and ethics remain low (18.4/100). Multimodal models perform worse overall (mean 47.5/100; best: GPT-5, 54.9/100), with solid perception yet weaker cross-modal reasoning. Agents built on the same backbones substantially improve end-to-end performance (mean 79.8/100), with Claude Sonnet 4.5-based agents achieving up to 85.3/100 overall and 88.9/100 on safety tasks. MedBench v4 thus reveals persisting gaps in multimodal reasoning and safety for base models, while showing that governance-aware agentic orchestration can markedly enhance benchmarked clinical readiness without sacrificing capability. By aligning tasks with Chinese clinical guidelines and regulatory priorities, the platform offers a practical reference for hospitals, developers, and policymakers auditing medical AI.
[15] Tell Me: An LLM-powered Mental Well-being Assistant with RAG, Synthetic Dialogue Generation, and Agentic Planning
Trishala Jayesh Ahalpara
Main category: cs.CL
TL;DR: Tell Me是一个基于大型语言模型的心理健康助手系统,结合了RAG、合成对话生成和代理规划技术,旨在为用户和研究者提供情境感知的支持。系统包括个性化对话生成、研究数据增强的合成对话生成以及动态自适应健康计划的功能。
Details
Motivation: 心理健康资源短缺和隐私问题限制了数据的可用性;静态心理健康工具的局限性需要动态个性化的解决方案。Contribution: 1. 结合RAG技术实现个性化对话;2. 引入合成对话生成解决数据短缺问题;3. 代理规划技术实现动态自适应健康计划。
Method: 系统整合了RAG助手、合成对话生成器和基于CrewAI的健康规划代理。
Result: 通过自动评估和用户研究表明,RAG助手在心理健康场景中表现良好。系统展示了人机交互在心理健康领域的创新潜力。
Insight: 多组件结合的系统设计可以降低心理健康支持的门槛,同时展示了合成数据和研究工具在心理健康领域的价值。
Abstract: We present Tell Me, a mental well-being system that leverages advances in large language models to provide accessible, context-aware support for users and researchers. The system integrates three components: (i) a retrieval-augmented generation (RAG) assistant for personalized, knowledge-grounded dialogue; (ii) a synthetic client-therapist dialogue generator conditioned on client profiles to facilitate research on therapeutic language and data augmentation; and (iii) a Well-being AI crew, implemented with CrewAI, that produces weekly self-care plans and guided meditation audio. The system is designed as a reflective space for emotional processing rather than a substitute for professional therapy. It illustrates how conversational assistants can lower barriers to support, complement existing care, and broaden access to mental health resources. To address the shortage of confidential therapeutic data, we introduce synthetic client-therapist dialogue generation conditioned on client profiles. Finally, the planner demonstrates an innovative agentic workflow for dynamically adaptive, personalized self-care, bridging the limitations of static well-being tools. We describe the architecture, demonstrate its functionalities, and report evaluation of the RAG assistant in curated well-being scenarios using both automatic LLM-based judgments and a human-user study. This work highlights opportunities for interdisciplinary collaboration between NLP researchers and mental health professionals to advance responsible innovation in human-AI interaction for well-being.
[16] Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
Mingyue Cheng,Jie Ouyang,Shuo Yu,Ruiran Yan,Yucong Luo,Zirui Liu,Daoyu Wang,Qi Liu,Enhong Chen
Main category: cs.CL
TL;DR: 本文介绍了Agent-R1,一个基于强化学习(RL)的大语言模型(LLM)智能体训练框架,旨在解决LLM智能体在环境交互中的挑战,并提供了一个模块化、灵活且易扩展的训练框架。
Details
Motivation: LLM智能体在复杂问题解决中的交互能力尚不成熟,RL在此领域的应用仍面临挑战,缺乏专门为LLM智能体设计的灵活训练框架。Contribution: 1.系统地扩展了马尔可夫决策过程(MDP)框架,定义了LLM智能体的关键组件;2.提出了Agent-R1,一个模块化、灵活且用户友好的RL训练框架。
Method: 通过扩展MDP框架定义LLM智能体的组件,并设计了Agent-R1框架,支持多任务场景和环境交互的实验。
Result: 在Multihop QA基准任务上的实验初步验证了所提方法和框架的有效性。
Insight: RL为LLM智能体的训练提供了潜力,但需要专门设计的框架来适应其交互需求;Agent-R1在这一方向迈出了重要一步。
Abstract: Large Language Models (LLMs) are increasingly being explored for building Agents capable of active environmental interaction (e.g., via tool use) to solve complex problems. Reinforcement Learning (RL) is considered a key technology with significant potential for training such Agents; however, the effective application of RL to LLM Agents is still in its nascent stages and faces considerable challenges. Currently, this emerging field lacks in-depth exploration into RL approaches specifically tailored for the LLM Agent context, alongside a scarcity of flexible and easily extensible training frameworks designed for this purpose. To help advance this area, this paper first revisits and clarifies Reinforcement Learning methodologies for LLM Agents by systematically extending the Markov Decision Process (MDP) framework to comprehensively define the key components of an LLM Agent. Secondly, we introduce Agent-R1, a modular, flexible, and user-friendly training framework for RL-based LLM Agents, designed for straightforward adaptation across diverse task scenarios and interactive environments. We conducted experiments on Multihop QA benchmark tasks, providing initial validation for the effectiveness of our proposed methods and framework.
[17] Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities
Kahaan Gandhi,Boris Bolliet,Inigo Zubeldia
Main category: cs.CL
TL;DR: 该论文提出了一种基于视觉语言模型(VLM)的多智能体系统,用于增强自主科学发现能力。通过将图表作为可验证的检查点,系统能够实时修正错误并指导数据分析,显著提升了任务完成率。
Details
Motivation: 传统自主科学发现的智能体系统通常仅依赖代码或文本,缺乏对可视化数据的实时理解和反馈能力。该研究旨在通过引入视觉语言模型,提升系统在科学发现中的鲁棒性和适应性。Contribution: 主要贡献是提出了一种结合VLM的多智能体系统,该系统能够动态评估图表并根据领域特定的评分标准实时修正错误,从而显著提高科学发现的成功率。
Method: 方法包括将图表作为可验证检查点,利用VLM作为评判者,动态生成领域特定的评分标准,并实时指导智能体的数据探索和错误修正流程。
Result: 在10个任务的基准测试中,VLM增强的系统实现了0.7-0.8的Pass@1分数,显著高于仅依赖代码(0.2-0.3)或代码加文本(0.4-0.5)的基线方法。
Insight: 研究表明,引入视觉语言模型的实时反馈机制可以有效提升科学发现任务的自主性和适应性,同时增强了系统的可解释性。
Abstract: We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent
[18] A Specialized Large Language Model for Clinical Reasoning and Diagnosis in Rare Diseases
Tao Yang,Dandan Huang,Yunting Lin,Pengfei Wu,Zhikun Wu,Gangyuan Ma,Yulan Lu,Xinran Dong,Dingpeng Li,Junshuang Ge,Zhiyan Zhang,Xuanzhao Huang,Wenyan Nong,Yao Zhou,Hui Tang,Hongxi Yang,Shijie Zhang,Juan Li,Xiaojun Cao,Lin Yang,Xia Gao,Kaishou Xu,Xiaoqiong Gu,Wen Zhang,Huimin Xia,Li Liu,Wenhao Zhou,Mulin Jun Li
Main category: cs.CL
TL;DR: 论文提出了一种专注于罕见病临床推理和诊断的专用大语言模型RareSeek R1,通过分阶段指令调优、思维链学习和图检索方法,显著提升了诊断准确性。
Details
Motivation: 罕见病诊断过程漫长且复杂,现有方法在证据提取和推理诊断之间存在脱节,且通用或医学LLMs面临数据稀缺和幻觉问题。Contribution: 开发了首个针对罕见病的专用LLMs,整合临床语料库和图检索技术,实现了诊断准确性和推理透明性的显著提升。
Method: 采用分阶段指令调优、思维链学习和图检索技术,结合临床验证的推理数据集和领域知识。
Result: 在多中心EHR数据和公共基准测试中取得最先进性能,推理透明性高,部分表现与资深医师相当。
Insight: 非表型证据(如影像学、干预措施)在诊断中起关键作用(中位数23.1%),检索增强技术显著提升了候选机制对齐能力。
Abstract: Rare diseases affect hundreds of millions worldwide, yet diagnosis often spans years. Convectional pipelines decouple noisy evidence extraction from downstream inferential diagnosis, and general/medical large language models (LLMs) face scarce real world electronic health records (EHRs), stale domain knowledge, and hallucinations. We assemble a large, domain specialized clinical corpus and a clinician validated reasoning set, and develop RareSeek R1 via staged instruction tuning, chain of thought learning, and graph grounded retrieval. Across multicenter EHR narratives and public benchmarks, RareSeek R1 attains state of the art accuracy, robust generalization, and stability under noisy or overlapping phenotypes. Augmented retrieval yields the largest gains when narratives pair with prioritized variants by resolving ambiguity and aligning candidates to mechanisms. Human studies show performance on par with experienced physicians and consistent gains in assistive use. Notably, transparent reasoning highlights decisive non phenotypic evidence (median 23.1%, such as imaging, interventions, functional tests) underpinning many correct diagnoses. This work advances a narrative first, knowledge integrated reasoning paradigm that shortens the diagnostic odyssey and enables auditable, clinically translatable decision support.
[19] Graded strength of comparative illusions is explained by Bayesian inference
Yuhan Zhang,Erxiao Wang,Cory Shain
Main category: cs.CL
TL;DR: 该论文通过贝叶斯推断解释了比较幻觉(CI)的强度分级,并结合统计语言模型与人类行为数据定量预测了幻觉的强度。
Details
Motivation: 语言处理中存在的比较幻觉现象(如“More students have been to Russia than I have”)需要定量解释,以支持噪声信道理论在语言理解中的统一作用。Contribution: 论文的主要贡献是提出了一个定量模型,结合贝叶斯推断和语言模型,直接预测幻觉强度,并解释了此前未解决的代词与名词短语主语对幻觉强度的影响。
Method: 研究方法包括从统计语言模型和人类行为数据中推导出后验概率模型,定量评估不同解释的合理性,从而预测幻觉强度。
Result: 模型不仅解释了幻觉强度的细微变化,还揭示了代词与名词短语主语对幻觉强度的影响,支持了噪声信道理论在语言处理中的普适性。
Insight: 该研究进一步验证了噪声信道模型在语言理解中的作用,为统一解释语言处理现象提供了计算层面的理论基础。
Abstract: Like visual processing, language processing is susceptible to illusions in which people systematically misperceive stimuli. In one such case–the comparative illusion (CI), e.g., More students have been to Russia than I have–comprehenders tend to judge the sentence as acceptable despite its underlying nonsensical comparison. Prior research has argued that this phenomenon can be explained as Bayesian inference over a noisy channel: the posterior probability of an interpretation of a sentence is proportional to both the prior probability of that interpretation and the likelihood of corruption into the observed (CI) sentence. Initial behavioral work has supported this claim by evaluating a narrow set of alternative interpretations of CI sentences and showing that comprehenders favor interpretations that are more likely to have been corrupted into the illusory sentence. In this study, we replicate and go substantially beyond this earlier work by directly predicting the strength of illusion with a quantitative model of the posterior probability of plausible interpretations, which we derive through a novel synthesis of statistical language models with human behavioral data. Our model explains not only the fine gradations in the strength of CI effects, but also a previously unexplained effect caused by pronominal vs. full noun phrase than-clause subjects. These findings support a noisy-channel theory of sentence comprehension by demonstrating that the theory makes novel predictions about the comparative illusion that bear out empirically. This outcome joins related evidence of noisy channel processing in both illusory and non-illusory contexts to support noisy channel inference as a unified computational-level theory of diverse language processing phenomena.
[20] Streamlining Industrial Contract Management with Retrieval-Augmented LLMs
Kristi Topollai,Tolga Dimlioglu,Anna Choromanska,Simon Odie,Reginald Hui
Main category: cs.CL
TL;DR: 提出了一个基于检索增强生成(RAG)的模块化框架,用于自动识别和优化合同条款修订,在低资源条件下实现了80%以上的准确率。
Details
Motivation: 合同管理流程复杂且缺乏标注数据,传统方法难以自动化处理条款修订的识别和优化问题。Contribution: 1. 设计了集成合成数据生成、语义检索、可接受性分类和奖励对齐的模块化框架;2. 在低资源条件下实现了高效的问题条款识别和优化。
Method: 采用检索增强生成(RAG)技术,结合合成数据生成和多模块协作,实现了合同条款的语义检索与修订优化。
Result: 系统在真实工业场景中实现了超过80%的准确率,显著提升了合同修订的效率。
Insight: 通过合成数据和多模块协作,可以在低资源条件下有效解决复杂文档处理任务。
Abstract: Contract management involves reviewing and negotiating provisions, individual clauses that define rights, obligations, and terms of agreement. During this process, revisions to provisions are proposed and iteratively refined, some of which may be problematic or unacceptable. Automating this workflow is challenging due to the scarcity of labeled data and the abundance of unstructured legacy contracts. In this paper, we present a modular framework designed to streamline contract management through a retrieval-augmented generation (RAG) pipeline. Our system integrates synthetic data generation, semantic clause retrieval, acceptability classification, and reward-based alignment to flag problematic revisions and generate improved alternatives. Developed and evaluated in collaboration with an industry partner, our system achieves over 80% accuracy in both identifying and optimizing problematic revisions, demonstrating strong performance under real-world, low-resource conditions and offering a practical means of accelerating contract revision workflows.
[21] SMRC: Aligning Large Language Models with Student Reasoning for Mathematical Error Correction
Biaojie Zeng,Min Zhang,Juan Zhou,Fengrui Liu,Ruiyang Huang,Xin Lin
Main category: cs.CL
TL;DR: SMRC提出了一种新颖的方法,通过蒙特卡洛树搜索(MCTS)和广度优先搜索(BFS)引导的大型语言模型(LLMs)与学生推理对齐,以自动纠正数学错误。该方法减少了标注过程奖励的成本,并在高中数学基准MSEB上表现出色。
Details
Motivation: 现有的方法主要依赖于模型的自纠正,无法满足教育场景中需要系统性指导的"教师风格"纠错需求。因此,需要一种能够与学生推理对齐的自动纠错方法。Contribution: 1. 提出了SMRC方法,通过MCTS和BFS与LLMs结合,探索最优纠错路径;2. 减少了过程奖励标注的成本;3. 构建了高中数学错误基准MSEB;4. 提出了双评价协议,关注解的正确性和正确步骤的保留。
Method: SMRC将学生推理建模为多步顺序决策问题,利用MCTS优化纠错路径。通过BFS和LLMs生成奖励信号,并通过反向传播机制在中间推理步骤中分配奖励。
Result: SMRC在两个公开数据集(ProcessBench和MR-GSM8K)以及MSEB上显著优于现有方法,表现更优。
Insight: SMRC的创新在于将LLMs与学生推理对齐,模拟教学过程中的系统性纠错,为教育领域的自动纠错提供了新思路。
Abstract: Large language models (LLMs) often make reasoning errors when solving mathematical problems, and how to automatically detect and correct these errors has become an important research direction. However, existing approaches \textit{mainly focus on self-correction within the model}, which falls short of the teacher-style correction required in educational settings, \textit{i.e.}, systematically guiding and revising a student’s problem-solving process. To address this gap, we propose \texttt{SMRC} (\textit{\underline{S}tudent \underline{M}athematical \underline{R}easoning \underline{C}orrection}), a novel method that aligns LLMs with student reasoning. Specifically, \texttt{SMRC} formulates student reasoning as a multi-step sequential decision problem and introduces Monte Carlo Tree Search (MCTS) to explore optimal correction paths. To reduce the cost of the annotating process-level rewards, we leverage breadth-first search (BFS) guided by LLMs and final-answer evaluation to generate reward signals, which are then distributed across intermediate reasoning steps via a back-propagation mechanism, enabling fine-grained process supervision. Additionally, we construct a benchmark for high school mathematics, MSEB (Multi-Solution Error Benchmark), consisting of 158 instances that include problem statements, student solutions, and correct reasoning steps. We further propose a dual evaluation protocol centered on \textbf{solution accuracy} and \textbf{correct-step retention}, offering a comprehensive measure of educational applicability. Experiments demonstrate that \texttt{SMRC} significantly outperforms existing methods on two public datasets (ProcessBench and MR-GSM8K) and our MSEB in terms of effectiveness and overall performance. The code and data are available at https://github.com/Mind-Lab-ECNU/SMRC.
[22] Encoding and Understanding Astrophysical Information in Large Language Model-Generated Summaries
Kiera McCormick,Rafael Martínez-Galarza
Main category: cs.CL
TL;DR: 该论文研究了大型语言模型(LLM)能否通过文本嵌入编码天体物理学中的物理信息,并探讨提示的作用及语言关键特征。
Details
Motivation: 探讨LLM是否能够编码通常仅通过科学测量获得的物理信息,并以天体物理学为测试平台。Contribution: 提出了使用稀疏自编码器从文本中提取可解释特征的方法,验证了LLM嵌入能够编码物理统计信息。
Method: 通过稀疏自编码器提取LLM生成的文本中的可解释特征,分析提示对信息编码的影响及关键语言特征。
Result: 发现LLM能够编码物理信息,提示对编码效果有显著影响,并识别出关键的语言特征。
Insight: LLM的嵌入可以用于科学信息的编码,提示设计和语言特征是影响编码效果的重要因素。
Abstract: Large Language Models have demonstrated the ability to generalize well at many levels across domains, modalities, and even shown in-context learning capabilities. This enables research questions regarding how they can be used to encode physical information that is usually only available from scientific measurements, and loosely encoded in textual descriptions. Using astrophysics as a test bed, we investigate if LLM embeddings can codify physical summary statistics that are obtained from scientific measurements through two main questions: 1) Does prompting play a role on how those quantities are codified by the LLM? and 2) What aspects of language are most important in encoding the physics represented by the measurement? We investigate this using sparse autoencoders that extract interpretable features from the text.
[23] Talk, Snap, Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances
Rishu Kumar Singh,Navneet Shreya,Sarmistha Das,Apoorva Singh,Sriparna Saha
Main category: cs.CL
TL;DR: 论文提出VALOR框架,通过多模态(文本+图像)和多轮对话分析客户投诉,利用专家路由和语义对齐实现细粒度分类,优于基线模型。
Details
Motivation: 现有投诉分析方法主要依赖单模态短文本(如推文或评论),忽视了多模态对话中丰富的视觉证据和多轮交互信息。Contribution: 1. 提出了VALOR框架,支持多模态投诉分类;2. 结合语义对齐和专家路由,提升分类性能;3. 与UN SDGs目标(SDG 9和12)结合,推动AI在服务基础设施和消费责任中的应用。
Method: 1. 使用多专家推理(Chain-of-Thought提示)进行决策;2. 计算语义对齐分数以整合多模态信息;3. 通过元融合策略生成最终分类。
Result: 在多模态投诉数据集上评估,VALOR在复杂场景中显著优于基线模型。
Insight: 多模态交互和专家验证对实际投诉理解系统至关重要,尤其在信息分散于文本和图像的情况下。
Abstract: Existing approaches to complaint analysis largely rely on unimodal, short-form content such as tweets or product reviews. This work advances the field by leveraging multimodal, multi-turn customer support dialogues, where users often share both textual complaints and visual evidence (e.g., screenshots, product photos) to enable fine-grained classification of complaint aspects and severity. We introduce VALOR, a Validation-Aware Learner with Expert Routing, tailored for this multimodal setting. It employs a multi-expert reasoning setup using large-scale generative models with Chain-of-Thought (CoT) prompting for nuanced decision-making. To ensure coherence between modalities, a semantic alignment score is computed and integrated into the final classification through a meta-fusion strategy. In alignment with the United Nations Sustainable Development Goals (UN SDGs), the proposed framework supports SDG 9 (Industry, Innovation and Infrastructure) by advancing AI-driven tools for robust, scalable, and context-aware service infrastructure. Further, by enabling structured analysis of complaint narratives and visual context, it contributes to SDG 12 (Responsible Consumption and Production) by promoting more responsive product design and improved accountability in consumer services. We evaluate VALOR on a curated multimodal complaint dataset annotated with fine-grained aspect and severity labels, showing that it consistently outperforms baseline models, especially in complex complaint scenarios where information is distributed across text and images. This study underscores the value of multimodal interaction and expert validation in practical complaint understanding systems. Resources related to data and codes are available here: https://github.com/sarmistha-D/VALOR
cs.CV [Back]
[24] Temporal Object-Aware Vision Transformer for Few-Shot Video Object Detection
Yogesh Kumar,Anand Mishra
Main category: cs.CV
TL;DR: 该论文提出了一种针对少样本视频目标检测(FSVOD)的时间感知视觉Transformer方法,通过选择性传播高置信度的目标特征来提升检测精度和时间一致性。
Details
Motivation: 传统目标检测方法需要大量训练数据,而少样本视频目标检测任务面临时间一致性差、复杂区域提议计算开销大等问题,需要更高效的解决方案。Contribution: 提出了一种新颖的目标感知时间建模方法,通过过滤机制选择性传播特征,减少了噪声积累并提升了检测精度。
Method: 结合了视觉Transformer和过滤机制,选择性传播高置信度目标特征,并使用少样本训练的检测和分类头。
Result: 在多个数据集上取得了显著的AP提升(如FSVOD-500提升3.7%),展示了在1-shot、3-shot和10-shot设置下的优越性。
Insight: 选择性特征传播是关键,能够在少样本条件下有效利用时间信息,避免了复杂的区域提议依赖。
Abstract: Few-shot Video Object Detection (FSVOD) addresses the challenge of detecting novel objects in videos with limited labeled examples, overcoming the constraints of traditional detection methods that require extensive training data. This task presents key challenges, including maintaining temporal consistency across frames affected by occlusion and appearance variations, and achieving novel object generalization without relying on complex region proposals, which are often computationally expensive and require task-specific training. Our novel object-aware temporal modeling approach addresses these challenges by incorporating a filtering mechanism that selectively propagates high-confidence object features across frames. This enables efficient feature progression, reduces noise accumulation, and enhances detection accuracy in a few-shot setting. By utilizing few-shot trained detection and classification heads with focused feature propagation, we achieve robust temporal consistency without depending on explicit object tube proposals. Our approach achieves performance gains, with AP improvements of 3.7% (FSVOD-500), 5.3% (FSYTV-40), 4.3% (VidOR), and 4.5 (VidVRD) in the 5-shot setting. Further results demonstrate improvements in 1-shot, 3-shot, and 10-shot configurations. We make the code public at: https://github.com/yogesh-iitj/fs-video-vit
[25] FusionFM: All-in-One Multi-Modal Image Fusion with Flow Matching
Huayi Zhu,Xiu Shu,Youqiang Xiong,Qiao Liu,Rui Chen,Di Yuan,Xiaojun Chang,Zhenyu He
Main category: cs.CV
TL;DR: FusionFM提出了一种基于流匹配的统一多模态图像融合方法,解决了现有方法训练成本高、采样效率低的问题,并通过伪标签选择和融合细化模块提升了性能。
Details
Motivation: 目前的多模态图像融合方法通常依赖任务专用模型,导致训练成本高且扩展性差;而生成方法虽然提供统一建模视角,但采样效率低。Contribution: 1) 提出了一种基于流匹配的直接概率传输框架,提升了采样效率和结构一致性;2) 引入了伪标签选择和融合细化模块,解决了高质量标注数据不足的问题;3) 通过弹性权重巩固和经验回放机制支持多任务场景。
Method: 1) 使用流匹配框架直接从源模态分布传输到融合图像分布;2) 通过任务感知选择函数从现有模型的结果中选择可靠的伪标签;3) 设计融合细化模块分解和增强伪标签中的退化部分;4) 多任务场景下采用弹性权重巩固和经验回放机制。
Result: FusionFM在多样化的融合任务中表现优异,显著提高了采样效率并保持了轻量级模型设计。
Insight: 流匹配范式能够有效解决生成模型采样效率低的问题,而伪标签选择和任务感知细化则为缺乏高质量标注的任务提供了可行解决方案。
Abstract: Current multi-modal image fusion methods typically rely on task-specific models, leading to high training costs and limited scalability. While generative methods provide a unified modeling perspective, they often suffer from slow inference due to the complex sampling trajectories from noise to image. To address this, we formulate image fusion as a direct probabilistic transport from source modalities to the fused image distribution, leveraging the flow matching paradigm to improve sampling efficiency and structural consistency. To mitigate the lack of high-quality fused images for supervision, we collect fusion results from multiple state-of-the-art models as priors, and employ a task-aware selection function to select the most reliable pseudo-labels for each task. We further introduce a Fusion Refiner module that employs a divide-and-conquer strategy to systematically identify, decompose, and enhance degraded components in selected pseudo-labels. For multi-task scenarios, we integrate elastic weight consolidation and experience replay mechanisms to preserve cross-task performance and enhance continual learning ability from both parameter stability and memory retention perspectives. Our approach achieves competitive performance across diverse fusion tasks, while significantly improving sampling efficiency and maintaining a lightweight model design. The code will be available at: https://github.com/Ist-Zhy/FusionFM.
[26] A Trajectory-free Crash Detection Framework with Generative Approach and Segment Map Diffusion
Weiying Shen,Hao Yu,Yu Dong,Pan Liu,Yu Han,Xin Wen
Main category: cs.CV
TL;DR: 本文提出了一种无需轨迹的两阶段撞车检测框架,通过生成模型和路段地图扩散技术,直接利用路段地图数据进行实时撞车检测。
Details
Motivation: 传统的撞车检测依赖车辆轨迹获取和跟踪,但轨迹数据难以实时获取且不完整。本文旨在通过路段地图直接记录交通动态数据,解决这一问题。Contribution: 1) 提出了基于扩散模型的路段地图生成方法Mapfusion;2) 设计了包含背景上下文的两阶段撞车检测框架;3) 实验验证了该方法在真实数据中的有效性。
Method: 1) Mapfusion通过噪声到正常的扩散过程生成未来路段地图;2) 利用ControlNet引入背景上下文;3) 通过对比监测数据与生成数据检测撞车。
Result: Mapfusion能够生成符合真实运动模式的路段地图,并在不同采样间隔下表现稳健。实验证明该方法能准确检测撞车。
Insight: 无需轨迹数据,直接利用路段地图可实现高效的撞车检测,扩散模型在此类时空数据生成任务中具有潜力。
Abstract: Real-time crash detection is essential for developing proactive safety management strategy and enhancing overall traffic efficiency. To address the limitations associated with trajectory acquisition and vehicle tracking, road segment maps recording the individual-level traffic dynamic data were directly served in crash detection. A novel two-stage trajectory-free crash detection framework, was present to generate the rational future road segment map and identify crashes. The first-stage diffusion-based segment map generation model, Mapfusion, conducts a noisy-to-normal process that progressively adds noise to the road segment map until the map is corrupted to pure Gaussian noise. The denoising process is guided by sequential embedding components capturing the temporal dynamics of segment map sequences. Furthermore, the generation model is designed to incorporate background context through ControlNet to enhance generation control. Crash detection is achieved by comparing the monitored segment map with the generations from diffusion model in second stage. Trained on non-crash vehicle motion data, Mapfusion successfully generates realistic road segment evolution maps based on learned motion patterns and remains robust across different sampling intervals. Experiments on real-world crashes indicate the effectiveness of the proposed two-stage method in accurately detecting crashes.
[27] Synergizing Multigrid Algorithms with Vision Transformer: A Novel Approach to Enhance the Seismic Foundation Model
Huiwen Wu,Shuo Zhang,Yi Liu,Hongbin Ye
Main category: cs.CV
TL;DR: 本文提出了一种新颖的自适应双网格基础模型训练策略(ADATG),结合希尔伯特编码,专门针对地震数据的独特高低频特征,提升了视觉地震基础模型的预训练效果。
Details
Motivation: 由于地震数据的独特高低频特征,现有的视觉Transformer(ViT)在处理地震数据时无法高效捕捉这些信息,因此需要一种专门的技术来优化预训练过程。Contribution: 1. 提出了一种自适应双网格训练策略(ADATG)与希尔伯特编码,专门针对地震数据的高低频特征。 2. 利用频谱分解和分层希尔伯特编码有效表示数据。 3. 基于ViT的频率原理,设计了从粗到细的自适应训练策略。
Method: 1. 频谱分解分离高低频成分。 2. 分层希尔伯特编码有效表示数据。 3. 采用自适应训练策略,先关注粗粒度信息,再逐步细化到细粒度特征。
Result: 实验证明该方法在地震数据预训练中表现出高效性和有效性。
Insight: 数据编码和训练策略应根据地震图像的高低频特征定制,这对视觉地震基础模型的预训练至关重要。
Abstract: Due to the emergency and homogenization of Artificial Intelligence (AI) technology development, transformer-based foundation models have revolutionized scientific applications, such as drug discovery, materials research, and astronomy. However, seismic data presents unique characteristics that require specialized processing techniques for pretraining foundation models in seismic contexts with high- and low-frequency features playing crucial roles. Existing vision transformers (ViTs) with sequential tokenization ignore the intrinsic pattern and fail to grasp both the high- and low-frequency seismic information efficiently and effectively. This work introduces a novel adaptive two-grid foundation model training strategy (ADATG) with Hilbert encoding specifically tailored for seismogram data, leveraging the hierarchical structures inherent in seismic data. Specifically, our approach employs spectrum decomposition to separate high- and low-frequency components and utilizes hierarchical Hilbert encoding to represent the data effectively. Moreover, observing the frequency principle observed in ViTs, we propose an adaptive training strategy that initially emphasizes coarse-level information and then progressively refines the model’s focus on fine-level features. Our extensive experiments demonstrate the effectiveness and efficiency of our training methods. This research highlights the importance of data encoding and training strategies informed by the distinct characteristics of high- and low-frequency features in seismic images, ultimately contributing to the enhancement of visual seismic foundation models pretraining.
[28] Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video
Filippo Cenacchi. Longbing Cao,Mitchell McEwan,Deborah Richards
Main category: cs.CV
TL;DR: 论文提出了一种通过分析视频中面部细微动态变化的无语言痴呆筛查方法,利用自然谈话视频中的眨眼、嘴部动作等特征,无需依赖语言或脚本,实现了高效筛查。
Details
Motivation: 现有痴呆筛查方法多依赖语言或脚本访谈,限制了其临床应用范围。本研究旨在开发一种无需语言、可在自然环境中大规模应用的被动筛查方法。Contribution: 1)提出了一种基于面部细微动态变化的无语言痴呆筛查框架;2)引入了YT DemTalk数据集,提供了首个自然视频痴呆筛查的基准。
Method: 通过稳定面部信号,将细微动作转换为可解释的时间序列,分析其活动混合分布,使用轻量级分类器进行预测。
Result: 在YT DemTalk数据集上,模型表现优异(AUROC 0.953,AP 0.961,F1-score 0.851,准确率0.857)。
Insight: 研究发现,视线不稳定性和嘴部动作是最具信息量的特征,轻量级模型即可实现高效筛查。
Abstract: We target passive dementia screening from short camera-facing talking head video, developing a facial temporal micro dynamics analysis for language free detection of early neuro cognitive change. This enables unscripted, in the wild video analysis at scale to capture natural facial behaviors, transferrable across devices, topics, and cultures without active intervention by clinicians or researchers during recording. Most existing resources prioritize speech or scripted interviews, limiting use outside clinics and coupling predictions to language and transcription. In contrast, we identify and analyze whether temporal facial kinematics, including blink dynamics, small mouth jaw motions, gaze variability, and subtle head adjustments, are sufficient for dementia screening without speech or text. By stabilizing facial signals, we convert these micro movements into interpretable facial microdynamic time series, smooth them, and summarize short windows into compact clip level statistics for screening. Each window is encoded by its activity mix (the relative share of motion across streams), thus the predictor analyzes the distribution of motion across streams rather than its magnitude, making per channel effects transparent. We also introduce YT DemTalk, a new dataset curated from publicly available, in the wild camera facing videos. It contains 300 clips (150 with self reported dementia, 150 controls) to test our model and offer a first benchmarking of the corpus. On YT DemTalk, ablations identify gaze lability and mouth/jaw dynamics as the most informative cues, and light weighted shallow classifiers could attain a dementia prediction performance of (AUROC) 0.953, 0.961 Average Precision (AP), 0.851 F1-score, and 0.857 accuracy.
[29] Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
Xinxin Liu,Zhaopan Xu,Kai Wang,Yong Jae Lee,Yuzhang Shang
Main category: cs.CV
TL;DR: Gen-ViRe是一个新的生成式视觉推理基准测试框架,旨在评估视频生成模型在多步骤规划和抽象推理等认知维度的表现,填补现有基准测试在评估Chain-of-Frames推理能力上的空缺。
Details
Motivation: 当前的Chain-of-Thought提示法仅限于文本推理,无法模拟真实世界的连续物理动态。视频生成模型通过Chain-of-Frames推理展现出潜力,但缺乏系统性的评估工具来衡量模型的认知能力。Contribution: 提出了Gen-ViRe框架,将视觉推理分解为6个认知维度和24个子任务,通过多源数据、最小提示协议和混合VLM辅助评估,首次提供了视频模型推理能力的定量评估。
Method: 结合认知科学和实际AI应用,设计了多任务评估框架,并使用混合VLM辅助方法进行评测,重点关注视觉推理的深度而非仅视觉质量。
Result: 实验揭示了SOTA模型的视觉质量与实际推理深度之间的显著差异,为改进世界模拟器提供了基准和诊断工具。
Insight: 视觉生成模型的推理能力需系统性评估,单纯的高视觉质量并不能代表复杂的认知能力。Gen-ViRe为未来模型的优化提供了明确方向。
Abstract: While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning – materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a framework grounded in cognitive science and real-world AI applications, which decomposes CoF reasoning into six cognitive dimensions – from perceptual logic to abstract planning – and 24 subtasks. Through multi-source data curation, minimal prompting protocols, and hybrid VLM-assisted evaluation with detailed criteria, Gen-ViRe delivers the first quantitative assessment of video models as reasoners. Our experiments on SOTA systems reveal substantial discrepancies between impressive visual quality and actual reasoning depth, establishing baselines and diagnostic tools to advance genuine world simulators.
[30] RSPose: Ranking Based Losses for Human Pose Estimation
Muhammed Can Keles,Bedrettin Cetinkaya,Sinan Kalkan,Emre Akbas
Main category: cs.CV
TL;DR: 论文RSPose提出了基于排序的损失函数,解决了热图姿态估计中的三个核心问题,并通过实验验证了其在多个数据集上的优越性。
Details
Motivation: 传统的基于热图的人体姿态估计方法存在三个主要问题:MSE损失不能有效聚焦峰值定位、热图的空间和类别不平衡、以及损失函数与评估指标(mAP)不匹配。Contribution: 提出了基于排序的损失函数,首次将损失与评估指标(mAP)对齐,显著提升了置信度分数与定位质量的相关性。
Method: 设计了一维和二维热图两种模式的排序损失函数,并应用于不同数据集和模型(如ViTPose-H和SimCC Resnet-50)。
Result: RSPose在COCO-val数据集上达到79.9 mAP,优于之前的SOTA方法;SimCC Resnet-50性能也提升1.5 AP。
Insight: 损失函数的设计需要与评估指标一致,排序损失能够更好地优化模型性能,尤其是在峰值定位和NMS实例选择中。
Abstract: While heatmap-based human pose estimation methods have shown strong performance, they suffer from three main problems: (P1) “Commonly used Mean Squared Error (MSE)” Loss may not always improve joint localization because it penalizes all pixel deviations equally, without focusing explicitly on sharpening and correctly localizing the peak corresponding to the joint; (P2) heatmaps are spatially and class-wise imbalanced; and, (P3) there is a discrepancy between the evaluation metric (i.e., mAP) and the loss functions. We propose ranking-based losses to address these issues. Both theoretically and empirically, we show that our proposed losses are superior to commonly used heatmap losses (MSE, KL-Divergence). Our losses considerably increase the correlation between confidence scores and localization qualities, which is desirable because higher correlation leads to more accurate instance selection during Non-Maximum Suppression (NMS) and better Average Precision (mAP) performance. We refer to the models trained with our losses as RSPose. We show the effectiveness of RSPose across two different modes: one-dimensional and two-dimensional heatmaps, on three different datasets (COCO, CrowdPose, MPII). To the best of our knowledge, we are the first to propose losses that align with the evaluation metric (mAP) for human pose estimation. RSPose outperforms the previous state of the art on the COCO-val set and achieves an mAP score of 79.9 with ViTPose-H, a vision transformer model for human pose estimation. We also improve SimCC Resnet-50, a coordinate classification-based pose estimation method, by 1.5 AP on the COCO-val set, achieving 73.6 AP.
[31] Segmenting Collision Sound Sources in Egocentric Videos
Kranti Kumar Parida,Omar Emara,Hazel Doughty,Dima Damen
Main category: cs.CV
TL;DR: 论文提出了碰撞声源分割任务(CS3),旨在从视频中分割出产生碰撞声的物体,结合音频信息。通过弱监督方法和基础模型(CLIP和SAM2)解决挑战,在EPIC-CS3和Ego4D-CS3基准上表现优异。
Details
Motivation: 人类擅长多感官感知,能从交互声识别物体属性。受此启发,论文希望在杂乱的第一人称视频中分割碰撞声源,解决碰撞声依赖双方物体且视觉场景复杂的问题。Contribution: 提出了CS3任务及相关基准EPIC-CS3和Ego4D-CS3,开发了结合音频与第一人称视觉线索的弱监督分割方法。
Method: 使用CLIP和SAM2基础模型,加入手部物体线索,弱监督分割音频条件下的碰撞声源物体。
Result: 在两个新基准上,mIoU指标分别领先基线方法3倍和4.7倍。
Insight: 多模态感知(音频+视觉)能提升复杂场景中的物体分割效果,且第一人称视角的手部信息是关键线索。
Abstract: Humans excel at multisensory perception and can often recognise object properties from the sound of their interactions. Inspired by this, we propose the novel task of Collision Sound Source Segmentation (CS3), where we aim to segment the objects responsible for a collision sound in visual input (i.e. video frames from the collision clip), conditioned on the audio. This task presents unique challenges. Unlike isolated sound events, a collision sound arises from interactions between two objects, and the acoustic signature of the collision depends on both. We focus on egocentric video, where sounds are often clear, but the visual scene is cluttered, objects are small, and interactions are brief. To address these challenges, we propose a weakly-supervised method for audio-conditioned segmentation, utilising foundation models (CLIP and SAM2). We also incorporate egocentric cues, i.e. objects in hands, to find acting objects that can potentially be collision sound sources. Our approach outperforms competitive baselines by $3\times$ and $4.7\times$ in mIoU on two benchmarks we introduce for the CS3 task: EPIC-CS3 and Ego4D-CS3.
[32] GRLoc: Geometric Representation Regression for Visual Localization
Changyang Li,Xuejian Ma,Lixiang Liu,Zhan Li,Qingan Yan,Yi Xu
Main category: cs.CV
TL;DR: GRLoc提出了几何表示回归(GRR)作为绝对姿态回归(APR)的替代方案,通过显式预测解耦的几何表示(射线方向和点云),引入几何先验,提升了视觉定位性能。
Details
Motivation: APR方法作为黑箱直接回归6-DoF姿态,容易记住训练视图而非理解3D场景几何。GRLoc通过逆向渲染过程,显式建模几何关系,提升了泛化能力。Contribution: 1)提出GRR范式,将APR转化为几何表示回归;2)显式解耦旋转和平移预测;3)通过可微分求解器恢复姿态,引入几何先验。
Method: 1)预测世界坐标系中的射线方向(旋转)和点云(平移);2)使用可微分求解器从几何表示恢复6-DoF姿态;3)分离视觉到几何的映射与姿态计算。
Result: 在7-Scenes和Cambridge Landmarks数据集上达到SOTA性能,表明逆向渲染建模有助于泛化姿态估计。
Insight: 显式解耦几何表示能有效提升姿态估计的鲁棒性和泛化能力,几何先验是关键。
Abstract: Absolute Pose Regression (APR) has emerged as a compelling paradigm for visual localization. However, APR models typically operate as black boxes, directly regressing a 6-DoF pose from a query image, which can lead to memorizing training views rather than understanding 3D scene geometry. In this work, we propose a geometrically-grounded alternative. Inspired by novel view synthesis, which renders images from intermediate geometric representations, we reformulate APR as its inverse that regresses the underlying 3D representations directly from the image, and we name this paradigm Geometric Representation Regression (GRR). Our model explicitly predicts two disentangled geometric representations in the world coordinate system: (1) a ray bundle’s directions to estimate camera rotation, and (2) a corresponding pointmap to estimate camera translation. The final 6-DoF camera pose is then recovered from these geometric components using a differentiable deterministic solver. This disentangled approach, which separates the learned visual-to-geometry mapping from the final pose calculation, introduces a strong geometric prior into the network. We find that the explicit decoupling of rotation and translation predictions measurably boosts performance. We demonstrate state-of-the-art performance on 7-Scenes and Cambridge Landmarks datasets, validating that modeling the inverse rendering process is a more robust path toward generalizable absolute pose estimation.
[33] H-CNN-ViT: A Hierarchical Gated Attention Multi-Branch Model for Bladder Cancer Recurrence Prediction
Xueyang Li,Zongren Wang,Yuliang Zhang,Zixuan Pan,Yu-Jen Chen,Nishchal Sapkota,Gelei Xu,Danny Z. Chen,Yiyu Shi
Main category: cs.CV
TL;DR: 该论文提出了一种名为H-CNN-ViT的分层门控注意力多分支模型,用于膀胱癌复发预测,并引入了一个专用的多序列MRI数据集。
Details
Motivation: 膀胱癌复发率高,但术后MRI扫描的解读困难,缺乏专用的数据集阻碍了AI辅助诊断工具的进展。Contribution: 1) 引入了一个用于膀胱癌复发预测的多序列MRI数据集;2) 提出了H-CNN-ViT模型,结合全局(ViT)和局部(CNN)特征,实现定向特征融合。
Method: H-CNN-ViT采用分层门控注意力机制,独立处理多模态数据,动态加权全局与局部特征。
Result: 模型在AUC上达到78.6%,优于现有方法。
Insight: 结合CNN和ViT的多分支架构能有效捕捉多模态MRI的独特特征,提升预测性能。
Abstract: Bladder cancer is one of the most prevalent malignancies worldwide, with a recurrence rate of up to 78%, necessitating accurate post-operative monitoring for effective patient management. Multi-sequence contrast-enhanced MRI is commonly used for recurrence detection; however, interpreting these scans remains challenging, even for experienced radiologists, due to post-surgical alterations such as scarring, swelling, and tissue remodeling. AI-assisted diagnostic tools have shown promise in improving bladder cancer recurrence prediction, yet progress in this field is hindered by the lack of dedicated multi-sequence MRI datasets for recurrence assessment study. In this work, we first introduce a curated multi-sequence, multi-modal MRI dataset specifically designed for bladder cancer recurrence prediction, establishing a valuable benchmark for future research. We then propose H-CNN-ViT, a new Hierarchical Gated Attention Multi-Branch model that enables selective weighting of features from the global (ViT) and local (CNN) paths based on contextual demands, achieving a balanced and targeted feature fusion. Our multi-branch architecture processes each modality independently, ensuring that the unique properties of each imaging channel are optimally captured and integrated. Evaluated on our dataset, H-CNN-ViT achieves an AUC of 78.6%, surpassing state-of-the-art models. Our model is publicly available at https://github.com/XLIAaron/H-CNN-ViT}.
[34] QwenCLIP: Boosting Medical Vision-Language Pretraining via LLM Embeddings and Prompt tuning
Xiaoyang Wei,Camille Kurtz,Florence Cloppet
Main category: cs.CV
TL;DR: QwenCLIP通过替换CLIP的文本编码器为基于大型语言模型(LLM)的嵌入模块,并引入可学习的提示词(prompt tuning),解决了CLIP在医学领域处理长文本报告的局限性。
Details
Motivation: CLIP的文本编码器仅支持77个token,而医学领域的长格式放射学报告信息丰富,传统方法(如PubMedBERT)虽有所缓解,但仍受限于输入长度和语义理解的深度。Contribution: 提出QwenCLIP,使用LLM(如Qwen3-Embedding)替换CLIP文本编码器,并引入可学习提示词,提升跨模态对齐能力,显著改善医学图像-文本任务性能。
Method: 1. 用LLM替换CLIP的文本编码器,利用其长上下文窗口和丰富语义表示;2. 引入可学习提示词增强跨模态对齐。
Result: 在放射学基准任务中表现优异,提升了医学图像与长文本的语义对齐能力。
Insight: LLM的强语义表示能力与CLIP的视觉-语言预训练结合,可有效解决医学领域的长文本处理问题。
Abstract: Contrastive Language-Image Pretraining (CLIP) has demonstrated strong generalization for vision-language tasks in computer vision and medical domains, yet its text encoder accepts only up to 77 tokens, which limits its ability to represent long and information-rich radiology reports. Recent adaptations using domain-specific encoders, such as PubMedBERT or ClinicalBERT, mitigate this issue by leveraging medical corpora, but remain constrained by their limited input length (typically 512 tokens) and relatively shallow semantic understanding. To address these limitations, we propose QwenCLIP, a vision-language framework that replaces CLIP’s text encoder with a large language model (LLM)-based embedding module (e.g., Qwen3-Embedding) and introduces learnable prompts to enhance cross-modal alignment. By leveraging the extended context window and richer representations of LLMs, QwenCLIP captures comprehensive medical semantics from long-form clinical text, substantially improving medical image-text alignment and downstream performance on radiology benchmarks. Our code is publicly available at https://github.com/Wxy-24/QwenCLIP.
[35] Hybrid Convolution Neural Network Integrated with Pseudo-Newton Boosting for Lumbar Spine Degeneration Detection
Pandiyaraju V,Abishek Karthik,Jaspin K,Kannan A,Jaime Lloret
Main category: cs.CV
TL;DR: 该论文提出了一种结合EfficientNet和VGG19的混合卷积神经网络模型,通过引入伪牛顿提升层和稀疏诱导特征减少层,改进传统迁移学习方法,实现了对腰椎退化的高效检测。
Details
Motivation: 传统的迁移学习方法在高维度医学图像中难以捕捉详细解剖特征,冗余的特征表示降低了模型的效率和准确性。Contribution: 1. 提出了一种混合模型架构;2. 设计了伪牛顿提升层和稀疏诱导特征减少层;3. 在腰椎退化检测任务上显著提升了性能。
Method: 结合EfficientNet和VGG19,添加伪牛顿提升层调整特征权重,利用稀疏诱导层去除冗余特征,形成多层框架。
Result: 模型在精度、召回率、F1分数和准确率上表现优异(精度0.9,召回率0.861,F1分数0.88,准确率88.1%),优于基线模型EfficientNet。
Insight: 该模型通过智能调整特征权重和减少冗余,为医学图像的高维特征提取和表示提供了新思路。
Abstract: This paper proposes a new enhanced model architecture to perform classification of lumbar spine degeneration with DICOM images while using a hybrid approach, integrating EfficientNet and VGG19 together with custom-designed components. The proposed model is differentiated from traditional transfer learning methods as it incorporates a Pseudo-Newton Boosting layer along with a Sparsity-Induced Feature Reduction Layer that forms a multi-tiered framework, further improving feature selection and representation. The Pseudo-Newton Boosting layer makes smart variations of feature weights, with more detailed anatomical features, which are mostly left out in a transfer learning setup. In addition, the Sparsity-Induced Layer removes redundancy for learned features, producing lean yet robust representations for pathology in the lumbar spine. This architecture is novel as it overcomes the constraints in the traditional transfer learning approach, especially in the high-dimensional context of medical images, and achieves a significant performance boost, reaching a precision of 0.9, recall of 0.861, F1 score of 0.88, loss of 0.18, and an accuracy of 88.1%, compared to the baseline model, EfficientNet. This work will present the architectures, preprocessing pipeline, and experimental results. The results contribute to the development of automated diagnostic tools for medical images.
[36] VLMs Guided Interpretable Decision Making for Autonomous Driving
Xin Hu,Taotao Jing,Renran Tian,Zhengming Ding
Main category: cs.CV
TL;DR: 该论文提出一种新方法,将视觉语言模型(VLMs)从直接决策生成器转变为语义增强器,通过多模态交互架构和事后细化模块提升自动驾驶决策的可靠性和可解释性。
Details
Motivation: 现有基于VQA框架的自动驾驶决策方法依赖手工提示且性能不稳定,限制了其在现实场景中的鲁棒性和泛化能力。Contribution: 1)将VLMs作为语义增强器;2)设计多模态交互架构;3)提出事后细化模块;4)在两项自动驾驶基准测试中达到SOTA性能。
Method: 利用VLMs的语义理解能力丰富视觉基准数据,结合多模态融合架构,并通过事后模块优化预测可靠性。
Result: 实验表明,该方法在自动驾驶任务中性能优越,提供可靠的决策和可解释的文本描述。
Insight: VLMs更适合作为语义增强工具而非直接决策生成器,多模态融合能显著提升自动驾驶系统的准确性和可解释性。
Abstract: Recent advancements in autonomous driving (AD) have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions. Building on this enriched representation, we introduce a multi-modal interactive architecture that fuses visual and linguistic features for more accurate decision-making and interpretable textual explanations. Furthermore, we design a post-hoc refinement module that utilizes VLMs to enhance prediction reliability. Extensive experiments on two autonomous driving benchmarks demonstrate that our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable AD systems.
[37] Uni-Hema: Unified Model for Digital Hematopathology
Abdul Rehman,Iqra Rasool,Ayesha Imran,Mohsen Ali,Waqas Sultani
Main category: cs.CV
TL;DR: Uni-Hema是一个统一的数字血液病理学模型,能够整合检测、分类、分割、形态预测和疾病推理等多任务,克服了现有方法的局限,实现了跨疾病的统一分析。
Details
Motivation: 数字血液病理学需要跨多种疾病类别的细胞级分析,但现有方法无法统一处理多任务和多模态的复杂性。Contribution: 提出Uni-Hema,一个多任务统一模型,整合多种任务(检测、分类、分割等),利用多模态模块Hema-Former桥接视觉和文本表示。
Method: 基于Hema-Former多模态模块,整合46个公开数据集(70万图像和2.1万问答对),支持不同粒度的任务。
Result: 在多种血液学任务中表现优异,与单任务模型相当或更好,同时提供可解释的单细胞形态学见解。
Insight: Uni-Hema为多任务和多模态数字血液病理学设立了新标准,展现了统一模型的潜力。
Abstract: Digital hematopathology requires cell-level analysis across diverse disease categories, including malignant disorders (e.g., leukemia), infectious conditions (e.g., malaria), and non-malignant red blood cell disorders (e.g., sickle cell disease). Whether single-task, vision-language, WSI-optimized, or single-cell hematology models, these approaches share a key limitation, they cannot provide unified, multi-task, multi-modal reasoning across the complexities of digital hematopathology. To overcome these limitations, we propose Uni-Hema, a multi-task, unified model for digital hematopathology integrating detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases. Uni-Hema leverages 46 publicly available datasets, encompassing over 700K images and 21K question-answer pairs, and is built upon Hema-Former, a multimodal module that bridges visual and textual representations at the hierarchy level for the different tasks (detection, classification, segmentation, morphology, mask language modeling and visual question answer) at different granularity. Extensive experiments demonstrate that Uni-Hema achieves comparable or superior performance to train on a single-task and single dataset models, across diverse hematological tasks, while providing interpretable, morphologically relevant insights at the single-cell level. Our framework establishes a new standard for multi-task and multi-modal digital hematopathology. The code will be made publicly available.
[38] Weakly Supervised Ephemeral Gully Detection In Remote Sensing Images Using Vision Language Models
Seyed Mohamad Ali Tousi,John A. Lory,G. N. DeSouza
Main category: cs.CV
TL;DR: 提出了一种弱监督管道,利用视觉语言模型(VLMs)减少标签劳动强度,实现短暂冲沟的自动化检测,并通过师生模型和噪声感知损失函数提升性能。
Details
Motivation: 短暂冲沟是农业领域的一大土壤侵蚀问题,但其短暂性和标签数据的稀缺性使得传统方法和机器学习难以实现自动检测。Contribution: 1)首次提出弱监督管道;2)利用VLM的预训练知识;3)发布首个半监督检测数据集,包含18,000多张高分辨率遥感图像。
Method: 基于VLM和师生模型,教师从VLM生成的噪声标签中学习,学生通过弱监督和噪声感知损失函数优化。
Result: 实验表明,该方法性能优于单独使用VLM或标签模型。
Insight: VLM的预训练知识可以显著减少人工标注需求,弱监督在有限标签数据下仍能表现优异。
Abstract: Among soil erosion problems, Ephemeral Gullies are one of the most concerning phenomena occurring in agricultural fields. Their short temporal cycles increase the difficulty in automatically detecting them using classical computer vision approaches and remote sensing. Also, due to scarcity of and the difficulty in producing accurate labeled data, automatic detection of ephemeral gullies using Machine Learning is limited to zero-shot approaches which are hard to implement. To overcome these challenges, we present the first weakly supervised pipeline for detection of ephemeral gullies. Our method relies on remote sensing and uses Vision Language Models (VLMs) to drastically reduce the labor-intensive task of manual labeling. In order to achieve that, the method exploits: 1) the knowledge embedded in the VLM’s pretraining; 2) a teacher-student model where the teacher learns from noisy labels coming from the VLMs, and the student learns by weak supervision using teacher-generate labels and a noise-aware loss function. We also make available the first-of-its-kind dataset for semi-supervised detection of ephemeral gully from remote-sensed images. The dataset consists of a number of locations labeled by a group of soil and plant scientists, as well as a large number of unlabeled locations. The dataset represent more than 18,000 high-resolution remote-sensing images obtained over the course of 13 years. Our experimental results demonstrate the validity of our approach by showing superior performances compared to VLMs and the label model itself when using weak supervision to train an student model. The code and dataset for this work are made publicly available.
[39] Temporal Realism Evaluation of Generated Videos Using Compressed-Domain Motion Vectors
Mert Onur Cakiroglu,Idil Bilge Altun,Zhihe Lu,Mehmet Dalkilic,Hasan Kurban
Main category: cs.CV
TL;DR: 该论文提出了一种使用压缩视频流中的运动向量(MVs)评估生成视频的时间真实性的框架,并通过实验揭示了生成视频在运动上的系统性缺陷。
Details
Motivation: 当前生成视频模型的评估指标主要关注空间外观,而对时间的敏感性不足,因此需要一种新的方法来评估视频的时间行为。Contribution: 提出了一种基于压缩域运动向量(MVs)的评估框架,量化了生成视频与真实视频在运动上的差异,并探索了MV与RGB融合的方法以改进下游任务。
Method: 从H.264和HEVC等压缩标准中提取运动向量,计算KL散度、JS散度和Wasserstein距离来量化运动差异,并通过MV-RGB融合模块改进分类任务。
Result: 实验发现了生成视频在运动上的系统性缺陷(如中心偏差、稀疏流动),MV融合显著提高了分类模型的性能(如I3D达到99.0%准确率)。
Insight: 压缩域运动向量是一种有效的信号,可用于诊断生成视频的运动缺陷,并增强判别模型的时间推理能力。
Abstract: Temporal realism remains a central weakness of current generative video models, as most evaluation metrics prioritize spatial appearance and offer limited sensitivity to motion. We introduce a scalable, model-agnostic framework that assesses temporal behavior using motion vectors (MVs) extracted directly from compressed video streams. Codec-generated MVs from standards such as H.264 and HEVC provide lightweight, resolution-consistent descriptors of motion dynamics. We quantify realism by computing Kullback-Leibler, Jensen-Shannon, and Wasserstein divergences between MV statistics of real and generated videos. Experiments on the GenVidBench dataset containing videos from eight state-of-the-art generators reveal systematic discrepancies from real motion: entropy-based divergences rank Pika and SVD as closest to real videos, MV-sum statistics favor VC2 and Text2Video-Zero, and CogVideo shows the largest deviations across both measures. Visualizations of MV fields and class-conditional motion heatmaps further reveal center bias, sparse and piecewise constant flows, and grid-like artifacts that frame-level metrics do not capture. Beyond evaluation, we investigate MV-RGB fusion through channel concatenation, cross-attention, joint embedding, and a motion-aware fusion module. Incorporating MVs improves downstream classification across ResNet, I3D, and TSN backbones, with ResNet-18 and ResNet-34 reaching up to 97.4% accuracy and I3D achieving 99.0% accuracy on real-versus-generated discrimination. These findings demonstrate that compressed-domain MVs provide an effective temporal signal for diagnosing motion defects in generative videos and for strengthening temporal reasoning in discriminative models. The implementation is available at: https://github.com/KurbanIntelligenceLab/Motion-Vector-Learning
[40] SAE-MCVT: A Real-Time and Scalable Multi-Camera Vehicle Tracking Framework Powered by Edge Computing
Yuqiang Lin,Sam Lockyer,Florian Stanek,Markus Zarbock,Adrian Evans,Wenbin Li,Nic Zhang
Main category: cs.CV
TL;DR: SAE-MCVT是一种实时且可扩展的多摄像头车辆跟踪框架,结合边缘计算,解决了传统方法在实时性和扩展性上的不足,适合城市规模部署。
Details
Motivation: 现有的多摄像头车辆跟踪(MCVT)方法过于注重准确性,而忽略了实时性和可扩展性,这些问题在城市规模应用中尤为关键。Contribution: 提出了首个可扩展的实时MCVT框架SAE-MCVT,结合边缘计算和自监督相机链接模型,显著提升了实时性和扩展性。
Method: 系统由多个边缘设备和一个中央工作站组成,边缘设备处理视频流并提取轻量级元数据,中央工作站基于时空关系和自监督模型进行跨摄像头关联。
Result: 在RoundaboutHD数据集上,SAE-MCVT在2K 15 FPS视频流中保持实时运行,IDF1分数达到61.2。
Insight: 边缘计算和轻量级元数据传输是提升实时性和扩展性的关键,自监督相机链接模型减少了人工标注的需求。
Abstract: In modern Intelligent Transportation Systems (ITS), cameras are a key component due to their ability to provide valuable information for multiple stakeholders. A central task is Multi-Camera Vehicle Tracking (MCVT), which generates vehicle trajectories and enables applications such as anomaly detection, traffic density estimation, and suspect vehicle tracking. However, most existing studies on MCVT emphasize accuracy while overlooking real-time performance and scalability. These two aspects are essential for real-world deployment and become increasingly challenging in city-scale applications as the number of cameras grows. To address this issue, we propose SAE-MCVT, the first scalable real-time MCVT framework. The system includes several edge devices that interact with one central workstation separately. On the edge side, live RTSP video streams are serialized and processed through modules including object detection, object tracking, geo-mapping, and feature extraction. Only lightweight metadata – vehicle locations and deep appearance features – are transmitted to the central workstation. On the central side, cross-camera association is calculated under the constraint of spatial-temporal relations between adjacent cameras, which are learned through a self-supervised camera link model. Experiments on the RoundaboutHD dataset show that SAE-MCVT maintains real-time operation on 2K 15 FPS video streams and achieves an IDF1 score of 61.2. To the best of our knowledge, this is the first scalable real-time MCVT framework suitable for city-scale deployment.
[41] Mind the Gap: Evaluating LLM Understanding of Human-Taught Road Safety Principles
Chalamalasetti Kranti
Main category: cs.CV
TL;DR: 该论文评估了多模态大语言模型(LLMs)对人类教授的交通安全原则的理解能力,发现模型在零样本设置下表现不佳。
Details
Motivation: 交通安全对自动驾驶系统至关重要,但多模态大语言模型是否能理解人类教授的交通安全原则尚未明确。Contribution: 提供了一个评测数据集,分析了LLMs在交通安全推理上的性能差距。
Method: 从学校教科书中收集了描绘交通标志和交通安全规范的图像数据集,并在零样本设置下测试了模型的性能。
Result: 初步结果表明,模型在交通安全推理方面存在困难,与人类的学习能力有显著差距。
Insight: 揭示了LLMs在理解人类交通安全原则方面的局限性,为未来研究指明了改进方向。
Abstract: Following road safety norms is non-negotiable not only for humans but also for the AI systems that govern autonomous vehicles. In this work, we evaluate how well multi-modal large language models (LLMs) understand road safety concepts, specifically through schematic and illustrative representations. We curate a pilot dataset of images depicting traffic signs and road-safety norms sourced from school text books and use it to evaluate models capabilities in a zero-shot setting. Our preliminary results show that these models struggle with safety reasoning and reveal gaps between human learning and model interpretation. We further provide an analysis of these performance gaps for future research.
[42] Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding
Qingyang Yan,Guangyao Chen,Yixiong Zou
Main category: cs.CV
TL;DR: 提出了基于课程的相对策略优化(CuRPO),通过逐步从简单到复杂的示例训练视觉接地任务,显著提升了性能。
Details
Motivation: 研究发现RL微调的CoT推理在视觉接地任务中表现不佳,尤其当CoT输出复杂时;数据集增大也可能导致性能下降。因此,提出CuRPO以结构化训练数据。Contribution: 提出CuRPO方法,利用CoT长度和gIoU奖励作为复杂度指标,逐步训练模型;在多个数据集上表现优异,提升高达12.52 mAP。
Method: 采用课程学习策略,基于CoT长度和gIoU奖励动态调整数据复杂度,提升训练效率与性能。
Result: 在RefCOCO等数据集上明显优于现有方法,尤其在少样本学习中表现鲁棒。
Insight: 结构化训练数据(从简单到复杂)能显著提升模型性能,尤其在处理复杂文本描述的任务中。
Abstract: Chain-of-Thought (CoT) prompting has recently shown significant promise across various NLP and computer vision tasks by explicitly generating intermediate reasoning steps. However, we find that reinforcement learning (RL)-based fine-tuned CoT reasoning can paradoxically degrade performance in Visual Grounding tasks, particularly as CoT outputs become lengthy or complex. Additionally, our analysis reveals that increased dataset size does not always enhance performance due to varying data complexities. Motivated by these findings, we propose Curriculum-based Relative Policy Optimization (CuRPO), a novel training strategy that leverages CoT length and generalized Intersection over Union (gIoU) rewards as complexity indicators to progressively structure training data from simpler to more challenging examples. Extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and LISA datasets demonstrate the effectiveness of our approach. CuRPO consistently outperforms existing methods, including Visual-RFT, with notable improvements of up to +12.52 mAP on RefCOCO. Moreover, CuRPO exhibits exceptional efficiency and robustness, delivering strong localization performance even in few-shot learning scenarios, particularly benefiting tasks characterized by ambiguous and intricate textual descriptions.The code is released on https://github.com/qyoung-yan/CuRPO.
[43] Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets
Noam Glazner,Noam Tsfaty,Sharon Shalev,Avishai Weizman
Main category: cs.CV
TL;DR: 提出了一种基于聚类的帧选择策略,以减少视频衍生数据集中信息泄漏的问题。
Details
Motivation: 视频衍生数据集中由于帧之间高度相关,传统的数据集分割方式可能导致信息泄漏,影响模型评估的可靠性。Contribution: 提出了一种基于聚类的帧选择方法,确保训练、验证和测试集的划分更加代表性、平衡且可靠。
Method: 通过视觉相似性将帧聚类,然后在聚类基础上分割数据集,避免相似帧出现在不同集合中。
Result: 该方法生成了更具代表性的数据集分割,减少了信息泄漏。
Insight: 聚类方法可以有效解决视频数据集中帧相关性导致的信息泄漏问题,为数据集分割提供了新思路。
Abstract: We propose a cluster-based frame selection strategy to mitigate information leakage in video-derived frames datasets. By grouping visually similar frames before splitting into training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.
[44] Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers
Zachary Shinnick,Liangze Jiang,Hemanth Saratchandran,Damien Teney,Anton van den Hengel
Main category: cs.CV
TL;DR: 论文探讨了通过在无视觉或语义内容的程序生成数据上预训练视觉Transformer(ViTs),以注入通用的归纳偏置的方法。这种“热身”阶段绕过了视觉分块嵌入机制,提升了数据效率、收敛速度和下游性能。
Details
Motivation: 探索一种新的方法,通过程序生成的无语义数据预训练ViTs,以注入跨模态的通用归纳偏置,从而提升模型的数据效率和性能。Contribution: 提出了一种新颖的程序生成数据预训练方法,显著提升了ViTs的数据效率、收敛速度和下游任务表现。例如,仅用1%的训练预算分配,就能显著提升ImageNet-1k的精度。
Method: 使用简单算法(如形式文法)生成无视觉或语义内容的数据,用于ViTs的预训练“热身”阶段。这一阶段绕过分块嵌入机制,鼓励模型抽象计算先验。
Result: 实验表明,程序生成数据预训练显著提升了性能。在ImageNet-1k上,1%的程序数据相当于28%的真实数据的效果,最终精度提升了1.7%。
Insight: 程序生成数据可以作为高效的预训练策略,尤其在数据稀缺或跨域任务中具有潜力,为ViTs提供了一种新的数据高效训练路径。
Abstract: Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in vision transformers (ViTs) by pretraining on procedurally-generated data devoid of visual or semantic content. We generate this data with simple algorithms such as formal grammars, so the results bear no relationship to either natural or synthetic images. We use this procedurally-generated data to pretrain ViTs in a warm-up phase that bypasses their visual patch embedding mechanisms, thus encouraging the models to internalise abstract computational priors. When followed by standard image-based training, this warm-up significantly improves data efficiency, convergence speed, and downstream performance. On ImageNet-1k for example, allocating just 1% of the training budget to procedural data improves final accuracy by over 1.7%. In terms of its effect on performance, 1% procedurally generated data is thus equivalent to 28% of the ImageNet-1k data. These findings suggest a promising path toward new data-efficient and domain-agnostic pretraining strategies.
[45] EchoAgent: Guideline-Centric Reasoning Agent for Echocardiography Measurement and Interpretation
Matin Daghyani,Lyuyang Wang,Nima Hashemi,Bassant Medhat,Baraa Abdelsamad,Eros Rojas Velez,XiaoXiao Li,Michael Y. C. Tsang,Christina Luong,Teresa S. M. Tsang,Purang Abolmaesumi
Main category: cs.CV
TL;DR: EchoAgent是一个基于指南的心电图测量和解释推理代理,通过大型语言模型(LLM)协调专业视觉工具,实现了结构化、可解释的心电图分析自动化。
Details
Motivation: 当前深度学习模型不支持心电图视频级推理和基于指南的测量分析,EchoAgent填补了这一空白,提供透明可靠的自动化解决方案。Contribution: 1)引入了测量可行性预测模型,用于确定每帧中解剖结构是否可靠可测;2)构建了一个多样化的临床验证视频-查询对基准数据集。
Method: 结合LLM控制专业视觉工具,实现时间定位、空间测量和临床解释,并通过测量可行性模型自主选择工具。
Result: EchoAgent在复杂的时空视频分析中实现了准确且可解释的结果,输出基于视觉证据和临床指南,具有透明性和可追溯性。
Insight: EchoAgent展示了任务专用工具和全视频级自动化如何实现指南对齐的心电图分析,为心脏超声领域的可信AI提供了新方向。
Abstract: Purpose: Echocardiographic interpretation requires video-level reasoning and guideline-based measurement analysis, which current deep learning models for cardiac ultrasound do not support. We present EchoAgent, a framework that enables structured, interpretable automation for this domain. Methods: EchoAgent orchestrates specialized vision tools under Large Language Model (LLM) control to perform temporal localization, spatial measurement, and clinical interpretation. A key contribution is a measurement-feasibility prediction model that determines whether anatomical structures are reliably measurable in each frame, enabling autonomous tool selection. We curated a benchmark of diverse, clinically validated video-query pairs for evaluation. Results: EchoAgent achieves accurate, interpretable results despite added complexity of spatiotemporal video analysis. Outputs are grounded in visual evidence and clinical guidelines, supporting transparency and traceability. Conclusion: This work demonstrates the feasibility of agentic, guideline-aligned reasoning for echocardiographic video analysis, enabled by task-specific tools and full video-level automation. EchoAgent sets a new direction for trustworthy AI in cardiac ultrasound.
[46] Learning Skill-Attributes for Transferable Assessment in Video
Kumar Ashutosh,Kristen Grauman
Main category: cs.CV
TL;DR: 论文提出CrossTrainer方法,通过发现跨运动通用的技能属性(如平衡、控制等),训练多模态语言模型为视频生成可操作的反馈和熟练度评级,显著提升了跨运动及运动内部的技能评估性能。
Details
Motivation: 当前基于视频的技能评估模型通常针对单一运动,专家标注成本高且稀缺,难以覆盖多样化的运动长尾需求。因此,探索可迁移的视频表示方法是解决这一问题的关键。Contribution: 提出了CrossTrainer方法,通过提取跨运动通用的技能属性,结合多模态语言模型生成反馈和评估结果,实现了60%的性能提升。
Method: 方法包括两步:1)发现跨运动的技能属性(如平衡、控制);2)基于多模态语言模型生成反馈和熟练度评级。在跨运动和运动内部场景下验证。
Result: 在多个数据集上验证,CrossTrainer在跨运动(迁移)和运动内部(域内)场景中性能提升高达60%,优于现有技术。
Insight: 通过抽象化人类技能的共性行为,视频表示方法的泛化能力显著增强,为多模态大语言模型的应用提供了新思路。
Abstract: Skill assessment from video entails rating the quality of a person’s physical performance and explaining what could be done better. Today’s models specialize for an individual sport, and suffer from the high cost and scarcity of expert-level supervision across the long tail of sports. Towards closing that gap, we explore transferable video representations for skill assessment. Our CrossTrainer approach discovers skill-attributes, such as balance, control, and hand positioning – whose meaning transcends the boundaries of any given sport, then trains a multimodal language model to generate actionable feedback for a novel video, e.g., “lift hands more to generate more power” as well as its proficiency level, e.g., early expert. We validate the new model on multiple datasets for both cross-sport (transfer) and intra-sport (in-domain) settings, where it achieves gains up to 60% relative to the state of the art. By abstracting out the shared behaviors indicative of human skill, the proposed video representation generalizes substantially better than an array of existing techniques, enriching today’s multimodal large language models.
[47] CD-DPE: Dual-Prompt Expert Network based on Convolutional Dictionary Feature Decoupling for Multi-Contrast MRI Super-Resolution
Xianming Gu,Lihui Wang,Ying Cao,Zeyu Deng,Yingfeng Ou,Guodong Hu,Yi Chen
Main category: cs.CV
TL;DR: 该论文提出了一种基于卷积字典特征解耦的双提示专家网络(CD-DPE),用于多对比度MRI超分辨率重建,通过解耦跨对比度和对比度内特征并结合双提示融合策略,显著提升了重建质量。
Details
Motivation: 多对比度MRI超分辨率重建面临跨模态对比度差异导致特征整合不佳的问题,限制了高分辨率图像的生成质量。Contribution: 提出了一种卷积字典特征解耦模块(CD-FDM)和双提示特征融合专家模块(DP-FFEM),分别用于特征解耦和最优融合,提升了重建性能和泛化能力。
Method: 通过迭代卷积字典解耦特征为跨对比度和对比度内组件,并利用频率提示和自适应路由提示选择与融合特征。
Result: 在公开数据集上表现优于现有方法,对未见数据集也展示了强泛化能力。
Insight: 特征解耦结合双提示策略能有效减少冗余和干扰,显著提升多对比度MRI超分辨率的细节重建能力。
Abstract: Multi-contrast magnetic resonance imaging (MRI) super-resolution intends to reconstruct high-resolution (HR) images from low-resolution (LR) scans by leveraging structural information present in HR reference images acquired with different contrasts. This technique enhances anatomical detail and soft tissue differentiation, which is vital for early diagnosis and clinical decision-making. However, inherent contrasts disparities between modalities pose fundamental challenges in effectively utilizing reference image textures to guide target image reconstruction, often resulting in suboptimal feature integration. To address this issue, we propose a dual-prompt expert network based on a convolutional dictionary feature decoupling (CD-DPE) strategy for multi-contrast MRI super-resolution. Specifically, we introduce an iterative convolutional dictionary feature decoupling module (CD-FDM) to separate features into cross-contrast and intra-contrast components, thereby reducing redundancy and interference. To fully integrate these features, a novel dual-prompt feature fusion expert module (DP-FFEM) is proposed. This module uses a frequency prompt to guide the selection of relevant reference features for incorporation into the target image, while an adaptive routing prompt determines the optimal method for fusing reference and target features to enhance reconstruction quality. Extensive experiments on public multi-contrast MRI datasets demonstrate that CD-DPE outperforms state-of-the-art methods in reconstructing fine details. Additionally, experiments on unseen datasets demonstrated that CD-DPE exhibits strong generalization capabilities.
[48] RISE: Single Static Radar-based Indoor Scene Understanding
Kaichen Zhou,Laura Dodds,Sayed Saad Afzal,Fadel Adib
Main category: cs.CV
TL;DR: RISE首个基于单一静态雷达的室内场景理解系统,通过多路径反射增强和分层扩散框架,实现了高精度的布局重建和物体检测。
Details
Motivation: 光学传感器(如RGB和LiDAR)在室内场景中存在遮挡和隐私风险,而毫米波雷达(mmWave)保护隐私且穿透障碍,但空间分辨率低。RISE旨在解决这一问题,利用多路径反射提升几何推理能力。Contribution: 1)提出首个单静态雷达基准和系统RISE,联合解决布局重建和物体检测;2)提出Bi-Angular多路径增强方法,利用AoA和AoD恢复次生反射;3)建立首个大规模雷达室内数据集。
Method: 1)Bi-Angular多路径增强法显式建模AoA和AoD,恢复隐形结构;2)分层扩散框架将碎片化雷达响应转化为完整布局重建和物体检测。
Result: 布局重建的Chamfer Distance降低60%(16cm),首次实现mmWave物体检测(58%IoU)。
Insight: 多路径反射(传统视为噪声)蕴含丰富几何信息,是提升雷达场景理解的关键。分层扩散框架有效填补数据碎片化问题。
Abstract: Robust and privacy-preserving indoor scene understanding remains a fundamental open problem. While optical sensors such as RGB and LiDAR offer high spatial fidelity, they suffer from severe occlusions and introduce privacy risks in indoor environments. In contrast, millimeter-wave (mmWave) radar preserves privacy and penetrates obstacles, but its inherently low spatial resolution makes reliable geometric reasoning difficult. We introduce RISE, the first benchmark and system for single-static-radar indoor scene understanding, jointly targeting layout reconstruction and object detection. RISE is built upon the key insight that multipath reflections, traditionally treated as noise, encode rich geometric cues. To exploit this, we propose a Bi-Angular Multipath Enhancement that explicitly models Angle-of-Arrival and Angle-of-Departure to recover secondary (ghost) reflections and reveal invisible structures. On top of these enhanced observations, a simulation-to-reality Hierarchical Diffusion framework transforms fragmented radar responses into complete layout reconstruction and object detection. Our benchmark contains 50,000 frames collected across 100 real indoor trajectories, forming the first large-scale dataset dedicated to radar-based indoor scene understanding. Extensive experiments show that RISE reduces the Chamfer Distance by 60% (down to 16 cm) compared to the state of the art in layout reconstruction, and delivers the first mmWave-based object detection, achieving 58% IoU. These results establish RISE as a new foundation for geometry-aware and privacy-preserving indoor scene understanding using a single static radar.
[49] MRI Plane Orientation Detection using a Context-Aware 2.5D Model
SangHyuk Kim,Daniel Haehn,Sumientra Rampersad
Main category: cs.CV
TL;DR: 论文提出了一种基于2.5D上下文感知模型的MRI平面方向检测方法,通过利用多切片信息避免独立切片的歧义性,显著提升了分类准确性,并在脑肿瘤检测任务中验证了其有效性。
Details
Motivation: MRI切片的平面方向(轴向、冠状面和矢状面)识别对医学图像分析至关重要,但自动化系统因缺乏方向元数据而表现不佳,导致分析复杂化和数据集异构性问题。Contribution: 主要贡献包括:1) 提出了一种上下文感知的2.5D模型,显著提升了平面方向分类的准确性;2)展示了生成的元数据如何在脑肿瘤检测任务中提升性能;3)开源了相关模型和交互式Web应用。
Method: 方法采用2.5D模型,结合多切片信息避免单一切片的歧义性,并通过2D和3D切片序列的训练提升特征学习。测试中使用了带不确定性评分的门控策略选择性地增强预测。
Result: 结果上,2.5D模型将准确性从2D模型的98.74%提升至99.49%,错误率降低60%;在脑肿瘤检测任务中,准确性从97.0%提升至98.0%,误诊减少33.3%。
Insight: 研究表明,引入多切片上下文信息能够显著提升平面方向识别的鲁棒性,同时生成的元数据可以辅助下游任务(如肿瘤检测)的性能提升。
Abstract: Humans can easily identify anatomical planes (axial, coronal, and sagittal) on a 2D MRI slice, but automated systems struggle with this task. Missing plane orientation metadata can complicate analysis, increase domain shift when merging heterogeneous datasets, and reduce accuracy of diagnostic classifiers. This study develops a classifier that accurately generates plane orientation metadata. We adopt a 2.5D context-aware model that leverages multi-slice information to avoid ambiguity from isolated slices and enable robust feature learning. We train the 2.5D model on both 3D slice sequences and static 2D images. While our 2D reference model achieves 98.74% accuracy, our 2.5D method raises this to 99.49%, reducing errors by 60%, highlighting the importance of 2.5D context. We validate the utility of our generated metadata in a brain tumor detection task. A gated strategy selectively uses metadata-enhanced predictions based on uncertainty scores, boosting accuracy from 97.0% with an image-only model to 98.0%, reducing misdiagnoses by 33.3%. We integrate our plane orientation model into an interactive web application and provide it open-source.
[50] LINGUAL: Language-INtegrated GUidance in Active Learning for Medical Image Segmentation
Md Shazid Islam,Shreyangshu Bera,Sudipta Paul,Amit K. Roy-Chowdhury
Main category: cs.CV
TL;DR: LINGUAL是一个集成语言指导的主动学习框架,通过自然语言指令简化医学图像分割中的标注任务,显著减少标注时间和专家认知负担。
Details
Motivation: 医学图像分割中传统主动学习(AL)标注任务复杂且耗时,尤其是对于模糊边界的分割。LINGUAL旨在通过语言指令替代传统标注,降低专家认知负担和标注成本。Contribution: LINGUAL提出了一种新颖的语言集成主动学习方法,将自然语言指令转化为可执行程序,自动完成分割任务,减少80%的标注时间。
Method: LINGUAL利用上下文学习将自然语言指令翻译为程序,自动执行分割子任务,无需人工干预。
Result: 实验表明,LINGUAL在主动域适应(ADA)中表现优于或接近传统AL基线,标注时间减少约80%。
Insight: 语言指导提供了一种高效的替代方案,简化了复杂标注任务,并在不降低性能的前提下显著提升了效率。
Abstract: Although active learning (AL) in segmentation tasks enables experts to annotate selected regions of interest (ROIs) instead of entire images, it remains highly challenging, labor-intensive, and cognitively demanding due to the blurry and ambiguous boundaries commonly observed in medical images. Also, in conventional AL, annotation effort is a function of the ROI- larger regions make the task cognitively easier but incur higher annotation costs, whereas smaller regions demand finer precision and more attention from the expert. In this context, language guidance provides an effective alternative, requiring minimal expert effort while bypassing the cognitively demanding task of precise boundary delineation in segmentation. Towards this goal, we introduce LINGUAL: a framework that receives natural language instructions from an expert, translates them into executable programs through in-context learning, and automatically performs the corresponding sequence of sub-tasks without any human intervention. We demonstrate the effectiveness of LINGUAL in active domain adaptation (ADA) achieving comparable or superior performance to AL baselines while reducing estimated annotation time by approximately 80%.
[51] Training-free Detection of AI-generated images via Cropping Robustness
Sungik Choi,Hankook Lee,Moontae Lee
Main category: cs.CV
TL;DR: 该论文提出了一种无需训练的AI生成图像检测方法WaRPAD,利用自监督模型对裁剪操作的鲁棒性进行检测,具有广泛的适用性和鲁棒性。
Details
Motivation: 随着视觉生成模型的快速发展,AI生成图像检测变得至关重要。传统方法需要针对特定数据集训练检测器,而本文探索了一种无需训练的自监督模型方法。Contribution: 提出了WaRPAD算法,一种基于自监督模型的无训练AI生成图像检测方法,通过量化图像嵌入对高频方向扰动的敏感性来实现检测。
Method: 利用Haar小波分解提取高频方向,定义基础评分函数量化图像嵌入对扰动的敏感性;通过缩放和分块计算最终检测评分。
Result: 在多样化数据集和23种生成模型生成的图像上验证了WaRPAD的竞争性能和鲁棒性。
Insight: 自监督模型对RandomResizedCrop的鲁棒性是通用的训练策略,使得WaRPAD适用于多种自监督模型。
Abstract: AI-generated image detection has become crucial with the rapid advancement of vision-generative models. Instead of training detectors tailored to specific datasets, we study a training-free approach leveraging self-supervised models without requiring prior data knowledge. These models, pre-trained with augmentations like RandomResizedCrop, learn to produce consistent representations across varying resolutions. Motivated by this, we propose WaRPAD, a training-free AI-generated image detection algorithm based on self-supervised models. Since neighborhood pixel differences in images are highly sensitive to resizing operations, WaRPAD first defines a base score function that quantifies the sensitivity of image embeddings to perturbations along high-frequency directions extracted via Haar wavelet decomposition. To simulate robustness against cropping augmentation, we rescale each image to a multiple of the models input size, divide it into smaller patches, and compute the base score for each patch. The final detection score is then obtained by averaging the scores across all patches. We validate WaRPAD on real datasets of diverse resolutions and domains, and images generated by 23 different generative models. Our method consistently achieves competitive performance and demonstrates strong robustness to test-time corruptions. Furthermore, as invariance to RandomResizedCrop is a common training scheme across self-supervised models, we show that WaRPAD is applicable across self-supervised models.
[52] Saliency-Guided Deep Learning for Bridge Defect Detection in Drone Imagery
Loucif Hebbache,Dariush Amirkhani,Mohand Saïd Allili,Jean-François Lapointe
Main category: cs.CV
TL;DR: 本文提出了一种结合显著性区域提议和YOLOX深度学习的方法,用于无人机图像中桥梁缺陷的自动检测与分类,实验证明了其高效性和准确性。
Details
Motivation: 桥梁缺陷检测在计算机视觉中具有挑战性,无人机图像提供了新的检测方式,但如何高效准确地定位和分类缺陷是关键问题。Contribution: 论文的主要贡献包括:1) 提出了显著性引导的区域提议方法,用于初步定位缺陷;2) 结合YOLOX在显著性增强图像上进行深度学习检测,提升检测精度。
Method: 方法分为两步:1) 使用显著性检测提取潜在缺陷区域;2) 通过YOLOX对显著性增强的图像进行深度学习检测和分类。
Result: 在标准数据集上验证了方法的性能,显示出高准确性和计算效率,适合应用于自供电检测系统。
Insight: 显著性区域提议与深度学习结合可以有效解决桥梁缺陷检测中的局部特征提取问题,为实际应用提供了可行性。
Abstract: Anomaly object detection and classification are one of the main challenging tasks in computer vision and pattern recognition. In this paper, we propose a new method to automatically detect, localize and classify defects in concrete bridge structures using drone imagery. This framework is constituted of two main stages. The first stage uses saliency for defect region proposals where defects often exhibit local discontinuities in the normal surface patterns with regard to their surrounding. The second stage employs a YOLOX-based deep learning detector that operates on saliency-enhanced images obtained by applying bounding-box level brightness augmentation to salient defect regions. Experimental results on standard datasets confirm the performance of our framework and its suitability in terms of accuracy and computational efficiency, which give a huge potential to be implemented in a self-powered inspection system.
[53] Semantic Context Matters: Improving Conditioning for Autoregressive Models
Dongyang Jin,Ryan Xu,Jianhao Zeng,Rui Lan,Yancheng Bai,Lei Sun,Xiangxiang Chu
Main category: cs.CV
TL;DR: SCAR是一种基于语义上下文的自回归模型改进方法,通过压缩语义预填充和语义对齐指导,提升图像编辑中的指令遵循和视觉保真度。
Details
Motivation: 自回归模型在图像生成中表现优异,但因其条件限制较弱且低效,导致图像编辑中指令遵循不佳和视觉伪影问题。SCAR旨在解决这些问题。Contribution: 提出了SCAR方法,包含压缩语义预填充和语义对齐指导两个关键模块,增强了自回归模型的条件控制能力。
Method: 通过压缩语义预填充将高层语义编码为紧凑前缀,并结合语义对齐指导在解码阶段对齐视觉隐藏状态和目标语义。
Result: SCAR在指令编辑和可控生成任务中表现出卓越的视觉保真度和语义对齐能力,优于现有自回归方法。
Insight: SCAR展示了语义上下文对自回归模型的重要性,提供了一种高效且通用的条件控制框架。
Abstract: Recently, autoregressive (AR) models have shown strong potential in image generation, offering better scalability and easier integration with unified multi-modal systems compared to diffusion-based methods. However, extending AR models to general image editing remains challenging due to weak and inefficient conditioning, often leading to poor instruction adherence and visual artifacts. To address this, we propose SCAR, a Semantic-Context-driven method for Autoregressive models. SCAR introduces two key components: Compressed Semantic Prefilling, which encodes high-level semantics into a compact and efficient prefix, and Semantic Alignment Guidance, which aligns the last visual hidden states with target semantics during autoregressive decoding to enhance instruction fidelity. Unlike decoding-stage injection methods, SCAR builds upon the flexibility and generality of vector-quantized-based prefilling while overcoming its semantic limitations and high cost. It generalizes across both next-token and next-set AR paradigms with minimal architectural changes. SCAR achieves superior visual fidelity and semantic alignment on both instruction editing and controllable generation benchmarks, outperforming prior AR-based methods while maintaining controllability. All code will be released.
[54] CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs
Jingyu Lei,Gaoang Wang,Der-Horng Lee
Main category: cs.CV
TL;DR: CORE提出了一种新的视觉令牌压缩方法,通过对象分割和中心排序机制生成紧凑的对象中心表示,显著提升了效率并保持了性能。
Details
Motivation: 现有的视觉令牌压缩方法缺乏高层语义理解,导致信息冗余或上下文丢失,CORE旨在解决这一问题。Contribution: CORE通过对象分割和中心排序机制生成对象中心表示,实现了高效的令牌压缩和性能保留。
Method: 利用高效的解码器生成对象掩码作为语义先验,通过中心排序机制恢复空间顺序。
Result: 在六个基准测试中取得最优表现,极端压缩下保留97.4%性能。
Insight: 对象中心表示是高效处理LVLM的有效方法。
Abstract: Large Vision-Language Models (LVLMs) usually suffer from prohibitive computational and memory costs due to the quadratic growth of visual tokens with image resolution. Existing token compression methods, while varied, often lack a high-level semantic understanding, leading to suboptimal merges, information redundancy, or context loss. To address these limitations, we introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression. CORE leverages an efficient segmentation decoder to generate object masks, which serve as a high-level semantic prior to guide the merging of visual tokens into a compact set of object-centric representations. Furthermore, a novel centroid-guided sorting mechanism restores a coherent spatial order to the merged tokens, preserving vital positional information. Extensive experiments show that CORE not only establishes a new state-of-the-art on six authoritative benchmarks for fixed-rate compression, but also achieves dramatic efficiency gains in adaptive-rate settings. Even under extreme compression, after aggressively retaining with only 2.2% of all visual tokens, CORE still maintains 97.4% of baseline performance. Our work demonstrates the superiority of object-centric representations for efficient and effective LVLM processing.
[55] Zero-Training Task-Specific Model Synthesis for Few-Shot Medical Image Classification
Yao Qin,Yangyang Yan,YuanChao Yang,Jinhua Pang,Huanyong Bi,Yuan Liu,HaiHua Wang
Main category: cs.CV
TL;DR: 该论文提出了一种名为ZS-TMS的新范式,通过预训练的生成引擎直接合成任务特定分类器的参数,无需任务特定训练,实现了医学图像分类中的少样本(甚至1样本)高效分类。
Details
Motivation: 医学图像分析依赖大规模标注数据,但医疗数据获取困难且标注昂贵,尤其是罕见疾病样本稀缺。传统方法受限于数据需求,亟需一种少样本或无训练的方法。Contribution: 提出了ZS-TMS范式,通过预训练生成引擎直接合成分类器参数,避免了任务特定训练或微调。开发了Semantic-Guided Parameter Synthesizer(SGPS)框架,结合多模态信息(如图像和临床文本)生成分类器权重。
Method: SGPS框架利用生成引擎,输入少量多模态信息(如单样本图像和临床文本),直接合成任务特定分类器的参数。生成轻量高效的分类器(如EfficientNet-V2),无需训练即可部署推理。
Result: 在ISIC 2018皮肤病变数据集和自定义罕见疾病数据集中,SGPS在1-shot和5-shot分类任务中显著优于现有少样本和零样本学习方法,达到了新的最优性能。
Insight: 该研究展示了生成模型在直接合成任务特定模型参数上的潜力,为数据稀缺的医疗领域(尤其是罕见疾病)提供了快速开发和部署AI诊断工具的新途径。
Abstract: Deep learning models have achieved remarkable success in medical image analysis but are fundamentally constrained by the requirement for large-scale, meticulously annotated datasets. This dependency on “big data” is a critical bottleneck in the medical domain, where patient data is inherently difficult to acquire and expert annotation is expensive, particularly for rare diseases where samples are scarce by definition. To overcome this fundamental challenge, we propose a novel paradigm: Zero-Training Task-Specific Model Synthesis (ZS-TMS). Instead of adapting a pre-existing model or training a new one, our approach leverages a large-scale, pre-trained generative engine to directly synthesize the entire set of parameters for a task-specific classifier. Our framework, the Semantic-Guided Parameter Synthesizer (SGPS), takes as input minimal, multi-modal task information as little as a single example image (1-shot) and a corresponding clinical text description to directly synthesize the entire set of parameters for a task-specific classifier. The generative engine interprets these inputs to generate the weights for a lightweight, efficient classifier (e.g., an EfficientNet-V2), which can be deployed for inference immediately without any task-specific training or fine-tuning. We conduct extensive evaluations on challenging few-shot classification benchmarks derived from the ISIC 2018 skin lesion dataset and a custom rare disease dataset. Our results demonstrate that SGPS establishes a new state-of-the-art, significantly outperforming advanced few-shot and zero-shot learning methods, especially in the ultra-low data regimes of 1-shot and 5-shot classification. This work paves the way for the rapid development and deployment of AI-powered diagnostic tools, particularly for the long tail of rare diseases where data is critically limited.
[56] Error-Driven Scene Editing for 3D Grounding in Large Language Models
Yue Zhang,Zun Wang,Han Lin,Jialu Li,Jianing Yang,Yonatan Bitton,Idan Szpektor,Mohit Bansal
Main category: cs.CV
TL;DR: 论文提出了一种名为DEER-3D的错误驱动框架,通过精细的空间场景编辑生成针对性反事实数据,以提升3D大型语言模型(LLM)的空间语言联合能力。
Details
Motivation: 当前3D-LLMs虽然在语言推理上有进步,但在视觉和空间的精准联合上仍然存在局限,主要原因是训练数据侧重语言推理而缺乏空间理解。Contribution: 提出DEER-3D框架,通过分解、诊断评估、编辑和再训练的流程,实现对3D场景的针对性编辑,生成反事实数据以提升模型的空间联合能力。
Method: 采用‘分解、诊断评估、编辑和再训练’的工作流程,根据模型的错误诊断结果进行最小化空间编辑(如重新着色或重新定位),生成反事实监督数据。
Result: 在多个3D联合和场景理解任务基准测试中,DEER-3D通过迭代优化显著提升了模型的联合准确性。
Insight: 错误驱动的针对性场景编辑比传统的数据增强方法更高效,能够有效提升3D-LLMs的空间语言联合能力。
Abstract: Despite recent progress in 3D-LLMs, they remain limited in accurately grounding language to visual and spatial elements in 3D environments. This limitation stems in part from training data that focuses on language reasoning rather than spatial understanding due to scarce 3D resources, leaving inherent grounding biases unresolved. To address this, we propose 3D scene editing as a key mechanism to generate precise visual counterfactuals that mitigate these biases through fine-grained spatial manipulation, without requiring costly scene reconstruction or large-scale 3D data collection. Furthermore, to make these edits targeted and directly address the specific weaknesses of the model, we introduce DEER-3D, an error-driven framework following a structured “Decompose, Diagnostic Evaluation, Edit, and Re-train” workflow, rather than broadly or randomly augmenting data as in conventional approaches. Specifically, upon identifying a grounding failure of the 3D-LLM, our framework first diagnoses the exact predicate-level error (e.g., attribute or spatial relation). It then executes minimal, predicate-aligned 3D scene edits, such as recoloring or repositioning, to produce targeted counterfactual supervision for iterative model fine-tuning, significantly enhancing grounding accuracy. We evaluate our editing pipeline across multiple benchmarks for 3D grounding and scene understanding tasks, consistently demonstrating improvements across all evaluated datasets through iterative refinement. DEER-3D underscores the effectiveness of targeted, error-driven scene editing in bridging linguistic reasoning capabilities with spatial grounding in 3D LLMs.
[57] GCA-ResUNet:Image segmentation in medical images using grouped coordinate attention
Jun Ding,Shang Gao
Main category: cs.CV
TL;DR: GCA-ResUNet通过引入分组坐标注意力(GCA),解决了医学图像分割中U-Net类网络难以捕捉长距离依赖的问题,同时在计算资源有限的情况下实现了高效的全局建模。
Details
Motivation: U-Net类网络在医学图像分割中表现优异,但难以捕捉长距离依赖;Transformer变体虽然解决了这一问题,但计算开销大且需要大量训练数据。因此,需要一种既能全局建模又高效的方法。Contribution: 提出了GCA-ResUNet,通过在ResNet-50残差块中集成分组坐标注意力(GCA),实现了高效的全局依赖建模,同时保持了低计算开销。
Method: GCA通过分组坐标建模联合编码通道和空间位置的全局依赖,加强了特征表示和边界划分,计算开销显著低于自注意力机制。
Result: 在Synapse和ACDC数据集上分别达到86.11%和92.64%的Dice分数,超越了多个SOTA方法,同时保持了高效推理。
Insight: GCA为卷积架构提供了一种实用的全局建模能力增强方式,适用于资源有限的医学图像分割任务。
Abstract: Medical image segmentation underpins computer-aided diagnosis and therapy by supporting clinical diagnosis, preoperative planning, and disease monitoring. While U-Net style convolutional neural networks perform well due to their encoder-decoder structures with skip connections, they struggle to capture long-range dependencies. Transformer-based variants address global context but often require heavy computation and large training datasets. This paper proposes GCA-ResUNet, an efficient segmentation network that integrates Grouped Coordinate Attention (GCA) into ResNet-50 residual blocks. GCA uses grouped coordinate modeling to jointly encode global dependencies across channels and spatial locations, strengthening feature representation and boundary delineation while adding minimal parameter and FLOP overhead compared with self-attention. On the Synapse dataset, GCA-ResUNet achieves a Dice score of 86.11%, and on the ACDC dataset, it reaches 92.64%, surpassing several state-of-the-art baselines while maintaining fast inference and favorable computational efficiency. These results indicate that GCA offers a practical way to enhance convolutional architectures with global modeling capability, enabling high-accuracy and resource-efficient medical image segmentation.
[58] SMGeo: Cross-View Object Geo-Localization with Grid-Level Mixture-of-Experts
Fan Zhang,Haoyuan Ren,Fei Ma,Qiang Yin,Yongsheng Zhou
Main category: cs.CV
TL;DR: SMGeo提出了一种基于Transformer的端到端模型,通过网格级稀疏混合专家(GMoE)和Anchor-free检测头,实现了高效的跨视角目标地理定位,显著提升了性能。
Details
Motivation: 传统跨视角目标地理定位方法因视角和尺度差异大、背景复杂,容易产生累积误差。为了解决这一问题,作者提出了一种新的交互式端到端模型。Contribution: 1. 提出了SMGeo模型,支持点击提示的实时地理定位;2. 引入网格级稀疏混合专家(GMoE),自适应捕捉跨模态和跨视图依赖;3. 采用Anchor-free检测头避免预设锚框带来的尺度偏差。
Method: 1. 使用Swin-Transformer联合编码无人机和卫星图像特征;2. 通过GMoE动态激活专家模块;3. 基于热图监督的Anchor-free检测头直接预测对象坐标。
Result: 在IoU=0.25和mIoU指标上,SMGeo显著优于DetGeo等基线方法(测试集上分别为87.51%、62.50%和61.45%)。
Insight: 网格级稀疏混合专家和Anchor-free设计有效提升了跨视角地理定位的精度和效率,展示了Transformer在复杂视觉任务中的潜力。
Abstract: Cross-view object Geo-localization aims to precisely pinpoint the same object across large-scale satellite imagery based on drone images. Due to significant differences in viewpoint and scale, coupled with complex background interference, traditional multi-stage “retrieval-matching” pipelines are prone to cumulative errors. To address this, we present SMGeo, a promptable end-to-end transformer-based model for object Geo-localization. This model supports click prompting and can output object Geo-localization in real time when prompted to allow for interactive use. The model employs a fully transformer-based architecture, utilizing a Swin-Transformer for joint feature encoding of both drone and satellite imagery and an anchor-free transformer detection head for coordinate regression. In order to better capture both inter-modal and intra-view dependencies, we introduce a grid-level sparse Mixture-of-Experts (GMoE) into the cross-view encoder, allowing it to adaptively activate specialized experts according to the content, scale and source of each grid. We also employ an anchor-free detection head for coordinate regression, directly predicting object locations via heat-map supervision in the reference images. This approach avoids scale bias and matching complexity introduced by predefined anchor boxes. On the drone-to-satellite task, SMGeo achieves leading performance in accuracy at IoU=0.25 and mIoU metrics (e.g., 87.51%, 62.50%, and 61.45% in the test set, respectively), significantly outperforming representative methods such as DetGeo (61.97%, 57.66%, and 54.05%, respectively). Ablation studies demonstrate complementary gains from shared encoding, query-guided fusion, and grid-level sparse mixture-of-experts.
[59] BCE3S: Binary Cross-Entropy Based Tripartite Synergistic Learning for Long-tailed Recognition
Weijia Fan,Qiufu Li,Jiajun Wen,Xiaoyang Peng
Main category: cs.CV
TL;DR: 该论文提出了BCE3S方法,通过基于二元交叉熵(BCE)的三方协同学习,解决了长尾识别任务中特征紧凑性、类别可分性及分类器平衡性的问题,并在多个数据集上实现了SOTA性能。
Details
Motivation: 现有的基于交叉熵(CE)的长尾识别方法由于Softmax分母中的不平衡分类器向量,难以实现特征的紧凑性和类别可分性。BCE3S旨在通过BCE的多元优化来改进这些问题。Contribution: 提出了BCE3S方法,包含基于BCE的联合学习、对比学习和均匀学习三部分,分别优化特征紧凑性、类别可分性和分类器平衡性。
Method: 1)BCE联合学习通过多元Sigmoid解耦特征与分类器向量的度量;2)BCE对比学习提升类内紧凑性;3)BCE均匀学习平衡分类器可分性。
Result: 在CIFAR10-LT、CIFAR100-LT、ImageNet-LT和iNaturalist2018等长尾数据集上实现了SOTA性能。
Insight: 表明BCE在多任务协同学习中具有潜力,尤其适用于长尾数据中的不平衡性问题。
Abstract: For long-tailed recognition (LTR) tasks, high intra-class compactness and inter-class separability in both head and tail classes, as well as balanced separability among all the classifier vectors, are preferred. The existing LTR methods based on cross-entropy (CE) loss not only struggle to learn features with desirable properties but also couple imbalanced classifier vectors in the denominator of its Softmax, amplifying the imbalance effects in LTR. In this paper, for the LTR, we propose a binary cross-entropy (BCE)-based tripartite synergistic learning, termed BCE3S, which consists of three components: (1) BCE-based joint learning optimizes both the classifier and sample features, which achieves better compactness and separability among features than the CE-based joint learning, by decoupling the metrics between feature and the imbalanced classifier vectors in multiple Sigmoid; (2) BCE-based contrastive learning further improves the intra-class compactness of features; (3) BCE-based uniform learning balances the separability among classifier vectors and interactively enhances the feature properties by combining with the joint learning. The extensive experiments show that the LTR model trained by BCE3S not only achieves higher compactness and separability among sample features, but also balances the classifier’s separability, achieving SOTA performance on various long-tailed datasets such as CIFAR10-LT, CIFAR100-LT, ImageNet-LT, and iNaturalist2018.
[60] FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration
Jingren Liu,Shuning Xu,Qirui Yang,Yun Wang,Xiangyu Chen,Zhong Ji
Main category: cs.CV
TL;DR: FAPE-IR是一个频率感知的图像修复框架,通过结合多模态大语言模型(MLLM)和LoRA-MoE模块,实现了统一的多退化图像修复任务,并在实验中获得最优性能。
Details
Motivation: 现有的多退化图像修复方法往往依赖任务特定设计或潜在路由策略,难以适应复杂的真实场景。FAPE-IR旨在提供一种统一且可解释的解决方案。Contribution: 提出FAPE-IR框架,利用冻结的MLLM作为规划器生成频率感知修复计划,并引入LoRA-MoE模块和频率正则化损失提高修复质量。
Method: 1)使用MLLM分析退化图像生成修复计划;2)在扩散模型中使用LoRA-MoE动态选择高频或低频专家;3)引入对抗训练和频率正则化损失。
Result: 在七种修复任务中达到SOTA性能,并在混合退化场景下表现出强大的零样本泛化能力。
Insight: 频率感知规划和执行的结合为多退化图像修复提供了统一框架,同时增强了模型的解释性和适应性。
Abstract: All-in-One Image Restoration (AIO-IR) aims to develop a unified model that can handle multiple degradations under complex conditions. However, existing methods often rely on task-specific designs or latent routing strategies, making it hard to adapt to real-world scenarios with various degradations. We propose FAPE-IR, a Frequency-Aware Planning and Execution framework for image restoration. It uses a frozen Multimodal Large Language Model (MLLM) as a planner to analyze degraded images and generate concise, frequency-aware restoration plans. These plans guide a LoRA-based Mixture-of-Experts (LoRA-MoE) module within a diffusion-based executor, which dynamically selects high- or low-frequency experts, complemented by frequency features of the input image. To further improve restoration quality and reduce artifacts, we introduce adversarial training and a frequency regularization loss. By coupling semantic planning with frequency-based restoration, FAPE-IR offers a unified and interpretable solution for all-in-one image restoration. Extensive experiments show that FAPE-IR achieves state-of-the-art performance across seven restoration tasks and exhibits strong zero-shot generalization under mixed degradations.
[61] Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations
Yiqing Shen,Chenjia Li,Mathias Unberath
Main category: cs.CV
TL;DR: 论文提出了一个基于强化学习的文本驱动视频编辑方法RIVER,通过数字孪生表征和多跳推理解决隐式查询的复杂视频编辑任务,并在多个基准测试中取得了最佳性能。
Details
Motivation: 现有视频编辑方法需要显式的空间和时间描述,而用户常通过隐式查询表达编辑意图,导致实用性受限。论文旨在解决这一挑战。Contribution: 1. 提出了‘推理视频编辑’任务和首个模型RIVER;2. 引入数字孪生表征分离推理与生成;3. 设计了强化学习训练框架;4. 发布了RVEBenchmark基准。
Method: RIVER通过数字孪生保留视频的空间、时间和语义信息,利用大语言模型进行多跳推理生成编辑指令,再通过扩散模型完成像素级修改。训练采用强化学习优化推理和生成质量。
Result: RIVER在自提的RVEBenchmark中表现最佳,同时在VegGIE和FiVE基准上超越6种基线方法,刷新了state-of-the-art。
Insight: 数字孪生和分离推理与生成的框架为解决复杂隐式查询任务提供了新思路,强化学习的引入进一步提升了模型的推理和生成能力。
Abstract: Text-driven video editing enables users to modify video content only using text queries. While existing methods can modify video content if explicit descriptions of editing targets with precise spatial locations and temporal boundaries are provided, these requirements become impractical when users attempt to conceptualize edits through implicit queries referencing semantic properties or object relationships. We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications, and a first model attempting to solve this complex task, RIVER (Reasoning-based Implicit Video Editor). RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes. A large language model then processes this representation jointly with the implicit query, performing multi-hop reasoning to determine modifications, then outputs structured instructions that guide a diffusion-based editor to execute pixel-level changes. RIVER training uses reinforcement learning with rewards that evaluate reasoning accuracy and generation quality. Finally, we introduce RVEBenchmark, a benchmark of 100 videos with 519 implicit queries spanning three levels and categories of reasoning complexity specifically for reasoning video editing. RIVER demonstrates best performance on the proposed RVEBenchmark and also achieves state-of-the-art performance on two additional video editing benchmarks (VegGIE and FiVE), where it surpasses six baseline methods.
[62] RTS-Mono: A Real-Time Self-Supervised Monocular Depth Estimation Method for Real-World Deployment
Zeyu Cheng,Tongfei Liu,Tao Lei,Xiang Hua,Yi Zhang,Chengkai Tang
Main category: cs.CV
TL;DR: RTS-Mono提出了一种实时自监督单目深度估计方法,通过轻量级编码器-解码器架构提升计算效率和性能,适用于自动驾驶和机器人导航的实际部署。
Details
Motivation: 当前的自监督单目深度估计模型计算资源消耗大,轻量化方法又导致性能下降,阻碍其在实际场景中的应用。Contribution: 提出了轻量高效的RTS-Mono方法,在KITTI数据集上实现了低分辨率和高分辨率的SoTA性能,并在实际部署中达到49 FPS的实时推理速度。
Method: 采用基于Lite-Encoder的轻量编码器和多尺度稀疏融合解码器框架,减少冗余并提升推理速度。
Result: 在低分辨率下Abs Rel和Sq Rel分别提升5.6%和9.8%,高分辨率下Sq Rel和RMSE提升6.1%和1.9%。
Insight: 通过多尺度稀疏融合设计平衡性能和效率,证明了轻量化模型在实际部署中的可行性。
Abstract: Depth information is crucial for autonomous driving and intelligent robot navigation. The simplicity and flexibility of self-supervised monocular depth estimation are conducive to its role in these fields. However, most existing monocular depth estimation models consume many computing resources. Although some methods have reduced the model’s size and improved computing efficiency, the performance deteriorates, seriously hindering the real-world deployment of self-supervised monocular depth estimation models in the real world. To address this problem, we proposed a real-time self-supervised monocular depth estimation method and implemented it in the real world. It is called RTS-Mono, which is a lightweight and efficient encoder-decoder architecture. The encoder is based on Lite-Encoder, and the decoder is designed with a multi-scale sparse fusion framework to minimize redundancy, ensure performance, and improve inference speed. RTS-Mono achieved state-of-the-art (SoTA) performance in high and low resolutions with extremely low parameter counts (3 M) in experiments based on the KITTI dataset. Compared with lightweight methods, RTS-Mono improved Abs Rel and Sq Rel by 5.6% and 9.8% at low resolution and improved Sq Rel and RMSE by 6.1% and 1.9% at high resolution. In real-world deployment experiments, RTS-Mono has extremely high accuracy and can perform real-time inference on Nvidia Jetson Orin at a speed of 49 FPS. Source code is available at https://github.com/ZYCheng777/RTS-Mono.
[63] $A^2$GC: $A$symmetric $A$ggregation with Geometric Constraints for Locally Aggregated Descriptors
Zhenyu Li,Tianyi Shang
Main category: cs.CV
TL;DR: 本文提出了一种名为$A^2$GC-VPR的不对称聚合方法,通过在局部聚合描述符中引入几何约束,改进了视觉地点识别(VPR)的性能。
Details
Motivation: 现有的最优传输聚合方法在处理图像特征和聚类中心分布差异时,由于对称性限制,效果受限。因此需要一种不对称的聚合方法以适应这种差异。Contribution: 1) 提出不对称聚合方法$A^2$GC-VPR,通过独立的边际校准实现适应性匹配;2) 引入可学习的坐标嵌入,结合几何约束增强空间感知。
Method: 使用行列归一化平均法进行不对称匹配,通过几何约束(坐标嵌入)计算兼容性分数并与特征相似性融合。
Result: 在MSLS、NordLand和Pittsburgh数据集上表现出优异的性能,验证了方法在匹配准确性和鲁棒性上的有效性。
Insight: 不对称聚合和几何约束的结合能够更好地适应特征分布差异并提升空间感知能力,这对于视觉地点识别任务尤为重要。
Abstract: Visual Place Recognition (VPR) aims to match query images against a database using visual cues. State-of-the-art methods aggregate features from deep backbones to form global descriptors. Optimal transport-based aggregation methods reformulate feature-to-cluster assignment as a transport problem, but the standard Sinkhorn algorithm symmetrically treats source and target marginals, limiting effectiveness when image features and cluster centers exhibit substantially different distributions. We propose an asymmetric aggregation VPR method with geometric constraints for locally aggregated descriptors, called $A^2$GC-VPR. Our method employs row-column normalization averaging with separate marginal calibration, enabling asymmetric matching that adapts to distributional discrepancies in visual place recognition. Geometric constraints are incorporated through learnable coordinate embeddings, computing compatibility scores fused with feature similarities, thereby promoting spatially proximal features to the same cluster and enhancing spatial awareness. Experimental results on MSLS, NordLand, and Pittsburgh datasets demonstrate superior performance, validating the effectiveness of our approach in improving matching accuracy and robustness.
[64] CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer
Srivathsan Sivakumar,Faisal Z. Qureshi
Main category: cs.CV
TL;DR: 论文提出了一种轻量级且计算高效的视觉Transformer架构CascadedViT(CViT),通过新颖的Cascaded-Chunk Feed Forward Network(CCFFN)和Cascaded Group Attention设计,显著降低了计算量和能耗,同时在ImageNet-1K上取得了优异的性能表现。
Details
Motivation: 尽管视觉Transformer(ViT)在各种计算机视觉任务中表现出色,但其高计算量、内存需求和能耗限制了在资源受限设备上的部署。因此,作者提出了一种更高效的ViT架构以解决这些问题。Contribution: 提出了CascadedViT(CViT),包含创新的CCFFN和Cascaded Group Attention设计,显著提升了参数和FLOP效率;CViT家族模型在各种规模下均表现出最低能耗,适合移动设备部署。
Method: 通过将输入特征分块处理并利用CCFFN和多头注意力机制的分层设计,CViT在保持精度的同时降低了计算复杂性。
Result: 在ImageNet-1K上,CViT-XL达到75.5%的Top-1准确率,FLOPs减少15%,能耗降低3.3%。CViT-L比EfficientViT-M2准确率高2.2%,同时保持相当的APF分数。
Insight: CViT的设计表明,通过分块和分层机制,可以显著优化Transformer的计算效率,为轻量级ViT的发展提供了新思路。
Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a range of computer vision tasks; however, their high computational, memory, and energy demands hinder deployment on resource-constrained platforms. In this paper, we propose \emph{Cascaded-ViT (CViT)}, a lightweight and compute-efficient vision transformer architecture featuring a novel feedforward network design called \emph{Cascaded-Chunk Feed Forward Network (CCFFN)}. By splitting input features, CCFFN improves parameter and FLOP efficiency without sacrificing accuracy. Experiments on ImageNet-1K show that our \emph{CViT-XL} model achieves 75.5% Top-1 accuracy while reducing FLOPs by 15% and energy consumption by 3.3% compared to EfficientViT-M5. Across various model sizes, the CViT family consistently exhibits the lowest energy consumption, making it suitable for deployment on battery-constrained devices such as mobile phones and drones. Furthermore, when evaluated using a new metric called \emph{Accuracy-Per-FLOP (APF)}, which quantifies compute efficiency relative to accuracy, CViT models consistently achieve top-ranking efficiency. Particularly, CViT-L is 2.2% more accurate than EfficientViT-M2 while having comparable APF scores.
[65] Coffee: Controllable Diffusion Fine-tuning
Ziyao Zeng,Jingcheng Ni,Ruyi Liu,Alex Wong
Main category: cs.CV
TL;DR: Coffee提出了一种可控的扩散模型微调方法,通过语言描述正则化微调过程,防止模型学习不需要的概念。
Details
Motivation: 当前扩散模型微调过程中容易习得不需要的概念,并与用户提示纠缠,这对下游任务(如偏置缓解、恶意适应预防等)构成挑战。Contribution: Coffee通过语言描述明确不需要的概念,避免其在微调过程中被学习,且无需额外训练即可灵活修改这些概念。
Method: 通过阻止用户提示嵌入与不需要的概念对齐,Coffee在微调过程中实现概念分离,仅需修改文本描述即可调整控制。
Result: 实验表明,Coffee能有效防止扩散模型在微调期间学习指定不需要的概念,性能优于现有方法。
Insight: 基于语言的控制实现了一种灵活且高效的扩散模型正则化方法,为可控生成提供了新思路。
Abstract: Text-to-image diffusion models can generate diverse content with flexible prompts, which makes them well-suited for customization through fine-tuning with a small amount of user-provided data. However, controllable fine-tuning that prevents models from learning undesired concepts present in the fine-tuning data, and from entangling those concepts with user prompts, remains an open challenge. It is crucial for downstream tasks like bias mitigation, preventing malicious adaptation, attribute disentanglement, and generalizable fine-tuning of diffusion policy. We propose Coffee that allows using language to specify undesired concepts to regularize the adaptation process. The crux of our method lies in keeping the embeddings of the user prompt from aligning with undesired concepts. Crucially, Coffee requires no additional training and enables flexible modification of undesired concepts by modifying textual descriptions. We evaluate Coffee by fine-tuning on images associated with user prompts paired with undesired concepts. Experimental results demonstrate that Coffee can prevent text-to-image models from learning specified undesired concepts during fine-tuning and outperforms existing methods. Code will be released upon acceptance.
[66] Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning Framework with Vision-Language Models
Hao Zhen,Yunxiang Yang,Jidong J. Yang
Main category: cs.CV
TL;DR: MP-PVIR是一个多视角分阶段的行人与车辆事故推理框架,通过结合视觉语言模型和行为理论,自动分析事故过程并提供预防策略。
Details
Motivation: 当前的行人与车辆事故分析系统缺乏对行为阶段的详细理解和多视角整合能力,导致对事件的诊断不够全面。Contribution: 提出了MP-PVIR框架,包括多视角视频获取、行为阶段分割、分阶段多视角推理和诊断报告生成四部分,并设计了两个专用VLMs实现高效分析。
Method: 1. 事件触发的多视角视频采集;2. 行人行为阶段分割;3. 分阶段多视角推理;4. 分层合成与诊断推理。使用了TG-VLM和PhaVR-VLM两个模型。
Result: TG-VLM的行为阶段分割mIoU为0.4881;PhaVR-VLM的captioning分数为33.063,问答准确率达64.70%。
Insight: 通过引入行为理论和多视角分析,MP-PVIR能将视频数据转化为可操作的预防策略,推动了AI在交通安全中的应用。
Abstract: Pedestrian-vehicle incidents remain a critical urban safety challenge, with pedestrians accounting for over 20% of global traffic fatalities. Although existing video-based systems can detect when incidents occur, they provide little insight into how these events unfold across the distinct cognitive phases of pedestrian behavior. Recent vision-language models (VLMs) have shown strong potential for video understanding, but they remain limited in that they typically process videos in isolation, without explicit temporal structuring or multi-view integration. This paper introduces Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning (MP-PVIR), a unified framework that systematically processes multi-view video streams into structured diagnostic reports through four stages: (1) event-triggered multi-view video acquisition, (2) pedestrian behavior phase segmentation, (3) phase-specific multi-view reasoning, and (4) hierarchical synthesis and diagnostic reasoning. The framework operationalizes behavioral theory by automatically segmenting incidents into cognitive phases, performing synchronized multi-view analysis within each phase, and synthesizing results into causal chains with targeted prevention strategies. Particularly, two specialized VLMs underpin the MP-PVIR pipeline: TG-VLM for behavioral phase segmentation (mIoU = 0.4881) and PhaVR-VLM for phase-aware multi-view analysis, achieving a captioning score of 33.063 and up to 64.70% accuracy on question answering. Finally, a designated large language model is used to generate comprehensive reports detailing scene understanding, behavior interpretation, causal reasoning, and prevention recommendations. Evaluation on the Woven Traffic Safety dataset shows that MP-PVIR effectively translates multi-view video data into actionable insights, advancing AI-driven traffic safety analytics for vehicle-infrastructure cooperative systems.
[67] Attention Via Convolutional Nearest Neighbors
Mingi Kang,Jeová Farias Sales Rocha Neto
Main category: cs.CV
TL;DR: 该论文提出了卷积最近邻(ConvNN)框架,统一了卷积和自注意力操作,表明它们是通过邻居选择和聚合的不同方式(空间邻近性与特征相似性)实现的特殊案例。
Details
Motivation: 传统上,卷积神经网络(CNN)和Transformer被视为截然不同的架构。论文的目的是揭示它们之间的深层联系,并通过统一的框架探索中间设计方案。Contribution: 主要贡献是提出了ConvNN框架,证明卷积和自注意力可以统一为k近邻聚合的不同形式,并通过实验验证其在分类任务中的优势。
Method: ConvNN框架通过邻居选择和聚合将卷积(空间邻近性)与注意力(特征相似性)统一起来,支持两者的混合设计方案。实验在VGG和ViT架构上进行。
Result: 在CIFAR-10和CIFAR-100上,ConvNN在VGG中结合两种选择方式提升了精度,而在ViT中超越标准注意力和其变体。
Insight: 卷积和注意力并非对立,而是位于一个连续谱的两端,平衡局部和全局感受野具有正则化效果,为设计更原则性的视觉架构提供了新思路。
Abstract: The shift from Convolutional Neural Networks to Transformers has reshaped computer vision, yet these two architectural families are typically viewed as fundamentally distinct. We argue that convolution and self-attention, despite their apparent differences, can be unified within a single k-nearest neighbor aggregation framework. The critical insight is that both operations are special cases of neighbor selection and aggregation; convolution selects neighbors by spatial proximity, while attention selects by feature similarity, revealing they exist on a continuous spectrum. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that formalizes this connection. Crucially, ConvNN serves as a drop-in replacement for convolutional and attention layers, enabling systematic exploration of the intermediate spectrum between these two extremes. We validate the framework’s coherence on CIFAR-10 and CIFAR-100 classification tasks across two complementary architectures: (1) Hybrid branching in VGG improves accuracy on both CIFAR datasets by combining spatial-proximity and feature-similarity selection; and (2) ConvNN in ViT outperforms standard attention and other attention variants on both datasets. Extensive ablations on $k$ values and architectural variants reveal that interpolating along this spectrum provides regularization benefits by balancing local and global receptive fields. Our work provides a unifying framework that dissolves the apparent distinction between convolution and attention, with implications for designing more principled and interpretable vision architectures.
[68] SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM
An Yu,Weiheng Lu,Jian Li,Zhenfei Zhang,Yunhang Shen,Felix X. -F. Ye,Ming-Ching Chang
Main category: cs.CV
TL;DR: SMART是一个基于MLLM的框架,用于视频片段检索,通过结合音频和视觉特征并利用镜头级时间结构,提升了多模态表示能力。它在Charades-STA和QVHighlights上显著优于现有方法。
Details
Motivation: 现有视频片段检索方法主要依赖粗略的时间理解和单一视觉模态,难以处理复杂视频。SMART通过引入音频线索和镜头级时间结构来解决这一问题。Contribution: 1. 提出SMART框架,整合音频线索和镜头级结构;2. 设计Shot-aware Token Compression以减少冗余并保留细节;3. 优化提示设计以更好地利用音频视觉线索。
Method: 1. 结合音频和视觉特征;2. 应用Shot-aware Token Compression选择性保留高信息量token;3. 改进提示设计以增强多模态表示。
Result: 在Charades-STA和QVHighlights上,SMART显著优于现有方法,R1@0.5和R1@0.7分别提升了1.61%和2.59%。
Insight: 音频线索和镜头级时间结构在视频片段检索中至关重要,可以有效提升模型对复杂视频的理解能力。
Abstract: Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos using both traditional techniques and Multimodal Large Language Models (MLLM), most existing methods still rely on coarse temporal understanding and a single visual modality, limiting performance on complex videos. To address this, we introduce \textit{S}hot-aware \textit{M}ultimodal \textit{A}udio-enhanced \textit{R}etrieval of \textit{T}emporal \textit{S}egments (SMART), an MLLM-based framework that integrates audio cues and leverages shot-level temporal structure. SMART enriches multimodal representations by combining audio and visual features while applying \textbf{Shot-aware Token Compression}, which selectively retains high-information tokens within each shot to reduce redundancy and preserve fine-grained temporal details. We also refine prompt design to better utilize audio-visual cues. Evaluations on Charades-STA and QVHighlights show that SMART achieves significant improvements over state-of-the-art methods, including a 1.61% increase in R1@0.5 and 2.59% gain in R1@0.7 on Charades-STA.
[69] iGaussian: Real-Time Camera Pose Estimation via Feed-Forward 3D Gaussian Splatting Inversion
Hao Wang,Linqing Zhao,Xiuwei Xu,Jiwen Lu,Haibin Yan
Main category: cs.CV
TL;DR: 论文提出iGaussian框架,通过两阶段前馈方法实现实时相机位姿估计,避免了传统迭代方法的计算开销,实现了10倍的速度提升。
Details
Motivation: 传统方法通过迭代的渲染-比较-优化循环估计相机位姿,计算量大且难以实时运行,尤其在机器人应用中限制了性能。Contribution: 提出了一种无需可微渲染的两阶段前馈框架,通过高斯场景先验回归网络和多重视角特征融合,显著提升了速度和精度。
Method: 方法分为粗估计和精细估计两阶段:粗阶段使用高斯场景先验回归网络,精细阶段通过特征匹配和多视角融合优化位姿。
Result: 在NeRF Synthetic等数据集上,旋转误差中位数降至0.2°,速度达2.87 FPS,比优化方法快10倍。
Insight: 通过跨相关性模块直接对齐图像嵌入与3D高斯属性,避免了渲染开销,为实时位姿估计提供了新思路。
Abstract: Recent trends in SLAM and visual navigation have embraced 3D Gaussians as the preferred scene representation, highlighting the importance of estimating camera poses from a single image using a pre-built Gaussian model. However, existing approaches typically rely on an iterative \textit{render-compare-refine} loop, where candidate views are first rendered using NeRF or Gaussian Splatting, then compared against the target image, and finally, discrepancies are used to update the pose. This multi-round process incurs significant computational overhead, hindering real-time performance in robotics. In this paper, we propose iGaussian, a two-stage feed-forward framework that achieves real-time camera pose estimation through direct 3D Gaussian inversion. Our method first regresses a coarse 6DoF pose using a Gaussian Scene Prior-based Pose Regression Network with spatial uniform sampling and guided attention mechanisms, then refines it through feature matching and multi-model fusion. The key contribution lies in our cross-correlation module that aligns image embeddings with 3D Gaussian attributes without differentiable rendering, coupled with a Weighted Multiview Predictor that fuses features from Multiple strategically sampled viewpoints. Experimental results on the NeRF Synthetic, Mip-NeRF 360, and T&T+DB datasets demonstrate a significant performance improvement over previous methods, reducing median rotation errors to 0.2° while achieving 2.87 FPS tracking on mobile robots, which is an impressive 10 times speedup compared to optimization-based approaches. Code: https://github.com/pythongod-exe/iGaussian
[70] Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
Laura Dodds,Maisy Lam,Waleed Akbar,Yibo Cheng,Fadel Adib
Main category: cs.CV
TL;DR: Wave-Former是一种通过毫米波无线信号对完全遮挡的日常物体进行高精度3D重建的新方法,结合了物理感知的形状补全模型和三阶段流程,显著提升了重建的召回率。
Details
Motivation: 现有毫米波重建方法受限于覆盖范围和高噪声,无法有效重建完全遮挡的物体。Wave-Former旨在解决这一问题,推动机器人技术、增强现实和物流等领域的应用。Contribution: Wave-Former的主要贡献在于提出了一个物理感知的形状补全模型和三阶段流程,将原始无线信号与视觉形状补全方法结合,显著提升了遮挡物体的3D重建性能。
Method: 方法包括三个阶段:提出候选几何表面、使用基于Transformer的形状补全模型、以及通过熵引导的表面选择。该方法完全基于合成的点云数据训练,并展示了良好的泛化能力。
Result: 与现有最优方法相比,Wave-Former将召回率从54%提升至72%,同时保持了85%的高精度。
Insight: 通过结合毫米波信号的物理特性与视觉形状补全方法,Wave-Former展示了即使在完全遮挡的情况下,也能通过无线信号实现高精度的3D重建。
Abstract: We present Wave-Former, a novel method capable of high-accuracy 3D shape reconstruction for completely occluded, diverse, everyday objects. This capability can open new applications spanning robotics, augmented reality, and logistics. Our approach leverages millimeter-wave (mmWave) wireless signals, which can penetrate common occlusions and reflect off hidden objects. In contrast to past mmWave reconstruction methods, which suffer from limited coverage and high noise, Wave-Former introduces a physics-aware shape completion model capable of inferring full 3D geometry. At the heart of Wave-Former’s design is a novel three-stage pipeline which bridges raw wireless signals with recent advancements in vision-based shape completion by incorporating physical properties of mmWave signals. The pipeline proposes candidate geometric surfaces, employs a transformer-based shape completion model designed specifically for mmWave signals, and finally performs entropy-guided surface selection. This enables Wave-Former to be trained using entirely synthetic point-clouds, while demonstrating impressive generalization to real-world data.In head-to-head comparisons with state-of-the-art baselines, Wave-Former raises recall from 54% to 72% while maintaining a high precision of 85%.
[71] Learning Representation and Synergy Invariances: A Povable Framework for Generalized Multimodal Face Anti-Spoofing
Xun Lin,Shuai Wang,Yi Yu,Zitong Yu,Jiale Zhou,Yizhong Liu,Xiaochun Cao,Alex Kot,Yefeng Zheng
Main category: cs.CV
TL;DR: 论文提出了一种名为RiSe的可证明框架,通过解决多模态人脸防伪中的表示不变性和模态协同不变性风险,提升了跨域泛化性能。
Details
Motivation: 多模态人脸防伪方法在未见域中性能下降严重,主要原因是模态表示不变性风险和模态协同不变性风险被忽视。论文旨在解决这两类风险,提升跨域泛化能力。Contribution: 1. 提出RiSe框架,解决多模态人脸防伪中的表示和协同不变性问题;2. 设计了AsyIRM方法,在径向空间中学习不变球形决策边界;3. 提出了MMSD任务,通过自监督学习增强模态特征的可泛化性。
Method: 1. AsyIRM:在径向空间中学习不变球形决策边界,适应非对称分布;2. MMSD:通过跨样本混合和解缠增强模态特征。
Result: RiSe框架在多模态人脸防伪任务中实现了跨域最优性能。
Insight: 1. 模态表示和协同不变性对跨域泛化至关重要;2. 非对称分类问题需要特殊设计;3. 自监督学习能有效增强模态特征的泛化能力。
Abstract: Multimodal Face Anti-Spoofing (FAS) methods, which integrate multiple visual modalities, often suffer even more severe performance degradation than unimodal FAS when deployed in unseen domains. This is mainly due to two overlooked risks that affect cross-domain multimodal generalization. The first is the modal representation invariant risk, i.e., whether representations remain generalizable under domain shift. We theoretically show that the inherent class asymmetry in FAS (diverse spoofs vs. compact reals) enlarges the upper bound of generalization error, and this effect is further amplified in multimodal settings. The second is the modal synergy invariant risk, where models overfit to domain-specific inter-modal correlations. Such spurious synergy cannot generalize to unseen attacks in target domains, leading to performance drops. To solve these issues, we propose a provable framework, namely Multimodal Representation and Synergy Invariance Learning (RiSe). For representation risk, RiSe introduces Asymmetric Invariant Risk Minimization (AsyIRM), which learns an invariant spherical decision boundary in radial space to fit asymmetric distributions, while preserving domain cues in angular space. For synergy risk, RiSe employs Multimodal Synergy Disentanglement (MMSD), a self-supervised task enhancing intrinsic, generalizable modal features via cross-sample mixing and disentanglement. Theoretical analysis and experiments verify RiSe, which achieves state-of-the-art cross-domain performance.
[72] MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
Huiyi Chen,Jiawei Peng,Dehai Min,Changchang Sun,Kaijie Chen,Yan Yan,Xu Yang,Lu Cheng
Main category: cs.CV
TL;DR: MVI-Bench是一个专门用于评估大型视觉语言模型(LVLM)在误导性视觉输入下的鲁棒性的基准测试,填补了现有评测专注于文本误导而忽略视觉误导的空白。
Details
Motivation: 现有鲁棒性评测主要集中在文本误导(如幻觉)上,而忽视了视觉误导对视觉理解的影响,需要更全面的评估工具。Contribution: 1. 提出首个专注于误导性视觉输入的评测基准MVI-Bench;2. 设计多层次视觉误导分类(概念、属性、关系);3. 引入MVI-Sensitivity指标实现细粒度鲁棒性评估。
Method: 基于视觉基元设计三级误导分类(视觉概念、属性、关系),构建1248个标注VQA实例,并提出MVI-Sensitivity指标。
Result: 对18个前沿LVLM的评测显示它们对视觉误导明显脆弱,分析结果为改进模型鲁棒性提供了可操作的见解。
Insight: 视觉误导是LVLM鲁棒性的重要挑战,MVI-Bench为未来模型开发提供了系统的评测框架。
Abstract: Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.
[73] O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model
Rishi Gupta,Mukilan Karuppasamy,Shyam Marjit,Aditay Tripathi,Anirban Chakraborty
Main category: cs.CV
TL;DR: O3SLM提出了一个开放权重、开放数据和开放词汇的草图-语言模型,解决了当前大视觉语言模型(LVLM)在处理抽象草图输入时的局限性。其主要贡献包括一个新的大规模数据集和一个基于该数据训练的LVLM,显著提升了草图理解性能。
Details
Motivation: 当前的大视觉语言模型在理解抽象视觉输入(如手绘草图)方面存在局限性,主要原因是缺乏同时建模草图、真实图像和自然语言指令的大规模数据集。Contribution: 1. 一个新的图像-草图-指令三元组大规模数据集;2. 基于该数据集训练的O3SLM模型,在多个草图任务中实现了最佳性能。
Method: 通过设计一个新的大规模数据集并训练LVLM(O3SLM),结合现有草图数据集(QuickDraw!、Sketchy、Tu Berlin)和生成的SketchVCL数据集,进行多任务评估。
Result: O3SLM在物体定位、计数、图像检索(SBIR和细粒度SBIR)和视觉问答(VQA)等任务中显著优于现有LVLM,实现了最先进的性能。
Insight: 草图和语言联合建模的数据集是提升LVLM抽象视觉理解能力的关键。
Abstract: While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.
[74] AdaTok: Adaptive Token Compression with Object-Aware Representations for Efficient Multimodal LLMs
Xinliang Zhang,Lei Zhu,Hangzhou He,Shuang Zeng,Ourui Fu,Jiakui Hu,Zhengjian Yao,Yanye Lu
Main category: cs.CV
TL;DR: 论文提出AdaTok,一种自适应令牌压缩方法,通过对象级令牌合并策略减少多模态大语言模型(MLLMs)的计算冗余,同时保持高性能。
Details
Motivation: 传统基于补丁的令牌化方法导致图像令牌数量平方级增长,增加了计算和内存负担,且与人类视觉认知系统不一致,引发幻觉和冗余。Contribution: 提出对象级令牌合并策略AdaTok,显著减少令牌数量(仅需10%令牌),同时保持96%的性能,平衡压缩率和性能。
Method: 采用自适应令牌压缩技术,结合对象感知表示,在多模态LLMs中实现高效的令牌合并。
Result: 在多个基准测试中,AdaTok仅用10%的令牌即可达到原始模型96%的性能,优于相关方法。
Insight: 对象级令牌合并更符合人类视觉认知,能有效减少计算冗余并提升模型效率。
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated substantial value in unified text-image understanding and reasoning, primarily by converting images into sequences of patch-level tokens that align with their architectural paradigm. However, patch-level tokenization leads to a quadratic growth in image tokens, burdening MLLMs’ understanding and reasoning with enormous computation and memory. Additionally, the traditional patch-wise scanning tokenization workflow misaligns with the human vision cognition system, further leading to hallucination and computational redundancy. To address this issue, we propose an object-level token merging strategy for Adaptive Token compression, revealing the consistency with human vision system. The experiments are conducted on multiple comprehensive benchmarks, which show that our approach averagely, utilizes only 10% tokens while achieving almost 96% of the vanilla model’s performance. More extensive experimental results in comparison with relevant works demonstrate the superiority of our method in balancing compression ratio and performance. Our code will be available.
[75] UniSER: A Foundation Model for Unified Soft Effects Removal
Jingdong Zhang,Lingzhi Zhang,Qing Liu,Mang Tik Chiu,Connelly Barnes,Yizhou Wang,Haoran You,Xiaoyang Liu,Yuqian Zhou,Zhe Lin,Eli Shechtman,Sohrab Amirghodsi,Xin Li,Wenping Wang,Xiaohang Zhan
Main category: cs.CV
TL;DR: UniSER是一种基础模型,专注于统一解决多种软效应退化问题(如镜头眩光、雾霾、阴影和反射),通过大规模数据集和定制训练流程,超越了专业和通用模型的表现。
Details
Motivation: 现有方法通常针对特定软效应退化问题设计专用模型,缺乏通用性和扩展性,而通用模型在任务细节和场景还原上表现不足。Contribution: 提出了一种统一的基础模型UniSER,利用软效应的共同本质(半透明遮挡),在单一框架内解决多种退化问题。
Method: 构建了大规模3.8M对数据集填补公开数据集空白,并通过Diffusion Transformer学习鲁棒的恢复先验,整合细粒度掩码和强度控制。
Result: UniSER在多种软效应任务上显著优于专用和通用模型,实现了自然场景中的高保真恢复。
Insight: 软效应的共同特性为统一建模提供了可能,大规模多样化数据和定制训练流程是提升模型泛化能力的关键。
Abstract: Digital images are often degraded by soft effects such as lens flare, haze, shadows, and reflections, which reduce aesthetics even though the underlying pixels remain partially visible. The prevailing works address these degradations in isolation, developing highly specialized, specialist models that lack scalability and fail to exploit the shared underlying essences of these restoration problems. While specialist models are limited, recent large-scale pretrained generalist models offer powerful, text-driven image editing capabilities. while recent general-purpose systems (e.g., GPT-4o, Flux Kontext, Nano Banana) require detailed prompts and often fail to achieve robust removal on these fine-grained tasks or preserve identity of the scene. Leveraging the common essence of soft effects, i.e., semi-transparent occlusions, we introduce a foundational versatile model UniSER, capable of addressing diverse degradations caused by soft effects within a single framework. Our methodology centers on curating a massive 3.8M-pair dataset to ensure robustness and generalization, which includes novel, physically-plausible data to fill critical gaps in public benchmarks, and a tailored training pipeline that fine-tunes a Diffusion Transformer to learn robust restoration priors from this diverse data, integrating fine-grained mask and strength controls. This synergistic approach allows UniSER to significantly outperform both specialist and generalist models, achieving robust, high-fidelity restoration in the wild.
[76] GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation
Xuan Zhao,Zhongyu Zhang,Yuge Huang,Yuxi Mi,Guodong Mu,Shouhong Ding,Jun Wang,Rizen Guo,Shuigeng Zhou
Main category: cs.CV
TL;DR: GloTok通过全局关系信息建模更均匀的语义分布,提出了一种基于直方图关系学习的语义转移方法和残差学习模块,显著提升了图像重构和生成质量。
Details
Motivation: 现有图像分词方法依赖局部监督语义特征,导致语义分布不均匀,而VA-VAE证明均匀分布对生成性能更有利。GloTok旨在通过全局视角解决这一问题。Contribution: 1. 提出全局视角分词器GloTok;2. 设计了代码本间直方图关系学习方法和残差学习模块;3. 实验证明其在ImageNet-1k上实现了SOTA效果。
Method: 1. 利用预训练模型的全局关系信息建模语义分布;2. 通过直方图关系学习将语义转移至代码本;3. 引入残差模块恢复量化导致的细节损失。
Result: 在ImageNet-1k基准测试中,GloTok在图像重构和生成方面均达到SOTA性能,且无需直接访问预训练模型。
Insight: 全局语义关系建模和细节恢复是提升图像生成质量的关键,均匀分布的特征表示对自回归模型训练更有利。
Abstract: Existing state-of-the-art image tokenization methods leverage diverse semantic features from pre-trained vision models for additional supervision, to expand the distribution of latent representations and thereby improve the quality of image reconstruction and generation. These methods employ a locally supervised approach for semantic supervision, which limits the uniformity of semantic distribution. However, VA-VAE proves that a more uniform feature distribution yields better generation performance. In this work, we introduce a Global Perspective Tokenizer (GloTok), which utilizes global relational information to model a more uniform semantic distribution of tokenized features. Specifically, a codebook-wise histogram relation learning method is proposed to transfer the semantics, which are modeled by pre-trained models on the entire dataset, to the semantic codebook. Then, we design a residual learning module that recovers the fine-grained details to minimize the reconstruction error caused by quantization. Through the above design, GloTok delivers more uniformly distributed semantic latent representations, which facilitates the training of autoregressive (AR) models for generating high-quality images without requiring direct access to pre-trained models during the training process. Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.
[77] Few-Shot Precise Event Spotting via Unified Multi-Entity Graph and Distillation
Zhaoyu Liu,Kan Jiang,Murong Ma,Zhe Hou,Yun Lin,Jin Song Dong
Main category: cs.CV
TL;DR: 这篇论文提出了一个统一的多实体图网络(UMEG-Net),用于少样本精确事件检测(PES)。UMEG-Net整合了人体骨架和运动特定对象关键点,通过先进的图卷积网络和多尺度时间移位提取时空特征,并通过多模态蒸馏提升性能。
Details
Motivation: 精确事件检测(PES)在体育分析中至关重要,但由于快速连续动作、运动模糊和细微视觉差异,传统方法依赖于大量标注数据和像素或姿态输入,难以在少样本条件下表现良好。Contribution: 主要贡献包括:(1)UMEG-Net统一了人体骨架和运动对象关键点的图表示;(2)设计了一个高效的时空特征提取模块;(3)通过多模态蒸馏进一步提升少样本条件下的性能。
Method: UMEG-Net基于图卷积网络(GCN)和多尺度时间移位提取时空特征,并通过多模态蒸馏将关键点图的知识迁移到视觉表示中。
Result: 实验表明,UMEG-Net在少样本条件下表现优于基线模型,提供了可扩展的解决方案。
Insight: 通过整合多实体信息和知识蒸馏,可以显著提升少样本精确事件检测的性能,尤其是在标注数据稀缺的实际场景中。
Abstract: Precise event spotting (PES) aims to recognize fine-grained events at exact moments and has become a key component of sports analytics. This task is particularly challenging due to rapid succession, motion blur, and subtle visual differences. Consequently, most existing methods rely on domain-specific, end-to-end training with large labeled datasets and often struggle in few-shot conditions due to their dependence on pixel- or pose-based inputs alone. However, obtaining large labeled datasets is practically hard. We propose a Unified Multi-Entity Graph Network (UMEG-Net) for few-shot PES. UMEG-Net integrates human skeletons and sport-specific object keypoints into a unified graph and features an efficient spatio-temporal extraction module based on advanced GCN and multi-scale temporal shift. To further enhance performance, we employ multimodal distillation to transfer knowledge from keypoint-based graphs to visual representations. Our approach achieves robust performance with limited labeled data and significantly outperforms baseline models in few-shot settings, providing a scalable and effective solution for few-shot PES. Code is publicly available at https://github.com/LZYAndy/UMEG-Net.
[78] Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision
Zitang Sun,Masakazu Yoshimura,Junji Otsuka,Atsushi Irie,Takeshi Ohashi
Main category: cs.CV
TL;DR: DetGain提出了一种针对目标检测的在线数据筛选方法,通过估计每张图像对数据集级AP的边际贡献来选择训练样本,提升了模型的效率和性能。
Details
Motivation: 高质量数据在规模法则下是性能提升的关键,但现有的在线采样策略难以直接应用于目标检测,因其结构复杂且存在域差距。Contribution: 提出了DetGain方法,首次将在线数据筛选扩展到目标检测领域,通过建模全局评分分布和教师-学生贡献差距实现高效样本选择。
Method: DetGain基于预测质量估计每张图像对数据集级AP的边际扰动,并结合教师-学生框架计算样本的信息量,动态选择训练样本。
Result: 在COCO数据集上的实验表明,DetGain显著提升了多种代表性检测器的精度,并在低质量数据和知识蒸馏场景下表现出强大鲁棒性。
Insight: DetGain展示了在线数据筛选在目标检测中的潜力,为高效数据利用提供了一种通用且互补的策略。
Abstract: High-quality data has become a primary driver of progress under scale laws, with curated datasets often outperforming much larger unfiltered ones at lower cost. Online data curation extends this idea by dynamically selecting training samples based on the model’s evolving state. While effective in classification and multimodal learning, existing online sampling strategies rarely extend to object detection because of its structural complexity and domain gaps. We introduce DetGain, an online data curation method specifically for object detection that estimates the marginal perturbation of each image to dataset-level Average Precision (AP) based on its prediction quality. By modeling global score distributions, DetGain efficiently estimates the global AP change and computes teacher-student contribution gaps to select informative samples at each iteration. The method is architecture-agnostic and minimally intrusive, enabling straightforward integration into diverse object detection architectures. Experiments on the COCO dataset with multiple representative detectors show consistent improvements in accuracy. DetGain also demonstrates strong robustness under low-quality data and can be effectively combined with knowledge distillation techniques to further enhance performance, highlighting its potential as a general and complementary strategy for data-efficient object detection.
[79] InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
Weimin Bai,Suzhe Xu,Yiwei Ren,Jinhua Hao,Ming Sun,Wenzheng Chen,He Sun
Main category: cs.CV
TL;DR: InstantViR 是一种基于预训练视频扩散先验的实时视频逆问题求解框架,通过教师-学生蒸馏实现单次前向推断,无需迭代优化,显著提升速度。
Details
Motivation: 视频逆问题在流媒体、远程呈现和AR/VR中至关重要,但现有方法要么速度慢,要么引入时间伪影。InstantViR 旨在解决这些问题,实现高质量实时重建。Contribution: 1. 提出 InstantViR,通过蒸馏双向视频扩散模型实现高效单步推断;2. 创新的 LeanVAE 替换 VAE,进一步降低延迟;3. 无需外部配对数据,仅依赖教师模型和已知退化算子。
Method: 1. 将教师双向视频扩散模型蒸馏为因果自回归学生模型;2. 采用 LeanVAE 替换 VAE,通过教师空间正则化提升效率;3. 支持低延迟潜在空间处理。
Result: InstantViR 在随机修复、高斯去模糊和超分辨率任务中,性能匹配或超越基线方法,速度达到 35 FPS(A100 GPU)。
Insight: 扩散模型可通过蒸馏和高效架构设计适配实时场景,为高质量视频恢复提供实用解决方案。
Abstract: Video inverse problems are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers - leading to temporal artifacts - or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher’s strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the video-diffusion backbone VAE with a high-efficiency LeanVAE via an innovative teacher-space regularized distillation scheme, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100 times speedups over iterative video diffusion solvers. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.
[80] Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution
N Dinesh Reddy,Sudeep Pillai
Main category: cs.CV
TL;DR: Orion 是一个多模态感知、高级视觉推理和执行的统一视觉代理框架,通过结合神经感知与符号执行实现主动视觉智能。
Details
Motivation: 传统视觉-语言模型通常只生成描述性输出,缺乏复杂多步视觉任务的处理能力。Orion 旨在通过工具驱动的框架填补这一空白。Contribution: 提出了 Orion,一个能够调用多种计算机视觉工具的统一视觉代理框架,实现了从被动视觉理解到主动视觉智能的转变。
Method: Orion 结合神经感知与符号执行,调用对象检测、关键点定位、全景分割等专用工具,完成复杂的多步视觉任务。
Result: 在 MMMU、MMBench、DocVQA 和 MMLongBench 等基准测试中取得竞争性性能。
Insight: 通过工具驱动的框架,Orion 展示了视觉智能从静态理解向动态执行的演进潜力。
Abstract: We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.
[81] StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model
Yifan Yang,Zhi Cen,Sida Peng,Xiangwei Chen,Yifu Deng,Xinyu Zhu,Fan Jia,Xiaowei Zhou,Hujun Bao
Main category: cs.CV
TL;DR: StreamingTalker提出了一种基于自回归扩散模型的音频驱动3D面部动画方法,能够以流式方式处理音频输入,解决了传统方法在长音频输入时的性能和延迟问题。
Details
Motivation: 现有的音频驱动3D面部动画方法在处理长音频输入时表现不佳,且存在显著的延迟问题。因此,需要一种能够灵活适应不同音频长度并实现低延迟的方法。Contribution: 提出了一个新颖的自回归扩散模型,通过流式处理音频输入和动态历史运动上下文的结合,实现了高质量的实时3D面部动画生成。
Method: 方法采用自回归扩散模型,将有限的过去帧作为历史运动上下文,与音频输入结合形成动态条件,指导扩散过程逐帧生成面部动画。
Result: 该方法在处理不同长度的音频输入时表现优异,支持实时交互,生成了高质量的面部动画。
Insight: 流式处理和动态条件是解决长音频输入问题的关键,自回归扩散模型展示了在实时3D动画生成中的潜力。
Abstract: This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs.Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations.However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides the diffusion process to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Additionally, we implemented a real-time interactive demo, highlighting the effectiveness and efficiency of our approach. We will release the code at https://zju3dv.github.io/StreamingTalker/.
[82] Breaking the Passive Learning Trap: An Active Perception Strategy for Human Motion Prediction
Juncheng Hu,Zijian Zhang,Zeyu Wang,Guoyu Wang,Yingji Li,Kedi Lyu
Main category: cs.CV
TL;DR: 该论文提出了一种主动感知策略(APS)用于3D人体运动预测,通过商空间表示和辅助学习目标来增强时空建模,显著提升了预测性能。
Details
Motivation: 当前方法过度依赖网络对时空关系和运动特性的隐式建模,导致冗余和单调的3D坐标信息获取,缺乏主动引导的显式学习机制。Contribution: 1. 提出主动感知策略(APS),利用商空间显式编码运动特性;2. 引入辅助学习目标加强时空建模;3. 设计了数据感知模块和网络感知模块,实现几何降维和动态约束。
Method: 1. 数据感知模块将姿态投影到商空间,解耦几何和冗余坐标;2. 网络感知模块通过掩码和噪声注入主动学习时空依赖关系;3. 设计辅助学习网络从扰动信息中学习。
Result: 性能显著提升,在H3.6M、CMU Mocap和3DPW数据集上分别优于现有方法16.3%、13.9%和10.1%。
Insight: 主动感知策略(APS)通过显式建模和辅助学习目标解决了传统被动学习的局限性,能无缝集成到不同预测模型中。
Abstract: Forecasting 3D human motion is an important embodiment of fine-grained understanding and cognition of human behavior by artificial agents. Current approaches excessively rely on implicit network modeling of spatiotemporal relationships and motion characteristics, falling into the passive learning trap that results in redundant and monotonous 3D coordinate information acquisition while lacking actively guided explicit learning mechanisms. To overcome these issues, we propose an Active Perceptual Strategy (APS) for human motion prediction, leveraging quotient space representations to explicitly encode motion properties while introducing auxiliary learning objectives to strengthen spatio-temporal modeling. Specifically, we first design a data perception module that projects poses into the quotient space, decoupling motion geometry from coordinate redundancy. By jointly encoding tangent vectors and Grassmann projections, this module simultaneously achieves geometric dimension reduction, semantic decoupling, and dynamic constraint enforcement for effective motion pose characterization. Furthermore, we introduce a network perception module that actively learns spatio-temporal dependencies through restorative learning. This module deliberately masks specific joints or injects noise to construct auxiliary supervision signals. A dedicated auxiliary learning network is designed to actively adapt and learn from perturbed information. Notably, APS is model agnostic and can be integrated with different prediction models to enhance active perceptual. The experimental results demonstrate that our method achieves the new state-of-the-art, outperforming existing methods by large margins: 16.3% on H3.6M, 13.9% on CMU Mocap, and 10.1% on 3DPW.
[83] Enhancing Generalization of Depth Estimation Foundation Model via Weakly-Supervised Adaptation with Regularization
Yan Huang,Yongyi Su,Xin Lin,Le Zhang,Xun Xu
Main category: cs.CV
TL;DR: WeSTAR是一个参数高效的框架,通过弱监督自训练适应与正则化,提升单目深度估计(MDE)基础模型在未见域中的鲁棒性。
Details
Motivation: 尽管基础模型(如Depth Anything系列)在零样本泛化方面表现优异,但在下游任务数据可用时,能否进一步提升性能仍是一个开放问题。Contribution: 1. 提出WeSTAR框架,结合自训练目标、语义感知的分层归一化和弱监督序数深度标注;2. 引入权重正则化损失以稳定训练并保留模型的通用知识。
Method: 1. 使用密集自训练目标作为结构自监督;2. 通过语义分割图实现多尺度归一化;3. 利用弱监督序数深度标注;4. 添加权重正则化损失。
Result: WeSTAR在多样化和具有挑战性的场景下,显著提升了模型泛化性能,并在多个基准上达到SOTA。
Insight: 结合多种监督形式(自监督、弱监督)和正则化技术,可以有效提升基础模型在下游任务中的适应性。
Abstract: The emergence of foundation models has substantially advanced zero-shot generalization in monocular depth estimation (MDE), as exemplified by the Depth Anything series. However, given access to some data from downstream tasks, a natural question arises: can the performance of these models be further improved? To this end, we propose WeSTAR, a parameter-efficient framework that performs Weakly supervised Self-Training Adaptation with Regularization, designed to enhance the robustness of MDE foundation models in unseen and diverse domains. We first adopt a dense self-training objective as the primary source of structural self-supervision. To further improve robustness, we introduce semantically-aware hierarchical normalization, which exploits instance-level segmentation maps to perform more stable and multi-scale structural normalization. Beyond dense supervision, we introduce a cost-efficient weak supervision in the form of pairwise ordinal depth annotations to further guide the adaptation process, which enforces informative ordinal constraints to mitigate local topological errors. Finally, a weight regularization loss is employed to anchor the LoRA updates, ensuring training stability and preserving the model’s generalizable knowledge. Extensive experiments on both realistic and corrupted out-of-distribution datasets under diverse and challenging scenarios demonstrate that WeSTAR consistently improves generalization and achieves state-of-the-art performance across a wide range of benchmarks.
[84] ManipShield: A Unified Framework for Image Manipulation Detection, Localization and Explanation
Zitong Xu,Huiyu Duan,Xiaoyu Wang,Zhaolin Cai,Kaiwei Zhang,Qiang Hu,Jing Liu,Xiongkuo Min,Guangtao Zhai
Main category: cs.CV
TL;DR: 该论文提出了ManipBench基准和ManipShield框架,针对AI编辑图像的检测、定位和解释任务,解决了现有基准内容单一、模型覆盖不足和解释性差的问题。
Details
Motivation: 随着生成模型的快速发展,AI编辑图像的多样性和真实性远超传统deepfake技术,现有检测方法在多样性、模型覆盖和解释性方面存在不足。Contribution: 1. 提出了大规模基准ManipBench,包含45万张AI编辑图像,覆盖25种模型和12类操作;2. 设计了统一框架ManipShield,结合MLLM实现检测、定位和解释。
Method: 基于多模态大语言模型(MLLM),通过对比LoRA微调和任务特定解码器,实现图像操作的检测、定位和解释。
Result: 在ManipBench和多个公开数据集上,ManipShield表现最优,且对未见过的操作模型具有强泛化能力。
Insight: 统一的基准和框架能显著提升AI编辑图像的检测与解释能力,未来可扩展至更多任务和场景。
Abstract: With the rapid advancement of generative models, powerful image editing methods now enable diverse and highly realistic image manipulations that far surpass traditional deepfake techniques, posing new challenges for manipulation detection. Existing image manipulation detection and localization (IMDL) benchmarks suffer from limited content diversity, narrow generative-model coverage, and insufficient interpretability, which hinders the generalization and explanation capabilities of current manipulation detection methods. To address these limitations, we introduce \textbf{ManipBench}, a large-scale benchmark for image manipulation detection and localization focusing on AI-edited images. ManipBench contains over 450K manipulated images produced by 25 state-of-the-art image editing models across 12 manipulation categories, among which 100K images are further annotated with bounding boxes, judgment cues, and textual explanations to support interpretable detection. Building upon ManipBench, we propose \textbf{ManipShield}, an all-in-one model based on a Multimodal Large Language Model (MLLM) that leverages contrastive LoRA fine-tuning and task-specific decoders to achieve unified image manipulation detection, localization, and explanation. Extensive experiments on ManipBench and several public datasets demonstrate that ManipShield achieves state-of-the-art performance and exhibits strong generality to unseen manipulation models. Both ManipBench and ManipShield will be released upon publication.
[85] Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation
Weimin Bai,Yubo Li,Weijian Luo,Zeqiang Lai,Yequan Wang,Wenzheng Chen,He Sun
Main category: cs.CV
TL;DR: 论文提出VLM3D框架,利用大型视觉-语言模型(VLM)作为语义和空间的判别器,解决了文本到3D生成中的语义对齐和几何一致性问题。
Details
Motivation: 现有文本到3D生成模型在细粒度语义对齐和3D空间关系理解上表现不佳,导致几何不一致和部分装配失败。Contribution: 提出VLM3D,利用VLM的双重判别信号(语义真实性和几何一致性)优化3D生成,适用于优化和正向生成两种范式。
Method: 通过VLM的双查询判别信号(基于是/否概率)评估语义和几何合理性,并将其作为优化目标或测试时指导模块。
Result: VLM3D在标准基准上显著优于现有方法,并能纠正原生3D模型的严重空间错误。
Insight: VLM的语言接地能力为3D生成提供了语义和空间的联合优化路径,展示了模型跨任务通用性。
Abstract: Text-to-3D generation has advanced rapidly, yet state-of-the-art models, encompassing both optimization-based and feed-forward architectures, still face two fundamental limitations. First, they struggle with coarse semantic alignment, often failing to capture fine-grained prompt details. Second, they lack robust 3D spatial understanding, leading to geometric inconsistencies and catastrophic failures in part assembly and spatial relationships. To address these challenges, we propose VLM3D, a general framework that repurposes large vision-language models (VLMs) as powerful, differentiable semantic and spatial critics. Our core contribution is a dual-query critic signal derived from the VLM’s Yes or No log-odds, which assesses both semantic fidelity and geometric coherence. We demonstrate the generality of this guidance signal across two distinct paradigms: (1) As a reward objective for optimization-based pipelines, VLM3D significantly outperforms existing methods on standard benchmarks. (2) As a test-time guidance module for feed-forward pipelines, it actively steers the iterative sampling process of SOTA native 3D models to correct severe spatial errors. VLM3D establishes a principled and generalizable path to inject the VLM’s rich, language-grounded understanding of both semantics and space into diverse 3D generative pipelines.
[86] NeuralSSD: A Neural Solver for Signed Distance Surface Reconstruction
Zi-Chen Xi,Jiahui Huang,Hao-Xiang Chen,Francis Williams,Qun-Ce Xu,Tai-Jiang Mu,Shi-Min Hu
Main category: cs.CV
TL;DR: NeuralSSD是一种基于神经Galerkin方法的求解器,用于从点云数据重建高质量的3D隐式表面。通过提出新的能量方程和卷积网络,确保重建表面紧密贴合输入数据并具备高精度。
Details
Motivation: 现有隐式方法对点云信息的贴合缺乏显式机制,导致表面重建精度不足。NeuralSSD旨在解决这一问题,通过优化能量方程和引入新网络结构提升重建质量。Contribution: 1. 提出新的能量方程,平衡点云信息的可靠性;2. 设计新型卷积网络,学习三维信息以提高优化效果;3. 在ShapeNet和Matterport数据集上实现SOTA效果。
Method: 基于神经Galerkin方法,结合新能量方程和3D卷积网络,实现对输入点云的紧密贴合和高质量表面重建。
Result: 在ShapeNet和Matterport等数据集上,NeuralSSD在表面重建精度和泛化性方面达到SOTA水平。
Insight: 通过显式机制优化点云信息的贴合性,并结合神经网络学习三维特征,可以显著提升隐式表面重建的精度和稳定性。
Abstract: We proposed a generalized method, NeuralSSD, for reconstructing a 3D implicit surface from the widely-available point cloud data. NeuralSSD is a solver-based on the neural Galerkin method, aimed at reconstructing higher-quality and accurate surfaces from input point clouds. Implicit method is preferred due to its ability to accurately represent shapes and its robustness in handling topological changes. However, existing parameterizations of implicit fields lack explicit mechanisms to ensure a tight fit between the surface and input data. To address this, we propose a novel energy equation that balances the reliability of point cloud information. Additionally, we introduce a new convolutional network that learns three-dimensional information to achieve superior optimization results. This approach ensures that the reconstructed surface closely adheres to the raw input points and infers valuable inductive biases from point clouds, resulting in a highly accurate and stable surface reconstruction. NeuralSSD is evaluated on a variety of challenging datasets, including the ShapeNet and Matterport datasets, and achieves state-of-the-art results in terms of both surface reconstruction accuracy and generalizability.
[87] NeuralBoneReg: A Novel Self-Supervised Method for Robust and Accurate Multi-Modal Bone Surface Registration
Luohong Wu,Matthias Seibold,Nicola A. Cavalcanti,Yunke Ao,Roman Flepp,Aidana Massalimova,Lilian Calvet,Philipp Fürnstahl
Main category: cs.CV
TL;DR: NeuralBoneReg是一种新颖的自监督方法,用于多模态骨表面配准,在计算机和机器人辅助骨科手术中实现了鲁棒且准确的跨模态对齐。
Details
Motivation: 在骨科手术中,术前和术中数据的模态异质性使配准变得复杂且容易出错,因此需要一种鲁棒、自动且模态无关的骨表面配准方法。Contribution: 提出了NeuralBoneReg框架,包括隐式神经无符号距离场(UDF)和基于MLP的配准模块,实现了自监督的多模态骨表面配准。
Method: 使用3D点云作为模态无关表示,通过UDF学习术前骨模型,MLP模块生成变换假设进行全局初始化和局部细化。
Result: 在多个数据集上表现优异,UltraBones100k数据集上平均RRE/RTE为1.68°/1.86 mm,展示了跨解剖结构和模态的强泛化能力。
Insight: 自监督方法避免了跨受试者训练数据的需求,为模态异质性场景下的配准提供了新思路。
Abstract: In computer- and robot-assisted orthopedic surgery (CAOS), patient-specific surgical plans derived from preoperative imaging define target locations and implant trajectories. During surgery, these plans must be accurately transferred, relying on precise cross-registration between preoperative and intraoperative data. However, substantial modality heterogeneity across imaging modalities makes this registration challenging and error-prone. Robust, automatic, and modality-agnostic bone surface registration is therefore clinically important. We propose NeuralBoneReg, a self-supervised, surface-based framework that registers bone surfaces using 3D point clouds as a modality-agnostic representation. NeuralBoneReg includes two modules: an implicit neural unsigned distance field (UDF) that learns the preoperative bone model, and an MLP-based registration module that performs global initialization and local refinement by generating transformation hypotheses to align the intraoperative point cloud with the neural UDF. Unlike SOTA supervised methods, NeuralBoneReg operates in a self-supervised manner, without requiring inter-subject training data. We evaluated NeuralBoneReg against baseline methods on two publicly available multi-modal datasets: a CT-ultrasound dataset of the fibula and tibia (UltraBones100k) and a CT-RGB-D dataset of spinal vertebrae (SpineDepth). The evaluation also includes a newly introduced CT–ultrasound dataset of cadaveric subjects containing femur and pelvis (UltraBones-Hip), which will be made publicly available. NeuralBoneReg matches or surpasses existing methods across all datasets, achieving mean RRE/RTE of 1.68°/1.86 mm on UltraBones100k, 1.88°/1.89 mm on UltraBones-Hip, and 3.79°/2.45 mm on SpineDepth. These results demonstrate strong generalizability across anatomies and modalities, providing robust and accurate cross-modal alignment for CAOS.
[88] SAM-Fed: SAM-Guided Federated Semi-Supervised Learning for Medical Image Segmentation
Sahar Nasirihaghighi,Negin Ghamsarian,Yiping Li,Marcel Breeuwer,Raphael Sznitman,Klaus Schoeffmann
Main category: cs.CV
TL;DR: SAM-Fed是一个联邦半监督学习框架,利用高性能的分割基础模型指导轻量级客户端训练,通过双重知识蒸馏和自适应一致性机制优化像素级监督,显著提升了医学图像分割的性能。
Details
Motivation: 医学图像分割面临数据隐私和标注成本高的问题,联邦半upervised学习(FSSL)虽有潜力,但因本地模型能力不足和设备异构性限制了伪标签质量。Contribution: 提出了SAM-Fed框架,通过引入高性能分割基础模型和双重知识蒸馏,解决了FSSL中伪标签不可靠和设备异构性问题。
Method: 结合高性能分割基础模型(如SAM)和双重知识蒸馏,设计自适应一致性机制优化伪标签质量。
Result: 在皮肤病变和息肉分割任务中,SAM-Fed在异构和同构环境下均优于现有FSSL方法。
Insight: 高性能基础模型可以在联邦学习中指导轻量级客户端,显著提升伪标签质量和模型性能,这对医学图像分割尤为重要。
Abstract: Medical image segmentation is clinically important, yet data privacy and the cost of expert annotation limit the availability of labeled data. Federated semi-supervised learning (FSSL) offers a solution but faces two challenges: pseudo-label reliability depends on the strength of local models, and client devices often require compact or heterogeneous architectures due to limited computational resources. These constraints reduce the quality and stability of pseudo-labels, while large models, though more accurate, cannot be trained or used for routine inference on client devices. We propose SAM-Fed, a federated semi-supervised framework that leverages a high-capacity segmentation foundation model to guide lightweight clients during training. SAM-Fed combines dual knowledge distillation with an adaptive agreement mechanism to refine pixel-level supervision. Experiments on skin lesion and polyp segmentation across homogeneous and heterogeneous settings show that SAM-Fed consistently outperforms state-of-the-art FSSL methods.
[89] Dental3R: Geometry-Aware Pairing for Intraoral 3D Reconstruction from Sparse-View Photographs
Yiyi Miao,Taoyu Wu,Tong Chen,Ji Jiang,Zhe Tang,Zhengyong Jiang,Angelos Stefanidis,Limin Yu,Jionglong Su
Main category: cs.CV
TL;DR: Dental3R是一个用于稀疏口腔内照片的3D重建方法,通过几何感知配对策略(GAPS)和基于小波变换的3D高斯溅射(3DGS)优化,解决了稀疏视角下几何和姿态估计不稳定的问题,并保持了高精度的诊断细节。
Details
Motivation: 传统的口腔内扫描方法在远程正畸治疗中不可行,而现有的3D高斯溅射方法在处理稀疏、非结构化照片时存在几何和姿态估计不稳定的问题,且容易丢失关键诊断细节。Contribution: 1. 提出了一种几何感知配对策略(GAPS),用于高效选择高质量图像对;2. 引入小波正则化的3D高斯溅射方法,抑制高频伪影并保留精细结构;3. 在950个临床案例数据集上验证了方法的有效性。
Method: 1. GAPS策略优化图像对选择,提升几何初始化的稳定性;2. 基于小波变换的正则化目标函数训练3DGS模型,平衡高频和低频信息;3. 结合稀疏视角的几何和姿态估计。
Result: 在950个临床案例和195个视频测试集上表现优异,优于现有方法,尤其在稀疏视角下保持了高质量的诊断细节。
Insight: 1. 几何感知配对策略显著提升了稀疏视角下的重建稳定性;2. 小波正则化在保留低频结构的同时避免高频噪声,适用于医疗图像。
Abstract: Intraoral 3D reconstruction is fundamental to digital orthodontics, yet conventional methods like intraoral scanning are inaccessible for remote tele-orthodontics, which typically relies on sparse smartphone imagery. While 3D Gaussian Splatting (3DGS) shows promise for novel view synthesis, its application to the standard clinical triad of unposed anterior and bilateral buccal photographs is challenging. The large view baselines, inconsistent illumination, and specular surfaces common in intraoral settings can destabilize simultaneous pose and geometry estimation. Furthermore, sparse-view photometric supervision often induces a frequency bias, leading to over-smoothed reconstructions that lose critical diagnostic details. To address these limitations, we propose \textbf{Dental3R}, a pose-free, graph-guided pipeline for robust, high-fidelity reconstruction from sparse intraoral photographs. Our method first constructs a Geometry-Aware Pairing Strategy (GAPS) to intelligently select a compact subgraph of high-value image pairs. The GAPS focuses on correspondence matching, thereby improving the stability of the geometry initialization and reducing memory usage. Building on the recovered poses and point cloud, we train the 3DGS model with a wavelet-regularized objective. By enforcing band-limited fidelity using a discrete wavelet transform, our approach preserves fine enamel boundaries and interproximal edges while suppressing high-frequency artifacts. We validate our approach on a large-scale dataset of 950 clinical cases and an additional video-based test set of 195 cases. Experimental results demonstrate that Dental3R effectively handles sparse, unposed inputs and achieves superior novel view synthesis quality for dental occlusion visualization, outperforming state-of-the-art methods.
[90] LSP-YOLO: A Lightweight Single-Stage Network for Sitting Posture Recognition on Embedded Devices
Nanjun Li,Ziyue Hao,Quanqiang Wang,Xuanyin Wang
Main category: cs.CV
TL;DR: LSP-YOLO提出了一种轻量级单阶段网络,用于在嵌入式设备上实现坐姿识别,通过设计Light-C3k2模块和直接映射关键点到姿态类的方法,实现了高效、轻量化的实时识别。
Details
Motivation: 随着久坐行为的增加,不良坐姿引发的健康问题受到关注。现有方法多依赖于两阶段流程,存在侵入性强、计算量大、嵌入式设备实时性差的问题。Contribution: 1. 提出LSP-YOLO,一种轻量级单阶段网络;2. 设计了Light-C3k2模块降低计算成本;3. 直接在识别头中映射关键点到姿态类别;4. 构建了一个包含6种坐姿的5000张图像数据集。
Method: 集成部分卷积(PConv)和相似性感知激活模块(SimAM),设计Light-C3k2模块;关键点通过逐点卷积直接映射到姿态类别;使用中间监督实现姿态估计与分类的高效融合。
Result: 最小模型LSP-YOLO-n在PC上达到94.2%准确率和251 FPS,模型仅1.9 MB;在SV830C + GC030A平台上展示了实时高精度推理。
Insight: LSP-YOLO的高效性和轻量化设计使其适用于智能教室、康复和人机交互等边缘计算场景。
Abstract: With the rise in sedentary behavior, health problems caused by poor sitting posture have drawn increasing attention. Most existing methods, whether using invasive sensors or computer vision, rely on two-stage pipelines, which result in high intrusiveness, intensive computation, and poor real-time performance on embedded edge devices. Inspired by YOLOv11-Pose, a lightweight single-stage network for sitting posture recognition on embedded edge devices termed LSP-YOLO was proposed. By integrating partial convolution(PConv) and Similarity-Aware Activation Module(SimAM), a lightweight module, Light-C3k2, was designed to reduce computational cost while maintaining feature extraction capability. In the recognition head, keypoints were directly mapped to posture classes through pointwise convolution, and intermediate supervision was employed to enable efficient fusion of pose estimation and classification. Furthermore, a dataset containing 5,000 images across six posture categories was constructed for model training and testing. The smallest trained model, LSP-YOLO-n, achieved 94.2% accuracy and 251 Fps on personal computer(PC) with a model size of only 1.9 MB. Meanwhile, real-time and high-accuracy inference under constrained computational resources was demonstrated on the SV830C + GC030A platform. The proposed approach is characterized by high efficiency, lightweight design and deployability, making it suitable for smart classrooms, rehabilitation, and human-computer interaction applications.
[91] ArchMap: Arch-Flattening and Knowledge-Guided Vision Language Model for Tooth Counting and Structured Dental Understanding
Bohan Zhang,Yiyi Miao,Taoyu Wu,Tong Chen,Ji Jiang,Zhuoxiao Li,Zhe Tang,Limin Yu,Jionglong Su
Main category: cs.CV
TL;DR: ArchMap提出了一种无需训练的知识引导框架,通过几何感知的牙弓扁平化模块和多模态知识库,实现对3D口腔扫描的结构化分析。
Details
Motivation: 现有的深度学习方法需要对特定模态进行训练,依赖大量标注数据且泛化性差,难以适应实际临床需求。Contribution: 提出了ArchMap框架,结合几何标准化和知识引导的推理,实现了无需训练的稳健结构化牙科理解。
Method: 1)几何感知的牙弓扁平化模块;2)构建牙科知识库(DKB)以约束符号推理空间;3)多模态知识引导的推理。
Result: 在1060个案例中验证,ArchMap在牙齿计数、解剖分区和临床条件识别等任务中表现优于监督方法和VLM基线。
Insight: 几何标准化与知识引导的多模态推理相结合,为3D口腔扫描的结构化分析提供了可扩展的解决方案。
Abstract: A structured understanding of intraoral 3D scans is essential for digital orthodontics. However, existing deep-learning approaches rely heavily on modality-specific training, large annotated datasets, and controlled scanning conditions, which limit generalization across devices and hinder deployment in real clinical workflows. Moreover, raw intraoral meshes exhibit substantial variation in arch pose, incomplete geometry caused by occlusion or tooth contact, and a lack of texture cues, making unified semantic interpretation highly challenging. To address these limitations, we propose ArchMap, a training-free and knowledge-guided framework for robust structured dental understanding. ArchMap first introduces a geometry-aware arch-flattening module that standardizes raw 3D meshes into spatially aligned, continuity-preserving multi-view projections. We then construct a Dental Knowledge Base (DKB) encoding hierarchical tooth ontology, dentition-stage policies, and clinical semantics to constrain the symbolic reasoning space. We validate ArchMap on 1060 pre-/post-orthodontic cases, demonstrating robust performance in tooth counting, anatomical partitioning, dentition-stage classification, and the identification of clinical conditions such as crowding, missing teeth, prosthetics, and caries. Compared with supervised pipelines and prompted VLM baselines, ArchMap achieves higher accuracy, reduced semantic drift, and superior stability under sparse or artifact-prone conditions. As a fully training-free system, ArchMap demonstrates that combining geometric normalization with ontology-guided multimodal reasoning offers a practical and scalable solution for the structured analysis of 3D intraoral scans in modern digital orthodontics.
[92] ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
Junfu Pu,Teng Wang,Yixiao Ge,Yuying Ge,Chen Li,Ying Shan
Main category: cs.CV
TL;DR: ARC-Chapter首次提出了一种大规模视频分章模型,通过百万级长视频章节数据的训练,实现了高效的长视频内容结构化。
Details
Motivation: 现有方法因小规模训练和粗粒度的标注,难以处理长视频的细微过渡,迫切需要一种能有效结构化长视频内容的方法。Contribution: 1. 提出了首个大规模视频分章模型ARC-Chapter;2. 构建了双语多级标注的长视频章节数据集;3. 设计了新的评估指标GRACE。
Method: 1. 统一ASR转录、场景文本和视觉描述生成多级标注;2. 通过数据规模和标注强度的提升优化模型性能;3. 设计GRACE指标评估模型。
Result: ARC-Chapter显著超越现有方法(F1得分提升14.0%,SODA得分提升11.3%),并在下游任务(如密集视频描述)上表现优异。
Insight: 数据规模和标注质量是提升视频分章性能的关键,GRACE指标更能反映真实场景的分章灵活性。
Abstract: The proliferation of hour-long videos (e.g., lectures, podcasts, documentaries) has intensified demand for efficient content structuring. However, existing approaches are constrained by small-scale training with annotations that are typical short and coarse, restricting generalization to nuanced transitions in long videos. We introduce ARC-Chapter, the first large-scale video chaptering model trained on over million-level long video chapters, featuring bilingual, temporally grounded, and hierarchical chapter annotations. To achieve this goal, we curated a bilingual English-Chinese chapter dataset via a structured pipeline that unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries. We demonstrate clear performance improvements with data scaling, both in data volume and label intensity. Moreover, we design a new evaluation metric termed GRACE, which incorporates many-to-one segment overlaps and semantic similarity, better reflecting real-world chaptering flexibility. Extensive experiments demonstrate that ARC-Chapter establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score. Moreover, ARC-Chapter shows excellent transferability, improving the state-of-the-art on downstream tasks like dense video captioning on YouCook2.
[93] Clinically-Validated Innovative Mobile Application for Assessing Blinking and Eyelid Movements
Gustavo Adolpho Bonesso,Carlos Marcelo Gurjão de Godoy,Tammy Hentona Osaki,Midori Hentona Osaki,Bárbara Moreira Ribeiro Trindade dos Santos,Regina Célia Coelho
Main category: cs.CV
TL;DR: 论文介绍了一款名为Bapp的移动应用,用于评估眨眼和眼睑运动,并通过临床验证证明了其高精度和可靠性。
Details
Motivation: 传统评估眼睑运动的工具复杂、昂贵且临床应用有限,亟需一种便携且客观的解决方案。Contribution: 开发并验证了Bapp,一款基于Flutter框架和Google ML Kit的移动应用,实现了实时、高精度的眼睑运动分析。
Method: 使用45段患者视频,由专家手动标注作为基准,评估Bapp的Precision、Recall和F1-Score。
Result: Bapp表现优异,Precision为98.4%,Recall为96.9%,总体准确率98.3%。
Insight: Bapp为眼睑运动的客观监测提供了便携且经济的工具,有望替代传统人工计数,支持临床和术后评估。
Abstract: Blinking is a vital physiological process that protects and maintains the health of the ocular surface. Objective assessment of eyelid movements remains challenging due to the complexity, cost, and limited clinical applicability of existing tools. This study presents the clinical validation of Bapp (Blink Application), a mobile application developed using the Flutter framework and integrated with Google ML Kit for on-device, real-time analysis of eyelid movements. The validation occurred using 45 videos from real patients, whose blinks were manually annotated by ophthalmology specialists from the Paulista School of Medicine of the Federal University of Sao Paulo (EPM-UNIFESP) to serve as the ground truth. Bapp’s performance was evaluated using standard metrics, including Precision, Recall, and F1-Score, with results demonstrating 98.4% precision, 96.9% recall, and an overall accuracy of 98.3%. These outcomes confirm the reliability of Bapp as a portable, accessible, and objective tool for monitoring both normal and abnormal eyelid movements. The application offers a promising alternative to traditional manual blink counting, supporting continuous ocular health monitoring and postoperative evaluation in clinical environments.
[94] Blur-Robust Detection via Feature Restoration: An End-to-End Framework for Prior-Guided Infrared UAV Target Detection
Xiaolin Wang,Houzhang Fang,Qingshan Li,Lu Wang,Yi Chang,Luxin Yan
Main category: cs.CV
TL;DR: 论文提出了一种名为JFD3的端到端框架,通过联合特征域去模糊与检测方法,提升红外无人机图像在模糊条件下的目标检测性能。方法采用双分支架构,利用清晰分支指导模糊分支的特征恢复,并结合频率结构引导模块增强目标信息。实验证明了其优越性。
Details
Motivation: 红外无人机图像常因快速移动传感器导致运动模糊,降低目标与背景对比度,进而影响检测性能。现有方法多关注视觉质量而非任务相关特征增强,因此提出新方法解决这一问题。Contribution: 1. 提出了JFD3框架,首次联合特征域去模糊与检测任务;2. 设计了双分支架构和特征恢复网络,利用清晰分支指导模糊分支;3. 引入频率结构引导模块,增强目标结构信息;4. 构建了IRBlurUAV数据集。
Method: 1. 双分支架构:清晰分支指导模糊分支;2. 特征恢复网络:利用清晰特征监督模糊分支的特征恢复;3. 频率结构引导模块:增强浅层检测层的结构信息;4. 特征一致性自监督损失:确保模糊分支特征接近清晰分支。
Result: 在IRBlurUAV数据集上,JFD3在检测性能上表现优越,同时保持实时效率。
Insight: 通过端到端联合优化去模糊与检测任务,显著提升了模糊条件下红外图像的检测能力,表明任务相关特征增强比单纯视觉质量提升更有效。
Abstract: Infrared unmanned aerial vehicle (UAV) target images often suffer from motion blur degradation caused by rapid sensor movement, significantly reducing contrast between target and background. Generally, detection performance heavily depends on the discriminative feature representation between target and background. Existing methods typically treat deblurring as a preprocessing step focused on visual quality, while neglecting the enhancement of task-relevant features crucial for detection. Improving feature representation for detection under blur conditions remains challenging. In this paper, we propose a novel Joint Feature-Domain Deblurring and Detection end-to-end framework, dubbed JFD3. We design a dual-branch architecture with shared weights, where the clear branch guides the blurred branch to enhance discriminative feature representation. Specifically, we first introduce a lightweight feature restoration network, where features from the clear branch serve as feature-level supervision to guide the blurred branch, thereby enhancing its distinctive capability for detection. We then propose a frequency structure guidance module that refines the structure prior from the restoration network and integrates it into shallow detection layers to enrich target structural information. Finally, a feature consistency self-supervised loss is imposed between the dual-branch detection backbones, driving the blurred branch to approximate the feature representations of the clear one. Wealso construct a benchmark, named IRBlurUAV, containing 30,000 simulated and 4,118 real infrared UAV target images with diverse motion blur. Extensive experiments on IRBlurUAV demonstrate that JFD3 achieves superior detection performance while maintaining real-time efficiency.
[95] Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving
Kangqiao Zhao,Shuo Huai,Xurui Song,Jun Luo
Main category: cs.CV
TL;DR: 该论文首次提出了一种针对自动驾驶中双目深度估计的物理对抗攻击方法,通过3D全局伪装纹理和新型立体匹配渲染模块,实现了视觉一致性和攻击效果。
Details
Motivation: 现有对抗攻击多针对单目感知或使用2D局部补丁,而双目立体匹配模型的物理对抗攻击效果尚未充分探索。Contribution: 1. 提出首个体纹理驱动的物理对抗攻击方法;2. 设计3D立体匹配渲染模块解决双目视差问题;3. 提出融合攻击优化,增强隐蔽性和杀伤力。
Method: 采用3D全局伪装纹理,结合新型立体匹配渲染模块和对齐优化,实现视觉一致性和有效攻击。
Result: 实验证明攻击能成功欺骗立体模型产生错误深度信息。
Insight: 3D全局纹理和融合优化显著提升了物理对抗攻击的隐蔽性和有效性。
Abstract: Though deep neural models adopted to realize the perception of autonomous driving have proven vulnerable to adversarial examples, known attacks often leverage 2D patches and target mostly monocular perception. Therefore, the effectiveness of Physical Adversarial Examples (PAEs) on stereo-based binocular depth estimation remains largely unexplored. To this end, we propose the first texture-enabled physical adversarial attack against stereo matching models in the context of autonomous driving. Our method employs a 3D PAE with global camouflage texture rather than a local 2D patch-based one, ensuring both visual consistency and attack effectiveness across different viewpoints of stereo cameras. To cope with the disparity effect of these cameras, we also propose a new 3D stereo matching rendering module that allows the PAE to be aligned with real-world positions and headings in binocular vision. We further propose a novel merging attack that seamlessly blends the target into the environment through fine-grained PAE optimization. It has significantly enhanced stealth and lethality upon existing hiding attacks that fail to get seamlessly merged into the background. Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.
[96] Enhancing LLM-based Autonomous Driving with Modular Traffic Light and Sign Recognition
Fabian Schmidt,Noushiq Mohammed Kayilan Abdul Nazar,Markus Enzweiler,Abhinav Valada
Main category: cs.CV
TL;DR: 论文提出了TLS-Assist模块,通过增强LLM自动驾驶代理的交通灯和标志识别能力,解决现有方法忽视交通规则的缺陷,显著提升驾驶性能和安全性。
Details
Motivation: 尽管LLM在自动驾驶中展现出泛化能力,但现有方法缺乏对交通规则(如交通灯和标志)的显式关注,导致安全隐患。Contribution: 引入TLS-Assist模块,为LLM驾驶代理提供显式的交通灯和标志识别能力,并通过自然语言消息注入强化安全关键信号的注意力。
Method: 采用模块化设计,将交通灯和标志检测结果转化为结构化自然语言信息,输入LLM,支持单视图和多视图相机设置。
Result: 在CARLA的LangAuto基准测试中,驾驶性能提升14%(对比LMDrive)和7%(对比BEVDriver),显著减少交通违规。
Insight: 模块化冗余层(如TLS-Assist)可有效弥补LLM在小物体检测上的不足,增强自动驾驶的安全性和可靠性。
Abstract: Large Language Models (LLMs) are increasingly used for decision-making and planning in autonomous driving, showing promising reasoning capabilities and potential to generalize across diverse traffic situations. However, current LLM-based driving agents lack explicit mechanisms to enforce traffic rules and often struggle to reliably detect small, safety-critical objects such as traffic lights and signs. To address this limitation, we introduce TLS-Assist, a modular redundancy layer that augments LLM-based autonomous driving agents with explicit traffic light and sign recognition. TLS-Assist converts detections into structured natural language messages that are injected into the LLM input, enforcing explicit attention to safety-critical cues. The framework is plug-and-play, model-agnostic, and supports both single-view and multi-view camera setups. We evaluate TLS-Assist in a closed-loop setup on the LangAuto benchmark in CARLA. The results demonstrate relative driving performance improvements of up to 14% over LMDrive and 7% over BEVDriver, while consistently reducing traffic light and sign infractions. We publicly release the code and models on https://github.com/iis-esslingen/TLS-Assist.
[97] BEDLAM2.0: Synthetic Humans and Cameras in Motion
Joachim Tesch,Giorgio Becherini,Prerana Achar,Anastasios Yiannakidis,Muhammed Kocabas,Priyanka Patel,Michael J. Black
Main category: cs.CV
TL;DR: BEDLAM2.0是一个新的合成数据集,扩展了BEDLAM的功能,提供了更丰富的人体形状、动作、服饰、头发和3D环境多样性,同时增加了鞋子数据和相机运动多样性,显著提升了世界坐标系下人体运动的估计精度。
Details
Motivation: 目前缺乏包含真实人体和相机运动的地面真实数据的视频数据集,限制了从视频中推断3D人体运动的研究进展。Contribution: BEDLAM2.0是一个改进的数据集,提供了更真实、更丰富的合成人体和相机运动数据,特别适用于训练世界坐标系下的人体运动估计方法。
Method: BEDLAM2.0通过扩展BEDLAM的数据多样性,增加了相机运动、人体形状、动作、服饰和环境的真实性,并补充了鞋子的数据。
Result: 实验表明,使用BEDLAM2.0训练的方法比使用BEDLAM的方法在人体运动估计精度上有显著提升。
Insight: 高质量的合成数据(如BEDLAM2.0)可以有效解决真实数据稀缺的问题,提升3D人体运动估计在实际应用中的性能。
Abstract: Inferring 3D human motion from video remains a challenging problem with many applications. While traditional methods estimate the human in image coordinates, many applications require human motion to be estimated in world coordinates. This is particularly challenging when there is both human and camera motion. Progress on this topic has been limited by the lack of rich video data with ground truth human and camera movement. We address this with BEDLAM2.0, a new dataset that goes beyond the popular BEDLAM dataset in important ways. In addition to introducing more diverse and realistic cameras and camera motions, BEDLAM2.0 increases diversity and realism of body shape, motions, clothing, hair, and 3D environments. Additionally, it adds shoes, which were missing in BEDLAM. BEDLAM has become a key resource for training 3D human pose and motion regressors today and we show that BEDLAM2.0 is significantly better, particularly for training methods that estimate humans in world coordinates. We compare state-of-the art methods trained on BEDLAM and BEDLAM2.0, and find that BEDLAM2.0 significantly improves accuracy over BEDLAM. For research purposes, we provide the rendered videos, ground truth body parameters, and camera motions. We also provide the 3D assets to which we have rights and links to those from third parties.
[98] Language as an Anchor: Preserving Relative Visual Geometry for Domain Incremental Learning
Shuyi Geng,Tao Zhou,Yi Zhou
Main category: cs.CV
TL;DR: LAVA提出了一种基于语言锚点的领域增量学习框架,通过文本驱动的相对对齐取代直接特征对齐,解决传统方法的领域干扰或知识碎片化问题。
Details
Motivation: 领域增量学习(DIL)中,现有方法在处理分布变化时面临统一视觉空间造成干扰或隔离参数导致知识碎片化的两难问题。Contribution: 提出LAVA框架,利用语言锚点(文本类名)作为语义参考,通过保持相对几何结构实现跨领域的知识保留和重用。
Method: LAVA通过文本类名的语义相似性定义相对几何结构,引导视觉表示保持一致,避免直接特征对齐的缺陷。
Result: 在标准DIL基准测试中,LAVA显著优于现有方法。
Insight: 语言可作为跨领域学习的锚点,其语义相似性能够有效保持视觉表示的相对几何结构,缓解遗忘问题。
Abstract: A key challenge in Domain Incremental Learning (DIL) is to continually learn under shifting distributions while preserving knowledge from previous domains. Existing methods face a fundamental dilemma. On one hand, projecting all domains into a single unified visual space leads to inter-domain interference and semantic distortion, as large shifts may vary with not only visual appearance but also underlying semantics. On the other hand, isolating domain-specific parameters causes knowledge fragmentation, creating “knowledge islands” that hamper knowledge reuse and exacerbate forgetting. To address this issue, we propose LAVA (Language-Anchored Visual Alignment), a novel DIL framework that replaces direct feature alignment with relative alignment driven by a text-based reference anchor. LAVA guides the visual representations of each incoming domain to preserve a consistent relative geometry, which is defined by mirroring the pairwise semantic similarities between the class names. This anchored geometric structure acts as a bridge across domains, enabling the retrieval of class-aware prior knowledge and facilitating robust feature aggregation. Extensive experiments on standard DIL benchmarks demonstrate that LAVA achieves significant performance improvements over state-of-the-arts. Code is available at https://github.com/ShuyiGeng/LAVA.
[99] Learning to See Through a Baby’s Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
Yusen Cai,Bhargava Satya Nunna,Qing Lin,Mengmi Zhang
Main category: cs.CV
TL;DR: 论文探讨了模拟婴儿视觉发展阶段的‘视觉饮食’(CATDiet)如何提升自监督学习模型的鲁棒性,并在多个数据集上验证了其有效性。CombDiet方法结合CATDiet与标准训练进一步提升了性能。
Details
Motivation: 研究动机在于探索婴儿早期视觉发展的生态优势,以启发机器学习模型的改进,尤其是如何通过模拟婴儿视觉阶段的限制(如低对比度、模糊和时间连续性)提升模型的鲁棒性。Contribution: 主要贡献包括:1)提出CATDiet,模拟婴儿视觉发展阶段的限制;2)建立涵盖多任务的综合评测基准;3)显示模型的神经可塑性与生物学现象一致;4)提出CombDiet,进一步提升性能。
Method: 方法包括:1)CATDiet通过灰度到色彩(C)、模糊到清晰(A)和时间连续性(T)模拟婴儿视觉;2)CombDiet结合CATDiet与标准训练;3)使用自监督学习在目标中心视频或婴儿头戴式视频上训练。
Result: 结果显示:1)CATDiet增强了模型的鲁棒性;2)CombDiet在域内和域外的目标识别及深度感知任务中优于标准自监督学习。
Insight: 研究表明,早期婴儿视觉发展的渐进过程为机器视觉智能的鲁棒性提供了反向工程框架,模型的神经和行为特性与婴儿发展类似。
Abstract: Newborns perceive the world with low-acuity, color-degraded, and temporally continuous vision, which gradually sharpens as infants develop. To explore the ecological advantages of such staged “visual diets”, we train self-supervised learning (SSL) models on object-centric videos under constraints that simulate infant vision: grayscale-to-color (C), blur-to-sharp (A), and preserved temporal continuity (T)-collectively termed CATDiet. For evaluation, we establish a comprehensive benchmark across ten datasets, covering clean and corrupted image recognition, texture-shape cue conflict tests, silhouette recognition, depth-order classification, and the visual cliff paradigm. All CATDiet variants demonstrate enhanced robustness in object recognition, despite being trained solely on object-centric videos. Remarkably, models also exhibit biologically aligned developmental patterns, including neural plasticity changes mirroring synaptic density in macaque V1 and behaviors resembling infants’ visual cliff responses. Building on these insights, CombDiet initializes SSL with CATDiet before standard training while preserving temporal continuity. Trained on object-centric or head-mounted infant videos, CombDiet outperforms standard SSL on both in-domain and out-of-domain object recognition and depth perception. Together, these results suggest that the developmental progression of early infant visual experience offers a powerful reverse-engineering framework for understanding the emergence of robust visual intelligence in machines. All code, data, and models will be publicly released.
[100] Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding
Hong Gao,Yiming Bao,Xuezhen Tu,Yutong Xu,Yue Jin,Yiyang Mu,Bin Zhong,Linan Yue,Min-Ling Zhang
Main category: cs.CV
TL;DR: AVI是一种无需训练的灵活框架,通过系统级设计和优化模拟人类视频理解。其主要贡献包括三阶段推理过程、结构化视频知识库和开源模型集成,在多个基准测试中表现优异且具解释性。
Details
Motivation: 现有视频理解方法通常依赖昂贵的专有模型或需强化学习训练,限制了灵活性。AVI旨在通过系统设计和轻量化工具实现高效、开放的视频探索与推理。Contribution: 1. 提出Retrieve-Perceive-Review三阶推理过程;2. 构建结构化视频知识库和多粒度工具;3. 开源模型集成减少对专有API的依赖。
Method: AVI结合基于实体图的知识库、多粒度工具和开源模型(LLMs+CV/VLM),通过三阶推理实现全局探索与局部分析。
Result: 在LVBench等数据集上表现优异,兼具性能与解释性。
Insight: 系统级设计和工具化方法可高效替代强化学习,开源模型集成降低了计算成本,适用于长视频理解任务。
Abstract: Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent’s interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.
[101] CompEvent: Complex-valued Event-RGB Fusion for Low-light Video Enhancement and Deblurring
Mingchen Zhong,Xin Lu,Dong Li,Senyan Xu,Ruixuan Jiang,Xueyang Fu,Baocai Yin
Main category: cs.CV
TL;DR: CompEvent提出了一种基于复值神经网络的框架,用于低光照视频增强和去模糊,通过全过程的时空融合和数据互补学习显著提升了性能。
Details
Motivation: 低光照条件下的视频去模糊(如夜间监控和自动驾驶)因光线不足和长时间曝光而极具挑战性,现有方法的分阶段策略难以有效解决这种复合退化问题。Contribution: 1) 提出了CompEvent,一种支持事件数据与RGB帧全流程融合的复值神经网络框架;2) 设计了复时态对齐GRU和复空频学习模块,实现了时空对齐和深度融合。
Method: 1) 使用复值卷积和GRU迭代处理视频和事件流以实现时态对齐;2) 通过复空频学习模块在空间和频域进行统一信号处理。
Result: 实验表明CompEvent显著优于现有方法,尤其在低光照视频去模糊任务中表现优异。
Insight: 复值神经网络能够更好地表征多模态数据的互补性,全流程融合策略克服了分阶段方法的局限性。
Abstract: Low-light video deblurring poses significant challenges in applications like nighttime surveillance and autonomous driving due to dim lighting and long exposures. While event cameras offer potential solutions with superior low-light sensitivity and high temporal resolution, existing fusion methods typically employ staged strategies, limiting their effectiveness against combined low-light and motion blur degradations. To overcome this, we propose CompEvent, a complex neural network framework enabling holistic full-process fusion of event data and RGB frames for enhanced joint restoration. CompEvent features two core components: 1) Complex Temporal Alignment GRU, which utilizes complex-valued convolutions and processes video and event streams iteratively via GRU to achieve temporal alignment and continuous fusion; and 2) Complex Space-Frequency Learning module, which performs unified complex-valued signal processing in both spatial and frequency domains, facilitating deep fusion through spatial structures and system-level characteristics. By leveraging the holistic representation capability of complex-valued neural networks, CompEvent achieves full-process spatiotemporal fusion, maximizes complementary learning between modalities, and significantly strengthens low-light video deblurring capability. Extensive experiments demonstrate that CompEvent outperforms SOTA methods in addressing this challenging task. The code is available at https://github.com/YuXie1/CompEvent.
[102] 2D Gaussians Spatial Transport for Point-supervised Density Regression
Miao Shang,Xiaopeng Hong
Main category: cs.CV
TL;DR: 本文提出了GST框架,利用高斯泼溅技术实现图像坐标空间到标注图的概率度量传输,通过贝叶斯概率计算传输计划,并设计了一种损失函数,避免了训练中迭代计算,显著提升了效率。
Details
Motivation: 在计算机视觉任务中,传统的基于最优传输的方法通常需要在训练过程中迭代计算传输计划,导致效率低下。本文旨在解决这一问题,通过高斯泼溅技术和概率度量传输,提升计算效率。Contribution: 1. 提出GST框架,利用高斯泼溅技术实现概率度量传输。2. 设计了一种贝叶斯概率驱动的传输计划计算方法。3. 提出了一种新的损失函数,避免了训练中的迭代计算,显著提升了效率。
Method: 1. 使用高斯泼溅技术估计像素与标注的对应关系。2. 基于贝叶斯概率计算传输计划。3. 推导了一种新的损失函数,将传输计划直接集成到网络优化中。
Result: 在人群计数和关键点检测等任务上的实验表明,GST相比传统最优传输方法显著提升了效率,同时保持了较高的精度。
Insight: 高斯泼溅技术和贝叶斯概率的应用为概率度量传输提供了新的思路,避免了复杂的迭代计算,适用于大规模视觉任务。
Abstract: This paper introduces Gaussian Spatial Transport (GST), a novel framework that leverages Gaussian splatting to facilitate transport from the probability measure in the image coordinate space to the annotation map. We propose a Gaussian splatting-based method to estimate pixel-annotation correspondence, which is then used to compute a transport plan derived from Bayesian probability. To integrate the resulting transport plan into standard network optimization in typical computer vision tasks, we derive a loss function that measures discrepancy after transport. Extensive experiments on representative computer vision tasks, including crowd counting and landmark detection, validate the effectiveness of our approach. Compared to conventional optimal transport schemes, GST eliminates iterative transport plan computation during training, significantly improving efficiency. Code is available at https://github.com/infinite0522/GST.
[103] Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM
Jack Qin,Zhitao Wang,Yinan Zheng,Keyu Chen,Yang Zhou,Yuanxin Zhong,Siyuan Cheng
Main category: cs.CV
TL;DR: 论文提出了一种名为Risk Semantic Distillation (RSD)的新框架,利用视觉语言模型(VLM)增强端到端自动驾驶(E2E AD)的训练,通过RiskHead模块将风险注意力引入BEV特征以提高泛化能力。
Details
Motivation: 当前的自动驾驶系统在处理未见场景或陌生传感器配置时泛化能力不足,而现有的视觉语言模型方法导致混合系统不一致或计算需求过高。Contribution: 提出RSD框架和RiskHead模块,将VLM的风险语义蒸馏到BEV特征中,提升模型的泛化能力和解释性。
Method: 通过RiskHead从VLM中提取因果风险估计到BEV特征,生成可解释的风险注意力图,增强BEV特征的风险表示能力。
Result: 在Bench2Drive基准测试中,RSD显著提升了感知和规划能力,尤其是在复杂和不可预测的驾驶条件下。
Insight: 关注风险注意力可以更好地模拟人类驾驶行为,从而在动态环境中提升自动驾驶的性能。
Abstract: The autonomous driving (AD) system has exhibited remarkable performance in complex driving scenarios. However, generalization is still a key limitation for the current system, which refers to the ability to handle unseen scenarios or unfamiliar sensor configurations.Related works have explored the use of Vision-Language Models (VLMs) to address few-shot or zero-shot tasks. While promising, these methods introduce a new challenge: the emergence of a hybrid AD system, where two distinct systems are used to plan a trajectory, leading to potential inconsistencies. Alternative research directions have explored Vision-Language-Action (VLA) frameworks that generate control actions from VLM directly. However, these end-to-end solutions demonstrate prohibitive computational demands. To overcome these challenges, we introduce Risk Semantic Distillation (RSD), a novel framework that leverages VLMs to enhance the training of End-to-End (E2E) AD backbones. By providing risk attention for key objects, RSD addresses the issue of generalization. Specifically, we introduce RiskHead, a plug-in module that distills causal risk estimates from Vision-Language Models into Bird’s-Eye-View (BEV) features, yielding interpretable risk-attention maps.This approach allows BEV features to learn richer and more nuanced risk attention representations, which directly enhance the model’s ability to handle spatial boundaries and risky objects.By focusing on risk attention, RSD aligns better with human-like driving behavior, which is essential to navigate in complex and dynamic environments. Our experiments on the Bench2Drive benchmark demonstrate the effectiveness of RSD in managing complex and unpredictable driving conditions. Due to the enhanced BEV representations enabled by RSD, we observed a significant improvement in both perception and planning capabilities.
[104] D-PerceptCT: Deep Perceptual Enhancement for Low-Dose CT Images
Taifour Yousra Nabila,Azeddine Beghdadi,Marie Luong,Zuheng Ming,Habib Zaidi,Faouzi Alaya Cheikh
Main category: cs.CV
TL;DR: D-PerceptCT是一种受人类视觉系统启发的深度学习架构,用于增强低剂量CT图像质量,通过结合语义先验和多尺度特征,并利用新的感知损失函数,显著提升了图像细节的保留。
Details
Motivation: 低剂量CT(LDCT)虽减少辐射风险,但图像质量下降,现有方法常因过度平滑而丢失关键细节,因此需要一种能感知关键特征的增强方法。Contribution: 提出了D-PerceptCT架构,结合视觉双路径提取器(ViDex)和全局-局部状态空间模块,以及新的深度感知相关性损失函数(DPRLF),有效保留诊断关键细节。
Method: 1)ViDex引入预训练DINOv2模型的语义先验;2)全局-局部状态空间模块捕获多尺度特征;3)DPRLF损失函数强化感知重要性。
Result: 在Mayo2016数据集上显示优于现有方法,显著提升了LDCT图像的结构和纹理信息保留。
Insight: 结合人类视觉感知特性的损失函数和架构设计能更有效地保留医学图像中的关键细节,提升诊断辅助效果。
Abstract: Low Dose Computed Tomography (LDCT) is widely used as an imaging solution to aid diagnosis and other clinical tasks. However, this comes at the price of a deterioration in image quality due to the low dose of radiation used to reduce the risk of secondary cancer development. While some efficient methods have been proposed to enhance LDCT quality, many overestimate noise and perform excessive smoothing, leading to a loss of critical details. In this paper, we introduce D-PerceptCT, a novel architecture inspired by key principles of the Human Visual System (HVS) to enhance LDCT images. The objective is to guide the model to enhance or preserve perceptually relevant features, thereby providing radiologists with CT images where critical anatomical structures and fine pathological details are perceptu- ally visible. D-PerceptCT consists of two main blocks: 1) a Visual Dual-path Extractor (ViDex), which integrates semantic priors from a pretrained DINOv2 model with local spatial features, allowing the network to incorporate semantic-awareness during enhancement; (2) a Global-Local State-Space block that captures long-range information and multiscale features to preserve the important structures and fine details for diagnosis. In addition, we propose a novel deep perceptual loss, designated as the Deep Perceptual Relevancy Loss Function (DPRLF), which is inspired by human contrast sensitivity, to further emphasize perceptually important features. Extensive experiments on the Mayo2016 dataset demonstrate the effectiveness of D-PerceptCT method for LDCT enhancement, showing better preservation of structural and textural information within LDCT images compared to SOTA methods.
[105] A Generative Data Framework with Authentic Supervision for Underwater Image Restoration and Enhancement
Yufeng Tian,Yifan Chen,Zhe Sun,Libang Chen,Mingyu Dou,Jijun Lu,Ye Zheng,Xuelong Li
Main category: cs.CV
TL;DR: 该论文提出了一种生成性数据框架,通过将自然图像转换为水下退化版本,构建合成数据集,解决了水下图像恢复和增强中高质量配对数据稀缺的问题,提升了模型的颜色恢复和泛化性能。
Details
Motivation: 水下图像恢复和增强任务由于难以获取高质量的真实参考图像,现有基准数据集通常依赖人工选择的增强结果,缺乏全局一致的颜色和真实监督信号,限制了模型的性能。Contribution: 提出了一个基于非配对图像到图像转换的生成性数据框架,构建了一个大规模合成数据集,覆盖6种典型水下退化类型,为模型学习提供了真实监督信号。
Method: 利用自然图像作为清晰参考目标,通过图像转换生成水下退化版本,构建合成数据集,支持模型从退化图像到原始场景的精确映射学习。
Result: 在6种网络架构和3个独立测试集上的实验表明,基于合成数据训练的模型在颜色恢复和泛化性能上优于或媲美现有基准数据集训练的模型。
Insight: 合成数据框架为解决水下图像恢复任务中数据稀缺问题提供了可靠且可扩展的方案,通过生成精确的地面真值标签,显著提升了模型的性能。
Abstract: Underwater image restoration and enhancement are crucial for correcting color distortion and restoring image details, thereby establishing a fundamental basis for subsequent underwater visual tasks. However, current deep learning methodologies in this area are frequently constrained by the scarcity of high-quality paired datasets. Since it is difficult to obtain pristine reference labels in underwater scenes, existing benchmarks often rely on manually selected results from enhancement algorithms, providing debatable reference images that lack globally consistent color and authentic supervision. This limits the model’s capabilities in color restoration, image enhancement, and generalization. To overcome this limitation, we propose using in-air natural images as unambiguous reference targets and translating them into underwater-degraded versions, thereby constructing synthetic datasets that provide authentic supervision signals for model learning. Specifically, we establish a generative data framework based on unpaired image-to-image translation, producing a large-scale dataset that covers 6 representative underwater degradation types. The framework constructs synthetic datasets with precise ground-truth labels, which facilitate the learning of an accurate mapping from degraded underwater images to their pristine scene appearances. Extensive quantitative and qualitative experiments across 6 representative network architectures and 3 independent test sets show that models trained on our synthetic data achieve comparable or superior color restoration and generalization performance to those trained on existing benchmarks. This research provides a reliable and scalable data-driven solution for underwater image restoration and enhancement. The generated dataset is publicly available at: https://github.com/yftian2025/SynUIEDatasets.git.
[106] DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation
Xiangchen Yin,Jiahui Yuan,Zhangchi Hu,Wenzhang Sun,Jie Chen,Xiaozhen Qiao,Hao Li,Xiaoyan Sun
Main category: cs.CV
TL;DR: DeCo-VAVE通过解耦视频内容为关键帧、运动和残差三部分,为每部分学习专用潜表示,避免了冗余建模,实现了紧凑的潜表示,并通过共享3D解码器保持了重建的时空一致性。
Details
Motivation: 现有视频VAE忽视帧间内容的相似性,导致潜表示冗余。DeCo-VAVE旨在通过解耦视频内容,学习紧凑的潜表示。Contribution: 提出了解耦VAE(DeCo-VAVE),将视频内容分解为关键帧、运动和残差三部分,分别为其设计专用编码器,避免了跨组件干扰,并通过共享解码器保持了重建一致性。
Method: 设计专用编码器分别对关键帧、运动和残差进行编码,采用共享3D解码器;使用解耦适应策略,冻结部分编码器并顺序训练其他部分,确保稳定学习静态和动态特征。
Result: 定量和定性实验表明,DeCo-VAVE在视频重建任务中表现优越。
Insight: 解耦视频内容并为其学习专用潜表示可以有效减少冗余,提升重建质量和效率。
Abstract: Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain spatiotemporal consistency during reconstruction. We further utilize a decoupled adaptation strategy that freezes partial encoders while training the others sequentially, ensuring stable training and accurate learning of both static and dynamic features. Extensive quantitative and qualitative experiments demonstrate that DeCo-VAE achieves superior video reconstruction performance.
[107] Interaction-Aware 4D Gaussian Splatting for Dynamic Hand-Object Interaction Reconstruction
Hao Tian,Chenyangguang Zhang,Rui Liu,Wen Shen,Xiaolin Qin
Main category: cs.CV
TL;DR: 这篇论文提出了一种新的4D高斯泼溅方法,专注于动态手-物交互的重建,无需任何物体先验信息。通过引入交互感知的高斯表示和动态场,以及渐进式优化策略,显著提升了重建质量。
Details
Motivation: 动态手-物交互场景的几何和外观建模是一个具有挑战性的问题,尤其是在没有物体先验信息的情况下。现有的动态3D高斯泼溅方法在处理复杂交互(如相互遮挡和边缘模糊)时存在局限性,需要更有效的建模方法。Contribution: 1. 提出了交互感知的手-物高斯表示,通过可优化参数增强结构清晰度;2. 将手部信息融入物体形变场,构建交互感知的动态场;3. 设计了渐进式优化策略和显式正则化方法,以稳定表示并提升运动过渡和物理交互的真实性。
Method: 1. 使用交互感知的高斯表示建模手-物交互;2. 通过动态场融合手部和物体形状的动态变化;3. 采用渐进式优化策略,逐步处理动态区域和静态背景;4. 设计正则化方法以确保运动平滑性和物理真实性。
Result: 实验表明,该方法在动态手-物交互的重建任务中超越了现有的动态3D高斯泼溅方法,取得了最先进的性能。
Insight: 手部和物体在交互过程中的紧密耦合可以通过联合建模和动态场的引入更好地捕捉。渐进式优化策略显式正则化是解决复杂交互场景优化问题的有效手段。
Abstract: This paper focuses on a challenging setting of simultaneously modeling geometry and appearance of hand-object interaction scenes without any object priors. We follow the trend of dynamic 3D Gaussian Splatting based methods, and address several significant challenges. To model complex hand-object interaction with mutual occlusion and edge blur, we present interaction-aware hand-object Gaussians with newly introduced optimizable parameters aiming to adopt piecewise linear hypothesis for clearer structural representation. Moreover, considering the complementarity and tightness of hand shape and object shape during interaction dynamics, we incorporate hand information into object deformation field, constructing interaction-aware dynamic fields to model flexible motions. To further address difficulties in the optimization process, we propose a progressive strategy that handles dynamic regions and static background step by step. Correspondingly, explicit regularizations are designed to stabilize the hand-object representations for smooth motion transition, physical interaction reality, and coherent lighting. Experiments show that our approach surpasses existing dynamic 3D-GS-based methods and achieves state-of-the-art performance in reconstructing dynamic hand-object interaction.
[108] ForensicFlow: A Tri-Modal Adaptive Network for Robust Deepfake Detection
Mohammad Romani
Main category: cs.CV
TL;DR: ForensicFlow是一种三模态自适应网络,用于鲁棒的Deepfake检测,通过融合RGB、纹理和频率证据,显著提升检测性能。
Details
Motivation: 当前单流CNN难以捕捉空间、纹理和频率域的多尺度伪造痕迹,限制了Deepfake检测的鲁棒性和泛化能力。Contribution: 提出了一种三模态框架(RGB、纹理和频率分支),结合注意力机制和自适应融合策略,有效提升检测性能。
Method: 使用ConvNeXt提取RGB全局特征,Swin Transformer检测纹理特征,CNN+SE分析频率特征,并通过注意力机制动态整合多模态信息。
Result: 在Celeb-DF(v2)数据集上达到AUC 0.9752、F1-Score 0.9408和准确率0.9208,超越单流baseline。
Insight: 多模态特征融合对提升Deepfake检测性能至关重要,尤其能增强对细微伪造痕迹的识别能力。
Abstract: Deepfakes generated by advanced GANs and autoencoders severely threaten information integrity and societal stability. Single-stream CNNs fail to capture multi-scale forgery artifacts across spatial, texture, and frequency domains, limiting robustness and generalization. We introduce the ForensicFlow, a tri-modal forensic framework that synergistically fuses RGB, texture, and frequency evidence for video Deepfake detection. The RGB branch (ConvNeXt-tiny) extracts global visual inconsistencies; the texture branch (Swin Transformer-tiny) detects fine-grained blending artifacts; the frequency branch (CNN + SE) identifies periodic spectral noise. Attention-based temporal pooling dynamically prioritizes high-evidence frames, while adaptive attention fusion balances branch contributions.Trained on Celeb-DF (v2) with Focal Loss, ForensicFlow achieves AUC 0.9752, F1-Score 0.9408, and accuracy 0.9208, outperforming single-stream baselines. Ablation validates branch synergy; Grad-CAM confirms forensic focus. This comprehensive feature fusion provides superior resilience against subtle forgeries.
[109] OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
Keda Tao,Kele Shao,Bohan Yu,Weiqiang Wang,Jian liu,Huan Wang
Main category: cs.CV
TL;DR: OmniZip是一个无训练的、音频引导的动态令牌压缩框架,用于加速多模态大语言模型的推理。它通过音频令牌的显著性识别和信息密度计算,动态指导视频令牌的剪枝和压缩,实现了显著的加速和内存节省。
Details
Motivation: 现有的令牌压缩方法未能满足多模态令牌联合压缩的需求,而多模态大语言模型在音频和视频令牌序列处理中存在计算瓶颈。Contribution: 提出了OmniZip框架,首次实现了无训练的音频-视觉令牌联合动态压缩,显著提升了多模态模型的推理速度和内存效率。
Method: OmniZip通过识别显著音频令牌并计算音频保留分数,动态指导视频令牌的剪枝和压缩,同时保留了音频锚点的跨模态相似性增强信息。
Result: 实现了3.42倍的推理加速和1.4倍的内存节省,性能与未压缩模型相当。
Insight: 音频引导的动态压缩是多模态令牌优化的有效途径,无训练的框架设计为实际部署提供了便利。
Abstract: Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding, wherein processing audio-video token sequences creates a significant computational bottleneck, however. Existing token compression methods have yet to accommodate this emerging need of jointly compressing multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme. Extensive empirical results demonstrate the merits of OmniZip - it achieves 3.42X inference speedup and 1.4X memory reduction over other top-performing counterparts, while maintaining performance with no training.
[110] CCSD: Cross-Modal Compositional Self-Distillation for Robust Brain Tumor Segmentation with Missing Modalities
Dongqing Xie,Yonghuang Wu,Zisheng Ai,Jun Min,Zhencun Jiang,Shaojin Geng,Lei Wang
Main category: cs.CV
TL;DR: 论文提出了一种称为CCSD的新框架,通过跨模态组合自蒸馏技术来解决脑肿瘤分割中模态缺失的问题,显著提升了模型的鲁棒性和泛化能力。
Details
Motivation: 在脑肿瘤分割中,多模态MRI数据的互补信息整合至关重要,但现实中常因模态缺失导致模型性能下降。Contribution: 1. 提出了CCSD框架,能够灵活处理任意模态组合;2. 设计了分层模态自蒸馏和渐进模态组合蒸馏两种策略,提升了模型对缺失模态的鲁棒性。
Method: 采用共享-特定编码器-解码器架构,结合两种自蒸馏策略:跨模态分层知识传递和渐进模态组合训练。
Result: 在公开数据集上展示了最优性能,尤其在模态缺失情况下表现优异。
Insight: 通过模态间的知识蒸馏,模型的泛化性和稳定性得到显著提升,为实际临床应用提供了可行解决方案。
Abstract: The accurate segmentation of brain tumors from multi-modal MRI is critical for clinical diagnosis and treatment planning. While integrating complementary information from various MRI sequences is a common practice, the frequent absence of one or more modalities in real-world clinical settings poses a significant challenge, severely compromising the performance and generalizability of deep learning-based segmentation models. To address this challenge, we propose a novel Cross-Modal Compositional Self-Distillation (CCSD) framework that can flexibly handle arbitrary combinations of input modalities. CCSD adopts a shared-specific encoder-decoder architecture and incorporates two self-distillation strategies: (i) a hierarchical modality self-distillation mechanism that transfers knowledge across modality hierarchies to reduce semantic discrepancies, and (ii) a progressive modality combination distillation approach that enhances robustness to missing modalities by simulating gradual modality dropout during training. Extensive experiments on public brain tumor segmentation benchmarks demonstrate that CCSD achieves state-of-the-art performance across various missing-modality scenarios, with strong generalization and stability.
[111] MRI Embeddings Complement Clinical Predictors for Cognitive Decline Modeling in Alzheimer’s Disease Cohorts
Nathaniel Putera,Daniel Vilet Rodríguez,Noah Videcrantz,Julia Machnio,Mostafa Mehdipour Ghazi
Main category: cs.CV
TL;DR: 该论文研究了MRI嵌入和表格数据在阿尔茨海默病认知衰退建模中的互补作用,发现临床特征和MRI嵌入在不同场景下各有优势。
Details
Motivation: 阿尔茨海默病的认知衰退建模需要准确预测和分层管理,但传统表格数据难以捕捉细微的脑部变化,MRI嵌入提供了补充信息。Contribution: 提出了一种基于动态时间规整聚类的轨迹感知标签策略,并训练了一个无监督的3D Vision Transformer以获取MRI嵌入,展示了其在稳定个体识别上的优势。
Method: 使用动态时间规整聚类捕捉认知变化的异质性,通过无监督重建训练3D ViT获取MRI嵌入,并结合传统和深度学习方法进行评估。
Result: 临床特征在预测高风险极端情况(AUC≈0.70)表现最佳,而ViT的MRI嵌入能更好识别稳定个体(AUC=0.71),但在中度异质性群体中表现不佳。
Insight: 临床特征和MRI嵌入在不同任务中各有优势,支持多模态融合策略以提升阿尔茨海默病进展建模的性能。
Abstract: Accurate modeling of cognitive decline in Alzheimer’s disease is essential for early stratification and personalized management. While tabular predictors provide robust markers of global risk, their ability to capture subtle brain changes remains limited. In this study, we evaluate the predictive contributions of tabular and imaging-based representations, with a focus on transformer-derived Magnetic Resonance Imaging (MRI) embeddings. We introduce a trajectory-aware labeling strategy based on Dynamic Time Warping clustering to capture heterogeneous patterns of cognitive change, and train a 3D Vision Transformer (ViT) via unsupervised reconstruction on harmonized and augmented MRI data to obtain anatomy-preserving embeddings without progression labels. The pretrained encoder embeddings are subsequently assessed using both traditional machine learning classifiers and deep learning heads, and compared against tabular representations and convolutional network baselines. Results highlight complementary strengths across modalities. Clinical and volumetric features achieved the highest AUCs of around 0.70 for predicting mild and severe progression, underscoring their utility in capturing global decline trajectories. In contrast, MRI embeddings from the ViT model were most effective in distinguishing cognitively stable individuals with an AUC of 0.71. However, all approaches struggled in the heterogeneous moderate group. These findings indicate that clinical features excel in identifying high-risk extremes, whereas transformer-based MRI embeddings are more sensitive to subtle markers of stability, motivating multimodal fusion strategies for AD progression modeling.
[112] XAttn-BMD: Multimodal Deep Learning with Cross-Attention for Femoral Neck Bone Mineral Density Estimation
Yilin Zhang,Leo D. Westbury,Elaine M. Dennison,Nicholas C. Harvey,Nicholas R. Fuggle,Rahman Attar
Main category: cs.CV
TL;DR: XAttn-BMD是一种多模态深度学习框架,通过交叉注意力机制整合髋部X射线图像和临床元数据,用于预测股骨颈骨密度(BMD)。其定制损失函数改善了数据不平衡问题,实验表明该方法在回归和分类任务中均优于基线模型。
Details
Motivation: 骨密度(BMD)低是骨质疏松的主要特征,传统方法难以有效整合多模态数据以准确预测BMD。本文提出了一种动态融合图像和临床数据的方法,以提升预测性能。Contribution: 1. 提出了XAttn-BMD框架,利用双向交叉注意力机制动态整合多模态数据;2. 设计了加权平滑L1损失函数以解决BMD数据不平衡问题;3. 在真实数据集上验证了方法的优越性和鲁棒性。
Method: 1. 使用双向交叉注意力机制融合图像和元数据特征;2. 采用加权平滑L1损失函数优化模型训练;3. 通过回归和分类任务评估模型性能。
Result: 实验表明,XAttn-BMD在MSE、MAE和R2得分上分别优于基线16.7%、6.03%和16.4%。此外,在临床相关阈值下的分类任务中表现优异。
Insight: 交叉注意力机制能有效捕捉多模态数据间的关联,定制损失函数显著提升了模型对临床重要病例的关注。这表明动态特征融合和任务特定优化在多模态学习中的重要性。
Abstract: Poor bone health is a significant public health concern, and low bone mineral density (BMD) leads to an increased fracture risk, a key feature of osteoporosis. We present XAttn-BMD (Cross-Attention BMD), a multimodal deep learning framework that predicts femoral neck BMD from hip X-ray images and structured clinical metadata. It utilizes a novel bidirectional cross-attention mechanism to dynamically integrate image and metadata features for cross-modal mutual reinforcement. A Weighted Smooth L1 loss is tailored to address BMD imbalance and prioritize clinically significant cases. Extensive experiments on the data from the Hertfordshire Cohort Study show that our model outperforms the baseline models in regression generalization and robustness. Ablation studies confirm the effectiveness of both cross-attention fusion and the customized loss function. Experimental results show that the integration of multimodal data via cross-attention outperforms naive feature concatenation without cross-attention, reducing MSE by 16.7%, MAE by 6.03%, and increasing the R2 score by 16.4%, highlighting the effectiveness of the approach for femoral neck BMD estimation. Furthermore, screening performance was evaluated using binary classification at clinically relevant femoral neck BMD thresholds, demonstrating the model’s potential in real-world scenarios.
[113] 3D-Guided Scalable Flow Matching for Generating Volumetric Tissue Spatial Transcriptomics from Serial Histology
Mohammad Vali Sanian,Arshia Hemmat,Amirhossein Vahidi,Jonas Maaskola,Jimmy Tsz Hang Lee,Stanislaw Makarchuk,Yeliz Demirci,Nana-Jane Chipampe,Omer Bayraktar,Lassi Paavolainen,Mohammad Lotfollahi
Main category: cs.CV
TL;DR: 该论文提出了HoloTea,一种基于3D感知的流匹配框架,用于从H&E染色图像中推断组织空间转录组数据,并通过相邻切片的上下文信息提高预测的3D表达准确性。
Details
Motivation: 现有的预测算法要么忽略3D结构(单一切片独立处理),要么不具备生成性或扩展性差。HoloTea旨在解决这些问题,生成更准确的3D虚拟组织数据。Contribution: 1. 提出了HoloTea框架,融合相邻切片的上下文信息;2. 设计了3D一致的流匹配先验,结合了学习到的ZINB先验和空间经验先验;3. 通过全局注意力块实现高效的扩展性。
Method: HoloTea基于流匹配框架,利用ControlNet实现跨切片信息融合,并结合ZINB和空间经验先验来捕捉数据的计数特性。全局注意力块提升了在大规模数据集上的训练效率。
Result: 在三种不同组织和分辨率的数据集上,HoloTea相比2D和3D基线方法,显著提高了3D表达的准确性和泛化能力。
Insight: 通过显式利用3D上下文信息和高扩展性设计,HoloTea为生成准确的3D虚拟组织提供了一种新方法,有望推动疾病研究和生物标志物发现。
Abstract: A scalable and robust 3D tissue transcriptomics profile can enable a holistic understanding of tissue organization and provide deeper insights into human biology and disease. Most predictive algorithms that infer ST directly from histology treat each section independently and ignore 3D structure, while existing 3D-aware approaches are not generative and do not scale well. We present Holographic Tissue Expression Inpainting and Analysis (HoloTea), a 3D-aware flow-matching framework that imputes spot-level gene expression from H&E while explicitly using information from adjacent sections. Our key idea is to retrieve morphologically corresponding spots on neighboring slides in a shared feature space and fuse this cross section context into a lightweight ControlNet, allowing conditioning to follow anatomical continuity. To better capture the count nature of the data, we introduce a 3D-consistent prior for flow matching that combines a learned zero-inflated negative binomial (ZINB) prior with a spatial-empirical prior constructed from neighboring sections. A global attention block introduces 3D H&E scaling linearly with the number of spots in the slide, enabling training and inference on large 3D ST datasets. Across three spatial transcriptomics datasets spanning different tissue types and resolutions, HoloTea consistently improves 3D expression accuracy and generalization compared to 2D and 3D baselines. We envision HoloTea advancing the creation of accurate 3D virtual tissues, ultimately accelerating biomarker discovery and deepening our understanding of disease.
[114] Fusing Biomechanical and Spatio-Temporal Features for Fall Prediction: Characterizing and Mitigating the Simulation-to-Reality Gap
Md Fokhrul Islam,Sajeda Al-Hammouri,Christopher J. Arellano,Kavan Hazeli,Heman Shakeri
Main category: cs.CV
TL;DR: 本文提出了一种双流模型BioST-GCN,结合姿态和生物力学信息用于跌倒预测,并通过交叉注意力机制融合特征。尽管在模拟数据上表现优异,但模拟与现实数据的差距导致零样本泛化性能显著下降。
Details
Motivation: 跌倒对老年人的健康和独立性造成严重影响。基于视觉的跌倒预测系统需要大量数据,但由于真实跌倒数据的稀缺,开发此类系统存在挑战。Contribution: 1. 提出Biomechanical Spatio-Temporal Graph Convolutional Network (BioST-GCN),一种双流模型,结合姿态和生物力学信息;2. 通过交叉注意力机制实现特征融合;3. 揭示了模拟与现实数据之间的显著差距。
Method: BioST-GCN模型采用双流设计,分别处理姿态和生物力学信息,并通过交叉注意力机制融合特征。此外,模型利用时空注意力机制提高可解释性。
Result: 在模拟数据上,BioST-GCN比基线ST-GCN提升了5.32%和2.91%的F1分数。然而,零样本泛化到真实数据时,性能显著下降至35.9%。
Insight: 模拟数据与现实数据之间存在显著差距,尤其是老年人的特殊运动特征加剧了这一差距。未来的研究方向包括个性化策略和保护隐私的数据管道。
Abstract: Falls are a leading cause of injury and loss of independence among older adults. Vision-based fall prediction systems offer a non-invasive solution to anticipate falls seconds before impact, but their development is hindered by the scarcity of available fall data. Contributing to these efforts, this study proposes the Biomechanical Spatio-Temporal Graph Convolutional Network (BioST-GCN), a dual-stream model that combines both pose and biomechanical information using a cross-attention fusion mechanism. Our model outperforms the vanilla ST-GCN baseline by 5.32% and 2.91% F1-score on the simulated MCF-UA stunt-actor and MUVIM datasets, respectively. The spatio-temporal attention mechanisms in the ST-GCN stream also provide interpretability by identifying critical joints and temporal phases. However, a critical simulation-reality gap persists. While our model achieves an 89.0% F1-score with full supervision on simulated data, zero-shot generalization to unseen subjects drops to 35.9%. This performance decline is likely due to biases in simulated data, such as `intent-to-fall’ cues. For older adults, particularly those with diabetes or frailty, this gap is exacerbated by their unique kinematic profiles. To address this, we propose personalization strategies and advocate for privacy-preserving data pipelines to enable real-world validation. Our findings underscore the urgent need to bridge the gap between simulated and real-world data to develop effective fall prediction systems for vulnerable elderly populations.
[115] SparseSurf: Sparse-View 3D Gaussian Splatting for Surface Reconstruction
Meiying Gu,Jiawei Zhang,Jiahe Li,Xiaohan Yu,Haonan Luo,Jin Zheng,Xiao Bai
Main category: cs.CV
TL;DR: 论文提出SparseSurf方法,通过Stereo Geometry-Texture Alignment和Pseudo-Feature Enhanced Geometry Consistency,解决了稀疏视角下3D高斯泼墨(Gaussian Splatting)的过拟合问题,提升了表面重建和新视角合成的质量。
Details
Motivation: 稀疏视角下,现有的3D高斯泼墨方法容易过拟合,导致表面重建质量下降和新视角合成效果不佳。本文旨在解决这一问题并提升重建和渲染的性能。Contribution: 1. 提出了Stereo Geometry-Texture Alignment,联合优化渲染质量和几何估计;2. 引入Pseudo-Feature Enhanced Geometry Consistency,增强多视角几何一致性,缓解过拟合。
Method: 结合Stereo Geometry-Texture Alignment和多视角几何一致性约束,通过伪特征增强训练和未见视角的一致性,提升稀疏视角下的重建质量。
Result: 在DTU、BlendedMVS和Mip-NeRF360数据集上实现了SOTA性能,验证了方法的有效性。
Insight: 联合优化几何和纹理信息能够显著提升稀疏视角下的3D重建质量,并改善新视角合成效果。
Abstract: Recent advances in optimizing Gaussian Splatting for scene geometry have enabled efficient reconstruction of detailed surfaces from images. However, when input views are sparse, such optimization is prone to overfitting, leading to suboptimal reconstruction quality. Existing approaches address this challenge by employing flattened Gaussian primitives to better fit surface geometry, combined with depth regularization to alleviate geometric ambiguities under limited viewpoints. Nevertheless, the increased anisotropy inherent in flattened Gaussians exacerbates overfitting in sparse-view scenarios, hindering accurate surface fitting and degrading novel view synthesis performance. In this paper, we propose \net{}, a method that reconstructs more accurate and detailed surfaces while preserving high-quality novel view rendering. Our key insight is to introduce Stereo Geometry-Texture Alignment, which bridges rendering quality and geometry estimation, thereby jointly enhancing both surface reconstruction and view synthesis. In addition, we present a Pseudo-Feature Enhanced Geometry Consistency that enforces multi-view geometric consistency by incorporating both training and unseen views, effectively mitigating overfitting caused by sparse supervision. Extensive experiments on the DTU, BlendedMVS, and Mip-NeRF360 datasets demonstrate that our method achieves the state-of-the-art performance.
[116] RepAir: A Framework for Airway Segmentation and Discontinuity Correction in CT
John M. Oyer,Ali Namvar,Benjamin A. Hoff,Wassim W. Labaki,Ella A. Kazerooni,Charles R. Hatt,Fernando J. Martinez,MeiLan K. Han,Craig J. Galbán,Sundaresh Ram
Main category: cs.CV
TL;DR: RepAir是一个用于CT图像中气道分割和断开修复的三阶段框架,结合了nnU-Net网络和基于解剖学的拓扑校正,显著提升了气道树的完整性和解剖一致性。
Details
Motivation: 气道分割在肺部定量分析中非常重要,但手动标注不实际,而现有的自动方法(如U-Net)常产生不连通的气道组件,限制了生物标志物的可靠提取。Contribution: 提出了RepAir框架,包含nnU-Net分割、骨架化的断开检测和1D卷积分类器的连接决策三阶段,显著提升了气道分割的准确性和拓扑完整性。
Method: 1.使用nnU-Net生成初步气道掩码;2.通过骨架算法识别潜在断开并建议重连;3.用1D卷积分类器判断候选连接的解剖合理性。
Result: 在ATM’22和AeroPath数据集上,RepAir在体素级和拓扑指标上均优于Bronchinet和NaviAirway等现有方法。
Insight: 结合深度学习和基于解剖学的后处理可以有效解决气道分割中的拓扑不连续问题,为定量分析提供更可靠的数据。
Abstract: Accurate airway segmentation from chest computed tomography (CT) scans is essential for quantitative lung analysis, yet manual annotation is impractical and many automated U-Net-based methods yield disconnected components that hinder reliable biomarker extraction. We present RepAir, a three-stage framework for robust 3D airway segmentation that combines an nnU-Net-based network with anatomically informed topology correction. The segmentation network produces an initial airway mask, after which a skeleton-based algorithm identifies potential discontinuities and proposes reconnections. A 1D convolutional classifier then determines which candidate links correspond to true anatomical branches versus false or obstructed paths. We evaluate RepAir on two distinct datasets: ATM’22, comprising annotated CT scans from predominantly healthy subjects and AeroPath, encompassing annotated scans with severe airway pathology. Across both datasets, RepAir outperforms existing 3D U-Net-based approaches such as Bronchinet and NaviAirway on both voxel-level and topological metrics, and produces more complete and anatomically consistent airway trees while maintaining high segmentation accuracy.
[117] Impact of Image Resolution on Age Estimation with DeepFace and InsightFace
Shiyar Jamo
Main category: cs.CV
TL;DR: 论文研究了图像分辨率对DeepFace和InsightFace年龄估计精度的影响,发现224x224像素为最佳分辨率,InsightFace在所有分辨率下均优于DeepFace。
Details
Motivation: 年龄验证应用中,输入图像分辨率差异较大,研究分辨率对年龄估计精度的影响具有实际意义。Contribution: 首次系统评估了图像分辨率对DeepFace和InsightFace年龄估计性能的影响,确定了最佳分辨率范围。
Method: 使用IMDB-Clean数据集的1000张图像,生成7种分辨率的7000个测试样本,通过MAE、SD和MedAE评估性能。
Result: 224x224像素分辨率下性能最优(DeepFace MAE=10.83,InsightFace MAE=7.46),低分辨率和高分辨率均降低精度,InsightFace速度更快。
Insight: 图像分辨率是年龄估计的关键因素,实际应用中需权衡分辨率与计算效率。
Abstract: Automatic age estimation is widely used for age verification, where input images often vary considerably in resolution. This study evaluates the effect of image resolution on age estimation accuracy using DeepFace and InsightFace. A total of 1000 images from the IMDB-Clean dataset were processed in seven resolutions, resulting in 7000 test samples. Performance was evaluated using Mean Absolute Error (MAE), Standard Deviation (SD), and Median Absolute Error (MedAE). Based on this study, we conclude that input image resolution has a clear and consistent impact on the accuracy of age estimation in both DeepFace and InsightFace. Both frameworks achieve optimal performance at 224x224 pixels, with an MAE of 10.83 years (DeepFace) and 7.46 years (InsightFace). At low resolutions, MAE increases substantially, while very high resolutions also degrade accuracy. InsightFace is consistently faster than DeepFace across all resolutions.
[118] Seeing Beyond the Image: ECG and Anatomical Knowledge-Guided Myocardial Scar Segmentation from Late Gadolinium-Enhanced Images
Farheen Ramzan,Yusuf Kiberu,Nikesh Jathanna,Meryem Jabrane,Vicente Grau,Shahnaz Jamil-Copley,Richard H. Clayton,Chen,Chen
Main category: cs.CV
TL;DR: 本文提出了一种新颖的多模态框架,整合ECG生理信息和AHA-17解剖图谱,通过动态加权融合特征(TAFF)提升LGE-MRI心肌瘢痕分割性能。
Details
Motivation: 现有LGE-MRI心肌瘢痕分割方法因对比度变化和成像伪影受限;ECG信号提供互补生理信息,可用于定位瘢痕区域。Contribution: 1. 提出整合ECG和AHA-17解剖图谱的多模态框架;2. 引入TAFF机制动态加权特征;3. 显著提升分割性能(Dice从0.6149升至0.8463)。
Method: 设计TAFF机制,基于ECG和LGE-MRI的时间差异动态融合特征,结合解剖知识优化分割模型。
Result: 在临床数据上表现优异:Dice 0.8463,精度0.9115,敏感度0.9043,显著优于基线方法nnU-Net。
Insight: 生理和解剖知识的整合使模型能‘超越图像’学习,为心脏瘢痕分割提供了新方向。
Abstract: Accurate segmentation of myocardial scar from late gadolinium enhanced (LGE) cardiac MRI is essential for evaluating tissue viability, yet remains challenging due to variable contrast and imaging artifacts. Electrocardiogram (ECG) signals provide complementary physiological information, as conduction abnormalities can help localize or suggest scarred myocardial regions. In this work, we propose a novel multimodal framework that integrates ECG-derived electrophysiological information with anatomical priors from the AHA-17 atlas for physiologically consistent LGE-based scar segmentation. As ECGs and LGE-MRIs are not acquired simultaneously, we introduce a Temporal Aware Feature Fusion (TAFF) mechanism that dynamically weights and fuses features based on their acquisition time difference. Our method was evaluated on a clinical dataset and achieved substantial gains over the state-of-the-art image-only baseline (nnU-Net), increasing the average Dice score for scars from 0.6149 to 0.8463 and achieving high performance in both precision (0.9115) and sensitivity (0.9043). These results show that integrating physiological and anatomical knowledge allows the model to “see beyond the image”, setting a new direction for robust and physiologically grounded cardiac scar segmentation.
[119] FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation
Yunfeng Wu,Jiayi Song,Zhenxiong Tan,Zihao He,Songhua Liu
Main category: cs.CV
TL;DR: FreeSwim提出了一种无需训练的方法,通过预训练的Diffusion Transformers生成超高清视频。该方法采用内滑动窗口注意力机制和双路径管道,解决了局部窗口注意力导致的重复内容和全局不一致性问题,并通过交叉注意力缓存提高了效率。
Details
Motivation: 现有的Transformer视频生成模型因注意力的二次复杂度无法高效训练超高清视频。FreeSwim旨在无需额外训练或适应的情况下,利用预训练模型实现超高清视频生成。Contribution: 1. 引入无需训练的超高清视频生成方法;2. 设计了内滑动窗口注意力机制;3. 提出双路径管道和交叉注意力覆盖策略;4. 实现了高效的内容生成和全局一致性。
Method: 1. 内滑动窗口注意力机制保持训练尺度感受野;2. 双路径管道结合局部窗口注意力和全局交叉注意力;3. 交叉注意力缓存策略减少计算开销。
Result: 实验表明该方法在VBench上优于其他训练方法,生成了具有精细视觉细节的超高清视频,且效率高。
Insight: 保持局部注意力的训练尺度感受野是关键,结合全局交叉注意力可解决重复内容和全局不一致性问题。
Abstract: The quadratic time and memory complexity of the attention mechanism in modern Transformer based video generators makes end-to-end training for ultra high resolution videos prohibitively expensive. Motivated by this limitation, we introduce a training-free approach that leverages video Diffusion Transformers pretrained at their native scale to synthesize higher resolution videos without any additional training or adaptation. At the core of our method lies an inward sliding window attention mechanism, which originates from a key observation: maintaining each query token’s training scale receptive field is crucial for preserving visual fidelity and detail. However, naive local window attention, unfortunately, often leads to repetitive content and exhibits a lack of global coherence in the generated results. To overcome this challenge, we devise a dual-path pipeline that backs up window attention with a novel cross-attention override strategy, enabling the semantic content produced by local attention to be guided by another branch with a full receptive field and, therefore, ensuring holistic consistency. Furthermore, to improve efficiency, we incorporate a cross-attention caching strategy for this branch to avoid the frequent computation of full 3D attention. Extensive experiments demonstrate that our method delivers ultra-high-resolution videos with fine-grained visual details and high efficiency in a training-free paradigm. Meanwhile, it achieves superior performance on VBench, even compared to training-based alternatives, with competitive or improved efficiency. Codes are available at: https://github.com/WillWu111/FreeSwim
[120] Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model
Xiyuan Wang,Muhan Zhang
Main category: cs.CV
TL;DR: 这篇论文提出了Diffusion as Self-Distillation (DSD),一种将编码、解码和扩散网络统一为可端到端训练的单网络方法,解决了传统Latent Diffusion Models(LDMs)因模块化设计导致的效率低下和性能不佳问题。DSD通过类比自蒸馏学习,稳定了潜在空间的训练,显著提升了生成任务的表现。
Details
Motivation: 传统的Latent Diffusion Models(LDMs)采用编码器、解码器和扩散网络的三部分架构,需要分阶段训练,不仅计算效率低,还导致性能次优。本文旨在将这些模块统一为一个端到端的单网络模型,以提升效率并简化架构设计。Contribution: 1) 揭示了传统联合训练方法因“潜在空间崩溃”而失败的根本原因;2) 提出了Diffusion as Self-Distillation (DSD)框架,通过修改训练目标实现了潜在空间的稳定训练;3) 首次实现了单网络端到端训练的Latent Diffusion Model。
Method: DSD基于自蒸馏学习的类比,设计了新的训练目标,防止扩散目标干扰潜在表示的学习。具体包括对潜在空间的稳定化修改,从而使编码、解码和扩散可以在同一网络中联合训练。
Result: 在ImageNet 256×256条件生成任务中,DSD表现优异:仅使用42M/118M/205M参数和50个训练周期,FID达到13.44/6.38/4.25,且未使用classifier-free-guidance。
Insight: 将扩散过程类比为自蒸馏学习,可以解决潜在空间训练的稳定性问题,为简化扩散模型架构提供了新思路。
Abstract: Standard Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network, which are trained in multiple stages. This modular design is computationally inefficient, leads to suboptimal performance, and prevents the unification of diffusion with the single-network architectures common in vision foundation models. Our goal is to unify these three components into a single, end-to-end trainable network. We first demonstrate that a naive joint training approach fails catastrophically due to ``latent collapse’’, where the diffusion training objective interferes with the network’s ability to learn a good latent representation. We identify the root causes of this instability by drawing a novel analogy between diffusion and self-distillation based unsupervised learning method. Based on this insight, we propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space. This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion. DSD achieves outstanding performance on the ImageNet $256\times 256$ conditional generation task: FID=13.44/6.38/4.25 with only 42M/118M/205M parameters and 50 training epochs on ImageNet, without using classifier-free-guidance.
[121] Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising
Yifan Wang,Liya Ji,Zhanghan Ke,Harry Yang,Ser-Nam Lim,Qifeng Chen
Main category: cs.CV
TL;DR: 该论文提出了一种零样本框架,通过结构感知的去噪方法增强合成视频的真实感,无需微调,利用扩散视频基础模型,并引入结构信息(如深度图、语义图和边缘图)指导生成过程。
Details
Motivation: 合成视频的真实感通常较低,而现有方法往往需要微调或依赖模拟器的内部信息。本文旨在通过零样本方法提升合成视频的真实感,同时保持结构和语义一致性。Contribution: 1. 提出了一种零样本的合成视频真实感增强框架;2. 通过结构感知信息(深度图、语义图和边缘图)指导去噪过程;3. 无需微调或依赖模拟器内部信息。
Method: 基于扩散视频基础模型,通过辅助模型提取合成视频的结构信息(深度、语义和边缘),并将这些信息作为条件输入去噪过程,确保增强后的视频在结构和语义上与原始视频一致。
Result: 在实验中,该方法在结构一致性上优于现有基线,同时保持了最先进的真实感质量。
Insight: 通过引入结构信息作为条件输入,可以显著提升生成视频的结构一致性,证明了结构信息在视频生成任务中的重要性。
Abstract: We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion. Our realism enhancement approach is a zero-shot framework that focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains, built upon a diffusion video foundational model without further fine-tuning. Specifically, we incorporate an effective modification to have the generation/denoising process conditioned on estimated structure-aware information from the synthetic video, such as depth maps, semantic maps, and edge maps, by an auxiliary model, rather than extracting the information from a simulator. This guidance ensures that the enhanced videos are consistent with the original synthetic video at both the structural and semantic levels. Our approach is a simple yet general and powerful approach to enhancing synthetic video realism: we show that our approach outperforms existing baselines in structural consistency with the original video while maintaining state-of-the-art photorealism quality in our experiments.
[122] Vision Large Language Models Are Good Noise Handlers in Engagement Analysis
Alexander Vedernikov,Puneet Kumar,Haoyu Chen,Tapio Seppänen,Xiaobai Li
Main category: cs.CV
TL;DR: 该论文提出了一种利用视觉大语言模型(VLMs)处理视频数据集中主观性和噪声标签的方法,通过问卷提取行为线索并划分数据,结合课程学习和软标签优化策略提升模型性能。
Details
Motivation: 视频数据集中的参与度识别因主观标签和噪声问题导致模型效果受限,传统图像分类方法无法有效解决。Contribution: 提出了一种结合VLMs的框架,通过数据划分和课程学习策略优化标签噪声问题,提升了模型性能。
Method: 使用VLMs提取行为线索并划分数据,结合课程学习和软标签优化策略逐步训练模型。
Result: 在EngageNet、DREAMS和PAFE等基准测试上性能超越现有方法,最大提升达+1.21%。
Insight: VLMs在处理主观性和噪声标签上具有潜力,课程学习和数据划分策略能有效提升模型鲁棒性。
Abstract: Engagement recognition in video datasets, unlike traditional image classification tasks, is particularly challenged by subjective labels and noise limiting model performance. To overcome the challenges of subjective and noisy engagement labels, we propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process. Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets. We also introduce a training strategy combining curriculum learning with soft label refinement, gradually incorporating ambiguous samples while adjusting supervision to reflect uncertainty. We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements, highlighting benefits of addressing label subjectivity with VLMs. This method surpasses prior state of the art across engagement benchmarks such as EngageNet (three of six feature settings, maximum improvement of +1.21%), and DREAMS / PAFE with F1 gains of +0.22 / +0.06.
[123] UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning
Rui Tian,Mingfei Gao,Haiming Gang,Jiasen Lu,Zhe Gan,Yinfei Yang,Zuxuan Wu,Afshin Dehghan
Main category: cs.CV
TL;DR: UniGen-1.5是一个基于统一强化学习策略的多模态大语言模型,通过共享奖励模型联合优化图像生成和编辑能力,并在编辑指令理解上引入轻量对齐阶段提升性能。
Details
Motivation: 现有的图像生成和编辑模型往往是分开训练的,缺乏联合优化。UniGen-1.5旨在通过统一的强化学习策略,提升模型的生成和编辑能力。Contribution: 1. 提出统一的强化学习策略,通过共享奖励模型联合优化图像生成和编辑;2. 引入轻量编辑指令对齐阶段,显著提升编辑指令理解能力。
Method: 1. 基于UniGen的模型架构和训练流程进行全面增强;2. 采用统一的强化学习策略,共享奖励模型优化生成和编辑任务;3. 引入Edit Instruction Alignment阶段提升指令理解。
Result: 在GenEval和ImgEdit上分别取得0.89和4.31的综合评分,超越BAGEL等现有模型,接近GPT-Image-1等专有模型性能。
Insight: 联合优化生成和编辑任务可以显著提升模型的多模态能力,轻量对齐阶段是提升编辑任务性能的关键。
Abstract: We present UniGen-1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing. Building upon UniGen, we comprehensively enhance the model architecture and training pipeline to strengthen the image understanding and generation capabilities while unlocking strong image editing ability. Especially, we propose a unified Reinforcement Learning (RL) strategy that improves both image generation and image editing jointly via shared reward models. To further enhance image editing performance, we propose a light Edit Instruction Alignment stage that significantly improves the editing instruction comprehension that is essential for the success of the RL training. Experimental results show that UniGen-1.5 demonstrates competitive understanding and generation performance. Specifically, UniGen-1.5 achieves 0.89 and 4.31 overall scores on GenEval and ImgEdit that surpass the state-of-the-art models such as BAGEL and reaching performance comparable to proprietary models such as GPT-Image-1.
[124] ARC Is a Vision Problem!
Keya Hu,Ali Cy,Linlu Qiu,Xiaoman Delores Ding,Runqian Wang,Yeyin Eva Zhu,Jacob Andreas,Kaiming He
Main category: cs.CV
TL;DR: 论文提出将ARC(Abstraction and Reasoning Corpus)问题重新定义为视觉问题,而非语言问题,并提出了Vision ARC(VARC)框架,使用标准的视觉架构(如ViT)进行图像到图像的映射,显著提高了性能。
Details
Motivation: 现有研究大多将ARC视为语言问题,使用大语言模型(LLMs)或循环推理模型解决,但ARC任务的本质是视觉的。作者希望通过视觉范式来重新定义和解决ARC问题。Contribution: 1. 提出将ARC问题重新定义为视觉问题;2. 提出VARC框架,使用标准视觉架构(ViT)处理ARC任务;3. 在没有预训练的情况下,模型性能显著超过现有方法。
Method: 1. 将ARC输入表示为“画布”(canvas),类似于自然图像;2. 使用Vision Transformer(ViT)进行图像到图像的映射;3. 通过测试时训练(test-time training)泛化到未见任务。
Result: VARC在ARC-1基准测试上达到60.4%的准确率,显著超过现有方法,并与领先的LLMs表现相当,接近人类平均水平。
Insight: 1. 视觉方法更适合处理ARC任务;2. 简单的视觉架构(如ViT)可以在标准框架下实现高性能;3. 测试时训练是一种有效的泛化策略。
Abstract: The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a “canvas” that can be processed like natural images. It is then natural for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.
eess.IV [Back]
[125] Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar
Rongsheng Qian,Chi Xu,Xiaoqiang Ma,Hao Fang,Yili Jin,William I. Atlas,Jiangchuan Liu
Main category: eess.IV
TL;DR: SCOPE是一个自监督框架,通过自适应编解码压缩和频率感知多尺度分割,联合处理水下成像声纳的压缩和伪影校正,显著提升了图像质量和带宽效率。
Details
Motivation: 水下环境中光学传感不可靠,实时成像声纳成为重要工具,但其广泛应用受限于有限的带宽和严重的伪影问题(如斑点噪声、运动模糊等)。Contribution: 1)提出SCOPE框架,联合压缩与伪影校正,无需干净-噪声数据对或合成假设;2)自适应编解码压缩(ACC)和频率感知多尺度分割(FAMS)的创新方法。
Method: 1)ACC学习频率编码的潜在表示;2)FAMS分解帧为低频结构和高频稀疏动态;3)使用对冲训练策略引导无标签的低频代理对学习。
Result: 在ARIS声纳数据上,SSIM达到0.77(提升40%),比特率低至0.0118 bpp,带宽减少80%,实时运行(编码3.1 ms,解码97 ms)。
Insight: 频率结构化的潜在表示是实现低比特率声纳流传输的关键,同时保留了信号细节。
Abstract: Real-time imaging sonar has become an important tool for underwater monitoring in environments where optical sensing is unreliable. Its broader use is constrained by two coupled challenges: highly limited uplink bandwidth and severe sonar-specific artifacts (speckle, motion blur, reverberation, acoustic shadows) that affect up to 98% of frames. We present SCOPE, a self-supervised framework that jointly performs compression and artifact correction without clean-noise pairs or synthetic assumptions. SCOPE combines (i) Adaptive Codebook Compression (ACC), which learns frequency-encoded latent representations tailored to sonar, with (ii) Frequency-Aware Multiscale Segmentation (FAMS), which decomposes frames into low-frequency structure and sparse high-frequency dynamics while suppressing rapidly fluctuating artifacts. A hedging training strategy further guides frequency-aware learning using low-pass proxy pairs generated without labels. Evaluated on months of in-situ ARIS sonar data, SCOPE achieves a structural similarity index (SSIM) of 0.77, representing a 40% improvement over prior self-supervised denoising baselines, at bitrates down to <= 0.0118 bpp. It reduces uplink bandwidth by more than 80% while improving downstream detection. The system runs in real time, with 3.1 ms encoding on an embedded GPU and 97 ms full multi-layer decoding on the server end. SCOPE has been deployed for months in three Pacific Northwest rivers to support real-time salmon enumeration and environmental monitoring in the wild. Results demonstrate that learning frequency-structured latents enables practical, low-bitrate sonar streaming with preserved signal details under real-world deployment conditions.
[126] ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders
Junsik Kim,Gun Bang,Soowoong Kim
Main category: eess.IV
TL;DR: ELiC提出了一种高效的LiDAR几何压缩框架,通过跨位深度特征传播、编码器池选择和保持Morton顺序的层次结构,实现了实时的压缩性能,并在Ford和SemanticKITTI数据集上达到了先进的压缩效果。
Details
Motivation: 现有的LiDAR几何压缩方法在处理多深度层次时独立编码,导致局部上下文重复估计和压缩效率低下。ELiC旨在通过特征重用和自适应编码器选择提升效率。Contribution: 1. 跨位深度特征传播;2. 编码器池(BoE)选择机制;3. 保持Morton顺序的层次结构设计。
Method: 1. 利用低深度提取的特征支持高深度预测;2. 为每个深度从编码器池中选择最佳编码网络;3. 通过Morton顺序保持全局一致性,减少延迟。
Result: 在Ford和SemanticKITTI数据集上实现了实时压缩性能,且压缩效率达到最新水平。
Insight: 跨层次特征重用和自适应编码器选择是提升压缩效率的关键,同时保持全局顺序有助于减少计算开销。
Abstract: Hierarchical LiDAR geometry compression encodes voxel occupancies from low to high bit-depths, yet prior methods treat each depth independently and re-estimate local context from coordinates at every level, limiting compression efficiency. We present ELiC, a real-time framework that combines cross-bit-depth feature propagation, a Bag-of-Encoders (BoE) selection scheme, and a Morton-order-preserving hierarchy. Cross-bit-depth propagation reuses features extracted at denser, lower depths to support prediction at sparser, higher depths. BoE selects, per depth, the most suitable coding network from a small pool, adapting capacity to observed occupancy statistics without training a separate model for each level. The Morton hierarchy maintains global Z-order across depth transitions, eliminating per-level sorting and reducing latency. Together these components improve entropy modeling and computation efficiency, yielding state-of-the-art compression at real-time throughput on Ford and SemanticKITTI. Code and models will be released upon publication.
cs.CR [Back]
[127] GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards
Yule Liu,Heyi Zhang,Jinyi Zheng,Zhen Sun,Zifan Peng,Tianshuo Cong,Yilong Yang,Xinlei He,Zhuo Ma
Main category: cs.CR
TL;DR: 论文提出了首个针对RLVR的成员推理攻击框架DIBA,通过行为变化而非记忆推断数据是否被用于训练,显著优于现有基线。
Details
Motivation: RLVR在LLM训练中引入新的隐私泄露模式,即通过模型行为变化而非输出记忆推断训练数据。需要系统性分析其隐私风险。Contribution: 提出了DIBA框架,首个专注于RLVR的成员推理攻击,揭示了行为变化可作为隐私泄露的新途径。
Method: DIBA利用模型在两个维度上的行为变化(优势侧改进和逻辑侧分歧),设计攻击方法。
Result: DIBA在AUC和TPR@0.1%FPR上显著优于基线,并在多场景下验证了其鲁棒性。
Insight: 即使没有显式监督,训练数据也可能通过行为痕迹被推断,揭示了RLVR的隐私漏洞。
Abstract: Membership inference attacks (MIAs) on large language models (LLMs) pose significant privacy risks across various stages of model training. Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have brought a profound paradigm shift in LLM training, particularly for complex reasoning tasks. However, the on-policy nature of RLVR introduces a unique privacy leakage pattern: since training relies on self-generated responses without fixed ground-truth outputs, membership inference must now determine whether a given prompt (independent of any specific response) is used during fine-tuning. This creates a threat where leakage arises not from answer memorization. To audit this novel privacy risk, we propose Divergence-in-Behavior Attack (DIBA), the first membership inference framework specifically designed for RLVR. DIBA shifts the focus from memorization to behavioral change, leveraging measurable shifts in model behavior across two axes: advantage-side improvement (e.g., correctness gain) and logit-side divergence (e.g., policy drift). Through comprehensive evaluations, we demonstrate that DIBA significantly outperforms existing baselines, achieving around 0.8 AUC and an order-of-magnitude higher TPR@0.1%FPR. We validate DIBA’s superiority across multiple settings–including in-distribution, cross-dataset, cross-algorithm, black-box scenarios, and extensions to vision-language models. Furthermore, our attack remains robust under moderate defensive measures. To the best of our knowledge, this is the first work to systematically analyze privacy vulnerabilities in RLVR, revealing that even in the absence of explicit supervision, training data exposure can be reliably inferred through behavioral traces.
cs.LG [Back]
[128] Certified but Fooled! Breaking Certified Defences with Ghost Certificates
Quoc Viet Vo,Tashreque M. Haq,Paul Montague,Tamas Abraham,Ehsan Abbasnejad,Damith C. Ranasinghe
Main category: cs.LG
TL;DR: 论文研究了如何通过微小且不易察觉的扰动欺骗认证防御机制,生成虚假的鲁棒性保证,揭示了现有认证方法的局限性。
Details
Motivation: 现有认证防御虽能提供理论上的鲁棒性保证,但可能存在被恶意利用的风险,尤其是通过操纵认证过程生成虚假的认证证书。Contribution: 首次提出‘幽灵证书’概念,展示如何通过微小扰动欺骗认证模型,生成虚假的大鲁棒半径,突破了现有防御的限制。
Method: 采用区域聚焦对抗样本技术,设计微小但足以误导认证过程的扰动,实现对ImageNet等数据集上的认证防御的有效绕过。
Result: 实验证明,该方法能成功欺骗Densepure等先进认证防御,生成虚假的认证半径。
Insight: 研究强调了认证方法的局限性,呼吁进一步探索其安全边界。
Abstract: Certified defenses promise provable robustness guarantees. We study the malicious exploitation of probabilistic certification frameworks to better understand the limits of guarantee provisions. Now, the objective is to not only mislead a classifier, but also manipulate the certification process to generate a robustness guarantee for an adversarial input certificate spoofing. A recent study in ICLR demonstrated that crafting large perturbations can shift inputs far into regions capable of generating a certificate for an incorrect class. Our study investigates if perturbations needed to cause a misclassification and yet coax a certified model into issuing a deceptive, large robustness radius for a target class can still be made small and imperceptible. We explore the idea of region-focused adversarial examples to craft imperceptible perturbations, spoof certificates and achieve certification radii larger than the source class ghost certificates. Extensive evaluations with the ImageNet demonstrate the ability to effectively bypass state-of-the-art certified defenses such as Densepure. Our work underscores the need to better understand the limits of robustness certification methods.
cs.RO [Back]
[129] RoboTidy : A 3D Gaussian Splatting Household Tidying Benchmark for Embodied Navigation and Action
Xiaoquan Sun,Ruijian Zhang,Kang Pang,Bingchen Miao,Yuxiang Tan,Zhen Yang,Ming Li,Jiayu Chen
Main category: cs.RO
TL;DR: RoboTidy是一个基于3D高斯点云的基准测试,用于家庭整理任务,支持语言引导的视觉-语言-动作(VLA)和视觉-语言-导航(VLN)训练与评估,提供高质量的场景、对象和演示轨迹。
Details
Motivation: 当前的家庭整理基准测试缺乏对用户偏好的建模和移动性支持,且泛化能力差,难以全面评估语言到动作的综合能力。Contribution: 提出了RoboTidy基准测试,提供500个3D高斯点云家庭场景、6.4k操作演示和1.5k导航轨迹,支持语言引导的机器人动作和导航训练与评估。
Method: 使用3D高斯点云技术构建场景,以”动作(对象,容器)”列表形式定义整理任务,并通过演示轨迹支持训练。
Result: RoboTidy在真实世界中实现了端到端的家庭整理任务,为嵌入式AI提供了一个全面的评估平台。
Insight: RoboTidy填补了语言引导机器人家庭任务评估的空白,为嵌入式AI的实用化提供了重要支持。
Abstract: Household tidying is an important application area, yet current benchmarks neither model user preferences nor support mobility, and they generalize poorly, making it hard to comprehensively assess integrated language-to-action capabilities. To address this, we propose RoboTidy, a unified benchmark for language-guided household tidying that supports Vision-Language-Action (VLA) and Vision-Language-Navigation (VLN) training and evaluation. RoboTidy provides 500 photorealistic 3D Gaussian Splatting (3DGS) household scenes (covering 500 objects and containers) with collisions, formulates tidying as an “Action (Object, Container)” list, and supplies 6.4k high-quality manipulation demonstration trajectories and 1.5k naviagtion trajectories to support both few-shot and large-scale training. We also deploy RoboTidy in the real world for object tidying, establishing an end-to-end benchmark for household tidying. RoboTidy offers a scalable platform and bridges a key gap in embodied AI by enabling holistic and realistic evaluation of language-guided robots.
[130] Going Places: Place Recognition in Artificial and Natural Systems
Michael Milford,Tobias Fischer
Main category: cs.RO
TL;DR: 这篇综述论文讨论了地点识别在生物导航和自主系统中的重要性,总结了机器人系统、动物研究和人类研究中的方法和发现,提出了一套统一的概念以促进人工地点识别技术的发展。
Details
Motivation: 地点识别是生物导航和自主系统的核心能力,但目前缺乏对不同系统(人工、动物、人类)中地点识别方法的系统总结和比较。本文旨在填补这一空白,为人工系统的发展提供跨学科见解。Contribution: 论文的主要贡献包括:1)总结了人工系统、动物和人类在地点识别中的计算和表征策略;2)提出了统一的概念框架;3)指出了未来的关键挑战(如泛化性和鲁棒性)。
Method: 论文采用跨学科综述的方法,分析了机器人系统中的拓扑地图构建、动物中的多模态导航机制以及人类中的语义地点概念和文化影响。
Result: 论文发现人工系统倾向于使用可扩展的数据驱动模型,而生物系统则依赖于多模态线索和适应性机制。人类的研究揭示了语义和文化对环境认知的影响。
Insight: 生物系统的导航机制(如多模态线索整合)可以为人工系统的设计提供灵感,而人工系统的可扩展性也为研究生物导航提供了新视角。
Abstract: Place recognition, the ability to identify previously visited locations, is critical for both biological navigation and autonomous systems. This review synthesizes findings from robotic systems, animal studies, and human research to explore how different systems encode and recall place. We examine the computational and representational strategies employed across artificial systems, animals, and humans, highlighting convergent solutions such as topological mapping, cue integration, and memory management. Animal systems reveal evolved mechanisms for multimodal navigation and environmental adaptation, while human studies provide unique insights into semantic place concepts, cultural influences, and introspective capabilities. Artificial systems showcase scalable architectures and data-driven models. We propose a unifying set of concepts by which to consider and develop place recognition mechanisms and identify key challenges such as generalization, robustness, and environmental variability. This review aims to foster innovations in artificial localization by connecting future developments in artificial place recognition systems to insights from both animal navigation research and human spatial cognition studies.
[131] Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning
Xiuxiu Qi,Yu Yang,Jiannong Cao,Luyao Bai,Chongshan Fan,Chengtai Cao,Hongpeng Wang
Main category: cs.RO
TL;DR: 该论文提出了一种名为CCoL的新框架,通过连续的视觉-语言-动作共学习与语义-物理对齐,解决了行为克隆中的复合误差问题,提升了策略执行的连贯性和语义理解。
Details
Motivation: 行为克隆(BC)在语言条件操纵中面临复合误差和语义-物理不对齐的挑战,现有方法存在物理不连续和执行间歇性问题。Contribution: 提出了CCoL框架,通过跨模态共学习和双向注意力机制,实现了语义-物理对齐和动作生成的连贯性。
Method: 利用视觉、语言和本体感觉输入的连续共学习,并结合双向交叉注意力机制,对齐语言语义与视觉运动表征。
Result: 在三个仿真套件中平均提升8.0%,在双手机器人插入任务中最高提升19.2%,并在真实机器人测试中展示了良好的泛化能力。
Insight: 语义-物理对齐和跨模态共学习是提升行为克隆性能的关键,特别是在复杂任务中能够显著减少执行误差。
Abstract: Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (e.g., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL’s generalization under unseen and noisy object states.
cs.AI [Back]
[132] AISAC: An Integrated multi-agent System for Transparent, Retrieval-Grounded Scientific Assistance
Chandrachur Bhattacharya,Sibendu Som
Main category: cs.AI
TL;DR: AISAC是一个集成多代理系统,专注于透明性和检索基础的科研辅助,结合了LangGraph、FAISS和SQLite技术。
Details
Motivation: 解决科研工作流中缺乏透明性和适应性的问题,提供可追溯的、灵活的辅助工具。Contribution: 提出了一种透明且可定制的多代理系统,支持语义检索和结构化历史记录,适用于跨领域科研任务。
Method: 采用Router-Planner-Coordinator工作流,结合FAISS和SQLite的混合记忆系统,并通过配置驱动实现快速定制。
Result: 成功应用于废物处理、能源安全等领域,展示了系统的跨领域适用性。
Insight: 通过透明性和灵活性设计,AISAC能够有效支持复杂的科学工作流,同时适应不同研究需求。
Abstract: AI Scientific Assistant Core (AISAC) is an integrated multi-agent system developed at Argonne National Laboratory for scientific and engineering workflows. AISAC builds on established technologies - LangGraph for orchestration, FAISS for vector search, and SQLite for persistence - and integrates them into a unified system prototype focused on transparency, provenance tracking, and scientific adaptability. The system implements a Router-Planner-Coordinator workflow and an optional Evaluator role, using prompt-engineered agents coordinated via LangGraph’s StateGraph and supported by helper agents such as a Researcher. Each role is defined through custom system prompts that enforce structured JSON outputs. A hybrid memory approach (FAISS + SQLite) enables both semantic retrieval and structured conversation history. An incremental indexing strategy based on file hashing minimizes redundant re-embedding when scientific corpora evolve. A configuration-driven project bootstrap layer allows research teams to customize tools, prompts, and data sources without modifying core code. All agent decisions, tool invocations, and retrievals are logged and visualized through a custom Gradio interface, providing step-by-step transparency for each reasoning episode. The authors have applied AISAC to multiple research areas at Argonne, including specialized deployments for waste-to-products research and energy process safety, as well as general-purpose scientific assistance, demonstrating its cross-domain applicability.
[133] DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning
Xiaochuan Liu,Yuanfeng Song,Xiaoming Yin,Xing Chen
Main category: cs.AI
TL;DR: DataSage是一个多智能体框架,通过外部知识检索、多角色辩论机制和多路径推理,解决了当前数据洞察代理在领域知识利用不足、分析深度有限和代码生成易出错的问题。
Details
Motivation: 当前数据洞察代理在领域知识利用、分析深度和代码生成准确性方面存在不足,难以满足自动化数据分析和洞察发现的需求。Contribution: 提出了DataSage框架,通过外部知识检索丰富分析上下文,多角色辩论机制模拟多样化分析视角,以及多路径推理提高代码和洞察生成的准确性。
Method: 结合外部知识检索、多角色辩论和多路径推理,通过多智能体协作提升数据洞察的深度和准确性。
Result: 在InsightBench上的实验表明,DataSage在所有难度级别上均优于现有数据洞察代理。
Insight: 多智能体协作和多样化机制能够显著提升数据洞察的深度和准确性,为自动化数据分析提供了新思路。
Abstract: In today’s data-driven era, fully automated end-to-end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM-driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi-agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi-role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi-path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.
[134] KANGURA: Kolmogorov-Arnold Network-Based Geometry-Aware Learning with Unified Representation Attention for 3D Modeling of Complex Structures
Mohammad Reza Shafie,Morteza Hajiabadi,Hamed Khosravi,Mobina Noori,Imtiaz Ahmed
Main category: cs.AI
TL;DR: KANGURA是一种基于Kolmogorov-Arnold网络的几何感知学习方法,通过统一表示注意力优化复杂结构的3D建模,显著提升了模型性能。
Details
Motivation: 微生物燃料电池(MFCs)的性能受阳极结构设计影响,现有模型难以捕捉复杂的几何依赖关系。KANGURA旨在解决这一问题,提供更精确的3D建模方法。Contribution: 提出了KANGURA框架,结合Kolmogorov-Arnold网络和几何感知学习,动态增强关键几何区域的表示,提升建模精度。
Method: 采用KAN表示学习重构几何关系,脱离传统MLP;通过几何解耦表示学习和统一注意力机制优化空间理解。
Result: 在ModelNet40数据集上以92.7%的精度超越15个SOTA模型,在实际MFC阳极问题上达到97%的精度。
Insight: KANGURA通过几何解耦和动态注意力机制,为复杂结构的3D建模提供了新思路,适用于高级制造和质量驱动工程。
Abstract: Microbial Fuel Cells (MFCs) offer a promising pathway for sustainable energy generation by converting organic matter into electricity through microbial processes. A key factor influencing MFC performance is the anode structure, where design and material properties play a crucial role. Existing predictive models struggle to capture the complex geometric dependencies necessary to optimize these structures. To solve this problem, we propose KANGURA: Kolmogorov-Arnold Network-Based Geometry-Aware Learning with Unified Representation Attention. KANGURA introduces a new approach to three-dimensional (3D) machine learning modeling. It formulates prediction as a function decomposition problem, where Kolmogorov-Arnold Network (KAN)- based representation learning reconstructs geometric relationships without a conventional multi- layer perceptron (MLP). To refine spatial understanding, geometry-disentangled representation learning separates structural variations into interpretable components, while unified attention mechanisms dynamically enhance critical geometric regions. Experimental results demonstrate that KANGURA outperforms over 15 state-of-the-art (SOTA) models on the ModelNet40 benchmark dataset, achieving 92.7% accuracy, and excels in a real-world MFC anode structure problem with 97% accuracy. This establishes KANGURA as a robust framework for 3D geometric modeling, unlocking new possibilities for optimizing complex structures in advanced manufacturing and quality-driven engineering applications.
[135] Scene Graph-Guided Generative AI Framework for Synthesizing and Evaluating Industrial Hazard Scenarios
Sanjay Acharjee,Abir Khan Ratul,Diego Patino,Md Nazmus Sakib
Main category: cs.AI
TL;DR: 该论文提出了一种基于场景图引导的生成AI框架,用于合成和评估工业危险场景,结合历史OSHA事故报告和GPT-4o分析生成逼真图像,并通过VQA框架评估生成数据的真实性和语义保真度。
Details
Motivation: 训练视觉模型以准确检测工作场所危险需要真实的危险场景图像,但由于难以捕捉实际事故触发场景,数据获取困难。为此,研究提出了一种生成AI框架来解决这一问题。Contribution: 主要贡献包括:(1) 一种基于场景图引导的生成AI框架,能够生成逼真的危险场景图像;(2) 结合GPT-4o从OSHA报告中提取结构化危险推理;(3) 提出VQA Graph Score评估方法,优于现有的CLIP和BLIP指标。
Method: 方法分为三步:(1) 使用GPT-4o分析OSHA报告,提取结构化危险推理并生成场景图;(2) 利用场景图指导文本到图像的扩散模型生成危险场景;(3) 通过VQA框架评估生成图像的真实性和语义准确性。
Result: 实验结果表明,提出的VQA Graph Score在区分生成图像的保真度和真实性方面优于CLIP和BLIP等现有指标,验证了其更高的判别敏感性。
Insight: 该研究展示了生成AI在工业安全领域的潜力,尤其在数据稀缺场景下的合成数据生成和质量评估方面提供了新思路。
Abstract: Training vision models to detect workplace hazards accurately requires realistic images of unsafe conditions that could lead to accidents. However, acquiring such datasets is difficult because capturing accident-triggering scenarios as they occur is nearly impossible. To overcome this limitation, this study presents a novel scene graph-guided generative AI framework that synthesizes photorealistic images of hazardous scenarios grounded in historical Occupational Safety and Health Administration (OSHA) accident reports. OSHA narratives are analyzed using GPT-4o to extract structured hazard reasoning, which is converted into object-level scene graphs capturing spatial and contextual relationships essential for understanding risk. These graphs guide a text-to-image diffusion model to generate compositionally accurate hazard scenes. To evaluate the realism and semantic fidelity of the generated data, a visual question answering (VQA) framework is introduced. Across four state-of-the-art generative models, the proposed VQA Graph Score outperforms CLIP and BLIP metrics based on entropy-based validation, confirming its higher discriminative sensitivity.
cs.DL [Back]
[136] SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature
Hang Ding,Yilun Zhao,Tiansheng Hu,Manasi Patwardhan,Arman Cohan
Main category: cs.DL
TL;DR: SciRAG是一个开源的科学文献探索框架,通过自适应检索、引用感知的符号推理和大纲引导的合成,解决了传统RAG方法在引用图结构、复杂查询适应性和合成碎片化方面的不足。
Details
Motivation: 科学文献的快速增长对大规模、可信的知识合成系统提出了需求。传统RAG方法忽视了引用图结构,对复杂查询适应性差,且合成结果碎片化、难以验证。Contribution: 主要贡献包括:(1)自适应检索方法,灵活切换顺序和并行证据收集;(2)引用感知的符号推理,利用引用图组织和过滤支持文档;(3)大纲引导的合成,确保回答的一致性和透明引用。
Method: SciRAG结合了自适应检索、引用感知的符号推理和大纲引导的合成。自适应检索动态调整证据收集方式;符号推理利用引用图优化文档组织;大纲引导的合成通过计划、评价和优化回答确保质量。
Result: 在QASA和ScholarQA等多个基准测试中,SciRAG在事实准确性和合成质量上优于现有系统,为可靠的大规模科学知识聚合奠定了基础。
Insight: SciRAG的创新点在于将引用图结构与RAG结合,提升了知识合成的可信度和适应性;大纲引导的方法增强了回答的一致性和透明度。
Abstract: The accelerating growth of scientific publications has intensified the need for scalable, trustworthy systems to synthesize knowledge across diverse literature. While recent retrieval-augmented generation (RAG) methods have improved access to scientific information, they often overlook citation graph structure, adapt poorly to complex queries, and yield fragmented, hard-to-verify syntheses. We introduce SciRAG, an open-source framework for scientific literature exploration that addresses these gaps through three key innovations: (1) adaptive retrieval that flexibly alternates between sequential and parallel evidence gathering; (2) citation-aware symbolic reasoning that leverages citation graphs to organize and filter supporting documents; and (3) outline-guided synthesis that plans, critiques, and refines answers to ensure coherence and transparent attribution. Extensive experiments across multiple benchmarks such as QASA and ScholarQA demonstrate that SciRAG outperforms prior systems in factual accuracy and synthesis quality, establishing a new foundation for reliable, large-scale scientific knowledge aggregation.
cs.MM [Back]
[137] Can LLMs Create Legally Relevant Summaries and Analyses of Videos?
Lyra Hoeben-Kuil,Gijs van Dijck,Jaromir Savelka,Johanna Gunawan,Konrad Kollnig,Marta Kolacz,Mindy Duffourc,Shashank Chakravarthy,Hannes Westermann
Main category: cs.MM
TL;DR: 論文探討大型語言模型(LLMs)能否從影片中總結並分析法律相關內容,實驗結果顯示71.7%的摘要被評為高或中等質量。
Details
Motivation: 法律專業人士需從事件中提取法律相關事實並轉化為文本,但這對普通人具有挑戰性。現有AI方法仍需用戶用文字描述事件,因此論文探討LLMs能否直接從影片中生成法律摘要和文件。Contribution: 論文首次評估LLMs從影片中生成法律相關摘要和分析的能力,並驗證其在法律領域的潛在應用價值。
Method: 研究使用120個YouTube影片,涵蓋多種法律問題領域,讓LLM生成摘要並起草法律信件,評估其質量。
Result: 71.7%的摘要被評為高或中等質量,表明LLMs在法律總結和分析方面具有潛力。
Insight: LLMs可直接從影片中提取法律相關內容,可能幫助填補法律專業與普通民眾之間的鴻溝,尤其在司法服務普及化方面。
Abstract: Understanding the legally relevant factual basis of an event and conveying it through text is a key skill of legal professionals. This skill is important for preparing forms (e.g., insurance claims) or other legal documents (e.g., court claims), but often presents a challenge for laypeople. Current AI approaches aim to bridge this gap, but mostly rely on the user to articulate what has happened in text, which may be challenging for many. Here, we investigate the capability of large language models (LLMs) to understand and summarize events occurring in videos. We ask an LLM to summarize and draft legal letters, based on 120 YouTube videos showing legal issues in various domains. Overall, 71.7% of the summaries were rated as of high or medium quality, which is a promising result, opening the door to a number of applications in e.g. access to justice.
[138] MindCross: Fast New Subject Adaptation with Limited Data for Cross-subject Video Reconstruction from Brain Signals
Xuan-Hao Liu,Yan-Kai Liu,Tianyi Zhou,Bao-Liang Lu,Wei-Long Zheng
Main category: cs.MM
TL;DR: MindCross是一种新颖的跨被试视频重建框架,通过特定编码器和共享编码器分别提取被试特异性和不变性信息,并采用Top-K协作模块实现快速新被试适应,仅需少量数据。
Details
Motivation: 现有的脑解码框架多为被试依赖型,需要大量数据。然而,脑信号数据收集成本高昂,导致数据稀缺。现有跨被试方法过度关注被试不变信息,忽视特异性信息,导致适应速度慢。Contribution: 1. 提出MindCross框架,结合特异性编码器和共享编码器;2. 引入Top-K协作模块,利用已有知识增强新被试解码;3. 在fMRI/EEG到视频的任务中验证其高效性。
Method: 1. 使用N个特异性编码器和1个共享编码器分别提取被试特异性和不变性信息;2. 通过Top-K协作模块选择最相关的编码器,快速适应新被试;3. 仅需一个模型即可实现跨被试解码。
Result: 在fMRI/EEG到视频的基准测试中,MindCross表现出高效的跨被试解码能力,仅需少量数据即可快速适应新被试。
Insight: 1. 提取被试特异性信息对解码至关重要;2. Top-K协作模块显著提升新被试适应速度;3. 统一框架可减少模型复杂度。
Abstract: Reconstructing video from brain signals is an important brain decoding task. Existing brain decoding frameworks are primarily built on a subject-dependent paradigm, which requires large amounts of brain data for each subject. However, the expensive cost of collecting brain-video data causes severe data scarcity. Although some cross-subject methods being introduced, they often overfocus with subject-invariant information while neglecting subject-specific information, resulting in slow fine-tune-based adaptation strategy. To achieve fast and data-efficient new subject adaptation, we propose MindCross, a novel cross-subject framework. MindCross’s N specific encoders and one shared encoder are designed to extract subject-specific and subject-invariant information, respectively. Additionally, a Top-K collaboration module is adopted to enhance new subject decoding with the knowledge learned from previous subjects’ encoders. Extensive experiments on fMRI/EEG-to-video benchmarks demonstrate MindCross’s efficacy and efficiency of cross-subject decoding and new subject adaptation using only one model.
cs.NE [Back]
[139] Attention via Synaptic Plasticity is All You Need: A Biologically Inspired Spiking Neuromorphic Transformer
Kallol Mondal,Ankush Kumar
Main category: cs.NE
TL;DR: 论文提出了一种基于突触可塑性的生物启发式脉冲神经形态Transformer(S²TDPT),用脉冲时序依赖可塑性(STDP)实现自注意力机制,显著提高了能效和硬件友好性。
Details
Motivation: 现有Transformer的能量消耗高,且核心注意力层未完全神经形态化,限制了在事件驱动硬件上的应用。Contribution: 提出了S²TDPT模型,通过STDP实现自注意力,解决了现有脉冲Transformer的非神经形态化问题,并显著降低了能耗。
Method: 利用STDP机制替代传统的点积相似度计算,将查询-键相关性嵌入突触权重中,支持内存计算和非冯·诺依曼硬件。
Result: 在CIFAR-10和CIFAR-100上分别达到94.35%和78.08%的准确率,能耗降低88.47%。
Insight: STDP是一种生物启发的学习机制,能够实现高效、硬件友好且可解释的注意力模型。
Abstract: Attention is the brain’s ability to selectively focus on a few specific aspects while ignoring irrelevant ones. This biological principle inspired the attention mechanism in modern Transformers. Transformers now underpin large language models (LLMs) such as GPT, but at the cost of massive training and inference energy, leading to a large carbon footprint. While brain attention emerges from neural circuits, Transformer attention relies on dot-product similarity to weight elements in the input sequence. Neuromorphic computing, especially spiking neural networks (SNNs), offers a brain-inspired path to energy-efficient intelligence. Despite recent work on attention-based spiking Transformers, the core attention layer remains non-neuromorphic. Current spiking attention (i) relies on dot-product or element-wise similarity suited to floating-point operations, not event-driven spikes; (ii) keeps attention matrices that suffer from the von Neumann bottleneck, limiting in-memory computing; and (iii) still diverges from brain-like computation. To address these issues, we propose the Spiking STDP Transformer (S$^{2}$TDPT), a neuromorphic Transformer that implements self-attention through spike-timing-dependent plasticity (STDP), embedding query–key correlations in synaptic weights. STDP, a core mechanism of memory and learning in the brain and widely studied in neuromorphic devices, naturally enables in-memory computing and supports non-von Neumann hardware. On CIFAR-10 and CIFAR-100, our model achieves 94.35% and 78.08% accuracy with only four timesteps and 0.49 mJ on CIFAR-100, an 88.47% energy reduction compared to a standard ANN Transformer. Grad-CAM shows that the model attends to semantically relevant regions, enhancing interpretability. Overall, S$^{2}$TDPT illustrates how biologically inspired attention can yield energy-efficient, hardware-friendly, and explainable neuromorphic models.