Table of Contents

cs.CL [Back]

[1] Op-Fed: Opinion, Stance, and Monetary Policy Annotations on FOMC Transcripts Using Active Learning

Alisa Kanganis,Katherine A. Keith

Main category: cs.CL

TL;DR: 该论文发布了Op-Fed数据集,包含1044条人工标注的FOMC会议记录句子及其上下文,解决了类别不平衡和跨句子依赖的技术挑战,并通过主动学习提高了数据标注效率。

Details Motivation: FOMC的货币政策讨论对公众有重大影响,但相关数据集稀缺且标注困难。论文旨在解决类别不平衡和跨句子依赖的问题,提供高质量的标注数据。

Contribution: 1. 发布了Op-Fed数据集,标注了意见、货币政策和立场;2. 提出了五阶段分层标注方法;3. 使用主动学习显著增加了正样本数量。

Method: 1. 设计五阶段分层标注架构;2. 采用主动学习选择标注实例;3. 利用封闭权重的大语言模型(LLM)进行零样本评估。

Result: LLM在意见分类上的零样本准确率为0.80,但在货币政策立场分类上仅为0.61,低于人类基线0.89。

Insight: 主动学习能有效解决类别不平衡问题;复杂任务(如立场分类)上,模型表现仍需提升;跨句子依赖是标注中的重要挑战。

Abstract: The U.S. Federal Open Market Committee (FOMC) regularly discusses and sets monetary policy, affecting the borrowing and spending decisions of millions of people. In this work, we release Op-Fed, a dataset of 1044 human-annotated sentences and their contexts from FOMC transcripts. We faced two major technical challenges in dataset creation: imbalanced classes – we estimate fewer than 8% of sentences express a non-neutral stance towards monetary policy – and inter-sentence dependence – 65% of instances require context beyond the sentence-level. To address these challenges, we developed a five-stage hierarchical schema to isolate aspects of opinion, monetary policy, and stance towards monetary policy as well as the level of context needed. Second, we selected instances to annotate using active learning, roughly doubling the number of positive instances across all schema aspects. Using Op-Fed, we found a top-performing, closed-weight LLM achieves 0.80 zero-shot accuracy in opinion classification but only 0.61 zero-shot accuracy classifying stance towards monetary policy – below our human baseline of 0.89. We expect Op-Fed to be useful for future model training, confidence calibration, and as a seed dataset for future annotation efforts.

[2] Latent Traits and Cross-Task Transfer: Deconstructing Dataset Interactions in LLM Fine-tuning

Shambhavi Krishna,Atharva Naik,Chaitali Agarwal,Sudharshan Govindan,Taesung Lee,Haw-Shiuan Chang

Main category: cs.CL

TL;DR: 该论文提出了一个分析框架,通过构建转移学习矩阵和降维技术,探讨了在不同数据集间进行LLM微调时的潜在能力和跨任务交互。研究发现,性能提升往往与数据集的隐藏统计特征(如类别分布和生成长度偏好)相关,而非表面相似性。

Details Motivation: 大规模语言模型(LLM)在部署时会遇到未在训练中见过的任务,而获取所有任务的高质量训练数据不可行。因此,需要依赖转移学习,但跨任务的交互机制尚未充分理解。

Contribution: 提出了一个分析框架,揭示了转移学习中潜在能力(如推理、情感分类等)的作用,并发现性能提升主要由数据集的隐藏统计特征和语言特征驱动。

Method: 通过构建转移学习矩阵和降维技术,训练并分析了10个模型,识别了不同任务的潜在能力和转移学习的副作用。

Result: 研究发现,性能提升与数据集的隐藏统计特征(如类别分布、生成长度偏好)和特定语言特征更相关,而非表面相似性或数据质量。

Insight: 转移学习的复杂性超出了表面数据相似性的解释,隐藏的统计特征是更关键的影响因素,为LLM的适应性提供了更可预测的方向。

Abstract: Large language models are increasingly deployed across diverse applications. This often includes tasks LLMs have not encountered during training. This implies that enumerating and obtaining the high-quality training data for all tasks is infeasible. Thus, we often need to rely on transfer learning using datasets with different characteristics, and anticipate out-of-distribution requests. Motivated by this practical need, we propose an analysis framework, building a transfer learning matrix and dimensionality reduction, to dissect these cross-task interactions. We train and analyze 10 models to identify latent abilities (e.g., Reasoning, Sentiment Classification, NLU, Arithmetic) and discover the side effects of the transfer learning. Our findings reveal that performance improvements often defy explanations based on surface-level dataset similarity or source data quality. Instead, hidden statistical factors of the source dataset, such as class distribution and generation length proclivities, alongside specific linguistic features, are actually more influential. This work offers insights into the complex dynamics of transfer learning, paving the way for more predictable and effective LLM adaptation.

[3] Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs

Zhuoxuan Zhang,Jinhao Duan,Edward Kim,Kaidi Xu

Main category: cs.CL

TL;DR: 该论文发现,大语言模型(LLMs)的内部表示中,稀疏神经元能够在预填充阶段线性编码问题的模糊性信息,这表明模糊性信号在模型的早期处理阶段就已形成。

Details Motivation: 现实世界中的问题普遍存在模糊性,但大语言模型常常以一种自信的方式回答,而不是寻求澄清。因此,研究模糊性如何在LLMs中编码和控制具有重要价值。

Contribution: 1. 发现LLMs内部存在少量(甚至一个)编码问题模糊性的神经元(AENs);2. 这些神经元可用于模糊性检测和行为控制(如从直接回答转为弃权);3. 揭示了模糊性信号在浅层网络中早期编码的现象。

Method: 1. 通过预填充阶段分析神经元激活;2. 识别出编码模糊性的稀疏神经元(AENs);3. 训练探测器对AENs进行模糊性检测;4. 通过操纵AENs控制模型行为。

Result: AENs探测器在模糊性检测任务中表现优异,且具有跨数据集泛化能力;通过操纵AENs可以有效控制模型行为。

Insight: LLMs内部存在紧凑且可解释的模糊性表示,这为模型的可控性和解释性提供了新思路。

Abstract: Ambiguity is pervasive in real-world questions, yet large language models (LLMs) often respond with confident answers rather than seeking clarification. In this work, we show that question ambiguity is linearly encoded in the internal representations of LLMs and can be both detected and controlled at the neuron level. During the model’s pre-filling stage, we identify that a small number of neurons, as few as one, encode question ambiguity information. Probes trained on these Ambiguity-Encoding Neurons (AENs) achieve strong performance on ambiguity detection and generalize across datasets, outperforming prompting-based and representation-based baselines. Layerwise analysis reveals that AENs emerge from shallow layers, suggesting early encoding of ambiguity signals in the model’s processing pipeline. Finally, we show that through manipulating AENs, we can control LLM’s behavior from direct answering to abstention. Our findings reveal that LLMs form compact internal representations of question ambiguity, enabling interpretable and controllable behavior.

[4] Improving Context Fidelity via Native Retrieval-Augmented Reasoning

Suyuchen Wang,Jinlin Wang,Xinyu Wang,Shiqi Li,Xiangru Tang,Sirui Hong,Xiao-Wen Chang,Chenglin Wu,Bang Liu

Main category: cs.CL

TL;DR: 论文提出CARE框架,通过原生检索增强推理能力,提升大语言模型(LLMs)在上下文忠实度上的表现,显著优于现有方法。

Details Motivation: 大语言模型(LLMs)在上下文忠实度方面存在不足,容易生成与给定信息不一致的答案。现有方法通常依赖昂贵的有监督微调或外部检索,但未显著提升对上下文的利用效率。

Contribution: 提出了一种原生检索增强推理框架CARE,通过显式整合上下文证据到模型的推理过程中,显著提升了检索精度和答案生成质量。该方法仅需有限标注数据,且无需外部检索工具。

Method: CARE框架利用模型自身的检索能力,在推理链中策略性地检索上下文相关的token,从而提升上下文忠实度和答案准确性。

Result: 在多种实际和反事实QA基准测试中,CARE显著优于传统方法,包括有监督微调、基于检索的生成方法以及外部检索解决方案。

Insight: 通过模型自身的检索能力与推理过程的深度融合,CARE展示了提升LLMs在知识密集型任务中准确性和可靠性的潜力,同时降低了对外部数据或工具的依赖。

Abstract: Large language models (LLMs) often struggle with context fidelity, producing inconsistent answers when responding to questions based on provided information. Existing approaches either rely on expensive supervised fine-tuning to generate evidence post-answer or train models to perform web searches without necessarily improving utilization of the given context. We propose CARE, a novel native retrieval-augmented reasoning framework that teaches LLMs to explicitly integrate in-context evidence within their reasoning process with the model’s own retrieval capabilities. Our method requires limited labeled evidence data while significantly enhancing both retrieval accuracy and answer generation performance through strategically retrieved in-context tokens in the reasoning chain. Extensive experiments on multiple real-world and counterfactual QA benchmarks demonstrate that our approach substantially outperforms supervised fine-tuning, traditional retrieval-augmented generation methods, and external retrieval solutions. This work represents a fundamental advancement in making LLMs more accurate, reliable, and efficient for knowledge-intensive tasks.

[5] Integrating Text and Time-Series into (Large) Language Models to Predict Medical Outcomes

Iyadh Ben Cheikh Larbi,Ajay Madhavan Ravichandran,Aljoscha Burchardt,Roland Roller

Main category: cs.CL

TL;DR: 该论文研究了如何将大型语言模型(LLMs)应用于处理临床分类任务,结合文本和时间序列数据,通过DSPy优化提示实现了高性能和任务适应性。

Details Motivation: 尽管LLMs在文本生成上表现出色,但其在处理结构化数据(如时间序列)的临床分类任务上的能力尚未得到充分探索,因此需要验证LLMs在此类任务中的潜力。

Contribution: 主要的贡献包括:1)通过DSPy优化提示,成功将LLMs应用于联合处理临床笔记和结构化EHR数据;2)展示了这种方法在多任务适应性和性能上的优势。

Method: 采用了DSPy-based prompt optimization方法,将文本和时间序列数据整合到LLMs中,避免了复杂的多模态系统设计。

Result: 结果表明,该方法在性能上可与专用多模态系统媲美,同时复杂度更低,任务适应性更强。

Insight: LLMs不仅限于文本生成任务,通过适当的优化和整合,也能有效处理结构化数据任务,扩展了其应用范围。

Abstract: Large language models (LLMs) excel at text generation, but their ability to handle clinical classification tasks involving structured data, such as time series, remains underexplored. In this work, we adapt instruction-tuned LLMs using DSPy-based prompt optimization to process clinical notes and structured EHR inputs jointly. Our results show that this approach achieves performance on par with specialized multimodal systems while requiring less complexity and offering greater adaptability across tasks.

[6] DSCC-HS: A Dynamic Self-Reinforcing Framework for Hallucination Suppression in Large Language Models

Xiao Zheng

Main category: cs.CL

TL;DR: DSCC-HS是一个动态自我强化框架,用于抑制大语言模型中的幻觉问题,通过在自回归解码过程中实时干预,显著提升了模型的真实性。

Details Motivation: 大语言模型(LLM)的幻觉问题是其可靠部署的主要障碍,现有方法(如RAG)多为被动应对。DSCC-HS旨在通过主动干预解决这一挑战。

Contribution: 提出了DSCC-HS框架,利用紧凑的代理模型(FAP和HDP)动态引导目标模型,无需修改目标模型即可显著提升真实性。

Method: 基于双过程认知理论,训练两个代理模型(FAP和HDP),在推理时通过实时注入转向向量(FAP和HDP logits之差)动态调整解码过程。

Result: 在TruthfulQA和BioGEN上达到SOTA性能,TruthfulQA的FCR为99.2%,BioGEN的FActScore为46.50。

Insight: DSCC-HS展示了通过动态代理模型干预解码过程的潜力,为LLM真实性问题提供了高效且无需目标模型修改的解决方案。

Abstract: Large Language Model (LLM) hallucination is a significant barrier to their reliable deployment. Current methods like Retrieval-Augmented Generation (RAG) are often reactive. We introduce Dynamic Self-reinforcing Calibration for Hallucination Suppression (DSCC-HS), a novel, proactive framework that intervenes during autoregressive decoding. Inspired by dual-process cognitive theory, DSCC-HS uses a compact proxy model, trained in adversarial roles as a Factual Alignment Proxy (FAP) and a Hallucination Detection Proxy (HDP). During inference, these proxies dynamically steer a large target model by injecting a real-time steering vector, which is the difference between FAP and HDP logits, at each decoding step. This plug-and-play approach requires no modification to the target model. Our experiments on TruthfulQA and BioGEN show DSCC-HS achieves state-of-the-art performance. On TruthfulQA, it reached a 99.2% Factual Consistency Rate (FCR). On the long-form BioGEN benchmark, it attained the highest FActScore of 46.50. These results validate DSCC-HS as a principled and efficient solution for enhancing LLM factuality.

[7] DSPC: Dual-Stage Progressive Compression Framework for Efficient Long-Context Reasoning

Yaxin Gao,Yao Lu,Zongfei Zhang,Jiaqi Nie,Shanqing Yu,Qi Xuan

Main category: cs.CL

TL;DR: 论文提出了一种名为DSPC的双阶段渐进压缩框架,旨在高效处理长文本推理任务,通过无需训练的粗粒度语义句子过滤和细粒度token剪枝,显著减少计算成本。

Details Motivation: 随着大型语言模型(LLMs)的普及,提示越长越精确,但计算成本也随之增加。现有的提示压缩方法通常需要训练辅助模型,带来额外开销,因此亟需一种无需训练的压缩方案。

Contribution: 提出了DSPC框架,一种无需训练的双阶段压缩方法,通过语义句子过滤和token剪枝,在长上下文推理任务中实现高效压缩和高性能。

Method: 1. 粗粒度阶段:基于TF-IDF去除低语义价值的句子;2. 细粒度阶段:结合注意力贡献、跨模型损失差异和位置重要性评估token重要性,剪枝低效token。

Result: 在LLaMA-3.1-8B-Instruct和GPT-3.5-Turbo上验证,DSPC仅用3倍更少的token,FewShot任务性能达49.17,优于当前最佳基线LongLLMLingua 7.76分。

Insight: 无需训练的渐进压缩框架可以在保持语义的同时显著提升效率,为长文本推理任务提供了一种低成本解决方案。

Abstract: Large language models (LLMs) have achieved remarkable success in many natural language processing (NLP) tasks. To achieve more accurate output, the prompts used to drive LLMs have become increasingly longer, which incurs higher computational costs. To address this prompt inflation problem, prompt compression has been proposed. However, most existing methods require training a small auxiliary model for compression, incurring a significant amount of additional computation. To avoid this, we propose a two-stage, training-free approach, called Dual-Stage Progressive Compression (DSPC). In the coarse-grained stage, semantic-related sentence filtering removes sentences with low semantic value based on TF-IDF. In the fine-grained stage, token importance is assessed using attention contribution, cross-model loss difference, and positional importance, enabling the pruning of low-utility tokens while preserving semantics. We validate DSPC on LLaMA-3.1-8B-Instruct and GPT-3.5-Turbo under a constrained token budget and observe consistent improvements. For instance, in the FewShot task of the Longbench dataset, DSPC achieves a performance of 49.17 by using only 3x fewer tokens, outperforming the best state-of-the-art baseline LongLLMLingua by 7.76.

[8] Combining Evidence and Reasoning for Biomedical Fact-Checking

Mariano Barone,Antonio Romano,Giuseppe Riccio,Marco Postiglione,Vincenzo Moscato

Main category: cs.CL

TL;DR: CER是一种结合科学证据检索、大型语言模型推理和监督真实性预测的新型生物医学事实核查框架,显著提升了生物医学领域的自动化事实核查能力。

Details Motivation: 生物医学领域的错误信息(如疫苗犹豫和未经证实的治疗方法)会危害公众健康和医疗系统信任,而现有自动化事实核查方法在该领域面临术语复杂、需专业知识等独特挑战。

Contribution: CER框架通过结合科学证据检索、大型语言模型推理和监督真实性预测,解决了生物医学事实核查中的幻觉问题,并实现了基于高质量证据的输出。

Method: CER整合了大型语言模型的文本生成能力与高级生物医学科学证据检索技术,结合监督学习的真实性预测模块,从而优化了事实核查流程。

Result: 在HealthFC、BioASQ-7b和SciFact等专家标注数据集上的实验表明,CER达到了最先进性能,并展示了良好的跨数据集泛化能力。

Insight: 结合生成式模型与检索技术能有效减少幻觉风险,并为生物医学领域的事实核查提供可验证的科学依据。

Abstract: Misinformation in healthcare, from vaccine hesitancy to unproven treatments, poses risks to public health and trust in medical systems. While machine learning and natural language processing have advanced automated fact-checking, validating biomedical claims remains uniquely challenging due to complex terminology, the need for domain expertise, and the critical importance of grounding in scientific evidence. We introduce CER (Combining Evidence and Reasoning), a novel framework for biomedical fact-checking that integrates scientific evidence retrieval, reasoning via large language models, and supervised veracity prediction. By integrating the text-generation capabilities of large language models with advanced retrieval techniques for high-quality biomedical scientific evidence, CER effectively mitigates the risk of hallucinations, ensuring that generated outputs are grounded in verifiable, evidence-based sources. Evaluations on expert-annotated datasets (HealthFC, BioASQ-7b, SciFact) demonstrate state-of-the-art performance and promising cross-dataset generalization. Code and data are released for transparency and reproducibility: https: //github.com/PRAISELab-PicusLab/CER.

[9] Combating Biomedical Misinformation through Multi-modal Claim Detection and Evidence-based Verification

Mariano Barone,Antonio Romano,Giuseppe Riccio,Marco Postiglione,Vincenzo Moscato

Main category: cs.CL

TL;DR: CER 是一个用于生物医学事实核查的框架,结合了科学证据检索、大型语言模型推理和监督的真实性预测,有效降低了幻觉风险,并在多个数据集上展示了最先进的性能。

Details Motivation: 生物医学领域的错误信息(如疫苗犹豫和未经证实的治疗方法)对公众健康和医疗系统信任构成威胁,但现有技术在验证生物医学声明时面临复杂术语和领域专业知识的挑战。

Contribution: 提出了一种新的框架 CER,整合了科学证据检索、大语言模型推理和监督预测功能,实现了基于证据的生物医学事实核查,显著提高了性能。

Method: CER 结合了文本生成的 LLM 和先进检索技术,通过检索高质量生物医学证据并与推理相结合,确保输出基于可验证的证据。

Result: 在 HealthFC、BioASQ-7b 和 SciFact 等专家标注数据集上,CER 展示了最先进的性能和跨数据集泛化能力。

Insight: 通过结合检索和推理,CER 减少了幻觉,为生物医学事实核查提供了一个可解释且可靠的解决方案。

Abstract: Misinformation in healthcare, from vaccine hesitancy to unproven treatments, poses risks to public health and trust in medical systems. While machine learning and natural language processing have advanced automated fact-checking, validating biomedical claims remains uniquely challenging due to complex terminology, the need for domain expertise, and the critical importance of grounding in scientific evidence. We introduce CER (Combining Evidence and Reasoning), a novel framework for biomedical fact-checking that integrates scientific evidence retrieval, reasoning via large language models, and supervised veracity prediction. By integrating the text-generation capabilities of large language models with advanced retrieval techniques for high-quality biomedical scientific evidence, CER effectively mitigates the risk of hallucinations, ensuring that generated outputs are grounded in verifiable, evidence-based sources. Evaluations on expert-annotated datasets (HealthFC, BioASQ-7b, SciFact) demonstrate state-of-the-art performance and promising cross-dataset generalization. Code and data are released for transparency and reproducibility: https://github.com/PRAISELab-PicusLab/CER

[10] Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency

Colin Hong,Xu Guo,Anand Chaanan Singh,Esha Choukse,Dmitrii Ustiugov

Main category: cs.CL

TL;DR: 提出Slim-SC,一种高效的自一致性(SC)推理优化方法,通过思维剪枝减少冗余计算,显著降低延迟和资源消耗。

Details Motivation: 当前自一致性(SC)方法虽能提升大语言模型推理性能,但其高计算开销限制了实际部署。

Contribution: 首次从理论和实证角度分析SC的低效性,并提出基于思维层相似性剪枝的Slim-SC方法。

Method: 利用思维层相似性逐步剪枝冗余推理链,减少计算开销,同时保持或提升准确性。

Result: 在三个STEM推理数据集上,Slim-SC降低推理延迟45%,KVC使用量26%,且准确性未降。

Insight: 思维层相似性是解决SC低效性的关键,剪枝策略可显著优化计算资源使用。

Abstract: Recently, Test-Time Scaling (TTS) has gained increasing attention for improving LLM reasoning performance at test time without retraining the model. A notable TTS technique is Self-Consistency (SC), which generates multiple reasoning chains in parallel and selects the final answer via majority voting. While effective, the order-of-magnitude computational overhead limits its broad deployment. Prior attempts to accelerate SC mainly rely on model-based confidence scores or heuristics with limited empirical support. For the first time, we theoretically and empirically analyze the inefficiencies of SC and reveal actionable opportunities for improvement. Building on these insights, we propose Slim-SC, a step-wise pruning strategy that identifies and removes redundant chains using inter-chain similarity at the thought level. Experiments on three STEM reasoning datasets and two recent LLM architectures show that Slim-SC reduces inference latency and KVC usage by up to 45% and 26%, respectively, with R1-Distill, while maintaining or improving accuracy, thus offering a simple yet efficient TTS alternative for SC.

[11] Early Stopping Chain-of-thoughts in Large Language Models

Minjia Mao,Bowen Yin,Yu Zhu,Xiao Fang

Main category: cs.CL

TL;DR: 论文提出了一种名为ES-CoT的方法,通过在推理过程中检测答案收敛性并提前停止生成链式思考(CoT),以减少推理成本,同时保持性能损失最小。

Details Motivation: 大型语言模型(LLMs)在生成长链式思考(CoT)时推理成本高,亟需一种高效的方法来缩短生成过程而不显著影响性能。

Contribution: 提出了ES-CoT方法,通过监测步骤答案的收敛性动态停止CoT生成,显著减少了推理成本(平均减少41%的推理token),同时保持了与标准CoT相当的准确性。

Method: 1. 在每个推理步骤末,提示LLM输出当前最终答案(步骤答案)。2. 跟踪连续相同步骤答案的运行长度作为收敛性指标。3. 一旦运行长度显著增加并超过阈值,则终止生成。

Result: 在五个推理数据集和三个LLM上的实验表明,ES-CoT平均减少41%的推理token,性能损失极小,且与自一致性提示兼容性强。

Insight: 步骤答案会逐步收敛到最终答案,其运行长度的显著跳跃可作为收敛的可靠标记,为高效推理提供了理论基础。

Abstract: Reasoning large language models (LLMs) have demonstrated superior capacities in solving complicated problems by generating long chain-of-thoughts (CoT), but such a lengthy CoT incurs high inference costs. In this study, we introduce ES-CoT, an inference-time method that shortens CoT generation by detecting answer convergence and stopping early with minimal performance loss. At the end of each reasoning step, we prompt the LLM to output its current final answer, denoted as a step answer. We then track the run length of consecutive identical step answers as a measure of answer convergence. Once the run length exhibits a sharp increase and exceeds a minimum threshold, the generation is terminated. We provide both empirical and theoretical support for this heuristic: step answers steadily converge to the final answer, and large run-length jumps reliably mark this convergence. Experiments on five reasoning datasets across three LLMs show that ES-CoT reduces the number of inference tokens by about 41% on average while maintaining accuracy comparable to standard CoT. Further, ES-CoT integrates seamlessly with self-consistency prompting and remains robust across hyperparameter choices, highlighting it as a practical and effective approach for efficient reasoning.

[12] Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale

Hasan Abed Al Kader Hammoud,Mohammad Zbeeb,Bernard Ghanem

Main category: cs.CL

TL;DR: Hala 是一系列以阿拉伯语为中心的指令和翻译模型,通过翻译微调流水线构建,在阿拉伯语 NLP 任务中取得先进成果。

Details Motivation: 当前阿拉伯语 NLP 缺乏高质量的指令和翻译模型,Hala 旨在填补这一空白,提升阿拉伯语任务性能。

Contribution: 1) 提出一个高效的 FP8 压缩教师模型,用于生成高质量双语数据;2) 训练不同规模的 Hala 模型,在阿拉伯语任务中表现优异;3) 开源模型、数据和代码。

Method: 1) 使用 FP8 压缩教师模型生成监督数据;2) 微调轻量级语言模型 LFM2-1.2B 进行指令翻译;3) 训练多规模模型并使用 slerp 合并提升性能。

Result: 在阿拉伯语任务中,Hala 在 ‘nano’ (≤2B) 和 ‘small’ (7-9B) 规模下均达到 SOTA。

Insight: 高效的压缩技术和高质量数据生成可显著提升小规模模型性能,适用于资源有限的语言任务。

Abstract: We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong AR$\leftrightarrow$EN teacher to FP8 (yielding $\sim$2$\times$ higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, Hala achieves state-of-the-art results within both the “nano” ($\leq$2B) and “small” (7-9B) categories, outperforming their bases. We release models, data, evaluation, and recipes to accelerate research in Arabic NLP.

[13] Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality

Sami Ul Haq,Sheila Castilho,Yvette Graham

Main category: cs.CL

TL;DR: 该论文研究了音频与纯文本方式评估机器翻译质量的差异,发现音频评估在某些情况下能更自然地区分翻译系统的性能。

Details Motivation: 尽管机器翻译(MT)取得了显著进展,但质量评估仍主要依赖文本方式,而忽略了实际应用中翻译通常是口语输出的场景。因此,研究音频评估方式的可行性和效果具有实际意义。

Contribution: 论文的主要贡献是比较了音频和纯文本评估MT质量的差异,并通过统计分析和自重复实验验证了音频评估的可靠性和一致性。

Method: 研究对比了WMT通用MT共享任务中10个MT系统的音频和纯文本评估结果,使用亚马逊Mechanical Turk众包平台收集数据,并进行统计显著性测试和自重复实验。

Result: 音频评估的排名与纯文本评估基本一致,但在某些情况下能显著区分翻译系统的性能差异,表明音频因更自然和丰富的模态可能更具评估优势。

Insight: 未来MT评估框架应考虑纳入语音评估方式,以更全面地反映翻译在口语场景中的质量。

Abstract: Machine Translation (MT) has achieved remarkable performance, with growing interest in speech translation and multimodal approaches. However, despite these advancements, MT quality assessment remains largely text centric, typically relying on human experts who read and compare texts. Since many real-world MT applications (e.g Google Translate Voice Mode, iFLYTEK Translator) involve translation being spoken rather printed or read, a more natural way to assess translation quality would be through speech as opposed text-only evaluations. This study compares text-only and audio-based evaluations of 10 MT systems from the WMT General MT Shared Task, using crowd-sourced judgments collected via Amazon Mechanical Turk. We additionally, performed statistical significance testing and self-replication experiments to test reliability and consistency of audio-based approach. Crowd-sourced assessments based on audio yield rankings largely consistent with text only evaluations but, in some cases, identify significant differences between translation systems. We attribute this to speech richer, more natural modality and propose incorporating speech-based assessments into future MT evaluation frameworks.

[14] Enhancing Multi-Agent Debate System Performance via Confidence Expression

Zijie Lin,Bryan Hooi

Main category: cs.CL

TL;DR: 该论文提出在多智能体辩论系统中引入置信度表达,以提升辩论效果和任务性能。

Details Motivation: 现有的多智能体辩论系统缺乏清晰的置信度表达机制,导致辩论效果不佳或过早收敛于次优答案。

Contribution: 提出了ConfMAD框架,通过显式表达置信度改善辩论动态和任务性能。

Method: 在辩论过程中集成置信度表达,设计了一种新的框架ConfMAD。

Result: 实验证明该方法有效,并分析了置信度对辩论动态的影响。

Insight: 置信度表达在多智能体辩论系统中扮演关键角色,可优化系统设计。

Abstract: Generative Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks. Recent research has introduced Multi-Agent Debate (MAD) systems, which leverage multiple LLMs to simulate human debate and thereby improve task performance. However, while some LLMs may possess superior knowledge or reasoning capabilities for specific tasks, they often struggle to clearly communicate this advantage during debates, in part due to a lack of confidence expression. Moreover, inappropriate confidence expression can cause agents in MAD systems to either stubbornly maintain incorrect beliefs or converge prematurely on suboptimal answers, ultimately reducing debate effectiveness and overall system performance. To address these challenges, we propose incorporating confidence expression into MAD systems to allow LLMs to explicitly communicate their confidence levels. To validate this approach, we develop ConfMAD, a MAD framework that integrates confidence expression throughout the debate process. Experimental results demonstrate the effectiveness of our method, and we further analyze how confidence influences debate dynamics, offering insights into the design of confidence-aware MAD systems.

[15] SSL-SSAW: Self-Supervised Learning with Sigmoid Self-Attention Weighting for Question-Based Sign Language Translation

Zekang Liu,Wei Feng,Fanhua Shang,Lianyu Hu,Jichao Feng,Liqing Gao

Main category: cs.CL

TL;DR: SSL-SSAW提出了一种跨模态自监督学习方法,结合Sigmoid自注意力加权,用于问题驱动的手语翻译任务,通过对话上下文提升翻译质量。

Details Motivation: 手语翻译(SLT)在聋哑人与听人之间的沟通中起关键作用,而对话提供了重要的上下文线索。研究提出基于问题的SLT(QB-SLT),探索如何高效整合对话信息,以简化标注并提升翻译性能。

Contribution: 1. 提出新的QB-SLT任务,利用对话上下文;2. 设计了跨模态自监督学习(SSL)和Sigmoid自注意力加权(SSAW)方法;3. 在新数据集上实现SOTA性能,证明了对话辅助的优越性。

Method: 1. 使用对比学习对齐多模态特征;2. 引入SSAW模块自适应提取问题和手语序列特征;3. 通过自监督学习增强问题和手语文本的表示能力。

Result: 在CSL-Daily-QA和PHOENIX-2014T-QA数据集上,SSL-SSAW取得了最佳性能,甚至优于依赖注释的方法,可视化结果验证了对话上下文的有效性。

Insight: 对话信息可以简化标注并提升翻译质量,为手语翻译任务提供了更实用的解决方案。

Abstract: Sign Language Translation (SLT) bridges the communication gap between deaf people and hearing people, where dialogue provides crucial contextual cues to aid in translation. Building on this foundational concept, this paper proposes Question-based Sign Language Translation (QB-SLT), a novel task that explores the efficient integration of dialogue. Unlike gloss (sign language transcription) annotations, dialogue naturally occurs in communication and is easier to annotate. The key challenge lies in aligning multimodality features while leveraging the context of the question to improve translation. To address this issue, we propose a cross-modality Self-supervised Learning with Sigmoid Self-attention Weighting (SSL-SSAW) fusion method for sign language translation. Specifically, we employ contrastive learning to align multimodality features in QB-SLT, then introduce a Sigmoid Self-attention Weighting (SSAW) module for adaptive feature extraction from question and sign language sequences. Additionally, we leverage available question text through self-supervised learning to enhance representation and translation capabilities. We evaluated our approach on newly constructed CSL-Daily-QA and PHOENIX-2014T-QA datasets, where SSL-SSAW achieved SOTA performance. Notably, easily accessible question assistance can achieve or even surpass the performance of gloss assistance. Furthermore, visualization results demonstrate the effectiveness of incorporating dialogue in improving translation quality.

[16] AssoCiAm: A Benchmark for Evaluating Association Thinking while Circumventing Ambiguity

Yifan Liu,Wenkuan Zhao,Shanshan Zhong,Jinghui Qin,Mingfu Liang,Zhongzhan Huang,Wushao Wen

Main category: cs.CL

TL;DR: 该论文提出了AssoCiAm基准,用于评估多模态大语言模型(MLLMs)的联想能力,同时通过混合计算方法规避任务中的内在和外在歧义。实验表明,联想能力与认知能力呈正相关,且歧义会导致模型行为更随机。

Details Motivation: 当前评估联想能力的框架往往忽视了任务中的歧义性,影响评估的可靠性。因此,需要一种能规避歧义的基准,以更准确地衡量MLLMs的联想能力。

Contribution: 1. 将歧义分解为内在歧义和外在歧义;2. 提出AssoCiAm基准,通过混合计算方法规避歧义;3. 实验验证联想能力与认知的正相关性,并揭示歧义对模型行为的影响。

Method: 提出AssoCiAm基准,采用混合计算方法(可能结合多种评估策略)以规避任务中的内在和外在歧义,并对多种MLLMs进行实验和分析。

Result: 发现联想能力与认知能力呈正相关,歧义会导致模型行为更随机,同时验证了AssoCiAm在评估中的有效性。

Insight: 规避歧义能显著提升评估的可靠性,联想能力是MLLMs创造力评估的重要指标,歧义会影响模型行为的可预测性。

Abstract: Recent advancements in multimodal large language models (MLLMs) have garnered significant attention, offering a promising pathway toward artificial general intelligence (AGI). Among the essential capabilities required for AGI, creativity has emerged as a critical trait for MLLMs, with association serving as its foundation. Association reflects a model’ s ability to think creatively, making it vital to evaluate and understand. While several frameworks have been proposed to assess associative ability, they often overlook the inherent ambiguity in association tasks, which arises from the divergent nature of associations and undermines the reliability of evaluations. To address this issue, we decompose ambiguity into two types-internal ambiguity and external ambiguity-and introduce AssoCiAm, a benchmark designed to evaluate associative ability while circumventing the ambiguity through a hybrid computational method. We then conduct extensive experiments on MLLMs, revealing a strong positive correlation between cognition and association. Additionally, we observe that the presence of ambiguity in the evaluation process causes MLLMs’ behavior to become more random-like. Finally, we validate the effectiveness of our method in ensuring more accurate and reliable evaluations. See Project Page for the data and codes.

[17] Synthesizing Behaviorally-Grounded Reasoning Chains: A Data-Generation Framework for Personal Finance LLMs

Akhil Theerthala

Main category: cs.CL

TL;DR: 论文提出了一种结合财务背景和行为金融学的数据生成框架,用于训练个性化的财务建议LLM,并通过实验验证了其8B模型在性能和成本上的优势。

Details Motivation: 个性化财务建议需要综合考虑用户目标、风险偏好等复杂因素。现有方法维护成本高且效果不佳(财务回报低于预期25%),亟需更高效且低成本的数据生成与模型训练框架。

Contribution: 1. 提出了结合财务背景与行为金融学的可复现数据生成框架;2. 构建了包含19k样本的推理数据集;3. 验证了8B模型在准确性、流畅性和个性化方面优于更大基线模型且成本降低80%。

Method: 1. 整合财务上下文和行为金融学数据生成监督数据;2. 对Qwen-3-8B进行微调;3. 通过测试集和盲审实验评估性能。

Result: 8B模型在事实准确性、流畅性和个性化方面媲美更大的基线模型(14-32B参数),且成本降低80%。

Insight: 通过精心设计的监督数据生成框架,可显著降低模型规模需求,同时保持高质量输出,为小模型在复杂任务中的应用提供了新思路。

Abstract: Personalized financial advice requires consideration of user goals, constraints, risk tolerance, and jurisdiction. Prior LLM work has focused on support systems for investors and financial planners. Simultaneously, numerous recent studies examine broader personal finance tasks, including budgeting, debt management, retirement, and estate planning, through agentic pipelines that incur high maintenance costs, yielding less than 25% of their expected financial returns. In this study, we introduce a novel and reproducible framework that integrates relevant financial context with behavioral finance studies to construct supervision data for end-to-end advisors. Using this framework, we create a 19k sample reasoning dataset and conduct a comprehensive fine-tuning of the Qwen-3-8B model on the dataset. Through a held-out test split and a blind LLM-jury study, we demonstrate that through careful data curation and behavioral integration, our 8B model achieves performance comparable to significantly larger baselines (14-32B parameters) across factual accuracy, fluency, and personalization metrics while incurring 80% lower costs than the larger counterparts.

cs.CV [Back]

[18] Hybrid Quantum-Classical Model for Image Classification

Muhammad Adnan Shahzad

Main category: cs.CV

TL;DR: 该论文系统地比较了混合量子-经典神经网络与纯经典模型在三个基准数据集上的性能、效率和鲁棒性,发现混合模型在精度、训练速度和资源效率上均优于经典模型,尤其在复杂任务中表现突出。

Details Motivation: 研究旨在探索量子计算与传统深度学习的结合是否能提升图像分类任务的性能,特别是在精度、效率和鲁棒性方面超越纯经典模型。

Contribution: 主要贡献包括:1) 展示了混合模型在多个基准数据集上显著优于经典模型;2) 证明了混合模型在训练速度和资源效率上的优势;3) 揭示了混合模型对不同复杂度任务的适应性。

Method: 混合模型结合了参数化量子电路和经典深度学习架构(如CNN),通过50个训练周期在MNIST、CIFAR100和STL10上进行了全面评估,指标包括验证精度、测试精度、训练时间、资源占用和对抗鲁棒性。

Result: 混合模型在精度上显著超越经典模型(如CIFAR100提升9.44%),训练速度更快(5-12倍),资源占用更低(内存和CPU使用减少),但在复杂数据集的对抗鲁棒性上与经典模型相当。

Insight: 混合量子-经典模型在复杂视觉任务中表现出显著优势,但其鲁棒性可能受数据集复杂度影响,这表明未来研究可进一步优化量子电路设计以提升鲁棒性。

Abstract: This study presents a systematic comparison between hybrid quantum-classical neural networks and purely classical models across three benchmark datasets (MNIST, CIFAR100, and STL10) to evaluate their performance, efficiency, and robustness. The hybrid models integrate parameterized quantum circuits with classical deep learning architectures, while the classical counterparts use conventional convolutional neural networks (CNNs). Experiments were conducted over 50 training epochs for each dataset, with evaluations on validation accuracy, test accuracy, training time, computational resource usage, and adversarial robustness (tested with $\epsilon=0.1$ perturbations).Key findings demonstrate that hybrid models consistently outperform classical models in final accuracy, achieving {99.38% (MNIST), 41.69% (CIFAR100), and 74.05% (STL10) validation accuracy, compared to classical benchmarks of 98.21%, 32.25%, and 63.76%, respectively. Notably, the hybrid advantage scales with dataset complexity, showing the most significant gains on CIFAR100 (+9.44%) and STL10 (+10.29%). Hybrid models also train 5–12$\times$ faster (e.g., 21.23s vs. 108.44s per epoch on MNIST) and use 6–32% fewer parameters} while maintaining superior generalization to unseen test data.Adversarial robustness tests reveal that hybrid models are significantly more resilient on simpler datasets (e.g., 45.27% robust accuracy on MNIST vs. 10.80% for classical) but show comparable fragility on complex datasets like CIFAR100 ($\sim$1% robustness for both). Resource efficiency analyses indicate that hybrid models consume less memory (4–5GB vs. 5–6GB for classical) and lower CPU utilization (9.5% vs. 23.2% on average).These results suggest that hybrid quantum-classical architectures offer compelling advantages in accuracy, training efficiency, and parameter scalability, particularly for complex vision tasks.

[19] Research on Expressway Congestion Warning Technology Based on YOLOv11-DIoU and GRU-Attention

Tong Yulin,Liang Xuechen

Main category: cs.CV

TL;DR: 论文提出了一种基于YOLOv11-DIoU和GRU-Attention的高速公路拥堵预警技术,通过优化目标检测和长序列预测模型,显著提升了检测精度和拥堵预警的准确性。

Details Motivation: 现有高速公路拥堵预警系统的车辆感知精度低且长序列依赖关系丢失,严重影响了预警效果。

Contribution: 1. 提出YOLOv11-DIoU模型,通过DIoU Loss提升目标检测精度;2. 改进DeepSort算法,融合运动与外观特征;3. 设计GRU-Attention模型,有效捕捉拥堵前兆。

Method: 1. 用DIoU Loss优化YOLOv11;2. 结合Mahalanobis和余弦距离改进DeepSort;3. 构建GRU-Attention模型用于拥堵预测。

Result: YOLOv11-DIoU在mAP上提升6.5%,GRU-Attention测试准确率达99.7%,拥堵预警时间误差≤1分钟。

Insight: 融合目标检测与序列预测的优势可以显著提升拥堵预警系统的性能,尤其在遮挡和高流量场景下表现稳定。

Abstract: Expressway traffic congestion severely reduces travel efficiency and hinders regional connectivity. Existing “detection-prediction” systems have critical flaws: low vehicle perception accuracy under occlusion and loss of long-sequence dependencies in congestion forecasting. This study proposes an integrated technical framework to resolve these issues.For traffic flow perception, two baseline algorithms were optimized. Traditional YOLOv11 was upgraded to YOLOv11-DIoU by replacing GIoU Loss with DIoU Loss, and DeepSort was improved by fusing Mahalanobis (motion) and cosine (appearance) distances. Experiments on Chang-Shen Expressway videos showed YOLOv11-DIoU achieved 95.7% mAP (6.5 percentage points higher than baseline) with 5.3% occlusion miss rate. DeepSort reached 93.8% MOTA (11.3 percentage points higher than SORT) with only 4 ID switches. Using the Greenberg model (for 10-15 vehicles/km high-density scenarios), speed and density showed a strong negative correlation (r=-0.97), conforming to traffic flow theory. For congestion warning, a GRU-Attention model was built to capture congestion precursors. Trained 300 epochs with flow, density, and speed, it achieved 99.7% test accuracy (7-9 percentage points higher than traditional GRU). In 10-minute advance warnings for 30-minute congestion, time error was $\leq$ 1 minute. Validation with an independent video showed 95% warning accuracy, over 90% spatial overlap of congestion points, and stable performance in high-flow ($>$5 vehicles/second) scenarios.This framework provides quantitative support for expressway congestion control, with promising intelligent transportation applications.

[20] Parking Space Ground Truth Test Automation by Artificial Intelligence Using Convolutional Neural Networks

Tony Rohe,Martin Margreiter,Markus Moertl

Main category: cs.CV

TL;DR: 本文提出了一种基于卷积神经网络(CNN)的自动化测试方法,用于优化实时云基路边停车服务的真实性测试流程,显著减少人工参与时间。

Details Motivation: 现有路边停车服务的真实性测试(ground truth test)依赖大量人工,效率低且成本高。研究旨在通过AI技术实现测试自动化,提升服务效率。

Contribution: 1. 提出一种基于CNN的图像模式识别方法,自动化分析停车位数据。2. 显著减少人工测试时间达99.58%。3. 为停车服务提供高质量的真实性测试工具。

Method: 采用卷积神经网络实现图像模式识别,对超声波传感器采集的停车位数据进行自动化分类和分析,取代手工测试。

Result: 测试结果显示,自动化工具将人工资源消耗减少了99.58%,显著提升了测试效率和准确性。

Insight: 机器学习技术(如CNN)可在停车服务等领域高效替代人工测试任务,但需进一步优化模型以适应更复杂的现实场景。

Abstract: This research is part of a study of a real-time, cloud-based on-street parking service using crowd-sourced in-vehicle fleet data. The service provides real-time information about available parking spots by classifying crowd-sourced detections observed via ultrasonic sensors. The goal of this research is to optimize the current parking service quality by analyzing the automation of the existing test process for ground truth tests. Therefore, methods from the field of machine learning, especially image pattern recognition, are applied to enrich the database and substitute human engineering work in major areas of the analysis process. After an introduction into the related areas of machine learning, this paper explains the methods and implementations made to achieve a high level of automation, applying convolutional neural networks. Finally, predefined metrics present the performance level achieved, showing a time reduction of human resources up to 99.58 %. The overall improvements are discussed, summarized, and followed by an outlook for future development and potential application of the analysis automation tool.

[21] An Empirical Analysis of VLM-based OOD Detection: Mechanisms, Advantages, and Sensitivity

Yuxiao Lee,Xiaofeng Cao,Wei Ye,Jiangchao Yao,Jingkuan Song,Heng Tao Shen

Main category: cs.CV

TL;DR: 该论文系统分析了基于视觉-语言模型(VLM)的零样本OOD检测机制、优势及敏感性,揭示了其语义新颖性利用能力及对提示词的敏感性。

Details Motivation: 尽管VLM在零样本OOD检测中表现出色,但其工作机制、优势及鲁棒性的全面理解仍不充分,亟需系统化分析以指导未来研究。

Contribution: 1) 形式化VLM嵌入空间中的关键操作特性;2) 量化VLM相对于单模态方法的优势;3) 揭示其对提示词敏感的不对称鲁棒性。

Method: 通过ID和OOD提示词进行系统化实证分析,研究VLM嵌入空间的机制及其行为敏感性。

Result: VLM利用语义新颖性显著优于单模态方法,但对提示词选择高度敏感。

Insight: VLM的OOD检测能力依赖于其嵌入空间的语义特性,提示词的影响不可忽视,需在设计中加强鲁棒性。

Abstract: Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot out-of-distribution (OOD) detection capabilities, vital for reliable AI systems. Despite this promising capability, a comprehensive understanding of (1) why they work so effectively, (2) what advantages do they have over single-modal methods, and (3) how is their behavioral robustness – remains notably incomplete within the research community. This paper presents a systematic empirical analysis of VLM-based OOD detection using in-distribution (ID) and OOD prompts. (1) Mechanisms: We systematically characterize and formalize key operational properties within the VLM embedding space that facilitate zero-shot OOD detection. (2) Advantages: We empirically quantify the superiority of these models over established single-modal approaches, attributing this distinct advantage to the VLM’s capacity to leverage rich semantic novelty. (3) Sensitivity: We uncovers a significant and previously under-explored asymmetry in their robustness profile: while exhibiting resilience to common image noise, these VLM-based methods are highly sensitive to prompt phrasing. Our findings contribute a more structured understanding of the strengths and critical vulnerabilities inherent in VLM-based OOD detection, offering crucial, empirically-grounded guidance for developing more robust and reliable future designs.

[22] Curvature as a tool for evaluating dimensionality reduction and estimating intrinsic dimension

Charlotte Beylier,Parvaneh Joharinad,Jürgen Jost,Nahid Torbati

Main category: cs.CV

TL;DR: 该论文提出了一种基于曲率的方法来评估降维技术的效果,并估计数据集的内在维度。通过抽象的截面曲率概念,构建离散度量空间的几何轮廓,从而分析数据表示的几何特性。

Details Motivation: 降维技术在实际应用中广泛使用,但缺乏定量评估其效果的工具。作者希望通过几何曲率的概念,提供一种新的方法来评估降维结果的几何保真性,并进一步估计数据集的内在维度。

Contribution: 1. 提出了一种基于截面曲率的几何轮廓构建方法;2. 引入了一种定量评估降维技术效果的曲率指标;3. 展示了该方法在估计数据集内在维度和分析大规模网络几何结构中的应用。

Method: 作者利用抽象的截面曲率概念,定义了度量空间中点对之间的关系,并构建了离散度量空间的几何轮廓。该方法通过分析数据表示的曲率特性,评估其几何结构的保留程度。

Result: 实验结果表明,该方法不仅可以有效评估降维技术的效果,还能估计数据集的内在维度。同时,该方法适用于分析大规模网络的几何结构。

Insight: 通过曲率这一几何工具,可以更好地理解数据表示的几何特性,为降维技术的评估和数据集的分析提供了一种新的视角。

Abstract: Utilizing recently developed abstract notions of sectional curvature, we introduce a method for constructing a curvature-based geometric profile of discrete metric spaces. The curvature concept that we use here captures the metric relations between triples of points and other points. More significantly, based on this curvature profile, we introduce a quantitative measure to evaluate the effectiveness of data representations, such as those produced by dimensionality reduction techniques. Furthermore, Our experiments demonstrate that this curvature-based analysis can be employed to estimate the intrinsic dimensionality of datasets. We use this to explore the large-scale geometry of empirical networks and to evaluate the effectiveness of dimensionality reduction techniques.

[23] Real-Time Detection and Tracking of Foreign Object Intrusions in Power Systems via Feature-Based Edge Intelligence

Xinan Wang,Di Shi,Fengyu Wang

Main category: cs.CV

TL;DR: 该论文提出了一种用于电力系统中异物入侵实时检测与追踪的三阶段框架,结合YOLOv7分割、ConvNeXt特征提取和IoU跟踪器,并通过边缘硬件优化实现高效部署。

Details Motivation: 电力系统中的异物入侵可能导致严重事故,现有方法在实时性和鲁棒性上存在不足,本文提出了一种高效的边缘智能解决方案。

Contribution: 1) 三阶段实时检测与追踪框架;2) 结合YOLOv7与ConvNeXt的特征提取方法;3) 支持边缘部署和增量更新的系统设计。

Method: 1) YOLOv7用于快速目标定位;2) ConvNeXt提取特征并训练三元组损失;3) 基于IoU的特征辅助跟踪器处理遮挡和多目标。

Result: 在真实监控和无人机视频数据集上验证了框架的高精度和鲁棒性,NVIDIA Jetson设备的硬件测试证明了其实用性。

Insight: 通过边缘优化和增量学习,该方法展示了如何在资源受限的设备上实现高效的实时检测与追踪。

Abstract: This paper presents a novel three-stage framework for real-time foreign object intrusion (FOI) detection and tracking in power transmission systems. The framework integrates: (1) a YOLOv7 segmentation model for fast and robust object localization, (2) a ConvNeXt-based feature extractor trained with triplet loss to generate discriminative embeddings, and (3) a feature-assisted IoU tracker that ensures resilient multi-object tracking under occlusion and motion. To enable scalable field deployment, the pipeline is optimized for deployment on low-cost edge hardware using mixed-precision inference. The system supports incremental updates by adding embeddings from previously unseen objects into a reference database without requiring model retraining. Extensive experiments on real-world surveillance and drone video datasets demonstrate the framework’s high accuracy and robustness across diverse FOI scenarios. In addition, hardware benchmarks on NVIDIA Jetson devices confirm the framework’s practicality and scalability for real-world edge applications.

[24] EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing

Tianyu Chen,Yasi Zhang,Zhi Zhang,Peiyu Yu,Shu Wang,Zhendong Wang,Kevin Lin,Xiaofei Wang,Zhengyuan Yang,Linjie Li,Chung-Ching Lin,Jianwen Xie,Oscar Leong,Lijuan Wang,Ying Nian Wu,Mingyuan Zhou

Main category: cs.CV

TL;DR: 论文提出了EdiVal-Agent,一个面向对象的自动、可扩展、细粒度评估框架,用于多轮指令编辑。它结合了视觉语言模型和对象检测器,提高了评估准确性。

Details Motivation: 当前的图像编辑评估方法依赖参考图像或单一视觉语言模型,存在覆盖有限和评估不精确的问题。需要一种更可靠、可解释的自动评估框架。

Contribution: 1. 提出EdiVal-Agent框架,支持多轮指令编辑的评估。2. 结合视觉语言模型与对象检测器,提高评估准确性。3. 构建EdiVal-Bench基准,涵盖多种指令类型和编辑模型。

Method: 1. 将图像分解为语义对象,生成多样化的编辑指令。2. 使用视觉语言模型和开放词汇对象检测器评估指令遵循。3. 利用语义级特征提取器评估内容一致性,人类偏好模型评估视觉质量。

Result: 实验表明,结合视觉语言模型与对象检测器在指令遵循评估中比单独使用视觉语言模型更接近人类判断。

Insight: 模块化设计允许未来工具的集成,逐步提升评估准确性。EdiVal-Agent能识别当前编辑模型的失败模式,推动下一代模型的开发。

Abstract: Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images – resulting in limited coverage and inheriting biases from prior generative models – or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated, scalable, and fine-grained evaluation framework for multi-turn instruction-based editing from an object-centric perspective, supported by a suite of expert tools. Given an image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions. For evaluation, it integrates VLMs with open-vocabulary object detectors to assess instruction following, uses semantic-level feature extractors to evaluate content consistency, and leverages human preference models to judge visual quality. We show that combining VLMs with object detectors yields stronger agreement with human judgments in instruction-following evaluation compared to using VLMs alone and CLIP-based metrics. Furthermore, the pipeline’s modular design allows future tools to be seamlessly integrated, enhancing evaluation accuracy over time. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 11 state-of-the-art editing models spanning autoregressive (AR) (including Nano Banana, GPT-Image-1), flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models. Project page: https://tianyucodings.github.io/EdiVAL-page/.

[25] MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha,Norman Müller,Johannes Schönberger,Lorenzo Porzi,Yuchen Zhang,Tobias Fischer,Arno Knapitsch,Duncan Zauss,Ethan Weber,Nelson Antunes,Jonathon Luiten,Manuel Lopez-Antequera,Samuel Rota Bulò,Christian Richardt,Deva Ramanan,Sebastian Scherer,Peter Kontschieder

Main category: cs.CV

TL;DR: MapAnything是一种基于Transformer的通用前馈模型,能够通过多视图几何信息的分解表示直接回归出场景的三维几何结构和相机参数,适用于多种3D视觉任务。

Details Motivation: 现有的大多数3D重建模型针对特定任务设计,缺乏通用性。MapAnything旨在提出一个统一的模型,能够高效处理多种3D重建任务。

Contribution: 提出了一种基于Transformer的前馈模型MapAnything,通过分解表示多视图几何信息,支持多种输入和输出任务,实现了高效且通用的3D重建。

Method: 模型采用分解表示(深度图、局部射线图、相机姿态和尺度因子),通过标准化的监督和灵活的输入增强,实现单次前馈的多任务处理。

Result: 实验表明,MapAnything在多项任务中优于或媲美专用模型,同时展现了更高的训练效率和通用性。

Insight: 通过统一的几何表示和训练策略,可以构建一个通用的3D重建主干网络,取代传统专用模型。

Abstract: We introduce MapAnything, a unified transformer-based feed-forward model that ingests one or more images along with optional geometric inputs such as camera intrinsics, poses, depth, or partial reconstructions, and then directly regresses the metric 3D scene geometry and cameras. MapAnything leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame. Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a single feed-forward pass, including uncalibrated structure-from-motion, calibrated multi-view stereo, monocular depth estimation, camera localization, depth completion, and more. We provide extensive experimental analyses and model ablations demonstrating that MapAnything outperforms or matches specialist feed-forward models while offering more efficient joint training behavior, thus paving the way toward a universal 3D reconstruction backbone.

[26] Semantic-Enhanced Cross-Modal Place Recognition for Robust Robot Localization

Yujia Lin,Nicholas Evans

Main category: cs.CV

TL;DR: 该论文提出了一种名为SCM-PR的语义增强跨模态地点识别框架,通过结合RGB图像的高层语义和LiDAR地图的几何信息,提升了机器人在无GPS环境中的定位鲁棒性。

Details Motivation: 现有的RGB图像定位方法对光照、天气等环境变化敏感,而跨模态定位方法在复杂场景、细粒度匹配和视角变化情况下表现欠佳。

Contribution: 提出了SCM-PR框架,包括VMamba骨干网络提取RGB特征、SAFF模块融合语义信息、LiDAR描述符结合语义和几何信息,以及跨模态语义注意力机制。

Method: 采用VMamba提取特征,SAFF模块融合语义特征,NetVLAD中引入跨模态语义注意力,设计了多视角语义-几何匹配和语义一致性损失。

Result: 在KITTI和KITTI-360数据集上,SCM-PR优于其他跨模态地点识别方法,达到了最先进的性能。

Insight: 语义信息在提升跨模态定位鲁棒性中发挥了关键作用,特别是在复杂环境和视角变化情况下。

Abstract: Ensuring accurate localization of robots in environments without GPS capability is a challenging task. Visual Place Recognition (VPR) techniques can potentially achieve this goal, but existing RGB-based methods are sensitive to changes in illumination, weather, and other seasonal changes. Existing cross-modal localization methods leverage the geometric properties of RGB images and 3D LiDAR maps to reduce the sensitivity issues highlighted above. Currently, state-of-the-art methods struggle in complex scenes, fine-grained or high-resolution matching, and situations where changes can occur in viewpoint. In this work, we introduce a framework we call Semantic-Enhanced Cross-Modal Place Recognition (SCM-PR) that combines high-level semantics utilizing RGB images for robust localization in LiDAR maps. Our proposed method introduces: a VMamba backbone for feature extraction of RGB images; a Semantic-Aware Feature Fusion (SAFF) module for using both place descriptors and segmentation masks; LiDAR descriptors that incorporate both semantics and geometry; and a cross-modal semantic attention mechanism in NetVLAD to improve matching. Incorporating the semantic information also was instrumental in designing a Multi-View Semantic-Geometric Matching and a Semantic Consistency Loss, both in a contrastive learning framework. Our experimental work on the KITTI and KITTI-360 datasets show that SCM-PR achieves state-of-the-art performance compared to other cross-modal place recognition methods.

[27] Improving 3D Gaussian Splatting Compression by Scene-Adaptive Lattice Vector Quantization

Hao Xu,Xiaolin Wu,Xi Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种名为SALVQ(场景自适应格子向量量化)的新方法,用于改进3D高斯泼溅(3DGS)数据的压缩性能,通过优化格子基向量使其适应不同场景,从而在保持低复杂度的同时提升压缩效率。

Details Motivation: 3DGS因其高质量渲染和实时性能而流行,但其数据量巨大,现有压缩方法主要依赖统一的标量量化(USQ),缺乏灵活性。本文探索是否可以通过更复杂的量化器(如格子向量量化LVQ)提升压缩性能,同时保持系统简单。

Contribution: 1)提出场景自适应LVQ(SALVQ),优化格子基向量以适应场景特性;2)SALVQ能与现有3DGS压缩架构无缝集成,提升R-D性能且无需显著增加计算开销;3)通过缩放格子基向量动态调整压缩率,无需为不同比特率训练多个模型。

Method: 用LVQ替代USQ,并优化格子基向量以适应不同场景(SALVQ)。通过缩放格子基向量动态调整压缩率,支持多种比特率目标。该方法结合了LVQ的高R-D效率和USQ的低复杂度。

Result: SALVQ显著提升了3DGS压缩的R-D性能,且计算开销和修改成本极低。其动态调整能力减少了训练时间和内存消耗。

Insight: 格子基向量的优化和动态调整是提升3DGS压缩效率的关键,场景自适应性弥补了传统量化方法的不足,为未来压缩技术提供了新方向。

Abstract: 3D Gaussian Splatting (3DGS) is rapidly gaining popularity for its photorealistic rendering quality and real-time performance, but it generates massive amounts of data. Hence compressing 3DGS data is necessary for the cost effectiveness of 3DGS models. Recently, several anchor-based neural compression methods have been proposed, achieving good 3DGS compression performance. However, they all rely on uniform scalar quantization (USQ) due to its simplicity. A tantalizing question is whether more sophisticated quantizers can improve the current 3DGS compression methods with very little extra overhead and minimal change to the system. The answer is yes by replacing USQ with lattice vector quantization (LVQ). To better capture scene-specific characteristics, we optimize the lattice basis for each scene, improving LVQ’s adaptability and R-D efficiency. This scene-adaptive LVQ (SALVQ) strikes a balance between the R-D efficiency of vector quantization and the low complexity of USQ. SALVQ can be seamlessly integrated into existing 3DGS compression architectures, enhancing their R-D performance with minimal modifications and computational overhead. Moreover, by scaling the lattice basis vectors, SALVQ can dynamically adjust lattice density, enabling a single model to accommodate multiple bit rate targets. This flexibility eliminates the need to train separate models for different compression levels, significantly reducing training time and memory consumption.

[28] MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes

Liu Liu,Alexandra Kudaeva,Marco Cipriano,Fatimeh Al Ghannam,Freya Tan,Gerard de Melo,Andres Sevtsuk

Main category: cs.CV

TL;DR: MINGLE是一个三阶段模块化流程,用于从城市街景图像中检测语义复杂的社交群体区域,结合了现成的人类检测、VLM推理和轻量级空间聚合算法。

Details Motivation: 理解公共场所的群体社交互动对城市规划至关重要,但目前缺乏能够捕捉抽象人际关系的视觉检测方法。

Contribution: 提出了社交群体区域检测任务,并开发了MINGLE流程;发布了一个包含10万张标注图像的新数据集。

Method: 分三个阶段:1)人类检测与深度估计;2)VLM推理分类社交关系;3)轻量级空间聚合算法定位群体区域。

Result: MINGLE能够有效检测语义复杂的社交群体区域,并通过新数据集验证性能。

Insight: 结合现成模型与基于VLM的推理可以处理传统方法难以捕捉的抽象语义任务。

Abstract: Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity, and co-movement - semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real-world scenarios.

[29] BiasMap: Leveraging Cross-Attentions to Discover and Mitigate Hidden Social Biases in Text-to-Image Generation

Rajatsubhra Chakraborty,Xujun Che,Depeng Xu,Cori Faklaris,Xi Niu,Shuhan Yuan

Main category: cs.CV

TL;DR: BiasMap 是一个模型无关的框架,通过利用交叉注意力机制揭示文本到图像生成模型中的潜在概念级偏差,并提出能量引导的扩散采样方法进行偏差缓解。

Details Motivation: 现有的偏差发现工作主要关注输出级的人口统计分布,无法保证概念表示的分离性,而研究更深层的概念级偏差是必要的。

Contribution: 1)提出 BiasMap 框架,利用交叉注意力图揭示人口统计与语义概念的结构性纠缠;2)提出基于能量引导扩散采样的偏差缓解方法。

Method: 通过交叉注意力图量化人口统计与语义概念的空间纠缠(IoU),并利用能量引导扩散采样直接修改潜在噪声空间以减少 SoftIoU 期望。

Result: 实验表明,现有公平性干预可能减少输出分布差距但无法解耦概念级纠缠,而 BiasMap 能有效缓解概念级偏差并补充分布级偏差缓解。

Insight: 概念级偏差的发现和缓解是提高文本到图像生成公平性的关键,而注意力机制提供了有效的分析工具。

Abstract: Bias discovery is critical for black-box generative models, especiall text-to-image (TTI) models. Existing works predominantly focus on output-level demographic distributions, which do not necessarily guarantee concept representations to be disentangled post-mitigation. We propose BiasMap, a model-agnostic framework for uncovering latent concept-level representational biases in stable diffusion models. BiasMap leverages cross-attention attribution maps to reveal structural entanglements between demographics (e.g., gender, race) and semantics (e.g., professions), going deeper into representational bias during the image generation. Using attribution maps of these concepts, we quantify the spatial demographics-semantics concept entanglement via Intersection over Union (IoU), offering a lens into bias that remains hidden in existing fairness discovery approaches. In addition, we further utilize BiasMap for bias mitigation through energy-guided diffusion sampling that directly modifies latent noise space and minimizes the expected SoftIoU during the denoising process. Our findings show that existing fairness interventions may reduce the output distributional gap but often fail to disentangle concept-level coupling, whereas our mitigation method can mitigate concept entanglement in image generation while complementing distributional bias mitigation.

[30] LivePyxel: Accelerating image annotations with a Python-integrated webcam live streaming

Uriel Garcilazo-Cruz,Joseph O. Okeme,Rodrigo A. Vargas–Hernández

Main category: cs.CV

TL;DR: LivePyxel是一个基于Python的图形用户界面工具,支持与摄像头、显微镜等设备集成,实现实时图像标注,加速AI模型的开发。

Details Motivation: 现有图像标注工具通常需要上传预收集的数据集,不支持实时数据采集,尤其在实验室环境中限制了AI模型的部署效率。

Contribution: 开发了LivePyxel,一个支持实时图像标注的Python工具,提供Bézier样条、二值掩码等功能,并集成了OpenCV和Numpy以提升性能。

Method: LivePyxel通过简单的图形界面实现精确标注,支持非破坏性图层编辑,并优化了视频设备兼容性和对象检测操作。

Result: 工具显著简化了数据采集和标注流程,适用于实验工作流中的AI模型开发。

Insight: 实时标注工具的灵活性对于加速科学领域的AI模型部署至关重要,尤其是在实验环境中。

Abstract: The lack of flexible annotation tools has hindered the deployment of AI models in some scientific areas. Most existing image annotation software requires users to upload a precollected dataset, which limits support for on-demand pipelines and introduces unnecessary steps to acquire images. This constraint is particularly problematic in laboratory environments, where real-time data acquisition from instruments such as microscopes is increasingly common. In this work, we introduce \texttt{LivePixel}, a Python-based graphical user interface that integrates with imaging systems, such as webcams, microscopes, and others, to enable real-time image annotation. LivePyxel is designed to be easy to use through a simple interface that allows users to precisely delimit areas for annotation using tools commonly found in commercial graphics editing software. Of particular interest is the availability of B'ezier splines and binary masks, and the software’s capacity to work with non-destructive layers that enable high-performance editing. LivePyxel also integrates a wide compatibility across video devices, and it’s optimized for object detection operations via the use of OpenCV in combination with high-performance libraries designed to handle matrix and linear algebra operations via Numpy effectively. LivePyxel facilitates seamless data collection and labeling, accelerating the development of AI models in experimental workflows. LivePyxel freely available at https://github.com/UGarCil/LivePyxel

[31] DEFT-VTON: Efficient Virtual Try-On with Consistent Generalised H-Transform

Xingzi Xu,Qi Li,Shuwen Qiu,Julien Han,Karim Bouyarmane

Main category: cs.CV

TL;DR: DEFT-VTON提出了一种高效的虚拟试衣方法,通过冻结预训练模型参数并训练小型h-transform网络,显著减少训练参数量,同时结合自适应一致性损失提升性能,实现了高质量且快速的虚拟试衣效果。

Details Motivation: 现实应用中,虚拟试衣需要高效的训练和推理,以满足有限的预算需求。当前方法依赖大量端到端训练,难以满足这一需求。

Contribution: 1. 提出DEFT方法,通过h-transform高效微调预训练模型,仅训练1.42%的参数;2. 结合自适应一致性损失,进一步提升性能并减少推理时间。

Method: 1. 冻结预训练模型参数,训练小型h-transform网络;2. 引入自适应一致性损失,结合去噪分数匹配损失进行微调。

Result: DEFT-VTON在虚拟试衣任务中达到SOTA性能,仅需15步去噪步骤即可实现高质量结果。

Insight: 高效微调和一致性损失是实现高质量虚拟试衣的关键技术,为实际应用提供了可行的解决方案。

Abstract: Diffusion models enable high-quality virtual try-on (VTO) with their established image synthesis abilities. Despite the extensive end-to-end training of large pre-trained models involved in current VTO methods, real-world applications often prioritize limited training and inference, serving, and deployment budgets for VTO. To solve this obstacle, we apply Doob’s h-transform efficient fine-tuning (DEFT) for adapting large pre-trained unconditional models for downstream image-conditioned VTO abilities. DEFT freezes the pre-trained model’s parameters and trains a small h-transform network to learn a conditional h-transform. The h-transform network allows training only 1.42 percent of the frozen parameters, compared to a baseline of 5.52 percent in traditional parameter-efficient fine-tuning (PEFT). To further improve DEFT’s performance and decrease existing models’ inference time, we additionally propose an adaptive consistency loss. Consistency training distills slow but high-performing diffusion models into a fast one while retaining performance by enforcing consistencies along the inference path. Inspired by constrained optimization, instead of distillation, we combine the consistency loss and the denoising score matching loss in a data-adaptive manner for fine-tuning existing VTO models at a low cost. Empirical results show the proposed DEFT-VTON method achieves state-of-the-art performance on VTO tasks, with as few as 15 denoising steps, while maintaining competitive results.

[32] Adversarial Appearance Learning in Augmented Cityscapes for Pedestrian Recognition in Autonomous Driving

Artem Savkin,Thomas Lapotre,Kevin Strauss,Uzair Akbar,Federico Tombari

Main category: cs.CV

TL;DR: 这篇论文提出了一种通过对抗性学习生成更真实的合成数据的方法,用于提升自动驾驶中行人识别的性能,并在Cityscapes数据集上进行了验证。

Details Motivation: 自动驾驶需要大量特定场景的数据,但合成数据与真实数据之间存在域差距(domain gap),这影响了模型的泛化能力。论文旨在通过数据增强和对抗性学习减轻这种差距。

Contribution: 1)提出了一个生成对抗网络(GAN)架构,用于学习数据的照明条件,提升合成数据的真实性;2)开发了一个流水线(pipeline),用于在Cityscapes数据集中增强虚拟行人数据。

Method: 1)通过数据增强生成定制化交通场景;2)利用对抗性学习优化合成数据的照明条件;3)在语义分割和实例分割任务上评估方法有效性。

Result: 实验表明,对抗性学习能够显著提升合成数据的真实性,进而改善行人识别性能。

Insight: 对抗性学习可以有效缓解合成数据与真实数据之间的域差距,为自动驾驶中的行人识别提供更高质量的训练数据。

Abstract: In the autonomous driving area synthetic data is crucial for cover specific traffic scenarios which autonomous vehicle must handle. This data commonly introduces domain gap between synthetic and real domains. In this paper we deploy data augmentation to generate custom traffic scenarios with VRUs in order to improve pedestrian recognition. We provide a pipeline for augmentation of the Cityscapes dataset with virtual pedestrians. In order to improve augmentation realism of the pipeline we reveal a novel generative network architecture for adversarial learning of the data-set lighting conditions. We also evaluate our approach on the tasks of semantic and instance segmentation.

[33] FunKAN: Functional Kolmogorov-Arnold Network for Medical Image Enhancement and Segmentation

Maksim Penkin,Andrey Krylov

Main category: cs.CV

TL;DR: FunKAN是一个功能性的Kolmogorov-Arnold神经网络,专为医学图像增强和分割设计,通过傅里叶分解和Hermite函数学习内部函数,提升了解释性并保持了图像的空间结构。

Details Motivation: 解决传统深度学习方法在医学图像处理中解释性不足的问题,同时利用Kolmogorov-Arnold定理的数学框架,避免破坏图像的空间结构。

Contribution: 提出FunKAN和U-FunKAN模型,将Kolmogorov-Arnold定理推广到功能性空间,并展示了在医学图像增强和分割任务中的优越性能。

Method: 基于函数空间的Kolmogorov-Arnold定理,利用傅里叶分解和Hermite函数学习内部函数,提升了模型的解释性和性能。

Result: 在IXI、BUSI、GlaS和CVC-ClinicDB数据集上,FunKAN在图像增强(PSNR、TV)和分割(IoU、F1)任务中表现优于其他KAN-based方法。

Insight: FunKAN通过将理论函数近似与医学图像分析结合,为临床提供了一个鲁棒且可解释的解决方案。

Abstract: Medical image enhancement and segmentation are critical yet challenging tasks in modern clinical practice, constrained by artifacts and complex anatomical variations. Traditional deep learning approaches often rely on complex architectures with limited interpretability. While Kolmogorov-Arnold networks offer interpretable solutions, their reliance on flattened feature representations fundamentally disrupts the intrinsic spatial structure of imaging data. To address this issue we propose a Functional Kolmogorov-Arnold Network (FunKAN) – a novel interpretable neural framework, designed specifically for image processing, that formally generalizes the Kolmogorov-Arnold representation theorem onto functional spaces and learns inner functions using Fourier decomposition over the basis Hermite functions. We explore FunKAN on several medical image processing tasks, including Gibbs ringing suppression in magnetic resonance images, benchmarking on IXI dataset. We also propose U-FunKAN as state-of-the-art binary medical segmentation model with benchmarks on three medical datasets: BUSI (ultrasound images), GlaS (histological structures) and CVC-ClinicDB (colonoscopy videos), detecting breast cancer, glands and polyps, respectively. Experiments on those diverse datasets demonstrate that our approach outperforms other KAN-based backbones in both medical image enhancement (PSNR, TV) and segmentation (IoU, F1). Our work bridges the gap between theoretical function approximation and medical image analysis, offering a robust, interpretable solution for clinical applications.

[34] Multimodal Hate Detection Using Dual-Stream Graph Neural Networks

Jiangbei Yue,Shuonan Yang,Tailin Chen,Jianbo Jiao,Zeyu Fu

Main category: cs.CV

TL;DR: 提出了一种基于双流图神经网络的多模态仇恨视频检测模型,通过实例图和权重图分别提取特征和重要性权重,显著提升了分类性能和可解释性。

Details Motivation: 现有多模态方法未能有效突出仇恨内容,且缺乏对视频结构化信息的系统性建模,导致检测效果受限。

Contribution: 1. 提出双流图神经网络模型,分离实例特征并加权融合;2. 通过图结构建模模态内和模态间关系;3. 实验证明分类和可解释性优势。

Method: 1. 构建实例图提取实例级特征;2. 通过互补权重图分配重要性权重;3. 结合权重与特征生成视频标签。

Result: 在公开数据集上达到SOTA性能,并提供强可解释性。

Insight: 突出仇恨实例的多模态建模及结构化关系捕捉是关键改进方向。

Abstract: Hateful videos present serious risks to online safety and real-world well-being, necessitating effective detection methods. Although multimodal classification approaches integrating information from several modalities outperform unimodal ones, they typically neglect that even minimal hateful content defines a video’s category. Specifically, they generally treat all content uniformly, instead of emphasizing the hateful components. Additionally, existing multimodal methods cannot systematically capture structured information in videos, limiting the effectiveness of multimodal fusion. To address these limitations, we propose a novel multimodal dual-stream graph neural network model. It constructs an instance graph by separating the given video into several instances to extract instance-level features. Then, a complementary weight graph assigns importance weights to these features, highlighting hateful instances. Importance weights and instance features are combined to generate video labels. Our model employs a graph-based framework to systematically model structured relationships within and across modalities. Extensive experiments on public datasets show that our model is state-of-the-art in hateful video classification and has strong explainability. Code is available: https://github.com/Multimodal-Intelligence-Lab-MIL/MultiHateGNN.

[35] ColonCrafter: A Depth Estimation Model for Colonoscopy Videos Using Diffusion Priors

Romain Hardy,Tyler Berzin,Pranav Rajpurkar

Main category: cs.CV

TL;DR: ColonCrafter是一种基于扩散先验的深度估计模型,用于从单目结肠镜视频生成时间一致的深度图。通过合成数据学习几何先验,并结合风格迁移技术,模型在C3VD数据集上实现了零样本SOTA。

Details Motivation: 结肠镜视频的3D场景理解需求自动化深度估计方法,但现有方法在时间一致性上表现不足。

Contribution: 1.提出ColonCrafter模型,利用扩散先验生成时间一致的深度图;2.引入风格迁移技术,将真实视频适应到合成训练域;3.在C3VD数据集上实现零样本SOTA性能。

Method: 1.从合成结肠镜序列学习几何先验;2.结合风格迁移技术,保持几何结构的同时适应真实视频。

Result: 在C3VD数据集上超越通用和结肠镜专用方法,支持3D点云生成和表面覆盖评估。

Insight: 扩散模型可用于医学领域的时间一致深度估计,风格迁移技术有助于解决域适应问题。

Abstract: Three-dimensional (3D) scene understanding in colonoscopy presents significant challenges that necessitate automated methods for accurate depth estimation. However, existing depth estimation models for endoscopy struggle with temporal consistency across video sequences, limiting their applicability for 3D reconstruction. We present ColonCrafter, a diffusion-based depth estimation model that generates temporally consistent depth maps from monocular colonoscopy videos. Our approach learns robust geometric priors from synthetic colonoscopy sequences to generate temporally consistent depth maps. We also introduce a style transfer technique that preserves geometric structure while adapting real clinical videos to match our synthetic training domain. ColonCrafter achieves state-of-the-art zero-shot performance on the C3VD dataset, outperforming both general-purpose and endoscopy-specific approaches. Although full trajectory 3D reconstruction remains a challenge, we demonstrate clinically relevant applications of ColonCrafter, including 3D point cloud generation and surface coverage assessment.

[36] MemGS: Memory-Efficient Gaussian Splatting for Real-Time SLAM

Yinlong Bai,Hongxin Zhang,Sheng Zhong,Junkai Niu,Hai Li,Yijia He,Yi Zhou

Main category: cs.CV

TL;DR: MemGS提出了一种内存高效的3D高斯泼溅方法,适用于嵌入式平台的实时SLAM,通过体素空间合并冗余高斯基元和Patch-Grid点采样提升渲染质量。

Details Motivation: 现有3DGS研究多关注高性能GPU,而忽视了嵌入式设备(如微型飞行器)的资源限制,MemGS旨在解决内存和计算资源有限情况下的实时SLAM应用需求。

Contribution: 1. 提出基于几何相似性的体素空间合并方法,减少冗余高斯基元的内存占用;2. 引入Patch-Grid点采样初始化3D高斯基元,提升渲染质量。

Method: 1. 在SLAM中识别冗余高斯基元,基于几何相似性在体素空间合并;2. 使用Patch-Grid点采样初始化高斯基元。

Result: 公开数据集上的实验表明,MemGS在降低内存占用的同时提升了渲染质量,且不影响实时性能。

Insight: 嵌入式平台可通过高效内存管理和高斯基元优化实现高质量的实时SLAM,无需依赖高性能GPU。

Abstract: Recent advancements in 3D Gaussian Splatting (3DGS) have made a significant impact on rendering and reconstruction techniques. Current research predominantly focuses on improving rendering performance and reconstruction quality using high-performance desktop GPUs, largely overlooking applications for embedded platforms like micro air vehicles (MAVs). These devices, with their limited computational resources and memory, often face a trade-off between system performance and reconstruction quality. In this paper, we improve existing methods in terms of GPU memory usage while enhancing rendering quality. Specifically, to address redundant 3D Gaussian primitives in SLAM, we propose merging them in voxel space based on geometric similarity. This reduces GPU memory usage without impacting system runtime performance. Furthermore, rendering quality is improved by initializing 3D Gaussian primitives via Patch-Grid (PG) point sampling, enabling more accurate modeling of the entire scene. Quantitative and qualitative evaluations on publicly available datasets demonstrate the effectiveness of our improvements.

[37] Dynamic Aware: Adaptive Multi-Mode Out-of-Distribution Detection for Trajectory Prediction in Autonomous Vehicles

Tongfei Guo,Lili Su

Main category: cs.CV

TL;DR: 该论文提出了一种动态感知的自适应多模态OOD检测框架,用于自动驾驶中的轨迹预测,显著提升了检测延迟和误报率。

Details Motivation: 自动驾驶中轨迹预测模型在现实场景中面临分布偏移问题,传统的OOD检测方法主要集中在计算机视觉任务,而轨迹级别的OOD检测研究不足。

Contribution: 提出了一个自适应多模态OOD检测框架,能够显式建模随时间变化的误差模式,显著提升了检测性能。

Method: 通过快速变化检测(QCD)任务扩展了OOD检测框架,并引入自适应机制以适应复杂驾驶环境中的动态变化。

Result: 在多个真实数据集上的实验表明,该方法在检测延迟和误报率上显著优于现有UQ和视觉基础的OOD方法。

Insight: 预测误差即使在分布内样本中也会表现出随时间演变的模态依赖性,显式建模这些误差模态是提升OOD检测性能的关键。

Abstract: Trajectory prediction is central to the safe and seamless operation of autonomous vehicles (AVs). In deployment, however, prediction models inevitably face distribution shifts between training data and real-world conditions, where rare or underrepresented traffic scenarios induce out-of-distribution (OOD) cases. While most prior OOD detection research in AVs has concentrated on computer vision tasks such as object detection and segmentation, trajectory-level OOD detection remains largely underexplored. A recent study formulated this problem as a quickest change detection (QCD) task, providing formal guarantees on the trade-off between detection delay and false alarms [1]. Building on this foundation, we propose a new framework that introduces adaptive mechanisms to achieve robust detection in complex driving environments. Empirical analysis across multiple real-world datasets reveals that prediction errors – even on in-distribution samples – exhibit mode-dependent distributions that evolve over time with dataset-specific dynamics. By explicitly modeling these error modes, our method achieves substantial improvements in both detection delay and false alarm rates. Comprehensive experiments on established trajectory prediction benchmarks show that our framework significantly outperforms prior UQ- and vision-based OOD approaches in both accuracy and computational efficiency, offering a practical path toward reliable, driving-aware autonomy.

[38] Annotating Satellite Images of Forests with Keywords from a Specialized Corpus in the Context of Change Detection

Nathalie Neptune,Josiane Mothe

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的亚马逊雨林卫星图像变化检测方法,并通过语料库提取关键词标注变化区域。该方法在环境监测中表现有效,并具有通用性。

Details Motivation: 亚马逊雨林对全球气候和生物多样性至关重要,但其砍伐问题严重。传统监测方法效率低,亟需自动化工具。

Contribution: 1. 提出了一种结合深度学习的卫星图像变化检测方法;2. 开发了视觉语义模型,自动标注变化区域关键词;3. 验证了方法在亚马逊雨林监测中的有效性。

Method: 使用卫星图像对,通过深度学习比较不同时间点的图像,检测森林覆盖变化。再利用专业语料库提取科学文献关键词,标注变化区域。

Result: 在亚马逊雨林数据集上验证了方法的有效性,成功检测砍伐并生成相关标注。

Insight: 该方法不仅适用于环境监测,还可推广至其他领域,展现出通用性和实际应用潜力。

Abstract: The Amazon rain forest is a vital ecosystem that plays a crucial role in regulating the Earth’s climate and providing habitat for countless species. Deforestation in the Amazon is a major concern as it has a significant impact on global carbon emissions and biodiversity. In this paper, we present a method for detecting deforestation in the Amazon using image pairs from Earth observation satellites. Our method leverages deep learning techniques to compare the images of the same area at different dates and identify changes in the forest cover. We also propose a visual semantic model that automatically annotates the detected changes with relevant keywords. The candidate annotation for images are extracted from scientific documents related to the Amazon region. We evaluate our approach on a dataset of Amazon image pairs and demonstrate its effectiveness in detecting deforestation and generating relevant annotations. Our method provides a useful tool for monitoring and studying the impact of deforestation in the Amazon. While we focus on environment applications of our work by using images of deforestation in the Amazon rain forest to demonstrate the effectiveness of our proposed approach, it is generic enough to be applied to other domains.

[39] Intelligent Healthcare Imaging Platform An VLM-Based Framework for Automated Medical Image Analysis and Clinical Report Generation

Samer Al-Hamadani

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉语言模型(VLM)的智能医疗影像分析框架,结合Google Gemini 2.5 Flash实现多模态影像的自动化肿瘤检测和临床报告生成。

Details Motivation: 人工智能在医疗影像中的快速发展为诊断和决策提供了新机会,亟需一种自动化、多模态的分析工具以提高效率和准确性。

Contribution: 提出了一种结合视觉特征提取与自然语言处理的VLM框架,支持多模态影像分析、临床报告生成,并具备零样本学习能力。

Method: 利用Google Gemini 2.5 Flash提取视觉特征,通过坐标验证和高斯建模分析异常分布,结合多层级可视化技术和提示工程生成结构化报告。

Result: 实验显示系统在多模态异常检测中表现优异,定位偏差平均80像素,用户友好的Gradio界面便于临床整合。

Insight: 该框架展示了AI在医疗影像中的潜力,但需进一步临床验证和多中心评估以推动广泛应用。

Abstract: The rapid advancement of artificial intelligence (AI) in healthcare imaging has revolutionized diagnostic medicine and clinical decision-making processes. This work presents an intelligent multimodal framework for medical image analysis that leverages Vision-Language Models (VLMs) in healthcare diagnostics. The framework integrates Google Gemini 2.5 Flash for automated tumor detection and clinical report generation across multiple imaging modalities including CT, MRI, X-ray, and Ultrasound. The system combines visual feature extraction with natural language processing to enable contextual image interpretation, incorporating coordinate verification mechanisms and probabilistic Gaussian modeling for anomaly distribution. Multi-layered visualization techniques generate detailed medical illustrations, overlay comparisons, and statistical representations to enhance clinical confidence, with location measurement achieving 80 pixels average deviation. Result processing utilizes precise prompt engineering and textual analysis to extract structured clinical information while maintaining interpretability. Experimental evaluations demonstrated high performance in anomaly detection across multiple modalities. The system features a user-friendly Gradio interface for clinical workflow integration and demonstrates zero-shot learning capabilities to reduce dependence on large datasets. This framework represents a significant advancement in automated diagnostic support and radiological workflow efficiency, though clinical validation and multi-center evaluation are necessary prior to widespread adoption.

[40] Federated Learning for Deforestation Detection: A Distributed Approach with Satellite Imagery

Yuvraj Dutta,Aaditya Sikder,Basabdatta Palit

Main category: cs.CV

TL;DR: 论文提出了一种基于联邦学习的分布式方法,用于从卫星图像中检测森林砍伐,利用FLOWER和RAY框架实现分布式学习,并保障客户数据隐私和安全。

Details Motivation: 传统的集中式训练方法需要合并数据,可能危及客户数据安全和隐私,因此需要一种分布式方法来解决这一问题。

Contribution: 提出了一种基于联邦学习的分布式框架,用于检测森林砍伐,利用FLOWER和RAY框架实现高效分布式学习,并验证了多种模型的性能(如YOLOS-small、Faster R-CNN)。

Method: 采用FLOWER和RAY框架执行分布式学习任务,结合YOLOS-small(Vision Transformer变体)、Faster R-CNN(ResNet50主干)和Faster R-CNN(MobileNetV3主干)模型,在公开数据集上进行训练和测试。

Result: 框架成功实现了分布式学习任务,同时保障了客户数据隐私和安全,为基于卫星图像的图像分割任务提供了新视角。

Insight: 联邦学习在卫星图像分析中具有潜力,能够在不共享数据的情况下实现高效协作训练,为其他地理空间任务提供了借鉴。

Abstract: Accurate identification of deforestation from satellite images is essential in order to understand the geographical situation of an area. This paper introduces a new distributed approach to identify as well as locate deforestation across different clients using Federated Learning (FL). Federated Learning enables distributed network clients to collaboratively train a model while maintaining data privacy and security of the active users. In our framework, a client corresponds to an edge satellite center responsible for local data processing. Moreover, FL provides an advantage over centralized training method which requires combining data, thereby compromising with data security of the clients. Our framework leverages the FLOWER framework with RAY framework to execute the distributed learning workload. Furthermore, efficient client spawning is ensured by RAY as it can select definite amount of users to create an emulation environment. Our FL framework uses YOLOS-small (a Vision Transformer variant), Faster R-CNN with a ResNet50 backbone, and Faster R-CNN with a MobileNetV3 backbone models trained and tested on publicly available datasets. Our approach provides us a different view for image segmentation-based tasks on satellite imagery.

[41] Gaussian Alignment for Relative Camera Pose Estimation via Single-View Reconstruction

Yumin Li,Dylan Campbell

Main category: cs.CV

TL;DR: GARPS是一种无需训练的方法,通过单视图重建解决相对相机姿态估计问题,结合3D高斯混合模型(GMM)对齐实现度量尺度下的相机姿态优化,显著优于现有方法。

Details Motivation: 传统两视图姿态估计方法无法实现度量尺度,且在宽基线或无纹理区域表现不佳。GARPS通过单视图重建和多视图几何结合,提供了一种鲁棒的度量姿态估计方案。

Contribution: 提出GARPS框架,首次将单视图重建与多视图几何对齐结合,实现了度量尺度下的鲁棒姿态估计;设计了一种可微的GMM对齐目标函数,综合考虑几何、颜色、语义等多模态信息。

Method: 1. 使用度量单目深度估计器和3D高斯场景重建器生成每张图像的度量GMM;2. 通过优化的GMM对齐目标函数(结合几何、颜色、协方差和语义特征)精化初始姿态估计。

Result: 在Real-Estate10K数据集上,GARPS超越了传统方法和当前最佳学习型方法(如MASt3R),验证了其鲁棒性和精确性。

Insight: 单视图感知与多视图几何的结合为度量尺度下的姿态估计提供了新思路,无需依赖显式2D匹配或大规模训练数据。

Abstract: Estimating metric relative camera pose from a pair of images is of great importance for 3D reconstruction and localisation. However, conventional two-view pose estimation methods are not metric, with camera translation known only up to a scale, and struggle with wide baselines and textureless or reflective surfaces. This paper introduces GARPS, a training-free framework that casts this problem as the direct alignment of two independently reconstructed 3D scenes. GARPS leverages a metric monocular depth estimator and a Gaussian scene reconstructor to obtain a metric 3D Gaussian Mixture Model (GMM) for each image. It then refines an initial pose from a feed-forward two-view pose estimator by optimising a differentiable GMM alignment objective. This objective jointly considers geometric structure, view-independent colour, anisotropic covariance, and semantic feature consistency, and is robust to occlusions and texture-poor regions without requiring explicit 2D correspondences. Extensive experiments on the Real-Estate10K dataset demonstrate that GARPS outperforms both classical and state-of-the-art learning-based methods, including MASt3R. These results highlight the potential of bridging single-view perception with multi-view geometry to achieve robust and metric relative pose estimation.

[42] Re-purposing SAM into Efficient Visual Projectors for MLLM-Based Referring Image Segmentation

Xiaobo Yang,Xiaojin Gong

Main category: cs.CV

TL;DR: 该论文提出了一种新颖的语义视觉投影器,利用SAM生成的语义超像素来压缩视觉标记,显著减少了计算负担,同时保持了语义清晰度。

Details Motivation: 现有的MLLM与SAM结合的Referring Image Segmentation框架计算成本高,主要由于视觉标记冗余。传统方法在减少标记数量和保持语义清晰度之间难以平衡。

Contribution: 提出的语义视觉投影器通过利用语义超像素压缩视觉标记,动态调整标记序列长度,并提出新的位置嵌入和聚合器以保留细节和全局上下文。

Method: 利用SAM生成语义超像素作为视觉词,结合语义超像素位置嵌入和聚合器,以减少标记冗余并保持语义信息。

Result: 实验表明,该方法将视觉标记减少了93%,同时保持性能,显著提升了训练和推理速度,在RIS任务上优于现有压缩投影器。

Insight: 通过将图像分割为语义超像素并压缩表示,可以高效地减少计算成本,同时保持语义信息的完整性。

Abstract: Recently, Referring Image Segmentation (RIS) frameworks that pair the Multimodal Large Language Model (MLLM) with the Segment Anything Model (SAM) have achieved impressive results. However, adapting MLLM to segmentation is computationally intensive, primarily due to visual token redundancy. We observe that traditional patch-wise visual projectors struggle to strike a balance between reducing the number of visual tokens and preserving semantic clarity, often retaining overly long token sequences to avoid performance drops. Inspired by text tokenizers, we propose a novel semantic visual projector that leverages semantic superpixels generated by SAM to identify “visual words” in an image. By compressing and projecting semantic superpixels as visual tokens, our approach adaptively shortens the token sequence according to scene complexity while minimizing semantic loss in compression. To mitigate loss of information, we propose a semantic superpixel positional embedding to strengthen MLLM’s awareness of superpixel geometry and position, alongside a semantic superpixel aggregator to preserve both fine-grained details inside superpixels and global context outside. Experiments show that our method cuts visual tokens by 93% without compromising performance, notably speeding up MLLM training and inference, and outperforming existing compressive visual projectors on RIS.

[43] FishBEV: Distortion-Resilient Bird’s Eye View Segmentation with Surround-View Fisheye Cameras

Hang Li,Dianmo Sheng,Qiankun Dong,Zichun Wang,Zhiwei Xu,Tao Li

Main category: cs.CV

TL;DR: FishBEV是一个专为鱼眼相机设计的BEV分割框架,通过三个创新模块解决了严重几何畸变、多视角对应模糊和时态不稳定等问题,显著提升了性能。

Details Motivation: 现有BEV分割方法在鱼眼相机上表现不佳,主要由于其严重的几何畸变、多视角对应模糊和时态不稳定等问题,亟需一种鲁棒的解决方案。

Contribution: FishBEV提出了三个创新模块:DRME(抗畸变多尺度特征提取)、U-SCA(不确定性感知空间交叉注意力)和D-TSA(距离感知时序自注意力),显著提升了鱼眼相机的BEV分割性能。

Method: FishBEV通过DRME学习鲁棒特征并保持尺度一致性,U-SCA利用不确定性估计实现可靠的多视角对齐,D-TSA自适应平衡近远景以实现时序一致性。

Result: 在Synwoodscapes数据集上的实验表明,FishBEV在环视鱼眼相机BEV分割任务上显著优于现有SOTA方法。

Insight: 鱼眼相机的几何畸变和多视角对齐问题是BEV分割的主要挑战,结合不确定性估计和时序建模可以有效提升性能。

Abstract: As a cornerstone technique for autonomous driving, Bird’s Eye View (BEV) segmentation has recently achieved remarkable progress with pinhole cameras. However, it is non-trivial to extend the existing methods to fisheye cameras with severe geometric distortion, ambiguous multi-view correspondences and unstable temporal dynamics, all of which significantly degrade BEV performance. To address these challenges, we propose FishBEV, a novel BEV segmentation framework specifically tailored for fisheye cameras. This framework introduces three complementary innovations, including a Distortion-Resilient Multi-scale Extraction (DRME) backbone that learns robust features under distortion while preserving scale consistency, an Uncertainty-aware Spatial Cross-Attention (U-SCA) mechanism that leverages uncertainty estimation for reliable cross-view alignment, a Distance-aware Temporal Self-Attention (D-TSA) module that adaptively balances near field details and far field context to ensure temporal coherence. Extensive experiments on the Synwoodscapes dataset demonstrate that FishBEV consistently outperforms SOTA baselines, regarding the performance evaluation of FishBEV on the surround-view fisheye BEV segmentation tasks.

[44] Taylor-Series Expanded Kolmogorov-Arnold Network for Medical Imaging Classification

Kaniz Fatema,Emad A. Mohammed,Sukhjit Singh Sehra

Main category: cs.CV

TL;DR: 论文提出了基于样条的Kolmogorov-Arnold网络(KANs)用于医学图像分类,通过结合B样条与泰勒级数等方法,显著减少了参数量并提升了模型性能。

Details Motivation: 医学图像分类在资源有限的临床环境中面临挑战,需要高效且可解释的模型。传统CNN参数量大且依赖预处理,而KANs通过学习原始数据直接建模非线性关系,提供了更轻量化的解决方案。

Contribution: 1. 提出SBTAYLOR-KAN等三种基于样条的KAN变体,结合泰勒级数、径向基函数和小波变换;2. 模型参数量大幅减少(仅2872个),性能媲美传统CNN;3. 通过Grad-CAM实现可解释性。

Method: 1. 使用B样条基函数结合泰勒级数、径向基函数或小波变换构建KANs;2. 直接在原始数据上训练,无需预处理;3. 采用Grad-CAM进行可视化解释。

Result: SBTAYLOR-KAN在多个数据集上表现优异,最高准确率达98.93%,并在数据减少实验中保持86%以上准确率。参数量仅为传统CNN的约万分之一。

Insight: 1. 样条基函数能够有效建模局部与全局非线性关系;2. KANs在数据稀缺场景下具有强泛化能力;3. 轻量化设计适合资源受限的医疗环境。

Abstract: Effective and interpretable classification of medical images is a challenge in computer-aided diagnosis, especially in resource-limited clinical settings. This study introduces spline-based Kolmogorov-Arnold Networks (KANs) for accurate medical image classification with limited, diverse datasets. The models include SBTAYLOR-KAN, integrating B-splines with Taylor series; SBRBF-KAN, combining B-splines with Radial Basis Functions; and SBWAVELET-KAN, embedding B-splines in Morlet wavelet transforms. These approaches leverage spline-based function approximation to capture both local and global nonlinearities. The models were evaluated on brain MRI, chest X-rays, tuberculosis X-rays, and skin lesion images without preprocessing, demonstrating the ability to learn directly from raw data. Extensive experiments, including cross-dataset validation and data reduction analysis, showed strong generalization and stability. SBTAYLOR-KAN achieved up to 98.93% accuracy, with a balanced F1-score, maintaining over 86% accuracy using only 30% of the training data across three datasets. Despite class imbalance in the skin cancer dataset, experiments on both imbalanced and balanced versions showed SBTAYLOR-KAN outperforming other models, achieving 68.22% accuracy. Unlike traditional CNNs, which require millions of parameters (e.g., ResNet50 with 24.18M), SBTAYLOR-KAN achieves comparable performance with just 2,872 trainable parameters, making it more suitable for constrained medical environments. Gradient-weighted Class Activation Mapping (Grad-CAM) was used for interpretability, highlighting relevant regions in medical images. This framework provides a lightweight, interpretable, and generalizable solution for medical image classification, addressing the challenges of limited datasets and data-scarce scenarios in clinical AI applications.

[45] StyleProtect: Safeguarding Artistic Identity in Fine-tuned Diffusion Models

Qiuyu Tang,Joshua Krinsky,Aparna Bharati

Main category: cs.CV

TL;DR: 论文提出StyleProtect方法,通过选择性更新扩散模型中的交叉注意力层,保护艺术作品的独特风格免受恶意模仿。

Details Motivation: 随着生成模型(尤其是扩散模型)的快速发展,它们可能被滥用以低成本复制艺术家的独特风格,侵犯其创作劳动和个人愿景。这引发了保护艺术作品风格的需求。

Contribution: 提出了StyleProtect方法,通过分析交叉注意力层对艺术风格的敏感性,仅更新关键层以实现高效、轻量级的风格保护。

Method: 通过测量风格和内容表征对注意力层的激活强度,识别对风格敏感的交叉注意力层,并选择性更新这些层以防御风格模仿。

Result: 实验证明,StyleProtect在保护艺术风格和动漫风格免受恶意定制方面表现优异,同时保持较好的不可感知性。

Insight: 交叉注意力层对艺术风格的敏感性是关键,仅更新这些层即可实现高效的风格保护,避免全模型更新的计算开销。

Abstract: The rapid advancement of generative models, particularly diffusion-based approaches, has inadvertently facilitated their potential for misuse. Such models enable malicious exploiters to replicate artistic styles that capture an artist’s creative labor, personal vision, and years of dedication in an inexpensive manner. This has led to a rise in the need and exploration of methods for protecting artworks against style mimicry. Although generic diffusion models can easily mimic an artistic style, finetuning amplifies this capability, enabling the model to internalize and reproduce the style with higher fidelity and control. We hypothesize that certain cross-attention layers exhibit heightened sensitivity to artistic styles. Sensitivity is measured through activation strengths of attention layers in response to style and content representations, and assessing their correlations with features extracted from external models. Based on our findings, we introduce an efficient and lightweight protection strategy, StyleProtect, that achieves effective style defense against fine-tuned diffusion models by updating only selected cross-attention layers. Our experiments utilize a carefully curated artwork dataset based on WikiArt, comprising representative works from 30 artists known for their distinctive and influential styles and cartoon animations from the Anita dataset. The proposed method demonstrates promising performance in safeguarding unique styles of artworks and anime from malicious diffusion customization, while maintaining competitive imperceptibility.

[46] UM-Depth : Uncertainty Masked Self-Supervised Monocular Depth Estimation with Visual Odometry

Tae-Wook Um,Ki-Hyeon Kim,Hyun-Duck Choi,Hyo-Sung Ahn

Main category: cs.CV

TL;DR: UM-Depth提出了一种结合运动感知和不确定性感知的自监督单目深度估计框架,通过教师-学生训练策略提升动态物体边界和无纹理区域的深度估计精度,无需额外标签或运行时开销。

Details Motivation: 自监督单目深度估计在动态区域和无纹理区域的表现较差,主要原因是输入数据的不确定性,现有方法通常依赖额外标签或辅助网络,增加了复杂性和开销。

Contribution: 1. 提出了UM-Depth框架,结合运动感知和不确定性感知;2. 设计了教师-学生训练策略,将不确定性估计嵌入训练流程和网络架构;3. 仅需在训练时使用光流,无需额外标签或运行时开销。

Method: 1. 教师网络生成深度和不确定性图;2. 学生网络通过不确定性掩码优化深度估计;3. 仅在训练时使用光流进行运动感知。

Result: 在KITTI和Cityscapes数据集上验证了方法的有效性,实现了自监督深度和姿态估计的最优性能。

Insight: 不确定性感知可以有效弥补自监督训练中弱光度信号的不足,教师-学生策略是实现高效训练的关键。

Abstract: Monocular depth estimation has been increasingly adopted in robotics and autonomous driving for its ability to infer scene geometry from a single camera. In self-supervised monocular depth estimation frameworks, the network jointly generates and exploits depth and pose estimates during training, thereby eliminating the need for depth labels. However, these methods remain challenged by uncertainty in the input data, such as low-texture or dynamic regions, which can cause reduced depth accuracy. To address this, we introduce UM-Depth, a framework that combines motion- and uncertainty-aware refinement to enhance depth accuracy at dynamic object boundaries and in textureless regions. Specifically, we develop a teacherstudent training strategy that embeds uncertainty estimation into both the training pipeline and network architecture, thereby strengthening supervision where photometric signals are weak. Unlike prior motion-aware approaches that incur inference-time overhead and rely on additional labels or auxiliary networks for real-time generation, our method uses optical flow exclusively within the teacher network during training, which eliminating extra labeling demands and any runtime cost. Extensive experiments on the KITTI and Cityscapes datasets demonstrate the effectiveness of our uncertainty-aware refinement. Overall, UM-Depth achieves state-of-the-art results in both self-supervised depth and pose estimation on the KITTI datasets.

[47] Mitigating Query Selection Bias in Referring Video Object Segmentation

Dingwei Zhang,Dong Zhang,Jinhui Tang

Main category: cs.CV

TL;DR: 本文提出了三重查询变换器(TQF)来解决基于查询的Referring Video Object Segmentation(RVOS)中的查询选择偏差问题,通过将查询分解为外观、帧内交互和帧间运动三个组件,并结合动态的语言和视觉引导,显著提升了性能。

Details Motivation: 现有基于查询的RVOS方法因依赖静态查询而易受相似外观或运动的干扰,导致查询选择偏差。本文旨在通过动态查询设计和运动感知模块解决这一问题。

Contribution: 1)提出TQF,将查询分解为三个动态组件;2)引入帧内交互聚合和帧间运动聚合模块增强对象表示;3)在多个RVOS基准上验证了方法的优越性。

Method: 1)TQF将查询分解为外观查询、帧内交互查询和帧间运动查询;2)通过动态语言和视觉引导构建查询;3)设计运动感知模块(帧内交互聚合和帧间运动聚合)。

Result: 在多个RVOS基准上,TQF展现了显著的性能提升,验证了结构化查询设计和运动感知模块的有效性。

Insight: 动态查询设计和运动感知模块能有效缓解查询选择偏差,提升跨模态对齐的鲁棒性。

Abstract: Recently, query-based methods have achieved remarkable performance in Referring Video Object Segmentation (RVOS) by using textual static object queries to drive cross-modal alignment. However, these static queries are easily misled by distractors with similar appearance or motion, resulting in \emph{query selection bias}. To address this issue, we propose Triple Query Former (TQF), which factorizes the referring query into three specialized components: an appearance query for static attributes, an intra-frame interaction query for spatial relations, and an inter-frame motion query for temporal association. Instead of relying solely on textual embeddings, our queries are dynamically constructed by integrating both linguistic cues and visual guidance. Furthermore, we introduce two motion-aware aggregation modules that enhance object token representations: Intra-frame Interaction Aggregation incorporates position-aware interactions among objects within a single frame, while Inter-frame Motion Aggregation leverages trajectory-guided alignment across frames to ensure temporal coherence. Extensive experiments on multiple RVOS benchmarks demonstrate the advantages of TQF and the effectiveness of our structured query design and motion-aware aggregation modules.

[48] Improving Generalized Visual Grounding with Instance-aware Joint Learning

Ming Dai,Wenxuan Cheng,Jiang-Jiang Liu,Lingfeng Yang,Zhenhua Feng,Wankou Yang,Jingdong Wang

Main category: cs.CV

TL;DR: InstanceVG是一个多任务广义视觉定位框架,通过实例感知能力联合训练GREC和GRES任务,统一实例级别的框和掩码预测,显著优于现有方法。

Details Motivation: 现有方法通常独立处理GREC和GRES任务,忽视了联合训练的一致性和实例感知能力的重要性。

Contribution: 提出了InstanceVG框架,首次联合处理GREC和GRES任务,并通过实例查询实现实例级别的框和掩码一致性预测。

Method: 采用多任务框架,为每个实例查询分配先验参考点,统一预测点、框和掩码,确保一致性。

Result: 在四个任务的十个数据集上实现了SOTA性能,显著超越现有方法。

Insight: 联合训练和实例感知能力对广义视觉定位任务的性能提升至关重要。

Abstract: Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical visual grounding paradigm by accommodating multi-target and non-target scenarios. Specifically, GREC focuses on accurately identifying all referential objects at the coarse bounding box level, while GRES aims for achieve fine-grained pixel-level perception. However, existing approaches typically treat these tasks independently, overlooking the benefits of jointly training GREC and GRES to ensure consistent multi-granularity predictions and streamline the overall process. Moreover, current methods often treat GRES as a semantic segmentation task, neglecting the crucial role of instance-aware capabilities and the necessity of ensuring consistent predictions between instance-level boxes and masks. To address these limitations, we propose InstanceVG, a multi-task generalized visual grounding framework equipped with instance-aware capabilities, which leverages instance queries to unify the joint and consistency predictions of instance-level boxes and masks. To the best of our knowledge, InstanceVG is the first framework to simultaneously tackle both GREC and GRES while incorporating instance-aware capabilities into generalized visual grounding. To instantiate the framework, we assign each instance query a prior reference point, which also serves as an additional basis for target matching. This design facilitates consistent predictions of points, boxes, and masks for the same instance. Extensive experiments obtained on ten datasets across four tasks demonstrate that InstanceVG achieves state-of-the-art performance, significantly surpassing the existing methods in various evaluation metrics. The code and model will be publicly available at https://github.com/Dmmm1997/InstanceVG.

[49] Cross-modal Full-mode Fine-grained Alignment for Text-to-Image Person Retrieval

Hao Yin,Xin Man,Feiyu Chen,Jie Shao,Heng Tao Shen

Main category: cs.CV

TL;DR: FMFA框架通过显式细粒度对齐和隐式关系推理,改进文本-图像跨模态对齐,提升检索性能。

Details Motivation: TIPR任务中,现有方法无法验证局部特征是否正确对齐,且仅关注硬负样本,忽略误匹配的正样本对。FMFA旨在解决这些问题。

Contribution: 1. 提出A-SDM模块,自适应修正未对齐的正样本对;2. 设计EFA模块,通过稀疏化相似矩阵显式增强细粒度对齐;3. 在三个数据集上实现SOTA性能。

Method: 1. A-SDM模块通过自适应调整正样本对的距离改进全局对齐;2. EFA模块通过稀疏相似矩阵和硬编码方法显式优化细粒度交互。

Result: FMFA在三个公共数据集上超越所有全局匹配方法,达到最优性能。

Insight: 显式细粒度对齐与隐式关系推理的结合(”全模式”)能有效提升跨模态检索的精度。

Abstract: Text-to-Image Person Retrieval (TIPR) is a cross-modal matching task that aims to retrieve the most relevant person images based on a given text query. The key challenge in TIPR lies in achieving effective alignment between textual and visual modalities within a common latent space. To address this challenge, prior approaches incorporate attention mechanisms for implicit cross-modal local alignment. However, they lack the ability to verify whether all local features are correctly aligned. Moreover, existing methods primarily focus on hard negative samples during model updates, with the goal of refining distinctions between positive and negative pairs, often neglecting incorrectly matched positive pairs. To alleviate these issues, we propose FMFA, a cross-modal Full-Mode Fine-grained Alignment framework, which enhances global matching through explicit fine-grained alignment and existing implicit relational reasoning – hence the term ``full-mode” – without requiring additional supervision. Specifically, we design an Adaptive Similarity Distribution Matching (A-SDM) module to rectify unmatched positive sample pairs. A-SDM adaptively pulls the unmatched positive pairs closer in the joint embedding space, thereby achieving more precise global alignment. Additionally, we introduce an Explicit Fine-grained Alignment (EFA) module, which makes up for the lack of verification capability of implicit relational reasoning. EFA strengthens explicit cross-modal fine-grained interactions by sparsifying the similarity matrix and employs a hard coding method for local alignment. Our proposed method is evaluated on three public datasets, achieving state-of-the-art performance among all global matching methods. Our code is available at https://github.com/yinhao1102/FMFA.

[50] Iterative Prompt Refinement for Safer Text-to-Image Generation

Jinwoo Jeon,JunHyeok Oh,Hayeong Lee,Byung-Jun Lee

Main category: cs.CV

TL;DR: 该论文提出了一种基于视觉语言模型(VLM)的迭代提示优化算法,通过结合文本和生成图像的反馈来提升文本到图像(T2I)模型的安全性和用户意图的保持。

Details Motivation: 现有的安全方法通常基于大型语言模型(LLM)优化提示,但忽视了生成图像的内容,可能导致不安全输出或对已安全提示的过度修改。

Contribution: 1. 提出了一种迭代提示优化算法,结合视觉反馈;2. 引入了一个新的数据集,包含文本和视觉安全信号;3. 实验表明该方法在不牺牲用户意图对齐的前提下提高了安全性。

Method: 利用视觉语言模型分析输入提示和生成图像,通过迭代优化提示,结合多模态监督微调。

Result: 实验结果表明,该方法生成的图像更具安全性,同时保持了与用户意图的高对齐性。

Insight: 视觉反馈在提示优化中的作用至关重要,多模态数据和方法可以显著提升T2I模型的安全性和可靠性。

Abstract: Text-to-Image (T2I) models have made remarkable progress in generating images from text prompts, but their output quality and safety still depend heavily on how prompts are phrased. Existing safety methods typically refine prompts using large language models (LLMs), but they overlook the images produced, which can result in unsafe outputs or unnecessary changes to already safe prompts. To address this, we propose an iterative prompt refinement algorithm that uses Vision Language Models (VLMs) to analyze both the input prompts and the generated images. By leveraging visual feedback, our method refines prompts more effectively, improving safety while maintaining user intent and reliability comparable to existing LLM-based approaches. Additionally, we introduce a new dataset labeled with both textual and visual safety signals using off-the-shelf multi-modal LLM, enabling supervised fine-tuning. Experimental results demonstrate that our approach produces safer outputs without compromising alignment with user intent, offering a practical solution for generating safer T2I content. Our code is available at https://github.com/ku-dmlab/IPR. \textbf{\textcolor{red}WARNING: This paper contains examples of harmful or inappropriate images generated by models.

[51] Task-Aware Image Signal Processor for Advanced Visual Perception

Kai Chen,Jin Xiao,Leheng Zhang,Kexuan Shi,Shuhang Gu

Main category: cs.CV

TL;DR: TA-ISP是一个轻量级的RAW-to-RGB框架,通过预测多尺度调制算子来优化视觉感知任务,显著减少了计算开销,同时提升了任务性能。

Details Motivation: 传统ISP方法在RAW数据处理上存在计算开销大或表达能力有限的问题,限制了视觉感知任务的性能提升。

Contribution: 提出Task-Aware Image Signal Processor (TA-ISP),一个紧凑的RAW-to-RGB框架,通过多尺度调制算子生成任务导向的表示,优化计算效率和任务性能。

Method: TA-ISP预测轻量级的多尺度调制算子(全局、区域和像素级别),重塑图像统计特性,避免了传统的密集卷积计算。

Result: 在多个RAW数据检测和分割任务中,TA-ISP提升了准确率,同时显著减少了参数数量和推理时间。

Insight: 因子化的多尺度调制方法能有效平衡计算开销和任务性能,适用于资源受限的设备。

Abstract: In recent years, there has been a growing trend in computer vision towards exploiting RAW sensor data, which preserves richer information compared to conventional low-bit RGB images. Early studies mainly focused on enhancing visual quality, while more recent efforts aim to leverage the abundant information in RAW data to improve the performance of visual perception tasks such as object detection and segmentation. However, existing approaches still face two key limitations: large-scale ISP networks impose heavy computational overhead, while methods based on tuning traditional ISP pipelines are restricted by limited representational capacity.To address these issues, we propose Task-Aware Image Signal Processing (TA-ISP), a compact RAW-to-RGB framework that produces task-oriented representations for pretrained vision models. Instead of heavy dense convolutional pipelines, TA-ISP predicts a small set of lightweight, multi-scale modulation operators that act at global, regional, and pixel scales to reshape image statistics across different spatial extents. This factorized control significantly expands the range of spatially varying transforms that can be represented while keeping memory usage, computation, and latency tightly constrained. Evaluated on several RAW-domain detection and segmentation benchmarks under both daytime and nighttime conditions, TA-ISP consistently improves downstream accuracy while markedly reducing parameter count and inference time, making it well suited for deployment on resource-constrained devices.

[52] VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI

Daiqi Liu,Tomás Arias-Vergara,Johannes Enk,Fangxu Xing,Maureen Stone,Jerry L. Prince,Jana Hutter,Andreas Maier,Jonghye Woo,Paula Andrea Pérez-Toro

Main category: cs.CV

TL;DR: VocSegMRI提出了一种多模态框架,结合视频、音频和语音输入,通过跨注意力融合和对比学习提升声带分割精度,实现实时MRI中的高效分割。

Details Motivation: 现有方法主要依赖视觉信息,忽视了音频和语音信号的补充作用。通过多模态学习可以更精确地分割声带结构。

Contribution: 1. 提出VocSegMRI,整合视频、音频和语音输入;2. 引入跨注意力融合和对比学习,提升分割性能;3. 在USC-75数据集上实现最优性能。

Method: 1. 多模态框架融合视频、音频和语音数据;2. 跨注意力机制动态对齐特征;3. 对比学习目标增强表征鲁棒性。

Result: Dice分数0.95,HD_95为4.20 mm,超越单模态和基线多模态方法。

Insight: 多模态建模及对比学习显著提升分割精度和鲁棒性,尤其在音频缺失时仍保持性能。

Abstract: Accurately segmenting articulatory structures in real-time magnetic resonance imaging (rtMRI) remains challenging, as most existing methods rely almost entirely on visual cues. Yet synchronized acoustic and phonological signals provide complementary context that can enrich visual information and improve precision. In this paper, we introduce VocSegMRI, a multimodal framework that integrates video, audio, and phonological inputs through cross-attention fusion for dynamic feature alignment. To further enhance cross-modal representation, we incorporate a contrastive learning objective that improves segmentation performance even when the audio modality is unavailable at inference. Evaluated on a sub-set of USC-75 rtMRI dataset, our approach achieves state-of-the-art performance, with a Dice score of 0.95 and a 95th percentile Hausdorff Distance (HD_95) of 4.20 mm, outperforming both unimodal and multimodal baselines. Ablation studies confirm the contributions of cross-attention and contrastive learning to segmentation precision and robustness. These results highlight the value of integrative multimodal modeling for accurate vocal tract analysis.

[53] AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving

Yuechen Luo,Fang Li,Shaoqing Xu,Zhiyi Lai,Lei Yang,Qimao Chen,Ziang Luo,Zixun Xie,Shengyin Jiang,Jiaxin Liu,Long Chen,Bing Wang,Zhi-xin Yang

Main category: cs.CV

TL;DR: AdaThinkDrive提出了一种双模式推理框架,结合快速和慢速思考机制,通过自适应选择推理模式提高了自动驾驶的决策质量和效率。

Details Motivation: 当前自动驾驶模型中的推理技术(如CoT)在简单场景中表现不佳,导致不必要的计算开销,急需一种自适应机制以区分不同场景的推理需求。

Contribution: 1. 提出AdaThinkDrive框架,结合双模式推理机制;2. 引入自适应思考奖励策略和GRPO优化方法;3. 在Navsim基准上显著提升了性能(PDMS 90.3)和推理效率(减少14%时间)。

Method: 1. 预训练阶段结合QA和轨迹数据学习世界知识和驾驶常识;2. 监督微调中区分快速回答(无CoT)和慢速思考(有CoT)数据集;3. 通过自适应思考奖励策略优化推理模式选择。

Result: 在Navsim基准上,AdaThinkDrive的PDMS达到90.3,优于仅视觉基线1.7分,并显著降低推理时间14%。

Insight: 自适应推理能有效平衡决策质量与计算效率,为复杂任务中的推理机制设计提供了新思路。

Abstract: While reasoning technology like Chain of Thought (CoT) has been widely adopted in Vision Language Action (VLA) models, it demonstrates promising capabilities in end to end autonomous driving. However, recent efforts to integrate CoT reasoning often fall short in simple scenarios, introducing unnecessary computational overhead without improving decision quality. To address this, we propose AdaThinkDrive, a novel VLA framework with a dual mode reasoning mechanism inspired by fast and slow thinking. First, our framework is pretrained on large scale autonomous driving (AD) scenarios using both question answering (QA) and trajectory datasets to acquire world knowledge and driving commonsense. During supervised fine tuning (SFT), we introduce a two mode dataset, fast answering (w/o CoT) and slow thinking (with CoT), enabling the model to distinguish between scenarios that require reasoning. Furthermore, an Adaptive Think Reward strategy is proposed in conjunction with the Group Relative Policy Optimization (GRPO), which rewards the model for selectively applying CoT by comparing trajectory quality across different reasoning modes. Extensive experiments on the Navsim benchmark show that AdaThinkDrive achieves a PDMS of 90.3, surpassing the best vision only baseline by 1.7 points. Moreover, ablations show that AdaThinkDrive surpasses both the never Think and always Think baselines, improving PDMS by 2.0 and 1.4, respectively. It also reduces inference time by 14% compared to the always Think baseline, demonstrating its ability to balance accuracy and efficiency through adaptive reasoning.

[54] CETUS: Causal Event-Driven Temporal Modeling With Unified Variable-Rate Scheduling

Hanfang Liang,Bing Wang,Shizhen Zhang,Wen Jiang,Yizhuo Yang,Weixiang Guo,Shenghai Yuan

Main category: cs.CV

TL;DR: CETUS提出了一种直接处理原始事件流的新型架构,通过轻量级因果空间编码器和线性复杂度的Mamba状态空间模型,实现高效的时空建模,并动态调整处理速度以平衡延迟。

Details Motivation: 现有方法需要将事件流转换为中间表示(如帧或体素网格),这引入了窗口延迟,而逐点检测方法因计算量大难以实现实时效率。CETUS旨在直接处理原始事件流,避免这些限制。

Contribution: 1. 提出Variable-Rate Spatial Event Mamba架构,直接处理原始事件流。2. 引入轻量级因果空间编码器和线性复杂度的Mamba状态空间模型。3. 动态调整处理速度以优化延迟。

Method: 1. 使用因果空间编码器捕获局部几何关系。2. 采用Mamba状态空间模型进行高效时空建模。3. 在推理时动态调整处理速度。

Result: CETUS避免了中间表示的窗口延迟,显著提升了处理效率,适用于高速视觉任务。

Insight: 直接处理原始事件流能有效减少延迟,结合轻量编码器和状态空间模型是高效时空建模的关键。

Abstract: Event cameras capture asynchronous pixel-level brightness changes with microsecond temporal resolution, offering unique advantages for high-speed vision tasks. Existing methods often convert event streams into intermediate representations such as frames, voxel grids, or point clouds, which inevitably require predefined time windows and thus introduce window latency. Meanwhile, pointwise detection methods face computational challenges that prevent real-time efficiency due to their high computational cost. To overcome these limitations, we propose the Variable-Rate Spatial Event Mamba, a novel architecture that directly processes raw event streams without intermediate representations. Our method introduces a lightweight causal spatial neighborhood encoder to efficiently capture local geometric relations, followed by Mamba-based state space models for scalable temporal modeling with linear complexity. During inference, a controller adaptively adjusts the processing speed according to the event rate, achieving an optimal balance between window latency and inference latency.

[55] BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

Hanshuai Cui,Zhiqing Tang,Zhifei Xu,Zhi Yao,Wenyi Zeng,Weijia Jia

Main category: cs.CV

TL;DR: 论文提出了一种无训练的加速方法 BWCache,通过块级缓存重用来减少扩散变换器(DiT)在视频生成中的计算冗余,显著提升推理速度。

Details Motivation: 扩散变换器(DiT)在视频生成中表现出色,但其串行去噪过程导致高延迟,现有加速方法要么牺牲视觉质量,要么无法有效重用中间特征。论文发现 DiT 块是延迟的主要来源,且其特征变化在中间时间步呈现高相似性,具有优化潜力。

Contribution: 提出 BWCache,一种基于块级缓存的无训练加速方法;引入相似性指示器动态触发特征重用,减少冗余计算;在多个视频扩散模型上验证了方法的有效性。

Method: 分析 DiT 块特征变化的 U 形模式;动态缓存和重用块特征;通过相似性阈值控制特征重用以保证视觉质量。

Result: 实验表明,BWCache 在多个模型上实现了最高 2.24 倍的加速,同时保持了可比的视觉质量。

Insight: DiT 块特征在中间时间步的高相似性为缓存和重用提供了机会,动态阈值设计是平衡速度和视觉质量的关键。

Abstract: Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24$\times$ speedup with comparable visual quality.

[56] Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models

Weihang Wang,Xinhao Li,Ziyue Wang,Yan Pang,Jielei Zhang,Peiyi Li,Qiang Zhang,Longwen Gao

Main category: cs.CV

TL;DR: 该论文针对大型视觉语言模型(LVLMs)中的目标幻觉问题,提出了一种新的基准测试VHBench-10和动态路由网络VisionWeaver,以减少幻觉并提升性能。

Details Motivation: LVLMs中的目标幻觉显著影响其实际应用效果。不同视觉编码器的训练范式可能导致其具有不同的归纳偏置,从而表现出多样化的幻觉行为,现有基准测试对此未能充分捕捉。

Contribution: :1) 提出VHBench-10基准测试,覆盖10个细粒度幻觉类别;2) 提出VisionWeaver动态路由网络,通过全局视觉特征动态聚合多个专家特征以减少幻觉。

Method: 设计VHBench-10基准测试分析幻觉行为,并提出VisionWeaver网络:基于全局视觉特征生成路由信号,动态融合多个专家特征。

Result: 实验证明VisionWeaver能显著减少幻觉,提升模型整体性能。

Insight: 视觉编码器的归纳偏置对幻觉行为有重要影响,动态路由机制是一种有效的特征融合策略。

Abstract: Object hallucination in Large Vision-Language Models (LVLMs) significantly impedes their real-world applicability. As the primary component for accurately interpreting visual information, the choice of visual encoder is pivotal. We hypothesize that the diverse training paradigms employed by different visual encoders instill them with distinct inductive biases, which leads to their diverse hallucination performances. Existing benchmarks typically focus on coarse-grained hallucination detection and fail to capture the diverse hallucinations elaborated in our hypothesis. To systematically analyze these effects, we introduce VHBench-10, a comprehensive benchmark with approximately 10,000 samples for evaluating LVLMs across ten fine-grained hallucination categories. Our evaluations confirm encoders exhibit unique hallucination characteristics. Building on these insights and the suboptimality of simple feature fusion, we propose VisionWeaver, a novel Context-Aware Routing Network. It employs global visual features to generate routing signals, dynamically aggregating visual features from multiple specialized experts. Comprehensive experiments confirm the effectiveness of VisionWeaver in significantly reducing hallucinations and improving overall model performance.

[57] SWA-PF: Semantic-Weighted Adaptive Particle Filter for Memory-Efficient 4-DoF UAV Localization in GNSS-Denied Environments

Jiayu Yuan,Ming Dai,Enhui Zheng,Chao Su,Nanxing Chen,Qiming Hu,Shibo Zhu,Yibin Cao

Main category: cs.CV

TL;DR: 该论文提出了一种语义加权自适应粒子滤波方法(SWA-PF),用于在GNSS缺失环境中实现高效、准确的无人机定位,并发布了一个多高度飞行段数据集(MAFS)。

Details Motivation: 现有基于检索的无人机定位方法在实时性、环境敏感性和泛化能力方面存在局限性,尤其适用于动态或时变环境。论文旨在解决这些问题。

Contribution: 1) 提出了一个大规模多高度飞行段数据集(MAFS);2) 提出了一种新的语义加权自适应粒子滤波方法(SWA-PF),结合无人机图像和卫星图像的语义特征。

Method: SWA-PF方法通过语义加权机制和优化的粒子滤波架构,整合无人机和卫星图像的语义特征,实现高效的定位。

Result: 该方法在计算效率上比特征提取方法提升了10倍,全局定位误差低于10米,并在低分辨率卫星地图上实现秒级的4自由度位姿估计。

Insight: 结合语义信息的粒子滤波方法在无人机定位中具有显著优势,尤其适用于动态环境,且低分辨率卫星地图也能支持高精度定位。

Abstract: Vision-based Unmanned Aerial Vehicle (UAV) localization systems have been extensively investigated for Global Navigation Satellite System (GNSS)-denied environments. However, existing retrieval-based approaches face limitations in dataset availability and persistent challenges including suboptimal real-time performance, environmental sensitivity, and limited generalization capability, particularly in dynamic or temporally varying environments. To overcome these limitations, we present a large-scale Multi-Altitude Flight Segments dataset (MAFS) for variable altitude scenarios and propose a novel Semantic-Weighted Adaptive Particle Filter (SWA-PF) method. This approach integrates robust semantic features from both UAV-captured images and satellite imagery through two key innovations: a semantic weighting mechanism and an optimized particle filtering architecture. Evaluated using our dataset, the proposed method achieves 10x computational efficiency gain over feature extraction methods, maintains global positioning errors below 10 meters, and enables rapid 4 degree of freedom (4-DoF) pose estimation within seconds using accessible low-resolution satellite maps. Code and dataset will be available at https://github.com/YuanJiayuuu/SWA-PF.

[58] Consistent View Alignment Improves Foundation Models for 3D Medical Image Segmentation

Puru Vaish,Felix Meister,Tobias Heimann,Christoph Brune,Jelmer M. Wolterink

Main category: cs.CV

TL;DR: 本文挑战了表示学习中无关视图足以学习有效表示的假设,提出了一种显式对齐视图的方法(Consistent View Alignment),在3D医学图像分割任务中表现优异。

Details Motivation: 现有表示学习方法假设无关视图足以学习有效表示,但本文发现潜在空间中的有意义结构不会自然出现,需要显式对齐视图以提升效果。

Contribution: 提出了Consistent View Alignment方法,显式对齐不同视图的表示以补充信息,避免了误匹配问题。

Method: 通过自监督学习,显式对齐多视图的潜在表示,确保互补信息对齐且不引入误匹配。

Result: 在MICCAI 2025 SSL3D挑战赛中,使用Primus视觉Transformer和ResEnc卷积神经网络分别获得第一和第二名。

Insight: 潜在空间中的有效表示需要显式结构化对齐,而非依赖自然涌现,这对医学图像分割任务尤为重要。

Abstract: Many recent approaches in representation learning implicitly assume that uncorrelated views of a data point are sufficient to learn meaningful representations for various downstream tasks. In this work, we challenge this assumption and demonstrate that meaningful structure in the latent space does not emerge naturally. Instead, it must be explicitly induced. We propose a method that aligns representations from different views of the data to align complementary information without inducing false positives. Our experiments show that our proposed self-supervised learning method, Consistent View Alignment, improves performance for downstream tasks, highlighting the critical role of structured view alignment in learning effective representations. Our method achieved first and second place in the MICCAI 2025 SSL3D challenge when using a Primus vision transformer and ResEnc convolutional neural network, respectively. The code and pretrained model weights are released at https://github.com/Tenbatsu24/LatentCampus.

[59] SpecDiff: Accelerating Diffusion Model Inference with Self-Speculation

Jiayi Pan,Jiaming Xu,Yongkang Zhou,Guohao Dai

Main category: cs.CV

TL;DR: SpecDiff是一种无需训练的多层次特征缓存策略,通过自推测信息改进扩散模型推理效率,显著加速性能并保持质量。

Details Motivation: 现有特征缓存方法仅依赖历史信息,导致准确性和速度受限,作者希望通过引入未来信息(自推测)来解决这一问题。

Contribution: 提出SpecDiff,结合自推测信息与历史信息的多层次特征缓存策略,突破了速度与准确性的权衡瓶颈。

Method: 基于自推测信息的特征选择算法和多层次特征分类算法,动态计算特征重要性并分类。

Result: 在Stable Diffusion 3、3.5和FLUX上,SpecDiff实现了2.80倍、2.74倍和3.17倍的加速,质量损失可忽略。

Insight: 通过融合推测与历史信息,SpecDiff推动了扩散模型高效推理中速度与准确性Pareto前沿的突破。

Abstract: Feature caching has recently emerged as a promising method for diffusion model acceleration. It effectively alleviates the inefficiency problem caused by high computational requirements by caching similar features in the inference process of the diffusion model. In this paper, we analyze existing feature caching methods from the perspective of information utilization, and point out that relying solely on historical information will lead to constrained accuracy and speed performance. And we propose a novel paradigm that introduces future information via self-speculation based on the information similarity at the same time step across different iteration times. Based on this paradigm, we present \textit{SpecDiff}, a training-free multi-level feature caching strategy including a cached feature selection algorithm and a multi-level feature classification algorithm. (1) Feature selection algorithm based on self-speculative information. \textit{SpecDiff} determines a dynamic importance score for each token based on self-speculative information and historical information, and performs cached feature selection through the importance score. (2) Multi-level feature classification algorithm based on feature importance scores. \textit{SpecDiff} classifies tokens by leveraging the differences in feature importance scores and introduces a multi-level feature calculation strategy. Extensive experiments show that \textit{SpecDiff} achieves average 2.80 \times, 2.74 \times , and 3.17\times speedup with negligible quality loss in Stable Diffusion 3, 3.5, and FLUX compared to RFlow on NVIDIA A800-80GB GPU. By merging speculative and historical information, \textit{SpecDiff} overcomes the speedup-accuracy trade-off bottleneck, pushing the Pareto frontier of speedup and accuracy in the efficient diffusion model inference.

[60] Dense Video Understanding with Gated Residual Tokenization

Haichao Zhang,Wenhao Chai,Shwai He,Ang Li,Yun Fu

Main category: cs.CV

TL;DR: 本文提出了Dense Video Understanding (DVU)和Gated Residual Tokenization (GRT)方法,用于高效处理高帧率视频理解,通过减少token化时间和开销,解决了现有视频大语言模型在密集时序信息上的不足。

Details Motivation: 现有视频大语言模型和基准测试大多依赖低帧率采样,忽略了密集时序信息,导致在需要精确时序对齐的任务(如讲座理解)上表现不佳。

Contribution: 1. 提出DVU方法,支持高帧率视频理解;2. 提出DIVE基准测试,专注于密集时序推理;3. 设计GRT框架,通过动态补偿和语义合并减少token冗余和计算成本。

Method: GRT分为两阶段:1. 运动补偿的帧间token化,跳过静态区域;2. 语义场景的帧内token合并,减少冗余并保留动态语义。

Result: 在DIVE上,GRT超越现有视频大语言模型基线,且性能随FPS提高而提升。

Insight: 密集时序信息对视频理解至关重要,GRT提供了一种高效且可扩展的高帧率视频处理方法。

Abstract: High temporal resolution is essential for capturing fine-grained details in video understanding. However, current video large language models (VLLMs) and benchmarks mostly rely on low-frame-rate sampling, such as uniform sampling or keyframe selection, discarding dense temporal information. This compromise avoids the high cost of tokenizing every frame, which otherwise leads to redundant computation and linear token growth as video length increases. While this trade-off works for slowly changing content, it fails for tasks like lecture comprehension, where information appears in nearly every frame and requires precise temporal alignment. To address this gap, we introduce Dense Video Understanding (DVU), which enables high-FPS video comprehension by reducing both tokenization time and token overhead. Existing benchmarks are also limited, as their QA pairs focus on coarse content changes. We therefore propose DIVE (Dense Information Video Evaluation), the first benchmark designed for dense temporal reasoning. To make DVU practical, we present Gated Residual Tokenization (GRT), a two-stage framework: (1) Motion-Compensated Inter-Gated Tokenization uses pixel-level motion estimation to skip static regions during tokenization, achieving sub-linear growth in token count and compute. (2) Semantic-Scene Intra-Tokenization Merging fuses tokens across static regions within a scene, further reducing redundancy while preserving dynamic semantics. Experiments on DIVE show that GRT outperforms larger VLLM baselines and scales positively with FPS. These results highlight the importance of dense temporal information and demonstrate that GRT enables efficient, scalable high-FPS video understanding.

[61] EDITS: Enhancing Dataset Distillation with Implicit Textual Semantics

Qianxin Xia,Jiawei Du,Guoming Lu,Zhiyong Shu,Jielei Wang

Main category: cs.CV

TL;DR: 论文EDITS提出了一种新框架,通过挖掘图像中的隐式文本语义增强数据集蒸馏效果,融合视觉-语言模型和大型语言模型生成合成数据集。

Details Motivation: 传统数据集蒸馏方法主要关注低层视觉特征,忽略了图像中的高层语义和结构信息。EDITS通过引入文本语义提升蒸馏效果。

Contribution: 1) 提出Global Semantic Query模块融合视觉-语言模型的文本与图像特征;2) 利用大型语言模型生成文本原型;3) 提出Dual Prototype Guidance策略生成合成数据集。

Method: 1) 通过VLM生成外部文本并与图像特征融合;2) Local Semantic Awareness选择代表性样本生成图像和文本原型;3) 扩散模型结合双重原型指导生成最终数据集。

Result: 实验证实EDITS显著提升了数据集蒸馏的效果。

Insight: 文本语义在数据集蒸馏中具有重要作用,结合多模态模型(VLM和LLM)可以更好地捕获高层信息。

Abstract: Dataset distillation aims to synthesize a compact dataset from the original large-scale one, enabling highly efficient learning while preserving competitive model performance. However, traditional techniques primarily capture low-level visual features, neglecting the high-level semantic and structural information inherent in images. In this paper, we propose EDITS, a novel framework that exploits the implicit textual semantics within the image data to achieve enhanced distillation. First, external texts generated by a Vision Language Model (VLM) are fused with image features through a Global Semantic Query module, forming the prior clustered buffer. Local Semantic Awareness then selects representative samples from the buffer to construct image and text prototypes, with the latter produced by guiding a Large Language Model (LLM) with meticulously crafted prompt. Ultimately, Dual Prototype Guidance strategy generates the final synthetic dataset through a diffusion model. Extensive experiments confirm the effectiveness of our method.Source code is available in: https://github.com/einsteinxia/EDITS.

[62] LamiGauss: Pitching Radiative Gaussian for Sparse-View X-ray Laminography Reconstruction

Chu Chen,Ander Biguri,Jean-Michel Morel,Raymond H. Chan,Carola-Bibiane Schönlieb,Jizhou Li

Main category: cs.CV

TL;DR: LamiGauss提出了一种基于高斯泼溅辐射光栅化(Gaussian Splatting radiative rasterization)和专用检测器-世界变换模型的稀疏视图X射线层析成像重建算法,显著提高了在极稀疏视图条件下的重建质量。

Details Motivation: X射线层析成像在板状结构(如微芯片和电池复合材料)的非破坏性检测中至关重要,但传统CT因几何限制难以适用,而稀疏视图条件下的高质量重建仍具挑战性。

Contribution: 1. 提出LamiGauss算法,结合高斯泼溅辐射光栅化和专用变换模型;2. 设计初始化策略过滤常见伪影,优化高斯分布分配;3. 仅需3%的完整视图即可超越全数据优化的迭代方法。

Method: 将高斯泼溅辐射光栅化与检测器-世界变换模型结合,并通过初始化策略去除伪影,直接优化稀疏投影数据以实现高效重建。

Result: 在合成和真实数据集上验证了LamiGauss的有效性和优越性,仅用3%的完整视图即超越全数据优化的迭代方法。

Insight: 高斯泼溅辐射光栅化在稀疏视图重建中具有潜力,结合专用变换模型和伪影过滤策略可显著提升模型性能和重建质量。

Abstract: X-ray Computed Laminography (CL) is essential for non-destructive inspection of plate-like structures in applications such as microchips and composite battery materials, where traditional computed tomography (CT) struggles due to geometric constraints. However, reconstructing high-quality volumes from laminographic projections remains challenging, particularly under highly sparse-view acquisition conditions. In this paper, we propose a reconstruction algorithm, namely LamiGauss, that combines Gaussian Splatting radiative rasterization with a dedicated detector-to-world transformation model incorporating the laminographic tilt angle. LamiGauss leverages an initialization strategy that explicitly filters out common laminographic artifacts from the preliminary reconstruction, preventing redundant Gaussians from being allocated to false structures and thereby concentrating model capacity on representing the genuine object. Our approach effectively optimizes directly from sparse projections, enabling accurate and efficient reconstruction with limited data. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness and superiority of the proposed method over existing techniques. LamiGauss uses only 3$%$ of full views to achieve superior performance over the iterative method optimized on a full dataset.

[63] Distractor-Aware Memory-Based Visual Object Tracking

Jovana Videnovic,Matej Kristan,Alan Lukezic

Main category: cs.CV

TL;DR: 论文提出了一种针对视觉目标跟踪的干扰物感知内存模块DAM4SAM,有效减少了目标漂移并提升了遮挡后的重检测能力,同时在多个基准测试中取得了领先表现。

Details Motivation: 当前基于内存的视频分割方法(如SAM2)在分割任务中表现优异,但在目标跟踪任务中未能有效应对干扰物(与目标视觉相似的物体)的挑战。

Contribution: 提出了干扰物感知内存模块(DAM4SAM)和基于自省的管理方法,构建了干扰物数据集DiDi,显著提升了跟踪性能并在多个基准测试中取得领先。

Method: 设计了一个干扰物感知的drop-in内存模块和自省管理方法,结合了SAM2框架,有效应对干扰物和遮挡问题。

Result: DAM4SAM在13个基准测试中优于SAM2.1,并在10个测试中刷新了SOTA;集成到实时跟踪器EfficientTAM和边缘跟踪器EdgeTAM中分别提升11%和4%。

Insight: 干扰物感知设计对提升目标跟踪性能至关重要,特别是在复杂场景和遮挡情况下。

Abstract: Recent emergence of memory-based video segmentation methods such as SAM2 has led to models with excellent performance in segmentation tasks, achieving leading results on numerous benchmarks. However, these modes are not fully adjusted for visual object tracking, where distractors (i.e., objects visually similar to the target) pose a key challenge. In this paper we propose a distractor-aware drop-in memory module and introspection-based management method for SAM2, leading to DAM4SAM. Our design effectively reduces the tracking drift toward distractors and improves redetection capability after object occlusion. To facilitate the analysis of tracking in the presence of distractors, we construct DiDi, a Distractor-Distilled dataset. DAM4SAM outperforms SAM2.1 on thirteen benchmarks and sets new state-of-the-art results on ten. Furthermore, integrating the proposed distractor-aware memory into a real-time tracker EfficientTAM leads to 11% improvement and matches tracking quality of the non-real-time SAM2.1-L on multiple tracking and segmentation benchmarks, while integration with edge-based tracker EdgeTAM delivers 4% performance boost, demonstrating a very good generalization across architectures.

[64] EvHand-FPV: Efficient Event-Based 3D Hand Tracking from First-Person View

Zhen Xu,Guorui Lu,Chang Gao,Qinyu Chen

Main category: cs.CV

TL;DR: EvHand-FPV提出了一种高效的基于单事件相机的第一人称3D手部跟踪框架,通过腕部ROI定位、多任务学习等方法,显著提升了准确性和效率。

Details Motivation: 传统帧式方法在低延迟和能效方面表现不佳,尤其适用于资源受限的XR设备,因此提出基于事件相机的高效方法。

Contribution: 1. 构建了结合合成训练数据和真实事件数据的FPV数据集;2. 提出了腕部ROI定位和端到端映射策略,减少计算量;3. 引入多任务学习提升表示能力。

Method: 1. 使用腕部几何线索定位手部ROI;2. 端到端映射策略嵌入ROI偏移;3. 多任务学习通过辅助几何特征头提升表示能力。

Result: 2D-AUCp提升至0.85(原0.77),参数量减少89%(1.2M),推理FLOPs减少89%(0.185G),3D-AUCp保持0.84。

Insight: 事件相机和轻量化设计的结合能够在资源受限设备上实现高效的手部跟踪,适合XR应用。

Abstract: Hand tracking holds great promise for intuitive interaction paradigms, but frame-based methods often struggle to meet the requirements of accuracy, low latency, and energy efficiency, especially in resource-constrained settings such as Extended Reality (XR) devices. Event cameras provide $\mu$s-level temporal resolution at mW-level power by asynchronously sensing brightness changes. In this work, we present EvHand-FPV, a lightweight framework for egocentric First-Person-View 3D hand tracking from a single event camera. We construct an event-based FPV dataset that couples synthetic training data with 3D labels and real event data with 2D labels for evaluation to address the scarcity of egocentric benchmarks. EvHand-FPV also introduces a wrist-based region of interest (ROI) that localizes the hand region via geometric cues, combined with an end-to-end mapping strategy that embeds ROI offsets into the network to reduce computation without explicit reconstruction, and a multi-task learning strategy with an auxiliary geometric feature head that improves representations without test-time overhead. On our real FPV test set, EvHand-FPV improves 2D-AUCp from 0.77 to 0.85 while reducing parameters from 11.2M to 1.2M by 89% and FLOPs per inference from 1.648G to 0.185G by 89%. It also maintains a competitive 3D-AUCp of 0.84 on synthetic data. These results demonstrate accurate and efficient egocentric event-based hand tracking suitable for on-device XR applications. The dataset and code are available at https://github.com/zen5x5/EvHand-FPV.

[65] Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration

Yuanchen Wu,Ke Yan,Shouhong Ding,Ziyin Zhou,Xiaoqiang Li

Main category: cs.CV

TL;DR: 论文提出了Self-Rationale Calibration(SRC)框架,通过迭代校准大型视觉语言模型(LVLM)中rationale(推理依据)与答案的对齐问题,显著提升了模型的感知、推理和泛化能力。

Details Motivation: 大型视觉语言模型在视觉问答任务中表现出色,但其生成的rationale和答案之间常存在不一致性,导致推理错误。为了解决这一问题,论文提出了SRC框架。

Contribution: 1. 提出了SRC框架,通过自校准rationale与答案的对齐问题;2. 设计了轻量化的rationale微调方法;3. 提出了R-Scorer评分模型评估候选结果的rationale质量和事实一致性。

Method: 1. 轻量化rationale微调,调整模型输出格式以强制生成rationale;2. 生成多样化候选答案;3. 使用R-Scorer评分模型进行成对评分;4. 基于置信度加权偏好选择进行偏好微调。

Result: SRC框架在多个基准测试中显著提升了LVLM的感知、推理和泛化能力,验证了rationale导向对齐的有效性。

Insight: rationale与答案的对齐是提升LVLM推理能力的关键,SRC框架通过自校准机制为解决这一问题提供了新思路。

Abstract: Large Vision-Language Models (LVLMs) have manifested strong visual question answering capability. However, they still struggle with aligning the rationale and the generated answer, leading to inconsistent reasoning and incorrect responses. To this end, this paper introduces the Self-Rationale Calibration (SRC) framework to iteratively calibrate the alignment between rationales and answers. SRC begins by employing a lightweight “rationale fine-tuning” approach, which modifies the model’s response format to require a rationale before deriving an answer without explicit prompts. Next, SRC searches for a diverse set of candidate responses from the fine-tuned LVLMs for each sample, followed by a proposed pairwise scoring strategy using a tailored scoring model, R-Scorer, to evaluate both rationale quality and factual consistency of candidates. Based on a confidence-weighted preference curation process, SRC decouples the alignment calibration into a preference fine-tuning manner, leading to significant improvements of LVLMs in perception, reasoning, and generalization across multiple benchmarks. Our results emphasize the rationale-oriented alignment in exploring the potential of LVLMs.

[66] Noise-Level Diffusion Guidance: Well Begun is Half Done

Harvey Mannering,Zhiwu Huang,Adam Prugel-Bennett

Main category: cs.CV

TL;DR: 这篇论文提出了一种简单高效的噪声级引导(NLG)方法,用于优化扩散模型中的初始噪声,从而提升生成图像的质量和提示遵从性,无需额外数据、网络或反向传播。

Details Motivation: 扩散模型的初始高斯噪声会影响最终图像质量和提示遵从性,现有方法通常依赖额外数据集、网络或优化,实用性受限。

Contribution: 提出了噪声级引导(NLG)方法,无需额外训练数据或网络,通过提升初始噪声与通用引导的对齐概率来优化扩散模型输出。

Method: NLG通过优化初始噪声,使其更符合通用引导的分布,适用于条件和无条件扩散模型,支持多种形式的引导。

Result: 在五个标准基准测试中,NLG显著提升了生成质量和条件遵从性,同时保持了计算效率。

Insight: 初始噪声的优化对扩散模型性能至关重要,NLG作为一种轻量级方法,可无缝集成现有技术,推动扩散模型的实用性提升。

Abstract: Diffusion models have achieved state-of-the-art image generation. However, the random Gaussian noise used to start the diffusion process influences the final output, causing variations in image quality and prompt adherence. Existing noise-level optimization approaches generally rely on extra dataset construction, additional networks, or backpropagation-based optimization, limiting their practicality. In this paper, we propose Noise Level Guidance (NLG), a simple, efficient, and general noise-level optimization approach that refines initial noise by increasing the likelihood of its alignment with general guidance - requiring no additional training data, auxiliary networks, or backpropagation. The proposed NLG approach provides a unified framework generalizable to both conditional and unconditional diffusion models, accommodating various forms of diffusion-level guidance. Extensive experiments on five standard benchmarks demonstrate that our approach enhances output generation quality and input condition adherence. By seamlessly integrating with existing guidance methods while maintaining computational efficiency, our method establishes NLG as a practical and scalable enhancement to diffusion models. Code can be found at https://github.com/harveymannering/NoiseLevelGuidance.

[67] Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation

Gia Khanh Nguyen,Yifeng Huang,Minh Hoai

Main category: cs.CV

TL;DR: 论文提出了PairTally数据集,用于评估细粒度视觉计数任务,发现当前AI模型在复杂场景中仍难以准确计数用户的意图对象。

Details Motivation: 当前AI模型在视觉计数任务中表现优秀,但在细粒度、意图驱动的计数中能力尚不明确,需要更严格的评估标准。

Contribution: 1. 引入PairTally数据集,包含681张高分辨率图像,支持跨类别和类别内细粒度计数评估;2. 对多种SOTA模型进行了系统评估。

Method: 通过PairTally数据集,对基于范例的方法、语言提示模型和大规模视觉语言模型(VLMs)进行了基准测试。

Result: 当前模型在细粒度和视觉模糊场景下的计数可靠性较差,难以完全满足用户意图。

Insight: 细粒度计数任务需要更强的区分能力和语义理解,PairTally为未来模型优化提供了基础。

Abstract: Visual counting is a fundamental yet challenging task, especially when users need to count objects of a specific type in complex scenes. While recent models, including class-agnostic counting models and large vision-language models (VLMs), show promise in counting tasks, their ability to perform fine-grained, intent-driven counting remains unclear. In this paper, we introduce PairTally, a benchmark dataset specifically designed to evaluate fine-grained visual counting. Each of the 681 high-resolution images in PairTally contains two object categories, requiring models to distinguish and count based on subtle differences in shape, size, color, or semantics. The dataset includes both inter-category (distinct categories) and intra-category (closely related subcategories) settings, making it suitable for rigorous evaluation of selective counting capabilities. We benchmark a variety of state-of-the-art models, including exemplar-based methods, language-prompted models, and large VLMs. Our results show that despite recent advances, current models struggle to reliably count what users intend, especially in fine-grained and visually ambiguous cases. PairTally provides a new foundation for diagnosing and improving fine-grained visual counting systems.

[68] MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment

Elena Camuffo,Francesco Barbato,Mete Ozay,Simone Milani,Umberto Michieli

Main category: cs.CV

TL;DR: MOCHA是一种多模态知识蒸馏方法,将大型视觉-语言教师模型(如LLaVa)的区域级多模态语义迁移到轻量级纯视觉目标检测学生模型(如YOLO)中,通过双目标损失实现对象级语义对齐。

Details Motivation: 现有方法主要集中在密集或全局对齐,但MOCHA专注于对象级语义迁移,旨在高效地将多模态语义知识迁移到纯视觉模型中,同时不依赖推理时的文本输入。

Contribution: 提出了一种对象级多模态知识蒸馏框架MOCHA,通过翻译模块和双目标损失实现局部对齐和全局关系一致性,显著提升了轻量级检测器的性能。

Method: 设计了一个翻译模块,将学生模型的特征映射到联合空间中,并使用双目标损失(局部对齐和全局关系一致性)指导学生模型和翻译器的训练。

Result: 在四个个性化检测基准测试中,MOCHA相比于基线方法平均提升了10.1分,且在轻量级架构下达到了与大型多模态模型相当的性能。

Insight: 对象级对齐比密集或全局对齐更适合多模态知识的迁移,特别是在轻量级模型中,能够在不依赖文本输入的情况下显著提升性能。

Abstract: We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a knowledge distillation approach that transfers region-level multimodal semantics from a large vision-language teacher (e.g., LLaVa) into a lightweight vision-only object detector student (e.g., YOLO). A translation module maps student features into a joint space, where the training of the student and translator is guided by a dual-objective loss that enforces both local alignment and global relational consistency. Unlike prior approaches focused on dense or global alignment, MOCHA operates at the object level, enabling efficient transfer of semantics without modifying the teacher or requiring textual input at inference. We validate our method across four personalized detection benchmarks under few-shot regimes. Results show consistent gains over baselines, with a +10.1 average score improvement. Despite its compact architecture, MOCHA reaches performance on par with larger multimodal models, proving its suitability for real-world deployment.

[69] SAIL-VL2 Technical Report

Weijie Yin,Yongjie Ye,Fangxun Shu,Yue Liao,Zijian Kang,Hongyuan Dong,Haiyang Yu,Dingkang Yang,Jiacong Wang,Han Wang,Wenzhuo Liu,Xiao Liang,Shuicheng Yan,Chao Feng

Main category: cs.CV

TL;DR: SAIL-VL2是一个开源的视觉语言基础模型,通过大规模数据筛选、渐进式训练框架和稀疏混合专家架构创新,在2B和8B参数规模下实现了多模态理解和推理的先进性能。

Details Motivation: 现有的视觉语言模型在细粒度感知和复杂推理任务上仍有提升空间,SAIL-VL2旨在通过数据、训练和架构创新,推动多模态模型能力的边界。

Contribution: 1. 大规模数据筛选与清洗策略;2. 渐进式训练框架结合SFT-RL混合范式;3. 稀疏混合专家(MoE)架构扩展。

Method: 1. 使用评分与过滤策略优化数据质量;2. 先预训练视觉编码器(SAIL-ViT),再进行多模态预训练,最终结合监督微调(SFT)与强化学习(RL);3. 采用MoE设计提高模型效率。

Result: 在106个数据集上表现优异,在MMMU和MathVista等复杂推理任务中达到SOTA,OpenCompass排行榜中2B模型在4B以下开源模型中排名第一。

Insight: 数据质量与多样性、训练范式的系统性设计以及稀疏架构的应用是提升多模态模型性能的关键。

Abstract: We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.

[70] PROFUSEme: PROstate Cancer Biochemical Recurrence Prediction via FUSEd Multi-modal Embeddings

Suhang You,Carla Pitarch-Abaigar,Sanket Kachole,Sumedh Sonawane,Juhyung Ha,Anish Sudarshan Gada,David Crandall,Rakesh Shiradkar,Spyridon Bakas

Main category: cs.CV

TL;DR: PROFUSEme使用多模态嵌入(临床、放射和病理数据)的中级融合配置结合Cox比例风险回归,以早期预测前列腺癌生化复发(BCR),取得了优于晚期融合的性能表现。

Details Motivation: 30%的前列腺癌患者在根治性前列腺切除术后经历生化复发(BCR),早期准确预测BCR可改善临床决策和患者预后。

Contribution: 提出了PROFUSEme方法,通过融合多模态数据(临床、放射、病理)的中级学习策略改进BCR预测性能。

Method: 采用中级融合配置结合Cox比例风险回归模型,学习多模态数据的跨模态交互。

Result: 内部5折嵌套交叉验证中平均C-index为0.861(σ=0.112),在CHIMERA 2025挑战验证集上C-index为0.7103。

Insight: 中级融合策略在多模态数据中表现优于晚期融合,提供了更精准的BCR预测潜力。

Abstract: Almost 30% of prostate cancer (PCa) patients undergoing radical prostatectomy (RP) experience biochemical recurrence (BCR), characterized by increased prostate specific antigen (PSA) and associated with increased mortality. Accurate early prediction of BCR, at the time of RP, would contribute to prompt adaptive clinical decision-making and improved patient outcomes. In this work, we propose prostate cancer BCR prediction via fused multi-modal embeddings (PROFUSEme), which learns cross-modal interactions of clinical, radiology, and pathology data, following an intermediate fusion configuration in combination with Cox Proportional Hazard regressors. Quantitative evaluation of our proposed approach reveals superior performance, when compared with late fusion configurations, yielding a mean C-index of 0.861 ($\sigma=0.112$) on the internal 5-fold nested cross-validation framework, and a C-index of 0.7103 on the hold out data of CHIMERA 2025 challenge validation leaderboard.

[71] Wan-Animate: Unified Character Animation and Replacement with Holistic Replication

Gang Cheng,Xin Gao,Li Hu,Siqi Hu,Mingyang Huang,Chaonan Ji,Ju Li,Dechao Meng,Jinwei Qi,Penchong Qiao,Zhen Shen,Yafei Song,Ke Sun,Linrui Tian,Feng Wang,Guangyuan Wang,Qi Wang,Zhongjian Wang,Jiayu Xiao,Sheng Xu,Bang Zhang,Peng Zhang,Xindi Zhang,Zhe Zhang,Jingren Zhou,Lian Zhuo

Main category: cs.CV

TL;DR: Wan-Animate是一个统一的角色动画与替换框架,通过精确复制视频中的表情和动作来生成高保真角色视频,或替换原视频角色并实现环境无缝融合。

Details Motivation: 解决现有角色动画和替换任务中生成高保真视频和环境无缝融合的挑战。

Contribution: 提出Wan-Animate框架,统一角色动画与替换,引入改进的输入范式、空间对齐骨骼信号和辅助重光照LoRA模块,提升生成质量与环境适应性。

Method: 基于Wan模型,改进输入范式以区分参考条件和生成区域;利用空间对齐骨骼信号和隐式面部特征实现高可控性;开发Relighting LoRA模块增强环境融合。

Result: 实验显示Wan-Animate达到最先进性能,生成视频具有高质量和无缝环境融合效果。

Insight: 统一符号表示支持多任务处理,辅助模块如Relighting LoRA是提升环境适应性的有效手段。

Abstract: We introduce Wan-Animate, a unified framework for character animation and replacement. Given a character image and a reference video, Wan-Animate can animate the character by precisely replicating the expressions and movements of the character in the video to generate high-fidelity character videos. Alternatively, it can integrate the animated character into the reference video to replace the original character, replicating the scene’s lighting and color tone to achieve seamless environmental integration. Wan-Animate is built upon the Wan model. To adapt it for character animation tasks, we employ a modified input paradigm to differentiate between reference conditions and regions for generation. This design unifies multiple tasks into a common symbolic representation. We use spatially-aligned skeleton signals to replicate body motion and implicit facial features extracted from source images to reenact expressions, enabling the generation of character videos with high controllability and expressiveness. Furthermore, to enhance environmental integration during character replacement, we develop an auxiliary Relighting LoRA. This module preserves the character’s appearance consistency while applying the appropriate environmental lighting and color tone. Experimental results demonstrate that Wan-Animate achieves state-of-the-art performance. We are committed to open-sourcing the model weights and its source code.

[72] VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement

Jun Du,Weiwei Xing,Ming Li,Fei Richard Yu

Main category: cs.CV

TL;DR: 该论文提出了VSE-MOT框架,通过视觉语义增强技术提升低质量视频中的多目标跟踪性能,结合视觉语言模型和适配器设计,显著优于现有方法。

Details Motivation: 现有MOT算法在低质量视频中表现不佳,限制了实际应用。论文旨在通过视觉语义增强技术解决这一问题。

Contribution: 提出VSE-MOT框架,包括三元分支架构、MOT-Adapter和VSFM模块,显著提升低质量视频中的跟踪性能。

Method: 利用视觉语言模型提取全局视觉语义信息,通过MOT-Adapter适配多目标任务,VSFM优化特征融合效果。

Result: 在低质量视频场景中,VSE-MOT的跟踪性能指标比现有方法高8%-20%,且在常规场景中表现稳健。

Insight: 视觉语义信息的引入和多任务适配器设计是提升低质量视频MOT性能的关键。

Abstract: Current multi-object tracking (MOT) algorithms typically overlook issues inherent in low-quality videos, leading to significant degradation in tracking performance when confronted with real-world image deterioration. Therefore, advancing the application of MOT algorithms in real-world low-quality video scenarios represents a critical and meaningful endeavor. To address the challenges posed by low-quality scenarios, inspired by vision-language models, this paper proposes a Visual Semantic Enhancement-guided Multi-Object Tracking framework (VSE-MOT). Specifically, we first design a tri-branch architecture that leverages a vision-language model to extract global visual semantic information from images and fuse it with query vectors. Subsequently, to further enhance the utilization of visual semantic information, we introduce the Multi-Object Tracking Adapter (MOT-Adapter) and the Visual Semantic Fusion Module (VSFM). The MOT-Adapter adapts the extracted global visual semantic information to suit multi-object tracking tasks, while the VSFM improves the efficacy of feature fusion. Through extensive experiments, we validate the effectiveness and superiority of the proposed method in real-world low-quality video scenarios. Its tracking performance metrics outperform those of existing methods by approximately 8% to 20%, while maintaining robust performance in conventional scenarios.

[73] AD-DINOv3: Enhancing DINOv3 for Zero-Shot Anomaly Detection with Anomaly-Aware Calibration

Jingyi Yuan,Jianxiong Ye,Wenkang Chen,Chenqiang Gao

Main category: cs.CV

TL;DR: AD-DINOv3通过结合DINOv3和CLIP的多模态框架,针对零样本异常检测任务优化特征对齐和异常区域识别,显著提升了性能。

Details Motivation: 零样本异常检测(ZSAD)需要高效且无需标注的方法来处理未知类别的异常。传统方法依赖CLIP模型,但DINOv3等模型在迁移学习中的优势未被充分利用。本文旨在解决DINOv3在ZSAD任务中的特征偏差和全局语义偏好问题。

Contribution: 1. 首次将DINOv3适配到ZSAD任务中;2. 提出AD-DINOv3框架,结合多模态对比学习和轻量级适配器;3. 设计异常感知校准模块(AACM)提升异常区域识别能力。

Method: 1. 使用DINOv3提取视觉特征,CLIP编码文本提示;2. 引入轻量级适配器对齐视觉和文本模态;3. 通过AACM模块引导CLS令牌关注异常区域。

Result: 在八项工业和医疗基准测试上,AD-DINOv3达到或超越现有最优方法。

Insight: 结合视觉和文本模态的对比学习能有效缓解预训练模型在ZSAD任务中的偏差问题,AACM模块显著提升了异常区域的区分能力。

Abstract: Zero-Shot Anomaly Detection (ZSAD) seeks to identify anomalies from arbitrary novel categories, offering a scalable and annotation-efficient solution. Traditionally, most ZSAD works have been based on the CLIP model, which performs anomaly detection by calculating the similarity between visual and text embeddings. Recently, vision foundation models such as DINOv3 have demonstrated strong transferable representation capabilities. In this work, we are the first to adapt DINOv3 for ZSAD. However, this adaptation presents two key challenges: (i) the domain bias between large-scale pretraining data and anomaly detection tasks leads to feature misalignment; and (ii) the inherent bias toward global semantics in pretrained representations often leads to subtle anomalies being misinterpreted as part of the normal foreground objects, rather than being distinguished as abnormal regions. To overcome these challenges, we introduce AD-DINOv3, a novel vision-language multimodal framework designed for ZSAD. Specifically, we formulate anomaly detection as a multimodal contrastive learning problem, where DINOv3 is employed as the visual backbone to extract patch tokens and a CLS token, and the CLIP text encoder provides embeddings for both normal and abnormal prompts. To bridge the domain gap, lightweight adapters are introduced in both modalities, enabling their representations to be recalibrated for the anomaly detection task. Beyond this baseline alignment, we further design an Anomaly-Aware Calibration Module (AACM), which explicitly guides the CLS token to attend to anomalous regions rather than generic foreground semantics, thereby enhancing discriminability. Extensive experiments on eight industrial and medical benchmarks demonstrate that AD-DINOv3 consistently matches or surpasses state-of-the-art methods, verifying its superiority as a general zero-shot anomaly detection framework.

[74] Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing

Yaru Chen,Ruohao Guo,Liting Gao,Yang Xiang,Qingyu Luo,Zhenbo Li,Wenwu Wang

Main category: cs.CV

TL;DR: 该论文提出了一种用于弱监督视听视频解析的方法,通过EMA引导的伪监督框架和类感知跨模态一致性损失,实现了段级监督和模态对齐,取得了SOTA性能。

Details Motivation: 现有方法在弱监督视听视频解析中忽略了段级监督和类感知跨模态对齐,导致性能受限。论文旨在解决这些问题。

Contribution: 1. 提出了EMA引导的伪监督框架,通过自适应阈值或top-k选择生成可靠的段级监督;2. 设计了类感知跨模态一致性损失,实现了音频和视觉模态的对齐。

Method: 1. 使用EMA生成段级伪监督;2. 提出CMA损失对齐跨模态嵌入。

Result: 在LLP和UnAV-100数据集上达到了SOTA性能。

Insight: 段级监督和类感知模态对齐对弱监督视听视频解析至关重要,EMA和CMA是有效的解决方案。

Abstract: Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via adaptive thresholds or top-k selection, offering stable temporal guidance beyond video-level labels; and (2) a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs, ensuring consistency across modalities while preserving temporal structure. Evaluations on LLP and UnAV-100 datasets shows that our method achieves state-of-the-art (SOTA) performance across multiple metrics.

[75] Generative AI for Misalignment-Resistant Virtual Staining to Accelerate Histopathology Workflows

Jiabo MA,Wenqiang Li,Jinbang Li,Ziyi Liu,Linshan Wu,Fengtao Zhou,Li Liang,Ronald Cheong Kin Chan,Terence T. W. Wong,Hao Chen

Main category: cs.CV

TL;DR: 本研究提出了一种生成式AI框架,通过级联配准机制解决虚拟染色中的空间错位问题,显著提升了性能,尤其在错位严重的数据集上表现突出。

Details Motivation: 传统组织病理学诊断需要多次染色,耗时耗力且环境不友好。虚拟染色虽有潜力,但现有方法因依赖对齐良好的配对数据而受限。

Contribution: 提出了一种级联配准的虚拟染色框架,有效解决了生成输出与真实数据间的空间错位问题,显著提升了性能和鲁棒性。

Method: 采用级联配准机制,逐步解决空间错位问题,从而在未对齐或粗略对齐的数据上实现更准确的像素级监督。

Result: 在五个数据集上优于现有方法,内部数据集平均提升3.2%,外部数据集提升10.1%,严重错位数据集上的PSNR提升了23.8%。

Insight: 级联配准机制简化了数据获取过程,为虚拟染色的发展提供了新思路,尤其在错位严重的数据上表现突出。

Abstract: Accurate histopathological diagnosis often requires multiple differently stained tissue sections, a process that is time-consuming, labor-intensive, and environmentally taxing due to the use of multiple chemical stains. Recently, virtual staining has emerged as a promising alternative that is faster, tissue-conserving, and environmentally friendly. However, existing virtual staining methods face significant challenges in clinical applications, primarily due to their reliance on well-aligned paired data. Obtaining such data is inherently difficult because chemical staining processes can distort tissue structures, and a single tissue section cannot undergo multiple staining procedures without damage or loss of information. As a result, most available virtual staining datasets are either unpaired or roughly paired, making it difficult for existing methods to achieve accurate pixel-level supervision. To address this challenge, we propose a robust virtual staining framework featuring cascaded registration mechanisms to resolve spatial mismatches between generated outputs and their corresponding ground truth. Experimental results demonstrate that our method significantly outperforms state-of-the-art models across five datasets, achieving an average improvement of 3.2% on internal datasets and 10.1% on external datasets. Moreover, in datasets with substantial misalignment, our approach achieves a remarkable 23.8% improvement in peak signal-to-noise ratio compared to baseline models. The exceptional robustness of the proposed method across diverse datasets simplifies the data acquisition process for virtual staining and offers new insights for advancing its development.

[76] Deceptive Beauty: Evaluating the Impact of Beauty Filters on Deepfake and Morphing Attack Detection

Sara Concas,Simone Maurizio La Cava,Andrea Panzino,Ester Masala,Giulia Orrù,Gian Luca Marcialis

Main category: cs.CV

TL;DR: 该论文研究了美颜滤镜如何影响深度伪造(deepfake)和面部变形攻击(morphing attack)检测器的性能,发现滤镜会导致检测器性能下降,暴露出现有模型的脆弱性。

Details Motivation: 社交媒体美颜滤镜的普及引发了对其影响面部数据可靠性和自动化人脸分析系统效果的担忧,尤其对于检测深度伪造和变形攻击的任务。

Contribution: 论文的主要贡献是首次系统评估了美颜滤镜对深度伪造和变形攻击检测器性能的影响,揭示了其负面影响。

Method: 通过在多个人脸数据集上应用不同的平滑滤镜,并测试多个先进检测器的性能变化,完成全面分析。

Result: 结果显示美颜滤镜显著降低了检测器的性能。

Insight: 研究结果表明,现有检测模型对图像增强操作(如美颜滤镜)缺乏鲁棒性,需要设计更健壮的检测方法。

Abstract: Digital beautification through social media filters has become increasingly popular, raising concerns about the reliability of facial images and videos and the effectiveness of automated face analysis. This issue is particularly critical for digital manipulation detectors, systems aiming at distinguishing between genuine and manipulated data, especially in cases involving deepfakes and morphing attacks designed to deceive humans and automated facial recognition. This study examines whether beauty filters impact the performance of deepfake and morphing attack detectors. We perform a comprehensive analysis, evaluating multiple state-of-the-art detectors on benchmark datasets before and after applying various smoothing filters. Our findings reveal performance degradation, highlighting vulnerabilities introduced by facial enhancements and underscoring the need for robust detection models resilient to such alterations.

[77] MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

Peng Xu,Shengwu Xiong,Jiajun Zhang,Yaxiong Chen,Bowen Zhou,Chen Change Loy,David A. Clifton,Kyoung Mu Lee,Luc Van Gool,Ruiming He,Ruilin Yao,Xinwei Long,Jirui Huang,Kai Tian,Sa Yang,Yihua Shao,Jin Feng,Yue Zhong,Jiakai Zhou,Cheng Tang,Tianyu Zou,Yifang Zhang,Junming Liang,Guoyou Li,Zhaoxiang Wang,Qiang Zhou,Yichen Zhao,Shili Xiong,Hyeongjin Nam,Jaerin Lee,Jaeyoung Chung,JoonKyu Park,Junghun Oh,Kanggeon Lee,Wooseok Lee,Juneyoung Ro,Turghun Osman,Can Hu,Chaoyang Liao,Cheng Chen,Chengcheng Han,Chenhao Qiu,Chong Peng,Cong Xu,Dailin Li,Feiyu Wang,Feng Gao,Guibo Zhu,Guopeng Tang,Haibo Lu,Han Fang,Han Qi,Hanxiao Wu,Haobo Cheng,Hongbo Sun,Hongyao Chen,Huayong Hu,Hui Li,Jiaheng Ma,Jiang Yu,Jianing Wang,Jie Yang,Jing He,Jinglin Zhou,Jingxuan Li,Josef Kittler,Lihao Zheng,Linnan Zhao,Mengxi Jia,Muyang Yan,Nguyen Thanh Thien,Pu Luo,Qi Li,Shien Song,Shijie Dong,Shuai Shao,Shutao Li,Taofeng Xue,Tianyang Xu,Tianyi Gao,Tingting Li,Wei Zhang,Weiyang Su,Xiaodong Dong,Xiao-Jun Wu,Xiaopeng Zhou,Xin Chen,Xin Wei,Xinyi You,Xudong Kang,Xujie Zhou,Xusheng Liu,Yanan Wang,Yanbin Huang,Yang Liu,Yang Yang,Yanglin Deng,Yashu Kang,Ye Yuan,Yi Wen,Yicen Tian,Yilin Tao,Yin Tang,Yipeng Lin,Yiqing Wang,Yiting Xi,Yongkang Yu,Yumei Li,Yuxin Qin,Yuying Chen,Yuzhe Cen,Zhaofan Zou,Zhaohong Liu,Zhehao Shen,Zhenglin Du,Zhengyang Li,Zhenni Huang,Zhenwei Shao,Zhilong Song,Zhiyong Feng,Zhiyu Wang,Zhou Yu,Ziang Li,Zihan Zhai,Zijian Zhang,Ziyang Peng,Ziyun Xiao,Zongshu Li

Main category: cs.CV

TL;DR: 本文回顾了MARS2 2025挑战赛,旨在通过大规模基准测试推动多模态机器学习和大语言模型的发展,特别关注现实和专业化场景下的多模态推理应用。

Details Motivation: 当前多模态推理领域动态发展迅速,但缺乏统一的测试标准和广泛应用场景。MARS2挑战赛希望通过多样化数据集和任务促进技术进步。

Contribution: 1. 发布两个定制数据集Lens和AdsQA,覆盖12种日常场景和专业广告视频领域。
2. 评估40多个基线模型并设立三个竞赛赛道(VG-RS、VQA-SA、VR-Ads)。
3. 吸引了76个团队参与,纳入40多份有效提交。

Method: 通过公开数据集和基准测试,组织多模态推理挑战赛,评估通用MLLMs和任务专用模型的性能。

Result: 挑战赛成功吸引了大量团队参与,发布了数据集、代码库和排行榜,推动了多模态推理在实际场景中的应用。

Insight: 多模态推理的进步需要多样化测试场景和完善的评价标准,同时开源数据和代码有助于社区共同发展。

Abstract: This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year’s MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and industrial institutions have registered and 40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ baselines and 15+ participants’ methods), and rankings are publicly available on the MARS2 workshop website and our GitHub organization page https://github.com/mars2workshop/, where our updates and announcements of upcoming events will be continuously provided.

[78] An Exploratory Study on Abstract Images and Visual Representations Learned from Them

Haotian Li,Jianbo Jiao

Main category: cs.CV

TL;DR: 本文探讨了由基本形状构成的抽象图像是否能有效传达视觉语义信息,并通过引入分层抽象图像数据集(HAID),比较了抽象图像与传统栅格图像在视觉任务中的表现差异。

Details Motivation: 研究旨在理解抽象图像是否能有效传达视觉语义信息,以及为何其在深度学习中表现不如传统栅格图像。

Contribution: 引入了分层抽象图像数据集(HAID),并对抽象图像在分类、分割和检测等任务中的表现进行了全面研究。

Method: 使用HAID数据集训练和评估传统视觉系统,比较不同抽象层次的图像在视觉任务中的表现。

Result: 抽象图像能传达部分语义信息,但在高级任务中表现不如传统图像。

Insight: 抽象图像可能在某些视觉任务中具有潜力,但需进一步优化以缩小与传统图像的差距。

Abstract: Imagine living in a world composed solely of primitive shapes, could you still recognise familiar objects? Recent studies have shown that abstract images-constructed by primitive shapes-can indeed convey visual semantic information to deep learning models. However, representations obtained from such images often fall short compared to those derived from traditional raster images. In this paper, we study the reasons behind this performance gap and investigate how much high-level semantic content can be captured at different abstraction levels. To this end, we introduce the Hierarchical Abstraction Image Dataset (HAID), a novel data collection that comprises abstract images generated from normal raster images at multiple levels of abstraction. We then train and evaluate conventional vision systems on HAID across various tasks including classification, segmentation, and object detection, providing a comprehensive study between rasterised and abstract image representations. We also discuss if the abstract image can be considered as a potentially effective format for conveying visual semantic information and contributing to vision tasks.

[79] BEVUDA++: Geometric-aware Unsupervised Domain Adaptation for Multi-View 3D Object Detection

Rongyu Zhang,Jiaming Liu,Xiaoqi Li,Xiaowei Chi,Dan Wang,Li Du,Yuan Du,Shanghang Zhang

Main category: cs.CV

TL;DR: BEVUDA++提出了一种几何感知的无监督域适应方法,用于多视角3D目标检测的鸟瞰图(BEV)感知,通过可靠的深度教师模型和几何一致性学生模型,减少了跨域场景中的性能下降。

Details Motivation: BEV感知在自动驾驶中具有重要意义,但跨域场景中的域偏移问题被忽视,导致性能显著下降。本文致力于解决BEV感知中多视角3D目标检测的域适应挑战。

Contribution: 1. 提出了BEVUDA++框架,包含可靠深度教师(RDT)和几何一致性学生(GCS)模型;2. 设计了不确定性引导的指数移动平均(UEMA)方法以减少域偏移带来的误差累积;3. 在四种跨域场景中验证了方法的优越性。

Method: 1. RDT通过目标LiDAR和可靠深度预测生成深度感知信息;2. GCS将多空间特征映射到统一的几何嵌入空间;3. UEMA利用不确定性引导减少域偏移误差。

Result: 在BEV 3D目标检测任务中取得了最优性能,例如在昼夜适应任务中NDS提升了12.9%,mAP提升了9.5%。

Insight: 1. 域适应问题的核心在于跨几何空间的域偏移累积;2. 深度感知和几何一致性是缓解域偏移的有效手段;3. 不确定性引导可以提升域适应方法的稳定性。

Abstract: Vision-centric Bird’s Eye View (BEV) perception holds considerable promise for autonomous driving. Recent studies have prioritized efficiency or accuracy enhancements, yet the issue of domain shift has been overlooked, leading to substantial performance degradation upon transfer. We identify major domain gaps in real-world cross-domain scenarios and initiate the first effort to address the Domain Adaptation (DA) challenge in multi-view 3D object detection for BEV perception. Given the complexity of BEV perception approaches with their multiple components, domain shift accumulation across multi-geometric spaces (e.g., 2D, 3D Voxel, BEV) poses a significant challenge for BEV domain adaptation. In this paper, we introduce an innovative geometric-aware teacher-student framework, BEVUDA++, to diminish this issue, comprising a Reliable Depth Teacher (RDT) and a Geometric Consistent Student (GCS) model. Specifically, RDT effectively blends target LiDAR with dependable depth predictions to generate depth-aware information based on uncertainty estimation, enhancing the extraction of Voxel and BEV features that are essential for understanding the target domain. To collaboratively reduce the domain shift, GCS maps features from multiple spaces into a unified geometric embedding space, thereby narrowing the gap in data distribution between the two domains. Additionally, we introduce a novel Uncertainty-guided Exponential Moving Average (UEMA) to further reduce error accumulation due to domain shifts informed by previously obtained uncertainty guidance. To demonstrate the superiority of our proposed method, we execute comprehensive experiments in four cross-domain scenarios, securing state-of-the-art performance in BEV 3D object detection tasks, e.g., 12.9% NDS and 9.5% mAP enhancement on Day-Night adaptation.

[80] Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions

Michal Szczepanski,Martyna Poreba,Karim Haroun

Main category: cs.CV

TL;DR: 论文提出了一种名为STEP的混合令牌减少框架,通过动态合并和早期剪枝提高ViT在高分辨率语义分割中的效率,显著降低计算成本的同时几乎不影响精度。

Details Motivation: Vision Transformers(ViT)在高分辨率语义分割中表现出色,但高计算和内存成本限制了其应用。因此,需要一种高效的方法在不显著牺牲精度的情况下减少计算负担。

Contribution: 1. 提出了STEP框架,结合动态合并和令牌剪枝;2. 设计了轻量级CNN策略网络dCTS,用于灵活合并超块;3. 集成早期退出机制,提前移除高置信度令牌。

Method: STEP通过dCTS动态合并令牌为超块,并在编码器块中引入早期退出机制,提前移除高置信度令牌以减少计算量。实验中对高分辨率图像(1024x1024)进行测试,验证了其有效性。

Result: STEP显著降低了计算成本(高达4倍)并提高了推理速度(1.7倍),同时精度损失不超过2%。dCTS单独应用可减少2.5倍令牌数,计算成本降低2.6倍。

Insight: 动态令牌合并和早期退出是高分辨率语义分割中提高ViT效率的有效方法,为后续研究提供了新思路。

Abstract: Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.

[81] Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark

Nisarg A. Shah,Amir Ziai,Chaitanya Ekanadham,Vishal M. Patel

Main category: cs.CV

TL;DR: 本文提出了一种名为Cinéaste的细粒度上下文电影问答基准,用于评估模型对长视频叙事的深度理解能力,包含丰富的问题类型和严格的质量控制流程。

Details Motivation: 现有视频理解基准多关注短片段识别或模板化问题,缺乏对长叙事内容的细粒度推理能力评估。

Contribution: 1)提出Cinéaste基准,包含五个细粒度推理类别的问题;2)利用GPT-4生成多样化问题并引入两阶段过滤确保高质量;3)揭示现有模型在长时序推理上的瓶颈。

Method: 使用GPT-4整合多模态信息生成问题,通过两阶段过滤(上下文独立性和真实性)确保问题质量。

Result: 现有MLLMs在Cinéaste上表现不佳,最佳开源模型准确率仅为63.15%,凸显长时序推理的挑战。

Insight: 长叙事内容的深度理解需要更强的长时序推理能力,现有模型仍需改进。

Abstract: While recent advancements in vision-language models have improved video understanding, diagnosing their capacity for deep, narrative comprehension remains a challenge. Existing benchmarks often test short-clip recognition or use template-based questions, leaving a critical gap in evaluating fine-grained reasoning over long-form narrative content. To address these gaps, we introduce $\mathsf{Cin\acute{e}aste}$, a comprehensive benchmark for long-form movie understanding. Our dataset comprises 3,119 multiple-choice question-answer pairs derived from 1,805 scenes across 200 diverse movies, spanning five novel fine-grained contextual reasoning categories. We use GPT-4o to generate diverse, context-rich questions by integrating visual descriptions, captions, scene titles, and summaries, which require deep narrative understanding. To ensure high-quality evaluation, our pipeline incorporates a two-stage filtering process: Context-Independence filtering ensures questions require video context, while Contextual Veracity filtering validates factual consistency against the movie content, mitigating hallucinations. Experiments show that existing MLLMs struggle on $\mathsf{Cin\acute{e}aste}$; our analysis reveals that long-range temporal reasoning is a primary bottleneck, with the top open-source model achieving only 63.15% accuracy. This underscores significant challenges in fine-grained contextual understanding and the need for advancements in long-form movie comprehension.

[82] GenExam: A Multidisciplinary Text-to-Image Exam

Zhaokai Wang,Penghao Yin,Xiangyu Zhao,Changyao Tian,Yu Qiao,Wenhai Wang,Jifeng Dai,Gen Luo

Main category: cs.CV

TL;DR: GenExam是一个多学科的文本到图像的考试基准,首次将图像生成任务以考试形式评估,涵盖10个学科的1000个样本,展示了现有先进模型在严格评分标准下的挑战性表现。

Details Motivation: 现有的考试式基准主要关注理解和推理任务,而生成基准则侧重于世界知识和视觉概念的呈现,缺乏对严格绘图考试的评估。GenExam填补了这一空白,旨在评估模型整合知识、推理和生成的能力。

Contribution: GenExam是首个多学科文本到图像考试基准,提供了1000个样本和精细评分点,用于精确评估语义正确性和视觉合理性。

Method: 采用四层分类法组织考题,涵盖10个学科,每个问题配备真实图像和细粒度评分点,通过严格评分标准评估模型表现。

Result: 实验显示,即使最先进的模型(如GPT-Image-1和Gemini-2.5-Flash-Image)在严格评分下的得分低于15%,多数模型几乎得0%,表明该基准的极高挑战性。

Insight: 通过将图像生成任务融入考试框架,GenExam为评估AGI模型的知识整合、推理和生成能力提供了新视角,揭示了当前生成模型的局限性。

Abstract: Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models’ ability to integrate knowledge, reasoning, and generation, providing insights on the path to general AGI.

q-bio.PE [Back]

[83] Autonomous Reporting of Normal Chest X-rays by Artificial Intelligence in the United Kingdom. Can We Take the Human Out of the Loop?

Katrina Nash,James Vaz,Ahmed Maiter,Christopher Johns,Nicholas Woznitza,Aditya Kale,Abdala Espinosa Morgado,Rhidian Bramley,Mark Hall,David Lowe,Alex Novak,Sarim Ather

Main category: q-bio.PE

TL;DR: 该论文探讨了在英国使用人工智能自动报告正常胸片(CXRs)的可能性及其潜在影响,研究了技术、法律和实践方面的挑战。

Details Motivation: 由于英国放射科医生短缺,导致胸片报告延迟,AI工具若能自动识别正常胸片并提出报告,有望大幅减轻工作量。

Contribution: 论文提出了AI自主报告正常胸片的可行性分析,并强调了其在推广前需解决的技术、法律和伦理问题。

Method: 通过分析现有AI工具的能力,讨论了定义“正常”胸片的标准、泛化性问题、灵敏度与特异性之间的权衡,以及法律合规性(如IR(ME)R和GDPR)。

Result: 研究表明AI在此领域具有潜力,但仍需进一步验证和监管框架,以确保安全性和责任归属。

Insight: AI可以辅助放射科工作,但完全脱离人工监督可能尚不成熟,需结合技术改进、法律支持和多方利益相关者的参与。

Abstract: Chest X-rays (CXRs) are the most commonly performed imaging investigation. In the UK, many centres experience reporting delays due to radiologist workforce shortages. Artificial intelligence (AI) tools capable of distinguishing normal from abnormal CXRs have emerged as a potential solution. If normal CXRs could be safely identified and reported without human input, a substantial portion of radiology workload could be reduced. This article examines the feasibility and implications of autonomous AI reporting of normal CXRs. Key issues include defining normal, ensuring generalisability across populations, and managing the sensitivity-specificity trade-off. It also addresses legal and regulatory challenges, such as compliance with IR(ME)R and GDPR, and the lack accountability frameworks for errors. Further considerations include the impact on radiologists practice, the need for robust post-market surveillance, and incorporation of patient perspectives. While the benefits are clear, adoption must be cautious.

cs.RO [Back]

[84] Semantic 3D Reconstructions with SLAM for Central Airway Obstruction

Ayberk Acar,Fangjie Li,Hao Li,Lidia Al-Zogbi,Kanyifeechukwu Jane Oguine,Susheela Sharma Stern,Jesse F. d’Almeida,Robert J. Webster III,Ipek Oguz,Jie Ying Wu

Main category: cs.RO

TL;DR: 论文提出了一种结合语义分割与实时单目SLAM的流水线,用于中央气道阻塞(CAO)的内窥镜3D重建,实现高精度且临床相关的实时标注地图。

Details Motivation: 中央气道阻塞是一种高风险的疾病,传统治疗方法并发症风险高。结合机器人干预和场景理解的自动化方法可以降低风险并提高精确度。

Contribution: 首次将语义分割与实时单目SLAM结合用于内窥镜CAO场景,提供了一种模块化框架,可实时生成标注的3D重建地图。

Method: 结合DROID-SLAM和分割模型,SLAM实时重建3D气道几何形状,分割掩码指导阻塞区域的标注。

Result: 通过离体模型验证,重建结果与真实CT扫描高度相似(Chamfer距离0.62毫米),重建速度更快。

Insight: 将语义分割直接集成到SLAM工作流中,能够实时标注临床相关区域,为自动化机器人干预提供了可行方向。

Abstract: Central airway obstruction (CAO) is a life-threatening condition with increasing incidence, caused by tumors in and outside of the airway. Traditional treatment methods such as bronchoscopy and electrocautery can be used to remove the tumor completely; however, these methods carry a high risk of complications. Recent advances allow robotic interventions with lesser risk. The combination of robot interventions with scene understanding and mapping also opens up the possibilities for automation. We present a novel pipeline that enables real-time, semantically informed 3D reconstructions of the central airway using monocular endoscopic video. Our approach combines DROID-SLAM with a segmentation model trained to identify obstructive tissues. The SLAM module reconstructs the 3D geometry of the airway in real time, while the segmentation masks guide the annotation of obstruction regions within the reconstructed point cloud. To validate our pipeline, we evaluate the reconstruction quality using ex vivo models. Qualitative and quantitative results show high similarity between ground truth CT scans and the 3D reconstructions (0.62 mm Chamfer distance). By integrating segmentation directly into the SLAM workflow, our system produces annotated 3D maps that highlight clinically relevant regions in real time. High-speed capabilities of the pipeline allows quicker reconstructions compared to previous work, reflecting the surgical scene more accurately. To the best of our knowledge, this is the first work to integrate semantic segmentation with real-time monocular SLAM for endoscopic CAO scenarios. Our framework is modular and can generalize to other anatomies or procedures with minimal changes, offering a promising step toward autonomous robotic interventions.

[85] Object Pose Estimation through Dexterous Touch

Amir-Hossein Shahidzadeh,Jiyue Zhu,Kezhou Chen,Sha Yi,Cornelia Fermüller,Yiannis Aloimonos,Xiaolong Wang

Main category: cs.RO

TL;DR: 本文提出了一种通过双手触觉探索来估计物体姿态的方法,利用强化学习主动收集触觉数据,并通过迭代优化完成姿态估计。

Details Motivation: 在视觉数据受限的场景下(如光照、遮挡或外观变化),触觉传感器提供的局部信息难以直接用于物体姿态估计。本文希望通过主动探索的方式解决这一问题。

Contribution: 1) 提出了一种双手协作的触觉探索方法;2) 利用强化学习主动收集触觉数据;3) 通过迭代优化点云实现物体姿态估计。

Method: 1) 一只手固定物体,另一只手进行主动探索;2) 使用强化学习训练探索策略;3) 将收集的3D点云用于迭代优化物体形状和姿态。

Result: 实验表明,该方法能够在没有物体几何先验的情况下,通过触觉探索识别关键姿态特征。

Insight: 主动触觉探索结合强化学习为物体姿态估计提供了新思路,尤其在视觉信息受限的场景中具有潜力。

Abstract: Robust object pose estimation is essential for manipulation and interaction tasks in robotics, particularly in scenarios where visual data is limited or sensitive to lighting, occlusions, and appearances. Tactile sensors often offer limited and local contact information, making it challenging to reconstruct the pose from partial data. Our approach uses sensorimotor exploration to actively control a robot hand to interact with the object. We train with Reinforcement Learning (RL) to explore and collect tactile data. The collected 3D point clouds are used to iteratively refine the object’s shape and pose. In our setup, one hand holds the object steady while the other performs active exploration. We show that our method can actively explore an object’s surface to identify critical pose features without prior knowledge of the object’s geometry. Supplementary material and more demonstrations will be provided at https://amirshahid.github.io/BimanualTactilePose .

[86] MAP: End-to-End Autonomous Driving with Map-Assisted Planning

Huilin Yin,Yiming Kan,Daniel Watzenig

Main category: cs.RO

TL;DR: 论文提出了一种名为MAP(Map-Assisted Planning)的新型端到端轨迹规划框架,通过显式集成基于分割的地图特征和当前自车状态,显著提升了自动驾驶的轨迹规划能力。实验表明,该方法在无需后处理的情况下显著降低了误差并提升了性能。

Details Motivation: 现有端到端自动驾驶方法未充分利用在线地图模块的潜力,导致其在轨迹规划中的作用有限。该论文旨在通过显式集成地图特征和自车状态,提升规划的准确性和鲁棒性。

Contribution: 1. 提出MAP框架,通过Plan-enhancing Online Mapping模块、Ego-status-guided Planning模块和Weight Adapter显式集成地图与自车状态;2. 在DAIR-V2X-seq-SPD数据集上验证了方法的有效性;3. 在CVPR2025比赛中取得优异成绩。

Method: 1. Plan-enhancing Online Mapping模块:提取基于分割的地图特征;2. Ego-status-guided Planning模块:结合自车状态进行规划;3. Weight Adapter:根据当前状态动态调整权重。

Result: 在DAIR-V2X-seq-SPD数据集上实现了L2位移误差降低16.6%、越野率降低56.2%、总分提升44.5%;在CVPR2025比赛中总分领先第二名39.5%。

Insight: 显式利用语义地图特征可以显著提升端到端自动驾驶系统的规划能力,为未来系统设计提供了新方向。

Abstract: In recent years, end-to-end autonomous driving has attracted increasing attention for its ability to jointly model perception, prediction, and planning within a unified framework. However, most existing approaches underutilize the online mapping module, leaving its potential to enhance trajectory planning largely untapped. This paper proposes MAP (Map-Assisted Planning), a novel map-assisted end-to-end trajectory planning framework. MAP explicitly integrates segmentation-based map features and the current ego status through a Plan-enhancing Online Mapping module, an Ego-status-guided Planning module, and a Weight Adapter based on current ego status. Experiments conducted on the DAIR-V2X-seq-SPD dataset demonstrate that the proposed method achieves a 16.6% reduction in L2 displacement error, a 56.2% reduction in off-road rate, and a 44.5% improvement in overall score compared to the UniV2X baseline, even without post-processing. Furthermore, it achieves top ranking in Track 2 of the End-to-End Autonomous Driving through V2X Cooperation Challenge of MEIS Workshop @CVPR2025, outperforming the second-best model by 39.5% in terms of overall score. These results highlight the effectiveness of explicitly leveraging semantic map features in planning and suggest new directions for improving structure design in end-to-end autonomous driving systems. Our code is available at https://gitee.com/kymkym/map.git

[87] MCGS-SLAM: A Multi-Camera SLAM Framework Using Gaussian Splatting for High-Fidelity Mapping

Zhihao Cao,Hanyu Wu,Li Wa Tang,Zizhou Luo,Zihan Zhu,Wei Zhang,Marc Pollefeys,Martin R. Oswald

Main category: cs.RO

TL;DR: MCGS-SLAM 是首个基于 RGB 输入的多摄像头 SLAM 系统,利用 3D 高斯点云(3DGS)实现了高保真地图重建。通过多摄像头捆绑调整和尺度一致性模块,系统在实时性、几何覆盖和重建质量上优于单目基线方法。

Details Motivation: 当前的密集 SLAM 方法主要针对单目摄像头,忽略了多摄像头在鲁棒性和几何覆盖方面的潜力。多摄像头输入可以提供更宽的视野,弥补单目系统在侧视图重建上的不足,这对于自动驾驶等安全关键应用尤为重要。

Contribution: 1. 提出了首个基于多摄像头 RGB 输入的 3D 高斯点云 SLAM 系统;2. 设计了多摄像头捆绑调整(MCBA)和尺度一致性模块,实现了姿态和深度的联合优化;3. 展示了多摄像头 SLAM 在高保真重建和实时性能上的优势。

Method: 1. 利用多摄像头输入构建统一的高斯点云地图;2. 通过 MCBA 联合优化姿态和深度,使用稠密的光度和几何残差;3. 采用尺度一致性模块,利用低秩先验实现多视图间的度量对齐。

Result: 在合成和真实数据集上的实验表明,MCGS-SLAM 在轨迹精度和重建质量上优于单目基线方法,尤其在侧视图区域的重建上表现突出。

Insight: 多摄像头输入和高斯点云结合,不仅提升了 SLAM 的鲁棒性和覆盖范围,还为自动驾驶等领域的高保真地图重建提供了新思路。

Abstract: Recent progress in dense SLAM has primarily targeted monocular setups, often at the expense of robustness and geometric coverage. We present MCGS-SLAM, the first purely RGB-based multi-camera SLAM system built on 3D Gaussian Splatting (3DGS). Unlike prior methods relying on sparse maps or inertial data, MCGS-SLAM fuses dense RGB inputs from multiple viewpoints into a unified, continuously optimized Gaussian map. A multi-camera bundle adjustment (MCBA) jointly refines poses and depths via dense photometric and geometric residuals, while a scale consistency module enforces metric alignment across views using low-rank priors. The system supports RGB input and maintains real-time performance at large scale. Experiments on synthetic and real-world datasets show that MCGS-SLAM consistently yields accurate trajectories and photorealistic reconstructions, usually outperforming monocular baselines. Notably, the wide field of view from multi-camera input enables reconstruction of side-view regions that monocular setups miss, critical for safe autonomous operation. These results highlight the promise of multi-camera Gaussian Splatting SLAM for high-fidelity mapping in robotics and autonomous driving.

cs.IR [Back]

[88] Enhancing Time Awareness in Generative Recommendation

Sunkyung Lee,Seongmin Park,Jonghyo Kim,Mincheol Yoon,Jongwuk Lee

Main category: cs.IR

TL;DR: 该论文提出了GRUT模型,通过时间感知提示和趋势感知推理,解决了生成式推荐中忽视时间动态的问题,显著提升了推荐性能。

Details Motivation: 现有的生成式推荐方法主要关注物品的顺序,而忽略了物品间的时间动态,这可能隐含用户偏好的演化。论文旨在解决这一局限性。

Contribution: 提出GRUT模型,引入时间感知提示和趋势感知推理,有效捕捉用户偏好的时间动态,提升推荐性能。

Method: GRUT通过用户级时间上下文和物品级转移上下文建模时间动态,并通过趋势感知推理在推断阶段增强排序。

Result: 在四个基准数据集上,GRUT在Recall@5和NDCG@5上分别提升了15.4%和14.3%。

Insight: 时间动态信息对捕捉用户偏好演化至关重要,GRUT的创新性方法为生成式推荐提供了新思路。

Abstract: Generative recommendation has emerged as a promising paradigm that formulates the recommendations into a text-to-text generation task, harnessing the vast knowledge of large language models. However, existing studies focus on considering the sequential order of items and neglect to handle the temporal dynamics across items, which can imply evolving user preferences. To address this limitation, we propose a novel model, Generative Recommender Using Time awareness (GRUT), effectively capturing hidden user preferences via various temporal signals. We first introduce Time-aware Prompting, consisting of two key contexts. The user-level temporal context models personalized temporal patterns across timestamps and time intervals, while the item-level transition context provides transition patterns across users. We also devise Trend-aware Inference, a training-free method that enhances rankings by incorporating trend information about items with generation likelihood. Extensive experiments demonstrate that GRUT outperforms state-of-the-art models, with gains of up to 15.4% and 14.3% in Recall@5 and NDCG@5 across four benchmark datasets. The source code is available at https://github.com/skleee/GRUT.

eess.AS [Back]

[89] TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models

Haolong Zheng,Yekaterina Yegorova,Mark Hasegawa-Johnson

Main category: eess.AS

TL;DR: 论文提出了一种名为TICL的方法,通过语义上下文选择示例,提升大型多模态模型的语音识别能力,无需微调,在多种挑战性任务中显著降低WER。

Details Motivation: 现有的语音基础模型虽能进行Speech In-Context Learning (SICL),但示例选择方法尚未充分研究,影响了性能。本文旨在通过语义上下文改进SICL的示例选择。

Contribution: 提出了TICL方法,利用文本嵌入和KNN技术选择高质量示例,显著提升大型多模态模型的语音识别能力,尤其在口音英语、多语言和儿童语音任务中表现突出。

Method: TICL通过文本嵌入语义相似性选择示例,结合KNN技术优化SICL的上下文示例选择,无需微调即可应用于现成模型。

Result: 实验显示,TICL在多个挑战性任务中相对零样本提升了84.7%的WER降低,证明了其高效性和鲁棒性。

Insight: 语义上下文的示例选择对于SICL至关重要,通过文本嵌入和KNN的简单方法即可显著提升模型性能,为语音识别任务提供了新思路。

Abstract: Speech foundation models have recently demonstrated the ability to perform Speech In-Context Learning (SICL). Selecting effective in-context examples is crucial for SICL performance, yet selection methodologies remain underexplored. In this work, we propose Text-Embedding KNN for SICL (TICL), a simple pipeline that uses semantic context to enhance off-the-shelf large multimodal models’ speech recognition ability without fine-tuning. Across challenging automatic speech recognition tasks, including accented English, multilingual speech, and children’s speech, our method enables models to surpass zero-shot performance with up to 84.7% relative WER reduction. We conduct ablation studies to show the robustness and efficiency of our method.

eess.IV [Back]

[90] 3D Reconstruction of Coronary Vessel Trees from Biplanar X-Ray Images Using a Geometric Approach

Ethan Koland,Lin Xi,Nadeev Wijesuriya,YingLiang Ma

Main category: eess.IV

TL;DR: 该研究提出了一种从双平面X射线图像重建冠状动脉三维血管树的几何方法框架,通过图像分割、运动相位匹配和三维重建三个主要步骤,改进了传统方法的精度和工作流程。

Details Motivation: 心脏介入手术中,X射线血管造影用于可视化冠状动脉,但传统方法在3D重建中存在误差和复杂性。研究旨在通过几何方法简化并提高重建精度。

Contribution: 1. 提出自动视频分割方法实现语义分割和运动相位匹配;2. 开发基于几何算法的3D血管树重建方法;3. 显著减少了重建误差和复杂性。

Method: 1. 使用自动分割方法标记不同物体类别;2. 通过追踪静态物体(如导管)匹配运动相位;3. 采用启发式方法和几何算法匹配关键解剖标志并重建血管树。

Result: 分割准确率达到0.703,3D重建的投影误差为0.62±0.38毫米,验证了方法的有效性。

Insight: 几何方法简化了3D重建流程且精度更高,对临床心脏手术的辅助具有重要意义。

Abstract: X-ray angiography is widely used in cardiac interventions to visualize coronary vessels, assess integrity, detect stenoses and guide treatment. We propose a framework for reconstructing 3D vessel trees from biplanar X-ray images which are extracted from two X-ray videos captured at different C-arm angles. The proposed framework consists of three main components: image segmentation, motion phase matching, and 3D reconstruction. An automatic video segmentation method for X-ray angiography to enable semantic segmentation for image segmentation and motion phase matching. The goal of the motion phase matching is to identify a pair of X-ray images that correspond to a similar respiratory and cardiac motion phase to reduce errors in 3D reconstruction. This is achieved by tracking a stationary object such as a catheter or lead within the X-ray video. The semantic segmentation approach assigns different labels to different object classes enabling accurate differentiation between blood vessels, balloons, and catheters. Once a suitable image pair is selected, key anatomical landmarks (vessel branching points and endpoints) are matched between the two views using a heuristic method that minimizes reconstruction errors. This is followed by a novel geometric reconstruction algorithm to generate the 3D vessel tree. The algorithm computes the 3D vessel centrelines by determining the intersection of two 3D surfaces. Compared to traditional methods based on epipolar constraints, the proposed approach simplifies there construction workflow and improves overall accuracy. We trained and validated our segmentation method on 62 X-ray angiography video sequences. On the test set, our method achieved a segmentation accuracy of 0.703. The 3D reconstruction framework was validated by measuring the reconstruction error of key anatomical landmarks, achieving a reprojection errors of 0.62mm +/- 0.38mm.

[91] Generative AI Pipeline for Interactive Prompt-driven 2D-to-3D Vascular Reconstruction for Fontan Geometries from Contrast-Enhanced X-Ray Fluoroscopy Imaging

Prahlad G Menon

Main category: eess.IV

TL;DR: 这篇论文提出了一种基于生成式AI的管道,用于从造影增强X射线透视成像中交互式Prompt驱动的2D到3D血管重建(Fontan几何结构),展示了临床可行性。

Details Motivation: Fontan姑息治疗的单心室先天性心脏病进展为血流动力学衰竭,传统2D成像难以描述复杂的血流模式,亟需一种能从常规2D造影数据生成3D几何结构的方法。

Contribution: 主要贡献是开发了一个多步骤的AI管道,结合Google的Gemini 2.5 Flash和Tencent的Hunyuan3D-2mini,实现了从单视角造影图像生成CFD适用的3D几何结构,并支持快速虚拟血流可视化。

Method: 管道包括医学图像预处理、血管分割、对比增强、伪影去除以及在2D投影中的虚拟血流可视化,最终通过Hunyuan3D-2mini生成立体光刻文件。

Result: 成功生成几何优化的2D投影,并在15分钟内完成处理,虚拟血流可视化识别了血流停滞区和分支动脉的血流模式。

Insight: 该方法展示了从常规造影数据生成CFD适用几何结构的临床潜力,尽管需迭代优化准确性,但为利用现成影像数据实现高级血流动力学分析奠定了基础。

Abstract: Fontan palliation for univentricular congenital heart disease progresses to hemodynamic failure with complex flow patterns poorly characterized by conventional 2D imaging. Current assessment relies on fluoroscopic angiography, providing limited 3D geometric information essential for computational fluid dynamics (CFD) analysis and surgical planning. A multi-step AI pipeline was developed utilizing Google’s Gemini 2.5 Flash (2.5B parameters) for systematic, iterative processing of fluoroscopic angiograms through transformer-based neural architecture. The pipeline encompasses medical image preprocessing, vascular segmentation, contrast enhancement, artifact removal, and virtual hemodynamic flow visualization within 2D projections. Final views were processed through Tencent’s Hunyuan3D-2mini (384M parameters) for stereolithography file generation. The pipeline successfully generated geometrically optimized 2D projections from single-view angiograms after 16 processing steps using a custom web interface. Initial iterations contained hallucinated vascular features requiring iterative refinement to achieve anatomically faithful representations. Final projections demonstrated accurate preservation of complex Fontan geometry with enhanced contrast suitable for 3D conversion. AI-generated virtual flow visualization identified stagnation zones in central connections and flow patterns in branch arteries. Complete processing required under 15 minutes with second-level API response times. This approach demonstrates clinical feasibility of generating CFD-suitable geometries from routine angiographic data, enabling 3D generation and rapid virtual flow visualization for cursory insights prior to full CFD simulation. While requiring refinement cycles for accuracy, this establishes foundation for democratizing advanced geometric and hemodynamic analysis using readily available imaging data.

cs.AI [Back]

[92] Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness

Pratik Jayarao,Himanshu Gupta,Neeraj Varshney,Chaitanya Dwivedi

Main category: cs.AI

TL;DR: 论文对比了“思考”与“非思考”LLM在作为裁判任务中的表现,发现显式推理(思考模型)在准确性、效率和鲁棒性上均优于非思考模型,支持显式推理的广泛优势。

Details Motivation: 随着LLM被广泛用作自动化裁判,确保其可靠性、效率和鲁棒性成为关键问题。本文通过系统研究,探讨显式推理在LLM裁判任务中的作用。

Contribution: 1. 系统比较了显式推理与非显式推理LLM的性能差异;2. 验证了显式推理在准确性、效率和鲁棒性上的优势;3. 在多语言场景中扩展了结论的普适性。

Method: 使用开源Qwen 3模型(0.6B-4B参数),在RewardBench任务上评估“思考”与“非思考”模型的性能,并测试了上下文学习、规则引导等多种增强策略。

Result: 思考模型准确率高出约10%,计算开销低(<2x),增强策略效果有限且成本高(>8x)。思考模型在多种偏置条件下的鲁棒性平均高出6%。

Insight: 显式推理是提升LLM裁判任务性能的关键,其优势不仅限于英语场景,还具有普适性。

Abstract: As Large Language Models (LLMs) are increasingly adopted as automated judges in benchmarking and reward modeling, ensuring their reliability, efficiency, and robustness has become critical. In this work, we present a systematic comparison of “thinking” and “non-thinking” LLMs in the LLM-as-a-judge paradigm using open-source Qwen 3 models of relatively small sizes (0.6B, 1.7B, and 4B parameters). We evaluate both accuracy and computational efficiency (FLOPs) on RewardBench tasks, and further examine augmentation strategies for non-thinking models, including in-context learning, rubric-guided judging, reference-based evaluation, and n-best aggregation. Our results show that despite these enhancements, non-thinking models generally fall short of their thinking counterparts. Our results show that thinking models achieve approximately 10% points higher accuracy with little overhead (under 2x), in contrast to augmentation strategies like few-shot learning, which deliver modest gains at a higher cost (>8x). Bias and robustness analyses further demonstrate that thinking models maintain significantly greater consistency under a variety of bias conditions such as positional, bandwagon, identity, diversity, and random biases (6% higher on average). We further extend our experiments to the multilingual setting and our results confirm that explicit reasoning extends its benefits beyond English. Overall, our work results in several important findings that provide systematic evidence that explicit reasoning offers clear advantages in the LLM-as-a-judge paradigm not only in accuracy and efficiency but also in robustness.

[93] Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning

Pulkit Verma,Ngoc La,Anthony Favier,Swaroop Mishra,Julie A. Shah

Main category: cs.AI

TL;DR: 该论文提出了一种名为PDDL-Instruct的指令调优框架,通过逻辑链式思维推理增强大语言模型(LLMs)的符号规划能力,显著提升了规划准确性。

Details Motivation: 虽然大语言模型(LLMs)在多样任务中表现出色,但其在需要形式化表示(如PDDL)的结构化符号规划任务中能力有限。

Contribution: 提出了PDDL-Instruct框架,通过逻辑链式思维推理指导LLMs进行严谨的动作适用性、状态转移和规划有效性推理。

Method: 设计了指令提示,将规划过程分解为关于前提条件满足、效果应用和不变性保持的显式推理链,并通过结构化反思实现自我校正。

Result: 在多个规划领域的实验中,该方法的规划准确率达到94%,比基线模型提升了66%。

Insight: 通过逻辑链式思维推理,成功缩小了LLMs通用推理能力与自动化规划所需逻辑精度之间的差距。

Abstract: Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, yet their ability to perform structured symbolic planning remains limited, particularly in domains requiring formal representations like the Planning Domain Definition Language (PDDL). In this paper, we present a novel instruction tuning framework, PDDL-Instruct, designed to enhance LLMs’ symbolic planning capabilities through logical chain-of-thought reasoning. Our approach focuses on teaching models to rigorously reason about action applicability, state transitions, and plan validity using explicit logical inference steps. By developing instruction prompts that guide models through the precise logical reasoning required to determine when actions can be applied in a given state, we enable LLMs to self-correct their planning processes through structured reflection. The framework systematically builds verification skills by decomposing the planning process into explicit reasoning chains about precondition satisfaction, effect application, and invariant preservation. Experimental results on multiple planning domains show that our chain-of-thought reasoning based instruction-tuned models are significantly better at planning, achieving planning accuracy of up to 94% on standard benchmarks, representing a 66% absolute improvement over baseline models. This work bridges the gap between the general reasoning capabilities of LLMs and the logical precision required for automated planning, offering a promising direction for developing better AI planning systems.

[94] SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

Vincent Siu,Nicholas Crispino,David Park,Nathan W. Henry,Zhun Wang,Yang Liu,Dawn Song,Chenguang Wang

Main category: cs.AI

TL;DR: 论文介绍了SteeringControl基准,用于评估表示导向方法在核心对齐目标(如偏见、有害生成和幻觉)及其对次要行为(如奉承和常识道德)的影响。研究发现现有工作中未系统探索的权衡问题,并通过模块化框架评估五种流行的导向方法在不同模型上的表现。

Details Motivation: 现有的对齐研究往往仅关注真实性或推理能力,忽略了其他未被系统理解的权衡问题。论文旨在填补这一空白,提供一个综合评估导向方法影响的新基准。

Contribution: 1. 提出了SteeringControl基准,用于评估导向方法在多目标上的效果和副作用;2. 开发了一个模块化导向框架,支持灵活评估不同方法;3. 揭示了导向方法、模型和目标行为之间的复杂关系。

Method: 构建了一个模块化导向框架,基于五种流行的导向方法,在Qwen-2.5-7B和Llama-3.1-8B模型上进行实验评估,分析其对核心对齐目标和次要行为的影响。

Result: 研究发现,导向效果高度依赖于方法、模型和目标行为的组合,且不当组合可能导致严重的概念纠缠。

Insight: 导向方法的设计需综合考虑多目标之间的权衡,单一优化可能引发意想不到的副作用。模块化框架为未来对齐研究提供了灵活性。

Abstract: We introduce SteeringControl, a benchmark for evaluating representation steering methods across core alignment objectives–bias, harmful generation, and hallucination–and their effects on secondary behaviors such as sycophancy and commonsense morality. While prior alignment work often highlights truthfulness or reasoning ability to demonstrate the side effects of representation steering, we find there are many unexplored tradeoffs not yet understood in a systematic way. We collect a dataset of safety-relevant primary and secondary behaviors to evaluate steering effectiveness and behavioral entanglement centered around five popular steering methods. To enable this, we craft a modular steering framework based on unique components that serve as the building blocks of many existing methods. Our results on Qwen-2.5-7B and Llama-3.1-8B find that strong steering performance is dependent on the specific combination of steering method, model, and targeted behavior, and that severe concept entanglement can result from poor combinations of these three as well. We release our code here: https://github.com/wang-research-lab/SteeringControl.git.

[95] See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles

Zongru Wu,Rui Mao,Zhiyuan Tian,Pengzhou Cheng,Tianjie Ju,Zheng Wu,Lingzhong Dong,Haiyue Sheng,Zhuosheng Zhang,Gongshen Liu

Main category: cs.AI

TL;DR: 该论文提出了一种名为State-aware Reasoning (StaR)的培训方法,用于提高多模态代理在GUI中执行切换指令的准确性。通过感知当前切换状态并分析指令中的期望状态,StaR显著提升了性能。

Details Motivation: 多模态代理在GUI交互中无法可靠执行切换指令是一个主要瓶颈,尤其是当前状态与期望状态一致时。为解决这一问题,研究者提出了StaR方法。

Contribution: 1. 构建了一个包含二元切换指令的状态控制基准;2. 提出StaR方法,显著提升了切换指令的执行准确性;3. 展示了StaR在通用任务性能和动态环境中的潜力。

Method: StaR通过训练代理感知当前切换状态、从指令中分析期望状态,并据此执行动作。实验证明,该方法可提升准确性30%以上。

Result: 在三个多模态代理上的实验显示,StaR提升切换指令准确性超30%。在公共基准和动态环境中的进一步评测验证了其通用性和实用性。

Insight: StaR不仅解决了切换指令的可靠性问题,还为多模态代理在复杂场景中的高效交互提供了新思路。

Abstract: The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions from public datasets. Evaluations of existing agents demonstrate their unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a training method that teaches agents to perceive the current toggle state, analyze the desired state from the instruction, and act accordingly. Experiments on three multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30%. Further evaluations on three public benchmarks show that StaR also enhances general task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code, benchmark, and StaR-enhanced agents are available at https://github.com/ZrW00/StaR.

[96] THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

Qikai Chang,Zhenrong Zhang,Pengfei Hu,Jiefeng Ma,Yicheng Pan,Jianshu Zhang,Jun Du,Quan Liu,Jianqing Gao

Main category: cs.AI

TL;DR: 论文提出了THOR方法,通过强化学习结合外部工具,解决了语言模型在数学推理中的高精度任务难题,包括数据集构建、细粒度优化和推理增强。

Details Motivation: 尽管大语言模型在数学推理上取得了进展,但在数值计算和符号操作等高精度任务中仍表现不佳,需结合外部工具提升能力。

Contribution: 1. 提出TIRGen,多智能体方法生成高质量工具推理数据集;2. 分层强化学习策略,联合优化轨迹级问题和步骤级代码生成;3. 自校正机制,利用工具反馈动态修正推理路径。

Method: 采用多智能体actor-critic框架构建数据集,分层RL优化问题解决和代码生成,并引入工具反馈的自校正机制。

Result: 在多种数学基准测试中达到同类模型最佳性能,同时在代码任务上表现一致提升。

Insight: 中间工具调用的成功是最终答案正确性的强预测指标。

Abstract: Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent actor-critic-based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both trajectory-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer’s correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.

[97] Exploring Major Transitions in the Evolution of Biological Cognition With Artificial Neural Networks

Konstantinos Voudouris,Andrew Barron,Marta Halina,Colin Klein,Matishalin Patel

Main category: cs.AI

TL;DR: 论文通过人工神经网络研究了生物认知进化中的主要转变,发现信息流结构的改变会导致认知性能的质变,尤其是循环网络在处理复杂输入时表现更优,同时训练难度也形成了进化屏障。

Details Motivation: 探索生物神经网络结构的改变如何通过主要转变影响认知性能,为理解认知进化提供理论模型。

Contribution: 验证了信息流拓扑结构的改变(如前馈、循环、分层网络)对认知性能的质变影响,揭示了循环网络的优越性及其训练难度作为进化屏障的作用。

Method: 使用人工神经网络模拟不同拓扑结构(前馈、循环、分层),比较它们在复杂语法学习任务中的表现,控制网络规模和资源。

Result: 循环网络在处理复杂输入时表现优于前馈网络,分层网络在语法学习中未显优势,训练难度形成进化屏障。

Insight: 信息流结构的改变可能是认知进化的关键因素,循环网络的优势揭示了进化中的质变和不可逆性。

Abstract: Transitional accounts of evolution emphasise a few changes that shape what is evolvable, with dramatic consequences for derived lineages. More recently it has been proposed that cognition might also have evolved via a series of major transitions that manipulate the structure of biological neural networks, fundamentally changing the flow of information. We used idealised models of information flow, artificial neural networks (ANNs), to evaluate whether changes in information flow in a network can yield a transitional change in cognitive performance. We compared networks with feed-forward, recurrent and laminated topologies, and tested their performance learning artificial grammars that differed in complexity, controlling for network size and resources. We documented a qualitative expansion in the types of input that recurrent networks can process compared to feed-forward networks, and a related qualitative increase in performance for learning the most complex grammars. We also noted how the difficulty in training recurrent networks poses a form of transition barrier and contingent irreversibility – other key features of evolutionary transitions. Not all changes in network topology confer a performance advantage in this task set. Laminated networks did not outperform non-laminated networks in grammar learning. Overall, our findings show how some changes in information flow can yield transitions in cognitive performance.

[98] The Art of Saying “Maybe”: A Conformal Lens for Uncertainty Benchmarking in VLMs

Asif Azad,Mohammad Sadat Hossain,MD Sadik Hossain Shanto,M Saifur Rahman,Md Rizwan Pervez

Main category: cs.AI

TL;DR: 论文对16种先进视觉语言模型(VLMs)的不确定性量化进行了全面评估,发现更大、更准确的模型在不确定性量化上表现更好,特别是在数学和推理任务中表现较差。

Details Motivation: 尽管视觉语言模型在多模态任务中表现出色,但不确定性量化这一关键维度尚未得到充分研究,本文旨在填补这一空白。

Contribution: 提出了对16种VLMs在6个多模态数据集上的全面不确定性量化评估,发现模型规模与不确定性量化能力正相关。

Method: 使用3种不同的评分函数,对16种开源和闭源VLMs在6个多模态数据集上进行评估。

Result: 更大、更准确的模型在不确定性量化上表现更好,而数学和推理任务在所有模型中表现较差。

Insight: 模型不仅仅需要高精度,还需具备良好的不确定性量化能力,尤其是在复杂任务中。

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 16 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don’t know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.

cs.LG [Back]

[99] Privacy-Aware In-Context Learning for Large Language Models

Bishnu Bhusal,Manoj Acharya,Ramneet Kaur,Colin Samplawski,Anirban Roy,Adam D. Cobb,Rohit Chadha,Susmit Jha

Main category: cs.LG

TL;DR: 本文提出了一种基于差分隐私(DP)的新型隐私保护框架,用于生成高质量合成文本,确保信息泄露的理论边界,同时不依赖模型微调。

Details Motivation: 大型语言模型(LLM)在自然语言理解和生成方面表现优异,但也存在隐私泄露风险,尤其是敏感信息可能通过提示词暴露。本文旨在解决这一问题。

Contribution: 主要贡献包括:1)提出了一种隐私保护的预测框架;2)通过聚合每词元的输出分布生成长且连贯的合成文本;3)结合公共与私有推理的混合操作,进一步提升了实用性。

Method: 方法基于差分隐私框架,通过对私有记录的推理和输出分布聚合,生成合成文本。还引入混合操作结合私有与公共推理以增强实用性。

Result: 实验表明,该方法在上下文学习(ICL)任务中优于现有方法,同时在隐私保护与实用性之间取得了平衡。

Insight: 研究展示了差分隐私在LLM隐私保护中的潜力,为生成隐私安全的文本提供了新的技术路径。

Abstract: Large language models (LLMs) have significantly transformed natural language understanding and generation, but they raise privacy concerns due to potential exposure of sensitive information. Studies have highlighted the risk of information leakage, where adversaries can extract sensitive information embedded in the prompts. In this work, we introduce a novel private prediction framework for generating high-quality synthetic text with strong privacy guarantees. Our approach leverages the Differential Privacy (DP) framework to ensure worst-case theoretical bounds on information leakage without requiring any fine-tuning of the underlying models.The proposed method performs inference on private records and aggregates the resulting per-token output distributions. This enables the generation of longer and coherent synthetic text while maintaining privacy guarantees. Additionally, we propose a simple blending operation that combines private and public inference to further enhance utility. Empirical evaluations demonstrate that our approach outperforms previous state-of-the-art methods on in-context-learning (ICL) tasks, making it a promising direction for privacy-preserving text generation while maintaining high utility.

[100] LLM-I: LLMs are Naturally Interleaved Multimodal Creators

Zirun Guo,Feng Zhang,Kai Jia,Tao Jin

Main category: cs.LG

TL;DR: LLM-I提出了一个灵活的框架,通过将交错图像-文本生成问题重构为工具使用问题,解决了当前统一模型的‘单一工具’瓶颈,并通过强化学习训练LLM/MLLM智能协调专用视觉工具。

Details Motivation: 现有统一模型限于合成图像,难以处理需要事实基础或程序化精确的任务,LLM-I旨在突破这种限制。

Contribution: 1) 提出LLM-I框架,将任务重构为工具使用问题;2) 设计了包含在线图像搜索、扩散生成等工具的多样化工具包;3) 通过混合奖励系统的强化学习训练代理;4) 在四个基准测试中大幅优于现有方法。

Method: 1) 架构基于中央LLM/MLLM代理协调专用工具;2) 使用强化学习框架(混合规则逻辑与LLM/MLLM评估奖励)训练代理;3) 引入测试时缩放策略。

Result: 在多样化数据集上训练后,LLM-I在四个基准测试中大幅领先现有方法。

Insight: 通过工具化和强化学习的结合,LLM-I展示了在多模态生成任务中动态协调专用工具的潜力。

Abstract: We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the “one-tool” bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.

cs.SD [Back]

[101] Noise Supervised Contrastive Learning and Feature-Perturbed for Anomalous Sound Detection

Shun Huang,Zhihua Fang,Liang He

Main category: cs.SD

TL;DR: 论文提出了一阶段监督对比学习(OS-SCL)和特征扰动方法,显著改善了异常声音检测中的误报问题,并提出了新的时频特征TFgram,取得了优异的性能。

Details Motivation: 当前无监督异常声音检测方法在处理来自不同机器的同类样本时容易产生高误报率,这一问题仍未解决。本文旨在通过监督对比学习和特征扰动技术提升检测性能。

Contribution: 1. 提出一阶段监督对比学习(OS-SCL)方法;2. 在嵌入空间引入特征扰动技术;3. 提出新的时频特征TFgram,提升检测效果。

Method: 1. OS-SCL结合噪声监督对比学习;2. 在嵌入空间对特征进行扰动;3. 从原始音频中提取TFgram特征。

Result: 在DCASE 2020任务2上,Log-Mel特征达到94.64% AUC,TFgram特征进一步达到95.71% AUC,显著优于基线。

Insight: 监督对比学习和特征扰动能有效减少同类样本的误报,而TFgram特征能更好地捕捉异常声音的关键信息。

Abstract: Unsupervised anomalous sound detection aims to detect unknown anomalous sounds by training a model using only normal audio data. Despite advancements in self-supervised methods, the issue of frequent false alarms when handling samples of the same type from different machines remains unresolved. This paper introduces a novel training technique called one-stage supervised contrastive learning (OS-SCL), which significantly addresses this problem by perturbing features in the embedding space and employing a one-stage noisy supervised contrastive learning approach. On the DCASE 2020 Challenge Task 2, it achieved 94.64% AUC, 88.42% pAUC, and 89.24% mAUC using only Log-Mel features. Additionally, a time-frequency feature named TFgram is proposed, which is extracted from raw audio. This feature effectively captures critical information for anomalous sound detection, ultimately achieving 95.71% AUC, 90.23% pAUC, and 91.23% mAUC. The source code is available at: \underline{www.github.com/huangswt/OS-SCL}.

cs.AR [Back]

[102] A TRRIP Down Memory Lane: Temperature-Based Re-Reference Interval Prediction For Instruction Caching

Henry Kao,Nikhil Sreekumar,Prabhdeep Singh Soni,Ali Sedaghati,Fang Su,Bryan Chan,Maziar Goudarzi,Reza Azimi

Main category: cs.AR

TL;DR: 论文提出了一种软件-硬件协同设计方法TRRIP,通过编译器分析代码温度(热/冷)并利用操作系统接口优化指令缓存替换策略,减少热代码的淘汰率,从而提升移动CPU性能。

Details Motivation: 现代移动CPU软件因其复杂的运行时行为导致指令缓存的高重用距离,而传统硬件中心缓存管理方法无法满足需求。代码规模和复杂度的增长快于片上存储,需新的解决方案。

Contribution: 提出TRRIP方法,结合编译器分析和硬件扩展,利用代码温度属性优化指令缓存替换策略,显著减少热代码淘汰率,提升性能。

Method: TRRIP通过编译器分析代码温度,用OS接口传递温度属性,硬件根据属性优化缓存替换(如RRIP),优先保留热代码。

Result: 在已使用PGO优化的移动代码上,TRRIP将L2指令MPKI降低26.5%,平均加速3.9%。

Insight: 软件-硬件协同设计可利用代码温度信息优化缓存管理,显著提升移动系统性能,且易于实际部署。

Abstract: Modern mobile CPU software pose challenges for conventional instruction cache replacement policies due to their complex runtime behavior causing high reuse distance between executions of the same instruction. Mobile code commonly suffers from large amounts of stalls in the CPU frontend and thus starvation of the rest of the CPU resources. Complexity of these applications and their code footprint are projected to grow at a rate faster than available on-chip memory due to power and area constraints, making conventional hardware-centric methods for managing instruction caches to be inadequate. We present a novel software-hardware co-design approach called TRRIP (Temperature-based Re-Reference Interval Prediction) that enables the compiler to analyze, classify, and transform code based on “temperature” (hot/cold), and to provide the hardware with a summary of code temperature information through a well-defined OS interface based on using code page attributes. TRRIP’s lightweight hardware extension employs code temperature attributes to optimize the instruction cache replacement policy resulting in the eviction rate reduction of hot code. TRRIP is designed to be practical and adoptable in real mobile systems that have strict feature requirements on both the software and hardware components. TRRIP can reduce the L2 MPKI for instructions by 26.5% resulting in geomean speedup of 3.9%, on top of RRIP cache replacement running mobile code already optimized using PGO.

cs.CY [Back]

[103] Accuracy Paradox in Large Language Models: Regulating Hallucination Risks in Generative AI

Zihao Li,Weiwei Yi,Jiahong Chen

Main category: cs.CY

TL;DR: 该论文探讨了大型语言模型(LLMs)中的“准确性悖论”,指出过度依赖准确性指标会掩盖幻觉问题的复杂性,并提出了多维度分类和风险治理的新思路。

Details Motivation: 随着LLMs在日常决策中的广泛应用,其输出的幻觉问题(如虚构、误导或不可信内容)对社会和认知风险提出了迫切需求的研究。现有治理框架过度依赖准确性,导致问题被误诊。

Contribution: 1. 提出幻觉类型的分类学;2. 揭示了准确性悖论的三个交织维度;3. 批判现有法规(如欧盟AI Act)的局限性,呼吁更全面的治理框架。

Method: 通过跨学科文献分析,构建幻觉类型分类,并结合实际法规(如GDPR、DSA)进行批判性评估。

Result: 指出准确性作为单一指标无法捕捉误导、价值观偏差和社会扭曲等问题,呼吁多元化的治理方法。

Insight: 治理LLM幻觉需超越准确性,关注上下文感知、抗操纵能力和社会多样性,以解决更广泛的认知和社会风险。

Abstract: As Large Language Models (LLMs) permeate everyday decision-making, their epistemic and societal risks demand urgent scrutiny. Hallucinations, the generation of fabricated, misleading, oversimplified or untrustworthy outputs, has emerged as imperative challenges. While regulatory, academic, and technical discourse position accuracy as the principal benchmark for mitigating such harms, this article contends that overreliance on accuracy misdiagnoses the problem and has counterproductive effect: the accuracy paradox. Drawing on interdisciplinary literatures, this article develops a taxonomy of hallucination types and shows the paradox along three intertwining dimensions: outputs, individuals and society. First, accuracy functions as a superficial proxy for reliability, incentivising the optimisation of rhetorical fluency and surface-level correctness over epistemic trustworthiness. This encourages passive user trust in outputs that appear accurate but epistemically untenable. Second, accuracy as a singular metric fails to detect harms that are not factually false but are nonetheless misleading, value-laden, or socially distorting, including consensus illusions, sycophantic alignment, and subtle manipulation. Third, regulatory overemphasis on accuracy obscures the wider societal consequences of hallucination, including social sorting, privacy violations, equity harms, epistemic convergence that marginalises dissent, reduces pluralism, and causes social deskilling. By examining the EU AI Act, GDPR, and DSA, the article argues that current regulations are not yet structurally equipped to address these epistemic, relational, and systemic harms and exacerbated by the overreliance on accuracy. By exposing such conceptual and practical challenges, this article calls for a fundamental shift towards pluralistic, context-aware, and manipulation-resilient approaches to AI trustworthy governance.

[104] CogniAlign: Survivability-Grounded Multi-Agent Moral Reasoning for Safe and Transparent AI

Hasin Jawad Ali,Ilhamul Azam,Ajwad Abrar,Md. Kamrul Hasan,Hasan Mahmud

Main category: cs.CY

TL;DR: CogniAlign是一个基于自然道德现实的多代理审议框架,通过跨学科代理(如神经科学、心理学等)的结构化辩论,将道德推理锚定在生存性上,并在透明性和解释深度上显著优于GPT-4o。

Details Motivation: 现有AI对齐方法在道德推理上存在抽象性和不透明性问题,CogniAlign旨在通过多学科代理的辩论提升透明性和解释质量。

Contribution: 提出了CogniAlign框架,将道德推理基于生存性,并通过跨学科代理的辩论实现透明和实证锚定的判断。

Method: 采用多代理(神经科学、心理学等)辩论,由仲裁者综合输出透明决策。

Result: 在60多个道德问题上,CogniAlign在分析质量、广度和解释深度上分别比GPT-4o平均提升16.2、14.3和28.4分。

Insight: 跨学科辩论可显著提升AI的道德推理透明性和安全性,为对齐问题提供可扩展路径。

Abstract: The challenge of aligning artificial intelligence (AI) with human values persists due to the abstract and often conflicting nature of moral principles and the opacity of existing approaches. This paper introduces CogniAlign, a multi-agent deliberation framework based on naturalistic moral realism, that grounds moral reasoning in survivability, defined across individual and collective dimensions, and operationalizes it through structured deliberations among discipline-specific scientist agents. Each agent, representing neuroscience, psychology, sociology, and evolutionary biology, provides arguments and rebuttals that are synthesized by an arbiter into transparent and empirically anchored judgments. We evaluate CogniAlign on classic and novel moral questions and compare its outputs against GPT-4o using a five-part ethical audit framework. Results show that CogniAlign consistently outperforms the baseline across more than sixty moral questions, with average performance gains of 16.2 points in analytic quality, 14.3 points in breadth, and 28.4 points in depth of explanation. In the Heinz dilemma, for example, CogniAlign achieved an overall score of 89.2 compared to GPT-4o’s 69.2, demonstrating a decisive advantage in handling moral reasoning. By reducing black-box reasoning and avoiding deceptive alignment, CogniAlign highlights the potential of interdisciplinary deliberation as a scalable pathway for safe and transparent AI alignment.

cs.SE [Back]

[105] An Empirical Study on Failures in Automated Issue Solving

Simiao Liu,Fang Liu,Liehao Li,Xin Tan,Yinghao Zhu,Xiaoli Lian,Li Zhang

Main category: cs.SE

TL;DR: 该论文分析了自动化问题解决中的失败模式,提出了一个三维分类法,并设计了协作式专家-执行者框架以提升性能。

Details Motivation: 当前自动化问题解决工具在SWE-Bench-Verified中的失败率高,且现有评估仅关注总体性能,掩盖了失败的根本原因,无法指导针对性改进。

Contribution: 1) 对自动化问题解决工具的失败模式进行了系统分类;2) 发现代理架构的主要失败原因是推理错误和认知死锁;3) 提出了专家-执行者协作框架,显著提升问题解决率。

Method: 1) 分析了三种SOTA工具在不同任务特性下的表现;2) 手动分析了150个失败案例,构建了包含3个阶段、9类、25个子类的失败分类法;3) 设计了专家-执行者协作框架。

Result: 实验表明,所提框架解决了22.2%的单代理无法处理的问题。

Insight: 1) 代理架构的主要弱点是推理和认知能力;2) 通过诊断评估和协作设计可以显著提升代理的鲁棒性。

Abstract: Automated issue solving seeks to autonomously identify and repair defective code snippets across an entire codebase. SWE-Bench has emerged as the most widely adopted benchmark for evaluating progress in this area. While LLM-based agentic tools show great promise, they still fail on a substantial portion of tasks. Moreover, current evaluations primarily report aggregate issue-solving rates, which obscure the underlying causes of success and failure, making it challenging to diagnose model weaknesses or guide targeted improvements. To bridge this gap, we first analyze the performance and efficiency of three SOTA tools, spanning both pipeline-based and agentic architectures, in automated issue solving tasks of SWE-Bench-Verified under varying task characteristics. Furthermore, to move from high-level performance metrics to underlying cause analysis, we conducted a systematic manual analysis of 150 failed instances. From this analysis, we developed a comprehensive taxonomy of failure modes comprising 3 primary phases, 9 main categories, and 25 fine-grained subcategories. Then we systematically analyze the distribution of the identified failure modes, the results reveal distinct failure fingerprints between the two architectural paradigms, with the majority of agentic failures stemming from flawed reasoning and cognitive deadlocks. Motivated by these insights, we propose a collaborative Expert-Executor framework. It introduces a supervisory Expert agent tasked with providing strategic oversight and course-correction for a primary Executor agent. This architecture is designed to correct flawed reasoning and break the cognitive deadlocks that frequently lead to failure. Experiments show that our framework solves 22.2% of previously intractable issues for a leading single agent. These findings pave the way for building more robust agents through diagnostic evaluation and collaborative design.

[106] Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework

Kerui Huang,Shuhan Liu,Xing Hu,Tongtong Xu,Lingfeng Bao,Xin Xia

Main category: cs.SE

TL;DR: 论文提出了一种自适应压缩Chain-of-Thought(CoT)推理的框架SEER,以减少计算开销并保持准确性。

Details Motivation: 现有的CoT推理虽然能提升LLM的准确性和鲁棒性,但其计算成本高,过长推理可能导致截断、准确性下降和延迟增加,尤其在需要简洁输出的任务中。

Contribution: 提出了SEER框架,通过自适应压缩CoT降低计算开销和延迟,同时保持准确性。

Method: 结合Best-of-N采样和任务感知的自适应过滤,动态调整阈值以实现CoT压缩。

Result: 实验表明,SEER平均缩短CoT 42.1%,减少截断情况,消除无限循环,提升效率。

Insight: 过长的推理并非总是有效,自适应压缩CoT可以在保持性能的同时显著提升效率。

Abstract: Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs) by prompting intermediate steps, improving accuracy and robustness in arithmetic, logic, and commonsense tasks. However, this benefit comes with high computational costs: longer outputs increase latency, memory usage, and KV-cache demands. These issues are especially critical in software engineering tasks where concise and deterministic outputs are required. To investigate these trade-offs, we conduct an empirical study based on code generation benchmarks. The results reveal that longer CoT does not always help. Excessive reasoning often causes truncation, accuracy drops, and latency up to five times higher, with failed outputs consistently longer than successful ones. These findings challenge the assumption that longer reasoning is inherently better and highlight the need for adaptive CoT control. Motivated by this, we propose SEER (Self-Enhancing Efficient Reasoning), an adaptive framework that compresses CoT while preserving accuracy. SEER combines Best-of-N sampling with task-aware adaptive filtering, dynamically adjusting thresholds based on pre-inference outputs to reduce verbosity and computational overhead. We then evaluate SEER on three software engineering tasks and one math task. On average, SEER shortens CoT by 42.1%, improves accuracy by reducing truncation, and eliminates most infinite loops. These results demonstrate SEER as a practical method to make CoT-enhanced LLMs more efficient and robust, even under resource constraints.