Table of Contents

cs.CL [Back]

[1] Dynamic Prompt Fusion for Multi-Task and Cross-Domain Adaptation in LLMs

Xin Hu,Yue Kang,Guanzi Yao,Tianze Kang,Mengjie Wang,Heyao Liu

Main category: cs.CL

TL;DR: 该研究提出了一种动态提示融合方法,用于改进大语言模型在多任务和跨域场景下的泛化能力,通过动态调度和任务感知的提示池设计,显著提升了模型性能。

Details Motivation: 解决大语言模型在多任务和跨域设置中泛化能力不足的问题,避免传统固定提示模板的局限性。

Contribution: 1. 提出动态提示调度机制和提示池设计;2. 引入任务嵌入和门控机制优化提示融合;3. 提出联合多任务学习目标,自动学习调度权重,减少任务干扰。

Method: 1. 动态提示池与任务感知调度策略;2. 基于任务嵌入和门控机制的提示信号控制;3. 联合多任务学习和自动权重优化的目标函数。

Result: 实验表明,该方法显著提升了模型在语言理解和知识推理任务上的性能,验证了其稳定性和跨域适应性。

Insight: 动态提示融合可通过灵活的任务间共享路径和精细的信号控制,有效平衡多任务学习中的共性和特异性需求。

Abstract: This study addresses the generalization limitations commonly observed in large language models under multi-task and cross-domain settings. Unlike prior methods such as SPoT, which depends on fixed prompt templates, our study introduces a unified multi-task learning framework with dynamic prompt scheduling mechanism. By introducing a prompt pool and a task-aware scheduling strategy, the method dynamically combines and aligns prompts for different tasks. This enhances the model’s ability to capture semantic differences across tasks. During prompt fusion, the model uses task embeddings and a gating mechanism to finely control the prompt signals. This ensures alignment between prompt content and task-specific demands. At the same time, it builds flexible sharing pathways across tasks. In addition, the proposed optimization objective centers on joint multi-task learning. It incorporates an automatic learning strategy for scheduling weights, which effectively mitigates task interference and negative transfer. To evaluate the effectiveness of the method, a series of sensitivity experiments were conducted. These experiments examined the impact of prompt temperature parameters and task number variation. The results confirm the advantages of the proposed mechanism in maintaining model stability and enhancing transferability. Experimental findings show that the prompt scheduling method significantly improves performance on a range of language understanding and knowledge reasoning tasks. These results fully demonstrate its applicability and effectiveness in unified multi-task modeling and cross-domain adaptation.

[2] Event Causality Identification with Synthetic Control

Haoyu Wang,Fengze Liu,Jiayao Zhang,Dan Roth,Kyle Richardson

Main category: cs.CL

TL;DR: 该论文提出了一种新方法,通过合成控制模型结合文本嵌入技术,更稳健地识别事件因果关系,解决了传统方法中因语言模式和虚假推理导致的误判问题。

Details Motivation: 传统的事件因果关系识别方法依赖语言模式和多跳关系推理,容易因语言的非正式使用和虚假推理而导致错误。论文旨在通过引入Rubin因果模型和合成控制方法,更准确地识别因果关系。

Contribution: 首次将合成控制方法应用于事件因果关系识别,提出了一种基于文本嵌入合成和反转技术的‘双胞胎’生成方法,显著提升了因果关系识别的准确性。

Method: 采用Rubin因果模型,将第一个事件视为‘治疗’,第二个事件为‘结果’,通过合成控制方法生成‘双胞胎’(即未被治疗的对照组),估算因果关系。使用文本嵌入技术和反转技术实现数据合成。

Result: 该方法在COPES-hard基准测试中表现优于传统方法和GPT-4,证明了其鲁棒性和有效性。

Insight: 合成控制方法在文本领域的应用为解决因果推断中的数据稀缺问题提供了新的思路;文本嵌入技术的使用为语义匹配和因果分析提供了更高效的工具。

Abstract: Event causality identification (ECI), a process that extracts causal relations between events from text, is crucial for distinguishing causation from correlation. Traditional approaches to ECI have primarily utilized linguistic patterns and multi-hop relational inference, risking false causality identification due to informal usage of causality and specious graphical inference. In this paper, we adopt the Rubin Causal Model to identify event causality: given two temporally ordered events, we see the first event as the treatment and the second one as the observed outcome. Determining their causality involves manipulating the treatment and estimating the resultant change in the likelihood of the outcome. Given that it is only possible to implement manipulation conceptually in the text domain, as a work-around, we try to find a twin for the protagonist from existing corpora. This twin should have identical life experiences with the protagonist before the treatment but undergoes an intervention of treatment. However, the practical difficulty of locating such a match limits its feasibility. Addressing this issue, we use the synthetic control method to generate such a twin’ from relevant historical data, leveraging text embedding synthesis and inversion techniques. This approach allows us to identify causal relations more robustly than previous methods, including GPT-4, which is demonstrated on a causality benchmark, COPES-hard.

[3] ZERA: Zero-init Instruction Evolving Refinement Agent - From Zero Instructions to Structured Prompts via Principle-based Optimization

Seungyoun Yi,Minsoo Khang,Sungrae Park

Main category: cs.CL

TL;DR: ZERA是一种新型框架,通过基于原则的低开销优化,联合优化系统和用户提示,快速收敛到高质量提示。

Details Motivation: 现有自动提示优化(APO)方法通常仅关注用户提示,依赖非结构化反馈,且需要大量样本和长迭代周期,成本高且脆弱。ZERA旨在解决这些问题。

Contribution: 提出ZERA框架,通过八个可泛化标准的评分和结构化反馈,优化系统和用户提示,实现快速且高效的提示生成。

Method: 使用八项标准评分提示,并基于结构化反馈修订提示;联合优化系统和用户提示;通过小样本和短迭代实现高质量提示。

Result: 在五个大型语言模型和九个数据集上验证,ZERA表现优于基线;消融实验证实各组件对提示构建的有效性。

Insight: 结构化反馈和联合优化是提示优化的关键;小样本和短迭代周期可以显著降低成本并提升效果。

Abstract: Automatic Prompt Optimization (APO) improves large language model (LLM) performance by refining prompts for specific tasks. However, prior APO methods typically focus only on user prompts, rely on unstructured feedback, and require large sample sizes and long iteration cycles-making them costly and brittle. We propose ZERA (Zero-init Instruction Evolving Refinement Agent), a novel framework that jointly optimizes both system and user prompts through principled, low-overhead refinement. ZERA scores prompts using eight generalizable criteria with automatically inferred weights, and revises prompts based on these structured critiques. This enables fast convergence to high-quality prompts using minimal examples and short iteration cycles. We evaluate ZERA across five LLMs and nine diverse datasets spanning reasoning, summarization, and code generation tasks. Experimental results demonstrate consistent improvements over strong baselines. Further ablation studies highlight the contribution of each component to more effective prompt construction. Our implementation including all prompts is publicly available at https://github.com/younatics/zera-agent.

[4] Thinking in a Crowd: How Auxiliary Information Shapes LLM Reasoning

Haodong Zhao,Chenyan Zhao,Yansi Li,Zhuosheng Zhang,Gongshen Liu

Main category: cs.CL

TL;DR: 论文研究了外部辅助信息对大型语言模型(LLM)推理能力的影响,揭示了模型的‘思考模式’是一把双刃剑:有用信息提升准确性,但误导信息会显著降低性能。

Details Motivation: 现实场景中,LLM常依赖外部信息进行推理,但这些信息可能有用、无关或误导。研究旨在量化这类信息对模型推理的影响。

Contribution: 1. 引入SciAux数据集系统测试LLM对辅助信息的鲁棒性;2. 揭示模型的‘思考模式’会放大误导信息的负面影响。

Method: 基于ScienceQA构建SciAux数据集,实验评估LLM在包含有用/误导信息的场景下的推理表现。

Result: 结果显示,误导信息导致模型性能大幅下降,且‘思考模式’进一步加剧了错误。

Insight: 模型的挑战不仅是‘思考’,还需具备批判性评估信息的能力。

Abstract: The capacity of Large Language Models (LLMs) to reason is fundamental to their application in complex, knowledge-intensive domains. In real-world scenarios, LLMs are often augmented with external information that can be helpful, irrelevant, or even misleading. This paper investigates the causal impact of such auxiliary information on the reasoning process of LLMs with explicit step-by-step thinking capabilities. We introduce SciAux, a new dataset derived from ScienceQA, to systematically test the robustness of the model against these types of information. Our findings reveal a critical vulnerability: the model’s deliberative “thinking mode” is a double-edged sword. While helpful context improves accuracy, misleading information causes a catastrophic drop in performance, which is amplified by the thinking process. Instead of conferring robustness, thinking reinforces the degree of error when provided with misinformation. This highlights that the challenge is not merely to make models “think”, but to endow them with the critical faculty to evaluate the information upon which their reasoning is based. The SciAux dataset is available at https://huggingface.co/datasets/billhdzhao/SciAux.

[5] SIRAG: Towards Stable and Interpretable RAG with A Process-Supervised Multi-Agent Framework

Junlin Wang,Zehao Wu,Shaowei Lu,Yanlan Li,Xinghao Huang

Main category: cs.CL

TL;DR: 该论文提出了一个名为SIRAG的多智能体框架,旨在提升检索增强生成(RAG)的稳定性和可解释性,通过引入决策者和知识选择器两个轻量级智能体,并结合过程级监督和树状探索策略,实现了更高的准确性和更稳定的训练收敛。

Details Motivation: 现有的RAG方法中,检索器和生成器独立开发,交互效果不佳,表现为检索结果可能不相关或冗余,而生成器未能充分利用检索到的证据。这导致RAG的整体效果不理想。

Contribution: 1) 提出了一个模块化的多智能体框架SIRAG,引入了决策者和知识选择器;2) 使用过程级监督(LLM-as-a-Judge)和树状探索策略优化训练;3) 实验表明该方法提高了准确性和稳定性,且无需修改现有检索器或生成器。

Method: 1) 决策者决定何时停止检索或生成答案;2) 知识选择器过滤检索到的文档;3) 使用LLM-as-a-Judge提供过程级奖励;4) 采用树状推理路径探索和PPO进行端到端训练。

Result: 实验表明,在单跳和多跳问答基准测试中,SIRAG比标准RAG基线方法表现出更高的准确性、更稳定的收敛性和更易解释的推理轨迹。

Insight: 过程级监督和多智能体协作能够有效弥补检索器与生成器之间的交互不足,为RAG的实际应用提供了模块化和即插即用的解决方案。

Abstract: Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to access external knowledge sources, but the effectiveness of RAG relies on the coordination between the retriever and the generator. Since these components are developed independently, their interaction is often suboptimal: the retriever may return irrelevant or redundant documents, while the generator may fail to fully leverage retrieved evidence. In this work, we propose a process-supervised multi-agent framework to bridge the gap between retriever and generator. The framework introduces two lightweight agents: a Decision Maker, which determines when to continue retrieval or stop for answer generation, and a Knowledge Selector, which filters retrieved documents to retain only the most useful evidence. To provide fine-grained supervision, we employ an LLM-as-a-Judge that evaluates each intermediate action with process-level rewards, ensuring more accurate credit assignment than relying solely on final answer correctness. We further adopt a tree-structured rollout strategy to explore diverse reasoning paths, and train both agents with Proximal Policy Optimization (PPO) in an end-to-end manner. Experiments on single-hop and multi-hop question answering benchmarks show that our approach achieves higher accuracy, more stable convergence, and produces more interpretable reasoning trajectories compared with standard RAG baselines. Importantly, the proposed framework is modular and plug-and-play, requiring no modification to the retriever or generator, making it practical for real-world RAG applications.

[6] ERFC: Happy Customers with Emotion Recognition and Forecasting in Conversation in Call Centers

Aditi Debsharma,Bhushan Jagyasi,Surajit Sen,Priyanka Pandey,Devicharith Dovari,Yuvaraj V. C,Rosalin Parida,Gopali Contractor

Main category: cs.CL

TL;DR: 论文提出了一种名为ERFC的新架构,用于在通话中识别和预测情感,旨在提升呼叫中心的客户体验。

Details Motivation: 在呼叫中心场景中,客服的情绪管理对客户体验至关重要。通过识别和预测对话中的情绪变化,有助于改善客户满意度,将不满客户转化为满意客户。

Contribution: 提出ERFC架构,结合多模态、情感属性、上下文和对话依赖性,实现了对未来话语情绪的预测。

Method: ERFC整合了多模态数据、情感属性分析和上下文信息,利用对话中发言者的话语依赖性进行情绪识别和预测。

Result: 在IEMOCAP数据集上的实验验证了ERFC的可行性,显示其在呼叫中心等应用中具有显著的商业价值。

Insight: 情绪预测不仅能提升客户体验,还能为客服提供实时指导,优化问题解决方案,从而提高整体业务表现。

Abstract: Emotion Recognition in Conversation has been seen to be widely applicable in call center analytics, opinion mining, finance, retail, healthcare, and other industries. In a call center scenario, the role of the call center agent is not just confined to receiving calls but to also provide good customer experience by pacifying the frustration or anger of the customers. This can be achieved by maintaining neutral and positive emotion from the agent. As in any conversation, the emotion of one speaker is usually dependent on the emotion of other speaker. Hence the positive emotion of an agent, accompanied with the right resolution will help in enhancing customer experience. This can change an unhappy customer to a happy one. Imparting the right resolution at right time becomes easier if the agent has the insight of the emotion of future utterances. To predict the emotions of the future utterances we propose a novel architecture, Emotion Recognition and Forecasting in Conversation. Our proposed ERFC architecture considers multi modalities, different attributes of emotion, context and the interdependencies of the utterances of the speakers in the conversation. Our intensive experiments on the IEMOCAP dataset have shown the feasibility of the proposed ERFC. This approach can provide a tremendous business value for the applications like call center, where the happiness of customer is utmost important.

[7] Evaluating Large Language Models for Detecting Antisemitism

Jay Patel,Hrudayangam Mehta,Jeremy Blackburn

Main category: cs.CL

TL;DR: 本文评估了八种开源LLM在检测反犹言论上的能力,提出了一种新的提示方法Guided-CoT,显著提升了模型性能,并分析了LLM的语义偏差和悖论行为。

Details Motivation: 检测仇恨内容是重要而具有挑战性的任务,需要自动化工具如LLM不断适应社交媒体内容的动态变化。本文旨在评估LLM在反犹言论检测上的能力及其局限性。

Contribution: 1. 提出Guided-CoT提示方法,有效利用上下文策略提升模型性能;2. 发现Llama 3.1 70B表现优于微调的GPT-3.5;3. 引入新指标量化LLM生成理由的语义偏差,揭示其行为差异。

Method: 通过设计Guided-CoT提示方法,结合上下文定义作为策略指导,测试八种开源LLM在反犹言论检测任务上的表现,并分析其生成的推理理由。

Result: Guided-CoT显著提升了所有评估模型的性能,Llama 3.1 70B表现最佳。同时,量化分析揭示了LLM在语义一致性和行为上的显著差异。

Insight: LLM在仇恨内容检测中表现不一,提示设计对性能影响显著,且模型理性推理中存在语义偏差和悖论现象。

Abstract: Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs’ capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided-CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of decoding configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs’ utility, explainability, and reliability.

[8] Exploiting Tree Structure for Credit Assignment in RL Training of LLMs

Hieu Tran,Zonghai Yao,Hong Yu

Main category: cs.CL

TL;DR: 论文提出了一种名为P2T的简单方法,将多组响应转换为前缀树,并基于此设计了TEMPO算法,解决了强化学习训练LLM时稀疏延迟奖励下token级信用分配的问题。TEMPO在多项任务上优于PPO和GRPO。

Details Motivation: 在强化学习训练LLMs时,稀疏且延迟的奖励(尤其是长序列任务)使得token级的信用分配成为主要瓶颈。现有方法如PPO和GRPO各有局限,需要一种更高效且无需复杂价值网络的方法。

Contribution: 1. 提出了P2T方法,将响应组转换为前缀树并计算非参数化前缀值;2. 基于P2T设计了TEMPO算法,通过分支门控的TD修正实现token级信用分配,无需额外价值网络或外部裁判。

Method: 1. P2T将多组响应转换为前缀树并计算前缀值;2. TEMPO结合GRPO的序列级信号和基于树的TD修正,在分支点提供精确的token级信用。

Result: 在多个基准任务(如MATH、MedQA等)上,TEMPO优于PPO和GRPO,验证准确率更高且训练时间相近。

Insight: 1. 基于树的结构可以有效捕捉token级贡献;2. 分支门控的TD修正确保信用分配更精准,避免了复杂价值网络的训练。

Abstract: Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce \textbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes \emph{nonparametric} prefix values (V(s)) by aggregating descendant outcomes. Built on P2T, we propose \textbf{TEMPO} (\emph{\textbf{T}ree-\textbf{E}stimated \textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with \emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.

[9] Brittleness and Promise: Knowledge Graph Based Reward Modeling for Diagnostic Reasoning

Saksham Khatwani,He Cheng,Majid Afshar,Dmitriy Dligach,Yanjun Gao

Main category: cs.CL

TL;DR: 该论文探讨了一种新范式:利用大语言模型(LLM)作为知识图谱(KG)推理路径的奖励模型,以提升诊断推理的可信度。尽管实验显示了路径判断的潜力,但其在下游任务中的迁移能力较弱。

Details Motivation: 大语言模型在诊断推理中表现出潜力但缺乏可靠的知识支持,知识图谱提供结构化生物医学知识但现有方法未能充分发挥其推理能力。论文提出通过奖励建模来结合两者优势,模仿医生的诊断评估过程。

Contribution: 1. 提出将LLM作为知识图谱推理路径的奖励模型的新范式;2. 系统评估了五种任务形式和八种训练范式;3. 首次系统评估了临床知识图谱的奖励式推理效果。

Method: 将LLM视为知识图谱推理路径的奖励模型,训练模型判断候选路径是否能正确诊断。实验比较了不同任务形式和训练范式对路径判断及下游任务(如诊断总结和医学问答)的影响。

Result: 特定奖励优化和蒸馏方法在路径判断任务中表现良好,但迁移到下游任务的泛化能力较弱。

Insight: 验证式推理(判断路径正确性)比生成式推理更容易,但在实际应用中仍需优化其泛化能力;结构化奖励监督对诊断推理有潜力但需进一步探索。

Abstract: Large language models (LLMs) show promise for diagnostic reasoning but often lack reliable, knowledge grounded inference. Knowledge graphs (KGs), such as the Unified Medical Language System (UMLS), offer structured biomedical knowledge that can support trustworthy reasoning. Prior approaches typically integrate KGs via retrieval augmented generation or fine tuning, inserting KG content into prompts rather than enabling structured reasoning. We explore an alternative paradigm: treating the LLM as a reward model of KG reasoning paths, where the model learns to judge whether a candidate path leads to correct diagnosis for a given patient input. This approach is inspired by recent work that leverages reward training to enhance model reasoning abilities, and grounded in computational theory, which suggests that verifying a solution is often easier than generating one from scratch. It also parallels physicians’ diagnostic assessment, where they judge which sequences of findings and intermediate conditions most plausibly support a diagnosis. We first systematically evaluate five task formulation for knowledge path judging and eight training paradigm. Second, we test whether the path judging abilities generalize to downstream diagnostic tasks, including diagnosis summarization and medical question answering. Experiments with three open source instruct-tuned LLMs reveal both promise and brittleness: while specific reward optimization and distillation lead to strong path-judging performance, the transferability to downstream tasks remain weak. Our finding provides the first systematic assessment of “reward model style” reasoning over clinical KGs, offering insights into how structured, reward-based supervision influences diagnostic reasoning in GenAI systems for healthcare.

[10] Developing an AI framework to automatically detect shared decision-making in patient-doctor conversations

Oscar J. Ponce-Ponte,David Toro-Tobon,Luis F. Figueroa,Michael Gionfriddo,Megan Branda,Victor M. Montori,Saturnino Luz,Juan P. Brito

Main category: cs.CL

TL;DR: 该研究提出了一种利用语言模型和对话对齐评分(CA)自动检测医患对话中共享决策(SDM)的方法,通过深度学习和微调的BERT模型计算CA评分,并与SDM结果关联。

Details Motivation: 共享决策(SDM)是患者中心护理的关键,但目前缺乏自动测量SDM的可扩展方法。

Contribution: 开发了一种基于语言模型和CA评分的自动化方法,用于大规模测量SDM,并通过临床数据验证其有效性。

Method: 使用上下文-响应对和负采样训练深度学习模型及微调的BERT模型,通过NSP任务计算CA评分,并与SDM结果关联分析。

Result: DL模型和微调BERT模型生成的CA评分与OPTION12和DCS评分显著相关,证明了方法的有效性。

Insight: CA评分可作为一种可解释的自动化工具,用于大规模评估SDM策略。

Abstract: Shared decision-making (SDM) is necessary to achieve patient-centred care. Currently no methodology exists to automatically measure SDM at scale. This study aimed to develop an automated approach to measure SDM by using language modelling and the conversational alignment (CA) score. A total of 157 video-recorded patient-doctor conversations from a randomized multi-centre trial evaluating SDM decision aids for anticoagulation in atrial fibrillations were transcribed and segmented into 42,559 sentences. Context-response pairs and negative sampling were employed to train deep learning (DL) models and fine-tuned BERT models via the next sentence prediction (NSP) task. Each top-performing model was used to calculate four types of CA scores. A random-effects analysis by clinician, adjusting for age, sex, race, and trial arm, assessed the association between CA scores and SDM outcomes: the Decisional Conflict Scale (DCS) and the Observing Patient Involvement in Decision-Making 12 (OPTION12) scores. p-values were corrected for multiple comparisons with the Benjamini-Hochberg method. Among 157 patients (34% female, mean age 70 SD 10.8), clinicians on average spoke more words than patients (1911 vs 773). The DL model without the stylebook strategy achieved a recall@1 of 0.227, while the fine-tuned BERTbase (110M) achieved the highest recall@1 with 0.640. The AbsMax (18.36 SE7.74 p=0.025) and Max CA (21.02 SE7.63 p=0.012) scores generated with the DL without stylebook were associated with OPTION12. The Max CA score generated with the fine-tuned BERTbase (110M) was associated with the DCS score (-27.61 SE12.63 p=0.037). BERT model sizes did not have an impact the association between CA scores and SDM. This study introduces an automated, scalable methodology to measure SDM in patient-doctor conversations through explainable CA scores, with potential to evaluate SDM strategies at scale.

[11] CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

Daniel Kaiser,Arnoldo Frigessi,Ali Ramezani-Kebrya,Benjamin Ricaud

Main category: cs.CL

TL;DR: CogniLoad 是一个基于认知负荷理论(CLT)的合成自然语言推理基准,通过独立调控内在难度、干扰物密度和任务长度,为大型语言模型(LLMs)的长上下文推理能力提供精确的失败分析和诊断工具。

Details Motivation: 当前的长上下文推理基准往往混淆了任务复杂度、干扰物影响和任务长度等关键因素。CogniLoad 旨在通过独立调控这些因素,实现对 LLMs 推理局限性的精确分析。

Contribution: 提出了 CogniLoad 基准,通过独立调控内在难度、干扰物密度和任务长度,为 LLMs 的推理能力提供了系统、可扩展的诊断工具。

Method: 基于认知负荷理论(CLT),生成自然语言逻辑谜题,独立调控三个核心维度:内在难度(d)控制内在负荷;干扰物信号比(ρ)调控外在负荷;任务长度(N)作为操作代理。

Result: 评估了 22 种先进推理 LLMs,揭示了任务长度的主导约束、对内在复杂性的耐受性变化,以及干扰物比的 U 形响应。

Insight: CogniLoad 提供了可重现、可扩展的诊断工具,揭示了 LLMs 推理能力的多样性和局限性,为未来模型开发提供了指导。

Abstract: Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT’s core dimensions: intrinsic difficulty ($d$) controls intrinsic load; distractor-to-signal ratio ($\rho$) regulates extraneous load; and task length ($N$) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.

[12] Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference

Ben Finkelshtein,Silviu Cucerzan,Sujay Kumar Jauhar,Ryen White

Main category: cs.CL

TL;DR: 该论文通过大规模实验评估了LLMs在基于文本的图推理任务中的表现,分析了不同交互模式、数据集和模型配置下的优缺点,发现代码生成方法在长文本或高密度图中表现最佳。

Details Motivation: 尽管LLMs在文本丰富的图机器学习任务中应用广泛,但对其与图数据交互能力的系统性理解仍缺乏。

Contribution: 论文提出了一个大规模、受控的实验框架,评估了LLMs在不同图推理任务中的表现,并提供了实践指导。

Method: 通过多种交互模式(提示、工具使用、代码生成)和数据集(引用、网络链接、电子商务、社交网络)进行比较实验。

Result: 代码生成方法在长文本或高密度图中表现最佳,且在异配图中仍有效。

Insight: 代码生成能灵活调整依赖结构、特征或标签,利用最有信息量的输入类型。

Abstract: Large language models (LLMs) are increasingly used for text-rich graph machine learning tasks such as node classification in high-impact domains like fraud detection and recommendation systems. Yet, despite a surge of interest, the field lacks a principled understanding of the capabilities of LLMs in their interaction with graph data. In this work, we conduct a large-scale, controlled evaluation across several key axes of variability to systematically assess the strengths and weaknesses of LLM-based graph reasoning methods in text-based applications. The axes include the LLM-graph interaction mode, comparing prompting, tool-use, and code generation; dataset domains, spanning citation, web-link, e-commerce, and social networks; structural regimes contrasting homophilic and heterophilic graphs; feature characteristics involving both short- and long-text node attributes; and model configurations with varying LLM sizes and reasoning capabilities. We further analyze dependencies by methodically truncating features, deleting edges, and removing labels to quantify reliance on input types. Our findings provide practical and actionable guidance. (1) LLMs as code generators achieve the strongest overall performance on graph data, with especially large gains on long-text or high-degree graphs where prompting quickly exceeds the token budget. (2) All interaction strategies remain effective on heterophilic graphs, challenging the assumption that LLM-based methods collapse under low homophily. (3) Code generation is able to flexibly adapt its reliance between structure, features, or labels to leverage the most informative input type. Together, these findings provide a comprehensive view of the strengths and limitations of current LLM-graph interaction modes and highlight key design principles for future approaches.

[13] A Rhythm-Aware Phrase Insertion for Classical Arabic Poetry Composition

Mohamad Elzohbi,Richard Zhao

Main category: cs.CL

TL;DR: 提出了一种基于ByT5的方法,用于在阿拉伯诗歌中插入符合特定节奏的短语,结合了规则化的字母转节拍转换和课程学习策略,实现了高节奏对齐和语义连贯性。

Details Motivation: 古典阿拉伯诗歌创作需要严格遵循节奏规则,传统方法依赖人工,效率低下且容易出错。本文旨在利用现代自然语言处理技术,自动完成这一任务。

Contribution: 1. 提出了一个规则化的字母转节拍转换方法,从标注完整的阿拉伯文本中提取节奏。2. 采用条件去噪目标微调ByT5模型,实现节奏对齐。3. 展示了课程学习和跨语言迁移的有效性。

Method: 1. 基于规则的字母转节拍转换。2. 使用ByT5模型,通过条件去噪目标进行微调。3. 采用课程学习策略,先在通用阿拉伯数据集上预训练,再在诗歌数据集上微调。4. 探索了从英语到阿拉伯语的跨语言迁移。

Result: 实验结果展示了模型在节奏对齐和语义连贯性方面的优异表现,证明了其在阿拉伯诗歌创作中的实用潜力。

Insight: 规则化的节奏提取与深度学习结合,可以有效解决诗歌创作中的复杂约束问题,同时跨语言迁移展示了语言无关任务的可能性。

Abstract: This paper presents a methodology for inserting phrases in Arabic poems to conform to a specific rhythm using ByT5, a byte-level multilingual transformer-based model. Our work discusses a rule-based grapheme-to-beat transformation tailored for extracting the rhythm from fully diacritized Arabic script. Our approach employs a conditional denoising objective to fine-tune ByT5, where the model reconstructs masked words to match a target rhythm. We adopt a curriculum learning strategy, pre-training on a general Arabic dataset before fine-tuning on poetic dataset, and explore cross-lingual transfer from English to Arabic. Experimental results demonstrate that our models achieve high rhythmic alignment while maintaining semantic coherence. The proposed model has the potential to be used in co-creative applications in the process of composing classical Arabic poems.

[14] CCQA: Generating Question from Solution Can Improve Inference-Time Reasoning in SLMs

Jin Young Kim,Ji Won Yoon

Main category: cs.CL

TL;DR: 本文提出了一种名为CCQA的新型推理方法,通过生成问题与原始问题的相似性评分来优化小型语言模型(SLMs)的推理能力,证明了其在数学和常识推理任务中的优越性。

Details Motivation: 传统推理方法在小型语言模型上表现不佳,因此作者提出了CCQA方法,旨在通过生成问题并评估其相似性来提升推理性能。

Contribution: 1. 提出了CCQA方法,通过问题生成与相似性评分优化SLMs的推理能力。2. 验证了CCQA在多个模型和任务上的优越性。

Method: CCQA基于循环一致性,生成每个推理路径的问题并与原始问题比较相似性,选择评分最高的答案。使用轻量级Flan-T5模型辅助问题生成。

Result: 实验显示,CCQA在数学和常识推理任务上优于现有方法,为SLMs建立了新的高效推理基线。

Insight: 生成问题并评估其一致性可以有效提升小型模型的推理能力,轻量级辅助模型的引入提高了方法的实用性。

Abstract: Recently, inference-time reasoning strategies have further improved the accuracy of large language models (LLMs), but their effectiveness on smaller models remains unclear. Based on the observation that conventional approaches often fail to improve performance in this context, we propose \textbf{C}ycle-\textbf{C}onsistency in \textbf{Q}uestion \textbf{A}nswering (CCQA), a novel reasoning method that can be effectively applied to SLMs. Inspired by cycle consistency, CCQA generates a question from each reasoning path and answer, evaluates each by its similarity to the original question, and then selects the candidate solution with the highest similarity score as the final response. Since conventional SLMs struggle to generate accurate questions from their own reasoning paths and answers, we employ a lightweight Flan-T5 model specialized for question generation to support this process efficiently. From the experimental results, it is verified that CCQA consistently outperforms existing state-of-the-art (SOTA) methods across eight models on mathematical and commonsense reasoning benchmarks. Furthermore, our method establishes a new practical baseline for efficient reasoning in SLMs. Source code can be found at https://github.com/scai-research/ccqa_official.

[15] Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

Yeongbin Seo,Gayoung Kim,Jaehyung Kim,Jinyoung Yeo

Main category: cs.CL

TL;DR: 该论文提出了一种基于先验的文本数据过滤方法,通过利用语料库级别的词频统计估计token的先验概率,避免了基于困惑度(PPL)方法的耗时和不稳定性问题,显著提升了过滤效率。

Details Motivation: 大规模语言模型(LLMs)预训练需要高效且可靠的数据过滤方法,而传统的基于困惑度(PPL)的方法存在时间成本高和噪声数据下可靠性不足的问题。

Contribution: 提出了一种基于先验的快速数据过滤方法,利用语料库级别的词频统计估计token先验概率,替代PPL作为过滤标准,显著降低了时间成本且性能更优。

Method: 通过计算文档中token先验概率的均值和标准差作为过滤标准,无需模型推理,适用于多语言、代码和数学等符号语言。

Result: 在20个下游任务中取得了最优的平均性能,同时时间成本比PPL方法降低1000倍以上。

Insight: 语料库级别的统计特征可以作为高效的数据过滤指标,尤其适用于多语言和多领域场景,且无需监督学习。

Abstract: As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has shown strong performance, it suffers from drawbacks: substantial time costs and inherent unreliability of the model when handling noisy or out-of-distribution samples. In this work, we propose a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density. Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL while requiring no model inference. Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks, while reducing time cost by over 1000x compared to PPL-based filtering. We further demonstrate its applicability to symbolic languages such as code and math, and its dynamic adaptability to multilingual corpora without supervision

[16] TsqLoRA: Towards Sensitivity and Quality Low-Rank Adaptation for Efficient Fine-Tuning

Yu Chen,Yifei Han,Long Zhang,Yue Du,Bin Li

Main category: cs.CL

TL;DR: TsqLoRA提出了一种结合数据质量选择和敏感性感知的低秩适应方法,通过动态调整层敏感性和选择高质量数据,提升大模型微调的效率和性能。

Details Motivation: 全参数微调大模型计算和内存开销大,现有参数高效微调方法忽视了不同层的敏感性差异和训练数据重要性。

Contribution: 提出TsqLoRA,整合数据质量驱动选择和敏感性感知的低秩适应,包含质量感知采样机制和动态秩分配模块。

Method: 1. 质量感知采样机制选择信息量大的训练数据;2. 动态秩分配模块根据层敏感性调整其秩。

Result: 在多NLP任务中,TsqLoRA提高了微调效率,同时保持或提升了性能。

Insight: 结合数据质量和层敏感性动态调整是提升参数高效微调效果的关键。

Abstract: Fine-tuning large pre-trained models for downstream tasks has become a fundamental approach in natural language processing. Fully fine-tuning all model parameters is computationally expensive and memory-intensive, especially in resource-constrained environments. Existing parameter-efficient fine-tuning methods reduce the number of trainable parameters but typically overlook the varying sensitivity of different model layers and the importance of training data. In this work, we propose TsqLoRA, a novel method that integrates data-quality-driven selection with sensitivity-aware low-rank adaptation, consisted of two main components: a quality-aware sampling mechanism for selecting the most informative training data, and a dynamic rank allocation module that adjusts the rank of each layer based on its sensitivity to parameter updates. The experimental results demonstrate that TsqLoRA improves fine-tuning efficiency while maintaining or even improving performance on a variety of NLP tasks. Our code will be available at https://github.com/Benjamin-Ricky/TsqLoRA.

[17] UniECG: Understanding and Generating ECG in One Unified Model

Jiarui Jin,Haoyu Wang,Xiang Lan,Jun Li,Gaofeng Cheng,Hongyan Li,Shenda Hong

Main category: cs.CL

TL;DR: UniECG提出了一种新型统一模型,能够同时执行心电图(ECG)的解释与生成任务,通过解耦的两阶段训练方法实现,扩展了现有ECG模型的能力范围。

Details Motivation: 现有统一模型(如GPT-5)在处理心电图理解与生成任务时表现不佳,UniECG旨在填补这一空白。

Contribution: 提出了首个能够同时完成基于证据的ECG解释和文本条件ECG生成的统一模型,并通过两阶段训练方法实现了潜在空间对齐。

Method: 采用解耦的两阶段训练策略:第一阶段学习ECG到文本的解释能力,第二阶段通过潜在空间对齐注入文本到ECG的生成能力。

Result: UniECG能够根据用户输入自主选择解释或生成ECG信号,显著提升了现有ECG模型的功能边界。

Insight: 通过潜在空间对齐技术,将不同任务的能力统一到一个模型中,为多模态医学任务的集成提供了新思路。

Abstract: Recent unified models such as GPT-5 have achieved encouraging progress on vision-language tasks. However, these unified models typically fail to correctly understand ECG signals and provide accurate medical diagnoses, nor can they correctly generate ECG signals. To address these limitations, we propose UniECG, the first unified model for ECG capable of concurrently performing evidence-based ECG interpretation and text-conditioned ECG generation tasks. Through a decoupled two-stage training approach, the model first learns evidence-based interpretation skills (ECG-to-Text), and then injects ECG generation capabilities (Text-to-ECG) via latent space alignment. UniECG can autonomously choose to interpret or generate an ECG based on user input, significantly extending the capability boundaries of current ECG models. Our code and checkpoints will be made publicly available at https://github.com/PKUDigitalHealth/UniECG upon acceptance.

[18] A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users

Nishant Balepur,Matthew Shu,Yoo Yeon Sung,Seraphina Goldfarb-Tarrant,Shi Feng,Fumeng Yang,Rachel Rudinger,Jordan Lee Boyd-Graber

Main category: cs.CL

TL;DR: 论文通过实验发现,LLM生成的计划是否对用户有帮助与用户的偏好并不完全一致,现有的对齐方法(如RLHF和ChatbotArena)可能误导模型优化方向。基于真实用户交互的反馈是改进计划帮助性的关键。

Details Motivation: 研究动机在于验证现有的对齐方法(如RLHF和ChatbotArena)是否真的能确保大型语言模型(LLM)生成的计划对用户有帮助,还是仅仅反映了用户的偏好。

Contribution: 主要贡献包括:1)揭示了用户偏好与计划实际帮助性之间的不一致性;2)表明这种不一致性与用户个性化偏好无关;3)指出表面特征(如简洁性)虽影响偏好,但无法预测帮助性。

Method: 通过Planorama接口进行实验,收集126名用户完成300个多步问题的数据(包括4388次计划执行和5584次比较),评估计划帮助性和用户偏好,并模拟代理和奖励模型的行为。

Result: 结果表明:1)用户偏好和代理成功不能准确预测计划的帮助性;2)计划的帮助性与其表面特征(如简洁性)无关;3)真实用户反馈是优化计划帮助性的关键。

Insight: 核心见解是对齐LLM的计划生成应基于实际帮助性的反馈,而非仅依赖用户偏好或表面特征。未来研究需更注重真实用户交互数据。

Abstract: To assist users in complex tasks, LLMs generate plans: step-by-step instructions towards a goal. While alignment methods aim to ensure LLM plans are helpful, they train (RLHF) or evaluate (ChatbotArena) on what users prefer, assuming this reflects what helps them. We test this with Planorama: an interface where 126 users answer 300 multi-step questions with LLM plans. We get 4388 plan executions and 5584 comparisons to measure plan helpfulness (QA success) and user preferences on plans, and recreate the setup in agents and reward models to see if they simulate or prefer what helps users. We expose: 1) user/model preferences and agent success do not accurately predict which plans help users, so common alignment feedback can misalign with helpfulness; 2) this gap is not due to user-specific preferences, as users are similarly successful when using plans they prefer/disprefer; 3) surface-level cues like brevity and question similarity strongly link to preferences, but such biases fail to predict helpfulness. In all, we argue aligning helpful LLMs needs feedback from real user interactions, not just preferences of what looks helpful, so we discuss the plan NLP researchers can execute to solve this problem.

[19] Consistency-Aware Parameter-Preserving Knowledge Editing Framework for Multi-Hop Question Answering

Lingwen Deng,Yifei Han,Long Zhang,Yue Du,Bin Li

Main category: cs.CL

TL;DR: 该论文提出了一种一致性意识的知识编辑框架CAPE-KG,用于多跳问答任务,解决现有参数保留知识编辑方法在一致性问题上的不足。

Details Motivation: 现有的基于知识图谱的参数保留知识编辑方法在多跳问答任务中存在知识污染、更新不稳定等问题,影响了编辑的可靠性。

Contribution: 提出了CAPE-KG框架,通过确保知识图谱的构建、更新和检索与多跳问答任务要求一致,解决了知识编辑的一致性问题。

Method: CAPE-KG通过一致性感知的知识图谱构建、更新和检索机制,确保编辑后的知识在多跳推理中保持一致性和可靠性。

Result: 在MQuAKE基准测试中,CAPE-KG显著提升了多跳问答任务中知识编辑的准确性。

Insight: 一致性是多跳问答任务中知识编辑的关键挑战,通过任务对齐的知识图谱管理可以有效提升编辑的可靠性。

Abstract: Parameter-Preserving Knowledge Editing (PPKE) enables updating models with new or corrected information without retraining or parameter adjustment. Recent PPKE approaches based on knowledge graphs (KG) to extend knowledge editing (KE) capabilities to multi-hop question answering (MHQA). However, these methods often lack consistency, leading to knowledge contamination, unstable updates, and retrieval behaviors that fail to reflect the intended edits. Such inconsistencies undermine the reliability of PPKE in multi- hop reasoning. We present CAPE-KG, Consistency-Aware Parameter-Preserving Editing with Knowledge Graphs, a novel consistency-aware framework for PPKE on MHQA. CAPE-KG ensures KG construction, update, and retrieval are always aligned with the requirements of the MHQA task, maintaining coherent reasoning over both unedited and edited knowledge. Extensive experiments on the MQuAKE benchmark show accuracy improvements in PPKE performance for MHQA, demonstrating the effectiveness of addressing consistency in PPKE.

[20] MemOrb: A Plug-and-Play Verbal-Reinforcement Memory Layer for E-Commerce Customer Service

Yizhe Huang,Yang Liu,Ruiyu Zhao,Xiaolong Zhong,Xingming Yue,Ling Jiang

Main category: cs.CL

TL;DR: 论文提出了一种名为MemOrb的轻量级、即插即用的语言强化记忆层,用于提升基于大语言模型的客服代理在多轮会话中的成功率和一致性。

Details Motivation: 现有的基于大语言模型的客服代理在多轮会话中容易遗忘和重复错误,缺乏持续自我改进的机制,限制了其在动态场景中的可靠性和稳定性。

Contribution: MemOrb是一种无需微调的轻量级记忆层,能够将多轮会话提炼为紧凑的策略反思,并通过共享记忆库指导决策。

Method: MemOrb通过提取多轮交互的策略反思,存储于共享记忆库中,并在后续决策时检索这些反思信息以提升表现。

Result: 实验显示,MemOrb显著提高了客服代理的任务成功率和一致性,多轮成功率最高提升63个百分点。

Insight: 结构化反思是提升冻结大语言模型代理在长期服务中可靠性的有效机制。

Abstract: Large Language Model-based agents(LLM-based agents) are increasingly deployed in customer service, yet they often forget across sessions, repeat errors, and lack mechanisms for continual self-improvement. This makes them unreliable in dynamic settings where stability and consistency are critical. To better evaluate these properties, we emphasize two indicators: task success rate as a measure of overall effectiveness, and consistency metrics such as Pass$^k$ to capture reliability across multiple trials. To address the limitations of existing approaches, we propose MemOrb, a lightweight and plug-and-play verbal reinforcement memory layer that distills multi-turn interactions into compact strategy reflections. These reflections are stored in a shared memory bank and retrieved to guide decision-making, without requiring any fine-tuning. Experiments show that MemOrb significantly improves both success rate and stability, achieving up to a 63 percentage-point gain in multi-turn success rate and delivering more consistent performance across repeated trials. Our results demonstrate that structured reflection is a powerful mechanism for enhancing long-term reliability of frozen LLM agents in customer service scenarios.

[21] Global-Recent Semantic Reasoning on Dynamic Text-Attributed Graphs with Large Language Models

Yunan Wang,Jianxin Li,Ziwei Zhang

Main category: cs.CL

TL;DR: 论文提出了DyGRASP方法,结合LLMs和时序GNNs,高效处理动态文本属性图(DyTAGs),解决了现有方法忽视近期-全局时间语义的挑战,并在实验中表现优越。

Details Motivation: 现有方法(如GNNs和LLMs)主要关注静态TAGs,难以处理DyTAGs中的时间动态性和文本语义演化,因此需要一种新方法捕捉近期和全局语义。

Contribution: 提出了DyGRASP方法,首次结合LLMs和时序GNNs,通过隐式和显式推理机制捕捉DyTAGs中的近期和全局语义动态。

Method: 1. 设计节点中心的隐式推理和滑动窗口机制捕捉近期语义;2. 通过显式推理和RNN链结构捕捉全局语义;3. 通过更新和合并层整合语义与图结构。

Result: 在DyTAG基准测试中性能优越,目标节点检索任务的Hit@10提升高达34%,且泛化能力强。

Insight: 结合LLMs的语义能力与时序GNNs的动态建模能力,是处理动态文本属性图的有效方向。

Abstract: Dynamic Text-Attribute Graphs (DyTAGs), characterized by time-evolving graph interactions and associated text attributes, are prevalent in real-world applications. Existing methods, such as Graph Neural Networks (GNNs) and Large Language Models (LLMs), mostly focus on static TAGs. Extending these existing methods to DyTAGs is challenging as they largely neglect the recent-global temporal semantics: the recent semantic dependencies among interaction texts and the global semantic evolution of nodes over time. Furthermore, applying LLMs to the abundant and evolving text in DyTAGs faces efficiency issues. To tackle these challenges, we propose Dynamic Global-Recent Adaptive Semantic Processing (DyGRASP), a novel method that leverages LLMs and temporal GNNs to efficiently and effectively reason on DyTAGs. Specifically, we first design a node-centric implicit reasoning method together with a sliding window mechanism to efficiently capture recent temporal semantics. In addition, to capture global semantic dynamics of nodes, we leverage explicit reasoning with tailored prompts and an RNN-like chain structure to infer long-term semantics. Lastly, we intricately integrate the recent and global temporal semantics as well as the dynamic graph structural information using updating and merging layers. Extensive experiments on DyTAG benchmarks demonstrate DyGRASP’s superiority, achieving up to 34% improvement in Hit@10 for destination node retrieval task. Besides, DyGRASP exhibits strong generalization across different temporal GNNs and LLMs.

[22] AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

Chen Liang,Zhaoqi Huang,Haofen Wang,Fu Chai,Chunying Yu,Huanhuan Wei,Zhengjie Liu,Yanpeng Li,Hongjun Wang,Ruifeng Luo,Xianzhong Zhao

Main category: cs.CL

TL;DR: 论文提出了一个名为AECBench的层次化基准,旨在评估大型语言模型(LLMs)在建筑、工程和施工(AEC)领域的知识能力,揭示了模型在不同认知层级上的性能下降,并为LLM在安全关键领域的可靠集成提供了基础。

Details Motivation: 随着LLMs在AEC领域的应用增加,其在这种专业且安全关键的领域中的鲁棒性和可靠性尚未得到充分评估,因此需要一种全面的基准来量化其性能。

Contribution: 提出了AECBench基准,涵盖5个认知层级的23项任务,并构建了包含4800个问题的数据集;还引入了LLM-as-a-Judge方法用于复杂响应的评估。

Method: 通过专家审核构建数据集,设计了包含记忆、理解、推理、计算和应用5个层级的评估框架,并使用LLM-as-a-Judge方法进行规模化评估。

Result: 评估了9个LLMs,发现模型在基础任务上表现较好,但在表格解读、复杂推理和领域文档生成等任务上表现显著下降。

Insight: LLMs在AEC领域的高级认知任务中仍存在明显局限性,未来研究需进一步提升模型在复杂任务中的能力以支持安全关键应用。

Abstract: Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building lifecycle. However, the robustness and reliability of LLMs in such a specialized and safety-critical domain remain to be evaluated. To address this challenge, this paper establishes AECBench, a comprehensive benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain. The benchmark defines 23 representative tasks within a five-level cognition-oriented evaluation framework encompassing Knowledge Memorization, Understanding, Reasoning, Calculation, and Application. These tasks were derived from authentic AEC practice, with scope ranging from codes retrieval to specialized documents generation. Subsequently, a 4,800-question dataset encompassing diverse formats, including open-ended questions, was crafted primarily by engineers and validated through a two-round expert review. Furthermore, an LLM-as-a-Judge approach was introduced to provide a scalable and consistent methodology for evaluating complex, long-form responses leveraging expert-derived rubrics. Through the evaluation of nine LLMs, a clear performance decline across five cognitive levels was revealed. Despite demonstrating proficiency in foundational tasks at the Knowledge Memorization and Understanding levels, the models showed significant performance deficits, particularly in interpreting knowledge from tables in building codes, executing complex reasoning and calculation, and generating domain-specific documents. Consequently, this study lays the groundwork for future research and development aimed at the robust and reliable integration of LLMs into safety-critical engineering practices.

[23] MAPEX: A Multi-Agent Pipeline for Keyphrase Extraction

Liting Zhang,Shiwan Zhao,Aobo Kong,Qicheng Li

Main category: cs.CL

TL;DR: 该论文提出了一种名为MAPEX的多智能体协作框架,旨在改进基于LLM的无监督关键词提取方法,通过动态适应文档长度和优化多智能体的协作,显著提升了表现。

Details Motivation: 现有的无监督基于提示的LLM方法通常采用单一阶段的推理流程,未能充分利用LLM的推理和生成能力,且未考虑文档长度和LLM模型的多样性。

Contribution: 提出了首个将多智能体协作引入关键词提取任务的框架MAPEX,通过动态双路径策略(短文本知识驱动,长文本主题引导)显著提升了性能。

Method: MAPEX通过专家招募、候选提取、主题引导、知识增强和后处理等模块协调LLM智能体,采用动态双路径策略适应不同文档长度。

Result: 在六个基准数据集上的实验表明,MAPEX平均F1@5比当前最优的无监督方法高出2.44%,比标准LLM基线高出4.01%。

Insight: 多智能体协作和动态路径策略能有效解决关键词提取任务中的复杂性和多样性问题,充分发挥LLM的潜力。

Abstract: Keyphrase extraction is a fundamental task in natural language processing. However, existing unsupervised prompt-based methods for Large Language Models (LLMs) often rely on single-stage inference pipelines with uniform prompting, regardless of document length or LLM backbone. Such one-size-fits-all designs hinder the full exploitation of LLMs’ reasoning and generation capabilities, especially given the complexity of keyphrase extraction across diverse scenarios. To address these challenges, we propose MAPEX, the first framework that introduces multi-agent collaboration into keyphrase extraction. MAPEX coordinates LLM-based agents through modules for expert recruitment, candidate extraction, topic guidance, knowledge augmentation, and post-processing. A dual-path strategy dynamically adapts to document length: knowledge-driven extraction for short texts and topic-guided extraction for long texts. Extensive experiments on six benchmark datasets across three different LLMs demonstrate its strong generalization and universality, outperforming the state-of-the-art unsupervised method by 2.44% and standard LLM baselines by 4.01% in F1@5 on average. Code is available at https://github.com/NKU-LITI/MAPEX.

[24] Are Smaller Open-Weight LLMs Closing the Gap to Proprietary Models for Biomedical Question Answering?

Damian Stachura,Joanna Konieczna,Artur Nowak

Main category: cs.CL

TL;DR: 这篇论文探讨了开放式权重的大型语言模型(LLMs)在生物医学问答任务中是否能够媲美甚至超越专有模型,通过BioASQ挑战赛的实验验证了这一点。

Details Motivation: 随着开放式权重LLMs的快速发展,研究它们是否能在生物医学问答这一特定领域替代专有模型,成为一个重要问题。

Contribution: 论文的主要贡献包括:展示了开放式权重LLMs在生物医学问答任务中与专有模型相当的性能,并在某些情况下通过集成策略超越后者。

Method: 研究方法包括:基于嵌入距离检索相关片段、上下文学习、结构化输出生成,以及利用集成方法结合不同模型的输出。

Result: 实验结果表明,开放式权重LLMs在性能上可与专有模型相媲美,甚至在某些任务中表现更优。

Insight: 开放式权重LLMs在某些场景下能够替代专有模型,尤其是在采用集成策略时,展示了开源模型的潜力和竞争力。

Abstract: Open-weight versions of large language models (LLMs) are rapidly advancing, with state-of-the-art models like DeepSeek-V3 now performing comparably to proprietary LLMs. This progression raises the question of whether small open-weight LLMs are capable of effectively replacing larger closed-source models. We are particularly interested in the context of biomedical question-answering, a domain we explored by participating in Task 13B Phase B of the BioASQ challenge. In this work, we compare several open-weight models against top-performing systems such as GPT-4o, GPT-4.1, Claude 3.5 Sonnet, and Claude 3.7 Sonnet. To enhance question answering capabilities, we use various techniques including retrieving the most relevant snippets based on embedding distance, in-context learning, and structured outputs. For certain submissions, we utilize ensemble approaches to leverage the diverse outputs generated by different models for exact-answer questions. Our results demonstrate that open-weight LLMs are comparable to proprietary ones. In some instances, open-weight LLMs even surpassed their closed counterparts, particularly when ensembling strategies were applied. All code is publicly available at https://github.com/evidenceprime/BioASQ-13b.

[25] Extractive Fact Decomposition for Interpretable Natural Language Inference in one Forward Pass

Nicholas Popovič,Michael Färber

Main category: cs.CL

TL;DR: 本文提出了JEDI模型,一种仅使用编码器架构的模型,通过联合提取原子事实分解和可解释的推理,避免了推理阶段生成式大模型的高资源消耗。JEDI在多数据集上表现出色,并在分布外和对抗性场景中显著提升了鲁棒性。

Details Motivation: 现有的自然语言推理(NLI)方法依赖生成式大模型进行原子事实分解,导致资源消耗大。本文旨在设计一种高效的编码器架构,通过提取式分解实现高效且可解释的推理。

Contribution: 1) 提出了JEDI模型,实现了单次前向传递的原子事实分解和推理;2) 使用合成理性数据训练,避免依赖生成式模型;3) 在分布外和对抗性场景中表现更鲁棒。

Method: JEDI是一种仅编码器架构,联合训练原子事实分解器和推理模型。通过大规模合成理性数据集进行训练,直接提取事实分解结果。

Result: JEDI在分布内数据上具有竞争力,同时在分布外和对抗性场景中显著优于仅依赖提取式理性监督的模型。

Insight: 仅编码器架构结合合成数据可以在NLI任务中实现高效的可解释性和鲁棒性,避免生成式模型的资源瓶颈。

Abstract: Recent works in Natural Language Inference (NLI) and related tasks, such as automated fact-checking, employ atomic fact decomposition to enhance interpretability and robustness. For this, existing methods rely on resource-intensive generative large language models (LLMs) to perform decomposition. We propose JEDI, an encoder-only architecture that jointly performs extractive atomic fact decomposition and interpretable inference without requiring generative models during inference. To facilitate training, we produce a large corpus of synthetic rationales covering multiple NLI benchmarks. Experimental results demonstrate that JEDI achieves competitive accuracy in distribution and significantly improves robustness out of distribution and in adversarial settings over models based solely on extractive rationale supervision. Our findings show that interpretability and robust generalization in NLI can be realized using encoder-only architectures and synthetic rationales. Code and data available at https://jedi.nicpopovic.com

[26] Investigating Test-Time Scaling with Reranking for Machine Translation

Shaomu Tan,Ryosuke Mitani,Ritvik Choudhary,Toshiyuki Sekiya

Main category: cs.CL

TL;DR: 论文系统地研究了测试时缩放(TTS)在机器翻译中的应用,发现其在高资源语言中能提升翻译质量,但在低资源语言中可能因指标盲点导致性能下降。

Details Motivation: 缩放模型参数是提升NLP系统性能的常见方法,但计算成本高。测试时缩放是一种替代方案,通过生成多个候选并选择最佳结果来分配更多计算资源。然而,这种方法在机器翻译中尚未得到系统研究。

Contribution: 论文首次系统地研究了TTS在机器翻译中的应用,提出了一种简单实用的best-of-N框架,并在高资源和低资源语言对上进行了实验验证。

Method: 采用best-of-N框架,生成多个翻译候选并选择最佳结果,覆盖不同模型大小(3B-72B)和计算预算(N最高达1024)。

Result: 结果表明,TTS能显著提升高资源语言的翻译质量,但对低资源语言可能存在指标盲点问题。小模型通过增加N可匹配大模型性能,但计算成本更高。

Insight: 在固定计算预算下,大模型通常更高效;TTS的效果因语言资源丰富程度而异,需结合具体场景权衡计算成本和性能提升。

Abstract: Scaling model parameters has become the de facto strategy for improving NLP systems, but it comes with substantial computational costs. Test-Time Scaling (TTS) offers an alternative by allocating more computation at inference: generating multiple candidates and selecting the best. While effective in tasks such as mathematical reasoning, TTS has not been systematically explored for machine translation (MT). In this paper, we present the first systematic study of TTS for MT, investigating a simple but practical best-of-N framework on WMT24 benchmarks. Our experiments cover six high-resource and one low-resource language pairs, five model sizes (3B-72B), and various TTS compute budget (N up to 1024). Our results show that a) For high-resource languages, TTS generally improves translation quality according to multiple neural MT evaluation metrics, and our human evaluation confirms these gains; b) Augmenting smaller models with large $N$ can match or surpass larger models at $N{=}1$ with more compute cost; c) Under fixed compute budgets, larger models are typically more efficient, and TTS can degrade quality due to metric blind spots in low-resource cases.

[27] Charting a Decade of Computational Linguistics in Italy: The CLiC-it Corpus

Chiara Alzetta,Serena Auriemma,Alessandro Bondielli,Luca Dini,Chiara Fazzone,Alessio Miaschi,Martina Miliani,Marta Sartor

Main category: cs.CL

TL;DR: 该论文分析了意大利计算语言学领域过去十年的研究趋势,通过CLiC-it会议的论文集(2014-2024年)构建了一个语料库,并对其元数据和内容进行了分析。

Details Motivation: 随着Transformer大语言模型的兴起,计算语言学的研究重点发生了变化。研究旨在通过分析CLiC-it会议论文,揭示意大利计算语言学界的研究趋势和发展动态。

Contribution: 构建了一个名为CLiC-it Corpus的语料库,涵盖了十年的会议论文,并对元数据和内容进行了系统性分析,为国际研究社区提供了参考。

Method: 收集并整理了CLiC-it会议十年的论文集,通过分析论文的元数据(如作者背景、性别、机构等)和内容主题,揭示研究趋势。

Result: 提供了意大利计算语言学界的研究趋势和发展动态,包括从词汇语义资源到语言建模和多模态研究的转变。

Insight: 研究显示,意大利的计算语言学研究重点逐渐从传统资源转向大语言模型和多模态研究,反映了国际趋势。

Abstract: Over the past decade, Computational Linguistics (CL) and Natural Language Processing (NLP) have evolved rapidly, especially with the advent of Transformer-based Large Language Models (LLMs). This shift has transformed research goals and priorities, from Lexical and Semantic Resources to Language Modelling and Multimodality. In this study, we track the research trends of the Italian CL and NLP community through an analysis of the contributions to CLiC-it, arguably the leading Italian conference in the field. We compile the proceedings from the first 10 editions of the CLiC-it conference (from 2014 to 2024) into the CLiC-it Corpus, providing a comprehensive analysis of both its metadata, including author provenance, gender, affiliations, and more, as well as the content of the papers themselves, which address various topics. Our goal is to provide the Italian and international research communities with valuable insights into emerging trends and key developments over time, supporting informed decisions and future directions in the field.

[28] Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering

Alireza Salemi,Cheng Li,Mingyang Zhang,Qiaozhu Mei,Zhuowan Li,Spurthi Amba Hombaiah,Weize Kong,Tao Chen,Hamed Zamani,Michael Bendersky

Main category: cs.CL

TL;DR: 该论文提出了Pathways of Thoughts(PoT)方法,通过多方向思考生成个性化问答响应,提升了问答系统的准确性和用户满意度。

Details Motivation: 个性化问答系统面临用户偏好推断难、长上下文噪声多、生成响应需同时满足正确性和用户期望等挑战,PoT旨在解决这些问题。

Contribution: 提出了PoT方法,能够在推理阶段动态选择认知操作(如推理、修订、个性化等),生成多样候选响应并聚合为最终个性化回答。

Method: PoT将LLM的推理建模为迭代决策过程,探索多方向推理路径,候选响应通过用户偏好重加权生成最终结果。

Result: 在LaMP-QA基准测试中,PoT相对提升13.1%;人工评估显示66%偏好PoT输出,仅15%持平。

Insight: 多方向思考路径能有效捕获不同视角,动态聚合策略显著提升个性化问答性能。

Abstract: Personalization is essential for adapting question answering (QA) systems to user-specific information needs, thereby improving both accuracy and user satisfaction. However, personalized QA remains relatively underexplored due to challenges such as inferring preferences from long, noisy, and implicit contexts, and generating responses that are simultaneously correct, contextually appropriate, and aligned with user expectations and background knowledge. To address these challenges, we propose Pathways of Thoughts (PoT), an inference-stage method that applies to any large language model (LLM) without requiring task-specific fine-tuning. The approach models the reasoning of an LLM as an iterative decision process, where the model dynamically selects among cognitive operations such as reasoning, revision, personalization, and clarification. This enables exploration of multiple reasoning trajectories, producing diverse candidate responses that capture different perspectives. PoT then aggregates and reweights these candidates according to inferred user preferences, yielding a final personalized response that benefits from the complementary strengths of diverse reasoning paths. Experiments on the LaMP-QA benchmark for personalized QA show that PoT consistently outperforms competitive baselines, achieving up to a 13.1% relative improvement. Human evaluation corroborates these results, with annotators preferring outputs from PoT in 66% of cases and reporting ties in only 15% of cases.

[29] Soft Tokens, Hard Truths

Natasha Butt,Ariel Kwiatkowski,Ismail Labiad,Julia Kempe,Yann Ollivier

Main category: cs.CL

TL;DR: 本文首次提出了一种通过强化学习(RL)训练连续推理链(CoT)的方法,避免了从离散CoT中蒸馏的高成本,并展示了其在数学推理任务中的优势。

Details Motivation: 当前连续推理链的训练面临高昂的计算成本和局限性,限制了其实际应用,因此需要一种更高效的训练方法。

Contribution: 1. 提出了基于强化学习训练连续CoT的可扩展方法;2. 通过“软”令牌(soft tokens)和噪声嵌入实现高效探索;3. 在数学推理任务中,连续CoT训练的性能超越离散CoT。

Method: 使用强化学习直接训练连续CoT,避免从离散CoT蒸馏。通过软令牌和输入嵌入噪声实现高效的RL探索。

Result: 在Llama和Qwen模型上,连续CoT训练在pass@1上匹配离散CoT,在pass@32上表现更优,且能保留基础模型在域外任务上的预测能力。

Insight: 连续CoT训练不仅高效,还能提升推理多样性,同时减少对基础模型的干扰,是一种更“温和”的调整方式。

Abstract: The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use “soft” tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the “soft” models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.

[30] Online Process Reward Leanring for Agentic Reinforcement Learning

Xiaoqian Liu,Ke Wang,Yuchuan Wu,Fei Huang,Yongbin Li,Junge Zhang,Jianbin Jiao

Main category: cs.CL

TL;DR: 论文提出了一种名为OPRL的在线过程奖励学习方法,用于解决代理强化学习中的时序信用分配问题,通过优化隐式过程奖励模型,结合轨迹偏好和步骤奖励,显著提升了样本效率和训练稳定性。

Details Motivation: 当前使用强化学习训练大型语言模型(LLM)作为自主代理时,稀疏且不可靠的奖励信号导致时序信用分配困难。现有方法存在标注偏差、奖励欺骗等问题,亟需一种更高效的方法。

Contribution: 提出了OPRL(在线过程奖励学习)方法,这是一种通用的信用分配策略,可无缝集成到标准策略算法中,无需额外轨迹或显式步骤标签。

Method: 通过交替优化隐式过程奖励模型(PRM)和代理策略,将轨迹偏好转化为隐式步骤奖励,结合轨迹级优势进行策略更新,形成自增强循环。

Result: 在WebShop、VisualSokoban和SOTOPIA等多个任务中,OPRL表现优于前沿的LLM和强化学习方法,实现了更高的样本效率和训练稳定性。

Insight: OPRL通过隐式奖励学习和自增强循环,不仅能提升任务性能,还能减少动作探索的冗余,为现实场景中的代理学习提供了高效解决方案。

Abstract: Large language models (LLMs) are increasingly trained with reinforcement learning (RL) as autonomous agents that reason and act over long horizons in interactive environments. However, sparse and sometimes unverifiable rewards make temporal credit assignment extremely challenging. Recent work attempts to integrate process supervision into agent learning but suffers from biased annotation, reward hacking, high-variance from overly fine-grained signals or failtures when state overlap is rare. We therefore introduce Online Process Reward Learning (OPRL), a general credit-assignment strategy for agentic RL that integrates seamlessly with standard on-policy algorithms without relying on additional rollouts or explicit step labels. In OPRL, we optimize an implicit process reward model (PRM) alternately with the agent’s policy to transform trajectory preferences into implicit step rewards through a trajectory-based DPO objective. These step rewards are then used to compute step-level advantages, which are combined with episode-level advantages from outcome rewards for policy update, creating a self-reinforcing loop. Theoretical findings guarantee that the learned step rewards are consistent with trajectory preferences and act as potential-based shaping rewards, providing bounded gradients to stabilize training. Empirically, we evaluate OPRL on three distinct agent benmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverfiable rewards in SOTOPIA. Crucially, OPRL shows superior performance over frontier LLMs and strong RL baselines across domains, achieving state-of-the-art results with higher sample-efficiency and lower variance during training. Further analysis also demonstrates the efficient exploration by OPRL using fewer actions, underscoring its potential for agentic learning in real-world scenarios.

[31] Steering Multimodal Large Language Models Decoding for Context-Aware Safety

Zheyuan Liu,Zhangchen Xu,Guangyao Dou,Xiangchi Yuan,Zhaoxuan Tan,Radha Poovendran,Meng Jiang

Main category: cs.CL

TL;DR: 论文提出了SafeCoDe方法,通过对比解码和全局感知的token调制策略,动态调整多模态大语言模型(MLLMs)的生成行为,以提升上下文感知的安全性。

Details Motivation: 当前MLLMs在上下文感知的安全性上存在平衡不足的问题,容易对良性查询过度拒绝或对视觉风险检测不足。

Contribution: 提出了SafeCoDe框架,首次通过对比解码和全局感知调制策略,动态调整生成行为,提升了MLLMs的安全性对齐能力。

Method: SafeCoDe包含两阶段:1)对比解码,通过对比真实图像和高斯噪声图像突出敏感token;2)全局感知token调制,结合场景级推理和token级调整。

Result: 在多种MLLM架构和安全性基准测试中,SafeCoDe显著提升了上下文敏感的拒绝行为,同时保持了模型的帮助性。

Insight: 对比解码和全局感知调制可以有效平衡安全性和生成质量,为MLLMs的安全对齐提供了一种轻量级、模型无关的解决方案。

Abstract: Multimodal Large Language Models (MLLMs) are increasingly deployed in real-world applications, yet their ability to make context-aware safety decisions remains limited. Existing methods often fail to balance oversensitivity (unjustified refusals of benign queries) and undersensitivity (missed detection of visually grounded risks), leaving a persistent gap in safety alignment. To address this issue, we introduce Safety-aware Contrastive Decoding (SafeCoDe), a lightweight and model-agnostic decoding framework that dynamically adjusts token generation based on multimodal context. SafeCoDe operates in two stages: (1) a contrastive decoding mechanism that highlights tokens sensitive to visual context by contrasting real and Gaussian-noised images, and (2) a global-aware token modulation strategy that integrates scene-level reasoning with token-level adjustment to adapt refusals according to the predicted safety verdict. Extensive experiments across diverse MLLM architectures and safety benchmarks, covering undersensitivity, oversensitivity, and general safety evaluations, show that SafeCoDe consistently improves context-sensitive refusal behaviors while preserving model helpfulness.

[32] Reinforcement Learning on Pre-Training Data

Siheng Li,Kejiao Li,Zenan Xu,Guanhua Huang,Evander Yang,Kun Li,Haoyuan Wu,Jiajia Wu,Zihao Zheng,Chenchen Zhang,Kun Shi,Kyrierl Deng,Qi Yi,Ruibin Xiong,Tingqiang Xu,Yuhao Jiang,Jianfeng Yan,Yuyuan Zeng,Guanghui Xu,Jinbao Xue,Zhijiang Xu,Zheng Fang,Shuai Li,Qibin Liu,Xiaoxue Li,Zhuoyu Li,Yangyu Tao,Fei Gao,Cheng Jiang,Bo Chao Wang,Kai Liu,Jianchen Zhu,Wai Lam,Wayyt Wang,Bo Zhou,Di Wang

Main category: cs.CL

TL;DR: RLPT是一种新的训练范式,通过强化学习直接在预训练数据上优化大语言模型,无需依赖人工标注的奖励信号,显著提升了模型的推理能力。

Details Motivation: 高质量文本数据的增长速度有限,而计算资源呈指数级增长,导致传统的大语言模型扩展方法受限,亟需一种新的训练范式。

Contribution: 提出了RLPT方法,利用强化学习直接从预训练数据中学习,无需人工标注奖励信号,显著提升了模型的泛化推理能力。

Method: 采用下一段推理目标,通过预测后续文本来构建奖励信号,利用强化学习在预训练数据上优化模型。

Result: 在多个基准测试中表现优异,例如在Qwen3-4B-Base模型上,RLPT在MMLU等任务上实现了显著的绝对提升。

Insight: RLPT不仅扩展了模型的推理边界,还为强化学习在预训练数据上的应用提供了新思路,展示了计算资源进一步扩展的潜力。

Abstract: The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$, $6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.

[33] DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models’ Understanding on Indian Culture

Arijit Maji,Raghvendra Kumar,Akash Ghosh,Anushka,Nemil Shah,Abhilekh Borah,Vanshika Shah,Nishant Mishra,Sriparna Saha

Main category: cs.CL

TL;DR: DRISHTIKON 是一个专注于印度文化的多模态多语言基准测试,旨在评估生成式AI系统的文化理解能力,填补了现有基准测试在文化多样性覆盖上的不足。

Details Motivation: 当前的语言模型基准测试多以通用或全球化为重点,缺乏对特定文化的深入覆盖,尤其是印度这样文化多样化的国家。DRISHTIKON 的提出是为了填补这一空白,推动更具文化包容性的AI研究。

Contribution: DRISHTIKON 是首个专注于印度文化的多模态多语言基准测试,覆盖15种语言、所有印度邦和联邦属地,包含64,000多组对齐的文本-图像对。

Method: 数据集涵盖了丰富的文化主题(如节日、服饰、美食等),并通过零样本(zero-shot)和思维链(chain-of-thought)设置评估了多种视觉语言模型(VLMs)。

Result: 结果显示,当前模型在处理文化相关的多模态输入时存在显著局限,尤其是对低资源语言和较少记载的传统文化的理解能力不足。

Insight: DRISHTIKON 强调了文化多样性在AI研究中的重要性,为开发更具文化意识的语言技术提供了基础。

Abstract: We introduce DRISHTIKON, a first-of-its-kind multimodal and multilingual benchmark centered exclusively on Indian culture, designed to evaluate the cultural understanding of generative AI systems. Unlike existing benchmarks with a generic or global scope, DRISHTIKON offers deep, fine-grained coverage across India’s diverse regions, spanning 15 languages, covering all states and union territories, and incorporating over 64,000 aligned text-image pairs. The dataset captures rich cultural themes including festivals, attire, cuisines, art forms, and historical heritage amongst many more. We evaluate a wide range of vision-language models (VLMs), including open-source small and large models, proprietary systems, reasoning-specialized VLMs, and Indic-focused models, across zero-shot and chain-of-thought settings. Our results expose key limitations in current models’ ability to reason over culturally grounded, multimodal inputs, particularly for low-resource languages and less-documented traditions. DRISHTIKON fills a vital gap in inclusive AI research, offering a robust testbed to advance culturally aware, multimodally competent language technologies.

cs.CV [Back]

[34] PerceptronCARE: A Deep Learning-Based Intelligent Teleopthalmology Application for Diabetic Retinopathy Diagnosis

Akwasi Asare,Isaac Baffour Senkyire,Emmanuel Freeman,Simon Hilary Ayinedenaba Aluze-Ele,Kelvin Kwao

Main category: cs.CV

TL;DR: PerceptronCARE是一个基于深度学习的远程眼科应用,用于糖尿病视网膜病变的自动诊断,结合了多种卷积神经网络模型,实现了高准确性和实时筛查能力,并具备云端扩展性和安全性。

Details Motivation: 糖尿病视网膜病变是全球视力丧失的主要原因之一,尤其在资源匮乏地区,传统诊断方法存在资源不足和延迟问题。

Contribution: 提出了PerceptronCARE,一个整合多种CNN模型的远程眼科系统,实现了疾病严重程度的实时分类(准确率85.4%),并通过云端架构扩展了服务范围。

Method: 使用ResNet-18、EfficientNet-B0和SqueezeNet等CNN模型进行开发与评估,以平衡准确性和计算效率。

Result: 模型准确率达85.4%,适用于临床和远程医疗场景,并能降低医疗成本。

Insight: AI驱动的远程医疗解决方案有望在资源匮乏地区推广糖尿病视网膜病变筛查,改善早期诊断和医患互动。

Abstract: Diabetic retinopathy is a leading cause of vision loss among adults and a major global health challenge, particularly in underserved regions. This study presents PerceptronCARE, a deep learning-based teleophthalmology application designed for automated diabetic retinopathy detection using retinal images. The system was developed and evaluated using multiple convolutional neural networks, including ResNet-18, EfficientNet-B0, and SqueezeNet, to determine the optimal balance between accuracy and computational efficiency. The final model classifies disease severity with an accuracy of 85.4%, enabling real-time screening in clinical and telemedicine settings. PerceptronCARE integrates cloud-based scalability, secure patient data management, and a multi-user framework, facilitating early diagnosis, improving doctor-patient interactions, and reducing healthcare costs. This study highlights the potential of AI-driven telemedicine solutions in expanding access to diabetic retinopathy screening, particularly in remote and resource-constrained environments.

[35] Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR

Khalil Hennara,Muhammad Hreden,Mohamed Motasim Hamed,Ahmad Bastati,Zeina Aldallal,Sara Chrouf,Safwan AlModhayan

Main category: cs.CV

TL;DR: Baseer是一个专为阿拉伯文档OCR设计的视觉语言模型,通过微调预训练的多模态大语言模型(MLLM)并保留视觉特征,显著提升了阿拉伯文档OCR的性能,优于现有开源和商业解决方案。

Details Motivation: 阿拉伯文档OCR因草书字体、多样化字体、变音符号和从右到左的方向而具有挑战性,现有MLLMs在高资源语言中表现良好,但在阿拉伯语中表现有限。

Contribution: 1. 提出Baseer,专为阿拉伯文档OCR设计的视觉语言模型;2. 提出Misraj-DocOCR,高质量的专家验证基准;3. 展示了领域特定适应通用MLLMs的优势。

Method: 结合合成和真实世界文档的大规模数据集,采用解码器微调策略对预训练的MLLM进行微调,同时保留通用的视觉特征。

Result: Baseer在阿拉伯文档OCR中表现优异,WER为0.25,显著优于现有方案,建立了新的SOTA。

Insight: 领域特定的MLLMs适应对形态丰富的语言(如阿拉伯语)的高精度OCR非常重要。

Abstract: Arabic document OCR remains a challenging task due to the language’s cursive script, diverse fonts, diacritics, and right-to-left orientation. While modern Multimodal Large Language Models (MLLMs) have advanced document understanding for high-resource languages, their performance on Arabic remains limited. In this work, we introduce Baseer, a vision-language model fine- tuned specifically for Arabic document OCR. Leveraging a large-scale dataset combining synthetic and real-world documents, Baseer is trained using a decoder-only fine-tuning strategy to adapt a pre-trained MLLM while preserving general visual features. We also present Misraj-DocOCR, a high-quality, expert-verified benchmark designed for rigorous evaluation of Arabic OCR systems. Our experiments show that Baseer significantly outperforms existing open-source and commercial solutions, achieving a WER of 0.25 and establishing a new state-of-the-art in the domain of Arabic document OCR. Our results highlight the benefits of domain-specific adaptation of general-purpose MLLMs and establish a strong baseline for high-accuracy OCR on morphologically rich languages like Arabic.

[36] A Deep Learning Approach for Spatio-Temporal Forecasting of InSAR Ground Deformation in Eastern Ireland

Wendong Yao,Saeed Azadnejad,Binhua Huang,Shane Donohue,Soumyabrata Dev

Main category: cs.CV

TL;DR: 该论文提出了一种新型深度学习框架,用于解决稀疏InSAR时间序列数据的地面形变预测问题。该方法将稀疏点测量转化为密集时空张量,并结合CNN-LSTM模型,显著提升预测精度。

Details Motivation: 地面形变监测对城市基础设施稳定性和地质灾害缓解至关重要,但稀疏的InSAR数据预测未来形变是一项挑战。本文旨在通过深度学习方法解决这一问题。

Contribution: 主要贡献包括:1)提出将稀疏InSAR数据转化为密集时空张量的方法;2)设计并实现CNN-LSTM混合模型,同时学习时空特征;3)通过实验验证模型优于传统机器学习方法。

Method: 采用CNN提取空间特征,LSTM捕捉时间依赖性,通过混合CNN-LSTM模型处理生成的密集时空张量,进行地面形变预测。

Result: 实验结果表明,提出的CNN-LSTM模型在精度和空间一致性上显著优于LightGBM和LASSO等基线方法。

Insight: 传统方法倾向于依赖简单的持续性模式,而集成的时空方法能更好地捕捉地面形变的复杂动态,验证了时空深度学习在高分辨率形变预测中的潜力。

Abstract: Monitoring ground displacement is crucial for urban infrastructure stability and mitigating geological hazards. However, forecasting future deformation from sparse Interferometric Synthetic Aperture Radar (InSAR) time-series data remains a significant challenge. This paper introduces a novel deep learning framework that transforms these sparse point measurements into a dense spatio-temporal tensor. This methodological shift allows, for the first time, the direct application of advanced computer vision architectures to this forecasting problem. We design and implement a hybrid Convolutional Neural Network and Long-Short Term Memory (CNN-LSTM) model, specifically engineered to simultaneously learn spatial patterns and temporal dependencies from the generated data tensor. The model’s performance is benchmarked against powerful machine learning baselines, Light Gradient Boosting Machine and LASSO regression, using Sentinel-1 data from eastern Ireland. Results demonstrate that the proposed architecture provides significantly more accurate and spatially coherent forecasts, establishing a new performance benchmark for this task. Furthermore, an interpretability analysis reveals that baseline models often default to simplistic persistence patterns, highlighting the necessity of our integrated spatio-temporal approach to capture the complex dynamics of ground deformation. Our findings confirm the efficacy and potential of spatio-temporal deep learning for high-resolution deformation forecasting.

[37] A Framework for Generating Artificial Datasets to Validate Absolute and Relative Position Concepts

George Corrêa de Araújo,Helena de Almeida Maia,Helio Pedrini

Main category: cs.CV

TL;DR: 论文提出了Scrapbook框架,用于生成多样化的人工数据集,以验证AI模型对绝对和相对位置等基本概念的理解能力。

Details Motivation: 当前AI模型虽在对象识别方面表现良好,但在处理位置信息和复杂约束问题时仍存在挑战,因此需要一种系统的方法来生成数据集,以评估模型的弱点和改进空间。

Contribution: 提出了Scrapbook框架,专注于生成涵盖对象识别、位置信息和属性识别等基础概念的数据集,为AI模型的系统评估和改进提供工具。

Method: 利用Scrapbook框架生成大量多样化问题数据集,覆盖基础概念和广泛的语言变体,并通过实验验证模型的性能表现。

Result: 实验发现,如MobileVLM-V2等模型在处理位置信息和几何形状问题时表现不佳,且存在答案不一致和偏向肯定回答的倾向。

Insight: 研究揭示了当前模型在理解位置信息和复杂约束问题上的不足,为未来模型的改进提供了明确方向。

Abstract: In this paper, we present the Scrapbook framework, a novel methodology designed to generate extensive datasets for probing the learned concepts of artificial intelligence (AI) models. The framework focuses on fundamental concepts such as object recognition, absolute and relative positions, and attribute identification. By generating datasets with a large number of questions about individual concepts and a wide linguistic variation, the Scrapbook framework aims to validate the model’s understanding of these basic elements before tackling more complex tasks. Our experimental findings reveal that, while contemporary models demonstrate proficiency in recognizing and enumerating objects, they encounter challenges in comprehending positional information and addressing inquiries with additional constraints. Specifically, the MobileVLM-V2 model showed significant answer disagreements and plausible wrong answers, while other models exhibited a bias toward affirmative answers and struggled with questions involving geometric shapes and positional information, indicating areas for improvement in understanding and consistency. The proposed framework offers a valuable instrument for generating diverse and comprehensive datasets, which can be utilized to systematically assess and enhance the performance of AI models.

[38] The Describe-Then-Generate Bottleneck: How VLM Descriptions Alter Image Generation Outcomes

Sai Varun Kodathala,Rakesh Vunnam

Main category: cs.CV

TL;DR: 论文量化了视觉-语言-视觉流程中因文本中介导致的信息损失,揭示了‘描述-生成’瓶颈的显著局限性。

Details Motivation: 随着多模态AI系统在创意工作流中的应用增多,理解视觉信息通过语言中介后的损失变得重要。

Contribution: 通过实证分析揭示了‘描述-生成’流程中信息损失的严重性,99.3%的样本表现出显著的感知退化。

Method: 生成了150对图像,使用LPIPS、SSIM和颜色距离等指标来衡量感知、结构和色彩维度的信息保留。

Result: 结果表明,‘描述-生成’瓶颈是多模态系统中一项可测量且一致的局限性。

Insight: 文本中介会导致显著的视觉信息损失,这对依赖此类流程的创意工具和多模态系统的优化具有重要启示。

Abstract: With the increasing integration of multimodal AI systems in creative workflows, understanding information loss in vision-language-vision pipelines has become important for evaluating system limitations. However, the degradation that occurs when visual content passes through textual intermediation remains poorly quantified. In this work, we provide empirical analysis of the describe-then-generate bottleneck, where natural language serves as an intermediate representation for visual information. We generated 150 image pairs through the describe-then-generate pipeline and applied existing metrics (LPIPS, SSIM, and color distance) to measure information preservation across perceptual, structural, and chromatic dimensions. Our evaluation reveals that 99.3% of samples exhibit substantial perceptual degradation and 91.5% demonstrate significant structural information loss, providing empirical evidence that the describe-then-generate bottleneck represents a measurable and consistent limitation in contemporary multimodal systems.

[39] AI-Derived Structural Building Intelligence for Urban Resilience: An Application in Saint Vincent and the Grenadines

Isabelle Tingzon,Yoji Toriumi,Caroline Gevaert

Main category: cs.CV

TL;DR: 论文提出了一种AI驱动的工作流,通过高分辨率卫星影像自动推断屋顶属性,填补小岛屿发展中国家(SIDS)的结构性建筑信息缺失,用于提升城市韧性规划。

Details Motivation: 许多气候脆弱区的小岛屿发展中国家缺乏详细的建筑结构信息,限制了灾害风险评估和城市韧性规划。

Contribution: 1) 提出结合地理空间基础模型与浅层分类器的AI工作流;2) 评估邻域数据对模型性能的改进;3) 在屋顶坡度和材料分类任务中取得高F1分数。

Method: 比较了地理空间基础模型+浅层分类器与微调深度学习模型的性能,并测试了跨区域训练数据的影响。

Result: 最佳模型在屋顶坡度和材料分类任务中分别达到F1分数0.88和0.83。

Insight: 结合AI和地球观测数据,可为小岛屿发展中国家提供更高效的灾害风险评估工具,支持基于证据的城市治理。

Abstract: Detailed structural building information is used to estimate potential damage from hazard events like cyclones, floods, and landslides, making them critical for urban resilience planning and disaster risk reduction. However, such information is often unavailable in many small island developing states (SIDS) in climate-vulnerable regions like the Caribbean. To address this data gap, we present an AI-driven workflow to automatically infer rooftop attributes from high-resolution satellite imagery, with Saint Vincent and the Grenadines as our case study. Here, we compare the utility of geospatial foundation models combined with shallow classifiers against fine-tuned deep learning models for rooftop classification. Furthermore, we assess the impact of incorporating additional training data from neighboring SIDS to improve model performance. Our best models achieve F1 scores of 0.88 and 0.83 for roof pitch and roof material classification, respectively. Combined with local capacity building, our work aims to provide SIDS with novel capabilities to harness AI and Earth Observation (EO) data to enable more efficient, evidence-based urban governance.

[40] VLA-LPAF: Lightweight Perspective-Adaptive Fusion for Vision-Language-Action to Enable More Unconstrained Robotic Manipulation

Jinyue Bian,Zhaoxing Zhang,Zhengyu Liang,Shiwei Zheng,Shengtao Zhang,Rong Shen,Chen Yang,Anzhou Hou

Main category: cs.CV

TL;DR: 论文提出了VLA-LPAF模块,用于增强视觉-语言-动作(VLA)模型在不同视角下的适应性,通过轻量级模块解决视角异质性问题,显著提升了任务成功率。

Details Motivation: 现有的VLA模型在视觉观测数据中存在视角差异问题,导致模型通用性受限。为解决这一问题,论文提出一种轻量级模块VLA-LPAF,专注于提升模型对多视角数据的适应性。

Contribution: 提出VLA-LPAF模块,仅使用2D数据即可实现对VLA模型的多视角适应性增强;通过实验验证该方法在多个基准数据集上的性能提升。

Method: VLA-LPAF通过单视角图像的微调,在潜在空间融合多视角观测,解决了视角不一致性问题。

Result: 实验表明,RoboFlamingo-LPAF在CALVIN、LIBERO和定制仿真基准上的任务成功率分别提升了8%、15%和30%。

Insight: 轻量级的视角自适应模块可以显著提升VLA模型在复杂环境中的表现,尤其在多视角数据融合方面具有潜力。

Abstract: The Visual-Language-Action (VLA) models can follow text instructions according to visual observations of the surrounding environment. This ability to map multimodal inputs to actions is derived from the training of the VLA model on extensive standard demonstrations. These visual observations captured by third-personal global and in-wrist local cameras are inevitably varied in number and perspective across different environments, resulting in significant differences in the visual features. This perspective heterogeneity constrains the generality of VLA models. In light of this, we first propose the lightweight module VLA-LPAF to foster the perspective adaptivity of VLA models using only 2D data. VLA-LPAF is finetuned using images from a single view and fuses other multiview observations in the latent space, which effectively and efficiently bridge the gap caused by perspective inconsistency. We instantiate our VLA-LPAF framework with the VLA model RoboFlamingo to construct RoboFlamingo-LPAF. Experiments show that RoboFlamingo-LPAF averagely achieves around 8% task success rate improvement on CALVIN, 15% on LIBERO, and 30% on a customized simulation benchmark. We also demonstrate the developed viewadaptive characteristics of the proposed RoboFlamingo-LPAF through real-world tasks.

[41] URNet: Uncertainty-aware Refinement Network for Event-based Stereo Depth Estimation

Yifeng Cheng,Alois Knoll,Hu Cao

Main category: cs.CV

TL;DR: URNet提出一种不确定性感知的优化网络,用于基于事件的立体深度估计,结合局部-全局优化模块和KL散度建模方法,显著提升性能。

Details Motivation: 事件相机具有高时间分辨率、高动态范围和低延迟优势,但目前基于事件的深度估计方法在捕捉局部细节和全局上下文方面仍有不足,且缺乏可靠性建模。

Contribution: 1. 提出局部-全局优化模块,有效融合细节与全局信息;2. 引入基于KL散度的不确定性建模方法,提升预测可靠性;3. 在DSEC数据集上验证了性能优越性。

Method: 1. 设计局部-全局优化模块,分别处理细粒度局部细节和长距离上下文;2. 使用KL散度建模不确定性,优化预测可靠性。

Result: 在DSEC数据集上,URNet在定量和定性评估中均优于现有SOTA方法。

Insight: 融合局部与全局信息的模块设计是提升事件相机深度估计性能的关键,不确定性建模进一步增强了模型的可靠性。

Abstract: Event cameras provide high temporal resolution, high dynamic range, and low latency, offering significant advantages over conventional frame-based cameras. In this work, we introduce an uncertainty-aware refinement network called URNet for event-based stereo depth estimation. Our approach features a local-global refinement module that effectively captures fine-grained local details and long-range global context. Additionally, we introduce a Kullback-Leibler (KL) divergence-based uncertainty modeling method to enhance prediction reliability. Extensive experiments on the DSEC dataset demonstrate that URNet consistently outperforms state-of-the-art (SOTA) methods in both qualitative and quantitative evaluations.

[42] Visionerves: Automatic and Reproducible Hybrid AI for Peripheral Nervous System Recognition Applied to Endometriosis Cases

Giammarco La Barbera,Enzo Bonnot,Thomas Isla,Juan Pablo de la Plata,Joy-Rose Dunoyer de Segonzac,Jennifer Attali,Cécile Lozach,Alexandre Bellucci,Louis Marcellin,Laure Fournier,Sabine Sarnacki,Pietro Gori,Isabelle Bloch

Main category: cs.CV

TL;DR: Visionerves是一种新型混合AI框架,用于从多梯度DWI和形态学MRI数据中识别周围神经系统,特别适用于子宫内膜异位症的神经分析,显著优于传统纤维束成像方法。

Details Motivation: 子宫内膜异位症常导致慢性盆腔疼痛并可能涉及神经,但神经成像仍具挑战性。传统方法依赖手动ROI选择且难以重复,需要自动化且可重复的解决方案。

Contribution: 提出了一个结合深度学习和符号空间推理的混合AI框架Visionerves,无需手动ROI选择,显著提升了周围神经识别的准确性和可重复性。

Method: 分为两个阶段:(A)使用深度学习模型自动分割解剖结构;(B)通过符号空间推理进行纤维束成像和神经识别,利用模糊空间关系编码解剖知识。

Result: 在10例子宫内膜异位症患者的腰骶神经丛中,Visionerves的Dice分数提升了25%,空间误差降至5毫米以内,优于传统方法。

Insight: 结合深度学习与符号推理的混合方法能够有效解决医学影像中神经识别的挑战,为非侵入性神经病变诊断提供了新途径。

Abstract: Endometriosis often leads to chronic pelvic pain and possible nerve involvement, yet imaging the peripheral nerves remains a challenge. We introduce Visionerves, a novel hybrid AI framework for peripheral nervous system recognition from multi-gradient DWI and morphological MRI data. Unlike conventional tractography, Visionerves encodes anatomical knowledge through fuzzy spatial relationships, removing the need for selection of manual ROIs. The pipeline comprises two phases: (A) automatic segmentation of anatomical structures using a deep learning model, and (B) tractography and nerve recognition by symbolic spatial reasoning. Applied to the lumbosacral plexus in 10 women with (confirmed or suspected) endometriosis, Visionerves demonstrated substantial improvements over standard tractography, with Dice score improvements of up to 25% and spatial errors reduced to less than 5 mm. This automatic and reproducible approach enables detailed nerve analysis and paves the way for non-invasive diagnosis of endometriosis-related neuropathy, as well as other conditions with nerve involvement.

[43] V-SenseDrive: A Privacy-Preserving Road Video and In-Vehicle Sensor Fusion Framework for Road Safety & Driver Behaviour Modelling

Muhammad Naveed,Nazia Perwaiz,Sidra Sultana,Mohaira Ahmad,Muhammad Moazam Fraz

Main category: cs.CV

TL;DR: V-SenseDrive是一个隐私保护的多模态驾驶行为数据集,首次在巴基斯坦的驾驶环境中收集数据,结合智能手机的惯性/GPS传感器与同步的道路视频,用于建模驾驶行为和提升道路安全。

Details Motivation: 现有数据集多来自发达国家,对新兴经济体的驾驶行为多样性缺乏覆盖,且传统方法中驾驶员面部记录侵犯隐私。巴基斯坦等国的复杂交通环境亟需更适配的解决方案。

Contribution: 1)首个在巴基斯坦驾驶环境中收集的隐私保护多模态数据集;2)结合智能手机传感器与道路视频,支持多模态分析;3)数据分层结构设计便于未来研究。

Method: 通过定制Android应用采集高频率的加速度计、陀螺仪、GPS数据及同步视频,确保时间对齐;数据分为原始、处理和语义层。

Result: V-SenseDrive填补了全球驾驶行为数据集的空白,为智能交通解决方案提供上下文感知支持。

Insight: 新兴经济体的驾驶行为多样性需通过本地化数据集解决;隐私保护是多模态数据收集的关键考量。

Abstract: Road traffic accidents remain a major public health challenge, particularly in countries with heterogeneous road conditions, mixed traffic flow, and variable driving discipline, such as Pakistan. Reliable detection of unsafe driving behaviours is a prerequisite for improving road safety, enabling advanced driver assistance systems (ADAS), and supporting data driven decisions in insurance and fleet management. Most of existing datasets originate from the developed countries with limited representation of the behavioural diversity observed in emerging economies and the driver’s face recording voilates the privacy preservation. We present V-SenseDrive, the first privacy-preserving multimodal driver behaviour dataset collected entirely within the Pakistani driving environment. V-SenseDrive combines smartphone based inertial and GPS sensor data with synchronized road facing video to record three target driving behaviours (normal, aggressive, and risky) on multiple types of roads, including urban arterials, secondary roads, and motorways. Data was gathered using a custom Android application designed to capture high frequency accelerometer, gyroscope, and GPS streams alongside continuous video, with all sources precisely time aligned to enable multimodal analysis. The focus of this work is on the data acquisition process, covering participant selection, driving scenarios, environmental considerations, and sensor video synchronization techniques. The dataset is structured into raw, processed, and semantic layers, ensuring adaptability for future research in driver behaviour classification, traffic safety analysis, and ADAS development. By representing real world driving in Pakistan, V-SenseDrive fills a critical gap in the global landscape of driver behaviour datasets and lays the groundwork for context aware intelligent transportation solutions.

[44] Qianfan-VL: Domain-Enhanced Universal Vision-Language Models

Daxiang Dong,Mingming Zheng,Dong Xu,Bairong Zhuang,Wenyu Zhang,Chunhua Luo,Haoran Wang,Zijian Zhao,Jie Li,Yuxuan Li,Hanjun Zhong,Mengyue Liu,Jieting Chen,Shupeng Li,Lun Tian,Yaping Feng,Xin Li,Donggang Jiang,Yong Chen,Yehua Xu,Duohao Qin,Chen Feng,Dan Wang,Henghua Zhang,Jingjing Ha,Jinhui He,Yanfeng Zhai,Chengxin Zheng,Jiayi Mao,Jiacheng Chen,Ruchang Yao,Ziye Yuan,Jianmin Wu,Guangjun Xie,Dou Shen

Main category: cs.CV

TL;DR: Qianfan-VL 是一系列多模态大语言模型,参数量从 3B 到 70B,通过创新的领域增强技术和多阶段渐进训练,在通用和特定领域任务中实现了最优性能。

Details Motivation: 开发一个既能在通用任务中表现优异,又能针对特定领域(如 OCR、文档理解)优化的多模态模型,以满足企业多样化部署需求。

Contribution: 提出了领域增强策略和多阶段渐进训练方法,显著提升了模型在 OCR 和文档理解等领域的性能,并在数学推理和逻辑推断任务中表现出色。

Method: 采用多阶段渐进训练和高精度数据合成流程,结合长链思维(Chain-of-Thought)能力,优化模型的领域适应性和泛化能力。

Result: 在多个基准测试(如 CCBench、SEEDBench IMG、ScienceQA、MMStar、OCRBench、DocVQA、MathVista)中取得最优或可比的结果。

Insight: 领域增强技术与通用性能的平衡是关键;大规模 AI 基础设施(如 Baidu 的 P800 芯片)能够高效训练多模态模型,适合企业部署。

Abstract: We present Qianfan-VL, a series of multimodal large language models ranging from 3B to 70B parameters, achieving state-of-the-art performance through innovative domain enhancement techniques. Our approach employs multi-stage progressive training and high-precision data synthesis pipelines, which prove to be critical technologies for enhancing domain-specific capabilities while maintaining strong general performance. Qianfan-VL achieves comparable results to leading open-source models on general benchmarks, with state-of-the-art performance on benchmarks such as CCBench, SEEDBench IMG, ScienceQA, and MMStar. The domain enhancement strategy delivers significant advantages in OCR and document understanding, validated on both public benchmarks (OCRBench 873, DocVQA 94.75%) and in-house evaluations. Notably, Qianfan-VL-8B and 70B variants incorporate long chain-of-thought capabilities, demonstrating superior performance on mathematical reasoning (MathVista 78.6%) and logical inference tasks. All models are trained entirely on Baidu’s Kunlun P800 chips, validating the capability of large-scale AI infrastructure to train SOTA-level multimodal models with over 90% scaling efficiency on 5000 chips for a single task. This work establishes an effective methodology for developing domain-enhanced multimodal models suitable for diverse enterprise deployment scenarios.

[45] HazeFlow: Revisit Haze Physical Model as ODE and Non-Homogeneous Haze Generation for Real-World Dehazing

Junseong Shin,Seungwoo Chung,Yunjeong Yang,Tae Hyun Kim

Main category: cs.CV

TL;DR: 论文提出了HazeFlow,一种基于ODE的框架,将大气散射模型(ASM)重新表述为ODE,并通过单步推理实现真实世界的去雾任务。此外,采用马尔可夫链布朗运动(MCBM)生成非均匀雾霾数据,解决了真实配对数据稀缺的问题。

Details Motivation: 当前深度学习方法在去雾任务中因缺乏真实配对数据而表现受限,传统基于ASM的方法难以应对真实世界的复杂性和多样雾霾模式。因此,需研究更适应真实场景的物理驱动方法。

Contribution: 1. 提出HazeFlow框架,将ASM重新表述为ODE,优化去雾性能;2. 引入MCBM方法生成非均匀雾霾数据,缓解数据稀缺问题;3. 在真实去雾基准测试中达到SOTA性能。

Method: 1. 将ASM转化为ODE框架,通过Rectified Flow优化学习轨迹;2. 使用MCBM模拟真实雾霾分布,生成非均匀数据增强训练。

Result: HazeFlow在多个真实去雾基准数据集上表现优异,优于现有方法。

Insight: 将物理模型转化为ODE并结合数据生成方法,能有效提升去雾任务在真实世界的泛化能力,为其他物理驱动任务提供借鉴。

Abstract: Dehazing involves removing haze or fog from images to restore clarity and improve visibility by estimating atmospheric scattering effects. While deep learning methods show promise, the lack of paired real-world training data and the resulting domain gap hinder generalization to real-world scenarios. In this context, physics-grounded learning becomes crucial; however, traditional methods based on the Atmospheric Scattering Model (ASM) often fall short in handling real-world complexities and diverse haze patterns. To solve this problem, we propose HazeFlow, a novel ODE-based framework that reformulates ASM as an ordinary differential equation (ODE). Inspired by Rectified Flow (RF), HazeFlow learns an optimal ODE trajectory to map hazy images to clean ones, enhancing real-world dehazing performance with only a single inference step. Additionally, we introduce a non-homogeneous haze generation method using Markov Chain Brownian Motion (MCBM) to address the scarcity of paired real-world data. By simulating realistic haze patterns through MCBM, we enhance the adaptability of HazeFlow to diverse real-world scenarios. Through extensive experiments, we demonstrate that HazeFlow achieves state-of-the-art performance across various real-world dehazing benchmark datasets.

[46] Learning Contrastive Multimodal Fusion with Improved Modality Dropout for Disease Detection and Prediction

Yi Gu,Kuniaki Saito,Jiaxin Ma

Main category: cs.CV

TL;DR: 该论文提出了一种新颖的多模态学习框架,结合改进的模态丢失和对比学习,以解决医学诊断中模态不平衡和缺失的现实问题。通过可学习的模态令牌和多模态对比学习,该方法在疾病检测和预测任务中表现出色。

Details Motivation: 医学诊断中多模态数据的融合面临模态不平衡和缺失的挑战,需要一种鲁棒且高效的解决方案。

Contribution: 1. 提出了一种改进的模态丢失机制和对比学习方法,增强多模态融合的鲁棒性。2. 引入可学习的模态令牌,优化对缺失模态的处理。

Method: 结合模态丢失和对比学习,利用可学习的模态令牌融合多模态数据,并在缺失模态场景下表现优异。

Result: 在大规模临床数据集上验证了方法的有效性,特别是在单模态可用场景下表现突出。

Insight: 通过改进模态管理和对比学习,可以实现高效、低成本的临床多模态学习解决方案。

Abstract: As medical diagnoses increasingly leverage multimodal data, machine learning models are expected to effectively fuse heterogeneous information while remaining robust to missing modalities. In this work, we propose a novel multimodal learning framework that integrates enhanced modalities dropout and contrastive learning to address real-world limitations such as modality imbalance and missingness. Our approach introduces learnable modality tokens for improving missingness-aware fusion of modalities and augments conventional unimodal contrastive objectives with fused multimodal representations. We validate our framework on large-scale clinical datasets for disease detection and prediction tasks, encompassing both visual and tabular modalities. Experimental results demonstrate that our method achieves state-of-the-art performance, particularly in challenging and practical scenarios where only a single modality is available. Furthermore, we show its adaptability through successful integration with a recent CT foundation model. Our findings highlight the effectiveness, efficiency, and generalizability of our approach for multimodal learning, offering a scalable, low-cost solution with significant potential for real-world clinical applications. The code is available at https://github.com/omron-sinicx/medical-modality-dropout.

[47] Rethinking Pulmonary Embolism Segmentation: A Study of Current Approaches and Challenges with an Open Weight Model

Yixin Zhang,Ryan Chamberlain,Lawrance Ngo,Kevin Kramer,Maciej A. Mazurowski

Main category: cs.CV

TL;DR: 该研究系统地评估了多种分割架构在肺栓塞分割任务中的表现,发现3D U-Net仍是最佳选择,CNN优于ViT,且分类预训练可能损害分割性能。

Details Motivation: 肺栓塞分割存在挑战,现有方法缺乏系统性比较,尤其是CNN与ViT的分割性能差异未被充分研究。

Contribution: 提出首个系统性评估多种分割架构在肺栓塞任务中的研究,揭示了3D模型、CNN和ViT的性能差异,以及预训练的负面影响。

Method: 使用490个密集标注的CTPA扫描数据集,评估了9种CNN和ViT分割架构,统一测试框架对比性能。

Result: 3D U-Net表现最佳,CNN优于ViT;分类预训练对分割任务有害;远端栓塞分割仍具挑战性。

Insight: 肺栓塞分类和分割可能依赖不同特征;3D模型更适合形态学复杂的任务;公开数据集验证了模型泛化性。

Abstract: In this study, we curated a densely annotated in-house dataset comprising 490 CTPA scans. Using this dataset, we systematically evaluated nine widely used segmentation architectures from both the CNN and Vision Transformer (ViT) families, initialized with either pretrained or random weights, under a unified testing framework as a performance audit. Our study leads to several important observations: (1) 3D U-Net with a ResNet encoder remains a highly effective architecture for PE segmentation; (2) 3D models are particularly well-suited to this task given the morphological characteristics of emboli; (3) CNN-based models generally yield superior performance compared to their ViT-based counterparts in PE segmentation; (4) classification-based pretraining, even on large PE datasets, can adversely impact segmentation performance compared to training from scratch, suggesting that PE classification and segmentation may rely on different sets of discriminative features; (5) different model architectures show a highly consistent pattern of segmentation performance when trained on the same data; and (6) while central and large emboli can be segmented with satisfactory accuracy, distal emboli remain challenging due to both task complexity and the scarcity of high-quality datasets. Besides these findings, our best-performing model achieves a mean Dice score of 0.7131 for segmentation. It detects 181 emboli with 49 false positives and 28 false negatives from 60 in-house testing scans. Its generalizability is further validated on public datasets.

[48] Align Where the Words Look: Cross-Attention-Guided Patch Alignment with Contrastive and Transport Regularization for Bengali Captioning

Riad Ahmed Anonto,Sardar Md. Saffat Zabin,M. Saifur Rahman

Main category: cs.CV

TL;DR: 该论文提出了一种针对低资源语言(孟加拉语)的图像描述生成方法,通过结合跨注意力引导的补丁对齐、对比学习和最优传输正则化,显著提升了模型在真实和合成数据上的性能。

Details Motivation: 低资源语言(如孟加拉语)的视觉-语言模型面临数据稀缺、翻译对齐失败以及英语中心预训练等问题,导致生成文本与图像内容不匹配。论文旨在解决这些问题,提出一种计算高效的跨模态对齐方法。

Contribution: 主要贡献包括:1) 提出了一种轻量级的桥接模块连接视觉和语言模态;2) 设计了基于跨注意力引导的补丁对齐损失(PAL)、信息对比损失(InfoNCE)和Sinkhorn最优传输(OT)的三重损失目标;3) 在Flickr30k和MSCOCO数据集上显著提升了性能,缩小了真实与合成数据的差距。

Method: 方法包括:1) 使用冻结的MaxViT提取稳定的视觉补丁特征;2) 利用孟加拉语原生mBART-50解码;3) 通过PAL+InfoNCE+OT三重损失优化跨模态对齐,减少虚假匹配。

Result: 在Flickr30k-1k和MSCOCO-1k数据集上取得了显著提升,BLEU-4、METEOR和BERTScore-F1均优于基线模型,真实与合成数据的质心差距缩小了41%。

Insight: 论文展示了通过结合跨注意力引导的局部对齐和全局对比学习,可以在低资源语言任务中有效提升跨模态对齐能力,减少噪声干扰,为类似任务提供了借鉴。

Abstract: Grounding vision–language models in low-resource languages remains challenging, as they often produce fluent text about the wrong objects. This stems from scarce paired data, translation pivots that break alignment, and English-centric pretraining that ignores target-language semantics. We address this with a compute-aware Bengali captioning pipeline trained on LaBSE-verified EN–BN pairs and 110k bilingual-prompted synthetic images. A frozen MaxViT yields stable visual patches, a Bengali-native mBART-50 decodes, and a lightweight bridge links the modalities. Our core novelty is a tri-loss objective: Patch-Alignment Loss (PAL) aligns real and synthetic patch descriptors using decoder cross-attention, InfoNCE enforces global real–synthetic separation, and Sinkhorn-based OT ensures balanced fine-grained patch correspondence. This PAL+InfoNCE+OT synergy improves grounding, reduces spurious matches, and drives strong gains on Flickr30k-1k (BLEU-4 12.29, METEOR 27.98, BERTScore-F1 71.20) and MSCOCO-1k (BLEU-4 12.00, METEOR 28.14, BERTScore-F1 75.40), outperforming strong CE baselines and narrowing the real–synthetic centroid gap by 41%.

[49] TinyBEV: Cross Modal Knowledge Distillation for Efficient Multi Task Bird’s Eye View Perception and Planning

Reeshad Khan,John Gauch

Main category: cs.CV

TL;DR: TinyBEV提出了一种高效的、仅依赖相机的BEV感知与规划框架,通过知识蒸馏将大型教师模型(UniAD)的功能压缩到轻量级学生模型中,支持多任务处理,且在参数和速度上均有显著优势。

Details Motivation: 解决实时自动驾驶系统中大规模多模态感知-规划模型在资源受限设备上部署的难题,需要在高效性和功能完整性之间取得平衡。

Contribution: 1. 提出TinyBEV框架,支持多任务处理且参数减少78%; 2. 提出多阶段蒸馏策略,有效传递多模态知识; 3. 在nuScenes数据集上实现高效性能。

Method: 采用特征级、输出级和自适应区域感知监督的多阶段知识蒸馏策略,将教师模型的知识压缩到轻量级BEV表示中。

Result: 在nuScenes数据集上,检测mAP为39.0,运动预测minADE为1.08,碰撞率为0.32,运行速度为11 FPS,仅需相机输入。

Insight: 轻量级模型可以通过知识蒸馏保留多任务功能,为资源受限设备上的实时自动驾驶提供可行解决方案。

Abstract: We present TinyBEV, a unified, camera only Bird’s Eye View (BEV) framework that distills the full-stack capabilities of a large planning-oriented teacher (UniAD [19]) into a compact, real-time student model. Unlike prior efficient camera only baselines such as VAD[23] and VADv2[7], TinyBEV supports the complete autonomy stack 3D detection, HD-map segmentation, motion forecasting, occupancy prediction, and goal-directed planning within a streamlined 28M-parameter backbone, achieving a 78% reduction in parameters over UniAD [19]. Our model-agnostic, multi-stage distillation strategy combines feature-level, output-level, and adaptive region-aware supervision to effectively transfer high-capacity multi-modal knowledge to a lightweight BEV representation. On nuScenes[4], Tiny-BEV achieves 39.0 mAP for detection, 1.08 minADE for motion forecasting, and a 0.32 collision rate, while running 5x faster (11 FPS) and requiring only camera input. These results demonstrate that full-stack driving intelligence can be retained in resource-constrained settings, bridging the gap between large-scale, multi-modal perception-planning models and deployment-ready real-time autonomy.

[50] BlurBall: Joint Ball and Motion Blur Estimation for Table Tennis Ball Tracking

Thomas Gossard,Filip Radovic,Andreas Ziegler,Andrea Zell

Main category: cs.CV

TL;DR: 论文提出了一种新的标注策略,将球置于运动模糊的中心,并标注模糊属性,同时发布了新的乒乓球检测数据集。提出的BlurBall模型通过联合估计球位置和运动模糊属性,结合注意力机制,实现了最先进的检测效果。

Details Motivation: 快速运动物体因运动模糊而难以检测,现有方法标记球的前沿边缘导致不对称性,忽略了与速度相关的运动线索。

Contribution: 1. 提出新的标注策略,将球置于模糊中心并标注模糊属性;2. 发布新的乒乓球检测数据集;3. 提出BlurBall模型,联合估计球位置和模糊属性,结合注意力机制提升检测性能。

Method: 1. 多帧输入的注意力机制(如Squeeze-and-Excitation);2. 联合估计球位置和运动模糊属性;3. 新的标注策略。

Result: BlurBall在球检测任务中实现了最先进的性能,并提高了轨迹预测的可靠性。

Insight: 利用运动模糊不仅可以提升检测精度,还为实时运动分析提供了更可靠的数据。

Abstract: Motion blur reduces the clarity of fast-moving objects, posing challenges for detection systems, especially in racket sports, where balls often appear as streaks rather than distinct points. Existing labeling conventions mark the ball at the leading edge of the blur, introducing asymmetry and ignoring valuable motion cues correlated with velocity. This paper introduces a new labeling strategy that places the ball at the center of the blur streak and explicitly annotates blur attributes. Using this convention, we release a new table tennis ball detection dataset. We demonstrate that this labeling approach consistently enhances detection performance across various models. Furthermore, we introduce BlurBall, a model that jointly estimates ball position and motion blur attributes. By incorporating attention mechanisms such as Squeeze-and-Excitation over multi-frame inputs, we achieve state-of-the-art results in ball detection. Leveraging blur not only improves detection accuracy but also enables more reliable trajectory prediction, benefiting real-time sports analytics.

[51] MVP: Motion Vector Propagation for Zero-Shot Video Object Detection

Binhua Huang,Ni Wang,Wendong Yao,Soumyabrata Dev

Main category: cs.CV

TL;DR: 该论文提出了一种无需训练的零样本视频目标检测方法MVP,通过运动矢量传播技术减少大规模开放词汇检测器的调用,同时保持较高的检测精度。

Details Motivation: 在视频中每一帧运行大规模的开放词汇检测器虽然准确但计算成本高。本文旨在通过压缩域的运动矢量传播技术,减少检测器的调用次数,同时保持检测效果。

Contribution: 1. 提出了一种无需训练、无需标注的零样本视频目标检测方法MVP;2. 使用压缩域的运动矢量传播技术,减少开放词汇检测器的调用频率;3. 在保持检测精度的同时显著降低计算成本。

Method: 1. 在固定间隔的关键帧上调用OWLv2检测器;2. 通过压缩域的运动矢量(MV)将检测结果传播到中间帧;3. 使用3x3网格聚合运动矢量,并结合区域增长检查和可选的类别切换优化结果。

Result: 在ILSVRC2015-VID数据集上,MVP的mAP@0.5为0.609,mAP@[0.5:0.95]为0.316。在松散的IoU阈值下,性能接近逐帧检测的OWLv2-Large。与基于跟踪器的传播方法相比,MVP表现更优。

Insight: 压缩域的运动矢量传播是一种高效且实用的技术,可以在减少检测器调用的同时,保持零样本视频目标检测的强覆盖能力。该方法无需标注数据,适用于开放词汇场景。

Abstract: Running a large open-vocabulary (Open-vocab) detector on every video frame is accurate but expensive. We introduce a training-free pipeline that invokes OWLv2 only on fixed-interval keyframes and propagates detections to intermediate frames using compressed-domain motion vectors (MV). A simple 3x3 grid aggregation of motion vectors provides translation and uniform-scale updates, augmented with an area-growth check and an optional single-class switch. The method requires no labels, no fine-tuning, and uses the same prompt list for all open-vocabulary methods. On ILSVRC2015-VID (validation dataset), our approach (MVP) attains mAP@0.5=0.609 and mAP@[0.5:0.95]=0.316. At loose intersection-over-union (IoU) thresholds it remains close to framewise OWLv2-Large (0.747/0.721 at 0.2/0.3 versus 0.784/0.780), reflecting that coarse localization is largely preserved. Under the same keyframe schedule, MVP outperforms tracker-based propagation (MOSSE, KCF, CSRT) at mAP@0.5. A supervised reference (YOLOv12x) reaches 0.631 at mAP@0.5 but requires labeled training, whereas our method remains label-free and open-vocabulary. These results indicate that compressed-domain propagation is a practical way to reduce detector invocations while keeping strong zero-shot coverage in videos. Our code and models are available at https://github.com/microa/MVP.

[52] Improving the color accuracy of lighting estimation models

Zitian Zhang,Joshua Urban Davis,Jeanne Phuong Anh Vu,Jiangtao Kuang,Jean-François Lalonde

Main category: cs.CV

TL;DR: 论文研究了如何通过简单的适应技术提升现有光照估计模型的色彩准确性,发现使用预训练的白平衡网络预处理输入图像能够显著提高色彩准确性且无需重新训练模型。

Details Motivation: 虽然高动态范围(HDR)光照估计技术为增强现实(AR)提供了新可能,但其色彩准确性常被忽视。论文旨在通过适配技术提升现有模型的色彩准确性,以增强虚拟对象的视觉真实感。

Contribution: 1) 首次系统研究了光照估计模型的色彩准确性;2) 提出一种无需重新训练模型的白平衡预处理方法,显著提升了色彩准确性;3) 验证了方法在三种先进光照估计模型上的通用性。

Method: 通过预训练的白平衡网络对输入图像进行预处理,优化色彩准确性,并在多样化的HDR数据集上评估多种适应策略。

Result: 白平衡预处理方法在所有测试场景中表现最佳,显著提升了色彩准确性,且无需重新训练光照估计模型。

Insight: 光照估计的色彩准确性是容易被忽视但重要的因素,简单的预处理方法可显著提升现有模型的性能。

Abstract: Advances in high dynamic range (HDR) lighting estimation from a single image have opened new possibilities for augmented reality (AR) applications. Predicting complex lighting environments from a single input image allows for the realistic rendering and compositing of virtual objects. In this work, we investigate the color robustness of such methods – an often overlooked yet critical factor for achieving visual realism. While most evaluations conflate color with other lighting attributes (e.g., intensity, direction), we isolate color as the primary variable of interest. Rather than introducing a new lighting estimation algorithm, we explore whether simple adaptation techniques can enhance the color accuracy of existing models. Using a novel HDR dataset featuring diverse lighting colors, we systematically evaluate several adaptation strategies. Our results show that preprocessing the input image with a pre-trained white balance network improves color robustness, outperforming other strategies across all tested scenarios. Notably, this approach requires no retraining of the lighting estimation model. We further validate the generality of this finding by applying the technique to three state-of-the-art lighting estimation methods from recent literature.

[53] Check Field Detection Agent (CFD-Agent) using Multimodal Large Language and Vision Language Models

Sourav Halder,Jinjun Tong,Xinyu Wu

Main category: cs.CV

TL;DR: CFD-Agent提出了一个无需训练的新框架,利用视觉语言模型(VLM)和多模态大语言模型(MLLM)实现支票关键字段的零样本检测,显著降低了实际金融场景中的部署门槛。

Details Motivation: 支票在金融交易中广泛使用,但也容易成为欺诈目标,因此需要高效准确的字段检测机制。传统方法依赖大量标注数据,但因隐私和专有问题难以获取。

Contribution: 提出了一个无需训练的自动化支票字段检测框架,结合VLM和MLLM实现零样本检测,并在多样化的支票数据集上验证了其强性能和泛化能力。

Method: 利用视觉语言模型和多模态大语言模型,通过零样本学习直接检测支票中的关键字段(如签名、MICR线等),无需额外训练数据。

Result: 在包含110张支票的手工策划数据集上表现优异,展示了强大的性能和泛化能力,同时还能生成高质量标注数据供后续专用模型开发。

Insight: 结合VLM和MLLM的零样本学习方法在金融文档处理中具有潜力,既能降低数据需求,又能快速部署,为领域内的小样本问题提供了新思路。

Abstract: Checks remain a foundational instrument in the financial ecosystem, facilitating substantial transaction volumes across institutions. However, their continued use also renders them a persistent target for fraud, underscoring the importance of robust check fraud detection mechanisms. At the core of such systems lies the accurate identification and localization of critical fields, such as the signature, magnetic ink character recognition (MICR) line, courtesy amount, legal amount, payee, and payer, which are essential for subsequent verification against reference checks belonging to the same customer. This field-level detection is traditionally dependent on object detection models trained on large, diverse, and meticulously labeled datasets, a resource that is scarce due to proprietary and privacy concerns. In this paper, we introduce a novel, training-free framework for automated check field detection, leveraging the power of a vision language model (VLM) in conjunction with a multimodal large language model (MLLM). Our approach enables zero-shot detection of check components, significantly lowering the barrier to deployment in real-world financial settings. Quantitative evaluation of our model on a hand-curated dataset of 110 checks spanning multiple formats and layouts demonstrates strong performance and generalization capability. Furthermore, this framework can serve as a bootstrap mechanism for generating high-quality labeled datasets, enabling the development of specialized real-time object detection models tailored to institutional needs.

[54] Losing the Plot: How VLM responses degrade on imperfect charts

Philip Wootaek Shin,Jack Sampson,Vijaykrishnan Narayanan,Andres Marquez,Mahantesh Halappanavar

Main category: cs.CV

TL;DR: 论文研究了视觉语言模型(VLM)在不完美图表上的表现退化问题,并提出了CHART NOISe数据集和缓解策略。

Details Motivation: 现有基准测试假设图表清晰且问题基于事实,但真实图表常含失真或需要复杂推理。研究目的是评估VLM在失真或遮挡条件下的表现。

Contribution: 1) 对先进VLM进行基准测试,暴露系统漏洞;2) 发布首个结合失真、遮挡和逆向不一致性的数据集CHART NOISe;3) 提出质量过滤和遮挡检测等缓解策略。

Method: 通过评估ChatGPT 4o、Claude Sonnet 4和Gemini 2.5 Pro在失真或遮挡图表上的表现,并设计CHART NOISe数据集进行测试。

Result: 模型在失真或遮挡条件下性能显著下降,出现幻觉现象(如虚构数值),且保持过度自信。

Insight: 提升VLM在图表理解中的鲁棒性需关注失真和遮挡问题,未来研究应结合数据集和缓解策略以推动进展。

Abstract: Vision language models (VLMs) show strong results on chart understanding, yet existing benchmarks assume clean figures and fact based queries. Real world charts often contain distortions and demand reasoning beyond simple matching. We evaluate ChatGPT 4o, Claude Sonnet 4, and Gemini 2.5 Pro, finding sharp performance drops under corruption or occlusion, with hallucinations such as value fabrication, trend misinterpretation, and entity confusion becoming more frequent. Models remain overconfident in degraded settings, generating plausible but unsupported explanations. To address this gap, we introduce CHART NOISe(Chart Hallucinations, Answers, and Reasoning Testing on Noisy and Occluded Input Selections), a dataset combining chart corruptions, occlusions, and exam style multiple choice questions inspired by Korea’s CSAT English section. A key innovation is prompt reverse inconsistency, where models contradict themselves when asked to confirm versus deny the same statement. Our contributions are threefold: (1) benchmarking state of the art VLMs, exposing systematic vulnerabilities in chart reasoning; (2) releasing CHART NOISe, the first dataset unifying corruption, occlusion, and reverse inconsistency; and (3) proposing baseline mitigation strategies such as quality filtering and occlusion detection. Together, these efforts establish a rigorous testbed for advancing robustness and reliability in chart understanding.

[55] CPT-4DMR: Continuous sPatial-Temporal Representation for 4D-MRI Reconstruction

Xinyang Wu,Muheng Li,Xia Li,Orso Pusterla,Sairos Safai,Philippe C. Cattin,Antony J. Lomax,Ye Zhang

Main category: cs.CV

TL;DR: CPT-4DMR提出了一种基于神经表示的4D-MRI重建方法,通过连续时空表示取代传统的离散分箱方法,显著提高了效率和准确性。

Details Motivation: 传统4D-MRI重建方法依赖相位分箱或模板扫描,存在无法捕捉时间可变性、工作流复杂和计算量大等问题。本文旨在解决这些问题。

Contribution: 1. 提出了一种连续时空表示框架,将呼吸运动建模为平滑的连续变形;2. 设计了两个协同网络(SAN和TMN),结合空间解剖和时间运动建模;3. 显著提升了重建效率和准确性,适用于不规则呼吸模式。

Method: 通过SAN编码连续的3D解剖表示,TMN基于Transformer提取的呼吸信号生成时间一致的变形场。两者结合实现了高效的重建。

Result: 在19名志愿者的自由呼吸数据集上,该方法减少了处理时间(从5小时到15分钟),且每秒可推断一个3D体积,重建精度优于传统方法。

Insight: 连续表示在4D-MRI重建中具有显著优势,尤其是在捕捉动态呼吸运动时。结合神经表示和Transformer信号处理是未来研究的潜在方向。

Abstract: Four-dimensional MRI (4D-MRI) is an promising technique for capturing respiratory-induced motion in radiation therapy planning and delivery. Conventional 4D reconstruction methods, which typically rely on phase binning or separate template scans, struggle to capture temporal variability, complicate workflows, and impose heavy computational loads. We introduce a neural representation framework that considers respiratory motion as a smooth, continuous deformation steered by a 1D surrogate signal, completely replacing the conventional discrete sorting approach. The new method fuses motion modeling with image reconstruction through two synergistic networks: the Spatial Anatomy Network (SAN) encodes a continuous 3D anatomical representation, while a Temporal Motion Network (TMN), guided by Transformer-derived respiratory signals, produces temporally consistent deformation fields. Evaluation using a free-breathing dataset of 19 volunteers demonstrates that our template- and phase-free method accurately captures both regular and irregular respiratory patterns, while preserving vessel and bronchial continuity with high anatomical fidelity. The proposed method significantly improves efficiency, reducing the total processing time from approximately five hours required by conventional discrete sorting methods to just 15 minutes of training. Furthermore, it enables inference of each 3D volume in under one second. The framework accurately reconstructs 3D images at any respiratory state, achieves superior performance compared to conventional methods, and demonstrates strong potential for application in 4D radiation therapy planning and real-time adaptive treatment.

[56] An Analysis of Kalman Filter based Object Tracking Methods for Fast-Moving Tiny Objects

Prithvi Raj Singh,Raju Gottumukkala,Anthony Maida

Main category: cs.CV

TL;DR: 论文分析了五种基于卡尔曼滤波的目标跟踪方法(OCSORT、DeepOCSORT、ByteTrack、BoTSORT和StrongSORT)在快速移动小物体(如壁球)上的表现,发现DeepOCSORT误差最小,ByteTrack处理最快,但所有方法均存在显著跟踪漂移问题。

Details Motivation: 快速移动小物体的不规则运动模式和小尺寸视觉标记使其跟踪成为计算机视觉中的难题,尤其在体育机器人应用中需要轻量且高精度的跟踪系统。

Contribution: 评估了五种基于卡尔曼滤波的跟踪方法在自定义壁球数据集上的性能,揭示了它们在速度和误差方面的差异及局限性。

Method: 使用包含10,000帧标注壁球的720p-1280p分辨率数据集,分析推理速度和每帧更新频率对跟踪精度的影响。

Result: DeepOCSORT误差最低(ADE 31.15像素),ByteTrack处理最快(26.6ms),但所有方法在空间误差上仍有3-11cm漂移。

Insight: 现有卡尔曼滤波跟踪方法对快速移动小物体的不规则运动适应性不足,需开发更专业的跟踪算法。

Abstract: Unpredictable movement patterns and small visual mark make precise tracking of fast-moving tiny objects like a racquetball one of the challenging problems in computer vision. This challenge is particularly relevant for sport robotics applications, where lightweight and accurate tracking systems can improve robot perception and planning capabilities. While Kalman filter-based tracking methods have shown success in general object tracking scenarios, their performance degrades substantially when dealing with rapidly moving objects that exhibit irregular bouncing behavior. In this study, we evaluate the performance of five state-of-the-art Kalman filter-based tracking methods-OCSORT, DeepOCSORT, ByteTrack, BoTSORT, and StrongSORT-using a custom dataset containing 10,000 annotated racquetball frames captured at 720p-1280p resolution. We focus our analysis on two critical performance factors: inference speed and update frequency per image, examining how these parameters affect tracking accuracy and reliability for fast-moving tiny objects. Our experimental evaluation across four distinct scenarios reveals that DeepOCSORT achieves the lowest tracking error with an average ADE of 31.15 pixels compared to ByteTrack’s 114.3 pixels, while ByteTrack demonstrates the fastest processing at 26.6ms average inference time versus DeepOCSORT’s 26.8ms. However, our results show that all Kalman filter-based trackers exhibit significant tracking drift with spatial errors ranging from 3-11cm (ADE values: 31-114 pixels), indicating fundamental limitations in handling the unpredictable motion patterns of fast-moving tiny objects like racquetballs. Our analysis demonstrates that current tracking approaches require substantial improvements, with error rates 3-4x higher than standard object tracking benchmarks, highlighting the need for specialized methodologies for fast-moving tiny object tracking applications.

[57] MoCrop: Training Free Motion Guided Cropping for Efficient Video Action Recognition

Binhua Huang,Wendong Yao,Shaowu Chen,Guoxin Wang,Qingyuan Wang,Soumyabrata Dev

Main category: cs.CV

TL;DR: MoCrop是一种无需训练的运动感知自适应裁剪模块,用于提升压缩视频中动作识别的效率,通过使用H.264视频中的运动向量来定位高运动密度区域,生成剪裁方案。

Details Motivation: 视频动作识别通常需要处理大量计算资源,尤其是在压缩域中。现有方法常依赖训练或引入额外参数,降低了效率。MoCrop旨在通过无需训练的方式,高效利用视频中的运动向量信息,提升模型性能。

Contribution: 1. 提出MoCrop模块,无需训练且不增加参数;2. 通过轻量级管道(包括去噪与合并、蒙特卡洛采样和自适应裁剪)生成鲁棒的剪裁方案;3. 在多个骨干网络上验证其通用性,显著提升性能或降低计算成本。

Method: MoCrop利用H.264视频中的运动向量定位高运动密度区域,通过轻量级管道(DM、MCS、AC)生成剪裁方案。该方法无需训练,可直接嵌入多种骨干网络。

Result: 在UCF101数据集上,MoCrop以相同FLOPs提升Top-1准确率3.5%,或以26.5%更少FLOPs提升2.4%。在CoViAR上,保持原计算成本下达到89.2%准确率,或在降低计算量时仍保持88.5%准确率。

Insight: 视频中的运动向量信息是高效动作识别的关键。MoCrop的轻量级设计展示了在压缩域中无需训练即可提升性能的潜力,为实时部署提供了实用解决方案。

Abstract: We introduce MoCrop, a motion-aware adaptive cropping module for efficient video action recognition in the compressed domain. MoCrop uses motion vectors that are available in H.264 video to locate motion-dense regions and produces a single clip-level crop that is applied to all I-frames at inference. The module is training free, adds no parameters, and can be plugged into diverse backbones. A lightweight pipeline that includes denoising & merge (DM), Monte Carlo sampling (MCS), and adaptive cropping (AC) via a motion-density submatrix search yields robust crops with negligible overhead. On UCF101, MoCrop improves accuracy or reduces compute. With ResNet-50, it delivers +3.5% Top-1 accuracy at equal FLOPs (attention setting), or +2.4% Top-1 accuracy with 26.5% fewer FLOPs (efficiency setting). Applied to CoViAR, it reaches 89.2% Top-1 accuracy at the original cost and 88.5% Top-1 accuracy while reducing compute from 11.6 to 8.5 GFLOPs. Consistent gains on MobileNet-V3, EfficientNet-B1, and Swin-B indicate strong generality and make MoCrop practical for real-time deployment in the compressed domain. Our code and models are available at https://github.com/microa/MoCrop.

[58] Codebook-Based Adaptive Feature Compression With Semantic Enhancement for Edge-Cloud Systems

Xinyu Wang,Zikun Zhou,Yingjian Li,Xin An,Hongpeng Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于码书的自适应特征压缩框架CAFC-SE,通过向量量化和选择性传输,在低比特率条件下保持了更多信息丰富的视觉模式。

Details Motivation: 在边缘-云系统中,如何在低比特率条件下保持高效的分析性能和最小化比特率是关键问题。传统方法在低比特率条件下表现不佳,因为它们保留了冗余细节或学习了过于集中的符号分布。

Contribution: 提出了一种基于码书的自适应特征压缩框架CAFC-SE,利用向量量化将连续视觉特征映射为离散索引,并通过选择性传输优化低比特率条件下的性能。

Method: 采用向量量化(VQ)技术将特征向量映射到最近的视觉基元,选择性传输离散索引以保留更多信息丰富的视觉模式。

Result: 实验表明,CAFC-SE在低比特率条件下的速率和准确性方面表现优越。

Insight: 通过向量量化和选择性传输,CAFC-SE能够有效减少冗余信息,提升低比特率条件下的特征压缩性能,为边缘-云系统提供更高效的解决方案。

Abstract: Coding images for machines with minimal bitrate and strong analysis performance is key to effective edge-cloud systems. Several approaches deploy an image codec and perform analysis on the reconstructed image. Other methods compress intermediate features using entropy models and subsequently perform analysis on the decoded features. Nevertheless, these methods both perform poorly under low-bitrate conditions, as they retain many redundant details or learn over-concentrated symbol distributions. In this paper, we propose a Codebook-based Adaptive Feature Compression framework with Semantic Enhancement, named CAFC-SE. It maps continuous visual features to discrete indices with a codebook at the edge via Vector Quantization (VQ) and selectively transmits them to the cloud. The VQ operation that projects feature vectors onto the nearest visual primitives enables us to preserve more informative visual patterns under low-bitrate conditions. Hence, CAFC-SE is less vulnerable to low-bitrate conditions. Extensive experiments demonstrate the superiority of our method in terms of rate and accuracy.

[59] BridgeSplat: Bidirectionally Coupled CT and Non-Rigid Gaussian Splatting for Deformable Intraoperative Surgical Navigation

Maximilian Fehrentz,Alexander Winkler,Thomas Heiliger,Nazim Haouchine,Christian Heiliger,Nassir Navab

Main category: cs.CV

TL;DR: BridgeSplat是一种新颖的可变形外科导航方法,通过将术中3D重建与术前CT数据耦合,将手术视频与患者体积数据联系起来。该方法通过光度量监督联合优化高斯参数和网格变形。

Details Motivation: 现有技术在可变形手术导航中存在手术视频与患者体积数据之间的鸿沟,BridgeSplat旨在通过耦合CT与非刚性高斯投射来解决这一问题。

Contribution: 提出了BridgeSplat,一种将CT与非刚性高斯投射双向耦合的方法,实现了高斯参数与网格变形的联合优化,并能将变形传播回CT。

Method: 通过将3D高斯绑定到CT网格上,并参数化每个高斯相对于其父网格三角形,实现了高斯与网格的对齐。通过光度量监督联合优化变形。

Result: 在内脏猪手术和模拟人类肝脏的合成数据上验证了BridgeSplat的有效性,展示了术前CT在单目RGB数据下的合理变形。

Insight: BridgeSplat展示了耦合CT与高斯投射在可变形手术导航中的潜力,为术中导航提供了新的技术支持。

Abstract: We introduce BridgeSplat, a novel approach for deformable surgical navigation that couples intraoperative 3D reconstruction with preoperative CT data to bridge the gap between surgical video and volumetric patient data. Our method rigs 3D Gaussians to a CT mesh, enabling joint optimization of Gaussian parameters and mesh deformation through photometric supervision. By parametrizing each Gaussian relative to its parent mesh triangle, we enforce alignment between Gaussians and mesh and obtain deformations that can be propagated back to update the CT. We demonstrate BridgeSplat’s effectiveness on visceral pig surgeries and synthetic data of a human liver under simulation, showing sensible deformations of the preoperative CT on monocular RGB data. Code, data, and additional resources can be found at https://maxfehrentz.github.io/ct-informed-splatting/ .

[60] Source-Free Domain Adaptive Semantic Segmentation of Remote Sensing Images with Diffusion-Guided Label Enrichment

Wenjie Liu,Hongmin Liu,Lixin Zhang,Bin Fan

Main category: cs.CV

TL;DR: 本文提出了一种名为DGLE的新方法,用于解决遥感图像语义分割中的源自由域适应问题。通过扩散模型优化伪标签的生成,显著提升了目标域的性能。

Details Motivation: 在遥感图像的语义分割中,源自由域适应(SFDA)是一个实际问题,但现有方法难以直接优化噪声较多的完整伪标签集。本文旨在通过高质量伪标签的生成和扩散传播来解决这一问题。

Contribution: 提出了DGLE框架,利用扩散模型的去噪和建模能力,从未完全的高质量种子伪标签出发,生成完整且高质量的伪标签集,显著提升了域适应性能。

Method: 1. 提出基于置信度过滤和超分辨率增强的伪标签融合方法,获取高质量种子伪标签。2. 利用扩散模型将种子伪标签扩散为完整且高质量的伪标签集。

Result: DGLE框架有效提升了伪标签的质量,从而显著改善了模型在目标域的性能。

Insight: 扩散模型的去噪和建模能力为伪标签生成提供了一种高效方法,避免了直接优化噪声标签的困难。

Abstract: Research on unsupervised domain adaptation (UDA) for semantic segmentation of remote sensing images has been extensively conducted. However, research on how to achieve domain adaptation in practical scenarios where source domain data is inaccessible namely, source-free domain adaptation (SFDA) remains limited. Self-training has been widely used in SFDA, which requires obtaining as many high-quality pseudo-labels as possible to train models on target domain data. Most existing methods optimize the entire pseudo-label set to obtain more supervisory information. However, as pseudo-label sets often contain substantial noise, simultaneously optimizing all labels is challenging. This limitation undermines the effectiveness of optimization approaches and thus restricts the performance of self-training. To address this, we propose a novel pseudo-label optimization framework called Diffusion-Guided Label Enrichment (DGLE), which starts from a few easily obtained high-quality pseudo-labels and propagates them to a complete set of pseudo-labels while ensuring the quality of newly generated labels. Firstly, a pseudo-label fusion method based on confidence filtering and super-resolution enhancement is proposed, which utilizes cross-validation of details and contextual information to obtain a small number of high-quality pseudo-labels as initial seeds. Then, we leverage the diffusion model to propagate incomplete seed pseudo-labels with irregular distributions due to its strong denoising capability for randomly distributed noise and powerful modeling capacity for complex distributions, thereby generating complete and high-quality pseudo-labels. This method effectively avoids the difficulty of directly optimizing the complete set of pseudo-labels, significantly improves the quality of pseudo-labels, and thus enhances the model’s performance in the target domain.

[61] Hyperbolic Coarse-to-Fine Few-Shot Class-Incremental Learning

Jiaxin Dai,Xiang Xiang

Main category: cs.CV

TL;DR: 本文提出了一种在双曲空间中实现的粗到细少样本类别增量学习方法,通过双曲空间的优势提升分层数据表示和分类性能。

Details Motivation: 传统欧几里得空间在表示分层数据时存在局限性,而双曲空间因其层级结构特性展现出更强的表示能力。本文旨在利用这一优势解决少样本类别增量学习任务中的过拟合和分类性能问题。

Contribution: 1. 将特征提取器嵌入双曲空间(Poincaré球模型)以提升层级数据表示;2. 提出双曲对比损失和双曲全连接层,优化模型在双曲空间中的训练与分类;3. 引入双曲空间的最大熵分布估计,通过生成增强特征缓解少样本训练中的过拟合。

Method: 1. 使用Poincaré球模型将图像特征映射到双曲空间;2. 设计双曲对比损失和分类层,实现双曲空间中的优化与分类;3. 利用最大熵分布生成增强特征,扩展少样本训练数据。

Result: 在C2FSCIL基准测试中,该方法显著提升了粗类和细类分类的准确率。

Insight: 双曲空间的有效嵌入能够改善层级数据表示,增强模型的泛化能力,尤其在少样本场景下可显著减少过拟合。

Abstract: In the field of machine learning, hyperbolic space demonstrates superior representation capabilities for hierarchical data compared to conventional Euclidean space. This work focuses on the Coarse-To-Fine Few-Shot Class-Incremental Learning (C2FSCIL) task. Our study follows the Knowe approach, which contrastively learns coarse class labels and subsequently normalizes and freezes the classifier weights of learned fine classes in the embedding space. To better interpret the “coarse-to-fine” paradigm, we propose embedding the feature extractor into hyperbolic space. Specifically, we employ the Poincar'e ball model of hyperbolic space, enabling the feature extractor to transform input images into feature vectors within the Poincar'e ball instead of Euclidean space. We further introduce hyperbolic contrastive loss and hyperbolic fully-connected layers to facilitate model optimization and classification in hyperbolic space. Additionally, to enhance performance under few-shot conditions, we implement maximum entropy distribution in hyperbolic space to estimate the probability distribution of fine-class feature vectors. This allows generation of augmented features from the distribution to mitigate overfitting during training with limited samples. Experiments on C2FSCIL benchmarks show that our method effectively improves both coarse and fine class accuracies.

[62] GeoRemover: Removing Objects and Their Causal Visual Artifacts

Zixin Zhu,Haoxiang Li,Xuelu Feng,He Wu,Chunming Qiao,Junsong Yuan

Main category: cs.CV

TL;DR: GeoRemover提出了一个两阶段几何感知框架,专注于移除图像中的目标对象及其因果视觉伪影(如阴影和反射),通过几何移除和外观渲染实现更自然的编辑效果。

Details Motivation: 现有方法在移除对象时未能处理其因果视觉伪影(如阴影和反射),或缺乏可控性导致超出目标区域的擦除。GeoRemover通过几何和外观解耦解决了这些问题。

Contribution: 提出了几何感知的两阶段框架(几何移除和外观渲染),通过严格掩码对齐的几何移除和条件渲染实现对象及其因果伪影的高质量移除。

Method: 1. 几何阶段:通过严格掩码监督移除对象的几何结构(如深度)。2. 外观阶段:基于更新后的几何结构渲染RGB图像。采用了偏好驱动的学习目标来优化几何移除。

Result: 在两个流行基准测试中,GeoRemover在移除对象及其因果伪影方面达到了最先进的性能。

Insight: 几何与外观的解耦是实现自然对象移除的关键,因果伪影可以通过修改几何结构隐式消除。

Abstract: Towards intelligent image editing, object removal should eliminate both the target object and its causal visual artifacts, such as shadows and reflections. However, existing image appearance-based methods either follow strictly mask-aligned training and fail to remove these causal effects which are not explicitly masked, or adopt loosely mask-aligned strategies that lack controllability and may unintentionally over-erase other objects. We identify that these limitations stem from ignoring the causal relationship between an object’s geometry presence and its visual effects. To address this limitation, we propose a geometry-aware two-stage framework that decouples object removal into (1) geometry removal and (2) appearance rendering. In the first stage, we remove the object directly from the geometry (e.g., depth) using strictly mask-aligned supervision, enabling structure-aware editing with strong geometric constraints. In the second stage, we render a photorealistic RGB image conditioned on the updated geometry, where causal visual effects are considered implicitly as a result of the modified 3D geometry. To guide learning in the geometry removal stage, we introduce a preference-driven objective based on positive and negative sample pairs, encouraging the model to remove objects as well as their causal visual artifacts while avoiding new structural insertions. Extensive experiments demonstrate that our method achieves state-of-the-art performance in removing both objects and their associated artifacts on two popular benchmarks. The code is available at https://github.com/buxiangzhiren/GeoRemover.

[63] SEGA: A Transferable Signed Ensemble Gaussian Black-Box Attack against No-Reference Image Quality Assessment Models

Yujia Liu,Dingquan Li,Tiejun Huang

Main category: cs.CV

TL;DR: 该论文提出了一种名为SEGA的黑盒攻击方法,通过高斯平滑和梯度集成提升对抗样本在未知NR-IQA模型上的可迁移性。

Details Motivation: 现有的对抗攻击方法在NR-IQA模型的白盒设置下表现良好,但在黑盒设置中因可迁移性差而效果不佳。论文旨在解决这一挑战。

Contribution: 首次提出针对NR-IQA模型的可迁移黑盒攻击方法SEGA,通过高斯平滑和梯度集成提升对抗样本的泛化能力。

Method: SEGA通过高斯平滑源模型的梯度并集成,近似目标模型的梯度。此外,设计了扰动滤波掩码以保持攻击的不可感知性。

Result: 在CLIVE数据集上的实验表明,SEGA在对抗样本的可迁移性上优于现有方法。

Insight: 高斯平滑和梯度集成可有效提升黑盒攻击的可迁移性,同时扰动滤波有助于保持攻击的隐蔽性。

Abstract: No-Reference Image Quality Assessment (NR-IQA) models play an important role in various real-world applications. Recently, adversarial attacks against NR-IQA models have attracted increasing attention, as they provide valuable insights for revealing model vulnerabilities and guiding robust system design. Some effective attacks have been proposed against NR-IQA models in white-box settings, where the attacker has full access to the target model. However, these attacks often suffer from poor transferability to unknown target models in more realistic black-box scenarios, where the target model is inaccessible. This work makes the first attempt to address the challenge of low transferability in attacking NR-IQA models by proposing a transferable Signed Ensemble Gaussian black-box Attack (SEGA). The main idea is to approximate the gradient of the target model by applying Gaussian smoothing to source models and ensembling their smoothed gradients. To ensure the imperceptibility of adversarial perturbations, SEGA further removes inappropriate perturbations using a specially designed perturbation filter mask. Experimental results on the CLIVE dataset demonstrate the superior transferability of SEGA, validating its effectiveness in enabling successful transfer-based black-box attacks against NR-IQA models.

[64] HadaSmileNet: Hadamard fusion of handcrafted and deep-learning features for enhancing facial emotion recognition of genuine smiles

Mohammad Junayed Hasan,Nabeel Mohammed,Shafin Rahman,Philipp Koehn

Main category: cs.CV

TL;DR: HadaSmileNet提出了一种基于哈达玛乘积的特征融合框架,结合深度学习和手工特征,显著提升了真实微笑的表情识别性能,同时减少了计算复杂度。

Details Motivation: 区分真实与伪装表情是模式识别中的重要挑战,现有方法因多任务监督和复杂损失平衡导致计算效率低下。

Contribution: 1. 提出参数自由的哈达玛乘积融合方法;2. 在四个基准数据集上达到SOTA性能;3. 减少了26%的参数并简化了训练。

Method: 通过哈达玛乘积直接融合基于Transformer的特征和生理学D-Markers手工特征,评估了15种融合策略。

Result: 在UvA-NEMO(88.7%)、MMI(99.7%)、SPOS(98.5%)和BBC(100%)数据集上取得显著提升。

Insight: 哈达玛乘积能高效实现特征交互,结合领域知识可增强判别力,适用于实时情感计算。

Abstract: The distinction between genuine and posed emotions represents a fundamental pattern recognition challenge with significant implications for data mining applications in social sciences, healthcare, and human-computer interaction. While recent multi-task learning frameworks have shown promise in combining deep learning architectures with handcrafted D-Marker features for smile facial emotion recognition, these approaches exhibit computational inefficiencies due to auxiliary task supervision and complex loss balancing requirements. This paper introduces HadaSmileNet, a novel feature fusion framework that directly integrates transformer-based representations with physiologically grounded D-Markers through parameter-free multiplicative interactions. Through systematic evaluation of 15 fusion strategies, we demonstrate that Hadamard multiplicative fusion achieves optimal performance by enabling direct feature interactions while maintaining computational efficiency. The proposed approach establishes new state-of-the-art results for deep learning methods across four benchmark datasets: UvA-NEMO (88.7 percent, +0.8), MMI (99.7 percent), SPOS (98.5 percent, +0.7), and BBC (100 percent, +5.0). Comprehensive computational analysis reveals 26 percent parameter reduction and simplified training compared to multi-task alternatives, while feature visualization demonstrates enhanced discriminative power through direct domain knowledge integration. The framework’s efficiency and effectiveness make it particularly suitable for practical deployment in multimedia data mining applications that require real-time affective computing capabilities.

[65] Event-guided 3D Gaussian Splatting for Dynamic Human and Scene Reconstruction

Xiaoting Yin,Hao Shi,Kailun Yang,Jiajun Zhai,Shangwei Guo,Lin Wang,Kaiwei Wang

Main category: cs.CV

TL;DR: 该论文提出了一种基于事件相机和3D高斯泼溅的动态人与静态场景联合重建框架,通过语义属性区分动态与静态区域,并使用事件引导的损失函数优化快速运动区域的局部保真度,在模糊条件下取得显著效果提升。

Details Motivation: 单目视频中动态人和静态场景的联合重建在快速运动下因运动模糊而困难。事件相机的高时间分辨率为解决这一问题提供了优势。

Contribution: 1. 提出基于3D高斯泼溅的统一框架,区分动态人和静态场景;
2. 引入事件引导的损失函数,提升快速运动区域的保真度;
3. 无需外部人体掩码,简化高斯集管理。

Method: 通过语义属性将3D高斯泼溅分为动态人和静态场景部分,仅动态高斯进行形变。利用事件相机的数据设计损失函数,优化渲染结果与事件亮度的匹配。

Result: 在ZJU-MoCap-Blur和MMHPSD-Blur数据集上实现了SOTA性能,PSNR/SSIM显著提高,LPIPS降低,尤其在高速运动场景中表现突出。

Insight: 事件相机与3D高斯泼溅的结合为动态重建提供了新思路,语义属性的引入简化了建模复杂度并提升了性能。

Abstract: Reconstructing dynamic humans together with static scenes from monocular videos remains difficult, especially under fast motion, where RGB frames suffer from motion blur. Event cameras exhibit distinct advantages, e.g., microsecond temporal resolution, making them a superior sensing choice for dynamic human reconstruction. Accordingly, we present a novel event-guided human-scene reconstruction framework that jointly models human and scene from a single monocular event camera via 3D Gaussian Splatting. Specifically, a unified set of 3D Gaussians carries a learnable semantic attribute; only Gaussians classified as human undergo deformation for animation, while scene Gaussians stay static. To combat blur, we propose an event-guided loss that matches simulated brightness changes between consecutive renderings with the event stream, improving local fidelity in fast-moving regions. Our approach removes the need for external human masks and simplifies managing separate Gaussian sets. On two benchmark datasets, ZJU-MoCap-Blur and MMHPSD-Blur, it delivers state-of-the-art human-scene reconstruction, with notable gains over strong baselines in PSNR/SSIM and reduced LPIPS, especially for high-speed subjects.

[66] Live-E2T: Real-time Threat Monitoring in Video via Deduplicated Event Reasoning and Chain-of-Thought

Yuhan Wang,Cheng Liu,Zihan Zhao,Weichao Wu

Main category: cs.CV

TL;DR: Live-E2T是一个实时威胁监控框架,通过结构化语义元组、高效去重机制和Chain-of-Thought微调LLM,兼顾实时性和可解释性。

Details Motivation: 现有方法难以同时满足实时性和可解释性需求,Live-E2T旨在通过新颖机制解决这一问题。

Contribution: 1. 结构化视频帧为语义元组;2. 高效在线事件去重机制;3. Chain-of-Thought微调LLM实现透明推理。

Method: 1. 分解视频帧为Human-Object-Interaction-Place元组;2. 在线去重;3. Chain-of-Thought微调LLM生成威胁报告。

Result: 在XD-Violence和UCF-Crime数据集上,检测精度、实时性和可解释性显著优于SOTA。

Insight: 结构化解构和语义压缩是实时视频分析的关键,LLM的逻辑推理能力可增强解释性。

Abstract: Real-time threat monitoring identifies threatening behaviors in video streams and provides reasoning and assessment of threat events through explanatory text. However, prevailing methodologies, whether based on supervised learning or generative models, struggle to concurrently satisfy the demanding requirements of real-time performance and decision explainability. To bridge this gap, we introduce Live-E2T, a novel framework that unifies these two objectives through three synergistic mechanisms. First, we deconstruct video frames into structured Human-Object-Interaction-Place semantic tuples. This approach creates a compact, semantically focused representation, circumventing the information degradation common in conventional feature compression. Second, an efficient online event deduplication and updating mechanism is proposed to filter spatio-temporal redundancies, ensuring the system’s real time responsiveness. Finally, we fine-tune a Large Language Model using a Chain-of-Thought strategy, endow it with the capability for transparent and logical reasoning over event sequences to produce coherent threat assessment reports. Extensive experiments on benchmark datasets, including XD-Violence and UCF-Crime, demonstrate that Live-E2T significantly outperforms state-of-the-art methods in terms of threat detection accuracy, real-time efficiency, and the crucial dimension of explainability.

[67] The Photographer Eye: Teaching Multimodal Large Language Models to See and Critique like Photographers

Daiqing Qi,Handong Zhao,Jing Shi,Simon Jenni,Yifei Fan,Franck Dernoncourt,Scott Cohen,Sheng Li

Main category: cs.CV

TL;DR: 论文提出了PhotoCritique数据集和PhotoEye模型,旨在提升多模态大语言模型(MLLMs)在美学视觉理解方面的能力,并提出了专业基准PhotoBench。

Details Motivation: 当前MLLMs在美学视觉理解(如色彩、构图等)方面存在显著不足,尤其在专业摄影场景中表现有限。论文希望通过专业数据和方法改进这一问题。

Contribution: 1. 提出了专业的PhotoCritique数据集;2. 提出了PhotoEye模型,具有语言引导的多视角视觉融合机制;3. 设计了专业美学评估基准PhotoBench。

Method: 通过语言引导的多视角视觉融合机制,从多个角度理解图像美学,并利用PhotoCritique数据集进行训练。

Result: 在现有基准和PhotoBench上,PhotoEye模型表现出优于现有模型的性能。

Insight: 专业数据和方法是提升MLLMs美学理解能力的关键,多视角融合机制能更全面地捕捉美学特征。

Abstract: While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky. Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component–a pure color block (blue). Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig. 1), which require extensive expertise–including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. Finally, we present a novel benchmark, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models.

[68] Enhancing Video Object Segmentation in TrackRAD Using XMem Memory Network

Pengchao Deng,Shengqi Chen

Main category: cs.CV

TL;DR: 该论文提出了一种基于XMem记忆网络的实时MRI引导放疗中肿瘤分割框架,旨在解决TrackRAD2025挑战赛中的长序列MRI肿瘤分割问题,尽管实验细节丢失,但初步结果表明其满足临床实时性要求。

Details Motivation: 实时MRI引导放疗中需要对肿瘤进行精确分割和跟踪,以提升癌症治疗的准确性和安全性。现有方法在长序列分割和实时性上存在挑战,尤其是标注数据有限的情况下。

Contribution: 主要贡献是提出了一种基于XMem记忆网络的肿瘤分割框架,能够有效跟踪长序列MRI中的肿瘤运动,满足临床实时性需求。

Method: 采用XMem记忆网络架构,通过内存机制增强模型在长序列中的记忆能力,支持对肿瘤运动的实时跟踪。

Result: 尽管实验细节丢失,但开发过程中的初步印象表明,该方法在分割性能和实时性上表现良好,满足临床需求。

Insight: XMem记忆网络在有限标注数据的情况下仍能实现高性能分割,显示了其在医学影像分析中的潜力,尤其是实时性要求高的场景。

Abstract: This paper presents an advanced tumor segmentation framework for real-time MRI-guided radiotherapy, designed for the TrackRAD2025 challenge. Our method leverages the XMem model, a memory-augmented architecture, to segment tumors across long cine-MRI sequences. The proposed system efficiently integrates memory mechanisms to track tumor motion in real-time, achieving high segmentation accuracy even under challenging conditions with limited annotated data. Unfortunately, the detailed experimental records have been lost, preventing us from reporting precise quantitative results at this stage. Nevertheless, From our preliminary impressions during development, the XMem-based framework demonstrated reasonable segmentation performance and satisfied the clinical real-time requirement. Our work contributes to improving the precision of tumor tracking during MRI-guided radiotherapy, which is crucial for enhancing the accuracy and safety of cancer treatments.

[69] SSCM: A Spatial-Semantic Consistent Model for Multi-Contrast MRI Super-Resolution

Xiaoman Wu,Lubin Gan,Siying Wu,Jing Zhang,Yunwei Ou,Xiaoyan Sun

Main category: cs.CV

TL;DR: SSCM模型提出了一种空间-语义一致性方法,用于多对比MRI超分辨率任务,通过动态空间扭曲、语义感知的令牌聚合和空间-频率融合,实现了高分辨率重建的结构一致性和细节恢复。

Details Motivation: 多对比MRI超分辨率面临的主要问题是空间-语义一致性难以保持,现有方法对此建模不足且未充分利用频域信息,导致细节恢复不理想。

Contribution: 1. 提出动态空间扭曲模块(Dynamic Spatial Warping Module)实现空间对齐;2. 提出语义感知令牌聚合块(Semantic-Aware Token Aggregation Block)保持长程语义一致性;3. 提出空间-频率融合块(Spatial-Frequency Fusion Block)恢复精细结构。

Method: SSCM结合动态空间扭曲模块、语义感知令牌聚合块和空间-频率融合块,以端到端方式实现多对比MRI的超分辨率重建。

Result: 在公开和私有数据集上,SSCM以更少的参数实现了最优性能,确保了重建的空间和语义一致性。

Insight: 1. 动态空间对齐在多对比MRI中至关重要;2. 频域信息的充分利用能显著提升细节恢复效果;3. 结合空间和语义建模可实现高质量的医学图像重建。

Abstract: Multi-contrast Magnetic Resonance Imaging super-resolution (MC-MRI SR) aims to enhance low-resolution (LR) contrasts leveraging high-resolution (HR) references, shortening acquisition time and improving imaging efficiency while preserving anatomical details. The main challenge lies in maintaining spatial-semantic consistency, ensuring anatomical structures remain well-aligned and coherent despite structural discrepancies and motion between the target and reference images. Conventional methods insufficiently model spatial-semantic consistency and underuse frequency-domain information, which leads to poor fine-grained alignment and inadequate recovery of high-frequency details. In this paper, we propose the Spatial-Semantic Consistent Model (SSCM), which integrates a Dynamic Spatial Warping Module for inter-contrast spatial alignment, a Semantic-Aware Token Aggregation Block for long-range semantic consistency, and a Spatial-Frequency Fusion Block for fine structure restoration. Experiments on public and private datasets show that SSCM achieves state-of-the-art performance with fewer parameters while ensuring spatially and semantically consistent reconstructions.

[70] OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation

Zhuoxiao Chen,Hongyang Yu,Ying Xu,Yadan Luo,Long Duong,Yuan-Fang Li

Main category: cs.CV

TL;DR: OraPO提出了一种基于轻量级Oracle指导和FactScore奖励的强化学习方法,显著提升了放射学报告生成任务的效率和临床准确性。

Details Motivation: 现有放射学报告生成方法通常需要大规模数据和计算资源,OraPO旨在在资源受限条件下解决这一问题。

Contribution: 1. 提出OraPO框架,通过Oracle指导将失败的探索转化为监督信号;2. 设计了基于FactScore的密集奖励机制,提高生成报告的临床准确性。

Method: 结合轻量级Oracle指导(将失败的RL探索转化为监督信号)和FactScore奖励(提取临床事实并验证其真实性)。

Result: 在CheXpert Plus数据集上达到SOTA(F1=0.341),仅需少量训练数据和计算资源。

Insight: 通过高效的RL设计和奖励机制,可以显著减少对大规模数据和计算资源的依赖。

Abstract: Radiology report generation (RRG) aims to automatically produce clinically faithful reports from chest X-ray images. Prevailing work typically follows a scale-driven paradigm, by multi-stage training over large paired corpora and oversized backbones, making pipelines highly data- and compute-intensive. In this paper, we propose Oracle-educated GRPO {OraPO) with a FactScore-based reward (FactS) to tackle the RRG task under constrained budgets. OraPO enables single-stage, RL-only training by converting failed GRPO explorations on rare or difficult studies into direct preference supervision via a lightweight oracle step. FactS grounds learning in diagnostic evidence by extracting atomic clinical facts and checking entailment against ground-truth labels, yielding dense, interpretable sentence-level rewards. Together, OraPO and FactS create a compact and powerful framework that significantly improves learning efficiency on clinically challenging cases, setting the new SOTA performance on the CheXpert Plus dataset (0.341 in F1) with 2–3 orders of magnitude less training data using a small base VLM on modest hardware.

[71] Training-Free Multi-Style Fusion Through Reference-Based Adaptive Modulation

Xu Liu,Yibo Lu,Xinxian Wang,Xinyu Wu

Main category: cs.CV

TL;DR: AMSF 是一种无需训练的、基于参考的多风格融合框架,通过自适应调制实现可控的多风格融合。

Details Motivation: 现有方法通常只能接受单一样式图像,缺乏多风格融合的可控性和扩展性。

Contribution: 提出了无需训练的 AMSF 框架,支持多参考样式图像的融合,并通过自适应调制实现可控的风格平衡。

Method: 使用语义令牌分解模块编码多风格图像和文本提示,并通过相似性感知的重加权模块在每个去噪步骤重新校准风格权重。

Result: AMSF 在多风格融合任务中表现优于现有方法,且能够无缝扩展到两种或更多风格。

Insight: AMSF 为扩散模型中表达性多风格生成提供了一种实用的解决方案。

Abstract: We propose Adaptive Multi-Style Fusion (AMSF), a reference-based training-free framework that enables controllable fusion of multiple reference styles in diffusion models. Most of the existing reference-based methods are limited by (a) acceptance of only one style image, thus prohibiting hybrid aesthetics and scalability to more styles, and (b) lack of a principled mechanism to balance several stylistic influences. AMSF mitigates these challenges by encoding all style images and textual hints with a semantic token decomposition module that is adaptively injected into every cross-attention layer of an frozen diffusion model. A similarity-aware re-weighting module then recalibrates, at each denoising step, the attention allocated to every style component, yielding balanced and user-controllable blends without any fine-tuning or external adapters. Both qualitative and quantitative evaluations show that AMSF produces multi-style fusion results that consistently outperform the state-of-the-art approaches, while its fusion design scales seamlessly to two or more styles. These capabilities position AMSF as a practical step toward expressive multi-style generation in diffusion models.

[72] MLF-4DRCNet: Multi-Level Fusion with 4D Radar and Camera for 3D Object Detection in Autonomous Driving

Yuzhi Wu,Li Xiao,Jun Liu,Guangfeng Jiang,XiangGen Xia

Main category: cs.CV

TL;DR: MLF-4DRCNet提出了一种多级融合框架,结合4D雷达和摄像头数据,用于自动驾驶中的3D物体检测,解决了雷达点云稀疏性和噪声问题,性能接近基于激光雷达的方法。

Details Motivation: 4D雷达成本低且鲁棒性强,但其点云稀疏且噪声大,限制了在3D物体检测中的独立应用。现有雷达-摄像头融合方法多基于Bird's-Eye-View范式,忽视了雷达点云的几何缺陷,仅进行粗粒度的场景级融合。

Contribution: 提出了MLF-4DRCNet框架,通过点级、场景级和提议级的多级融合,实现了全面的特征表征,设计了ERPE、HSFP和PLFE三个关键模块提升性能。

Method: 1. ERPE模块通过Triple-Attention Voxel Feature Encoder稠密化点云;2. HSFP模块使用可变形注意力动态融合多模态特征;3. PLFE模块细化区域提议并进一步整合特征。

Result: 在VoD和TJ4DRadSet数据集上达到SOTA性能,VoD数据集上接近基于激光雷达的方法。

Insight: 多级融合策略能够有效弥补雷达点云的缺陷,结合视觉信息可以实现低成本的3D检测任务。Triple-Attention机制和可变形注意力是提升性能的关键。

Abstract: The emerging 4D millimeter-wave radar, measuring the range, azimuth, elevation, and Doppler velocity of objects, is recognized for its cost-effectiveness and robustness in autonomous driving. Nevertheless, its point clouds exhibit significant sparsity and noise, restricting its standalone application in 3D object detection. Recent 4D radar-camera fusion methods have provided effective perception. Most existing approaches, however, adopt explicit Bird’s-Eye-View fusion paradigms originally designed for LiDAR-camera fusion, neglecting radar’s inherent drawbacks. Specifically, they overlook the sparse and incomplete geometry of radar point clouds and restrict fusion to coarse scene-level integration. To address these problems, we propose MLF-4DRCNet, a novel two-stage framework for 3D object detection via multi-level fusion of 4D radar and camera images. Our model incorporates the point-, scene-, and proposal-level multi-modal information, enabling comprehensive feature representation. It comprises three crucial components: the Enhanced Radar Point Encoder (ERPE) module, the Hierarchical Scene Fusion Pooling (HSFP) module, and the Proposal-Level Fusion Enhancement (PLFE) module. Operating at the point-level, ERPE densities radar point clouds with 2D image instances and encodes them into voxels via the proposed Triple-Attention Voxel Feature Encoder. HSFP dynamically integrates multi-scale voxel features with 2D image features using deformable attention to capture scene context and adopts pooling to the fused features. PLFE refines region proposals by fusing image features, and further integrates with the pooled features from HSFP. Experimental results on the View-of-Delft (VoD) and TJ4DRadSet datasets demonstrate that MLF-4DRCNet achieves the state-of-the-art performance. Notably, it attains performance comparable to LiDAR-based models on the VoD dataset.

[73] Learning neuroimaging models from health system-scale data

Yiwei Lyu,Samir Harake,Asadur Chowdury,Soumyanil Banerjee,Rachel Gologorsky,Shixuan Liu,Anna-Katharina Meissner,Akshay Rao,Chenhui Zhao,Akhil Kondepudi,Cheng Jiang,Xinhai Hou,Rushikesh S. Joshi,Volker Neuschmelting,Ashok Srinivasan,Dawn Kleindorfer,Brian Athey,Vikas Gulani,Aditya Pandey,Honglak Lee,Todd Hollon

Main category: cs.CV

TL;DR: 该论文提出了Prima,一个基于大规模MRI研究的视觉语言模型(VLM),用于提升神经影像的诊断效率和公平性。

Details Motivation: 全球对MRI研究的需求不断增加,导致医疗系统压力大、诊断时间延长,尤其是在资源匮乏和农村地区。因此,需要一种高效的AI工具来解决这些问题。

Contribution: Prima是首个支持真实临床MRI研究的视觉语言模型,能够在广泛的神经疾病诊断中表现优异,同时提供可解释的鉴别诊断和工作优先级建议。

Method: Prima采用分层视觉架构,训练了超过22万次MRI研究,并通过包括3万次MRI研究的实际医疗系统测试验证了其有效性。

Result: 在52种神经疾病的诊断任务中,Prima的平均ROC曲线下面积(AUC)达到92.0,优于其他先进的通用和医疗AI模型。

Insight: Prima不仅能提升诊断效率,还能减少医疗系统偏见,尤其在资源匮乏地区的公平性表现突出。

Abstract: Neuroimaging is a ubiquitous tool for evaluating patients with neurological diseases. The global demand for magnetic resonance imaging (MRI) studies has risen steadily, placing significant strain on health systems, prolonging turnaround times, and intensifying physician burnout \cite{Chen2017-bt, Rula2024-qp-1}. These challenges disproportionately impact patients in low-resource and rural settings. Here, we utilized a large academic health system as a data engine to develop Prima, the first vision language model (VLM) serving as an AI foundation for neuroimaging that supports real-world, clinical MRI studies as input. Trained on over 220,000 MRI studies, Prima uses a hierarchical vision architecture that provides general and transferable MRI features. Prima was tested in a 1-year health system-wide study that included 30K MRI studies. Across 52 radiologic diagnoses from the major neurologic disorders, including neoplastic, inflammatory, infectious, and developmental lesions, Prima achieved a mean diagnostic area under the ROC curve of 92.0, outperforming other state-of-the-art general and medical AI models. Prima offers explainable differential diagnoses, worklist priority for radiologists, and clinical referral recommendations across diverse patient demographics and MRI systems. Prima demonstrates algorithmic fairness across sensitive groups and can help mitigate health system biases, such as prolonged turnaround times for low-resource populations. These findings highlight the transformative potential of health system-scale VLMs and Prima’s role in advancing AI-driven healthcare.

[74] Understanding-in-Generation: Reinforcing Generative Capability of Unified Model via Infusing Understanding into Generation

Yuanhuiyi Lyu,Chi Kit Wong,Chenfei Liao,Lutao Jiang,Xu Zheng,Zexin Lu,Linfeng Zhang,Xuming Hu

Main category: cs.CV

TL;DR: 论文提出了一种新的推理框架Understanding-in-Generation (UiG),通过将统一模型的理解能力注入生成过程中,以提升其图像生成性能。

Details Motivation: 现有基于CoT的文本到图像生成方法将理解和生成过程分离,限制了生成能力的优化。作者希望通过理解引导生成来弥补这一缺陷。

Contribution: 提出UiG框架,利用统一模型的理解能力逐步优化图像生成,首次将图像编辑作为桥梁实现这一目标。

Method: 1. 验证生成图像并融合模型理解到编辑指令中;2. 逐步优化生成图像,将理解注入生成过程。

Result: 在TIIF基准测试的长提示设置中性能提升3.92%,显著优于现有方法。

Insight: 通过理解和生成的深度融合,可以显著提升统一模型的生成能力,图像编辑是实现这一目标的有效工具。

Abstract: Recent works have made notable advancements in enhancing unified models for text-to-image generation through the Chain-of-Thought (CoT). However, these reasoning methods separate the processes of understanding and generation, which limits their ability to guide the reasoning of unified models in addressing the deficiencies of their generative capabilities. To this end, we propose a novel reasoning framework for unified models, Understanding-in-Generation (UiG), which harnesses the robust understanding capabilities of unified models to reinforce their performance in image generation. The core insight of our UiG is to integrate generative guidance by the strong understanding capabilities during the reasoning process, thereby mitigating the limitations of generative abilities. To achieve this, we introduce “Image Editing” as a bridge to infuse understanding into the generation process. Initially, we verify the generated image and incorporate the understanding of unified models into the editing instructions. Subsequently, we enhance the generated image step by step, gradually infusing the understanding into the generation process. Our UiG framework demonstrates a significant performance improvement in text-to-image generation over existing text-to-image reasoning methods, e.g., a 3.92% gain on the long prompt setting of the TIIF benchmark. The project code: https://github.com/QC-LY/UiG

[75] Zero-shot Monocular Metric Depth for Endoscopic Images

Nicolas Toussaint,Emanuele Colleoni,Ricardo Sanchez-Matilla,Joshua Sutcliffe,Vanessa Thompson,Muhammad Asad,Imanol Luengo,Danail Stoyanov

Main category: cs.CV

TL;DR: 该论文提出了一种针对内窥镜图像的零样本单目度量深度估计方法,通过合成数据集和基准测试解决了该领域数据不足的问题,并展示了其在实际临床场景中的泛化能力和性能提升。

Details Motivation: 当前内窥镜图像领域缺乏高质量的基准测试和数据集,难以支持深度估计模型的开发和应用。

Contribution: 1. 提供了一个全面的基准测试,评估了现有深度估计模型在内窥镜图像上的表现;2. 发布了一个新的合成数据集EndoSynth,包含内窥镜手术器械的度量深度和分割掩码。

Method: 1. 构建合成数据集EndoSynth;2. 使用深度基础模型(如基于Transformer的网络)在合成数据上进行微调;3. 在真实内窥镜图像上进行零样本测试。

Result: 微调后的深度基础模型在多数真实数据集上显著提升了深度估计的准确性。

Insight: 合成数据可以有效弥补真实数据的不足,微调基础模型能够显著提升模型在实际应用中的性能。

Abstract: Monocular relative and metric depth estimation has seen a tremendous boost in the last few years due to the sharp advancements in foundation models and in particular transformer based networks. As we start to see applications to the domain of endoscopic images, there is still a lack of robust benchmarks and high-quality datasets in that area. This paper addresses these limitations by presenting a comprehensive benchmark of state-of-the-art (metric and relative) depth estimation models evaluated on real, unseen endoscopic images, providing critical insights into their generalisation and performance in clinical scenarios. Additionally, we introduce and publish a novel synthetic dataset (EndoSynth) of endoscopic surgical instruments paired with ground truth metric depth and segmentation masks, designed to bridge the gap between synthetic and real-world data. We demonstrate that fine-tuning depth foundation models using our synthetic dataset boosts accuracy on most unseen real data by a significant margin. By providing both a benchmark and a synthetic dataset, this work advances the field of depth estimation for endoscopic images and serves as an important resource for future research. Project page, EndoSynth dataset and trained weights are available at https://github.com/TouchSurgery/EndoSynth.

[76] LEAF-Mamba: Local Emphatic and Adaptive Fusion State Space Model for RGB-D Salient Object Detection

Lanhu Wu,Zilin Gao,Hao Fei,Mong-Li Lee,Wynne Hsu

Main category: cs.CV

TL;DR: LEAF-Mamba is a novel state space model designed for RGB-D salient object detection, addressing limitations of CNNs and Vision Transformers by combining local emphatic and adaptive fusion modules for efficient cross-modality interaction.

Details Motivation: Current RGB-D SOD methods rely on CNNs (limited by local receptive fields) or Vision Transformers (high computational cost). State space models (SSM) like Mamba offer long-range dependency modeling with linear complexity but struggle with local semantics and cross-modality fusion.

Contribution: 1) Introduces LE-SSM for multi-scale local dependency capture. 2) Proposes an SSM-based AFM for effective cross-modality fusion. Demonstrates superior performance and efficiency over 16 SOTA methods and generalizes well to RGB-T SOD.

Method: LEAF-Mamba combines LE-SSM to enhance local semantics and AFM for adaptive cross-modality fusion, leveraging the strengths of SSMs for efficient long-range modeling.

Result: Outperforms 16 SOTA RGB-D SOD methods in efficacy and efficiency, and shows strong generalization to RGB-T SOD.

Insight: SSMs like Mamba can effectively balance performance and efficiency in multi-modal tasks when enhanced with local and adaptive fusion mechanisms.

Abstract: RGB-D salient object detection (SOD) aims to identify the most conspicuous objects in a scene with the incorporation of depth cues. Existing methods mainly rely on CNNs, limited by the local receptive fields, or Vision Transformers that suffer from the cost of quadratic complexity, posing a challenge in balancing performance and computational efficiency. Recently, state space models (SSM), Mamba, have shown great potential for modeling long-range dependency with linear complexity. However, directly applying SSM to RGB-D SOD may lead to deficient local semantics as well as the inadequate cross-modality fusion. To address these issues, we propose a Local Emphatic and Adaptive Fusion state space model (LEAF-Mamba) that contains two novel components: 1) a local emphatic state space module (LE-SSM) to capture multi-scale local dependencies for both modalities. 2) an SSM-based adaptive fusion module (AFM) for complementary cross-modality interaction and reliable cross-modality integration. Extensive experiments demonstrate that the LEAF-Mamba consistently outperforms 16 state-of-the-art RGB-D SOD methods in both efficacy and efficiency. Moreover, our method can achieve excellent performance on the RGB-T SOD task, proving a powerful generalization ability.

[77] Lightweight Vision Transformer with Window and Spatial Attention for Food Image Classification

Xinle Gao,Linghui Ye,Zhiyong Xiao

Main category: cs.CV

TL;DR: 提出了一种轻量化的Vision Transformer模型,结合窗口多头注意力机制和空间注意力机制,用于高效的食物图像分类。

Details Motivation: 食物图像分类在自动化质量控制、食品安全监管和智能农业中至关重要,但现有Vision Transformer模型参数量大且计算复杂度高。

Contribution: 提出了一种轻量化的Vision Transformer模型,通过窗口多头注意力机制和空间注意力机制,显著降低了计算成本并提高了分类性能。

Method: 结合窗口多头注意力机制(WMHAM)和空间注意力机制(SAM),WMHAM通过窗口划分捕捉局部和全局上下文特征,SAM则自适应地突出关键空间区域。

Result: 在Food-101和Vireo Food-172数据集上分别达到95.24%和94.33%的准确率,同时大幅减少了参数和计算量。

Insight: 通过优化注意机制,可以在保持较高分类精度的同时显著降低模型复杂度,适用于资源受限的环境。

Abstract: With the rapid development of society and continuous advances in science and technology, the food industry increasingly demands higher production quality and efficiency. Food image classification plays a vital role in enabling automated quality control on production lines, supporting food safety supervision, and promoting intelligent agricultural production. However, this task faces challenges due to the large number of parameters and high computational complexity of Vision Transformer models. To address these issues, we propose a lightweight food image classification algorithm that integrates a Window Multi-Head Attention Mechanism (WMHAM) and a Spatial Attention Mechanism (SAM). The WMHAM reduces computational cost by capturing local and global contextual features through efficient window partitioning, while the SAM adaptively emphasizes key spatial regions to improve discriminative feature representation. Experiments conducted on the Food-101 and Vireo Food-172 datasets demonstrate that our model achieves accuracies of 95.24% and 94.33%, respectively, while significantly reducing parameters and FLOPs compared with baseline methods. These results confirm that the proposed approach achieves an effective balance between computational efficiency and classification performance, making it well-suited for deployment in resource-constrained environments.

[78] OSDA: A Framework for Open-Set Discovery and Automatic Interpretation of Land-cover in Remote Sensing Imagery

Siyi Chen,Kai Wang,Weicong Pang,Ruiming Yang,Ziru Chen,Renjun Gao,Alexis Kai Hon Lau,Dasa Gu,Chenchen Zhang,Cheng Li

Main category: cs.CV

TL;DR: 本文提出了OSDA框架,用于遥感图像中无监督的开放集土地覆盖发现、分割和描述,结合了像素级精度和高层语义理解,解决了开放世界遥感解读中的关键挑战。

Details Motivation: 遥感图像中的开放集土地覆盖分析需要实现细粒度空间定位和语义开放的分类。传统方法依赖于监督学习,难以处理新类别或无标注数据。

Contribution: 1. 提出了OSDA框架,支持无标注的开放集土地覆盖发现、分割和描述;2. 结合了SAM和MLLM模型,实现像素级精度和高层语义理解;3. 提供了一种可扩展且可解释的解决方案,适用于动态土地覆盖监测和大规模地球观测分析。

Method: 1. 使用可提示的微调分割模型(SAM)进行精确发现和掩码提取;2. 通过两阶段微调的多模态大语言模型(MLLM)实现语义归属和上下文描述;3. 利用LLM作为评估者和人工评分对MLLM的输出进行评价。

Result: OSDA在多样化卫星图像上表现出色,支持无监督分类和动态土地覆盖监测,为自动地图更新和大规模地球观测分析提供了潜力。

Insight: 结合像素级分割和高层语义理解的多模态方法是开放世界遥感解读的有效途径。无监督和可扩展的设计是该框架的核心优势。

Abstract: Open-set land-cover analysis in remote sensing requires the ability to achieve fine-grained spatial localization and semantically open categorization. This involves not only detecting and segmenting novel objects without categorical supervision but also assigning them interpretable semantic labels through multimodal reasoning. In this study, we introduce OSDA, an integrated three-stage framework for annotation-free open-set land-cover discovery, segmentation, and description. The pipeline consists of: (1) precise discovery and mask extraction with a promptable fine-tuned segmentation model (SAM), (2) semantic attribution and contextual description via a two-phase fine-tuned multimodal large language model (MLLM), and (3) LLM-as-judge and manual scoring of the MLLMs evaluation. By combining pixel-level accuracy with high-level semantic understanding, OSDA addresses key challenges in open-world remote sensing interpretation. Designed to be architecture-agnostic and label-free, the framework supports robust evaluation across diverse satellite imagery without requiring manual annotation. Our work provides a scalable and interpretable solution for dynamic land-cover monitoring, showing strong potential for automated cartographic updating and large-scale earth observation analysis.

[79] Overview of PlantCLEF 2021: cross-domain plant identification

Herve Goeau,Pierre Bonnet,Alexis Joly

Main category: cs.CV

TL;DR: PlantCLEF 2021旨在评估如何利用植物标本馆的数字化标本改进对数据稀缺地区(如热带国家)植物的自动化识别,挑战聚焦于跨域分类任务,结合标本和少量野外照片数据。

Details Motivation: 自动化植物识别在北美和西欧进展显著,但生物多样性丰富的热带地区数据稀缺。植物标本馆积累了数百万数字化标本,可用于填补这一空白。

Contribution: 设计了结合植物标本和少量野外照片的跨域分类任务,评估了利用标本数据提升识别能力的潜力。

Method: 基于约1000种南美洲圭亚那高地植物的数据集,训练集包含数十万标本和数千张野外照片,测试集为野外照片。

Result: 任务吸引了多个研究团队参与,展示了跨域学习的有效性,但对稀疏数据的处理仍需改进。

Insight: 植物标本可作为数据稀缺地区的宝贵补充资源,但跨域学习的挑战仍需进一步研究。

Abstract: Automated plant identification has improved considerably thanks to recent advances in deep learning and the availability of training data with more and more field photos. However, this profusion of data concerns only a few tens of thousands of species, mainly located in North America and Western Europe, much less in the richest regions in terms of biodiversity such as tropical countries. On the other hand, for several centuries, botanists have systematically collected, catalogued and stored plant specimens in herbaria, especially in tropical regions, and recent efforts by the biodiversity informatics community have made it possible to put millions of digitised records online. The LifeCLEF 2021 plant identification challenge (or “PlantCLEF 2021”) was designed to assess the extent to which automated identification of flora in data-poor regions can be improved by using herbarium collections. It is based on a dataset of about 1,000 species mainly focused on the Guiana Shield of South America, a region known to have one of the highest plant diversities in the world. The challenge was evaluated as a cross-domain classification task where the training set consisted of several hundred thousand herbarium sheets and a few thousand photos to allow learning a correspondence between the two domains. In addition to the usual metadata (location, date, author, taxonomy), the training data also includes the values of 5 morphological and functional traits for each species. The test set consisted exclusively of photos taken in the field. This article presents the resources and evaluations of the assessment carried out, summarises the approaches and systems used by the participating research groups and provides an analysis of the main results.

[80] RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images

Ke Li,Di Wang,Ting Wang,Fuyu Dong,Yiming Zhang,Luyao Zhang,Xiangyu Wang,Shaofeng Li,Quan Wang

Main category: cs.CV

TL;DR: 该论文提出了RSVG-ZeroOV框架,旨在通过无训练的方式实现零样本开放词汇的遥感视觉定位,利用冻结的基础模型解决现有方法的局限性。

Details Motivation: 现有的遥感视觉定位方法通常受限于封闭词汇,且依赖昂贵的高质量数据集和微调,限制了其在开放世界场景中的应用。

Contribution: 提出了一个无需训练的框架RSVG-ZeroOV,利用冻结的基础模型(VLM和DM)实现零样本开放词汇的遥感视觉定位。

Method: 框架分为三个阶段:1) 使用VLM获取跨注意力图;2) 利用DM补充结构和形状信息;3) 引入注意力演化模块生成纯净的分割掩码。

Result: 实验表明,RSVG-ZeroOV在零样本和弱监督方法中表现更优。

Insight: 通过结合VLM和DM的优势,无需微调即可解决遥感视觉定位问题,为开放词汇任务提供了高效且可扩展的方案。

Abstract: Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing images based on free-form natural language expressions. Existing approaches are typically constrained to closed-set vocabularies, limiting their applicability in open-world scenarios. While recent attempts to leverage generic foundation models for open-vocabulary RSVG, they overly rely on expensive high-quality datasets and time-consuming fine-tuning. To address these limitations, we propose \textbf{RSVG-ZeroOV}, a training-free framework that aims to explore the potential of frozen generic foundation models for zero-shot open-vocabulary RSVG. Specifically, RSVG-ZeroOV comprises three key stages: (i) Overview: We utilize a vision-language model (VLM) to obtain cross-attention\footnote[1]{In this paper, although decoder-only VLMs use self-attention over all tokens, we refer to the image-text interaction part as cross-attention to distinguish it from pure visual self-attention.}maps that capture semantic correlations between text queries and visual regions. (ii) Focus: By leveraging the fine-grained modeling priors of a diffusion model (DM), we fill in gaps in structural and shape information of objects, which are often overlooked by VLM. (iii) Evolve: A simple yet effective attention evolution module is introduced to suppress irrelevant activations, yielding purified segmentation masks over the referred objects. Without cumbersome task-specific training, RSVG-ZeroOV offers an efficient and scalable solution. Extensive experiments demonstrate that the proposed framework consistently outperforms existing weakly-supervised and zero-shot methods.

[81] What Makes You Unique? Attribute Prompt Composition for Object Re-Identification

Yingquan Wang,Pingping Zhang,Chong Sun,Dong Wang,Huchuan Lu

Main category: cs.CV

TL;DR: 该论文提出了一个名为属性提示组合(APC)的框架,通过利用文本语义增强目标重识别(ReID)的判别性和泛化性,解决了单领域和跨领域模型的局限性。

Details Motivation: 现有ReID模型在单领域和跨领域场景中分别存在过拟合和抑制身份特异性判别特征的问题,限制了实际应用。希望通过文本语义提升模型的判别能力和泛化能力。

Contribution: 1. 提出了属性提示组合(APC)框架,结合文本语义生成判别性特征。
2. 设计了包含语义属性字典(SAD)和提示组合模块(PCM)的APG模块。
3. 提出了快慢训练策略(FSTS),平衡ReID的特异性和通用表征学习。

Method: 1. 构建语义属性字典(SAD)提供丰富语义描述。
2. 通过提示组合模块(PCM)自适应组合属性,生成判别性特征。
3. 采用快慢训练策略(FSTS),快速学习ReID特异性知识,同时保留预训练视觉语言模型的通用知识。

Result: 在传统和领域泛化(DG)ReID数据集上的实验表明,APC框架超越现有方法,在判别性和泛化性上表现优异。

Insight: 1. 文本语义的引入有效提升了ReID模型的判别能力和泛化能力。
2. 快慢训练策略是一个通用方法,可用于平衡任务特异性和通用表征学习。

Abstract: Object Re-IDentification (ReID) aims to recognize individuals across non-overlapping camera views. While recent advances have achieved remarkable progress, most existing models are constrained to either single-domain or cross-domain scenarios, limiting their real-world applicability. Single-domain models tend to overfit to domain-specific features, whereas cross-domain models often rely on diverse normalization strategies that may inadvertently suppress identity-specific discriminative cues. To address these limitations, we propose an Attribute Prompt Composition (APC) framework, which exploits textual semantics to jointly enhance discrimination and generalization. Specifically, we design an Attribute Prompt Generator (APG) consisting of a Semantic Attribute Dictionary (SAD) and a Prompt Composition Module (PCM). SAD is an over-complete attribute dictionary to provide rich semantic descriptions, while PCM adaptively composes relevant attributes from SAD to generate discriminative attribute-aware features. In addition, motivated by the strong generalization ability of Vision-Language Models (VLM), we propose a Fast-Slow Training Strategy (FSTS) to balance ReID-specific discrimination and generalizable representation learning. Specifically, FSTS adopts a Fast Update Stream (FUS) to rapidly acquire ReID-specific discriminative knowledge and a Slow Update Stream (SUS) to retain the generalizable knowledge inherited from the pre-trained VLM. Through a mutual interaction, the framework effectively focuses on ReID-relevant features while mitigating overfitting. Extensive experiments on both conventional and Domain Generalized (DG) ReID datasets demonstrate that our framework surpasses state-of-the-art methods, exhibiting superior performances in terms of both discrimination and generalization. The source code is available at https://github.com/AWangYQ/APC.

[82] Knowledge Transfer from Interaction Learning

Yilin Gao,Kangyi Chen,Zhongxing Peng,Hengjie Lu,Shugong Xu

Main category: cs.CV

TL;DR: 这篇论文提出了一个名为LFI的框架,通过显式建模视觉理解的交互过程,解决了视觉基础模型(VFMs)无法从视觉语言模型(VLMs)中有效转移知识的问题。

Details Motivation: 现有的VFMs通常采用结果导向的范式,忽略了潜在的交互过程,导致知识转移效果受限。论文提出通过捕获VLMs中的动态交互模式,提升VFMs的表现。

Contribution: 主要贡献包括提出了LFI框架,设计了Interaction Queries和交互式监督机制,实现了更高效的知识转移和跨领域泛化能力。

Method: LFI框架的核心技术是Interaction Queries和基于VLMs跨模态注意力机制的交互式监督。

Result: 在多个基准测试中取得了显著提升,如TinyImageNet分类任务提升3.3mAP,COCO检测/分割任务提升1.6mAP/2.4AP,并在跨领域零样本任务中表现优异。

Insight: 论文的亮点在于揭示了交互过程对知识转移的重要性,并通过认知对齐验证了新方法的有效性。

Abstract: Current visual foundation models (VFMs) face a fundamental limitation in transferring knowledge from vision language models (VLMs), while VLMs excel at modeling cross-modal interactions through unified representation spaces, existing VFMs predominantly adopt result-oriented paradigms that neglect the underlying interaction processes. This representational discrepancy hinders effective knowledge transfer and limits generalization across diverse vision tasks. We propose Learning from Interactions (LFI), a cognitive-inspired framework that addresses this gap by explicitly modeling visual understanding as an interactive process. Our key insight is that capturing the dynamic interaction patterns encoded in pre-trained VLMs enables more faithful and efficient knowledge transfer to VFMs. The approach centers on two technical innovations, Interaction Queries, which maintain persistent relational structures across network layers, and interaction-based supervision, derived from the cross-modal attention mechanisms of VLMs. Comprehensive experiments demonstrate consistent improvements across multiple benchmarks, achieving 3.3 and 1.6mAP/2.4AP absolute gains on TinyImageNet classification and COCO detection/segmentation respectively, with minimal parameter overhead and faster convergence. The framework particularly excels in cross-domain settings, delivering 2.4 and 9.3 zero-shot improvements on PACS and VLCS. Human evaluations further confirm its cognitive alignment, outperforming result-oriented methods by 2.7 times in semantic consistency metrics.

[83] HyPSAM: Hybrid Prompt-driven Segment Anything Model for RGB-Thermal Salient Object Detection

Ruichao Hou,Xingyuan Li,Tongwei Ren,Dongming Zhou,Gangshan Wu,Jinde Cao

Main category: cs.CV

TL;DR: 该论文提出了一种名为HyPSAM的新型混合提示驱动的Segment Anything Model(SAM),用于RGB-热成像显著目标检测(RGB-T SOD)。HyPSAM通过动态融合网络(DFNet)生成高质量初始显著图作为视觉提示,并结合SAM的零样本泛化能力,进一步通过插件式的优化网络(P2RNet)利用混合提示(文本、掩码和框)精细化显著图。实验表明,该方法在公开数据集上达到了SOTA性能。

Details Motivation: RGB-热成像显著目标检测(RGB-T SOD)在多模态融合和数据稀缺性方面存在挑战。传统方法在特征融合和目标边界提取上表现不足,因此需要一种更高效的方法来结合RGB和热成像模态的优势。

Contribution: 1. 提出HyPSAM,利用SAM的零样本泛化能力实现RGB-T SOD;2. 设计动态融合网络(DFNet),通过动态卷积和多分支解码提高多模态特征融合能力;3. 提出插件式精细化网络(P2RNet),利用混合提示优化显著目标定位。

Method: 1. DFNet动态融合RGB和热成像特征,生成初始显著图作为视觉提示;2. P2RNet结合文本提示(输入可靠性)、掩码和框提示(精确定位)优化SAM的输出。

Result: 在三个公开数据集上取得SOTA性能,展示了HyPSAM的通用性和高效性。

Insight: 提示工程(如文本、掩码和框提示)可以显著提升多模态任务的表现,结合SAM的零样本能力为解决数据稀缺问题提供了新思路。

Abstract: RGB-thermal salient object detection (RGB-T SOD) aims to identify prominent objects by integrating complementary information from RGB and thermal modalities. However, learning the precise boundaries and complete objects remains challenging due to the intrinsic insufficient feature fusion and the extrinsic limitations of data scarcity. In this paper, we propose a novel hybrid prompt-driven segment anything model (HyPSAM), which leverages the zero-shot generalization capabilities of the segment anything model (SAM) for RGB-T SOD. Specifically, we first propose a dynamic fusion network (DFNet) that generates high-quality initial saliency maps as visual prompts. DFNet employs dynamic convolution and multi-branch decoding to facilitate adaptive cross-modality interaction, overcoming the limitations of fixed-parameter kernels and enhancing multi-modal feature representation. Moreover, we propose a plug-and-play refinement network (P2RNet), which serves as a general optimization strategy to guide SAM in refining saliency maps by using hybrid prompts. The text prompt ensures reliable modality input, while the mask and box prompts enable precise salient object localization. Extensive experiments on three public datasets demonstrate that our method achieves state-of-the-art performance. Notably, HyPSAM has remarkable versatility, seamlessly integrating with different RGB-T SOD methods to achieve significant performance gains, thereby highlighting the potential of prompt engineering in this field. The code and results of our method are available at: https://github.com/milotic233/HyPSAM.

[84] TriFusion-AE: Language-Guided Depth and LiDAR Fusion for Robust Point Cloud Processing

Susmit Neogi

Main category: cs.CV

TL;DR: TriFusion-AE 是一种多模态跨注意力自编码器,通过整合文本先验、多视图图像的深度图和 LiDAR 点云,提升了点云处理的鲁棒性。

Details Motivation: LiDAR 点云在自动驾驶和机器人技术中至关重要,但容易受到噪声、遮挡和对抗性攻击的影响。传统自编码器在复杂现实条件下性能下降,需要更鲁棒的解决方案。

Contribution: 提出了 TriFusion-AE,通过结合文本、深度图和 LiDAR 的多模态信息,显著提升了点云重建的鲁棒性,尤其是在对抗性攻击和高噪声环境下。

Method: 使用多模态跨注意力机制,对齐文本语义、图像几何特征和 LiDAR 空间结构,进行联合表示学习。

Result: 在 nuScenes-mini 数据集上验证,TriFusion-AE 在高噪声和对抗性攻击条件下表现优于传统 CNN 自编码器。

Insight: 多模态融合可以有效提升点云处理的鲁棒性,尤其是在极端条件下,传统的单模态方法容易失效。

Abstract: LiDAR-based perception is central to autonomous driving and robotics, yet raw point clouds remain highly vulnerable to noise, occlusion, and adversarial corruptions. Autoencoders offer a natural framework for denoising and reconstruction, but their performance degrades under challenging real-world conditions. In this work, we propose TriFusion-AE, a multimodal cross-attention autoencoder that integrates textual priors, monocular depth maps from multi-view images, and LiDAR point clouds to improve robustness. By aligning semantic cues from text, geometric (depth) features from images, and spatial structure from LiDAR, TriFusion-AE learns representations that are resilient to stochastic noise and adversarial perturbations. Interestingly, while showing limited gains under mild perturbations, our model achieves significantly more robust reconstruction under strong adversarial attacks and heavy noise, where CNN-based autoencoders collapse. We evaluate on the nuScenes-mini dataset to reflect realistic low-data deployment scenarios. Our multimodal fusion framework is designed to be model-agnostic, enabling seamless integration with any CNN-based point cloud autoencoder for joint representation learning.

[85] COLT: Enhancing Video Large Language Models with Continual Tool Usage

Yuyang Liu,Xinyuan Shi,Bang Yang,Peilin Zhou,Jiahua Dong,Long Chen,Ian Reid,Xiaondan Liang

Main category: cs.CV

TL;DR: COLT提出了一种增强开源视频大语言模型(LLMs)的方法,通过持续工具使用(Continual Tool Usage),使其能够在不断更新的工具流中自动获得工具使用能力,避免过去工具的学习遗忘。

Details Motivation: 现有方法要么依赖闭源LLMs的提示,要么通过指令调优范式进行工具使用微调,但这些方法假设工具库固定不变,难以适应实时演变的现实环境。

Contribution: 1) 提出COLT方法,解决持续工具流学习中的‘灾难性遗忘’问题;2) 引入可学习的工具代码本作为工具专用记忆系统;3) 收集并发布了视频中心化的工具使用指令调优数据集VideoToolBench。

Method: COLT通过动态选择工具代码本中与用户指令相似的工具特征,实现持续工具使用,同时结合视频大语言模型的能力。

Result: 在视频LLMs基准测试和VideoToolBench数据集上,COLT表现出最先进的性能。

Insight: 工具代码本的设计和动态选择机制为解决持续学习中的遗忘问题提供了一种有效途径,同时为视频理解任务提供了更强的工具利用能力。

Abstract: The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tools), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use fine-tuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering ‘catastrophic forgetting’ of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then relevant tools are dynamically selected based on the similarity between user instruction and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.

[86] FixingGS: Enhancing 3D Gaussian Splatting via Training-Free Score Distillation

Zhaorui Wang,Yi Gu,Deming Zhou,Renjing Xu

Main category: cs.CV

TL;DR: FixingGS提出了一种无需训练的方法,通过利用扩散模型为稀疏视角的3D高斯泼溅(3DGS)重建提供更准确和多视角一致的先验,从而有效去除伪影和修复缺失内容。

Details Motivation: 稀疏视角下的3D重建由于视觉信息不足,容易产生伪影和缺失内容,而现有方法在确保多视角一致性和细节合理性方面存在不足。

Contribution: 1. 提出无需训练的FixingGS方法,利用扩散模型提升3DGS在稀疏视角下的重建效果;2. 引入自适应渐进增强方案,进一步优化欠约束区域的细节。

Method: 1. 基于现有扩散模型设计蒸馏方法,提供准确且多视角一致的先验;2. 采用自适应渐进增强策略细化欠约束区域。

Result: 实验表明,FixingGS在视觉质量和重建性能上优于现有方法。

Insight: 扩散模型可以在不额外训练的情况下,通过蒸馏方法有效提升3DGS的重建质量,尤其是在稀疏视角和多视角一致性方面表现突出。

Abstract: Recently, 3D Gaussian Splatting (3DGS) has demonstrated remarkable success in 3D reconstruction and novel view synthesis. However, reconstructing 3D scenes from sparse viewpoints remains highly challenging due to insufficient visual information, which results in noticeable artifacts persisting across the 3D representation. To address this limitation, recent methods have resorted to generative priors to remove artifacts and complete missing content in under-constrained areas. Despite their effectiveness, these approaches struggle to ensure multi-view consistency, resulting in blurred structures and implausible details. In this work, we propose FixingGS, a training-free method that fully exploits the capabilities of the existing diffusion model for sparse-view 3DGS reconstruction enhancement. At the core of FixingGS is our distillation approach, which delivers more accurate and cross-view coherent diffusion priors, thereby enabling effective artifact removal and inpainting. In addition, we propose an adaptive progressive enhancement scheme that further refines reconstructions in under-constrained regions. Extensive experiments demonstrate that FixingGS surpasses existing state-of-the-art methods with superior visual quality and reconstruction performance. Our code will be released publicly.

[87] Bi-VLM: Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models

Xijun Wang,Junyun Huang,Rayyan Abdalla,Chengyuan Zhang,Ruiqi Xian,Dinesh Manocha

Main category: cs.CV

TL;DR: Bi-VLM 提出了一种非均匀分离模型权重的方法,通过量化视觉-语言模型(VLMs)的权重,显著提升了超低精度(≤2位)后训练量化的性能,同时在硬件受限环境中实现了更高的效率。

Details Motivation: 视觉-语言模型(VLMs)的高计算成本和内存需求限制了其在硬件受限环境中的应用。作者希望通过超低精度后训练量化来解决这一问题。

Contribution: Bi-VLM 提出了基于高斯分位数的非均匀权重分离方法,结合显著性感知的混合量化算法,显著提升了量化模型的性能。

Method: 通过高斯分位数将模型权重分为异常值(显著)和多组正常值(非显著)子集,并基于显著性度量和压缩目标对每组子集施加不同的量化约束。

Result: 在视觉问答任务上,Bi-VLM 的语言模型部分性能超出 SOTA 3%-47%,整体 VLM 性能超出 4%-45%,并发现量化模型中图像令牌存在 90%-99% 的冗余。

Insight: 超低精度量化结合令牌剪枝可以显著提高 VLMs 的效率,表明量化模型存在大量冗余信息,进一步优化的潜力很大。

Abstract: We address the critical gap between the computational demands of vision-language models and the possible ultra-low-bit weight precision (bitwidth $\leq2$ bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier (salient) and multiple inlier (unsalient) subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quantization algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%. We also perform token pruning on the quantized models and observe that there is redundancy of image tokens 90% - 99% in the quantized models. This helps us to further prune the visual tokens to improve efficiency.

[88] DiSSECT: Structuring Transfer-Ready Medical Image Representations through Discrete Self-Supervision

Azad Singh,Deepak Mishra

Main category: cs.CV

TL;DR: DiSSECT提出了一种结合离散自监督学习的方法,通过多尺度向量量化显式约束特征学习,提升医学图像表示的可迁移性和稳健性。

Details Motivation: 现有自监督学习方法在医学图像中易受捷径学习影响,且依赖复杂架构或领域先验,限制了其可扩展性和泛化能力。DiSSECT旨在解决这一问题。

Contribution: 1. 提出离散自监督框架DiSSECT,通过向量量化引入表征瓶颈;2. 学习结构感知特征,抑制视图相关或无用的模式;3. 在低标签数据下高效迁移,验证了跨任务和域的性能。

Method: 结合多尺度向量量化到自监督学习流程中,显式约束特征空间为离散表示,避免捷径学习,增强特征的结构性和可迁移性。

Result: 在多个公开医学影像数据集上验证,DiSSECT在分类和分割任务中表现优异,且无需或少微调,特别在低标签场景下高效。

Insight: 离散化特征表示能有效抑制噪声,提升模型对结构和病理信息的敏感性,使预训练模型更易迁移到新任务。

Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for medical image representation learning, particularly in settings with limited labeled data. However, existing SSL methods often rely on complex architectures, anatomy-specific priors, or heavily tuned augmentations, which limit their scalability and generalizability. More critically, these models are prone to shortcut learning, especially in modalities like chest X-rays, where anatomical similarity is high and pathology is subtle. In this work, we introduce DiSSECT – Discrete Self-Supervision for Efficient Clinical Transferable Representations, a framework that integrates multi-scale vector quantization into the SSL pipeline to impose a discrete representational bottleneck. This constrains the model to learn repeatable, structure-aware features while suppressing view-specific or low-utility patterns, improving representation transfer across tasks and domains. DiSSECT achieves strong performance on both classification and segmentation tasks, requiring minimal or no fine-tuning, and shows particularly high label efficiency in low-label regimes. We validate DiSSECT across multiple public medical imaging datasets, demonstrating its robustness and generalizability compared to existing state-of-the-art approaches.

[89] Real-time Deer Detection and Warning in Connected Vehicles via Thermal Sensing and Deep Learning

Hemanth Puppala,Wayne Sarasua,Srinivas Biyaguda,Farhad Farzinpour,Mashrur Chowdhury

Main category: cs.CV

TL;DR: 该论文提出了一种基于热成像和深度学习的实时鹿类检测与驾驶员预警系统,旨在减少鹿车碰撞事故。系统通过热成像和车联网技术,在多样天气条件下实现了高检测精度和低延迟响应。

Details Motivation: 鹿车碰撞在美国造成了严重的安全和经济问题,每年导致大量伤亡和经济损失。现有的基于可见光摄像头的系统在恶劣天气条件下效果不佳,亟需一种更可靠的解决方案。

Contribution: 主要贡献包括:1) 结合热成像和深度学习的高性能鹿类检测系统;2) 整合车联网技术(CV2X)实现实时数据共享和预警;3) 在12,000张热成像图像数据集上验证了高精度和低延迟性能。

Method: 方法包括:1) 使用热成像摄像头采集鹿类图像数据;2) 基于深度学习的目标检测算法训练;3) 车联网技术(CV2X)实现实时预警信息广播。

Result: 实验结果显示,系统平均精度(mAP)达98.84%,准确率为95.44%,召回率为95.96%。户外测试中,热成像在恶劣天气下的检测精度(88%-92%)显著优于可见光摄像头(<60%),且系统延迟始终低于100毫秒。

Insight: 研究表明,热成像结合深度学习在复杂环境中具有显著优势,车联网技术的引入进一步提升了系统的实用性和普适性,为减少鹿车碰撞提供了可行的技术路径。

Abstract: Deer-vehicle collisions represent a critical safety challenge in the United States, causing nearly 2.1 million incidents annually and resulting in approximately 440 fatalities, 59,000 injuries, and 10 billion USD in economic damages. These collisions also contribute significantly to declining deer populations. This paper presents a real-time detection and driver warning system that integrates thermal imaging, deep learning, and vehicle-to-everything communication to help mitigate deer-vehicle collisions. Our system was trained and validated on a custom dataset of over 12,000 thermal deer images collected in Mars Hill, North Carolina. Experimental evaluation demonstrates exceptional performance with 98.84 percent mean average precision, 95.44 percent precision, and 95.96 percent recall. The system was field tested during a follow-up visit to Mars Hill and readily sensed deer providing the driver with advanced warning. Field testing validates robust operation across diverse weather conditions, with thermal imaging maintaining between 88 and 92 percent detection accuracy in challenging scenarios where conventional visible light based cameras achieve less than 60 percent effectiveness. When a high probability threshold is reached sensor data sharing messages are broadcast to surrounding vehicles and roadside units via cellular vehicle to everything (CV2X) communication devices. Overall, our system achieves end to end latency consistently under 100 milliseconds from detection to driver alert. This research establishes a viable technological pathway for reducing deer-vehicle collisions through thermal imaging and connected vehicles.

[90] Towards Application Aligned Synthetic Surgical Image Synthesis

Danush Kumar Venkatesh,Stefanie Speidel

Main category: cs.CV

TL;DR: 该论文提出了一种新的合成图像生成框架SAADi,通过明确对齐下游任务的偏好,提升了手术图像分类和分割任务的性能。

Details Motivation: 手术数据的稀缺性和现有扩散模型生成图像的不一致性或多样性不足问题,阻碍了计算机辅助干预中深度学习系统的发展。

Contribution: 提出了SAADi框架,通过对扩散模型进行轻量级微调,生成符合下游任务偏好的合成图像。

Method: 通过构建偏好和非偏好合成图像对,显式对齐图像生成过程与下游任务目标。

Result: 在三个手术数据集上展示了分类任务提升7-9%,分割任务提升2-10%,尤其是对少数类别的改善显著。

Insight: 任务感知对齐是缓解数据稀缺性的关键原则,迭代细化合成样本进一步提升了性能。

Abstract: The scarcity of annotated surgical data poses a significant challenge for developing deep learning systems in computer-assisted interventions. While diffusion models can synthesize realistic images, they often suffer from data memorization, resulting in inconsistent or non-diverse samples that may fail to improve, or even harm, downstream performance. We introduce \emph{Surgical Application-Aligned Diffusion} (SAADi), a new framework that aligns diffusion models with samples preferred by downstream models. Our method constructs pairs of \emph{preferred} and \emph{non-preferred} synthetic images and employs lightweight fine-tuning of diffusion models to align the image generation process with downstream objectives explicitly. Experiments on three surgical datasets demonstrate consistent gains of $7$–$9%$ in classification and $2$–$10%$ in segmentation tasks, with the considerable improvements observed for underrepresented classes. Iterative refinement of synthetic samples further boosts performance by $4$–$10%$. Unlike baseline approaches, our method overcomes sample degradation and establishes task-aware alignment as a key principle for mitigating data scarcity and advancing surgical vision applications.

[91] Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

Junhao Su,Yuanliang Wan,Junwei Yang,Hengyu Shi,Tianyang Han,Junfeng Luo,Yurui Qiu

Main category: cs.CV

TL;DR: 论文提出了一种结构化反思方法,通过明确、可控且可训练的动作改善大型语言模型在多轮工具交互中的错误修复能力,显著提升了工具调用的成功率和错误恢复能力。

Details Motivation: 当前基于工具的LLM通常通过监督模仿或粗粒度强化学习训练,而自我反思方法依赖启发式提示或单向推理,导致在多轮交互中错误修复能力不足。

Contribution: 提出了结构化反思方法,将错误修复路径明确化为可训练的动作,并设计了DAPO和GSPO目标结合的工具专用奖励机制,优化了逐步策略(Reflect-Call-Final)。

Method: 结合DAPO和GSPO目标,设计工具专用的奖励机制,采用Reflect-Call-Final策略训练模型进行结构化反思。

Result: 在BFCL v3和Tool-Reflection-Bench上实验表明,多轮工具调用成功率和错误恢复能力显著提升,冗余调用减少。

Insight: 直接优化和明确化反思过程可以显著提升工具交互的可靠性,并为模型提供从失败中学习的可复现路径。

Abstract: Tool-augmented large language models (LLMs) are usually trained with supervised imitation or coarse-grained reinforcement learning that optimizes single tool calls. Current self-reflection practices rely on heuristic prompts or one-way reasoning: the model is urged to ‘think more’ instead of learning error diagnosis and repair. This is fragile in multi-turn interactions; after a failure the model often repeats the same mistake. We propose structured reflection, which turns the path from error to repair into an explicit, controllable, and trainable action. The agent produces a short yet precise reflection: it diagnoses the failure using evidence from the previous step and then proposes a correct, executable follow-up call. For training we combine DAPO and GSPO objectives with a reward scheme tailored to tool use, optimizing the stepwise strategy Reflect, then Call, then Final. To evaluate, we introduce Tool-Reflection-Bench, a lightweight benchmark that programmatically checks structural validity, executability, parameter correctness, and result consistency. Tasks are built as mini trajectories of erroneous call, reflection, and corrected call, with disjoint train and test splits. Experiments on BFCL v3 and Tool-Reflection-Bench show large gains in multi-turn tool-call success and error recovery, and a reduction of redundant calls. These results indicate that making reflection explicit and optimizing it directly improves the reliability of tool interaction and offers a reproducible path for agents to learn from failure.

[92] VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Hao Wang,Eiki Murata,Lingfang Zhang,Ayako Sato,So Fukuda,Ziqi Yin,Wentao Hu,Keisuke Nakao,Yusuke Nakamura,Sebastian Zwirner,Yi-Chia Chen,Hiroyuki Otomo,Hiroki Ouchi,Daisuke Kawahara

Main category: cs.CV

TL;DR: VIR-Bench 是一个评估多模态大语言模型(MLLMs)在长距离旅行视频中地理时空理解能力的新基准,通过实验和原型应用验证了其有效性。

Details Motivation: 当前视频理解基准多集中于室内场景或短距离户外活动,缺乏对长距离旅行的地理时空挑战评估,而这对下一代 MLLMs 的实际应用至关重要。

Contribution: 提出了 VIR-Bench 基准,包含 200 个旅行视频,以行程重建为任务,评估 MLLMs 的地理时空智能,并通过原型代理验证其实际性能提升。

Method: 构建包含长距离旅行视频的数据集,设计行程重建任务,评估现有 MLLMs 的表现,并开发原型旅行规划代理。

Result: 实验表明当前 MLLMs 在长地理时空尺度下表现不佳,而基于 VIR-Bench 的原型代理显著改进行程推荐能力。

Insight: 长距离旅行视频是 MLLMs 地理时空理解的挑战点,而 VIR-Bench 不仅评估模型性能,还能直接提升实际应用表现。

Abstract: Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs’ geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent’s markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

[93] Surgical Video Understanding with Label Interpolation

Garam Kim,Tae Kyeong Jeong,Juyoun Park

Main category: cs.CV

TL;DR: 论文提出了一种结合光流分割标签插值和多任务学习的框架,用于解决手术视频理解中标注数据不平衡的问题。

Details Motivation: 机器人辅助手术(RAS)需要精确理解手术视频数据,但当前研究多关注单任务,且标注数据存在时空分布不均的问题(如长时标注丰富而短时标注稀缺)。

Contribution: 提出了一种新颖的光流标签插值方法,结合多任务学习,解决了手术视频中标注稀疏和时空不平衡的问题。

Method: 利用光流从已标注的关键帧中传播标签到未标注帧,丰富空间监督,并与多任务学习结合,提升模型对手术场景的理解能力。

Result: 该方法提高了手术场景理解的准确性和效率,增强了RAS的实际应用价值。

Insight: 光流和标签传播技术可以有效缓解标注数据不足的问题,同时多任务学习的引入能够更好地捕捉手术视频中的复杂动态。

Abstract: Robot-assisted surgery (RAS) has become a critical paradigm in modern surgery, promoting patient recovery and reducing the burden on surgeons through minimally invasive approaches. To fully realize its potential, however, a precise understanding of the visual data generated during surgical procedures is essential. Previous studies have predominantly focused on single-task approaches, but real surgical scenes involve complex temporal dynamics and diverse instrument interactions that limit comprehensive understanding. Moreover, the effective application of multi-task learning (MTL) requires sufficient pixel-level segmentation data, which are difficult to obtain due to the high cost and expertise required for annotation. In particular, long-term annotations such as phases and steps are available for every frame, whereas short-term annotations such as surgical instrument segmentation and action detection are provided only for key frames, resulting in a significant temporal-spatial imbalance. To address these challenges, we propose a novel framework that combines optical flow-based segmentation label interpolation with multi-task learning. optical flow estimated from annotated key frames is used to propagate labels to adjacent unlabeled frames, thereby enriching sparse spatial supervision and balancing temporal and spatial information for training. This integration improves both the accuracy and efficiency of surgical scene understanding and, in turn, enhances the utility of RAS.

[94] ColorBlindnessEval: Can Vision-Language Models Pass Color Blindness Tests?

Zijian Ling,Han Zhang,Yazhuo Zhou,Jiahao Cui

Main category: cs.CV

TL;DR: 该论文提出了ColorBlindnessEval基准,旨在评估视觉语言模型在受Ishihara色盲测试启发下的视觉对抗场景中的鲁棒性,发现模型在复杂视觉模式中识别数字存在局限性。

Details Motivation: 研究视觉语言模型在视觉对抗场景中的表现需要更全面的评估工具,尤其是基于色盲测试的复杂视觉模式识别。

Contribution: 提出了ColorBlindnessEval基准,包含500张Ishihara风格图像,用于评估视觉语言模型在复杂视觉环境中的数字识别能力。

Method: 构建了包含数字0到99的Ishihara风格图像数据集,通过Yes/No和开放式提示评估9种视觉语言模型的表现,并与人类参与者对比。

Result: 实验表明视觉语言模型在对抗性视觉场景中识别数字的能力有限,且存在幻觉问题。

Insight: 研究揭示了视觉语言模型在复杂视觉任务中的局限性,强调了提升其在现实应用中鲁棒性的必要性。

Abstract: This paper presents ColorBlindnessEval, a novel benchmark designed to evaluate the robustness of Vision-Language Models (VLMs) in visually adversarial scenarios inspired by the Ishihara color blindness test. Our dataset comprises 500 Ishihara-like images featuring numbers from 0 to 99 with varying color combinations, challenging VLMs to accurately recognize numerical information embedded in complex visual patterns. We assess 9 VLMs using Yes/No and open-ended prompts and compare their performance with human participants. Our experiments reveal limitations in the models’ ability to interpret numbers in adversarial contexts, highlighting prevalent hallucination issues. These findings underscore the need to improve the robustness of VLMs in complex visual environments. ColorBlindnessEval serves as a valuable tool for benchmarking and improving the reliability of VLMs in real-world applications where accuracy is critical.

[95] Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

Yanzuo Lu,Xin Xia,Manlin Zhang,Huafeng Kuang,Jianbin Zheng,Yuxi Ren,Xuefeng Xiao

Main category: cs.CV

TL;DR: Hyper-Bagel提出了一种统一加速框架,显著提升了多模态理解和生成任务的效率。

Details Motivation: 当前统一多模态模型在处理多模态内容时计算开销大,迭代过程(如扩散去噪和自回归解码)效率低,亟需优化。

Contribution: 1. 提出Hyper-Bagel,支持多模态理解和生成的双加速;2. 通过分治法、推测解码和多阶段蒸馏提升效率;3. 在生成任务中实现16.67x和22x的加速,且在1-NFE模型中支持近实时交互。

Method: 采用分治法,结合推测解码(多模态理解)和多阶段蒸馏(扩散去噪),并通过对抗蒸馏和人类反馈学习优化1-NFE模型。

Result: 多模态理解任务加速2倍以上;文本到图像生成和图像编辑分别加速16.67x和22x;1-NFE模型支持近实时交互。

Insight: 通过统一框架实现多模态任务的高效处理,同时平衡质量和速度,为实时交互提供了可能。

Abstract: Unified multimodal models have recently attracted considerable attention for their remarkable abilities in jointly understanding and generating diverse content. However, as contexts integrate increasingly numerous interleaved multimodal tokens, the iterative processes of diffusion denoising and autoregressive decoding impose significant computational overhead. To address this, we propose Hyper-Bagel, a unified acceleration framework designed to simultaneously speed up both multimodal understanding and generation tasks. Our approach uses a divide-and-conquer strategy, employing speculative decoding for next-token prediction and a multi-stage distillation process for diffusion denoising. The framework delivers substantial performance gains, achieving over a 2x speedup in multimodal understanding. For generative tasks, our resulting lossless 6-NFE model yields a 16.67x speedup in text-to-image generation and a 22x speedup in image editing, all while preserving the high-quality output of the original model. We further develop a highly efficient 1-NFE model that enables near real-time interactive editing and generation. By combining advanced adversarial distillation with human feedback learning, this model achieves ultimate cost-effectiveness and responsiveness, making complex multimodal interactions seamless and instantaneous.

[96] Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

Guoxin Wang,Jun Zhao,Xinyi Liu,Yanbo Liu,Xuyang Cao,Chao Li,Zhuoyun Liu,Qintian Sun,Fangru Zhou,Haoqiang Xing,Zhenhong Yang

Main category: cs.CV

TL;DR: Citrus-V是一个多模态医学基础模型,结合图像分析与文本推理,支持从视觉定位到临床推理的统一流程,性能优于现有开源医学模型和专家级成像系统。

Details Motivation: 现有医学影像模型通常功能单一,泛化能力有限,而临床实际需求需要多模态融合、精确视觉定位和链式推理能力。

Contribution: 提出Citrus-V,将检测、分割和多模态链式推理集成在单一框架中,支持像素级病灶定位、结构化报告生成和类似医生的诊断推理。

Method: 采用新型多模态训练方法,集成检测、分割和链式推理任务,并发布开源数据集支持模型训练与评估。

Result: 在多个基准测试中,Citrus-V表现优于现有开源医学模型和专家级成像系统,实现了从视觉定位到临床推理的统一流程。

Insight: 通过统一的多模态框架,Citrus-V展示了医学影像分析与临床推理的紧密结合潜力,为精准医疗提供了可靠支持。

Abstract: Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.

[97] Benchmarking Vision-Language and Multimodal Large Language Models in Zero-shot and Few-shot Scenarios: A study on Christian Iconography

Gianmarco Spinaci,Lukas Klic,Giovanni Colavizza

Main category: cs.CV

TL;DR: 该研究评估了多模态大语言模型(LLMs)和视觉语言模型(VLMs)在基督教图像单标签分类任务中的表现。研究发现,Gemini-2.5 Pro和GPT-4o优于ResNet50基线模型,提示文本丰富化能提升零样本性能,但少样本学习效果有限。

Details Motivation: 研究旨在探索通用VLMs和LLMs是否能理解基督教图像分类任务,并评估其在零样本和少样本场景下的表现,为数字人文学科中的元数据管理工具提供依据。

Contribution: 通过评估多种模型在基督教图像分类中的性能,揭示了Gemini-2.5 Pro和GPT-4o的优势,并提出LLMs可作为文化遗产领域的元数据管理工具。

Method: 使用三个数据集(ArtDL、ICONCLASS、Wikidata)测试模型在三种条件下的表现:类别标签分类、Iconclass描述分类、五样本少样本学习。结果与微调的ResNet50基线对比。

Result: Gemini-2.5 Pro和GPT-4o优于ResNet50基线;Wikidata数据集上准确率下降;提示文本丰富化提升零样本性能,少样本学习效果不明显。

Insight: 通用多模态LLMs能处理复杂视觉文化遗产分类任务,提示优化和少样本学习是未来研究方向。

Abstract: This study evaluates the capabilities of Multimodal Large Language Models (LLMs) and Vision Language Models (VLMs) in the task of single-label classification of Christian Iconography. The goal was to assess whether general-purpose VLMs (CLIP and SigLIP) and LLMs, such as GPT-4o and Gemini 2.5, can interpret the Iconography, typically addressed by supervised classifiers, and evaluate their performance. Two research questions guided the analysis: (RQ1) How do multimodal LLMs perform on image classification of Christian saints? And (RQ2), how does performance vary when enriching input with contextual information or few-shot exemplars? We conducted a benchmarking study using three datasets supporting Iconclass natively: ArtDL, ICONCLASS, and Wikidata, filtered to include the top 10 most frequent classes. Models were tested under three conditions: (1) classification using class labels, (2) classification with Iconclass descriptions, and (3) few-shot learning with five exemplars. Results were compared against ResNet50 baselines fine-tuned on the same datasets. The findings show that Gemini-2.5 Pro and GPT-4o outperformed the ResNet50 baselines. Accuracy dropped significantly on the Wikidata dataset, where Siglip reached the highest accuracy score, suggesting model sensitivity to image size and metadata alignment. Enriching prompts with class descriptions generally improved zero-shot performance, while few-shot learning produced lower results, with only occasional and minimal increments in accuracy. We conclude that general-purpose multimodal LLMs are capable of classification in visually complex cultural heritage domains. These results support the application of LLMs as metadata curation tools in digital humanities workflows, suggesting future research on prompt optimization and the expansion of the study to other classification strategies and models.

[98] ViG-LRGC: Vision Graph Neural Networks with Learnable Reparameterized Graph Construction

Ismael Elsharkawi,Hossam Sharara,Ahmed Rafea

Main category: cs.CV

TL;DR: ViG-LRGC提出了一种可学习的重参数化图构造方法(LRGC),用于改进视觉图神经网络(ViG)的图构建过程,无需超参数搜索,并在ImageNet-1k数据集上表现优异。

Details Motivation: 传统的ViG模型依赖于非参数化、不可学习的统计方法构建图,可能无法为每个节点选择最佳邻域。ViG-LRGC旨在通过可学习的图构建方法解决这一问题,避免超参数搜索的依赖。

Contribution: 提出了LRGC方法,通过键-查询注意力和软阈值重参数化选择边,实现了可学习的图构建,并允许每层的阈值根据训练数据自适应调整。

Method: LRGC利用键-查询注意力计算节点对之间的相关性,并通过软阈值重参数化选择边,使图构建过程可微分且可训练。

Result: 在ImageNet-1k基准数据集上,ViG-LRGC在相似模型规模下优于现有ViG模型。

Insight: 可学习的图构建方法能够更灵活地捕捉节点间的关系,避免了传统方法中聚类或阈值带来的偏见,从而提升模型性能。

Abstract: Image Representation Learning is an important problem in Computer Vision. Traditionally, images were processed as grids, using Convolutional Neural Networks or as a sequence of visual tokens, using Vision Transformers. Recently, Vision Graph Neural Networks (ViG) have proposed the treatment of images as a graph of nodes; which provides a more intuitive image representation. The challenge is to construct a graph of nodes in each layer that best represents the relations between nodes and does not need a hyper-parameter search. ViG models in the literature depend on non-parameterized and non-learnable statistical methods that operate on the latent features of nodes to create a graph. This might not select the best neighborhood for each node. Starting from k-NN graph construction to HyperGraph Construction and Similarity-Thresholded graph construction, these methods lack the ability to provide a learnable hyper-parameter-free graph construction method. To overcome those challenges, we present the Learnable Reparameterized Graph Construction (LRGC) for Vision Graph Neural Networks. LRGC applies key-query attention between every pair of nodes; then uses soft-threshold reparameterization for edge selection, which allows the use of a differentiable mathematical model for training. Using learnable parameters to select the neighborhood removes the bias that is induced by any clustering or thresholding methods previously introduced in the literature. In addition, LRGC allows tuning the threshold in each layer to the training data since the thresholds are learnable through training and are not provided as hyper-parameters to the model. We demonstrate that the proposed ViG-LRGC approach outperforms state-of-the-art ViG models of similar sizes on the ImageNet-1k benchmark dataset.

[99] Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model

Xueyu Liu,Xiaoyi Zhang,Guangze Shi,Meilin Liu,Yexin Lai,Yongfei Wu,Mingqiang Wei

Main category: cs.CV

TL;DR: 提出了Point Prompt Defender,一种对抗性强化学习框架,通过攻击-防御范式自动优化点提示(point prompts),提升SAM的分割性能。

Details Motivation: 现有的SAM提示(prompt)方法依赖启发式或人工设计,限制了模型的扩展性和泛化能力。

Contribution: 提出了任务无关的双空间图表示方法和对抗性强化学习框架,通过攻击与防御代理优化提示,无需重新训练即可提升SAM的分割性能。

Method: 将图像补丁表示为双空间图中的节点,构建攻击代理和防御代理的对抗性强化学习框架,使用Deep Q-Network训练代理。

Result: 实验证明Point Prompt Defender显著提高了SAM的鲁棒性和泛化能力。

Insight: 通过对抗学习优化提示可以提升模型的性能,且无需任务特定的调整或重新训练。

Abstract: Prompt quality plays a critical role in the performance of the Segment Anything Model (SAM), yet existing approaches often rely on heuristic or manually crafted prompts, limiting scalability and generalization. In this paper, we propose Point Prompt Defender, an adversarial reinforcement learning framework that adopts an attack-for-defense paradigm to automatically optimize point prompts. We construct a task-agnostic point prompt environment by representing image patches as nodes in a dual-space graph, where edges encode both physical and semantic distances. Within this environment, an attacker agent learns to activate a subset of prompts that maximally degrade SAM’s segmentation performance, while a defender agent learns to suppress these disruptive prompts and restore accuracy. Both agents are trained using Deep Q-Networks with a reward signal based on segmentation quality variation. During inference, only the defender is deployed to refine arbitrary coarse prompt sets, enabling enhanced SAM segmentation performance across diverse tasks without retraining. Extensive experiments show that Point Prompt Defender effectively improves SAM’s robustness and generalization, establishing a flexible, interpretable, and plug-and-play framework for prompt-based segmentation.

[100] SmartWilds: Multimodal Wildlife Monitoring Dataset

Jenna Kline,Anirudh Potlapally,Bharath Pillai,Tanishka Wani,Rugved Katole,Vedant Patil,Penelope Covey,Hari Subramoni,Tanya Berger-Wolf,Christopher Stewart

Main category: cs.CV

TL;DR: SmartWilds是一个多模态野生动物监测数据集,首次发布了2025年夏季在俄亥俄州野生动物园同步收集的无人机图像、相机陷阱照片和视频以及生物声学记录。

Details Motivation: 支持多模态AI研究,满足濒危物种研究、生态保护和栖息地管理等需求,填补了该领域的数据空白。

Contribution: 提供了首个同步多模态数据集,支持全面的环境监测AI研究,并为未来扩展(如GPS追踪数据)奠定基础。

Method: 通过无人机、相机陷阱和生物声学设备同步收集数据,并进行模态性能对比分析。

Result: 展示了不同传感器模态在土地利用、物种检测、行为分析和栖息地监测中的互补优势。

Insight: 多模态数据结合能够更全面地支持野生动物监测,且开源数据推动保护计算机视觉研究的发展。

Abstract: We present the first release of SmartWilds, a multimodal wildlife monitoring dataset. SmartWilds is a synchronized collection of drone imagery, camera trap photographs and videos, and bioacoustic recordings collected during summer 2025 at The Wilds safari park in Ohio. This dataset supports multimodal AI research for comprehensive environmental monitoring, addressing critical needs in endangered species research, conservation ecology, and habitat management. Our pilot deployment captured four days of synchronized monitoring across three modalities in a 220-acre pasture containing Pere David’s deer, Sichuan takin, Przewalski’s horses, as well as species native to Ohio, including bald eagles, white-tailed deer, and coyotes. We provide a comparative analysis of sensor modality performance, demonstrating complementary strengths for landuse patterns, species detection, behavioral analysis, and habitat monitoring. This work establishes reproducible protocols for multimodal wildlife monitoring while contributing open datasets to advance conservation computer vision research. Future releases will include synchronized GPS tracking data from tagged individuals, citizen science data, and expanded temporal coverage across multiple seasons.

[101] RS3DBench: A Comprehensive Benchmark for 3D Spatial Perception in Remote Sensing

Jiayu Wang,Ruizhi Wang,Jie Song,Haofei Zhang,Mingli Song,Zunlei Feng,Li Sun

Main category: cs.CV

TL;DR: 论文提出了一个名为RS3DBench的新型基准测试,用于推动遥感图像中通用大规模3D视觉模型的发展,填补了现有数据集在深度信息和图像对齐方面的不足。

Details Motivation: 当前遥感领域的许多数据集缺乏全面的深度信息或未能精确对齐深度数据与遥感图像,限制了3D视觉模型的发展。

Contribution: 1. 提出了RS3DBench数据集,包含54,951对遥感图像和像素级对齐的深度图,以及文本描述;2. 提出了一个基于稳定扩散的遥感深度估计模型,实现了SOTA性能。

Method: 1. 构建了覆盖多样地理场景的数据集;2. 利用稳定扩散模型的多模态融合能力,设计了深度估计模型。

Result: 提出的深度估计模型在RS3DBench上实现了SOTA性能,数据集和模型显著促进了3D视觉感知和地理AI的发展。

Insight: 多模态数据(如图像与深度图的精确对齐)对提升3D视觉模型的性能至关重要,稳定扩散模型的适应性在遥感领域具有潜力。

Abstract: In this paper, we introduce a novel benchmark designed to propel the advancement of general-purpose, large-scale 3D vision models for remote sensing imagery. While several datasets have been proposed within the realm of remote sensing, many existing collections either lack comprehensive depth information or fail to establish precise alignment between depth data and remote sensing images. To address this deficiency, we present a visual Benchmark for 3D understanding of Remotely Sensed images, dubbed RS3DBench. This dataset encompasses 54,951 pairs of remote sensing images and pixel-level aligned depth maps, accompanied by corresponding textual descriptions, spanning a broad array of geographical contexts. It serves as a tool for training and assessing 3D visual perception models within remote sensing image spatial understanding tasks. Furthermore, we introduce a remotely sensed depth estimation model derived from stable diffusion, harnessing its multimodal fusion capabilities, thereby delivering state-of-the-art performance on our dataset. Our endeavor seeks to make a profound contribution to the evolution of 3D visual perception models and the advancement of geographic artificial intelligence within the remote sensing domain. The dataset, models and code will be accessed on the https://rs3dbench.github.io.

[102] DeblurSplat: SfM-free 3D Gaussian Splatting with Event Camera for Robust Deblurring

Pengteng Li,Yunfan Lu,Pinhao Song,Weiyu Guo,Huizai Yao,F. Richard Yu,Hui Xiong

Main category: cs.CV

TL;DR: 本文提出了DeblurSplat,一种无需运动结构(SfM)的事件相机3D高斯泼溅去模糊方法,结合稠密立体模块(DUSt3R)和事件流数据,提升了去模糊效果和渲染效率。

Details Motivation: 传统基于SfM的去模糊方法存在相机位姿累积误差问题,且去模糊效果受限于模糊图像。事件相机因其对动态变化的高敏感度,为去模糊提供了新思路。

Contribution: 1. 提出首个无需SfM的去模糊3D高斯泼溅方法;2. 利用DUSt3R直接生成初始点云,避免位姿误差传递;3. 引入事件流增强去模糊的精细监督信号。

Method: 1. 使用DUSt3R从模糊图像生成初始点云;2. 结合事件流和模糊图像解码潜在清晰图像;3. 通过优化场景重建实现去模糊和高效渲染。

Result: 实验表明,DeblurSplat在生成高保真新视图和渲染效率方面均优于现有去模糊3D-GS方法。

Insight: 事件相机在动态场景重建中潜力巨大,而跳过SfM直接生成点云可避免误差累积,为3D重建提供了新方向。

Abstract: In this paper, we propose the first Structure-from-Motion (SfM)-free deblurring 3D Gaussian Splatting method via event camera, dubbed DeblurSplat. We address the motion-deblurring problem in two ways. First, we leverage the pretrained capability of the dense stereo module (DUSt3R) to directly obtain accurate initial point clouds from blurred images. Without calculating camera poses as an intermediate result, we avoid the cumulative errors transfer from inaccurate camera poses to the initial point clouds’ positions. Second, we introduce the event stream into the deblur pipeline for its high sensitivity to dynamic change. By decoding the latent sharp images from the event stream and blurred images, we can provide a fine-grained supervision signal for scene reconstruction optimization. Extensive experiments across a range of scenes demonstrate that DeblurSplat not only excels in generating high-fidelity novel views but also achieves significant rendering efficiency compared to the SOTAs in deblur 3D-GS.

[103] Frequency-Domain Decomposition and Recomposition for Robust Audio-Visual Segmentation

Yunzhe Shen,Kai Peng,Leiye Liu,Wei Ji,Jingjing Li,Miao Zhang,Yongri Piao,Huchuan Lu

Main category: cs.CV

TL;DR: 论文提出了一种频域分解与重构的音频-视觉分割(AVS)框架FAVS,通过FDED和SCMC模块解决音频与视觉高频特征的矛盾,显著提升了分割性能。

Details Motivation: 现有AVS方法忽略了音频与视觉模态在频域的固有矛盾(如音频高频噪声与视觉高频细节的冲突),导致性能受限。因此,作者从频域角度重新思考AVS任务。

Contribution: 1. 提出FAVS框架,首次将AVS问题建模为频域分解与重构任务;2. 设计了FDED模块通过残差迭代分解模态特征,以及SCMC模块通过动态专家路由增强跨模态一致性。

Method: 1. FDED模块:利用残差迭代分解频域特征,区分模态语义与结构;2. SCMC模块:基于混合专家架构动态优化跨模态一致性。

Result: 在三个基准数据集上取得SOTA性能,并通过可视化验证了FDED和SCMC模块的有效性。

Insight: 频域视角能有效建模音频与视觉模态的矛盾,动态专家路由是实现跨模态一致性的有效手段。

Abstract: Audio-visual segmentation (AVS) plays a critical role in multimodal machine learning by effectively integrating audio and visual cues to precisely segment objects or regions within visual scenes. Recent AVS methods have demonstrated significant improvements. However, they overlook the inherent frequency-domain contradictions between audio and visual modalities–the pervasively interfering noise in audio high-frequency signals vs. the structurally rich details in visual high-frequency signals. Ignoring these differences can result in suboptimal performance. In this paper, we rethink the AVS task from a deeper perspective by reformulating AVS task as a frequency-domain decomposition and recomposition problem. To this end, we introduce a novel Frequency-Aware Audio-Visual Segmentation (FAVS) framework consisting of two key modules: Frequency-Domain Enhanced Decomposer (FDED) module and Synergistic Cross-Modal Consistency (SCMC) module. FDED module employs a residual-based iterative frequency decomposition to discriminate modality-specific semantics and structural features, and SCMC module leverages a mixture-of-experts architecture to reinforce semantic consistency and modality-specific feature preservation through dynamic expert routing. Extensive experiments demonstrate that our FAVS framework achieves state-of-the-art performance on three benchmark datasets, and abundant qualitative visualizations further verify the effectiveness of the proposed FDED and SCMC modules. The code will be released as open source upon acceptance of the paper.

[104] xAI-CV: An Overview of Explainable Artificial Intelligence in Computer Vision

Nguyen Van Tu,Pham Nguyen Hai Long,Vo Hoai Viet

Main category: cs.CV

TL;DR: 该论文综述了计算机视觉中可解释人工智能(xAI)的四种代表性方法,包括显著图、概念瓶颈模型、基于原型的方法和混合方法,分析了它们的机制、优缺点和评估指标。

Details Motivation: 深度学习的“黑盒”特性使其决策过程难以解释,限制了在关键应用中的可靠性。xAI的兴起旨在解决这一问题,帮助人类理解AI模型的决策过程。

Contribution: 论文系统梳理了xAI在计算机视觉领域的四种主要方法,提供了对各自机制的深入分析和比较,为未来研究和应用提供了指导。

Method: 论文总结了四种xAI方法:(i)显著图,(ii)概念瓶颈模型,(iii)基于原型的方法,(iv)混合方法,并分析了它们的机制和评估指标。

Result: 论文展示了不同xAI方法在视觉感知任务中的表现和局限性,为研究者和从业者提供了实用参考。

Insight: xAI方法在提升模型透明度方面各有优势,但融合多种方法可能更有效。未来需关注评估标准的统一和实际部署的可行性。

Abstract: Deep learning has become the de facto standard and dominant paradigm in image analysis tasks, achieving state-of-the-art performance. However, this approach often results in “black-box” models, whose decision-making processes are difficult to interpret, raising concerns about reliability in critical applications. To address this challenge and provide human a method to understand how AI model process and make decision, the field of xAI has emerged. This paper surveys four representative approaches in xAI for visual perception tasks: (i) Saliency Maps, (ii) Concept Bottleneck Models (CBM), (iii) Prototype-based methods, and (iv) Hybrid approaches. We analyze their underlying mechanisms, strengths and limitations, as well as evaluation metrics, thereby providing a comprehensive overview to guide future research and applications.

[105] LiDAR Point Cloud Image-based Generation Using Denoising Diffusion Probabilistic Models

Amirhesam Aghanouri,Cristina Olaverri-Monreal

Main category: cs.CV

TL;DR: 论文提出了一种基于去噪扩散概率模型(DDPM)的方法,用于生成高质量的合成LiDAR点云数据,以提高自动驾驶车辆(AVs)的感知性能。

Details Motivation: 真实LiDAR数据的采集耗时且易受噪声和稀疏性的影响,尤其是在恶劣天气或传感器限制情况下。合成数据生成可以缓解这一问题,但现有方法在生成高保真点云上仍有不足。

Contribution: 1. 提出了一种增强的DDPM方法,通过新颖的噪声调度和时间步嵌入技术改进生成质量;2. 生成的点云具有丰富的空间关系和结构细节,优于现有基线方法。

Method: 1. 使用DDPM框架生成LiDAR点云;2. 引入改进的噪声调度和时间步嵌入技术,优化去噪过程和时间感知;3. 在IAMCV和KITTI-360数据集上评估生成的点云质量。

Result: 实验结果表明,该方法在多种配置下优于现有基线方法,能够有效减少噪声和稀疏性对LiDAR数据的影响。

Insight: 1. DDPM在合成LiDAR数据生成中具有潜力;2. 噪声调度和时间嵌入技术对提升点云质量至关重要;3. 合成数据可以显著提升自动驾驶感知任务的性能。

Abstract: Autonomous vehicles (AVs) are expected to revolutionize transportation by improving efficiency and safety. Their success relies on 3D vision systems that effectively sense the environment and detect traffic agents. Among sensors AVs use to create a comprehensive view of surroundings, LiDAR provides high-resolution depth data enabling accurate object detection, safe navigation, and collision avoidance. However, collecting real-world LiDAR data is time-consuming and often affected by noise and sparsity due to adverse weather or sensor limitations. This work applies a denoising diffusion probabilistic model (DDPM), enhanced with novel noise scheduling and time-step embedding techniques to generate high-quality synthetic data for augmentation, thereby improving performance across a range of computer vision tasks, particularly in AV perception. These modifications impact the denoising process and the model’s temporal awareness, allowing it to produce more realistic point clouds based on the projection. The proposed method was extensively evaluated under various configurations using the IAMCV and KITTI-360 datasets, with four performance metrics compared against state-of-the-art (SOTA) methods. The results demonstrate the model’s superior performance over most existing baselines and its effectiveness in mitigating the effects of noisy and sparse LiDAR data, producing diverse point clouds with rich spatial relationships and structural detail.

[106] Audio-Driven Universal Gaussian Head Avatars

Kartik Teotia,Helge Rhodin,Mohit Mendiratta,Hyeongwoo Kim,Marc Habermann,Christian Theobalt

Main category: cs.CV

TL;DR: 本文提出了首个音频驱动的通用真人头像合成方法,结合了与身份无关的语音模型和新提出的通用头像先验(UHAP),通过中性扫描数据训练,捕捉高保真身份细节,实现了音视频同步和细微表情变化。

Details Motivation: 现有方法通常仅将音频特征映射到几何变形,忽略了音频依赖的外观变化,本文旨在通过UHAP实现对几何和外观的双重建模。

Contribution: 1. 提出首个通用音频驱动头像合成方法;2. 设计UHAP先验,捕捉高保真身份细节;3. 通过单目编码器实现轻量级个性化适配。

Method: 1. 结合通用语音模型和UHAP;2. 通过中性扫描数据训练UHAP;3. 单目编码器动态回归表情变化;4. UHAP解码生成高质量头像。

Result: 方法在唇同步精度、图像质量和感知真实性上优于现有几何方法,实现了高真实感头像生成。

Insight: UHAP的成功表明同时建模几何和外观变化对音频驱动头像生成至关重要,单目编码器为轻量级个性化提供了有效解决方案。

Abstract: We introduce the first method for audio-driven universal photorealistic avatar synthesis, combining a person-agnostic speech model with our novel Universal Head Avatar Prior (UHAP). UHAP is trained on cross-identity multi-view videos. In particular, our UHAP is supervised with neutral scan data, enabling it to capture the identity-specific details at high fidelity. In contrast to previous approaches, which predominantly map audio features to geometric deformations only while ignoring audio-dependent appearance variations, our universal speech model directly maps raw audio inputs into the UHAP latent expression space. This expression space inherently encodes, both, geometric and appearance variations. For efficient personalization to new subjects, we employ a monocular encoder, which enables lightweight regression of dynamic expression variations across video frames. By accounting for these expression-dependent changes, it enables the subsequent model fine-tuning stage to focus exclusively on capturing the subject’s global appearance and geometry. Decoding these audio-driven expression codes via UHAP generates highly realistic avatars with precise lip synchronization and nuanced expressive details, such as eyebrow movement, gaze shifts, and realistic mouth interior appearance as well as motion. Extensive evaluations demonstrate that our method is not only the first generalizable audio-driven avatar model that can account for detailed appearance modeling and rendering, but it also outperforms competing (geometry-only) methods across metrics measuring lip-sync accuracy, quantitative image quality, and perceptual realism.

[107] No Labels Needed: Zero-Shot Image Classification with Collaborative Self-Learning

Matheus Vinícius Todescato,Joel Luís Carbonera

Main category: cs.CV

TL;DR: 这篇论文提出了一种无监督的零样本图像分类框架,结合了视觉语言模型(VLMs)和预训练视觉模型,无需标注数据即可动态训练轻量级分类器。

Details Motivation: 深度学习通常依赖大量标注数据,但在实际场景中数据稀缺。为了解决这一问题,作者提出了利用视觉语言模型和预训练视觉模型的无监督分类方法。

Contribution: 提出了一种无需标注数据的零样本图像分类框架,结合了视觉语言模型和预训练视觉模型的自学习循环,动态训练分类器。

Method: 采用基于置信度的伪标注策略,利用VLM识别高置信度样本,预训练视觉模型增强视觉表征,迭代训练轻量级分类器。

Result: 在十个数据集上的实验表明,该方法优于基线零样本方法,展现了强大的无监督分类能力。

Insight: 该方法避免了VLM微调和大语言模型的使用,仅依赖视觉模型以减少对语义表征的依赖,为无监督学习提供了一种高效解决方案。

Abstract: While deep learning, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), has significantly advanced classification performance, its typical reliance on extensive annotated datasets presents a major obstacle in many practical scenarios where such data is scarce. Vision-language models (VLMs) and transfer learning with pre-trained visual models appear as promising techniques to deal with this problem. This paper proposes a novel zero-shot image classification framework that combines a VLM and a pre-trained visual model within a self-learning cycle. Requiring only the set of class names and no labeled training data, our method utilizes a confidence-based pseudo-labeling strategy to train a lightweight classifier directly on the test data, enabling dynamic adaptation. The VLM identifies high-confidence samples, and the pre-trained visual model enhances their visual representations. These enhanced features then iteratively train the classifier, allowing the system to capture complementary semantic and visual cues without supervision. Notably, our approach avoids VLM fine-tuning and the use of large language models, relying on the visual-only model to reduce the dependence on semantic representation. Experimental evaluations on ten diverse datasets demonstrate that our approach outperforms the baseline zero-shot method.

[108] Generative data augmentation for biliary tract detection on intraoperative images

Cristina Iacono,Mariarosaria Meola,Federica Conte,Laura Mecozzi,Umberto Bracale,Pietro Falco,Fanny Ficuciello

Main category: cs.CV

TL;DR: 该论文通过结合深度学习和生成对抗网络(GAN)技术,提出了一种用于术中图像胆管检测的数据增强方法,旨在提高胆管损伤的预防效果。

Details Motivation: 腹腔镜胆囊切除术中胆管损伤风险较高,需改进胆管的术中可视化。现有方法依赖人工标注数据,但数据稀缺且标注成本高。

Contribution: 1) 构建并标注了用于训练Yolo检测算法的胆管图像数据库;2) 提出了结合GAN的生成式数据增强方法,扩充了训练数据集。

Method: 1) 使用Yolo算法进行胆管检测;2) 利用GAN生成合成数据以补充标注不足的真实数据集。

Result: 实验结果表明,生成的数据增强技术提高了胆管检测的准确性。

Insight: 结合生成式数据增强可以缓解医学图像标注数据稀缺问题,同时提升模型性能。

Abstract: Cholecystectomy is one of the most frequently performed procedures in gastrointestinal surgery, and the laparoscopic approach is the gold standard for symptomatic cholecystolithiasis and acute cholecystitis. In addition to the advantages of a significantly faster recovery and better cosmetic results, the laparoscopic approach bears a higher risk of bile duct injury, which has a significant impact on quality of life and survival. To avoid bile duct injury, it is essential to improve the intraoperative visualization of the bile duct. This work aims to address this problem by leveraging a deep-learning approach for the localization of the biliary tract from white-light images acquired during the surgical procedures. To this end, the construction and annotation of an image database to train the Yolo detection algorithm has been employed. Besides classical data augmentation techniques, the paper proposes Generative Adversarial Network (GAN) for the generation of a synthetic portion of the training dataset. Experimental results have been discussed along with ethical considerations.

[109] Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards

Honghao Chen,Xingzhou Lou,Xiaokun Feng,Kaiqi Huang,Xinlong Wang

Main category: cs.CV

TL;DR: 该论文提出了一种用于视觉-语言模型的细粒度步进推理框架,解决了现有方法在推理链质量评估和强化学习中的局限性,并在基准测试中取得了显著提升。

Details Motivation: 现有视觉-语言模型的推理链通常过于粗粒度,难以评估中间推理步骤的质量,也无法有效应用于强化学习。论文旨在解决这一问题,提升推理的精细化和可评估性。

Contribution: 1. 提出了一个细粒度的步进推理框架,包括步级推理数据、过程奖励模型(PRM)和强化学习训练;2. 在视觉-语言基准测试中实现了显著改进;3. 提供了详细的实证分析和组件消融研究。

Method: 1. 定义步级推理数据以支持细粒度推理;2. 设计过程奖励模型(PRM)评估每一步推理的质量;3. 利用强化学习优化推理链。

Result: 该框架在多个视觉-语言任务上实现了显著性能提升,并通过消融实验验证了各组件的有效性。

Insight: 1. 细粒度推理链更易于评估和优化;2. 强化学习与推理时扩展(inference-time scaling)的结合能显著提升模型性能;3. 过程奖励模型对推理链的优化至关重要。

Abstract: Chain of thought reasoning has demonstrated remarkable success in large language models, yet its adaptation to vision-language reasoning remains an open challenge with unclear best practices. Existing attempts typically employ reasoning chains at a coarse-grained level, which struggles to perform fine-grained structured reasoning and, more importantly, are difficult to evaluate the reward and quality of intermediate reasoning. In this work, we delve into chain of step reasoning for vision-language models, enabling assessing reasoning step quality accurately and leading to effective reinforcement learning and inference-time scaling with fine-grained rewards. We present a simple, effective, and fully transparent framework, including the step-level reasoning data, process reward model (PRM), and reinforcement learning training. With the proposed approaches, our models set strong baselines with consistent improvements on challenging vision-language benchmarks. More importantly, we conduct a thorough empirical analysis and ablation study, unveiling the impact of each component and several intriguing properties of inference-time scaling. We believe this paper serves as a baseline for vision-language models and offers insights into more complex multimodal reasoning. Our dataset, PRM, and code will be available at https://github.com/baaivision/CoS.

[110] Weakly Supervised Food Image Segmentation using Vision Transformers and Segment Anything Model

Ioannis Sarafis,Alexandros Papadopoulos,Anastasios Delopoulos

Main category: cs.CV

TL;DR: 该论文提出了一种弱监督的食物图像分割方法,结合了Vision Transformers(ViTs)和Segment Anything Model(SAM)的优势,利用图像级注释训练模型,无需像素级标注,并通过CAM生成SAM的提示,实现高质量分割。

Details Motivation: 食物图像分割在营养跟踪和健康应用中具有重要意义,但传统的像素级标注成本高昂。因此,研究者希望通过弱监督方法减少标注需求,同时利用SAM的零样本能力和ViT的注意力机制提升分割效果。

Contribution: 1. 提出了一种结合ViT和SAM的弱监督分割框架;2. 利用CAM生成SAM的提示,减少了对像素级标注的依赖;3. 研究了图像预处理和多掩码生成策略对SAM分割质量的提升。

Method: 1. 使用Swin Transformer(ViT)生成CAM作为SAM的提示;2. 基于图像级注释训练ViT;3. 结合图像预处理和多掩码生成策略优化SAM输出。

Result: 在FoodSeg103数据集上,平均每张图像生成2.4个掩码(不包括背景),多掩码场景下的mIoU达到0.54。

Insight: 通过弱监督方法结合SAM和ViT,可以显著减少标注成本,同时保持较高的分割性能,为食物图像分析提供了实用工具。

Abstract: In this paper, we propose a weakly supervised semantic segmentation approach for food images which takes advantage of the zero-shot capabilities and promptability of the Segment Anything Model (SAM) along with the attention mechanisms of Vision Transformers (ViTs). Specifically, we use class activation maps (CAMs) from ViTs to generate prompts for SAM, resulting in masks suitable for food image segmentation. The ViT model, a Swin Transformer, is trained exclusively using image-level annotations, eliminating the need for pixel-level annotations during training. Additionally, to enhance the quality of the SAM-generated masks, we examine the use of image preprocessing techniques in combination with single-mask and multi-mask SAM generation strategies. The methodology is evaluated on the FoodSeg103 dataset, generating an average of 2.4 masks per image (excluding background), and achieving an mIoU of 0.54 for the multi-mask scenario. We envision the proposed approach as a tool to accelerate food image annotation tasks or as an integrated component in food and nutrition tracking applications.

[111] A DyL-Unet framework based on dynamic learning for Temporally Consistent Echocardiographic Segmentation

Jierui Qu,Jianchun Zhao

Main category: cs.CV

TL;DR: DyL-UNet提出了一种基于动态学习的架构,通过Echo-Dynamics Graph和Cardiac Phase-Dynamics Attention实现超声心动图分割的时间一致性,同时保持高精度。

Details Motivation: 超声心动图的分割变形和噪声导致帧间分割抖动,影响功能估计和临床解释性,需要一种时间稳定的分割方法。

Contribution: 1) 提出DyL-UNet框架,结合动态学习和Swin-Transformer编码-解码结构;2) 引入Echo-Dynamics Graph (EDG)和Cardiac Phase-Dynamics Attention (CPDA)增强时间一致性。

Method: 1) 构建EDG提取动态信息;2) 多分支Swin-Transformer处理单帧;3) 在跳跃连接中使用CPDA融合动态特征和相位线索。

Result: 在CAMUS和EchoNet-Dynamic数据集上,DyL-UNet在保持精度的同时显著提升了时间一致性。

Insight: 动态学习和相位注意力机制的结合在医学视频分割中是解决时间一致性的有效方法。

Abstract: Accurate segmentation of cardiac anatomy in echocardiography is essential for cardiovascular diagnosis and treatment. Yet echocardiography is prone to deformation and speckle noise, causing frame-to-frame segmentation jitter. Even with high accuracy in single-frame segmentation, temporal instability can weaken functional estimates and impair clinical interpretability. To address these issues, we propose DyL-UNet, a dynamic learning-based temporal consistency U-Net segmentation architecture designed to achieve temporally stable and precise echocardiographic segmentation. The framework constructs an Echo-Dynamics Graph (EDG) through dynamic learning to extract dynamic information from videos. DyL-UNet incorporates multiple Swin-Transformer-based encoder-decoder branches for processing single-frame images. It further introduces Cardiac Phase-Dynamics Attention (CPDA) at the skip connections, which uses EDG-encoded dynamic features and cardiac-phase cues to enforce temporal consistency during segmentation. Extensive experiments on the CAMUS and EchoNet-Dynamic datasets demonstrate that DyL-UNet maintains segmentation accuracy comparable to existing methods while achieving superior temporal consistency, providing a reliable solution for automated clinical echocardiography.

[112] 3rd Place Report of LSVOS 2025 MeViS Track: Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference

Alexey Nekrasov,Ali Athar,Daan de Geus,Alexander Hermans,Bastian Leibe

Main category: cs.CV

TL;DR: Sa2VA-i是对Sa2VA模型的改进版本,通过解决训练与推理过程的不一致性,显著提升了在多个视频分割基准上的性能。

Details Motivation: Sa2VA在语言引导的密集视频分割任务中表现未达预期,研究发现训练与推理的不一致性是主要原因。

Contribution: 提出了Sa2VA-i,解决了训练与推理的不一致性,显著提升了性能,并在多个基准上刷新了SOTA。

Method: 改进了Sa2VA的训练和推理流程,确保一致性。

Result: 在MeViS等基准上取得显著提升(最高+11.6 J&F),且Sa2VA-i-1B模型性能与原Sa2VA-26B相当。

Insight: 研究揭示了实现细节的重要性,为视频分割领域提供了有价值的参考。

Abstract: Sa2VA is a recent model for language-guided dense grounding in images and video that achieves state-of-the-art results on multiple segmentation benchmarks and that has become widely popular. However, we found that Sa2VA does not perform according to its full potential for referring video object segmentation tasks. We identify inconsistencies between training and inference procedures as the key factor holding it back. To mitigate this issue, we propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results. In fact, Sa2VA-i sets a new state of the art for multiple video benchmarks and achieves improvements of up to +11.6 J&F on MeViS, +1.4 on Ref-YT-VOS, +3.3 on Ref-DAVIS and +4.1 on ReVOS using the same Sa2VA checkpoints. With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark. We hope that this work will show the importance of seemingly trivial implementation details and that it will provide valuable insights for the referring video segmentation field. We provide the code and updated models at https://github.com/kumuji/sa2va-i

[113] Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications

Ganesh Mallya,Yotam Gigi,Dahun Kim,Maxim Neumann,Genady Beryozkin,Tomer Shekel,Anelia Angelova

Main category: cs.CV

TL;DR: 论文提出了一种零样本多光谱学习方法,使得通用多模态模型(如Gemini 2.5)无需额外训练即可处理多光谱数据,提升了遥感应用的效率和适应性。

Details Motivation: 多光谱数据在遥感应用中至关重要,但目前专用模型训练成本高,且通用多模态模型无法处理多光谱输入。作者希望解决这一问题,让通用模型也能高效利用多光谱数据。

Contribution: 提出了一种训练免费的方法,将多光谱数据转化为通用多模态模型(如Gemini 2.5)可处理的输入,并通过指令注入领域信息,实现了零样本性能提升。

Method: 方法包括将多光谱输入映射到模型的视觉空间,并通过指令注入领域知识。实验以Gemini 2.5为例,验证了其在遥感任务中的零样本性能。

Result: 在土地覆盖和土地利用分类任务中,该方法展示了显著的零样本性能提升,证明了Gemini 2.5对多光谱输入的适应性。

Insight: 该研究表明,通用多模态模型可通过简单的输入适配和指令注入扩展其能力,为非标准输入(如多光谱数据)的遥感任务提供了高效解决方案。

Abstract: Multi-spectral imagery plays a crucial role in diverse Remote Sensing applications including land-use classification, environmental monitoring and urban planning. These images are widely adopted because their additional spectral bands correlate strongly with physical materials on the ground, such as ice, water, and vegetation. This allows for more accurate identification, and their public availability from missions, such as Sentinel-2 and Landsat, only adds to their value. Currently, the automatic analysis of such data is predominantly managed through machine learning models specifically trained for multi-spectral input, which are costly to train and support. Furthermore, although providing a lot of utility for Remote Sensing, such additional inputs cannot be used with powerful generalist large multimodal models, which are capable of solving many visual problems, but are not able to understand specialized multi-spectral signals. To address this, we propose a training-free approach which introduces new multi-spectral data in a Zero-Shot-only mode, as inputs to generalist multimodal models, trained on RGB-only inputs. Our approach leverages the multimodal models’ understanding of the visual space, and proposes to adapt to inputs to that space, and to inject domain-specific information as instructions into the model. We exemplify this idea with the Gemini2.5 model and observe strong Zero-Shot performance gains of the approach on popular Remote Sensing benchmarks for land cover and land use classification and demonstrate the easy adaptability of Gemini2.5 to new inputs. These results highlight the potential for geospatial professionals, working with non-standard specialized inputs, to easily leverage powerful multimodal models, such as Gemini2.5, to accelerate their work, benefiting from their rich reasoning and contextual capabilities, grounded in the specialized sensor data.

[114] Investigating Traffic Accident Detection Using Multimodal Large Language Models

Ilhan Skender,Kailin Tong,Selim Solmaz,Daniel Watzenig

Main category: cs.CV

TL;DR: 本文研究了多模态大语言模型(MLLMs)在交通事故检测中的零样本能力,通过基础设施摄像头图像实现了事故检测与描述,减少了对标注数据的依赖。

Details Motivation: 交通事故检测对公共安全至关重要,但现有方法依赖大量标注数据。本文旨在探索MLLMs的零样本能力,结合高级视觉分析技术提升检测效果。

Contribution: 1. 使用模拟数据集DeepAccident评估MLLMs;2. 对比Gemini、Gemma等模型的表现;3. 结合YOLO、Deep SORT和SAM提升模型准确性和可解释性。

Method: 采用MLLMs(如Pixtral、Gemini)进行零样本检测,整合YOLO、Deep SORT和SAM生成增强提示词,优化模型输出。

Result: Pixtral表现最佳(F1=0.71,召回率83%),Gemini在精确性上提升但牺牲了F1和召回率,Gemma 3表现最均衡。

Insight: 结合MLLMs与视觉分析技术可显著提升实际交通监控系统的适用性,零样本能力为减少数据依赖提供了新方向。

Abstract: Traffic safety remains a critical global concern, with timely and accurate accident detection essential for hazard reduction and rapid emergency response. Infrastructure-based vision sensors offer scalable and efficient solutions for continuous real-time monitoring, facilitating automated detection of acci- dents directly from captured images. This research investigates the zero-shot capabilities of multimodal large language models (MLLMs) for detecting and describing traffic accidents using images from infrastructure cameras, thus minimizing reliance on extensive labeled datasets. Main contributions include: (1) Evaluation of MLLMs using the simulated DeepAccident dataset from CARLA, explicitly addressing the scarcity of diverse, realistic, infrastructure-based accident data through controlled simulations; (2) Comparative performance analysis between Gemini 1.5 and 2.0, Gemma 3 and Pixtral models in acci- dent identification and descriptive capabilities without prior fine-tuning; and (3) Integration of advanced visual analytics, specifically YOLO for object detection, Deep SORT for multi- object tracking, and Segment Anything (SAM) for instance segmentation, into enhanced prompts to improve model accuracy and explainability. Key numerical results show Pixtral as the top performer with an F1-score of 0.71 and 83% recall, while Gemini models gained precision with enhanced prompts (e.g., Gemini 1.5 rose to 90%) but suffered notable F1 and recall losses. Gemma 3 offered the most balanced performance with minimal metric fluctuation. These findings demonstrate the substantial potential of integrating MLLMs with advanced visual analytics techniques, enhancing their applicability in real-world automated traffic monitoring systems.

[115] Track-On2: Enhancing Online Point Tracking with Memory

Görkay Aydemir,Weidi Xie,Fatma Güney

Main category: cs.CV

TL;DR: Track-On2是Track-On的改进版本,专注于在线点跟踪任务。通过改进的Transformer架构、高效的内存利用和合成的训练策略,实现了更高的性能和效率。

Details Motivation: 解决长时点跟踪中因外观变化、运动和遮挡导致的点识别不一致问题,适用于实时和流式应用场景。

Contribution: 1. 提出了Track-On2,一种基于Transformer的在线长时跟踪模型;2. 改进了内存利用和合成训练策略;3. 在多项基准测试中达到了SOTA性能。

Method: 1. 使用因果处理框架,通过内存机制保持时间连贯性;2. 在推理时采用粗粒度块分类和细粒度优化;3. 系统研究了合成训练设置对内存行为的影响。

Result: 在五项合成和真实基准测试中超越了现有在线跟踪器,甚至优于利用双向上下文的离线方法。

Insight: 纯合成数据训练的因果内存架构是解决真实世界点跟踪问题的有效方案。

Abstract: In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across video frames under significant appearance changes, motion, and occlusion. We target the online setting, i.e. tracking points frame-by-frame, making it suitable for real-time and streaming applications. We extend our prior model Track-On into Track-On2, a simple and efficient transformer-based model for online long-term tracking. Track-On2 improves both performance and efficiency through architectural refinements, more effective use of memory, and improved synthetic training strategies. Unlike prior approaches that rely on full-sequence access or iterative updates, our model processes frames causally and maintains temporal coherence via a memory mechanism, which is key to handling drift and occlusions without requiring future frames. At inference, we perform coarse patch-level classification followed by refinement. Beyond architecture, we systematically study synthetic training setups and their impact on memory behavior, showing how they shape temporal robustness over long sequences. Through comprehensive experiments, Track-On2 achieves state-of-the-art results across five synthetic and real-world benchmarks, surpassing prior online trackers and even strong offline methods that exploit bidirectional context. These results highlight the effectiveness of causal, memory-based architectures trained purely on synthetic data as scalable solutions for real-world point tracking. Project page: https://kuis-ai.github.io/track_on2

[116] NeuCODEX: Edge-Cloud Co-Inference with Spike-Driven Compression and Dynamic Early-Exit

Maurf Hassan,Steven Davy,Muhammad Zawish,Owais Bin Zuber,Nouman Ashraf

Main category: cs.CV

TL;DR: NeuCODEX提出了一种神经形态协同推理架构,通过联合优化时空冗余,结合尖峰驱动的压缩模块和动态早退机制,显著减少了数据传输和边缘能耗。

Details Motivation: 由于固定且高时间步长的开销,边缘设备执行完整的SNN推理面临延迟和能耗的挑战。边缘-云协同推理虽有潜力,但受限于高延迟和特征传输成本。

Contribution: NeuCODEX的主要贡献包括:1) 尖峰驱动的压缩模块减少数据传输;2) 动态早退机制自适应终止推理;3) 在静态图像和事件流上的高效部署。

Method: NeuCODEX结合了尖峰驱动的压缩和动态早退机制,前者通过学习减少数据,后者根据输出置信度自适应终止推理。

Result: NeuCODEX在边缘能耗和延迟上显著优化,数据传输减少高达2048倍,边缘能耗降低90%以上,端到端延迟降低3倍,精度损失小于2%。

Insight: NeuCODEX通过联合优化时空冗余,为资源受限环境中的SNN部署提供了高效解决方案。

Abstract: Spiking Neural Networks (SNNs) offer significant potential for enabling energy-efficient intelligence at the edge. However, performing full SNN inference at the edge can be challenging due to the latency and energy constraints arising from fixed and high timestep overheads. Edge-cloud co-inference systems present a promising solution, but their deployment is often hindered by high latency and feature transmission costs. To address these issues, we introduce NeuCODEX, a neuromorphic co-inference architecture that jointly optimizes both spatial and temporal redundancy. NeuCODEX incorporates a learned spike-driven compression module to reduce data transmission and employs a dynamic early-exit mechanism to adaptively terminate inference based on output confidence. We evaluated NeuCODEX on both static images (CIFAR10 and Caltech) and neuromorphic event streams (CIFAR10-DVS and N-Caltech). To demonstrate practicality, we prototyped NeuCODEX on ResNet-18 and VGG-16 backbones in a real edge-to-cloud testbed. Our proposed system reduces data transfer by up to 2048x and edge energy consumption by over 90%, while reducing end-to-end latency by up to 3x compared to edge-only inference, all with a negligible accuracy drop of less than 2%. In doing so, NeuCODEX enables practical, high-performance SNN deployment in resource-constrained environments.

[117] RoSe: Robust Self-supervised Stereo Matching under Adverse Weather Conditions

Yun Wang,Junjie Hu,Junhui Hou,Chenghao Zhang,Renwei Yang,Dapeng Oliver Wu

Main category: cs.CV

TL;DR: RoSe提出了一种针对恶劣天气条件的鲁棒自监督立体匹配方法,通过引入视觉基础模型的先验知识和场景对应先验,改进特征表示和监督信号,从而提升模型在恶劣天气下的性能。

Details Motivation: 现有自监督立体匹配方法在恶劣天气下性能显著下降,主要原因是特征提取器难以处理退化区域(如反射或无纹理区域),以及仅依赖光度一致性假设的监督信号失效。

Contribution: 1. 提出从视觉基础模型注入鲁棒先验知识到CNN特征提取器;2. 引入场景对应先验构建鲁棒监督信号;3. 构建包含恶劣天气的合成立体数据集;4. 提出鲁棒自监督训练范式,包括场景对应学习和恶劣天气蒸馏。

Method: 1. 利用视觉基础模型改进特征提取器;2. 通过合成数据集建立场景对应先验;3. 采用两步训练范式:鲁棒自监督场景对应学习和恶劣天气蒸馏。

Result: 实验表明,RoSe在恶劣天气条件下优于现有自监督方法,具有更高的鲁棒性和泛化性。代码已开源。

Insight: 1. 视觉基础模型的先验知识可有效提升恶劣天气下特征提取能力;2. 场景对应先验比单纯光度一致性假设更适用于恶劣天气下的监督信号;3. 合成数据集在训练中发挥了关键作用。

Abstract: Recent self-supervised stereo matching methods have made significant progress, but their performance significantly degrades under adverse weather conditions such as night, rain, and fog. We identify two primary weaknesses contributing to this performance degradation. First, adverse weather introduces noise and reduces visibility, making CNN-based feature extractors struggle with degraded regions like reflective and textureless areas. Second, these degraded regions can disrupt accurate pixel correspondences, leading to ineffective supervision based on the photometric consistency assumption. To address these challenges, we propose injecting robust priors derived from the visual foundation model into the CNN-based feature extractor to improve feature representation under adverse weather conditions. We then introduce scene correspondence priors to construct robust supervisory signals rather than relying solely on the photometric consistency assumption. Specifically, we create synthetic stereo datasets with realistic weather degradations. These datasets feature clear and adverse image pairs that maintain the same semantic context and disparity, preserving the scene correspondence property. With this knowledge, we propose a robust self-supervised training paradigm, consisting of two key steps: robust self-supervised scene correspondence learning and adverse weather distillation. Both steps aim to align underlying scene results from clean and adverse image pairs, thus improving model disparity estimation under adverse weather effects. Extensive experiments demonstrate the effectiveness and versatility of our proposed solution, which outperforms existing state-of-the-art self-supervised methods. Codes are available at \textcolor{blue}{https://github.com/cocowy1/RoSe-Robust-Self-supervised-Stereo-Matching-under-Adverse-Weather-Conditions}.

[118] The 1st Solution for MOSEv2 Challenge 2025: Long-term and Concept-aware Video Segmentation via SeC

Mingqi Gao,Jingkun Chen,Yunqi Miao,Gengshen Wu,Zhijin Qin,Jungong Han

Main category: cs.CV

TL;DR: 本文介绍了针对MOSEv2挑战赛的首个解决方案,通过改进SeC框架(基于SAM-2),研究了其长期记忆和概念感知记忆的能力,有效解决了视频对象分割中的遮挡、重现和干扰物问题,并在比赛中取得第一名。

Details Motivation: LSVOS挑战赛中的MOSEv2赛道专注于复杂半监督视频对象分割(VOS),作者旨在解决其核心挑战,如长期遮挡和目标重现。

Contribution: 主要贡献是提出了一种结合长期记忆和概念感知记忆的SeC框架,有效提升了复杂VOS任务的性能。

Method: 方法基于改进的SAM-2框架(SeC),引入长期记忆保持时间连续性,概念感知记忆提供语义先验以抑制干扰物。

Result: 在MOSEv2测试集上取得了39.89%的JF分数,排名第一。

Insight: 长期记忆和概念感知记忆的结合能显著提升复杂VOS任务的表现,为未来研究提供了新思路。

Abstract: This technical report explores the MOSEv2 track of the LSVOS Challenge, which targets complex semi-supervised video object segmentation. By analysing and adapting SeC, an enhanced SAM-2 framework, we conduct a detailed study of its long-term memory and concept-aware memory, showing that long-term memory preserves temporal continuity under occlusion and reappearance, while concept-aware memory supplies semantic priors that suppress distractors; together, these traits directly benefit several MOSEv2’s core challenges. Our solution achieves a JF score of 39.89% on the test set, ranking 1st in the MOSEv2 track of the LSVOS Challenge.

[119] Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models

Yueyan Li,Chenggong Zhao,Zeyuan Zang,Caixia Yuan,Xiaojie Wang

Main category: cs.CV

TL;DR: 该论文通过分析视觉语言模型(VLMs)中视觉信息处理的两大路径(对象识别和空间感知),揭示了其分层处理机制和几何结构,并提出了一种高效的令牌压缩算法和RoPE缩放技术以优化模型性能。

Details Motivation: 现有VLMs通过序列化方式处理图像信息,与人眼的并行处理方式有显著差异,且其内部机制不透明,限制了模型理解的深度和架构的创新。

Contribution: 1. 将视觉处理分解为对象识别和空间感知两部分进行独立研究;2. 揭示了对象识别的两阶段分层处理机制及其几何结构;3. 提出了令牌压缩算法和RoPE缩放技术以提升解码效率和空间推理能力。

Method: 1. 将图像转换为文本令牌映射,分析对象识别的分层机制;2. 理论推导并实证检验VLMs中位置表示的几何结构;3. 设计插即用的视觉解码器实现令牌压缩,并通过RoPE缩放优化空间推理。

Result: 实验验证了论文的分析,不仅加深了对VLM内部机制的理解,还为未来架构设计提供了明确原则。

Insight: 1. VLM的视觉处理可分为对象识别和空间感知两条路径;2. 对象识别是从浅层属性识别到深层语义消歧的两阶段过程;3. 引入RoPE等技术可显著提升模型的解码效率和空间推理能力。

Abstract: Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks. However, existing VLMs typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision. Moreover, their opaque internal mechanisms hinder both deeper understanding and architectural innovation. Inspired by the dual-stream hypothesis of human vision, which distinguishes the “what” and “where” pathways, we deconstruct the visual processing in VLMs into object recognition and spatial perception for separate study. For object recognition, we convert images into text token maps and find that the model’s perception of image content unfolds as a two-stage process from shallow to deep layers, beginning with attribute recognition and culminating in semantic disambiguation. For spatial perception, we theoretically derive and empirically verify the geometric structure underlying the positional representation in VLMs. Based on these findings, we introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency, and a RoPE scaling technique to enhance spatial reasoning. Through rigorous experiments, our work validates these analyses, offering a deeper understanding of VLM internals and providing clear principles for designing more capable future architectures.

[120] Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions

Ioanna Ntinou,Alexandros Xenos,Yassine Ouali,Adrian Bulat,Georgios Tzimiropoulos

Main category: cs.CV

TL;DR: 该论文提出了一种无视觉的单编码器检索方法,通过文本化的场景描述实现检索任务,避免了传统双编码器模型的模态鸿沟问题,计算成本低且隐私友好。

Details Motivation: 传统的视觉-语言对比学习模型(如CLIP)存在语言理解浅显、模态鸿沟显著,且依赖大规模数据和昂贵计算的问题。本文旨在解决这些问题,提出无视觉的检索方案。

Contribution: 提出了无视觉的单编码器检索流水线,迁移到文本到文本的检索范式,利用VLLM生成的图像描述替代原始图像,显著减少了模态鸿沟,提升了组合性和检索性能。

Method: 利用VLLM生成的图像结构化描述,构建文本到文本的检索范式,仅需少量GPU校准时间即可实现高性能。

Result: 实验表明,该方法在多个检索和组合性基准测试上达到了SOTA零样本性能,模型参数小至0.3B。

Insight: 文本化图像描述可以有效替代原始图像,解决模态鸿沟问题,同时减少计算成本和隐私风险。

Abstract: Contrastively-trained Vision-Language Models (VLMs), such as CLIP, have become the standard approach for learning discriminative vision-language representations. However, these models often exhibit shallow language understanding, manifesting bag-of-words behaviour. These limitations are reinforced by their dual-encoder design, which induces a modality gap. Additionally, the reliance on vast web-collected data corpora for training makes the process computationally expensive and introduces significant privacy concerns. To address these limitations, in this work, we challenge the necessity of vision encoders for retrieval tasks by introducing a vision-free, single-encoder retrieval pipeline. Departing from the traditional text-to-image retrieval paradigm, we migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions. We demonstrate that this paradigm shift has significant advantages, including a substantial reduction of the modality gap, improved compositionality, and better performance on short and long caption queries, all attainable with only a few hours of calibration on two GPUs. Additionally, substituting raw images with textual descriptions introduces a more privacy-friendly alternative for retrieval. To further assess generalisation and address some of the shortcomings of prior compositionality benchmarks, we release two benchmarks derived from Flickr30k and COCO, containing diverse compositional queries made of short captions, which we coin subFlickr and subCOCO. Our vision-free retriever matches and often surpasses traditional multimodal models. Importantly, our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks, with models as small as 0.3B parameters. Code is available at: https://github.com/IoannaNti/LexiCLIP

[121] Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs

Israfel Salazar,Desmond Elliott,Yova Kementchedjhieva

Main category: cs.CV

TL;DR: 该论文研究了视觉语言模型(VLMs)中组合性与长标题理解之间的关系,发现两者之间存在双向促进作用,并通过实验验证了高质量数据和模型设计对其效果的关键影响。

Details Motivation: 尽管对比式视觉语言模型在视觉和文本信息的绑定方面取得了显著进展,但理解和处理长而密集的标题仍然是一个挑战。作者假设组合性(即对对象-属性绑定和对象间关系的推理能力)是理解长标题的关键。

Contribution: 论文的主要贡献是揭示了组合性训练和长标题理解之间的双向关系,并发现这种关系对数据质量和模型设计非常敏感。同时,提出了通过高质量长标题数据联合训练以提升模型性能的实用策略。

Method: 作者训练并评估了一系列针对组合性和长标题理解能力的模型,分析了不同训练策略(如冻结位置嵌入)对性能的影响。

Result: 实验结果表明,组合性训练可以提升长标题检索性能,而长标题训练也能促进组合性能力。但效果依赖于高质量的数据和合理的模型设计。

Insight: 组合性理解和长标题理解是相互交织的能力,可通过联合训练高质量数据共同提升。这一发现为改进视觉语言模型的泛化能力提供了实用指导。

Abstract: Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, but understanding long, dense captions remains an open challenge. We hypothesize that compositionality, the capacity to reason about object-attribute bindings and inter-object relationships, is key to understanding longer captions. In this paper, we investigate the interaction between compositionality and long-caption understanding, asking whether training for one property enhances the other. We train and evaluate a range of models that target each of these capabilities. Our results reveal a bidirectional relationship: compositional training improves performance on long-caption retrieval, and training on long captions promotes compositionality. However, these gains are sensitive to data quality and model design. We find that training on poorly structured captions, or with limited parameter updates, fails to support generalization. Likewise, strategies that aim at retaining general alignment, such as freezing positional embeddings, do not improve compositional understanding. Overall, we find that compositional understanding and long-caption understanding are intertwined capabilities that can be jointly learned through training on dense, grounded descriptions. Despite these challenges, we show that models trained on high-quality, long-caption data can achieve strong performance in both tasks, offering practical guidance for improving VLM generalization.

[122] Enabling Plant Phenotyping in Weedy Environments using Multi-Modal Imagery via Synthetic and Generated Training Data

Earl Ranario,Ismael Mayanja,Heesup Yun,Brian N. Bailey,J. Mason Earles

Main category: cs.CV

TL;DR: 该研究提出了一种结合合成数据、少量真实标注和生成对抗网络(GAN)的方法,在复杂杂草环境中提升多模态图像(热成像)的植物表型分割性能。通过合成数据与少量真实数据的结合,分割效果显著提升。

Details Motivation: 植物表型分析在杂草丛生的环境中面临挑战,特别是热成像图像中植物和杂草对比度低且遮挡频繁,难以准确分割。

Contribution: 1. 提出了一个结合合成RGB图像、少量真实标注和GAN跨模态对齐的框架;2. 展示了合成数据结合少量真实标注能显著提升分割性能。

Method: 1. 使用1,128张合成RGB图像训练模型生成分割掩码;2. 结合少量真实标注图像;3. 通过CycleGAN-turbo实现RGB到热成像的跨模态对齐。

Result: 与全真实数据基线相比,杂草和植物的分割性能分别提升了22%和17%。跨模态对齐进一步提升了分割的鲁棒性。

Insight: 合成数据和真实数据的结合,结合生成模型,为复杂场景下的多模态图像分割提供了高效解决方案。

Abstract: Accurate plant segmentation in thermal imagery remains a significant challenge for high throughput field phenotyping, particularly in outdoor environments where low contrast between plants and weeds and frequent occlusions hinder performance. To address this, we present a framework that leverages synthetic RGB imagery, a limited set of real annotations, and GAN-based cross-modality alignment to enhance semantic segmentation in thermal images. We trained models on 1,128 synthetic images containing complex mixtures of crop and weed plants in order to generate image segmentation masks for crop and weed plants. We additionally evaluated the benefit of integrating as few as five real, manually segmented field images within the training process using various sampling strategies. When combining all the synthetic images with a few labeled real images, we observed a maximum relative improvement of 22% for the weed class and 17% for the plant class compared to the full real-data baseline. Cross-modal alignment was enabled by translating RGB to thermal using CycleGAN-turbo, allowing robust template matching without calibration. Results demonstrated that combining synthetic data with limited manual annotations and cross-domain translation via generative models can significantly boost segmentation performance in complex field environments for multi-model imagery.

[123] MsFIN: Multi-scale Feature Interaction Network for Traffic Accident Anticipation

Tongshuai Wu,Chao Lu,Ze Song,Yunlong Lin,Sizhe Fan,Xuemei Chen

Main category: cs.CV

TL;DR: 本文提出了一个多尺度特征交互网络(MsFIN)用于行车记录仪视频中的交通事故早期预测,通过多尺度特征聚合、时间特征处理和后融合阶段显著提升了预测准确性和时效性。

Details Motivation: 现有交通事故预测模型难以处理行车记录仪视角中交通参与者的遮挡问题以及复杂多时间尺度的行为线索,因此需要一种能够建模多尺度特征交互和时序演化的方法。

Contribution: 1)设计了多尺度模块和Transformer架构实现多尺度特征聚合和交互;2)提出时序特征处理模块捕捉场景和对象的因果演化;3)提出多尺度后融合阶段生成综合风险表示。

Method: MsFIN包含三部分:多尺度特征聚合(通过Multi-scale Module提取短期、中期和长期特征)、时序特征处理(因果约束下捕捉特征演化)和多尺度后融合(融合场景与对象特征)。

Result: 在DAD和DADA数据集上,MsFIN在预测准确性和时效性上显著优于单尺度特征提取的SOTA模型。消融实验验证了各模块的有效性。

Insight: 多尺度特征融合和上下文交互建模是提升交通事故预测性能的关键;时序因果约束能有效捕捉事故前期的行为演化模式。

Abstract: With the widespread deployment of dashcams and advancements in computer vision, developing accident prediction models from the dashcam perspective has become critical for proactive safety interventions. However, two key challenges persist: modeling feature-level interactions among traffic participants (often occluded in dashcam views) and capturing complex, asynchronous multi-temporal behavioral cues preceding accidents. To deal with these two challenges, a Multi-scale Feature Interaction Network (MsFIN) is proposed for early-stage accident anticipation from dashcam videos. MsFIN has three layers for multi-scale feature aggregation, temporal feature processing and multi-scale feature post fusion, respectively. For multi-scale feature aggregation, a Multi-scale Module is designed to extract scene representations at short-term, mid-term and long-term temporal scales. Meanwhile, the Transformer architecture is leveraged to facilitate comprehensive feature interactions. Temporal feature processing captures the sequential evolution of scene and object features under causal constraints. In the multi-scale feature post fusion stage, the network fuses scene and object features across multiple temporal scales to generate a comprehensive risk representation. Experiments on DAD and DADA datasets show that MsFIN significantly outperforms state-of-the-art models with single-scale feature extraction in both prediction correctness and earliness. Ablation studies validate the effectiveness of each module in MsFIN, highlighting how the network achieves superior performance through multi-scale feature fusion and contextual interaction modeling.

[124] Lavida-O: Elastic Masked Diffusion Models for Unified Multimodal Understanding and Generation

Shufan Li,Jiuxiang Gu,Kangning Liu,Zhe Lin,Zijun Wei,Aditya Grover,Jason Kuen

Main category: cs.CV

TL;DR: Lavida-O是一种统一的多模态掩码扩散模型(MDM),支持图像理解与生成任务,具备对象定位、图像编辑和高分辨率图像合成等新能力。

Details Motivation: 现有方法仅支持简单的图像级理解任务和低分辨率图像生成,而Lavida-O旨在通过统一模型实现更复杂的任务,并利用理解能力提升生成和编辑效果。

Contribution: 提出首个统一的MDM,结合理解与生成能力,并通过计划与迭代自反思优化结果;引入弹性混合Transformer架构、通用文本条件和分层采样等新技术。

Method: 采用掩码扩散模型架构,结合弹性混合Transformer和分层采样技术,支持任务统一建模与高效训练/推理。

Result: 在RefCOCO、GenEval和ImgEdit等任务上优于现有自回归和连续扩散模型(如Qwen2.5-VL),并提供显著的推理加速。

Insight: 多模态任务的统一建模可以通过扩散模型实现,理解与生成的结合能够进一步提升生成质量,弹性架构设计提升了模型效率和扩展性。

Abstract: We proposed Lavida-O, a unified multi-modal Masked Diffusion Model (MDM) capable of image understanding and generation tasks. Unlike existing multimodal diffsion language models such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O exhibits many new capabilities such as object grounding, image-editing, and high-resolution (1024px) image synthesis. It is also the first unified MDM that uses its understanding capabilities to improve image generation and editing results through planning and iterative self-reflection. To allow effective and efficient training and sampling, Lavida-O ntroduces many novel techniques such as Elastic Mixture-of-Transformer architecture, universal text conditioning, and stratified sampling. \ours~achieves state-of-the-art performance on a wide range of benchmarks such as RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference.

[125] ConViS-Bench: Estimating Video Similarity Through Semantic Concepts

Benedetta Liberatori,Alessandro Conti,Lorenzo Vaquero,Yiming Wang,Elisa Ricci,Paolo Rota

Main category: cs.CV

TL;DR: 论文提出了基于语义概念的视频相似性估计任务ConViS及配套基准ConViS-Bench,利用大模型支持多维度视频相似度评估,并通过实验验证其性能差异。

Details Motivation: 现有模型依赖全局相似度评分,缺乏对人类多维评估视频相似性的能力(如动作、地点等),限制了视频理解任务的发展。

Contribution: 1. 提出ConViS任务,通过预定义语义概念计算可解释的相似度分数;2. 发布ConViS-Bench基准,包含标注视频对及概念级评分;3. 验证模型性能,揭示部分概念对相似性估计更具挑战性。

Method: 引入多模态大模型(LMMs),利用自然语言处理视频对,通过语义概念(如动作、地点)量化相似度,并与人类标注数据对比。

Result: 实验表明不同模型在ConViS任务上表现差异显著,某些概念(如抽象场景)对模型更具挑战性。

Insight: ConViS任务和基准为语言驱动的视频理解研究提供了新方向,强调细粒度语义概念的重要性。

Abstract: What does it mean for two videos to be similar? Videos may appear similar when judged by the actions they depict, yet entirely different if evaluated based on the locations where they were filmed. While humans naturally compare videos by taking different aspects into account, this ability has not been thoroughly studied and presents a challenge for models that often depend on broad global similarity scores. Large Multimodal Models (LMMs) with video understanding capabilities open new opportunities for leveraging natural language in comparative video tasks. We introduce Concept-based Video Similarity estimation (ConViS), a novel task that compares pairs of videos by computing interpretable similarity scores across a predefined set of key semantic concepts. ConViS allows for human-like reasoning about video similarity and enables new applications such as concept-conditioned video retrieval. To support this task, we also introduce ConViS-Bench, a new benchmark comprising carefully annotated video pairs spanning multiple domains. Each pair comes with concept-level similarity scores and textual descriptions of both differences and similarities. Additionally, we benchmark several state-of-the-art models on ConViS, providing insights into their alignment with human judgments. Our results reveal significant performance differences on ConViS, indicating that some concepts present greater challenges for estimating video similarity. We believe that ConViS-Bench will serve as a valuable resource for advancing research in language-driven video understanding.

[126] Adversarially-Refined VQ-GAN with Dense Motion Tokenization for Spatio-Temporal Heatmaps

Gabriel Maldonado,Narges Rashvand,Armin Danesh Pazho,Ghazal Alinezhad Noghre,Vinit Katariya,Hamed Tabkhi

Main category: cs.CV

TL;DR: 该论文提出了一种对抗性精炼的VQ-GAN框架,结合密集运动令牌化技术,用于高效压缩时空热力图并保留精细运动痕迹。实验表明,该方法显著优于基线模型,并揭示了运动复杂性的量化需求。

Details Motivation: 连续人体运动理解是计算机视觉领域的核心挑战,因其高维度和固有冗余性。高效压缩与表示对分析复杂运动动态至关重要。

Contribution: 1. 提出对抗性精炼的VQ-GAN框架;2. 引入密集运动令牌化技术;3. 揭示了运动复杂性与令牌化需求的量化关系。

Method: 1. 结合密集运动令牌化与对抗性精炼,消除重建伪影;2. 在CMU Panoptic数据集上验证模型性能。

Result: 方法在SSIM指标上优于基线9.31%,时间不稳定性降低37.1%。2D/3D运动分别需要128和1024令牌的词汇表。

Insight: 运动复杂性与令牌化需求密切相关,2D运动可用小词汇表表示,而3D运动需更大词汇表才能准确重建。

Abstract: Continuous human motion understanding remains a core challenge in computer vision due to its high dimensionality and inherent redundancy. Efficient compression and representation are crucial for analyzing complex motion dynamics. In this work, we introduce an adversarially-refined VQ-GAN framework with dense motion tokenization for compressing spatio-temporal heatmaps while preserving the fine-grained traces of human motion. Our approach combines dense motion tokenization with adversarial refinement, which eliminates reconstruction artifacts like motion smearing and temporal misalignment observed in non-adversarial baselines. Our experiments on the CMU Panoptic dataset provide conclusive evidence of our method’s superiority, outperforming the dVAE baseline by 9.31% SSIM and reducing temporal instability by 37.1%. Furthermore, our dense tokenization strategy enables a novel analysis of motion complexity, revealing that 2D motion can be optimally represented with a compact 128-token vocabulary, while 3D motion’s complexity demands a much larger 1024-token codebook for faithful reconstruction. These results establish practical deployment feasibility across diverse motion analysis applications. The code base for this work is available at https://github.com/TeCSAR-UNCC/Pose-Quantization.

[127] Moving by Looking: Towards Vision-Driven Avatar Motion Generation

Markos Diomataris,Berat Mert Albaba,Giorgio Becherini,Partha Ghosh,Omid Taheri,Michael J. Black

Main category: cs.CV

TL;DR: 这篇论文提出了CLOPS,第一个仅使用自我中心视觉(egocentric vision)感知环境并导航的人类化身(avatar),通过将低级运动技能学习与高级视觉控制解耦,实现了人类化行为生成。

Details Motivation: 当前的人类运动生成方法忽视了感知与运动的相互依赖性,且使用的感知方式与人类感知差异较大。作者认为生成具有人类行为的化身需要类似人类的感知,因此提出利用自我中心视觉作为驱动。

Contribution: 提出了首个完全基于自我中心视觉的人类化身CLOPS,并展示了如何通过解耦运动技能学习与视觉控制映射,生成具有人类特征的运动行为。

Method: 1. 在大规模运动捕捉数据集上训练运动先验模型(motion prior model);2. 使用Q学习训练策略,将自我中心视觉输入映射为运动先验的高级控制指令。

Result: 实验证明,自我中心视觉能使化身表现出人类特征的运动行为,例如避开视野中的障碍物。

Insight: 为化身配备类似人类的传感器(尤其是自我中心视觉)有望训练出行为更接近真实人类的化身。

Abstract: The way we perceive the world fundamentally shapes how we move, whether it is how we navigate in a room or how we interact with other humans. Current human motion generation methods, neglect this interdependency and use task-specific ``perception’’ that differs radically from that of humans. We argue that the generation of human-like avatar behavior requires human-like perception. Consequently, in this work we present CLOPS, the first human avatar that solely uses egocentric vision to perceive its surroundings and navigate. Using vision as the primary driver of motion however, gives rise to a significant challenge for training avatars: existing datasets have either isolated human motion, without the context of a scene, or lack scale. We overcome this challenge by decoupling the learning of low-level motion skills from learning of high-level control that maps visual input to motion. First, we train a motion prior model on a large motion capture dataset. Then, a policy is trained using Q-learning to map egocentric visual inputs to high-level control commands for the motion prior. Our experiments empirically demonstrate that egocentric vision can give rise to human-like motion characteristics in our avatars. For example, the avatars walk such that they avoid obstacles present in their visual field. These findings suggest that equipping avatars with human-like sensors, particularly egocentric vision, holds promise for training avatars that behave like humans.

[128] Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

Sherwin Bahmani,Tianchang Shen,Jiawei Ren,Jiahui Huang,Yifeng Jiang,Haithem Turki,Andrea Tagliasacchi,David B. Lindell,Zan Gojcic,Sanja Fidler,Huan Ling,Jun Gao,Xuanchi Ren

Main category: cs.CV

TL;DR: 论文提出了一种名为Lyra的自蒸馏框架,通过视频扩散模型的内部知识提取,将其转化为显式的3D高斯泼溅(3DGS)表示,无需多视图训练数据。该方法支持从文本或单张图像生成3D场景,并可扩展到动态场景生成。

Details Motivation: 现有的基于学习的3D重建方法依赖多视图数据,而视频扩散模型虽具有强大的生成能力,但其2D特性限制了在物理AI领域的应用。本文旨在填补这一空白。

Contribution: 主要贡献是提出了一个自蒸馏框架,将视频扩散模型的隐式3D知识提取为3DGS表示,无需真实多视图数据,支持静态和动态3D场景生成。

Method: 通过增强RGB解码器为3DGS解码器,利用视频扩散模型的合成数据进行监督训练,实现了从文本或单张图像生成3D场景的能力。

Result: 实验结果表明,该方法在静态和动态3D场景生成中达到了最先进的性能。

Insight: 论文展示了如何通过自蒸馏技术将2D生成模型的潜力扩展到3D领域,为虚拟环境生成提供了新的思路。

Abstract: The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.

[129] VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

Weijie Wang,Yeqing Chen,Zeyu Zhang,Hengyu Liu,Haoxiao Wang,Zhiyuan Feng,Wenkang Qin,Zheng Zhu,Donny Y. Chen,Bohan Zhuang

Main category: cs.CV

TL;DR: VolSplat提出了一种基于体素对齐的3D高斯分布预测新范式,解决了传统像素对齐方法在多视角一致性、密度分布偏差和对输入视角数量的依赖等问题上的局限性。

Details Motivation: 现有的3D高斯分布预测方法依赖于像素对齐范式,导致重建结果对输入视角数量敏感、密度分布存在视角偏差,且在遮挡或低纹理场景下易引入对齐误差。

Contribution: VolSplat通过引入体素对齐的高斯分布预测,摆脱了对2D特征匹配的依赖,提升了多视角一致性和几何鲁棒性,并支持基于3D场景复杂度的自适应密度控制。

Method: VolSplat直接从预测的3D体素网格生成高斯分布,避免了像素对齐步骤,从而优化了高斯点云的生成和渲染质量。

Result: 在RealEstate10K和ScanNet等基准测试中,VolSplat实现了最先进的性能,生成了更真实且视角一致的3D重建结果。

Insight: 体素对齐的高斯预测范式为3D重建提供了更鲁棒和可扩展的框架,未来可推动更广泛的研究和应用。

Abstract: Feed-forward 3D Gaussian Splatting (3DGS) has emerged as a highly effective solution for novel view synthesis. Existing methods predominantly rely on a pixel-aligned Gaussian prediction paradigm, where each 2D pixel is mapped to a 3D Gaussian. We rethink this widely adopted formulation and identify several inherent limitations: it renders the reconstructed 3D models heavily dependent on the number of input views, leads to view-biased density distributions, and introduces alignment errors, particularly when source views contain occlusions or low texture. To address these challenges, we introduce VolSplat, a new multi-view feed-forward paradigm that replaces pixel alignment with voxel-aligned Gaussians. By directly predicting Gaussians from a predicted 3D voxel grid, it overcomes pixel alignment’s reliance on error-prone 2D feature matching, ensuring robust multi-view consistency. Furthermore, it enables adaptive control over Gaussian density based on 3D scene complexity, yielding more faithful Gaussian point clouds, improved geometric consistency, and enhanced novel-view rendering quality. Experiments on widely used benchmarks including RealEstate10K and ScanNet demonstrate that VolSplat achieves state-of-the-art performance while producing more plausible and view-consistent Gaussian reconstructions. In addition to superior results, our approach establishes a more scalable framework for feed-forward 3D reconstruction with denser and more robust representations, paving the way for further research in wider communities. The video results, code and trained models are available on our project page: https://lhmd.top/volsplat.

cs.RO [Back]

[130] Semantic-Aware Particle Filter for Reliable Vineyard Robot Localisation

Rajitha de Silva,Jonathan Cox,James R. Heselden,Marija Popovic,Cesar Cadena,Riccardo Polvara

Main category: cs.RO

TL;DR: 该论文提出了一种语义感知粒子滤波器,用于提高葡萄园机器人在复杂且重复结构环境中的定位可靠性。

Details Motivation: 在结构化户外环境(如葡萄园)中,基于LiDAR的定位方法因重复的行几何和感知混叠问题而失效,需要一种更鲁棒的方案。

Contribution: 主要贡献包括结合语义信息(如葡萄藤树干和支撑柱)的粒子滤波器,以及通过语义墙缓解行混叠的创新方法。

Method: 方法包括将语义检测投影到鸟瞰图并与LiDAR数据融合,引入语义墙作为伪结构约束,并结合自适应噪声GPS先验。

Result: 实验表明,该方法能在葡萄园中实现准确定位,优于传统的AMCL和视觉SLAM方法如RTAB-Map。

Insight: 语义信息能有效应对重复环境中的感知混叠,动态融合多模态数据(如GPS)是提升鲁棒性的关键。

Abstract: Accurate localisation is critical for mobile robots in structured outdoor environments, yet LiDAR-based methods often fail in vineyards due to repetitive row geometry and perceptual aliasing. We propose a semantic particle filter that incorporates stable object-level detections, specifically vine trunks and support poles into the likelihood estimation process. Detected landmarks are projected into a birds eye view and fused with LiDAR scans to generate semantic observations. A key innovation is the use of semantic walls, which connect adjacent landmarks into pseudo-structural constraints that mitigate row aliasing. To maintain global consistency in headland regions where semantics are sparse, we introduce a noisy GPS prior that adaptively supports the filter. Experiments in a real vineyard demonstrate that our approach maintains localisation within the correct row, recovers from deviations where AMCL fails, and outperforms vision-based SLAM methods such as RTAB-Map.

[131] Latent Action Pretraining Through World Modeling

Bahey Tharwat,Yara Nasser,Ali Abouzeid,Ian Reid

Main category: cs.RO

TL;DR: 本文提出了一种名为LAWM的模型无关框架,通过世界建模从无标签视频数据中学习潜在动作表示,用于自监督预训练模仿学习模型。该方法在LIBERO基准测试和现实场景中表现优异,且更高效实用。

Details Motivation: 现有的视觉-语言-动作(VLA)模型依赖于大规模手动标注的动作数据集或复杂的潜在动作表示,模型庞大且难以在实际场景中部署。本文旨在通过自监督学习从无标签视频数据中提取潜在动作表示,提升模型的实用性和效率。

Contribution: 1. 提出了LAWM框架,通过世界建模实现潜在动作的自监督预训练;2. 框架适用于跨任务、环境和设备的迁移;3. 在LIBERO基准测试和现实场景中表现优于基于标注数据的模型和其他预训练方法。

Method: LAWM通过世界建模从无标签视频数据中学习潜在动作表示,不依赖手动标注。视频数据可来自机器人记录或人类日常动作视频。框架采用自监督学习方式,适用于多种任务和设备。

Result: 在LIBERO基准测试和现实场景中,LAWM优于基于标注动作的模型和其他预训练方法,同时显著提升了模型的效率和实用性。

Insight: 通过自监督学习从无标签视频中提取潜在动作表示是一种高效的预训练方法,能够减少对标注数据的依赖,提升模型的泛化能力和实用性。

Abstract: Vision-Language-Action (VLA) models have gained popularity for learning robotic manipulation tasks that follow language instructions. State-of-the-art VLAs, such as OpenVLA and $\pi_{0}$, were trained on large-scale, manually labeled action datasets collected through teleoperation. More recent approaches, including LAPA and villa-X, introduce latent action representations that enable unsupervised pretraining on unlabeled datasets by modeling abstract visual changes between frames. Although these methods have shown strong results, their large model sizes make deployment in real-world settings challenging. In this work, we propose LAWM, a model-agnostic framework to pretrain imitation learning models in a self-supervised way, by learning latent action representations from unlabeled video data through world modeling. These videos can be sourced from robot recordings or videos of humans performing actions with everyday objects. Our framework is designed to be effective for transferring across tasks, environments, and embodiments. It outperforms models trained with ground-truth robotics actions and similar pretraining methods on the LIBERO benchmark and real-world setup, while being significantly more efficient and practical for real-world settings.

[132] VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation

Neel P. Bhatt,Yunhao Yang,Rohan Siva,Pranay Samineni,Daniel Milan,Zhangyang Wang,Ufuk Topcu

Main category: cs.RO

TL;DR: VLN-Zero 提出了一种零样本迁移的视觉语言导航框架,结合探索和神经符号规划,显著提升了在未见环境中的导航效率和成功率。

Details Motivation: 现有视觉语言导航方法在未见环境中难以快速适应且泛化能力差,限制了其在现实自主系统中的扩展性。

Contribution: 1) 提出两阶段框架(探索与部署);2) 利用视觉语言模型构建符号场景图;3) 引入缓存加速执行模块。

Method: 1) 探索阶段:结构化提示引导高效搜索;2) 部署阶段:神经符号规划器结合缓存执行模块生成可执行计划。

Result: 在未见环境中,成功率提升2倍,目标到达时间减半,VLM调用减少55%,并优于大多数微调基线。

Insight: 结合符号推理和高效探索可显著提升零样本导航性能,缓存机制进一步优化计算效率。

Abstract: Rapid adaptation in unseen environments is essential for scalable real-world autonomy, yet existing approaches rely on exhaustive exploration or rigid navigation policies that fail to generalize. We present VLN-Zero, a two-phase vision-language navigation framework that leverages vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation. In the exploration phase, structured prompts guide VLM-based search toward informative and diverse trajectories, yielding compact scene graph representations. In the deployment phase, a neurosymbolic planner reasons over the scene graph and environmental observations to generate executable plans, while a cache-enabled execution module accelerates adaptation by reusing previously computed task-location trajectories. By combining rapid exploration, symbolic reasoning, and cache-enabled execution, the proposed framework overcomes the computational inefficiency and poor generalization of prior vision-language navigation methods, enabling robust and scalable decision-making in unseen environments. VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time with 55% fewer VLM calls on average compared to state-of-the-art models across diverse environments. Codebase, datasets, and videos for VLN-Zero are available at: https://vln-zero.github.io/.

[133] Human-Interpretable Uncertainty Explanations for Point Cloud Registration

Johannes A. Gaus,Loris Schneider,Yitian Shi,Jongseok Lee,Rania Rayyes,Rudolph Triebel

Main category: cs.RO

TL;DR: 本文提出了一种名为GP-CA的新型方法,用于点云配准中的不确定性量化和解释,通过主动学习发现新的不确定性来源,并在真实机器人实验中验证了其有效性。

Details Motivation: 点云配准在传感器噪声、位姿估计误差和遮挡导致的部分重叠等不确定性因素下表现不佳,现有方法(如ICP)难以处理这些不确定性,因此需要一种能够量化并解释不确定性的方法。

Contribution: 提出GP-CA方法,不仅能量化配准不确定性,还能将其归因于已知的误差来源;通过主动学习发现新不确定性来源;在公开数据集和真实机器人实验中验证了方法的有效性。

Method: 使用高斯过程概念归因(GP-CA)方法量化不确定性并解释其来源;结合主动学习选择信息量丰富的实例以发现新不确定性源。

Result: GP-CA在运行时间、样本效率和准确性上优于现有方法;真实实验证明了其适用性,并能有效实现失败恢复行为。

Insight: 不确定性解释对于提高配准的鲁棒性至关重要,主动学习在发现新不确定性源方面具有潜力。

Abstract: In this paper, we address the point cloud registration problem, where well-known methods like ICP fail under uncertainty arising from sensor noise, pose-estimation errors, and partial overlap due to occlusion. We develop a novel approach, Gaussian Process Concept Attribution (GP-CA), which not only quantifies registration uncertainty but also explains it by attributing uncertainty to well-known sources of errors in registration problems. Our approach leverages active learning to discover new uncertainty sources in the wild by querying informative instances. We validate GP-CA on three publicly available datasets and in our real-world robot experiment. Extensive ablations substantiate our design choices. Our approach outperforms other state-of-the-art methods in terms of runtime, high sample-efficiency with active learning, and high accuracy. Our real-world experiment clearly demonstrates its applicability. Our video also demonstrates that GP-CA enables effective failure-recovery behaviors, yielding more robust robotic perception.

[134] DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation

Suzannah Wistreich,Baiyu Shi,Stephen Tian,Samuel Clarke,Michael Nath,Chengyi Xu,Zhenan Bao,Jiajun Wu

Main category: cs.RO

TL;DR: DexSkin是一种柔软、可适配的电容式电子皮肤,能够实现高灵敏度和可校准的触觉感知,适用于学习接触密集的机器人操作任务。

Details Motivation: 人类皮肤具有丰富的触觉感知能力,能够在大面积和曲面区域定位接触事件。为机器人系统复制这种能力一直是一个挑战。

Contribution: DexSkin的提出,提供了一种可定制几何形状、覆盖范围广泛的电子皮肤,支持触觉感知和高精度的机器人操作学习。

Method: 通过电容式电子皮肤技术实现触觉感知,结合学习框架(如示范学习和强化学习)验证其在复杂操作任务中的有效性。

Result: 实验表明,DexSkin在高覆盖触觉感知和模型迁移方面表现优异,适用于接触密集的机器人操作任务。

Insight: DexSkin的校准能力支持模型跨传感器实例迁移,为数据驱动的机器人学习提供了一种实用工具。

Abstract: Human skin provides a rich tactile sensing stream, localizing intentional and unintentional contact events over a large and contoured region. Replicating these tactile sensing capabilities for dexterous robotic manipulation systems remains a longstanding challenge. In this work, we take a step towards this goal by introducing DexSkin. DexSkin is a soft, conformable capacitive electronic skin that enables sensitive, localized, and calibratable tactile sensing, and can be tailored to varying geometries. We demonstrate its efficacy for learning downstream robotic manipulation by sensorizing a pair of parallel jaw gripper fingers, providing tactile coverage across almost the entire finger surfaces. We empirically evaluate DexSkin’s capabilities in learning challenging manipulation tasks that require sensing coverage across the entire surface of the fingers, such as reorienting objects in hand and wrapping elastic bands around boxes, in a learning-from-demonstration framework. We then show that, critically for data-driven approaches, DexSkin can be calibrated to enable model transfer across sensor instances, and demonstrate its applicability to online reinforcement learning on real robots. Our results highlight DexSkin’s suitability and practicality for learning real-world, contact-rich manipulation. Please see our project webpage for videos and visualizations: https://dex-skin.github.io/.

[135] Category-Level Object Shape and Pose Estimation in Less Than a Millisecond

Lorenzo Shaikewitz,Tim Nguyen,Luca Carlone

Main category: cs.RO

TL;DR: 这篇论文提出了一种快速求解类别级别物体形状和姿态估计的方法,通过稀疏语义关键点检测和线性主动形状模型,结合自洽场迭代求解,实现了毫秒级运行时间,并提供全局最优性证明。

Details Motivation: 物体形状和姿态估计是机器人学中的基础问题,应用于操纵、场景理解和导航等任务。现有方法速度较慢或缺乏全局最优性证明,论文旨在解决这些问题。

Contribution: 1. 提出了一种基于稀疏语义关键点检测和线性主动形状模型的快速求解方法;2. 使用了自洽场迭代法,每次迭代仅需计算一个4x4矩阵并找到最小特征对;3. 提供了简单的全局最优性证明。

Method: 1. 使用学习的前端检测稀疏语义关键点;2. 用线性主动形状模型表示物体的未知形状;3. 将问题转化为单位四元数形式的优化问题,并用自洽场迭代法求解。

Result: 方法在合成数据和真实场景(包括公开数据集和无人机追踪)中测试有效,每次迭代仅需100微秒,实现了快速异常值剔除。

Insight: 通过将问题化简并高效求解,证明了在保持全局最优性的同时,可以实现毫秒级的实时性能。

Abstract: Object shape and pose estimation is a foundational robotics problem, supporting tasks from manipulation to scene understanding and navigation. We present a fast local solver for shape and pose estimation which requires only category-level object priors and admits an efficient certificate of global optimality. Given an RGB-D image of an object, we use a learned front-end to detect sparse, category-level semantic keypoints on the target object. We represent the target object’s unknown shape using a linear active shape model and pose a maximum a posteriori optimization problem to solve for position, orientation, and shape simultaneously. Expressed in unit quaternions, this problem admits first-order optimality conditions in the form of an eigenvalue problem with eigenvector nonlinearities. Our primary contribution is to solve this problem efficiently with self-consistent field iteration, which only requires computing a 4-by-4 matrix and finding its minimum eigenvalue-vector pair at each iterate. Solving a linear system for the corresponding Lagrange multipliers gives a simple global optimality certificate. One iteration of our solver runs in about 100 microseconds, enabling fast outlier rejection. We test our method on synthetic data and a variety of real-world settings, including two public datasets and a drone tracking scenario. Code is released at https://github.com/MIT-SPARK/Fast-ShapeAndPose.

[136] FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

Hongli Xu,Lei Zhang,Xiaoyue Hu,Boyang Zhong,Kaixin Bai,Zoltán-Csaba Márton,Zhenshan Bing,Zhaopeng Chen,Alois Christian Knoll,Jianwei Zhang

Main category: cs.RO

TL;DR: FunCanon 是一个通过功能对象规范化(functional object canonicalization)学习姿态感知动作原语的框架,旨在实现机器人操作的泛化性。它将长时程操作任务分解为动作块序列,并结合视觉语言模型的提示映射物体到共享功能帧,训练扩散策略以实现泛化和鲁棒性。

Details Motivation: 现有的端到端机器人技能学习方法往往局限于特定任务,缺乏泛化性。FunCanon 旨在通过动作块和功能对象规范化,实现跨任务的组合性和重用性。

Contribution: 1. 提出 FunCanon 框架,通过功能对象规范化和动作块分解实现泛化性。2. 结合视觉语言模型的提示,实现物体的功能对齐和轨迹自动迁移。3. 提出 FuncDiffuser 扩散策略,简化学习并提升泛化能力。

Method: 1. 将任务分解为动作块(actor, verb, object)。2. 利用视觉语言模型的提示映射物体到共享功能帧。3. 训练对象中心和动作中心的扩散策略 FuncDiffuser。

Result: 实验表明 FunCanon 在模拟和真实环境中实现了类别级泛化、跨任务行为重用以及鲁棒的 sim2real 部署。

Insight: 功能对象规范化为复杂操作领域提供了强归纳偏置,提升了模仿学习的可扩展性。

Abstract: General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.

cs.HC [Back]

[137] Does Embodiment Matter to Biomechanics and Function? A Comparative Analysis of Head-Mounted and Hand-Held Assistive Devices for Individuals with Blindness and Low Vision

Gaurav Seth,Hoa Pham,Giles Hamilton-Fletcher,Charles Leclercq,John-Ross Rizzo

Main category: cs.HC

TL;DR: 本文通过对比头戴式和手持式视觉辅助设备对盲人和低视力人群的功能和生物力学影响,发现头戴式设备减少身体动作和任务时间,而手持式设备在特定任务中表现更好。

Details Motivation: 研究动机是评估不同视觉辅助设备的物理和功能影响,以优化用户体验。

Contribution: 主要贡献是通过生物力学和功能指标量化了头戴式和手持式设备的差异,为设计提供依据。

Method: 研究方法包括使用微软Seeing AI结合Xsens动作捕捉,比较两种设备在六项日常活动中的表现。

Result: 头戴式设备减少上身动作和任务时间,手持式设备在扫描小或弯曲文本时成功率更高。

Insight: 结果提示,未来辅助设备设计需结合功能效率和物理可持续性。

Abstract: Visual assistive technologies, such as Microsoft Seeing AI, can improve access to environmental information for persons with blindness or low vision (pBLV). Yet, the physical and functional implications of different device embodiments remain unclear. In this study, 11 pBLV participants used Seeing AI on a hand-held smartphone and on a head-mounted ARx Vision system to perform six activities of daily living, while their movements were captured with Xsens motion capture. Functional outcomes included task time, success rate, and number of attempts, and biomechanical measures included joint range of motion, angular path length, working volume, and movement smoothness. The head-mounted system generally reduced upper-body movement and task time, especially for document-scanning style tasks, whereas the hand-held system yielded higher success rates for tasks involving small or curved text. These findings indicate that both embodiments are viable, but they differ in terms of physical demands and ease of use. Incorporating biomechanical measures into assistive technology evaluations can inform designs that optimise user experience by balancing functional efficiency, physical sustainability, and intuitive interaction.

eess.IV [Back]

[138] Efficient Breast and Ovarian Cancer Classification via ViT-Based Preprocessing and Transfer Learning

Richa Rawat,Faisal Ahmed

Main category: eess.IV

TL;DR: 该论文提出了一种基于ViT的预处理和迁移学习方法,用于高效分类乳腺癌和卵巢癌,通过微调预训练ViT模型并采用标准化预处理流程,在性能上超越了现有方法。

Details Motivation: 早期检测乳腺癌和卵巢癌对提高生存率至关重要,但传统手动检测方法耗时耗力且依赖专家经验。因此,需要一种更高效的自动化分类方法。

Contribution: 1. 提出了一种结合ViT和标准化预处理的分类方法;2. 在二元和五分类任务中超越了现有的CNN和拓扑数据分析方法。

Method: 1. 使用预训练的ViT-Base-Patch16-224模型进行微调;2. 设计预处理流程将病理图像转换为适合ViT的PyTorch张量;3. 在BreakHis和UBC-OCEAN数据集上测试性能。

Result: 模型在二元分类(BreakHis)和五分类(UBC-OCEAN)任务中均表现优异,超越了现有方法。

Insight: ViT结合高效预处理在癌症诊断中表现显著优于传统方法,展示了迁移学习在医学图像分析中的潜力。

Abstract: Cancer is one of the leading health challenges for women, specifically breast and ovarian cancer. Early detection can help improve the survival rate through timely intervention and treatment. Traditional methods of detecting cancer involve manually examining mammograms, CT scans, ultrasounds, and other imaging types. However, this makes the process labor-intensive and requires the expertise of trained pathologists. Hence, making it both time-consuming and resource-intensive. In this paper, we introduce a novel vision transformer (ViT)-based method for detecting and classifying breast and ovarian cancer. We use a pre-trained ViT-Base-Patch16-224 model, which is fine-tuned for both binary and multi-class classification tasks using publicly available histopathological image datasets. Further, we use a preprocessing pipeline that converts raw histophological images into standardized PyTorch tensors, which are compatible with the ViT architecture and also help improve the model performance. We evaluated the performance of our model on two benchmark datasets: the BreakHis dataset for binary classification and the UBC-OCEAN dataset for five-class classification without any data augmentation. Our model surpasses existing CNN, ViT, and topological data analysis-based approaches in binary classification. For multi-class classification, it is evaluated against recent topological methods and demonstrates superior performance. Our study highlights the effectiveness of Vision Transformer-based transfer learning combined with efficient preprocessing in oncological diagnostics.

cs.SD [Back]

[139] Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models

Junyu Wang,Ziyang Ma,Zhengding Luo,Tianrui Wang,Meng Ge,Xiaobao Wang,Longbiao Wang

Main category: cs.SD

TL;DR: 论文提出了一种无需训练的方法MATA,用于解决大型音频-语言模型中音频与文本注意力不平衡的问题,通过干预自注意力机制,动态调整对音频令牌的关注,显著提升了模型的音频推理能力。

Details Motivation: 大型音频-语言模型(LALMs)在多模态融合层中倾向于优先处理文本信息,导致音频信息未被充分利用,影响了音频推理任务的性能。

Contribution: 提出了MATA方法,这是一种无需训练的动态调整自注意力机制的方法,专注于增强模型对音频令牌的关注,同时不增加额外参数或计算成本。

Method: MATA在原始注意力评分后介入,针对中间层的最后一个令牌动态调整注意力权重,无需额外训练或参数。

Result: 在MMAU和MMAR基准测试中,MATA显著提升了模型性能,甚至使开源模型首次超越了专有的Gemini 2.0 Flash。

Insight: 通过干预自注意力机制,可以有效缓解跨模态注意力不平衡问题,为提升多模态模型的音频处理能力提供了新的研究方向。

Abstract: Large Audio-Language Models (LALMs) often suffer from audio-textual attention imbalance, prioritizing text over acoustic information, particularly in the multi-modal fusion layers of the Transformer architecture. This bias hinders their ability to fully utilize acoustic cues, causing suboptimal performance on audio reasoning tasks. To mitigate this, we propose \textbf{MATA}, a novel training-free method that dynamically pushes LALMs to pay \textbf{M}ore \textbf{A}ttention \textbf{T}o \textbf{A}udio tokens within the self-attention mechanism. Specifically, MATA intervenes post raw attention scoring, targeting only the last token in intermediate layers without introducing additional parameters or computational overhead. Experiments on the MMAU and MMAR benchmarks confirm MATA’s effectiveness, with consistent performance gains. Notably, on MMAR, MATA enables an open-source model to surpass the proprietary Gemini 2.0 Flash for the first time. Our work provides an efficient solution to mitigate attention bias and opens a new research direction for enhancing the audio-processing capabilities of multi-modal models.

cs.LG [Back]

[140] Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

Jiaqi Weng,Han Zheng,Hanyu Zhang,Qinqin He,Jialing Tao,Hui Xue,Zhixuan Chu,Xiting Wang

Main category: cs.LG

TL;DR: Safe-SAIL是一个通过稀疏自编码器(SAE)解释框架的系统,用于增强对大型语言模型(LLM)安全相关行为的细粒度理解。

Details Motivation: 现有的大多数LLM安全研究集中于输出评估或特定任务,缺乏对更广泛、未定义风险的处理能力。稀疏自编码器(SAE)虽被用于解释模型行为,但尚未充分关注安全相关的细粒度特征。

Contribution: 提出了Safe-SAIL框架,通过识别最具概念解释性的SAE、解释安全相关神经元,并引入高效扩展方法,系统地提升LLM安全机制的细粒度理解。

Method: 采用稀疏自编码器(SAE)解释框架,识别优化的SAE以生成安全概念相关神经元,并开发高效扩展策略。

Result: 发布了一个工具包,包含SAE检查点和人类可读的神经元解释,支持对LLM安全风险的实证分析。

Insight: 通过细粒度的安全特征分解,Safe-SAIL为LLM的安全研究提供了新思路,能够捕捉高风险行为并为未来安全机制设计提供支持。

Abstract: Increasing deployment of large language models (LLMs) in real-world applications raises significant safety concerns. Most existing safety research focuses on evaluating LLM outputs or specific safety tasks, limiting their ability to ad- dress broader, undefined risks. Sparse Autoencoders (SAEs) facilitate interpretability research to clarify model behavior by explaining single-meaning atomic features decomposed from entangled signals. jHowever, prior applications on SAEs do not interpret features with fine-grained safety-related con- cepts, thus inadequately addressing safety-critical behaviors, such as generating toxic responses and violating safety regu- lations. For rigorous safety analysis, we must extract a rich and diverse set of safety-relevant features that effectively capture these high-risk behaviors, yet face two challenges: identifying SAEs with the greatest potential for generating safety concept-specific neurons, and the prohibitively high cost of detailed feature explanation. In this paper, we pro- pose Safe-SAIL, a framework for interpreting SAE features within LLMs to advance mechanistic understanding in safety domains. Our approach systematically identifies SAE with best concept-specific interpretability, explains safety-related neurons, and introduces efficient strategies to scale up the in- terpretation process. We will release a comprehensive toolkit including SAE checkpoints and human-readable neuron ex- planations, which supports empirical analysis of safety risks to promote research on LLM safety.

[141] PiMoE: Token-Level Routing for Integrating High-Precision Computation and Reasoning

Hengbo Xiao,Jingyuan Fan,Xin Tong,Jingzhao Zhang,Chao Lu,Guannan He

Main category: cs.LG

TL;DR: PiMoE提出了一种新的架构PiMoE(物理隔离的专家混合模型),通过在训练和推理阶段使用令牌级别的路由机制,将高精度计算与推理能力内生地集成到神经网络中,解决了现有大型语言模型(LLM)和多智能体系统在计算和推理集成上的局限性。

Details Motivation: 当前大型语言模型无法内生地支持高精度计算,而多智能体系统虽能调用外部专家但存在通信开销和可扩展性限制。PiMoE旨在通过内生的方式高效融合计算与推理能力。

Contribution: 提出了PiMoE架构,通过令牌级别的路由机制实现计算与推理的内生集成,显著提升了响应延迟、令牌使用效率和能耗表现。

Method: PiMoE分别训练专家模型、文本到计算模块和路由器,并在推理时动态路由令牌级别任务,实现在单个思维链内迭代交替计算和推理。

Result: 在两项计算推理任务中,PiMoE在准确性上优于直接微调的LLM,在延迟、令牌使用和GPU能耗上也显著优于多智能体系统方法。

Insight: PiMoE提供了一种高效、可解释且可扩展的范式,为下一代科学或工业智能系统提供了新的可能性。

Abstract: Complex systems typically rely on high-precision numerical computation to support decisions, but current large language models (LLMs) cannot yet incorporate such computations as an intrinsic and interpretable capability with existing architectures. Mainstream multi-agent approaches can leverage external experts, but inevitably introduce communication overhead and suffer from inefficient multimodal emergent capability and limited scalability. To this end, we propose PiMoE (Physically-isolated Mixture of Experts), a training and inference architecture for integrating computation and reasoning. Instead of the workflow paradigm of tool invocation, PiMoE endogenously integrates computational capabilities into neural networks after separately training experts, a text-to-computation module, and a router. At inference, the router directs computation and reasoning at the token level, thereby enabling iterative alternation within a single chain of thought. We evaluate PiMoE on two reasoning-computation tasks against LLM finetuning and the multi-agent system approaches. Results show that the PiMoE architecture achieves not only higher accuracy than directly finetuning LLMs but also significant improvements in response latency, token usage, and GPU energy consumption compared with mainstream multi-agent approaches. PiMoE offers an efficient, interpretable, and scalable paradigm for next-generation scientific or industrial intelligent systems.

[142] Conversational Orientation Reasoning: Egocentric-to-Allocentric Navigation with Multimodal Chain-of-Thought

Yu Ti Huang

Main category: cs.LG

TL;DR: 论文提出了一种多模态链式思维(MCoT)框架,用于解决对话代理将自我中心表述(如“在我右边”)转换为全局方向(北/东/南/西)的挑战,特别是在室内或GPS信号弱的场景中。通过在台湾LLM-13B-v2.0-Chat上实施课程学习策略,MCoT在干净和ASR转录文本上分别实现了100%和98.1%的方向推理准确率。

Details Motivation: 在GPS信号弱且缺乏详细地图的环境中,对话代理需要准确理解用户的自我中心空间表述并将其转换为全局方向。当前的链式思维(CoT)方法在多模态空间方向推理中的应用尚未充分探索。

Contribution: 1)提出了Conversational Orientation Reasoning(COR)基准,专注于非英语和ASR转录场景中的方向推理;2)设计了多模态链式思维(MCoT)框架,结合ASR语音和地标坐标,分三步完成方向推理;3)展示了MCoT在噪声环境和多语言混杂下的鲁棒性。

Method: MCoT框架通过三步实现方向推理:1)提取空间关系;2)将坐标映射到绝对方向;3)推断用户朝向。采用课程学习策略在台湾LLM-13B-v2.0-Chat上逐步训练模型。

Result: MCoT在干净和ASR转录文本上的准确率分别为100%和98.1%,显著优于单模态和非结构化基线。此外,MCoT对ASR错误、多语言混杂等噪声环境具有鲁棒性。

Insight: 结构化多模态链式思维为资源受限环境中的可解释性和高效导航提供了一条可行路径,尤其是在非英语和嘈杂场景中表现出色。

Abstract: Conversational agents must translate egocentric utterances (e.g., “on my right”) into allocentric orientations (N/E/S/W). This challenge is particularly critical in indoor or complex facilities where GPS signals are weak and detailed maps are unavailable. While chain-of-thought (CoT) prompting has advanced reasoning in language and vision tasks, its application to multimodal spatial orientation remains underexplored. We introduce Conversational Orientation Reasoning (COR), a new benchmark designed for Traditional Chinese conversational navigation projected from real-world environments, addressing egocentric-to-allocentric reasoning in non-English and ASR-transcribed scenarios. We propose a multimodal chain-of-thought (MCoT) framework, which integrates ASR-transcribed speech with landmark coordinates through a structured three-step reasoning process: (1) extracting spatial relations, (2) mapping coordinates to absolute directions, and (3) inferring user orientation. A curriculum learning strategy progressively builds these capabilities on Taiwan-LLM-13B-v2.0-Chat, a mid-sized model representative of resource-constrained settings. Experiments show that MCoT achieves 100% orientation accuracy on clean transcripts and 98.1% with ASR transcripts, substantially outperforming unimodal and non-structured baselines. Moreover, MCoT demonstrates robustness under noisy conversational conditions, including ASR recognition errors and multilingual code-switching. The model also maintains high accuracy in cross-domain evaluation and resilience to linguistic variation, domain shift, and referential ambiguity. These findings highlight the potential of structured MCoT spatial reasoning as a path toward interpretable and resource-efficient embodied navigation.

[143] Localized PCA-Net Neural Operators for Scalable Solution Reconstruction of Elliptic PDEs

Mrigank Dhingra,Romit Maulik,Adil Rasheed,Omer San

Main category: cs.LG

TL;DR: 该论文提出了一种基于局部区域的PCA-Net框架,通过将解场分解为小块并在每个块内应用PCA,显著降低了计算复杂度,同时保持了高精度。

Details Motivation: 针对高维PDE解场应用主成分分析(PCA)时的计算开销问题,作者试图通过局部化方法提升效率。

Contribution: 提出两种基于局部区域的PCA方法(局部到全局和局部到局部),并探索了两种优化方法(重叠块和平滑滤波、CNN细化),显著降低了计算时间。

Method: 通过块分解解场,在每个块内应用PCA,并训练神经算子;探索了两种优化方法:重叠块与CNN细化。

Result: 实验表明,该方法将端到端处理时间减少3.7至4倍,同时保持了高精度。

Insight: 局部PCA结合神经算子是高效求解PDE的可行方向,为大规模系统提供了计算效率与传统精度之间的平衡。

Abstract: Neural operator learning has emerged as a powerful approach for solving partial differential equations (PDEs) in a data-driven manner. However, applying principal component analysis (PCA) to high-dimensional solution fields incurs significant computational overhead. To address this, we propose a patch-based PCA-Net framework that decomposes the solution fields into smaller patches, applies PCA within each patch, and trains a neural operator in the reduced PCA space. We investigate two different patch-based approaches that balance computational efficiency and reconstruction accuracy: (1) local-to-global patch PCA, and (2) local-to-local patch PCA. The trade-off between computational cost and accuracy is analyzed, highlighting the advantages and limitations of each approach. Furthermore, within each approach, we explore two refinements for the most computationally efficient method: (i) introducing overlapping patches with a smoothing filter and (ii) employing a two-step process with a convolutional neural network (CNN) for refinement. Our results demonstrate that patch-based PCA significantly reduces computational complexity while maintaining high accuracy, reducing end-to-end pipeline processing time by a factor of 3.7 to 4 times compared to global PCA, thefore making it a promising technique for efficient operator learning in PDE-based systems.

[144] Prompt Optimization Meets Subspace Representation Learning for Few-shot Out-of-Distribution Detection

Faizul Rakib Sayem,Shahana Ibrahim

Main category: cs.LG

TL;DR: 该论文提出了一种结合提示优化与子空间表示学习的少样本OOD检测方法,通过利用VLM的特征嵌入信息提升ID-OOD分离性。

Details Motivation: 现有基于提示学习的OOD检测方法仅依赖softmax概率,忽略了VLM特征嵌入的丰富判别信息。这一局限性促使研究者探索如何更高效利用这些特征。

Contribution: 提出了一种基于CoOp的框架,整合子空间表示学习与提示调优,通过投影机制提升ID-OOD分离性,同时设计了端到端的学习准则。

Method: 将ID特征投影到由提示向量张成的子空间中,同时将无关特征投影到正交零空间中。采用端到端训练标准,兼顾OOD检测性能和ID分类准确性。

Result: 在真实数据集上的实验验证了方法的有效性。

Insight: 通过结合提示学习和子空间表示,能够更全面利用VLM的特征信息,显著提升少样本OOD检测的性能。

Abstract: The reliability of artificial intelligence (AI) systems in open-world settings depends heavily on their ability to flag out-of-distribution (OOD) inputs unseen during training. Recent advances in large-scale vision-language models (VLMs) have enabled promising few-shot OOD detection frameworks using only a handful of in-distribution (ID) samples. However, existing prompt learning-based OOD methods rely solely on softmax probabilities, overlooking the rich discriminative potential of the feature embeddings learned by VLMs trained on millions of samples. To address this limitation, we propose a novel context optimization (CoOp)-based framework that integrates subspace representation learning with prompt tuning. Our approach improves ID-OOD separability by projecting the ID features into a subspace spanned by prompt vectors, while projecting ID-irrelevant features into an orthogonal null space. To train such OOD detection framework, we design an easy-to-handle end-to-end learning criterion that ensures strong OOD detection performance as well as high ID classification accuracy. Experiments on real-world datasets showcase the effectiveness of our approach.

[145] KM-GPT: An Automated Pipeline for Reconstructing Individual Patient Data from Kaplan-Meier Plots

Yao Zhao,Haoyue Sun,Yantian Ding,Yanxun Xu

Main category: cs.LG

TL;DR: 论文介绍了一种名为KM-GPT的全自动化AI管道,用于直接从Kaplan-Meier(KM)图中重建个体患者数据(IPD),解决了现有手动方法的局限性,提高了精确性和可扩展性。

Details Motivation: 传统的IPD重建方法依赖人工数字化,存在错误率高、难以扩展的问题。KM-GPT旨在通过自动化技术克服这些挑战,为临床研究提供更高效的工具。

Contribution: KM-GPT首次实现了完全自动化的IPD重建,结合了先进的图像预处理、GPT-5的多模态推理和迭代重建算法,显著提升了数据质量和可访问性。

Method: 采用混合推理架构,集成图像预处理、GPT-5的多模态推理和迭代重建算法,并通过用户友好的Web界面和AI助手简化操作。

Result: 在合成和真实数据集上验证了KM-GPT的高精确性和鲁棒性,成功应用于胃癌免疫治疗试验的元分析。

Insight: KM-GPT展示了AI在临床研究中的潜力,自动化IPD重建不仅能提高效率,还能支持更深入的下游分析和循证决策。

Abstract: Reconstructing individual patient data (IPD) from Kaplan-Meier (KM) plots provides valuable insights for evidence synthesis in clinical research. However, existing approaches often rely on manual digitization, which is error-prone and lacks scalability. To address these limitations, we develop KM-GPT, the first fully automated, AI-powered pipeline for reconstructing IPD directly from KM plots with high accuracy, robustness, and reproducibility. KM-GPT integrates advanced image preprocessing, multi-modal reasoning powered by GPT-5, and iterative reconstruction algorithms to generate high-quality IPD without manual input or intervention. Its hybrid reasoning architecture automates the conversion of unstructured information into structured data flows and validates data extraction from complex KM plots. To improve accessibility, KM-GPT is equipped with a user-friendly web interface and an integrated AI assistant, enabling researchers to reconstruct IPD without requiring programming expertise. KM-GPT was rigorously evaluated on synthetic and real-world datasets, consistently demonstrating superior accuracy. To illustrate its utility, we applied KM-GPT to a meta-analysis of gastric cancer immunotherapy trials, reconstructing IPD to facilitate evidence synthesis and biomarker-based subgroup analyses. By automating traditionally manual processes and providing a scalable, web-based solution, KM-GPT transforms clinical research by leveraging reconstructed IPD to enable more informed downstream analyses, supporting evidence-based decision-making.

[146] MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Tianyu Yu,Zefan Wang,Chongyi Wang,Fuwei Huang,Wenshuo Ma,Zhihui He,Tianchi Cai,Weize Chen,Yuxiang Huang,Yuanqian Zhao,Bokai Xu,Junbo Cui,Yingjing Xu,Liqing Ruan,Luoyuan Zhang,Hanyu Liu,Jingkun Tang,Hongyuan Liu,Qining Guo,Wenhao Hu,Bingxiang He,Jie Zhou,Jie Cai,Ji Qi,Zonghao Guo,Chi Chen,Guoyang Zeng,Yuxuan Li,Ganqu Cui,Ning Ding,Xu Han,Yuan Yao,Zhiyuan Liu,Maosong Sun

Main category: cs.LG

TL;DR: MiniCPM-V 4.5是一个8B参数的高效多模态大语言模型,通过改进架构、数据策略和训练方法,显著提升了性能和效率。

Details Motivation: 多模态大语言模型(MLLMs)的训练和推理效率成为瓶颈,限制了其普及和扩展性。MiniCPM-V 4.5旨在通过高效设计和优化解决这一问题。

Contribution: 1. 提出统一的3D-Resampler架构。2. 设计无需繁重数据工程的统一学习范式。3. 引入混合强化学习策略。

Method: 1. 使用3D-Resampler紧凑编码图像和视频。2. 统一学习范式处理文档知识和文字识别。3. 混合强化学习优化长短推理模式。

Result: MiniCPM-V 4.5在OpenCompass评估中超越GPT-4o-latest和Qwen2.5-VL 72B,同时在VideoMME基准上以46.7%显存和8.7%推理时间实现SOTA性能。

Insight: 高效的架构设计和训练策略能够在较小规模的模型中实现卓越性能,同时显著降低资源消耗。

Abstract: Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7% GPU memory cost and 8.7% inference time of Qwen2.5-VL 7B.

[147] Latent Danger Zone: Distilling Unified Attention for Cross-Architecture Black-box Attacks

Yang Li,Chenyu Wang,Tingrui Wang,Yongwei Wang,Haonan Li,Zhunga Liu,Quan Pan

Main category: cs.LG

TL;DR: 该论文提出了一种新型的黑盒对抗攻击方法JAD,通过结合CNN和ViT的注意力图生成对抗样本,提升了跨架构攻击的通用性和效率。

Details Motivation: 当前黑盒对抗攻击方法受限于特定架构依赖和高查询成本,亟需一种通用的、高效率的攻击方法。

Contribution: 提出JAD框架,通过注意力蒸馏和潜在扩散模型实现跨架构攻击,显著提升生成效率和攻击转移能力。

Method: 利用CNN和ViT的注意力图蒸馏联合指导潜在扩散模型,生成对抗样本。

Result: JAD在攻击通用性、生成效率和跨架构转移性上优于现有方法。

Insight: 通过结合多种架构的注意力信息,可以显著提升对抗攻击的通用性。

Abstract: Black-box adversarial attacks remain challenging due to limited access to model internals. Existing methods often depend on specific network architectures or require numerous queries, resulting in limited cross-architecture transferability and high query costs. To address these limitations, we propose JAD, a latent diffusion model framework for black-box adversarial attacks. JAD generates adversarial examples by leveraging a latent diffusion model guided by attention maps distilled from both a convolutional neural network (CNN) and a Vision Transformer (ViT) models. By focusing on image regions that are commonly sensitive across architectures, this approach crafts adversarial perturbations that transfer effectively between different model types. This joint attention distillation strategy enables JAD to be architecture-agnostic, achieving superior attack generalization across diverse models. Moreover, the generative nature of the diffusion framework yields high adversarial sample generation efficiency by reducing reliance on iterative queries. Experiments demonstrate that JAD offers improved attack generalization, generation efficiency, and cross-architecture transferability compared to existing methods, providing a promising and effective paradigm for black-box adversarial attacks.

eess.AS [Back]

[148] Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise Distillation

Runyan Yang,Yuke Si,Yingying Gao,Junlan Feng,Chao Deng,Shilei Zhang

Main category: eess.AS

TL;DR: 论文提出了一个统一的知识蒸馏框架,通过源级和层级蒸馏,将文本教师模型的推理能力转移到学生音频模型中,同时保留其声学能力。

Details Motivation: 当前大型音频语言模型在复杂推理任务上表现不佳,主要是由于音频与文本之间的模态差异以及缺乏结构化中间监督。

Contribution: 提出了一个双维度的知识蒸馏框架,结合源级和层级蒸馏,显著提升了音频模型的推理能力。

Method: 通过源级蒸馏利用文本和声学教师的互补监督,层级蒸馏将教师信号与学生模型的各层对齐以提高效率。

Result: 实验结果显示该方法有效提升了音频推理性能。

Insight: 通过精细控制蒸馏过程,可以更好地弥合符号推理与语音表征之间的鸿沟。

Abstract: While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To address this, we propose a unified knowledge distillation framework to transfer reasoning capabilities from a high-capacity textual teacher model to a student audio models while preserving its acoustic competence. Our method introduces two key dimensions: source-wise distillation, which leverages both textual and acoustic teachers to provide complementary modality-specific supervision; and layer-wise distillation, which aligns teacher signals with appropriate student layers to improve transfer efficiency. This dual-dimensional strategy enables fine-grained control over the distillation process, effectively bridging the gap between symbolic reasoning and speech representations. Experimental results show significant improvements in audio reasoning performance, demonstrating the effectiveness of our framework as a reasoning transfer solution for audio modeling.

cs.GR [Back]

[149] Zero-Shot Visual Deepfake Detection: Can AI Predict and Prevent Fake Content Before It’s Created?

Ayan Sar,Sampurna Roy,Tanupriya Choudhury,Ajith Abraham

Main category: cs.GR

TL;DR: 该论文研究了零样本视觉深度伪造检测方法,探讨了自监督学习、Transformer零样本分类器和生成模型指纹等技术,并提出包括对抗扰动、数字水印和区块链验证在内的预防策略,同时指出当前挑战及未来研究方向。

Details Motivation: 深度伪造技术的快速发展对数字安全、媒体完整性和公众信任构成威胁,亟需一种零样本检测方法以应对未见过的伪造变体。

Contribution: 1. 提出零样本深度伪造检测方法;2. 结合自监督学习和Transformer技术;3. 推出AI驱动的预防策略;4. 探讨未来研究方向。

Method: 采用自监督学习、基于Transformer的零样本分类器、生成模型指纹技术和元学习方法,并结合对抗扰动、数字水印和区块链验证等预防机制。

Result: 论文提出了一种零样本检测与预防框架,但面临对抗攻击、扩展性限制等挑战。

Insight: 1. 需要多模态融合和可解释AI提升检测能力;2. 量子AI和联邦学习是未来发展方向;3. 跨学科合作对构建防御体系至关重要。

Abstract: Generative adversarial networks (GANs) and diffusion models have dramatically advanced deepfake technology, and its threats to digital security, media integrity, and public trust have increased rapidly. This research explored zero-shot deepfake detection, an emerging method even when the models have never seen a particular deepfake variation. In this work, we studied self-supervised learning, transformer-based zero-shot classifier, generative model fingerprinting, and meta-learning techniques that better adapt to the ever-evolving deepfake threat. In addition, we suggested AI-driven prevention strategies that mitigated the underlying generation pipeline of the deepfakes before they occurred. They consisted of adversarial perturbations for creating deepfake generators, digital watermarking for content authenticity verification, real-time AI monitoring for content creation pipelines, and blockchain-based content verification frameworks. Despite these advancements, zero-shot detection and prevention faced critical challenges such as adversarial attacks, scalability constraints, ethical dilemmas, and the absence of standardized evaluation benchmarks. These limitations were addressed by discussing future research directions on explainable AI for deepfake detection, multimodal fusion based on image, audio, and text analysis, quantum AI for enhanced security, and federated learning for privacy-preserving deepfake detection. This further highlighted the need for an integrated defense framework for digital authenticity that utilized zero-shot learning in combination with preventive deepfake mechanisms. Finally, we highlighted the important role of interdisciplinary collaboration between AI researchers, cybersecurity experts, and policymakers to create resilient defenses against the rising tide of deepfake attacks.

[150] Differentiable Light Transport with Gaussian Surfels via Adapted Radiosity for Efficient Relighting and Geometry Reconstruction

Kaiwen Jiang,Jia-Mu Sun,Zilu Li,Dan Wang,Tzu-Mao Li,Ravi Ramamoorthi

Main category: cs.GR

TL;DR: 该论文提出了一种基于高斯曲面体的可微分光传输框架,通过改进的辐射度方法实现高效的重光照和几何重建。

Details Motivation: 传统辐射场方法在建模材料反射特性和光照条件时存在不足,导致几何模糊和难以实现重光照。物理渲染虽然能解决这些问题,但计算成本高昂。

Contribution: 提出了一种基于高斯曲面体的可微分光传输框架,扩展了经典辐射度理论以支持非二进制可见性和半透明基元,并提出了高效的求解器和梯度优化方法。

Method: 使用高斯曲面体作为基元,在球谐系数空间中构建光传输框架,支持漫反射和镜面反射材料,并优化了梯度计算效率。

Result: 实验表明,该方法在几何重建、视图合成和重光照任务上优于现有基线方法,尤其是在稀疏数据集下表现优异。

Insight: 该方法通过结合物理渲染和高效计算,显著提升了重光照和几何重建的效果,同时支持实时渲染。

Abstract: Radiance fields have gained tremendous success with applications ranging from novel view synthesis to geometry reconstruction, especially with the advent of Gaussian splatting. However, they sacrifice modeling of material reflective properties and lighting conditions, leading to significant geometric ambiguities and the inability to easily perform relighting. One way to address these limitations is to incorporate physically-based rendering, but it has been prohibitively expensive to include full global illumination within the inner loop of the optimization. Therefore, previous works adopt simplifications that make the whole optimization with global illumination effects efficient but less accurate. In this work, we adopt Gaussian surfels as the primitives and build an efficient framework for differentiable light transport, inspired from the classic radiosity theory. The whole framework operates in the coefficient space of spherical harmonics, enabling both diffuse and specular materials. We extend the classic radiosity into non-binary visibility and semi-opaque primitives, propose novel solvers to efficiently solve the light transport, and derive the backward pass for gradient optimizations, which is more efficient than auto-differentiation. During inference, we achieve view-independent rendering where light transport need not be recomputed under viewpoint changes, enabling hundreds of FPS for global illumination effects, including view-dependent reflections using a spherical harmonics representation. Through extensive qualitative and quantitative experiments, we demonstrate superior geometry reconstruction, view synthesis and relighting than previous inverse rendering baselines, or data-driven baselines given relatively sparse datasets with known or unknown lighting conditions.

[151] Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters

Pin-Yen Chiu,I-Sheng Fang,Jun-Cheng Chen

Main category: cs.GR

TL;DR: Text Slider是一种轻量级、高效的插件式框架,通过LoRA适配器实现图像/视频合成中的连续概念控制,显著减少训练时间和GPU内存占用。

Details Motivation: 现有的概念控制方法需要大量训练资源和时间,且难以适配不同扩散模型主干。Text Slider旨在解决这些问题,提供高效、灵活的控制方案。

Contribution: 提出Text Slider,通过低秩方向实现连续概念控制,支持多概念组合,显著提升训练效率和资源利用率。

Method: 利用预训练文本编码器中的低秩方向,结合LoRA适配器实现参数高效控制,支持插件式操作和多概念组合。

Result: 训练速度比Concept Slider快5倍,比Attribute Control快47倍,GPU内存占用减少2倍和4倍。

Insight: 低秩方向的方法在保持生成效果的同时,极大提升了训练效率和适应性,为连续概念控制提供了新的思路。

Abstract: Recent advances in diffusion models have significantly improved image and video synthesis. In addition, several concept control methods have been proposed to enable fine-grained, continuous, and flexible control over free-form text prompts. However, these methods not only require intensive training time and GPU memory usage to learn the sliders or embeddings but also need to be retrained for different diffusion backbones, limiting their scalability and adaptability. To address these limitations, we introduce Text Slider, a lightweight, efficient and plug-and-play framework that identifies low-rank directions within a pre-trained text encoder, enabling continuous control of visual concepts while significantly reducing training time, GPU memory consumption, and the number of trainable parameters. Furthermore, Text Slider supports multi-concept composition and continuous control, enabling fine-grained and flexible manipulation in both image and video synthesis. We show that Text Slider enables smooth and continuous modulation of specific attributes while preserving the original spatial layout and structure of the input. Text Slider achieves significantly better efficiency: 5$\times$ faster training than Concept Slider and 47$\times$ faster than Attribute Control, while reducing GPU memory usage by nearly 2$\times$ and 4$\times$, respectively.

[152] One-shot Embroidery Customization via Contrastive LoRA Modulation

Jun Ma,Qian He,Gaofeng He,Huang Chen,Chen Liu,Xiaogang Jin,Huamin Wang

Main category: cs.GR

TL;DR: 这篇论文提出了一种基于对比学习的LoRA调制框架,用于通过单张参考图像实现细粒度刺绣定制。该方法通过两个阶段的对比学习分离风格与内容,并在刺绣定制任务上超越了现有方法,同时展示了在其他领域的泛化能力。

Details Motivation: 刺绣是一种具有复杂针法和材质特性的纺织品艺术形式,传统风格迁移方法难以处理其细粒度特征。因此,需要一种能够通过单张参考图像有效分离风格与内容的方法。

Contribution: 1) 提出了一种基于对比学习的LoRA调制框架,实现细粒度风格与内容的分离;2) 构建了一个刺绣定制的基准数据集;3) 展示了该方法在艺术风格迁移、草图着色和外观迁移等领域的泛化性。

Method: 1) 使用预训练扩散模型的解耦表示定义风格与内容;2) 两阶段对比LoRA调制:第一阶段迭代更新LoRA和风格块,初步分离风格与内容;第二阶段通过自知识蒸馏进一步解耦;3) 构建推理流程支持图像或文本输入。

Result: 实验表明,该方法在刺绣定制任务上超越了现有方法,并在艺术风格迁移等三个额外领域展示了强泛化性。

Insight: 通过对比学习和LoRA调制,可以在细粒度任务中有效分离风格与内容,这种方法可能适用于其他需要高精度特征迁移的任务。

Abstract: Diffusion models have significantly advanced image manipulation techniques, and their ability to generate photorealistic images is beginning to transform retail workflows, particularly in presale visualization. Beyond artistic style transfer, the capability to perform fine-grained visual feature transfer is becoming increasingly important. Embroidery is a textile art form characterized by intricate interplay of diverse stitch patterns and material properties, which poses unique challenges for existing style transfer methods. To explore the customization for such fine-grained features, we propose a novel contrastive learning framework that disentangles fine-grained style and content features with a single reference image, building on the classic concept of image analogy. We first construct an image pair to define the target style, and then adopt a similarity metric based on the decoupled representations of pretrained diffusion models for style-content separation. Subsequently, we propose a two-stage contrastive LoRA modulation technique to capture fine-grained style features. In the first stage, we iteratively update the whole LoRA and the selected style blocks to initially separate style from content. In the second stage, we design a contrastive learning strategy to further decouple style and content through self-knowledge distillation. Finally, we build an inference pipeline to handle image or text inputs with only the style blocks. To evaluate our method on fine-grained style transfer, we build a benchmark for embroidery customization. Our approach surpasses prior methods on this task and further demonstrates strong generalization to three additional domains: artistic style transfer, sketch colorization, and appearance transfer.

cs.AI [Back]

[153] The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks

Yu Gu,Jingjing Fu,Xiaodong Liu,Jeya Maria Jose Valanarasu,Noel Codella,Reuben Tan,Qianchu Liu,Ying Jin,Sheng Zhang,Jinyu Wang,Rui Wang,Lei Song,Guanghui Qin,Naoto Usuyama,Cliff Wong,Cheng Hao,Hohin Lee,Praneeth Sanapathi,Sarah Hilado,Bian Jiang,Javier Alvarez-Valle,Mu Wei,Jianfeng Gao,Eric Horvitz,Matt Lungren,Hoifung Poon,Paul Vozila

Main category: cs.AI

TL;DR: 这篇论文揭示了大模型在多模态医学基准测试中的脆弱性,发现它们虽然得分高,但实际表现依赖测试技巧而非真正的医学理解。

Details Motivation: 研究旨在暴露当前医学基准测试的局限性,揭示大模型在真实医疗环境中的表现与其高得分之间的差距。

Contribution: 通过压力测试和临床评估,揭示了六种前沿模型在多模态医学基准中的脆弱性和捷径学习问题。

Method: 采用压力测试(如移除关键输入、提示词微调)和临床医生指导的评分标准,评估模型在六个广泛使用的医学基准上的表现。

Result: 发现模型在基准测试中表现优异,但在压力测试下表现出明显的脆弱性和不稳定性,且基准测试的标准与其实际医疗价值之间存在脱节。

Insight: 论文警示不能仅凭基准测试分数评价AI在医疗领域的适用性,需关注模型的鲁棒性、合理推理能力以及与真实医疗需求的匹配程度。

Abstract: Large frontier models like GPT-5 now achieve top scores on medical benchmarks. But our stress tests tell a different story. Leading systems often guess correctly even when key inputs like images are removed, flip answers under trivial prompt changes, and fabricate convincing yet flawed reasoning. These aren’t glitches; they expose how today’s benchmarks reward test-taking tricks over medical understanding. We evaluate six flagship models across six widely used benchmarks and find that high leaderboard scores hide brittleness and shortcut learning. Through clinician-guided rubric evaluation, we show that benchmarks vary widely in what they truly measure yet are treated interchangeably, masking failure modes. We caution that medical benchmark scores do not directly reflect real-world readiness. If we want AI to earn trust in healthcare, we must demand more than leaderboard wins and must hold systems accountable for robustness, sound reasoning, and alignment with real medical demands.

[154] Memory-QA: Answering Recall Questions Based on Multimodal Memories

Hongda Jiang,Xinyuan Zhang,Siddhant Garg,Rishab Arora,Shiun-Zu Kuo,Jiayang Xu,Christopher Brossman,Yue Liu,Aaron Colak,Ahmed Aly,Anuj Kumar,Xin Luna Dong

Main category: cs.AI

TL;DR: Memory-QA是一个新颖的任务,旨在基于多模态记忆回答回忆问题。作者提出了Pensieve框架,结合了记忆增强、时空感知检索和多记忆QA微调,并在新构建的多模态基准上取得了显著优于现有模型的性能。

Details Motivation: 现实世界中,许多任务需要从过去的记忆(如视觉内容)中提取信息回答问题。然而,现有的模型在多模态记忆的存储和检索能力上存在不足,无法充分利用时空信息和多记忆的综合推理。

Contribution: 1) 提出Memory-QA任务,聚焦基于多模态记忆的问答;2) 设计Pensieve框架,整合记忆增强、时空感知检索和多记忆QA微调;3) 构建多模态基准数据集,展示任务的现实挑战。

Method: Pensieve框架包括三个核心组件:1) 记忆特异性增强,优化记忆存储;2) 时空感知多信号检索,利用时间和位置信息;3) 多记忆QA微调,通过学习多个记忆的综合信息回答问题。

Result: 在提出的多模态基准上,Pensieve在问答准确率上比现有最优方法提升了14%。

Insight: 时空信息在多模态记忆问答中至关重要,而多记忆的综合推理能力可以显著提升模型的性能。

Abstract: We introduce Memory-QA, a novel real-world task that involves answering recall questions about visual content from previously stored multimodal memories. This task poses unique challenges, including the creation of task-oriented memories, the effective utilization of temporal and location information within memories, and the ability to draw upon multiple memories to answer a recall question. To address these challenges, we propose a comprehensive pipeline, Pensieve, integrating memory-specific augmentation, time- and location-aware multi-signal retrieval, and multi-memory QA fine-tuning. We created a multimodal benchmark to illustrate various real challenges in this task, and show the superior performance of Pensieve over state-of-the-art solutions (up to 14% on QA accuracy).

[155] Cross-Cultural Transfer of Commonsense Reasoning in LLMs: Evidence from the Arab World

Saeed Almheiri,Rania Hossam,Mena Attia,Chenxi Wang,Preslav Nakov,Timothy Baldwin,Fajri Koto

Main category: cs.AI

TL;DR: 论文探讨了大型语言模型(LLM)在阿拉伯世界中的跨文化常识推理迁移能力,发现仅需少量文化特定示例即可显著提升模型性能,并提出高效跨文化对齐方法。

Details Motivation: 大型语言模型通常反映西方中心偏见,影响其在多元文化环境中的表现。当前研究多聚焦于单一文化对齐,跨文化迁移潜力尚未充分探索。

Contribution: 1. 提出并验证了跨文化常识推理迁移的可行性;2. 展示了少量文化特定示例对模型性能的显著提升;3. 发现了跨文化展示(如印尼和美国背景)在阿拉伯世界的类似效果。

Method: 使用基于13个阿拉伯国家的文化常识推理数据集,评估了轻量级对齐方法(如上下文学习、DITTO)及基线方法(如监督微调、直接偏好优化)。

Result: 仅需12个文化特定示例,模型性能平均提升10%。跨文化展示在某些任务中效果优于本土文化对齐。

Insight: 跨文化对齐具有高效性和可扩展性,为低资源文化背景下的模型适配提供了新思路,同时揭示了文化共性的潜在作用。

Abstract: Large language models (LLMs) often reflect Western-centric biases, limiting their effectiveness in diverse cultural contexts. Although some work has explored cultural alignment, the potential for cross-cultural transfer, using alignment in one culture to improve performance in others, remains underexplored. This paper investigates cross-cultural transfer of commonsense reasoning in the Arab world, where linguistic and historical similarities coexist with local cultural differences. Using a culturally grounded commonsense reasoning dataset covering 13 Arab countries, we evaluate lightweight alignment methods such as in-context learning and demonstration-based reinforcement (DITTO), alongside baselines like supervised fine-tuning and direct preference optimization. Our results show that merely 12 culture-specific examples from one country can improve performance in others by 10% on average, within multilingual models. In addition, we demonstrate that out-of-culture demonstrations from Indonesia and US contexts can match or surpass in-culture alignment for MCQ reasoning, highlighting cultural commonsense transferability beyond the Arab world. These findings demonstrate that efficient cross-cultural alignment is possible and offer a promising approach to adapt LLMs to low-resource cultural settings.