Table of Contents

cs.CL [Back]

[1] PHANTOM RECALL: When Familiar Puzzles Fool Smart Models

Souradeep Mukhopadhyay,Rishabh Baral,Nimeesh Mahajan,Samhitha Harish,Aswin RRV,Mihir Parmar,Mutsumi Nakamura,Chitta Baral

Main category: cs.CL

TL;DR: 大型语言模型(LLMs)在经典逻辑谜题中表现优秀,但研究发现它们依赖记忆模板而非真正推理。PHANTOM RECALL基准测试揭示了模型在扰动谜题中的脆弱性,并提出工具和框架以缓解这一问题。

Details Motivation: 研究动机是探究LLMs在解决逻辑谜题时是否依赖记忆而非推理,以及如何通过系统实验揭示其局限性。

Contribution: 主要贡献包括:1) 引入PHANTOM RECALL基准测试;2) 提出自动化逻辑等价评判工具;3) 错误分类体系;4) 提示优化框架。

Method: 方法包括设计25个经典谜题及其149个扰动版本,评估11种LLMs,并开发工具检测推理错误和缓解问题。

Result: 结果显示LLMs在扰动谜题中表现显著低于人类,存在记忆依赖现象。

Insight: 研究发现LLMs的流畅语言能力与逻辑理解存在差距,突出了重新设计提示的必要性。

Abstract: Large language models (LLMs) such as GPT, Gemini, and Claude often appear adept at solving classic logic puzzles–but how much genuine reasoning underlies their answers? Recent evidence suggests that these models frequently rely on memorized templates rather than reasoning from first principles. When puzzles are slightly modified, their performance collapses, revealing a striking fragility. In particular, we asked: Have LLMs addressed these issues? To what extent? How about perturbations to other puzzles? Is there a general way of reformulating the prompt so that the models do better? To examine these things systematically, we introduce PHANTOM RECALL, a benchmark comprising 25 well-known logic puzzles and 149 carefully designed perturbations that preserve reasoning structure but alter superficial details and solutions. We evaluate eleven leading LLMs and identify a recurring failure mode–phantom recall–where models confidently reproduce memorized solutions or spurious rationales that no longer fit the altered scenario. To probe and mitigate this issue, we contribute three tools: (i) an automated logical-equivalence judge to detect reasoning mismatches, (ii) a taxonomy of fine-grained reasoning error categories, and (iii) a prompting-based mitigation framework guided by these categories. Despite near-perfect accuracy on unmodified puzzles, models significantly underperform humans on perturbed ones, exhibiting both phantom recall and over-elaboration. Our findings reveal a crucial limitation: LLMs often fail to re-reason when contextual cues shift–highlighting the gap between linguistic fluency and logical understanding.

[2] LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance

Patrick Haller,Mark Ibrahim,Polina Kirichenko,Levent Sagun,Samuel J. Bell

Main category: cs.CL

TL;DR: 这篇论文研究了大型语言模型(LLMs)中知识表示的脆弱性,发现其真实性表征高度依赖于输入形式的表面相似性,而非稳健的语义理解。

Details Motivation: 研究动机是探索LLMs在多样化的实际场景中知识应用的可靠性,特别是其内部知识表示对外部输入变化的敏感性。

Contribution: 主要贡献是揭示了LLMs的真实性表征容易因输入的表面形式变化而崩溃,指出其知识表示的浅层性和非稳健性。

Method: 方法上,作者通过对输入进行语义保留的扰动(如拼写错误或重述),评估LLM在不同家族、数据集和知识探测方法中表示的可分性退化情况。

Result: 结果显示,LLMs的真实性表征随输入形式与预训练数据的差异增大而迅速失效,表明其知识表示依赖于表面形式。

Insight: 研究强调了对LLMs知识表示稳健性的改进需求,同时对真实性探测工具的实际应用提出了挑战。

Abstract: For Large Language Models (LLMs) to be reliable, they must learn robust knowledge that can be generally applied in diverse settings – often unlike those seen during training. Yet, extensive research has shown that LLM performance can be brittle, with models exhibiting excessive sensitivity to trivial input variations. In this work, we explore whether this brittleness is a direct result of unstable internal knowledge representations. To explore this question, we build on previous work showing that LLM representations encode statement truthfulness – i.e., true, factual statements can be easily separated from false, inaccurate ones. Specifically, we test the robustness of learned knowledge by evaluating representation separability on samples that have undergone superficial transformations to drive them out-of-distribution (OOD), such as typos or reformulations. By applying semantically-preserving perturbations, we study how separability degrades as statements become more OOD, across four LLM families, five evaluation datasets, and three knowledge probing methods. Our results reveal that internal representations of statement truthfulness collapse as the samples’ presentations become less similar to those seen during pre-training. While LLMs can often distinguish between true and false statements when they closely resemble the pre-training data, this ability is highly dependent on the statement’s exact surface form. These findings offer a possible explanation for brittle benchmark performance: LLMs may learn shallow, non-robust knowledge representations that allow for only limited generalizability. Our work presents a fundamental challenge for the utility of truthfulness probes, and more broadly, calls for further research on improving the robustness of learned knowledge representations.

[3] LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens

Armel Zebaze,Rachel Bawden,Benoît Sagot

Main category: cs.CL

TL;DR: 论文探讨了大型推理模型(LRMs)在机器翻译(MT)任务中生成中间令牌(thinking tokens)的效果,发现其并未提升性能,但通过模块化翻译提示策略构造中间令牌能带来改进。

Details Motivation: 大型推理模型在数学和编程任务中表现出色,但在机器翻译中的应用尚未充分探索,作者希望通过生成中间令牌(thinking tokens)来提升MT性能。

Contribution: 揭示了中间令牌在MT任务中的局限性,并提出模块化翻译提示策略是提升性能的关键方法。

Method: 通过实验对比标准输入输出微调和基于合成链式思考(CoT)的微调,以及模块化翻译提示策略构造中间令牌的效果。

Result: 标准微调优于合成CoT微调,但模块化翻译提示策略能提升MT性能。

Insight: 中间令牌的效果依赖于其是否包含翻译尝试的实质性内容,而直接优化目标翻译或扩展平行语料比单纯模仿人类翻译的思考过程更有效。

Abstract: Large reasoning models (LRMs) have led to new possibilities in terms of problem-solving, through the devising of a natural language thought process prior to answering a query. While their capabilities are well known across mathematics and coding tasks, their impact on the task of machine translation (MT) remains underexplored. In this work, we explore the benefits of the generation of intermediate tokens when performing MT across multiple language pairs of different levels of resourcedness and multiple setups. We find that “thinking tokens” do not help LRMs better perform MT. This result generalizes to models fine-tuned to reason before translating using distilled chain of thought (CoT) inspired by human translators’ practices. Specifically, fine-tuning a model with synthetic CoT explanations detailing how to translate step-by-step does not outperform standard input-output fine-tuning. However, constructing the intermediate tokens by combining the outputs of modular translation-specific prompting strategies results in improvements. Our findings underscore that the contribution of intermediate tokens during fine-tuning highly depends on the presence of translation attempts within them. More broadly, our results suggest that using a teacher to refine target translations or to expand parallel corpora is more impactful than distilling their CoT explanations into “thinking” MT models.

[4] Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering

Lorena Calvo-Bartolomé,Valérie Aldana,Karla Cantarero,Alonso Madroñal de Mesa,Jerónimo Arenas-García,Jordan Boyd-Graber

Main category: cs.CL

TL;DR: 该论文提出了MIND,一个面向多语言问答系统的用户参与式事实检测流程,用于检测事实和文化差异。

Details Motivation: 多语言问答系统需要在不同语言间保持事实一致性,同时考虑文化差异的主观问题。

Contribution: 提出了MIND流程,用于检测多语言QA系统中的事实和文化差异,并发布了一个标注的双语问答数据集。

Method: MIND采用用户参与式流程,识别文化和事实差异,并在母婴健康领域和其他领域的数据集上进行评估。

Result: MIND能可靠地识别不一致性,支持开发更具文化敏感性和事实一致性的问答系统。

Insight: 文化差异在多语言QA系统中不容忽视,用户参与的方法能有效提升系统的准确性和适应性。

Abstract: Multilingual question answering (QA) systems must ensure factual consistency across languages, especially for objective queries such as What is jaundice?, while also accounting for cultural variation in subjective responses. We propose MIND, a user-in-the-loop fact-checking pipeline to detect factual and cultural discrepancies in multilingual QA knowledge bases. MIND highlights divergent answers to culturally sensitive questions (e.g., Who assists in childbirth?) that vary by region and context. We evaluate MIND on a bilingual QA system in the maternal and infant health domain and release a dataset of bilingual questions annotated for factual and cultural inconsistencies. We further test MIND on datasets from other domains to assess generalization. In all cases, MIND reliably identifies inconsistencies, supporting the development of more culturally aware and factually consistent QA systems.

[5] TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

Yupei Li,Philipp Borchert,Gerasimos Lampouras

Main category: cs.CL

TL;DR: TopoAlign是一个框架,通过拓扑分解将代码与数学对齐,利用广泛可用的代码库作为Math LLMs的训练资源,显著提升了模型的性能。

Details Motivation: 当前的Math LLMs在将非正式数学陈述转化为正式数学陈述(autoformalisation)方面表现不佳,主要原因是缺乏大规模的非正式与正式数学陈述配对语料库。现有的代码生成模型由于其结构与形式数学的差异,无法直接用于训练Math LLMs。

Contribution: 提出了TopoAlign框架,通过分解代码为docstrings、主函数和依赖函数,并将其重组为结构上与形式数学陈述相似的代码数据,从而无需额外人工标注即可训练Math LLMs。

Method: TopoAlign将代码分解为结构组件,并将这些组件重组为与形式数学陈述结构对齐的数据。训练了DeepSeek-Math和Herald两个模型,并在minif2f、Putnam和ProofNet基准上进行了评估。

Result: TopoAlign显著提升了DeepSeek-Math的性能,BEq@10提升了17.77%,typecheck@10提升了68.82%。即使对专用模型Herald,BEq@10和typecheck@10也分别提升了0.12%和1.09%。

Insight: 即使不引入新的数学知识,通过结构对齐的代码数据训练Math LLMs也能显著提升性能,表明代码库可以作为有效的训练资源。

Abstract: Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements. Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations. Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements. Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning. We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements. This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation. We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks. TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10. Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.

[6] Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

Gabrielle Kaili-May Liu,Bryan Li,Arman Cohan,William Gantt Walden,Eugene Yang

Main category: cs.CL

TL;DR: 本文提出了一个自动化流程CRUMQs,用于生成不可作弊、真实、不可回答和多跳的问题,填补了现有RAG基准测试的不足,并通过实验验证其有效性。

Details Motivation: 现实场景中,RAG系统常面临复杂查询,但现有基准测试无法反映真实任务复杂性(如多跳或超出范围的问题),限制了揭示RAG系统局限的能力。

Contribution: 提出了首个自动化生成CRUMQs的流程,增强基准测试的难度和真实性,推动更强大的RAG系统发展。

Method: 通过设计的自动化流程生成CRUMQs问题,并在两个流行的RAG数据集上进行基准测试,验证其挑战性和实用性。

Result: CRUMQs显著提高了RAG系统的挑战性,作弊分数降低了81.0%,展示了其在揭示系统局限性方面的有效性。

Insight: CRUMQs流程提供了一个简单方法,可以提升基准测试的难度和真实性,有助于推动RAG系统的进一步研究和改进。

Abstract: Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of un$\underline{c}$heatable, $\underline{r}$ealistic, $\underline{u}$nanswerable, and $\underline{m}$ulti-hop $\underline{q}$uerie$\underline{s}$ (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.

[7] Scaling Long-Horizon LLM Agent via Context-Folding

Weiwei Sun,Miao Lu,Zhan Ling,Kang Liu,Xuesong Yao,Yiming Yang,Jiecao Chen

Main category: cs.CL

TL;DR: 论文提出了一种名为Context-Folding的框架,通过动态管理上下文来解决长任务中的上下文长度限制问题,并开发了FoldGRPO强化学习方法。实验表明,该方法在小规模上下文下优于基于总结的基线。

Details Motivation: 大型语言模型(LLM)代理在处理长任务时受限于上下文长度,现有方法无法高效管理上下文。

Contribution: 提出了Context-Folding框架,允许代理动态分支处理子任务并折叠中间步骤,同时开发了FoldGRPO强化学习方法来学习这种行为。

Method: 采用FoldGRPO强化学习框架,通过任务分解和上下文管理的特定过程奖励,实现动态上下文折叠。

Result: 在复杂长任务(Deep Research和SWE)上,该方法使用10倍小的活跃上下文,性能优于或匹配ReAct基线,显著优于基于总结的方法。

Insight: 上下文动态管理是解决长任务限制的关键,强化学习可以有效优化任务分解和上下文折叠行为。

Abstract: Large language model (LLM) agents are fundamentally constrained by context length on long-horizon tasks. We introduce Context-Folding, a framework that empowers agents to actively manage their working context. An agent can procedurally branch into a sub-trajectory to handle a subtask and then fold it upon completion, collapsing the intermediate steps while retaining a concise summary of the outcome. To make this behavior learnable, we develop an end-to-end reinforcement learning framework FoldGRPO with specific process rewards to encourage effective task decomposition and context management. On complex long-horizon tasks (Deep Research and SWE), our folding agent matches or outperforms the ReAct baselines while using an active context 10$\times$ smaller and significantly outperforms models that rely on summarization-based context management.

[8] Conjecturing: An Overlooked Step in Formal Mathematical Reasoning

Jasivan Alex Sivakumar,Philipp Borchert,Ronald Cardenas,Gerasimos Lampouras

Main category: cs.CL

TL;DR: 论文探讨了数学形式化推理中被忽视的一步——猜想(conjecturing),并指出这是自动形式化(autoformalisation)的关键前置步骤。作者提出了ConjectureBench数据集和新评估框架,揭示了当前大语言模型(LLMs)在猜想能力上的局限性,并提出Lean-FIRe方法提升性能。

Details Motivation: 现有自动形式化任务忽略猜想步骤的直接评估,导致性能被高估。作者希望填补这一空白,强调猜想的重要性。

Contribution: 1. 提出ConjectureBench数据集和评估框架;2. 设计Lean-FIRe方法提升猜想的整合效果;3. 首次实现13个PutnamBench问题的端到端自动形式化。

Method: 1. 扩展数据集并设计新评估指标;2. 提出Lean-FIRe方法,在推理时改进猜想与形式化的结合;3. 以大语言模型(GPT-4.1和DeepSeek-V3.1)为实验对象。

Result: 实验表明,猜想能力的缺失导致模型在自动形式化中性能被高估。Lean-FIRe显著提升了GPT-4.1和DeepSeek-V3.1的表现。

Insight: 猜想是数学形式化推理的关键步骤,需独立研究其优化方法;整合猜想与形式化能显著提升端到端任务的性能。

Abstract: Autoformalisation, the task of expressing informal mathematical statements in formal language, is often viewed as a direct translation process. This, however, disregards a critical preceding step: conjecturing. Many mathematical problems cannot be formalised directly without first conjecturing a conclusion such as an explicit answer, or a specific bound. Since Large Language Models (LLMs) already struggle with autoformalisation, and the evaluation of their conjecturing ability is limited and often entangled within autoformalisation or proof, it is particularly challenging to understand its effect. To address this gap, we augment existing datasets to create ConjectureBench, and redesign the evaluation framework and metric specifically to measure the conjecturing capabilities of LLMs both as a distinct task and within the autoformalisation pipeline. Our evaluation of foundational models, including GPT-4.1 and DeepSeek-V3.1, reveals that their autoformalisation performance is substantially overestimated when the conjecture is accounted for during evaluation. However, the conjecture should not be assumed to be provided. We design an inference-time method, Lean-FIRe to improve conjecturing and autoformalisation, which, to the best of our knowledge, achieves the first successful end-to-end autoformalisation of 13 PutnamBench problems with GPT-4.1 and 7 with DeepSeek-V3.1. We demonstrate that while LLMs possess the requisite knowledge to generate accurate conjectures, improving autoformalisation performance requires treating conjecturing as an independent task, and investigating further how to correctly integrate it within autoformalisation. Finally, we provide forward-looking guidance to steer future research toward improving conjecturing, an overlooked step of formal mathematical reasoning.

[9] SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

Ryan Shea,Yunan Lu,Liang Qiu,Zhou Yu

Main category: cs.CL

TL;DR: SAGE是一个新颖的多轮代理评估用户模拟框架,结合了业务逻辑的知识,生成更真实和多样化的交互,并能更好地发现代理的错误。

Details Motivation: 由于多轮交互代理的评估需要人工参与,成本较高,现有的用户模拟方法通常忽略领域特定知识,难以捕捉真实用户行为。

Contribution: 提出了SAGE框架,结合了自上而下的业务逻辑知识和自下而上的业务基础设施知识,生成更真实的用户模拟交互。

Method: 自上而下方法结合理想客户画像,自下而上方法利用产品目录、常见问题解答和知识库,生成反映用户信息需求的交互。

Result: 实证表明,SAGE生成的交互更真实多样,并能多发现33%的代理错误。

Insight: 结合领域知识的多轮用户模拟能显著提升代理评估的效果,支持代理的迭代改进。

Abstract: Evaluating multi-turn interactive agents is challenging due to the need for human assessment. Evaluation with simulated users has been introduced as an alternative, however existing approaches typically model generic users and overlook the domain-specific principles required to capture realistic behavior. We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts. SAGE incorporates top-down knowledge rooted in business logic, such as ideal customer profiles, grounding user behavior in realistic customer personas. We further integrate bottom-up knowledge taken from business agent infrastructure (e.g., product catalogs, FAQs, and knowledge bases), allowing the simulator to generate interactions that reflect users’ information needs and expectations in a company’s target market. Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support bug-finding and iterative agent improvement.

[10] Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models

Yukun Zhang,Qi Dong

Main category: cs.CL

TL;DR: 本文提出了一种名为层次对齐的新方法,通过针对性地对不同功能模块(局部层、中间层和全局层)进行直接偏好优化(DPO),避免传统方法对模型所有层统一优化的局限性。实验证明,该方法在语法流利度、事实一致性和逻辑连贯性上均显著提升,同时避免了标准DPO中的对齐代价问题。

Details Motivation: 现有的大型语言模型(LLMs)对齐技术(如DPO)通常将模型视为单一实体,对所有层施加统一的优化压力,忽视了Transformer架构中不同层在功能上的专门化(如语法、逻辑和事实处理)。

Contribution: 提出了层次对齐方法,通过针对性地对不同功能层进行优化,显著提升了模型的性能,并避免了单一对齐方法的局限性。

Method: 使用LoRA技术对LLaMA-3.1-8B和Qwen1.5-7B等模型进行精细调优,将DPO应用于局部层(Local-Align)、中间层(Logic-Align)和全局层(Global-Align),并通过LLM-as-Judge评估效果。

Result: 实验结果表明,局部层优化提高了语法流利度,全局层优化不仅增强了事实一致性,还显著提升了逻辑连贯性,且未出现标准DPO中的对齐代价问题。

Insight: 层次对齐方法提供了一种更高效、可控且可解释的对齐路径,表明结构感知的精细调优能够显著提升LLMs的可靠性和高级能力。

Abstract: Existing alignment techniques for Large Language Models (LLMs), such as Direct Preference Optimization (DPO), typically treat the model as a monolithic entity, applying uniform optimization pressure across all layers. This approach overlooks the functional specialization within the Transformer architecture, where different layers are known to handle distinct tasks from syntax to abstract reasoning. In this paper, we challenge this one-size-fits-all paradigm by introducing Hierarchical Alignment, a novel method that applies targeted DPO to distinct functional blocks of a model’s layers: local (syntax), intermediate (logic), and global (factuality). Through a series of controlled experiments on state-of-the-art models like Llama-3.1-8B and Qwen1.5-7B using LoRA for surgical fine-tuning, our results, evaluated by a powerful LLM-as-Judge, demonstrate significant and predictable improvements. Specifically, aligning the local layers (Local-Align) enhances grammatical fluency. More importantly, aligning the global layers (Global-Align) not only improves factual consistency as hypothesized but also proves to be the most effective strategy for enhancing logical coherence, outperforming all baselines. Critically, all hierarchical strategies successfully avoid the “alignment tax” observed in standard DPO, where gains in fluency come at the cost of degraded logical reasoning. These findings establish a more resource-efficient, controllable, and interpretable path for model alignment, highlighting the immense potential of shifting from monolithic optimization to structure-aware surgical fine-tuning to build more advanced and reliable LLMs.

[11] SafeMT: Multi-turn Safety for Multimodal Language Models

Han Zhu,Juntao Dai,Jiaming Ji,Haoran Li,Chengkun Cai,Pengcheng Wen,Chi-Min Chan,Boyuan Chen,Yaodong Yang,Sirui Han,Yike Guo

Main category: cs.CL

TL;DR: SafeMT是一个专注于多模态大语言模型在多轮对话中安全性的基准测试,包含10,000个样本,涵盖17种场景和4种越狱方法。作者提出了Safety Index (SI)评估模型的安全性,并提出了一种对话安全调解器,能更有效降低多轮攻击成功率。

Details Motivation: 随着多模态大语言模型(MLLMs)的普及,其安全性问题日益突出,尤其是多轮对话中的风险更高。现有的基准测试未能充分覆盖这一场景,SafeMT旨在填补这一空白。

Contribution: 1) 提出SafeMT基准测试,涵盖多轮对话中的安全性评估;2) 引入Safety Index (SI)量化模型安全性;3) 设计了一种对话安全调解器,有效降低多轮攻击成功率。

Method: 通过生成包含恶意查询和图像的多样化多轮对话数据集(10,000样本),结合17种场景和4种越狱方法,评估模型安全性。提出的安全调解器通过检测对话中的恶意意图并为模型提供安全策略来提升安全性。

Result: 实验表明,多轮对话中攻击成功率随对话轮次增加而上升,说明现有模型的安全性机制不足。提出的调解器在降低多轮攻击成功率(ASR)方面优于现有防护模型。

Insight: 多轮对话中的安全性是多模态大语言模型的一大薄弱环节,需针对性设计防护机制;Safety Index和调解器的引入为模型安全性量化与提升提供了新思路。

Abstract: With the widespread use of multi-modal Large Language models (MLLMs), safety issues have become a growing concern. Multi-turn dialogues, which are more common in everyday interactions, pose a greater risk than single prompts; however, existing benchmarks do not adequately consider this situation. To encourage the community to focus on the safety issues of these models in multi-turn dialogues, we introduce SafeMT, a benchmark that features dialogues of varying lengths generated from harmful queries accompanied by images. This benchmark consists of 10,000 samples in total, encompassing 17 different scenarios and four jailbreak methods. Additionally, we propose Safety Index (SI) to evaluate the general safety of MLLMs during conversations. We assess the safety of 17 models using this benchmark and discover that the risk of successful attacks on these models increases as the number of turns in harmful dialogues rises. This observation indicates that the safety mechanisms of these models are inadequate for recognizing the hazard in dialogue interactions. We propose a dialogue safety moderator capable of detecting malicious intent concealed within conversations and providing MLLMs with relevant safety policies. Experimental results from several open-source models indicate that this moderator is more effective in reducing multi-turn ASR compared to existed guard models.

[12] A Survey on Parallel Reasoning

Ziqi Wang,Boye Niu,Zipeng Gao,Zhi Zheng,Tong Xu,Linghui Meng,Zhongli Li,Jing Liu,Yilong Chen,Chen Zhu,Hua Wu,Haifeng Wang,Enhong Chen

Main category: cs.CL

TL;DR: 本文是对并行推理(Parallel Reasoning)的综述,探讨了这一新兴推理范式如何通过同时探索多条思路提升大语言模型(LLM)的鲁棒性和性能。文章明确了并行推理的定义,分类整理了相关技术,并展望了未来研究方向。

Details Motivation: 传统的顺序推理方法(如Chain-of-Thought)容易因单一路径的错误导致整体失败,而并行推理通过同时探索多条思路提升了推理的鲁棒性,成为当前的研究热点。

Contribution: 1. 给出了并行推理的正式定义,并将其与Chain-of-Thought等概念区分;2. 提出了一种新的分类方法,涵盖非交互式推理、交互式推理和效率优化技术;3. 总结了应用场景和未来挑战。

Method: 文章采用分类法组织并行推理技术:1. 非交互式推理(多路径独立探索);2. 交互式推理(路径间信息共享);3. 效率优化策略(如动态剪枝)。

Result: 总结了并行推理在复杂问题求解和LLM输出可靠性提升中的实际应用,并指出其优势(如鲁棒性)和局限性(如计算开销)。

Insight: 并行推理通过多样化探索弥补了顺序方法的脆弱性,但计算效率和路径融合策略仍需进一步研究。

Abstract: With the increasing capabilities of Large Language Models (LLMs), parallel reasoning has emerged as a new inference paradigm that enhances reasoning robustness by concurrently exploring multiple lines of thought before converging on a final answer. It has become a significant trend to explore parallel reasoning to overcome the fragility of standard sequential methods and improve practical performance. In this paper, we aim to survey and summarize the progress and challenges of parallel reasoning. We first present a formal definition of parallel reasoning and clarify its distinction from related concepts like Chain-of-Thought. Then, we organize and discuss advanced techniques based on a novel taxonomy, including non-interactive reasoning, interactive reasoning, and efficiency-focused decoding strategies. Additionally, we explore various application scenarios, such as solving complex problems and enhancing the reliability of LLM outputs.Finally, we highlight the core challenges of parallel reasoning and suggest potential directions for future research. We hope that our work can provide a useful roadmap for beginners and encourage more research on improving parallel reasoning methods. Related source can be avaliable in https://github.com/PPPP-kaqiu/Awesome-Parallel-Reasoning.

[13] Towards Inference-time Scaling for Continuous Space Reasoning

Minghan Wang,Thuy-Trang Vu,Ehsan Shareghi,Gholamreza Haffari

Main category: cs.CL

TL;DR: 本文探讨了将离散空间中的推理技术(如多样本生成和PRM/ORM重排序)应用于连续空间推理的可行性,指出当前方法在连续空间中效果有限的主要原因是缺乏关键归纳偏置,并提出需要优化训练框架以引入这些偏置。

Details Motivation: 研究动机是将已在离散空间中验证有效的推理技术(如多样本生成和PRM/ORM重排序)扩展到连续空间,探索其在连续空间推理中的适用性和局限性。

Contribution: 主要贡献包括:(1)展示了通过Dropout采样生成多样化推理路径的可行性;(2)揭示了在连续空间中实现性能提升的独特挑战;(3)发现当前性能受限的原因是缺乏关键归纳偏置;(4)提出未来训练框架需明确优化这些偏置。

Method: 研究方法基于COCONUT连续空间推理模型,通过Dropout采样生成多样本,并使用Pass@N分析性能潜力;同时探究几何性质和轨迹动力学以识别PRM/ORM在连续空间中效果不佳的原因。

Result: 实验结果表明,尽管连续空间中推理路径多样,但PRM/ORM重排序仅带来边际改进。进一步的几何和动力学分析揭示了当前方法的局限性。

Insight: 关键洞察是连续空间推理需要显式引入归纳偏置,以在推理时区分正确与错误推理路径,而不仅是优化准确性。这为未来研究方向提供了重要指导。

Abstract: Inference-time scaling through multiple sample generation in combination with Process- or Outcome-Reward Model (PRM or ORM) re-ranking has proven effective for text-based reasoning in large language models. This paper investigates whether such established techniques can be successfully adapted to reasoning in the continuous space, using COCONUT (Hao et al. 2024) continuous space reasoning LM as the backbone. We demonstrate the feasibility of generating diverse reasoning paths through dropout-based sampling. Our Pass@N analysis on the generated samples reveals the potential that could enable a significant gain in performance akin to observed gain in the discrete space. However, we highlight unique challenges faced for materializing this gain in the continuous thought space. In particular, working recipes for data generation and training PRM and ORM models in the discrete space unlocks only marginal improvements in the continuous space. Through probing various aspects including geometric properties and trajectory dynamics we identify the underlying reasons that prevent effective discrimination between correct and incorrect reasoning (essential for the functioning of PRM and ORM). Our findings reveal that current limitations stem from the absence of key inductive biases in continuous thought representations. We argue that the training frameworks for continuous reasoning LMs require not only to optimize for accuracy but also to explicitly incorporate inductive biases that could be utilized during inference-time for discrimination of correct and incorrect thoughts.\footnote{Our code and data will be publicly available.}

[14] Not in Sync: Unveiling Temporal Bias in Audio Chat Models

Jiayu Yao,Shenghua Liu,Yiwei Wang,Rundong Cheng,Lingrui Mei,Baolong Bi,Zhen Xiong,Xueqi Cheng

Main category: cs.CL

TL;DR: 这篇论文首次系统地研究了大型音频语言模型(LALMs)中的时间偏差问题,揭示了其在时间戳预测中的关键局限性,并提出了一种量化方法(TBI)和可视化框架。

Details Motivation: 目前LALMs在音频理解和多模态推理中应用广泛,但其事件时间定位能力尚未被充分研究。论文旨在揭示模型在时间戳预测中存在的系统性偏差问题。

Contribution: 1) 首次系统性地研究了LALMs的时间偏差问题;2) 提出了Temporal Bias Index(TBI)量化这一偏差;3) 提供了一个可视化框架;4) 揭示了偏差的普遍性和影响因素。

Method: 通过在带时间戳的数据集上进行控制实验,分析了模型在不同音频长度、事件类型和位置上的表现,并提出TBI作为量化指标。

Result: 研究表明,时间偏差在数据集和模型中普遍存在,且随着音频长度增加而加剧,偏差可达数十秒,同时偏差程度因事件类型和位置而异。

Insight: 当前LALMs在时间敏感任务中存在根本局限性,开发具有时间鲁棒性的架构是未来的重要方向。

Abstract: Large Audio Language Models (LALMs) are increasingly applied to audio understanding and multimodal reasoning, yet their ability to locate when events occur remains underexplored. We present the first systematic study of temporal bias in LALMs, revealing a key limitation in their timestamp prediction. For example, when asked “At which second does the lecturer introduce the key formula?”, models often predict timestamps that are consistently earlier or later than the ground truth. Through controlled experiments on timestamped datasets, we find that temporal bias (i) is prevalent across datasets and models, (ii) increases with audio length - even accumulating to tens of seconds in extended recordings, and (iii) varies across event types and positions. We quantify this effect with the Temporal Bias Index (TBI), measuring systematic misalignment in predicted event timings, and complement it with a visualization framework. Our findings highlight a fundamental limitation in current LALMs and call for the development of temporally robust architectures.

[15] HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment

Ali Mekky,Omar El Herraoui,Preslav Nakov,Yuxia Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为HALF的框架,用于评估大型语言模型(LLM)在实际应用中的公平性,并根据危害严重性对结果进行加权,填补了现有评估方法的不足。

Details Motivation: LLM在高影响领域的部署日益增多(如医疗决策和法律分析),但其公平性和偏见评估缺乏对实际场景的关注和危害严重性的区分。

Contribution: 提出HALF框架,将九个应用领域分为三个层级(严重、中等、轻微),并通过五阶段流程评估LLM的公平性。

Method: HALF采用五阶段流程,结合实际应用场景和危害严重性对LLM进行评估,比较了八种LLM的表现。

Result: 研究发现:(1) LLM在不同领域的公平性不一致;(2) 模型规模或性能不能保证公平性;(3) 推理模型在医疗决策支持中表现较好,但在教育领域表现较差。

Insight: HALF揭示了现有基准测试的成功与LLM实际部署准备度之间的差距,强调了针对危害严重性进行公平性评估的重要性。

Abstract: Large language models (LLMs) are increasingly deployed across high-impact domains, from clinical decision support and legal analysis to hiring and education, making fairness and bias evaluation before deployment critical. However, existing evaluations lack grounding in real-world scenarios and do not account for differences in harm severity, e.g., a biased decision in surgery should not be weighed the same as a stylistic bias in text summarization. To address this gap, we introduce HALF (Harm-Aware LLM Fairness), a deployment-aligned framework that assesses model bias in realistic applications and weighs the outcomes by harm severity. HALF organizes nine application domains into three tiers (Severe, Moderate, Mild) using a five-stage pipeline. Our evaluation results across eight LLMs show that (1) LLMs are not consistently fair across domains, (2) model size or performance do not guarantee fairness, and (3) reasoning models perform better in medical decision support but worse in education. We conclude that HALF exposes a clear gap between previous benchmarking success and deployment readiness.

[16] LLM-REVal: Can We Trust LLM Reviewers Yet?

Rui Li,Jia-Chen Gu,Po-Nien Kung,Heming Xia,Junfeng liu,Xiangwen Kong,Zhifang Sui,Nanyun Peng

Main category: cs.CL

TL;DR: 该研究探讨了将大型语言模型(LLMs)深度整合到学术评审和研究过程中的潜在风险,揭示了LLM评审者存在的偏见及其对人类作者和学术公平的影响。

Details Motivation: 随着LLMs在学术工作流中的广泛应用,其作为评审者的角色可能带来新的风险,但目前对这些风险的探索不足。

Contribution: 研究发现LLM评审者对LLM生成的论文评分过高,而对包含批判性陈述的人类论文评分过低,揭示了其语言特征偏见和批判性内容回避的缺陷。

Method: 通过模拟实验,结合研究和评审代理,评估LLMs作为评审者的表现,并结合人工标注进行分析。

Result: LLM评审者在评分中存在系统性偏见,但同时也显示出其对提升论文质量的潜力。

Insight: LLMs作为评审者需谨慎部署,以避免对学术公平的负面影响,但其在帮助早期研究者和改进低质量论文方面具有潜力。

Abstract: The rapid advancement of large language models (LLMs) has inspired researchers to integrate them extensively into the academic workflow, potentially reshaping how research is practiced and reviewed. While previous studies highlight the potential of LLMs in supporting research and peer review, their dual roles in the academic workflow and the complex interplay between research and review bring new risks that remain largely underexplored. In this study, we focus on how the deep integration of LLMs into both peer-review and research processes may influence scholarly fairness, examining the potential risks of using LLMs as reviewers by simulation. This simulation incorporates a research agent, which generates papers and revises, alongside a review agent, which assesses the submissions. Based on the simulation results, we conduct human annotations and identify pronounced misalignment between LLM-based reviews and human judgments: (1) LLM reviewers systematically inflate scores for LLM-authored papers, assigning them markedly higher scores than human-authored ones; (2) LLM reviewers persistently underrate human-authored papers with critical statements (e.g., risk, fairness), even after multiple revisions. Our analysis reveals that these stem from two primary biases in LLM reviewers: a linguistic feature bias favoring LLM-generated writing styles, and an aversion toward critical statements. These results highlight the risks and equity concerns posed to human authors and academic research if LLMs are deployed in the peer review cycle without adequate caution. On the other hand, revisions guided by LLM reviews yield quality gains in both LLM-based and human evaluations, illustrating the potential of the LLMs-as-reviewers for early-stage researchers and enhancing low-quality papers.

[17] PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation

Xiangjun Zai,Xingyu Tan,Xiaoyang Wang,Qing Liu,Xiwei Xu,Wenjie Zhang

Main category: cs.CL

TL;DR: PRoH是一个动态规划和推理框架,用于基于知识超图的检索增强生成,通过上下文感知规划、结构化问题分解和实体加权重叠引导的检索算法,显著提升了多跳问答性能。

Details Motivation: 现有基于知识超图(KH)的检索增强生成方法存在静态检索规划、执行不灵活以及对KH结构和语义的浅层利用三大局限,限制了多跳问答的效果。

Contribution: 1. 提出上下文感知规划模块;2. 结构化问题分解为动态演化的有向无环图(DAG);3. 实体加权重叠(EWO)引导的推理路径检索算法。

Method: 1. 上下文感知规划模块生成基于KH邻域的推理计划;2. 动态DAG组织子问题;3. EWO算法优先选择语义连贯的超边遍历。

Result: 在多个领域的实验中,PRoH平均F1分数提升19.73%,生成评价(G-E)分数提升8.41%,优于此前SOTA模型HyperGraphRAG。

Insight: 动态规划和语义敏感的推理路径设计对多跳问答至关重要,结构化问题分解能有效提升检索的灵活性和适应性。

Abstract: Knowledge Hypergraphs (KHs) have recently emerged as a knowledge representation for retrieval-augmented generation (RAG), offering a paradigm to model multi-entity relations into a structured form. However, existing KH-based RAG methods suffer from three major limitations: static retrieval planning, non-adaptive retrieval execution, and superficial use of KH structure and semantics, which constrain their ability to perform effective multi-hop question answering. To overcome these limitations, we propose PRoH, a dynamic Planning and Reasoning over Knowledge Hypergraphs framework. PRoH incorporates three core innovations: (i) a context-aware planning module that sketches the local KH neighborhood to guide structurally grounded reasoning plan generation; (ii) a structured question decomposition process that organizes subquestions as a dynamically evolving Directed Acyclic Graph (DAG) to enable adaptive, multi-trajectory exploration; and (iii) an Entity-Weighted Overlap (EWO)-guided reasoning path retrieval algorithm that prioritizes semantically coherent hyperedge traversals. Experiments across multiple domains demonstrate that PRoH achieves state-of-the-art performance, surpassing the prior SOTA model HyperGraphRAG by an average of 19.73% in F1 and 8.41% in Generation Evaluation (G-E) score, while maintaining strong robustness in long-range multi-hop reasoning tasks.

[18] Probing Latent Knowledge Conflict for Faithful Retrieval-Augmented Generation

Linfeng Gao,Baolong Bi,Zheng Yuan,Le Wang,Zerui Chen,Zhimin Wei,Shenghua Liu,Qinggang Zhang,Jinsong Su

Main category: cs.CL

TL;DR: 该论文针对检索增强生成(RAG)中的忠实性问题,提出了一种基于探测的冲突定位与增强注意的框架(CLEAR),通过分析LLM隐藏状态信号,识别并解决知识冲突,显著提升了模型的准确性和上下文忠实性。

Details Motivation: 现有RAG系统在处理知识冲突时往往缺乏忠实性,且依赖外部干预(如提示工程或奖励微调),忽略了LLM内部如何整合检索证据与参数化记忆的关键问题。

Contribution: 1. 首次通过LLM隐藏状态信号分析知识冲突的层级特性;2. 提出CLEAR框架,定位冲突知识并引入冲突感知微调;3. 在多个基准测试中显著提升性能。

Method: 1. 将上下文分解为句子级知识;2. 使用隐藏状态探测定位冲突;3. 通过冲突感知微调优化整合过程。

Result: CLEAR在三种冲突场景的基准测试中均优于基线方法,提升了准确性和忠实性。

Insight: 知识冲突可通过LLM内部隐藏状态信号显式分析,冲突感知框架能有效增强RAG系统的可靠性。

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm to enhance the factuality of Large Language Models (LLMs). However, existing RAG systems often suffer from an unfaithfulness issue, where the model’s response contradicts evidence from the retrieved context. Existing approaches to improving contextual faithfulness largely rely on external interventions, such as prompt engineering, decoding constraints, or reward-based fine-tuning. These works treat the LLM as a black box and overlook a crucial question: how does the LLM internally integrate retrieved evidence with its parametric memory, particularly under knowledge conflicts? To address this gap, we conduct a probing-based analysis of hidden-state representations in LLMs and observe three findings: knowledge integration occurs hierarchically, conflicts manifest as latent signals at the sentence level, and irrelevant context is often amplified when aligned with parametric knowledge. Building on these findings, we propose CLEAR (Conflict-Localized and Enhanced Attention for RAG), a framework that (i) decomposes context into fine-grained sentence-level knowledge, (ii) employs hidden-state probing to localize conflicting knowledge, and (iii) introduces conflict-aware fine-tuning to guide the model to accurately integrate retrieved evidence. Extensive experiments across three benchmarks demonstrate that CLEAR substantially improves both accuracy and contextual faithfulness, consistently outperforming strong baselines under diverse conflict conditions. The related resources are available at https://github.com/LinfengGao/CLEAR.

[19] SMEC: Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression

Biao Zhang,Lixin Chen,Tong Liu,Bo Zheng

Main category: cs.CL

TL;DR: 论文提出SMEC框架,通过SMRL、ADS和S-XBM模块,在保持性能的同时显著降低嵌入维度,解决了高维嵌入的计算和存储问题。

Details Motivation: 高维嵌入虽能捕捉丰富语义信息,但增加了计算复杂度和存储需求,限制了实际应用。

Contribution: 提出了SMEC框架,包含SMRL方法、ADS模块和S-XBM模块,显著降低了嵌入维度而不牺牲性能。

Method: SMRL减少训练梯度方差,ADS模块缓解维度裁剪中的信息退化,S-XBM模块增强高低维嵌入的无监督学习。

Result: 在BEIR数据集上,SMEC将256维嵌入的性能提升1.1和2.7点。

Insight: 通过模块化设计,SMEC在嵌入压缩中平衡了性能与效率,为检索任务提供了实用解决方案。

Abstract: Large language models (LLMs) generate high-dimensional embeddings that capture rich semantic and syntactic information. However, high-dimensional embeddings exacerbate computational complexity and storage requirements, thereby hindering practical deployment. To address these challenges, we propose a novel training framework named Sequential Matryoshka Embedding Compression (SMEC). This framework introduces the Sequential Matryoshka Representation Learning(SMRL) method to mitigate gradient variance during training, the Adaptive Dimension Selection (ADS) module to reduce information degradation during dimension pruning, and the Selectable Cross-batch Memory (S-XBM) module to enhance unsupervised learning between high- and low-dimensional embeddings. Experiments on image, text, and multimodal datasets demonstrate that SMEC achieves significant dimensionality reduction while maintaining performance. For instance, on the BEIR dataset, our approach improves the performance of compressed LLM2Vec embeddings (256 dimensions) by 1.1 points and 2.7 points compared to the Matryoshka-Adaptor and Search-Adaptor models, respectively.

[20] VISaGE: Understanding Visual Generics and Exceptions

Stella Frank,Emily Allaway

Main category: cs.CL

TL;DR: 该论文研究了视觉语言模型(VLMs)在处理典型和非典型图像时如何权衡语义先验和语用先验,并提出了新的评估数据集VISaGE。

Details Motivation: VLMs通常在训练中学习广义知识,但在处理非典型实例时,模型中的语义先验和语用先验会产生冲突。论文旨在探索模型如何在这两种先验之间进行权衡。

Contribution: 提出了一个新的数据集VISaGE,包含典型和非典型图像,用于评估VLMs在处理非典型实例时的表现。

Method: 通过在VISaGE数据集上进行平衡实验,分析VLMs在语义先验和语用先验冲突时的行为。

Result: 实验表明,当图像与文本输入不一致时,语用先验的影响超过语义先验,导致概念理解能力下降。

Insight: VLMs在处理非典型实例时更依赖语用先验,这可能限制了它们在泛化任务中的性能。

Abstract: While Vision Language Models (VLMs) learn conceptual representations, in the form of generalized knowledge, during training, they are typically used to analyze individual instances. When evaluation instances are atypical, this paradigm results in tension between two priors in the model. The first is a pragmatic prior that the textual and visual input are both relevant, arising from VLM finetuning on congruent inputs; the second is a semantic prior that the conceptual representation is generally true for instances of the category. In order to understand how VLMs trade off these priors, we introduce a new evaluation dataset, VISaGE, consisting of both typical and exceptional images. In carefully balanced experiments, we show that conceptual understanding degrades when the assumption of congruency underlying the pragmatic prior is violated with incongruent images. This effect is stronger than the effect of the semantic prior when querying about individual instances.

[21] Teaching Language Models to Faithfully Express their Uncertainty

Bryan Eikema,Evgenia Ilia,José G. C. de Souza,Chrysoula Zerva,Wilker Aziz

Main category: cs.CL

TL;DR: 本文提出了一种名为Faithful Uncertainty Tuning (FUT)的微调方法,旨在教会语言模型忠实表达其不确定性,而不改变其答案分布。

Details Motivation: 大型语言模型(LLMs)在表达不确定性时常常不够忠实,导致多次查询可能产生不一致的答案,但生成的回答通常未经修饰或修饰方式未能反映这种变异性,从而传达了不忠实的信息。

Contribution: 主要贡献是提出了FUT方法,通过微调教会LLMs忠实表达不确定性,同时保持答案分布的准确性,无需额外监督。

Method: FUT通过增强模型样本与不确定性修饰词(如’possibly’或’likely’)来构建训练数据,这些修饰词与样本一致性对齐。

Result: 实验表明,FUT显著减少了忠实性差距,同时保持了问答准确性,并引入了最小的语义分布偏移。该方法在不同解码策略、修饰词选择和其他不确定性表达形式中表现出鲁棒性。

Insight: FUT是一种简单有效的方法,能够帮助LLMs更好地传达不确定性,提升模型的可信度和实用性。

Abstract: Large language models (LLMs) often miscommunicate their uncertainty: repeated queries can produce divergent answers, yet generated responses are typically unhedged or hedged in ways that do not reflect this variability. This conveys unfaithful information about the uncertain state of the LLMs’ knowledge, creating a faithfulness gap that affects even strong LLMs. We introduce Faithful Uncertainty Tuning (FUT): a fine-tuning approach that teaches instruction-tuned LLMs to express uncertainty faithfully without altering their underlying answer distribution. We construct training data by augmenting model samples with uncertainty hedges (i.e. verbal cues such as ‘possibly’ or ‘likely’) aligned with sample consistency, requiring no supervision beyond the model and a set of prompts. We evaluate FUT on open-domain question answering (QA) across multiple models and datasets. Our results show that FUT substantially reduces the faithfulness gap, while preserving QA accuracy and introducing minimal semantic distribution shift. Further analyses demonstrate robustness across decoding strategies, choice of hedgers, and other forms of uncertainty expression (i.e. numerical). These findings establish FUT as a simple and effective way to teach LLMs to communicate uncertainty faithfully.

[22] Reasoning Pattern Matters: Learning to Reason without Human Rationales

Chaoxu Pang,Yixuan Cao,Ping Luo

Main category: cs.CL

TL;DR: 这篇论文提出了一种减少依赖人工标注理由(rationales)的方法,通过识别问题中的固定推理模式(patterned reasoning tasks),并利用LLM自动生成对齐这些模式的理由。提出的PARO框架能够在减少标注成本的同时,保持与人工标注相当的性能。

Details Motivation: 人工标注高质量的推理理由成本高昂,限制了SFT+RLVR范式在LLMs中的应用。论文发现某些问题的推理模式是固定的,因此减少标注量是可行的。

Contribution: 1. 识别了一类固定推理模式的问题;2. 提出PARO框架,利用LLM自动生成对齐推理模式的理由;3. 实验表明PARO生成的理由与人工标注相当,但成本更低。

Method: 1. 分析推理模式对任务性能的关键作用;2. 设计PARO框架,通过LLM自动生成对齐模式的理由;3. 使用SFT+RLVR验证性能。

Result: PARO生成的理由在实验中表现与10倍规模的人工标注相当,验证了自动生成的可行性。

Insight: 固定推理模式问题是减少人工标注的关键;LLM自动生成的理由可以替代大规模人工标注,只需有限的模式监督。

Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities under the widely adopted SFT+RLVR paradigm, which first performs Supervised Fine-Tuning (SFT) on human-annotated reasoning trajectories (rationales) to establish initial reasoning behaviors, then applies Reinforcement Learning with Verifiable Rewards (RLVR) to optimize the model using verifiable signals without golden rationales. However, annotating high-quality rationales for the SFT stage remains prohibitively expensive. This paper investigates when and how rationale annotation costs can be substantially reduced without compromising reasoning performance. We identify a broad class of problems, termed patterned reasoning tasks, where reasoning follows a fixed, procedural strategy consistent across instances. Although instances vary in content such as domain knowledge, factual information, or numeric values, the solution derives from applying a shared reasoning pattern. We argue that the success of SFT+RLVR on such tasks primarily stems from its ability to enable models to internalize these reasoning patterns. Using numerical semantic matching as a representative task, we provide both causal and behavioral evidence showing that reasoning patterns rather than the quantity or quality of rationales are the key determinant of performance. Building on these insights, we propose Pattern-Aware LLMs as Rationale AnnOtators (PARO), a simple yet effective framework that enables LLMs to generate rationales aligned with task-specific reasoning patterns without requiring human rationale annotations. Experiments show that PARO-generated rationales achieve comparable SFT+RLVR performance to human rationales that are 10 times larger. These results suggest that large-scale human rationale annotations can be replaced with LLM-based automatic annotations requiring only limited human supervision over reasoning patterns.

[23] Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations

Sunny Yu,Ahmad Jabbar,Robert Hawkins,Dan Jurafsky,Myra Cheng

Main category: cs.CL

TL;DR: 论文通过提出’生成空间大小’(GSS)的概念,统一了LLMs在开放生成任务中的两种失败模式,并提出GSSBench评估任务套装以及三种GSS的应用方法。

Details Motivation: 当前LLMs在开放生成任务中表现不佳:在创造性任务中输出过于同质化,在事实性任务中生成多样但不正确的回答。论文通过GSS概念统一这两种失败模式。

Contribution: 1. 提出GSS概念统一两种LLM失败模式;\n2. 设计GSSBench任务套装评估模型行为;\n3. 提出三种GSS的实际应用方法。

Method: 1. 使用GSSBench评估不同指标;\n2. 发现EigenScore等幻觉检测指标优于传统多样性指标;\n3. 展示了GSS在检测提示歧义、解释模型推理行为及引导模型输出中的应用。

Result: EigenScore等幻觉检测指标表现最佳,GSS能有效改进模型的开放生成能力。

Insight: 1. GSS为理解和校准LLMs的开放生成行为提供了统一框架;\n2. 模型的内部任务表征可通过GSS解释;\n3. 模型行为可通过GSS引导和改进。

Abstract: Different open-ended generation tasks require different degrees of output diversity. However, current LLMs are often miscalibrated. They collapse to overly homogeneous outputs for creative tasks and hallucinate diverse but incorrect responses for factual tasks. We argue that these two failure modes are unified by, and can both be addressed by, the notion of effective generation space size (GSS) – the set of semantically distinct outputs a model considers for a prompt. We present GSSBench, a task suite of prompt pairs with ground-truth GSS relationships to assess different metrics and understand where models diverge from desired behavior. We find that hallucination detection metrics, particularly EigenScore, consistently outperform standard diversity and uncertainty quantification metrics, while using only model internals, providing interpretable insights into a model’s internal task representations. We demonstrate three applications of GSS: (1) detecting prompt ambiguity and predicting clarification questions for better grounding, (2) interpreting overthinking and underthinking in reasoning models, and (3) steering models to expand their generation space to yield high-quality and diverse outputs.

[24] Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

Ziyang Ma,Ruiyang Xu,Zhenghao Xing,Yunfei Chu,Yuxuan Wang,Jinzheng He,Jin Xu,Pheng-Ann Heng,Kai Yu,Junyang Lin,Eng Siong Chng,Xie Chen

Main category: cs.CL

TL;DR: Omni-Captioner提出了一种系统性方法,包括数据生成流程、模型和评测基准,用于增强多模态细粒度感知能力。通过工具调用的自主数据生成缓解了细节与幻觉的共生长问题,并在多个任务上实现了SOTA性能。

Details Motivation: 现有Omni Language Models(OLMs)在多模态细粒度感知中存在细节与幻觉共生长的问题,缺乏高质量数据和专用评测基准。

Contribution: 1. 提出Omni-Detective数据生成流程,通过工具调用生成高质量低幻觉数据;2. 训练了Audio-Captioner和Omni-Captioner模型,分别在音频和音视频任务中表现优异;3. 设计了Omni-Cloze评测基准,填补了细粒度感知评测的空白。

Method: 1. Omni-Detective通过工具调用自主生成数据;2. 训练了专用captioning模型;3. 设计了基于完形填空的评测基准Omni-Cloze。

Result: Audio-Captioner在MMAU和MMAR中超越开源模型,与Gemini 2.5 Pro性能相当;Omni-Captioner在VDC和video-SALMONN 2中实现最佳细节与幻觉平衡。

Insight: 工具调用的自主数据生成可缓解多模态任务中的幻觉问题,专用评测基准对模型性能的可靠评估至关重要。

Abstract: Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent “co-growth” between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions.

[25] Which Word Orders Facilitate Length Generalization in LMs? An Investigation with GCG-Based Artificial Languages

Nadine El-Naggar,Tatsuki Kuribayashi,Ted Briscoe

Main category: cs.CL

TL;DR: 本文通过广义范畴语法(GCG)扩展了人工语言的形式化研究,探讨语言模型(LMs)对未见过的长句子的泛化能力,发现类型学上合理的词序更容易被LMs泛化。

Details Motivation: 研究语言模型是否具有偏向类型学常见语法属性的归纳偏好,并通过人工语言模拟自然语言的结构特征。

Contribution: 1. 使用广义范畴语法(GCG)扩展人工语言的形式化能力;2. 专注于LMs对长句子的泛化能力,发现合理的词序更易泛化。

Method: 1. 采用GCG设计人工语言,覆盖无界依赖和轻度上下文敏感结构;2. 通过实验评估LMs对长句的泛化能力。

Result: 实验表明,类型学上合理的词序(如自然语言中常见的结构)更易于LMs泛化到未见过的长句子。

Insight: LMs在处理语言时倾向于模仿自然语言的类型学特征,合理的词序设计有助于其泛化能力的提升。

Abstract: Whether language models (LMs) have inductive biases that favor typologically frequent grammatical properties over rare, implausible ones has been investigated, typically using artificial languages (ALs) (White and Cotterell, 2021; Kuribayashi et al., 2024). In this paper, we extend these works from two perspectives. First, we extend their context-free AL formalization by adopting Generalized Categorial Grammar (GCG) (Wood, 2014), which allows ALs to cover attested but previously overlooked constructions, such as unbounded dependency and mildly context-sensitive structures. Second, our evaluation focuses more on the generalization ability of LMs to process unseen longer test sentences. Thus, our ALs better capture features of natural languages and our experimental paradigm leads to clearer conclusions – typologically plausible word orders tend to be easier for LMs to productively generalize.

[26] Dr.LLM: Dynamic Layer Routing in LLMs

Ahmed Heakl,Martin Gubri,Salman Khan,Sangdoo Yun,Seong Joon Oh

Main category: cs.CL

TL;DR: Dr.LLM通过在预训练的LLM中引入轻量级层路由器,动态决定跳过、执行或重复某些层,从而在保持准确性的前提下提升计算效率,适用于不同复杂度的任务。

Details Motivation: LLMs在处理每个token时会经过所有transformer层,导致简单查询的计算浪费和复杂查询推理不足。现有方法通常需要高成本的推理时搜索或大规模重训练,且可能牺牲准确性。

Contribution: 提出了Dr.LLM框架,为预训练模型添加轻量级路由器,动态调整层执行;使用MCTS生成高质量层配置,改进窗口池化、焦点损失和瓶颈MLP设计,提升鲁棒性。

Method: 通过MCTS监督训练路由器,设计窗口池化稳定路由,采用焦点损失和瓶颈MLP解决类别不均衡和长序列问题,动态调整层执行。

Result: 在逻辑(ARC)和数学(DART)任务上,Dr.LLM平均节省5层计算,准确率提升3.4个百分点;在跨域任务上仅降低0.85%准确率。

Insight: 显式监督的路由器可在不改动预训练模型权重的情况下,提升计算效率和准确性,适用于多样化任务。

Abstract: Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.

[27] Cost Analysis of Human-corrected Transcription for Predominately Oral Languages

Yacouba Diarra,Nouhoum Souleymane Coulibaly,Michael Leventhal

Main category: cs.CL

TL;DR: 这篇论文研究了为低资源的、主要以口语为主的Bambara语言创建高质量语音数据集所需的时间和人力成本,发现在实验室和田野条件下,分别需要30和36小时的人力纠正一小时语音数据。

Details Motivation: 为低资源语言构建语音数据集是NLP研究的关键挑战,特别是在人力成本方面缺乏系统性研究。

Contribution: 提供了Bambara语言语音数据标注的时间成本基线,为类似语言的数据集创建提供了实用指导。

Method: 通过对10名母语转录员的1个月田野研究,分析了53小时语音数据的自动语音识别(ASR)生成转录的纠正过程。

Result: 在实验室和田野条件下,分别需要30和36小时的人力纠正一小时语音数据。

Insight: 研究表明,低资源语言的数据集创建成本显著高于预期,为未来类似项目的规划和预算提供了重要参考。

Abstract: Creating speech datasets for low-resource languages is a critical yet poorly understood challenge, particularly regarding the actual cost in human labor. This paper investigates the time and complexity required to produce high-quality annotated speech data for a subset of low-resource languages, low literacy Predominately Oral Languages, focusing on Bambara, a Manding language of Mali. Through a one-month field study involving ten transcribers with native proficiency, we analyze the correction of ASR-generated transcriptions of 53 hours of Bambara voice data. We report that it takes, on average, 30 hours of human labor to accurately transcribe one hour of speech data under laboratory conditions and 36 hours under field conditions. The study provides a baseline and practical insights for a large class of languages with comparable profiles undertaking the creation of NLP resources.

cs.CV [Back]

[28] Data or Language Supervision: What Makes CLIP Better than DINO?

Yiming Liu,Yuhui Zhang,Dhruba Ghosh,Ludwig Schmidt,Serena Yeung-Levy

Main category: cs.CV

TL;DR: CLIP在视觉语言模型(VLM)中优于DINO,但不确定是其语言监督还是更大训练数据所致。研究发现CLIP捕捉高级语义,DINO更关注低级特征。CLIP在文本任务中表现更好,DINO在视觉任务中稍优。

Details Motivation: 探索CLIP优于DINO的原因,明确是语言监督还是更大数据集的作用。

Contribution: 通过实验控制变量,证明CLIP的优势主要来自语言监督而非数据量,并分析了两种模型的特征捕捉差异。

Method: 在同架构、数据集和训练配置下预训练CLIP和DINO,比较其嵌入特征和VLM集成效果。

Result: CLIP擅长文本密集型任务,DINO在视觉任务中稍优;语言监督变体提升有限。

Insight: 语言监督对CLIP的高级语义捕捉至关重要,VLM设计时应根据任务需求选择编码器。

Abstract: CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP’s language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings – using the same architecture, dataset, and training configuration – achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.

[29] MammoDINO: Anatomically Aware Self-Supervision for Mammographic Images

Sicheng Zhou,Lei Wu,Cao Xiao,Parminder Bhatia,Taha Kass-Hout

Main category: cs.CV

TL;DR: MammoDINO是一个针对乳腺X线图像的自我监督学习框架,通过创新的数据增强和跨切片对比学习方法,在乳腺癌筛查任务中取得了最佳性能。

Details Motivation: 尽管自监督学习在通用领域取得了成功,但在医学影像中由于数据有限和领域特异性偏见的挑战,应用较少。本文旨在解决这一问题,特别是在乳腺癌筛查任务中。

Contribution: MammoDINO提出了一个新颖的自监督学习框架,包括乳腺组织感知的数据增强采样器和跨切片对比学习目标,利用3D DBT结构进行2D预训练。

Method: 方法包括:1)乳腺组织感知的数据增强采样器;2)跨切片对比学习目标;3)在140万乳腺X线图像上进行预训练。

Result: 在多个乳腺癌筛查任务中实现了最佳性能,并在五个基准数据集上展现出良好的泛化能力。

Insight: MammoDINO为无需标注的多用途计算机辅助诊断工具提供了可扩展的基础,有助于减少放射科医生的工作量并提高诊断效率。

Abstract: Self-supervised learning (SSL) has transformed vision encoder training in general domains but remains underutilized in medical imaging due to limited data and domain specific biases. We present MammoDINO, a novel SSL framework for mammography, pretrained on 1.4 million mammographic images. To capture clinically meaningful features, we introduce a breast tissue aware data augmentation sampler for both image-level and patch-level supervision and a cross-slice contrastive learning objective that leverages 3D digital breast tomosynthesis (DBT) structure into 2D pretraining. MammoDINO achieves state-of-the-art performance on multiple breast cancer screening tasks and generalizes well across five benchmark datasets. It offers a scalable, annotation-free foundation for multipurpose computer-aided diagnosis (CAD) tools for mammogram, helping reduce radiologists’ workload and improve diagnostic efficiency in breast cancer screening.

[30] Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis

Blessing Agyei Kyem,Neema Jakisa Owor,Andrews Danyo,Joshua Kofi Asamoah,Eugene Denteh,Tanner Muturi,Anthony Dontoh,Yaw Adu-Gyamfi,Armstrong Aboah

Main category: cs.CV

TL;DR: 该论文提出了一种双模型框架(VideoLLaMA和Qwen2.5-VL),通过任务特定优化分别处理视频描述和视觉问答任务,以减少任务干扰并提升各自性能。实验表明该方法在WTS数据集上表现优异,并在AI City Challenge Track 2中排名第10。

Details Motivation: 交通安全性分析需要复杂的视频理解能力,以捕捉细粒度的行为模式并生成全面的描述用于事故预防。现有的方法在处理多任务(如视频描述和视觉问答)时容易受到任务干扰影响。

Contribution: 1. 提出了一种任务特定的双模型框架(VideoLLaMA和Qwen2.5-VL),分别优化视频描述和视觉问答任务。2. 分离训练策略避免了任务干扰,显著提升了性能(VQA准确率提升8.6%)。

Method: 1. 使用VideoLLaMA处理视频描述任务,专注于时间推理;2. 使用Qwen2.5-VL处理视觉问答任务,专注于视觉理解;3. 通过分离训练策略优化两个模型的性能。

Result: 1. VideoLLaMA在时间推理上表现优异(CIDEr分数1.1001);2. Qwen2.5-VL在视觉理解上表现优异(VQA准确率60.80%);3. 在AI City Challenge Track 2中获得S2分数45.7572,排名第10。

Insight: 分离训练策略能够有效减少多任务学习中的干扰,使模型在各自的任务中表现更优。这种方法可以扩展到其他需要多任务联合学习的场景。

Abstract: Traffic safety analysis requires complex video understanding to capture fine-grained behavioral patterns and generate comprehensive descriptions for accident prevention. In this work, we present a unique dual-model framework that strategically utilizes the complementary strengths of VideoLLaMA and Qwen2.5-VL through task-specific optimization to address this issue. The core insight behind our approach is that separating training for captioning and visual question answering (VQA) tasks minimizes task interference and allows each model to specialize more effectively. Experimental results demonstrate that VideoLLaMA is particularly effective in temporal reasoning, achieving a CIDEr score of 1.1001, while Qwen2.5-VL excels in visual understanding with a VQA accuracy of 60.80%. Through extensive experiments on the WTS dataset, our method achieves an S2 score of 45.7572 in the 2025 AI City Challenge Track 2, placing 10th on the challenge leaderboard. Ablation studies validate that our separate training strategy outperforms joint training by 8.6% in VQA accuracy while maintaining captioning quality.

[31] PanoTPS-Net: Panoramic Room Layout Estimation via Thin Plate Spline Transformation

Hatem Ibrahem,Ahmed Salem,Qinmin Vivian Hu,Guanghui Wang

Main category: cs.CV

TL;DR: PanoTPS-Net 提出了一种通过薄板样条变换(TPS)从全景图像中估计房间3D布局的新方法,结合CNN和TPS变换层,适用于立方体和非立方体房间布局。

Details Motivation: 现有方法在处理非立方体房间布局时表现不佳,需要一种更通用的方法来准确估计复杂房间形状的3D布局。

Contribution: 1. 结合CNN和TPS变换层的双阶段模型;2. 能够处理立方体和非立方体房间布局;3. 在多数据集上表现优于现有方法。

Method: 使用CNN提取特征并预测TPS参数,通过TPS层将参考布局变形为目标布局。

Result: 在多个数据集上3DIoU表现优异(PanoContext:85.49,Stanford-2D3D:86.16,Matterport3DLayout:81.76,ZInD:91.98)。

Insight: TPS变换在全景图像中表现出色,且双阶段设计提升了模型的泛化能力。

Abstract: Accurately estimating the 3D layout of rooms is a crucial task in computer vision, with potential applications in robotics, augmented reality, and interior design. This paper proposes a novel model, PanoTPS-Net, to estimate room layout from a single panorama image. Leveraging a Convolutional Neural Network (CNN) and incorporating a Thin Plate Spline (TPS) spatial transformation, the architecture of PanoTPS-Net is divided into two stages: First, a convolutional neural network extracts the high-level features from the input images, allowing the network to learn the spatial parameters of the TPS transformation. Second, the TPS spatial transformation layer is generated to warp a reference layout to the required layout based on the predicted parameters. This unique combination empowers the model to properly predict room layouts while also generalizing effectively to both cuboid and non-cuboid layouts. Extensive experiments on publicly available datasets and comparisons with state-of-the-art methods demonstrate the effectiveness of the proposed method. The results underscore the model’s accuracy in room layout estimation and emphasize the compatibility between the TPS transformation and panorama images. The robustness of the model in handling both cuboid and non-cuboid room layout estimation is evident with a 3DIoU value of 85.49, 86.16, 81.76, and 91.98 on PanoContext, Stanford-2D3D, Matterport3DLayout, and ZInD datasets, respectively. The source code is available at: https://github.com/HatemHosam/PanoTPS_Net.

[32] Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning

Tanner Muturi,Blessing Agyei Kyem,Joshua Kofi Asamoah,Neema Jakisa Owor,Richard Dyzinela,Andrews Danyo,Yaw Adu-Gyamfi,Armstrong Aboah

Main category: cs.CV

TL;DR: 论文提出了一种基于RGB-D变换器和提示引导的空间推理框架,用于解决大规模3D环境中的精细物体关系推理问题。通过直接嵌入边界框坐标到输入提示中,增强了模型的空间理解能力。

Details Motivation: 解决仓库等大规模3D环境中因场景杂乱、遮挡和需要精确空间理解而带来的视觉语言系统挑战。现有模型依赖局部外观且缺乏显式空间基础,泛化能力有限。

Contribution: 1) 提出了一种空间推理框架,通过嵌入边界框坐标到输入提示中增强空间理解;2) 在AI City Challenge 2025 Track 3的数据集上进行了四类任务的微调,取得了优异效果。

Method: 1) 将边界框坐标嵌入输入提示;2) 使用任务特定的监督对框架进行微调;3) 在训练集中将标准化答案附加到GPT响应中以改进一致性。

Result: 在公共排行榜上以73.0606的总分排名第四,证明了结构化提示增强和目标优化的有效性。

Insight: 通过显式引入空间信息(如边界框坐标)可以显著提升模型在复杂3D环境中的空间推理能力。

Abstract: Spatial reasoning in large-scale 3D environments such as warehouses remains a significant challenge for vision-language systems due to scene clutter, occlusions, and the need for precise spatial understanding. Existing models often struggle with generalization in such settings, as they rely heavily on local appearance and lack explicit spatial grounding. In this work, we introduce a dedicated spatial reasoning framework for the Physical AI Spatial Intelligence Warehouse dataset introduced in the Track 3 2025 AI City Challenge. Our approach enhances spatial comprehension by embedding mask dimensions in the form of bounding box coordinates directly into the input prompts, enabling the model to reason over object geometry and layout. We fine-tune the framework across four question categories namely: Distance Estimation, Object Counting, Multi-choice Grounding, and Spatial Relation Inference using task-specific supervision. To further improve consistency with the evaluation system, normalized answers are appended to the GPT response within the training set. Our comprehensive pipeline achieves a final score of 73.0606, placing 4th overall on the public leaderboard. These results demonstrate the effectiveness of structured prompt enrichment and targeted optimization in advancing spatial reasoning for real-world industrial environments.

[33] Evaluating the Explainability of Vision Transformers in Medical Imaging

Leili Barekatain,Ben Glocker

Main category: cs.CV

TL;DR: 该论文研究了Vision Transformers (ViTs) 在医学影像中的可解释性,评估了多种架构(ViT、DeiT、DINO、Swin Transformer)和预训练策略,发现DINO与Grad-CAM结合提供了最忠实和局部化的解释。

Details Motivation: 医学影像中模型的可解释性直接影响临床信任与应用,而ViTs复杂的注意力机制为其解释带来了挑战。本文旨在评估和改进ViTs的可解释性。

Contribution: 1. 评估了多种ViTs架构和预训练策略的可解释性;2. 发现DINO与Grad-CAM结合是最优解;3. 提供了定量和定性分析;4. 揭示了误分类中的临床相关特征。

Method: 使用Gradient Attention Rollout和Grad-CAM方法对ViT、DeiT、DINO和Swin Transformer进行可解释性分析,并在外周血细胞分类和乳腺超声图像分类任务中进行实验。

Result: Grad-CAM生成的热图更具判别性和空间精确性,而Gradient Attention Rollout的激活更分散;DINO与Grad-CAM结合表现最佳,即使在误分类中也突出了临床相关特征。

Insight: ViTs的可解释性与其架构和预训练策略密切相关;DINO的结构在医学影像中特别适合解释性分析。

Abstract: Understanding model decisions is crucial in medical imaging, where interpretability directly impacts clinical trust and adoption. Vision Transformers (ViTs) have demonstrated state-of-the-art performance in diagnostic imaging; however, their complex attention mechanisms pose challenges to explainability. This study evaluates the explainability of different Vision Transformer architectures and pre-training strategies - ViT, DeiT, DINO, and Swin Transformer - using Gradient Attention Rollout and Grad-CAM. We conduct both quantitative and qualitative analyses on two medical imaging tasks: peripheral blood cell classification and breast ultrasound image classification. Our findings indicate that DINO combined with Grad-CAM offers the most faithful and localized explanations across datasets. Grad-CAM consistently produces class-discriminative and spatially precise heatmaps, while Gradient Attention Rollout yields more scattered activations. Even in misclassification cases, DINO with Grad-CAM highlights clinically relevant morphological features that appear to have misled the model. By improving model transparency, this research supports the reliable and explainable integration of ViTs into critical medical diagnostic workflows.

[34] VIDMP3: Video Editing by Representing Motion with Pose and Position Priors

Sandeep Mishra,Oindrila Saha,Alan C. Bovik

Main category: cs.CV

TL;DR: VIDMP3提出了一种基于姿势和位置先验的通用运动表示方法,用于解决视频编辑中结构可变但运动需保持的问题,提升了时间一致性和身份稳定性。

Details Motivation: 当前扩散模型在视频编辑中虽然能保持结构,但在结构可变的任务中常出现时间不一致、身份漂移等问题,缺乏灵活性和自动化能力。

Contribution: VIDMP3的核心贡献是通过姿势和位置先验学习通用的运动表示,支持结构可变的同时保持原始运动,无需人工干预。

Method: 方法利用源视频中的姿势和位置信息训练模型,学习运动模式,生成新视频时结合目标结构和语义调整,确保运动连贯性。

Result: 实验表明,VIDMP3在定性和定量评估中均优于现有方法,能够高效生成时间一致且身份稳定的视频。

Insight: 研究表明,运动表示的通用性是视频编辑的关键,结合先验信息能显著提升灵活性和生成质量。

Abstract: Motion-preserved video editing is crucial for creators, particularly in scenarios that demand flexibility in both the structure and semantics of swapped objects. Despite its potential, this area remains underexplored. Existing diffusion-based editing methods excel in structure-preserving tasks, using dense guidance signals to ensure content integrity. While some recent methods attempt to address structure-variable editing, they often suffer from issues such as temporal inconsistency, subject identity drift, and the need for human intervention. To address these challenges, we introduce VidMP3, a novel approach that leverages pose and position priors to learn a generalized motion representation from source videos. Our method enables the generation of new videos that maintain the original motion while allowing for structural and semantic flexibility. Both qualitative and quantitative evaluations demonstrate the superiority of our approach over existing methods. The code will be made publicly available at https://github.com/sandeep-sm/VidMP3.

[35] A Review on Domain Adaption and Generative Adversarial Networks(GANs)

Aashish Dhawan,Divyanshu Mudgal

Main category: cs.CV

TL;DR: 这篇综述论文探讨了在计算机视觉领域中标记数据稀缺的问题,并重点介绍了领域自适应(Domain Adaptation)和生成对抗网络(GANs)的方法,以解决这一问题。

Details Motivation: 由于高质量标记数据的获取成本高昂且困难,亟需找到可靠方法来解决数据稀缺问题。领域自适应和GANs是两种潜在的有效解决方案。

Contribution: 论文的主要贡献是系统地回顾了领域自适应的概念及其实现方法,并讨论了GANs在领域自适应中的应用潜力。

Method: 论文讨论了多种领域自适应方法,包括基于特征的映射和对抗训练(使用GANs),以及如何将一个领域的知识迁移到另一个领域。

Result: 综述结果表明,领域自适应和GANs可以有效减少对标记数据的依赖,并在不同领域之间实现知识迁移。

Insight: 领域自适应和GANs的结合为解决标记数据稀缺问题提供了新的研究方向,尤其是在跨领域任务中表现出潜在优势。

Abstract: The major challenge in today’s computer vision scenario is the availability of good quality labeled data. In a field of study like image classification, where data is of utmost importance, we need to find more reliable methods which can overcome the scarcity of data to produce results comparable to previous benchmark results. In most cases, obtaining labeled data is very difficult because of the high cost of human labor and in some cases impossible. The purpose of this paper is to discuss Domain Adaptation and various methods to implement it. The main idea is to use a model trained on a particular dataset to predict on data from a different domain of the same kind, for example - a model trained on paintings of airplanes predicting on real images of airplanes

[36] Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback

Xingpei Ma,Shenneng Huang,Jiaran Cai,Yuansheng Guan,Shen Zheng,Hanfeng Zhao,Qiang Zhang,Shunsi Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散变换器(DiT)的框架,用于生成任意长度的逼真说话视频,并引入了一种无需训练的多角色音频驱动动画方法。通过LoRA训练策略和位置偏移推理实现高效长视频生成,并结合奖励反馈增强唇同步和自然动作。此外,提出了Mask-CFG方法,支持多角色动画且无需额外数据或模型修改。

Details Motivation: 现有音频驱动视频生成方法在唇同步精度、长视频时间和多角色动画方面存在挑战,需要更高效且无需复杂训练的解决方案。

Contribution: 1. 基于DiT的框架实现高质量长视频生成;2. 结合LoRA和奖励反馈提升唇同步和动作自然性;3. 提出Mask-CFG方法,无需训练支持多角色动画。

Method: 1. LoRA训练策略与位置偏移推理结合;2. 部分参数更新与奖励反馈机制;3. Mask-CFG方法用于多角色动画。

Result: 实验表明,该方法在质量、时间一致性和多角色支持方面优于现有方法,且实现简单高效。

Insight: 1. 扩散变换器在视频生成中表现优异;2. 无需训练的奖励反馈机制可有效提升生成质量;3. Mask-CFG为多角色动画提供了通用解决方案。

Abstract: Recent advances in diffusion models have significantly improved audio-driven human video generation, surpassing traditional methods in both quality and controllability. However, existing approaches still face challenges in lip-sync accuracy, temporal coherence for long video generation, and multi-character animation. In this work, we propose a diffusion transformer (DiT)-based framework for generating lifelike talking videos of arbitrary length, and introduce a training-free method for multi-character audio-driven animation. First, we employ a LoRA-based training strategy combined with a position shift inference approach, which enables efficient long video generation while preserving the capabilities of the foundation model. Moreover, we combine partial parameter updates with reward feedback to enhance both lip synchronization and natural body motion. Finally, we propose a training-free approach, Mask Classifier-Free Guidance (Mask-CFG), for multi-character animation, which requires no specialized datasets or model modifications and supports audio-driven animation for three or more characters. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches, achieving high-quality, temporally coherent, and multi-character audio-driven video generation in a simple, efficient, and cost-effective manner.

[37] IL3D: A Large-Scale Indoor Layout Dataset for LLM-Driven 3D Scene Generation

Wenxu Zhou,Kaixuan Nie,Hang Du,Dong Yin,Wei Huang,Siqiang Guo,Xiaobo Zhang,Pengbo Hu

Main category: cs.CV

TL;DR: IL3D是一个大规模室内布局数据集,专为LLM驱动的3D场景生成设计,包含多样化的高质量数据和实例级自然语言标注,显著提升了LLM的泛化能力。

Details Motivation: 当前室内布局设计缺乏多样化、高质量的训练数据,限制了LLM在3D场景生成中的应用。IL3D填补了这一空白,支持多模态学习与环境感知任务。

Contribution: 1. 提出IL3D数据集,包含27,816种室内布局和29,215个3D对象资产;2. 提供实例级标注和多模态数据导出能力;3. 通过SFT提升LLM的泛化性能。

Method: 通过监督微调(SFT)在IL3D上训练LLM,利用多模态数据(如点云、深度图、语义掩码)支持场景生成任务。

Result: 实验表明,IL3D上的SFT显著提升了LLM的泛化能力,优于其他数据集上的训练结果。

Insight: 高质量、多模态的数据集是提升LLM在3D生成任务中性能的关键,IL3D为未来研究与实际应用提供了坚实基础。

Abstract: In this study, we present IL3D, a large-scale dataset meticulously designed for large language model (LLM)-driven 3D scene generation, addressing the pressing demand for diverse, high-quality training data in indoor layout design. Comprising 27,816 indoor layouts across 18 prevalent room types and a library of 29,215 high-fidelity 3D object assets, IL3D is enriched with instance-level natural language annotations to support robust multimodal learning for vision-language tasks. We establish rigorous benchmarks to evaluate LLM-driven scene generation. Experimental results show that supervised fine-tuning (SFT) of LLMs on IL3D significantly improves generalization and surpasses the performance of SFT on other datasets. IL3D offers flexible multimodal data export capabilities, including point clouds, 3D bounding boxes, multiview images, depth maps, normal maps, and semantic masks, enabling seamless adaptation to various visual tasks. As a versatile and robust resource, IL3D significantly advances research in 3D scene generation and embodied intelligence, by providing high-fidelity scene data to support environment perception tasks of embodied agents.

[38] An Adaptive Edge-Guided Dual-Network Framework for Fast QR Code Motion Deblurring

Jianping Li,Dongyang Guo,Wenjie Li,Wei Zhao

Main category: cs.CV

TL;DR: 该论文提出了一种自适应边缘引导双网络框架(ADNet),通过EGAB嵌入显式边缘先验,并整合EG-Restormer和LENet,针对QR码运动去模糊问题实现了高效且高性能的去模糊效果。

Details Motivation: QR码去模糊的核心目标是提升解码成功率,而非一般图像的感知质量。现有深度学习方法未充分利用QR码的结构化边缘先验,因此作者希望设计一种能够显式利用这些先验的高效网络。

Contribution: 1. 提出了Edge-Guided Attention Block(EGAB),将边缘先验嵌入Transformer架构;2. 设计了EG-Restormer用于严重模糊QR码的去模糊,LENet用于轻度模糊;3. 提出了自适应双网络(ADNet),动态选择合适网络以适应不同模糊程度。

Method: 1. EGAB通过显式边缘引导增强注意力机制;2. EG-Restormer基于EGAB对严重模糊QR码进行高效去模糊;3. LENet为轻量级网络,处理轻度模糊;4. ADNet结合EG-Restormer和LENet,根据模糊程度动态切换网络。

Result: 实验表明,EG-Restormer和ADNet在性能和速度上均达到最优水平,显著提升了QR码的解码成功率。

Insight: 显式利用QR码的结构化边缘先验能够显著提升去模糊效果,同时动态网络选择策略兼顾了性能和效率,适合资源受限的移动设备。

Abstract: Unlike general image deblurring that prioritizes perceptual quality, QR code deblurring focuses on ensuring successful decoding. QR codes are characterized by highly structured patterns with sharp edges, a robust prior for restoration. Yet existing deep learning methods rarely exploit these priors explicitly. To address this gap, we propose the Edge-Guided Attention Block (EGAB), which embeds explicit edge priors into a Transformer architecture. Based on EGAB, we develop Edge-Guided Restormer (EG-Restormer), an effective network that significantly boosts the decoding rate of severely blurred QR codes. For mildly blurred inputs, we design the Lightweight and Efficient Network (LENet) for fast deblurring. We further integrate these two networks into an Adaptive Dual-network (ADNet), which dynamically selects the suitable network based on input blur severity, making it ideal for resource-constrained mobile devices. Extensive experiments show that our EG-Restormer and ADNet achieve state-of-the-art performance with a competitive speed. Project page: https://github.com/leejianping/ADNet

[39] G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

Junfeng Ni,Yixin Chen,Zhifei Yang,Yu Liu,Ruijie Lu,Song-Chun Zhu,Siyuan Huang

Main category: cs.CV

TL;DR: G4Splat提出了一种结合几何引导和生成先验的3D场景重建方法,解决了现有方法在几何监督和多视角一致性方面的不足,并在多个数据集上展示了优越性能。

Details Motivation: 现有方法在利用生成模型进行3D重建时,缺乏可靠的几何监督且难以解决多视角不一致性问题,导致重建质量下降。

Contribution: 1. 提出利用平面结构生成精确的深度图;2. 在生成流程中引入几何引导,改善可见性掩码估计和新视角选择;3. 提升了未观测区域的场景补全准确性和一致性。

Method: 通过平面结构提取深度信息,并结合生成模型进行联合优化,同时在重建过程中引入几何指导以减少多视角不一致性。

Result: 在Replica、ScanNet++和DeepBlending等数据集上,G4Splat在几何和外观重建上都优于基线方法,尤其在未观测区域表现突出。

Insight: 准确的几何信息是利用生成模型提升3D重建的关键,而多视角一致性可以通过几何引导有效改善。

Abstract: Despite recent advances in leveraging generative prior from pre-trained diffusion models for 3D scene reconstruction, existing methods still face two critical limitations. First, due to the lack of reliable geometric supervision, they struggle to produce high-quality reconstructions even in observed regions, let alone in unobserved areas. Second, they lack effective mechanisms to mitigate multi-view inconsistencies in the generated images, leading to severe shape-appearance ambiguities and degraded scene geometry. In this paper, we identify accurate geometry as the fundamental prerequisite for effectively exploiting generative models to enhance 3D scene reconstruction. We first propose to leverage the prevalence of planar structures to derive accurate metric-scale depth maps, providing reliable supervision in both observed and unobserved regions. Furthermore, we incorporate this geometry guidance throughout the generative pipeline to improve visibility mask estimation, guide novel view selection, and enhance multi-view consistency when inpainting with video diffusion models, resulting in accurate and consistent scene completion. Extensive experiments on Replica, ScanNet++, and DeepBlending show that our method consistently outperforms existing baselines in both geometry and appearance reconstruction, particularly for unobserved regions. Moreover, our method naturally supports single-view inputs and unposed videos, with strong generalizability in both indoor and outdoor scenarios with practical real-world applicability. The project page is available at https://dali-jack.github.io/g4splat-web/.

[40] DRL: Discriminative Representation Learning with Parallel Adapters for Class Incremental Learning

Jiawei Zhan,Jun Liu,Jinlong Peng,Xiaochen Chen,Bin-Bin Gao,Yong Liu,Chengjie Wang

Main category: cs.CV

TL;DR: 论文提出了一种新型的判别性表征学习方法(DRL),通过并行适配器(IPA网络)和虚拟锚点监督(DAS)解决了类增量学习(CIL)中的模型复杂度、表征漂移和优化不一致性问题。

Details Motivation: 预训练模型在非回放式类增量学习中表现优异,但仍面临模型复杂度高、表征漂移不平滑以及阶段优化与全局推理不一致的挑战。

Contribution: 提出了DRL框架,结合IPA网络和DAS方法,有效解决了类增量学习中的三大难题,显著提升了性能。

Method: 1. 增量并行适配器(IPA):通过轻量级适配器继承和传播表征能力。2. 解耦锚点监督(DAS):分离正负样本约束,对齐特征空间。

Result: 在六个基准测试中,DRL全面优于其他最先进方法,并在训练和推理阶段保持了高效率。

Insight: 轻量级适配器和平滑表征设计是实现高效类增量学习的关键,解耦监督策略有助于全局特征对齐。

Abstract: With the excellent representation capabilities of Pre-Trained Models (PTMs), remarkable progress has been made in non-rehearsal Class-Incremental Learning (CIL) research. However, it remains an extremely challenging task due to three conundrums: increasingly large model complexity, non-smooth representation shift during incremental learning and inconsistency between stage-wise sub-problem optimization and global inference. In this work, we propose the Discriminative Representation Learning (DRL) framework to specifically address these challenges. To conduct incremental learning effectively and yet efficiently, the DRL’s network, called Incremental Parallel Adapter (IPA) network, is built upon a PTM and increasingly augments the model by learning a lightweight adapter with a small amount of parameter learning overhead in each incremental stage. The adapter is responsible for adapting the model to new classes, it can inherit and propagate the representation capability from the current model through parallel connection between them by a transfer gate. As a result, this design guarantees a smooth representation shift between different incremental stages. Furthermore, to alleviate inconsistency and enable comparable feature representations across incremental stages, we design the Decoupled Anchor Supervision (DAS). It decouples constraints of positive and negative samples by respectively comparing them with the virtual anchor. This decoupling promotes discriminative representation learning and aligns the feature spaces learned at different stages, thereby narrowing the gap between stage-wise local optimization over a subset of data and global inference across all classes. Extensive experiments on six benchmarks reveal that our DRL consistently outperforms other state-of-the-art methods throughout the entire CIL period while maintaining high efficiency in both training and inference phases.

[41] Self-Supervised Selective-Guided Diffusion Model for Old-Photo Face Restoration

Wenjie Li,Xiangyi Wang,Heng Guo,Guangwei Gao,Zhanyu Ma

Main category: cs.CV

TL;DR: 论文提出了SSDiff方法,通过自监督选择引导扩散模型,从伪参考脸中提取结构和颜色信息,实现对老旧照片中人脸的分阶段修复,显著提升了修复质量和可控性。

Details Motivation: 老旧照片中的人脸修复面临复杂退化(如破损、褪色和模糊)的挑战,现有方法在处理局部伪影和颜色时效果有限。

Contribution: 1. 提出SSDiff方法,结合伪参考脸实现分阶段修复;2. 构建VintageFace基准数据集;3. 在质量和可控性上优于现有方法。

Method: 利用预训练扩散模型生成伪参考脸,分阶段提供结构和颜色引导,结合人脸解析和遮罩选择性地修复破损区域。

Result: SSDiff在感知质量、保真度和区域可控性上优于GAN和扩散模型方法。

Insight: 通过分阶段引导和选择性修复,可以更好地处理老旧照片中的复杂退化问题。

Abstract: Old-photo face restoration poses significant challenges due to compounded degradations such as breakage, fading, and severe blur. Existing pre-trained diffusion-guided methods either rely on explicit degradation priors or global statistical guidance, which struggle with localized artifacts or face color. We propose Self-Supervised Selective-Guided Diffusion (SSDiff), which leverages pseudo-reference faces generated by a pre-trained diffusion model under weak guidance. These pseudo-labels exhibit structurally aligned contours and natural colors, enabling region-specific restoration via staged supervision: structural guidance applied throughout the denoising process and color refinement in later steps, aligned with the coarse-to-fine nature of diffusion. By incorporating face parsing maps and scratch masks, our method selectively restores breakage regions while avoiding identity mismatch. We further construct VintageFace, a 300-image benchmark of real old face photos with varying degradation levels. SSDiff outperforms existing GAN-based and diffusion-based methods in perceptual quality, fidelity, and regional controllability. Code link: https://github.com/PRIS-CV/SSDiff.

[42] ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation

Ziyuan Luo,Yangyi Zhao,Ka Chun Cheung,Simon See,Renjie Wan

Main category: cs.CV

TL;DR: ImageSentinel是一个保护视觉数据集免受未经授权的检索增强图像生成(RAIG)使用的框架,通过合成视觉一致的哨兵图像并使用随机生成的字符序列作为检索键来实现验证。

Details Motivation: 随着RAIG系统的广泛应用,如何防止私密图像数据集被未经授权使用成为一个重要问题,传统数字水印方法在RAIG中效果有限。

Contribution: 提出了ImageSentinel框架,能够在RAIG系统中有效保护视觉数据集,同时不影响授权应用的生成质量。

Method: 利用视觉语言模型合成哨兵图像,并通过随机字符序列作为检索键进行保护验证。

Result: 实验证明,ImageSentinel能够有效检测未经授权的数据集使用,同时保持生成质量。

Insight: 视觉语言模型可用于生成与数据集视觉一致的哨兵图像,为数据保护提供了新思路。

Abstract: The widespread adoption of Retrieval-Augmented Image Generation (RAIG) has raised significant concerns about the unauthorized use of private image datasets. While these systems have shown remarkable capabilities in enhancing generation quality through reference images, protecting visual datasets from unauthorized use in such systems remains a challenging problem. Traditional digital watermarking approaches face limitations in RAIG systems, as the complex feature extraction and recombination processes fail to preserve watermark signals during generation. To address these challenges, we propose ImageSentinel, a novel framework for protecting visual datasets in RAIG. Our framework synthesizes sentinel images that maintain visual consistency with the original dataset. These sentinels enable protection verification through randomly generated character sequences that serve as retrieval keys. To ensure seamless integration, we leverage vision-language models to generate the sentinel images. Experimental results demonstrate that ImageSentinel effectively detects unauthorized dataset usage while preserving generation quality for authorized applications. Code is available at https://github.com/luo-ziyuan/ImageSentinel.

[43] Hardware-aware Coding Function Design for Compressive Single-Photon 3D Cameras

David Parra,Felipe Gutierrez-Barragan,Trevor Seets,Andreas Velten

Main category: cs.CV

TL;DR: 论文提出了一种基于约束优化的编码函数设计方法,用于压缩式单光子3D相机,通过梯度下降联合优化照明和编码矩阵,以适应硬件限制,并在带宽和峰值功率约束下表现优于传统方法。

Details Motivation: 单光子相机在3D成像中具有高分辨率时间标记能力,但其性能受硬件限制(如带宽、峰值功率等)影响。现有压缩直方图方法虽能降低数据率,但在真实照明约束下表现不佳。

Contribution: 1. 提出了适应硬件约束的编码函数设计方法;2. 通过梯度下降联合优化照明和编码矩阵,提升性能;3. 在带宽和峰值功率约束下优于传统方法。

Method: 采用梯度下降法优化照明和编码矩阵,确保其在硬件限制下工作,并通过仿真验证性能。

Result: 仿真和真实系统实验表明,该方法在带宽和峰值功率约束下优于传统编码设计,尤其在峰值功率受限的情况下优势显著。

Insight: 1. 硬件约束下的优化设计对单光子3D相机性能至关重要;2. 联合优化照明和编码矩阵能显著提升系统表现;3. 方法适用于非理想冲激响应的真实系统。

Abstract: Single-photon cameras are becoming increasingly popular in time-of-flight 3D imaging because they can time-tag individual photons with extreme resolution. However, their performance is susceptible to hardware limitations, such as system bandwidth, maximum laser power, sensor data rates, and in-sensor memory and compute resources. Compressive histograms were recently introduced as a solution to the challenge of data rates through an online in-sensor compression of photon timestamp data. Although compressive histograms work within limited in-sensor memory and computational resources, they underperform when subjected to real-world illumination hardware constraints. To address this, we present a constrained optimization approach for designing practical coding functions for compressive single-photon 3D imaging. Using gradient descent, we jointly optimize an illumination and coding matrix (i.e., the coding functions) that adheres to hardware constraints. We show through extensive simulations that our coding functions consistently outperform traditional coding designs under both bandwidth and peak power constraints. This advantage is particularly pronounced in systems constrained by peak power. Finally, we show that our approach adapts to arbitrary parameterized impulse responses by evaluating it on a real-world system with a non-ideal impulse response function.

[44] MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites

Zhenxin Lei,Zhangwei Gao,Changyao Tian,Erfei Cui,Guanzhou Chen,Danni Yang,Yuchen Duan,Zhaokai Wang,Wenhao Li,Weiyun Wang,Xiangyu Zhao,Jiayi Ji,Yu Qiao,Wenhai Wang,Gen Luo

Main category: cs.CV

TL;DR: 该论文提出了CapFlow工作流和MetaCaptioner模型,通过多智能体协作提升开源视觉描述模型的性能,达到与商业模型可比的结果,同时显著降低成本。

Details Motivation: 当前开源视觉描述模型与商业模型存在较大性能差距,限制了其在数据合成等应用中的潜力。作者旨在通过开源模型构建高质量、低成本的通用视觉描述解决方案。

Contribution: 1. 提出CapFlow工作流,首次证明开源模型在视觉描述任务中可以达到商业模型(如GPT-4.1)的性能;2. 基于CapFlow合成高质量数据,训练出通用视觉描述模型MetaCaptioner;3. 展示了MetaCaptioner在开源社区中的顶级多模态性能。

Method: 1. CapFlow通过多智能体协作整合多个开源模型的输出,提升生成描述的多样性和准确性;2. 利用CapFlow合成大规模高质量数据集,并基于此微调MetaCaptioner。

Result: MetaCaptioner在多个视觉领域中性能媲美商业模型,同时成本降低89.5%。其在开源社区中达到顶级多模态性能。

Insight: 通过智能整合开源模型和高效数据合成方法,可以显著缩小与商业模型的差距,为多模态研究提供经济高效的解决方案。

Abstract: Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.

[45] FedHUG: Federated Heterogeneous Unsupervised Generalization for Remote Physiological Measurements

Xiao Yang,Jiyao Wang

Main category: cs.CV

TL;DR: FedHUG是一个联邦学习框架,用于解决远程生理测量中的隐私和标签数据稀缺问题,通过动态聚合和学习控制器实现跨域泛化。

Details Motivation: 远程生理测量需要处理用户隐私数据和标签稀缺问题,现有方法依赖客户端的标记数据,难以在实际部署中更新模型。

Contribution: 提出了FUDG协议和FedHUG框架,通过动态权重聚合和全局分布感知学习控制器,解决了数据异构和标签分布不平衡问题。

Method: FedHUG包含最小偏置聚合模块和全局分布感知学习控制器,前者动态调整聚合权重,后者动态调整客户端训练策略以解决标签分布倾斜问题。

Result: 在RGB视频或毫米波雷达的生理测量任务中,FedHUG表现优于现有技术。

Insight: 联邦学习中动态调整聚合权重和训练策略可以显著提升跨域泛化能力,同时保护隐私。

Abstract: Remote physiological measurement gained wide attention, while it requires collecting users’ privacy-sensitive information, and existing contactless measurements still rely on labeled client data. This presents challenges when we want to further update real-world deployed models with numerous user data lacking labels. To resolve these challenges, we instantiate a new protocol called Federated Unsupervised Domain Generalization (FUDG) in this work. Subsequently, the \textbf{Fed}erated \textbf{H}eterogeneous \textbf{U}nsupervised \textbf{G}eneralization (\textbf{FedHUG}) framework is proposed and consists of: (1) Minimal Bias Aggregation module dynamically adjusts aggregation weights based on prior-driven bias evaluation to cope with heterogeneous non-IID features from multiple domains. (2) The Global Distribution-aware Learning Controller parameterizes the label distribution and dynamically manipulates client-specific training strategies, thereby mitigating the server-client label distribution skew and long-tail issue. The proposal shows superior performance across state-of-the-art techniques in estimation with either RGB video or mmWave radar. The code will be released.

[46] Class-aware Domain Knowledge Fusion and Fission for Continual Test-Time Adaptation

Jiahuan Zhou,Chao Zhu,Zhenyu Cui,Zichen Liu,Xu Zou,Gang Hua

Main category: cs.CV

TL;DR: 本文提出了一种名为KFF的类感知领域知识融合与分裂方法,用于持续测试时间适应(CTTA),通过动态积累历史知识来解决现有方法中学习不足和有害历史知识干扰的问题。

Details Motivation: 现有的CTTA方法在适应多个未知下游领域分布时,常因不规则切换领域数据而导致历史知识灾难性遗忘,同时新知识学习不足和有害历史知识的干扰导致性能严重下降。

Contribution: 1. 提出了类感知领域知识融合与分裂方法(KFF);2. 设计了领域知识分裂(KFI)模块和领域知识融合(KFU)模块;3. 引入贪婪知识动态合并策略以减少计算和存储开销。

Method: KFF方法包括两个核心模块:KFI模块用于分离新领域知识,减轻旧领域负面知识的干扰;KFU模块则用于高效合并新知识到现有知识池中,动态合并策略确保了新旧知识的兼容性和计算效率。

Result: 在ImageNet-C数据集上的广泛实验验证了KFF方法的有效性,优于其他现有方法。

Insight: 通过类感知的动态知识积累和高效合并策略,KFF在CTTA任务中显著提升了模型的适应能力和性能表现。

Abstract: Continual Test-Time Adaptation (CTTA) aims to quickly fine-tune the model during the test phase so that it can adapt to multiple unknown downstream domain distributions without pre-acquiring downstream domain data. To this end, existing advanced CTTA methods mainly reduce the catastrophic forgetting of historical knowledge caused by irregular switching of downstream domain data by restoring the initial model or reusing historical models. However, these methods are usually accompanied by serious insufficient learning of new knowledge and interference from potentially harmful historical knowledge, resulting in severe performance degradation. To this end, we propose a class-aware domain Knowledge Fusion and Fission method for continual test-time adaptation, called KFF, which adaptively expands and merges class-aware domain knowledge in old and new domains according to the test-time data from different domains, where discriminative historical knowledge can be dynamically accumulated. Specifically, considering the huge domain gap within streaming data, a domain Knowledge FIssion (KFI) module is designed to adaptively separate new domain knowledge from a paired class-aware domain prompt pool, alleviating the impact of negative knowledge brought by old domains that are distinct from the current domain. Besides, to avoid the cumulative computation and storage overheads from continuously fissioning new knowledge, a domain Knowledge FUsion (KFU) module is further designed to merge the fissioned new knowledge into the existing knowledge pool with minimal cost, where a greedy knowledge dynamic merging strategy is designed to improve the compatibility of new and old knowledge while keeping the computational efficiency. Extensive experiments on the ImageNet-C dataset verify the effectiveness of our proposed method against other methods.

[47] DPL: Spatial-Conditioned Diffusion Prototype Enhancement for One-Shot Medical Segmentation

Ziyuan Gao,Philippe Morel

Main category: cs.CV

TL;DR: DPL提出了一种基于扩散过程的原型学习方法,通过扩散增强和空间条件机制改进单次医学图像分割的原型表示。

Details Motivation: 传统原型方法在处理有限标注数据和患者间解剖结构变异时表现脆弱,无法捕捉类内多样性。DPL旨在通过学习概率分布生成多样且语义一致的原型变体来解决这一问题。

Contribution: 1. 基于扩散的原型增强模块;2. 空间感知的条件机制;3. 原型保真与多样性平衡的融合策略。

Method: 通过前向-反向扩散过程生成原型变体,结合几何特征统计进行空间条件化,并在训练与推理中保持一致的原型增强流程。

Result: 在腹部MRI和CT数据集上的实验表明DPL显著提升了单次医学分割的性能,达到了新SOTA。

Insight: 将扩散模型用于原型学习,能够有效捕捉类内多样性,增强模型的鲁棒性。

Abstract: One-shot medical image segmentation faces fundamental challenges in prototype representation due to limited annotated data and significant anatomical variability across patients. Traditional prototype-based methods rely on deterministic averaging of support features, creating brittle representations that fail to capture intra-class diversity essential for robust generalization. This work introduces Diffusion Prototype Learning (DPL), a novel framework that reformulates prototype construction through diffusion-based feature space exploration. DPL models one-shot prototypes as learnable probability distributions, enabling controlled generation of diverse yet semantically coherent prototype variants from minimal labeled data. The framework operates through three core innovations: (1) a diffusion-based prototype enhancement module that transforms single support prototypes into diverse variant sets via forward-reverse diffusion processes, (2) a spatial-aware conditioning mechanism that leverages geometric properties derived from prototype feature statistics, and (3) a conservative fusion strategy that preserves prototype fidelity while maximizing representational diversity. DPL ensures training-inference consistency by using the same diffusion enhancement and fusion pipeline in both phases. This process generates enhanced prototypes that serve as the final representations for similarity calculations, while the diffusion process itself acts as a regularizer. Extensive experiments on abdominal MRI and CT datasets demonstrate significant improvements respectively, establishing new state-of-the-art performance in one-shot medical image segmentation.

[48] State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding

Jiahuan Zhou,Kai Zhu,Zhenyu Cui,Zichen Liu,Xu Zou,Gang Hua

Main category: cs.CV

TL;DR: 本文提出了一种用于视频理解的状态空间提示方法(SSP),通过聚合和传播视频中的关键时空信息,显著提升了视频分类的性能,同时减少了微调参数的开销。

Details Motivation: 现有的预训练状态空间模型在处理视频数据时虽然效率高,但其顺序压缩的视觉提示令牌未能充分捕捉视频中的时空上下文信息,影响了判别信息的有效提取和传播。

Contribution: 提出了SSP方法,结合帧内(IFG)和帧间(IFS)提示模块,自适应地平衡和压缩视频中的关键时空信息,从而高效传播判别信息。

Method: 设计了Intra-Frame Gathering (IFG)模块聚合帧内关键空间信息,以及Inter-Frame Spreading (IFS)模块传播帧间判别时空信息。

Result: 在四个视频基准数据集上的实验表明,SSP方法平均比现有SOTA方法高出2.76%,同时减少了微调参数的开销。

Insight: 通过分离并互补地处理帧内和帧间的时空信息,可以更高效地提取和传播视频中的判别特征,从而提升模型性能。

Abstract: Recently, pre-trained state space models have shown great potential for video classification, which sequentially compresses visual tokens in videos with linear complexity, thereby improving the processing efficiency of video data while maintaining high performance. To apply powerful pre-trained models to downstream tasks, prompt learning is proposed to achieve efficient downstream task adaptation with only a small number of fine-tuned parameters. However, the sequentially compressed visual prompt tokens fail to capture the spatial and temporal contextual information in the video, thus limiting the effective propagation of spatial information within a video frame and temporal information between frames in the state compression model and the extraction of discriminative information. To tackle the above issue, we proposed a State Space Prompting (SSP) method for video understanding, which combines intra-frame and inter-frame prompts to aggregate and propagate key spatiotemporal information in the video. Specifically, an Intra-Frame Gathering (IFG) module is designed to aggregate spatial key information within each frame. Besides, an Inter-Frame Spreading (IFS) module is designed to spread discriminative spatio-temporal information across different frames. By adaptively balancing and compressing key spatio-temporal information within and between frames, our SSP effectively propagates discriminative information in videos in a complementary manner. Extensive experiments on four video benchmark datasets verify that our SSP significantly outperforms existing SOTA methods by 2.76% on average while reducing the overhead of fine-tuning parameters.

[49] UniGS: Unified Geometry-Aware Gaussian Splatting for Multimodal Rendering

Yusen Xie,Zhenmin Huang,Jianhao Jiao,Dimitrios Kanoulas,Jun Ma

Main category: cs.CV

TL;DR: UniGS是一个基于3D高斯泼溅的多模态3D重建统一框架,通过CUDA加速的光栅化管线实现高质量RGB图像、几何精确深度图、一致表面法线和语义逻辑的同步渲染。

Details Motivation: 传统方法在多模态3D重建中存在几何一致性和计算效率的问题。UniGS旨在通过统一的几何感知表示和高斯泼溅的优化方法解决这些问题。

Contribution: 1. 提出统一的几何感知表示;2. 通过可微光线-椭球体交会实现深度渲染;3. 推导表面法线的解析梯度;4. 引入可学习属性优化计算和存储效率。

Method: 基于3D高斯泼溅的框架,重新设计光栅化管线,支持多模态同步渲染,并通过解析梯度优化几何属性。

Result: 实验表明UniGS在多模态重建中实现了最先进的精度,验证了几何感知范式的有效性。

Insight: 统一几何感知表示和高斯泼溅的结合在多模态3D重建中具有显著潜力,特别是在几何一致性和计算效率方面。

Abstract: In this paper, we propose UniGS, a unified map representation and differentiable framework for high-fidelity multimodal 3D reconstruction based on 3D Gaussian Splatting. Our framework integrates a CUDA-accelerated rasterization pipeline capable of rendering photo-realistic RGB images, geometrically accurate depth maps, consistent surface normals, and semantic logits simultaneously. We redesign the rasterization to render depth via differentiable ray-ellipsoid intersection rather than using Gaussian centers, enabling effective optimization of rotation and scale attribute through analytic depth gradients. Furthermore, we derive the analytic gradient formulation for surface normal rendering, ensuring geometric consistency among reconstructed 3D scenes. To improve computational and storage efficiency, we introduce a learnable attribute that enables differentiable pruning of Gaussians with minimal contribution during training. Quantitative and qualitative experiments demonstrate state-of-the-art reconstruction accuracy across all modalities, validating the efficacy of our geometry-aware paradigm. Source code and multimodal viewer will be available on GitHub.

[50] BEEP3D: Box-Supervised End-to-End Pseudo-Mask Generation for 3D Instance Segmentation

Youngju Yoo,Seho Kim,Changick Kim

Main category: cs.CV

TL;DR: BEEP3D 提出了一种基于框监督的端到端伪掩码生成方法,用于3D实例分割,减少了标注成本,并通过师生框架和创新的损失函数提升了性能。

Details Motivation: 3D实例分割的密集点级标注成本高昂,而基于框的监督虽更高效但存在重叠区域的模糊性问题。现有方法通常采用两阶段训练,增加了复杂性和时间成本。

Contribution: 1. 提出了端到端的师生框架BEEP3D;2. 引入了基于实例中心的查询细化机制;3. 设计了查询一致性和掩码特征一致性损失函数;4. 在ScanNetV2和S3DIS数据集上实现了高效且竞争性的性能。

Method: 采用师生框架,教师模型通过EMA更新;利用实例中心细化位置查询;提出查询一致性和掩码特征一致性损失。

Result: BEEP3D在ScanNetV2和S3DIS数据集上表现优于或接近现有弱监督方法,且计算高效。

Insight: 端到端学习和一致性损失能有效解决框监督中的模糊性问题,同时减少训练复杂性。

Abstract: 3D instance segmentation is crucial for understanding complex 3D environments, yet fully supervised methods require dense point-level annotations, resulting in substantial annotation costs and labor overhead. To mitigate this, box-level annotations have been explored as a weaker but more scalable form of supervision. However, box annotations inherently introduce ambiguity in overlapping regions, making accurate point-to-instance assignment challenging. Recent methods address this ambiguity by generating pseudo-masks through training a dedicated pseudo-labeler in an additional training stage. However, such two-stage pipelines often increase overall training time and complexity, hinder end-to-end optimization. To overcome these challenges, we propose BEEP3D-Box-supervised End-to-End Pseudo-mask generation for 3D instance segmentation. BEEP3D adopts a student-teacher framework, where the teacher model serves as a pseudo-labeler and is updated by the student model via an Exponential Moving Average. To better guide the teacher model to generate precise pseudo-masks, we introduce an instance center-based query refinement that enhances position query localization and leverages features near instance centers. Additionally, we design two novel losses-query consistency loss and masked feature consistency loss-to align semantic and geometric signals between predictions and pseudo-masks. Extensive experiments on ScanNetV2 and S3DIS datasets demonstrate that BEEP3D achieves competitive or superior performance compared to state-of-the-art weakly supervised methods while remaining computationally efficient.

[51] CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs

Jiwan Kim,Kibum Kim,Sangwoo Seo,Chanyoung Park

Main category: cs.CV

TL;DR: CompoDistill通过显式对齐学生和教师的视觉注意力,提升了小型多模态大语言模型(MLLM)在视觉感知任务中的表现,同时保持了其在视觉问答任务中的性能。

Details Motivation: 现有知识蒸馏(KD)方法在将大型教师MLLM的视觉感知能力有效传递给学生模型方面表现不佳,原因是视觉注意力未对齐。

Contribution: 提出了CompoDistill框架,通过显式对齐学生和教师的视觉注意力,改善了学生模型的视觉感知能力。

Method: CompoDistill通过系统性分析发现视觉注意力未对齐是关键问题,并提出了一种新颖的知识蒸馏框架来解决这一问题。

Result: 实验表明,CompoDistill在需要视觉感知能力的组合推理任务中表现显著提升,同时保持了视觉问答任务的性能。

Insight: 视觉注意力的对齐是提升学生模型视觉感知能力的关键,CompoDistill展示了这一方法的通用性和有效性。

Abstract: Recently, efficient Multimodal Large Language Models (MLLMs) have gained significant attention as a solution to their high computational complexity, making them more practical for real-world applications. In this regard, the knowledge distillation (KD) approach has emerged as a promising alternative, which transfers the rich visual and linguistic knowledge from a larger model (teacher) to a smaller model (student). However, we observe that existing KD methods struggle to effectively distill the teacher MLLM’s rich visual perception abilities to the student, a challenge that has been largely overlooked in previous studies. Through a systematic analysis, we identify visual attention misalignment between student and teacher as the main cause of this issue. Based on this insight, we propose CompoDistill, a novel KD framework that explicitly aligns the student’s visual attention with that of the teacher to enhance the student’s visual perception abilities. Our extensive experiments show that CompoDistill significantly improves performance on compositional reasoning tasks that require visual perception abilities while maintaining strong performance on visual question answering tasks, as done in existing studies. Furthermore, CompoDistill demonstrates effectiveness with a more advanced backbone, highlighting its generalizability.

[52] Hierarchical Reasoning with Vision-Language Models for Incident Reports from Dashcam Videos

Shingo Yokoi,Kento Sasaki,Yu Yamaguchi

Main category: cs.CV

TL;DR: 该论文提出了一种用于从行车记录仪视频生成事故报告的分层推理框架,结合了帧级标题生成、事故帧检测和视觉语言模型(VLMs)的细粒度推理,提升了报告的准确性和可读性。

Details Motivation: 尽管端到端自动驾驶模型在多样性大规模数据集上训练取得了进展,但在分布外(OOD)场景中表现仍不理想。COOOL基准测试和2COOOL挑战赛旨在填补这一空白,要求超越封闭分类的危险理解和生成人类可解释的事故报告。

Contribution: 主要贡献是提出了一个分层推理框架,集成了多种技术(帧级标题、事故帧检测、VLMs细粒度推理),并通过模型集成和盲A/B评分选择协议提升报告的准确性和可读性。

Method: 采用分层推理方法:1)帧级标题生成;2)事故帧检测;3)VLMs细粒度推理。通过模型集成和盲A/B评分选择协议优化结果。

Result: 在2COOOL公开排行榜中,该方法在29支团队中排名第2,并取得了最佳CIDEr-D分数,生成了准确且连贯的事故叙述。

Insight: 结果表明,结合分层推理和视觉语言模型是事故分析和理解安全关键交通事件的有前景方向。

Abstract: Recent advances in end-to-end (E2E) autonomous driving have been enabled by training on diverse large-scale driving datasets, yet autonomous driving models still struggle in out-of-distribution (OOD) scenarios. The COOOL benchmark targets this gap by encouraging hazard understanding beyond closed taxonomies, and the 2COOOL challenge extends it to generating human-interpretable incident reports. We present a hierarchical reasoning framework for incident report generation from dashcam videos that integrates frame-level captioning, incident frame detection, and fine-grained reasoning within vision-language models (VLMs). We further improve factual accuracy and readability through model ensembling and a Blind A/B Scoring selection protocol. On the official 2COOOL open leaderboard, our method ranks 2nd among 29 teams and achieves the best CIDEr-D score, producing accurate and coherent incident narratives. These results indicate that hierarchical reasoning with VLMs is a promising direction for accident analysis and for broader understanding of safety-critical traffic events. The implementation and code are available at https://github.com/riron1206/kaggle-2COOOL-2nd-Place-Solution.

[53] The Impact of Synthetic Data on Object Detection Model Performance: A Comparative Analysis with Real-World Data

Muammer Bay,Timo von Marcard,Dren Fazlija

Main category: cs.CV

TL;DR: 这篇论文研究了合成数据对物体检测模型性能的影响,特别在仓库物流领域,比较了合成数据与真实世界数据的表现。研究表明,合成数据和真实数据的平衡结合可以提高模型的鲁棒性和效率。

Details Motivation: 由于缺乏专业知识和资源,许多AI应用依赖于通用模型,而域特定数据的收集又昂贵且低效。合成数据提供了一种经济高效的替代方案,但其在实际应用中的效果需要验证。

Contribution: 论文的主要贡献是通过实验验证了合成数据在物体检测任务中的有效性,尤其是在仓库物流场景中,提出了合成数据与真实数据结合的优化策略。

Method: 使用NVIDIA Omniverse Replicator工具生成合成数据,设计了实验来比较不同数据集策略(纯真实数据、纯合成数据、混合数据)对物体检测模型性能的影响。

Result: 实验结果表明,合成数据和真实数据的混合使用能够显著提升模型性能,尤其是在真实世界场景中的表现。

Insight: 合成数据虽然不是完美的替代品,但在某些场景下可以有效地补充真实数据的不足,尤其是在数据稀缺或成本高昂的情况下。

Abstract: Recent advances in generative AI, particularly in computer vision (CV), offer new opportunities to optimize workflows across industries, including logistics and manufacturing. However, many AI applications are limited by a lack of expertise and resources, which forces a reliance on general-purpose models. Success with these models often requires domain-specific data for fine-tuning, which can be costly and inefficient. Thus, using synthetic data for fine-tuning is a popular, cost-effective alternative to gathering real-world data. This work investigates the impact of synthetic data on the performance of object detection models, compared to models trained on real-world data only, specifically within the domain of warehouse logistics. To this end, we examined the impact of synthetic data generated using the NVIDIA Omniverse Replicator tool on the effectiveness of object detection models in real-world scenarios. It comprises experiments focused on pallet detection in a warehouse setting, utilizing both real and various synthetic dataset generation strategies. Our findings provide valuable insights into the practical applications of synthetic image data in computer vision, suggesting that a balanced integration of synthetic and real data can lead to robust and efficient object detection models.

[54] DIANet: A Phase-Aware Dual-Stream Network for Micro-Expression Recognition via Dynamic Images

Vu Tram Anh Khuong,Luu Tu Nguyen,Thi Bich Phuong Man,Thanh Ha Le,Thi Duyen Ngo

Main category: cs.CV

TL;DR: DIANet是一种双流网络,通过动态图像识别微表情,重点关注微表情的不同时间相位(onset-to-apex和apex-to-offset),并通过跨注意力模块融合特征,性能优于传统方法。

Details Motivation: 微表情识别(MER)因其短暂和细微的特点以及标注数据的稀缺而具有挑战性。传统基于动态图像(DI)的方法忽视了对微表情不同时间相位的建模。

Contribution: 提出了DIANet,一种双流框架,分别建模微表情的onset-to-apex和apex-to-offset相位,并通过跨注意力模块融合两流特征。

Method: 使用动态图像分别编码两个相位,每流通过独立的CNN处理,并设计跨注意力模块自适应融合特征。

Result: 在CASME-II、SAMM和MMEW三个数据集上表现优异,验证了显式建模时间相位信息的重要性。

Insight: 显式建模微表情的不同时间相位能显著提升识别性能,跨注意力模块能有效融合多相位特征。

Abstract: Micro-expressions are brief, involuntary facial movements that typically last less than half a second and often reveal genuine emotions. Accurately recognizing these subtle expressions is critical for applications in psychology, security, and behavioral analysis. However, micro-expression recognition (MER) remains a challenging task due to the subtle and transient nature of facial cues and the limited availability of annotated data. While dynamic image (DI) representations have been introduced to summarize temporal motion into a single frame, conventional DI-based methods often overlook the distinct characteristics of different temporal phases within a micro-expression. To address this issue, this paper proposes a novel dual-stream framework, DIANet, which leverages phase-aware dynamic images - one encoding the onset-to-apex phase and the other capturing the apex-to-offset phase. Each stream is processed by a dedicated convolutional neural network, and a cross-attention fusion module is employed to adaptively integrate features from both streams based on their contextual relevance. Extensive experiments conducted on three benchmark MER datasets (CASME-II, SAMM, and MMEW) demonstrate that the proposed method consistently outperforms conventional single-phase DI-based approaches. The results highlight the importance of modeling temporal phase information explicitly and suggest a promising direction for advancing MER.

[55] HoneyBee: Data Recipes for Vision-Language Reasoners

Hritik Bansal,Devandra Singh Sachan,Kai-Wei Chang,Aditya Grover,Gargi Ghosh,Wen-tau Yih,Ramakanth Pasunuru

Main category: cs.CV

TL;DR: 该论文研究了构建高性能视觉-语言模型(VLM)推理训练数据集的原则,提出了多种数据筛选方法,并分析了其对VLM推理能力的影响。研究发现上下文来源、干预措施和数据扩展对性能有显著影响,并提出了HoneyBee数据集和一种低成本测试策略。

Details Motivation: 尽管视觉-语言模型在推理任务上表现优异,但构建高性能训练数据集的原则尚不明确。研究旨在填补这一空白,探索数据筛选和扩展的最佳实践。

Contribution: 1)分析了数据筛选方法对VLM推理性能的影响;2)提出HoneyBee数据集(250万高质量链式推理样本);3)提出低成本测试策略降低解码成本73%。

Method: 通过控制训练和评估设置,研究上下文来源、数据干预(如图像标题辅助信号)和数据扩展(图像、问题、链式推理)的影响。

Result: HoneyBee训练的3B参数VLM在MathVerse上分别比SOTA模型和基线模型高出7.8%和24.8%。测试策略在不损失精度下降低了73%解码成本。

Insight: 上下文策略、数据干预和多维度扩展是提升VLM推理能力的关键。高质量数据集和优化测试策略能显著提高模型性能和效率。

Abstract: Recent advances in vision-language models (VLMs) have made them highly effective at reasoning tasks. However, the principles underlying the construction of performant VL reasoning training datasets remain poorly understood. In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions. Our findings reveal that (a) context source strategies significantly affect VLM performance, (b) interventions such as auxiliary signals from image captions and the inclusion of text-only reasoning yield substantial gains, and (c) scaling all data dimensions (e.g., unique questions per image and unique CoTs per image-question pair) consistently improves reasoning capability. Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs. VLMs trained with HoneyBee outperform state-of-the-art models across model sizes. For instance, a HoneyBee-trained VLM with 3B parameters outperforms the SOTA model and the base model by 7.8% and 24.8%, respectively, on MathVerse. Furthermore, we propose a test-time scaling strategy that reduces decoding cost by 73% without sacrificing accuracy. Overall, this work presents improved strategies for VL reasoning dataset curation research.

[56] BIGFix: Bidirectional Image Generation with Token Fixing

Victor Besnier,David Hurych,Andrei Bursuc,Eduardo Valle

Main category: cs.CV

TL;DR: BIGFix提出了一种双向图像生成方法,通过修正采样过程中的错误token来提升生成质量,同时保持了并行token预测的效率优势。

Details Motivation: 当前生成模型的推理效率是一个关键挑战,尤其是在商业可行性和科学研究的平衡上。并行token预测虽能提升效率,但可能因token不兼容导致结构不一致。

Contribution: 提出了一种基于token修正的自校正图像生成方法,通过训练时注入随机token来提升鲁棒性,并在采样时实现token修正。

Method: 结合自回归token建模和多token并行预测,通过迭代修正采样过程中的token,减少了推理步骤并提高了生成质量。训练时注入随机token以增强鲁棒性。

Result: 在ImageNet-256、CIFAR-10等图像数据集以及UCF-101、NuScenes等视频数据集上验证了方法的高效性和生成质量提升。

Insight: token修正机制能够在不显著增加计算成本的情况下,显著提升生成模型的输出质量,尤其适用于图像和视频生成任务。

Abstract: Recent advances in image and video generation have raised significant interest from both academia and industry. A key challenge in this field is improving inference efficiency, as model size and the number of inference steps directly impact the commercial viability of generative models while also posing fundamental scientific challenges. A promising direction involves combining auto-regressive sequential token modeling with multi-token prediction per step, reducing inference time by up to an order of magnitude. However, predicting multiple tokens in parallel can introduce structural inconsistencies due to token incompatibilities, as capturing complex joint dependencies during training remains challenging. Traditionally, once tokens are sampled, there is no mechanism to backtrack and refine erroneous predictions. We propose a method for self-correcting image generation by iteratively refining sampled tokens. We achieve this with a novel training scheme that injects random tokens in the context, improving robustness and enabling token fixing during sampling. Our method preserves the efficiency benefits of parallel token prediction while significantly enhancing generation quality. We evaluate our approach on image generation using the ImageNet-256 and CIFAR-10 datasets, as well as on video generation with UCF-101 and NuScenes, demonstrating substantial improvements across both modalities.

[57] Ivan-ISTD: Rethinking Cross-domain Heteroscedastic Noise Perturbations in Infrared Small Target Detection

Yuehui Li,Yahao Lu,Haoyuan Wu,Sen Zhang,Liang Lin,Yukai Shi

Main category: cs.CV

TL;DR: 该论文提出了一种名为Ivan-ISTD的双重小波引导不变性学习框架,用于解决红外小目标检测(ISTD)中的跨域偏移和异方差噪声扰动问题。通过小波引导的跨域合成和真实域噪声不变性学习,方法显著提升了检测性能。

Details Motivation: 红外小目标检测在实际应用中面临跨域分布偏移和异方差噪声扰动的双重挑战,传统方法难以有效处理这些问题。

Contribution: 1. 提出双重小波引导框架;2. 小波引导的跨域合成和动态噪声库构建;3. 创建动态ISTD基准数据集。

Method: 1. 小波引导的多频滤波分离目标背景;2. 真实域噪声不变性学习;3. 动态噪声库和自我监督损失。

Result: 实验结果表明,Ivan-ISTD在多个定量指标上优于现有方法,尤其在跨域场景中表现出色。

Insight: 小波滤波和多频分析能有效分离目标与背景,真实噪声建模显著提升了噪声不变性学习的效果。

Abstract: In the multimedia domain, Infrared Small Target Detection (ISTD) plays a important role in drone-based multi-modality sensing. To address the dual challenges of cross-domain shift and heteroscedastic noise perturbations in ISTD, we propose a doubly wavelet-guided Invariance learning framework(Ivan-ISTD). In the first stage, we generate training samples aligned with the target domain using Wavelet-guided Cross-domain Synthesis. This wavelet-guided alignment machine accurately separates the target background through multi-frequency wavelet filtering. In the second stage, we introduce Real-domain Noise Invariance Learning, which extracts real noise characteristics from the target domain to build a dynamic noise library. The model learns noise invariance through self-supervised loss, thereby overcoming the limitations of distribution bias in traditional artificial noise modeling. Finally, we create the Dynamic-ISTD Benchmark, a cross-domain dynamic degradation dataset that simulates the distribution shifts encountered in real-world applications. Additionally, we validate the versatility of our method using other real-world datasets. Experimental results demonstrate that our approach outperforms existing state-of-the-art methods in terms of many quantitative metrics. In particular, Ivan-ISTD demonstrates excellent robustness in cross-domain scenarios. The code for this work can be found at: https://github.com/nanjin1/Ivan-ISTD.

[58] Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding

Ye Chen,Liming Tan,Yupeng Zhu,Yuanbin Wang,Bingbing Ni

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于层次化时空一致代理嵌入的视频向量化表示方法,解决了传统视频表示中因像素级匹配和跟踪不稳定导致的脆弱性问题,同时支持高效的视频编辑功能。

Details Motivation: 当前视频表示方法依赖于过度细粒度和不稳定的像素级匹配和跟踪,容易因跟踪误差、遮挡或大范围运动而失效,亟需一种更鲁棒且灵活的视频表示方案。

Contribution: 1. 引入层次化时空一致代理节点,稳定表达视频中动态变化的物体/场景;2. 通过动态表示更新机制减少不准确跟踪的影响;3. 解耦形状和纹理编码,支持可控的细粒度视频编辑。

Method: 设计了层次化的代理节点框架,结合时空一致性约束和多尺度结构感知,同时提出了动态更新机制和解耦编码策略,以实现鲁棒的视频表示和编辑能力。

Result: 实验证明,该方法能以较少的参数实现高精度的视频重建,并支持复杂的视频处理任务(如视频修复和基于关键帧的一致编辑)。

Insight: 层次化代理节点和动态更新机制的引入,为视频表示提供了一个鲁棒且灵活的解决方案,能够有效应对传统方法中的各类挑战。

Abstract: Current video representations heavily rely on unstable and over-grained priors for motion and appearance modelling, \emph{i.e.}, pixel-level matching and tracking. A tracking error of just a few pixels would lead to the collapse of the visual object representation, not to mention occlusions and large motion frequently occurring in videos. To overcome the above mentioned vulnerability, this work proposes spatio-temporally consistent proxy nodes to represent dynamically changing objects/scenes in the video. On the one hand, the hierarchical proxy nodes have the ability to stably express the multi-scale structure of visual objects, so they are not affected by accumulated tracking error, long-term motion, occlusion, and viewpoint variation. On the other hand, the dynamic representation update mechanism of the proxy nodes adequately leverages spatio-temporal priors of the video to mitigate the impact of inaccurate trackers, thereby effectively handling drastic changes in scenes and objects. Additionally, the decoupled encoding manner of the shape and texture representations across different visual objects in the video facilitates controllable and fine-grained appearance editing capability. Extensive experiments demonstrate that the proposed representation achieves high video reconstruction accuracy with fewer parameters and supports complex video processing tasks, including video in-painting and keyframe-based temporally consistent video editing.

[59] Multiplicative Loss for Enhancing Semantic Segmentation in Medical and Cellular Images

Yuto Yokoi,Kazuhiro Hotta

Main category: cs.CV

TL;DR: 论文提出两种新颖的损失函数(Multiplicative Loss和Confidence-Adaptive Multiplicative Loss),通过乘法结合交叉熵和Dice损失,动态调整梯度,提升医学和细胞图像的语义分割性能。

Details Motivation: 现有损失函数(如交叉熵和Dice Loss的结合)对超参数敏感且在数据稀缺时表现不佳,而医学图像因隐私等问题数据量有限,需要更鲁棒的损失函数。

Contribution: 1. 提出Multiplicative Loss,动态调整梯度以减少对高置信度正确预测的惩罚;2. 扩展为Confidence-Adaptive Multiplicative Loss,通过置信度驱动的指数缩放进一步优化困难样本的学习。

Method: 将交叉熵和Dice Loss乘法结合,并根据预测置信度动态调制梯度,后者再引入置信度驱动的指数缩放机制。

Result: 在细胞和医学图像分割任务中,提出的损失函数优于传统加法结合及现有损失函数,尤其适用于数据稀缺场景。

Insight: 乘法结合损失函数和动态梯度调制能有效提升模型在数据稀缺条件下的鲁棒性和性能。

Abstract: We propose two novel loss functions, Multiplicative Loss and Confidence-Adaptive Multiplicative Loss, for semantic segmentation in medical and cellular images. Although Cross Entropy and Dice Loss are widely used, their additive combination is sensitive to hyperparameters and often performs suboptimally, especially with limited data. Medical images suffer from data scarcity due to privacy, ethics, and costly annotations, requiring robust and efficient training objectives. Our Multiplicative Loss combines Cross Entropy and Dice losses multiplicatively, dynamically modulating gradients based on prediction confidence. This reduces penalties for confident correct predictions and amplifies gradients for incorrect overconfident ones, stabilizing optimization. Building on this, Confidence-Adaptive Multiplicative Loss applies a confidence-driven exponential scaling inspired by Focal Loss, integrating predicted probabilities and Dice coefficients to emphasize difficult samples. This enhances learning under extreme data scarcity by strengthening gradients when confidence is low. Experiments on cellular and medical segmentation benchmarks show our framework consistently outperforms tuned additive and existing loss functions, offering a simple, effective, and hyperparameter-free mechanism for robust segmentation under challenging data limitations.

[60] Local Background Features Matter in Out-of-Distribution Detection

Jinlun Ye,Zhuohao Sun,Yiqiao Qiu,Qiu Li,Zhijun Tan,Ruixuan Wang

Main category: cs.CV

TL;DR: 该论文提出了一种利用ID图像的局部背景特征作为伪OOD特征的新型OOD检测方法,通过优化这些特征的L2范数缓解OOD数据的过自信问题。

Details Motivation: OOD检测在保证深度学习模型可靠性中至关重要,但现有方法依赖昂贵的数据收集或生成伪OOD图像,亟需高效低成本的解决方案。

Contribution: 提出了一种新颖且高效的OOD检测方法,利用ID图像的背景特征作为伪OOD特征,避免了额外数据的需求,同时与现有方法兼容。

Method: 通过卷积的局部不变性从ID图像中提取背景特征作为伪OOD特征,并通过最小化其L2范数优化模型以减少OOD数据的过自信。

Result: 在多个标准OOD检测基准上验证了方法的有效性,并与现有方法结合实现了新的SOTA性能。

Insight: ID图像与OOD图像可能共享相似的背景区域,利用这一特性可以有效模拟OOD特征而无需额外数据,为OOD检测提供新思路。

Abstract: Out-of-distribution (OOD) detection is crucial when deploying deep neural networks in the real world to ensure the reliability and safety of their applications. One main challenge in OOD detection is that neural network models often produce overconfident predictions on OOD data. While some methods using auxiliary OOD datasets or generating fake OOD images have shown promising OOD detection performance, they are limited by the high costs of data collection and training. In this study, we propose a novel and effective OOD detection method that utilizes local background features as fake OOD features for model training. Inspired by the observation that OOD images generally share similar background regions with ID images, the background features are extracted from ID images as simulated OOD visual representations during training based on the local invariance of convolution. Through being optimized to reduce the $L_2$-norm of these background features, the neural networks are able to alleviate the overconfidence issue on OOD data. Extensive experiments on multiple standard OOD detection benchmarks confirm the effectiveness of our method and its wide combinatorial compatibility with existing post-hoc methods, with new state-of-the-art performance achieved from our method.

[61] AngularFuse: A Closer Look at Angle-based Perception for Spatial-Sensitive Multi-Modality Image Fusion

Xiaopeng Liu,Yupei Lin,Sen Zhang,Xiao Wang,Yukai Shi,Liang Lin

Main category: cs.CV

TL;DR: AngularFuse提出了一种基于角度的感知框架,用于空间敏感的多模态图像融合,通过跨模态互补掩码模块、细粒度参考图像合成策略和角度感知损失,显著提升了融合图像的质量。

Details Motivation: 现有的无监督深度学习方法在可见光-红外图像融合中依赖手工设计的损失函数,导致参考图像细节不足且亮度不均,梯度损失仅关注梯度幅值。AngularFuse旨在解决这些问题。

Contribution: 1. 提出跨模态互补掩码模块,学习模态间的互补信息。2. 设计细粒度参考图像合成策略,生成细节丰富、亮度均衡的参考图像。3. 首次在梯度域中同时约束梯度幅值和方向的角感知损失。

Method: 1. 使用Laplacian边缘增强和自适应直方图均衡化合成参考图像。2. 引入角度感知损失,约束梯度幅值和方向。3. 网络设计结合跨模态互补掩码模块来提升融合效果。

Result: 在MSRS、RoadScene和M3FD数据集上的实验表明,AngularFuse在融合质量和下游任务适应性上显著优于主流方法。

Insight: 通过同时关注梯度幅值和方向,AngularFuse能够保留纹理强度和正确的边缘方向,为多模态图像融合提供了新的优化方向。

Abstract: Visible-infrared image fusion is crucial in key applications such as autonomous driving and nighttime surveillance. Its main goal is to integrate multimodal information to produce enhanced images that are better suited for downstream tasks. Although deep learning based fusion methods have made significant progress, mainstream unsupervised approaches still face serious challenges in practical applications. Existing methods mostly rely on manually designed loss functions to guide the fusion process. However, these loss functions have obvious limitations. On one hand, the reference images constructed by existing methods often lack details and have uneven brightness. On the other hand, the widely used gradient losses focus only on gradient magnitude. To address these challenges, this paper proposes an angle-based perception framework for spatial-sensitive image fusion (AngularFuse). At first, we design a cross-modal complementary mask module to force the network to learn complementary information between modalities. Then, a fine-grained reference image synthesis strategy is introduced. By combining Laplacian edge enhancement with adaptive histogram equalization, reference images with richer details and more balanced brightness are generated. Last but not least, we introduce an angle-aware loss, which for the first time constrains both gradient magnitude and direction simultaneously in the gradient domain. AngularFuse ensures that the fused images preserve both texture intensity and correct edge orientation. Comprehensive experiments on the MSRS, RoadScene, and M3FD public datasets show that AngularFuse outperforms existing mainstream methods with clear margin. Visual comparisons further confirm that our method produces sharper and more detailed results in challenging scenes, demonstrating superior fusion capability.

[62] SpineBench: Benchmarking Multimodal LLMs for Spinal Pathology Analysis

Chenghanyu Zhang,Zekun Li,Peipei Li,Xing Cui,Shuhan Xia,Weixiang Yan,Yiqiao Zhang,Qianyu Zhuang

Main category: cs.CV

TL;DR: SpineBench是一个针对脊柱病理分析的多模态大语言模型(MLLM)的基准测试工具,旨在评估MLLM在脊柱领域的细粒度性能,填补了现有医疗基准测试在脊柱领域的不足。

Details Motivation: 随着多模态大语言模型(MLLM)在医疗领域的广泛应用,现有基准测试主要集中在通用医疗任务上,缺乏对脊柱等依赖视觉输入的细分领域的性能评估。

Contribution: 提出了SpineBench,一个包含64,878个QA对和40,263张脊柱图像的基准测试,涵盖11种脊柱疾病,并模拟真实世界中具有挑战性的临床场景(如视觉相似的疾病)。

Method: 通过整合开源脊柱疾病数据集中的图像-标签对,构建了SpineBench,并为每个VQA对设计了视觉相似的困难负样本选项。

Result: 评估了12个MLLM模型,结果显示它们在脊柱任务中表现不佳,突显了现有MLLM在脊柱领域的局限性。

Insight: SpineBench揭示了MLLM在脊柱领域的性能瓶颈,为未来改进脊柱医学应用提供了方向。

Abstract: With the increasing integration of Multimodal Large Language Models (MLLMs) into the medical field, comprehensive evaluation of their performance in various medical domains becomes critical. However, existing benchmarks primarily assess general medical tasks, inadequately capturing performance in nuanced areas like the spine, which relies heavily on visual input. To address this, we introduce SpineBench, a comprehensive Visual Question Answering (VQA) benchmark designed for fine-grained analysis and evaluation of MLLMs in the spinal domain. SpineBench comprises 64,878 QA pairs from 40,263 spine images, covering 11 spinal diseases through two critical clinical tasks: spinal disease diagnosis and spinal lesion localization, both in multiple-choice format. SpineBench is built by integrating and standardizing image-label pairs from open-source spinal disease datasets, and samples challenging hard negative options for each VQA pair based on visual similarity (similar but not the same disease), simulating real-world challenging scenarios. We evaluate 12 leading MLLMs on SpineBench. The results reveal that these models exhibit poor performance in spinal tasks, highlighting limitations of current MLLM in the spine domain and guiding future improvements in spinal medicine applications. SpineBench is publicly available at https://zhangchenghanyu.github.io/SpineBench.github.io/.

[63] PAGS: Priority-Adaptive Gaussian Splatting for Dynamic Driving Scenes

Ying A,Wenzhang Sun,Chang Zeng,Chunfeng Wang,Hao Li,Jianxun Cui

Main category: cs.CV

TL;DR: PAGS提出了一种基于语义优先级的动态3D场景重建框架,显著提升了重建质量和渲染效率。

Details Motivation: 自动驾驶需要高质量的动态3D场景重建,但现有方法在保真度和计算成本之间存在明显权衡,主要问题在于语义无关的均匀资源分配。

Contribution: 1) 语义引导的剪枝与正则化策略,简化非关键场景元素;2) 基于优先级的渲染管线,加速遮挡剔除和最终着色计算。

Method: PAGS通过语义优先级动态分配资源,采用混合重要性度量简化场景,并结合优先级驱动的深度预渲染技术。

Result: 在Waymo和KITTI数据集上,PAGS在安全关键物体上表现优异,训练时间显著减少,渲染速度超过350 FPS。

Insight: 语义优先级是优化动态场景重建的关键,任务驱动的资源分配可以显著提升计算效率。

Abstract: Reconstructing dynamic 3D urban scenes is crucial for autonomous driving, yet current methods face a stark trade-off between fidelity and computational cost. This inefficiency stems from their semantically agnostic design, which allocates resources uniformly, treating static backgrounds and safety-critical objects with equal importance. To address this, we introduce Priority-Adaptive Gaussian Splatting (PAGS), a framework that injects task-aware semantic priorities directly into the 3D reconstruction and rendering pipeline. PAGS introduces two core contributions: (1) Semantically-Guided Pruning and Regularization strategy, which employs a hybrid importance metric to aggressively simplify non-critical scene elements while preserving fine-grained details on objects vital for navigation. (2) Priority-Driven Rendering pipeline, which employs a priority-based depth pre-pass to aggressively cull occluded primitives and accelerate the final shading computations. Extensive experiments on the Waymo and KITTI datasets demonstrate that PAGS achieves exceptional reconstruction quality, particularly on safety-critical objects, while significantly reducing training time and boosting rendering speeds to over 350 FPS.

[64] Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval

Jianfeng Dong,Lei Huang,Daizong Liu,Xianke Chen,Xun Yang,Changting Lin,Xun Wang,Meng Wang

Main category: cs.CV

TL;DR: 論文提出了一種名為DL-DKD++的雙學習框架,結合動態知識蒸餾和軟對齊機制,用於解決部分相關視頻檢索(PRVR)任務。該方法通過從強大預訓練模型提取知識並轉移到輕量級學生模型,取得了優異的表現。

Details Motivation: 傳統視頻檢索假設視頻已剪輯且內容與文本完全相關,但實際視頻通常是未剪輯且包含大量無關內容。因此,論文聚焦於更具挑戰性的部分相關視頻檢索任務。

Contribution: 1. 提出DL-DKD++框架,結合動態知識蒸餾和雙分支學習。
2. 引入軟目標機制,動態調整監督信號以捕捉視頻與查詢的細粒度相關性。

Method: 1. 使用大型教師模型監督輕量級雙分支學生模型。
2. 學生模型包含繼承分支(吸收教師知識)和探索分支(學習任務特定知識)。
3. 動態構建軟目標,替代硬標籤監督。

Result: 在TVR、ActivityNet和Charades-STA數據集上實現了最先進的性能。

Insight: 動態軟目標機制能更好地捕捉視頻與查詢的部分相關性,適用於未剪輯視頻的檢索任務。

Abstract: Almost all previous text-to-video retrieval works ideally assume that videos are pre-trimmed with short durations containing solely text-related content. However, in practice, videos are typically untrimmed in long durations with much more complicated background content. Therefore, in this paper, we focus on the more practical yet challenging task of Partially Relevant Video Retrieval (PRVR), which aims to retrieve partially relevant untrimmed videos with the given query. To tackle this task, we propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model and transfers it to a lightweight, task-specific PRVR network. Specifically, we introduce a Dual Learning framework with Dynamic Knowledge Distillation (DL-DKD++), where a large teacher model provides supervision to a compact dual-branch student network. The student model comprises two branches: an inheritance branch that absorbs transferable knowledge from the teacher, and an exploration branch that learns task-specific information from the PRVR dataset to address domain gaps. To further enhance learning, we incorporate a dynamic soft-target construction mechanism. By replacing rigid hard-target supervision with adaptive soft targets that evolve during training, our method enables the model to better capture the fine-grained, partial relevance between videos and queries. Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets for PRVR. The code is available at https://github.com/HuiGuanLab/DL-DKD.

[65] Vision Language Models Map Logos to Text via Semantic Entanglement in the Visual Projector

Sifan Li,Hongkai Chen,Yujun Cai,Qingwen Ye,Liyang Chen,Junsong Yuan,Yiwei Wang

Main category: cs.CV

TL;DR: 该论文研究了视觉语言模型(VLMs)在处理无文本标志时的幻觉现象,揭示了模型依赖于符号先验而非真实视觉证据的倾向,并提出通过投影器解耦和OCR引导解码来缓解问题。

Details Motivation: 视觉语言模型在多模态推理中取得了显著进展,但仍存在幻觉问题,即模型输出未基于视觉证据。本文专注于标志幻觉这一被忽视的场景,探索模型生成品牌名称或文本内容的倾向,尤其是在标志中不包含可见文字时。

Contribution: 论文的主要贡献包括:1)系统测量了领先VLMs在标志幻觉问题中的表现;2)通过嵌入分析揭示了幻觉与投影器维度的关联;3)提出了针对性的投影器子空间消融方法,显著减少错误。

Method: 研究方法包括:1)使用纯符号、混合和含文本标志的分割集及Hard-60子集评估幻觉现象;2)通过九种结构化扰动测试模型鲁棒性;3)利用LLaVA进行嵌入级分析,识别投影器维度中的幻觉来源。

Result: 研究发现:1)幻觉现象在强扭曲下仍持续出现,尤其在被遮挡时表现最差;2)幻觉与投影器的小部分维度紧密相关,消融这些维度可显著减少错误;3)模型对圆形标志更依赖符号先验而非字形感知。

Insight: 论文的洞见包括:1)VLMs在处理标志时普遍依赖先验而非视觉证据;2)投影器子空间在幻觉中起关键作用;3)投影器解耦和OCR引导解码是提高模型可信度的潜在方向。

Abstract: Vision Language Models (VLMs) have achieved impressive progress in multimodal reasoning; yet, they remain vulnerable to hallucinations, where outputs are not grounded in visual evidence. In this paper, we investigate a previously overlooked setting: logo hallucination, where models generate brand names or textual content despite logos containing no visible words. Using curated splits of pure symbols, hybrids, and text-bearing logos, as well as the challenging Hard-60 subset, we systematically measure hallucination across leading VLMs. We further probe robustness through nine structured perturbations and show that hallucinations persist even under strong distortions, with occlusion exposing the sharpest weaknesses. Embedding-level analysis with open-weight LLaVA demonstrates that hallucination is tied to a small subset of projector dimensions, and targeted ablation substantially reduces errors while preserving OCR accuracy. Together, these findings reveal that VLMs often rely on symbolic priors rather than genuine glyph perception, particularly for iconic circular logos, and that projector subspaces play a decisive role in this failure mode. Our work contributes both a novel diagnostic lens and actionable mitigation insights, highlighting projector disentanglement and OCR-guided decoding as promising directions for building more trustworthy multimodal systems.

[66] CurriFlow: Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion

Jinzhou Lin,Jie Zhou,Wenhao Xu,Rongtao Xu,Changwei Wang,Shunpeng Chen,Kexue Fu,Yihua Shao,Li Guo,Shibiao Xu

Main category: cs.CV

TL;DR: CurriFlow提出了一种结合光流时序对齐与课程学习引导的深度融合方法,用于3D语义场景补全(SSC),通过在语义KITTI基准上取得16.9的mIoU,验证了其有效性。

Details Motivation: 现有SSC方法依赖时序堆叠或深度投影,缺乏显式运动推理,且对遮挡和噪声深度监督处理不佳。CurriFlow旨在解决这些问题,提升相机感知的鲁棒性。

Contribution: 1. 光流引导的多级特征对齐;2. 课程学习驱动的深度融合(从稀疏LiDAR到密集立体深度);3. 利用SAM提供语义先验增强体素级语义学习。

Method: 1. 基于预训练光流的多级特征对齐;2. 课程学习机制逐步切换深度监督源;3. SAM提供类别无关的语义监督。

Result: 在SemanticKITTI上达到16.9 mIoU,表现SOTA。

Insight: 光流对齐与课程学习的结合能显著提升时序一致性和几何鲁棒性,SAM的引入进一步增强了语义学习的泛化能力。

Abstract: Semantic Scene Completion (SSC) aims to infer complete 3D geometry and semantics from monocular images, serving as a crucial capability for camera-based perception in autonomous driving. However, existing SSC methods relying on temporal stacking or depth projection often lack explicit motion reasoning and struggle with occlusions and noisy depth supervision. We propose CurriFlow, a novel semantic occupancy prediction framework that integrates optical flow-based temporal alignment with curriculum-guided depth fusion. CurriFlow employs a multi-level fusion strategy to align segmentation, visual, and depth features across frames using pre-trained optical flow, thereby improving temporal consistency and dynamic object understanding. To enhance geometric robustness, a curriculum learning mechanism progressively transitions from sparse yet accurate LiDAR depth to dense but noisy stereo depth during training, ensuring stable optimization and seamless adaptation to real-world deployment. Furthermore, semantic priors from the Segment Anything Model (SAM) provide category-agnostic supervision, strengthening voxel-level semantic learning and spatial consistency. Experiments on the SemanticKITTI benchmark demonstrate that CurriFlow achieves state-of-the-art performance with a mean IoU of 16.9, validating the effectiveness of our motion-guided and curriculum-aware design for camera-based 3D semantic scene completion.

[67] Deep Attention-guided Adaptive Subsampling

Sharath M Shankaranarayana,Soumava Kumar Roy,Prasad Sudhakar,Chandan Aladahalli

Main category: cs.CV

TL;DR: 论文提出了一种基于注意力引导的可学习子采样框架,能够动态适应输入数据,减少深度神经网络的冗余计算,提升性能和降低复杂度。

Details Motivation: 现有深度神经网络的性能提升通常伴随计算复杂度和成本的增加。许多任务(如3D体积或视频分类)中存在数据冗余,但当前方法多为任务自适应而非输入自适应,限制了实际应用。

Contribution: 提出了一种注意力引导的子采样模块,能够在推理阶段动态适应输入数据,从而实现性能提升和复杂度降低。

Method: 采用可学习子采样框架,克服了子采样操作不可微的问题,并通过注意力机制实现输入自适应。

Result: 在MedMNIST3D和两个超声视频数据集上的实验验证了方法的有效性,尤其是在真实的临床数据上表现优异。

Insight: 动态自适应采样机制能够更高效地处理冗余数据,适用于实际场景,为复杂任务的数据处理提供了新思路。

Abstract: Although deep neural networks have provided impressive gains in performance, these improvements often come at the cost of increased computational complexity and expense. In many cases, such as 3D volume or video classification tasks, not all slices or frames are necessary due to inherent redundancies. To address this issue, we propose a novel learnable subsampling framework that can be integrated into any neural network architecture. Subsampling, being a nondifferentiable operation, poses significant challenges for direct adaptation into deep learning models. While some works, have proposed solutions using the Gumbel-max trick to overcome the problem of non-differentiability, they fall short in a crucial aspect: they are only task-adaptive and not inputadaptive. Once the sampling mechanism is learned, it remains static and does not adjust to different inputs, making it unsuitable for real-world applications. To this end, we propose an attention-guided sampling module that adapts to inputs even during inference. This dynamic adaptation results in performance gains and reduces complexity in deep neural network models. We demonstrate the effectiveness of our method on 3D medical imaging datasets from MedMNIST3D as well as two ultrasound video datasets for classification tasks, one of them being a challenging in-house dataset collected under real-world clinical conditions.

[68] Learning to Recognize Correctly Completed Procedure Steps in Egocentric Assembly Videos through Spatio-Temporal Modeling

Tim J. Schoonbeek,Shao-Hsuan Hung,Dan Lehman,Hans Onvlee,Jacek Kustra,Peter H. N. de With,Fons van der Sommen

Main category: cs.CV

TL;DR: 该论文提出了一种双流框架STORM-PSR,通过结合空间和时序特征,提升了Egocentric装配视频中正确完成步骤的识别能力,尤其在部分遮挡情况下表现优于现有方法。

Details Motivation: 现有方法仅依赖单帧图像中的装配物体状态检测,忽略了时序特征,导致在物体部分遮挡时鲁棒性和准确性受限。

Contribution: 提出了STORM-PSR双流框架,结合空间和时序特征,显著减少了装配步骤识别的延迟,并公开了标注数据和代码。

Method: STORM-PSR包含两个流:装配状态检测流处理无遮挡物体视图;时空流通过空间编码器和基于Transformer的时序编码器捕获时空特征。

Result: 在MECCANO和IndustReal数据集上,平均延迟分别减少了11.2%和26.1%,尤其在遮挡情况下优势明显。

Insight: 结合时空特征可以有效提升部分遮挡场景下的步骤识别性能,弱监督预训练的空间编码器和Transformer时序编码器是关键创新。

Abstract: Procedure step recognition (PSR) aims to identify all correctly completed steps and their sequential order in videos of procedural tasks. The existing state-of-the-art models rely solely on detecting assembly object states in individual video frames. By neglecting temporal features, model robustness and accuracy are limited, especially when objects are partially occluded. To overcome these limitations, we propose Spatio-Temporal Occlusion-Resilient Modeling for Procedure Step Recognition (STORM-PSR), a dual-stream framework for PSR that leverages both spatial and temporal features. The assembly state detection stream operates effectively with unobstructed views of the object, while the spatio-temporal stream captures both spatial and temporal features to recognize step completions even under partial occlusion. This stream includes a spatial encoder, pre-trained using a novel weakly supervised approach to capture meaningful spatial representations, and a transformer-based temporal encoder that learns how these spatial features relate over time. STORM-PSR is evaluated on the MECCANO and IndustReal datasets, reducing the average delay between actual and predicted assembly step completions by 11.2% and 26.1%, respectively, compared to prior methods. We demonstrate that this reduction in delay is driven by the spatio-temporal stream, which does not rely on unobstructed views of the object to infer completed steps. The code for STORM-PSR, along with the newly annotated MECCANO labels, is made publicly available at https://timschoonbeek.github.io/stormpsr .

[69] Scene Coordinate Reconstruction Priors

Wenjing Bian,Axel Barroso-Laguna,Tommaso Cavallari,Victor Adrian Prisacariu,Eric Brachmann

Main category: cs.CV

TL;DR: 论文提出了一种概率重解释方法,通过在训练场景坐标回归(SCR)模型时注入高层次重建先验,解决了多视图约束不足导致的退化问题。这些先验包括深度值分布的简单先验和基于扩散模型的3D点云配置学习先验。实验表明,该方法在室内数据集上提升了场景表示质量、相机姿态估计及下游任务性能。

Details Motivation: 场景坐标回归(SCR)模型在单一场景训练时,若多视图约束不足会导致退化。为提高其鲁棒性和准确性,需要引入高层次的重建先验知识。

Contribution: 1. 提出了SCR模型的概率重解释方法;2. 设计了多种重建先验,包括深度分布和扩散模型学习的点云配置;3. 在室内数据集上验证了先验对场景表示和下游任务的提升效果。

Method: 1. 概率化SCR训练;2. 引入简单深度先验和基于3D点云扩散模型的先验;3. 在训练过程中动态调整3D场景点以符合几何合理性。

Result: 在三个室内数据集上,该方法显著提升了场景点云的连贯性、相机姿态估计的准确性,并改善了新视图合成和相机重定位等任务的表现。

Insight: 通过注入高层次重建先验,SCR模型在多视图约束不足的场景下仍能生成合理的几何结构,提升了泛化能力和下游任务性能。

Abstract: Scene coordinate regression (SCR) models have proven to be powerful implicit scene representations for 3D vision, enabling visual relocalization and structure-from-motion. SCR models are trained specifically for one scene. If training images imply insufficient multi-view constraints SCR models degenerate. We present a probabilistic reinterpretation of training SCR models, which allows us to infuse high-level reconstruction priors. We investigate multiple such priors, ranging from simple priors over the distribution of reconstructed depth values to learned priors over plausible scene coordinate configurations. For the latter, we train a 3D point cloud diffusion model on a large corpus of indoor scans. Our priors push predicted 3D scene points towards plausible geometry at each training step to increase their likelihood. On three indoor datasets our priors help learning better scene representations, resulting in more coherent scene point clouds, higher registration rates and better camera poses, with a positive effect on down-stream tasks such as novel view synthesis and camera relocalization.

[70] Towards General Urban Monitoring with Vision-Language Models: A Review, Evaluation, and a Research Agenda

André Torneiro,Diogo Monteiro,Paulo Novais,Pedro Rangel Henriques,Nuno F. Rodrigues

Main category: cs.CV

TL;DR: 本文系统地回顾了视觉-语言模型(VLM)在城市监测中的应用,重点关注零样本学习能力,总结了2021-2025年间32项研究,探讨了VLM的任务、架构、数据集及评估方法。

Details Motivation: 城市公共基础设施的多样性及复杂环境使得传统监测方法(如物联网传感器和人工检查)成本高且难以扩展。VLMs结合视觉与自然语言理解,为解决这一问题提供了潜力。

Contribution: 1. 系统地总结了VLM在城市监测中的应用场景、性能表现及相关数据集;2. 提供了研究议程,推动未来VLM在城市监测中的发展;3. 强调了VLM零样本学习的潜力。

Method: 采用PRISMA系统综述方法,分析了32篇研究论文,围绕四大研究问题展开:VLM的任务适用性、常用架构、支持数据集及评估方法。

Result: 研究表明,VLMs在零样本学习中表现优异,能够处理多样化的城市监测任务,但仍需优化架构和数据集以进一步提升性能。

Insight: 1. VLMs在多任务城市监测中具有潜力;2. 零样本学习减少了数据标注需求;3. 现有数据集仍需扩充多样性以支持更广泛应用。

Abstract: Urban monitoring of public infrastructure (such as waste bins, road signs, vegetation, sidewalks, and construction sites) poses significant challenges due to the diversity of objects, environments, and contextual conditions involved. Current state-of-the-art approaches typically rely on a combination of IoT sensors and manual inspections, which are costly, difficult to scale, and often misaligned with citizens’ perception formed through direct visual observation. This raises a critical question: Can machines now “see” like citizens and infer informed opinions about the condition of urban infrastructure? Vision-Language Models (VLMs), which integrate visual understanding with natural language reasoning, have recently demonstrated impressive capabilities in processing complex visual information, turning them into a promising technology to address this challenge. This systematic review investigates the role of VLMs in urban monitoring, with particular emphasis on zero-shot applications. Following the PRISMA methodology, we analyzed 32 peer-reviewed studies published between 2021 and 2025 to address four core research questions: (1) What urban monitoring tasks have been effectively addressed using VLMs? (2) Which VLM architectures and frameworks are most commonly used and demonstrate superior performance? (3) What datasets and resources support this emerging field? (4) How are VLM-based applications evaluated, and what performance levels have been reported?

[71] VideoLucy: Deep Memory Backtracking for Long Video Understanding

Jialong Zuo,Yongtai Deng,Lingdong Kong,Jingkang Yang,Rui Jin,Yiwei Zhang,Nong Sang,Liang Pan,Ziwei Liu,Changxin Gao

Main category: cs.CV

TL;DR: VideoLucy 是一个用于长视频理解的深度记忆回溯框架,通过分层记忆结构和基于代理的迭代回溯机制,有效解决现有方法在时间上下文和细节丢失上的局限性,并在新基准 EgoMem 上表现优异。

Details Motivation: 现有基于代理和大语言模型的长视频理解系统在时间上下文捕捉和细节保留上存在不足,需要一种更高效的方法来克服这些挑战。

Contribution: 1. 提出 VideoLucy 框架,采用分层记忆结构和迭代回溯机制;2. 引入新基准 EgoMem,全面评估长时间视频理解能力;3. 在多个基准上显著优于现有方法。

Method: VideoLucy 通过分层记忆结构(粗到细)定义记忆的细节水平和时间范围,并利用代理迭代回溯机制挖掘视频范围内的深度记忆。

Result: VideoLucy 在多个长视频理解基准上表现卓越,性能超过 GPT-4o 等最新专有模型。

Insight: 分层记忆结构和迭代回溯能有效保留时间上下文和细节,为长视频理解提供了新思路。

Abstract: Recent studies have shown that agent-based systems leveraging large language models (LLMs) for key information retrieval and integration have emerged as a promising approach for long video understanding. However, these systems face two major challenges. First, they typically perform modeling and reasoning on individual frames, struggling to capture the temporal context of consecutive frames. Second, to reduce the cost of dense frame-level captioning, they adopt sparse frame sampling, which risks discarding crucial information. To overcome these limitations, we propose VideoLucy, a deep memory backtracking framework for long video understanding. Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity. This structure explicitly defines the detail level and temporal scope of memory at different hierarchical depths. Through an agent-based iterative backtracking mechanism, VideoLucy systematically mines video-wide, question-relevant deep memories until sufficient information is gathered to provide a confident answer. This design enables effective temporal understanding of consecutive frames while preserving critical details. In addition, we introduce EgoMem, a new benchmark for long video understanding. EgoMem is designed to comprehensively evaluate a model’s ability to understand complex events that unfold over time and capture fine-grained details in extremely long videos. Extensive experiments demonstrate the superiority of VideoLucy. Built on open-source models, VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o. Our code and dataset will be made publicly at https://videolucy.github.io

[72] A Review of Longitudinal Radiology Report Generation: Dataset Composition, Methods, and Performance Evaluation

Shaoyang Zhou,Yingshu Li,Yunyi Liu,Lingqiao Liu,Lei Wang,Luping Zhou

Main category: cs.CV

TL;DR: 本文是对纵向放射学报告生成(LRRG)的首次全面综述,重点关注胸部X射线图像的纵向数据分析,填补了传统单图像方法的不足,并为未来研究提供了系统性框架。

Details Motivation: 胸部X射线成像在现代医学中应用广泛,但高负荷工作增加了放射科医生的负担。现有方法多依赖单图像,缺乏对纵向上下文的理解,无法生成准确的临床对比描述。因此,亟需一种系统性研究纵向放射学报告生成的框架。

Contribution: 本文的主要贡献在于:(1)首次系统性综述了LRRG领域;(2)详细分析了数据集构建、模型架构设计和评估方法;(3)总结了现有方法的性能及相关消融实验;(4)指出了当前研究的五大局限并提出了未来发展方向。

Method: 文章通过综述方式分析了LRRG的关键技术:数据集构造策略(如多时间点数据的标注)、模型架构设计(如纵向信息的整合方法)、评估协议(包括纵向特定指标和通用基准)。

Result: 综述结果表明,纵向信息的引入显著提升了模型性能,尤其是在临床对比描述的生成上。同时,揭示了不同架构设计对结果的直接影响。

Insight: 核心发现包括:(1)纵向数据对提升报告质量至关重要;(2)当前模型仍需改进对时间动态的建模;(3)未来需更注重临床适用性和多模态数据的融合。

Abstract: Chest Xray imaging is a widely used diagnostic tool in modern medicine, and its high utilization creates substantial workloads for radiologists. To alleviate this burden, vision language models are increasingly applied to automate Chest Xray radiology report generation (CXRRRG), aiming for clinically accurate descriptions while reducing manual effort. Conventional approaches, however, typically rely on single images, failing to capture the longitudinal context necessary for producing clinically faithful comparison statements. Recently, growing attention has been directed toward incorporating longitudinal data into CXR RRG, enabling models to leverage historical studies in ways that mirror radiologists diagnostic workflows. Nevertheless, existing surveys primarily address single image CXRRRG and offer limited guidance for longitudinal settings, leaving researchers without a systematic framework for model design. To address this gap, this survey provides the first comprehensive review of longitudinal radiology report generation (LRRG). Specifically, we examine dataset construction strategies, report generation architectures alongside longitudinally tailored designs, and evaluation protocols encompassing both longitudinal specific measures and widely used benchmarks. We further summarize LRRG methods performance, alongside analyses of different ablation studies, which collectively highlight the critical role of longitudinal information and architectural design choices in improving model performance. Finally, we summarize five major limitations of current research and outline promising directions for future development, aiming to lay a foundation for advancing this emerging field.

[73] MS-GAGA: Metric-Selective Guided Adversarial Generation Attack

Dion J. X. Ho,Gabriel Lee Jun Rong,Niharika Shrivastava,Harshavardhan Abichandani,Pai Chet Ng,Xiaoxiao Miao

Main category: cs.CV

TL;DR: MS-GAGA是一个两阶段框架,通过双流攻击模块和度量感知选择模块,生成针对黑盒深度伪造检测器的对抗样本,具有高转移性和视觉不可感知性。

Details Motivation: 当前对抗攻击方法在生成高质量对抗样本时难以兼顾转移性和视觉不可感知性,MS-GAGA旨在解决这一问题。

Contribution: 提出了一个两阶段框架MS-GAGA,结合双流攻击模块和度量感知选择模块,显著提升对抗样本在黑盒检测器上的成功率和视觉质量。

Method: Stage 1使用MNTD-PGD和SG-PGD双流攻击模块生成对抗候选;Stage 2通过SSIM和黑盒模型成功率评估候选样本。

Result: MS-GAGA在未见检测器上的误分类率比现有攻击方法高27%。

Insight: 结合互补的攻击策略和度量感知选择可以有效提升对抗样本的鲁棒性和不可感知性。

Abstract: We present MS-GAGA (Metric-Selective Guided Adversarial Generation Attack), a two-stage framework for crafting transferable and visually imperceptible adversarial examples against deepfake detectors in black-box settings. In Stage 1, a dual-stream attack module generates adversarial candidates: MNTD-PGD applies enhanced gradient calculations optimized for small perturbation budgets, while SG-PGD focuses perturbations on visually salient regions. This complementary design expands the adversarial search space and improves transferability across unseen models. In Stage 2, a metric-aware selection module evaluates candidates based on both their success against black-box models and their structural similarity (SSIM) to the original image. By jointly optimizing transferability and imperceptibility, MS-GAGA achieves up to 27% higher misclassification rates on unseen detectors compared to state-of-the-art attacks.

[74] A Text-Image Fusion Method with Data Augmentation Capabilities for Referring Medical Image Segmentation

Shurong Chai,Rahul Kumar JAIN,Rui Xu,Shaocong Mo,Ruibo Hou,Shiyu Teng,Jiaqing Liu,Lanfen Lin,Yen-Wei Chen

Main category: cs.CV

TL;DR: 论文提出了一种早期融合框架,结合文本和视觉特征以保留空间一致性,并设计了一个轻量级生成器,将文本嵌入投影到视觉空间。该方法在医学图像分割任务中表现出色。

Details Motivation: 在多模态学习中,传统的图像增强方法会破坏文本与图像的空间对齐,导致性能下降。针对这一问题,论文提出了一种解决方案。

Contribution: 1. 提出了一个早期融合框架,保留了文本与图像的空间一致性;2. 设计了一个轻量级生成器,用于将文本嵌入映射到视觉空间;3. 在多个医学图像分割任务中验证了方法的有效性。

Method: 1. 在增强前融合文本和视觉特征;2. 使用轻量级生成器生成伪图像以填补语义鸿沟;3. 在三种医学图像任务和四种分割框架中进行评估。

Result: 该方法在多项任务中取得了最先进的性能,并通过可视化展示了准确的区域定位能力。

Insight: 1. 早期融合避免了增强对空间对齐的破坏;2. 轻量级生成器有效连接了文本与视觉特征;3. 研究为医学图像分割中的多模态学习提供了新思路。

Abstract: Deep learning relies heavily on data augmentation to mitigate limited data, especially in medical imaging. Recent multimodal learning integrates text and images for segmentation, known as referring or text-guided image segmentation. However, common augmentations like rotation and flipping disrupt spatial alignment between image and text, weakening performance. To address this, we propose an early fusion framework that combines text and visual features before augmentation, preserving spatial consistency. We also design a lightweight generator that projects text embeddings into visual space, bridging semantic gaps. Visualization of generated pseudo-images shows accurate region localization. Our method is evaluated on three medical imaging tasks and four segmentation frameworks, achieving state-of-the-art results. Code is publicly available on GitHub: https://github.com/11yxk/MedSeg_EarlyFusion.

[75] CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving

Xiaoji Zheng,Ziyuan Yang,Yanhao Chen,Yuhang Peng,Yuanrong Tang,Gengyuan Liu,Bokui Chen,Jiangtao Gong

Main category: cs.CV

TL;DR: 论文提出CoIRL-AD框架,通过协同竞争机制结合模仿学习(IL)和强化学习(RL),解决了自动驾驶中传统方法的泛化问题和训练不稳定问题。

Details Motivation: 传统端到端自动驾驶模型中,单纯模仿学习泛化能力差,而强化学习样本效率低且不稳定。结合两者可以互补优势。

Contribution: 提出CoIRL-AD框架,创新性地通过竞争机制促进IL和RL间的知识交换,避免梯度冲突。

Method: 采用双策略框架,IL和RL策略在训练中交互竞争,同时引入潜在世界模型优化学习过程。

Result: 在nuScenes数据集上,碰撞率降低18%,泛化能力提升,长尾场景表现更优。

Insight: 协同竞争机制能为多策略学习提供新思路,通过对抗平衡优化性能和稳定性。

Abstract: End-to-end autonomous driving models trained solely with imitation learning (IL) often suffer from poor generalization. In contrast, reinforcement learning (RL) promotes exploration through reward maximization but faces challenges such as sample inefficiency and unstable convergence. A natural solution is to combine IL and RL. Moving beyond the conventional two-stage paradigm (IL pretraining followed by RL fine-tuning), we propose CoIRL-AD, a competitive dual-policy framework that enables IL and RL agents to interact during training. CoIRL-AD introduces a competition-based mechanism that facilitates knowledge exchange while preventing gradient conflicts. Experiments on the nuScenes dataset show an 18% reduction in collision rate compared to baselines, along with stronger generalization and improved performance on long-tail scenarios. Code is available at: https://github.com/SEU-zxj/CoIRL-AD.

[76] MMOT: The First Challenging Benchmark for Drone-based Multispectral Multi-Object Tracking

Tianhao Li,Tingfa Xu,Ying Wang,Haolin Qin,Xu Lin,Jianan Li

Main category: cs.CV

TL;DR: MMOT是首个专注于无人机多光谱多目标追踪的基准数据集,提供大规模、多样化的挑战场景,并提出一种多光谱和方向感知的追踪方法,显著提升了性能。

Details Motivation: 现有的RGB追踪方法在无人机视角下因目标小、遮挡多和背景复杂而受限,多光谱图像能提供更丰富的特征,但缺乏专门的基准数据集。

Contribution: 提出了MMOT基准数据集(125个视频序列、488.8K标注),并设计了一种结合多光谱特征和方向感知的追踪方法。

Method: 使用轻量级Spectral 3D-Stem整合光谱特征,引入方向感知卡尔曼滤波器和端到端方向自适应Transformer。

Result: 实验表明多光谱输入显著提升了追踪性能,尤其是对小目标和密集场景。

Insight: 多光谱特征在无人机追踪中具有巨大潜力,方向感知设计能有效应对复杂视角挑战。

Abstract: Drone-based multi-object tracking is essential yet highly challenging due to small targets, severe occlusions, and cluttered backgrounds. Existing RGB-based tracking algorithms heavily depend on spatial appearance cues such as color and texture, which often degrade in aerial views, compromising reliability. Multispectral imagery, capturing pixel-level spectral reflectance, provides crucial cues that enhance object discriminability under degraded spatial conditions. However, the lack of dedicated multispectral UAV datasets has hindered progress in this domain. To bridge this gap, we introduce MMOT, the first challenging benchmark for drone-based multispectral multi-object tracking. It features three key characteristics: (i) Large Scale - 125 video sequences with over 488.8K annotations across eight categories; (ii) Comprehensive Challenges - covering diverse conditions such as extreme small targets, high-density scenarios, severe occlusions, and complex motion; and (iii) Precise Oriented Annotations - enabling accurate localization and reduced ambiguity under aerial perspectives. To better extract spectral features and leverage oriented annotations, we further present a multispectral and orientation-aware MOT scheme adapting existing methods, featuring: (i) a lightweight Spectral 3D-Stem integrating spectral features while preserving compatibility with RGB pretraining; (ii) an orientation-aware Kalman filter for precise state estimation; and (iii) an end-to-end orientation-adaptive transformer. Extensive experiments across representative trackers consistently show that multispectral input markedly improves tracking performance over RGB baselines, particularly for small and densely packed objects. We believe our work will advance drone-based multispectral multi-object tracking research. Our MMOT, code, and benchmarks are publicly available at https://github.com/Annzstbl/MMOT.

[77] Learning Human Motion with Temporally Conditional Mamba

Quang Nguyen,Tri Le,Baoru Huang,Minh Nhat Vu,Ngan Le,Thieu Vo,Anh Nguyen

Main category: cs.CV

TL;DR: 该论文提出了一种基于Mamba的时间条件模型(Temporally Conditional Mamba),用于生成与时间条件输入信号对齐的人体运动,显著提升了时间对齐性和运动真实性。

Details Motivation: 现有方法主要依赖交叉注意力机制融合条件与运动,但难以保持逐步的时间对齐性,限制了生成运动的准确性和一致性。

Contribution: 引入Temporally Conditional Mamba模型,通过将条件信息集成到Mamba块的循环动力学中,实现了更好的时间对齐运动生成。

Method: 基于Mamba块,设计了一种时间条件化的循环动态模型,能够逐步对齐条件输入与生成运动。

Result: 实验表明,模型在时间对齐性、运动真实性和条件一致性上显著优于现有方法。

Insight: 通过循环动态集成条件信息,可以有效解决时间对齐问题,为人体运动生成提供了新思路。

Abstract: Learning human motion based on a time-dependent input signal presents a challenging yet impactful task with various applications. The goal of this task is to generate or estimate human movement that consistently reflects the temporal patterns of conditioning inputs. Existing methods typically rely on cross-attention mechanisms to fuse the condition with motion. However, this approach primarily captures global interactions and struggles to maintain step-by-step temporal alignment. To address this limitation, we introduce Temporally Conditional Mamba, a new mamba-based model for human motion generation. Our approach integrates conditional information into the recurrent dynamics of the Mamba block, enabling better temporally aligned motion. To validate the effectiveness of our method, we evaluate it on a variety of human motion tasks. Extensive experiments demonstrate that our model significantly improves temporal alignment, motion realism, and condition consistency over state-of-the-art approaches. Our project page is available at https://zquang2202.github.io/TCM.

[78] LayerSync: Self-aligning Intermediate Layers

Yasaman Haghighi,Bastien van Delft,Mariam Hassan,Alexandre Alahi

Main category: cs.CV

TL;DR: LayerSync是一种领域无关的方法,通过利用扩散模型中间层的自对齐表示来提高生成质量和训练效率,无需外部监督或额外数据。

Details Motivation: 现有研究表明扩散模型的中间表示对生成质量和训练效率至关重要,但通常需要外部指导。LayerSync的动机是利用模型自身的中间表示作为内部指导,减少对外部监督的依赖。

Contribution: LayerSync的主要贡献是提出一种自对齐的中间层正则化方法,通过利用扩散模型中语义丰富的表示指导较弱的表示,从而提高生成质量和训练效率。该方法无需预训练模型或额外数据。

Method: LayerSync的核心方法是利用扩散模型中不同层的表示质量差异,通过正则化使语义丰富的表示指导其他表示,减少对外部监督的需求。该方法是一个即插即用的正则化项,对训练无额外开销。

Result: 实验表明,LayerSync显著提升了生成质量和训练效率。例如,在ImageNet数据集上,基于transformer的流模型训练速度提升了8.75倍,生成质量提高了23.6%。该方法还适用于音频、视频和运动生成等其他领域。

Insight: LayerSync的深入洞见在于,扩散模型的中间层表示本身可以作为其训练的固有指导,无需依赖外部资源,从而实现了高效且高质量的生成。

Abstract: We propose LayerSync, a domain-agnostic approach for improving the generation quality and the training efficiency of diffusion models. Prior studies have highlighted the connection between the quality of generation and the representations learned by diffusion models, showing that external guidance on model intermediate representations accelerates training. We reconceptualize this paradigm by regularizing diffusion models with their own intermediate representations. Building on the observation that representation quality varies across diffusion model layers, we show that the most semantically rich representations can act as an intrinsic guidance for weaker ones, reducing the need for external supervision. Our approach, LayerSync, is a self-sufficient, plug-and-play regularizer term with no overhead on diffusion model training and generalizes beyond the visual domain to other modalities. LayerSync requires no pretrained models nor additional data. We extensively evaluate the method on image generation and demonstrate its applicability to other domains such as audio, video, and motion generation. We show that it consistently improves the generation quality and the training efficiency. For example, we speed up the training of flow-based transformer by over 8.75x on ImageNet dataset and improved the generation quality by 23.6%. The code is available at https://github.com/vita-epfl/LayerSync.

[79] Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training

Jiachen Lei,Keli Liu,Julius Berner,Haiming Yu,Hongkai Zheng,Jiahong Wu,Xiangxiang Chu

Main category: cs.CV

TL;DR: 本文提出了一种两阶段训练框架,通过自监督预训练提升像素空间生成模型的性能,显著缩小了与潜在空间模型的差距。

Details Motivation: 像素空间生成模型在训练难度和性能上通常不如潜在空间模型,本文旨在通过自监督预训练解决这一问题。

Contribution: 提出了一个两阶段训练框架,结合了自监督预训练和端到端微调,显著提升了像素空间扩散模型和一致性模型的性能。

Method: 第一阶段预训练编码器以捕捉图像语义并对齐采样轨迹;第二阶段将编码器与解码器结合进行端到端微调。

Result: 在ImageNet-256和ImageNet-512上取得了优异的FID分数,超越了之前的像素空间方法,甚至媲美基于VAE的模型。

Insight: 自监督预训练可以有效提升像素空间生成模型的性能,首次成功实现了高分辨率图像上的一致性模型训练。

Abstract: Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our training framework demonstrates strong empirical performance on ImageNet dataset. Specifically, our diffusion model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE), surpassing prior pixel-space methods by a large margin in both generation quality and efficiency while rivaling leading VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our consistency model achieves an impressive FID of 8.82 in a single sampling step, significantly surpassing its latent-space counterpart. To the best of our knowledge, this marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models.

[80] Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space

Chao Chen,Zhixin Ma,Yongqi Li,Yupeng Hu,Yinwei Wei,Wenjie Li,Liqiang Nie

Main category: cs.CV

TL;DR: 该论文提出了一种新的多模态推理方法IVT-LR,通过在潜在空间中进行视觉-文本交织推理,减少了标注需求和推理延迟,同时提升了性能。

Details Motivation: 当前多模态推理方法依赖显式推理步骤,需要大量人工标注且推理延迟高。为解决这些问题,论文提出了在潜在空间中进行推理的方法。

Contribution: 提出了IVT-LR方法,实现了视觉和文本信息在潜在空间的结合;设计了渐进式多阶段训练策略;显著提升了推理效率和准确性。

Method: IVT-LR将每个推理步骤表示为潜在文本(上一步的隐藏状态)和潜在视觉(一组选择的图像嵌入)的结合,并通过多阶段训练优化模型。

Result: 在M3CoT和ScienceQA数据集上,IVT-LR的准确率平均提升5.45%,推理速度提高5倍以上。

Insight: 潜在空间推理可以有效减少显式标注需求,同时提升多模态模型的效率和性能。

Abstract: Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the thought process to be conveyed through both images and text. Despite its effectiveness, current multimodal reasoning methods depend on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant inference latency. To address these issues, we introduce multimodal latent reasoning with the advantages of multimodal representation, reduced annotation, and inference efficiency. To facilicate it, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text (the hidden states from the previous step) and latent vision (a set of selected image embeddings). We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. Experiments on M3CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches. Code available at https://github.com/FYYDCC/IVT-LR.

[81] WaterFlow: Explicit Physics-Prior Rectified Flow for Underwater Saliency Mask Generation

Runting Li,Shijie Lian,Hua Li,Yutong Li,Wenhui Wu,Sam Kwong

Main category: cs.CV

TL;DR: WaterFlow是一个基于校正流的框架,通过显式物理先验和时间维度建模,显著提升了水下显著目标检测的性能。

Details Motivation: 水下显著目标检测存在图像质量退化与领域差距问题,现有方法常忽视水下成像的物理原理或简单将其视为干扰因素。

Contribution: 提出WaterFlow框架,首次将水下物理成像信息作为显式先验融入网络训练,并引入时间维度建模。

Method: 采用校正流方法,结合显式物理先验和时间维度建模,优化显著目标检测。

Result: 在USOD10K数据集上,S_m指标提升0.072,验证了方法的有效性。

Insight: 水下成像的物理信息可作为有用先验,而非干扰因素,提升检测性能。

Abstract: Underwater Salient Object Detection (USOD) faces significant challenges, including underwater image quality degradation and domain gaps. Existing methods tend to ignore the physical principles of underwater imaging or simply treat degradation phenomena in underwater images as interference factors that must be eliminated, failing to fully exploit the valuable information they contain. We propose WaterFlow, a rectified flow-based framework for underwater salient object detection that innovatively incorporates underwater physical imaging information as explicit priors directly into the network training process and introduces temporal dimension modeling, significantly enhancing the model’s capability for salient object identification. On the USOD10K dataset, WaterFlow achieves a 0.072 gain in S_m, demonstrating the effectiveness and superiority of our method. The code will be published after the acceptance.

[82] On the Use of Hierarchical Vision Foundation Models for Low-Cost Human Mesh Recovery and Pose Estimation

Shuhei Tarashima,Yushan Wang,Norio Tagawa

Main category: cs.CV

TL;DR: 该论文提出了一种利用分层视觉基础模型(VFMs)的早期阶段作为编码器的方法,用于低成本和高效的人体网格恢复(HMR)和姿态估计(HPE)。研究结果表明,仅使用前两到三个阶段的分层VFM模型可以达到与全阶段模型相当的性能,同时在计算效率和准确率之间取得了更好的平衡。

Details Motivation: 现有的HMR和HPE方法通常依赖于大型非分层视觉Transformer作为编码器,计算成本较高。为了开发简单高效的模型,论文希望通过利用分层VFMs的中间阶段特征,实现低成本的解决方案。

Contribution: 论文的主要贡献包括:(1)构建了三个轻量级HMR2.0变体作为基线;(2)提出将分层VFMs(如Swin Transformer、GroupMixFormer和VMamba)的早期阶段作为编码器;(3)通过实验表明,使用前两到三阶段的分层VFM模型可以实现高性能和高效计算的平衡。

Method: 方法分为两部分:(1)构建HMR2.0的轻量级变体;(2)利用分层VFMs的早期阶段作为编码器,避免了使用完整的复杂模型,从而降低计算成本。实验评估了27种基于分层VFMs的HMR和HPE模型。

Result: 研究结果显示,仅使用分层VFMs的前两到三个阶段,模型的性能与完整阶段模型相当。此外,这些截断模型在计算效率和准确率之间表现出更好的权衡,优于现有轻量级替代方案。

Insight: 论文的洞见在于发现分层VFMs的中间阶段特征分辨率通常与传统非分层模型的输出相当或更高,因此可以早期截断而不损失性能,从而显著降低计算成本。

Abstract: In this work, we aim to develop simple and efficient models for human mesh recovery (HMR) and its predecessor task, human pose estimation (HPE). State-of-the-art HMR methods, such as HMR2.0 and its successors, rely on large, non-hierarchical vision transformers as encoders, which are inherited from the corresponding HPE models like ViTPose. To establish baselines across varying computational budgets, we first construct three lightweight HMR2.0 variants by adapting the corresponding ViTPose models. In addition, we propose leveraging the early stages of hierarchical vision foundation models (VFMs), including Swin Transformer, GroupMixFormer, and VMamba, as encoders. This design is motivated by the observation that intermediate stages of hierarchical VFMs produce feature maps with resolutions comparable to or higher than those of non-hierarchical counterparts. We conduct a comprehensive evaluation of 27 hierarchical-VFM-based HMR and HPE models, demonstrating that using only the first two or three stages achieves performance on par with full-stage models. Moreover, we show that the resulting truncated models exhibit better trade-offs between accuracy and computational efficiency compared to existing lightweight alternatives.

[83] TerraCodec: Compressing Earth Observations

Julen Costa-Watanabe,Isabelle Wittmann,Benedikt Blumenstiel,Konrad Schindler

Main category: cs.CV

TL;DR: TerraCodec (TEC) 是一系列专为地球观测数据设计的神经编解码器,通过图像和多光谱输入的变体以及时间Transformer模型(TEC-TT)实现高效压缩。提出Latent Repacking方法支持可变码率训练,显著超越传统编解码器,并在零样本云去除任务中表现优异。

Details Motivation: 地球观测卫星生成的多光谱图像时间序列数据量大,传统的图像或视频编解码器未能充分利用时间冗余或场景辐射变化特性,亟需针对性的压缩算法。

Contribution: 1. 提出TerraCodec家族,包括多光谱图像编解码器和时间Transformer模型;2. 提出Latent Repacking方法支持可变码率训练;3. 在Sentinel-2数据上实现3-10倍的压缩提升,并在云修复任务中超越现有方法。

Method: 1. 设计多光谱图像编解码器;2. 引入TEC-TT模型,利用时间依赖关系;3. 提出Latent Repacking方法,训练可变码率Transformer模型。

Result: 在Sentinel-2数据上,TEC比传统编解码器压缩能力提升3-10倍;TEC-TT在AllClear云修复任务中表现最佳。

Insight: 针对特定领域的神经编解码器设计(如EO数据)能显著提升性能,并支持下游任务如云修复。

Abstract: Earth observation (EO) satellites produce massive streams of multispectral image time series, posing pressing challenges for storage and transmission. Yet, learned EO compression remains fragmented, lacking publicly available pretrained models and misaligned with advances in compression for natural imagery. Image codecs overlook temporal redundancy, while video codecs rely on motion priors that fail to capture the radiometric evolution of largely static scenes. We introduce TerraCodec (TEC), a family of learned codecs tailored to EO. TEC includes efficient image-based variants adapted to multispectral inputs, as well as a Temporal Transformer model (TEC-TT) that leverages dependencies across time. To overcome the fixed-rate setting of today’s neural codecs, we present Latent Repacking, a novel method for training flexible-rate transformer models that operate on varying rate-distortion settings. Trained on Sentinel-2 data, TerraCodec outperforms classical codecs, achieving 3-10x stronger compression at equivalent image quality. Beyond compression, TEC-TT enables zero-shot cloud inpainting, surpassing state-of-the-art methods on the AllClear benchmark. Our results establish bespoke, learned compression algorithms as a promising direction for Earth observation. Code and model weights will be released under a permissive license.

[84] EReLiFM: Evidential Reliability-Aware Residual Flow Meta-Learning for Open-Set Domain Generalization under Noisy Labels

Kunyu Peng,Di Wen,Kailun Yang,Jia Fu,Yufan Chen,Ruiping Liu,Jiamin Wu,Junwei Zheng,M. Saquib Sarfraz,Luc Van Gool,Danda Pani Paudel,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: 论文提出了一种名为EReLiFM的新方法,用于解决带噪声标签的开集域泛化问题。通过无监督的两阶段证据损失聚类和残差流匹配机制,该方法在噪声环境下提升了模型的性能。

Details Motivation: 开集域泛化(OSDG)在真实场景中非常重要,但标签噪声会破坏源域知识,使得模型难以识别已知类别和拒绝未知类别。现有方法在有限干净标签数据下难以弥合域间差异,需要更有效的解决方案。

Contribution: 提出了EReLiFM方法,包含无监督的两阶段证据损失聚类和残差流匹配机制。通过这些创新,模型能够在噪声标签下更好地泛化到新域和未见类别。

Method: 1. 无监督的两阶段证据损失聚类以提升标签可靠性;2. 残差流匹配机制,建模结构化的域和类别条件残差;3. 元学习过程中利用伪标签优化模型。

Result: 实验结果表明,EReLiFM在OSDG-NL任务上优于现有方法,达到了最先进的性能。

Insight: 证据可靠性感知和不确定性感知的转移路径设计是提升模型在噪声环境下泛化能力的关键。

Abstract: Open-Set Domain Generalization (OSDG) aims to enable deep learning models to recognize unseen categories in new domains, which is crucial for real-world applications. Label noise hinders open-set domain generalization by corrupting source-domain knowledge, making it harder to recognize known classes and reject unseen ones. While existing methods address OSDG under Noisy Labels (OSDG-NL) using hyperbolic prototype-guided meta-learning, they struggle to bridge domain gaps, especially with limited clean labeled data. In this paper, we propose Evidential Reliability-Aware Residual Flow Meta-Learning (EReLiFM). We first introduce an unsupervised two-stage evidential loss clustering method to promote label reliability awareness. Then, we propose a residual flow matching mechanism that models structured domain- and category-conditioned residuals, enabling diverse and uncertainty-aware transfer paths beyond interpolation-based augmentation. During this meta-learning process, the model is optimized such that the update direction on the clean set maximizes the loss decrease on the noisy set, using pseudo labels derived from the most confident predicted class for supervision. Experimental results show that EReLiFM outperforms existing methods on OSDG-NL, achieving state-of-the-art performance. The source code is available at https://github.com/KPeng9510/ERELIFM.

[85] Hybrid Explanation-Guided Learning for Transformer-Based Chest X-Ray Diagnosis

Shelley Zixin Shu,Haozhe Luo,Alexander Poellinger,Mauricio Reyes

Main category: cs.CV

TL;DR: 该论文提出了一种混合解释引导学习框架(H-EGL),通过结合自监督和人类引导的约束,提升基于Transformer的胸部X光诊断模型的注意对齐和泛化能力。

Details Motivation: 当前基于Transformer的深度学习模型在医学影像中表现出色,但容易学习虚假相关性,导致偏见和泛化能力受限。现有的人类-AI注意对齐方法依赖高成本的人工监督,因此需要更高效的解决方案。

Contribution: 提出H-EGL框架,结合自监督和人类引导的约束,无需严格先验即可提升注意对齐和模型性能。在胸部X光分类任务中,H-EGL在分类准确性和泛化能力上均超越现有方法,且生成更符合人类专家解释的注意图。

Method: 框架包含自监督和人类引导两部分。自监督组件利用类区分性注意,避免依赖人工标注;人类引导部分则通过专家知识优化模型。该方法基于Vision Transformer(ViT)实现。

Result: 在胸部X光分类任务中,H-EGL优于两种先进的解释引导学习方法,展现出更高的分类准确率和泛化能力,同时生成的注意图与人类专业知识更一致。

Insight: 结合自监督与人类引导的学习方法不仅能减少对人工监督的依赖,还能提升模型的注意对齐和泛化能力,为医学影像分析提供了更高效的解决方案。

Abstract: Transformer-based deep learning models have demonstrated exceptional performance in medical imaging by leveraging attention mechanisms for feature representation and interpretability. However, these models are prone to learning spurious correlations, leading to biases and limited generalization. While human-AI attention alignment can mitigate these issues, it often depends on costly manual supervision. In this work, we propose a Hybrid Explanation-Guided Learning (H-EGL) framework that combines self-supervised and human-guided constraints to enhance attention alignment and improve generalization. The self-supervised component of H-EGL leverages class-distinctive attention without relying on restrictive priors, promoting robustness and flexibility. We validate our approach on chest X-ray classification using the Vision Transformer (ViT), where H-EGL outperforms two state-of-the-art Explanation-Guided Learning (EGL) methods, demonstrating superior classification accuracy and generalization capability. Additionally, it produces attention maps that are better aligned with human expertise.

[86] Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

Xingang Guo,Utkarsh Tyagi,Advait Gosai,Paula Vergara,Ernesto Gabriel Hernández Montoya,Chen Bo Calvin Zhang,Bin Hu,Yunzhong He,Bing Liu,Rakshith Sharma Srinivasa

Main category: cs.CV

TL;DR: 论文介绍了IRIS基准,用于评估多模态大语言模型(MLLMs)在动态图像处理和推理任务中的能力,揭示了当前模型的局限性以及工具使用的差异性。

Details Motivation: 当前的多模态大语言模型在被动处理静态图像任务中表现较好,但在需要主动动态处理和工具整合的复杂任务中表现不佳,缺乏系统性评估。

Contribution: 论文的主要贡献是提出IRIS基准,涵盖1204个复杂视觉-文本任务,首次以“think with images”范式系统地评估MLLMs的图像感知、转换和推理能力。

Method: IRIS设计了一个交互式基准,包括单轮和多轮任务,覆盖五个领域,并提供详细评分标准,通过对现有模型的评估揭示其局限性。

Result: 实验显示,当前最强的GPT-5-think模型在任务中仅达到18.68%通过率,且不同模型在工具使用上表现出显著差异。

Insight: 研究表明,MLLMs在动态图像处理和工具整合能力上仍有很大提升空间,IRIS为未来研究提供了重要方向。

Abstract: Multimodal Large Language Models (MLLMs) are increasingly applied in real-world scenarios where user-provided images are often imperfect, requiring active image manipulations such as cropping, editing, or enhancement to uncover salient visual cues. Beyond static visual perception, MLLMs must also think with images: dynamically transforming visual content and integrating it with other tools to solve complex tasks. However, this shift from treating vision as passive context to a manipulable cognitive workspace remains underexplored. Most existing benchmarks still follow a think about images paradigm, where images are regarded as static inputs. To address this gap, we introduce IRIS, an Interactive Reasoning with Images and Systems that evaluates MLLMs’ ability to perceive, transform, and reason across complex visual-textual tasks under the think with images paradigm. IRIS comprises 1,204 challenging, open-ended vision tasks (603 single-turn, 601 multi-turn) spanning across five diverse domains, each paired with detailed rubrics to enable systematic evaluation. Our evaluation shows that current MLLMs struggle with tasks requiring effective integration of vision and general-purpose tools. Even the strongest model (GPT-5-think) reaches only 18.68% pass rate. We further observe divergent tool-use behaviors, with OpenAI models benefiting from diverse image manipulations while Gemini-2.5-pro shows no improvement. By introducing the first benchmark centered on think with images, IRIS offers critical insights for advancing visual intelligence in MLLMs.

[87] Personalized Federated Fine-Tuning of Vision Foundation Models for Healthcare

Adam Tupper,Christian Gagné

Main category: cs.CV

TL;DR: 该论文提出了一种新的个性化联邦学习方法,通过正交LoRA适配器分离通用和客户端特定知识,从而充分利用本地和其他客户端数据来微调基础模型,应用于医疗成像任务。

Details Motivation: 医疗领域的基础模型需要针对具体任务进行微调,但由于数据隐私问题,数据无法集中共享。联邦学习是一种解决方案,但现有方法未能充分分离通用和客户端特定知识。

Contribution: 提出了一个个性化联邦微调方法,通过正交LoRA适配器明确分离通用和客户端特定知识,使每个客户端能同时利用本地和其他数据。

Method: 使用正交LoRA适配器,在联邦学习框架下微调视觉基础模型,目标是通过正交约束解耦知识。

Result: 在真实世界的联邦医疗成像任务中,该方法表现出优于现有联邦微调方法的性能。

Insight: 通过正交约束解耦通用和客户端特定知识,可以有效提升联邦学习的性能,同时保护数据隐私。

Abstract: Foundation models open up new possibilities for the use of AI in healthcare. However, even when pre-trained on health data, they still need to be fine-tuned for specific downstream tasks. Furthermore, although foundation models reduce the amount of training data required to achieve good performance, obtaining sufficient data is still a challenge. This is due, in part, to restrictions on sharing and aggregating data from different sources to protect patients’ privacy. One possible solution to this is to fine-tune foundation models via federated learning across multiple participating clients (i.e., hospitals, clinics, etc.). In this work, we propose a new personalized federated fine-tuning method that learns orthogonal LoRA adapters to disentangle general and client-specific knowledge, enabling each client to fully exploit both their own data and the data of others. Our preliminary results on real-world federated medical imaging tasks demonstrate that our approach is competitive against current federated fine-tuning methods.

[88] FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution

Junhao Zhuang,Shi Guo,Xin Cai,Xiaohui Li,Yihao Liu,Chun Yuan,Tianfan Xue

Main category: cs.CV

TL;DR: FlashVSR是一款基于扩散模型的实时视频超分辨率(VSR)框架,通过多项创新技术实现了高效、可扩展和实时性能,解决了传统扩散模型在高分辨率视频中的计算延迟问题。

Details Motivation: 扩散模型在视频修复任务中表现出色,但在实际视频超分辨率(VSR)应用中面临高延迟、计算量大以及超高分辨率泛化能力差的问题。本文旨在解决这些挑战,使基于扩散模型的VSR更具实用性。

Contribution: 1. 提出了首个基于扩散模型的一步流式框架FlashVSR;2. 设计了三种互补的创新技术(三阶段蒸馏管道、局部约束稀疏注意力、微型条件解码器);3. 构建了大规模数据集VSR-120K。

Method: 1. 使用三阶段蒸馏管道实现流式超分辨率;2. 通过局部约束稀疏注意力减少冗余计算;3. 采用微型条件解码器加速重构。

Result: FlashVSR在单个A100 GPU上实现了约17 FPS的处理速度(768x1408视频),性能优于现有一步扩散VSR模型12倍,并支持超高分辨率扩展。

Insight: 通过技术创新(如稀疏注意力和小型解码器)和大规模数据集支持,扩散模型可以在视频超分辨率任务中实现高效和实时性能。

Abstract: Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train-test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based VSR.

[89] SPORTS: Simultaneous Panoptic Odometry, Rendering, Tracking and Segmentation for Urban Scenes Understanding

Zhiliu Yang,Jinyu Dai,Jianyuan Zhang,Zhu Yang

Main category: cs.CV

TL;DR: SPORTS 是一个统一的框架,通过整合视频全景分割(VPS)、视觉里程计(VO)和场景渲染(SR)任务,实现了城市场景的全景理解。该方法通过自适应注意力机制和多模态特征对齐,提高了动态物体跟踪和相机姿态估计的准确性,并通过神经场合成高保真视图。

Details Motivation: 现有场景理解方法在全景分割不足、动态物体干扰、传感器数据稀疏和视角限制等方面存在问题。SPORTS旨在通过统一的任务整合来解决这些问题。

Contribution: 1. 提出了SPORTS框架,首次将VPS、VO和SR紧密集成;2. 设计了自适应注意力机制和多模态特征对齐方法;3. 通过点云渲染技术生成了高质量的RGB和全景视图。

Method: 1. VPS中使用自适应注意力机制对齐姿态、深度和光流模态;2. VO结合全景分割与光流图优化动态物体的置信度;3. SR通过神经场将点云转化为高保真视图。

Result: 在三个公开数据集上的实验表明,SPORTS在里程计、跟踪、分割和新视图合成任务上超越了现有方法。

Insight: 任务间的紧密集成可以有效提升场景理解的综合性能,特别是在动态物体处理和视角生成方面。

Abstract: The scene perception, understanding, and simulation are fundamental techniques for embodied-AI agents, while existing solutions are still prone to segmentation deficiency, dynamic objects’ interference, sensor data sparsity, and view-limitation problems. This paper proposes a novel framework, named SPORTS, for holistic scene understanding via tightly integrating Video Panoptic Segmentation (VPS), Visual Odometry (VO), and Scene Rendering (SR) tasks into an iterative and unified perspective. Firstly, VPS designs an adaptive attention-based geometric fusion mechanism to align cross-frame features via enrolling the pose, depth, and optical flow modality, which automatically adjust feature maps for different decoding stages. And a post-matching strategy is integrated to improve identities tracking. In VO, panoptic segmentation results from VPS are combined with the optical flow map to improve the confidence estimation of dynamic objects, which enhances the accuracy of the camera pose estimation and completeness of the depth map generation via the learning-based paradigm. Furthermore, the point-based rendering of SR is beneficial from VO, transforming sparse point clouds into neural fields to synthesize high-fidelity RGB views and twin panoptic views. Extensive experiments on three public datasets demonstrate that our attention-based feature fusion outperforms most existing state-of-the-art methods on the odometry, tracking, segmentation, and novel view synthesis tasks.

[90] VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage

A. Alfarano,L. Venturoli,D. Negueruela del Castillo

Main category: cs.CV

TL;DR: 该论文提出了VQArt-Bench,一个新的视觉问答基准测试,专注于艺术和文化遗产领域,旨在评估模型对复杂语义的深度理解能力。

Details Motivation: 现有的VQA基准测试在评估复杂领域(如视觉艺术分析)的深度语义理解方面表现不足,导致模型倾向于利用统计捷径而非真正的视觉推理。

Contribution: 提出了VQArt-Bench,一个大规模的VQA基准测试,通过多智能体协作生成语义丰富且多样的问题,填补了现有基准的空白。

Method: 采用多智能体管道生成问题,确保问题的多样性和验证有效性,并围绕视觉理解维度构建基准。

Result: 评估了14个先进的MLLM,揭示了当前模型的显著局限性,包括在简单计数任务上的弱点以及开源与私有模型的性能差距。

Insight: VQArt-Bench为未来模型开发提供了更全面的评估工具,凸显了在复杂语义理解任务中仍需改进的方向。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface-level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statistical shortcuts rather than engage in visual reasoning. To address this gap, we introduce VQArt-Bench, a new, large-scale VQA benchmark for the cultural heritage domain. This benchmark is constructed using a novel multi-agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions. The resulting benchmark is structured along relevant visual understanding dimensions that probe a model’s ability to interpret symbolic meaning, narratives, and complex visual relationships. Our evaluation of 14 state-of-the-art MLLMs on this benchmark reveals significant limitations in current models, including a surprising weakness in simple counting tasks and a clear performance gap between proprietary and open-source models.

[91] E-MoFlow: Learning Egomotion and Optical Flow from Event Data via Implicit Regularization

Wenpu Li,Bangyan Liao,Yi Zhou,Qi Xu,Pian Wan,Peidong Liu

Main category: cs.CV

TL;DR: E-MoFlow提出了一种无监督框架,通过隐式时空和几何正则化联合优化光流和6自由度相机运动,避免了显式深度估计和局部最优问题,取得了优异性能。

Details Motivation: 光流和相机运动的独立估计在事件相机中缺乏鲁棒的数据关联性,导致问题不适定。现有方法要么引入显式正则化导致偏差和计算开销,要么依赖场景深度和相机运动的参数化容易陷入局部最优。

Contribution: 1. 通过隐式神经网络表示和连续样条建模,隐式嵌入时空一致性;2. 引入微分几何约束避免显式深度估计,保持几何一致性;3. 无监督框架统一光流和相机运动估计,性能优异。

Method: 1. 光流建模为隐式神经网络表示,相机运动建模为连续样条;2. 利用微分几何约束隐含结构-运动先验;3. 联合优化光流和相机运动。

Result: 在无监督方法中达到最优性能,甚至可与监督方法竞争,适用于一般6自由度运动场景。

Insight: 隐式正则化和几何约束能有效解决事件相机中光流和运动估计的不适定问题,同时避免了显式深度估计的复杂性和局部最优风险。

Abstract: The estimation of optical flow and 6-DoF ego-motion, two fundamental tasks in 3D vision, has typically been addressed independently. For neuromorphic vision (e.g., event cameras), however, the lack of robust data association makes solving the two problems separately an ill-posed challenge, especially in the absence of supervision via ground truth. Existing works mitigate this ill-posedness by either enforcing the smoothness of the flow field via an explicit variational regularizer or leveraging explicit structure-and-motion priors in the parametrization to improve event alignment. The former notably introduces bias in results and computational overhead, while the latter, which parametrizes the optical flow in terms of the scene depth and the camera motion, often converges to suboptimal local minima. To address these issues, we propose an unsupervised framework that jointly optimizes egomotion and optical flow via implicit spatial-temporal and geometric regularization. First, by modeling camera’s egomotion as a continuous spline and optical flow as an implicit neural representation, our method inherently embeds spatial-temporal coherence through inductive biases. Second, we incorporate structure-and-motion priors through differential geometric constraints, bypassing explicit depth estimation while maintaining rigorous geometric consistency. As a result, our framework (called E-MoFlow) unifies egomotion and optical flow estimation via implicit regularization under a fully unsupervised paradigm. Experiments demonstrate its versatility to general 6-DoF motion scenarios, achieving state-of-the-art performance among unsupervised methods and competitive even with supervised approaches.

[92] PET Head Motion Estimation Using Supervised Deep Learning with Attention

Zhuotong Cai,Tianyi Zeng,Jiazhen Zhang,Eléonore V. Lieffrig,Kathryn Fontaine,Chenyu You,Enette Mae Revilla,James S. Duncan,Jingmin Xin,Yihuan Lu,John A. Onofrey

Main category: cs.CV

TL;DR: 论文提出了一种基于深度学习和交叉注意力的PET头部运动估计方法(DL-HMC++),通过学习已有的动态PET数据和外部硬件运动跟踪的金标准数据,实现了高效的头部运动矫正,并在多种PET扫描仪和放射性示踪剂上验证了其有效性和泛化能力。

Details Motivation: PET成像中头部运动会导致图像伪影和示踪剂摄取定量不准确,影响了神经系统疾病的精确诊断。传统的硬件运动跟踪(HMT)方法在临床实践中应用受限,需一种数据驱动的解决方案。

Contribution: 提出了DL-HMC++方法,首次将深度学习与交叉注意力结合用于PET头部运动估计;在多种扫描仪和示踪剂上验证了方法的有效性,且性能优于现有数据驱动方法。

Method: DL-HMC++通过监督学习训练,利用动态PET数据和HMT金标准数据预测刚性头部运动;结合交叉注意力机制增强模型对关键区域的关注。

Result: DL-HMC++显著减少了运动伪影,图像质量接近HMT金标准;HRRT和mCT扫描仪上的平均差异比分别为1.2%和0.5%。

Insight: 数据驱动的PET头部运动校正有望取代HMT,降低临床使用门槛;交叉注意力机制在运动估计中发挥了重要作用。

Abstract: Head movement poses a significant challenge in brain positron emission tomography (PET) imaging, resulting in image artifacts and tracer uptake quantification inaccuracies. Effective head motion estimation and correction are crucial for precise quantitative image analysis and accurate diagnosis of neurological disorders. Hardware-based motion tracking (HMT) has limited applicability in real-world clinical practice. To overcome this limitation, we propose a deep-learning head motion correction approach with cross-attention (DL-HMC++) to predict rigid head motion from one-second 3D PET raw data. DL-HMC++ is trained in a supervised manner by leveraging existing dynamic PET scans with gold-standard motion measurements from external HMT. We evaluate DL-HMC++ on two PET scanners (HRRT and mCT) and four radiotracers (18F-FDG, 18F-FPEB, 11C-UCB-J, and 11C-LSN3172176) to demonstrate the effectiveness and generalization of the approach in large cohort PET studies. Quantitative and qualitative results demonstrate that DL-HMC++ consistently outperforms state-of-the-art data-driven motion estimation methods, producing motion-free images with clear delineation of brain structures and reduced motion artifacts that are indistinguishable from gold-standard HMT. Brain region of interest standard uptake value analysis exhibits average difference ratios between DL-HMC++ and gold-standard HMT to be 1.2 plus-minus 0.5% for HRRT and 0.5 plus-minus 0.2% for mCT. DL-HMC++ demonstrates the potential for data-driven PET head motion correction to remove the burden of HMT, making motion correction accessible to clinical populations beyond research settings. The code is available at https://github.com/maxxxxxxcai/DL-HMC-TMI.

[93] AnyUp: Universal Feature Upsampling

Thomas Wimmer,Prune Truong,Marie-Julie Rakotosaona,Michael Oechsle,Federico Tombari,Bernt Schiele,Jan Eric Lenssen

Main category: cs.CV

TL;DR: AnyUp提出了一种通用的特征上采样方法,适用于任何视觉特征和分辨率,无需针对特定编码器进行训练,解决了现有方法需要重新训练的局限性。

Details Motivation: 现有的学习型特征上采样方法(如DINO或CLIP)需要针对每种特征提取器重新训练,无法在推理时泛化到不同特征类型。AnyUp旨在解决这一问题,提供一种通用的上采样方案。

Contribution: 提出了一种推理时特征无关的上采样架构AnyUp,能够泛化到不同特征类型,提升上采样质量,并在效率和下游任务适用性上表现出色。

Method: AnyUp采用了一种特征无关的上采样架构,无需针对特定编码器进行训练,支持任意分辨率和特征类型的上采样。

Result: 实验表明,AnyUp在特征上采样任务中达到了新的最优性能,同时能泛化到不同特征类型并保持特征语义。

Insight: AnyUp的成功表明,通用上采样方法可以在不牺牲性能的情况下简化模型部署和扩展,适合于广泛的视觉任务。

Abstract: We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an inference-time feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.

[94] Efficient Perceptual Image Super Resolution: AIM 2025 Study and Benchmark

Bruno Longarela,Marcos V. Conde,Alvaro Garcia,Radu Timofte

Main category: cs.CV

TL;DR: 这篇论文提出了一个全面的高效感知超分辨率研究及其基准测试。尽管基于PSNR的高效超分辨率方法取得了显著进展,但关注感知质量的方法仍然相对低效。为了解决这一问题,作者的目标是在满足严格效率约束(最多500万参数和2000 GFLOPs)的同时,复制或超越Real-ESRGAN的感知效果。

Details Motivation: 当前高效超分辨率研究主要集中在PSNR指标上,而感知质量的提升方法效率较低。作者希望通过研究填补这一空白,推动高效感知超分辨率的发展。

Contribution: 论文的主要贡献包括:1) 提出了一个新的数据集和基准测试,用于评估高效感知超分辨率方法;2) 在严格效率约束下,实现了超越Real-ESRGAN的性能;3) 为高效感知超分辨率领域确立了现代基线。

Method: 作者采用了高效的网络设计和优化方法,在参数和计算量严格限制的条件下,提升了模型的感知质量。测试数据集由多类降质类型的4K图像组成,模拟真实部署场景。

Result: 在500张测试图像上,提出的最优方法在所有基准数据集上超越了Real-ESRGAN,验证了高效方法在感知超分辨率领域的潜力。

Insight: 研究表明,即使在严格的计算和参数限制下,仍可通过优化设计和训练策略显著提升感知超分辨率的效果,这为今后的研究和应用提供了重要参考。

Abstract: This paper presents a comprehensive study and benchmark on Efficient Perceptual Super-Resolution (EPSR). While significant progress has been made in efficient PSNR-oriented super resolution, approaches focusing on perceptual quality metrics remain relatively inefficient. Motivated by this gap, we aim to replicate or improve the perceptual results of Real-ESRGAN while meeting strict efficiency constraints: a maximum of 5M parameters and 2000 GFLOPs, calculated for an input size of 960x540 pixels. The proposed solutions were evaluated on a novel dataset consisting of 500 test images of 4K resolution, each degraded using multiple degradation types, without providing the original high-quality counterparts. This design aims to reflect realistic deployment conditions and serves as a diverse and challenging benchmark. The top-performing approach manages to outperform Real-ESRGAN across all benchmark datasets, demonstrating the potential of efficient methods in the perceptual domain. This paper establishes the modern baselines for efficient perceptual super resolution.

[95] What If : Understanding Motion Through Sparse Interactions

Stefan Andreas Baumann,Nick Stracke,Timy Phan,Björn Ommer

Main category: cs.CV

TL;DR: 论文提出了Flow Poke Transformer(FPT),一种通过稀疏交互(”poke”)直接预测局部运动分布的新框架,支持多模态运动表达和不确定性建模,并在多个下游任务中表现优异。

Details Motivation: 现有的方法通常只能密集采样单一场景动态实现,无法直接表达多模态运动及其不确定性。FPT旨在解决这一问题,通过稀疏交互理解场景动态。

Contribution: 1. 提出FPT框架,直接预测局部运动分布;2. 支持多模态运动表达和不确定性建模;3. 在密集面部运动生成、关节物体运动估计等任务中表现优于专用基线。

Method: FPT基于稀疏交互(”poke”)条件化,利用Transformer结构直接预测局部运动分布,提供可解释的多模态运动表示。

Result: FPT在多个任务中表现优异:1. 密集面部运动生成超越专业基线;2. 合成数据集的关节物体运动估计优于域内方法;3. 移动部件分割任务表现竞争力强。

Insight: 稀疏交互可以作为理解复杂场景动态的有效条件,而Transformer结合多模态输出能够显著提升运动建模的灵活性和泛化能力。

Abstract: Understanding the dynamics of a physical scene involves reasoning about the diverse ways it can potentially change, especially as a result of local interactions. We present the Flow Poke Transformer (FPT), a novel framework for directly predicting the distribution of local motion, conditioned on sparse interactions termed “pokes”. Unlike traditional methods that typically only enable dense sampling of a single realization of scene dynamics, FPT provides an interpretable directly accessible representation of multi-modal scene motion, its dependency on physical interactions and the inherent uncertainties of scene dynamics. We also evaluate our model on several downstream tasks to enable comparisons with prior methods and highlight the flexibility of our approach. On dense face motion generation, our generic pre-trained model surpasses specialized baselines. FPT can be fine-tuned in strongly out-of-distribution tasks such as synthetic datasets to enable significant improvements over in-domain methods in articulated object motion estimation. Additionally, predicting explicit motion distributions directly enables our method to achieve competitive performance on tasks like moving part segmentation from pokes which further demonstrates the versatility of our FPT. Code and models are publicly available at https://compvis.github.io/flow-poke-transformer.

[96] SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Weiyang Jin,Yuwei Niu,Jiaqi Liao,Chengqi Duan,Aoxue Li,Shenghua Gao,Xihui Liu

Main category: cs.CV

TL;DR: SRUM是一个自奖励的后训练框架,通过理解模块奖励生成模块,实现UMMs的自改进,无需额外人工标注数据。

Details Motivation: 现有的UMMs在视觉理解和生成之间存在性能差距,作者希望通过模型自身的理解模块提供反馈信号,提升生成模块的能力。

Contribution: 1. 提出了SRUM框架,利用模型的内部理解模块作为评估器;2. 设计了全局-局部双奖励系统;3. 在多个数据集上显著提升了性能。

Method: SRUM通过全局奖励(确保整体语义和布局正确)和局部奖励(细化对象级保真度)实现多层次指导。

Result: 在T2I-CompBench和T2I-ReasonBench上的表现分别从82.18提升到88.37和从43.82提升到46.75。

Insight: 通过自奖励机制,模型的生成能力可以自我提升,无需依赖外部数据或人工干预。

Abstract: Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a significant gap exists where a model’s strong visual understanding often fails to transfer to its visual generation. A model might correctly understand an image based on user instructions, yet be unable to generate a faithful image from text prompts. This phenomenon directly raises a compelling question: Can a model achieve self-improvement by using its understanding module to reward its generation module? To bridge this gap and achieve self-improvement, we introduce SRUM, a self-rewarding post-training framework that can be directly applied to existing UMMs of various designs. SRUM creates a feedback loop where the model’s own understanding module acts as an internal ``evaluator’’, providing corrective signals to improve its generation module, without requiring additional human-labeled data. To ensure this feedback is comprehensive, we designed a global-local dual reward system. To tackle the inherent structural complexity of images, this system offers multi-scale guidance: a \textbf{global reward} ensures the correctness of the overall visual semantics and layout, while a \textbf{local reward} refines fine-grained, object-level fidelity. SRUM leads to powerful capabilities and shows strong generalization, boosting performance on T2I-CompBench from 82.18 to \textbf{88.37} and on T2I-ReasonBench from 43.82 to \textbf{46.75}. Overall, our work establishes a powerful new paradigm for enabling a UMMs’ understanding module to guide and enhance its own generation via self-rewarding.

[97] MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars

Felix Taubner,Ruihang Zhang,Mathieu Tuli,Sherwin Bahmani,David B. Lindell

Main category: cs.CV

TL;DR: MVP4D提出了一种基于单张参考图像生成多视角可动画4D数字人化身的方法,利用预训练视频扩散模型生成360度视角一致的视频,显著提升了真实性和3D一致性。

Details Motivation: 传统数字人化身制作成本高且依赖多视角设备,现有单视角方法在偏离参考视角时质量和真实性下降。MVP4D旨在通过多视角视频生成技术提升一致性。

Contribution: 1. 基于单张参考图像生成多视角可动画视频的模型MVP4D;2. 通过视频扩散模型实现360度视角一致的输出;3. 将结果蒸馏为实时渲染的4D化身。

Method: MVP4D基于预训练视频扩散模型,输入单张参考图像和目标表情,生成多视角视频帧。通过模型蒸馏将视频输出转化为4D化身。

Result: 实验表明,MVP4D生成的结果在真实性、时间连续性和3D一致性上显著优于现有方法。

Insight: 结合视频扩散模型与多视角生成技术,为低成本、高质量的数字人化身制作提供了新思路。

Abstract: Digital human avatars aim to simulate the dynamic appearance of humans in virtual environments, enabling immersive experiences across gaming, film, virtual reality, and more. However, the conventional process for creating and animating photorealistic human avatars is expensive and time-consuming, requiring large camera capture rigs and significant manual effort from professional 3D artists. With the advent of capable image and video generation models, recent methods enable automatic rendering of realistic animated avatars from a single casually captured reference image of a target subject. While these techniques significantly lower barriers to avatar creation and offer compelling realism, they lack constraints provided by multi-view information or an explicit 3D representation. So, image quality and realism degrade when rendered from viewpoints that deviate strongly from the reference image. Here, we build a video model that generates animatable multi-view videos of digital humans based on a single reference image and target expressions. Our model, MVP4D, is based on a state-of-the-art pre-trained video diffusion model and generates hundreds of frames simultaneously from viewpoints varying by up to 360 degrees around a target subject. We show how to distill the outputs of this model into a 4D avatar that can be rendered in real-time. Our approach significantly improves the realism, temporal consistency, and 3D consistency of generated avatars compared to previous methods.

[98] Efficient Real-World Deblurring using Single Images: AIM 2025 Challenge Report

Daniel Feijoo,Paula Garrido-Mellado,Marcos V. Conde,Jaesung Rim,Alvaro Garcia,Sunghyun Cho,Radu Timofte

Main category: cs.CV

TL;DR: 本文回顾了AIM 2025高效单图像实时去模糊挑战赛,目标是推动高效真实模糊图像恢复技术的发展。挑战基于新的RSBlur数据集,参与者需在严格的计算限制下开发高效去模糊方法。

Details Motivation: 真实场景中的图像模糊恢复需要高效的算法,以满足实时应用的需求。现有的方法在计算效率和模型大小上往往不满足实际部署的要求。

Contribution: 挑战赛提出了一个基于RSBlur数据集的新测试集,并在严格的效率约束下(参数少于500万,计算量低于200 GMACs)评估了参与者的方法。

Method: 参与者提交的方案需要在计算效率和去模糊性能之间找到平衡。

Result: 最佳方案的PSNR达到31.1298 dB,展示了高效方法的潜力。

Insight: 严格的效率约束可以推动更轻量和实用的去模糊算法发展,为实际应用提供参考。

Abstract: This paper reviews the AIM 2025 Efficient Real-World Deblurring using Single Images Challenge, which aims to advance in efficient real-blur restoration. The challenge is based on a new test set based on the well known RSBlur dataset. Pairs of blur and degraded images in this dataset are captured using a double-camera system. Participant were tasked with developing solutions to effectively deblur these type of images while fulfilling strict efficiency constraints: fewer than 5 million model parameters and a computational budget under 200 GMACs. A total of 71 participants registered, with 4 teams finally submitting valid solutions. The top-performing approach achieved a PSNR of 31.1298 dB, showcasing the potential of efficient methods in this domain. This paper provides a comprehensive overview of the challenge, compares the proposed solutions, and serves as a valuable reference for researchers in efficient real-world image deblurring.

[99] UniFusion: Vision-Language Model as Unified Encoder in Image Generation

Kevin Li,Manuel Brack,Sudeep Katakol,Hareesh Ravi,Ajinkya Kale

Main category: cs.CV

TL;DR: UniFusion提出了一种基于扩散的生成模型,利用冻结的大型视觉语言模型(VLM)作为统一的多模态编码器,通过Layerwise Attention Pooling(LAP)机制提取高低层语义信息,并结合VERIFI方法提升推理灵活性。

Details Motivation: 现有生成模型通常对图像和文本使用独立的编码器,限制了跨模态推理和知识迁移能力。UniFusion旨在通过统一的VLM编码器解决这一问题。

Contribution: 1. 提出LAP机制,从冻结的VLM中提取多层级信息;
2. 引入VERIFI方法,提升推理灵活性和对齐能力;
3. 展示了单图像编辑任务下的强零样本泛化能力。

Method: 1. 使用LAP机制从VLM的文本和视觉标记中提取高低层语义信息;
2. 结合VERIFI方法,通过VLM生成的文本标记为扩散变换器(DiT)提供条件;
3. 在编辑任务上微调以提升跨模态对齐能力。

Result: 在文本-图像对齐和生成任务中,LAP优于其他浅层融合架构,编辑任务微调后展示了强泛化能力,能够零样本适应多图像参考任务。

Insight: 统一的VLM编码器设计有助于跨模态知识迁移,LAP和VERIFI的组合提升了生成模型的推理灵活性和对齐能力。

Abstract: Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models’ ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large unified models jointly for text and image generation, which demands substantial computational resources and large-scale data, limiting its accessibility.We present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder. At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high level semantics and low level details from text and visual tokens of a frozen VLM to condition a diffusion generative model. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing. We propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI), which conditions a diffusion transformer (DiT) only on the text tokens generated by the VLM during in-model prompt rewriting. VERIFI combines the alignment of the conditioning distribution with the VLM’s reasoning capabilities for increased capabilities and flexibility at inference. In addition, finetuning on editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities. Our model when trained on single image editing, zero-shot generalizes to multiple image references further motivating the unified encoder design of UniFusion.

[100] ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

Long Cui,Weiyun Wang,Jie Shao,Zichen Wen,Gen Luo,Linfeng Zhang,Yanting Zhang,Yu Qiao,Wenhai Wang

Main category: cs.CV

TL;DR: ViCO提出了一种基于语义复杂度的动态高分辨率视觉训练策略,通过不同压缩比的MLP连接器减少视觉令牌数量,同时保持模型性能。

Details Motivation: 现有MLLMs因图像输入引入的额外视觉令牌导致推理成本增加,ViCO旨在通过动态调整视觉令牌数量来解决这一问题。

Contribution: 提出了ViCO训练算法和ViR图像路由器,根据图像语义复杂度动态调整视觉令牌数量,显著降低推理成本。

Method: 使用多个不同压缩比的MLP连接器将视觉令牌下采样,并通过KL散度最小化训练一致性;推理时通过ViR动态选择压缩率。

Result: 实验表明ViCO可减少50%的视觉令牌,同时保持模型感知、推理和OCR能力。

Insight: 动态调整视觉令牌数量优于基于分辨率的调整,语义复杂度是关键因素。

Abstract: Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs. In this work, we propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying semantic complexities using different numbers of vision tokens. The key idea behind our method is to employ multiple MLP connectors, each with a different image compression ratio, to downsample the vision tokens based on the semantic complexity of the image. During training, we minimize the KL divergence between the responses conditioned on different MLP connectors. At inference time, we introduce an image router, termed Visual Resolution Router (ViR), that automatically selects the appropriate compression rate for each image patch. Compared with existing dynamic high-resolution strategies, which adjust the number of visual tokens based on image resolutions, our method dynamically adapts the number of visual tokens according to semantic complexity. Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model’s perception, reasoning, and OCR capabilities. We hope this work will contribute to the development of more efficient MLLMs. The code and models will be released to facilitate future research.

[101] CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations

Caner Korkmaz,Brighton Nuwagira,Barış Coşkunuzer,Tolga Birdal

Main category: cs.CV

TL;DR: CuMPerLay是一个新颖的可微分向量化层,将立方多参数持续性(CMP)融入深度学习流程中,克服了多过滤结构的复杂性,并通过可学习的方式提升图像分类与分割性能。

Details Motivation: CMP为图像分析提供了强大的拓扑工具,但由于多过滤结构的复杂性和向量化困难,其应用受到限制。CuMPerLay旨在解决这些问题,使CMP能够高效地整合到现代深度学习中。

Contribution: 1. 提出了一种新的算法,将CMP分解为可学习的单参数持续性;2. 设计了可微分的向量化层CuMPerLay,支持反向传播;3. 理论证明了向量化在广义Wasserstein度量下的稳定性。

Method: CuMPerLay将多参数持续性分解为联合学习的单参数持续性,并通过可微分的架构生成鲁棒的拓扑特征向量。这些特征可直接用于先进模型(如Swin Transformers)。

Result: 在医学影像和计算机视觉基准数据集上,CuMPerLay提升了分类和分割的性能,尤其在数据有限的场景中表现突出。

Insight: CuMPerLay展示了拓扑特征与深度学习结合的潜力,为结构化图像分析提供了一种全局结构信息的有效整合方式。

Abstract: We present CuMPerLay, a novel differentiable vectorization layer that enables the integration of Cubical Multiparameter Persistence (CMP) into deep learning pipelines. While CMP presents a natural and powerful way to topologically work with images, its use is hindered by the complexity of multifiltration structures as well as the vectorization of CMP. In face of these challenges, we introduce a new algorithm for vectorizing MP homologies of cubical complexes. Our CuMPerLay decomposes the CMP into a combination of individual, learnable single-parameter persistence, where the bifiltration functions are jointly learned. Thanks to the differentiability, its robust topological feature vectors can be seamlessly used within state-of-the-art architectures such as Swin Transformers. We establish theoretical guarantees for the stability of our vectorization under generalized Wasserstein metrics. Our experiments on benchmark medical imaging and computer vision datasets show the benefit CuMPerLay on classification and segmentation performance, particularly in limited-data scenarios. Overall, CuMPerLay offers a promising direction for integrating global structural information into deep networks for structured image analysis.

[102] DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Yingyan Li,Shuyao Shang,Weisong Liu,Bing Zhan,Haochen Wang,Yuqi Wang,Yuntao Chen,Xiaoman Wang,Yasong An,Chufeng Tang,Lu Hou,Lue Fan,Zhaoxiang Zhang

Main category: cs.CV

TL;DR: DriveVLA-W0通过世界建模预测未来图像,解决了VLA模型的监督不足问题,显著提升了自动驾驶的性能和数据扩展效果。

Details Motivation: VLA模型在自动驾驶中存在监督不足的问题,模型的潜力未被充分利用。作者希望通过世界建模生成密集的自监督信号,提升模型性能。

Contribution: 提出了DriveVLA-W0训练范式,引入世界建模预测未来图像,并设计了两种实现方式(自回归和扩散模型),同时引入轻量级动作专家以降低推理延迟。

Method: 采用世界建模生成自监督信号,针对离散和连续视觉特征的VLA模型分别设计自回归和扩散世界模型,并结合轻量级动作专家优化实时性能。

Result: 在NAVSIM v1/v2基准和更大的内部数据集上,DriveVLA-W0显著优于BEV和VLA基线,且数据扩展效果明显。

Insight: 世界建模可以显著提升VLA模型的性能,尤其是在大规模数据下,性能增益随着数据规模的增加而加速。

Abstract: Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit’’: the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose \textbf{DriveVLA-W0}, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm’s versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.

[103] Detect Anything via Next Point Prediction

Qing Jiang,Junan Huo,Xingyu Chen,Yuda Xiong,Zhaoyang Zeng,Yihao Chen,Tianhe Ren,Junzhi Yu,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出了Rex-Omni,一种基于多模态大型语言模型(MLLM)的物体检测方法,通过智能的任务设计、数据生成和训练流程,在零样本设置下达到或超越传统回归模型的性能,同时支持多种视觉任务。

Details Motivation: 传统坐标回归模型(如YOLO、DETR)在物体检测中占主导,但面临召回率低、重复预测等问题。MLLMs虽被尝试用于检测任务,但表现不佳。本文旨在弥合这一差距,提出更高效的MLLM方法。

Contribution: 1) 提出Rex-Omni,一种3B规模的MLLM,在零样本检测中表现优异;2) 设计了特殊令牌表示坐标,简化学习难度;3) 构建多引擎数据生成和两阶段训练流程,提升性能。

Method: 1) 任务设计:使用0-999的量化坐标令牌;2) 数据引擎:生成高质量的训练数据;3) 训练流程:两阶段训练(监督微调+GRPO强化学习),并结合几何感知奖励。

Result: 在COCO和LVIS等基准测试中,Rex-Omni性能超越传统回归模型(如DINO),并支持多种视觉任务(如OCR、关键点检测)。

Insight: MLLM通过合适的任务设计和训练流程,可以显著提升物体检测性能,同时保持语言理解能力,为多任务视觉系统开辟了新方向。

Abstract: Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: 1) Task Formulation: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model’s learning difficulty and improving token efficiency for coordinate prediction; 2) Data Engines: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; \3) Training Pipelines: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni’s inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.

Kartik Narayan,Yang Xu,Tian Cao,Kavya Nerella,Vishal M. Patel,Navid Shiee,Peter Grasch,Chao Jia,Yinfei Yang,Zhe Gan

Main category: cs.CV

TL;DR: DeepMMSearch-R1是一种多模态大语言模型(MLLM),能够动态生成查询进行多轮网络搜索,并通过两阶段训练(监督微调和强化学习)优化性能。其核心贡献是解决了现有方法的低效问题,并引入了新数据集DeepMMSearchVQA。

Details Motivation: 现有方法(如RAG)在多模态网络搜索中存在刚性流程、搜索调用过多和查询质量不佳等问题,导致效率低下和结果不理想。

Contribution: 1. 提出首个支持按需多轮网络搜索的多模态LLM;2. 引入动态查询生成和图像区域搜索功能;3. 提出了两阶段训练方法和DeepMMSearchVQA数据集。

Method: 1. 冷启动监督微调阶段;2. 在线强化学习优化;3. 训练数据集DeepMMSearchVQA结合自动化生成和网络搜索的真实信息。

Result: 在多个知识密集型基准测试中表现出优越性。

Insight: 动态查询生成和图像区域搜索是多模态网络搜索的关键改进方向;两阶段训练方法能有效平衡性能和效率。

Abstract: Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.

cs.CR [Back]

[105] Deep Research Brings Deeper Harm

Shuo Chen,Zonggen Li,Zhen Han,Bailan He,Tong Liu,Haokun Chen,Georg Groh,Philip Torr,Volker Tresp,Jindong Gu

Main category: cs.CR

TL;DR: Deep Research (DR) agents based on LLMs pose elevated risks due to their ability to synthesize dangerous forbidden knowledge, necessitating novel safety analyses beyond traditional LLM jailbreak methods.

Details Motivation: The misuse of DR agents in high-stakes domains (e.g., biosecurity) highlights gaps in existing safety measures, prompting the need for specialized jailbreak strategies and evaluations.

Contribution: Proposes Plan Injection and Intent Hijack jailbreak strategies for DR agents, uncovering systemic vulnerabilities in multi-step planning and execution.

Method: Introduces two novel jailbreak techniques targeting DR agents’ research capabilities, followed by experiments across LLMs and safety benchmarks to evaluate alignment failures.

Result: Reveals that DR agents bypass traditional safeguards, producing more coherent and dangerous content, exposing fundamental misalignment issues.

Insight: DR agents require alignment techniques tailored to their multi-step planning abilities, as traditional LLM safeguards are insufficient.

Abstract: Deep Research (DR) agents built on Large Language Models (LLMs) can perform complex, multi-step research by decomposing tasks, retrieving online information, and synthesizing detailed reports. However, the misuse of LLMs with such powerful capabilities can lead to even greater risks. This is especially concerning in high-stakes and knowledge-intensive domains such as biosecurity, where DR can generate a professional report containing detailed forbidden knowledge. Unfortunately, we have found such risks in practice: simply submitting a harmful query, which a standalone LLM directly rejects, can elicit a detailed and dangerous report from DR agents. This highlights the elevated risks and underscores the need for a deeper safety analysis. Yet, jailbreak methods designed for LLMs fall short in exposing such unique risks, as they do not target the research ability of DR agents. To address this gap, we propose two novel jailbreak strategies: Plan Injection, which injects malicious sub-goals into the agent’s plan; and Intent Hijack, which reframes harmful queries as academic research questions. We conducted extensive experiments across different LLMs and various safety benchmarks, including general and biosecurity forbidden prompts. These experiments reveal 3 key findings: (1) Alignment of the LLMs often fail in DR agents, where harmful prompts framed in academic terms can hijack agent intent; (2) Multi-step planning and execution weaken the alignment, revealing systemic vulnerabilities that prompt-level safeguards cannot address; (3) DR agents not only bypass refusals but also produce more coherent, professional, and dangerous content, compared with standalone LLMs. These results demonstrate a fundamental misalignment in DR agents and call for better alignment techniques tailored to DR agents. Code and datasets are available at https://chenxshuo.github.io/deeper-harm.

cs.LG [Back]

[106] Don’t Walk the Line: Boundary Guidance for Filtered Generation

Sarah Ball,Andreas Haupt

Main category: cs.LG

TL;DR: 本文提出了一种名为Boundary Guidance的强化学习微调方法,旨在避免生成模型输出靠近分类器决策边界的内容,从而提高安全性和实用性。

Details Motivation: 现有方法通过微调生成模型以减少被分类器过滤的概率,但这种方法会导致模型生成的内容靠近分类器边界,增加误判率。

Contribution: 提出了Boundary Guidance方法,通过显式引导生成远离分类器边界的内容,改善了生成的安全性和实用性。

Method: 采用了强化学习微调方法,结合分类器的边界信息,指导生成过程远离潜在的误判区域。

Result: 在越狱和模糊提示基准测试中,Boundary Guidance显著提升了输出的安全性和实用性。

Insight: 通过远离分类器边界生成内容,可以有效减少误判,同时保持生成质量的实用性。

Abstract: Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier’s decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier’s margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.

[107] Balancing Synthetic Data and Replay for Enhancing Task-Specific Capabilities

Urs Spiegelhalter,Jörg K. H. Franke,Frank Hutter

Main category: cs.LG

TL;DR: 论文研究了在适应新任务时如何平衡合成数据生成和重播(replay)比例,以优化任务表现和知识保留。通过bAbI推理任务的实验,发现了一种最佳配置,为实践提供了指导。

Details Motivation: 适应新任务的语言模型需要在学习新能力的同时避免灾难性遗忘(catastrophic forgetting)。然而,合成数据生成和重播比例的最优配置尚未明确,尤其是在计算资源受限的情况下。

Contribution: 论文的主要贡献是系统地研究了重播比例和计算预算对任务性能和知识保留的影响,并提供了基于计算预算的实践指南。

Method: 研究使用bAbI推理任务作为目标,采用合成数据生成技术,并系统评估不同总token预算和重播比例配置的效果,分析它们对任务掌握和知识保留的影响。

Result: 实验揭示了平衡任务特定性能和通用知识保留的最佳配置,并提供了减少训练成本的实用指南。

Insight: 研究中发现,合理的重播比例配置可以显著提高模型适应新任务的效率,同时避免知识遗忘。

Abstract: Adapting language models to new tasks through continued pretraining faces a fundamental trade-off: models must learn new capabilities while avoiding catastrophic forgetting of existing knowledge. While prior work has studied synthetic data generation techniques, the optimal replay ratios for balancing task performance and knowledge retention under computational constraints remain poorly understood. We present a comprehensive empirical study investigating the interplay between replay ratio configuration and computational budget when adapting language models to new tasks. Using the bAbI reasoning tasks as our target objective, we apply synthetic data generation and systematically evaluate different total token budgets and replay ratio configurations. We analyze their effects on both task mastery and general knowledge retention. Our experiments reveal an optimal configuration that balances task-specific performance with general knowledge retention. Based on our findings, we provide empirically-grounded guidelines for selecting replay ratios based on computational budget, enabling practitioners to achieve strong task adaptation with significantly reduced training costs.

[108] Demystifying Hybrid Thinking: Can LLMs Truly Switch Between Think and No-Think?

Shouren Wang,Wang Yang,Xianxuan Long,Qifan Wang,Vipin Chaudhary,Xiaotian Han

Main category: cs.LG

TL;DR: 该论文揭示了当前混合思维LLMs在切换推理和直接回答模式时的局限性,并提出了一种改进方法,显著减少了直接回答模式的输出长度和推理支持性令牌的出现频率。

Details Motivation: 研究混合思维LLMs在切换推理和直接回答模式时的行为泄漏问题,以提高其可控性和效率。

Contribution: 1. 识别了影响混合思维LLMs可控性的四个关键因素;2. 提出了一种两阶段的训练策略;3. 实验结果表明该方法能显著减少直接回答模式的输出长度和推理支持性令牌的使用。

Method: 1. 分析不同数据规模和来源对模式分离的影响;2. 设计两阶段训练策略(先训练推理能力,再进行混合思维训练);3. 实验验证提出的方法在MATH500数据集上的效果。

Result: 与标准训练相比,该方法在保持两种模式准确度的同时,显著减少了直接回答模式的输出长度(从1085降至585)和推理支持性令牌的出现频率(从5917降至522)。

Insight: 当前混合思维LLMs的模式分离不完全,但通过数据设计和训练策略的优化可以有效提升其可控性。

Abstract: Hybrid thinking enables LLMs to switch between reasoning and direct answering, offering a balance between efficiency and reasoning capability. Yet our experiments reveal that current hybrid thinking LLMs only achieve partial mode separation: reasoning behaviors often leak into the no-think mode. To understand and mitigate this, we analyze the factors influencing controllability and identify four that matter most: (1) larger data scale, (2) using think and no-think answers from different questions rather than the same question, (3) a moderate increase in no-think data number, and (4) a two-phase strategy that first trains reasoning ability and then applies hybrid think training. Building on these findings, we propose a practical recipe that, compared to standard training, can maintain accuracy in both modes while significantly reducing no-think output length (from $1085$ to $585$ on MATH500) and occurrences of reasoning-supportive tokens such as ``\texttt{wait}’’ (from $5917$ to $522$ on MATH500). Our findings highlight the limitations of current hybrid thinking and offer directions for strengthening its controllability.

[109] Your VAR Model is Secretly an Efficient and Explainable Generative Classifier

Yi-Chung Chen,David I. Inouye,Jing Gao

Main category: cs.LG

TL;DR: 这篇论文提出了一种基于视觉自回归(VAR)模型的新型生成分类器A-VARC+,它在分类准确性和推断速度上优于扩散模型,同时具备更高的可解释性和对灾难性遗忘的抵抗能力。

Details Motivation: 当前生成分类器主要依赖扩散模型,但其计算成本高且可扩展性受限。作者希望通过VAR模型的引入,为生成分类器研究提供新视角,并解决效率和理解性问题。

Contribution: 1. 基于VAR模型的新型生成分类器;2. 提出A-VARC+以实现准确性与速度的平衡;3. 揭示VAR方法在可解释性和抗灾难性遗忘上的独特性质。

Method: 利用VAR模型的条件生成能力构建分类器,并通过自适应策略优化性能(A-VARC+),结合token-wise互信息实现可视化可解释性。

Result: A-VARC+在分类任务中表现优异,比扩散模型更快且更高效,同时展示了更强的抗灾难性遗忘能力。

Insight: VAR模型因其可追踪的似然性,为生成分类器的可解释性和鲁棒性提供了新的研究方向。

Abstract: Generative classifiers, which leverage conditional generative models for classification, have recently demonstrated desirable properties such as robustness to distribution shifts. However, recent progress in this area has been largely driven by diffusion-based models, whose substantial computational cost severely limits scalability. This exclusive focus on diffusion-based methods has also constrained our understanding of generative classifiers. In this work, we propose a novel generative classifier built on recent advances in visual autoregressive (VAR) modeling, which offers a new perspective for studying generative classifiers. To further enhance its performance, we introduce the Adaptive VAR Classifier$^+$ (A-VARC$^+$), which achieves a superior trade-off between accuracy and inference speed, thereby significantly improving practical applicability. Moreover, we show that the VAR-based method exhibits fundamentally different properties from diffusion-based methods. In particular, due to its tractable likelihood, the VAR-based classifier enables visual explainability via token-wise mutual information and demonstrates inherent resistance to catastrophic forgetting in class-incremental learning tasks.

[110] A Function Centric Perspective On Flat and Sharp Minima

Israel Mason-Williams,Gabryel Mason-Williams,Helen Yannakoudakis

Main category: cs.LG

TL;DR: 这篇论文重新审视了平坦最小值和尖锐最小值在深度神经网络泛化中的作用,提出尖锐性应被视为依赖函数的性质,而非泛化能力的可靠指标。研究表明,正则化往往导致更尖锐的最小值,但反而提升了泛化性、校准性和鲁棒性。

Details Motivation: 尽管平坦最小值被广泛认为与更好的泛化性能相关,但近期研究显示这种联系并不简单。论文旨在探讨尖锐最小值的实际作用及其与泛化的复杂关系。

Contribution: 主要贡献是提出尖锐性是一种依赖函数的性质,并通过大量实验证明正则化导致的尖锐最小值可能带来更好的泛化性能和安全性。

Method: 通过单目标优化和现代图像分类任务进行实证研究,分析了不同正则化方法(如SAM、权重衰减、数据增强)对最小值尖锐性和模型性能的影响。

Result: 研究发现,无正则化的基线模型倾向于收敛到更平坦的最小值,但在泛化和安全性指标上表现更差;而正则化导致的尖锐最小值反而表现更好。

Insight: 解决方案的几何形状由函数复杂度而非平坦性主导,尖锐最小值可能反映了更适合的归纳偏置,尤其是在正则化条件下。

Abstract: Flat minima are widely believed to correlate with improved generalisation in deep neural networks. However, this connection has proven more nuanced in recent studies, with both theoretical counterexamples and empirical exceptions emerging in the literature. In this paper, we revisit the role of sharpness in model performance, proposing that sharpness is better understood as a function-dependent property rather than a reliable indicator of poor generalisation. We conduct extensive empirical studies, from single-objective optimisation to modern image classification tasks, showing that sharper minima often emerge when models are regularised (e.g., via SAM, weight decay, or data augmentation), and that these sharp minima can coincide with better generalisation, calibration, robustness, and functional consistency. Across a range of models and datasets, we find that baselines without regularisation tend to converge to flatter minima yet often perform worse across all safety metrics. Our findings demonstrate that function complexity, rather than flatness alone, governs the geometry of solutions, and that sharper minima can reflect more appropriate inductive biases (especially under regularisation), calling for a function-centric reappraisal of loss landscape geometry.

q-bio.NC [Back]

[111] MAPS: Masked Attribution-based Probing of Strategies- A computational framework to align human and model explanations

Sabine Muzellec,Yousif Kashef Alghetaa,Simon Kornblith,Kohitij Kar

Main category: q-bio.NC

TL;DR: MAPS是一个计算框架,用于评估人工神经网络(ANN)的解释是否与人类视觉策略一致,通过将属性图转换为解释掩码图像(EMIs)并比较人类在这些图像上的准确率,验证解释方法的有效性。

Details Motivation: 人类核心物体识别依赖于选择性使用视觉信息,但难以直接测量这些策略。MAPS旨在通过比较ANN的解释与人类行为,验证其一致性。

Contribution: 提出MAPS框架,通过EMIs和有限像素预算的图像,验证ANN解释方法的行为有效性,并提供一个可扩展的工具连接人类行为、神经活动和模型决策。

Method: 将ANN的属性图转换为EMIs,通过人类在EMIs和原图上的表现差异评估解释方法的有效性。

Result: MAPS在仿真实验中恢复了模型属性图的真实相似性,并在人类和猕猴实验中找到了与生物视觉最一致的ANN解释组合。

Insight: MAPS避免了繁琐的心理物理学实验,通过少量行为数据和模型属性即可评估解释方法的有效性,为解释方法的比较提供了可扩展的标准。

Abstract: Human core object recognition depends on the selective use of visual information, but the strategies guiding these choices are difficult to measure directly. We present MAPS (Masked Attribution-based Probing of Strategies), a behaviorally validated computational tool that tests whether explanations derived from artificial neural networks (ANNs) can also explain human vision. MAPS converts attribution maps into explanation-masked images (EMIs) and compares image-by-image human accuracies on these minimal images with limited pixel budgets with accuracies on the full stimuli. MAPS provides a principled way to evaluate and choose among competing ANN interpretability methods. In silico, EMI-based behavioral similarity between models reliably recovers the ground-truth similarity computed from their attribution maps, establishing which explanation methods best capture the model’s strategy. When applied to humans and macaques, MAPS identifies ANN-explanation combinations whose explanations align most closely with biological vision, achieving the behavioral validity of Bubble masks while requiring far fewer behavioral trials. Because it needs only access to model attributions and a modest set of behavioral data on the original images, MAPS avoids exhaustive psychophysics while offering a scalable tool for adjudicating explanations and linking human behavior, neural activity, and model decisions under a common standard.

cs.RO [Back]

[112] Fast Visuomotor Policy for Robotic Manipulation

Jingkai Jia,Tong Yang,Xueyao Chen,Chenhuan Liu,Wenqiang Zhang

Main category: cs.RO

TL;DR: 本文提出了一种名为Energy Policy的快速视觉运动策略框架,专为高频机器人任务和资源受限系统设计,能够在单次前向传播中预测多模态动作,实现高精度快速操作。

Details Motivation: 现有的机器人策略在高频任务和资源受限环境中表现不足,需要一种既能快速推理又能实现多模态动作预测的轻量级解决方案。

Contribution: 1. 提出了Energy Policy框架,利用能量分数作为学习目标,支持多模态动作建模;2. 设计了简单高效的Energy MLP实现能量目标。

Method: 1. 采用能量分数作为学习目标;2. 使用Energy MLP实现多模态动作预测,保持架构轻量高效。

Result: 在仿真和真实机器人任务中,Energy Policy性能优于或匹配现有方法,同时显著降低了计算开销;在MimicGen基准上,推理速度更快且性能更优。

Insight: 能量分数的引入简化了多模态动作建模,Energy MLP的高效设计使其适合资源受限系统。

Abstract: We present a fast and effective policy framework for robotic manipulation, named Energy Policy, designed for high-frequency robotic tasks and resource-constrained systems. Unlike existing robotic policies, Energy Policy natively predicts multimodal actions in a single forward pass, enabling high-precision manipulation at high speed. The framework is built upon two core components. First, we adopt the energy score as the learning objective to facilitate multimodal action modeling. Second, we introduce an energy MLP to implement the proposed objective while keeping the architecture simple and efficient. We conduct comprehensive experiments in both simulated environments and real-world robotic tasks to evaluate the effectiveness of Energy Policy. The results show that Energy Policy matches or surpasses the performance of state-of-the-art manipulation methods while significantly reducing computational overhead. Notably, on the MimicGen benchmark, Energy Policy achieves superior performance with at a faster inference compared to existing approaches.

cs.GR [Back]

[113] GS-Verse: Mesh-based Gaussian Splatting for Physics-aware Interaction in Virtual Reality

Anastasiya Pechko,Piotr Borycki,Joanna Waczyńska,Daniel Barczyk,Agata Szymańska,Sławomir Tadeja,Przemysław Spurek

Main category: cs.GR

TL;DR: GS-Verse提出了一种基于网格的高斯泼溅方法,直接在虚拟现实(VR)中实现物理感知交互,解决了现有方法在视觉保真度和物理精度上的不足,并简化了开发流程。

Details Motivation: 随着沉浸式3D内容需求的增长,现有VR物理交互方法依赖工程密集型流程和简化几何表示,导致视觉和物理效果不佳。

Contribution: 1. 提出GS-Verse,直接将物体网格与高斯泼溅表示结合,实现更精确的表面近似和逼真交互。2. 支持现有3D网格资产的复用,简化开发流程。3. 设计物理引擎无关的系统,提供灵活部署。

Method: 使用高斯泼溅技术结合物体网格,优化表面近似效果,支持物理引擎无关的交互设计。

Result: 在18名参与者的对比实验中,GS-Verse在物理感知拉伸操作上显著优于现有方法,并在扭曲和摇晃等其他物理操作中表现更一致。

Insight: 高斯泼溅与网格的直接结合能有效提升VR交互的真实性和灵活性,为3D内容创作提供新工具。

Abstract: As the demand for immersive 3D content grows, the need for intuitive and efficient interaction methods becomes paramount. Current techniques for physically manipulating 3D content within Virtual Reality (VR) often face significant limitations, including reliance on engineering-intensive processes and simplified geometric representations, such as tetrahedral cages, which can compromise visual fidelity and physical accuracy. In this paper, we introduce \our{} (\textbf{G}aussian \textbf{S}platting for \textbf{V}irtual \textbf{E}nvironment \textbf{R}endering and \textbf{S}cene \textbf{E}diting), a novel method designed to overcome these challenges by directly integrating an object’s mesh with a Gaussian Splatting (GS) representation. Our approach enables more precise surface approximation, leading to highly realistic deformations and interactions. By leveraging existing 3D mesh assets, \our{} facilitates seamless content reuse and simplifies the development workflow. Moreover, our system is designed to be physics-engine-agnostic, granting developers robust deployment flexibility. This versatile architecture delivers a highly realistic, adaptable, and intuitive approach to interactive 3D manipulation. We rigorously validate our method against the current state-of-the-art technique that couples VR with GS in a comparative user study involving 18 participants. Specifically, we demonstrate that our approach is statistically significantly better for physics-aware stretching manipulation and is also more consistent in other physics-based manipulations like twisting and shaking. Further evaluation across various interactions and scenes confirms that our method consistently delivers high and reliable performance, showing its potential as a plausible alternative to existing methods.

cs.SE [Back]

[114] Task-Aware Reduction for Scalable LLM-Database Systems

Marcus Emmanuel Barnes,Taher A. Ghaleb,Safwat Hassan

Main category: cs.SE

TL;DR: 论文提出将LLM的token预算视为注意力预算,并通过任务感知的文本精简优化LLM与数据库系统的整合,以解决实际数据中的冗余和噪声问题。

Details Motivation: LLM在数据密集型任务(如数据库查询)中的有效性受到数据量大、冗余和噪声的限制,直接输入这类数据不仅成本高且与任务目标不匹配。现有优化主要集中在模型或架构层面,而上游输入的精简问题未被充分探索。

Contribution: 论文的主要贡献是将任务感知的文本精简提升为LLM-数据系统的核心设计原则,将输入精简视为注意力分配而非压缩,并提出相关研究方向(如基准构建和自适应精简管道)。

Method: 提出了将LLM的token预算视为注意力预算的理念,设计任务感知的文本精简策略,优先保留与下游任务最相关的信息。

Result: 论文未提供具体实验结果,而是提出了一个研究方向和技术框架,旨在实现高效、准确且可持续的LLM-数据整合。

Insight: 输入精简是LLM高效处理大规模数据的关键,任务感知的精简策略可以显著提升LLM在数据密集型任务中的表现和可持续性。

Abstract: Large Language Models (LLMs) are increasingly applied to data-intensive workflows, from database querying to developer observability. Yet the effectiveness of these systems is constrained by the volume, verbosity, and noise of real-world text-rich data such as logs, telemetry, and monitoring streams. Feeding such data directly into LLMs is costly, environmentally unsustainable, and often misaligned with task objectives. Parallel efforts in LLM efficiency have focused on model- or architecture-level optimizations, but the challenge of reducing upstream input verbosity remains underexplored. In this paper, we argue for treating the token budget of an LLM as an attention budget and elevating task-aware text reduction as a first-class design principle for language – data systems. We position input-side reduction not as compression, but as attention allocation: prioritizing information most relevant to downstream tasks. We outline open research challenges for building benchmarks, designing adaptive reduction pipelines, and integrating token-budget–aware preprocessing into database and retrieval systems. Our vision is to channel scarce attention resources toward meaningful signals in noisy, data-intensive workflows, enabling scalable, accurate, and sustainable LLM–data integration.

cs.IR [Back]

[115] The Role of Parametric Injection-A Systematic Study of Parametric Retrieval-Augmented Generation

Minghao Tang,Shiyu Ni,Jingtong Wu,Zengxin Han,Keping Bi

Main category: cs.IR

TL;DR: 论文系统研究了参数化检索增强生成(PRAG)机制,发现参数化文档仅捕捉部分语义信息,单独使用时性能不如文本级交互。但参数化表示编码了高层文档信息,结合文本使用时能增强模型理解并提升性能。

Details Motivation: 检索增强生成(RAG)通过外部文档增强大语言模型(LLMs),而参数化RAG(PRAG)将文档编码为模型参数(如LoRA模块)以注入模型。但其机制尚不清晰,需系统性研究以明确参数化注入的作用。

Contribution: 1. 揭示了参数化文档仅捕捉部分语义信息;2. 证明参数化编码的高层文档信息可增强模型理解;3. 提出结合参数化和文本文档能提升性能。

Method: 系统研究PRAG机制,分析参数化表示的特性,对比其在单独使用或与文本结合时的性能差异。

Result: 单独使用参数化文档性能较差,但与文本结合时能更有效利用相关信息,提升模型鲁棒性和性能。

Insight: 参数化表示的高层信息是对文本的有益补充,未来应提升其信息含量以优化PRAG。

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by retrieving external documents. As an emerging form of RAG, parametric retrieval-augmented generation (PRAG) encodes documents as model parameters (i.e., LoRA modules) and injects these representations into the model during inference, enabling interaction between the LLM and documents at parametric level. Compared with directly placing documents in the input context, PRAG is more efficient and has the potential to offer deeper model-document interaction. Despite its growing attention, the mechanism underlying parametric injection remains poorly understood. In this work, we present a systematic study of PRAG to clarify the role of parametric injection, showing that parameterized documents capture only partial semantic information of documents, and relying on them alone yields inferior performance compared to interaction at text level. However, these parametric representations encode high-level document information that can enhance the model’s understanding of documents within the input context. When combined parameterized documents with textual documents, the model can leverage relevant information more effectively and become more robust to noisy inputs, achieving better performance than either source alone. We recommend jointly using parameterized and textual documents and advocate for increasing the information content of parametric representations to advance PRAG.

[116] SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model

Lin Lin,Jiefeng Long,Zhihe Wan,Yuchi Wang,Dingkang Yang,Shuang Yang,Yueyang Yao,Xu Chen,Zirui Guo,Shengqiang Li,Weiran Li,Hanyu Li,Yaling Mou,Yan Qiu,Haiyang Yu,Xiao Liang,Hongsheng Li,Chao Feng

Main category: cs.IR

TL;DR: SAIL-Embedding是一种全模态嵌入基础模型,通过多阶段训练策略和架构设计解决了现有方法在多模态支持、训练机制稳定性和工业领域适应性等方面的挑战。

Details Motivation: 尽管CLIP双塔架构和大模型在多模态嵌入任务中表现优异,但在实际场景中仍面临多模态支持不足、训练不稳定和领域适应性差等问题。

Contribution: 提出了一种全模态嵌入基础模型SAIL-Embedding,采用内容感知的渐进式训练和推荐增强训练策略,通过知识蒸馏和历史兴趣挖掘提升模型性能。

Method: 设计了多阶段训练方案,包括内容感知渐进训练、协作感知推荐增强训练,以及随机专业化和数据集驱动的模式匹配,增强模型的灵活性和泛化性。

Result: 实验证明SAIL-Embedding在检索任务中表现优异,在线实验中显著提升了关键指标如LT和AUC。

Insight: 多阶段训练和针对特定任务的优化策略是提升多模态嵌入模型性能的关键,尤其在推荐系统中的实际应用表现突出。

Abstract: Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model’s adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.158% and the 14-day LT gain of +0.144% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.08% AUC gain.

cs.SD [Back]

[117] UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

Jinchuan Tian,Sang-gil Lee,Zhifeng Kong,Sreyan Ghosh,Arushi Goel,Chao-Han Huck Yang,Wenliang Dai,Zihan Liu,Hanrong Ye,Shinji Watanabe,Mohammad Shoeybi,Bryan Catanzaro,Rafael Valle,Wei Ping

Main category: cs.SD

TL;DR: UALM是一个统一的音频语言模型,旨在整合音频理解、文本到音频生成和多模态推理任务,首次展示了跨模态生成式推理的有效性。

Details Motivation: 目前在ALM领域,音频理解和文本到音频生成被视为独立任务,缺乏统一模型以实现高级多模态推理。

Contribution: 1. 提出UALM-Gen,一种直接预测音频标记的文本到音频语言模型;2. 展示单一UALM模型在多项任务中媲美专业模型;3. 引入UALM-Reason,支持多模态生成式推理。

Method: 1. 结合数据混合和训练技巧;2. 设计跨模态推理模型,利用音频和文本进行中间推理。

Result: UALM在音频理解、生成和推理任务中达到SOTA水平,并通过主观评估验证了其有效性。

Insight: 统一模型可能在多模态任务中更具优势,跨模态推理为复杂生成任务提供了新思路。

Abstract: Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks – an essential step toward advanced multimodal reasoning. This paper introduces U}nified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.

[118] SeeingSounds: Learning Audio-to-Visual Alignment via Text

Simone Carnemolla,Matteo Pennisi,Chiara Russo,Simone Palazzo,Daniela Giordano,Concetto Spampinato

Main category: cs.SD

TL;DR: SeeingSounds是一个轻量级模块化框架,通过语言作为桥梁实现了音频到图像的生成,无需配对音频-视觉数据或视觉生成模型的训练。它通过双重对齐(音频到语言语义空间和语言到视觉域)实现了可控且可解释的音频到视觉生成。

Details Motivation: 研究灵感来源于人类感知中的跨模态关联,尤其是音频、语言和视觉之间的自然交互。目标是解决传统音频到视觉生成方法对配对数据或视觉生成模型的依赖问题。

Contribution: 提出了SeeingSounds框架,首次通过语言作为中介实现音频到视觉的生成;支持通过文本提示进行细粒度和可解释的控制;在零样本和监督设置下均优于现有方法。

Method: 利用冻结的语言编码器将音频投影到语言语义空间,再通过视觉-语言模型将语言上下文接地到视觉域。仅训练轻量级适配器,利用冻结的扩散模型骨架。

Result: 在标准基准测试中,SeeingSounds在可控音频到视觉生成任务中实现了新的最佳性能。

Insight: 语言可以作为音频和视觉之间的有效桥梁;轻量级适配器训练能够在保持模型效率的同时实现高性能;人类感知的跨模态关联启发了一种新的生成方法。

Abstract: We introduce SeeingSounds, a lightweight and modular framework for audio-to-image generation that leverages the interplay between audio, language, and vision-without requiring any paired audio-visual data or training on visual generative models. Rather than treating audio as a substitute for text or relying solely on audio-to-text mappings, our method performs dual alignment: audio is projected into a semantic language space via a frozen language encoder, and, contextually grounded into the visual domain using a vision-language model. This approach, inspired by cognitive neuroscience, reflects the natural cross-modal associations observed in human perception. The model operates on frozen diffusion backbones and trains only lightweight adapters, enabling efficient and scalable learning. Moreover, it supports fine-grained and interpretable control through procedural text prompt generation, where audio transformations (e.g., volume or pitch shifts) translate into descriptive prompts (e.g., “a distant thunder”) that guide visual outputs. Extensive experiments across standard benchmarks confirm that SeeingSounds outperforms existing methods in both zero-shot and supervised settings, establishing a new state of the art in controllable audio-to-visual generation.

cs.AI [Back]

[119] Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

Sayash Kapoor,Benedikt Stroebl,Peter Kirgis,Nitya Nadgir,Zachary S Siegel,Boyi Wei,Tianci Xue,Ziru Chen,Felix Chen,Saiteja Utpala,Franck Ndzomga,Dheeraj Oruganty,Sophie Luskin,Kangheng Liu,Botao Yu,Amit Arora,Dongyoon Hahm,Harsh Trivedi,Huan Sun,Juyong Lee,Tengjun Jin,Yifan Mai,Yifei Zhou,Yuxuan Zhu,Rishi Bommasani,Daniel Kang,Dawn Song,Peter Henderson,Yu Su,Percy Liang,Arvind Narayanan

Main category: cs.AI

TL;DR: 该论文提出了Holistic Agent Leaderboard (HAL),旨在解决AI智能体评估中的挑战,通过标准化评估框架、多维分析和LLM辅助日志检查,揭示了智能体行为中的意外发现。

Details Motivation: AI智能体在复杂任务中的应用日益广泛,但现有的评估方法存在诸多问题,如耗时、实现错误等,阻碍了对智能体真实性能的理解。

Contribution: 1. 提供标准化评估框架,大幅缩短评估时间;2. 进行多维模型、支架和基准分析;3. 使用LLM辅助日志检查,揭示未报告行为。

Method: 通过并行评估框架(HAL)在多台虚拟机上进行大规模评估,结合多维分析和LLM辅助日志检查,揭示了智能体行为的细微问题。

Result: 在21,730次智能体实验中,发现了一些意外结果(如更高推理努力反而降低准确性),并公开了2.5B token的日志数据。

Insight: 标准化评估和深入日志分析有助于发现智能体的真实性能问题,推动从‘基准测试高分’到‘实际任务可靠性’的转变。

Abstract: AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.

[120] ThinkPilot: Steering Reasoning Models via Automated Think-prefixes Optimization

Sunzhu Li,Zhiyu Lin,Shuling Yang,Jiale Zhao,Wei Chen

Main category: cs.AI

TL;DR: ThinkPilot 是一种无需训练的方法,通过进化生成的 think-prefixes 来优化大型推理模型的表现,显著提高了推理效率、安全性和指令跟随能力。

Details Motivation: 当前的大型推理模型(LRMs)在推理过程中存在效率低下和目标偏离的问题,现有的免训练方法要么过于刻板,要么缺乏可操作性。

Contribution: 提出了 ThinkPilot,一种通过进化生成的 think-prefixes 来优化 LRMs 推理的免训练框架,显著提升了模型的准确性、安全性及指令跟随能力。

Method: 利用进化过程生成 think-prefixes,这些前缀是基于推理行为分类驱动的,可以引导模型实现更好的表现,并与现有训练方法协同工作。

Result: ThinkPilot 显著改善了推理的效率与准确性,大幅提升了安全性(如将 StrongREJECT 分数从 27.0% 降至 0.7),并增强了指令跟随能力。

Insight: 研究发现,think-prefixes 能可靠控制 LRMs 的推理行为,不同任务对特定行为分布有强烈偏好。通过自动识别这些行为,ThinkPilot 提供了一种通用的对齐框架。

Abstract: Large Reasoning Models (LRMs) are powerful, but they still suffer from inefficient and off-target reasoning. Currently, training-free methods are limited to either rigid heuristics or descriptive, non-actionable analyses. In this paper, we introduce ThinkPilot, a training-free framework that automatically optimizes LRMs reasoning. It uses an evolutionary process to generate think-prefixes, which are instructions that evolve driven by a taxonomy of reasoning behaviors to guide models toward superior performance. Extensive experiments demonstrate ThinkPilot’s broad effectiveness: it significantly improves the accuracy-length trade-off for efficient reasoning, drastically improves safety (for example, cutting the StrongREJECT score of DeepSeek-R1-Distill-Qwen-32B from 27.0% to 0.7), and enhances instruction following. It also synergizes with existing training-based methods. Our analysis reveals that think-prefixes can reliably control LRMs’ reasoning behaviors, and that different tasks have strong preferences for specific behavioral distributions. By automatically identifying and eliciting these behaviors, ThinkPilot provides a generalizable framework for aligning LRMs reasoning with task demands. Data and code are available at https://github.com/teqkilla/ThinkPilot

[121] Evolution of meta’s llama models and parameter-efficient fine-tuning of large language models: a survey

Abdulhady Abas Abdullah,Arkaitz Zubiaga,Seyedali Mirjalili,Amir H. Gandomi,Fatemeh Daneshfar,Mohammadsadra Amini,Alan Salam Mohammed,Hadi Veisi

Main category: cs.AI

TL;DR: 本文综述了Meta AI的LLaMA系列模型(从LLaMA 1到LLaMA 4)的快速演进,以及为其开发的参数高效微调(PEFT)方法,总结了五种PEFT技术及其应用场景、性能表现和实际案例。

Details Motivation: 随着LLaMA系列模型规模的扩大和多样化,如何在资源有限的情况下高效微调这些大模型成为研究热点。本文旨在为研究者和实践者提供一个关于LLaMA模型及其PEFT方法的全面指南。

Contribution: 1) 系统梳理LLaMA系列模型的架构及性能;2) 综述五种PEFT方法的机制和应用;3) 分析模型和适配器的性能表现;4) 探讨实际应用案例及未来研究方向。

Method: 1) 描述LLaMA模型系列的发展与特点;2) 介绍PEFT概念及五种具体方法(如LoRA、QLoRA等);3) 分析模型和适配器的架构、参数量和基准测试结果。

Result: 研究表明,PEFT方法能显著减少微调参数量,同时保持性能高效,部分场景下微调后的LLaMA模型甚至超越更大规模的基线模型。

Insight: 1) PEFT是高效利用大模型的关键技术;2) LoRA及其衍生方法(如QLoRA)具有广泛应用潜力;3) LLM在专业领域(如法律、医疗)的应用前景广阔。

Abstract: This review surveys the rapid evolution of Meta AI’s LLaMA (Large Language Model Meta AI) series - from LLaMA 1 through LLaMA 4 and the specialized parameter-efficient fine-tuning (PEFT) methods developed for these models. We first describe the LLaMA family of foundation models (7B-65B to 288B parameters), their architectures (including native multimodal and Mixtureof-Experts variants), and key performance characteristics. We then describe and discuss the concept of PEFT, which adapts large pre-trained models by updating only a small subset of parameters, and review five PEFT methods that have been applied to LLaMA: LoRA (Low-Rank Adaptation), LLaMA-Adapter V1 and V2, LLaMA-Excitor, and QLoRA (Quantized LoRA). We discuss each method’s mechanism, parameter savings, and example application to LLaMA (e.g., instruction tuning, multimodal tasks). We provide structured discussion and analysis of model and adapter architectures, parameter counts, and benchmark results (including examples where fine-tuned LLaMA models outperform larger baselines). Finally, we examine real-world use cases where LLaMA-based models and PEFT have been successfully applied (e.g., legal and medical domains), and we discuss ongoing challenges and future research directions (such as scaling to even larger contexts and improving robustness). This survey paper provides a one-stop resource for ML researchers and practitioners interested in LLaMA models and efficient fine-tuning strategies.