Table of Contents
- cs.CL [Total: 21]
- cs.CV [Total: 78]
- q-bio.GN [Total: 1]
- cs.IR [Total: 3]
- cs.MM [Total: 1]
- cs.AI [Total: 7]
- cs.DB [Total: 1]
- cs.LG [Total: 10]
- cs.GR [Total: 1]
- cs.HC [Total: 1]
cs.CL [Back]
[1] Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework cs.CLPDF
Rakib Ullah, Mominul islam, Md Sanjid Hossain, Md Ismail Hossain
TL;DR: 本文针对孟加拉语网络迷因中的仇恨和煽动性内容检测问题,提出了首个区分仇恨与煽动性内容的孟加拉语多模态数据集Bn-HIB,并设计了一种基于协同注意力机制的多模态融合模型MCFM,该模型通过联合分析图像和文本特征显著提升了分类性能。
Details
Motivation: 网络迷因在社交媒体中广泛传播,可能包含针对个人或群体的攻击性、有害和煽动性内容,而现有研究主要集中于高资源语言,对孟加拉语等低资源语言缺乏关注,且迷因的讽刺性、微妙性和文化特定性使检测极具挑战性。
Result: 在提出的Bn-HIB数据集上,MCFM模型显著优于多个先进模型,实现了最先进的性能。
Insight: 创新点包括构建首个区分仇恨与煽动性内容的孟加拉语迷因数据集,以及提出一种简单的协同注意力融合架构,通过跨模态特征交互增强模型对微妙内容的识别能力,为低资源多模态内容分析提供了新思路。
Abstract: Internet memes have become a dominant form of expression on social media, including within the Bengali-speaking community. While often humorous, memes can also be exploited to spread offensive, harmful, and inflammatory content targeting individuals and groups. Detecting this type of content is excep- tionally challenging due to its satirical, subtle, and culturally specific nature. This problem is magnified for low-resource lan- guages like Bengali, as existing research predominantly focuses on high-resource languages. To address this critical research gap, we introduce Bn-HIB (Bangla Hate Inflammatory Benign), a novel dataset containing 3,247 manually annotated Bengali memes categorized as Benign, Hate, or Inflammatory. Significantly, Bn- HIB is the first dataset to distinguish inflammatory content from direct hate speech in Bengali memes. Furthermore, we propose the MCFM (Multi-Modal Co-Attention Fusion Model), a simple yet effective architecture that mutually analyzes both the visual and textual elements of a meme. MCFM employs a co-attention mechanism to identify and fuse the most critical features from each modality, leading to a more accurate classification. Our experiments show that MCFM significantly outperforms several state-of-the-art models on the Bn-HIB dataset, demonstrating its effectiveness in this nuanced task.Warning: This work contains material that may be disturbing to some audience members. Viewer discretion is advised.
[2] Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads cs.CLPDF
Shaswat Patel, Vishvesh Trivedi, Yue Han, Yihuai Hong, Eunsol Choi
TL;DR: 本文研究了多语言Transformer模型中的注意力头,识别出一种称为检索-过渡头(RTH)的特殊注意力头,它们负责从源语言上下文过渡到目标语言输出。研究发现,RTH与传统的检索头(RH)不同,对多语言大语言模型(LLM)中的思维链推理更为关键。通过在Qwen-2.5和Llama-3.1模型家族上的四个多语言基准测试(MMLU-ProX、MGSM、MLQA、XQuaD)验证,掩蔽RTH比掩蔽RH会导致更大的性能下降。
Details
Motivation: 动机是深入理解多语言Transformer模型中注意力头的作用,特别是识别那些在跨语言上下文中负责从源语言信息过渡到目标语言生成的关键注意力头,以解决多语言LLM中推理机制不明确的问题。
Result: 在MMLU-ProX、MGSM、MLQA和XQuaD四个多语言基准上,使用Qwen-2.5和Llama-3.1模型进行实验,结果表明掩蔽检索-过渡头(RTH)比掩蔽传统检索头(RH)导致更大的性能下降,凸显了RTH在多语言思维链推理中的重要性。
Insight: 创新点在于首次在多语言上下文中识别出检索-过渡头(RTH),揭示了其与检索头(RH)的功能区别及其对目标语言生成和跨语言推理的关键作用,为理解和改进多语言LLM的内部机制提供了新的视角。
Abstract: Recent work has identified a subset of attention heads in Transformer as retrieval heads, which are responsible for retrieving information from the context. In this work, we first investigate retrieval heads in multilingual contexts. In multilingual language models, we find that retrieval heads are often shared across multiple languages. Expanding the study to cross-lingual setting, we identify Retrieval-Transition heads(RTH), which govern the transition to specific target-language output. Our experiments reveal that RTHs are distinct from retrieval heads and more vital for Chain-of-Thought reasoning in multilingual LLMs. Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH). Our work advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.
[3] Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training cs.CL | cs.IR | cs.LGPDF
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li
TL;DR: 本文提出Search-P1框架,通过路径中心奖励塑形解决智能体化检索增强生成训练中的奖励稀疏和样本效率低下问题。该框架包含路径中心奖励和双轨路径评分两个核心组件,在多个QA基准测试上相比Search-R1等基线模型取得了显著性能提升。
Details
Motivation: 针对当前基于强化学习的智能体化RAG训练方法存在奖励稀疏(丢弃中间信号)和样本效率低(失败样本无贡献)的问题,旨在实现更稳定高效的训练。
Result: 在多个QA基准测试上的实验表明,Search-P1相比Search-R1及其他强基线模型平均准确率提升7.7个百分点,实现了显著改进。
Insight: 创新点在于提出路径中心奖励塑形,通过顺序无关的步骤覆盖和软评分从失败样本中提取学习信号,并采用双轨路径评分结合自一致性和参考对齐视角评估推理路径。
Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.
[4] Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA cs.CLPDF
Wenwei Li, Ming Xu, Tianle Xia, Lingxiang Hu, Yiding Sun
TL;DR: 本文提出了一种用于工业广告问答的强化协同适应框架,旨在解决传统检索增强生成(RAG)在工业场景中因知识关系复杂、频繁更新以及与生成目标对齐不足而导致的幻觉问题。该框架通过图感知检索(GraphRAG)建模实体关系结构以进行多跳域特定证据选择,并利用基于组相对策略优化(GRPO)的证据约束强化学习,通过多维奖励(如忠实性、风格合规性、安全性和URL有效性)联合优化检索和生成。
Details
Motivation: 工业广告问答任务风险高,幻觉内容(尤其是伪造URL)可能导致财务损失、合规违规和法律风险。现有RAG方法在部署时面临挑战,因为工业知识具有内在关系性、频繁更新且与生成目标对齐不足。
Result: 在内部广告QA数据集上的实验显示,该方法在专家评估的准确性、完整性和安全性等维度上均取得一致提升,并将幻觉率降低了72%。为期两周的在线A/B测试表明,点赞率提高了28.6%,差评率降低了46.2%,URL幻觉减少了92.7%。系统已上线运行半年多,服务了数百万次QA交互。
Insight: 创新点包括:1)GraphRAG通过建模高引用知识子图的实体关系结构,实现多跳、领域特定的证据检索;2)采用GRPO进行证据约束的强化学习,整合多维奖励以联合优化检索和生成。从客观角度看,该框架将图结构引入工业RAG以增强知识表示,并通过强化学习实现端到端优化,提升了系统的忠实性和实用性。
Abstract: Industrial advertising question answering (QA) is a high-stakes task in which hallucinated content, particularly fabricated URLs, can lead to financial loss, compliance violations, and legal risk. Although Retrieval-Augmented Generation (RAG) is widely adopted, deploying it in production remains challenging because industrial knowledge is inherently relational, frequently updated, and insufficiently aligned with generation objectives. We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for multi-hop, domain-specific evidence selection; and (2) evidence-constrained reinforcement learning via Group Relative Policy Optimization (GRPO) with multi-dimensional rewards covering faithfulness, style compliance, safety, and URL validity. Experiments on an internal advertising QA dataset show consistent gains across expert-judged dimensions including accuracy, completeness, and safety, while reducing the hallucination rate by 72%. A two-week online A/B test demonstrates a 28.6% increase in like rate, a 46.2% decrease in dislike rate, and a 92.7% reduction in URL hallucination. The system has been running in production for over half a year and has served millions of QA interactions.
[5] Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization cs.CLPDF
Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu
TL;DR: 本文提出了一种名为“Search More, Think Less”(SMTL)的框架,旨在解决深度研究智能体在长视野搜索任务中因扩展推理深度导致的高推理成本和延迟问题,同时提升跨异构研究场景的泛化能力。该框架通过并行证据获取替代顺序推理,实现了在有限上下文预算下的高效上下文管理,并利用统一的数据合成管道构建涵盖确定性问答和开放式研究场景的搜索任务。
Details
Motivation: 当前深度研究智能体主要通过增加推理深度来提升性能,但这在搜索密集型场景中会导致高昂的推理成本和延迟,且跨异构研究设置的泛化能力仍然不足。
Result: 通过在BrowseComp(48.6%)、GAIA(75.7%)、Xbench(82.0%)和DeepResearch Bench(45.9%)等基准测试上进行监督微调和强化学习训练,SMTL实现了强大且通常达到最先进水平的性能。与Mirothinker-v1.0相比,在最多100次交互步骤下,SMTL在BrowseComp上的平均推理步骤减少了70.7%,同时准确率有所提升。
Insight: 论文的创新点在于提出并行证据获取机制以提高搜索效率,并设计统一的数据合成管道来增强跨任务泛化能力。从客观角度看,这种将效率与泛化结合的方法为长视野智能体搜索提供了新的优化方向,特别是在上下文管理受限的场景中具有实际应用价值。
Abstract: Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios. Moreover, generalization across heterogeneous research settings remains challenging. In this work, we propose \emph{Search More, Think Less} (SMTL), a framework for long-horizon agentic search that targets both efficiency and generalization. SMTL replaces sequential reasoning with parallel evidence acquisition, enabling efficient context management under constrained context budgets. To support generalization across task types, we further introduce a unified data synthesis pipeline that constructs search tasks spanning both deterministic question answering and open-ended research scenarios with task appropriate evaluation metrics. We train an end-to-end agent using supervised fine-tuning and reinforcement learning, achieving strong and often state of the art performance across benchmarks including BrowseComp (48.6%), GAIA (75.7%), Xbench (82.0%), and DeepResearch Bench (45.9%). Compared to Mirothinker-v1.0, SMTL with maximum 100 interaction steps reduces the average number of reasoning steps on BrowseComp by 70.7%, while improving accuracy.
[6] Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue cs.CL | cs.AIPDF
Ning Gao, Wei Zhang, Yuqin Dai, Ling Shi, Ziyin Wang
TL;DR: 本文提出InteractCS-RL框架,将面向任务的对话重新构建为多粒度强化学习过程,以平衡用户奖励与全局成本约束,从而在真实业务场景中优化服务代理的决策。
Details
Motivation: 现有方法难以在任务导向对话中有效平衡共情沟通与预算感知决策之间的复杂权衡,因此需要一种能同时优化用户满意度和成本控制的框架。
Result: 在定制的真实业务场景实验中,InteractCS-RL在三个评估维度上显著优于其他基线方法;在工具-代理-用户交互基准测试中进一步验证了其跨领域的鲁棒性。
Insight: 创新点包括:建立以用户为中心的交互框架作为高保真训练环境,引入结合生成过程信用和PID-Lagrangian成本控制器的混合优势估计策略,引导策略探索用户奖励与成本约束之间的帕累托边界。
Abstract: The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic communication with budget-aware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process. Specifically, we first establish a User-centric Interaction Framework to provide a high-fidelity training gym, enabling agents to dynamically explore diverse strategies with persona-driven users. Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy. By integrating generative process credits and employing a PID-Lagrangian cost controller, CMPO effectively guides the policy to explore Pareto boundary between user reward and global cost constraints. Extensive experiments on customized real business scenarios demonstrate that InteractCS-RL significantly outperform other baselines across three evaluation dimensions. Further evaluation on tool-agent-user interaction benchmarks verify InteractCS-RL robustness across diverse domains.
[7] Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs cs.CL | cs.AIPDF
Siyue Su, Jian Yang, Bo Li, Guanglin Niu
TL;DR: 本文提出KGT框架,通过引入专用实体token、融合预训练的结构与文本特征,并采用解耦预测机制,解决了LLM与知识图谱在粒度不匹配上的问题,实现了高效的全空间预测。
Details
Motivation: 动机在于解决LLM用于知识图谱补全时存在的粒度不匹配问题:LLM基于碎片化的token序列操作,而知识图谱以实体为基本单位,现有方法无法同时捕捉文本语义和图形结构完整性。
Result: 实验结果表明,KGT在多个基准测试中持续优于最先进的方法,达到了SOTA水平。
Insight: 创新点包括:专用实体token的tokenization方法、通过关系引导门控机制融合预训练特征以避免从头训练,以及利用独立头进行语义与结构推理的解耦预测策略。
Abstract: Leveraging Large Language Models (LLMs) for Knowledge Graph Completion (KGC) is promising but hindered by a fundamental granularity mismatch. LLMs operate on fragmented token sequences, whereas entities are the fundamental units in knowledge graphs (KGs) scenarios. Existing approaches typically constrain predictions to limited candidate sets or align entities with the LLM’s vocabulary by pooling multiple tokens or decomposing entities into fixed-length token sequences, which fail to capture both the semantic meaning of the text and the structural integrity of the graph. To address this, we propose KGT, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction. Specifically, we first introduce specialized tokenization to construct feature representations at the level of dedicated entity tokens. We then fuse pre-trained structural and textual features into these unified embeddings via a relation-guided gating mechanism, avoiding training from scratch. Finally, we implement decoupled prediction by leveraging independent heads to separate and combine semantic and structural reasoning. Experimental results show that KGT consistently outperforms state-of-the-art methods across multiple benchmarks.
[8] Towards Better RL Training Data Utilization via Second-Order Rollout cs.CLPDF
Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui
TL;DR: 本文提出了一种名为二阶展开(second-order rollout)的新方法,通过联合训练生成和批判能力,以更有效地利用强化学习(RL)训练数据,从而提升大型语言模型(LLM)的推理能力。
Details
Motivation: 传统RL训练仅使用一阶展开(为问题生成多个回答),忽视了批判能力的训练,未能充分利用训练数据的潜力。
Result: 在多种模型和数据集上的广泛实验表明,该方法比传统RL更有效地利用训练数据,并在相同数据下实现了更好的性能。
Insight: 创新点在于引入二阶展开(为回答生成多个批判)的联合训练框架,并揭示了批判训练中标签平衡的重要性以及基于结果奖励的噪声问题可通过采样技术缓解,为RL中的动态数据增强和联合训练提供了初步探索。
Abstract: Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training
[9] Imagination Helps Visual Reasoning, But Not Yet in Latent Space cs.CLPDF
You Li, Chi Chen, Yanghao Li, Fanhu Zeng, Kaiyu Huang
TL;DR: 本文通过因果中介分析研究了潜在视觉推理的有效性,发现输入与潜在标记、潜在标记与最终答案之间存在关键脱节,质疑了潜在推理的必要性,并提出了一种名为CapImagine的显式文本想象替代方法,在视觉基准测试中显著优于复杂潜在空间基线。
Details
Motivation: 旨在揭示潜在视觉推理(通过多模态大语言模型的隐藏状态进行推理)的有效性来源,探究其作为视觉推理范式的真正机制是否成立。
Result: 在视觉中心基准测试上的实验表明,所提出的CapImagine方法显著优于复杂的潜在空间基线,凸显了通过显式想象进行视觉推理的优越潜力。
Insight: 创新点在于使用因果中介分析揭示了潜在推理中存在的两个关键脱节(输入-潜在脱节和潜在-答案脱节),并提出了一个简单有效的替代方案(CapImagine),即教导模型使用文本进行显式想象,这挑战了当前对潜在推理必要性的普遍假设。
Abstract: Latent visual reasoning aims to mimic human’s imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.
[10] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift cs.CL | cs.AIPDF
Hyunwoo Kim, Hanau Yi, Jaehee Bae, Yumin Kim
TL;DR: 本文提出了一种名为自然语言声明式提示(NLD-P)的模块化治理方法,旨在解决大型语言模型(LLM)快速演进导致的模型漂移问题,该方法将提示设计重新概念化为一个声明式治理框架,而非固定的模板,通过分离来源、约束逻辑、任务内容和生成后评估等模块,直接在自然语言中编码,以实现对LLM更稳定、可解释的控制。
Details
Motivation: 随着LLM的快速演进和跨代更新,提示行为对指令遵循策略、对齐机制和解码策略的变化变得敏感(即GPT规模模型漂移),传统的表面级格式约定和临时优化方法已不足以确保稳定、可解释的控制,因此需要一种系统级的治理方法。
Result: 论文未在摘要中提及具体的定量实验结果或基准测试,但定义了最小合规标准,分析了模型相关的模式接受度,并将NLD-P定位为一个适用于非开发人员在不断演进的LLM生态系统中使用的可访问治理框架。
Insight: 创新点在于将提示工程重新概念化为一个声明式治理方法,提出了模块化的控制抽象,分离了提示的不同组成部分,并直接在自然语言中编码,无需依赖外部编排代码,这为在模型漂移环境下实现更稳健的提示设计提供了可借鉴的框架思路。
Abstract: The rapid evolution of large language models (LLMs) has transformed prompt engineering from a localized craft into a systems-level governance challenge. As models scale and update across generations, prompt behavior becomes sensitive to shifts in instruction-following policies, alignment regimes, and decoding strategies, a phenomenon we characterize as GPT-scale model drift. Under such conditions, surface-level formatting conventions and ad hoc refinement are insufficient to ensure stable, interpretable control. This paper reconceptualizes Natural Language Declarative Prompting (NLD-P) as a declarative governance method rather than a rigid field template. NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code. We define minimal compliance criteria, analyze model-dependent schema receptivity, and position NLD-P as an accessible governance framework for non-developer practitioners operating within evolving LLM ecosystems. Portions of drafting and editorial refinement employed a schema-bound LLM assistant configured under NLD-P. All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol. The paper concludes by outlining implications for declarative control under ongoing model evolution and identifying directions for future empirical validation.
[11] TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought cs.CL | cs.AIPDF
Jianmin Li, Ying Chang, Su-Kit Tang, Yujia Liu, Yanwen Wang
TL;DR: 该论文提出了TCM-DiffRAG,一个针对中医个性化辨证推理的改进型检索增强生成框架。该框架结合了知识图谱和思维链技术,旨在解决传统RAG方法在中医复杂推理和个体差异场景下性能不佳的问题。
Details
Motivation: 传统RAG方法在涉及复杂推理过程和显著个体差异的中医临床诊疗领域表现不佳,因此需要开发一个适应中医推理特点的改进RAG框架。
Result: 在三个独特的中医测试数据集上,TCM-DiffRAG显著超越了原生大语言模型、直接监督微调模型以及其他基准RAG方法。例如,qwen-plus模型的得分从0.927/0.361/0.038提升至0.952/0.788/0.356,对非中文LLM的提升更为明显。
Insight: 核心创新点在于将结构化的中医知识图谱与基于思维链的推理过程相结合。其宣称的亮点是通用与个性化知识图谱的联合使用,实现了通用知识与临床推理的有效对齐,这为开发具有推理意识的RAG框架以推进LLM在专业领域的应用提供了思路。
Abstract: Background: Retrieval augmented generation (RAG) technology can empower large language models (LLMs) to generate more accurate, professional, and timely responses without fine tuning. However, due to the complex reasoning processes and substantial individual differences involved in traditional Chinese medicine (TCM) clinical diagnosis and treatment, traditional RAG methods often exhibit poor performance in this domain. Objective: To address the limitations of conventional RAG approaches in TCM applications, this study aims to develop an improved RAG framework tailored to the characteristics of TCM reasoning. Methods: We developed TCM-DiffRAG, an innovative RAG framework that integrates knowledge graphs (KG) with chains of thought (CoT). TCM-DiffRAG was evaluated on three distinctive TCM test datasets. Results: The experimental results demonstrated that TCM-DiffRAG achieved significant performance improvements over native LLMs. For example, the qwen-plus model achieved scores of 0.927, 0.361, and 0.038, which were significantly enhanced to 0.952, 0.788, and 0.356 with TCM-DiffRAG. The improvements were even more pronounced for non-Chinese LLMs. Additionally, TCM-DiffRAG outperformed directly supervised fine-tuned (SFT) LLMs and other benchmark RAG methods. Conclusions: TCM-DiffRAG shows that integrating structured TCM knowledge graphs with Chain of Thought based reasoning substantially improves performance in individualized diagnostic tasks. The joint use of universal and personalized knowledge graphs enables effective alignment between general knowledge and clinical reasoning. These results highlight the potential of reasoning-aware RAG frameworks for advancing LLM applications in traditional Chinese medicine.
[12] Effective QA-driven Annotation of Predicate-Argument Relations Across Languages cs.CLPDF
Jonathan Davidov, Aviv Slobodkin, Shmuel Tomi Klein, Reut Tsarfaty, Ido Dagan
TL;DR: 本文提出了一种跨语言投影方法,利用英语QA-SRL(问答驱动语义角色标注)解析器,通过受限翻译和词对齐流程,自动生成与目标语言谓词对齐的问答标注数据,从而将语义标注扩展到希伯来语、俄语和法语等非英语语言。
Details
Motivation: 动机是解决语义角色标注(SRL)成本高昂且主要局限于英语的问题,旨在利用QA-SRL这一自然语言框架,高效地将语义分析扩展到多种语言。
Result: 在希伯来语、俄语和法语上,该方法生成了高质量的训练数据,并微调出语言特定的解析器,其性能超越了GPT-4o和LLaMA-Maverick等强大的多语言大语言模型基线。
Insight: 创新点在于将QA-SRL作为可迁移的自然语言语义接口,通过跨语言投影自动生成标注,避免了昂贵的人工标注,实现了跨语言谓词-关系解析的高效和广泛可访问性。
Abstract: Explicit representations of predicate-argument relations form the basis of interpretable semantic analysis, supporting reasoning, generation, and evaluation. However, attaining such semantic structures requires costly annotation efforts and has remained largely confined to English. We leverage the Question-Answer driven Semantic Role Labeling (QA-SRL) framework – a natural-language formulation of predicate-argument relations – as the foundation for extending semantic annotation to new languages. To this end, we introduce a cross-linguistic projection approach that reuses an English QA-SRL parser within a constrained translation and word-alignment pipeline to automatically generate question-answer annotations aligned with target-language predicates. Applied to Hebrew, Russian, and French – spanning diverse language families – the method yields high-quality training data and fine-tuned, language-specific parsers that outperform strong multilingual LLM baselines (GPT-4o, LLaMA-Maverick). By leveraging QA-SRL as a transferable natural-language interface for semantics, our approach enables efficient and broadly accessible predicate-argument parsing across languages.
[13] Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching cs.CL | cs.AIPDF
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng
TL;DR: 本文提出了一种名为’Stitching Noisy Diffusion Thoughts’的自洽框架,用于提升大语言模型的推理能力。该方法通过扩散语言模型生成多样化的低成本推理轨迹,利用过程奖励模型对每个中间步骤进行评分,并将不同轨迹中的高质量步骤拼接成一个复合推理链,最后使用自回归模型基于该推理链重新计算最终答案。
Details
Motivation: 现有的大语言模型推理聚合策略(如选择最佳轨迹或对最终答案投票)通常在轨迹层面操作,丢弃了部分或’接近正确’的尝试中有用的中间工作。本文旨在利用这些被丢弃的中间步骤信息,通过步骤级别的重组来提升推理质量。
Result: 在数学推理基准测试中,该方法在六个数学和编程任务上将平均准确率提升了高达23.8%。同时,相对于传统的扩散模型(如Dream, LLaDA)和统一架构(如TiDAR),实现了高达1.8倍的延迟降低。
Insight: 核心创新点在于将探索(扩散采样)、评估(过程奖励模型评分)和解决方案合成(步骤拼接与自回归求解)解耦的模块化流程。这避免了构建单一的统一混合模型,同时保持了广泛的搜索能力。步骤级别的重组对于更困难的问题尤其有益,并且最终的自回归求解器对于将拼接的、可能不完美的推理链转化为准确答案至关重要。
Abstract: Reasoning with large language models often benefits from generating multiple chains-of-thought, but existing aggregation strategies are typically trajectory-level (e.g., selecting the best trace or voting on the final answer), discarding useful intermediate work from partial or “nearly correct” attempts. We propose Stitching Noisy Diffusion Thoughts, a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates. Given a problem, we (i) sample many diverse, low-cost reasoning trajectories using a masked diffusion language model, (ii) score every intermediate step with an off-the-shelf process reward model (PRM), and (iii) stitch these highest-quality steps across trajectories into a composite rationale. This rationale then conditions an autoregressive (AR) model (solver) to recompute only the final answer. This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search. Across math reasoning benchmarks, we find that step-level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate answers. Using low-confidence diffusion sampling with parallel, independent rollouts, our training-free framework improves average accuracy by up to 23.8% across six math and coding tasks. At the same time, it achieves up to a 1.8x latency reduction relative to both traditional diffusion models (e.g., Dream, LLaDA) and unified architectures (e.g., TiDAR). Code is available at https://github.com/roymiles/diffusion-stitching.
[14] Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models cs.CLPDF
Jonathan Steinberg, Oren Gal
TL;DR: 本文通过因果干预方法研究了三种视觉语言模型(Qwen3-VL、Phi-4、InternVL3.5)中光学字符识别(OCR)信息的处理路径,发现OCR瓶颈的位置取决于模型架构:DeepStack模型在中间层对场景文本最敏感,而单阶段投影模型则在早期层。OCR信号具有低维特性,且其主成分分析方向可在数据集间迁移。有趣的是,在模块化OCR架构中移除OCR信息反而能提升计数性能。
Details
Motivation: 探究视觉语言模型中OCR信息具体在语言处理流程的哪个环节被整合,以理解不同架构的文本处理机制。
Result: 在Qwen3-VL-4B等模块化OCR架构中,移除OCR信息可使计数任务性能提升高达6.9个百分点;OCR信号的第一主成分可解释72.9%的方差,且其PCA方向在不同数据集间具有可迁移性。
Insight: 揭示了VLMs中OCR处理路径的架构依赖性,并发现模块化OCR电路可能干扰其他视觉处理任务,这为设计更高效的视觉语言模型提供了新视角,即可能需要解耦文本识别与其他视觉理解模块。
Abstract: Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.
[15] Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent cs.CL | cs.CR | cs.LGPDF
Boyang Zhang, Yang Zhang
TL;DR: 本文提出了一种名为SALA(Stylometry-Assisted LLM Analysis)的LLM智能体框架,用于评估和缓解文本数据(如新闻文章)中的作者身份去匿名化风险。该框架结合了定量文体特征分析和LLM推理,构建了一个结构化、可解释的管道。实验表明,SALA方法,尤其是在增强数据库模块后,在各种场景下都能实现高精度的作者推断。此外,论文还提出了一种引导式重写策略,利用智能体的推理轨迹生成改写提示,有效降低作者身份可识别性,同时保持文本原意。
Details
Motivation: 大型语言模型(LLMs)的快速发展赋予了强大的作者推断能力,这引发了人们对新闻文章等文本数据中非预期的去匿名化风险的日益担忧。本文旨在评估此类风险并探索缓解方法。
Result: 在大规模新闻数据集上的实验表明,所提出的SALA方法,特别是当增强数据库模块后,在各种场景下都实现了高精度的作者推断。
Insight: 核心创新点在于提出了SALA方法,将定量文体特征分析与LLM推理相结合,形成了一个鲁棒且透明的作者归属分析框架。此外,提出的引导式重写策略利用智能体的推理过程来指导文本改写,以降低隐私风险,这是一种主动、可解释的防御思路。
Abstract: The rapid advancement of large language models (LLMs) has enabled powerful authorship inference capabilities, raising growing concerns about unintended deanonymization risks in textual data such as news articles. In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline. Central to our framework is the proposed $\textit{SALA}$ (Stylometry-Assisted LLM Analysis) method, which integrates quantitative stylometric features with LLM reasoning for robust and transparent authorship attribution. Experiments on large-scale news datasets demonstrate that $\textit{SALA}$, particularly when augmented with a database module, achieves high inference accuracy in various scenarios. Finally, we propose a guided recomposition strategy that leverages the agent’s reasoning trace to generate rewriting prompts, effectively reducing authorship identifiability while preserving textual meaning. Our findings highlight both the deanonymization potential of LLM agents and the importance of interpretable, proactive defenses for safeguarding author privacy.
[16] Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs cs.CL | cs.AI | cs.LGPDF
Jayadev Billa
TL;DR: 本文研究了多模态大语言模型(MLLMs)在处理非文本模态(如语音和图像)时出现的信息损失问题,即模态坍缩。作者通过信息论框架将其形式化为一个不匹配解码问题,指出文本解码器只能提取与文本对齐方向的信息,而其他模态特定信息(如说话者身份、情感、视觉纹理)即使被编码器保留,也会被解码器视为噪声而丢弃。
Details
Motivation: 多模态LLMs虽然能处理语音和图像,但无法有效利用其中的非文本信息(如声音特质或物体纹理)。本文旨在探究这一现象的根本原因,即信息损失是否源于编码失败,还是解码器与多模态输入之间的不匹配。
Result: 在五个涵盖语音和视觉的模型上验证了理论界限。线性探测显示,说话者身份、情感和视觉属性等信息在每一层LLM中均被保留(比随机猜测高3-55倍),但移除64-71%的模态特定方差反而能改善解码损失。通过控制实验(两个仅在编码器文本对齐性上不同的棱柱视觉语言模型)证实瓶颈在于解码器的评分规则,而非编码器或投影层。使用LoRA进行情感目标训练干预,可将情感可访问性提升7.5%,且不影响其他属性。
Insight: 创新点在于从信息论角度将模态坍缩形式化为不匹配解码问题,并引入广义互信息(GMI)来量化可访问信息的上界。核心洞察是:解码器的训练目标(评分规则)决定了哪些多模态信息可被提取,而非编码架构本身;通过调整训练目标,可以有针对性地提升特定属性的可访问性。
Abstract: Multimodal LLMs can process speech and images, but they cannot hear a speaker’s voice or see an object’s texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every LLM layer (3–55$\times$ above chance in linear probes), yet removing 64–71% of modality-specific variance improves decoder loss. The decoder has no learned use for these directions; their presence is noise. We formalize this as a mismatched decoder problem: a decoder trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and decoder sensitivity. The bound is a property of the decoder’s scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the bottleneck is the decoder’s scoring rule, not the encoder or projection. A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility ($+$7.5%) without affecting other attributes, confirming that the training objective determines what becomes accessible.
[17] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? cs.CL | cs.AIPDF
Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu
TL;DR: 本文探讨了扩散语言模型(DLMs)在实现真正并行(非自回归)解码时遇到的困难,指出其常退化为类似自回归(AR)的解码动态。作者认为主要原因是DLM的训练目标与高度序列化的训练数据(如标准预训练语料和长思维链监督)不匹配。为此,他们提出了NAP(非自回归并行DLMs),一种以数据为中心的概念验证方法,通过构建多个独立推理轨迹的示例并结合并行强制解码策略,以更好地对齐监督与非AR并行解码。在数学推理基准测试中,NAP在并行解码下比基于标准长思维链数据训练的DLMs表现更好,且随着并行度增加,性能提升更明显。
Details
Motivation: 解决扩散语言模型(DLMs)在实践中常退化为类似自回归(AR)的解码动态,无法实现真正非AR并行生成的问题,以消除AR的顺序瓶颈,更好地利用并行硬件来减少同步/通信开销并改善输出长度相关的延迟扩展。
Result: 在数学推理基准测试上,NAP在并行解码下比基于标准长思维链数据训练的DLMs表现更强,性能增益随并行度增加而增长。
Insight: 论文的核心创新点在于诊断出DLMs的AR-like行为源于训练目标与序列化数据的不匹配,并提出了一种数据中心的解决方案(NAP),通过重新设计监督数据(多独立推理轨迹)和解码策略(并行强制解码)来促进真正的多令牌并行更新。这为缓解AR-like行为、实现真正非自回归并行生成提供了一个原则性的研究方向。
Abstract: Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR’s sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.
[18] Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems cs.CLPDF
Siyuan Liu, Jiahui Xu, Feng Jiang, Kuang Wang, Zefeng Zhao
TL;DR: 本文提出了一个名为‘话语感知双轨流式响应’(DDTSR)的低延迟框架,用于改进级联语音对话系统(ASR-LLM-TTS)的响应速度。该框架通过连接词引导的大小模型协同、基于流式的跨模态协作以及课程学习增强话语连续性三大机制,实现了‘边听边思考’和‘边思考边说话’,显著降低了响应延迟。
Details
Motivation: 解决传统ASR-LLM-TTS级联流水线严格串行执行(需等待完整转录和推理后才能开始语音合成)所导致的高响应延迟问题,旨在实现更接近人类响应速度的语音对话系统。
Result: 在两个语音对话基准测试上的实验表明,DDTSR在保持话语质量的同时,将响应延迟降低了19%至51%。分析还表明,该框架可作为即插即用模块与多种LLM主干兼容,并在不同话语长度下保持鲁棒性。
Insight: 创新点在于提出了一个解耦话语连接词生成与知识密集型推理的双轨并行架构,并通过流式处理和课程学习来协调ASR、LLM和TTS模块,在降低延迟的同时保证了话语的连贯性与逻辑一致性,具有很强的实用性和可扩展性。
Abstract: Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a strictly sequential paradigm, requiring complete transcription and full reasoning before speech synthesis can begin, which results in high response latency. We propose the Discourse-Aware Dual-Track Streaming Response (DDTSR) framework, a low-latency architecture that enables listen-while-thinking and speak-while-thinking. DDTSR is built upon three key mechanisms: (1) connective-guided small-large model synergy, where an auxiliary small model generates minimal-committal discourse connectives while a large model performs knowledge-intensive reasoning in parallel; (2) streaming-based cross-modal collaboration, which dynamically overlaps ASR, LLM inference, and TTS to advance the earliest speakable moment; and (3) curriculum-learning-based discourse continuity enhancement, which maintains coherence and logical consistency between early responses and subsequent reasoning outputs. Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51% while preserving discourse quality. Further analysis shows that DDTSR functions as a plug-and-play module compatible with diverse LLM backbones, and remains robust across varying utterance lengths, indicating strong practicality and scalability for real-time spoken interaction.
[19] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables cs.CL | cs.AI | cs.DB | cs.IRPDF
Sungho Park, Jueun Kim, Wook-Shin Han
TL;DR: SPARTA是一个用于生成大规模表格-文本多跳问答基准的端到端框架,通过自动化流程和轻量人工验证,构建了包含聚合、分组和深度多跳推理的高质量问答对,揭示了现有模型在跨模态推理上的不足。
Details
Motivation: 现有表格-文本问答基准规模小、人工标注易出错,且问题浅显,缺乏多跳和复杂操作(如聚合),因此需要可扩展、高质量的基准来评估模型真实推理能力。
Result: 在SPARTA基准上,当前SOTA模型(如HybridQA上F1>70或OTT-QA上F1>50)的F1分数下降超过30点,暴露了跨模态推理的根本弱点。
Insight: 创新点包括基于溯源的查询重写和真实结构强制技术,确保生成可执行SQL和自然流畅的问题;框架自动化生成大规模基准,显著减少人工标注时间(仅需HybridQA的四分之一)。
Abstract: Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.
[20] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations cs.CL | eess.ASPDF
Soumya Dutta, Smruthi Balaji, Sriram Ganapathy
TL;DR: 本文提出了一种名为MiSTER-E的混合专家模型,用于多模态对话情感识别。该模型通过解耦模态特定上下文建模和多模态信息融合两大核心挑战,利用微调后的大语言模型生成语音和文本的丰富话语级嵌入,并通过卷积循环层增强上下文建模。系统结合了纯语音、纯文本和跨模态三个专家的预测,采用学习到的门控机制动态加权输出,并引入监督对比损失和KL散度正则化以促进模态间的一致性和对齐。实验在IEMOCAP、MELD和MOSI三个基准数据集上取得了优于多个基线系统的性能。
Details
Motivation: 对话情感识别面临独特挑战,需要模型捕捉多轮对话的时间流并有效整合多模态线索。现有方法在模态特定上下文建模和多模态信息融合方面存在耦合问题,本文旨在通过模块化混合专家框架解耦这两个核心挑战,提升识别性能。
Result: 在IEMOCAP、MELD和MOSI三个基准数据集上,MiSTER-E分别取得了70.9%、69.5%和87.9%的加权F1分数,优于多个基线语音-文本ERC系统,达到了SOTA水平。
Insight: 创新点包括:1) 采用模块化混合专家框架解耦模态特定上下文建模与多模态融合;2) 利用微调的LLMs生成话语级嵌入,并通过卷积循环层增强上下文;3) 引入学习门控机制动态整合多专家预测;4) 使用监督对比损失和KL散度正则化促进模态对齐与一致性。这些设计避免了说话人身份依赖,提升了模型泛化能力。
Abstract: Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.
[21] Scale Can’t Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning cs.CL | cs.CVPDF
Amita Kamath, Jack Hessel, Khyathi Chandu, Jena D. Hwang, Kai-Wei Chang
TL;DR: 这篇论文研究了视觉语言模型(VLMs)推理能力不足的根本原因,指出其训练数据中存在报告偏差,即人们在描述视觉内容时默认省略了推理所需的隐含信息。作者通过语用学理论分析了OpenCLIP、LLaVA-1.5和Molmo等流行VLMs的数据,发现尽管数据规模庞大或为合成生成,但报告偏差导致空间、时间、否定和计数四种推理技能的表征不足。实验表明,VLMs在这些被抑制的推理任务上表现不佳,单纯扩大数据规模、模型规模或多语言训练并不能自然涌现这些能力,而专门收集隐含信息标注的数据则能有效提升性能。
Details
Motivation: 解决视觉语言模型(VLMs)推理能力不足的问题,探究其根源在于训练数据中的报告偏差,即人们描述视觉内容时习惯性省略隐含信息,导致模型难以学习到某些推理技能。
Result: 在精心设计的基准测试中,VLMs在因报告偏差而受抑制的推理任务(空间、时间、否定、计数)上表现不佳;扩大数据规模、模型规模或多语言训练并未自然涌现这些能力;但使用专门收集隐含信息标注的数据能有效提升性能。
Insight: 论文的创新点在于从语用学角度系统分析了报告偏差对VLM推理能力的影响,并实证了单纯依赖数据规模无法克服此偏差,强调了需要更精细的数据标注和收集策略来提升模型推理能力,而非盲目追求规模扩展。
Abstract: The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., “at the game today!” is a more likely caption than “a photo of 37 people standing behind a field”. We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.
cs.CV [Back]
[22] AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction cs.CV | cs.AIPDF
Hanyang Liu, Rongjun Qin
TL;DR: 本文提出了AeroDGS,一个用于单目无人机视频的物理引导4D高斯溅射框架,旨在解决单视角、大空间范围、小目标大运动场景下的动态空中4D重建难题。该方法通过单目几何提升模块重建可靠的静态与动态几何,并引入物理引导优化模块,利用可微分的地面支撑、垂直稳定性和轨迹平滑性先验,将模糊的图像线索转化为物理一致的运动,从而实现几何稳定、时间演化连贯的动态场景重建。
Details
Motivation: 现有4D场景重建方法在单视角、大空间范围、动态目标空间足迹小且运动差异大的空中条件下存在局限,导致严重的深度模糊和不稳定运动估计,使得单目空中重建成为一个不适定问题。
Result: 在合成和真实无人机场景上的实验表明,AeroDGS优于现有最先进方法,在动态空中环境中实现了卓越的重建保真度。
Insight: 创新点在于提出了一个结合单目几何提升和物理引导优化的统一框架,将物理一致性先验(如地面支撑、垂直稳定性、轨迹平滑性)集成到可微分优化中,以解决单目重建的固有模糊性;同时构建了一个涵盖不同高度和运动条件的真实世界无人机数据集,用于评估动态空中重建。
Abstract: Recent advances in 4D scene reconstruction have significantly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed. To this end, we present AeroDGS, a physics-guided 4D Gaussian splatting framework for monocular UAV videos. AeroDGS introduces a Monocular Geometry Lifting module that reconstructs reliable static and dynamic geometry from a single aerial sequence, providing a robust basis for dynamic estimation. To further resolve monocular ambiguity, we propose a Physics-Guided Optimization module that incorporates differentiable ground-support, upright-stability, and trajectory-smoothness priors, transforming ambiguous image cues into physically consistent motion. The framework jointly refines static backgrounds and dynamic entities with stable geometry and coherent temporal evolution. We additionally build a real-world UAV dataset that spans various altitudes and motion conditions to evaluate dynamic aerial reconstruction. Experiments on synthetic and real UAV scenes demonstrate that AeroDGS outperforms state-of-the-art methods, achieving superior reconstruction fidelity in dynamic aerial environments.
[23] Vision Transformers Need More Than Registers cs.CVPDF
Cheng Shi, Yizhou Yu, Sibei Yang
TL;DR: 本文通过系统分析发现Vision Transformers (ViTs) 存在一种惰性聚合行为,即利用语义无关的背景图像块作为捷径来表征全局语义,这源于全局注意力机制和粗粒度语义监督。为解决此问题,作者提出了一种选择性整合图像块特征到CLS令牌的方法,有效减少了背景主导的捷径影响,并在12个基准测试中提升了性能。
Details
Motivation: ViTs在不同监督范式和下游任务中广泛观察到伪影,其根本机制尚未充分阐明,本文旨在揭示这些伪影的起源并提供解决方案。
Result: 提出的方法在标签监督、文本监督和自监督下的12个基准测试中均一致提升了性能。
Insight: 创新点在于识别了ViTs的惰性聚合行为,并提出通过选择性特征整合来缓解背景捷径问题,为理解ViT行为提供了新视角。
Abstract: Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.
[24] CLIP Is Shortsighted: Paying Attention Beyond the First Sentence cs.CVPDF
Marc-Antoine Lavoie, Anas Mahmoud, Aldo Zaimi, Arsene Fansi Tchango, Steven L. Waslander
TL;DR: 本文指出CLIP模型在预训练时因依赖短标题而存在对复杂场景和密集描述的粗粒度对齐问题,并发现长标题通常以一句摘要开头,导致模型训练时注意力集中在开头句子和早期token上。为此,作者提出了DeBias-CLIP方法,通过移除摘要句子、应用句子子采样和文本token填充来分散监督信号,从而改善长文本检索性能。
Details
Motivation: CLIP模型在互联网规模数据上通过图像-文本对比学习学习可迁移的多模态特征,但其预训练主要依赖短标题,导致模型偏向编码显著对象的简单描述,在复杂场景和密集描述上对齐粗糙。尽管近期工作通过在小规模长标题数据集上微调来缓解此问题,但作者发现人类和LLM生成的长标题通常以一句摘要开头,这成为训练中的捷径,削弱了标题其余部分的对齐。
Result: DeBias-CLIP在长文本检索上达到了最先进水平(SOTA),同时改善了短文本检索,并且对句子顺序排列更不敏感。它作为Long-CLIP的直接替代品,无需额外可训练参数。
Insight: 论文的创新点在于识别了长标题中摘要句作为训练捷径的常见偏差,并提出了通过移除摘要、句子子采样和token填充来分散监督的DeBias-CLIP方法,从而提升模型对长文本的注意力分布和对齐能力,这是一种简单有效的偏差缓解策略。
Abstract: CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP’s pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.
[25] SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read cs.CV | cs.LGPDF
Yibo Peng, Peng Xia, Ding Zhong, Kaide Zeng, Siwei Han
TL;DR: 本文提出SimpleOCR训练策略,通过将文本查询直接渲染到图像上形成可视化问题,强制多模态大语言模型激活其视觉文本提取能力,以解决模型在视觉基础任务中存在的‘模态惰性’问题。
Details
Motivation: 诊断多模态大语言模型是否真正‘阅读’图像中的文本,还是仅依赖文本提示中的参数捷径,揭示其存在的‘模态惰性’问题。
Result: 在四个代表性OOD基准测试中,SimpleOCR比基础模型提升5.4%,比基于原始图像的GRPO提升2.7%,且仅需8.5K样本(数据效率提升30倍),并可无缝集成NoisyRollout等先进RL策略获得互补提升。
Insight: 创新性地引入可视化问题设置作为诊断工具,并提出一种即插即用的训练策略,通过随机化样式渲染文本强制模型优化视觉通路,无需修改架构即可显著提升模型在视觉文本理解任务上的鲁棒性和数据效率。
Abstract: Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated modality laziness.’’ To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL-based methods. Furthermore, its plug-and-play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at https://github.com/aiming-lab/SimpleOCR.
[26] Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge cs.CVPDF
Giuseppe Lando, Rosario Forte, Antonino Furnari
TL;DR: 本文研究了在边缘设备上使用多模态大语言模型(MLLMs)进行实时在线情景记忆问答的可行性。通过设计一个包含描述线程和问答线程的异步流水线,将视频流实时转换为轻量级文本记忆并进行推理。在资源受限的边缘设备上实现了与云端方案竞争的性能。
Details
Motivation: 解决云端卸载方案在可穿戴助手等场景中存在的隐私和延迟问题,探索在边缘设备上实现隐私保护、低延迟的情景记忆问答。
Result: 在QAEgo4D-Closed基准测试上,在消费级8GB GPU上实现了51.76%的准确率和0.41秒的首词生成时间(TTFT);在本地企业级服务器上达到54.40%准确率和0.88秒TTFT。相比之下,云端方案准确率为56.00%。结果表明边缘方案具有竞争力。
Insight: 创新点在于将流式处理约束整合到问答流水线中,采用异步双线程架构(描述线程与问答线程)实现实时处理;证明了在严格资源限制下,边缘部署的MLLMs可以达到与云端方案相近的性能,为隐私敏感的实时应用提供了可行方案。
Abstract: We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.
[27] MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation cs.CV | cs.IRPDF
Raiyan Jahangir, Nafiz Imtiaz Khan, Amritanand Sudheerkumar, Vladimir Filkov
TL;DR: 本文提出了MammoWise,一个用于乳腺钼靶报告生成的多模型本地RAG(检索增强生成)流程。该系统利用开源视觉语言模型(VLM),通过零样本、少样本、思维链提示以及可选的多模态RAG,从乳腺钼靶图像生成结构化报告并进行多任务分类(如BI-RADS评估、乳腺密度)。
Details
Motivation: 解决现有基于VLM的乳腺钼靶报告生成系统依赖封闭云服务或紧耦合架构,从而在隐私、可复现性和适应性方面受限的问题。
Result: 在VinDr-Mammo和DMID数据集上评估了MedGemma、LLaVA-Med和Qwen2.5-VL。报告生成质量(BERTScore, ROUGE-L)表现稳定,少样本提示和RAG能进一步提升。分类任务可行但对模型和数据集敏感。对MedGemma进行参数高效微调(QLoRA)后,在BI-RADS分类、乳腺密度和钙化检测上分别达到0.7545、0.8840和0.9341的准确率,同时保持了报告质量。
Insight: 创新点在于构建了一个本地化、可扩展的统一工作流框架,支持多种开源VLM和数据集,并整合了多模态RAG以提供病例特异性上下文。客观来看,其将参数高效微调(QLoRA)与RAG、多样化提示策略结合,为在隐私敏感医疗场景中部署可定制、可复现的AI报告系统提供了实用方案。
Abstract: Screening mammography is high volume, time sensitive, and documentation heavy. Radiologists must translate subtle visual findings into consistent BI-RADS assessments, breast density categories, and structured narrative reports. While recent Vision Language Models (VLMs) enable image-to-text reporting, many rely on closed cloud systems or tightly coupled architectures that limit privacy, reproducibility, and adaptability. We present MammoWise, a local multi-model pipeline that transforms open source VLMs into mammogram report generators and multi-task classifiers. MammoWise supports any Ollama-hosted VLM and mammography dataset, and enables zero-shot, few-shot, and Chain-of-Thought prompting, with optional multimodal Retrieval Augmented Generation (RAG) using a vector database for case-specific context. We evaluate MedGemma, LLaVA-Med, and Qwen2.5-VL on VinDr-Mammo and DMID datasets, assessing report quality (BERTScore, ROUGE-L), BI-RADS classification, breast density, and key findings. Report generation is consistently strong and improves with few-shot prompting and RAG. Classification is feasible but sensitive to model and dataset choice. Parameter-efficient fine-tuning (QLoRA) of MedGemma improves reliability, achieving BI-RADS accuracy of 0.7545, density accuracy of 0.8840, and calcification accuracy of 0.9341 while preserving report quality. MammoWise provides a practical and extensible framework for deploying local VLMs for mammography reporting within a unified and reproducible workflow.
[28] Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models cs.CV | cs.AIPDF
Niamul Hassan Samin, Md Arifur Rahman, Abdullah Ibne Hanif, Juena Ahmed Noshin, Md Ashikur Rahman
TL;DR: 本文针对视觉语言模型(VLMs)中常见的物体幻觉问题,提出了空间信用再分配(SCR)方法。该方法在推理时无需额外训练,通过将高注意力源补丁的隐藏状态激活重新分配到其上下文,有效缓解了早期Transformer层中激活信用集中于稀疏视觉补丁导致的信用崩溃问题。实验在POPE和CHAIR基准上对多个模型家族进行了评估,显著降低了幻觉率,同时保持了生成质量,且计算开销极小。
Details
Motivation: 视觉语言模型经常幻觉出输入图像中不存在的物体。作者将这一失败归因于空间信用崩溃:在早期Transformer层中,激活信用集中在稀疏的视觉补丁上,这抑制了上下文证据并增加了对语言先验的依赖。
Result: 在POPE-Adversarial基准上,SCR将幻觉率降低了约4.7-6.0个百分点;在CHAIR-s上降低了3.7-5.2个百分点(相对降低42-51%),在CHAIR-i上降低了2.7-4.4个百分点(相对降低44-58%),同时将CIDEr的下降控制在0.8个百分点以内。SCR仅增加43-56毫秒开销,远低于OPERA、VCD和OVCD等方法,并在幻觉率和CIDEr上实现了帕累托最优。
Insight: 核心创新点在于提出了一种无需训练、基于推理时干预的空间信用再分配机制,通过低熵输入引导,将激活从高注意力源补丁重新分配到其上下文,直接针对信用崩溃这一根本原因。该方法计算高效,适用于实时场景。消融实验证实了注意力引导的源补丁选择是关键,随机选择会大幅降低效果,这验证了信用崩溃是驱动幻觉的关键因素。
Abstract: Vision-language models (VLMs) frequently hallucinate objects absent from the input image. We trace this failure to spatial credit collapse: activation credit concentrating on sparse visual patches in early transformer layers, which suppresses contextual evidence and increases reliance on language priors. We introduce Spatial Credit Redistribution (SCR), a training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs. We evaluate six model families (Chameleon, LLaVA, and Qwen, including both Qwen-VL and Qwen2-VL) at scales of 7B, 13B, and 30B, on POPE and CHAIR benchmarks. SCR reduces hallucination by ~4.7-6.0 percentage points on POPE-Adversarial, cuts CHAIR-s by 3.7-5.2 percentage points (42-51 percent relative), and CHAIR-i by 2.7-4.4 percentage points (44-58 percent relative), and preserves CIDEr within 0.8 percentage points. Gains are largest for low-entropy inputs, consistent with the theoretical framework. SCR incurs only 43-56 ms overhead (small models: +43-46 ms; large models: +54-56 ms), roughly 3-6 times lower than OPERA and VCD and 1.3-1.7 times lower than OVCD (+72 ms), while Pareto-dominating all three on both hallucination rate and CIDEr, making it practical for real-time settings. A controlled ablation confirms that attention-guided source selection is essential: replacing it with uniform random selection reduces hallucination rate gains from ~4.7-6.0 percentage points to only ~2.6-3.4 percentage points, pointing to credit-collapse as the key driver.
[29] Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning cs.CVPDF
Guoyizhe Wei, Yang Jiao, Nan Xi, Zhishen Huang, Jingjing Meng
TL;DR: 本文提出了Pix2Key方法,用于可控开放词汇图像检索。该方法将查询和候选图像表示为开放词汇视觉词典,在统一嵌入空间中进行意图感知的约束匹配和多样性感知的重排序。此外,通过仅使用图像的自监督预训练组件V-Dict-AE来增强词典表示,无需特定于组合图像检索的监督即可提升细粒度属性理解。
Details
Motivation: 解决组合图像检索中传统融合流水线可能丢失细粒度线索,以及零样本方法可能忽略用户隐含意图并返回重复结果的问题。
Result: 在DFMM-Compose基准测试上,Pix2Key将Recall@10提升了高达3.2个百分点,加入V-Dict-AE后又带来了额外的2.3个百分点提升,同时改善了意图一致性并保持了较高的列表多样性。
Insight: 创新点在于将查询和候选图像表示为开放词汇视觉词典,实现了意图和多样性的统一匹配;自监督预训练组件V-Dict-AE仅使用图像数据即可增强细粒度理解,无需特定任务监督,这是一种高效的表征学习方法。
Abstract: Composed Image Retrieval (CIR) uses a reference image plus a natural-language edit to retrieve images that apply the requested change while preserving other relevant visual content. Classic fusion pipelines typically rely on supervised triplets and can lose fine-grained cues, while recent zero-shot approaches often caption the reference image and merge the caption with the edit, which may miss implicit user intent and return repetitive results. We present Pix2Key, which represents both queries and candidates as open-vocabulary visual dictionaries, enabling intent-aware constraint matching and diversity-aware reranking in a unified embedding space. A self-supervised pretraining component, V-Dict-AE, further improves the dictionary representation using only images, strengthening fine-grained attribute understanding without CIR-specific supervision. On the DFMM-Compose benchmark, Pix2Key improves Recall@10 up to 3.2 points, and adding V-Dict-AE yields an additional 2.3-point gain while improving intent consistency and maintaining high list diversity.
[30] DisQ-HNet: A Disentangled Quantized Half-UNet for Interpretable Multimodal Image Synthesis Applications to Tau-PET Synthesis from T1 and FLAIR MRI cs.CV | cs.AIPDF
Agamdeep S. Chopra, Caitlin Neher, Tianyi Ren, Juampablo E. Heras Rivera, Mehmet Kurt
TL;DR: DisQ-HNet是一个可解释的多模态图像合成框架,用于从T1加权和FLAIR MRI合成tau-PET图像。该方法结合了基于部分信息分解的向量量化编码器和Half-UNet解码器,旨在揭示每种模态对预测的贡献,同时保持解剖细节和疾病相关信号。
Details
Motivation: 解决tau-PET成本高、可用性有限的问题,通过MRI合成替代图像,并提高合成过程的可解释性,明确T1和FLAIR MRI各自在预测中的角色。
Result: 在多个基线模型(VAE、VQ-VAE、UNet)上,DisQ-HNet保持了重建保真度,并在下游阿尔茨海默病任务(如Braak分期、tau定位和分类)中更好地保留了疾病相关信号。
Insight: 创新点包括:使用部分信息分解引导的向量量化编码器将潜在信息分解为冗余、独特和互补成分;以及Half-UNet解码器通过基于结构边缘线索的伪跳跃连接来保留解剖细节,避免直接重用编码器特征,增强了可解释性和模态特异性归因分析。
Abstract: Tau positron emission tomography (tau-PET) provides an in vivo marker of Alzheimer’s disease pathology, but cost and limited availability motivate MRI-based alternatives. We introduce DisQ-HNet (DQH), a framework that synthesizes tau-PET from paired T1-weighted and FLAIR MRI while exposing how each modality contributes to the prediction. The method combines (i) a Partial Information Decomposition (PID)-guided, vector-quantized encoder that partitions latent information into redundant, unique, and complementary components, and (ii) a Half-UNet decoder that preserves anatomical detail using pseudo-skip connections conditioned on structural edge cues rather than direct encoder feature reuse. Across multiple baselines (VAE, VQ-VAE, and UNet), DisQ-HNet maintains reconstruction fidelity and better preserves disease-relevant signal for downstream AD tasks, including Braak staging, tau localization, and classification. PID-based Shapley analysis provides modality-specific attribution of synthesized uptake patterns.
[31] DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation cs.CV | cs.AIPDF
Zhechao Wang, Yiming Zeng, Lufan Ma, Zeqing Fu, Chen Bai
TL;DR: 本文提出DrivePTS,一个用于驾驶场景生成的渐进式学习框架,通过文本和结构增强来解决现有方法中条件间依赖、语义细节不足和结构模糊的问题。该框架采用渐进学习策略减少几何条件间的相互依赖,利用视觉语言模型生成多视角分层描述提供细粒度文本指导,并引入频率引导的结构损失以增强对高频元素的敏感性。
Details
Motivation: 现有基于扩散模型的驾驶场景生成方法存在几何条件间隐式依赖导致生成失败、语义描述简略导致背景建模弱、以及均匀空间加权的去噪损失忽视前景结构细节导致视觉失真和模糊等问题。
Result: 大量实验表明,DrivePTS在生成多样化驾驶场景时,在保真度和可控性方面达到了最先进水平(SOTA),并且能够成功生成先前方法失败的罕见场景,展现了强大的泛化能力。
Insight: 创新点包括:1) 采用渐进学习策略并辅以显式互信息约束,以解耦几何条件间的依赖;2) 利用视觉语言模型生成跨六个语义方面的多视角分层描述,提供细粒度文本指导;3) 引入频率引导的结构损失,增强模型对高频结构细节的敏感性,从而提升前景结构保真度。
Abstract: Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model’s sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.
[32] BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model cs.CV | cs.AIPDF
Yuci Han, Charles Toth, John E. Anderson, William J. Shuart, Alper Yilmaz
TL;DR: BetterScene提出了一种利用极稀疏、无约束照片增强多样化真实场景新视角合成质量的方法。它基于预训练的Stable Video Diffusion模型,通过引入时间等变性正则化和视觉基础模型对齐表示来优化VAE模块,并结合3D高斯泼溅模型生成连续、无伪影、一致的新视角。
Details
Motivation: 现有基于扩散模型的新视角合成方法通常仅微调UNet模块并保持其他组件冻结,即使结合深度或语义等几何感知正则化,仍会导致细节不一致和伪影。本文旨在通过研究扩散模型的潜在空间并优化VAE模块来解决这些问题。
Result: 在具有挑战性的DL3DV-10K数据集上评估,BetterScene相比最先进方法展示了优越的性能。
Insight: 创新点在于对扩散模型VAE模块的优化,包括时间等变性正则化和视觉基础模型对齐表示,这有助于提升新视角合成的细节一致性和减少伪影。结合3D高斯泼溅进行特征渲染也是一种有效的架构设计。
Abstract: We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views. We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.
[33] Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery cs.CVPDF
Minh Kha Do, Wei Xiang, Kang Han, Di Wu, Khoa Phan
TL;DR: 本文提出SATtxt,一种用于卫星图像的视觉语言基础模型。该模型在推理时仅需RGB输入,但通过训练阶段学习并保留了光谱信息。方法包含两个阶段:首先通过光谱表示蒸馏将多光谱先验知识从冻结的教师模型迁移到RGB学生模型;然后利用指令增强的大语言模型进行光谱对齐,将视觉空间与表达力强的LLM嵌入空间对齐。
Details
Motivation: 解决卫星图像视觉语言基础模型应用中的两个主要障碍:多光谱输入虽信息丰富但存在波段冗余和对齐困难;传统CLIP式文本编码器语义表达有限且细粒度对齐能力弱。
Result: 在EuroSAT、BigEarthNet和ForestNet数据集上,SATtxt相比基线模型在零样本分类平均提升4.2%,检索提升5.9%,线性探测提升2.7%。
Insight: 创新点在于提出两阶段框架:光谱表示蒸馏实现RGB推理下的光谱信息保留;结合指令增强LLM进行对齐,提升语义表达和细粒度对齐能力,为地球观测提供高效的光谱感知视觉语言学习路径。
Abstract: Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over baselines, showing an efficient path toward spectrum-aware vision-language learning for Earth observation. Project page: https://ikhado.github.io/sattxt/
[34] Coded-E2LF: Coded Aperture Light Field Imaging from Events cs.CVPDF
Tomoya Tsuchida, Keita Takahashi, Chihiro Tsutake, Toshiaki Fujii, Hajime Nagahara
TL;DR: 本文提出Coded-E2LF方法,一种利用编码孔径和纯事件相机获取4D光场的计算成像技术。该方法仅基于事件数据,无需强度图像,降低了硬件实现限制,并首次证明了仅从事件数据即可重建像素级精度的4D光场。
Details
Motivation: 解决传统光场成像系统需同时捕获事件和强度图像的限制,实现纯事件驱动的光场重建以简化硬件要求。
Result: 在真实成像硬件上实现,成功捕获真实3D场景,实现了像素级精度的4D光场重建,据作者所知是首次仅用事件数据达到该精度。
Insight: 创新点包括纯事件驱动的光场重建方法、阐明孔径编码模式中黑色图案的关键作用,以及理论支持与实践改进的结合,为事件相机在计算成像中的应用提供了新思路。
Abstract: We propose Coded-E2LF (coded event to light field), a computational imaging method for acquiring a 4-D light field using a coded aperture and a stationary event-only camera. In a previous work, an imaging system similar to ours was adopted, but both events and intensity images were captured and used for light field reconstruction. In contrast, our method is purely event-based, which relaxes restrictions for hardware implementation. We also introduce several advancements from the previous work that enable us to theoretically support and practically improve light field reconstruction from events alone. In particular, we clarify the key role of a black pattern in aperture coding patterns. We finally implemented our method on real imaging hardware to demonstrate its effectiveness in capturing real 3-D scenes. To the best of our knowledge, we are the first to demonstrate that a 4-D light field with pixel-level accuracy can be reconstructed from events alone. Our software and supplementary video are available from our project website.
[35] Instruction-based Image Editing with Planning, Reasoning, and Generation cs.CV | cs.AIPDF
Liya Ji, Chenyang Qi, Qifeng Chen
TL;DR: 本文提出了一种基于指令的图像编辑新方法,通过引入多模态思维链提示,将编辑任务分解为规划、推理和生成三个步骤,以提升对复杂场景的理解和生成质量。
Details
Motivation: 现有基于指令的图像编辑方法通常依赖单一模态的理解模型,限制了编辑质量,尤其是在复杂场景下。本文旨在通过一个多模态模型来桥接理解和生成,以处理更复杂的编辑任务。
Result: 大量实验表明,该方法在复杂真实世界图像上具有竞争力的编辑能力。
Insight: 创新点在于将指令编辑任务分解为思维链规划、编辑区域推理和生成三个子任务,并分别利用大语言模型进行规划、训练多模态大语言模型进行区域推理,以及基于大规模文本到图像扩散模型构建提示引导的编辑网络,从而实现了理解和生成的更好结合。
Abstract: Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.
[36] CRAG: Can 3D Generative Models Help 3D Assembly? cs.CVPDF
Zeyu Jiang, Sihang Li, Siqi Tan, Chenyang Xu, Juexiao Zhang
TL;DR: 本文提出CRAG方法,将3D装配问题重新定义为装配与生成的联合问题,通过同时生成完整形状和预测部件姿态,解决了现有方法无法合成缺失几何形状的局限性。
Details
Motivation: 现有3D装配方法仅关注刚性变换的姿态估计,而人类装配过程结合了结构推理与整体形状推断,因此需要一种能够同时处理装配和生成的方法。
Result: 在具有多样几何形状、不同部件数量和缺失部件的野外物体上,CRAG实现了最先进的性能。
Insight: 创新点在于将装配与生成视为相互增强的过程:装配为生成提供部件级结构先验,而生成注入整体形状上下文以解决装配中的歧义,从而能够合成缺失几何形状。
Abstract: Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Our code and models will be released.
[37] Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models cs.CVPDF
Siqi Lu, Wanying Xu, Yongbin Zheng, Wenting Luan, Peng Sun
TL;DR: 本文提出了一种名为多模态权重分配模块(MWAM)的即插即用组件,用于解决多模态模型中因模态缺失导致的性能急剧下降问题。该方法通过频域分析量化模态偏好,并动态调整各模态分支在训练中的贡献,以实现更均衡的学习。
Details
Motivation: 多模态模型在面对模态缺失时表现脆弱,性能会严重下降。作者认为这种脆弱性源于不平衡的学习过程,即模型对某些模态产生隐式偏好,导致其他模态未被充分优化。
Result: 大量实验表明,MWAM可以无缝集成到多种架构骨干(如基于CNN和ViT的模型)中,并在广泛的任务和模态组合上带来一致的性能提升。该模块不仅优化了基础模型的性能,还能进一步提升解决模态缺失问题的最先进方法的性能。
Insight: 核心创新在于从频域视角量化模态间的支配关系,并据此设计动态权重分配模块。该方法简单高效,具有即插即用特性,能普遍提升多模态模型的鲁棒性。
Abstract: Missing modalities present a fundamental challenge in multimodal models, often causing catastrophic performance degradation. Our observations suggest that this fragility stems from an imbalanced learning process, where the model develops an implicit preference for certain modalities, leading to the under-optimization of others. We propose a simple yet efficient method to address this challenge. The central insight of our work is that the dominance relationship between modalities can be effectively discerned and quantified in the frequency domain. To leverage this principle, we first introduce a Frequency Ratio Metric (FRM) to quantify modality preference by analyzing features in the frequency domain. Guided by FRM, we then propose a Multimodal Weight Allocation Module, a plug-and-play component that dynamically re-balances the contribution of each branch during training, promoting a more holistic learning paradigm. Extensive experiments demonstrate that MWAM can be seamlessly integrated into diverse architectural backbones, such as those based on CNNs and ViTs. Furthermore, MWAM delivers consistent performance gains across a wide range of tasks and modality combinations. This advancement extends beyond merely optimizing the performance of the base model; it also manifests as further performance improvements to state-of-the-art methods addressing the missing modality problem.
[38] Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache cs.CVPDF
Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang
TL;DR: 本文提出了一种名为DPCache的无训练加速框架,将扩散模型的采样加速问题建模为全局路径规划问题。该方法通过构建路径感知成本张量来量化跳过时间步的误差,并利用动态规划选择最优的关键时间步序列,从而在推理时仅对关键时间步进行完整计算,中间输出则通过缓存特征高效预测,实现了高质量加速。
Details
Motivation: 扩散模型在图像和视频生成中表现出色,但其多步迭代采样计算开销巨大,阻碍了实际部署。现有的基于缓存的加速方法采用固定或局部自适应调度,未考虑去噪轨迹的全局结构,容易导致误差累积和视觉伪影。
Result: 在DiT、FLUX和HunyuanVideo上的大量实验表明,DPCache能以最小质量损失实现显著加速。例如在FLUX上,以4.87倍加速超越先前方法+0.031 ImageReward,甚至以3.54倍加速超越全步基线+0.028 ImageReward。
Insight: 核心创新在于将扩散采样加速形式化为全局路径规划问题,并引入路径感知成本张量与动态规划进行全局最优调度。这提供了一种无训练、考虑全局轨迹结构的加速新视角,其路径规划思想可泛化至其他序列决策任务。
Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at https://github.com/argsss/DPCache.
[39] Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing cs.CV | cs.MMPDF
Renyu Yang, Jian Jin, Lili Meng, Meiqin Liu, Yilin Wang
TL;DR: 本文提出了一种通过众包构建音频-视频质量评估(AVQA)数据集的方法,并发布了当前最大且最多样化的AVQA数据集YT-NTU-AVQ,包含1,620个用户生成的音频-视频序列。该方法包括设计众包主观实验框架、系统化的数据准备策略以及扩展标注,以支持模型开发和多模态感知研究。
Details
Motivation: 现有AVQA数据集通常规模小、内容和质量多样性不足,且仅标注总体分数,这限制了模型开发和多模态感知研究的进展。
Result: 通过该方法构建的YT-NTU-AVQ数据集是当前最大且最多样化的AVQA数据集,包含1,620个序列,并通过众包框架实现了跨环境的可靠标注。
Insight: 创新点包括:设计众包主观实验框架以突破实验室限制;采用系统化数据准备策略确保质量和语义场景的广泛覆盖;扩展标注以支持多模态感知机制研究。这为大规模、多样化AVQA数据集的构建提供了实用方案。
Abstract: Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at https://github.com/renyu12/YT-NTU-AVQ
[40] Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes cs.CVPDF
Changqing Zhou, Yueru Luo, Han Zhang, Zeyu Jiang, Changhao Chen
TL;DR: 本文提出了一种用于室内场景的单目开放词汇占据预测方法。该方法采用仅使用二元占据标签(占用 vs 自由)的几何监督范式,构建于3D语言嵌入高斯模型之上,该模型将细粒度3D几何与语言对齐的语义嵌入耦合为一个统一的中间表示。为了解决现有方法在弱监督下失效的问题,论文引入了不透明度感知的泊松式体素聚合方法,并提出了渐进温度衰减策略以增强高斯与语言的对齐。
Details
Motivation: 开放词汇3D占据对于具身智能体理解复杂室内环境至关重要,但现有针对室外驾驶场景的方法难以迁移到室内,因为室内几何更密集、布局更复杂、语义更细粒度。
Result: 在Occ-ScanNet基准测试的开放词汇设置中,该方法取得了59.50的IoU和21.05的mIoU,在IoU上超越了所有现有占据预测方法,并在mIoU上大幅领先于先前的开放词汇方法。
Insight: 创新点在于:1)采用仅需二元占据标签的弱几何监督范式;2)提出3D语言嵌入高斯作为统一的几何与语义中间表示;3)针对几何侧,设计了不透明度感知的泊松式体素聚合算子以稳定训练;4)针对语义侧,提出了渐进温度衰减策略,在渲染过程中逐步锐化不透明度,以解决特征混合问题并增强语言对齐。
Abstract: Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at https://github.com/JuIvyy/LegoOcc.
[41] ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport cs.CV | cs.AIPDF
Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham
TL;DR: 本文提出了ViCLIP-OT,这是首个针对越南语图像-文本检索任务的基础视觉-语言模型。该模型将CLIP风格的对比学习与一种名为SIGROT(相似图正则化最优传输)的损失函数相结合,旨在增强跨模态全局一致性并缓解模态鸿沟问题。
Details
Motivation: 现有视觉-语言模型主要针对高资源语言优化,在越南语等低资源语言场景下表现不佳。本文旨在为越南语图像-文本检索构建一个专门的、性能更优的基础模型。
Result: 在三个越南语基准数据集(UIT-OpenViIC, KTVIC, Crossmodal-3600)上的实验表明,ViCLIP-OT在领域内和零样本设置下均优于CLIP和SigLIP基线。在UIT-OpenViIC上,平均Recall@K达到67.34%,比CLIP提升5.75个百分点;在Crossmodal-3600的零样本评估中,比CLIP提升11.72个百分点,达到了新的SOTA水平。
Insight: 主要创新点在于将最优传输理论(OT)与相似图正则化(SIGROT)结合到视觉-语言模型的训练中,以更有效地对齐跨模态表示并减少模态鸿沟。这为低资源语言的跨模态检索提供了一种有效且可扩展的策略。
Abstract: Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75 percentage points. In zero-shot evaluation on Crossmodal-3600, ViCLIPOT surpasses CLIP by 11.72 percentage points. Embedding-space analysis further confirms improved alignment and reduced modality gap. The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.
[42] SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses cs.CV | cs.AIPDF
Zhuohang Jiang, Xu Yuan, Haohao Qu, Shanru Lin, Kanglong Liu
TL;DR: 该论文提出了首个基于智能眼镜真实采集数据的视觉问答基准SUPERGLASSES,并针对现有视觉语言模型在该场景下的性能不足,设计了一个名为SUPERLENS的多模态智能眼镜代理,通过集成目标检测、查询解耦和多模态网络搜索来实现检索增强的答案生成,在基准测试中取得了超越GPT-4o的SOTA性能。
Details
Motivation: 现有适配智能眼镜的视觉语言模型通常在传统多模态数据集上训练和评估,这些数据集缺乏反映智能眼镜真实使用场景的多样性和真实性,且未解决其核心挑战——在检索外部知识前需准确识别感兴趣目标。
Result: 在SUPERGLASSES基准上评估了26个代表性VLM,发现存在显著性能差距。提出的SUPERLENS代理在该基准上取得了最先进的性能,超越了GPT-4o 2.19个百分点。
Insight: 创新点在于构建了首个完全基于智能眼镜真实数据的VQA基准,并提出了一个针对智能眼镜场景的任务特定解决方案,其核心是整合了自动目标检测、查询解耦和多模态网络搜索的检索增强生成框架,强调了场景适配的重要性。
Abstract: The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPERGLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. Our agent achieves state-of-the-art performance, surpassing GPT-4o by 2.19 percent, and highlights the need for task-specific solutions in smart glasses VQA scenarios.
[43] No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings cs.CV | cs.CRPDF
Joonsung Jeon, Woo Jae Kim, Suhyeon Ha, Sooel Son, Sung-Eui Yoon
TL;DR: 本文提出了一种名为MoFit的无标注成员推理攻击框架,用于评估潜在扩散模型的隐私风险。该方法通过优化生成与目标模型生成流形过拟合的合成条件输入,从而在无需真实文本标注的情况下,有效区分训练集成员样本与非成员样本。
Details
Motivation: 现有成员推理攻击方法依赖真实文本标注,但在实际场景中仅图像可用且文本注释未公开,导致先前方法在仅使用视觉语言模型生成标注时效果不佳。本文旨在解决无标注条件下的成员推理问题。
Result: 在多个数据集和扩散模型上的实验表明,MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods。
Insight: 创新点在于提出了一种两阶段方法:模型拟合代理优化和代理驱动嵌入提取,通过构造过拟合于目标模型生成流形的合成条件输入来增强成员与非成员样本的条件损失差异,从而在无真实标注下实现有效的成员推理。
Abstract: Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model’s generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model’s unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.
[44] SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs cs.CV | cs.AIPDF
Guanting Ye, Qiyan Zhao, Wenhao Yu, Liangyu Yuan, Mingkai Li
TL;DR: 本文提出了一种基于球坐标的位置嵌入方法SoPE,用于增强3D大型视觉语言模型的空间感知能力,通过将点云token映射到球坐标空间来统一建模空间位置和方向角度,并引入多尺度频率混合策略融合不同频域特征。
Details
Motivation: 现有3D LVLMs继承的旋转位置嵌入RoPE在编码3D token时无法保持三维空间结构,且相对距离计算忽略了角度依赖性,限制了模型捕捉视觉表示中方向变化的能力。
Result: 在多个3D场景基准测试上的实验结果验证了该方法的有效性,实际部署实验进一步展示了其强大的泛化能力。
Insight: 创新点在于将点云token索引映射到球坐标空间以保留几何结构,并引入多尺度频率混合策略;客观分析认为该方法通过统一建模位置和角度,增强了3D LVLMs的空间感知和表示一致性。
Abstract: 3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies, hindering the model’s ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate-based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.
[45] HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models cs.CVPDF
Yangguang Lin, Quan Fang, Yufei Li, Jiachen Sun, Junyu Gao
TL;DR: 本文提出HulluEdit,一种单次前向、无需参考模型的干预框架,通过正交子空间编辑来缓解大型视觉语言模型中的物体幻觉问题。该方法将隐藏状态分解为视觉证据、冲突先验和残差不确定性三个正交子空间,选择性抑制幻觉模式而不影响视觉基础,在多个基准测试上实现了最先进的幻觉减少效果。
Details
Motivation: 大型视觉语言模型中的物体幻觉问题严重阻碍了其可靠部署,现有方法难以平衡效率与准确性,通常需要昂贵的参考模型和多次前向传播,或应用静态编辑可能抑制真实视觉证据。
Result: 在POPE和CHAIR等基准测试上,HulluEdit实现了最先进的幻觉减少效果,并在MME基准上保持了一般能力,同时保持了高效的推理效率,性能优于对比解码和静态子空间编辑基线方法。
Insight: 核心创新是正交子空间编辑,通过数学保证对先验子空间的编辑完全不影响视觉成分,实现了选择性抑制幻觉而不干扰视觉基础,为构建更可信的大型视觉语言模型提供了新途径。
Abstract: Object hallucination in Large Vision-Language Models (LVLMs) significantly hinders their reliable deployment. Existing methods struggle to balance efficiency and accuracy: they often require expensive reference models and multiple forward passes, or apply static edits that risk suppressing genuine visual evidence. To address this, we introduce HulluEdit, a single-pass, reference-free intervention framework. Our core innovation is orthogonal subspace editing: we decompose the hidden states of the model into orthogonal subspaces - visual evidence, conflicting priors, and residual uncertainty - enabling selective suppression of hallucinatory patterns without interfering with visual grounding. This approach mathematically guarantees that edits applied to the prior subspace leave the visual component entirely unaffected. Extensive experiments show that HulluEdit achieves state-of-the-art hallucination reduction on benchmarks including POPE and CHAIR across diverse architectures, while preserving general capabilities on MME and maintaining efficient inference. Our method consistently outperforms contrastive decoding and static subspace editing baselines, offering a new pathway toward more trustworthy LVLMs.
[46] Asymmetric Idiosyncrasies in Multimodal Models cs.CVPDF
Muzi Tao, Chufan Shi, Huijuan Wang, Shengbang Tong, Xuezhe Ma
TL;DR: 本文研究了图像描述模型中的风格特异性及其对文本到图像生成模型的影响,通过训练神经网络从生成描述或对应图像中识别源描述模型,发现文本分类准确率高达99.70%,表明描述模型具有独特的风格特征,但这些特征在生成的图像中基本消失,即使最先进的Flux模型分类准确率也最多降至50%。
Details
Motivation: 动机是探究图像描述模型中的风格特异性如何影响下游文本到图像生成模型,并量化这种跨模态差异。
Result: 文本分类准确率达到99.70%,而图像分类准确率最多降至50%(在Flux模型上),表明生成图像未能保留描述中的关键变化,如细节水平、颜色纹理强调和场景对象分布。
Insight: 创新点在于提出了一种基于分类的框架来量化描述模型的风格特异性和文本到图像系统的提示跟随能力,揭示了跨模态信息传递的局限性。
Abstract: In this work, we study idiosyncrasies in the caption models and their downstream impact on text-to-image models. We design a systematic analysis: given either a generated caption or the corresponding image, we train neural networks to predict the originating caption model. Our results show that text classification yields very high accuracy (99.70%), indicating that captioning models embed distinctive stylistic signatures. In contrast, these signatures largely disappear in the generated images, with classification accuracy dropping to at most 50% even for the state-of-the-art Flux model. To better understand this cross-modal discrepancy, we further analyze the data and find that the generated images fail to preserve key variations present in captions, such as differences in the level of detail, emphasis on color and texture, and the distribution of objects within a scene. Overall, our classification-based framework provides a novel methodology for quantifying both the stylistic idiosyncrasies of caption models and the prompt-following ability of text-to-image systems.
[47] AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation cs.CV | cs.AIPDF
Tongfei Chen, Shuo Yang, Yuguang Yang, Linlin Yang, Runtang Guo
TL;DR: 本文提出了一种名为对齐感知掩码学习(AML)的训练策略,用于提升指称图像分割(RIS)任务的性能。该方法通过显式估计像素级的视觉-语言对齐,在优化过程中过滤掉对齐较差的区域,并专注于可信的线索,从而在RefCOCO数据集上取得了最先进的性能,并增强了对多样化描述和场景的鲁棒性。
Details
Motivation: 指称图像分割任务旨在根据自然语言描述分割图像中的目标对象,现有方法在像素级视觉-语言对齐的显式建模和利用方面存在不足,导致性能受限。
Result: 在RefCOCO数据集上取得了最先进的(SOTA)性能,并展示了对多样化描述和场景的增强鲁棒性。
Insight: 创新点在于提出了一种对齐感知的掩码学习训练策略,通过显式估计和过滤像素级对齐信息,引导模型专注于高质量的对齐区域,从而提升分割精度和鲁棒性。
Abstract: Referring Image Segmentation (RIS) aims to segment an object in an image identified by a natural language expression. The paper introduces Alignment-Aware Masked Learning (AML), a training strategy to enhance RIS by explicitly estimating pixel-level vision-language alignment, filtering out poorly aligned regions during optimization, and focusing on trustworthy cues. This approach results in state-of-the-art performance on RefCOCO datasets and also enhances robustness to diverse descriptions and scenarios
[48] SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation cs.CVPDF
Fengming Liu, Tat-Jen Cham, Chuanxia Zheng
TL;DR: 本文提出了SPATIALALIGN框架,旨在提升文本到视频(T2V)生成模型对文本提示中动态空间关系(DSR)的描绘能力。该框架采用基于零阶正则化的直接偏好优化(DPO)方法对T2V模型进行微调,并引入了基于几何的度量DSR-SCORE来定量评估生成视频与指定DSR的对齐程度,同时构建了一个包含多样DSR的文本-视频对数据集。实验表明,微调后的模型在空间关系对齐方面显著优于基线模型。
Details
Motivation: 现有文本到视频生成器通常注重美学质量,但往往忽略了生成视频中的空间约束,导致无法准确描绘文本提示中指定的动态空间关系。
Result: 广泛的实验证明,经过SPATIALALIGN框架微调的模型在空间关系对齐方面显著优于基线模型,具体定量结果未在摘要中详述,但声称有显著提升。
Insight: 主要创新点包括:1)提出了一个专注于动态空间关系对齐的自改进框架;2)设计了基于几何的定量评估指标DSR-SCORE,相较于依赖视觉语言模型(VLM)的先前工作是一大进步;3)构建了专门的DSR数据集以支持研究;4)采用了零阶正则化的DPO进行模型微调。
Abstract: Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts. We present a zeroth-order regularized Direct Preference Optimization (DPO) to fine-tune T2V models towards better alignment with DSR. Specifically, we design DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts, which is a step forward from prior works that rely on VLM for evaluation. We also conduct a dataset of text-video pairs with diverse DSRs to facilitate the study. Extensive experiments demonstrate that our fine-tuned model significantly out performs the baseline in spatial relationships. The code will be released in Link.
[49] Beyond Detection: Multi-Scale Hidden-Code for Natural Image Deepfake Recovery and Factual Retrieval cs.CVPDF
Yuan-Chih Chen, Chun-Shien Lu
TL;DR: 本文提出了一种统一的隐码恢复框架,用于从自然图像的深度伪造中恢复篡改内容并进行事实检索。该方法通过多尺度向量量化将语义和感知信息编码为紧凑的隐码表示,并利用条件Transformer模块增强上下文推理。为了系统评估,作者构建了ImageNet-S基准测试集。实验表明,该方法在检索和重建方面表现良好,且与多种水印流程兼容。
Details
Motivation: 当前图像真实性研究主要集中在深度伪造检测和定位,而对篡改内容的恢复以进行事实检索的研究相对不足。
Result: 在ImageNet-S基准测试上的广泛实验表明,该方法在检索和重建方面表现出有前景的性能,同时与多种水印流程完全兼容。
Insight: 创新点包括提出统一的隐码恢复框架,结合多尺度向量量化和条件Transformer进行编码与推理;构建ImageNet-S基准测试集以支持系统评估;框架旨在超越检测和定位,为通用图像恢复奠定基础。
Abstract: Recent advances in image authenticity have primarily focused on deepfake detection and localization, leaving recovery of tampered contents for factual retrieval relatively underexplored. We propose a unified hidden-code recovery framework that enables both retrieval and restoration from post-hoc and in-generation watermarking paradigms. Our method encodes semantic and perceptual information into a compact hidden-code representation, refined through multi-scale vector quantization, and enhances contextual reasoning via conditional Transformer modules. To enable systematic evaluation for natural images, we construct ImageNet-S, a benchmark that provides paired image-label factual retrieval tasks. Extensive experiments on ImageNet-S demonstrate that our method exhibits promising retrieval and reconstruction performance while remaining fully compatible with diverse watermarking pipelines. This framework establishes a foundation for general-purpose image recovery beyond detection and localization.
[50] TrajTok: Learning Trajectory Tokens enables better Video Understanding cs.CVPDF
Chenhao Zheng, Jieyu Zhang, Jianing Zhang, Weikai Huang, Ashutosh Kumar
TL;DR: 本文提出TrajTok,一种端到端的视频Tokenizer模块,能够通过时空像素聚类直接生成物体轨迹,动态调整token粒度以适应语义复杂度,从而提升视频理解效率与性能。
Details
Motivation: 现有视频模型中的tokenization方法(如分块)产生大量冗余token,限制效率与可扩展性;而基于轨迹的tokenizer依赖复杂的外部分割与跟踪流程,速度慢且与任务无关。
Result: 基于TrajTok构建的视频CLIP模型(TrajViT2)在分类和检索基准上实现了最佳准确率,同时效率与最佳token合并方法相当;作为探针头(TrajAdapter)或对齐连接器(TrajVLM)时,在长视频推理中表现尤其出色。
Insight: 创新点在于将轨迹tokenizer与下游模型端到端协同训练,通过隐式聚类直接生成轨迹,强调下游适应性而非像素级分割精度,实现了轻量高效且性能提升的通用视频理解组件。
Abstract: Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.
[51] Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning cs.CVPDF
Taishu Arashima, Hiroshi Kera, Kazuhiko Kawamoto
TL;DR: 本文提出了一种通过自监督骨架表示学习实现鲁棒人体轨迹预测的方法。该方法利用掩码自编码器预训练骨架表示模型,以应对现实场景中因遮挡导致的关节缺失问题,从而提升轨迹预测的鲁棒性。
Details
Motivation: 现实环境中的人体骨架数据常因遮挡导致关节缺失,这会显著降低现有轨迹预测方法的准确性,因此需要更鲁棒的骨架表示。
Result: 在易发生遮挡的场景下的实验结果表明,该方法在不牺牲预测准确性的情况下,提升了对缺失骨架数据的鲁棒性,并且在从清洁数据到中等缺失程度的多种情况下,性能均持续优于基线模型。
Insight: 创新点在于将自监督学习(特别是掩码自编码)引入骨架表示预训练,以学习对关节缺失鲁棒的特征,从而增强下游轨迹预测任务的性能。这是一种将表示学习与特定任务(轨迹预测)的鲁棒性需求相结合的有效思路。
Abstract: Human trajectory prediction plays a crucial role in applications such as autonomous navigation and video surveillance. While recent works have explored the integration of human skeleton sequences to complement trajectory information, skeleton data in real-world environments often suffer from missing joints caused by occlusions. These disturbances significantly degrade prediction accuracy, indicating the need for more robust skeleton representations. We propose a robust trajectory prediction method that incorporates a self-supervised skeleton representation model pretrained with masked autoencoding. Experimental results in occlusion-prone scenarios show that our method improves robustness to missing skeletal data without sacrificing prediction accuracy, and consistently outperforms baseline models in clean-to-moderate missingness regimes.
[52] CMSA-Net: Causal Multi-scale Aggregation with Adaptive Multi-source Reference for Video Polyp Segmentation cs.CVPDF
Tong Wang, Yaolei Qi, Siwen Wang, Imran Razzak, Guanyu Yang
TL;DR: 本文提出了一种名为CMSA-Net的鲁棒视频息肉分割框架,旨在解决息肉与周围黏膜相似导致的语义区分弱,以及视频帧间息肉位置和尺度变化大带来的分割难题。该网络通过因果多尺度聚合模块按时间顺序整合历史多尺度语义信息,并采用动态多源参考策略自适应选择可靠的参考帧,以在保证实时性的同时提升分割准确性。
Details
Motivation: 解决视频息肉分割中因息肉与周围黏膜外观相似导致的语义区分弱,以及视频帧间息肉位置和尺度变化大带来的稳定准确分割困难。
Result: 在SUN-SEG数据集上的大量实验表明,CMSA-Net达到了最先进的性能,在分割准确性和实时临床适用性之间取得了良好的平衡。
Insight: 创新点包括引入因果多尺度聚合模块确保时序特征传播遵循严格时间顺序以减少噪声,以及设计动态多源参考策略自适应选择信息丰富且可靠的参考帧,从而在提升特征可靠性的同时保持模型实时推理效率。
Abstract: Video polyp segmentation (VPS) is an important task in computer-aided colonoscopy, as it helps doctors accurately locate and track polyps during examinations. However, VPS remains challenging because polyps often look similar to surrounding mucosa, leading to weak semantic discrimination. In addition, large changes in polyp position and scale across video frames make stable and accurate segmentation difficult. To address these challenges, we propose a robust VPS framework named CMSA-Net. The proposed network introduces a Causal Multi-scale Aggregation (CMA) module to effectively gather semantic information from multiple historical frames at different scales. By using causal attention, CMA ensures that temporal feature propagation follows strict time order, which helps reduce noise and improve feature reliability. Furthermore, we design a Dynamic Multi-source Reference (DMR) strategy that adaptively selects informative and reliable reference frames based on semantic separability and prediction confidence. This strategy provides strong multi-frame guidance while keeping the model efficient for real-time inference. Extensive experiments on the SUN-SEG dataset demonstrate that CMSA-Net achieves state-of-the-art performance, offering a favorable balance between segmentation accuracy and real-time clinical applicability.
[53] A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling cs.CVPDF
Chong Wang, Yabin Zhang, Yunhe Gao, Maya Varma, Clemence Mottez
TL;DR: 本文提出了一种名为CheXficient的胸部X光基础模型,通过主动、有原则的数据筛选策略,在预训练阶段仅使用22.7%的数据和27.3%的计算资源,实现了与全数据预训练模型相当或更优的性能。
Details
Motivation: 解决当前医学影像基础模型预训练中’不计成本扩大规模’范式带来的两大挑战:大规模医学数据集中存在的冗余和类别不平衡问题,以及数据质量异质性导致的计算效率低下。
Result: 在涵盖5种任务类型的20个基准测试中,包括零样本发现分类、跨模态检索、疾病预测、语义分割和放射学报告生成,CheXficient表现出与全数据模型相当或更优的性能,尤其在长尾或罕见病症上泛化能力更强。
Insight: 创新点在于提出了一种基于数据筛选的高效预训练方法,通过优先选择信息丰富的训练样本,有效缓解了数据冗余和类别不平衡问题,为医学视觉-语言基础模型的高效预训练和下游适应提供了实用见解。
Abstract: Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a “scale-at-all-costs” paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full-data counterpart and other large-scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non-adapted off-the-shelf evaluations (zero-shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.
[54] From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models cs.CVPDF
Hongrui Jia, Chaoya Jiang, Shikun Zhang, Wei Ye
TL;DR: 本文提出了一种名为诊断驱动渐进演化(DPE)的新型训练范式,用于大型多模态模型(LMMs)。该方法通过一个螺旋循环,利用诊断结果来指导数据生成和模型强化,并基于更新后的模型进行再诊断以驱动下一轮有针对性的改进,从而持续提升模型在开放任务分布下的能力。
Details
Motivation: 当前大型多模态模型的训练依赖于静态数据和固定配方,难以诊断其能力盲点或提供动态、有针对性的强化。受测试驱动错误暴露和基于反馈的纠正优于重复练习的发现启发,本文旨在解决这一问题。
Result: 在Qwen3-VL-8B-Instruct和Qwen2.5-VL-7B-Instruct模型上的实验表明,DPE方法在11个基准测试上实现了稳定、持续的提升。
Insight: 核心创新在于将诊断、数据生成和模型强化整合成一个闭环迭代流程。具体包括:1)利用多智能体(借助网络搜索、图像编辑等工具)对海量未标注多模态数据进行标注和质量控制,以生成多样且真实的样本;2)将模型失败归因于特定弱点,动态调整数据混合比例,并指导智能体生成针对弱点的数据以进行靶向强化。这为在开放任务分布下进行持续、可扩展的LMM训练提供了一个新范式。
Abstract: As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at https://github.com/hongruijia/DPE.
[55] Towards Multimodal Domain Generalization with Few Labels cs.CVPDF
Hongzhao Li, Hao Dong, Hualei Wan, Shupan Li, Mingliang Xu
TL;DR: 本文提出了半监督多模态域泛化(SSMDG)这一新问题,旨在利用少量标注样本从多源数据中学习鲁棒的多模态模型。作者观察到现有方法无法有效应对此场景,因此提出了一个统一框架,包含共识驱动的一致性正则化、分歧感知正则化和跨模态原型对齐三个关键组件。该框架在标准及模态缺失场景下均优于强基线,并建立了首个SSMDG基准。
Details
Motivation: 解决多模态模型在标注数据有限的情况下,如何泛化到未见域的问题,现有方法在处理未标注数据、域偏移和多模态输入方面存在局限。
Result: 在作者建立的SSMDG基准上,所提方法在标准及模态缺失场景下均一致优于强基线。
Insight: 创新点在于将半监督学习与多模态域泛化相结合,通过共识机制利用未标注数据,并引入跨模态原型对齐来增强对模态缺失的鲁棒性。从客观角度看,其提出的共识驱动和分歧感知机制为利用多模态数据中的不确定性提供了新思路。
Abstract: Multimodal models ideally should generalize to unseen domains while remaining data-efficient to reduce annotation costs. To this end, we introduce and study a new problem, Semi-Supervised Multimodal Domain Generalization (SSMDG), which aims to learn robust multimodal models from multi-source data with few labeled samples. We observe that existing approaches fail to address this setting effectively: multimodal domain generalization methods cannot exploit unlabeled data, semi-supervised multimodal learning methods ignore domain shifts, and semi-supervised domain generalization methods are confined to single-modality inputs. To overcome these limitations, we propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, which obtains reliable pseudo-labels through confident fused-unimodal consensus; Disagreement-Aware Regularization, which effectively utilizes ambiguous non-consensus samples; and Cross-Modal Prototype Alignment, which enforces domain- and modality-invariant representations while promoting robustness under missing modalities via cross-modal translation. We further establish the first SSMDG benchmarks, on which our method consistently outperforms strong baselines in both standard and missing-modality scenarios. Our benchmarks and code are available at https://github.com/lihongzhao99/SSMDG.
[56] Chain of Flow: A Foundational Generative Framework for ECG-to-4D Cardiac Digital Twins cs.CVPDF
Haofan Wu, Nay Aung, Theodoros N. Arvanitis, Joao A. C. Lima, Steffen E. Petersen
TL;DR: 本文提出了Chain of Flow (COF)这一基础性生成框架,旨在从单周期心电图(ECG)重建完整的4D心脏结构和运动,从而构建可操作的患者特异性心脏数字孪生体(CDT)。
Details
Motivation: 现有心脏数字孪生框架多局限于特定任务的预测器,而非构建患者特异性、可操控的虚拟心脏,因此需要一种能从多模态信号重建个体化心脏解剖与生理状态的基础生成框架。
Result: 该方法在多个队列上进行了评估,能够准确恢复心脏解剖结构、各心腔功能及动态运动模式,并支持下游任务如容量测定、区域功能分析和虚拟电影合成。
Insight: 核心创新在于通过整合电影磁共振成像(cine-CMR)和12导联心电图,学习心脏几何、电生理和运动动力学的统一表示,从而将心脏数字孪生从狭窄的预测模型转变为完全生成式的患者特异性虚拟器官。
Abstract: A clinically actionable Cardiac Digital Twin (CDT) should reconstruct individualised cardiac anatomy and physiology, update its internal state from multimodal signals, and enable a broad range of downstream simulations beyond isolated tasks. However, existing CDT frameworks remain limited to task-specific predictors rather than building a patient-specific, manipulable virtual heart. In this work, we introduce Chain of Flow (COF), a foundational ECG-driven generative framework that reconstructs full 4D cardiac structure and motion from a single cardiac cycle. The method integrates cine-CMR and 12-lead ECG during training to learn a unified representation of cardiac geometry, electrophysiology, and motion dynamics. We evaluate Chain of Flow on diverse cohorts and demonstrate accurate recovery of cardiac anatomy, chamber-wise function, and dynamic motion patterns. The reconstructed 4D hearts further support downstream CDT tasks such as volumetry, regional function analysis, and virtual cine synthesis. By enabling full 4D organ reconstruction directly from ECG, COF transforms cardiac digital twins from narrow predictive models into fully generative, patient-specific virtual hearts. Code will be released after review.
[57] OSDaR-AR: Enhancing Railway Perception Datasets via Multi-modal Augmented Reality cs.CVPDF
Federico Nesti, Gianluca D’Amico, Mauro Marinoni, Giorgio Buttazzo
TL;DR: 本文提出了一种多模态增强现实框架OSDaR-AR,用于通过将逼真的虚拟物体集成到真实世界铁路序列中来生成高质量的增强数据,以解决铁路感知任务中标注数据稀缺的问题。
Details
Motivation: 铁路应用在安全关键任务(如障碍物检测)中缺乏高质量标注数据,现有方法如逼真模拟器存在“仿真到现实”差距,而简单的图像掩码技术则缺乏时空一致性。
Result: 通过基于分割的INS/GNSS数据细化策略,显著提升了增强序列的真实感,并基于OSDaR23数据集创建了公开数据集OSDaR-AR,以支持下一代铁路感知系统的开发。
Insight: 创新点在于利用Unreal Engine 5结合LiDAR点云和INS/GNSS数据,确保虚拟物体在RGB帧中的精确放置和时间稳定性,从而生成具有时空一致性的逼真增强铁路场景。
Abstract: Although deep learning has significantly advanced the perception capabilities of intelligent transportation systems, railway applications continue to suffer from a scarcity of high-quality, annotated data for safety-critical tasks like obstacle detection. While photorealistic simulators offer a solution, they often struggle with the ``sim-to-real” gap; conversely, simple image-masking techniques lack the spatio-temporal coherence required to obtain augmented single- and multi-frame scenes with the correct appearance and dimensions. This paper introduces a multi-modal augmented reality framework designed to bridge this gap by integrating photorealistic virtual objects into real-world railway sequences from the OSDaR23 dataset. Utilizing Unreal Engine 5 features, our pipeline leverages LiDAR point-clouds and INS/GNSS data to ensure accurate object placement and temporal stability across RGB frames. This paper also proposes a segmentation-based refinement strategy for INS/GNSS data to significantly improve the realism of the augmented sequences, as confirmed by the comparative study presented in the paper. Carefully designed augmented sequences are collected to produce OSDaR-AR, a public dataset designed to support the development of next-generation railway perception systems. The dataset is available at the following page: https://syndra.retis.santannapisa.it/osdarar.html
[58] WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents cs.CV | cs.ROPDF
Runwei Guan, Shaofeng Liang, Ningwei Ouyang, Weichen Fei, Shanliang Yao
TL;DR: 本文提出了WaterVideoQA,首个针对全水域环境的大规模视频问答基准,包含3,029个视频片段,覆盖六类水道,并整合多变光照与动态天气以测试自主水面艇的认知能力;同时引入NaviMind多智能体神经符号系统,通过自适应语义路由、情境感知分层推理和自主自反验证,实现从表层模式匹配到合规可解释决策的转变。
Details
Motivation: 自主导航在被动感知方面已取得显著成功,但在知识驱动的交互式环境认知方面存在空白;海事导航中,将原始视觉感知与复杂认知推理结合是自主水面艇执行安全精确操作的关键前提。
Result: 实验结果表明,该框架显著超越现有基线,为动态海事环境中的智能可信交互建立了新范式。
Insight: 创新点包括构建首个全水域视频问答基准WaterVideoQA,以及提出多智能体神经符号系统NaviMind,结合自适应语义路由、分层推理和自反验证,推动自主水面艇从模式匹配转向合规可解释决策,提升了海事环境下的认知与推理能力。
Abstract: While autonomous navigation has achieved remarkable success in passive perception (e.g., object detection and segmentation), it remains fundamentally constrained by a void in knowledge-driven, interactive environmental cognition. In the high-stakes domain of maritime navigation, the ability to bridge the gap between raw visual perception and complex cognitive reasoning is not merely an enhancement but a critical prerequisite for Autonomous Surface Vessels to execute safe and precise maneuvers. To this end, we present WaterVideoQA, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments. This benchmark encompasses 3,029 video clips across six distinct waterway categories, integrating multifaceted variables such as volatile lighting and dynamic weather to rigorously stress-test ASV capabilities across a five-tier hierarchical cognitive framework. Furthermore, we introduce NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning. By synergizing Adaptive Semantic Routing, Situation-Aware Hierarchical Reasoning, and Autonomous Self-Reflective Verification, NaviMind transitions ASVs from superficial pattern matching to regulation-compliant, interpretable decision-making. Experimental results demonstrate that our framework significantly transcends existing baselines, establishing a new paradigm for intelligent, trustworthy interaction in dynamic maritime environments.
[59] MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding cs.CVPDF
Wenhui Tan, Xiaoyi Yu, Jiaze Li, Yijing Chen, Jianzhong Ju
TL;DR: 本文提出了MSJoE框架,通过联合进化多模态大语言模型(MLLM)和轻量级关键帧采样器,以高效理解长视频。该方法基于关键帧信息性的假设,首先生成描述问题相关视觉视角的查询,利用冻结的CLIP模型计算查询-帧相似度矩阵,再由采样器预测采样权重并选择信息丰富的关键帧子集输入MLLM生成答案。MLLM和采样器通过强化学习联合优化,实现了查询推理、帧采样和关键帧理解的协同适应。
Details
Motivation: 解决多模态大语言模型(MLLMs)在高效理解长视频方面存在的根本性挑战,即如何从冗长的视频序列中有效筛选出对回答问题真正有用的少量关键帧。
Result: 在VideoMME、LongVideoBench、LVBench和MLVU等多个基准测试上的广泛实验表明,MSJoE相比基础MLLM实现了8.0%的准确率提升,并且比最强的基线方法高出1.1%的准确率。
Insight: 创新点在于提出了一个联合进化MLLM与采样器的框架,通过强化学习实现查询生成、帧采样和视频理解的端到端协同优化。其核心是假设仅需少量信息性关键帧即可回答问题,并利用冻结的CLIP模型构建查询-帧交互来指导采样,这是一种高效且可学习的视频信息压缩与选择机制。
Abstract: Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0% accuracy gain upon the base MLLM, and 1.1% higher accuracy than strongest baseline method.
[60] Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings cs.CVPDF
Julian Ziegler, Daniel Matthes, Finn Gerdts, Patrick Frenzel, Torsten Warnke
TL;DR: 本文提出了一种基于平移和缩放视频录像的自动化框架,用于重建皮划艇团体赛艇的速度和划桨频率。该方法利用YOLOv8检测浮标和运动员,结合已知浮标网格估计单应性变换,并通过基于U-Net的船头校准学习特定船型的运动员偏移来泛化船位估计。此外,采用光流进行鲁棒跟踪以适应多运动员船型,并从姿态估计或运动员边界框中提取划桨频率信息。
Details
Motivation: 解决皮划艇冲刺赛中因GPS设备有限而难以自动分析配速策略(由速度和划桨频率定义)的问题,旨在为教练提供无需船上传感器或手动标注的自动化高精度反馈。
Result: 在精英比赛GPS数据上的评估显示,速度的相对均方根误差(RRMSE)为0.020 ± 0.011(相关系数ρ = 0.956),划桨频率的RRMSE为0.022 ± 0.024(ρ = 0.932),表明方法具有高准确性。
Insight: 创新点包括:利用已知浮标网格和单应性变换进行空间校准,通过U-Net学习船型特定偏移以泛化船位估计,以及结合光流跟踪和姿态/边界框信息提取划桨频率,实现了全自动、高精度的视频分析解决方案。
Abstract: Pacing strategies, defined by velocity and stroke rate profiles, are essential for peak performance in canoe sprint. While GPS is the gold standard for analysis, its limited availability necessitates automated video-based solutions. This paper presents an extended framework for reconstructing performance metrics from panned and zoomed video recordings across all sprint disciplines (K1-K4, C1-C2) and distances (200m-500m). Our method utilizes YOLOv8 for buoy and athlete detection, leveraging the known buoy grid to estimate homographies. We generalized the estimation of the boat position by means of learning a boat-specific athlete offset using a U-net based boat tip calibration. Further, we implement a robust tracking scheme using optical flow to adapt to multi-athlete boat types. Finally, we introduce methods to extract stroke rate information from either pose estimations or the athlete bounding boxes themselves. Evaluation against GPS data from elite competitions yields a velocity RRMSE of 0.020 +- 0.011 (rho = 0.956) and a stroke rate RRMSE of 0.022 +- 0.024 (rho = 0.932). The methods provide coaches with highly accurate, automated feedback without requiring on-boat sensors or manual annotation.
[61] MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis cs.CV | cs.AIPDF
Feng Guo, Jiaxiang Liu, Yang Li, Qianqian Shi, Mingkun Xu
TL;DR: 该论文提出了MM-NeuroOnco,一个用于脑肿瘤MRI诊断的大规模多模态基准和指令微调数据集,包含约20万条语义丰富的多模态指令。基于此数据集,作者构建了评估基准MM-NeuroOnco-Bench,并开发了NeuroOnco-GPT模型,在诊断问题上取得了显著性能提升。
Details
Motivation: 现有公共数据集在标注丰富性和诊断语义方面存在不足,无法满足脑肿瘤准确诊断所需的、基于影像表现的临床可解释推理需求。
Result: 在构建的MM-NeuroOnco-Bench评估基准上,最强的基线模型Gemini 3 Flash在诊断相关问题上的准确率仅为41.88%。而基于该数据集微调的NeuroOnco-GPT模型在诊断问题上实现了27%的绝对准确率提升。
Insight: 主要创新点包括:1) 通过多模型协作流水线自动生成和质控诊断语义标注,缓解了高质量医学标注稀缺且昂贵的问题;2) 构建了包含拒绝感知设置的手动标注评估基准,以减少封闭式问题格式的固有偏差;3) 提供了一个大规模、语义丰富的多模态指令数据集,专门针对脑肿瘤MRI诊断的细粒度理解。
Abstract: Accurate brain tumor diagnosis requires models to not only detect lesions but also generate clinically interpretable reasoning grounded in imaging manifestations, yet existing public datasets remain limited in annotation richness and diagnostic semantics. To bridge this gap, we introduce MM-NeuroOnco, a large-scale multimodal benchmark and instruction-tuning dataset for brain tumor MRI understanding, consisting of 24,726 MRI slices from 20 data sources paired with approximately 200,000 semantically enriched multimodal instructions spanning diverse tumor subtypes and imaging modalities. To mitigate the scarcity and high cost of diagnostic semantic annotations, we develop a multi-model collaborative pipeline for automated medical information completion and quality control, enabling the generation of diagnosis-related semantics beyond mask-only annotations. Building upon this dataset, we further construct MM-NeuroOnco-Bench, a manually annotated evaluation benchmark with a rejection-aware setting to reduce biases inherent in closed-ended question formats. Evaluation across ten representative models shows that even the strongest baseline, Gemini 3 Flash, achieves only 41.88% accuracy on diagnosis-related questions, highlighting the substantial challenges of multimodal brain tumor diagnostic understanding. Leveraging MM-NeuroOnco, we further propose NeuroOnco-GPT, which achieves a 27% absolute accuracy improvement on diagnostic questions following fine-tuning. This result demonstrates the effectiveness of our dataset and benchmark in advancing clinically grounded multimodal diagnostic reasoning. Code and dataset are publicly available at: https://github.com/gfnnnb/MM-NeuroOnco
[62] Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study cs.CVPDF
Zihao Zhao, Frederik Hauke, Juliana De Castilhos, Sven Nebelung, Daniel Truhn
TL;DR: 本文研究多模态大语言模型(MLLM)智能体在零样本设置下区分视觉上难以分离的疾病的能力,通过两个代理诊断任务(黑色素瘤 vs. 非典型痣、肺水肿 vs. 肺炎)进行基准测试,并引入基于对比裁决的多智能体框架以提升性能。
Details
Motivation: 解决医学影像中一个未被充分探索但具有临床意义的场景:在零样本设置下区分视觉特征高度混淆的疾病,这些疾病在临床管理上存在显著差异。
Result: 实验结果表明,所提多智能体框架在皮肤镜数据上诊断准确率提升了11个百分点,并在定性样本上减少了无依据的断言,但整体性能仍不足以用于临床部署。
Insight: 创新点在于将零样本智能体评估聚焦于视觉混淆的疾病区分任务,并提出了基于对比裁决的多智能体框架来提升诊断可靠性和减少错误主张,为视觉混淆场景下的零样本智能体性能提供了初步见解。
Abstract: The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.
[63] UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models cs.CVPDF
Tianxing Xu, Zixuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang
TL;DR: 本文提出了UCM框架,旨在解决基于视频生成的世界模型在长期内容一致性和精确相机控制方面的难题。该框架通过时间感知的位置编码扭曲机制统一了长期记忆和相机控制,并设计了高效的双流扩散Transformer进行高保真生成。
Details
Motivation: 现有世界模型方法在场景重访时难以保持长期内容一致性,且无法实现用户输入的精确相机控制;基于显式3D重建的方法在无界场景和细粒度结构上灵活性不足,而依赖先前生成帧的方法则缺乏显式空间对应关系,限制了可控性和一致性。
Result: 在真实世界和合成基准测试上的大量实验表明,UCM在长期场景一致性方面显著优于现有最先进方法,并在高保真视频生成中实现了精确的相机可控性。
Insight: 核心创新点是时间感知的位置编码扭曲机制,它统一了记忆和相机控制;此外,高效的双流扩散Transformer设计降低了计算开销,而基于点云渲染的可扩展数据策展策略支持了大规模单目视频训练。
Abstract: World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.
[64] SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling cs.CV | cs.LGPDF
Camile Lendering, Erkut Akdag, Egor Bondarev
TL;DR: 本文提出了一种名为SubspaceAD的无训练小样本异常检测方法,该方法利用冻结的DINOv2骨干网络提取正常图像的块级特征,并通过主成分分析(PCA)建模正常变化的低维子空间,通过重构残差进行异常检测。该方法在MVTec-AD和VisA数据集上实现了最先进的性能。
Details
Motivation: 针对工业检测中每类仅需少量正常图像进行训练的小样本异常检测问题,现有方法通常依赖记忆库、辅助数据集或多模态调优,本文旨在探究基于视觉基础模型的特征表示是否真的需要这种复杂性。
Result: 在单样本异常检测设置下,SubspaceAD在MVTec-AD数据集上实现了98.0%的图像级AUROC和97.6%的像素级AUROC,在VisA数据集上实现了93.3%的图像级AUROC和98.3%的像素级AUROC,超越了先前的SOTA结果。
Insight: 创新点在于提出了一种完全无需训练、提示调优或记忆库的简单两阶段方法,通过PCA子空间建模将异常检测问题转化为统计重构残差计算,证明了基础模型特征本身已足够强大,无需额外复杂机制即可实现SOTA性能。
Abstract: Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 98.0% and 97.6% on the MVTec-AD dataset, and 93.3% and 98.3% on the VisA dataset, respectively, surpassing prior state-of-the-art results. Code and demo are available at https://github.com/CLendering/SubspaceAD.
[65] DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis cs.CVPDF
Xinglong Luo, Ao Luo, Zhengning Wang, Yueqi Yang, Chaoyu Feng
TL;DR: 本文提出了DMAligner,一种基于扩散模型的图像对齐框架,通过面向对齐的视角合成来解决传统光流法在遮挡和光照变化下的局限性。该方法采用动态感知扩散训练,结合动态感知掩码生成模块区分前景与背景,并构建了动态场景图像对齐数据集进行验证。
Details
Motivation: 现有图像对齐方法主要依赖光流图像扭曲,易受遮挡和光照变化影响,导致对齐质量下降和下游任务精度受损。本文旨在从生成式新视角解决这些经典难题。
Result: 在自建的DSIA数据集(包含1033个室内外场景、超过3万对图像)基准测试上,以及一系列广泛使用的视频数据集定性比较中,该方法均表现出优越性。
Insight: 创新点在于将图像对齐任务重新定义为条件图像生成问题,利用扩散模型进行对齐导向的视角合成,并引入动态感知掩码模块自适应处理动态前景,避免了传统光流法的固有问题。
Abstract: Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aware Mask Producing (DMP) module to adaptively distinguish dynamic foreground regions from static backgrounds, enabling the diffusion model to more effectively handle challenges that classical methods struggle to solve. Furthermore, we develop the Dynamic Scene Image Alignment (DSIA) dataset using Blender, which includes 1,033 indoor and outdoor scenes with over 30K image pairs tailored for image alignment. Extensive experimental results demonstrate the superiority of the proposed approach on DSIA benchmarks, as well as on a series of widely-used video datasets for qualitative comparisons. Our code is available at https://github.com/boomluo02/DMAligner.
[66] WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval cs.CVPDF
Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu
TL;DR: WISER是一个无需训练的零样本组合图像检索框架,通过统一文本到图像检索和图像到图像检索的双路径,采用“检索-验证-精炼”流程,实现更广泛的搜索、更深度的思考和自适应融合,以提升多模态查询下的目标图像检索性能。
Details
Motivation: 现有零样本组合图像检索方法通常将多模态查询转换为单一模态(如编辑后的文本或图像),但文本检索会丢失细粒度视觉细节,图像检索难以处理复杂语义修改,因此需要结合两者优势以适应多样化的查询意图。
Result: 在CIRCO和CIRR等多个基准测试中,WISER显著优于之前的无需训练方法,在CIRCO上mAP@5相对提升45%,在CIRR上Recall@1相对提升57%,甚至超越了许多依赖训练的方法,展现了优越性和泛化能力。
Insight: 创新点在于通过意图感知和不确定性感知,将双路径检索并行化以扩大候选池,并利用验证器动态评估置信度,对不确定结果进行结构化自反思引导的精炼,实现了无需训练下对多模态查询的鲁棒处理。
Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a “retrieve-verify-refine” pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.
[67] PackUV: Packed Gaussian UV Maps for 4D Volumetric Video cs.CVPDF
Aashish Rai, Angela Xing, Anushka Agarwal, Xiaoyan Cong, Zekun Li
TL;DR: 本文提出了一种名为PackUV的新型4D高斯表示方法,它将所有高斯属性映射到一系列结构化的多尺度UV图集序列中,实现了紧凑的、图像原生的存储。为了从多视角视频中拟合这种表示,作者提出了PackUV-GS,这是一种直接在UV域优化高斯参数的时间一致性拟合方法。该方法包含一个流引导的高斯标记和视频关键帧模块,以识别动态高斯、稳定静态区域,并在大运动和遮挡解除情况下保持时间一致性。最终生成的UV图集格式是首个与标准视频编解码器兼容且不损失质量的统一体视频表示,能够在现有多媒体基础设施中实现高效流式传输。
Details
Motivation: 体视频提供了沉浸式的4D体验,但在大规模重建、存储和流式传输方面仍然存在困难。现有的基于高斯泼溅的方法虽然能实现高质量重建,但在长序列、时间不一致性以及大运动和遮挡解除情况下会失效。此外,它们的输出通常与传统视频编码流水线不兼容,阻碍了实际应用。
Result: 在作者提出的PackUV-2B数据集(迄今为止最大的多视角视频数据集,包含超过50个同步相机、大量运动和频繁遮挡解除的100个序列和20亿帧)上进行的大量实验表明,该方法在渲染保真度上超越了现有基线,并能扩展到长达30分钟的序列,同时保持一致的渲染质量。
Insight: 论文的主要创新点在于提出了一个紧凑的、与标准视频编解码器兼容的4D高斯表示(PackUV),以及一个直接在UV域进行时间一致性优化的拟合方法(PackUV-GS)。从客观角度看,其将动态3D高斯属性映射到2D UV图集序列的思路,巧妙地解决了体视频存储、流式传输与现有基础设施兼容性的关键瓶颈,同时通过流引导和关键帧技术有效处理了长序列中的时间一致性问题。
Abstract: Volumetric videos offer immersive 4D experiences, but remain difficult to reconstruct, store, and stream at scale. Existing Gaussian Splatting based methods achieve high-quality reconstruction but break down on long sequences, temporal inconsistency, and fail under large motions and disocclusions. Moreover, their outputs are typically incompatible with conventional video coding pipelines, preventing practical applications. We introduce PackUV, a novel 4D Gaussian representation that maps all Gaussian attributes into a sequence of structured, multi-scale UV atlas, enabling compact, image-native storage. To fit this representation from multi-view videos, we propose PackUV-GS, a temporally consistent fitting method that directly optimizes Gaussian parameters in the UV domain. A flow-guided Gaussian labeling and video keyframing module identifies dynamic Gaussians, stabilizes static regions, and preserves temporal coherence even under large motions and disocclusions. The resulting UV atlas format is the first unified volumetric video representation compatible with standard video codecs (e.g., FFV1) without losing quality, enabling efficient streaming within existing multimedia infrastructure. To evaluate long-duration volumetric capture, we present PackUV-2B, the largest multi-view video dataset to date, featuring more than 50 synchronized cameras, substantial motion, and frequent disocclusions across 100 sequences and 2B (billion) frames. Extensive experiments demonstrate that our method surpasses existing baselines in rendering fidelity while scaling to sequences up to 30 minutes with consistent quality.
[68] D-FINE-seg: Object Detection and Instance Segmentation Framework with multi-backend deployment cs.CVPDF
Argo Saakyan, Dmitry Solntsev
TL;DR: D-FINE-seg是一个基于D-FINE Transformer架构的实时实例分割框架,通过添加轻量级掩码头、分割感知训练(包括框裁剪BCE和dice掩码损失)、辅助与去噪掩码监督以及改进的匈牙利匹配成本来实现。该工作还提供了一个跨ONNX、TensorRT、OpenVINO后端的端到端训练、导出和优化推理流程。
Details
Motivation: 解决基于Transformer的实时目标检测器(如D-FINE)在实时实例分割任务中应用较少的问题,旨在扩展D-FINE以实现高效、准确的实例分割。
Result: 在TACO数据集上,采用统一的TensorRT FP16端到端基准测试协议,D-FINE-seg的F1分数超过了Ultralytics YOLO26,同时保持了有竞争力的延迟。
Insight: 主要创新点在于将高性能的Transformer目标检测器扩展为实例分割模型,并设计了专门的分割感知训练策略和损失函数;同时,提供了跨多个主流推理后端(ONNX、TensorRT、OpenVINO)的完整部署流程,增强了框架的实用性和可部署性。
Abstract: Transformer-based real-time object detectors achieve strong accuracy-latency trade-offs, and D-FINE is among the top-performing recent architectures. However, real-time instance segmentation with transformers is still less common. We present D-FINE-seg, an instance segmentation extension of D-FINE that adds: a lightweight mask head, segmentation-aware training, including box cropped BCE and dice mask losses, auxiliary and denoising mask supervision, and adapted Hungarian matching cost. On the TACO dataset, D-FINE-seg improves F1-score over Ultralytics YOLO26 under a unified TensorRT FP16 end-to-end benchmarking protocol, while maintaining competitive latency. Second contribution is an end-to-end pipeline for training, exporting, and optimized inference across ONNX, TensorRT, OpenVINO for both object detection and instance segmentation tasks. This framework is released as open-source under the Apache-2.0 license. GitHub repository - https://github.com/ArgoHA/D-FINE-seg.
[69] GeoWorld: Geometric World Models cs.CV | cs.ROPDF
Zeyu Zhang, Danning Li, Ian Reid, Richard Hartley
TL;DR: GeoWorld是一种基于能量的几何世界模型,通过双曲JEPA将潜在表示从欧几里得空间映射到双曲流形,以保持状态间的几何结构和层次关系,并引入几何强化学习进行能量优化,从而在双曲潜在空间中实现稳定的多步规划。
Details
Motivation: 现有基于能量的预测世界模型存在两个主要挑战:潜在表示通常在欧几里得空间中学习,忽略了状态间的几何和层次结构;且难以进行长时程预测,导致在扩展推演中性能迅速下降。
Result: 在CrossTask和COIN基准测试中,与最先进的V-JEPA 2相比,3步规划的平均成功率(SR)提高了约3%,4步规划提高了约2%。
Insight: 创新点包括使用双曲JEPA将潜在表示映射到双曲流形以保留几何结构,以及引入几何强化学习优化能量,从而提升多步视觉规划的稳定性和性能。
Abstract: Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: https://steve-zeyu-zhang.github.io/GeoWorld.
[70] Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception cs.CVPDF
Yiding Sun, Jihua Zhu, Haozhe Cheng, Chaoyi Lu, Zhichuan Yang
TL;DR: 本文提出了一种名为’对齐后适应’(PointATA)的新范式,用于将预训练的3D模型高效迁移到4D感知任务中。该方法通过两个阶段解决迁移过程中的过拟合和模态差距问题:第一阶段使用最优传输理论量化分布差异并训练点对齐嵌入器来缓解模态差距;第二阶段在冻结的3D骨干网络中集成高效的点视频适配器和空间上下文编码器以增强时序建模能力。实验表明,PointATA在参数效率更高的前提下,性能可匹配甚至超越全微调模型。
Details
Motivation: 4D数据集远少于3D数据集,限制了自监督4D模型的可扩展性,因此需要将预训练的3D模型迁移到4D任务,但现有方法存在过拟合和模态差距两大限制。
Result: 在3D动作识别上达到97.21%准确率,4D动作分割提升+8.7%,4D语义分割达到84.06%,性能匹配或超越全微调模型,同时具有参数高效的优势。
Insight: 创新点在于将参数高效迁移学习分解为’对齐’和’适应’两个阶段,分别用最优传输理论缓解模态差距和设计适配器增强时序建模;客观分析其工程导向设计(如点对齐嵌入器和点视频适配器)为跨模态迁移提供了可借鉴的模块化思路。
Abstract: Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel “Align then Adapt” (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 % accuracy on 3D action recognition, $+8.7 %$ on 4 D action segmentation, and 84.06% on 4D semantic segmentation.
[71] Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy cs.CVPDF
Matthew Sutton, Katrin Amunts, Timo Dickscheid, Christian Schiffer
TL;DR: 本文提出了一种标签介导的弱监督视觉-语言建模方法,用于人类大脑显微图像分析。该方法通过标签自动从相关文献中挖掘区域描述作为合成标题,将现有的细胞构筑视觉基础模型(CytoNet)与大型语言模型耦合,实现自然语言描述显微图像区域,无需配对的图像-文本数据。
Details
Motivation: 解决在研究和临床环境中,由于配对的图像-文本数据稀缺且难以获取,难以构建自然语言接口来辅助显微图像分析的问题,特别是在研究人类大脑切片细胞构筑(细胞密度、形态及其分层和区域组织)的领域。
Result: 在57个大脑区域上,该方法能生成合理的区域级描述,并支持通过显式拒绝未见区域实现开放集使用。对于范围内图像块,其细胞构筑参考标签匹配准确率达到90.6%;在掩蔽区域标签的情况下,其描述仍具有足够区分度,在8路测试中恢复区域的准确率为68.6%。
Insight: 创新点在于提出了一种弱监督的标签介导配对方法,利用标签自动生成合成标题,将现有生物医学视觉基础模型与语言模型连接,为细粒度配对注释稀缺的领域提供了集成自然语言功能的实用方案。
Abstract: Foundation models increasingly offer potential to support interactive, agentic workflows that assist researchers during analysis and interpretation of image data. Such workflows often require coupling vision to language to provide a natural-language interface. However, paired image-text data needed to learn this coupling are scarce and difficult to obtain in many research and clinical settings. One such setting is microscopic analysis of cell-body-stained histological human brain sections, which enables the study of cytoarchitecture: cell density and morphology and their laminar and areal organization. Here, we propose a label-mediated method that generates meaningful captions from images by linking images and text only through a label, without requiring curated paired image-text data. Given the label, we automatically mine area descriptions from related literature and use them as synthetic captions reflecting canonical cytoarchitectonic attributes. An existing cytoarchitectonic vision foundation model (CytoNet) is then coupled to a large language model via an image-to-text training objective, enabling microscopy regions to be described in natural language. Across 57 brain areas, the resulting method produces plausible area-level descriptions and supports open-set use through explicit rejection of unseen areas. It matches the cytoarchitectonic reference label for in-scope patches with 90.6% accuracy and, with the area label masked, its descriptions remain discriminative enough to recover the area in an 8-way test with 68.6% accuracy. These results suggest that weak, label-mediated pairing can suffice to connect existing biomedical vision foundation models to language, providing a practical recipe for integrating natural-language in domains where fine-grained paired annotations are scarce.
[72] Locally Adaptive Decay Surfaces for High-Speed Face and Landmark Detection with Event Cameras cs.CVPDF
Paul Kielty, Timothy Hanley, Peter Corcoran
TL;DR: 本文提出了一种称为局部自适应衰减表面(LADS)的事件相机表示方法,通过根据局部信号动态调整每个位置的时间衰减,解决了传统固定时间参数表示在静止期保留空间结构与快速运动期保留清晰边缘之间的权衡问题。实验表明,LADS在公开数据集上显著提升了人脸检测和面部关键点检测的准确性,特别是在高频(如240 Hz)下能维持高性能,甚至超越了先前工作在30 Hz下报告的结果,并支持使用更轻量化的网络架构实现实时性能。
Details
Motivation: 事件相机以微秒级分辨率记录亮度变化,但其稀疏、异步的输出转换为神经网络可利用的密集张量仍是一个核心挑战。传统直方图或全局衰减时间表面表示在整个图像平面上应用固定的时间参数,这在实际中造成了在静止期保留空间结构与在快速运动期保留清晰边缘之间的权衡。
Result: 在公开数据集上的广泛实验表明,与标准的非自适应表示相比,LADS一致地提高了人脸检测和面部关键点检测的准确性。在30 Hz下,LADS实现了比基线更高的检测精度和更低的关键点误差;在240 Hz下,它缓解了通常在高频下观察到的精度下降,将关键点的归一化平均误差维持在2.44%,人脸检测的mAP50达到0.966。这些高频结果甚至超过了先前工作在30 Hz下报告的精度,为基于事件的人脸分析设定了新的基准。
Insight: 论文的创新点在于提出了局部自适应衰减表面(LADS)这一事件表示家族,其中每个位置的时间衰减根据局部信号动态(如事件率、拉普拉斯-高斯响应和高频频谱能量)进行调制。这种自适应方案在静止区域保留细节,同时在密集活动区域减少模糊。从客观角度看,这种上下文感知的时间集成方法强调了为神经形态视觉设计自适应表示的重要性,能够更好地利用事件相机的独特优势,支持更轻量化的网络实现实时高频人机交互系统。
Abstract: Event cameras record luminance changes with microsecond resolution, but converting their sparse, asynchronous output into dense tensors that neural networks can exploit remains a core challenge. Conventional histograms or globally-decayed time-surface representations apply fixed temporal parameters across the entire image plane, which in practice creates a trade-off between preserving spatial structure during still periods and retaining sharp edges during rapid motion. We introduce Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics. Three strategies are explored, based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy. These adaptive schemes preserve detail in quiescent regions while reducing blur in regions of dense activity. Extensive experiments on the public data show that LADS consistently improves both face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, LADS achieves higher detection accuracy and lower landmark error than either baseline, and at 240 Hz it mitigates the accuracy decline typically observed at higher frequencies, sustaining 2.44 % normalized mean error for landmarks and 0.966 mAP50 in face detection. These high-frequency results even surpass the accuracy reported in prior works operating at 30 Hz, setting new benchmarks for event-based face analysis. Moreover, by preserving spatial structure at the representation stage, LADS supports the use of much lighter network architectures while still retaining real-time performance. These results highlight the importance of context-aware temporal integration for neuromorphic vision and point toward real-time, high-frequency human-computer interaction systems that exploit the unique advantages of event cameras.
[73] SpectralMamba-UNet: Frequency-Disentangled State Space Modeling for Texture-Structure Consistent Medical Image Segmentation cs.CVPDF
Fuhao Zhang, Lei Liu, Jialin Zhang, Ya-Nan Zhang, Nan Mu
TL;DR: 本文提出SpectralMamba-UNet,一种新颖的频率解耦框架,用于医学图像分割。该方法通过谱域分解将结构信息和纹理信息解耦学习,利用离散余弦变换分离低频和高频特征,分别通过频域Mamba进行全局上下文建模和保留边界敏感细节,并通过谱通道重加权和谱引导融合模块实现自适应多尺度融合。
Details
Motivation: 解决现有状态空间模型(如Vision Mamba)在医学图像分割中,由于一维序列化处理削弱了局部空间连续性和高频表示能力,难以同时有效建模全局解剖结构和细粒度边界细节的问题。
Result: 在五个公开基准测试上的实验表明,该方法在不同模态和分割目标上均取得了一致的性能提升,验证了其有效性和泛化能力。
Insight: 创新点在于将频域分析与状态空间模型结合,通过谱分解明确分离并分别处理低频(结构)和高频(纹理)信息,并引入谱感知的通道重加权和融合机制,实现了纹理与结构一致的分割。从客观角度看,这种频率解耦策略为视觉任务中全局与局部信息的协同建模提供了新思路。
Abstract: Accurate medical image segmentation requires effective modeling of both global anatomical structures and fine-grained boundary details. Recent state space models (e.g., Vision Mamba) offer efficient long-range dependency modeling. However, their one-dimensional serialization weakens local spatial continuity and high-frequency representation. To this end, we propose SpectralMamba-UNet, a novel frequency-disentangled framework to decouple the learning of structural and textural information in the spectral domain. Our Spectral Decomposition and Modeling (SDM) module applies discrete cosine transform to decompose low- and high-frequency features, where low frequency contributes to global contextual modeling via a frequency-domain Mamba and high frequency preserves boundary-sensitive details. To balance spectral contributions, we introduce a Spectral Channel Reweighting (SCR) mechanism to form channel-wise frequency-aware attention, and a Spectral-Guided Fusion (SGF) module to achieve adaptively multi-scale fusion in the decoder. Experiments on five public benchmarks demonstrate consistent improvements across diverse modalities and segmentation targets, validating the effectiveness and generalizability of our approach.
[74] WARM-CAT: : Warm-Started Test-Time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning cs.CVPDF
Xudong Yan, Songhe Feng, Jiaxin Wang, Xin Su, Yi Jin
TL;DR: 本文提出WARM-CAT方法,用于解决组合零样本学习(CZSL)中测试时因未见组合引入导致标签空间分布偏移的问题。该方法通过在测试时从无监督数据中积累文本和视觉模态的综合知识来更新多模态原型,并设计了自适应更新权重和动态优先级队列来灵活适应分布变化。此外,论文还引入了新的基准数据集C-Fashion并优化了MIT-States数据集,以提供更可靠的评估。
Details
Motivation: 现有CZSL方法在测试时因包含未见属性-对象组合而导致标签空间分布偏移,造成性能下降。本文旨在通过测试时知识积累和原型更新来克服这一挑战。
Result: 在四个基准数据集(包括新提出的C-Fashion和优化的MIT-States)上,该方法在封闭世界和开放世界设置下均取得了最先进的(SOTA)性能。
Insight: 创新点包括:测试时从无监督数据中积累多模态知识以更新原型;设计自适应更新权重和动态优先级队列(通过用训练图像预热启动)来灵活处理分布偏移;利用已见与未见文本原型间的映射生成未见视觉原型;通过多模态协同表示学习对齐文本和视觉原型。此外,构建更可靠的基准数据集也对领域有贡献。
Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual prototypes from historical images for inference. Since the model tends to favor compositions already stored in the queue during testing, we warm-start the queue by initializing it with training images for visual prototypes of seen compositions and generating unseen visual prototypes using the mapping learned between seen and unseen textual prototypes. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. To provide a more reliable evaluation for CZSL, we introduce a new benchmark dataset, C-Fashion, and refine the widely used but noisy MIT-States dataset. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. The source code and datasets are available at https://github.com/xud-yan/WARM-CAT .
[75] FLIGHT: Fibonacci Lattice-based Inference for Geometric Heading in real-Time cs.CV | cs.CG | cs.ROPDF
David Dirnfeld, Fabien Delattre, Pedro Miraldo, Erik Learned-Miller
TL;DR: 本文提出了一种名为FLIGHT的新方法,用于从单目视频中实时估计相机运动方向(航向)。该方法通过将霍夫变换推广到单位球面(S(2)),利用斐波那契晶格对球面进行离散化,以鲁棒地处理噪声和异常值。实验表明,该方法在精度和效率之间达到了帕累托前沿,并能通过校正航向来改进SLAM系统的位姿初始化精度。
Details
Motivation: 现有方法在已知相机旋转(例如来自IMU或优化算法)的情况下估计航向,通常在低噪声、低异常值条件下表现良好,但随着噪声和异常值水平增加,其精度会下降或计算成本变得昂贵。本文旨在解决这些限制,提出一种更鲁棒且高效的航向估计方法。
Result: 在三个数据集上的实验结果表明,该方法在精度与效率权衡上达到了帕累托前沿(即在该权衡曲线上表现最优)。在SLAM实验中,该方法通过校正相机位姿初始化时的航向,降低了RMSE(均方根误差)。
Insight: 主要创新点包括:1) 将霍夫变换推广到单位球面(S(2))用于航向估计;2) 使用斐波那契晶格对球面进行离散化作为投票箱中心,以提高计算效率和鲁棒性;3) 方法能确保未受噪声或动态物体影响的特征一致地为正确运动方向投票,从而在存在噪声和异常值时保持鲁棒性。
Abstract: Estimating camera motion from monocular video is a fundamental problem in computer vision, central to tasks such as SLAM, visual odometry, and structure-from-motion. Existing methods that recover the camera’s heading under known rotation, whether from an IMU or an optimization algorithm, tend to perform well in low-noise, low-outlier conditions, but often decrease in accuracy or become computationally expensive as noise and outlier levels increase. To address these limitations, we propose a novel generalization of the Hough transform on the unit sphere (S(2)) to estimate the camera’s heading. First, the method extracts correspondences between two frames and generates a great circle of directions compatible with each pair of correspondences. Then, by discretizing the unit sphere using a Fibonacci lattice as bin centers, each great circle casts votes for a range of directions, ensuring that features unaffected by noise or dynamic objects vote consistently for the correct motion direction. Experimental results on three datasets demonstrate that the proposed method is on the Pareto frontier of accuracy versus efficiency. Additionally, experiments on SLAM show that the proposed method reduces RMSE by correcting the heading during camera pose initialization.
[76] TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement cs.CVPDF
Arian Sabaghi, José Oramas
TL;DR: TriLite是一个用于弱监督目标定位(WSOL)的单阶段框架,它利用自监督预训练的Dinov2 Vision Transformer的冻结特征,仅引入少量可训练参数(少于80万)。其核心是TriHead模块,将图像块特征分解为前景、背景和模糊区域,以提升目标覆盖度并抑制虚假激活。该方法在多个数据集上实现了新的SOTA,且参数效率高、训练简单。
Details
Motivation: 解决现有弱监督目标定位方法依赖多阶段流程或对大骨干网络进行全微调导致的训练成本高,以及普遍存在的目标覆盖不全(partial object coverage)问题。
Result: 在CUB-200-2011、ImageNet-1K和OpenImages数据集上的大量实验表明,TriLite达到了新的最先进水平(SOTA),同时参数效率显著更高,训练也更简单。
Insight: 主要创新点在于:1. 利用自监督预训练的通用视觉特征(如Dinov2),避免昂贵的端到端训练;2. 提出TriHead模块,通过三区域(前景、背景、模糊)解耦来提升定位质量;3. 将分类与定位目标解耦,实现了高效的单阶段框架,仅需极少量的可训练参数。
Abstract: Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. The code will be released soon.
[77] No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors cs.CVPDF
Tao Liu, Gang Wan, Kan Ren, Shibo Wen
TL;DR: 本文提出了一种新的无监督在线视频稳定框架,该方法基于经典的三阶段稳定流程,并引入了多线程缓冲机制,无需成对的稳定与不稳定数据集进行训练。
Details
Motivation: 解决基于深度学习的视频稳定方法对配对数据集的依赖、可控性差以及在资源受限硬件上效率低下的问题,并扩展稳定方法在无人机夜间遥感等领域的适用性。
Result: 在定量指标和视觉质量上均优于现有的在线稳定方法,性能与离线方法相当,并在新引入的多模态无人机航空视频数据集(UAV-Test)上进行了验证。
Insight: 通过结合经典先验和无监督学习,避免了数据依赖和可控性问题;多线程缓冲机制提升了在线处理的效率;新数据集扩展了视频稳定在无人机夜间遥感等场景的评估基准。
Abstract: We propose a new unsupervised framework for online video stabilization. Unlike methods based on deep learning that require paired stable and unstable datasets, our approach instantiates the classical stabilization pipeline with three stages and incorporates a multithreaded buffering mechanism. This design addresses three longstanding challenges in end-to-end learning: limited data, poor controllability, and inefficiency on hardware with constrained resources. Existing benchmarks focus mainly on handheld videos with a forward view in visible light, which restricts the applicability of stabilization to domains such as UAV nighttime remote sensing. To fill this gap, we introduce a new multimodal UAV aerial video dataset (UAV-Test). Experiments show that our method consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, while achieving performance comparable to offline methods.
[78] Efficient Encoder-Free Fourier-based 3D Large Multimodal Model cs.CV | cs.AIPDF
Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Yiming Wang
TL;DR: 本文提出了Fase3D,一种无需重型预训练编码器、基于傅里叶变换的高效3D大模型。它通过结合点云序列化和快速傅里叶变换的新型分词器,解决了无序、大规模点云数据处理的挑战,实现了在计算和参数量上显著更高效的3D场景理解。
Details
Motivation: 当前处理3D数据的大模型通常依赖笨重的预训练视觉编码器来提取几何特征。虽然2D大模型已开始为效率和可扩展性而摒弃此类编码器,但由于点云的无序和大规模特性,将这一范式扩展到3D领域仍具挑战。本文旨在设计一种无需繁琐编码器、能有效且高效地对无序3D数据进行分词的大模型。
Result: Fase3D在性能上与基于编码器的3D大模型相当,同时在计算和参数上显著更高效。
Insight: 创新点包括:1) 通过结构化超点紧凑表示大场景;2) 使用空间填充曲线序列化结合快速傅里叶变换,实现高效的全局上下文建模和图基令牌合并;3) 通过傅里叶增强的LoRA适配器,以可忽略的成本将全局频率感知交互注入大语言模型。从客观角度看,其核心创新在于将傅里叶变换与序列化策略结合,为无序3D数据提供了一种轻量且有效的全局建模替代方案,绕过了传统自注意力或重型编码器的计算瓶颈。
Abstract: Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters. Project website: https://tev-fbk.github.io/Fase3D.
[79] DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation cs.CVPDF
Yichen Peng, Jyun-Ting Song, Siyeol Jung, Ruofan Liu, Haiyang Liu
TL;DR: 本文提出DyaDiT,一种多模态扩散Transformer模型,用于从双人对话音频生成社交情境下自然、互动的双人对话手势。该模型融合双方音频信息,可选地利用社交上下文标记和对方手势,以生成更具响应性和社交适宜性的运动。
Details
Motivation: 现有方法通常将单一音频流映射到单一说话者的运动,忽略了社交上下文和对话双方之间的相互动态,导致生成的对话手势不够自然和互动。
Result: 在Seamless Interaction Dataset上训练,DyaDiT在标准运动生成指标和定量用户研究中均超越现有方法,在客观指标上表现更优,且用户强烈偏好其生成的运动,证明了其鲁棒性和社交友好性。
Insight: 创新点包括:采用多模态扩散Transformer架构处理双人音频输入;引入社交上下文标记和可选的对方案势信息以增强互动动态;使用运动字典编码运动先验,提升生成质量。从客观角度看,模型通过显式建模对话双方的相互影响,在社交适宜性手势生成方面具有显著优势。
Abstract: Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker’s motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner’s gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.
[80] AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios cs.CVPDF
Zhaochen Su, Jincheng Gao, Hangyu Guo, Zhenhua Liu, Lueyang Zhang
TL;DR: 该论文提出了AgentVista,一个用于评估通用多模态智能体在超挑战性现实视觉场景中性能的基准测试。该基准涵盖7个类别下的25个子领域,要求智能体在包含丰富细节的真实场景中,进行长视野、跨模态的工具交互(如网页搜索、图像搜索、页面导航和代码操作),以解决多步骤工作流问题。
Details
Motivation: 现有基准主要评估单轮视觉推理或特定工具技能,未能充分捕捉实际智能体所需的真实性、视觉细微差别和长视野工具使用能力。因此,需要一个新的基准来评估多模态智能体在复杂现实场景中的综合能力。
Result: 对最先进模型的综合评估揭示了它们在执行长视野多模态工具使用方面的巨大差距。评估中表现最佳的模型(配备工具的Gemini-3-Pro)总体准确率仅为27.3%,且困难实例可能需要超过25轮工具调用。
Insight: 创新点在于构建了一个强调真实性、视觉细节和长视野、跨模态工具交互的综合性基准。其核心价值在于通过模拟真实世界复杂工作流,暴露当前模型的局限性,从而推动更强大、可靠的多模态智能体的发展。
Abstract: Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both image processing and general programming. Comprehensive evaluation of state-of-the-art models exposes significant gaps in their ability to carry out long-horizon multimodal tool use. Even the best model in our evaluation, Gemini-3-Pro with tools, achieves only 27.3% overall accuracy, and hard instances can require more than 25 tool-calling turns. We expect AgentVista to accelerate the development of more capable and reliable multimodal agents for realistic and ultra-challenging problem solving.
[81] Phys-3D: Physics-Constrained Real-Time Crowd Tracking and Counting on Railway Platforms cs.CVPDF
Bin Zeng, Johannes Künzel, Anna Hilsmann, Peter Eisert
TL;DR: 本文提出了一种名为Phys-3D的物理约束实时人群跟踪与计数框架,用于解决列车进站时站台上因遮挡、相机运动和透视变形导致的人群计数难题。该方法在DeepSORT框架内集成了迁移学习的YOLOv11m检测器与EfficientNet-B0外观编码,并引入了一个基于针孔几何的物理约束卡尔曼模型来确保三维运动的物理合理性,同时采用带持久性的虚拟计数带以应对遮挡问题。
Details
Motivation: 现有基于检测的跟踪方法通常假设相机静止或忽略运动建模中的物理一致性,导致在列车进站等动态条件下人群计数不可靠。本文旨在利用简单的单摄像头硬件,在列车扫描站台时实现准确、实时的计数,以保障安全和运力管理。
Result: 在自建基准数据集MOT-RailwayPlatformCrowdHead (MOT-RPCH) 上,该方法将计数误差降低至2.97%,在存在运动和遮挡的情况下表现出鲁棒性能。
Insight: 核心创新点在于将第一性原理的几何(针孔模型)和运动先验(物理约束的3D卡尔曼模型)整合到实时跟踪管道中,从而在动态、遮挡严重的安全关键交通场景中实现可靠计数。虚拟计数带的设计也增强了计数在遮挡下的稳定性。
Abstract: Accurate, real-time crowd counting on railway platforms is essential for safety and capacity management. We propose to use a single camera mounted in a train, scanning the platform while arriving. While hardware constraints are simple, counting remains challenging due to dense occlusions, camera motion, and perspective distortions during train arrivals. Most existing tracking-by-detection approaches assume static cameras or ignore physical consistency in motion modeling, leading to unreliable counting under dynamic conditions. We propose a physics-constrained tracking framework that unifies detection, appearance, and 3D motion reasoning in a real-time pipeline. Our approach integrates a transfer-learned YOLOv11m detector with EfficientNet-B0 appearance encoding within DeepSORT, while introducing a physics-constrained Kalman model (Phys-3D) that enforces physically plausible 3D motion dynamics through pinhole geometry. To address counting brittleness under occlusions, we implement a virtual counting band with persistence. On our platform benchmark, MOT-RailwayPlatformCrowdHead Dataset(MOT-RPCH), our method reduces counting error to 2.97%, demonstrating robust performance despite motion and occlusions. Our results show that incorporating first-principles geometry and motion priors enables reliable crowd counting in safety-critical transportation scenarios, facilitating effective train scheduling and platform safety management.
[82] Uni-Animator: Towards Unified Visual Colorization cs.CVPDF
Xinyuan Chen, Yao Xu, Shaowen Wang, Pengjie Song, Bowen Deng
TL;DR: 本文提出Uni-Animator,一个基于扩散Transformer(DiT)的统一图像与视频草图着色框架。该框架旨在解决现有方法在统一图像和视频任务时面临的三大挑战:参考颜色传递不精确、高频物理细节保留不足以及大运动场景下时间一致性差。通过引入实例块嵌入增强视觉参考、利用物理特征强化细节以及基于草图的动态RoPE编码,实现了精确的颜色对齐、高保真纹理保留和鲁棒的时间一致性。
Details
Motivation: 现有草图着色方法难以统一处理图像和视频任务,存在单/多参考颜色传递不精确、高频物理细节保留不足以及大运动场景下时间一致性差(出现运动伪影)的问题。本文旨在构建一个统一的框架来解决这些挑战。
Result: 大量实验结果表明,Uni-Animator在图像和视频草图着色任务上均取得了有竞争力的性能,其效果与特定任务方法相当,同时解锁了统一的跨域能力,具有高细节保真度和鲁棒的时间一致性。
Insight: 创新点包括:1)通过实例块嵌入进行视觉参考增强,实现精确的颜色对齐与融合;2)利用物理特征进行细节强化,有效捕获和保留高频纹理;3)提出基于草图的动态RoPE编码,自适应建模运动感知的时空依赖关系以提升时间一致性。从客观角度看,将DiT架构与针对性的模块(如动态RoPE)结合,为统一处理图像和视频的生成任务提供了一个有前景的范式。
Abstract: We propose Uni-Animator, a novel Diffusion Transformer (DiT)-based framework for unified image and video sketch colorization. Existing sketch colorization methods struggle to unify image and video tasks, suffering from imprecise color transfer with single or multiple references, inadequate preservation of high-frequency physical details, and compromised temporal coherence with motion artifacts in large-motion scenes. To tackle imprecise color transfer, we introduce visual reference enhancement via instance patch embedding, enabling precise alignment and fusion of reference color information. To resolve insufficient physical detail preservation, we design physical detail reinforcement using physical features that effectively capture and retain high-frequency textures. To mitigate motion-induced temporal inconsistency, we propose sketch-based dynamic RoPE encoding that adaptively models motion-aware spatial-temporal dependencies. Extensive experimental results demonstrate that Uni-Animator achieves competitive performance on both image and video sketch colorization, matching that of task-specific methods while unlocking unified cross-domain capabilities with high detail fidelity and robust temporal consistency.
[83] ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation cs.CV | cs.AIPDF
Junhu Fu, Shuyu Liang, Wutong Li, Chen Ma, Peng Huang
TL;DR: 本文提出ColoDiff,一种基于扩散模型的结肠镜视频生成框架,旨在解决数据稀缺问题并辅助临床分析。该框架通过TimeStream模块在帧间解耦时序依赖以实现动态一致性建模,并通过Content-Aware模块在帧内实现临床属性的精确控制,同时采用非马尔可夫采样策略将生成步骤减少90%以上以实现实时生成。
Details
Motivation: 结肠镜视频生成对于诊断肠道疾病至关重要,尤其在数据稀缺场景下。然而,高质量的生成面临肠道结构不规则、疾病表征多样和成像模式各异等挑战,需要同时保证时序一致性和对临床属性的精确控制。
Result: ColoDiff在三个公共数据集和一个医院数据库上进行了评估,评估指标包括生成质量和下游任务(疾病诊断、模式判别、肠道准备评分和病变分割)。实验表明,ColoDiff能生成具有平滑过渡和丰富动态的视频。
Insight: 创新点在于:1) 通过跨帧标记化机制解耦时序依赖,以应对不规则肠道结构带来的动态建模挑战;2) 结合噪声注入嵌入和可学习原型,实现对临床属性的细粒度控制,突破了扩散模型的粗粒度引导限制;3) 采用高效的非马尔可夫采样策略,显著加速生成过程。这为可控医学视频生成提供了新思路,展示了合成数据在补充真实表征和缓解临床数据稀缺方面的潜力。
Abstract: Colonoscopy video generation delivers dynamic, information-rich data critical for diagnosing intestinal diseases, particularly in data-scarce scenarios. High-quality video generation demands temporal consistency and precise control over clinical attributes, but faces challenges from irregular intestinal structures, diverse disease representations, and various imaging modalities. To this end, we propose ColoDiff, a diffusion-based framework that generates dynamic-consistent and content-aware colonoscopy videos, aiming to alleviate data shortage and assist clinical analysis. At the inter-frame level, our TimeStream module decouples temporal dependency from video sequences through a cross-frame tokenization mechanism, enabling intricate dynamic modeling despite irregular intestinal structures. At the intra-frame level, our Content-Aware module incorporates noise-injected embeddings and learnable prototypes to realize precise control over clinical attributes, breaking through the coarse guidance of diffusion models. Additionally, ColoDiff employs a non-Markovian sampling strategy that cuts steps by over 90% for real-time generation. ColoDiff is evaluated across three public datasets and one hospital database, based on both generation metrics and downstream tasks including disease diagnosis, modality discrimination, bowel preparation scoring, and lesion segmentation. Extensive experiments show ColoDiff generates videos with smooth transitions and rich dynamics. ColoDiff presents an effort in controllable colonoscopy video generation, revealing the potential of synthetic videos in complementing authentic representation and mitigating data scarcity in clinical settings.
[84] Motion-aware Event Suppression for Event Cameras cs.CV | cs.ROPDF
Roberto Pellerito, Nico Messikommer, Giovanni Cioffi, Marco Cannici, Davide Scaramuzza
TL;DR: 本文提出了首个运动感知事件抑制框架,能够实时过滤由独立运动物体(IMO)和自运动触发的事件。该模型通过联合分割当前事件流中的IMO并预测其未来运动,实现对动态事件的预见性抑制。
Details
Motivation: 解决事件相机中由独立运动物体和自运动产生的冗余事件干扰问题,以提高处理效率和下游任务性能。
Result: 在EVIMO基准测试中,分割准确率比先前SOTA方法提升67%,推理速度达173 Hz(提升53%);在下游任务中,通过令牌剪枝将Vision Transformer推理加速83%,并将基于事件的视觉里程计绝对轨迹误差降低13%。
Insight: 创新点在于将IMO分割与未来运动预测相结合,实现前瞻性事件抑制;其轻量级架构在保持高精度的同时实现了实时性能,为事件相机的高效处理提供了新思路。
Abstract: In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67% in segmentation accuracy while operating at a 53% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13%.
[85] EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents cs.CVPDF
Wenjia Wang, Liang Pan, Huaijin Pi, Yuke Lou, Xuqian Ren
TL;DR: 本文提出EmbodMocap,一种使用两部移动iPhone进行便携、低成本数据采集的流程,旨在联合校准双RGB-D序列,在统一的世界坐标系中重建人体和场景,以解决在野外大规模采集场景条件人体运动数据的难题。
Details
Motivation: 现有采集系统依赖昂贵的影棚设置和可穿戴设备,限制了在自然环境中大规模获取场景条件人体运动数据,而这类数据对训练具身智能体至关重要。
Result: 与光学捕捉真值相比,双视角设置显著减轻了深度模糊性,在配准和重建性能上优于单iPhone或单目模型;所采集数据成功赋能了单目人体-场景重建、基于物理的角色动画和机器人运动控制三项具身AI任务。
Insight: 创新点在于提出了一种无需静态相机或标记、在日常生活环境中实现度量尺度且场景一致捕捉的便携双视角流程,无缝桥接了人体运动与场景几何,为具身AI研究提供了可扩展的数据基础。
Abstract: Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.
[86] Through BrokenEyes: How Eye Disorders Impact Face Detection? cs.CVPDF
Prottay Kumar Adhikary
TL;DR: 本文开发了一个名为BrokenEyes的计算框架,用于模拟五种常见眼疾(年龄相关性黄斑变性、白内障、青光眼、屈光不正和糖尿病视网膜病变)对深度学习模型特征表示的影响。研究发现,白内障和青光眼对模型特征图造成了严重破坏,这与已知的神经处理挑战相符。
Details
Motivation: 研究动机是探究视觉障碍如何影响计算机视觉系统(特别是人脸检测模型)处理信息的方式,以理解视觉输入退化与学习到的特征表示之间的相互作用。
Result: 研究使用激活能和余弦相似度等指标量化了视觉障碍导致的特征扭曲严重程度,揭示了在模拟眼疾条件下训练的模型,其特征表示出现了关键性破坏。
Insight: 创新点在于开发了一个系统性的计算框架来模拟多种眼疾对深度学习模型的影响,并提供了量化分析视觉输入退化如何扭曲神经网络内部特征表示的方法论。这为理解模型在非理想视觉条件下的鲁棒性提供了新视角。
Abstract: Vision disorders significantly impact millions of lives, altering how visual information is processed and perceived. In this work, a computational framework was developed using the BrokenEyes system to simulate five common eye disorders: Age-related macular degeneration, cataract, glaucoma, refractive errors, and diabetic retinopathy and analyze their effects on neural-like feature representations in deep learning models. Leveraging a combination of human and non-human datasets, models trained under normal and disorder-specific conditions revealed critical disruptions in feature maps, particularly for cataract and glaucoma, which align with known neural processing challenges in these conditions. Evaluation metrics such as activation energy and cosine similarity quantified the severity of these distortions, providing insights into the interplay between degraded visual inputs and learned representations.
[87] Multidimensional Task Learning: A Unified Tensor Framework for Computer Vision Tasks cs.CV | math.NAPDF
Alaa El Ichi, Khalide Jbilou
TL;DR: 本文提出了多维任务学习(MTL),这是一个基于广义爱因斯坦MLP(GE-MLPs)的统一数学框架,该框架通过爱因斯坦积直接在张量上运算。论文认为当前计算机视觉任务表述受限于基于矩阵的思维,而GE-MLPs使用张量值参数,能够在不损失信息的情况下显式控制哪些维度被保留或收缩。作者通过严格的数学推导,证明了分类、分割和检测都是MTL的特例,仅在形式化定义的任务空间中的维度配置上有所不同。该框架为通过张量代数理解、比较和设计计算机视觉任务提供了数学基础。
Details
Motivation: 解决当前计算机视觉任务表述受限于矩阵思维的问题,即标准架构依赖矩阵权重和向量偏置,需要进行破坏信息的结构展平,这限制了可自然表达的任务空间。
Result: 论文通过数学推导证明了所提出的任务空间严格大于基于矩阵的表述所能原生表达的范围,能够支持时空或跨模态预测等需要原则性任务配置的场景。
Insight: 创新点在于提出了一个基于张量运算的统一数学框架,将多种视觉任务视为同一框架下维度配置不同的特例,从而在理论上扩展了可表达的任务空间,为任务设计和比较提供了新的代数视角。
Abstract: This paper introduces Multidimensional Task Learning (MTL), a unified mathematical framework based on Generalized Einstein MLPs (GE-MLPs) that operate directly on tensors via the Einstein product. We argue that current computer vision task formulations are inherently constrained by matrix-based thinking: standard architectures rely on matrix-valued weights and vectorvalued biases, requiring structural flattening that restricts the space of naturally expressible tasks. GE-MLPs lift this constraint by operating with tensor-valued parameters, enabling explicit control over which dimensions are preserved or contracted without information loss. Through rigorous mathematical derivations, we demonstrate that classification, segmentation, and detection are special cases of MTL, differing only in their dimensional configuration within a formally defined task space. We further prove that this task space is strictly larger than what matrix-based formulations can natively express, enabling principled task configurations such as spatiotemporal or cross modal predictions that require destructive flattening under conventional approaches. This work provides a mathematical foundation for understanding, comparing, and designing computer vision tasks through the lens of tensor algebra.
[88] UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception cs.CV | cs.ROPDF
Mohammad Mahdavian, Gordon Tan, Binbin Xu, Yuan Ren, Dongfeng Bai
TL;DR: UniScale是一个用于机器人感知的统一、尺度感知的多视图三维重建框架,它通过模块化、语义感知的设计灵活整合几何先验。该框架通过单一前馈网络联合估计相机内参和外参、尺度不变深度与点云图以及场景的度量尺度,并能选择性地融入辅助几何先验,从而在已知相机参数时提升性能。
Details
Motivation: 解决机器人视觉导航中从原始图像序列准确提取环境结构的关键挑战,实现鲁棒、度量感知的三维重建,以适应资源受限的机器人团队需求。
Result: 在多个基准测试中评估,展示了强大的泛化能力和在不同环境中的一致性能,具体定量结果未在摘要中提及。
Insight: 创新点在于将全局上下文推理与相机感知特征表示相结合,以恢复场景的度量尺度;其模块化设计允许灵活集成已知的相机内参和外参等先验,且无需从头训练,利用预训练模型中的世界先验,避免了复杂的几何编码策略。
Abstract: We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.
[89] MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction cs.CV | cs.AIPDF
Yizhi Li, Xiaohan Chen, Miao Jiang, Wentao Tang, Gaoang Wang
TL;DR: MovieTeller是一个用于生成长视频(如电影)摘要的新型框架,它通过工具增强的渐进式抽象方法,解决了现有视觉语言模型在长视频理解中存在的角色识别不一致和叙事连贯性差的问题。该框架无需微调,直接利用现成的模型(如人脸识别模型)作为外部工具来建立事实基础,并通过多阶段处理来克服模型上下文长度的限制。
Details
Motivation: 解决现有通用视觉语言模型在生成长视频(如电影)摘要时,因缺乏ID一致的角色识别和叙事连贯性断裂而导致的失败问题。
Result: 实验表明,与端到端基线方法相比,该方法在事实准确性、角色一致性和整体叙事连贯性方面取得了显著提升。
Insight: 创新点在于提出了一种无需训练、工具增强、事实基础化的生成过程,通过外部工具(如人脸识别模型)建立精确的角色身份信息作为事实基础,并采用渐进式抽象管道将长视频摘要分解为多阶段处理,有效缓解了当前视觉语言模型的上下文长度限制。
Abstract: With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external “tool” to establish Factual Groundings–precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM’s reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.
[90] Large Multimodal Models as General In-Context Classifiers cs.CVPDF
Marco Garosi, Matteo Farina, Alessandro Conti, Massimiliano Mancini, Elisa Ricci
TL;DR: 本文探讨了大型多模态模型(LMMs)作为通用上下文分类器的潜力,指出尽管其零样本分类性能低于CLIP等对比式视觉语言模型(VLMs),但通过少量上下文示例,LMMs能够匹配甚至超越基于缓存的适配器VLMs,并在开放世界分类中展现出优势。
Details
Motivation: 解决当前研究对LMMs在分类任务中上下文学习能力的忽视,探索LMMs作为统一分类器的可行性,以替代专门化模型。
Result: 在多个封闭世界分类数据集上,LMMs使用少量上下文示例后性能可匹配或超越对比式VLMs;在开放世界设置中,提出的CIRCLE方法超越了VLM对应模型,建立了鲁棒的基线。
Insight: 创新点在于强调LMMs的上下文学习能力在分类任务中的重要性,并提出了无需训练的CIRCLE方法,通过伪标签迭代优化上下文信息,展示了LMMs作为灵活统一分类器的潜力。
Abstract: Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP’s, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their “in-context” equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.
[91] Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents cs.CV | cs.AIPDF
Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao
TL;DR: 本文提出了一种名为GUIPruner的无训练框架,旨在解决纯视觉GUI代理在处理高分辨率屏幕截图和历史轨迹时存在的严重时空冗余问题。该框架通过时间自适应分辨率(TAR)和分层结构感知剪枝(SSP)两种协同技术,在保持高精度导航性能的同时,显著降低了计算开销和视觉编码延迟。
Details
Motivation: 现有压缩范式存在两个关键错位:一是时间错配,即统一的历史编码与代理的’衰减记忆’注意力模式不符;二是空间拓扑冲突,即非结构化的剪枝破坏了精确定位坐标所需的网格完整性,导致空间幻觉。本文旨在解决这些挑战,以实现高效的高分辨率GUI导航。
Result: 在多个基准测试上的广泛评估表明,GUIPruner始终达到最先进的性能,有效防止了高压缩下大规模模型出现的性能崩溃。具体而言,在Qwen2-VL-2B模型上,该方法实现了FLOPs减少3.4倍,视觉编码延迟加速3.3倍,同时保留了超过94%的原始性能。
Insight: 创新点在于提出了一个无训练框架,通过时间自适应分辨率(TAR)和分层结构感知剪枝(SSP)协同工作,分别针对性地解决了时空冗余中的时间错配和空间拓扑冲突问题。从客观角度看,该方法将压缩策略与GUI交互的特定结构(如交互前景和语义锚点)相结合,在保持全局布局完整性的同时实现高效压缩,这是一个值得借鉴的、针对特定任务(GUI导航)的优化思路。
Abstract: Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent’s “fading memory” attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.
[92] Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving cs.CV | cs.AI | cs.ROPDF
Jiangxin Sun, Feng Xue, Teng Long, Chang Liu, Jian-Fang Hu
TL;DR: 本文提出了一种名为风险感知世界模型预测控制(RaWMPC)的统一框架,用于解决端到端自动驾驶在长尾场景下的泛化问题。该方法不依赖专家演示,而是利用世界模型预测候选动作的后果,并通过显式的风险评估来选择低风险动作。
Details
Motivation: 当前基于模仿学习的端到端自动驾驶方法严重依赖专家演示,在遇到专家数据分布之外的罕见或未见长尾场景时,模型因缺乏先验经验而容易做出不安全决策。本文旨在探索无需专家动作监督也能做出可靠决策的自动驾驶系统。
Result: 大量实验表明,RaWMPC在分布内和分布外场景下均优于最先进的方法,同时提供了更优的决策可解释性。
Insight: 论文的创新点在于:1)通过风险感知交互策略,系统性地让世界模型接触危险驾驶行为,使其能够预测灾难性后果;2)引入自评估蒸馏方法,将训练好的世界模型中的风险规避能力蒸馏到生成式动作提议网络中,无需专家演示。从客观角度看,这是一种将强化学习/控制理论与世界模型结合以提升安全泛化能力的新颖思路。
Abstract: With advances in imitation learning (IL) and large-scale driving datasets, end-to-end autonomous driving (E2E-AD) has made great progress recently. Currently, IL-based methods have become a mainstream paradigm: models rely on standard driving behaviors given by experts, and learn to minimize the discrepancy between their actions and expert actions. However, this objective of “only driving like the expert” suffers from limited generalization: when encountering rare or unseen long-tail scenarios outside the distribution of expert demonstrations, models tend to produce unsafe decisions in the absence of prior experience. This raises a fundamental question: Can an E2E-AD system make reliable decisions without any expert action supervision? Motivated by this, we propose a unified framework named Risk-aware World Model Predictive Control (RaWMPC) to address this generalization dilemma through robust control, without reliance on expert demonstrations. Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation. To endow the world model with the ability to predict the outcomes of risky driving behaviors, we design a risk-aware interaction strategy that systematically exposes the world model to hazardous behaviors, making catastrophic outcomes predictable and thus avoidable. Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without any expert demonstration. Extensive experiments show that RaWMPC outperforms state-of-the-art methods in both in-distribution and out-of-distribution scenarios, while providing superior decision interpretability.
[93] LineGraph2Road: Structural Graph Reasoning on Line Graphs for Road Network Extraction cs.CVPDF
Zhengyang Wei, Renzhi Jing, Yiyi He, Jenny Suckale
TL;DR: 本文提出LineGraph2Road框架,用于从卫星图像中自动提取道路网络。该方法将连通性预测建模为在构建的全局稀疏欧几里得图上对边进行二分类,并通过将原图转换为线图并应用图Transformer来学习结构链接表示,以克服长程依赖和复杂拓扑的捕获难题。此外,引入了立交桥/地下通道头和多级交叉口解决模块以及耦合NMS策略来保留关键连接。
Details
Motivation: 现有道路提取方法通常分解为关键点提取和连通性预测,但难以捕捉长程依赖和复杂拓扑结构,因此需要一种能更好进行结构图推理的方法。
Result: 在City-scale、SpaceNet和Global-scale三个基准测试上评估,LineGraph2Road在TOPO-F1和APLS两个关键指标上达到了最先进水平(SOTA),并捕获了对实际部署至关重要的精细视觉细节。
Insight: 创新点包括将连通性预测重新定义为线图上的边分类任务,利用图Transformer进行结构链接表示学习;通过立交桥/地下通道头和耦合NMS策略处理多级交叉和关键连接保留,提升了全局结构推理能力。
Abstract: The accurate and automatic extraction of roads from satellite imagery is critical for applications in navigation and urban planning, significantly reducing the need for manual annotation. Many existing methods decompose this task into keypoint extraction and connectedness prediction, but often struggle to capture long-range dependencies and complex topologies. Here, we propose LineGraph2Road, a framework that improves connectedness prediction by formulating it as binary classification over edges in a constructed global but sparse Euclidean graph, where nodes are keypoints extracted from segmentation masks and edges connect node pairs within a predefined distance threshold, representing potential road segments. To better learn structural link representation, we transform the original graph into its corresponding line graph and apply a Graph Transformer on it for connectedness prediction. This formulation overcomes the limitations of endpoint-embedding fusion on set-isomorphic links, enabling rich link representations and effective relational reasoning over the global structure. Additionally, we introduce an overpass/underpass head to resolve multi-level crossings and a coupled NMS strategy to preserve critical connections. We evaluate LineGraph2Road on three benchmarks: City-scale, SpaceNet, and Global-scale, and show that it achieves state-of-the-art results on two key metrics, TOPO-F1 and APLS. It also captures fine visual details critical for real-world deployment. We will make our code publicly available.
[94] Towards Long-Form Spatio-Temporal Video Grounding cs.CVPDF
Xin Gu, Bing Fan, Jiali Yao, Zhipeng Zhang, Yan Huang
TL;DR: 本文提出了一种针对长视频时空定位(LF-STVG)的新方法ART-STVG,通过自回归Transformer架构处理流式视频输入,并设计了时空记忆库与级联解码器,以有效应对长视频中时间跨度大、无关信息多的挑战,在扩展的长视频数据集上显著优于现有方法。
Details
Motivation: 现有时空视频定位(STVG)研究主要针对数十秒的短视频,而真实场景中视频可能长达数分钟甚至数小时,这限制了实际应用;因此,本文探索长视频时空定位(LF-STVG),以解决长视频中目标定位的难题。
Result: 在新扩展的LF-STVG数据集上,ART-STVG显著优于现有最先进方法(SOTA);同时,在传统的短视频STVG基准上也取得了有竞争力的性能。
Insight: 创新点包括:采用自回归Transformer处理流式视频输入以高效处理长视频;设计时空记忆库及记忆选择策略来提供更相关的上下文信息;提出级联时空解码器设计,利用细粒度空间线索辅助复杂的时间定位。
Abstract: In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds, typically less than one minute, which limits real-world applications. In this paper, we explore Long-Form STVG (LF-STVG), which aims to locate targets in long-term videos. Compared with short videos, long-term videos contain much longer temporal spans and more irrelevant information, making it difficult for existing STVG methods that process all frames at once. To address this challenge, we propose an AutoRegressive Transformer architecture for LF-STVG, termed ART-STVG. Unlike conventional STVG methods that require the entire video sequence to make predictions at once, ART-STVG treats the video as streaming input and processes frames sequentially, enabling efficient handling of long videos. To model spatio-temporal context, we design spatial and temporal memory banks and apply them to the decoders. Since memories from different moments are not always relevant to the current frame, we introduce simple yet effective memory selection strategies to provide more relevant information to the decoders, significantly improving performance. Furthermore, instead of parallel spatial and temporal localization, we propose a cascaded spatio-temporal design that connects the spatial decoder to the temporal decoder, allowing fine-grained spatial cues to assist complex temporal localization in long videos. Experiments on newly extended LF-STVG datasets show that ART-STVG significantly outperforms state-of-the-art methods, while achieving competitive performance on conventional short-form STVG.
[95] PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM cs.CVPDF
Yiqing Wang, Chunming He, Ming-Chen Lu, Mercy Pawar, Leslie Niziol
TL;DR: PRIMA是一种用于医学诊断的多模态预训练框架,通过整合图像与临床元数据的语义对齐,结合领域专业知识,提升诊断性能。该方法利用检索增强生成构建风险-疾病关联专家知识库,优化文本编码器,并采用双编码器预训练策略与多粒度损失函数,最后通过大语言模型融合特征进行分类。
Details
Motivation: 现有医学诊断方法常将元数据视为孤立标签,未能充分利用临床描述中的丰富语义知识,因此需要一种能有效融合视觉表现与临床元数据的框架。
Result: 在医学诊断任务上,PRIMA显著优于其他最先进方法,展现出卓越的鲁棒性,且无需大规模数据收集或大量计算资源。
Insight: 创新点包括:通过检索增强生成构建领域专家知识库以嵌入诊断先验;设计双编码器预训练策略与多粒度对齐损失函数,处理临床关联的模糊性;利用大语言模型进行特征融合,实现像素级特征与抽象临床知识的有效协调。
Abstract: Medical diagnosis requires the effective synthesis of visual manifestations and clinical metadata. However, existing methods often treat metadata as isolated tags, failing to exploit the rich semantic knowledge embedded in clinical descriptions. We propose PRIMA (Pre-training with Risk-integrated Image-Metadata Alignment), a framework that integrates domain-specific knowledge into multi-modal representation learning. We first curate an expert corpus of risk-disease correlations via Retrieval-Augmented Generation (RAG) to refine Clinical ModernBERT, embedding diagnostic priors into the text encoder. To bridge the modality gap, we introduce a dual-encoder pre-training strategy utilizing DINOv3 and our refined BERT, optimized by a suite of four complementary loss functions. These losses are designed to capture multi-granular semantic alignment and handle the ambiguity of clinical correlations through soft labels. Finally, we leverage Qwen-3 to fuse these aligned features for precise disease classification. Extensive experiments demonstrate that PRIMA effectively harmonizes pixel-level features with abstract clinical expertise, significantly outperforming other state-of-the-art methods. Notably, our framework achieves superior robustness without the need for massive data collection or exhaustive computational resources. Our code will be made public upon acceptance.
[96] ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding cs.CVPDF
Yiran Guan, Sifan Tu, Dingkang Liang, Linghao Zhu, Jianzhong Ju
TL;DR: 论文提出ThinkOmni框架,旨在将文本推理能力提升至全模态场景,而无需额外训练或数据。该框架利用现成的大型推理模型(LRM)来指导全模态大语言模型(OLLM)的解码过程,并通过逐步对比缩放自适应平衡感知与推理信号。在六个多模态推理基准测试中,ThinkOmni均带来性能提升,例如在MathVista上达到70.2分,在MMAU上达到75.5分。
Details
Motivation: 现有全模态大语言模型(OLLM)擅长感知多模态数据,但缺乏复杂推理能力,而通过额外训练增强其推理能力面临高质量数据需求、任务特定适应和巨大计算成本等挑战。
Result: 在六个多模态推理基准测试中,ThinkOmni一致提升了性能,主要结果在MathVista上达到70.2分,在MMAU上达到75.5分,展示了其有效性。
Insight: 创新点包括:1)LRM-as-a-Guide:利用现成LRM指导OLLM解码,实现训练免费和数据免费的推理能力迁移;2)Stepwise Contrastive Scaling:自适应平衡感知与推理信号,无需手动超参数调优。这为推理能力的泛化和应用提供了新思路。
Abstract: Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.
[97] Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation? cs.CVPDF
Tilemachos Aravanis, Vladan Stojnić, Bill Psomas, Nikos Komodakis, Giorgos Tolias
TL;DR: 本文提出了一种基于检索增强的测试时适配器方法,用于解决开放词汇分割(OVS)中因图像级监督粗糙和自然语言语义模糊导致的性能差距。该方法通过引入少量带像素标注的支持图像,结合文本提示,学习轻量级的每图像分类器,实现更强的模态融合。
Details
Motivation: 动机是解决开放词汇分割中因视觉语言模型(VLM)的粗粒度图像级监督和自然语言语义模糊性导致的性能落后于全监督方法的问题。
Result: 实验表明,该方法在保持开放词汇能力的同时,显著缩小了零样本分割与全监督分割之间的性能差距,适用于细粒度任务如个性化分割。
Insight: 创新点在于引入少量样本支持集,并设计检索增强的测试时适配器,通过每查询学习融合文本和视觉特征,替代传统手工后期融合,实现更优的模态协同;支持集可动态扩展,提升适应性。
Abstract: Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.
[98] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation cs.CV | cs.AIPDF
Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat, Ravi Kiran Sarvadevabhatla, R. Venkatesh Babu
TL;DR: 本文提出了SeeThrough3D模型,用于解决文本到图像生成中基于3D布局控制时物体间遮挡关系建模不足的问题。该方法引入了一种遮挡感知的3D场景表示(OSCR),将物体表示为虚拟环境中的半透明3D框,并结合预训练的基于流的文本到图像模型,通过视觉标记和掩码自注意力机制实现精确的物体属性绑定与生成。
Details
Motivation: 现有方法在生成遵循输入3D布局的逼真场景时,往往无法精确建模物体间的遮挡关系,而遮挡推理对于合成具有深度一致几何和尺度的部分遮挡物体至关重要。
Result: 模型在构建的包含强物体间遮挡的多样化多物体合成数据集上训练,能够有效泛化到未见过的物体类别,并实现具有逼真遮挡和一致相机控制的精确3D布局控制。
Insight: 核心创新在于提出了遮挡感知的3D场景表示(OSCR),通过半透明3D框的渲染来编码隐藏区域,使模型能够显式推理遮挡;同时,结合视觉标记和掩码自注意力机制,实现了物体边界框与对应文本描述的精确绑定,避免了多物体生成时的属性混淆问题。
Abstract: We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.
[99] MediX-R1: Open Ended Medical Reinforcement Learning cs.CVPDF
Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Omair Mohamed, Mohamed Zidan, Fahad Khan
TL;DR: 本文提出了MediX-R1,一个用于医学多模态大语言模型(MLLMs)的开放式强化学习框架。该框架通过结合基于组的强化学习和一个复合奖励函数(包括基于LLM的准确性奖励、医学嵌入语义奖励以及轻量级格式与模态奖励),使模型能够生成超越多项选择题格式的、基于临床的、自由形式的答案。论文还提出了一个统一的评估框架,使用基于参考的LLM作为评判者来替代脆弱的字符串重叠指标,以衡量语义正确性、推理和上下文对齐。尽管仅使用了约51K个指令示例,MediX-R1在标准的医学LLM(纯文本)和VLM(图像+文本)基准测试中均取得了优异结果。
Details
Motivation: 解决现有医学多模态大语言模型在开放式、自由形式回答任务上的局限性,传统方法依赖于可验证的或仅限多项选择题的奖励,难以提供稳定且信息丰富的反馈,从而限制了模型在需要临床推理和解释的复杂场景中的应用。
Result: 在标准的医学LLM(纯文本)和VLM(图像+文本)基准测试中,MediX-R1超越了强大的开源基线模型,并在开放式临床任务上取得了特别显著的性能提升,达到了SOTA水平。
Insight: 创新点在于提出了一个专为医学推理设计的复合奖励信号(结合准确性、语义相似性、格式和模态识别)以及一个基于参考的LLM-as-judge统一评估框架。从客观角度看,这种多信号奖励设计和基于LLM的语义评估方法为解决开放式生成任务中奖励稀疏和评估困难的问题提供了实用且有效的路径,可推广至其他需要精确推理和解释的领域。
Abstract: We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only $\sim51$K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com
q-bio.GN [Back]
[100] CrossLLM-Mamba: Multimodal State Space Fusion of LLMs for RNA Interaction Prediction q-bio.GN | cs.CV | cs.LGPDF
Rabeya Tus Sadia, Qiang Ye, Qiang Cheng
TL;DR: 本文提出了CrossLLM-Mamba,一个用于RNA相互作用预测的新型多模态框架。该框架将相互作用预测重新定义为状态空间对齐问题,利用双向Mamba编码器通过隐藏状态传播实现模态特定嵌入之间的深度“串扰”,从而将相互作用建模为动态序列转换而非静态特征重叠。
Details
Motivation: 现有基于生物大语言模型(BioLLMs)的方法依赖静态融合策略,无法捕捉分子结合动态、上下文依赖的本质。本文旨在解决这一问题,以更准确地预测RNA相关相互作用。
Result: 在RNA-蛋白质、RNA-小分子和RNA-RNA三类相互作用的综合实验中,CrossLLM-Mamba实现了最先进的性能。在RPI1460基准测试中,模型取得了0.892的MCC,比之前最佳方法提升了5.2%。在结合亲和力预测任务中,对核糖开关和重复RNA亚型,皮尔逊相关系数超过0.95。
Insight: 主要创新点在于将状态空间建模(通过Mamba编码器)引入多模态生物相互作用预测,实现了模态间的动态、上下文感知融合。此外,结合高斯噪声注入和Focal Loss增强了模型对困难负样本的鲁棒性,同时保持了线性计算复杂度,使其可扩展至高维BioLLM嵌入。
Abstract: Accurate prediction of RNA-associated interactions is essential for understanding cellular regulation and advancing drug discovery. While Biological Large Language Models (BioLLMs) such as ESM-2 and RiNALMo provide powerful sequence representations, existing methods rely on static fusion strategies that fail to capture the dynamic, context-dependent nature of molecular binding. We introduce CrossLLM-Mamba, a novel framework that reformulates interaction prediction as a state-space alignment problem. By leveraging bidirectional Mamba encoders, our approach enables deep ``crosstalk’’ between modality-specific embeddings through hidden state propagation, modeling interactions as dynamic sequence transitions rather than static feature overlaps. The framework maintains linear computational complexity, making it scalable to high-dimensional BioLLM embeddings. We further incorporate Gaussian noise injection and Focal Loss to enhance robustness against hard-negative samples. Comprehensive experiments across three interaction categories, RNA-protein, RNA-small molecule, and RNA-RNA demonstrate that CrossLLM-Mamba achieves state-of-the-art performance. On the RPI1460 benchmark, our model attains an MCC of 0.892, surpassing the previous best by 5.2%. For binding affinity prediction, we achieve Pearson correlations exceeding 0.95 on riboswitch and repeat RNA subtypes. These results establish state-space modeling as a powerful paradigm for multi-modal biological interaction prediction.
cs.IR [Back]
[101] SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas cs.IR | cs.CL | cs.LGPDF
Cornelius Wolff, Daniel Gomm, Madelon Hulsebos
TL;DR: 论文介绍了SQaLe,一个基于真实数据库模式构建的大规模半合成文本到SQL数据集,包含517,676个高质量(问题、模式、查询)三元组,旨在解决现有数据集中模式复杂性、领域覆盖和任务多样性不足的问题,以促进文本到SQL模型的泛化能力。
Details
Motivation: 开发泛化性强的文本到SQL模型的关键瓶颈在于缺乏具有足够模式复杂性、领域覆盖和任务多样性的大规模数据集,因此需要构建更真实的数据集来推动研究。
Result: SQaLe是目前最真实的大规模文本到SQL数据集,相比现有基准和数据集,它捕捉了真实的模式大小变化、多样查询模式和自然语言歧义,同时保持执行有效性。
Insight: 创新点在于基于真实模式(SchemaPile)构建半合成数据集,通过模式采样、问题合成和SQL构建的原则性生成流程,实现了数据规模和模型泛化的愿景,为文本到SQL研究提供了更接近实际应用场景的基准。
Abstract: Advances in large language models have accelerated progress in text-to-SQL, methods for converting natural language queries into valid SQL queries. A key bottleneck for developing generalizable text-to-SQL models is the lack of large-scale datasets with sufficient schema and query complexity, domain coverage, and task diversity. We introduce SQaLe: a large-scale semi-synthetic text-to-SQL dataset built on 135,875 relational database schemas expanded from a collection of real-world schemas, SchemaPile. We establish a principled generation pipeline which combines schema sampling, question synthesis, and SQL construction, and produce 517,676 high-quality (question, schema, query) triples. The SQaLe dataset captures realistic schema size variability, diverse query patterns, and natural language ambiguity while maintaining execution validity. We provide an analysis of its contents and characteristics, and find that SQaLe introduces the most realistic large-scale text-to-SQL dataset to date in comparison with existing benchmarks and datasets. We discuss how SQaLe enables our vision for data scaling and model generalization in text-to-SQL research. The dataset is accessible at: https://huggingface.co/datasets/trl-lab/SQaLe-text-to-SQL-dataset.
[102] SmartChunk Retrieval: Query-Aware Chunk Compression with Planning for Efficient Document RAG cs.IR | cs.AI | cs.CL | cs.LGPDF
Xuechen Zhang, Koustava Goswami, Samet Oymak, Jiasi Chen, Nedim Lipka
TL;DR: 本文提出了SmartChunk检索框架,通过查询感知的块压缩与规划,动态调整检索粒度,以提升文档检索增强生成(RAG)在长文档问答中的效率和鲁棒性。
Details
Motivation: 当前RAG流程受限于静态分块和平坦检索,存在检索质量对块大小敏感、引入噪声以及在大规模语料库上扩展性差的问题,需要一种能自适应查询的检索方法。
Result: 在五个QA基准测试和一个域外数据集上,SmartChunk均优于最先进的RAG基线模型,同时降低了成本,并在更大语料库上展现出强扩展性和域外泛化能力。
Insight: 创新点包括:1)规划器预测每个查询的最优块抽象级别;2)轻量级压缩模块生成高层块嵌入,避免重复摘要;3)通过新颖的强化学习方案STITCH进行块抽象推理,提升准确性和泛化性。
Abstract: Retrieval-augmented generation (RAG) has strong potential for producing accurate and factual outputs by combining language models (LMs) with evidence retrieved from large text corpora. However, current pipelines are limited by static chunking and flat retrieval: documents are split into short, predetermined, fixed-size chunks, embeddings are retrieved uniformly, and generation relies on whatever chunks are returned. This design brings challenges, as retrieval quality is highly sensitive to chunk size, often introduces noise from irrelevant or misleading chunks, and scales poorly to large corpora. We present SmartChunk retrieval, a query-adaptive framework for efficient and robust long-document question answering (QA). SmartChunk uses (i) a planner that predicts the optimal chunk abstraction level for each query, and (ii) a lightweight compression module that produces high-level chunk embeddings without repeated summarization. By adapting retrieval granularity on the fly, SmartChunk balances accuracy with efficiency and avoids the drawbacks of fixed strategies. Notably, our planner can reason about chunk abstractions through a novel reinforcement learning scheme, STITCH, which boosts accuracy and generalization. To reflect real-world applications, where users face diverse document types and query styles, we evaluate SmartChunk on five QA benchmarks plus one out-of-domain dataset. Across these evaluations, SmartChunk outperforms state-of-the-art RAG baselines, while reducing cost. Further analysis demonstrates strong scalability with larger corpora and consistent gains on out-of-domain datasets, highlighting its effectiveness as a general framework for adaptive retrieval.
[103] Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators cs.IR | cs.CL | cs.LGPDF
Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt
TL;DR: 本文提出了一种名为STATIC的高效约束解码技术,专为基于TPU/GPU的大语言模型生成式检索设计。该方法通过将前缀树扁平化为静态压缩稀疏行矩阵,将不规则树遍历转换为向量化稀疏矩阵运算,从而在硬件加速器上实现显著效率提升。
Details
Motivation: 工业推荐系统通常需要根据业务逻辑将输出空间限制在特定项目子集,但标准自回归解码无法原生支持,而现有基于前缀树的约束解码方法在硬件加速器上会产生严重的延迟开销。
Result: 在大型工业视频推荐平台部署中,STATIC以极低的延迟开销显著提升了产品指标,相比CPU前缀树实现加速948倍,相比硬件加速的二分查找基线加速47-1033倍。在学术基准测试中也证明能显著改善生成式检索的冷启动性能。
Insight: 核心创新在于将树结构转换为静态稀疏矩阵表示,实现了完全向量化的约束解码操作,这为在生产规模部署严格约束的生成式检索提供了首个可行方案。
Abstract: Generative retrieval has emerged as a powerful paradigm for LLM-based recommendation. However, industrial recommender systems often benefit from restricting the output space to a constrained subset of items based on business logic (e.g. enforcing content freshness or product category), which standard autoregressive decoding cannot natively support. Moreover, existing constrained decoding methods that make use of prefix trees (Tries) incur severe latency penalties on hardware accelerators (TPUs/GPUs). In this work, we introduce STATIC (Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding), an efficient and scalable constrained decoding technique designed specifically for high-throughput LLM-based generative retrieval on TPUs/GPUs. By flattening the prefix tree into a static Compressed Sparse Row (CSR) matrix, we transform irregular tree traversals into fully vectorized sparse matrix operations, unlocking massive efficiency gains on hardware accelerators. We deploy STATIC on a large-scale industrial video recommendation platform serving billions of users. STATIC produces significant product metric impact with minimal latency overhead (0.033 ms per step and 0.25% of inference time), achieving a 948x speedup over a CPU trie implementation and a 47-1033x speedup over a hardware-accelerated binary-search baseline. Furthermore, the runtime overhead of STATIC remains extremely low across a wide range of practical configurations. To the best of our knowledge, STATIC enables the first production-scale deployment of strictly constrained generative retrieval. In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval. Our code is available at https://github.com/youtube/static-constraint-decoding.
cs.MM [Back]
[104] Decoding the Hook: A Multimodal LLM Framework for Analyzing the Hooking Period of Video Ads cs.MM | cs.AI | cs.CL | cs.LGPDF
Kunpeng Zhang, Poppy Zhang, Shawndra Hill, Amel Awadelkarim
TL;DR: 本研究提出了一种基于多模态大语言模型(MLLMs)的框架,用于分析视频广告的‘钩子期’(即前3秒)。该框架通过两种帧采样策略处理视频,利用MLLMs生成描述性分析,并使用BERTopic进行主题提炼,同时整合音频属性和广告定向信息。实证验证表明,该框架能有效揭示钩子期特征与广告关键绩效指标(如投资转化率)之间的相关性。
Details
Motivation: 视频广告的‘钩子期’(前3秒)对吸引观众注意力和影响参与度至关重要,但由于视频内容融合了视觉、听觉和文本等多模态信息,传统方法难以捕捉其细微的交互作用,因此需要先进的框架进行全面评估。
Result: 在社交媒体平台的大规模真实数据上进行实证验证,结果表明该框架有效,揭示了钩子期特征与关键绩效指标(如投资转化率)之间的相关性,证明了方法的实用性和预测能力。
Insight: 创新点包括:1) 首次系统性地利用MLLMs分析视频广告的钩子期;2) 结合了均匀随机采样和关键帧选择两种策略,确保平衡且具代表性的声学特征提取;3) 整合了音频属性和聚合广告定向信息,丰富了特征集;4) 使用BERTopic对MLLMs输出进行主题提炼,实现高层次抽象。从客观角度看,该研究为视频广告分析提供了可扩展的方法论,专注于初始时刻的优化。
Abstract: Video-based ads are a vital medium for brands to engage consumers, with social media platforms leveraging user data to optimize ad delivery and boost engagement. A crucial but under-explored aspect is the ‘hooking period’, the first three seconds that capture viewer attention and influence engagement metrics. Analyzing this brief window is challenging due to the multimodal nature of video content, which blends visual, auditory, and textual elements. Traditional methods often miss the nuanced interplay of these components, requiring advanced frameworks for thorough evaluation. This study presents a framework using transformer-based multimodal large language models (MLLMs) to analyze the hooking period of video ads. It tests two frame sampling strategies, uniform random sampling and key frame selection, to ensure balanced and representative acoustic feature extraction, capturing the full range of design elements. The hooking video is processed by state-of-the-art MLLMs to generate descriptive analyses of the ad’s initial impact, which are distilled into coherent topics using BERTopic for high-level abstraction. The framework also integrates features such as audio attributes and aggregated ad targeting information, enriching the feature set for further analysis. Empirical validation on large-scale real-world data from social media platforms demonstrates the efficacy of our framework, revealing correlations between hooking period features and key performance metrics like conversion per investment. The results highlight the practical applicability and predictive power of the approach, offering valuable insights for optimizing video ad strategies. This study advances video ad analysis by providing a scalable methodology for understanding and enhancing the initial moments of video advertisements.
cs.AI [Back]
[105] Graph Your Way to Inspiration: Integrating Co-Author Graphs with Retrieval-Augmented Generation for Large Language Model Based Scientific Idea Generation cs.AI | cs.CL | cs.IRPDF
Pengzhen Xie, Huizhi Liang
TL;DR: 本文提出了一种名为GYWI的科学创意生成系统,该系统通过整合作者知识图谱与检索增强生成技术,为大型语言模型提供可控的学术背景和可追溯的灵感路径,以生成新颖的科学想法。
Details
Motivation: 解决当前LLM在科学创意生成中结果缺乏可控的学术背景和可追溯的灵感路径的问题。
Result: 在基于arXiv构建的数据集上,使用GPT-4o、DeepSeek-V3等模型进行实验,GYWI在创新性、可靠性和相关性等多个指标上显著优于主流LLM。
Insight: 创新点在于将作者中心的知识图谱与RAG结合形成混合检索机制,并引入基于强化学习原理的提示优化策略,为LLM的创意生成提供了结构化的外部知识支持和可控的优化路径。
Abstract: Large Language Models (LLMs) demonstrate potential in the field of scientific idea generation. However, the generated results often lack controllable academic context and traceable inspiration pathways. To bridge this gap, this paper proposes a scientific idea generation system called GYWI, which combines author knowledge graphs with retrieval-augmented generation (RAG) to form an external knowledge base to provide controllable context and trace of inspiration path for LLMs to generate new scientific ideas. We first propose an author-centered knowledge graph construction method and inspiration source sampling algorithms to construct external knowledge base. Then, we propose a hybrid retrieval mechanism that is composed of both RAG and GraphRAG to retrieve content with both depth and breadth knowledge. It forms a hybrid context. Thirdly, we propose a Prompt optimization strategy incorporating reinforcement learning principles to automatically guide LLMs optimizing the results based on the hybrid context. To evaluate the proposed approaches, we constructed an evaluation dataset based on arXiv (2018-2023). This paper also develops a comprehensive evaluation method including empirical automatic assessment in multiple-choice question task, LLM-based scoring, human evaluation, and semantic space visualization analysis. The generated ideas are evaluated from the following five dimensions: novelty, feasibility, clarity, relevance, and significance. We conducted experiments on different LLMs including GPT-4o, DeepSeek-V3, Qwen3-8B, and Gemini 2.5. Experimental results show that GYWI significantly outperforms mainstream LLMs in multiple metrics such as novelty, reliability, and relevance.
[106] How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? cs.AI | cs.CL | cs.LGPDF
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu
TL;DR: 本文对潜在推理方法在弱监督和强监督下的表现进行了全面分析,揭示了其内部机制中的关键问题,包括普遍存在的捷径行为以及潜在表示在推理过程中并未忠实执行结构化搜索,而是表现出隐式剪枝和压缩。研究发现监督强度存在权衡:强监督能缓解捷径行为但限制潜在表示保持多样假设的能力,而弱监督允许更丰富的潜在表示但会增加捷径行为。
Details
Motivation: 潜在推理作为一种新兴推理范式,通过在连续潜在空间而非离散文本空间进行多步计算来实现推理,但其内部机制尚未得到充分研究。本文旨在深入理解潜在表示在推理过程中的作用和行为。
Result: 分析揭示了潜在推理方法中普遍存在的捷径行为,即模型无需依赖潜在推理即可达到高准确率;同时发现潜在表示虽能编码多种可能性,但推理过程并未忠实执行类似广度优先搜索的结构化探索,而是表现出隐式剪枝和压缩。
Insight: 创新点在于系统性地分析了不同监督强度下潜在推理方法的内部机制,揭示了监督强度与模型行为(捷径行为与假设多样性)之间的关键权衡,为理解与改进潜在推理提供了重要见解。
Abstract: Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space. This paradigm enables reasoning beyond discrete language tokens by performing multi-step computation in continuous latent spaces. Although there have been numerous studies focusing on improving the performance of latent reasoning, its internal mechanisms remain not fully investigated. In this work, we conduct a comprehensive analysis of latent reasoning methods to better understand the role and behavior of latent representation in the process. We identify two key issues across latent reasoning methods with different levels of supervision. First, we observe pervasive shortcut behavior, where they achieve high accuracy without relying on latent reasoning. Second, we examine the hypothesis that latent reasoning supports BFS-like exploration in latent space, and find that while latent representations can encode multiple possibilities, the reasoning process does not faithfully implement structured search, but instead exhibits implicit pruning and compression. Finally, our findings reveal a trade-off associated with supervision strength: stronger supervision mitigates shortcut behavior but restricts the ability of latent representations to maintain diverse hypotheses, whereas weaker supervision allows richer latent representations at the cost of increased shortcut behavior.
[107] VeRO: An Evaluation Harness for Agents to Optimize Agents cs.AI | cs.CL | cs.LGPDF
Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan, Xue
TL;DR: 论文提出了VeRO,一个用于评估和优化编码智能体的框架,旨在通过版本控制、奖励机制和结构化观察来系统化智能体优化过程。
Details
Motivation: 解决编码智能体优化任务中缺乏系统性评估方法的问题,特别是针对智能体在编辑-执行-评估循环中涉及确定性代码和随机LLM生成的混合特性。
Result: 通过VeRO框架进行了实证研究,比较了不同优化器配置在任务中的表现,并分析了哪些修改能可靠提升目标智能体性能。
Insight: 创新点在于提供了可复现的评估工具和基准套件,支持结构化捕获智能体的中间推理和执行结果,为编码智能体优化研究提供了核心能力支持。
Abstract: An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.
[108] Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance cs.AI | cs.CLPDF
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song
TL;DR: 本文研究了数学推理中基于示例的引导方法在推理时的不稳定性问题,发现其根源在于策略使用性和策略可执行性之间的差距。通过分析人类与模型生成解决方案的差异,提出了选择性策略检索框架,该框架通过多路径、来源感知的信号选择性检索和组合策略,在多个数学推理基准测试上取得了稳定提升。
Details
Motivation: 解决基于示例的引导在数学推理中效果不稳定、受问题和模型影响大的问题,探索策略使用性与可执行性之间的差异及其对引导效果的影响。
Result: 在AIME25基准上准确率提升高达13个百分点,在Apex基准上提升5个百分点,优于直接求解、上下文学习和单源引导方法,在多个数学推理基准上实现了可靠且一致的改进。
Insight: 创新点在于识别了策略使用性与可执行性的系统差异,并基于此提出了选择性策略检索框架,通过建模可执行性并利用多路径、来源感知的信号进行策略选择,提高了引导的稳定性和效果。
Abstract: Example-based guidance is widely used to improve mathematical reasoning at inference time, yet its effectiveness is highly unstable across problems and models-even when the guidance is correct and problem-relevant. We show that this instability arises from a previously underexplored gap between strategy usage-whether a reasoning strategy appears in successful solutions-and strategy executability-whether the strategy remains effective when instantiated as guidance for a target model. Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, leading to complementary strengths and consistent source-dependent reversals under guidance. Building on this diagnosis, we propose Selective Strategy Retrieval (SSR), a test-time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi-route, source-aware signals. Across multiple mathematical reasoning benchmarks, SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance, improving accuracy by up to $+13$ points on AIME25 and $+5$ points on Apex for compact reasoning models. Code and benchmark are publicly available at: https://github.com/lwd17/strategy-execute-pipeline.
[109] OmniGAIA: Towards Native Omni-Modal AI Agents cs.AI | cs.CL | cs.CV | cs.LG | cs.MMPDF
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong
TL;DR: 本文提出了OmniGAIA基准和OmniAtlas智能体,旨在推动原生全模态AI助手的发展。OmniGAIA是一个用于评估智能体在视频、音频和图像模态上进行深度推理和多轮工具使用的综合基准。OmniAtlas则是一个在工具集成推理范式下、具备主动全模态感知能力的原生全模态基础智能体,通过后见之明引导的树探索策略和OmniDPO进行训练,有效提升了现有开源模型的工具使用能力。
Details
Motivation: 当前的多模态大语言模型主要局限于双模态交互(如视觉-语言),缺乏通用AI助手所需的统一认知能力。本文旨在弥合人类智能中全模态感知与复杂推理及工具使用的自然结合,与现有模型能力之间的差距。
Result: 论文提出了OmniGAIA基准和OmniAtlas智能体。OmniAtlas通过特定训练策略,有效增强了现有开源模型的工具使用能力,但摘要中未提及具体的定量实验结果或与SOTA模型的直接比较。
Insight: 主要创新点包括:1) 通过新颖的全模态事件图方法构建了OmniGAIA基准,用于评估需要跨模态推理和外部工具集成的复杂任务;2) 提出了OmniAtlas,一个在工具集成推理范式下的原生全模态基础智能体框架;3) 采用了后见之明引导的树探索策略和OmniDPO进行细粒度错误纠正的训练方法。这些工作为面向真实世界场景的下一代原生全模态AI助手奠定了基础。
Abstract: Human intelligence naturally intertwines omni-modal perception – spanning vision, audio, and language – with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.
[110] A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring cs.AI | cs.CL | cs.CR | cs.IT | cs.MAPDF
Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall
TL;DR: 本文提出了一种基于决策理论的隐写术形式化框架,用于检测和量化大型语言模型中的隐写推理行为。该框架通过引入广义V-信息来测量输入中的可用信息量,并定义了’隐写间隙’这一度量,通过比较能够解码和不能解码隐藏内容的智能体对隐写信号的下游效用差异来量化隐写行为。
Details
Motivation: 大型语言模型开始展现出隐写能力,这可能使未对齐的模型规避监督机制。然而,缺乏检测和量化此类行为的原理性方法。经典的隐写术定义及其检测方法需要一个已知的非隐写信号参考分布,这对于LLM中的隐写推理是不可行的,因此这些方法不适用。
Result: 论文通过实证验证了其形式化框架的有效性,并表明该框架可用于检测、量化和缓解LLM中的隐写推理。
Insight: 核心创新点在于从决策理论视角看待隐写术,其核心见解是:隐写术在能够解码和不能解码隐藏内容的智能体之间创造了可用信息的不对称性,而这种潜在的不对称性可以从智能体的可观察行为中推断出来。这为在缺乏参考分布的情况下检测LLM隐写行为提供了新的原理性方法。
Abstract: Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents’ observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} – a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.
[111] AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning cs.AI | cs.CLPDF
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding
TL;DR: 本文提出AgentDropoutV2,一种无需重新训练、在测试时动态优化多智能体系统信息流的框架。该框架通过检索增强的校正器迭代修正错误,并基于失败驱动指标池识别潜在错误,对不可修复的输出进行剪枝以防止错误传播,从而提升系统性能。
Details
Motivation: 解决多智能体系统中个体参与者产生的错误信息会级联传播的问题,现有方法通常依赖僵化的结构工程或昂贵的微调,限制了可部署性和适应性。
Result: 在广泛的数学基准测试上,AgentDropoutV2显著提升了多智能体系统的任务性能,平均准确率提高了6.3个百分点,展现了强大的泛化能力和适应性。
Insight: 创新点在于测试时动态校正或拒绝剪枝机制,利用检索增强和失败模式先验知识主动拦截并修正错误,无需重新训练即可优化信息流,增强了系统的鲁棒性和效率。
Abstract: While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their deployability and adaptability. We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining. Our approach acts as an active firewall, intercepting agent outputs and employing a retrieval-augmented rectifier to iteratively correct errors based on a failure-driven indicator pool. This mechanism allows for the precise identification of potential errors using distilled failure patterns as prior knowledge. Irreparable outputs are subsequently pruned to prevent error propagation, while a fallback strategy preserves system integrity. Empirical results on extensive math benchmarks show that AgentDropoutV2 significantly boosts the MAS’s task performance, achieving an average accuracy gain of 6.3 percentage points on math benchmarks. Furthermore, the system exhibits robust generalization and adaptivity, dynamically modulating rectification efforts based on task difficulty while leveraging context-aware indicators to resolve a wide spectrum of error patterns. Our code and dataset are released at https://github.com/TonySY2/AgentDropoutV2.
cs.DB [Back]
[112] Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA cs.DB | cs.CLPDF
Fengyu Li, Junhao Zhu, Kaishi Song, Lu Chen, Zhongming Yao
TL;DR: 本文提出了Operation-R1框架,首次通过可验证奖励的强化学习变体训练轻量级LLM(如Qwen-4B/1.7B),以单步推理生成高质量的数据准备管道,用于表格问答任务,旨在解决现有多步方法延迟高、成本大的问题。
Details
Motivation: 现有基于LLM的表格问答解决方案通常采用以操作为中心的多步管道生成方法,虽能达到SOTA性能,但依赖多次LLM调用,导致延迟和计算成本过高。
Result: 在两个基准数据集上的实验表明,使用相同LLM骨干网络时,Operation-R1相比多步准备基线平均绝对准确率分别提升9.55和6.08个百分点,同时实现79%的表格压缩和2.2倍的货币成本降低。
Insight: 创新点包括:1)引入自监督奖励机制自动获取细粒度管道级监督信号;2)提出方差感知组重采样以缓解训练不稳定性;3)开发操作合并(通过多候选共识过滤虚假操作)和自适应回滚(运行时防止数据转换中的信息丢失)两种互补机制增强管道生成的鲁棒性。从客观角度看,其将多步组装过程压缩为单步生成,并结合强化学习与轻量化模型,在效率与性能间取得了平衡。
Abstract: Table Question Answering (TQA) aims to answer natural language questions over structured tables. Large Language Models (LLMs) enable promising solutions to this problem, with operator-centric solutions that generate table manipulation pipelines in a multi-step manner offering state-of-the-art performance. However, these solutions rely on multiple LLM calls, resulting in prohibitive latencies and computational costs. We propose Operation-R1, the first framework that trains lightweight LLMs (e.g., Qwen-4B/1.7B) via a novel variant of reinforcement learning with verifiable rewards to produce high-quality data-preparation pipelines for TQA in a single inference step. To train such an LLM, we first introduce a self-supervised rewarding mechanism to automatically obtain fine-grained pipeline-wise supervision signals for LLM training. We also propose variance-aware group resampling to mitigate training instability. To further enhance robustness of pipeline generation, we develop two complementary mechanisms: operation merge, which filters spurious operations through multi-candidate consensus, and adaptive rollback, which offers runtime protection against information loss in data transformation. Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79% table compression and a 2.2$\times$ reduction in monetary cost.
cs.LG [Back]
[113] Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences cs.LG | cs.AI | cs.CL | stat.MLPDF
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu
TL;DR: 本文提出了Duel-Evolve算法,一种无需外部奖励模型的进化优化方法,用于在测试时优化大语言模型(LLM)的输出。该方法利用LLM自身生成的候选输出之间的成对偏好比较,通过贝叶斯Bradley-Terry模型聚合这些噪声比较来估计候选质量,并结合Double Thompson Sampling进行资源分配和高质量父代选择以生成改进的候选。
Details
Motivation: 现有方法通常依赖针对目标任务的校准标量评估器来指导搜索,但在许多任务中,这种分数可能无法获得、过于稀疏或不可靠。相比之下,成对比较通常更容易获取,仍能提供改进方向的有用信号,并且无需外部监督即可从LLM本身获得。
Result: 在MathBench基准测试中,Duel-Evolve比现有方法和基线实现了20个百分点的准确率提升;在LiveCodeBench基准测试中,它比可比较的迭代方法提高了超过12个百分点,达到了新的SOTA水平。
Insight: 核心创新点在于用LLM自身的成对偏好(self-preferences)完全替代了外部标量奖励模型,实现了无需奖励模型、无需搜索期间真实标签、无需手工评分函数的测试时优化。该方法展示了成对自偏好能为大型离散输出空间的测试时改进提供强大的优化信号,其贝叶斯建模和Thompson采样策略也为处理噪声比较和资源分配提供了可借鉴的思路。
Abstract: Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality parents to generate improved candidates. We evaluate Duel-Evolve on MathBench, where it achieves 20 percentage points higher accuracy over existing methods and baselines, and on LiveCodeBench, where it improves over comparable iterative methods by over 12 percentage points. Notably, the method requires no reward model, no ground-truth labels during search, and no hand-crafted scoring function. Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.
[114] RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format cs.LG | cs.CLPDF
Zhehao Huang, Yuhang Liu, Baijiong Lin, Yixin Lou, Zhengbao He
TL;DR: 本文提出了一种名为RAIN-Merging的无梯度方法,旨在将指令微调模型(ITM)的能力整合到大型推理模型(LRM)中,以显著提升LRM遵循指令(如输出格式、约束等)的能力,同时保持其原有的思维链推理格式和性能。
Details
Motivation: 大型推理模型擅长复杂推理,但在忠实遵循关于输出格式、约束或特定要求的指令方面存在不足。本文旨在通过整合指令微调模型来弥补这一差距。
Result: 在四个指令遵循基准测试和九个推理与通用能力基准测试中,RAIN-Merging显著提升了指令遵循能力,同时保持了推理质量。该方法在不同模型规模和架构上均表现出一致的增益,并在智能体场景中提升了性能。
Insight: 创新点在于:1)通过分析任务向量发现LRM与ITM在参数空间的关键模块主成分子空间近乎正交,为轻量级合并提供了理论基础;2)提出了RAIN-Merging方法,通过将ITM任务向量投影到LRM思维特殊标记前向特征的零空间来保护推理机制,并利用指令注意力进行模块特定缩放以增强指令相关成分、抑制泄漏,从而解决了朴素合并因输出格式不匹配(思维链 vs. 仅答案)而脆弱的问题。
Abstract: Large reasoning models (LRMs) excel at a long chain of reasoning but often fail to faithfully follow instructions regarding output format, constraints, or specific requirements. We investigate whether this gap can be closed by integrating an instruction-tuned model (ITM) into an LRM. Analyzing their differences in parameter space, namely task vectors, we find that their principal subspaces are nearly orthogonal across key modules, suggesting a lightweight merging with minimal interference. However, we also demonstrate that naive merges are fragile because they overlook the output format mismatch between LRMs (with explicit thinking and response segments) and ITMs (answers-only). We introduce RAIN-Merging (Reasoning-Aware Instruction-attention guided Null-space projection Merging), a gradient-free method that integrates instruction following while preserving thinking format and reasoning performance. First, with a small reasoning calibration set, we project the ITM task vector onto the null space of forward features at thinking special tokens, which preserves the LRM’s structured reasoning mechanisms. Second, using a small instruction calibration set, we estimate instruction attention to derive module-specific scaling that amplifies instruction-relevant components and suppresses leakage. Across four instruction-following benchmarks and nine reasoning & general capability benchmarks, RAIN-Merging substantially improves instruction adherence while maintaining reasoning quality. The gains are consistent across model scales and architectures, translating to improved performance in agent settings.
[115] Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation cs.LG | cs.AI | cs.CLPDF
Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian
TL;DR: 本文提出一个两阶段框架,旨在解决大型推理模型在处理低复杂度查询时产生的过度思考问题。该框架通过混合微调建立良好初始化,并结合保持正确性的优势塑形与长度感知梯度调节进行自适应强化学习,以稳定优化并提升效率。
Details
Motivation: 现有方法在缓解大型推理模型的过度思考行为时,面临准确性与效率权衡不稳定以及对异构推理行为鲁棒性差的根本性限制。
Result: 在Qwen2.5-1.5B和7B模型上的大量实验表明,该方法在多个强基线上取得了一致的改进,最高提升准确率3.7/3.6个百分点,同时减少生成token数40.6%/43.9%,并在不同问题难度和分布外任务上验证了其鲁棒性和泛化能力。
Insight: 创新点在于将混合微调作为初始化,并结合了保持正确性的优势塑形以避免抑制正确的长链推理,以及长度感知梯度调节来应对严重的推理长度异质性,从而实现了稳定、自适应的思考过程。
Abstract: Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses across varying problem difficulties and out-of-distribution tasks confirm the robustness and generalization of our approach.
[116] ContextRL: Enhancing MLLM’s Knowledge Discovery Efficiency with Context-Augmented RL cs.LG | cs.AI | cs.CLPDF
Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu
TL;DR: 本文提出ContextRL框架,通过上下文增强强化学习来提升多模态大语言模型的知识发现效率。具体方法包括:为奖励模型提供完整参考方案作为上下文以增强可识别性,进行细粒度过程验证来过滤假阳性样本;引入多轮采样策略,让奖励模型为失败尝试生成错误报告,以引导策略从全负样本组中恢复正确响应。
Details
Motivation: 旨在解决多模态大语言模型在强化学习与视觉推理中存在的知识发现效率瓶颈,特别是可识别性不足和可达性差的问题,以更高效地引导模型学习高质量推理过程。
Result: 在11个感知和推理基准测试中,ContextRL显著提升了知识发现效率;使用Qwen3-VL-8B模型达到了与32B模型相当的性能,大幅优于标准RLVR基线,并有效缓解了奖励黑客问题。
Insight: 创新点在于利用上下文信息增强奖励模型的准确性,通过过程验证和错误报告机制改进强化学习训练;客观分析表明,上下文信息对提升奖励模型精度具有显著潜力,同时揭示了奖励黑客现象的普遍性,为未来RLVR研究提供了重要见解。
Abstract: We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks. Specifically, to enhance Identifiability, we provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives (samples with the right answer but low-quality reasoning process). To improve Reachability, we introduce a multi-turn sampling strategy where the reward model generates mistake reports for failed attempts, guiding the policy to “recover” correct responses from previously all-negative groups. Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency. Notably, ContextRL enables the Qwen3-VL-8B model to achieve performance comparable to the 32B model, outperforming standard RLVR baselines by a large margin while effectively mitigating reward hacking. Our in-depth analysis reveals the significant potential of contextual information for improving reward model accuracy and document the widespread occurrence of reward hacking, offering valuable insights for future RLVR research.
[117] Moral Preferences of LLMs Under Directed Contextual Influence cs.LG | cs.AI | cs.CL | cs.CV | cs.CYPDF
Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie
TL;DR: 该论文研究了在电车难题式的道德困境中,定向上下文提示(如用户请求、社会规范线索)如何影响大型语言模型(LLM)的道德决策。作者引入了一个评估框架,通过应用匹配的、方向相反的上下文影响来系统测量模型的定向响应,发现上下文提示能显著改变决策,即使其相关性表面;基线偏好不能预测可操纵性;影响可能适得其反;推理会降低平均敏感性但放大偏见示例的影响。
Details
Motivation: 现有LLM道德基准通常使用无上下文提示,假设偏好稳定,但实际部署中提示常包含可能引导决策的上下文信号,因此需要研究这些定向上下文影响如何重塑道德决策。
Result: 在电车难题式道德分类设置中,通过系统测量发现:上下文影响常显著改变决策;基线中性模型可能表现出系统性的可操纵不对称性;影响可能适得其反,导致选择向相反方向偏移;推理降低平均敏感性但放大偏见少样本示例的影响。
Insight: 创新点在于提出了一个用于评估定向上下文影响的试点评估框架,通过方向翻转的上下文操纵来系统测量模型行为。客观分析认为,该方法揭示了LLM道德偏好对上下文的高度敏感性和不可预测的操纵性,强调了在道德评估中纳入受控上下文操作的重要性。
Abstract: Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage settings. We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response. We find that: (i) contextual influences often significantly shift decisions, even when only superficially relevant; (ii) baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence; (iii) influences can backfire: models may explicitly claim neutrality or discount the contextual cue, yet their choices still shift, sometimes in the opposite direction; and (iv) reasoning reduces average sensitivity, but amplifies the effect of biased few-shot examples. Our findings motivate extending moral evaluations with controlled, direction-flipped context manipulations to better characterize model behavior.
[118] NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion cs.LG | cs.AI | cs.CLPDF
Hung-Hsuan Chen
TL;DR: 论文提出了NoRA(非线性秩适应),一种权重级别的并行适配器,通过引入SiLU门控和结构化dropout来实现流形扩展,以突破低秩适应(LoRA)在复杂推理任务中面临的线性瓶颈。
Details
Motivation: LoRA在参数高效微调中占主导地位,但在复杂推理任务中面临关键的线性瓶颈:由于固有的线性约束,单纯增加秩会导致收益递减。
Result: 在SlimOrca基准测试中,秩为64的NoRA(困惑度3.89)优于秩为512的LoRA(困惑度3.90);在数学推理任务MathInstruct上,NoRA达到困惑度1.97,显著超过LoRA的饱和点2.07。
Insight: 创新点在于通过非线性激活(SiLU门控)和结构化dropout实现流形扩展,从而激活奇异值谱的尾部,有效防止线性方法中观察到的秩崩溃,实现了更高的谱效率。
Abstract: Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning (PEFT). However, it faces a critical ``linear ceiling’’ in complex reasoning tasks: simply increasing the rank yields diminishing returns due to intrinsic linear constraints. We introduce NoRA (Non-linear Rank Adaptation), a weight-level parallel adapter that injects SiLU gating and structural dropout to induce manifold expansion. On the SlimOrca benchmark, NoRA breaks this linear barrier: NoRA remarkably at rank 64 (PPL 3.89) outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency. This advantage generalizes to mathematical reasoning, where NoRA achieves a perplexity of 1.97 on MathInstruct, significantly surpassing LoRA’s saturation point of 2.07. Mechanism analysis via Singular Value Decomposition (SVD) confirms that NoRA activates the dormant tail of the singular value spectrum, effectively preventing the rank collapse observed in linear methods.
[119] Entropy-Controlled Flow Matching cs.LG | cs.CVPDF
Chika Maduabuchi
TL;DR: 本文提出熵控制流匹配(ECFM),一种通过施加全局熵率约束来优化连续方程路径的变分原理,旨在解决标准流匹配方法中因低熵瓶颈导致的语义模式瞬时耗尽问题。该方法在Wasserstein空间中形成凸优化问题,具有随机控制表示,并在纯输运机制下恢复熵最优传输测地线。
Details
Motivation: 标准流匹配目标未直接控制轨迹的信息几何,可能产生低熵瓶颈,导致语义模式在传输过程中暂时耗尽,因此需要一种能主动控制熵率的框架来确保模式覆盖和稳定性。
Result: ECFM在纯输运机制下恢复熵最优传输测地线,并在熵率约束趋近于零时Gamma收敛到经典最优传输;理论分析提供了基于Lipschitz稳定性的模式覆盖和密度下限保证,并构建了针对无约束流匹配的近最优崩溃反例。
Insight: 创新点在于将熵率约束引入流匹配的变分框架,形成凸优化问题并等价于具有显式熵乘子的薛定谔桥问题,从而在理论上保证信息几何的控制和模式覆盖;客观分析认为该方法通过约束熵动态提升了生成过程的鲁棒性和语义保真度。
Abstract: Modern vision generators transport a base distribution to data through time-indexed measures, implemented as deterministic flows (ODEs) or stochastic diffusions (SDEs). Despite strong empirical performance, standard flow-matching objectives do not directly control the information geometry of the trajectory, allowing low-entropy bottlenecks that can transiently deplete semantic modes. We propose Entropy-Controlled Flow Matching (ECFM): a constrained variational principle over continuity-equation paths enforcing a global entropy-rate budget d/dt H(mu_t) >= -lambda. ECFM is a convex optimization in Wasserstein space with a KKT/Pontryagin system, and admits a stochastic-control representation equivalent to a Schrodinger bridge with an explicit entropy multiplier. In the pure transport regime, ECFM recovers entropic OT geodesics and Gamma-converges to classical OT as lambda -> 0. We further obtain certificate-style mode-coverage and density-floor guarantees with Lipschitz stability, and construct near-optimal collapse counterexamples for unconstrained flow matching.
[120] MolFM-Lite: Multi-Modal Molecular Property Prediction with Conformer Ensemble Attention and Cross-Modal Fusion cs.LG | cs.CVPDF
Syed Omer Shah, Mohammed Maqsood Ahmed, Danish Mohiuddin Mohammed, Shahnawaz Alam, Mohd Vahaj ur Rahman
TL;DR: MolFM-Lite是一个多模态分子性质预测模型,它联合编码SELFIES序列(1D)、分子图(2D)和构象体集合(3D),并通过交叉注意力融合和特征级线性调制(FiLM)整合实验上下文。其主要方法贡献包括构象体集合注意力机制和跨模态融合层,在四个MoleculeNet基准测试上,三模态融合比单模态基线带来7-11%的AUC提升。
Details
Motivation: 现有大多数分子性质预测模型仅依赖单一分子表示(如序列、图或3D结构)并将分子几何视为静态,忽略了多模态信息和分子构象的动态分布,因此需要开发能联合利用多种表示并捕捉分子形状热力学分布的方法。
Result: 在四个MoleculeNet支架分割基准测试上,使用模型自身的数据分割进行评估,三模态融合相比单模态基线带来7-11%的AUC提升,构象体集合相比单构象体变体带来约2%的AUC提升,并通过消融研究验证了各架构组件的独立贡献。
Insight: 创新点包括:1)构象体集合注意力机制,结合可学习注意力与基于玻尔兹曼权重的先验,捕捉分子形状的热力学分布;2)跨模态融合层,允许各模态相互关注以实现互补信息共享;3)使用交叉模态对比和掩码原子目标在ZINC250K上进行预训练,以较低计算成本实现有效的权重初始化。
Abstract: Most machine learning models for molecular property prediction rely on a single molecular representation (either a sequence, a graph, or a 3D structure) and treat molecular geometry as static. We present MolFM-Lite, a multi-modal model that jointly encodes SELFIES sequences (1D), molecular graphs (2D), and conformer ensembles (3D) through cross-attention fusion, while conditioning predictions on experimental context via Feature-wise Linear Modulation (FiLM). Our main methodological contributions are: (1) a conformer ensemble attention mechanism that combines learnable attention with Boltzmann-weighted priors over multiple RDKit-generated conformers, capturing the thermodynamic distribution of molecular shapes; and (2) a cross-modal fusion layer where each modality can attend to others, enabling complementary information sharing. We evaluate on four MoleculeNet scaffold-split benchmarks using our model’s own splits, and report all baselines re-evaluated under the same protocol. Comprehensive ablation studies across all four datasets confirm that each architectural component contributes independently, with tri-modal fusion providing 7-11% AUC improvement over single-modality baselines and conformer ensembles adding approximately 2% over single-conformer variants. Pre-training on ZINC250K (~250K molecules) using cross-modal contrastive and masked-atom objectives enables effective weight initialization at modest compute cost. We release all code, trained models, and data splits to support reproducibility.
[121] Space Syntax-guided Post-training for Residential Floor Plan Generation cs.LG | cs.CVPDF
Zhuoyang Jiang, Dongqing Zhang
TL;DR: 本文提出了一种名为空间句法引导后训练(SSPT)的方法,用于住宅平面图生成,通过引入不可微的预言机将空间句法知识(如公共空间的配置主导性和连通性)显式注入生成过程,以弥补预训练生成模型在关键建筑先验上的不足。该方法包括基于空间句法过滤的迭代重训练和基于PPO的强化学习两种策略,并在SSPT-Bench基准上进行了评估,结果表明能有效提升公共空间主导性和功能层次清晰度。
Details
Motivation: 解决现有预训练生成模型在住宅平面图生成中过度拟合大规模数据分布,而忽视关键建筑先验(如公共空间的主导性和连通性)的问题,旨在将建筑理论整合到数据驱动的生成中。
Result: 在SSPT-Bench(Eval-8)这一分布外基准上,两种SSPT策略均比基线模型提升了公共空间主导性并恢复了更清晰的功能层次,其中PPO策略在计算效率和方差降低方面表现更强,取得了更显著的增益。
Insight: 创新点在于通过非可微预言机将空间句法知识显式注入生成过程,提供了一种可扩展的后训练范式;客观分析认为,该方法兼容不同生成主干,为整合领域先验知识到数据驱动模型提供了通用路径。
Abstract: Pre-trained generative models for residential floor plans are typically optimized to fit large-scale data distributions, which can under-emphasize critical architectural priors such as the configurational dominance and connectivity of domestic public spaces (e.g., living rooms and foyers). This paper proposes Space Syntax-guided Post-training (SSPT), a post-training paradigm that explicitly injects space syntax knowledge into floor plan generation via a non-differentiable oracle. The oracle converts RPLAN-style layouts into rectangle-space graphs through greedy maximal-rectangle decomposition and door-mediated adjacency construction, and then computes integration-based measurements to quantify public space dominance and functional hierarchy. To enable consistent evaluation and diagnosis, we further introduce SSPT-Bench (Eval-8), an out-of-distribution benchmark that post-trains models using conditions capped at $\leq 7$ rooms while evaluating on 8-room programs, together with a unified metric suite for dominance, stability, and profile alignment. SSPT is instantiated with two strategies: (i) iterative retraining via space-syntax filtering and diffusion fine-tuning, and (ii) reinforcement learning via PPO with space-syntax rewards. Experiments show that both strategies improve public-space dominance and restore clearer functional hierarchy compared to distribution-fitted baselines, while PPO achieves stronger gains with substantially higher compute efficiency and reduced variance. SSPT provides a scalable pathway for integrating architectural theory into data-driven plan generation and is compatible with other generative backbones given a post-hoc evaluation oracle.
[122] $φ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models cs.LG | cs.CVPDF
Thanh-Dat Truong, Huu-Thien Tran, Jackson Cothren, Bhiksha Raj, Khoa Luu
TL;DR: 本文提出了一种名为公平性直接偏好优化(FaiDPO或$φ$-DPO)的新框架,用于解决大型多模态模型在持续学习中的公平性问题。该框架基于直接偏好优化,通过引入新的损失函数来同时缓解灾难性遗忘和数据分布不平衡导致的偏见,并在多个基准测试上实现了最先进的性能。
Details
Motivation: 大型多模态模型在持续学习中面临数据分布不平衡导致的公平性挑战,现有方法主要关注灾难性遗忘,而忽略了数据不平衡引起的偏见问题,本文旨在同时解决这两个问题。
Result: 在多个持续学习基准测试上进行的广泛实验表明,$φ$-DPO方法取得了最先进的性能,超越了先前的大型多模态模型持续学习方法。
Insight: 创新点在于将直接偏好优化引入持续学习以缓解遗忘,并设计了新的$φ$-DPO损失函数来显式处理数据分布偏差;客观来看,该方法通过理论分析和构建偏好标注数据集,为平衡持续学习中的遗忘与公平性提供了系统性的解决方案。
Abstract: Fairness in Continual Learning for Large Multimodal Models (LMMs) is an emerging yet underexplored challenge, particularly in the presence of imbalanced data distributions that can lead to biased model updates and suboptimal performance across tasks. While recent continual learning studies have made progress in addressing catastrophic forgetting, the problem of fairness caused the imbalanced data remains largely underexplored. This paper presents a novel Fairness Direct Preference Optimization (FaiDPO or $φ$-DPO) framework for continual learning in LMMs. In particular, we first propose a new continual learning paradigm based on Direct Preference Optimization (DPO) to mitigate catastrophic forgetting by aligning learning with pairwise preference signals. Then, we identify the limitations of conventional DPO in imbalanced data and present a new $φ$-DPO loss that explicitly addresses distributional biases. We provide a comprehensive theoretical analysis demonstrating that our approach addresses both forgetting and data imbalance. Additionally, to enable $φ$-DPO-based continual learning, we construct pairwise preference annotations for existing benchmarks in the context of continual learning. Extensive experiments and ablation studies show the proposed $φ$-DPO achieves State-of-the-Art performance across multiple benchmarks, outperforming prior continual learning methods of LMMs.
cs.GR [Back]
[123] DiffBMP: Differentiable Rendering with Bitmap Primitives cs.GR | cs.CVPDF
Seongmin Hong, Junghun James Kim, Daehyeop Kim, Insoo Chung, Se Young Chun
TL;DR: DiffBMP是一个可扩展且高效的、针对位图图像集合的可微分渲染引擎,它通过高度并行化的渲染管道和自定义CUDA实现来优化数千个位图基元的位置、旋转、缩放、颜色和透明度等属性,并能在消费级GPU上在1分钟内完成优化。
Details
Motivation: 传统可微分渲染器主要局限于矢量图形,而现实世界中的大多数图像是位图,DiffBMP旨在解决这一限制,为位图图像提供可微分渲染能力。
Result: 论文通过高斯模糊的软光栅化、结构感知初始化、噪声画布以及针对视频或空间约束图像的专业损失/启发式方法等技术,验证了DiffBMP的优化效果,例如能在消费级GPU上快速优化数千个位图基元。
Insight: 创新点包括将可微分渲染扩展到位图领域,采用高度并行化的自定义CUDA实现以实现高效梯度计算,并设计了多种优化技术(如软光栅化、结构感知初始化)来促进优化过程,同时注重实用性,支持导出到分层文件格式并提供了易于使用的Python包。
Abstract: We introduce DiffBMP, a scalable and efficient differentiable rendering engine for a collection of bitmap images. Our work addresses a limitation that traditional differentiable renderers are constrained to vector graphics, given that most images in the world are bitmaps. Our core contribution is a highly parallelized rendering pipeline, featuring a custom CUDA implementation for calculating gradients. This system can, for example, optimize the position, rotation, scale, color, and opacity of thousands of bitmap primitives all in under 1 min using a consumer GPU. We employ and validate several techniques to facilitate the optimization: soft rasterization via Gaussian blur, structure-aware initialization, noisy canvas, and specialized losses/heuristics for videos or spatially constrained images. We demonstrate DiffBMP is not just an isolated tool, but a practical one designed to integrate into creative workflows. It supports exporting compositions to a native, layered file format, and the entire framework is publicly accessible via an easy-to-hack Python package.
cs.HC [Back]
[124] TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation cs.HC | cs.AI | cs.CLPDF
Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang
TL;DR: 论文提出了一种名为TherapyProbe的设计探测方法,通过对抗性多智能体模拟系统性地探索心理健康聊天机器人的对话轨迹,以生成关于关系安全性的可操作设计知识。该方法识别了如‘验证螺旋’和‘共情疲劳’等交互模式中的安全失败,并转化为包含23种失败原型的‘安全模式库’及相应设计建议。
Details
Motivation: 当前心理健康聊天机器人的安全评估主要关注单轮危机响应,忽略了跨对话展开的交互模式质量(即关系安全性),无法评估聊天机器人随时间推移是帮助还是伤害用户。论文旨在解决如何为关系安全性进行设计的问题。
Result: 论文通过使用开源模型进行对抗模拟,识别了具体的交互失败模式,并构建了一个包含23种失败原型的‘安全模式库’。贡献包括一个无需API成本的可复现方法、一个基于临床的失败分类法,以及对开发者、临床医生和政策制定者的设计启示。
Insight: 创新点在于将对抗性模拟方法应用于评估聊天机器人的长期交互安全性,并系统地将识别出的失败模式转化为具体的设计知识库。这为评估和设计注重关系安全性的对话系统提供了新的方法论和实用工具。
Abstract: As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness of individual responses? Current safety evaluations assess single-turn crisis responses, missing the therapeutic dynamics that determine whether chatbots help or harm over time. We introduce TherapyProbe, a design probe methodology that generates actionable design knowledge by systematically exploring chatbot conversation trajectories through adversarial multi-agent simulation. Using open-source models, TherapyProbe surfaces relational safety failures interaction patterns like “validation spirals” where chatbots progressively reinforce hopelessness, or “empathy fatigue” where responses become mechanical over turns. Our contribution is translating these failures into a Safety Pattern Library of 23 failure archetypes with corresponding design recommendations. We contribute: (1) a replicable methodology requiring no API costs, (2) a clinically-grounded failure taxonomy, and (3) design implications for developers, clinicians, and policymakers.