cs.CL [Total: 19]
cs.CV [Total: 45]
cs.AI [Total: 4]
cs.RO [Total: 1]
cs.LG [Total: 4]
eess.IV [Total: 1]
cs.HC [Total: 1]
eess.AS [Total: 1]

cs.CL [Back]

[1] HARMON-E: Hierarchical Agentic Reasoning for Multimodal Oncology Notes to Extract Structured Data cs.CL | cs.AIPDF

Shashi Kant Gupta, Arijeet Pramanik, Jerrin John Thomas, Regina Schwind, Lauren Wiener

TL;DR: 本文提出了一个名为HARMON-E的层次化智能体推理框架，旨在从电子健康记录中的非结构化肿瘤学笔记中提取结构化数据。该框架利用大型语言模型作为推理智能体，通过上下文感知检索和迭代合成能力，系统地分解复杂的肿瘤数据提取任务，并在包含超过40万份临床笔记和扫描PDF报告的大规模真实数据集上实现了高精度的提取。

Details

Motivation: 电子健康记录中的非结构化肿瘤学笔记包含丰富的临床信息，但可靠地提取结构化数据面临巨大挑战，包括高度可变性、专业术语和不一致的文档格式。现有自动化方法通常局限于狭窄场景，无法充分处理患者级别的信息合成，而人工提取成本高昂且不可扩展。

Result: 在包含2250名癌症患者的超过40万份非结构化临床笔记和扫描PDF报告的大规模数据集上评估，该方法平均F1分数达到0.93，其中103个肿瘤特异性临床变量中有100个超过0.85，关键变量（如生物标志物和药物）超过0.95。集成到数据整理工作流后，直接人工批准率达到0.94，显著降低了标注成本。

Insight: 论文的创新点在于提出了一个层次化、模块化的智能体框架，将LLM作为推理智能体，结合上下文检索和迭代合成，实现了端到端、大规模的肿瘤数据提取。这首次展示了基于LLM的智能体在结构化肿瘤数据提取中的全面应用，为解决临床文档中信息矛盾和多文档合成问题提供了新思路。

Abstract: Unstructured notes within the electronic health record (EHR) contain rich clinical information vital for cancer treatment decision making and research, yet reliably extracting structured oncology data remains challenging due to extensive variability, specialized terminology, and inconsistent document formats. Manual abstraction, although accurate, is prohibitively costly and unscalable. Existing automated approaches typically address narrow scenarios - either using synthetic datasets, restricting focus to document-level extraction, or isolating specific clinical variables (e.g., staging, biomarkers, histology) - and do not adequately handle patient-level synthesis across the large number of clinical documents containing contradictory information. In this study, we propose an agentic framework that systematically decomposes complex oncology data extraction into modular, adaptive tasks. Specifically, we use large language models (LLMs) as reasoning agents, equipped with context-sensitive retrieval and iterative synthesis capabilities, to exhaustively and comprehensively extract structured clinical variables from real-world oncology notes. Evaluated on a large-scale dataset of over 400,000 unstructured clinical notes and scanned PDF reports spanning 2,250 cancer patients, our method achieves an average F1-score of 0.93, with 100 out of 103 oncology-specific clinical variables exceeding 0.85, and critical variables (e.g., biomarkers and medications) surpassing 0.95. Moreover, integration of the agentic system into a data curation workflow resulted in 0.94 direct manual approval rate, significantly reducing annotation costs. To our knowledge, this constitutes the first exhaustive, end-to-end application of LLM-based agents for structured oncology data extraction at scale

[2] Counterfactual LLM-based Framework for Measuring Rhetorical Style cs.CL | cs.CYPDF

Jingyi Qiu, Hong Chen, Zongyi Li

TL;DR: 本文提出了一种基于反事实推理和大型语言模型（LLM）的框架，用于量化机器学习论文中的修辞风格，并将其与实质性内容分离。该框架利用多个LLM修辞角色从相同内容生成反事实文本，通过LLM法官进行成对比较，并使用Bradley-Terry模型聚合结果。应用该方法对2017年至2025年的8,485篇ICLR投稿样本进行分析，生成了超过25万篇反事实文本，实现了对ML论文修辞风格的大规模量化。研究发现，愿景式框架能显著预测下游关注度（如引用和媒体关注），且这一趋势在2023年后急剧上升，主要受LLM写作辅助工具驱动。框架的可靠性通过其对角色选择的鲁棒性以及LLM判断与人工标注的高相关性得到验证。

Details

Motivation: AI的兴起加剧了人们对机器学习论文中“炒作”现象的担忧，但量化修辞风格（独立于实质性内容）的可靠方法一直难以实现。由于大胆的语言可能源于强实证结果或仅仅是修辞风格，区分两者往往很困难。

Result: 在8,485篇ICLR投稿样本上应用该方法，发现愿景式框架能显著预测下游关注度（如引用和媒体关注），即使控制了同行评审评估后依然成立。观察到2023年后修辞强度急剧上升，实证证据表明这主要由LLM写作辅助工具的采用驱动。框架可靠性通过其对角色选择的鲁棒性以及LLM判断与人工标注的高相关性（具体数值未在摘要中提及）得到验证。

Insight: 创新点在于引入了一个反事实、LLM驱动的框架来解耦修辞风格与内容，首次实现了对科学论文修辞风格的大规模量化。客观来看，该方法将LLM用作测量工具，为科学评估提供了新视角，并揭示了AI写作工具对学术写作风格的潜在影响，具有方法论上的借鉴意义。

Abstract: The rise of AI has fueled growing concerns about ``hype’’ in machine learning papers, yet a reliable way to quantify rhetorical style independently of substantive content has remained elusive. Because bold language can stem from either strong empirical results or mere rhetorical style, it is often difficult to distinguish between the two. To disentangle rhetorical style from substantive content, we introduce a counterfactual, LLM-based framework: multiple LLM rhetorical personas generate counterfactual writings from the same substantive content, an LLM judge compares them through pairwise evaluations, and the outcomes are aggregated using a Bradley–Terry model. Applying this method to 8,485 ICLR submissions sampled from 2017 to 2025, we generate more than 250,000 counterfactual writings and provide a large-scale quantification of rhetorical style in ML papers. We find that visionary framing significantly predicts downstream attention, including citations and media attention, even after controlling for peer-review evaluations. We also observe a sharp rise in rhetorical strength after 2023, and provide empirical evidence showing that this increase is largely driven by the adoption of LLM-based writing assistance. The reliability of our framework is validated by its robustness to the choice of personas and the high correlation between LLM judgments and human annotations. Our work demonstrates that LLMs can serve as instruments to measure and improve scientific evaluation.

Zhixiang Lu, Xueyuan Deng, Yiran Liu, Yulong Li, Qiang Yan

TL;DR: 论文提出了PRISM（Personality-Refracted Intelligent Simulation Model）框架，这是一个结合随机微分方程（SDE）和基于人格的条件部分可观测马尔可夫决策过程（PC-POMDP）的混合模型，用于模拟社交媒体中由人格驱动的多智能体交互，以更好地理解在线极化现象。

Details

Motivation: 传统基于智能体的意见动态模型（ABMs）因假设同质性而无法捕捉驱动在线极化的心理异质性，这阻碍了对意识形态分歧如何被放大的机制性理解。

Result: PRISM在人格一致性方面优于标准的同质性和大五人格基准模型，并与人类真实数据对齐，能够有效复现理性抑制和情感共鸣等涌现现象。

Insight: 创新点在于将MBTI人格类型与多模态大语言模型（MLLM）智能体结合，通过数据驱动的先验初始化，并采用SDE与PC-POMDP的混合框架来分别建模连续情感演化和离散决策，从而更精细地模拟社交媒体生态系统。

Abstract: Traditional agent-based models (ABMs) of opinion dynamics often fail to capture the psychological heterogeneity driving online polarization due to simplistic homogeneity assumptions. This limitation obscures the critical interplay between individual cognitive biases and information propagation, thereby hindering a mechanistic understanding of how ideological divides are amplified. To address this challenge, we introduce the Personality-Refracted Intelligent Simulation Model (PRISM), a hybrid framework coupling stochastic differential equations (SDE) for continuous emotional evolution with a personality-conditional partially observable Markov decision process (PC-POMDP) for discrete decision-making. In contrast to continuous trait approaches, PRISM assigns distinct Myers-Briggs Type Indicator (MBTI) based cognitive policies to multimodal large language model (MLLM) agents, initialized via data-driven priors from large-scale social media datasets. PRISM achieves superior personality consistency aligned with human ground truth, significantly outperforming standard homogeneous and Big Five benchmarks. This framework effectively replicates emergent phenomena such as rational suppression and affective resonance, offering a robust tool for analyzing complex social media ecosystems.

[4] Bias Beneath the Tone: Empirical Characterisation of Tone Bias in LLM-Driven UX Systems cs.CL | cs.HCPDF

Heet Bodara, Md Masum Mushfiq, Isma Farah Siddiqui

TL;DR: 本文通过合成对话数据集和语调分类模型，实证研究了大型语言模型在对话系统中存在的语调偏见问题，发现即使在中性提示下生成的对话也表现出系统性语调偏差，并利用弱监督方法训练分类器有效检测这些模式。

Details

Motivation: 解决大型语言模型在对话系统中存在的隐性语调偏见问题，这些偏见（如过度礼貌、乐观或谨慎）可能影响用户对信任、共情和公平性的感知。

Result: 集成模型在语调分类任务中取得了高达0.92的宏F1分数，表明语调偏见是可系统测量且与设计公平可信对话AI相关的。

Insight: 创新点在于将可控LLM对话合成与语调分类模型相结合，实现稳健且符合伦理的情感识别；研究发现即使中性提示也会产生语调偏差，揭示了模型底层对话风格的偏见来源。

Abstract: Large Language Models are increasingly used in conversational systems such as digital personal assistants, shaping how people interact with technology through language. While their responses often sound fluent and natural, they can also carry subtle tone biases such as sounding overly polite, cheerful, or cautious even when neutrality is expected. These tendencies can influence how users perceive trust, empathy, and fairness in dialogue. In this study, we explore tone bias as a hidden behavioral trait of large language models. The novelty of this research lies in the integration of controllable large language model based dialogue synthesis with tone classification models, enabling robust and ethical emotion recognition in personal assistant interactions. We created two synthetic dialogue datasets, one generated from neutral prompts and another explicitly guided to produce positive or negative tones. Surprisingly, even the neutral set showed consistent tonal skew, suggesting that bias may stem from the model’s underlying conversational style. Using weak supervision through a pretrained DistilBERT model, we labeled tones and trained several classifiers to detect these patterns. Ensemble models achieved macro F1 scores up to 0.92, showing that tone bias is systematic, measurable, and relevant to designing fair and trustworthy conversational AI.

[5] Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models cs.CL | cs.AI | cs.LGPDF

Ming Li, Chenrui Fan, Yize Cheng, Soheil Feizi, Tianyi Zhou

TL;DR: 本文提出ThinkARM框架，基于Schoenfeld的Episode理论，将语言模型的推理轨迹抽象为功能步骤（如分析、探索、实现、验证），以分析数学问题解决中的思维动态和结构差异。

Details

Motivation: 动机在于解决大型语言模型推理轨迹的认知结构和步骤难以识别和分析的问题，超越表面统计，深入理解推理过程。

Result: 应用ThinkARM于多种模型的数学问题解决，揭示了推理与非推理模型之间的可复现思维动态和结构差异，并通过案例研究显示探索步骤与正确性相关，效率导向方法选择性地抑制评估反馈步骤。

Insight: 创新点在于引入中间尺度的Episode理论框架，使推理步骤显式化，支持对现代语言模型中推理结构、稳定性和变化的系统分析，有助于诊断模型行为。

Abstract: Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld’s Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.

[6] Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents cs.CLPDF

Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang

TL;DR: 论文提出Memory-T1框架，利用强化学习为多轮对话智能体学习时间感知的记忆选择策略，以解决长对话历史中时序推理的难题。该框架采用由粗到细的策略，先通过时间和相关性过滤器修剪对话历史，再由强化学习代理选择精确的证据会话，并通过多级奖励函数优化答案准确性、证据基础和时序一致性。

Details

Motivation: 解决现有长上下文模型在处理冗长、多轮对话历史时，因噪声积累而难以准确识别时序相关信息，从而导致推理性能显著下降的问题。

Result: 在Time-Dialog基准测试中，Memory-T1将7B模型提升至67.0%的总分，为开源模型建立了新的最先进性能，并超越14B基线10.2%。消融研究表明时序一致性和证据基础奖励共同贡献了15.0%的性能提升，且框架在128k令牌内保持鲁棒性。

Insight: 创新点在于引入强化学习来学习时间感知的记忆选择策略，并设计了一个结合答案准确性、证据基础和时序一致性的多级奖励函数，特别是通过会话级和话语级的时序一致性奖励提供密集信号，以解决细微的时序歧义。从客观角度看，其由粗到细的处理策略和针对长对话噪声的鲁棒性设计具有借鉴价值。

Abstract: Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. However, existing works and our pilot study have shown that as dialogue histories grow in length and accumulate noise, current long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set using temporal and relevance filters, followed by an RL agent that selects the precise evidence sessions. The RL training is guided by a multi-level reward function optimizing (i) answer accuracy, (ii) evidence grounding, and (iii) temporal consistency. In particular, the temporal consistency reward provides a dense signal by evaluating alignment with the query time scope at both the session-level (chronological proximity) and the utterance-level (chronological fidelity), enabling the agent to resolve subtle chronological ambiguities. On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0%, establishing a new state-of-the-art performance for open-source models and outperforming a 14B baseline by 10.2%. Ablation studies show temporal consistency and evidence grounding rewards jointly contribute to a 15.0% performance gain. Moreover, Memory-T1 maintains robustness up to 128k tokens, where baseline models collapse, proving effectiveness against noise in extensive dialogue histories. The code and datasets are publicly available at https://github.com/Elvin-Yiming-Du/Memory-T1/

[7] ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language cs.CL | cs.AI | cs.LGPDF

Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, Alane Suhr

TL;DR: 本文提出了ABBEL框架，旨在解决长序列决策任务中LLM代理因完整交互历史过长导致计算成本高的问题。该框架通过维护自然语言表示的信念状态（即任务相关未知信息的摘要）来压缩上下文，并结合强化学习后训练来提升性能。

Details

Motivation: 动机是解决长序列决策任务中，保持完整交互历史会导致计算成本过高的问题，需要一种方法来压缩上下文并维持高效决策。

Result: 在六个多样化的多步环境中系统评估了前沿模型，发现ABBEL能生成可解释的信念并保持接近恒定的内存使用，但最初因信念更新错误导致性能低于完整上下文设置；通过强化学习训练（包括信念评分和长度惩罚）后，性能超越完整上下文设置，且内存使用少于同期方法。

Insight: 创新点在于引入基于自然语言的信念瓶颈来压缩交互历史，并结合强化学习优化信念生成和行动选择，从而在保持可解释性和低内存消耗的同时提升决策性能。

Abstract: As the length of sequential decision-making tasks increases, it becomes computationally impractical to keep full interaction histories in context. We introduce a general framework for LLM agents to maintain concise contexts through multi-step interaction: Acting through Belief Bottlenecks Expressed in Language (ABBEL), and methods to further improve ABBEL agents with RL post-training. ABBEL replaces long multi-step interaction history by a belief state, i.e., a natural language summary of what has been discovered about task-relevant unknowns. Under ABBEL, at each step the agent first updates a prior belief with the most recent observation from the environment to form a posterior belief, then uses only the posterior to select an action. We systematically evaluate frontier models under ABBEL across six diverse multi-step environments, finding that ABBEL supports generating interpretable beliefs while maintaining near-constant memory use over interaction steps. However, bottleneck approaches are generally prone to error propagation, which we observe causing inferior performance when compared to the full context setting due to errors in belief updating. Therefore, we train LLMs to generate and act on beliefs within the ABBEL framework via reinforcement learning (RL). We experiment with belief grading, to reward higher quality beliefs, as well as belief length penalties to reward more compressed beliefs. Our experiments demonstrate the ability of RL to improve ABBEL’s performance beyond the full context setting, while using less memory than contemporaneous approaches.

[8] M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation cs.CL | cs.AIPDF

Hyeongcheol Park, Jiyoung Seo, Jaewon Mun, Hogun Park, Wonmin Byeon

TL;DR: 本文提出了M^3KG-RAG，一种多跳多模态知识图谱增强的检索增强生成框架，旨在解决视听领域多模态RAG面临的模态覆盖有限、多跳连接不足以及相似性检索导致的离题或冗余知识问题。

Details

Motivation: 现有视听多模态知识图谱（MMKGs）存在模态覆盖不全、多跳连接有限，且基于共享嵌入空间相似性的检索方法无法有效过滤离题或冗余知识，限制了多模态大语言模型（MLLMs）的推理深度和答案忠实度。

Result: 在多个多模态基准测试上的广泛实验表明，M^3KG-RAG显著提升了MLLMs在多模态推理和实体定位方面的性能，优于现有方法。

Insight: 创新点包括：1）设计轻量级多智能体流程构建多跳多模态知识图谱（M^3KG），支持基于查询的模态化检索；2）提出GRASP机制，通过实体定位、答案支持相关性评估和冗余上下文剪枝，确保检索知识的精确性和简洁性。

Abstract: Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M$^3$KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M$^3$KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M$^3$KG-RAG significantly enhances MLLMs’ multimodal reasoning and grounding over existing approaches.

[9] Multi-hop Reasoning via Early Knowledge Alignment cs.CLPDF

Yuxin Wang, Shicheng Fang, Bo Wang, Qi Luo, Xuanjing Huang

TL;DR: 本文提出了一种名为早期知识对齐（EKA）的模块，用于增强迭代式检索增强生成（RAG）系统处理复杂多跳问题的能力。EKA通过在规划前将大语言模型与检索到的相关知识对齐，为推理建立更坚实的基础，从而显著提高检索精度、减少级联错误，并提升整体性能和效率。

Details

Motivation: 现有迭代式RAG系统在规划问题分解时，通常不利用可用检索语料库的信息，导致检索效率低下和推理链级联错误，最终性能不佳。本文旨在解决这一问题。

Result: 在六个标准RAG数据集上的大量实验表明，EKA显著提高了检索精度和性能，同时提升了效率。该方法被证明是一种有效的、无需训练且可扩展的推理策略，并在不同数据集和检索语料库的泛化测试中表现出鲁棒性，推动了迭代式RAG系统的最先进水平。

Insight: 核心创新点是在迭代RAG的推理规划前引入早期知识对齐，为模型提供上下文相关的检索知识作为先验，从而减少不必要的探索，更有效地聚焦于相关信息子集。从熵的角度分析，这优化了强化学习增强框架中结构化推理与高效探索的交互。

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for Large Language Models (LLMs) to address knowledge-intensive queries requiring domain-specific or up-to-date information. To handle complex multi-hop questions that are challenging for single-step retrieval, iterative RAG approaches incorporating reinforcement learning have been proposed. However, existing iterative RAG systems typically plan to decompose questions without leveraging information about the available retrieval corpus, leading to inefficient retrieval and reasoning chains that cascade into suboptimal performance. In this paper, we introduce Early Knowledge Alignment (EKA), a simple but effective module that aligns LLMs with retrieval set before planning in iterative RAG systems with contextually relevant retrieved knowledge. Extensive experiments on six standard RAG datasets demonstrate that by establishing a stronger reasoning foundation, EKA significantly improves retrieval precision, reduces cascading errors, and enhances both performance and efficiency. Our analysis from an entropy perspective demonstrate that incorporating early knowledge reduces unnecessary exploration during the reasoning process, enabling the model to focus more effectively on relevant information subsets. Moreover, EKA proves effective as a versatile, training-free inference strategy that scales seamlessly to large models. Generalization tests across diverse datasets and retrieval corpora confirm the robustness of our approach. Overall, EKA advances the state-of-the-art in iterative RAG systems while illuminating the critical interplay between structured reasoning and efficient exploration in reinforcement learning-augmented frameworks. The code is released at \href{https://github.com/yxzwang/EarlyKnowledgeAlignment}{Github}.

[10] Retrieval-augmented Prompt Learning for Pre-trained Foundation Models cs.CL | cs.AI | cs.CV | cs.IR | cs.LGPDF

Xiang Chen, Yixin Ou, Quan Feng, Lei Li, Piji Li

TL;DR: 该论文提出了一种名为RetroPrompt的检索增强提示学习方法，旨在解决预训练基础模型（PFMs）在提示学习中对记忆和死记硬背的过度依赖问题。该方法通过引入一个从训练数据生成的公开知识库，并在输入、训练和推理阶段整合检索机制，使模型能够主动从语料库中检索相关上下文信息，从而在零样本和少样本场景中实现更好的泛化性能。

Details

Motivation: 传统提示学习方法仍遵循参数化学习范式，可能导致泛化稳定性受损，难以充分利用非典型实例，并在有限数据下容易过拟合到浅层模式。论文旨在通过解耦知识与单纯记忆，在记忆与泛化之间取得平衡。

Result: 在自然语言处理和计算机视觉任务的各种数据集上进行的综合实验表明，RetroPrompt在零样本和少样本场景中均表现出优越性能。通过对记忆模式的分析，观察到RetroPrompt有效减少了对死记硬背的依赖，从而增强了泛化能力。

Insight: 创新点在于将检索机制系统性地整合到提示学习的全流程中，通过检索外部知识库来增强上下文线索，这是一种非参数化的知识增强方法，有助于模型更稳健地利用训练数据中的信息，减少过拟合风险。

Abstract: The pre-trained foundation models (PFMs) have become essential for facilitating large-scale multimodal learning. Researchers have effectively employed the ``pre-train, prompt, and predict’’ paradigm through prompt learning to induce improved few-shot performance. However, prompt learning approaches for PFMs still follow a parametric learning paradigm. As such, the stability of generalization in memorization and rote learning can be compromised. More specifically, conventional prompt learning might face difficulties in fully utilizing atypical instances and avoiding overfitting to shallow patterns with limited data during the process of fully-supervised training. To overcome these constraints, we present our approach, named RetroPrompt, which aims to achieve a balance between memorization and generalization by decoupling knowledge from mere memorization. Unlike traditional prompting methods, RetroPrompt leverages a publicly accessible knowledge base generated from the training data and incorporates a retrieval mechanism throughout the input, training, and inference stages. This enables the model to actively retrieve relevant contextual information from the corpus, thereby enhancing the available cues. We conduct comprehensive experiments on a variety of datasets across natural language processing and computer vision tasks to demonstrate the superior performance of our proposed approach, RetroPrompt, in both zero-shot and few-shot scenarios. Through detailed analysis of memorization patterns, we observe that RetroPrompt effectively reduces the reliance on rote memorization, leading to enhanced generalization.

[11] Fun-Audio-Chat Technical Report cs.CL | cs.AI | cs.SD | eess.ASPDF

Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu

TL;DR: Fun-Audio-Chat是一个大型音频语言模型，旨在解决现有联合语音文本模型中存在的语义信息稀释、计算成本高和灾难性遗忘等问题。它通过双分辨率语音表示和核心鸡尾酒训练等创新方法，在保持文本LLM知识的同时，获得了强大的音频理解、推理和生成能力。该模型在多项任务上取得了有竞争力的性能，并开源了8B版本。

Details

Motivation: 现有联合语音文本模型面临语音标记（25Hz）与文本标记（~3Hz）时间分辨率不匹配的问题，导致语义信息被稀释、计算成本高昂，并引发对文本LLM知识的灾难性遗忘。

Result: Fun-Audio-Chat 8B和MoE 30B-A3B在语音转文本和语音转语音任务上取得了有竞争力的性能，在类似规模的模型中，在口语问答基准测试中排名靠前。在音频理解、语音功能调用、指令遵循和语音共情方面也达到了竞争性乃至更优的性能。

Insight: 主要创新点包括：1）双分辨率语音表示，通过共享LLM处理5Hz高效音频和语音精炼头生成25Hz高质量标记，在效率和质量间取得平衡；2）核心鸡尾酒训练，一种两阶段微调与中间合并方法，缓解灾难性遗忘；3）多任务DPO训练，增强了鲁棒性、音频理解、指令遵循和语音共情能力。模型无需大规模音频文本预训练，而是利用预训练模型和广泛的后期训练。

Abstract: Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy. This multi-stage post-training enables Fun-Audio-Chat to retain text LLM knowledge while gaining powerful audio understanding, reasoning, and generation. Unlike recent LALMs requiring large-scale audio-text pre-training, Fun-Audio-Chat leverages pre-trained models and extensive post-training. Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. We develop Fun-Audio-Chat-Duplex, a full-duplex variant with strong performance on Spoken QA and full-duplex interactions. We open-source Fun-Audio-Chat-8B with training and inference code, and provide an interactive demo.

[12] FaithLens: Detecting and Explaining Faithfulness Hallucination cs.CL | cs.AIPDF

Shuzheng Si, Qingyi Wang, Haozhe Zhao, Yuzhuo Bai, Guanqiao Chen

TL;DR: 本文提出FaithLens，一个用于检测大语言模型输出中忠实性幻觉的高效模型，它不仅能提供二元预测，还能生成相应的解释以增强可信度。该方法通过合成带解释的训练数据、应用数据过滤策略，并采用基于规则的强化学习进行微调来实现。

Details

Motivation: 解决大语言模型在检索增强生成和摘要等实际应用中输出存在忠实性幻觉的问题，旨在开发一个既能检测又能解释此类幻觉的模型，以提高模型输出的可信度。

Result: 在12个多样化任务上的实验结果表明，8B参数的FaithLens模型在性能上超越了GPT-4.1和o3等先进模型，并能够生成高质量的解释，在可信度、效率和有效性方面达到了独特的平衡。

Insight: 创新点在于联合预测与解释的检测框架，通过合成高质量训练数据与基于规则的强化学习优化策略，实现了在较小参数量下超越大型闭源模型的性能，为可信AI提供了一种可解释的解决方案。

Abstract: Recognizing whether outputs from large language models (LLMs) contain faithfulness hallucination is crucial for real-world applications, e.g., retrieval-augmented generation and summarization. In this paper, we introduce FaithLens, a cost-efficient and effective faithfulness hallucination detection model that can jointly provide binary predictions and corresponding explanations to improve trustworthiness. To achieve this, we first synthesize training data with explanations via advanced LLMs and apply a well-defined data filtering strategy to ensure label correctness, explanation quality, and data diversity. Subsequently, we fine-tune the model on these well-curated training data as a cold start and further optimize it with rule-based reinforcement learning, using rewards for both prediction correctness and explanation quality. Results on 12 diverse tasks show that the 8B-parameter FaithLens outperforms advanced models such as GPT-4.1 and o3. Also, FaithLens can produce high-quality explanations, delivering a distinctive balance of trustworthiness, efficiency, and effectiveness.

[13] SlideTailor: Personalized Presentation Slide Generation for Scientific Papers cs.CL | cs.AI | cs.MMPDF

Wenzheng Zeng, Mingyu Ouyang, Langyuan Cui, Hwee Tou Ng

TL;DR: 本文提出SlideTailor框架，用于根据用户偏好个性化生成学术论文演示幻灯片。该方法通过用户提供的论文-幻灯片示例对和视觉模板来隐式编码用户偏好，并采用链式语音机制确保幻灯片内容与口头叙述对齐，从而生成可编辑的个性化幻灯片。

Details

Motivation: 现有自动幻灯片生成方法因未考虑用户个性化偏好而导致结果与用户需求不匹配，本文旨在解决这一限制，实现用户对齐的个性化幻灯片生成。

Result: 在构建的包含多样化用户偏好的基准数据集上，通过精心设计的可解释指标进行广泛实验，证明了所提框架的有效性。

Insight: 创新点包括：1) 引入基于人类行为的智能体框架，通过隐式输入（论文-幻灯片示例对和视觉模板）而非详细文本描述来提取和泛化用户偏好；2) 提出链式语音机制，将幻灯片内容与计划的口头叙述对齐，提升生成质量并支持视频演示等下游应用。

Abstract: Automatic presentation slide generation can greatly streamline content creation. However, since preferences of each user may vary, existing under-specified formulations often lead to suboptimal results that fail to align with individual user needs. We introduce a novel task that conditions paper-to-slides generation on user-specified preferences. We propose a human behavior-inspired agentic framework, SlideTailor, that progressively generates editable slides in a user-aligned manner. Instead of requiring users to write their preferences in detailed textual form, our system only asks for a paper-slides example pair and a visual template - natural and easy-to-provide artifacts that implicitly encode rich user preferences across content and visual style. Despite the implicit and unlabeled nature of these inputs, our framework effectively distills and generalizes the preferences to guide customized slide generation. We also introduce a novel chain-of-speech mechanism to align slide content with planned oral narration. Such a design significantly enhances the quality of generated slides and enables downstream applications like video presentations. To support this new task, we construct a benchmark dataset that captures diverse user preferences, with carefully designed interpretable metrics for robust evaluation. Extensive experiments demonstrate the effectiveness of our framework.

[14] AprielGuard cs.CLPDF

Jaykumar Kasundra, Anjaneya Praharaj, Sourabh Surana, Lakshmi Sirisha Chodisetty, Sourav Sharma

TL;DR: 本文提出了AprielGuard，一个8B参数的大语言模型安全防护模型，旨在统一处理安全风险（如毒性、偏见）和对抗性威胁（如提示注入、越狱），通过结合开放和合成数据训练，并在多轮对话和智能体工作流场景中表现出色。

Details

Motivation: 现有安全防护工具通常将安全风险和对抗性威胁视为独立问题，限制了其鲁棒性和泛化能力，因此需要一种统一的学习框架来提升LLM在对话和智能体环境中的安全性。

Result: 在多个公开和专有基准测试中，AprielGuard在检测有害内容和对抗性操作方面表现强劲，超越了Llama-Guard和Granite Guardian等开源防护模型，尤其在多步骤和推理密集型场景中达到SOTA水平。

Insight: 创新点包括将安全风险与对抗性威胁统一到单一分类法和学习框架中，利用结构化推理轨迹增强可解释性，并通过多样化数据训练提升模型在复杂场景下的泛化能力。

Abstract: Safeguarding large language models (LLMs) against unsafe or adversarial behavior is critical as they are increasingly deployed in conversational and agentic settings. Existing moderation tools often treat safety risks (e.g. toxicity, bias) and adversarial threats (e.g. prompt injections, jailbreaks) as separate problems, limiting their robustness and generalizability. We introduce AprielGuard, an 8B parameter safeguard model that unify these dimensions within a single taxonomy and learning framework. AprielGuard is trained on a diverse mix of open and synthetic data covering standalone prompts, multi-turn conversations, and agentic workflows, augmented with structured reasoning traces to improve interpretability. Across multiple public and proprietary benchmarks, AprielGuard achieves strong performance in detecting harmful content and adversarial manipulations, outperforming existing opensource guardrails such as Llama-Guard and Granite Guardian, particularly in multi-step and reasoning intensive scenarios. By releasing the model, we aim to advance transparent and reproducible research on reliable safeguards for LLMs.

[15] SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision cs.CL | cs.SD | eess.ASPDF

Maxime Poli, Mahi Luthra, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize

TL;DR: 本文提出SpidR，一种自监督语音表示模型，通过掩码预测目标结合自蒸馏和在线聚类，从原始波形中高效学习具有高可访问性语音信息的表示，特别适用于无文本口语语言建模。SpidR在语言建模基准（sWUGGY、sBLIMP、tSC）上优于wav2vec 2.0、HuBERT、WavLM和DinoSR，同时大幅减少预训练时间，仅需16个GPU训练一天而非一周。

Details

Motivation: 解决直接从语音中学习语义表示的问题，以支持无文本口语语言建模，避免依赖文本中间体。

Result: 在sWUGGY、sBLIMP、tSC等下游语言建模基准上超越wav2vec 2.0、HuBERT、WavLM和DinoSR，达到SOTA水平；预训练时间比HuBERT缩短至一天（16个GPU），效率显著提升。

Insight: 创新点包括结合自蒸馏和在线聚类的训练目标，稳定聚类过程并提高码本质量；系统评估语音单元质量（ABX、PNMI）与语言建模性能的相关性，验证了这些指标作为可靠代理；通过高效预训练方法和代码库实现快速迭代，为无监督语音表示学习提供了新思路。

Abstract: The parallel advances in language modeling and speech representation learning have raised the prospect of learning language directly from speech without textual intermediates. This requires extracting semantic representations directly from speech. Our contributions are threefold. First, we introduce SpidR, a self-supervised speech representation model that efficiently learns representations with highly accessible phonetic information, which makes it particularly suited for textless spoken language modeling. It is trained on raw waveforms using a masked prediction objective combined with self-distillation and online clustering. The intermediate layers of the student model learn to predict assignments derived from the teacher’s intermediate layers. This learning objective stabilizes the online clustering procedure compared to previous approaches, resulting in higher quality codebooks. SpidR outperforms wav2vec 2.0, HuBERT, WavLM, and DinoSR on downstream language modeling benchmarks (sWUGGY, sBLIMP, tSC). Second, we systematically evaluate across models and layers the correlation between speech unit quality (ABX, PNMI) and language modeling performance, validating these metrics as reliable proxies. Finally, SpidR significantly reduces pretraining time compared to HuBERT, requiring only one day of pretraining on 16 GPUs, instead of a week. This speedup is enabled by the pretraining method and an efficient codebase, which allows faster iteration and easier experimentation. We open-source the training code and model checkpoints at https://github.com/facebookresearch/spidr.

[16] Can LLMs Solve My Grandma’s Riddle? Evaluating Multilingual Large Language Models on Reasoning Traditional Bangla Tricky Riddles cs.CLPDF

Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Khushnur Binte Jahangir, Swakkhar Shatabda, Sarah Masud Preum

TL;DR: 这篇论文针对孟加拉语传统谜语推理任务，提出了一个名为BanglaRiddleEval的新基准测试，包含1,244个谜语和四个任务。研究评估了多种开源和闭源大语言模型在不同提示策略下的表现，发现模型在生成式问答中语义重叠度中等但正确率低，多项选择题准确率最高仅约56%，远低于83%的人类基线，表明当前LLMs在低资源、具象化推理任务上仍远未达到人类水平。

Details

Motivation: 尽管大语言模型在许多NLP基准测试上表现出色，但其在具象化、文化背景丰富且低资源环境下的推理能力尚未得到充分探索。本研究旨在填补孟加拉语领域的这一空白。

Result: 模型在生成式问答任务上取得了中等的语义重叠度但正确率较低；多项选择题（MCQ）准确率峰值约为56%，而人类基线为83%；歧义消解任务的准确率在26%到68%之间。这些结果表明当前模型与人类表现存在显著差距。

Insight: 论文的创新点在于创建了一个针对低资源语言（孟加拉语）和文化特定内容（传统谜语）的具象化推理基准测试，并通过LLM驱动的流程自动生成思维链解释、语义连贯的干扰项和细粒度歧义标注，为评估LLMs在复杂、文化背景丰富的推理任务上的能力提供了新工具和洞见。

Abstract: Large Language Models (LLMs) show impressive performance on many NLP benchmarks, yet their ability to reason in figurative, culturally grounded, and low-resource settings remains underexplored. We address this gap for Bangla by introducing BanglaRiddleEval, a benchmark of 1,244 traditional Bangla riddles instantiated across four tasks (4,976 riddle-task artifacts in total). Using an LLM-based pipeline, we generate Chain-of-Thought explanations, semantically coherent distractors, and fine-grained ambiguity annotations, and evaluate a diverse suite of open-source and closed-source models under different prompting strategies. Models achieve moderate semantic overlap on generative QA but low correctness, MCQ accuracy peaks at only about 56% versus an 83% human baseline, and ambiguity resolution ranges from roughly 26% to 68%, with high-quality explanations confined to the strongest models. These results show that current LLMs capture some cues needed for Bangla riddle reasoning but remain far from human-level performance, establishing BanglaRiddleEval as a challenging new benchmark for low-resource figurative reasoning. All data, code, and evaluation scripts are available on GitHub: https://github.com/Labib1610/BanglaRiddleEval.

[17] Step-DeepResearch Technical Report cs.CLPDF

Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen

TL;DR: 本文介绍了Step-DeepResearch，一个面向开放式深度研究任务、具有成本效益的端到端智能体。论文提出了一种基于原子能力的数据合成策略来增强规划和报告撰写能力，并采用从智能体中期训练到SFT和RL的渐进式训练路径，辅以清单式评判器以提高鲁棒性。同时，为填补中文领域评估空白，建立了ADR-Bench基准。实验表明，该模型在多个评估中表现出色，证明了精炼训练能使中等规模模型以行业领先的性价比实现专家级能力。

Details

Motivation: 现有学术基准（如BrowseComp）难以满足开放式深度研究的现实需求，该任务需要强大的意图识别、长程决策和跨源验证能力。为解决此问题，并弥补中文领域评估的不足，本文旨在开发一个鲁棒且高效的深度研究智能体。

Result: Step-DeepResearch（32B）在Scale AI Research Rubrics上得分为61.4%。在自建的ADR-Bench上，它显著优于同类可比模型，并与OpenAI和Gemini DeepResearch等闭源SOTA模型相匹敌。

Insight: 创新点包括：1) 基于原子能力的数据合成策略，用于强化智能体的规划和报告生成；2) 从智能体中期训练到SFT和RL的渐进式训练路径；3) 引入清单式评判器以增强鲁棒性；4) 构建了面向现实场景的中文深度研究评估基准ADR-Bench。客观来看，其将系统化的能力分解（原子能力）、渐进式训练与针对性评估基准相结合的方法，为开发高效能、高性价比的领域专用智能体提供了可借鉴的框架。

Abstract: As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.

[18] Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits cs.CLPDF

Amirhosein Ghasemabadi, Di Niu

TL;DR: 本文提出了一种名为Gnosis的轻量级自感知机制，使冻结的大型语言模型能够通过解码隐藏状态和注意力模式的信号进行内在自我验证，从而预测自身生成内容的错误，无需外部监督且计算成本极低。

Details

Motivation: 大型语言模型（LLMs）虽然能生成流畅复杂的输出，但常常无法识别自身的错误和幻觉；现有方法依赖外部评估、多样本一致性或基于文本的自我批评，这些方法要么增加额外计算，要么与真实正确性弱相关。本文旨在探索LLMs是否可以通过检查推理过程中的内部状态来预测自身失败。

Result: 在数学推理、开放域问答和学术知识基准测试中，针对参数规模从1.7B到20B的冻结主干模型，Gnosis在准确性和校准方面均持续优于强大的内部基线和大型外部评估器，且仅增加约500万参数，推理成本可忽略不计。

Insight: 创新点在于提出了一种被动观察内部痕迹（隐藏状态和注意力模式）并压缩为固定预算描述符的轻量级自感知机制，实现了无需外部监督的高效自我验证；客观来看，该方法揭示了生成过程中存在可靠正确性线索，并可零样本泛化到部分生成，支持早期失败检测和计算感知控制，为模型自我评估提供了新思路。

Abstract: Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with true correctness. We ask: can LLMs predict their own failures by inspecting internal states during inference? We introduce Gnosis, a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. Gnosis passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost, adding only ~5M parameters and operating independently of sequence length. Across math reasoning, open-domain question answering, and academic knowledge benchmarks, and over frozen backbones ranging from 1.7B to 20B parameters, Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. Moreover, it generalizes zero-shot to partial generations, enabling early detection of failing trajectories and compute-aware control. These results show that reliable correctness cues are intrinsic to generation process and can be extracted efficiently without external supervision.

[19] Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs cs.CL | cs.AI | cs.CVPDF

Dhruv Anand, Ehsan Shareghi

TL;DR: 本文介绍了Cube Bench，一个基于魔方的基准测试，用于评估多模态大语言模型（MLLMs）的空间和序列推理能力。该基准将性能分解为五个技能：从图像和文本重建魔方面、选择最优下一步移动、预测候选移动的结果而不执行它、执行多步计划并从错误中恢复，以及检测和修正自身错误。通过比较七个MLLMs，发现准确率随魔方打乱深度急剧下降，模型一旦轨迹停滞或偏离很少能恢复，且高面重建准确率不能保证有效的动作选择或多步执行。闭源模型表现优于开源模型，但即使最佳模型在更高复杂度下也会退化。简单的自我校正能带来适度提升，但也可能引入过度思考。

Details

Motivation: 解决现有MLLMs在空间和序列推理能力评估上的不足，特别是针对魔方这类需要复杂空间理解和多步规划的任务，提供一个紧凑、可复现的基准测试。

Result: 在Cube Bench基准上测试了七个MLLMs，准确率随魔方打乱深度增加而急剧下降；闭源模型在单步感知任务和多步控制任务上领先，而开源模型在最难设置下接近随机水平；最佳MLLM在更高魔方复杂度下性能退化；通过反思性思维的简单自我校正能带来适度增益。

Insight: 创新点在于提出了一个专门针对空间和序列推理的魔方基准，将性能分解为五个可评估的技能，并揭示了MLLMs在复杂多步任务中的局限性，如恢复能力差和感知与行动之间的脱节。客观分析认为，该基准为系统评估MLLMs的推理能力提供了新工具，并强调了闭源与开源模型之间的性能差距。

Abstract: We introduce Cube Bench, a Rubik’s-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one’s own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced closed- vs open-source gap emerges: the strongest closed model leads on both single-step perception tasks and multi-step control tasks, while open-weight models cluster near chance on the hardest settings; yet even the best MLLM degrades at higher cube complexity. A simple self-correction via reflective thinking yields modest gains but can also introduce overthinking. Cube Bench offers a compact, reproducible probe of sequential spatial reasoning in MLLMs.

cs.CV [Back]

[20] PHANTOM: PHysical ANamorphic Threats Obstructing Connected Vehicle Mobility cs.CV | cs.AI | cs.CR | cs.LGPDF

Md Nahid Hasan Shuvo, Moinul Hossain

TL;DR: 本文提出了PHANTOM框架，一种利用变形艺术（anamorphic art）制作和部署视角依赖性物理对抗样本的新方法，旨在攻击联网自动驾驶车辆（CAV）。该攻击利用几何扭曲生成对人类看似自然、但能被先进目标检测器高置信度误分类的对抗样本，并在黑盒设置下对多种检测器架构（如YOLOv5、SSD等）展现出强迁移性。在CARLA模拟器中的综合评估表明，PHANTOM在最优条件下攻击成功率超过90%，在恶劣环境下仍保持60-80%的有效性，并能触发V2X网络范围内的通信干扰。

Details

Motivation: 联网自动驾驶车辆依赖基于视觉的深度神经网络和低延迟V2X通信来安全高效导航，但这些系统仍易受物理对抗攻击。论文旨在揭示CAV生态系统在感知和通信层面的关键漏洞。

Result: 在CARLA模拟器中，PHANTOM在多种速度、天气和光照条件下进行评估：在最优条件下攻击成功率超过90%，在退化环境下保持60-80%的有效性，攻击在目标6-10米内激活。SUMO-OMNeT++联合仿真显示，虚假紧急消息通过V2X链路传播，使信息峰值年龄增加68-89%，降低了安全关键通信性能。

Insight: 创新点在于利用变形艺术的几何原理生成视角依赖的物理对抗样本，实现了黑盒设置下的强迁移性攻击，并首次展示了此类攻击如何从单个车辆欺骗扩展到引发整个CAV网络的通信级联故障，暴露了感知与通信层协同的脆弱性。

Abstract: Connected autonomous vehicles (CAVs) rely on vision-based deep neural networks (DNNs) and low-latency (Vehicle-to-Everything) V2X communication to navigate safely and efficiently. Despite their advances, these systems remain vulnerable to physical adversarial attacks. In this paper, we introduce PHANTOM (PHysical ANamorphic Threats Obstructing connected vehicle Mobility), a novel framework for crafting and deploying perspective-dependent adversarial examples using \textit{anamorphic art}. PHANTOM exploits geometric distortions that appear natural to humans but are misclassified with high confidence by state-of-the-art object detectors. Unlike conventional attacks, PHANTOM operates in black-box settings without model access and demonstrates strong transferability across four diverse detector architectures (YOLOv5, SSD, Faster R-CNN, and RetinaNet). Comprehensive evaluation in CARLA across varying speeds, weather conditions, and lighting scenarios shows that PHANTOM achieves over 90% attack success rate under optimal conditions and maintains 60-80% effectiveness even in degraded environments. The attack activates within 6-10 meters of the target, providing insufficient time for safe maneuvering. Beyond individual vehicle deception, PHANTOM triggers network-wide disruption in CAV systems: SUMO-OMNeT++ co-simulation demonstrates that false emergency messages propagate through V2X links, increasing Peak Age of Information by 68-89% and degrading safety-critical communication. These findings expose critical vulnerabilities in both perception and communication layers of CAV ecosystems.

[21] Generating the Past, Present and Future from a Motion-Blurred Image cs.CV | cs.GRPDF

SaiKiran Tedla, Kelly Zhu, Trevor Canham, Felix Taubner, Michael S. Brown

TL;DR: 本文提出了一种新方法，利用预训练的视频扩散模型从单张运动模糊图像中恢复出拍摄时刻的动态视频序列，并预测拍摄前后可能发生的场景动态。该方法在复杂场景动态恢复方面优于现有技术，并能泛化到真实世界图像，支持相机轨迹、物体运动和动态3D场景结构等下游任务。

Details

Motivation: 解决从单张运动模糊图像中恢复复杂场景动态（包括过去、现在和未来）的逆问题，克服现有方法依赖手工先验或特定网络架构、难以捕捉复杂动态且无法恢复拍摄前后信息的局限性。

Result: 在从运动模糊图像恢复视频的任务上，该方法优于现有方法，并能泛化到具有挑战性的真实世界图像。

Insight: 核心创新在于重新利用在互联网规模数据集上预训练的视频扩散模型作为强大的先验，以解决运动去模糊及视频预测这一高度不适定问题。这避免了手工设计先验，并能生成包含复杂动态和时序信息的逼真视频序列，扩展了单张图像的理解能力。

Abstract: We seek to answer the question: what can a motion-blurred image reveal about a scene’s past, present, and future? Although motion blur obscures image details and degrades visual quality, it also encodes information about scene and camera motion during an exposure. Previous techniques leverage this information to estimate a sharp image from an input blurry one, or to predict a sequence of video frames showing what might have occurred at the moment of image capture. However, they rely on handcrafted priors or network architectures to resolve ambiguities in this inverse problem, and do not incorporate image and video priors on large-scale datasets. As such, existing methods struggle to reproduce complex scene dynamics and do not attempt to recover what occurred before or after an image was taken. Here, we introduce a new technique that repurposes a pre-trained video diffusion model trained on internet-scale datasets to recover videos revealing complex scene dynamics during the moment of capture and what might have occurred immediately into the past or future. Our approach is robust and versatile; it outperforms previous methods for this task, generalizes to challenging in-the-wild images, and supports downstream tasks such as recovering camera trajectories, object motion, and dynamic 3D scene structure. Code and data are available at https://blur2vid.github.io

[22] Learning to Refocus with Video Diffusion Models cs.CVPDF

SaiKiran Tedla, Zhoutong Zhang, Xuaner Zhang, Shumian Xin

TL;DR: 本文提出了一种利用视频扩散模型实现单张失焦图像后处理重聚焦的新方法，能够生成感知准确的焦点堆栈视频序列，支持交互式重聚焦并解锁多种下游应用。

Details

Motivation: 解决摄影中自动对焦系统常无法准确捕捉目标主体，以及用户希望在拍摄后调整焦点的需求。

Result: 在感知质量和鲁棒性方面均优于现有方法，特别是在具有挑战性的场景中表现优异。

Insight: 创新性地将视频扩散模型应用于焦点堆栈生成，通过大规模真实世界智能手机采集的焦点堆栈数据集支持方法验证，为日常摄影中的高级焦点编辑功能提供了新途径。

Abstract: Focus is a cornerstone of photography, yet autofocus systems often fail to capture the intended subject, and users frequently wish to adjust focus after capture. We introduce a novel method for realistic post-capture refocusing using video diffusion models. From a single defocused image, our approach generates a perceptually accurate focal stack, represented as a video sequence, enabling interactive refocusing and unlocking a range of downstream applications. We release a large-scale focal stack dataset acquired under diverse real-world smartphone conditions to support this work and future research. Our method consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, paving the way for more advanced focus-editing capabilities in everyday photography. Code and data are available at www.learn2refocus.github.io

[23] HyGE-Occ: Hybrid View-Transformation with 3D Gaussian and Edge Priors for 3D Panoptic Occupancy Prediction cs.CVPDF

Jong Wook Kim, Wonseok Roh, Ha Dam Baek, Pilhyeon Lee, Jonghyun Choi

TL;DR: 本文提出HyGE-Occ框架，用于3D全景占据预测任务。该框架通过融合基于3D高斯的连续深度表示和离散深度仓表示的混合视图变换分支，并结合BEV特征的边缘先验，旨在提升几何一致性和边界感知能力，从而更精确地重建密集的3D场景语义和实例地图。

Details

Motivation: 现有方法在3D全景占据预测中难以保持精确的几何结构并捕捉对全景分离至关重要的3D实例的精确空间范围，导致几何一致性和边界识别能力不足。

Result: 在Occ3D-nuScenes数据集上的大量实验表明，HyGE-Occ超越了现有工作，展现了卓越的3D几何推理能力。

Insight: 主要创新点在于引入了混合视图变换分支（融合连续高斯深度与离散深度仓）和边缘先验，以增强几何一致性和边界感知。从客观角度看，这种将连续表示与离散表示相结合，并利用边缘信息作为辅助线索的方法，为解决3D场景理解中的几何精度和实例分离难题提供了新思路。

Abstract: 3D Panoptic Occupancy Prediction aims to reconstruct a dense volumetric scene map by predicting the semantic class and instance identity of every occupied region in 3D space. Achieving such fine-grained 3D understanding requires precise geometric reasoning and spatially consistent scene representation across complex environments. However, existing approaches often struggle to maintain precise geometry and capture the precise spatial range of 3D instances critical for robust panoptic separation. To overcome these limitations, we introduce HyGE-Occ, a novel framework that leverages a hybrid view-transformation branch with 3D Gaussian and edge priors to enhance both geometric consistency and boundary awareness in 3D panoptic occupancy prediction. HyGE-Occ employs a hybrid view-transformation branch that fuses a continuous Gaussian-based depth representation with a discretized depth-bin formulation, producing BEV features with improved geometric consistency and structural coherence. In parallel, we extract edge maps from BEV features and use them as auxiliary information to learn edge cues. In our extensive experiments on the Occ3D-nuScenes dataset, HyGE-Occ outperforms existing work, demonstrating superior 3D geometric reasoning.

[24] Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs cs.CVPDF

Houston H. Zhang, Tao Zhang, Baoze Lin, Yuanqi Xue, Yincheng Zhu

TL;DR: 本文提出Widget2Code任务，旨在将紧凑、无上下文的微界面（widget）图像转换为可执行代码。作者构建了一个仅包含图像的widget基准数据集，并提出一个联合提升感知理解和结构化代码生成的基线方法，包括基于设计原则的组件组装、图标检索、可视化模块以及端到端基础设施WidgetFactory。

Details

Motivation: 现有UI2Code研究主要关注网页和移动界面，而widget作为紧凑、无上下文的微界面，缺乏公开标记数据且生成代码的视觉保真度不足，因此需要专门的方法和基准。

Result: 在提出的widget基准上评估，广义多模态大语言模型（MLLMs）优于专用UI2Code方法，但仍产生不可靠和视觉不一致的代码；所提基线方法通过组件组装、图标检索和自适应渲染等模块，显著提升了视觉保真度，为未来研究建立了强基线。

Insight: 创新点包括：形式化Widget2Code任务并构建图像基准；提出联合感知与代码生成的基线，利用设计原则组装组件；设计框架无关的领域特定语言（WidgetDSL）和编译器，支持多前端实现；自适应渲染模块优化空间尺寸以满足紧凑性约束。

Abstract: User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with rich hierarchical context, widgets are compact, context-free micro-interfaces that summarize key information through dense layouts and iconography under strict spatial constraints. Moreover, while (image, code) pairs are widely available for web or mobile UIs, widget designs are proprietary and lack accessible markup. We formalize this setting as the Widget-to-Code (Widget2Code) and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics. Benchmarking shows that although generalized multimodal large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code. To address these limitations, we develop a baseline that jointly advances perceptual understanding and structured code generation. At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and reusable visualization modules. At the system level, we design an end-to-end infrastructure, WidgetFactory, which includes a framework-agnostic widget-tailored domain-specific language (WidgetDSL) and a compiler that translates it into multiple front-end implementations (e.g., React, HTML/CSS). An adaptive rendering module further refines spatial dimensions to satisfy compactness constraints. Together, these contributions substantially enhance visual fidelity, establishing a strong baseline and unified infrastructure for future Widget2Code research.

[25] Vehicle-centric Perception via Multimodal Structured Pre-training cs.CV | cs.AI | cs.LGPDF

Wentao Wu, Xiao Wang, Chenglong Li, Jin Tang, Bin Luo

TL;DR: 本文提出了一种名为VehicleMAE-V2的以车辆为中心的预训练大模型，通过利用车辆的多模态结构化先验知识（如对称性、轮廓和语义）来指导掩码令牌重建过程，从而增强模型学习通用车辆感知表示的能力。该方法在构建的大规模数据集Autobot4M上进行预训练，并在五个下游任务上展示了优越性能。

Details

Motivation: 现有方法在预训练阶段缺乏对车辆相关知识的有效学习，导致建模通用车辆感知表示的能力不足，因此本文旨在解决这一问题，提升以车辆为中心的感知任务性能。

Result: 在五个下游任务上的广泛实验表明，VehicleMAE-V2表现出优越性能，达到了当前先进水平（SOTA）。

Insight: 创新点包括设计了对称性引导掩码模块（SMM）、轮廓引导表示模块（CRM）和语义引导表示模块（SRM），将车辆的结构化先验知识融入掩码重建过程，从而减少信息冗余、保留整体结构并增强语义理解。

Abstract: Vehicle-centric perception plays a crucial role in many intelligent systems, including large-scale surveillance systems, intelligent transportation, and autonomous driving. Existing approaches lack effective learning of vehicle-related knowledge during pre-training, resulting in poor capability for modeling general vehicle perception representations. To handle this problem, we propose VehicleMAE-V2, a novel vehicle-centric pre-trained large model. By exploring and exploiting vehicle-related multimodal structured priors to guide the masked token reconstruction process, our approach can significantly enhance the model’s capability to learn generalizable representations for vehicle-centric perception. Specifically, we design the Symmetry-guided Mask Module (SMM), Contour-guided Representation Module (CRM) and Semantics-guided Representation Module (SRM) to incorporate three kinds of structured priors into token reconstruction including symmetry, contour and semantics of vehicles respectively. SMM utilizes the vehicle symmetry constraints to avoid retaining symmetric patches and can thus select high-quality masked image patches and reduce information redundancy. CRM minimizes the probability distribution divergence between contour features and reconstructed features and can thus preserve holistic vehicle structure information during pixel-level reconstruction. SRM aligns image-text features through contrastive learning and cross-modal distillation to address the feature confusion caused by insufficient semantic understanding during masked reconstruction. To support the pre-training of VehicleMAE-V2, we construct Autobot4M, a large-scale dataset comprising approximately 4 million vehicle images and 12,693 text descriptions. Extensive experiments on five downstream tasks demonstrate the superior performance of VehicleMAE-V2.

[26] Block-Recurrent Dynamics in Vision Transformers cs.CV | cs.AI | cs.LGPDF

Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba

TL;DR: 本文提出块循环假设（BRH），认为训练好的视觉Transformer（ViT）具有块循环深度结构，即原始L个块的计算可仅用k<<L个不同块循环应用来准确重写。通过训练块循环代理模型Raptor，在DINOv2上仅用2个块恢复96%的ImageNet-1k线性探测准确率，并利用该假设发展动态可解释性分析程序，揭示了ViT深度方向上的低复杂度规范解。

Details

Motivation: 针对ViT缺乏将Transformer深度解释为特征化流的理论框架，本文旨在通过块循环假设（BRH）从动力学系统角度理解ViT的计算现象学。

Result: 在DINOv2上，Raptor模型仅用2个块（等效计算成本）即恢复96%的ImageNet-1k线性探测准确率，为BRH提供了实证存在性证明。

Insight: 创新点在于提出块循环假设及Raptor方法，揭示了ViT深度中存在的低维吸引子收敛、类依赖角向盆地定向收敛等动态特性，为通过动力学系统分析研究ViT提供了原则性框架。

Abstract: As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.

[27] SE360: Semantic Edit in 360$^\circ$ Panoramas via Hierarchical Data Construction cs.CVPDF

Haoyi Zhong, Fang-Lue Zhang, Andrew Chalmers, Taehyun Rhee

TL;DR: 本文提出了SE360，一个用于360度全景图中多条件引导对象编辑的新框架。其核心是一个无需人工干预的从粗到细的自主数据生成流程，该流程利用视觉语言模型和自适应投影调整进行层次分析，确保对象及其物理环境的整体分割。基于构建的数据集，训练了一个基于Transformer的扩散模型，以实现通过文本、掩码或参考图像灵活编辑360度全景图中的对象。实验表明，该方法在视觉质量和语义准确性上均优于现有方法。

Details

Motivation: 基于指令的图像编辑技术正在兴起，但将其扩展到360度全景图会带来额外的挑战，现有方法在等距柱状投影和透视视图中经常产生不合理的结果。

Result: 实验表明，该方法在视觉质量和语义准确性上均优于现有方法。

Insight: 创新点在于提出了一个无需人工干预的、从粗到细的自主数据生成流程，该流程通过视觉语言模型和自适应投影调整实现层次分析，确保生成的数据对在语义和几何上的一致性；此外，还引入了一个经济高效的两阶段数据细化策略以提高数据真实感并减轻模型过拟合导致的伪影。基于此数据集训练的扩散模型支持多种条件引导的灵活编辑。

Abstract: While instruction-based image editing is emerging, extending it to 360$^\circ$ panoramas introduces additional challenges. Existing methods often produce implausible results in both equirectangular projections (ERP) and perspective views. To address these limitations, we propose SE360, a novel framework for multi-condition guided object editing in 360$^\circ$ panoramas. At its core is a novel coarse-to-fine autonomous data generation pipeline without manual intervention. This pipeline leverages a Vision-Language Model (VLM) and adaptive projection adjustment for hierarchical analysis, ensuring the holistic segmentation of objects and their physical context. The resulting data pairs are both semantically meaningful and geometrically consistent, even when sourced from unlabeled panoramas. Furthermore, we introduce a cost-effective, two-stage data refinement strategy to improve data realism and mitigate model overfitting to erase artifacts. Based on the constructed dataset, we train a Transformer-based diffusion model to allow flexible object editing guided by text, mask, or reference image in 360$^\circ$ panoramas. Our experiments demonstrate that our method outperforms existing methods in both visual quality and semantic accuracy.

[28] How Much 3D Do Video Foundation Models Encode? cs.CV | cs.AIPDF

Zixuan Huang, Xiang Li, Zhaoyang Lv, James M. Rehg

TL;DR: 本文研究了现有视频基础模型（VidFMs）在大量视频数据预训练后，是否以及如何编码3D世界信息。作者提出了一个模型无关的框架，通过浅层读出器从模型特征中估计多种3D属性，以量化其3D理解能力。研究发现，最先进的视频生成模型展现出对3D物体和场景的深刻理解，甚至能超越专门为3D任务训练的大型专家模型。

Details

Motivation: 视频是3D世界的连续2D投影，本文旨在探究在大量视频数据上训练后，全局3D理解能力是否会自然涌现，并量化现有视频基础模型的3D感知水平。

Result: 研究显示，最先进的视频生成模型在多个3D属性评估轴上表现出强大的3D理解能力，尽管未在3D数据上训练，其理解甚至能超越专门训练的3D专家模型。

Insight: 创新点在于提出了首个模型无关的框架来评估视频基础模型的3D感知能力，通过浅层读出器量化3D属性；客观分析表明，视频数据本身可能蕴含丰富的3D信息，使得视频生成模型能隐式学习3D结构，这为构建可扩展的3D模型提供了新思路。

Abstract: Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.

[29] A Dual-Branch Local-Global Framework for Cross-Resolution Land Cover Mapping cs.CVPDF

Peng Gao, Ke Li, Di Wang, Yongshan Zhu, Yiming Zhang

TL;DR: 本文提出了一种名为DDTM的双分支弱监督框架，用于解决跨分辨率土地覆盖制图任务中高分辨率语义预测与低分辨率监督标签之间的严重不匹配问题。该框架通过扩散分支进行局部语义细化，结合Transformer分支实现全局上下文一致性，并设计了伪标签置信度评估模块来减少噪声监督的影响。

Details

Motivation: 现有弱监督方法在跨分辨率土地覆盖制图中难以将细粒度空间结构与粗粒度标签对齐，导致监督噪声和制图精度下降，本文旨在解决这一挑战。

Result: 在Chesapeake Bay基准测试上，DDTM取得了66.52%的mIoU，显著优于先前的弱监督方法，达到了新的最先进水平（SOTA）。

Insight: 创新点在于将局部语义细化与全局上下文推理显式解耦的双分支架构，以及通过伪标签置信度评估模块自适应地利用可靠监督信号，从而有效缓解跨分辨率不一致性带来的噪声问题。

Abstract: Cross-resolution land cover mapping aims to produce high-resolution semantic predictions from coarse or low-resolution supervision, yet the severe resolution mismatch makes effective learning highly challenging. Existing weakly supervised approaches often struggle to align fine-grained spatial structures with coarse labels, leading to noisy supervision and degraded mapping accuracy. To tackle this problem, we propose DDTM, a dual-branch weakly supervised framework that explicitly decouples local semantic refinement from global contextual reasoning. Specifically, DDTM introduces a diffusion-based branch to progressively refine fine-scale local semantics under coarse supervision, while a transformer-based branch enforces long-range contextual consistency across large spatial extents. In addition, we design a pseudo-label confidence evaluation module to mitigate noise induced by cross-resolution inconsistencies and to selectively exploit reliable supervisory signals. Extensive experiments demonstrate that DDTM establishes a new state-of-the-art on the Chesapeake Bay benchmark, achieving 66.52% mIoU and substantially outperforming prior weakly supervised methods. The code is available at https://github.com/gpgpgp123/DDTM.

[30] Few-Shot-Based Modular Image-to-Video Adapter for Diffusion Models cs.CVPDF

Zhenhao Li, Shaohan Yi, Zheng Liu, Leonartinus Gao, Minh Ngoc Le

TL;DR: 本文提出了一种名为MIVA的轻量级模块化图像到视频适配器，用于解决扩散模型在图像动画生成中面临的数据稀缺和泛化能力不足的问题。该方法通过少量样本训练即可学习特定运动模式，并在推理时通过组合多个MIVA模块实现精确的运动控制，无需复杂的提示工程。

Details

Motivation: 扩散模型在图像和视频生成中虽已取得高真实感，但在图像动画应用上仍受限，主要由于视频数据维度高导致训练数据稀缺，模型易记忆而非遵循提示生成运动，且难以泛化到训练集未见过的新运动模式。

Result: 实验表明，MIVA在仅需约十个样本和单个消费级GPU训练的情况下，能实现更精确的运动控制，并在生成质量上保持甚至超越基于大规模数据集训练的模型。

Insight: 创新点在于模块化设计，每个MIVA子网络专注于单一运动模式，支持并行扩展，通过少量数据高效训练，避免了传统扩散模型对大规模数据和提示工程的依赖，提升了运动生成的灵活性和可控性。

Abstract: Diffusion models (DMs) have recently achieved impressive photorealism in image and video generation. However, their application to image animation remains limited, even when trained on large-scale datasets. Two primary challenges contribute to this: the high dimensionality of video signals leads to a scarcity of training data, causing DMs to favor memorization over prompt compliance when generating motion; moreover, DMs struggle to generalize to novel motion patterns not present in the training set, and fine-tuning them to learn such patterns, especially using limited training data, is still under-explored. To address these limitations, we propose Modular Image-to-Video Adapter (MIVA), a lightweight sub-network attachable to a pre-trained DM, each designed to capture a single motion pattern and scalable via parallelization. MIVAs can be efficiently trained on approximately ten samples using a single consumer-grade GPU. At inference time, users can specify motion by selecting one or multiple MIVAs, eliminating the need for prompt engineering. Extensive experiments demonstrate that MIVA enables more precise motion control while maintaining, or even surpassing, the generation quality of models trained on significantly larger datasets.

[31] SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images cs.CVPDF

Zepeng Xin, Kaiyu Li, Luodi Chen, Wanchen Li, Yuchen Xiao

TL;DR: 本文提出了一个用于遥感图像语言引导分割的大规模数据集LaSeRS，以及一个名为SegEarth-R2的多模态大语言模型架构。该工作旨在解决现有模型在处理复杂地理空间场景（如多粒度、多目标、需要推理和语言变体）时的不足，通过新的数据集和模型改进来推动该领域的进展。

Details

Motivation: 现有遥感图像语言引导分割模型只能处理简单的单目标指令，在面临复杂地理空间场景（如多粒度分割、多目标指令和隐含用户意图）时表现不佳。现有数据集过于简化，导致模型在现实应用中鲁棒性差。

Result: 实验结果表明，SegEarth-R2在提出的LaSeRS数据集以及其他基准测试上取得了出色的性能，为新一代地理空间分割建立了一个强大的基线。

Insight: 主要创新点包括：1）构建了首个涵盖层次粒度、目标多重性、推理需求和语言变异性四个关键维度的大规模语言引导分割数据集LaSeRS；2）提出了SegEarth-R2模型，其核心改进在于专门处理小目标定位的空间注意力监督机制，以及能灵活处理单目标和多目标场景的高效分割查询机制。

Abstract: Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation across four critical dimensions of language-guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimplify, leading to sensitivity-prone real-world models. We also propose SegEarth-R2, an MLLM architecture designed for comprehensive language-guided segmentation in RS, which directly confronts these challenges. The model’s effectiveness stems from two key improvements: (1) a spatial attention supervision mechanism specifically handles the localization of small objects and their components, and (2) a flexible and efficient segmentation query mechanism that handles both single-target and multi-target scenarios. Experimental results demonstrate that our SegEarth-R2 achieves outstanding performance on LaSeRS and other benchmarks, establishing a powerful baseline for the next generation of geospatial segmentation. All data and code will be released at https://github.com/earth-insights/SegEarth-R2.

[32] A Contextual Analysis of Driver-Facing and Dual-View Video Inputs for Distraction Detection in Naturalistic Driving Environments cs.CVPDF

Anthony Dontoh, Stephanie Ivey, Armstrong Aboah

TL;DR: 本研究探讨了在自然驾驶环境中，结合驾驶员面部视图和道路视图是否能够提升分心驾驶检测的准确性。通过使用真实世界的同步双摄像头数据，对三种时空动作识别架构（SlowFast-R50、X3D-M、SlowOnly-R50）在单视图和双视图输入配置下的性能进行了基准测试。

Details

Motivation: 现有的基于计算机视觉的分心驾驶检测模型大多仅依赖驾驶员面部视图，忽略了影响驾驶行为的关键环境上下文信息。本研究旨在探究整合道路视图是否能提升检测精度。

Result: 实验结果表明，上下文输入的提升效果高度依赖于底层架构。单路径的SlowOnly模型在双视图输入下获得了9.8%的性能提升，而双路径的SlowFast模型则因表征冲突导致准确率下降了7.2%。

Insight: 论文的创新点在于首次使用自然驾驶数据对单视图和双视图分心检测模型进行了系统比较。核心洞察是：简单地增加视觉上下文信息并不足够，甚至可能导致干扰；未来的多模态驾驶员监控系统需要专门设计以支持多视图融合的架构。

Abstract: Despite increasing interest in computer vision-based distracted driving detection, most existing models rely exclusively on driver-facing views and overlook crucial environmental context that influences driving behavior. This study investigates whether incorporating road-facing views alongside driver-facing footage improves distraction detection accuracy in naturalistic driving conditions. Using synchronized dual-camera recordings from real-world driving, we benchmark three leading spatiotemporal action recognition architectures: SlowFast-R50, X3D-M, and SlowOnly-R50. Each model is evaluated under two input configurations: driver-only and stacked dual-view. Results show that while contextual inputs can improve detection in certain models, performance gains depend strongly on the underlying architecture. The single-pathway SlowOnly model achieved a 9.8 percent improvement with dual-view inputs, while the dual-pathway SlowFast model experienced a 7.2 percent drop in accuracy due to representational conflicts. These findings suggest that simply adding visual context is not sufficient and may lead to interference unless the architecture is specifically designed to support multi-view integration. This study presents one of the first systematic comparisons of single- and dual-view distraction detection models using naturalistic driving data and underscores the importance of fusion-aware design for future multimodal driver monitoring systems.

[33] MAPI-GNN: Multi-Activation Plane Interaction Graph Neural Network for Multimodal Medical Diagnosis cs.CVPDF

Ziwei Qin, Xuhui Song, Deqing Huang, Na Qin, Jun Li

TL;DR: 本文提出了一种名为MAPI-GNN的新型图神经网络，用于多模态医疗诊断。该模型通过从语义解耦的特征子空间中学习多方面的图结构，突破了传统依赖单一静态图的局限，从而能够动态建模患者特定的病理关系。

Details

Motivation: 现有图神经网络在多模态医疗诊断中通常依赖单一、静态的图结构，该图由不加区分的特征构建，这限制了模型对患者特异性病理关系进行建模的能力。

Result: 在两个包含超过1300个患者样本的多样化任务上进行的大量实验表明，MAPI-GNN显著优于最先进的方法。

Insight: 创新点在于提出了一个多激活平面交互框架，通过多维鉴别器发现潜在图感知模式，动态构建一系列激活图，并利用关系融合引擎进行聚合和上下文建模，实现了对患者特异性关系的动态、多方面建模。

Abstract: Graph neural networks are increasingly applied to multimodal medical diagnosis for their inherent relational modeling capabilities. However, their efficacy is often compromised by the prevailing reliance on a single, static graph built from indiscriminate features, hindering the ability to model patient-specific pathological relationships. To this end, the proposed Multi-Activation Plane Interaction Graph Neural Network (MAPI-GNN) reconstructs this single-graph paradigm by learning a multifaceted graph profile from semantically disentangled feature subspaces. The framework first uncovers latent graph-aware patterns via a multi-dimensional discriminator; these patterns then guide the dynamic construction of a stack of activation graphs; and this multifaceted profile is finally aggregated and contextualized by a relational fusion engine for a robust diagnosis. Extensive experiments on two diverse tasks, comprising over 1300 patient samples, demonstrate that MAPI-GNN significantly outperforms state-of-the-art methods.

Chang Sun, Dongliang Xie, Bo Qin, Hong Yang

TL;DR: 本文提出VALLR-Pin，一个用于普通话视觉语音识别的两阶段框架。它扩展了VALLR架构，首先通过共享视频编码器和双解码器联合预测汉字序列及其拼音，然后利用大语言模型结合拼音输出来消歧和优化候选文本，从而提升普通话唇读性能。

Details

Motivation: 解决普通话视觉语音识别中因视位高度模糊和同音字普遍存在而导致的挑战。

Result: 论文未在摘要中提及具体的定量结果、基准测试或SOTA比较。

Insight: 创新点包括：1) 将视觉特征与拼音、语言上下文协同的多任务学习与双解码设计；2) 利用大语言模型结合拼音输出来纠正同音字错误；3) 通过基于中间检查点生成的合成噪声数据对LLM进行微调，使其适应模型特定的错误模式。

Abstract: Visual Speech Recognition aims to transcribe spoken words from silent lip-motion videos. This task is particularly challenging for Mandarin, as visemes are highly ambiguous and homophones are prevalent. We propose VALLR-Pin, a novel two-stage framework that extends the recent VALLR architecture from English to Mandarin. First, a shared video encoder feeds into dual decoders, which jointly predict both Chinese character sequences and their standard Pinyin romanization. The multi-task learning of character and phonetic outputs fosters robust visual-semantic representations. During inference, the text decoder generates multiple candidate transcripts. We construct a prompt by concatenating the Pinyin output with these candidate Chinese sequences and feed it to a large language model to resolve ambiguities and refine the transcription. This provides the LLM with explicit phonetic context to correct homophone-induced errors. Finally, we fine-tune the LLM on synthetic noisy examples: we generate imperfect Pinyin-text pairs from intermediate VALLR-Pin checkpoints using the training data, creating instruction-response pairs for error correction. This endows the LLM with awareness of our model’s specific error patterns. In summary, VALLR-Pin synergizes visual features with phonetic and linguistic context to improve Mandarin lip-reading performance.

[35] FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs cs.CVPDF

Andreas Zinonos, Michał Stypułkowski, Antoni Bigata, Stavros Petridis, Maja Pantic

TL;DR: FlashLips是一个两阶段、无需掩码的唇形同步系统，通过解耦唇部控制和渲染，在单个GPU上实现超过100 FPS的实时性能，同时视觉质量媲美更大的SOTA模型。第一阶段使用紧凑的单步潜在空间编辑器，基于参考身份、掩码目标帧和低维唇部姿态向量重建图像，仅用重建损失训练，无需GAN或扩散模型；第二阶段通过流匹配目标训练音频到姿态的Transformer，从语音预测唇部姿态向量。

Details

Motivation: 解决现有唇形同步模型依赖复杂生成模型（如GAN或扩散模型）导致推理速度慢、需要显式掩码控制的问题，旨在实现高质量、实时且无需掩码的唇部同步。

Result: 在单个GPU上达到超过100 FPS的实时性能，视觉质量与更大的SOTA模型相当，但速度更快。

Insight: 创新点包括：1) 两阶段解耦设计，将唇部控制与渲染分离，提升效率；2) 使用纯重建损失的单步潜在空间编辑器，避免复杂生成模型；3) 通过自监督生成伪真值进行微调，实现无需显式掩码的唇部局部编辑；4) 结合确定性重建与音频控制，形成简单稳定的管道。

Abstract: We present FlashLips, a two-stage, mask-free lip-sync system that decouples lips control from rendering and achieves real-time performance running at over 100 FPS on a single GPU, while matching the visual quality of larger state-of-the-art models. Stage 1 is a compact, one-step latent-space editor that reconstructs an image using a reference identity, a masked target frame, and a low-dimensional lips-pose vector, trained purely with reconstruction losses - no GANs or diffusion. To remove explicit masks at inference, we use self-supervision: we generate mouth-altered variants of the target image, that serve as pseudo ground truth for fine-tuning, teaching the network to localize edits to the lips while preserving the rest. Stage 2 is an audio-to-pose transformer trained with a flow-matching objective to predict lips-poses vectors from speech. Together, these stages form a simple and stable pipeline that combines deterministic reconstruction with robust audio control, delivering high perceptual quality and faster-than-real-time speed.

Nguyen Lam Phu Quy, Pham Phu Hoa, Tran Chi Nguyen, Dao Sy Duy Minh, Nguyen Hoang Minh Ngoc

TL;DR: 本文提出了一种多模态检索增强的图像描述生成方法，旨在为现实世界图像生成包含事件背景、时间线索、结果和命名实体等非视觉可辨关键细节的、上下文丰富的描述。该方法通过检索语义相似的图像、进行几何对齐重排序，并从相关文章中提取上下文信息，最终利用微调的Qwen3模型整合这些信息与基础描述，生成事件增强的、上下文感知的描述。

Details

Motivation: 解决现实世界图像描述通常缺乏上下文深度，遗漏非视觉可辨关键细节的问题，以提升在新闻、教育和数字档案等领域需要更丰富、信息量更大描述的图像理解效果。

Result: 在OpenEvents v1数据集上的评估表明，与传统方法相比，该方法生成的描述信息量显著更大，显示出在需要更深层次视觉-文本理解的实际应用中具有强大潜力。

Insight: 创新点在于构建了一个结合多模态检索（图像和文本）与几何对齐重排序的管道，以外部文本知识增强视觉输入，并通过微调的大语言模型（Qwen3 with QLoRA）进行上下文整合，从而生成事件和上下文丰富的图像描述。从客观角度看，该方法将传统的基于视觉的captioning扩展为上下文感知的任务，通过检索和融合外部知识来弥补纯视觉模型的局限性，是一个有前景的方向。

Abstract: Real-world image captions often lack contextual depth, omitting crucial details such as event background, temporal cues, outcomes, and named entities that are not visually discernible. This gap limits the effectiveness of image understanding in domains like journalism, education, and digital archives, where richer, more informative descriptions are essential. To address this, we propose a multimodal pipeline that augments visual input with external textual knowledge. Our system retrieves semantically similar images using BEIT-3 (Flickr30k-384 and COCO-384) and SigLIP So-384, reranks them using ORB and SIFT for geometric alignment, and extracts contextual information from related articles via semantic search. A fine-tuned Qwen3 model with QLoRA then integrates this context with base captions generated by Instruct BLIP (Vicuna-7B) to produce event-enriched, context-aware descriptions. Evaluated on the OpenEvents v1 dataset, our approach generates significantly more informative captions compared to traditional methods, showing strong potential for real-world applications requiring deeper visual-textual understanding

[37] Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark cs.CV | cs.CL | cs.IRPDF

Hao Guo, Xugong Qin, Jun Jie Ou Yang, Peng Zhang, Gangyan Zeng

TL;DR: 本文提出了一个新的基于自然语言的文档图像检索（NL-DIR）基准，旨在解决现有方法在处理细粒度语义文本查询时的不足。该基准包含41K个真实文档图像，每个图像配有五个由大语言模型生成并经人工验证的高质量细粒度语义查询描述。作者评估了现有主流对比视觉语言模型和无OCR视觉文档理解模型的零样本和微调性能，并探索了一种两阶段检索方法以提升性能同时保证效率。

Details

Motivation: 现有文档图像检索方法主要基于图像查询，只能检索粗粒度语义类别（如报纸或收据）的文档，难以处理现实场景中通常提供的具有细粒度语义的自然语言文本查询。

Result: 在提出的NL-DIR基准上，对现有主流模型进行了零样本和微调评估，并进一步研究了两阶段检索方法，在提升性能的同时实现了时间和空间效率。

Insight: 创新点在于引入了一个以自然语言描述作为细粒度语义查询的新基准，弥补了现有DIR任务在真实场景应用中的不足；同时，利用大语言模型结合人工验证生成高质量查询，以及探索两阶段检索策略，为视觉文档理解社区提供了新的研究方向和数据集资源。

Abstract: Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieve documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios where textual queries with fine-grained semantics are usually provided. To bridge this gap, we introduce a new Natural Language-based Document Image Retrieval (NL-DIR) benchmark with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We perform zero-shot and fine-tuning evaluations of existing mainstream contrastive vision-language models and OCR-free visual document understanding (VDU) models. A two-stage retrieval method is further investigated for performance improvement while achieving both time and space efficiency. We hope the proposed NL-DIR benchmark can bring new opportunities and facilitate research for the VDU community. Datasets and codes will be publicly available at huggingface.co/datasets/nianbing/NL-DIR.

[38] DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation cs.CV | cs.SD | eess.ASPDF

Jingqi Tian, Yiheng Du, Haoji Zhang, Yuji Wang, Isaac Ning Lee

TL;DR: 本文提出DDAVS框架，通过解耦音频语义与延迟双向对齐来解决音频-视觉分割中的多源纠缠和视听错位问题，在多个基准测试中表现出色。

Details

Motivation: 现有音频-视觉分割方法存在多源纠缠和视听错位问题，导致模型偏向于更响亮或更大的物体，而忽略较弱、较小或共现的声源。

Result: 在AVS-Objects和VPO基准测试上，DDAVS在单源、多源和多实例场景中均优于现有方法，验证了其有效性和泛化能力。

Insight: 创新点包括使用可学习查询从音频原型记忆库中提取并锚定音频语义以解耦多源信息，以及引入延迟模态交互的双重交叉注意力机制来增强多模态对齐的鲁棒性。

Abstract: Audio-Visual Segmentation (AVS) aims to localize sound-producing objects at the pixel level by jointly leveraging auditory and visual information. However, existing methods often suffer from multi-source entanglement and audio-visual misalignment, which lead to biases toward louder or larger objects while overlooking weaker, smaller, or co-occurring sources. To address these challenges, we propose DDAVS, a Disentangled Audio Semantics and Delayed Bidirectional Alignment framework. To mitigate multi-source entanglement, DDAVS employs learnable queries to extract audio semantics and anchor them within a structured semantic space derived from an audio prototype memory bank. This is further optimized through contrastive learning to enhance discriminability and robustness. To alleviate audio-visual misalignment, DDAVS introduces dual cross-attention with delayed modality interaction, improving the robustness of multimodal alignment. Extensive experiments on the AVS-Objects and VPO benchmarks demonstrate that DDAVS consistently outperforms existing approaches, exhibiting strong performance across single-source, multi-source, and multi-instance scenarios. These results validate the effectiveness and generalization ability of our framework under challenging real-world audio-visual segmentation conditions. Project page: https://trilarflagz.github.io/DDAVS-page/

[39] HEART-VIT: Hessian-Guided Efficient Dynamic Attention and Token Pruning in Vision Transformer cs.CVPDF

Mohammad Helal Uddin, Liam Seymour, Sabur Baidya

TL;DR: 本文提出了HEART-ViT，一个用于视觉Transformer的、基于Hessian矩阵引导的高效动态注意力与令牌剪枝框架。该框架首次统一地利用二阶信息（Hessian向量积）来评估令牌和注意力头的重要性，实现输入自适应的剪枝决策，旨在显著降低计算开销和延迟，同时保持甚至提升模型精度。

Details

Motivation: 现有ViT模型虽然精度高，但其二次注意力计算成本和冗余计算严重阻碍了在延迟和资源受限平台上的部署。现有的剪枝方法通常孤立地处理令牌或注意力头，并依赖启发式或一阶信号，这常常导致精度损失或无法泛化到不同输入。

Result: 在ImageNet-100和ImageNet-1K数据集上，使用ViT-B/16和DeiT-B/16模型，HEART-ViT实现了高达49.4%的FLOPs减少、36%的延迟降低和46%的吞吐量提升，并且在微调后精度始终匹配甚至超过基线（例如在40%令牌剪枝率下精度恢复4.7%）。在AGX Orin等边缘设备上的部署验证了其在实际推理速度和能效上的提升。

Insight: 论文宣称的创新点在于首次提出了一个统一的、基于二阶Hessian信息的、输入自适应的ViT优化框架。客观来看，其核心洞察在于：通过Hessian向量积高效计算令牌和注意力头的曲率加权敏感度，并发现令牌剪枝主导计算节省，而注意力头剪枝提供细粒度冗余消除，两者的结合实现了更优的权衡。这为模型压缩提供了新的理论指导和实用方法。

Abstract: Vision Transformers (ViTs) deliver state-of-the-art accuracy but their quadratic attention cost and redundant computations severely hinder deployment on latency and resource-constrained platforms. Existing pruning approaches treat either tokens or heads in isolation, relying on heuristics or first-order signals, which often sacrifice accuracy or fail to generalize across inputs. We introduce HEART-ViT, a Hessian-guided efficient dynamic attention and token pruning framework for vision transformers, which to the best of our knowledge is the first unified, second-order, input-adaptive framework for ViT optimization. HEART-ViT estimates curvature-weighted sensitivities of both tokens and attention heads using efficient Hessian-vector products, enabling principled pruning decisions under explicit loss budgets.This dual-view sensitivity reveals an important structural insight: token pruning dominates computational savings, while head pruning provides fine-grained redundancy removal, and their combination achieves a superior trade-off. On ImageNet-100 and ImageNet-1K with ViT-B/16 and DeiT-B/16, HEART-ViT achieves up to 49.4 percent FLOPs reduction, 36 percent lower latency, and 46 percent higher throughput, while consistently matching or even surpassing baseline accuracy after fine-tuning, for example 4.7 percent recovery at 40 percent token pruning. Beyond theoretical benchmarks, we deploy HEART-ViT on different edge devices such as AGX Orin, demonstrating that our reductions in FLOPs and latency translate directly into real-world gains in inference speed and energy efficiency. HEART-ViT bridges the gap between theory and practice, delivering the first unified, curvature-driven pruning framework that is both accuracy-preserving and edge-efficient.

[40] milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion cs.CVPDF

Niraj Prakash Kini, Shiau-Rung Tsai, Guan-Hsun Lin, Wen-Hsiao Peng, Ching-Wen Ma

TL;DR: 本文提出了milliMamba，一个基于毫米波雷达的2D人体姿态估计框架，旨在解决雷达信号因镜面反射导致的稀疏性问题。该框架通过一个具有线性复杂度的跨视图融合Mamba编码器高效提取长序列的时空特征，并结合一个时空交叉注意力解码器来预测多帧的关节坐标，从而利用相邻帧和关节的上下文信息来推断因镜面反射而缺失的关节。

Details

Motivation: 毫米波雷达作为人体姿态估计的传感器，具有保护隐私和不受光照影响的优点，但其信号常因镜面反射而稀疏，使得从雷达信号中提取鲁棒特征极具挑战。本文旨在解决这一关键问题。

Result: 在TransHuPR和HuPR数据集上的实验表明，该方法取得了显著的性能提升，分别超过基线模型11.0 AP和14.6 AP，同时保持了合理的模型复杂度。

Insight: 论文的核心创新点在于将Mamba模型（一种具有线性复杂度的序列建模架构）引入雷达姿态估计，构建了一个端到端的时空建模流水线，联合建模特征提取和解码阶段的时空依赖关系。此外，在训练中引入速度损失以增强运动平滑性也是一个实用的技巧。从客观角度看，将高效的序列模型（Mamba）与雷达信号的时空特性相结合，为解决信号稀疏性问题提供了一个新颖且计算高效的技术路径。

Abstract: Millimeter-wave radar offers a privacy-preserving and lighting-invariant alternative to RGB sensors for Human Pose Estimation (HPE) task. However, the radar signals are often sparse due to specular reflection, making the extraction of robust features from radar signals highly challenging. To address this, we present milliMamba, a radar-based 2D human pose estimation framework that jointly models spatio-temporal dependencies across both the feature extraction and decoding stages. Specifically, given the high dimensionality of radar inputs, we adopt a Cross-View Fusion Mamba encoder to efficiently extract spatio-temporal features from longer sequences with linear complexity. A Spatio-Temporal-Cross Attention decoder then predicts joint coordinates across multiple frames. Together, this spatio-temporal modeling pipeline enables the model to leverage contextual cues from neighboring frames and joints to infer missing joints caused by specular reflections. To reinforce motion smoothness, we incorporate a velocity loss alongside the standard keypoint loss during training. Experiments on the TransHuPR and HuPR datasets demonstrate that our method achieves significant performance improvements, exceeding the baselines by 11.0 AP and 14.6 AP, respectively, while maintaining reasonable complexity. Code: https://github.com/NYCU-MAPL/milliMamba

[41] AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model cs.CVPDF

Sofian Chaybouti, Sanath Narayan, Yasser Dahou, Phúc H. Lê Khac, Ankit Singh

TL;DR: 本文提出了一种名为AMoE的聚合专家混合视觉基础模型，通过多教师蒸馏方法，同时从SigLIP2和DINOv3模型中蒸馏知识到一个专家混合学生模型中。研究系统地探讨了多教师蒸馏的学习动态和数据效率，并引入了非对称关系知识蒸馏损失、令牌平衡批处理以及基于分层聚类的数据采样等关键技术，以降低计算成本并提升样本效率。

Details

Motivation: 动机在于探索通过多教师蒸馏训练视觉基础模型的学习动态和数据效率，旨在以更低的计算成本获得统一的视觉表示，解决现有方法在这些方面研究不足的问题。

Result: 通过结合所提出的技术，构建了包含2亿张图像的OpenLVD200M数据集，该数据集在多教师蒸馏中表现出卓越的效率。模型以专家混合架构实例化，并发布了数据集和蒸馏模型。

Insight: 创新点包括：1) 非对称关系知识蒸馏损失，在保持各教师模型几何特性的同时实现有效知识迁移；2) 令牌平衡批处理，通过将不同分辨率图像打包成具有统一令牌预算的序列，稳定了跨分辨率的表示学习；3) 将通常用于自监督学习的分层聚类和采样技术应用于多教师蒸馏，显著提高了样本效率。这些方法共同提升了蒸馏过程的效率和效果。

Abstract: Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce Agglomerative Mixture-of-Experts Vision Foundation Models (AMoE), which distill knowledge from SigLIP2 and DINOv3 simultaneously into a Mixture-of-Experts student. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, and (3) hierarchical clustering and sampling of training data–typically reserved for self-supervised learning–substantially improves sample efficiency over random sampling for multi-teacher distillation. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts. We release OpenLVD200M and distilled models.

[42] Generative Latent Coding for Ultra-Low Bitrate Image Compression cs.CV | eess.IVPDF

Zhaoyang Jia, Jiahao Li, Bin Li, Houqiang Li, Yan Lu

TL;DR: 本文提出了一种生成式潜在编码（GLC）架构，用于超低码率图像压缩。该方法在生成式VQ-VAE的潜在空间而非像素空间进行变换编码，利用潜在空间稀疏、语义丰富且更符合人类感知的特性，在极低码率下实现了高真实感与高保真度的图像压缩。

Details

Motivation: 现有像素空间的图像压缩方法在低码率下难以同时保证高真实感与高保真度，因为像素空间的失真度量与人类感知不一致。本文旨在解决这一问题。

Result: 实验表明，GLC在自然图像上能以低于0.04 bpp、在人脸图像上能以低于0.01 bpp的码率保持高视觉质量。在CLIC2020测试集上，其FID指标与MS-ILLM相当，但比特数减少了45%。

Insight: 核心创新在于将压缩的变换编码过程从像素空间转移到生成模型的潜在空间，该空间特性更优。此外，引入分类超模块降低超信息比特成本，以及基于代码预测的监督增强语义一致性，也是重要的技术贡献。强大的生成潜在空间还支持了图像修复和风格迁移等下游应用。

Abstract: Most existing image compression approaches perform transform coding in the pixel space to reduce its spatial redundancy. However, they encounter difficulties in achieving both high-realism and high-fidelity at low bitrate, as the pixel-space distortion may not align with human perception. To address this issue, we introduce a Generative Latent Coding (GLC) architecture, which performs transform coding in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE), instead of in the pixel space. The generative latent space is characterized by greater sparsity, richer semantic and better alignment with human perception, rendering it advantageous for achieving high-realism and high-fidelity compression. Additionally, we introduce a categorical hyper module to reduce the bit cost of hyper-information, and a code-prediction-based supervision to enhance the semantic consistency. Experiments demonstrate that our GLC maintains high visual quality with less than 0.04 bpp on natural images and less than 0.01 bpp on facial images. On the CLIC2020 test set, we achieve the same FID as MS-ILLM with 45% fewer bits. Furthermore, the powerful generative latent space enables various applications built on our GLC pipeline, such as image restoration and style transfer. The code is available at https://github.com/jzyustc/GLC.

Xiangxuan Ren, Zhongdao Wang, Pin Tang, Guoqing Wang, Jilai Zheng

TL;DR: LiteFusion是一种新颖的多模态3D目标检测器，它重新思考了LiDAR在相机-LiDAR融合范式中的作用，将LiDAR点云作为几何信息的补充源来增强基于相机的检测，从而完全消除了对3D主干网络的依赖，提高了部署友好性。

Details

Motivation: 当前多模态3D目标检测器依赖复杂架构和训练策略，且严重依赖LiDAR传感器，在LiDAR缺失时性能大幅下降，影响自动驾驶系统的鲁棒性和安全性；同时，现有方法因依赖主要针对NVIDIA GPU优化的3D稀疏卷积算子，难以部署在NPU和FPGA等多样化硬件平台上。

Result: 在nuScenes数据集上的实验表明，LiteFusion在仅增加1.1%参数且不使用专用LiDAR编码器的情况下，将基线视觉检测器的mAP和NDS分别提升了+20.4%和+19.7%；即使在LiDAR输入缺失时，LiteFusion仍能保持强劲结果。

Insight: 创新点在于将LiDAR视为几何信息的补充而非独立模态，从而无需3D主干网络，提高了部署灵活性；通过四元数空间集成互补特征，保持正交约束以建模跨模态的领域特定关系，生成紧凑的跨模态嵌入，增强了鲁棒性和跨场景有效性。

Abstract: 3D object detection is fundamental for safe and robust intelligent transportation systems. Current multi-modal 3D object detectors often rely on complex architectures and training strategies to achieve higher detection accuracy. However, these methods heavily rely on the LiDAR sensor so that they suffer from large performance drops when LiDAR is absent, which compromises the robustness and safety of autonomous systems in practical scenarios. Moreover, existing multi-modal detectors face difficulties in deployment on diverse hardware platforms, such as NPUs and FPGAs, due to their reliance on 3D sparse convolution operators, which are primarily optimized for NVIDIA GPUs. To address these challenges, we reconsider the role of LiDAR in the camera-LiDAR fusion paradigm and introduce a novel multi-modal 3D detector, LiteFusion. Instead of treating LiDAR point clouds as an independent modality with a separate feature extraction backbone, LiteFusion utilizes LiDAR data as a complementary source of geometric information to enhance camera-based detection. This straightforward approach completely eliminates the reliance on a 3D backbone, making the method highly deployment-friendly. Specifically, LiteFusion integrates complementary features from LiDAR points into image features within a quaternion space, where the orthogonal constraints are well-preserved during network training. This helps model domain-specific relations across modalities, yielding a compact cross-modal embedding. Experiments on the nuScenes dataset show that LiteFusion improves the baseline vision-based detector by +20.4% mAP and +19.7% NDS with a minimal increase in parameters (1.1%) without using dedicated LiDAR encoders. Notably, even in the absence of LiDAR input, LiteFusion maintains strong results , highlighting its favorable robustness and effectiveness across diverse fusion paradigms and deployment scenarios.

Jinghao Shi, Jianing Song

TL;DR: 本文提出了一种用于高分辨率遥感图像语义分割的双向协同优化框架BiCoR-Seg，旨在解决类间相似度高、类内差异大导致的边界模糊和类别混淆问题。该框架通过热图驱动的双向信息协同模块（HBIS）在特征图与类别嵌入之间建立双向信息流，并采用分层监督策略和跨层类别嵌入费舍尔判别损失来增强特征的判别力与可解释性。

Details

Motivation: 高分辨率遥感图像语义分割面临类间相似度高、类内差异大的挑战，现有方法难以将抽象且强判别性的语义知识有效注入像素级特征学习，导致复杂场景下边界模糊和类别混淆。

Result: 在LoveDA、Vaihingen和Potsdam数据集上的大量实验表明，BiCoR-Seg实现了出色的分割性能，同时提供了更强的可解释性。

Insight: 创新点包括：1）热图驱动的双向信息协同模块（HBIS），通过生成类别级热图在特征与类别嵌入间建立双向信息流；2）分层监督策略，利用HBIS生成的可解释热图作为低分辨率分割预测进行监督，增强浅层特征的判别能力；3）跨层类别嵌入费舍尔判别损失，强制类内紧凑并扩大类间分离性，提升嵌入表示的判别力。该框架在提升性能的同时增强了模型的可解释性。

Abstract: High-resolution remote sensing image semantic segmentation (HRSS) is a fundamental yet critical task in the field of Earth observation. However, it has long faced the challenges of high inter-class similarity and large intra-class variability. Existing approaches often struggle to effectively inject abstract yet strongly discriminative semantic knowledge into pixel-level feature learning, leading to blurred boundaries and class confusion in complex scenes. To address these challenges, we propose Bidirectional Co-Refinement Framework for HRSS (BiCoR-Seg). Specifically, we design a Heatmap-driven Bidirectional Information Synergy Module (HBIS), which establishes a bidirectional information flow between feature maps and class embeddings by generating class-level heatmaps. Based on HBIS, we further introduce a hierarchical supervision strategy, where the interpretable heatmaps generated by each HBIS module are directly utilized as low-resolution segmentation predictions for supervision, thereby enhancing the discriminative capacity of shallow features. In addition, to further improve the discriminability of the embedding representations, we propose a cross-layer class embedding Fisher Discriminative Loss to enforce intra-class compactness and enlarge inter-class separability. Extensive experiments on the LoveDA, Vaihingen, and Potsdam datasets demonstrate that BiCoR-Seg achieves outstanding segmentation performance while offering stronger interpretability. The released code is available at https://github.com/ShiJinghao566/BiCoR-Seg.

[45] LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation cs.CVPDF

Daniele Cardullo, Simone Teglia, Irene Amerini

TL;DR: 本文提出了LADLE-MM模型，一个在有限标注数据和训练资源下工作的多模态虚假信息检测器。它通过集成两个单模态分支和一个利用BLIP提取多模态嵌入作为固定参考空间的分支，在减少可训练参数的同时，在DGM4和VERITE基准测试中取得了与更复杂模型相竞争甚至更优的性能。

Details

Motivation: 针对现有基于图文对的多模态虚假信息检测方法通常依赖计算密集型架构或需要大量标注数据的问题，本文旨在开发一个在有限标注设置和受限训练资源下仍能有效工作的检测器。

Result: 在DGM4基准测试的二元和多标签分类任务上，LADLE-MM取得了有竞争力的性能，并且在无基础标注训练时优于现有方法。在VERITE数据集上，它超越了使用更复杂大型视觉语言模型架构的当前最先进方法，证明了其在开放集设置下的有效泛化能力和对单模态偏差的强鲁棒性。

Insight: 主要创新点在于提出了一个模型汤初始化的架构，通过引入一个利用预训练模型（BLIP）提取的固定多模态嵌入作为参考空间来增强表征，从而在显著减少可训练参数（比之前SOTA少60.3%）的同时，实现了高效且鲁棒的检测。这为资源受限下的多模态任务提供了借鉴思路。

Abstract: With the rise of easily accessible tools for generating and manipulating multimedia content, realistic synthetic alterations to digital media have become a widespread threat, often involving manipulations across multiple modalities simultaneously. Recently, such techniques have been increasingly employed to distort narratives of important events and to spread misinformation on social media, prompting the development of misinformation detectors. In the context of misinformation conveyed through image-text pairs, several detection methods have been proposed. However, these approaches typically rely on computationally intensive architectures or require large amounts of annotated data. In this work we introduce LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation, a model-soup initialized multimodal misinformation detector designed to operate under a limited annotation setup and constrained training resources. LADLE-MM is composed of two unimodal branches and a third multimodal one that enhances image and text representations with additional multimodal embeddings extracted from BLIP, serving as fixed reference space. Despite using 60.3% fewer trainable parameters than previous state-of-the-art models, LADLE-MM achieves competitive performance on both binary and multi-label classification tasks on the DGM4 benchmark, outperforming existing methods when trained without grounding annotations. Moreover, when evaluated on the VERITE dataset, LADLE-MM outperforms current state-of-the-art approaches that utilize more complex architectures involving Large Vision-Language-Models, demonstrating the effective generalization ability in an open-set setting and strong robustness to unimodal bias.

[46] ${D}^{3}${ETOR}: ${D}$ebate-Enhanced Pseudo Labeling and Frequency-Aware Progressive ${D}$ebiasing for Weakly-Supervised Camouflaged Object ${D}$etection with Scribble Annotations cs.CV | cs.AIPDF

Jiawei Ge, Jiuxin Cao, Xinyi Li, Xuelin Zhu, Chang Liu

TL;DR: 本文提出了一种名为D3ETOR的两阶段弱监督伪装目标检测框架，旨在解决现有方法因伪标签不可靠和涂鸦标注偏差而性能不足的问题。第一阶段通过自适应熵驱动点采样和多智能体辩论机制增强SAM模型，生成更精确的伪掩码；第二阶段设计FADeNet网络，融合多级频率感知特征并动态调整监督权重，以缓解标注偏差。该方法在多个基准测试中实现了最先进的性能，显著缩小了弱监督与全监督方法之间的差距。

Details

Motivation: 现有弱监督伪装目标检测方法存在两大局限：一是通用分割模型生成的伪掩码不可靠，缺乏任务特定的语义理解；二是忽视了涂鸦标注的固有偏差，阻碍模型捕捉伪装目标的全局结构。

Result: 在多个基准测试中，D3ETOR实现了最先进的性能，显著缩小了弱监督与全监督伪装目标检测方法之间的差距。

Insight: 创新点包括：通过自适应熵驱动点采样和多智能体辩论机制增强SAM的伪标签生成能力，以及设计频率感知渐进去偏网络来平衡全局语义与局部细节建模，同时动态调整区域监督权重以缓解标注偏差。

Abstract: Weakly-Supervised Camouflaged Object Detection (WSCOD) aims to locate and segment objects that are visually concealed within their surrounding scenes, relying solely on sparse supervision such as scribble annotations. Despite recent progress, existing WSCOD methods still lag far behind fully supervised ones due to two major limitations: (1) the pseudo masks generated by general-purpose segmentation models (e.g., SAM) and filtered via rules are often unreliable, as these models lack the task-specific semantic understanding required for effective pseudo labeling in COD; and (2) the neglect of inherent annotation bias in scribbles, which hinders the model from capturing the global structure of camouflaged objects. To overcome these challenges, we propose ${D}^{3}$ETOR, a two-stage WSCOD framework consisting of Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing. In the first stage, we introduce an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to enhance the capability of SAM for COD, improving the interpretability and precision of pseudo masks. In the second stage, we design FADeNet, which progressively fuses multi-level frequency-aware features to balance global semantic understanding with local detail modeling, while dynamically reweighting supervision strength across regions to alleviate scribble bias. By jointly exploiting the supervision signals from both the pseudo masks and scribble semantics, ${D}^{3}$ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state-of-the-art performance on multiple benchmarks.

[47] UbiQVision: Quantifying Uncertainty in XAI for Image Recognition cs.CV | cs.AIPDF

Akshat Dubey, Aleksandar Anžel, Bahar İlgen, Georges Hattab

TL;DR: 本文提出了UbiQVision框架，旨在量化图像识别中可解释人工智能（XAI）方法SHAP的不确定性。该框架利用狄利克雷后验采样和Dempster-Shafer理论，通过信念图、似然图和融合图结合统计分析，来量化SHAP解释在存在认知和偶然不确定性时的不可靠性。研究在三个具有不同类别分布、图像质量和模态类型的医学影像数据集上进行了评估。

Details

Motivation: 深度学习模型（如ResNet、Vision Transformer）的复杂性损害了模型的可解释性。SHAP虽能提供可解释的可视化，但其解释在认知和偶然不确定性存在时可能不稳定且不可靠，这在医学影像等关键领域尤为突出。

Result: 在涵盖病理学、眼科学和放射学的三个医学影像数据集上进行了评估，这些数据集因图像分辨率和模态差异引入了显著的认知不确定性。

Insight: 创新点在于将不确定性量化（通过狄利克雷采样和Dempster-Shafer理论）与SHAP解释相结合，为XAI在存在噪声和不确定性的复杂场景（如多模态医学影像）中提供了更可靠的评估框架。

Abstract: Recent advances in deep learning have led to its widespread adoption across diverse domains, including medical imaging. This progress is driven by increasingly sophisticated model architectures, such as ResNets, Vision Transformers, and Hybrid Convolutional Neural Networks, that offer enhanced performance at the cost of greater complexity. This complexity often compromises model explainability and interpretability. SHAP has emerged as a prominent method for providing interpretable visualizations that aid domain experts in understanding model predictions. However, SHAP explanations can be unstable and unreliable in the presence of epistemic and aleatoric uncertainty. In this study, we address this challenge by using Dirichlet posterior sampling and Dempster-Shafer theory to quantify the uncertainty that arises from these unstable explanations in medical imaging applications. The framework uses a belief, plausible, and fusion map approach alongside statistical quantitative analysis to produce quantification of uncertainty in SHAP. Furthermore, we evaluated our framework on three medical imaging datasets with varying class distributions, image qualities, and modality types which introduces noise due to varying image resolutions and modality-specific aspect covering the examples from pathology, ophthalmology, and radiology, introducing significant epistemic uncertainty.

[48] TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation cs.CV | cs.AI | eess.AS | eess.IVPDF

Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Joon Son Chung, Shinji Watanabe

TL;DR: 本文提出了TAVID框架，用于从文本和参考图像同步生成交互式视频和对话语音，旨在构建更接近人类的多模态对话系统。

Details

Motivation: 现有研究通常孤立地探索说话或倾听头部生成以及对话语音生成，忽略了人类对话中视听模态紧密耦合的交互特性，TAVID旨在解决这一多模态交互生成的统一问题。

Result: 在面部真实性、倾听响应性、二元交互流畅性和语音质量四个维度上的广泛实验证明了该方法的有效性。

Insight: 通过两个跨模态映射器（运动映射器和说话者映射器）集成面部和语音生成流程，实现了视听模态间互补信息的双向交换，从而同步生成交互内容。

Abstract: The objective of this paper is to jointly synthesize interactive videos and conversational speech from text and reference images. With the ultimate goal of building human-like conversational systems, recent studies have explored talking or listening head generation as well as conversational speech generation. However, these works are typically studied in isolation, overlooking the multimodal nature of human conversation, which involves tightly coupled audio-visual interactions. In this paper, we introduce TAVID, a unified framework that generates both interactive faces and conversational speech in a synchronized manner. TAVID integrates face and speech generation pipelines through two cross-modal mappers (i.e., a motion mapper and a speaker mapper), which enable bidirectional exchange of complementary information between the audio and visual modalities. We evaluate our system across four dimensions: talking face realism, listening head responsiveness, dyadic interaction fluency, and speech quality. Extensive experiments demonstrate the effectiveness of our approach across all these aspects.

[49] The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection cs.CVPDF

Qingdong He, Xueqin Chen, Yanjie Pan, Peng Tang, Pengcheng Xu

TL;DR: 本文提出了一种名为KeyTailor的新框架，用于视频虚拟试穿任务，通过关键帧驱动的细节注入策略来提升服装动态细节和背景完整性，同时避免对扩散变换器架构的显式修改以降低计算成本。

Details

Motivation: 现有基于扩散变换器的视频虚拟试穿方法在捕捉细粒度服装动态、保持视频帧间背景完整性方面存在不足，且因引入额外交互模块导致计算成本高，现有公开数据集的规模和质量也限制了模型泛化和有效训练。

Result: 在动态和静态场景下，KeyTailor在服装保真度和背景完整性方面均优于最先进的基线方法。

Insight: 创新点在于利用关键帧固有包含前景动态和背景一致性的特性，设计了指令引导的关键帧采样策略以及两个定制的关键帧驱动模块（服装细节增强模块和协作背景优化模块），将提炼的细节注入标准DiT块，实现了高效且真实的试穿视频合成，同时避免了架构复杂性的增加。此外，构建的高清大规模数据集ViT-HD也为该领域提供了重要资源。

Abstract: Although diffusion transformer (DiT)-based video virtual try-on (VVT) has made significant progress in synthesizing realistic videos, existing methods still struggle to capture fine-grained garment dynamics and preserve background integrity across video frames. They also incur high computational costs due to additional interaction modules introduced into DiTs, while the limited scale and quality of existing public datasets also restrict model generalization and effective training. To address these challenges, we propose a novel framework, KeyTailor, along with a large-scale, high-definition dataset, ViT-HD. The core idea of KeyTailor is a keyframe-driven details injection strategy, motivated by the fact that keyframes inherently contain both foreground dynamics and background consistency. Specifically, KeyTailor adopts an instruction-guided keyframe sampling strategy to filter informative frames from the input video. Subsequently,two tailored keyframe-driven modules, the garment details enhancement module and the collaborative background optimization module, are employed to distill garment dynamics into garment-related latents and to optimize the integrity of background latents, both guided by keyframes.These enriched details are then injected into standard DiT blocks together with pose, mask, and noise latents, enabling efficient and realistic try-on video synthesis. This design ensures consistency without explicitly modifying the DiT architecture, while simultaneously avoiding additional complexity. In addition, our dataset ViT-HD comprises 15, 070 high-quality video samples at a resolution of 810*1080, covering diverse garments. Extensive experiments demonstrate that KeyTailor outperforms state-of-the-art baselines in terms of garment fidelity and background integrity across both dynamic and static scenarios.

[50] CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation cs.CVPDF

V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin

TL;DR: CRAFT是一个无需训练、模型无关的框架，通过结构化推理和代理反馈优化多模态文本到图像生成。它将提示分解为依赖结构的视觉问题，使用视觉语言模型验证生成图像，并通过LLM代理仅在约束失败时进行针对性提示编辑，迭代直到满足停止标准。

Details

Motivation: 现有推理和反思方法依赖隐式、整体的批评或无约束的提示重写，导致行为难以解释、控制或可靠停止；而大语言模型受益于基于验证、针对性修正和提前停止的显式结构化“思考”，因此将这种结构化推理范式引入多模态图像生成。

Result: 在多个模型家族和挑战性基准测试中，CRAFT持续提升了组合准确性、文本渲染和基于偏好的评估，尤其对轻量级生成器有显著增益，且推理开销可忽略，使较小或更便宜的模型接近更昂贵系统的质量。

Insight: 创新点在于将显式结构化、约束驱动的推理时间循环应用于图像生成，通过依赖分解、验证和针对性编辑实现可解释和可控的优化；客观分析认为其将LLM的推理范式迁移到多模态任务，提供了一种高效提升生成可靠性的方法。

Abstract: Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of thinking based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free, model-agnostic framework that brings this structured reasoning paradigm to multimodal image generation. CRAFT decomposes a prompt into dependency-structured visual questions, veries generated images using a vision-language model, and applies targeted prompt edits through an LLM agent only where constraints fail. The process iterates with an explicit stopping criterion once all constraints are satised, yielding an interpretable and controllable inference-time renement loop. Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models.

[51] DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning cs.CV | cs.AIPDF

Junho Yoon, Jaemo Jung, Hyunju Kim, Dongman Lee

TL;DR: 本文提出DETACH框架，通过分解时空特征对齐外中心视频与环境传感器数据，以解决现有全局对齐方法在捕捉局部细节和区分相似时间模式动作方面的不足，从而实现非侵入式、可扩展的人类动作识别。

Details

Motivation: 现有基于自我中心视频与可穿戴传感器的动作识别方法存在用户不适、隐私和可扩展性问题，而外中心视频与环境传感器组合提供了一种非侵入式替代方案，但传统全局对齐方法在此设置下无法有效捕捉局部细节并易受相似时间模式干扰。

Result: 在Opportunity++和HWU-USP数据集的下游任务实验中，DETACH相比适应的自我中心-可穿戴基线方法取得显著提升，验证了其有效性。

Insight: 创新点包括：1）显式分解时空特征以保留局部细节；2）通过在线聚类发现传感器-空间特征提供语义基础；3）两阶段对齐方法（空间互监督与时空加权对比损失）自适应处理负样本。

Abstract: Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features discovered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial correspondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.

[52] Chain-of-Anomaly Thoughts with Large Vision-Language Models cs.CV | cs.MAPDF

Pedro Domingos, João Pereira, Vasco Lopes, João Neves, David Semedo

TL;DR: 本文提出了一种名为Chain-of-Anomaly-Thoughts (CoAT)的多智能体推理框架，旨在解决大型视觉语言模型在自动化视频监控中因对正常场景的固有偏见而难以有效检测犯罪等异常行为的问题。该方法通过在推理过程中引入归纳性的犯罪偏见，并增加一个专注于异常的分类层，显著提升了异常检测与分类的性能。

Details

Motivation: 大型视觉语言模型在自动化视频监控中存在对正常场景的固有偏见，导致其经常无法检测到犯罪等异常事件。现有的思维链推理策略虽能提升语言任务性能，但其推理过程缺乏归纳性的异常偏见，进一步将模型导向正常解释。

Result: 该方法在具有挑战性的低分辨率视频片段上将异常检测的F1分数提升了11.8个百分点，在高分辨率视频中将异常分类的准确率提升了3.78个百分点。

Insight: 核心创新点在于提出了一个引入归纳性犯罪偏见的思维链变体（CoAT），通过一个专注于异常的多智能体推理框架和最终的异常分类层，主动引导模型关注异常线索，从而克服了模型对正常场景的固有偏见。这为将大型视觉语言模型应用于需要检测罕见或负面事件的领域（如安防监控）提供了新的思路。

Abstract: Automated video surveillance with Large Vision-Language Models is limited by their inherent bias towards normality, often failing to detect crimes. While Chain-of-Thought reasoning strategies show significant potential for improving performance in language tasks, the lack of inductive anomaly biases in their reasoning further steers the models towards normal interpretations. To address this, we propose Chain-of-Anomaly-Thoughts (CoAT), a multi-agent reasoning framework that introduces inductive criminal bias in the reasoning process through a final, anomaly-focused classification layer. Our method significantly improves Anomaly Detection, boosting F1-score by 11.8 p.p. on challenging low-resolution footage and Anomaly Classification by 3.78 p.p. in high-resolution videos.

[53] Beyond Motion Pattern: An Empirical Study of Physical Forces for Human Motion Understanding cs.CVPDF

Anh Dao, Manh Tran, Yufei Zhang, Xiaoming Liu, Zijun Cui

TL;DR: 这篇论文通过将物理推断的关节驱动力整合到现有的人体运动理解流程中，系统性地评估了物理力线索在步态识别、动作识别和细粒度视频描述三大任务上的影响。实验表明，在多个基准测试中，引入力信息能带来一致的性能提升，尤其是在动态、遮挡或外观变化的挑战性条件下。

Details

Motivation: 现有基于视觉的人体运动理解方法大多忽略了关节驱动力等生物力学中基础的物理线索。本文旨在探究物理推断的力是否以及何时能增强运动理解。

Result: 在8个基准测试中，整合力信息均带来了性能提升。例如，在CASIA-B步态识别中，Rank-1准确率从89.52%提升至90.39%（+0.87%），在穿外套和侧视等挑战条件下提升更大（分别+2.7%和+3.0%）。在Gait3D上，性能从46.0%提升至47.3%（+1.3%）。在Penn Action动作识别中，CTR-GCN模型提升+2.00%，其中高强度的动作（如出拳/拍打）提升+6.96%。在视频描述任务中，Qwen2.5-VL模型的ROUGE-L分数从0.310提升至0.339（+0.029）。

Insight: 论文的创新点在于将生物力学中的物理力线索系统地引入并验证其对视觉运动理解的补充价值。客观来看，其核心洞察是物理力信息能有效补充视觉和运动学特征，尤其是在动态、遮挡或外观变化的场景下，为模型提供了更本质的物理约束和解释性，从而提升鲁棒性和语义理解能力。

Abstract: Human motion understanding has advanced rapidly through vision-based progress in recognition, tracking, and captioning. However, most existing methods overlook physical cues such as joint actuation forces that are fundamental in biomechanics. This gap motivates our study: if and when do physically inferred forces enhance motion understanding? By incorporating forces into established motion understanding pipelines, we systematically evaluate their impact across baseline models on 3 major tasks: gait recognition, action recognition, and fine-grained video captioning. Across 8 benchmarks, incorporating forces yields consistent performance gains; for example, on CASIA-B, Rank-1 gait recognition accuracy improved from 89.52% to 90.39% (+0.87), with larger gain observed under challenging conditions: +2.7% when wearing a coat and +3.0% at the side view. On Gait3D, performance also increases from 46.0% to 47.3% (+1.3). In action recognition, CTR-GCN achieved +2.00% on Penn Action, while high-exertion classes like punching/slapping improved by +6.96%. Even in video captioning, Qwen2.5-VL’s ROUGE-L score rose from 0.310 to 0.339 (+0.029), indicating that physics-inferred forces enhance temporal grounding and semantic richness. These results demonstrate that force cues can substantially complement visual and kinematic features under dynamic, occluded, or appearance-varying conditions.

[54] UTDesign: A Unified Framework for Stylized Text Editing and Generation in Graphic Design Images cs.CVPDF

Yiming Zhao, Yuanpeng Gao, Yuxuan Luo, Jiwei Duan, Shisong Lin

TL;DR: UTDesign是一个统一的框架，用于在平面设计图像中进行高精度的风格化文本编辑和条件文本生成，支持英文和中文。它通过一个基于DiT（Diffusion Transformer）的文本风格迁移模型生成透明的RGBA文本前景，并扩展为条件生成框架，结合背景图像、提示词和布局规范进行文本合成。该框架还集成了预训练的文本到图像模型和基于MLLM的布局规划器，形成一个全自动的文本到设计（T2D）流程。

Details

Motivation: 解决基于扩散的文本到图像模型在文本渲染方面的局限性，特别是小规模排版和非拉丁文字（如中文）的生成问题，以自动化平面设计中的文本创建和编辑。

Result: 在风格一致性和文本准确性方面，UTDesign在开源方法中达到了最先进的性能（SOTA），并且与专有商业方法相比也展现出独特优势。

Insight: 创新点包括：1）从零开始在合成数据集上训练一个基于DiT的文本风格迁移模型，生成透明RGBA文本前景以保留参考字形风格；2）通过在多模态条件编码器上训练带详细文本注释的数据集，扩展为条件文本生成框架；3）集成预训练T2I模型和MLLM布局规划器，实现全自动T2D流程。从客观角度看，其统一框架结合了风格迁移、条件生成和布局规划，针对多语言文本设计提供了端到端的解决方案。

Abstract: AI-assisted graphic design has emerged as a powerful tool for automating the creation and editing of design elements such as posters, banners, and advertisements. While diffusion-based text-to-image models have demonstrated strong capabilities in visual content generation, their text rendering performance, particularly for small-scale typography and non-Latin scripts, remains limited. In this paper, we propose UTDesign, a unified framework for high-precision stylized text editing and conditional text generation in design images, supporting both English and Chinese scripts. Our framework introduces a novel DiT-based text style transfer model trained from scratch on a synthetic dataset, capable of generating transparent RGBA text foregrounds that preserve the style of reference glyphs. We further extend this model into a conditional text generation framework by training a multi-modal condition encoder on a curated dataset with detailed text annotations, enabling accurate, style-consistent text synthesis conditioned on background images, prompts, and layout specifications. Finally, we integrate our approach into a fully automated text-to-design (T2D) pipeline by incorporating pre-trained text-to-image (T2I) models and an MLLM-based layout planner. Extensive experiments demonstrate that UTDesign achieves state-of-the-art performance among open-source methods in terms of stylistic consistency and text accuracy, and also exhibits unique advantages compared to proprietary commercial approaches. Code and data for this paper are available at https://github.com/ZYM-PKU/UTDesign.

[55] Bridging Modalities and Transferring Knowledge: Enhanced Multimodal Understanding and Recognition cs.CVPDF

Gorjan Radevski

TL;DR: 本文探讨了多模态对齐、翻译、融合和迁移，以增强机器对复杂输入的理解。论文分为五章，分别解决多模态机器学习中的独特挑战，包括将空间语言翻译为视觉表示、将医学文本映射到解剖图谱、将结构化文本链接到知识图谱、融合视频和物体检测进行动作识别，以及通过知识蒸馏将多模态能力迁移到RGB模型中。

Details

Motivation: 解决多模态机器学习中空间关系理解、医学文本导航、知识图谱构建、动作识别以及计算效率提升等挑战，以增强计算系统处理复杂多模态输入的能力。

Result: 论文在多个任务上取得了进展：Spatial-Reasoning Bert能有效解码空间语言为视觉表示；医学文本翻译方法通过空间共现损失函数提升了导航性；知识图谱链接方法解决了文本提取的歧义；多模态融合方法提高了动作识别的鲁棒性和准确性；知识迁移方法使RGB模型能模仿多模态融合能力，在保持性能的同时降低了计算需求。

Insight: 创新点包括：利用空间共现损失进行可解释的医学文本映射、开发知识图谱链接基准以处理歧义、融合视频帧与物体检测表示以增强动作识别、以及通过多模态知识蒸馏实现高效模型迁移。这些方法在跨模态翻译和知识转移方面提供了可借鉴的思路。

Abstract: This manuscript explores multimodal alignment, translation, fusion, and transference to enhance machine understanding of complex inputs. We organize the work into five chapters, each addressing unique challenges in multimodal machine learning. Chapter 3 introduces Spatial-Reasoning Bert for translating text-based spatial relations into 2D arrangements between clip-arts. This enables effective decoding of spatial language into visual representations, paving the way for automated scene generation aligned with human spatial understanding. Chapter 4 presents a method for translating medical texts into specific 3D locations within an anatomical atlas. We introduce a loss function leveraging spatial co-occurrences of medical terms to create interpretable mappings, significantly enhancing medical text navigability. Chapter 5 tackles translating structured text into canonical facts within knowledge graphs. We develop a benchmark for linking natural language to entities and predicates, addressing ambiguities in text extraction to provide clearer, actionable insights. Chapter 6 explores multimodal fusion methods for compositional action recognition. We propose a method fusing video frames and object detection representations, improving recognition robustness and accuracy. Chapter 7 investigates multimodal knowledge transference for egocentric action recognition. We demonstrate how multimodal knowledge distillation enables RGB-only models to mimic multimodal fusion-based capabilities, reducing computational requirements while maintaining performance. These contributions advance methodologies for spatial language understanding, medical text interpretation, knowledge graph enrichment, and action recognition, enhancing computational systems’ ability to process complex, multimodal inputs across diverse applications.

[56] SirenPose: Dynamic Scene Reconstruction via Geometric Supervision cs.CVPDF

Kaitong Cai, Jensen Zhang, Jing Yang, Keze Wang

TL;DR: SirenPose是一种结合了正弦表示网络周期性激活特性和基于关键点几何监督的几何感知损失公式，用于从单目视频中准确且时间一致地重建动态3D场景。该方法通过物理启发的约束增强时空一致性，并利用高频信号建模捕捉几何细节。在Sintel、Bonn和DAVIS等基准测试中，SirenPose在多个指标上优于现有最先进方法。

Details

Motivation: 现有方法在快速运动、多物体交互、遮挡和快速场景变化等挑战性场景中，常难以保持运动保真度和时空一致性。SirenPose旨在解决这些问题，实现更准确和连贯的动态场景重建。

Result: 在DAVIS基准上，SirenPose相比MoSCA在FVD上降低17.8%，FID降低28.7%，LPIPS提升6.0%，并改善了时间一致性、几何精度、用户评分和运动平滑度。在姿态估计任务中，SirenPose在绝对轨迹误差、平移和旋转相对姿态误差上优于Monst3R，达到SOTA水平。

Insight: 创新点包括：将正弦表示网络的周期性激活与关键点几何监督结合，引入物理启发的时空约束以增强一致性，利用高频信号建模捕捉细节，并扩展UniKPT数据集至60万标注实例，结合图神经网络建模关键点关系。客观分析认为，该方法在动态场景重建的几何精度和时序连贯性方面有显著提升。

Abstract: We introduce SirenPose, a geometry-aware loss formulation that integrates the periodic activation properties of sinusoidal representation networks with keypoint-based geometric supervision, enabling accurate and temporally consistent reconstruction of dynamic 3D scenes from monocular videos. Existing approaches often struggle with motion fidelity and spatiotemporal coherence in challenging settings involving fast motion, multi-object interaction, occlusion, and rapid scene changes. SirenPose incorporates physics inspired constraints to enforce coherent keypoint predictions across both spatial and temporal dimensions, while leveraging high frequency signal modeling to capture fine grained geometric details. We further expand the UniKPT dataset to 600,000 annotated instances and integrate graph neural networks to model keypoint relationships and structural correlations. Extensive experiments on benchmarks including Sintel, Bonn, and DAVIS demonstrate that SirenPose consistently outperforms state-of-the-art methods. On DAVIS, SirenPose achieves a 17.8 percent reduction in FVD, a 28.7 percent reduction in FID, and a 6.0 percent improvement in LPIPS compared to MoSCA. It also improves temporal consistency, geometric accuracy, user score, and motion smoothness. In pose estimation, SirenPose outperforms Monst3R with lower absolute trajectory error as well as reduced translational and rotational relative pose error, highlighting its effectiveness in handling rapid motion, complex dynamics, and physically plausible reconstruction.

[57] Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios cs.CVPDF

Mingwei Tang, Jiahao Nie, Guang Yang, Ziqing Cui, Jie Li

TL;DR: 本文提出了一种多粒度文本引导的图像融合方法（MTIF），用于处理多曝光和多焦点场景下的图像融合问题。该方法通过引入多粒度文本描述（细节、结构和语义）来指导图像融合，并设计了分层跨模态调制模块、多粒度监督信号和显著性驱动的数据增强模块，以提升融合质量。实验表明，MTIF在多曝光和多焦点图像融合任务上均优于现有方法。

Details

Motivation: 现有文本引导的图像融合方法仅使用粗粒度文本描述，难以理解细粒度细节并实现精确的跨模态对齐，限制了融合质量。本文旨在通过多粒度文本描述和相应的对齐机制来解决这一问题。

Result: MTIF在多曝光和多焦点图像融合任务上进行了广泛实验，结果一致优于先前方法，达到了SOTA水平。

Insight: 创新点包括：1）引入多粒度文本描述（细节、结构、语义）进行分层指导；2）采用多粒度监督信号促进视觉-文本特征对齐；3）设计显著性驱动的数据增强模块来丰富训练数据。这些设计提升了跨模态调制和对齐的精度，可借鉴于其他跨模态任务中。

Abstract: Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision-language models, recent methods incorporate textual descriptions as auxiliary guidance to enhance fusion quality. However, simply incorporating coarse-grained descriptions hampers the understanding of fine-grained details and poses challenges for precise cross-modal alignment. To address these limitations, we propose Multi-grained Text-guided Image Fusion (MTIF), a novel fusion paradigm with three key designs. First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content, guiding image fusion through a hierarchical cross-modal modulation module. Second, it involves supervision signals at each granularity to facilitate alignment between visual and textual features and enhance the utility of auxiliary text. Third, it adopts a saliency-driven enrichment module to augment training data with dense semantic content, further strengthening the cross-modal modulation and alignment. Extensive experiments show that MTIF consistently outperforms previous methods on both multi-exposure and multi-focus image fusion tasks.

[58] Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models cs.CVPDF

Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin

TL;DR: 该论文针对视觉语言模型在动态空间推理方面的不足，提出了DSR Suite，包括一个从野外视频自动生成多选问答对的数据集构建流程，以及一个轻量级的几何选择模块，用于将几何先验知识集成到VLMs中，从而提升模型对物体在3D空间中随时间演变的几何和关系推理能力。

Details

Motivation: 解决视觉语言模型在动态空间推理方面的弱点，即对物体在3D空间中随时间变化的几何和关系进行推理的能力不足，这主要是由于缺乏可扩展的4D感知训练资源。

Result: 实验表明，将DSR-Train数据集和GSM模块集成到Qwen2.5-VL-7B模型中，显著提升了其动态空间推理能力，同时在通用视频理解基准测试上保持了准确性。

Insight: 创新点包括：自动化构建强调野外视频、物体和场景级3D需求、视角变换、多物体交互以及细粒度程序性答案的数据集；以及一个轻量级几何选择模块，能针对性地从预训练的4D重建先验中提取与问题相关的几何知识，避免信息过载。

Abstract: Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.

[59] FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models cs.CVPDF

Kaitong Cai, Jusheng Zhang, Jing Yang, Yijia Fan, Pengtao Xie

TL;DR: FlashVLM是一个文本引导的视觉令牌选择框架，用于大型多模态模型。它通过计算图像令牌与文本嵌入在语言模型空间中的显式跨模态相似性，并结合视觉显著性，动态地根据查询选择关键视觉令牌，从而在保持甚至超越未剪枝基线性能的同时，大幅减少冗余视觉令牌的处理数量，实现高效的注意力计算。

Details

Motivation: 现有的大型视觉语言模型处理每张图像或视频帧时通常涉及数百上千个视觉令牌，导致二次方注意力成本和大量冗余。现有的令牌缩减方法要么忽略文本查询，要么依赖不稳定的深度注意力图，在激进剪枝下会导致语义对齐退化。

Result: 在LLaVA 1.5模型上，FlashVLM在相同令牌预算和评估协议下，剪枝高达77.8%的视觉令牌时性能略微超越未剪枝基线，甚至在94.4%的压缩率下仍保持92.8%的准确率。在14个图像和视频基准测试上的广泛实验表明，FlashVLM实现了最先进的效率-性能权衡，并在主流VLM中保持了强大的鲁棒性和泛化性。

Insight: 创新点在于提出了一种不依赖噪声注意力权重的、基于显式跨模态相似性的文本引导令牌选择机制，并结合了视觉显著性和多样性保留分区来维持全局上下文。这为多模态模型的高效推理提供了一种稳定且可解释的令牌剪枝方法。

Abstract: Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual query or rely on deep attention maps, whose instability under aggressive pruning leads to degraded semantic alignment. We propose FlashVLM, a text guided visual token selection framework that dynamically adapts visual inputs to the query. Instead of relying on noisy attention weights, FlashVLM computes an explicit cross modal similarity between projected image tokens and normalized text embeddings in the language model space. This extrinsic relevance is fused with intrinsic visual saliency using log domain weighting and temperature controlled sharpening. In addition, a diversity preserving partition retains a minimal yet representative set of background tokens to maintain global context. Under identical token budgets and evaluation protocols, FlashVLM achieves beyond lossless compression, slightly surpassing the unpruned baseline while pruning up to 77.8 percent of visual tokens on LLaVA 1.5, and maintaining 92.8 percent accuracy even under 94.4 percent compression. Extensive experiments on 14 image and video benchmarks demonstrate that FlashVLM delivers state of the art efficiency performance trade offs while maintaining strong robustness and generalization across mainstream VLMs.

[60] LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving cs.CV | cs.AI | cs.LG | cs.ROPDF

Long Nguyen, Micha Fauth, Bernhard Jaeger, Daniel Dauner, Maximilian Igl

TL;DR: 本文研究了模拟驾驶中模仿学习性能受限的原因，发现专家演示与学生观测之间的不对称性（如专家拥有特权信息、更高可见性和更低不确定性）是主要瓶颈。论文通过一系列干预措施缩小这些差距，提出了TransFuser v6策略，在CARLA闭环基准测试中实现了新的SOTA性能，并在真实世界基准上展示了泛化能力。

Details

Motivation: 解决模拟驾驶中模仿学习策略因专家与学生之间的信息不对称（如专家忽略遮挡、知晓其他车辆动作、导航意图不明确）而导致的闭环性能不佳问题。

Result: 在CARLA闭环基准测试中，TransFuser v6策略在Bench2Drive上达到95 DS，在Longest6~v2和Town13上性能翻倍，实现了新的SOTA；在NAVSIM和Waymo基于视觉的端到端驾驶基准上也取得了一致性提升。

Insight: 创新点在于系统性地识别并量化了专家-学习者不对称性对模仿学习的影响，并通过针对性干预（如改进可见性、降低不确定性、明确导航意图）有效缩小差距；同时展示了感知监督在仿真到真实迁移中的价值，为端到端驾驶提供了可借鉴的模块化改进思路。

Abstract: Simulators can generate virtually unlimited driving data, yet imitation learning policies in simulation still struggle to achieve robust closed-loop performance. Motivated by this gap, we empirically study how misalignment between privileged expert demonstrations and sensor-based student observations can limit the effectiveness of imitation learning. More precisely, experts have significantly higher visibility (e.g., ignoring occlusions) and far lower uncertainty (e.g., knowing other vehicles’ actions), making them difficult to imitate reliably. Furthermore, navigational intent (i.e., the route to follow) is under-specified in student models at test time via only a single target point. We demonstrate that these asymmetries can measurably limit driving performance in CARLA and offer practical interventions to address them. After careful modifications to narrow the gaps between expert and student, our TransFuser v6 (TFv6) student policy achieves a new state of the art on all major publicly available CARLA closed-loop benchmarks, reaching 95 DS on Bench2Drive and more than doubling prior performances on Longest6~v2 and Town13. Additionally, by integrating perception supervision from our dataset into a shared sim-to-real pipeline, we show consistent gains on the NAVSIM and Waymo Vision-Based End-to-End driving benchmarks. Our code, data, and models are publicly available at https://github.com/autonomousvision/lead.

[61] Repurposing Video Diffusion Transformers for Robust Point Tracking cs.CVPDF

Soowon Son, Honggyu An, Chaehyun Kim, Hyunah Ko, Jisu Nam

TL;DR: 本文提出DiTracker方法，通过改造预训练的视频扩散变换器（DiT）来实现鲁棒的点跟踪。该方法利用DiT的时空注意力机制，结合查询-键注意力匹配、轻量级LoRA微调和与ResNet主干网络的成本融合，在ITTO和TAP-Vid基准测试中达到或超越了最先进水平。

Details

Motivation: 现有点跟踪方法通常依赖浅层卷积主干（如ResNet），独立处理视频帧，缺乏时间一致性，在挑战性条件下匹配成本不可靠。本文旨在利用预训练视频DiT的时空注意力能力，提升点跟踪的鲁棒性和准确性。

Result: DiTracker在ITTO基准测试中达到最先进性能，在TAP-Vid基准测试中匹配或超越最先进模型，且训练批大小仅为其他方法的八分之一。

Insight: 创新点包括：重新利用预训练视频DiT的时空注意力进行点跟踪，提出查询-键注意力匹配机制，结合轻量级LoRA微调和成本融合策略，验证了视频DiT特征作为点跟踪高效基础的有效性。

Abstract: Point tracking aims to localize corresponding points across video frames, serving as a fundamental task for 4D reconstruction, robotics, and video editing. Existing methods commonly rely on shallow convolutional backbones such as ResNet that process frames independently, lacking temporal coherence and producing unreliable matching costs under challenging conditions. Through systematic analysis, we find that video Diffusion Transformers (DiTs), pre-trained on large-scale real-world videos with spatio-temporal attention, inherently exhibit strong point tracking capability and robustly handle dynamic motions and frequent occlusions. We propose DiTracker, which adapts video DiTs through: (1) query-key attention matching, (2) lightweight LoRA tuning, and (3) cost fusion with a ResNet backbone. Despite training with 8 times smaller batch size, DiTracker achieves state-of-the-art performance on challenging ITTO benchmark and matches or outperforms state-of-the-art models on TAP-Vid benchmarks. Our work validates video DiT features as an effective and efficient foundation for point tracking.

[62] Active Intelligence in Video Avatars via Closed-loop World Modeling cs.CVPDF

Xuanhua He, Tianyu Yang, Ke Cao, Ruiqi Wu, Cheng Meng

TL;DR: 本文提出了L-IVA任务与基准以及ORCA框架，旨在解决当前视频化身生成方法缺乏自主追求长期目标能力的问题。ORCA通过闭环OTAR循环和分层双系统架构，实现了在随机生成环境中的主动智能与目标导向规划。

Details

Motivation: 当前视频化身生成方法虽能保持身份一致性和运动对齐，但缺乏真正的自主性，无法通过自适应环境交互自主追求长期目标。本文旨在推动视频化身从被动动画向主动、目标导向行为的智能演进。

Result: 大量实验表明，ORCA在任务成功率和行为一致性上显著优于开环和非反思基线，验证了其受内部世界模型启发的设计有效性。

Insight: 创新点在于将化身控制建模为POMDP，并通过闭环OTAR（观察-思考-行动-反思）循环实现持续信念更新与结果验证，以及采用分层双系统架构（系统2进行战略推理与状态预测，系统1将抽象计划转化为精确的动作描述），从而在开放域场景中实现自主多步任务完成。

Abstract: Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.

[63] SpatialTree: How Spatial Abilities Branch Out in MLLMs cs.CVPDF

Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng

TL;DR: 本文提出SpatialTree，一个受认知科学启发的空间能力层次结构，将多模态大语言模型（MLLMs）的空间能力划分为四个递进层级：低级感知、心理映射、模拟和智能体能力。基于此，作者构建了首个能力中心化的分层基准，全面评估了主流MLLMs在27个子能力上的表现，揭示了高低层级技能间的相关性与迁移动态，并提出了一种简单的自动思考策略以抑制不必要的深思，从而通过强化学习一致提升所有层级的性能。

Details

Motivation: 当前对多模态大语言模型空间能力的研究大多局限于狭窄的任务集，缺乏对其能力层次结构的系统理解，而认知科学表明空间能力是从感知到推理与交互逐步发展的。本文旨在填补这一空白，提出一个层次化框架来理解和系统化扩展MLLMs的空间能力。

Result: 在构建的SpatialTree分层基准上评估主流MLLMs，结果显示低级感知技能基本正交，而高级技能间强相关，表明依赖性随层级增加。通过有针对性的监督微调，发现L1内部存在负迁移，但从低到高能力存在强跨级迁移与协同效应。提出的自动思考策略使强化学习能一致提升所有层级性能。

Insight: 创新点在于借鉴认知科学构建了系统化的空间能力层次框架（SpatialTree）和首个能力中心化的分层基准，揭示了MLLMs空间能力的结构化特性与跨层级迁移动态，并提出了抑制不必要深思的自动思考策略，为理解和系统扩展MLLMs的空间能力提供了概念验证框架。

Abstract: Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive “thinking” is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.

[64] SemanticGen: Video Generation in Semantic Space cs.CVPDF

Jianhong Bai, Xiaoshi Wu, Xintao Wang, Fu Xiao, Yuanxing Zhang

TL;DR: 本文提出SemanticGen，一种在语义空间生成视频的新方法，通过两阶段扩散模型先在高层次语义空间进行全局规划，再添加高频细节，以解决现有VAE潜在空间方法收敛慢、计算成本高的问题。

Details

Motivation: 现有视频生成模型通常在VAE潜在空间学习分布，再通过VAE解码器映射到像素，这种方法生成高质量视频但收敛慢，生成长视频时计算成本高。

Result: 实验表明，SemanticGen能生成高质量视频，在多个基准测试中优于最先进方法和强基线，且在语义空间生成收敛更快，扩展至长视频生成时计算高效。

Insight: 创新点在于将生成过程分解为语义空间全局规划和VAE潜在空间细节补充的两阶段设计，利用视频固有冗余性，先在高层次紧凑语义特征中规划，再添加细节，避免了直接建模大量低层视频token的双向注意力计算开销。

Abstract: State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.

cs.AI [Back]

[65] Reason2Decide: Rationale-Driven Multi-Task Learning cs.AI | cs.CLPDF

H M Quamran Hasan, Housam Khalifa Bashier, Jiayi Dai, Mi-Young Kim, Randy Goebel

TL;DR: 论文提出了Reason2Decide，一个用于临床决策支持的两阶段训练框架，旨在同时提高预测准确性和生成与预测一致的解释。该框架通过分阶段训练（首先生成解释，然后联合训练预测和解释生成）和应用计划采样来缓解暴露偏差和任务分离问题。

Details

Motivation: 解决当前大型语言模型在临床决策支持系统中面临的关键挑战：在保持高预测准确性的同时，生成与预测结果一致的解释。现有方法存在暴露偏差，导致解释与预测不匹配。

Result: 在三个医学数据集（包括一个专有的分诊数据集和公共生物医学QA数据集）上评估，Reason2Decide在预测（F1分数）和解释保真度（BERTScore、BLEU、LLM-as-a-Judge）方面优于其他微调基线方法和一些零样本LLM。在分诊任务中，它对LLM生成、护士撰写和护士后处理的解释都具有鲁棒性。

Insight: 创新点在于两阶段训练框架和计划采样的应用，有效缓解了自解释任务中的暴露偏差和任务分离问题。关键洞察是，仅使用LLM生成的解释进行第一阶段预训练是有效的，这减少了对人工标注的依赖，并且该框架能以比当代基础模型小40倍的模型实现高性能，提高了在资源受限环境中部署可解释临床推理的可行性。

Abstract: Despite the wide adoption of Large Language Models (LLM)s, clinical decision support systems face a critical challenge: achieving high predictive accuracy while generating explanations aligned with the predictions. Current approaches suffer from exposure bias leading to misaligned explanations. We propose Reason2Decide, a two-stage training framework that addresses key challenges in self-rationalization, including exposure bias and task separation. In Stage-1, our model is trained on rationale generation, while in Stage-2, we jointly train on label prediction and rationale generation, applying scheduled sampling to gradually transition from conditioning on gold labels to model predictions. We evaluate Reason2Decide on three medical datasets, including a proprietary triage dataset and public biomedical QA datasets. Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge). In triage, Reason2Decide is rationale source-robust across LLM-generated, nurse-authored, and nurse-post-processed rationales. In our experiments, while using only LLM-generated rationales in Stage-1, Reason2Decide outperforms other fine-tuning variants. This indicates that LLM-generated rationales are suitable for pretraining models, reducing reliance on human annotations. Remarkably, Reason2Decide achieves these gains with models 40x smaller than contemporary foundation models, making clinical reasoning more accessible for resource-constrained deployments while still providing explainable decision support.

[66] Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems cs.AI | cs.CL | cs.CVPDF

YuChe Hsu, AnJui Wang, TsaiChing Ni, YuanFu Yang

TL;DR: 本文提出了一种视觉语言仿真模型（VLSM），它通过统一视觉和文本理解，能够根据布局草图和自然语言提示合成可执行的FlexScript代码，从而为工业仿真系统实现跨模态推理。研究为此构建了首个大规模生成式数字孪生数据集，包含超过12万个提示-草图-代码三元组，并提出了三个新颖的评估指标（SVR、PMR、ESR）来全面评估模型性能。

Details

Motivation: 解决如何将视觉布局草图与自然语言描述相结合，自动生成可执行的工业仿真系统代码，以实现生成式数字孪生，从而整合视觉推理与语言理解到可执行的工业仿真中。

Result: 通过对视觉编码器、连接器和代码预训练语言主干进行系统消融实验，所提模型在结构准确性上接近完美，并具有较高的执行鲁棒性。

Insight: 创新点在于提出了首个统一视觉与语言理解以生成可执行仿真代码的VLSM框架，并为此构建了大规模多模态数据集和专门的任务评估指标，为生成式数字孪生领域奠定了基础。

Abstract: We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript from layout sketches and natural-language prompts, enabling cross-modal reasoning for industrial simulation systems. To support this new paradigm, the study constructs the first large-scale dataset for generative digital twins, comprising over 120,000 prompt-sketch-code triplets that enable multimodal learning between textual descriptions, spatial structures, and simulation logic. In parallel, three novel evaluation metrics, Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), are proposed specifically for this task to comprehensively evaluate structural integrity, parameter fidelity, and simulator executability. Through systematic ablation across vision encoders, connectors, and code-pretrained language backbones, the proposed models achieve near-perfect structural accuracy and high execution robustness. This work establishes a foundation for generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems.

[67] Automated stereotactic radiosurgery planning using a human-in-the-loop reasoning large language model agent cs.AI | cs.CL | cs.HCPDF

Humza Nusrat, Luke Francisco, Bing Luo, Hassan Bagher-Ebadian, Joshua Kim

TL;DR: 本文提出了一种基于大语言模型（LLM）的自动化立体定向放射外科（SRS）计划代理SAGE，并比较了其推理与非推理变体在41例脑转移瘤患者中的表现。研究发现，推理模型在主要剂量学指标上与人类计划者相当，同时降低了耳蜗剂量，并通过思维链推理展示了前瞻性约束验证和权衡考量等系统性规划行为，为透明化自动计划提供了可审计的优化轨迹。

Details

Motivation: 解决SRS计划中黑盒AI系统因不透明性而临床采用受限的问题，探索思维链推理是否能提升基于LLM的智能体在自动化计划中的表现和可解释性。

Result: 在41例患者的回顾性队列中，推理模型在主要终点（PTV覆盖率、最大剂量、适形指数、梯度指数）上与人类计划者相当（所有p > 0.21），同时将耳蜗剂量降低至低于人类基线（p = 0.022）。当被提示改进适形性时，推理模型表现出大量前瞻性约束验证（457次）和权衡考量（609次）等系统性规划行为，而非推理模型几乎没有这些过程。

Insight: 创新点在于将思维链推理集成到LLM代理中，使其在自动化SRS计划中不仅能达到与人类相当的性能，还能生成可解释的、可审计的规划过程（如约束验证和因果解释），这为解决AI在临床中的“黑盒”问题提供了透明化路径。

Abstract: Stereotactic radiosurgery (SRS) demands precise dose shaping around critical structures, yet black-box AI systems have limited clinical adoption due to opacity concerns. We tested whether chain-of-thought reasoning improves agentic planning in a retrospective cohort of 41 patients with brain metastases treated with 18 Gy single-fraction SRS. We developed SAGE (Secure Agent for Generative Dose Expertise), an LLM-based planning agent for automated SRS treatment planning. Two variants generated plans for each case: one using a non-reasoning model, one using a reasoning model. The reasoning variant showed comparable plan dosimetry relative to human planners on primary endpoints (PTV coverage, maximum dose, conformity index, gradient index; all p > 0.21) while reducing cochlear dose below human baselines (p = 0.022). When prompted to improve conformity, the reasoning model demonstrated systematic planning behaviors including prospective constraint verification (457 instances) and trade-off deliberation (609 instances), while the standard model exhibited none of these deliberative processes (0 and 7 instances, respectively). Content analysis revealed that constraint verification and causal explanation concentrated in the reasoning agent. The optimization traces serve as auditable logs, offering a path toward transparent automated planning.

[68] LongVideoAgent: Multi-Agent Reasoning with Long Videos cs.AI | cs.CV | cs.LG | cs.MAPDF

Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi

TL;DR: 本文提出LongVideoAgent，一个多智能体框架，用于对长达数小时的视频进行问答推理。该框架通过一个主控LLM协调一个定位智能体来定位问题相关片段，以及一个视觉智能体来提取目标文本观察，并利用强化学习训练主控智能体以实现简洁、正确且高效的多智能体协作。

Details

Motivation: 现有方法在处理长视频问答时，通常将内容压缩为有损摘要或依赖有限工具集，这削弱了时间定位能力并遗漏了细粒度线索。本文旨在解决长视频推理中精确时间定位和细粒度视觉信息利用不足的问题。

Result: 在从TVQA/TVQA+聚合而成的LongTVQA和LongTVQA+数据集上，该多智能体系统显著优于强大的非智能体基线模型。实验还表明，强化学习进一步增强了训练后智能体的推理和规划能力。

Insight: 创新点在于提出了一个由主控LLM协调的、分工明确的多智能体框架，通过定位和视觉提取的分离与协作，并结合强化学习优化规划，实现了对长视频更精准的时间定位和更丰富的细粒度信息利用，产生了可解释的推理轨迹。

Abstract: Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.

cs.RO [Back]

[69] KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System cs.RO | cs.AI | cs.CVPDF

Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang

TL;DR: KnowVal是一个知识增强和价值引导的自动驾驶系统，它通过整合开放世界感知和知识检索来实现视觉语言推理，并利用人类偏好数据集训练的价值模型来指导可解释、价值对齐的轨迹评估。

Details

Motivation: 现有自动驾驶方法主要依赖数据驱动学习，难以通过模仿或有限的强化奖励捕捉决策背后的复杂逻辑，因此需要融合驾驶知识、视觉语言推理和价值对齐来提升系统性能。

Result: 实验结果表明，该方法显著提升了规划性能，在nuScenes数据集上实现了最低的碰撞率，并在Bench2Drive基准上达到了最先进水平。

Insight: 创新点在于构建了全面的驾驶知识图谱（编码交通法规、防御性驾驶原则和伦理规范）并设计了高效的基于LLM的检索机制，同时开发了人类偏好数据集和值模型以实现价值对齐的轨迹评估，这些知识增强和价值引导机制可提升自动驾驶系统的可解释性和安全性。

Abstract: Visual-language reasoning, driving knowledge, and value alignment are essential for advanced autonomous driving systems. However, existing approaches largely rely on data-driven learning, making it difficult to capture the complex logic underlying decision-making through imitation or limited reinforcement rewards. To address this, we propose KnowVal, a new autonomous driving system that enables visual-language reasoning through the synergistic integration of open-world perception and knowledge retrieval. Specifically, we construct a comprehensive driving knowledge graph that encodes traffic laws, defensive driving principles, and ethical norms, complemented by an efficient LLM-based retrieval mechanism tailored for driving scenarios. Furthermore, we develop a human-preference dataset and train a Value Model to guide interpretable, value-aligned trajectory assessment. Experimental results show that our method substantially improves planning performance while remaining compatible with existing architectures. Notably, KnowVal achieves the lowest collision rate on nuScenes and state-of-the-art results on Bench2Drive.

cs.LG [Back]

[70] Brain-Grounded Axes for Reading and Steering LLM States cs.LG | cs.AI | cs.CLPDF

Sandro Andric

TL;DR: 该论文提出了一种基于人脑活动构建坐标轴的方法，用于解读和调控大语言模型（LLM）的内部状态。通过分析MEG脑电数据，提取了与词汇属性（如词频）和语义功能/内容相关的潜在脑轴，并训练轻量级适配器将LLM隐藏状态映射到这些脑轴上，从而实现对LLM行为的可解释性分析和定向调控。

Details

Motivation: 现有LLM可解释性方法通常依赖于文本监督来推导方向，缺乏外部客观的基准。本文旨在利用人脑活动作为外部基准，构建一个基于神经生理学的坐标系统，以更可靠地解读和操控LLM的内部表示。

Result: 在SMN4Lang MEG数据集上验证了脑轴的稳定性，并在TinyLlama、Qwen2-0.5B和GPT-2等模型上进行了测试。结果显示，基于脑轴的方法能够稳健地提取出与词频相关的词汇轴和功能/内容轴，其调控效果在困惑度匹配的文本探针对比中表现更优（产生更大的词频偏移且困惑度更低）。脑轴结构在不同嵌入表示（如word2vec）下也保持稳定（匹配轴间的相关系数|r|在0.64-0.95之间）。

Insight: 创新点在于将人脑活动作为外部基准坐标系统，而非训练信号，用于解读和定向调控LLM状态。这提供了一种新的、基于神经生理学的可解释性接口，可能减少传统文本监督方法中的循环论证问题，并为LLM行为控制提供了更客观、可解释的“手柄”。

Abstract: Interpretability methods for large language models (LLMs) typically derive directions from textual supervision, which can lack external grounding. We propose using human brain activity not as a training signal but as a coordinate system for reading and steering LLM states. Using the SMN4Lang MEG dataset, we construct a word-level brain atlas of phase-locking value (PLV) patterns and extract latent axes via ICA. We validate axes with independent lexica and NER-based labels (POS/log-frequency used as sanity checks), then train lightweight adapters that map LLM hidden states to these brain axes without fine-tuning the LLM. Steering along the resulting brain-derived directions yields a robust lexical (frequency-linked) axis in a mid TinyLlama layer, surviving perplexity-matched controls, and a brain-vs-text probe comparison shows larger log-frequency shifts (relative to the text probe) with lower perplexity for the brain axis. A function/content axis (axis 13) shows consistent steering in TinyLlama, Qwen2-0.5B, and GPT-2, with PPL-matched text-level corroboration. Layer-4 effects in TinyLlama are large but inconsistent, so we treat them as secondary (Appendix). Axis structure is stable when the atlas is rebuilt without GPT embedding-change features or with word2vec embeddings (|r|=0.64-0.95 across matched axes), reducing circularity concerns. Exploratory fMRI anchoring suggests potential alignment for embedding change and log frequency, but effects are sensitive to hemodynamic modeling assumptions and are treated as population-level evidence only. These results support a new interface: neurophysiology-grounded axes provide interpretable and controllable handles for LLM behavior.

[71] Learning to Reason in LLMs by Expectation Maximization cs.LG | cs.CL | stat.MLPDF

Junghyun Lee, Branislav Kveton, Sunav Choudhary, Subhojyoti Mukherjee, Anup Rao

TL;DR: 该论文提出了一种基于期望最大化（EM）框架的LLM推理学习方法，将推理过程形式化为隐变量模型，并比较了多种采样方案（如带预算的拒绝采样、STaR和PPS）在ARC、MMLU和OpenBookQA数据集上的性能。

Details

Motivation: 解决LLM在生成推理依据（rationale）时可能产生错误或无关内容，从而影响最终答案准确性的问题，旨在通过优化采样分布来提升推理能力。

Result: 在Llama和Qwen模型上，实验表明采样方案显著影响学习到的推理模型的准确性，其中仅保留STaR中理性化阶段的PPS方法在多个数据集上优于其他采样方案。

Insight: 将推理建模为隐变量并通过EM框架优化，创新性地连接了EM与基于奖励的优化方法；PPS的简洁设计在实验中表现突出，强调了采样分布设计对提升LLM推理能力的关键作用。

Abstract: Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive an expectation-maximization (EM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution that generates rationales that justify correct answers. We instantiate and compare several sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR. Our experiments on the ARC, MMLU, and OpenBookQA datasets with the Llama and Qwen models show that the sampling scheme can significantly affect the accuracy of learned reasoning models. Despite its simplicity, we observe that PPS outperforms the other sampling schemes.

[72] Unified Multimodal Brain Decoding via Cross-Subject Soft-ROI Fusion cs.LG | cs.CV | eess.IVPDF

Xuanyu Hu

TL;DR: 本文提出了一种名为BrainROI的统一多模态脑解码模型，通过跨被试的软ROI融合方法，从fMRI等脑活动信号中重建与视觉刺激一致的语义信息，并生成可读的自然语言描述。该模型在NSD数据集上的脑图说任务中取得了领先水平的结果，特别是在跨被试泛化性和可解释性方面有显著提升。

Details

Motivation: 解决多模态脑解码在跨被试泛化性和可解释性方面的关键挑战，包括功能脑拓扑结构的异质性以及手动或黑盒提示方法在稳定性和透明度上的局限性。

Result: 在NSD数据集的脑图说评估中，跨被试设置下，相比近期SOTA方法和代表性基线，BLEU-4和CIDEr等指标有明显提升，达到领先水平。

Insight: 创新点包括：设计新的fMRI编码器，使用多图谱软功能分区（soft-ROI）作为共享空间，将离散ROI拼接策略扩展为体素门控融合机制（Voxel-gate），并通过全局标签对齐确保一致的ROI映射以增强跨被试可迁移性；引入可解释的提示优化过程，在小样本闭环中使用本地部署的Qwen模型迭代生成和选择可读提示，提升提示设计的稳定性并保留可审计的优化轨迹；在推理时施加参数化解码约束以进一步提高生成描述的稳定性和质量。

Abstract: Multimodal brain decoding aims to reconstruct semantic information that is consistent with visual stimuli from brain activity signals such as fMRI, and then generate readable natural language descriptions. However, multimodal brain decoding still faces key challenges in cross-subject generalization and interpretability. We propose a BrainROI model and achieve leading-level results in brain-captioning evaluation on the NSD dataset. Under the cross-subject setting, compared with recent state-of-the-art methods and representative baselines, metrics such as BLEU-4 and CIDEr show clear improvements. Firstly, to address the heterogeneity of functional brain topology across subjects, we design a new fMRI encoder. We use multi-atlas soft functional parcellations (soft-ROI) as a shared space. We extend the discrete ROI Concatenation strategy in MINDLLM to a voxel-wise gated fusion mechanism (Voxel-gate). We also ensure consistent ROI mapping through global label alignment, which enhances cross-subject transferability. Secondly, to overcome the limitations of manual and black-box prompting methods in stability and transparency, we introduce an interpretable prompt optimization process. In a small-sample closed loop, we use a locally deployed Qwen model to iteratively generate and select human-readable prompts. This process improves the stability of prompt design and preserves an auditable optimization trajectory. Finally, we impose parameterized decoding constraints during inference to further improve the stability and quality of the generated descriptions.

[73] Field-Space Attention for Structure-Preserving Earth System Transformers cs.LG | cs.CV | math-phPDF

Maximilian Witte, Johannes Meuer, Étienne Plésiat, Christopher Kadow

TL;DR: 本文提出了一种名为‘场空间注意力’的机制，用于地球系统Transformer模型。该机制直接在物理域（而非学习的潜在空间）计算注意力，将中间表示保持为球面上的连续场，从而实现了可解释的内部状态并便于施加科学约束。模型采用固定的非学习多尺度分解和学习结构保持的输入场变形，能够连贯地整合粗粒度和细粒度信息，并避免了标准单尺度视觉Transformer的优化不稳定性。在HEALPix网格上的全球温度超分辨率任务中，该方法比传统视觉Transformer和U-Net基线收敛更快、更稳定，且所需参数更少。

Details

Motivation: 动机在于需要能够直接在连续地球物理场上操作并保持其底层几何结构的机器学习架构，以实现准确且物理一致的地球系统动力学建模。

Result: 在HEALPix网格上的全球温度超分辨率任务中，Field-Space Transformer比传统视觉Transformer和U-Net基线收敛更快、更稳定，且所需参数更少，实现了保真度和可靠性的提升。

Insight: 创新点在于提出了在物理域计算注意力的‘场空间注意力’机制，通过保持连续场表示和固定多尺度分解，实现了模型的可解释性、物理约束嵌入的便利性以及训练稳定性，为数据驱动的地球系统建模提供了一个紧凑、可解释且物理基础扎实的构建模块。

Abstract: Accurate and physically consistent modeling of Earth system dynamics requires machine-learning architectures that operate directly on continuous geophysical fields and preserve their underlying geometric structure. Here we introduce Field-Space attention, a mechanism for Earth system Transformers that computes attention in the physical domain rather than in a learned latent space. By maintaining all intermediate representations as continuous fields on the sphere, the architecture enables interpretable internal states and facilitates the enforcement of scientific constraints. The model employs a fixed, non-learned multiscale decomposition and learns structure-preserving deformations of the input field, allowing coherent integration of coarse and fine-scale information while avoiding the optimization instabilities characteristic of standard single-scale Vision Transformers. Applied to global temperature super-resolution on a HEALPix grid, Field-Space Transformers converge more rapidly and stably than conventional Vision Transformers and U-Net baselines, while requiring substantially fewer parameters. The explicit preservation of field structure throughout the network allows physical and statistical priors to be embedded directly into the architecture, yielding improved fidelity and reliability in data-driven Earth system modeling. These results position Field-Space Attention as a compact, interpretable, and physically grounded building block for next-generation Earth system prediction and generative modeling frameworks.

eess.IV [Back]

[74] Dual-Encoder Transformer-Based Multimodal Learning for Ischemic Stroke Lesion Segmentation Using Diffusion MRI eess.IV | cs.AI | cs.CVPDF

Muhammad Usman, Azka Rehman, Muhammad Mutti Ur Rehman, Abd Ur Rehman, Muhammad Umar Farooq

TL;DR: 该论文提出了一种基于双编码器TransUNet架构的多模态学习方法，用于从扩散MRI中分割缺血性卒中病灶。研究在ISLES 2022数据集上评估了多种卷积和Transformer模型，发现Transformer模型优于卷积基线，其中双编码器TransUNet通过整合DWI和ADC模态的互补信息以及相邻切片的空间上下文，实现了最佳性能，测试集Dice分数达到85.4%。

Details

Motivation: 缺血性卒中病灶的准确分割对临床决策至关重要，但扩散MRI中DWI和ADC模态的病灶外观多变，使得自动分割具有挑战性。论文旨在利用多模态扩散MRI的互补信息，提升分割准确性。

Result: 在ISLES 2022数据集上，Transformer模型优于卷积基线，提出的双编码器TransUNet在测试集上达到85.4%的Dice分数，取得了最佳性能。

Insight: 创新点包括：采用双编码器TransUNet架构学习模态特异性表示，整合DWI和ADC的互补信息；通过三切片输入配置引入空间上下文。这为多模态医学图像分割提供了有效的框架设计思路。

Abstract: Accurate segmentation of ischemic stroke lesions from diffusion magnetic resonance imaging (MRI) is essential for clinical decision-making and outcome assessment. Diffusion-Weighted Imaging (DWI) and Apparent Diffusion Coefficient (ADC) scans provide complementary information on acute and sub-acute ischemic changes; however, automated lesion delineation remains challenging due to variability in lesion appearance. In this work, we study ischemic stroke lesion segmentation using multimodal diffusion MRI from the ISLES 2022 dataset. Several state-of-the-art convolutional and transformer-based architectures, including U-Net variants, Swin-UNet, and TransUNet, are benchmarked. Based on performance, a dual-encoder TransUNet architecture is proposed to learn modality-specific representations from DWI and ADC inputs. To incorporate spatial context, adjacent slice information is integrated using a three-slice input configuration. All models are trained under a unified framework and evaluated using the Dice Similarity Coefficient (DSC). Results show that transformer-based models outperform convolutional baselines, and the proposed dual-encoder TransUNet achieves the best performance, reaching a Dice score of 85.4% on the test set. The proposed framework offers a robust solution for automated ischemic stroke lesion segmentation from diffusion MRI.

cs.HC [Back]

[75] Dreamcrafter: Immersive Editing of 3D Radiance Fields Through Flexible, Generative Inputs and Outputs cs.HC | cs.CVPDF

Cyrus Vachha, Yixiao Kang, Zach Dive, Ashwat Chidambaram, Anik Gupta

TL;DR: Dreamcrafter是一个基于VR的3D场景编辑系统，旨在将生成式AI算法集成到实时、沉浸式的3D辐射场（如NeRF、3D高斯泼溅）编辑中，通过模块化架构、多种控制方式（如自然语言和直接操作）以及代理表示来降低创作门槛并提升创造力。

Details

Motivation: 为了解决3D场景创作中存在的障碍，统一沉浸式直接操作与基于AI的高层抽象编辑的优势，并克服后者高延迟的问题，将生成式AI进展整合到实时沉浸式3D辐射场编辑中。

Result: 论文未在摘要中提及具体的定量基准测试结果或SOTA比较，但通过实证研究探讨了控制偏好，并讨论了生成式AI界面如何增强场景编辑的创造力。

Insight: 创新点包括：模块化架构集成多种生成式AI算法；结合自然语言和直接操作等多层次控制；引入代理表示以在高延迟操作期间支持交互，这为实时沉浸式3D内容创作提供了灵活框架。

Abstract: Authoring 3D scenes is a central task for spatial computing applications. Competing visions for lowering existing barriers are (1) focus on immersive, direct manipulation of 3D content or (2) leverage AI techniques that capture real scenes (3D Radiance Fields such as, NeRFs, 3D Gaussian Splatting) and modify them at a higher level of abstraction, at the cost of high latency. We unify the complementary strengths of these approaches and investigate how to integrate generative AI advances into real-time, immersive 3D Radiance Field editing. We introduce Dreamcrafter, a VR-based 3D scene editing system that: (1) provides a modular architecture to integrate generative AI algorithms; (2) combines different levels of control for creating objects, including natural language and direct manipulation; and (3) introduces proxy representations that support interaction during high-latency operations. We contribute empirical findings on control preferences and discuss how generative AI interfaces beyond text input enhance creativity in scene editing and world building.

eess.AS [Back]

[76] SAM Audio: Segment Anything in Audio eess.AS | cs.CVPDF

Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu

TL;DR: SAM Audio是一个通用的音频分离基础模型，它通过统一的框架支持文本、视觉和时间跨度提示，能够灵活分离语音、音乐和一般声音中的目标源，并在多个基准测试中达到最先进的性能。

Details

Motivation: 现有音频分离模型要么是领域特定的（如仅针对语音或音乐），要么可控性有限（如仅支持文本提示），因此需要开发一个统一的多模态提示框架来提升通用性和灵活性。

Result: 在涵盖一般声音、语音、音乐和乐器分离的多样化基准测试中，包括野外和专业制作的音频，SAM Audio实现了最先进的性能，显著优于先前的通用和专用系统。

Insight: 创新点在于将文本、视觉和时间跨度提示统一到单个扩散变换器架构中，通过流匹配在大规模音频数据上训练，从而增强了模型的通用性和可控性；同时引入了带有人工标注多模态提示的真实世界分离基准和无参考评估模型，提升了评估的可靠性。

Abstract: General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.

Table of Contents

cs.CL [Back]

[1] HARMON-E: Hierarchical Agentic Reasoning for Multimodal Oncology Notes to Extract Structured Data cs.CL | cs.AIPDF

[2] Counterfactual LLM-based Framework for Measuring Rhetorical Style cs.CL | cs.CYPDF

[3] PRISM: A Personality-Driven Multi-Agent Framework for Social Media Simulation cs.CLPDF

[4] Bias Beneath the Tone: Empirical Characterisation of Tone Bias in LLM-Driven UX Systems cs.CL | cs.HCPDF

[5] Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models cs.CL | cs.AI | cs.LGPDF

[6] Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents cs.CLPDF

[7] ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language cs.CL | cs.AI | cs.LGPDF

[8] M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation cs.CL | cs.AIPDF

[9] Multi-hop Reasoning via Early Knowledge Alignment cs.CLPDF

[10] Retrieval-augmented Prompt Learning for Pre-trained Foundation Models cs.CL | cs.AI | cs.CV | cs.IR | cs.LGPDF

[11] Fun-Audio-Chat Technical Report cs.CL | cs.AI | cs.SD | eess.ASPDF

[12] FaithLens: Detecting and Explaining Faithfulness Hallucination cs.CL | cs.AIPDF

[13] SlideTailor: Personalized Presentation Slide Generation for Scientific Papers cs.CL | cs.AI | cs.MMPDF

[14] AprielGuard cs.CLPDF

[15] SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision cs.CL | cs.SD | eess.ASPDF

[16] Can LLMs Solve My Grandma’s Riddle? Evaluating Multilingual Large Language Models on Reasoning Traditional Bangla Tricky Riddles cs.CLPDF

[17] Step-DeepResearch Technical Report cs.CLPDF

[18] Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits cs.CLPDF

[19] Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs cs.CL | cs.AI | cs.CVPDF

cs.CV [Back]

[20] PHANTOM: PHysical ANamorphic Threats Obstructing Connected Vehicle Mobility cs.CV | cs.AI | cs.CR | cs.LGPDF

[21] Generating the Past, Present and Future from a Motion-Blurred Image cs.CV | cs.GRPDF

[22] Learning to Refocus with Video Diffusion Models cs.CVPDF

[23] HyGE-Occ: Hybrid View-Transformation with 3D Gaussian and Edge Priors for 3D Panoptic Occupancy Prediction cs.CVPDF

[24] Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs cs.CVPDF

[25] Vehicle-centric Perception via Multimodal Structured Pre-training cs.CV | cs.AI | cs.LGPDF

[26] Block-Recurrent Dynamics in Vision Transformers cs.CV | cs.AI | cs.LGPDF

[27] SE360: Semantic Edit in 360$^\circ$ Panoramas via Hierarchical Data Construction cs.CVPDF

[28] How Much 3D Do Video Foundation Models Encode? cs.CV | cs.AIPDF

[29] A Dual-Branch Local-Global Framework for Cross-Resolution Land Cover Mapping cs.CVPDF

[30] Few-Shot-Based Modular Image-to-Video Adapter for Diffusion Models cs.CVPDF

[31] SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images cs.CVPDF

[32] A Contextual Analysis of Driver-Facing and Dual-View Video Inputs for Distraction Detection in Naturalistic Driving Environments cs.CVPDF

[33] MAPI-GNN: Multi-Activation Plane Interaction Graph Neural Network for Multimodal Medical Diagnosis cs.CVPDF

[34] VALLR-Pin: Dual-Decoding Visual Speech Recognition for Mandarin with Pinyin-Guided LLM Refinement cs.CVPDF

[35] FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs cs.CVPDF

[36] Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieva cs.CV | cs.AIPDF

[37] Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark cs.CV | cs.CL | cs.IRPDF

[38] DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation cs.CV | cs.SD | eess.ASPDF

[39] HEART-VIT: Hessian-Guided Efficient Dynamic Attention and Token Pruning in Vision Transformer cs.CVPDF

[40] milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion cs.CVPDF

[41] AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model cs.CVPDF

[42] Generative Latent Coding for Ultra-Low Bitrate Image Compression cs.CV | eess.IVPDF

[43] LiteFusion: Taming 3D Object Detectors from Vision-Based to Multi-Modal with Minimal Adaptation cs.CVPDF

[44] BiCoR-Seg: Bidirectional Co-Refinement Framework for High-Resolution Remote Sensing Image Segmentation cs.CVPDF

[45] LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation cs.CVPDF

[46] ${D}^{3}${ETOR}: ${D}$ebate-Enhanced Pseudo Labeling and Frequency-Aware Progressive ${D}$ebiasing for Weakly-Supervised Camouflaged Object ${D}$etection with Scribble Annotations cs.CV | cs.AIPDF

[47] UbiQVision: Quantifying Uncertainty in XAI for Image Recognition cs.CV | cs.AIPDF

[48] TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation cs.CV | cs.AI | eess.AS | eess.IVPDF

[49] The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection cs.CVPDF

[50] CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation cs.CVPDF

[51] DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning cs.CV | cs.AIPDF

[52] Chain-of-Anomaly Thoughts with Large Vision-Language Models cs.CV | cs.MAPDF

[53] Beyond Motion Pattern: An Empirical Study of Physical Forces for Human Motion Understanding cs.CVPDF

[54] UTDesign: A Unified Framework for Stylized Text Editing and Generation in Graphic Design Images cs.CVPDF

[55] Bridging Modalities and Transferring Knowledge: Enhanced Multimodal Understanding and Recognition cs.CVPDF

[56] SirenPose: Dynamic Scene Reconstruction via Geometric Supervision cs.CVPDF

[57] Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios cs.CVPDF

[58] Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models cs.CVPDF

[59] FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models cs.CVPDF

[60] LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving cs.CV | cs.AI | cs.LG | cs.ROPDF

[61] Repurposing Video Diffusion Transformers for Robust Point Tracking cs.CVPDF

[62] Active Intelligence in Video Avatars via Closed-loop World Modeling cs.CVPDF

[63] SpatialTree: How Spatial Abilities Branch Out in MLLMs cs.CVPDF

[64] SemanticGen: Video Generation in Semantic Space cs.CVPDF

cs.AI [Back]

[65] Reason2Decide: Rationale-Driven Multi-Task Learning cs.AI | cs.CLPDF

[66] Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems cs.AI | cs.CL | cs.CVPDF

[67] Automated stereotactic radiosurgery planning using a human-in-the-loop reasoning large language model agent cs.AI | cs.CL | cs.HCPDF

[68] LongVideoAgent: Multi-Agent Reasoning with Long Videos cs.AI | cs.CV | cs.LG | cs.MAPDF

cs.RO [Back]

[69] KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System cs.RO | cs.AI | cs.CVPDF

cs.LG [Back]

[70] Brain-Grounded Axes for Reading and Steering LLM States cs.LG | cs.AI | cs.CLPDF

[71] Learning to Reason in LLMs by Expectation Maximization cs.LG | cs.CL | stat.MLPDF

[72] Unified Multimodal Brain Decoding via Cross-Subject Soft-ROI Fusion cs.LG | cs.CV | eess.IVPDF

[73] Field-Space Attention for Structure-Preserving Earth System Transformers cs.LG | cs.CV | math-phPDF

eess.IV [Back]