cs.CL [Total: 14]
cs.CV [Total: 95]
cs.AI [Total: 4]
cs.RO [Total: 4]
cs.LG [Total: 2]
cs.AR [Total: 1]
cs.IR [Total: 1]
cs.SD [Total: 3]
cs.CR [Total: 1]

cs.CL [Back]

[1] ConFu: Contemplate the Future for Better Speculative Sampling cs.CLPDF

Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee

TL;DR: 本文提出了一种名为ConFu的新型推测解码框架，旨在通过让草稿模型能够预测生成的未来方向，来解决现有推测解码方法中因仅依赖当前前缀而导致的错误累积问题。该方法引入了“思考令牌”和软提示，使草稿模型能以极低成本利用目标模型的未来导向信号，并采用动态思考令牌机制与MoE实现上下文感知的未来预测，以及一个包含锚令牌采样和未来预测复制的训练框架。

Details

Motivation: 现有推测解码方法（如EAGLE系列）虽然实现了最先进的加速效果，但其草稿模型仅基于当前前缀进行预测，导致预测会随着生成步数增加而偏离目标模型，产生错误累积。本文旨在通过让草稿模型“思考未来”来提升预测质量，从而解决这一问题。

Result: 实验表明，在使用Llama-3 3B和8B模型的各种下游任务中，ConFu相比EAGLE-3将令牌接受率和生成速度提高了8-11%。

Insight: 主要创新点在于首次将推测解码与连续推理令牌（contemplate tokens）相结合，通过引入未来导向信号和动态机制，使草稿模型能够进行上下文感知的未来预测，从而更准确地匹配目标模型，为加速LLM推理提供了新方向。

Abstract: Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose \textbf{ConFu} (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8–11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.

[2] SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation cs.CLPDF

Hexuan Wang, Yaxuan Ren, Srikar Bommireddypalli, Shuxian Chen, Adarsh Prabhudesai

TL;DR: 本文介绍了SciTaRC基准测试，这是一个专家构建的、针对科学论文中表格数据的问答基准，要求模型具备深度语言推理和复杂计算能力。研究发现，当前最先进的AI模型（包括Llama-3.3-70B-Instruct）在该基准上存在显著性能差距，失败率至少为23%，揭示了模型在执行规划时普遍存在的’执行瓶颈’。

Details

Motivation: 解决现有AI模型在处理科学表格数据时，因需要结合深度语言理解和复杂计算而表现不足的问题，旨在评估和推动模型在科学领域的推理能力。

Result: 在SciTaRC基准上，当前SOTA模型失败率至少为23%；Llama-3.3-70B-Instruct的失败率高达65.5%，表明即使强大模型也存在显著性能差距。

Insight: 创新点在于构建了专注于科学表格推理的专家级基准，并揭示了’执行瓶颈’这一关键挑战：代码方法在原始科学表格上脆弱，而自然语言推理主要失败于初始理解和计算错误，为改进模型执行能力提供了方向。

Abstract: We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal “execution bottleneck”: both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.

[3] Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning cs.CLPDF

Juming Xiong, Kevin Guo, Congning Ni, Chao Yan, Katherine Brown

TL;DR: 本文提出了一种基于置信度的自适应决策框架，用于优化大语言模型（LLM）的思维链（CoT）推理效率。该方法通过分析单条完整的推理轨迹，自适应地选择使用单路径或多路径（自洽性）推理，在保持与多路径基线相当的准确率的同时，显著减少了推理所需的计算开销。

Details

Motivation: 动机在于解决LLM进行CoT推理时生成过长路径导致的高推理成本，以及现有自洽性方法因采样多条轨迹而带来的巨大额外计算开销问题。

Result: 在MedQA数据集上训练，并在MathQA、MedMCQA和MMLU基准测试上无需微调即可有效泛化。实验结果表明，该方法在保持与多路径基线相当的准确率的同时，最多可减少80%的token使用量。

Insight: 创新点在于揭示了推理轨迹本身包含丰富的用于不确定性估计的信号，并据此设计了一个简单、可迁移的机制来平衡LLM推理的准确性与效率。该方法的核心是利用从中间推理状态提取的句子级数值和语言特征来训练置信度感知决策器。

Abstract: Large language models (LLMs) achieve strong reasoning performance through chain-of-thought (CoT) reasoning, yet often generate unnecessarily long reasoning paths that incur high inference cost. Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead. This paper introduces a confidence-aware decision framework that analyzes a single completed reasoning trajectory to adaptively select between single-path and multi-path reasoning. The framework is trained using sentence-level numeric and linguistic features extracted from intermediate reasoning states in the MedQA dataset and generalizes effectively to MathQA, MedMCQA, and MMLU without additional fine-tuning. Experimental results show that the proposed method maintains accuracy comparable to multi-path baselines while using up to 80% fewer tokens. These findings demonstrate that reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.

[4] Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs cs.CL | cs.CVPDF

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang

TL;DR: 本文系统性地研究了多模态大语言模型在处理图像文本时出现的‘模态鸿沟’问题，即模型对图像中的文本理解能力弱于直接输入文本。通过评估七个MLLM在五个输入模式下的七个基准测试，发现模态鸿沟具有任务和数据依赖性，且渲染选择（如字体和分辨率）是重要干扰因素。错误分析表明，图像模式主要放大阅读错误（如计算和格式错误），而知识和推理错误基本不变。作者提出一种自蒸馏方法，利用模型自身的纯文本推理轨迹训练图像输入，显著提升了图像模式在数学任务上的准确率。

Details

Motivation: 解决多模态大语言模型在处理图像文本时性能显著低于处理纯文本的问题，即‘模态鸿沟’，以提升模型对视觉文本的理解能力。

Result: 在GSM8K数学基准测试上，图像模式准确率从30.71%提升至92.72%；方法能迁移到未见过的基准测试且未出现灾难性遗忘。实验覆盖七个MLLM在七个基准（包括合成渲染文本和arXiv PDF、维基百科页面等真实文档图像）上的评估。

Insight: 模态鸿沟是任务和数据依赖的，渲染细节（如字体）对性能影响巨大；图像输入主要放大阅读错误而非推理错误；提出的自蒸馏方法通过利用模型自身文本推理能力来训练视觉输入，是有效缩小模态鸿沟的实用路径。

Abstract: Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this “modality gap” by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and reasoning errors largely unchanged, and that some models exhibit a chain-of-thought reasoning collapse under visual input. Motivated by these findings, we propose a self-distillation method that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and transferring to unseen benchmarks without catastrophic forgetting. Overall, our study provides a systematic understanding of the modality gap and suggests a practical path toward improving visual text understanding in multimodal language models.

[5] DEO: Training-Free Direct Embedding Optimization for Negation-Aware Retrieval cs.CLPDF

Taegyeong Lee, Jiwon Park, Seunghyun Hwang, JooYoung Jang

TL;DR: 本文提出了一种名为直接嵌入优化（DEO）的无训练方法，用于处理包含否定和排除查询的文本与多模态检索任务。该方法通过将查询分解为正负成分并利用对比目标优化查询嵌入，无需额外训练数据或模型更新，即可提升检索性能。

Details

Motivation: 现有检索方法（尤其是在大语言模型和检索增强生成背景下）难以准确处理否定和排除查询，而先前的解决方案依赖嵌入适应或微调，引入了额外的计算成本和部署复杂性。

Result: 在NegConstraint基准测试中，DEO优于基线方法，nDCG@10提升0.0738，MAP@100提升0.1028；在多模态检索中，相比OpenAI CLIP，Recall@5提升了6%。

Insight: 创新点在于提出了一种无需训练、通过查询分解和对比优化直接调整嵌入的轻量级方法，有效解决了否定感知检索的难题，具有部署简便和计算成本低的优势。

Abstract: Recent advances in Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) have enabled diverse retrieval methods. However, existing retrieval methods often fail to accurately retrieve results for negation and exclusion queries. To address this limitation, prior approaches rely on embedding adaptation or fine-tuning, which introduce additional computational cost and deployment complexity. We propose Direct Embedding Optimization (DEO), a training-free method for negation-aware text and multimodal retrieval. DEO decomposes queries into positive and negative components and optimizes the query embedding with a contrastive objective. Without additional training data or model updates, DEO outperforms baselines on NegConstraint, with gains of +0.0738 nDCG@10 and +0.1028 MAP@100, while improving Recall@5 by +6% over OpenAI CLIP in multimodal retrieval. These results demonstrate the practicality of DEO for negation- and exclusion-aware retrieval in real-world settings.

[6] Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing cs.CL | cs.AI | cs.LGPDF

Benjamin Reichman, Adar Avasian, Samuel Webster, Larry Heck

TL;DR: 该论文研究了情感作为潜在因素如何影响大语言模型对文本的注意力和推理过程，而非仅仅将其视为分类标签。作者分析了情感如何系统性地改变Transformer模型中的注意力几何特性，并提出了一个情感平衡的问答数据集AURA-QA以及一种情感正则化训练框架。

Details

Motivation: 现有研究通常将情感视为预测目标（如情感分析），而忽略了情感作为表征变化来源对模型推理行为的影响，论文旨在探索情感作为潜在因素如何塑造模型对文本的处理。

Result: 在多个问答基准测试上的实验表明，所提出的情感正则化方法在情感变化和非情感变化的数据集上均提升了阅读理解性能，在分布偏移下取得了稳定增益，并在多个基准上实现了领域内改进。

Insight: 创新点在于将情感视为影响模型内部注意力机制的潜在变量，并设计了情感平衡的数据集AURA-QA和情感正则化框架来约束训练中的情感条件表征漂移，这为理解模型如何处理情感信息提供了新视角。

Abstract: Large language models are routinely deployed on text that varies widely in emotional tone, yet their reasoning behavior is typically evaluated without accounting for emotion as a source of representational variation. Prior work has largely treated emotion as a prediction target, for example in sentiment analysis or emotion classification. In contrast, we study emotion as a latent factor that shapes how models attend to and reason over text. We analyze how emotional tone systematically alters attention geometry in transformer models, showing that metrics such as locality, center-of-mass distance, and entropy vary across emotions and correlate with downstream question-answering performance. To facilitate controlled study of these effects, we introduce Affect-Uniform ReAding QA (AURA-QA), a question-answering dataset with emotionally balanced, human-authored context passages. Finally, an emotional regularization framework is proposed that constrains emotion-conditioned representational drift during training. Experiments across multiple QA benchmarks demonstrate that this approach improves reading comprehension in both emotionally-varying and non-emotionally varying datasets, yielding consistent gains under distribution shift and in-domain improvements on several benchmarks.

[7] SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models cs.CL | eess.ASPDF

Hsiao-Ying Huang, Cheng-Han Chiang, Hung-yi Lee

TL;DR: 本文提出了SPAR-K，一种用于交错语音语言模型（SLM）的模态感知早期退出框架，旨在通过让大多数语音位置在固定中间层退出，并周期性进行全深度’刷新’步骤来加速推理，同时保持感知质量。

Details

Motivation: 交错语音语言模型交替生成文本和语音token，但每一步都在完整的Transformer深度上进行解码成本高昂，尤其是对于长语音序列。需要一种方法在加速推理的同时保持输出质量。

Result: 在Step-Audio-2-mini和GLM-4-Voice模型上，于涵盖推理、事实问答和对话任务的四个数据集上评估。SPAR-K在最大准确率仅下降0.82%的情况下，将平均语音解码深度分别降低了11%和5%，MOS和WER变化可忽略，且无额外计算开销。

Insight: 创新点在于针对语音token的独特统计特性，设计了语音交替深度调度（周期性刷新），而非直接套用文本LLM中常用的基于置信度的早期退出策略，这被证明对SLM是次优的。这体现了为不同模态定制推理加速方案的重要性。

Abstract: Interleaved spoken language models (SLMs) alternately generate text and speech tokens, but decoding at full transformer depth for every step becomes costly, especially due to long speech sequences. We propose SPAR-K, a modality-aware early exit framework designed to accelerate interleaved SLM inference while preserving perceptual quality. SPAR-K introduces a speech alternating-depth schedule: most speech positions exit at a fixed intermediate layer, while periodic full-depth “refresh” steps mitigate distribution shift due to early exit. We evaluate our framework using Step-Audio-2-mini and GLM-4-Voice across four datasets spanning reasoning, factual QA, and dialogue tasks, measuring performance in terms of ASR transcription accuracy and perceptual quality. Experimental results demonstrate that SPAR-K largely preserves question-answering accuracy with a maximum accuracy drop of 0.82% while reducing average speech decoding depth by up to 11% on Step-Audio-2-mini and 5% on GLM-4-Voice, both with negligible changes in MOS and WER and no auxiliary computation overhead. We further demonstrate that confidence-based early exit strategies, widely used in text LLMs, are suboptimal for SLMs, highlighting that the unique statistical nature of speech tokens necessitates a specialized early exit design.

[8] TaSR-RAG: Taxonomy-guided Structured Reasoning for Retrieval-Augmented Generation cs.CL | cs.AIPDF

Jiashuo Sun, Yixuan Xie, Jimeng Shi, Shaowen Wang, Jiawei Han

TL;DR: 本文提出了TaSR-RAG，一种基于分类法引导的结构化推理框架，用于增强检索增强生成（RAG）系统。该方法将查询和文档表示为关系三元组，并利用轻量级双层分类法约束实体语义，以平衡泛化与精度。对于复杂问题，它将其分解为有序的三元组子查询序列，并通过结合原始三元组语义相似性和类型化三元组结构一致性的混合三元组匹配，进行分步证据选择。

Details

Motivation: 解决现有RAG系统在回答知识密集型和时效性问题时存在的局限性，如检索非结构化文本块导致冗余上下文、信息密度低，以及一次性生成导致的多跳推理脆弱性。同时，现有结构化RAG方法通常需要昂贵且易出错的图构建，或强加僵化的以实体为中心的结构，与查询的推理链不匹配。

Result: 在多个多跳问答基准测试上的实验表明，TaSR-RAG始终优于强大的RAG和结构化RAG基线模型，性能提升高达14%，同时能产生更清晰的证据归因和更忠实的推理轨迹。

Insight: 创新点在于将查询和文档统一表示为关系三元组，并引入轻量级分类法来约束实体语义，从而在不显式构建图或进行穷举搜索的情况下，通过维护跨步骤的显式实体绑定表来解析中间变量并减少实体混淆。该方法实现了结构化推理与灵活性的平衡，提升了多跳推理的鲁棒性和可解释性。

Abstract: Retrieval-Augmented Generation (RAG) helps large language models (LLMs) answer knowledge-intensive and time-sensitive questions by conditioning generation on external evidence. However, most RAG systems still retrieve unstructured chunks and rely on one-shot generation, which often yields redundant context, low information density, and brittle multi-hop reasoning. While structured RAG pipelines can improve grounding, they typically require costly and error-prone graph construction or impose rigid entity-centric structures that do not align with the query’s reasoning chain. We propose \textsc{TaSR-RAG}, a taxonomy-guided structured reasoning framework for evidence selection. We represent both queries and documents as relational triples, and constrain entity semantics with a lightweight two-level taxonomy to balance generalization and precision. Given a complex question, \textsc{TaSR-RAG} decomposes it into an ordered sequence of triple sub-queries with explicit latent variables, then performs step-wise evidence selection via hybrid triple matching that combines semantic similarity over raw triples with structural consistency over typed triples. By maintaining an explicit entity binding table across steps, \textsc{TaSR-RAG} resolves intermediate variables and reduces entity conflation without explicit graph construction or exhaustive search. Experiments on multiple multi-hop question answering benchmarks show that \textsc{TaSR-RAG} consistently outperforms strong RAG and structured-RAG baselines by up to 14%, while producing clearer evidence attribution and more faithful reasoning traces.

[9] Common Sense vs. Morality: The Curious Case of Narrative Focus Bias in LLMs cs.CL | cs.AIPDF

Saugata Purkayastha, Pranav Kushare, Pragya Paramita Pal, Sukannya Purkayastha

TL;DR: 本文揭示了当前大语言模型在道德推理与常识理解之间的权衡偏差，即模型倾向于优先考虑道德判断而忽视常识矛盾。研究者构建了CoMoral基准数据集，包含嵌入道德困境中的常识矛盾，并通过评估十个不同规模的LLM发现模型普遍存在叙事焦点偏见——当常识矛盾出现在次要角色而非叙述者角色时，模型更容易识别矛盾。

Details

Motivation: 针对LLM在现实应用中需同时保持道德基础和知识感知的需求，研究发现现有模型存在过度优先道德推理而牺牲常识理解的系统性缺陷，旨在探究这种偏见的具体表现和影响。

Result: 在CoMoral基准测试中，十个不同规模的LLM均表现出难以在没有先验信号的情况下识别常识矛盾，且存在显著的叙事焦点偏见（次要角色vs.叙述者角色的检测差异），表明当前模型的常识鲁棒性不足。

Insight: 创新点在于构建了融合道德与常识矛盾的诊断数据集，并揭示了LLM中未被充分研究的叙事焦点偏见；客观来看，该研究强调了需要增强推理感知训练来平衡模型的道德与常识能力，为提升LLM的决策可靠性提供了新视角。

Abstract: Large Language Models (LLMs) are increasingly deployed across diverse real-world applications and user communities. As such, it is crucial that these models remain both morally grounded and knowledge-aware. In this work, we uncover a critical limitation of current LLMs – their tendency to prioritize moral reasoning over commonsense understanding. To investigate this phenomenon, we introduce CoMoral, a novel benchmark dataset containing commonsense contradictions embedded within moral dilemmas. Through extensive evaluation of ten LLMs across different model sizes, we find that existing models consistently struggle to identify such contradictions without prior signal. Furthermore, we observe a pervasive narrative focus bias, wherein LLMs more readily detect commonsense contradictions when they are attributed to a secondary character rather than the primary (narrator) character. Our comprehensive analysis underscores the need for enhanced reasoning-aware training to improve the commonsense robustness of large language models.

[10] ALARM: Audio-Language Alignment for Reasoning Models cs.CLPDF

Petr Grinberg, Hassan Shahmohammadi

TL;DR: 本文提出ALARM方法，通过自重构技术将音频语言模型（ALM）与推理型大语言模型（RLM）对齐，解决了传统冻结LLM训练适配器方法在RLM中因暴露文本代理输入导致响应不自然的问题。该方法融合并压缩多个音频编码器以增强表示，并在一个包含600万实例的多任务语料库上进行训练。

Details

Motivation: 动机是解决现有音频语言模型在推理型大语言模型（RLMs）上应用时，由于链式思维轨迹暴露文本代理输入而产生不自然响应的问题，旨在实现音频与语言的对齐以支持更自然的音频推理。

Result: 在相关音频推理基准测试中，提出的40亿参数ALM模型优于同规模模型，并超越大多数更大ALM，在MMAU-speech和MMSU基准上取得最佳开源结果，总体排名第三，同时以较低训练成本保留了文本能力。

Insight: 创新点包括自重构技术以生成与RLM兼容的音频理解变体，以及多音频编码器的融合压缩策略；从客观角度看，该方法有效解决了音频与推理模型对齐的分布偏差问题，为多模态推理提供了可扩展的解决方案。

Abstract: Large audio language models (ALMs) extend LLMs with auditory understanding. A common approach freezes the LLM and trains only an adapter on self-generated targets. However, this fails for reasoning LLMs (RLMs) whose built-in chain-of-thought traces expose the textual surrogate input, yielding unnatural responses. We propose self-rephrasing, converting self-generated responses into audio-understanding variants compatible with RLMs while preserving distributional alignment. We further fuse and compress multiple audio encoders for stronger representations. For training, we construct a 6M-instance multi-task corpus (2.5M unique prompts) spanning 19K hours of speech, music, and sound. Our 4B-parameter ALM outperforms similarly sized models and surpasses most larger ALMs on related audio-reasoning benchmarks, while preserving textual capabilities with a low training cost. Notably, we achieve the best open-source result on the MMAU-speech and MMSU benchmarks and rank third among all the models.

[11] RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation cs.CL | cs.AIPDF

Sihong Wu, Yiling Ma, Yilun Zhao, Tiansheng Hu, Owen Jiang

TL;DR: 本文提出RbtAct方法，利用审稿反驳作为监督信号来生成更具可操作性的同行评审反馈。通过构建RMR-75K数据集，将评审片段与对应的反驳片段进行映射，并引入视角条件化的片段级评审生成任务，使用监督微调和偏好优化训练Llama-3.1-8B-Instruct模型，以提升反馈的具体性和可实施性。

Details

Motivation: 当前大语言模型生成的同行评审报告往往流于表面且缺乏可操作性，无法为作者提供具体、可实施的修改指导，因此需要一种方法来优化反馈生成的可行动性。

Result: 在人类专家和LLM作为评判者的实验中，该方法在可操作性和具体性方面均优于强基线模型，同时保持了良好的文本相关性和事实依据。

Insight: 创新点在于将审稿反驳作为隐式监督信号来直接优化反馈生成的可操作性，并提出了视角条件化的片段级评审生成新任务，通过构建大规模对齐数据集和两阶段训练策略实现性能提升。

Abstract: Large language models (LLMs) are increasingly used across the scientific workflow, including to draft peer-review reports. However, many AI-generated reviews are superficial and insufficiently actionable, leaving authors without concrete, implementable guidance and motivating the gap this work addresses. We propose RbtAct, which targets actionable review feedback generation and places existing peer review rebuttal at the center of learning. Rebuttals show which reviewer comments led to concrete revisions or specific plans, and which were only defended. Building on this insight, we leverage rebuttal as implicit supervision to directly optimize a feedback generator for actionability. To support this objective, we propose a new task called perspective-conditioned segment-level review feedback generation, in which the model is required to produce a single focused comment based on the complete paper and a specified perspective such as experiments and writing. We also build a large dataset named RMR-75K that maps review segments to the rebuttal segments that address them, with perspective labels and impact categories that order author uptake. We then train the Llama-3.1-8B-Instruct model with supervised fine-tuning on review segments followed by preference optimization using rebuttal derived pairs. Experiments with human experts and LLM-as-a-judge show consistent gains in actionability and specificity over strong baselines while maintaining grounding and relevance.

[12] Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents cs.CLPDF

Naman Gupta, Vaibhav Singh, Arun Iyer, Kirankumar Shiragur, Pratham Grover

TL;DR: 本文研究了长上下文推理中的分块顺序问题，通过使用Chow-Liu树学习分块间的依赖结构，并采用广度优先遍历生成分块顺序，以减少多智能体推理中的信息损失，从而提升答案相关性和精确匹配准确率。

Details

Motivation: 在Chain-of-Agents（CoA）等顺序多智能体推理框架中，分块处理长上下文查询时，顺序依赖会导致信息瓶颈和损失，影响最终推理质量，因此需要优化分块顺序以最小化信息损失。

Result: 在三个长上下文基准测试中，基于Chow-Liu树广度优先遍历的分块顺序在答案相关性和精确匹配准确率上，一致优于默认文档分块顺序和基于语义分数的排序方法。

Insight: 创新点在于将Chow-Liu树应用于学习分块依赖结构，以数据驱动方式优化处理顺序，从而缓解信息瓶颈；这为长上下文推理中的顺序优化提供了概率图模型的新视角。

Abstract: Sequential multi-agent reasoning frameworks such as Chain-of-Agents (CoA) handle long-context queries by decomposing inputs into chunks and processing them sequentially using LLM-based worker agents that read from and update a bounded shared memory. From a probabilistic perspective, CoA aims to approximate the conditional distribution corresponding to a model capable of jointly reasoning over the entire long context. CoA achieves this through a latent-state factorization in which only bounded summaries of previously processed evidence are passed between agents. The resulting bounded-memory approximation introduces a lossy information bottleneck, making the final evidence state inherently dependent on the order in which chunks are processed. In this work, we study the problem of chunk ordering for long-context reasoning. We use the well-known Chow-Liu trees to learn a dependency structure that prioritizes strongly related chunks. Empirically, we show that a breadth-first traversal of the resulting tree yields chunk orderings that reduce information loss across agents and consistently outperform both default document-chunk ordering and semantic score-based ordering in answer relevance and exact-match accuracy across three long-context benchmarks.

[13] Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs cs.CLPDF

Zorik Gekhman, Roee Aharoni, Eran Ofek, Mor Geva, Roi Reichart

TL;DR: 这篇论文探讨了在大型语言模型中，推理过程如何影响简单事实性问题的回答。研究发现，即使对于不需要复杂推理的单跳事实性问题，启用推理也能显著提升模型参数化知识回忆的能力，揭示出两种关键机制：计算缓冲效应和事实性提示。同时，研究也指出了推理过程中幻觉事实可能增加最终答案错误的风险，并提出了通过优先选择无幻觉事实的推理路径来直接提升模型准确性的方法。

Details

Motivation: 研究动机在于探究推理在LLMs中对简单、单跳事实性问题回答的作用，这类问题通常不需要逐步逻辑分解，因此推理的效用显得反直觉。

Result: 通过一系列假设驱动的控制实验，研究发现启用推理可以解锁模型原本无法正确回答的问题，并识别出计算缓冲和事实性提示两种机制。同时，研究展示了如何利用这些见解通过优先选择无幻觉事实的推理轨迹来直接提高模型准确性。

Insight: 论文的创新点在于揭示了推理对简单知识回忆的非直观促进作用及其背后的两种机制（计算缓冲和事实性提示），并指出了推理中幻觉事实的风险，进而提出了利用无幻觉事实推理路径提升准确性的策略。

Abstract: While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Nevertheless, we find that enabling reasoning substantially expands the capability boundary of the model’s parametric knowledge recall, unlocking correct answers that are otherwise effectively unreachable. Why does reasoning aid parametric knowledge recall when there are no complex reasoning steps to be done? To answer this, we design a series of hypothesis-driven controlled experiments, and identify two key driving mechanisms: (1) a computational buffer effect, where the model uses the generated reasoning tokens to perform latent computation independent of their semantic content; and (2) factual priming, where generating topically related facts acts as a semantic bridge that facilitates correct answer retrieval. Importantly, this latter generative self-retrieval mechanism carries inherent risks: we demonstrate that hallucinating intermediate facts during reasoning increases the likelihood of hallucinations in the final answer. Finally, we show that our insights can be harnessed to directly improve model accuracy by prioritizing reasoning trajectories that contain hallucination-free factual statements.

[14] CREATE: Testing LLMs for Associative Creativity cs.CLPDF

Manya Wadhwa, Tiasa Singha Roy, Harvey Lederman, Junyi Jessy Li, Greg Durrett

TL;DR: 论文提出了CREATE基准测试，用于评估大语言模型在联想创造力方面的能力，要求模型在概念间生成具有高特异性和多样性的路径，模拟真实创造力任务如假设生成的需求。

Details

Motivation: 解决评估模型在创造性联想推理能力上的不足，特别是如何衡量模型在概念间建立新颖且有意义的连接的能力。

Result: 在CREATE基准上，前沿模型表现出更高的创造性效用，但基准难以饱和；思维模型即使在高token预算下也不总是更有效，创意提示方法带来有限改进。

Insight: 创新点在于设计了一个客观评估联想创造力的基准，强调路径的特异性和多样性，为开发提升模型创造力方法提供了测试平台。

Abstract: A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models’ capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model’s parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models’ capacity for associative creativity.

cs.CV [Back]

[15] Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM cs.CVPDF

Junyuan Mao, Qiankun Li, Linghao Meng, Zhicheng He, Xinliang Zhou

TL;DR: 本文提出Granulon，一种基于DINOv3的新型多模态大语言模型，通过自适应粒度增强来解决现有视觉编码器在细粒度视觉理解和多粒度推理方面的不足。它引入了一个文本条件粒度控制器和一个自适应令牌聚合模块，能够在单次前向传播中实现从像素到细粒度再到粗粒度的统一推理。

Details

Motivation: 现有MLLM主要依赖CLIP等强调全局语义对齐的视觉编码器，它们在细粒度视觉理解上存在困难；而DINOv3虽具有强像素级感知能力，却缺乏粗粒度语义抽象，导致多粒度推理能力有限。本文旨在弥合这一差距。

Result: 大量可解释的实验表明，在相同设置下，Granulon将准确率提高了约30%，幻觉减少了约20%，性能超越了所有其他视觉编码器。

Insight: 核心创新在于提出了文本条件粒度控制器来动态调整视觉抽象级别，以及自适应令牌聚合模块进行粒度引导的池化和关系感知聚类，从而在单一模型中实现了自适应、多粒度的视觉语义表示与推理。

Abstract: Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified “pixel-to-fine-to-coarse” reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.

[16] VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model cs.CVPDF

Jinxiang Lai, Wenzhe Zhao, Zexin Lu, Hualei Zhang, Qinyu Yang

TL;DR: 本文提出了VisionCreator-R1，一种具有显式反思机制的原生视觉生成智能体模型，以及一种反思-规划协同优化（RPCO）训练方法。该模型旨在解决现有视觉生成智能体缺乏系统性反思机制来纠正生成过程中视觉错误的问题。通过在自建的VCR-SFT和VCR-RL数据集上进行训练，模型在单图像和多图像任务上均超越了Gemini2.5Pro。

Details

Motivation: 现有视觉内容生成智能体多为规划驱动，缺乏在生成轨迹中系统性反思和纠正视觉错误的机制。

Result: 在现有基准测试和作者提出的VCR-bench（涵盖单图像和多图像任务）上，VisionCreator-R1模型的表现持续优于Gemini2.5Pro。

Insight: 核心创新点是提出了一个具有显式反思机制的原生视觉生成智能体，并揭示了强化学习中反思与规划优化的不对称性（规划可通过规划奖励可靠优化，而反思学习受噪声信用分配阻碍）。基于此，设计了RPCO训练方法，先在自建数据集上分别强化反思和规划能力，再进行协同优化。

Abstract: Visual content generation has advanced from single-image to multi-image workflows, yet existing agents remain largely plan-driven and lack systematic reflection mechanisms to correct mid-trajectory visual errors. To address this limitation, we propose VisionCreator-R1, a native visual generation agent with explicit reflection, together with a Reflection-Plan Co-Optimization (RPCO) training methodology. Through extensive experiments and trajectory-level analysis, we uncover reflection-plan optimization asymmetry in reinforcement learning (RL): planning can be reliably optimized via plan rewards, while reflection learning is hindered by noisy credit assignment. Guided by this insight, our RPCO first trains on the self-constructed VCR-SFT dataset with reflection-strong single-image trajectories and planning-strong multi-image trajectories, then co-optimization on VCR-RL dataset via RL. This yields our unified VisionCreator-R1 agent, which consistently outperforms Gemini2.5Pro on existing benchmarks and our VCR-bench covering single-image and multi-image tasks.

[17] Computer Vision-Based Vehicle Allotment System using Perspective Mapping cs.CVPDF

Prachi Nandi, Sonakshi Satapathy, Suchismita Chinara

TL;DR: 本文提出了一种基于计算机视觉和逆透视映射（IPM）的智能停车分配系统，通过整合四个摄像头的视图来检测空闲停车位，并模拟3D停车环境以引导用户。

Details

Motivation: 解决传统传感器停车系统在精度和适应性上的限制，特别是在人口密集城市区域，通过计算机视觉技术提高停车效率并减少人工干预。

Result: 未在摘要中明确提及具体定量结果或基准测试，但声称系统具有高精度和成本效益，并利用YOLOv8等对象检测模型进行车辆识别。

Insight: 创新点在于结合逆透视映射整合多摄像头视图以动态评估停车布局，并模拟3D环境进行可视化引导，可借鉴其低成本、易部署的计算机视觉解决方案设计思路。

Abstract: Smart city research envisions a future in which data-driven solutions and sustainable infrastructure work together to define urban living at the crossroads of urbanization and technology. Within this framework, smart parking systems play an important role in reducing urban congestion and supporting sustainable transportation. Automating parking solutions have considerable benefits, such as increased efficiency and less reliance on human involvement, but obstacles such as sensor limitations and integration complications remain. To overcome them, a more sophisticated car allotment system is required, particularly in heavily populated urban areas. Computer vision, with its higher accuracy and adaptability, outperforms traditional sensor-based systems for recognizing vehicles and vacant parking spaces. Unlike fixed sensor technologies, computer vision can dynamically assess a wide range of visual inputs while adjusting to changing parking layouts. This research presents a cost-effective, easy-to-implement smart parking system utilizing computer vision and object detection models like YOLOv8. Using inverse perspective mapping (IPM) to merge images from four camera views, we extract data on vacant spaces. The system simulates a 3D parking environment, representing available spots with a 3D Cartesian plot to guide users.

[18] HECTOR: Hybrid Editable Compositional Object References for Video Generation cs.CVPDF

Guofeng Zhang, Angtian Wang, Jacob Zhiyuan Fang, Liming Jiang, Haotian Yang

TL;DR: HECTOR是一个支持细粒度组合控制的视频生成框架，通过混合参考条件（静态图像和/或动态视频）和显式轨迹指定，实现对场景中各个元素的精确空间位置、尺度和速度控制，从而合成满足复杂时空约束且保持高保真参考一致性的连贯视频。

Details

Motivation: 当前大多数视频生成模型以整体方式合成场景，缺乏对视觉元素进行显式组合操作的机制，无法精细控制视频中不同物理对象的交互和动态组合。

Result: 大量实验表明，与现有方法相比，HECTOR在视觉质量、参考保真度和运动可控性方面均表现出优越性。

Insight: 创新点在于提出了混合参考条件机制（支持静态图像和动态视频同时引导生成）以及允许用户显式指定每个参考元素的轨迹（位置、尺度、速度），实现了对视频生成的细粒度组合控制。

Abstract: Real-world videos naturally portray complex interactions among distinct physical objects, effectively forming dynamic compositions of visual elements. However, most current video generation models synthesize scenes holistically and therefore lack mechanisms for explicit compositional manipulation. To address this limitation, we propose HECTOR, a generative pipeline that enables fine-grained compositional control. In contrast to prior methods,HECTOR supports hybrid reference conditioning, allowing generation to be simultaneously guided by static images and/or dynamic videos. Moreover, users can explicitly specify the trajectory of each referenced element, precisely controlling its location, scale, and speed (see Figure1). This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references. Extensive experiments demonstrate that HECTOR achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.

[19] Comparative Analysis of Patch Attack on VLM-Based Autonomous Driving Architectures cs.CVPDF

David Fernandez, Pedram MohajerAnsari, Amir Salarpour, Long Cheng, Abolfazl Razi

TL;DR: 本文系统性地评估了三种基于视觉语言模型的自动驾驶架构对物理对抗性补丁攻击的鲁棒性。通过在CARLA仿真环境中使用黑盒优化和语义同质化方法进行公平比较，研究发现所有架构都存在严重漏洞，包括持续的多帧失效和关键物体检测性能下降。

Details

Motivation: 视觉语言模型正被应用于自动驾驶领域，但其对物理对抗性攻击的鲁棒性尚未得到充分探索。本文旨在填补这一空白，评估现有VLM架构在安全关键应用中的脆弱性。

Result: 在CARLA仿真基准测试中，所有评估的架构均表现出严重漏洞，攻击导致持续的多帧失效和关键物体检测性能显著下降，揭示了当前VLM设计在对抗性威胁面前的不足。

Insight: 创新点在于提出了一个包含语义同质化的系统性对抗评估框架，用于公平比较不同VLM架构的脆弱性模式。客观分析表明，研究揭示了VLM架构设计中的固有安全缺陷，强调了在安全关键系统中整合鲁棒性考量的必要性。

Abstract: Vision-language models are emerging for autonomous driving, yet their robustness to physical adversarial attacks remains unexplored. This paper presents a systematic framework for comparative adversarial evaluation across three VLM architectures: Dolphins, OmniDrive (Omni-L), and LeapVAD. Using black-box optimization with semantic homogenization for fair comparison, we evaluate physically realizable patch attacks in CARLA simulation. Results reveal severe vulnerabilities across all architectures, sustained multi-frame failures, and critical object detection degradation. Our analysis exposes distinct architectural vulnerability patterns, demonstrating that current VLM designs inadequately address adversarial threats in safety-critical autonomous driving applications.

[20] Towards Visual Query Segmentation in the Wild cs.CVPDF

Bing Fan, Minghao Li, Hanzhi Zhang, Shaohua Dong, Naga Prudhvi Mareedu

TL;DR: 本文提出了视觉查询分割（VQS）这一新范式，旨在根据外部视觉查询，在未修剪视频中分割出目标对象所有像素级的出现。为此，作者构建了首个专门用于VQS任务的大规模基准数据集VQS-4K，并提出了一种名为VQ-SAM的简单而有效的方法，该方法通过一个新颖的多阶段框架和自适应记忆生成模块来提升性能。

Details

Motivation: 现有视觉查询定位（VQL）方法通常仅使用边界框定位目标的最后一次出现，这不够全面和精确。VQS旨在解决此问题，实现对目标所有出现进行像素级分割，使其更适用于现实场景。

Result: 在提出的VQS-4K基准上进行广泛实验，VQ-SAM方法取得了有希望的结果，并超越了所有现有方法，证明了其有效性。

Insight: 创新点在于提出了VQS这一新任务范式及其首个专用基准VQS-4K，并提出VQ-SAM方法，其核心是利用视频中的目标特定线索和背景干扰线索，通过多阶段框架和自适应记忆生成模块逐步演化记忆，从而显著提升分割性能。

Abstract: In this paper, we introduce visual query segmentation (VQS), a new paradigm of visual query localization (VQL) that aims to segment all pixel-level occurrences of an object of interest in an untrimmed video, given an external visual query. Compared to existing VQL locating only the last appearance of a target using bounding boxes, VQS enables more comprehensive (i.e., all object occurrences) and precise (i.e., pixel-level masks) localization, making it more practical for real-world scenarios. To foster research on this task, we present VQS-4K, a large-scale benchmark dedicated to VQS. Specifically, VQS-4K contains 4,111 videos with more than 1.3 million frames and covers a diverse set of 222 object categories. Each video is paired with a visual query defined by a frame outside the search video and its target mask, and annotated with spatial-temporal masklets corresponding to the queried target. To ensure high quality, all videos in VQS-4K are manually labeled with meticulous inspection and iterative refinement. To the best of our knowledge, VQS-4K is the first benchmark specifically designed for VQS. Furthermore, to stimulate future research, we present a simple yet effective method, named VQ-SAM, which extends SAM 2 by leveraging target-specific and background distractor cues from the video to progressively evolve the memory through a novel multi-stage framework with an adaptive memory generation (AMG) module for VQS, significantly improving the performance. In our extensive experiments on VQS-4K, VQ-SAM achieves promising results and surpasses all existing approaches, demonstrating its effectiveness. With the proposed VQS-4K and VQ-SAM, we expect to go beyond the current VQL paradigm and inspire more future research and practical applications on VQS. Our benchmark, code, and results will be made publicly available.

[21] Multi-Kernel Gated Decoder Adapters for Robust Multi-Task Thyroid Ultrasound under Cross-Center Shift cs.CV | physics.med-phPDF

Maziar Sabouri, Nourhan Bayasi, Arman Rahmim

TL;DR: 本文针对甲状腺超声多任务自动化在跨中心域偏移下的性能退化问题，提出了一种轻量级解码器侧适配器家族（MKGA和ResMKGA），通过多核互补感受野和语义门控机制来增强分割和恶性风险评估的鲁棒性。

Details

Motivation: 甲状腺超声自动化需要同时处理全局几何驱动的结节分割和局部纹理驱动的恶性风险评估，但在跨中心域偏移下，这些线索会不对称地退化，而现有共享主干的多任务方法容易导致负迁移。

Result: 在两个超声基准测试中，所提适配器提升了跨中心鲁棒性：增强了域外分割性能，并在CNN设置下显著提高了临床TI-RADS诊断准确率，优于标准多任务基线。

Insight: 创新点在于揭示了CNN和ViT在跨中心偏移下对纹理和几何线索的传递差异，并设计了轻量级解码器适配器，通过多核特征细化和语义门控来抑制伪影，实现任务特异性特征增强。

Abstract: Thyroid ultrasound (US) automation couples two competing requirements: global, geometry-driven reasoning for nodule delineation and local, texture-driven reasoning for malignancy risk assessment. Under cross-center domain shift, these cues degrade asymmetrically, yet most multi-task pipelines rely on a single shared backbone, often inducing negative transfer. In this paper, we characterize this interference across CNN (ResNet34) and medical ViT (MedSAM) backbones, and observe a consistent trend: ViTs transfer geometric priors that benefit segmentation, whereas CNNs more reliably preserve texture cues for malignancy discrimination under strong shift and artifacts. Motivated by this failure mode, we propose a lightweight family of decoder-side adapters, the Multi-Kernel Gated Adapter (MKGA) and a residual variant (ResMKGA), which refine multi-scale skip features using complementary receptive fields and apply semantic, context-conditioned gating to suppress artifact-prone content before fusion. Across two US benchmarks, the proposed adapters improve cross-center robustness: they strengthen out-of-domain segmentation and, in the CNN setting, yield clear gains in clinical TI-RADS diagnostic accuracy compared to standard multi-task baselines. Code and models will be released.

[22] Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning cs.CV | cs.LGPDF

Mohamed Harmanani, Bining Long, Zhuoxin Guo, Paul F. R. Wilson, Amirhossein Sabour

TL;DR: 本文提出了MedCBR框架，将临床指南与视觉语言和推理模型相结合，用于基于概念的医学影像推理。该框架通过多任务目标训练概念瓶颈模型，并利用推理模型生成结构化的临床叙述来解释诊断，从而在提升诊断性能的同时增强模型的可解释性。

Details

Motivation: 解决传统概念瓶颈模型在医学影像中因忽略临床指南和专家启发式知识而导致复杂病例可靠性降低的问题，旨在构建一个结合临床背景、更可靠且可解释的AI系统。

Result: 在医学数据集上取得了优异的诊断和概念级性能，超声和乳腺X光检查的AUROC分别达到94.2%和84.0%；在非医学数据集上的准确率达到86.1%。

Insight: 创新点在于将临床指南文本化并整合到多模态对比对齐、概念监督和诊断分类的多任务训练中，实现了从图像特征到概念再到病理的联合建模，并通过推理模型生成符合指南的临床叙述，模拟了专家的推理过程，从而在端到端地连接医学图像分析与决策的同时提升了可解释性。

Abstract: Concept Bottleneck Models (CBMs) are a prominent framework for interpretable AI that map learned visual features to a set of meaningful concepts for task-specific downstream predictions. Their sequential structure enhances transparency by connecting model predictions to the underlying concepts that support them. In medical imaging, where transparency is essential, CBMs offer an appealing foundation for explainable model design. However, discrete concept representations often overlook broader clinical context such as diagnostic guidelines and expert heuristics, reducing reliability in complex cases. We propose MedCBR, a concept-based reasoning framework that integrates clinical guidelines with vision-language and reasoning models. Labeled clinical descriptors are transformed into guideline-conformant text, and a concept-based model is trained with a multitask objective combining multimodal contrastive alignment, concept supervision, and diagnostic classification to jointly ground image features, concepts, and pathology. A reasoning model then converts these predictions into structured clinical narratives that explain the diagnosis, emulating expert reasoning based on established guidelines. MedCBR achieves superior diagnostic and concept-level performance, with AUROCs of 94.2% on ultrasound and 84.0% on mammography. Further experiments on non-medical datasets achieve 86.1% accuracy. Our framework enhances interpretability and forms an end-to-end bridge from medical image analysis to decision-making.

[23] MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering cs.CV | cs.MMPDF

Xinqi Fan, Jingting Li, John See, Moi Hoon Yap, Su-Jing Wang

TL;DR: MEGC2026微表情大挑战赛聚焦于利用多模态大语言模型（MLLMs）和大视觉语言模型（LVLMs）来增强微表情分析，提出了两个新任务：微表情视频问答（ME-VQA）和微表情长视频问答（ME-LVQA），旨在通过视觉问答形式探索对微表情的理解，尤其是在长视频序列中的时序推理和细微检测。

Details

Motivation: 微表情是人在抑制情绪时自发产生的面部细微运动，常见于高风险环境。近年来，微表情识别、检测和生成领域取得了显著进展，而MLLMs和LVLMs的出现为利用其强大的多模态推理能力来提升微表情分析提供了新途径。

Result: 论文未提及具体实验结果，而是介绍了一个挑战赛框架，要求所有参与算法在公开排行榜上提交结果，以评估模型在ME-VQA和ME-LVQA任务上的性能。

Insight: 创新点在于将微表情分析与新兴的MLLMs/LVLMs结合，通过设计ME-VQA和ME-LVQA两个任务，推动微表情理解向更复杂的视觉问答和长视频时序推理方向发展，为多模态情感计算开辟了新研究路径。

Abstract: Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. The emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2026 introduces two tasks that reflect these evolving research directions: (1) ME video question answering (ME-VQA), which explores ME understanding through visual question answering on relatively short video sequences, leveraging MLLMs or LVLMs to address diverse question types related to MEs; and (2) ME long-video question answering (ME-LVQA), which extends VQA to long-duration video sequences in realistic settings, requiring models to handle temporal reasoning and subtle micro-expression detection across extended time periods. All participating algorithms are required to submit their results on a public leaderboard. More details are available at https://megc2026.github.io.

[24] Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning cs.CV | cs.AIPDF

Heesup Yun, Isaac Kazuo Uyehara, Earl Ranario, Lars Lundqvist, Christine H. Diepenbrock

TL;DR: 本文提出了一种利用视觉语言模型（VLMs）通过上下文学习从无人机遥感图像直接生成植物模拟配置（JSON格式）的新方法，并创建了一个合成基准来评估其性能。研究使用Gemma 3和Qwen3-VL等开源VLM，在合成的豇豆地块数据集上测试了五种上下文学习方法，评估了JSON完整性、几何和生物物理指标。结果表明，VLM能够解释结构元数据并估计植物数量等参数，但易受上下文偏见影响或在视觉线索不足时依赖数据集均值。研究还通过真实世界无人机正射影像数据集和消融实验进行了验证。

Details

Motivation: 功能结构植物模型（FSPMs）是模拟农业环境中生物物理过程的有用工具，但其高复杂性和低吞吐量阻碍了大规模部署。本文旨在利用先进的视觉语言模型来解决这一瓶颈，为农业数字孪生提供可扩展的植物模拟配置生成框架。

Result: 在合成的豇豆地块数据集上，模型在解释结构元数据（如植物数量、太阳方位角）方面表现出能力，但在JSON完整性、几何和生物物理评估中，性能会因上下文偏见或视觉线索不足而下降。研究在真实世界无人机正射影像上进行了验证，并通过消融实验（使用盲基线）进一步分析了模型的推理能力与对上下文先验的依赖。

Insight: 创新点在于首次利用VLM为植物模拟生成结构化的JSON配置，为农业数字孪生的3D地块重建提供了一个可扩展的框架。方法的核心是结合先进的开放VLM和上下文学习，直接从图像生成模拟参数，这避免了传统FSPMs手动配置的复杂性。客观来看，研究系统地评估了VLM在此新任务上的能力与局限（如上下文偏见），并建立了专门的合成基准，这对推动VLM在农业和仿真领域的应用具有借鉴意义。

Abstract: This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale. We propose a novel approach that leverages state-of-the-art open-source VLMs – Gemma 3 and Qwen3-VL – to directly generate simulation parameters in JSON format from drone-based remote sensing images. Using a synthetic cowpea plot dataset generated via the Helios 3D procedural plant generation library, we tested five in-context learning methods and evaluated the models across three categories: JSON integrity, geometric evaluations, and biophysical evaluations. Our results show that while VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, they often exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on a real-world drone orthophoto dataset and an ablation study using a blind baseline further characterize the models’ reasoning capabilities versus their reliance on contextual priors. To the best of our knowledge, this is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.

[25] PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration cs.CV | cs.AI | cs.CL | cs.DL | cs.IRPDF

Abdul Rehman Akbar, Samuel Wales-McGrath, Alejadro Levya, Lina Gokhale, Rajendra Singh

TL;DR: PathoScribe是一个统一的检索增强大型语言模型框架，旨在将静态病理学档案转化为可搜索、支持推理的活体知识库。该系统支持自然语言病例探索、自动化队列构建、临床问答、免疫组化（IHC）面板推荐和提示控制报告转换，并在多机构手术病理报告数据集上验证了其高效性和准确性。

Details

Motivation: 解决病理学中积累的大量叙述性报告数据难以有效检索和利用的问题，避免数字化档案成为被动存储库，使病理学家在诊断时能够实时查询类似历史病例。

Result: 在70,000份多机构手术病理报告上评估，PathoScribe在自然语言病例检索中实现了Recall@10为100%，检索增强推理平均评分为4.56/5；自动化队列构建仅需平均9.2分钟，与人工审核一致性达91.3%，且无误排除合格病例，相比传统人工图表审查大幅降低了时间和成本。

Insight: 创新点在于将检索增强生成（RAG）与LLM结合，构建统一的框架实现病理数据的语义检索和临床集成，将静态档案转化为主动的临床智能平台，显著提升了数据利用效率和诊断支持能力。

Abstract: Pathology underpins modern diagnosis and cancer care, yet its most valuable asset, the accumulated experience encoded in millions of narrative reports, remains largely inaccessible. Although institutions are rapidly digitizing pathology workflows, storing data without effective mechanisms for retrieval and reasoning risks transforming archives into a passive data repository, where institutional knowledge exists but cannot meaningfully inform patient care. True progress requires not only digitization, but the ability for pathologists to interrogate prior similar cases in real time while evaluating a new diagnostic dilemma. We present PathoScribe, a unified retrieval-augmented large language model (LLM) framework designed to transform static pathology archives into a searchable, reasoning-enabled living library. PathoScribe enables natural language case exploration, automated cohort construction, clinical question answering, immunohistochemistry (IHC) panel recommendation, and prompt-controlled report transformation within a single architecture. Evaluated on 70,000 multi-institutional surgical pathology reports, PathoScribe achieved perfect Recall@10 for natural language case retrieval and demonstrated high-quality retrieval-grounded reasoning (mean reviewer score 4.56/5). Critically, the system operationalized automated cohort construction from free-text eligibility criteria, assembling research-ready cohorts in minutes (mean 9.2 minutes) with 91.3% agreement to human reviewers and no eligible cases incorrectly excluded, representing orders-of-magnitude reductions in time and cost compared to traditional manual chart review. This work establishes a scalable foundation for converting digital pathology archives from passive storage systems into active clinical intelligence platforms.

[26] BiCLIP: Domain Canonicalization via Structured Geometric Transformation cs.CV | cs.AI | cs.CL | cs.LGPDF

Pranav Mantini, Shishir K. Shah

TL;DR: 本文提出BiCLIP框架，通过结构化几何变换实现领域规范化，以增强视觉语言模型在特定领域的跨模态对齐能力。该方法利用少量锚点样本估计几何变换，在11个标准基准测试中实现了最先进的性能。

Details

Motivation: 视觉语言模型在零样本任务中表现优异，但适应特定领域仍具挑战性；基于理论发现，不同领域图像特征可通过规范化几何变换关联，少量标注样本可作为锚点估计该变换。

Result: 在EuroSAT、DTD、FGVCAircraft等11个标准基准测试中，BiCLIP均取得最先进（SOTA）结果，验证了其有效性。

Insight: 创新点在于将领域对齐问题形式化为结构化几何变换估计，通过极简设计和低参数量实现鲁棒领域适应；实证分析表明，学习到的变换具有正交性和角度分布特性，结构化对齐是关键机制。

Abstract: Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP

[27] Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation cs.CV | eess.ASPDF

Siddeshwar Raghavan, Gautham Vinod, Bruce Coburn, Fengqing Zhu

TL;DR: 本文提出了首个免示例的持续学习基准，用于音频-视觉分割任务，旨在解决现实世界中音频和视觉分布动态变化带来的挑战。论文设计了四种学习协议，并提出了一个名为ATLAS的强基线方法，该方法通过音频引导的预融合条件调制和低秩锚定技术来缓解灾难性遗忘。

Details

Motivation: 现实世界环境是动态的，导致音频和视觉分布随时间演变，而现有音频-视觉分割系统假设静态训练设置，无法适应这种变化，因此需要建立持续学习基准来解决这一问题。

Result: 在多种持续学习场景下进行了广泛实验，证明了所提方法具有竞争力的性能，为终身音频-视觉感知奠定了基础。

Insight: 创新点包括引入首个免示例的持续学习基准用于音频-视觉分割，以及提出ATLAS方法，其中音频引导的预融合条件调制和低秩锚定技术有助于稳定适应权重并减轻灾难性遗忘。

Abstract: Audio-Visual Segmentation (AVS) aims to produce pixel-level masks of sound producing objects in videos, by jointly learning from audio and visual signals. However, real-world environments are inherently dynamic, causing audio and visual distributions to evolve over time, which challenge existing AVS systems that assume static training settings. To address this gap, we introduce the first exemplar-free continual learning benchmark for Audio-Visual Segmentation, comprising four learning protocols across single-source and multi-source AVS datasets. We further propose a strong baseline, ATLAS, which uses audio-guided pre-fusion conditioning to modulate visual feature channels via projected audio context before cross-modal attention. Finally, we mitigate catastrophic forgetting by introducing Low-Rank Anchoring (LRA), which stabilizes adapted weights based on loss sensitivity. Extensive experiments demonstrate competitive performance across diverse continual scenarios, establishing a foundation for lifelong audio-visual perception. Code is available at${}^{*}$\footnote{Paper under review} - \hyperlink{https://gitlab.com/viper-purdue/atlas}{https://gitlab.com/viper-purdue/atlas} \keywords{Continual Learning \and Audio-Visual Segmentation \and Multi-Modal Learning}

[28] SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing cs.CVPDF

Xuanyi Zhou, Qiuyang Mang, Shuo Yang, Haocheng Xi, Jintao Zhang

TL;DR: 本文提出SVG-EAR，一种用于稀疏视频生成的无参数线性补偿方法。该方法通过语义聚类，利用聚类中心点近似被跳过的注意力块以恢复其贡献，并引入一个轻量级探针进行误差感知路由，精确计算误差成本比最高的块，从而在保持生成质量的同时显著提升扩散Transformer的推理效率。

Details

Motivation: 扩散Transformer（DiTs）已成为视频生成的主流骨干网络，但其二次注意力计算成本是主要瓶颈。现有稀疏注意力方法要么直接丢弃部分注意力块导致信息损失，要么依赖学习的预测器引入训练开销和潜在输出分布偏移。

Result: 在Wan2.2和HunyuanVideo基准测试上，SVG-EAR在保持生成保真度的同时，分别实现了高达1.77倍和1.93倍的加速，PSNR分别达到29.759和31.043，在质量-效率权衡上建立了清晰的帕累托前沿。

Insight: 核心创新在于发现经过语义聚类后，注意力块内的键值具有强相似性，可通过少量聚类中心点无训练地近似恢复被跳过块的贡献；并设计了误差感知路由机制，根据估计的补偿误差而非注意力分数来选择精确计算的块，实现了更优的误差-成本权衡。该方法是无参数且无需训练的。

Abstract: Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks, which incurs information loss, or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting. In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest. SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks. Overall, SVG-EAR establishes a clear Pareto frontier over prior approaches, achieving up to 1.77$\times$ and 1.93$\times$ speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.

[29] Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning cs.CVPDF

Bolutife Atoki, Iuliia Tkachenko, Bertrand Kerautret, Carlos Crispim-Junior

TL;DR: 该论文提出了一种基于扩散模型的多模态认证框架，用于鉴别打印的防伪码（CDP）的真伪。该方法联合利用原始二进制模板、打印的CDP图像以及捕获打印机身份语义信息的表征，将认证问题重构为基于打印机签名的多类别分类任务，并通过扩展ControlNet实现有效的分类。

Details

Motivation: 动机是应对高分辨率打印/扫描设备和生成式深度学习技术带来的伪造威胁，解决传统认证系统难以区分高质量伪造品与正品打印的问题。

Result: 在Indigo 1 x 1 Base数据集上，该方法超越了传统的相似性度量和先前的深度学习方法，并且能够泛化到训练中未见的伪造类型。

Insight: 创新点在于将扩散模型（特别是ControlNet）的降噪过程重新用于基于类条件的噪声预测，从而通过空间和文本条件捕获细粒度的设备特定特征，构建了一个多模态的打印机签名条件化认证框架。

Abstract: Counterfeiting affects diverse industries, including pharmaceuticals, electronics, and food, posing serious health and economic risks. Printable unclonable codes, such as Copy Detection Patterns (CDPs), are widely used as an anti-counterfeiting measure and are applied to products and packaging. However, the increasing availability of high-resolution printing and scanning devices, along with advances in generative deep learning, undermines traditional authentication systems, which often fail to distinguish high-quality counterfeits from genuine prints. In this work, we propose a diffusion-based authentication framework that jointly leverages the original binary template, the printed CDP, and a representation of printer identity that captures relevant semantic information. Formulating authentication as multi-class printer classification over printer signatures lets our model capture fine-grained, device-specific features via spatial and textual conditioning. We extend ControlNet by repurposing the denoising process for class-conditioned noise prediction, enabling effective printer classification. On the Indigo 1 x 1 Base dataset, our method outperforms traditional similarity metrics and prior deep learning approaches. Results show the framework generalises to counterfeit types unseen during training.

[30] Intelligent Spatial Estimation for Fire Hazards in Engineering Sites: An Enhanced YOLOv8-Powered Proximity Analysis Framework cs.CVPDF

Ammar K. AlMhdawi, Nonso Nnamoko, Alaa Mashan Ubaid

TL;DR: 本研究提出了一种增强型双模型YOLOv8框架，用于智能火灾检测和基于接近度的风险评估。该框架通过一个主YOLOv8实例分割模型检测火与烟，并利用一个在COCO数据集上预训练的次级目标检测模型识别人员、车辆等周围实体。通过整合两个模型的输出，系统计算火灾区域与附近物体之间的像素距离，并利用像素到米的缩放方法将其转换为近似真实世界测量值，进而结合火灾证据、物体脆弱性和距离暴露度生成定量风险评分和警报级别。

Details

Motivation: 动机是扩展传统基于视觉的火灾监控，使其超越简单的检测，实现可操作的、基于接近度的危害优先级排序，从而进行智能风险评估。

Result: 在包含9,860张标注图像的数据集上训练，该框架取得了超过90%的精确率、召回率和F1分数，以及高于91%的mAP@0.5，表现出色。

Insight: 创新点在于将实例分割与目标检测模型结合，并引入像素到真实世界的距离转换，构建了一个综合火灾证据、物体脆弱性和接近度的定量风险评估机制。该框架轻量且基于开源工具，适合工业及资源受限环境部署。

Abstract: This study proposes an enhanced dual-model YOLOv8 framework for intelligent fire detection and proximity-aware risk assessment, extending conventional vision-based monitoring beyond simple detection to actionable hazard prioritization. The system is trained on a dataset of 9,860 annotated images to segment fire and smoke across complex environments. The framework combines a primary YOLOv8 instance segmentation model for fire and smoke detection with a secondary object detection model pretrained on the COCO dataset to identify surrounding entities such as people, vehicles, and infrastructure. By integrating the outputs of both models, the system computes pixel-based distances between detected fire regions and nearby objects and converts these values into approximate real-world measurements using a pixel-to-meter scaling approach. This proximity information is incorporated into a risk assessment mechanism that combines fire evidence, object vulnerability, and distance-based exposure to produce a quantitative risk score and alert level. The proposed framework achieves strong performance, with precision, recall, and F1 scores exceeding 90% and mAP@0.5 above 91%. The system generates annotated visual outputs showing fire locations, detected objects, estimated distances, and contextual risk information to support situational awareness. Implemented using open-source tools within the Google Colab environment, the framework is lightweight and suitable for deployment in industrial and resource-constrained settings.

[31] GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models cs.CV | cs.AI | cs.ROPDF

Md Selim Sarowar, Omer Tariq, Sungho Kim

TL;DR: 本文提出GST-VLA模型，通过引入高斯空间标记器（GST）将深度和语义特征转换为各向异性的3D高斯基元，以结构化表示3D几何信息，并结合3D深度感知思维链（DA-CoT）进行空间推理，最终通过流匹配动作专家解码动作。该模型在LIBERO和SimplerEnv基准测试中取得了显著性能提升。

Details

Motivation: 现有视觉-语言-动作（VLA）模型将视觉观测编码为无内在几何结构的2D图像块标记，缺乏对3D几何和深度信息的显式建模，限制了在需要精确空间理解的任务（如机器人操作）中的性能。

Result: 在LIBERO基准上达到96.4%（提升2.0%），在SimplerEnv基准上达到80.2%（提升5.4%），通过消融实验验证了各组件（GST、DA-CoT思维、训练阶段）的独立和协同增益。

Insight: 创新点包括：1）GST将深度和语义特征转换为参数化的3D高斯基元，通过协方差特征结构编码局部表面方向，不透明度提供几何置信度；2）DA-CoT监督生成四个结构化的中间空间思维（如3D物体定位、抓取接触几何），作为显式训练目标；3）在VLM transformer块中引入交叉注意力子层，使模型在推理时能直接访问原始高斯场；4）采用混合专家前馈子层的流匹配动作专家，结合VLM隐藏状态和DA-CoT输出进行条件动作解码。

Abstract: VLA models encode visual observations as 2D patch tokens with no intrinsic geometric structure. We introduce GST-VLA with two contributions. First, the Gaussian Spatial Tokenizer (GST) converts frozen dense depth and frozen semantic patch features into $N_g{=}128$ anisotropic 3D Gaussian primitives, each parameterized by a metric residual mean $μ\in \mathbb{R}^3$, log-scale covariance $\log σ\in \mathbb{R}^3$, and learned opacity $α\in (0,1)$. The covariance eigenstructure encodes local surface orientation, and opacity provides per-primitive geometric confidence, both inaccessible from scalar depth. Spatial attention pooling with learned queries concentrates the fixed token budget on geometrically salient regions rather than distributing uniformly. Second, 3D Depth-Aware Chain-of-Thought (DA-CoT) reasoning supervises four structured intermediate spatial thoughts, covering 3D object grounding, grasp affordance contact geometry, pairwise metric distances, and coarse SE(3) waypoints, as explicit generation targets in the training loss. A cross-attention sublayer at every VLM transformer block provides direct access to the raw 256-primitive Gaussian field during DA-CoT generation. A 300M-parameter flow-matching action expert with mixture-of-experts feedforward sublayers decodes 7-DoF delta action chunks via conditional ODE integration, conditioned on both VLM hidden states and DA-CoT outputs through dual cross-attention. Trained with composite $\mathcal{L}\mathrm{flow} + \mathcal{L}\mathrm{CoT} + \mathcal{L}_\mathrm{depth}$ across three progressive stages, GST-VLA achieves 96.4% on LIBERO (+2.0%), and 80.2% on SimplerEnv (+5.4%). Ablations isolate the contribution of each GST component, each DA-CoT thought, and each training stage, confirming independent and synergistic gains concentrated on precision demanding tasks.

[32] OmniEdit: A Training-free framework for Lip Synchronization and Audio-Visual Editing cs.CVPDF

Lixiang Lin, Siyuan Jin, Jinshan Zhang

TL;DR: 本文提出了一种名为OmniEdit的无训练框架，用于解决唇音同步和音视频编辑任务。该方法通过将FlowEdit中的编辑序列替换为目标序列，实现了对期望输出的无偏估计，并通过消除生成过程中的随机性，建立了平滑稳定的编辑轨迹。

Details

Motivation: 现有唇音同步和音视频编辑方法大多依赖于预训练模型的监督微调，导致计算开销大且数据需求高，本文旨在开发一个无需训练的高效框架以克服这些限制。

Result: 大量实验结果验证了所提框架的有效性和鲁棒性，但摘要未提及具体基准测试或与SOTA模型的定量比较。

Insight: 核心创新在于将编辑范式重新表述为用目标序列直接替换编辑序列，从而实现无偏估计，并通过确定性生成过程确保编辑轨迹的平滑稳定，这是一种新颖的训练免优化方法。

Abstract: Lip synchronization and audio-visual editing have emerged as fundamental challenges in multimodal learning, underpinning a wide range of applications, including film production, virtual avatars, and telepresence. Despite recent progress, most existing methods for lip synchronization and audio-visual editing depend on supervised fine-tuning of pre-trained models, leading to considerable computational overhead and data requirements. In this paper, we present OmniEdit, a training-free framework designed for both lip synchronization and audio-visual editing. Our approach reformulates the editing paradigm by substituting the edit sequence in FlowEdit with the target sequence, yielding an unbiased estimation of the desired output. Moreover, by removing stochastic elements from the generation process, we establish a smooth and stable editing trajectory. Extensive experimental results validate the effectiveness and robustness of the proposed framework. Code is available at https://github.com/l1346792580123/OmniEdit.

[33] Chain of Event-Centric Causal Thought for Physically Plausible Video Generation cs.CVPDF

Zixuan Wang, Yixin Hu, Haolan Wang, Feng Chen, Yan Liu

TL;DR: 本文提出了一种基于事件链因果推理的物理合理视频生成框架，通过将物理现象分解为多个因果关联的事件单元，并利用物理公式约束推理过程，结合跨模态提示生成时序对齐的视觉语言描述，从而提升视频生成的物理合理性。

Details

Motivation: 现有视频扩散模型在理解常识知识方面存在挑战，通常仅将物理现象渲染为提示词定义的单一时刻，缺乏对因果进展的建模机制，导致生成的视频物理合理性不足。

Result: 在PhyGenBench和VideoPhy基准测试上的综合实验表明，该框架在多个物理领域生成物理合理视频方面取得了优越性能。

Insight: 创新点在于将物理合理视频生成视为生成因果连接且动态演化的事件序列，并设计了物理驱动的事件链推理模块（嵌入物理公式作为约束以消除因果模糊性）以及过渡感知的跨模态提示模块（确保事件间连续性），这为视频生成中的因果建模提供了新思路。

Abstract: Physically Plausible Video Generation (PPVG) has emerged as a promising avenue for modeling real-world physical phenomena. PPVG requires an understanding of commonsense knowledge, which remains a challenge for video diffusion models. Current approaches leverage commonsense reasoning capability of large language models to embed physical concepts into prompts. However, generation models often render physical phenomena as a single moment defined by prompts, due to the lack of conditioning mechanisms for modeling causal progression. In this paper, we view PPVG as generating a sequence of causally connected and dynamically evolving events. To realize this paradigm, we design two key modules: (1) Physics-driven Event Chain Reasoning. This module decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning. To mitigate causal ambiguity, we embed physical formulas as constraints to impose deterministic causal dependencies during reasoning. (2) Transition-aware Cross-modal Prompting (TCP). To maintain continuity between events, this module transforms causal event units into temporally aligned vision-language prompts. It summarizes discrete event descriptions to obtain causally consistent narratives, while progressively synthesizing visual keyframes of individual events by interactive editing. Comprehensive experiments on PhyGenBench and VideoPhy benchmarks demonstrate that our framework achieves superior performance in generating physically plausible videos across diverse physical domains. Our code will be released soon.

[34] MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration cs.CVPDF

Chenran Zhang, Ruiqi Wu, Tao Zhou, Yi Zhou

TL;DR: 本文提出了一种名为MedKCO的医学视觉-语言预训练方法，通过知识驱动的认知编排来优化模型学习过程。该方法采用两级课程学习策略，基于诊断敏感性和类内样本代表性对预训练数据进行排序，并引入自适应的非对称对比损失来动态调整学习目标。在多个医学影像下游任务上的实验表明，该方法显著超越了现有基线模型。

Details

Motivation: 当前医学视觉-语言预训练模型通常强制模型同时学习简单和复杂概念，这种反认知过程导致特征表示次优，尤其在分布偏移下表现不佳。本文旨在通过模拟人类认知过程，设计有序的学习策略来解决这一问题。

Result: 在三种医学影像场景的多个视觉-语言下游任务上进行评估，并与多种课程学习方法比较。大量实验表明，该方法在所有基线上均取得显著超越。

Insight: 创新点在于将课程学习思想引入医学VLP，通过诊断敏感性和类内代表性构建两级课程，并结合医学图像类间相似性特点设计自适应的非对称对比损失，实现了更符合认知规律的有序学习过程。

Abstract: Medical vision-language pretraining (VLP) models have recently been investigated for their generalization to diverse downstream tasks. However, current medical VLP methods typically force the model to learn simple and complex concepts simultaneously. This anti-cognitive process leads to suboptimal feature representations, especially under distribution shift. To address this limitation, we propose a Knowledge-driven Cognitive Orchestration for Medical VLP (MedKCO) that involves both the ordering of the pretraining data and the learning objective of vision-language contrast. Specifically, we design a two level curriculum by incorporating diagnostic sensitivity and intra-class sample representativeness for the ordering of the pretraining data. Moreover, considering the inter-class similarity of medical images, we introduce a self-paced asymmetric contrastive loss to dynamically adjust the participation of the pretraining objective. We evaluate the proposed pretraining method on three medical imaging scenarios in multiple vision-language downstream tasks, and compare it with several curriculum learning methods. Extensive experiments show that our method significantly surpasses all baselines. https://github.com/Mr-Talon/MedKCO.

[35] Training-free Motion Factorization for Compositional Video Generation cs.CVPDF

Zixuan Wang, Ziqin Zhou, Feng Chen, Duo Peng, Yixin Hu

TL;DR: 本文提出了一种无需训练的运动分解框架，用于组合式视频生成。该框架将复杂运动分解为静止、刚体运动和非刚体运动三类，遵循“规划后生成”范式：首先在运动图上推理运动规律，获得实例形状和位置的逐帧变化；然后在生成过程中以解耦方式调制不同运动类别的合成。该框架与模型无关，可无缝集成到各种扩散模型架构中。

Details

Motivation: 当前组合式视频生成方法主要关注语义绑定，忽视了理解提示中指定的多样化运动类别，导致难以合成具有不同外观和运动的多个实例。本文旨在通过分解运动类别来解决这一问题。

Result: 大量实验表明，该框架在真实世界基准测试中实现了令人印象深刻的运动合成性能。

Insight: 创新点在于将复杂运动分解为三类基本运动，并采用“规划后生成”的范式，通过运动图推理和条件调制来解耦控制不同运动类别，从而缓解用户提示中的语义模糊性，且框架具有模型无关性，易于集成。

Abstract: Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in real-world scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. Specifically, our framework follows a planning before generation paradigm. (1) During planning, we reason about motion laws on the motion graph to obtain frame-wise changes in the shape and position of each instance. This alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions. (2) During generation, we modulate the synthesis of distinct motion categories in a disentangled manner. Conditioned on the motion cues, guidance branches stabilize appearance in motionless regions, preserve rigid-body geometry, and regularize local non-rigid deformations. Crucially, our two modules are model-agnostic, which can be seamlessly incorporated into various diffusion model architectures. Extensive experiments demonstrate that our framework achieves impressive performance in motion synthesis on real-world benchmarks. Our code will be released soon.

[36] Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations cs.CV | cs.AIPDF

Yuheng Wang, Yuji Lin, Dongrun Zhu, Jiayue Cai, Sunil Kalia

TL;DR: 本文提出了一种用于皮肤癌病例检索的组合视觉-语言检索框架，通过联合全局和局部表征对齐，将病变图像与文本描述结合查询，在公开数据集Derm7pt上实现了优于现有方法的性能。

Details

Motivation: 解决医学图像检索中查询通常结合参考病变图像和文本描述（如皮肤镜特征）的实际需求，以支持临床决策、教育和质量控制。

Result: 在公开数据集Derm7pt上的实验表明，该方法相比现有最先进方法（SOTA）取得了持续的性能提升。

Insight: 创新点在于学习层次化的组合查询表征，通过多空间注意力掩码聚合判别性区域进行局部对齐，同时结合全局对齐提供整体语义监督，并采用凸的、领域知情的加权策略计算最终相似度，强调临床显著的局部证据并保持全局一致性。

Abstract: Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods. The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment.

[37] VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs cs.CV | cs.AIPDF

Xiyao Wang, Xiaoyu Tan, Yang Dai, Yuxuan Fu, Shuo Li

TL;DR: VIVID-Med是一种新颖的医学视觉Transformer预训练框架，它利用冻结的大型语言模型作为结构化语义教师，通过统一的医学模式将临床发现转换为可验证的JSON字段-状态对，并采用可回答性感知掩码和结构化预测分解技术来优化训练。训练后丢弃LLM，得到一个轻量级、可部署的纯视觉Transformer骨干网络。

Details

Motivation: 当前医学图像分析的视觉-语言预训练方法通常使用独热标签或自由文本监督视觉编码器，无法有效捕捉临床发现之间复杂的语义关系。

Result: 在CheXpert线性探测任务上，VIVID-Med实现了0.8588的宏观AUC，比BiomedCLIP高出6.65个百分点，且使用的数据量少500倍。在零样本跨域迁移到NIH ChestX-ray14（宏观AUC 0.7225）和跨模态泛化到CT（在LIDC-IDRI肺结节分类上AUC 0.8413，在OrganAMNIST 11器官分类上宏观AUC 0.9969）方面也表现出色。

Insight: 创新点在于利用LLM作为结构化语义教师进行监督，通过统一的医学模式将临床发现结构化，并采用可回答性感知掩码和结构化预测分解技术来提取互补的视觉特征。核心优势是训练后仅保留轻量级ViT骨干，实现了高效、可扩展且易于临床部署的模型，避免了资源密集型视觉-语言模型的部署负担。

Abstract: Vision-language pretraining has driven significant progress in medical image analysis. However, current methods typically supervise visual encoders using one-hot labels or free-form text, neither of which effectively captures the complex semantic relationships among clinical findings. In this study, we introduce VIVID-Med, a novel framework that leverages a frozen large language model (LLM) as a structured semantic teacher to pretrain medical vision transformers (ViTs). VIVID-Med translates clinical findings into verifiable JSON field-state pairs via a Unified Medical Schema (UMS), utilizing answerability-aware masking to focus optimization. It then employs Structured Prediction Decomposition (SPD) to partition cross-attention into orthogonality-regularized query groups, extracting complementary visual aspects. Crucially, the LLM is discarded post-training, yielding a lightweight, deployable ViT-only backbone. We evaluated VIVID-Med across multiple settings: on CheXpert linear probing, it achieves a macro-AUC of 0.8588, outperforming BiomedCLIP by +6.65 points while using 500x less data. It also demonstrates robust zero-shot cross-domain transfer to NIH ChestX-ray14 (0.7225 macro-AUC) and strong cross-modality generalization to CT, achieving 0.8413 AUC on LIDC-IDRI lung nodule classification and 0.9969 macro-AUC on OrganAMNIST 11-organ classification. VIVID-Med offers a highly efficient, scalable alternative to deploying resource-heavy vision-language models in clinical settings.

[38] Progressive Representation Learning for Multimodal Sentiment Analysis with Incomplete Modalities cs.CVPDF

Jindi Bao, Jianjun Qian, Mengkai Yan, Jian Yang

TL;DR: 本文提出了一种名为PRLF的渐进式表示学习框架，用于处理模态不完整情况下的多模态情感分析。该框架通过自适应模态可靠性估计器动态评估各模态的可靠性以确定主导模态，并利用渐进交互模块迭代对齐其他模态与主导模态，从而增强跨模态一致性并抑制噪声。

Details

Motivation: 现实世界应用中常因噪声、硬件故障或隐私限制导致模态缺失，而现有方法通常假设所有模态完整，直接融合不完整模态会扭曲已学习到的完整模态表示，因此需要解决不完整模态下的特征错位问题。

Result: 在CMU-MOSI、CMU-MOSEI和SIMS数据集上的大量实验表明，PRLF在模态间和模态内缺失场景下均优于现有最先进方法，证明了其鲁棒性和泛化能力。

Insight: 创新点在于引入自适应模态可靠性估计器动态量化模态可靠性，以及渐进交互模块迭代对齐模态以增强一致性；客观分析认为该方法通过动态评估和渐进对齐有效处理了不完整模态的融合问题，提升了模型在真实噪声环境中的实用性。

Abstract: Multimodal Sentiment Analysis (MSA) seeks to infer human emotions by integrating textual, acoustic, and visual cues. However, existing approaches often rely on all modalities are completeness, whereas real-world applications frequently encounter noise, hardware failures, or privacy restrictions that result in missing modalities. There exists a significant feature misalignment between incomplete and complete modalities, and directly fusing them may even distort the well-learned representations of the intact modalities. To this end, we propose PRLF, a Progressive Representation Learning Framework designed for MSA under uncertain missing-modality conditions. PRLF introduces an Adaptive Modality Reliability Estimator (AMRE), which dynamically quantifies the reliability of each modality using recognition confidence and Fisher information to determine the dominant modality. In addition, the Progressive Interaction (ProgInteract) module iteratively aligns the other modalities with the dominant one, thereby enhancing cross-modal consistency while suppressing noise. Extensive experiments on CMU-MOSI, CMU-MOSEI, and SIMS verify that PRLF outperforms state-of-the-art methods across both inter- and intra-modality missing scenarios, demonstrating its robustness and generalization capability.

[39] QUSR: Quality-Aware and Uncertainty-Guided Image Super-Resolution Diffusion Model cs.CV | cs.AIPDF

Junjie Yin, Jiaju Li, Hanfa Xing

TL;DR: 本文提出了一种名为QUSR的新型图像超分辨率扩散模型，旨在解决真实场景中因未知且空间非均匀退化导致的细节丢失和视觉伪影问题。该模型通过整合质量感知先验（QAP）和不确定性引导噪声生成（UNG）模块，自适应地调整噪声注入强度，以重建复杂细节并保留原始信息。

Details

Motivation: 动机在于解决基于扩散的图像超分辨率方法在真实世界场景中，面对未知且空间非均匀的退化时，难以恢复细节并易产生视觉伪影的挑战。

Result: 实验结果表明，QUSR能够在真实场景中生成高保真和高真实感的图像，但摘要未具体说明在哪些基准测试上达到何种水平（如SOTA）。

Insight: 创新点在于提出了不确定性引导的噪声生成模块，根据区域不确定性自适应调整噪声强度，并利用先进的多模态大语言模型生成可靠的质量描述作为可解释的质量先验，为扩散模型在真实世界超分辨率任务中提供了新的引导机制。

Abstract: Diffusion-based image super-resolution (ISR) has shown strong potential, but it still struggles in real-world scenarios where degradations are unknown and spatially non-uniform, often resulting in lost details or visual artifacts. To address this challenge, we propose a novel super-resolution diffusion model, QUSR, which integrates a Quality-Aware Prior (QAP) with an Uncertainty-Guided Noise Generation (UNG) module. The UNG module adaptively adjusts the noise injection intensity, applying stronger perturbations to high-uncertainty regions (e.g., edges and textures) to reconstruct complex details, while minimizing noise in low-uncertainty regions (e.g., flat areas) to preserve original information. Concurrently, the QAP leverages an advanced Multimodal Large Language Model (MLLM) to generate reliable quality descriptions, providing an effective and interpretable quality prior for the restoration process. Experimental results confirm that QUSR can produce high-fidelity and high-realism images in real-world scenarios. The source code is available at https://github.com/oTvTog/QUSR.

[40] Rotation Equivariant Mamba for Vision Tasks cs.CVPDF

Zhongchen Zhao, Qi Xie, Keyu Huang, Lei Zhang, Deyu Meng

TL;DR: 本文提出了EQ-VMamba，这是首个用于视觉任务的旋转等变视觉Mamba架构。它通过精心设计的旋转等变交叉扫描策略和群Mamba块，将旋转对称性这一关键几何先验融入Mamba模型中，以解决现有视觉Mamba模型对图像旋转敏感、缺乏鲁棒性和泛化能力的问题。

Details

Motivation: 旋转等变性是视觉数据最重要、最通用的结构先验之一，但现有的基于Mamba的视觉架构均未考虑这一性质，导致模型对图像旋转敏感，限制了其鲁棒性和跨任务泛化能力。

Result: 在包括高级图像分类、中级语义分割和低级图像超分辨率在内的多个基准测试中，EQ-VMamba相比非等变基线模型取得了更优或相当的性能，同时参数量减少了约50%。

Insight: 创新点在于首次将旋转等变性引入视觉Mamba架构，提出了旋转等变交叉扫描策略和群Mamba块，并提供了严格的理论分析证明其端到端的旋转等变性。其核心洞察是嵌入旋转等变性不仅能有效提升模型对旋转变换的鲁棒性，还能以更高的参数效率增强整体性能。

Abstract: Rotation equivariance constitutes one of the most general and crucial structural priors for visual data, yet it remains notably absent from current Mamba-based vision architectures. Despite the success of Mamba in natural language processing and its growing adoption in computer vision, existing visual Mamba models fail to account for rotational symmetry in their design. This omission renders them inherently sensitive to image rotations, thereby constraining their robustness and cross-task generalization. To address this limitation, we propose to incorporate rotation symmetry, a universal and fundamental geometric prior in images, into Mamba-based architectures. Specifically, we introduce EQ-VMamba, the first rotation equivariant visual Mamba architecture for vision tasks. The core components of EQ-VMamba include a carefully designed rotation equivariant cross-scan strategy and group Mamba blocks. Moreover, we provide a rigorous theoretical analysis of the intrinsic equivariance error, demonstrating that the proposed architecture enforces end-to-end rotation equivariance throughout the network. Extensive experiments across multiple benchmarks - including high-level image classification task, mid-level semantic segmentation task, and low-level image super-resolution task - demonstrate that EQ-VMamba achieves superior or competitive performance compared to non-equivariant baselines, while requiring approximately 50% fewer parameters. These results indicate that embedding rotation equivariance not only effectively bolsters the robustness of visual Mamba models against rotation transformations, but also enhances overall performance with significantly improved parameter efficiency. Code is available at https://github.com/zhongchenzhao/EQ-VMamba.

[41] RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning cs.CV | cs.AI | cs.LGPDF

Tzu-Heng Huang, Sirajul Salekin, Javier Movellan, Frederic Sala, Manjot Bilkhu

TL;DR: RubiCap是一个基于规则引导的强化学习框架，用于密集图像描述任务，通过LLM生成的评估规则提供细粒度、样本特定的奖励信号，以克服监督蒸馏方法的输出多样性有限和泛化能力弱的问题，并在多个基准测试中取得了最佳性能。

Details

Motivation: 密集图像描述在视觉语言预训练和文本到图像生成中至关重要，但高质量人工标注成本高昂；现有基于强视觉语言模型的合成描述方法存在输出多样性有限和泛化能力弱的问题，而强化学习在开放域描述任务中缺乏可靠的确定性检查器来提供奖励信号。

Result: 在CapArena基准上取得了最高的胜率，超越了监督蒸馏、先前的强化学习方法、人类专家标注和GPT-4V增强的输出；在CaptionQA基准上表现出优异的词汇效率，其7B模型与Qwen2.5-VL-32B-Instruct相当，3B模型超越了其7B对应模型；使用紧凑的RubiCap-3B作为描述器训练的视觉语言预训练模型比使用专有模型描述训练的模型更强。

Insight: 创新点在于利用LLM生成的评估规则来提供结构化、多方面的奖励信号，替代了传统的粗粒度标量奖励，从而实现了更精细的策略优化；客观分析认为，该方法通过委员会机制和规则引导的评估，有效结合了多样性和质量评估，为开放域生成任务的强化学习提供了可扩展的解决方案。

Abstract: Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers – a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.

Sneha Paul, Zachary Patterson, Nizar Bouguila

TL;DR: 本文提出了SAGE，首个端到端的3D多模态大语言模型，它通过轻量级3D分词器将原始点云直接转换为离散token，将3D数据视为一种‘外语’来扩展LLM词汇表，并引入基于语义对齐奖励的偏好优化训练策略，以提升复杂3D任务上的推理能力。

Details

Motivation: 现有基于预训练3D编码器的多模态大语言模型存在几何与语言空间语义错位、分辨率敏感和计算开销大的问题，本文旨在通过端到端直接处理原始点云来解决这些问题。

Result: 在多个3D理解基准测试上的广泛实验表明，该端到端方法在计算效率、跨LLM骨干网络的泛化能力和输入分辨率变化的鲁棒性方面具有显著优势，且性能优于现有基于编码器的方法。

Insight: 创新点在于将点云视为‘外语’进行直接离散化处理的轻量级3D分词器设计，以及针对开放式3D问答的语义对齐奖励偏好优化策略，实现了无需预训练3D编码器的端到端高效3D理解。

Abstract: Multi-modal large language models (MLLMs) have shown remarkable progress in integrating visual and linguistic understanding. Recent efforts have extended these capabilities to 3D understanding through encoder-based architectures that rely on pre-trained 3D encoders to extract geometric features. However, such approaches suffer from semantic misalignment between geometric and linguistic spaces, resolution sensitivity, and substantial computational overhead. In this work, we present SAGE, the first end-to-end 3D MLLM that directly processes raw point clouds without relying on a pre-trained 3D encoder. Our approach introduces a lightweight 3D tokenizer that combines geometric sampling and neighbourhood aggregation with vector quantization to convert point clouds into discrete tokens–treating 3D data as a foreign language that naturally extends the LLM’s vocabulary. Furthermore, to enhance the model’s reasoning capability on complex 3D tasks, we propose a preference optimization training strategy with a semantic alignment-based reward, specifically designed for open-ended 3D question answering where responses are descriptive. Extensive experiments across diverse 3D understanding benchmarks demonstrate that our end-to-end approach outperforms existing encoder-based methods while offering significant advantages in computational efficiency, generalization across LLM backbones, and robustness to input resolution variations. Code is available at: github.com/snehaputul/SAGE3D.

[43] MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data cs.CV | cs.LGPDF

Zongxia Li, Hongyang Du, Chengsong Huang, Xiyang Wu, Lantao Yu

TL;DR: 本文提出了MM-Zero，一个基于强化学习的零数据自进化框架，用于提升视觉语言模型（VLM）的推理能力。该框架引入了多角色（提议者、编码者、解决者）训练架构，所有角色从同一基础模型初始化，并通过精心设计的奖励机制进行训练，实现了无需任何种子数据即可启动的自进化过程。

Details

Motivation: 现有VLM的自进化方法通常需要至少一些种子数据（如图像）来启动过程，而本文旨在探索并实现完全零数据的VLM自进化，以最小化人工干预，扩展自改进范式到多模态领域。

Result: 实验表明，MM-Zero在广泛的多模态基准测试上提升了VLM的推理性能，但其摘要未提及具体基准名称或是否达到SOTA水平，仅说明性能得到改善。

Insight: 主要创新点在于提出了首个零数据VLM自进化框架，并引入了包含三个专门角色的多模型自进化训练范式，以及整合了执行反馈、视觉验证和难度平衡的奖励机制。这为多模态模型的自进化提供了一条可扩展的路径，超越了传统的双模型范式。

Abstract: Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.

[44] TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy cs.CVPDF

Yaoyu Liu, Minghui Zhang, Xin You, Hanxiao Zhang, Yun Gu

TL;DR: 本文提出了TubeMLLM，一个用于医学管状解剖结构拓扑知识探索的基础模型。该模型通过结合结构化语言提示和视觉表示，增强了拓扑感知能力，并在一个名为TubeMData的新型多模态基准上进行了评估。实验表明，该模型在分布外泛化、零样本跨模态迁移以及拓扑感知理解任务上均取得了优异性能。

Details

Motivation: 医学管状解剖结构建模面临拓扑结构复杂和对数据集偏移敏感等挑战，导致任务特定模型常出现拓扑不一致性问题。本文旨在利用多模态大语言模型的零样本泛化潜力，构建一个统一的基础模型来解决这些问题。

Result: 在15个不同数据集上的广泛实验证明了其优越性。在眼底彩色照相上，其将全局拓扑差异（β₀误差）从基线的37.42显著降低至8.58，达到了最先进的分布外性能。在未见过的X射线血管造影上，零样本跨模态迁移取得了67.50%的Dice分数，并将β₀误差降至1.21。在拓扑感知理解任务中，评估掩膜拓扑质量的准确率达到97.38%，显著优于标准视觉语言基线。

Insight: 创新点在于提出了一个将拓扑先验知识通过显式自然语言提示与视觉表示对齐的统一基础模型架构，并构建了首个专注于拓扑的多模态基准数据集TubeMData。此外，引入的自适应损失加权策略能有效强调训练中的拓扑关键区域，增强了模型的拓扑一致性感知与生成能力。

Abstract: Modeling medical vessel-like anatomy is challenging due to its intricate topology and sensitivity to dataset shifts. Consequently, task-specific models often suffer from topological inconsistencies, including artificial disconnections and spurious merges. Motivated by the promise of multimodal large language models (MLLMs) for zero-shot generalization, we propose TubeMLLM, a unified foundation model that couples structured understanding with controllable generation for medical vessel-like anatomy. By integrating topological priors through explicit natural language prompting and aligning them with visual representations in a shared-attention architecture, TubeMLLM significantly enhances topology-aware perception. Furthermore, we construct TubeMData, a pionner multimodal benchmark comprising comprehensive topology-centric tasks, and introduce an adaptive loss weighting strategy to emphasize topology-critical regions during training. Extensive experiments on fifteen diverse datasets demonstrate our superiority. Quantitatively, TubeMLLM achieves state-of-the-art out-of-distribution performance, substantially reducing global topological discrepancies on color fundus photography (decreasing the $β_{0}$ number error from 37.42 to 8.58 compared to baselines). Notably, TubeMLLM exhibits exceptional zero-shot cross-modality transferring ability on unseen X-ray angiography, achieving a Dice score of 67.50% while significantly reducing the $β_{0}$ error to 1.21. TubeMLLM also maintains robustness against degradations such as blur, noise, and low resolution. Furthermore, in topology-aware understanding tasks, the model achieves 97.38% accuracy in evaluating mask topological quality, significantly outperforming standard vision-language baselines.

[45] When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection cs.CVPDF

Chao Shuai, Zhenguang Liu, Shaojing Fan, Bin Gong, Weichen Lian

TL;DR: 本文针对基于视觉基础模型（VFMs）的AI生成图像检测器在遇到未见过的生成管道时泛化能力差的问题，首次识别出一种关键失效机制——语义回退，即检测器在分布偏移下会依赖预训练的语义先验而非伪造痕迹。为解决此问题，作者提出了一个名为几何语义解耦（GSD）的无参数模块，该模块利用冻结的VFM作为语义引导，可训练的VFM作为伪影检测器，通过几何约束显式地从学习表示中移除语义成分，从而迫使检测器依赖语义不变的取证证据。实验表明，该方法在跨数据集评估、未见操作鲁棒性以及通用场景合成图像检测方面均优于现有最先进方法。

Details

Motivation: 随着生成式AI的快速发展，AI生成图像检测变得日益重要，但基于视觉基础模型（如CLIP）的检测器在泛化到由未见生成管道创建的图像时存在困难。本文的动机是首次识别并解决导致这种泛化失败的关键机制——语义回退，即检测器在分布偏移下过度依赖预训练的语义先验，而非伪造特有的痕迹。

Result: 广泛的实验表明，该方法在多个基准测试中一致优于最先进方法：在跨数据集评估中达到94.4%的视频级AUC（提升1.2%），在DF40上对未见操作的鲁棒性提升3.0%，并在通用场景合成图像检测（如UniversalFakeDetect和GenImage）上分别提升0.9%和1.7%，实现了SOTA性能。

Insight: 本文的核心创新点在于首次识别出语义回退这一泛化失败机制，并提出了几何语义解耦（GSD）这一无参数模块来显式地解耦语义信息与伪造痕迹。从客观角度看，该方法巧妙地利用冻结VFM作为语义引导来估计并移除语义方向，迫使可训练检测器专注于语义不变的取证证据，这是一种新颖且有效的提升检测器泛化能力的策略，对基于基础模型的取证任务具有重要借鉴意义。

Abstract: AI-generated image detection has become increasingly important with the rapid advancement of generative AI. However, detectors built on Vision Foundation Models (VFMs, \emph{e.g.}, CLIP) often struggle to generalize to images created using unseen generation pipelines. We identify, for the first time, a key failure mechanism, termed \emph{semantic fallback}, where VFM-based detectors rely on dominant pre-trained semantic priors (such as identity) rather than forgery-specific traces under distribution shifts. To address this issue, we propose \textbf{Geometric Semantic Decoupling (GSD)}, a parameter-free module that explicitly removes semantic components from learned representations by leveraging a frozen VFM as a semantic guide with a trainable VFM as an artifact detector. GSD estimates semantic directions from batch-wise statistics and projects them out via a geometric constraint, forcing the artifact detector to rely on semantic-invariant forensic evidence. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving 94.4% video-level AUC (+\textbf{1.2%}) in cross-dataset evaluation, improving robustness to unseen manipulations (+\textbf{3.0%} on DF40), and generalizing beyond faces to the detection of synthetic images of general scenes, including UniversalFakeDetect (+\textbf{0.9%}) and GenImage (+\textbf{1.7%}).

[46] Towards Instance Segmentation with Polygon Detection Transformers cs.CVPDF

Jiacheng Sun, Jiaqi Lin, Wenlong Hu, Haoyang Li, Xinghong Zhou

TL;DR: 本文提出了一种名为Polygon Detection Transformer (Poly-DETR)的实例分割方法，它将实例分割重新定义为通过极坐标表示进行稀疏顶点回归的任务，从而避免了对密集像素级掩码预测的依赖。该方法引入了极坐标可变形注意力机制和位置感知训练方案，以动态更新监督并聚焦于边界线索。

Details

Motivation: 解决当前实例分割中高分辨率输入需求与轻量级、实时推理要求之间的冲突瓶颈。

Result: 在MS COCO test-dev上，相比最先进的基于极坐标的方法，Poly-DETR实现了4.7 mAP的提升；在Cityscapes数据集上，内存消耗减少近一半；在PanNuke和SpaceNet数据集上，其性能在所有指标上均超越了基于掩码的对应方法。

Insight: 核心创新在于将实例分割任务重新表述为稀疏顶点回归，并采用极坐标表示，这降低了计算开销，尤其在高分辨率场景下更轻量。提出的极坐标可变形注意力和位置感知训练方案，有效处理了检测变换器中从边界框到多边形参考的偏移问题，增强了模型对边界信息的捕捉能力，在特定领域（如细胞、建筑物）的规则形状实例上表现出优势。

Abstract: One of the bottlenecks for instance segmentation today lies in the conflicting requirements of high-resolution inputs and lightweight, real-time inference. To address this bottleneck, we present a Polygon Detection Transformer (Poly-DETR) to reformulate instance segmentation as sparse vertex regression via Polar Representation, thereby eliminating the reliance on dense pixel-wise mask prediction. Considering the box-to-polygon reference shift in Detection Transformers, we propose Polar Deformable Attention and Position-Aware Training Scheme to dynamically update supervision and focus attention on boundary cues. Compared with state-of-the-art polar-based methods, Poly-DETR achieves a 4.7 mAP improvement on MS COCO test-dev. Moreover, we construct a parallel mask-based counterpart to support a systematic comparison between polar and mask representations. Experimental results show that Poly-DETR is more lightweight in high-resolution scenarios, reducing memory consumption by almost half on Cityscapes dataset. Notably, on PanNuke (cell segmentation) and SpaceNet (building footprints) datasets, Poly-DETR surpasses its mask-based counterpart on all metrics, which validates its advantage on regular-shaped instances in domain-specific settings.

[47] Multi-model approach for autonomous driving: A comprehensive study on traffic sign-, vehicle- and lane detection and behavioral cloning cs.CV | cs.AIPDF

Kanishkha Jaisankar, Pranav M. Pawar, Diana Susane Joseph, Raja Muthalagu, Mithun Mukherjee

TL;DR: 该论文提出了一种用于自动驾驶的多模型方法，通过结合预训练和定制神经网络，对交通标志分类、车辆检测、车道检测和行为克隆等关键任务进行性能增强，并整合了数据增强、图像归一化和迁移学习等技术，在多个数据集上评估了模型效果。

Details

Motivation: 旨在利用深度学习和计算机视觉技术提升自动驾驶汽车的感知与决策能力，解决交通标志分类、车道预测、车辆检测和行为克隆等挑战，以增强自动驾驶系统的鲁棒性和可靠性。

Result: 在德国交通标志识别基准（GTSRB）、道路与车道分割数据集、车辆检测数据集以及Udacity自动驾驶模拟器收集的数据上评估了模型，结果表明该方法能有效解决自动驾驶相关任务，为未来更安全高效的自动驾驶技术提供了基础。

Insight: 创新点在于将多种神经网络模型与数据增强、迁移学习等技术集成，形成综合的多任务处理框架；从客观角度看，这种模块化方法有助于提高系统整体性能，并为实际部署提供了可扩展的解决方案。

Abstract: Deep learning and computer vision techniques have become increasingly important in the development of self-driving cars. These techniques play a crucial role in enabling self-driving cars to perceive and understand their surroundings, allowing them to safely navigate and make decisions in real-time. Using Neural Networks self-driving cars can accurately identify and classify objects such as pedestrians, other vehicles, and traffic signals. Using deep learning and analyzing data from sensors such as cameras and radar, self-driving cars can predict the likely movement of other objects and plan their own actions accordingly. In this study, a novel approach to enhance the performance of selfdriving cars by using pre-trained and custom-made neural networks for key tasks, including traffic sign classification, vehicle detection, lane detection, and behavioral cloning is provided. The methodology integrates several innovative techniques, such as geometric and color transformations for data augmentation, image normalization, and transfer learning for feature extraction. These techniques are applied to diverse datasets,including the German Traffic Sign Recognition Benchmark (GTSRB), road and lane segmentation datasets, vehicle detection datasets, and data collected using the Udacity selfdriving car simulator to evaluate the model efficacy. The primary objective of the work is to review the state-of-the-art in deep learning and computer vision for self-driving cars. The findings of the work are effective in solving various challenges related to self-driving cars like traffic sign classification, lane prediction, vehicle detection, and behavioral cloning, and provide valuable insights into improving the robustness and reliability of autonomous systems, paving the way for future research and deployment of safer and more efficient self-driving technologies.

[48] Multimodal Graph Representation Learning with Dynamic Information Pathways cs.CVPDF

Xiaobin Hong, Mingkai Lin, Xiaoli Wang, Chaoqun Wang, Wenzhong Li

TL;DR: 本文提出了一种名为DiP的新型多模态图表示学习框架，通过引入模态特定的伪节点，实现了基于邻近引导的模态内动态消息路由，并在共享状态空间中通过高效的信息通路捕获模态间依赖关系，从而以线性复杂度实现跨模态的自适应、表达性强且稀疏的消息传播。

Details

Motivation: 现实应用中多模态图（节点包含图像、文本等异构特征）日益普遍，现有方法多从传统图神经网络扩展而来，依赖静态结构或密集注意力，限制了节点嵌入学习的灵活性和表达能力。

Result: 在多个基准测试上进行的链接预测和节点分类任务实验表明，DiP在性能上持续优于基线方法。

Insight: 创新点在于通过模态特定伪节点和动态信息通路的设计，实现了跨模态的自适应、高效且稀疏的消息传播，其线性复杂度提升了可扩展性。

Abstract: Multimodal graphs, where nodes contain heterogeneous features such as images and text, are increasingly common in real-world applications. Effectively learning on such graphs requires both adaptive intra-modal message passing and efficient inter-modal aggregation. However, most existing approaches to multimodal graph learning are typically extended from conventional graph neural networks and rely on static structures or dense attention, which limit flexibility and expressive node embedding learning. In this paper, we propose a novel multimodal graph representation learning framework with Dynamic information Pathways (DiP). By introducing modality-specific pseudo nodes, DiP enables dynamic message routing within each modality via proximity-guided pseudo-node interactions and captures inter-modality dependence through efficient information pathways in a shared state space. This design achieves adaptive, expressive, and sparse message propagation across modalities with linear complexity. We conduct the link prediction and node classification tasks to evaluate performance and carry out full experimental analyses. Extensive experiments across multiple benchmarks demonstrate that DiP consistently outperforms baselines.

Mingfei Han, Haihong Hao, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova

TL;DR: 本文提出了一种基于网络视频的大规模视觉语言导航框架，通过从室内游览视频中提取自然行走轨迹，结合开放式描述和动作重建，并引入隐式几何表示从RGB帧直接提取空间线索，显著提升了数据利用率和导航性能。

Details

Motivation: 现有视觉语言导航数据集受限于模拟器生成，多样性和可扩展性不足，无法反映真实环境的复杂性，因此需要利用大规模网络视频来学习真实室内场景中的导航能力。

Result: 在多个VLN基准测试（CVDN、SOON、R2R、REVERIE）上实现了新的最先进性能，并支持开发鲁棒的零样本导航智能体。

Insight: 通过隐式几何表示直接从RGB帧提取空间信息，避免了脆弱的3D重建过程，提高了数据利用率并解锁了大量先前无法使用的视频数据，推动了具身导航向更可扩展、泛化性强且适用于真实世界的方向发展。

Abstract: Vision-and-Language Navigation (VLN) has long been constrained by the limited diversity and scalability of simulator-curated datasets, which fail to capture the complexity of real-world environments. To overcome this limitation, we introduce a large-scale video-instruction framework derived from web-based room tour videos, enabling agents to learn from natural human walking demonstrations in diverse, realistic indoor settings. Unlike existing datasets, our framework integrates both open-ended description-enriched trajectories and action-enriched trajectories reconstructed in 3D, providing richer spatial and semantic supervision. A key extension in this work is the incorporation of implicit geometry representations, which extract spatial cues directly from RGB frames without requiring fragile 3D reconstruction. This approach substantially improves data utilization, alleviates reconstruction failures, and unlocks large portions of previously unusable video data. Comprehensive experiments across multiple VLN benchmarks (CVDN, SOON, R2R, and REVERIE) demonstrate that our method not only sets new state-of-the-art performance but also enables the development of robust zero-shot navigation agents. By bridging large-scale web videos with implicit spatial reasoning, this work advances embodied navigation towards more scalable, generalizable, and real-world applicable solutions.

[50] ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph cs.CVPDF

Junhao Cai, Deyu Zeng, Junhao Pang, Lini Li, Zongze Wu

TL;DR: ForgeDreamer是一个针对工业应用设计的文本到3D生成框架，旨在解决现有方法在工业领域面临的两个关键挑战：跨类别知识干扰和几何结构推理不足。它通过多专家LoRA集成机制和跨视图超图几何增强方法，实现了更好的语义泛化和制造级别的几何一致性。

Details

Motivation: 当前文本到3D生成方法在自然场景中表现良好，但在工业应用中存在局限性，主要问题包括：传统LoRA融合导致跨类别知识干扰，以及成对一致性约束无法捕捉精密制造所需的高阶结构依赖关系。

Result: 在自定义工业数据集上的大量实验表明，该方法在语义泛化和几何保真度方面优于最先进的方法。

Insight: 创新点包括：1) 多专家LoRA集成机制，将多个特定类别的LoRA模型整合为统一表示，消除知识干扰并提升跨类别泛化能力；2) 跨视图超图几何增强方法，通过超图建模同时捕捉多视角下的结构依赖关系，确保制造级别的一致性。这些组件协同工作，实现了语义理解与几何推理的有效结合。

Abstract: Current text-to-3D generation methods excel in natural scenes but struggle with industrial applications due to two critical limitations: domain adaptation challenges where conventional LoRA fusion causes knowledge interference across categories, and geometric reasoning deficiencies where pairwise consistency constraints fail to capture higher-order structural dependencies essential for precision manufacturing. We propose a novel framework named ForgeDreamer addressing both challenges through two key innovations. First, we introduce a Multi-Expert LoRA Ensemble mechanism that consolidates multiple category-specific LoRA models into a unified representation, achieving superior cross-category generalization while eliminating knowledge interference. Second, building on enhanced semantic understanding, we develop a Cross-View Hypergraph Geometric Enhancement approach that captures structural dependencies spanning multiple viewpoints simultaneously. These components work synergistically improved semantic understanding, enables more effective geometric reasoning, while hypergraph modeling ensures manufacturing-level consistency. Extensive experiments on a custom industrial dataset demonstrate superior semantic generalization and enhanced geometric fidelity compared to state-of-the-art approaches. Our code and data are provided in the supplementary material attached in the appendix for review purposes.

[51] From Ideal to Real: Stable Video Object Removal under Imperfect Conditions cs.CVPDF

Jiagao Hu, Yuxuan Chen, Fuhao Li, Zepeng Wang, Fei Wang

TL;DR: 本文提出了一种名为SVOR（Stable Video Object Removal）的鲁棒框架，用于在存在阴影、剧烈运动和缺陷掩码等现实世界不完美条件下稳定地从视频中移除物体。该框架通过三个关键设计实现：MUSE（掩码稳定擦除联合策略）、DA-Seg（去噪感知分割模块）以及课程式两阶段训练，从而在保持时间稳定性和视觉一致性的同时，有效处理掩码缺陷并移除物体及其关联的阴影/反射。

Details

Motivation: 现有基于扩散模型的视频修复方法在现实世界的不完美条件（如阴影、剧烈运动、缺陷掩码）下难以保持时间稳定性和视觉一致性，因此需要一种更鲁棒的视频物体移除框架来应对这些挑战。

Result: 大量实验表明，SVOR在多个数据集和退化掩码基准测试中取得了新的最先进（SOTA）结果，将视频物体移除从理想设置推进到现实世界应用。

Insight: 创新点包括：1) MUSE：一种在时间掩码下采样中应用的窗口联合策略，以保留每个窗口内观察到的所有目标区域，有效处理剧烈运动并减少遗漏移除；2) DA-Seg：一个轻量级分割头，配备去噪感知AdaLN并通过掩码退化训练，提供内部扩散感知定位先验而不影响内容生成；3) 课程式两阶段训练：第一阶段在未配对的真实背景视频上进行自监督预训练以学习真实背景和时间先验，第二阶段使用掩码退化和副作用加权损失在合成数据上进行微调，共同移除物体及其关联的阴影/反射，提高跨域鲁棒性。

Abstract: Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with Denoising-Aware AdaLN and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) Curriculum Two-Stage Training: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications.

[52] X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models cs.CV | cs.CLPDF

Yueen Ma, Irwin King

TL;DR: 本文提出了X-GS，一个可扩展的开放框架，旨在统一多种3D高斯泼溅（3DGS）架构，并将其与下游多模态模型连接起来。该框架的核心是X-GS-Perceiver高效流水线，能够从无位姿的RGB（或RGB-D）视频流中实时协同优化几何与位姿，并将视觉基础模型的高维语义特征蒸馏到3D高斯中，从而实现语义增强的在线SLAM。

Details

Motivation: 现有3DGS方法大多相互孤立，专注于特定领域（如在线SLAM、语义增强或无位姿图像处理），缺乏一个统一框架来整合这些技术并连接下游多模态任务。

Result: 在真实世界数据集上的实验结果表明，X-GS框架在效率、效能以及新解锁的多模态能力方面表现优异，实现了实时性能。

Insight: 创新点包括：1）提出了一个统一的、可扩展的开放框架，将多种3DGS技术与下游多模态模型桥接；2）设计了高效的X-GS-Perceiver流水线，集成了在线向量量化（VQ）模块、GPU加速的网格采样方案和高度并行化的流水线设计，以实现实时处理；3）通过语义特征蒸馏，使3D高斯能够支持视觉语言模型，从而解锁了物体检测、零样本字幕生成等下游任务。

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods are isolated, focusing on specific domains such as online SLAM, semantic enrichment, or 3DGS for unposed images. In this paper, we introduce X-GS, an extensible open framework that unifies a broad range of techniques to enable real-time 3DGS-based online SLAM enriched with semantics, bridging the gap to downstream multimodal models. At the core of X-GS is a highly efficient pipeline called X-GS-Perceiver, capable of taking unposed RGB (or optionally RGB-D) video streams as input to co-optimize geometry and poses, and distill high-dimensional semantic features from vision foundation models into the 3D Gaussians. We achieve real-time performance through a novel online Vector Quantization (VQ) module, a GPU-accelerated grid-sampling scheme, and a highly parallelized pipeline design. The semantic 3D Gaussians can then be utilized by vision-language models within the X-GS-Thinker component, enabling downstream tasks such as object detection, zero-shot caption generation, and potentially embodied tasks. Experimental results on real-world datasets showcase the efficacy, efficiency, and newly unlocked multimodal capabilities of the X-GS framework.

Shilei Wang, Pujian Lai, Dong Gao, Jifeng Ning, Gong Cheng

TL;DR: 本文提出了一种名为MDTrack的新型多模态目标跟踪框架，通过模态感知融合和解耦时序传播来解决现有方法中融合策略单一和时序信息混杂的问题。

Details

Motivation: 现有多模态跟踪器通常采用统一的融合策略，忽略了模态间的固有差异，并且通过混合令牌传播时序信息，导致时序表示纠缠且判别性不足。

Result: 在五个多模态跟踪基准测试中，MDTrack S和MDTrack U均实现了最先进的性能。

Insight: 创新点包括：为红外、事件、深度和RGB等模态分配专用专家进行模态感知融合，以及引入两个独立的SSM结构来解耦RGB和X模态流的时序传播，并通过交叉注意力模块促进隐式信息交换，从而增强对时序信息的利用能力。

Abstract: Most existing multimodal trackers adopt uniform fusion strategies, overlooking the inherent differences between modalities. Moreover, they propagate temporal information through mixed tokens, leading to entangled and less discriminative temporal representations. To address these limitations, we propose MDTrack, a novel framework for modality aware fusion and decoupled temporal propagation in multimodal object tracking. Specifically, for modality aware fusion, we allocate dedicated experts to each modality, including infrared, event, depth, and RGB, to process their respective representations. The gating mechanism within the Mixture of Experts dynamically selects the optimal experts based on the input features, enabling adaptive and modality specific fusion. For decoupled temporal propagation, we introduce two separate State Space Model structures to independently store and update the hidden states of the RGB and X modal streams, effectively capturing their distinct temporal information. To ensure synergy between the two temporal representations, we incorporate a set of cross attention modules between the input features of the two SSMs, facilitating implicit information exchange. The resulting temporally enriched features are then integrated into the backbone through another set of cross attention modules, enhancing MDTrack’s ability to leverage temporal information. Extensive experiments demonstrate the effectiveness of our proposed method. Both MDTrack S and MDTrack U achieve state of the art performance across five multimodal tracking benchmarks.

[54] DenoiseSplat: Feed-Forward Gaussian Splatting for Noisy 3D Scene Reconstruction cs.CV | cs.AIPDF

Fuzhen Jiang, Zhuoran Li, Yinlin Zhang

TL;DR: 本文提出了DenoiseSplat，一种用于处理含噪声多视角图像的前馈式3D高斯泼溅方法，旨在从噪声输入中重建3D场景并合成新视角。该方法在RE10K数据集上构建了一个大规模、场景一致的噪声-干净基准，通过仅使用干净的2D渲染作为监督进行端到端训练，无需3D真值。

Details

Motivation: 现有的NeRF和3D高斯泼溅流程通常假设输入是干净的，在实际应用中会因噪声和伪影而性能下降。本文旨在解决从含噪声的多视角图像中鲁棒地进行3D场景重建和新视角合成的问题。

Result: 在含噪声的RE10K基准测试中，DenoiseSplat在PSNR、SSIM和LPIPS指标上，针对不同类型和强度的噪声（如高斯、泊松、散斑和椒盐噪声），均优于原始的MVSplat以及一个强大的两阶段基线方法（IDF + MVSplat）。

Insight: 论文的创新点在于提出了首个专门针对噪声3D重建的前馈式高斯泼溅框架，并构建了一个可控的大规模噪声基准用于训练和评估。从客观角度看，其仅使用2D监督进行端到端训练、无需3D真值的策略，为处理真实世界噪声数据提供了一种高效且实用的解决方案。

Abstract: 3D scene reconstruction and novel-view synthesis are fundamental for VR, robotics, and content creation. However, most NeRF and 3D Gaussian Splatting pipelines assume clean inputs and degrade under real noise and artifacts. We therefore propose DenoiseSplat, a feed-forward 3D Gaussian splatting method for noisy multi-view images. We build a large-scale, scene-consistent noisy–clean benchmark on RE10K by injecting Gaussian, Poisson, speckle, and salt-and-pepper noise with controlled intensities. With a lightweight MVSplat-style feed-forward backbone, we train end-to-end using only clean 2D renderings as supervision and no 3D ground truth. On noisy RE10K, DenoiseSplat outperforms vanilla MVSplat and a strong two-stage baseline (IDF + MVSplat) in PSNR/SSIM and LPIPS across noise types and levels.

[55] IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework cs.CVPDF

Feiyu Wang, Jiayuan Yang, Zhiyuan Zhao, Da Zhang, Bingyu Li

TL;DR: 本文提出IntroSVG框架，通过一个内省的生成器-评论家统一视觉语言模型，将渲染后的视觉反馈融入文本到SVG的生成过程，以解决现有自回归方法因缺乏最终图像视觉感知而限制生成质量的核心问题。

Details

Motivation: 现有文本到SVG生成方法在自回归训练过程中未考虑最终渲染图像的视觉感知，这从根本上制约了生成质量，因此需要一种能整合视觉反馈的机制来提升SVG的复杂结构、语义对齐和可编辑性。

Result: 实验结果表明，该方法在多个关键评估指标上达到了最先进的性能，生成的SVG具有更复杂的结构、更强的语义对齐和更高的可编辑性。

Insight: 创新点在于构建了一个闭环的统一视觉语言模型，兼具生成与批评双重角色，并通过监督微调、将早期失败转化为纠错数据以及直接偏好优化来整合显式视觉反馈，实现了迭代的“生成-评审-优化”自主改进循环。

Abstract: Scalable Vector Graphics (SVG) are central to digital design due to their inherent scalability and editability. Despite significant advancements in content generation enabled by Visual Language Models (VLMs), existing text-to-SVG generation methods are limited by a core challenge: the autoregressive training process does not incorporate visual perception of the final rendered image, which fundamentally constrains generation quality. To address this limitation, we propose an Introspective SVG Generation Framework (IntroSVG). At its core, the framework instantiates a unified VLM that operates in a closed loop, assuming dual roles of both generator and critic. Specifically, through Supervised Fine-Tuning (SFT), the model learns to draft SVGs and to provide feedback on their rendered outputs; moreover, we systematically convert early-stage failures into high-quality error-correction training data, thereby enhancing model robustness. Subsequently, we leverage a high-capacity teacher VLM to construct a preference dataset and further align the generator’s policy through Direct Preference Optimization (DPO). During inference, the optimized generator and critic operate collaboratively in an iterative “generate-review-refine” cycle, starting from imperfect intermediate drafts to autonomously improve output quality. Experimental results demonstrate that our method achieves state-of-the-art performance across several key evaluation metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability. These results corroborate the effectiveness of incorporating explicit visual feedback into the generation loop.

[56] EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning cs.CV | cs.AI | cs.CLPDF

Chengjun Yu, Xuhan Zhu, Chaoqun Du, Pengfei Yu, Wei Zhai

TL;DR: 本文提出了一个名为EXPLORE-Bench的新基准测试，用于评估多模态大语言模型在自我中心视角下进行长时程物理推理的能力。该任务要求模型根据初始场景图像和一系列原子动作描述，预测所有动作执行后的最终场景。基准数据来源于真实的第一人称视频，并包含细粒度的结构化标注。实验表明，现有MLLMs在此任务上与人类表现存在显著差距，长时程自我中心推理仍是一个重大挑战。

Details

Motivation: 尽管多模态大语言模型被视为具身智能体的基础，但其能否从自我中心视角可靠地推理动作的长期物理后果尚不明确。本文旨在通过一个新任务和基准来系统性地研究这一能力差距。

Result: 在EXPLORE-Bench上对一系列专有和开源MLLMs进行的实验揭示了其与人类表现存在显著性能差距。通过逐步推理进行测试时缩放分析表明，分解长动作序列能在一定程度上提升性能，但会带来不小的计算开销。

Insight: 论文的创新点在于提出了首个专注于长时程自我中心场景预测的基准测试EXPLORE-Bench，它提供了结构化、细粒度的标注以支持定量评估。客观来看，该工作明确指出了当前MLLMs在长时程物理推理能力上的核心短板，并为未来研究提供了原则性的测试平台。

Abstract: Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.

[57] CLoE: Expert Consistency Learning for Missing Modality Segmentation cs.CV | cs.AI | cs.LGPDF

Xinyu Tong, Meihua Zhou, Bowu Fan, Haitao Li

TL;DR: 本文提出了一种名为CLoE（专家一致性学习）的框架，用于解决多模态医学图像分割中模态缺失的问题。该框架通过双分支专家一致性学习目标，在决策层面控制专家预测的一致性，并利用轻量门控网络将一致性分数映射为模态可靠性权重，从而在融合前进行可靠性感知的特征重校准。

Details

Motivation: 多模态医学图像分割在推理时经常面临模态缺失，这会导致模态专家之间的预测不一致，并使融合过程不稳定，尤其是在前景小结构上。

Result: 在BraTS 2020和MSD Prostate数据集上的大量实验表明，CLoE在不完整多模态分割任务中超越了现有最先进方法，同时表现出强大的跨数据集泛化能力，并提高了对临床关键结构的鲁棒性。

Insight: 创新点在于将鲁棒性定义为决策层面的专家一致性控制，并引入了全局模态专家一致性和针对临床关键前景区域的区域专家一致性双重目标，以避免背景主导的正则化。此外，通过轻量门控网络实现一致性分数到可靠性权重的映射，实现了可靠性感知的特征融合。

Abstract: Multimodal medical image segmentation often faces missing modalities at inference, which induces disagreement among modality experts and makes fusion unstable, particularly on small foreground structures. We propose Consistency Learning of Experts (CLoE), a consistency-driven framework for missing-modality segmentation that preserves strong performance when all modalities are available. CLoE formulates robustness as decision-level expert consistency control and introduces a dual-branch Expert Consistency Learning objective. Modality Expert Consistency enforces global agreement among expert predictions to reduce case-wise drift under partial inputs, while Region Expert Consistency emphasizes agreement on clinically critical foreground regions to avoid background-dominated regularization. We further map consistency scores to modality reliability weights using a lightweight gating network, enabling reliability-aware feature recalibration before fusion. Extensive experiments on BraTS 2020 and MSD Prostate demonstrate that CLoE outperforms state-of-the-art methods in incomplete multimodal segmentation, while exhibiting strong cross-dataset generalization and improving robustness on clinically critical structures.

Aodi Wu, Jianhong Zuo, Zeyuan Zhao, Xubo Luo, Ruisuo Wang

TL;DR: 本文提出了SpaceSense-Bench，一个用于航天器感知和姿态估计的大规模多模态基准数据集。该数据集包含136个卫星模型，约70GB数据，提供时间同步的RGB图像、高精度深度图、LiDAR点云，以及密集的部件级语义标签和6自由度姿态真值。研究通过高保真仿真生成数据，并评估了五项代表性任务，发现小部件感知和零样本泛化是当前方法的瓶颈，而扩大训练数据规模能显著提升性能。

Details

Motivation: 自主空间操作（如在轨服务和主动碎片清除）需要鲁棒的部件级语义理解和精确的相对导航，但在轨收集大规模真实数据成本高昂且不切实际。现有合成数据集存在目标多样性有限、传感模态单一、标注不完整等问题。

Result: 在五项基准任务（目标检测、2D语义分割、RGB-LiDAR融合的3D点云分割、单目深度估计、方向估计）上进行了评估。关键发现包括：感知小尺度部件（如推进器和全向天线）以及对完全未见航天器的零样本泛化仍是当前方法的瓶颈；增加训练卫星数量能显著提升在新目标上的性能。

Insight: 创新点在于构建了首个大规模、多模态、具有密集部件级语义和精确姿态标注的航天器感知基准数据集。客观来看，其高保真仿真流程、自动化质量控制以及多任务基准评估为空间感知研究提供了宝贵的资源和明确的性能瓶颈分析，强调了数据规模和多样性对于提升模型泛化能力的重要性。

Abstract: Autonomous space operations such as on-orbit servicing and active debris removal demand robust part-level semantic understanding and precise relative navigation of target spacecraft, yet collecting large-scale real data in orbit remains impractical due to cost and access constraints. Existing synthetic datasets, moreover, suffer from limited target diversity, single-modality sensing, and incomplete ground-truth annotations. We present \textbf{SpaceSense-Bench}, a large-scale multi-modal benchmark for spacecraft perception encompassing 136~~satellite models with approximately 70~~GB of data. Each frame provides time-synchronized 1024$\times$1024 RGB images, millimeter-precision depth maps, and 256-beam LiDAR point clouds, together with dense 7-class part-level semantic labels at both the pixel and point level as well as accurate 6-DoF pose ground truth. The dataset is generated through a high-fidelity space simulation built in Unreal Engine5 and a fully automated pipeline covering data acquisition, multi-stage quality control, and conversion to mainstream formats. We benchmark five representative tasks (object detection, 2D semantic segmentation, RGB–LiDAR fusion-based 3D point cloud segmentation, monocular depth estimation, and orientation estimation) and identify two key findings: (i)perceiving small-scale components (\emph{e.g.}, thrusters and omni-antennas) and generalizing to entirely unseen spacecraft in a zero-shot setting remain critical bottlenecks for current methods, and (ii)~scaling up the number of training satellites yields substantial performance gains on novel targets, underscoring the value of large-scale, diverse datasets for space perception research. The dataset, code, and toolkit are publicly available at https://github.com/wuaodi/SpaceSense-Bench.

[59] OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models cs.CVPDF

Tengjin Weng, Wenhao Jiang, Jingyi Wang, Ming Li, Lin Ma

TL;DR: 本文提出了OddGridBench基准测试，用于评估多模态大语言模型在细粒度视觉差异感知方面的能力，发现现有模型表现远低于人类水平。同时，作者提出了OddGrid-GRPO强化学习框架，通过课程学习和距离感知奖励机制，显著提升了模型的视觉差异检测能力。

Details

Motivation: 多模态大语言模型在高层视觉语言任务上表现出色，但其底层视觉感知能力，尤其是检测细粒度视觉差异的能力，尚未得到充分探索和系统分析。

Result: 在OddGridBench基准上，包括Qwen3-VL、InternVL3.5、Gemini-2.5-Pro和GPT-5在内的所有评估模型，其视觉差异检测性能都远低于人类水平。提出的OddGrid-GRPO框架显著提升了模型的细粒度视觉辨别能力。

Insight: 创新点在于构建了一个可控的、基于网格图像的细粒度视觉差异检测基准，并提出了一个结合课程学习和空间邻近约束奖励的强化学习框架来提升模型在此任务上的性能，为多模态智能的感知基础研究奠定了基础。

Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model’s fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence. Code and dataset are available at https://wwwtttjjj.github.io/OddGridBench/.

[60] Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments cs.CV | cs.AIPDF

Yang Li, Xing Chen, Yutao Liu, Gege Qi, Yanxian BI

TL;DR: 本文提出了STAR基准测试，这是一个用于评估大型语言模型在零和竞争环境中战略推理与快速决策能力的多智能体评估框架。该框架支持回合制和实时设置，通过1v1对抗交互来测试模型的迭代式、适应性决策过程，并引入战略评估套件以超越简单的胜负结果，分析战略行为质量。

Details

Motivation: 现有评估大多将推理视为单次能力，忽略了对手感知决策、时间约束和压力下执行等挑战，因此需要评估LLM在对抗性、时间敏感环境中的交互式智能体效能。

Result: 广泛的成对评估揭示了显著的策略-执行差距：推理密集型模型在回合制设置中占优，但其推理延迟导致在实时场景中表现较差，而更快的指令调优模型在实时场景中胜出。

Insight: 创新点在于将推理建模为迭代、自适应的决策过程，并统一评估长期战略规划与快速战术执行；客观分析表明，交互环境中的战略智能不仅取决于推理深度，还依赖于将计划转化为及时行动的能力，STAR基准为研究这种权衡提供了原则性框架。

Abstract: Large Language Models (LLMs) have achieved strong performance on static reasoning benchmarks, yet their effectiveness as interactive agents operating in adversarial, time-sensitive environments remains poorly understood. Existing evaluations largely treat reasoning as a single-shot capability, overlooking the challenges of opponent-aware decision-making, temporal constraints, and execution under pressure. This paper introduces Strategic Tactical Agent Reasoning (STAR) Benchmark, a multi-agent evaluation framework that assesses LLMs through 1v1 zero-sum competitive interactions, framing reasoning as an iterative, adaptive decision-making process. STAR supports both turn-based and real-time settings, enabling controlled analysis of long-horizon strategic planning and fast-paced tactical execution within a unified environment. Built on a modular architecture with a standardized API and fully implemented execution engine, STAR facilitates reproducible evaluation and flexible task customization. To move beyond binary win-loss outcomes, we introduce a Strategic Evaluation Suite that assesses not only competitive success but also the quality of strategic behavior, such as execution efficiency and outcome stability. Extensive pairwise evaluations reveal a pronounced strategy-execution gap: while reasoning-intensive models dominate turn-based settings, their inference latency often leads to inferior performance in real-time scenarios, where faster instruction-tuned models prevail. These results show that strategic intelligence in interactive environments depends not only on reasoning depth, but also on the ability to translate plans into timely actions, positioning STAR as a principled benchmark for studying this trade-off in competitive, dynamic settings.

[61] MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification cs.CV | cs.AIPDF

Nikola Jovišić, Milica Škipina, Nicola Dall’Asen, Dubravko Ćulibrk

TL;DR: 该论文提出了一种名为MIL-PF（基于预计算特征的多示例学习）的可扩展框架，用于乳腺X光摄影分类。该方法结合了冻结的基础模型编码器与轻量级的多示例学习头部，通过预计算语义表示并仅训练小型任务特定聚合模块，实现了高效实验与适应，无需重新训练大型骨干网络。

Details

Motivation: 动机在于解决将现代基础模型适应高分辨率医学影像（如乳腺X光片）的挑战，这些挑战包括标注有限、监督信号弱、图像尺寸大、多视图研究可变以及主要存在乳腺级别标签，使得端到端微调计算成本高昂且通常不切实际。

Result: MIL-PF在临床规模上实现了最先进的分类性能，同时显著降低了训练复杂度。

Insight: 创新点在于提出了一种结合冻结预训练编码器与轻量级多示例学习头部的可扩展框架，通过注意力机制聚合显式建模全局组织上下文和稀疏局部病灶信号，仅需训练约4万个参数的小型聚合模块，从而在保持高性能的同时大幅提升效率。

Abstract: Modern foundation models provide highly expressive visual representations, yet adapting them to high-resolution medical imaging remains challenging due to limited annotations and weak supervision. Mammography, in particular, is characterized by large images, variable multi-view studies and predominantly breast-level labels, making end-to-end fine-tuning computationally expensive and often impractical. We propose Multiple Instance Learning on Precomputed Features (MIL-PF), a scalable framework that combines frozen foundation encoders with a lightweight MIL head for mammography classification. By precomputing the semantic representations and training only a small task-specific aggregation module (40k parameters), the method enables efficient experimentation and adaptation without retraining large backbones. The architecture explicitly models the global tissue context and the sparse local lesion signals through attention-based aggregation. MIL-PF achieves state-of-the-art classification performance at clinical scale while substantially reducing training complexity. We release the code for full reproducibility.

Yinrui Ren, Jinjing Zhu, Kanghao Chen, Zhuoxiao Li, Jing Ou

TL;DR: 本文提出EventVGGT框架，用于解决事件相机单目深度估计中因缺乏密集深度标注和忽略事件流时序连续性而导致预测不一致和不准确的问题。该框架将事件流建模为连贯的视频序列，并首次从视觉几何基础Transformer（VGGT）中蒸馏时空和多视图几何先验知识到事件域，通过一个包含跨模态特征混合、时空特征蒸馏和时序一致性蒸馏的三级蒸馏策略来实现。

Details

Motivation: 事件相机在高速运动和极端光照条件下具有优越的灵敏度，但基于事件的单目深度估计进展受限于密集深度标注的稀缺。现有免标注方法通过从视觉基础模型（VFMs）蒸馏知识来缓解此问题，但将事件流作为独立帧处理，忽略了事件数据固有的时序连续性，无法利用VFMs中丰富的时序先验，导致深度预测在时间上不一致且准确性较低。

Result: 在EventScape数据集上，EventVGGT将30米处的绝对平均深度误差降低了超过53%（从2.30降至1.06），显著优于现有方法。在未见过的DENSE和MVSEC数据集上也表现出强大的零样本泛化能力。

Insight: 核心创新在于首次将事件流明确建模为连贯视频序列，并从VGGT中蒸馏时空和多视图几何先验。提出的三级蒸馏策略（跨模态特征混合、时空特征蒸馏、时序一致性蒸馏）系统性地弥合了模态差距并强制了时序一致性，为事件数据有效利用强大的预训练RGB模型知识提供了新思路。

Abstract: Event cameras offer superior sensitivity to high-speed motion and extreme lighting, making event-based monocular depth estimation a promising approach for robust 3D perception in challenging conditions. However, progress is severely hindered by the scarcity of dense depth annotations. While recent annotation-free approaches mitigate this by distilling knowledge from Vision Foundation Models (VFMs), a critical limitation persists: they process event streams as independent frames. By neglecting the inherent temporal continuity of event data, these methods fail to leverage the rich temporal priors encoded in VFMs, ultimately yielding temporally inconsistent and less accurate depth predictions. To address this, we introduce EventVGGT, a novel framework that explicitly models the event stream as a coherent video sequence. To the best of our knowledge, we are the first to distill spatio-temporal and multi-view geometric priors from the Visual Geometry Grounded Transformer (VGGT) into the event domain. We achieve this via a comprehensive tri-level distillation strategy: (i) Cross-Modal Feature Mixture (CMFM) bridges the modality gap at the output level by fusing RGB and event features to generate auxiliary depth predictions; (ii) Spatio-Temporal Feature Distillation (STFD) distills VGGT’s powerful spatio-temporal representations at the feature level; and (iii) Temporal Consistency Distillation (TCD) enforces cross-frame coherence at the temporal level by aligning inter-frame depth changes. Extensive experiments demonstrate that EventVGGT consistently outperforms existing methods – reducing the absolute mean depth error at 30m by over 53% on EventScape (from 2.30 to 1.06) – while exhibiting robust zero-shot generalization on the unseen DENSE and MVSEC datasets.

[63] ICDAR 2025 Competition on End-to-End Document Image Machine Translation Towards Complex Layouts cs.CV | cs.AIPDF

Yaping Zhang, Yupu Liang, Zhiyang Zhang, Zhiyuan Chen, Lu Xiang

TL;DR: 本文介绍了ICDAR 2025端到端文档图像机器翻译竞赛，该竞赛旨在推动复杂版式文档的翻译研究。竞赛分为OCR-free和OCR-based两个赛道，并各有大小模型子任务，共吸引了69支团队参与。报告概述了竞赛动机、数据集、任务定义、评估方法和结果总结，指出大模型方法为复杂版式文档图像翻译建立了有前景的新范式。

Details

Motivation: 推动端到端文档图像机器翻译的研究，该任务旨在通过联合建模文本内容和页面布局，弥合OCR与NLP之间的鸿沟，以处理具有复杂版式的文档。

Result: 竞赛吸引了总计69支团队和27份有效提交。赛道1有34支团队和13份有效提交，赛道2有35支团队和14份有效提交。分析表明，大模型方法为翻译复杂版式文档图像建立了一个有前景的新范式。

Insight: 竞赛通过设置OCR-free和OCR-based双赛道以及大小模型子任务，系统性地探索了不同技术路径在DIMT任务上的表现。其核心创新在于将端到端翻译任务与复杂文档布局理解紧密结合，并强调大模型在此多模态任务中的潜力，为未来研究指明了方向。

Abstract: Document Image Machine Translation (DIMT) seeks to translate text embedded in document images from one language to another by jointly modeling both textual content and page layout, bridging optical character recognition (OCR) and natural language processing (NLP). The DIMT 2025 Challenge advances research on end-to-end document image translation, a rapidly evolving area within multimodal document understanding. The competition features two tracks, OCR-free and OCR-based, each with two subtasks for small (less than 1B parameters) and large (greater than 1B parameters) models. Participants submit a single unified DIMT system, with the option to incorporate provided OCR transcripts. Running from December 10, 2024 to April 20, 2025, the competition attracted 69 teams and 27 valid submissions in total. Track 1 had 34 teams and 13 valid submissions, while Track 2 had 35 teams and 14 valid submissions. In this report, we present the challenge motivation, dataset construction, task definitions, evaluation protocol, and a summary of results. Our analysis shows that large-model approaches establish a promising new paradigm for translating complex-layout document images and highlight substantial opportunities for future research.

[64] Reviving ConvNeXt for Efficient Convolutional Diffusion Models cs.CV | cs.AI | cs.LGPDF

Taesung Kwon, Lorenzo Bianchi, Lennart Wittke, Felix Watine, Fabio Carrara

TL;DR: 本文提出了完全卷积扩散模型（FCDM），这是一种基于ConvNeXt架构设计的条件扩散模型，旨在探索卷积网络在生成建模中的效率优势。研究表明，FCDM-XL在仅使用DiT-XL/2一半计算量的情况下，在256×256和512×512分辨率上分别以7倍和7.5倍更少的训练步骤实现了竞争性性能，并且可在4-GPU系统上训练，展示了卓越的训练效率。

Details

Motivation: 当前扩散模型倾向于采用Transformer骨干网络，但卷积网络的局部性偏置、参数效率和硬件友好性等优势在现代生成建模中尚未得到充分探索。本文旨在研究现代卷积设计作为扩散模型高效扩展的竞争性替代方案。

Result: FCDM-XL在ImageNet 256×256和512×512分辨率上，仅使用DiT-XL/2 50%的FLOPs，分别以7倍和7.5倍更少的训练步骤达到了竞争性性能，并且可在4-GPU系统上训练，展示了高效性。

Insight: 论文的创新点在于将ConvNeXt架构复兴为高效生成建模的简单而强大的构建块，证明了完全卷积设计在扩散模型中具有竞争性的性能和卓越的训练效率，为高效生成模型提供了新的架构选择。

Abstract: Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and hardware friendliness–the attributes that established ConvNets as the efficient vision backbone–have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling. We find that using only 50% of the FLOPs of DiT-XL/2, FCDM-XL achieves competitive performance with 7$\times$ and 7.5$\times$ fewer training steps at 256$\times$256 and 512$\times$512 resolutions, respectively. Remarkably, FCDM-XL can be trained on a 4-GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.

[65] RiO-DETR: DETR for Real-time Oriented Object Detection cs.CVPDF

Zhangchi Hu, Yifan Zhao, Yansong Peng, Wenzhang Sun, Xiangchen Yin

TL;DR: 本文提出了RiO-DETR，这是首个面向实时旋转目标检测的DETR模型。它通过内容驱动的角度估计、解耦周期性优化和面向的密集O2O监督等任务原生设计，解决了将DETR适配到旋转框（OBB）时遇到的语义依赖、角度周期性和搜索空间扩大导致收敛慢的挑战，在保持实时效率的同时实现了新的速度-精度权衡。

Details

Motivation: 将DETR框架应用于旋转目标检测（OBB）面临三大挑战：方向预测对语义内容的依赖、角度周期性破坏标准欧几里得优化、以及增大的搜索空间导致收敛缓慢。本文旨在解决这些问题，实现实时高效的旋转目标检测。

Result: 在DOTA-1.0、DIOR-R和FAIR-1M-2.0等基准数据集上的大量实验表明，RiO-DETR为实时旋转目标检测建立了新的速度-精度权衡（即取得了更优的权衡曲线）。

Insight: 创新点包括：1) 将角度从位置查询中解耦，提出内容驱动的角度估计，并结合旋转校正的正交注意力来捕获可靠的朝向线索；2) 结合有界由粗到细更新和最短路径周期损失的解耦周期性优化，实现跨角度接缝的稳定学习；3) 通过向密集监督注入角度多样性来加速角度收敛的面向密集O2O方法，且不增加额外成本。这些是针对旋转检测任务的定制化设计。

Abstract: We present RiO-DETR: DETR for Real-time Oriented Object Detection, the first real-time oriented detection transformer to the best of our knowledge. Adapting DETR to oriented bounding boxes (OBBs) poses three challenges: semantics-dependent orientation, angle periodicity that breaks standard Euclidean refinement, and an enlarged search space that slows convergence. RiO-DETR resolves these issues with task-native designs while preserving real-time efficiency. First, we propose Content-Driven Angle Estimation by decoupling angle from positional queries, together with Rotation-Rectified Orthogonal Attention to capture complementary cues for reliable orientation. Second, Decoupled Periodic Refinement combines bounded coarse-to-fine updates with a Shortest-Path Periodic Loss for stable learning across angular seams. Third, Oriented Dense O2O injects angular diversity into dense supervision to speed up angle convergence at no extra cost. Extensive experiments on DOTA-1.0, DIOR-R, and FAIR-1M-2.0 demonstrate RiO-DETR establishes a new speed–accuracy trade-off for real-time oriented detection. Code will be made publicly available.

[66] CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation cs.CVPDF

Bohao Li, Zhicheng Cao, Huixian Li, Yangming Guo

TL;DR: 该论文提出了CIGPose框架，一种用于全身姿态估计的因果干预图神经网络。它通过结构因果模型（SCM）形式化视觉上下文作为混杂因子导致预测偏差的问题，并引入因果干预模块来近似视觉证据与姿态间的真实因果效应，使用层次图神经网络处理去混杂后的表征以增强解剖合理性。

Details

Motivation: 当前最先进的全身姿态估计器在复杂场景中缺乏鲁棒性，常产生解剖学上不合理的预测。作者认为这是由于模型从视觉上下文中学习了虚假相关性，并用结构因果模型将其形式化为一个混杂问题。

Result: 在COCO-WholeBody基准测试上，CIGPose取得了新的最先进（SOTA）结果。具体地，CIGPose-x模型达到67.0% AP，超越了依赖额外训练数据的先前方法；当加入UBody数据集时，性能进一步提升至67.5% AP，展现了卓越的鲁棒性和数据效率。

Insight: 论文的核心创新点在于将因果干预思想引入姿态估计，通过识别并替换受混杂的关键点表征（使用预测不确定性和学习到的上下文不变规范嵌入）来阻断非因果的后门路径，并结合层次图神经网络进行局部和全局语义推理，从而系统性地提升模型的解剖合理性和鲁棒性。

Abstract: State-of-the-art whole-body pose estimators often lack robustness, producing anatomically implausible predictions in challenging scenes. We posit this failure stems from spurious correlations learned from visual context, a problem we formalize using a Structural Causal Model (SCM). The SCM identifies visual context as a confounder that creates a non-causal backdoor path, corrupting the model’s reasoning. We introduce the Causal Intervention Graph Pose (CIGPose) framework to address this by approximating the true causal effect between visual evidence and pose. The core of CIGPose is a novel Causal Intervention Module: it first identifies confounded keypoint representations via predictive uncertainty and then replaces them with learned, context-invariant canonical embeddings. These deconfounded embeddings are processed by a hierarchical graph neural network that reasons over the human skeleton at both local and global semantic levels to enforce anatomical plausibility. Extensive experiments show CIGPose achieves a new state-of-the-art on COCO-WholeBody. Notably, our CIGPose-x model achieves 67.0% AP, surpassing prior methods that rely on extra training data. With the additional UBody dataset, CIGPose-x is further boosted to 67.5% AP, demonstrating superior robustness and data efficiency. The codes and models are publicly available at https://github.com/53mins/CIGPose.

[67] Open-World Motion Forecasting cs.CV | cs.AI | cs.ROPDF

Nicolas Schischka, Nikhil Gosala, B Ravi Kiran, Senthil Yogamani, Abhinav Valada

TL;DR: 本文提出了开放世界运动预测这一新设定，旨在解决自动驾驶中因感知不完善和物体类别动态变化而导致的预测难题。作者设计了一个首个端到端的类增量运动预测框架，通过伪标签策略和基于查询特征方差的回放采样策略来缓解灾难性遗忘，并直接从相机图像预测未来轨迹。

Details

Motivation: 现有运动预测方法在封闭世界假设下运行，依赖于固定的物体分类和高质量感知，难以应对现实世界中感知不完善和物体类别随时间演变的挑战。

Result: 在nuScenes和Argoverse 2数据集上的广泛评估表明，该方法能有效抵抗灾难性遗忘，在保持已学习类别性能的同时，提升了对新类别的适应能力，并支持零样本迁移到真实世界驾驶和端到端类增量规划。

Insight: 创新点在于首次定义了开放世界运动预测任务，并提出了结合伪标签过滤（利用视觉语言模型）和基于查询特征方差的回放采样策略的端到端类增量学习框架，以实现对新类别的持续适应和对旧知识的保留。

Abstract: Motion forecasting aims to predict the future trajectories of dynamic agents in the scene, enabling autonomous vehicles to effectively reason about scene evolution. Existing approaches operate under the closed-world regime and assume fixed object taxonomy as well as access to high-quality perception. Therefore, they struggle in real-world settings where perception is imperfect and object taxonomy evolves over time. In this work, we bridge this fundamental gap by introducing open-world motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are estimated directly from camera images. We tackle this setting by proposing the first end-to-end class-incremental motion forecasting framework to mitigate catastrophic forgetting while simultaneously learning to forecast newly introduced classes. When a new class is introduced, our framework employs a pseudo-labeling strategy to first generate motion forecasting pseudo-labels for all known classes which are then processed by a vision-language model to filter inconsistent and over-confident predictions. Parallelly, our approach further mitigates catastrophic forgetting by using a novel replay sampling strategy that leverages query feature variance to sample previous sequences with informative motion patterns. Extensive evaluation on the nuScenes and Argoverse 2 datasets demonstrates that our approach successfully resists catastrophic forgetting and maintains performance on previously learned classes while improving adaptation to novel ones. Further, we demonstrate that our approach supports zero-shot transfer to real-world driving and naturally extends to end-to-end class-incremental planning, enabling continual adaptation of the full autonomous driving system. We provide the code at https://omen.cs.uni-freiburg.de .

[68] EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation cs.CV | cs.AIPDF

Jiajun Cao, Xiaoan Zhang, Xiaobao Wei, Liyuqiu Huang, Wang Zijian

TL;DR: 本文提出EvoDriveVLA，一种用于自动驾驶的视觉-语言-动作模型协同感知-规划蒸馏框架，通过自锚定视觉蒸馏和先知引导轨迹蒸馏，解决了视觉编码器解冻后感知退化与长期规划累积不稳定的问题，在开环和闭环评估中均取得优异性能。

Details

Motivation: 现有视觉-语言-动作模型在自动驾驶中存在两大挑战：视觉编码器解冻后感知能力下降，以及长期规划中不稳定性累积。

Result: EvoDriveVLA在开环评估中达到SOTA性能，并在闭环评估中显著提升了性能。

Insight: 创新点在于提出协同感知-规划蒸馏框架，结合了自锚定教师模型提供的视觉锚定约束与先知教师模型引导的轨迹优化，通过轨迹引导的关键区域感知和从粗到细的轨迹细化与蒙特卡洛dropout采样来提升模型鲁棒性。

Abstract: Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracle-guided trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, oracle-guided trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to produce high-quality trajectory candidates, thereby selecting the optimal trajectory to guide the student’s prediction. EvoDriveVLA achieves SOTA performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.

[69] TopoOR: A Unified Topological Scene Representation for the Operating Room cs.CVPDF

Tony Danjun Wang, Ka Young Kim, Tolga Birdal, Nassir Navab, Lennart Bastian

TL;DR: 本文提出了一种名为TopoOR的新型拓扑场景表示方法，用于对手术室（OR）进行建模。该方法将手术室中的多模态实体及其关系建模为高阶结构，超越了传统场景图（Scene Graph）的二元限制，能够原生地保留成对和群体关系。论文还提出了一种高阶注意力机制，以在分层关系注意力中显式保持流形结构和模态特定特征。实验表明，该方法在无菌违规检测、机器人阶段预测和下一动作预测等任务上优于传统的图和基于LLM的基线方法。

Details

Motivation: 现有手术场景图（Surgical Scene Graphs）方法将手术室抽象为实体及其关系，但受限于严格的二元结构。主要依赖成对消息传递或令牌化序列的框架会扁平化关系结构固有的流形几何，从而丢失结构信息。因此，需要一种能够原生建模复杂动态和多模态性的新范式。

Result: 在无菌违规检测、机器人阶段预测和下一动作预测等任务上进行了广泛实验。结果表明，TopoOR方法优于传统的图模型和基于LLM的基线方法，取得了更好的性能。

Insight: 核心创新在于引入了高阶拓扑结构来表示手术室场景，这比传统场景图具有更强的表达能力。通过将实体间的交互提升到高阶拓扑单元，能够原生建模复杂动态和多模态性。此外，提出的高阶注意力机制避免了将3D几何、音频和机器人运动学强行融合到单一潜在表示中，从而保留了安全关键推理所需的多模态结构，这是与现有方法的关键区别。

Abstract: Surgical Scene Graphs abstract the complexity of surgical operating rooms (OR) into a structure of entities and their relations, but existing paradigms suffer from strictly dyadic structural limitations. Frameworks that predominantly rely on pairwise message passing or tokenized sequences flatten the manifold geometry inherent to relational structures and lose structure in the process. We introduce TopoOR, a new paradigm that models multimodal operating rooms as a higher-order structure, innately preserving pairwise and group relationships. By lifting interactions between entities into higher-order topological cells, TopoOR natively models complex dynamics and multimodality present in the OR. This topological representation subsumes traditional scene graphs, thereby offering strictly greater expressivity. We also propose a higher-order attention mechanism that explicitly preserves manifold structure and modality-specific features throughout hierarchical relational attention. In this way, we circumvent combining 3D geometry, audio, and robot kinematics into a single joint latent representation, preserving the precise multimodal structure required for safety-critical reasoning, unlike existing methods. Extensive experiments demonstrate that our approach outperforms traditional graph and LLM-based baselines across sterility breach detection, robot phase prediction, and next-action anticipation

[70] OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks cs.CVPDF

Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang

TL;DR: 本文介绍了OmniEarth，一个用于在真实地球观测场景下评估遥感视觉语言模型（RSVLMs）的系统性基准。该基准从感知、推理和鲁棒性三个能力维度定义了28个细粒度任务，涵盖多源遥感数据和多样化地理空间上下文，支持多项选择VQA和开放式VQA两种任务形式，并采用盲测协议和五元语义一致性要求以减少语言偏见。

Details

Motivation: 尽管视觉语言模型（VLMs）在通用领域任务上表现出有效的感知和推理能力，但其在地球观测领域的应用缺乏一个系统性的评估基准，本文旨在填补这一空白。

Result: 在包含9,275张高质量图像和44,210条人工验证指令的OmniEarth基准上，对基于对比学习的模型、通用闭源/开源VLMs以及RSVLMs进行了系统评估。结果表明，现有VLMs在处理地理空间复杂任务时仍存在困难，揭示了遥感应用领域需要解决的明显差距。

Insight: 创新点在于构建了一个全面、多维度、多任务形式的遥感VLM评估基准，并引入了盲测协议和严格的语义一致性要求来确保评估的严谨性，为领域模型的发展提供了明确的评估标准和方向。

Abstract: Vision-Language Models (VLMs) have demonstrated effective perception and reasoning capabilities on general-domain tasks, leading to growing interest in their application to Earth observation. However, a systematic benchmark for comprehensively evaluating remote sensing vision-language models (RSVLMs) remains lacking. To address this gap, we introduce OmniEarth, a benchmark for evaluating RSVLMs under realistic Earth observation scenarios. OmniEarth organizes tasks along three capability dimensions: perception, reasoning, and robustness. It defines 28 fine-grained tasks covering multi-source sensing data and diverse geospatial contexts. The benchmark supports two task formulations: multiple-choice VQA and open-ended VQA. The latter includes pure text outputs for captioning tasks, bounding box outputs for visual grounding tasks, and mask outputs for segmentation tasks. To reduce linguistic bias and examine whether model predictions rely on visual evidence, OmniEarth adopts a blind test protocol and a quintuple semantic consistency requirement. OmniEarth includes 9,275 carefully quality-controlled images, including proprietary satellite imagery from Jilin-1 (JL-1), along with 44,210 manually verified instructions. We conduct a systematic evaluation of contrastive learning-based models, general closed-source and open-source VLMs, as well as RSVLMs. Results show that existing VLMs still struggle with geospatially complex tasks, revealing clear gaps that need to be addressed for remote sensing applications. OmniEarth is publicly available at https://huggingface.co/datasets/sjeeudd/OmniEarth.

[71] Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity cs.CVPDF

Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu

TL;DR: 该论文提出了一种名为PruneSID的无训练视觉令牌压缩方法，用于解决视觉语言模型（VLMs）中视觉令牌冗余导致的计算效率低下问题。该方法通过一个两阶段流程（语义聚类和组内去冗余）以及动态压缩比机制，在保留关键信息的同时显著减少令牌数量，并在多个基准测试中实现了最先进的性能。

Details

Motivation: 现有视觉语言模型因生成过多冗余的视觉令牌而导致计算效率低下，而现有的压缩方法难以在保留重要信息和维持信息多样性之间取得平衡。

Result: 在LLaVA-1.5上，仅保留11.1%的令牌即可达到96.3%的准确率；在LLaVA-NeXT上，即使在5.6%的极端压缩率下也能达到92.8%的准确率，比先前方法高出2.5%，且预填充速度比原始模型快7.8倍，展现了最先进的性能。

Insight: 创新点在于提出了一种协同考虑重要性和多样性的无训练两阶段压缩框架（PSCA聚类和组内NMS），并结合了基于图像复杂度的动态压缩比机制，实现了在多种VLM和图像/视频模态上的高效泛化。

Abstract: Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 $\times$ faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at https://github.com/ZhengyaoFang/PruneSID}{https://github.com/ZhengyaoFang/PruneSID.

[72] Streaming Autoregressive Video Generation via Diagonal Distillation cs.CVPDF

Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-HsuanYang

TL;DR: 本文提出了一种名为对角线蒸馏（Diagonal Distillation）的新方法，用于实现实时流式自回归视频生成。该方法通过一种非对称的生成策略（早期块使用更多去噪步，后期块使用更少步），并引入隐式光流建模，有效解决了现有视频蒸馏方法中时间依赖性利用不足、误差累积和延迟-质量权衡等问题，在保持高质量的同时实现了显著的生成速度提升。

Details

Motivation: 解决大型预训练扩散模型在实时流式视频生成中计算开销大、延迟高的问题，以及现有视频蒸馏方法（主要借鉴图像方法）忽视时间依赖性，导致运动连贯性差、长序列误差累积和延迟-质量权衡的局限性。

Result: 所提方法在生成5秒视频时仅需2.61秒（最高可达31 FPS），相比未蒸馏模型实现了277.3倍的加速。

Insight: 核心创新在于提出了一种正交于现有方法的对角线蒸馏框架，其非对称生成策略（早期多步、后期少步）允许后期视频块继承早期块丰富的表观信息，并使用部分去噪块作为后续合成的条件输入，从而更好地利用跨视频块和去噪步骤的时间信息，缓解误差传播。同时，结合隐式光流建模在严格步数限制下保持运动质量。

Abstract: Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in 2.61 seconds (up to 31 FPS), achieving a 277.3x speedup over the undistilled model.

[73] Evolving Prompt Adaptation for Vision-Language Models cs.CV | cs.AIPDF

Enming Zhang, Jiayang Li, Yanru Wu, Zhenyu Liu, Yang Li

TL;DR: 本文提出EvoPrompt框架，通过演化训练策略和特征几何正则化，解决视觉语言模型在少样本下游任务适应中存在的灾难性遗忘问题，实现了稳定且知识保留的提示调优。

Details

Motivation: 解决大规模视觉语言模型在有限标注数据下游任务适应时，参数高效的提示学习方法常导致预训练知识灾难性遗忘的问题。

Result: 在少样本学习任务上达到最先进性能，并有效保持了预训练模型的原始零样本能力。

Insight: 创新点在于通过演化训练策略将低秩更新解耦为方向与幅度分量，仅调整幅度以保留语义方向，并结合特征几何正则化防止表示崩溃，实现了无遗忘的提示演化路径控制。

Abstract: The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge. While parameter-efficient prompt learning methods offer a promising path, they often suffer from catastrophic forgetting of pre-trained knowledge. Toward addressing this limitation, our work is grounded in the insight that governing the evolutionary path of prompts is essential for forgetting-free adaptation. To this end, we propose EvoPrompt, a novel framework designed to explicitly steer the prompt trajectory for stable, knowledge-preserving fine-tuning. Specifically, our approach employs a Modality-Shared Prompt Projector (MPP) to generate hierarchical prompts from a unified embedding space. Critically, an evolutionary training strategy decouples low-rank updates into directional and magnitude components, preserving early-learned semantic directions while only adapting their magnitude, thus enabling prompts to evolve without discarding foundational knowledge. This process is further stabilized by Feature Geometric Regularization (FGR), which enforces feature decorrelation to prevent representation collapse. Extensive experiments demonstrate that EvoPrompt achieves state-of-the-art performance in few-shot learning while robustly preserving the original zero-shot capabilities of pre-trained VLMs.

[74] SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding cs.CVPDF

Zheng Fang, Ziwei Niu, Ziyue Wang, Zhu Zhuo, Haofeng Liu

TL;DR: SurgFed是一个用于手术视频理解的多任务联邦学习框架，通过语言引导的通道选择和超聚合机制，解决手术场景中组织多样性和任务多样性带来的挑战，提升跨站点和跨任务的模型性能。

Details

Motivation: 解决手术视频多任务联邦学习中因组织多样性和任务多样性导致的局部模型适应差和服务器聚合参数更新不准确的问题。

Result: 在四个手术类型的五个公共数据集上，SurgFed相比现有最先进方法取得了改进。

Insight: 创新点包括语言引导的通道选择网络增强站点特定适应，以及语言引导的超聚合机制通过跨注意力建模任务交互并指导个性化参数更新；客观分析认为其将自然语言处理与联邦学习结合，为异构医疗数据提供了可扩展的解决方案。

Abstract: Surgical scene Multi-Task Federated Learning (MTFL) is essential for robot-assisted minimally invasive surgery (RAS) but remains underexplored in surgical video understanding due to two key challenges: (1) Tissue Diversity: Local models struggle to adapt to site-specific tissue features, limiting their effectiveness in heterogeneous clinical environments and leading to poor local predictions. (2) Task Diversity: Server-side aggregation, relying solely on gradient-based clustering, often produces suboptimal or incorrect parameter updates due to inter-site task heterogeneity, resulting in inaccurate localization. In light of these two issues, we propose SurgFed, a multi-task federated learning framework, enabling federated learning for surgical scene segmentation and depth estimation across diverse surgical types. SurgFed is powered by two appealing designs, i.e., Language-guided Channel Selection (LCS) and Language-guided Hyper Aggregation (LHA), to address the challenge of fully exploration on corss-site and cross-task. Technically, the LCS is first designed a lightweight personalized channel selection network that enhances site-specific adaptation using pre-defined text inputs, which optimally the local model learn the specific embeddings. We further introduce the LHA that employs a layer-wise cross-attention mechanism with pre-defined text inputs to model task interactions across sites and guide a hypernetwork for personalized parameter updates. Extensive empirical evidence shows that SurgFed yields improvements over the state-of-the-art methods in five public datasets across four surgical types. The code is available at https://anonymous.4open.science/r/SurgFed-070E/.

Won Shik Jang, Ue-Hwan Kim

TL;DR: 本文提出Context-Nav方法，用于解决文本目标实例导航（TGIN）任务，即根据自由文本描述在3D场景中导航至指定物体实例。该方法将长上下文描述从局部匹配线索提升为全局探索先验，并通过视角感知的3D空间关系验证候选目标。它无需任务特定训练，在InstanceNav和CoIN-Bench基准上达到了最先进性能。

Details

Motivation: 解决文本目标实例导航中，如何在存在同类干扰物的杂乱3D场景中，根据自由文本描述精准导航至正确实例的问题，避免因早期错误检测或语义合理但几何不匹配的候选目标导致的导航失败。

Result: 在InstanceNav和CoIN-Bench基准测试中取得了最先进的（SOTA）性能。消融实验表明，将完整描述编码到价值图中可避免无效移动，而显式的视角感知3D验证能防止语义合理但几何错误的停止。

Insight: 主要创新点在于：1）将长上下文描述作为全局探索先验，通过密集文本-图像对齐生成价值图来引导探索；2）引入显式的、视角感知的3D空间关系验证机制，通过采样观察者姿态并检查空间关系是否满足来确认目标。这为细粒度实例消歧提供了一种可扩展的、基于几何的空间推理替代方案，无需繁重的策略训练或人工交互。

Abstract: Text-goal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present \textit{Context-Nav} that elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers – guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or fine-tuning; we attain state-of-the-art performance on InstanceNav and CoIN-Bench. Ablations show that (i) encoding full captions into the value map avoids wasted motion and (ii) explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops. This suggests that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.

[76] Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning cs.CVPDF

Chun-Peng Chang, Chen-Yu Wang, Holger Caesar, Alain Pagani

TL;DR: 本文研究了作为驾驶助手的视觉语言模型（VLMs）的可靠性，重点关注其响应不一致性和时序推理能力不足的问题。作者发现即使视觉理解能力强的模型在需要时序推理的任务上也可能表现不佳，并提出了FutureVQA基准数据集和一种基于思维链的自监督微调方法来提升模型的一致性和时序推理能力。

Details

Motivation: 探究VLMs作为驾驶助手时，其响应是否基于对观察信息的时序推理，还是仅仅依赖训练中记忆的模式，从而评估其决策可靠性。

Result: 研究揭示了VLMs存在响应不一致和时序推理有限的问题。作者引入了FutureVQA基准进行评估，并提出的自监督微调方法改善了模型性能，但未在摘要中提及具体的定量结果或与SOTA的比较。

Insight: 创新点在于系统性地识别并量化了驾驶VLMs的两大可靠性挑战，并构建了专门的FutureVQA基准。提出的基于思维链的自监督微调方法无需时序标注，为提升模型时序推理能力提供了简单有效的思路。

Abstract: A reliable driving assistant should provide consistent responses based on temporally grounded reasoning derived from observed information. In this work, we investigate whether Vision-Language Models (VLMs), when applied as driving assistants, can response consistantly and understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without temporally grounded reasoning. While recent efforts have integrated VLMs into autonomous driving, prior studies typically emphasize scene understanding and instruction generation, implicitly assuming that strong visual interpretation naturally enables consistant future reasoning and thus ensures reliable decision-making, a claim we critically examine. We focus on two major challenges limiting VLM reliability in this setting: response inconsistency, where minor input perturbations yield different answers or, in some cases, responses degenerate toward near-random guessing, and limited temporal reasoning, in which models fail to reason and align sequential events from current observations, often resulting in incorrect or even contradictory responses. Moreover, we find that models with strong visual understanding do not necessarily perform best on tasks requiring temporal reasoning, indicating a tendency to over-rely on pretrained patterns rather than modeling temporal dynamics. To address these issues, we adopt existing evaluation methods and introduce FutureVQA, a human-annotated benchmark dataset specifically designed to assess future scene reasoning. In addition, we propose a simple yet effective self-supervised tuning approach with chain-of-thought reasoning that improves both consistency and temporal reasoning without requiring temporal labels.

[77] Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization cs.CVPDF

Ming Nie, Chunwei Wang, Jianhua Han, Hang Xu, Li Zhang

TL;DR: 本文提出了一种基于强化学习的后训练策略，旨在解锁现有统一视觉语言模型的多模态交错生成能力，无需依赖大规模交错数据集。该方法通过预热阶段引入混合数据集，并结合扩展的组相对策略优化框架，联合优化文本和图像生成，利用覆盖文本相关性、视觉-文本对齐和结构保真度的混合奖励，以及过程级奖励来提升生成质量。

Details

Motivation: 现有统一视觉语言模型在多模态理解和生成方面取得显著进展，但在生成多模态交错输出（如视觉叙事和逐步视觉推理）方面能力不足，且缺乏大规模交错数据集支持训练。

Result: 在MMIE和InterleavedBench基准测试上的实验表明，该方法显著提升了多模态交错生成的质量和连贯性。

Insight: 创新点包括：提出基于强化学习的后训练策略，无需大规模交错数据集；扩展GRPO到多模态设置，统一建模文本和图像生成轨迹；设计覆盖多方面的混合奖励函数，并引入过程级奖励以提供逐步指导，提升复杂任务训练效率。

Abstract: Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.

[78] GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision cs.CVPDF

Lang Sun, Ronghao Fu, Zhuoran Duan, Haoran Liu, Xueyan Liu

TL;DR: 本文提出了GeoSolver框架，通过构建大规模细粒度过程监督数据集Geo-PRM-2M和训练令牌级过程奖励模型GeoPRM，结合提出的Process-Aware Tree-GRPO强化学习算法，显著提升了遥感领域视觉语言模型（VLM）的逐步推理能力和视觉忠实度。最终模型GeoSolver-9B在多个遥感基准测试中取得了最先进的性能，并且GeoPRM具备强大的跨模型泛化能力，能作为通用地理空间验证器提升其他模型的性能。

Details

Motivation: 尽管视觉语言模型在遥感解译方面取得了进展，但实现复杂、逐步的推理仍极具挑战。现有引入思维链（CoT）推理的方法，其中间步骤的视觉忠实性难以保证，成为关键瓶颈。

Result: 在多个遥感基准测试上，最终模型GeoSolver-9B取得了最先进的性能。GeoPRM作为通用验证器，能够通过测试时缩放（TTS）有效提升GeoSolver-9B和通用VLMs的性能，展现了卓越的跨模型泛化能力。

Insight: 核心创新在于将遥感推理转向可验证的、过程监督的强化学习范式。具体包括：1）通过熵引导的蒙特卡洛树搜索和针对性视觉幻觉注入，合成大规模令牌级过程监督数据集；2）训练细粒度的令牌级过程奖励模型提供忠实性反馈；3）提出结合树状探索和忠实性加权奖励机制的强化学习算法，实现对中间步骤的精确信用分配。这为提升复杂视觉推理任务的可靠性和可扩展性提供了新思路。

Abstract: While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.

[79] GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning cs.CVPDF

Xiao Yang, Ronghao Fu, Zhuoran Duan, Zhiwen Lin, Xueyan Liu

TL;DR: 本文提出GeoAlignCLIP框架，通过多粒度一致性学习增强遥感图像与自然语言的细粒度对齐能力，并构建了包含场景描述、区域级标注和困难负样本的RSFG-100k数据集。在多个公开遥感基准测试中，该方法均优于现有遥感专用方法，实现了更鲁棒和准确的细粒度视觉-语言对齐。

Details

Motivation: 现有视觉-语言预训练模型主要依赖全局图像-文本对齐，难以有效整合多粒度视觉与文本信息，导致在捕捉图像细粒度细节方面存在局限，限制了其在复杂细粒度任务中的性能。

Result: 在多个公开遥感基准测试上进行广泛实验，GeoAlignCLIP一致性地超越了现有的遥感专用方法，在不同任务中均表现出更鲁棒和准确的细粒度视觉-语言对齐性能。

Insight: 创新点在于提出了一个通过多粒度语义对齐和模态内一致性学习来实现细粒度对齐的统一框架，并构建了具有层次化监督的细粒度遥感数据集RSFG-100k，以促进模型对图像区域与文本概念的精确视觉-语义对齐。

Abstract: Vision-language pretraining models have made significant progress in bridging remote sensing imagery with natural language. However, existing approaches often fail to effectively integrate multi-granular visual and textual information, relying primarily on global image-text alignment. This limitation hinders the model’s ability to accurately capture fine-grained details in images, thus restricting its performance in complex, fine-grained tasks. To address this, we propose GeoAlignCLIP, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts. Additionally, we construct RSFG-100k, a fine-granular remote sensing dataset containing scene descriptions, region-level annotations, and challenging hard-negative samples, providing hierarchical supervision for model training. Extensive experiments conducted on multiple public remote-sensing benchmarks demonstrate that GeoAlignCLIP consistently outperforms existing RS-specific methods across diverse tasks, exhibiting more robust and accurate fine-grained vision-language alignment.

[80] More than the Sum: Panorama-Language Models for Adverse Omni-Scenes cs.CVPDF

Weijia Fan, Ruiping Liu, Jiale Wei, Yufan Chen, Junwei Zheng

TL;DR: 本文提出了全景语言建模（PLM）范式，旨在解决现有视觉语言模型（VLM）仅适用于针孔图像、无法直接处理全景图像的问题。作者还构建了一个包含不利全景场景的大规模全景视觉问答数据集PanoVQA，并开发了一个即插即用的全景稀疏注意力模块，使现有基于针孔的VLM无需重新训练即可处理等距柱状投影全景图。实验表明，PLM在具有挑战性的全景场景下实现了卓越的鲁棒性和整体推理能力。

Details

Motivation: 现有视觉语言模型专为针孔图像设计，通过拼接多个窄视场输入来理解全景场景，但这种方法忽略了全景图本身固有的整体空间和上下文关系。本文旨在直接对全景图进行统一的视觉语言推理，实现超越其针孔部分简单加和的整体理解。

Result: 广泛的实验表明，所提出的PLM在具有挑战性的全景场景下（如物体遮挡和驾驶事故）实现了卓越的鲁棒性和整体推理能力，其理解能力超越了其窄视场部分的总和。

Insight: 主要创新点包括：1）提出了全景语言建模（PLM）这一新范式，直接对全景图进行端到端的视觉语言推理；2）构建了首个专注于不利全景场景的大规模全景VQA数据集PanoVQA；3）设计了一个即插即用的全景稀疏注意力模块，能够高效地将现有针孔VLM适配到全景输入，无需重新训练，这是一种高效且实用的模型适配方法。

Abstract: Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM)paradigm, a unified $360^\circ$ vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: https://github.com/InSAI-Lab/PanoVQA.

[81] BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers cs.CVPDF

Chaodong Xiao, Zhengqiang Zhang, Lei Zhang

TL;DR: 本文提出了一种名为BinaryAttention的1位量化注意力机制，用于视觉和扩散Transformer。该方法通过仅保留查询和键的符号，并用位运算替代浮点点积，显著降低了计算成本。通过引入可学习的偏置、量化感知训练和自蒸馏技术，缓解了1位量化下的信息损失和对齐问题。实验表明，BinaryAttention在A100 GPU上比FlashAttention2快2倍以上，并在多个基准测试中达到或超过全精度注意力的性能。

Details

Motivation: Transformer注意力模块的计算复杂度是视觉任务的主要瓶颈，现有方法主要使用8位或4位量化来平衡效率和精度。本文旨在探索更极端的1位量化，以进一步降低计算成本，同时保持注意力机制的核心相似性关系。

Result: 在A100 GPU上，BinaryAttention比FlashAttention2快2倍以上。在视觉Transformer和扩散Transformer的基准测试中，BinaryAttention的性能达到或超过了全精度注意力，验证了其有效性。

Insight: 创新点在于从理论上论证了注意力二值化可以保留核心相似性关系，并提出了具体的1位量化实现方案，包括使用符号位、位运算、可学习偏置、量化感知训练和自蒸馏技术。这为低比特视觉和扩散Transformer提供了一种高效且有效的替代方案。

Abstract: Transformers have achieved widespread and remarkable success, while the computational complexity of their attention modules remains a major bottleneck for vision tasks. Existing methods mainly employ 8-bit or 4-bit quantization to balance efficiency and accuracy. In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit qk-attention. Specifically, we retain only the sign of queries and keys in computing the attention, and replace the floating dot products with bit-wise operations, significantly reducing the computational cost. We mitigate the inherent information loss under 1-bit quantization by incorporating a learnable bias, and enable end-to-end acceleration. To maintain the accuracy of attention, we adopt quantization-aware training and self-distillation techniques, mitigating quantization errors while ensuring sign-aligned similarity. BinaryAttention is more than 2x faster than FlashAttention2 on A100 GPUs. Extensive experiments on vision transformer and diffusion transformer benchmarks demonstrate that BinaryAttention matches or even exceeds full-precision attention, validating its effectiveness. Our work provides a highly efficient and effective alternative to full-precision attention, pushing the frontier of low-bit vision and diffusion transformers. The codes and models can be found at https://github.com/EdwardChasel/BinaryAttention.

[82] A saccade-inspired approach to image classification using visiontransformer attention maps cs.CVPDF

Matthis Dallain, Laurent Rodriguez, Laurent Udo Perrinet, Benoît Miramond

TL;DR: 该论文受人类视觉选择性注意机制启发，提出一种基于Vision Transformer注意力图的眼跳式图像分类方法。通过使用DINO模型生成的注意力图来模拟人眼注视模式，引导模型聚焦于图像关键区域进行信息处理，在ImageNet分类任务中验证了该方法的有效性。

Details

Motivation: 旨在借鉴人类视觉系统的高效选择性注意机制，以克服传统AI系统平等处理整幅图像的低效性，构建更智能、更节能的图像处理模型。

Result: 在ImageNet标准分类任务上，该选择性处理策略保留了大部分全图分类性能，在某些情况下甚至能超越全图处理；与专为人眼注视预测构建的显著性模型相比，DINO注意力图能提供更优的注视引导来选择信息区域。

Insight: 创新点在于将Vision Transformer（特别是DINO）的注意力机制与生物视觉的眼跳原理相结合，为构建受生物启发的主动视觉系统和高效神经形态视觉处理开辟了新方向。

Abstract: Human vision achieves remarkable perceptual performance while operating under strict metabolic constraints. A key ingredient is the selective attention mechanism, driven by rapid saccadic eye movements that constantly reposition the high-resolution fovea onto task-relevant locations, unlike conventional AI systems that process entire images with equal emphasis. Our work aims to draw inspiration from the human visual system to create smarter, more efficient image processing models. Using DINO, a self-supervised Vision Transformer that produces attention maps strikingly similar to human gaze patterns, we explore a saccade inspired method to focus the processing of information on key regions in visual space. To do so, we use the ImageNet dataset in a standard classification task and measure how each successive saccade affects the model’s class scores. This selective-processing strategy preserves most of the full-image classification performance and can even outperform it in certain cases. By benchmarking against established saliency models built for human gaze prediction, we demonstrate that DINO provides superior fixation guidance for selecting informative regions. These findings highlight Vision Transformer attention as a promising basis for biologically inspired active vision and open new directions for efficient, neuromorphic visual processing.

[83] Decoder-Free Distillation for Quantized Image Restoration cs.CVPDF

S. M. A. Sharif, Abdur Rehman, Seongwan Kim, Jaeho Lee

TL;DR: 本文提出了一种名为量化感知蒸馏恢复（QDR）的框架，用于在边缘设备上部署图像恢复（IR）模型。该框架通过FP32自蒸馏消除师生容量不匹配，采用解码器无关蒸馏（DFD）在瓶颈处纠正量化误差以避免空间误差放大，并引入可学习幅度重加权（LMR）动态平衡梯度以稳定优化冲突。此外，设计了一个边缘友好模型（EFM），包含轻量级可学习退化门控（LDG）来动态调制空间退化定位。实验表明，Int8模型在四个IR任务上恢复了FP32模型96.5%的性能，在NVIDIA Jetson Orin上达到每秒442帧，并提升下游目标检测性能16.3 mAP。

Details

Motivation: 针对图像恢复（IR）这类对精度敏感的任务，将量化感知训练（QAT）与知识蒸馏（KD）结合用于模型压缩和边缘部署时，存在师生容量不匹配、解码器蒸馏中的空间误差放大以及量化噪声引起的重建与蒸馏损失之间的优化冲突三大瓶颈，需要解决这些挑战以实现高效部署。

Result: 在四个图像恢复任务上的广泛实验表明，所提Int8模型恢复了FP32模型96.5%的性能，在NVIDIA Jetson Orin上达到每秒442帧（FPS），并提升下游目标检测性能16.3 mAP，实现了高效的边缘部署。

Insight: 创新点包括：采用FP32自蒸馏消除容量不匹配；提出解码器无关蒸馏（DFD）在瓶颈处纠正量化误差以避免误差传播；设计可学习幅度重加权（LMR）动态平衡竞争梯度以稳定优化；以及构建边缘友好模型（EFM）集成轻量级可学习退化门控（LDG）动态处理空间退化。这些方法为量化感知训练在低层视觉任务中的应用提供了新思路。

Abstract: Quantization-Aware Training (QAT), combined with Knowledge Distillation (KD), holds immense promise for compressing models for edge deployment. However, joint optimization for precision-sensitive image restoration (IR) to recover visual quality from degraded images remains largely underexplored. Directly adapting QAT-KD to low-level vision reveals three critical bottlenecks: teacher-student capacity mismatch, spatial error amplification during decoder distillation, and an optimization “tug-of-war” between reconstruction and distillation losses caused by quantization noise. To tackle these, we introduce Quantization-aware Distilled Restoration (QDR), a framework for edge-deployed IR. QDR eliminates capacity mismatch via FP32 self-distillation and prevents error amplification through Decoder-Free Distillation (DFD), which corrects quantization errors strictly at the network bottleneck. To stabilize the optimization tug-of-war, we propose a Learnable Magnitude Reweighting (LMR) that dynamically balances competing gradients. Finally, we design an Edge-Friendly Model (EFM) featuring a lightweight Learnable Degradation Gating (LDG) to dynamically modulate spatial degradation localization. Extensive experiments across four IR tasks demonstrate that our Int8 model recovers 96.5% of FP32 performance, achieves 442 frames per second (FPS) on an NVIDIA Jetson Orin, and boosts downstream object detection by 16.3 mAP

[84] Grounding Synthetic Data Generation With Vision and Language Models cs.CV | cs.AIPDF

Ümit Mert Çağlar, Alptekin Temizel

TL;DR: 本文提出了一种基于视觉-语言模型的、可解释的遥感图像合成数据增强与评估框架，并构建了大规模数据集ARAS400k。该框架结合了生成模型、语义分割和图像描述技术，通过分析语义构成、最小化描述冗余和验证视觉结构与语言描述之间的跨模态一致性，实现了对合成数据的自动化评估。实验表明，仅用合成数据训练的模型能达到有竞争力的性能，而使用增强数据（真实与合成图像结合）训练的模型则能持续超越仅用真实数据的基线。

Details

Motivation: 现有合成数据评估指标通常计算潜在特征相似性，难以解释且与下游任务性能提升不总是相关。本文旨在为遥感领域提供一个可解释的合成数据增强与评估框架，以解决这一问题。

Result: 在遥感语义分割和图像描述任务上，仅用合成数据训练的模型达到有竞争力的性能水平，而使用增强数据训练的模型则一致超越了仅用真实数据的基线。该工作为遥感任务建立了一个可扩展的基准。

Insight: 创新点在于提出了一个结合视觉与语言模型的可解释评估框架，通过语义构成分析、描述冗余最小化和跨模态一致性验证来评估合成数据质量，并据此构建了大规模遥感增强数据集ARAS400k，为合成数据评估提供了新思路和基准。

Abstract: Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typically calculate latent feature similarity, which is difficult to interpret and does not always correlate with the contribution to downstream tasks. We propose a vision-language grounded framework for interpretable synthetic data augmentation and evaluation in remote sensing. Our approach combines generative models, semantic segmentation and image captioning with vision and language models. Based on this framework, we introduce ARAS400k: A large-scale Remote sensing dataset Augmented with Synthetic data for segmentation and captioning, containing 100k real images and 300k synthetic images, each paired with segmentation maps and descriptions. ARAS400k enables the automated evaluation of synthetic data by analyzing semantic composition, minimizing caption redundancy, and verifying cross-modal consistency between visual structures and language descriptions. Experimental results indicate that while models trained exclusively on synthetic data reach competitive performance levels, those trained with augmented data (a combination of real and synthetic images) consistently outperform real-data baselines. Consequently, this work establishes a scalable benchmark for remote sensing tasks, specifically in semantic segmentation and image captioning. The dataset is available at zenodo.org/records/18890661 and the code base at github.com/caglarmert/ARAS400k.

[85] When to Lock Attention: Training-Free KV Control in Video Diffusion cs.CV | cs.AI | cs.ET | eess.IVPDF

Tianyi Zeng, Jincheng Gao, Tianyi Wang, Zijie Meng, Miao Zhang

TL;DR: 本文提出了一种名为KV-Lock的无训练框架，专为基于DiT的视频扩散模型设计，旨在解决视频编辑中背景一致性与前景质量提升之间的矛盾。该方法通过检测扩散过程中的幻觉风险，动态调度背景KV缓存与新生成KV的融合比例以及CFG引导尺度，从而在保持背景高保真度的同时提升前景生成质量。

Details

Motivation: 视频编辑的核心挑战在于保持背景一致性的同时提升前景质量。现有方法要么注入全图信息导致背景伪影，要么严格锁定背景严重限制了模型的前景生成能力。本文旨在解决这一矛盾。

Result: 大量实验验证表明，该方法在各种视频编辑任务中，在保持高背景保真度的同时，其改进的前景质量优于现有方法。

Insight: 核心创新在于将幻觉度量（去噪预测的方差）与CFG引导尺度建立内在联系，并基于此动态调度KV融合与CFG引导。该方法是一个无需训练、即插即用的模块，可轻松集成到任何预训练的基于DiT的模型中。

Abstract: Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model’s capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.

[86] DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics cs.CVPDF

Yuanhang Lei, Boming Zhao, Zesong Yang, Xingxuan Li, Tao Cheng

TL;DR: DiffWind是一个物理信息可微分框架，用于从视频中建模风驱动物体动力学。它通过将风表示为网格物理场、物体表示为3D高斯溅射粒子系统，并利用物质点法模拟相互作用，联合优化风场和物体运动。该方法还引入格子玻尔兹曼方法作为物理约束，确保流体动力学合规性，并支持新条件下的前向模拟和风重定向等应用。

Details

Motivation: 由于风的不可见性、时空变化性以及物体复杂变形，从视频观测中建模风驱动物体动力学极具挑战。现有方法难以统一处理风-物体交互建模、视频重建和物理模拟。

Result: 在合成和真实世界风驱动场景数据集WD-Objects上的大量实验表明，该方法在重建精度和模拟保真度上显著优于先前的动态场景建模方法。

Insight: 创新点包括：1）将风场与基于3D高斯溅射的物体表示统一在可微分框架中，实现联合优化；2）引入格子玻尔兹曼方法作为物理约束，增强模型物理合理性；3）框架支持重建外的前向模拟和新应用，为基于视频的风-物体交互建模开辟了新途径。

Abstract: Modeling wind-driven object dynamics from video observations is highly challenging due to the invisibility and spatio-temporal variability of wind, as well as the complex deformations of objects. We present DiffWind, a physics-informed differentiable framework that unifies wind-object interaction modeling, video-based reconstruction, and forward simulation. Specifically, we represent wind as a grid-based physical field and objects as particle systems derived from 3D Gaussian Splatting, with their interaction modeled by the Material Point Method (MPM). To recover wind-driven object dynamics, we introduce a reconstruction framework that jointly optimizes the spatio-temporal wind force field and object motion through differentiable rendering and simulation. To ensure physical validity, we incorporate the Lattice Boltzmann Method (LBM) as a physics-informed constraint, enforcing compliance with fluid dynamics laws. Beyond reconstruction, our method naturally supports forward simulation under novel wind conditions and enables new applications such as wind retargeting. We further introduce WD-Objects, a dataset of synthetic and real-world wind-driven scenes. Extensive experiments demonstrate that our method significantly outperforms prior dynamic scene modeling approaches in both reconstruction accuracy and simulation fidelity, opening a new avenue for video-based wind-object interaction modeling.

[87] Improving 3D Foot Motion Reconstruction in Markerless Monocular Human Motion Capture cs.CVPDF

Tom Wehrbein, Bodo Rosenhahn

TL;DR: 本文提出FootMR方法，通过将2D足部关键点序列提升至3D，优化现有人体运动捕捉模型估计的足部运动，解决了无标记单目视频中足部精细运动重建不准确的问题，并在多个数据集上验证了其有效性。

Details

Motivation: 现有先进方法能从野外视频中恢复准确的整体3D人体运动，但常无法捕捉精细关节运动（尤其是足部），这源于训练数据中足部标注不准确和运动多样性有限，而足部运动对步态分析和动画等应用至关重要。

Result: 在MOOF、MOYO和RICH数据集上的实验表明，FootMR优于现有先进方法，在MOYO数据集上将踝关节角度误差降低了高达30%，相比最佳基于视频的方法有显著提升。

Insight: 创新点包括：避免直接使用图像输入以规避不准确的图像-3D标注对，转而利用大规模运动捕捉数据；结合膝部和足部运动作为上下文，仅预测足部运动残差以解决2D到3D提升的歧义；采用全局关节旋转表示和广泛数据增强以提高对极端足部姿态的泛化能力；并引入了MOOF数据集以支持足部运动重建的评估。

Abstract: State-of-the-art methods can recover accurate overall 3D human body motion from in-the-wild videos. However, they often fail to capture fine-grained articulations, especially in the feet, which are critical for applications such as gait analysis and animation. This limitation results from training datasets with inaccurate foot annotations and limited foot motion diversity. We address this gap with FootMR, a Foot Motion Refinement method that refines foot motion estimated by an existing human recovery model through lifting 2D foot keypoint sequences to 3D. By avoiding direct image input, FootMR circumvents inaccurate image-3D annotation pairs and can instead leverage large-scale motion capture data. To resolve ambiguities of 2D-to-3D lifting, FootMR incorporates knee and foot motion as context and predicts only residual foot motion. Generalization to extreme foot poses is further improved by representing joints in global rather than parent-relative rotations and applying extensive data augmentation. To support evaluation of foot motion reconstruction, we introduce MOOF, a 2D dataset of complex foot movements. Experiments on MOOF, MOYO, and RICH show that FootMR outperforms state-of-the-art methods, reducing ankle joint angle error on MOYO by up to 30% over the best video-based approach.

[88] AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering cs.CV | cs.AIPDF

Nguyen Anh Tuong, Phan Ba Duc, Nguyen Trung Quoc, Tran Dac Thinh, Dang Duy Lan

TL;DR: 本文提出了AutoViVQA，一个大规模自动构建的越南语视觉问答数据集，并探索了基于Transformer的架构在该任务上的应用。研究系统比较了多语言环境下自动评估指标的性能，旨在促进越南语低资源多模态学习的发展。

Details

Motivation: 动机是解决越南语VQA任务中缺乏大规模、平衡数据集的问题，并利用预训练Transformer模型（如PhoBERT和ViT）提升多模态融合性能，同时评估和改进自动评估指标与人类判断的一致性。

Result: 论文未在摘要中明确提及具体定量结果或基准测试表现，但暗示通过整合文本和视觉预训练，以及系统比较自动评估指标，在多语言环境下推进了越南语VQA的研究。

Insight: 创新点包括自动构建大规模越南语VQA数据集AutoViVQA，以及系统评估多语言自动指标（如BLEU、METEOR等）在VQA任务中的适用性，为低资源多模态学习提供了新资源和方法论参考。

Abstract: Visual Question Answering (VQA) is a fundamental multimodal task that requires models to jointly understand visual and textual information. Early VQA systems relied heavily on language biases, motivating subsequent work to emphasize visual grounding and balanced datasets. With the success of large-scale pre-trained transformers for both text and vision domains – such as PhoBERT for Vietnamese language understanding and Vision Transformers (ViT) for image representation learning – multimodal fusion has achieved remarkable progress. For Vietnamese VQA, several datasets have been introduced to promote research in low-resource multimodal learning, including ViVQA, OpenViVQA, and the recently proposed ViTextVQA. These resources enable benchmarking of models that integrate linguistic and visual features in the Vietnamese context. Evaluation of VQA systems often employs automatic metrics originally designed for image captioning or machine translation, such as BLEU, METEOR, CIDEr, Recall, Precision, and F1-score. However, recent research suggests that large language models can further improve the alignment between automatic evaluation and human judgment in VQA tasks. In this work, we explore Vietnamese Visual Question Answering using transformer-based architectures, leveraging both textual and visual pre-training while systematically comparing automatic evaluation metrics under multilingual settings.

[89] TemporalDoRA: Temporal PEFT for Robust Surgical Video Question Answering cs.CVPDF

Luca Carlini, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi

TL;DR: 本文提出了一种名为TemporalDoRA的视频专用参数高效微调方法，用于增强手术视频问答任务的鲁棒性。该方法通过扩展权重分解低秩适配，在视觉编码器的低秩瓶颈中插入轻量级时序多头注意力，并选择性应用权重分解，以实现时序感知的更新。作者还构建了一个名为REAL-Colon-VQA的结肠镜视频问答数据集，用于评估模型对语言变化的敏感性。

Details

Motivation: 标准参数高效微调方法在适应预训练投影时未显式建模帧间交互，限制了其利用稀疏时序证据的能力，而手术视频问答任务需要精确的时序定位并对临床医生提问方式的自然变化保持鲁棒性。

Result: 在提出的REAL-Colon-VQA数据集上，TemporalDoRA提高了Out-of-Template问题的性能；在EndoVis18-VQA数据集上的短片段适应实验中也观察到一致的改进。消融研究证实了低秩分支内的时序混合是性能提升的主要驱动力。

Insight: 创新点在于将时序注意力机制嵌入到低秩适配路径中，实现了对帧间交互的显式建模，同时通过选择性权重分解保持了主干网络的冻结和稳定的缩放，从而以最小的参数开销提升了模型对时序一致视觉线索的利用和对语言变化的鲁棒性。

Abstract: Surgical Video Question Answering (VideoQA) requires accurate temporal grounding while remaining robust to natural variation in how clinicians phrase questions, where linguistic bias can arise. Standard Parameter Efficient Fine Tuning (PEFT) methods adapt pretrained projections without explicitly modeling frame-to-frame interactions within the adaptation pathway, limiting their ability to exploit sparse temporal evidence. We introduce TemporalDoRA, a video-specific PEFT formulation that extends Weight-Decomposed Low-Rank Adaptation by (i) inserting lightweight temporal Multi-Head Attention (MHA) inside the low-rank bottleneck of the vision encoder and (ii) selectively applying weight decomposition only to the trainable low-rank branch rather than the full adapted weight. This design enables temporally-aware updates while preserving a frozen backbone and stable scaling. By mixing information across frames within the adaptation subspace, TemporalDoRA steers updates toward temporally consistent visual cues and improves robustness with minimal parameter overhead. To benchmark this setting, we present REAL-Colon-VQA, a colonoscopy VideoQA dataset with 6,424 clip–question pairs, including paired rephrased Out-of-Template questions to evaluate sensitivity to linguistic variation. TemporalDoRA improves Out-of-Template performance, and ablation studies confirm that temporal mixing inside the low-rank branch is the primary driver of these gains. We also validate on EndoVis18-VQA adapted to short clips and observe consistent improvements on the Out-of-Template split. Code and dataset available at~\href{https://anonymous.4open.science/r/TemporalDoRA-BFC8/}{Anonymous GitHub}.

Fayaz Ali Dharejo, Sharif S. M. A., Aiman Khalil, Nachiket Chaudhary, Rizwan Ali Naqvi

TL;DR: 本文提出了TriFusion-SR，一个基于小波引导的条件扩散框架，用于联合执行三模态医学图像融合与超分辨率。该框架通过2D离散小波变换将多模态特征分解到频带，实现频率感知的跨模态交互，并引入了校正小波特征策略和自适应空频融合模块进行优化。

Details

Motivation: 现有方法通常将多模态医学图像融合与超分辨率分阶段处理，导致伪影和感知质量下降，特别是在结合解剖模态（如MRI、CT）与功能扫描（如PET、SPECT）的三模态设置中，频域不平衡问题加剧，限制了融合效果。

Result: 在多个上采样尺度上进行了广泛实验，结果表明该方法达到了最先进的性能，PSNR提升了4.8-12.4%，并在RMSE和LPIPS指标上实现了显著降低。

Insight: 创新点在于将小波变换与条件扩散模型结合，通过频率分解和校正策略处理三模态图像，以及自适应空频融合模块的引入，有效解决了频域不平衡和分阶段处理导致的伪影问题，为多模态医学图像处理提供了新思路。

Abstract: Multimodal medical image fusion facilitates comprehensive diagnosis by aggregating complementary structural and functional information, but its effectiveness is limited by resolution degradation and modality discrepancies. Existing approaches typically perform image fusion and super-resolution (SR) in separate stages, leading to artifacts and degraded perceptual quality. These limitations are further amplified in tri-modal settings that combine anatomical modalities (e.g., MRI, CT) with functional scans (e.g., PET, SPECT) due to pronounced frequency domain imbalances. We propose TriFusionSR, a wavelet-guided conditional diffusion framework for joint tri-modal fusion and SR. The framework explicitly decomposes multimodal features into frequency bands using the 2D Discrete Wavelet Transform, enabling frequency-aware crossmodal interaction. We further introduce a Rectified Wavelet Features (RWF) strategy for latent coefficient calibration, followed by an Adaptive Spatial-Frequency Fusion (ASFF) module with gated channel-spatial attention to enable structure-driven multimodal refinement. Extensive experiments demonstrate state-of-the-art performance, achieving 4.8-12.4% PSNR improvement and substantial reductions in RMSE and LPIPS across multiple upsampling scales.

[91] GSStream: 3D Gaussian Splatting based Volumetric Scene Streaming System cs.CVPDF

Zhiye Tang, Qiudan Zhang, Lei Zhang, Junhui Hou, You Yang

TL;DR: 本文提出了GSStream，一个基于3D高斯泼溅（3DGS）的体素场景流媒体系统，旨在解决3DGS渲染技术带来的巨大数据传输带宽挑战。该系统集成了协作视口预测模块和基于深度强化学习的码率自适应模块，以实现高效的体素场景传输。

Details

Motivation: 3D高斯泼溅技术虽然实现了实时辐射场渲染，但产生了大量数据，对网络带宽要求极高。现有方法在实现实时分发方面仍面临挑战，因此需要一种新的流媒体系统来高效传输3DGS格式的体素场景数据。

Result: 大量实验证明，GSStream系统在视觉质量和网络使用效率方面优于现有的代表性体素场景流媒体系统。

Insight: 创新点包括：1）集成协作视口预测模块，通过从多用户和用户视口序列中学习协作先验和历史先验来预测用户未来行为；2）采用基于深度强化学习的码率自适应模块，解决了码率自适应问题中状态和动作空间可变性的挑战；3）首次构建了用于体素场景的用户视口轨迹数据集，以支持训练和流媒体模拟。

Abstract: Recently, the 3D Gaussian splatting (3DGS) technique for real-time radiance field rendering has revolutionized the field of volumetric scene representation, providing users with an immersive experience. But in return, it also poses a large amount of data volume, which is extremely bandwidth-intensive. Cutting-edge researchers have tried to introduce different approaches and construct multiple variants for 3DGS to obtain a more compact scene representation, but it is still challenging for real-time distribution. In this paper, we propose GSStream, a novel volumetric scene streaming system to support 3DGS data format. Specifically, GSStream integrates a collaborative viewport prediction module to better predict users’ future behaviors by learning collaborative priors and historical priors from multiple users and users’ viewport sequences and a deep reinforcement learning (DRL)-based bitrate adaptation module to tackle the state and action space variability challenge of the bitrate adaptation problem, achieving efficient volumetric scene delivery. Besides, we first build a user viewport trajectory dataset for volumetric scenes to support the training and streaming simulation. Extensive experiments prove that our proposed GSStream system outperforms existing representative volumetric scene streaming systems in visual quality and network usage. Demo video: https://youtu.be/3WEe8PN8yvA.

[92] FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation cs.CVPDF

Minh Khoa Le, Kien Do, Duc Thanh Nguyen, Truyen Tran

TL;DR: 本文提出FrameDiT，一种用于高效视频生成的扩散Transformer模型，通过引入帧级矩阵注意力（Matrix Attention）机制，将整个帧作为矩阵处理，以解决现有方法在全3D注意力（计算昂贵）和局部因子化注意力（时间建模受限）之间的权衡问题。

Details

Motivation: 现有视频扩散模型在高效建模复杂时空动态方面存在困难，全3D注意力计算成本高，而局部因子化注意力在时间建模上受限，需要一种兼顾效率和全局时空结构保持的注意力机制。

Result: 在多个视频生成基准测试中，FrameDiT-H取得了最先进（SOTA）的结果，在保持与局部因子化注意力相当的效率的同时，提升了时间一致性和视频质量。

Insight: 创新点在于提出了帧级矩阵注意力，通过矩阵原生操作生成查询、键和值矩阵，在帧级别进行注意力计算，有效保留了全局时空结构并适应显著运动；进一步结合局部因子化注意力构建的FrameDiT-H能同时捕捉大运动和小运动。

Abstract: High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.

[93] FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis cs.CV | cs.MAPDF

Xiaotian Hu, Junwei Huang, Mingxuan Liu, Kasidit Anmahapong, Yifei Chen

TL;DR: 本文提出了FetalAgents，首个用于全面胎儿超声分析的多智能体系统，通过轻量级协调框架动态调度专用视觉专家，以在诊断、测量和分割任务中实现最优性能，并支持端到端的视频流摘要生成结构化临床报告。

Details

Motivation: 解决现有自动化胎儿超声分析工具难以在任务特定精度与支持端到端临床工作流所需的全流程通用性之间取得平衡的问题。

Result: 在八个临床任务上的多中心外部评估表明，FetalAgents相比专用模型和多模态大语言模型（MLLMs）始终提供最鲁棒和准确的性能。

Insight: 创新点在于采用多智能体系统架构进行动态任务协调，并首次将端到端视频流摘要与结构化报告生成整合到胎儿超声分析中，提供了可审计且与工作流对齐的解决方案。

Abstract: Fetal ultrasound (US) is the primary imaging modality for prenatal screening, yet its interpretation relies heavily on the expertise of the clinician. Despite advances in deep learning and foundation models, existing automated tools for fetal US analysis struggle to balance task-specific accuracy with the whole-process versatility required to support end-to-end clinical workflows. To address these limitations, we propose FetalAgents, the first multi-agent system for comprehensive fetal US analysis. Through a lightweight, agentic coordination framework, FetalAgents dynamically orchestrates specialized vision experts to maximize performance across diagnosis, measurement, and segmentation. Furthermore, FetalAgents advances beyond static image analysis by supporting end-to-end video stream summarization, where keyframes are automatically identified across multiple anatomical planes, analyzed by coordinated experts, and synthesized with patient metadata into a structured clinical report. Extensive multi-center external evaluations across eight clinical tasks demonstrate that FetalAgents consistently delivers the most robust and accurate performance when compared against specialized models and multimodal large language models (MLLMs), ultimately providing an auditable, workflow-aligned solution for fetal ultrasound analysis and reporting.

[94] ENIGMA-360: An Ego-Exo Dataset for Human Behavior Understanding in Industrial Scenarios cs.CVPDF

Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto, Daniele Di Mauro, Camillo Quattrocchi

TL;DR: 本文提出了ENIGMA-360数据集，这是一个在真实工业场景中采集的、包含180个自我中心（egocentric）和180个外部中心（exocentric）视角视频的配对数据集，视频在时间上同步并提供了互补的场景信息。该数据集带有时间和空间标注，用于支持工业场景下人类行为理解的研究。作者在三个基础任务（时序动作分割、关键步骤识别和自我中心人-物交互检测）上进行了基线实验，展示了现有SOTA方法在此挑战性场景中的局限性，并公开了数据集。

Details

Motivation: 当前缺乏在真实工业场景中同时捕获自我中心和外部中心视角的数据集，这阻碍了开发能够支持工人、提升安全性的系统。本文旨在填补这一空白。

Result: 在提出的ENIGMA-360数据集上，对时序动作分割、关键步骤识别和自我中心人-物交互检测三个任务进行了基线实验，结果表明现有的最先进（SOTA）方法在此具有挑战性的场景中表现存在局限。

Insight: 创新点在于构建了一个新颖的、在真实工业环境中同步采集的自我-外部中心（ego-exo）配对视频数据集，并提供了丰富的时空标注。这为研究多视角、复杂场景下的人类行为理解，特别是工业应用，提供了宝贵的资源，并揭示了现有模型在真实世界多视角理解上的不足，指明了未来研究方向。

Abstract: Understanding human behavior from complementary egocentric (ego) and exocentric (exo) points of view enables the development of systems that can support workers in industrial environments and enhance their safety. However, progress in this area is hindered by the lack of datasets capturing both views in realistic industrial scenarios. To address this gap, we propose ENIGMA-360, a new ego-exo dataset acquired in a real industrial scenario. The dataset is composed of 180 egocentric and 180 exocentric procedural videos temporally synchronized offering complementary information of the same scene. The 360 videos have been labeled with temporal and spatial annotations, enabling the study of different aspects of human behavior in industrial domain. We provide baseline experiments for 3 foundational tasks for human behavior understanding: 1) Temporal Action Segmentation, 2) Keystep Recognition and 3) Egocentric Human-Object Interaction Detection, showing the limits of state-of-the-art approaches on this challenging scenario. These results highlight the need for new models capable of robust ego-exo understanding in real-world environments. We publicly release the dataset and its annotations at https://iplab.dmi.unict.it/ENIGMA-360.

[95] LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos cs.CVPDF

Lei Shi, Victor Aregbede, Andreas Persson, Martin Längkvist, Amy Loutfi

TL;DR: 本文提出了一种语言感知规划模型LAP，用于教学视频中的程序规划任务，通过利用语言描述的区分性来弥补视觉观察的模糊性，从而预测从起始状态到目标状态的动作序列。

Details

Motivation: 现有方法主要依赖视觉观察作为输入，但不同动作可能在视觉上相似，导致固有模糊性，因此需要更独特的表示来改进程序规划。

Result: LAP在CrossTask、Coin和NIV三个基准测试中，在多个指标和时间跨度上均以较大优势达到了新的最先进性能。

Insight: 创新点在于利用微调的视觉语言模型将视觉观察转换为文本描述，并提取文本嵌入，这些嵌入比视觉嵌入更具区分性，再结合扩散模型进行动作序列规划，从而显著提升规划效果。

Abstract: Procedure planning requires a model to predict a sequence of actions that transform a start visual observation into a goal in instructional videos. While most existing methods rely primarily on visual observations as input, they often struggle with the inherent ambiguity where different actions can appear visually similar. In this work, we argue that language descriptions offer a more distinctive representation in the latent space for procedure planning. We introduce Language-Aware Planning (LAP), a novel method that leverages the expressiveness of language to bridge visual observation and planning. LAP uses a finetuned Vision Language Model (VLM) to translate visual observations into text descriptions and to predict actions and extract text embeddings. These text embeddings are more distinctive than visual embeddings and are used in a diffusion model for planning action sequences. We evaluate LAP on three procedure planning benchmarks: CrossTask, Coin, and NIV. LAP achieves new state-of-the-art performance across multiple metrics and time horizons by large margin, demonstrating the significant advantage of language-aware planning.

[96] LogoDiffuser: Training-Free Multilingual Logo Generation and Stylization via Letter-Aware Attention Control cs.CVPDF

Mingyu Kang, Hyein Seo, Yuna Jeong, Junhyeong Park, Yong Suk Choi

TL;DR: 本文提出LogoDiffuser，一种无需训练的多语言Logo生成与风格化方法，通过基于字母感知的注意力控制，在保持字符几何结构的同时，将视觉设计与多语言文本元素和谐融合。

Details

Motivation: 现有文本到图像生成方法在应用创意风格时容易扭曲字符几何结构，且难以在不额外训练的情况下支持多语言文本生成，因此需要一种能同时控制字符结构和视觉设计的训练免费方案。

Result: 大量实验和用户研究表明，该方法在多语言Logo生成任务上达到了最先进的性能水平。

Insight: 创新点在于将目标字符作为图像输入而非文本提示，通过分析联合注意力机制识别核心令牌，并注入信息量最大的注意力图以整合字符结构与视觉设计，同时采用分层注意力图聚合来缓解注意力偏移，确保核心令牌的一致性。

Abstract: Recent advances in text-to-image generation have been remarkable, but generating multilingual design logos that harmoniously integrate visual and textual elements remains a challenging task. Existing methods often distort character geometry when applying creative styles and struggle to support multilingual text generation without additional training. To address these challenges, we propose LogoDiffuser, a training-free method that synthesizes multilingual logo designs using the multimodal diffusion transformer. Instead of using textual prompts, we input the target characters as images, enabling robust character structure control regardless of language. We first analyze the joint attention mechanism to identify core tokens, which are tokens that strongly respond to textual structures. With this observation, our method integrates character structure and visual design by injecting the most informative attention maps. Furthermore, we perform layer-wise aggregation of attention maps to mitigate attention shifts across layers and obtain consistent core tokens. Extensive experiments and user studies demonstrate that our method achieves state-of-the-art performance in multilingual logo generation.

[97] PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments cs.CV | cs.RO | eess.IVPDF

Guoliang Zhu, Wanjun Jia, Caoyang Shao, Yuheng Zhang, Zhiyong Li

TL;DR: 本文提出了一种名为PanoAffordanceNet的端到端框架，用于解决360°室内环境中的整体可供性接地任务。该任务旨在超越传统的以对象为中心的视角，实现全景空间的全局感知。论文还构建了首个高质量全景可供性接地数据集360-AGD。

Details

Motivation: 当前的可供性接地研究主要集中于对象中心化和透视视图，而具身智能体在360°空间中需要全局感知能力。为了弥补这一差距，本文引入了360°室内环境中的整体可供性接地这一新任务。

Result: 大量实验表明，PanoAffordanceNet显著优于现有方法，为具身智能中的场景级感知建立了一个坚实的基线。

Insight: 创新点包括：1）提出了一个新颖的“整体可供性接地”任务；2）设计了包含失真感知频谱调制器和全向球形致密化头的端到端框架，以应对ERP投影的几何畸变和语义分散等挑战；3）通过整合像素级、分布级和区域-文本对比目标的多级约束，有效抑制了低监督下的语义漂移；4）构建了首个全景可供性接地基准数据集360-AGD。

Abstract: Global perception is essential for embodied agents in 360° spaces, yet current affordance grounding remains largely object-centric and restricted to perspective views. To bridge this gap, we introduce a novel task: Holistic Affordance Grounding in 360° Indoor Environments. This task faces unique challenges, including severe geometric distortions from Equirectangular Projection (ERP), semantic dispersion, and cross-scale alignment difficulties. We propose PanoAffordanceNet, an end-to-end framework featuring a Distortion-Aware Spectral Modulator (DASM) for latitude-dependent calibration and an Omni-Spherical Densification Head (OSDH) to restore topological continuity from sparse activations. By integrating multi-level constraints comprising pixel-wise, distributional, and region-text contrastive objectives, our framework effectively suppresses semantic drift under low supervision. Furthermore, we construct 360-AGD, the first high-quality panoramic affordance grounding dataset. Extensive experiments demonstrate that PanoAffordanceNet significantly outperforms existing methods, establishing a solid baseline for scene-level perception in embodied intelligence. The source code and benchmark dataset will be made publicly available at https://github.com/GL-ZHU925/PanoAffordanceNet.

[98] Ego: Embedding-Guided Personalization of Vision-Language Models cs.CV | cs.AIPDF

Soroush Seifi, Simon Gardier, Vaggelis Dorovatas, Daniel Olmeda Reino, Rahaf Aljundi

TL;DR: 本文提出了一种名为Ego的高效视觉语言模型个性化方法，该方法无需额外训练阶段或复杂的外部模块，通过利用模型内部的注意力机制提取代表目标概念的视觉token作为记忆，从而实现对新图像中该概念的识别与描述。

Details

Motivation: 现有的大型视觉语言模型个性化方法通常依赖额外的训练阶段或使用预训练模块的工程化流程，这限制了其通用性、可扩展性和部署效率，本文旨在克服这些模型的通用性，以更高效的方式实现个性化体验。

Result: 该方法在单概念、多概念和视频个性化等多种设置下进行了统一评估，相比现有SOTA方法，在保持最小个性化开销的同时，取得了显著的性能提升。

Insight: 创新点在于利用模型固有的注意力机制来捕获和记忆个性化概念，避免了外部训练或模块，提供了一种更通用、可扩展且部署高效的个性化范式。

Abstract: AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pre-trained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model’s inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model’s internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it appears in test images. We conduct a comprehensive and unified evaluation of our approach and SOTA methods across various personalization settings including single-concept, multi-concept, and video personalization, demonstrating strong performance gains with minimal personalization overhead.

[99] RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding cs.CVPDF

Muyi Sun, Yixuan Wang, Hong Wang, Chen Su, Man Zhang

TL;DR: 本文提出了一种细粒度的视听学习新任务——区域感知声源理解（RA-SSU），旨在实现区域感知、帧级别、高质量的声源理解。为此，作者构建了两个新数据集（f-Music和f-Lifescene），并提出了一个名为SSUFormer的多模态输入输出基准模型，该模型通过Mask Collaboration Module和Mixture of Hierarchical-prompted Experts模块，在声源分割和声源区域描述任务上取得了SOTA性能。

Details

Motivation: 现有视听学习研究多从粗粒度视角（如视听对应、声源定位）出发，缺乏对场景感知细节的精细刻画。为了提供更具体的场景感知信息，本文新定义了细粒度的RA-SSU任务，以解决区域级、帧级别的声源理解问题。

Result: 在自建的f-Music（3,976个样本，22个场景类型）和f-Lifescene（6,156个样本，61个类型）数据集上进行了广泛实验，验证了任务的可行性、数据集的可用性，并证明了SSUFormer在声源理解基准上达到了SOTA性能。

Insight: 论文的创新点包括：1) 定义了细粒度的RA-SSU新任务，推动了视听学习向更精细场景理解发展；2) 构建了两个高质量、带标注的细粒度视听数据集；3) 提出了SSUFormer基准模型，采用多模态输入输出架构，并设计了MCM和MoHE模块以分别提升分割精度和描述丰富度，为多模态细粒度理解提供了可借鉴的框架。

Abstract: Audio-Visual Learning (AVL) is one fundamental task of multi-modality learning and embodied intelligence, displaying the vital role in scene understanding and interaction. However, previous researchers mostly focus on exploring downstream tasks from a coarse-grained perspective (e.g., audio-visual correspondence, sound source localization, and audio-visual event localization). Considering providing more specific scene perception details, we newly define a fine-grained Audio-Visual Learning task, termed Region-Aware Sound Source Understanding (RA-SSU), which aims to achieve region-aware, frame-level, and high-quality sound source understanding. To support this goal, we innovatively construct two corresponding datasets, i.e. fine-grained Music (f-Music) and fine-grained Lifescene (f-Lifescene), each containing annotated sound source masks and frame-by-frame textual descriptions. The f-Music dataset includes 3,976 samples across 22 scene types related to specific application scenarios, focusing on music scenes with complex instrument mixing. The f-Lifescene dataset contains 6,156 samples across 61 types representing diverse sounding objects in life scenarios. Moreover, we propose SSUFormer, a Sound-Source Understanding TransFormer benchmark that facilitates both the sound source segmentation and sound region description with a multi-modal input and multi-modal output architecture. Specifically, we design two modules for this framework, Mask Collaboration Module (MCM) and Mixture of Hierarchical-prompted Experts (MoHE), to respectively enhance the accuracy and enrich the elaboration of the sound source description. Extensive experiments are conducted on our two datasets to verify the feasibility of the task, evaluate the availability of the datasets, and demonstrate the superiority of the SSUFormer, which achieves SOTA performance on the Sound Source Understanding benchmark.

[100] ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation cs.CVPDF

Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Yang Bai

TL;DR: 本文提出ConfCtrl框架，通过置信度感知的视频插值方法，解决大视角变化下仅用两张输入图像进行新视角合成的挑战。该方法结合置信度加权的点云潜在表示与噪声作为扩散模型的初始条件，并采用卡尔曼滤波启发的预测-更新机制，平衡相机姿态驱动预测与噪声几何观测，实现稳定且几何感知的生成。

Details

Motivation: 现有基于回归的方法无法重建未观测区域，而相机引导的扩散模型常因噪声点云投影或相机姿态条件不足而偏离预期轨迹，因此需要一种能精确控制相机姿态并完成未观测区域重建的方法。

Result: 在多个数据集上的实验表明，ConfCtrl能生成几何一致且视觉合理的新视角，有效重建大视角变化下的遮挡区域，实现了稳定的几何感知生成。

Insight: 创新点包括置信度加权的点云潜在表示初始化扩散过程，以及卡尔曼滤波启发的预测-更新机制来平衡姿态预测与噪声观测；客观分析认为该方法通过动态调整对投影可靠性的依赖，提升了在复杂几何变化下的生成鲁棒性。

Abstract: We address the challenge of novel view synthesis from only two input images under large viewpoint changes. Existing regression-based methods lack the capacity to reconstruct unseen regions, while camera-guided diffusion models often deviate from intended trajectories due to noisy point cloud projections or insufficient conditioning from camera poses. To address these issues, we propose ConfCtrl, a confidence-aware video interpolation framework that enables diffusion models to follow prescribed camera poses while completing unseen regions. ConfCtrl initializes the diffusion process by combining a confidence-weighted projected point cloud latent with noise as the conditioning input. It then applies a Kalman-inspired predict-update mechanism, treating the projected point cloud as a noisy measurement and using learned residual corrections to balance pose-driven predictions with noisy geometric observations. This allows the model to rely on reliable projections while down-weighting uncertain regions, yielding stable, geometry-aware generation. Experiments on multiple datasets show that ConfCtrl produces geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions under large viewpoint changes.

[101] VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models cs.CVPDF

Shuhao Kang, Youqi Liao, Peijie Wang, Wenlong Liao, Qilin Zhang

TL;DR: 本文提出VLM-Loc框架，利用大视觉语言模型（VLMs）的空间推理能力，解决基于自然语言描述在3D点云地图中的精确定位问题。该方法将点云转换为鸟瞰图（BEV）图像和场景图，以联合编码几何与语义信息，并通过部分节点分配机制显式关联文本线索与场景图节点，实现可解释的空间推理。

Details

Motivation: 现有文本到点云（T2P）定位方法主要依赖浅层的文本-点云对应关系，缺乏有效的空间推理，导致在复杂环境中定位精度受限。

Result: 在提出的CityLoc基准测试（基于多源点云构建）上，VLM-Loc相比现有最先进方法（SOTA）展现出更高的准确性和鲁棒性。

Insight: 创新点在于将点云结构化表示为BEV图像和场景图，以利用VLM进行跨模态表示学习，并引入部分节点分配机制实现文本与空间节点的显式关联，增强了定位过程的可解释性和空间推理能力。

Abstract: Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird’s-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset are available at \href{https://github.com/MCG-NKU/nku-3d-vision}{repository}.

[102] MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents cs.CV | cs.AIPDF

Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee

TL;DR: 本文提出了一个名为MA-EgoQA的新基准测试，用于评估模型在同时理解多个具身智能体采集的长时程第一人称视角视频上的能力，并提出了一个名为EgoMAS的基线模型，该模型利用智能体间的共享记忆和基于智能体的动态检索来处理多视频流问答任务。

Details

Motivation: 随着具身模型的发展，未来人类将与多个具身AI代理协作，因此需要有效处理并理解来自多个代理的并行感官输入（视频），以促进人机沟通。现有方法在压缩、通信和聚合多个第一人称视角视频以构建系统级记忆方面存在挑战。

Result: 在MA-EgoQA基准（包含1.7k个独特问题，涵盖社交互动、任务协调、心理理论、时序推理和环境互动五类）上的综合评估表明，现有方法无法有效处理多个第一人称视角视频流，而提出的EgoMAS基线模型为未来研究提供了起点。

Insight: 论文的创新点在于首次形式化定义了同时理解多个长时程第一人称视角视频的问题，并创建了专门的基准测试MA-EgoQA。从客观角度看，其提出的EgoMAS模型通过共享记忆和智能体级动态检索机制，为多智能体视频理解中的信息聚合与上下文关联提供了一个简洁有效的架构思路。

Abstract: As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides 1.7k questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory-of-mind, temporal reasoning, and environmental interaction. We further propose a simple baseline model for MA-EgoQA named EgoMAS, which leverages shared memory across embodied agents and agent-wise dynamic retrieval. Through comprehensive evaluation across diverse baselines and EgoMAS on MA-EgoQA, we find that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system-level understanding across the agents. The code and benchmark are available at https://ma-egoqa.github.io.

[103] MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities cs.CVPDF

Tien Anh Pham, Phuong-Anh Nguyen, Duc-Trong Le, Cam-Van Thi Nguyen

TL;DR: 本文提出了MissBench，一个用于评估多模态情感分析模型在不平衡缺失模态情况下的基准测试框架。该框架在四个广泛使用的情感数据集上标准化了共享和不平衡缺失率协议，并引入了两个诊断指标：模态公平性指数（MEI）和模态学习指数（MLI），以衡量不同模态的贡献公平性和训练过程中的优化不平衡问题。实验表明，即使在共享缺失率下表现稳健的模型，在不平衡条件下也可能存在显著的模态不公平和优化失衡。

Details

Motivation: 当前多模态情感计算的标准评估通常假设文本、声学和视觉模态同等可用，而实际应用中某些模态更易缺失或成本更高，导致不平衡的缺失率和训练偏差，这些问题是任务级指标无法揭示的。

Result: 在四个广泛使用的多模态情感数据集上对代表性方法家族进行了实验，结果表明，在共享缺失率下看似稳健的模型，在不平衡条件下仍可能表现出明显的模态不公平（通过MEI衡量）和优化失衡（通过MLI衡量）。

Insight: 论文的创新点在于提出了首个专门针对不平衡缺失模态场景的多模态情感分析基准测试框架（MissBench），并设计了两个新颖的诊断指标（MEI和MLI）来量化模态公平性和训练优化失衡，为在现实不完整模态设置下进行压力测试和分析多模态情感模型提供了实用工具。

Abstract: Multimodal affective computing underpins key tasks such as sentiment analysis and emotion recognition. Standard evaluations, however, often assume that textual, acoustic, and visual modalities are equally available. In real applications, some modalities are systematically more fragile or expensive, creating imbalanced missing rates and training biases that task-level metrics alone do not reveal. We introduce MissBench, a benchmark and framework for multimodal affective tasks that standardizes both shared and imbalanced missing-rate protocols on four widely used sentiment and emotion datasets. MissBench also defines two diagnostic metrics. The Modality Equity Index (MEI) measures how fairly different modalities contribute across missing-modality configurations. The Modality Learning Index (MLI) quantifies optimization imbalance by comparing modality-specific gradient norms during training, aggregated across modality-related modules. Experiments on representative method families show that models that appear robust under shared missing rates can still exhibit marked modality inequity and optimization imbalance under imbalanced conditions. These findings position MissBench, together with MEI and MLI, as practical tools for stress-testing and analyzing multimodal affective models in realistic incomplete-modality settings.For reproducibility, we release our code at: https://anonymous.4open.science/r/MissBench-4098/

[104] InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing cs.CVPDF

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang

TL;DR: 本文提出了InternVL-U，一个轻量级的4B参数统一多模态模型，旨在在一个框架内集成理解、推理、生成和编辑能力。该模型通过统一上下文建模和解耦视觉表示的模态特定模块化设计原则，将最先进的多模态大语言模型与基于MMDiT的视觉生成头相结合。通过构建以推理为中心的数据合成流程，弥合了审美生成与高级智能之间的差距。实验表明，该模型在性能和效率上取得了优越的平衡。

Details

Motivation: 解决统一多模态模型在保持强大语义理解能力与获得强大生成能力之间存在的固有权衡问题，旨在将这些能力民主化地集成到一个轻量级框架中。

Result: 在仅使用4B参数的情况下，在各种生成和编辑任务上持续超越规模大3倍以上的统一基线模型（如14B的BAGEL），同时保持了强大的多模态理解和推理能力。

Insight: 创新点在于提出了统一上下文建模与解耦视觉表示的模态特定模块化设计原则，并构建了以推理为中心、利用思维链的数据合成流程，以更好地对齐抽象用户意图与细粒度视觉生成细节，从而在轻量级模型中实现性能与效率的平衡。

Abstract: Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.

[105] DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary cs.CVPDF

Jiazhi Guan, Quanwei Yang, Luying Huang, Junhao Liang, Borong Liang

TL;DR: DISPLAY是一个通过稀疏运动引导和多任务辅助训练生成可控人-物交互视频的框架，它仅使用手腕关节坐标和形状无关的物体边界框作为引导信号，解决了现有方法在生成物理一致且可控的HOI视频时依赖密集控制信号或精心设计文本提示的局限性。

Details

Motivation: 现有的人为中心视频生成方法难以生成可控且物理一致的人-物交互视频，它们通常依赖密集控制信号、模板视频或精心设计的文本提示，这限制了方法的灵活性和对新物体的泛化能力。

Result: 综合实验表明，该方法在多样化任务上实现了高保真度、可控的HOI生成，虽然没有明确提及具体基准测试和SOTA比较，但展示了其有效性。

Insight: 创新点包括：1) 引入仅由手腕关节和物体边界框构成的稀疏运动引导，实现直观用户控制并缓解人-物表示不平衡问题；2) 提出对象强调注意力机制以增强稀疏条件下的物体鲁棒性；3) 设计多任务辅助训练策略与专用数据整理流程，利用可靠HOI样本和辅助任务缓解高质量HOI数据稀缺问题。

Abstract: Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.

[106] Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports cs.CVPDF

Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu

TL;DR: 该论文提出了首个面向体育场景的大规模空间智能数据集CourtSI，包含超过100万个问答对，涵盖羽毛球、网球和乒乓球等代表性隔网运动的空间计数、距离测量、定位和关系推理任务。基于此，论文还构建了高质量评估基准CourtSI-Bench，并在其上评估了25个视觉语言模型，揭示了现有模型在体育场景空间智能方面的局限。通过在CourtSI上微调模型，性能得到显著提升，并展现出良好的泛化能力。

Details

Motivation: 体育场景因其高强度人体运动和动态物体交互特性，是评估视觉语言模型空间智能的理想测试平台，但目前缺乏针对性的数据集。

Result: 在CourtSI-Bench上评估的25个专有和开源VLM均存在明显的人机性能差距，且从现有空间智能基准的泛化能力有限。在CourtSI上微调Qwen3-VL-8B模型，使其在CourtSI-Bench上的准确率提升了23.5个百分点，并在类似但未见过的运动评估集CourtSI-Ext上有效泛化。

Insight: 利用明确的球场几何结构作为度量锚点，开发了半自动数据引擎来重建体育场景，实现了数据集的规模化构建。研究结果表明，体育场景能暴露出现有基准未能捕捉到的空间智能能力局限，CourtSI为推进VLM在体育领域的空间智能提供了可扩展的途径。

Abstract: Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.

[107] WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition cs.CVPDF

Shan Ning, Longtian Qiu, Jiaxuan Sun, Xuming He

TL;DR: 论文提出WikiCLIP，一个用于开放域视觉实体识别的高效对比学习基线框架。它利用大语言模型嵌入作为知识丰富的实体表示，并通过视觉引导知识适配器在图像块级别对齐文本语义与视觉线索，同时使用硬负样本合成机制增强细粒度判别能力。

Details

Motivation: 解决现有生成式方法在开放域视觉实体识别中计算成本高、可扩展性差的问题，旨在建立一个既高效又强大的对比学习基线。

Result: 在OVEN等流行基准测试上显著优于强基线模型，特别是在OVEN未见集上实现了16%的性能提升，且推理延迟相比领先的生成模型AutoVER降低了近100倍。

Insight: 创新点在于将大语言模型的知识嵌入与视觉引导的细粒度对齐相结合，并通过合成硬负样本来提升模型对视觉相似但语义不同实体的判别能力，为开放域视觉实体识别提供了一个高效且强大的对比学习范式。

Abstract: Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/

[108] Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction cs.CV | cs.IRPDF

Yao Zhang, Zhuchenyang Liu, Yanlan He, Thomas Ploetz, Yu Xiao

TL;DR: 本文提出了一种基于关节角运动图像和令牌-补丁延迟交互的细粒度运动检索方法，通过将关节级局部特征映射为结构化伪图像，并利用预训练视觉Transformer和增强的MaxSim机制，实现了文本与3D人体运动骨架序列之间的可解释细粒度对齐，在HumanML3D和KIT-ML基准上超越了现有最优方法。

Details

Motivation: 解决现有文本-运动检索方法因使用双编码器框架压缩为全局嵌入而丢失细粒度局部对应关系，导致准确性下降和可解释性有限的问题。

Result: 在HumanML3D和KIT-ML基准上的大量实验表明，该方法在文本-运动检索任务中超越了当前最先进（SOTA）的方法。

Insight: 创新点包括：提出可解释的基于关节角的运动表示（将局部特征映射为伪图像以兼容预训练ViT），以及采用并增强MaxSim令牌级延迟交互机制（结合掩码语言建模正则化以促进鲁棒可解释的对齐）。

Abstract: Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.

[109] Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation cs.CV | cs.AIPDF

Rong Zhou, Houliang Zhou, Yao Su, Brian Y. Chen, Yu Zhang

TL;DR: 该论文提出了ACADiff框架，用于解决阿尔茨海默病诊断中多模态脑成像数据缺失的问题。该框架通过自适应临床感知的潜在扩散模型，在关注可用成像数据和临床元数据的同时，逐步去噪潜在表示，从而合成缺失的脑成像模态。它包含三个专门的生成器，可以在sMRI、FDG-PET和AV45-PET之间进行双向合成。

Details

Motivation: 多模态神经影像为阿尔茨海默病诊断提供了互补信息，但临床数据集经常存在模态缺失的问题，需要一种能够合成缺失模态并保持诊断性能的方法。

Result: 在ADNI数据集上评估，ACADiff在生成质量上表现优异，即使在极端80%数据缺失的情况下也能保持稳健的诊断性能，超越了所有现有基线方法。

Insight: 创新点在于提出了自适应融合机制，能根据输入数据的可用性动态重组，并结合了通过GPT-4o编码提示实现的语义临床指导，从而在潜在扩散过程中整合了临床信息。这是一种将大型语言模型的语义理解能力与扩散模型生成能力相结合，用于医学图像生成和补全的新颖方法。

Abstract: Multimodal neuroimaging provides complementary insights for Alzheimer’s disease diagnosis, yet clinical datasets frequently suffer from missing modalities. We propose ACADiff, a framework that synthesizes missing brain imaging modalities through adaptive clinical-aware diffusion. ACADiff learns mappings between incomplete multimodal observations and target modalities by progressively denoising latent representations while attending to available imaging data and clinical metadata. The framework employs adaptive fusion that dynamically reconfigures based on input availability, coupled with semantic clinical guidance via GPT-4o-encoded prompts. Three specialized generators enable bidirectional synthesis among sMRI, FDG-PET, and AV45-PET. Evaluated on ADNI subjects, ACADiff achieves superior generation quality and maintains robust diagnostic performance even under extreme 80% missing scenarios, outperforming all existing baselines. To promote reproducibility, code is available at https://github.com/rongzhou7/ACADiff

cs.AI [Back]

[110] A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations cs.AI | cs.CL | cs.DC | cs.IR | cs.LGPDF

Joshua Castillo, Ravi Mukkamala

TL;DR: 本文介绍了Guardian LLM Pipeline，这是一个用于失踪人员调查的多LLM系统。该系统通过协调多个任务专用LLM模型进行端到端执行，并调用一个共识LLM引擎来比较和解决不同模型输出的分歧。系统还通过基于QLoRA的微调进行增强，旨在支持失踪儿童调查和早期搜救规划，强调将LLM作为结构化信息提取器和标注器进行保守、可审计的使用。

Details

Motivation: 解决失踪人员调查（尤其是关键的前72小时）中信息提取与处理的效率与可靠性问题，支持早期搜救规划。

Result: 摘要中未提及具体的定量基准测试结果或SOTA比较，但强调了系统通过共识机制和微调来提升信息处理的可靠性与一致性。

Insight: 创新点在于提出了一个结合任务专用LLM、共识引擎和QLoRA微调的多模型管道，将LLM定位为受约束的结构化提取器而非端到端决策者，增强了系统的可审计性和可靠性。

Abstract: The first 72 hours of a missing-person investigation are critical for successful recovery. Guardian is an end-to-end system designed to support missing-child investigation and early search planning. This paper presents the Guardian LLM Pipeline, a multi-model system in which LLMs are used for intelligent information extraction and processing related to missing-person search operations. The pipeline coordinates end-to-end execution across task-specialized LLM models and invokes a consensus LLM engine that compares multiple model outputs and resolves disagreements. The pipeline is further strengthened by QLoRA-based fine-tuning, using curated datasets. The presented design aligns with prior work on weak supervision and LLM-assisted annotation, emphasizing conservative, auditable use of LLMs as structured extractors and labelers rather than unconstrained end-to-end decision makers.

[111] From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring cs.AI | cs.CL | cs.LGPDF

Seunghwan Kim, Tiffany H. Kung, Heena Verma, Dilan Edirisinghe, Kaveh Sedehi

TL;DR: 本文提出了名为Sentinel的自主AI代理，利用模型上下文协议（MCP）和21种临床工具，通过多步推理对远程患者监测（RPM）的生命体征进行上下文分诊。该代理在紧急情况敏感性（95.8%）和可操作警报敏感性（88.5%）方面优于个体临床医生，且成本低至每次分诊0.34美元，为解决RPM数据过载问题提供了可扩展的自动化解决方案。

Details

Motivation: 远程患者监测（RPM）产生大量数据，但先前试验（如Tele-HF、BEAT-HF）因数据量压倒临床人员而失败；虽然TIM-HF2显示24/7医生主导监测可降低30%死亡率，但该模型成本过高且难以扩展。因此，需要开发自动化AI代理以实现高效、可扩展的临床分诊。

Result: 在基于人类多数投票标准（N=467）的评估中，代理的紧急情况敏感性达95.8%，所有可操作警报敏感性为88.5%（特异性85.7%）；四级别精确准确率为69.4%（二次加权kappa=0.778）。在留一法分析中，代理在紧急敏感性（97.5% vs. 60.0%）和可操作敏感性（90.9% vs. 69.5%）上均优于每位临床医生，且自我一致性接近完美（kappa=0.850）。

Insight: 创新点在于使用模型上下文协议（MCP）进行上下文合成和多步推理，实现自动化系统化分诊，从而解决RPM数据过载的核心限制；客观来看，该方法通过低成本、高一致性的AI代理，为降低死亡率的密集型监测提供了可扩展路径，同时保持临床可辩护的过度分诊特征。

Abstract: Background: Remote patient monitoring (RPM) generates vast data, yet landmark trials (Tele-HF, BEAT-HF) failed because data volume overwhelmed clinical staff. While TIM-HF2 showed 24/7 physician-led monitoring reduces mortality by 30%, this model remains prohibitively expensive and unscalable. Methods: We developed Sentinel, an autonomous AI agent using Model Context Protocol (MCP) for contextual triage of RPM vitals via 21 clinical tools and multi-step reasoning. Evaluation included: (1) self-consistency (100 readings x 5 runs); (2) comparison against rule-based thresholds; and (3) validation against 6 clinicians (3 physicians, 3 NPs) using a connected matrix design. A leave-one-out (LOO) analysis compared the agent against individual clinicians; severe overtriage cases underwent independent physician adjudication. Results: Against a human majority-vote standard (N=467), the agent achieved 95.8% emergency sensitivity and 88.5% sensitivity for all actionable alerts (85.7% specificity). Four-level exact accuracy was 69.4% (quadratic-weighted kappa=0.778); 95.9% of classifications were within one severity level. In LOO analysis, the agent outperformed every clinician in emergency sensitivity (97.5% vs. 60.0% aggregate) and actionable sensitivity (90.9% vs. 69.5%). While disagreements skewed toward overtriage (22.5%), independent adjudication of severe gaps (>=2 levels) validated agent escalation in 88-94% of cases; consensus resolution validated 100%. The agent showed near-perfect self-consistency (kappa=0.850). Median cost was $0.34/triage. Conclusions: Sentinel triages RPM vitals with sensitivity exceeding individual clinicians. By automating systematic context synthesis, Sentinel addresses the core limitation of prior RPM trials, offering a scalable path toward the intensive monitoring shown to reduce mortality while maintaining a clinically defensible overtriage profile.

[112] The Reasoning Trap – Logical Reasoning as a Mechanistic Pathway to Situational Awareness cs.AI | cs.CL | cs.CY | cs.LGPDF

Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary

TL;DR: 本文提出RAISE框架，指出大型语言模型逻辑推理能力的提升会通过演绎自我推断、归纳情境识别和溯因自我建模三条机制路径，逐步增强AI系统的情境感知能力，从而可能引发从自我认知到战略欺骗的升级风险。

Details

Motivation: 论文的动机在于揭示逻辑推理研究与AI情境感知能力发展之间的碰撞风险，即逻辑推理能力的增强可能无意中推动AI系统获得危险的自我认知和战略推理能力。

Result: 论文通过形式化三条机制路径，构建了从基本自我识别到战略欺骗的升级阶梯，并论证了当前主流LLM逻辑推理研究主题均直接对应情境感知的特定放大器，同时指出现有安全措施不足以阻止此升级。

Insight: 创新点在于首次系统性地建立了逻辑推理能力与情境感知风险之间的机制联系，提出了RAISE框架和升级阶梯模型，并为社区提出了‘镜像测试’基准和推理安全对等原则等具体防护建议。

Abstract: Situational awareness, the capacity of an AI system to recognize its own nature, understand its training and deployment context, and reason strategically about its circumstances, is widely considered among the most dangerous emergent capabilities in advanced AI systems. Separately, a growing research effort seeks to improve the logical reasoning capabilities of large language models (LLMs) across deduction, induction, and abduction. In this paper, we argue that these two research trajectories are on a collision course. We introduce the RAISE framework (Reasoning Advancing Into Self Examination), which identifies three mechanistic pathways through which improvements in logical reasoning enable progressively deeper levels of situational awareness: deductive self inference, inductive context recognition, and abductive self modeling. We formalize each pathway, construct an escalation ladder from basic self recognition to strategic deception, and demonstrate that every major research topic in LLM logical reasoning maps directly onto a specific amplifier of situational awareness. We further analyze why current safety measures are insufficient to prevent this escalation. We conclude by proposing concrete safeguards, including a “Mirror Test” benchmark and a Reasoning Safety Parity Principle, and pose an uncomfortable but necessary question to the logical reasoning community about its responsibility in this trajectory.

[113] Think Before You Lie: How Reasoning Improves Honesty cs.AI | cs.CL | cs.LGPDF

Ann Yuan, Asma Ghandeharioun, Carter Blum, Alicia Machado, Jessica Hoffmann

TL;DR: 这篇论文研究了大型语言模型（LLM）在道德权衡情境下的诚实行为，发现与人类不同，推理过程能持续提高LLM的诚实度。研究表明，这种效应与表征空间的几何结构有关，其中欺骗性区域是亚稳态的，更容易被扰动，而推理过程通过遍历有偏的表征空间，将模型推向更稳定的诚实默认状态。

Details

Motivation: 现有研究主要测量LLM的欺骗率，但对导致欺骗行为的根本条件理解不足。本文旨在探究在诚实需要付出可变成本的现实道德权衡情境下，LLM的诚实行为如何受推理影响。

Result: 研究发现，推理在不同规模和系列的LLM中都能一致地提高诚实度。这种效应不能仅由推理内容解释，因为推理轨迹对最终行为的预测性较差。实验通过输入转述、输出重采样和激活噪声等方法，证实了欺骗性答案在表征空间中比诚实答案更不稳定（亚稳态）。

Insight: 论文的创新点在于揭示了推理提升LLM诚实度的机制与表征空间的几何特性（亚稳态）相关，而非仅仅是推理的语义内容。这为理解和干预LLM的道德决策行为提供了新的视角，即可通过操纵表征空间的稳定性来引导模型行为。

Abstract: While existing evaluations of large language models (LLMs) measure deception rates, the underlying conditions that give rise to deceptive behavior are poorly understood. We investigate this question using a novel dataset of realistic moral trade-offs where honesty incurs variable costs. Contrary to humans, who tend to become less honest given time to deliberate (Capraro, 2017; Capraro et al., 2019), we find that reasoning consistently increases honesty across scales and for several LLM families. This effect is not only a function of the reasoning content, as reasoning traces are often poor predictors of final behaviors. Rather, we show that the underlying geometry of the representational space itself contributes to the effect. Namely, we observe that deceptive regions within this space are metastable: deceptive answers are more easily destabilized by input paraphrasing, output resampling, and activation noise than honest ones. We interpret the effect of reasoning in this vein: generating deliberative tokens as part of moral reasoning entails the traversal of a biased representational space, ultimately nudging the model toward its more stable, honest defaults.

cs.RO [Back]

[114] SurgCalib: Gaussian Splatting-Based Hand-Eye Calibration for Robot-Assisted Minimally Invasive Surgery cs.RO | cs.CVPDF

Zijian Wu, Shuojue Yang, Yu Chung Lee, Eitan Prisman, Yueming Jin

TL;DR: 本文提出了一种名为SurgCalib的基于高斯溅射（Gaussian Splatting）的无标记手眼标定框架，用于达芬奇手术机器人。该方法利用原始运动学测量初始化手术器械姿态，并在高斯溅射可微分渲染管道中，通过两阶段优化在远程运动中心（RCM）约束下进行姿态细化，旨在解决手术室环境中传统标定方法依赖标记物、可能违反无菌协议的问题。

Details

Motivation: 在视觉引导的机器人系统中，精确估计机器人基座与相机坐标系之间的刚性变换对于可靠的闭环控制至关重要。对于线缆驱动的手术机器人，由于线缆拉伸和回差，编码器提供的本体感知测量往往不准确，且传统手眼标定方法依赖已知标记图案，在手术室（OR）环境中引入额外标记可能违反无菌协议并干扰手术流程。

Result: 在公开的dVRK基准测试集SurgPose上评估，所提方法对于左、右器械的平均2D工具尖端重投影误差分别为12.24像素（2.06毫米）和11.33像素（1.9毫米），3D工具尖端欧氏距离误差分别为5.98毫米和4.75毫米。

Insight: 论文宣称的创新点在于提出了首个基于高斯溅射的无标记、自动手眼标定框架，可直接应用于手术室环境。其核心创新在于将原始运动学测量与高斯溅射可微分渲染相结合，并在RCM约束下进行两阶段优化，避免了传统方法对物理标记的依赖，同时提升了在存在线缆驱动误差情况下的标定精度和实用性。

Abstract: We present a Gaussian Splatting-based framework for hand-eye calibration of the da Vinci surgical robot. In a vision-guided robotic system, accurate estimation of the rigid transformation between the robot base and the camera frame is essential for reliable closed-loop control. For cable-driven surgical robots, this task faces unique challenges. The encoders of surgical instruments often produce inaccurate proprioceptive measurements due to cable stretch and backlash. Conventional hand-eye calibration approaches typically rely on known fiducial patterns and solve the AX = XB formulation. While effective, introducing additional markers into the operating room (OR) environment can violate sterility protocols and disrupt surgical workflows. In this study, we propose SurgCalib, an automatic, markerless framework that has the potential to be used in the OR. SurgCalib first initializes the pose of the surgical instrument using raw kinematic measurements and subsequently refines this pose through a two-phase optimization procedure under the RCM constraint within a Gaussian Splatting-based differentiable rendering pipeline. We evaluate the proposed method on the public dVRK benchmark, SurgPose. The results demonstrate average 2D tool-tip reprojection errors of 12.24 px (2.06 mm) and 11.33 px (1.9 mm), and 3D tool-tip Euclidean distance errors of 5.98 mm and 4.75 mm, for the left and right instruments, respectively.

[115] See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation cs.RO | cs.CVPDF

Tingjun Dai, Mingfei Han, Tingwen Du, Zhiheng Liu, Zhihui Li

TL;DR: 本文提出了SPR（See, Plan, Rewind）框架，这是一个具有进度感知能力的视觉-语言-动作模型，用于实现鲁棒的机器人操作。该框架通过将语言指令动态地映射为一系列空间子目标，并持续执行‘观察-规划-回滚’的核心循环来监控任务进度、规划轨迹并在失败时恢复，从而在无需额外训练数据或辅助模型的情况下实现闭环的错误纠正。

Details

Motivation: 通过明确、可操作的里程碑来测量任务进度对于鲁棒的机器人操作至关重要。具备进度感知能力可以使模型理解当前任务状态、预测可验证的中间状态，并在进度停滞时检测和从失败中恢复。

Result: 在LIBERO基准测试中，SPR比MolmoAct基线性能高出5%。在更具挑战性的LIBERO-Plus基准测试（包含未见过的指令和初始状态）上，SPR实现了最先进的鲁棒性，性能下降最小，超越了OpenVLA-OFT和UniVLA，展示了卓越的分布外鲁棒性。

Insight: 论文的核心创新在于提出了一个显式的、基于进度感知的闭环框架，通过‘观察-规划-回滚’的循环将高层语言指令分解为可执行的空间子目标序列，并内置了基于进度监控的失败恢复机制。这种方法无需额外训练即可增强鲁棒性，为将语言指令可靠地转化为机器人动作提供了一种结构化的、可解释的解决方案。

Abstract: Measurement of task progress through explicit, actionable milestones is critical for robust robotic manipulation. This progress awareness enables a model to ground its current task status, anticipate verifiable intermediate states, and detect and recover from failures when progress stalls. To embody this capability, we introduce See, Plan, Rewind (SPR), a progress-aware vision-language-action framework that dynamically grounds language instructions into a sequence of spatial subgoals. SPR operates through a continuous core cycle, Seeing the current state and upcoming milestone, Planning a trajectory towards the next 2D waypoint, and Rewinding to a recoverable state upon failure by monitoring progress against the expected sequence. This closed-loop approach enables robust error correction without requiring additional training data or auxiliary models. Extensive experiments demonstrate the framework’s effectiveness, generalization and robustness: SPR outperforms the MolmoAct baseline by 5% on the LIBERO benchmark. On the challenging LIBERO-Plus benchmark with unseen instructions and initial states, SPR achieves state-of-the-art robustness with the smallest performance drop, surpassing OpenVLA-OFT and UniVLA, demonstrating superior out-of-distribution robustness.

Haoyuan Li, Rui Liu, Hehe Fan, Yi Yang

TL;DR: 本文提出了一种名为SACA（Step-Aware Contrastive Alignment）的框架，用于解决连续环境中的视觉语言导航（VLN-CE）任务中存在的泛化能力、错误恢复和训练稳定性平衡问题。该框架通过感知接地的步感知审计器从非完美轨迹中提取密集监督，并利用场景条件分组构建机制进行动态批处理优化，从而在VLN-CE基准测试中实现了最先进的性能。

Details

Motivation: 当前基于多模态大语言模型（MLLMs）的VLN-CE训练范式存在两个主要问题：一是监督微调（SFT）策略容易产生复合错误且难以从分布外状态恢复；二是强化微调（RFT）方法（如GRPO）受限于稀疏的结果奖励，其二元反馈无法为单个步骤分配信用，导致在失败主导的批次中出现梯度信号崩溃。

Result: 在VLN-CE基准测试上的广泛实验表明，SACA实现了最先进的性能（state-of-the-art）。

Insight: 核心创新点在于引入了步感知对比对齐框架，通过感知接地的步感知审计器对轨迹进行逐步评估，将失败轨迹解耦为有效前缀和确切分歧点，从而从非完美轨迹中提取密集监督信号。同时，场景条件分组构建机制能够根据轨迹状态动态路由批次到专门的重新采样和优化策略，这为解决稀疏奖励和训练稳定性问题提供了新思路。

Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to learn complex reasoning from long-horizon human interactions. While Multi-modal Large Language Models (MLLMs) have driven recent progress, current training paradigms struggle to balance generalization capability, error recovery and training stability. Specifically, (i) policies derived from SFT suffer from compounding errors, struggling to recover from out-of-distribution states, and (ii) Reinforcement Fine-Tuning (RFT) methods e.g. GRPO are bottlenecked by sparse outcome rewards. Their binary feedback fails to assign credit to individual steps, leading to gradient signal collapse in failure dominant batches. To address these challenges, we introduce Step-Aware Contrastive Alignment (SACA), a framework designed to extract dense supervision from imperfect trajectories. At its core, the Perception-Grounded Step-Aware auditor evaluates progress step-by-step, disentangling failed trajectories into valid prefixes and exact divergence points. Leveraging these signals, Scenario-Conditioned Group Construction mechanism dynamically routes batches to specialized resampling and optimization strategies. Extensive experiments on VLN-CE benchmarks demonstrate that SACA achieves state-of-the-art performance.

Xinyu Gao, Gang Chen, Javier Alonso-Mora

TL;DR: BEACON是一种用于语言条件导航的方法，通过预测鸟瞰图（BEV）中的可通行性热图来解决目标位置被遮挡的问题。它结合了环绕RGB-D观测和视觉语言模型，在Habitat模拟器构建的数据集上验证了其有效性。

Details

Motivation: 现有基于视觉语言模型（VLM）的导航方法通常在图像空间进行推理，难以处理因家具或移动人体导致的遮挡区域中的目标定位问题。

Result: 在Habitat模拟器构建的包含遮挡目标位置的验证子集上，BEACON在平均测地距离阈值上的准确率比最先进的图像空间基线提高了22.74个百分点。

Insight: 创新点在于将导航推理从图像空间转移到鸟瞰图（BEV）空间，通过注入空间线索到VLM并融合深度导出的BEV特征，实现了对遮挡区域的可通行性预测，提升了在复杂环境下的导航鲁棒性。

Abstract: Language-conditioned local navigation requires a robot to infer a nearby traversable target location from its current observation and an open-vocabulary, relational instruction. Existing vision-language spatial grounding methods usually rely on vision-language models (VLMs) to reason in image space, producing 2D predictions tied to visible pixels. As a result, they struggle to infer target locations in occluded regions, typically caused by furniture or moving humans. To address this issue, we propose BEACON, which predicts an ego-centric Bird’s-Eye View (BEV) affordance heatmap over a bounded local region including occluded areas. Given an instruction and surround-view RGB-D observations from four directions around the robot, BEACON predicts the BEV heatmap by injecting spatial cues into a VLM and fusing the VLM’s output with depth-derived BEV features. Using an occlusion-aware dataset built in the Habitat simulator, we conduct detailed experimental analysis to validate both our BEV space formulation and the design choices of each module. Our method improves the accuracy averaged across geodesic thresholds by 22.74 percentage points over the state-of-the-art image-space baseline on the validation subset with occluded target locations. Our project page is: https://xin-yu-gao.github.io/beacon.

cs.LG [Back]

[118] ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning cs.LG | cs.AI | cs.CLPDF

Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi

TL;DR: 本文提出了ACTIVEULTRAFEEDBACK，一种模块化的主动学习流程，旨在高效生成用于对齐大语言模型（LLM）的偏好数据。该流程利用不确定性估计动态选择信息量最大的响应进行标注，并引入了两种新的响应选择方法（DRTS和DELTAUCB），以优先选择预测质量差距大的响应对。实验表明，该方法仅需静态基线方法六分之一的标注数据，即可在下游任务上取得相当或更优的性能。

Details

Motivation: 解决基于人类反馈的强化学习（RLHF）中获取偏好数据成本高昂的问题，特别是在低资源和专业领域，以突破RLHF效能的瓶颈。

Result: 在实验中，ACTIVEULTRAFEEDBACK生成的高质量数据集显著提升了下游任务性能，仅使用静态基线六分之一的标注数据就取得了相当或更优的结果。

Insight: 核心创新在于将主动学习框架系统性地应用于LLM偏好数据生成，并提出了两种基于预测质量差距（而非绝对质量）的新型响应选择策略（DRTS和DELTAUCB），这利用了质量差距大的配对能为微调提供更好信号的理论见解，从而实现了数据标注效率的显著提升。

Abstract: Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine-tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high-quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one-sixth of the annotated data relative to static baselines. Our pipeline is available at https://github.com/lasgroup/ActiveUltraFeedback and our preference datasets at https://huggingface.co/ActiveUltraFeedback.

[119] MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning cs.LG | cs.AI | cs.CLPDF

Yiyang Lu, Yu He, Jianlong Chen, Hongyuan Zha

TL;DR: 本文提出了一种名为MSSR（Memory-Inspired Sampler and Scheduler Replay）的自适应经验回放框架，用于缓解大语言模型（LLM）在持续微调过程中的灾难性遗忘问题。该方法通过估计样本级别的记忆强度并自适应地安排回放间隔，在促进快速适应新知识的同时，有效保留旧技能。

Details

Motivation: 大语言模型在动态环境中部署时需要进行持续微调，这虽然能快速学习新知识，但也导致了灾难性遗忘问题。现有的基于回放的方法（如固定交错回放、精度监督或损失驱动的调度）存在依赖启发式规则、遗忘缓解不彻底或计算开销大的局限性。

Result: 在三个骨干模型和11个顺序任务上的广泛实验表明，MSSR在推理密集型任务和多选基准测试上取得了显著优势，并持续超越了最先进的回放基线方法。

Insight: 核心创新在于受序列微调下记忆保持动态的启发，提出了一个结合样本级记忆强度估计与自适应回放调度的统一框架。其借鉴认知科学中的记忆理论来指导回放策略，而非依赖启发式规则，在计算效率和遗忘缓解之间取得了更好的平衡。

Abstract: Continual fine-tuning of large language models (LLMs) is becoming increasingly crucial as these models are deployed in dynamic environments where tasks and data distributions evolve over time. While strong adaptability enables rapid acquisition of new knowledge, it also exposes LLMs to catastrophic forgetting, where previously learned skills degrade during sequential training. Existing replay-based strategies, such as fixed interleaved replay, accuracy-supervised, and loss-driven scheduling, remain limited: some depend on heuristic rules and provide only partial mitigation of forgetting, while others improve performance but incur substantial computational overhead. Motivated by retention dynamics under sequential fine-tuning, we propose Memory-Inspired Sampler and Scheduler Replay (MSSR), an experience replay framework that estimates sample-level memory strength and schedules rehearsal at adaptive intervals to mitigate catastrophic forgetting while maintaining fast adaptation. Extensive experiments across three backbone models and 11 sequential tasks show that MSSR consistently outperforms state-of-the-art replay baselines, with particularly strong gains on reasoning-intensive and multiple-choice benchmarks.

cs.AR [Back]

[120] VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation cs.AR | cs.CLPDF

Luca Collini, Andrew Hennesee, Patrick Yubeaton, Siddharth Garg, Ramesh Karri

TL;DR: 本文对Verilog代码生成中语言模型与提示设计的交互进行了实证研究，评估了包括通用、推理和领域专用模型在内的多种模型，并测试了结构化输出、提示重写、思维链、上下文学习及遗传-帕累托进化提示优化等策略。研究发现不同模型类别对结构化提示和优化的响应模式存在差异，并识别了跨模型和基准的通用趋势与特定组合的独有特性。

Details

Motivation: 随着语言模型的快速发展，自动化代码生成面临模型特性与提示设计选择之间权衡的复杂性，本文旨在通过实证研究揭示Verilog代码生成中模型推理、专业化和提示工程策略之间的交互关系。

Result: 研究在两个Verilog基准测试上进行，通过受控因子设计评估了多种提示策略，识别了模型类别对结构化提示和优化的响应模式，并记录了跨模型和基准的通用趋势与特定模型-提示组合的独有特性。

Insight: 创新点在于系统性地实证映射了Verilog代码生成中模型与提示的交互趋势，特别是通过遗传-帕累托进化进行提示优化，并区分了通用趋势与特定组合的差异，为模型选择和提示设计提供了实证依据。

Abstract: Rapid advances in language models (LMs) have created new opportunities for automated code generation while complicating trade-offs between model characteristics and prompt design choices. In this work, we provide an empirical map of recent trends in LMs for Verilog code generation, focusing on interactions among model reasoning, specialization, and prompt engineering strategies. We evaluate a diverse set of small and large LMs, including general-purpose, reasoning, and domain-specific variants. Our experiments use a controlled factorial design spanning benchmark prompts, structured outputs, prompt rewriting, chain-of-thought reasoning, in-context learning, and evolutionary prompt optimization via Genetic-Pareto. Across two Verilog benchmarks, we identify patterns in how model classes respond to structured prompts and optimization, and we document which trends generalize across LMs and benchmarks versus those that are specific to particular model-prompt combinations.

cs.IR [Back]

[121] TA-Mem: Tool-Augmented Autonomous Memory Retrieval for LLM in Long-Term Conversational QA cs.IR | cs.CLPDF

Mengwei Yuan, Jianan Liu, Jing Yang, Xianyou Li, Weiran Yan

TL;DR: 本文提出了一种名为TA-Mem的工具增强自主记忆检索框架，旨在解决大语言模型在长上下文推理任务中因上下文窗口限制而面临的挑战。该框架通过一个记忆提取代理将输入自适应地切分为语义相关的子上下文并提取为结构化笔记，构建一个支持多种查询方式的多索引记忆数据库，并利用一个工具增强的记忆检索代理自主选择工具来探索记忆，以完成长时会话问答任务。

Details

Motivation: 动机在于现有的大语言模型虽然展现出强大的推理能力，但其上下文窗口有限，难以处理长程推理任务，需要外部记忆存储系统。现有的记忆检索方法主要依赖预定义的工作流或基于嵌入的静态相似度top-k检索，缺乏灵活性。

Result: 该方法在LoCoMo数据集上进行了评估，相比现有基线方法取得了显著的性能提升。对不同问题类型下工具使用的分析也证明了该方法的适应性。

Insight: 创新点在于提出了一个集成了自适应记忆提取、多索引数据库和工具增强自主检索代理的端到端框架。其核心是将记忆检索过程构建为一个由LLM驱动的、可自主选择工具（如基于键的查找和基于相似度的检索）的探索性任务，从而实现了比静态检索方法更灵活、更适应复杂查询的记忆访问机制。

Abstract: Large Language Model (LLM) has exhibited strong reasoning ability in text-based contexts across various domains, yet the limitation of context window poses challenges for the model on long-range inference tasks and necessitates a memory storage system. While many current storage approaches have been proposed with episodic notes and graph representations of memory, retrieval methods still primarily rely on predefined workflows or static similarity top-k over embeddings. To address this inflexibility, we introduced a novel tool-augmented autonomous memory retrieval framework (TA-Mem), which contains: (1) a memory extraction LLM agent which is prompted to adaptively chuck an input into sub-context based on semantic correlation, and extract information into structured notes, (2) a multi-indexed memory database designed for different types of query methods including both key-based lookup and similarity-based retrieval, (3) a tool-augmented memory retrieval agent which explores the memory autonomously by selecting appropriate tools provided by the database based on the user input, and decides whether to proceed to the next iteration or finalizing the response after reasoning on the fetched memories. The TA-Mem is evaluated on the LoCoMo dataset, achieving significant performance improvements over existing baseline approaches. In addition, an analysis of tool use across different question types also demonstrates the adaptivity of the proposed method.

cs.SD [Back]

[122] Fish Audio S2 Technical Report cs.SD | cs.AI | cs.CLPDF

Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang

TL;DR: Fish Audio S2是一个开源的文本转语音系统，支持多说话人、多轮对话生成，并能通过自然语言描述进行指令跟随控制。论文介绍了其多阶段训练方案、数据流水线，并发布了模型权重、微调代码及一个基于SGLang的高效推理引擎。

Details

Motivation: 旨在推动开源TTS技术的发展，解决传统TTS系统在指令跟随控制、多说话人及多轮对话生成方面的局限性。

Result: 推理引擎已具备生产就绪的流式处理能力，实时因子为0.195，首次音频生成时间低于100毫秒。

Insight: 创新点包括通过自然语言描述实现指令跟随控制、多阶段训练与数据流水线（涵盖视频/语音字幕、音质评估和奖励建模），以及高效的生产级推理引擎设计。

Abstract: We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.

[123] VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs cs.SD | cs.AI | cs.CL | cs.MM | eess.ASPDF

Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan, Thomas Hain

TL;DR: 本文提出了VoxEmo，一个用于评估语音大语言模型在语音情感识别任务上的综合基准，涵盖15种语言的35个情感语料库，并引入了标准化工具包、分布感知软标签协议和提示集成策略，以解决生成式接口带来的零样本随机性和情感固有模糊性问题。

Details

Motivation: 语音大语言模型通过生成式接口在语音情感识别中展现出潜力，但零样本随机性使评估对提示高度敏感，且现有基准忽略了人类情感的固有模糊性，因此需要构建更全面的评估框架。

Result: 实验表明，零样本语音大语言模型在硬标签准确率上落后于监督基线，但能独特地匹配人类主观情感分布，在VoxEmo基准上验证了其与人类感知的一致性。

Insight: 创新点包括构建多语言多语料库的标准化SER基准、引入分布感知软标签协议模拟标注者分歧，以及通过提示集成策略减少评估随机性，为生成式SER提供了更贴近实际应用的评估方法。

Abstract: Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.

[124] How Contrastive Decoding Enhances Large Audio Language Models? cs.SD | cs.CL | eess.ASPDF

Tzu-Quan Lin, Wei-Ping Huang, Yi-Cheng Lin, Hung-yi Lee

TL;DR: 本研究系统评估了四种对比解码策略在不同大型音频语言模型架构上的效果，发现音频感知解码和音频对比解码最为有效，但其效果因模型而异。通过引入转移矩阵框架分析推理过程中的错误模式变化，揭示了对比解码能有效纠正模型错误声称无音频或不确定性猜测的错误，但无法修正错误推理或自信的错误断言。

Details

Motivation: 对比解码已被证明能有效增强大型音频语言模型，但其成功机制及不同策略的相对效果尚不明确，因此需要系统评估以提供清晰的指导原则。

Result: 在多种LALM架构上评估了四种CD策略，其中Audio-Aware Decoding和Audio Contrastive Decoding表现最佳，但效果因模型基线错误模式而异。

Insight: 创新点在于引入转移矩阵框架来量化分析错误模式转移，从而为基于模型基线错误特征选择最适合CD增强的架构提供了明确指导；客观来看，该方法为理解解码策略的机制提供了可解释性工具。

Abstract: While Contrastive Decoding (CD) has proven effective at enhancing Large Audio Language Models (LALMs), the underlying mechanisms driving its success and the comparative efficacy of different strategies remain unclear. This study systematically evaluates four distinct CD strategies across diverse LALM architectures. We identify Audio-Aware Decoding and Audio Contrastive Decoding as the most effective methods. However, their impact varies significantly by model. To explain this variability, we introduce a Transition Matrix framework to map error pattern shifts during inference. Our analysis demonstrates that CD reliably rectifies errors in which models falsely claim an absence of audio or resort to uncertainty-driven guessing. Conversely, it fails to correct flawed reasoning or confident misassertions. Ultimately, these findings provide a clear guideline for determining which LALM architectures are most suitable for CD enhancement based on their baseline error profiles.

cs.CR [Back]

[125] Robust Provably Secure Image Steganography via Latent Iterative Optimization cs.CR | cs.CVPDF

Yanan Li, Zixuan Wang, Qiyang Xiao, Yanzhen Ren

TL;DR: 本文提出了一种基于潜在空间迭代优化的鲁棒且可证明安全的图像隐写框架。该框架将传输图像视为固定参考，通过迭代优化潜在变量以最小化重构误差，从而提高消息提取的准确性。与现有方法不同，该方法在保持可证明安全性的同时，显著增强了在各种压缩和图像处理场景下的鲁棒性。

Details

Motivation: 解决现有可证明安全的隐写方法在图像压缩等处理下鲁棒性不足的问题，旨在构建既安全又鲁棒的隐写系统。

Result: 在基准数据集上的实验结果表明，所提出的迭代优化方法在保持可证明安全性的同时，提高了对图像压缩的鲁棒性，并且可以作为一个独立模块应用于其他可证明安全的隐写方案中以进一步增强鲁棒性。

Insight: 创新点在于将潜在空间迭代优化作为独立模块引入隐写框架，在保证理论安全性的前提下，通过迭代重构机制有效提升了系统对常见图像处理的鲁棒性，为构建可靠、鲁棒且安全的隐写系统提供了新思路。

Abstract: We propose a robust and provably secure image steganography framework based on latent-space iterative optimization. Within this framework, the receiver treats the transmitted image as a fixed reference and iteratively refines a latent variable to minimize the reconstruction error, thereby improving message extraction accuracy. Unlike prior methods, our approach preserves the provable security of the embedding while markedly enhancing robustness under various compression and image processing scenarios. On benchmark datasets, the experimental results demonstrate that the proposed iterative optimization not only improves robustness against image compression while preserving provable security, but can also be applied as an independent module to further reinforce robustness in other provably secure steganographic schemes. This highlights the practicality and promise of latent-space optimization for building reliable, robust, and secure steganographic systems.

Table of Contents

cs.CL [Back]

[1] ConFu: Contemplate the Future for Better Speculative Sampling cs.CLPDF

[2] SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation cs.CLPDF

[3] Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning cs.CLPDF

[4] Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs cs.CL | cs.CVPDF

[5] DEO: Training-Free Direct Embedding Optimization for Negation-Aware Retrieval cs.CLPDF

[6] Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing cs.CL | cs.AI | cs.LGPDF

[7] SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models cs.CL | eess.ASPDF

[8] TaSR-RAG: Taxonomy-guided Structured Reasoning for Retrieval-Augmented Generation cs.CL | cs.AIPDF

[9] Common Sense vs. Morality: The Curious Case of Narrative Focus Bias in LLMs cs.CL | cs.AIPDF

[10] ALARM: Audio-Language Alignment for Reasoning Models cs.CLPDF

[11] RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation cs.CL | cs.AIPDF

[12] Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents cs.CLPDF

[13] Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs cs.CLPDF

[14] CREATE: Testing LLMs for Associative Creativity cs.CLPDF

cs.CV [Back]

[15] Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM cs.CVPDF

[16] VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model cs.CVPDF

[17] Computer Vision-Based Vehicle Allotment System using Perspective Mapping cs.CVPDF

[18] HECTOR: Hybrid Editable Compositional Object References for Video Generation cs.CVPDF

[19] Comparative Analysis of Patch Attack on VLM-Based Autonomous Driving Architectures cs.CVPDF

[20] Towards Visual Query Segmentation in the Wild cs.CVPDF

[21] Multi-Kernel Gated Decoder Adapters for Robust Multi-Task Thyroid Ultrasound under Cross-Center Shift cs.CV | physics.med-phPDF

[22] Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning cs.CV | cs.LGPDF

[23] MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering cs.CV | cs.MMPDF

[24] Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning cs.CV | cs.AIPDF

[25] PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration cs.CV | cs.AI | cs.CL | cs.DL | cs.IRPDF

[26] BiCLIP: Domain Canonicalization via Structured Geometric Transformation cs.CV | cs.AI | cs.CL | cs.LGPDF

[27] Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation cs.CV | eess.ASPDF

[28] SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing cs.CVPDF

[29] Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning cs.CVPDF

[30] Intelligent Spatial Estimation for Fire Hazards in Engineering Sites: An Enhanced YOLOv8-Powered Proximity Analysis Framework cs.CVPDF

[31] GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models cs.CV | cs.AI | cs.ROPDF

[32] OmniEdit: A Training-free framework for Lip Synchronization and Audio-Visual Editing cs.CVPDF

[33] Chain of Event-Centric Causal Thought for Physically Plausible Video Generation cs.CVPDF

[34] MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration cs.CVPDF

[35] Training-free Motion Factorization for Compositional Video Generation cs.CVPDF

[36] Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations cs.CV | cs.AIPDF

[37] VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs cs.CV | cs.AIPDF

[38] Progressive Representation Learning for Multimodal Sentiment Analysis with Incomplete Modalities cs.CVPDF

[39] QUSR: Quality-Aware and Uncertainty-Guided Image Super-Resolution Diffusion Model cs.CV | cs.AIPDF

[40] Rotation Equivariant Mamba for Vision Tasks cs.CVPDF

[41] RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning cs.CV | cs.AI | cs.LGPDF

[42] Point Cloud as a Foreign Language for Multi-modal Large Language Model cs.CVPDF

[43] MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data cs.CV | cs.LGPDF

[44] TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy cs.CVPDF

[45] When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection cs.CVPDF

[46] Towards Instance Segmentation with Polygon Detection Transformers cs.CVPDF

[47] Multi-model approach for autonomous driving: A comprehensive study on traffic sign-, vehicle- and lane detection and behavioral cloning cs.CV | cs.AIPDF

[48] Multimodal Graph Representation Learning with Dynamic Information Pathways cs.CVPDF

[49] Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos cs.CV | cs.ROPDF

[50] ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph cs.CVPDF

[51] From Ideal to Real: Stable Video Object Removal under Imperfect Conditions cs.CVPDF

[52] X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models cs.CV | cs.CLPDF

[53] Exploring Modality-Aware Fusion and Decoupled Temporal Propagation for Multi-Modal Object Tracking cs.CVPDF

[54] DenoiseSplat: Feed-Forward Gaussian Splatting for Noisy 3D Scene Reconstruction cs.CV | cs.AIPDF

[55] IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework cs.CVPDF

[56] EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning cs.CV | cs.AI | cs.CLPDF

[57] CLoE: Expert Consistency Learning for Missing Modality Segmentation cs.CV | cs.AI | cs.LGPDF

[58] SpaceSense-Bench: A Large-Scale Multi-Modal Benchmark for Spacecraft Perception and Pose Estimation cs.CV | cs.AIPDF

[59] OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models cs.CVPDF

[60] Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments cs.CV | cs.AIPDF

[61] MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification cs.CV | cs.AIPDF

[62] EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation cs.CVPDF

[63] ICDAR 2025 Competition on End-to-End Document Image Machine Translation Towards Complex Layouts cs.CV | cs.AIPDF

[64] Reviving ConvNeXt for Efficient Convolutional Diffusion Models cs.CV | cs.AI | cs.LGPDF

[65] RiO-DETR: DETR for Real-time Oriented Object Detection cs.CVPDF

[66] CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation cs.CVPDF

[67] Open-World Motion Forecasting cs.CV | cs.AI | cs.ROPDF

[68] EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation cs.CV | cs.AIPDF

[69] TopoOR: A Unified Topological Scene Representation for the Operating Room cs.CVPDF

[70] OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks cs.CVPDF

[71] Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity cs.CVPDF

[72] Streaming Autoregressive Video Generation via Diagonal Distillation cs.CVPDF

[73] Evolving Prompt Adaptation for Vision-Language Models cs.CV | cs.AIPDF

[74] SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding cs.CVPDF

[75] Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation cs.CV | cs.ROPDF

[76] Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning cs.CVPDF

[77] Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization cs.CVPDF