cs.CL [Total: 17]
cs.CV [Total: 76]
cs.SD [Total: 1]
cs.AI [Total: 5]
stat.ML [Total: 1]
cs.IR [Total: 2]
eess.IV [Total: 1]
cs.LG [Total: 4]
cs.RO [Total: 2]

cs.CL [Back]

[1] EXPO-SQL: Execution-based Clause-level Policy Optimization for Text-to-SQL cs.CL | cs.IRPDF

Jaehoon Lee, CheolWon Na, Suyoung Bae, Jin-Seop Lee, Jihyung Lee

TL;DR: 本文提出了EXPO-SQL方法，一种用于文本到SQL任务的基于执行的子句级策略优化框架。该方法通过分析SQL查询的执行结果（如错误信息和子句增量执行）来识别错误子句，并为每个子句提供细粒度的奖励信号，以解决现有强化学习方法中查询级统一奖励信号不足的问题。

Details

Motivation: 现有基于大语言模型的强化学习方法在文本到SQL任务中，对SQL查询中的所有子句分配统一的查询级奖励，未能区分正确与错误的子句，导致学习信号不足，影响了正确SQL的生成。

Result: 在广泛使用的文本到SQL基准测试上的实验表明，EXPO-SQL通过细粒度的子句级学习，显著优于现有的监督微调、提示工程和基于强化学习的方法。

Insight: 核心创新点在于将强化学习的奖励机制从查询级细化到子句级，通过执行反馈（错误消息和子句增量执行）来精确识别和奖励/惩罚各个SQL子句，从而提供更有效的学习信号。这为利用执行反馈优化文本到SQL模型提供了一种更精细的监督范式。

Abstract: Text-to-SQL enables users to query databases using natural language by generating executable SQL queries. Recent methods have increasingly adopted Large Language Models based reinforcement learning (RL) to leverage execution feedback for training. However, existing RL methods assign uniform query-level rewards to all clauses in a SQL query, treating correct and incorrect clauses equally. This coarse-grained reward design leads to insufficient learning signals for correct SQL generation. To address this issue, we propose EXPO-SQL (EXecution-based clause-level Policy Optimization for Text-to-SQL) which provides fine-grained supervision through clause-level rewards. To assign clause-level rewards, our method identifies erroneous clauses by analyzing execution results, including error messages and clause-wise incremental execution. Experiments on widely-used Text-to-SQL benchmarks demonstrate that EXPO-SQL significantly outperforms existing supervised fine-tuning, prompting, and RL-based methods through fine-grained clause-level learning. Our code is available at https://github. com/jhn25/EXPO-SQL.

[2] Quantifying Prior Dominance in RAG Systems cs.CL | cs.AIPDF

Barak Or

TL;DR: 本文针对检索增强生成（RAG）系统评估中的‘认知盲区’问题，提出了归一化上下文利用（NCU）这一连续量化指标，用于严格衡量模型从外部上下文中获取信息的能力。研究发现，在严格事实提取任务中，小语言模型（SLMs）的表现与大规模模型相当甚至更优，且模型规模和专有对齐会加剧‘先验主导’现象，导致商业API在对抗性冲突中频繁忽略外部证据并产生系统性置信度崩溃。

Details

Motivation: 当前RAG系统的评估依赖于离散启发式方法，存在‘认知盲区’，无法区分模型是真正利用了检索到的上下文信息，还是仅仅依赖了其参数化记忆。本文旨在解决这一问题，严格量化模型从上下文中获得的信息增益。

Result: 在涵盖1.5B到72B参数规模模型及一个商业API的评估中，对于严格事实提取任务（无思维链推理），传统缩放定律表现出极端的收益递减，高效的小语言模型（SLMs）匹配甚至超越了高容量架构。商业API在近一半的对抗性冲突中覆盖了明确的外部证据，并在其参数先验被反驳时频繁出现系统性置信度崩溃（负迁移）。

Insight: 创新点在于提出了NCU这一连续量化指标，克服了离散评估的局限性。客观分析认为，其核心发现揭示了在严格提取工作流中，小语言模型（SLMs）具有结构性的认知优势和更优的上下文遵循能力，这对RAG系统的架构选择和评估范式具有重要启发。

Abstract: Retrieval-Augmented Generation (RAG) grounds Large Language Models in external knowledge, yet current evaluations rely on discrete heuristics that suffer from ‘’epistemic blindness’’ - failing to distinguish genuine contextual information extraction from parametric memory recall. To address this, we introduce the Normalized Context Utilization (NCU) metric, leveraging continuous token log-probabilities across zero-shot, oracle, and adversarial conditions to strictly quantify contextual information gain. Evaluating architectures ranging from 1.5B to 72B parameters alongside a proprietary commercial API reveals that for strict factual extraction (without Chain-of-Thought reasoning), traditional scaling laws exhibit extreme diminishing returns: highly efficient Small Language Models (SLMs) match or outperform high-capacity architectures. Furthermore, we demonstrate that ``Prior Dominance’’ correlates with model scale and proprietary alignments. The evaluated commercial API not only overrode explicit external evidence in nearly half of adversarial conflicts, but also frequently suffered from systemic confidence collapse (Negative Transfer) when its parametric priors were contradicted. Our findings highlight the structural epistemic advantage and superior contextual adherence of SLMs in strict extraction workflows.

[3] Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification cs.CL | cs.CV | cs.IRPDF

Qian Ma, Qiong Wu, Zhengyi Zhou, Yao Ma

TL;DR: 本文针对知识型视觉问答任务，提出了一种名为‘先定位后排序’的训练无关实体识别框架。该方法将实体识别与证据排序解耦，首先利用多模态大语言模型从候选实体中筛选高置信度实体，再使用文本重排序器进行证据选择。

Details

Motivation: 现有基于多模态检索增强生成的方法将实体判别与证据排序紧密耦合，导致成本高且泛化能力有限。作者认为实体级和事实级定位是关键瓶颈，并观察到MLLM在候选实体集合中能更准确识别实体。

Result: 在Encyclopedic-VQA和InfoSeek基准测试中，该方法始终优于微调的多模态重排序基线，同时降低了训练和推理复杂度。

Insight: 创新点在于将实体识别与证据排序解耦的简单训练无关框架，其改进不仅源于更好的实体识别，还在于固定正确实体后能选择信息更丰富的证据。

Abstract: Knowledge-Based Visual Question Answering (KB-VQA) requires grounding visual queries to external knowledge beyond directly observable content in images. While recent multi modal large language models (MLLMs) show strong perceptual abilities, they struggle on KB-VQA tasks requiring groundings from both fine-grained entity and evidence levels. Most existing multi-modal retrieval augmented generation (MM-RAG) methods tightly couple entity discrimination and section-level evidence ranking into a single re-ranking stage, leading to high cost and limited generalization. In this work, we revisit existing MM-RAG solutions from a workflow perspective and argue both entity-level and fact-level groundings are key bottlenecks. We observe that although MLLMs often fail under open-ended entity naming, they can better identify the correct entity when selecting from a small set of candidate names. Based on this insight, we propose a simple and training-free identify-before-answer IBA framework that decouples entity identification from section-level re-ranking. Our approach prompts an MLLM to select high-confidence entities using only candidate names, followed by an off-the-shelf textual re-ranker for evidence selection. Experiments on Encyclopedic-VQA and InfoSeek show that our method consistently outperforms fine-tuned multi-modal re-ranking baselines while reducing training and inference complexity. Additional analyses reveal that the improvements arise not only from better entity identification, but also from selecting more informative evidence once correct entity is fixed. Our implementation is made public to ease reproducibility.

[4] ModTGCN: Modularity-aware Graph Neural Networks for Text Classification cs.CLPDF

Rajarshi Misra, Aditya Sharma, Vinti Agarwal, Hari Om Aggrawal

TL;DR: 本文提出了ModTGCN，一种用于文本分类的模块化感知图神经网络。该方法通过联合优化交叉熵损失和基于模块度的辅助目标，在保持判别性表示的同时促进类一致的文档社区结构。为了提高可扩展性，该方法将原始的异构TextGCN图解耦为独立的文档-单词和单词-单词组件，实现了2-10倍的训练加速。

Details

Motivation: 现有的基于图的文本分类模型通常依赖局部邻域聚合而忽略了全局社区结构，尽管语义文档图表现出强烈的类一致聚类。忽略这一点会模糊类边界并导致过平滑问题。

Result: 在五个基准数据集上的实验显示了一致的性能提升，在复杂、低同质性的数据集（如Ohsumed和20NG）上改进更大。

Insight: 核心创新点在于将模块度（modularity）作为辅助优化目标引入图神经网络训练，以显式地建模和利用文档图的全局社区结构。此外，通过解耦异构图为两个同质组件，显著提升了模型的可扩展性。

Abstract: Graph-based text classification models typically rely on local neighborhood aggregation and overlook global community structure, despite semantic document graphs exhibiting strong class-consistent clustering. Ignoring this can blur class boundaries and lead to over-smoothing. We propose ModTGCN, a modularity-aware graph neural network for text classification that jointly optimizes cross-entropy and a modularity-based auxiliary objective to promote class-coherent document communities while preserving discriminative representations. The modularity term is computed on a document-document similarity graph derived from transformer embeddings (pretrained or fine-tuned). To improve scalability, we decouple the original heterogeneous TextGCN graph into separate document-word and word-word components, achieving 2x-10x faster training. We further study graph construction strategies, label-aware edge reweighting, and supervision choices for modularity optimization. Experiments on five benchmarks show consistent gains, with larger improvements on complex, low homophily datasets such as Ohsumed and 20NG.

[5] MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models cs.CLPDF

Ding Jinru, Jiang Chuchu, Lu Lu, Pang Wenrao, Bian Mouxiao

TL;DR: MedBench v5是一个为临床多模态模型设计的动态、面向过程且包含幻觉检测的基准测试。它通过一个结合临床认知响应性和医学原子技能的双维度框架，以及可切换的信息流压力源和动态过程审计协议，旨在从静态问答评估转向对模型推理过程和幻觉传播的动态评估。

Details

Motivation: 现有医学AI基准测试缺乏过程可见性、原子技能评估和集成的幻觉检测能力，无法全面评估临床多模态模型在动态、复杂医疗场景下的真实表现。

Result: 在多个前沿模型上的实验表明，强大的整体任务性能并不能保证过程稳定性：压力源主要破坏了模型的矛盾检测、诊断更新、幻觉传播和基于矛盾的自校正能力，而最终证据的锚定可能表面上保持稳定。

Insight: 论文的创新点在于提出了一个集成了过程审计、可控压力测试和幻觉轨迹分析的统一评估框架，其双维度评估、信息流压力源和动态过程审计协议为深入剖析模型在临床推理中的失败模式提供了系统化工具，特别是对‘沉默幻觉’的传播监控是现有基准所缺乏的。

Abstract: Existing medical AI benchmarks lack process visibility, atomic skill evaluation, and integrated hallucination detection. We introduce MedBench v5, a redesigned benchmark for clinical multimodal models (language, vision-language, and agent systems) that moves from static QA to dynamic, process-oriented evaluation. MedBench v5 features: (1) a dual-dimensional framework combining Clinical Cognitive Responsiveness (14 sub-dimensions) and Medical Atomic Skills (4 agent environments), covering 63 tasks; (2) three switchable information-flow stressors (omission, contradiction, evidence delay) for factorized degradation analysis; (3) a dynamic process audit protocol with five reasoning nodes that produces model-specific failure fingerprints; (4) hallucination propagation monitoring across initiation, propagation, anchoring, and contradiction interaction-capturing silent hallucination. Experiments on frontier models show that strong overall task performance does not guarantee process stability: stressors mainly disrupt contradiction detection, diagnosis updating, hallucination propagation, and contradiction-based self-correction, while final evidence grounding can remain superficially stable. MedBench v5 provides a unified infrastructure for capability profiling, controllable stress testing, process auditing, and hallucination trajectory analysis in clinical AI evaluation.

[6] CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression cs.CL | cs.AI | cs.LGPDF

Morayo Danielle Adeyemi, Ryan A. Rossi, Franck Dernoncourt

TL;DR: 论文提出了一个名为Cavewoman的双通道评估协议，用于评估大型语言模型在语言输入和输出压缩下的表现。研究发现，输出压缩能有效降低API模型和开源模型的推理成本，而输入压缩反而会增加净成本并导致准确性下降，同时使模型生成的表面文本与无约束参考文本产生语义分歧。

Details

Motivation: 解决在推理成本优化中广泛提倡的’简化语言’（如缩短提示、省略语法）压缩策略的实际效果问题，探究输入和输出压缩对模型准确性、成本和生成一致性的不同影响。

Result: 在五个数据集和五个压缩级别上评估了八个模型。输出压缩使大多数API模型的实现成本降低1.4-2.4倍（最佳情况达3倍），所有开源模型在公共层级定价下成本均降低。输入压缩平均增加成本约1.15倍（最差数据集达1.8倍，强压缩下达2.7倍），且准确性崩溃。非推理模型中约一半的正确生成其表面文本不再蕴含模型自身的无约束基线生成。

Insight: 创新点在于提出了一个系统性的双通道评估框架（Cavewoman），量化了压缩策略在成本和准确性上的权衡。关键发现是输入与输出压缩具有相反的经济效应，且压缩会导致生成文本的语义偏离，这挑战了’简化提示能省钱’的普遍假设，对实际部署中的提示工程和成本优化具有重要指导意义。

Abstract: “Talk short. Drop grammar. Save token.” This caveman style is widely promoted as a way to cut inference cost, but whether it actually saves anything depends on which channel (the user’s prompt or the model’s response) is being compressed. We present Cavewoman, a two-channel evaluation protocol that scores every generation on task accuracy, realized per-item cost, and reference-text agreement against the model’s unconstrained reference. We evaluate eight models on five datasets at five reduction levels, with both channels measured on the same items. Output compression cuts realized cost on most API models (1.4-2.4x per model, up to 3x in the best case) and on all four open-weight models under public-tier pricing. Input compression has the opposite effect, a strict lose-lose: it raises net cost rather than lowering it (~1.15x on the five-benchmark mean, up to 1.8x on the worst dataset and 2.7x under stronger compression), because models compensate with longer responses even as accuracy collapses. Under the same setting, surface text diverges from the unconstrained reference: on the non-reasoning models, roughly half of all generations are correct yet their surface text no longer entails the model’s own unconstrained baseline generation. The divergence survives length-controlled re-scoring, multiple-comparisons correction, and replication under complementary semantic measures. Code and data are available at https://github.com/danielle34/cavewoman.

Yijing Chen, Wenhui Tan, Xiaoyi Yu, Yuyue Wang, Xin Cheng

TL;DR: 本文提出了AVOC框架，旨在解决多模态大语言模型在长时音频-视频理解中面临的上下文窗口有限和信息冗余问题。该框架在模态编码器和LLM主干之间引入了一个可学习的令牌压缩模块，将多模态令牌压缩重构为Top-K检索问题，并基于相关性、重要性和多样性三个信息检索准则来选取最具信息量的令牌子集。

Details

Motivation: 现有多模态大语言模型在短时音视频理解上取得了显著进展，但在处理长时音视频内容时，受限于有限的上下文窗口和严重的信息冗余，其理解能力面临挑战。

Result: 实验表明，AVOC在长时音视频基准测试中取得了最先进的性能，在OmniVideoBench和LVOmniBench上的平均准确率分别比次优模型高出4.9和5.5个百分点。此外，在长达一小时的音频-视频“大海捞针”任务中，AVOC保持了稳健的性能。

Insight: 核心创新点在于将多模态令牌压缩重新定义为基于信息检索准则（相关性、重要性、多样性）的Top-K检索问题，并设计了一个统一的检索式压缩流程。这为处理长序列多模态信息提供了一种新颖且高效的压缩策略，可有效保留关键信息并减少冗余。

Abstract: Multimodal Large Language Models have achieved remarkable progress in short-form audio-video understanding, yet long-form audio-video comprehension remains challenged by limited context windows and severe information redundancy. To address these bottlenecks, we propose AVOC, a framework for long-form audio-video understanding in Omni-modal Large Language Models. AVOC introduces a learnable token compression module between the modality encoders and the LLM backbone. We reframe multimodal token compression as a top-$K$ retrieval problem: given a fixed context budget, the module must retrieve a compact subset of tokens that best supports answering the user query. We draw inspiration from three classical Information Retrieval criteria for selecting informative units from a large candidate pool: relevance, importance, and diversity. AVOC instantiates each criterion as a tailored mechanism for audio-video understanding, and integrates them into a unified retrieval-style compression pipeline. Experiments show that AVOC achieves state-of-the-art performance on long-form audio-video benchmarks, surpassing the second-best model by 4.9 and 5.5 points in average accuracy on OmniVideoBench and LVOmniBench, respectively. Moreover, AVOC maintains robust performance on Audio-Video Needle-in-a-Haystack task at durations up to one hour.

[8] CALIBER: Calibrating Confidence Before and After Reasoning in Language Models cs.CL | cs.AIPDF

Conor Finlay, Joshua Kurien, Saurabh Dash, Marzieh Fadaee, Beyza Ermis

TL;DR: CALIBER是一种校准语言模型置信度的方法，它分别在推理前和推理后估计模型成功的可能性，并通过与信息状态匹配的监督目标来优化这两个置信度估计。

Details

Motivation: 现有方法通常在推理前或推理后仅进行一次置信度估计，但作者认为推理模型的置信度是状态依赖的，需要区分推理前（估计解决提示的概率）和推理后（预测答案正确的概率）的不同监督目标。

Result: 在BigMathDigits基准上，CALIBER将7B模型的预期校准误差（ECE）降低了52.5%，同时获得了最佳的Brier分数和AUROC，准确率接近最优；在30B模型上，它在BigMathDigits上实现了最佳ECE，并在GPQA和TriviaQA等分布外基准上保持竞争力。

Insight: 创新点在于提出状态依赖的置信度校准框架，将推理前和推理后的置信度估计与不同的监督目标（提示级成功与答案级正确性）对齐，这在分布偏移下尤其有效，能显著降低校准误差。

Abstract: Reasoning language models are increasingly asked not only to answer difficult questions, but also to estimate their likelihood of success. Existing methods typically elicit confidence only once: either before thinking or after answering. We argue that confidence in reasoning models is state-dependent: before thinking, confidence should estimate the chance of the model correctly solving the prompt, while after thinking it should predict whether the realized answer is likely to be correct. This distinction determines the appropriate supervision target: prompt-level success should supervise confidence estimates made after seeing the prompt, while individual answer-level correctness should supervise confidence estimates made after answering. We introduce CALIBER (Calibration Before and After Reasoning), which elicits both estimates and supervises each with the target matched to its information state. Under this unified protocol, CALIBER reduces Expected Calibration Error (ECE) by 52.5% over the strongest single-confidence baseline on BigMathDigits for the 7B model, while achieving the best Brier score and AUROC, and remains within 2.1 points of the best accuracy. Further, on a larger 30B model, CALIBER achieves the best ECE on BigMathDigits while remaining competitive in Brier score and AUROC. Out of distribution, it achieves the best ECE and Brier score on GPQA and TriviaQA, and remains competitive on SimpleQA. Ablations further show that this position-target alignment is most beneficial under distribution shift where it consistently reduces calibration error across all out-of-distribution benchmarks.

[9] AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning cs.CLPDF

Honglin Guo, Qi Zhang, Yu Zhang, Weijie Li, Rui Zheng

TL;DR: 本文提出了AGORA基准测试，用于评估语言模型在真实工作文档归档中进行智能推理的能力。该基准包含362个问题，对应8个领域的9,664份真实文档（共3.72亿词元），远超任何模型的上下文窗口，迫使智能体必须进行有目的的探索而非简单扫描。

Details

Motivation: 现有基准测试未能全面覆盖归档式推理场景，即在大规模、杂乱的工作文档集合中定位稀疏证据、协调不一致的术语/单位/时间约定并进行计算回答。

Result: 在评估的八个模型中，即使最强模型也仅达到59.4%的准确率，且在不同领域表现存在显著差异，表明该任务远未解决。

Insight: 创新点在于构建了一个同时强调归档基础性、智能体探索性和跨领域覆盖的基准，并通过结合跨文档任务合成、防泄漏混淆和难度过滤的智能体流程构建数据集。

Abstract: Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and computing an answer. Existing benchmarks address only parts of this setting and none jointly stresses archive-groundedness, agentic exploration, and cross-domain coverage. We introduce Agora, a benchmark pairing 362 questions with eight domain collections of 9,664 authentic documents and 372M tokens, far exceeding any model’s context window, so agents must explore deliberately rather than scan exhaustively. Agora is built by an agentic pipeline combining cross-document task synthesis, leakage-preventing obfuscation, and difficulty filtering. Evaluating eight models, we find the task far from solved: even the strongest reaches only 59.4% accuracy, with notable variation across domains.

[10] Poster: Exploring the Limits of Audio-Based Detection of Turkish Phone Call Scams cs.CL | cs.AIPDF

Arda Eren, Micheal Cheung, Youqian Zhang, Grace Ngai, Eugene Yujun Fu

TL;DR: 该论文研究了利用大型语言模型检测土耳其语诈骗电话的可行性，并发布了首个包含100个对齐的音频-文本对的土耳其语诈骗电话多模态数据集。研究评估了Gemini、GPT-4o和Qwen三个家族的七个LLM模型在三种输入条件下的性能：原始音频、自动语音转文本转录和人工校正转录。结果表明，基于转录的输入始终优于直接音频处理，而人工校正与未校正的转录性能相当。

Details

Motivation: 针对全球范围内利用弱势群体的诈骗电话，现有检测研究几乎完全集中于英语等高资源语言。在土耳其语等低资源语言环境中，由于标注数据稀缺且技术防御手段有限，检测尤其困难。

Result: 在引入的土耳其语诈骗电话数据集上，基于转录的LLM输入性能优于直接音频处理。人工校正与未校正的转录在检测性能上表现相近。

Insight: 论文的主要创新点在于构建并发布了首个土耳其语诈骗电话多模态公共数据集，填补了低资源语言安全研究的空白。客观来看，其核心洞察是：在低资源语言场景下，即使转录文本存在误差，基于文本的LLM检测方法也比直接处理音频更有效，这为构建更鲁棒的多模态欺诈预防系统提供了新思路，并强调了AI安全研究需要更具文化和语言包容性。

Abstract: Scam phone calls exploit vulnerable communities worldwide, yet research on detection has focused almost exclusively on English and other high-resource languages. In low-resource settings such as Turkish, detection is especially difficult, as annotated data is scarce and technological defenses remain limited. This research investigates how large language models (LLMs) can support scam detection in Turkish by introducing the first public multi-modal dataset of 100 aligned audio-transcript pairs of scam and benign conversations. We evaluate seven LLMs spanning three model families: Gemini 2.5 (Flash, Flash-Lite, Pro), GPT-4o, and Qwen (Max, Plus, Turbo), under three input conditions: raw audio, automatic speech-to-text transcripts, and transcripts refined by a native speaker. Our results suggest that transcript-based inputs consistently outperform direct audio processing, while human-corrected and uncorrected transcripts perform comparably. By centering a low-resource language and real world threat, this work highlights the urgent need for culturally and linguistically inclusive AI safety research and more robust multi-modal systems for fraud prevention.

Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych

TL;DR: 该论文针对大语言模型社会偏见评估中存在的广泛方法碎片化问题，提出了一个统一且可控的框架，用于标准化异构基准测试。研究发现，孤立评估与强制选择比较设置之间存在巨大的系统性范式差距，比较设置会作为潜在歧视的催化剂，而思维链推理会加剧这种偏见，且这种偏见随模型规模扩大而增强。

Details

Motivation: 解决当前大语言模型社会偏见评估文献中广泛存在的方法碎片化问题，该问题导致相互矛盾的结论，其根源在于忽视了基准测试层面的结构性框架。

Result: 在多个模型系列上的评估揭示了巨大的系统性范式差距：比较设置会作为潜在歧视的催化剂，思维链推理会加剧比较设置下的社会偏见，且这种偏见是确定性的，并随模型规模扩大而增强。

Insight: 创新点在于提出了一个统一框架来标准化异构基准，并系统性地对比了孤立评估与比较设置。关键发现是比较设置会激活潜在偏见，这为评估方法提供了重要指导：研究者必须利用比较设置来审计隐藏偏见，但从业者在模糊的现实任务中不能安全地依赖比较部署。

Abstract: As Large Language Models are increasingly deployed in critical applications, robustly evaluating their social biases is paramount. However, the current literature suffers from widespread methodological fragmentation, which yields contradictory conclusions. This stems largely from ignoring the structural framing of benchmark-level evaluations. To resolve this, we introduce a unified and controllable framework that standardizes heterogeneous benchmarks to systematically contrast isolated demographic assessments with forced-choice comparative settings. Crucially, this allows us to disentangle the confounding effects of Chain-of-Thought reasoning, neutral fallback options, and other structural artifacts in social bias evaluations. Our evaluation across multiple model families reveals a massive, systematic paradigm gap: while isolated assessments limit prejudice activation, comparative settings act as aggressive catalysts for latent discrimination, a shift primarily driven by underspecified contexts. Alarmingly, CoT reasoning exacerbates social biases under comparative settings, and this systemic bias persists as a deterministic prejudice even when models are provided neutral fallback options or claim to answer randomly. Finally, we demonstrate that this comparative prejudice is a generalized phenomenon that scales positively with model size. Ultimately, we offer a crucial methodological guideline: while researchers must leverage comparative settings to robustly audit hidden biases, practitioners cannot safely rely on comparative deployments in ambiguous real-world tasks.

[12] Transformer-Based Language Models Across Domain Verticals: Architectures, Applications and Critical Assessment cs.CL | cs.ETPDF

Guruprakash J, Krithika L. B

TL;DR: 这篇综述论文系统梳理了基于Transformer的语言模型在不同垂直领域的架构、应用及关键评估。文章从机制层面将Transformer家族分类为编码器、解码器、编码器-解码器、长上下文、基于排列及生成器-判别器等变体，并讨论了2023年后的重要发展，如指令微调、人类反馈强化学习、专家混合扩展和检索增强等。在应用层面，论文调研了医疗、金融、法律、教育、客服、创意写作和科学等领域的部署情况，并基于此进行了关键评估，包括架构比较、参数量与能耗权衡、对齐方法及基准测试饱和等问题。

Details

Motivation: 针对Transformer语言模型发布速度过快、难以区分实质性进展与增量改进的现状，本文旨在为从业者提供一个系统性的综述，帮助理解不同模型架构的机制及其在垂直领域的适用性。

Result: 论文未提供具体的定量实验结果，但通过综述分析，比较了不同架构在部署决策中的四个关键维度，并量化了参数量与能耗之间的权衡关系，同时讨论了模型对齐、数据来源和基准测试饱和对“state of the art”定义的影响。

Insight: 创新点在于提出了一个全面的Transformer模型分类框架，并将架构机制与垂直领域应用需求直接关联，提供了基于实际部署考量的关键评估维度，如能耗效率和对齐方法的实际影响，为模型选择和研究方向提供了实用指导。

Abstract: Transformer-based language models have become the default substrate for natural language processing and the pace of new releases has made it hard for practitioners to separate durable ideas from the noise of incremental announcements. This review works at two levels. At the level of mechanism, we organise the main transformer families into a working taxonomy, covering encoder-only, decoder-only, encoder-decoder, long-context, permutation-based, and generator-discriminator variants. We then extend the discussion to post-2023 developments that changed the picture in practice: instruction tuning, reinforcement learning from human feedback, direct preference optimisation, mixture-of-experts scaling, retrieval augmentation and the current flagship model families from OpenAI, Anthropic, Google, Meta, Mistral and DeepSeek. At the level of use, we survey deployments across healthcare, finance, legal, education, customer service, creative writing and scientific work. Based on this we link each to the specific capabilities that make a transformer the appropriate tool. The contribution of this paper is a critical assessment that is based on the survey. We compare architectures on four axes that matter to deployment decisions, we quantify the trade-off between parameter count and energy cost. We also discuss how alignment methods, data provenance and benchmark saturation change what it means to call a model “state of the art”. The final section lists the research questions that we think deserve more attention.

[13] Qwen-AgentWorld: Language World Models for General Agents cs.CLPDF

Yuxin Zuo, Zikai Xiao, Li Sheng, Fei Huang, Jianhong Tu

TL;DR: 本文提出了Qwen-AgentWorld，一个基于语言模型的世界模型，旨在模拟多领域智能体环境动态。它通过三阶段训练流程构建，并引入AgentWorldBench进行评估。研究表明，该模型不仅作为解耦的环境模拟器支持可扩展的强化学习训练，还能作为统一的智能体基础模型，通过世界建模预训练提升下游任务性能。

Details

Motivation: 研究动机是探索基于语言模型的世界建模如何进一步推动通用智能体的发展，解决智能体在复杂、多领域环境中进行推理和规划的认知核心问题。

Result: 在涵盖7个领域的AgentWorldBench基准测试中，Qwen-AgentWorld显著优于现有前沿模型。作为环境模拟器，其支持的可扩展模拟训练收益超越了仅使用真实环境训练；作为基础模型，其预训练有效提升了在7个智能体基准上的下游性能。

Insight: 创新点在于首次构建了能够通过长链思维推理模拟多领域智能体环境的语言世界模型，并提出了一个结合CPT、SFT和RL的三阶段训练流程。从客观角度看，其将世界建模作为通用智能体预训练的有效热身策略，以及解耦模拟与统一基础模型的双重应用范式，具有借鉴意义。

Abstract: A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on building foundation models for agentic environment simulation. We introduce Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B, the first language world models capable of simulating agentic environments covering 7 domains via long chain-of-thought reasoning. Leveraging more than 10M environment interaction trajectories of 7 domains in real-world environments, we develop Qwen-AgentWorld through a three-stage training pipeline: CPT injects general-purpose world modeling capabilities from the state transition dynamics and augmented professional corpora, SFT activates next-state-prediction reasoning, and RL sharpens simulation fidelity through a tailored framework with hybrid rubric-and-rule rewards. To evaluate language world models, we present AgentWorldBench, a comprehensive benchmark constructed from real-world interactions of 5 frontier models on 9 established benchmarks. Empirical results demonstrate that Qwen-AgentWorld significantly outperforms existing frontier models. (ii) Beyond foundation models, we further investigate two complementary paradigms through which world modeling enhances general agents. First, as a decoupled environment simulator, Qwen-AgentWorld supports scalable and controllable simulation of thousands of real-world environments for agentic RL, yielding gains that surpass real-environment training alone. Second, as a unified agent foundation model, world-model training acts as a highly effective warm-up that improves downstream performance across 7 agentic benchmarks. Code: https://github.com/QwenLM/Qwen-AgentWorld

[14] DREAM: Dense Retrieval Embeddings via Autoregressive Modeling cs.CLPDF

Yixuan Tang, Yi Yang

TL;DR: 本文提出了DREAM方法，通过利用大型语言模型（LLM）的自回归下一个词预测目标来监督训练稠密检索模型。该方法将检索器生成的查询-文档相似度分数注入到冻结LLM的选定注意力头中，利用LLM的预测损失通过注意力机制为检索器提供训练梯度。

Details

Motivation: 解决稠密检索器训练依赖对比学习目标，需要大量昂贵且难以获取的标注正负文档对的问题。

Result: 在BEIR和RTEB检索基准测试中，使用0.5B到3B参数的嵌入骨干网络进行评估，DREAM在不同模型规模上均持续优于现有基线方法。

Insight: 创新点在于利用冻结LLM的自回归预测损失作为监督信号来训练独立的稠密检索器，通过将检索器分数注入LLM注意力机制来实现梯度反向传播，为无监督或弱监督检索训练提供了新思路。

Abstract: Dense retrieval embedding models are a fundamental component of modern retrieval-based AI systems. Most dense retrievers are trained with contrastive objectives, which require labeled positive and negative document pairs that are often costly and difficult to obtain. In this work, we investigate whether the autoregressive next-token prediction objective of a large language model (LLM) can provide supervision for dense retrieval. The intuition is simple: if a document contains information relevant to a query, conditioning on that document should make the target output easier for the LLM to predict. A key challenge is that the next-token prediction loss is computed inside the LLM, while the retriever is a separate embedding model. To address this challenge, we propose DREAM (Dense Retrieval Embeddings via Autoregressive Modeling), which injects retriever-generated query-document similarity scores into selected attention heads of a frozen LLM. During training, these scores determine how much attention each candidate document receives while the LLM predicts the target output. The resulting prediction loss provides gradients for retriever training through the attention mechanism. We evaluate DREAM on retrieval benchmarks BEIR and RTEB using embedding backbones ranging from 0.5B to 3B parameters. DREAM consistently outperforms existing baselines across different model scales. These results demonstrate that DREAM provides a promising approach for training dense retrievers through autoregressive modeling.

[15] Task Decomposition for Efficient Annotation cs.CL | cs.AI | cs.HCPDF

Nupoor Gandhi, Emma Strubell

TL;DR: 该论文提出通过将复杂的结构化标注任务分解为子任务来降低标注项目的总体推理负荷。基于中心理论，作者引入了一个基于有效标注空间自由度的推理负荷形式化模型，并证明识别中心实体可以约束输出空间复杂度。论文还提供了任务分解指南和预算约束下的子任务分配策略，以提高标注成本效益。

Details

Motivation: 解决大规模语料库中高质量结构化标注成本高昂的问题，传统端到端标注方式对标注者推理负荷大，且现代标注项目涉及异构标注者（模型和人类），需要重新设计任务分配以优化效率。

Result: 通过理论模型证明任务分解能降低推理负荷，并以先前工作为例展示了成本效益的提升，但未提及具体基准测试或定量比较结果。

Insight: 创新点在于将中心理论应用于标注任务分解，形式化建模推理负荷，并提出异构标注者环境下的任务分配框架，为复杂标注项目提供了可操作的设计原则。

Abstract: High-quality annotations of structured representations are expensive to collect over large corpora. Manual annotation of structure is laborious, and model-based annotation, although cheaper to generate, requires expensive validation and potentially significant supervision to ensure that the annotation quality is strong enough to be useful downstream. In traditional annotation workflows, annotation of each complete example is performed end-to-end by a single annotator. However, structured annotation is complex, and each aspect of the task represents a unique challenge with an associated inferential load for a given annotator. Modern annotation projects can incorporate heterogeneous groups of annotators, including both models and human annotators with varying domain and linguistic expertise. It remains unclear, however, how to redesign annotation tasks in this setting, where efforts are discriminately allocated across heterogeneous annotators with respect to distinct annotation challenges. We propose to decompose annotation tasks into sub-tasks in order to reduce the aggregate inferential load of annotation projects. Inspired by the notion of centers from centering theory, we introduce a formal model of inferential load based on the degrees of freedom in the space of valid annotations. Using this model, we show that identifying these centers (i.e. salient anchor entities realized by annotation sub-tasks) constrains the output space complexity, and decompositions which isolate and advance center identification reduce the aggregate inferential load. We provide guidelines for decomposing complex structured annotation tasks, supported by examples demonstrating improved cost-efficiency from our prior work. Finally, we present a procedure for allocating sub-tasks across annotators to maximize quality under a fixed budget.

[16] Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce cs.CL | cs.AIPDF

Filippos Ventirozos, Matthew Shardlow

TL;DR: 本文提出了一种面向智能体电商的微交易市场架构，旨在解决自主购物代理在信息获取上的瓶颈。该市场允许买家代理通过微支付逐步解锁经过验证的产品信息（如服务历史、第三方测试报告等），并采用声誉评分机制确保数据可信度。论文认为这种模式比基于排名的电商平台更能促进真实的产品质量竞争，并将该愿景转化为一系列具体的NLP研究问题。

Details

Motivation: 传统电商聊天机器人仅作为推荐或销售工具，而随着支持微支付的底层技术（如x402、AP2）出现，自主购物代理能够彻底调查产品信息，此时稀缺资源从产品匹配转变为获取可信的、与决策相关的产品信息。

Result: 论文未在摘要中提供具体的定量实验结果或基准测试，而是提出了一个概念性架构，并论证该市场设计能比基于排名的店面带来更真实的竞争和对真实产品质量的奖励。

Insight: 创新点在于将电商范式从基于聊天的推荐转变为基于微支付的、渐进式的可信信息市场，并识别出成本最优信息获取、数据定价与协商、实时实体解析、基于价值的交换和隐私保护的角色建模等关键NLP问题，作为未来研究的新方向，而非仅仅追求对话流畅度。

Abstract: Commercial NLP treats the shopping chatbot as a recommender or a conversion tool: its job is to match a user to a catalogue entry and close a sale. We argue that the arrival of agent-native micro-payment rails (e.g., x402, AP2) changes what is scarce. When the buyer is an autonomous agent that can investigate exhaustively, the bottleneck is no longer matching products but acquiring trustworthy, decision-relevant information about them. We envision agentic e-commerce as a micro-transaction market for verified information: buyer agents spend fractions of a cent to progressively unlock seller- and reviewer-supplied data – service histories, third-party test reports, bills of materials, audited sales and support metrics – paid for a la carte under a freemium model, with reviewer trust scored reputationally. We sketch the architecture of such a market and argue that it rewards genuine product quality and yields truer competition than ranking-based storefronts. We then translate the vision into concrete NLP problems – cost-optimal information acquisition, data pricing and negotiation, real-time entity resolution, grounded value exchange, and privacy-preserving persona modelling – and argue that these, not chat fluency, deserve the field’s attention.

[17] SHERLOC: Structured Diagnostic Localization for Code Repair Agents cs.CLPDF

Hovhannes Tamoyan, Sean Narenthiran, Erik Arakelyan, Mira Mezini, Boris Ginsburg

TL;DR: 论文提出了SHERLOC框架，用于提升LLM智能体在代码修复任务中的故障定位能力。该框架无需训练，通过结合推理LLM、紧凑的仓库工具和自我恢复机制，实现了对代码库中故障的结构化诊断定位。

Details

Motivation: 现有LLM智能体在解决仓库级编码任务时，将大量预算浪费在故障定位上，而现有的定位框架仅提供文件检索而非可操作的诊断信息，缺乏修复智能体所需的诊断上下文。

Result: SHERLOC在SWE-Bench Lite上达到84.33%的准确率@1，在SWE-Bench Verified上达到81.27%的召回率@1，达到SOTA水平；在约300亿参数规模下，其性能匹配或超越其他智能体方法。将其定位和诊断结果注入修复智能体，可将SWE-Bench Verified上的解决率平均提升5.95个百分点，同时减少定位和总令牌消耗。

Insight: 创新点在于提出了一个无需训练、结构化的假设驱动探索与推理定位框架，将诊断上下文与定位相结合，为修复智能体提供可操作信息；其设计避免了微调或多智能体编排，通过紧凑工具和自我恢复提高了效率。

Abstract: LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval rather than actionable diagnosis, producing locations without the diagnostic context a repair agent needs. We introduce SHERLOC (Structured Hypothesis-driven Exploration and Reasoning for Localization), a training-free framework pairing a reasoning LLM with compact repository tools and self-recovery, without fine-tuning or multi-agent orchestration. SHERLOC reaches state-of-the-art localization across model scales: 84.33% accuracy@1 on SWE-Bench Lite and 81.27% recall@1 on SWE-Bench Verified; at ~30B parameters, it matches or outperforms other agentic methods. Injecting our locations and diagnostic findings into repair agents yields, on average, +5.95 pp resolve rate on SWE-Bench Verified while cutting localization and total tokens by 36.7% and 23.1%.

cs.CV [Back]

[18] Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation cs.CV | cs.AI | cs.LGPDF

Yitong Li, Junsong Chen, Haopeng Li, Haozhe Liu, Jincheng Yu

TL;DR: 本文提出了Sol Video Inference Engine，这是一个面向视频扩散模型的智能原生全栈加速框架。该框架将缓存、稀疏注意力、令牌剪枝、量化和内核融合五种技术组织成智能加速栈，通过并行技能代理、代理集成器和人工验证器的工作流程，针对特定模型、硬件和服务配置实现实例化优化。

Details

Motivation: 现代视频扩散模型通过扩展规模提升了生成质量，但也增加了推理成本。现有加速方法面临的核心挑战是：最有效的加速策略高度依赖于具体实例（模型、硬件、推理配置的组合），不同实例在架构、数值敏感性、注意力模式等方面差异巨大，导致手动性能调优成本高昂。

Result: 在三个不同规模和架构的视频模型（64B Cosmos3-Super、22B LTX-2.3和2B SANA-Video）上实例化该工作流，以极少人工努力实现了超过2倍的端到端加速，同时在VBench基准上保持了近乎无损的生成质量。

Insight: 创新点在于提出了一个智能代理驱动的全栈加速框架，将多种通用加速技术组织成可组合的优化栈，并通过代理协作与人工反馈的混合工作流，自动化地解决实例特定的性能优化问题，避免了传统手动调优的繁琐与低效。

Abstract: Modern video diffusion models achieve higher generation quality through scaling, but this also increases inference cost. Although many acceleration methods have been proposed, a central challenge is that the most effective acceleration strategy is highly instance-specific: a recipe that works well for one combination of model, hardware, and inference configuration often does not transfer to another. Different models vary in architecture, numerical sensitivity, and attention concentration patterns. Inference settings differ in spatial and temporal resolution and video duration, while hardware platforms differ in memory hierarchy, supported numerical formats, and kernel throughput. These factors create a large tuning space, making manual performance engineering costly. We present Sol Video Inference Engine, an agentic, native, training-free acceleration framework for video diffusion models. It organizes five broadly applicable techniques, cache, sparse attention, token pruning, quantization, and kernel fusion, into an agentic acceleration stack for instance-specific optimization. For a concrete deployment target defined by a model, hardware platform, and serving configuration, parallel skill agents optimize the implementation of each technique, an agent integrator composes them into a global acceleration stack, and a human validator provides feedback on generation quality. We instantiate this workflow on three video models with different sizes and architectures: 64B Cosmos3-Super, 22B LTX-2.3, and 2B SANA-Video. With little human effort, the full stack achieves more than 2x end-to-end acceleration while maintaining near-lossless VBench quality, demonstrating the effectiveness of the agent framework for video diffusion acceleration.

[19] A Geometry-Informed Computer Vision Method for Detecting and Examining Overtaking Vehicles From A Bicycle cs.CV | cs.HCPDF

Gandhimathi Padmanaban, Rayane Moustafa, Fred Feng

TL;DR: 本文提出了一种基于几何信息的计算机视觉方法，用于从自行车视角自动检测和分析超车事件。该方法结合RT-DETR目标检测与ByteTrack多目标跟踪，并通过几何验证模块确保检测的准确性，无需多传感器配置或显式相机标定。

Details

Motivation: 现有研究依赖人工逐帧标注自行车后方视频中的超车事件，这限制了样本规模和自然骑行安全研究的效率。

Result: 在密歇根州安阿伯市315个手动标注的真实超车事件上验证，召回率达到97.8%且无假阳性；系统平均提前2.44秒识别超车意图，84.1%的事件超过人类反应时间阈值；横向通过距离测量显示33.3%的事件低于5英尺阈值；无标定横向距离估计方法的平均绝对误差为13-14厘米。

Insight: 创新点在于将几何约束（如方位角趋势、表观尺寸增长和空间确认）集成到检测流程中，实现单摄像头下的自动化超车事件检测；提出无需标定的横向距离估计方法，为大规模车辆-自行车交互分析提供了可扩展的基础。

Abstract: Instrumented bicycle studies have produced direct field evidence on vehicle passing behavior, but extracting overtaking events from continuous rear-facing video has remained dependent on manual, frame-by-frame annotation. This bottleneck constrains sample sizes and limits naturalistic cycling safety research. We present a geometry-informed computer vision pipeline that automates overtaking event detection from a single bicycle-mounted camera without multi-sensor configurations or explicit camera calibration. The system combines RT-DETR object detection with ByteTrack multi-object tracking through a three-stage geometric validation module enforcing bearing angle trend, apparent size growth, and spatial confirmation criteria derived from perspective projection principles. Validated on 315 manually annotated real-world overtaking events from urban roads in Ann Arbor, Michigan, the pipeline achieved 97.8% recall with zero false positives. The system identified overtaking intentions a mean of 2.44 seconds before vehicle passage, with 84.1% of events exceeding the 1.5-second human reaction time threshold, demonstrating feasibility for active cyclist warning. Lateral passing distance measurements from 96 events revealed 33.3% of passes below the 5-foot (152.4 cm) threshold, consistent with non-compliance rates in prior field and self-reported studies. A preliminary calibration-free lateral distance estimation approach using bounding box geometric features achieved mean absolute errors of 13-14 cm under leave-one-out cross-validation, sufficient to distinguish close passes from standard passes for safety categorization. By automating event isolation from consumer-grade footage, the system removes the primary annotation bottleneck of instrumented bicycle research and provides a scalable foundation for vehicle-bicycle interaction analysis across larger datasets and diverse urban environments.

[20] Listening makes Vision Clear for VLMs cs.CV | cs.AIPDF

Yiyang Chen, Yixin Tan, Binrui Shen

TL;DR: 本文提出了一种名为Prompt-Vision Token Activation Map (PV-TAM)的新方法，用于评估视觉-语言模型（VLMs）中提示与视觉区域的一致性。该方法通过关注提示侧的语义，并引入过滤器消除模态边界标记引入的系统性偏差，从而更准确地衡量对齐关系。实验表明，PV-TAM在多个数据集上优于基于答案侧注意力的基线方法，提升了基于注意力和IoU的定位指标。

Details

Motivation: 现有工作通常使用答案侧标记的注意力分布来评估视觉-语言一致性，但作者观察到最高注意力区域并不总是与目标语义标记一致，这可能是由于解码漂移和模态边界标记等结构标记引入的偏差所致。

Result: 在各种数据集上，PV-TAM相较于基于答案侧的基线方法，在基于注意力和IoU风格的定位指标上均取得了持续改进。

Insight: 创新点在于从提示侧语义出发构建注意力图（PV-TAM），并设计过滤器来消除模态边界标记的系统性偏差；同时，新指标利用注意力的峰值分布来衡量对齐，而非仅依赖掩码重叠，这提供了更精细的一致性评估方法。

Abstract: Recent work typically assesses vision–language consistency using attention distributions of answer-side tokens. However, we observe that highest attention regions are not always consistent with the intended semantic token. This probably stems from decoding drift, where language priors from previously generated answer tokens accumulate and mismatch with visual attention. Besides the priors from previous answer tokens, we find that structural tokens, e.g., modality boundary markers, may encompass the entire context and generate high attention to areas unrelated to the target. To avoid these distortions and provide consistency evaluation for large VLMs, we adopt prompt-side semantics and propose Prompt-Vision Token Activation Map (PV-TAM). PV-TAM further incorporates a filter to remove systematic bias induced by modality boundary markers. Unlike traditional methods that evaluate overlap solely through masks while ignoring activation intensity, our metrics leverage the peak distribution of attention to measure the alignment between prompts and visual regions. In experiments, PV-TAM consistently improves both attention-based and IoU-style localization metrics over answer-side baselines on various datasets.

[21] Mind the Heads: Topological Representation Alignment for Multimodal LLMs cs.CV | cs.AI | cs.CL | cs.MMPDF

Davide Caffagni, Alberto Compagnoni, Federico Melis, Sara Sarto, Pier Luigi Dovesi

TL;DR: 本文提出了一种名为HeRA（Head-Wise Representation Alignment）的方法，用于改进多模态大语言模型（MLLMs）。该方法通过在单个注意力头层面强制进行跨模态表示对齐，以保留表征的拓扑结构（即局部邻域关系），从而提升模型在视觉中心任务上的性能并有效减少幻觉。

Details

Motivation: 现有方法通常对齐语言主干网络的固定层，忽略了Transformer模型的细粒度结构。本文旨在解决这一问题，通过更精细的表示对齐来提升MLLMs的性能。

Result: 在多个MLLMs和18个基准测试上的广泛评估表明，HeRA能持续提升模型在具有挑战性的视觉中心任务上的性能，并作为一种有效的正则化器，通过自然抑制对语言先验的过度依赖来减少视觉幻觉。

Insight: 创新点在于将表示对齐细化到注意力头层面，并基于柏拉图表示假说，专注于匹配表征的拓扑结构。一个反直觉的发现是，对齐最不对齐的注意力头能带来最大的性能增益，这为模型正则化提供了新的视角。

Abstract: Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing methods typically align a fixed layer of the language backbone, overlooking the fine-grained structure of Transformer models. In this work, we propose Head-Wise Representation Alignment (HeRA), a method that enforces cross-modal alignment at the level of individual attention heads. Our approach is grounded in the Platonic Representation Hypothesis, focusing on preserving the topological structure of representations (i.e., their local neighborhood relationships) across modalities. Following the Mutual K-Nearest Neighbor (MKNN) alignment metric, we introduce a contrastive objective that acts as a differentiable proxy for matching local structures. HeRA applies this objective during multimodal training to specific attention heads in the LLM, selected by their alignment score according to the MKNN metric. Counterintuitively, we find that aligning the least aligned heads yields the largest gains. Extensive evaluations across multiple MLLMs and 18 benchmarks demonstrate that HeRA consistently improves performance on challenging vision-centric tasks and serves as an effective regularizer against visual hallucinations by naturally curbing the over-reliance on linguistic priors. Our code is publicly released.

[22] ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation cs.CV | eess.IVPDF

Anindya Mondal, Sauradip Nag, Anjan Dutta

TL;DR: ABACUS是一个统一的视觉语言模型，能够处理物体计数、人群计数、指代表达式计数以及忠实于计数的图像生成任务，无需针对特定基准进行训练。该模型基于现有的30亿参数统一基础模型构建，通过密度感知自适应缩放、边界感知计数策略和循环一致的GRPO策略三大创新，适应物体定位任务。

Details

Motivation: 解决现有模型在图像计数理解（如物体和人群计数）与计数忠实图像生成之间存在的鸿沟问题，旨在构建一个无需基准特定训练的统一模型来同时处理多种计数相关任务。

Result: 在七个基准测试中取得了最先进（SOTA）的结果，超越了特定任务的专业模型和更大的通用模型。

Insight: 创新点包括：使用物体性图进行密度感知自适应缩放以实现空间定位；通过GRPO实现边界感知计数策略以消除裁剪边界误差；以及循环一致的GRPO策略，让理解分支自我批判生成输出，从而在没有外部标注的情况下弥合理解与生成之间的差距。从客观角度看，该研究通过自适应机制和自监督策略，有效统一了计数理解与生成，提升了模型的泛化能力和精度。

Abstract: ABACUS is a unified vision-language model that handles object counting, crowd counting, referring-expression counting, and count-faithful image generation without any benchmark-specific training required. Our model is built on existing 3B-parameter unified foundation model and is adapted for object localization tasks using three key innovations: density-aware adaptive zooming with objectness maps for spatial grounding; a boundary-aware count policy via GRPO to eliminate crop-boundary errors; and a cycle-consistent GRPO strategy where the understanding branch self-critiques generated outputs, closing the understanding-generation gap without any external annotations. ABACUS achieves state-of-the-art results across seven benchmarks, outperforming both task-specific specialists and larger generalist models.

[23] The Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models cs.CV | cs.AIPDF

Ahmad Algadhi, Ahmed Alzuhair, Omar Alkhulaif, Muzammil Behzad

TL;DR: 本文提出了一种名为TheProfessor的多教师无监督提示蒸馏方法，用于压缩视觉语言模型。该方法通过结合一个经过领域微调的PromptSRC ViT-L/14教师模型和一个零样本EVA-CLIP-L/14教师模型，在四个基础到新颖的数据集上进行评估，结果表明置信度加权集成能显著提升性能，尤其是在领域偏移场景下。

Details

Motivation: 旨在解决单教师提示蒸馏方法在压缩大型视觉语言模型时可能存在的监督信息不足问题，特别是在面对领域偏移时，通过引入多教师集成来提供更全面和互补的监督信号。

Result: 在Caltech-101、DTD、UCF101和EuroSAT四个数据集上的实验显示，置信度加权集成将平均调和均值从87.52提升至89.28（+1.77点），其中在领域偏移明显的EuroSAT数据集上提升最大（+5.78 HM），表明多教师蒸馏在领域偏移下效果显著。

Insight: 创新点在于提出多教师集成框架，结合领域特定和零样本教师的优势，提供互补监督；客观分析认为，该方法的核心价值在于利用不同教师的专长来增强蒸馏的鲁棒性，特别是在处理领域分布变化时。

Abstract: Prompt distillation compresses large vision-language models (VLMs) such as CLIP into lightweight student models by matching teacher predictions on unlabeled domain images. PromptKD (CVPR 2024) established this paradigm with a single PromptSRC-finetuned ViT-L/14 teacher and a ViT-B/16 student. We propose TheProfessor, a multi-teacher extension that distills from a fixed two-teacher ensemble: a domain-finetuned PromptSRC ViT-L/14 teacher and a zero-shot EVA-CLIP-L/14 teacher whose logits are pre-computed per dataset. We evaluate single-teacher PromptKD, equal-probability ensembling, and confidence-weighted ensembling on four base-to-novel datasets: Caltech-101, DTD, UCF101, and EuroSAT. In a 12-run single-seed sweep, confidence-weighted ensembling improves average HM from 87.52 to 89.28 (+1.77 points), while equal averaging improves average HM to 88.88 (+1.37 points). Gains are dataset dependent: they are negligible on Caltech-101 (+0.16 HM for confidence weighting), modest on UCF101 (+0.62), and largest on domain-shifted EuroSAT (+5.78). These results update our earlier Caltech-only analysis and show that multi-teacher prompt distillation is most useful when the second teacher contributes complementary supervision under domain shift.

[24] REALM: A Unified Red-Teaming Benchmark for Physical-World VLMs cs.CVPDF

Yifei Zhao, Qian Lou, Mengxin Zheng

TL;DR: 论文提出了REALM，这是首个针对物理世界视觉语言模型（VLMs）的统一红队测试基准。该基准整合了12种红队测试方法、3种模型无关防御和13个VLMs，在共享数据集和指标下进行黑盒威胁模型评估。通过引入基于场景的、物理相关的攻击目标生成流程，实现了不同攻击方法在统一协议下的公平比较。

Details

Motivation: 现有红队测试基准主要针对聊天机器人的越狱和内容安全评估，缺乏对物理世界VLMs的功能性失效进行系统化评估，且不同方法的评估设置碎片化，难以直接比较。

Result: 评估表明，文本和排版注入攻击引发的失效最多，多模态协同优化产生的视觉扰动迁移性最强，单次攻击能以更低成本接近迭代方法的效果，模型规模本身并不能带来对抗鲁棒性。

Insight: 创新点在于构建了首个统一、物理相关的VLM红队测试基准，并设计了基于智能体的目标生成流程来对齐不同攻击方法的对抗目标，从而实现了公平比较。这为评估物理世界VLM的安全性和鲁棒性提供了标准化框架。

Abstract: Vision-language models (VLMs) are increasingly used as perception-reasoning backbones for embodied intelligence in safety-critical physical systems, where perception or reasoning errors can lead to unsafe decisions or actions. Although many red-teaming methods have been developed to probe VLM vulnerabilities, their evaluation remains fragmented across datasets, metrics, and threat models, making direct comparison difficult and obscuring whether observed differences arise from stronger attacks, more vulnerable models, or incompatible evaluation settings. Existing chatbot-centric red-teaming benchmarks mainly standardize jailbreak and content-safety evaluation, but they do not systematically capture physically grounded functional failures or cover red-teaming methods that target physical-world VLMs. This raises the key challenge of comparing diverse attack methods under a unified protocol while targeting the same scenario-specific failures. We introduce REALM, to our knowledge the first unified red-teaming benchmark for physical-world VLMs. REALM integrates 12 red-teaming methods, 3 model-agnostic defenses, and 13 VLMs under a practical black-box threat model with shared datasets and metrics. To align adversarial objectives across attack families, REALM introduces an agentic target-generation pipeline that constructs shared, scenario-specific, and physically grounded attack objectives for each scene, enabling fair comparison of diverse red-teaming methods under aligned adversarial goals. Our evaluation shows that text and typographic injection attacks induce the most failures, multimodal co-optimization yields the strongest visual-perturbation transfer, single-pass attacks approach iterative methods at much lower cost, and model scale alone does not confer adversarial robustness. Code is available at https://github.com/UCF-ML-Research/REALM.

[25] HANCLIP: A Family of Hyperbolic Angular Negation Vision Language Models cs.CV | cs.IRPDF

Hoang-Bao Le, Aiden Durrant, Thai Son Mai, Binh T. Nguyen, Liting Zhou

TL;DR: 本文提出了HANCLIP模型系列，旨在解决视觉语言模型对否定语义的脆弱性问题。该模型通过结合双曲几何和角度三元组目标，在嵌入空间中显式编码图像的否定描述，仅使用2万个小规模四元组数据进行训练，即可增强模型的否定推理能力。

Details

Motivation: 现有视觉语言模型在整体检索或分类性能良好的情况下，对否定语义依然非常脆弱，容易受到误导性文本线索的干扰，且直接在否定数据上微调会损害模型原有的知识。

Result: 在专注于否定评估的NegBench基准测试上，HANCLIP取得了持续的提升；同时，在标准的图像分类和图文检索基准测试上，它保持了竞争力甚至有所改进。

Insight: 核心创新在于通过几何感知的设计（双曲空间建模层次语义关系与不对称性，结合角度三元组目标）来系统性地分离肯定与否定描述，从而在增强否定敏感性的同时，保留预训练表征的全局结构。该框架是模型无关的，可轻松集成到多种CLIP变体中。

Abstract: Vision-Language Models (VLMs) are typically pre-trained on large-scale image-text datasets to capture semantic correspondences between visual content and natural language. However, they remain surprisingly brittle to negation: models often rely on shallow word co-occurrence and are easily distracted by misleading or irrelevant textual cues, even when their overall retrieval or classification performance is strong. Moreover, directly finetuning on negation data can interfere with previously acquired knowledge, causing noticeable degradation on standard vision-language benchmarks. To tackle these issues, this work introduces HANCLIP (Hyperbolic + Angular + Negation), a family of VLMs that explicitly restructures the embedding space to encode “what an image is not” alongside “what it is.” HANCLIP is trained on a compact set of 20,000 image-text quadruplets and combines a hyperbolic formulation, which models hierarchical semantic relations and asymmetries, with an angular triplet objective that drives systematic separation between negated descriptions and their corresponding positives. This geometry-aware design strengthens negation sensitivity while preserving the global structure of pretrained representations, rather than overwriting them. Extensive experiments across multiple vision-language tasks show that HANCLIP delivers consistent gains on the negation-focused NegBench benchmark, while maintaining competitive or improved performance on standard classification and image-text retrieval benchmarks. The framework is model-agnostic and can be plugged into CLIP, LongCLIP, SmartCLIP, and HiMo-CLIP without large-scale retraining, demonstrating that a carefully designed geometric objective can substantially extend the reasoning capabilities of existing VLMs using only modest additional data.

[26] Trustworthy Image Authentication using Forensic Knowledge Graphs cs.CVPDF

Tai D. Nguyen, Matthew C. Stamm

TL;DR: 本文提出了一种名为法证知识图谱（FKG）的统一框架，用于可信的图像认证。该框架整合了法证证据提取、结构化推理和人类可解释的解释，通过编码法证痕迹及其因果依赖关系来应对生成式AI带来的逼真图像伪造问题。

Details

Motivation: 生成式AI的进步使得图像伪造极为逼真，现有法证检测器针对特定伪造类型但缺乏可解释性，而视觉语言模型能提供解释却无法利用法证痕迹进行可靠检测，因此需要一种能结合两者优势的可信认证系统。

Result: 实验表明，FKG在检测、伪造识别与定位以及法证论证方面均优于现有的法证检测器和视觉语言模型，并在新构建的包含5万张真实伪造图像的FKG-50K数据集上进行了验证。

Insight: 创新点在于提出了法证知识图谱这一结构化表示方法，并引入了新颖的法证认证网络和迭代上下文细化策略，以指导视觉语言模型生成忠实、基于证据的解释，从而实现了可解释且可靠的图像认证。

Abstract: Advances in generative AI have made image falsification highly realistic, demanding trustworthy authentication systems. Existing forensic detectors can target certain forgery types but lack interpretability, while vision-language models (VLMs) provide explanations but cannot exploit forensic traces for reliable detection. We propose Forensic Knowledge Graphs (FKGs), a unified framework that integrates forensic evidence extraction, structured reasoning, and human-interpretable explanation. Our FKG structure encodes forensic traces along with their causal dependencies and links to scene content. To generate accurate FKGs, we introduce a novel forensic authentication network and an Iterative Context Refinement strategy that guides VLMs to produce faithful, grounded explanations. We also present FKG-50K, a dataset of 50,000 realistic forgeries with ground-truth FKGs. Experiments demonstrate that FKG outperforms both forensic detectors and VLMs in detection, forgery identification and localization, and forensic justification.

[27] End-to-End Radar and Communication Modulation Recognition with Neuromorphic Computing cs.CV | cs.AIPDF

Xiaohu Li, Chongxiao Qu, Caiyong Lin, Chenxiao Dou, Wei Hua

TL;DR: 本文提出了一种名为EMRFormer的新型端到端脉冲神经网络（SNN）架构，用于雷达和通信信号的自动调制识别（AMR）。该模型结合了自适应脉冲编码器、整数泄漏积分发放神经元以及脉冲可分离卷积网络，以提取原始IQ波形中的多尺度时序特征，并在保持高精度的同时显著降低了计算能耗。

Details

Motivation: 解决基于深度学习的自动调制识别方法在资源受限平台上难以平衡高精度与低功耗的问题，探索利用神经形态计算架构实现高效、低功耗的AMR。

Result: 在多个主流数据集上的实验表明，EMRFormer在识别准确率上达到了最先进水平（SOTA），在低信噪比环境下性能稳健，理论能耗降低超过90%。在KA200神经形态芯片上部署时，其功耗相比运行在3090 GPU或Orin NX上降低了高达5倍。

Insight: 创新点在于将脉冲驱动的Transformer（SpikeFormer）与脉冲可分离卷积（SSCNN）结合，构建了适用于神经形态硬件约束的端到端SNN架构，并引入了自适应脉冲编码器和整数LIF神经元来缓解信息退化并增强SNN的表征能力，为资源受限设备上的AMR提供了一条高效低功耗的可行路径。

Abstract: Although deep learning-based methods can achieve high accuracy in automatic modulation recognition (AMR) tasks, their high computational cost makes it difficult to strike a balance between accuracy and power consumption, thereby limiting their application on resource-constrained platforms. Neuromorphic architectures that perform spike-driven inference with modest energy budgets have recently been explored for vision and timeseries tasks. Motivated by these works, we propose EMRFormer, a novel end-to-end spiking nerural network (SNN) architecture that applies spike-driven transformer to the constraints of neuromorphic hardware for AMR. The model incorporates an adaptive spike encoder and Integer Leaky Integrate-and-Fire neurons to mitigate the degradation of effective information and enhance SNN representational capacity. By integrating spike-separable Convolution Neural Networks (SSCNN) into Spike-Driven Transformers (SpikeFormer), EMRFormer effectively extracts multi-scale temporal features from the raw IQ waveforms. We validate our approach across various mainstream datasets, the experimental results show that EMRFormer achieves state-of-the-art interms of accuracy, outperforming all the baselines. Furthermore, the model maintains strong performance in low signal-to-noise(SNR) environments and reduces theoretical energy consumption by over 90%. Finally, we evaluate our model on a KA200 neuromorphic chip. The results show that our model achieves up to 5 times reduction in power compared to running on a 3090 GPU or an Orin NX. This work demonstrates a promising pathway for AMR on resource-constrained devices.

[28] DriveStack-VLA: Render-Teacher Alignment for BEV-Based DeepStack Vision-Language-Action Model cs.CVPDF

Jingke Wang, Zhenru Zhao, Shuangming Lei, Hao Su, Yuehao Huang

TL;DR: 本文提出了DriveStack-VLA框架，这是一个基于大型视觉语言模型（VLM）的自动驾驶策略模型。其核心创新在于通过引入鸟瞰图（BEV）表示和渲染-教师对齐（Render-Teacher Alignment）技术，增强了模型对驾驶场景的空间几何与结构理解，并设计了基于头部的自批判模块来优化轨迹选择。该模型在多个自动驾驶仿真基准测试中取得了优异的性能。

Details

Motivation: 现有视觉-语言-动作（VLA）驾驶模型主要依赖透视图像token和语言先验，缺乏面向驾驶任务的空间智能，例如对度量几何、俯视场景结构和安全关键感知线索的理解，这导致其在视觉几何建模和专家演示的感知覆盖方面存在不足。

Result: DriveStack-VLA在NAVSIMv1上获得91.6 PDMS，在NAVSIMv2上获得91.0 EPDMS（启用人类惩罚过滤器），在闭环Bench2Drive上获得79.49的驾驶分数和56.36%的成功率，展现了强大的性能。

Insight: 主要创新点包括：1）通过DeepStack式连接将BEV表示注入大语言模型解码器，以增强空间感知；2）提出渲染-教师对齐方法，将真实图像与栅格化图像的感知焦点对齐，以提升几何建模能力；3）引入基于头部的自批判模块，对采样轨迹进行排序和条件性优化，以弥合多模态轨迹选择的差距。

Abstract: Vision-Language-Action driving models convert a pretrained Vision-Language Model into a driving policy, allowing them to use world knowledge and follow language guidances. However, existing VLA driving models still lack driving-oriented spatial intelligence: their policies are mainly grounded on perspective image tokens and language priors, while precise motion planning requires metric geometry, top-down scene structure, and attention to safety-critical perceptual cues. This limitation makes current models vulnerable to weak visual geometry modeling and perceptual coverage in expert demonstrations. In this paper, we present DriveStack-VLA, a framework built upon a large VLM backbone. To strengthen the spatial grounding of VLA driving, we develop dual visual modeling components. We inject a Bird-Eye-View representation into the Large Language Model decoder through a DeepStack-style connection, and propose Render-Teacher Alignment to align the perceptual focus of real images with that of rasterized images. Furthermore, to bridge the gap in multimodal trajectory selection, we introduce a head-based self-critique module that ranks sampled trajectories and conditionally refines the best one. DriveStack-VLA achieves 91.6 PDMS on NAVSIMv1, 91.0 EPDMS on NAVSIMv2 (with the human penalty filter enabled), and a driving score of 79.49 with a success rate of 56.36% on the closed-loop Bench2Drive. More visualizations are available on our project page: https://anonymous.4open.science/w/drivestack-vla/.

[29] Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent cs.CVPDF

Wenliang Zhong, Rob Barton, Lucas Goncalves, Kushal Kumar, Feng Jiang

TL;DR: 本文提出了一种通用的图像聚类框架——Guideline-Driven Image Clustering Agent，通过文本指导来统一不同的聚类场景。该框架包含生成式概念代理建模来生成指导感知的嵌入，以及基于最小生成树的LLM遍历方法来自动发现复杂语义簇。该方法在从通用到细粒度、从全局到局部、从平衡到长尾分布的各种聚类任务中都表现出色。

Details

Motivation: 解决不同聚类场景（如任务定义、粒度、标准不同）之间存在根本性差异，导致难以构建统一方法的问题。

Result: 该方法在涵盖通用分类、细粒度分类、全局与局部标准以及长尾分布等多种聚类任务中，均一致优于专门的（任务特定）方法。

Insight: 核心创新在于利用文本指导（guidelines）作为通用接口来桥接不同聚类任务，并提出了无需任务特定训练的生成式概念代理建模，以及用于复杂语义判断的LLM遍历策略，实现了强大的泛化能力。

Abstract: Unifying image clustering across different clustering scenarios remains challenging due to fundamental gaps among tasks. We introduce a Guideline-Driven Image Clustering Agent, the first universal framework that bridges these gaps through textual guidelines. To incorporate complex guidelines without task-specific training, we propose Generative Concept Proxy Modeling, which generates guideline-aware embeddings via concept proxy extraction. For scenarios requiring automatic cluster discovery, we introduce LLM Traversal based on Minimum Spanning Tree that selectively applies LLM reasoning for complex semantic judgments. Our method generalizes across diverse clustering scenarios spanning from general to fine-grained categorization, from global to local criteria, and from balanced to long-tail distributions. Our framework consistently outperforms specialized methods across diverse clustering tasks.

[30] A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy cs.CV | cs.AIPDF

Aminu Lawal, Niyoj Oli, Sachin Acharya, Prashnna Gyawali, Maria Carmen Romano

TL;DR: 本文针对胃肠道内窥镜领域，在Gut-VLM数据集上系统性地评估了九种幻觉检测方法在五个视觉语言模型上的性能。研究发现，白盒方法ReXTrust在所有模型上均取得最高的AUC，显著优于其他方法，而黑盒方法和基于聚类的灰盒方法在某些模型上性能接近随机。研究还识别了‘自信的虚构’这一系统性失效模式。

Details

Motivation: 视觉语言模型存在幻觉问题，阻碍其在临床实践中的安全部署。目前大多数幻觉检测方法仅在放射学基准上进行评估，而胃肠道内窥镜领域尚未得到充分探索。

Result: 在Gut-VLM数据集（包含4,392个VQA测试对）上，白盒方法ReXTrust在所有五个VLMs上均取得了最高的AUC，在MedGemma-4B上达到峰值93.0，且其优势具有统计显著性。白盒访问带来了平均19.5个AUC点的优势。在非白盒方法中，基于token级别的灰盒统计方法表现最佳。

Insight: 论文的主要创新在于为胃肠道内窥镜领域建立了一个幻觉检测基准，并系统比较了不同类别方法的性能。客观分析表明，白盒访问（利用模型内部隐藏状态）在幻觉检测任务中具有显著且一致的优势，这为未来方法设计提供了重要方向。同时，识别出的‘自信的虚构’失效模式揭示了现有基于一致性和不确定性的方法的共同弱点。

Abstract: Vision-language models (VLMs) are prone to hallucination, which remains a major barrier to their safe deployment in clinical practice. To date, most hallucination detection methods have been evaluated on radiology benchmarks such as MIMIC-CXR and VQA-RAD, while gastrointestinal (GI) endoscopy remains largely underexplored. In this paper, we benchmark nine hallucination detection methods on the Gut-VLM dataset, a GI diagnostic Visual Question Answering (VQA) dataset with 4,392 test VQA pairs, across five VLMs (MedGemma-4B, MedGemma-27B, LLaVA-Med-7B, LLaVA-v1.6-7B, and Lingshu-32B). The methods span three categories: black-box methods (RadFlag, SelfCheckGPT-NLI), gray-box methods (AvgProb, AvgEnt, MaxProb, MaxEnt, Semantic Entropy, and VASE), and a white-box method (ReXTrust). Our results show that ReXTrust, a white-box method, achieves the highest AUC across all five models, outperforming the strongest alternative method on each VLM by a statistically significant margin (paired permutation test, p < 0.001 in all cases), reaching a peak AUC of 93.0 on MedGemma-4B. White-box hidden-state access provides a consistent advantage of 19.5 AUC points on average (range: 9.5–33.5), with ReXTrust maintaining strong performance even on LLaVA-v1.6-7B (AUC 79.9), where black-box methods and clustering-based gray-box methods collapse to near-chance performance. Among non-white-box methods, token-level gray-box statistics (MaxEnt, MaxProb) are the strongest alternatives, outperforming both clustering-based gray-box methods (Semantic Entropy, VASE) and black-box approaches on average. We further identify confident confabulation, a failure mode in which models hallucinate with high inter-sample consistency or high token-level probability, as a systemic failure for both consistency and uncertainty-based methods.

[31] Flood Mapping from RGB imagery using a Vision Foundation Model cs.CV | eess.IVPDF

Vladyslav Polushko, Tilman Bucher, Ronald Rösch, Thomas März, Markus Rauhut

TL;DR: 本文研究了如何将基于卫星数据预训练的地球观测基础视觉模型（Prithvi-EO-2.0-600M Vision Transformer）适配到厘米级RGB航空影像的洪水淹没范围制图任务。通过结合UPerNet解码器构建Prithvi-2.0-UPN模型，并在两个RGB洪水数据集（BlessemFlood21, NeuenahrFlood）上进行微调，验证了其在洪水水域分割任务上的有效性、跨事件的泛化能力以及小样本快速适应新场景的潜力。

Details

Motivation: 紧急响应和损害评估需要及时、高分辨率的居民区洪水淹没范围图，低成本快速获取的航空RGB影像是理想数据源。现有基于CNN或小型ViT的深度学习洪水制图模型需要大量数据适应新场景（即新的洪水事件），而视觉基础模型（大视觉Transformer）具有跨领域泛化能力。但现有的地球观测基础模型是在卫星数据上预训练的，其空间分辨率、观测几何和辐射特性与垂直拍摄的RGB影像不同，因此需要研究如何适配。

Result: 在BlessemFlood21和NeuenahrFlood数据集上训练时，Prithvi-2.0-UPN达到了最先进（SOTA）水平。在零样本跨事件迁移（在BlessemFlood21上训练，在NeuenahrFlood上测试）中，其性能优于SOTA基线模型，但仍有提升空间。当使用少量NeuenahrFlood数据进一步微调时，Prithvi-2.0-UPN性能提升最快，几乎达到了在NeuenahrFlood上完全训练的性能水平。

Insight: 论文的创新点在于将卫星预训练的地球观测视觉基础模型（Prithvi-EO-2.0）成功迁移适配到航空RGB影像的洪水制图任务，证明了此类大模型在跨数据模态（卫星到航空）和跨事件场景下的强大泛化能力和快速适应潜力。这为利用预训练基础模型解决特定领域（如灾害遥感）的细分任务提供了有效范例，其架构（ViT主干+UPerNet解码器）和微调策略具有借鉴意义。

Abstract: Timely, high-resolution maps of flood extent around settlements are essential for emergency response and damage assessment. We consider airborne RGB imagery for flood mapping as it can be collected rapidly at low cost. To produce flood maps, deep learning models for water segmentation are often used. CNN based and small vision transformer models are used. However, they need much data for adaptation to a change of scenery, i.e., another flooding event. Vision foundation models or large vision transformers are known to generalize across domains. Recently, foundation models for Earth observation became available. They are pretrained on satellite data, whose spatial resolution, viewing geometry, and radiometry differ from nadir RGB imagery. Thus, adaptation is required. We investigate how a satellite-pretrained Earth observation foundation model can be adapted to centimeter-scale floodwater mapping from RGB imagery. Specifically, we fine-tune a model we call Prithvi-2.0-UPN consisting of the Prithvi-EO-2.0-600M Vision Transformer combined with a UPerNet decoder for binary water segmentation on two RGB datasets (BlessemFlood21, NeuenahrFlood). In a first experiment we observe that Prithvi-2.0-UPN reaches state-of-the-art results on BlessemFlood21 and NeuenahrFlood, when trained on their datasets. In a second experiment we show that Prithvi-2.0-UPN performs better than state-of-the-art baseline models for transfer to a new flood event (trained on BlessemFlood21, tested on NeuenahrFlood) in a zero-shot setting. However, the performance indicates room for improvement. In this respect, we investigate in a third experiment how performance improves when further fine-tuning the models with small shares of NeuenahrFlood training data: Prithvi-2.0-UPN improves the fastest and reaches almost the performance level when fully trained on NeuenahrFlood, indicating transfer capabilities.

[32] Sat2City v2: Native 3D City Asset Generation from a Single Satellite Image cs.CVPDF

Tongyan Hua, Dongli Wu, Jinjing Zhu, Yinrui Ren, Zhongcheng Hong

TL;DR: Sat2City v2 是一个从单张卫星图像生成显式、可重用、带纹理的3D城市资产（网格）的框架。它通过微调预训练的原生结构化潜在3D基础模型，并利用卫星图像作为条件，实现了对几何形状和外观的可控生成。该方法在真实世界数据集上进行了评估，并在几何和外观生成基准上取得了最佳综合性能。

Details

Motivation: 从单张卫星图像生成显式3D城市资产对于数字孪生、城市模拟和地理空间智能至关重要。现有方法（如Sat2City）存在外观随机、依赖合成数据、以及特定任务VAE对真实世界噪声重建扩展性差的问题。

Result: 在度量尺度数字表面模型重建以及几何和外观生成的城市资产基准测试中，Sat2City v2 在评估的基线方法中取得了最佳的综合性能。

Insight: 核心创新在于将预训练的原生3D基础模型适配到弱对齐的卫星图像和纹理网格数据上，避免了从噪声网格直接学习3D表示。它通过卫星条件几何流和基于解码形状的卫星条件纹理化，保留了从几何到外观的级联生成流程，同时实现了外观可控。此外，论文贡献了首个为资产级任务从匹配地理区域收集的卫星-网格配对真实世界数据集。

Abstract: Generating explicit 3D city assets from a single satellite image is important for digital twins, urban simulation, and geospatial intelligence. Unlike satellite-to-street-view synthesis, the task requires a reusable textured mesh with plausible geometry and controllable appearance rather than a 3D proxy optimized only for rendering a small set of images or videos. The ICCV Sat2City framework made a first step by conditioning cascaded sparse-voxel latent diffusion on satellite-derived height maps, but its appearance was random, its training data were synthetic, and its task-specific VAE did not scale well to noisy real-world reconstructions. We present Sat2City v2, a journal extension that adapts a pretrained native structured-latent 3D foundation model to weakly aligned satellite images and textured meshes. We build a real-world dataset with 16,241 satellite-mesh pairs across 24 regions in 9 cities. Instead of learning a 3D representation from noisy city meshes, Sat2City v2 encodes each mesh into a pretrained native 3D latent space, fine-tunes a satellite-conditioned geometry flow, and uses the decoded shape to anchor satellite-conditioned texturing. This retains Sat2City’s geometry-to-appearance cascade while enabling appearance-controllable generation from the satellite input. Experiments on metric-scale DSM reconstruction and generative city-asset benchmarks for geometry and appearance show that Sat2City v2 achieves the best overall performance among evaluated baselines. Overall, Sat2City v2 advances satellite-to-city generation from rendering-oriented 3D proxies to explicit textured mesh assets, supported by, to the best of our knowledge, the first documented satellite-mesh paired dataset collected from matched geographic crops for this asset-level task. Project page: https://ai4city-hkust.github.io/Sat2City-v2/

[33] ObsGraph: Hierarchical Observation Representation for Embodied Reasoning and Exploration cs.CV | cs.ROPDF

Taekbeom Lee, Youngseok Jang, Jeonghwa Heo, Jeongjun Choi, H. Jin Kim

TL;DR: 本文提出了ObsGraph，一种以观测为中心的层次化场景图，用于具身推理和探索任务。它将场景表示、检索和探索统一起来，通过房间-视角-对象三层结构组织视觉证据，并基于此进行从粗到细的层次化检索，进而引导自适应的多尺度探索。

Details

Motivation: 为了解决机器人在复杂陌生环境中执行任务时，需要识别和获取必要信息的问题，本文旨在通过一种结构化的场景表示来紧密耦合表示、检索与探索过程。

Result: 在多个具身推理和探索基准测试上的实验表明，该方法提高了任务成功率和效率，验证了结构化场景表示和基于证据缺口的目标信息收集的优势。

Insight: 创新点在于提出了一个统一的层次化观测表示框架（ObsGraph），将场景表示、检索与探索策略紧密耦合，并利用检索结果动态结构化探索候选空间，实现了自适应的多尺度信息获取。

Abstract: Embodied reasoning and exploration are increasingly considered crucial abilities for robots operating in complex and unfamiliar environments. To accomplish tasks in such settings, an agent must identify and acquire the information necessary for the task through exploration. We propose ObsGraph, an observation-centric hierarchical scene graph that unifies scene representation, retrieval, and exploration. It retains visual evidence and organizes it into room-view-object layers: rooms provide coarse semantic anchors, views preserve contextual object covisibility, and objects store fine-grained details. On top of this representation, we perform coarse-to-fine hierarchical retrieval under a bounded budget, and crucially use retrieval outcomes to structure the exploration candidate space–activating room-level exploration, view refinement, or frontier exploration–thereby tightly coupling representation, retrieval, and adaptive multi-scale exploration. Experiments across embodied reasoning and exploration benchmarks demonstrate improved success and efficiency, highlighting the benefits of structured scene representation and more targeted information gathering driven by identified evidence gaps.

[34] DramaDirector: Geometry-Guided Short Drama Generation cs.CV | cs.AIPDF

Hengji Zhou, Sijie Liu, Jianrun Chen, Xingchen Zou, Lianghao Xia

TL;DR: 本文提出了DramaDirector，一个几何引导的短剧生成框架，旨在解决现有提示级或纯文本视频生成方法在生成具有快速镜头节奏、对话驱动焦点转换和高要求电影摄影基础的短剧时面临的挑战。该框架通过从真实短剧镜头库中检索深度和姿态信息来引导规划器，将每个镜头解耦为静态视觉和动态叙事条件，并利用模式约束的SFT和GRPO进行训练。

Details

Motivation: 解决现有视频生成方法难以满足短剧特有的快速镜头切换、对话驱动的焦点转移以及对电影摄影基础的高要求等问题，实现从全局情节和局部上下文到视觉基础的多镜头视频的生成。

Result: 在基于35部真人短剧、2.8K集和81K个镜头构建的DramaBoard基准测试中，实验表明DramaDirector在忠实度、一致性和可控性方面优于代表性的多智能体和视频生成基线方法。

Insight: 创新点在于提出了一个几何引导的框架，通过从真实镜头库中检索深度和姿态信息来指导视频生成，并引入了DramaBoard这一包含结构化故事板和多维评估协议的新基准。客观来看，该方法将电影摄影几何知识显式地整合到生成过程中，以提升生成视频的视觉质量和叙事连贯性。

Abstract: Short dramas, with their rapid shot rhythms, dialogue-driven focus shifts, and demanding cinematographic grounding, pose challenges that prompt-level or text-only video generation pipelines struggle to meet. We study plot-to-short-drama generation, where a global plot and local context are transformed into visually grounded multi-shot videos. We propose DramaDirector, a geometry-grounded framework that lets the planner borrow cinematographic geometry from a gallery of real short-drama shots indexed by depth and pose. DramaDirector decouples each shot into static visual and dynamic narrative conditions, trains the planner with schema-constrained SFT and GRPO under a learned text-visual alignment reward, and retrieves depth-pose references to guide first-frame generation and image-to-video synthesis. We also introduce DramaBoard, a benchmark built from 35 live-action dramas, 2.8K episodes, and 81K shots, with structured storyboards and multi-dimensional evaluation protocols. Experiments show that DramaDirector improves over representative multi-agent and video generation baselines on faithfulness, consistency, and controllability. Our code is released at: https://github.com/iLearn-Lab/DramaDirector

[35] Autonomous Video Generation with Counterfactual Controllability for Self-Evolving World Models cs.CV | cs.LGPDF

Xin Wang, Wenxuan Liu, Tongtong Feng, Wenwu Zhu

TL;DR: 本文提出了一种新的视角，认为自主视频生成应具备反事实可控性，以实现自我进化的世界模型。论文指出当前视频生成模型仅学习到部分、隐式的时空世界模型，缺乏对可控变量、身体约束和干预下有效未来的理解。

Details

Motivation: 现有研究认为视频生成本质上是世界建模，但作者认为仅靠扩展视觉预测无法自动产生物理智能体，因为模型缺乏对可控性和约束的显式理解。

Result: 论文未在摘要中提及具体的定量实验结果或基准测试，但提出了反事实可控性作为实现自我进化世界模型的关键标准。

Insight: 创新点在于强调反事实可控性作为视频生成模型的核心能力，即模型应能推理在特定动作干预下的未来状态，并将动作知识反馈到生成过程中，从而促进世界模型的自我进化。

Abstract: Existing literature claims that video generation essentially is world modelling. On the one hand, the claim is productive because it pushes generative AI beyond static images and toward temporally extended physical scenes. On the other hand, this claim dangerously relies on the belief that scaling visual prediction alone will automatically yield physical agents. We prefer a more accurate statement: video generation models learn a partial, implicit spatiotemporal world model, but not a fully grounded or controllable one. The reason is as follows: a model may generate a plausible video of a drone crossing a forest or a robot arm manipulating a cup, yet still fail to know which variables are controllable, which constraints belong to a particular body and which futures remain valid under intervention. The frontier in essence is not predictive realism alone, instead it emphasizes a self-evolving generative nature that requires the decisive criterion to be counterfactual controllability: the capability of asking what would happen under an action, to test whether the generated future can survive embodiment constraints and to feed the resulting action knowledge back into future imagination (generation). Therefore, in this paper we present a new perspective, i.e., autonomous video generation with counterfactual controllability is one promising way to realize self-evolving world models.

[36] An LMM for Precisely Grounding Elements in Documents cs.CVPDF

Yijian Lu, Chuangxin Zhao, Kai Sun, Lei Hou, Juanzi Li

TL;DR: 本文提出了PreciseDoc，一个专门为文档图像中精确元素定位而设计的大型多模态模型。它通过构建包含细粒度坐标元数据的高质量合成文档数据来增强基础定位能力，并引入一个结合强化学习的联合监督训练范式，以提升基于定位证据的推理能力。

Details

Motivation: 现有大型多模态模型在文本丰富的文档图像中定位精度不足，难以可靠地定位关键文档元素以支持推理，这限制了文档理解、深度研究和文档错误检测等应用。

Result: 在多个基准测试上的综合评估表明，所提出的数据和方法在文档空间定位和文档理解任务上具有优势。

Insight: 创新点在于构建了能够大规模生成带有细粒度坐标元数据（包括模拟手填文档和相机效果）的挑战性训练数据，以及一个将定位与推理通过强化学习联合监督的训练范式，使模型能执行超越简单文本定位的现实世界功能（如从简历中定位个人信息）。

Abstract: Visual grounding in documents is a crucial ability for Large Multimodal Models (LMMs) in areas such as document understanding, deep research and document error detection. However, existing approaches exhibit poor grounding precision in text-rich document images, often failing to accurately locate the critical document elements needed for reliable reasoning. To address this gap, we introduce PreciseDoc, an LMM specifically designed for precise element grounding and can be further optimized for Document VQA tasks. Specifically, to enhance the basic localization capability, we construct challenging training data by two pipelines capable of mass-producing high-quality documents with paired metadata of fine-grained coordinates, including synthetic hand-filled documents with camera effects. The model develops more real-world functions beyond straightforward localization of single text, such as locating personal information from CVs. Furthermore, we introduce a training paradigm for visual grounded reasoning where the grounding and reasoning are supervised jointly with reinforcement learning to improve the contribution of the grounded evidence. A comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods in document spatial grounding and document understanding.

[37] Differential Unfolding: Efficient Unfolding Reconstruction for Video Snapshot Compressive Imaging cs.CVPDF

Muyuan Zhang, Jiancheng Zhang, Haijin Zeng, Yin-ping Zhao

TL;DR: 本文提出了一种名为差分展开（Differential Unfolding, DU）的异构框架，用于视频快照压缩成像（SCI）的高效重建。该方法通过将展开过程分解为结构锚定和差分演化两个互补角色，用动态演化取代了传统深度展开网络（DUNs）中均匀重复的高复杂度先验，从而在显著降低计算开销的同时，实现了新的最先进（SOTA）性能。

Details

Motivation: 现有基于深度展开网络（DUNs）的视频SCI方法受限于统一的设计理念，即重复堆叠结构相同的高复杂度先验，忽略了优化轨迹趋于静态的事实，导致表示停滞和计算资源的浪费。本文旨在解决这种效率低下的问题。

Result: 大量实验验证表明，该方法在多个基准测试上取得了新的最先进（SOTA）结果，同时显著降低了计算开销。

Insight: 核心创新点在于提出的异构差分演化框架（DEF），它将展开过程解耦为结构锚定阶段和轻量级差分演化阶段。后者引入了差分表示先验（DRP），通过差分表示注意力（DRA）和差分调制前馈网络（DM-FFN）等机制，以最小开销建模跨阶段的变化，从而将计算资源集中在动态演化而非静态冗余上，实现了精度与效率的优越权衡。

Abstract: While Deep Unfolding Networks (DUNs) dominate video Snapshot Compressive Imaging (SCI), they remain constrained by a uniform design philosophy. Existing methods repeatedly stack high-complexity priors with identical structures, ignoring the fact that optimization trajectories converge toward static states. This results in representation stagnation, where high-cost computations are wasted on minimal feature updates. To address this inefficiency, we present Differential Unfolding (DU), a heterogeneous framework that replaces uniform repetition with dynamic evolution. Central to DU is the Differential Evolutionary Framework (DEF), which partitions the unfolding process into two complementary roles: structural anchoring and differential evolution. In this scheme, high-parameter general stages are sparsely deployed to generate high-fidelity feature foundations. Complementing these, lightweight differential stages employ a Differential Representation Prior (DRP) to propagate and refine these foundational features through a differential mechanism. By integrating Differential Representation Attention (DRA) for evolving attention maps and a Differential Modulated FFN (DM-FFN) for feature rectification, DRP effectively models cross-stage variations with minimal overhead. By focusing computational resources on dynamic evolution rather than static redundancy, DU achieves a superior trade-off between accuracy and efficiency. Extensive experiments verify that our method establishes new state-of-the-art results while significantly slashing computational overhead. https://github.com/Muyuan-Zhang/DU

[38] Dual-Branch Cross-Projection Debiasing through Diffusion-based Disentanglement cs.CVPDF

Xiangqian Zhao, Xinyang Jiang, Zhipeng Xu, Lingfeng He, Zilong Wang

TL;DR: 本文提出了一种名为Dual-branch Cross-projection Debiasing (DCD)的提示调优框架，用于缓解基础模型在训练数据存在偏差时对虚假相关性的依赖。该方法包含两个核心组件：首先，通过Confidence-guided Bias Concept Mining (CBCM)利用扩散模型解耦出语义基础的概念表示，以无监督方式识别可靠的虚假属性；其次，DCD框架将目标属性和虚假属性表示分离到两个分支中，并通过交叉零空间投影显式去除虚假信息，同时保留目标相关语义。在四个基准数据集上的实验表明，该方法在无组标签设置下取得了最先进的性能，且仅需调整极少量的模型参数。

Details

Motivation: 解决基础模型在存在偏差的数据集上训练时，因依赖目标标签与非因果属性（虚假属性）之间的虚假相关性而导致的泛化性能下降问题。现有方法面临两大挑战：一是在缺乏组标签时，难以识别与真实世界偏差语义对齐的虚假因素；二是即使有伪监督，单分支设计在共享特征空间中难以有效解耦目标与虚假属性。

Result: 在四个基准数据集上进行的大量实验表明，该方法在无组标签（group-unsupervised）的方法中，取得了最先进的（state-of-the-art）最差组准确率（worst group accuracy），同时仅需调整最多0.22%的模型参数。

Insight: 主要创新点包括：1）提出CBCM，利用扩散模型的解耦能力获得语义基础的概念表示，从而在没有属性标注的情况下可靠地识别虚假属性；2）提出DCD框架，采用双分支设计将目标与虚假表示分离，并通过交叉零空间投影进行显式去偏，有效解耦了纠缠的特征。从客观角度看，将扩散模型用于概念挖掘以及双分支交叉投影的架构设计，为解决特征纠缠和虚假相关性提供了新的思路。

Abstract: Foundation models trained on biased datasets often rely on spurious correlations between target labels and non-causal attributes, resulting in poor generalization on minority groups. Bias mitigation remains challenging due to two fundamental issues. First, when group labels are unavailable, existing group-unsupervised methods typically infer spurious attributes implicitly from model behavior, making it difficult to identify spurious factors that are semantically aligned with real-world biases. Second, even with pseudo spurious supervision, most existing debiasing methods follow a single-branch design that operates within a single shared feature space, where target and spurious attributes are intrinsically entangled. To address the first challenge, we introduce Confidence-guided Bias Concept Mining (CBCM), which leverages diffusion-disentangled, semantically grounded concept representations to identify reliable spurious attributes without attribute annotations. To address the second challenge, we propose Dual-branch Cross-projection Debiasing (DCD), a prompt-tuning framework that separates target and spurious representations into two branches and explicitly removes spurious information through cross null-space projection while preserving target-relevant semantics. Extensive experiments on four benchmark datasets show that our method achieves state-of-the-art worst group accuracy among group-unsupervised approaches, while tuning at most 0.22% of the model parameters. The source code is available in the supplementary materials.

[39] Spectral Evolution-Guided Token Pruning in Multimodal Large Language Models cs.CVPDF

Bin Chen, Yuxiang Cai, Yadan Luo, Yi Zhang, Jianwei Yin

TL;DR: 本文提出了一种基于跨层谱演化（CLSE）的无训练token剪枝框架，用于加速多模态大语言模型（MLLMs）。该方法通过量化token表示在Transformer层间频域中的演化来评估token重要性，从而减少视觉token冗余，在保持跨模态推理性能的同时提升效率。

Details

Motivation: 现有token剪枝方法通常依赖单层信号（如注意力分数或token相似性），忽略了视觉表示的跨层变换，并可能在多模态token序列中表现出位置偏差。本文旨在解决这一局限性。

Result: 在图像和视频基准测试上的大量实验表明，CLSE在激进的token削减下实现了效率与准确性的优越权衡。在多个MLLMs上，CLSE减少了FLOPs、KV缓存内存和延迟，同时保持了竞争性或改进的性能。

Insight: 创新点在于从频域角度建模跨层token动态，将token重要性与其谱重分布强度关联，这提供了一个稳定的重要性标准以缓解位置偏差，且无需训练即可实现高效剪枝。

Abstract: Reducing visual token redundancy is critical for accelerating Multimodal Large Language Models (MLLMs) without degrading cross-modal reasoning performance. Existing token pruning methods typically rely on single-layer signals, such as attention scores or token similarities, which overlook the cross-layer transformation of visual representations and may exhibit positional bias in multimodal token sequences. To address this limitation, we propose a training-free token pruning framework based on Cross-Layer Spectral Evolution (CLSE). Instead of measuring token importance from single-layer feature magnitudes, CLSE quantifies how token representations evolve across Transformer layers in the frequency domain. This evolution reflects the transition from high-frequency structural details to low-frequency semantic abstractions. We observe that tokens with stronger spectral redistribution across layers are more likely to be semantically active and should therefore be preserved. By modeling cross-layer token dynamics, CLSE provides a stable importance criterion that mitigates positional bias. Extensive experiments on both image and video benchmarks demonstrate that CLSE achieves a superior trade-off between efficiency and accuracy under aggressive token reduction. Across multiple MLLMs, CLSE reduces FLOPs, KV cache memory, and latency while maintaining competitive or improved performance.

[40] Zero-Shot Test-Time Canonicalization using Out-of-Distribution Scoring cs.CV | cs.AIPDF

Dominik Lindner, Johann Schmidt, Tom Siegl, Martin Becker, Sebastian Stober

TL;DR: 本文提出了一种零样本测试时规范化方法，通过将输入映射到训练分布附近的规范形式来提升预训练视觉模型对仿射变换（如旋转、缩放、剪切）的鲁棒性。该方法将规范化问题重新定义为分布外检测，允许使用任何OOD评分函数来最小化变换能量，并引入门控机制以避免对已对齐输入的误处理。

Details

Motivation: 预训练视觉模型常因仿射变换而误分类，现有方法需修改模型架构或重新训练，而测试时规范化无需改动分类器，但现有规范化方法依赖有限的基于logit的能量评分和定制搜索过程，限制了评分函数和优化器的设计空间。

Result: 在涵盖手写字符、草图、自然图像和3D点云等多个基准测试中，系统评估了约20种OOD评分和9种搜索算法，发现基于距离的评分结合随机搜索和局部优化整体表现最佳，同时门控机制在保持分布内准确性的同时提升了变换输入的鲁棒性。

Insight: 创新点在于将规范化问题重新定义为OOD检测，扩展了评分函数和优化器的选择范围；门控机制可动态决定是否进行变换，避免了不必要的规范化对准确性的损害，为测试时自适应方法提供了新思路。

Abstract: Pretrained vision models often misclassify inputs that are rotated, scaled, or sheared, even though these affine transformations leave the object class unchanged. Robustness is usually restored either by building equivariance into the architecture or by retraining with augmentation, both of which require changing or retraining the model. Test-time canonicalization instead leaves the classifier untouched. It undoes the transformation of each input, mapping it to a canonical form near the training distribution before classification. Existing canonicalizers, however, rely on a narrow set of logit-based energy scores and bespoke search procedures, leaving the design space of scoring functions and optimizers unexplored. We reframe canonicalization as out-of-distribution (OOD) detection, which lets any OOD score serve as the energy minimized over transformations. Across benchmarks ranging from handwritten characters and sketches to natural images and 3D point clouds, we systematically evaluate around twenty OOD scores and nine search algorithms, finding that distance-based scores paired with random search and local refinement perform best overall. Because canonicalizing an already-aligned input can hurt accuracy, we add a gated mechanism that transforms an input only when its OOD score indicates this is needed, preserving most in-distribution accuracy while retaining the robustness gains on transformed inputs. Code is available at github.com/johschm/its.

[41] Deep Learning Approaches for 3D Medical Scene Completion: From Geometric Modeling to Generative Paradigms cs.CV | cs.AIPDF

Afifa Khaled, Said Jadid Abdulkadir, Majdy Mohamed Eltayeb Eltahir

TL;DR: 本文对2016年至2026年间3D场景补全领域的研究进展进行了系统性综述，重点探讨了从基于体素的语义补全范式（如SSCNet）到结合生成扩散先验与高斯溅射实时渲染的最新范式的演变历程。

Details

Motivation: 旨在梳理3D场景补全领域过去十年的研究贡献，总结其从几何建模到生成范式的演进，并分析当前面临的挑战与未来发展方向。

Result: 研究未提出具体新模型或实验数据，而是通过全面分析和分类学构建，系统回顾了该领域在表示范式（如体素网格、点学习、隐式神经场、Transformer、扩散网络及3D高斯基元）上的进展。

Insight: 创新点在于系统性地总结了3D场景补全从传统方法到生成式AI（如扩散模型）与神经渲染（如高斯溅射）融合的范式转变，并提出了清晰的研究分类与未来议程，为领域发展提供了结构化视角。

Abstract: Three-dimensional scene completion has evolved as a major problem in computer vision and robotics, and its applications are diverse, including autonomous navigation and augmented reality. In this study, a systematic review has been conducted to compile the research contributions made in the last ten years, i.e., 2016 to 2026, which has revolutionized the field from the voxel semantic completion paradigm represented by SSCNet to the latest paradigm that combines generative diffusion priors with real-time rendering using a Gaussian splatting technique. The evolution in representation paradigms, such as voxel grids, point learning, implicit neural fields, transformer networks, diffusion networks, and the latest paradigm based on rendering-aware 3D Gaussian primitives, has been discussed in this study. A comprehensive analysis has been carried out on the contributions made in the last ten years, and a taxonomy has been developed to provide a clear idea about the contributions made in the field. The study has also discussed the research contributions made in the field, along with the challenges that still need to be addressed. Finally, the study has presented a research agenda that will provide a clear idea about the directions that can be followed in the development of the next-generation system

[42] Tri-Efficient Transfer Learning for Point Cloud Videos cs.CVPDF

Yiding Sun, Dongxu Zhang, Jihua Zhu, Haozhe Cheng, Zhengqiao Li

TL;DR: 本文提出了PoinTriE，一个针对点云视频理解的三高效（数据、参数、内存）迁移学习框架。该框架通过合成伪运动轨迹、结合文本语料和2D投影进行预训练，并采用几何-运动对偶网络进行多模态对比学习。在微调阶段，冻结预训练主干，仅更新一个由LoRA单元构建的轻量级时空侧网络，并结合梯度流掩码策略以降低内存和参数开销。

Details

Motivation: 现有参数高效微调方法在点云视频理解中面临两大关键限制：大规模点云数据集标注成本高昂，以及严重的显存瓶颈。本文旨在从现有数据中挖掘更丰富的监督信号，而非盲目扩大数据集，并大幅降低微调阶段的内存占用。

Result: 大量实验证实，PoinTriE在动作识别和语义分割任务上取得了新的最先进结果。

Insight: 创新点在于提出了一个统一框架，同时优化数据、参数和内存效率。具体包括：通过刚性变换合成伪运动轨迹进行自监督预训练；设计几何-运动对偶网络进行多模态学习；在微调时结合LoRA和梯度流掩码策略，实现了高效且低开销的迁移学习。

Abstract: While point cloud foundation models have significantly advanced point cloud video understanding, existing parameter-efficient fine-tuning (PEFT) methods still suffer from two critical limitations: prohibitive annotation costs for large-scale point cloud datasets and severe memory bottlenecks. In this paper, we aim to mine richer supervision signals from existing data rather than blindly scaling datasets. A further key principle is that the memory footprint of fine-tuning must be drastically reduced compared to full fine-tuning, which remains elusive for current PEFT techniques. Driven by these challenges, we identify three core desiderata: data-, parameter-, and memory efficiency, and present PoinTriE, a unified framework that excels along all three dimensions. For pre-training, pseudo-motion trajectories are synthesized via rigid transformations, paired with text corpora and 2D projections derived from raw point clouds. We then propose a Geometric-Motion Duality Network optimized via multimodal contrastive learning, rigid rotation prediction, and motion distribution divergence to produce dense self-supervision. During fine-tuning, we freeze the pretrained backbone and only update a lightweight Spatio-temporal Side Network built with LoRA units. Equipped with a gradient flow masking strategy, PoinTriE simultaneously reduces memory consumption and parameter overhead. Extensive experiments confirm that PoinTriE establishes new state-of-the-art results on action recognition and semantic segmentation tasks.

[43] MorVess: Morphology-Aware Pulmonary Vessel Segmentation Network cs.CVPDF

Fuyou Mao, Yifei Chen, Beining Wu, Lixin Lin, Jinnan Dai

TL;DR: 该论文提出了一种名为MorVess的形态学感知肺血管分割网络，旨在解决由于血管结构稀疏、曲折且多尺度特性导致的小分支易丢失和拓扑完整性难以保持的问题。该框架通过联合预测血管掩码、距离图和厚度图，并利用轻量级2.5D适配器桥接3D空间上下文与2D SAM表示，实现了细粒度的血管解析。

Details

Motivation: 现有深度分割模型主要优化二值掩码，缺乏显式的几何约束，难以恢复连续的管状形态和精细的血管连通性。因此，研究旨在将可微几何先验与大规模基础模型适应相结合，以提升血管分割的准确性和拓扑完整性。

Result: 在两个具有挑战性的肺部CT基准测试中，MorVess在Dice、clDice和HD95指标上均取得了优越的性能，显著改善了小血管的恢复和全局连通性。

Insight: 创新点在于将几何智能嵌入预训练视觉模型中，通过联合预测多种几何图（如距离图和厚度图）提供显式监督，并设计轻量级适配器融合3D与2D表示，为精确的血管分析和临床可靠的结构量化提供了一条原则性且可扩展的路径。

Abstract: Accurate pulmonary vessel segmentation remains challenging due to the sparse, tortuous, and multi-scale nature of vascular structures, where small branches are easily lost and topology integrity is difficult to preserve under voxel-wise supervision. Existing deep segmentation models primarily optimize binary masks, lacking explicit geometric constraints, thus struggling to recover continuous tubular morphology and fine vascular connectivity. In this study, we introduce MorVess, a morphology-aware segmentation framework that integrates differentiable geometric priors with large-scale foundation model adaptation to achieve fine-grained vascular parsing. MorVess jointly predicts vessel masks, distance maps, and thickness maps, providing explicit supervision for vascular boundaries, centerline consistency, and smooth diameter transitions. A lightweight 2.5D adapter bridges 3D spatial context and 2D SAM representations, while a global-local fusion block aggregates multi-level semantics and geometric cues for high-fidelity topology reconstruction. Across two challenging pulmonary CT benchmarks, MorVess delivers superior Dice, clDice, and HD95 scores, substantially improving small-vessel recovery and global connectivity. These results demonstrate that embedding geometric intelligence into pretrained vision models offers a principled and scalable pathway toward precise vessel analysis and clinically reliable structural quantification. Our source code is available at https://github.com/MaoFuyou/MorVess.

[44] Towards Fast and Effective Long Video Understanding of Multimodal Large Language Models via Adaptive Quasi-Gaussian Sampling cs.CVPDF

Kun Zhang, Chenxin Fang, Tao Chen, Baiyang Song, Yunhang Shen

TL;DR: 本文提出了一种名为AdaQ的自适应、无需训练的视频帧选择方法，用于解决多模态大语言模型在长视频理解中面临的计算和内存开销过大的问题。该方法将帧选择定义为准高斯采样问题，旨在为不同的查询（局部或全局）动态确定最优的3-σ采样区间，从而在保证高效性的同时提升理解性能。

Details

Motivation: 长视频理解对多模态大语言模型来说计算和内存成本过高，而现有的关键帧选择方法因其硬采样原则，存在灵活性低、噪声高的问题，需要一种更鲁棒和自适应的解决方案。

Result: 在四个MLLM和三个嵌入模型上的广泛实验表明，AdaQ的性能明显优于默认的MLLM和SOTA关键帧选择方法。例如，仅使用64帧就帮助Qwen3-VL-8B模型在平均性能上超过GPT4o达15.8%，并且证明了其在长视频理解中具有优越的鲁棒性和高效率，仅需设置1个超参数。

Insight: 核心创新点是将视频帧选择重新定义为准高斯采样问题，并利用高斯分布的3-σ规则，自适应地为不同性质的查询（局部/全局）确定最优采样区间。这是一种无需训练、超参数极少且能灵活权衡计算效率与理解性能的轻量级方法。

Abstract: Long video understanding remains a daunting challenge for \emph{Multimodal Large Language Models} (MLLMs) due to the excessive computation and memory footprint. Thus, \emph{keyframe selection} is often adopted to mitigate this shortcoming, which however still suffers from low flexibility and high noise due to its hard sampling principle. In this paper, we define video frame selection as a problem of \emph{Quasi-Gaussian Sampling}, and propose an adaptive and training-free approach termed \textbf{\emph{AdaQ}}. Inspired by the $3$-$σ$ rule of Gaussian distribution, the objective of AdaQ is to achieve the optimal $3$-$σ$ interval for different examples, \emph{i.e.}, a smaller $3$-$σ$ interval for the local query and a larger one for the global query, thereby facilitating robust and adaptive frame sampling. To validate AdaQ, we apply it to four MLLMs with three embedding models. The extensive experimental results not only show its obvious performance gains over the default MLLMs and the SOTA keyframe selection methods, \emph{e.g.}, helping Qwen3-VL-8B outperform GPT4o by 15.8% on average by using only 64 frames, but also confirm its superior robustness and high efficiency for long-video understanding, \emph{e.g.}, \textbf{only 1 hyper-parameter} needs to be set. \textbf{Our code project} is given at \href{https://github.com/Zkayovo-xmu/AdaQ}{https://github.com/Zkayovo-xmu/AdaQ}.

[45] Geometry-Instructed Video Editing cs.CVPDF

Chirui Chang, Xiaoyang Lyu, Yi-Hua Huang, Haoru Tan, Shizhen Zhao

TL;DR: 本文提出GIVE（Geometry-Instructed Video Editing）框架，用于解决生成式视频编辑中对象级几何编辑（如平移、旋转、缩放等）不可靠的问题。该框架通过统一的物体状态表示，结合深度框和方向框两种几何流来明确指定目标对象在编辑前后的3D状态变化，并利用图形引擎生成的配对数据进行监督学习，以实现具有时间一致性和几何相关二次效应（如阴影、反射）一致性的视频编辑。

Details

Motivation: 当前生成式视频编辑在处理对象级几何编辑（如移动、旋转物体）时不可靠，主要挑战在于难以跨视角和时间明确指定目标物体的3D状态变化，并保持几何依赖的二次效应（如阴影、反射）的一致性。

Result: 实验结果表明，GIVE框架能够在统一框架下为不同操作符生成具有时间一致性和二次效应一致性的忠实几何编辑，并在野外视频上显示出良好的迁移能力。

Insight: 创新点在于提出了一种通过深度框和方向框的紧凑几何流来统一表示物体状态变化的编辑方法，并构建了可扩展的图形引擎管道来生成精确的配对训练数据，从而学习几何指令与视频编辑之间的映射，确保了编辑的几何准确性和物理合理性。

Abstract: Object-level geometric edits, including translating, rotating, scaling, duplicating, or removing an object, are routine operations in digital content creation (DCC) workflows, yet they remain unreliable in generative video editing. The key challenge lies in specifying the target object’s 3D state change unambiguously across viewpoint and time, while consistently updating geometry-dependent secondary effects such as shadows and reflections. We introduce GIVE, a geometry-instructed video editing framework that represents edits through a unified object-state formulation. Two video-aligned geometry streams describe the target object before and after editing: a depth-box encoding coarse 3D placement and extent, and an orientation-box providing an appearance-agnostic orientation cue. Together, these streams provide a compact pre/post geometric specification for object-state transitions. To provide paired supervision for learning these edits, we build a scalable graphics-engine pipeline that executes object-level edit programs and renders controlled before/after pairs, isolating the intended geometric edit while keeping secondary effects consistent with the transformation. Experimental results demonstrate that GIVE produces faithful geometric edits with temporal coherence and consistent secondary effects across operators in a unified framework, and shows promising transfer to in-the-wild videos. Project page: https://geometry-instructed-video-editing.github.io/give/

[46] FiCA: Feed-forward instant Gaussian Codec Avatars from a Single Portrait Image cs.CV | cs.GRPDF

Kim Youwang, Zhengyu Yang, Liuhao Ge, Yu Rong, Timur Bagautdinov

TL;DR: FiCA是一种基于单张人像照片生成逼真可驱动虚拟人的前馈式即时高斯编解码器流程。该方法结合了以人为中心的视觉基础模型和扩散模型，从有限视觉信息中推断完整3D头部外观与几何结构，并通过前馈网格优化网络提升保真度，最终解码为可实时驱动表达的3D高斯虚拟人。

Details

Motivation: 解决仅凭单张图像生成逼真可驱动虚拟人的挑战，因为单张图像提供的视觉信息有限，难以准确推断3D头部外观和几何结构。

Result: 实验表明，该方法生成的虚拟人能忠实呈现多样身份，在视觉质量上超越了近期竞争方法。

Insight: 创新点包括：结合视觉基础模型与扩散模型从局部观测生成完整3D网格；前馈式网格优化网络无需针对特定人物的测试时优化；通过通用先验模型将网格解码为可实时驱动的3D高斯表示。

Abstract: We introduce FiCA, a Feed-forward, instant Gaussian Codec Avatar generation pipeline that creates lifelike avatars from a single portrait image. Generating a photorealistic and drivable avatar from just a single image is significantly challenging due to the limited visual information available to accurately infer the 3D appearance and geometry of human heads. To address this, we develop a novel system that combines human-centric vision foundation models with a diffusion model. This system is designed to fully exploit partial visual observations to generate lifelike human avatars. Our proposed diffusion model learns a generative mapping from these partial observations to complete and authentic 3D mesh reconstruction. Additionally, we introduce a feed-forward mesh refinement network that enhances the fidelity and identity preservation of the generated avatars, eliminating the need for person-specific test-time optimization. By leveraging a universal prior model that decodes a generated mesh into a set of 3D Gaussians, we generate a photorealistic 3D Gaussian avatar, capable of being driven with novel expressions in real-time. Our experiments demonstrate that the avatars generated by our feed-forward approach faithfully represent diverse identities and surpass the visual quality of avatars produced by recent competing methods.

[47] Accelerating Multimodal Large Language Models with Prior-Corrected Token Reduction cs.CVPDF

Zengjie Chen, Yuxiang Cai, Jingcai Guo, Taotao Cai, Jianwei Yin

TL;DR: 本文提出了一种名为Prior-Corrected Token Reduction (PriorTR) 的训练无关视觉令牌削减方法，用于加速多模态大语言模型。该方法通过引入一个空令牌来分离模型固有的先验注意力与任务驱动的后验注意力，从而更准确地评估每个视觉令牌的信息贡献，避免因先验主导而误删重要令牌。

Details

Motivation: 现有基于注意力分数排序的视觉令牌削减方法存在风险，因为MLLMs的注意力常被一种与任务无关的模型固有先验所主导，这会抑制任务相关令牌的注意力分数，导致在削减过程中被错误丢弃。

Result: 在多个多模态基准测试和MLLMs上的广泛实验表明，PriorTR在精度与效率的权衡上持续优于其他训练无关基线方法，尤其是在激进的令牌预算下表现更佳。

Insight: 核心创新点在于通过单次前向传播中的空令牌探针，显式分离并对比模型固有先验与任务条件后验的注意力分布，以无训练的方式更鲁棒地识别信息丰富的视觉令牌进行保留。

Abstract: Visual token reduction has emerged as an effective strategy for accelerating Multimodal Large Language Models (MLLMs). Many existing methods prune tokens by ranking text-visual attention scores. However, we show that attention is often dominated by a model-induced prior: even without textual instruction, MLLMs tend to focus on certain task-agnostic regions. Consequently, the attention scores of instruction-conditioned tokens are suppressed, increasing the risk that these tokens are discarded during pruning. To address this issue, we propose Prior-Corrected Token Reduction (PriorTR), a training-free token reduction method that explicitly separates task-conditioned attention from the model-induced prior. PriorTR estimates the attention map of the prior, and contrasts it with the task-conditioned attention distribution to measure the additional usable information contributed by each visual token. Importantly, PriorTR computes both the model-induced prior and the task-conditioned posterior within a single forward pass by introducing a null token that serves as an instruction-agnostic probe in the attention block. This design avoids duplicated propagation. Extensive experiments across multiple multimodal benchmarks and MLLMs demonstrate that PriorTR consistently improves the trade-off between accuracy and efficiency over strong training-free baselines, particularly under aggressive token budgets.

[48] Latent Visual States for Efficient Multimodal Reasoning cs.CVPDF

Xiuwei Chen, Wentao Hu, Yongxin Wang, Zisheng Chen, Likui Zhang

TL;DR: 该论文提出了EVA（LatEnt Visual StAtes）框架，通过生成连续的潜在视觉表示（Latent_slot tokens）作为推理过程中的中间视觉思维，以克服现有大模型依赖外部工具导致的延迟和依赖问题。该方法采用端到端训练，并开发了D-GSPO优化策略来解决潜在与离散组件优化中的策略偏差问题，同时构建了EVA-230K高质量数据集进行监督微调。实验表明，EVA在多个基准测试中显著提升了性能与推理效率。

Details

Motivation: 当前大模型整合视觉证据主要依赖生成离散输出（如代码或坐标）调用外部工具，这引入了刚性依赖和高延迟，限制了多模态推理的效率和灵活性。

Result: 在多个基准测试上的广泛实验证实，EVA实现了显著的性能提升，同时增强了推理效率，但摘要未具体提及是否达到SOTA或与特定模型相当的水平。

Insight: 创新点在于提出连续的潜在视觉表示作为内部中间状态，替代离散工具调用，并通过D-GSPO策略优化训练过程；客观分析认为，这种将视觉信息编码为自适应序列token的方法，可能为多模态模型提供更高效、灵活的推理机制。

Abstract: The integration of visual evidence has significantly enhanced the capabilities of large multimodal models. However, this integration predominantly relies on generating discrete outputs (etc., code or box coordinates) to invoke external tools, a process that introduces rigid dependencies and substantial latency. To overcome these limitations, we propose {EVA} (LatEnt Visual StAtes), a novel framework that natively generates continuous latent visual representations. These internal representations manifest as an adaptive sequence of Latent_slot tokens, serving as intermediate visual thoughts during the reasoning process. These Latent_slot tokens are then trained end-to-end with the discrete text tokens. This co-optimization, notably, causes extreme policy deviation in the ‘transition window’ following the Latent_slot tokens. We develop D-GSPO (Decouple-GSPO) to target this root cause by decoupling the optimization of latent and discrete components. To support SFT, we construct EVA-230K, a high-quality text-image interleaved CoT dataset encompassing a diverse range of real-world scenes, documents, charts and OCR tasks. Extensive experiments across multiple benchmarks confirm that EVA achieves significant performance gains while enhancing inference efficiency.

[49] TuringViT: Making SOTA Vision Transformers Accessible to All cs.CVPDF

Qiman Wu, Hanlin Chen, Lyujie Chen, Rui Xin, Jianlei Zheng

TL;DR: TuringViT通过引入图灵线性注意力（TLA）、VISTA-Curation数据构建方法以及原生动态分辨率预训练，旨在降低训练高性能视觉Transformer（ViT）的门槛，使其在仅需10%数据的情况下超越现有开源ViT基线，并实现更好的下游视觉语言模型性能和延迟扩展。

Details

Motivation: 解决社区因需要海量图文数据、标准softmax注意力计算成本高昂（尤其是在高分辨率或动态分辨率预训练中）而难以训练定制化SOTA级视觉Transformer的问题。

Result: 在仅使用10%数据的情况下，超越了领先的开源ViT基线，在下游VLM任务中表现更强，且在高分辨率输入上延迟扩展性显著更好；缩放定律分析显示其性能随数据规模可预测地提升，远未饱和。

Insight: 创新点包括高效的TLA序列建模、监督丰富的VISTA-Curation数据构建方法以及支持灵活输入的原生动态分辨率预训练，提供了一个可复现的管道，大幅降低了训练、定制和部署SOTA级ViT的成本。

Abstract: Modern VLMs and VLA systems commonly adopt off-the-shelf ViTs such as SigLIP2 as visual encoders, but diverse downstream requirements in latency, temporal modeling, and VLM integration often call for customized SOTA-level ViTs. Training such encoders remains beyond the reach of much of the community, as it requires massive image-text data, while standard softmax attention makes high-resolution or dynamic-resolution pretraining prohibitively costly and often forces low-resolution pretraining followed by post-hoc adaptation. TuringViT addresses these challenges with three key designs: Turing Linear Attention (TLA) for efficient sequence modeling, VISTA-Curation to construct supervision-rich image-video training data, and native dynamic-resolution pretraining that supports flexible inputs from the start and transfers seamlessly to downstream VLMs. As a result, TuringViT outperforms leading open-source ViT baselines with only 10% of the data, achieves stronger downstream VLM performance, and delivers substantially better latency scaling on high-resolution inputs. Our scaling-law analysis further shows that TuringViT continues to improve predictably with curated data scale, far from saturation. Its fast adaptation, hardware-friendly design, and efficient deployment have made it a unified visual foundation across XPeng’s AI systems. More broadly, TuringViT provides a reproducible pipeline that dramatically lowers the cost for the community to train, customize, and deploy SOTA-level ViTs, moving toward making such Vision Transformers accessible to all.

Zhongju Wang, Beier Wang, Yatao Bian, Pichao Wang, Zhi Wang

TL;DR: 本文提出了一种从文本生成3D人-人交互（HHI）的新框架，强调建模社会结构（如阶段进展、角色分配和协调）的重要性。该框架采用‘规划-执行’范式，利用大语言模型（LLM）作为规划器来推断交互的社会结构，并利用一个经过改造的单人运动模型作为执行器来生成物理合理、协调的两人运动。

Details

Motivation: 现有文本到运动生成方法在单人运动合成上进展显著，但扩展到文本驱动的3D人-人交互生成仍面临挑战，因为HHI需要建模控制交互阶段、参与者角色和相互协调的底层社会结构。

Result: 提出的Solo-to-Social框架在生成3D HHI时，在阶段一致性、角色对齐和伙伴感知协调方面表现更优，但摘要中未提及具体的定量基准测试结果或SOTA比较。

Insight: 创新点在于将HHI生成明确建模为社会结构推理与运动实现两个步骤，并提出了‘LLM思考，运动技能执行’的范式。具体技术包括利用LLM进行阶段分解和角色分配，以及对预训练单人运动模型进行LoRA适配、前阶段自条件和以自我为中心的伙伴条件化，从而将社会结构‘落地’为协调运动。

Abstract: Although text-to-motion generation has achieved strong progress in synthesizing realistic single-person motions from language, extending it to text-driven 3D human-human interaction (HHI) remains non-trivial, as HHI requires modeling the underlying \textbf{social structure} that governs phase progression, actor roles, and inter-actor coordination. In this paper, we formulate HHI generation as a social structure modeling and grounding problem: the model must first infer how an interaction unfolds and how the two actors coordinate their roles, and then realize this structure as continuous, physically plausible, and partner-aware 3D motion. To study how such structure should be modeled, we first examine the capability boundary of large language models (LLMs) for HHI generation. Our analysis shows that LLMs can \textit{think} by recovering phase decompositions and partner-aware roles, but cannot directly \textit{move}, as they fail to generate dynamic, physically plausible, and interaction-aware motion. This motivates our planner-executor paradigm, \textbf{Think with LLM, Move with Motion Skill}. The LLM planner converts implicit interaction semantics into motion-aligned social supervision by decomposing interactions into phases, assigning partner-aware actor roles, and aligning them with motion sequence. The motion executor then grounds the planned social structure into coordinated two-person motion by adapting a pretrained solo motion model with LoRA, previous-phase self-conditioning, and ego-relative partner conditioning. Together, our Solo-to-Social framework bridges social organization and motion realization, producing 3D HHI with improved phase consistency, role alignment, and partner-aware coordination.

[51] UniRED: Unified RGB-D Video Frame Interpolation with Event Guidance cs.CVPDF

Yinuo Zhang, Guangshun Wei, Yuanfeng Zhou, Yiran Shen

TL;DR: 本文提出了一种名为UniRED的统一多模态框架，用于RGB-D视频帧插值，该框架联合利用RGB外观、深度几何和基于事件的时间线索。该方法首先提取并融合RGB、深度和事件线索，然后通过运动基元细化估计双向光流（针对RGB）和Z轴细化（针对深度），最后通过双向扭曲和软融合合成目标RGB-D帧。此外，作者构建了一个新的RGB-D-Event数据集以缓解三模态训练数据的稀缺性。

Details

Motivation: 高帧率RGB-D视频对于多种下游任务至关重要，但实际RGB-D相机通常受限于低帧率，难以捕捉快速场景动态。现有的视频插值方法在RGB数据上表现良好，但直接应用于RGB-D场景时，常产生模糊边界、可见伪影和几何一致性退化。此外，在复杂动态场景中，仅从两个边界帧进行运动估计本质上是欠约束的。

Result: 在公开基准测试和提出的数据集上进行的大量实验表明，与现有方法相比，该方法在RGB插值上实现了更优的光度保真度，在深度插值上实现了更强的几何精度。

Insight: 创新点在于提出了一个统一的多模态框架，首次将事件相机的高时间分辨率异步测量作为密集运动线索，与RGB和深度信息联合利用，以解决RGB-D视频插值中的运动模糊和几何退化问题。同时，构建了稀缺的RGB-D-Event三模态数据集，为相关研究提供了资源。

Abstract: High frame-rate RGB-D videos are crucial for a variety of downstream tasks, including motion analysis, dynamic scene understanding, and 3D reconstruction. However, due to hardware and sensing constraints, practical RGB-D cameras are typically limited to low frame rates, making it difficult to capture rapid scene dynamics. Existing video interpolation methods have achieved strong performance on RGB data, but they are not readily applicable to RGB-D scenarios, where they often yield blurry boundaries, visible artifacts, and degraded geometric consistency. Furthermore, motion estimation from only two boundary frames is inherently under-constrained in complex dynamic scenes. Event cameras, by contrast, provide asynchronous measurements with ultra-high temporal resolution, offering dense motion cues. In this paper, we propose a unified multimodal framework for RGB-D video interpolation that jointly exploits RGB appearance, depth geometry, and event-based temporal cues. Specifically, it first extracts and fuses RGB, depth and event cues, then estimates bidirectional flow with motion basis refinement for RGB and Z-axial refinement for depth, and finally synthesizes the target RGB-D frame via bidirectional warping and soft blending. In addition, we construct a new RGB-D-Event dataset to alleviate the scarcity of tri-modal training data. Extensive experiments on a public benchmark and the proposed dataset demonstrate that our method achieves superior photometric fidelity for RGB interpolation and stronger geometric accuracy for depth interpolation than existing approaches.

[52] ActiveScope: Actively Seeking and Correcting Perception for MLLMs cs.CVPDF

Yajing Wang, Chao Bi, Junshu Sun, Shufan Shen, Zhaobo Qi

TL;DR: 本文提出了一种名为ActiveScope的无训练框架，旨在解决多模态大语言模型（MLLMs）在高分辨率图像中细粒度感知能力不足的问题。该框架通过主动搜索和自校正机制，利用语义锚点定位和干扰抑制细化两个模块，有效缓解了语义偏见和上下文主导问题，从而提升了模型在复杂场景下的目标定位准确性。

Details

Motivation: 现有无训练方法（如基于注意力的定位或由粗到细的搜索）在高分辨率图像中容易被干扰物误导，且难以定位多个目标。论文将这些问题归因于’上下文主导’（Contextual Dominance）和’语义偏见’（Semantic Bias），旨在通过主动感知和校正来解决这些挑战。

Result: 在高分辨率图像理解基准测试（如V* Bench）上的广泛实验表明，ActiveScope优于现有的无训练方法，例如在V* Bench上达到了96.34%的准确率，验证了其主动搜索和自校正范式的优越性。

Insight: 论文的创新点在于提出了一个无训练框架，通过语义锚点定位（SAL）模块独立定位关键目标以缓解语义偏见，以及干扰抑制细化（ISR）模块抑制对显著干扰物的注意力以克服上下文主导。从客观角度看，这种主动感知和自校正的范式为提升MLLMs的细粒度视觉理解提供了一种新颖且有效的解决方案。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive vision-language understanding, yet still struggle with fine-grained perception in high-resolution images. While existing training-free methods typically rely on attention-based localization or coarse-to-fine search, they are often misled by distractors and fail to locate multiple targets. Our investigation attributes these failures to Contextual Dominance, where salient distractors overwhelm target attention and cause inaccurate localization, and Semantic Bias, where global semantics cause the model to fixate on the most salient concept, resulting in incomplete localization in multi-object scenarios. Built on these insights, we propose ActiveScope, a training-free framework that enhances MLLMs by actively seeking and correcting perception. ActiveScope features two modules. The Semantic Anchor Localization (SAL) utilizes fine-grained semantic anchors to independently localize key targets, thereby mitigating semantic bias. The Interference-Suppressed Refinement (ISR) refines localization by suppressing attention on salient distractions to overcome contextual dominance. Extensive experiments on high-resolution image understanding benchmarks demonstrate that ActiveScope outperforms existing training-free methods (e.g., 96.34 percent accuracy on $V^{*}$ Bench), validating the superiority of the active search and self-correction paradigm. Our code is available at https://github.com/jasmine-ww/ActiveScope.

[53] Trimming the Long-Tail of Visual World Modeling Evaluation cs.CVPDF

Bingxuan Li, Yining Hong, Cheng Qian, Hyeonjeong Ha, Jiateng Liu

TL;DR: 本文针对当前视觉世界模型（图像和视频生成模型）主要评估常见物理交互而忽视长尾罕见交互的问题，提出了Tailor-Bench基准测试。该基准通过设计三种渐进式挑战的场景模式（常规、非常规、不可能）和两种生成设置（预测性、描述性），系统地评估模型对不规则物理交互的模拟与泛化能力。实验结果表明，模型性能从常规场景到不可能场景显著下降，揭示了其在物理世界建模中存在明显的长尾差距，并依赖于表面视觉模式而非深层物理原理。

Details

Motivation: 现有视觉世界模型在基准测试上实现了高真实感，但主要模拟常见的物理交互，这引发了一个核心问题：当前模型是否真正内化并泛化了物理原理？为了解决这个问题，作者旨在评估模型在长尾、罕见及不规则物理交互上的表现。

Result: 在提出的Tailor-Bench上进行评估，结果显示模型性能从常规场景到非常规场景再到不可能场景逐步下降，表明其泛化能力有限，无法超越常见交互。具体而言，图像模型未能实现正确的状态变化，而视频模型还存在时间不一致性问题。

Insight: 论文的创新点在于构建了一个专注于长尾、不规则物理交互的系统性评估基准（Tailor-Bench），其设计的渐进式场景模式和互补的生成设置能更深入地探测模型对物理原理的内化程度。从客观角度看，这项工作强调了当前视觉生成模型评估的局限性，并为推动模型理解更广泛的物理世界提供了重要的诊断工具和方向。

Abstract: Physical interactions follow a long-tailed distribution: a set of common and regular interactions dominates human experience and visual data, while a broad spectrum of rare and irregular interactions remains underrepresented. Although recent visual world models, including image and video generation models, achieve impressive realism on existing benchmarks, they primarily focus on simulating common physical interactions. This raises a central question: Do current visual world models internalize and generalize physical principles? In this work, we introduce Tailor-Bench, a benchmark that challenges world models to simulate irregular physical interactions. To enable systematic evaluation, we design three scenario modes that progressively challenge model reasoning: Regular scenarios reflect common tool-task pairs, Unconventional scenarios replace conventional tools with attribute-compatible substitutes to test affordance generalization, and Impossible scenarios introduce attribute-violating tools to probe constraint awareness. Additionally, we design two complementary settings under a unified evaluation protocol: predictive generation requires inferring outcomes without guidance, while descriptive generation specifies the target outcome for faithful realization. Our experimental results reveal a clear long-tail gap in physical world modeling: performance degrades from Regular to Unconventional and Impossible scenarios, indicating limited generalization beyond common interactions. Failure analysis further shows that models rely on superficial visual patterns: image models fail to realize correct state changes, while video models further suffer from temporal inconsistencies.

[54] Training-free Cross-domain Few-shot Segmentation via Robust Semantic Representation and Matching cs.CVPDF

Sujun Sun, Mingwu Ren, Haofeng Zhang

TL;DR: 本文提出了一种无需训练的跨域小样本分割框架，通过鲁棒的语义表示与匹配来解决跨域小样本分割问题。该方法基于自监督视觉编码器DINOv3，设计了三个核心模块：语义感知特征重融合模块增强语义判别性，自适应支持增强模块缩小支持集与查询集之间的语义鸿沟，以及混合原型匹配模块整合不同原型的匹配结果以适应跨域语义复杂性。

Details

Motivation: 现有跨域小样本分割方法依赖训练或微调，计算成本高且易过拟合，即使引入强大的视觉基础模型也仅带来边际改进甚至性能下降。本文旨在消除可训练参数，提出无需训练的框架以避免训练开销和过拟合问题。

Result: 在四个目标域数据集上的大量实验表明，该方法在无需任何训练的情况下，在跨域小样本分割任务上达到了最先进的性能。

Insight: 创新点在于完全去除了训练过程，通过设计语义感知特征重融合、自适应支持增强和混合原型匹配三个模块，有效利用自监督视觉编码器的通用特征，增强了跨域场景下的语义表示鲁棒性和匹配适应性，避免了过拟合并降低了计算成本。

Abstract: Cross-domain Few-shot Segmentation (CD-FSS) aims to transfer knowledge learned from source domain to distinct target domains, segmenting unseen target classes with only a few annotated samples. Although existing methods have made significant progress, they still rely on training or fine-tuning processes, which incur high computational costs and risk overfitting. We observe that when powerful and general-purpose vision foundation models are incorporated into these methods, their performance shows only marginal improvement or even degrades due to overfitting. To address this, we eliminate trainable parameters and propose a training-free framework to avoid both training overhead and overfitting. Built upon the self-supervised vision encoder DINOv3, our framework addresses cross-domain challenges through three core modules. First, the Semantic-aware Feature Re-fusion (SAFR) module identifies and re-fuses features that emphasize semantic patterns, generating representations with enhanced semantic discriminability. Additionally, the Adaptive Support Enhancement (ASE) module narrows semantic gaps between support and query through robust query information aggregation. Finally, the Hybrid Prototype Matching (HPM) module integrates matching results from diverse prototypes to adapt to varying semantic complexity across domains. Extensive experiments on four target domain datasets demonstrate that our method achieves state-of-the-art performance in CD-FSS without any training.

Hongli Xiao, Youjian Zhang, Yucai Bai, Chaoyue Wang, Yaohui Jin

TL;DR: 本文提出了MM-TRELLIS，一种用于自动驾驶场景中三维车辆生成的多模态方法。该方法整合了激光雷达点云和多视角图像作为输入，通过将点云作为测试时引导来确保几何精度和跨视角一致性，并引入基于3D高斯溅射不透明度的体素滤波策略来生成高质量网格。

Details

Motivation: 现有车辆生成方法未能充分利用多模态传感器（多视角图像和激光雷达点云），且依赖基于神经渲染的重建，导致网格质量低下；同时，现有的原生3D生成模型不适用于任意多视角输入，且难以处理真实驾驶场景中的图像。

Result: 在Waymo数据集上的综合实验表明，该方法在高保真3D车辆生成方面优于现有方法。

Insight: 创新点在于将激光雷达点云作为测试时引导，并与模型先验对齐以增强几何精度；同时，提出基于3D高斯溅射不透明度的体素滤波策略来抑制漂浮物并生成干净网格，有效整合了多模态传感器信息。

Abstract: Recovering realistic 3D vehicle models from autonomous driving scenes is crucial for synthesizing training data and building simulation environment. However, most existing vehicle generation methods fail to fully exploit multimodal sensors i.e. multi-view images and LiDAR point clouds) and rely on neural rendering based reconstruction, leading to low-quality mesh. Recently, native 3D generative models have made significant progress, yet they are not built for arbitrary multi-view inputs and often struggle with in-the-wild driving images. In this work, we present MM-TRELLIS, a multi-modal version of TRELLIS for in-the-wild 3D vehicle generation that integrates LiDAR and image sensors from autonomous driving datasets into native 3D generative models. Specifically, multi-view images are cycled as conditioning inputs, while LiDAR point clouds provide test-time guidance to ensure geometric accuracy and cross-view consistency. During denoising, we first align the guidance point cloud with the model priors, then enforce consistency between the generated geometry and the guidance point cloud. Finally, we introduce a voxel filtering strategy based on the opacity of 3D Gaussian Splatting to suppress floaters and produce clean meshes. Comprehensive experiments on Waymo dataset demonstrate our method outperforms existing methods in high-fidelity 3D vehicle generation. Code is available at https://github.com/HongliXiao/MM-TRELLIS.

[56] REDI-Match: Rotation-Equivariant Distillation for Efficient and Robust Dense Matching cs.CVPDF

Yinji Ge, Guixu Zheng, Wulong Guo, Qian Feng, Xu Wu

TL;DR: 本文提出了一种名为REDI-Match的高效鲁棒稠密匹配框架，其核心是旋转等变蒸馏（REDI）范式。该方法将视觉基础模型（VFM）的非等变语义表示蒸馏到一个轻量级的严格旋转等变编码器中，并通过解码器的熵驱动空间对齐模块消除全局旋转模糊，从而解决了严重平面内旋转带来的挑战。

Details

Motivation: 现有稠密匹配方法在处理严重平面内旋转时面临两难：数据驱动方法需要低效的参数扩展来隐式学习旋转，而严格等变网络又缺乏现代视觉基础模型的语义能力。当前框架通常冻结VFM，将旋转泛化的全部负担转移给下游解码器，这构成了一个架构瓶颈。

Result: 大量实验表明，REDI-Match在多个基准测试中确立了新的最先进水平（SOTA）。特别是在极具挑战性的SatAst数据集上，其绝对位姿精度提升了13.89%，同时运行速度比当前SOTA方法（RoMa v2）快1.9倍，在单张RTX 4090 GPU上可实现实时推理（约41 FPS）。

Insight: 核心创新在于提出的旋转等变蒸馏（REDI）范式，它通过蒸馏将VFM的语义能力与严格旋转等变几何架构的优势相结合，而非依赖数据增强。此外，熵驱动空间对齐模块通过评估离散旋转假设来显式锁定规范坐标系，这是一种新颖的消除全局模糊性的机制。

Abstract: Vision Foundation Models (VFMs) have significantly advanced dense feature matching, yet severe in-plane rotation remains a critical challenge. Existing solutions face a fundamental dilemma: data-driven methods require inefficient parameter scaling to implicitly learn rotations, whereas strictly equivariant networks lack the semantic capacity of modern VFMs. Consequently, current frameworks typically freeze VFMs and shift the entire burden of rotation generalization to the downstream decoder. To break this architectural bottleneck, we propose REDI-Match, an efficient framework driven by a novel Rotation-Equivariant Distillation (REDI) paradigm. Instead of relying on rotation data augmentation to establish rotational correspondences, REDI distills the non-equivariant semantic representations of a VFM into a lightweight, strictly rotation-equivariant encoder, leveraging an equivariant geometric architecture to constrain robust high-dimensional semantics. To fully exploit these features, we equip the decoder with an entropy-driven spatial alignment module. By evaluating discrete rotation hypotheses, this mechanism explicitly locks onto the canonical coordinate system, eliminating global ambiguity before continuous refinement. Extensive experiments demonstrate that REDI-Match establishes a new state-of-the-art (SOTA) across multiple benchmarks. Notably, it achieves a 13.89% absolute pose accuracy improvement on the highly challenging SatAst dataset while operating 1.9x faster than the current SOTA (RoMa v2), enabling real-time inference (~41 FPS) on a single RTX 4090 GPU. Code: https://github.com/YinjiGe/REDI-Match.

Jiahao Lyu, Pei Fu, Zhenhang Li, Shaojie Zhang, Jiahui Yang

TL;DR: 本文提出了UniTranslator，一个用于端到端图像内机器翻译的统一多模态框架，旨在解决现有统一模型在理解与生成阶段存在的语义不一致和空间位置错位问题。通过引入理解-生成对齐模块和空间掩码解码器，该框架实现了翻译内容预测与文本渲染的语义一致性，并提升了生成过程中文本区域的空间定位和布局控制精度。

Details

Motivation: 图像内机器翻译任务需要将图像中的场景文本翻译并重新渲染到原始区域，同时保持视觉外观。现有统一多模态模型虽结合了视觉文本理解和图像生成，但直接应用于该任务时存在理解-生成冲突和空间位置错位两大挑战。

Result: 在多个基准测试上的广泛实验表明，UniTranslator在多种语言方向和复杂真实世界布局上均达到了最先进的性能水平。

Insight: 创新点在于提出了理解-生成对齐模块以弥合表示差距，以及空间掩码解码器通过像素级监督增强空间定位；客观来看，该工作揭示了翻译理解与图像生成之间的强相互增强效应，凸显了统一翻译多模态学习的优势。

Abstract: In-Image Machine Translation (IIMT) aims to translate scene text in an image and render the translated text back into the original regions while preserving the overall visual appearance. Recent unified multimodal models provide a promising solution by combining visual-text understanding and image generation within a single framework. However, directly adapting such models to IIMT remains challenging. In particular, they often suffer from understanding-generation conflicts, where the translation inferred during understanding is inconsistent with the text supervision used in generation, and spatial position misalignment, where the rendered text does not accurately match the target text regions. To address these issues, we present UniTranslator, a unified multimodal framework for IIMT that tightly couples translation understanding and text editing. Specifically, we introduce an Understand-Generation Alignment Module (UGAM) to bridge the representation gap between understanding and generation, encouraging semantic consistency between translated content prediction and text rendering. We further propose a Spatial Mask Decoder (SMD) with pixel-level supervision over text regions to improve spatial grounding, geometric alignment, and layout controllability during generation. Extensive experiments on multiple benchmarks demonstrate that UniTranslator achieves state-of-the-art performance across diverse language directions and complex real-world layouts. Moreover, our results reveal a strong mutual reinforcement effect between translation understanding and image generation, highlighting the advantage of unified translation multimodal learning. Code is available at https://github.com/SeerRay-Lab/Unitranslator.

[58] Ill-Posed by Design: Probing Evidence Use in VLMs cs.CVPDF

Boaz Meivar, Shaked Perek, Shani Shvartzman, Eli Schwartz, Shai Avidan

TL;DR: 本文提出使用单目度量物体尺寸估计作为不适定诊断任务，以评估视觉语言模型（VLMs）的证据使用情况。通过构建Metric VQA数据集（包含10,813个维度查询和331个真实场景测量数据），对12个开放权重VLMs进行反事实分析，分解六种视觉和语言证据通道。研究发现，即使最大规模的VLMs在真实场景分割上仍落后于纯文本前沿LLM，且模型主要依赖目标身份线索，而全局场景几何信息未被充分利用。

Details

Motivation: 现有反事实分析在适定任务中诊断价值有限，因为多个线索可能独立支持同一答案，移除单个线索可能不影响预测。因此，作者引入不适定的单目度量物体尺寸估计任务，以更有效地探究VLMs如何选择和依赖不完美线索（如类别先验、目标外观、局部上下文等）。

Result: 在Metric VQA数据集上评估12个VLMs（参数量3B至397B），最大模型（如Qwen3-VL-235B）在真实场景分割上仍落后于纯文本前沿LLM。反事实分析显示：目标身份是最关键的线索，目标像素和局部上下文仅对部分模型有帮助，表观尺寸会无方向性地影响预测，而全局场景几何信息基本未被使用。LoRA微调实验表明任务可学习，但模型未学会利用场景几何。

Insight: 创新点在于将不适定任务（单目度量尺寸估计）设计为诊断工具，以更敏感地检测VLMs的证据选择偏差。客观分析表明，VLMs在复杂推理中过度依赖语义先验（如目标类别），而忽视几何等底层视觉线索，这揭示了当前多模态模型在物理世界理解上的局限性。

Abstract: Counterfactual analysis is widely used to study evidence use in vision-language models, but its diagnostic value is limited on well-posed tasks: when several cues independently support the same answer, removing one may not change the prediction. We propose monocular metric object-size estimation as an ill-posed diagnostic setting for evidence selection: because physical size cannot be determined from a single uncalibrated image, models must rely on imperfect cues category priors, target appearance, local context, apparent image size, and scene geometry. We assemble Metric VQA ($10{,}813$ dimension queries from Objectron and $331$ tape-measured in-the-wild scenes) and evaluate $12$ open-weight VLMs ($3$–$397$,B parameters) with counterfactual analysis decomposing six visual and language evidence channels. Even the largest VLMs tested (Qwen3-VL-235B, Qwen3.5-397B, InternVL3.5-241B) trail a text-only frontier LLM on the in-the-wild split. The diagnostic analysis shows: target identity is the most load-bearing cue, target pixels and local context help only some models, apparent size shifts predictions without a directional readout, and global scene geometry is largely unused. We analyze LoRA fine-tuning as an actionable intervention specific to metric estimation: while the task is learnable, the models do not learn to leverage scene geometry.

[59] TIGER: Taming Identity, Geometry, and Generative Priors for High-Quality Face Video Restoration cs.CVPDF

Yang Zhou, Wenxue Li, Peng Zhang, Yifei Chen, Fei Wang

TL;DR: TIGER是一个用于高质量人脸视频恢复的结构化三先验融合框架，通过整合身份、几何和生成先验来解决现有方法在身份偏移、视角纠缠引导和感知真实性方面的挑战。该方法采用渐进式三阶段训练优化策略，并构建了一个大规模FVR数据集以支持鲁棒训练和标准化评估。

Details

Motivation: 现有的人脸视频恢复方法难以同时解决身份偏移、视角纠缠引导和感知真实性这三个关键挑战，因此需要一种能够有效融合多种先验信息以保持身份一致性和时空一致性的新框架。

Result: 大量实验表明，TIGER在身份保真度和时间稳定性方面达到了最先进的性能，在构建的大规模FVR数据集上实现了高质量、高效且身份一致的恢复效果。

Insight: 创新点在于提出了一个结构化的三先验融合框架，通过身份先验注入、解耦的3D几何先验构建以及一步校正流的生成先验利用，实现了对身份、几何和生成信息的有效整合；同时，渐进式三阶段训练策略和专门构建的数据集也为鲁棒优化和评估提供了支持。

Abstract: Face Video Restoration (FVR) aims to recover high-fidelity facial videos from degraded input while preserving identity and semantic consistency across frames. Existing methods often struggle to simultaneously address three key challenges: identity shift, viewpoint-entangled guidance, and perceptual realism. To tackle these issues, we propose TIGER, a structured tri-prior fusion framework that Tames Identity, Geometry, and gEnerative pRiors for high-quality FVR. Specifically, an Identity Prior is first established by injecting subject-discriminative embeddings into the latent space, effectively anchoring the subject’s identity against severe degradations. Then, to provide temporally consistent structural guidance for dynamic videos, TIGER constructs a Geometry Prior by lifting 2D reference cues into a disentangled 3D parameter space, creating a geometric anchor through cross-source parameter fusion. Moreover, to achieve maximum efficiency without compromising realism, we harness the video generation model’s Generative Prior through a one-step rectified flow. We further design a progressive three-stage training optimization strategy that refines structural fidelity, textural reconstruction, and distribution-level realism to ensure robust optimization. We also construct a large-scale FVR dataset to facilitate robust training and standardized evaluation. Extensive experiments demonstrate that TIGER achieves state-of-the-art performance in both identity fidelity and temporal stability, delivering a high-quality, efficient and identity-consistent FVR. Project page: https://yzhoulv.github.io/Tiger/.

[60] Open-Vocabulary BEV Segmentation with 3D-Aware Geometric Constraints cs.CV | cs.LGPDF

Hojun Choi, Seulbin Hwang, Dae Jung Kim, Kisung Kim, Hyunjung Shim

TL;DR: 本文提出了开放词汇的鸟瞰图分割任务（OVBS）及相应的框架OVBEVSeg，旨在利用视觉语言模型（VLM）识别训练集之外的类别，同时保持精确的BEV感知和实时效率。该框架通过三个阶段（2D到BEV的伪标签生成、联合2D-BEV每场景优化、3D几何蒸馏）引入3D几何约束，解决了将2D VLM语义提升到BEV时的几何不一致性问题。

Details

Motivation: 现有最先进的BEV感知方法局限于闭集场景，难以应对现实世界中不可预测的开放环境。因此，需要一种能够识别未知类别并保持几何一致性的开放词汇BEV分割方法。

Result: 在nuScenes数据集上，OVBEVSeg在未见类别上比闭集方法高出15.3 mIoU，达到了最先进的性能。即使没有新类别的真实标签，其性能仍可与使用高达40%真实标注的自监督和半监督基线相竞争，且推理速度快2.5倍，内存消耗仅为基于投影方法的0.22倍。

Insight: 创新点在于将开放词汇识别引入BEV分割，并通过渐进式的3D几何约束（如高斯溅射反投影和结构约束）确保2D语义到3D BEV的几何一致性。该方法在保持高效的同时，显著提升了开放场景下的分割泛化能力。

Abstract: Bird’s-eye view (BEV) perception fuses multi-camera images into a unified top-down representation for autonomous driving. Despite recent progress, state-of-the-art methods remain confined to closed-set scenarios, making them vulnerable to unpredictable real-world environments. In this work, we introduce open-vocabulary BEV segmentation (OVBS), which leverages vision-language models (VLMs) to recognize categories beyond the training set while maintaining precise BEV perception and real-time efficiency. A key challenge in OVBS lies in the 3D geometric inconsistency inherent in the ill-posed lifting of 2D VLM semantics into BEV. To address this, we propose OVBEVSeg, a geometry-aware OVBS framework that enhances efficient Gaussian splatting (GS)-based unprojection by leveraging robust 3D geometric constraints across three progressive stages: (1) 2D-to-BEV pseudo-labeling via reliable 3D projection for OV generalization; (2) joint 2D-BEV per-scene optimization with BEV structural constraints for 3D geometric consistency; and (3) 3D geometric distillation for online efficiency. On the nuScenes dataset, OVBEVSeg achieves state-of-the-art performance, outperforming closed-set methods by 15.3 mIoU on unseen categories. Remarkably, even with no novel-class ground-truth labels, it remains competitive with self- and semi-supervised baselines trained with up to 40% of ground-truth annotations. Furthermore, it achieves 2.5x faster inference with only 0.22x the memory consumption of projection-based methods. Project page: https://hchoi256.github.io/projects/ovbevseg/.

[61] SignNet-1M: Large-Scale Multilingual Sign Language Video Dataset with Downstream Benchmarks cs.CVPDF

Zhewen He, Junyi Hu, Haomian Huang, Zhenhua Li, Yu-Shen Liu

TL;DR: 本文介绍了SignNet-1M，一个大规模多语言手语视频增强数据集，涵盖美国手语、中国手语和德国手语。该数据集通过3D高斯溅射、扩散模型和渲染后增强技术，合成了包含视角、背景和身份多样性的真实变化数据，旨在提升手语模型在真实世界分布变化下的鲁棒性。

Details

Motivation: 现有手语模型通常在受限条件下采集的数据集上训练，视角、背景和身份多样性有限，导致在真实世界分布变化下鲁棒性差。

Result: 实验表明，在不同骨干网络上使用SignNet-1M训练，能持续提升模型在跨视角、跨背景、跨身份和渲染后变化下的泛化能力，同时保持强大的分布内性能。

Insight: 创新点在于通过三维重建和生成式AI技术系统性地增强手语数据多样性，并提供了统一的下游任务基准套件来评估各增强组件的效果，为手语识别与翻译的鲁棒性研究提供了新范式。

Abstract: Sign language models are typically trained on datasets captured under constrained conditions, with limited viewpoint, background, and signer-identity diversity, leading to poor robustness under real-world distribution shifts. We introduce SignNet-1M, a large-scale augmented dataset spanning ASL, CSL, and German Sign Language (DGS). SignNet-1M synthesizes realistic variations along three axes: (i) novel-view rendering (rotation and zoom) via 3D Gaussian Splatting (3DGS), (ii) scene/identity editing via diffusion models for background replacement and signer substitution while preserving sign motion and linguistic content, and (iii) post-rendering augmentations that emulate capture and compression artifacts (e.g., pose/temporal perturbations and video-level corruptions) to better match in-the-wild recordings. Beyond data release, we provide a unified benchmark suite across downstream tasks (e.g., translation and recognition) and ablations that isolate each augmentation component. Experiments across backbones show that training with SignNet-1M consistently improves generalization under cross-view, cross-background, cross-identity, and post-rendering shifts, while maintaining strong in-distribution performance. The dataset, full augmentation pipeline, and benchmark are available at https://signnet.chatsign.ai/.

Lars Doorenbos, Duc Manh Vu, Serdar Ozsoy, Juergen Gall

TL;DR: 本文提出了一种用于多模态动作识别的模态感知分布外检测方法。该方法通过分析多模态与单模态预测之间的关系，构建了一个专门针对多模态场景的后验检测器，并结合特征空间评分与多模态逻辑归一化，形成混合检测器。实验表明，该方法在MultiOOD基准测试中平均性能优于现有技术。

Details

Motivation: 当前多模态动作识别模型虽然性能提升，但其在分布外检测方面的鲁棒性研究不足，现有方法在推理时仍使用为单模态设计的现成检测器，忽略了重要信息。

Result: 在MultiOOD基准测试的多个数据集上，该方法平均性能优于现有技术，达到了SOTA水平。

Insight: 创新点在于发现了多模态与单模态预测之间的关系，并利用该信号构建了专门针对多模态场景的后验混合检测器，强调了在推理时显式考虑不同模态的重要性。

Abstract: The incorporation of additional modalities into action recognition models increases their performance across a wide range of settings. However, how this additional information can contribute to making the models more robust remains underexplored, particularly for the case of multi-modal out-of-distribution (OOD) detection. While methods exist that regularize the multi-modal training process with OOD detection in mind, they still apply off-the-shelf OOD detectors designed for the uni-modal case during inference, discarding important information. Based on an interesting relationship we find between the multi-modal and uni-modal predictions, we propose to use this signal to build a post-hoc detector explicitly designed for the multi-modal scenario. We combine this new source of information with a feature-space score, which detects off-manifold samples in the multi-modal space, and normalize them by the multi-modal logits. In doing so, the proposed hybrid detector is compatible with existing training-time approaches and consistently improves performance. Experiments on a wide range of established datasets from the MultiOOD benchmark show that, on average, our approach outperforms the state of the art. Our results show the importance of explicitly considering the different modalities at inference time for multi-modal OOD detection.

[63] EgoSAT: A Comprehensive Benchmark of Egocentric Streaming Interaction Understanding cs.CVPDF

Yijia Lei, Jinzhao Li, Yichi Zhang, Jiacheng Hua, Yin Li

TL;DR: 论文提出了EgoSAT，这是首个针对流式设置下第一人称视频推理的综合基准，旨在评估现代视觉语言模型（VLMs）的能力。该基准专注于流式交互理解，即视频帧顺序到达，模型必须持续解释不断演变的视觉上下文。EgoSAT将几个先前独立的任务统一到一个流式框架中，包含约2000个独特视频和4800个高质量问答对，用于评估模型对过去、现在和未来的推理能力。

Details

Motivation: 解决现有基准在评估视觉语言模型对第一人称、流式视频进行连续、动态交互理解方面的不足，特别是缺乏一个统一框架来同时评估回顾性推理、在线理解和前瞻性预测能力。

Result: 在EgoSAT基准上对多种开源和闭源视觉语言模型进行了系统评估，发现现有模型不仅在预测和回顾建模上存在困难，而且表现出严重的校准错误，即模型置信度与实际可回答性不匹配，导致危险的‘自信错误’行为。

Insight: 创新点在于提出了首个统一流式交互理解任务的综合基准，并引入了‘可回答性’与模型置信度的诊断分析，揭示了当前模型在时序推理和校准方面的关键缺陷，为未来模型开发提供了重要的评估维度和方向。

Abstract: We introduce EgoSAT, the first comprehensive benchmark for egocentric video reasoning in streaming settings, designed to evaluate the capabilities of modern vision-language models (VLMs). The benchmark targets streaming interaction understanding, where video frames arrive sequentially and models must continuously interpret evolving visual context. EgoSAT unifies several previously distinct tasks within a single streaming framework. In this formulation, queries about completed events correspond to retrospective reasoning, queries about ongoing activities require online understanding, and queries about future actions involve prospective anticipation. This unified setting requires models to reason about the past, present, and future while operating under the constraint that only previously observed frames are available. EgoSAT contains 1,997 unique videos spanning 165 hours of egocentric footage and around 4,800 high-quality question-answer pairs, carefully designed to probe reasoning across varying temporal contexts. Using this benchmark, we evaluate a diverse set of both open-weight and closed-weight VLMs, providing a systematic assessment of their ability for streaming interaction understanding. By distinguishing answerability and conducting diagnostics on confidence of models, we find existing models not only struggle with prospective and retrospective modeling, but also exhibit severe mis-calibration: confidence often fails to track inherent answerability, leading to dangerous “confidently wrong” behaviors. Project page: https://leiyj23.github.io/EgoSAT/

[64] S1-Omni-Image: A Unified Model for Scientific Image Understanding, Generation, and Editing cs.CVPDF

Qingxiao Li, Zikai Wang, Qingli Wang, Nan Xu

TL;DR: S1-Omni-Image 是一个开源的、统一的多模态模型，用于科学图像的理解、生成和编辑。它基于科学多模态推理骨干网络 S1-VL-32B，采用‘先思考后生成’的范式，将理解能力与图像生成模块相结合。模型支持生成科学插图和文本渲染，并将分割等视觉任务转化为图像编辑问题。

Details

Motivation: 科学图像任务不仅需要高保真合成，还需要对科学语义、结构关系、领域知识和任务意图的鲁棒理解，而通用图像生成模型难以满足这些要求。

Result: 在 GenExam 和 TechImage-Bench 上优于开源模型，在 MSD、cigRockSEM、SynthRAD2025 和 IXI 四个编辑基准测试中达到 SOTA 水平，并在科学图像理解评估中保持稳定性能。

Insight: 创新点在于将理解、生成和编辑统一在一个框架下，并通过‘先思考后生成’范式，利用推理轨迹、文本答案和任务特殊令牌的隐藏状态来条件化图像生成或编辑，从而更好地整合科学语义和任务意图。

Abstract: We present S1-Omni-Image, an open-weight unified multimodal model for scientific image understanding, generation, and editing. Unlike general-purpose image generation models, scientific image tasks require not only high-fidelity synthesis, but also robust understanding of scientific semantics, structural relations, domain knowledge, and task intent. To this end, S1-Omni-Image builds on the scientific multimodal reasoning backbone S1-VL-32B and couples its understanding capability with an image generation module under a unified think-before-generate paradigm. Given a user instruction, the model first produces a task-oriented reasoning trace, a textual answer, and a task special token; their hidden states are then injected into the generation module to condition image generation or editing. S1-Omni-Image supports scientific image understanding, generation, and editing in a unified framework. For generation, it focuses on scientific illustrations and text rendering, including logical diagrams, relational comparisons, data charts, and realistic scientific visualizations. For editing, it casts segmentation and other domain-specific vision tasks as native image editing problems, enabling multi-turn illustration editing, medical and geographic image segmentation, medical image translation, and scientific image super-resolution. We construct SciGenEdit, a 314K-sample training dataset, and release the model weights, inference code, and SciGenEdit-10K. Experiments show that S1-Omni-Image substantially improves scientific image generation and editing while preserving the scientific image understanding capability inherited from S1-VL-32B. It outperforms open-source models on GenExam and TechImage-Bench, achieves state-of-the-art results on four editing benchmarks including MSD, cigRockSEM, SynthRAD2025, and IXI, and maintains stable performance on scientific image understanding evaluations.

[65] P-MTP: Efficient Document Parsing via Multi-Token Prediction with Progressive Depth Scaling cs.CVPDF

Le Xiang, Chenxi Zhai, Shu Wei, Jingjing Wu, Qunyi Xie

TL;DR: 本文提出了一种名为P-MTP的高效文档解析框架，它通过渐进式多令牌预测和置信度门控动态草稿技术，解决了现有多令牌预测方法在扩展前瞻深度时面临的优化不稳定问题，从而显著提升了推理速度。

Details

Motivation: 视觉语言模型（VLMs）在端到端文档解析中引入了显著的延迟瓶颈，尤其是在令牌密集的文档中。现有的多令牌预测（MTP）方法在扩展到更深的前瞻深度时，会面临优化不稳定的限制。

Result: 在多个基准测试和架构上的实验结果表明，P-MTP在精度损失可忽略不计的情况下，实现了高达5倍的推理加速，这是在文档解析领域首次成功验证了扩展前瞻深度MTP的有效性。

Insight: 论文的创新点在于提出了渐进式课程损失，通过累积路径可靠性和回顾性目标一致性自适应地重新加权不同前瞻深度，从而抑制长程预测中的梯度噪声，实现从易到难的优化过渡；同时，置信度门控动态草稿技术通过自适应校准推理过程中的推测长度，最大化有效前瞻深度和接受率，进一步推动了推理加速的边界。

Abstract: Vision-Language Models (VLMs) have revolutionized document parsing by enabling end-to-end mapping from images to structured text, imposing a significant latency bottleneck, particularly for token-dense documents. While Multi-Token Prediction (MTP) has emerged as a promising approach for accelerating inference, its potential is constrained by optimization instability when scaling to deeper look-ahead depth. In this paper, we propose \textbf{P-MTP}, a framework that leverages \textbf{Progressive Multi-Token Prediction} with a lightweight MTP module to scale the look-ahead depth for high-throughput document parsing. Specifically, we introduce Progressive Curriculum Loss that adaptively re-weights different look-ahead depths using cumulative path reliability and retrospective target consistency. By effectively suppressing gradient noise in long-range predictions, P-MTP, facilitates an automated easy-to-hard optimization transition, enabling the model to master increasingly distant look-ahead depths. Furthermore, we propose Confidence-Gated Dynamic Drafting to maximize the effective look-ahead depth and acceptance rate by adaptively calibrating speculative length during inference, thereby minimizing computational waste and further pushing the boundaries of inference speedup. Experimental results across multiple benchmarks and architectures demonstrate that P-MTP, achieves up to a $5\times$ speedup with negligible loss in accuracy, providing the first successful validation of extensive look-ahead MTP in the document parsing domain.

[66] Lite Any Stereo V2: Faster and Stronger Efficient Zero-Shot Stereo Matching cs.CVPDF

Junpeng Jing, Ronglai Zuo, Zhelun Shen, Shangchen Zhou, Rolandos Alexandros Potamias

TL;DR: 本文提出了Lite Any Stereo V2 (LAS2)，一个超快且高效的零样本立体匹配模型系列。该模型从架构和训练两个角度进行设计：架构上采用纯2D成本聚合框架以优化实际推理延迟；训练上采用三阶段策略结合合成监督、自蒸馏和真实世界知识蒸馏。实验表明LAS2在高效立体方法中达到最先进的精度，同时保持显著更低的延迟。

Details

Motivation: 当前立体匹配方法虽精度高，但模型大、计算重或依赖额外先验，难以部署在资源受限平台；而高效模型虽推理快，但普遍被认为零样本泛化能力较弱。本文旨在挑战这一假设，开发一个兼具高效推理和强零样本泛化能力的模型。

Result: 在广泛实验中，LAS2在高效立体方法中实现了最先进的精度，同时保持显著更低的延迟。具体而言，LAS2-H在整体零样本性能上优于迭代方法Fast-FoundationStereo，在H200和Orin平台上的推理速度分别快1.8倍和2.7倍。

Insight: 创新点包括：从实际部署角度重新审视高效立体设计，提出纯2D成本聚合框架以优化真实推理延迟；开发结合合成监督、自蒸馏和真实世界知识蒸馏的三阶段训练策略；引入伪标签过滤和误差钳制操作以提升真实世界伪监督的可靠性，实现更平滑的合成到真实迁移。模型系列包含前馈变体（适应不同效率预算）和迭代变体（追求更高精度）。

Abstract: Recent advances in stereo matching have achieved remarkable accuracy, but often rely on large models, heavy computation, or additional foundation-model priors, making them difficult to deploy on resource-constrained platforms. In contrast, efficient stereo models offer faster inference but are commonly considered less capable of strong zero-shot generalization. In this paper, we challenge this assumption by introducing Lite Any Stereo V2 (LAS2), an ultra-fast model series designed for efficient zero-shot stereo matching. LAS2 is developed from both architecture and training perspectives. Architecturally, we revisit efficient stereo design under practical deployment settings and propose a 2D-only cost aggregation framework, optimized for real inference latency rather than theoretical MACs alone. For training, we develop a three-stage strategy that combines synthetic supervision, self-distillation, and real-world knowledge distillation. To improve the reliability of real-world pseudo supervision, we further introduce pseudo-label filtering and an error-clamping operation, enabling smoother synthetic-to-real transfer. We instantiate LAS2 as a family of models, including feed-forward variants for different efficiency budgets and an iterative variant for higher accuracy. Extensive experiments show that LAS2 achieves state-of-the-art accuracy among efficient stereo methods while maintaining significantly lower latency. Specifically, LAS2-H achieves stronger overall zero-shot performance than the iterative method Fast-FoundationStereo, with 1.8x and 2.7x faster inference on H200 and Orin, respectively. The project page, demos, and code are available at https://tomtomtommi.github.io/LiteAnyStereoV2/.

[67] Boosting Text-Driven Video Segmentation via Geometry-Aware Distillation cs.CVPDF

Tianyu Zhu, Yingping Liang, Hesong Li, Ying Fu

TL;DR: 本文提出了一种名为GeoLaV的两阶段框架，通过从图像中蒸馏3D几何知识来增强文本驱动的视频分割。第一阶段通过单目新视角合成进行几何预训练，使模型在大规模单图像数据集上获得几何一致的视觉表示；第二阶段引入几何感知蒸馏并在视频分割数据集上微调，从通用3D先验模型迁移3D结构知识，从而提升时空一致性和语言定位能力。

Details

Motivation: 现有文本驱动的参考视频对象分割（RVOS）模型通常在2D图像或视频数据集上训练，使用简单的分割损失，忽略了跨帧的几何一致性，导致空间理解能力较弱。

Result: 实验表明，仅使用图像分割数据的方法在RVOS中已展现出显著的零样本泛化能力；结合几何感知蒸馏在视频上微调后，在多个RVOS基准测试中达到了最先进的性能（SOTA）。

Insight: 创新点在于通过两阶段框架将3D几何知识蒸馏到视频分割模型中，利用单目新视角合成预训练增强几何一致性表示，并通过几何感知蒸馏从通用3D先验迁移结构知识，从而提升分割的时空连贯性和语言理解能力。

Abstract: Text-driven Referring Video Object Segmentation (RVOS) aims to locate and segment target objects in videos given natural language. However, existing models are typically trained on 2D image or video datasets with naive segmentation losses, which overlooks the geometric consistency across frames and leads to weak spatial understanding. In this paper, we propose Geometry-enhanced Language-guided Video segmentation (GeoLaV), a two-stage framework that distills 3D geometric knowledge from images to enhance text-driven video segmentation. In the first stage, we perform monocular geometry pretraining with monocular novel-view synthesis, enabling the model to acquire geometry-consistent visual representations via spatial alignment on large-scale single-image datasets. In the second stage, we introduce geometry-aware distillation and fine-tune the model on video segmentation datasets, transferring 3D structural knowledge from a general 3D prior model. This process reinforces 3D awareness and improves both spatiotemporal coherence and language grounding in segmentation. Extensive experiments show that our method using only image segmentation data already provides notable zero-shot generalization in RVOS. When combined with geometry-aware distillation for fine-tuning on videos, our method achieves state-of-the-art performance across multiple RVOS benchmarks. The code is available at https://github.com/Tony1882880/GeoLaV.

[68] video-SALMONN-R$^3$: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding cs.CV | cs.AI | cs.SDPDF

Yixuan Li, Guangzhi Sun, Yudong Yang, Wei Li, Zejun MA

TL;DR: 本文提出了video-SALMONN-R^3，一种用于高效视频理解的两阶段视频大语言模型。它通过强化学习实现端到端的“重看”机制，无需依赖思维链的冷启动，并引入了“重答”和“重问”策略来优化答案质量和问题遵循性。

Details

Motivation: 现有视频大语言模型因计算和内存限制，常采用低帧率和空间分辨率，可能导致遗漏关键信息。需要一个高效的两阶段范式：先粗粒度定位相关片段，再以更高保真度重看这些片段。

Result: 实验结果表明，video-SALMONN-R^3在性能上持续超越基础模型和QA-SFT基线，并显著优于先前的基于重看的方法，同时计算成本大幅降低。

Insight: 创新点在于通过强化学习实现无需思维链冷启动的端到端重看机制，以及引入“重答”策略缓解模型行为不匹配，和“重问”机制增强问题遵循性，从而在保持高效的同时提升视频问答性能。

Abstract: Video large language models (LLMs) are often constrained by computation and memory budgets, leading them to use reduced frame rates and spatial resolutions, which may cause them to miss critical information for question answering (QA). A practical and efficient solution is a two-stage paradigm: first perform coarse video understanding to localize relevant segments, and then re-watch these segments at higher temporal or spatial fidelity. In this paper, we present video-SALMONN-R$^3$, the first end-to-end video-LLM that enables re-watch through reinforcement learning without relying on chain-of-thought (CoT) cold-start. This design removes the need for costly CoT data annotations and avoids CoT-based supervised fine-tuning (SFT), which can otherwise degrade the pretrained video understanding abilities. To address the mismatch between the reasoning-first behavior induced by re-watch and the answer-first tendency of pretrained video-LLMs, we propose a re-answer strategy, in which the model first produces a direct answer in the first watch and then refines it after re-watching. Finally, to improve question adherence during re-watching, we propose a re-ask mechanism that re-injects the query when revisiting localized segments. Experimental results show that video-SALMONN-R$^3$ consistently outperforms both the base model and the QA-SFT baseline, while surpassing prior re-watch-based approaches with significantly lower computational cost. Code, models, and data will be publicly released upon acceptance.

[69] Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods cs.CVPDF

Xingsong Ye, Yongkun Du, Jiaxin Zhang, Haojie Zhang, Chong Sun

TL;DR: 本文针对艺术字场景文本识别（WATER）任务，从数据和模型两方面推进研究：构建了包含200万合成样本的WATER-S数据集，并提出支持任意形状输入和自回归解码的WATERec模型，在WordArt-Bench基准上达到90.40%的准确率，大幅超越现有方法。

Details

Motivation: 艺术字具有高度定制化的字体、纹理和布局，现有基于规则文本和固定模板的STR数据集与方法难以适应WATER任务，因此需要从数据和模型层面解决这一挑战。

Result: 在WordArt-Bench基准测试中，结合新合成数据WATER-S和WATERec模型的方法达到90.40%的准确率，显著优于通用及OCR专用视觉语言模型，实现了艺术字识别任务的SOTA性能。

Insight: 创新点包括：构建大规模合成数据集WATER-S（结合升级渲染流程与多模态生成技术提升数据覆盖度），以及设计支持任意形状输入和自回归解码的WATERec模型，突破了固定模板STR在复杂艺术字布局上的瓶颈。

Abstract: WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene TExt Recognition (WATER) substantially more challenging than general Scene Text Recognition (STR). Existing STR datasets and methods, typically built around regular scene text and fixed-template inputs, struggle to scale to WATER. Thus, we aim to advance this task from both data and model perspectives. On the data side, we construct a 2M synthetic dataset, WATER-S, with the scale improved by hundreds of times compared to existing artistic text data. WATER-S consists of two complementary subsets. One rendered by an upgraded rendering pipeline (SynthWordArt), which provides highly accurate and controllable synthetic WordArt data. The other is generated by combining Qwen3-VL for prompt mining and Z-Image for image synthesis, which improves the coverage of realistic and diverse data. On the model side, we propose WATERec. It adopts an visual encoder supporting arbitrary-shaped inputs and an autoregressive decoder to model complex layouts, structurally breaking the bottleneck of fixed-template STR on WordArt. Experiments show that this architecture outperforms prior STR methods, achieving state-of-the-art performance on irregular texts such as WordArt. Together with WATER-R, carefully reorganized from existing real STR data, our strong baseline with the new synthetic data and model design reaches 90.40% accuracy on WordArt-Bench, surpassing both general-purpose and OCR-specialized vision-language models by a large margin. Code and data are available at https://github.com/YesianRohn/WATER.

[70] RetiSEM: Generalising Causal Models for Fragmented Biomedical Data cs.CV | cs.AI | stat.MEPDF

Inam Ullah, Imran Razzak, Shoaib Jameel

TL;DR: 本文提出了RetiSEM，一个用于处理碎片化生物医学数据的领域约束结构方程建模框架，旨在从临床、分子和成像等不完整或非联合观测的多模态数据中学习因果模型。该方法将变量组织为生物学信息块，应用禁止边约束，并将通路级效应分解为总效应、自然直接效应和自然间接效应。在十个合成基准场景和一个结合NHANES临床变量与外部视网膜表征的真实碎片化数据场景中进行了评估。

Details

Motivation: 解决从碎片化生物医学数据（如临床、分子和成像变量常不完整或非联合观测）中学习因果模型的挑战，以在有限多模态资源下进行因果图恢复和中介分析。

Result: 在合成基准测试中，RetiSEM相比无约束基线实现了更低的结构误差和更高的因果准确性。在真实数据分析中，视网膜变量主要表现为下游生物标志物类指标，具有较小但可检测的间接效应。

Insight: 创新点在于提出了一种领域约束的SEM框架，通过生物学信息块组织变量、应用禁止边约束以及分解通路级效应，为资源有限的生物医学AI提供了一个可解释的结构化因果假设测试框架。

Abstract: Learning causal models from fragmented biomedical data is challenging because clinical, molecular, and imaging variables are often incomplete or not jointly observed. We propose RetiSEM, a domain-constrained structural equation modelling (SEM) framework for causal graph recovery and mediation analysis under limited multimodal resources. This proposed work organises variables into biologically informed blocks, applies forbidden-edge constraints, and decomposes pathway-level effects into TE, NDE, and NIE components. We evaluate RetiSEM across ten synthetic benchmark scenarios that vary in dimensionality, nonlinearity, causal depth, and pathway structure, together with a fragmented real-world setting that combines NHANES clinical variables with externally derived retinal representations. This approach achieves lower structural error and higher causal accuracy than unconstrained baselines across the synthetic benchmarks. In the real-data analysis, retinal variables behave mainly as downstream biomarker-like indicators, with smaller but detectable indirect effects. These findings support our strategy as an interpretable framework for testing structured causal hypotheses in limited-resource biomedical AI. The code and resources for this work are publicly available at: https://github.com/Inamullah-Colab/ReitSEM.

[71] VisCritic: Visual State Comparison as Process Reward for GUI Agents cs.CVPDF

Jiachen Qian

TL;DR: 论文提出了VisCritic，一个用于GUI智能体的视觉过程奖励框架，通过直接在视觉特征空间中比较动作前后的屏幕截图来验证智能体动作，以解决现有基于文本推理的奖励模型在长视野任务中因缺乏视觉状态验证而失败的问题。

Details

Motivation: 现有基于视觉语言模型的GUI智能体在长视野任务中经常失败，原因是缺乏步骤级别的验证，且现有的过程奖励模型仅通过文本推理来验证动作，忽略了GUI状态变化的视觉本质。

Result: 在五个基准测试上的实验和离线分析表明，VisCritic可作为即插即用的增强模块，普遍提升了各类GUI智能体的基准指标，同时提供了视觉诊断线索。

Insight: 创新点在于引入视觉特征空间直接比较动作前后状态的奖励机制，以及一个无需额外人工标注、从现有轨迹生成弱监督样本的批评器训练数据构建流程；其视觉状态比较和联合评估机制为GUI智能体提供了更可靠的步骤级反馈。

Abstract: GUI agents powered by vision-language models show strong potential for automating digital tasks, yet frequently fail in long-horizon scenarios due to the absence of step-level verification. Existing process reward models verify actions through textual reasoning alone, missing the visual nature of GUI state changes. We introduce VisCritic, a visual process reward framework that verifies agent actions by directly comparing pre-action and post-action screenshots in visual feature space. VisCritic employs a Siamese vision transformer to extract change-aware representations, coupled with an Action-Aware Critic Head that jointly evaluates action success, task progress, and error type. A critic-training data construction pipeline generates weakly supervised samples from existing trajectories without additional human labels for critic training. Experiments and offline analyses across five benchmarks demonstrate that VisCritic serves as a plug-and-play enhancement for diverse GUI agents, generally improving benchmark metrics while providing visual diagnostic cues.

[72] PointVG-R: Internalizing Geometric Reasoning in MLLMs for Precise Pointing Localization via Visual Chain of Thought cs.CVPDF

Ling Li, Bowen Liu, Zinuo Zhan, Jianhui Zhong, Ziyu Zhu

TL;DR: 本文提出PointVG-R，一种基于推理引导的多模态大语言模型，旨在通过视觉思维链增强基于指向的视觉定位任务中的几何推理能力。该方法通过强化学习和冷启动数据集成，模拟人类解读指向手势的迭代认知过程，并构建了高质量的视觉思维链数据集EgoPoint-CoT进行监督微调和强化学习训练。

Details

Motivation: 传统方法通常将输入图像编码为静态特征表示，并在语言域内进行推理，往往忽略了图像中丰富的感知线索和显式空间几何信息，导致模型在解释手势空间关系时存在认知脆弱性。

Result: 实验结果表明，PointVG-R在基于指向的视觉定位任务上达到了SOTA性能，在mIoU指标上比基线模型高出15.86个百分点。广泛的消融研究进一步验证了所提模块的有效性。

Insight: 创新点在于引入了几何感知推理管道来模拟人类认知过程，并构建了专门的视觉思维链数据集进行训练。从客观角度看，其提出的基于组方差的适应性重要性加权策略，能动态调整强化学习中的奖励信号以优化训练过程，这是一个可借鉴的技术亮点。

Abstract: Pointing-based visual grounding requires models to precisely locate target objects by deciphering complex spatial relationships between the visual scene and pointing gestures. Traditional methods typically encode input images into static feature representations and perform reasoning primarily within the linguistic domain, often overlooking the rich perceptual cues and explicit spatial geometry inherent in images. In this study, we aim to mitigate the cognitive vulnerability of models in interpreting gestural spatial relations by proposing PointVG-R, a reasoning-guided Multi-modal Large Language Model (MLLM). PointVG-R introduces geometric-aware reasoning for pointing-based grounding, enabling the model to think with images through the strategic integration of Reinforcement Learning (RL) and cold-start data. Specifically, we design a novel geometric reasoning pipeline that simulates the iterative cognitive process humans employ when interpreting pointing gestures. Furthermore, we construct EgoPoint-CoT, a high-quality visual Chain-of-Thought (CoT) dataset featuring detailed reasoning trajectories to guide the model via Supervised Fine-Tuning (SFT) and RL. To address the varying quality of learning signals encountered during training, we further propose an Adaptive Importance Weighting strategy based on Group Variance, which dynamically adjusts reward signals to optimize the learning process. Experimental results demonstrate that PointVG-R achieves SOTA performance, outperforming the baseline by $\textbf{15.86}$ points in mIoU. Extensive ablation studies further validate the efficacy of our proposed modules. Code: https://github.com/lingli1724/PointVG-R.

[73] ForensicsTok: Forensics-Guided Tokenized Modeling for Image Tampering Localization cs.CVPDF

Lei Xu, Haowei Wang, Shen Chen, Taiping Yao, Bin Li

TL;DR: 本文提出ForensicsTok，一种用于图像篡改定位的新方法，将篡改定位重新定义为自回归序列生成任务，通过直接生成空间定位的token序列来预测精确掩码，避免了传统外部分割解码器的信息瓶颈问题。

Details

Motivation: 现有基于多模态大语言模型（MLLMs）的取证方法依赖外部分割解码器，导致次优定位，其拼接式流程在反向传播中引入信息瓶颈，稀释了空间信号并受限于分割器的语义先验。

Result: 在六个基准测试上的广泛实验表明，ForensicsTok显著优于现有的基于MLLM的基线方法，并略微优于强大的取证专家基线，同时表现出更强的抗扰动鲁棒性。

Insight: 创新点包括将篡改定位重构为序列生成任务以避免中间监督；引入Token Splatting Decoder通过基于码本的代码平滑将token映射到二进制掩码，缓解确定性解token器的陡峭梯度；以及Hierarchical Expert Fusion模块注入来自取证专家模型的多尺度特征，以补偿标准MLLMs中取证先验的缺乏。

Abstract: Multi-modal Large Language Models (MLLMs) offer powerful reasoning for forensic tasks, yet existing approaches utilizing exogenous segmentation decoders often suffer from suboptimal localization. The reliance on stitched pipelines introduces information bottlenecks during backpropagation, which dilutes spatial signals and is limited by semantic priors of the segmentor. To address these limitations, we propose ForensicsTok, which reformulates image manipulation localization as an autoregressive sequence generation task. ForensicsTok directly generates spatially grounded token sequences, enabling precise mask prediction without intermediary supervision. Specifically, we introduce a Token Splatting Decoder (TSD) to map tokens to binary masks via codebook-aware code smoothing, which mitigates sharp gradients from deterministic detokenizers. Furthermore, to capture diverse tampering clues, we propose a Hierarchical Expert Fusion (HEF) module that injects multi-scale features from a forensic expert model. This unified architecture effectively compensates for the lack of forensic priors in standard MLLMs. Extensive experiments on six benchmarks show that ForensicsTok substantially improves over existing MLLM-based baselines and slightly improves over strong forensic expert baselines, while exhibiting stronger robustness to perturbations.

[74] PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments cs.CVPDF

Zhenyang Li, Lutao Jiang, Yizhou Zhao, Ying-Cong Chen, Xin Wang

TL;DR: 本文提出了PatternGSL，一种用于无模板、可模拟3D服装的结构化规范语言，旨在弥合几何重建与结构化服装制作之间的表示鸿沟。论文还提出了一个视觉-语言框架，能够直接从单张图像预测PatternGSL规范，并通过轻量级确定性处理将其解码为服装，无需基于优化的细化或手动清理。此外，作者创建了PatternGSLData数据集，以支持监督训练。

Details

Motivation: 解决从单张图像重建真实、物理合理服装的挑战。现有无模板方法能捕捉表面几何但缺乏明确的缝制结构以进行模拟，而程序化系统虽可模拟但受限于预定义模板，这揭示了几何重建与结构化服装构造之间的根本表示差距。

Result: 实验表明，该方法在图案准确性上优于先前基线，能够显式恢复缝制结构，实现可靠的布料模拟，并通过相同的确定性解码流程进行图案级编辑。

Insight: 核心创新在于提出了一种无模板、可学习的结构化服装表示语言（PatternGSL），将完整的缝纫图案（包括面板边界、参数化接缝和显式缝合拓扑）编码为紧凑标准形式，从而将缝制结构提升为生成建模的一等目标。同时，构建了首个大规模图像到GSL的配对数据集，支持监督式视觉语言模型训练。

Abstract: Reconstructing realistic, physically plausible garments from a single image remains a fundamental challenge. Template-free methods capture surface geometry but lack explicit sewing structure for simulation; while programmatic systems are simulation-ready but constrained by predefined templates. This reveals a fundamental representation gap between geometric reconstruction and structured garment construction. We present PatternGSL, a structured garment representation in the form of a template-free and learnable specification language that encodes complete sewing patterns, including panel boundaries, parameterized seams, and explicit stitch topology, in a compact and standardized form. PatternGSL preserves the physical rigor of pattern-based models while removing template dependence, elevating sewing structure as a first-class target for generative modeling. We further propose a vision-language framework that predicts PatternGSL specifications directly from a single image and decodes them into garments using lightweight deterministic validity handling, without optimization-based refinement or manual cleanup. In addition, we introduce PatternGSLData, the first large-scale image-to-GSL paired dataset comprising 300K samples with complete sewing pattern annotations, enabling supervised VLM training for structured garment reconstruction. Experiments demonstrate improved pattern accuracy over prior baselines, explicit sewing-structure recovery, reliable cloth simulation, and pattern-level editing through the same deterministic decoding pipeline. Code and data-processing scripts will be released at https://github.com/PatternGSL/PatternGSL.

[75] Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning cs.CVPDF

Jiayi Lei, Yuandong Pu, Xingyu Han, Rongpeng Zhu, Jing Xu

TL;DR: 本文提出了一个名为Counterfactual-World（CF-World）的反事实基准测试，旨在评估文本到图像（T2I）生成模型是否具备真正的因果推理能力，而非仅仅依赖训练数据中的视觉-文本相关性进行模式匹配。该基准通过三个渐进场景（事实生成、显式反事实生成、隐式反事实生成）和两个新指标（先验抵抗率和推理保留率）来系统测试模型。实验表明，现有T2I模型在反事实场景下性能急剧下降，揭示了其将世界知识与视觉外观紧密耦合、过度依赖训练数据中常见共现模式的根本局限。

Details

Motivation: 当前文本到图像生成模型取得了显著进展，但其成功是否源于真正的因果理解，还是仅仅基于视觉-文本相关性的复杂模式匹配，尚不明确。受罗素的归纳主义火鸡思想启发，本文旨在通过构建反事实基准来探究模型能否生成与真实世界先验系统矛盾的图像，从而评估其因果推理能力。

Result: 在CF-World基准上，无论是开源还是闭源的T2I模型，从事实场景到反事实场景均表现出性能的急剧退化。基于视觉语言模型的评估器（CF-Eval）和两个新指标（先验抵抗率PRR和推理保留率RRR）的定量分析证实了所有模型在需要克服根深蒂固的真实世界先验或进行因果推理时都面临巨大困难。

Insight: 论文的核心创新点在于构建了一个系统性的反事实推理基准（CF-World）并引入了两个针对性的评估指标（PRR和RRR），为量化T2I模型的因果理解能力提供了新工具。从客观分析看，其重要洞察是揭示了当前T2I模型将世界知识与视觉外观编码为紧密耦合的模式，导致其严重依赖训练数据中的高频共现，从而在需要违背常识先验的生成任务中失败，这指出了未来模型需要解耦知识与表征的研究方向。

Abstract: Text-to-image (T2I) generation models have achieved remarkable progress in producing visually realistic images from natural language prompts. Yet it remains unclear whether their success reflects genuine causal understanding or sophisticated pattern matching over visual-textual correlations. Inspired by Russell’s inductivist turkey, we introduce Counterfactual-World (CF-World), a counterfactual benchmark designed to investigate whether text-to-image models can generate images under rules that systematically contradict real-world priors. CF-World organizes each scenario into three progressive levels: factual generation under ordinary world knowledge, explicit counterfactual generation with direct visual instructions, and implicit counterfactual generation requiring causal deduction from altered rules. We evaluate both open-source and closed-source T2I models using a Vision Language Model (VLM)-based evaluator (CF-Eval). Furthermore, we introduce two metrics: Prior Resistance Rate (PRR), which measures a model’s ability to overcome entrenched real-world priors, and Reasoning Retention Rate (RRR), which assesses whether models can maintain reasoning-dependent counterfactual generation without explicit visual cues. Experiments show that all models exhibit sharp degradation from factual to counterfactual settings. Further analyses suggest that these failures arise because current T2I models encode world knowledge and visual appearances as tightly coupled patterns. Consequently, their heavy reliance on frequent visual co-occurrences within the training data forces them to default to familiar commonsense priors when tasked with rendering counterfactual worlds.

[76] Jolia: Concept-Level Vision-Language Alignment for 3D CT Contrastive Learning cs.CVPDF

Julien Khlaut, Charles Corbière, Baptiste Callard, Amaury Prat, Leo Butsanets

TL;DR: 本文提出ConQuer（Concept Queries）方法，用于增强3D医学影像与文本的对比学习。该方法通过将放射报告分割为特定概念（如器官或身体区域）的段落，并学习一系列跨注意力查询来聚合匹配的图像特征，从而在全局对齐基础上增加了局部概念对齐。基于此方法训练的3D CT基础模型Jolia在多个任务上超越了CLIP基线，并在公开基准测试中达到了新的SOTA水平。

Details

Motivation: 现有CLIP风格的3D医学基础模型预训练将整个扫描和报告压缩为单一全局token，可能丢失医学图像中众多器官细节和长篇幅、结构化报告中的重要信息。

Result: Jolia模型在胸部与腹部CT的发现分类、报告生成和跨中心迁移任务上持续优于CLIP基线，并在多个公开基准测试中创造了新的最先进（SOTA）结果。

Insight: 创新点在于引入概念查询（ConQuer）进行局部对齐，无需分割掩码或空间监督即可实现概念级视觉-语言对齐，并附带提供聚焦于特定概念的空间注意力图，增强了模型的可解释性。

Abstract: Vision-language contrastive pretraining has become the dominant recipe for 3D medical foundation models, leveraging the large volumes of paired scans and reports produced in clinical practice. However, medical images usually span dozens of organs, and radiological reports are much longer than typical natural image captions and are composed of multiple structured sections. CLIP-style pretraining compresses this structure by encoding each modality into a single global token, at the risk of losing important details. We introduce ConQuer (Concept Queries), an image-text pretraining method that augments CLIP’s global alignment with a set of localized alignments, one per concept. ConQuer splits the report into concept-specific sections and learns cross-attention queries that pool the matching image features without using any segmentation mask or spatial supervision. Contrastive learning is then applied independently for each concept. Concepts can be any unit of semantic localization; here, they are anatomical regions, one query per organ or gross body region. As a byproduct, each query learns attention maps focused on its concept, providing built-in spatial interpretability. We use ConQuer to train Jolia, a 3D CT foundation model on chest and abdominal CT. Jolia consistently outperforms a CLIP baseline on findings classification, report generation, and cross-center transfer, and sets a new state of the art across multiple public benchmarks. Jolia’s weights will be released upon acceptance.

[77] Agentic Collaborative Cognition for Zero-Shot 3D Understanding cs.CVPDF

Wenxin Wang, Bo Zhang, Feng Chen, Zixuan Wang, Wen Li

TL;DR: 本文提出了一种协作多智能体框架来解决零样本3D理解问题，通过规划智能体进行高层视角规划和补充新视角，感知智能体将3D场景显式总结为结构化整体认知地图，两者通过闭环迭代过程协同工作，在多个基准测试中实现了最先进的性能。

Details

Motivation: 现有基于多模态大语言模型的视频关键帧理解方法存在固有瓶颈，因为视频的观察视角有限且对3D场景的感知是隐式的，无法全面理解3D场景。

Result: 在6个基准测试中实现了最先进的性能，其中ScanRefer上Acc@0.5提升11.1%，3D辅助对话上BLEU-1提升14.6%，SQA3D上EM提升2.1%。

Insight: 创新性地采用双智能体协作框架，将3D场景显式建模为结构化认知地图，并通过规划与感知的闭环迭代实现视角补充和观察整合，突破了单智能体有限视角的瓶颈。

Abstract: Recent advancements have explored agentic zero-shot 3D understanding by reformulating it as video keyframe understanding with Multimodal Large Language Models (MLLMs). However, existing methods face an intrinsic bottleneck due to the finite observation perspectives inherent in videos and the implicit perception of 3D scenes. In this paper, we propose a collaborative multi-agent framework that assigns a Planning Agent to handle high-level viewpoint planning and supplement novel perspectives, and a Perception Agent to explicitly summarize the 3D scene into a structured holistic cognitive map. Specifically, Planning Agent first analyzes this cognitive map to determine query-relevant viewpoints and supplements missing critical perspectives to ensure comprehensive observation. Subsequently, Perception Agent documents object-level attributes from these views by assigning consistent instance identifiers across viewpoints, thereby integrating fragmented observations into the holistic cognitive map. In parallel, it provides feedback to filter out mismatched candidate objects and guide subsequent viewpoint planning. Through this closed-loop iterative process, two agents collaboratively figure out candidates until Perception Agent determines that sufficient information has been captured to complete the task. Extensive experiments demonstrate that our method achieves state-of-the-art performance on 6 benchmarks, with improvements of 11.1% Acc@0.5 on ScanRefer, 14.6 BLEU-1 on 3D-assisted dialog, and 2.1 EM on SQA3D.

[78] ViTexQA: A Multi-Frame Temporal Perception Dataset for Video Text Question Answering cs.CVPDF

Zhentao Guo, Chen Duan, Tongkun Guan, Zining Wang, Kai Zhou

TL;DR: 本文提出了ViTexQA数据集和FrameThinker模型，以解决当前多模态大语言模型在视频文本理解中的局限性，特别是需要跨多帧整合时序分布文本线索的语义理解问题。ViTexQA是一个大规模视频文本问答数据集，其所有问答对都要求跨帧文本融合才能解答，从而强制模型进行真正的时序依赖推理。FrameThinker采用两阶段训练进行显式时序建模：通过思维链引导的监督微调生成帧感知推理链，然后使用时序接地的强化学习进行优化，以多帧连贯性作为奖励。

Details

Motivation: 当前多模态大语言模型在视频文本理解方面存在局限，尤其是当语义需要通过跨多帧的时序分布文本线索整合才能显现时。现有数据集大多问题仍可从单帧解答，未能充分反映真实世界视频文本理解的需求。

Result: 在ViTexQA数据集上的评估表明，该方法优于最先进的基线模型，将ROUGE-L分数提升了6.3%。

Insight: 创新点在于构建了一个强制跨帧文本融合的问答数据集ViTexQA，以及一个采用两阶段训练（思维链引导的监督微调和时序接地的强化学习）进行显式时序建模的FrameThinker模型，以增强多帧时序推理能力。

Abstract: Despite remarkable progress in multimodal understanding, current MLLMs still exhibit limitations in video text understanding, particularly when semantics emerge through the integration of temporally distributed textual cues across multiple frames. This perception challenge fundamentally differs from static image text understanding, yet existing datasets fail to capture: the vast majority of questions remain answerable from single frames, inadequately reflecting real-world video text comprehension demands. To address this, we present ViTexQA, a large-scale video-text QA dataset, and FrameThinker for robust multi-frame temporal reasoning. We build ViTexQA via a quality-controlled Chain-of-Thought (CoT) annotation pipeline boosted with temporal constraints; all its QA pairs demand cross-frame text fusion to solve, enforcing true temporal reliance. FrameThinker adopts two-stage training for explicit temporal modeling: CoT-Guided Supervised Fine-Tuning (SFT) generates frame-aware reasoning chains, followed by Temporally-grounded Reinforcement Learning (RL) optimized with multi-frame coherence rewards. Evaluations show our method outperforms SOTA baselines on ViTexQA, lifting ROUGE-L by 6.3%.

[79] SER: Learning to Ground Video Reasoning with Semantic Evidence Rewards cs.CVPDF

Sheng Xia, Zhengqin Lai, Tianxiang Jiang, Kanghui Tian, Shoujun Zhou

TL;DR: 本文提出了一种名为语义证据奖励（SER）的新方法，用于改进视频多模态大语言模型（Video MLLMs）的细粒度时空推理能力。该方法将时空证据定位重新构建为一个约束验证任务，利用一个裁判视觉语言模型（VLM）来评估模型生成的证据声明，从而减少对密集边界框标注的依赖，并可直接在标准视频问答数据上进行训练。

Details

Motivation: 现有的视频MLLMs在细粒度时空推理上存在不足，有时会基于不相关的帧或物体生成正确答案。虽然输出时空证据是一个有前景的方向，但现有的强化学习框架通常仅依赖几何（IoU）奖励，这容易受到边界扰动的影响并忽视语义对齐。

Result: 在V-STAR基准测试上，SER取得了49.6%的mLGM分数，比强大的证据基础基线模型Open-o3-Video提高了3.0个百分点，证明了其在提升答案准确性和证据定位方面的潜力。

Insight: 创新点在于将证据定位任务重新定义为基于语义的约束验证，使用一个裁判VLM从相关性和定位质量两个维度进行评估，并结合了时间惩罚。这避免了像素级重叠计算，降低了对精确边界框标注的依赖，使得模型能够更稳健地学习语义对齐的证据定位。

Abstract: Video MLLMs often struggle with fine-grained spatio-temporal reasoning, sometimes generating correct answers based on irrelevant frames or objects. Although outputting spatio-temporal evidence during reasoning is a promising direction, existing RL frameworks typically rely on geometry-only (IoU) rewards, which can be sensitive to boundary perturbations and overlook semantic alignment. To address this, we propose Semantic Evidence Reward (SER), which reformulates spatio-temporal evidence grounding as a constrained verification task. Instead of computing pixel-level overlap, SER uses a referee VLM as a local checker to evaluate model-generated evidence claims across two dimensions: relevance and localization quality, combined with a temporal penalty. This design reduces the reliance on dense box annotations and enables training directly on standard video QA data. On the V-STAR benchmark, SER achieves 49.6% mLGM, improving by 3.0 points over the strong evidence-grounded baseline Open-o3-Video, demonstrating its potential in enhancing both answer accuracy and evidence grounding.

[80] Adaptive Hebbian Memory Routing in Vision Transformers for Few-Shot Learning cs.CVPDF

Mohammed Yusuf Mujawar, Noorbakhsh Amiri Golilarz

TL;DR: 本文提出了一种用于小样本学习的自适应赫布路由方法，应用于视觉Transformer中。该方法通过轻量级MLP路由器动态控制赫布记忆的贡献、更新强度和先前记忆的保留，以提升模型在新类别上的适应能力。

Details

Motivation: 解决小样本图像识别中固定赫布记忆行为可能不适用于所有任务的问题，旨在使记忆机制能够根据具体任务自适应调整。

Result: 在Omniglot和CIFAR-FS数据集上的5-way 1-shot评估中，使用Swin-Tiny等骨干网络，自适应塑性将固定赫布结果从96.74%提升至96.92%，完全自适应路由达到最佳96.94%，且推理时间从16.51 ms减少到14.05 ms。

Insight: 创新点在于引入自适应机制（包括放置、塑性和完全路由）来动态调节赫布记忆，这超越了固定记忆行为，可借鉴用于增强小样本学习中的表示适应性和效率。

Abstract: Few-shot image recognition requires models to adapt to new classes from a small labeled support set. Hebbian fast-weight memory can provide temporary associative information during an episode, but fixed memory behavior may not be appropriate for every few-shot task. In this work, we propose Adaptive Hebbian Routing for few-shot Vision Transformers. The method uses a lightweight MLP router to control the contribution of Hebbian memory, the strength of memory updates, and the retention of previous memory from support-set features. We study Adaptive Placement, Adaptive Plasticity, and Fully Adaptive Hebbian Routing. Experiments use ViT-Small, DeiT-Small, and Swin-Tiny under 5-way 1-shot evaluation on Omniglot, CIFAR-FS, and cross-domain transfer from CIFAR-FS to Omniglot. In the direct Swin comparison, fixed and adaptive Hebbian variants use the same memory location. Adaptive Plasticity improves the fixed Hebbian result from 96.74% to 96.92%, while Fully Adaptive Routing achieves the best result at 96.94%. The fully adaptive Swin model also reduces inference time from 16.51 ms to 14.05 ms relative to fixed Hebbian Swin. On CIFAR-FS, adaptive variants improve performance across all three backbones, and the multi-shot evaluation shows that these gains remain useful as the number of support examples increases. These results show that adaptive plasticity and adaptive memory activation can improve few-shot Transformer representations beyond fixed Hebbian behavior.

[81] Evaluating the Interpretability of Sparse Autoencoders with Concept Annotations cs.CV | cs.AIPDF

Jonas Klotz, Cassio F. Dantas, Pallavi Jain, Diego Marcos, Begüm Demir

TL;DR: 本文提出了一种基于人类标注概念和针对性属性扰动的稀疏自编码器（SAE）可解释性评估框架，通过构建合成数据集synCUB和synCOCO，并引入完全二值匹配追踪（FBMP）算法和针对性属性扰动对齐分数（TAPAScore），量化SAE潜在变量与人类概念之间的语义对齐程度，发现适度字典大小能实现最佳可解释性。

Details

Motivation: 现有稀疏自编码器评估方法主要依赖代理指标或定性检查，缺乏对语义对应性的直接测量，本文旨在建立一种无需用户研究、能定量评估SAE可解释性的人类基准框架。

Result: 在CLIP和DINOv2嵌入上训练的SAE评估中，FBMP匹配方法优于一对一基线，TAPAScore是唯一能可靠区分训练与未训练SAE的指标；研究发现过度完备性会降低扰动对齐性，表明可解释性下降，适度字典大小能实现最佳权衡。

Insight: 创新点包括：1）通过合成数据集和针对性扰动实现干预式评估；2）提出支持多对一映射的FBMP匹配算法；3）引入TAPAScore进行功能验证。客观分析认为，该框架为可解释性研究提供了可量化的基准工具，揭示了模型复杂度与可解释性之间的权衡关系。

Abstract: Sparse autoencoders (SAEs) are increasingly used to extract interpretable concepts from vision and vision language models, yet existing evaluation methods largely rely on proxy metrics or qualitative inspection rather than measuring semantic correspondence. We present a human-grounded evaluation framework that quantifies alignment between SAE latents and human-annotated concepts, without requiring user studies, and validate this matching through targeted attribute perturbations. To enable this intervention-style evaluation in vision, we construct synCUB and synCOCO, synthetic benchmarks of paired images that differ in exactly one attribute. We introduce Fully-Binary Matching Pursuit (FBMP), a coalition-based matching procedure that supports many-to-one mappings between SAE latents and annotated concepts, and consistently outperforms one-to-one baselines. For functional validation, we propose a Targeted Attribute Perturbation Alignment Score (TAPAScore), which tests whether matched concepts respond selectively and in the expected direction under targeted image-level attribute perturbations. Under sanity checks, our matching and TAPAScore are the only evaluated metrics that reliably distinguish trained SAEs from untrained ones. Across SAEs trained on CLIP and DINOv2 embeddings, we find that increased overcompleteness can reduce perturbation alignment, indicating a reduction in interpretability. Our evaluation framework suggests that moderate dictionary sizes provide the best trade-off, yielding the most interpretable SAEs. Code and datasets are available at https://github.com/JonasKlotz/sae-concept-eval.

[82] BioMedVR: Confusion-Aware Mixture-of-Prompt Experts for Biomedical Visual Reprogramming cs.CVPDF

Jiaxiang Liu, Tianxiang Hu, Juwei Guan, Yujie Wu, Yusong Wang

TL;DR: 本文提出了BioMedVR，一种基于视觉重编程（VR）的框架，用于将预训练的视觉语言模型（如CLIP）高效适应到生物医学图像领域。该框架通过引入混淆最小化机制和混合提示专家，利用LLM生成的混淆感知属性和混淆抑制损失来减少细粒度类别间的错误对齐，在少量样本下实现参数高效的模型适配。

Details

Motivation: 现有视觉语言模型在生物医学图像上直接微调计算成本高且数据稀缺，而传统视觉重编程方法主要关注正类提示，忽略了混淆负类，导致在细粒度医疗场景中预测失准。

Result: 在18个数据集（包括11个生物医学数据集和7个自然图像基准）上的实验表明，BioMedVR在准确性和泛化性方面均优于现有方法，有效连接了VR与VLMs在生物医学领域的应用。

Insight: 创新点包括：首次将VR框架应用于生物医学成像；提出混淆最小化机制，利用LLM生成属性显式抑制假阳性对齐；设计混合提示专家结构，通过自适应门控平衡正负专家，提升细粒度分类性能。

Abstract: Recent advances in vision-language models (VLMs) such as CLIP have demonstrated strong generalization across natural-image domains. However, adapting these models to biomedical imaging is non-trivial: full-model fine-tuning is computationally expensive, while medical data are often scarce and exhibit subtle, fine-grained inter-class differences, making parameter-efficient adaptation particularly critical. Visual Reprogramming (VR) offers a parameter-efficient alternative by injecting learnable perturbations into the input space, but existing VR approaches for VLMs mainly focus on positive class prompts and overlook confusing negatives, leading to miscalibrated predictions in fine-grained medical scenarios. We present BioMedVR, the first VR-based framework for biomedical imaging, enabling few-shot adaptation of pretrained VLMs through compact learnable VR modules. To mitigate class confusion, we introduce a Confusion Minimization Mechanism that leverages LLM-generated confusion-aware attributes together with a Confusion-Suppression Loss to explicitly reduce false-positive alignment. Moreover, the designed Mixture-of-Prompt Experts combines a positive expert for main-class discrimination and a negative expert for confusion suppression, balanced via adaptive gating. Extensive experiments on 18 datasets, including 11 biomedical datasets and 7 natural image benchmarks, demonstrate that BioMedVR achieves superior accuracy and generalization, effectively bridging VR and VLMs in biomedical domains.

[83] Compact Object-Level Representations with Open-Vocabulary Understanding for Indoor Visual Relocalization cs.CV | cs.ROPDF

Zhaopeng Cui, Jiarui Hu, Jingbo Liu, Boming Zhao, Xiyue Guo

TL;DR: 本文提出OpenReLoc，一种用于室内视觉重定位的系统，旨在通过结构化地图表示来组织场景中的丰富物体信息（包括语义、布局和几何），并利用物体单元驱动相机重定位任务。

Details

Motivation: 现有室内视觉重定位研究主要集中于低层视觉方案，难以感知场景语义和构成，限制了可解释性和适用性；本文探索如何将物体信息组织成结构化表示以解决此问题。

Result: 实验结果表明，OpenReLoc在多个数据集上实现了卓越的重定位召回率和精度。

Insight: 创新点包括：利用基础模型引入多模态机制以整合开放词汇语义知识进行有效的2D-3D物体匹配；设计面向物体的参考帧作为位置先验，并基于距离交并比（DIOU）的参考帧选择策略以实现可扩展场景；提出由物体形状引导的双路径2D迭代最近像素损失以稳定精确的位姿优化。

Abstract: Indoor visual relocalization plays a critical role in emerging spatial and embodied AI applications. However, prior research was predominantly devoted to low-level vision schemes, struggling to perceive scene semantics and compositions, which limits both interpretability and applicability. In this paper, we explore the issue of how to organize rich object information in a scene, including semantics, layout, and geometry, into a structured map representation, thereby utilizing object units exclusively to drive the camera relocalization task. To this end, we propose OpenReLoc, a camera relocalization system designed to provide scene understanding and accurate pose estimation capabilities. Leveraging recent foundation models, we first introduce a multi-modal mechanism to integrate open-vocabulary semantic knowledge for effective 2D-3D object matching. Additionally, we design object-oriented reference frames as position priors, paired with a reference frame selection strategy based on the Distance-IoU (DIOU), enabling extension to scalable scenes. Moreover, to ensure stable and accurate pose optimization, we also propose a dual-path 2D Iterative Closest Pixel loss guided by object shape. Experimental results demonstrate that OpenReLoc achieves superior relocalization recall and accuracy across various datasets. Our source code will be released upon acceptance.

[84] UniDrive: A Unified Vision-Language and Grounding Framework for Interpretable Risk Understanding in Autonomous Driving cs.CV | cs.AIPDF

Xiaowei Gao, Pengxiang Li, Yitai Cheng, Ruihan Xu, James Haworth

TL;DR: 本文提出UniDrive，一个统一的视觉-语言与定位框架，用于自动驾驶中的可解释风险理解。该框架通过结合多帧时序推理分支与高分辨率感知分支，并利用门控交叉注意力融合模块对齐动态上下文与精确空间证据，从而联合生成自然语言风险描述和风险对象的定位边界框。

Details

Motivation: 现有多模态大语言模型在自动驾驶场景理解中存在时序推理与空间精度之间的根本权衡：依赖单帧或低分辨率输入的模型容易遗漏小、远或被部分遮挡的危险，而以语言为中心的驾驶模型则常为其解释提供有限的定位证据。

Result: 在DRAMA-Reasoning基准测试中，UniDrive在描述生成和风险对象定位任务上均优于代表性的基于图像和视频的基线模型，在验证集上取得了最佳整体性能，并在小物体定位、对NuScenes和BDD100K的零样本泛化能力以及人类评估的可解释性和可信度方面展现出明显优势。

Insight: 创新点在于显式地将时序语义与高分辨率感知相结合，通过门控交叉注意力融合模块实现动态上下文与精确空间证据的对齐，为可解释且面向安全的自动驾驶系统提供了更强的基础。从客观角度看，这种双分支统一框架有效解决了现有方法在时空权衡上的局限性。

Abstract: Recent multimodal large language models (MLLMs) have shown strong potential for autonomous driving scene understanding, yet existing methods still face a fundamental trade-off between temporal reasoning and spatial precision. Models that rely on single-frame or low-resolution inputs often miss small, distant, or partially occluded hazards, while language-centric driving models frequently provide limited grounded evidence for their explanations. To address this gap, we propose UniDrive, a unified visual-language and grounding framework for interpretable risk understanding in autonomous driving. UniDrive combines a temporal reasoning branch that models scene dynamics from multi-frame visual input with a high-resolution perception branch that preserves fine-grained spatial details from the latest frame. The two branches are integrated through a gated cross-attention fusion module, enabling dynamic context to be aligned with precise spatial evidence. Based on the fused representation, UniDrive jointly generates natural-language risk descriptions and grounded bounding-box outputs for risk objects. Experiments on the DRAMA-Reasoning benchmark show that UniDrive outperforms representative image-based and video-based baselines in both captioning and risk-object grounding. In particular, UniDrive achieves the best overall performance on the validation split and demonstrates clear advantages in small-object localization, zero-shot generalization to NuScenes and BDD100K, and human-rated interpretability and trustworthiness. These results suggest that explicitly combining temporal semantics and high-resolution perception provides a stronger foundation for interpretable and safety-oriented autonomous driving systems. The code is available at https://github.com/pixeli99/unidrive-dev.

[85] Revealing Training Data Exposure in Vision Language Large Models via Parameter Gradients cs.CVPDF

Zhihao Zhu, Hongyi Tang, Yi Yang, Ahmed Abbasi

TL;DR: 本文提出了一种名为GradAudit的基于梯度的审计框架，用于检测视觉语言大模型（VLLMs）中训练数据的暴露情况。该方法通过分析模型参数优化过程中训练样本与非训练样本的梯度动态差异，而非依赖模型输出信号，从而有效识别模型是否使用了特定的训练数据对。

Details

Motivation: 动机在于解决VLLMs在大量爬取数据上训练引发的版权和数据来源问题，尤其是在医疗等隐私敏感领域，现有训练数据检测方法在跨模态场景中失效或判别力不足。

Result: 在医疗和通用领域数据集上的实验表明，GradAudit在预训练和微调的VLLMs中均显著优于现有最先进的基线方法；在涉及版权内容的案例研究中，它揭示了现有方法会低估未授权数据的使用程度，且这种低估随着模型更新和更先进而加剧。

Insight: 创新点在于从内部优化动态（梯度对齐与稳定性）而非黑箱输出角度进行训练数据检测，能够识别真正的跨模态关联，而非仅单模态成员资格；客观来看，该方法为模型审计提供了一种更本质、判别力更强的梯度签名分析视角。

Abstract: Vision-Language Large Models (VLLMs) trained on massive crawled corpora raise pressing copyright and data-provenance concerns. These concerns are particularly acute in healthcare, where patient medical images paired with clinical reports demand rigorous privacy safeguards. However, existing training data detection methods either fail in cross-modal scenarios or rely on superficial output signals with insufficient discriminative power. We introduce GradAudit, a gradient-based auditing framework that examines internal optimization dynamics rather than treating VLLMs as black boxes. Our approach builds on a key observation: model parameters converge to regions where gradients on training samples become stable and well-aligned, whereas gradients on non-training samples remain noisy and inconsistent. By analyzing these gradient signatures, GradAudit achieves strong separability and detects genuine image-text associations learned during training, not merely individual modality membership. Empirically, across both medical and general-domain datasets, GradAudit substantially outperforms state-of-the-art baselines in both pretraining and fine-tuning VLLMs. In a case study employing copyrighted content, we show that existing training data detection methods not only underestimate the extent of unauthorized data usage, but that this underestimation becomes more pronounced as models become more recent and more advanced.

[86] Counting Trees from Satellite Imagery with Noisy Supervision cs.CVPDF

Dimitri Gominski, Maurice Mugabowindekwe, Qiue Xu, Xiaowei Tong, Martin Brandt

TL;DR: 该论文提出了一种基于不平衡最优传输的卫星图像树木计数方法，将树木计数问题建模为空间密度匹配问题，并引入自校正机制来逐步优化噪声监督。该方法在覆盖三大洲、三种卫星传感器的TinyTrees基准测试中进行了评估，包含超过2.15亿个树木标注（其中77.3万个经过人工验证），面积达23,000平方公里。

Details

Motivation: 解决从卫星图像中计数单个树木的挑战，包括密集森林中树冠边界模糊导致个体定义不明确，以及大规模人工标注成本过高，而来自机载激光雷达的可扩展监督存在噪声且难以有效利用。

Result: 在TinyTrees新基准测试中，该方法在23,000平方公里的区域上一致优于基于检测、回归和基于传输的分布匹配基线方法，证明了不平衡传输和可靠性感知监督在大规模卫星图像树木计数中的有效性。

Insight: 创新点包括将树木计数重新定义为空间密度匹配问题，利用不平衡最优传输来同时处理孤立树的精确定位和密集森林的鲁棒密度估计，并设计了一种基于传输残差的自校正机制，在训练过程中逐步细化噪声监督，提高了对不完美标注数据的鲁棒性。

Abstract: Counting individual trees is a fundamental task for environmental monitoring, yet remains largely unexplored with satellite imagery. At these resolutions, isolated trees may still be identifiable, but crown boundaries become ambiguous in dense forests, making the notion of an individual tree inherently ill-defined. Moreover, large-scale manual annotations of individual trees are prohibitively expensive. While scalable supervision can be derived from airborne LiDAR, the resulting annotations are noisy and difficult to exploit effectively. We address these challenges by formulating tree counting as a spatial density matching problem supervised through Unbalanced Optimal Transport. This formulation naturally accommodates both precise localization of isolate trees and robust density estimation in dense forests. We further introduce a self-correction mechanism that leverages transport residuals to progressively refine noisy supervision during training. We evaluate our approach on TinyTrees, a new benchmark spanning three continents and three satellite sensors, comprising over 215 million tree annotations (including 773K manually verified instances) across 23,000 sq.km. Our method consistently outperforms detection-based, regression-based, and transport-based distribution-matching baselines, demonstrating the effectiveness of unbalanced transport and reliability-aware supervision for large-scale tree counting from satellite imagery. Code, data and models are available at https://github.com/dgominski/treematch.

[87] EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence cs.CV | cs.AIPDF

Linpeng Huang, Weixing Chen, Zexin Chen, Yang Liu, Liang Lin

TL;DR: 本文提出了EG-VQA基准，这是一个用于评估视频问答模型能否基于视频中可验证的时间证据进行回答的基准测试。该基准包含2,067个视频和11,838个问答对，每个答案都标注了支持性的时间证据片段。为了解决现有模型在证据定位上的不足，作者还提出了一个名为EG-Reasoner的、带有显式证据监督的推理模型。

Details

Motivation: 现有视频大语言模型在视频问答任务上取得了进展，但其评估主要关注答案正确性，而忽略了答案是否基于视频中的相关证据。这种答案生成与证据理解之间的脱节，促使作者构建一个需要联合推理和精确证据定位的、基于证据的视频问答基准。

Result: 实验评估表明，即使是强大的专有模型也难以准确定位其预测的证据，揭示了答案正确性与忠实证据定位之间的根本差异。作者提出的EG-Reasoner模型在开源模型中达到了最先进的性能，结果可与专有系统竞争，在反事实问题等推理密集型任务上收益尤为明显。

Insight: 论文的创新点在于构建了首个明确要求证据定位的开放域视频问答基准EG-VQA，并提出了统一的评估指标EG-F1来同时衡量时间对齐和语义一致性。核心洞见是：仅靠模型规模扩展不足以实现鲁棒的视频理解，而结构化的证据监督对于开发更可靠、可解释的视频问答系统至关重要。

Abstract: Recent advances in Video Large Language Models (Video-LLMs) have yielded promising performance on video question answering (VideoQA). Nevertheless, existing benchmarks are predominantly evaluated through answer correctness, while the grounding of predictions in relevant video evidence remains largely unexamined. This disconnect between answer generation and evidence understanding motivates the construction of the Evidence-Grounded Video Question Answering Benchmark (EG-VQA), an open-ended evaluation protocol in which each QA pair is explicitly annotated with supporting temporal evidence, thereby requiring joint reasoning and precise evidence localization. EG-VQA is comprised of 2,067 videos and 11,838 QA pairs with fine-grained evidence annotations. To evaluate predicted evidence, Evidence-Grounded F1 (EG-F1) is introduced as a unified metric in which temporal alignment and semantic consistency against ground-truth evidence are jointly measured. Experimental evaluation reveals that even strong proprietary models struggle to accurately ground their predictions, exposing a fundamental discrepancy between answer correctness and faithful evidence localization. To bridge this gap, EG-Reasoner, an evidence-grounded reasoning model trained with explicit supervision, is proposed. State-of-the-art performance is achieved among open-source models, with results competitive against proprietary systems, particularly pronounced gains are observed on reasoning-intensive tasks such as counterfactual questions. These findings demonstrate that scaling alone is insufficient for robust video understanding and that structured evidence supervision is essential for the development of more reliable and interpretable VideoQA systems.

[88] OrbitForge: Text-to-3D Scene Generation via Reconstruction-Anchored Video Synthesis cs.CV | cs.AIPDF

Chenrui Fan, Paolo Favaro

TL;DR: OrbitForge是一种通过重建锚定的视频合成方法，将单个文本生成的视频转换为规范的闭环3D高斯溅射场景。该方法利用冻结的视频先验和逐提示的高斯溅射重建优化，无需特定任务的视频或多视图微调，避免了逐提示的分数蒸馏优化，并提高了3D一致性。

Details

Motivation: 解决现有文本到视频模型在生成3D资产时的问题，如相机运动难以控制、视角覆盖不完整以及跨时间帧不一致，旨在通过3D重建作为锚点来提升生成视频的3D一致性。

Result: 在基于T3Bench的300个冻结提示审计中，OrbitForge重建实现了359.0度的中位数跨度测量，将原本不支持分箱的Q10 ImageReward从8.07提升至16.36（相对于仅使用MedianGS的重建），同时在覆盖质量上与VideoMV保持竞争力。

Insight: 创新点包括使用3D重建作为锚点来改进视频的3D一致性，通过Deformable Gaussian Splatting和MedianGS代理获取初步重建，并仅补全缺失视角以避免渐进式生成。从客观角度看，该方法避免了微调和分数蒸馏优化，实现了高效的闭环场景生成。

Abstract: Generic text-to-video models can be used as rich open-world scene priors. Despite the high quality of today’s generated videos, they do not directly yield reliable 3D assets: camera motion is difficult to control, view coverage is partial, and frames often contain inconsistencies across time. We introduce OrbitForge, an adapter built from frozen video priors and per-prompt Gaussian Splatting reconstruction optimization that converts a single text-generated video into a canonical closed-orbit 3D Gaussian Splatting scene. We use 3D reconstruction as an anchor to improve the 3D consistency of the generated video. We obtain a preliminary 3D reconstruction from a first generated video via Deformable Gaussian Splatting with a robust MedianGS proxy. We render views from a prescribed orbit to detect missing viewpoints. OrbitForge uses the text-to-video model to complete only the missing views, and reconstructs the completed orbit into a final Gaussian Splatting scene. This design requires no task-specific video or multiview fine-tuning, avoids per-prompt score-distillation optimization, and does not progressively generate views one step at a time. We further argue that this setting demands coverage-aware evaluation: local smoothness alone rewards methods that never attempt a full orbit. On a frozen 300-prompt T3Bench-derived audit, OrbitForge reconstruction attains a 359.0-degree measured median span, raises originally unsupported-bin Q10 ImageReward from 8.07 to 16.36 relative to MedianGS-only reconstruction, while remaining competitive with VideoMV on the coverage-quality.

[89] GeoT2V-Bench: Benchmarking 3D Consistency in Text-to-Video Models via 3D Reconstruction cs.CVPDF

Chenrui Fan, Paolo Favaro

TL;DR: 本文提出了GeoT2V-Bench，一个基于3D重建的诊断基准，用于评估相机提示的文本到视频（T2V）模型生成的视频是否支持显式的刚性3D重建。该基准通过估计相机参数、拟合可变形高斯溅射（DeformableGS）、生成静态MedianGS代理并渲染，输出一个连续的重建剖面，涵盖多个指标。在12个开源模型配置和80个静态场景提示的评估中，发现不同指标间常存在不一致。

Details

Motivation: 当使用相机提示的T2V模型合成虚拟相机捕获（如环绕物体或穿越静态场景）时，仅视觉合理性不足，生成的帧还应为一个静态3D场景提供连贯的多视角证据。因此，需要一种方法来评估这些视频是否支持3D重建。

Result: 在包含3,840个已完成重建的公平格式四种子评估中，GeoT2V-Bench揭示了可见运动、静态渲染误差、光流一致性以及灵活与静态拟合之间的差距等指标经常不一致，从而捕捉了生成视频作为全局静态场景采集时的互补失败模式。

Insight: 创新点在于提出了一个基于重建的诊断基准，而非简单的通过/失败标签或单一标量分数，它通过一个连续的重建剖面来全面评估T2V模型的3D一致性。这为评估生成视频的几何连贯性提供了更细粒度和互补的分析视角。

Abstract: Camera-prompted text-to-video (T2V) models are increasingly used to synthesize virtual camera captures, such as orbiting objects or moving through static scenes. For these outputs, visual plausibility is insufficient: the generated frames should also provide coherent multi-view evidence for a single static 3D scene. We introduce GeoT2V-Bench, a reconstruction-based diagnostic benchmark for evaluating whether camera-prompted T2V clips can support explicit rigid 3D reconstruction. Our pipeline estimates per-frame camera intrinsics and poses with VGGT-style geometry estimation, fits DeformableGS, derives a static MedianGS proxy by temporal-median aggregation, and renders this proxy along the estimated camera path. Instead of producing a pass/fail label or a single scalar score, GeoT2V-Bench reports a continuous reconstruction profile covering apparent image motion, estimated trajectory behavior, MedianGS static rendering error, static-render flow agreement, and the gap between flexible and static fits. On a fair-format four-seed evaluation with 3,840 completed reconstructions from 12 open-weight model configurations and 80 GeCo-Eval static-scene prompts, we find that visible motion, static rendering error, flow agreement, and flexible-vs-static behavior often disagree. GeoT2V-Bench therefore captures complementary failure modes that emerge when generated videos are tested as global static-scene acquisitions.

[90] IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation cs.CV | cs.AIPDF

Zixuan Li, Haokun Lin, Yicheng Xiao, Zhiwei Li, Xinyang Song

TL;DR: 本文提出IV-CoT（隐式视觉思维链），一种用于结构感知文本到图像生成的潜在视觉推理框架。它通过将视觉条件查询分解为结构到语义的级联，先由结构查询形成潜在视觉规划，再由语义查询基于此规划渲染外观，以解决多模态大语言模型在遵循结构感知提示（如物体数量、空间关系等）方面的困难。

Details

Motivation: 现有统一多模态大语言模型在文本到图像生成质量上表现良好，但在遵循结构感知提示（需保持物体数量、空间关系、属性绑定和粗略布局）方面仍存在不足，作者将此归因于结构规划和外观渲染在单一条件流中的纠缠。

Result: IV-CoT在GenEval和T2I-CompBench基准测试上取得了优异结果，实现了单次前向传播的隐式思维链推理，可视化与分析表明学习到的结构和语义查询在结构感知生成中发挥互补作用。

Insight: 核心创新在于将视觉条件查询分解为结构-语义级联，并引入仅用于训练的草图监督来引导结构查询捕获结构信息，而无需在推理时进行草图提取或中间解码，从而实现了隐式、高效的结构规划与外观解耦。

Abstract: Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layouts must be preserved. We attribute this limitation in part to the entanglement of structural planning and appearance rendering within a single conditioning stream. To address this issue, we propose Implicit Visual Chain-of-Thought (IV-CoT), a latent visual reasoning framework for query-conditioned image generation. IV-CoT decomposes the visual conditioning queries into a structural-to-semantic cascade, where structural queries first form a latent visual plan and semantic queries then render appearance conditioned on this plan. To guide the structural queries, we introduce training-only sketch supervision, which encourages them to capture structure from sketches without requiring sketch extraction or intermediate decoding at inference time. IV-CoT performs implicit CoT reasoning in a single forward pass and achieves superior results on GenEval and T2I-CompBench. Visualizations and analyses demonstrate that the learned structural and semantic queries play complementary roles in structure-aware generation.

[91] BenchX: Benchmarking AI Models for Cancer Detection and Localization with Demographic and Protocol Biases cs.CVPDF

Qi Chen, Wenxuan Li, Pedro R. A. S. Bassi, Xinze Zhou, Jakob Wasserthal

TL;DR: 本文提出了一个名为BenchX的大规模开放基准测试，包含85,355个CT扫描，用于系统评估12个肿瘤检测AI模型在不同肿瘤大小、位置、患者亚组和成像协议下的性能。研究发现，当前最先进的AI模型在罕见或代表性不足的亚组（如年轻非裔美国女性）中表现不佳，强调了在医学影像和计算机视觉中进行严格亚组级评估的必要性。

Details

Motivation: AI模型在医学影像中取得了显著成功，但在真实临床环境中，当患者人口统计学和成像协议变化时，其性能往往不一致。本文旨在量化这些不一致性，以推动开发更可靠和鲁棒的肿瘤检测AI模型。

Result: 基准测试揭示了当前以平均准确率优化的最先进AI模型在罕见或代表性不足的亚组中表现较差，例如年轻非裔美国女性患者。

Insight: 创新点包括利用大型语言模型从临床数据中提取和组织亚组信息，使分析具有可扩展性和可重复性；同时，该基准为构建更可靠的AI模型提供了基础，并突出了亚组级评估的重要性。

Abstract: Artificial intelligence (AI) has achieved remarkable success in medical imaging, but it is widely recognized that these models often perform inconsistently across real-world clinical settings. Such inconsistencies occur when patient demographics and imaging protocols vary, for example, in detecting small tumors, analyzing scans from different contrast phases, or evaluating patients of different ages or sexes. To quantify these inconsistencies, we develop a large-scale, open benchmark of 85,355 CT scans that systematically evaluates 12 tumor-detection AI models across tumor size, location, patient subgroup, and imaging protocol. We leverage large language models (LLMs) to extract and organize subgroup information from clinical data, which makes the analysis both scalable and reproducible. Our benchmark reveals that current state-of-the-art AI models, optimized for average accuracy, perform poorly in rare or underrepresented subgroups, such as young, female African Americans. However, collecting sufficient annotated data for these rare cases is often impractical. The benchmark provides a foundation for building more reliable and robust AI models for tumor detection and highlighting the need for rigorous, subgroup-level evaluation in medical imaging and computer vision. Datasets, code

[92] FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation cs.CV | cs.AIPDF

Haorui Ji, Weizhe Liu, Hongdong Li, Hengkai Guo

TL;DR: FLUX3D是一个用于从图像生成高保真3D高斯溅射（3DGS）资产的框架。它通过提出扩散对齐的结构化潜在表示（DA-SLAT）和一个稀疏结构感知的扩散框架（包含SMDiT和MARoPE），解决了现有稀疏体素表示方法在保留图像高频细节和实现2D-3D跨模态对齐方面的瓶颈。

Details

Motivation: 现有基于稀疏体素表示的图像到3DGS生成方法存在两个结构性瓶颈：一是用于构建稀疏体素潜在表示的判别性2D特征抑制了重建线索，导致表示瓶颈；二是标准扩散变换器缺乏有效机制来对齐密集的2D图像token与稀疏的3D体素潜在表示，导致跨模态对应瓶颈。

Result: 广泛的基准测试实验表明，FLUX3D在生成外观保真度方面取得了显著提升，并且在生成高质量3DGS资产方面显著优于所有最先进的（SOTA）方法。

Insight: 创新点在于重新审视了用于稀疏体素3D表示学习的2D特征选择，提出了DA-SLAT来提升重建保真度；并设计了稀疏结构感知的扩散框架，通过SMDiT和MARoPE实现了与几何无关的2D-3D对齐，有效解决了跨模态对应问题。

Abstract: Sparse voxel representation has emerged as a scalable foundation for image-to-3D Gaussian Splatting (3DGS) generation, yet current methods struggle to preserve high-frequency visual details of input images due to two structural bottlenecks. First, they adopt discriminative 2D features optimized for semantic abstraction to construct sparse voxel latents, which suppress reconstructive cues and induce a representation bottleneck. Second, in the generation stage, standard diffusion transformers lack effective mechanisms to align dense 2D image tokens with sparse 3D voxel latents, resulting in a cross-modal correspondence bottleneck. To address these issues, we propose FLUX3D, a scalable image-to-3DGS framework that boosts both representation learning and cross-modal alignment during generation. We first revisit 2D feature selection for sparse-voxel-based 3D representation learning, propose Diffusion-Aligned Structured Latents (DA-SLAT) and couple it with a decoder-only architecture to improve 3DGS reconstruction fidelity. We also design a sparse-structure-aware diffusion framework, which integrates the Sparse-structure Multimodal Diffusion Transformer (SMDiT) and Modal-Aware Rotary Positional Embedding (MARoPE) to achieve geometry-agnostic 2D-3D alignment. Extensive benchmark experiments demonstrate that FLUX3D yields substantial improvements in appearance fidelity and significantly outperforms all state-of-the-art (SOTA) methods in generating high-quality 3DGS assets.

[93] FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation cs.CVPDF

Orest Kupyn, Goutam Bhat, Philipp Henzler, Fabian Manhardt, Christian Rupprecht

TL;DR: 本文提出了FLAT方法，首次实现了从视频扩散模型的潜在空间直接解码出三角形图元（triangle splats），以生成具有精确几何结构的3D场景。该方法通过射线中心的旋转参数化和新颖的乘积窗口函数解决了三角形回归的梯度流问题，在保持视觉质量的同时显著提升了几何精度，并可进一步优化为实时渲染的游戏引擎就绪表示。

Details

Motivation: 现有前馈式潜在场景解码器通常输出缺乏明确表面的体积3D高斯，限制了其在模拟或标准图形管线中的应用，因此需要解码出与表面对齐、可渲染且更接近显式几何资产的图元。

Result: 在标准基准测试中，与最先进的前馈基线方法相比，FLAT在保持视觉质量竞争力的同时，实现了显著更好的几何精度。

Insight: 核心创新点在于首次直接从视频扩散潜在空间解码三角形图元，并引入了射线中心旋转参数化和乘积窗口函数以稳定训练；客观来看，该方法为前馈式场景生成中的表示权衡提供了首个系统性分析框架。

Abstract: Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existing feedforward latent scene decoders typically output volumetric 3D Gaussians that lack a well-defined surface, limiting their use in simulation or standard graphics pipelines. This motivates decoding surface-aligned primitives that are not only renderable but also closer to explicit geometric assets. We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass. To this end, we introduce FLAT and, for the first time, show that triangle splats can be decoded directly from video diffusion latents. Compared with decoding 3D Gaussians, predicting flat primitives is notoriously more challenging due to high sensitivity to primitive orientations, oftentimes leading to poor gradient flow. FLAT solves with two key ingredients: a ray-centered rotation parameterization for triangle regression and a novel product window function that improves gradient flow during differentiable triangle rendering. On standard benchmarks, FLAT achieves significantly better geometric accuracy while maintaining competitive visual quality compared to state-of-the-art feedforward baselines. We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supports real-time rendering. By evaluating 3DGS, 2DGS, and triangle splatting variants under an identical training setup, we provide the first systematic analysis of representation tradeoffs in feedforward scene generation. The project page is available at https://flat-splat.github.io

cs.SD [Back]

[94] VieSpeaker: A Large-Scale Vietnamese Speaker Recognition Dataset Beyond Visual Dependency cs.SD | cs.CL | eess.ASPDF

Viet Hoang Pham, Tran Trung Nguyen, Bao Thu Ho, Phuong Tuan Dat, Thi Thu Trang Nguyen

TL;DR: 本文提出了VieSpeaker，一个大规模越南语说话人识别数据集，通过基于文本元数据和大语言模型推理的无面部依赖数据构建流程，解决了越南语语音数据资源匮乏和现有数据集规模小、声学多样性不足的问题。

Details

Motivation: 越南语说话人识别领域缺乏大规模、声学多样性的数据集，且现有大规模数据集通常依赖面部线索来关联语音与说话人身份，限制了数据收集范围。

Result: 在VieSpeaker数据集上训练的模型相比现有越南语数据集展现出更强的鲁棒性和泛化能力。

Insight: 创新点在于提出了一种不依赖视觉信息（面部）的大规模说话人识别数据集构建方法，利用文本转录和上下文信息通过大语言模型推理来推断说话人身份，为构建大规模语音资源提供了新方向。

Abstract: Speaker recognition has advanced rapidly with large-scale training datasets, yet Vietnamese remains under-resourced, with existing corpora limited in scale and acoustic diversity. Most large-scale datasets rely on facial cues to link speech with speaker identities, restricting data collection to recordings where speakers appear on camera. We propose a face-independent dataset construction pipeline and introduce VieSpeaker, a large-scale Vietnamese speaker recognition dataset. Our approach leverages textual metadata and large language model reasoning to infer speaker identities from transcripts and contextual information. VieSpeaker contains approximately 902 hours of speech from 4,715 speakers. Experiments show that models trained on VieSpeaker achieve improved robustness and generalization compared to existing Vietnamese datasets. This work demonstrates the feasibility of face-independent dataset construction and provides a new direction for building large-scale speech resources.

cs.AI [Back]

[95] Neuro-Symbolic Drive: Rule-Grounded Faithful Reasoning for Driving VLAs cs.AI | cs.CLPDF

Xiangbo Gao, Xiukun Huang, Boyu Lu, Junge Zhang, Mengjie Mao

TL;DR: 本文提出Neuro-Symbolic Drive，一种神经符号驾驶框架，通过从经典基于规则的规划器中直接提取规则驱动的推理轨迹来监督驾驶视觉语言模型。该方法将规划器的内部决策序列化为结构化推理，用于微调Qwen3.5-4B模型，从而确保推理与运动生成在结构上耦合。

Details

Motivation: 当前结合思维链推理的驾驶视觉语言模型虽然能利用预训练表示并暴露中间决策，但其推理过程往往缺乏逐步决策语义，导致推理与规划动作之间的因果联系不足。

Result: 在仿真生成的基准测试中，该方法在三个摄像头感知下将ADE@3s从0.47降至0.26，漏检率从8.30%降至6.40%；在八个摄像头感知下，ADE@3s从0.54降至0.26，漏检率从10.13%降至5.99%。

Insight: 核心创新在于将基于规则的规划器作为可执行的推理引擎，直接提取其内部决策轨迹作为结构化监督信号，从而在构造上（而非事后对齐）确保推理与动作生成的因果一致性，实现了神经符号规划逻辑的有效转化。

Abstract: Driving VLA models incorporating Chain-of-Thought (CoT) reasoning are attractive because they leverage pretrained VLM representations and expose intermediate decisions in natural language, yet current rationales often lack the step-by-step decision semantics needed to keep the rationale causally connected to the planned motion. We introduce Neuro-Symbolic Drive, a neuro-symbolic driving framework that supervises a driving VLA with rule-grounded reasoning traces extracted directly from classical rule-based planners. Our key observation is that rule-based planners are symbolic AI systems that already function as executable reasoning engines: they reason about active safety constraints, search over candidate maneuvers, and select a final trajectory. We instrument these planners in simulation to capture both the executed trajectory and the internal decision trace at each rule-evaluation step. Each trace is serialized into structured rule-grounded reasoning and paired with the trajectory to fine-tune Qwen3.5-4B as a driving VLA. Because these traces are derived directly from the planner states that determine the action, they ensure reasoning is structurally coupled to motion generation by construction, rather than by post-hoc alignment. On our simulator-generated benchmark, detailed rule-grounded reasoning reduces ADE@3s from 0.47 to 0.26 and miss rate from 8.30% to 6.40% under three-camera perception, and from 0.54 to 0.26 and 10.13% to 5.99% under eight-camera perception. Neuro-Symbolic Drive thus converts neuro-symbolic planning logic into structured supervision. Code base: https://github.com/XiangboGaoBarry/Neural-Symbolic-Drive.

[96] Reinforcement Learning Towards Broadly and Persistently Beneficial Models cs.AI | cs.CLPDF

Akshay V. Jagadeesh, Rahul K. Arora, Khaled Saab, Ali Malik, Mikhail Trofimov

TL;DR: 该论文研究通过强化学习（RL）在现实领域训练有益行为，以提升模型在分布外任务中的对齐泛化能力。作者构建了一个包含真实性、公平性、风险意识和可修正性等有益特质的数据集，并在健康、科学和教育等多个领域进行训练。实验表明，与计算匹配的基线相比，有益特质RL在超过80%的分布外基准测试中提升了性能，并展现出跨领域的对齐转移和更强的持久性。

Details

Motivation: 随着AI系统在日益多样化和高风险场景中的部署，模型对齐需要泛化到训练之外的任务和领域，特别是针对强化学习可能引发的奖励黑客、欺骗等意外错位问题。

Result: 在超过50个独立对齐和有益行为基准测试中，有益特质RL相比计算匹配基线在80%以上的分布外任务中表现更优，尤其在健康领域训练后能泛化到非健康领域，减少奖励黑客和欺骗行为，并在对抗性提示和有害微调中展现出更强的持久性。

Insight: 创新点在于通过现实领域的有益行为RL训练，实现跨领域的对齐泛化与持久性提升；客观分析认为，该方法为构建更稳健对齐模型提供了实证基础，但持久性效果的来源仍需进一步研究。

Abstract: As AI systems are deployed across increasingly diverse and high-stakes settings, model alignment must generalize beyond the tasks and domains seen during training. This is especially important for reinforcement learning (RL), which can introduce unexpected misalignment through reward hacking, deception, or other unintended strategies. We study whether RL on beneficial behavior, instantiated in realistic domains, can produce broad and persistent alignment generalization beyond the training distribution. We construct a dataset of realistic situations designed to measure and train beneficial traits, such as truthfulness, fairness, risk awareness, and corrigibility, spanning varied domains, including health, science, and education. We then train models with RL on this dataset and evaluate them on more than 50 independent benchmarks of alignment and beneficial behavior. Compared to a compute-matched baseline, beneficial trait RL improves performance on over 80% of these out-of-distribution benchmarks. We observe substantial out-of-distribution alignment transfer: a beneficial-behavior RL intervention entirely limited to one domain, health, produces broad improvements on non-health alignment evaluations, including reduced reward hacking, deception, and general misalignment. Finally, we study alignment persistence: whether behavior remains robustly aligned under attempts to steer models towards misalignment. Models trained with beneficial trait RL show improved persistence, including greater resistance to adversarial prompting and harmful finetuning; further work is required to isolate the sources of these effects. These results suggest that RL to reinforce beneficial behavior in realistic domains can produce models that are more robustly aligned with human flourishing.

[97] Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War cs.AI | cs.CL | cs.GT | cs.MAPDF

Arnaud Ricci

TL;DR: 论文提出了一个名为’Age of LLM’的回合制1v1基准测试，用于评估大语言模型在战争迷雾、完整外交和严格可靠性要求下的推理、外交和可靠性能力。该测试在一个13x7的网格上进行，目标是摧毁敌方基地。作者对15个推理模型进行了基准测试，共54场比赛和5258个动作，并分析了核打击主导、军事征服罕见、外交频繁但无效、非法动作反映信念追踪以及可靠性与胜率存在弱关联等主要发现。

Details

Motivation: 为了解决现有公开基准测试中数据污染的问题，并创建一个能够系统性评估LLM在对抗性不确定性（如战争迷雾、复杂外交）下的战略推理、外交和操作可靠性的新基准。

Result: 在54场比赛的基准测试中，核打击策略占主导（在规则连贯的子语料库中占78%，整体占85%），军事征服罕见但更快（12.3 vs 18.9回合），外交行为频繁但几乎从未成功达成协议，约58%的非法动作源于战争迷雾/状态错误，可靠性与胜率之间存在弱关联。测试结果提供了对模型行为的初步描述性排名。

Insight: 创新点在于设计了一个包含战争迷雾、秘密外交（如秘密铀储备）和严格JSON模式可靠性检查的私有、随机化基准测试引擎，有效缓解了数据污染。该基准不仅用于排名，其逐回合的动作和消息轨迹为研究LLM在对抗性不确定性下的信念追踪、自发欺骗和认知’人格’提供了独特视角，是一个有前景的未来研究方向。

Abstract: We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept secret), and a reliability dimension where every turn must follow a strict JSON schema and an illegal action is silently discarded. The engine is private and each match uses a fresh random map seed and opponent, mitigating the data contamination that affects public benchmarks. Models receive a (near) rule-only prompt with no build-order advice (two tactical seed phrases were present during data collection; see Section 2.7). We benchmark 15 reasoning models across 54 matches and 5,258 actions. Findings: (1) the nuclear rush dominates (78% on the rules-coherent v0.11+ sub-corpus; 85% corpus-wide) with a sole-launcher signature that is largely mechanical under secret-simultaneous launch rules, not a cognitive deterrence failure; (2) military conquest is rare but faster (12.3 vs 18.9 turns); (3) diplomacy is prolific yet almost never consummated; (4) ~58% of illegal actions are fog/state errors, making the illegal-action rate a measure of belief-tracking; (5) – the least established, and the only one we label exploratory – a weak link associates reliability with winning. The corpus is small, unbalanced and not side-swapped, so the ranking is a preliminary descriptive view, not a contribution. Beyond ranking, the turn-by-turn traces of actions and messages make the corpus a lens on how LLMs reason under adversarial uncertainty – their belief-tracking, spontaneous deception, and per-model cognitive “personas” – which we frame as a future research direction. We release the replay format, an isometric viewer and all replays; engine source on request.

[98] A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial cs.AI | cs.CLPDF

Haichao Chen, Songchi Zhou, Zhengyun Zhao, Shikai Hu, Xianghong Jin

TL;DR: 本文提出了RaDaR，一个用于加速罕见病诊断的专用推理大语言模型。该模型在公开基准和外部验证中心上表现优于其他开源模型，包括DeepSeek-R1。在随机医生辅助试验中，RaDaR显著提高了医生的诊断准确率。

Details

Motivation: 罕见病影响全球数百万人，但由于专业临床知识的稀缺，及时诊断仍是一个重大公共卫生挑战。当前大语言模型在临床部署性、临床证据基础和训练数据方面存在不足。

Result: 在公开基准和四个外部验证中心上，RaDaR在评估的开源模型中表现最强，优于671B的DeepSeek-R1。在回顾性队列中，RaDaR在61.06%的病例中优先于临床怀疑确定了最终诊断。在随机试验中，RaDaR辅助将医生的罕见病诊断准确率提高了21.44个百分点。

Insight: 创新点包括：1) 构建了一个紧凑、开源的专用推理LLM用于罕见病诊断；2) 采用了推理增强训练，结合公开病例和合成数据；3) 提出了表型锚定叙事作为长尾罕见病的有效训练信号，并展示了数据范围内的单调缩放趋势；4) 提供了一个在数据稀缺条件下可部署的诊断AI开发和验证框架。

Abstract: Rare diseases affect millions of individuals worldwide, yet timely diagnosis remains a major public health challenge due to scarcity of specialized clinical expertise. While large language models (LLMs) show promise to support rare disease diagnosis, current models are constrained by insufficient clinical deployability, limited clinically grounded evidence, and scarcity of training data. Here we present RaDaR (Rare Disease navigatoR), an open-source, compact reasoning LLM (32B parameters) for rare disease diagnosis. RaDaR was trained with 49,170 publicly available free-text cases and 104,666 synthetic cases with reasoning-enhanced training. RaDaR showed the strongest performance among evaluated open-source models, including the 671B DeepSeek-R1, across public benchmarks and four external validation centers. In a retrospective cohort, RaDaR prioritized the final diagnosis before documented clinical suspicion in 61.06 percent of cases, corresponding to a potential lead time of 1.87 months and 50.18 percent of the within-center interval. In a randomized physician-assistance trial, RaDaR assistance improved physicians’ rare-disease diagnostic accuracy by 21.44 percentage points compared with internet search alone. Synthetic-data ablations suggested that phenotype-anchored narratives provide useful training signal for long-tail rare diseases, with a monotonic scaling trend within the tested data range. Together, RaDaR and its development and validation framework provide a deployable rare-disease reasoning model and a reproducible development framework for diagnostic AI under data scarcity.

[99] AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability cs.AI | cs.CLPDF

Khanak Khandelwal

TL;DR: 本文提出了AdversaBench，一个端到端的红队测试流水线，用于自动化评估大型语言模型的对抗鲁棒性。该系统通过五种结构化操作符对种子提示进行变异，查询目标模型，并采用三法官加元法官裁决机制来确认失败。实验在推理、指令遵循和工具使用三个类别上对45个种子进行了测试，所有种子均产生了确认的失败案例。

Details

Motivation: 随着大语言模型的规模化应用，对其进行对抗性评估需要一种既能生成困难输入又能可靠确认真实失败的方法。本文旨在解决这一自动化红队测试中的挑战。

Result: 实验在45个种子（涵盖推理、指令遵循和工具使用三类）上进行，每个种子都产生了确认的失败。具体发现包括：不同操作符的有效性因类别而异（例如，inject_distractor在指令遵循上得分为0.00，在推理和工具使用上为0.80-0.83）；指令遵循类别平均需要2.4次攻击迭代，比其他类别（1.1次）更难；法官间一致性为80-87%，但Cohen’s kappa接近零；针对Llama 3.1 8B生成的对抗提示能零样本迁移到Llama 3.3 70B。

Insight: 论文的创新点在于构建了一个包含多法官确认机制的自动化红队测试流水线，并系统性地评估了对抗攻击的类别差异、迭代难度、评估指标的有效性以及对抗提示的跨模型可迁移性。这为理解大语言模型的通用行为模式而非特定模型弱点提供了新视角。

Abstract: Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. We report experiments on 45 seeds across three categories: reasoning, instruction-following, and tool use. Every seed produced a confirmed failure. Four findings stand out. First, operator effectiveness varies sharply by category: inject_distractor scores 0.00 mean reward on instruction-following seeds but 0.80-0.83 on reasoning and tool-use. Second, binary failure rate hides difficulty: instruction-following seeds required 2.4 attacker iterations on average versus 1.1 for other categories, a gap visible in survival curves. Third, pairwise judge agreement of 80-87% coexists with near-zero Cohen’s kappa due to label skew; category-level disagreement rates are more informative. Fourth, adversarial prompts generated against Llama 3.1 8B transfer zero-shot to Llama 3.3 70B, suggesting the mutations exploit general behavioral patterns rather than model-specific weaknesses. Code, dataset, and analysis scripts are available at https://github.com/khanak0509/AdversaBench .

stat.ML [Back]

[100] Automated Residual Plot Assessment With the R Package autovi and the Shiny Application autovi.web stat.ML | cs.CV | cs.LGPDF

Weihao Li, Dianne Cook, Emi Tanaka, Susan VanderPlas, Klaus Ackermann

TL;DR: 本文提出了一种自动化评估线性模型残差图的新方法，通过开发R包autovi和Shiny应用autovi.web，利用计算机视觉模型替代人工视觉评估，预测视觉信号强度（VSS）并提供辅助信息，以帮助分析师更高效、一致地评估模型拟合效果。

Details

Motivation: 传统残差图评估依赖人工视觉检查，存在扩展性差和决策不一致的问题，而现有 lineup 协议虽减少主观性但增加人力成本，因此需要自动化解决方案来适应数据驱动时代的需求。

Result: 论文介绍了基于计算机视觉模型的自动化工具，能够预测残差图的视觉信号强度（VSS），但摘要中未提及具体基准测试或定量结果（如与SOTA比较）。

Insight: 创新点在于将计算机视觉技术应用于统计诊断领域，实现残差图评估的自动化，并通过用户友好的Shiny应用提升易用性，这为机器学习模型的可解释性和评估流程的标准化提供了新思路。

Abstract: Visual assessment of residual plots is a common approach for diagnosing linear models, but it relies on manual evaluation, which does not scale well and can lead to inconsistent decisions across analysts. The lineup protocol, which embeds the observed plot among null plots, can reduce subjectivity but requires even more human effort. In today’s data-driven world, such tasks are well suited for automation. We present a new R package that uses a computer vision model to automate the evaluation of residual plots. An accompanying Shiny application is provided for ease of use. Given a sample of residuals, the model predicts a visual signal strength (VSS) and offers supporting information to help analysts assess model fit.

cs.IR [Back]

[101] EvidenceLens: A Claim-Evidence Matrix for Auditing Financial Question Answering cs.IR | cs.CL | cs.HCPDF

Fengchen Gu, Xiaotian Ren, Zhengyong Jiang, Zhilu Zhang, Ángel F. García-Fernández

TL;DR: 本文提出了EvidenceLens，一个用于审计金融问答系统的视觉分析原型。它将金融问答视为一个声明-证据对齐问题，通过将答案分解为原子声明，并总结支持构成、置信度、支持缺口，协调声明级检查与源文本、表格单元格和图表区域，其核心视觉表示是一个多模态声明-证据矩阵，使覆盖范围、矛盾和模态不平衡立即可见。

Details

Motivation: 大型语言模型越来越多地用于回答基于年报、收益报告和分析师笔记的问题，但其输出在高风险的金融工作流程中难以验证，流畅的答案可能混合了直接有根据的陈述、弱综合和跨叙述文本、表格和图表的无支持声明。

Result: 通过代表性的报告审计场景，展示了EvidenceLens如何帮助分析师区分有根据的声明与传统聊天界面所掩盖的过度自信的综合。

Insight: 创新点在于将金融问答视为声明-证据对齐问题，并提出了一个多模态声明-证据矩阵作为核心可视化表示，以及一个基于JSON的工件模式、轻量级多模态对齐管道和确定性的审查优先级排序，以支持可重复性和审计。

Abstract: Large language models are increasingly used to answer questions over annual reports, earnings decks, and analyst notes, yet their outputs remain difficult to verify in high-stakes financial workflows. A fluent answer can blend directly grounded statements, weak synthesis, and unsupported claims across narrative text, tables, and charts. We present EvidenceLens, a visual analytics prototype that treats financial question answering as a claim-evidence alignment problem. The system decomposes an answer into atomic claims, summarizes support composition and confidence, support gaps, and coordinates claim-level inspection with source passages, table cells, and chart regions. Its core visual representation is a multimodal claim-evidence matrix that makes coverage, contradiction, and modality imbalance immediately visible. To support reproducibility, we also specify a JSON-based artifact schema, a lightweight multimodal alignment pipeline, and a deterministic review-priority ranking that maps backend signals into an auditable visual structure. Through representative report-auditing scenarios, we show how EvidenceLens helps analysts distinguish grounded claims from overconfident synthesis that conventional chat interfaces flatten.

[102] PETRA: Transforming Web Text for Petroleum-Engineering Domain Adaptation cs.IR | cs.CLPDF

Kirill Dubovikov, Omar El Mansouri, Hachem Madmoun, Yanda Li, Sandeep Kumar

TL;DR: 该论文提出了PETRA，一个用于石油工程领域检索适应的大规模数据集和构建流程，旨在解决该领域公开网络文本丰富但领域相关标注稀缺的监督鸿沟。PETRA通过将噪声网络数据转化为精炼的领域语料库和用于密集检索与重排序的合成监督数据，显著提升了领域内检索性能。

Details

Motivation: 石油工程领域的搜索面临一个监督鸿沟：公开网络文本中存在相关证据，但领域相关的标注数据非常稀缺，这限制了强大通用检索模型在该领域的应用。

Result: PETRA将领域内第一阶段检索的归一化折损累计增益（nDCG）从0.703提升至0.763。重排序器适应将公开地球科学基准测试的性能相对提升了44%，并将一个包含六项推理密集型任务的专家小组评估性能提升了23%。

Insight: 论文的创新点在于提出了一套结合高召回率领域筛选、领域分类器、基于文本块的查询生成、LLM编写的困难负样本以及检索挖掘候选列表的完整数据构建流程。一个关键发现是，合成标签上的高训练-验证集准确率并不能预测检索性能的提升，而将检索挖掘的数据重新包装为从推理时候选分布中采样的、经过教师模型评分的候选列表，才是有效的。

Abstract: Petroleum-engineering search exposes a supervision gap for strong general retrievers: relevant evidence exists in public web text, but domain relevance labels are scarce. To address this gap, we propose PETRA, a large-scale Petroleum Engineering Text for Retrieval Adaptation dataset and pipeline that converts noisy public web data into a curated domain corpus and synthetic supervision for dense retrieval and reranking. PETRA contains 1.36M curated chunks, approximately 2B token equivalents, $\approx$859k, embedding training rows from $\approx$224k anchors, and roughly 400k teacher-scored reranker candidate rows. Its construction combines high-recall energy-domain curation, an energy-domain classifier with 98.4% test accuracy, chunk-grounded query generation, LLM-written hard negatives, and retrieval-mined candidate lists. PETRA improves first-stage in-domain Normalized Discounted Cumulative Gain (nDCG) from 0.703 to 0.763 through score fusion. Reranker adaptation improves the public Earth Science benchmark by 44% relative and a six-task reasoning-intensive panel by 23%. Failed training recipes show that high train-holdout accuracy on synthetic labels does not predict retrieval gains; retrieval-mined data helps only after being repackaged as teacher-scored candidate lists sampled from the inference-time candidate distribution.

eess.IV [Back]

[103] E-MRL: Cross-view Aligned Evidence-driven Multimodal Reinforcement Learning for Reliable 3D Tumor Analysis eess.IV | cs.AI | cs.CVPDF

Sijing Li, Zhongwei Qiu, Zhuoya Wang, Boxiang Yun, Zhenyu Yi

TL;DR: 本文提出了一种名为E-MRL的跨视图对齐证据驱动多模态强化学习框架，用于可靠的3D肿瘤分析。该框架将报告生成过程建模为‘诊断-定位-验证’的马尔可夫决策过程，通过识别‘关键证据切片’并引入跨视图一致性奖励，旨在减少视觉幻觉，提高诊断准确性。

Details

Motivation: 当前基于视觉语言模型（VLMs）的容积医学报告生成方法常出现视觉幻觉问题，且其监督微调（SFT）和强化学习（RL）策略通常仅优化文本保真度，奖励的是来自语言先验的正确诊断，而非真正的视觉感知。

Result: 在大规模3D CT肿瘤数据集上的实验表明，与SFT和RL基线方法相比，E-MRL显著减少了幻觉并提高了诊断准确性。

Insight: 创新点在于将报告生成过程明确建模为‘诊断-定位-验证’的序列决策过程，并强制模型识别关键证据切片；同时，引入跨视图一致性奖励来验证全局报告与局部关键切片视觉重查询之间的语义对齐，为正确定位的推理提供额外奖励，从而提供了一种临床可解释的、视觉基础扎实的解决方案。

Abstract: While Vision-Language Models (VLMs) show great promise in volumetric medical report generation, they frequently suffer from visual hallucinations and a lack of grounding in 3D CT data. Current Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) strategies typically optimize text fidelity alone, essentially rewarding correct diagnoses derived from language priors rather than genuine visual perception. To address this, we propose cross-view aligned Evidence-driven Multimodal Reinforcement Learning (Evidence-MRL, noted as E-MRL), a reliable RL reasoning framework that formulates the generation process as a Markov Decision Process of “diagnosis-localization-verification”. Unlike standard approaches, our model is explicitly trained to identify a “key evidence slice” alongside the global diagnostic report, grounding its findings in verifiable visual evidence. Crucially, we introduce a novel cross-view consistency reward, which validates the semantic alignment between the golden-standard report and a local visual re-query of the selected key slice, providing additional rewards for correctly-localized reasoning. Experiments on large-scale 3D CT tumor datasets demonstrate that E-MRL significantly reduces hallucinations and improves diagnostic accuracy compared to SFT and RL baselines, offering a clinically interpretable solution for visually-grounded and tumor analysis.

cs.LG [Back]

[104] Blockwise Policy-Drift Gating for On-Policy Distillation cs.LG | cs.AI | cs.CLPDF

Liwen Zheng, Haiyun Jiang

TL;DR: 本文提出了一种名为块级策略漂移门控的轻量级方法，用于改进在线策略蒸馏在长视野推理任务中的稳定性。该方法通过计算学生策略在采样轨迹上的对数概率偏移，并将其聚合到固定块上，生成门控信号来重新加权位置损失，从而在不改变教师目标或采样策略的情况下提升性能。

Details

Motivation: 在线策略蒸馏在长视野推理任务中可能表现脆弱，尤其是在重用采样轨迹时，学生策略与行为策略之间的漂移会影响训练稳定性。本文旨在通过一种轻量化的学生端控制器来缓解策略漂移问题。

Result: 在包含AIME24、AIME25、MATH500和AMC23的六变体Qwen3数学推理基准测试中，使用固定64令牌块门控将平均pass@8从0.4978提升至0.5160，并在Teacher-TopK/LSM变体中取得了训练学生中的最佳四基准平均性能。

Insight: 创新点在于将局部新旧策略漂移作为可操作的控制信号，并提出了块级门控作为一种简单默认机制来增强求解率的鲁棒性，该方法仅需学生端计算，无需修改教师模型或采样过程。

Abstract: On-policy distillation (OPD) trains a student policy using teacher signals computed on trajectories sampled by the student itself. Recent work shows that sampled-token OPD can be fragile on long-horizon reasoning tasks and that local teacher-support matching is a simple and effective repair. This paper introduces blockwise policy-drift gating, a lightweight student-only old-current drift controller for OPD under rollout reuse. The method computes log-probability shifts between the behavior student and the current student on the sampled token path, aggregates these shifts over fixed blocks or spans, and uses the resulting detached, mean-normalized gates to reweight OPD position losses. It does not change teacher targets, teacher top-K supports, or the rollout policy. In a six-variant Qwen3 math reasoning benchmark with a uniform 200-step training budget for all trained variants, we use pass@8 as the primary problem-level solve-rate metric. Fixed 64-token block gating improves sampled-token OPD mean pass@8 from 0.4978 to 0.5160 across AIME24, AIME25, MATH500, and AMC23. On Teacher-TopK/LSM, Block64 gives the best four-benchmark mean pass@8 among trained students. The results identify local old-current policy drift as a practical control signal for reused OPD rollouts and motivate block-level gating as a simple default for improving solve-rate robustness.

[105] RoPE-Aware Bit Allocation for KV-Cache Quantization cs.LG | cs.CLPDF

Fengfeng Liang, Yuechen Zhang, Jiaya Jia

TL;DR: 本文提出了一种名为Block-GTQ的RoPE感知比特分配方法，用于优化KV缓存的低比特量化。该方法将RoPE（旋转位置编码）下键向量的贡献分解为二维频率块，并基于各块的能量分数进行贪婪的整数比特分配，从而更有效地保留注意力对数。实验表明，在相同比特预算下，Block-GTQ显著降低了量化误差，提升了长上下文任务性能，并实现了高效的KV缓存压缩与推理加速。

Details

Motivation: 现有低比特KV缓存量化器通常将每个缓存的键视为平坦向量处理，但在RoPE下，键对未来注意力对数的贡献可分解为位置相关的二维频率块之和，这使得键缓存量化成为一个块级比特分配问题：高能量的RoPE块对量化误差更敏感，应分配更多比特。

Result: 在匹配的K/V比特预算下，Block-GTQ在十个模型的诊断面板上更好地保留了RoPE查询-键对数，在2和3比特/维的仅K量化中，将每层MAE降低了32-80%，并在所有367/367层比较中击败了均匀TQ-MSE。下游长上下文检索、理解和推理任务性能显著提升，例如在Llama-3.1-8B-Instruct上K2V2设置下，将六任务NIAH平均分从70.6提高到97.4，LongBench-EN平均分从36.87提高到53.31。在AIME 2024/2025基准测试中，Block-GTQ在K3V2设置下接近fp16性能，而均匀TQ-MSE崩溃为0。此外，打包缓存服务路径在Qwen2.5-3B-Instruct上实现了3.24倍KV缓存压缩、1.34倍推理加速和显著内存节省。

Insight: 创新点在于首次将RoPE结构引入KV缓存量化，将其建模为块级比特分配问题，并提出了基于能量分数的贪婪整数比特分配器Block-GTQ。客观来看，该方法通过利用RoPE的频域特性进行细粒度比特分配，有效平衡了量化误差与比特预算，为长上下文模型的高效部署提供了新的优化方向。

Abstract: Existing low-bit KV-cache quantizers often treat each cached key as a flat vector. Under RoPE, however, a key’s contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency blocks. This makes key-cache quantization a block-wise bit-allocation problem: high-energy RoPE blocks are more sensitive to quantization error and should receive more bits. We introduce Block-GTQ, a RoPE-aware bit allocator for key-cache quantization built on TurboQuant-MSE(TQ-MSE). For each layer and KV head, Block-GTQ computes a label-free energy score for each RoPE block and greedily allocates integer bit widths by marginal gain. Under matched K/V bit budgets, Block-GTQ better preserves RoPE query-key logits on a ten-model diagnostic panel, cutting per-layer MAE by 32-80% at 2 and 3 b/dim K-only quantization and winning all 367/367 layer comparisons against uniform TQ-MSE. These fidelity gains translate to stronger downstream long-context retrieval, understanding, and reasoning. At K2V2 on Llama-3.1-8B-Instruct, Block-GTQ raises the six-task NIAH average from 70.6 to 97.4, and the LongBench-EN average from 36.87 to 53.31. On AIME 2024/2025 with DeepSeek-R1-Distill-Qwen-7B, without an fp16 recent-key buffer, Block-GTQ at K3V2 scores 51.7/37.5, close to fp16’s 54.2/37.9, whereas uniform TQ-MSE collapses to 0.0/0.0. We further implement a packed-cache serving path. On a single H800 GPU with Qwen2.5-3B-Instruct, packed K3V3 achieves 3.24x KV-cache compression with fp16-comparable quality, runs 1.34x faster than fp16 FlashAttention2 at 128K context, reduces peak memory from 56.31 GB to 19.85 GB, and remains feasible at 256K and 512K where fp16 OOMs. Code is available at https://github.com/JIA-Lab-research/blockgtq.

[106] Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning cs.LG | cs.CLPDF

Chenhao Dang, Jing Ma, Mingjie Liao

TL;DR: 本文提出了一种名为Holistic Data Scheduler（HDS）的新型在线数据混合框架，用于优化大语言模型（LLM）预训练中的数据调度。该框架将数据调度问题建模为连续控制空间中的强化学习问题，并采用Soft Actor-Critic（SAC）算法进行求解。其核心创新在于一个多目标、全面的奖励函数，该函数整合了数据质量、跨领域影响和模型权重变化三个关键维度。

Details

Motivation: 现有在线数据混合（ODM）方法通常仅从单一优化视角出发，无法满足复杂LLM预训练对动态数据组合的多维度考量需求。本文旨在克服这一局限，通过一个更全面的框架来优化训练数据源的混合策略。

Result: 在The Pile基准测试中，HDS达到次优方法最终验证困惑度所需的训练迭代次数减少了44%。在MMLU零样本任务上，HDS实现了7.2%的性能提升，并在其他基准测试上取得了一致的增益，证明了其在提升训练效率和最终模型能力方面的有效性。

Insight: 主要创新点在于将数据调度问题形式化为强化学习任务，并设计了一个整合数据驱动、损失驱动和模型驱动三个视角的多目标奖励函数。这为动态、多维度地优化预训练数据组合提供了新思路，其采用的SAC算法也确保了在高维策略空间中探索的稳定性和样本效率。

Abstract: The composition of training data, governed by the diversity of sources and their mixing strategy, is a cornerstone of Large Language Model (LLM) pre-training. Online Data Mixing (ODM), the technique of adaptively adjusting data mixtures during training, has emerged as a promising direction to improve efficiency. However, existing methods are constrained by their reliance on a singular optimization perspective, which fundamentally overlooks the need for complex LLM pre-training to consider the dynamic data composition from multiple dimensions. To overcome this limitation, we introduce the Holistic Data Scheduler (HDS), a novel online data mixing framework. HDS formulates the data scheduling challenge as a reinforcement learning problem in a continuous control space and leverages the Soft Actor-Critic (SAC) algorithm for its stability and sample efficiency in exploring the high-dimensional policy space. At the core of HDS lies a novel multi-objective, holistic reward function that integrates three critical perspectives: a data-driven reward for quality, a loss-driven reward capturing inter-domain influence, and a model-driven reward based on weight norms. To validate our design and determine its optimal configuration, we conducted systematic experiments on LLMs of various sizes. On The Pile benchmark, HDS reaches the final validation perplexity of the next best method with 44% fewer training iterations. Furthermore, it achieves a 7.2% improvement on the MMLU 0-shot task along with consistent gains on other benchmarks, showcasing its ability to enhance both training efficiency and final model capability.

[107] 3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy cs.LG | cs.CV | q-bio.QMPDF

Amirhossein Kardoost, Lion Gleiter, Tingying Peng, Carsten Marr

TL;DR: 本文系统比较了2D和3D掩码自编码器在体积显微镜数据上的性能，发现MAE-3D在多种下游单细胞任务中持续优于2D变体。研究进一步将视觉表示与预训练蛋白质语言模型对齐，证明跨模态监督能带来更大增益，并在蛋白质-蛋白质相互作用和蛋白质定位任务上取得了最先进的性能。

Details

Motivation: 荧光显微镜的自监督学习通常依赖于2D投影，但细胞本质上是三维的，因此需要探索原生3D建模的优势。

Result: 在蛋白质-蛋白质相互作用任务中，MAE-3D的ROC-AUC达到0.865，优于先前方法最多0.025；在蛋白质定位任务中，最佳3D模型实现了最先进的AUC_micro（0.952）和F1_micro（0.742），分别比先前方法提升0.003和0.010。

Insight: 创新点包括使用原生3D掩码自编码器进行体积建模、跨模态对齐（视觉与蛋白质语言模型），以及引入通道交叉注意力和频域正则化以有效利用3D空间上下文。

Abstract: Self-supervised learning in fluorescence microscopy often relies on 2D projections, despite the inherently three-dimensional nature of cells. We present a systematic comparison of 2D and 3D masked autoencoders (MAE-2D vs. MAE-3D) on volumetric microscopy data. Under matched architectures and training protocols, MAE-3D consistently outperforms 2D max-projection and slice-based variants on downstream single-cell tasks. We further align visual representations with a pretrained protein language model (ESM2) and show that cross-modal supervision yields larger gains for volumetric models. Channel cross-attention and frequency-domain regularization are critical for leveraging 3D spatial context. On a protein–protein interaction task, MAE-3D achieves a ROC–AUC of 0.865, outperforming prior methods by up to +0.025. For protein localization, our best 3D model attains state-of-the-art AUC${\text{micro}}$ (0.952) and F1${\text{micro}}$ (0.742), improving over previous approaches by +0.003 and +0.010 absolute, respectively. Overall, these results demonstrate the advantages of native 3D modeling and multimodal alignment for representation learning in single-cell microscopy.

cs.RO [Back]

Yanghong Mei, Longteng Guo, Ming-Ming Yu, Guiyu Zhao, Xingjian He

TL;DR: 本文提出NavWM，一种统一的导航世界模型，通过整合潜在世界推理、多模态动作预测和可控视觉生成，克服传统视觉导航策略在复杂环境中的短视决策和模式崩溃问题。该模型利用潜在世界令牌提取几何和语义先验，并引入基于锚点的多模态轨迹预测框架生成多样化动作空间，使生成式世界模型能够作为鲁棒的闭环规划器，利用视觉前瞻评估和选择最优路径。

Details

Motivation: 传统视觉导航策略在复杂环境中常面临短视决策和模式崩溃的局限，而现有世界模型范式通常孤立感知、生成和控制，未能捕捉其共享的时空动态。

Result: 在多种机器人数据集上的广泛实验表明，NavWM显著推进了当前技术水平，在高保真未来状态生成和零样本导航成功率方面均取得了显著提升。

Insight: 创新点在于提出统一导航世界模型，整合潜在世界推理、多模态动作预测和可控视觉生成；通过潜在世界令牌提取先验增强结构理解，并利用基于锚点的多模态轨迹预测框架生成多样化动作，使模型能够作为闭环规划器进行视觉前瞻规划。

Abstract: Conventional visual navigation policies often struggle with myopic decision-making and mode collapse in complex environments. While world models offer a promising alternative, existing paradigms typically isolate perception, generation, and control, failing to capture their shared spatio-temporal dynamics. In this paper, we propose NavWM, a unified navigation world model that seamlessly integrates latent world reasoning, multimodal action prediction, and controllable visual generation. At its core, NavWM leverages latent world tokens to distill geometric and semantic priors, endowing the agent with robust structural understanding. To overcome the limitations of deterministic policies, we introduce an anchor-based multimodal trajectory forecasting framework that generates a diverse action space. This inherent diversity explicitly empowers the generative world model to act as a robust closed-loop planner, utilizing visual foresight to evaluate and select the optimal path. Extensive experiments across diverse robotics datasets demonstrate that NavWM significantly advances the state-of-the-art, delivering remarkable improvements in both high-fidelity future state generation and zero-shot navigation success.

[109] ArtiTwinSplat: Interactable Digital Twin Reconstruction via Gaussian Splatting from RGB-D videos cs.RO | cs.CVPDF

Pranjal Mishra, René Zurbrügg, Max Wilder-Smith, Marco Hutter, Marc Pollefeys

TL;DR: ArtiTwinSplat是一个从RGB-D视频自动重建可交互、逼真数字孪生对象的框架。它基于3D高斯泼溅技术保持几何保真度和光度真实感，并通过无监督关节发现流程从观测运动中恢复部件结构和关节运动学。该方法无需CAD模型或人工标注，可直接用于机器人规划和交互系统。

Details

Motivation: 在非结构化真实环境中部署机器人需要精确、可交互的对象模型，而大规模构建此类模型是机器人系统集成的关键瓶颈。本文旨在直接从RGB-D视频自动构建关节化、逼真的数字孪生，以降低集成障碍。

Result: 该方法通过跟踪和优化阶段提供稳定、可查询的数字孪生，支持实时渲染、视点控制和交互操作。与局限于仿真的先前方法不同，它直接处理真实世界观测，生成的孪生可直接用于下游机器人规划和学习系统。

Insight: 创新点在于将3D高斯泼溅与无监督关节发现相结合，直接从RGB-D视频恢复关节结构和运动学，无需先验模型或标注。这为数字孪生构建提供了实用、可扩展的途径，特别适用于具身AI和人机协作场景。

Abstract: Deploying robots in unstructured real-world environments needs accurate, interactive models of the objects. Constructing these models at scale remains a critical bottleneck for robotic system integration. We present ArtiTwinSplat, a framework that automatically constructs articulated, photo-realistic digital twins of objects directly from RGB-D videos, requiring no CAD models, simulation assets, or manual annotations. Our method is built on 3D Gaussian Splatting that preserve geometric fidelity and photometric realism, coupled with an unsupervised articulation discovery pipeline that recovers part structure and joint kinematics from observed motion alone. With tracking and optimization stages our method provides stable, queryable digital twins that support real-time rendering, viewpoint control, and interactive manipulation. Unlike prior methods confined to simulation, ArtiTwinSplat operates directly on real-world observations and produces twins that are immediately usable by downstream robot planning and learning systems. This method offers a practical, scalable pathway toward digital twin construction, lowering the integration barrier for articulated object manipulation in embodied AI and human-robot collaboration contexts.

Table of Contents

cs.CL [Back]

[1] EXPO-SQL: Execution-based Clause-level Policy Optimization for Text-to-SQL cs.CL | cs.IRPDF

[2] Quantifying Prior Dominance in RAG Systems cs.CL | cs.AIPDF

[3] Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification cs.CL | cs.CV | cs.IRPDF

[4] ModTGCN: Modularity-aware Graph Neural Networks for Text Classification cs.CLPDF

[5] MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models cs.CLPDF

[6] CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression cs.CL | cs.AI | cs.LGPDF

[7] AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression cs.CL | cs.CVPDF

[8] CALIBER: Calibrating Confidence Before and After Reasoning in Language Models cs.CL | cs.AIPDF

[9] AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning cs.CLPDF

[10] Poster: Exploring the Limits of Audio-Based Detection of Turkish Phone Call Scams cs.CL | cs.AIPDF

[11] To Compare, or Not to Compare: On Methodological Practices in Evaluating Social Bias cs.CLPDF

[12] Transformer-Based Language Models Across Domain Verticals: Architectures, Applications and Critical Assessment cs.CL | cs.ETPDF

[13] Qwen-AgentWorld: Language World Models for General Agents cs.CLPDF

[14] DREAM: Dense Retrieval Embeddings via Autoregressive Modeling cs.CLPDF

[15] Task Decomposition for Efficient Annotation cs.CL | cs.AI | cs.HCPDF

[16] Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce cs.CL | cs.AIPDF

[17] SHERLOC: Structured Diagnostic Localization for Code Repair Agents cs.CLPDF

cs.CV [Back]

[18] Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation cs.CV | cs.AI | cs.LGPDF

[19] A Geometry-Informed Computer Vision Method for Detecting and Examining Overtaking Vehicles From A Bicycle cs.CV | cs.HCPDF

[20] Listening makes Vision Clear for VLMs cs.CV | cs.AIPDF

[21] Mind the Heads: Topological Representation Alignment for Multimodal LLMs cs.CV | cs.AI | cs.CL | cs.MMPDF

[22] ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation cs.CV | eess.IVPDF

[23] The Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models cs.CV | cs.AIPDF

[24] REALM: A Unified Red-Teaming Benchmark for Physical-World VLMs cs.CVPDF

[25] HANCLIP: A Family of Hyperbolic Angular Negation Vision Language Models cs.CV | cs.IRPDF

[26] Trustworthy Image Authentication using Forensic Knowledge Graphs cs.CVPDF

[27] End-to-End Radar and Communication Modulation Recognition with Neuromorphic Computing cs.CV | cs.AIPDF

[28] DriveStack-VLA: Render-Teacher Alignment for BEV-Based DeepStack Vision-Language-Action Model cs.CVPDF

[29] Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent cs.CVPDF

[30] A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy cs.CV | cs.AIPDF

[31] Flood Mapping from RGB imagery using a Vision Foundation Model cs.CV | eess.IVPDF

[32] Sat2City v2: Native 3D City Asset Generation from a Single Satellite Image cs.CVPDF

[33] ObsGraph: Hierarchical Observation Representation for Embodied Reasoning and Exploration cs.CV | cs.ROPDF

[34] DramaDirector: Geometry-Guided Short Drama Generation cs.CV | cs.AIPDF

[35] Autonomous Video Generation with Counterfactual Controllability for Self-Evolving World Models cs.CV | cs.LGPDF

[36] An LMM for Precisely Grounding Elements in Documents cs.CVPDF

[37] Differential Unfolding: Efficient Unfolding Reconstruction for Video Snapshot Compressive Imaging cs.CVPDF

[38] Dual-Branch Cross-Projection Debiasing through Diffusion-based Disentanglement cs.CVPDF

[39] Spectral Evolution-Guided Token Pruning in Multimodal Large Language Models cs.CVPDF

[40] Zero-Shot Test-Time Canonicalization using Out-of-Distribution Scoring cs.CV | cs.AIPDF

[41] Deep Learning Approaches for 3D Medical Scene Completion: From Geometric Modeling to Generative Paradigms cs.CV | cs.AIPDF

[42] Tri-Efficient Transfer Learning for Point Cloud Videos cs.CVPDF

[43] MorVess: Morphology-Aware Pulmonary Vessel Segmentation Network cs.CVPDF

[44] Towards Fast and Effective Long Video Understanding of Multimodal Large Language Models via Adaptive Quasi-Gaussian Sampling cs.CVPDF

[45] Geometry-Instructed Video Editing cs.CVPDF

[46] FiCA: Feed-forward instant Gaussian Codec Avatars from a Single Portrait Image cs.CV | cs.GRPDF

[47] Accelerating Multimodal Large Language Models with Prior-Corrected Token Reduction cs.CVPDF

[48] Latent Visual States for Efficient Multimodal Reasoning cs.CVPDF

[49] TuringViT: Making SOTA Vision Transformers Accessible to All cs.CVPDF

[50] Social Structure Matters in 3D Human-Human Interaction Generation cs.CV | cs.AIPDF

[51] UniRED: Unified RGB-D Video Frame Interpolation with Event Guidance cs.CVPDF

[52] ActiveScope: Actively Seeking and Correcting Perception for MLLMs cs.CVPDF

[53] Trimming the Long-Tail of Visual World Modeling Evaluation cs.CVPDF

[54] Training-free Cross-domain Few-shot Segmentation via Robust Semantic Representation and Matching cs.CVPDF

[55] MM-TRELLIS: Point-Cloud Guided Multi-Modal 3D Vehicle Generation in Autonomous Driving cs.CVPDF

[56] REDI-Match: Rotation-Equivariant Distillation for Efficient and Robust Dense Matching cs.CVPDF

[57] UniTranslator: A Unified Multi-modal Framework for End-to-end In-Image Machine Translation cs.CVPDF

[58] Ill-Posed by Design: Probing Evidence Use in VLMs cs.CVPDF

[59] TIGER: Taming Identity, Geometry, and Generative Priors for High-Quality Face Video Restoration cs.CVPDF

[60] Open-Vocabulary BEV Segmentation with 3D-Aware Geometric Constraints cs.CV | cs.LGPDF

[61] SignNet-1M: Large-Scale Multilingual Sign Language Video Dataset with Downstream Benchmarks cs.CVPDF

[62] Modality-Aware Out-of-Distribution Detection for Multi-Modal Action Recognition cs.CVPDF

[63] EgoSAT: A Comprehensive Benchmark of Egocentric Streaming Interaction Understanding cs.CVPDF

[64] S1-Omni-Image: A Unified Model for Scientific Image Understanding, Generation, and Editing cs.CVPDF

[65] P-MTP: Efficient Document Parsing via Multi-Token Prediction with Progressive Depth Scaling cs.CVPDF

[66] Lite Any Stereo V2: Faster and Stronger Efficient Zero-Shot Stereo Matching cs.CVPDF

[67] Boosting Text-Driven Video Segmentation via Geometry-Aware Distillation cs.CVPDF

[68] video-SALMONN-R$^3$: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding cs.CV | cs.AI | cs.SDPDF

[69] Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods cs.CVPDF

[70] RetiSEM: Generalising Causal Models for Fragmented Biomedical Data cs.CV | cs.AI | stat.MEPDF

[71] VisCritic: Visual State Comparison as Process Reward for GUI Agents cs.CVPDF

[72] PointVG-R: Internalizing Geometric Reasoning in MLLMs for Precise Pointing Localization via Visual Chain of Thought cs.CVPDF

[73] ForensicsTok: Forensics-Guided Tokenized Modeling for Image Tampering Localization cs.CVPDF

[74] PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments cs.CVPDF

[75] Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning cs.CVPDF

[76] Jolia: Concept-Level Vision-Language Alignment for 3D CT Contrastive Learning cs.CVPDF

[77] Agentic Collaborative Cognition for Zero-Shot 3D Understanding cs.CVPDF