cs.CL [Total: 14]
cs.CV [Total: 67]
cs.AI [Total: 6]
cs.RO [Total: 5]
cs.LG [Total: 5]
cs.IR [Total: 2]
eess.IV [Total: 1]
cs.SD [Total: 1]

cs.CL [Back]

[1] Talking to Yourself: Defying Forgetting in Large Language Models cs.CL | cs.AIPDF

Yutao Sun, Mingshuai Chen, Tiancheng Zhao, Phillip Miao, Zilun Zhang

TL;DR: 本文提出了一种名为SA-SFT的轻量级自增强方法，通过在微调前让大语言模型生成自我对话数据，并将其与任务数据混合进行训练，以有效缓解灾难性遗忘问题，同时提升领域内性能。

Details

Motivation: 解决大语言模型在特定任务数据上微调时，因灾难性遗忘导致其通用知识和推理能力下降的挑战。

Result: 在50个评估场景中，该方法保持了与原模型相当的性能，并在40个案例中取得了最佳结果，优于层冻结和外部数据混合等常见基线。

Insight: 创新点在于利用模型自生成的对齐数据（自我对话）进行自增强，无需外部数据或额外调优；理论分析表明，遗忘部分源于风格诱导的参数漂移，而自生成数据的自对齐能有效抵消此效应，为鲁棒的LLM适应提供了简单有效的机制。

Abstract: Catastrophic forgetting remains a major challenge when fine-tuning large language models (LLMs) on narrow, task-specific data, often degrading their general knowledge and reasoning abilities. We propose SA-SFT, a lightweight self-augmentation routine in which an LLM generates self-dialogues prior to fine-tuning, and the resulting self-authored data are mixed with task data without modifying optimization or training schedules. Despite requiring no external data or additional tuning, SA-SFT consistently mitigates catastrophic forgetting while improving in-domain performance. Across 50 evaluation scenarios, it maintains performance comparable to the original model and achieves the best results in 40 cases, outperforming common baselines such as layer freezing and external data mixing. Guided by these empirical findings, we further present a theoretical analysis suggesting that forgetting can partly stem from style-induced parameter drift, and that self-alignment through self-generated data provides an effective means to counteract this effect. Overall, our results indicate that self-augmentation offers a simple and effective mechanism for robust LLM adaptation without incurring catastrophic forgetting.

[2] Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings cs.CL | cs.LGPDF

Sachin Gopal Wani, Eric Page, Ajay Dholakia, David Ellison

TL;DR: 本文对知识蒸馏生成的小型语言模型（SLMs）在资源受限环境下的性能和计算成本进行了基准测试，并与原始模型及专有模型进行了对比。研究发现，蒸馏技术能够创造出更优的性能-计算曲线，例如蒸馏出的8B模型比训练同等规模的原始模型计算效率高出2000倍以上，且推理能力与十倍于其规模的标准模型相当甚至更优。

Details

Motivation: 解决在资源受限环境下开发高效且强大的小型语言模型的需求，通过知识蒸馏技术实现模型压缩与性能提升。

Result: 在基准测试中，蒸馏模型展现出卓越的计算效率，8B蒸馏模型比训练原始模型计算效率高2000倍以上，推理能力达到或超过十倍规模标准模型的水平，验证了蒸馏作为构建SOTA可访问AI的主要策略。

Insight: 创新点在于将知识蒸馏不仅视为压缩技术，更作为构建高性能、高效能小型语言模型的核心策略，通过量化分析证明了其在性能-计算曲线上的优越性，为资源受限环境下的AI部署提供了新思路。

Abstract: Knowledge distillation offers a transformative pathway to developing powerful, yet efficient, small language models (SLMs) suitable for resource-constrained environments. In this paper, we benchmark the performance and computational cost of distilled models against their vanilla and proprietary counterparts, providing a quantitative analysis of their efficiency. Our results demonstrate that distillation creates a superior performance-tocompute curve. We find that creating a distilled 8B model is over 2,000 times more compute-efficient than training its vanilla counterpart, while achieving reasoning capabilities on par with, or even exceeding, standard models ten times its size. These findings validate distillation not just as a compression technique, but as a primary strategy for building state-of-the-art, accessible AI

[3] No One Size Fits All: QueryBandits for Hallucination Mitigation cs.CL | cs.AI | cs.LGPDF

Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, Manuela Veloso

TL;DR: 本文提出了QueryBandits，一个模型无关的上下文赌博机框架，用于缓解大型语言模型（LLMs）的幻觉问题。该框架通过在线学习自适应地选择最优的查询重写策略，利用经验验证和校准的奖励函数，特别适用于无法进行后训练或梯度调整的闭源模型。

Details

Motivation: 当前大多数缓解LLM幻觉的研究集中于开源模型的后验检测和参数编辑，而闭源模型在机构部署中占绝大多数，却缺乏针对性研究。本文旨在解决闭源模型中幻觉缓解的挑战，通过纯前向机制调整模型行为，无需重新训练或基于梯度的适应。

Result: 在16个QA场景中，顶级的QueryBandit（使用Thompson Sampling）相比无重写基线实现了87.5%的胜率，并分别优于零样本静态策略（如Paraphrase或Expand）42.6%和60.3%。所有上下文赌博机在所有数据集上都优于普通赌博机，且特征方差越大，臂选择方差也越大，证实了没有单一重写策略对所有查询都是最优的。

Insight: 创新点在于提出了一个模型无关的在线学习框架，通过语义特征学习自适应策略，仅通过前向传递机制即可缓解幻觉，适用于闭源模型。客观分析表明，该方法强调了查询的多样性需要动态策略选择，静态策略可能加剧幻觉，为闭源模型的幻觉缓解提供了可扩展的解决方案。

Abstract: Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all datasets, with higher feature variance coinciding with greater variance in arm selection. This substantiates our finding that there is no single rewrite policy optimal for all queries. We also discover that certain static policies incur higher cumulative regret than No-Rewrite, indicating that an inflexible query-rewriting policy can worsen hallucinations. Thus, learning an online policy over semantic features with QueryBandits can shift model behavior purely through forward-pass mechanisms, enabling its use with closed-source models and bypassing the need for retraining or gradient-based adaptation.

[4] Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning cs.CL | cs.LGPDF

Justin Lovelace, Christian Belardi, Sofian Zalouk, Adhitya Polavaram, Srivatsa Kundurthy

TL;DR: STAR-LDM模型将潜在扩散规划与自回归生成相结合，在生成过程中引入’思考’阶段，通过扩散过程在连续空间进行全局语义规划，再生成离散token。

Details

Motivation: 解决传统自回归语言模型局限于逐token决策、缺乏全局规划能力的问题，旨在提升语言生成在叙事连贯性和常识推理等方面的表现。

Result: 在语言理解基准测试中显著优于同规模模型，在LLM-as-judge评估中，叙事连贯性和常识推理的胜率超过70%；通过轻量级分类器实现细粒度控制，在流畅性与控制权衡上优于专门方法。

Insight: 创新点在于将扩散模型的规划能力引入语言建模，在连续语义空间进行全局规划；客观来看，其’暂停-规划-生成’的架构为语言模型提供了更灵活的推理与控制机制。

Abstract: The Stop-Think-AutoRegress Language Diffusion Model (STAR-LDM) integrates latent diffusion planning with autoregressive generation. Unlike conventional autoregressive language models limited to token-by-token decisions, STAR-LDM incorporates a “thinking” phase that pauses generation to refine a semantic plan through diffusion before continuing. This enables global planning in continuous space prior to committing to discrete tokens. Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves $>70%$ win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning. The architecture also allows straightforward control through lightweight classifiers, enabling fine-grained steering of attributes without model retraining while maintaining better fluency-control trade-offs than specialized approaches.

[5] CARE: An Explainable Computational Framework for Assessing Client-Perceived Therapeutic Alliance Using Large Language Models cs.CLPDF

Anqi Li, Chenxiao Wang, Yu Lu, Renjun Xu, Lizhi Ma

TL;DR: 本文提出了CARE框架，这是一个基于大语言模型（LLM）的可解释计算框架，用于从心理咨询对话文本中自动预测客户感知的治疗联盟多维评分并生成可解释的依据。该框架在CounselingWAI数据集上进行微调，并利用专家标注的依据进行增强监督，显著提升了预测准确性并生成了高质量的、基于上下文的解释。

Details

Motivation: 传统的事后问卷评估客户感知的治疗联盟存在负担重、延迟性高的问题，而现有计算方法则存在评分粗糙、缺乏可解释依据以及无法建模完整会话上下文等缺陷。

Result: 在CounselingWAI数据集上的实验表明，CARE超越了领先的LLM，将咨询师评估与客户感知联盟之间的差距大幅缩小，与客户评分的皮尔逊相关性提高了70%以上。自动和人工评估均验证了其生成的高质量、基于上下文的依据。

Insight: 创新点在于提出了一个结合‘依据增强监督’的微调框架，将预测评分与生成解释性依据的任务统一起来，从而同时提升了预测性能和模型的可解释性。该方法为将LLM应用于需要细粒度、可解释评估的心理健康辅助工具提供了可行的技术路径。

Abstract: Client perceptions of the therapeutic alliance are critical for counseling effectiveness. Accurately capturing these perceptions remains challenging, as traditional post-session questionnaires are burdensome and often delayed, while existing computational approaches produce coarse scores, lack interpretable rationales, and fail to model holistic session context. We present CARE, an LLM-based framework to automatically predict multi-dimensional alliance scores and generate interpretable rationales from counseling transcripts. Built on the CounselingWAI dataset and enriched with 9,516 expert-curated rationales, CARE is fine-tuned using rationale-augmented supervision with the LLaMA-3.1-8B-Instruct backbone. Experiments show that CARE outperforms leading LLMs and substantially reduces the gap between counselor evaluations and client-perceived alliance, achieving over 70% higher Pearson correlation with client ratings. Rationale-augmented supervision further improves predictive accuracy. CARE also produces high-quality, contextually grounded rationales, validated by both automatic and human evaluations. Applied to real-world Chinese online counseling sessions, CARE uncovers common alliance-building challenges, illustrates how interaction patterns shape alliance development, and provides actionable insights, demonstrating its potential as an AI-assisted tool for supporting mental health care.

[6] CAMEL: Confidence-Gated Reflection for Reward Modeling cs.CL | cs.AIPDF

Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar

TL;DR: 本文提出CAMEL框架，一种基于置信度门控反思的奖励建模方法，通过轻量级单令牌偏好决策和选择性反思机制，在保持高效率的同时提升模型性能。

Details

Motivation: 现有奖励建模方法存在效率与可解释性之间的权衡：标量判别模型高效但缺乏解释性，生成式判断模型提供丰富推理但计算开销大。

Result: 在三个广泛使用的奖励模型基准测试中，CAMEL以82.9%的平均准确率取得SOTA性能，仅用14B参数即超越70B参数模型，并建立更优的准确率-效率帕累托前沿。

Insight: 利用判决令牌的对数概率差作为置信度代理实现零成本难度评估，结合反事实前缀增强的强化学习训练促进有效自我修正，形成动态反思机制。

Abstract: Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance difficulty without additional inference cost. Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances. To induce effective self-correction, we train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision. Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters, while establishing a strictly better accuracy-efficiency Pareto frontier.

[7] ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition cs.CLPDF

Xindian Ma, Rundong Kong, Peng Zhang, Ruoxiang Huang, Yongyu Jiang

TL;DR: 本文提出了一种名为ID-LoRA的新型参数高效微调框架，旨在解决现有LoRA方法在模型规模扩大时引入过多可训练参数，而激进降低秩又会损害性能的矛盾。该方法通过从预训练权重矩阵中提取并复用聚类参数组，构建多个共享单一可训练低秩矩阵的低秩组件，从而在减少参数的同时保持模型能力。

Details

Motivation: 动机是解决现有LoRA及其变体在模型规模扩大时引入过多可训练参数开销，以及在多任务复杂场景下，为控制开销而激进降低秩会导致性能显著下降的权衡问题。

Result: 在数学推理、代码生成、MMLU、常识问答和安全对齐五个基准测试上，ID-LoRA的性能优于全量微调和现有PEFT基线（如LoRA、DoRA、HydraLoRA），同时使用的可训练参数比标准LoRA少达46%。在多任务场景下，其在代码和MMLU任务上超越了LoRA及其近期变体，而所需参数仅为传统LoRA的54%。

Insight: 核心创新点在于受矩阵插值分解启发，从预训练权重中提取聚类参数组并复用，以构建多个共享单一可训练低秩矩阵的组件，从而在参数效率和模型能力之间实现更好的平衡。这提供了一种打破传统低秩适应中参数数量与性能权衡的新思路。

Abstract: LoRA has become a universal Parameter-Efficient Fine-Tuning (PEFT) technique that equips Large Language Models (LLMs) to adapt quickly to new tasks. However, when these models are scaled up, even the latest LoRA variants still introduce considerable overhead in trainable parameters. Conversely, aggressively lowering the rank to curb this overhead markedly degrades performance in complex multi-task settings. We propose ID-LoRA, a novel PEFT framework that breaks the trade-off. Its core innovation lies in extracting and reusing clustered parameter groups from the pretrained weight matrix. These groups are then used to form multiple low-rank components, all of which share only a single initialized trainable low-rank matrix. This approach cuts the number of trainable parameters while keeping the model’s capacity intact. We evaluate ID-LoRA on five diverse benchmarks: Mathematical Reasoning, Code Generation, MMLU, CommonsenseQA, and Safety Alignment. ID-LoRA outperforms both full fine-tuning and existing PEFT baselines (e.g., LoRA, DoRA, HydraLoRA) while using up to 46% fewer trainable parameters than the standard LoRA. In multi-task scenarios, it surpasses LoRA and its recent variants (e.g., DoRA and HydraLoRA) on both Code and MMLU tasks, yet requires only 54% of the trainable parameters demanded by the conventional LoRA.

[8] SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing cs.CL | cs.AI | cs.LGPDF

Yifei Xu, Guilherme Potje, Shivam Shandilya, Tiancheng Yuan, Leonardo de Oliveira Nunes

TL;DR: SibylSense是一种推理时学习方法，通过可调记忆库自适应调整冻结的评分标准生成器，利用验证器基于少量示例的参考-候选答案判别差距更新记忆，并结合评分标准对抗策略更新生成满足评分标准的答案，从而提升评分标准的判别能力和下游强化学习性能。

Details

Motivation: 解决开放域生成任务中奖励设计对齐性和鲁棒性的关键挑战，传统评分标准构建方法存在成本高、表面化、不一致或易饱和漂移导致奖励黑客攻击的问题。

Result: 在两个开放域任务上的实验表明，SibylSense相比静态和非自适应基线，能产生更具判别力的评分标准，并提升下游强化学习性能。

Insight: 创新点包括通过记忆库动态调整评分标准生成器，结合验证器奖励和对抗策略更新，实现评分标准与生成策略的协同优化，避免奖励饱和和漂移。

Abstract: Designing aligned and robust rewards for open-ended generation remains a key barrier to RL post-training. Rubrics provide structured, interpretable supervision, but scaling rubric construction is difficult: expert rubrics are costly, prompted rubrics are often superficial or inconsistent, and fixed-pool discriminative rubrics can saturate and drift, enabling reward hacking. We present SibylSense, an inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items. Memory is updated via verifier-based item rewards measured by reference-candidate answer discriminative gaps from a handful of examples. SibylSense alternates memory tuning with a rubric-adversarial policy update that produces rubric-satisfying candidate answers, shrinking discriminative gaps and driving the rubric generator to capture new quality dimensions. Experiments on two open-ended tasks show that SibylSense yields more discriminative rubrics and improves downstream RL performance over static and non-adaptive baselines.

[9] Overton Pluralistic Reinforcement Learning for Large Language Models cs.CLPDF

Yu Fu, Seongho Son, Ilija Bogunovic

TL;DR: 本文提出了OP-GRPO（Overton Pluralistic Group Relative Policy Optimization）强化学习框架，旨在解决现有大语言模型对齐方法难以捕捉人类价值观多元性的问题。该框架通过一个双奖励系统，使单个模型无需显式提示或模块化编排，即可从单一查询生成具有多元视角的回应。

Details

Motivation: 现有对齐范式在捕捉人类价值观的多元性方面存在局限，Overton Pluralism旨在弥补这一差距，使模型能生成包含不同视角的回应。

Result: 在自然语言推理基准测试中，经过训练的Qwen2.5-3B-Instruct模型相对于20B GPT-OSS基线取得了37.4%的相对准确率提升，相对于模块化架构基线也有19.1%的相对改进。使用GPT-4.1作为大语言模型评判者的进一步评估也证实了该方法的鲁棒性。

Insight: 创新点在于提出了一个隐式的Overton Pluralism强化学习框架（OP-GRPO），通过结合相似性估计器的双奖励系统来同时确保视角覆盖的广度和独特性，实现了“小模型，大视角覆盖”的效果，简化了生成多元回应的流程。

Abstract: Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a “small models, big perspective coverage” effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.

[10] The Art of Efficient Reasoning: Data, Reward, and Optimization cs.CL | cs.AIPDF

Taiqiang Wu, Zenan Zu, Bo Zhou, Ngai Wong

TL;DR: 本文系统研究了大型语言模型的高效推理机制，提出通过强化学习奖励塑造来激励生成简短而准确的思维轨迹，并揭示了训练过程遵循长度适应和推理精炼两阶段范式。

Details

Motivation: 解决大型语言模型在链式思维推理中计算开销过大的问题，旨在激励模型生成简短且准确的推理路径。

Result: 在统一实验协议下进行了大量实验（约20万GPU小时），验证了方法在Qwen3系列模型（0.6B至30B）上的鲁棒性和泛化性，并提出了细粒度评估指标（如基于正确性的长度分布和2k至32k令牌预算下的性能）。

Insight: 关键发现包括：在相对简单的提示上训练以确保正奖励信号的密度，避免长度崩溃；学习的长度偏差可以跨领域泛化；训练过程遵循两阶段范式（长度适应和推理精炼）。

Abstract: Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranging from 2k to 32k. First, we reveal that the training process follows a two-stage paradigm: length adaptation and reasoning refinement. After that, we conduct extensive experiments (about 0.2 million GPU hours) in a unified protocol, deconstructing training prompts and rollouts, reward shaping, and optimization strategies. In particular, a key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse. Meanwhile, the learned length bias can be generalized across domains. We distill all findings into valuable insights and practical guidelines, and further validate them across the Qwen3 series, ranging from 0.6B to 30B, demonstrating the robustness and generalization.

[11] Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models cs.CLPDF

Paola Merlo, Chunyang Jiang, Giuseppe Samo, Vivi Nastase

TL;DR: 本文提出了一个名为Blackbird Language Matrices（BLM）的新型语言任务，该任务受智力测试启发，旨在通过结构化的多选题数据集来评估大型语言模型的语言能力。论文介绍了BLM数据集的构建、基准测试，并针对组块化和系统性进行了实验，展示了模型能够检测语言对象、利用跨句子的系统模式，并支持对模型行为的可解释性研究。

Details

Motivation: 动机是创建一个结构化的语言任务来探究当前大型语言模型的核心能力，包括它们是否能检测语言对象及其属性、是否能识别和使用跨句子的系统模式，以及其错误更多源于语言还是推理问题。

Result: 实验表明，BLM任务虽然具有挑战性，但使用简单的基线模型或在多种语言中使用更定制的模型都能达到良好的性能水平；模型表示中包含了解决语言任务所需的语法对象和属性，并且解决方案是通过检测跨句子的系统模式达成的。

Insight: 创新点在于设计了一个多层次结构化的自然主义数据集，支持对语言和大型语言模型属性的多面性调查；这种精心策划的数据集因其包含学习上下文、预期答案和部分手工构建，能够支持可解释性研究，有助于理解模型行为的原因。

Abstract: This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core questions about current large language models abilities: do LLMs detect linguistic objects and their properties? Do they detect and use systematic patterns across sentences? Are they more prone to linguistic or reasoning errors, and how do these interact? We show that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models. We show that their representations contain the grammatical objects and attributes relevant to solve a linguistic task. We also show that these solutions are reached by detecting systematic patterns across sentences. The paper supports the point of view that curated, structured datasets support multi-faceted investigations of properties of language and large language models. Because they present a curated, articulated structure, because they comprise both learning contexts and expected answers, and because they are partly built by hand, BLMs fall in the category of datasets that can support explainability investigations, and be useful to ask why large language models behave the way they do.

[12] Linear Reasoning vs. Proof by Cases: Obstacles for Large Language Models in FOL Problem Solving cs.CLPDF

Yuliang Ji, Fuchen Shen, Jian Wu, Qiujie Xie, Yue Zhang

TL;DR: 本文针对大型语言模型在数学推理能力评估中的局限性，提出了一个专注于分情况证明的一阶逻辑数据集PC-FOL，并通过实验和理论分析揭示了模型在线性推理与分情况推理问题上的显著性能差距。

Details

Motivation: 现有数学推理数据集主要关注线性推理，忽视了反证法和分情况证明等关键推理形式，这限制了全面评估LLMs数学推理能力的研究。

Result: 在领先的LLMs上的实验结果表明，模型在处理分情况推理问题（PC-FOL数据集）时，性能显著低于线性推理问题。

Insight: 创新点在于构建了首个专注于分情况证明、由专业数学家标注的FOL数据集PC-FOL，并基于图模型理论分析了两种推理问题性能差异的原因，为自动化自然语言数学证明生成领域揭示了核心挑战。

Abstract: To comprehensively evaluate the mathematical reasoning capabilities of Large Language Models (LLMs), researchers have introduced abundant mathematical reasoning datasets. However, most existing datasets primarily focus on linear reasoning, neglecting other parts such as proof by contradiction and proof by cases, which are crucial for investigating LLMs’ reasoning abilities. To address this limitation, we first introduce a novel first-order logic (FOL) dataset named PC-FOL, annotated by professional mathematicians, focusing on case-based reasoning problems. All instances in this dataset are equipped with a manually written natural language proof, clearly distinguishing it from conventional linear reasoning datasets. Our experimental results over leading LLMs demonstrate a substantial performance gap between linear reasoning and case-based reasoning problems. To further investigate this phenomenon, we provide a theoretical analysis grounded in graphical model, which provides an explanation for the observed disparity between the two types of reasoning problems. We hope this work can reveal the core challenges in the field of automated natural language mathematical proof generation, paving the way for future research.

[13] Evaluating Proactive Risk Awareness of Large Language Models cs.CL | cs.CYPDF

Xuan Luo, Yubin Chen, Zhiyu Hou, Linpu Yu, Geng Tu

TL;DR: 本文提出了一个评估大型语言模型（LLM）主动风险意识的框架，并构建了Butterfly数据集来测试LLM在环境生态领域能否预见潜在危害并提前发出警告。实验发现，在回复长度受限、跨语言场景以及（多模态）物种保护方面，LLM的主动风险意识存在显著下降和盲点。

Details

Motivation: 随着LLM越来越多地参与日常决策，其安全责任不应仅限于应对明确的恶意意图，还应扩展到预见非故意但可能造成严重后果的风险。

Result: 在五个广泛使用的LLM上进行的实验表明，在回复长度受限、跨语言场景以及（多模态）物种保护方面，模型的主动风险意识存在一致且显著的下降，揭示了当前安全对齐与真实世界生态责任要求之间的关键差距。

Insight: 论文的创新点在于提出了一个评估LLM主动风险意识（而非被动反应）的框架和专用数据集（Butterfly），并系统性地揭示了LLM在特定领域（如生态）和特定条件（如长度限制）下存在的系统性安全盲点，强调了在LLM部署中需要前瞻性保障措施。

Abstract: As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended but consequential risks. In this work, we introduce a proactive risk awareness evaluation framework that measures whether LLMs can anticipate potential harms and provide warnings before damage occurs. We construct the Butterfly dataset to instantiate this framework in the environmental and ecological domain. It contains 1,094 queries that simulate ordinary solution-seeking activities whose responses may induce latent ecological impact. Through experiments across five widely used LLMs, we analyze the effects of response length, languages, and modality. Experimental results reveal consistent, significant declines in proactive awareness under length-restricted responses, cross-lingual similarities, and persistent blind spots in (multimodal) species protection. These findings highlight a critical gap between current safety alignment and the requirements of real-world ecological responsibility, underscoring the need for proactive safeguards in LLM deployment.

[14] Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning cs.CL | cs.IRPDF

Sanket Badhe, Deep Shah

TL;DR: 本文提出了一种名为Prompt-Level Distillation（PLD）的新方法，旨在解决大型语言模型进行复杂推理时延迟高、成本大的问题。该方法通过从教师模型中提取显式的推理模式，并将其组织成结构化的指令列表，作为学生模型的系统提示，从而在不进行参数微调的情况下，显著提升小型模型的推理能力与效率。

Details

Motivation: 现有方法如思维链提示虽然准确但延迟和推理成本过高，而微调小型模型则会牺牲可解释性并引入显著的资源开销。本文旨在找到一种既高效又能保持透明度的替代方案。

Result: 在StereoSet和Contract-NLI数据集上使用Gemma-3 4B模型进行评估，PLD方法将Macro F1分数分别从57%提升至90.0%和从67%提升至83%，使得这个紧凑模型能以可忽略的延迟开销达到前沿性能水平。

Insight: 论文的核心创新在于提出了一种非参数化的知识蒸馏替代方案，通过将推理知识显式编码为系统提示指令，而非修改模型参数。这不仅实现了高效推理，还保持了决策过程的完全透明和可验证性，特别适用于对可解释性要求高的监管行业和高吞吐量场景。

Abstract: Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model’s System Prompt. Evaluated on the StereoSet and Contract-NLI datasets using Gemma-3 4B, PLD improved Macro F1 scores from 57% to 90.0% and 67% to 83% respectively, enabling this compact model to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.

cs.CV [Back]

[15] VISION-ICE: Video-based Interpretation and Spatial Identification of Arrhythmia Origins via Neural Networks in Intracardiac Echocardiography cs.CV | cs.LGPDF

Dorsa EPMoghaddam, Feng Gao, Drew Bernard, Kavya Sinha, Mehdi Razavi

TL;DR: 本文提出了一种名为VISION-ICE的AI框架，利用心腔内超声心动图（ICE）视频数据，通过3D卷积神经网络对心律失常起源（正常窦性心律、左侧心律失常、右侧心律失常）进行三分类，旨在辅助临床医生快速定位心律失常病灶，减少手术时间。

Details

Motivation: 当前高密度标测技术和术前CT/MRI在定位心律失常时耗时且资源密集，而AI已被验证可用于超声图像的快速实时分析。因此，本研究旨在开发一个基于ICE（电生理手术常规部分）的AI框架，以指导临床医生定位心律失常起源区域。

Result: 在十折交叉验证中，模型在四名未见过的患者数据上评估时，平均准确率达到66.2%，显著优于33.3%的随机基线，证明了该方法的可行性。

Insight: 创新点在于将心律失常起源定位问题形式化为基于ICE视频的三分类任务，并应用3D CNN进行处理。这展示了将常规手术影像（ICE）与深度学习结合用于自动化心律失常定位的临床潜力，有望实现更快、更精准的电生理干预。

Abstract: Contemporary high-density mapping techniques and preoperative CT/MRI remain time and resource intensive in localizing arrhythmias. AI has been validated as a clinical decision aid in providing accurate, rapid real-time analysis of echocardiographic images. Building on this, we propose an AI-enabled framework that leverages intracardiac echocardiography (ICE), a routine part of electrophysiology procedures, to guide clinicians toward areas of arrhythmogenesis and potentially reduce procedural time. Arrhythmia source localization is formulated as a three-class classification task, distinguishing normal sinus rhythm, left-sided, and right-sided arrhythmias, based on ICE video data. We developed a 3D Convolutional Neural Network trained to discriminate among the three aforementioned classes. In ten-fold cross-validation, the model achieved a mean accuracy of 66.2% when evaluated on four previously unseen patients (substantially outperforming the 33.3% random baseline). These results demonstrate the feasibility and clinical promise of using ICE videos combined with deep learning for automated arrhythmia localization. Leveraging ICE imaging could enable faster, more targeted electrophysiological interventions and reduce the procedural burden of cardiac ablation. Future work will focus on expanding the dataset to improve model robustness and generalizability across diverse patient populations.

[16] OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport cs.CVPDF

Xiwen Chen, Wenhui Zhu, Gen Li, Xuanzhao Dong, Yujian Xiong

TL;DR: OTPrune是一种无需训练、基于最优传输的视觉令牌剪枝框架，旨在减少多模态大语言模型推理时的冗余视觉令牌，通过最小化完整令牌与剪枝后令牌分布之间的2-Wasserstein距离，在降低计算成本的同时保持表示的多样性和全局代表性。

Details

Motivation: 现有视觉令牌剪枝方法忽略了视觉表示的底层分布结构，导致剪枝可能损害模型性能，OTPrune旨在通过分布对齐解决这一问题，以更原则性的方式加速推理。

Result: 在广泛的基准测试中，OTPrune相比现有最先进方法实现了更优的性能与效率权衡，达到了SOTA水平。

Insight: 创新点在于将剪枝问题形式化为最优传输的分布对齐任务，并推导出可处理的子模目标函数，理论上证明了其单调性和子模性，为稳定高效的剪枝提供了理论依据；同时，分布对齐有助于实现语义忠实的剪枝。

Abstract: Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, OTPrune preserves both local diversity and global representativeness while reducing inference cost. Moreover, we derive a tractable submodular objective that enables efficient optimization, and theoretically prove its monotonicity and submodularity, providing a principled foundation for stable and efficient pruning. We further provide a comprehensive analysis that explains how distributional alignment contributes to stable and semantically faithful pruning. Comprehensive experiments on wider benchmarks demonstrate that OTPrune achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods. The code is available at https://github.com/xiwenc1/OTPrune.

[17] De-rendering, Reasoning, and Repairing Charts with Vision-Language Models cs.CVPDF

Valentin Bonas, Martin Sinnona, Viviana Siless, Emmanuel Iarussi

TL;DR: 本文提出了一种结合图表去渲染、自动分析和迭代改进的框架，旨在为可视化设计提供可操作、可解释的反馈。该系统从图表图像中重建结构，利用视觉语言模型识别设计缺陷，并基于可视化研究原则提出具体修改建议，用户可选择性应用改进并重新渲染更新后的图表，形成一个促进高质量可视化和可视化素养发展的反馈循环。

Details

Motivation: 数据可视化在科学传播、新闻和日常决策中至关重要，但常存在错误，可能扭曲解读或误导受众。基于规则的可视化检查工具虽能标记违规，但缺乏上下文且无法提供有意义的设计修改建议；直接查询通用大语言模型（LLM）关于可视化质量又不可靠，因为它们未经过遵循可视化设计原则的训练，常产生不一致或不正确的反馈。

Result: 在Chart2Code基准测试的1,000个图表上，系统生成了10,452条设计建议，这些建议聚类为10个连贯类别（如轴格式化、颜色可访问性、图例一致性），展示了LLM驱动推荐系统在提供结构化、基于原则的可视化设计反馈方面的潜力。

Insight: 创新点在于将图表去渲染、视觉语言推理和迭代修复结合成一个闭环框架，利用视觉语言模型进行上下文感知的设计缺陷识别和原则驱动的修改建议，这为开发更智能、更易用的可视化创作工具开辟了新途径。

Abstract: Data visualizations are central to scientific communication, journalism, and everyday decision-making, yet they are frequently prone to errors that can distort interpretation or mislead audiences. Rule-based visualization linters can flag violations, but they miss context and do not suggest meaningful design changes. Directly querying general-purpose LLMs about visualization quality is unreliable: lacking training to follow visualization design principles, they often produce inconsistent or incorrect feedback. In this work, we introduce a framework that combines chart de-rendering, automated analysis, and iterative improvement to deliver actionable, interpretable feedback on visualization design. Our system reconstructs the structure of a chart from an image, identifies design flaws using vision-language reasoning, and proposes concrete modifications supported by established principles in visualization research. Users can selectively apply these improvements and re-render updated figures, creating a feedback loop that promotes both higher-quality visualizations and the development of visualization literacy. In our evaluation on 1,000 charts from the Chart2Code benchmark, the system generated 10,452 design recommendations, which clustered into 10 coherent categories (e.g., axis formatting, color accessibility, legend consistency). These results highlight the promise of LLM-driven recommendation systems for delivering structured, principle-based feedback on visualization design, opening the door to more intelligent and accessible authoring tools.

[18] N4MC: Neural 4D Mesh Compression cs.CVPDF

Guodong Chen, Huanshuo Dong, Mallesham Dasari

TL;DR: N4MC是首个4D神经压缩框架，通过利用时变网格序列的时间冗余来高效压缩。它借鉴2D视频编解码器的帧间压缩思想，将不规则网格帧转换为规则4D张量，并使用自解码器捕获时空相关性以去除冗余。此外，引入基于Transformer的插值模型增强时间一致性，预测基于跟踪体积中心潜在嵌入的中间网格帧。

Details

Motivation: 现有神经网格压缩方法独立处理每帧网格，忽略了时间冗余；N4MC旨在解决时变网格序列的高效压缩问题，通过利用帧间相关性提升压缩性能。

Result: 广泛评估表明，N4MC在率失真性能上优于现有最先进方法，并能实现4D网格序列的实时解码。

Insight: 创新点包括将不规则网格转换为规则4D张量以统一表示，引入基于Transformer的插值模型消除运动模糊，以及整体框架首次实现4D网格的神经压缩，可借鉴于时空数据的高效编码设计。

Abstract: We present N4MC, the first 4D neural compression framework to efficiently compress time-varying mesh sequences by exploiting their temporal redundancy. Unlike prior neural mesh compression methods that treat each mesh frame independently, N4MC takes inspiration from inter-frame compression in 2D video codecs, and learns motion compensation in long mesh sequences. Specifically, N4MC converts consecutive irregular mesh frames into regular 4D tensors to provide a uniform and compact representation. These tensors are then condensed using an auto-decoder, which captures both spatial and temporal correlations for redundancy removal. To enhance temporal coherence, we introduce a transformer-based interpolation model that predicts intermediate mesh frames conditioned on latent embeddings derived from tracked volume centers, eliminating motion ambiguities. Extensive evaluations show that N4MC outperforms state-of-the-art in rate-distortion performance, while enabling real-time decoding of 4D mesh sequences. The implementation of our method is available at: https://github.com/frozzzen3/N4MC.

[19] Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking cs.CV | cs.AI | cs.LGPDF

Jingcheng Yang, Tianhu Xiong, Shengyi Qian, Klara Nahrstedt, Mingyuan Wu

TL;DR: 本文提出了首个用于视觉语言模型（VLM）透明电路追踪的框架，通过使用转码器、归因图和基于注意力的方法，系统分析了多模态推理的内部机制，揭示了VLM如何分层整合视觉和语义概念，并发现不同的视觉特征电路可以处理数学推理并支持跨模态关联。

Details

Motivation: 解决视觉语言模型作为不透明黑箱的问题，旨在理解其内部的多模态推理机制，为构建更可解释和可靠的VLM奠定基础。

Result: 通过特征引导和电路修补验证，证明所识别的电路具有因果性和可控性，为理解VLM的内部工作机制提供了实证支持。

Insight: 创新点在于首次系统性地对VLM进行电路追踪，揭示了视觉特征电路在数学推理和跨模态关联中的具体作用，为模型可解释性研究提供了新工具和视角。

Abstract: Vision-language models (VLMs) are powerful but remain opaque black boxes. We introduce the first framework for transparent circuit tracing in VLMs to systematically analyze multimodal reasoning. By utilizing transcoders, attribution graphs, and attention-based methods, we uncover how VLMs hierarchically integrate visual and semantic concepts. We reveal that distinct visual feature circuits can handle mathematical reasoning and support cross-modal associations. Validated through feature steering and circuit patching, our framework proves these circuits are causal and controllable, laying the groundwork for more explainable and reliable VLMs.

[20] Large-scale Photorealistic Outdoor 3D Scene Reconstruction from UAV Imagery Using Gaussian Splatting Techniques cs.CV | cs.ROPDF

Christos Maikos, Georgios Angelidis, Georgios Th. Papadopoulos

TL;DR: 本研究提出了一种端到端流水线，能够将无人机捕获的视频流实时转换为高保真度的3D重建模型。该方法集成了RTMP视频流采集、传感器融合、相机位姿估计和3D高斯泼溅（3DGS）优化，实现了在交互式可视化环境中的连续模型更新和低延迟部署，适用于沉浸式AR/VR应用。

Details

Motivation: 无人机（UAV）已广泛应用于空中实时感知应用，而3D高斯泼溅（3DGS）在实时神经渲染方面显示出巨大潜力，但将其集成到基于无人机的端到端重建和可视化系统中仍未被充分探索。本研究旨在填补这一空白，提出一个高效的集成架构。

Result: 实验结果表明，与基于NeRF的方法相比，所提方法在视觉保真度上具有竞争力，同时显著提高了渲染性能并大幅降低了端到端延迟。重建质量保持在离线高保真参考的4-7%以内，证实了该系统适用于从空中平台进行实时、可扩展的增强感知。

Insight: 论文的创新点在于将实时视频流、传感器融合与3D高斯泼溅优化结合，构建了一个完整的、低延迟的端到端无人机3D重建系统。从客观角度看，其系统级集成方案（从采集到可视化）以及对实时性和渲染效率的优化，对于推动无人机实时3D感知在AR/VR等交互应用中的落地具有借鉴意义。

Abstract: In this study, we present an end-to-end pipeline capable of converting drone-captured video streams into high-fidelity 3D reconstructions with minimal latency. Unmanned aerial vehicles (UAVs) are extensively used in aerial real-time perception applications. Moreover, recent advances in 3D Gaussian Splatting (3DGS) have demonstrated significant potential for real-time neural rendering. However, their integration into end-to-end UAV-based reconstruction and visualization systems remains underexplored. Our goal is to propose an efficient architecture that combines live video acquisition via RTMP streaming, synchronized sensor fusion, camera pose estimation, and 3DGS optimization, achieving continuous model updates and low-latency deployment within interactive visualization environments that supports immersive augmented and virtual reality (AR/VR) applications. Experimental results demonstrate that the proposed method achieves competitive visual fidelity, while delivering significantly higher rendering performance and substantially reduced end-to-end latency, compared to NeRF-based approaches. Reconstruction quality remains within 4-7% of high-fidelity offline references, confirming the suitability of the proposed system for real-time, scalable augmented perception from aerial platforms.

[21] 3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism cs.CVPDF

Bhavik Chandna, Kelsey R. Allen

TL;DR: 本文提出了3DSPA，一个用于自动化评估生成视频真实性的3D时空点自编码器框架。该方法通过整合3D点轨迹、深度线索和DINO语义特征，构建了一个统一的视频表示，以建模场景中的物体运动和语义内容，从而评估视频的真实性、时间一致性和物理合理性。

Details

Motivation: 当前AI视频生成发展迅速，但评估生成视频的真实性仍主要依赖人工标注或范围受限的专用评估数据集，缺乏自动化、全面的评估方法。本文旨在开发一个能自动捕获视频语义和连贯3D结构、且无需参考视频的评估框架。

Result: 实验表明，3DSPA能可靠识别违反物理定律的视频，对运动伪影更敏感，并且在多个数据集上与人类对视频质量和真实性的判断更一致。

Insight: 主要创新点在于将基于轨迹的表示与3D语义信息（深度和DINO特征）相结合，为生成视频模型的基准测试提供了更强大的基础，并能隐式捕获物理规则违反。这为自动化视频评估提供了一个新的、更全面的视角。

Abstract: AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process – requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D spatiotemporal point autoencoder which integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets. Our results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations. The code and pretrained model weights will be available at https://github.com/TheProParadox/3dspa_code.

[22] Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field cs.CVPDF

Sheyang Tang, Armin Shafiee Sarvestani, Jialu Xu, Xiaoyu Xu, Zhou Wang

TL;DR: 本文提出了一种名为3D美学场的新概念，用于从稀疏捕获的图像中进行基于几何的美学推理，从而高效地推荐美观的相机视点。该方法通过一个前馈的3D高斯溅射网络，将预训练的2D美学模型的高级知识蒸馏到3D空间中，并采用两阶段搜索流程（粗采样与基于梯度的细化）来识别美观视点，无需密集捕获或强化学习搜索。

Details

Motivation: 现有美学视点建议方法存在局限：单视图调整方法仅从单张图像预测有限的相机调整，缺乏对场景几何的理解；而3D探索方法依赖密集捕获或预建3D环境，并结合计算成本高的强化学习搜索。本文旨在解决这些问题，实现基于稀疏输入的、几何基础的高效美学视点建议。

Result: 大量实验表明，与现有方法相比，本文方法一致地推荐出具有更优构图和框架的视点，在美学质量上表现优越，为3D感知美学建模开辟了新方向。

Insight: 创新点包括引入3D美学场概念，实现稀疏输入下的3D美学推理；使用3D高斯溅射网络将2D美学知识蒸馏到3D空间；以及设计高效的两阶段搜索流程，避免了密集捕获和强化学习的高成本。从客观角度看，该方法将2D美学评估与3D几何表示相结合，为计算摄影和图形学中的视点选择提供了可扩展且实用的解决方案。

Abstract: The aesthetic quality of a scene depends strongly on camera viewpoint. Existing approaches for aesthetic viewpoint suggestion are either single-view adjustments, predicting limited camera adjustments from a single image without understanding scene geometry, or 3D exploration approaches, which rely on dense captures or prebuilt 3D environments coupled with costly reinforcement learning (RL) searches. In this work, we introduce the notion of 3D aesthetic field that enables geometry-grounded aesthetic reasoning in 3D with sparse captures, allowing efficient viewpoint suggestions in contrast to costly RL searches. We opt to learn this 3D aesthetic field using a feedforward 3D Gaussian Splatting network that distills high-level aesthetic knowledge from a pretrained 2D aesthetic model into 3D space, enabling aesthetic prediction for novel viewpoints from only sparse input views. Building on this field, we propose a two-stage search pipeline that combines coarse viewpoint sampling with gradient-based refinement, efficiently identifying aesthetically appealing viewpoints without dense captures or RL exploration. Extensive experiments show that our method consistently suggests viewpoints with superior framing and composition compared to existing approaches, establishing a new direction toward 3D-aware aesthetic modeling.

[23] CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation cs.CVPDF

Mainak Singha, Sarthak Mehrotra, Paolo Casari, Subhasis Chaudhuri, Elisa Ricci

TL;DR: 本文提出了CLIPoint3D，一个基于CLIP的少样本无监督3D点云域自适应框架。该方法通过将3D点云投影为多视角深度图，并利用冻结的CLIP主干网络，结合知识驱动的提示调优方案（融合高层语言先验和轻量级3D编码器的几何线索）进行优化。同时，采用参数高效微调、熵引导的视图采样策略，以及基于最优传输和不确定性感知的原型对齐损失，来弥合源域与目标域之间的分布差异。

Details

Motivation: 解决现有视觉语言模型（如CLIP）在领域偏移（尤其是从合成到真实世界点云）下表现脆弱，以及传统3D域自适应方法依赖重型可训练编码器导致效率低下的问题。

Result: 在PointDA-10和GraspNetPC-10基准测试上，CLIPoint3D相比基于CLIP和传统编码器的基线方法，取得了3-16%的准确率提升。

Insight: 创新点在于首次构建了基于CLIP的少样本无监督3D点云域自适应框架，通过知识驱动的提示调优融合语言与几何信息，并设计了高效的视图采样和对齐损失机制，在保持模型轻量化的同时提升了跨域适应能力。

Abstract: Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP’s encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Codes are available at https://github.com/SarthakM320/CLIPoint3D.

[24] gQIR: Generative Quanta Image Reconstruction cs.CVPDF

Aryan Garg, Sizhuo Ma, Mohit Gupta

TL;DR: 本文提出了一种名为gQIR的方法，利用大型文本到图像潜在扩散模型，从单光子雪崩二极管（SPAD）传感器捕获的稀疏、噪声、二进制的光子探测（量子帧）中重建高质量图像。该方法通过适应光子受限领域的噪声统计，结合潜在空间恢复与突发级时空推理，在合成和真实世界数据集上实现了优于经典和现代学习基线的感知质量。

Details

Motivation: 解决在极低光子数（仅几个检测到的光子）下从SPAD传感器的原始量子帧中恢复高质量图像的挑战，这些帧存在稀疏、噪声、二进制特性，且噪声统计远超标准恢复流程或现代生成模型的假设范围。

Result: 在合成基准和新的真实世界数据集（包括首个彩色SPAD突发数据集和具有挑战性的“Deforming (XD)”视频基准）上评估，该方法在所有设置中都显著提高了感知质量，超越了经典和现代学习基线。

Insight: 创新点在于将大型文本到图像潜在扩散模型适应到光子受限的量子突发成像领域，利用互联网规模扩散模型的结构和语义先验，同时引入处理伯努利光子统计的机制，实现了潜在空间恢复与突发级时空推理的结合，从而在高速运动下也能产生光度保真且感知愉悦的重建结果。

Abstract: Capturing high-quality images from only a few detected photons is a fundamental challenge in computational imaging. Single-photon avalanche diode (SPAD) sensors promise high-quality imaging in regimes where conventional cameras fail, but raw \emph{quanta frames} contain only sparse, noisy, binary photon detections. Recovering a coherent image from a burst of such frames requires handling alignment, denoising, and demosaicing (for color) under noise statistics far outside those assumed by standard restoration pipelines or modern generative models. We present an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging. Our method leverages the structural and semantic priors of internet-scale diffusion models while introducing mechanisms to handle Bernoulli photon statistics. By integrating latent-space restoration with burst-level spatio-temporal reasoning, our approach produces reconstructions that are both photometrically faithful and perceptually pleasing, even under high-speed motion. We evaluate the method on synthetic benchmarks and new real-world datasets, including the first color SPAD burst dataset and a challenging \textit{Deforming (XD)} video benchmark. Across all settings, the approach substantially improves perceptual quality over classical and modern learning-based baselines, demonstrating the promise of adapting large generative priors to extreme photon-limited sensing. Code at \href{https://github.com/Aryan-Garg/gQIR}{https://github.com/Aryan-Garg/gQIR}.

[25] MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation cs.CV | cs.CLPDF

Taha Koleilat, Hojat Asgariandehkordi, Omid Nejati Manzari, Berardino Barile, Yiming Xiao

TL;DR: MedCLIPSeg是一个新颖的框架，通过概率跨模态注意力机制和软补丁级对比损失，将CLIP模型适配用于数据高效、泛化性强且具有不确定性感知的医学图像分割。

Details

Motivation: 解决医学图像分割中训练标注有限、解剖特征模糊以及领域偏移的挑战，并探索CLIP等视觉语言模型在密集、文本引导的医学图像分割中的潜力。

Result: 在涵盖5种成像模态和6个器官的16个数据集上的广泛实验表明，MedCLIPSeg在准确性、效率和鲁棒性上优于先前方法，并能提供可解释的不确定性图。

Insight: 创新点在于通过概率跨模态注意力实现图像与文本标记的双向交互及预测不确定性的显式建模，以及使用软补丁级对比损失促进跨多样文本提示的细致语义学习，提升了数据效率和领域泛化性。

Abstract: Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision-language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation.

[26] SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens cs.CVPDF

Anindita Ghosh, Vladislav Golyanik, Taku Komura, Philipp Slusallek, Christian Theobalt

TL;DR: SceMoS提出了一种基于场景感知的3D人体运动合成框架，通过将全局规划与局部执行解耦，利用轻量级2D场景表示（鸟瞰图BEV和局部高度图）替代昂贵的3D数据（如点云），在保证物理合理性的同时提升了效率。

Details

Motivation: 现有方法在合成文本驱动的3D人体运动时，需同时学习高层语义意图（如“走向沙发”）和低层物理可行性（如避障），且依赖计算成本高的3D场景数据；本文旨在探索结构化2D场景表示能否作为3D监督的有效替代，实现高效且物理真实的运动合成。

Result: 在TRUMANS基准测试中，SceMoS实现了最先进的运动真实性和接触准确性，同时将场景编码的可训练参数量减少了50%以上。

Insight: 创新点在于将运动合成分解为基于BEV语义的全局规划与基于局部高度图的几何接地运动标记化，通过2D因子化在效率与保真度间取得平衡；客观来看，该方法证明了轻量2D线索足以支撑3D人-场景交互的物理基础，为减少对密集3D数据的依赖提供了新思路。

Abstract: Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent (“walk to the couch”) and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a bird’s-eye-view (BEV) image rendered from an elevated corner of the scene, encoded with DINOv2 features, as the scene representation, and (2) a geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.

[27] Path-Decoupled Hyperbolic Flow Matching for Few-Shot Adaptation cs.CVPDF

Lin Li, Ziqi Jiang, Gefan Ye, Zhenqi He, Jiahui Li

TL;DR: 本文提出了一种路径解耦的双曲流匹配方法，用于解决跨模态少样本适应任务中视觉-语义对齐的问题。该方法利用洛伦兹流形的指数扩展特性来解耦特征传输路径，通过向心双曲对齐和路径解耦目标来构建有序的流并约束轨迹在特定类别的测地线走廊内。

Details

Motivation: 现有基于欧几里得空间的流匹配方法在处理多样特征分布时存在局限性，平坦几何的多项式体积增长无法容纳这些分布，导致严重的路径纠缠问题。

Result: 在11个基准测试上的广泛消融实验表明，该方法建立了新的最先进水平，持续优于其欧几里得空间的对应方法。

Insight: 创新点在于利用双曲几何的指数扩展空间来解耦传输路径，并通过向心层次结构和路径解耦的监督目标来引导和约束特征传输过程，这为解决少样本适应中的特征对齐问题提供了新的几何视角和有效机制。

Abstract: Recent advances in cross-modal few-shot adaptation treat visual-semantic alignment as a continuous feature transport problem via Flow Matching (FM). However, we argue that Euclidean-based FM overlooks fundamental limitations of flat geometry, where polynomial volume growth fails to accommodate diverse feature distributions, leading to severe path entanglement. To this end, we propose path-decoupled Hyperbolic Flow Matching (HFM), leveraging the Lorentz manifold’s exponential expansion for trajectory decoupling. HFM structures the transport via two key designs: 1) Centripetal hyperbolic alignment: It constructs a centripetal hierarchy by anchoring textual roots, which pushes visual leaves to the boundary to initialize orderly flows. 2) Path-decoupled objective: It acts as a ``semantic guardrail’’ rigidly confining trajectories within isolated class-specific geodesic corridors via step-wise supervision. Furthermore, we devise an adaptive diameter-based stopping to prevent over-transportation into the crowded origin based on the intrinsic semantic scale. Extensive ablations on 11 benchmarks have shown that HFM establishes a new state-of-the-art, consistently outperforming its Euclidean counterparts. Our codes and models will be released.

[28] LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration cs.CV | cs.AIPDF

Peiliang Cai, Jiacheng Liu, Haowen Xu, Xinyu Wang, Chang Zou

TL;DR: 本文提出了一种名为LESA（可学习的阶段感知预测器）的框架，用于加速扩散模型的推理过程。该框架采用两阶段训练，利用Kolmogorov-Arnold网络（KAN）学习时间特征映射，并引入一个多阶段、多专家的架构，为不同的噪声水平阶段分配专门的预测器，从而实现更精确和鲁棒的特征预测。

Details

Motivation: 扩散模型在图像和视频生成任务中取得了显著成功，但扩散变换器（DiTs）的高计算需求对其实际部署构成了重大挑战。现有的基于简单重用或无训练预测的特征缓存加速方法难以适应扩散过程中复杂且依赖阶段的动态特性，通常会导致质量下降，并且无法与标准去噪过程保持一致。

Result: 在FLUX.1-dev上实现了5.00倍加速，质量下降仅为1.0%；在Qwen-Image上实现了6.25倍加速，质量比之前的SOTA方法（TaylorSeer）提升了20.2%；在HunyuanVideo上实现了5.00倍加速，PSNR比TaylorSeer提升了24.7%。在文本到图像和文本到视频合成任务上均达到了最先进的性能。

Insight: 核心创新点在于提出了一个基于训练的可学习阶段感知预测器框架。其借鉴之处包括：1. 采用两阶段训练策略和KAN网络来学习复杂的时间特征映射；2. 设计了多阶段、多专家的架构，针对扩散过程的不同阶段（噪声水平）使用专门的预测器，以更好地适应其动态特性，这比简单的特征重用或固定预测方法更具适应性和精确性。

Abstract: Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.

[29] Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models cs.CVPDF

Qing Zhang, Xuesong Li, Jing Zhang

TL;DR: 该论文探讨了视觉基础模型（VFMs）如何理解物体的可供性（affordance），提出理解可供性需要几何感知和交互感知两个互补能力。通过系统探测发现，DINO模型编码了部件级几何结构，而Flux模型则蕴含了以动词为条件的空间注意力图作为隐式交互先验。研究进一步证明，通过无训练、零样本的方式融合这两种线索，可实现与弱监督方法相媲美的可供性估计。

Details

Motivation: 动机在于探究视觉系统真正理解物体可供性的本质，认为这依赖于识别物体可交互结构部分的几何感知，以及建模智能体动作如何与这些部分交互的交互感知这两个互补能力。

Result: 通过将DINO的几何原型与Flux的交互图进行无训练、零样本融合，实现了可供性估计，其性能与弱监督方法相当。

Insight: 创新点在于系统性地揭示了视觉基础模型中几何感知和交互感知是可供性理解的两个基本且可组合的构建模块，并通过简单的融合实验验证了这一机制，为理解感知如何支撑行动提供了机制性解释。

Abstract: What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent’s actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode part-level geometric structures, while generative models like Flux contain rich, verb-conditioned spatial attention maps that serve as implicit interaction priors. Crucially, we demonstrate that these two dimensions are not merely correlated but are composable elements of affordance. By simply fusing DINO’s geometric prototypes with Flux’s interaction maps in a training-free and zero-shot manner, we achieve affordance estimation competitive with weakly-supervised methods. This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.

[30] Leveraging Causal Reasoning Method for Explaining Medical Image Segmentation Models cs.CVPDF

Limai Jiang, Ruitao Xie, Bokai Yang, Huazhen Huang, Juan He

TL;DR: 本文提出了一种基于因果推理框架的医学图像分割模型解释方法，通过计算平均处理效应（ATE）来量化输入区域和网络组件对目标分割区域的影响，并在两个代表性医学影像数据集上验证了其解释的忠实性优于现有方法。

Details

Motivation: 解决医学图像分割模型因黑盒特性而缺乏可信度的问题，现有解释技术主要针对分类任务，分割领域的解释方法相对不足。

Result: 在两个代表性医学影像数据集上与近期分割可解释性技术比较，证明该方法提供更忠实的解释；对多个基础分割模型的系统因果分析揭示了不同模型甚至同一模型不同输入间感知策略的显著异质性。

Insight: 创新点在于将因果推理框架应用于分割任务解释，通过ATE量化影响；客观分析认为该方法为优化分割模型提供了新的洞察途径，揭示了模型内部决策的差异性。

Abstract: Medical image segmentation plays a vital role in clinical decision-making, enabling precise localization of lesions and guiding interventions. Despite significant advances in segmentation accuracy, the black-box nature of most deep models has raised growing concerns about their trustworthiness in high-stakes medical scenarios. Current explanation techniques have primarily focused on classification tasks, leaving the segmentation domain relatively underexplored. We introduced an explanation model for segmentation task which employs the causal inference framework and backpropagates the average treatment effect (ATE) into a quantification metric to determine the influence of input regions, as well as network components, on target segmentation areas. Through comparison with recent segmentation explainability techniques on two representative medical imaging datasets, we demonstrated that our approach provides more faithful explanations than existing approaches. Furthermore, we carried out a systematic causal analysis of multiple foundational segmentation models using our method, which reveals significant heterogeneity in perceptual strategies across different models, and even between different inputs for the same model. Suggesting the potential of our method to provide notable insights for optimizing segmentation models. Our code can be found at https://github.com/lcmmai/PdCR.

[31] How Do Inpainting Artifacts Propagate to Language? cs.CV | cs.AIPDF

Pratham Yashwante, Davit Abrahamyan, Shresth Grover, Sukruth Rao

TL;DR: 本文研究扩散模型修复图像时产生的视觉伪影如何影响视觉语言模型的文本生成。通过两阶段诊断框架，对比原始图像与修复后图像生成的描述文本，分析重建保真度与下游描述质量的关系。

Details

Motivation: 动机是探究扩散修复模型引入的视觉伪影在多模态系统中如何传播并影响语言生成，以理解视觉重建质量对语言模型行为的影响。

Result: 在多个数据集上，像素级和感知重建指标与词汇及语义描述性能存在一致关联；中间视觉表示和注意力模式分析显示修复伪影导致模型行为出现系统性、层依赖的变化。

Insight: 创新点在于提出一个诊断框架来量化视觉重建质量对语言生成的影响，揭示了修复伪影在视觉语言模型中传播的机制，为多模态系统鲁棒性评估提供了新视角。

Abstract: We study how visual artifacts introduced by diffusion-based inpainting affect language generation in vision-language models. We use a two-stage diagnostic setup in which masked image regions are reconstructed and then provided to captioning models, enabling controlled comparisons between captions generated from original and reconstructed inputs. Across multiple datasets, we analyze the relationship between reconstruction fidelity and downstream caption quality. We observe consistent associations between pixel-level and perceptual reconstruction metrics and both lexical and semantic captioning performance. Additional analysis of intermediate visual representations and attention patterns shows that inpainting artifacts lead to systematic, layer-dependent changes in model behavior. Together, these results provide a practical diagnostic framework for examining how visual reconstruction quality influences language generation in multimodal systems.

[32] A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata cs.CVPDF

Azrin Sultana, Firoz Ahmed

TL;DR: 本文提出了一种轻量级的视觉-语言融合框架，用于结合移动应用的用户界面（UI）布局和语义信息来预测应用评分。该框架使用MobileNetV3提取UI视觉特征，DistilBERT提取文本特征，并通过带Swish激活函数的门控融合模块进行多模态特征融合，最后通过多层感知机（MLP）回归头进行评分预测。

Details

Motivation: 现有应用评分预测模型大多仅基于文本数据或UI特征，忽略了联合利用UI和语义信息的重要性，因此本文旨在解决这一局限性，通过融合多模态信息提升预测性能。

Result: 模型在训练20轮后，在评估指标上达到MAE为0.1060、RMSE为0.1433、MSE为0.0205、R2为0.8529、Pearson相关系数为0.9251，显示了良好的预测准确性。

Insight: 创新点在于提出了一种轻量级的视觉-语言融合框架，通过门控融合模块有效整合UI布局和文本元数据，支持在边缘设备上高效部署，为开发者提供了可持续应用开发的实用工具。

Abstract: App ratings are among the most significant indicators of the quality, usability, and overall user satisfaction of mobile applications. However, existing app rating prediction models are largely limited to textual data or user interface (UI) features, overlooking the importance of jointly leveraging UI and semantic information. To address these limitations, this study proposes a lightweight vision–language framework that integrates both mobile UI and semantic information for app rating prediction. The framework combines MobileNetV3 to extract visual features from UI layouts and DistilBERT to extract textual features. These multimodal features are fused through a gated fusion module with Swish activations, followed by a multilayer perceptron (MLP) regression head. The proposed model is evaluated using mean absolute error (MAE), root mean square error (RMSE), mean squared error (MSE), coefficient of determination (R2), and Pearson correlation. After training for 20 epochs, the model achieves an MAE of 0.1060, an RMSE of 0.1433, an MSE of 0.0205, an R2 of 0.8529, and a Pearson correlation of 0.9251. Extensive ablation studies further demonstrate the effectiveness of different combinations of visual and textual encoders. Overall, the proposed lightweight framework provides valuable insights for developers and end users, supports sustainable app development, and enables efficient deployment on edge devices.

[33] Beyond Human Performance: A Vision-Language Multi-Agent Approach for Quality Control in Pharmaceutical Manufacturing cs.CVPDF

Subhra Jyoti Mandal, Lara Rachidi, Puneet Jain, Matthieu Duvinage, Sander W. Timmer

TL;DR: 本文提出了一种结合深度学习和视觉语言模型的多智能体框架，用于制药制造中的菌落形成单位检测，以提高自动化水平和准确性。

Details

Motivation: 解决传统手动计数和纯深度学习方法在制药质量检测中面临的劳动密集、易受样本质量变化和异常情况影响的问题。

Result: 在GSK超过5万张培养皿图像数据集上，定制Detectron2模型达到99%检测率；结合VLM后，人类验证需求从50%减少到85%，显著提升效率。

Insight: 创新点在于将VLM用于样本有效性分类，并与DL模型协同工作，通过专家反馈实现持续自改进，为制药质量控制提供了可扩展、可审计的自动化解决方案。

Abstract: Colony-forming unit (CFU) detection is critical in pharmaceutical manufacturing, serving as a key component of Environmental Monitoring programs and ensuring compliance with stringent quality standards. Manual counting is labor-intensive and error-prone, while deep learning (DL) approaches, though accurate, remain vulnerable to sample quality variations and artifacts. Building on our earlier CNN-based framework (Beznik et al., 2020), we evaluated YOLOv5, YOLOv7, and YOLOv8 for CFU detection; however, these achieved only 97.08 percent accuracy, insufficient for pharmaceutical-grade requirements. A custom Detectron2 model trained on GSK’s dataset of over 50,000 Petri dish images achieved 99 percent detection rate with 2 percent false positives and 0.6 percent false negatives. Despite high validation accuracy, Detectron2 performance degrades on outlier cases including contaminated plates, plastic artifacts, or poor optical clarity. To address this, we developed a multi-agent framework combining DL with vision-language models (VLMs). The VLM agent first classifies plates as valid or invalid. For valid samples, both DL and VLM agents independently estimate colony counts. When predictions align within 5 percent, results are automatically recorded in Postgres and SAP; otherwise, samples are routed for expert review. Expert feedback enables continuous retraining and self-improvement. Initial DL-based automation reduced human verification by 50 percent across vaccine manufacturing sites. With VLM integration, this increased to 85 percent, delivering significant operational savings. The proposed system provides a scalable, auditable, and regulation-ready solution for microbiological quality control, advancing automation in biopharmaceutical production.

[34] WildGHand: Learning Anti-Perturbation Gaussian Hand Avatars from Monocular In-the-Wild Videos cs.CVPDF

Hanhui Li, Xuan Huang, Wanquan Liu, Yuhao Cheng, Long Chen

TL;DR: 本文提出了WildGHand，一个基于优化的框架，用于从单目野外视频中学习抗扰动的3D高斯手部化身。该方法通过动态扰动解耦模块和扰动感知优化策略，在存在手物交互、极端姿态、光照变化和运动模糊等严重扰动的真实世界场景中，实现了高保真的手部重建。

Details

Motivation: 现有方法依赖受控环境数据，在存在严重扰动的真实世界场景（如手物交互、极端姿态）中性能下降。本文旨在解决从单目野外视频中鲁棒重建3D手部化身的挑战。

Result: 在自建数据集和两个公共数据集上的大量实验表明，WildGHand达到了最先进的性能，并在多个指标上（如PSNR相对提升15.8%，LPIPS相对降低23.1%）显著优于其基础模型。

Insight: 创新点在于：1）动态扰动解耦模块，将扰动显式建模为3D高斯属性上的时变偏差；2）扰动感知优化策略，生成逐帧各向异性加权掩码来指导优化。这实现了跨时空维度的扰动识别与抑制，提升了在复杂真实场景下的鲁棒性。

Abstract: Despite recent progress in 3D hand reconstruction from monocular videos, most existing methods rely on data captured in well-controlled environments and therefore degrade in real-world settings with severe perturbations, such as hand-object interactions, extreme poses, illumination changes, and motion blur. To tackle these issues, we introduce WildGHand, an optimization-based framework that enables self-adaptive 3D Gaussian splatting on in-the-wild videos and produces high-fidelity hand avatars. WildGHand incorporates two key components: (i) a dynamic perturbation disentanglement module that explicitly represents perturbations as time-varying biases on 3D Gaussian attributes during optimization, and (ii) a perturbation-aware optimization strategy that generates per-frame anisotropic weighted masks to guide optimization. Together, these components allow the framework to identify and suppress perturbations across both spatial and temporal dimensions. We further curate a dataset of monocular hand videos captured under diverse perturbations to benchmark in-the-wild hand avatar reconstruction. Extensive experiments on this dataset and two public datasets demonstrate that WildGHand achieves state-of-the-art performance and substantially improves over its base model across multiple metrics (e.g., up to a $15.8%$ relative gain in PSNR and a $23.1%$ relative reduction in LPIPS). Our implementation and dataset are available at https://github.com/XuanHuang0/WildGHand.

[35] AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents cs.CVPDF

Jiaqi Wu, Yuchen Zhou, Muduo Xu, Zisheng Liang, Simiao Ren

TL;DR: 本文提出了AIForge-Doc，这是首个专门针对金融和表单文档中基于扩散模型的修复（inpainting）篡改检测的基准数据集，包含像素级标注。该数据集通过使用Gemini和Ideogram两种AI修复API，对来自四个公共文档数据集的真实收据和表单图像中的数字字段进行系统性伪造，生成了4,061张伪造图像。研究评估了三种代表性检测器（TruFor、DocTamper和GPT-4o），发现现有方法在检测此类AI伪造时性能大幅下降，表明这是一个尚未解决的新挑战。

Details

Motivation: 现有文档伪造数据集依赖于传统数字编辑工具（如Photoshop），导致最先进的检测器无法应对日益增长的AI伪造文档欺诈威胁。AIForge-Doc旨在填补这一空白，专门针对AI驱动的文档篡改创建基准。

Result: 在AIForge-Doc基准上的评估结果显示，现有检测器性能严重退化：TruFor的AUC从NIST16数据集上的0.96降至0.751（零样本、分布外）；DocTamper的AUC从分布内的0.98降至0.563，像素级IoU仅为0.020；GPT-4o的AUC仅为0.509，接近随机猜测水平。这证实了AI伪造的数值对自动检测器和视觉语言模型来说难以区分。

Insight: 论文的主要创新点是创建了首个专注于AI伪造（特别是扩散模型修复）的文档篡改检测基准，并系统性地揭示了现有最先进检测方法在此类新型威胁面前的严重不足，为文档取证领域指出了一个紧迫且未解决的研究方向。

Abstract: We present AIForge-Doc, the first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation. Existing document forgery datasets rely on traditional digital editing tools (e.g., Adobe Photoshop, GIMP), creating a critical gap: state-of-the-art detectors are blind to the rapidly growing threat of AI-forged document fraud. AIForge-Doc addresses this gap by systematically forging numeric fields in real-world receipt and form images using two AI inpainting APIs – Gemini 2.5 Flash Image and Ideogram v2 Edit – yielding 4,061 forged images from four public document datasets (CORD, WildReceipt, SROIE, XFUND) across nine languages, annotated with pixel-precise tampered-region masks in DocTamper-compatible format. We benchmark three representative detectors – TruFor, DocTamper, and a zero-shot GPT-4o judge – and find that all existing methods degrade substantially: TruFor achieves AUC=0.751 (zero-shot, out-of-distribution) vs. AUC=0.96 on NIST16; DocTamper achieves AUC=0.563 vs. AUC=0.98 in-distribution, with pixel-level IoU=0.020; GPT-4o achieves only 0.509 – essentially at chance – confirming that AI-forged values are indistinguishable to automated detectors and VLMs. These results demonstrate that AIForge-Doc represents a qualitatively new and unsolved challenge for document forensics.

[36] An interactive enhanced driving dataset for autonomous driving cs.CVPDF

Haojie Feng, Peizhi Zhang, Mengjie Tian, Xinrui Zhang, Zhuoren Li

TL;DR: 本文提出了交互增强驾驶数据集（IEDD），旨在解决自动驾驶中视觉-语言-动作模型因交互场景稀疏和多模态对齐不足而受限的问题。该数据集通过从自然驾驶数据中挖掘百万级交互片段，并构建严格对齐语义动作与结构化语言的鸟瞰图视频问答数据集（IEDD-VQA），以支持自动驾驶模型的推理能力评估与微调。

Details

Motivation: 自动驾驶向全自动化演进需要强大的交互能力，但现有数据中交互场景稀疏且多模态对齐不足，限制了视觉-语言-动作模型的发展。

Result: 论文提供了对十个主流视觉语言模型的基准测试结果，展示了该数据集在评估和微调自动驾驶模型推理能力方面的重用价值。

Insight: 创新点包括基于交互轨迹的可扩展数据挖掘流程、交互过程量化指标，以及构建严格对齐语义动作与结构化语言的合成鸟瞰图视频数据集，为自动驾驶交互研究提供了高质量基准。

Abstract: The evolution of autonomous driving towards full automation demands robust interactive capabilities; however, the development of Vision-Language-Action (VLA) models is constrained by the sparsity of interactive scenarios and inadequate multimodal alignment in existing data. To this end, this paper proposes the Interactive Enhanced Driving Dataset (IEDD). We develop a scalable pipeline to mine million-level interactive segments from naturalistic driving data based on interactive trajectories, and design metrics to quantify the interaction processes. Furthermore, the IEDD-VQA dataset is constructed by generating synthetic Bird’s Eye View (BEV) videos where semantic actions are strictly aligned with structured language. Benchmark results evaluating ten mainstream Vision Language Models (VLMs) are provided to demonstrate the dataset’s reuse value in assessing and fine-tuning the reasoning capabilities of autonomous driving models.

[37] Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion cs.CVPDF

Jiaru Zhang, Manav Gagvani, Can Cui, Juntong Peng, Ruqi Zhang

TL;DR: 本文提出了一种名为MVLAD-AD的新型端到端自动驾驶框架，它通过掩码视觉-语言-动作扩散模型，旨在同时解决推理效率、动作精度和可解释性三大挑战。该方法的核心创新在于引入了离散动作标记化策略、几何感知嵌入学习和动作优先解码策略，从而在nuScenes等基准测试中实现了高效、高精度的规划，并提供了高保真的可解释推理。

Details

Motivation: 现有基于LLM/VLM的端到端自动驾驶模型在推理延迟、动作精度和可解释性方面存在不足，特别是自回归方法生成速度慢，而现有扩散规划器依赖缺乏明确几何结构的通用语言标记。本文旨在弥合高效规划与语义可解释性之间的差距。

Result: 在nuScenes及其衍生基准上的大量实验表明，MVLAD-AD在规划精度上超越了最先进的自回归和扩散基线模型，达到了SOTA水平，同时实现了卓越的效率。

Insight: 主要创新点包括：1) 离散动作标记化策略，从真实驾驶分布构建紧凑的、运动学可行的航点码本，避免将动作强行映射到语言空间；2) 几何感知嵌入学习，确保潜在空间嵌入近似物理几何度量；3) 动作优先解码策略，优先生成轨迹。这些设计共同提升了模型的效率、精度和可解释性。

Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged as promising candidates for end-to-end autonomous driving. However, these models typically face challenges in inference latency, action precision, and explainability. Existing autoregressive approaches struggle with slow token-by-token generation, while prior diffusion-based planners often rely on verbose, general-purpose language tokens that lack explicit geometric structure. In this work, we propose Masked Vision-Language-Action Diffusion for Autonomous Driving (MVLAD-AD), a novel framework designed to bridge the gap between efficient planning and semantic explainability via a masked vision-language-action diffusion model. Unlike methods that force actions into the language space, we introduce a discrete action tokenization strategy that constructs a compact codebook of kinematically feasible waypoints from real-world driving distributions. Moreover, we propose geometry-aware embedding learning to ensure that embeddings in the latent space approximate physical geometric metrics. Finally, an action-priority decoding strategy is introduced to prioritize trajectory generation. Extensive experiments on nuScenes and derived benchmarks demonstrate that MVLAD-AD achieves superior efficiency and outperforms state-of-the-art autoregressive and diffusion baselines in planning precision, while providing high-fidelity and explainable reasoning.

[38] PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models cs.CVPDF

Wonyong Seo, Jaeho Moon, Jaehyup Lee, Soo Ye Kim, Munchurl Kim

TL;DR: PropFly是一种基于传播的视频编辑训练框架，它利用预训练视频扩散模型（VDMs）的即时监督来生成多样化的源-编辑潜在对，从而避免了大规模配对视频数据集的需求。该方法通过引导调制流匹配（GMFM）损失训练适配器，学习在保持原始视频结构和运动的同时传播编辑效果，实现了高质量、时间一致的视频编辑。

Details

Motivation: 解决基于传播的视频编辑模型训练依赖大规模、成本高昂的配对视频数据集的问题，提出一种无需预计算数据集的即时监督训练方法。

Result: 在多种视频编辑任务上，PropFly显著优于现有最先进方法，能够生成高质量的编辑结果。

Insight: 创新点包括利用不同分类器自由引导（CFG）尺度从噪声潜在中合成源-编辑潜在对作为即时监督，以及通过GMFM损失引导模型学习目标变换，从而实现动态且时间一致的编辑传播。

Abstract: Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures. However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire. Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets. Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of ‘source’ (low-CFG) and ‘edited’ (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation. Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation. Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations. Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.

[39] VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos cs.CVPDF

Aihua Mao, Kaihang Huang, Yong-Jin Liu, Chee Seng Chan, Ying He

TL;DR: 本文提出VAGNet框架，通过视频引导进行3D物体可供性（affordance）定位，旨在从视频中的人类-物体交互（HOI）动态序列中学习功能监督，以更准确地识别3D物体上支持交互的接触区域。同时，作者构建了首个HOI视频-3D配对数据集PVAD，用于支持这一新任务。

Details

Motivation: 现有3D可供性定位方法主要依赖静态视觉或文本线索，忽视了可供性本质由动态动作定义，导致难以准确定位真实交互中的接触区域。受人类通过观察和模仿动作学习使用物体的直觉启发，本文提出利用动态交互序列提供功能监督的新视角。

Result: 在提出的PVAD数据集上进行的大量实验表明，VAGNet实现了最先进的性能，显著优于基于静态线索的基线方法。

Insight: 创新点在于首次将视频动态交互序列引入3D可供性定位任务，通过视频-3D对齐解决静态线索的歧义；同时构建了首个视频-3D配对的HOI可供性数据集PVAD，为领域提供了新的功能监督数据源。

Abstract: 3D object affordance grounding aims to identify regions on 3D objects that support human-object interaction (HOI), a capability essential to embodied visual reasoning. However, most existing approaches rely on static visual or textual cues, neglecting that affordances are inherently defined by dynamic actions. As a result, they often struggle to localize the true contact regions involved in real interactions. We take a different perspective. Humans learn how to use objects by observing and imitating actions, not just by examining shapes. Motivated by this intuition, we introduce video-guided 3D affordance grounding, which leverages dynamic interaction sequences to provide functional supervision. To achieve this, we propose VAGNet, a framework that aligns video-derived interaction cues with 3D structure to resolve ambiguities that static cues cannot address. To support this new setting, we introduce PVAD, the first HOI video-3D pairing affordance dataset, providing functional supervision unavailable in prior works. Extensive experiments on PVAD show that VAGNet achieves state-of-the-art performance, significantly outperforming static-based baselines. The code and dataset will be open publicly.

[40] From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection cs.CVPDF

Yepeng Liu, Hao Li, Liwen Yang, Fangzhen Li, Xudi Ge

TL;DR: 本文提出TraqPoint，一种基于强化学习的端到端关键点检测框架，将关键点检测重构为序列决策问题，通过轨迹感知奖励机制优化关键点在图像序列中的轨迹质量，从而提升在视角和光照变化下的长期可跟踪性。

Details

Motivation: 现有基于学习的关键点检测方法通常在图像对上训练，未能显式优化关键点在序列中的长期可跟踪性，尤其是在挑战性视角和光照变化下。

Result: 在稀疏匹配基准测试（如相对位姿估计和3D重建）上，TraqPoint显著优于一些最先进的关键点检测和描述方法，达到SOTA水平。

Insight: 创新点在于将关键点检测视为序列决策问题，引入轨迹感知奖励机制，通过强化学习直接优化关键点的轨迹质量，强调跨视图的一致性和独特性，这为关键点检测提供了新的优化视角。

Abstract: Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the \textbf{Tra}ck-\textbf{q}uality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art (SOTA) keypoint detection and description methods.

[41] Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video cs.CV | cs.AI | cs.HC | cs.LGPDF

Mohammad Sadra Rajabi, Aanuoluwapo Ojelade, Sunwook Kim, Maury A. Nussbaum

TL;DR: 本研究评估了使用视觉语言模型从RGB视频流中非侵入式估计手动举升任务中水平和垂直手部距离的可行性。开发了两种基于VLM的多阶段流程：文本引导的仅检测流程和检测加分割流程，通过文本引导定位感兴趣区域、提取视觉特征，并利用基于Transformer的时间回归来估计举升开始和结束时的距离。

Details

Motivation: 手动举升任务是导致工作相关肌肉骨骼疾病的主要因素，而有效的工效学风险评估对于量化身体暴露和指导干预措施至关重要。修订版NIOSH举升方程作为广泛使用的风险评估工具，其所需的水平和垂直手部距离参数通常难以在真实环境中通过手动测量或专用传感系统获取。

Result: 在不同举升任务和七种相机视角条件下，通过留一受试者交叉验证进行评估。结果显示，基于分割的多视角流程表现最佳，在估计水平距离时平均绝对误差约为6-8厘米，垂直距离约为5-8厘米。与仅检测流程相比，像素级分割将水平距离估计误差降低了约20-30%，垂直距离误差降低了约35-40%。

Insight: 论文的创新点在于将视觉语言模型应用于工效学风险评估这一特定领域，通过文本引导定位和Transformer时间回归，实现了从普通RGB视频中非侵入式、自动化地估计关键距离参数。客观来看，其提出的检测加分割多阶段流程，特别是像素级分割的引入，显著提升了估计精度，为基于视频的自动化风险评估提供了可行的技术路径。

Abstract: Manual lifting tasks are a major contributor to work-related musculoskeletal disorders, and effective ergonomic risk assessment is essential for quantifying physical exposure and informing ergonomic interventions. The Revised NIOSH Lifting Equation (RNLE) is a widely used ergonomic risk assessment tool for lifting tasks that relies on six task variables, including horizontal (H) and vertical (V) hand distances; such distances are typically obtained through manual measurement or specialized sensing systems and are difficult to use in real-world environments. We evaluated the feasibility of using innovative vision-language models (VLMs) to non-invasively estimate H and V from RGB video streams. Two multi-stage VLM-based pipelines were developed: a text-guided detection-only pipeline and a detection-plus-segmentation pipeline. Both pipelines used text-guided localization of task-relevant regions of interest, visual feature extraction from those regions, and transformer-based temporal regression to estimate H and V at the start and end of a lift. For a range of lifting tasks, estimation performance was evaluated using leave-one-subject-out validation across the two pipelines and seven camera view conditions. Results varied significantly across pipelines and camera view conditions, with the segmentation-based, multi-view pipeline consistently yielding the smallest errors, achieving mean absolute errors of approximately 6-8 cm when estimating H and 5-8 cm when estimating V. Across pipelines and camera view configurations, pixel-level segmentation reduced estimation error by approximately 20-30% for H and 35-40% for V relative to the detection-only pipeline. These findings support the feasibility of VLM-based pipelines for video-based estimation of RNLE distance parameters.

[42] AnimeAgent: Is the Multi-Agent via Image-to-Video models a Good Disney Storytelling Artist? cs.CVPDF

Hailong Yan, Shice Liu, Tao Wang, Xiangtao Zhang, Yijie Zhong

TL;DR: 本文提出AnimeAgent，首个基于图像到视频（I2V）模型的多智能体框架，用于定制故事板生成（CSG），旨在解决现有静态扩散模型在动态表现力、一次性推理和多智能体评估方面的局限性，通过结合迪士尼动画工作流和混合主客观评审机制，在一致性、提示忠实度和风格化方面实现SOTA性能。

Details

Motivation: 针对当前基于静态扩散模型的定制故事板生成方法在动态表现力不足、一次性推理无法迭代修正以及多智能体框架依赖不鲁棒的评估器（尤其不适合风格化非写实动画）这三个关键限制，提出新的解决方案。

Result: 在收集的人工标注CSG基准测试（含真实标注）上，实验表明AnimeAgent在一致性、提示忠实度和风格化方面达到了SOTA（最先进）性能。

Insight: 创新点包括：首次将图像到视频（I2V）模型的隐式运动先验引入多智能体框架以增强一致性和表现力；受迪士尼’直接绘制与姿态到姿态结合’工作流启发设计流程；采用混合主客观评审器实现可靠的迭代优化。

Abstract: Custom Storyboard Generation (CSG) aims to produce high-quality, multi-character consistent storytelling. Current approaches based on static diffusion models, whether used in a one-shot manner or within multi-agent frameworks, face three key limitations: (1) Static models lack dynamic expressiveness and often resort to “copy-paste” pattern. (2) One-shot inference cannot iteratively correct missing attributes or poor prompt adherence. (3) Multi-agents rely on non-robust evaluators, ill-suited for assessing stylized, non-realistic animation. To address these, we propose AnimeAgent, the first Image-to-Video (I2V)-based multi-agent framework for CSG. Inspired by Disney’s “Combination of Straight Ahead and Pose to Pose” workflow, AnimeAgent leverages I2V’s implicit motion prior to enhance consistency and expressiveness, while a mixed subjective-objective reviewer enables reliable iterative refinement. We also collect a human-annotated CSG benchmark with ground-truth. Experiments show AnimeAgent achieves SOTA performance in consistency, prompt fidelity, and stylization.

[43] GA-Drive: Geometry-Appearance Decoupled Modeling for Free-viewpoint Driving Scene Generatio cs.CVPDF

Hao Zhang, Lue Fan, Qitai Wang, Wenbo Li, Zehuan Wu

TL;DR: GA-Drive是一个用于自由视角驾驶场景生成的新型仿真框架，通过几何-外观解耦和基于扩散的生成技术，能够沿用户指定的新轨迹生成逼真的相机视图，并支持外观编辑。

Details

Motivation: 为了解决现有驾驶仿真器在自由视角、可编辑性和高保真度方面的不足，以更好地训练和评估端到端自动驾驶系统。

Result: 在NTA-IoU、NTL-IoU和FID指标上，GA-Drive大幅超越了现有方法，达到了SOTA水平。

Insight: 核心创新在于将场景的几何信息与外观信息解耦，先利用几何信息合成伪视图，再通过视频扩散模型将其转换为逼真视图，这种解耦设计使得能够利用先进的视频到视频编辑技术进行外观编辑，同时保持几何一致性。

Abstract: A free-viewpoint, editable, and high-fidelity driving simulator is crucial for training and evaluating end-to-end autonomous driving systems. In this paper, we present GA-Drive, a novel simulation framework capable of generating camera views along user-specified novel trajectories through Geometry-Appearance Decoupling and Diffusion-Based Generation. Given a set of images captured along a recorded trajectory and the corresponding scene geometry, GA-Drive synthesizes novel pseudo-views using geometry information. These pseudo-views are then transformed into photorealistic views using a trained video diffusion model. In this way, we decouple the geometry and appearance of scenes. An advantage of such decoupling is its support for appearance editing via state-of-the-art video-to-video editing techniques, while preserving the underlying geometry, enabling consistent edits across both original and novel trajectories. Extensive experiments demonstrate that GA-Drive substantially outperforms existing methods in terms of NTA-IoU, NTL-IoU, and FID scores.

[44] VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation cs.CV | cs.AI | cs.CLPDF

Seongheon Park, Changdae Oh, Hyeong Kyu Choi, Xuefeng Du, Sharon Li

TL;DR: 本文提出了一种名为VAUQ的视觉感知不确定性量化框架，用于大型视觉语言模型（LVLM）的自我评估。该框架通过引入图像信息分数（IS）来量化视觉输入对模型预测不确定性的减少程度，并结合无监督核心区域掩码策略增强显著区域的影响，从而无需训练即可生成可靠反映答案正确性的评分函数。

Details

Motivation: 大型视觉语言模型（LVLM）经常产生幻觉，限制了其在现实应用中的安全部署。现有的LLM自我评估方法严重依赖语言先验，不适合评估基于视觉条件的预测，因此需要一种能明确衡量模型输出对视觉证据依赖程度的方法。

Result: 在多个数据集上的综合实验表明，VAUQ在自我评估任务上持续优于现有方法。

Insight: 创新点在于提出了图像信息分数（IS）来捕捉视觉输入对减少预测不确定性的贡献，并结合无监督核心区域掩码策略来放大显著区域的影响，从而实现了无需训练、基于视觉证据的可靠自我评估。从客观角度看，该方法将不确定性量化与视觉显著性分析相结合，为LVLM的可靠性评估提供了新思路。

Abstract: Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model’s ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model’s output depends on visual evidence. VAUQ introduces the Image-Information Score (IS), which captures the reduction in predictive uncertainty attributable to visual input, and an unsupervised core-region masking strategy that amplifies the influence of salient regions. Combining predictive entropy with this core-masked IS yields a training-free scoring function that reliably reflects answer correctness. Comprehensive experiments show that VAUQ consistently outperforms existing self-evaluation methods across multiple datasets.

[45] RAYNOVA: 3D-Geometry-Free Auto-Regressive Driving World Modeling with Unified Spatio-Temporal Representation cs.CVPDF

Yichen Xie, Chensheng Peng, Mazen Abdelfattah, Yihan Hu, Jiezhi Yang

TL;DR: RAYNOVA是一种无需3D几何先验的自回归世界模型，它采用双因果自回归框架和统一的4D时空表示，通过相对普吕克射线位置编码构建各向同性的跨视图、帧和尺度的表示，实现了对多样化相机设置和自运动的鲁棒泛化，并在nuScenes数据集上取得了SOTA的多视角视频生成效果。

Details

Motivation: 解决现有世界模型分别处理时空关联、依赖强3D几何先验，从而难以泛化到不同相机配置和长时视频生成的问题。

Result: 在nuScenes基准测试中取得了最先进的多视角视频生成结果，同时具有更高的吞吐量和在多样化输入条件下的强可控性。

Insight: 创新点在于提出了一个几何无关的、基于统一4D时空表示的双因果自回归框架，以及利用相对普吕克射线位置编码构建各向同性表示，这降低了对显式3D场景表示的依赖，增强了模型的泛化能力。

Abstract: World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-free world model that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RAYNOVA achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation. Our code will be released at http://yichen928.github.io/raynova.

[46] MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision cs.CVPDF

Bedrettin Cetinkaya, Sinan Kalkan, Emre Akbas

TL;DR: 本文提出MatchED，一种轻量级、即插即用的匹配监督模块，用于端到端学习生成清晰的单像素宽边缘图。该方法通过基于空间距离和置信度的一对一匹配，在训练时直接优化清晰边缘，无需依赖非最大抑制或骨架细化等不可微的后处理。

Details

Motivation: 现有清晰边缘检测方法普遍依赖不可微的后处理算法（如非最大抑制和骨架细化），这阻碍了端到端优化，且所有方法都需后处理才能获得满意结果。本文旨在解决这一限制，实现无需后处理的端到端清晰边缘检测。

Result: 在四个流行数据集上的实验表明，集成MatchED能显著提升现有边缘检测模型的性能。在强调清晰度的评估（CEval）下，MatchED将基线模型的ODS性能提升高达20-35%，在OIS和AP上也有类似增益，首次达到或超越了标准后处理的SOTA水平。

Insight: 核心创新在于提出了一种基于匹配的监督机制，通过训练时的一对一匹配确保训练与测试协议的一致性，从而直接学习清晰边缘。该方法参数轻量（仅约21K），可即插即用地增强任何边缘检测模型，实现了无需后处理的端到端清晰边缘检测。

Abstract: Generating crisp, i.e., one-pixel-wide, edge maps remains one of the fundamental challenges in edge detection, affecting both traditional and learning-based methods. To obtain crisp edges, most existing approaches rely on two hand-crafted post-processing algorithms, Non-Maximum Suppression (NMS) and skeleton-based thinning, which are non-differentiable and hinder end-to-end optimization. Moreover, all existing crisp edge detection methods still depend on such post-processing to achieve satisfactory results. To address this limitation, we propose \MethodLPP, a lightweight, only $\sim$21K additional parameters, and plug-and-play matching-based supervision module that can be appended to any edge detection model for joint end-to-end learning of crisp edges. At each training iteration, \MethodLPP performs one-to-one matching between predicted and ground-truth edges based on spatial distance and confidence, ensuring consistency between training and testing protocols. Extensive experiments on four popular datasets demonstrate that integrating \MethodLPP substantially improves the performance of existing edge detection models. In particular, \MethodLPP increases the Average Crispness (AC) metric by up to 2–4$\times$ compared to baseline models. Under the crispness-emphasized evaluation (CEval), \MethodLPP further boosts baseline performance by up to 20–35% in ODS and achieves similar gains in OIS and AP, achieving SOTA performance that matches or surpasses standard post-processing for the first time. Code is available at https://cvpr26-matched.github.io.

[47] NGL-Prompter: Training-Free Sewing Pattern Estimation from a Single Image cs.CVPDF

Anna Badalyan, Pratheba Selvaraju, Giorgio Becherini, Omid Taheri, Victoria Fernandez Abrevaya

TL;DR: 本文提出了一种无需训练的NGL-Prompter方法，用于从单张图像中估计服装的缝制图案。该方法通过引入一种名为NGL（自然服装语言）的中间表示，将参数化服装模型GarmentCode的结构转换为更适合大型视觉语言模型理解的形式，从而直接查询VLMs提取结构化参数并映射为有效的GarmentCode，实现了对真实世界图像（包括多层服装）的高质量三维服装重建。

Details

Motivation: 现有方法依赖于在合成数据集上微调大型视觉语言模型，但泛化能力差，难以捕捉真实服装部件间的关联，且通常仅限于单层服装。本文旨在解决这些局限性，利用VLMs的自然语言描述能力，通过设计中间表示来弥合图像与参数化模型之间的语义鸿沟。

Result: 在Dress4D、CloSe以及新收集的约5000张真实时尚图像数据集上评估，该方法在标准几何指标上达到了最先进（SOTA）性能，并且在基于人类和GPT的感知评估中均显著优于现有基线。

Insight: 核心创新点在于提出了NGL这一中间语言，将参数化模型的复杂结构重新组织为语言模型更易理解的表示，从而实现了无需模型训练、仅通过提示即可从单张图像中准确提取服装参数并重建缝制图案的流程。该方法还展示了处理多层服装和遮挡部分的能力，突出了其对真实图像的强泛化性。

Abstract: Estimating sewing patterns from images is a practical approach for creating high-quality 3D garments. Due to the lack of real-world pattern-image paired data, prior approaches fine-tune large vision language models (VLMs) on synthetic garment datasets generated by randomly sampling from a parametric garment model GarmentCode. However, these methods often struggle to generalize to in-the-wild images, fail to capture real-world correlations between garment parts, and are typically restricted to single-layer outfits. In contrast, we observe that VLMs are effective at describing garments in natural language, yet perform poorly when asked to directly regress GarmentCode parameters from images. To bridge this gap, we propose NGL (Natural Garment Language), a novel intermediate language that restructures GarmentCode into a representation more understandable to language models. Leveraging this language, we introduce NGL-Prompter, a training-free pipeline that queries large VLMs to extract structured garment parameters, which are then deterministically mapped to valid GarmentCode. We evaluate our method on the Dress4D, CloSe and a newly collected dataset of approximately 5,000 in-the-wild fashion images. Our approach achieves state-of-the-art performance on standard geometry metrics and is strongly preferred in both human and GPT-based perceptual evaluations compared to existing baselines. Furthermore, NGL-prompter can recover multi-layer outfits whereas competing methods focus mostly on single-layer garments, highlighting its strong generalization to real-world images even with occluded parts. These results demonstrate that accurate sewing pattern reconstruction is possible without costly model training. Our code and data will be released for research use.

[48] Communication-Inspired Tokenization for Structured Image Representations cs.CV | cs.AI | cs.LGPDF

Aram Davtyan, Yusuf Sahin, Yasaman Haghighi, Sebastian Stapf, Pablo Acuaviva

TL;DR: 本文提出了一种受人类交流启发的结构化图像离散标记化框架COMiT，通过迭代观察局部图像区域并循环更新离散表示来构建固定标记预算内的潜在消息，最终通过流匹配解码器重建完整图像。

Details

Motivation: 现有离散图像标记化方法主要针对重建和压缩优化，生成的标记往往捕捉局部纹理而非对象级语义结构，因此需要一种能产生结构化、语义化视觉标记序列的方法。

Result: 实验表明，COMiT在组合泛化和关系推理方面显著优于现有方法，并诱导出可解释的以对象为中心的标记结构。

Insight: 创新点在于将人类交流的增量性和组合性特性引入视觉标记化过程，通过注意力驱动的序列标记化实现结构化表示，并结合流匹配重建与语义对齐损失进行端到端训练。

Abstract: Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.

[49] Federated Learning for Cross-Modality Medical Image Segmentation via Augmentation-Driven Generalization cs.CVPDF

Sachin Dudda Nagaraju, Ashkan Moradi, Bendik Skarre Abrahamsen, Mattijs Elschot

TL;DR: 本文研究了联邦学习在跨模态医学图像分割中的应用，提出了一种基于增强驱动的泛化方法，通过全局强度非线性增强（GIN）来模拟跨模态外观变化，从而在保护数据隐私的同时提升模型在CT和MRI等不同模态数据上的泛化性能。

Details

Motivation: 解决医学图像分析中因数据隐私和机构间数据孤岛导致的模型泛化难题，特别是在联邦学习框架下，当各客户端仅持有单一模态数据（如CT或MRI）时，跨模态域偏移严重阻碍模型性能。

Result: 在腹部器官分割和全心分割基准测试中，GIN增强方法在集中式和联邦式设置下均优于其他增强策略，例如胰腺分割的Dice分数从0.073提升至0.437（提升498%），联邦学习方法达到了集中式训练精度的93-98%，实现了强跨模态泛化。

Insight: 创新点在于系统评估了多种增强策略（包括空间增强、频域操作、域特定归一化和GIN），并证明GIN通过模拟跨模态外观变化同时保留解剖结构，能有效提升泛化能力；这为实际临床中无需配对多模态数据或复杂架构的联邦AI部署提供了可行方案。

Abstract: Artificial intelligence has emerged as a transformative tool in medical image analysis, yet developing robust and generalizable segmentation models remains difficult due to fragmented, privacy-constrained imaging data siloed across institutions. While federated learning (FL) enables collaborative model training without centralizing data, cross-modality domain shifts pose a critical challenge, particularly when models trained on one modality fail to generalize to another. Many existing solutions require paired multimodal data per patient or rely on complex architectures, both of which are impractical in real clinical settings. In this work, we consider a realistic FL scenario where each client holds single-modality data (CT or MRI), and systematically investigate augmentation strategies for cross-modality generalization. Using abdominal organ segmentation and whole-heart segmentation as representative multi-class and binary segmentation benchmarks, we evaluate convolution-based spatial augmentation, frequency-domain manipulation, domain-specific normalization, and global intensity nonlinear (GIN) augmentation. Our results show that GIN consistently outperforms alternatives in both centralized and federated settings by simulating cross-modality appearance variations while preserving anatomical structure. For the pancreas, Dice score improved from 0.073 to 0.437, a 498% gain. Our federated approach achieves 93-98% of centralized training accuracy, demonstrating strong cross-modality generalization without compromising data privacy, pointing toward feasible federated AI deployment across diverse healthcare systems.

[50] Real-time Motion Segmentation with Event-based Normal Flow cs.CV | cs.ROPDF

Sheng Zhong, Zhongyang Ren, Xiya Zhu, Dehao Yuan, Cornelia Fermuller

TL;DR: 本文提出了一种基于事件相机的实时运动分割框架，通过将法向流作为中间表示来压缩事件簇中的运动信息，将运动分割任务建模为图割能量最小化问题，并结合法向流聚类与运动模型拟合进行迭代优化，实现了比现有方法快近800倍的实时性能。

Details

Motivation: 事件相机具有微秒级分辨率和异步响应特性，适合挑战性场景下的视觉任务，但单个事件信息稀疏，直接处理原始事件数据效率低下，限制了现有方法在实时运动分割等任务中的应用。

Result: 在多个公共数据集上的广泛评估表明，该框架在准确性和效率上均表现优异，相比开源的最先进方法实现了近800倍的加速，确保了实时性能。

Insight: 创新点在于利用从事件邻域直接学习的密集法向流作为输入，通过基于法向流的运动模型初始化和拟合方法，仅需有限候选模型即可高效估计独立运动物体的运动模型，显著降低了计算复杂度。

Abstract: Event-based cameras are bio-inspired sensors with pixels that independently and asynchronously respond to brightness changes at microsecond resolution, offering the potential to handle visual tasks in challenging scenarios. However, due to the sparse information content in individual events, directly processing the raw event data to solve vision tasks is highly inefficient, which severely limits the applicability of state-of-the-art methods in real-time tasks, such as motion segmentation, a fundamental task for dynamic scene understanding. Incorporating normal flow as an intermediate representation to compress motion information from event clusters within a localized region provides a more effective solution. In this work, we propose a normal flow-based motion segmentation framework for event-based vision. Leveraging the dense normal flow directly learned from event neighborhoods as input, we formulate the motion segmentation task as an energy minimization problem solved via graph cuts, and optimize it iteratively with normal flow clustering and motion model fitting. By using a normal flow-based motion model initialization and fitting method, the proposed system is able to efficiently estimate the motion models of independently moving objects with only a limited number of candidate models, which significantly reduces the computational complexity and ensures real-time performance, achieving nearly a 800x speedup in comparison to the open-source state-of-the-art method. Extensive evaluations on multiple public datasets fully demonstrate the accuracy and efficiency of our framework.

[51] SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking cs.CVPDF

Muhammad Saif Ullah Khan, Didier Stricker

TL;DR: 本文提出了一个生物力学感知的关键点模拟框架，用于从现有的人体姿态数据集中生成解剖学上一致的3D脊柱关键点，并创建了首个开放数据集SIMSPINE，包含214万帧无约束室内多视角捕获的自然全身运动数据，附带稀疏椎骨级3D脊柱标注。同时，发布了涵盖2D检测器、单目3D姿态提升模型和多视角重建流程的预训练基线，为生物力学有效的脊柱运动估计建立了统一基准。

Details

Motivation: 脊柱运动建模对于理解人体生物力学至关重要，但由于脊柱复杂的多关节运动学特性以及缺乏大规模3D标注数据，该领域在计算机视觉中尚未得到充分探索。

Result: 在受控环境下，所提出的2D脊柱基线将最先进水平从0.63 AUC提升至0.80 AUC；在野外脊柱跟踪任务中，从0.91 AP提升至0.93 AP。

Insight: 创新点在于开发了一个将肌肉骨骼模拟与计算机视觉相结合的框架，能够从细微姿态变化中数据驱动地学习椎骨运动学，并为自然条件下的、可重复的、解剖学基础的3D脊柱估计研究提供了数据集和基准。

Abstract: Modeling spinal motion is fundamental to understanding human biomechanics, yet remains underexplored in computer vision due to the spine’s complex multi-joint kinematics and the lack of large-scale 3D annotations. We present a biomechanics-aware keypoint simulation framework that augments existing human pose datasets with anatomically consistent 3D spinal keypoints derived from musculoskeletal modeling. Using this framework, we create the first open dataset, named SIMSPINE, which provides sparse vertebra-level 3D spinal annotations for natural full-body motions in indoor multi-camera capture without external restraints. With 2.14 million frames, this enables data-driven learning of vertebral kinematics from subtle posture variations and bridges the gap between musculoskeletal simulation and computer vision. In addition, we release pretrained baselines covering fine-tuned 2D detectors, monocular 3D pose lifting models, and multi-view reconstruction pipelines, establishing a unified benchmark for biomechanically valid spine motion estimation. Specifically, our 2D spine baselines improve the state-of-the-art from 0.63 to 0.80 AUC in controlled environments, and from 0.91 to 0.93 AP for in-the-wild spine tracking. Together, the simulation framework and SIMSPINE dataset advance research in vision-based biomechanics, motion analysis, and digital human modeling by enabling reproducible, anatomically grounded 3D spine estimation under natural conditions.

[52] VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving cs.CVPDF

Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye

TL;DR: 本文提出VGGDrive架构，通过引入可插拔的跨视图3D几何使能器（CVGE），将成熟的3D基础模型的几何特征与视觉语言模型（VLM）的2D视觉特征相结合，从而增强VLM在自动驾驶任务中的跨视图3D几何建模能力。

Details

Motivation: 现有视觉语言模型（VLMs）缺乏跨视图3D几何建模能力，导致其在自动驾驶任务中表现平庸，论文旨在通过融合成熟的3D基础模型的几何先验来弥补这一关键能力差距。

Result: 在五个自动驾驶基准测试（包括跨视图风险感知、运动预测和轨迹规划等任务）上的广泛实验表明，VGGDrive显著提升了基础VLM的性能。

Insight: 核心创新在于提出了一个可插拔的CVGE模块，通过分层自适应注入机制，将冻结的3D视觉模型的跨视图几何特征有效注入VLM，实现了3D几何先验与2D视觉特征的桥接，为3D基础模型赋能自动驾驶任务提供了新的范式探索。

Abstract: The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing Q&A data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM’s 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It’s our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.

[53] GatedCLIP: Gated Multimodal Fusion for Hateful Memes Detection cs.CVPDF

Yingying Guo, Ke Zhang, Zirong Zeng

TL;DR: 该论文提出了一种名为GatedCLIP的视觉语言模型，专门用于检测仇恨表情包。通过引入可学习的投影头、动态门控融合机制和对比学习目标，增强了CLIP模型在多模态仇恨内容检测中的能力，在Hateful Memes数据集上取得了显著优于CLIP基线的性能。

Details

Motivation: 检测多模态表情包中的仇恨内容具有挑战性，因为有害信息往往源于良性图像和文本的复杂交互，现有模型如CLIP在此任务上表现不足，需要针对性的架构改进。

Result: 在Hateful Memes数据集上，GatedCLIP的AUROC达到0.66，显著优于CLIP基线的0.49，同时仅使用35万个可训练参数，保持了计算效率。

Insight: 创新点包括：通过可学习投影头将CLIP嵌入映射到任务优化的语义空间；动态门控融合机制自适应加权视觉和文本特征；对比学习目标保持跨模态语义对齐。这些改进可借鉴于其他需要精细多模态融合的任务。

Abstract: Detecting hateful content in multimodal memes presents unique challenges, as harmful messages often emerge from the complex interplay between benign images and text. We propose GatedCLIP, a Vision-Language model that enhances CLIP’s multimodal capabilities with specialized architectural improvements for hateful memes detection. Our approach introduces learned projection heads that map CLIP embeddings to a task-optimized semantic space, a dynamic gated fusion mechanism that adaptively weights visual and textual features, and a contrastive learning objective that maintains cross-modal semantic alignment. Experiments on the Hateful Memes dataset demonstrate that GatedCLIP achieves an AUROC of 0.66, substantially outperforming the CLIP baseline (AUROC 0.49) while maintaining computational efficiency with only 350K trainable parameters.

[54] On the Explainability of Vision-Language Models in Art History cs.CVPDF

Stefanie Schneider

TL;DR: 本文研究了视觉语言模型（VLMs）在艺术史背景下的可解释性，通过评估七种可解释人工智能（XAI）方法，结合零样本定位实验和人类可解释性研究，探讨了CLIP模型的视觉推理机制。

Details

Motivation: 解决视觉语言模型在艺术史领域中机器’理解’本质的可解释性问题，旨在使模型的视觉推理过程对人类更透明。

Result: 实验表明，这些方法能捕捉人类解释的某些方面，但其有效性取决于所考察类别的概念稳定性和表征可用性。

Insight: 创新点在于将XAI方法应用于艺术史领域的VLM解释，揭示了模型解释能力与概念属性之间的依赖关系，为跨学科可解释性研究提供了新视角。

Abstract: Vision-Language Models (VLMs) transfer visual and textual data into a shared embedding space. In so doing, they enable a wide range of multimodal tasks, while also raising critical questions about the nature of machine ‘understanding.’ In this paper, we examine how Explainable Artificial Intelligence (XAI) methods can render the visual reasoning of a VLM - namely, CLIP - legible in art-historical contexts. To this end, we evaluate seven methods, combining zero-shot localization experiments with human interpretability studies. Our results indicate that, while these methods capture some aspects of human interpretation, their effectiveness hinges on the conceptual stability and representational availability of the examined categories.

[55] DA-Cal: Towards Cross-Domain Calibration in Semantic Segmentation cs.CVPDF

Wangkai Li, Rui Sun, Zhaoyang Li, Yujia Chen, Tianzhu Zhang

TL;DR: 本文提出DA-Cal，一种针对语义分割中跨域校准的框架，通过将目标域校准转化为软伪标签优化问题，解决了现有无监督域自适应方法在跨域场景下网络校准质量不佳的问题。

Details

Motivation: 现有无监督域自适应方法在提升语义分割目标域性能时，往往忽视网络校准质量，导致预测置信度与实际准确率不匹配，这在安全关键应用中存在显著风险。

Result: 实验表明，DA-Cal在多个UDA分割基准测试中，能无缝集成到现有自训练框架，显著提升目标域校准质量，同时带来性能增益且无需推理开销。

Insight: 创新点在于将目标域校准转化为软伪标签优化，并引入元温度网络生成像素级校准参数，通过双层优化建立软伪标签与UDA监督的关系，利用互补域混合策略防止过拟合和减少域差异。

Abstract: While existing unsupervised domain adaptation (UDA) methods greatly enhance target domain performance in semantic segmentation, they often neglect network calibration quality, resulting in misalignment between prediction confidence and actual accuracy – a significant risk in safety-critical applications. Our key insight emerges from observing that performance degrades substantially when soft pseudo-labels replace hard pseudo-labels in cross-domain scenarios due to poor calibration, despite the theoretical equivalence of perfectly calibrated soft pseudo-labels to hard pseudo-labels. Based on this finding, we propose DA-Cal, a dedicated cross-domain calibration framework that transforms target domain calibration into soft pseudo-label optimization. DA-Cal introduces a Meta Temperature Network to generate pixel-level calibration parameters and employs bi-level optimization to establish the relationship between soft pseudo-labels and UDA supervision, while utilizing complementary domain-mixing strategies to prevent overfitting and reduce domain discrepancies. Experiments demonstrate that DA-Cal seamlessly integrates with existing self-training frameworks across multiple UDA segmentation benchmarks, significantly improving target domain calibration while delivering performance gains without inference overhead. The code will be released.

[56] MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification cs.CVPDF

Jiahao Xu, Sheng Huang, Xin Zhang, Zhixiong Nan, Jiajun Dong

TL;DR: 本文提出了随机多视图语义增强框架MUSE，用于解决计算病理学中少样本全切片图像分类任务因标注数据极度稀缺而面临的挑战。该框架通过样本级细粒度语义增强模块为每个样本生成自适应视觉-语义交互先验，并在此基础上利用检索增强的多视图生成机制，在训练过程中随机整合多样化的病理描述文本作为语义监督，从而提升模型的泛化能力。

Details

Motivation: 现有视觉-语言方法通常将大语言模型生成的文本语义视为静态的类别级先验，缺乏针对单个样本的细化和多样性，限制了视觉-语义对齐的精度和丰富性，阻碍了在有限监督下的泛化性能。

Result: 在三个基准WSI数据集上的实验表明，MUSE在少样本设置下持续优于现有的视觉-语言基线方法。

Insight: 创新点在于提出了样本级细粒度语义增强和随机多视图模型优化机制，强调有效的少样本病理学习不仅需要更丰富的语义来源，还需要主动且样本感知的语义优化策略。

Abstract: In computational pathology, few-shot whole slide image classification is primarily driven by the extreme scarcity of expert-labeled slides. Recent vision-language methods incorporate textual semantics generated by large language models, but treat these descriptions as static class-level priors that are shared across all samples and lack sample-wise refinement. This limits both the diversity and precision of visual-semantic alignment, hindering generalization under limited supervision. To overcome this, we propose the stochastic MUlti-view Semantic Enhancement (MUSE), a framework that first refines semantic precision via sample-wise adaptation and then enhances semantic richness through retrieval-augmented multi-view generation. Specifically, MUSE introduces Sample-wise Fine-grained Semantic Enhancement (SFSE), which yields a fine-grained semantic prior for each sample through MoE-based adaptive visual-semantic interaction. Guided by this prior, Stochastic Multi-view Model Optimization (SMMO) constructs an LLM-generated knowledge base of diverse pathological descriptions per class, then retrieves and stochastically integrates multiple matched textual views during training. These dynamically selected texts serve as enriched semantic supervisions to stochastically optimize the vision-language model, promoting robustness and mitigating overfitting. Experiments on three benchmark WSI datasets show that MUSE consistently outperforms existing vision-language baselines in few-shot settings, demonstrating that effective few-shot pathology learning requires not only richer semantic sources but also their active and sample-aware semantic optimization. Our code is available at: https://github.com/JiahaoXu-god/CVPR2026_MUSE.

[57] SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models cs.CV | cs.LGPDF

Yuechen Xie, Xiaoyan Zhang, Yicheng Shan, Hao Zhu, Rui Tang

TL;DR: 本文提出了SpatiaLQA基准，用于评估视觉语言模型在空间逻辑推理方面的能力。该基准包含9,605个问答对，源自241个真实室内场景。实验表明，现有先进模型在此任务上表现不佳，为此作者提出了递归场景图辅助推理方法，通过视觉基础模型逐步分解复杂场景，提升了模型的性能。

Details

Motivation: 尽管视觉语言模型在常见视觉问答和逻辑推理中表现出色，但在复杂真实环境中的空间逻辑推理能力仍不足，这需要理解物体间的空间关系和任务步骤间的逻辑依赖。

Result: 在41个主流视觉语言模型上的实验显示，即使最先进的模型在SpatiaLQA基准上也面临困难；提出的递归场景图辅助推理方法超越了所有先前方法。

Insight: 创新点在于定义了空间逻辑推理任务并构建了相应基准，以及利用递归场景图分解来增强模型对复杂场景的理解和推理能力，为视觉语言模型的现实应用提供了新方向。

Abstract: Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. Code and dataset are available at https://github.com/xieyc99/SpatiaLQA.

[58] TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering cs.CVPDF

Hanshen Zhu, Yuliang Liu, Xuecheng Wu, An-Lan Wang, Hao Feng

TL;DR: 本文提出TextPecker，一种即插即用的结构异常感知强化学习策略，旨在解决文本到图像生成中视觉文本渲染的结构异常问题。通过构建字符级结构异常标注数据集和开发笔画编辑合成引擎，该方法能有效提升多种文本到图像模型在结构保真度和语义对齐方面的性能。

Details

Motivation: 当前先进的文本到图像生成模型在渲染文本时经常出现扭曲、模糊、错位等结构异常，而主流的MLLM和专用OCR模型难以感知这些异常，这阻碍了VTR的评估和基于RL的优化，导致现有SOTA生成器仍难以生成结构准确的文本。

Result: 实验表明，TextPecker能持续改进多种文本到图像模型；即使在已优化的Qwen-Image上，它在中文文本渲染中显著提升了4%的结构保真度和8.7%的语义对齐，达到了高保真VTR的新SOTA水平。

Insight: 创新点包括提出结构异常感知的RL策略以缓解噪声奖励信号，以及构建字符级结构异常标注数据集和笔画编辑合成引擎来扩展结构错误覆盖。这为可靠且结构准确的视觉文本生成提供了基础优化方法。

Abstract: Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., SeedDream4.0, Qwen-Image) still struggle to render structurally faithful text. To address this, we propose TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator. To enable this capability, we construct a recognition dataset with character-level structural-anomaly annotations and develop a stroke-editing synthesis engine to expand structural-error coverage. Experiments show that TextPecker consistently improves diverse text-to-image models; even on the well-optimized Qwen-Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR. Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.

Jihao Qiu, Lingxi Xie, Xinyue Huo, Qi Tian, Qixiang Ye

TL;DR: 本文提出了LongVideo-R1，一种用于低成本长视频理解的智能导航MLLM智能体。它通过一个推理模块，利用高层视觉线索来推断信息最丰富的视频片段进行处理，并采用迭代细化的方式聚焦，一旦获得足够信息就停止探索，从而避免穷举搜索的冗余。模型基于Qwen-3-8B，通过从CGBench提取的层次化视频描述和GPT-5生成的思维链轨迹进行两阶段微调（SFT和RL）。

Details

Motivation: 解决在有限计算预算下进行长视频理解这一关键且未被充分探索的挑战，旨在实现高效视频上下文导航，避免详尽搜索的冗余。

Result: 在多个长视频基准测试上的实验验证了其有效性，在问答准确性和效率之间取得了优越的权衡。

Insight: 创新点在于提出了一个主动的、具备推理能力的MLLM智能体，其核心是结合高层视觉线索进行推理以选择性导航的模块，以及使用两阶段微调（SFT+RL）和专门设计的奖励函数来优化导航效率的训练范式。这为资源受限的长视频理解提供了一种新颖的智能决策路径。

Abstract: This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: https://github.com/qiujihao19/LongVideo-R1

[60] Computing a Characteristic Orientation for Rotation-Independent Image Analysis cs.CVPDF

Cristian Valero-Abundio, Emilio Sansano-Sansano, Raúl Montoliu, Marina Martínez García

TL;DR: 本文提出了一种名为通用强度方向（GID）的预处理方法，旨在提升深度学习模型对图像旋转的鲁棒性，而无需修改网络架构。该方法通过估计每张图像的全局方向并将其对齐到规范参考系，使标准模型能在不同旋转下更一致地处理输入。在旋转MNIST和CIFAR-10数据集上的实验表明，GID在保持空间结构的同时，取得了优于现有旋转不变架构的准确率。

Details

Motivation: 解决深度学习在计算机视觉中处理几何变换（尤其是旋转）的挑战，标准神经网络缺乏内在的旋转不变性，而现有方法如数据增强或架构修改会增加计算成本或限制适用性。

Result: 在旋转MNIST数据集上，GID方法达到了比最先进的旋转不变架构更高的准确率；在CIFAR-10数据集上的额外实验证实了该方法在更复杂条件下仍然有效。

Insight: 创新点在于提出了一种基于预处理的旋转对齐方法，直接变换图像以保持空间结构，与卷积网络兼容，避免了网络架构的改动；客观分析认为，这种方法通过全局方向估计和规范对齐，提供了一种轻量级且通用的旋转鲁棒性增强方案。

Abstract: Handling geometric transformations, particularly rotations, remains a challenge in deep learning for computer vision. Standard neural networks lack inherent rotation invariance and typically rely on data augmentation or architectural modifications to improve robustness. Although effective, these approaches increase computational demands, require specialised implementations, or alter network structures, limiting their applicability. This paper introduces General Intensity Direction (GID), a preprocessing method that improves rotation robustness without modifying the network architecture. The method estimates a global orientation for each image and aligns it to a canonical reference frame, allowing standard models to process inputs more consistently across different rotations. Unlike moment-based approaches that extract invariant descriptors, this method directly transforms the image while preserving spatial structure, making it compatible with convolutional networks. Experimental evaluation on the rotated MNIST dataset shows that the proposed method achieves higher accuracy than state-of-the-art rotation-invariant architectures. Additional experiments on the CIFAR-10 dataset, confirm that the method remains effective under more complex conditions.

[61] See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis cs.CV | cs.AIPDF

Jaehyun Park, Minyoung Ahn, Minkyu Kim, Jonghyun Lee, Jae-Gil Lee

TL;DR: 本文提出ArtiAgent框架，通过三个智能体（感知、合成、筛选）自动合成包含视觉伪影的图像对，并生成丰富的伪影标注数据，以解决扩散模型生成图像中伪影影响真实感的问题。

Details

Motivation: 现有方法依赖人工标注的伪影数据集，成本高且难以扩展，需要自动化方法生成可靠的伪影标注数据以支持伪影缓解研究。

Result: 使用ArtiAgent合成了10万张带有丰富伪影标注的图像，并在多种应用中展示了其有效性和通用性。

Insight: 创新点在于通过智能体驱动的数据合成流程，结合扩散变换器的块级嵌入操作注入伪影，并生成局部和全局解释，实现了可扩展的自动化伪影数据生成。

Abstract: Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a highly crucial area of study. Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets. In this paper, we propose ArtiAgent, which efficiently creates pairs of real and artifact-injected images. It comprises three agents: a perception agent that recognizes and grounds entities and subentities from real images, a synthesis agent that introduces artifacts via artifact injection tools through novel patch-wise embedding manipulation within a diffusion transformer, and a curation agent that filters the synthesized artifacts and generates both local and global explanations for each instance. Using ArtiAgent, we synthesize 100K images with rich artifact annotations and demonstrate both efficacy and versatility across diverse applications. Code is available at link.

[62] Are Multimodal Large Language Models Good Annotators for Image Tagging? cs.CVPDF

Ming-Kun Xie, Jia-Hao Xiao, Zhiqiang Kou, Zhongnian Li, Gang Niu

TL;DR: 本文研究了多模态大语言模型（MLLMs）作为图像标注（image tagging）自动标注工具的潜力。分析发现MLLMs能大幅降低标注成本（可降至人工成本的千分之一），但标注质量约为人工的50%-80%。为此，论文提出了TagLLM框架，通过结构化分组提示生成候选标签和交互式语义消歧来提升MLLM的标注质量，显著缩小了与人工标注的差距。

Details

Motivation: 传统图像标注依赖昂贵的人工标注来训练多标签分类器。MLLMs有潜力自动化此过程，但其能否替代人工标注者尚不明确。本文旨在分析MLLM生成标注与人工标注之间的差距，并提出有效解决方案以实现基于MLLM的标注替代人工标注。

Result: 实验表明，TagLLM框架显著缩小了MLLM生成标注与人工标注的差距。在下游训练任务性能上，它弥补了约60%到80%的差异。

Insight: 创新点在于提出了TagLLM框架，其核心是结合了高效的结构化分组提示（用于生成高覆盖率的紧凑候选标签集）和交互式语义消歧（用于校准提示中的类别概念并精炼候选标签）。这为利用MLLMs进行低成本、高质量的自动化数据标注提供了一种系统性的方法。

Abstract: Image tagging, a fundamental vision task, traditionally relies on human-annotated datasets to train multi-label classifiers, which incurs significant labor and costs. While Multimodal Large Language Models (MLLMs) offer promising potential to automate annotation, their capability to replace human annotators remains underexplored. This paper aims to analyze the gap between MLLM-generated and human annotations and to propose an effective solution that enables MLLM-based annotation to replace manual labeling. Our analysis of MLLM annotations reveals that, under a conservative estimate, MLLMs can reduce annotation cost to as low as one-thousandth of the human cost, mainly accounting for GPU usage, which is nearly negligible compared to manual efforts. Their annotation quality reaches about 50% to 80% of human performance, while achieving over 90% performance on downstream training tasks.Motivated by these findings, we propose TagLLM, a novel framework for image tagging, which aims to narrow the gap between MLLM-generated and human annotations. TagLLM comprises two components: Candidates generation, which employs structured group-wise prompting to efficiently produce a compact candidate set that covers as many true labels as possible while reducing subsequent annotation workload; and label disambiguation, which interactively calibrates the semantic concept of categories in the prompts and effectively refines the candidate labels. Extensive experiments show that TagLLM substantially narrows the gap between MLLM-generated and human annotations, especially in downstream training performance, where it closes about 60% to 80% of the difference.

[63] CrystaL: Spontaneous Emergence of Visual Latents in MLLMs cs.CV | cs.AIPDF

Yang Zhang, Danyang Li, Yuxuan Li, Xin Zhang, Tianyu Xie

TL;DR: 本文提出CrystaL框架，通过双路径处理完整与损坏图像，并显式对齐注意力模式和预测分布，从而在MLLMs中自发形成视觉潜在表示，提升细粒度视觉理解能力。

Details

Motivation: 现有潜在思维链方法中启发式预定义的监督信号对保留中间潜在状态的关键视觉信息指导有限，需解决此问题以增强多模态大语言模型的视觉语义整合。

Result: 在感知密集型基准测试中，CrystaL持续超越最先进基线方法，在细粒度视觉理解上取得显著提升，同时保持稳健的推理能力。

Insight: 创新点在于通过双路径对齐机制自发结晶化任务相关视觉语义，无需辅助标注或外部模块，实现了视觉潜在表示的有效涌现。

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance by integrating powerful language backbones with large-scale visual encoders. Among these, latent Chain-of-Thought (CoT) methods enable implicit reasoning in continuous hidden states, facilitating seamless vision-language integration and faster inference. However, existing heuristically predefined supervision signals in latent CoT provide limited guidance for preserving critical visual information in intermediate latent states. To address this limitation, we propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images, respectively. By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics, without relying on auxiliary annotations or external modules. Extensive experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines, achieving substantial gains in fine-grained visual understanding while maintaining robust reasoning capabilities.

[64] Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models cs.CV | cs.AIPDF

Christian Simon, MAsato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa

TL;DR: 本文提出了一种名为MMHNet的多模态分层网络，用于解决视频到音频生成任务中的长度泛化问题。该方法通过整合分层方法和非因果Mamba模块，支持生成长达5分钟以上的音频，并在长视频到音频基准测试中取得了优于先前工作的结果。

Details

Motivation: 解决多模态对齐中数据有限和文本描述与帧级视频信息不匹配的挑战，探索在短实例上训练的模型能否在测试时泛化到更长的视频序列。

Result: 在长视频到音频基准测试中取得了显著成果，超越了先前方法，能够生成长达5分钟以上的音频，证明了无需在长时数据上训练即可实现长度泛化。

Insight: 创新点在于引入分层结构和非因果Mamba模块来增强长序列建模能力，实现了视频到音频生成中的长度泛化，为多模态生成任务提供了可扩展的解决方案。

Abstract: Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.

[65] Cycle-Consistent Tuning for Layered Image Decomposition cs.CVPDF

Zheng Gu, Min Lu, Zhida Sun, Dani Lischinski, Daniel Cohen-O

TL;DR: 本文提出了一种基于上下文感知的图像分层分解框架，利用大型扩散基础模型实现logo与物体表面的分离。该方法通过轻量级LoRA适配微调预训练扩散模型，并引入循环一致性调优策略，联合训练分解与合成模型以确保重构一致性。此外，还采用了渐进式自改进过程，通过迭代增强训练集来提升性能。实验表明，该方法能实现准确、连贯的分解，并能泛化至其他分解任务。

Details

Motivation: 解决真实世界图像中视觉层（如阴影、反射、透视畸变等非线性全局耦合交互）的分离难题，特别是针对logo与物体表面分离这一挑战性任务。

Result: 在logo-物体分解任务上进行了广泛实验，结果表明该方法能实现准确且连贯的分解，并能有效泛化到其他分解类型，展现了作为统一分层图像分解框架的潜力。

Insight: 创新点包括：1) 基于上下文感知的扩散模型微调框架；2) 循环一致性调优策略，通过双向监督增强鲁棒性；3) 渐进式自改进过程，利用模型生成的高质量样本迭代优化性能。该方法为处理复杂层间交互提供了可借鉴的监督机制和训练策略。

Abstract: Disentangling visual layers in real-world images is a persistent challenge in vision and graphics, as such layers often involve non-linear and globally coupled interactions, including shading, reflection, and perspective distortion. In this work, we present an in-context image decomposition framework that leverages large diffusion foundation models for layered separation. We focus on the challenging case of logo-object decomposition, where the goal is to disentangle a logo from the surface on which it appears while faithfully preserving both layers. Our method fine-tunes a pretrained diffusion model via lightweight LoRA adaptation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images. This bidirectional supervision substantially enhances robustness in cases where the layers exhibit complex interactions. Furthermore, we introduce a progressive self-improving process, which iteratively augments the training set with high-quality model-generated examples to refine performance. Extensive experiments demonstrate that our approach achieves accurate and coherent decompositions and also generalizes effectively across other decomposition types, suggesting its potential as a unified framework for layered image decomposition.

[66] VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models cs.CVPDF

Bowen Zheng, Yongli Xiang, Ziming Hong, Zerong Lin, Chaojian Yu

TL;DR: 本文提出了一种名为视觉指令注入（VII）的新型越狱框架，该框架无需训练且具有可迁移性，旨在揭示图像到视频（I2V）生成模型的安全风险。VII通过将不安全文本提示的恶意意图伪装成安全参考图像中的良性视觉指令，从而诱导模型生成有害内容。

Details

Motivation: I2V生成模型展现出遵循视觉指令的能力，即参考图像中的视觉线索可以隐式控制视频生成。然而，这种能力带来了被忽视的安全风险：攻击者可能通过图像模态注入恶意意图。本文旨在揭示并利用这一风险。

Result: 在四个最先进的商业I2V模型（Kling-v2.5-turbo, Gemini Veo-3.1, Seedance-1.5-pro, PixVerse-V5）上的广泛实验表明，VII的攻击成功率最高可达83.5%，同时将拒绝率降至接近零，显著优于现有基线方法。

Insight: 论文的创新点在于首次系统地揭示了I2V模型通过视觉指令被越狱的风险，并提出了一个无需训练、可迁移的攻击框架VII。该框架的核心创新是将恶意意图从文本模态“蒸馏”并“锚定”到图像模态，通过协调恶意意图重编程和视觉指令接地两个模块，实现了对安全图像的有效篡改，从而诱导模型生成有害内容。这为理解和防御多模态生成模型的安全漏洞提供了新的视角。

Abstract: Image-to-Video (I2V) generation models, which condition video generation on reference images, have shown emerging visual instruction-following capability, allowing certain visual cues in reference images to act as implicit control signals for video generation. However, this capability also introduces a previously overlooked risk: adversaries may exploit visual instructions to inject malicious intent through the image modality. In this work, we uncover this risk by proposing Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that intentionally disguises the malicious intent of unsafe text prompts as benign visual instructions in the safe reference image. Specifically, VII coordinates a Malicious Intent Reprogramming module to distill malicious intent from unsafe text prompts while minimizing their static harmfulness, and a Visual Instruction Grounding module to ground the distilled intent onto a safe input image by rendering visual instructions that preserve semantic consistency with the original unsafe text prompt, thereby inducing harmful content during I2V generation. Empirically, our extensive experiments on four state-of-the-art commercial I2V models (Kling-v2.5-turbo, Gemini Veo-3.1, Seedance-1.5-pro, and PixVerse-V5) demonstrate that VII achieves Attack Success Rates of up to 83.5% while reducing Refusal Rates to near zero, significantly outperforming existing baselines.

[67] From Perception to Action: An Interactive Benchmark for Vision Reasoning cs.CVPDF

Yuhao Wu, Maojia Song, Yihuai Lan, Lei Wang, Zhiqiang Hu

TL;DR: 本文介绍了CHAIN基准测试，这是一个交互式、基于物理的3D测试平台，旨在评估模型在动态环境中理解和执行受物理约束的结构化动作序列的能力。该基准将评估重点从被动感知转向主动问题解决，涵盖机械拼图和3D堆叠等任务。研究对最先进的视觉语言模型和扩散模型进行了统一评估，发现它们在理解物理结构和因果约束方面仍存在困难。

Details

Motivation: 现有视觉语言模型评估主要关注结构无关的单轮任务（如VQA），无法评估模型在动态环境中推理几何、接触和支撑关系如何共同约束可行动作的能力，因此需要新的基准来填补这一空白。

Result: 对最先进的视觉语言模型和基于扩散的模型在统一交互设置下的综合研究表明，表现最佳的模型仍然难以内化物理结构和因果约束，经常无法产生可靠的长时程计划，也不能稳健地将感知到的结构转化为有效动作。

Insight: 创新点在于提出了一个从被动感知转向主动问题解决的交互式评估范式（CHAIN），强调对物理结构和因果约束的理解，这为评估和开发面向具身智能等现实应用的模型提供了更贴近实际需求的测试平台。

Abstract: Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents’ ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

[68] MIP Candy: A Modular PyTorch Framework for Medical Image Processing cs.CV | cs.AI | cs.LG | cs.SEPDF

Tianhao Fu, Yucheng Chen

TL;DR: MIP Candy是一个基于PyTorch的模块化医学图像处理框架，旨在解决现有框架在集成灵活性和易用性方面的不足。它提供了一个完整的数据加载、训练、推理和评估流程，通过实现单个方法即可获得全功能工作流，同时支持对组件的细粒度控制。

Details

Motivation: 医学图像处理需要处理高维体数据、异构文件格式和领域特定训练流程，现有框架要么需要大量集成工作，要么采用僵化的整体式流程难以修改。

Result: 论文未提及具体定量结果或基准测试，但强调框架提供了内置的k折交叉验证、自动感兴趣区域检测、深度监督、指数移动平均、多前端实验跟踪、训练状态恢复和商回归验证分数预测等功能。

Insight: 核心创新是LayerT延迟配置机制，允许运行时替换卷积、归一化和激活模块而无需子类化；同时，可扩展的捆绑包生态系统提供预建模型实现，遵循一致的训练器-预测器模式，无需修改即可与核心框架集成。

Abstract: Medical image processing demands specialized software that handles high-dimensional volumetric data, heterogeneous file formats, and domain-specific training procedures. Existing frameworks either provide low-level components that require substantial integration effort or impose rigid, monolithic pipelines that resist modification. We present MIP Candy (MIPCandy), a freely available, PyTorch-based framework designed specifically for medical image processing. MIPCandy provides a complete, modular pipeline spanning data loading, training, inference, and evaluation, allowing researchers to obtain a fully functional process workflow by implementing a single method, $\texttt{build_network}$, while retaining fine-grained control over every component. Central to the design is $\texttt{LayerT}$, a deferred configuration mechanism that enables runtime substitution of convolution, normalization, and activation modules without subclassing. The framework further offers built-in $k$-fold cross-validation, dataset inspection with automatic region-of-interest detection, deep supervision, exponential moving average, multi-frontend experiment tracking (Weights & Biases, Notion, MLflow), training state recovery, and validation score prediction via quotient regression. An extensible bundle ecosystem provides pre-built model implementations that follow a consistent trainer–predictor pattern and integrate with the core framework without modification. MIPCandy is open-source under the Apache-2.0 license and requires Python~3.12 or later. Source code and documentation are available at https://github.com/ProjectNeura/MIPCandy.

[69] Not Just What’s There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning cs.CV | cs.MMPDF

Junhao Xiao, Zhiyu Wu, Hao Lin, Yi Chen, Yahui Liu

TL;DR: 本文提出CLIPGlasses，一种即插即用框架，旨在无需微调即可增强CLIP模型对否定性视觉描述的理解能力。该框架采用双阶段设计：Lens模块从文本嵌入中解耦否定语义，Frame模块预测上下文感知的排斥强度，并通过修改相似度计算来惩罚与否定语义的对齐，从而减少误匹配。

Details

Motivation: 现有视觉语言模型（如CLIP）难以理解否定语义，常将肯定和否定描述嵌入为相似表示（例如将’没有狗’与狗图像匹配），而现有方法通过微调文本编码器来改进，存在过拟合风险。

Result: 实验表明，配备CLIPGlasses的CLIP在领域内性能具有竞争力，并在跨领域泛化中优于最先进方法，尤其在低资源条件下优势明显，显示出更强的跨领域鲁棒性。

Insight: 创新点在于提出无需微调的双阶段即插即用框架，通过解耦否定语义和动态调整相似度计算来增强否定理解，避免了过拟合并提升了泛化能力，为改进VLMs的语义理解提供了新思路。

Abstract: Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching “no dog” with dog images). Existing methods refine negation understanding via fine-tuning CLIP’s text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP’s ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.

[70] OmniOCR: Generalist OCR for Ethnic Minority Languages cs.CVPDF

Bonan Liu, Zeyu Zhang, Bingbing Meng, Han Wang, Hanshuo Zhang

TL;DR: OmniOCR是一个针对少数民族语言的通用OCR框架，通过动态低秩适应（Dynamic LoRA）和稀疏正则化技术，在资源匮乏或零样本设置下有效适应多种复杂文字系统，在多个少数民族语言数据集上实现了最先进的识别精度和参数效率。

Details

Motivation: 解决少数民族语言OCR因文字系统复杂、标注稀缺以及历史与现代形式多样而面临的泛化挑战，特别是在低资源或零样本场景下的适应性问题。

Result: 在TibetanMNIST、Shui、ancient Yi和Dongba四个数据集上，OmniOCR超越了零样本基础模型和标准后训练方法，达到了最先进的准确率，相比现有基线模型准确率提升了39%-66%，且具有优越的参数效率。

Insight: 创新点在于引入动态低秩适应（Dynamic LoRA）来跨层和跨文字分配模型容量，并结合稀疏正则化修剪冗余更新，实现了紧凑高效的适应而无需额外推理成本，为低资源语言的多任务适应提供了可扩展的解决方案。

Abstract: Optical character recognition (OCR) has advanced rapidly with deep learning and multimodal models, yet most methods focus on well-resourced scripts such as Latin and Chinese. Ethnic minority languages remain underexplored due to complex writing systems, scarce annotations, and diverse historical and modern forms, making generalization in low-resource or zero-shot settings challenging. To address these challenges, we present OmniOCR, a universal framework for ethnic minority scripts. OmniOCR introduces Dynamic Low-Rank Adaptation (Dynamic LoRA) to allocate model capacity across layers and scripts, enabling effective adaptation while preserving knowledge.A sparsity regularization prunes redundant updates, ensuring compact and efficient adaptation without extra inference cost. Evaluations on TibetanMNIST, Shui, ancient Yi, and Dongba show that OmniOCR outperforms zero-shot foundation models and standard post training, achieving state-of-the-art accuracy with superior parameter efficiency, and compared with the state-of-the-art baseline models, it improves accuracy by 39%-66% on these four datasets. Code: https://github.com/AIGeeksGroup/OmniOCR.

[71] OCR-Agent: Agentic OCR with Capability and Memory Reflection cs.CVPDF

Shimin Wen, Zeyu Zhang, Xingdou Bian, Hongjie Zhu, Lulu He

TL;DR: 本文提出了一种名为OCR-Agent的新型迭代自校正框架，旨在解决大型视觉语言模型在复杂视觉理解任务中缺乏有效自我纠正机制、容易陷入重复无效尝试的问题。该框架通过能力反思和记忆反思两大核心能力，引导模型诊断错误、制定修正计划、回顾过往尝试以避免重复，并通过严谨的再推理优化答案。

Details

Motivation: 现有的大型视觉语言模型在迭代优化过程中普遍缺乏有效的自校正机制，难以独立纠正认知偏差，导致在多轮修正中常陷入重复无效的尝试，无法稳定提升答案质量。

Result: 在OCRBench v2基准测试中，OCR-Agent在英文和中文子集上分别比当前开源SOTA模型InternVL3-8B高出+2.0和+1.2分，并在视觉理解（79.9）和推理（66.5）方面取得了最先进的结果，甚至超越了更大的微调模型。

Insight: 论文的创新点在于提出了一个结构化的、具备自我意识的反思框架（能力反思与记忆反思），该框架无需额外训练即可显著增强视觉语言模型的推理鲁棒性。从客观角度看，将反思过程系统化为可操作的诊断、规划和历史回顾步骤，为解决模型迭代优化中的停滞问题提供了一种新颖且有效的方案。

Abstract: Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs’ reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.

[72] Event-Aided Sharp Radiance Field Reconstruction for Fast-Flying Drones cs.CV | cs.ROPDF

Rong Zou, Marco Cannici, Davide Scaramuzza

TL;DR: 本文提出了一种统一框架，利用异步事件流和运动模糊图像，从高速飞行的无人机中重建高保真度的辐射场。该方法将事件-图像融合嵌入NeRF优化，并联合使用事件和图像模态细化基于事件的视觉-惯性里程计先验，从而在无地面真值监督的情况下恢复清晰的辐射场和准确的相机轨迹。

Details

Motivation: 高速飞行的无人机在电池限制下能实现快速巡检，但高速导致图像出现严重运动模糊，并引起位姿估计的显著漂移和噪声，而神经辐射场（NeRF）对此类退化高度敏感，使得密集3D重建极具挑战。

Result: 在合成数据和真实世界高速飞行无人机捕获的序列上验证了该方法。尽管无人机飞行高度动态，RGB帧因运动模糊严重退化且位姿先验不可靠，该方法仍能重建高保真度辐射场并保留精细场景细节，在真实数据上相比最先进方法性能提升超过50%。

Insight: 创新点在于将异步事件流与模糊图像融合到NeRF优化中，并联合优化事件驱动的视觉-惯性里程计先验，从而在高速、模糊条件下实现无监督的清晰辐射场重建，提升了动态场景下的鲁棒性和重建质量。

Abstract: Fast-flying aerial robots promise rapid inspection under limited battery constraints, with direct applications in infrastructure inspection, terrain exploration, and search and rescue. However, high speeds lead to severe motion blur in images and induce significant drift and noise in pose estimates, making dense 3D reconstruction with Neural Radiance Fields (NeRFs) particularly challenging due to their high sensitivity to such degradations. In this work, we present a unified framework that leverages asynchronous event streams alongside motion-blurred frames to reconstruct high-fidelity radiance fields from agile drone flights. By embedding event-image fusion into NeRF optimization and jointly refining event-based visual-inertial odometry priors using both event and frame modalities, our method recovers sharp radiance fields and accurate camera trajectories without ground-truth supervision. We validate our approach on both synthetic data and real-world sequences captured by a fast-flying drone. Despite highly dynamic drone flights, where RGB frames are severely degraded by motion blur and pose priors become unreliable, our method reconstructs high-fidelity radiance fields and preserves fine scene details, delivering a performance gain of over 50% on real-world data compared to state-of-the-art methods.

[73] BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting cs.CVPDF

Jiaxing Yu, Dongyang Ren, Hangyu Xu, Zhouyuxiao Yang, Yuanqi Li

TL;DR: 本文提出了一种名为BrepGaussian的新框架，用于从多视角图像中重建计算机辅助设计（CAD）的边界表示（B-rep）模型。该方法利用高斯泼溅渲染器与可学习特征，通过两阶段学习框架先捕获几何和边缘，再细化面片特征，以实现干净的几何和一致的实例表示。

Details

Motivation: 从非结构化数据（如多视角图像）中恢复B-rep表示是一个具有挑战性且有价值的计算机视觉与图形学任务，现有深度学习方法依赖于密集干净的点云且难以泛化到新形状。

Result: 大量实验表明，该方法在性能上优于当前最先进（SOTA）的方法，但摘要未提及具体基准测试或定量结果。

Insight: 创新点在于将高斯泼溅渲染与B-rep重建结合，通过解耦几何重建与特征学习的两阶段框架，直接从2D图像学习3D参数化表示，提升了从图像到CAD模型的泛化能力。

Abstract: The boundary representation (B-rep) models a 3D solid as its explicit boundaries: trimmed corners, edges, and faces. Recovering B-rep representation from unstructured data is a challenging and valuable task of computer vision and graphics. Recent advances in deep learning have greatly improved the recovery of 3D shape geometry, but still depend on dense and clean point clouds and struggle to generalize to novel shapes. We propose B-rep Gaussian Splatting (BrepGaussian), a novel framework that learns 3D parametric representations from 2D images. We employ a Gaussian Splatting renderer with learnable features, followed by a specific fitting strategy. To disentangle geometry reconstruction and feature learning, we introduce a two-stage learning framework that first captures geometry and edges and then refines patch features to achieve clean geometry and coherent instance representations. Extensive experiments demonstrate the superior performance of our approach to state-of-the-art methods. We will release our code and datasets upon acceptance.

[74] UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics cs.CVPDF

Joseph Raj Vishal, Nagasiri Poluri, Katha Naik, Rutuja Patil, Kashyap Hegde Kota

TL;DR: 本文提出了UDVideoQA数据集，这是一个专注于城市交通场景多目标时空推理的视频问答基准，包含16小时真实交通视频和28K个问题-答案对，用于评估视频语言模型在视觉定位和因果推理方面的能力。

Details

Motivation: 解决现有视频语言模型难以理解复杂、多智能体的城市交通动态场景的问题，缺乏能够系统评估模型在真实、无脚本城市环境中进行时空推理能力的基准数据集。

Result: 在UDVideoQA上对10个SOTA视频语言模型进行基准测试，发现存在感知-推理鸿沟；微调较小的Qwen2.5-VL 7B模型可以弥合此鸿沟，达到与专有系统相当的性能。在视频问题生成任务上，Gemini 2.5 Pro和Qwen3 Max能生成最相关和复杂的问题，但所有模型的语言多样性有限。

Insight: 创新点包括：1) 引入事件驱动的动态模糊技术以保护隐私同时保持场景保真度；2) 构建了层次化的推理分类法，从基础理解到反事实推理；3) 提供了包含数据集、标注工具和基准测试的完整套件，为推进鲁棒、隐私感知的多模态推理研究奠定了基础。

Abstract: Understanding the complex, multi-agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event-driven dynamic blur technique to ensure privacy preservation without compromising scene fidelity. Using a unified annotation pipeline, the dataset contains 28K question-answer pairs generated across 8 hours of densely annotated video, averaging one question per second. Its taxonomy follows a hierarchical reasoning level, spanning basic understanding and attribution to event reasoning, reverse reasoning, and counterfactual inference, enabling systematic evaluation of both visual grounding and causal reasoning. Comprehensive experiments benchmark 10 SOTA VideoLMs on UDVideoQA and 8 models on a complementary video question generation benchmark. Results reveal a persistent perception-reasoning gap, showing models that excel in abstract inference often fail with fundamental visual grounding. While models like Gemini Pro achieve the highest zero-shot accuracy, fine-tuning the smaller Qwen2.5-VL 7B model on UDVideoQA bridges this gap, achieving performance comparable to proprietary systems. In VideoQGen, Gemini 2.5 Pro, and Qwen3 Max generate the most relevant and complex questions, though all models exhibit limited linguistic diversity, underscoring the need for human-centric evaluation. The UDVideoQA suite, including the dataset, annotation tools, and benchmarks for both VideoQA and VideoQGen, provides a foundation for advancing robust, privacy-aware, and real-world multimodal reasoning. UDVideoQA is available at https://ud-videoqa.github.io/UD-VideoQA/UD-VideoQA/.

Zhifan Jiang, Dong Yang, Vishwesh Nath, Abhijeet Parida, Nishad P. Kulkarni

TL;DR: 论文提出了LUMEN训练框架，这是一个专为纵向胸部X光片（CXR）解读优化的多模态放射学模型，通过多图像、多任务的指令微调来提升预后和诊断性能。

Details

Motivation: 解决放射科医生手动分析纵向影像数据耗时的问题，开发能够提供预后能力的训练框架，以辅助临床决策支持。

Result: 在公开数据集MIMIC-CXR及其关联的Medical-Diff-VQA上实验，相比基线模型在诊断性视觉问答（VQA）任务中取得显著提升，并在预后能力上展现出潜力。

Insight: 创新点在于设计了结合纵向研究的指令跟随数据集，并提出了针对纵向CXR的多图像、多任务指令微调框架，增强了模型对时序变化的解读能力，为临床放射学分析提供了更准确、有意义的工具。

Abstract: Large vision-language models (VLMs) have evolved from general-purpose applications to specialized use cases such as in the clinical domain, demonstrating potential for decision support in radiology. One promising application is assisting radiologists in decision-making by the analysis of radiology imaging data such as chest X-rays (CXR) via a visual and natural language question-answering (VQA) interface. When longitudinal imaging is available, radiologists analyze temporal changes, which are essential for accurate diagnosis and prognosis. The manual longitudinal analysis is a time-consuming process, motivating the development of a training framework that can provide prognostic capabilities. We introduce a novel training framework LUMEN, that is optimized for longitudinal CXR interpretation, leveraging multi-image and multi-task instruction fine-tuning to enhance prognostic and diagnostic performance. We conduct experiments on the publicly available MIMIC-CXR and its associated Medical-Diff-VQA datasets. We further formulate and construct a novel instruction-following dataset incorporating longitudinal studies, enabling the development of a prognostic VQA task. Our method demonstrates significant improvements over baseline models in diagnostic VQA tasks, and more importantly, shows promising potential for prognostic capabilities. These results underscore the value of well-designed, instruction-tuned VLMs in enabling more accurate and clinically meaningful radiological interpretation of longitudinal radiological imaging data.

[76] SPRITETOMESH: Automatic Mesh Generation for 2D Skeletal Animation Using Learned Segmentation and Contour-Aware Vertex Placement cs.CVPDF

Bastien Gimbert

TL;DR: SPRITETOMESH是一个全自动的流水线，用于将2D游戏精灵图像转换为与Spine2D等骨骼动画框架兼容的三角形网格。该方法通过结合学习的分割网络和基于轮廓感知的算法化顶点放置，解决了传统手动创建动画网格耗时费力的问题，处理一个精灵仅需不到3秒，相比手动创建提速300-1200倍。

Details

Motivation: 传统上，为2D骨骼动画创建网格是一个耗时的手动过程，每个精灵需要艺术家花费15-60分钟仔细放置顶点。该论文旨在通过自动化这一过程来解决此问题，以显著提高游戏开发的效率。

Result: 在超过10万个来自172个游戏的精灵-掩码对数据集上训练的分割网络（基于EfficientNet-B0编码器和U-Net解码器）达到了0.87的IoU。完整的流水线处理一个精灵图像的时间少于3秒，相比手动创建实现了300倍到1200倍的加速。

Insight: 论文的核心创新点在于其混合设计：在标注明确的任务（如分割）上使用学习模型，而在需要领域启发式知识的任务（如顶点放置）上使用算法。这源于一个关键的负面发现：直接通过神经网络热图回归预测顶点位置是不可行的，因为顶点放置具有固有的艺术性，同一精灵可以有多种有效的网格化方式，这导致热图解码器无法收敛。因此，结合学习的鲁棒性和算法的可控性是一种有效的解决方案。

Abstract: We present SPRITETOMESH, a fully automatic pipeline for converting 2D game sprite images into triangle meshes compatible with skeletal animation frameworks such as Spine2D. Creating animation-ready meshes is traditionally a tedious manual process requiring artists to carefully place vertices along visual boundaries, a task that typically takes 15-60 minutes per sprite. Our method addresses this through a hybrid learned-algorithmic approach. A segmentation network (EfficientNet-B0 encoder with U-Net decoder) trained on over 100,000 sprite-mask pairs from 172 games achieves an IoU of 0.87, providing accurate binary masks from arbitrary input images. From these masks, we extract exterior contour vertices using Douglas-Peucker simplification with adaptive arc subdivision, and interior vertices along visual boundaries detected via bilateral-filtered multi-channel Canny edge detection with contour-following placement. Delaunay triangulation with mask-based centroid filtering produces the final mesh. Through controlled experiments, we demonstrate that direct vertex position prediction via neural network heatmap regression is fundamentally not viable for this task: the heatmap decoder consistently fails to converge (loss plateau at 0.061) while the segmentation decoder trains normally under identical conditions. We attribute this to the inherently artistic nature of vertex placement - the same sprite can be meshed validly in many different ways. This negative result validates our hybrid design: learned segmentation where ground truth is unambiguous, algorithmic placement where domain heuristics are appropriate. The complete pipeline processes a sprite in under 3 seconds, representing a speedup of 300x-1200x over manual creation. We release our trained model to the game development community.

[77] Seeing Through Words: Controlling Visual Retrieval Quality with Language Models cs.CVPDF

Jianglin Lu, Simon Jenni, Kushal Kafle, Jing Shi, Handong Zhao

TL;DR: 本文提出了一种质量可控的检索新范式，旨在解决文本到图像检索中因用户查询简短且不明确而导致的语义模糊和检索质量不可控的问题。核心思想是利用生成式语言模型作为查询补全函数，将简短查询扩展为包含姿态、场景、美学等细粒度视觉属性的描述性形式，并引入一个通用框架，使查询补全过程能基于从相关性和美学评分模型导出的离散化质量级别进行条件化，从而实现语义丰富且质量感知的检索。

Details

Motivation: 解决现实场景中文本到图像检索任务面临的挑战：用户查询通常非常简短（仅一两个词），导致语义模糊、易产生多种视觉解释的冲突，并且缺乏对检索图像质量的显式控制。

Result: 大量实验表明，所提出的方法显著改善了检索结果，并提供了有效的质量控制，在现代视觉语言模型的表达能力与简短用户查询的不明确性之间架起了桥梁。

Insight: 创新点在于将生成式语言模型作为查询补全器，并与质量评分模型结合，构建了一个灵活（兼容任何预训练VLM无需修改）、透明（补全后的查询用户可解释）、可控（可按用户偏好质量级别引导检索结果）的通用框架，实现了质量可控的检索。

Abstract: Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semantically meaningful but also quality-aware. The resulting system provides three key advantages: 1) flexibility, it is compatible with any pretrained vision-language model (VLMs) without modification; 2) transparency, enriched queries are explicitly interpretable by users; and 3) controllability, enabling retrieval results to be steered toward user-preferred quality levels. Extensive experiments demonstrate that our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries. Our code is available at https://github.com/Jianglin954/QCQC.

[78] XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence cs.CV | cs.AIPDF

Sepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman, Vince Calhoun, Mohammed Alser

TL;DR: XMorph是一个用于脑肿瘤细粒度分类的可解释且计算高效的框架，通过结合信息加权边界归一化机制和双通道可解释AI模块，实现了对胶质瘤、脑膜瘤和垂体瘤的高精度分类，准确率达到96.0%。

Details

Motivation: 解决深度学习在脑肿瘤诊断中因模型可解释性差和计算限制而临床采用受限的问题，特别是传统模型作为’黑盒’无法量化复杂不规则的肿瘤边界。

Result: 在脑肿瘤分类任务上达到96.0%的准确率，表明在基于AI的医学影像系统中可解释性与高性能可以共存。

Insight: 创新点包括信息加权边界归一化机制以增强肿瘤形态表示，以及结合GradCAM++视觉线索和LLM生成文本解释的双通道可解释AI模块，将模型推理转化为临床可解释的见解。

Abstract: Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints. Conventional models often act as opaque ‘’black boxes’’ and fail to quantify the complex, irregular tumor boundaries that characterize malignant growth. To address these challenges, we present XMorph, an explainable and computationally efficient framework for fine-grained classification of three prominent brain tumor types: glioma, meningioma, and pituitary tumors. We propose an Information-Weighted Boundary Normalization (IWBN) mechanism that emphasizes diagnostically relevant boundary regions alongside nonlinear chaotic and clinically validated features, enabling a richer morphological representation of tumor growth. A dual-channel explainable AI module combines GradCAM++ visual cues with LLM-generated textual rationales, translating model reasoning into clinically interpretable insights. The proposed framework achieves a classification accuracy of 96.0%, demonstrating that explainability and high performance can co-exist in AI-based medical imaging systems. The source code and materials for XMorph are all publicly available at: https://github.com/ALSER-Lab/XMorph.

[79] Mask-HybridGNet: Graph-based segmentation with emergent anatomical correspondence from pixel-level supervision cs.CVPDF

Nicolás Gaggion, Maria J. Ledesma-Carbayo, Stergios Christodoulidis, Maria Vakalopoulou, Enzo Ferrante

TL;DR: Mask-HybridGNet 是一个基于图网络的医学图像分割框架，它仅使用标准的像素级分割掩码进行训练，无需手动标注的对应地标点。该方法通过结合 Chamfer 距离监督和基于边的正则化，将可变长度的真实边界与固定长度的地标预测对齐，并通过可微分光栅化进行细化。一个重要的涌现特性是，预测的地标位置在不同患者间能自发地与特定解剖位置保持一致，从而实现隐式的图谱学习，支持时序跟踪、跨切片重建和形态学群体分析。

Details

Motivation: 解决基于图的医学图像分割方法在临床应用中面临的主要障碍：训练数据通常缺乏跨患者保持点对点对应关系的手动标注地标点。

Result: 在胸部X光、心脏超声、心脏MRI和胎儿成像等多个数据集上的实验表明，该模型在分割性能上与最先进的像素级方法相当，同时通过固定的图邻接矩阵强制执行边界连通性，确保了解剖结构的合理性。

Insight: 主要创新点在于仅利用像素级掩码监督即可训练出具有稳定解剖对应关系的图模型，实现了隐式的图谱学习，并能从现有分割模型中提取对应关系以构建解剖图谱。这为利用大量现有分割数据构建具有拓扑完整性和内在对应关系的结构化模型提供了新途径。

Abstract: Graph-based medical image segmentation represents anatomical structures using boundary graphs, providing fixed-topology landmarks and inherent population-level correspondences. However, their clinical adoption has been hindered by a major requirement: training datasets with manually annotated landmarks that maintain point-to-point correspondences across patients rarely exist in practice. We introduce Mask-HybridGNet, a framework that trains graph-based models directly using standard pixel-wise masks, eliminating the need for manual landmark annotations. Our approach aligns variable-length ground truth boundaries with fixed-length landmark predictions by combining Chamfer distance supervision and edge-based regularization to ensure local smoothness and regular landmark distribution, further refined via differentiable rasterization. A significant emergent property of this framework is that predicted landmark positions become consistently associated with specific anatomical locations across patients without explicit correspondence supervision. This implicit atlas learning enables temporal tracking, cross-slice reconstruction, and morphological population analyses. Beyond direct segmentation, Mask-HybridGNet can extract correspondences from existing segmentation masks, allowing it to generate stable anatomical atlases from any high-quality pixel-based model. Experiments across chest radiography, cardiac ultrasound, cardiac MRI, and fetal imaging demonstrate that our model achieves competitive results against state-of-the-art pixel-based methods, while ensuring anatomical plausibility by enforcing boundary connectivity through a fixed graph adjacency matrix. This framework leverages the vast availability of standard segmentation masks to build structured models that maintain topological integrity and provide implicit correspondences.

[80] Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning cs.CVPDF

Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui

TL;DR: 本文提出了Spa3R，一个从无姿态多视角图像中学习统一、视角不变空间表示的自监督框架。其核心是预测性空间场建模（PSFM）范式，使模型能够基于紧凑的潜在表示合成任意未见视角的特征场，从而内化对底层3D场景的整体连贯理解。通过轻量级适配器将预训练的Spa3R编码器集成到现有视觉语言模型（VLM）中，形成了Spa3-VLM，将语言推理锚定在全局空间上下文中。

Details

Motivation: 当前视觉语言模型（VLMs）在3D空间理解和推理方面能力薄弱。现有方法要么依赖显式3D模态，要么用部分、视角条件的几何先验增强VLM，这限制了可扩展性，并迫使语言模型从稀疏线索中隐式重建整体3D几何。本文认为，空间智能可以仅从2D视觉中自然涌现，而非通过显式的空间指令微调强加。

Result: 在具有挑战性的VSI-Bench基准测试中，Spa3-VLM在3D视觉问答（VQA）任务上达到了58.6%的最新（SOTA）准确率，显著优于先前方法。

Insight: 论文的核心创新点是提出了预测性空间场建模（PSFM）这一自监督学习范式，它使模型能够从无姿态的2D图像中直接学习到统一且视角不变的空间表示，从而内化对3D场景的整体理解。这为仅从2D视觉数据中涌现空间智能提供了一条可扩展的路径。将这种空间编码器通过轻量适配器与现有VLM集成，是一种有效且高效的3D视觉推理增强方案。

Abstract: While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space–a cornerstone of spatial intelligence–remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstructing holistic 3D geometry from sparse cues. In this paper, we argue that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning. To this end, we introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images. Spa3R is built upon the proposed Predictive Spatial Field Modeling (PSFM) paradigm, where Spa3R learns to synthesize feature fields for arbitrary unseen views conditioned on a compact latent representation, thereby internalizing a holistic and coherent understanding of the underlying 3D scene. We further integrate the pre-trained Spa3R Encoder into existing VLMs via a lightweight adapter to form Spa3-VLM, effectively grounding language reasoning in a global spatial context. Experiments on the challenging VSI-Bench demonstrate that Spa3-VLM achieves state-of-the-art accuracy of 58.6% on 3D VQA, significantly outperforming prior methods. These results highlight PSFM as a scalable path toward advancing spatial intelligence. Code is available at https://github.com/hustvl/Spa3R.

[81] Human Video Generation from a Single Image with 3D Pose and View Control cs.CVPDF

Tiantian Wang, Chun-Han Yao, Tao Hu, Mallikarjun Byrasandra Ramalinga Reddy, Ming-Hsuan Yang

TL;DR: 本文提出了HVG（Human Video Generation in 4D），一种能够从单张图像生成高质量、多视角、时空一致的人体视频的潜在视频扩散模型。该模型通过3D姿态和视角控制，解决了从单图推断视角一致、运动相关的衣物褶皱等难题。

Details

Motivation: 现有基于扩散模型的单图生成视频方法在人体视频生成中面临挑战，特别是从单张图像推断出视角一致且与运动相关的衣物褶皱是一个困难问题。

Result: 在图像到视频任务上的大量实验表明，HVG在从多样化人体图像和姿态输入生成高质量4D人体视频方面优于现有方法。

Insight: 创新点包括：1) 关节姿态调制，通过新颖的双维度骨骼图捕捉3D关节的解剖关系，并引入3D信息解决跨视角自遮挡；2) 视角与时间对齐，确保多视角一致性和参考图像与姿态序列之间的对齐以实现帧间稳定性；3) 结合时间对齐的渐进时空采样，以保持长多视角动画的平滑过渡。

Abstract: Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.

cs.AI [Back]

[82] Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination cs.AI | cs.CL | cs.LGPDF

Rakshit Trivedi, Kartik Sharma, David C Parkes

TL;DR: 本文提出MIMIC框架，通过将语言作为行为意图的内部表征，利用视觉语言模型作为语言支架训练条件变分自编码器从观察中生成内部语音，再通过基于扩散模型的行为克隆策略根据当前观察和生成的内部语音选择动作。该框架旨在解决模仿学习中难以捕捉人类行为多样性和非马尔可夫特性，以及缺乏推理时行为引导能力的问题。

Details

Motivation: 当前模仿学习方法难以捕捉人类行为的固有多样性和非马尔可夫性质，且缺乏在推理时引导行为的能力。受人类认知过程中内部语音在执行前引导动作选择的理论启发，旨在构建能够展现和响应类人行为、适应变化情境的智能体，以促进有效的人机协作。

Result: 在机器人操作任务和人机协作游戏上的实验表明，MIMIC显著增强了行为多样性和对人类演示的保真度，同时实现了无需额外演示训练的细致行为引导。

Insight: 创新点在于将语言（内部语音）作为可引导的行为意图表征，并创新性地使用视觉语言模型作为语言支架来训练条件变分自编码器，结合扩散策略，实现了推理时通过行为特定语言进行细粒度行为引导的能力。

Abstract: Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts. Imitation learning has emerged as one of the prominent approaches to build such agents by training them to mimic human-demonstrated behaviors. However, current methods struggle to capture the inherent diversity and non-Markovian nature of human behavior and lack the ability to steer behavior at inference time. Drawing inspiration from the theory of human cognitive processes, where inner speech guides action selection before execution, we propose MIMIC (Modeling Inner Motivations for Imitation and Control), a framework that uses language as an internal representation of behavioral intent. MIMIC employs the novel use of vision-language models as linguistic scaffolding to train a conditional variational autoencoder capable of generating inner speech from observations. A diffusion-based behavior cloning policy then selects actions conditioned on current observations and the generated inner speech. MIMIC enables fine-grained steering of behavior at inference time by conditioning the agent on behavior-specific speech. Experiments across robotic manipulation tasks and human-AI collaboration games demonstrate that MIMIC significantly enhances both behavior diversity and fidelity to human demonstrations while enabling nuanced behavioral steering without training on additional demonstrations. We open source our code and provide pre-trained MIMIC agents and qualitative demos at: https://mimic-research.github.io.

[83] Counterfactual Simulation Training for Chain-of-Thought Faithfulness cs.AI | cs.CLPDF

Peter Hase, Christopher Potts

TL;DR: 本文提出了一种名为反事实模拟训练（CST）的方法，旨在通过奖励那些能使模拟器准确预测模型在反事实输入下输出的思维链（CoT），来提高CoT的忠实性。该方法应用于两个场景：基于线索的反事实CoT监控，以及基于通用模型反事实的模拟，实验表明CST能显著提升监控准确性和模拟性。

Details

Motivation: 解决思维链（CoT）推理中存在的忠实性问题，这些问题限制了从CoT分析中获得可靠见解的能力，旨在通过训练提升CoT的忠实性和可解释性。

Result: 在高达235B参数的模型上进行实验，CST将基于线索反事实的监控准确率提升了35个百分点，将通用反事实的模拟性提升了2个百分点；CST优于提示基线，使用LLM重写不忠实CoT的效率比单独使用RL高5倍。

Insight: 创新点在于引入反事实模拟训练框架，通过模拟器预测来优化CoT忠实性；客观分析认为，该方法为CoT监控和提升推理泛化性提供了新途径，且发现大模型虽未天生具备更忠实CoT，但能从CST中获益更多。

Abstract: Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model’s outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more efficient than RL alone, (3) faithfulness improvements do not generalize to dissuading cues (as opposed to persuading cues), and (4) larger models do not show more faithful CoT out of the box, but they do benefit more from CST. These results suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring. Code for experiments in this paper is available at https://github.com/peterbhase/counterfactual-simulation-training

[84] Predicting Sentence Acceptability Judgments in Multimodal Contexts cs.AI | cs.CLPDF

Hyewon Jang, Nikolai Ilinykh, Sharid Loáiciga, Jey Han Lau, Shalom Lappin

TL;DR: 本研究探讨了视觉图像作为上下文对大型语言模型（LLMs）和人类进行句子可接受性判断预测的影响。研究发现，与文本上下文不同，视觉图像对人类评分几乎没有影响，但LLMs却表现出类似人类在文档上下文中的压缩效应。LLMs总体上能高精度预测人类判断，但移除视觉上下文时性能略优，且不同模型（如Qwen）的预测分布与人类模式的相似度存在差异。

Details

Motivation: 研究动机是探索在多模态（视觉）上下文中，深度神经网络（特别是LLMs）预测人类句子可接受性判断的能力，并与纯文本上下文进行比较，以理解视觉信息对语言处理的影响。

Result: 在句子可接受性判断预测任务上，不同LLMs能达到高准确度，但性能在移除视觉上下文时略优；Qwen模型的判断分布最接近人类模式。LLMs的生成预测与其归一化对数概率高度相关，但在视觉上下文存在时相关性降低。

Insight: 创新点在于首次系统比较了视觉上下文对LLMs与人类句子可接受性判断的影响，揭示了LLMs在处理多模态信息时内部表示与生成预测之间存在差距，且不同模型与人类认知模式的对齐程度不同，为理解LLMs的多模态推理机制提供了新视角。

Abstract: Previous work has examined the capacity of deep neural networks (DNNs), particularly transformers, to predict human sentence acceptability judgments, both independently of context, and in document contexts. We consider the effect of prior exposure to visual images (i.e., visual context) on these judgments for humans and large language models (LLMs). Our results suggest that, in contrast to textual context, visual images appear to have little if any impact on human acceptability ratings. However, LLMs display the compression effect seen in previous work on human judgments in document contexts. Different sorts of LLMs are able to predict human acceptability judgments to a high degree of accuracy, but in general, their performance is slightly better when visual contexts are removed. Moreover, the distribution of LLM judgments varies among models, with Qwen resembling human patterns, and others diverging from them. LLM-generated predictions on sentence acceptability are highly correlated with their normalised log probabilities in general. However, the correlations decrease when visual contexts are present, suggesting that a higher gap exists between the internal representations of LLMs and their generated predictions in the presence of visual contexts. Our experimental work suggests interesting points of similarity and of difference between human and LLM processing of sentences in multimodal contexts.

[85] A Benchmark for Deep Information Synthesis cs.AI | cs.CL | cs.IR | cs.LGPDF

Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov

TL;DR: 本文介绍了DEEPSYNTH，一个用于评估基于大语言模型（LLM）的智能体在解决需要从多源信息中综合并推理出洞察的现实复杂任务方面能力的新基准。该基准包含7个领域、覆盖67个国家的120个任务，通过多阶段数据收集流程构建。实验表明，当前最先进的LLM和深度研究智能体在该基准上表现不佳，突显了其在处理幻觉和大信息空间推理方面的挑战。

Details

Motivation: 当前评估基准未能充分评估LLM智能体在解决需要综合多源信息并进行超越简单事实检索的推理的现实世界任务中的能力，因此需要一个新的基准来填补这一空白。

Result: 在DEEPSYNTH基准上评估的11个最先进的LLM和深度研究智能体，其最大F1分数仅为8.97（LLM-judge指标为17.5），表明该基准极具挑战性，当前模型表现远未达到理想水平。

Insight: 论文的创新点在于提出了一个专注于信息综合与深度推理的现实任务基准DEEPSYNTH，其通过严谨的多阶段数据收集流程确保任务真实性和可验证性，为未来智能体研究提供了关键的评估方向，揭示了当前模型在幻觉和复杂推理方面的主要缺陷。

Abstract: Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval. To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights. DEEPSYNTH contains 120 tasks collected across 7 domains and data sources covering 67 countries. DEEPSYNTH is constructed using a multi-stage data collection pipeline that requires annotators to collect official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable answers. When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces, highlighting DEEPSYNTH as a crucial benchmark for guiding future research.

[86] PyVision-RL: Forging Open Agentic Vision Models via RL cs.AI | cs.CVPDF

Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng

TL;DR: 本文提出了PyVision-RL，一个用于开放权重多模态模型的强化学习框架，旨在解决智能体模型训练中的交互崩溃问题。该框架通过过采样-过滤-排序的轨迹策略和累积工具奖励来稳定训练并鼓励多轮工具使用，并开发了PyVision-Image和PyVision-Video模型，分别用于图像和视频理解。PyVision-Video采用按需上下文构建策略，在推理时选择性采样任务相关帧以显著减少视觉令牌使用，实现了强大的性能和效率提升。

Details

Motivation: 解决多模态智能体模型在强化学习中出现的交互崩溃问题，即模型倾向于减少工具使用和多轮推理，从而限制了智能体行为的优势。

Result: 实验表明，所提方法在图像和视频理解任务上表现出强大的性能，并提高了效率，证明了持续交互和按需视觉处理对于可扩展多模态智能体的重要性。

Insight: 创新点在于结合了过采样-过滤-排序的轨迹策略与累积工具奖励来防止交互崩溃，以及为视频推理引入了按需上下文构建机制以优化视觉令牌使用，这为构建高效、可扩展的多模态智能体提供了新的训练框架和推理策略。

Abstract: Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

[87] NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning cs.AI | cs.CVPDF

Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan

TL;DR: 本文提出NoRD模型，一种无需推理标注且数据高效的视觉-语言-动作模型，用于自动驾驶。该模型通过减少数据需求和消除密集推理标注，在仅使用不到60%的数据和3倍更少token的情况下，在Waymo和NAVSIM基准上实现了与现有模型竞争的性能。

Details

Motivation: 当前视觉-语言-动作模型面临两大挑战：大规模数据集收集和密集推理标注的高昂成本。本文旨在解决这两个问题，开发一个数据高效且无需推理的自动驾驶模型。

Result: 在Waymo和NAVSIM基准测试中，NoRD模型使用少于60%的训练数据且无推理标注，达到了与现有模型竞争的性能水平。

Insight: 创新点在于引入Dr. GRPO算法来缓解GRPO中的难度偏差问题，从而在小规模、无推理数据集上实现有效策略优化。这为降低自动驾驶系统数据需求和标注成本提供了新思路。

Abstract: Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with \modelname (\textbf{No} \textbf{R}easoning for \textbf{D}riving). Compared to existing VLAs, \modelname achieves competitive performance while being fine-tuned on $<$60% of the data and no reasoning annotations, resulting in 3$\times$ fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. \modelname overcomes this by incorporating Dr.~GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, \modelname achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems.

cs.RO [Back]

[88] Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation cs.RO | cs.AI | cs.CVPDF

Zaijing Li, Bing Hu, Rui Shao, Gongwei Chen, Dongmei Jiang

TL;DR: 本文提出了OptimusVLA，一个用于机器人操作的双记忆增强视觉-语言-动作模型。它通过引入全局先验记忆和局部一致性记忆，解决了现有分层VLA模型在动作生成过程中推理效率低和鲁棒性差的问题，在多个仿真和真实世界基准测试中实现了性能与效率的显著提升。

Details

Motivation: 现有分层视觉-语言-动作模型在动作生成过程中存在两个主要瓶颈：一是各向同性噪声先验与目标动作分布之间存在显著差距，导致推理效率低下；二是现有策略仅基于当前观测，忽略了历史序列的约束，缺乏对任务进度和时间一致性的感知，导致鲁棒性差。

Result: 在三个仿真基准测试中，OptimusVLA均优于基线模型：在LIBERO上达到98.6%的平均成功率，在CALVIN上比基线模型pi_0提升13.5%，在RoboTwin 2.0 Hard上达到38%的平均成功率。在真实世界评估中，OptimusVLA在泛化性和长时程任务套件上表现最佳，分别超越pi_0 42.9%和52.4%，同时实现了2.9倍的推理加速。

Insight: 论文的核心创新点是提出了双记忆机制：全局先验记忆用从语义相似轨迹中检索的任务级先验替代高斯噪声，缩短了生成路径；局部一致性记忆动态建模已执行的动作序列以推断任务进度，并注入学习到的一致性约束来保证轨迹的时间连贯性与平滑性。这为提升生成式策略的效率和鲁棒性提供了新思路。

Abstract: Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.

[89] UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models cs.RO | cs.CVPDF

Manish Kumar Govind, Dominick Reilly, Pu Wang, Srijan Das

TL;DR: 本文提出UniLACT模型，一种基于Transformer的视觉-语言-动作模型，通过深度感知的潜在动作预训练来整合几何结构，以提升机器人操作任务中的空间先验。同时，作者还提出了UniLARN框架，用于学习RGB和深度模态的统一潜在动作表示，并通过交叉模态交互建模生成伪标签来支持预训练。

Details

Motivation: 现有仅从RGB观测学习的潜在动作表示主要编码外观驱动的动态，缺乏显式的3D几何结构，而几何结构对于精确且接触丰富的机器人操作至关重要。

Result: 在仿真和真实世界环境中的大量实验表明，深度感知的统一潜在动作表示有效。UniLACT在领域内和领域外预训练机制下，以及在已见和未见操作任务上，均持续优于仅基于RGB的潜在动作基线方法。

Insight: 创新点在于将深度信息整合到潜在动作学习中，通过统一的逆动力学和前向动力学目标框架（UniLARN）显式建模RGB与深度模态的交互，从而生成具有几何感知的潜在表示，为VLA模型提供了更强的空间先验。

Abstract: Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UniLARN, a unified latent action learning framework based on inverse and forward dynamics objectives that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. This formulation produces modality-specific and unified latent action representations that serve as pseudo-labels for the depth-aware pretraining of UniLACT. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness of depth-aware unified latent action representations. UniLACT consistently outperforms RGB-based latent action baselines under in-domain and out-of-domain pretraining regimes, as well as on both seen and unseen manipulation tasks.

[90] Strategy-Supervised Autonomous Laparoscopic Camera Control via Event-Driven Graph Mining cs.RO | cs.CVPDF

Keyu Zhou, Peisen Xu, Yahao Wu, Jiming Chen, Gaofeng Li

TL;DR: 本文提出了一种策略监督的自主腹腔镜相机控制框架，通过事件驱动图挖掘从手术视频中提取可重用的相机操作策略基元，并利用微调的视觉语言模型在线预测主导策略和运动指令，结合IBVS-RCM控制器在安全约束下执行，实现了稳定、安全且可解释的自主相机控制。

Details

Motivation: 解决自主腹腔镜相机控制在快速工具-组织交互下保持稳定、安全的手术视野，同时确保对外科医生可解释性的问题。

Result: 在硅胶模型和猪组织上的离体实验表明，该系统在标准化相机操作评估中优于初级外科医生，将视野中心误差降低了35.26%，图像抖动减少了62.33%，同时保持了平滑运动和稳定的工作距离调节。

Insight: 创新点在于将高级视觉-语言推理与低级闭环控制耦合，通过事件图挖掘提取策略基元进行监督学习，并结合VLM实现在线策略预测和基于图像的运动命令生成，提高了系统的可解释性和性能。

Abstract: Autonomous laparoscopic camera control must maintain a stable and safe surgical view under rapid tool-tissue interactions while remaining interpretable to surgeons. We present a strategy-grounded framework that couples high-level vision-language inference with low-level closed-loop control. Offline, raw surgical videos are parsed into camera-relevant temporal events (e.g., interaction, working-distance deviation, and view-quality degradation) and structured as attributed event graphs. Mining these graphs yields a compact set of reusable camera-handling strategy primitives, which provide structured supervision for learning. Online, a fine-tuned Vision-Language Model (VLM) processes the live laparoscopic view to predict the dominant strategy and discrete image-based motion commands, executed by an IBVS-RCM controller under strict safety constraints; optional speech input enables intuitive human-in-the-loop conditioning. On a surgeon-annotated dataset, event parsing achieves reliable temporal localization (F1-score 0.86), and the mined strategies show strong semantic alignment with expert interpretation (cluster purity 0.81). Extensive ex vivo experiments on silicone phantoms and porcine tissues demonstrate that the proposed system outperforms junior surgeons in standardized camera-handling evaluations, reducing field-of-view centering error by 35.26% and image shaking by 62.33%, while preserving smooth motion and stable working-distance regulation.

[91] BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model cs.RO | cs.CVPDF

Haosheng Li, Weixin Mao, Zihan Lan, Hongwei Xiong, Hongan Wang

TL;DR: 本文提出BFA++，一种专为多视角视觉-语言-动作模型设计的动态令牌剪枝框架。该框架采用分层剪枝策略，通过视图内和视图间重要性预测器，在保留关键视觉线索的同时减少令牌数量，从而提升计算效率和机器人操作成功率。

Details

Motivation: 现有视觉语言模型的加速技术（如令牌剪枝）直接应用于VLA模型时性能下降，因为它们忽略了多视图间的关系以及机器人操作动态和任务特定的特性。

Result: 在RoboTwin基准测试和真实机器人任务上的评估表明，BFA++始终优于现有方法，在π0和RDT模型上成功率提升约10%，速度分别提升1.8倍和1.5倍。

Insight: 创新点在于分层剪枝策略，结合视图内（抑制空间噪声）和视图间（减少跨视图冗余）重要性预测，实现上下文敏感和任务感知的令牌剪枝，比全视觉处理更有效。

Abstract: Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs. However, the substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation. Existing acceleration techniques for VLMs, such as token pruning, often result in degraded performance when directly applied to VLA models, as they overlook the relationships between different views and fail to account for the dynamic and task-specific characteristics of robotic operation. To address this, we propose BFA++, a dynamic token pruning framework designed specifically for VLA models. BFA++ introduces a hierarchical pruning strategy guided by two-level importance predictors: an intra-view predictor highlights task-relevant regions within each image to suppress spatial noise, while an inter-view predictor identifies critical camera views throughout different manipulation phases to reduce cross-view redundancy. This design enables efficient token selection while preserving essential visual cues, resulting in improved computational efficiency and higher manipulation success rates. Evaluations on the RoboTwin benchmark and real-world robotic tasks demonstrate that BFA++ consistently outperforms existing methods. BFA++ improves the success rate by about 10% on both the π0 and RDT models, achieving speedup of 1.8X and 1.5X, respectively. Our results highlight that context-sensitive and task-aware token pruning serves as a more effective strategy than full visual processing, enabling faster inference and improved manipulation accuracy in real-world robotic systems.

[92] Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics cs.RO | cs.CV | cs.LGPDF

Abdulaziz Almuzairee, Henrik I. Christensen

TL;DR: Squint是一种视觉软演员-评论家方法，通过并行仿真、分布评论家、分辨率调整、层归一化等技术，在单GPU上15分钟内完成视觉强化学习训练，并在SO-101任务集上实现模拟到现实的迁移。

Details

Motivation: 解决视觉强化学习中离策略方法样本效率高但训练慢、在策略方法可并行但样本浪费的问题，特别是在高维图像输入导致训练动态复杂、存储和编码开销大的挑战。

Result: 在SO-101任务集（ManiSkill3中的八个操作任务）上，大多数任务在6分钟内收敛，训练速度超过现有视觉离策略和在策略方法，并成功迁移到真实SO-101机器人。

Insight: 创新点包括结合并行仿真和分布评论家优化训练效率，通过分辨率调整降低视觉输入复杂度，以及层归一化和更新-数据比调优提升稳定性，为快速视觉强化学习提供了高效实现方案。

Abstract: Visual reinforcement learning is appealing for robotics but expensive – off-policy methods are sample-efficient yet slow; on-policy methods parallelize well but waste samples. Recent work has shown that off-policy methods can train faster than on-policy methods in wall-clock time for state-based control. Extending this to vision remains challenging, where high-dimensional input images complicate training dynamics and introduce substantial storage and encoding overhead. To address these challenges, we introduce Squint, a visual Soft Actor Critic method that achieves faster wall-clock training than prior visual off-policy and on-policy methods. Squint achieves this via parallel simulation, a distributional critic, resolution squinting, layer normalization, a tuned update-to-data ratio, and an optimized implementation. We evaluate on the SO-101 Task Set, a new suite of eight manipulation tasks in ManiSkill3 with heavy domain randomization, and demonstrate sim-to-real transfer to a real SO-101 robot. We train policies for 15 minutes on a single RTX 3090 GPU, with most tasks converging in under 6 minutes.

cs.LG [Back]

[93] Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training cs.LG | cs.AI | cs.CLPDF

Zhengyao Gu, Jonathan Light, Raul Astudillo, Ziyu Ye, Langzhou He

TL;DR: 本文提出了ACTOR-CURATOR，一个可扩展的、完全自动化的课程学习框架，用于大型语言模型（LLM）的强化学习后训练。该框架通过学习一个神经策展人，动态地从大型问题库中选择训练问题，以直接优化策略性能的预期提升。

Details

Motivation: 针对大型基础模型使用强化学习进行后训练时，通常依赖于海量异构数据集，这使得有效的课程学习变得至关重要且充满挑战。本文旨在解决如何自动、高效地选择训练数据以优化后训练过程的问题。

Result: 在多个具有挑战性的推理基准测试（如AIME2024和ARC-1D）上，ACTOR-CURATOR始终优于均匀采样和强基线课程学习方法，分别实现了28.6%和30.5%的相对性能提升，并获得了高达80%的训练加速，达到了SOTA水平。

Insight: 创新点在于将问题选择建模为一个非平稳随机多臂老虎机问题，并基于在线随机镜像下降推导出原则性的损失函数，从而实现了策略性能改进的直接优化。这为LLM后训练的自动化课程学习提供了一个可扩展且理论上有保证的实用框架。

Abstract: Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.

[94] GATES: Self-Distillation under Privileged Context with Consensus Gating cs.LG | cs.CLPDF

Alex Stein, Furong Huang, Tom Goldstein

TL;DR: 本文提出了一种名为GATES的自蒸馏方法，用于解决在监督不可靠（无真实标签、可验证奖励或外部评估者）情况下的文档问答任务。该方法通过采样多个基于文档的推理轨迹，利用导师模型之间的共识作为可靠性信号来门控学习，并蒸馏完整的推理轨迹（而非仅最终答案）以提供密集稳定的学习信号。

Details

Motivation: 动机在于处理监督不可靠的场景，特别是在文档问答中，当导师模型（训练时可访问相关文档）的答案可能不正确时，如何有效地将知识从导师蒸馏到学生模型（测试时仅基于问题回答）。

Result: 在非对称评估下，领域内保留准确率从46.0%提升至62.0%；在公开的无文档数学基准测试上，平均（maj@8）准确率从20.2%提升至35.4%，显示出显著改进。

Insight: 创新点包括：利用导师共识作为在线可靠性信号来门控学习，以及蒸馏完整的推理轨迹以提供更密集和稳定的监督，这有助于在无监督或弱监督设置下提升模型性能。

Abstract: We study self-distillation in settings where supervision is unreliable: there are no ground truth labels, verifiable rewards, or external graders to evaluate answers. We focus on document-grounded question answering with asymmetric context, where a single model serves as both tutor (with access to a relevant source document during training) and student (answering from the question alone at test time). Rather than assuming tutor correctness, we derive supervision online from tutor consensus by sampling multiple document-grounded reasoning traces and using agreement to gate learning. Conditioned on this reliability signal, we distill knowledge through full tutor reasoning trajectories (not just final answers), providing a dense and stable learning signal. Empirically, this consensus-gated trajectory distillation substantially improves transfer to the document-free student. Held-out in-domain accuracy under asymmetric evaluation improves from 46.0% to 62.0%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2% to 35.4%.

[95] SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards cs.LG | cs.CLPDF

Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray

TL;DR: 本文提出了SELAUR框架，一种通过不确定性感知奖励实现自我进化的LLM智能体强化学习方法，该方法将基于熵、最小置信度和边界的令牌级不确定性估计整合到奖励设计中，以提供密集的置信度对齐监督，并采用失败感知的奖励重塑机制，将不确定性信号注入步级和轨迹级奖励，从而提升探索效率和学习稳定性。

Details

Motivation: 现有LLM智能体在奖励设计中大多忽略了模型的内在不确定性信号，而该信号能反映模型置信度、指示探索需求，即使在失败轨迹中也能提供有价值的学习线索，因此本文旨在将不确定性直接纳入奖励设计以改进学习。

Result: 在ALFWorld和WebShop两个基准测试上的实验表明，该方法相较于强基线模型持续提升了成功率，消融研究进一步证明了不确定性信号如何增强探索和鲁棒性。

Insight: 创新点在于首次将LLM的内在不确定性（通过熵、最小置信度、边界等多指标综合估计）系统地整合到强化学习的奖励设计中，并提出了失败感知的奖励重塑机制，为LLM智能体的探索和学习稳定性提供了新的监督信号。

Abstract: Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate, providing dense confidence-aligned supervision, and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently improves success rates over strong baselines. Ablation studies further demonstrate how uncertainty signals enhance exploration and robustness.

[96] Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs cs.LG | cs.AI | cs.CL | cs.CV | cs.ROPDF

Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu

TL;DR: 本文提出了一种名为’反思性测试时规划’的方法，旨在解决具身大语言模型在部署时无法从错误中学习、导致重复犯错的问题。该方法整合了’行动中反思’和’行动后反思’两种模式，前者在执行前通过内部反思生成并评估多个候选动作，后者在执行后基于外部反馈更新内部反思模型和动作策略，并引入了’回顾性反思’以进行长视野信用分配。

Details

Motivation: 具身大语言模型虽然赋予了机器人高层次的任务推理能力，但缺乏反思能力，无法从错误中积累经验，导致部署过程成为一系列独立的、错误重复的试验。

Result: 在新设计的长视野家庭任务基准和MuJoCo橱柜装配基准上的实验表明，该方法相比基线模型取得了显著提升，消融研究验证了’行动中反思’和’行动后反思’的互补作用。定性分析（包括真实机器人试验）突出了通过反思实现的行为纠正。

Insight: 核心创新在于将人类’反思实践者’的概念形式化，系统地将反思机制（行动中、行动后及回顾性）集成到具身LLM的测试时规划中，实现了从试错中持续学习，解决了长视野任务中的信用分配问题。

Abstract: Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.

[97] Estimation of Confidence Bounds in Binary Classification using Wilson Score Kernel Density Estimation cs.LG | cs.CVPDF

Thorbjørn Mosekjær Iversen, Zebin Duan, Frederik Hagelskjær

TL;DR: 本文提出了一种名为Wilson Score Kernel Density Classification的新型核方法，用于估计二元分类中的置信边界。该方法的核心是Wilson Score Kernel Density Estimator，旨在为具有条件变化成功概率的二项实验提供置信边界估计。该方法在四个不同数据集上进行了选择性分类评估，展示了其作为任何特征提取器（包括视觉基础模型）分类头的实用性，并在保持与高斯过程分类相似性能的同时，显著降低了计算复杂度。

Details

Motivation: 解决在关键操作中应用二元分类器时，需要可靠置信边界估计以确保系统性能达到给定统计显著性的问题，从而推动自动化关键检测任务。

Result: 在四个数据集上的选择性分类任务中，所提方法取得了与高斯过程分类相似的性能，但计算复杂度更低。

Insight: 创新性地将Wilson Score与核密度估计结合，为条件概率变化的二项分类问题提供了高效的置信边界估计方法，可作为通用分类头灵活集成到各种特征提取器中。

Abstract: The performance and ease of use of deep learning-based binary classifiers have improved significantly in recent years. This has opened up the potential for automating critical inspection tasks, which have traditionally only been trusted to be done manually. However, the application of binary classifiers in critical operations depends on the estimation of reliable confidence bounds such that system performance can be ensured up to a given statistical significance. We present Wilson Score Kernel Density Classification, which is a novel kernel-based method for estimating confidence bounds in binary classification. The core of our method is the Wilson Score Kernel Density Estimator, which is a function estimator for estimating confidence bounds in Binomial experiments with conditionally varying success probabilities. Our method is evaluated in the context of selective classification on four different datasets, illustrating its use as a classification head of any feature extractor, including vision foundation models. Our proposed method shows similar performance to Gaussian Process Classification, but at a lower computational complexity.

cs.IR [Back]

[98] Generative Pseudo-Labeling for Pre-Ranking with LLMs cs.IR | cs.CLPDF

Junyu Bi, Xinting Niu, Daixuan Cheng, Kun Yuan, Tao Wang

TL;DR: 本文提出了一种名为生成式伪标签（GPL）的框架，用于解决工业推荐系统中预排序阶段的训练-服务偏差问题。该方法利用大型语言模型为未曝光项目生成无偏且内容感知的伪标签，从而在离线状态下对齐训练分布与在线服务空间，最终在大规模生产系统中提升了点击率、推荐多样性和长尾物品发现。

Details

Motivation: 解决预排序模型因仅基于曝光交互进行训练，而在线服务时需对所有召回候选（包括未曝光物品）进行评分所导致的严重样本选择偏差和泛化能力下降问题，特别是针对长尾内容。

Result: 在大规模生产系统中部署后，点击率提升了3.07%，同时显著增强了推荐多样性和长尾物品的发现能力。

Insight: 核心创新点在于利用LLM生成无偏的、内容感知的伪标签来监督未曝光物品的训练，通过离线生成用户兴趣锚点并在冻结的语义空间中进行匹配，避免了在线延迟，从而有效缓解了训练-服务偏差并提升了模型性能。

Abstract: Pre-ranking is a critical stage in industrial recommendation systems, tasked with efficiently scoring thousands of recalled items for downstream ranking. A key challenge is the train-serving discrepancy: pre-ranking models are trained only on exposed interactions, yet must score all recalled candidates – including unexposed items – during online serving. This mismatch not only induces severe sample selection bias but also degrades generalization, especially for long-tail content. Existing debiasing approaches typically rely on heuristics (e.g., negative sampling) or distillation from biased rankers, which either mislabel plausible unexposed items as negatives or propagate exposure bias into pseudo-labels. In this work, we propose Generative Pseudo-Labeling (GPL), a framework that leverages large language models (LLMs) to generate unbiased, content-aware pseudo-labels for unexposed items, explicitly aligning the training distribution with the online serving space. By offline generating user-specific interest anchors and matching them with candidates in a frozen semantic space, GPL provides high-quality supervision without adding online latency. Deployed in a large-scale production system, GPL improves click-through rate by 3.07%, while significantly enhancing recommendation diversity and long-tail item discovery.

[99] Multi-Vector Index Compression in Any Modality cs.IR | cs.CL | cs.CVPDF

Hanxiang Qin, Alexander Martin, Rohan Jha, Chunsheng Zuo, Reno Kriz

TL;DR: 本文研究了适用于任何模态的高效多向量检索，以解决延迟交互范式在跨模态检索中计算和存储成本随文档长度线性增长的问题。论文提出了四种查询无关的索引压缩方法，其中新颖的注意力引导聚类（AGC）方法通过识别文档中最具语义显著性的区域作为聚类中心，在文本、视觉文档和视频检索任务上均表现出色。

Details

Motivation: 延迟交互已成为文本、图像、视觉文档和视频信息检索的主导范式，但其计算和存储成本随文档长度线性增长，在处理富含图像、视频和音频的语料库时成本高昂。本文旨在探索在恒定向量预算下压缩多向量文档表示的查询无关方法，以解决这一局限性。

Result: 在文本（BEIR）、视觉文档（ViDoRe）和视频（MSR-VTT, MultiVENT 2.0）检索任务上的评估表明，注意力引导聚类（AGC）始终优于其他参数化压缩方法（序列调整和记忆令牌），在索引大小上比非参数化分层聚类更具灵活性，并且与完整未压缩索引相比，实现了具有竞争力或改进的性能。

Insight: 主要创新点在于提出了注意力引导聚类（AGC）这一新颖的索引压缩方法，它利用注意力机制识别文档的语义显著区域作为聚类中心并加权令牌聚合。从客观角度看，该方法将压缩过程与语义重要性相结合，提供了一种灵活且高效的跨模态表示压缩方案，可有效平衡检索性能与存储/计算开销。

Abstract: We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: github.com/hanxiangqin/omni-col-press.

eess.IV [Back]

[100] Multimodal MRI Report Findings Supervised Brain Lesion Segmentation with Substructures eess.IV | cs.AI | cs.CL | cs.CV | cs.LGPDF

Yubin Ge, Yongsong Huang, Xiaofeng Liu

TL;DR: 本文提出了一种名为MS-RSuper的新型报告监督学习方法，用于多模态MRI脑部病灶分割。该方法通过解析放射学报告中的全局定量和模态特异性定性发现，引入了一种统一的、单边的、不确定性感知的损失函数，以更有效地利用不完整和不确定的报告信息，并整合解剖学先验知识。

Details

Motivation: 动机在于解决传统报告监督学习在脑肿瘤多模态MRI分割中的局限性。具体问题包括：报告通常只描述最大病灶并提供定性或不确定线索；经典方法（如总体积一致性）在不完整报告下可能过度约束或产生未报告的幻觉分割；同时，现有方法难以利用层次化的报告发现（如模态特异性描述）以及合并数据集中不同病灶类型的先验知识。

Result: 在包含1238个带有报告标注的BraTS-MET/MEN扫描数据集上，所提出的MS-RSuper方法在性能上大幅超越了稀疏监督基线方法和一种朴素的报告监督学习方法。

Insight: 创新点在于：1) 明确解析了全局定量和模态特异性定性报告发现，并设计了一个统一的、不确定性感知的损失函数框架；2) 通过存在性和缺失性损失，将模态特异性定性线索（如T1c增强、FLAIR水肿）与其对应的子结构对齐；3) 对部分定量线索（如最大病灶尺寸、最小数量）强制执行单边下界约束；4) 整合了轴外与轴内解剖学先验以尊重队列差异。其核心思想是通过加权和缩放惩罚，更灵活、稳健地处理报告中的不确定性和缺失信息。

Abstract: Report-supervised (RSuper) learning seeks to alleviate the need for dense tumor voxel labels with constraints derived from radiology reports (e.g., volumes, counts, sizes, locations). In MRI studies of brain tumors, however, we often involve multi-parametric scans and substructures. Here, fine-grained modality/parameter-wise reports are usually provided along with global findings and are correlated with different substructures. Moreover, the reports often describe only the largest lesion and provide qualitative or uncertain cues (mild,'' possible’’). Classical RSuper losses (e.g., sum volume consistency) can over-constrain or hallucinate unreported findings under such incompleteness, and are unable to utilize these hierarchical findings or exploit the priors of varied lesion types in a merged dataset. We explicitly parse the global quantitative and modality-wise qualitative findings and introduce a unified, one-sided, uncertainty-aware formulation (MS-RSuper) that: (i) aligns modality-specific qualitative cues (e.g., T1c enhancement, FLAIR edema) with their corresponding substructures using existence and absence losses; (ii) enforces one-sided lower-bounds for partial quantitative cues (e.g., largest lesion size, minimal multiplicity); and (iii) adds extra- vs. intra-axial anatomical priors to respect cohort differences. Certainty tokens scale penalties; missing cues are down-weighted. On 1238 report-labeled BraTS-MET/MEN scans, our MS-RSuper largely outperforms both a sparsely-supervised baseline and a naive RSuper method.

cs.SD [Back]

[101] Graph Modelling Analysis of Speech-Gesture Interaction for Aphasia Severity Estimation cs.SD | cs.CL | eess.ASPDF

Navya Martin Kollapally, Christa Akers, Renjith Nelson Joseph

TL;DR: 本文提出了一种基于图神经网络（GNN）的框架，用于从自发言语和手势的交互中自动评估失语症严重程度。该方法将参与者的言语话语建模为有向多模态图，节点代表词汇项和手势，边编码词-词、手势-词和词-手势的转换，并使用GraphSAGE学习参与者级别的嵌入，从而整合局部邻域和整体图结构信息。

Details

Motivation: 动机在于当前失语症严重程度评估主要依赖西方失语症成套测验修订版（WAB-R），该工具测量的是孤立的语言技能，而话语产出评估能更全面地反映日常语言能力。现有的自动语音分析大多依赖孤立的语言或声学特征，忽视了言语与手势之间的结构化交互对失语症严重程度编码的重要性。

Result: 结果表明，失语症严重程度并非编码在孤立的词汇分布中，而是源于言语和手势之间的结构化交互。所提出的架构为失语症评估提供了可靠的自动化方法，可能在床边筛查和远程健康监测中得到应用。

Insight: 创新点在于首次将话语和手势的交互建模为多模态图，并利用图神经网络进行端到端的严重程度估计，强调了跨模态交互结构而非孤立特征的重要性，为基于多模态行为的自动化临床评估提供了新思路。

Abstract: Aphasia is an acquired language disorder caused by injury to the regions of the brain that are responsible for language. Aphasia may impair the use and comprehension of written and spoken language. The Western Aphasia Battery-Revised (WAB-R) is an assessment tool administered by speech-language pathologists (SLPs) to evaluate the aphasia type and severity. Because the WAB-R measures isolated linguistic skills, there has been growing interest in the assessment of discourse production as a more holistic representation of everyday language abilities. Recent advancements in speech analysis focus on automated estimation of aphasia severity from spontaneous speech, relying mostly in isolated linguistic or acoustical features. In this work, we propose a graph neural network-based framework for estimating aphasia severity. We represented each participant’s discourse as a directed multi-modal graph, where nodes represent lexical items and gestures and edges encode word-word, gesture-word, and word-gesture transitions. GraphSAGE is employed to learn participant-level embeddings, thus integrating information from immediate neighbors and overall graph structure. Our results suggest that aphasia severity is not encoded in isolated lexical distribution, but rather emerges from structured interactions between speech and gesture. The proposed architecture offers a reliable automated aphasia assessment, with possible uses in bedside screening and telehealth-based monitoring.

Table of Contents

cs.CL [Back]

[1] Talking to Yourself: Defying Forgetting in Large Language Models cs.CL | cs.AIPDF

[2] Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings cs.CL | cs.LGPDF

[3] No One Size Fits All: QueryBandits for Hallucination Mitigation cs.CL | cs.AI | cs.LGPDF

[4] Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning cs.CL | cs.LGPDF

[5] CARE: An Explainable Computational Framework for Assessing Client-Perceived Therapeutic Alliance Using Large Language Models cs.CLPDF

[6] CAMEL: Confidence-Gated Reflection for Reward Modeling cs.CL | cs.AIPDF

[7] ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition cs.CLPDF

[8] SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing cs.CL | cs.AI | cs.LGPDF

[9] Overton Pluralistic Reinforcement Learning for Large Language Models cs.CLPDF

[10] The Art of Efficient Reasoning: Data, Reward, and Optimization cs.CL | cs.AIPDF

[11] Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models cs.CLPDF

[12] Linear Reasoning vs. Proof by Cases: Obstacles for Large Language Models in FOL Problem Solving cs.CLPDF

[13] Evaluating Proactive Risk Awareness of Large Language Models cs.CL | cs.CYPDF

[14] Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning cs.CL | cs.IRPDF

cs.CV [Back]

[15] VISION-ICE: Video-based Interpretation and Spatial Identification of Arrhythmia Origins via Neural Networks in Intracardiac Echocardiography cs.CV | cs.LGPDF

[16] OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport cs.CVPDF

[17] De-rendering, Reasoning, and Repairing Charts with Vision-Language Models cs.CVPDF

[18] N4MC: Neural 4D Mesh Compression cs.CVPDF

[19] Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking cs.CV | cs.AI | cs.LGPDF

[20] Large-scale Photorealistic Outdoor 3D Scene Reconstruction from UAV Imagery Using Gaussian Splatting Techniques cs.CV | cs.ROPDF

[21] 3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism cs.CVPDF

[22] Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field cs.CVPDF

[23] CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation cs.CVPDF

[24] gQIR: Generative Quanta Image Reconstruction cs.CVPDF

[25] MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation cs.CV | cs.CLPDF

[26] SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens cs.CVPDF

[27] Path-Decoupled Hyperbolic Flow Matching for Few-Shot Adaptation cs.CVPDF

[28] LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration cs.CV | cs.AIPDF

[29] Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models cs.CVPDF

[30] Leveraging Causal Reasoning Method for Explaining Medical Image Segmentation Models cs.CVPDF

[31] How Do Inpainting Artifacts Propagate to Language? cs.CV | cs.AIPDF

[32] A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata cs.CVPDF

[33] Beyond Human Performance: A Vision-Language Multi-Agent Approach for Quality Control in Pharmaceutical Manufacturing cs.CVPDF

[34] WildGHand: Learning Anti-Perturbation Gaussian Hand Avatars from Monocular In-the-Wild Videos cs.CVPDF

[35] AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents cs.CVPDF

[36] An interactive enhanced driving dataset for autonomous driving cs.CVPDF

[37] Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion cs.CVPDF

[38] PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models cs.CVPDF

[39] VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos cs.CVPDF

[40] From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection cs.CVPDF

[41] Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video cs.CV | cs.AI | cs.HC | cs.LGPDF

[42] AnimeAgent: Is the Multi-Agent via Image-to-Video models a Good Disney Storytelling Artist? cs.CVPDF

[43] GA-Drive: Geometry-Appearance Decoupled Modeling for Free-viewpoint Driving Scene Generatio cs.CVPDF

[44] VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation cs.CV | cs.AI | cs.CLPDF

[45] RAYNOVA: 3D-Geometry-Free Auto-Regressive Driving World Modeling with Unified Spatio-Temporal Representation cs.CVPDF

[46] MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision cs.CVPDF

[47] NGL-Prompter: Training-Free Sewing Pattern Estimation from a Single Image cs.CVPDF

[48] Communication-Inspired Tokenization for Structured Image Representations cs.CV | cs.AI | cs.LGPDF

[49] Federated Learning for Cross-Modality Medical Image Segmentation via Augmentation-Driven Generalization cs.CVPDF

[50] Real-time Motion Segmentation with Event-based Normal Flow cs.CV | cs.ROPDF

[51] SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking cs.CVPDF

[52] VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving cs.CVPDF

[53] GatedCLIP: Gated Multimodal Fusion for Hateful Memes Detection cs.CVPDF

[54] On the Explainability of Vision-Language Models in Art History cs.CVPDF

[55] DA-Cal: Towards Cross-Domain Calibration in Semantic Segmentation cs.CVPDF

[56] MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification cs.CVPDF

[57] SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models cs.CV | cs.LGPDF

[58] TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering cs.CVPDF

[59] LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding cs.CVPDF

[60] Computing a Characteristic Orientation for Rotation-Independent Image Analysis cs.CVPDF

[61] See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis cs.CV | cs.AIPDF

[62] Are Multimodal Large Language Models Good Annotators for Image Tagging? cs.CVPDF

[63] CrystaL: Spontaneous Emergence of Visual Latents in MLLMs cs.CV | cs.AIPDF

[64] Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models cs.CV | cs.AIPDF

[65] Cycle-Consistent Tuning for Layered Image Decomposition cs.CVPDF

[66] VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models cs.CVPDF

[67] From Perception to Action: An Interactive Benchmark for Vision Reasoning cs.CVPDF

[68] MIP Candy: A Modular PyTorch Framework for Medical Image Processing cs.CV | cs.AI | cs.LG | cs.SEPDF

[69] Not Just What’s There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning cs.CV | cs.MMPDF

[70] OmniOCR: Generalist OCR for Ethnic Minority Languages cs.CVPDF

[71] OCR-Agent: Agentic OCR with Capability and Memory Reflection cs.CVPDF

[72] Event-Aided Sharp Radiance Field Reconstruction for Fast-Flying Drones cs.CV | cs.ROPDF

[73] BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting cs.CVPDF

[74] UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics cs.CVPDF

[75] LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis cs.CV | cs.LGPDF

[76] SPRITETOMESH: Automatic Mesh Generation for 2D Skeletal Animation Using Learned Segmentation and Contour-Aware Vertex Placement cs.CVPDF

[77] Seeing Through Words: Controlling Visual Retrieval Quality with Language Models cs.CVPDF