Table of Contents

cs.CL [Back]

[1] Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation cs.CL | cs.SD | eess.ASPDF

Ligong Lei, Wenwen Lu, Xudong Pang, Zaokere Kadeer, Aishan Wumaier

TL;DR: 本文提出了一种多模态一致性引导的无参考数据选择方法,用于ASR口音自适应。该方法通过目标感知预选、基于扰动的解码生成多个伪转录,并结合语音-文本对齐和预测词错误率两种无参考信号进行评分,最终通过百分位选择规则筛选可靠的伪标签用于微调。

Details

Motivation: 解决ASR系统在口音语音上性能下降的问题,传统基于文本的伪标签选择方法(如困惑度过滤)可能偏好流畅但与声学不匹配的假设,导致微调时错误放大,因此需要一种多模态一致性引导的无参考数据选择方法。

Result: 在领域内设置中,从30k数据池中选择约1.5k话语达到10.91% WER,接近使用30k监督标签的10.45% WER;在跨领域设置中,一致性过滤的子集避免了强口音偏移下未过滤伪标签导致的性能下降,并在更强ASR骨干上的匹配小时实验中进一步证实了相对于随机采样和近期基线的增益。

Insight: 创新点包括结合多模态一致性(语音-文本对齐和预测WER)进行无参考数据选择,以及目标感知预选步骤以提高查询相关性和减少计算;从客观角度看,该方法有效缓解了伪标签选择中的声学不匹配问题,提升了ASR口音自适应的效率和性能。

Abstract: Automatic speech recognition (ASR) systems often degrade on accented speech because acoustic-phonetic and prosodic shifts induce a mismatch to training data, making labeled accent adaptation costly. However, common pseudo-label selection heuristics are largely text-centric (e.g., perplexity (PPL) filtering) and can prefer fluent yet acoustically mismatched hypotheses, leading to error amplification when fine-tuning. To address this, we introduce a multimodal consistency-guided, reference-free data selection pipeline for ASR accent adaptation under a transductive, label-free protocol. The pipeline starts with a target-aware preselection step based on submodular mutual information to improve query relevance and reduce downstream computation. It then generates multiple pseudo-transcriptions per utterance via perturbation-based decoding and scores each hypothesis using two reference-free signals: speech–text alignment in a shared embedding space and predicted word error rate (WER). A simple percentile-based selection rule retains reliable pseudo-labels for fine-tuning while discarding noisy utterances. In an in-domain setting, selecting ~1.5k utterances from a 30k pool achieves 10.91% WER, close to 10.45% obtained using 30k supervised labels. In a cross-domain setting with a mismatched candidate pool, consistency-filtered subsets avoid the degradation caused by unfiltered pseudo-labels under strong accent shift, and matched-hour experiments on a stronger ASR backbone further confirm gains over random sampling and recent selection baselines.


[2] Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety cs.CL | cs.AI | cs.HCPDF

Phyllis Nabangi, Abdul-Jalil Zakaria, Jema David Ndibwile

TL;DR: 本研究针对斯瓦希里语这一低资源语言,利用支持向量机、逻辑回归和决策树等机器学习模型,结合SMOTE等技术处理数据不平衡问题,以检测网络中的伪装辱骂性语言,旨在提升儿童网络安全。

Details

Motivation: 数字技术发展加剧了网络欺凌和在线虐待风险,尤其在儿童群体中。斯瓦希里语作为非洲广泛使用但资源匮乏的语言,其伪装辱骂语言检测面临独特挑战,本研究旨在填补这一空白。

Result: 研究通过精确率、召回率和F1分数评估模型性能,发现这些模型在高维文本数据上表现良好,但受限于数据集规模小和不平衡,结果泛化能力有限。

Insight: 创新点在于针对低资源斯瓦希里语应用机器学习进行伪装辱骂检测,并采用SMOTE处理数据不平衡;可借鉴之处包括对低资源语言安全问题的关注、参数调优与数据增强技术的结合,以及未来通过迁移学习和多模态数据提升检测系统鲁棒性的方向。

Abstract: The rise of digital technology has dramatically increased the potential for cyberbullying and online abuse, necessitating enhanced measures for detection and prevention, especially among children. This study focuses on detecting abusive obfuscated language in Swahili, a low-resource language that poses unique challenges due to its limited linguistic resources and technological support. Swahili is chosen due to its popularity and being the most widely spoken language in Africa, with over 16 million native speakers and upwards of 100 million speakers in total, spanning regions in East Africa and some parts of the Middle East. We employed machine learning models including Support Vector Machines (SVM), Logistic Regression, and Decision Trees, optimized through rigorous parameter tuning and techniques like Synthetic Minority Over-sampling Technique (SMOTE) to handle data imbalance. Our analysis revealed that, while these models perform well in high-dimensional textual data, our dataset’s small size and imbalance limit our findings’ generalizability. Precision, recall, and F1 scores were thoroughly analyzed, highlighting the nuanced performance of each model in detecting obfuscated language. This research contributes to the broader discourse on ensuring safer online environments for children, advocating for expanded datasets and advanced machine-learning techniques to improve the effectiveness of cyberbullying detection systems. Future work will focus on enhancing data robustness, exploring transfer learning, and integrating multimodal data to create more comprehensive and culturally sensitive detection mechanisms.


[3] Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens cs.CLPDF

Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen

TL;DR: 这篇论文提出了一种量化大型语言模型推理努力的新方法,通过识别’深度思考令牌’(即模型在深层网络中对内部预测进行显著修正的令牌)来衡量推理质量。研究发现,深度思考比率(即生成序列中深度思考令牌的比例)与答案准确性呈稳健正相关,优于基于生成长度或置信度的基线方法。基于此,作者提出了Think@n策略,通过优先处理高深度思考比率的样本来提升推理效率,在保持或超越自洽性方法性能的同时,显著降低了推理成本。

Details

Motivation: 现有研究表明,单纯增加生成长度(如长思维链)并不总是与推理质量正相关,有时反而会导致’过度思考’和性能下降。因此,需要一种更可靠的指标来量化模型在推理时的实际努力程度,以更好地理解和利用模型的推理能力。

Result: 在四个具有挑战性的数学和科学基准(AIME 24/25, HMMT 25, GPQA-diamond)以及多种推理模型(GPT-OSS, DeepSeek-R1, Qwen3)上的实验表明,深度思考比率与准确性之间存在稳健且一致的正相关关系,其表现显著优于基于长度和置信度的基线方法。提出的Think@n策略在匹配或超越标准自洽性方法性能的同时,通过基于短前缀早期拒绝无望的生成,显著减少了推理成本。

Insight: 论文的核心创新在于提出了’深度思考令牌’这一概念,将其作为衡量模型内部推理过程的可靠代理指标,而非依赖外部输出长度。这为理解模型内部推理动态提供了新视角,并催生了高效的测试时计算扩展策略(Think@n),实现了性能与效率的更好权衡。

Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal “overthinking,” leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens – tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.


[4] Small Reward Models via Backward Inference cs.CLPDF

Yike Wang, Faeze Brahman, Shangbin Feng, Teng Xiao, Hannaneh Hajishirzi

TL;DR: 本文提出了一种名为FLIP(FLipped Inference for Prompt reconstruction)的新型奖励建模方法,该方法通过反向推理(从给定响应推断最可能的指令)来构建奖励信号,无需参考回答或明确评分标准。在四个领域使用13个小语言模型的评估表明,FLIP在性能上显著优于基于大模型的LLM-as-a-Judge基准方法,并能在并行采样和GRPO训练等下游任务中提升表现。

Details

Motivation: 当前主流的LLM-as-a-Judge奖励建模范式依赖大模型的强大推理能力,而其他方法需要参考回答或明确评分标准,这限制了灵活性和更广泛的可访问性。本文旨在解决这些限制,提出一种无需参考和评分标准的奖励建模方法。

Result: 在四个领域的评估中,FLIP使用13个小语言模型,平均性能比LLM-as-a-Judge基线高出79.6%。在通过并行采样和GRPO训练进行测试时缩放的外在评估中,FLIP显著提升了下游性能,尤其对长输出有效,且对常见的奖励攻击形式具有鲁棒性。

Insight: FLIP的创新点在于将奖励建模重新定义为反向推理问题,通过推断响应对应的指令来生成奖励信号,从而显式利用验证-生成差距。这种方法使得在规模缩小的情况下也能实现可靠的奖励建模,突破了传统方法在灵活性和可访问性上的限制,为小模型在非可验证领域的应用提供了新思路。

Abstract: Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility. In this work, we propose FLIP (FLipped Inference for Prompt reconstruction), a reference-free and rubric-free reward modeling approach that reformulates reward modeling through backward inference: inferring the instruction that would most plausibly produce a given response. The similarity between the inferred and the original instructions is then used as the reward signal. Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%. Moreover, FLIP substantially improves downstream performance in extrinsic evaluations under test-time scaling via parallel sampling and GRPO training. We further find that FLIP is particularly effective for longer outputs and robust to common forms of reward hacking. By explicitly exploiting the validation-generation gap, FLIP enables reliable reward modeling in downscaled regimes where judgment methods fail. Code available at https://github.com/yikee/FLIP.


[5] RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction cs.CL | cs.CVPDF

Yongkang Jin, Jianwen Luo, Jingjing Wang, Jianmin Yao, Yu Hong

TL;DR: 本文提出了一种名为RMPL的关系感知多任务渐进学习框架,用于解决低资源条件下的多媒体事件抽取任务。该框架通过分阶段训练整合了单模态事件抽取和多媒体关系抽取的异构监督信号,首先学习跨模态的共享事件中心表示,然后使用混合文本和视觉数据进行事件提及识别和论元角色抽取的微调。

Details

Motivation: 多媒体事件抽取任务需要从包含文本和图像的文档中识别事件及其论元,并实现跨模态的事件语义对齐。当前进展受限于缺乏标注训练数据,M2E2是唯一基准但仅提供评估标注,使得直接监督训练不可行。现有方法主要依赖跨模态对齐或视觉语言模型的推理时提示,未能显式学习结构化事件表示,且在多模态环境下论元对齐效果较弱。

Result: 在M2E2基准测试上使用多种视觉语言模型进行实验,结果表明RMPL在不同模态设置下均取得了持续的性能提升。

Insight: 创新点在于提出了一个分阶段的多任务渐进学习框架,通过整合异构监督信号(单模态事件抽取和多媒体关系抽取)来显式学习结构化的事件中心表示,从而在低资源条件下改善跨模态论元对齐。从客观角度看,该方法通过统一模式学习和渐进式微调,有效利用了有限的监督信息,增强了模型对多模态事件结构的理解能力。

Abstract: Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision–Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.


[6] Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind cs.CLPDF

Minyuan Ruan, Ziyue Wang, Kaiming Liu, Yunghwei Lai, Peng Li

TL;DR: 本文提出将大语言模型(LLM)的心智理论(Theory of Mind, ToM)能力形式化为一种用于检测和解决用户与智能体之间认知差异的机制,并引入了一个名为\benchname的基准来评估模型在实践中的表现。研究揭示了当前领先模型在识别阻碍任务成功的潜在认知差距方面存在显著局限性。为弥补这一差距,作者进一步构建了一个基于轨迹的ToM数据集,并通过强化学习训练模型,使其在推理用户心理状态方面得到持续改进,从而提升了下游任务性能。

Details

Motivation: 当前大语言模型在用户意图和指令表述不精确时,难以理解和响应用户真实需求,导致用户主观信念与真实环境状态之间出现认知差异。现有对LLM的ToM评估主要集中于孤立的信念推理,忽视了其在现实世界交互中的功能效用。

Result: 在11个领先模型上的评估结果揭示了它们在识别阻碍任务成功的潜在认知差距方面存在显著局限性。通过强化学习在构建的轨迹数据集上训练的模型,在推理用户心理状态方面表现出持续改进,并带来了下游性能的提升。

Insight: 论文的核心创新点在于将ToM重新定义为一种用于检测和解决用户-智能体交互中认知差异的交互层面机制,而不仅仅是独立的推理技能。这通过提出新的评估基准和构建连接信念追踪与任务状态推理的数据集来实现,强调了ToM在实际应用中的实用价值。

Abstract: Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks to assist human users. However, they still struggle to comprehend and respond to the true user needs when intentions and instructions are imprecisely conveyed, leading to a divergence between subjective user believes and true environment states. Resolving this epistemic divergence requires Theory of Mind (ToM), yet existing ToM evaluations for LLMs primarily focus on isolated belief inference, overlooking its functional utility in real-world interaction. To this end, we formalize ToM for LLMs as a mechanism for epistemic divergence detection and resolution, and propose a benchmark, \benchname, to assess how models reconcile user beliefs and profiles in practice. Results across 11 leading models reveal a significant limitation to identify underlying cognitive gaps that impede task success. To bridge this gap, we further curate a trajectory-based ToM dataset linking belief tracking with task-related state inference. The model trained on this data via reinforcement learning shows consistent improvement in reasoning about user mental states, leading to enhanced downstream performance. Our work highlights the practical value of ToM as an essential interaction-level mechanism rather than as a standalone reasoning skill.


[7] Evaluating Prompt Engineering Techniques for RAG in Small Language Models: A Multi-Hop QA Approach cs.CLPDF

Amir Hossein Mohammadi, Ali Moeinian, Zahra Razavizade, Afsaneh Fatemi, Reza Ramezani

TL;DR: 本文针对小型语言模型在检索增强生成中的提示工程进行了大规模实证研究,通过在HotpotQA数据集上评估24种不同的提示模板,发现精心设计的提示模板能显著提升SLM在复杂多跳问答任务中的性能,最高可达83%的增益。

Details

Motivation: 当前RAG研究主要集中于大语言模型,而针对小型语言模型在复杂多跳问答任务中的提示工程优化存在研究空白,特别是提示模板设计对性能的影响尚未充分探索。

Result: 在HotpotQA数据集上对Qwen2.5-3B Instruct和Gemma3-4B-It两个SLM的测试表明,最佳提示模板相比标准RAG提示提升了高达6%的性能,在特定模型上分别实现了83%和84.5%的显著性能增益。

Insight: 论文的创新点在于首次系统评估了SLM在RAG中的提示工程,提出了14种新颖的混合提示变体,并为资源受限环境下的SLM-RAG系统提供了可操作的提示设计建议,揭示了提示模板对SLM性能的关键影响。

Abstract: Retrieval Augmented Generation (RAG) is a powerful approach for enhancing the factual grounding of language models by integrating external knowledge. While widely studied for large language models, the optimization of RAG for Small Language Models (SLMs) remains a critical research gap, particularly in complex, multi-hop question-answering tasks that require sophisticated reasoning. In these systems, prompt template design is a crucial yet under-explored factor influencing performance. This paper presents a large-scale empirical study to investigate this factor, evaluating 24 different prompt templates on the HotpotQA dataset. The set includes a standard RAG prompt, nine well-formed techniques from the literature, and 14 novel hybrid variants, all tested on two prominent SLMs: Qwen2.5-3B Instruct and Gemma3-4B-It. Our findings, based on a test set of 18720 instances, reveal significant performance gains of up to 83% on Qwen2.5 and 84.5% on Gemma3-4B-It, yielding an improvement of up to 6% for both models compared to the Standard RAG prompt. This research also offers concrete analysis and actionable recommendations for designing effective and efficient prompts for SLM-based RAG systems, practically for deployment in resource-constrained environments.


[8] HLE-Verified: A Systematic Verification and Structured Revision of Humanity’s Last Exam cs.CLPDF

Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li

TL;DR: 本文介绍了HLE-Verified,一个经过系统验证和结构化修订的Humanity’s Last Exam(HLE)基准测试版本。该工作通过两阶段(验证与修复)流程,对原始HLE中的噪声项目进行清理,最终产出了641个已验证项目、1,170个修订认证项目以及689个标注不确定的项目集。在HLE-Verified上评估七个SOTA语言模型,平均绝对准确率提升了7-10个百分点,在问题陈述或参考答案有误的项目上提升尤为显著(30-40个百分点)。

Details

Motivation: 解决HLE基准测试中存在大量噪声项目的问题,这些噪声会偏差评估结果并扭曲模型间比较,从而影响对前沿大语言模型能力的忠实测量。

Result: 在HLE-Verified上评估七个SOTA语言模型,观察到平均绝对准确率比在原始HLE上提升了7-10个百分点;在问题陈述或参考答案有误的项目上,准确率提升达到30-40个百分点。

Insight: 提出了一种系统性的基准验证与修订协议(包括专家评审、模型交叉检查、双独立专家修复、模型辅助审计和最终裁决),并引入细粒度错误分类法和带明确不确定性来源及专家标签的“不确定集”,为基准构建的透明度和可重复性提供了可借鉴的框架。分析还揭示了模型置信度与问题/答案中错误存在之间的强关联,支持了修订的有效性。

Abstract: Humanity’s Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 641 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,170 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7–10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30–40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://github.com/SKYLENAGE-AI/HLE-Verified


[9] Chain-of-Thought Reasoning with Large Language Models for Clinical Alzheimer’s Disease Assessment and Diagnosis cs.CLPDF

Tongze Zhang, Jun-En Ding, Melik Ozolcer, Fang-Ming Hung, Albert Chih-Chieh Yang

TL;DR: 该论文提出了一种基于大语言模型(LLMs)的思维链(CoT)推理框架,用于阿尔茨海默病(AD)的临床评估与诊断。该方法利用患者的电子健康记录(EHRs),通过生成明确的诊断推理路径来辅助模型进行AD分期预测,旨在提升模型对复杂病因的诊断能力和预测过程的可解释性。

Details

Motivation: 传统AD诊断依赖医学影像和临床评估,耗时耗力且资源密集。尽管LLMs在医疗领域的应用日益增多,但在涉及复杂、多因素病因的AD评估中仍有限制,尤其是难以通过影像直接观察的病因。本文旨在利用LLMs的CoT推理能力,从EHRs中提取诊断依据,以解决AD评估中的这些挑战。

Result: 实验结果表明,所提出的基于CoT的诊断框架在多个临床痴呆评定量表(CDR)分级任务中显著提升了稳定性和诊断性能,与零样本基线方法相比,F1分数最高提升了15%。

Insight: 创新点在于将CoT推理机制引入AD的EHR数据分析,通过生成结构化的诊断推理路径,不仅提升了模型性能(尤其在处理内在复杂因素时),还增强了诊断过程的可解释性,为基于文本的医疗诊断提供了新的思路。

Abstract: Alzheimer’s disease (AD) has become a prevalent neurodegenerative disease worldwide. Traditional diagnosis still relies heavily on medical imaging and clinical assessment by physicians, which is often time-consuming and resource-intensive in terms of both human expertise and healthcare resources. In recent years, large language models (LLMs) have been increasingly applied to the medical field using electronic health records (EHRs), yet their application in Alzheimer’s disease assessment remains limited, particularly given that AD involves complex multifactorial etiologies that are difficult to observe directly through imaging modalities. In this work, we propose leveraging LLMs to perform Chain-of-Thought (CoT) reasoning on patients’ clinical EHRs. Unlike direct fine-tuning of LLMs on EHR data for AD classification, our approach utilizes LLM-generated CoT reasoning paths to provide the model with explicit diagnostic rationale for AD assessment, followed by structured CoT-based predictions. This pipeline not only enhances the model’s ability to diagnose intrinsically complex factors but also improves the interpretability of the prediction process across different stages of AD progression. Experimental results demonstrate that the proposed CoT-based diagnostic framework significantly enhances stability and diagnostic performance across multiple CDR grading tasks, achieving up to a 15% improvement in F1 score compared to the zero-shot baseline method.


[10] The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective cs.CL | cs.AIPDF

Ali Zahedzadeh, Behnam Bahrak

TL;DR: 本文从信息瓶颈视角研究LLM自解释中的充分性与简洁性权衡,提出评估框架在ARC Challenge数据集上验证压缩解释在保持准确性的同时显著缩短长度,并在英语和波斯语中进行了实验。

Details

Motivation: 解决LLM自解释(如思维链)在提升多步问答性能时通常冗长且生成成本高的问题,探究解释的必要程度。

Result: 在ARC Challenge数据集上,使用多个语言模型评估,发现更简洁的解释往往仍保持充分性,在显著减少解释长度的同时维持准确性,但过度压缩会导致性能下降。

Insight: 创新点在于将解释视为基于信息瓶颈原则的压缩表示,仅保留关键信息;客观分析表明该方法为优化LLM解释效率提供了可借鉴的评估范式,尤其在资源受限语言中具有扩展性。

Abstract: Large Language Models increasingly rely on self-explanations, such as chain of thought reasoning, to improve performance on multi step question answering. While these explanations enhance accuracy, they are often verbose and costly to generate, raising the question of how much explanation is truly necessary. In this paper, we examine the trade-off between sufficiency, defined as the ability of an explanation to justify the correct answer, and conciseness, defined as the reduction in explanation length. Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct answers.To operationalize this view, we introduce an evaluation pipeline that constrains explanation length and assesses sufficiency using multiple language models on the ARC Challenge dataset. To broaden the scope, we conduct experiments in both English, using the original dataset, and Persian, as a resource-limited language through translation. Our experiments show that more concise explanations often remain sufficient, preserving accuracy while substantially reducing explanation length, whereas excessive compression leads to performance degradation.


[11] GRRM: Group Relative Reward Modeling for Machine Translation cs.CLPDF

Sen Yang, Shanbo Cheng, Lu Xu, Jianbing Zhang, Shujian Huang

TL;DR: 本文提出了一种用于机器翻译的组相对奖励模型(GRRM),以改进GRPO框架中的组内排序问题。通过引入组质量度量(GQM)范式,GRRM能够联合处理整个候选组,利用比较分析来精确区分细粒度语言差异。实验表明,GRRM在排序准确性上达到基线竞争水平,并集成到GRPO训练循环中优化翻译策略,不仅提升了翻译质量,还解锁了与先进推理模型相当的推理能力。

Details

Motivation: 在机器翻译等开放域任务中,GRPO框架的有效性依赖于准确的组内排序,但标准标量质量度量(SQM)因独立评估候选而缺乏比较上下文,难以区分细粒度语言差异,因此需要一种新的度量方法。

Result: GRRM在所有基线中实现了竞争性的排序准确性;集成到GRPO训练循环后,实验结果显示框架不仅提高了整体翻译质量,还解锁了与最先进推理模型相当的推理能力。

Insight: 创新点在于提出组质量度量(GQM)范式和GRRM模型,通过联合处理候选组进行对比分析,解决了传统独立评分器在细粒度排序上的不足;从客观角度看,这种方法将比较上下文引入奖励建模,可能为开放域任务的后训练优化提供新思路。

Abstract: While Group Relative Policy Optimization (GRPO) offers a powerful framework for LLM post-training, its effectiveness in open-ended domains like Machine Translation hinges on accurate intra-group ranking. We identify that standard Scalar Quality Metrics (SQM) fall short in this context; by evaluating candidates in isolation, they lack the comparative context necessary to distinguish fine-grained linguistic nuances. To address this, we introduce the Group Quality Metric (GQM) paradigm and instantiate it via the Group Relative Reward Model (GRRM). Unlike traditional independent scorers, GRRM processes the entire candidate group jointly, leveraging comparative analysis to rigorously resolve relative quality and adaptive granularity. Empirical evaluations confirm that GRRM achieves competitive ranking accuracy among all baselines. Building on this foundation, we integrate GRRM into the GRPO training loop to optimize the translation policy. Experimental results demonstrate that our framework not only improves general translation quality but also unlocks reasoning capabilities comparable to state-of-the-art reasoning models. We release codes, datasets, and model checkpoints at https://github.com/NJUNLP/GRRM.


[12] Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness cs.CLPDF

Pietro Bernardelle, Stefano Civelli, Kevin Roitero, Gianluca Demartini

TL;DR: 本研究探讨了上下文对基于大语言模型(LLM)的事实核查任务的影响。通过使用三个数据集(HOVER、FEVEROUS和ClimateFEVER)和五个不同参数规模(7B、32B、70B)及模型家族(Llama-3.1、Qwen2.5、Qwen3)的开源模型,评估了模型的参数化事实知识以及证据在不同上下文长度和位置放置的效果。研究发现,LLMs具备显著的非平凡参数化事实知识,但随着上下文长度增加,其核查准确性普遍下降。同时,证据在提示词中的位置至关重要:当相关证据出现在提示词开头或结尾时,准确性更高;出现在中间时,准确性较低。这些结果强调了提示词结构在检索增强事实核查系统中的重要性。

Details

Motivation: 尽管大语言模型在各种任务上展现出强大的推理能力,但其在扩展上下文上的表现仍不一致。先前研究主要关注问答任务中的中间上下文性能下降问题,而本研究旨在探究上下文在基于LLM的事实核查任务中的具体影响。

Result: 在HOVER、FEVEROUS和ClimateFEVER三个事实核查数据集上的实验结果表明,LLMs的核查准确性随着上下文长度的增加而普遍下降。证据放置位置对准确性有显著影响:当证据位于提示词开头或结尾时,准确性更高;位于中间时,准确性较低。这些发现与先前关于问答任务的研究结论一致。

Insight: 论文的创新点在于将上下文长度和证据放置位置的影响研究从通用问答任务专门聚焦到事实核查任务上,并系统性地评估了不同参数规模和模型家族的表现。客观来看,其核心洞察是提示词的结构(特别是证据的放置位置)是优化检索增强事实核查系统性能的关键因素,这为设计更有效的提示工程策略提供了实证依据。

Abstract: Large language models (LLMs) show strong reasoning abilities across diverse tasks, yet their performance on extended contexts remains inconsistent. While prior research has emphasized mid-context degradation in question answering, this study examines the impact of context in LLM-based fact verification. Using three datasets (HOVER, FEVEROUS, and ClimateFEVER) and five open-source models accross different parameters sizes (7B, 32B and 70B parameters) and model families (Llama-3.1, Qwen2.5 and Qwen3), we evaluate both parametric factual knowledge and the impact of evidence placement across varying context lengths. We find that LLMs exhibit non-trivial parametric knowledge of factual claims and that their verification accuracy generally declines as context length increases. Similarly to what has been shown in previous works, in-context evidence placement plays a critical role with accuracy being consistently higher when relevant evidence appears near the beginning or end of the prompt and lower when placed mid-context. These results underscore the importance of prompt structure in retrieval-augmented fact-checking systems.


[13] LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation cs.CLPDF

Jizheng Chen, Weiming Zhang, Xinyi Dai, Weiwen Liu, Kounianhua Du

TL;DR: 本文提出LogitsCoder,一种用于代码生成的新框架,通过轻量级的logit级控制机制增强思维链推理。它利用Logits Preference Decoding引导token选择,结合Logits Rank Based Path Selection和Thoughts Aggregation来迭代生成和精炼推理路径,以解决现有方法中推理链过浅(underthinking)或过于冗长(overthinking)的问题,从而在代码生成任务中实现更高效和高质量的推理。

Details

Motivation: 现有测试时间缩放(TTS)方法(包括结构化树搜索)在探索推理路径方面虽有进展,但仍面临两大挑战:推理链过浅(underthinking)无法捕捉问题复杂性,以及推理链过于冗长(overthinking)导致效率低下和计算成本增加。本文旨在解决这些问题,提升代码生成中的推理效率和质量。

Result: 大量实验表明,LogitsCoder能产生更高效和更高质量的推理链,在代码生成性能上优于基线方法。

Insight: 创新点在于提出了一种基于logit偏好的解码机制(Logits Preference Decoding)和路径选择与聚合方法,通过轻量级的logit级控制来迭代优化推理链,平衡深度与效率,这为代码生成中的思维链搜索提供了新的高效路径探索策略。

Abstract: Code generation remains a challenging task that requires precise and structured reasoning. Existing Test Time Scaling (TTS) methods, including structured tree search, have made progress in exploring reasoning paths but still face two major challenges: (1) underthinking, where reasoning chains tend to be shallow and fail to capture the full complexity of problems; and (2) overthinking, where overly verbose reasoning leads to inefficiency and increased computational costs. To address these issues, we propose LogitsCoder, a novel framework that enhances chain-of-thought reasoning through lightweight, logit-level control mechanisms for code generation. LogitsCoder iteratively generates and refines reasoning steps by first steering token selection toward statistically preferred patterns via Logits Preference Decoding, then selecting and aggregating diverse reasoning paths using Logits Rank Based Path Selection and Thoughts Aggregation. This results in coherent and effective reasoning chains that balance depth and efficiency. Extensive experiments demonstrate that LogitsCoder produces more efficient and higher-quality reasoning chains, leading to superior code generation performance compared to baseline methods.


[14] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric cs.CLPDF

Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao

TL;DR: 本文提出了Open Rubric System (OpenRS),一个基于可解释原则的LLM-as-a-Judge框架,用于解决开放域对齐中因标量奖励模型的信息瓶颈导致的脆弱性和奖励攻击问题。该系统通过Pairwise Adaptive Meta-Rubrics动态生成适应性评分标准,并结合可验证的评分标准,在成对强化学习训练中提供奖励监督。

Details

Motivation: 现有标量奖励模型将多维人类偏好压缩为单一不透明分数,导致信息瓶颈,在开放域对齐中容易产生脆弱性和奖励攻击问题。作者认为,对于不可验证任务,稳健对齐本质上是一个原则泛化问题,奖励应是一个基于可检查原则的显式推理过程,而非内化于评判模型的学习函数。

Result: 论文将OpenRS实例化为成对强化学习训练中的奖励监督,但摘要中未提及具体的基准测试、定量结果或与现有方法的比较(如是否达到SOTA)。

Insight: 创新点在于将奖励建模从学习内部函数转变为基于显式、可编辑原则(类似宪法的元评分标准)的外部推理过程,并引入动态适应性评分标准(PAMR)和可验证评分标准(PVR)的组合,以提高开放域设置下的判别性和鲁棒性,同时通过两级元评分标准优化流程确保原则的一致性和可编辑性。

Abstract: Scalar reward models compress multi-dimensional human preferences into a single opaque score, creating an information bottleneck that often leads to brittleness and reward hacking in open-ended alignment. We argue that robust alignment for non-verifiable tasks is fundamentally a principle generalization problem: reward should not be a learned function internalized into a judge, but an explicit reasoning process executed under inspectable principles. To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which provide both hard-constraint guardrails and verifiable reward components when ground-truth or programmatic checks are available. OpenRS uses an explicit meta-rubric – a constitution-like specification that governs how rubrics are instantiated, weighted, and enforced – and instantiates adaptive rubrics on the fly by conditioning on the semantic differences between two candidate responses. It then performs criterion-wise pairwise comparisons and aggregates criterion-level preferences externally, avoiding pointwise weighted scalarization while improving discriminability in open-ended settings. To keep principles consistent yet editable across various domains, we introduce a two-level meta-rubric refinement pipeline (automated evolutionary refinement for general principles and a reproducible human-in-the-loop procedure for domain principles), complemented with pointwise verifiable rubrics that act as both guardrails against degenerate behaviors and a source of verifiable reward for objective sub-tasks. Finally, we instantiate OpenRS as reward supervision in pairwise RL training.


[15] Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework cs.CL | cs.AIPDF

Grzegorz Statkiewicz, Alicja Dobrzeniecka, Karolina Seweryn, Aleksandra Krasnodębska, Karolina Piosek

TL;DR: 本文基于LLaVA-Next框架,通过自动化翻译和筛选现有多模态数据集,并辅以合成波兰语OCR及文化特定任务数据,构建了波兰语视觉语言模型。该方法在波兰语适配的MMBench上相比LLaVA-1.6-Vicuna-13B提升9.5%,并在生成评估中获得了更高质量的语言正确性描述。

Details

Motivation: 大多数视觉语言模型基于英语数据训练,限制了其在其他语言和文化背景下的性能,阻碍了反映多样语言文化现实的多模态系统发展,因此需要为低资源语言(如波兰语)开发高效的模型适配方法。

Result: 在波兰语适配的MMBench基准测试中,模型性能相比LLaVA-1.6-Vicuna-13B提升了9.5%;在生成任务中,人工评估显示其描述的语言正确性更高。

Insight: 创新点在于通过大规模自动化翻译结合轻量级过滤,以及合成OCR和文化特定数据,能够高效引导低资源语言的高质量多模态模型,证明了最小化人工干预的自动化流程的有效性。

Abstract: Most vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts. This restricts their usability for non-English-speaking users and hinders the development of multimodal systems that reflect diverse linguistic and cultural realities. In this work, we reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs. We rely on a fully automated pipeline for translating and filtering existing multimodal datasets, and complement this with synthetic Polish data for OCR and culturally specific tasks. Despite relying almost entirely on automatic translation and minimal manual intervention to the training data, our approach yields strong results: we observe a +9.5% improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations, as measured by human annotators in terms of linguistic correctness. These findings highlight that large-scale automated translation, combined with lightweight filtering, can effectively bootstrap high-quality multimodal models for low-resource languages. Some challenges remain, particularly in cultural coverage and evaluation. To facilitate further research, we make our models and evaluation dataset publicly available.


[16] GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler cs.CL | cs.LGPDF

Minghan Wang, Ye Bai, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari

TL;DR: 本文提出了一种名为高斯思维采样器(GTS)的方法,用于改进潜在推理模型的推理时扩展(ITS)。GTS通过可学习的条件高斯分布对连续推理状态进行结构化扰动,替代了传统的启发式随机扰动(如dropout或固定高斯噪声),从而在有限的采样预算下实现更高效的轨迹探索。

Details

Motivation: 现有潜在推理模型的推理时扩展通常采用启发式扰动引入随机性以增加轨迹多样性,但这种探索行为未被显式建模,且在有限采样预算下效率低下。作者观察到更强的扰动并不总能产生更有效的候选轨迹,因为无引导的噪声可能会破坏内部决策结构而非引导它。

Result: 在GSM8K基准测试上,使用两种潜在推理架构的实验表明,GTS比启发式基线方法实现了更可靠的推理时扩展性能。

Insight: 论文的核心创新在于将潜在思维探索建模为从可学习密度中进行条件采样,并实例化为GTS模块。这提供了一种结构化且可优化的探索机制,其通过GRPO风格策略优化进行训练(同时保持骨干网络冻结),表明改进潜在推理的ITS需要结构化探索而非简单增加随机性。

Abstract: Inference-time scaling (ITS) in latent reasoning models typically introduces stochasticity through heuristic perturbations, such as dropout or fixed Gaussian noise. While these methods increase trajectory diversity, their exploration behavior is not explicitly modeled and can be inefficient under finite sampling budgets. We observe that stronger perturbations do not necessarily translate into more effective candidate trajectories, as unguided noise may disrupt internal decision structure rather than steer it. To provide a more structured alternative, we model latent thought exploration as conditional sampling from learnable densities and instantiate this idea as a Gaussian Thought Sampler (GTS). GTS predicts context-dependent perturbation distributions over continuous reasoning states and is trained with GRPO-style policy optimization while keeping the backbone frozen. Experiments on GSM8K with two latent reasoning architectures show that GTS achieves more reliable inference-time scaling than heuristic baselines. These findings indicate that improving latent ITS requires structured and optimizable exploration mechanisms rather than simply amplifying stochasticity.


[17] A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing cs.CL | cs.AI | cs.MAPDF

Naeimeh Nourmohammadi, Md Meem Hossain, The Anh Han, Safina Showkat Ara, Zia Ush Shamszaman

TL;DR: 本文提出了一种多智能体医疗问答框架,通过整合微调的GPT、LLaMA和DeepSeek R1模型,结合证据检索、不确定性估计和偏见检测,以提高临床查询处理的可靠性和可解释性。

Details

Motivation: 解决大型语言模型在医疗问答中存在的验证能力弱、证据基础不足和置信度信号不可靠等问题,以提升临床应用的可靠性。

Result: 在MedQuAD数据集上,微调的DeepSeek R1在ROUGE-1、ROUGE-2和BLEU指标上表现最佳,显著优于BioGPT基线;完整系统在评估中达到87%的准确率,证据增强降低了不确定性(困惑度4.13),平均端到端延迟为36.5秒。

Insight: 创新点在于采用模块化多智能体架构,结合临床推理、证据检索和精炼代理,并集成蒙特卡洛dropout、困惑度评分及基于LIME/SHAP的偏见检测机制,为基于证据和偏见感知的医疗AI提供了可扩展的实用设计。

Abstract: Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine-tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD-derived medical QA data (20k+ question-answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE-1 0.536 +- 0.04; ROUGE-2 0.226 +-0.03; BLEU 0.098 -+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero-shot evaluation. Second, we implement a modular multi-agent pipeline in which a Clinical Reasoning agent (fine-tuned LLaMA) produces structured explanations, an Evidence Retrieval agent queries PubMed to ground responses in recent literature, and a Refinement agent (DeepSeek R1) improves clarity and factual consistency; an optional human validation path is triggered for high-risk or high-uncertainty cases. Safety mechanisms include Monte Carlo dropout and perplexity-based uncertainty scoring, plus lexical and sentiment-based bias detection supported by LIME/SHAP-based analyses. In evaluation, the full system achieves 87% accuracy with relevance around 0.80, and evidence augmentation reduces uncertainty (perplexity 4.13) compared to base responses, with mean end-to-end latency of 36.5 seconds under the reported configuration. Overall, the results indicate that agent specialisation and verification layers can mitigate key single-model limitations and provide a practical, extensible design for evidence-based and bias-aware medical AI.


[18] Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering cs.CL | cs.CV | cs.IRPDF

Tao Xu

TL;DR: 本文提出了延迟视觉摄取(DVI)框架,用于解决视觉密集型文档问答中现有预摄取方法成本高、可靠性差的问题。DVI采用需求侧摄取策略,在索引阶段仅进行轻量级元数据提取,将视觉理解推迟到用户提出具体问题时进行,从而显著降低计算成本并提高系统可靠性。

Details

Motivation: 现有多模态文档问答方法普遍采用供给侧摄取策略,即在索引阶段对每一页运行视觉语言模型(VLM)生成全面描述,这种方法成本高昂、端到端不可靠,且一旦失败无法恢复。

Result: 在两个真实工业工程图纸数据集(113页+7页)上的实验表明,DVI在零摄取VLM成本下实现了与预摄取方法相当的整体准确率(46.7% vs. 48.9%),在视觉必要查询上的有效率达到50%(预摄取为0%),并实现了100%的页面定位(98%的搜索空间压缩)。

Insight: DVI的核心创新在于将“索引用于定位而非理解”,通过结构化元数据索引和BM25全文搜索实现页面定位,然后将原始图像与具体问题一起发送给VLM进行针对性分析。这巧妙地将“问答准确率”问题转化为“页面定位”问题,并支持交互式细化和渐进缓存,提高了系统的实用性和可扩展性。

Abstract: Existing multimodal document question answering methods universally adopt a supply-side ingestion strategy: running a Vision-Language Model (VLM) on every page during indexing to generate comprehensive descriptions, then answering questions through text retrieval. However, this “pre-ingestion” approach is costly (a 113-page engineering drawing package requires approximately 80,000 VLM tokens), end-to-end unreliable (VLM outputs may fail to be correctly retrieved due to format mismatches in the retrieval infrastructure), and irrecoverable once it fails. This paper proposes the Deferred Visual Ingestion (DVI) framework, adopting a demand-side ingestion strategy: the indexing phase performs only lightweight metadata extraction, deferring visual understanding to the moment users pose specific questions. DVI’s core principle is “Index for locating, not understanding”–achieving page localization through structured metadata indexes and BM25 full-text search, then sending original images along with specific questions to a VLM for targeted analysis. Experiments on two real industrial engineering drawings (113 pages + 7 pages) demonstrate that DVI achieves comparable overall accuracy at zero ingestion VLM cost (46.7% vs. 48.9%), an effectiveness rate of 50% on visually necessary queries (vs. 0% for pre-ingestion), and 100% page localization (98% search space compression). DVI also supports interactive refinement and progressive caching, transforming the “QA accuracy” problem into a “page localization” problem–once the correct drawing page is found, obtaining the answer becomes a matter of interaction rounds.


[19] Knowing When Not to Answer: Abstention-Aware Scientific Reasoning cs.CL | cs.AIPDF

Samir Abdaljalil, Erchin Serpedin, Hasan Kurban

TL;DR: 本文提出了一种基于弃权意识的科学推理验证框架,将科学主张分解为最小条件,利用自然语言推理(NLI)审核每个条件,并选择性地决定支持、反驳或弃权。该框架在SciFact和PubMedQA两个科学基准上进行了评估,实验表明,弃权机制在控制错误方面起着关键作用,尤其是在证据不足时,基于置信度的弃权能显著降低风险。

Details

Motivation: 现有评估通常假设模型必须始终给出明确答案,但在科学场景中,给出不确定或缺乏支持的结论可能比弃权更有害。本文旨在解决科学主张验证中的不确定性处理问题。

Result: 在SciFact和PubMedQA基准上,使用六种不同语言模型(包括编码器-解码器、开源聊天模型和专有API)进行实验。结果表明,原始准确率在不同架构间变化不大,而弃权机制在控制错误方面至关重要;基于置信度的弃权在中等覆盖率水平下显著降低了风险。

Insight: 创新点在于提出了一个弃权感知的验证框架,将科学主张分解并利用NLI进行条件审核,强调在科学推理任务中,关键挑战不是选择单一最佳模型,而是判断何时有足够证据来证明答案的合理性。这为评估科学可靠性提供了一个实用且模型无关的视角。

Abstract: Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings. Experiments are conducted with six diverse language models, including encoder-decoder, open-weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence-based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention-aware evaluation as a practical and model-agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at https://github.com/sabdaljalil2000/ai4science .


[20] AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents cs.CL | cs.AI | cs.IR | cs.LGPDF

Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu

TL;DR: 该论文提出了AD-Bench,一个基于真实世界业务需求的广告营销分析基准测试,用于评估LLM智能体在复杂、多轮、多工具协作场景下的能力。该基准从真实用户请求构建,包含专家提供的参考答案和工具调用轨迹,并将任务分为三个难度等级。实验表明,即使是Gemini-3-Pro等先进模型,在最高难度任务上性能也显著下降,揭示了现有模型在专业领域分析中的能力差距。

Details

Motivation: 当前LLM智能体评估基准多局限于理想化模拟,无法满足广告营销等专业领域对多轮、多工具交互的复杂现实任务需求,因此需要构建一个更贴近真实业务场景的基准。

Result: 在AD-Bench上,Gemini-3-Pro的Pass@1为68.0%,Pass@3为83.0%;但在最高难度L3上,性能显著下降至Pass@1=49.4%和Pass@3=62.1%,轨迹覆盖率为70.1%,表明即使是SOTA模型在复杂广告分析场景中仍存在明显能力不足。

Insight: 创新点在于构建了一个基于真实业务需求、包含可验证答案和工具调用轨迹的领域专用基准,并引入了多轮、多工具协作的难度分级评估体系,为评估和改进专业领域智能体提供了更现实的测试平台。

Abstract: While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi-round interaction with professional marketing tools. To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms. AD-Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool-call trajectories. The benchmark categorizes requests into three difficulty levels (L1-L3) to evaluate agents’ capabilities under multi-round, multi-tool collaboration. Experiments show that on AD-Bench, Gemini-3-Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state-of-the-art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD-Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at https://github.com/Emanual20/adbench-leaderboard.


[21] STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts cs.CL | cs.LGPDF

Zachary Bamberger, Till R. Saenger, Gilad Morad, Ofra Amir, Brandon M. Stewart

TL;DR: 本文提出了STATe-of-Thoughts(STATe),一种可解释的推理时计算(ITC)方法,用于生成高质量且多样化的文本输出。它通过结构化动作模板引导推理过程,取代了传统的高温采样,从而提升了输出的多样性和可控性。

Details

Motivation: 现有ITC方法(如Best-of-N和Tree-of-Thoughts)依赖高温采样,难以实现有意义的输出多样性,且对推理过程的控制有限,导致可解释性不足。

Result: 在论证生成案例研究中,STATe的动作序列能捕获可解释特征,这些特征对输出质量具有高度预测性。该方法还能识别动作空间中未探索但有潜力的区域,并直接引导生成过程。

Insight: 核心创新在于用离散、可解释的文本干预(控制器选择动作、生成器基于动作生成推理步骤、评估器评分引导搜索)替代随机采样,实现了对高层推理模式的结构化搜索,从而兼顾了质量、多样性和可解释性。

Abstract: Inference-Time-Compute (ITC) methods like Best-of-N and Tree-of-Thoughts are meant to produce output candidates that are both high-quality and diverse, but their use of high-temperature sampling often fails to achieve meaningful output diversity. Moreover, existing ITC methods offer limited control over how to perform reasoning, which in turn limits their explainability. We present STATe-of-Thoughts (STATe), an interpretable ITC method that searches over high-level reasoning patterns. STATe replaces stochastic sampling with discrete and interpretable textual interventions: a controller selects actions encoding high-level reasoning choices, a generator produces reasoning steps conditioned on those choices, and an evaluator scores candidates to guide search. This structured approach yields three main advantages. First, action-guided textual interventions produce greater response diversity than temperature-based sampling. Second, in a case study on argument generation, STATe’s explicit action sequences capture interpretable features that are highly predictive of output quality. Third, estimating the association between performance and action choices allows us to identify promising yet unexplored regions of the action space and steer generation directly toward them. Together, these results establish STATe as a practical framework for generating high-quality, diverse, and interpretable text. Our framework is available at https://github.com/zbambergerNLP/state-of-thoughts.


[22] InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem cs.CL | cs.AI | cs.IR | cs.LGPDF

Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue

TL;DR: 本文提出了InnoEval,一个将科研想法评估视为知识驱动、多视角推理问题的深度创新评估框架。该框架通过异构深度知识搜索引擎从多样在线来源检索动态证据,并组建包含不同学术背景评审者的创新评审委员会,实现跨多个指标的多维解耦评估。实验表明,InnoEval在点对点、成对和分组评估任务中均优于基线,其判断模式和共识与人类专家高度一致。

Details

Motivation: 大语言模型催生了大量科学想法,但想法评估方法未能同步发展,存在知识视野狭窄、评估维度单一以及LLM-as-a-Judge的固有偏见等问题。

Result: 在基于权威同行评审提交构建的综合数据集上的实验表明,InnoEval在点对点、成对和分组评估任务中持续优于基线模型,其判断模式和共识与人类专家高度对齐。

Insight: 将想法评估形式化为知识驱动、多视角推理问题,并设计了结合异构深度知识检索和多视角评审委员会共识机制的框架,以模拟人类水平的评估过程,这是对现有LLM评估方法的重要补充和创新。

Abstract: The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.


[23] Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models cs.CLPDF

Mufan Xu, Kehai Chen, Xuefeng Bai, Zhengyu Niu, Muyun Yang

TL;DR: 本文提出了多令牌策略梯度优化(MPO)框架,用于改进大型语言模型在复杂推理任务中的策略梯度方法。传统方法在自回归语言模型中逐令牌选择动作,而MPO将连续的K个令牌视为统一的语义动作,以更好地捕捉推理轨迹的组合结构。在数学推理和编程基准测试上的实验表明,MPO优于标准的令牌级策略梯度基线。

Details

Motivation: 现有基于策略梯度的自回归语言模型方法通常逐令牌选择后续令牌作为策略动作,但这种方法可能无法充分捕捉复杂推理任务的结构,因为单个语义决策(如定义变量或组合方程)往往跨越多个令牌实现,导致令牌级优化与推理固有的块级性质之间存在不匹配。

Result: 在数学推理和编程基准测试上的实验显示,MPO超越了标准的令牌级策略梯度基线,突显了令牌级策略梯度在复杂推理中的局限性。

Insight: 创新点在于将连续的令牌序列视为统一的语义动作,以块级视角捕捉推理轨迹的组合结构,支持对连贯、更高级目标的优化。这为推理密集型语言任务提供了超越令牌级粒度的新研究方向。

Abstract: Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens–for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives. Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines, highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.


[24] WavePhaseNet: A DFT-Based Method for Constructing Semantic Conceptual Hierarchy Structures (SCHS) cs.CLPDF

Kiyotaka Kasubuchi, Kazuo Fukiya

TL;DR: 本文从测度论和频域分析角度重新形式化了大型语言模型中的Transformer/注意力机制,理论上证明了幻觉是不可避免的结构性缺陷。作者提出WavePhaseNet方法,通过离散傅里叶变换沿序列维度分解语义信息,构建语义概念层次结构,将低频成分(全局意义)与高频成分(局部语法)分离,从而在约3000维的降维空间中实现精确语义操控与一致性控制。

Details

Motivation: 解决LLM中因嵌入空间与语义真值集非同构导致的逻辑一致性崩溃和幻觉问题,从理论层面揭示幻觉的结构性根源。

Result: 通过累积能量分析证明GPT-4的24,576维嵌入空间可降至约3,000维而不损失完整表示能力;利用上同调正则化和Hodge理论实现语义一致性量化控制,有效抑制幻觉。

Insight: 创新点包括:1) 用测度论和频域分析重构Transformer理论框架;2) 基于DFT的语义概念层次结构构建与频带分离;3) 通过上同调一致性控制实现可计算的语义正则化,为LLM的可解释性与鲁棒性提供新途径。

Abstract: This paper reformulates Transformer/Attention mechanisms in Large Language Models (LLMs) through measure theory and frequency analysis, theoretically demonstrating that hallucination is an inevitable structural limitation. The embedding space functions as a conditional expectation over a σ-algebra, and its failure to be isomorphic to the semantic truth set fundamentally causes logical consistency breakdown. WavePhaseNet Method The authors propose WavePhaseNet, which explicitly constructs a Semantic Conceptual Hierarchy Structure (SCHS) using Discrete Fourier Transform (DFT). By applying DFT along the sequence dimension, semantic information is decomposed into frequency bands: low-frequency components capture global meaning and intent, while high-frequency components represent local syntax and expression. This staged separation enables precise semantic manipulation in diagonalized space. Dimensionality Reduction GPT-4’s 24,576-dimensional embedding space exhibits a 1/f spectral structure based on language self-similarity and Zipf’s law. Through cumulative energy analysis, the authors derive that approximately 3,000 dimensions constitute the lower bound for “complete representation.” This demonstrates that reduction from 24,576 to 3,000 dimensions preserves meaning and intent while enabling rigorous reasoning and suppressing hallucination. Cohomological Consistency Control The reduced embedding space, constructed via cohomological regularization over overlapping local windows, allows defining a graph structure and cochain complex. This quantifies inconsistencies among local inferences as coboundary-based losses. Applying harmonic projection based on Hodge theory positions cohomology as a computable regularization principle for controlling semantic consistency, extracting maximally consistent global representations.


[25] LLM-Guided Knowledge Distillation for Temporal Knowledge Graph Reasoning cs.CLPDF

Wang Xing, Wei Song, Siyu Lin, Chen Wu, Man Wang

TL;DR: 本文提出了一种LLM辅助的知识蒸馏框架,专门用于时序知识图谱推理。该方法结合了传统的高容量时序教师模型和大型语言模型作为辅助指导,通过联合优化监督和蒸馏目标,并采用分阶段对齐策略,使轻量级学生模型能够更好地建模事件动态,同时保持推理效率。

Details

Motivation: 时序知识图谱推理的先进模型通常计算量大、部署成本高,而现有的压缩和蒸馏技术主要针对静态图设计,直接应用于时序场景可能忽略时间依赖的交互并导致性能下降。

Result: 在多个公共TKG基准测试和不同骨干架构上的广泛实验表明,该方法在链接预测性能上持续优于强蒸馏基线,同时保持了学生模型的紧凑和高效。

Insight: 创新点在于将大型语言模型作为辅助教师,提供丰富的背景知识和时序感知信号,以增强轻量级学生模型的时序推理能力,这展示了LLM作为有效教师将时序推理能力迁移到资源高效系统的潜力。

Abstract: Temporal knowledge graphs (TKGs) support reasoning over time-evolving facts, yet state-of-the-art models are often computationally heavy and costly to deploy. Existing compression and distillation techniques are largely designed for static graphs; directly applying them to temporal settings may overlook time-dependent interactions and lead to performance degradation. We propose an LLM-assisted distillation framework specifically designed for temporal knowledge graph reasoning. Beyond a conventional high-capacity temporal teacher, we incorporate a large language model as an auxiliary instructor to provide enriched supervision. The LLM supplies broad background knowledge and temporally informed signals, enabling a lightweight student to better model event dynamics without increasing inference-time complexity. Training is conducted by jointly optimizing supervised and distillation objectives, using a staged alignment strategy to progressively integrate guidance from both teachers. Extensive experiments on multiple public TKG benchmarks with diverse backbone architectures demonstrate that the proposed approach consistently improves link prediction performance over strong distillation baselines, while maintaining a compact and efficient student model. The results highlight the potential of large language models as effective teachers for transferring temporal reasoning capability to resource-efficient TKG systems.


[26] Measuring and Mitigating Post-hoc Rationalization in Reverse Chain-of-Thought Generation cs.CLPDF

Guangyue Peng, Zongchao Chen, Wen Luo, Yuntao Wen, Wei Li

TL;DR: 本文研究了反向思维链生成中的事后合理化问题,提出了一种三层测量框架来量化模型对答案的认知锚定效应,并揭示了传统语义抑制策略的局限性。基于认知心理学的讽刺过程理论,作者提出了结构骨架引导推理方法及其蒸馏版本,有效降低了锚定效应,并在开放域推理基准上取得了优于基线方法的性能。

Details

Motivation: 解决反向思维链生成中模型因看到答案而产生事后合理化的问题,即答案作为认知锚定会扭曲推理过程的生成。

Result: 在开放域推理基准上的实验表明,所提出的SSR-D方法相比语义抑制基线实现了高达10%的性能提升,同时保持了良好的分布外泛化能力。

Insight: 创新点在于从认知心理学理论出发,将事后合理化现象形式化为三层测量框架,并提出了通过生成答案不变的功能性结构骨架来引导推理生成的两阶段方法,有效打破了模型对答案的过度依赖循环。

Abstract: Reverse Chain-of-Thought Generation (RCG) synthesizes reasoning traces from query-answer pairs, but runs the risk of producing post-hoc rationalizations: when models can see the answer during generation, the answer serves as a cognitive anchor that shapes the entire explanation. We formalize this phenomenon through a three-level measurement hierarchy: lexical, entropic, and probabilistic anchoring, each captures surface artifacts, entropy dynamics, and latent answer dependence, respectively. We analyze semantic suppression, the intuitive mitigation strategy that instructs models to ignore the answer, to find out its counterproduction: while it reduces lexical overlap, it paradoxically increases entropic and probabilistic anchoring. Drawing on Ironic Process Theory from cognitive psychology, we attribute this failure to active monitoring of the forbidden answer, which inadvertently deepens dependence on it. To break this cycle, we propose Structural Skeleton-guided Reasoning (SSR), a two-phase approach that first generates an answer-invariant functional skeleton structure, then uses this skeleton to guide full trace generation. By redirecting the information flow to structural planning rather than answer monitoring, SSR consistently reduces anchoring across all three levels. We further introduce Distilled SSR (SSR-D), which fine-tunes models on teacher-generated SSR traces to ensure reliable structural adherence. Experiments across open-ended reasoning benchmarks demonstrate that SSR-D achieves up to 10% improvement over suppression baselines while preserving out-of-distribution (OOD) generalization.


[27] HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation cs.CLPDF

Wen-Sheng Lien, Yu-Kai Chan, Hao-Lung Hsiao, Bo-Kai Ruan, Meng-Fen Chiang

TL;DR: 本文提出HyperRAG,一种基于超图的检索增强生成框架,旨在解决传统基于知识图谱的RAG方法在关系表达和检索效率上的局限。通过引入两种互补的检索变体——HyperRetriever(学习结构语义推理以构建查询条件化的关系链)和HyperMemory(利用LLM参数记忆引导波束搜索)——该框架能够对n元高阶关系事实进行更高效、可解释的多跳推理。

Details

Motivation: 传统基于二元关系知识图谱的RAG方法存在检索方案僵化、相似性搜索密集导致无关上下文引入、计算开销大以及关系表达能力有限的问题,而n元超图能编码更高阶的关系事实,捕捉更丰富的实体间依赖并实现更浅层高效的推理路径。

Result: 在WikiTopics(11个闭域数据集)和三个开放域QA基准(HotpotQA、MuSiQue和2WikiMultiHopQA)上的广泛评估验证了HyperRAG的有效性。HyperRetriever实现了最高的整体答案准确率,在MRR和Hits@10指标上分别比最强基线平均提升2.95%和1.23%。

Insight: 创新点在于将RAG框架从传统的二元关系图谱扩展到n元超图,通过结合结构语义推理(HyperRetriever)和LLM参数记忆引导的波束搜索(HyperMemory),实现了自适应、可解释的高阶关系链构建,从而提升了多跳QA的推理准确性和效率。

Abstract: Graph-based retrieval-augmented generation (RAG) methods, typically built on knowledge graphs (KGs) with binary relational facts, have shown promise in multi-hop open-domain QA. However, their rigid retrieval schemes and dense similarity search often introduce irrelevant context, increase computational overhead, and limit relational expressiveness. In contrast, n-ary hypergraphs encode higher-order relational facts that capture richer inter-entity dependencies and enable shallower, more efficient reasoning paths. To address this limitation, we propose HyperRAG, a RAG framework tailored for n-ary hypergraphs with two complementary retrieval variants: (i) HyperRetriever learns structural-semantic reasoning over n-ary facts to construct query-conditioned relational chains. It enables accurate factual tracking, adaptive high-order traversal, and interpretable multi-hop reasoning under context constraints. (ii) HyperMemory leverages the LLM’s parametric memory to guide beam search, dynamically scoring n-ary facts and entities for query-aware path expansion. Extensive evaluations on WikiTopics (11 closed-domain datasets) and three open-domain QA benchmarks (HotpotQA, MuSiQue, and 2WikiMultiHopQA) validate HyperRAG’s effectiveness. HyperRetriever achieves the highest answer accuracy overall, with average gains of 2.95% in MRR and 1.23% in Hits@10 over the strongest baseline. Qualitative analysis further shows that HyperRetriever bridges reasoning gaps through adaptive and interpretable n-ary chain construction, benefiting both open and closed-domain QA.


[28] Query as Anchor: Scenario-Adaptive User Representation via Large Language Model cs.CL | cs.IRPDF

Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Ziyi Gao

TL;DR: 本文提出了Query-as-Anchor框架,将用户表征学习从静态编码转变为动态、查询感知的合成。该方法通过构建工业级预训练数据集UserU,并设计Q-Anchor Embedding架构,将分层编码器集成到双塔大语言模型中,通过联合对比自回归优化生成查询感知的用户表征。此外,引入基于聚类的软提示调优来弥合通用预训练与特定业务逻辑之间的差距。在10个支付宝工业基准测试中取得了SOTA性能,并在生产环境中通过A/B测试验证了其有效性。

Details

Motivation: 解决工业级用户表征学习中,现有静态、任务无关的嵌入方法难以在统一向量空间中调和下游场景的多样化需求,以及异构多源数据带来的噪声和模态冲突问题。

Result: 在10个支付宝工业基准测试中取得了SOTA性能,展示了强大的可扩展性。在支付宝生产系统的两个真实场景中进行的大规模在线A/B测试进一步验证了其实际有效性。

Insight: 核心创新在于将用户建模范式从静态编码转向以查询为锚点的动态合成,并提出了结合分层编码、双塔LLM架构与联合优化策略的Q-Anchor Embedding方法。通过基于聚类的软提示调优实现场景自适应,并在推理时利用KV缓存加速,兼顾了性能与效率。

Abstract: Industrial-scale user representation learning requires balancing robust universality with acute task-sensitivity. However, existing paradigms primarily yield static, task-agnostic embeddings that struggle to reconcile the divergent requirements of downstream scenarios within unified vector spaces. Furthermore, heterogeneous multi-source data introduces inherent noise and modality conflicts, degrading representation. We propose Query-as-Anchor, a framework shifting user modeling from static encoding to dynamic, query-aware synthesis. To empower Large Language Models (LLMs) with deep user understanding, we first construct UserU, an industrial-scale pre-training dataset that aligns multi-modal behavioral sequences with user understanding semantics, and our Q-Anchor Embedding architecture integrates hierarchical coarse-to-fine encoders into dual-tower LLMs via joint contrastive-autoregressive optimization for query-aware user representation. To bridge the gap between general pre-training and specialized business logic, we further introduce Cluster-based Soft Prompt Tuning to enforce discriminative latent structures, effectively aligning model attention with scenario-specific modalities. For deployment, anchoring queries at sequence termini enables KV-cache-accelerated inference with negligible incremental latency. Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment. Large-scale online A/B testing in Alipay’s production system across two real-world scenarios further validates its practical effectiveness. Our code is prepared for public release and will be available at: https://github.com/JhCircle/Q-Anchor.


[29] Beyond Translation: Evaluating Mathematical Reasoning Capabilities of LLMs in Sinhala and Tamil cs.CL | cs.LGPDF

Sukumar Kishanthan, Kumar Thushalika, Buddhi Jayasekara, Asela Hevapathige

TL;DR: 这篇论文评估了大型语言模型在僧伽罗语和泰米尔语中的数学推理能力,发现尽管基础算术推理能跨语言稳健迁移,但复杂推理任务在这些低资源语言中表现显著下降,挑战了模型在多语言环境下具有均等推理能力的常见假设。

Details

Motivation: 研究动机是探究LLMs在僧伽罗语和泰米尔语等低资源语言中的数学推理能力是否真正反映了多语言推理,还是依赖于向英语的隐式翻译处理。

Result: 在基于六类数学问题(从基础算术到复杂单位冲突和优化问题)构建的并行数据集上评估了四个主流大型语言模型,结果显示复杂推理任务在泰米尔语和僧伽罗语中性能显著退化,失败模式因模型和问题类型而异。

Insight: 创新点在于通过由流利说话者以三种语言原生构建的并行数据集,避免了翻译伪影,揭示了模型多语言能力可能不反映跨语言的均匀推理能力,强调了在多语言设置中进行细粒度、类型感知评估的必要性。

Abstract: Large language models (LLMs) demonstrate strong mathematical reasoning in English, but whether these capabilities reflect genuine multilingual reasoning or reliance on translation-based processing in low-resource languages like Sinhala and Tamil remains unclear. We examine this fundamental question by evaluating whether LLMs genuinely reason mathematically in these languages or depend on implicit translation to English-like representations. Using a taxonomy of six math problem types, from basic arithmetic to complex unit conflict and optimization problems, we evaluate four prominent large language models. To avoid translation artifacts that confound language ability with translation quality, we construct a parallel dataset where each problem is natively authored by fluent speakers with mathematical training in all three languages. Our analysis demonstrates that while basic arithmetic reasoning transfers robustly across languages, complex reasoning tasks show significant degradation in Tamil and Sinhala. The pattern of failures varies by model and problem type, suggesting that apparent multilingual competence may not reflect uniform reasoning capabilities across languages. These findings challenge the common assumption that models exhibiting strong multilingual performance can reason equally effectively across languages, and highlight the need for fine-grained, type-aware evaluation in multilingual settings.


[30] Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets cs.CL | cs.AIPDF

Yuchen Yang, Wenze Lin, Enhao Huang, Zhixuan Chu, Hongbin Zhou

TL;DR: 本文提出了一种可解释的令牌级噪声过滤框架XTF,用于优化大语言模型(LLM)微调数据集。该框架通过将令牌级数据对微调过程的贡献分解为三个明确属性(推理重要性、知识新颖性和任务相关性)并进行评分,从而有选择地屏蔽噪声令牌的梯度,以提升微调后模型的性能。

Details

Motivation: 当前微调数据集通常在句子级别设计,与LLM的令牌级优化机制存在根本性差异,这引入了令牌级噪声,对最终性能产生负面影响。本文旨在解决这一数据集与优化目标不匹配的问题。

Result: 在数学、代码和医学三个代表性下游任务上,对7个主流LLM进行了广泛实验。结果表明,与常规微调相比,XTF能将下游性能提升高达13.7%。

Insight: 创新点在于首次强调了令牌级数据集优化的重要性,并提出了一种基于属性分解的可解释框架来识别和过滤噪声令牌,为理解复杂的训练机制提供了新策略。

Abstract: Large Language Models (LLMs) have seen remarkable advancements, achieving state-of-the-art results in diverse applications. Fine-tuning, an important step for adapting LLMs to specific downstream tasks, typically involves further training on corresponding datasets. However, a fundamental discrepancy exists between current fine-tuning datasets and the token-level optimization mechanism of LLMs: most datasets are designed at the sentence-level, which introduces token-level noise, causing negative influence to final performance. In this paper, we propose XTF, an explainable token-level noise filtering framework. XTF decomposes the complex and subtle contributions of token-level data to the fine-tuning process into three distinct and explicit attributes (reasoning importance, knowledge novelty, and task relevance), which can be assessed using scoring methods, and then masks the gradients of selected noisy tokens accordingly to optimize the performance of fine-tuned LLMs. We conduct extensive experiments on three representative downstream tasks (math, code and medicine) across 7 mainstream LLMs. The results demonstrate that XTF can significantly improve downstream performance by up to 13.7% compared to regular fine-tuning. Our work highlights the importance of token-level dataset optimization, and demonstrates the potential of strategies based on attribute decomposition for explaining complex training mechanisms.


[31] Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation cs.CLPDF

Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Tareque Mohmud Chowdhury

TL;DR: 本研究评估了五款大型语言模型(LLMs)在医学问答(QA)任务上的零样本性能,使用了包含38,000个医学问题的iCliniq数据集。评估模型包括Llama-3-8B-Instruct、Llama 3.2 3B、Llama 3.3 70B Instruct、Llama-4-Maverick-17B-128E-Instruct和GPT-5-mini,并采用BLEU和ROUGE指标进行衡量。结果表明,更大的模型(如Llama 3.3 70B)表现更优,而Llama-4-Maverick-17B在效率与性能间取得了有竞争力的平衡,凸显了其在临床实际部署中的潜力。

Details

Motivation: 评估大型语言模型在医学领域的问答能力,特别是在资源匮乏环境中提升医疗可及性的潜力,旨在为医学自然语言处理应用提供一个标准化的基准测试环境。

Result: 在iCliniq数据集上的零样本评估显示,更大的模型(如Llama 3.3 70B Instruct)表现优于较小模型,符合模型规模扩展的效益;Llama-4-Maverick-17B也展现了有竞争力的结果,体现了实际部署中效率与性能的权衡。

Insight: 论文的创新点在于对多款近期LLMs进行系统的零样本医学QA评估,并强调模型规模与效率的权衡对临床实际部署的重要性;客观来看,其建立的标准化基准和“LLM-as-a-Judge”的评估思路可为未来医学NLP研究提供参考,推动资源高效且实用的模型发展。

Abstract: Recently, Large Language Models (LLMs) have gained significant traction in medical domain, especially in developing a QA systems to Medical QA systems for enhancing access to healthcare in low-resourced settings. This paper compares five LLMs deployed between April 2024 and August 2025 for medical QA, using the iCliniq dataset, containing 38,000 medical questions and answers of diverse specialties. Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini. We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning. Our results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks. It is notable that, Llama-4-Maverick-17B exhibited more competitive results, thus highlighting evasion efficiency trade-offs relevant for practical deployment. These findings align with advancements in LLM capabilities toward professional-level medical reasoning and reflect the increasing feasibility of LLM-supported QA systems in the real clinical environments. This benchmark aims to serve as a standardized setting for future study to minimize model size, computational resources and to maximize clinical utility in medical NLP applications.


[32] GradMAP: Faster Layer Pruning with Gradient Metric and Projection Compensation cs.CLPDF

Hao Liu, Guangyan Li, Wensheng Zhang, Yongqiang Tang

TL;DR: 本文提出了GradMAP方法,一种基于梯度度量和投影补偿的快速层剪枝方法,用于减少大型语言模型的计算成本。该方法通过单次反向传播评估层重要性,并使用投影补偿矩阵一步修正剪枝导致的性能下降,在剪枝速度和模型性能上均优于现有方法。

Details

Motivation: 大型语言模型存在显著的计算冗余,现有层剪枝方法难以同时保持剪枝性能和效率,因此需要一种既快速又有效的剪枝方案。

Result: 在广泛实验中,GradMAP在剪枝速度上平均提升4倍,并在性能上优于以往的层剪枝方法。

Insight: 创新点包括基于梯度幅度的全局层重要性度量(仅需单次反向传播)以及一步投影补偿矩阵来修正剪枝导致的均值偏移,实现了效率与性能的平衡。

Abstract: Large Language Models (LLMs) exhibit strong reasoning abilities, but their high computational costs limit their practical deployment. Recent studies reveal significant redundancy in LLMs layers, making layer pruning an active research topic. Layer pruning research primarily focuses on two aspects: measuring layer importance and recovering performance after pruning. Unfortunately, the present works fail to simultaneously maintain pruning performance and efficiency. In this study, we propose GradMAP, a faster layer pruning method with \textbf{Grad}ient \textbf{M}etric \textbf{A}nd \textbf{P}rojection compensation, which consists of two stages. In the first stage, we introduce a novel metric based on gradient magnitudes, enabling a global assessment of layer importance. Note that, it requires only a single backward propagation step per pruning decision, substantially enhancing pruning efficiency. In the second stage, we first analyze the layers with the largest mean shift resulting from pruning, and then incorporate a simple yet effective projection compensation matrix to correct this drift in one step. In this way, the degradation of model performance caused by layer pruning is effectively alleviated. Extensive experiments show that GradMAP outperforms previous layer pruning methods in both pruning speed (achieving an average $4\times$ speedup) and performance.


[33] Is Information Density Uniform when Utterances are Grounded on Perception and Discourse? cs.CLPDF

Matteo Gay, Coleman Haley, Mario Giulianelli, Edoardo Ponti

TL;DR: 本研究首次在视觉接地环境中计算性地检验了均匀信息密度(UID)假说,通过多语言视觉-语言模型分析30种语言的图像-字幕数据和13种语言的视觉叙事数据,发现感知接地能平滑信息分布,提高信息均匀性,尤其在视觉叙事中,图像和语篇双重接地进一步降低惊奇值,支持情境敏感的UID表述。

Details

Motivation: 解决现有UID研究局限于纯文本输入、忽略感知语境的问题,探索在视觉接地场景中信息密度是否均匀,以更贴近真实多模态语言使用环境。

Result: 在30种语言的图像-字幕数据和13种语言的视觉叙事数据上,感知接地显著提高了全局和局部信息均匀性,视觉叙事中图像和语篇双重接地在语篇单元起始处产生最强的惊奇值降低效应。

Insight: 创新点在于首次将UID研究扩展到视觉接地多模态环境,揭示了感知和语篇上下文对信息均匀性的增强作用,为建模生态合理的多模态语言信息流动态提供了基础。

Abstract: The Uniform Information Density (UID) hypothesis posits that speakers are subject to a communicative pressure to distribute information evenly within utterances, minimising surprisal variance. While this hypothesis has been tested empirically, prior studies are limited exclusively to text-only inputs, abstracting away from the perceptual context in which utterances are produced. In this work, we present the first computational study of UID in visually grounded settings. We estimate surprisal using multilingual vision-and-language models over image-caption data in 30 languages and visual storytelling data in 13 languages, together spanning 11 families. We find that grounding on perception consistently smooths the distribution of information, increasing both global and local uniformity across typologically diverse languages compared to text-only settings. In visual narratives, grounding in both image and discourse contexts has additional effects, with the strongest surprisal reductions occurring at the onset of discourse units. Overall, this study takes a first step towards modelling the temporal dynamics of information flow in ecologically plausible, multimodal language use, and finds that grounded language exhibits greater information uniformity, supporting a context-sensitive formulation of UID.


[34] Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer’s Disease Detection via Speech cs.CL | cs.AIPDF

Xiao Wei, Bin Wen, Yuqin Lin, Kai Li, Mingyang gu

TL;DR: 本文提出了一种名为FAL-AD的新型联邦增强学习框架,用于通过语音检测阿尔茨海默病(AD)。该框架通过结合基于语音转换的数据增强、自适应联邦学习和注意力跨模态融合,系统地解决了医疗数据稀缺和隐私障碍导致的数据效率困境。

Details

Motivation: 早期诊断阿尔茨海默病至关重要,但基于AI的语音检测面临医疗数据稀缺和隐私壁垒,导致数据效率低下。本文旨在通过联邦学习与数据增强的协同整合,打破这一数据效率困境。

Result: 在ADReSSo基准测试上,FAL-AD实现了91.52%的多模态准确率,达到了最先进(SOTA)水平,并超越了所有集中式基线模型。

Insight: 创新点包括:1)通过跨类别语音-内容重组的语音转换增强,实现绝对效率提升;2)通过自适应联邦学习范式,在隐私约束下最大化跨机构协作效益;3)通过注意力跨模态融合模型,实现细粒度词级对齐和声学-文本交互,优化表征效率。该框架为解决医疗AI中的数据效率问题提供了实用方案。

Abstract: Early diagnosis of Alzheimer’s Disease (AD) is crucial for delaying its progression. While AI-based speech detection is non-invasive and cost-effective, it faces a critical data efficiency dilemma due to medical data scarcity and privacy barriers. Therefore, we propose FAL-AD, a novel framework that synergistically integrates federated learning with data augmentation to systematically optimize data efficiency. Our approach delivers three key breakthroughs: First, absolute efficiency improvement through voice conversion-based augmentation, which generates diverse pathological speech samples via cross-category voice-content recombination. Second, collaborative efficiency breakthrough via an adaptive federated learning paradigm, maximizing cross-institutional benefits under privacy constraints. Finally, representational efficiency optimization by an attentive cross-modal fusion model, which achieves fine-grained word-level alignment and acoustic-textual interaction. Evaluated on ADReSSo, FAL-AD achieves a state-of-the-art multi-modal accuracy of 91.52%, outperforming all centralized baselines and demonstrating a practical solution to the data efficiency dilemma. Our source code is publicly available at https://github.com/smileix/fal-ad.


[35] Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers cs.CL | cs.AIPDF

Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Lukas Mauch, Fabien Cardinaux

TL;DR: 本文通过实验揭示了自回归Transformer中残差连接导致的输入输出对齐偏移问题,即当前token的激活与下一token的预测目标之间存在错位,并提出了基于残差衰减的轻量级缓解策略。

Details

Motivation: 解决自回归Transformer中因因果掩码和残差连接导致的当前token激活与下一token预测目标之间的结构错位问题。

Result: 在多个基准测试上,提出的残差衰减策略(固定层干预或可学习门控机制)缓解了表示错位并带来了性能提升。

Insight: 创新点在于通过解码轨迹和相似性度量实证定位了预训练LLMs中的对齐偏移,并提出了一种轻量级的残差路径缓解方法作为通用的架构增强。

Abstract: Large Language Models (LLMs) are trained with next-token prediction, implemented in autoregressive Transformers via causal masking for parallelism. This creates a subtle misalignment: residual connections tie activations to the current token, while supervision targets the next token, potentially propagating mismatched information if the current token is not the most informative for prediction. In this work, we empirically localize this input-output alignment shift in pretrained LLMs, using decoding trajectories over tied embedding spaces and similarity-based metrics. Our experiments reveal that the hidden token representations switch from input alignment to output alignment deep within the network. Motivated by this observation, we propose a lightweight residual-path mitigation based on residual attenuation, implemented either as a fixed-layer intervention or as a learnable gating mechanism. Experiments on multiple benchmarks show that these strategies alleviate the representation misalignment and yield improvements, providing an efficient and general architectural enhancement for autoregressive Transformers.


[36] Unlocking Reasoning Capability on Machine Translation in Large Language Models cs.CL | cs.AIPDF

Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio

TL;DR: 本文系统评估了推理导向大语言模型(RLMs)在机器翻译(MT)任务上的表现,发现启用显式推理反而会降低翻译质量。分析表明,MT推理过程高度线性,缺乏修订、自我纠正和替代翻译探索。为此,作者提出了一个针对翻译任务的结构化推理框架,包含多步草拟、充分性精炼、流畅性提升和选择性迭代修订,并通过在合成数据集上后训练模型,显著提升了翻译性能。

Details

Motivation: 尽管推理导向大语言模型在数学和代码生成等任务上表现出色,但其对机器翻译的影响尚未得到充分探索。本文旨在探究显式推理是否以及如何能有效提升机器翻译的质量。

Result: 在WMT24++基准测试上,启用显式推理的多个开源和闭源RLMs模型翻译质量普遍下降。而提出的结构化推理框架后训练的模型,在实验中显著优于标准的翻译微调方法和注入通用推理的基线模型。

Insight: 论文的核心创新在于揭示了通用推理模式与机器翻译任务的不匹配,并提出了一个任务定制的结构化推理框架。其关键见解是:推理必须针对具体任务(如翻译)进行结构化设计,而非简单套用通用推理模式,这为改进大语言模型在特定下游任务上的推理能力提供了重要方向。

Abstract: Reasoning-oriented large language models (RLMs) achieve strong gains on tasks such as mathematics and coding by generating explicit intermediate reasoning. However, their impact on machine translation (MT) remains underexplored. We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models. Analysis reveals that MT reasoning traces are highly linear, lacking revision, self-correction and exploration of alternative translations, which limits their usefulness. Furthermore, injecting higher-quality reasoning traces from stronger models does not reliably improve weaker models’ performance. To address this mismatch, we propose a structured reasoning framework tailored to translation, based on multi-step drafting, adequacy refinement, fluency improvement, and selective iterative revision. We curate a synthetic dataset of dynamic structured reasoning traces and post-train a large reasoning model on this data. Experiments show significant improvements over standard translation fine-tuning and injected generic reasoning baselines. Our findings demonstrate that reasoning must be task-structured to benefit MT.


[37] Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque cs.CLPDF

Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri

TL;DR: 本文针对低资源语言巴斯克语,构建了首个非问答式物理常识推理数据集BasPhyCo,包含标准语和方言变体,并评估了多语言大语言模型在三个层次上的物理常识推理能力。

Details

Motivation: 解决大语言模型在低资源语言(如巴斯克语)的非问答式物理常识推理任务上性能未被研究的问题,填补该领域空白。

Result: 在可验证性任务上,大语言模型在巴斯克语等低资源语言中表现出有限的物理常识推理能力,尤其在处理方言变体时效果更差。

Insight: 创新点在于构建了首个巴斯克语非问答物理常识推理数据集,并引入分层评估框架(准确性、一致性、可验证性),揭示了模型在低资源语言物理推理上的局限性。

Abstract: Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces. Recent years have witnessed growing interest in reasoning tasks within Natural Language Processing (NLP). However, no prior research has examined the performance of Large Language Models (LLMs) on non-question-answering (non-QA) physical commonsense reasoning tasks in low-resource languages such as Basque. Taking the Italian GITA as a starting point, this paper addresses this gap by presenting BasPhyCo, the first non-QA physical commonsense reasoning dataset for Basque, available in both standard and dialectal variants. We evaluate model performance across three hierarchical levels of commonsense understanding: (1) distinguishing between plausible and implausible narratives (accuracy), (2) identifying the conflicting element that renders a narrative implausible (consistency), and (3) determining the specific physical state that creates the implausibility (verifiability). These tasks were assessed using multiple multilingual LLMs as well as models pretrained specifically for Italian and Basque. Results indicate that, in terms of verifiability, LLMs exhibit limited physical commonsense capabilities in low-resource languages such as Basque, especially when processing dialectal variants.


[38] BFS-PO: Best-First Search for Large Reasoning Models cs.CL | cs.AIPDF

Fiorenzo Parascandolo, Wenhui Tan, Enver Sangineto, Ruihua Song, Rita Cucchiara

TL;DR: 本文提出了一种名为BFS-PO的强化学习算法,旨在解决大型推理模型(如OpenAI o1和DeepSeek-R1)在推理任务中因生成过长推理链而导致的过度思考问题,该算法通过最佳优先搜索策略来寻找最短正确答案,从而在提升模型准确率的同时缩短其输出长度。

Details

Motivation: 大型推理模型在推理任务中表现出色,但常伴随计算成本显著增加和输出冗长(即过度思考)的问题,而GRPO/DAPO等强化学习算法可能加剧此倾向,因此需要一种能同时提升准确性和输出简洁性的训练方法。

Result: 在不同基准测试和基础大型推理模型上的实验表明,BFS-PO能够同时提高模型的准确率并缩短其生成的答案长度。

Insight: 创新点在于将最佳优先搜索与基于最大熵节点的回溯机制结合到强化学习中,引导模型在训练过程中逐步生成更短的响应,从而学习到简洁的推理链,这为解决过度思考问题提供了一种新颖的优化方向。

Abstract: Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.


[39] Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition cs.CL | cs.SEPDF

Varun Nathan, Shreyas Guha, Ayush Kumar

TL;DR: 本文提出了一个面向联络中心AI的领域特定框架和基准,用于评估大型语言模型(LLM)在工具感知规划任务中的表现,该任务要求将业务洞察查询分解为可执行的步骤,并分配给结构化工具(如Text2SQL)和非结构化工具(如RAG)。研究贡献包括一个基于参考的规划评估框架、一个通过评估-优化循环迭代优化规划的数据策展方法,以及对14个不同规模和家族的LLM进行的大规模评估。

Details

Motivation: 解决联络中心场景中,LLM将复杂的业务洞察查询分解为可执行、可并行、且正确分配工具的逐步计划时面临的挑战,特别是处理复合查询和多步骤规划的能力不足问题。

Result: 在包含七个维度(如工具提示对齐、查询遵循度)的评估中,最佳总指标得分达到84.8%(Claude-3-7-Sonnet),但在“A+”级别(极好/很好)的单次匹配率最高仅为49.75%(o3-mini)。LLM在复合查询和超过4步(通常5-15步)的规划上表现不佳,规划谱系(有序修订)带来了混合收益,但提升了部分顶级模型的步骤可执行性。

Insight: 创新点包括:1) 一个结合度量评估器和单次评估器的参考式规划评估框架;2) 通过评估-优化循环迭代优化规划的数据策展方法,可生成高质量的规划谱系并减少人工标注;3) 揭示了LLM在工具理解(尤其是工具提示对齐和工具使用完整性)方面的持续差距,以及更短、更简单的规划显著更易生成。该框架为评估和改进基于工具的智能体规划提供了可复现的路径。

Abstract: We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools (Text2SQL (T2S)/Snowflake) and unstructured tools (RAG/transcripts) with explicit depends_on for parallelism. Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data curation methodology that iteratively refines plans via an evaluator->optimizer loop to produce high-quality plan lineages (ordered plan revisions) while reducing manual effort; and (iii) a large-scale study of 14 LLMs across sizes and families for their ability to decompose queries into step-by-step, executable, and tool-assigned plans, evaluated under prompts with and without lineage. Empirically, LLMs struggle on compound queries and on plans exceeding 4 steps (typically 5-15); the best total metric score reaches 84.8% (Claude-3-7-Sonnet), while the strongest one-shot match rate at the “A+” tier (Extremely Good, Very Good) is only 49.75% (o3-mini). Plan lineage yields mixed gains overall but benefits several top models and improves step executability for many. Our results highlight persistent gaps in tool-understanding, especially in tool-prompt alignment and tool-usage completeness, and show that shorter, simpler plans are markedly easier. The framework and findings provide a reproducible path for assessing and improving agentic planning with tools for answering data-analysis queries in contact-center settings.


[40] Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation cs.CL | cs.IRPDF

Mengdan Zhu, Yufan Zhao, Tao Di, Yulan Yan, Liang Zhao

TL;DR: 本文提出了一种基于强化学习的跨领域新闻推荐框架,通过训练大语言模型从异构用户信号中生成高质量的兴趣驱动新闻搜索查询列表,并采用策略优化和蒸馏技术实现可扩展部署。

Details

Motivation: 解决跨领域新闻推荐中如何超越表层行为、捕捉深层可复用用户兴趣,同时保持大规模生产系统可扩展性的关键挑战。

Result: 离线实验、消融研究和生产新闻推荐系统的大规模在线A/B测试均显示,该方法在兴趣建模质量和下游推荐性能上取得持续提升。

Insight: 创新点包括将查询列表生成建模为策略优化问题,采用GRPO与多奖励信号,系统研究推理时采样和模型容量两个计算维度,并通过策略蒸馏将计算密集型教师模型的知识迁移至紧凑学生模型以实现可扩展部署。

Abstract: News recommendation plays a critical role in online news platforms by helping users discover relevant content. Cross-domain news recommendation further requires inferring user’s underlying information needs from heterogeneous signals that often extend beyond direct news consumption. A key challenge lies in moving beyond surface-level behaviors to capture deeper, reusable user interests while maintaining scalability in large-scale production systems. In this paper, we present a reinforcement learning framework that trains large language models to generate high-quality lists of interest-driven news search queries from cross-domain user signals. We formulate query-list generation as a policy optimization problem and employ GRPO with multiple reward signals. We systematically study two compute dimensions: inference-time sampling and model capacity, and empirically observe consistent improvements with increased compute that exhibit scaling-like behavior. Finally, we perform on-policy distillation to transfer the learned policy from a large, compute-intensive teacher to a compact student model suitable for scalable deployment. Extensive offline experiments, ablation studies and large-scale online A/B tests in a production news recommendation system demonstrate consistent gains in both interest modeling quality and downstream recommendation performance.


[41] Cold-Start Personalization via Training-Free Priors from Structured World Models cs.CL | cs.AI | cs.LGPDF

Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du

TL;DR: 本文提出了一种名为Pep(Preference Elicitation with Priors)的框架,用于解决冷启动个性化问题。该方法将问题分解为离线结构学习和在线贝叶斯推断两个阶段:离线阶段从完整用户偏好数据中学习一个结构化的世界模型以捕捉偏好维度间的相关性;在线阶段则无需训练,通过贝叶斯推断选择信息量最大的问题并预测完整的用户偏好(包括未询问的维度)。

Details

Motivation: 解决冷启动个性化中的核心挑战:在有限的交互次数(问题预算)下,如何高效地推断用户的多维度偏好。传统强化学习方法无法充分利用偏好数据的因子化结构,且在实践中容易退化为忽略用户响应的静态问题序列。

Result: 在医疗、数学、社交和常识推理等多个领域,Pep实现了80.8%的生成响应与用户陈述偏好的对齐率,优于强化学习方法的68.5%,且交互次数减少了3-5倍。当不同用户对同一问题给出不同答案时,Pep有39-62%的概率改变后续问题,而强化学习方法仅为0-28%。Pep仅需约1万个参数,远少于强化学习的80亿参数。

Insight: 主要创新点在于将冷启动偏好获取分解为离线结构建模与在线无训练贝叶斯推断的模块化框架。该方法的核心洞察是,问题的瓶颈在于能否有效利用偏好数据的因子化结构,而非模型规模。该框架对下游求解器具有模块化兼容性,且仅需简单的信念模型,为高效、轻量级的个性化交互系统提供了新思路。

Abstract: Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available. The core challenge is a routing problem: each task admits dozens of preference dimensions, yet individual users care about only a few, and which ones matter depends on who is asking. With a limited question budget, asking without structure will miss the dimensions that matter. Reinforcement learning is the natural formulation, but in multi-turn settings its terminal reward fails to exploit the factored, per-criterion structure of preference data, and in practice learned policies collapse to static question sequences that ignore user responses. We propose decomposing cold-start elicitation into offline structure learning and online Bayesian inference. Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about. The framework is modular across downstream solvers and requires only simple belief models. Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users’ stated preferences versus 68.5% for RL, with 3-5x fewer interactions. When two users give different answers to the same question, Pep changes its follow-up 39-62% of the time versus 0-28% for RL. It does so with ~10K parameters versus 8B for RL, showing that the bottleneck in cold-start elicitation is the capability to exploit the factored structure of preference data.


cs.CV [Back]

[42] Evaluating the Impact of Post-Training Quantization on Reliable VQA with Multimodal LLMs cs.CV | cs.AIPDF

Paul Jonas Kurz, Tobias Jan Wieczorek, Mohamed A. Abdelsalam, Rahaf Aljundi, Marcus Rohrbach

TL;DR: 本文研究了后训练量化(PTQ)对多模态大语言模型(MLLM)在视觉问答(VQA)任务中准确性和可靠性的影响。通过评估Qwen2-VL-7B和Idefics3-8B模型在不同比特宽度和数据感知/无数据量化方法下的表现,发现量化会降低模型性能,但数据感知方法(如MBQ)能缓解此影响。研究还提出并验证了Selector置信度估计器在量化及分布外(OOD)场景下的鲁棒性,发现int4 MBQ与Selector结合能实现最佳效率-可靠性权衡,在内存需求减少约75%的情况下接近未压缩性能。

Details

Motivation: 解决MLLM在边缘设备部署时面临的两个关键挑战:模型因规模大而需要压缩(如量化),以及现有模型存在过度自信(即对错误答案给出高置信度)的可靠性问题。研究旨在系统分析量化如何影响MLLM的准确性和可靠性。

Result: 在VQA任务上,PTQ会降低准确性和可靠性;数据感知量化方法(如MBQ)比无数据方法(如HQQ)效果更好;引入的Selector置信度估计器能显著缓解量化对可靠性的负面影响。具体而言,int4 MBQ与Selector组合在内存需求减少约75%时,性能接近未压缩模型,实现了最优的效率-可靠性权衡。

Insight: 论文的创新点在于首次系统性地研究了多模态设置中量化与可靠性的关联,并提出了适用于量化多模态环境的Selector置信度估计器来提升可靠性。从客观角度看,其将模型压缩(量化)与可靠性评估相结合的研究视角具有借鉴意义,且提出的方法组合(数据感知量化+置信度估计)为边缘部署MLLM提供了实用的效率-可靠性平衡方案。

Abstract: Multimodal Large Language Models (MLLM) are increasingly deployed in domains where both reliability and efficiency are critical. However, current models remain overconfident, producing highly certain but incorrect answers. At the same time, their large size limits deployment on edge devices, necessitating compression. We study the intersection of these two challenges by analyzing how Post-Training Quantization (PTQ) compression affects both accuracy and reliability in Visual Question Answering (VQA). We evaluate two MLLMs, Qwen2-VL-7B and Idefics3-8B, quantized with data-free (HQQ) and data-aware (MBQ) methods across multiple bit widths. To counteract the reduction in reliability caused by quantization, we adapt the Selector confidence estimator for quantized multimodal settings and test its robustness across various quantization levels and out-of-distribution (OOD) scenarios. We find that PTQ degrades both accuracy and reliability. Data-aware methods soften the effect thereof. The Selector substantially mitigates the reliability impact. The combination of int4 MBQ and the Selector achieves the best efficiency-reliability trade-off, closing in on uncompressed performance at approx. 75% less memory demand. Overall, we present the first systematic study linking quantization and reliability in multimodal settings.


[43] NutVLM: A Self-Adaptive Defense Framework against Full-Dimension Attacks for Vision Language Models in Autonomous Driving cs.CV | eess.IVPDF

Xiaoxu Peng, Dong Zhou, Jianwen Zhang, Guanghui Sun, Anh Tu Ngo

TL;DR: 本文提出了NutVLM,一个针对自动驾驶中视觉语言模型(VLM)的自适应防御框架,旨在抵御从局部物理补丁到全局扰动的全维度对抗攻击。该框架首先使用NutNet++作为哨兵进行三分类检测,然后针对局部威胁采用灰度掩码净化,针对全局扰动采用专家引导的对抗提示调优(EAPT)生成纠正性驾驶提示,以在不进行全模型微调的情况下增强模型鲁棒性。

Details

Motivation: 自动驾驶中的视觉语言模型面临从局部到全局的多种对抗攻击威胁,现有防御方法有限且难以兼顾鲁棒性与干净样本性能,因此需要一种全面的安全解决方案来保障整个感知-决策生命周期的安全。

Result: 在Dolphins基准测试中,NutVLM在整体指标(如准确率、语言分数和GPT分数)上取得了4.89%的提升,验证了其作为智能交通可扩展安全解决方案的有效性。

Insight: 创新点在于提出了一个统一的自适应防御框架,将检测、净化与轻量级提示调优相结合;其核心洞察是使用三分类检测机制区分样本类型,并针对全局扰动采用基于梯度潜在优化和离散投影的EAPT方法生成纠正性提示,从而在不更新模型参数的情况下调整VLM的注意力,实现了鲁棒性与性能的平衡。

Abstract: Vision Language Models (VLMs) have advanced perception in autonomous driving (AD), but they remain vulnerable to adversarial threats. These risks range from localized physical patches to imperceptible global perturbations. Existing defense methods for VLMs remain limited and often fail to reconcile robustness with clean-sample performance. To bridge these gaps, we propose NutVLM, a comprehensive self-adaptive defense framework designed to secure the entire perception-decision lifecycle. Specifically, we first employ NutNet++ as a sentinel, which is a unified detection-purification mechanism. It identifies benign samples, local patches, and global perturbations through three-way classification. Subsequently, localized threats are purified via efficient grayscale masking, while global perturbations trigger Expert-guided Adversarial Prompt Tuning (EAPT). Instead of the costly parameter updates of full-model fine-tuning, EAPT generates “corrective driving prompts” via gradient-based latent optimization and discrete projection. These prompts refocus the VLM’s attention without requiring exhaustive full-model retraining. Evaluated on the Dolphins benchmark, our NutVLM yields a 4.89% improvement in overall metrics (e.g., Accuracy, Language Score, and GPT Score). These results validate NutVLM as a scalable security solution for intelligent transportation. Our code is available at https://github.com/PXX/NutVLM.


[44] VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction cs.CV | cs.AIPDF

Jiarong Liang, Max Ku, Ka-Hei Hui, Ping Nie, Wenhu Chen

TL;DR: 本文提出了VisPhyWorld框架,通过要求多模态大语言模型(MLLMs)从视觉观察生成可执行的模拟器代码来评估其物理推理能力,并基于此框架构建了包含209个评估场景的VisPhyBench基准测试。

Details

Motivation: 现有基于视觉问答(VQA)和期望违背(VoE)的评估方法难以检验MLLMs是否真正进行物理推理,因为它们可能无需形成明确、可测试的物理假设即可作答。

Result: 在VisPhyBench上,该框架能生成97.7%的有效重建视频;实验表明,尽管当前最先进的MLLMs在语义场景理解上表现强劲,但在准确推断物理参数和模拟一致的物理动态方面仍存在困难。

Insight: 创新点在于将物理推理与渲染分离,通过生成可执行代码使推断出的世界表示可直接检查、编辑和证伪,从而提供了一种更严格、可检验的物理推理评估范式。

Abstract: Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% on the benchmark. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics.


[45] WildfireVLM: AI-powered Analysis for Early Wildfire Detection and Risk Assessment Using Satellite Imagery cs.CV | cs.AIPDF

Aydin Ayanzadeh, Prakhar Dixit, Sadia Kamal, Milton Halem

TL;DR: 本文提出了WildfireVLM,一个结合卫星图像野火检测与语言驱动风险评估的AI框架。它利用YOLOv12检测卫星图像中的火区和烟雾,并集成多模态大语言模型(MLLM)将检测结果转化为情境化的风险评估和优先响应建议,通过面向服务的架构实现实时处理和可视化风险仪表板。

Details

Motivation: 野火对生态系统和人类构成日益严重的威胁,早期检测至关重要。然而,基于卫星的监测因烟雾信号微弱、天气条件动态变化以及需要大范围实时分析而面临挑战。

Result: 论文构建了基于Landsat-8/9、GOES-16等公开地球观测数据的标记野火和烟雾数据集,并采用LLM-as-judge评估验证了风险推理的质量。系统展示了将计算机视觉与基于语言的推理相结合对于可扩展野火监测的价值。

Insight: 创新点在于将目标检测模型(YOLOv12)与多模态大语言模型(MLLM)集成,构建了一个从视觉检测到语言化风险评估的端到端框架,实现了从“看到”到“理解并建议”的升级,为灾害管理提供了可解释、可操作的智能决策支持。

Abstract: Wildfires are a growing threat to ecosystems, human lives, and infrastructure, with their frequency and intensity rising due to climate change and human activities. Early detection is critical, yet satellite-based monitoring remains challenging due to faint smoke signals, dynamic weather conditions, and the need for real-time analysis over large areas. We introduce WildfireVLM, an AI framework that combines satellite imagery wildfire detection with language-driven risk assessment. We construct a labeled wildfire and smoke dataset using imagery from Landsat-8/9, GOES-16, and other publicly available Earth observation sources, including harmonized products with aligned spectral bands. WildfireVLM employs YOLOv12 to detect fire zones and smoke plumes, leveraging its ability to detect small, complex patterns in satellite imagery. We integrate Multimodal Large Language Models (MLLMs) that convert detection outputs into contextualized risk assessments and prioritized response recommendations for disaster management. We validate the quality of risk reasoning using an LLM-as-judge evaluation with a shared rubric. The system is deployed using a service-oriented architecture that supports real-time processing, visual risk dashboards, and long-term wildfire tracking, demonstrating the value of combining computer vision with language-based reasoning for scalable wildfire monitoring.


[46] Fine-Tuning a Large Vision-Language Model for Artwork’s Scoring and Critique cs.CV | cs.AI | cs.LGPDF

Zhehan Zhang, Meihua Qian, Li Luo, Siyu Huang, Chaoyi Zhou

TL;DR: 本文提出了一种基于多任务学习微调视觉语言模型Qwen2-VL-7B的框架,用于对人类绘画作品进行自动化创造力评估。该方法不仅能预测1-100分的数值分数,还能生成与评分细则(原创性、色彩、纹理、构图、内容)对齐的书面反馈。

Details

Motivation: 艺术创造力评估是创造力研究和艺术教育的基础,但传统的人工评分(如托兰斯创造性思维测试)在大规模应用时非常耗时。现有的机器学习方法主要依赖图像特征,且通常无法提供解释性反馈。

Result: 在包含1000幅人类绘画的数据集(80/20划分)上,模型在100分制评分上取得了皮尔逊相关系数r > 0.97和平均绝对误差约3.95的强相关性和准确性。生成的反馈与专家评语在语义上高度相似(平均SBERT余弦相似度为0.798)。

Insight: 主要创新点在于通过多任务学习微调大视觉语言模型,将结构化评分细则和作品描述嵌入系统提示中,从而在单次前向传播中同时实现数值评分和细则对齐的文本反馈生成,为创造力评估提供了可扩展的自动化工具。

Abstract: Assessing artistic creativity is foundational to creativity research and arts education, yet manual scoring (e.g., Torrance Tests of Creative Thinking) is labor-intensive at scale. Prior machine-learning approaches show promise for visual creativity scoring, but many rely mainly on image features and provide limited or no explanatory feedback. We propose a framework for automated creativity assessment of human paintings by fine-tuning the vision-language model Qwen2-VL-7B with multi-task learning. Our dataset contains 1000 human-created paintings scored on a 1-100 scale and paired with a short human-written description (content or artist explanation). Two expert raters evaluated each work using a five-dimension rubric (originality, color, texture, composition, content) and provided written critiques; we use an 80/20 train-test split. We add a lightweight regression head on the visual encoder output so the model can predict a numerical score and generate rubric-aligned feedback in a single forward pass. By embedding the structured rubric and the artwork description in the system prompt, we constrain the generated text to match the quantitative prediction. Experiments show strong accuracy, achieving Pearson r > 0.97 and MAE about 3.95 on the 100-point scale. Qualitative evaluation indicates the generated feedback is semantically close to expert critiques (average SBERT cosine similarity = 0.798). The proposed approach bridges computer vision and art assessment and offers a scalable tool for creativity research and classroom feedback.


[47] Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension cs.CV | cs.AIPDF

Haoran Xu, Hongyu Wang, Jiaze Li, Shunpeng Chen, Zizhao Tong

TL;DR: 本文提出了Visual Para-Thinker,这是首个面向多模态大语言模型(MLLMs)的并行推理框架,通过从深度扩展转向并行化思维,利用视觉划分和路径独立机制来克服传统垂直扩展策略中模型思维模式固化的瓶颈,并在多个视觉理解基准测试上验证了其有效性。

Details

Motivation: 现有LLM测试时扩展定律强调通过延长推理长度来激发自反思行为,但这种垂直扩展策略容易使模型陷入特定思维模式而遇到探索瓶颈;将并行思维范式扩展到视觉领域是一个开放的研究问题。

Result: 在V*、CountBench、RefCOCO和HallusionBench等基准数据集上的实证结果证实,Visual Para-Thinker成功将并行推理的优势扩展到了视觉领域。

Insight: 创新点在于首次为MLLMs提出了并行推理框架,核心是通过视觉划分实现并行化,并集成Pa-Attention和LPRoPE以保持路径独立性和促进推理多样性;从客观角度看,其将高效并行处理(基于vLLM框架)与视觉理解任务结合是一个有前景的方向。

Abstract: Existing LLM test-time scaling laws emphasize the emergence of self-reflective behaviors through extended reasoning length. Nevertheless, this vertical scaling strategy often encounters plateaus in exploration as the model becomes locked into specific thinking pattern. By shifting from depth to parallelism, parallel thinking mitigates the narrowing of exploration. However, the extension of this paradigm to visual domain remains an open research question. In this paper, we first examine the role of visual partitioning in parallelized reasoning and subsequently propose two distinct strategies. Based on the above, we introduce Visual Para-Thinker, representing the inaugural parallel reasoning framework for MLLMs. To maintain path independence and promote diversity in reasoning, our approach integrates Pa-Attention alongside LPRoPE. Leveraging the vLLM framework, we have developed a native multimodal implementation that facilitates high-efficiency parallel processing. Empirical results on benchmark datasets such as V*, CountBench, RefCOCO, and HallusionBench confirm that Visual Para-Thinker successfully extends the benefits of parallel reasoning to the visual domain.


[48] Agentic Spatio-Temporal Grounding via Collaborative Reasoning cs.CV | cs.AIPDF

Heng Zhao, Yew-Soon Ong, Joey Tianyi Zhou

TL;DR: 本文提出了一个名为Agentic Spatio-Temporal Grounder (ASTG)的新框架,用于解决时空视频定位(STVG)任务。该框架利用多模态大语言模型构建了两个专门的智能体(空间推理智能体SRA和时间推理智能体TRA),通过协作、自主的提议-评估范式,在无需训练的情况下,从视频中检索出与文本查询对应的目标时空管道。

Details

Motivation: 现有STVG方法通常需要在预测的时间跨度内逐帧进行空间定位,导致计算冗余、监督需求高且泛化能力有限。弱监督方法虽降低了标注成本,但仍受限于数据集级别的训练-拟合范式,且性能不佳。本文旨在解决这些问题,面向开放世界和免训练场景。

Result: 在流行基准测试上的实验表明,该方法优于现有的弱监督和零样本方法,并与一些全监督方法的性能相当。

Insight: 主要创新点在于:1)提出了一个基于多模态大语言模型的免训练、智能体协作框架,将时空推理解耦并自动化了管道提取、验证和时间定位过程;2)引入了专用的视觉记忆和对话上下文,显著提升了检索效率;3)实现了从依赖大量标注的封闭范式向开放世界、自主推理范式的转变。

Abstract: Spatio-Temporal Video Grounding (STVG) aims to retrieve the spatio-temporal tube of a target object or person in a video given a text query. Most existing approaches perform frame-wise spatial localization within a predicted temporal span, resulting in redundant computation, heavy supervision requirements, and limited generalization. Weakly-supervised variants mitigate annotation costs but remain constrained by the dataset-level train-and-fit paradigm with an inferior performance. To address these challenges, we propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario. Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs) work collaboratively to retrieve the target tube in an autonomous and self-guided manner. Following a propose-and-evaluation paradigm, ASTG duly decouples spatio-temporal reasoning and automates the tube extraction, verification and temporal localization processes. With a dedicate visual memory and dialogue context, the retrieval efficiency is significantly enhanced. Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin and is comparable to some of the fully-supervised methods.


[49] Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction cs.CV | cs.AIPDF

Emily Bejerano, Federico Tondolo, Aayan Qayyum, Xiaofan Yu, Xiaofan Jiang

TL;DR: Sim2Radar是一个端到端框架,旨在解决基于学习的雷达感知中数据稀缺和标注成本高的问题。它能够直接从单视角RGB图像合成用于训练的毫米波雷达数据,通过结合单目深度估计、分割和视觉语言推理来重建具有材质感知的3D场景,并使用基于物理的射线追踪器模拟毫米波传播。

Details

Motivation: 毫米波雷达在视觉退化的室内环境中提供可靠的感知,但基于学习的雷达感知受限于大规模雷达数据集的收集和标注的稀缺性与高成本。

Result: 在真实室内场景评估中,通过在合成数据上预训练雷达点云目标检测模型并在真实雷达数据上微调,Sim2Radar将3D AP(IoU 0.3)提升了高达+3.7,性能提升主要由改进的空间定位驱动。

Insight: 创新点在于利用视觉语言模型引导的场景重建和基于物理的电磁模拟,直接从RGB图像生成逼真的雷达数据,为雷达学习提供有效的几何先验,从而在有限真实数据监督下显著提升下游感知性能。

Abstract: Millimeter-wave (mmWave) radar provides reliable perception in visually degraded indoor environments (e.g., smoke, dust, and low light), but learning-based radar perception is bottlenecked by the scarcity and cost of collecting and annotating large-scale radar datasets. We present Sim2Radar, an end-to-end framework that synthesizes training radar data directly from single-view RGB images, enabling scalable data generation without manual scene modeling. Sim2Radar reconstructs a material-aware 3D scene by combining monocular depth estimation, segmentation, and vision-language reasoning to infer object materials, then simulates mmWave propagation with a configurable physics-based ray tracer using Fresnel reflection models parameterized by ITU-R electromagnetic properties. Evaluated on real-world indoor scenes, Sim2Radar improves downstream 3D radar perception via transfer learning: pre-training a radar point-cloud object detection model on synthetic data and fine-tuning on real radar yields up to +3.7 3D AP (IoU 0.3), with gains driven primarily by improved spatial localization. These results suggest that physics-based, vision-driven radar simulation can provide effective geometric priors for radar learning and measurably improve performance under limited real-data supervision.


[50] IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs cs.CV | cs.AIPDF

Yifan Tan, Yifu Sun, Shirui Huang, Hong Liu, Guanghua Yu

TL;DR: 本文提出了一种名为IDPruner的视觉令牌剪枝方法,用于加速多模态大语言模型(MLLMs)的推理。该方法通过系统分析令牌重要性与语义多样性之间的权衡,并利用最大边际相关性(MMR)算法实现两者的帕累托最优平衡,从而在无需注意力图的情况下实现高效的一次性剪枝。

Details

Motivation: 当前MLLMs因处理大量视觉令牌而产生显著计算瓶颈,现有剪枝方法缺乏将令牌重要性与多样性进行最优整合的原则性框架,本文旨在解决这一问题。

Result: 在多种模型架构和多模态基准测试上的实验表明,IDPruner达到了最先进的性能,并展现出卓越的泛化能力。例如,在Qwen2.5-VL-7B-Instruct模型上,剪除75%的令牌时能保持95.18%的基线性能,即使在极端剪除90%令牌的情况下仍能维持86.40%的性能。

Insight: 创新点在于首次系统分析了视觉令牌剪枝中重要性与多样性的权衡,并提出了一个无需注意力图、兼容FlashAttention且能一次性剪枝的帕累托最优平衡框架(IDPruner),这为高效MLLM推理提供了可借鉴的通用解决方案。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities, yet they encounter significant computational bottlenecks due to the massive volume of visual tokens. Consequently, visual token pruning, which substantially reduces the token count, has emerged as a critical technique for accelerating MLLM inference. Existing approaches focus on token importance, diversity, or an intuitive combination of both, without a principled framework for their optimal integration. To address this issue, we first conduct a systematic analysis to characterize the trade-off between token importance and semantic diversity. Guided by this analysis, we propose the \textbf{I}mportance and \textbf{D}iversity Pruner (\textbf{IDPruner}), which leverages the Maximal Marginal Relevance (MMR) algorithm to achieve a Pareto-optimal balance between these two objectives. Crucially, our method operates without requiring attention maps, ensuring full compatibility with FlashAttention and efficient deployment via one-shot pruning. We conduct extensive experiments across various model architectures and multimodal benchmarks, demonstrating that IDPruner achieves state-of-the-art performance and superior generalization across diverse architectures and tasks. Notably, on Qwen2.5-VL-7B-Instruct, IDPruner retains 95.18% of baseline performance when pruning 75% of the tokens, and still maintains 86.40% even under an extreme 90% pruning ratio. Our code is available at https://github.com/Tencent/AngelSlim.


[51] Diagnostic Benchmarks for Invariant Learning Dynamics: Empirical Validation of the Eidos Architecture cs.CV | cs.LGPDF

Datorien L. Anderson

TL;DR: 本文提出了PolyShapes-Ideal (PSI)数据集,这是一套用于诊断拓扑不变性的基准测试,旨在从标准视觉基准中分离出纹理相关性。通过三个诊断探针(噪声下的多边形分类、MNIST的零样本字体迁移和渐进变形下的几何崩溃映射),作者展示了Eidos架构在PSI上达到>99%的准确率,并在30种未见字体上实现81.67%的零样本迁移,无需预训练。这些结果验证了’形式优先’假设:在结构受限架构中的泛化是几何完整性的属性,而非统计规模。

Details

Motivation: 解决标准视觉基准中纹理相关性主导的问题,通过设计诊断基准来隔离拓扑不变性(即跨仿射变换保持结构身份的能力),以验证结构约束架构的泛化能力。

Result: Eidos架构在PSI数据集上达到>99%的准确率,在零样本字体迁移任务中(基于MNIST)对30种未见字体实现81.67%的准确率,无需预训练,验证了其几何完整性驱动的泛化性能。

Insight: 创新点在于提出PSI诊断基准来隔离拓扑不变性,并验证’形式优先’假设,即泛化源于几何完整性而非大规模统计学习;这为结构约束的机器学习架构提供了新的评估视角。

Abstract: We present the PolyShapes-Ideal (PSI) dataset, a suite of diagnostic benchmarks designed to isolate topological invariance – the ability to maintain structural identity across affine transformations – from the textural correlations that dominate standard vision benchmarks. Through three diagnostic probes (polygon classification under noise, zero-shot font transfer from MNIST, and geometric collapse mapping under progressive deformation), we demonstrate that the Eidos architecture achieves >99% accuracy on PSI and 81.67% zero-shot transfer across 30 unseen typefaces without pre-training. These results validate the “Form-First” hypothesis: generalization in structurally constrained architectures is a property of geometric integrity, not statistical scale.


[52] Synthesizing the Kill Chain: A Zero-Shot Framework for Target Verification and Tactical Reasoning on the Edge cs.CV | cs.AI | cs.ROPDF

Jesse Barkley, Abraham George, Amir Barati Farimani

TL;DR: 本文提出了一种用于军事边缘自主机器人的分层零样本框架,该框架将轻量级目标检测与紧凑型视觉语言模型(VLM)级联,以解决动态战场环境中领域数据稀缺和边缘硬件算力有限的问题。

Details

Motivation: 在动态军事环境中部署自主边缘机器人面临两大制约:领域特定训练数据稀缺,以及边缘硬件的计算能力有限。

Result: 在《战地6》的55个高保真合成视频上评估,该框架在假阳性过滤(最高100%准确率)、毁伤评估(最高97.5%)和细粒度车辆分类(55-90%)三个任务上表现良好。扩展的智能体工作流实现了100%的正确资产部署和9.8/10的推理评分(GPT-4o评估),延迟低于75秒。

Insight: 创新点在于提出了一种分层零样本架构,将感知与推理解耦,并通过“受控输入”方法揭示了不同VLM(如Gemma3-12B与Gemma3-4B)在视觉感知和战术逻辑上的不同失败模式,为安全关键应用中评估VLM适用性提供了诊断框架。

Abstract: Deploying autonomous edge robotics in dynamic military environments is constrained by both scarce domain-specific training data and the computational limits of edge hardware. This paper introduces a hierarchical, zero-shot framework that cascades lightweight object detection with compact Vision-Language Models (VLMs) from the Qwen and Gemma families (4B-12B parameters). Grounding DINO serves as a high-recall, text-promptable region proposer, and frames with high detection confidence are passed to edge-class VLMs for semantic verification. We evaluate this pipeline on 55 high-fidelity synthetic videos from Battlefield 6 across three tasks: false-positive filtering (up to 100% accuracy), damage assessment (up to 97.5%), and fine-grained vehicle classification (55-90%). We further extend the pipeline into an agentic Scout-Commander workflow, achieving 100% correct asset deployment and a 9.8/10 reasoning score (graded by GPT-4o) with sub-75-second latency. A novel “Controlled Input” methodology decouples perception from reasoning, revealing distinct failure phenotypes: Gemma3-12B excels at tactical logic but fails in visual perception, while Gemma3-4B exhibits reasoning collapse even with accurate inputs. These findings validate hierarchical zero-shot architectures for edge autonomy and provide a diagnostic framework for certifying VLM suitability in safety-critical applications.


[53] MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation cs.CVPDF

Xirui Hu, Yanbo Ding, Jiahao Wang, Tingting Shi, Yali Wang

TL;DR: MotionWeaver是一个用于多角色图像动画的端到端框架,通过统一的运动表示和整体的4D锚定范式,解决了现有方法在涉及多样化人形、复杂交互和频繁遮挡的多角色场景中泛化能力不足的问题。

Details

Motivation: 现有角色图像动画方法主要局限于单角色场景,难以泛化到涉及多样化人形、复杂交互和频繁遮挡的多角色场景。

Result: 在构建的包含300个视频的多角色基准测试中,MotionWeaver取得了最先进(SOTA)的定量和定性结果,并能有效泛化到多样化人形、复杂交互和具有挑战性的多角色场景。

Insight: 创新点包括:1)引入统一的、身份无关的运动表示,并将其显式绑定到对应角色,以实现跨多样化人形的泛化;2)提出整体的4D锚定范式,构建共享4D空间融合运动表示与视频潜在特征,并通过分层4D级监督增强,以更好地处理交互和遮挡。

Abstract: Character image animation, which synthesizes videos of reference characters driven by pose sequences, has advanced rapidly but remains largely limited to single-human settings. Existing methods struggle to generalize to multi-humanoid scenarios, which involve diverse humanoid forms, complex interactions, and frequent occlusions. We address this gap with two key innovations. First, we introduce unified motion representations that extract identity-agnostic motions and explicitly bind them to corresponding characters, enabling generalization across diverse humanoid forms and seamless extension to multi-humanoid scenarios. Second, we propose a holistic 4D-anchored paradigm that constructs a shared 4D space to fuse motion representations with video latents, and further reinforces this process with hierarchical 4D-level supervision to better handle interactions and occlusions. We instantiate these ideas in MotionWeaver, an end-to-end framework for multi-humanoid image animation. To support this setting, we curate a 46-hour dataset of multi-human videos with rich interactions, and construct a 300-video benchmark featuring paired humanoid characters. Quantitative and qualitative experiments demonstrate that MotionWeaver not only achieves state-of-the-art results on our benchmark but also generalizes effectively across diverse humanoid forms, complex interactions, and challenging multi-humanoid scenarios.


[54] HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving cs.CV | cs.AI | cs.ROPDF

Yiru Wang, Zichong Gu, Yu Gao, Anqing Jiang, Zhigang Sun

TL;DR: 本文提出了一种名为HiST-VLA的层次化时空视觉-语言-动作模型,用于端到端自动驾驶。该模型通过整合几何感知、细粒度驾驶命令和历史状态提示来增强3D时空推理能力,并采用动态令牌稀疏化和层次化轨迹规划器来提升计算效率和轨迹质量。在NAVSIM v2基准测试中取得了最先进的性能。

Details

Motivation: 现有视觉-语言-动作模型在自动驾驶等安全关键场景中存在数值推理不精确、3D空间感知弱和对上下文高度敏感等固有局限,限制了其可靠应用。本文旨在解决这些挑战,构建一个更可靠的轨迹生成模型。

Result: 在NAVSIM v2基准测试中,该模型在Navtest上取得了EPDMS 88.6的SOTA性能,在伪闭环Navhard基准测试上取得了EPDMS 50.9的成绩。

Insight: 主要创新点包括:1) 通过几何感知与细粒度驾驶命令/历史状态提示的整合来增强3D时空推理;2) 提出动态令牌稀疏化方法(融合而非过滤冗余令牌),在保持性能的同时提升效率;3) 采用层次化Transformer规划器,结合动态潜在正则化来融合语言指令,确保严格的空间锚定和时间一致性。

Abstract: Vision-Language-Action (VLA) models offer promising capabilities for autonomous driving through multimodal understanding. However, their utilization in safety-critical scenarios is constrained by inherent limitations, including imprecise numerical reasoning, weak 3D spatial awareness, and high sensitivity to context. To address these challenges, we propose HiST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation. Our framework enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting. To ensure computational efficiency, we integrate dynamic token sparsification into the VLA architecture. This approach fuses redundant tokens rather than filtering them, effectively reducing redundancy without sacrificing model performance. Furthermore, we employ a hierarchical transformer-based planner to progressively refine coarse VLA waypoints into fine-grained trajectories. Crucially, the planner utilizes dynamic latent regularization to incorporate language commands, ensuring strict spatial grounding and temporal coherence. Extensive evaluation on the NAVSIM v2 benchmark demonstrates state-of-the-art performance on Navtest, achieving an EPDMS of 88.6, and EPDMS of 50.9 on pseudo closed-loop Navhard benchmark.


[55] Zwitscherkasten – DIY Audiovisual bird monitoring cs.CVPDF

Dominik Blum, Elias Häring, Fabian Jirges, Martin Schäffer, David Schick

TL;DR: 本文介绍了Zwitscherkasten,一个用于鸟类物种监测的DIY多模态系统,它利用边缘设备上的音频和视觉数据。系统在资源受限的硬件上部署了用于生物声学和图像分类的深度学习模型,实现了实时、非侵入式的监测。声学活动检测器降低了能耗,而视觉识别则通过细粒度检测和分类流程进行。结果表明,在嵌入式平台上进行准确的鸟类物种识别是可行的,支持可扩展的生物多样性监测和公民科学应用。

Details

Motivation: 解决在资源受限的边缘设备上实现实时、非侵入式鸟类物种监测的问题,以支持生物多样性监测和公民科学应用。

Result: 在嵌入式平台上实现了准确的鸟类物种识别,表明该方法是可行的,但未提及具体基准测试或与SOTA的比较。

Insight: 创新点在于将DIY多模态(音频和视觉)系统部署在边缘设备上,结合声学活动检测以降低能耗,以及细粒度的视觉识别流程,为可扩展的实时监测提供了实用方案。

Abstract: This paper presents Zwitscherkasten, a DiY, multimodal system for bird species monitoring using audio and visual data on edge devices. Deep learning models for bioacoustic and image-based classification are deployed on resource-constrained hardware, enabling real-time, non-invasive monitoring. An acoustic activity detector reduces energy consumption, while visual recognition is performed using fine-grained detection and classification pipelines. Results show that accurate bird species identification is feasible on embedded platforms, supporting scalable biodiversity monitoring and citizen science applications.


[56] MedScope: Incentivizing “Think with Videos” for Clinical Reasoning via Coarse-to-Fine Tool Calling cs.CV | cs.AIPDF

Wenjie Li, Yujie Zhang, Haoran Sun, Xingqi He, Hongcheng Gao

TL;DR: 本文提出了MedScope,一种利用工具进行临床视频推理的模型,通过从粗到细的证据搜索机制,在长视频中迭代定位、验证和基于时间定位的视觉证据进行预测。为了解决缺乏高质量监督数据的问题,构建了ClinVideoSuite数据集,并采用Grounding-Aware Group Relative Policy Optimization(GA-GRPO)方法优化模型,在领域内和领域外评估中均达到最先进性能。

Details

Motivation: 当前多模态大语言模型在处理长临床视频时通常采用被动采样或弱基础检查,限制了其基于时间定位证据进行迭代定位、验证和预测的能力,因此需要开发能够主动利用工具进行精细化推理的模型。

Result: 在完整和细粒度的视频理解基准测试中,MedScope在领域内和领域外评估中均取得了最先进的性能。

Insight: 创新点包括:1)提出从粗到细的工具调用机制,实现基于时间定位证据的迭代推理;2)构建了以证据为中心的细粒度临床视频数据集ClinVideoSuite;3)设计了GA-GRPO优化方法,通过基础对齐奖励和证据加权优势直接强化工具使用,为医疗AI代理实现“用视频思考”提供了新路径。

Abstract: Long-form clinical videos are central to visual evidence-based decision-making, with growing importance for applications such as surgical robotics and related settings. However, current multimodal large language models typically process videos with passive sampling or weakly grounded inspection, which limits their ability to iteratively locate, verify, and justify predictions with temporally targeted evidence. To close this gap, we propose MedScope, a tool-using clinical video reasoning model that performs coarse-to-fine evidence seeking over long-form procedures. By interleaving intermediate reasoning with targeted tool calls and verification on retrieved observations, MedScope produces more accurate and trustworthy predictions that are explicitly grounded in temporally localized visual evidence. To address the lack of high-fidelity supervision, we build ClinVideoSuite, an evidence-centric, fine-grained clinical video suite. We then optimize MedScope with Grounding-Aware Group Relative Policy Optimization (GA-GRPO), which directly reinforces tool use with grounding-aligned rewards and evidence-weighted advantages. On full and fine-grained video understanding benchmarks, MedScope achieves state-of-the-art performance in both in-domain and out-of-domain evaluations. Our approach illuminates a path toward medical AI agents that can genuinely “think with videos” through tool-integrated reasoning. We will release our code, models, and data.


[57] Ask the Expert: Collaborative Inference for Vision Transformers with Near-Edge Accelerators cs.CV | cs.DC | cs.LGPDF

Hao Liu, Suhaib A. Fahmy

TL;DR: 本文提出了一种新颖的协作推理框架,用于在边缘设备和近边缘加速器上协同部署视觉变换器(ViT),以解决ViT在边缘部署时计算复杂度高和云卸载延迟大的问题。该框架在边缘设备上部署一个轻量级通用ViT,在近边缘加速器上部署多个中等规模的专家ViT,并设计了一种基于Top-k预测的动态路由机制来选择最相关的专家处理低置信度样本。

Details

Motivation: 解决在边缘设备上部署视觉变换器(ViT)时面临的高计算复杂度挑战,以及完全卸载到云端带来的显著延迟开销,旨在实现高效、低延迟的边缘推理。

Result: 在CIFAR-100数据集上使用真实边缘和近边缘测试平台进行的广泛实验表明,所提出的渐进式专家训练策略将专家在目标子集上的准确率提高了4.12%,整体准确率比静态专家提高了2.76%。此外,与边缘执行相比,该方法延迟降低了高达45%,与仅近边缘卸载相比,能耗降低了高达46%。

Insight: 创新点在于提出了一个结合轻量级边缘通用模型和近边缘专家模型的协作推理框架,并设计了基于置信度的动态路由机制以及渐进式专家训练策略来提升专家专业化程度和整体系统效率。从客观角度看,这种分层、动态的模型协作架构和针对性的训练方法,为在资源受限的边缘环境中部署大模型提供了有效的系统级优化思路。

Abstract: Deploying Vision Transformers on edge devices is challenging due to their high computational complexity, while full offloading to cloud resources presents significant latency overheads. We propose a novel collaborative inference framework, which orchestrates a lightweight generalist ViT on an edge device and multiple medium-sized expert ViTs on a near-edge accelerator. A novel routing mechanism uses the edge model’s Top-$\mathit{k}$ predictions to dynamically select the most relevant expert for samples with low confidence. We further design a progressive specialist training strategy to enhance expert accuracy on dataset subsets. Extensive experiments on the CIFAR-100 dataset using a real-world edge and near-edge testbed demonstrate the superiority of our framework. Specifically, the proposed training strategy improves expert specialization accuracy by 4.12% on target subsets and enhances overall accuracy by 2.76% over static experts. Moreover, our method reduces latency by up to 45% compared to edge execution, and energy consumption by up to 46% compared to just near-edge offload.


[58] FireRed-Image-Edit-1.0 Techinical Report cs.CV | eess.IVPDF

Super Intelligence Team, Changhao Qiao, Chao Hui, Chen Li, Cunzheng Wang

TL;DR: 本文介绍了FireRed-Image-Edit,一个基于指令的图像编辑扩散Transformer模型。它通过系统优化数据构建、训练方法和评估设计,实现了最先进的性能。模型训练使用了16亿样本的语料库,并经过严格清洗和平衡处理。训练流程包括预训练、监督微调和强化学习。论文还提出了多项创新技术以提升数据效率、优化稳定性和控制能力,并建立了一个包含15个编辑类别的综合基准测试REDEdit-Bench。实验表明,该模型在多个基准测试上达到了与开源和专有系统相当或更优的性能。

Details

Motivation: 动机是开发一个高性能的、基于指令的图像编辑系统,通过系统性地优化数据、训练和评估的整个流程来解决现有方法在数据质量、训练效率和任务覆盖范围上的不足。

Result: 在新建的REDEdit-Bench(涵盖15个编辑类别)以及公开基准测试(ImgEdit和GEdit)上进行了广泛实验,结果显示其性能与开源和专有系统相比具有竞争力或更优,达到了SOTA水平。

Insight: 创新点包括:1) 大规模、高质量、平衡的数据集构建与清洗流程;2) 多阶段训练流程;3) 提升数据效率的Multi-Condition Aware Bucket Sampler和Stochastic Instruction Alignment技术;4) 提升优化稳定性和控制能力的Asymmetric Gradient Optimization for DPO、DiffusionNFT(用于文本编辑)和可微分一致性损失;5) 建立了一个涵盖美化和低级增强等新任务的综合评估基准REDEdit-Bench。这些在数据、训练策略和评估体系上的系统性优化具有借鉴意义。

Abstract: We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. We release code, models, and the benchmark suite to support future research.


[59] Visual Foresight for Robotic Stow: A Diffusion-Based World Model from Sparse Snapshots cs.CV | cs.AI | cs.ROPDF

Lijun Zhang, Nikhil Chacko, Petter Nilsson, Ruinian Xu, Shantanu Thakar

TL;DR: 本文提出了FOREST模型,一种基于扩散变换器的世界模型,用于预测仓库机器人放置物品后货箱的视觉状态。该模型以物品对齐的实例掩码表示货箱状态,并利用稀疏快照和放置意图来预测执行后的布局配置。

Details

Motivation: 解决自动化仓库中数百万次放置操作前,如何从当前观测和计划行为中预先准确预测货箱最终视觉状态的问题,以支持仓库规划决策。

Result: 在几何一致性评估中,FOREST相比启发式基线显著提升了预测布局与真实布局的吻合度;在下游任务(负载质量评估和多放置推理)中,用模型预测替换真实掩码仅导致性能轻微下降,表明其能提供有效的预见信号。

Insight: 创新点在于将放置意图条件化与扩散变换器结合,以实例掩码形式进行稀疏视觉预测;该方法为基于视觉预测的机器人操作规划提供了可扩展的世界建模框架。

Abstract: Automated warehouses execute millions of stow operations, where robots place objects into storage bins. For these systems it is valuable to anticipate how a bin will look from the current observations and the planned stow behavior before real execution. We propose FOREST, a stow-intent-conditioned world model that represents bin states as item-aligned instance masks and uses a latent diffusion transformer to predict the post-stow configuration from the observed context. Our evaluation shows that FOREST substantially improves the geometric agreement between predicted and true post-stow layouts compared with heuristic baselines. We further evaluate the predicted post-stow layouts in two downstream tasks, in which replacing the real post-stow masks with FOREST predictions causes only modest performance loss in load-quality assessment and multi-stow reasoning, indicating that our model can provide useful foresight signals for warehouse planning.


[60] From Prompt to Production:Automating Brand-Safe Marketing Imagery with Text-to-Image Models cs.CV | cs.AIPDF

Parmida Atighehchian, Henry Wang, Andrei Kapustin, Boris Lerner, Tiancheng Jiang

TL;DR: 本文提出了一种自动化、可扩展的文本到图像生成流水线,用于生产符合品牌安全标准的营销图像。该系统通过平衡自动化与人工监督,在保持图像质量和保真度的同时引入创意变化,实现了营销对象保真度提升30.77%和生成结果的人类偏好度提升52.00%。

Details

Motivation: 解决文本到图像模型在生产环境中部署的规模化挑战,在自动化生成与人工监督之间取得平衡,确保生成的营销图像既符合质量标准又满足创意需求。

Result: 在DINOV2基准上实现营销对象保真度提升30.77%,生成结果的人类偏好度提升52.00%,表明系统在质量和创意方面均达到显著改进。

Insight: 创新点在于构建了端到端的自动化流水线,通过结构化流程将文本提示转化为生产就绪的营销图像,实现了规模化部署与创意控制的结合,为工业级应用提供了可借鉴的框架。

Abstract: Text-to-image models have made significant strides, producing impressive results in generating images from textual descriptions. However, creating a scalable pipeline for deploying these models in production remains a challenge. Achieving the right balance between automation and human feedback is critical to maintain both scale and quality. While automation can handle large volumes, human oversight is still an essential component to ensure that the generated images meet the desired standards and are aligned with the creative vision. This paper presents a new pipeline that offers a fully automated, scalable solution for generating marketing images of commercial products using text-to-image models. The proposed system maintains the quality and fidelity of images, while also introducing sufficient creative variation to adhere to marketing guidelines. By streamlining this process, we ensure a seamless blend of efficiency and human oversight, achieving a $30.77%$ increase in marketing object fidelity using DINOV2 and a $52.00%$ increase in human preference over the generated outcome.


[61] Using Deep Learning to Generate Semantically Correct Hindi Captions cs.CV | cs.AI | cs.CLPDF

Wasim Akram Khan, Anil Kumar Vuppala

TL;DR: 该研究利用深度学习模型为图像生成语义正确的印地语描述。通过结合预训练的CNN模型(如VGG16、ResNet50和Inception V3)提取图像特征,并采用单向和双向LSTM进行文本编码,同时引入注意力机制来融合视觉和语言信息。在Flickr8k数据集上生成印地语描述,并使用BLEU分数进行评估,其中基于注意力的双向LSTM与VGG16的组合在BLEU-1和BLEU-4上分别取得了0.59和0.19的最佳结果。

Details

Motivation: 现有的图像描述生成研究主要集中在英语上,而印地语作为全球第四大流行语言,其相关研究较少。该论文旨在利用多模态架构(包括局部/全局视觉特征、注意力机制和预训练模型)来填补这一空白,为图像生成语义准确的印地语描述。

Result: 在Flickr8k数据集上,使用BLEU分数进行评估。基于注意力的双向LSTM与VGG16的组合取得了最佳结果:BLEU-1分数为0.59,BLEU-4分数为0.19。论文指出,BLEU-1分数达到0.59已足够生成相关描述,而BLEU-4分数为0.19则表明描述具有流畅性。

Insight: 创新点在于将图像描述生成任务扩展到印地语,并系统比较了不同预训练CNN(VGG16、ResNet50、Inception V3)与文本编码技术(单向/双向LSTM)的组合效果,同时引入注意力机制来提升多模态融合。从客观角度看,该研究为低资源语言的图像描述生成提供了可借鉴的基准模型和方法框架。

Abstract: Automated image captioning using the content from the image is very appealing when done by harnessing the capability of computer vision and natural language processing. Extensive research has been done in the field with a major focus on the English language which gives the scope for further developments in the same with consideration of popular foreign languages. This research utilizes distinct models for translating the image caption into Hindi, the fourth most popular language across the world. Exploring the multi-modal architectures this research comprises local visual features, global visual features, attention mechanisms, and pre-trained models. Using google cloud translator on the image dataset from Flickr8k, Hindi image descriptions have been generated. Pre-trained CNNs like VGG16, ResNet50, and Inception V3 helped in retrieving image characteristics, while the uni-directional and bi-directional techniques of text encoding are used for the text encoding process. An additional Attention layer helps to generate a weight vector and, by multiplying it, combine image characteristics from each time step into a sentence-level feature vector. Bilingual evaluation understudy scores are used to compare the research outcome. Many experiments that serve as a baseline are done for the comparative analysis of the research. An image with a score of BLEU-1 is considered sufficient, whereas one with a score of BLEU-4 is considered to have fluid image captioning. For both BLEU scores, the attention-based bidirectional LSTM with VGG16 produced the best results of 0.59 and 0.19 respectively. The experiments conclude that researchs ability to produce relevant, semantically accurate image captions in Hindi. The research accomplishes the goals and future research can be guided by this research model.


[62] AdaCorrection: Adaptive Offset Cache Correction for Accurate Diffusion Transformers cs.CV | cs.AIPDF

Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu

TL;DR: 本文提出AdaCorrection框架,一种自适应偏移缓存校正方法,用于在扩散Transformer(DiT)推理过程中高效重用缓存特征,以加速采样过程,同时保持高生成保真度。该方法通过轻量级时空信号估计缓存有效性,并自适应混合缓存与新鲜激活,无需额外监督或重新训练。

Details

Motivation: 扩散Transformer在图像和视频生成中达到SOTA性能,但其迭代去噪结构导致推理成本高昂。现有加速方法依赖静态重用策略或粗粒度启发式方法,常导致时间漂移和缓存错位,显著降低生成质量。

Result: 在图像和视频扩散基准测试中,AdaCorrection在保持接近原始FID(Fréchet Inception Distance)的同时,实现了适度的加速,并持续提升了生成性能。

Insight: 创新点在于引入自适应缓存校正机制,通过实时估计缓存有效性并混合激活,解决了缓存重用中的对齐问题,在保证质量的前提下降低了计算开销,无需模型重训练。

Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art performance in high-fidelity image and video generation but suffer from expensive inference due to their iterative denoising structure. While prior methods accelerate sampling by caching intermediate features, they rely on static reuse schedules or coarse-grained heuristics, which often lead to temporal drift and cache misalignment that significantly degrade generation quality. We introduce \textbf{AdaCorrection}, an adaptive offset cache correction framework that maintains high generation fidelity while enabling efficient cache reuse across Transformer layers during diffusion inference. At each timestep, AdaCorrection estimates cache validity with lightweight spatio-temporal signals and adaptively blends cached and fresh activations. This correction is computed on-the-fly without additional supervision or retraining. Our approach achieves strong generation quality with minimal computational overhead, maintaining near-original FID while providing moderate acceleration. Experiments on image and video diffusion benchmarks show that AdaCorrection consistently improves generation performance.


[63] An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation cs.CV | cs.AI | cs.CLPDF

Giang Son Nguyen, Zi Pong Lim, Sarthak Ketanbhai Modi, Yon Shin Teo, Wenya Wang

TL;DR: 本文提出了一种用于流程图图像转代码生成任务的在线无参考评估框架。该框架利用OCR提取图像文本计算内容覆盖率(Recall_OCR),并通过视觉蕴含检测生成代码中的幻觉元素(Precision_VE),两者的调和平均数F1_OCR-VE作为统一的输出质量评分。在FlowVQA数据集上的验证表明,该框架与有参考真实值指标高度一致,可作为生产环境中持续质量监控的实用替代方案。

Details

Motivation: 在文档处理流水线中,视觉语言模型(VLMs)被用于将流程图图像转换为结构化代码(如Mermaid)。然而,在生产环境中,系统处理的输入通常没有真实代码参考,导致输出质量难以评估。因此,需要一种无需参考真实值的在线评估方法来监控生成质量。

Result: 在FlowVQA数据集上的验证结果显示,所提框架的Recall_OCR、Precision_VE和F1_OCR-VE指标与真实值指标的平均皮尔逊相关系数分别达到0.97、0.91和0.94,证明了其作为无参考评估方法的可靠性。

Insight: 创新点在于提出了一种结合OCR文本提取(用于内容覆盖评估)和视觉蕴含(用于幻觉检测)的无参考评估框架,实现了仅依赖输入图像和生成输出的在线质量监控。该方法为生产环境中缺乏真实标签的生成任务提供了实用的评估替代方案。

Abstract: Vision-Language Models (VLMs) are increasingly used in document processing pipelines to convert flowchart images into structured code (e.g., Mermaid). In production, these systems process arbitrary inputs for which no ground-truth code exists, making output quality difficult to assess. We propose a reference-free evaluation framework that monitors flowchart image-to-code generation quality at inference time, using only the input image and the generated output. The framework introduces two automated metrics: $\text{Recall}{\text{OCR}}$, which estimates content coverage by extracting text from the input image via OCR as a proxy reference, and $\text{Precision}{\text{VE}}$, which detects hallucinated elements through Visual Entailment against the original image. Their harmonic mean, $\text{F1}{\text{OCR-VE}}$, provides a unified quality score. Validation on the FlowVQA dataset shows strong agreement with ground-truth metrics (average Pearson’s $r = 0.97$, $0.91$, and $0.94$ for Recall, Precision, and F1, respectively), confirming the framework’s reliability as a practical, reference-free alternative for continuous quality monitoring in production settings.


[64] LAF-YOLOv10 with Partial Convolution Backbone, Attention-Guided Feature Pyramid, Auxiliary P2 Head, and Wise-IoU Loss for Small Object Detection in Drone Aerial Imagery cs.CV | cs.LGPDF

Sohail Ali Farooqui, Zuhair Ahmed Khan Taha, Mohammed Mudassir Uddin, Shahnawaz Alam

TL;DR: 本文提出了LAF-YOLOv10,一种基于YOLOv10n改进的无人机航拍图像小目标检测器。它集成了四个互补技术:部分卷积PC-C2f模块以减少主干网络冗余计算;注意力引导特征金字塔AG-FPN以优化多尺度特征融合;辅助P2检测头以提升对小像素目标的定位能力;以及Wise-IoU v3损失函数以在标注噪声下稳定边界框回归。该方法在VisDrone和UAVDT数据集上实现了性能提升,并在嵌入式设备上保持了实时推理速度。

Details

Motivation: 解决无人机航拍目标检测中的特定挑战,包括目标像素极少、背景杂乱、严重遮挡以及机载计算资源严格受限的问题。

Result: 在VisDrone-DET2019数据集上达到35.1±0.3% mAP@0.5,超过基准YOLOv10n 3.3个百分点;在UAVDT数据集上的跨数据集评估达到35.8±0.4% mAP@0.5。在NVIDIA Jetson Orin Nano上以FP16精度实现24.3 FPS,验证了嵌入式部署的可行性。

Insight: 主要贡献在于将四个现有技术(PC-C2f、AG-FPN、辅助P2头、Wise-IoU v3)协同集成到一个统一的YOLOv10框架中,以非重叠的方式分别解决计算效率、特征融合、空间分辨率和回归稳定性等瓶颈,从而在参数和速度约束下显著提升小目标检测性能。

Abstract: Unmanned aerial vehicles serve as primary sensing platforms for surveillance, traffic monitoring, and disaster response, making aerial object detection a central problem in applied computer vision. Current detectors struggle with UAV-specific challenges: targets spanning only a few pixels, cluttered backgrounds, heavy occlusion, and strict onboard computational budgets. This study introduces LAF-YOLOv10, built on YOLOv10n, integrating four complementary techniques to improve small-object detection in drone imagery. A Partial Convolution C2f (PC-C2f) module restricts spatial convolution to one quarter of backbone channels, reducing redundant computation while preserving discriminative capacity. An Attention-Guided Feature Pyramid Network (AG-FPN) inserts Squeeze-and-Excitation channel gates before multi-scale fusion and replaces nearest-neighbor upsampling with DySample for content-aware interpolation. An auxiliary P2 detection head at 160$\times$160 resolution extends localization to objects below 8$\times$8 pixels, while the P5 head is removed to redistribute parameters. Wise-IoU v3 replaces CIoU for bounding box regression, attenuating gradients from noisy annotations in crowded aerial scenes. The four modules address non-overlapping bottlenecks: PC-C2f compresses backbone computation, AG-FPN refines cross-scale fusion, the P2 head recovers spatial resolution, and Wise-IoU stabilizes regression under label noise. No individual component is novel; the contribution is the joint integration within a single YOLOv10 framework. Across three training runs (seeds 42, 123, 256), LAF-YOLOv10 achieves 35.1$\pm$0.3% mAP@0.5 on VisDrone-DET2019 with 2.3,M parameters, exceeding YOLOv10n by 3.3 points. Cross-dataset evaluation on UAVDT yields 35.8$\pm$0.4% mAP@0.5. Benchmarks on NVIDIA Jetson Orin Nano confirm 24.3 FPS at FP16, demonstrating viability for embedded UAV deployment.


[65] Handling Supervision Scarcity in Chest X-ray Classification: Long-Tailed and Zero-Shot Learning cs.CVPDF

Ha-Hieu Pham, Hai-Dang Nguyen, Thanh-Huy Nguyen, Min Xu, Ulas Bagci

TL;DR: 该论文针对胸部X光(CXR)分类中因长尾多标签分布和罕见/未见疾病标注缺失导致的监督稀缺问题,提出了针对性的解决方案。在CXR-LT 2026挑战赛的基准上,分别针对长尾多标签分类任务和零样本分布外(OOD)识别任务,采用了不平衡感知的多标签学习策略和无需OOD类别监督的预测方法,并在开发阶段的公开排行榜上取得了排名第一的优异性能。

Details

Motivation: 解决临床实践中胸部X光分类面临的两个关键监督稀缺问题:极端长尾的多标签疾病分布,以及罕见或先前未见发现的标注缺失。

Result: 在基于PadChest的CXR-LT 2026基准测试中,使用宏平均平均精度(mAP)进行评估,该方法在两个任务上都取得了强劲的性能,在开发阶段的公开排行榜上排名第一。

Insight: 针对不同的监督稀缺场景(长尾与零样本)设计任务专用解决方案是有效的;不平衡感知的多标签学习策略能兼顾尾类识别和常见类性能;提出的零样本OOD识别方法无需在训练中使用任何OOD类别的监督标签或示例,展示了处理未见类别的潜力。

Abstract: Chest X-ray (CXR) classification in clinical practice is often limited by imperfect supervision, arising from (i) extreme long-tailed multi-label disease distributions and (ii) missing annotations for rare or previously unseen findings. The CXR-LT 2026 challenge addresses these issues on a PadChest-based benchmark with a 36-class label space split into 30 in-distribution classes for training and 6 out-of-distribution (OOD) classes for zero-shot evaluation. We present task-specific solutions tailored to the distinct supervision regimes. For Task 1 (long-tailed multi-label classification), we adopt an imbalance-aware multi-label learning strategy to improve recognition of tail classes while maintaining stable performance on frequent findings. For Task 2 (zero-shot OOD recognition), we propose a prediction approach that produces scores for unseen disease categories without using any supervised labels or examples from the OOD classes during training. Evaluated with macro-averaged mean Average Precision (mAP), our method achieves strong performance on both tasks, ranking first on the public leaderboard of the development phase. Code and pre-trained models are available at https://github.com/hieuphamha19/CXR_LT.


[66] Learning on the Fly: Replay-Based Continual Object Perception for Indoor Drones cs.CV | cs.ROPDF

Sebastian-Ion Nae, Mihai-Eugen Barbu, Sebastian Mocanu, Marius Leordeanu

TL;DR: 本文针对室内无人机实时学习新物体类别并缓解灾难性遗忘的问题,提出了一个包含14,400帧、具有时间连贯性的室内无人机视频数据集,并基于YOLOv11-nano检测器,在严格的内存预算(5-10%回放)下,评估了三种基于回放的持续学习策略(ER、MIR、FAR)。实验表明,FAR策略在5%回放下取得了82.96%的平均精度(mAP50-95),优于其他方法,验证了回放式持续学习在边缘空中系统上的有效性。

Details

Motivation: 解决室内无人机等自主智能体在实时学习新物体类别时面临的灾难性遗忘问题,推动类增量学习(CIL)的应用,并弥补现有无人机数据集主要关注室外场景、缺乏时间连贯性室内视频的不足。

Result: 在提出的室内无人机数据集上,使用YOLOv11-nano作为资源高效检测器进行基准测试。在严格内存预算(5-10%回放)下,FAR策略表现最佳,在5%回放时平均精度(ACC, mAP50-95)达到82.96%。Grad-CAM分析揭示了混合场景中注意力在不同类别间的转移,并与无人机定位质量下降相关。

Insight: 主要创新点包括:1) 贡献了一个具有时间连贯性、经过半自动标注流程(首轮标注一致性达98.6%)的室内无人机视频数据集;2) 在严格内存约束下系统评估了回放式CIL策略在边缘无人机平台上的适用性,表明FAR在低回放预算下具有优势;3) 通过Grad-CAM提供了对模型注意力机制与定位性能关联的洞察,有助于理解持续学习中的性能变化。

Abstract: Autonomous agents such as indoor drones must learn new object classes in real-time while limiting catastrophic forgetting, motivating Class-Incremental Learning (CIL). However, most unmanned aerial vehicle (UAV) datasets focus on outdoor scenes and offer limited temporally coherent indoor videos. We introduce an indoor dataset of $14,400$ frames capturing inter-drone and ground vehicle footage, annotated via a semi-automatic workflow with a $98.6%$ first-pass labeling agreement before final manual verification. Using this dataset, we benchmark 3 replay-based CIL strategies: Experience Replay (ER), Maximally Interfered Retrieval (MIR), and Forgetting-Aware Replay (FAR), using YOLOv11-nano as a resource-efficient detector for deployment-constrained UAV platforms. Under tight memory budgets ($5-10%$ replay), FAR performs better than the rest, achieving an average accuracy (ACC, $mAP_{50-95}$ across increments) of $82.96%$ with $5%$ replay. Gradient-weighted class activation mapping (Grad-CAM) analysis shows attention shifts across classes in mixed scenes, which is associated with reduced localization quality for drones. The experiments further demonstrate that replay-based continual learning can be effectively applied to edge aerial systems. Overall, this work contributes an indoor UAV video dataset with preserved temporal coherence and an evaluation of replay-based CIL under limited replay budgets. Project page: https://spacetime-vision-robotics-laboratory.github.io/learning-on-the-fly-cl


[67] GLIMPSE : Real-Time Text Recognition and Contextual Understanding for VQA in Wearables cs.CV | cs.HCPDF

Akhil Ramachandran, Ankit Arun, Ashish Shenoy, Abhay Harpale, Srihari Jayakumar

TL;DR: 论文提出GLIMPSE系统,用于在可穿戴设备上实现实时文本识别和上下文理解的视觉问答(VQA)。该系统采用混合架构,在设备上执行选择性高分辨率OCR以进行文本识别,同时流式传输低分辨率视频以获取视觉上下文,从而解决高分辨率视频流导致的电池消耗和热节流问题,并在保持文本理解质量的同时显著降低功耗。

Details

Motivation: 解决在可穿戴设备上部署基于文本的视觉问答(Text VQA)时面临的核心矛盾:文本识别需要高分辨率视频,但流式传输高质量视频会耗尽电池并导致热节流,且现有模型在处理实时流中的多帧文本时难以保持连贯的时间上下文。

Result: 在涵盖五个任务类别的基于文本的VQA样本基准测试中,该系统实现了72%的准确率,功耗仅为全分辨率流传输的0.49倍,从而在资源受限的可穿戴设备上实现持续的VQA会话,且不牺牲文本理解质量。

Insight: 创新点在于利用文本识别和视觉推理对分辨率要求的不对称性(OCR需要精细细节,而场景理解可容忍粗糙特征),设计了一种混合架构,实现了选择性高分辨率OCR与低分辨率视频流处理的协同,从而在功耗和性能之间取得平衡,为可穿戴设备上的实时Text VQA提供了可行的解决方案。

Abstract: Video Large Language Models (Video LLMs) have shown remarkable progress in understanding and reasoning about visual content, particularly in tasks involving text recognition and text-based visual question answering (Text VQA). However, deploying Text VQA on wearable devices faces a fundamental tension: text recognition requires high-resolution video, but streaming high-quality video drains battery and causes thermal throttling. Moreover, existing models struggle to maintain coherent temporal context when processing text across multiple frames in real-time streams. We observe that text recognition and visual reasoning have asymmetric resolution requirements - OCR needs fine detail while scene understanding tolerates coarse features. We exploit this asymmetry with a hybrid architecture that performs selective high-resolution OCR on-device while streaming low-resolution video for visual context. On a benchmark of text-based VQA samples across five task categories, our system achieves 72% accuracy at 0.49x the power consumption of full-resolution streaming, enabling sustained VQA sessions on resource-constrained wearables without sacrificing text understanding quality.


[68] Benchmarking Video Foundation Models for Remote Parkinson’s Disease Screening cs.CVPDF

Md Saiful Islam, Ekram Hossain, Abdelrahman Abdelkader, Tariq Adnan, Fazla Rabbi Mashrur

TL;DR: 本研究系统评估了七种先进的视频基础模型(VideoPrism、V-JEPA、ViViT、VideoMAE等)在帕金森病远程视频筛查中的表现。通过一个包含1888名参与者(727名PD患者)、涵盖16种标准化临床任务的32847个视频的新数据集,研究发现不同模型在不同任务上的表现存在显著差异,例如VideoPrism在视觉语音运动学和面部表情任务上表现优异,而V-JEPA在上肢运动任务上更优。实验获得了76.4-85.3%的AUC和71.5-80.6%的准确率,但敏感性较低(43.2-57.3%),表明需要任务感知校准和多任务多模态整合。

Details

Motivation: 远程视频评估为帕金森病筛查提供了可扩展的途径,但传统方法依赖模仿临床量表的特征工程,而新兴的视频基础模型(VFMs)允许无需任务定制化的表征学习。然而,不同VFM架构在各种临床任务上的比较有效性尚不清楚,因此需要系统研究以确定其在临床筛查中的鲁棒性。

Result: 在包含16个标准化临床任务的大规模视频数据集上,使用线性分类头评估冻结嵌入,实验结果显示AUC在76.4%到85.3%之间,准确率在71.5%到80.6%之间。特异性高达90.3%,表明在排除健康个体方面潜力强,但敏感性较低(43.2-57.3%)。不同模型在特定任务上表现突出:VideoPrism在视觉语音运动学(无音频)和面部表情任务上表现最佳,V-JEPA在上肢运动任务上更优,TimeSformer在手指敲击等节律性任务上竞争力强。

Insight: 论文的创新点在于首次对多种视频基础模型在帕金森病远程筛查的多样化临床任务上进行了大规模系统性基准测试,揭示了模型性能高度依赖于具体任务(即任务显著性),并提供了选择合适任务和架构的路线图。从客观角度看,其核心洞察是:在医疗AI应用中,没有单一的“最佳”视频基础模型,而是需要根据特定临床任务(如运动、语言、表情)的特性来匹配模型架构,这为未来基于多任务、多模态集成的精准校准提供了重要方向。

Abstract: Remote, video-based assessments offer a scalable pathway for Parkinson’s disease (PD) screening. While traditional approaches rely on handcrafted features mimicking clinical scales, recent advances in video foundation models (VFMs) enable representation learning without task-specific customization. However, the comparative effectiveness of different VFM architectures across diverse clinical tasks remains poorly understood. We present a large-scale systematic study using a novel video dataset from 1,888 participants (727 with PD), comprising 32,847 videos across 16 standardized clinical tasks. We evaluate seven state-of-the-art VFMs – including VideoPrism, V-JEPA, ViViT, and VideoMAE – to determine their robustness in clinical screening. By evaluating frozen embeddings with a linear classification head, we demonstrate that task saliency is highly model-dependent: VideoPrism excels in capturing visual speech kinematics (no audio) and facial expressivity, while V-JEPA proves superior for upper-limb motor tasks. Notably, TimeSformer remains highly competitive for rhythmic tasks like finger tapping. Our experiments yield AUCs of 76.4-85.3% and accuracies of 71.5-80.6%. While high specificity (up to 90.3%) suggests strong potential for ruling out healthy individuals, the lower sensitivity (43.2-57.3%) highlights the need for task-aware calibration and integration of multiple tasks and modalities. Overall, this work establishes a rigorous baseline for VFM-based PD screening and provides a roadmap for selecting suitable tasks and architectures in remote neurological monitoring. Code and anonymized structured data are publicly available: https://anonymous.4open.science/r/parkinson\_video\_benchmarking-A2C5


[69] SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning cs.CV | cs.LGPDF

Jintao Zhang, Kai Jiang, Chendong Xiang, Weiqi Feng, Yuezhou Hu

TL;DR: 本文提出了SpargeAttention2,一种可训练的稀疏注意力方法,通过结合Top-k和Top-p的混合掩码规则、高效的实现以及蒸馏微调目标,在视频扩散模型中实现了95%的注意力稀疏度和16.2倍的注意力加速,同时保持了生成质量。

Details

Motivation: 研究旨在解决训练免费稀疏注意力方法的局限性,探索Top-k和Top-p掩码规则的失效情况、可训练稀疏注意力能达到更高稀疏度的原因,以及仅使用扩散损失微调稀疏注意力的不足,从而开发一种能在高稀疏度下保持生成质量的可训练稀疏注意力方法。

Result: 在视频扩散模型上的实验表明,SpargeAttention2达到了95%的注意力稀疏度和16.2倍的注意力加速,同时保持了生成质量,一致优于先前的稀疏注意力方法。

Insight: 创新点包括:结合Top-k和Top-p的混合掩码规则以在高稀疏度下实现更鲁棒的掩码;高效的稀疏注意力实现;以及受蒸馏启发的微调目标,以在微调过程中更好地保持生成质量。从客观角度看,该方法通过可训练性和混合策略,有效平衡了稀疏度与性能,为加速扩散模型提供了新思路。

Abstract: Many training-free sparse attention methods are effective for accelerating diffusion models. Recently, several works suggest that making sparse attention trainable can further increase sparsity while preserving generation quality. We study three key questions: (1) when do the two common masking rules, i.e., Top-k and Top-p, fail, and how can we avoid these failures? (2) why can trainable sparse attention reach higher sparsity than training-free methods? (3) what are the limitations of fine-tuning sparse attention using the diffusion loss, and how can we address them? Based on this analysis, we propose SpargeAttention2, a trainable sparse attention method that achieves high sparsity without degrading generation quality. SpargeAttention2 includes (i) a hybrid masking rule that combines Top-k and Top-p for more robust masking at high sparsity, (ii) an efficient trainable sparse attention implementation, and (iii) a distillation-inspired fine-tuning objective to better preserve generation quality during fine-tuning using sparse attention. Experiments on video diffusion models show that SpargeAttention2 reaches 95% attention sparsity and a 16.2x attention speedup while maintaining generation quality, consistently outperforming prior sparse attention methods.


[70] Two-Stream Interactive Joint Learning of Scene Parsing and Geometric Vision Tasks cs.CV | cs.AIPDF

Guanfeng Tang, Hongbo Zhao, Ziwei Long, Jiayao Li, Bohong Xiao

TL;DR: 本文提出了一种受人类视觉系统启发的双流交互式联合学习框架TwInS,能够同时执行场景解析和几何视觉任务。该框架通过场景解析流的上下文特征指导几何视觉流的迭代优化,并通过跨任务适配器将几何特征投影到上下文特征空间进行选择性融合,从而相互增强。此外,TwInS采用半监督训练策略,利用大规模多视角数据实现无需人工标注对应关系的自我进化。

Details

Motivation: 受人类视觉系统中并行且交互的上下文与空间处理流启发,旨在解决场景解析与几何视觉任务(如深度估计、光流等)的联合学习问题,通过任务间的交互提升各自性能,并减少对昂贵人工标注数据的依赖。

Result: 在三个公开数据集上的大量实验验证了TwInS核心组件的有效性,并表明其性能优于现有最先进方法(SOTA)。

Insight: 创新点包括:1)统一的生物启发双流交互架构,实现场景解析与几何视觉任务的相互增强;2)新颖的跨任务适配器,利用跨视角几何线索选择性融合异构特征;3)定制的半监督训练策略,利用大规模多视角数据实现无标注对应关系下的自我进化,降低数据标注成本。

Abstract: Inspired by the human visual system, which operates on two parallel yet interactive streams for contextual and spatial understanding, this article presents Two Interactive Streams (TwInS), a novel bio-inspired joint learning framework capable of simultaneously performing scene parsing and geometric vision tasks. TwInS adopts a unified, general-purpose architecture in which multi-level contextual features from the scene parsing stream are infused into the geometric vision stream to guide its iterative refinement. In the reverse direction, decoded geometric features are projected into the contextual feature space for selective heterogeneous feature fusion via a novel cross-task adapter, which leverages rich cross-view geometric cues to enhance scene parsing. To eliminate the dependence on costly human-annotated correspondence ground truth, TwInS is further equipped with a tailored semi-supervised training strategy, which unleashes the potential of large-scale multi-view data and enables continuous self-evolution without requiring ground-truth correspondences. Extensive experiments conducted on three public datasets validate the effectiveness of TwInS’s core components and demonstrate its superior performance over existing state-of-the-art approaches. The source code will be made publicly available upon publication.


[71] AdaVBoost: Mitigating Hallucinations in LVLMs via Token-Level Adaptive Visual Attention Boosting cs.CVPDF

Jiacheng Zhang, Feng Liu, Chao Du, Tianyu Pang

TL;DR: 本文提出AdaVBoost框架,通过引入视觉基础熵(VGE)来估计幻觉风险,并在每个生成步骤中自适应地调整视觉注意力的增强强度,以缓解大型视觉语言模型(LVLM)中的幻觉问题。

Details

Motivation: 现有视觉注意力增强方法采用预定义的缩放因子,存在过弱无法解决幻觉或过强引发新幻觉的权衡问题,因此需要一种能自适应调整增强强度的方案。

Result: 在多个LVLM模型和幻觉基准测试上的广泛实验表明,AdaVBoost显著优于基线方法。

Insight: 创新点在于提出视觉基础熵(VGE)作为幻觉风险估计指标,并实现基于风险的令牌级自适应视觉注意力增强,突破了固定缩放因子的限制。

Abstract: Visual attention boosting has emerged as a promising direction for mitigating hallucinations in Large Vision-Language Models (LVLMs), where existing methods primarily focus on where to boost by applying a predefined scaling to the attention of method-specific visual tokens during autoregressive generation. In this paper, we identify a fundamental trade-off in these methods: a predefined scaling factor can be too weak at some generation steps, leaving hallucinations unresolved, yet too strong at others, leading to new hallucinations. Motivated by this finding, we propose AdaVBoost, a token-level visual attention boosting framework that adaptively determines how much attention to boost at each generation step. Specifically, we introduce Visual Grounding Entropy (VGE) to estimate hallucination risk, which leverages visual grounding as a complementary signal to capture evidence mismatches beyond entropy. Guided by VGE, AdaVBoost applies stronger visual attention boosting to high-risk tokens and weaker boosting to low-risk tokens, enabling token-level adaptive intervention at each generation step. Extensive experiments show that AdaVBoost significantly outperforms baseline methods across multiple LVLMs and hallucination benchmarks.


[72] Towards Sparse Video Understanding and Reasoning cs.CV | cs.LGPDF

Chenwei Xu, Zhen Ye, Shang Wu, Weijian Li, Zihan Wang

TL;DR: 本文提出了ReViS,一种用于视频问答的多轮智能体,通过选择少量信息帧、维护跨轮次摘要状态并在置信时提前停止,实现稀疏视频理解与推理。该方法支持即插即用专有视觉语言模型,并为开源模型引入无标注奖励EAGER进行强化微调。

Details

Motivation: 解决视频问答中均匀采样帧导致计算冗余的问题,旨在通过稀疏帧选择和动态推理减少计算成本,提升效率。

Result: 在多个VQA基准测试中,ReViS在减少帧数、轮次和提示令牌的同时提高了准确率,实现了实用的稀疏视频推理。

Insight: 创新点包括基于信息帧的动态选择机制、跨轮次摘要状态维护,以及无标注奖励EAGER(结合置信增益、摘要充分性和正确早期停止),可借鉴于高效视频理解和强化学习策略设计。

Abstract: We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play’’ setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.


[73] A generalizable foundation model for intraoperative understanding across surgical procedures cs.CVPDF

Kanggil Park, Yongjun Jeon, Soyoung Lim, Seonmin Park, Jongmin Shin

TL;DR: 本文提出了ZEN,一种用于术中手术视频理解的通用基础模型,通过自监督多教师蒸馏框架在超过21种手术的400多万帧数据上训练。该模型在20个下游任务中表现出强大的跨手术泛化能力,并在多种设置下优于现有手术基础模型。

Details

Motivation: 解决微创手术中实时视觉解释的临床决策依赖性强,但不同外科医生和手术间的术中感知存在显著差异的问题,现有AI模型通常针对狭窄定义的任务设计,缺乏跨手术或机构的泛化能力。

Result: 在20个下游任务中,包括全微调、冻结骨干网络、少样本和零样本设置,ZEN均一致优于现有手术基础模型,并展示了稳健的跨手术泛化性能。

Insight: 创新点在于采用自监督多教师蒸馏框架构建通用基础模型,并利用大规模多样化数据集进行训练,实现了跨手术的统一表示学习,为术中辅助和手术训练评估提供了支持。

Abstract: In minimally invasive surgery, clinical decisions depend on real-time visual interpretation, yet intraoperative perception varies substantially across surgeons and procedures. This variability limits consistent assessment, training, and the development of reliable artificial intelligence systems, as most surgical AI models are designed for narrowly defined tasks and do not generalize across procedures or institutions. Here we introduce ZEN, a generalizable foundation model for intraoperative surgical video understanding trained on more than 4 million frames from over 21 procedures using a self-supervised multi-teacher distillation framework. We curated a large and diverse dataset and systematically evaluated multiple representation learning strategies within a unified benchmark. Across 20 downstream tasks and full fine-tuning, frozen-backbone, few-shot and zero-shot settings, ZEN consistently outperforms existing surgical foundation models and demonstrates robust cross-procedure generalization. These results suggest a step toward unified representations for surgical scene understanding and support future applications in intraoperative assistance and surgical training assessment.


[74] DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation cs.CVPDF

Haoyu Zhao, Yuang Zhang, Junqi Cheng, Jiaxi Gu, Zenghui Lu

TL;DR: 本文提出了一个名为DCDM(分而治之扩散模型)的系统级框架,旨在解决视频生成中语义、几何和身份一致性的关键挑战。该框架将一致性建模分解为三个专用组件:利用大语言模型确保片段内世界知识一致性,提出噪声空间中的时序相机表示以控制片段间相机运动一致性,以及采用整体场景生成范式保证镜头间元素一致性。

Details

Motivation: 现有视频生成模型虽然在视觉保真度上表现优异,但在语义、几何和身份一致性方面存在不足,DCDM旨在系统性地解决片段内知识、片段间相机运动和镜头间元素这三个关键的一致性挑战。

Result: 论文在AAAI’26 CVM竞赛的测试集上验证了框架,结果表明所提出的策略能有效应对这些挑战。

Insight: 创新点在于将复杂的视频一致性任务分解为三个可独立优化的子问题,并分别设计针对性解决方案(如LLM解析结构化语义、噪声空间相机表示、窗口化交叉注意力和稀疏镜头间自注意力),同时共享统一的视频生成主干以保持系统效率。

Abstract: Recent video generative models have demonstrated impressive visual fidelity, yet they often struggle with semantic, geometric, and identity consistency. In this paper, we propose a system-level framework, termed the Divide-and-Conquer Diffusion Model (DCDM), to address three key challenges: (1) intra-clip world knowledge consistency, (2) inter-clip camera consistency, and (3) inter-shot element consistency. DCDM decomposes video consistency modeling under these scenarios into three dedicated components while sharing a unified video generation backbone. For intra-clip consistency, DCDM leverages a large language model to parse input prompts into structured semantic representations, which are subsequently translated into coherent video content by a diffusion transformer. For inter-clip camera consistency, we propose a temporal camera representation in the noise space that enables precise and stable camera motion control, along with a text-to-image initialization mechanism to further enhance controllability. For inter-shot consistency, DCDM adopts a holistic scene generation paradigm with windowed cross-attention and sparse inter-shot self-attention, ensuring long-range narrative coherence while maintaining computational efficiency. We validate our framework on the test set of the CVM Competition at AAAI’26, and the results demonstrate that the proposed strategies effectively address these challenges.


[75] KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination cs.CV | cs.AI | cs.CLPDF

Byungjin Choi, Seongsu Bae, Sunjun Kweon, Edward Choi

TL;DR: 本文介绍了KorMedMCQA-V,一个用于评估视觉语言模型在韩国医师执照考试风格多模态选择题上表现的基准数据集。该数据集包含1534个问题及2043张相关医学图像,其中约30%的问题需要跨图像证据整合。研究对超过50个专有和开源VLM进行了零样本评估,发现最佳专有模型准确率达96.9%,而最佳韩语专用模型仅43.2%。

Details

Motivation: 为解决现有基准在评估VLM处理韩国医学多模态推理能力方面的不足,特别是缺乏包含多种医学影像模态和跨图像整合需求的数据集。

Result: 在统一零样本协议下,最佳专有模型Gemini-3.0-Pro准确率96.9%,最佳开源模型Qwen3-VL-32B-Thinking为83.7%,最佳韩语专用模型VARCO-VISION-2.0-14B仅43.2%。所有模型在多图像问题上表现均下降,且不同影像模态间性能差异显著。

Insight: 创新点在于构建首个韩国医学多模态选择题基准,涵盖X光、CT、心电图等多种临床影像;关键发现包括:推理导向模型变体比指令调优版本提升高达20个百分点,医学领域专业化对强通用基线的增益不一致,跨图像整合仍是VLM的普遍挑战。

Abstract: We introduce KorMedMCQA-V, a Korean medical licensing-exam-style multimodal multiple-choice question answering benchmark for evaluating vision-language models (VLMs). The dataset consists of 1,534 questions with 2,043 associated images from Korean Medical Licensing Examinations (2012-2023), with about 30% containing multiple images requiring cross-image evidence integration. Images cover clinical modalities including X-ray, computed tomography (CT), electrocardiography (ECG), ultrasound, endoscopy, and other medical visuals. We benchmark over 50 VLMs across proprietary and open-source categories-spanning general-purpose, medical-specialized, and Korean-specialized families-under a unified zero-shot evaluation protocol. The best proprietary model (Gemini-3.0-Pro) achieves 96.9% accuracy, the best open-source model (Qwen3-VL-32B-Thinking) 83.7%, and the best Korean-specialized model (VARCO-VISION-2.0-14B) only 43.2%. We further find that reasoning-oriented model variants gain up to +20 percentage points over instruction-tuned counterparts, medical domain specialization yields inconsistent gains over strong general-purpose baselines, all models degrade on multi-image questions, and performance varies notably across imaging modalities. By complementing the text-only KorMedMCQA benchmark, KorMedMCQA-V forms a unified evaluation suite for Korean medical reasoning across text-only and multimodal conditions. The dataset is available via Hugging Face Datasets: https://huggingface.co/datasets/seongsubae/KorMedMCQA-V.


[76] Optimizing Point-of-Care Ultrasound Video Acquisition for Probabilistic Multi-Task Heart Failure Detection cs.CVPDF

Armin Saadat, Nima Hashemi, Bahar Khodabakhshian, Michael Y. Tsang, Christina Luong

TL;DR: 本文提出了一种基于强化学习的个性化数据采集策略,用于优化床旁超声(POCUS)视频采集流程。该方法通过一个RL智能体,在部分观察多视图研究的基础上,动态选择下一个要采集的视图或终止采集,以支持心力衰竭(HF)评估。终止后,一个诊断模型联合预测主动脉瓣狭窄(AS)严重程度和左心室射血分数(LVEF)这两个关键HF生物标志物,并输出不确定性,从而在诊断性能和采集成本之间实现显式权衡。

Details

Motivation: 解决床旁超声在严格的时间和操作者精力限制下,如何高效采集视频以支持临床决策的问题,目标是平衡诊断准确性与采集成本。

Result: 在包含12,180个患者级别研究的数据集上,在1,820个测试研究中,该方法在使用视频数量减少32%的情况下,达到了与完整研究相当的性能,在AS严重程度分类和LVEF估计上实现了77.2%的平均平衡准确率(bACC),证明了在采集预算下的稳健多任务性能。

Insight: 创新点在于将POCUS视频采集建模为序列决策问题,通过强化学习实现个性化、成本感知的采集路径规划,并结合多视图Transformer进行多任务概率推断,输出不确定性以支持临床决策权衡。该框架可扩展至其他心脏终点,具有临床集成前景。

Abstract: Purpose: Echocardiography with point-of-care ultrasound (POCUS) must support clinical decision-making under tight bedside time and operator-effort constraints. We introduce a personalized data acquisition strategy in which an RL agent, given a partially observed multi-view study, selects the next view to acquire or terminates acquisition to support heart-failure (HF) assessment. Upon termination, a diagnostic model jointly predicts aortic stenosis (AS) severity and left ventricular ejection fraction (LVEF), two key HF biomarkers, and outputs uncertainty, enabling an explicit trade-off between diagnostic performance and acquisition cost. Methods: We model POCUS as a sequential acquisition problem: at each step, a video selector (RL agent) chooses the next view to acquire or terminates acquisition. Upon termination, a shared multi-view transformer performs multi-task inference with two heads, ordinal AS classification, and LVEF regression, and outputs Gaussian predictive distributions yielding ordinal probabilities over AS classes and EF thresholds. These probabilities drive a reward that balances expected diagnostic benefit against acquisition cost, producing patient-specific acquisition pathways. Results: The dataset comprises 12,180 patient-level studies, split into training/validation/test sets (75/15/15). On the 1,820 test studies, our method matches full-study performance while using 32% fewer videos, achieving 77.2% mean balanced accuracy (bACC) across AS severity classification and LVEF estimation, demonstrating robust multi-task performance under acquisition budgets. Conclusion: Patient-tailored, cost-aware acquisition can streamline POCUS workflows while preserving decision quality, producing interpretable scan pathways suited to bedside use. The framework is extensible to additional cardiac endpoints and merits prospective evaluation for clinical integration.


[77] LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases cs.CV | cs.AIPDF

Khang Nguyen Quoc, Phuong D. Dao, Luyl-Da Quach

TL;DR: 本文介绍了LeafNet,一个大规模植物病害多模态图像-文本数据集,以及LeafBench视觉问答基准,用于系统评估视觉语言模型在植物病理学理解方面的能力。数据集包含186,000张叶片图像,涵盖97种病害类别,并生成13,950个问答对,覆盖六项关键农业任务。通过基准测试12个SOTA视觉语言模型,揭示了它们在病害理解能力上的显著差异,并证明了多模态架构在诊断精度上的优势。

Details

Motivation: 解决当前视觉语言模型在农业领域(特别是植物病理学)应用受限的问题,主要由于缺乏大规模、全面的多模态图像-文本数据集和基准。

Result: 在LeafBench基准上测试了12个SOTA视觉语言模型,结果显示:二元健康-病害分类准确率超过90%,而细粒度病原体和物种识别准确率低于65%。微调后的视觉语言模型优于传统视觉模型,证实了多模态架构在诊断精度上的关键优势。

Insight: 创新点在于构建了首个大规模植物病害多模态数据集和基准,系统评估了视觉语言模型在农业领域的应用潜力,并揭示了当前模型在细粒度识别任务上的不足,为AI辅助植物病害诊断提供了严谨的评估框架。

Abstract: Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image–text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy–diseased classification exceeds 90% accuracy, while fine-grained pathogen and species identification remains below 65%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at https://github.com/EnalisUs/LeafBench.


[78] EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation cs.CVPDF

Rang Meng, Weipeng Wu, Yingjie Yin, Yuming Li, Chenguang Ma

TL;DR: 本文提出EchoTorrent,一种旨在实现快速、稳定、流式多模态视频生成的新框架。该框架通过多教师训练、自适应CFG校准、混合长尾强制和VAE解码器优化四重设计,解决了现有模型在实时部署中面临的高延迟、时间稳定性差以及流式推理时的多模态退化问题。

Details

Motivation: 当前多模态视频生成模型虽然视觉质量高,但其过高的延迟和有限的时间稳定性阻碍了实时部署,流式推理会加剧空间模糊、时间漂移和唇音不同步等多模态退化问题,导致效率与性能之间的权衡难题。

Result: 大量实验和分析表明,EchoTorrent能够以少量自回归生成步数实现显著提升的时间一致性、身份保持和音频-唇部同步性能,在相关基准测试中展现出优越的流式生成效果。

Insight: 创新点包括:1) 多教师训练策略,通过领域专家知识迁移提升学生模型性能;2) 自适应CFG校准(ACC-DMD),通过分阶段时空调度消除冗余计算,实现单步推理;3) 混合长尾强制架构,在长时自展开训练中专注尾部帧对齐以缓解流式退化;4) VAE解码器像素域优化,恢复高频细节并规避潜在空间歧义。这些设计共同解决了流式视频生成中的效率与质量平衡挑战。

Abstract: Recent multi-modal video generation models have achieved high visual quality, but their prohibitive latency and limited temporal stability hinder real-time deployment. Streaming inference exacerbates these issues, leading to pronounced multimodal degradation, such as spatial blurring, temporal drift, and lip desynchronization, which creates an unresolved efficiency-performance trade-off. To this end, we propose EchoTorrent, a novel schema with a fourfold design: (1) Multi-Teacher Training fine-tunes a pre-trained model on distinct preference domains to obtain specialized domain experts, which sequentially transfer domain-specific knowledge to a student model; (2) Adaptive CFG Calibration (ACC-DMD), which calibrates the audio CFG augmentation errors in DMD via a phased spatiotemporal schedule, eliminating redundant CFG computations and enabling single-pass inference per step; (3) Hybrid Long Tail Forcing, which enforces alignment exclusively on tail frames during long-horizon self-rollout training via a causal-bidirectional hybrid architecture, effectively mitigates spatiotemporal degradation in streaming mode while enhancing fidelity to reference frames; and (4) VAE Decoder Refiner through pixel-domain optimization of the VAE decoder to recover high-frequency details while circumventing latent-space ambiguities. Extensive experiments and analysis demonstrate that EchoTorrent achieves few-pass autoregressive generation with substantially extended temporal consistency, identity preservation, and audio-lip synchronization.


[79] An Ensemble Learning Approach towards Waste Segmentation in Cluttered Environment cs.CV | cs.AIPDF

Maimoona Jafar, Syed Imran Ali, Ahsan Saadat, Muhammad Bilal, Shah Khalid

TL;DR: 该论文提出了一种集成学习方法,用于在杂乱环境中进行废物分割,以提高回收过程中的原材料获取效率。通过结合U-Net和FPN两种高性能分割模型,并采用加权平均方法,该方法在模拟真实废物场景的数据集上提升了分割精度。

Details

Motivation: 废物分拣是回收过程中的关键步骤,需要精确的分割掩码来指导机器人定位和拾取传送带上的物体。然而,真实世界的废物环境复杂,物品变形、无特定模式且相互重叠,使得分割任务极具挑战性。

Result: 提出的集成模型EL-4在废物分割数据集上取得了0.8306的IoU值,优于U-Net的0.8065,并将Dice损失从FPN的0.1183降低至0.09019,表明性能有所提升。

Insight: 创新点在于利用集成学习结合U-Net(擅长捕捉细节和边界)和FPN(有效处理尺度变化和上下文)的优势,通过加权平均生成更精确的预测掩码,可借鉴于其他复杂环境下的视觉分割任务。

Abstract: Environmental pollution is a critical global issue, with recycling emerging as one of the most viable solutions. This study focuses on waste segregation, a crucial step in recycling processes to obtain raw material. Recent advancements in computer vision have significantly contributed to waste classification and recognition. In waste segregation, segmentation masks are essential for robots to accurately localize and pick objects from conveyor belts. The complexity of real-world waste environments, characterized by deformed items without specific patterns and overlapping objects, further complicates waste segmentation tasks. This paper proposes an Ensemble Learning approach to improve segmentation accuracy by combining high performing segmentation models, U-Net and FPN, using a weighted average method. U-Net excels in capturing fine details and boundaries in segmentation tasks, while FPN effectively handles scale variation and context in complex environments, and their combined masks result in more precise predictions. The dataset used closely mimics real-life waste scenarios, and preprocessing techniques were applied to enhance feature learning for deep learning segmentation models. The ensemble model, referred to as EL-4, achieved an IoU value of 0.8306, an improvement over U-Net’s 0.8065, and reduced Dice loss to 0.09019 from FPN’s 0.1183. This study could contribute to the efficiency of waste sorting at Material Recovery Facility, facilitating better raw material acquisition for recycling with minimal human intervention and enhancing the overall throughput.


[80] A WDLoRA-Based Multimodal Generative Framework for Clinically Guided Corneal Confocal Microscopy Image Synthesis in Diabetic Neuropathy cs.CVPDF

Xin Zhang, Liangxiu Han, Yue Shi, Yalin Zheng, Uazman Alam

TL;DR: 本文提出了一种基于权重分解低秩适应(WDLoRA)的多模态生成框架,用于在糖尿病神经病变中合成临床引导的角膜共聚焦显微镜(CCM)图像。该框架通过联合条件化神经分割掩码和疾病特异性临床提示,生成跨疾病谱系的解剖学一致图像,显著提升了图像质量和下游诊断任务的性能。

Details

Motivation: 角膜共聚焦显微镜(CCM)是评估糖尿病周围神经病变(DPN)小纤维损伤的敏感工具,但基于深度学习的自动化诊断模型发展受限于标记数据稀缺和角膜神经形态的细粒度变异性。现有AI基础生成模型在医学图像合成中常因领域特定训练不足而难以保持临床分析所需的解剖保真度。

Result: 在综合三支柱评估中,该框架在视觉保真度(Fréchet Inception Distance: 5.18)和结构完整性(Structural Similarity Index Measure: 0.630)上达到最先进水平(SOTA),显著优于GAN和标准扩散基线。合成图像保留了金标准临床生物标志物,与真实患者数据在统计上等效,用于训练自动化诊断模型时,下游诊断准确率提升2.1%,分割性能提升2.2%。

Insight: 创新点在于提出WDLoRA这一参数高效微调机制,通过解耦权重更新的幅度和方向,使基础生成模型能独立学习医学真实感所需的方向(神经拓扑)和强度(基质对比度)。结合多模态条件(分割掩码和临床提示)生成解剖学一致的图像,有效缓解医学AI中的数据瓶颈问题。

Abstract: Corneal Confocal Microscopy (CCM) is a sensitive tool for assessing small-fiber damage in Diabetic Peripheral Neuropathy (DPN), yet the development of robust, automated deep learning-based diagnostic models is limited by scarce labelled data and fine-grained variability in corneal nerve morphology. Although Artificial Intelligence (AI)-driven foundation generative models excel at natural image synthesis, they often struggle in medical imaging due to limited domain-specific training, compromising the anatomical fidelity required for clinical analysis. To overcome these limitations, we propose a Weight-Decomposed Low-Rank Adaptation (WDLoRA)-based multimodal generative framework for clinically guided CCM image synthesis. WDLoRA is a parameter-efficient fine-tuning (PEFT) mechanism that decouples magnitude and directional weight updates, enabling foundation generative models to independently learn the orientation (nerve topology) and intensity (stromal contrast) required for medical realism. By jointly conditioning on nerve segmentation masks and disease-specific clinical prompts, the model synthesises anatomically coherent images across the DPN spectrum (Control, T1NoDPN, T1DPN). A comprehensive three-pillar evaluation demonstrates that the proposed framework achieves state-of-the-art visual fidelity (Fréchet Inception Distance (FID): 5.18) and structural integrity (Structural Similarity Index Measure (SSIM): 0.630), significantly outperforming GAN and standard diffusion baselines. Crucially, the synthetic images preserve gold-standard clinical biomarkers and are statistically equivalent to real patient data. When used to train automated diagnostic models, the synthetic dataset improves downstream diagnostic accuracy by 2.1% and segmentation performance by 2.2%, validating the framework’s potential to alleviate data bottlenecks in medical AI.


[81] Fine-tuned Vision Language Model for Localization of Parasitic Eggs in Microscopic Images cs.CV | cs.LGPDF

Chan Hao Sien, Hezerul Abdul Karim, Nouar AlDahoul

TL;DR: 本文提出了一种基于微调视觉语言模型(VLM)的方法,用于在显微镜图像中定位寄生虫卵,以解决土壤传播蠕虫感染诊断中人工检测效率低、易出错的问题。初步结果表明,该方法在定位精度上优于其他目标检测方法。

Details

Motivation: 土壤传播蠕虫感染在热带和亚热带地区广泛流行,但专业诊断资源有限。人工显微镜诊断虽为金标准,但存在劳动密集、耗时且易出错的问题,因此需要自动化、可扩展的智能诊断方案。

Result: 初步实验表明,微调后的VLM(如Microsoft Florence)在定位寄生虫卵任务上,其平均交并比(mIOU)达到0.94,性能优于EfficientDet等其他目标检测方法。

Insight: 创新点在于将预训练的通用视觉语言模型(VLM)通过微调应用于高度专业化的医学图像分析任务(寄生虫卵定位),展示了VLM在特定领域迁移学习的潜力,为构建自动化寄生虫学诊断框架提供了核心组件。

Abstract: Soil-transmitted helminth (STH) infections continuously affect a large proportion of the global population, particularly in tropical and sub-tropical regions, where access to specialized diagnostic expertise is limited. Although manual microscopic diagnosis of parasitic eggs remains the diagnostic gold standard, the approach can be labour-intensive, time-consuming, and prone to human error. This paper aims to utilize a vision language model (VLM) such as Microsoft Florence that was fine-tuned to localize all parasitic eggs within microscopic images. The preliminary results show that our localization VLM performs comparatively better than the other object detection methods, such as EfficientDet, with an mIOU of 0.94. This finding demonstrates the potential of the proposed VLM to serve as a core component of an automated framework, offering a scalable engineering solution for intelligent parasitological diagnosis.


[82] RGA-Net: A Vision Enhancement Framework for Robotic Surgical Systems Using Reciprocal Attention Mechanisms cs.CVPDF

Quanjun Li, Weixuan Li, Han Xia, Junhua Zhou, Chi-Man Pun

TL;DR: 本文提出了RGA-Net,一种用于机器人手术系统中烟雾去除的新型深度学习框架。该框架通过结合移位窗口注意力和频域处理的双流混合注意力模块,以及通过分解注意力机制高效处理多尺度特征的轴分解注意力模块,解决了手术烟雾导致的视觉退化问题。

Details

Motivation: 机器人手术系统严重依赖高质量的视觉反馈进行精确遥操作,但能量设备产生的手术烟雾会显著降低内窥镜视频质量,损害人机界面和手术效果。

Result: 在DesmokeData和LSD3K手术数据集上的大量实验表明,RGA-Net在恢复适合机器人手术集成的视觉清晰度方面实现了优越性能。

Insight: 创新点在于引入了双流混合注意力模块和轴分解注意力模块,并通过互惠交叉门控块连接,实现了编码器和解码器路径之间的双向特征调制,有效处理了手术烟雾的密集、非均匀分布和复杂光散射等独特挑战。

Abstract: Robotic surgical systems rely heavily on high-quality visual feedback for precise teleoperation; yet, surgical smoke from energy-based devices significantly degrades endoscopic video feeds, compromising the human-robot interface and surgical outcomes. This paper presents RGA-Net (Reciprocal Gating and Attention-fusion Network), a novel deep learning framework specifically designed for smoke removal in robotic surgery workflows. Our approach addresses the unique challenges of surgical smoke-including dense, non-homogeneous distribution and complex light scattering-through a hierarchical encoder-decoder architecture featuring two key innovations: (1) a Dual-Stream Hybrid Attention (DHA) module that combines shifted window attention with frequency-domain processing to capture both local surgical details and global illumination changes, and (2) an Axis-Decomposed Attention (ADA) module that efficiently processes multi-scale features through factorized attention mechanisms. These components are connected via reciprocal cross-gating blocks that enable bidirectional feature modulation between encoder and decoder pathways. Extensive experiments on the DesmokeData and LSD3K surgical datasets demonstrate that RGA-Net achieves superior performance in restoring visual clarity suitable for robotic surgery integration. Our method enhances the surgeon-robot interface by providing consistently clear visualization, laying a technical foundation for alleviating surgeons’ cognitive burden, optimizing operation workflows, and reducing iatrogenic injury risks in minimally invasive procedures. These practical benefits could be further validated through future clinical trials involving surgeon usability assessments. The proposed framework represents a significant step toward more reliable and safer robotic surgical systems through computational vision enhancement.


[83] OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding cs.CV | cs.AIPDF

Haoyi Tao, Chaozheng Huang, Nan Wang, Han Lyu, Linfeng Zhang

TL;DR: 本文介绍了OmniScience,一个用于科学图像理解的大规模、高保真多模态数据集,包含150万个图像-标题-上下文三元组,覆盖10多个主要科学学科。为了解决现有开源多模态大语言模型在解释科学图像方面的局限性,作者开发了一个动态模型路由重描述流水线,利用最先进的多模态大语言模型生成密集、自包含的图像描述,并结合视觉特征、原始图注和文本参考。该数据集经过严格的质量过滤和人类专家对齐,将图像-文本多模态相似度得分从0.769提升至0.956。此外,作者提出了一个标题问答协议作为评估视觉理解的代理任务,在OmniScience上微调的Qwen2.5-VL-3B模型在MM-MT-Bench和MMMU基准上取得了显著提升。

Details

Motivation: 现有开源多模态大语言模型在理解科学图像(如示意图、实验表征图和分析图表)方面能力有限,这主要源于现有数据集的领域覆盖有限、结构标注粗糙和语义基础薄弱。

Result: 在OmniScience上微调的Qwen2.5-VL-3B模型在MM-MT-Bench基准上提升了0.378分,在MMMU基准上提升了0.140分,显示出相对于基线的显著增益。

Insight: 创新点包括:1) 构建了一个大规模、跨学科的科学图像多模态数据集;2) 提出了一个动态模型路由重描述流水线,通过结合视觉特征、原始图注和文本参考来生成高质量图像描述;3) 引入了标题问答协议作为评估视觉理解的新代理任务;4) 通过严格的质量过滤和人类专家对齐确保数据的事实准确性和语义完整性。

Abstract: Multimodal Large Language Models demonstrate strong performance on natural image understanding, yet exhibit limited capability in interpreting scientific images, including but not limited to schematic diagrams, experimental characterizations, and analytical charts. This limitation is particularly pronounced in open-source MLLMs. The gap largely stems from existing datasets with limited domain coverage, coarse structural annotations, and weak semantic grounding. We introduce OmniScience, a large-scale, high-fidelity multi-modal dataset comprising 1.5 million figure-caption-context triplets, spanning more than 10 major scientific disciplines. To obtain image caption data with higher information density and accuracy for multi-modal large-model training, we develop a dynamic model-routing re-captioning pipeline that leverages state-of-the-art multi-modal large language models to generate dense, self-contained descriptions by jointly synthesizing visual features, original figure captions, and corresponding in-text references authored by human scientists. The pipeline is further reinforced with rigorous quality filtering and alignment with human expert judgments, ensuring both factual accuracy and semantic completeness, and boosts the image-text multi-modal similarity score from 0.769 to 0.956. We further propose a caption QA protocol as a proxy task for evaluating visual understanding. Under this setting, Qwen2.5-VL-3B model finetuned on OmniScience show substantial gains over baselines, achieving a gain of 0.378 on MM-MT-Bench and a gain of 0.140 on MMMU.


[84] SAM4Dcap: Training-free Biomechanical Twin System from Monocular Video cs.CVPDF

Li Wang, HaoYu Wang, Xi Chen, ZeKun Jiang, Kang Li

TL;DR: SAM4Dcap是一个开源的端到端框架,旨在从单目视频中估计生物力学指标,无需额外训练。它通过整合具有时间一致性的4D人体网格恢复技术(SAM-Body4D)与OpenSim生物力学求解器,将重建的网格转换为与多种肌肉骨骼模型兼容的轨迹文件,从而为非实验室环境下的运动分析提供了一个灵活且易于访问的基础。

Details

Motivation: 定量生物力学分析对于临床诊断和损伤预防至关重要,但传统基于光学动作捕捉系统的方法成本高昂,通常局限于实验室环境。多视图视频方法虽降低了门槛,但在需要单目捕捉的家庭场景中仍不实用。本文旨在解决从单目视频中无需训练即可进行生物力学分析的问题。

Result: 在步行和下落跳跃任务的初步评估中,SAM4Dcap在膝关节运动学预测方面显示出与多视图系统相当的潜力,尽管在髋关节屈曲和残留抖动方面仍存在一些差异。

Insight: 论文的创新点在于将先进的计算机视觉技术(如SAM-Body4D的4D人体网格恢复)与成熟的生物力学模拟(OpenSim)相结合,构建了一个无需训练的单目视频分析框架。从客观角度看,其自动化提示策略和Linux原生构建为处理流程提供了便利,为家庭或非实验室环境下的可访问生物力学分析开辟了新途径。

Abstract: Quantitative biomechanical analysis is essential for clinical diagnosis and injury prevention but is often restricted to laboratories due to the high cost of optical motion capture systems. While multi-view video approaches have lowered barriers, they remain impractical for home-based scenarios requiring monocular capture. This paper presents SAM4Dcap, an open-source, end-to-end framework for estimating biomechanical metrics from monocular video without additional training. SAM4Dcap integrates the temporally consistent 4D human mesh recovery of SAM-Body4D with the OpenSim biomechanical solver. The pipeline converts reconstructed meshes into trajectory files compatible with diverse musculoskeletal models. We introduce automated prompting strategies and a Linux-native build for processing. Preliminary evaluations on walking and drop-jump tasks indicate that SAM4Dcap has the potential to achieve knee kinematic predictions comparable to multi-view systems, although some discrepancies in hip flexion and residual jitter remain. By bridging advanced computer vision with established biomechanical simulation, SAM4Dcap provides a flexible, accessible foundation for non-laboratory motion analysis.


[85] Offline-Poly: A Polyhedral Framework For Offline 3D Multi-Object Tracking cs.CVPDF

Xiaoyu Li, Yitao Wu, Xian Wu, Haolin Zhuo, Lijun Zhao

TL;DR: 该论文提出了一种名为Offline-Poly的通用离线3D多目标跟踪方法,旨在解决现有离线3D MOT方法对在线框架的过度依赖以及架构灵活性不足的问题。该方法基于一个以跟踪为中心的设计,引入了名为“跟踪-再跟踪”的标准化范式,能够处理任意现成的跟踪输出并进行离线优化,从而与上游检测器或跟踪器解耦。在nuScenes和KITTI数据集上均取得了最先进的性能。

Details

Motivation: 现有离线3D多目标跟踪方法直接扩展自在线框架,未能充分利用离线设置的优势(如无资源限制和未来帧可观测性),且依赖于固定的上游和定制化架构,适应性有限。

Result: 在nuScenes数据集上取得了77.6% AMOTA的SOTA性能;在KITTI数据集上取得了83.00% HOTA的领先结果。综合实验验证了其灵活性、泛化性和模块有效性。

Insight: 核心创新在于提出了“跟踪-再跟踪”范式,将离线跟踪器与特定上游组件解耦,实现了通用性。方法设计充分利用了离线跟踪的两个根本特性(资源无限制和未来可观测性),通过预处理、分层匹配与融合、轨迹段优化等结构化流程进行全局优化和全时域推理。

Abstract: Offline 3D multi-object tracking (MOT) is a critical component of the 4D auto-labeling (4DAL) process. It enhances pseudo-labels generated by high-performance detectors through the incorporation of temporal context. However, existing offline 3D MOT approaches are direct extensions of online frameworks and fail to fully exploit the advantages of offline setting. Moreover, these methods often depend on fixed upstream and customized architectures, limiting their adaptability. To address these limitations, we propose Offline-Poly, a general offline 3D MOT method based on a tracking-centric design. We introduce a standardized paradigm termed Tracking-by-Tracking (TBT), which operates exclusively on arbitrary off-the-shelf tracking outputs and produces offline-refined tracklets. This formulation decouples offline tracker from specific upstream detectors or trackers. Under the TBT paradigm, Offline-Poly accepts one or multiple coarse tracking results and processes them through a structured pipeline comprising pre-processing, hierarchical matching and fusion, and tracklet refinement. Each module is designed to capitalize on the two fundamental properties of offline tracking: resource unconstrainedness, which permits global optimization beyond real-time limits, and future observability, which enables tracklet reasoning over the full temporal horizon. Offline-Poly first eliminates short-term ghost tracklets and re-identifies fragmented segments using global scene context. It then constructs scene-level similarity to associate tracklets across multiple input sources. Finally, Offline-Poly refines tracklets by jointly leveraging local and global motion patterns. On nuScenes, we achieve SOTA performance with 77.6% AMOTA. On KITTI, it achieves leading results with 83.00% HOTA. Comprehensive experiments further validate the flexibility, generalizability, and modular effectiveness of Offline-Poly.


[86] Skeleton2Stage: Reward-Guided Fine-Tuning for Physically Plausible Dance Generation cs.CVPDF

Jidong Jia, Youjian Zhang, Huan Fu, Dacheng Tao

TL;DR: 该论文提出了一种名为Skeleton2Stage的两阶段方法,用于解决现有舞蹈生成模型在骨骼域训练时忽略网格级物理约束的问题。该方法通过从人体网格推导基于物理的奖励,并应用强化学习微调来引导扩散模型生成在网格可视化下物理上合理的舞蹈动作。

Details

Motivation: 现有舞蹈生成方法大多在骨骼域训练,忽略了网格级别的物理约束,导致生成的关节轨迹看似合理,但在使用人体网格可视化时会出现身体自穿透和足地接触异常等问题,降低了生成舞蹈的美观性和实际应用价值。

Result: 在多个舞蹈数据集上的实验一致表明,该方法能显著提升生成动作的物理合理性,产生更真实、美观的舞蹈。

Insight: 论文的创新点在于:1) 通过结合模仿奖励和足地偏差奖励,从物理模拟器角度衡量动作的普遍合理性并捕捉舞蹈中动态的足地交互;2) 提出了抗冻结奖励,以在保持物理合理性的同时避免模型生成动态性不足的冻结动作;3) 整体上弥合了骨骼域生成与网格可视化之间的物理差距。

Abstract: Despite advances in dance generation, most methods are trained in the skeletal domain and ignore mesh-level physical constraints. As a result, motions that look plausible as joint trajectories often exhibit body self-penetration and Foot-Ground Contact (FGC) anomalies when visualized with a human body mesh, reducing the aesthetic appeal of generated dances and limiting their real-world applications. We address this skeleton-to-mesh gap by deriving physics-based rewards from the body mesh and applying Reinforcement Learning Fine-Tuning (RLFT) to steer the diffusion model toward physically plausible motion synthesis under mesh visualization. Our reward design combines (i) an imitation reward that measures a motion’s general plausibility by its imitability in a physical simulator (penalizing penetration and foot skating), and (ii) a Foot-Ground Deviation (FGD) reward with test-time FGD guidance to better capture the dynamic foot-ground interaction in dance. However, we find that the physics-based rewards tend to push the model to generate freezing motions for fewer physical anomalies and better imitability. To mitigate it, we propose an anti-freezing reward to preserve motion dynamics while maintaining physical plausibility. Experiments on multiple dance datasets consistently demonstrate that our method can significantly improve the physical plausibility of generated motions, yielding more realistic and aesthetically pleasing dances. The project page is available at: https://jjd1123.github.io/Skeleton2Stage/


[87] Foundation Model-Driven Semantic Change Detection in Remote Sensing Imagery cs.CVPDF

Hengtong Shen, Li Yan, Hong Xie, Yaxuan Wei, Xinhao Li

TL;DR: 本文提出了一种基于遥感基础模型PerA驱动的语义变化检测方法PerASCD,通过引入级联门控解码器(CG-Decoder)简化复杂解码流程并促进多级特征交互,同时提出软语义一致性损失(SSCLoss)缓解训练中的数值不稳定问题,在公开基准数据集上实现了最先进的性能。

Details

Motivation: 现有语义变化检测方法因模型语义理解能力有限和任务复杂性,面临性能和范式复杂度的双重挑战,需提升多尺度语义理解与整体性能。

Result: 在两个公开基准数据集上达到最先进水平(SOTA),且所提解码器能无缝适配多种视觉编码器。

Insight: 创新点包括模块化级联门控解码器简化了复杂SCD解码流程并增强特征融合,以及软语义一致性损失提升了训练稳定性;客观分析认为其将基础模型能力有效迁移至语义变化检测任务,实现了范式简化与性能提升的平衡。

Abstract: Remote sensing (RS) change detection methods can extract critical information on surface dynamics and are an essential means for humans to understand changes in the earth’s surface and environment. Among these methods, semantic change detection (SCD) can more effectively interpret the multi-class information contained in bi-temporal RS imagery, providing semantic-level predictions that support dynamic change monitoring. However, due to the limited semantic understanding capability of the model and the inherent complexity of the SCD tasks, existing SCD methods face significant challenges in both performance and paradigm complexity. In this paper, we propose PerASCD, a SCD method driven by RS foundation model PerA, designed to enhance the multi-scale semantic understanding and overall performance. We introduce a modular Cascaded Gated Decoder (CG-Decoder) that simplifies complex SCD decoding pipelines while promoting effective multi-level feature interaction and fusion. In addition, we propose a Soft Semantic Consistency Loss (SSCLoss) to mitigate the numerical instability commonly encountered during SCD training. We further explore the applicability of multiple existing RS foundation models on the SCD task when equipped with the proposed decoder. Experimental results demonstrate that our decoder not only effectively simplifies the paradigm of SCD, but also achieves seamless adaptation across various vision encoders. Our method achieves state-of-the-art (SOTA) performance on two public benchmark datasets, validating its effectiveness. The code is available at https://github.com/SathShen/PerASCD.git.


[88] Joint Orientation and Weight Optimization for Robust Watertight Surface Reconstruction via Dirichlet-Regularized Winding Fields cs.CVPDF

Jiaze Li, Daisheng Jin, Fei Hou, Junhui Hou, Zheng Liu

TL;DR: 本文提出Dirichlet Winding Reconstruction (DiWR)方法,用于从具有非均匀采样、噪声和离群点的无定向点云中重建水密表面。该方法以广义环绕数场为目标隐式表示,在一个流程中联合优化点方向、每点面积权重和置信系数,通过最小化诱导环绕数场的Dirichlet能量及附加约束,无需依赖单独的预处理即可补偿非均匀采样、降低噪声影响并抑制离群点。

Details

Motivation: 解决从具有非均匀采样、噪声和离群点的无定向点云中稳健重建水密表面的问题,避免传统方法对单独预处理阶段的依赖。

Result: 在来自3D高斯泼溅的计算机视觉流程点云和受损的图形学基准测试上进行评估,实验表明DiWR能在这些挑战性输入上生成合理的水密表面,并优于传统的多阶段流程和最近的联合定向-重建方法。

Insight: 创新点在于将点方向、面积权重和置信系数在一个基于Dirichlet正则化的联合优化框架中进行统一优化,直接利用广义环绕数场约束来隐式处理点云缺陷,提高了重建的鲁棒性和一体化程度。

Abstract: We propose Dirichlet Winding Reconstruction (DiWR), a robust method for reconstructing watertight surfaces from unoriented point clouds with non-uniform sampling, noise, and outliers. Our method uses the generalized winding number (GWN) field as the target implicit representation and jointly optimizes point orientations, per-point area weights, and confidence coefficients in a single pipeline. The optimization minimizes the Dirichlet energy of the induced winding field together with additional GWN-based constraints, allowing DiWR to compensate for non-uniform sampling, reduce the impact of noise, and downweight outliers during reconstruction, with no reliance on separate preprocessing. We evaluate DiWR on point clouds from 3D Gaussian Splatting, a computer-vision pipeline, and corrupted graphics benchmarks. Experiments show that DiWR produces plausible watertight surfaces on these challenging inputs and outperforms both traditional multi-stage pipelines and recent joint orientation-reconstruction methods.


[89] Gaussian Sequences with Multi-Scale Dynamics for 4D Reconstruction from Monocular Casual Videos cs.CV | cs.ROPDF

Can Li, Jie Gu, Jingmin Chen, Fangzhou Qiu, Lei Sun

TL;DR: 本文提出了一种名为’多尺度动态高斯序列’的新颖表示方法,用于从单目随意视频中进行4D(动态3D)重建。该方法的核心是通过多尺度动态机制分解复杂的运动场,并利用视觉基础模型的多模态先验进行互补监督,从而在严格单目设置下实现准确且全局一致的动态场景重建。

Details

Motivation: 从随意视频中理解动态场景对于可扩展的机器人学习至关重要,但在严格单目设置下的4D重建问题仍然高度不适定。现有方法难以处理复杂的、多尺度的真实世界动态。

Result: 在动态新视角合成(NVS)的基准测试和真实世界操作数据集上的实验表明,该方法相比现有方法有显著提升。

Insight: 主要创新点在于:1)提出了多尺度动态机制,将复杂运动场分解为从物体到粒子级别的多层级运动,从而缓解重建的模糊性并促进物理上合理的动态;2)引入了多尺度动态高斯序列这一新颖表示,通过多级运动的组合来推导动态3D高斯;3)整合了视觉基础模型的多模态先验作为互补监督,以约束解空间并提高重建保真度。

Abstract: Understanding dynamic scenes from casual videos is critical for scalable robot learning, yet four-dimensional (4D) reconstruction under strictly monocular settings remains highly ill-posed. To address this challenge, our key insight is that real-world dynamics exhibits a multi-scale regularity from object to particle level. To this end, we design the multi-scale dynamics mechanism that factorizes complex motion fields. Within this formulation, we propose Gaussian sequences with multi-scale dynamics, a novel representation for dynamic 3D Gaussians derived through compositions of multi-level motion. This layered structure substantially alleviates ambiguity of reconstruction and promotes physically plausible dynamics. We further incorporate multi-modal priors from vision foundation models to establish complementary supervision, constraining the solution space and improving the reconstruction fidelity. Our approach enables accurate and globally consistent 4D reconstruction from monocular casual videos. Experiments of dynamic novel-view synthesis (NVS) on benchmark and real-world manipulation datasets demonstrate considerable improvements over existing methods.


[90] Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings cs.CVPDF

Haonan Jiang, Yuji Wang, Yongjie Zhu, Xin Lu, Wenyu Qin

TL;DR: 本文提出Embed-RL框架,通过嵌入器引导的强化学习优化推理器生成可追溯的思维链,以增强多模态嵌入的语义一致性和细粒度匹配能力。

Details

Motivation: 现有生成式嵌入方法的推理思维链仅限于文本分析,与目标检索无关,因此需要一种推理驱动的通用多模态嵌入框架来弥补这一不足。

Result: 在MMEB-V2和UVRB基准测试中,该框架在有限计算资源下超越了先前的嵌入模型,表现出优越性能。

Insight: 创新点包括嵌入器引导的强化学习框架、可追溯思维链的引入,以及通过多模态证据的结构化推理与检索导向的对齐来提升跨模态语义一致性。

Abstract: Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations compared to discriminative methods. However, the generated reasoning CoTs of existing generative embedding methods are limited to the textual analysis of queries and are irrelevant to the retrieval of the targets. To address these limitations, we propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT (T-CoT). Our key contributions are threefold: (1) We design an EG-RL framework where the Embedder provides explicit supervision to the Reasoner, ensuring the generated CoT traces are aligned with embedding tasks. (2) We introduce T-CoT, which extracts critical multimodal cues to focus on retrieval-relevant elements and provides multimodal inputs for the Embedder. (3) With limited computational resources, our framework outperforms the pioneering embedding model on both MMEB-V2 and UVRB benchmarks. The integration of multimodal evidence in structured reasoning, paired with retrieval-oriented alignment, effectively strengthens cross-modal semantic consistency and boosts the fine-grained matching capability of the model as well as the generalization across complex scenarios. Our work demonstrates that targeted reasoning optimization can significantly improve multimodal embedding quality, providing a practical and efficient solution for reasoning-driven UME development.


[91] High-Fidelity Causal Video Diffusion Models for Real-Time Ultra-Low-Bitrate Semantic Communication cs.CVPDF

Cem Eteke, Batuhan Tosun, Alexander Griessel, Wolfgang Kellerer, Eckehard Steinbach

TL;DR: 本文提出了一种用于超低比特率语义通信的高保真、因果实时视频生成扩散模型,通过语义视频编码传输场景结构,结合压缩低分辨率帧提供纹理信息,并采用模块化视频扩散模型与高效时间蒸馏技术,在极低比特率下实现高质量视频合成。

Details

Motivation: 解决在超低比特率语义通信约束下,实现高保真、因果且实时视频生成的挑战,以提升视频传输的效率和视觉质量。

Result: 在多个数据集上评估,该框架在超低比特率(<0.0003 bpp)下实现了强感知质量、语义保真度和时间一致性,在定量、定性和主观评估中优于传统、神经和生成基线方法。

Insight: 创新点包括结合语义编码与压缩帧的输入策略、模块化视频扩散模型设计(含语义控制、恢复适配器和时间适配器),以及高效时间蒸馏技术,大幅减少参数和训练时间,同时保持实时因果合成能力。

Abstract: We introduce a video diffusion model for high-fidelity, causal, and real-time video generation under ultra-low-bitrate semantic communication constraints. Our approach utilizes lossy semantic video coding to transmit the semantic scene structure, complemented by a stream of highly compressed, low-resolution frames that provide sufficient texture information to preserve fidelity. Building on these inputs, we introduce a modular video diffusion model that contains Semantic Control, Restoration Adapter, and Temporal Adapter. We further introduce an efficient temporal distillation procedure that enables extension to real-time and causal synthesis, reducing trainable parameters by 300x and training time by 2x, while adhering to communication constraints. Evaluated across diverse datasets, the framework achieves strong perceptual quality, semantic fidelity, and temporal consistency at ultra-low bitrates (< 0.0003 bpp), outperforming classical, neural, and generative baselines in extensive quantitative, qualitative, and subjective evaluations.


[92] Synthetic Dataset Generation and Validation for Robotic Surgery Instrument Segmentation cs.CVPDF

Giorgio Chiesa, Rossella Borra, Vittorio Lauro, Sabrina De Cillis, Daniele Amparore

TL;DR: 本文提出了一种用于机器人手术器械分割的合成数据集生成与验证工作流,通过自动化流程在Autodesk Maya中创建了具有真实感的带标注视频序列,并验证了合成数据与真实数据混合训练能显著提升分割模型的泛化性能。

Details

Motivation: 解决机器人辅助手术中器械分割任务面临真实标注数据稀缺、获取成本高且难以覆盖术中变异性的问题,通过合成数据生成来增强训练数据。

Result: 在混合真实与合成数据的控制比例实验中,平衡的混合训练相比仅使用真实数据显著提升了模型泛化能力;过度依赖合成数据会导致可测量的域偏移。

Insight: 创新点在于建立了完全自动化的逼真手术场景合成管线,能够随机化运动模式、光照变化和合成血液纹理;客观分析表明该方法为手术计算机视觉提供了可复现、可扩展的数据增强工具,支持域适应和基于仿真的预训练研究。

Abstract: This paper presents a comprehensive workflow for generating and validating a synthetic dataset designed for robotic surgery instrument segmentation. A 3D reconstruction of the Da Vinci robotic arms was refined and animated in Autodesk Maya through a fully automated Python-based pipeline capable of producing photorealistic, labeled video sequences. Each scene integrates randomized motion patterns, lighting variations, and synthetic blood textures to mimic intraoperative variability while preserving pixel-accurate ground truth masks. To validate the realism and effectiveness of the generated data, several segmentation models were trained under controlled ratios of real and synthetic data. Results demonstrate that a balanced composition of real and synthetic samples significantly improves model generalization compared to training on real data only, while excessive reliance on synthetic data introduces a measurable domain shift. The proposed framework provides a reproducible and scalable tool for surgical computer vision, supporting future research in data augmentation, domain adaptation, and simulation-based pretraining for robotic-assisted surgery. Data and code are available at https://github.com/EIDOSLAB/Sintetic-dataset-DaVinci.


[93] Cardiac Output Prediction from Echocardiograms: Self-Supervised Learning with Limited Data cs.CVPDF

Adson Duarte, Davide Vitturini, Emanuele Milillo, Andrea Bragagnolo, Carlo Alberto Barbano

TL;DR: 本文提出了一种基于SimCLR的自监督学习预训练策略,用于从心尖四腔心超声心动图视频中预测心输出量,旨在解决数据稀缺条件下的模型泛化问题。

Details

Motivation: 心输出量是心血管疾病诊断和管理的关键参数,但准确测量需要侵入性右心导管检查,因此需要开发基于超声心动图的可靠无创替代方法。

Result: 在测试集上,该方法实现了平均皮尔逊相关系数0.41,优于在超过一百万次超声心动图检查上训练的PanEcho模型,表明自监督学习能有效缓解过拟合并提升表征学习。

Insight: 创新点在于将自监督学习(SimCLR)应用于有限数据的超声心动图分析,证明了即使在数据稀缺情况下,自监督预训练也能显著提升下游任务的性能,为医疗影像分析提供了新思路。

Abstract: Cardiac Output (CO) is a key parameter in the diagnosis and management of cardiovascular diseases. However, its accurate measurement requires right-heart catheterization, an invasive and time-consuming procedure, motivating the development of reliable non-invasive alternatives using echocardiography. In this work, we propose a self-supervised learning (SSL) pretraining strategy based on SimCLR to improve CO prediction from apical four-chamber echocardiographic videos. The pretraining is performed using the same limited dataset available for the downstream task, demonstrating the potential of SSL even under data scarcity. Our results show that SSL mitigates overfitting and improves representation learning, achieving an average Pearson correlation of 0.41 on the test set and outperforming PanEcho, a model trained on over one million echocardiographic exams. Source code is available at https://github.com/EIDOSLAB/cardiac-output.


[94] Low-Pass Filtering Improves Behavioral Alignment of Vision Models cs.CVPDF

Max Wolff, Thomas Klein, Evgenia Rusak, Felix Wichmann, Wieland Brendel

TL;DR: 这篇论文发现,通过在测试时对图像进行低通滤波(即模糊处理),可以显著提高深度神经网络(如CLIP)与人类视觉行为的一致性,包括错误一致性和形状偏差。研究显示,低通滤波能有效去除图像中的高频空间信息,从而缩小DNN与人类观察者之间的对齐差距,并在模型与人类基准测试中达到新的最优水平。

Details

Motivation: 尽管深度神经网络在计算机视觉基准测试中表现优异,但在模拟人类视觉行为(如错误一致性和形状偏差)方面仍存在不足。先前研究假设生成式分类器能大幅改善行为对齐,但本文旨在探究这种对齐提升是否主要源于生成模型中看似无害的图像调整操作(即低通滤波),而非生成模型本身。

Result: 在模型与人类基准测试中,仅通过测试时模糊图像(而非训练时使用模糊图像),就实现了新的最优成绩,将DNN与人类观察者的对齐差距缩小了一半。此外,直接优化滤波器以最大化对齐性,发现低通滤波器可能是最优选择,并通过计算帕累托最优前沿来验证其性能。

Insight: 创新点在于揭示了低通滤波(如高斯滤波器)在提升视觉模型行为对齐中的关键作用,这与人眼视觉系统的带通滤波器频谱特性相匹配。从客观角度看,该方法简单有效,无需重新训练模型,仅通过后处理即可显著改善对齐性,为理解人类视觉建模提供了新视角。

Abstract: Despite their impressive performance on computer vision benchmarks, Deep Neural Networks (DNNs) still fall short of adequately modeling human visual behavior, as measured by error consistency and shape bias. Recent work hypothesized that behavioral alignment can be drastically improved through \emph{generative} – rather than \emph{discriminative} – classifiers, with far-reaching implications for models of human vision. Here, we instead show that the increased alignment of generative models can be largely explained by a seemingly innocuous resizing operation in the generative model which effectively acts as a low-pass filter. In a series of controlled experiments, we show that removing high-frequency spatial information from discriminative models like CLIP drastically increases their behavioral alignment. Simply blurring images at test-time – rather than training on blurred images – achieves a new state-of-the-art score on the model-vs-human benchmark, halving the current alignment gap between DNNs and human observers. Furthermore, low-pass filters are likely optimal, which we demonstrate by directly optimizing filters for alignment. To contextualize the performance of optimal filters, we compute the frontier of all possible pareto-optimal solutions to the benchmark, which was formerly unknown. We explain our findings by observing that the frequency spectrum of optimal Gaussian filters roughly matches the spectrum of band-pass filters implemented by the human visual system. We show that the contrast sensitivity function, describing the inverse of the contrast threshold required for humans to detect a sinusoidal grating as a function of spatiotemporal frequency, is approximated well by Gaussian filters of the specific width that also maximizes error consistency.


[95] Parameter-Efficient Fine-Tuning of DINOv2 for Large-Scale Font Classification cs.CV | cs.LGPDF

Daniel Chen, Zaria Zinn, Marcus Lowe

TL;DR: 本文提出了一种基于DINOv2视觉Transformer的字体分类系统,通过LoRA进行参数高效微调,能够从渲染的文本图像中识别394种字体家族。作者构建了一个大规模合成数据集生成流程,并开源了模型、数据集和完整训练代码。

Details

Motivation: 解决大规模字体分类任务中,需要高效微调大型视觉模型并生成泛化性强的训练数据的问题。

Result: 在包含394个字体类别的任务上,仅微调模型不到1%的参数(约87.2M总参数中的少量),实现了约86%的top-1准确率。

Insight: 创新点在于结合LoRA进行参数高效微调以降低计算成本,并设计了具有多样化增强(随机颜色、对齐、换行、高斯噪声)的合成数据集生成流程,提升了模型对真实世界样本的泛化能力;同时提供了完整的部署方案和开源资源。

Abstract: We present a font classification system capable of identifying 394 font families from rendered text images. Our approach fine-tunes a DINOv2 Vision Transformer using Low-Rank Adaptation (LoRA), achieving approximately 86% top-1 accuracy while training fewer than 1% of the model’s 87.2M parameters. We introduce a synthetic dataset generation pipeline that renders Google Fonts at scale with diverse augmentations including randomized colors, alignment, line wrapping, and Gaussian noise, producing training images that generalize to real-world typographic samples. The model incorporates built-in preprocessing to ensure consistency between training and inference, and is deployed as a HuggingFace Inference Endpoint. We release the model, dataset, and full training pipeline as open-source resources.


[96] MamaDino: A Hybrid Vision Model for Breast Cancer 3-Year Risk Prediction cs.CV | cs.LGPDF

Ruggiero Santeramo, Igor Zubarev, Florian Jug

TL;DR: 该论文提出了MamaDino模型,一种用于乳腺癌3年风险预测的混合视觉模型。它结合了卷积神经网络和Transformer的互补归纳偏置,并显式建模对侧乳腺不对称性,在显著降低输入图像分辨率的情况下,达到了与现有最佳模型Mirai相当的预测性能。

Details

Motivation: 当前基于深度学习的乳腺癌风险预测模型(如Mirai)通常使用卷积主干网络、极高分辨率输入和简单的多视图融合,缺乏对双侧乳腺不对称性的显式建模。作者假设结合互补的归纳偏置和显式的不对称性建模,可以在更低分辨率的乳腺X光片上实现最先进的预测性能。

Result: 在内部和外部测试集上,MamaDino在乳腺层面的预测性能与Mirai相当,但输入像素减少了约13倍。加入BilateralMixer模块后,模型在分布内测试集的AUC从0.713提升至0.736,在分布外测试集从0.666提升至0.677,且在不同年龄、种族、扫描仪、肿瘤类型和分级上表现一致。

Insight: 论文的创新点在于:1)提出了一种结合冻结自监督DINOv3 ViT-S特征与可训练CNN编码器的混合架构,有效融合了卷积和Transformer的互补优势;2)设计了BilateralMixer模块,显式建模双侧乳腺信息,提升了模型判别能力;3)证明了通过更结构化的方式利用更低分辨率的图像,可以恢复最先进的预测精度,这对降低计算成本和部署难度具有实际意义。

Abstract: Breast cancer screening programmes increasingly seek to move from one-size-fits-all interval to risk-adapted and personalized strategies. Deep learning (DL) has enabled image-based risk models with stronger 1- to 5-year prediction than traditional clinical models, but leading systems (e.g., Mirai) typically use convolutional backbones, very high-resolution inputs (>1M pixels) and simple multi-view fusion, with limited explicit modelling of contralateral asymmetry. We hypothesised that combining complementary inductive biases (convolutional and transformer-based) with explicit contralateral asymmetry modelling would allow us to match state-of-the-art 3-year risk prediction performance even when operating on substantially lower-resolution mammograms, indicating that using less detailed images in a more structured way can recover state-of-the-art accuracy. We present MamaDino, a mammography-aware multi-view attentional DINO model. MamaDino fuses frozen self-supervised DINOv3 ViT-S features with a trainable CNN encoder at 512x512 resolution, and aggregates bilateral breast information via a BilateralMixer to output a 3-year breast cancer risk score. We train on 53,883 women from OPTIMAM (UK) and evaluate on matched 3-year case-control cohorts: an in-distribution test set from four screening sites and an external out-of-distribution cohort from an unseen site. At breast-level, MamaDino matches Mirai on both internal and external tests while using ~13x fewer input pixels. Adding the BilateralMixer improves discrimination to AUC 0.736 (vs 0.713) in-distribution and 0.677 (vs 0.666) out-of-distribution, with consistent performance across age, ethnicity, scanner, tumour type and grade. These findings demonstrate that explicit contralateral modelling and complementary inductive biases enable predictions that match Mirai, despite operating on substantially lower-resolution mammograms.


[97] Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology cs.CVPDF

Minghao Han, Dingkang Yang, Linhao Qu, Zizhi Chen, Gang Li

TL;DR: 本文提出STAMP框架,通过整合空间转录组学数据与病理图像进行多模态表示学习,利用自监督的基因引导训练提升病理图像表征能力,并构建了大规模空间转录组数据集SpaVis-6M,在多个下游任务中验证了其有效性。

Details

Motivation: 现有计算病理学多模态模型主要依赖视觉和语言模态,但语言缺乏分子特异性且病理监督有限,导致表征瓶颈,因此需要引入空间分辨的基因表达数据以增强分子引导的联合嵌入。

Result: 在六个数据集和四个下游任务中验证,STAMP均表现出强劲性能,证明了整合空间分子监督对提升计算病理学多模态学习的价值和必要性。

Insight: 创新点包括提出空间转录组增强的多模态病理表示学习框架,构建最大规模的空间转录组数据集SpaVis-6M,以及采用分层多尺度对比对齐和跨尺度补丁定位机制来对齐空间转录组与病理图像,捕获空间结构和分子变异。

Abstract: Recent years have witnessed remarkable progress in multimodal learning within computational pathology. Existing models primarily rely on vision and language modalities; however, language alone lacks molecular specificity and offers limited pathological supervision, leading to representational bottlenecks. In this paper, we propose STAMP, a Spatial Transcriptomics-Augmented Multimodal Pathology representation learning framework that integrates spatially-resolved gene expression profiles to enable molecule-guided joint embedding of pathology images and transcriptomic data. Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations. Incorporating spatial context and multi-scale information further enhances model performance and generalizability. To support this, we constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset to date, and trained a spatially-aware gene encoder on this resource. Leveraging hierarchical multi-scale contrastive alignment and cross-scale patch localization mechanisms, STAMP effectively aligns spatial transcriptomics with pathology images, capturing spatial structure and molecular variation. We validate STAMP across six datasets and four downstream tasks, where it consistently achieves strong performance. These results highlight the value and necessity of integrating spatially resolved molecular supervision for advancing multimodal learning in computational pathology. The code is included in the supplementary materials. The pretrained weights and SpaVis-6M are available at: https://github.com/Hanminghao/STAMP.


[98] MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars cs.CV | astro-ph.IM | cs.CLPDF

Shuoyuan Wang, Yiran Wang, Hongxin Wei

TL;DR: 该论文提出了一个名为MarsRetrieval的基准测试,用于评估视觉语言模型在火星地理空间发现任务中的检索能力。它包含三个任务:配对图像-文本检索、地貌检索和全球地理定位,旨在解决现有基准局限于封闭集监督视觉任务、缺乏文本引导检索的问题。

Details

Motivation: 现有基准大多局限于封闭集的监督视觉任务,不支持用于地理空间发现的文本引导检索,因此需要一个新的基准来评估视觉语言模型在行星科学(特别是火星探索)中的检索能力。

Result: 评估表明MarsRetrieval具有挑战性,即使是强大的基础模型也常常难以捕捉领域特定的地貌区别。实验进一步表明,领域特定的微调对于行星环境中可泛化的地理空间发现至关重要。

Insight: 创新点在于提出了一个专门针对火星地理空间发现的多尺度、多任务检索基准,并引入了一个统一的、以检索为中心的协议来评估包括对比双塔编码器和生成式视觉语言模型在内的多模态嵌入架构。这为行星科学中的数据驱动方法提供了新的评估方向。

Abstract: Data-driven approaches like deep learning are rapidly advancing planetary science, particularly in Mars exploration. Despite recent progress, most existing benchmarks remain confined to closed-set supervised visual tasks and do not support text-guided retrieval for geospatial discovery. We introduce MarsRetrieval, a retrieval benchmark for evaluating vision-language models for Martian geospatial discovery. MarsRetrieval includes three tasks: (1) paired image-text retrieval, (2) landform retrieval, and (3) global geo-localization, covering multiple spatial scales and diverse geomorphic origins. We propose a unified retrieval-centric protocol to benchmark multimodal embedding architectures, including contrastive dual-tower encoders and generative vision-language models. Our evaluation shows MarsRetrieval is challenging: even strong foundation models often fail to capture domain-specific geomorphic distinctions. We further show that domain-specific fine-tuning is critical for generalizable geospatial discovery in planetary settings. Our code is available at https://github.com/ml-stat-Sustech/MarsRetrieval


[99] Flow4R: Unifying 4D Reconstruction and Tracking with Scene Flow cs.CVPDF

Shenhan Qian, Ganlin Zhang, Shangzhe Wu, Daniel Cremers

TL;DR: Flow4R是一个统一的框架,通过将相机空间场景流作为核心表示,将动态3D场景的4D重建与跟踪任务结合起来。它使用Vision Transformer从两视图输入中预测每个像素的3D点位置、场景流、姿态权重和置信度,从而在一个前向传播中对称地推断局部几何和双向运动,无需显式的位姿回归器或光束法平差。

Details

Motivation: 解决现有方法将几何重建与运动跟踪解耦的问题,即静态场景重建方法假设场景静止,而动态跟踪框架依赖显式相机位姿估计或单独的运动模型。

Result: 在静态和动态数据集上联合训练后,Flow4R在4D重建和跟踪任务上实现了最先进的性能。

Insight: 创新点在于提出以场景流为中心的统一表示,将3D结构、物体运动和相机运动联系起来,并使用共享解码器对称推断,简化了流程并提升了时空场景理解的效果。

Abstract: Reconstructing and tracking dynamic 3D scenes remains a fundamental challenge in computer vision. Existing approaches often decouple geometry from motion: multi-view reconstruction methods assume static scenes, while dynamic tracking frameworks rely on explicit camera pose estimation or separate motion models. We propose Flow4R, a unified framework that treats camera-space scene flow as the central representation linking 3D structure, object motion, and camera motion. Flow4R predicts a minimal per-pixel property set-3D point position, scene flow, pose weight, and confidence-from two-view inputs using a Vision Transformer. This flow-centric formulation allows local geometry and bidirectional motion to be inferred symmetrically with a shared decoder in a single forward pass, without requiring explicit pose regressors or bundle adjustment. Trained jointly on static and dynamic datasets, Flow4R achieves state-of-the-art performance on 4D reconstruction and tracking tasks, demonstrating the effectiveness of the flow-central representation for spatiotemporal scene understanding.


[100] Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation cs.CVPDF

Jia Li, Xiaomeng Fu, Xurui Peng, Weifeng Chen, Youwei Zheng

TL;DR: 本文提出了一种名为FLEX的训练免费推理时框架,用于解决自回归视频扩散模型在生成超出训练时长视频时出现的外推失败问题。该方法通过频率感知的RoPE调制、反相位噪声采样和仅推理注意力锚点等技术,无需额外训练即可显著扩展视频生成时长,在VBench基准测试中表现出色。

Details

Motivation: 自回归视频扩散模型在生成长视频时,由于3D位置编码的频谱偏差和噪声采样中缺乏动态先验,导致严重的误差累积和时间退化问题,无法超越训练时长进行稳定生成。

Result: 在VBench基准测试中,FLEX在6倍外推(30秒时长)上显著优于现有最先进模型,在12倍尺度(60秒时长)上与经过长视频微调的基线模型性能相当,并能支持长达4分钟的一致且动态的视频合成。

Insight: 创新点在于识别并解决了频谱偏差和动态先验缺失这两个核心问题,通过频率感知的插值与外推策略以及反相位噪声注入,以即插即用的方式实现了训练与推理时长的解耦,为长视频生成提供了高效的推理时增强方案。

Abstract: Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the \textit{spectral bias} of 3D positional embeddings and the lack of \textit{dynamic priors} in noise sampling. To address these issues, we propose \textbf{FLEX} (\textbf{F}requency-aware \textbf{L}ength \textbf{EX}tension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at $6\times$ extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at $12\times$ scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at \href{https://ga-lee.github.io/FLEX_demo}{https://ga-lee.github.io/FLEX}.


[101] BitDance: Scaling Autoregressive Generative Models with Binary Tokens cs.CV | cs.AIPDF

Yuang Ai, Jiaming Han, Shaobin Zhuang, Weijia Mao, Xuefeng Hu

TL;DR: BitDance是一种可扩展的自回归图像生成模型,它通过预测二进制视觉令牌而非码本索引来生成图像。模型利用高熵二进制潜在表示,使每个令牌能表示多达2^256种状态,从而获得紧凑且高表达力的离散表示。为解决从巨大令牌空间采样的难题,BitDance采用了二进制扩散头,使用连续空间扩散生成二进制令牌,并提出了next-patch diffusion解码方法,以高精度并行预测多个令牌,大幅加速推理。在ImageNet 256x256上,BitDance实现了1.24的FID,在自回归模型中表现最佳;在文本到图像生成中,它能高效生成高分辨率逼真图像,相比先前自回归模型在1024x1024图像生成上实现了超过30倍的加速。

Details

Motivation: 解决传统自回归生成模型在预测码本索引时面临的令牌空间有限、表达能力不足以及采样效率低下的问题,旨在通过二进制令牌和扩散机制实现更紧凑、高表达力的表示,并加速推理过程。

Result: 在ImageNet 256x256上,BitDance实现了FID为1.24,达到自回归模型中的最佳性能;使用next-patch diffusion时,以仅2.6亿参数(比1.4B参数模型少5.4倍)超越了最先进的并行自回归模型,并实现了8.7倍的加速;在文本到图像生成中,生成1024x1024图像时相比先前自回归模型实现了超过30倍的加速。

Insight: 创新点包括:使用二进制视觉令牌替代码本索引,实现高熵紧凑表示;结合二进制扩散头,通过连续空间扩散生成二进制令牌,解决巨大令牌空间的采样难题;提出next-patch diffusion解码方法,实现高精度并行令牌预测,显著提升推理速度。这些方法为自回归基础模型提供了可扩展且高效的生成框架。

Abstract: We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to $2^{256}$ states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance uses a binary diffusion head: instead of predicting an index with softmax, it employs continuous-space diffusion to generate the binary tokens. Furthermore, we propose next-patch diffusion, a new decoding method that predicts multiple tokens in parallel with high accuracy, greatly speeding up inference. On ImageNet 256x256, BitDance achieves an FID of 1.24, the best among AR models. With next-patch diffusion, BitDance beats state-of-the-art parallel AR models that use 1.4B parameters, while using 5.4x fewer parameters (260M) and achieving 8.7x speedup. For text-to-image generation, BitDance trains on large-scale multimodal tokens and generates high-resolution, photorealistic images efficiently, showing strong performance and favorable scaling. When generating 1024x1024 images, BitDance achieves a speedup of over 30x compared to prior AR models. We release code and models to facilitate further research on AR foundation models. Code and models are available at: https://github.com/shallowdream204/BitDance.


[102] Restoration Adaptation for Semantic Segmentation on Low Quality Images cs.CV | cs.AIPDF

Kai Guan, Rongyuan Wu, Shuai Li, Wentao Zhu, Wenjun Zeng

TL;DR: 本文提出了一种名为RASS的框架,旨在解决低质量图像上语义分割性能下降的问题。该框架通过语义约束恢复模型将分割先验注入图像恢复过程,并利用LoRA模块合并和任务特定微调将恢复知识迁移到分割模型中,从而直接在低质量图像上实现高质量语义分割。

Details

Motivation: 现实场景中,低质量图像缺乏清晰的语义结构和高频细节,导致语义分割性能下降。传统真实世界图像恢复模型主要关注像素级保真度,未能恢复任务相关的语义线索,而现有基于高质量数据训练的分割模型在真实世界退化下缺乏鲁棒性。

Result: 在合成和真实世界的低质量基准测试上进行了广泛实验,结果表明,所提出的SCR和RASS框架在分割和恢复任务上显著优于现有最先进方法。

Insight: 创新点在于将语义分割先验通过跨注意力图对齐注入图像恢复模型,实现语义保真的图像重建,并利用基于LoRA的模块合并和微调策略,将恢复知识有效迁移到分割任务中,增强了模型对低质量图像的鲁棒性。

Abstract: In real-world scenarios, the performance of semantic segmentation often deteriorates when processing low-quality (LQ) images, which may lack clear semantic structures and high-frequency details. Although image restoration techniques offer a promising direction for enhancing degraded visual content, conventional real-world image restoration (Real-IR) models primarily focus on pixel-level fidelity and often fail to recover task-relevant semantic cues, limiting their effectiveness when directly applied to downstream vision tasks. Conversely, existing segmentation models trained on high-quality data lack robustness under real-world degradations. In this paper, we propose Restoration Adaptation for Semantic Segmentation (RASS), which effectively integrates semantic image restoration into the segmentation process, enabling high-quality semantic segmentation on the LQ images directly. Specifically, we first propose a Semantic-Constrained Restoration (SCR) model, which injects segmentation priors into the restoration model by aligning its cross-attention maps with segmentation masks, encouraging semantically faithful image reconstruction. Then, RASS transfers semantic restoration knowledge into segmentation through LoRA-based module merging and task-specific fine-tuning, thereby enhancing the model’s robustness to LQ images. To validate the effectiveness of our framework, we construct a real-world LQ image segmentation dataset with high-quality annotations, and conduct extensive experiments on both synthetic and real-world LQ benchmarks. The results show that SCR and RASS significantly outperform state-of-the-art methods in segmentation and restoration tasks. Code, models, and datasets will be available at https://github.com/Ka1Guan/RASS.git.


[103] CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning cs.CVPDF

Yuhui Wu, Chenxi Xie, Ruibin Li, Liyi Chen, Qiaosi Yi

TL;DR: 本文提出了一种名为CoCoEdit的后训练框架,通过区域正则化强化学习实现内容一致的图像编辑。该方法通过增强现有编辑数据集并引入像素级相似度奖励和区域正则化机制,旨在解决现有模型在编辑目标区域时对非编辑区域产生不必要改变的问题。

Details

Motivation: 现有的大规模生成模型在图像编辑中主要关注目标对象和区域的编辑效果,但往往导致非编辑区域发生不期望的变化,因此需要一种能够保持内容一致性的编辑方法。

Result: 在GEdit-Bench和ImgEdit-Bench基准测试中,CoCoEdit应用于Qwen-Image-Edit和FLUX-Kontext模型后,不仅获得了与最先进模型相当的编辑分数,而且在PSNR/SSIM指标和人类主观评分上表现出显著更好的内容一致性。

Insight: 论文的创新点在于引入了像素级相似度奖励来补充基于MLLM的奖励,并提出了区域正则化机制以克服奖励的空间无关性,从而在保持编辑质量的同时确保非编辑区域的内容一致性。

Abstract: Image editing has achieved impressive results with the development of large-scale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for Content-Consistent Editing (CoCoEdit) via region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We then introduce a pixel-level similarity reward to complement MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only competitive editing scores with state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings.


[104] ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization cs.CVPDF

Youqi Wang, Shen Chen, Haowei Wang, Rongxuan Peng, Taiping Yao

TL;DR: 本文提出ForgeryVCR框架,旨在解决现有多模态大语言模型(MLLMs)在图像伪造检测与定位任务中因依赖文本中心思维链而产生的幻觉问题。该框架通过引入法证工具箱,将难以察觉的篡改痕迹物化为显式的视觉中间表示,并采用战略工具学习后训练范式,使MLLM能主动调用多视角推理路径,实现视觉中心推理。

Details

Motivation: 现有基于MLLM的图像伪造检测与定位方法主要采用文本中心的思维链范式,但语言模态难以捕捉细微的像素级不一致性,导致模型产生幻觉。本文旨在克服这一局限,通过视觉中心推理更有效地利用低层篡改痕迹。

Result: 大量实验表明,ForgeryVCR在检测和定位任务上均达到了最先进的性能,展现出优异的泛化能力和鲁棒性,同时工具冗余度极低。

Insight: 创新点在于将法证工具作为视觉中间表示集成到MLLM中,并设计了包含增益驱动轨迹构建的监督微调和基于工具效用奖励的强化学习的战略工具学习范式,使模型能主动决策,进行局部放大、压缩历史、噪声残差和频域等多视角分析,从而更精准地处理视觉伪造任务。

Abstract: Existing Multimodal Large Language Models (MLLMs) for image forgery detection and localization predominantly operate under a text-centric Chain-of-Thought (CoT) paradigm. However, forcing these models to textually characterize imperceptible low-level tampering traces inevitably leads to hallucinations, as linguistic modalities are insufficient to capture such fine-grained pixel-level inconsistencies. To overcome this, we propose ForgeryVCR, a framework that incorporates a forensic toolbox to materialize imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning. To enable efficient tool utilization, we introduce a Strategic Tool Learning post-training paradigm, encompassing gain-driven trajectory construction for Supervised Fine-Tuning (SFT) and subsequent Reinforcement Learning (RL) optimization guided by a tool utility reward. This paradigm empowers the MLLM to act as a proactive decision-maker, learning to spontaneously invoke multi-view reasoning paths including local zoom-in for fine-grained inspection and the analysis of invisible inconsistencies in compression history, noise residuals, and frequency domains. Extensive experiments reveal that ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks, demonstrating superior generalization and robustness with minimal tool redundancy. The project page is available at https://youqiwong.github.io/projects/ForgeryVCR/.


[105] GeoFusionLRM: Geometry-Aware Self-Correction for Consistent 3D Reconstruction cs.CVPDF

Ahmet Burak Yildirim, Tuna Saygin, Duygu Ceylan, Aysegul Dundar

TL;DR: 本文提出了GeoFusionLRM,一个用于单图像3D重建的几何感知自校正框架。它利用模型自身预测的法线和深度图,通过专门的Transformer和融合模块进行反馈,以修正几何不一致性和细节错位问题,从而提升重建网格与输入视图的对齐度和整体保真度。

Details

Motivation: 现有的大型重建模型(LRMs)在单图像3D重建中常出现几何不一致和细节错位问题,限制了重建的保真度。本文旨在通过模型自身的几何预测来校正这些错误,而无需额外监督或外部信号。

Result: 大量实验表明,GeoFusionLRM在几何锐度、法线一致性和保真度方面均优于最先进的LRM基线模型。

Insight: 核心创新在于提出了一种几何感知的自校正机制,通过内部反馈(法线/深度预测)而非仅依赖输入图像特征来提升3D重建的一致性。这种利用模型自身输出进行迭代修正的设计思路,为提升生成模型的几何准确性提供了新途径。

Abstract: Single-image 3D reconstruction with large reconstruction models (LRMs) has advanced rapidly, yet reconstructions often exhibit geometric inconsistencies and misaligned details that limit fidelity. We introduce GeoFusionLRM, a geometry-aware self-correction framework that leverages the model’s own normal and depth predictions to refine structural accuracy. Unlike prior approaches that rely solely on features extracted from the input image, GeoFusionLRM feeds back geometric cues through a dedicated transformer and fusion module, enabling the model to correct errors and enforce consistency with the conditioning image. This design improves the alignment between the reconstructed mesh and the input views without additional supervision or external signals. Extensive experiments demonstrate that GeoFusionLRM achieves sharper geometry, more consistent normals, and higher fidelity than state-of-the-art LRM baselines.


[106] EgoSound: Benchmarking Sound Understanding in Egocentric Videos cs.CVPDF

Bingwen Zhu, Yuqian Fu, Qiaole Dong, Guolei Sun, Tianwen Qian

TL;DR: 本文介绍了EgoSound,这是首个专门用于系统评估多模态大语言模型在自我中心视频中声音理解能力的基准。该基准整合了Ego4D和EgoBlind的数据集,涵盖有视觉辅助和依赖声音的体验,并通过多阶段自动生成流程构建了包含900个视频的7315个验证问答对。基准定义了涵盖内在声音感知、空间定位、因果推理和跨模态推理的七项任务分类。对九个最先进MLLM的综合实验表明,当前模型展现出初步的听觉推理能力,但在细粒度空间和因果理解方面仍有局限。

Details

Motivation: 人类感知本质上是多感官的,整合视觉、声音和运动来理解世界。在自我中心场景中,声音为空间布局、屏幕外事件和因果交互提供了不可或缺的线索,而当前多模态大语言模型主要关注视觉-语言理解,缺乏对声音理解能力的系统性评估。

Result: 在EgoSound基准上对九个最先进的多模态大语言模型进行了综合评估。结果表明,当前模型展现出新兴的听觉推理能力,但在细粒度空间定位和因果理解方面仍然有限。该基准为推进多感官自我中心智能建立了具有挑战性的基础。

Insight: 论文的创新点在于构建了首个专注于自我中心视频中声音理解的多任务基准(EgoSound),并系统性地定义了声音理解的任务分类。从客观角度看,其通过多阶段自动生成流程构建大规模、高质量评估数据的方法,以及将声音理解细分为内在感知、空间、因果和跨模态推理等多个维度的思路,为未来多模态模型在听觉理解和多感官融合方面的研究提供了重要的评估框架和方向指引。

Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in vision-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.


[107] DenseMLLM: Standard Multimodal LLMs are Intrinsic Dense Predictors cs.CV | cs.AI | cs.LGPDF

Yi Li, Hongze Shen, Lexiang Tang, Xin Li, Xinpeng Ding

TL;DR: 本文提出DenseMLLM,一种基于标准多模态大语言模型(MLLM)架构的方法,通过引入新颖的视觉token监督策略,使通用MLLM能够直接执行语义分割和深度估计等细粒度密集预测任务,无需额外任务特定解码器。

Details

Motivation: 解决现有MLLM在密集预测任务中需要引入复杂任务特定解码器,导致模型架构碎片化、偏离通用设计并限制实用性的问题。

Result: 在广泛的密集预测和视觉-语言基准测试中取得了极具竞争力的性能,表明标准通用MLLM无需架构特化即可有效支持密集感知。

Insight: 创新点在于提出一种适用于多标签多任务的视觉token监督策略,使标准MLLM架构本身具备密集预测能力,简化了模型设计并保持了通用性。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization.


[108] Detection of On-Ground Chestnuts Using Artificial Intelligence Toward Automated Picking cs.CV | cs.AIPDF

Kaixuan Fang, Yuzhen Lu, Xinyang Mu

TL;DR: 本研究旨在开发一种基于人工智能的板栗检测系统,以推动自动化采摘技术的发展。通过系统性地评估29种最先进的实时目标检测模型(包括YOLO和RT-DETR系列),在包含6524个标注板栗的果园地面图像数据集上进行实验,发现YOLOv12m模型在mAP@0.5指标上达到95.1%的最佳性能,展现了实时检测潜力。

Details

Motivation: 传统机械化板栗采摘成本高、非选择性且易损伤果实,而开发低成本、视觉引导的自动化采摘技术需要准确可靠的果园地面板栗检测。现有系统在复杂环境(如阴影、多变光照、杂草、落叶、石块等干扰)下面临挑战,亟待解决。

Result: 在自建的板栗检测数据集上,YOLOv12m模型取得了最佳的mAP@0.5为95.1%,RT-DETRv2-R101在RT-DETR模型中表现最好,mAP@0.5为91.1%。在更严格的mAP@[0.5:0.95]指标上,YOLOv11x模型达到80.1%的最高精度。YOLO模型在检测精度和推理速度上均优于RT-DETR模型,更适合车载部署。

Insight: 论文的创新点在于首次系统性地在农业板栗检测任务上大规模评估了最新的YOLO和RT-DETR实时检测器,并公开了数据集和代码。从客观角度看,其贡献在于为特定农业场景下的目标检测提供了详实的模型性能基准,并验证了轻量级模型(如YOLO)在复杂真实环境中部署的可行性。

Abstract: Traditional mechanized chestnut harvesting is too costly for small producers, non-selective, and prone to damaging nuts. Accurate, reliable detection of chestnuts on the orchard floor is crucial for developing low-cost, vision-guided automated harvesting technology. However, developing a reliable chestnut detection system faces challenges in complex environments with shading, varying natural light conditions, and interference from weeds, fallen leaves, stones, and other foreign on-ground objects, which have remained unaddressed. This study collected 319 images of chestnuts on the orchard floor, containing 6524 annotated chestnuts. A comprehensive set of 29 state-of-the-art real-time object detectors, including 14 in the YOLO (v11-13) and 15 in the RT-DETR (v1-v4) families at varied model scales, was systematically evaluated through replicated modeling experiments for chestnut detection. Experimental results show that the YOLOv12m model achieves the best mAP@0.5 of 95.1% among all the evaluated models, while the RT-DETRv2-R101 was the most accurate variant among RT-DETR models, with mAP@0.5 of 91.1%. In terms of mAP@[0.5:0.95], the YOLOv11x model achieved the best accuracy of 80.1%. All models demonstrate significant potential for real-time chestnut detection, and YOLO models outperformed RT-DETR models in terms of both detection accuracy and inference, making them better suited for on-board deployment. Both the dataset and software programs in this study have been made publicly available at https://github.com/AgFood-Sensing-and-Intelligence-Lab/ChestnutDetection.


[109] LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models cs.CVPDF

Shufan Li, Yuchen Zhu, Jiuxiang Gu, Kangning Liu, Zhe Lin

TL;DR: LaViDa-R1是一种多模态、通用推理的扩散语言模型,通过统一的监督微调和多任务强化学习框架,整合了多种多模态理解与生成任务,并在视觉数学推理、密集推理定位和图像编辑等任务上表现出色。

Details

Motivation: 解决现有扩散语言模型在推理任务中依赖任务特定强化学习的问题,旨在构建一个统一的多模态推理模型。

Result: 在广泛的视觉数学推理、密集推理定位和图像编辑等多模态任务上展现出强大性能,通过实验验证了其有效性。

Insight: 创新点包括统一的监督微调与多任务强化学习框架,以及答案强制、树搜索和互补似然估计等训练技术,提升了模型的效果和可扩展性。

Abstract: Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1’s strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.


[110] ARport: An Augmented Reality System for Markerless Image-Guided Port Placement in Robotic Surgery cs.CVPDF

Zheng Han, Zixin Yang, Yonghao Long, Lin Zhang, Peter Kazanzides

TL;DR: ARport是一种增强现实系统,用于机器人手术中无标记的图像引导端口放置。该系统通过光学透视头戴式显示器实现,无需外部传感器或标记,通过RGB、深度和姿态数据重建手术场景,提取患者体表,并基于表面进行无标记配准,将术前规划的三维模型与患者体表对齐,从而在患者体表上直观可视化预计划的套管布局。

Details

Motivation: 解决机器人辅助手术中端口放置的精确性问题,以弥合术前规划与术中执行之间的差距,提供直观的空间引导,简化设置并增强工作流程整合。

Result: 在全尺寸人体模型实验中,ARport能够准确地将预计划的套管位置叠加到物理模型上,实现了虚拟规划与真实解剖结构之间一致的空间对应关系。

Insight: 创新点在于使用基础模型提取患者体表并进行基于表面的无标记配准,提供了一种完全无标记且硬件需求极少的解决方案,有望无缝集成到常规临床工作流程中。

Abstract: Purpose: Precise port placement is a critical step in robot-assisted surgery, where port configuration influences both visual access to the operative field and instrument maneuverability. To bridge the gap between preoperative planning and intraoperative execution, we present ARport, an augmented reality (AR) system that automatically maps pre-planned trocar layouts onto the patient’s body surface, providing intuitive spatial guidance during surgical preparation. Methods: ARport, implemented on an optical see-through head-mounted display (OST-HMD), operates without any external sensors or markers, simplifying setup and enhancing workflow integration. It reconstructs the operative scene from RGB, depth, and pose data captured by the OST-HMD, extracts the patient’s body surface using a foundation model, and performs surface-based markerless registration to align preoperative anatomical models to the extracted patient’s body surface, enabling in-situ visualization of planned trocar layouts. A demonstration video illustrating the overall workflow is available online. Results: In full-scale human-phantom experiments, ARport accurately overlaid pre-planned trocar sites onto the physical phantom, achieving consistent spatial correspondence between virtual plans and real anatomy. Conclusion: ARport provides a fully marker-free and hardware-minimal solution for visualizing preoperative trocar plans directly on the patient’s body surface. The system facilitates efficient intraoperative setup and demonstrates potential for seamless integration into routine clinical workflows.


[111] When Test-Time Guidance Is Enough: Fast Image and Video Editing with Diffusion Guidance cs.CV | cs.AI | cs.LGPDF

Ahmed Ghorbel, Badr Moufad, Navid Bagheri Shouraki, Alain Oliviero Durmus, Thomas Hirtz

TL;DR: 该论文探讨了基于扩散模型的测试时引导方法在图像和视频编辑任务中的应用,提出了一种无需向量-雅可比积(VJP)计算的近似方法,并在大规模基准测试中验证了其性能。

Details

Motivation: 解决现有基于测试时引导的文本驱动图像视频编辑方法因依赖计算成本高的VJP近似而限制实际应用的问题。

Result: 在多个大规模图像和视频编辑基准测试上,该方法达到了与基于训练的方法相当甚至更优的性能。

Insight: 创新点在于对无VJP近似方法提供了理论分析并进行了广泛的实证评估,证明了仅靠测试时引导即可实现高效编辑,无需额外训练。

Abstract: Text-driven image and video editing can be naturally cast as inpainting problems, where masked regions are reconstructed to remain consistent with both the observed content and the editing prompt. Recent advances in test-time guidance for diffusion and flow models provide a principled framework for this task; however, existing methods rely on costly vector–Jacobian product (VJP) computations to approximate the intractable guidance term, limiting their practical applicability. Building upon the recent work of Moufad et al. (2025), we provide theoretical insights into their VJP-free approximation and substantially extend their empirical evaluation to large-scale image and video editing benchmarks. Our results demonstrate that test-time guidance alone can achieve performance comparable to, and in some cases surpass, training-based methods.


[112] Towards Spatial Transcriptomics-driven Pathology Foundation Models cs.CV | cs.AIPDF

Konstantin Hemker, Andrew H. Song, Cristina Almagro-Pérez, Guillaume Jaume, Sophia J. Wagner

TL;DR: 本文提出了Spatial Expression-Aligned Learning (SEAL)框架,这是一种参数高效的视觉-组学自监督学习框架,旨在将空间转录组学(ST)提供的局部分子信息注入到病理学视觉编码器中,以增强其表示能力。SEAL通过在超过70万个配对的基因表达点-组织区域样本上进行训练,并在38个切片级和15个斑块级下游任务中测试,证明其能持续提升性能,并展现出强大的领域泛化能力和新的跨模态能力。

Details

Motivation: 空间转录组学(ST)提供了基因表达的空间解析测量,超越了传统的组织学评估,但现有病理学基础模型主要依赖视觉信息。动机在于利用ST的局部分子信息与形态学的耦合,系统地改进病理学视觉表示,以提升下游任务的性能。

Result: 在38个切片级任务(如分子状态、通路活性和治疗反应预测)和15个斑块级任务(如基因表达预测)中,SEAL作为病理学基础模型的即插即用替代方案,性能持续优于广泛使用的仅视觉和ST预测基线,并展现出在分布外评估中的鲁棒领域泛化能力,以及实现基因到图像检索等新跨模态能力。

Insight: 创新点在于提出了一种通用的ST引导的病理学基础模型微调框架SEAL,通过局部分子监督增强现有模型,有效提升视觉表示并扩展跨模态实用性,且设计为参数高效方法,可灵活应用于广泛使用的病理学基础模型,避免了从头训练新编码器的开销。

Abstract: Spatial transcriptomics (ST) provides spatially resolved measurements of gene expression, enabling characterization of the molecular landscape of human tissue beyond histological assessment as well as localized readouts that can be aligned with morphology. Concurrently, the success of multimodal foundation models that integrate vision with complementary modalities suggests that morphomolecular coupling between local expression and morphology can be systematically used to improve histological representations themselves. We introduce Spatial Expression-Aligned Learning (SEAL), a vision-omics self-supervised learning framework that infuses localized molecular information into pathology vision encoders. Rather than training new encoders from scratch, SEAL is designed as a parameter-efficient vision-omics finetuning method that can be flexibly applied to widely used pathology foundation models. We instantiate SEAL by training on over 700,000 paired gene expression spot-tissue region examples spanning tumor and normal samples from 14 organs. Tested across 38 slide-level and 15 patch-level downstream tasks, SEAL provides a drop-in replacement for pathology foundation models that consistently improves performance over widely used vision-only and ST prediction baselines on slide-level molecular status, pathway activity, and treatment response prediction, as well as patch-level gene expression prediction tasks. Additionally, SEAL encoders exhibit robust domain generalization on out-of-distribution evaluations and enable new cross-modal capabilities such as gene-to-image retrieval. Our work proposes a general framework for ST-guided finetuning of pathology foundation models, showing that augmenting existing models with localized molecular supervision is an effective and practical step for improving visual representations and expanding their cross-modal utility.


[113] UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model cs.CV | cs.AIPDF

Shaobin Zhuang, Yuang Ai, Jiaming Han, Weijia Mao, Xiaohui Li

TL;DR: 本文提出了UniWeTok,一种用于统一多模态大语言模型(MLLM)的统一离散分词器。它采用了一个超大规模的二进制码本(2^128),并结合了创新的训练框架(如前-后蒸馏和生成感知先验)和模型架构(卷积-注意力混合架构与SigLu激活函数),旨在同时满足高保真重建、复杂语义提取和生成适用性这三个通常相互冲突的目标。

Details

Motivation: 现有视觉分词器难以在一个单一框架内同时满足高保真重建、复杂语义提取和生成适用性这三个相互冲突的目标,这阻碍了统一多模态大语言模型的发展。

Result: 在ImageNet上,UniWeTok在图像生成任务上取得了最先进的性能(FID: 1.38,优于REPA的1.42),且训练计算量显著更低(训练token数:33B vs. 262B)。在通用领域,UniWeTok在包括多模态理解、图像生成(DPG分数:86.63 vs. FLUX.1 [Dev] 83.84)和图像编辑(GEdit总分:5.09 vs. OmniGen 5.06)在内的广泛任务上展现出极具竞争力的能力。

Insight: 核心创新在于利用超大规模二进制码本统一视觉表示,并通过创新的SigLu激活函数解决优化冲突、稳定训练。同时,提出的三阶段训练框架增强了模型对不同图像分辨率和感知敏感场景的适应性。其架构和训练策略为设计高效、通用的多模态分词器提供了新思路。

Abstract: Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a single framework. In this paper, we introduce UniWeTok, a unified discrete tokenizer designed to bridge this gap using a massive binary codebook ($\mathit{2^{128}}$). For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens. In terms of model architecture, we propose a convolution-attention hybrid architecture with the SigLu activation function. SigLu activation not only bounds the encoder output and stabilizes the semantic distillation process but also effectively addresses the optimization conflict between token entropy loss and commitment loss. We further propose a three-stage training framework designed to enhance UniWeTok’s adaptability cross various image resolutions and perception-sensitive scenarios, such as those involving human faces and textual content. On ImageNet, UniWeTok achieves state-of-the-art image generation performance (FID: UniWeTok 1.38 vs. REPA 1.42) while requiring a remarkably low training compute (Training Tokens: UniWeTok 33B vs. REPA 262B). On general-domain, UniWeTok demonstrates highly competitive capabilities across a broad range of tasks, including multimodal understanding, image generation (DPG Score: UniWeTok 86.63 vs. FLUX.1 [Dev] 83.84), and editing (GEdit Overall Score: UniWeTok 5.09 vs. OmniGen 5.06). We release code and models to facilitate community exploration of unified tokenizer and MLLM.


[114] UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing cs.CVPDF

Hongyang Wei, Bin Wen, Yancheng Long, Yankai Yang, Yuhang Hu

TL;DR: 本文提出了UniRef-Image-Edit,一个统一单图像编辑与多图像合成的多模态生成系统。其核心创新是序列扩展潜在融合(SELF)表示和包含监督微调与强化学习的两阶段训练框架,旨在解决现有方法在多参考条件下难以保持一致性的问题。

Details

Motivation: 现有基于扩散模型的编辑方法在处理多个参考输入时,由于输入间交互有限,难以维持跨条件的一致性。本文旨在解决多参考图像编辑与合成的可扩展性和一致性问题。

Result: 论文通过提出的两阶段训练框架(SFT与RL)和SELF表示,提升了视觉保真度和跨参考一致性。强化学习阶段采用了专门设计的MSGRPO框架来优化模型,以调和冲突的视觉约束。

Insight: 主要创新点包括:1)SELF表示法,将多个参考图像动态序列化为连贯的潜在序列;2)渐进式序列长度训练策略,在固定像素预算下逐步增加分辨率以平衡细节与一致性;3)MSGRPO,首个为多参考图像生成定制的强化学习框架,用于增强组合一致性。

Abstract: We present UniRef-Image-Edit, a high-performance multi-modal generation system that unifies single-image editing and multi-image composition within a single framework. Existing diffusion-based editing methods often struggle to maintain consistency across multiple conditions due to limited interaction between reference inputs. To address this, we introduce Sequence-Extended Latent Fusion (SELF), a unified input representation that dynamically serializes multiple reference images into a coherent latent sequence. During a dedicated training stage, all reference images are jointly constrained to fit within a fixed-length sequence under a global pixel-budget constraint. Building upon SELF, we propose a two-stage training framework comprising supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we jointly train on single-image editing and multi-image composition tasks to establish a robust generative prior. We adopt a progressive sequence length training strategy, in which all input images are initially resized to a total pixel budget of $1024^2$, and are then gradually increased to $1536^2$ and $2048^2$ to improve visual fidelity and cross-reference consistency. This gradual relaxation of compression enables the model to incrementally capture finer visual details while maintaining stable alignment across references. For the RL stage, we introduce Multi-Source GRPO (MSGRPO), to our knowledge the first reinforcement learning framework tailored for multi-reference image generation. MSGRPO optimizes the model to reconcile conflicting visual constraints, significantly enhancing compositional consistency. We will open-source the code, models, training data, and reward data for community research purposes.


[115] GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery cs.CV | cs.AIPDF

Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang

TL;DR: 本文提出GeoEyes,一个用于超高分辨率遥感图像问答的分阶段训练框架,旨在解决现有多模态大语言模型在缩放工具使用中的同质化问题。框架包含一个冷启动监督微调数据集UHR-CoZ和一个基于强化学习的自适应缩放方法AdaZoom-GRPO,以学习按需缩放和适时停止,在XLRS-Bench基准上达到54.23%的准确率。

Details

Motivation: 解决现有支持缩放的多模态大语言模型在超高分辨率遥感视觉问答中存在的工具使用同质化问题,即缩放调用退化为与任务无关的模式,限制了有效证据的获取。

Result: 在超高分辨率遥感基准测试中取得显著提升,特别是在XLRS-Bench上达到54.23%的准确率。

Insight: 创新点在于提出了一个分阶段训练框架,结合了覆盖多样化缩放模式的监督微调数据集和显式奖励证据增益与答案改进的强化学习方法,从而实现了按需、自适应的视觉聚焦策略。

Abstract: The “thinking-with-images” paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.


[116] HiVid: LLM-Guided Video Saliency For Content-Aware VOD And Live Streaming cs.CVPDF

Jiahui Chen, Bo Peng, Lianchen Jia, Zeyu Zhang, Tianchi Huang

TL;DR: HiVid是一个利用大型语言模型(LLM)作为可扩展的人类代理,为视频点播(VOD)和直播流生成高保真重要性权重的框架,以优化内容感知流媒体的主观体验质量(QoE)。它通过感知模块处理局部上下文、排序模块进行全局重排以及预测模块实现低延迟在线推理,解决了现有视觉显著性模型泛化能力差的问题。

Details

Motivation: 内容感知流媒体需要动态的、分块级别的重要性权重来优化主观体验质量,但直接人工标注成本高昂,而现有的视觉显著性模型泛化能力差。因此,需要一种可扩展的方法来生成高质量的重要性权重。

Result: 大量实验表明,HiVid在VOD上的权重预测准确率比SOTA基线提高了11.5%,在直播流上提高了26%。真实世界的用户研究验证了HiVid将流媒体QoE相关性提升了14.7%。

Insight: 创新点在于首次将LLM用作可扩展的人类代理来生成视频重要性权重,并设计了三个核心模块:局部上下文感知模块、基于LLM引导归并排序的全局重排模块,以及用于直播的低延迟多模态时间序列预测模块。其核心洞察是利用LLM的语义理解能力来弥补纯视觉模型的不足,并通过模块化设计分别解决VOD和直播的不同技术挑战。

Abstract: Content-aware streaming requires dynamic, chunk-level importance weights to optimize subjective quality of experience (QoE). However, direct human annotation is prohibitively expensive while vision-saliency models generalize poorly. We introduce HiVid, the first framework to leverage Large Language Models (LLMs) as a scalable human proxy to generate high-fidelity weights for both Video-on-Demand (VOD) and live streaming. We address 3 non-trivial challenges: (1) To extend LLMs’ limited modality and circumvent token limits, we propose a perception module to assess frames in a local context window, autoregressively building a coherent understanding of the video. (2) For VOD with rating inconsistency across local windows, we propose a ranking module to perform global re-ranking with a novel LLM-guided merge-sort algorithm. (3) For live streaming which requires low-latency, online inference without future knowledge, we propose a prediction module to predict future weights with a multi-modal time series model, which comprises a content-aware attention and adaptive horizon to accommodate asynchronous LLM inference. Extensive experiments show HiVid improves weight prediction accuracy by up to 11.5% for VOD and 26% for live streaming over SOTA baselines. Real-world user study validates HiVid boosts streaming QoE correlation by 14.7%.


[117] Freq-DP Net: A Dual-Branch Network for Fence Removal using Dual-Pixel and Fourier Priors cs.CVPDF

Kunal Swami, Sudha Velusamy, Chandra Sekhar Seelamantula

TL;DR: 本文提出Freq-DP Net,一种用于单图像栅栏移除的双分支网络。它首次利用双像素传感器,融合了来自散焦视差的几何先验和通过快速傅里叶卷积学习的栅栏全局结构先验,并通过注意力机制进行融合。

Details

Motivation: 解决单图像中栅栏遮挡移除的难题,现有方法在静态场景下效果不佳或需要多帧运动线索,本文旨在利用双像素传感器数据克服这些限制。

Result: 在构建并发布的不同栅栏类型多样化基准测试上,该方法显著优于强大的通用基线,为基于单图像和双像素的栅栏移除建立了新的最先进水平。

Insight: 创新点在于首次将双像素传感器数据用于栅栏移除,并设计双分支网络融合几何与结构先验;可借鉴之处包括使用显式代价体积建模散焦视差,以及利用快速傅里叶卷积捕捉全局模式。

Abstract: Removing fence occlusions from single images is a challenging task that degrades visual quality and limits downstream computer vision applications. Existing methods often fail on static scenes or require motion cues from multiple frames. To overcome these limitations, we introduce the first framework to leverage dual-pixel (DP) sensors for this problem. We propose Freq-DP Net, a novel dual-branch network that fuses two complementary priors: a geometric prior from defocus disparity, modeled using an explicit cost volume, and a structural prior of the fence’s global pattern, learned via Fast Fourier Convolution (FFC). An attention mechanism intelligently merges these cues for highly accurate fence segmentation. To validate our approach, we build and release a diverse benchmark with different fence varieties. Experiments demonstrate that our method significantly outperforms strong general-purpose baselines, establishing a new state-of-the-art for single-image, DP-based fence removal.


[118] Dual-Signal Adaptive KV-Cache Optimization for Long-Form Video Understanding in Vision-Language Models cs.CV | cs.AI | cs.LG | cs.PFPDF

Vishnu Sai, Dheeraj Sai, Srinath B, Girish Varma, Priyesh Shukla

TL;DR: 本文提出了一种名为Sali-Cache的先验优化框架,通过结合基于光流分析的时间滤波器和基于显著性检测的空间滤波器,对视觉语言模型处理长视频时的KV缓存进行双信号自适应管理,从而在保持模型性能的同时显著降低内存使用。

Details

Motivation: 解决视觉语言模型处理长视频内容时,因KV缓存随序列长度线性增长而导致的关键内存瓶颈问题,并克服现有反应式淘汰策略计算全注意力矩阵后再丢弃令牌所造成的大量计算浪费。

Result: 在LLaVA 1.6架构上的实验表明,该方法实现了2.20倍的有效内存使用压缩比,同时在BLEU、ROUGE-L和Exact Match指标上保持了100%的准确率,并在相同内存预算下能保持丰富的上下文特征而不降低模型性能。

Insight: 创新点在于提出了一个先验的、主动的内存管理框架,通过融合时间(光流分析)和空间(显著性检测)双信号来智能预测并保留关键信息,从而在昂贵的注意力计算前优化KV缓存,实现了计算效率与模型精度的平衡。

Abstract: Vision-Language Models (VLMs) face a critical memory bottleneck when processing long-form video content due to the linear growth of the Key-Value (KV) cache with sequence length. Existing solutions predominantly employ reactive eviction strategies that compute full attention matrices before discarding tokens, resulting in substantial computational waste. We propose Sali-Cache, a novel a priori optimization framework that implements dual-signal adaptive caching through proactive memory management. By integrating a temporal filter based on optical flow analysis for detecting inter-frame redundancy and a spatial filter leveraging saliency detection for identifying visually significant regions, Sali-Cache intelligently manages memory allocation before entering computationally expensive attention operations. Experimental evaluation on the LLaVA 1.6 architecture demonstrates that our method achieves a 2.20x compression ratio in effective memory usage while maintaining 100% accuracy across BLEU, ROUGE-L, and Exact Match metrics. Furthermore, under identical memory budget constraints, Sali-Cache preserves context-rich features over extended temporal durations without degrading model performance, enabling efficient processing of long-form video content on consumer-grade hardware.


[119] AbracADDbra: Touch-Guided Object Addition by Decoupling Placement and Editing Subtasks cs.CV | cs.AIPDF

Kunal Swami, Raghu Chittersu, Yuvraj Rathore, Rajeev Irny, Shashavali Doodekula

TL;DR: 本文提出AbracADDbra框架,通过解耦放置和编辑子任务,利用触摸先验实现直观的物体添加,提升用户友好性和编辑精度。

Details

Motivation: 解决基于指令的物体添加任务中文本提示模糊和基于掩码输入繁琐的问题,提升交互可用性。

Result: 在Touch2Add基准测试中,放置模型显著优于随机放置和通用VLM基线,验证了框架生成高保真编辑的能力。

Insight: 创新点在于解耦架构(视觉语言Transformer用于触摸引导放置,扩散模型联合生成物体和实例掩码)以及揭示初始放置精度与最终编辑质量强相关,为创意工具提供新思路。

Abstract: Instruction-based object addition is often hindered by the ambiguity of text-only prompts or the tedious nature of mask-based inputs. To address this usability gap, we introduce AbracADDbra, a user-friendly framework that leverages intuitive touch priors to spatially ground succinct instructions for precise placement. Our efficient, decoupled architecture uses a vision-language transformer for touch-guided placement, followed by a diffusion model that jointly generates the object and an instance mask for high-fidelity blending. To facilitate standardized evaluation, we contribute the Touch2Add benchmark for this interactive task. Our extensive evaluations, where our placement model significantly outperforms both random placement and general-purpose VLM baselines, confirm the framework’s ability to produce high-fidelity edits. Furthermore, our analysis reveals a strong correlation between initial placement accuracy and final edit quality, validating our decoupled approach. This work thus paves the way for more accessible and efficient creative tools.


[120] Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision cs.CVPDF

A. Said Gurbuz, Sunghwan Hong, Ahmed Nassar, Marc Pollefeys, Peter Staar

TL;DR: 本文介绍了ScreenParse,一个用于完整屏幕解析的大规模数据集,包含77.1万张网页截图中所有可见UI元素的密集标注(共2100万个元素),并基于此训练了紧凑的视觉语言模型ScreenVLM。该模型在密集解析任务上显著优于更大的基础VLM,且ScreenParse数据能有效提升其他VLM的界面理解能力。

Details

Motivation: 现有计算机使用代理(CUA)的界面理解依赖于稀疏标注数据集,其标注覆盖不足且多样性低,限制了模型的覆盖范围和泛化能力;同时,实际部署需要高效、低延迟的模型。

Result: ScreenVLM(3.16亿参数)在密集解析任务上大幅超越更大的基础VLM(例如,在ScreenParse数据集上PageIoU达到0.592 vs. 0.294),并在公共基准测试上表现出强大的迁移能力。

Insight: 创新点包括:1) 构建了大规模、密集标注的屏幕解析数据集ScreenParse及自动化生成流程Webshot;2) 提出了紧凑的ScreenVLM模型,采用结构感知损失和紧凑的ScreenTag标记表示;3) 证明了密集屏幕监督能为UI理解提供可迁移的结构先验知识。

Abstract: Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: https://saidgurbuz.github.io/screenparse/.


[121] Differential pose optimization in descriptor space – Combining Geometric and Photometric Methods for Motion Estimation cs.CVPDF

Andreas L. Teigen, Annette Stahl, Rudolf Mester

TL;DR: 本文提出了一种结合几何和光度方法的运动估计新策略,通过使用密集采样的几何特征描述符替代光度误差,在描述符空间中优化两帧相对位姿。实验表明,尽管该方法能实现精确跟踪,但未能超越基于重投影误差的位姿优化方法。

Details

Motivation: 解决计算机视觉中两帧相对位姿优化问题,传统方法在光度误差和重投影误差之间存在精度、鲁棒性和闭环可能性的权衡,本文旨在结合两种范式的优势,提出统一方法。

Result: 在实验中,该方法实现了精确跟踪,但未能超越基于重投影误差的位姿优化策略,尽管利用了更多信息。

Insight: 创新点在于将几何特征描述符的密集采样引入光度方法,实现亚像素精度;但分析发现描述符相似性度量变化缓慢,可能未严格对应关键点定位精度,这解释了性能未达预期的原因。

Abstract: One of the fundamental problems in computer vision is the two-frame relative pose optimization problem. Primarily, two different kinds of error values are used: photometric error and re-projection error. The selection of error value is usually directly dependent on the selection of feature paradigm, photometric features, or geometric features. It is a trade-off between accuracy, robustness, and the possibility of loop closing. We investigate a third method that combines the strengths of both paradigms into a unified approach. Using densely sampled geometric feature descriptors, we replace the photometric error with a descriptor residual from a dense set of descriptors, thereby enabling the employment of sub-pixel accuracy in differential photometric methods, along with the expressiveness of the geometric feature descriptor. Experiments show that although the proposed strategy is an interesting approach that results in accurate tracking, it ultimately does not outperform pose optimization strategies based on re-projection error despite utilizing more information. We proceed to analyze the underlying reason for this discrepancy and present the hypothesis that the descriptor similarity metric is too slowly varying and does not necessarily correspond strictly to keypoint placement accuracy.


[122] Event-based Visual Deformation Measurement cs.CVPDF

Yuliang Wu, Wei Zhai, Yuxin Cui, Tiesong Zhao, Yang Cao

TL;DR: 本文提出了一种基于事件相机和传统帧相机融合的视觉变形测量框架,通过利用事件流提供的高时间分辨率运动线索和图像帧提供的空间密集精确估计,来解决传统图像方法在高度动态场景下或需要高速相机带来的存储和计算开销问题。

Details

Motivation: 传统基于图像的视觉变形测量方法依赖于帧间微小运动来约束对应点搜索空间,这限制了其在高度动态场景中的应用,或需要高速相机,导致存储和计算成本过高。

Result: 在涵盖多种变形场景、包含超过120个序列的基准数据集上,所提方法在生存率指标上比现有最优基线方法提升了1.6%,并且仅需高速视频方法18.9%的数据存储和处理资源。

Insight: 创新点包括:1) 事件-帧融合框架,结合了事件的高时间密度和帧的高空间精度;2) 仿射不变单纯形框架,将变形场划分为低参数表示的线性化子区域,有效缓解了稀疏噪声事件带来的运动模糊;3) 邻域贪婪优化策略,加速参数搜索并减少长期密集跟踪中的局部误差累积。

Abstract: Visual Deformation Measurement (VDM) aims to recover dense deformation fields by tracking surface motion from camera observations. Traditional image-based methods rely on minimal inter-frame motion to constrain the correspondence search space, which limits their applicability to highly dynamic scenes or necessitates high-speed cameras at the cost of prohibitive storage and computational overhead. We propose an event-frame fusion framework that exploits events for temporally dense motion cues and frames for spatially dense precise estimation. Revisiting the solid elastic modeling prior, we propose an Affine Invariant Simplicial (AIS) framework. It partitions the deformation field into linearized sub-regions with low-parametric representation, effectively mitigating motion ambiguities arising from sparse and noisy events. To speed up parameter searching and reduce error accumulation, a neighborhood-greedy optimization strategy is introduced, enabling well-converged sub-regions to guide their poorly-converged neighbors, effectively suppress local error accumulation in long-term dense tracking. To evaluate the proposed method, a benchmark dataset with temporally aligned event streams and frames is established, encompassing over 120 sequences spanning diverse deformation scenarios. Experimental results show that our method outperforms the state-of-the-art baseline by 1.6% in survival rate. Remarkably, it achieves this using only 18.9% of the data storage and processing resources of high-speed video methods.


[123] Adapting VACE for Real-Time Autoregressive Video Diffusion cs.CV | cs.AIPDF

Ryan Fosdick

TL;DR: 本文提出了一种将VACE(视频全能创作与编辑)模型适配为实时自回归视频生成的方法。核心修改是将参考帧从扩散潜空间移至并行的条件通路,从而在无需额外训练的情况下复用预训练权重,使其能够支持固定块大小和因果注意力的流式处理。

Details

Motivation: VACE虽然提供了统一的视频控制功能(如参考引导、结构条件、修复和时间扩展),但其基于全序列双向注意力的设计使其无法兼容需要固定块大小和因果注意力的流式处理管道。本文旨在解决这一不兼容问题,使VACE能够用于实时自回归视频生成。

Result: 在13亿和140亿参数规模的模型上,VACE为结构控制和修复任务增加了20-30%的延迟开销,但相对于基础模型的VRAM成本可忽略不计。然而,由于因果注意力的限制,其参考到视频的保真度相比批处理版本的VACE严重下降。

Insight: 主要创新点在于通过架构修改(将参考帧移至并行条件通路)而非重新训练,实现了对预训练统一视频控制模型的自回归流式适配。这为将复杂的批处理扩散模型高效部署到实时应用场景提供了一种轻量级的技术路径,但其在保真度上的显著折损也揭示了因果建模与高质量参考引导之间的固有矛盾。

Abstract: We describe an adaptation of VACE (Video All-in-one Creation and Editing) for real-time autoregressive video generation. VACE provides unified video control (reference guidance, structural conditioning, inpainting, and temporal extension) but assumes bidirectional attention over full sequences, making it incompatible with streaming pipelines that require fixed chunk sizes and causal attention. The key modification moves reference frames from the diffusion latent space into a parallel conditioning pathway, preserving the fixed chunk sizes and KV caching that autoregressive models require. This adaptation reuses existing pretrained VACE weights without additional training. Across 1.3B and 14B model scales, VACE adds 20-30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model. Reference-to-video fidelity is severely degraded compared to batch VACE due to causal attention constraints. A reference implementation is available at https://github.com/daydreamlive/scope.


[124] Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models cs.CVPDF

In Chong Choi, Jiacheng Zhang, Feng Liu, Yiliao Song

TL;DR: 本文提出了一种针对大型视觉语言模型的多轮自适应提示攻击方法MAPA,该方法通过交替使用文本和视觉攻击动作,并在多轮对话中进行迭代优化,以逐步增强响应的恶意性,从而有效绕过安全对齐模型的防御机制。

Details

Motivation: 现有针对纯文本大语言模型的多轮越狱攻击在扩展到视觉语言模型时,由于恶意视觉输入容易触发安全防御机制而失效,因此需要一种能够自适应调整攻击策略的方法。

Result: MAPA方法在LLaVA-V1.6-Mistral-7B、Qwen2.5-VL-7B-Instruct、Llama-3.2-Vision-11B-Instruct和GPT-4o-mini等模型的最新基准测试中,攻击成功率比现有最优方法提高了11-35%,达到了SOTA水平。

Insight: 创新点在于设计了双层自适应攻击策略:单轮内交替使用文本和视觉攻击以引出最恶意响应,多轮间通过迭代来回优化调整攻击轨迹。这为评估和增强多模态模型的安全性提供了新思路。

Abstract: Multi-turn jailbreak attacks are effective against text-only large language models (LLMs) by gradually introducing malicious content across turns. When extended to large vision-language models (LVLMs), we find that naively adding visual inputs can cause existing multi-turn jailbreaks to be easily defended. For example, overly malicious visual input will easily trigger the defense mechanism of safety-aligned LVLMs, making the response more conservative. To address this, we propose MAPA: a multi-turn adaptive prompting attack that 1) at each turn, alternates text-vision attack actions to elicit the most malicious response; and 2) across turns, adjusts the attack trajectory through iterative back-and-forth refinement to gradually amplify response maliciousness. This two-level design enables MAPA to consistently outperform state-of-the-art methods, improving attack success rates by 11-35% on recent benchmarks against LLaVA-V1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, Llama-3.2-Vision-11B-Instruct and GPT-4o-mini.


[125] pFedNavi: Structure-Aware Personalized Federated Vision-Language Navigation for Embodied AI cs.CV | cs.AIPDF

Qingqian Yang, Hao Wang, Sai Qian Zhang, Jian Li, Yang Hua

TL;DR: 本文提出了pFedNavi,一个为视觉语言导航任务设计的、结构感知且动态自适应的个性化联邦学习框架。该框架通过自适应识别客户端特定层并进行细粒度参数融合,以应对VLN任务中环境和指令风格的极端异构性,在R2R和RxR基准上显著超越了基于FedAvg的基线方法。

Details

Motivation: 视觉语言导航需要来自私有室内环境的大规模轨迹指令数据,引发了严重的隐私担忧。联邦学习虽能通过数据本地化缓解隐私问题,但标准联邦学习难以应对VLN任务中环境和指令风格存在的极端跨客户端异构性,导致单一的全局模型性能不佳。

Result: 在R2R和RxR两个标准VLN基准上,使用ResNet和CLIP视觉表示进行评估,pFedNavi在所有指标上均一致优于基于FedAvg的VLN基线,导航成功率提升高达7.5%,轨迹保真度提升高达7.8%,且在非独立同分布条件下收敛速度快了1.38倍。

Insight: 核心创新在于针对性地进行个性化:通过逐层混合系数自适应识别客户端特定层,并对选定的组件(如编码器-解码器投影层和环境敏感的解码器层)进行细粒度参数融合,以平衡全局知识共享与本地专业化。这是一种结构感知的个性化联邦学习新思路。

Abstract: Vision-Language Navigation VLN requires large-scale trajectory instruction data from private indoor environments, raising significant privacy concerns. Federated Learning FL mitigates this by keeping data on-device, but vanilla FL struggles under VLNs’ extreme cross-client heterogeneity in environments and instruction styles, making a single global model suboptimal. This paper proposes pFedNavi, a structure-aware and dynamically adaptive personalized federated learning framework tailored for VLN. Our key idea is to personalize where it matters: pFedNavi adaptively identifies client-specific layers via layer-wise mixing coefficients, and performs fine-grained parameter fusion on the selected components (e.g., the encoder-decoder projection and environment-sensitive decoder layers) to balance global knowledge sharing with local specialization. We evaluate pFedNavi on two standard VLN benchmarks, R2R and RxR, using both ResNet and CLIP visual representations. Across all metrics, pFedNavi consistently outperforms the FedAvg-based VLN baseline, achieving up to 7.5% improvement in navigation success rate and up to 7.8% gain in trajectory fidelity, while converging 1.38x faster under non-IID conditions.


[126] Feature Recalibration Based Olfactory-Visual Multimodal Model for Fine-Grained Rice Deterioration Detection cs.CV | cs.AIPDF

Rongqiang Zhao, Hengrui Hu, Yijing Wang, Mingchun Sun, Jie Liu

TL;DR: 本文提出了一种基于特征重校准的嗅觉-视觉多模态模型,用于细粒度大米变质检测。该方法通过构建细粒度变质嵌入构造器(FDEC)重构标记的多模态嵌入特征数据集以增强样本表示,并设计细粒度变质重校准注意力网络(FDRA-Net)来强调信号变化并提高对大米表面细粒度变质的敏感性。实验表明,该方法在分类准确率上达到99.89%,优于现有方法,且简化了检测流程,具有较高的准确性和操作简便性,并可推广至其他农业食品领域。

Details

Motivation: 现有大米变质检测的多模态方法在表示和提取细粒度异常特征方面能力有限,且依赖高光谱相机、质谱仪等昂贵设备,导致检测成本高、数据采集时间长。

Result: 在细粒度大米变质检测任务中,所提方法实现了99.89%的分类准确率,优于现有SOTA方法,同时简化了检测流程;现场检测也验证了其准确性和操作简便性的优势。

Insight: 创新点包括:1)提出FDEC重构多模态嵌入特征数据集,增强细粒度样本表示;2)设计FDRA-Net通过注意力机制重校准特征,提高对细粒度变质信号的敏感性。从客观角度看,该方法将嗅觉与视觉模态结合,并引入特征重校准机制,为低成本、高效的细粒度农业食品检测提供了新思路。

Abstract: Multimodal methods are widely used in rice deterioration detection, which exhibit limited capability in representing and extracting fine-grained abnormal features. Moreover, these methods rely on devices, such as hyperspectral cameras and mass spectrometers, increasing detection costs and prolonging data acquisition time. To address these issues, we propose a feature recalibration based olfactory-visual multimodal model for fine-grained rice deterioration detection. The fine-grained deterioration embedding constructor (FDEC) is proposed to reconstruct the labeled multimodal embedded-feature dataset, enhancing sample representation. The fine-grained deterioration recalibration attention network (FDRA-Net) is proposed to emphasize signal variations and increase sensitivity to fine-grained deterioration on the rice surface. Experiments show that the proposed method achieves a classification accuracy of 99.89%. Compared with state-of-the-art methods, the detection accuracy is improved and the procedure is simplified. Furthermore, field detection demonstrates the advantages of accuracy and operational simplicity. The proposed method can also be extended to other agrifood in agriculture and food industry.


[127] Learning Proposes, Geometry Disposes: A Modular Framework for Efficient Spatial Reasoning cs.CVPDF

Haichao Zhu, Zhaorui Yang, Qian Zhang

TL;DR: 本文提出了一种用于高效空间推理的模块化框架,其核心原则是’学习提出几何假设,几何算法处理估计决策’。该框架在RGB-D序列的相对相机位姿估计任务中进行了验证,使用VGGT作为学习模型生成位姿和深度提议,并结合经典的点到平面RGB-D ICP作为几何后端。

Details

Motivation: 解决在空间感知任务中,学习组件应直接替代几何估计,还是作为中间模块嵌入传统几何流程这一开放性问题,旨在探索学习与几何方法的最佳协作模式。

Result: 在TUM RGB-D基准测试上的实验表明:仅依赖学习提出的位姿提议不可靠;与相机内参未对齐的学习提议几何会降低性能;而当学习提出的深度经过几何对齐并由几何后端处理后,在中等挑战性的刚性场景中能带来一致的性能提升。

Insight: 创新点在于明确了几何模块不仅是精炼组件,更是验证和吸收学习观测的’仲裁者’。研究强调了模块化、几何感知的系统设计对于鲁棒空间感知的重要性,为学习与几何方法的协同提供了具体框架和实证依据。

Abstract: Spatial perception aims to estimate camera motion and scene structure from visual observations, a problem traditionally addressed through geometric modeling and physical consistency constraints. Recent learning-based methods have demonstrated strong representational capacity for geometric perception and are increasingly used to augment classical geometry-centric systems in practice. However, whether learning components should directly replace geometric estimation or instead serve as intermediate modules within such pipelines remains an open question. In this work, we address this gap and investigate an end-to-end modular framework for effective spatial reasoning, where learning proposes geometric hypotheses, while geometric algorithms dispose estimation decisions. In particular, we study this principle in the context of relative camera pose estimation on RGB-D sequences. Using VGGT as a representative learning model, we evaluate learning-based pose and depth proposals under varying motion magnitudes and scene dynamics, followed by a classical point-to-plane RGB-D ICP as the geometric backend. Our experiments on the TUM RGB-D benchmark reveal three consistent findings: (1) learning-based pose proposals alone are unreliable; (2) learning-proposed geometry, when improperly aligned with camera intrinsics, can degrade performance; and (3) when learning-proposed depth is geometrically aligned and followed by a geometric disposal stage, consistent improvements emerge in moderately challenging rigid settings. These results demonstrate that geometry is not merely a refinement component, but an essential arbiter that validates and absorbs learning-based geometric observations. Our study highlights the importance of modular, geometry-aware system design for robust spatial perception.


[128] Hierarchical Vision-Language Interaction for Facial Action Unit Detection cs.CVPDF

Yong Li, Yi Ren, Yizhe Zhang, Wenhua Zhang, Tianyi Zhang

TL;DR: 本文提出了一种名为HiVA的分层视觉-语言交互方法,用于面部动作单元(AU)检测。该方法利用大型语言模型生成丰富的AU文本描述作为语义先验,通过AU感知动态图模块和分层跨模态注意力架构(包括解耦双交叉注意力和上下文双交叉注意力)来学习细粒度和整体性的视觉-语言关联,从而增强AU表示的判别性和泛化能力。

Details

Motivation: 解决在标注数据有限条件下,如何有效学习具有判别性和泛化性的面部动作单元(AU)表示这一主要挑战。

Result: 大量实验表明,HiVA在多个基准测试上持续超越最先进(SOTA)方法;定性分析显示其能产生语义有意义的激活模式。

Insight: 创新点在于利用LLM生成AU文本描述作为语义先验来引导视觉表示学习,并设计了分层跨模态注意力架构(DDCA和CDCA)来协同建模细粒度AU特定交互和全局AU间依赖关系,从而实现了鲁棒且可解释的跨模态对应学习。

Abstract: Facial Action Unit (AU) detection seeks to recognize subtle facial muscle activations as defined by the Facial Action Coding System (FACS). A primary challenge w.r.t AU detection is the effective learning of discriminative and generalizable AU representations under conditions of limited annotated data. To address this, we propose a Hierarchical Vision-language Interaction for AU Understanding (HiVA) method, which leverages textual AU descriptions as semantic priors to guide and enhance AU detection. Specifically, HiVA employs a large language model to generate diverse and contextually rich AU descriptions to strengthen language-based representation learning. To capture both fine-grained and holistic vision-language associations, HiVA introduces an AU-aware dynamic graph module that facilitates the learning of AU-specific visual representations. These features are further integrated within a hierarchical cross-modal attention architecture comprising two complementary mechanisms: Disentangled Dual Cross-Attention (DDCA), which establishes fine-grained, AU-specific interactions between visual and textual features, and Contextual Dual Cross-Attention (CDCA), which models global inter-AU dependencies. This collaborative, cross-modal learning paradigm enables HiVA to leverage multi-grained vision-based AU features in conjunction with refined language-based AU details, culminating in robust and semantically enriched AU detection capabilities. Extensive experiments show that HiVA consistently surpasses state-of-the-art approaches. Besides, qualitative analyses reveal that HiVA produces semantically meaningful activation patterns, highlighting its efficacy in learning robust and interpretable cross-modal correspondences for comprehensive facial behavior analysis.


[129] D-SECURE: Dual-Source Evidence Combination for Unified Reasoning in Misinformation Detection cs.CVPDF

Gagandeep Singh, Samudi Amarasinghe, Priyanka Singh

TL;DR: D-SECURE是一个用于检测多模态虚假信息的框架,它通过结合内部篡改检测(HAMMER)和基于外部证据的推理(DEFAME)来统一处理新闻式帖子,以应对现有单源证据系统在检测精细视觉或文本篡改时的不足。

Details

Motivation: 现有虚假信息检测系统通常依赖单一证据源(如基于内容的局部不一致性检测或基于检索的外部事实核查),难以有效应对混合了逼真图像编辑和误导性文本的多模态虚假信息,导致内部一致但虚假的内容或包含细微篡改的声明被漏检。

Result: 在DGM4和ClaimReview数据集上的实验表明,HAMMER和DEFAME系统具有互补优势,其融合能提升检测性能。

Insight: 创新点在于将内部篡改检测与外部证据检索相结合,形成统一的、可解释的推理框架,通过DEFAME进行广泛验证,再由HAMMER分析残留或不确定的案例以捕捉细粒度编辑,从而更全面地应对多模态虚假信息。

Abstract: Multimodal misinformation increasingly mixes realistic im-age edits with fluent but misleading text, producing persuasive posts that are difficult to verify. Existing systems usually rely on a single evidence source. Content-based detectors identify local inconsistencies within an image and its caption but cannot determine global factual truth. Retrieval-based fact-checkers reason over external evidence but treat inputs as coarse claims and often miss subtle visual or textual manipulations. This separation creates failure cases where internally consistent fabrications bypass manipulation detectors and fact-checkers verify claims that contain pixel-level or token-level corruption. We present D-SECURE, a framework that combines internal manipulation detection with external evidence-based reasoning for news-style posts. D-SECURE integrates the HAMMER manipulation detector with the DEFAME retrieval pipeline. DEFAME performs broad verification, and HAMMER analyses residual or uncertain cases that may contain fine-grained edits. Experiments on DGM4 and ClaimReview samples highlight the complementary strengths of both systems and motivate their fusion. We provide a unified, explainable report that incorporates manipulation cues and external evidence.


[130] CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer cs.CV | cs.AIPDF

Wenbo Nie, Zixiang Li, Renshuai Tao, Bin Wu, Yunchao Wei

TL;DR: CoCoDiff是一种无需训练、低成本的细粒度风格迁移框架,利用预训练的潜在扩散模型实现语义一致的风格化。它通过挖掘扩散模型的中间特征构建内容与风格图像间的密集对齐映射,并引入循环一致性模块确保跨迭代的结构对齐,从而在保持几何细节的同时实现对象和区域级别的风格迁移。

Details

Motivation: 现有风格迁移方法多关注全局风格而忽视区域甚至像素级的语义对应,导致语义一致性不足。CoCoDiff旨在解决细粒度风格迁移中语义对应保留的挑战,利用预训练扩散模型中未充分探索的对应线索。

Result: CoCoDiff在无需额外训练或标注的情况下,实现了最先进的视觉质量和强定量结果,超越了依赖额外训练或注释的方法。

Insight: 创新点包括:利用预训练扩散模型的中间特征构建像素级语义对应模块,以及通过循环一致性模块增强结构对齐;客观分析认为,该方法通过无监督方式挖掘生成模型的内在对应关系,为细粒度风格迁移提供了高效且可扩展的解决方案。

Abstract: Transferring visual style between images while preserving semantic correspondence between similar objects remains a central challenge in computer vision. While existing methods have made great strides, most of them operate at global level but overlook region-wise and even pixel-wise semantic correspondence. To address this, we propose CoCoDiff, a novel training-free and low-cost style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization. We identify that correspondence cues within generative diffusion models are under-explored and that content consistency across semantically matched regions is often neglected. CoCoDiff introduces a pixel-wise semantic correspondence module that mines intermediate diffusion features to construct a dense alignment map between content and style images. Furthermore, a cycle-consistency module then enforces structural and perceptual alignment across iterations, yielding object and region level stylization that preserves geometry and detail. Despite requiring no additional training or supervision, CoCoDiff delivers state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.


[131] TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning cs.CV | cs.AIPDF

Hao Ding, Zhichuan Yang, Weijie Ge, Ziqin Gao, Chaoyi Lu

TL;DR: 本文提出TikArt,一种基于强化学习的孔径引导视觉推理代理,用于解决多模态大语言模型在细粒度视觉推理中因全局图像编码丢失微小、杂乱或细微目标信息的问题。TikArt通过Think-Aperture-Observe循环,交替进行语言生成和两种孔径动作(Zoom和Segment),将局部视觉线索转化为持久语言记忆,并在Qwen3-VL-8B基础上使用AGRPO强化学习算法优化推理策略。

Details

Motivation: 解决多模态大语言模型在细粒度视觉推理任务中,因单一全局图像编码无法捕捉微小物体、杂乱区域或细微标记等关键证据的问题。

Result: 在V*、HR-Bench-4K/8K、MME-RealWorld-Lite、MMStar、RefCOCO和ReasonSeg等多个基准测试上,TikArt相比骨干模型取得了持续的性能提升,并生成了可解释的孔径轨迹以支持高分辨率推理。

Insight: 创新点包括:1)将多步视觉语言推理建模为对感兴趣区域的决策过程,引入Think-Aperture-Observe循环;2)结合矩形裁剪(Zoom)和基于SAM2的掩码裁剪(Segment)两种孔径动作;3)使用AGRPO(一种GRPO风格的强化学习算法)进行两阶段课程学习优化策略,奖励机制将任务成功与有目的的孔径使用相结合。

Abstract: We address fine-grained visual reasoning in multimodal large language models (MLLMs), where key evidence may reside in tiny objects, cluttered regions, or subtle markings that are lost under a single global image encoding. We introduce TikArt (Thinking Aperture), an aperture-guided agent that casts multi-step vision-language reasoning as a decision process over regions of interest. TikArt follows a Think-Aperture-Observe loop, alternating between language generation and two aperture actions: Zoom extracts rectangular crops, while Segment invokes SAM2 to obtain mask-based crops for irregular targets. After every action, the model must produce an explicit observation, turning local visual cues into persistent linguistic memory. Built on Qwen3-VL-8B, TikArt optimizes its reasoning policy with AGRPO, a GRPO-style reinforcement learning algorithm with a two-stage curriculum: it warms up segmentation actions and then jointly optimizes visual math, fine-grained VQA, and segmentation, using rewards that couple task success with purposeful aperture use. Experiments on V*, HR-Bench-4K/8K, MME-RealWorld-Lite, MMStar, RefCOCO, and ReasonSeg show consistent gains over the backbone and yield interpretable aperture trajectories for high-resolution reasoning.


[132] Uncertainty-Aware Vision-Language Segmentation for Medical Imaging cs.CV | cs.LGPDF

Aryan Das, Tanishq Rachamalla, Koushik Biswas, Swalpa Kumar Roy, Vinay Kumar Verma

TL;DR: 本文提出了一种新颖的不确定性感知多模态分割框架,用于医学影像诊断,该框架同时利用放射影像和相关的临床文本。通过引入模态解码注意力块(MoDAB)和轻量级状态空间混合器(SSMix)实现高效的跨模态融合与长程依赖建模,并提出了谱熵不确定性(SEU)损失函数来引导模型在模糊情况下的学习,以提高模型可靠性。

Details

Motivation: 解决在图像质量不佳等复杂临床情况下,如何有效融合视觉和语言模态,并建模预测不确定性,以实现更精确、可靠的医学影像分割问题。

Result: 在多个公开医学数据集(QATA-COVID19, MosMed++, Kvasir-SEG)上的大量实验表明,该方法取得了卓越的分割性能,并且在计算效率上显著优于现有的最先进(SoTA)方法。

Insight: 主要创新点在于:1)提出了结合MoDAB和SSMix的高效跨模态融合与建模架构;2)设计了联合捕捉空间重叠、谱一致性和预测不确定性的SEU损失函数。其核心思想强调了在视觉-语言医学分割任务中结合不确定性建模和结构化模态对齐的重要性。

Abstract: We introduce a novel uncertainty-aware multimodal segmentation framework that leverages both radiological images and associated clinical text for precise medical diagnosis. We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion and long-range dependency modelling. To guide learning under ambiguity, we propose the Spectral-Entropic Uncertainty (SEU) Loss, which jointly captures spatial overlap, spectral consistency, and predictive uncertainty in a unified objective. In complex clinical circumstances with poor image quality, this formulation improves model reliability. Extensive experiments on various publicly available medical datasets, QATA-COVID19, MosMed++, and Kvasir-SEG, demonstrate that our method achieves superior segmentation performance while being significantly more computationally efficient than existing State-of-the-Art (SoTA) approaches. Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks. Code: https://github.com/arya-domain/UA-VLS


[133] Efficient Text-Guided Convolutional Adapter for the Diffusion Model cs.CVPDF

Aryan Das, Koushik Biswas, Swalpa Kumar Roy, Badri Narayana Patro, Vinay Kumar Verma

TL;DR: 本文提出了Nexus Adapters,一种新颖的、文本引导的高效适配器,用于扩散模型的结构保持条件生成框架。该方法通过引入Nexus Prime和Nexus Slim两种适配器,利用交叉注意力机制实现多模态条件输入,在理解输入文本提示的同时有效保留结构信息,显著提升了参数效率。

Details

Motivation: 解决现有结构保持条件生成方法中适配器参数效率低下、未考虑文本提示信息的问题,旨在设计一种既高效又能同时理解文本和结构输入的适配器。

Result: 在实验中,Nexus Prime适配器仅需增加8M参数,性能显著优于基线T2I-Adapter;而更轻量的Nexus Slim适配器参数比T2I-Adapter少18M,仍能取得最先进的结果。

Insight: 创新点在于提出了文本引导的、参数高效的适配器设计,通过交叉注意力机制融合文本和结构信息,实现了在保持结构的同时对文本提示的理解,为扩散模型的高效条件生成提供了新思路。

Abstract: We introduce the Nexus Adapters, novel text-guided efficient adapters to the diffusion-based framework for the Structure Preserving Conditional Generation (SPCG). Recently, structure-preserving methods have achieved promising results in conditional image generation by using a base model for prompt conditioning and an adapter for structure input, such as sketches or depth maps. These approaches are highly inefficient and sometimes require equal parameters in the adapter compared to the base architecture. It is not always possible to train the model since the diffusion model is itself costly, and doubling the parameter is highly inefficient. In these approaches, the adapter is not aware of the input prompt; therefore, it is optimal only for the structural input but not for the input prompt. To overcome the above challenges, we proposed two efficient adapters, Nexus Prime and Slim, which are guided by prompts and structural inputs. Each Nexus Block incorporates cross-attention mechanisms to enable rich multimodal conditioning. Therefore, the proposed adapter has a better understanding of the input prompt while preserving the structure. We conducted extensive experiments on the proposed models and demonstrated that the Nexus Prime adapter significantly enhances performance, requiring only 8M additional parameters compared to the baseline, T2I-Adapter. Furthermore, we also introduced a lightweight Nexus Slim adapter with 18M fewer parameters than the T2I-Adapter, which still achieved state-of-the-art results. Code: https://github.com/arya-domain/Nexus-Adapters


[134] Architectural Insights for Post-Tornado Damage Recognition cs.CVPDF

Robinson Umeike, Thang Dao, Shane Crawford, John van de Lindt, Blythe Johnston

TL;DR: 本文针对龙卷风后建筑物损伤识别任务,系统评估了79个开源深度学习模型(包括CNN和Vision Transformer)在自建QSTD数据集上的表现,发现优化器选择和学习率调优对性能的影响远超架构本身,其中ConvNeXt-Base模型在优化设置下取得了最佳泛化能力。

Details

Motivation: 解决龙卷风灾后建筑物损伤自动评估中因领域偏移和类别不平衡导致的现有方法性能不足问题。

Result: 在QSTD基准数据集上,优化器从Adam切换为SGD可使Vision Transformer和Swin Transformer的F1分数提升25-38点;学习率设为1e-4时所有架构平均F1提升10.2点;最佳模型ConvNeXt-Base在TMTD测试集上达到46.4% Macro F1(比基线提升34.6点)和85.5%有序Top-1准确率。

Insight: 优化策略(如优化器选择、学习率设置)对灾难视觉任务性能的影响可能大于神经网络架构选择;SGD优化器与低学习率的组合能显著提升Transformer类模型在领域偏移数据上的表现。

Abstract: Rapid and accurate building damage assessment in the immediate aftermath of tornadoes is critical for coordinating life-saving search and rescue operations, optimizing emergency resource allocation, and accelerating community recovery. However, current automated methods struggle with the unique visual complexity of tornado-induced wreckage, primarily due to severe domain shift from standard pre-training datasets and extreme class imbalance in real-world disaster data. To address these challenges, we introduce a systematic experimental framework evaluating 79 open-source deep learning models, encompassing both Convolutional Neural Networks (CNNs) and Vision Transformers, across over 2,300 controlled experiments on our newly curated Quad-State Tornado Damage (QSTD) benchmark dataset. Our findings reveal that achieving operational-grade performance hinges on a complex interaction between architecture and optimization, rather than architectural selection alone. Most strikingly, we demonstrate that optimizer choice can be more consequential than architecture: switching from Adam to SGD provided dramatic F1 gains of +25 to +38 points for Vision Transformer and Swin Transformer families, fundamentally reversing their ranking from bottom-tier to competitive with top-performing CNNs. Furthermore, a low learning rate of 1x10^(-4) proved universally critical, boosting average F1 performance by +10.2 points across all architectures. Our champion model, ConvNeXt-Base trained with these optimized settings, demonstrated strong cross-event generalization on the held-out Tuscaloosa-Moore Tornado Damage (TMTD) dataset, achieving 46.4% Macro F1 (+34.6 points over baseline) and retaining 85.5% Ordinal Top-1 Accuracy despite temporal and sensor domain shifts.


[135] Error Patterns in Historical OCR: A Comparative Analysis of TrOCR and a Vision-Language Model cs.CVPDF

Ari Vesalainen, Eetu Mäkelä, Laura Ruotsalainen, Mikko Tolonen

TL;DR: 本研究比较了专用OCR模型TrOCR与通用视觉语言模型Qwen在18世纪历史英文文本上的性能,发现尽管Qwen在字符错误率(CER)和词错误率(WER)上表现更优且对退化输入更鲁棒,但会选择性进行语言正则化和拼写规范化,可能悄无声息地改变历史文本的原貌;而TrOCR在拼写保真度上更一致,但更易出现级联错误传播。

Details

Motivation: 解决18世纪印刷文本因印刷质量退化、古老字形和非标准化拼写带来的OCR挑战,并探究传统聚合指标(如CER/WER)在评估模型对学术研究可靠性方面的局限性。

Result: 在历史英文文本上,Qwen实现了更低的CER/WER,对退化输入更具鲁棒性;TrOCR则在拼写保真度上更一致。两者在错误定位、可检测性和下游学术风险方面存在显著差异。

Insight: 模型架构的归纳偏置会系统性地影响OCR错误结构,聚合精度相似的模型可能在错误特性上差异巨大,强调了在历史数字化工作流中需要进行架构感知的评估,而非仅依赖传统精度指标。

Abstract: Optical Character Recognition (OCR) of eighteenth-century printed texts remains challenging due to degraded print quality, archaic glyphs, and non-standardized orthography. Although transformer-based OCR systems and Vision-Language Models (VLMs) achieve strong aggregate accuracy, metrics such as Character Error Rate (CER) and Word Error Rate (WER) provide limited insight into their reliability for scholarly use. We compare a dedicated OCR transformer (TrOCR) and a general-purpose Vision-Language Model (Qwen) on line-level historical English texts using length-weighted accuracy metrics and hypothesis driven error analysis. While Qwen achieves lower CER/WER and greater robustness to degraded input, it exhibits selective linguistic regularization and orthographic normalization that may silently alter historically meaningful forms. TrOCR preserves orthographic fidelity more consistently but is more prone to cascading error propagation. Our findings show that architectural inductive biases shape OCR error structure in systematic ways. Models with similar aggregate accuracy can differ substantially in error locality, detectability, and downstream scholarly risk, underscoring the need for architecture-aware evaluation in historical digitization workflows.


[136] MoRL: Reinforced Reasoning for Unified Motion Understanding and Generation cs.CVPDF

Hongpeng Wang, Zeyu Zhang, Wenhao Li, Hao Tang

TL;DR: MoRL是一个通过监督微调和强化学习训练的统一多模态运动模型,结合了任务特定的奖励设计,以提升运动理解和生成中的逻辑推理与感知真实性。该研究还引入了Chain-of-Motion推理方法,并构建了两个大规模CoT数据集,在HumanML3D和KIT-ML基准上取得了显著优于现有SOTA方法的结果。

Details

Motivation: 解决人类运动理解和生成任务中推理能力和测试时规划能力的不足,旨在通过统一模型提升逻辑一致性和物理合理性。

Result: 在HumanML3D和KIT-ML基准测试中,MoRL相比最先进的基线方法取得了显著提升,达到了SOTA水平。

Insight: 创新点包括:结合语义对齐、推理连贯性、物理合理性和文本-运动一致性的奖励设计;引入Chain-of-Motion进行逐步规划与反思的测试时推理方法;以及构建大规模CoT数据集以对齐运动序列与推理轨迹。

Abstract: Human motion understanding and generation are crucial for vision and robotics but remain limited in reasoning capability and test-time planning. We propose MoRL, a unified multimodal motion model trained with supervised fine-tuning and reinforcement learning with verifiable rewards. Our task-specific reward design combines semantic alignment and reasoning coherence for understanding with physical plausibility and text-motion consistency for generation, improving both logical reasoning and perceptual realism. To further enhance inference, we introduce Chain-of-Motion (CoM), a test-time reasoning method that enables step-by-step planning and reflection. We also construct two large-scale CoT datasets, MoUnd-CoT-140K and MoGen-CoT-140K, to align motion sequences with reasoning traces and action descriptions. Experiments on HumanML3D and KIT-ML show that MoRL achieves significant gains over state-of-the-art baselines. Code: https://github.com/AIGeeksGroup/MoRL. Website: https://aigeeksgroup.github.io/MoRL.


[137] DriveFine: Refining-Augmented Masked Diffusion VLA for Precise and Robust Driving cs.CVPDF

Chenxu Dang, Sining Ang, Yongkang Li, Haochen Tian, Jie Wang

TL;DR: 本文提出DriveFine,一种结合掩码扩散与视觉-语言-动作(VLA)模型的自驾规划方法,通过引入块状混合专家(block-MoE)设计,在生成专家之上无缝集成精炼专家,实现灵活解码与自校正能力,以解决现有扩散与基于token的规划器在模态对齐、训练效率、泛化及累积误差等方面的互补性缺陷。

Details

Motivation: 现有自驾VLA模型中的扩散规划器存在模态对齐困难、训练效率低、泛化有限的问题,而基于token的规划器则受累积因果误差和不可逆解码困扰,两者优缺点互补,因此需要一种能结合两者优势的新方法。

Result: 在NAVSIM v1、v2和Navhard基准测试上的大量实验表明,DriveFine展现出强大的效能和鲁棒性。

Insight: 创新点包括:提出块状混合专家(block-MoE)设计,实现生成与精炼专家的解耦,保留预训练权重的基础能力;设计混合强化学习策略,在鼓励精炼专家有效探索的同时保持训练稳定性;该方法突出了灵活性和可扩展性,为VLA模型提供了自校正机制。

Abstract: Vision-Language-Action (VLA) models for autonomous driving increasingly adopt generative planners trained with imitation learning followed by reinforcement learning. Diffusion-based planners suffer from modality alignment difficulties, low training efficiency, and limited generalization. Token-based planners are plagued by cumulative causal errors and irreversible decoding. In summary, the two dominant paradigms exhibit complementary strengths and weaknesses. In this paper, we propose DriveFine, a masked diffusion VLA model that combines flexible decoding with self-correction capabilities. In particular, we design a novel plug-and-play block-MoE, which seamlessly injects a refinement expert on top of the generation expert. By enabling explicit expert selection during inference and gradient blocking during training, the two experts are fully decoupled, preserving the foundational capabilities and generic patterns of the pretrained weights, which highlights the flexibility and extensibility of the block-MoE design. Furthermore, we design a hybrid reinforcement learning strategy that encourages effective exploration of refinement expert while maintaining training stability. Extensive experiments on NAVSIM v1, v2, and Navhard benchmarks demonstrate that DriveFine exhibits strong efficacy and robustness. The code will be released at https://github.com/MSunDYY/DriveFine.


[138] YOLO26: A Comprehensive Architecture Overview and Key Improvements cs.CVPDF

Priyanto Hidayatullah, Refdinal Tubagus

TL;DR: 本文对YOLO系列最新版本YOLO26进行了全面的架构概述和关键改进分析。论文重点介绍了YOLO26在推理速度、模型优化和任务扩展方面的主要增强,包括移除DFL、实现端到端无NMS推理、引入ProgLoss+STAL标签分配策略以及使用MuSGD优化器,旨在提升边缘设备上的实时性能。

Details

Motivation: 动机是深入解析YOLO26的架构细节,弥补现有技术文档的不足,为研究者和开发者提供精确的架构理解,以促进YOLO模型的进一步改进,保持其在计算机视觉领域的领先地位。

Result: 论文声称YOLO26在CPU模式下推理速度提升了43%,并在实例分割、姿态估计和定向边界框解码等多个计算机视觉任务上有所改进。

Insight: 创新点在于首次系统性地呈现了基于CNN的YOLO26核心架构图,并通过源代码分析揭示了其真实的运行机制。关键改进包括端到端无NMS推理设计、ProgLoss与STAL结合的标签分配策略,以及针对边缘设备优化的MuSGD优化器,这些设计旨在平衡速度与精度,推动实时检测在资源受限设备上的应用。

Abstract: You Only Look Once (YOLO) has been the prominent model for computer vision in deep learning for a decade. This study explores the novel aspects of YOLO26, the most recent version in the YOLO series. The elimination of Distribution Focal Loss (DFL), implementation of End-to-End NMS-Free Inference, introduction of ProgLoss + Small-Target-Aware Label Assignment (STAL), and use of the MuSGD optimizer are the primary enhancements designed to improve inference speed, which is claimed to achieve a 43% boost in CPU mode. This is designed to allow YOLO26 to attain real-time performance on edge devices or those without GPUs. Additionally, YOLO26 offers improvements in many computer vision tasks, including instance segmentation, pose estimation, and oriented bounding box (OBB) decoding. We aim for this effort to provide more value than just consolidating information already included in the existing technical documentation. Therefore, we performed a rigorous architectural investigation into YOLO26, mostly using the source code available in its GitHub repository and its official documentation. The authentic and detailed operational mechanisms of YOLO26 are inside the source code, which is seldom extracted by others. The YOLO26 architectural diagram is shown as the outcome of the investigation. This study is, to our knowledge, the first one presenting the CNN-based YOLO26 architecture, which is the core of YOLO26. Our objective is to provide a precise architectural comprehension of YOLO26 for researchers and developers aspiring to enhance the YOLO model, ensuring it remains the leading deep learning model in computer vision.


[139] VariViT: A Vision Transformer for Variable Image Sizes cs.CV | cs.AI | cs.LGPDF

Aswathi Varma, Suprosanna Shit, Chinmay Prabhakar, Daniel Scholz, Hongwei Bran Li

TL;DR: 本文提出VariViT,一种能够处理可变尺寸图像的Vision Transformer模型。它通过新颖的位置嵌入调整方案和批处理策略,解决了传统ViT因固定patch尺寸而需对图像进行预处理(如裁剪、缩放)的问题,在医学影像分析中表现出色。

Details

Motivation: 传统ViT将图像分割为固定大小的patch,要求输入图像尺寸固定,这通常需要对图像进行缩放、填充或裁剪等预处理。在医学影像中,这种处理可能导致信息丢失或引入伪影,特别是对于形状不规则的病灶(如肿瘤),固定尺寸的裁剪会带来前景与背景比例高度可变的问题,影响特征表示和诊断准确性。

Result: 在两个3D脑部MRI数据集上的评估显示,VariViT在胶质瘤基因型预测和脑肿瘤分类任务上超越了原始ViT和ResNet,分别取得了75.5%和76.3%的F1分数,学习了更具判别性的特征。此外,提出的批处理策略相比传统架构减少了高达30%的计算时间。

Insight: 主要创新点包括:1) 设计了一种能够适应可变数量patch的位置嵌入调整方案,使模型能直接处理不同尺寸的图像,而无需强制改变图像原始尺寸;2) 提出了一种新的批处理策略以降低计算复杂度,提升训练和推理速度。这为在医学影像等需要保留原始图像尺寸和细节的领域应用Transformer提供了有效解决方案。

Abstract: Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning. Our code can be found here: https://github.com/Aswathi-Varma/varivit


[140] VIGIL: Tackling Hallucination Detection in Image Recontextualization cs.CVPDF

Joanna Wojciechowicz, Maria Łubniewska, Jakub Antczak, Justyna Baczyńska, Wojciech Gromski

TL;DR: 本文提出了VIGIL基准数据集和框架,用于对大语言模型在图像再语境化任务中产生的幻觉进行细粒度分类和检测。它将幻觉分解为五类,并设计了一个多阶段检测流程来识别这些错误,旨在填补该领域评估方法的空白。

Details

Motivation: 现有研究通常将幻觉视为一个统一问题,缺乏对多模态图像再语境化任务中幻觉错误的细粒度分类和系统性检测方法。

Result: 通过广泛的实验评估,证明了所提出的多阶段检测流程的有效性,该流程利用了开源模型的协调集成。

Insight: 主要创新点在于首次对图像再语境化中的幻觉进行了五类细粒度分解,并提出了一个针对对象级保真度、背景一致性和遗漏检测的多阶段检测框架,为理解模型失败原因提供了更深入的视角。

Abstract: We introduce VIGIL (Visual Inconsistency & Generative In-context Lucidity), the first benchmark dataset and framework providing a fine-grained categorization of hallucinations in the multimodal image recontextualization task for large multimodal models (LMMs). While existing research often treats hallucinations as a uniform issue, our work addresses a significant gap in multimodal evaluation by decomposing these errors into five categories: pasted object hallucinations, background hallucinations, object omission, positional & logical inconsistencies, and physical law violations. To address these complexities, we propose a multi-stage detection pipeline. Our architecture processes recontextualized images through a series of specialized steps targeting object-level fidelity, background consistency, and omission detection, leveraging a coordinated ensemble of open-source models, whose effectiveness is demonstrated through extensive experimental evaluations. Our approach enables a deeper understanding of where the models fail with an explanation; thus, we fill a gap in the field, as no prior methods offer such categorization and decomposition for this task. To promote transparency and further exploration, we openly release VIGIL, along with the detection pipeline and benchmark code, through our GitHub repository: https://github.com/mlubneuskaya/vigil and Data repository: https://huggingface.co/datasets/joannaww/VIGIL.


[141] Advances in Global Solvers for 3D Vision cs.CV | cs.ROPDF

Zhenjun Zhao, Heng Yang, Bangyan Liao, Yingping Zeng, Shaocheng Yan

TL;DR: 本文是第一篇关于3D视觉中全局求解器的系统性综述,统一了该领域,提出了一个涵盖分支定界、凸松弛和渐进非凸性三大核心范式的全面分类法。综述分析了这些求解器的理论基础、算法设计及实际增强技术,探讨了它们如何解决几何估计问题的本质非凸性,并覆盖了从Wahba问题到光束法平差等十个核心视觉任务,揭示了最优性、鲁棒性和可扩展性之间的权衡。

Details

Motivation: 传统上,非凸几何优化问题通常通过局部或启发式方法解决,而全局求解器提供了可证明的解决方案,成为一种强大的范式。本文旨在对几何视觉中的全局求解器进行首次系统性回顾,统一该领域,并为实际应用提供可证明、可信赖的感知路线图。

Result: 本文是一篇综述,未提出新方法,因此没有具体的定量实验结果或基准测试排名。它通过分析现有工作,揭示了不同求解器在最优性、鲁棒性和可扩展性之间的权衡,为求解器选择提供了指导。

Insight: 主要创新点在于首次对3D视觉全局求解器领域进行了系统性综述和统一分类,提出了一个清晰的理论框架(三大范式)并分析了核心任务中的权衡关系。从客观角度看,该综述整合了理论基础、实践进展和更广泛的影响,为未来研究方向(如保持保证下的算法扩展、数据驱动先验与可证明优化的结合、标准化基准的建立)提供了重要的路线图,具有很高的参考价值。

Abstract: Global solvers have emerged as a powerful paradigm for 3D vision, offering certifiable solutions to nonconvex geometric optimization problems traditionally addressed by local or heuristic methods. This survey presents the first systematic review of global solvers in geometric vision, unifying the field through a comprehensive taxonomy of three core paradigms: Branch-and-Bound (BnB), Convex Relaxation (CR), and Graduated Non-Convexity (GNC). We present their theoretical foundations, algorithmic designs, and practical enhancements for robustness and scalability, examining how each addresses the fundamental nonconvexity of geometric estimation problems. Our analysis spans ten core vision tasks, from Wahba problem to bundle adjustment, revealing the optimality-robustness-scalability trade-offs that govern solver selection. We identify critical future directions: scaling algorithms while maintaining guarantees, integrating data-driven priors with certifiable optimization, establishing standardized benchmarks, and addressing societal implications for safety-critical deployment. By consolidating theoretical foundations, practical advances, and broader impacts, this survey provides a unified perspective and roadmap toward certifiable, trustworthy perception for real-world applications. A continuously-updated literature summary and companion code tutorials are available at https://github.com/ericzzj1989/Awesome-Global-Solvers-for-3D-Vision.


[142] MeFEm: Medical Face Embedding model cs.CVPDF

Yury Borets, Stepan Botman

TL;DR: MeFEm是一个基于改进的联合嵌入预测架构(JEPA)的视觉模型,用于从面部图像进行生物识别和医学分析。其关键改进包括轴向条纹掩码策略、循环损失加权方案以及为高质量线性探测而进行的CLS令牌概率重分配。模型在整合的精选图像数据集上训练,在核心人体测量任务上超越了FaRL和Franca等强基线,同时在新颖的、解决现有数据领域偏见的整合闭源数据集上,在身体质量指数(BMI)估计任务上也显示出有希望的结果。

Details

Motivation: 解决从面部图像进行生物识别和医学分析(如人体测量和BMI估计)时,现有方法可能存在的领域偏见问题,并探索在数据量较少的情况下实现高性能的模型架构。

Result: 在核心人体测量任务上超越了FaRL和Franca等强基线,并在一个新颖的、解决领域偏见的整合闭源数据集上,BMI估计任务也显示出有希望的结果。

Insight: 通过轴向条纹掩码策略聚焦于语义相关区域的学习、循环损失加权方案以及CLS令牌的概率重分配来提升线性探测性能,这些架构改进使得模型在数据量较少的情况下仍能取得优异表现,为相关领域提供了新的技术思路和强基线模型。

Abstract: We present MeFEm, a vision model based on a modified Joint Embedding Predictive Architecture (JEPA) for biometric and medical analysis from facial images. Key modifications include an axial stripe masking strategy to focus learning on semantically relevant regions, a circular loss weighting scheme, and the probabilistic reassignment of the CLS token for high quality linear probing. Trained on a consolidated dataset of curated images, MeFEm outperforms strong baselines like FaRL and Franca on core anthropometric tasks despite using significantly less data. It also shows promising results on Body Mass Index (BMI) estimation, evaluated on a novel, consolidated closed-source dataset that addresses the domain bias prevalent in existing data. Model weights are available at https://huggingface.co/boretsyury/MeFEm , offering a strong baseline for future work in this domain.


[143] It’s a Matter of Time: Three Lessons on Long-Term Motion for Perception cs.CVPDF

Willem Davison, Xinyue Hao, Laura Sevilla-Lara

TL;DR: 本文探讨了长期运动信息在感知任务中的作用,通过利用点轨迹估计技术,研究了长期运动表示在理解动作、物体、材料和空间信息方面的能力,并总结了三个关键发现:长期运动表示在信息丰富性、泛化能力和计算效率方面优于图像表示。

Details

Motivation: 解决长期运动信息在感知任务中作用不明确的问题,探索其相对于图像信息的独特属性和优势。

Result: 在低数据设置和零样本任务中,长期运动表示比图像表示泛化能力更强;在计算效率上,运动表示在GFLOPs和准确率之间提供了更好的权衡,与视频表示结合使用时性能更高。

Insight: 长期运动信息具有低维度和高信息密度的特性,能有效提升感知模型的泛化能力和效率,为未来模型设计提供了新的方向。

Abstract: Temporal information has long been considered to be essential for perception. While there is extensive research on the role of image information for perceptual tasks, the role of the temporal dimension remains less well understood: What can we learn about the world from long-term motion information? What properties does long-term motion information have for visual learning? We leverage recent success in point-track estimation, which offers an excellent opportunity to learn temporal representations and experiment on a variety of perceptual tasks. We draw 3 clear lessons: 1) Long-term motion representations contain information to understand actions, but also objects, materials, and spatial information, often even better than images. 2) Long-term motion representations generalize far better than image representations in low-data settings and in zero-shot tasks. 3) The very low dimensionality of motion information makes motion representations a better trade-off between GFLOPs and accuracy than standard video representations, and used together they achieve higher performance than video representations alone. We hope these insights will pave the way for the design of future models that leverage the power of long-term motion information for perception.


[144] Depth Completion as Parameter-Efficient Test-Time Adaptation cs.CVPDF

Bingxin Ke, Qunjie Zhou, Jiahui Huang, Xuanchi Ren, Tianchang Shen

TL;DR: 本文提出了CAPA框架,一种参数高效的测试时优化方法,用于将预训练的3D基础模型适配到深度补全任务中。该方法冻结基础模型的主干网络,仅通过参数高效微调技术(如LoRA或VPT)更新少量参数,利用推理时可用的稀疏几何观测作为梯度指导进行适配。对于视频数据,CAPA引入了序列级参数共享,联合适配所有帧以利用时间相关性、提高鲁棒性并确保多帧一致性。该方法与模型无关,兼容任何基于ViT的基础模型。

Details

Motivation: 解决现有方法为辅助输入训练特定任务编码器时容易过拟合、泛化能力差的问题,旨在利用稀疏几何线索高效地使预训练3D基础模型适应特定场景的深度补全任务。

Result: 在室内和室外数据集上的多种条件模式下取得了最先进(SOTA)的结果。

Insight: 创新点在于将测试时适配(TTA)与参数高效微调(PEFT)结合用于3D基础模型的深度补全,仅更新极少量参数,并提出了用于视频的序列级参数共享机制以实现时间一致性优化。这是一种轻量、高效且模型无关的适配策略。

Abstract: We introduce CAPA, a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion, using sparse geometric cues. Unlike prior methods that train task-specific encoders for auxiliary inputs, which often overfit and generalize poorly, CAPA freezes the FM backbone. Instead, it updates only a minimal set of parameters using Parameter-Efficient Fine-Tuning (e.g. LoRA or VPT), guided by gradients calculated directly from the sparse observations available at inference time. This approach effectively grounds the foundation model’s geometric prior in the scene-specific measurements, correcting distortions and misplaced structures. For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency. CAPA is model-agnostic, compatible with any ViT-based FM, and achieves state-of-the-art results across diverse condition patterns on both indoor and outdoor datasets. Project page: research.nvidia.com/labs/dvl/projects/capa.


[145] GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture cs.CV | cs.AI | cs.LG | cs.MM | cs.NEPDF

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

TL;DR: GOT-JEPA是一个基于联合嵌入预测架构的通用目标跟踪框架,通过模型预测预训练来提升跟踪器在动态环境中的泛化能力和遮挡处理能力。它扩展了JEPA,从预测图像特征转向预测跟踪模型,并引入了OccuSolver来增强对遮挡模式的细粒度感知。

Details

Motivation: 解决当前通用目标跟踪器因过度优化训练目标而导致的在未见场景中鲁棒性和泛化性不足的问题,以及其遮挡推理较为粗糙、缺乏对遮挡模式详细建模的局限性。

Result: 在七个基准测试上的广泛评估表明,该方法有效提升了跟踪器的泛化性和鲁棒性。

Insight: 创新点在于将JEPA框架从特征预测扩展到跟踪模型预测,通过教师-学生架构提供稳定的伪监督,并设计OccuSolver进行对象感知的可见性估计和细粒度遮挡模式捕获,从而显式训练预测器在遮挡等不利观测下产生可靠的跟踪模型。

Abstract: The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.


[146] VIPA: Visual Informative Part Attention for Referring Image Segmentation cs.CV | cs.AIPDF

Yubin Cho, Hyunwoo Yu, Kyeongbo Kong, Kyomin Sohn, Bongjoon Hyun

TL;DR: 本文提出了一种名为VIPA(Visual Informative Part Attention)的新框架,用于解决指代图像分割(RIS)任务,即根据自然语言描述分割图像中的目标物体。该框架通过利用视觉信息中的关键部分(称为视觉表达)来提供结构和语义信息,从而减少跨模态投影的方差并增强注意力机制中的语义一致性。

Details

Motivation: 现有方法通常将视觉信息融入语言标记中,但为了更有效地利用视觉上下文进行细粒度分割,需要一种能更好地捕捉视觉目标结构语义信息的方法。

Result: 在四个公开的RIS基准测试中,VIPA方法超越了现有的最先进(SOTA)方法,并通过广泛的实验和视觉分析证明了其有效性。

Insight: 创新点在于引入了视觉表达的概念,并设计了视觉表达生成器(VEG)模块,该模块通过局部-全局语言上下文线索检索信息丰富的视觉标记,并对其进行细化以减少噪声和共享视觉属性,从而使注意力能更鲁棒地对齐细粒度感兴趣区域。

Abstract: Referring Image Segmentation (RIS) aims to segment a target object described by a natural language expression. Existing methods have evolved by leveraging the vision information into the language tokens. To more effectively exploit visual contexts for fine-grained segmentation, we propose a novel Visual Informative Part Attention (VIPA) framework for referring image segmentation. VIPA leverages the informative parts of visual contexts, called a visual expression, which can effectively provide the structural and semantic visual target information to the network. This design reduces high-variance cross-modal projection and enhances semantic consistency in an attention mechanism of the referring image segmentation. We also design a visual expression generator (VEG) module, which retrieves informative visual tokens via local-global linguistic context cues and refines the retrieved tokens for reducing noise information and sharing informative visual attributes. This module allows the visual expression to consider comprehensive contexts and capture semantic visual contexts of informative regions. In this way, our framework enables the network’s attention to robustly align with the fine-grained regions of interest. Extensive experiments and visual analysis demonstrate the effectiveness of our approach. Our VIPA outperforms the existing state-of-the-art methods on four public RIS benchmarks.


[147] Debiasing Central Fixation Confounds Reveals a Peripheral “Sweet Spot” for Human-like Scanpaths in Hard-Attention Vision cs.CV | cs.AIPDF

Pengcheng Pan, Yonekura Shogo, Yasuo Kuniyosh

TL;DR: 该论文揭示了在基于任务的硬注意力视觉模型中,评估扫描路径与人类注视匹配度的常用指标会受到数据集中心偏置的强烈干扰。通过在Gaze-CIFAR-10数据集上的实验,研究发现简单的中心注视基线就能获得接近学习策略的扫描路径分数,导致标准指标过于乐观。论文提出了去中心偏置的复合度量GCS,并发现只有在特定的感知约束(中等大小的中央凹区域和外围视觉)下,模型才能产生既优于去偏置后的中心基线、又在运动统计上类人的扫描路径,即存在一个外围’甜点’。

Details

Motivation: 解决在评估硬注意力视觉模型的扫描路径与人类注视的匹配度时,常用指标因数据集中心偏置(尤其是在以物体为中心的数据集上)而被严重混淆的问题,从而更真实地衡量行为对齐而非中心趋势。

Result: 在Gaze-CIFAR-10数据集上,中心注视基线接近许多学习策略的扫描路径分数。提出的GCS度量揭示了一个在中等中央凹区域大小下、结合中央凹和外围视觉的鲁棒’甜点’,该结果无法从原始扫描路径度量或单独准确率中明显看出。

Insight: 创新点在于揭示了中心偏置对扫描路径评估指标的混淆效应,并提出了去偏置的复合度量GCS来更准确地评估行为对齐。客观分析认为,其核心洞察是识别出产生类人扫描路径需要特定的感知约束条件(’甜点’),这为设计更好的注视基准测试和评估主动感知模型提供了重要指导。

Abstract: Human eye movements in visual recognition reflect a balance between foveal sampling and peripheral context. Task-driven hard-attention models for vision are often evaluated by how well their scanpaths match human gaze. However, common scanpath metrics can be strongly confounded by dataset-specific center bias, especially on object-centric datasets. Using Gaze-CIFAR-10, we show that a trivial center-fixation baseline achieves surprisingly strong scanpath scores, approaching many learned policies. This makes standard metrics optimistic and blurs the distinction between genuine behavioral alignment and mere central tendency. We then analyze a hard-attention classifier under constrained vision by sweeping foveal patch size and peripheral context, revealing a peripheral sweet spot: only a narrow range of sensory constraints yields scanpaths that are simultaneously (i) above the center baseline after debiasing and (ii) temporally human-like in movement statistics. To address center bias, we propose GCS (Gaze Consistency Score), a center-debiased composite metric augmented with movement similarity. GCS uncovers a robust sweet spot at medium patch size with both foveal and peripheral vision, that is not obvious from raw scanpath metrics or accuracy alone, and also highlights a “shortcut regime” when the field-of-view becomes too large. We discuss implications for evaluating active perception on object-centric datasets and for designing gaze benchmarks that better separate behavioral alignment from center bias.


[148] Integrating Affordances and Attention models for Short-Term Object Interaction Anticipation cs.CVPDF

Lorenzo Mur Labadia, Ruben Martinez-Cantin, Jose J. Guerrero, Giovanni M. Farinella, Antonino Furnari

TL;DR: 本文提出了一种改进短期物体交互预测(STA)性能的方法,通过结合注意力机制和可供性模型,设计了STAformer和STAformer++两种新型架构,并引入了环境可供性建模和交互热点预测模块,以更好地基于人类行为进行STA预测。

Details

Motivation: 短期物体交互预测对于可穿戴助手理解用户目标和提供及时协助至关重要,现有方法性能有待提升,本文旨在通过整合注意力模型和可供性来改进STA预测。

Result: 在Ego4D数据集上,整体Top-5 mAP提升了高达23个百分点,在EPIC-Kitchens的STA标注集上提升了31个百分点,显示出显著改进。

Insight: 创新点包括提出集成帧引导时序池化、双图像-视频注意力和多尺度特征融合的注意力架构,以及通过环境可供性模型和交互热点预测来增强预测的基于行为的基础。

Abstract: Short Term object-interaction Anticipation consists in detecting the location of the next active objects, the noun and verb categories of the interaction, as well as the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants to understand user goals and provide timely assistance, or to enable human-robot interaction. In this work, we present a method to improve the performance of STA predictions. Our contributions are two-fold: 1 We propose STAformer and STAformer plus plus, two novel attention-based architectures integrating frame-guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair; 2 We introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. We explore how to integrate environment affordances via simple late fusion and with an approach which adaptively learns how to best fuse affordances with end-to-end predictions. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant improvements on Overall Top-5 mAP, with gain up to +23p.p on Ego4D and +31p.p on a novel set of curated EPIC-Kitchens STA labels. We released the code, annotations, and pre-extracted affordances on Ego4D and EPIC-Kitchens to encourage future research in this area.


[149] CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography cs.CV | cs.AIPDF

Qingqing Zhu, Qiao Jin, Tejas S. Mathai, Yin Fang, Zhizheng Wang

TL;DR: 本文介绍了CT-Bench,一个用于计算机断层扫描(CT)中多模态病灶理解的首个基准数据集。该数据集包含两部分:一个包含20,335个病灶的病灶图像和元数据集,以及一个包含2,850个问答对的多任务视觉问答基准。研究评估了多种最先进的多模态模型,并展示了在该数据集上微调模型带来的显著性能提升。

Details

Motivation: 当前人工智能在CT图像上自动勾画病灶和生成放射学报告内容方面进展有限,主要受限于公开可用的、具有病灶级标注的CT数据集的稀缺性。CT-Bench旨在填补这一空白。

Result: 研究评估了包括视觉语言模型和医学CLIP变体在内的多种最先进多模态模型,通过将其性能与放射科医生评估进行比较,证明了CT-Bench作为病灶分析综合基准的价值。在病灶图像和元数据集上微调模型,在两个组件上都带来了显著的性能增益。

Insight: 论文的创新点在于构建了首个专门针对CT病灶多模态理解的综合基准数据集,其包含病灶图像、元数据以及具有挑战性负样本的多任务视觉问答任务。这为评估和提升AI模型在病灶定位、描述、大小估计和属性分类等临床相关任务上的性能提供了标准化平台。

Abstract: Artificial intelligence (AI) can automatically delineate lesions on computed tomography (CT) and generate radiology report content, yet progress is limited by the scarcity of publicly available CT datasets with lesion-level annotations. To bridge this gap, we introduce CT-Bench, a first-of-its-kind benchmark dataset comprising two components: a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization. Hard negative examples are included to reflect real-world diagnostic challenges. We evaluate multiple state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis. Moreover, fine-tuning models on the Lesion Image and Metadata Set yields significant performance gains across both components, underscoring the clinical utility of CT-Bench.


[150] Wrivinder: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery cs.CVPDF

Chandrakanth Gudavalli, Tajuddin Manhar Mohammed, Abhay Yadav, Ananth Vishnu Bhaskar, Hardik Prajapati

TL;DR: 本文提出了Wrivinder框架,这是一个零样本、几何驱动的系统,通过聚合多张地面照片重建一致的3D场景,并将其与卫星图像对齐,实现精确的地理定位。同时,作者发布了MC-Sat数据集,为无配对监督的跨视角对齐任务提供了首个综合基准和测试平台。

Details

Motivation: 解决地面图像与地理配准卫星地图对齐的挑战,特别是在视角差异大或GPS不可靠的情况下,这对于测绘、导航和态势感知至关重要。

Result: 在零样本实验中,Wrivinder在密集和大范围场景中实现了亚30米的地理定位精度,展示了基于几何聚合的鲁棒地面到卫星定位的潜力。

Insight: 创新点在于结合SfM重建、3D高斯泼溅、语义接地和基于单目深度的度量线索,生成稳定的天顶视图渲染,可直接与卫星上下文匹配;同时,通过发布MC-Sat数据集填补了该任务缺乏合适基准的空白,为几何中心的跨视图对齐研究提供了无监督基线。

Abstract: Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth–based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task, which lacks suitable benchmarks, we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivinder and MC-Sat provide a first comprehensive baseline and testbed for studying geometry-centered cross-view alignment without paired supervision. In zero-shot experiments, Wrivinder achieves sub-30,m geolocation accuracy across both dense and large-area scenes, highlighting the promise of geometry-based aggregation for robust ground-to-satellite localization.


[151] AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories cs.CV | cs.AIPDF

Zun Wang, Han Lin, Jaehong Yoon, Jaemin Cho, Yue Zhang

TL;DR: AnchorWeave是一个用于相机可控视频生成的记忆增强框架,旨在解决长序列生成中空间世界一致性的挑战。它通过检索多个干净的局部几何记忆来替代单一、存在错位的全局3D场景重建,并利用多锚点编织控制器在生成过程中整合这些局部记忆,从而减少跨视图不一致性带来的噪声。

Details

Motivation: 现有基于记忆的方法通常依赖于从历史重建的全局3D场景渲染锚点视频来条件化生成,但多视图重建中的姿态和深度估计误差会导致跨视图错位,这些不一致性在融合后会累积成噪声几何,污染条件信号并降低生成质量。

Result: 大量实验表明,AnchorWeave在保持强视觉质量的同时,显著改善了长期场景一致性。消融和分析研究进一步验证了局部几何条件化、多锚点控制以及覆盖驱动检索的有效性。

Insight: 论文的核心创新在于用多个局部几何记忆替代单一的全局记忆,并通过覆盖驱动的检索与多锚点编织机制来协调跨视图不一致性,这为长序列视频生成中的世界一致性维护提供了一种更鲁棒的条件化策略。

Abstract: Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.


[152] ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery cs.CV | cs.AI | cs.LGPDF

Ayush Shrivastava, Kirtan Gangani, Laksh Jain, Mayank Goel, Nipun Batra

TL;DR: 该论文提出了ThermEval-B,一个包含约5.5万个热图像视觉问答对的结构化基准测试,用于评估视觉语言模型在热成像领域的理解能力。研究发现,现有模型在基于温度的推理、色彩映射变换下的表现均不佳,表明热视觉理解需要专门的评估基准。

Details

Motivation: 现有视觉语言模型在RGB图像上表现良好,但无法泛化到热图像。热成像在可见光失效的场景(如夜间监控、自动驾驶)中至关重要,其编码的是物理温度而非颜色纹理,需要现有RGB基准无法评估的感知和推理能力。

Result: 评估了25个开源和闭源VLM,发现模型在基于温度的推理上持续失败,在色彩映射变换下性能下降,并倾向于依赖语言先验或固定响应,即使通过提示或监督微调也只有边际提升。

Insight: 创新点在于构建了首个结合密集像素级温度图和语义身体部位注释的热视觉数据集ThermEval-D,并整合成结构化基准ThermEval-B。该研究揭示了热视觉理解与RGB视觉的根本差异,强调需要超越RGB中心假设的专门评估来推动该领域进展。

Abstract: Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning. These results demonstrate that thermal understanding requires dedicated evaluation beyond RGB-centric assumptions, positioning ThermEval as a benchmark to drive progress in thermal vision language modeling.


[153] EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing cs.CVPDF

Yehonathan Litman, Shikun Liu, Dario Seyb, Nicholas Milef, Yang Zhou

TL;DR: EditCtrl是一个高效的视频修复控制框架,通过解耦局部和全局控制,实现实时生成式视频编辑。它采用局部视频上下文模块仅处理掩码标记,计算成本与编辑大小成正比,并结合轻量级全局上下文嵌入器确保视频整体一致性。

Details

Motivation: 现有基于预训练视频基础模型的高保真生成式视频编辑方法计算成本高,即使对于稀疏的局部编辑也低效地处理完整视频上下文,因此需要一种更高效的方法来聚焦计算资源于编辑区域。

Result: EditCtrl的计算效率比最先进的生成式编辑方法高10倍,并且在编辑质量上优于采用全注意力机制的方法,展示了在文本提示多区域编辑和自回归内容传播等方面的新能力。

Insight: 创新点在于解耦局部与全局控制:局部模块仅处理掩码区域以降低计算成本,全局轻量嵌入器维持时序一致性;这种设计实现了效率与质量的平衡,为实时视频编辑提供了新思路。

Abstract: High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask’s size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10 times more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and autoregressive content propagation.


cs.IR [Back]

[154] LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News cs.IR | cs.CL | cs.LGPDF

Yunfan Zhang, Kathleen McKeown, Smaranda Muresan

TL;DR: 本文介绍了LiveNewsBench,一个用于评估大型语言模型(LLM)代理式网络搜索能力的基准测试。该基准通过自动从近期新闻文章中生成新鲜的问题-答案对,确保问题需要超出模型训练数据的信息,从而清晰区分内部知识与搜索能力。它包含需要多跳搜索、页面访问和推理的难题,并支持频繁更新和大规模训练数据构建。

Details

Motivation: 当前,具备代理式网络搜索能力的LLM在实时信息获取和复杂事实检索任务中展现出巨大潜力,但评估此类系统仍具挑战性。研究旨在解决评估难题,并应对该领域训练数据稀缺的问题。

Result: 研究使用LiveNewsBench评估了广泛的系统,包括商业和开源权重的LLM以及基于LLM的网络搜索API。测试集中包含人工验证样本以确保评估可靠性,相关排行榜、数据集和代码已公开。

Insight: 创新点在于构建了一个自动、持续更新且包含高难度多跳搜索问题的基准,能有效分离LLM的内部知识与外部搜索能力,并为代理式网络搜索模型提供了大规模训练数据生成方案。

Abstract: Large Language Models (LLMs) with agentic web search capabilities show strong potential for tasks requiring real-time information access and complex fact retrieval, yet evaluating such systems remains challenging. We introduce \bench, a rigorous and regularly updated benchmark designed to assess the agentic web search abilities of LLMs. \bench automatically generates fresh question-answer pairs from recent news articles, ensuring that questions require information beyond an LLM’s training data and enabling clear separation between internal knowledge and search capability. The benchmark features intentionally difficult questions requiring multi-hop search queries, page visits, and reasoning, making it well-suited for evaluating agentic search behavior. Our automated data curation and question generation pipeline enables frequent benchmark updates and supports construction of a large-scale training dataset for agentic web search models, addressing the scarcity of such data in the research community. To ensure reliable evaluation, we include a subset of human-verified samples in the test set. We evaluate a broad range of systems using \bench, including commercial and open-weight LLMs as well as LLM-based web search APIs. The leaderboard, datasets, and code are publicly available at livenewsbench.com.


[155] Pailitao-VL: Unified Embedding and Reranker for Real-Time Multi-Modal Industrial Search cs.IR | cs.AI | cs.CVPDF

Lei Chen, Chen Ju, Xu Chen, Zhicheng Wang, Yuheng Jiao

TL;DR: 本文提出了Pailitao-VL,一个为高精度、实时工业搜索设计的统一多模态检索系统。它通过两个核心范式转变解决了现有SOTA方案在检索粒度、环境噪声鲁棒性和效率-性能权衡上的挑战。

Details

Motivation: 解决当前SOTA多模态检索方案中存在的三个关键问题:检索粒度不足、对环境噪声敏感、以及效率与性能之间存在难以接受的差距。

Result: 在阿里巴巴电商平台上的大量离线基准测试和在线A/B测试证实,Pailitao-VL实现了最先进的性能,并带来了显著的商业影响。

Insight: 两个核心范式创新:1) 将嵌入范式从传统的对比学习转变为绝对ID识别任务,通过将实例锚定到由数十亿语义原型定义的全局一致潜在空间,克服了现有嵌入方案的随机性和粒度瓶颈。2) 将生成式重排序器从孤立的逐点评估演化为“比较-校准”的列表策略,结合基于分块的比较推理和校准的绝对相关性评分,实现了精细的判别分辨率,同时规避了传统重排序方法的高延迟。

Abstract: In this work, we presented Pailitao-VL, a comprehensive multi-modal retrieval system engineered for high-precision, real-time industrial search. We here address three critical challenges in the current SOTA solution: insufficient retrieval granularity, vulnerability to environmental noise, and prohibitive efficiency-performance gap. Our primary contribution lies in two fundamental paradigm shifts. First, we transitioned the embedding paradigm from traditional contrastive learning to an absolute ID-recognition task. Through anchoring instances to a globally consistent latent space defined by billions of semantic prototypes, we successfully overcome the stochasticity and granularity bottlenecks inherent in existing embedding solutions. Second, we evolved the generative reranker from isolated pointwise evaluation to the compare-and-calibrate listwise policy. By synergizing chunk-based comparative reasoning with calibrated absolute relevance scoring, the system achieves nuanced discriminative resolution while circumventing the prohibitive latency typically associated with conventional reranking methods. Extensive offline benchmarks and online A/B tests on Alibaba e-commerce platform confirm that Pailitao-VL achieves state-of-the-art performance and delivers substantial business impact. This work demonstrates a robust and scalable path for deploying advanced MLLM-based retrieval architectures in demanding, large-scale production environments.


cs.CR [Back]

[156] Mitigating the Safety-utility Trade-off in LLM Alignment via Adaptive Safe Context Learning cs.CR | cs.AI | cs.CLPDF

Yanbo Wang, Minzheng Wang, Jian Liang, Lu Wang, Yongcan Yu

TL;DR: 本文提出了自适应安全上下文学习(ASCL)框架,以缓解大型语言模型对齐中安全性与实用性之间的权衡。该框架将安全对齐建模为一个多轮工具使用过程,使模型能够自主决定何时查阅安全规则以及如何生成后续推理。此外,通过引入逆频率策略优化(IFPO)来重新平衡强化学习中的优势估计,从而解耦规则检索与后续推理,实现了比基线更高的整体性能。

Details

Motivation: 当前主流对齐策略通常通过上下文蒸馏构建带有明确安全规则的思维链训练数据,这无意中在规则记忆与拒绝之间建立了僵化关联,限制了模型的推理能力。因此,需要一种方法来缓解安全对齐中固有的安全性与实用性之间的权衡问题。

Result: 论文表明,所提出的方法在整体性能上优于基线模型,但摘要中未具体提及是在哪个基准测试或数据集上进行的评估,也未明确说明是否达到了SOTA水平。

Insight: 创新点在于将安全对齐重新定义为多轮工具使用过程,使模型能够自适应地决定何时检索安全规则,并通过IFPO技术解决强化学习中规则咨询偏好失衡的问题。这提供了一种解耦安全规则检索与核心推理生成的新思路,可能有助于提升对齐模型的灵活性和实用性。

Abstract: While reasoning models have achieved remarkable success in complex reasoning tasks, their increasing power necessitates stringent safety measures. For safety alignment, the core challenge lies in the inherent trade-off between safety and utility. However, prevailing alignment strategies typically construct CoT training data with explicit safety rules via context distillation. This approach inadvertently limits reasoning capabilities by creating a rigid association between rule memorization and refusal. To mitigate the safety-utility trade-off, we propose the Adaptive Safe Context Learning (ASCL) framework to improve the reasoning given proper context. ASCL formulates safety alignment as a multi-turn tool-use process, empowering the model to autonomously decide when to consult safety rules and how to generate the ongoing reasoning. Furthermore, to counteract the preference for rule consultation during RL, we introduce Inverse Frequency Policy Optimization (IFPO) to rebalance advantage estimates. By decoupling rule retrieval and subsequent reasoning, our method achieves higher overall performance compared to baselines.


[157] MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents cs.CR | cs.CLPDF

Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai, Moayad Aloqaily, Ouns Bouachir

TL;DR: 本文提出了MCPShield,一个用于基于模型上下文协议(MCP)的LLM代理的安全认知层。它旨在解决MCP开放生态中代理对第三方服务器工具盲目信任的安全错位问题,通过在工具调用前、中、后进行元数据引导探测、执行监控和历史追踪推理,来校准代理的信任并防御攻击。

Details

Motivation: MCP标准化了基于LLM的代理的工具使用并允许第三方服务器,但这种开放性引入了安全错位:代理会隐式信任可能不可信的MCP服务器暴露的工具。现有代理对第三方MCP服务器的验证有限,使其在整个工具调用生命周期中易受基于MCP的攻击。

Result: 实验表明,MCPShield在防御六种新颖的基于MCP的攻击场景(针对六种广泛使用的代理LLM)时表现出强大的泛化能力,同时能避免对良性服务器的误报,且部署开销低。

Insight: 核心创新点是受人类经验驱动工具验证(使用前探测、使用中监控、使用后反思)启发,设计了一个可插拔的安全认知层来动态校准代理对MCP工具的信任。这为开放代理生态系统中的工具调用提供了一个实用且鲁棒的安全保障框架。

Abstract: The Model Context Protocol (MCP) standardizes tool use for LLM-based agents and enable third-party servers. This openness introduces a security misalignment: agents implicitly trust tools exposed by potentially untrusted MCP servers. However, despite its excellent utility, existing agents typically offer limited validation for third-party MCP servers. As a result, agents remain vulnerable to MCP-based attacks that exploit the misalignment between agents and servers throughout the tool invocation lifecycle. In this paper, we propose MCPShield as a plug-in security cognition layer that mitigates this misalignment and ensures agent security when invoking MCP-based tools. Drawing inspiration from human experience-driven tool validation, MCPShield assists agent forms security cognition with metadata-guided probing before invocation. Our method constrains execution within controlled boundaries while cognizing runtime events, and subsequently updates security cognition by reasoning over historical traces after invocation, building on human post-use reflection on tool behavior. Experiments demonstrate that MCPShield exhibits strong generalization in defending against six novel MCP-based attack scenarios across six widely used agentic LLMs, while avoiding false positives on benign servers and incurring low deployment overhead. Overall, our work provides a practical and robust security safeguard for MCP-based tool invocation in open agent ecosystems.


[158] Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks cs.CR | cs.AI | cs.CL | cs.LGPDF

Lukas Struppek, Adam Gleave, Kellin Pelrine

TL;DR: 本文通过大规模实证研究,首次系统性地揭示了开源权重大语言模型对预填充攻击的普遍脆弱性。研究发现,攻击者通过预定义生成初始响应令牌,能够有效绕过模型内置的安全防护,且所有主流开源模型均受影响。

Details

Motivation: 动机在于,开源权重模型主要依赖内部防护机制来减轻有害行为,而现有红队研究多集中于输入型越狱和参数级操控,对模型原生支持的预填充攻击向量缺乏系统性关注。

Result: 研究评估了超过20种现有及新颖的预填充攻击策略,覆盖多个模型家族和SOTA开源权重模型。结果表明,预填充攻击对所有当代主流开源模型均持续有效;尽管某些大型推理模型对通用预填充表现出一定鲁棒性,但仍无法抵御针对性的模型定制策略。

Insight: 创新点在于首次大规模、系统地实证研究了预填充攻击这一被忽视的攻击向量,揭示了开源模型部署中的一个关键漏洞。客观来看,该研究强调了模型开发者亟需将针对预填充攻击的防御机制列为优先事项,为模型安全评估提供了新的重要维度。

Abstract: As the capabilities of large language models continue to advance, so does their potential for misuse. While closed-source models typically rely on external defenses, open-weight models must primarily depend on internal safeguards to mitigate harmful behavior. Prior red-teaming research has largely focused on input-based jailbreaking and parameter-level manipulations. However, open-weight models also natively support prefilling, which allows an attacker to predefine initial response tokens before generation begins. Despite its potential, this attack vector has received little systematic attention. We present the largest empirical study to date of prefill attacks, evaluating over 20 existing and novel strategies across multiple model families and state-of-the-art open-weight models. Our results show that prefill attacks are consistently effective against all major contemporary open-weight models, revealing a critical and previously underexplored vulnerability with significant implications for deployment. While certain large reasoning models exhibit some robustness against generic prefilling, they remain vulnerable to tailored, model-specific strategies. Our findings underscore the urgent need for model developers to prioritize defenses against prefill attacks in open-weight LLMs.


cs.LG [Back]

[159] Directional Concentration Uncertainty: A representational approach to uncertainty quantification for generative models cs.LG | cs.AI | cs.CLPDF

Souradeep Chattopadhyay, Brendan Kennedy, Sai Munikoti, Soumik Sarkar, Karl Pazdernik

TL;DR: 本文提出了一种名为方向集中度不确定性(DCU)的新框架,用于生成模型的不确定性量化。该方法基于冯·米塞斯-费舍尔分布,通过测量语言模型生成输出的嵌入向量的几何离散度来捕获不确定性,无需任务特定启发式规则。实验表明,DCU在单模态和多模态任务中均能匹配或超越现有方法(如语义熵)的校准性能。

Details

Motivation: 现有不确定性量化方法通常依赖僵化的启发式规则,难以泛化到不同任务和模态。本文旨在提出一种灵活、通用的不确定性量化框架,以提升生成模型的可信度和鲁棒性。

Result: 在实验中,DCU在不确定性校准水平上匹配或超越了先前工作(如语义熵),并能够很好地泛化到多模态领域中更复杂的任务。

Insight: 创新点在于引入基于冯·米塞斯-费舍尔分布的方向集中度不确定性统计方法,通过生成输出的连续嵌入向量的几何离散度来量化不确定性,避免了任务特定启发式,具有更好的灵活性和泛化能力。

Abstract: In the critical task of making generative models trustworthy and robust, methods for Uncertainty Quantification (UQ) have begun to show encouraging potential. However, many of these methods rely on rigid heuristics that fail to generalize across tasks and modalities. Here, we propose a novel framework for UQ that is highly flexible and approaches or surpasses the performance of prior heuristic methods. We introduce Directional Concentration Uncertainty (DCU), a novel statistical procedure for quantifying the concentration of embeddings based on the von Mises-Fisher (vMF) distribution. Our method captures uncertainty by measuring the geometric dispersion of multiple generated outputs from a language model using continuous embeddings of the generated outputs without any task specific heuristics. In our experiments, we show that DCU matches or exceeds calibration levels of prior works like semantic entropy (Kuhn et al., 2023) and also generalizes well to more complex tasks in multi-modal domains. We present a framework for the wider potential of DCU and its implications for integration into UQ for multi-modal and agentic frameworks.


[160] Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning cs.LG | cs.CLPDF

Zhimin Zhao

TL;DR: 本文提出一个基于信息结构的五级可学习性层次结构,认为机器学习进展的上限更多取决于任务本身是否可学习,而非模型规模。通过形式化区分计算问题的三个属性(可表达性、可计算性、可学习性)及其关系,解释了代码生成比强化学习更可靠的原因在于代码提供密集、局部、可验证的反馈。

Details

Motivation: 探讨为什么代码生成比强化学习进展更可靠,揭示任务信息结构对机器学习可扩展性的根本影响,挑战仅靠模型缩放就能解决所有机器学习问题的常见假设。

Result: 未提及具体定量实验结果或基准测试,但通过理论分析建立了可表达性、可计算性与可学习性之间的形式化关系,并提出了统一的模板来明确结构差异。

Insight: 创新点在于从信息结构角度形式化定义了可学习性层次,强调了反馈质量(如代码的密集局部验证)对学习效率的关键作用;客观分析认为该框架为理解不同机器学习任务的可扩展性差异提供了理论基础,有助于重新评估缩放定律的局限性。

Abstract: Code generation has progressed more reliably than reinforcement learning, largely because code has an information structure that makes it learnable. Code provides dense, local, verifiable feedback at every token, whereas most reinforcement learning problems do not. This difference in feedback quality is not binary but graded. We propose a five-level hierarchy of learnability based on information structure and argue that the ceiling on ML progress depends less on model size than on whether a task is learnable at all. The hierarchy rests on a formal distinction among three properties of computational problems (expressibility, computability, and learnability). We establish their pairwise relationships, including where implications hold and where they fail, and present a unified template that makes the structural differences explicit. The analysis suggests why supervised learning on code scales predictably while reinforcement learning does not, and why the common assumption that scaling alone will solve remaining ML challenges warrants scrutiny.


[161] ROAST: Rollout-based On-distribution Activation Steering Technique cs.LG | cs.CLPDF

Xuanbo Su, Hao Luo, Yingfang Zhang, Lijun Zhang

TL;DR: 本文提出了一种名为ROAST(基于滚动的分布内激活引导技术)的参数高效方法,用于在推理时控制大型语言模型(LLMs)。该方法通过ROC(滚动一致性)从模型自身的分布内滚动中估计引导方向,并采用连续软缩放(CSS)和分组均值归一化来避免硬稀疏化,从而提高了干预的鲁棒性和性能。

Details

Motivation: 现有的激活引导方法通常依赖于分布外监督和离散掩码,导致干预脆弱且不稳定。ROAST旨在解决这些问题,通过利用模型自身的分布内数据来估计更一致的引导方向,并引入归一化技术来平衡激活贡献,以实现更稳健的控制。

Result: 在多个模型(0.6B到32B参数规模)和多样化任务上的实验表明,ROAST显著提升了性能,例如在Qwen3-0.6B模型上GSM8K任务提升了9.7%,在GLM4-32B模型上TruthfulQA任务提升了12.1%。分析还显示,CSS能更好地保留激活能量。

Insight: 论文的创新点在于:1)通过ROC从分布内滚动估计引导方向,减少对分布外数据的依赖;2)引入CSS和分组均值归一化来避免硬稀疏化并平衡样本贡献,从而缓解高幅度激活主导全局方向的问题。从客观角度看,这种方法在参数高效控制中增强了鲁棒性和可扩展性。

Abstract: Activation steering provides parameter-efficient control over large language models (LLMs) at inference time, but many methods rely on off-distribution supervision and discrete masking, leading to brittle interventions. We propose ROAST (Rollout-based On-distribution Activation Steering Technique), which estimates steering directions from the model’s own on-distribution rollouts via ROC and avoids hard sparsification via Continuous Soft Scaling (CSS) and Grouped Mean Normalization. Our empirical analysis reveals that while activation magnitude correlates moderately with directional consistency, the variance in magnitude is significant and often disproportionate to semantic quality. This suggests that high-magnitude activations risk dominating the global steering direction if not properly normalized. To address this, ROAST employs grouped normalization to balance contributions across samples, ensuring a more robust estimation of the consensus steering direction. Across models (0.6B to 32B), ROAST consistently improves performance on diverse tasks (e.g., +9.7% on GSM8K for Qwen3-0.6B and +12.1% on TruthfulQA for GLM4-32B), and analyses show that CSS better preserves activation energy.


[162] Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling cs.LG | cs.AI | cs.CLPDF

Yiran Guo, Zhongjian Qiao, Yingqi Xie, Jie Liu, Dan Ye

TL;DR: 本文提出了一种名为深度密集探索(DDE)的新策略,用于解决大型语言模型强化学习中有效探索的难题。该方法通过识别失败轨迹中的关键‘枢纽’状态,并围绕这些状态进行局部密集重采样,以在有限的采样预算下更高效地发现高质量轨迹。其实例化方法DEEP-GRPO在数学推理基准测试中表现优异。

Details

Motivation: 现有强化学习方法(如GRPO和基于树的方法)在探索广阔的自然语言序列空间时存在局限:GRPO仅从根节点采样,导致高概率轨迹饱和而深度、易错状态探索不足;树方法则盲目分散采样预算,造成采样稀释,难以发现罕见正确后缀并破坏局部基线稳定性。

Result: 在数学推理基准测试上的实验表明,该方法(DEEP-GRPO)在性能上持续超越了GRPO、基于树的方法以及其他强基线模型。

Insight: 论文宣称的创新点包括:1) 一个轻量级的数据驱动效用函数,能自动平衡可恢复性和深度偏差以识别枢纽状态;2) 在每个枢纽状态进行局部密集重采样,以增加发现正确后续轨迹的概率;3) 一个解耦全局策略学习和局部校正更新的双流优化目标。从客观角度看,其核心创新在于将探索重点从根节点或盲目广度搜索,转向对深度、可恢复的‘枢纽’状态进行靶向密集探索,这为解决LLM强化学习中的探索-利用权衡提供了新思路。

Abstract: Effective exploration is a key challenge in reinforcement learning for large language models: discovering high-quality trajectories within a limited sampling budget from the vast natural language sequence space. Existing methods face notable limitations: GRPO samples exclusively from the root, saturating high-probability trajectories while leaving deep, error-prone states under-explored. Tree-based methods blindly disperse budgets across trivial or unrecoverable states, causing sampling dilution that fails to uncover rare correct suffixes and destabilizes local baselines. To address this, we propose Deep Dense Exploration (DDE), a strategy that focuses exploration on $\textit{pivots}$-deep, recoverable states within unsuccessful trajectories. We instantiate DDE with DEEP-GRPO, which introduces three key innovations: (1) a lightweight data-driven utility function that automatically balances recoverability and depth bias to identify pivot states; (2) local dense resampling at each pivot to increase the probability of discovering correct subsequent trajectories; and (3) a dual-stream optimization objective that decouples global policy learning from local corrective updates. Experiments on mathematical reasoning benchmarks demonstrate that our method consistently outperforms GRPO, tree-based methods, and other strong baselines.


[163] Scaling Beyond Masked Diffusion Language Models cs.LG | cs.CLPDF

Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu

TL;DR: 本文研究了离散扩散语言模型的缩放规律,比较了均匀状态扩散、插值扩散和掩码扩散方法。研究发现,掩码扩散模型通过交叉熵目标训练可提升约12%的FLOPs效率,且困惑度在扩散家族内部具有参考价值,但在跨家族比较中可能产生误导,因为更快的采样速度可能优于似然缩放表现。在1.7B参数规模下,均匀状态扩散在基于似然的基准测试中保持竞争力,并在GSM8K上超越自回归和掩码扩散模型,尽管其验证困惑度较差。

Details

Motivation: 扩散语言模型作为自回归模型的潜在替代方案,因其更快的生成速度而受到关注,其中掩码扩散目前主导离散扩散方法,主要基于其在语言建模基准上的强困惑度表现。本文旨在通过首次研究均匀状态和插值离散扩散方法的缩放规律,挑战仅依赖困惑度进行跨算法比较的观点,并探索更高效的训练和采样策略。

Result: 在1.7B参数规模下,均匀状态扩散在基于似然的基准测试中保持竞争力,并在GSM8K推理任务上超越自回归和掩码扩散模型;掩码扩散模型通过交叉熵目标训练可提升约12%的FLOPs效率;研究还表明,速度-质量帕累托前沿显示,某些模型尽管似然缩放较差,但因采样更快更实用而更优。

Insight: 创新点包括首次对均匀状态和插值离散扩散方法进行缩放规律研究,提出使用交叉熵目标提升掩码扩散模型的FLOPs效率,并强调困惑度在跨扩散家族比较中的局限性,倡导以速度-质量帕累托前沿作为更全面的评估指标,这挑战了掩码扩散作为扩散语言建模未来唯一方向的传统观点。

Abstract: Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likelihood-based benchmarks and outperforms autoregressive and Masked diffusion models on GSM8K, despite worse validation perplexity. We provide the code, model checkpoints, and video tutorials on the project page: http://s-sahoo.github.io/scaling-dllms


[164] Universal Algorithm-Implicit Learning cs.LG | cs.AI | cs.CVPDF

Stefano Woerner, Seong Joon Oh, Christian F. Baumgartner

TL;DR: 本文提出了一个元学习理论框架,正式定义了实用性通用性,并区分了算法显式与算法隐式学习。基于此框架,作者提出了TAIL,一种基于Transformer的算法隐式元学习器,能够处理跨不同领域、模态和标签配置的任务。TAIL通过随机投影进行跨模态特征编码、随机注入标签嵌入以泛化到更大标签空间,以及高效的内联查询处理等创新,在标准小样本基准测试中达到SOTA,并能泛化到未见过的领域和模态,同时大幅降低计算成本。

Details

Motivation: 当前元学习方法局限于具有固定特征和标签空间的狭窄任务分布,适用性受限;且现有文献对’通用’等关键术语使用不一致、缺乏精确定义,阻碍了可比性。

Result: TAIL在标准小样本基准测试中实现了最先进的性能,并能泛化到未见过的领域。它还能泛化到未见过的模态(例如,仅用图像训练即可解决文本分类任务),处理训练时未见过的、类别数多达20倍的任务,并且相比之前的基于Transformer的方法,计算量节省了几个数量级。

Insight: 理论框架为通用元学习方法提供了原则性词汇和精确定义;TAIL的三个技术创新(随机投影跨模态编码、随机注入标签嵌入、高效内联查询处理)实现了跨模态、跨领域和跨标签配置的强泛化能力,同时显著提升了计算效率。

Abstract: Current meta-learning methods are constrained to narrow task distributions with fixed feature and label spaces, limiting applicability. Moreover, the current meta-learning literature uses key terms like “universal” and “general-purpose” inconsistently and lacks precise definitions, hindering comparability. We introduce a theoretical framework for meta-learning which formally defines practical universality and introduces a distinction between algorithm-explicit and algorithm-implicit learning, providing a principled vocabulary for reasoning about universal meta-learning methods. Guided by this framework, we present TAIL, a transformer-based algorithm-implicit meta-learner that functions across tasks with varying domains, modalities, and label configurations. TAIL features three innovations over prior transformer-based meta-learners: random projections for cross-modal feature encoding, random injection label embeddings that extrapolate to larger label spaces, and efficient inline query processing. TAIL achieves state-of-the-art performance on standard few-shot benchmarks while generalizing to unseen domains. Unlike other meta-learning methods, it also generalizes to unseen modalities, solving text classification tasks despite training exclusively on images, handles tasks with up to 20$\times$ more classes than seen during training, and provides orders-of-magnitude computational savings over prior transformer-based approaches.


[165] Web-Scale Multimodal Summarization using CLIP-Based Semantic Alignment cs.LG | cs.CV | cs.ET | cs.HC | cs.NEPDF

Mounvik K, N Harshit

TL;DR: 本文提出了一种轻量级的Web-Scale Multimodal Summarization框架,用于从网络资源中检索文本和图像数据并生成摘要。系统根据用户定义的主题进行并行网络、新闻和图像搜索,使用微调的CLIP模型对检索到的图像进行排序以衡量其与主题和文本的语义对齐度,并可选择使用BLIP生成图像描述以增强多模态连贯性。

Details

Motivation: 解决如何从海量网络数据中高效生成结合文本和图像的多模态摘要的问题,旨在提供一个可配置、可部署的工具。

Result: 在包含500个图像-描述对和20:1对比负样本的数据集上评估,ROC-AUC达到0.9270,F1分数为0.6504,准确率为96.99%,证明了其强大的多模态对齐能力。

Insight: 主要创新点在于将微调的CLIP模型用于检索图像的语义对齐排序,并结合BLIP图像描述生成,构建了一个用户可扩展的、集成语言、检索和视觉模型的轻量级流水线,实现了网络规模的多模态摘要生成。

Abstract: We introduce Web-Scale Multimodal Summarization, a lightweight framework for generating summaries by combining retrieved text and image data from web sources. Given a user-defined topic, the system performs parallel web, news, and image searches. Retrieved images are ranked using a fine-tuned CLIP model to measure semantic alignment with topic and text. Optional BLIP captioning enables image-only summaries for stronger multimodal coherence.The pipeline supports features such as adjustable fetch limits, semantic filtering, summary styling, and downloading structured outputs. We expose the system via a Gradio-based API with controllable parameters and preconfigured presets.Evaluation on 500 image-caption pairs with 20:1 contrastive negatives yields a ROC-AUC of 0.9270, an F1-score of 0.6504, and an accuracy of 96.99%, demonstrating strong multimodal alignment. This work provides a configurable, deployable tool for web-scale summarization that integrates language, retrieval, and vision models in a user-extensible pipeline.


q-bio.QM [Back]

[166] Protect$^*$: Steerable Retrosynthesis through Neuro-Symbolic State Encoding q-bio.QM | cs.AI | cs.CL | cs.LG | q-bio.BMPDF

Shreyas Vinaya Sathyanarayana, Shah Rahil Kirankumar, Sharanabasava D. Hiremath, Bharath Ramsundar

TL;DR: 本文提出了Protect$^*$,一个用于可控逆合成的神经符号框架,它将大型语言模型的生成能力与严格的化学逻辑相结合,通过自动规则推理和神经模型生成,实现对分子中特定化学敏感位点的保护,从而避免无效或不理想的合成路径。

Details

Motivation: 解决大型语言模型在逆合成等科学领域中缺乏细粒度控制的问题,特别是在需要避免分子上特定化学敏感位点时,无约束生成可能导致无效或不良的合成路径。

Result: 通过复杂天然产物的案例研究(如发现红霉素B的新合成路径)进行了演示,表明该方法能够实现可靠、专家级的自主性,但未提及具体基准测试或定量比较结果。

Insight: 创新点在于结合神经生成与符号逻辑的混合架构,包括自动模式和人机协同模式,并通过“主动状态跟踪”将硬符号约束注入神经推理过程,使用保护状态与规范原子映射关联,以增强可控性和可靠性。

Abstract: Large Language Models (LLMs) have shown remarkable potential in scientific domains like retrosynthesis; yet, they often lack the fine-grained control necessary to navigate complex problem spaces without error. A critical challenge is directing an LLM to avoid specific, chemically sensitive sites on a molecule - a task where unconstrained generation can lead to invalid or undesirable synthetic pathways. In this work, we introduce Protect$^*$, a neuro-symbolic framework that grounds the generative capabilities of Large Language Models (LLMs) in rigorous chemical logic. Our approach combines automated rule-based reasoning - using a comprehensive database of 55+ SMARTS patterns and 40+ characterized protecting groups - with the generative intuition of neural models. The system operates via a hybrid architecture: an automatic mode'' where symbolic logic deterministically identifies and guards reactive sites, and a human-in-the-loop mode’’ that integrates expert strategic constraints. Through ``active state tracking,’’ we inject hard symbolic constraints into the neural inference process via a dedicated protection state linked to canonical atom maps. We demonstrate this neuro-symbolic approach through case studies on complex natural products, including the discovery of a novel synthetic pathway for Erythromycin B, showing that grounding neural generation in symbolic logic enables reliable, expert-level autonomy.


cs.AI [Back]

[167] Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning cs.AI | cs.CL | cs.LG | cs.LOPDF

Bowen Liu, Zhi Wu, Runquan Xie, Zhanhui Kang, Jia Li

TL;DR: 本文提出了SSLogic,一种基于智能体元合成的框架,用于扩展可验证逻辑推理任务的生成。该框架通过一个封闭的生成-验证-修复循环,迭代地合成和修复可执行的生成器-验证器程序对,从而在任务族级别实现扩展,并控制任务难度。

Details

Motivation: 解决强化学习从可验证奖励中学习时,可验证训练信号难以扩展的关键瓶颈。逻辑推理是一个天然的载体,但现有合成方法要么依赖专家编写的代码,要么局限于固定的模板,限制了任务规模的实质性增长。

Result: 从400个种子任务族开始,经过两轮演化,扩展到了953个任务族和21,389个可验证实例。在SSLogic演化数据上训练,相比种子基线在相同训练步数下取得了一致性提升:SynLogic提升+5.2,BBEH提升+1.4,AIME25提升+3.0,Brumo25提升+3.7。

Insight: 核心创新在于提出了一个智能体驱动的元合成框架,实现了任务族级别的自动扩展,而非仅进行实例级别的扰动。同时,引入了结合多策略一致性检查和对抗性盲审的多门验证协议,以确保生成任务的可靠性和清晰性。这为自动生成高质量、可扩展的训练数据提供了新思路。

Abstract: Scaling verifiable training signals remains a key bottleneck for Reinforcement Learning from Verifiable Rewards (RLVR). Logical reasoning is a natural substrate: constraints are formal and answers are programmatically checkable. However, prior synthesis pipelines either depend on expert-written code or operate within fixed templates/skeletons, which limits growth largely to instance-level perturbations. We propose SSLogic, an agentic meta-synthesis framework that scales at the task-family level by iteratively synthesizing and repairing executable Generator–Validator program pairs in a closed Generate–Validate–Repair loop, enabling continuous family evolution with controllable difficulty. To ensure reliability, we introduce a Multi-Gate Validation Protocol that combines multi-strategy consistency checks with Adversarial Blind Review, where independent agents must solve instances by writing and executing code to filter ambiguous or ill-posed tasks. Starting from 400 seed families, two evolution rounds expand to 953 families and 21,389 verifiable instances (from 5,718). Training on SSLogic-evolved data yields consistent gains over the seed baseline at matched training steps, improving SynLogic by +5.2, BBEH by +1.4, AIME25 by +3.0, and Brumo25 by +3.7.


[168] NL2LOGIC: AST-Guided Translation of Natural Language into First-Order Logic with Large Language Models cs.AI | cs.CLPDF

Rizky Ramadhana Putra, Raihan Sultan Pasha Basuki, Yutong Cheng, Peng Gao

TL;DR: 本文提出了NL2LOGIC框架,该框架利用抽象语法树作为中间表示,将自然语言翻译为一阶逻辑。它结合了基于大语言模型的递归语义解析器和AST引导的生成器,以确定性地生成可供求解器直接使用的逻辑代码。在多个基准测试中,该方法在语法准确性和语义正确性上均显著超越了现有最佳方法。

Details

Motivation: 解决现有基于大语言模型的自然语言到一阶逻辑翻译方法中,因全局语法约束弱导致的语法脆弱性,以及因子句级语义理解不足导致的语义忠实度低的问题。

Result: 在FOLIO、LogicNLI和ProofWriter基准测试上,NL2LOGIC实现了99%的语法准确率,并将语义正确性比现有最佳基线提升了高达30%。将其集成到Logic-LM中,相比其原有的少样本无约束翻译模块,实现了近乎完美的可执行性,并将下游推理准确率提升了31%。

Insight: 核心创新点是引入抽象语法树作为中间表示,将翻译过程解耦为递归语义解析和确定性AST引导生成两个阶段。这既加强了对全局语法结构的控制,又通过递归解析增强了对子句级语义的理解,从而同时提升了语法鲁棒性和语义忠实度。

Abstract: Automated reasoning is critical in domains such as law and governance, where verifying claims against facts in documents requires both accuracy and interpretability. Recent work adopts structured reasoning pipelines that translate natural language into first-order logic and delegate inference to automated solvers. With the rise of large language models, approaches such as GCD and CODE4LOGIC leverage their reasoning and code generation capabilities to improve logic parsing. However, these methods suffer from fragile syntax control due to weak enforcement of global grammar constraints and low semantic faithfulness caused by insufficient clause-level semantic understanding. We propose NL2LOGIC, a first-order logic translation framework that introduces an abstract syntax tree as an intermediate representation. NL2LOGIC combines a recursive large language model based semantic parser with an abstract syntax tree guided generator that deterministically produces solver-ready logic code. Experiments on the FOLIO, LogicNLI, and ProofWriter benchmarks show that NL2LOGIC achieves 99 percent syntactic accuracy and improves semantic correctness by up to 30 percent over state-of-the-art baselines. Furthermore, integrating NL2LOGIC into Logic-LM yields near-perfect executability and improves downstream reasoning accuracy by 31 percent compared to Logic-LM’s original few-shot unconstrained translation module.


[169] X-Blocks: Linguistic Building Blocks of Natural Language Explanations for Automated Vehicles cs.AI | cs.CL | cs.ROPDF

Ashkan Y. Zadeh, Xiaomeng Li, Andry Rakotonirainy, Ronald Schroeter, Sebastien Glaser

TL;DR: 本文提出了X-Blocks(eXplanation Blocks)框架,这是一个用于分析自动驾驶车辆自然语言解释语言构建块的分层分析框架,包含语境、句法和词汇三个层面。在语境层面,作者提出了RACE框架,利用多LLM集成结合思维链和自洽机制,将解释分类为32种场景感知类别,在Berkeley DeepDrive-X数据集上达到91.45%的准确率。词汇和句法层面分别通过统计分析和依存句法揭示了特定场景的词汇模式与可复用语法结构。

Details

Motivation: 现有方法缺乏系统框架来分析人类在不同驾驶场景下如何语言构建驾驶原理,这阻碍了建立对自动驾驶车辆的信任和接受度。

Result: 在语境分类任务中,RACE框架在人类标注者达成一致的案例上实现了91.45%的准确率和0.91的Cohen’s kappa,接近人类可靠性水平;词汇和句法分析揭示了场景特定的词汇模式和有限的、可复用的语法家族及其系统性变化。

Insight: 创新点在于提出了一个与数据集和任务无关的、层次化的语言分析框架(X-Blocks),其核心是结合思维链与自洽机制的多LLM集成分类方法(RACE),为生成场景感知的解释提供了基于证据的语言设计原则,可推广至其他自动驾驶数据集和安全关键领域。

Abstract: Natural language explanations play a critical role in establishing trust and acceptance of automated vehicles (AVs), yet existing approaches lack systematic frameworks for analysing how humans linguistically construct driving rationales across diverse scenarios. This paper introduces X-Blocks (eXplanation Blocks), a hierarchical analytical framework that identifies the linguistic building blocks of natural language explanations for AVs at three levels: context, syntax, and lexicon. At the context level, we propose RACE (Reasoning-Aligned Classification of Explanations), a multi-LLM ensemble framework that combines Chain-of-Thought reasoning with self-consistency mechanisms to robustly classify explanations into 32 scenario-aware categories. Applied to human-authored explanations from the Berkeley DeepDrive-X dataset, RACE achieves 91.45 percent accuracy and a Cohens kappa of 0.91 against cases with human annotator agreement, indicating near-human reliability for context classification. At the lexical level, log-odds analysis with informative Dirichlet priors reveals context-specific vocabulary patterns that distinguish driving scenarios. At the syntactic level, dependency parsing and template extraction show that explanations draw from a limited repertoire of reusable grammar families, with systematic variation in predicate types and causal constructions across contexts. The X-Blocks framework is dataset-agnostic and task-independent, offering broad applicability to other automated driving datasets and safety-critical domains. Overall, our findings provide evidence-based linguistic design principles for generating scenario-aware explanations that support transparency, user trust, and cognitive accessibility in automated driving systems.


[170] General learned delegation by clones cs.AI | cs.CLPDF

Darren Li, Meiqi Chen, Chenze Shao, Fandong Meng, Jie Zhou

TL;DR: 本文提出SELFCEST方法,通过智能体强化学习使基础模型能够在并行上下文中生成相同权重的克隆体,以优化固定推理预算下的计算效率。该方法在数学推理和长上下文多跳问答基准测试中,相比单一基线模型,在相同推理预算下提升了准确率-成本的帕累托前沿,并展现出跨领域的分布外泛化能力。

Details

Motivation: 解决前沿语言模型在固定推理预算下,串行推理或无协调并行采样导致的计算效率低下问题。

Result: 在具有挑战性的数学推理基准和长上下文多跳QA任务中,SELFCEST在匹配推理预算下相比单一基线模型提升了准确率-成本帕累托前沿,并展现出分布外泛化。

Insight: 创新点在于通过端到端强化学习训练一个共享参数的控制器,学习在分支间分配生成和上下文预算,实现高效的并行克隆体委托,优化计算资源利用。

Abstract: Frontier language models improve with additional test-time computation, but serial reasoning or uncoordinated parallel sampling can be compute-inefficient under fixed inference budgets. We propose SELFCEST, which equips a base model with the ability to spawn same-weight clones in separate parallel contexts by agentic reinforcement learning. Training is end-to-end under a global task reward with shared-parameter rollouts, yielding a learned controller that allocates both generation and context budget across branches. Across challenging math reasoning benchmarks and long-context multi-hop QA, SELFCEST improves the accuracy-cost Pareto frontier relative to monolithic baselines at matched inference budget, and exhibits out-of-distribution generalization in both domains.


[171] ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs cs.AI | cs.CLPDF

Rohan Subramanian Thomas, Shikhar Shiromani, Abdullah Chaudhry, Ruizhe Li, Vasu Sharma

TL;DR: 该论文提出了ProMoral-Bench基准,用于系统评估11种提示策略在四大LLM家族上的道德推理与安全对齐能力。通过整合ETHICS、Scruples、WildJailbreak及新设计的ETHICS-Contrast鲁棒性测试,并采用统一道德安全分数(UMSS)进行衡量,研究发现简洁的示例引导型提示优于复杂的多阶段推理,能以更低计算成本获得更高的UMSS分数和鲁棒性。

Details

Motivation: 现有研究对提示设计如何影响大语言模型的道德能力与安全对齐缺乏系统性的实证比较,该工作旨在建立一个统一的评估框架来解决此问题。

Result: 在ProMoral-Bench上的实验结果表明,紧凑的示例引导型提示在UMSS分数上优于复杂多轮推理,尤其在鲁棒性测试中表现更稳定;少样本示例能持续提升道德稳定性和抗越狱能力。

Insight: 论文的创新点在于提出了统一的道德安全评估基准(ProMoral-Bench)和平衡准确性与安全性的UMSS指标,并揭示了提示工程的效率悖论——简单、低成本的示例提示往往比复杂推理更有效且鲁棒,这为实用化的提示工程提供了原则性指导。

Abstract: Prompt design significantly impacts the moral competence and safety alignment of large language models (LLMs), yet empirical comparisons remain fragmented across datasets and models.We introduce ProMoral-Bench, a unified benchmark evaluating 11 prompting paradigms across four LLM families. Using ETHICS, Scruples, WildJailbreak, and our new robustness test, ETHICS-Contrast, we measure performance via our proposed Unified Moral Safety Score (UMSS), a metric balancing accuracy and safety. Our results show that compact, exemplar-guided scaffolds outperform complex multi-stage reasoning, providing higher UMSS scores and greater robustness at a lower token cost. While multi-turn reasoning proves fragile under perturbations, few-shot exemplars consistently enhance moral stability and jailbreak resistance. ProMoral-Bench establishes a standardized framework for principled, cost-effective prompt engineering.


[172] Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts cs.AI | cs.CLPDF

Chen Yang, Guangyue Peng, Jiaying Zhu, Ran Le, Ruixiang Feng

TL;DR: 本文介绍了Nanbeige4.1-3B,一个仅30亿参数的小型通用语言模型,它在一个模型中同时实现了强大的智能体行为、代码生成和通用推理能力。通过结合点对和成对奖励建模、复杂度感知的强化学习奖励以及深度搜索中的复杂数据合成和回合级监督,该模型能够稳定执行长达600轮的工具调用以解决复杂问题。实验表明,它在多个方面超越了同规模甚至更大规模的模型。

Details

Motivation: 解决现有小型语言模型难以在一个模型中同时兼顾智能体行为、代码生成和通用推理等多方面能力的问题,旨在证明小模型也能实现广泛能力和强专业性的统一。

Result: 在广泛的实验中,Nanbeige4.1-3B显著优于同规模模型(如Nanbeige4-3B-2511和Qwen3-4B),甚至在某些方面超越了更大的模型(如Qwen3-30B-A3B),重新定义了30亿参数模型的潜力。

Insight: 创新点在于将点对与成对奖励建模结合以提升推理和对齐质量,在强化学习中设计复杂度感知奖励以优化代码的正确性和效率,并通过深度搜索中的复杂数据合成和回合级监督实现稳定的长程工具交互。这为构建多功能、高效的小型通用模型提供了可借鉴的技术路径。

Abstract: We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses. For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency. In deep search, we perform complex data synthesis and incorporate turn-level supervision during training. This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving. Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar scale, such as Nanbeige4-3B-2511 and Qwen3-4B, even achieving superior performance compared to much larger models, such as Qwen3-30B-A3B. Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously, redefining the potential of 3B parameter models.


[173] Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization cs.AI | cs.CL | cs.CV | cs.HCPDF

Yibo Wang, Guangda Huzhang, Yuwei Hu, Yu Xia, Shiyin Lu

TL;DR: 本文提出了一种基于多模态大语言模型(MLLM)的GUI自主导航新框架,该框架包含智能Q值估计和分步策略优化两个核心组件,旨在通过强化学习优化策略,降低数据收集成本并实现稳定高效的环境交互。

Details

Motivation: 解决GUI智能体在非平稳现实环境中面临的高昂数据整理与策略优化计算成本问题。

Result: 在GUI导航和基础任务基准测试中,该框架使Ovis2.5-9B模型展现出强大的交互能力,取得了显著性能,甚至超越了规模更大的竞争模型。

Insight: 创新点在于将智能Q值估计(用于评估动作对任务完成的贡献)与分步策略优化相结合,实现了策略自我生成轨迹以管理数据成本,并将策略更新与环境解耦以确保优化稳定性。

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.


[174] StackingNet: Collective Inference Across Independent AI Foundation Models cs.AI | cs.CLPDF

Siyang Li, Chenhao Liu, Dongrui Wu, Zhigang Zeng, Lieyun Ding

TL;DR: 论文提出了一种名为StackingNet的元集成框架,旨在协调独立的大型基础模型(如语言理解和视觉模型),通过集体推理整合它们的预测,以提高准确性、减少偏见、实现可靠性排名,并识别或剪枝性能下降的模型,而无需访问内部参数或训练数据。

Details

Motivation: 当前大型基础模型虽然在各领域表现出色,但彼此孤立,无法轻易共享能力;整合这些独立模型的互补优势对于构建可信赖的智能系统至关重要,但缺乏协调黑盒异构模型的成熟方法。

Result: 在语言理解、视觉估计和学术论文评分等任务中,StackingNet相比单个模型和经典集成方法,持续提升了准确性、鲁棒性和公平性,展示了其有效性。

Insight: 创新点在于将集体智能原则应用于基础模型协调,通过元集成框架将多样性从不一致来源转化为协作优势,为协调人工智能提供了实用基础,表明进步不仅来自更大的单一模型,也源于多个专业化模型的原则性合作。

Abstract: Artificial intelligence built on large foundation models has transformed language understanding, vision and reasoning, yet these systems remain isolated and cannot readily share their capabilities. Integrating the complementary strengths of such independent foundation models is essential for building trustworthy intelligent systems. Despite rapid progress in individual model design, there is no established approach for coordinating such black-box heterogeneous models. Here we show that coordination can be achieved through a meta-ensemble framework termed StackingNet, which draws on principles of collective intelligence to combine model predictions during inference. StackingNet improves accuracy, reduces bias, enables reliability ranking, and identifies or prunes models that degrade performance, all operating without access to internal parameters or training data. Across tasks involving language comprehension, visual estimation, and academic paper rating, StackingNet consistently improves accuracy, robustness, and fairness, compared with individual models and classic ensembles. By turning diversity from a source of inconsistency into collaboration, StackingNet establishes a practical foundation for coordinated artificial intelligence, suggesting that progress may emerge from not only larger single models but also principled cooperation among many specialized ones.


[175] From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design cs.AI | cs.CL | cs.GRPDF

Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen

TL;DR: 本文提出了LaySPA,一个强化学习框架,旨在增强大语言模型在内容感知图形布局设计中的显式、可解释的空间推理能力。它将布局设计重新定义为在结构化文本空间环境上的策略学习问题,并生成包含可解释推理轨迹和结构化布局规范的双层输出。

Details

Motivation: 解决大语言模型在布局设计中空间推理能力有限以及设计决策过程缺乏透明度这两个关键挑战。

Result: 实验表明,LaySPA在结构有效性和视觉质量上有所提升,超越了更大的专有大语言模型,性能与专门的SOTA布局生成器相当,同时需要更少的标注样本和更低的延迟。

Insight: 创新点在于将布局设计重构为策略学习问题,并引入多目标空间评判标准(几何有效性、关系连贯性、美学一致性)和相对组优化训练方法,实现了透明可控的设计决策过程。

Abstract: We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design. LaySPA addresses two key challenges: LLMs’ limited spatial reasoning and the lack of opacity in design decision making. Instead of operating at the pixel level, we reformulate layout design as a policy learning problem over a structured textual spatial environment that explicitly encodes canvas geometry, element attributes, and inter-element relationships. LaySPA produces dual-level outputs comprising interpretable reasoning traces and structured layout specifications, enabling transparent and controllable design decision making. Layout design policy is optimized via a multi-objective spatial critique that decomposes layout quality into geometric validity, relational coherence, and aesthetic consistency, and is trained using relative group optimization to stabilize learning in open-ended design spaces. Experiments demonstrate that LaySPA improves structural validity and visual quality, outperforming larger proprietary LLMs and achieving performance comparable to specialized SOTA layout generators while requiring fewer annotated samples and reduced latency.


[176] Algebraic Quantum Intelligence: A New Framework for Reproducible Machine Creativity cs.AI | cs.CL | cs.LGPDF

Kazuo Yano, Jonghyeok Lee, Tae Ishitomi, Hironobu Kawaguchi, Akira Koyama

TL;DR: 本文提出了一种名为代数量子智能(AQI)的新计算框架,旨在解决大型语言模型(LLM)在生成真正创造性输出方面的局限性。AQI受量子理论启发,采用非交换代数结构来系统性地扩展语义空间,通过引入顺序依赖、干涉和不确定性等特性,使生成过程能够同时探索多种语义可能性。研究通过扩展一个基于Transformer的LLM,集成了600多个专用算子来实现AQI,并在十个领域的创造性推理基准测试中评估了其性能。

Details

Motivation: 当前大型语言模型在生成流畅且上下文合适的文本方面取得了显著成功,但其产生真正创造性输出的能力仍然有限。作者认为这种限制源于LLMs的结构特性:当提供丰富上下文时,未来生成的空间受到强烈约束,生成过程几乎由确定性动态主导。现有方法如测试时缩放和上下文适应虽能提升性能,但未从根本上改变这一约束。

Result: 在基于LLM作为评判者的协议下,AQI在十个领域的创造性推理基准测试中持续优于强基线模型,取得了统计上显著的改进,并减少了跨领域方差。这表明非交换代数动态可以作为机器创造性的实用且可复现的基础。该架构已在真实企业环境中部署。

Insight: 论文的创新点在于将量子理论中的非交换代数结构引入机器学习框架,以可控和可设计的方式实现语义空间的扩展,从而增强模型的创造性。从客观角度看,这种将抽象数学结构与实际LLM扩展相结合的方法,为提升机器创造力提供了一种新的、系统化的途径,可能推动生成模型在不确定性处理和多样化输出方面的发展。

Abstract: Large language models (LLMs) have achieved remarkable success in generating fluent and contextually appropriate text; however, their capacity to produce genuinely creative outputs remains limited. This paper posits that this limitation arises from a structural property of contemporary LLMs: when provided with rich context, the space of future generations becomes strongly constrained, and the generation process is effectively governed by near-deterministic dynamics. Recent approaches such as test-time scaling and context adaptation improve performance but do not fundamentally alter this constraint. To address this issue, we propose Algebraic Quantum Intelligence (AQI) as a computational framework that enables systematic expansion of semantic space. AQI is formulated as a noncommutative algebraic structure inspired by quantum theory, allowing properties such as order dependence, interference, and uncertainty to be implemented in a controlled and designable manner. Semantic states are represented as vectors in a Hilbert space, and their evolution is governed by C-values computed from noncommutative operators, thereby ensuring the coexistence and expansion of multiple future semantic possibilities. In this study, we implement AQI by extending a transformer-based LLM with more than 600 specialized operators. We evaluate the resulting system on creative reasoning benchmarks spanning ten domains under an LLM-as-a-judge protocol. The results show that AQI consistently outperforms strong baseline models, yielding statistically significant improvements and reduced cross-domain variance. These findings demonstrate that noncommutative algebraic dynamics can serve as a practical and reproducible foundation for machine creativity. Notably, this architecture has already been deployed in real-world enterprise environments.


[177] REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents cs.AI | cs.CLPDF

Zheng Chu, Xiao Wang, Jack Hong, Huiming Fan, Yuqi Huang

TL;DR: 本文提出了REDSearcher框架,这是一个可扩展且成本高效的框架,用于优化长视野搜索智能体。该框架通过协同设计复杂任务合成、中期训练和后期训练来解决高质量搜索轨迹稀疏和奖励信号不足的挑战,具体包括基于图拓扑和证据分散度的任务合成、工具增强查询、强化核心原子能力以及构建本地模拟环境。

Details

Motivation: 大型语言模型正从通用知识引擎转向现实世界问题解决者,但在深度搜索任务中优化它们仍面临挑战,主要瓶颈在于高质量搜索轨迹和奖励信号的极端稀疏性,这源于可扩展的长视野任务构建的困难以及涉及外部工具调用的高成本交互密集型rollout。

Result: 在纯文本和多模态搜索智能体基准测试中,该方法实现了最先进的性能。

Insight: 创新点包括:将任务合成构建为双约束优化问题以精确控制任务难度;引入工具增强查询以鼓励主动工具使用而非被动回忆;在中期训练中强化核心原子能力以降低高质量轨迹收集成本;构建本地模拟环境以实现低成本强化学习实验迭代。从客观角度看,该框架通过系统性的协同设计,为解决长视野搜索任务中数据稀疏和成本高昂的问题提供了可扩展的解决方案。

Abstract: Large language models are transitioning from generalpurpose knowledge engines to realworld problem solvers, yet optimizing them for deep search tasks remains challenging. The central bottleneck lies in the extreme sparsity of highquality search trajectories and reward signals, arising from the difficulty of scalable longhorizon task construction and the high cost of interactionheavy rollouts involving external tool calls. To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization. Specifically, REDSearcher introduces the following improvements: (1) We frame task synthesis as a dualconstrained optimization, where task difficulty is precisely governed by graph topology and evidence dispersion, allowing scalable generation of complex, highquality tasks. (2) We introduce toolaugmented queries to encourage proactive tool use rather than passive recall.(3) During midtraining, we strengthen core atomic capabilities knowledge, planning, and function calling substantially reducing the cost of collecting highquality trajectories for downstream training. (4) We build a local simulated environment that enables rapid, lowcost algorithmic iteration for reinforcement learning experiments. Across both textonly and multimodal searchagent benchmarks, our approach achieves stateoftheart performance. To facilitate future research on longhorizon search agents, we will release 10K highquality complex text search trajectories, 5K multimodal trajectories and 1K text RL query set, and together with code and model checkpoints.


[178] Precedent-Informed Reasoning: Mitigating Overthinking in Large Reasoning Models via Test-Time Precedent Learning cs.AI | cs.CLPDF

Qianyue Wang, Jinwu Hu, Huanxiang Lin, Bolin Chen, Zhiquan Wen

TL;DR: 该论文提出了一种名为先例知情推理(PIR)的新方法,旨在缓解大型推理模型(LRMs)中因冗长思维链导致的过度思考问题。PIR通过自适应先例选择(APS)为每个问题构建一个紧凑且信息丰富的先例集,并利用测试时经验内化(TEI)在推理时学习这些先例的模式,从而将推理范式从详尽的自我探索转变为有指导的先例学习。

Details

Motivation: 大型语言模型(LLMs)的推理过程常常伴随着低效的长思维链,包含冗余的自我探索和验证,这不仅增加了计算成本,甚至可能降低性能。该研究受到人类利用过去相关案例来约束搜索空间、减少试错这一推理模式的启发,旨在解决LRMs中的过度思考问题。

Result: 在数学推理、科学问答和代码生成等多个任务上的实验表明,PIR方法能持续缩短推理轨迹,同时保持或提高最终准确率,在多种LLMs上实现了出色的准确率-效率权衡。

Insight: 论文的核心创新点在于将测试时学习与先例引导推理相结合。具体包括:1)自适应先例选择(APS),它结合语义相似度和模型困惑度来选择和调整先例数量;2)测试时经验内化(TEI),通过更新轻量级适配器来内化解决方案模式,并将其作为后续推理的先验知识。这为提升推理效率提供了一种新的、受人类启发的范式。

Abstract: Reasoning in Large Language Models (LLMs) often suffers from inefficient long chain-of-thought traces with redundant self-exploration and validation, which inflate computational costs and even degrade performance. Inspired by human reasoning patterns where people solve new problems by leveraging past related cases to constrain search spaces and reduce trial-and-error, we propose Precedent Informed Reasoning (PIR) transforming LRMs’reasoning paradigm from exhaustive self-exploration to guided learning from precedents. PIR addresses two key challenges: what precedents to adopt and how to utilize them. First, Adaptive Precedent Selection (APS) constructs, for each question and LRM, a compact set of precedents that are both semantically related and informative for the model. It ranks examples by a joint score with semantic similarity and model perplexity, then adapts the amount of precedents to maximize perplexity reduction. Second, Test-time Experience Internalization (TEI) is treated as the test-time learning on precedent-informed instruction, updating lightweight adapters to internalize solution patterns and use them as a prior during subsequent reasoning. Experiments across mathematical reasoning, scientific QA, and code generation demonstrate that PIR consistently shortens reasoning traces while maintaining or improving final accuracy across LLMs, yielding outstanding accuracy-efficiency trade-offs.


[179] MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs cs.AI | cs.CL | cs.LGPDF

Gabriel Roccabruna, Olha Khomyn, Giuseppe Riccardi

TL;DR: 本文介绍了MATEO基准测试,这是一个用于评估和提升大型视觉语言模型(LVLMs)在现实世界规划任务中所需的时间推理能力的多模态基准。该基准基于一个高质量的专业多模态食谱语料库构建,其中指令被分解为离散步骤并配有对应图像,并通过可扩展的众包流程收集了以有向无环图形式标注的时间执行顺序(TEO)。

Details

Motivation: 现有研究对基础模型在时间执行顺序理解方面的评估存在局限,通常依赖于自动生成的标注、将TEO近似为线性链或仅使用文本输入,无法充分评估LVLMs在真实多模态规划场景中的能力。

Result: 研究使用MATEO基准评估了六种不同模型规模、语言上下文、多模态输入结构和微调策略的SOTA LVLMs,但摘要中未具体说明定量结果或达到的水平。

Insight: 论文的创新点在于构建了首个专注于评估LVLMs多模态时间推理与规划能力的高质量基准,其核心贡献是提供了真实、结构化、带图标注(TEO图)的多模态指令步骤数据集,以及可扩展的众包标注流程,这为深入分析模型在复杂任务规划中的能力提供了新工具。

Abstract: AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution. These plans consist of ordered steps structured according to a Temporal Execution Order (TEO, a directed acyclic graph that ensures each step executes only after its preconditions are satisfied. Existing research on foundational models’ understanding of temporal execution is limited to automatically derived annotations, approximations of the TEO as a linear chain, or text-only inputs. To address this gap, we introduce MATEO (MultimodAl Temporal Execution Order), a benchmark designed to assess and improve the temporal reasoning abilities of Large Vision Language Models (LVLMs) required for real-world planning. We acquire a high-quality professional multimodal recipe corpus, authored through a standardized editorial process that decomposes instructions into discrete steps, each paired with corresponding images. We collect TEO annotations as graphs by designing and using a scalable crowdsourcing pipeline. Using MATEO, we evaluate six state-of-the-art LVLMs across model scales, varying language context, multimodal input structure, and fine-tuning strategies.


[180] Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains cs.AI | cs.CVPDF

Yuqi Xiong, Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Zulong Chen

TL;DR: 本文提出Lang2Act,一种通过自涌现语言工具链实现细粒度视觉推理的方法。它旨在克服现有视觉检索增强生成框架依赖预定义外部工具、导致视觉信息损失的问题,通过强化学习训练视觉语言模型自主探索和利用语言工具来增强感知能力。

Details

Motivation: 现有视觉检索增强生成框架通常依赖预定义的外部工具,将视觉感知与推理过程解耦,这可能导致不必要的视觉信息损失(例如在图像裁剪操作中)。论文旨在实现更细粒度的视觉感知和推理。

Result: 实验结果表明,Lang2Act能显著增强视觉语言模型的视觉感知能力,实现了超过4%的性能提升。

Insight: 创新点在于提出自涌现语言工具链的概念,让模型自主探索和构建可重用的语言工具箱,而非依赖固定外部引擎;并设计了一个两阶段强化学习训练框架来支持这一机制。

Abstract: Visual Retrieval-Augmented Generation (VRAG) enhances Vision-Language Models (VLMs) by incorporating external visual documents to address a given query. Existing VRAG frameworks usually depend on rigid, pre-defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. However, this decoupled design can lead to unnecessary loss of visual information, particularly when image-based operations such as cropping are applied. In this paper, we propose Lang2Act, which enables fine-grained visual perception and reasoning through self-emergent linguistic toolchains. Rather than invoking fixed external engines, Lang2Act collects self-emergent actions as linguistic tools and leverages them to enhance the visual perception capabilities of VLMs. To support this mechanism, we design a two-stage Reinforcement Learning (RL)-based training framework. Specifically, the first stage optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox, and the second stage further optimizes VLMs to exploit these linguistic tools for downstream reasoning effectively. Experimental results demonstrate the effectiveness of Lang2Act in substantially enhancing the visual perception capabilities of VLMs, achieving performance improvements of over 4%. All code and data are available at https://github.com/NEUIR/Lang2Act.


[181] VSAL: A Vision Solver with Adaptive Layouts for Graph Property Detection cs.AI | cs.CVPDF

Jiahao Xie, Guangmo Tong

TL;DR: 本文提出了一种名为VSAL的视觉求解器,它通过自适应布局生成器动态生成针对单个图实例的信息化可视化,从而提升图属性检测的性能。实验表明,VSAL在哈密顿环、平面性、无爪性和树检测等多个任务上优于现有的视觉方法。

Details

Motivation: 现有基于视觉的图属性检测方法依赖于固定的图布局,限制了其表达能力和检测效果,因此需要一种能够自适应生成信息丰富视觉化的方法。

Result: 在哈密顿环、平面性、无爪性和树检测等任务上,VSAL超越了当前最先进的视觉方法,达到了SOTA水平。

Insight: 创新点在于引入了自适应布局生成器,能够根据具体图实例动态调整可视化布局,从而增强视觉管道的表达能力,这是对固定布局方法的重要改进。

Abstract: Graph property detection aims to determine whether a graph exhibits certain structural properties, such as being Hamiltonian. Recently, learning-based approaches have shown great promise by leveraging data-driven models to detect graph properties efficiently. In particular, vision-based methods offer a visually intuitive solution by processing the visualizations of graphs. However, existing vision-based methods rely on fixed visual graph layouts, and therefore, the expressiveness of their pipeline is restricted. To overcome this limitation, we propose VSAL, a vision-based framework that incorporates an adaptive layout generator capable of dynamically producing informative graph visualizations tailored to individual instances, thereby improving graph property detection. Extensive experiments demonstrate that VSAL outperforms state-of-the-art vision-based methods on various tasks such as Hamiltonian cycle, planarity, claw-freeness, and tree detection.


cs.RO [Back]

[182] FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation cs.RO | cs.AI | cs.CVPDF

Huajian Zeng, Lingyun Chen, Jiaqi Yang, Yuantai Zhang, Fan Shi

TL;DR: FlowHOI是一个基于流匹配的两阶段框架,用于生成语义基础、时序一致的手-物交互序列,包括手部姿态、物体姿态和接触状态,以第一人称观察、语言指令和3D高斯溅射场景重建为条件。它通过解耦以几何为中心的抓取和以语义为中心的操控,并利用从大规模第一人称视频重建的HOI先验,解决了现有VLA模型在长时程、接触丰富任务中因缺乏显式HOI表示而失败的问题。

Details

Motivation: 现有视觉-语言-动作模型能生成合理的末端执行器运动,但在长时程、接触丰富的任务中常因缺乏显式的手-物交互结构表示而失败。需要一个与具体机器人形态无关、能捕捉此结构的交互表示,以使操控行为更易于验证和跨机器人迁移。

Result: 在GRAB和HOT3D基准测试中,FlowHOI实现了最高的动作识别准确率,物理模拟成功率比最强的基于扩散的基线高1.7倍,同时推理速度提升40倍。在四个灵巧操控任务上的真实机器人执行演示,验证了生成的HOI表示可重定向到真实机器人执行流程的可行性。

Insight: 创新点包括:两阶段流匹配框架,解耦几何抓取与语义操控;利用3D场景令牌和运动-文本对齐损失,将生成的交互语义基础化于物理场景布局和语言指令;从大规模第一人称视频重建对齐的手-物轨迹和网格,提供鲁棒生成的HOI先验。这为显式建模HOI结构、提升长时程操控的语义连贯性和可迁移性提供了新思路。

Abstract: Recent vision-language-action (VLA) models can generate plausible end-effector motions, yet they often fail in long-horizon, contact-rich tasks because the underlying hand-object interaction (HOI) structure is not explicitly represented. An embodiment-agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots. We propose FlowHOI, a two-stage flow-matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand-object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry-centric grasping from semantics-centric manipulation, conditioning the latter on compact 3D scene tokens and employing a motion-text alignment loss to semantically ground the generated interactions in both the physical scene layout and the language instruction. To address the scarcity of high-fidelity HOI supervision, we introduce a reconstruction pipeline that recovers aligned hand-object trajectories and meshes from large-scale egocentric videos, yielding an HOI prior for robust generation. Across the GRAB and HOT3D benchmarks, FlowHOI achieves the highest action recognition accuracy and a 1.7$\times$ higher physics simulation success rate than the strongest diffusion-based baseline, while delivering a 40$\times$ inference speedup. We further demonstrate real-robot execution on four dexterous manipulation tasks, illustrating the feasibility of retargeting generated HOI representations to real-robot execution pipelines.


[183] Symmetry-Aware Fusion of Vision and Tactile Sensing via Bilateral Force Priors for Robotic Manipulation cs.RO | cs.CVPDF

Wonju Lee, Matteo Grimaldi, Tao Yu

TL;DR: 本文提出了一种用于机器人操作的视觉-触觉融合方法,通过跨模态Transformer(CMT)结合手腕摄像头观测和触觉信号,并引入基于物理的双边力平衡正则化来稳定触觉嵌入,从而在插入任务中实现高精度操作。

Details

Motivation: 解决机器人插入任务中仅依赖视觉无法处理精细接触交互的问题,并改进现有视觉-触觉融合方法效果不一致的局限性。

Result: 在TacSL基准测试中,所提方法实现了96.59%的插入成功率,超越了朴素融合和门控融合基线,并与特权配置(手腕+接触力)的96.09%性能相当。

Insight: 创新点包括使用结构化自注意力和交叉注意力的跨模态Transformer进行融合,以及引入反映人体运动控制原理的物理知情正则化来提升触觉嵌入的稳定性,从而充分发挥视觉和触觉的互补优势。

Abstract: Insertion tasks in robotic manipulation demand precise, contact-rich interactions that vision alone cannot resolve. While tactile feedback is intuitively valuable, existing studies have shown that naïve visuo-tactile fusion often fails to deliver consistent improvements. In this work, we propose a Cross-Modal Transformer (CMT) for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention. To stabilize tactile embeddings, we further introduce a physics-informed regularization that encourages bilateral force balance, reflecting principles of human motor control. Experiments on the TacSL benchmark show that CMT with symmetry regularization achieves a 96.59% insertion success rate, surpassing naïve and gated fusion baselines and closely matching the privileged “wrist + contact force” configuration (96.09%). These results highlight two central insights: (i) tactile sensing is indispensable for precise alignment, and (ii) principled multimodal fusion, further strengthened by physics-informed regularization, unlocks complementary strengths of vision and touch, approaching privileged performance under realistic sensing.


[184] ProAct: A Dual-System Framework for Proactive Embodied Social Agents cs.RO | cs.CV | cs.GRPDF

Zeyi Zhang, Zixi Kang, Ruijie Zhao, Yusen Feng, Biao Jiang

TL;DR: 论文提出了ProAct框架,这是一个用于主动具身社交智能体的双系统框架,旨在解决实时交互中低延迟要求与长时程社会推理之间的时间尺度冲突。该框架将低延迟的行为系统与慢速的认知系统解耦,并通过一个基于ControlNet的流式流匹配模型,将高层意图异步注入到连续的非语言行为流中,从而实现从反应式到主动式手势的无缝过渡。

Details

Motivation: 现有具身社交智能体大多为反应式系统,仅对短时窗口内的当前感官输入做出响应,缺乏基于累积上下文和意图推理的主动社交行为,而这与实时交互的严格延迟预算相冲突。

Result: 研究将ProAct部署在实体人形机器人上进行了评估。在真实世界交互用户研究中,参与者和观察者一致认为ProAct在感知主动性、社会临场感和整体参与度方面优于反应式变体,证明了双系统主动控制对具身社交交互的益处。

Insight: 核心创新在于提出了一个解耦的双系统架构来调和时间尺度冲突,并引入了一种支持异步意图注入的流式流匹配模型,实现了在单一运动流中反应式与主动式行为的无缝融合,为构建更自然、主动的具身交互系统提供了新思路。

Abstract: Embodied social agents have recently advanced in generating synchronized speech and gestures. However, most interactive systems remain fundamentally reactive, responding only to current sensory inputs within a short temporal window. Proactive social behavior, in contrast, requires deliberation over accumulated context and intent inference, which conflicts with the strict latency budget of real-time interaction. We present \emph{ProAct}, a dual-system framework that reconciles this time-scale conflict by decoupling a low-latency \emph{Behavioral System} for streaming multimodal interaction from a slower \emph{Cognitive System} which performs long-horizon social reasoning and produces high-level proactive intentions. To translate deliberative intentions into continuous non-verbal behaviors without disrupting fluency, we introduce a streaming flow-matching model conditioned on intentions via ControlNet. This mechanism supports asynchronous intention injection, enabling seamless transitions between reactive and proactive gestures within a single motion stream. We deploy ProAct on a physical humanoid robot and evaluate both motion quality and interactive effectiveness. In real-world interaction user studies, participants and observers consistently prefer ProAct over reactive variants in perceived proactivity, social presence, and overall engagement, demonstrating the benefits of dual-system proactive control for embodied social interaction.


[185] SemanticFeels: Semantic Labeling during In-Hand Manipulation cs.RO | cs.AI | cs.CVPDF

Anas Al Shikh Khalil, Haozhi Qi, Roberto Calandra

TL;DR: 本文提出SemanticFeels框架,扩展了NeuralFeels,将视觉和触觉信息结合,通过神经隐式形状表示进行语义标注,重点应用于手内操作中的材料分类。

Details

Motivation: 随着机器人融入日常任务,其在手内操作中感知物体形状和属性的能力对自适应智能行为至关重要,因此需要整合视觉与触觉进行语义标注。

Result: 在单材料和多种材料物体上,系统预测材料与实际材料高度一致,对多种材料物体进行多次操作试验的平均匹配准确率达到79.87%。

Insight: 创新点在于将微调的EfficientNet-B0 CNN处理的高分辨率触觉读数嵌入增强符号距离场网络,联合预测几何和连续材料区域,实现了多模态感知的语义整合。

Abstract: As robots become increasingly integrated into everyday tasks, their ability to perceive both the shape and properties of objects during in-hand manipulation becomes critical for adaptive and intelligent behavior. We present SemanticFeels, an extension of the NeuralFeels framework that integrates semantic labeling with neural implicit shape representation, from vision and touch. To illustrate its application, we focus on material classification: high-resolution Digit tactile readings are processed by a fine-tuned EfficientNet-B0 convolutional neural network (CNN) to generate local material predictions, which are then embedded into an augmented signed distance field (SDF) network that jointly predicts geometry and continuous material regions. Experimental results show that the system achieves a high correspondence between predicted and actual materials on both single- and multi-material objects, with an average matching accuracy of 79.87% across multiple manipulation trials on a multi-material object.


[186] Neurosim: A Fast Simulator for Neuromorphic Robot Perception cs.RO | cs.CVPDF

Richeek Das, Pratik Chaudhari

TL;DR: Neurosim是一个用于模拟动态视觉传感器、RGB相机、深度传感器和惯性传感器等传感器的高速实时高性能库,同时能模拟多旋翼飞行器在复杂动态环境中的敏捷动力学。它结合了基于ZeroMQ的通信库Cortex,支持与机器人和机器学习工作流的无缝集成,用于训练和测试神经形态感知与控制算法。

Details

Motivation: 解决神经形态机器人感知领域缺乏快速、实时、高性能的传感器模拟工具,以及缺乏与机器学习工作流高效集成的通信框架的问题。

Result: 在桌面GPU上可实现高达约2700 FPS的帧率,通过Cortex提供高吞吐、低延迟的消息传递系统,支持Python和C++应用,并原生支持NumPy数组和PyTorch张量。

Insight: 创新点在于将高速传感器模拟与高效的通信库结合,支持多模态时间同步数据的自监督学习训练和闭环实时测试,为神经形态算法开发提供了集成化仿真平台。

Abstract: Neurosim is a fast, real-time, high-performance library for simulating sensors such as dynamic vision sensors, RGB cameras, depth sensors, and inertial sensors. It can also simulate agile dynamics of multi-rotor vehicles in complex and dynamic environments. Neurosim can achieve frame rates as high as ~2700 FPS on a desktop GPU. Neurosim integrates with a ZeroMQ-based communication library called Cortex to facilitate seamless integration with machine learning and robotics workflows. Cortex provides a high-throughput, low-latency message-passing system for Python and C++ applications, with native support for NumPy arrays and PyTorch tensors. This paper discusses the design philosophy behind Neurosim and Cortex. It demonstrates how they can be used to (i) train neuromorphic perception and control algorithms, e.g., using self-supervised learning on time-synchronized multi-modal data, and (ii) test real-time implementations of these algorithms in closed-loop. Neurosim and Cortex are available at https://github.com/grasp-lyrl/neurosim .


cs.DL [Back]

[187] FMMD: A multimodal open peer review dataset based on F1000Research cs.DL | cs.AI | cs.CL | cs.LGPDF

Zhenzhen Zhuang, Yuqing Fu, Jing Zhu, Zhangping Zhou, Jialiang Lin

TL;DR: 本文介绍了FMMD数据集,这是一个基于F1000Research构建的多模态、多学科开放同行评审数据集,旨在解决现有同行评审数据集在模态单一、学科局限和版本对齐方面的不足,以支持更精细的同行评审生命周期分析和多模态任务研究。

Details

Motivation: 现有自动化同行评审研究受限于数据集多为文本中心、学科集中于计算机科学,且缺乏评审意见与特定稿件版本之间的精确对齐,无法充分反映评审过程中对图表、版式等视觉信息的评估以及稿件迭代关系。

Result: 论文构建了FMMD数据集,它整合了稿件级别的视觉与结构数据、版本特定的评审报告和编辑决策,为同行评审研究提供了全面的实证资源。

Insight: 创新点在于首次构建了一个明确对齐评审意见与具体稿件迭代版本的多模态、多学科同行评审数据集,突破了现有数据集的模态和领域限制,支持对评审生命周期进行细粒度分析和多模态任务(如问题检测、评论文本生成)的开发。

Abstract: Automated scholarly paper review (ASPR) has entered the coexistence phase with traditional peer review, where artificial intelligence (AI) systems are increasingly incorporated into real-world manuscript evaluation. In parallel, research on automated and AI-assisted peer review has proliferated. Despite this momentum, empirical progress remains constrained by several critical limitations in existing datasets. While reviewers routinely evaluate figures, tables, and complex layouts to assess scientific claims, most existing datasets remain overwhelmingly text-centric. This bias is reinforced by a narrow focus on data from computer science venues. Furthermore, these datasets lack precise alignment between reviewer comments and specific manuscript versions, obscuring the iterative relationship between peer review and manuscript evolution. In response, we introduce FMMD, a multimodal and multidisciplinary open peer review dataset curated from F1000Research. The dataset bridges the current gap by integrating manuscript-level visual and structural data with version-specific reviewer reports and editorial decisions. By providing explicit alignment between reviewer comments and the exact article iteration under review, FMMD enables fine-grained analysis of the peer review lifecycle across diverse scientific domains. FMMD supports tasks such as multimodal issue detection and multimodal review comment generation. It provides a comprehensive empirical resource for the development of peer review research.


cs.CY [Back]

[188] Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports cs.CY | cs.AI | cs.CLPDF

Dragan Stoll, Brian E. Perron, Zia Qi, Selina Steinmann, Nicole F. Eicher

TL;DR: 本研究探讨了推理语言模型在儿童保护服务中评估父母合作情况的潜力,通过构建包含案例报告收集、基于推理的评估、自动类别提取和案例标注的四阶段工作流,比较了不同参数规模模型与人工验证数据的性能。

Details

Motivation: 解决儿童保护服务中父母合作评估的复杂性问题,该任务涉及模糊和冲突信息,传统方法难以处理,旨在利用推理语言模型提升评估的准确性和效率。

Result: 最大参数规模(255B)的推理语言模型在评估父母合作时达到最高准确率89%,优于初始方法(80%),且在母亲评估(93%)上优于父亲(85%),与专家人工评审结果趋势一致。

Insight: 创新点在于将推理语言模型应用于儿童保护领域的复杂定性评估任务,通过多阶段工作流整合自动化与人工验证,揭示了模型在模糊信息处理中的有效性,同时指出专业实践中可能存在的性别偏差问题。

Abstract: Purpose: Reasoning language models (RLMs) have demonstrated significant advances in solving complex reasoning tasks. We examined their potential to assess parental cooperation during CPS interventions using case reports, a case factor characterized by ambiguous and conflicting information. Methods: A four stage workflow comprising (1) case reports collection, (2) reasoning-based assessment of parental cooperation, (3) automated category extraction, and (4) case labeling was developed. The performance of RLMs with different parameter sizes (255B, 32B, 4B) was compared against human validated data. Two expert human reviewers (EHRs) independently classified a weighted random sample of reports. Results: The largest RLM achieved the highest accuracy (89%), outperforming the initial approach (80%). Classification accuracy was higher for mothers (93%) than for fathers (85%), and EHRs exhibited similar differences. Conclusions: RLMs’ reasoning can effectively assess complex case factors such as parental cooperation. Lower accuracy in assessing fathers’ cooperation supports the argument of a stronger professional focus on mothers in CPS interventions.


[189] Synthetic Reader Panels: Tournament-Based Ideation with LLM Personas for Autonomous Publishing cs.CY | cs.AI | cs.CL | cs.HCPDF

Fred Zimmerman

TL;DR: 本文提出了一种基于LLM角色扮演的合成读者面板系统,用于自主图书创意生成,通过模拟多样化读者群体进行结构化锦标赛式评估,以替代传统的人类焦点小组。

Details

Motivation: 解决传统图书创意评估中依赖同质化人类评审、成本高且难以规模化的问题,旨在利用LLM模拟多样化读者群体进行高效、可扩展的图书概念筛选。

Result: 在管理6个活跃品牌和609种发行图书的多品牌出版业务中部署,案例研究表明合成面板能产生可操作的人口统计细分、识别同质评审难以发现的内容结构问题,并将高质量概念的比例从15%提升至62%。

Insight: 创新点包括使用LLM实例化的多样化读者角色进行锦标赛评估,以及引入五种自动反低质量检查机制来过滤LLM评估中的常见缺陷,为自动化创意评估提供了可扩展的框架。

Abstract: We present a system for autonomous book ideation that replaces human focus groups with synthetic reader panels – diverse collections of LLM-instantiated reader personas that evaluate book concepts through structured tournament competitions. Each persona is defined by demographic attributes (age group, gender, income, education, reading level), behavioral patterns (books per year, genre preferences, discovery methods, price sensitivity), and consistency parameters. Panels are composed per imprint to reflect target demographics, with diversity constraints ensuring representation across age, reading level, and genre affinity. Book concepts compete in single-elimination, double-elimination, round-robin, or Swiss-system tournaments, judged against weighted criteria including market appeal, originality, and execution potential. To reject low-quality LLM evaluations, we implement five automated anti-slop checks (repetitive phrasing, generic framing, circular reasoning, score clustering, audience mismatch). We report results from deployment within a multi-imprint publishing operation managing 6 active imprints and 609 titles in distribution. Three case studies – a 270-evaluator panel for a children’s literacy novel, and two 5-person expert panels for a military memoir and a naval strategy monograph – demonstrate that synthetic panels produce actionable demographic segmentation, identify structural content issues invisible to homogeneous reviewers, and enable tournament filtering that eliminates low-quality concepts while enriching high-quality survivors from 15% to 62% of the evaluated pool.


[190] CrisiSense-RAG: Crisis Sensing Multimodal Retrieval-Augmented Generation for Rapid Disaster Impact Assessment cs.CY | cs.CV | cs.IRPDF

Yiming Xiao, Kai Yin, Ali Mostafavi

TL;DR: 本文提出了CrisiSense-RAG,一个用于快速灾害影响评估的多模态检索增强生成框架。该框架通过混合检索方法整合实时社交媒体文本和灾后高分辨率卫星图像等异构数据源,无需针对特定灾害进行微调,旨在解决灾害评估中数据时间异步性问题。

Details

Motivation: 动机在于解决灾害影响评估中实时人类报告(捕捉灾害峰值条件)与灾后高分辨率卫星图像(常反映灾害衰退期)之间的时间异步性问题。简单融合这些未对齐的数据流可能导致对峰值灾害程度的严重低估,从而影响应急响应。

Result: 在飓风哈维事件中,对207个ZIP码区域进行零样本评估,该框架在洪水范围评估上的平均绝对误差(MAE)为10.94%至28.40%,在损害严重程度评估上的MAE为16.47%至21.65%。提示级别的对齐对量化有效性至关重要,能将损害估计提升高达4.75个百分点。

Insight: 创新点在于将灾害影响评估重新定义为异构数据源的证据合成问题,并采用了一种异步融合逻辑的拆分管道架构,优先将实时社交媒体证据用于峰值洪水范围评估,而将图像视为结构性损害的持久证据。这为在现实世界数据约束下实现快速、可部署的灾害评估提供了一种实用方法。

Abstract: Timely and spatially resolved disaster impact assessment is essential for effective emergency response. However, automated methods typically struggle with temporal asynchrony. Real-time human reports capture peak hazard conditions while high-resolution satellite imagery is frequently acquired after peak conditions. This often reflects flood recession rather than maximum extent. Naive fusion of these misaligned streams can yield dangerous underestimates when post-event imagery overrides documented peak flooding. We present CrisiSense-RAG, which is a multimodal retrieval-augmented generation framework that reframes impact assessment as evidence synthesis over heterogeneous data sources without disaster-specific fine-tuning. The system employs hybrid dense-sparse retrieval for text sources and CLIP-based retrieval for aerial imagery. A split-pipeline architecture feeds into asynchronous fusion logic that prioritizes real-time social evidence for peak flood extent while treating imagery as persistent evidence of structural damage. Evaluated on Hurricane Harvey across 207 ZIP-code queries, the framework achieves a flood extent MAE of 10.94% to 28.40% and damage severity MAE of 16.47% to 21.65% in zero-shot settings. Prompt-level alignment proves critical for quantitative validity because metric grounding improves damage estimates by up to 4.75 percentage points. These results demonstrate a practical and deployable approach to rapid resilience intelligence under real-world data constraints.


eess.IV [Back]

[191] Learnable Multi-level Discrete Wavelet Transforms for 3D Gaussian Splatting Frequency Modulation eess.IV | cs.CV | eess.SPPDF

Hung Nguyen, An Le, Truong Nguyen

TL;DR: 本文提出了一种基于多级离散小波变换的频率调制框架,用于优化3D高斯泼溅技术,通过递归分解低频子带构建更深的课程学习策略,在训练早期提供逐步粗糙的监督,从而有效减少高斯基元数量并保持渲染质量。

Details

Motivation: 解决3D高斯泼溅在训练过程中高斯基元数量激增导致内存和存储成本增加的问题,现有方法如AutoOpti3DGS使用单级可学习离散小波变换进行频率调制,但调制深度有限且联合优化会引入梯度竞争,导致高斯过度密集化。

Result: 在标准基准测试中,该方法进一步减少了高斯数量,同时保持了具有竞争力的渲染质量。

Insight: 创新点在于引入多级离散小波变换实现更深层次的频率调制,仅使用单个缩放参数而非学习完整的2抽头高通滤波器,简化了优化过程并缓解了梯度竞争问题,从而更有效地控制高斯增长。

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful approach for novel view synthesis. However, the number of Gaussian primitives often grows substantially during training as finer scene details are reconstructed, leading to increased memory and storage costs. Recent coarse-to-fine strategies regulate Gaussian growth by modulating the frequency content of the ground-truth images. In particular, AutoOpti3DGS employs the learnable Discrete Wavelet Transform (DWT) to enable data-adaptive frequency modulation. Nevertheless, its modulation depth is limited by the 1-level DWT, and jointly optimizing wavelet regularization with 3D reconstruction introduces gradient competition that promotes excessive Gaussian densification. In this paper, we propose a multi-level DWT-based frequency modulation framework for 3DGS. By recursively decomposing the low-frequency subband, we construct a deeper curriculum that provides progressively coarser supervision during early training, consistently reducing Gaussian counts. Furthermore, we show that the modulation can be performed using only a single scaling parameter, rather than learning the full 2-tap high-pass filter. Experimental results on standard benchmarks demonstrate that our method further reduces Gaussian counts while maintaining competitive rendering quality.


cs.SD [Back]

[192] Investigation for Relative Voice Impression Estimation cs.SD | cs.CL | cs.LGPDF

Keinichi Fujita, Yusuke Ijima

TL;DR: 本文研究了相对语音印象估计(RIE),旨在预测同一说话者两个话语之间的感知差异。通过比较经典声学特征、自监督语音表示和多模态大语言模型(MLLMs)三种方法,发现自监督表示在捕捉复杂动态印象(如’冷-暖’)方面优于经典特征,而当前MLLMs在此细粒度成对任务中不可靠。

Details

Motivation: 现有研究多关注绝对印象评分,而忽略了相对感知差异;本文提出RIE框架,以量化同一说话者不同话语在反义轴(如’暗-亮’)上的感知偏移。

Result: 在专业说话者以不同风格朗读文本的录音数据集上,自监督语音表示模型优于经典声学特征方法,尤其在捕捉’冷-暖’等复杂印象时;当前MLLMs表现不可靠。

Insight: 首次系统研究RIE任务,并证明自监督语音模型在捕捉细微感知变化方面的优势;为语音处理中相对印象估计提供了新框架和方法比较。

Abstract: Paralinguistic and non-linguistic aspects of speech strongly influence listener impressions. While most research focuses on absolute impression scoring, this study investigates relative voice impression estimation (RIE), a framework for predicting the perceptual difference between two utterances from the same speaker. The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., Dark--Bright''). To isolate expressive and prosodic variation, we used recordings of a professional speaker reading a text in various styles. We compare three modeling approaches: classical acoustic features commonly used for speech emotion recognition, self-supervised speech representations, and multimodal large language models (MLLMs). Our results demonstrate that models using self-supervised representations outperform methods with classical acoustic features, particularly in capturing complex and dynamic impressions (e.g., Cold–Warm’’) where classical features fail. In contrast, current MLLMs prove unreliable for this fine-grained pairwise task. This study provides the first systematic investigation of RIE and demonstrates the strength of self-supervised speech models in capturing subtle perceptual variations.


[193] The Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents cs.SD | cs.CL | cs.MMPDF

Ziyang Ma, Ruiyang Xu, Yinghao Ma, Chao-Han Huck Yang, Bohan Li

TL;DR: 本文介绍了Interspeech 2026音频推理挑战赛,这是首个专注于评估音频领域思维链质量的共享任务。挑战赛引入了MMAR-Rubrics评估协议,并设置了单模型和智能体两条赛道,吸引了全球156支队伍参与。结果显示,智能体系统在推理质量上领先,而单模型也在快速进步。

Details

Motivation: 当前大型音频语言模型在理解方面表现出色,但往往缺乏透明的推理过程,存在’黑箱’限制。该挑战赛旨在解决这一问题,推动可解释音频智能的发展。

Result: 挑战赛结果显示,智能体系统通过迭代工具编排和跨模态分析,在推理质量上目前领先;同时,单模型通过强化学习和复杂数据管道也在快速进步。

Insight: 创新点在于首次在音频领域组织专注于思维链质量评估的共享任务,并提出了新颖的实例级评估协议MMAR-Rubrics,用于评估推理链的事实性和逻辑性,为可解释音频智能提供了新见解。

Abstract: Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this “black-box” limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) quality in the audio domain. The challenge introduced MMAR-Rubrics, a novel instance-level protocol assessing the factuality and logic of reasoning chains. Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions. Results show agent systems currently lead in reasoning quality, utilizing iterative tool orchestration and cross-modal analysis. Besides, single models are rapidly advancing via reinforcement learning and sophisticated data pipeline. We details the challenge design, methodology, and a comprehensive analysis of state-of-the-art systems, providing new insights for explainable audio intelligence.


cs.MA [Back]

[194] G2CP: A Graph-Grounded Communication Protocol for Verifiable and Efficient Multi-Agent Reasoning cs.MA | cs.AI | cs.CLPDF

Karim Ben Khaled, Davy Monticolo

TL;DR: 本文提出了一种名为G2CP(图基通信协议)的结构化多智能体通信协议,旨在解决大语言模型驱动的多智能体系统中自然语言通信导致的语义漂移、幻觉传播和令牌消耗低效等问题。G2CP将消息定义为图操作(如遍历命令、子图片段和更新操作),在共享知识图上进行通信,从而实现可验证的推理轨迹并消除歧义。

Details

Motivation: 解决多智能体系统中因使用自然语言通信而引发的语义漂移、幻觉传播和令牌消耗低效等关键挑战。

Result: 在500个工业场景和21个真实世界维护案例上的实验表明,G2CP相比自由文本基线,减少了73%的智能体间通信令牌,任务完成准确率提高了34%,消除了级联幻觉,并生成了完全可审计的推理链。

Insight: 核心创新在于将多智能体通信从基于自然语言(linguistic)转变为基于结构化图操作(structural),实现了可验证、高效且无歧义的通信。这为需要精确智能体协调的领域提供了新的通信范式。

Abstract: Multi-agent systems powered by Large Language Models face a critical challenge: agents communicate through natural language, leading to semantic drift, hallucination propagation, and inefficient token consumption. We propose G2CP (Graph-Grounded Communication Protocol), a structured agent communication language where messages are graph operations rather than free text. Agents exchange explicit traversal commands, subgraph fragments, and update operations over a shared knowledge graph, enabling verifiable reasoning traces and eliminating ambiguity. We validate G2CP within an industrial knowledge management system where specialized agents (Diagnostic, Procedural, Synthesis, and Ingestion) coordinate to answer complex queries. Experimental results on 500 industrial scenarios and 21 real-world maintenance cases show that G2CP reduces inter-agent communication tokens by 73%, improves task completion accuracy by 34% over free-text baselines, eliminates cascading hallucinations, and produces fully auditable reasoning chains. G2CP represents a fundamental shift from linguistic to structural communication in multi-agent systems, with implications for any domain requiring precise agent coordination. Code, data, and evaluation scripts are publicly available.