cs.CL [Total: 16]
cs.CV [Total: 46]
eess.IV [Total: 2]
cs.DB [Total: 1]
cs.AI [Total: 6]
cs.LG [Total: 3]
cs.RO [Total: 1]
cs.CR [Total: 1]
cs.NE [Total: 1]
cs.HC [Total: 1]
cs.IR [Total: 1]

cs.CL [Back]

[1] Evaluating Reasoning Models for Queries with Presuppositions cs.CLPDF

Rose Sathyanathan, Kinshuk Vasisht, Danish Pruthi

TL;DR: 该论文评估了大型推理模型（LRMs）在处理包含预设前提的用户查询时的表现，发现尽管推理模型相比非推理模型准确率略有提升（2-11%），但仍无法有效挑战大量（26-42%）错误预设，且其表现受预设表达强度影响。

Details

Motivation: 解决大型语言模型（LLMs）在处理包含事实错误假设的用户查询时，往往未能挑战这些错误假设并可能强化用户误解的问题，特别是在模型推理能力提升的背景下，重新评估大型推理模型是否能合理推理并回应用户查询。

Result: 在涵盖健康、科学和常识的预设查询评估中，推理模型相比非推理模型准确率提高2-11%，但在多个广泛部署的模型上，仍无法挑战26-42%的错误预设，且对预设表达强度敏感。

Insight: 论文创新点在于系统构建了不同预设强度的查询集来评估模型推理能力，客观分析表明，尽管推理能力有所改进，但模型在识别和纠正错误预设方面仍存在显著局限，这揭示了当前AI模型在复杂语义推理中的挑战。

Abstract: Millions of users turn to AI models for their information needs. It is conceivable that a large number of user queries contain assumptions that may be factually inaccurate. Prior work notes that large language models (LLMs) often fail to challenge such erroneous assumptions, and can reinforce users’ misinformed opinions. However, given the recent advances, especially in model’s reasoning capabilities, we revisit whether large reasoning models (LRMs) can reason about the underlying assumptions and respond to user queries appropriately. We construct queries with varying degrees of presuppositions spanning health, science, and general knowledge, and use it to evaluate several widely-deployed models When compared to non-reasoning models, we find that reasoning models achieve a slightly higher accuracy (2-11%), but they still fail to challenge a large fraction (26-42%) of false presuppositions. Further, reasoning models remain susceptible to how strongly the presupposition is expressed.

[2] S^2tory: Story Spine Distillation for Movie Script Summarization cs.CL | cs.AIPDF

Mingzhe Lu, Yanbing Liu, Qihao Wang, Jiarui Zhang, Jiayue Wu

TL;DR: S^2tory是一个基于叙事学理论的电影剧本摘要框架，它通过分析角色发展轨迹来识别驱动故事前进的核心情节事件（plot nuclei），并过滤掉次要的卫星事件，从而有效处理剧本的非线性、交叉剪辑叙事结构。该框架包含一个进行理论约束推理的叙事专家代理（NEAgent），其提炼的知识用于指导一个小模型识别核心事件，再由另一个模型生成摘要。

Details

Motivation: 解决传统基于表面显著性的摘要方法在处理非线性、交叉剪辑叙事的电影剧本时效果不佳的问题，旨在保留故事的核心进展。

Result: 在MovieSum数据集上实现了最先进的语义保真度（SOTA），压缩比约为3.5倍；在BookSum上的零样本评估证实了其强大的跨领域泛化能力。

Insight: 创新点在于将叙事学理论（特别是角色发展轨迹和核心/卫星事件区分）系统地融入自动摘要框架，为建模复杂非线性叙事提供了理论基础；其专家代理蒸馏知识的架构设计也值得借鉴。

Abstract: Movie scripts pose a fundamental challenge for automatic summarization due to their non-linear, cross-cut narrative structure, which makes surface-level saliency methods ineffective at preserving core story progression. To address this, we introduce S^2tory (Story Spine Distillation), a narratology-grounded framework that leverages character development trajectories to identify plot nuclei, the essential events that drive the narrative forward, while filtering out peripheral satellite events that merely enrich atmosphere or emotion. Our Narrative Expert Agent (NEAgent) performs theory-constrained reasoning, whose distilled knowledge conditions a small model to identify plot nuclei. Another model then uses these plot nuclei to generate the summary. Experiments on the MovieSum dataset demonstrate state-of-the-art semantic fidelity at approximately 3.5x compression, and zero-shot evaluation on BookSum confirms strong out-of-domain generalization. Human evaluation further validates that narratological theory provides an indispensable foundation for modeling complex, non-linear narratives.

[3] When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning cs.CLPDF

Jiaqi Wei, Xuehang Guo, Pengfei Yu, Xiang Zhang, Wanli Ouyang

TL;DR: 本文提出了一种名为Side-by-Side（SxS）的交错推理方法，旨在解决传统自回归接口中模型状态更新与公开输出承诺之间的耦合问题。该方法将披露时机（即何时输出内容）变为一个可控的决策，允许模型在同一个上下文中交错进行部分内容披露和持续的私有推理，仅在推理结果支持时才公开内容。

Details

Motivation: 传统单流自回归接口存在‘沉默税’问题：额外的深思熟虑会延迟首次输出与任务相关的内容，而天真的早期流式输出则可能导致过早的承诺，从而影响后续生成。本文旨在通过可控的披露时机来优化内容准确性与输出延迟之间的权衡。

Result: 在两个Qwen3架构/规模（MoE Qwen3-30B-A3B和密集Qwen3-4B）以及领域内（AIME25）和领域外（GPQA-Diamond）基准测试上，SxS方法在基于令牌级代理（如更新间等待时间）衡量的准确性-内容延迟帕累托权衡方面取得了改进。

Insight: 核心创新点在于将披露时机决策内化到标准自回归生成过程中，通过交错披露与私有推理，并仅在推理支持时才输出，从而解耦思考与表达。方法上，通过构建蕴含对齐的交错轨迹进行监督微调，并结合强化学习来恢复新格式下的推理性能，避免了鼓励填充内容的问题。

Abstract: In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a \emph{silence tax}: additional deliberation postpones the first \emph{task-relevant} content, while naive early streaming risks premature commitments that bias subsequent generations. We introduce \textbf{\emph{Side-by-Side (SxS)}} Interleaved Reasoning, which makes \emph{disclosure timing} a controllable decision within standard autoregressive generation. SxS interleaves partial disclosures with continued private reasoning in the same context, but releases content only when it is \emph{supported} by the reasoning so far. To learn such pacing without incentivizing filler, we construct entailment-aligned interleaved trajectories by matching answer prefixes to supporting reasoning prefixes, then train with SFT to acquire the dual-action semantics and RL to recover reasoning performance under the new format. Across two Qwen3 architectures/scales (MoE \textbf{Qwen3-30B-A3B}, dense \textbf{Qwen3-4B}) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks, SxS improves accuracy–\emph{content-latency} Pareto trade-offs under token-level proxies (e.g., inter-update waiting).

[4] Detecting Stealth Sycophancy in Mental-Health Dialogue with Dynamic Emotional Signature Graphs cs.CL | cs.AIPDF

Tianze Han, Beining Xu, Hanbo Zhang, Yongming Lu

TL;DR: 本文提出了一种名为动态情感签名图（DESG）的模型无关评估器，用于在没有大型语言模型作为最终评判者的情况下，对心理健康对话中的治疗性回应质量进行可靠的离线评估。该方法通过解耦的临床状态表示对话窗口，并利用非对称的临床几何进行评分，以检测对话中可能存在的隐蔽性迎合行为。

Details

Motivation: 随着对话式AI治疗师在心理支持场景中的日益普及，如何可靠地离线评估治疗性回应的质量仍是一个未解决的问题。现有方法（如直接使用LLM评判或对称文本相似度指标）与治疗质量对齐不佳，因为评判标签取决于临床方向（如促进调节、重构、维持现状或加剧恶化），而非简单的文本相似性。

Result: 在一个构建的诊断性压力测试基准（包含来自EmpatheticDialogues、ESConv和CRADLE-Dialogue的3,000个对话窗口）上，DESG-Ensemble在600个窗口的保留测试集上取得了0.9353的宏观F1分数，分别超过ConcatANN 1.51个百分点、BERTScore 19.63个百分点、TRACT 33.81个百分点，达到了SOTA水平。

Insight: 论文的核心创新在于提出了动态情感签名图（DESG），它通过解耦的临床状态流形作为主要判别基础，并结合基于图的轨迹组件进行非对称评分和可解释诊断，而非单纯依赖图结构提升性能。这种方法为多领域支持性对话评估提供了一种不依赖LLM作为最终评判者的、模型无关的解决方案，并强调了临床方向性在评估中的关键作用。

Abstract: As conversational AI therapists are increasingly used in psychological support settings, reliable offline evaluation of therapeutic response quality remains an open problem. This paper studies multi-domain support-dialogue evaluation without relying on large language models as final judges. We use a direct LLM judge as a baseline that reads raw dialogue text and predicts whether the target response is harmful, productive, or neutral. We find that direct LLM judges and symmetric text-similarity metrics are poorly aligned with therapeutic quality because the target label depends on clinical direction: whether the response moves the user state toward regulation or reframing, leaves it broadly unchanged, or reinforces deterioration through higher risk affect or cognitive-distortion mass. To address this issue, we propose Dynamic Emotional Signature Graphs (DESG), a model-agnostic evaluator that represents dialogue windows with decoupled clinical states and scores them using asymmetric clinical geometry. We evaluate DESG on a constructed diagnostic stress-test benchmark of 3{,}000 dialogue windows from EmpatheticDialogues, ESConv, and CRADLE-Dialogue, covering peer support, counseling dialogue, and crisis-oriented interaction. On the 600-window held-out test aggregate, DESG-Ensemble achieves 0.9353 macro-F1, exceeding ConcatANN by 1.51 percentage points, BERTScore by 19.63 points, and TRACT by 33.81 points. Feature ablations, artifact controls, a 100-window blinded adjudicator audit, and qualitative disagreement cases indicate that the clinical state manifold is the main discriminative substrate, while graph-based trajectory components provide asymmetric scoring and interpretable diagnostics rather than serving as the sole source of performance.

[5] Revisiting Graph-Tokenizing Large Language Models: A Systematic Evaluation of Graph Token Understanding cs.CL | cs.AI | cs.LGPDF

Zhongjian Zhang, Yue Yu, Mengmei Zhang, Junping Du, Xiao Wang

TL;DR: 本文系统评估了图标记化大语言模型（GTokenLLMs）对图标记的理解能力，通过提出的GTEval评估框架在格式和内容层面进行指令变换测试，发现现有模型未能完全理解图标记，存在对指令过度敏感或不敏感、过度依赖文本推理等问题，且指令微调无法完全解决这一挑战。

Details

Motivation: 挑战当前广泛认为的图标记化LLMs能有效理解图数据的观点，探究这些模型是否在自然语言嵌入空间中真正理解图标记。

Result: 在6个代表性GTokenLLMs上进行了广泛实验，结果显示现有模型对图标记理解不充分，性能受指令变化影响显著，尽管图标记保留了任务相关信息并在LLM各层受到关注，但其利用程度因模型和指令变体而异。

Insight: 揭示了GTokenLLMs对图标记理解的局限性，提出了统一的评估框架GTEval，指出仅靠指令微调不足以解决图标记理解问题，为未来改进提供了方向。

Abstract: The remarkable success of large language models (LLMs) has motivated researchers to adapt them as universal predictors for various graph tasks. As a widely recognized paradigm, Graph-Tokenizing LLMs (GTokenLLMs) compress complex graph data into graph tokens and treat them as prefix tokens for querying LLMs, leading many to believe that LLMs can understand graphs more effectively and efficiently. In this paper, we challenge this belief: \textit{Do GTokenLLMs fully understand graph tokens in the natural-language embedding space?} Motivated by this question, we formalize a unified framework for GTokenLLMs and propose an evaluation pipeline, \textbf{GTEval}, to assess graph-token understanding via instruction transformations at the format and content levels. We conduct extensive experiments on 6 representative GTokenLLMs with GTEval. The primary findings are as follows: (1) Existing GTokenLLMs do not fully understand graph tokens. They exhibit over-sensitivity or over-insensitivity to instruction changes, and rely heavily on text for reasoning; (2) Although graph tokens preserve task-relevant graph information and receive attention across LLM layers, their utilization varies across models and instruction variants; (3) Additional instruction tuning can improve performance on the original and seen instructions, but it does not fully address the challenge of graph-token understanding, calling for further improvement.

[6] PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination cs.CL | cs.AIPDF

Qiyao Wang, Xinyi Chen, Longze Chen, Hongbo Wang, Hamid Alinejad-Rokny

TL;DR: 本文提出了PatRe，首个模拟完整专利审查生命周期的基准测试，涵盖审查意见通知书生成和申请人答复环节，包含480个真实案例，支持人工标注和检索模拟两种评估设置，将专利审查重构为动态多轮论证与回应的过程。

Details

Motivation: 现有基准主要将专利审查视为判别性分类或静态信息提取任务，未能捕捉其内在的交互性和迭代性本质，类似于学术出版中的同行评审与反驳过程。

Result: 在多种大语言模型上的广泛实验揭示了模型性能的关键洞察，包括专有模型与开源模型之间的差异，以及审查员分析与申请人方反驳任务之间的不对称性。

Insight: 创新点在于首次将专利审查建模为包含多轮论证的动态完整流程基准，揭示了LLM在复杂现实法律推理和技术新颖性判断中的潜力与当前局限。

Abstract: Patent examination is a complex, multi-stage process requiring both technical expertise and legal reasoning, increasingly challenged by rising application volumes. Prior benchmarks predominantly view patent examination as discriminative classification or static extraction, failing to capture its inherently interactive and iterative nature, similar to the peer review and rebuttal process in academic publishing. In this paper, we introduce PatRe, the first benchmark that models the full patent examination lifecycle, including Office Action generation and applicant rebuttal. PatRe comprises 480 real-world cases and supports both oracle and retrieval-simulated evaluation settings. Our benchmark reframes patent examination as a dynamic, multi-turn process of justification and response. Extensive experiments across various LLMs reveal critical insights into model performance, including differences between proprietary and open-source models, as well as task asymmetries between examiner analysis and applicant-side rebuttal. These findings highlight both the potential and current limitations of LLMs in modeling complex, real-world legal reasoning and technical novelty judgment in patent examination. We release our code and dataset to facilitate future research on patent examination modeling.

[7] SERE: Structural Example Retrieval for Enhancing LLMs in Event Causality Identification cs.CL | cs.AIPDF

Zhifeng Hao, Zhongjie Chen, Junhao Lu, Shengyin Yu, Guimin Hu

TL;DR: 本文提出SERE框架，通过结构化的示例检索增强大语言模型在事件因果关系识别任务中的性能，利用概念路径度量、句法度量和因果模式过滤三种结构策略选择相关示例，以缓解因果幻觉并提升准确性。

Details

Motivation: 大语言模型在事件因果关系识别中存在因果推理偏见，常导致过度预测因果关系（因果幻觉），因此需要一种方法增强其在该任务中的表现。

Result: 在多个ECI数据集上的广泛实验验证了SERE的有效性，具体定量结果未在摘要中提及，但暗示了性能提升。

Insight: 创新点在于结合概念网络编辑距离、句法树编辑距离和基于LLM的因果模式过滤进行结构化检索，以更精准地选择少样本学习示例，从而减少偏见并改进推理。

Abstract: Event Causality Identification (ECI) requires models to determine whether a given pair of events in a context exhibits a causal relationship. While Large Language Models (LLMs) have demonstrated strong performance across various NLP tasks, their effectiveness in ECI remains limited due to biases in causal reasoning, often leading to overprediction of causal relationships (causal hallucination). To mitigate these issues and enhance LLM performance in ECI, we propose SERE, a structural example retrieval framework that leverages LLMs’ few-shot learning capabilities. SERE introduces an innovative retrieval mechanism based on three structural concepts: (i) Conceptual Path Metric, which measures the conceptual relationship between events using edit distance in ConceptNet; (ii) Syntactic Metric, which quantifies structural similarity through tree edit distance on syntactic trees; and (iii) Causal Pattern Filtering, which filters examples based on predefined causal structures using LLMs. By integrating these structural retrieval strategies, SERE selects more relevant examples to guide LLMs in causal reasoning, mitigating bias and improving accuracy in ECI tasks. Extensive experiments on multiple ECI datasets validate the effectiveness of SERE. The source code is publicly available at https://github.com/DMIRLAB-Group/SERE.

[8] Rose-SQL: Role-State Evolution Guided Structured Reasoning for Multi-Turn Text-to-SQL cs.CLPDF

Le Zhou, Feng Yao, Fengcai Qiao, Bo Xu, Fangyuan Wang

TL;DR: 本文提出了Rose-SQL，一个无需训练的框架，用于解决多轮Text-to-SQL任务。它利用小规模大推理模型，通过上下文学习实现准确的上下文依赖解析。核心是引入了Role-State这一细粒度表示，作为连接模式链接和SQL生成的结构蓝图，并通过结构同构检查追踪其在历史上下文中的演变，从而指导模型推断当前问题的可能SQL组成。

Details

Motivation: 现有方法通常依赖不稳定的基于API的推理或需要对小规模模型进行昂贵的微调，而大推理模型在多轮Text-to-SQL任务中的潜力尚未被充分探索。本文旨在提供一个无需训练、利用小规模LRM的框架来解决这一问题。

Result: 在SParC和CoSQL基准测试上的实验表明，在Qwen3系列模型中，Rose-SQL在4B规模上优于上下文学习基线，在8B和14B规模上显著超越了最先进的微调模型，并在其他推理骨干模型上表现出一致的性能提升。

Insight: 主要创新点是引入了Role-State作为连接模式链接和SQL生成的结构化中间表示，并通过追踪其历史演变（基于结构同构检查）来引导SQL生成，这提供了一种无需训练、基于上下文学习的结构化推理新范式。

Abstract: Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought have demonstrated remarkable capabilities in code generation and mathematical reasoning. However, their potential in multi-turn Text-to-SQL tasks remains largely underexplored. Existing approaches typically rely on unstable API-based inference or require expensive fine-tuning on small-scale models. In this work, we present Rose-SQL, a training-free framework that leverages small-scale LRMs through in-context learning to enable accurate context-dependent parsing. We introduce the Role-State, a fine-grained representation that bridges the structural gap between schema linking and SQL generation by serving as a structural blueprint. To handle conversational dependencies, Rose-SQL traces the evolution of Role-State through historical context via structural isomorphism checks, guiding the model to infer the possible SQL composition for the current question through verified interaction trajectories. Experiments on the SParC and CoSQL benchmarks show that, within the Qwen3 series, Rose-SQL outperforms in-context learning baselines at the 4B scale and substantially surpasses state-of-the-art fine-tuned models at the 8B and 14B scales, while showing consistent gains on additional reasoning backbones.

[9] Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF cs.CLPDF

Mullosharaf K. Arabov

TL;DR: 这篇预印本提供了一个系统化、研究导向的NLP实践指南，涵盖从分词、向量化到大型语言模型微调、检索增强生成以及人类反馈强化学习的完整现代NLP流程。它通过十二个实践环节结合简明理论与详细实现方案，并强调在公开仓库中发布代码、模型和报告的可复现性。

Details

Motivation: 旨在为高年级本科生、研究生和开发人员提供一个可操作的、全面的NLP实践框架，解决从经典机器学习到最前沿LLM系统的实现、比较与部署需求，并特别关注如何将现代NLP技术适配到低资源语言环境。

Result: 工作本身是一个可复现的研究成果，所有实验在单一演进语料库上进行，并倡导使用开源权重模型。它包含了针对塔吉克语和鞑靼语等低资源语言的原创研究，提供了子词分词器、词嵌入、词汇库和音译基准等资源。

Insight: 创新点在于将系统化实践指南设计为可复现的研究工件，强调全流程公开与Hugging Face生态集成，并通过低资源语言案例实证展示了现代NLP在数据稀缺环境中的适配方法，为教育与实践提供了桥梁。

Abstract: This preprint presents a systematic, research-oriented practicum that guides the reader through the entire modern NLP pipeline: from tokenisation and vectorisation to fine-tuning of large language models, retrieval-augmented generation, and reinforcement learning from human feedback. Twelve hands-on sessions combine concise theory with detailed implementation plans, formalised evaluation metrics, and transparent assessment criteria. The work is not a conventional textbook: it is designed as a reproducible research artefact where every session requires publishing code, models, and reports in public repositories. All experiments are conducted on a single evolving corpus, and the work advocates open-weight models over commercial APIs, with special attention to the Hugging Face ecosystem. The material is enriched by original research on low-resource languages, incorporating linguistic resources for Tajik and Tatar (subword tokenisers, embeddings, lexical databases, and transliteration benchmarks), demonstrating how modern NLP can be adapted to data-scarce environments. Designed for senior undergraduates, graduate students, and practising developers seeking to implement, compare, and deploy methods from classical ML to state-of-the-art LLM-based systems.

[10] Reproducing Complex Set-Compositional Information Retrieval cs.CLPDF

Vincent Degenhart, Dewi Timman, Arjen P. de Vries, Faegheh Hasibi, Mohanna Hoveyda

TL;DR: 这篇论文对复杂集合组合信息检索进行了可复现性研究，评估了不同检索方法在QUEST和LIMIT+基准测试上的表现。研究发现，在QUEST上，最佳神经检索器的效果是BM25的两倍多，但推理导向方法并未普遍优于通用检索器；而在LIMIT+上，神经方法性能大幅下降，经典词汇检索方法表现优异。研究还发现，随着组合深度增加，所有方法性能均下降，但稀疏和词汇方法更稳定。

Details

Motivation: 解决复杂信息需求中涉及集合组合查询（如合取、析取、排除）的检索问题，并探究当前检索范式是否真正满足此类约束或利用了‘语义捷径’。

Result: 在QUEST基准上，最佳神经检索器的Recall@100超过0.41（BM25为0.20）；在LIMIT+基准上，最佳QUEST方法Recall@100从约0.42骤降至0.02以下，而经典词汇检索方法提升至约0.96。所有方法随组合深度增加性能一致下降。

Insight: 论文创新点在于引入LIMIT+受控基准，强调相关性依赖于任意属性谓词和约束满足，而非预训练知识；客观分析表明，稀疏和词汇检索方法在组合查询中更鲁棒，而密集检索方法易崩溃，这为未来检索系统设计提供了重要洞见。

Abstract: Complex information needs may involve set-compositional queries using conjunction, disjunction, and exclusion, yet it remains unclear whether current retrieval paradigms genuinely satisfy such constraints or exploit `semantic shortcuts’. We conduct a reproducibility study to benchmark major retrieval families and reasoning-targeted methods on QUEST and QUEST+Variants, and introduce LIMIT+, a controlled benchmark where relevance depends on arbitrary attribute predicates and constraint satisfaction, and less on pretrained knowledge. Our findings show that (i) on QUEST, the best neural retrievers achieve an effectiveness that is more than double what can be achieved with BM25 (Recall@100 ${>}$0.41 vs.\ 0.20), but reasoning-targeted methods like ReasonIR and Search-R1 do not outperform general-purpose retrievers uniformly; (ii) on LIMIT+, gains fail to transfer, where the strongest QUEST method collapses from Recall@100${\approx}$0.42 to below 0.02, while classic lexical retrieval gains to ${\sim}$0.96. Lastly, (iii) stratifying by compositional depth reveals a consistent degradation across all methods, where algebraic sparse and lexical methods show more stable performance while dense approaches collapse. We release code and LIMIT+ data generation scripts to support future reproducibility and controlled evaluation.

[11] TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains cs.CL | cs.AI | cs.HCPDF

Serhii Zabolotnii

TL;DR: 本文提出了TRACE框架，这是一个面向关键操作领域可信代理AI的跨领域工程框架。该框架包含四层参考架构、基于计量学的信任度量套件以及模型简约性原则，并通过三个不同领域的实例验证了其跨领域适用性。

Details

Motivation: 解决在关键操作领域中构建可信赖的代理AI系统所面临的工程挑战，特别是如何系统性地整合大型语言模型并确保其可信度。

Result: 框架在临床决策支持、工业多域操作和司法AI助手三个不同治理背景的实例中成功应用，验证了其架构和度量的可转移性。

Insight: 核心创新点在于将LLM的使用明确为一种设计决策而非默认架构，并通过计算简约比来量化模型简约性，将其提升为可信AI工程的一流设计原则；同时提供了计量学基础的信任度量方法。

Abstract: We introduce TRACE, a cross-domain engineering framework for trustworthy agentic AI in operationally critical domains. TRACE combines a four-layer reference architecture with an explicit classical-ML vs. LLM-validator split (L2a/L2b), a stateful orchestration-and-escalation policy (L3), and bounded human supervision (L4); a metrologically grounded trust-metric suite mapped to GUM/VIM/ISO 17025; and a Model-Parsimony principle quantified by the Computational Parsimony Ratio (CPR). Three instantiations–clinical decision support, industrial multi-domain operations, and a judicial AI assistant–transfer the samearchitecture and metrics across principally different governance contexts. The L2a/L2b separation makes the use of large language models a deliberate design decision rather than an architectural default, with parsimony quantified through CPR. TRACE introduces CPR as a first-class design principle in trustworthy-AI engineering.

[12] MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following cs.CL | cs.AIPDF

Jaeyun Lee, Junyoung Koh, Zeynel Tok, Hunar Batra, Ronald Clark

TL;DR: 本文提出了MCJudgeBench基准，用于在多约束指令遵循任务中对LLM评判器进行约束级别的评估。该基准包含指令、候选响应、显式约束列表、每个约束的黄金标签（是/部分/否）以及受控的响应侧扰动，并引入评估提示变体来测试评判器的稳定性。通过正确性和不一致性指标评估专有和开源LLM评判器，区分随机解码下的内在不一致性与提示和响应扰动下的程序不一致性。研究发现评判器可靠性具有多维度：整体性能强不保证在所有标签类别（尤其是较罕见的“部分”和“否”情况）上同样可靠，且正确性高的评判器不一定具有更低的不一致性，而推理评估虽能提升正确性但未必改善稳定性。

Details

Motivation: 当前LLM评判器在多约束指令遵循任务中通常仅通过整体响应判断进行评估，缺乏对每个约束满足情况的细粒度验证，这限制了对其可靠性的深入理解。本文旨在解决这一问题，通过构建约束级别的评估基准来系统研究评判器在不同约束和扰动下的失败模式。

Result: 在MCJudgeBench基准上评估了专有和开源LLM评判器，使用正确性和不一致性指标（区分内在不一致性和程序不一致性）。结果表明，评判器可靠性存在多维度差异：整体性能高的模型在“部分”和“否”等罕见标签类别上检测可靠性较低；正确性高的评判器不一致性不一定更低；推理评估能提高正确性但稳定性改善不统一。这些发现突显了约束级别评估的必要性。

Insight: 创新点在于提出了首个专注于多约束指令遵循中约束级别评判器评估的基准MCJudgeBench，并引入受控扰动和提示变体来系统测试评判器稳定性。客观分析认为，该研究强调了LLM评判器评估需从整体响应转向细粒度约束分析，以揭示其在罕见或复杂情况下的失败模式，为未来开发更可靠的评判器提供了重要方法论指导。

Abstract: Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. These findings motivate evaluating LLM judges at the constraint level to study these failure modes.

[13] CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing cs.CLPDF

Zhipeng Xu, Junhao Ji, Zulong Chen, Zhenghao Liu, Qing Liu

TL;DR: 本文介绍了CC-OCR V2，一个针对现实世界文档处理设计的全面且具有挑战性的OCR基准测试，涵盖文本识别、文档解析、文档定位、关键信息提取和文档问答五个主要任务，包含7,093个高难度样本。通过对14个先进大型多模态模型的广泛实验，发现当前模型在现实应用中的表现仍不足，即使最先进的模型在不同任务和场景下也出现显著性能下降。

Details

Motivation: 现有OCR基准测试的任务范围与实际应用脱节，且假设同质采集条件，导致大型多模态模型在现实世界应用中的有效性未被充分探索。

Result: 在CC-OCR V2基准测试上，14个先进大型多模态模型表现不佳，即使最先进的模型也出现显著性能下降，揭示了当前基准测试性能与现实应用有效性之间的显著差距。

Insight: 创新点在于构建了一个专注于现实企业文档处理任务、包含关键但先前基准中代表性不足的困难和极端案例的全面OCR基准测试，强调了评估模型在多样化、真实场景下鲁棒性的重要性。

Abstract: Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications remains underexplored, as existing benchmarks adopt task scopes misaligned with practical applications and assume homogeneous acquisition conditions. To address this gap, we introduce CC-OCR V2, a comprehensive and challenging OCR benchmark tailored to real-world document processing. CC-OCR V2 focuses on practical enterprise document processing tasks and incorporates hard and corner cases that are critical yet underrepresented in prior benchmarks, covering 5 major OCR-centric tracks: text recognition, document parsing, document grounding, key information extraction, and document question answering, comprising 7,093 high-difficulty samples. Extensive experiments on 14 advanced LMMs reveal that current models fall short of real-world application requirements. Even state-of-the-art LMMs exhibit substantial performance degradation across diverse tasks and scenarios. These findings reveal a significant gap between performance on current benchmarks and effectiveness in real-world applications. We release the full dataset and evaluation toolkit at https://github.com/eioss/CC-OCR-V2.

[14] The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models cs.CL | cs.AIPDF

Daniel Drucker, Kyle Mahowald

TL;DR: 这篇论文研究了语言模型能否通过迭代分析和修复链来执行概念分析任务，即一个模型生成反例，另一个模型修复定义，并重复此过程。研究发现，尽管许多语言模型生成的反例被人类专家和语言模型法官判定为无效，但语言模型法官接受的反例数量约为人类的两倍。迭代过程导致定义越来越冗长但准确性并未提高，且某些概念难以形成稳定定义。

Details

Motivation: 动机是探索语言模型是否能模拟哲学方法论中的概念分析过程，即通过反例迭代地提出和精炼定义，以评估语言模型进行高层次迭代哲学推理的能力。

Result: 在20个概念和数千个反例-修复循环中，语言模型法官接受的反例数量约为人类的两倍，但每项有效性判断在人类之间以及人类与语言模型之间具有中等一致性。迭代过程未提高准确性，反而使定义变得冗长。

Insight: 创新点在于将语言模型应用于哲学概念分析的迭代过程，揭示了语言模型在反例-修复循环中快速达到收益递减的局限性，这为评估语言模型是否能够持续进行高层次迭代哲学推理提供了一个有前景的测试案例。

Abstract: Conceptual analysis – proposing definitions and refining them through counterexamples – is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: one model instance generates counterexamples to a proposed definition, another repairs the definition, and the process repeats. Across 20 concepts and thousands of counterexample-repair cycles, we find that, although many LM-generated counterexamples are judged invalid by both expert humans and an LM judge, the LM judge accepts roughly twice as many as humans do. Nonetheless, per-item validity judgments are moderately consistent across humans and between humans and the LM. We further find that extended iteration produces increasingly verbose definitions without improving accuracy. We also see that some concepts resist stable definitions in general. These findings suggest that while LMs can engage in philosophical reasoning, the counterexample-repair loop hits diminishing returns quickly and could be a fruitful test case for evaluating whether LMs can sustain high-level iterated philosophical reasoning.

[15] Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments cs.CLPDF

Hao Mi, Qiang Sheng, Shaofei Wang, Beizhe Hu, Yifan Sun

TL;DR: 本文提出LaaB框架，通过建模响应与自我判断之间的标签约束，将神经特征与符号判断相结合以改进LLM幻觉检测。该方法引入’元判断’过程将符号标签映射回特征空间，利用响应与元判断标签基于语义逻辑一致性的内在关联，通过相互学习对齐并整合双视角信号。

Details

Motivation: 现有幻觉检测方法仅关注微观内在模式的不确定性量化或宏观自我判断的提示激发，孤立处理隐含神经不确定性与显式符号推理，未能利用其相互依赖性获得整体视图。

Result: 在4个公共数据集、4种LLM上与8个基线方法的广泛实验证明了LaaB的优越性，实现了SOTA水平的幻觉检测性能。

Insight: 创新点在于提出逻辑一致性作为桥梁的概念，通过元判断过程建立符号标签与特征空间的映射，利用响应与自我判断之间的语义约束关系实现双视角信号的对齐与增强。

Abstract: Large Language Models (LLMs) are prone to factual hallucinations, risking their reliability in real-world applications. Existing hallucination detectors mainly extract micro-level intrinsic patterns for uncertainty quantification or elicit macro-level self-judgments through verbalized prompts. However, these methods address only a single facet of the hallucination, focusing either on implicit neural uncertainty or explicit symbolic reasoning, thereby treating these inherently coupled behaviors in isolation and failing to exploit their interdependence for a holistic view. In this paper, we propose LaaB (Logical Consistency-as-a-Bridge), a framework that bridges neural features and symbolic judgments for hallucination detection. LaaB introduces a “meta-judgment” process to map symbolic labels back into the feature space. By leveraging the inherent logical bridge where response and meta-judgment labels are either the same or opposite based on the self-judgment’s semantics, LaaB aligns and integrates dual-view signals via mutual learning and enhances the hallucination detection. Extensive experiments on 4 public datasets, across 4 LLMs, against 8 baselines demonstrate the superiority of LaaB.

[16] Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems cs.CL | cs.IRPDF

Yilun Zhao, Jinbiao Wei, Tingyu Song, Siyue Zhang, Chen Zhao

TL;DR: 本文提出BRIGHT-Pro基准和RTriever-Synth合成语料库，以评估和提升推理密集型检索在智能体搜索系统中的性能。BRIGHT-Pro通过专家标注扩展多角度黄金证据，并引入静态和智能体搜索两种评估协议；RTriever-Synth则通过角度分解生成互补正例和条件硬负例，用于微调RTriever-4B模型。实验表明，该方法能揭示标准指标隐藏的行为，并显著提升基础模型的检索能力。

Details

Motivation: 现有推理密集型检索的评估和训练存在局限：基准如BRIGHT提供的黄金集较窄且孤立评估检索器，而合成训练语料往往优化单段落相关性而非证据组合构建，无法满足智能体搜索系统对迭代搜索和综合互补证据的需求。

Result: 在词汇、通用和推理密集型检索器上的实验显示，角度感知和智能体评估能暴露标准指标隐藏的行为；基于Qwen3-Embedding-4B微调的RTriever-4B模型相比基础模型有显著提升，在BRIGHT-Pro基准上表现出改进的检索性能。

Insight: 创新点包括引入多角度黄金证据和智能体搜索评估协议以更全面评估检索器，以及通过角度分解合成语料生成互补正例和条件硬负例来优化训练；从客观角度看，该方法强调了检索在智能体系统中的动态和组合特性，为推理密集型任务提供了新的评估和训练框架。

Abstract: Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is increasingly important for agentic search systems, where retrievers must provide complementary evidence across iterative search and synthesis. However, existing work remains limited on both evaluation and training: benchmarks such as BRIGHT provide narrow gold sets and evaluate retrievers in isolation, while synthetic training corpora often optimize single-passage relevance rather than evidence portfolio construction. We introduce BRIGHT-Pro, an expert-annotated benchmark that expands each query with multi-aspect gold evidence and evaluates retrievers under both static and agentic search protocols. We further construct RTriever-Synth, an aspect-decomposed synthetic corpus that generates complementary positives and positive-conditioned hard negatives, and use it to LoRA fine-tune RTriever-4B from Qwen3-Embedding-4B. Experiments across lexical, general-purpose, and reasoning-intensive retrievers show that aspect-aware and agentic evaluation expose behaviors hidden by standard metrics, while RTriever-4B substantially improves over its base model.

cs.CV [Back]

[17] Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models cs.CV | cs.AI | cs.LGPDF

Sakshi Agarwal, Aishik Konwer, Ankit Parag Shah

TL;DR: 本文提出VANGUARD框架，通过多模态大语言模型（MLLM）统一视频异常检测（VAD）中的异常分类、空间定位和思维链推理。该框架采用三阶段课程学习策略，并利用教师-学生标注管道生成结构化推理轨迹，在UCF-Crime数据集上达到94% ROC-AUC和84% F1分数，同时提供可解释的推理和异常对象定位。

Details

Motivation: 传统视频异常检测方法仅进行二元分类或离群点检测，缺乏可解释的推理和精确的空间定位；现有视觉语言模型（VLM）在空间定位时易产生幻觉或几何无效的边界框。

Result: 在UCF-Crime数据集上达到94% ROC-AUC和84% F1分数，实现SOTA性能；在XD-Violence和ShanghaiTech数据集上零样本迁移验证了跨域泛化能力。

Insight: 创新点包括：将异常分类、空间定位和思维链推理统一于单一VLM框架；采用三阶段渐进式课程学习策略；利用教师-学生管道生成结构化推理轨迹以解决标注稀疏问题；思维链推理作为隐式正则化器提升预测平衡性。

Abstract: Video Anomaly Detection (VAD) has traditionally been framed as binary classification or outlier detection, providing neither interpretable reasoning nor precise spatial localization of anomalous events. While Vision-Language Models (VLMs) offer rich scene understanding, they struggle with reliable spatial grounding - often producing hallucinated or geometrically invalid bounding boxes when asked to localize objects. We propose VANGUARD (Video Anomaly Understanding through Reasoning and Grounding), a framework that unifies anomaly classification, spatial grounding, and chain-of-thought reasoning within a single VLM. VANGUARD introduces a three-stage curriculum that progressively layers training objectives: (1) classifier warmup on frozen backbone features, (2) LoRA-adapted spatial grounding, and (3) chain-of-thought generation. To overcome the sparse annotation typical of VAD benchmarks, we employ a teacher-student annotation pipeline in which a VLM (Qwen3-VL-4B) generates structured per-subclip reasoning trajectories based on manual annotations available from the UCA Dataset. Further, GroundingDINO provides bounding box supervision. On UCF-Crime, VANGUARD achieves 94% ROC-AUC with 84% F1 while simultaneously producing interpretable chain-of-thought explanations and spatial grounding of anomalous objects - capabilities absent from prior VAD methods. Ablations confirm that staged training outperforms monolithic optimization, and that structured reasoning acts as an implicit regularizer yielding more balanced predictions than classification-only fine-tuning. Zero-shot transfer to XD-Violence and ShanghaiTech demonstrates cross-domain generalization without target-domain adaptation.

[18] Approaching human parity in the quality of automated organoid image segmentation cs.CV | cond-mat.soft | q-bio.QMPDF

Chase Cartwright, Gongbo Guo, Sai Teja Pusuluri, Christopher N. Mayhew, Mark Hester

TL;DR: 本文提出了一种结合通用基础模型Segment Anything Model (SAM)与现有领域专用工具的复合方法，用于自动分割源自多能干细胞(iPSC)的类器官图像，以测量其发育过程中的大小和形状。该方法在类器官图像数据上进行了评估，并与手动分割结果及现有工具进行了比较，结果表明其在绝大多数测试条件下能产生一致且准确的分割结果，性能接近甚至达到人工标注者之间的变异水平。

Details

Motivation: 类器官作为模拟器官特征的三维自组织细胞培养物，是研究人类疾病和开发治疗方案的重要平台。其发育过程具有动态的形态和细胞组织变化，需要先进的成像和分析工具来准确监测生长轨迹。然而，现有工具在自动分割类器官图像时，难以在所有测试条件下都达到足够的准确性。

Result: 在类器官图像数据上的评估显示，现有单一工具无法在所有测试条件下实现足够准确的分割，而新提出的复合方法除了极少数最具挑战性的图像外，对所有图像都能产生一致且准确的结果。与独立标注者间手动分割的变异（观察者间变异）相比，该方法在一项指标上达到了观察者间变异水平，在其他指标上也非常接近该水平。

Insight: 论文的创新点在于将通用基础模型SAM与领域专用工具相结合，构建了一种复合分割方法，以弥补通用模型在特定生物医学图像任务上可能存在的不足，并利用其强大的泛化能力。这种结合策略有效提升了类器官图像分割的准确性和鲁棒性，使其性能接近人工标注的一致性，为自动化生物图像分析提供了可借鉴的思路。

Abstract: Organoids are complex, three dimensional, self-organizing cell cultures which manifest organ-like features and represent a powerful platform for studying human disease and developing treatment options. Organoid development is characterized by dynamic morphological and cellular organization, which mimic some aspects of organ development. To study these rapid changes over the course of organoid development, advanced imaging and analytical tools are critical to accurately monitor the trajectory of organoid growth and investigate disease processes. In this work, we focus on computer vision and machine learning techniques to automatically measure the size and shape of developing spheroids derived from pluripotent stem cells (iPSCs), which are typically the starting material for generating organoid cultures. To facilitate this task, we introduce a composite method that combines the Segment Anything Model (SAM), a general-purpose foundation model, with an existing domain-specific tool. This composite method is evaluated together with several existing tools by testing them on organoid image data and comparing with the results of manual image segmentation. We find that no single existing tool is able to segment the test images with sufficient accuracy across all test conditions, but the newly introduced composite method produces consistent and accurate results for all but a very small fraction of the most challenging images. Finally, we compare the accuracy of this method to the variability between manual segmentations by independent annotators (inter-observer variability) and find that by one measure it performs at the level of inter-observer variability and by others it performs very close to it.

[19] Learning to Segment using Summary Statistics and Weak Supervision cs.CV | cs.LGPDF

Omkar Kulkarni, Edward Raff, Tim Oates

TL;DR: 该论文提出了一种利用摘要统计信息和弱监督信号训练医学图像分割模型的方法，旨在减轻专家手动标注的负担。通过结合图像重建质量、统计匹配和弱监督区域重叠的损失函数，在标准图像、超声和CT数据上验证了方法的有效性。

Details

Motivation: 解决医学图像分割中专家标注成本高且标注常被丢弃的问题，仅利用保留的摘要统计信息（如区域面积）和少量弱监督像素来训练模型。

Result: 在标准图像、乳腺癌超声和肾肿瘤CT数据上的实验表明，仅使用统计信息效果有限，但加入弱监督像素能显著提升性能，证明了方法的实用潜力。

Insight: 创新点在于设计了一种结合图像重建、统计匹配和弱监督重叠的多任务损失函数，为低标注成本下的医学图像分割提供了新思路。

Abstract: Medical experts often manually segment images to obtain diagnostic statistics and discard the resulting annotations. We aim to train segmentation models to alleviate this burden, but constrained to the retained summary statistics (e.g., the area of the annotated region). Empirical results suggest that statistics alone are insufficient for this task, but adding weak information in the form of a few pixels within the area of interest significantly improves performance. We use a novel loss function that combines terms for image reconstruction quality, matching to summary statistics, and overlap between the predicted foreground and the weak supervisory signal. Experiments on standard image, ultrasound (breast cancer), and Computed Tomography (CT) scan (kidney tumors) data demonstrate the utility and potential of the approach.

[20] NucEval: A Robust Evaluation Framework for Nuclear Instance Segmentation cs.CVPDF

Amirreza Mahbod, Ramona Woitek, Jeanne Shen

TL;DR: 本文提出了NucEval，一个用于核实例分割的鲁棒评估框架，旨在解决现有评估流程中的关键问题。通过处理模糊区域、分数归一化、重叠实例和边界不确定性四个核心问题，该框架在NuInsSeg数据集及两个外部数据集上验证了其对评估指标的影响。

Details

Motivation: 在计算病理学中，核实例分割是关键任务，但现有评估流程存在不足，缺乏对模糊区域、分数归一化、重叠实例和边界不确定性等问题的系统处理。

Result: 在NuInsSeg数据集及两个外部数据集上，使用三种基于CNN和ViT的核实例分割模型进行测试，展示了NucEval框架对评估指标的改进效果，但未明确提及是否达到SOTA水平。

Insight: 创新点在于系统识别并整合了核实例分割评估中的四个关键问题，提供了一个统一的鲁棒评估框架，可提升评估的准确性和一致性，适用于不同模型和数据集。

Abstract: In computational pathology, nuclear instance segmentation is a fundamental task with many downstream clinical applications. With the advent of deep learning, many approaches, including convolutional neural networks (CNNs) and vision transformers (ViTs), have been proposed for this task, along with both machine learning-based and non-machine learning-based pre- and post-processing techniques to further boost performance. However, one fundamental aspect that has received less attention is the evaluation pipeline. In this study, we identify four key issues associated with nuclear instance segmentation evaluation and propose corresponding solutions. Our proposed modifications, namely handling vague regions, score normalization, overlapping instances, and border uncertainty, are integrated into a unified framework called NucEval, which enables robust evaluation of nuclear instance segmentation. We evaluate this pipeline using the NuInsSeg dataset, which provides unique characteristics that make it particularly suitable for this study, as well as two additional external datasets, with three CNN- and ViT-based nuclear instance segmentation models, to demonstrate the impact of these modifications on instance segmentation metrics. The code, along with complete guidelines and illustrative examples, is publicly available at: https://github.com/masih4/nuc_eval.

[21] Sentinel2Cap: A Human-Annotated Benchmark Dataset for Multimodal Remote Sensing Image Captioning cs.CVPDF

Lucrezia Tosato, Gianluca Lombardi, Ronny Hansch

TL;DR: 该论文介绍了Sentinel2Cap，一个用于多模态遥感图像描述的人工标注基准数据集，包含Sentinel-1 SAR和Sentinel-2多光谱图像块及其人工创建并验证的描述文本。研究使用Qwen3-VL-8B-Instruct模型在RGB、多光谱和SAR伪RGB三种模态上进行了零样本描述实验，结果表明RGB图像性能最佳，SAR图像更具挑战性，且提供模态特定的上下文提示能提升性能。

Details

Motivation: 解决多模态卫星数据（特别是SAR图像和中分辨率传感器）描述数据集稀缺的问题，以促进跨模态场景理解研究。

Result: 在Sentinel2Cap数据集上的零样本描述实验显示，RGB图像描述性能最高，SAR图像对视觉语言模型更具挑战性；提供模态特定提示在所有评估指标上均能一致提升性能。

Insight: 创新点在于构建了首个包含SAR和多光谱图像的人工标注多模态遥感描述数据集；客观分析表明，该数据集填补了领域空白，其揭示的模态间性能差异及提示工程的有效性对推动遥感视觉语言模型发展具有重要价值。

Abstract: Image captioning has become an important task in computer vision, enabling models to generate natural language descriptions of visual content. While several datasets exist for natural images and high-resolution optical remote sensing imagery, the availability of captioning datasets for multimodal satellite data remains limited, particularly for SAR imagery and medium-resolution sensors. We introduce Sentinel2Cap, a human-annotated multimodal captioning dataset containing Sentinel-1 SAR and Sentinel-2 multi-spectral image patches at 10 m and 20 m spatial resolution with diverse land cover compositions. Captions are created manually and carefully validated to ensure both semantic accuracy and linguistic quality. To evaluate Sentinel2Cap, we perform a zero-shot captioning using the Qwen3-VL-8B-Instruct model across three image modalities: RGB, multi-spectral, and SAR pseudo-RGB representations. Results show that RGB images achieve the highest captioning performance, while SAR images remain more challenging for vision-language models. Providing modality-specific contextual prompts consistently improves performance across all metrics. These findings highlight both the challenges of multimodal remote sensing image captioning and the potential value of human-annotated datasets for advancing research in cross-modal scene understanding. All the material is publicly avaiable.

[22] CropVLM: A Domain-Adapted Vision-Language Model for Open-Set Crop Analysis cs.CVPDF

Abderrahmene Boudiaf, Sajd Javed

TL;DR: 本文提出了CropVLM，一个通过领域特定语义对齐（DSSA）为农业领域适配的视觉语言模型，以及一个名为混合开放集定位网络（HOS-Net）的架构。该模型旨在解决高通量植物表型分析中的‘表型瓶颈’问题，通过利用自然语言描述实现对新作物的零样本检测，无需物种特定的训练数据。

Details

Motivation: 解决高通量植物表型分析中存在的‘表型瓶颈’问题，即传统封闭集计算机视觉系统需要大量物种特定标注且缺乏灵活性，无法处理多样化的育种群体。

Result: 在零样本分类任务中，CropVLM达到了72.51%的准确率，优于七个CLIP风格的基线模型。在检测任务中，在CVTCropDet基准上对新物种实现了49.17 AP50，在热带水果物种上实现了50.73 AP50，均优于次优方法（分别为34.89和48.58 AP50）。

Insight: 核心创新点在于通过领域特定语义对齐（DSSA）将农学术语与细粒度视觉特征进行映射，并设计了混合开放集定位网络（HOS-Net），实现了仅凭自然语言描述即可检测新作物的零样本泛化能力，为农业表型分析提供了一个可扩展的解决方案。

Abstract: High-throughput plant phenotyping, the quantitative measurement of observable plant traits, is critical for modern breeding but remains constrained by a “phenotyping bottleneck,” where manual data collection is labor-intensive and prone to observer bias. Conventional closed-set computer vision systems fail to address this challenge, as they require extensive species-specific annotation and lack the flexibility to handle diverse breeding populations. To bridge this gap, we present CropVLM, a Vision-Language Model (VLM) adapted for the agricultural domain via Domain-Specific Semantic Alignment (DSSA). Trained on 52,987 manually selected image-caption pairs covering 37 species in natural field conditions, CropVLM effectively maps agronomic terminology to fine-grained visual features. We further introduce the Hybrid Open-Set Localization Network (HOS-Net), an architecture that integrates CropVLM to enable the detection of novel crops solely from natural language descriptions without retraining. By eliminating the reliance on species-specific training data, CropVLM provides a scalable solution for high-throughput phenotyping, accelerating genetic gain and facilitating large-scale biodiversity research essential for sustainable agriculture. The trained model weights and complete pipeline implementation are publicly available at: https://github.com/boudiafA/CropVLM. In comprehensive evaluations, CropVLM achieves 72.51% zero-shot classification accuracy, outperforming seven CLIP-style baselines. Our detection pipeline demonstrates superior zero-shot generalization to novel species, achieving 49.17 AP50 on our CVTCropDet benchmark and 50.73 AP50 on tropical fruit species, compared to 34.89 and 48.58 for the next-best method, respectively.

[23] VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing cs.CVPDF

Andong Deng, Dawei Du, Zhenfang Chen, Wen Zhong, Fan Chen

TL;DR: 本文提出了VEBENCH，这是首个用于评估大型多模态模型在真实视频编辑场景中编辑知识理解和操作推理能力的综合性基准。该基准包含3.9K个高质量编辑视频和3,080个人工验证的问答对，涵盖编辑技术识别和多视频操作模拟两大任务。实验表明，现有模型性能与人类编辑认知水平存在巨大差距。

Details

Motivation: 现实世界的视频编辑需要专业电影技术和多模态推理能力，而现有大型多模态模型在通用视频理解方面虽有进展，但其在多视频推理和操作编辑工作流方面的能力尚未得到充分探索。

Result: 在专有模型（如Gemini-2.5-Pro）和开源模型上的广泛实验表明，当前模型性能与人类水平的编辑认知之间存在巨大差距，凸显了将视频理解与创造性操作推理相结合的迫切需求。

Insight: 创新点在于构建了首个专注于视频编辑操作推理的基准，通过人机协作标注流程确保高质量数据，并设计了识别与模拟两大互补任务来全面评估模型能力。这为推进智能视频编辑系统和复杂推理研究奠定了基础。

Abstract: Real-world video editing demands not only expert knowledge of cinematic techniques but also multimodal reasoning to select, align, and combine footage into coherent narratives. While recent Large Multimodal Models (LMMs) have shown remarkable progress in general video understanding, their abilities in multi-video reasoning and operational editing workflows remain largely unexplored. We introduce VEBENCH, the first comprehensive benchmark designed to evaluate both editing knowledge understanding and operational reasoning in realistic video editing scenarios. VEBENCH contains 3.9K high-quality edited videos (over 257 hours) and 3,080 human-verified QA pairs, built through a three-round human-AI collaborative annotation pipeline that ensures precise temporal labeling and semantic consistency. It features two complementary QA tasks: 1) Video Editing Technique Recognition, assessing models’ ability to identify 7 editing techniques using multimodal cues; and 2) Video Editing Operation Simulation, modeling real-world editing workflows by requiring the selection and temporal localization of relevant clips from multiple candidates. Extensive experiments across proprietary (e.g., Gemini-2.5-Pro) and open-source LMMs reveal a large gap between current model performance and human-level editing cognition. These results highlight the urgent need for bridging video understanding with creative operational reasoning. We envision VEBENCH as a foundation for advancing intelligent video editing systems and driving future research on complex reasoning.

[24] FACTOR: Counterfactual Training-Free Test-Time Adaptation for Open-Vocabulary Object Detection cs.CVPDF

Kaixiang Zhao, Mao Ye, Lihua Zhou, Hu Wang, Luping Ji

TL;DR: 本文提出FACTOR框架，一种基于反事实推理的轻量级免训练测试时自适应方法，用于解决开放词汇目标检测在分布偏移下因虚假相关性导致的性能下降问题。该方法通过扰动测试图像的非因果属性并比较原始视图与反事实视图的预测，量化属性敏感性、语义相关性和预测变化，从而选择性抑制属性依赖的预测，无需参数更新。

Details

Motivation: 开放词汇目标检测在分布偏移下容易因非因果视觉属性（如亮度、纹理）与对象类别之间的虚假相关性而失效，现有测试时自适应方法要么依赖昂贵的在线优化，要么进行全局校准，忽略了这些失败的属性特定性质。

Result: 在PASCAL-C、COCO-C和FoggyCityscapes基准测试上，FACTOR持续优于先前的测试时自适应方法，表明显式的反事实推理能有效提升分布偏移下的鲁棒性。

Insight: 创新点在于将反事实推理引入测试时自适应，通过属性级扰动和区域级预测比较实现免参数更新的选择性校准；客观分析认为该方法提供了一种轻量且可解释的分布偏移缓解策略，避免了传统自适应方法的优化开销。

Abstract: Open-vocabulary object detection often fails under distribution shifts, as it can be misled by spurious correlations between non-causal visual attributes (e.g., brightness, texture) and object categories. Existing test-time adaptation (TTA) methods either depend on costly online optimization or perform global calibration, overlooking the attribute-specific nature of these failures. To address this, we propose FACTOR (counterFACtual training-free Test-time adaptation for Open-vocabulaRy object detection), a lightweight framework grounded in counterfactual reasoning. By perturbing test images along non-causal attributes and comparing region-level predictions between original and counterfactual views, FACTOR quantifies attribute sensitivity, semantic relevance, and prediction variation to selectively suppress attribute-dependent predictions-without parameter updates. Experiments on PASCAL-C, COCO-C, and FoggyCityscapes show that FACTOR consistently outperforms prior TTA methods, demonstrating that explicit counterfactual reasoning effectively improves robustness under distribution shifts.

[25] AHPA: Adaptive Hierarchical Prior Alignment for Diffusion Transformers cs.CV | cs.AIPDF

Ruibin Min, Yexin Liu, Aimin Pan, Changsheng Lu, Jiafei Wu

TL;DR: 本文提出了自适应分层先验对齐（AHPA）框架，用于加速扩散Transformer的训练。该方法利用冻结VAE编码器的分层表示作为动态对齐目标，通过时间步条件动态路由器自适应选择不同粒度的先验特征，以匹配去噪过程中不同信噪比阶段的需求。

Details

Motivation: 现有扩散模型对齐方法通常在整个去噪轨迹中使用固定的监督目标或对齐粒度，忽略了有用表示监督的粒度会随信噪比系统性变化的问题，导致表示不匹配。

Result: 大量实验表明，AHPA在基准测试中提升了收敛速度和生成质量，无需额外推理成本，且在训练中避免了外部编码器监督。

Insight: 创新点在于提出时间步自适应的分层先验对齐机制，利用VAE固有分层表示提供从局部几何到粗粒度语义的互补先验，动态调整对齐粒度以匹配扩散模型不同训练阶段的需求。

Abstract: Representation alignment has recently emerged as an effective paradigm for accelerating Diffusion Transformer training. Despite their success, existing alignment methods typically impose a fixed supervision target or a fixed alignment granularity throughout the entire denoising trajectory, whether the guidance is provided by external vision encoders, internal self-representations, or VAE-derived features. We argue that such timestep-agnostic alignment is suboptimal because the useful granularity of representation supervision changes systematically with the signal-to-noise ratio. In high-noise regimes, diffusion models benefit more from coarse semantic and layout-level anchoring, whereas in low-noise regimes, the training signal should emphasize spatially detailed and structurally faithful refinement. This non-stationary alignment behavior creates a representational mismatch for static single-level supervisors. To address this issue, we propose Adaptive Hierarchical Prior Alignment (AHPA), a lightweight alignment framework that exploits the hierarchical representations naturally embedded in the frozen VAE encoder. Instead of using only a single compressed latent as the alignment target, AHPA extracts multi-level VAE features that provide complementary priors ranging from local geometry and spatial topology to coarse semantic layout. A timestep-conditioned Dynamic Router adaptively selects and weights these hierarchical priors along the denoising trajectory, thereby synchronizing the alignment granularity with the model’s evolving training needs. Extensive experiments show that AHPA improves convergence and generation quality over baselines and incurs no additional inference cost while avoiding external encoder supervision during training.

[26] MedSR-Vision: Deep Learning Framework for Multi-Domain Medical Image Super-Resolution cs.CVPDF

Subhash Gurappa, Trivikram Satharasi, Yashas Hariprasad, Sundararaj Sitharama Iyengar

TL;DR: 本文提出了MedSR-Vision，一个用于多领域医学图像超分辨率的统一深度学习框架，在五种成像模态（脑部MRI、胸部X光、肾脏超声、肾结石CT、脊柱MRI）和三种放大倍数（×2、×3、×4）下，对SRCNN、SwinIR和Real-ESRGAN三种代表性模型进行了基准测试。

Details

Motivation: 解决医学图像超分辨率中存在的挑战，包括保持解剖学准确性、维持感知质量以及跨医学领域的泛化能力，并为临床工作流提供模型选择指导。

Result: 实验分析表明，Real-ESRGAN在较高放大倍数下实现了优越的感知质量和边缘恢复；SwinIR在保留结构和诊断特征方面表现出色；SRCNN在较低放大倍数下提供了高效稳定的性能。

Insight: 创新点在于提出了一个标准化的、统一的多模态医学图像超分辨率评估框架，并提供了针对特定领域的模型性能见解和临床部署的实用指南。

Abstract: Medical image super-resolution (MedSR) is essential for improving diagnostic precision across diverse imaging modalities such as MRI, CT, X-ray, Ultrasound, and Fundus imaging. Despite rapid advances in deep learning, challenges remain in preserving anatomical accuracy, maintaining perceptual quality, and generalizing across medical domains. This paper presents MedSR-Vision, a novel unified deep learning framework for evaluating and comparing super-resolution models across five modalities: Brain MRI, Chest X-ray, Renal Ultrasound, Nephrolithiasis CT, and Spine MRI, at magnification scales of $\times2$, $\times3$, and $\times4$. Three representative models namely SRCNN, SwinIR, and Real-ESRGAN are benchmarked using multiple quantitative metrics encompassing fidelity, perceptual realism, and sharpness. Experimental analysis demonstrates that Real-ESRGAN achieves superior perceptual quality and edge recovery at higher scales, SwinIR excels in preserving structural and diagnostic features, and SRCNN provides efficient and stable performance at lower magnifications. The results establish domain-specific insights and practical guidelines for model selection in clinical imaging workflows, offering a standardized evaluation framework for future medical image super-resolution research and deployment.

[27] VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models cs.CV | cs.AIPDF

JF Bastien, Sam D’Amico

TL;DR: 本文提出了一种无需训练的视频视觉语言模型（VLM）反重计算方法，通过自适应重用视频状态来减少计算开销。核心思想是在验证状态稳定的情况下重用视觉特征，仅在场景、查询或缓存拓扑变化时获取新证据。该方法在Qwen2.5-VL-7B-Instruct-4bit模型上，对同一视频的后续查询实现了高达35.92倍的延迟降低，同时保持准确性。

Details

Motivation: 解决现有VLM在处理视频时重复计算稳定视觉状态的问题，例如静态背景帧的冗余处理，旨在通过状态重用减少计算浪费，提升推理效率。

Result: 在VideoMME基准的93个查询设置中，自适应重用使后续查询延迟降低14.90-35.92倍，且保持配对选择和正确性。在Gemma 4-E4B-4bit上，C-VISION技术实现首次查询加速1.316倍，无准确性损失。

Insight: 创新点在于训练自适应的状态缓存与重用机制，根据场景变化动态决策，避免不必要的视觉编码重计算。从客观看，该方法为VLM高效推理提供了轻量级优化思路，并指向未来VLM原生媒体应直接暴露变化信息以减少冗余感知。

Abstract: Video vision-language models (VLMs) keep paying for visual state the stream already told us was stable. The factory wall did not move, but most VLM pipelines still hand the model dense RGB frames or a fresh prefix again. We study that waste as training-free anti-recomputation: reuse state when validation says it survives, and buy fresh evidence when the scene, query, or cache topology requires it. The largest measured win is after ingest. On frozen Qwen2.5-VL-7B-Instruct-4bit, adaptive same-video follow-up reuse preserves paired choices and correctness on a 93-query VideoMME breadth setting while reducing follow-up latency by 14.90-35.92x. The first query is still cold; the win starts when later questions reuse the same video state. Stress tests bound the result: repeated-question schedules hold through 50 turns, while dense-answer-anchored prompt variation separates conservative fixed K=1 repair from faster aggressive policies that drift. Fresh-video pruning is smaller but real. C-VISION skips timed vision-tower work before the first answer is generated. On Gemma 4-E4B-4bit, the clean 32f short cell reaches 1.316x first-query speedup with no paired drift or parse failures on 20 items; Qwen shows the fidelity/speed boundary. Stage-share ceiling (C-CEILING) is the accounting guardrail: a component speedup becomes an end-to-end speedup only in proportion to the wall-clock share it accelerates, so C-VISION and after-ingest follow-up reuse do not multiply. Candidate C-STREAM remains a native-rate target, not a headline result here. The broader direction is VLM-native media that expose change, motion, uncertainty, object state, sensor time, and active tiles directly, so models do not have to rediscover the world from dense RGB every frame.

[28] Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology cs.CV | cs.AIPDF

Lina Zhang, Tonmoy Monsoor, Mehmet Efe Lorasdagi, Prateik Sinha, Chong Han

TL;DR: 本研究评估了多模态大语言模型（MLLMs）在自动识别癫痫发作视频中病理性运动的能力。通过零样本测试，MLLMs在18个癫痫症状学特征中的13个上超越了微调的CNN和ViT基线模型，尤其在识别显著姿势和上下文特征方面表现出色，但在识别细微高频运动方面存在困难。通过特征针对性信号增强（如面部裁剪、姿态估计、音频降噪）可提升10个特征的性能。专家评估显示，MLLM对正确预测案例的解释在94.3%的情况下达到至少60%的忠实度分数，与癫痫学家的推理一致。

Details

Motivation: 多模态大语言模型（MLLMs）在日常人类活动识别中表现出强大能力，但其在分析神经系统疾病中具有临床意义的非自主运动（如癫痫发作症状）的潜力尚未被充分探索。本研究旨在探索MLLMs在专业临床视频分析中的应用可能性。

Result: 在90个临床癫痫发作记录上，针对20个ILAE定义的症状学特征进行零样本评估。MLLMs在18个可比较特征中的13个上超越了微调的CNN和ViT基线模型，无需任务特定训练。通过特征针对性预处理，20个特征中的10个性能得到提升。专家评估的忠实度分数表明MLLM的解释具有较高的临床合理性。

Insight: 论文的创新点在于将通用MLLMs通过针对性预处理策略（如面部裁剪、姿态估计、音频降噪）适应于专业临床视频分析任务，展示了MLLMs在无需任务特定训练下实现可解释、高效诊断辅助的潜力，为医疗AI应用提供了新路径。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated robust capabilities in recognizing everyday human activities, yet their potential for analyzing clinically significant involuntary movements in neurological disorders remains largely unexplored. This pilot study evaluates the capability of MLLMs for automated recognition of pathological movements in seizure videos. We assessed the zero-shot performance of state-of-the-art MLLMs on 20 ILAE-defined semiological features across 90 clinical seizure recordings. MLLMs outperformed fine-tuned Convolutional Neural Network (CNN) and Vision Transformer (ViT) baseline models on 13 of 18 features without task-specific training, demonstrating particular strength in recognizing salient postural and contextual features while struggling with subtle, high-frequency movements. Feature-targeted signal enhancement (facial cropping, pose estimation, audio denoising) improved performance on 10 of 20 features. Expert evaluation showed that 94.3 percent of MLLM-generated explanations for correctly predicted cases achieved at least 60 percent faithfulness scores, aligning with epileptologist reasoning. These findings demonstrate the potential of adapting general-purpose MLLMs for specialized clinical video analysis through targeted preprocessing strategies, offering a path toward interpretable, efficient diagnostic assistance. Our code is publicly available at https://github.com/LinaZhangUCLA/PathMotionMLLM.

[29] Tracing Like a Clinician: Anatomy-Guided Spatial Priors for Cephalometric Landmark Detection cs.CVPDF

Sidhartha Mohapatra, Pallavi Mohanty

TL;DR: 本文提出了一种基于解剖学引导的空间先验初始化流程，用于头影测量标志点检测。该方法将临床医生分析X光片的流程转化为计算操作，生成置信度加权的空间注意力先验，并输入到HRNet-W32检测器中。在包含1502张来自多种设备的X光片数据集上，系统在25个标志点上取得了1.04毫米的平均径向误差，比之前19个标志点上的最佳结果（1.23毫米）提升了15.4%，其中12个标志点误差低于1毫米。

Details

Motivation: 目前没有自动化系统能够复现正畸医生分析头影测量X光片时所遵循的结构化临床工作流程（如识别软组织轮廓、划分解剖区域、描绘轮廓、根据几何定义定位标志点）。本文旨在将这一临床推理过程转化为计算操作，以提升自动化检测的准确性和泛化能力。

Result: 在来自三个来源、涵盖7种以上成像设备的1502张X光片数据集上，系统在25个标志点上取得了1.04毫米的平均径向误差，超越了之前19个标志点上的最佳结果（1.23毫米），相对提升15.4%，其中12个标志点误差低于1毫米，达到了新的SOTA水平。消融实验表明，移除解剖学先验会严重破坏模型在测试集上的泛化能力（误差从1.04毫米恶化到1.94毫米）。

Insight: 论文的核心创新点在于将临床领域知识（解剖学结构和工作流程）编码为空间注意力先验，为下游检测器提供了一种架构和数据增强本身无法提供的归纳偏置。这显著提升了模型的泛化性能和最终精度。客观来看，将人类专家的结构化推理过程系统地转化为可计算的初始化步骤，是提升医学影像分析任务性能的一个有效且可解释的途径。

Abstract: When orthodontists trace cephalometric radiographs, they follow a structured workflow: identify the soft tissue profile, partition the skull into anatomical regions, trace contours, and locate landmarks using geometric definitions – yet no automated system replicates this reasoning. We present a five-phase anatomy-guided initialization pipeline that translates this clinical workflow into computational operations, producing confidence-weighted spatial attention priors for a downstream HRNet-W32 detector. On 1,502 radiographs from three sources spanning 7+ imaging devices, the system achieves 1.04 mm mean radial error on 25 landmarks – surpassing prior state-of-the-art (1.23 mm on 19 landmarks) by 15.4%, with twelve landmarks below 1 mm. A three-way controlled ablation reveals two striking findings. First, removing anatomical priors does not merely slow convergence – it destroys generalization: both models converge to ~1.03 mm on validation, but diverge to 1.94 vs. 1.04 mm on the test set. Second, replacing anatomical priors with random-position Gaussians produces even worse generalization (2.24 mm), confirming that the improvement derives from anatomically correct positioning, not additional input channels. Clinical domain knowledge encoded as spatial priors provides an inductive bias that architecture and data augmentation alone do not provide.

[30] Enhancing Self-Supervised Talking Head Forgery Detection via a Training-Free Dual-System Framework cs.CV | cs.AI | cs.MMPDF

Ke Liu, Jiwei Wei, Shuchang Zhou, Yutong Xiao, Ruikun Chai

TL;DR: 本文提出了一种无需训练的双系统框架，用于增强基于分数的自监督说话人头像伪造检测器的性能。该框架将样本分为置信和不确定子集，并对不确定子集进行细粒度证据引导推理，以修正原始分数分布中的排序问题，从而提升检测器的判别能力。

Details

Motivation: 现有基于分数的自监督伪造检测器在困难样本上的判别能力有限，表现为不可靠的异常排序，其潜在判别能力未被充分挖掘。

Result: 在多个数据集和扰动设置下的广泛实验表明，该框架能带来一致的性能提升，提升主要源于对不确定子集内样本排序的修正。

Insight: 创新点在于受人类认知双系统理论启发，提出无需训练的后处理框架，通过轻量级路由和证据引导推理，有效解锁现有自监督检测器中未被充分利用的判别线索。

Abstract: Supervised talking head forgery detection faces severe generalization challenges due to the continuous evolution of generators. By reducing reliance on generator-specific forgery patterns, self-supervised detectors offer stronger cross-generator robustness. However, existing research has mainly focused on building stronger detectors, while the discriminative capacity of trained detectors remains insufficiently exploited. In particular, for score-based self-supervised detectors, the limited discriminative ability on hard cases is often reflected in unreliable anomaly ordering, leaving room for further refinement. Motivated by this observation, we draw inspiration from the dual-system theory of human cognition and propose a Training-Free Dual-System (TFDS) framework to further exploit the latent discriminative capacity of existing score-based self-supervised detectors. TFDS treats anomaly-like scores as the basis of System-1, using lightweight threshold-based routing to partition samples into confident and uncertain subsets. System-2 then revisits only the uncertain subset, performing fine-grained evidence-guided reasoning to refine the relative ordering of ambiguous samples within the original score distribution. Extensive experiments demonstrate consistent improvements across datasets and perturbation settings, with the gains arising mainly from corrected ordering within the uncertain subset. These findings show that existing self-supervised talking head forgery detectors still contain underexploited discriminative cues that can be effectively unlocked through training-free dual-system reasoning.

[31] MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding cs.CVPDF

Ran Ran, Jiwei Wei, Shuchang Zhou, Yitong Qin, Shiyuan He

TL;DR: 本文提出MASRA，一种基于多模态大语言模型（MLLM）的训练时优化框架，用于解决视频时序定位（VTG）中的跨模态语义鸿沟问题。该框架利用MLLM生成事件级描述和片段级字幕两种文本先验，并实例化两种对齐：事件语义时序对齐（ESTA）和局部关系一致性对齐（LRCA），以增强语义与时间事件的对应关系、时序一致性及局部结构信息。此外，通过解耦对齐交互（DAI）和上下文感知码本自适应吸收查询无关语义。训练时使用MLLM，推理时不需调用。

Details

Motivation: 解决VTG中因跨模态语义鸿沟导致的背景特征与查询错误对齐，以及直接匹配查询与时刻导致时序语义区分性和一致性不足的问题。

Result: 大量实验表明MASRA优于现有方法，消融研究验证了其有效性，在视频时序定位基准上达到了先进水平（SOTA）。

Insight: 创新点在于利用MLLM生成文本先验来辅助训练时对齐，通过ESTA和LRCA分别增强语义-时间对应和局部关系一致性，并引入DAI和码本机制自适应处理查询无关信息，从而缓解跨模态差距，提升模型性能。

Abstract: Video Temporal Grounding (VTG) faces a cross-modal semantic gap that often leads to background features being incorrectly aligned with the query, while directly matching the query to moments results in insufficient discriminability and consistency of temporal semantics. To address this issue, we propose MLLM-Assisted Semantic-Relational Consistent Alignment (MASRA), a training-time MLLM-based optimization framework for VTG. MASRA leverages an MLLM during training to produce two forms of textual priors, namely event-level descriptions with temporal spans and clip-level captions, and instantiates two MLLM-assisted alignments. Event Semantic Temporal Alignment (ESTA) aligns temporal context with event semantics to explicitly strengthen the correspondence between semantics and temporal events and improve span-level separability. Local Relational Consistency Alignment (LRCA) constructs a textual relation matrix derived from clip-level captions and aligns it with the temporal feature similarity matrix in the model, enhancing temporal consistency while capturing local structural information. MASRA includes two simple supporting modules, semantic-guided enhancement and second-order relational attention, to better utilize the learned semantic context and relational structure. Moreover, we introduce Decoupled Alignment Interaction (DAI) with a context-aware codebook to adaptively absorb query-irrelevant semantics and alleviate the cross-modal gap. The MLLM is only invoked during training and is not used at inference. Extensive experiments show that MASRA outperforms existing methods, and ablation studies validate its effectiveness.

[32] GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning cs.CV | cs.LGPDF

Yujun Li, Hongyuan Zhang, Yuan Yuan

TL;DR: 本文提出了GRPO-TTA方法，将组相对策略优化（GRPO）应用于视觉语言模型的测试时适应（TTA）任务。该方法通过将特定类别的提示预测重构为组级策略优化问题，利用CLIP相似度分布采样Top-K类别候选构建输出组，并设计了对齐奖励和分散奖励来指导视觉编码器的调优，从而在无需真实标签的情况下实现概率驱动的优化。

Details

Motivation: 动机是探索GRPO是否也能显著促进视觉语言模型的测试时适应，解决在测试时分布偏移下模型性能下降的问题。

Result: 在多个基准测试上的广泛实验表明，GRPO-TTA一致优于现有的测试时适应方法，在自然分布偏移下取得了显著更大的性能提升。

Insight: 创新点在于将GRPO适应到TTA设置，通过组级策略优化和概率驱动的奖励设计，实现了无标签的测试时视觉调优，为视觉语言模型的在线适应提供了新思路。

Abstract: Group Relative Policy Optimization (GRPO) has recently shown strong performance in post-training large language models and vision-language models. It raises a question of whether the GRPO also significantly promotes the test-time adaptation (TTA) of vision language models. In this paper, we propose Group Relative Policy Optimization for Test-Time Adaptation (GRPO-TTA), which adapts GRPO to the TTA setting by reformulating class-specific prompt prediction as a group-wise policy optimization problem. Specifically, we construct output groups by sampling top-K class candidates from CLIP similarity distributions, enabling probability-driven optimization without access to ground-truth labels. Moreover, we design reward functions tailored to test-time adaptation, including alignment rewards and dispersion rewards, to guide effective visual encoder tuning. Extensive experiments across diverse benchmarks demonstrate that GRPO-TTA consistently outperforms existing test-time adaptation methods, with notably larger performance gains under natural distribution shifts.

[33] Learning Discriminative Signed Distance Functions from Multi-scale Level-of-detail Features for 3D Anomaly Detection cs.CV | cs.LGPDF

Haibo Xiao, Hanzhe Liang, Jie Zhou, Jinbao Wang, Can Gao

TL;DR: 本文提出了一种基于表面的三维异常检测方法，通过学习一个具有判别性的有符号距离函数，并利用多尺度细节层次特征来区分正常与异常点。该方法包含噪声点生成模块、多尺度细节层次特征提取模块和隐式表面判别模块，在Anomaly-ShapeNet和Real3D-AD数据集上取得了优于现有最佳方法的结果。

Details

Motivation: 由于点云数据规模大且稀疏，学习精确的点级表示进行三维异常检测面临巨大挑战。现有基于组或基于点的方法虽取得进展，但仍需更有效的表面表示方法。

Result: 在Anomaly-ShapeNet和Real3D-AD数据集上，该方法分别达到了92.1%和85.9%的平均物体级AUROC，比当前最佳方法分别高出2.1%和3.6%，实现了新的SOTA性能。

Insight: 创新点在于提出了一个基于表面的判别性有符号距离函数学习框架，通过噪声点生成来暴露异常、多尺度特征提取来融合局部与全局信息，以及隐式表面表示来有效区分正常与异常点，为三维点云异常检测提供了新的表征学习思路。

Abstract: Detecting anomalies from 3D point clouds has received increasing attention in the field of computer vision, with some group-based or point-based methods achieving impressive results in recent years. However, learning accurate point-wise representations for 3D anomaly detection faces great challenges due to the large scale and sparsity of point clouds. In this study, a surface-based method is proposed for 3D anomaly detection, which learns a discriminative signed distance function using multi-scale level-of-detail features. We first present a Noisy Points Generation (NPG) module to generate different types of noise, thereby facilitating the learning of discriminative features by exposing abnormal points. Then, we introduce a Multi-scale Level-of-detail Feature (MLF) module to capture multi-scale information from a point cloud, which provides both fine-grained local and coarse-grained global feature information. Finally, we design an Implicit Surface Discrimination (ISD) module that leverages the extracted multi-scale features to learn an implicit surface representation of point clouds, which effectively trains a signed distance function to distinguish between abnormal and normal points. Experimental results demonstrate that the proposed method achieves an average object-level AUROC of 92.1% and 85.9% on the Anomaly-ShapeNet and Real3D-AD datasets, outperforming the current best approach by 2.1% and 3.6%, respectively. Codes are available at https://anonymous.4open.science/r/DLF-3AD-DA61.

[34] First Shape, Then Meaning: Efficient Geometry and Semantics Learning for Indoor Reconstruction cs.CVPDF

Remi Chierchia, Léo Lebrat, David Ahmedt-Aristizabal, Olivier Salvado, Clinton Fookes

TL;DR: 本文提出了一种名为FSTM的高效室内三维重建方法，该方法采用先几何后语义的两步学习策略，首先仅使用RGB输入和几何线索进行几何预热，然后估计语义场，从而在保持高质量重建的同时显著提升了训练速度。

Details

Motivation: 现有的神经表面重建方法在集成语义标签时，通常继承了多SDF学习训练速度慢、可扩展性有限的问题，因此需要一种更高效、更鲁棒的联合学习几何与语义的方法。

Result: 在合成数据集Replica和真实世界数据集ScanNet++上的实验表明，该方法在Replica上训练速度比多SDF方法快2.3倍，在ScanNet++上对真实世界缺陷的鲁棒性更强，并且通过恢复场景中更多物体的表面获得了更高的召回率，性能优于多SDF方法。

Insight: 论文的核心创新点在于揭示了先独立优化几何（无需语义监督）再进行语义场估计的“先形状，后语义”两阶段流程，相比标准的联合优化能带来显著改进，证明了简化的统一公式足以实现强大的几何和语义重建，无需依赖复杂的多SDF设计或专门模块。

Abstract: Neural Surface Reconstruction has become a standard methodology for indoor 3D reconstruction, with Signed Distance Functions (SDFs) proving particularly effective for representing scene geometry. A variety of applications require a detailed understanding of the scene context, driving the need for object-level semantic signals. While recent methods successfully integrate semantic labels, they often inherit the slow training time and limited scalability of multi-SDF learning. In this paper, we introduce FSTM, a unified approach for learning geometry and semantics through a two-step process: a geometry warm-up using RGB inputs and geometric cues, followed by semantic field estimation. By first optimising geometry without semantic supervision, we observe substantial improvements compared to the standard joint optimisation. Rather than relying on specialised modules or complex multi-SDF designs, FSTM shows that a streamlined formulation is sufficient to achieve strong geometric and semantic reconstructions. Experiments on both synthetic and real-world indoor datasets show that our method outperforms multi-SDF approaches. It trains 2.3x faster on Replica, improves robustness to real-world imperfections on ScanNet++, and achieves higher recall by recovering the surfaces of more objects in the scene. The code will be made available at https://remichierchia.github.io/FSTM.

[35] WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models cs.CVPDF

Karthik Inbasekar, Guy Rom, Omer Shlomovits

TL;DR: 本文提出了WorldJen，一个用于评估生成式视频模型的多维度端到端基准测试框架。它通过使用Likert量表问卷和原生分辨率视频的视觉语言模型（VLM）评估，解决了现有评估方法在语义正确性、物理合理性和时间连贯性等方面的不足。

Details

Motivation: 现有生成式视频模型的评估方法存在缺陷：基于参考的指标（如SSIM、PSNR）过于关注像素保真度而忽略语义；FVD偏向纹理分布而非物理合理性；基于二元视觉问答（VQA）的基准（如VBench~2.0）存在肯定性偏差、依赖低分辨率审核器且一次只针对单一维度，导致视频需求量大且结果不可靠。

Result: 在包含50个精心设计的提示词和6个SOTA视频模型的数据集上，进行了包含2,696对标注的盲人偏好研究，建立了具有三层结构的人类基准Bradley-Terry评分。VLM评估引擎使用特定于提示和维度的Likert问卷（共47,160个评分响应）独立复现了该三层结构，与人类结果的Spearman等级相关系数达到ρ̂=1.000 (p=0.0014)，表明层级一致性。六项消融研究验证了VLM评估框架的鲁棒性。

Insight: 主要创新点包括：1）用Likert量表问卷替代二元VQA，并由接收原生分辨率视频帧的VLM进行评分，提升了评估的语义和物理合理性。2）使用对抗性策划的提示词，单次提示可同时测试多达16个质量维度，显著减少了所需生成的视频数量。3）通过大规模人类偏好研究建立可靠的人类基准，并验证了VLM评估引擎能有效复现人类判断的层级结构，为实现自动化、可扩展的生成视频评估提供了新范式。

Abstract: Evaluating generative video models remains an open problem. Reference-based metrics such as Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) reward pixel fidelity over semantic correctness, while Frechet Video Distance (FVD) favors distributional textures over physical plausibility. Binary Visual Question Answering (VQA) based benchmarks like VBench2.0 are prone to yes-bias and rely on low-resolution auditors that miss temporal failures. Moreover, their prompts target a single dimension at a time, multiplying the number of videos required while still not guaranteeing reliable results. WorldJen addresses these limitations directly. Binary VQA is replaced with Likert-scale questionnaires graded by a VLM that receives frames at native video resolution. Video generation costs are addressed by using adversarially curated prompts that are designed to exercise up to 16 quality dimensions simultaneously. The framework is built around two interlocking contributions. First, A blind human preference study is conducted, accumulating (2,696 pairwise annotations from 7 annotators with 100% pair coverage over 50 of the curated prompts $\times$ 6 state-of-the-art video models. A mean inter-annotator agreement of 66.9% is achieved and the study establishes a human ground-truth Bradley-Terry (BT) rating with a three-tier structure. Second, A VLM-as-a-judge evaluation engine using prompt-specific, dimension-specific Likert questionnaires (10 questions per dimension, 47,160 scored responses) judges the videos and reproduces the human-established three-tier BT rating structure independently. The VLM achieves a Spearman $\hatρ=1.000,p=0.0014$ that is interpreted as tier agreement with the human results. Six focused ablation studies validate the robustness of the VLM evaluation framework.

[36] MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models cs.CV | cs.AIPDF

Kangkang Wang, Qinting Jiang, Wanping Zhang, Bowen Ren, Shengzhao Wen

TL;DR: 该论文提出了MHPR基准，这是一个用于评估大视觉语言模型在以人为本场景中进行联合感知与推理的综合基准。它涵盖了个体、多人及人-物交互等多个维度，并包含一个自动化的标注生成流程ACVG，用于创建高质量、可扩展的数据集。

Details

Motivation: 当前的大视觉语言模型基准大多关注单任务设置，缺乏细粒度、以人为本的评估，而多维度的对人类的理解对于电影分析、虚拟数字人等实际应用至关重要。

Result: 在细粒度属性（外观、衣着、姿态、部位）和高层语义（社会关系、动作语义、空间关系、意图与功能）上的评估表明，使用MHPR数据训练Qwen2.5-VL-7B模型取得了显著提升，达到了与更大模型近乎相当的水平。

Insight: 论文的创新点在于构建了一个多维度的、以人为中心的联合感知-推理基准，并设计了包含原始数据、微调数据、强化学习数据和测试数据的多层次数据架构，以及一个通过属性分解、重写和多模型投票确保标注质量的自动化生成流程。其关于格式对齐的SFT数据提升指令遵循、以及基于坏案例分析构建的挑战性RL数据增强模型性能的发现具有借鉴意义。

Abstract: Multidimensional human understanding is essential for real-world applications such as film analysis and virtual digital humans, yet current LVLM benchmarks largely focus on single-task settings and lack fine-grained, human-centric evaluation. In this work, we introduce MHPR, a comprehensive benchmark for joint perception-reasoning over human-centric scenes spanning individual, multi-person, and human-object interaction dimensions. MHPR comprises a multi-level data design-Captioned Raw Data (C-RD), Supervised Fine-Tuning Data (SFT-D), Reinforcement Learning Data (RL-D), and Test Data (T-D)-together with an automated caption/VQA generation pipeline (ACVG) that performs category-wise attribute decomposition, attribute-specific rewriting, and multi-model voting to ensure high-quality, scalable annotations. We evaluate state-of-the-art vision-language models on fine-grained attributes (appearance, clothing, pose, parts) and high-level semantics (social relations, action semantics, spatial relations, intent and functionality). Our findings show that: 1) format-aligned SFT data substantially improves instruction following and stability; 2) challenge-focused RL data derived from bad-case analysis further enhances perception and reasoning on difficult instances; and 3) training Qwen2.5-VL-7B with MHPR yields significant gains, achieving near-parity with considerably larger models. We release ACVG and MHPR to facilitate reproducible, extensible research on human-centric perception and reasoning.

Sapna Sachan, Amulya Kumar Mahto, Prashant Wagambar Patil

TL;DR: 本文提出了一种面向脑肿瘤分类的、方向感知的无监督域自适应框架，旨在解决多模态MRI数据中因标注稀缺和跨机构域偏移（如扫描仪、成像协议差异）导致的模型泛化问题。该框架首先利用大感受野CNN对输入的2D MRI切片进行轴向、矢状面和冠状面方向分类，然后为每个方向使用基于ResNet50骨干网络并增强四个全连接层的CNN架构提取特征进行分类。通过结合最大均值差异损失的特征级对齐和伪标签引导的自适应策略，实现从多模态源域（如T1、T2、FLAIR）到目标域（对比后T1）的知识迁移。

Details

Motivation: 解决神经肿瘤学中脑肿瘤诊断深度学习模型因专家标注MRI数据有限以及跨机构域偏移（扫描仪、成像协议和对比度设置差异）而导致的临床集成困难与泛化性能下降问题。

Result: 广泛的实验表明，该方法在目标域性能上优于先前方法，突显了方向特异性学习、多模态知识迁移、伪标签引导自适应和无监督域自适应的优势。

Insight: 创新点包括：1) 方向感知的预处理（通过大感受野CNN进行切片方向分类），实现方向特异性特征学习；2) 切片级无监督域自适应策略，结合最大均值差异损失进行特征对齐和伪标签引导以保持类别判别性；3) 利用多模态源域（T1、T2、FLAIR）知识迁移至单模态目标域（对比后T1），缓解标注稀缺和域差异。从客观角度看，该方法将方向信息明确纳入域自适应流程，并整合多模态与伪标签策略，是针对医学影像域偏移问题的系统化创新框架。

Abstract: The clinical integration of deep learning models for brain tumor diagnosis in neuro-oncology is severely constrained by limited expert-annotated MRI data and substantial inter-institutional domain shift arising from variations in scanners, imaging protocols, and contrast settings. These challenges significantly impair model generalization in real-world settings. To address this, we propose a novel orientation-aware unsupervised domain-adaptive framework for automated brain tumor classification using mixed 2D MRI slices. Initially, a CNN with large receptive field first categorizes input slices into axial, sagittal, and coronal views. For each orientation, a CNN architecture with ResNet50 backbone augmented with four fully connected layers is trained to extract discriminative features for tumor classification. To mitigate annotation scarcity and domain discrepancies, we introduce a slice-wise unsupervised domain adaptation strategy that transfers knowledge from the multi-modal such as T1, T2, and FLAIR source domain to the post-contrast T1 target domain. Feature-level alignment is enforced using maximum mean discrepancy loss, complemented by pseudo-label guided adaptation to preserve class discriminability. Extensive experiments demonstrate improved target-domain performance over prior approaches, highlighting the benefits of orientation-specific learning, multi-modal knowledge transfer, pseudo-label-guided adaptation, and unsupervised domain adaptation.

[38] BFORE: Butterfly-Firefly Optimized Retinex Enhancement for Low-Light Image Quality Improvement cs.CV | cs.AIPDF

Ahmed Cherif

TL;DR: 本文提出了一种名为BFORE（蝴蝶-萤火虫优化的Retinex增强）的新型混合元启发式优化框架，用于低光照图像增强。该方法通过将图像转换到HSV色彩空间，对亮度通道应用自适应伽马校正（AGCWD）和自适应去噪，然后利用蝴蝶优化算法（BOA）优化多尺度Retinex色彩恢复（MSRCR）参数，同时利用萤火虫算法（FA）优化AGCWD和去噪参数。通过混合BOA-FA切换策略动态平衡全局探索和局部开发，实现了Retinex增强流程参数的自动优化。

Details

Motivation: 解决现有基于Retinex的低光照图像增强方法依赖手动调参、难以泛化到不同光照条件的问题，旨在实现参数自动优化以提升图像质量。

Result: 在LOL基准数据集（15对测试图像）上的实验表明，BFORE在所有传统增强方法中取得了最高的PSNR（17.22 dB），比直方图均衡化提升20.3%，比MSRCR提升17.5%。其平均亮度（129.97）最接近理想中值，且无需训练数据即在PSNR（17.22 vs. 16.77 dB）和SSIM（0.5417 vs. 0.4252）上超越了深度学习基线RetinexNet。混合BOA-FA优化相比未优化流程带来了12.3%的PSNR提升和14.8%的SSIM提升。

Insight: 创新点在于将蝴蝶优化算法（BOA）和萤火虫算法（FA）结合，形成混合元启发式优化框架，动态平衡全局探索与局部开发，自动优化多阶段Retinex增强流程的参数，避免了手动调参的局限性，并在传统方法中实现了SOTA性能，甚至在某些指标上超越了需要训练的深度学习方法。

Abstract: Low-light image enhancement is a fundamental challenge in computer vision and multimedia applications, as images captured under insufficient illumination suffer from poor visibility, low contrast, and color distortion. Existing Retinex-based methods rely on manually tuned parameters that fail to generalize across diverse lighting conditions. This paper proposes BFORE (Butterfly-Firefly Optimized Retinex Enhancement), a novel hybrid metaheuristic-optimized framework that automatically tunes the parameters of a multi-stage Retinex-based pipeline. The proposed method converts the input image to HSV color space and applies Adaptive Gamma Correction with Weighted Distribution (AGCWD) to the luminance channel, followed by adaptive denoising. A Butterfly Optimization Algorithm (BOA) optimizes the Multi-Scale Retinex with Color Restoration (MSRCR) parameters, while a Firefly Algorithm (FA) optimizes the AGCWD and denoising parameters. A hybrid BOA-FA switching strategy dynamically balances global exploration and local exploitation. Experimental evaluation on the LOL benchmark dataset (15 paired test images) demonstrates that BFORE achieves the highest PSNR (17.22 dB) among all traditional enhancement methods, with 20.3% improvement over Histogram Equalization and 17.5% over MSRCR. BFORE produces the most naturally balanced mean brightness (129.97), closest to the ideal mid-tone value. Notably, BFORE outperforms RetinexNet – a deep learning baseline – in both PSNR (17.22 vs. 16.77 dB) and SSIM (0.5417 vs. 0.4252) without requiring any training data. The hybrid BOA-FA optimization contributes a 12.3% PSNR improvement and 14.8% SSIM improvement over the unoptimized pipeline.

[39] Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models cs.CV | cs.AIPDF

JuneHyoung Kwon, JungMin Yun, YoungBin Kim

TL;DR: 该论文针对大型视觉语言模型可能记忆并再生受版权保护的视觉内容（如角色和标志）的问题，提出了首个专门评估LVLMs中版权内容遗忘效果的基准CoVUBench。该基准使用程序生成的合法合成数据，结合系统的视觉变化，以全面评估遗忘方法的泛化能力，并衡量遗忘效果与模型通用性能保持之间的权衡。

Details

Motivation: 大型视觉语言模型在训练时可能记忆受版权保护的视觉内容，存在法律风险。机器遗忘技术提供了一种在训练后移除特定内容的途径，但目前缺乏一个稳健、全面的框架来评估其在复杂多模态场景下的有效性，尤其是在跨模态概念擦除方面的细微差别。

Result: 论文提出了CoVUBench基准，这是一个专门用于评估LVLMs中版权内容遗忘效果的首个框架。它通过程序生成的合成数据和系统性的视觉变化（包括组合变化和多样化的领域表现）来确保评估的逼真性和鲁棒性。该基准提供了一个标准化的工具，用于衡量遗忘效果与模型通用性能之间的权衡。

Insight: 创新点在于构建了首个专门针对多模态版权遗忘的评估基准，其核心是使用合法安全的合成数据生成方法，并结合了系统的视觉变化来测试泛化能力。客观来看，其提出的从版权持有者（遗忘效果）和部署者（模型效用保持）双重视角进行评估的协议，为未来开发负责任且有效的遗忘方法提供了关键的标准化衡量工具。

Abstract: Large Vision-Language Models (LVLMs), trained on web-scale data, risk memorizing and regenerating copyrighted visual content such as characters and logos, creating significant challenges. Machine unlearning offers a path to mitigate these risks by removing specific content post-training, but evaluating its effectiveness, especially in the complex multimodal setting of LVLMs, remains an open problem. Current evaluation methods often lack robustness or fail to capture the nuances of cross-modal concept erasure. To address this critical gap, we introduce the CoVUBench benchmark, the first framework specifically designed for evaluating copyright content unlearning in LVLMs. CoVUBench utilizes procedurally generated, legally safe synthetic data coupled with systematic visual variations spanning compositional changes and diverse domain manifestations to ensure realistic and robust evaluation of unlearning generalization. Our comprehensive multimodal evaluation protocol assesses both forgetting efficacy from the copyright holder perspective and the preservation of general model utility from the deployer viewpoint. By rigorously measuring this crucial trade-off, CoVUBench provides a standardized tool to advance the development of responsible and effective unlearning methods for LVLMs.

[40] deSEO: Physics-Aware Dataset Creation for High-Resolution Satellite Image Shadow Removal cs.CV | eess.IVPDF

Lorenzo Beltrame, Jules Salzinger, Filip Svoboda, Phillipp Fanta-Jende, Jasmin Lampert

TL;DR: deSEO提出了一种几何感知和物理信息驱动的方法，首次从S-EO阴影检测数据集中通过可复现的流程，为高分辨率卫星图像阴影去除创建了配对的监督数据集。该方法通过时空和几何过滤、基于雅可比的方向归一化以及LoFTR-RANSAC配准，将弱参考图像与阴影图像配对，并利用逐像素有效性掩码限制学习区域。此外，还开发了一个结合残差平移、感知目标和掩码约束对抗学习的DSM感知去阴影模型。

Details

Motivation: 解决高分辨率卫星图像分析中地形和高大建筑物投射的阴影问题，该问题会降低分类、检测和3D重建的性能。目前缺乏几何一致的配对阴影/无阴影卫星图像公共资源，现有数据集多用于阴影检测或3D建模，而非去除。

Result: 模型在不同光照和观测条件下持续减少了投射阴影的视觉影响，在保留场景上实现了改进的结构和感知保真度。直接改编基于无人机的SRNet/pix2pix架构在卫星视角变化下无法收敛，而deSEO模型则表现稳定。

Insight: 创新点在于首次创建了可复现、几何感知的配对卫星阴影去除数据集和基线模型。方法结合了物理几何约束（如方向归一化、配准）和深度学习（掩码约束对抗学习），有效处理了卫星特有的视角变化和视差问题，为卫星图像阴影去除提供了新的监督学习范式。

Abstract: Shadows cast by terrain and tall structures remain a major obstacle for high-resolution satellite image analysis, degrading classification, detection, and 3D reconstruction performance. Public resources offering geometry-consistent paired shadow/shadow-free satellite imagery are essentially missing, and most Earth-observation datasets are designed for shadow detection or 3D modelling rather than removal. Existing deep shadow-removal datasets either target ground-level or aerial scenes or rely on unpaired and weakly supervised formulations rather than explicit satellite pairs. We address this gap with deSEO, a geometry-aware and physics-informed methodology that, to the best of our knowledge, is the first to derive paired supervision for satellite shadow removal from the S-EO shadow detection dataset through a fully replicable pipeline. For each tile, deSEO selects a minimally shadowed acquisition as a weak reference and pairs it with shadowed counterparts using temporal and geometric filtering, Jacobian-based orientation normalisation, and LoFTR-RANSAC registration. A per-pixel validity mask restricts learning to reliably aligned regions, enabling supervision despite residual off-nadir parallax. In addition to this paired dataset, we develop a DSM-aware deshadowing model that combines residual translation, perceptual objectives, and mask-constrained adversarial learning. In contrast, a direct adaptation of a UAV-based SRNet/pix2pix architecture fails to converge under satellite viewpoint variability. Our model consistently reduces the visual impact of cast shadows across diverse illumination and viewing conditions, achieving improved structural and perceptual fidelity on held-out scenes. deSEO therefore provides the first reproducible, geometry-aware paired dataset and baseline for shadow removal in satellite Earth observation.

[41] Uncertainty Estimation in Instance Segmentation of Affordances via Bayesian Visual Transformers cs.CVPDF

Lorenzo Mur-Labadia, Ruben Martinez-Cantina, Jose J. Guerrero

TL;DR: 本文提出了一种基于贝叶斯视觉Transformer的实例分割模型，用于视觉可供性（affordance）区域的不确定性估计。通过集成多个子网络和注意力机制，模型在IIT-Aff数据集上实现了+7.4个百分点的Fβ^w分数提升，并提供了像素级的认知和偶然不确定性分析，增强了神经网络的可解释性。

Details

Motivation: 视觉可供性识别图像中具有潜在交互的区域，对自主机器人、人机交互等应用至关重要。现有方法缺乏对可供性区域的精确定位和不确定性估计，因此需要开发能够同时进行实例分割和不确定性量化的模型。

Result: 在IIT-Aff数据集上，模型在Fβ^w分数上比确定性网络提升了+7.4个百分点，达到SOTA水平。贝叶斯模型具有更好的校准性，减少了过度自信的概率，并提供了更优的不确定性估计。

Insight: 创新点包括：采用基于样本和集成的方法进行不确定性估计；提出概率掩码质量（Probability-based Mask Quality）指标，用于分析概率实例分割模型中的语义和空间变化；通过注意力机制提取更强大的特征，结合贝叶斯模型提升掩码细化和泛化能力；不确定性分析（认知方差在视觉挑战像素出现，偶然方差在物体轮廓出现）增强了模型的可解释性。

Abstract: Visual affordances identify regions in an image with potential interactions, offering a novel paradigm for scene understanding. Recognizing affordances allows autonomous robots to act more naturally, could enhance human-robot interactions, enrich augmented reality systems, and benefit prosthetic vision devices. Accurate and localized prediction of affordance regions, rather than general saliency maps is crucial for these applications. We present a model for instance segmentation of affordances by adopting sample-based and ensembles approaches for uncertainty estimation. We extend an attention-based architecture for our novel task, showing with detailed ablation experiments the effects of each component. By comparing the distribution of these different detections, we extract pixel-wise epistemic and aleatoric variances at both the semantic and spatial levels. In addition, we propose a novel measure called Probability-based Mask Quality, which enables a comprehensive analysis of semantic and spatial variations in a probabilistic instance segmentation model. Our results show that the global consensus of multiple sub-networks of Bayesian models improve deterministic networks due to a better mask refinement and generalization. This fact, joined with the more powerful features extracted by attention-based mechanisms, represent an improvement of +7.4 p.p on the $F_β^w$ score in the challenging IIT-Aff dataset. Bayesian models are also better calibrated, producing less overconfident probabilities and with a better uncertainty estimation. Qualitative results show that aleatoric variance appears in the contour of the objects, while the epistemic variance is observed in visual challenging pixels, adding interpretability to the neural network.

[42] PriorNet: Prior-Guided Engagement Estimation from Face Video cs.CVPDF

Alexander Vedernikov

TL;DR: PriorNet是一个用于从面部视频估计参与度的先验引导框架，通过在预处理、模型适应和目标设计三个阶段注入任务相关的先验来解决面部证据不完整、标注数据有限和标注主观等挑战。

Details

Motivation: 解决面部视频参与度估计中因面部证据不完整、标注数据有限和标注主观性带来的挑战。

Result: 在EngageNet、DAiSEE、DREAMS和PAFE数据集上，使用各自的原生评估协议，PriorNet在每个数据集的评估框架内均优于所列的最强先验参考方法，组件消融实验表明增益来自预处理、适应和目标级先验的互补贡献。

Insight: 创新点包括将人脸检测失败转换为显式的零帧占位符以保持输入序列中缺失面部事件的表示，通过先验引导的低秩适应模块（Prior-LoRA）对冻结的自监督视频面部情感感知器（SVFAP）骨干进行参数高效的专业化适应，以及在硬标签监督下使用狄利克雷证据不确定性加权目标进行训练；客观分析认为，将显式先验注入作为设计原则对于面部视频参与度估计是有效的。

Abstract: Engagement estimation from face video remains challenging because facial evidence is often incomplete, labeled data are limited, and engagement annotations are subjective. We present PriorNet, a prior-guided framework that injects task-relevant priors at three stages of the pipeline: preprocessing, model adaptation, and objective design. PriorNet converts face-detection failures into explicit zero-frame placeholders so that missing-face events remain represented in the input sequence, adapts a frozen Self-supervised Video Facial Affect Perceiver (SVFAP) backbone through a Prior-guided Low-Rank Adaptation module (Prior-LoRA) for parameter-efficient specialization, and trains with a Dirichlet-evidential, uncertainty-weighted objective under hard-label supervision. We evaluate PriorNet on EngageNet, DAiSEE, DREAMS, and PAFE using each dataset’s native evaluation protocol. Across these benchmarks, PriorNet improves over the strongest listed prior reference within each dataset’s evaluation framing, while component ablations on EngageNet and DAiSEE indicate that the gains arise from complementary contributions of preprocessing, adaptation, and objective-level priors. These results support explicit prior injection as a useful design principle for face-video engagement estimation under the benchmark conditions studied in this work.

[43] Diffusion Masked Pretraining for Dynamic Point Cloud cs.CVPDF

Zhuoyue Zhang, Jihua Zhu, Chaowei Fang, Jian Liu, Ajmal Saeed Mian

TL;DR: 本文提出了DiMP（Diffusion Masked Pretraining），一种用于动态点云的自监督预训练框架。它通过引入扩散模型来解决现有掩码重建方法中的两个关键限制：时空位置信息泄露和确定性运动监督导致的分布结构丢失。DiMP在掩码的时空管中心上应用前向扩散噪声并进行去噪预测，同时将帧间位移监督重新定义为基于解码表示的DDPM噪声预测目标。

Details

Motivation: 现有动态点云预训练方法主要基于掩码重建目标，但存在两个关键局限：一是解码器注入真实管中心作为位置嵌入会导致时空位置信息泄露；二是使用确定性代理目标监督帧间运动会丢弃分布结构，将多模态轨迹不确定性坍缩为条件均值。

Result: 大量实验表明，DiMP能持续提升骨干网络在下游任务上的精度。在离线动作分割任务上获得11.21%的绝对增益，在因果约束的在线推理设置下获得13.65%的绝对增益。

Insight: 主要创新点在于将扩散建模统一引入位置推理和运动学习：1) 对掩码管中心进行扩散去噪以消除位置泄露并保留可见坐标作为时间锚点；2) 将点级帧间位移监督重构为基于解码表示的DDPM噪声预测目标，使编码器学习完整条件分布而非单一确定性估计。这为动态点云表示学习提供了新的概率化建模视角。

Abstract: Dynamic point cloud pretraining is still dominated by masked reconstruction objectives. However, these objectives inherit two key limitations. Existing methods inject ground-truth tube centers as decoder positional embeddings, causing spatio-temporal positional leakage. Moreover, they supervise inter-frame motion with deterministic proxy targets that systematically discard distributional structure by collapsing multimodal trajectory uncertainty into conditional means. To address these limitations, we propose Diffusion Masked Pretraining (DiMP), a unified self-supervised framework for dynamic point clouds. DiMP introduces diffusion modeling into both positional inference and motion learning. It first applies forward diffusion noise only to masked tube centers, then predicts clean centers from visible spatio-temporal context. This removes positional leakage while preserving visible coordinates as clean temporal anchors. DiMP also reformulates point-wise inter-frame displacement supervision as a DDPM noise-prediction objective conditioned on decoded representations. This design drives the encoder to target the full conditional distribution of plausible motions under a variational surrogate, rather than collapsing to a single deterministic estimate. Extensive experiments demonstrate that DiMP consistently improves downstream accuracy over the backbone alone, with absolute gains of 11.21% on offline action segmentation and 13.65% under causally constrained online inference.Codes are available at https://github.com/InitalZ/DiMP.git.

[44] The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection cs.CVPDF

Yazhe Wan, Changjae Oh

TL;DR: 本文提出了一种名为解耦自适应训练（DAT）的自监督微调方法，用于提升视觉语言模型（VLM）在基于协同模型的开放词汇目标检测中的性能。该方法利用预训练的闭集目标检测器构建区域感知的伪标签数据集，并以解耦方式微调VLM的视觉骨干网络，在增强局部特征对齐的同时通过权重插值保留全局语义知识。DAT是一个即插即用模块，无需推理开销且微调参数少于0.8M。

Details

Motivation: 开放词汇目标检测旨在识别开放类别集合中的物体，通常结合目标检测器和视觉语言模型（VLM）实现零样本识别。然而，在全图像上预训练的VLM往往难以捕捉局部物体细节，限制了其在区域级检测任务中的有效性。

Result: 在COCO和LVIS数据集上的实验表明，DAT能持续提升对新类别和已知类别的检测性能，在协同开放词汇检测中达到了新的最先进水平（SOTA）。

Insight: 论文的创新点在于提出了一种自监督的区域感知微调方法（DAT），通过解耦训练和权重插值，在微调少量参数的情况下，有效增强了VLM对局部物体特征的感知能力，同时保持了其全局语义知识。这是一种高效且即插即用的适配策略。

Abstract: Open-vocabulary object detection aims to recognize objects from an open set of categories, which leverages vision-language models (VLMs) pre-trained on large-scale image-text data. The cooperative paradigm combines an object detector with a VLM to achieve zero-shot recognition of novel objects. However, VLMs pre-trained on full images often struggle to capture local object details, limiting their effectiveness when applied to region-level detection. We present Decoupled Adaptivity Training (DAT), a self-supervised fine-tuning approach to improve VLMs for cooperative model-based object detection. Given a cooperative model consists of a closed-set detector and a VLM, we first construct a region-aware pseudo-labeled dataset using a pre-trained closed-set object detector, in which regions corresponding to novel objects may be present but remain unlabeled or mislabeled. We then fine-tune the visual backbone of the VLM in a decoupled manner, which enhances local feature alignment while preserving global semantic knowledge via weight interpolation. DAT is a plug-and-play module that requires no inference overhead and fine-tunes less than 0.8M parameters. Experiments on the COCO and LVIS datasets show that DAT consistently improves detection performance on both novel and known categories, establishing a new state of the art in cooperative open-vocabulary detection.

[45] Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence cs.CV | cs.LGPDF

Zhiyuan Li, Rongzhen Zhao, Wenyan Yang, Wenshuai Zhao, Pekka Marttinen

TL;DR: 本文提出了一种名为Grounded Correspondence的新框架，用于视频对象中心学习中的时间一致性建模。该框架摒弃了传统基于学习的动态预测模块，转而利用冻结的自监督视觉骨干网络提取的实例判别特征，通过确定性的二分图匹配（匈牙利算法）来维持帧间对象表示（slots）的一致性，无需任何可学习参数即可实现竞争性性能。

Details

Motivation: 传统方法通过学习的动态模块预测未来对象表示（slots）来维持时间一致性，但作者认为这些预测器本质上是离散对应问题的昂贵近似。动机在于利用现代自监督骨干网络已具备的实例判别特征来直接解决对应问题，从而简化模型并消除对学习型时间预测的需求。

Result: 在MOVi-D、MOVi-E和YouTube-VIS基准测试上，所提方法在无需任何时间建模可学习参数的情况下，取得了具有竞争力的性能。

Insight: 主要创新点在于将时间一致性维护从基于学习的预测范式重新定义为基于特征的确定性匹配问题，利用冻结骨干的特征进行对象初始化和匈牙利匹配，显著简化了模型架构并减少了计算开销。这为视频对象中心学习提供了一种更高效、参数更少的替代方案。

Abstract: The de facto approach in video object-centric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We demonstrate that these predictors function as expensive approximations of discrete correspondence problems. Modern self-supervised vision backbones already encode instance-discriminative features that distinguish objects reliably. Exploiting these features eliminates the need for learned temporal prediction. We introduce Grounded Correspondence, a framework that replaces learned transition functions with deterministic bipartite matching. Slots initialize from salient regions in frozen backbone features. Frame-to-frame identity is maintained through Hungarian matching on slot representations. The approach requires zero learnable parameters for temporal modeling yet achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS. Project page: https://magenta-sherbet-85b101.netlify.app/

[46] AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics cs.CV | cs.AIPDF

Tencent HY Team

TL;DR: AniMatrix是一个针对动漫视频生成的模型，它通过双通道条件机制和三步过渡，专注于艺术正确性而非物理真实性。模型首先通过一个生产知识系统将动漫编码为可控的生产变量（风格、运动、相机、VFX），并使用AniCaption从像素推断这些变量作为导演指令。然后，通过风格-运动-变形课程，将模型从接近物理的运动过渡到完全的动漫表现力。最后，通过变形感知的偏好优化和领域特定的奖励模型，区分有意的艺术性和病态崩溃。

Details

Motivation: 现有视频生成模型以物理真实性为先验，但动漫故意违反物理规则（如涂抹、冲击帧、Q版转换），且存在数千种共存的艺术惯例，没有单一的’动漫物理’可供模型吸收。物理偏向的模型因此会削弱定义媒介的艺术性或在其风格变化下崩溃。

Result: 在由专业动画师评分的五个生产维度的动漫特定人类评估中，AniMatrix在五个维度中的四个排名第一，其中在Prompt Understanding（+0.70，+22.4%）和Artistic Motion（+0.55，+16.9%）方面相比Seedance-Pro 1.0有最大提升。

Insight: 创新点包括：1) 将动漫重新定义为可控生产变量的结构化分类法，实现直接的艺术控制；2) 双通道条件机制（可训练标签编码器和冻结T5编码器）结合双路径注入（交叉注意力用于细粒度控制，AdaLN调制用于全局执行），确保分类指令不被开放文本稀释；3) 风格-运动-变形课程学习，逐步过渡到动漫表现力；4) 变形感知的偏好优化，使用领域特定奖励模型区分艺术意图与模型崩溃。

Abstract: Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single “physics of anime” a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We will publicly release the AniMatrix model weights and inference code.

[47] Unified Multimodal Visual Tracking with Dual Mixture-of-Experts cs.CVPDF

Lingyi Hong, Jinglun Li, Xinyu Zhou, Kaixun Jiang, Pinxue Guo

TL;DR: 本文提出了OneTrackerV2，一个统一的、端到端可训练的多模态视觉目标跟踪框架。它通过Meta Merger将多模态信息嵌入统一空间，并引入双专家混合模块（DMoE）分别处理时空关系和模态知识，从而用一个共享模型和一次训练，在多种模态输入（如RGB， RGB+X）的跟踪任务上均实现了高性能。

Details

Motivation: 现有方法通常为每种输入模态训练独立模型或依赖预训练模型进行适配，这限制了效率、可扩展性和易用性。本文旨在解决多模态视觉跟踪中模型割裂、训练不统一的问题。

Result: OneTrackerV2在5种RGB和RGB+X跟踪任务、12个基准测试上均达到了最先进的性能，同时保持了较高的推理效率，并且在模态缺失场景下表现出显著的鲁棒性。

Insight: 核心创新点在于提出了统一的端到端训练框架，并设计了Meta Merger和双专家混合模块（DMoE），后者将时空关系建模（T-MoE）与多模态知识嵌入（M-MoE）解耦，有效减少了特征冲突并增强了模态融合的灵活性。

Abstract: Multimodal visual object tracking can be divided into to several kinds of tasks (e.g. RGB and RGB+X tracking), based on the input modality. Existing methods often train separate models for each modality or rely on pretrained models to adapt to new modalities, which limits efficiency, scalability, and usability. Thus, we introduce OneTrackerV2, a unified multi-modal tracking framework that enables end-to-end training for any modality. We propose Meta Merger to embed multi-modal information into a unified space, allowing flexible modality fusion and robustness. We further introduce Dual Mixture-of-Experts (DMoE): T-MoE models spatio-temporal relations for tracking, while M-MoE embeds multi-modal knowledge, disentangling cross-modal dependencies and reducing feature conflicts. With a shared architecture, unified parameters, and a single end-to-end training, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks and 12 benchmarks, while maintaining high inference efficiency. Notably, even after model compression, OneTrackerV2 retains strong performance. Moreover, OneTrackerV2 demonstrates remarkable robustness under modality-missing scenarios.

[48] Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks cs.CV | cs.AIPDF

JuneHyoung Kwon, MiHyeon Kim, Eunju Lee, JungMin Yun, Byeonggeuk Lim

TL;DR: 本文针对大型视觉语言模型（LVLM）在遗忘学习基准测试中存在的根本性学习失败问题，提出了ReMem基准测试框架。该框架通过诊断记忆不足和多跳诅咒等核心问题，构建了可靠的多跳多图像记忆基准，并引入Exposure指标来量化模型内部概率分布的信息擦除深度，从而为LVLM的学习和遗忘行为提供了严谨可靠的评估框架。

Details

Motivation: 当前LVLM遗忘学习基准测试使用虚构身份来缓解隐私风险，但忽略了模型在初始阶段未能有效记忆目标信息（阶段1失败）的根本问题，导致后续遗忘评估不可靠。

Result: 广泛的实验表明，ReMem为诊断LVLM的学习和遗忘行为提供了一个严谨且可信赖的框架。

Insight: 创新点在于诊断了现有基准测试中模型记忆不足和多跳诅咒的根本缺陷，并提出了ReMem基准测试，其通过原则性数据扩展、推理感知的问答对和多样化的视觉上下文来确保稳健的基础学习，同时引入了新颖的Exposure指标来量化信息擦除深度。

Abstract: While Large Vision-Language Models (LVLMs) offer powerful capabilities, they pose privacy risks by unintentionally memorizing sensitive personal information. Current unlearning benchmarks attempt to mitigate this using fictitious identities but overlook a critical stage 1 failure: models fail to effectively memorize target information initially, rendering subsequent unlearning evaluations unreliable. Diagnosing under-memorization and the multi-hop curse as root causes, we introduce ReMem, a Reliable Multi-hop and Multi-image Memorization Benchmark. ReMem ensures robust foundational learning through principled data scaling, reasoning-aware QA pairs, and diverse visual contexts. Additionally, we propose a novel Exposure metric to quantify the depth of information erasure from the model’s internal probability distribution. Extensive experiments demonstrate that ReMem provides a rigorous and trustworthy framework for diagnosing both learning and unlearning behaviors in LVLMs.

[49] Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation cs.CVPDF

Quanxing Xu, Ling Zhou, Xian Zhong, Xiaohua Huang, Rubing Huang

TL;DR: 本文提出了一种名为CoVQD引导的检索增强生成框架（CgRAG），通过融合思维链推理与视觉问题分解，指导多模态大语言模型在视觉问答任务中更有效地检索外部知识，从而提升复杂跨域场景下的性能。

Details

Motivation: 旨在解决开放域视觉问答中，多模态大语言模型如何更有效地整合结构化推理与知识获取，以提升对外部知识的利用效率和答案的可靠性。

Result: 在E-VQA、InfoSeek和OKVQA基准测试上进行了广泛实验，证明了所提方法的有效性，能够提升模型在复杂跨域VQA场景中的泛化能力和可靠性。

Insight: 创新点在于将思维链推理与视觉问题分解结合为CoVQD逻辑提示策略，并以此引导检索过程，使多模态大语言模型能够获取更全面、连贯的外部知识，同时受益于结构化的视觉-文本推理指导。

Abstract: With advances in multimodal research and deep learning, Multimodal Large Language Models (MLLMs) have emerged as a powerful paradigm for a wide range of multimodal tasks. As a core problem in vision-language research, Visual Question Answering (VQA) has increasingly employed MLLMs to improve performance, particularly in open-domain settings where external knowledge is essential. In this work, we aim to further enhance retrieval-based VQA by more effectively integrating MLLMs with structured reasoning and knowledge acquisition. We introduce a logical prompting strategy that fuses Chain-of-Thought (CoT) reasoning with Visual Question Decomposition (VQD), termed CoVQD, to guide retrieval toward more accurate and relevant knowledge for MLLM inference. Building on this idea, we propose a new framework, CoVQD-guided RAG (CgRAG), which enables MLLMs to access more comprehensive and coherent external knowledge while benefiting from structured visual-text reasoning guidance, thereby improving generalization and reliability in complex cross-domain VQA scenarios. Extensive experiments on E-VQA, InfoSeek, and OKVQA benchmarks demonstrate the effectiveness of the proposed method.

[50] Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration cs.CV | cs.LG | cs.MMPDF

Xun Jiang, Yufan Gu, Disen Hu, Yuqing Hou, Yazhou Yao

TL;DR: 本文提出了一种名为Conformal Predictive Self-Calibration（CPSC）的统一框架，用于解决多模态学习中的低质量数据问题，包括模态不平衡和噪声污染。该框架利用共形预测实现模型的自引导在线校准，通过表征自校准和梯度自校准两个核心模块，增强特征的鲁棒性并优化训练方向。在六个基准数据集上的实验表明，CPSC在多种不平衡和噪声设置下均优于现有方法。

Details

Motivation: 多模态学习常面临低质量数据的挑战，主要表现为模态不平衡和噪声污染。现有研究多孤立处理这些问题，但作者认为它们都源于学习过程中对单个模态和实例可靠性的预测不确定性，因此需要一个统一框架来同时应对。

Result: 在六个基准数据集（如IEMOCAP、CMU-MOSI等）上进行的大量实验表明，CPSC在模态不平衡和噪声设置下均一致优于现有最先进方法（SOTA），达到了新的性能水平。

Insight: 创新点包括：1）提出统一的CPSC框架，利用共形预测实现自引导在线校准；2）设计表征自校准模块，通过分解单模态特征并选择性融合最鲁棒成分；3）引入梯度自校准模块，基于实例可靠性分数重新校准反向传播梯度流；4）开发共形预测器的自更新策略，确保系统在训练过程中协同进化。从客观角度看，该方法将不确定性估计与训练过程深度融合，为处理多模态数据质量问题提供了系统性的解决方案。

Abstract: Multimodal learning often grapples with the challenge of low-quality data, which predominantly manifests as two facets: modality imbalance and noisy corruption. While these issues are often studied in isolation, we argue that they share a common root in the predictive uncertainty towards the reliability of individual modalities and instances during learning. In this paper, we propose a unified framework, termed Conformal Predictive Self-Calibration (CPSC), which leverages conformal prediction to equip the model with the ability to perform self-guided calibration on-the-fly. The core of our proposed CPSC lies in a novel self-calibrating training loop that seamlessly integrates two key modules: (1) Representation Self-Calibration, which decomposes unimodal features into components, and selectively fuses the most robust ones identified by a conformal predictor to enhance feature resilience. (2) Gradient Self-Calibration, which recalibrates the gradient flow during backpropagation based on instance-wise reliability scores, steering the optimization towards more trustworthy directions. Furthermore, we also devise a self-update strategy for the conformal predictor to ensure the entire system co-evolves consistently throughout the training process. Extensive experiments on six benchmark datasets under both imbalanced and noisy settings demonstrate that our CPSC framework consistently outperforms existing state-of-the-art methods. Our code is available at https://github.com/XunCHN/CPSC.

[51] Conditions for well-posed color recovery in scattering media cs.CVPDF

Grigory Solomatov, Derya Akkaynak

TL;DR: 本文研究了在散射介质中恢复场景颜色的逆问题，提出了使该问题适定化的充分条件。通过分析不适定性的根源（光谱信号到像素强度的投影和未知介质参数），作者证明仅靠传感器改进无法解决介质引起的失真，而图像中自然存在的恢复模式（跨像素关系）结合理想高光谱相机可以确保解的唯一性。

Details

Motivation: 解决在散射介质中从图像恢复场景颜色这一基本逆问题，该问题本质不适定，因为多个解可解释同一观测，且在不理解候选解空间的情况下无法控制预测误差。

Result: 未在摘要中提及具体的定量实验结果或基准测试，但理论证明了在理想高光谱相机和恢复模式约束下，解是唯一的。

Insight: 创新点在于首次提出了使散射介质中颜色恢复问题适定化的充分条件，并指出基于恢复模式（跨像素关系）和第一性原理的新算法方向，为散射环境中的图像定量分析奠定了基础。

Abstract: Recovering scene color from images captured in scattering media is a fundamental inverse problem in optical imaging. Yet the problem is intrinsically ill-posed as multiple solutions can explain the same observation, and prediction error cannot be controlled without understanding the space of candidate solutions. Here, we present sufficient conditions under which color recovery in a scattering medium becomes well-posed. Observing that ill-posedness stems from (i) projection of spectral signals onto pixel intensities, and (ii) unknown medium parameters, we demonstrate that sensor improvements alone cannot resolve medium-induced distortions without additional constraints. We identify recovery patterns, cross-pixel relationships that naturally occur in images, and prove, for an ideal hyperspectral camera, that they restrict the solution to a unique candidate. This opens the door to a new class of vision algorithms grounded in first principles, enabling quantitative analysis of images in scattering environments.

[52] Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback cs.CVPDF

Edoardo Bianchi, Antonio Liotta

TL;DR: 本文提出了三种参数高效的多视角熟练度估计方法，用于评估动作执行的质量而非识别动作本身。这些方法包括用于选择性多视角融合的SkillFormer、用于改进时间采样的PATS，以及将熟练度估计重构为条件语言生成以同时输出标签和专家反馈的ProfVLM。

Details

Motivation: 解决在教练、康复和人才识别等场景中，从多视角视频中准确估计动作熟练度的挑战，该任务难点在于熟练度编码在时间、平衡、身体力学和执行上的细微差异中。

Result: 在Ego-Exo4D基准测试上达到了最先进的准确率，与视频Transformer基线相比，可训练参数减少了多达20倍，训练周期减少了多达3倍。

Insight: 创新点在于从封闭集分类转向可解释的反馈生成，通过选择性融合、熟练度感知采样和可操作的生成反馈，实现了高效的多视角系统。

Abstract: Estimating how well a person performs an action, rather than which action is performed, is central to coaching, rehabilitation, and talent identification. This task is challenging because proficiency is encoded in subtle differences in timing, balance, body mechanics, and execution, often distributed across multiple views and short temporal events. We discuss three recent contributions to multi-view proficiency estimation on Ego-Exo4D. SkillFormer introduces a parameter-efficient discriminative architecture for selective multi-view fusion; PATS improves temporal sampling by preserving locally dense excerpts of fundamental movements; and ProfVLM reformulates proficiency estimation as conditional language generation, producing both a proficiency label and expert-style feedback through a gated cross-view projector and a compact language backbone. Together, these methods achieve state-of-the-art accuracy on Ego-Exo4D with up to 20x fewer trainable parameters and up to 3x fewer training epochs than video-transformer baselines, while moving from closed-set classification toward interpretable feedback generation. These results highlight a shift toward efficient, multi-view systems that combine selective fusion, proficiency-aware sampling, and actionable generative feedback.

[53] Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation cs.CVPDF

Bin Wu, Mengqi Huang, Shaojin Wu, Weinan Jia, Yuxin Wang

TL;DR: 本文提出Stream-R1，一种用于流式视频生成的可靠性-困惑度感知奖励蒸馏框架。它通过一个共享的奖励引导机制，在rollout层面和时空元素层面自适应地重新加权蒸馏目标，以解决现有分布匹配蒸馏方法对所有监督信号平等对待的局限性。

Details

Motivation: 现有基于蒸馏的加速方法（如DMD）在训练学生模型时，不加区分地匹配教师模型的输出，将每个rollout、帧和像素视为同等可靠的监督。这限制了蒸馏质量，因为它忽略了DMD监督中两个互补的方差轴：学生rollouts之间的可靠性差异（Inter-Reliability），以及时空区域内对质量改进贡献不均的困惑度差异（Intra-Perplexity）。

Result: 在标准流式视频生成基准测试上，Stream-R1在视觉质量、运动质量和文本对齐三个维度上均比蒸馏基线取得了一致的改进，且无需修改架构或增加推理成本。

Insight: 创新点在于提出了一个统一的奖励引导机制，从两个层面自适应重加权蒸馏损失：在Inter-Reliability层面，根据预训练视频奖励分数指数缩放每个rollout的损失；在Intra-Perplexity层面，利用同一奖励模型的反向传播提取像素级梯度显著性，分解为空间和时间权重，将优化压力集中在预期收益最大的区域和帧上。这避免了优化目标在单一质量维度上占主导，实现了更精细的监督信号利用。

Abstract: Distillation-based acceleration has become foundational for making autoregressive streaming video diffusion models practical, with distribution matching distillation (DMD) as the de facto choice. Existing methods, however, train the student to match the teacher’s output indiscriminately, treating every rollout, frame, and pixel as equally reliable supervision. We argue that this caps distilled quality, since it overlooks two complementary axes of variance in DMD supervision: Inter-Reliability across student rollouts whose supervision varies in reliability, and Intra-Perplexity across spatial regions and temporal frames that contribute unequally to where quality can still be improved. The objective thus conflates two questions under a uniform weight: whether to learn from each rollout, and where to concentrate optimization within it. To address this, we propose Stream-R1, a Reliability-Perplexity Aware Reward Distillation framework that adaptively reweights the distillation objective at both rollout and spatiotemporal-element levels through a single shared reward-guided mechanism. At the Inter-Reliability level, Stream-R1 rescales each rollout’s loss by an exponential of a pretrained video reward score, so that rollouts with reliable supervision dominate optimization. At the Intra-Perplexity level, it back-propagates the same reward model to extract per-pixel gradient saliency, which is factored into spatial and temporal weights that concentrate optimization pressure on regions and frames where refinement yields the largest expected gain. An adaptive balancing mechanism prevents any single quality axis from dominating across visual quality, motion quality, and text alignment. Stream-R1 attains consistent improvements on all three dimensions over distillation baselines on standard streaming video generation benchmarks, without architectural modification or additional inference cost.

[54] StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning cs.CVPDF

Xiaowen Sun, Matthias Kerzel, Mengdi Li, Xufeng Zhao, Paul Striker

TL;DR: 本文提出了一种新的训练策略——辅助回归损失（ARL），用于增强视觉语言模型（VLM）在机器人任务中的数值推理能力，特别是物体检测和状态定位。基于此策略，作者开发了StateVLM模型，并引入了一个新的开源基准OSAR，用于评估物体状态可供性推理。实验表明，ARL能显著提升模型在多个基准上的性能。

Details

Motivation: 现有视觉语言模型在应用于机器人任务时，继承了大型语言模型在数值推理（如物体检测和状态定位）方面的固有局限，难以处理精确的回归任务。

Result: 在RefCOCO、RefCOCO+和RefCOCOg等适应基准上，ARL使模型性能平均提升1.6%；在新提出的OSAR基准上，StateVLM结合ARL比无ARL模型平均性能高5.2%，并增强了输出一致性。

Insight: 创新点在于将数值推理作为回归任务融入VLM微调，通过辅助回归损失（ARL）结合边界框解码器输出进行训练，同时保持推理时的序列预测标准；此外，构建了专门的OSAR基准以推动机器人可供性推理研究。

Abstract: Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject to a fundamental limitation inherent in large language models (LLMs): they struggle with numerical reasoning, particularly in object detection and object-state localization. To explore numerical reasoning as a regression task in VLMs, we propose a novel training strategy to adapt VLMs for object detection and object-state localization. This approach leverages box decoder outputs to compute an Auxiliary Regression Loss (ARL) during fine-tuning, while preserving standard sequence prediction at inference. We leverage this training strategy to develop StateVLM (State-aware Vision-Language Model), a novel model designed to perceive and learn fine-grained object representations, including precise localization of objects and their states, as well as graspable regions. Due to the lack of a benchmark for object-state affordance reasoning, we introduce an open-source benchmark, Object State Affordance Reasoning (OSAR), which contains 1,172 scenes with 7,746 individual objects and corresponding bounding boxes. Comparative experiments on adapted benchmarks (RefCOCO, RefCOCO+, and \mbox{RefCOCOg}) demonstrate that ARL improves model performance by an average of 1.6% compared to models without ARL. Experiments on the OSAR benchmark further support this finding, showing that StateVLM with ARL achieves an average of 5.2% higher performance than models without ARL. In particular, ARL is also important for the complex task of affordance reasoning in OSAR, where it enhances the consistency of model outputs.

[55] A Benchmark for Interactive World Models with a Unified Action Generation Framework cs.CV | cs.AIPDF

Jianjie Fang, Yingshan Lei, Qin Wan, Ziyou Wang, Yuchao Huang

TL;DR: 该论文提出了iWorld-Bench，一个用于评估交互式世界模型在物理交互能力方面的综合性基准测试。它包含一个包含33万个视频片段的数据集，并设计了一个统一的动作生成框架来评估模型在视觉生成、轨迹跟随和记忆等六个任务类型上的表现。

Details

Motivation: 当前研究缺乏大规模数据集和统一基准来评估交互式世界模型的物理交互能力，这阻碍了实现通用人工智能（AGI）所需的自适应学习与交互代理的发展。

Result: 在构建的数据集和设计的任务上评估了14个代表性世界模型，识别了它们的关键局限性，并为未来研究提供了见解。相关排行榜已公开。

Insight: 创新点在于提出了一个专门针对交互能力（如距离感知和记忆）的基准测试iWorld-Bench，并引入了一个统一的动作生成框架来标准化不同模态模型的评估，这有助于系统性推动交互式世界模型的研究。

Abstract: Achieving Artificial General Intelligence (AGI) requires agents that learn and interact adaptively, with interactive world models providing scalable environments for perception, reasoning, and action. Yet current research still lacks large-scale datasets and unified benchmarks to evaluate their physical interaction capabilities. To address this, we propose iWorld-Bench, a comprehensive benchmark for training and testing world models on interaction-related abilities such as distance perception and memory. We construct a diverse dataset with 330k video clips and select 2.1k high-quality samples covering varied perspectives, weather, and scenes. As existing world models differ in interaction modalities, we introduce an Action Generation Framework to unify evaluation and design six task types, generating 4.9k test samples. These tasks jointly assess model performance across visual generation, trajectory following, and memory. Evaluating 14 representative world models, we identify key limitations and provide insights for future research. The iWorld-Bench model leaderboard is publicly available at iWorld-Bench.com.

[56] UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning cs.CVPDF

Yifan Wang, Yun Fu

TL;DR: 本文提出了UnAC方法，一种用于增强大型多模态模型在复杂多模态推理任务中性能的多模态提示技术。该方法通过自适应视觉提示聚焦关键区域、图像抽象提示提取关键信息，以及渐进式自检方案验证推理步骤，旨在解决现有模型在多步骤视觉证据推理上的不可靠问题。

Details

Motivation: 尽管当前大型多模态模型在视觉感知方面有所增强，但在需要基于视觉证据进行多步骤推理的复杂问题上仍不可靠，本文旨在通过结构化提示方法提升其推理能力。

Result: 在MathVista、MM-Vet和MMMU三个公开基准测试上进行了广泛实验，结果表明该方法能有效提升GPT-4o、Gemini 1.5和GPT-4V等模型的推理性能。

Insight: 创新点在于结合了自适应视觉提示以关注显著区域、图像抽象以提取关键信息，以及逐步自检以验证推理过程，这为增强多模态模型的复杂推理提供了一种可借鉴的模块化提示框架。

Abstract: Although recent LMMs have become much stronger at visual perception, they remain unreliable on problems that require multi-step reasoning over visual evidence. In this paper, we present UnAC (Understanding, Abstracting, and Checking), a multimodal prompting method that strengthens reasoning for complex multimodal tasks in LMMs (e.g., GPT-4o, Gemini 1.5, and GPT-4V). To improve image understanding and capture fine details, we propose an adaptive visual prompting strategy that enables LMMs to focus on salient regions. We further design an image-abstraction prompt to effectively extract key information from images. In addition, we introduce a gradual self-checking scheme that improves reasoning by verifying each decomposed subquestion and its answer. Extensive experiments on three public benchmarks-MathVista, MM-Vet, and MMMU.

[57] Label-Efficient School Detection from Aerial Imagery via Weakly Supervised Pretraining and Fine-Tuning cs.CV | cs.AI | cs.LGPDF

Zakarya Elmimouni, Fares Fourati, Mohamed-Slim Alouini

TL;DR: 本文提出了一种用于航空影像中学校检测的弱监督框架，旨在最小化人工标注需求并支持全球范围的地图绘制。该方法采用两阶段训练流程：首先利用自动标注管道（结合稀疏位置点和语义分割）生成基础设施掩码和边界框来预训练检测器，然后使用少量人工标注图像进行微调。

Details

Motivation: 解决全球许多地区因官方记录过时、不完整或缺失而难以准确检测学校的问题，同时克服人工标注成本高、可扩展性差的挑战。

Result: 在低数据场景下（仅使用50张人工标注图像）实现了强大的目标检测性能，显著减少了对昂贵标注的依赖。

Insight: 创新点在于结合弱监督预训练（利用自动生成的伪标签）和少量样本微调的两阶段流程，为低数据条件下的基础设施检测提供了一种高效且可扩展的解决方案；自动标注管道利用现有稀疏位置信息生成训练数据的方法具有借鉴意义。

Abstract: Accurate school detection is essential for supporting education initiatives, including infrastructure planning and expanding internet connectivity to underserved areas. However, many regions around the world face challenges due to outdated, incomplete, or unavailable official records. Manual mapping efforts, while valuable, are labor-intensive and lack scalability across large geographic areas. To address this, we propose a weakly supervised framework for school detection from aerial imagery that minimizes the need for human annotations while supporting global mapping efforts. Our method is specifically designed for low-data regimes, where manual annotations are extremely scarce. We introduce an automatic labeling pipeline that leverages sparse location points and semantic segmentation to generate infrastructure masks from which we generate bounding boxes. Using these automatically labeled images, we train our detectors on a first training stage to learn a representation of what schools look like, then using a small set of manually labeled images, we fine-tune the previously trained models on this clean dataset. This two stage training pipeline enables large-scale and strong detection in low-data setting of school infrastructure with minimal supervision. Our results demonstrate strong object detection performance, particularly in the low-data regime, where the models achieve promising results using only 50 manually labeled images, significantly reducing the need for costly annotations. This framework supports education and connectivity initiatives worldwide by providing an efficient and extensible approach to mapping schools from space. All models, training code and auto-labeled data will be publicly released to foster future research and real-world impact.

[58] 3D Human Face Reconstruction with 3DMM face model from RGB image cs.CV | cs.GRPDF

Zhangnan Jiang, Zichen Yang

TL;DR: 本文提出了一种从单张RGB图像重建三维人脸模型的流程，该流程结合了人脸检测、关键点检测、3DMM参数回归和软渲染技术，旨在解决现有方法在生成具有细节（如皱纹）的真实感数据方面的不足。

Details

Motivation: 尽管卷积神经网络在图像处理领域展现出强大能力，但重建详细人脸形状需要大量标注数据，而现有粗粒度可变形人脸模型难以生成具有细节的真实感合成数据，因此本文旨在开发一个高效且细节丰富的单图像三维人脸重建方法。

Result: 论文未在摘要中明确提及具体的定量结果或基准测试，但通过整合现有技术（如Deep3DFaceRecon的代码参考），流程可能实现了与现有方法相当的三维重建效果，侧重于提升细节表现。

Insight: 创新点在于构建了一个端到端的流程，结合了3DMM参数回归和软渲染，以从单张RGB图像中恢复更真实的人脸细节；客观分析认为，该方法通过利用现有深度学习工具优化数据合成和渲染步骤，可能提高了重建模型的真实感和效率。

Abstract: Nowadays as convolution neural networks demonstrate its powerful problem-solving ability in the area of image processing, efforts have been made to reconstruct detailed face shapes from 2D face images or videos. However, to make the full use of CNN, a large number of labeled data is required to train the network. Coarse morphable face model has been used to synthesize labeled data. However, it is hard for coarse morphable face models to generate photo-realistic data with detail such as wrinkles. In this project, we present a pipeline that reconstructs a human face 3D model from a single RGB image. The pipeline includes face detection, landmark detection, regression of 3DMM model parameters, and soft rendering. Mentor: Zhipeng Fan (Email: zf606@nyu.edu) Code Repository: https://github.com/SeVEnMY/3d-face- reconstruction Code Reference: https://github.com/sicxu/Deep3DFaceRecon pytorch

[59] RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction cs.CVPDF

Renjie He

TL;DR: 本文提出了RD-ViT，一种用于语义分割的循环深度视觉Transformer。它通过共享一个Transformer块并循环T次来替代传统的深层堆叠结构，从而减少模型参数量和对大规模训练数据的依赖。该架构支持2D和3D输入，并集成了LTI稳定状态注入、自适应计算时间（ACT）、深度LoRA适配和可选的混合专家（MoE）前馈网络。在ACDC心脏MRI分割基准上的实验表明，RD-ViT在数据有限和全数据情况下均优于标准ViT，并在3D任务中以更少的参数达到接近SOTA的性能。

Details

Motivation: 标准视觉Transformer（ViT）在分割任务中达到SOTA精度，但每层具有独立参数，需要大量训练数据。本文旨在通过循环共享参数架构来减少模型对大规模数据的依赖，并将其扩展到密集预测任务（如2D/3D分割）。

Result: 在ACDC心脏MRI分割基准上进行评估。2D切片层面：在10%训练数据下，Dice分数为0.774（标准ViT为0.762）；在全数据下为0.882（标准ViT为0.872）。3D体积层面：带MoE的RD-ViT以3.0M参数达到Dice 0.812，达到标准ViT性能（0.817）的99.4%，而参数量仅为后者的53%。

Insight: 主要创新点包括：1）将循环深度Transformer（RDT）架构首次成功适配到密集预测任务；2）通过参数共享和循环机制显著减少模型参数量和对数据的依赖；3）引入LTI稳定状态注入保证收敛，ACT实现空间计算分配，深度LoRA进行高效适配，MoE实现类别特异性专业化；4）实验揭示了MoE专家能自发专注于不同心脏结构，ACT学习到在边界分配更多计算，且训练中平均“思考”时间减少，展示了学得的计算效率；5）深度外推能力允许推理时使用比训练更多的循环而不退化。

Abstract: Vision Transformers (ViTs) achieve state-of-the-art segmentation accuracy but require large training datasets because each layer has unique parameters that must be learned independently. We present RD-ViT, a Recurrent-Depth Vision Transformer that adapts the Recurrent-Depth Transformer (RDT) architecture to dense prediction tasks, supporting both 2D and 3D inputs. RD-ViT replaces the deep stack of unique transformer blocks with a single shared block looped T times, augmented with LTI-stable state injection for guaranteed convergence, Adaptive Computation Time (ACT) for spatial compute allocation, depth-wise LoRA adaptation, and optional Mixture-of-Experts (MoE) feed-forward networks for category-specific specialization. We evaluate on the ACDC cardiac MRI segmentation benchmark in both 2D slice-level and 3D volumetric settings with exclusively real experiments executed in Google Colab. In 2D, RD-ViT outperforms standard ViT at 10% training data (Dice 0.774 vs 0.762) and at full data (0.882 vs 0.872). In 3D, RD-ViT with MoE achieves Dice 0.812 with 3.0M parameters, reaching 99.4% of standard ViT performance (0.817) at 53% of the parameter count. MoE expert utilization analysis reveals that different experts spontaneously specialize for different cardiac structures (RV, MYO, LV) without explicit routing supervision. ACT halting maps show higher compute allocation at cardiac boundaries, and the mean ponder time decreases from 2.6 to 1.4 iterations during training, demonstrating learned computational efficiency. Depth extrapolation enables inference with more loops than training without degradation. All code, notebooks, and results are publicly released.

[60] Large Language Models are Universal Reasoners for Visual Generation cs.CVPDF

Sucheng Ren, Chen Chen, Zhenbang Wang, Liangchen Song, Xiangxin Zhu

TL;DR: 本文提出UniReasoner框架，通过利用大型语言模型作为通用推理器，将LLM在视觉理解方面的优势转化为直接的生成指导，以弥合文本到图像生成中的理解-生成差距。该方法首先生成粗略的视觉草稿，然后进行自我批判评估，最后结合提示、草稿和评估结果指导扩散模型生成，从而提升组合对齐和语义忠实度。

Details

Motivation: 解决当前基于LLM的统一文本到图像生成系统在合成复杂提示时经常无法忠实对齐的问题，即理解-生成差距，尽管这些系统在验证图像是否符合提示方面非常准确。

Result: 实验表明，在相同的扩散模型骨干下，UniReasoner提高了组合对齐和语义忠实度，同时保持了图像质量，证明了利用LLM推理来弥合理解-生成差距的实用方法。

Insight: 创新点在于将LLM作为通用推理器，通过生成视觉草稿和基于文本的自我批判评估，为扩散模型提供具体的场景级锚点和可操作的约束，从而将验证能力转化为生成指导，有效纠正遗漏、幻觉和关系错误。

Abstract: Text-to-image generation has advanced rapidly with diffusion models, progressing from CLIP and T5 conditioning to unified systems where a single LLM backbone handles both visual understanding and generation. Despite the architectural unification, these systems frequently fail to faithfully align complex prompts during synthesis, even though they remain highly accurate at verifying whether an image satisfies those same prompts. We formalize this as the \emph{understanding-generation gap} and propose UniReasoner, a framework that leverages the LLM as a universal reasoner to convert its understanding strength into direct generation guidance. Given a prompt, the LLM first produces a coarse visual draft composed of discrete vision tokens. It then performs a self-critique by evaluating the draft for prompt consistency, producing a grounded textual evaluation that pinpoints what needs to be corrected. Finally, a diffusion model is conditioned jointly on the prompt, the visual draft, and the evaluation, ensuring that generation is guided by explicit corrective signals. Each signal addresses a limitation of the other: the draft provides a concrete, scene-level anchor that reduces under-specification in text-only conditioning, while the evaluation turns verification into grounded, actionable constraints that correct omissions, hallucinations, and relational errors. Experiments show that UniReasoner improves compositional alignment and semantic faithfulness under the same diffusion backbone while maintaining image quality, demonstrating a practical way to exploit LLM reasoning to close the understanding-generation gap.

[61] UniCorrn: Unified Correspondence Transformer Across 2D and 3D cs.CVPDF

Prajnan Goswami, Tianye Ding, Feng Liu, Huaizu Jiang

TL;DR: UniCorrn是一个统一的对应变换器模型，首次使用共享权重统一处理2D-2D、2D-3D和3D-3D几何匹配任务。其核心设计包括双流解码器，分别处理外观和位置特征，通过Transformer注意力捕获跨模态特征相似性，实现了跨异构模态的端到端学习。

Details

Motivation: 当前视觉对应方法针对不同模态组合（如图像到图像、图像到点云、点云到点云）使用任务特定的独立模型，尽管问题结构相似。本文旨在解决这种分割，提出一个统一模型来处理所有三种几何匹配任务。

Result: UniCorrn在2D-2D匹配上达到竞争性性能，并在2D-3D任务（7Scenes数据集）上超过先前最佳方法8%的注册召回率，在3D-3D任务（3DLoMatch数据集）上超过10%的注册召回率，实现了SOTA水平。

Insight: 创新点在于利用Transformer注意力自然捕获跨模态特征相似性，并提出双流解码器设计，支持跨异构模态的灵活查询式对应估计。从客观角度看，该模型通过共享权重和联合训练，简化了多任务处理流程，提升了泛化能力和效率。

Abstract: Visual correspondence across image-to-image (2D-2D), image-to-point cloud (2D-3D), and point cloud-to-point cloud (3D-3D) geometric matching forms the foundation for numerous 3D vision tasks. Despite sharing a similar problem structure, current methods use task-specific designs with separate models for each modality combination. We present UniCorrn, the first correspondence model with shared weights that unifies geometric matching across all three tasks. Our key insight is that Transformer attention naturally captures cross-modal feature similarity. We propose a dual-stream decoder that maintains separate appearance and positional feature streams. This design enables end-to-end learning through stack-able layers while supporting flexible query-based correspondence estimation across heterogeneous modalities. Our architecture employs modality-specific backbones followed by shared encoder and decoder components, trained jointly on diverse data combining pseudo point clouds from depth maps with real 3D correspondence annotations. UniCorrn achieves competitive performance on 2D-2D matching and surpasses prior state-of-the-art by 8% on 7Scenes (2D-3D) and 10% on 3DLoMatch (3D-3D) in registration recall. Project website: https://neu-vi.github.io/UniCorrn

[62] Audio-Visual Intelligence in Large Foundation Models cs.CVPDF

You Qin, Kai Liu, Shengqiong Wu, Kai Wang, Shijian Deng

TL;DR: 本文是一篇关于大型基础模型中视听智能（AVI）的综述性论文，首次从大型基础模型的角度，对涵盖理解、生成和交互等广泛任务的AVI领域进行了全面回顾，并建立了统一的分类体系和方法论基础。

Details

Motivation: 尽管视听智能作为人工智能的核心前沿领域发展迅速，但现有文献在任务、分类和评估方面存在碎片化和不一致性，阻碍了系统性的比较和知识整合，因此需要一篇全面的综述来梳理和统一这一领域。

Result: 本文作为一篇综述，未提出具体模型，但系统性地梳理了AVI的方法论基础（如模态标记化、跨模态融合、生成模型、大规模预训练等），并整理了代表性的数据集、基准和评估指标，为未来研究提供了结构化比较和参考。

Insight: 论文的主要创新点在于首次从大型基础模型的视角构建了AVI的统一分类框架，并系统整合了方法论、数据集和评估实践，为这一快速发展的领域提供了清晰的知识图谱和未来研究方向（如同步性、空间推理、可控性等挑战）。

Abstract: Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals. Recent advances, such as Meta MovieGen and Google Veo-3, highlight the growing industrial and academic focus on unified audio-vision architectures that learn from massive multimodal data. However, despite rapid progress, the literature remains fragmented, spanning diverse tasks, inconsistent taxonomies, and heterogeneous evaluation practices that impede systematic comparison and knowledge integration. This survey provides the first comprehensive review of AVI through the lens of large foundation models. We establish a unified taxonomy covering the broad landscape of AVI tasks, ranging from understanding (e.g., speech recognition, sound localization) to generation (e.g., audio-driven video synthesis, video-to-audio) and interaction (e.g., dialogue, embodied, or agentic interfaces). We synthesize methodological foundations, including modality tokenization, cross-modal fusion, autoregressive and diffusion-based generation, large-scale pretraining, instruction alignment, and preference optimization. Furthermore, we curate representative datasets, benchmarks, and evaluation metrics, offering a structured comparison across task families and identifying open challenges in synchronization, spatial reasoning, controllability, and safety. By consolidating this rapidly expanding field into a coherent framework, this survey aims to serve as a foundational reference for future research on large-scale AVI.

eess.IV [Back]

[63] Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms eess.IV | cs.CVPDF

Muyang He, Hanzhong Guo, Junxiong Lin, Yizhou Yu

TL;DR: 这篇论文系统综述了将视频生成模型作为世界模拟器的高效方法，从建模范式、网络架构和推理算法三个维度提出新的分类法，并探讨了其在自动驾驶、具身AI等交互应用中的潜力，强调效率是实现实时、通用世界模拟器的关键前提。

Details

Motivation: 解决视频生成模型作为世界模拟器时，其理论模拟能力与高昂时空建模计算成本之间的效率鸿沟。

Result: 论文未提及具体定量结果或基准测试，但通过系统性综述和分类法，论证了提升效率可直接赋能自动驾驶、具身AI和游戏模拟等交互应用。

Insight: 创新性地从高效建模范式、网络架构和推理算法三个维度构建分类法，并强调效率是视频生成模型发展为通用、实时、鲁棒世界模拟器的根本前提，为未来研究提供了结构化方向。

Abstract: The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators. However, a critical gap still remains between the theoretical capacity for world simulation and the heavy computational costs of spatiotemporal modeling. To address this, we comprehensively and systematically review video generation frameworks and techniques that consider efficiency as a crucial requirement for practical world modeling. We introduce a novel taxonomy in three dimensions: efficient modeling paradigms, efficient network architectures, and efficient inference algorithms. We further show that bridging this efficiency gap directly empowers interactive applications such as autonomous driving, embodied AI, and game simulation. Finally, we identify emerging research frontiers in efficient video-based world modeling, arguing that efficiency is a fundamental prerequisite for evolving video generators into general-purpose, real-time, and robust world simulators.

[64] EMOVIS: Emotion-Optimized Image Processing eess.IV | cs.CVPDF

Dor Barber, Rony Zatzarinni, Hava Matichin, Noam Levy

TL;DR: 本文提出EMOVIS系统，将高级情感状态（如快乐、平静、愤怒、悲伤）映射到图像信号处理器（ISP）的低级控制参数（如色彩饱和度、局部色调映射和锐度），实现实时视频捕捉中情感优化的视觉处理，以增强场景的情感叙事。

Details

Motivation: 传统ISP优先考虑场景保真度，忽略了情感表达维度，而电影摄影中常通过调整视觉属性来强化情感叙事，因此需要将这种能力集成到实时相机流水线中。

Result: 通过盲测A/B测试验证，当目标情感与场景上下文匹配时，观众在87%的试验中偏好情感优化渲染，表明情感对齐的ISP控制提高了表达性视觉内容的感知适宜性。

Insight: 创新点在于建立了情感状态与ISP控制参数的系统映射，并通过用户校准研究支持，提出了一种不改变底层处理阶段的控制框架，将情感驱动调整集成到标准ISP硬件中，实现了实时情感优化处理。

Abstract: In cinematography, visual attributes such as color grading, contrast, and brightness are manipulated to reinforce the emotional narrative of a scene. However, conventional Image Signal Processors (ISPs) prioritize scene fidelity, effectively neglecting this expressive dimension. To bring this cinematic capability to real-time camera pipelines during video capture, we introduce EMOVIS (EMotion-Optimized VISual processing). We establish a systematic mapping between a compact set of high-level emotional states (Happy, Calm, Angry, Sad) and low-level ISP controls - including color saturation, local tone mapping, and sharpness - supported by a calibration user study with statistically significant effects across parameters. We propose a control framework that integrates these emotion-driven adjustments into standard ISP hardware without altering the underlying processing stages. Validation via blind A/B testing shows that viewers prefer the emotion-optimized rendering in 87% of trials when the target emotion matches the scene context, indicating that emotion-aligned ISP control improves perceived suitability for expressive visual content.

cs.DB [Back]

[65] FINER-SQL: Boosting Small Language Models for Text-to-SQL cs.DB | cs.AI | cs.CL | cs.HC | cs.MAPDF

Thanh Dat Hoang, Thanh Trung Huynh, Matthias Weidlich, Thanh Tam Nguyen, Tong Chen

TL;DR: 本文提出FINER-SQL框架，旨在通过细粒度执行反馈的强化学习来提升小型语言模型在文本到SQL生成任务上的性能，以解决大模型的计算成本高、延迟长和隐私问题，同时克服小模型推理能力弱和传统强化学习奖励稀疏导致的训练不稳定问题。

Details

Motivation: 大语言模型在Text-to-SQL任务中面临高计算成本、长延迟和数据隐私问题，而小语言模型虽然部署高效、隐私性好，但其推理和指令遵循能力较弱，且传统基于稀疏二元奖励的强化学习方法在SQL生成错误时学习信号不足，导致训练不稳定。

Result: 在BIRD和Spider基准测试上，使用3B参数模型，FINER-SQL分别达到了67.73%和85%的执行准确率，性能匹配更大的LLM，同时将推理延迟降低至5.57秒/样本。

Insight: 创新点在于提出了一个基于组相对策略优化的强化学习框架，用密集、可解释的奖励（包括内存奖励和原子奖励）替代稀疏监督，为错误SQL提供连续反馈，将离散正确性转化为连续学习，实现稳定、无需评论家的优化，为高性能、低成本、隐私保护的Text-to-SQL生成提供了可行路径。

Abstract: Large language models have driven major advances in Text-to-SQL generation. However, they suffer from high computational cost, long latency, and data privacy concerns, which make them impractical for many real-world applications. A natural alternative is to use small language models (SLMs), which enable efficient and private on-premise deployment. Yet, SLMs often struggle with weak reasoning and poor instruction following. Conventional reinforcement learning methods based on sparse binary rewards (0/1) provide little learning signal when the generated SQLs are incorrect, leading to unstable or collapsed training. To overcome these issues, we propose FINER-SQL, a scalable and reusable reinforcement learning framework that enhances SLMs through fine-grained execution feedback. Built on group relative policy optimization, FINER-SQL replaces sparse supervision with dense and interpretable rewards that offer continuous feedback even for incorrect SQLs. It introduces two key reward functions: a memory reward, which aligns reasoning with verified traces for semantic stability, and an atomic reward, which measures operation-level overlap to grant partial credit for structurally correct but incomplete SQLs. This approach transforms discrete correctness into continuous learning, enabling stable, critic-free optimization. Experiments on the BIRD and Spider benchmarks show that FINER-SQL achieves up to 67.73% and 85% execution accuracy with a 3B model – matching much larger LLMs while reducing inference latency to 5.57~s/sample. These results highlight a cost-efficient and privacy-preserving path toward high-performance Text-to-SQL generation. Our code is available at https://github.com/thanhdath/finer-sql.

cs.AI [Back]

[66] CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing cs.AI | cs.CL | cs.LGPDF

Cheng Qian, Hyeonjeong Ha, Jiayu Liu, Bingxiang He, Jeonghwan Kim

TL;DR: 本文介绍了CreativityBench，一个用于评估大语言模型在创造性工具使用方面能力的基准测试，重点关注模型通过推理物体的可供性（affordance）和属性来重新利用可用物体的能力。研究构建了一个包含4000个实体和15万+可供性标注的大规模知识库，并生成了1.4万个基于约束的、需要非显而易见但物理上可行解决方案的任务。

Details

Motivation: 动机是探索大语言模型的创造性问题解决能力，特别是在创造性工具使用方面，即模型如何通过推理物体的可供性和属性来重新利用物体，而不是依赖其规范用途，以弥补当前模型在这一能力上的研究不足。

Result: 在10个最先进的大语言模型（包括闭源和开源模型）上的评估表明，模型通常能选择合理的物体，但在识别正确的部件、其可供性以及解决任务所需的底层物理机制方面表现不佳，导致性能显著下降。模型规模扩展带来的改进很快饱和，强大的通用推理能力并不能可靠地转化为创造性可供性发现，而思维链等常见推理时策略的收益有限。

Insight: 创新点在于提出了首个专注于评估基于可供性的创造性推理的基准测试CreativityBench，并构建了大规模的可供性知识库来支持任务生成。客观分析认为，该工作揭示了创造性工具使用是当前模型面临的主要挑战，为研究智能体缺失的这一能力维度提供了有用的测试平台，对未来智能体的规划和推理模块具有潜在影响。

Abstract: Recent advances in large language models have led to strong performance on reasoning and environment-interaction tasks, yet their ability for creative problem-solving remains underexplored. We study this capability through the lens of creative tool use, where a model repurposes available objects by reasoning about their affordances and attributes rather than relying on canonical usage. As a first step, we introduce CreativityBench, a benchmark for evaluating affordance-based creativity in LLMs. To this end, we build a large-scale affordance knowledge base (KB) with 4K entities and 150K+ affordance annotations, explicitly linking objects, parts, attributes, and actionable uses. Building on this KB, we generate 14K grounded tasks that require identifying non-obvious yet physically plausible solutions under constraints. Evaluations across 10 state-of-the-art LLMs, including closed and open-source models, show that models can often select a plausible object, but fail to identify the correct parts, their affordances, and the underlying physical mechanism needed to solve the task, leading to a significant drop in performance. Furthermore, improvements from model scaling quickly saturate, strong general reasoning does not reliably translate to creative affordance discovery, and common inference-time strategies such as Chain-of-Thought yield limited gains. These results suggest that creative tool use remains a major challenge for current models, and that CreativityBench provides a useful testbed for studying this missing dimension of intelligence, with potential implications for planning and reasoning modules in future agents.

[67] ADAPTS: Agentic Decomposition for Automated Protocol-agnostic Tracking of Symptoms cs.AI | cs.CL | cs.HC | stat.AP | stat.COPDF

Alexandria K. Vail, Marcelo Cicconet, Katie Aafjes-van Doorn, Ryan Maroney, Marc Aafjes

TL;DR: ADAPTS是一个用于自动评估抑郁和焦虑严重程度的框架，它采用多智能体LLM架构，将长程临床访谈分解为针对特定症状的推理任务，生成可审计的评估理由，并保持时间和说话者信息的一致性。该框架在两个独立数据集上进行了泛化评估，在协议不一致的访谈中，其自动评分比原始人工评分更接近专家基准，且通过整合临床定性惯例的’扩展’协议显著提升了评分稳定性。

Details

Motivation: 解决从无约束的临床互动中建模潜在临床构念这一情感计算领域的独特挑战，旨在实现与协议无关、客观且可扩展的精神病学严重程度评估。

Result: 在两个独立数据集（N=204）上评估，在协议差异大的访谈中，自动评分的绝对误差（22）比原始人工评分（26）更接近专家基准；采用整合临床惯例的’扩展’协议后，评分绝对一致性达到ICC(2,1)=0.877的高水平。

Insight: 创新点在于将长程访谈分解为症状特定的智能体任务的多智能体LLM架构，实现了协议无关的评估并生成可审计的理由；其架构易于扩展至多模态输入，为资源有限环境下的客观评估提供了基础。

Abstract: Modeling latent clinical constructs from unconstrained clinical interactions is a unique challenge in affective computing. We present ADAPTS (Agentic Decomposition for Automated Protocol-agnostic Tracking of Symptoms), a framework for automated rating of depression and anxiety severity using a mixture-of-agents LLM architecture. This approach decomposes long-form clinical interviews into symptom-specific reasoning tasks, producing auditable justifications while preserving temporal and speaker alignment. Generalization was evaluated across two independent datasets ($N=204$) with distinct interview structures. On high-discrepancy interviews, automated ratings approximated expert benchmarks ($\text{absolute error}=22$) more closely than original human ratings ($\text{absolute error}=26$). Implementing an ``extended’’ protocol that incorporates qualitative clinical conventions significantly stabilized ratings, with absolute agreement reaching $\text{ICC(2,1)} = 0.877$. These findings suggest that the ADAPTS framework enables promising evaluations of psychiatric severity. While the current implementation is purely text-based, the underlying architecture is readily extensible to multimodal inputs, including acoustic and visual features. By approximating expert-level precision in a protocol-agnostic manner, this framework provides a foundation for objective and scalable psychiatric assessment, especially in resource-limited settings.

[68] Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies cs.AI | cs.CL | cs.DB | cs.LGPDF

Zirui Tang, Xuanhe Zhou, Yumou Liu, Linchun Li, Weizheng Wang

TL;DR: 本文介绍了Workspace-Bench 1.0，这是一个用于评估AI智能体在具有大规模文件依赖的工作空间任务中表现的基准测试。该基准构建了包含5种工作者档案、74种文件类型、20,476个文件（最大20GB）的真实工作空间，并策划了388个任务，每个任务都有其文件依赖图，通过总计7,399个评分标准进行评估，这些标准要求跨文件检索、上下文推理和自适应决策。

Details

Motivation: 现有基准测试主要在预定义或合成的文件上评估智能体，这些文件的现实世界依赖有限，导致工作空间层面的评估不足。本文旨在填补这一空白，为涉及大规模文件依赖的工作空间学习提供一个全面的评估框架。

Result: 实验评估了4种流行的智能体框架和7种基础模型。结果表明，当前智能体在可靠的工作空间学习方面仍有很大差距，最佳模型得分仅为68.7%，远低于人类结果的80.7%，所有智能体的平均性能仅为47.4%。

Insight: 该论文的创新点在于首次提出了一个专注于大规模文件依赖的工作空间学习基准，通过真实、异构的文件环境来评估智能体的跨文件检索、推理和决策能力。从客观角度看，其构建的详细依赖图和多样化任务设置为未来智能体开发提供了重要的评估标准，而Workspace-Bench-Lite子集则降低了评估成本，有助于更广泛的采用。

Abstract: Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker’s workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only 68.7%, substantially below the human result of 80.7%, and the average performance across agents is only 47.4%.

[69] Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards cs.AI | cs.CLPDF

Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang

TL;DR: 本文提出了TraceLift框架，通过引入执行器基础奖励来训练推理规划器，旨在解决仅依赖最终答案正确性进行强化学习时可能导致的推理轨迹不忠实、不可靠或对下游模型无用的问题。该框架将推理视为可消费的中间产物，利用冻结的执行器将规划器生成的推理转化为最终产物以获得验证反馈，同时通过结合基于量规的推理奖励模型分数和执行器性能提升的奖励来共同塑造中间轨迹的质量和实用性。

Details

Motivation: 现有基于可验证奖励的强化学习方法仅关注最终答案的正确性，无法保证推理轨迹的忠实性、可靠性或对消费该推理的模型的实际效用，可能导致模型因错误原因而正确、奖励捷径而夸大推理收益，以及在多步系统中传播有缺陷的中间状态。

Result: 在代码和数学基准测试上的广泛实验表明，与仅基于执行的训练相比，这种执行器基础的推理奖励能提升两阶段规划器-执行器系统的性能，表明推理监督不仅应评估轨迹看起来是否良好，还应评估其是否对消费它的模型有帮助。

Insight: 创新点在于提出了一个将推理作为可消费中间产物的训练框架，并引入了执行器基础奖励来联合优化推理轨迹的质量和实用性；同时构建了TRACELIFT-GROUPS数据集，通过包含高质量参考轨迹和多个局部扰动导致的缺陷轨迹的同一问题组，使推理质量可直接学习。这强调了在监督推理时需超越表面正确性，关注其对下游任务的实际效用。

Abstract: Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it.

[70] OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories cs.AI | cs.CLPDF

Yuwen Du, Rui Ye, Shuo Tang, Keduan Huang, Xinyu Zhu

TL;DR: 本文介绍了OpenSeeker-v2，一个仅通过监督微调（SFT）训练的搜索智能体。研究表明，当使用信息丰富且高难度的轨迹数据时，简单的SFT方法可以非常强大。通过三项数据合成改进（扩大知识图谱规模、扩展工具集、严格低步数过滤），该模型在仅使用10.6k数据点的情况下，在多个基准测试中超越了采用复杂训练流程（持续预训练+SFT+强化学习）的工业模型，达到了最先进的性能。

Details

Motivation: 前沿大语言模型智能体的深度搜索能力至关重要，但其开发通常被工业巨头垄断，依赖于资源密集型的复杂训练流程。本文旨在证明，通过高质量数据，简单的SFT方法也能训练出顶尖的搜索智能体，从而降低研究门槛。

Result: 在四个基准测试（BrowseComp, BrowseComp-ZH, Humanity’s Last Exam, xbench）上，30B参数的OpenSeeker-v2（采用ReAct范式）取得了最先进的性能，分别为46.0%、58.1%、34.6%和78.0%，超越了采用CPT+SFT+RL流程训练的Tongyi DeepResearch模型。

Insight: 核心创新点在于数据合成策略：通过扩大知识图谱以丰富探索、扩展工具集以增强功能、以及严格的低步数过滤来提升轨迹质量。这证明了高质量、高难度的训练数据对于简化训练流程（仅用SFT）并达到SOTA性能的关键作用，为学术界提供了可复现的、高效的训练范式。

Abstract: Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continual pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In this report, we show that when fueled with informative and high-difficulty trajectories, a simple SFT approach could be surprisingly powerful for training frontier search agents. By introducing three simple data synthesis modifications: scaling knowledge graph size for richer exploration, expanding the tool set size for broader functionality, and strict low-step filtering, we establish a stronger baseline. Trained on merely 10.6k data points, our OpenSeeker-v2 achieves state-of-the-art performance across 4 benchmarks (30B-sized agents with ReAct paradigm): 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity’s Last Exam, and 78.0% on xbench, surpassing even Tongyi DeepResearch trained with heavy CPT+SFT+RL pipeline, which achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively. Notably, OpenSeeker-v2 represents the first state-of-the-art search agent within its model scale and paradigm to be developed by a purely academic team using only SFT. We are excited to open-source the OpenSeeker-v2 model weights and share our simple yet effective findings to make frontier search agent research more accessible to the community.

[71] Quantifying the human visual exposome with vision language models cs.AI | cs.CVPDF

Christian Rominger, Andreas R. Schwerdtfeger, Malay Gaherwar Singh, Dimitri Khudyakow, Elizabeth A. M. Michels

TL;DR: 该论文提出了一种利用视觉语言模型（VLMs）量化人类视觉暴露组（visual exposome）的新方法，以研究视觉环境对心理健康的影响。通过结合生态瞬时评估（EMA）和VLMs，分析参与者拍摄的2674张照片的语义丰富度，发现VLM估计的绿色程度能稳健预测瞬时情绪和慢性压力。此外，还开发了一个基于大型语言模型（LLM）的半自动化流程，从超过700万篇科学出版物中提取了近1000个与心理健康相关的环境特征，应用于真实世界图像时，高达33%的VLM提取的环境评分与情绪和压力显著相关。

Details

Motivation: 解决视觉环境作为心理健康关键决定因素但尚未被量化的问题，现有方法依赖粗糙的地理空间代理或有偏见的自我报告，无法捕捉日常生活的第一人称视觉背景。

Result: 在参与者生成的2674张照片上，VLM估计的绿色程度与瞬时情绪和慢性压力显著相关，符合现有基准；基于LLM的流程提取的环境特征应用于真实图像时，高达33%的VLM评分与情绪和压力显著相关，建立了可扩展的客观视觉暴露组学范式。

Insight: 创新点在于将VLMs与生态瞬时评估结合，首次量化人类视觉暴露组，并利用LLM自动化提取科学文献中的环境特征，实现了对视觉环境与心理健康关联的高通量解码，为暴露组学研究提供了新工具。

Abstract: The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, failing to capture the first person visual context of daily life. We addressed this gap by coupling ecological momentary assessment with vision language models (VLMs) to quantify the semantic richness of human visual experience. Across 2674 participant generated photographs, VLM derived estimates of greenness robustly predicted momentary affect and chronic stress, consistent with established benchmarks. We then developed a semi autonomous large language model (LLM) based pipeline that mined over seven million scientific publications to extract nearly 1000 environmental features empirically linked to mental health. When applied to real world imagery, up to 33 percent of VLM extracted context ratings significantly correlated with affect and stress. These findings establish a scalable objective paradigm for visual exposomics, enabling high throughput decoding of how the visible world is associated with mental health.

cs.LG [Back]

[72] The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It cs.LG | cs.CLPDF

Gabriel Garcia

TL;DR: 这篇论文研究了大型语言模型在简单计数任务上失败的原因，发现并非模型内部无法表示计数信息，而是计数信息的内部表示与输出数字标记的几何方向不匹配，导致模型无法正确读出计数结果。通过线性探针和干预实验，论文证明了输出头与注意力机制中的几何对齐问题是关键瓶颈。

Details

Motivation: 动机是探究Transformer模型在显式提示下仍无法完成简单计数任务的根本原因，区分是内部表示缺失还是输出转换失败。

Result: 在Pythia、Qwen3和Mistral模型上的实验显示，线性探针能近乎完美地从中间层恢复计数信息（R²>0.99），但输出头数字标记行与计数编码方向几乎正交（|cos|≤0.032）。通过微调输出头数字行参数，约束下一个标记预测准确率从60.7%提升至100%；而使用LoRA干预注意力Q/V权重，在自回归生成中达到83.1%±7.2%的准确率，正确数字的词汇排名从55,980提升至第1位。

Insight: 创新点在于将计数失败定位为几何读出瓶颈，而非内部表示缺陷；通过小规模参数干预（如LoRA）改善注意力路由能有效解决对齐问题，这一瓶颈在字符计数、加法和列表长度任务中普遍存在，但在MMLU等复杂推理基准中不显著。

Abstract: Large language models often fail at simple counting tasks, even when the items to count are explicitly present in the prompt. We investigate whether this failure occurs because transformers do not represent counts internally, or because they cannot convert those representations into the correct output tokens. Across three model families, Pythia, Qwen3, and Mistral, ranging from 0.4B to 14B parameters, we find strong evidence for the second explanation. Linear probes recover the correct count from intermediate layers with near-perfect accuracy ($R^2>0.99$), showing that the information is present. However, the internal directions that encode counts are nearly orthogonal to the output-head rows for digit tokens ($|\cos|\leq0.032$). In other words, the model stores the count in a form that the digit logits do not naturally read out. We localize this failure with two interventions. Updating only the digit rows of the output head (36,864 parameters) substantially improves constrained next-token digit prediction (60.7 to 100.0% across four tasks), but it does not fix autoregressive generation. By contrast, a small LoRA intervention on attention Q/V weights (7.67M parameters) improves upstream routing and achieves 83.1% +/- 7.2% in true greedy autoregressive generation. Logit-lens measurements confirm the mechanism: the correct digit’s vocabulary rank drops from 55,980 to 1, a 50,000x improvement. Additional norm, logit-lens, and cross-task analyses show that the bottleneck generalizes across character counting, addition, and list length, while remaining absent from broader multi-step reasoning benchmarks, including MMLU, GSM8K, and DROP. These results identify counting failure as a geometric readout bottleneck rather than a failure of internal representation: the model knows the count but the output pathway is geometrically misaligned with the tokens needed to express it.

[73] Text-Conditional JEPA for Learning Semantically Rich Visual Representations cs.LG | cs.CVPDF

Chen Huang, Xianhang Li, Vimal Thilak, Etai Littwin, Josh Susskind

TL;DR: 本文提出了一种文本条件联合嵌入预测架构（TC-JEPA），通过引入图像描述文本来减少掩码特征预测中的视觉不确定性，从而学习更丰富的语义视觉表示。该方法利用细粒度文本条件器，通过稀疏交叉注意力调制预测的补丁特征，提升了自监督学习的性能和训练稳定性。

Details

Motivation: 解决基于图像的联合嵌入预测架构（I-JEPA）在掩码位置因视觉不确定性导致特征预测困难、难以学习语义表示的问题。

Result: TC-JEPA在下游任务中提升了性能与训练稳定性，并展现出良好的扩展性；在多种任务上超越了对比学习方法，特别是在需要细粒度视觉理解和推理的任务中表现优异。

Insight: 创新点在于将文本条件引入特征预测框架，通过稀疏交叉注意力实现细粒度调制，为仅基于特征预测的视觉-语言预训练提供了新范式，增强了表示的语义性。

Abstract: Image-based Joint-Embedding Predictive Architecture (I-JEPA) offers a promising approach to visual self-supervised learning through masked feature prediction. However with the inherent visual uncertainty at masked positions, feature prediction remains challenging and may fail to learn semantic representations. In this work, we propose Text-Conditional JEPA (TC-JEPA) that uses image captions to reduce the prediction uncertainty. Specifically, we modulate the predicted patch features using a fine-grained text conditioner that computes sparse cross-attention over input text tokens. With such conditioning, patch features become predictable as a function of text, thus are more semantically meaningful. We show TC-JEPA improves downstream performance and training stability, with promising scaling properties. TC-JEPA also offers a new vision-language pretraining paradigm based on feature prediction only, outperforming contrastive methods on diverse tasks, especially those requiring fine-grained visual understanding and reasoning.

[74] From Code to Prediction: Fine-Tuning LLMs for Neural Network Performance Classification in NNGPT cs.LG | cs.CVPDF

Mahmoud Hanouneh, Radu Timofte, Dmitry Ignatov

TL;DR: 本文提出在NNGPT框架中引入一个分类任务，通过微调LLM（DeepSeek-Coder-7B-Instruct）使其能够仅根据神经网络架构的源代码和数据集名称，预测该架构在哪个图像分类数据集上具有更高的准确率。实验表明，仅使用代码的提示配置在15个epoch后达到80%的峰值准确率，优于使用数据集元数据的配置（70%），证明了源代码比元数据包含更丰富的判别信号。

Details

Motivation: 当前基于LLM的AutoML方法主要关注生成输出（如超参数优化、架构生成）并通过训练生成物进行评估，但LLM是否能够学习跨数据集推理神经网络性能尚未得到充分探索。本文旨在探究LLM是否可以通过微调，直接从神经网络代码中学习并预测其在不同数据集上的性能差异。

Result: 在基于LEMUR数据集构建的分类任务上，使用LoRA微调的DeepSeek-Coder-7B-Instruct模型，在仅提供源代码和数据集名称的提示配置下达到80%的峰值准确率（超过15个epoch）；而使用数据集元数据的提示配置峰值准确率为70%。分析显示，元数据提示在具有独特属性的数据集（如CelebAGender，准确率90.9%）上表现优异，但在特征重叠的数据集上性能下降；代码提示则表现更均衡。与DeepSeek-Coder1.3B的对比证实模型容量影响这种架构推理能力。

Insight: 创新点在于将LLM应用于跨数据集的神经网络性能分类任务，而非传统的生成任务，并证明仅通过神经网络源代码（无需准确率数据）即可有效预测架构在不同数据集上的适用性。客观来看，该研究揭示了源代码比数据集元数据包含更丰富的、可用于性能推理的判别信息，为LLM在AutoML中的推理应用提供了新方向。

Abstract: Automated Machine Learning (AutoML) frameworks increasingly leverage Large Language Models (LLMs) for tasks such as hyperparameter optimization and neural architecture code generation. However, current LLM-based approaches focus on generative outputs and evaluate them by training the produced artifacts. Whether LLMs can learn to reason about neural network performance across datasets remains underexplored. We present a classification task integrated into the NNGPT framework, in which a fine-tuned LLM predicts which of two image classification datasets a given neural network architecture achieves higher accuracy on. The task is built on the LEMUR dataset, which provides standardized PyTorch implementations with reproducible performance metrics. Three prompt configurations of increasing difficulty are evaluated: a normalized-accuracy baseline (trivially reaching 100%), a metadata-enriched prompt replacing accuracies with dataset properties, and a code-only prompt presenting only architecture source code and dataset names. Using DeepSeek-Coder-7B-Instruct fine-tuned with LoRA, the code-only prompt reaches 80% peak accuracy over 15 epochs, while the metadata prompt peaks at 70%. Perdataset analysis reveals complementary strengths: metadata excels for datasets with distinctive properties (CelebAGender at 90.9%) but degrades for overlapping characteristics, whereas the code-only prompt shows more balanced performance. A comparison with DeepSeek-Coder1.3B confirms that model capacity affects this form of architectural reasoning. The results establish that LLMs can be fine-tuned to predict cross-dataset suitability from neural network code, suggesting that architecture source code contains richer discriminative signal than dataset metadata alone.

cs.RO [Back]

[75] Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision Language Embeddings and Hyperdimensional Computing cs.RO | cs.CVPDF

Zhiling Chen, David Gorsich, Matthew P. Castanier, Yang Zhang, Jiong Tang

TL;DR: 本文提出了一种基于视觉语言嵌入和超维计算的机器人激光轮廓扫描参数配置方法，旨在根据自然语言指令和RGB预扫描图像，自适应地推荐离散的传感器参数配置，以提升测量保真度并消除手动调参。

Details

Motivation: 解决工业激光轮廓扫描仪中多个耦合参数（如采样频率、测量范围、曝光时间等）依赖试错法手动配置的问题，避免因参数不匹配导致的饱和、削波或数据丢失，实现从任务意图和场景上下文自主配置传感器。

Result: 在真实世界多模态数据集Instruct-Obs2Param上，所提ScanHD框架在五个参数上平均精确准确率达到92.7%，Win@1准确率达到98.1%，具有强跨分割泛化能力和低延迟推理性能，优于基于规则的启发式方法、传统多模态模型和多模态大语言模型。

Insight: 创新点在于将超维计算框架用于多模态任务感知编码，实现指令与观察的绑定和参数级关联推理，提供稳定、可解释且低延迟的决策；将传感器配置从静态设置提升为自适应决策变量，为自主机器人检测提供了新思路。

Abstract: Robotic laser profiling is widely used for dimensional verification and surface inspection, yet measurement fidelity is often dominated by sensor configuration rather than robot motion. Industrial profilers expose multiple coupled parameters, including sampling frequency, measurement range, exposure time, receiver dynamic range, and illumination, that are still tuned by trial-and-error; mismatches can cause saturation, clipping, or missing returns that cannot be recovered downstream. We formulate instruction-conditioned sensing parameter recommendation; given a pre-scan RGB observation and a natural-language inspection instruction, infer a discrete configuration over key parameters of a robot-mounted profiler. To benchmark this problem, we develop Instruct-Obs2Param, a real-world multimodal dataset linking inspection intents and multi-view pose and illumination variation across 16 objects to canonical parameter regimes. We then propose ScanHD, a hyperdimensional computing framework that binds instruction and observation into a task-aware code and performs parameter-wise associative reasoning with compact memories, matching discrete scanner regimes while yielding stable, interpretable, low-latency decisions. On Instruct-Obs2Param, ScanHD achieves 92.7% average exact accuracy and 98.1% average Win@1 accuracy across the five parameters, with strong cross-split generalization and low-latency inference suitable for deployment, outperforming rule-based heuristics, conventional multimodal models, and multimodal large language models. This work enables autonomous, instruction-conditioned sensing configuration from task intent and scene context, eliminating manual tuning and elevating sensor configuration from a static setting to an adaptive decision variable.

cs.CR [Back]

[76] Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses cs.CR | cs.AI | cs.CV | cs.ROPDF

Xiao Li, Xiang Zheng, Yifeng Gao, Xinyu Xia, Yixu Wang

TL;DR: 这篇综述论文全面梳理了具身人工智能（Embodied AI）领域的安全研究，涵盖了从感知、认知到规划、行动和交互的完整流程中存在的风险、攻击与防御方法。

Details

Motivation: 随着具身AI系统在交通、医疗、工业等安全关键领域获得自主性并部署，确保其在不确定感知、动态人机交互等复杂开放世界中的安全性变得至关重要且技术挑战巨大。

Result: 论文通过分析超过400篇相关文献，提出了一个统一的多层次分类法，将分散的研究工作系统化，并将具身AI特有的安全发现与更广泛的视觉、语言和多模态基础模型进展联系起来。

Insight: 论文的创新点在于为具身AI安全领域提供了一个连贯的框架和路线图，并揭示了几个被忽视的挑战，如多模态感知融合的脆弱性、越狱攻击下规划的不稳定性，以及开放场景中人机交互的可信度问题。

Abstract: Embodied Artificial Intelligence (Embodied AI) integrates perception, cognition, planning, and interaction into agents that operate in open-world, safety-critical environments. As these systems gain autonomy and enter domains such as transportation, healthcare, and industrial or assistive robotics, ensuring their safety becomes both technically challenging and socially indispensable. Unlike digital AI systems, embodied agents must act under uncertain sensing, incomplete knowledge, and dynamic human-robot interactions, where failures can directly lead to physical harm. This survey provides a comprehensive and structured review of safety research in embodied AI, examining attacks and defenses across the full embodied pipeline, from perception and cognition to planning, action and interaction, and agentic system. We introduce a multi-level taxonomy that unifies fragmented lines of work and connects embodied-specific safety findings with broader advances in vision, language, and multimodal foundation models. Our review synthesizes insights from over 400 papers spanning adversarial, backdoor, jailbreak, and hardware-level attacks; attack detection, safe training and robust inference; and risk-aware human-agent interaction. This analysis reveals several overlooked challenges, including the fragility of multimodal perception fusion, the instability of planning under jailbreak attacks, and the trustworthiness of human-agent interaction in open-ended scenarios. By organizing the field into a coherent framework and identifying critical research gaps, this survey provides a roadmap for building embodied agents that are not only capable and autonomous but also safe, robust, and reliable in real-world deployment.

cs.NE [Back]

[77] Where to Bind Matters: Hebbian Fast Weights in Vision Transformers for Few-Shot Character Recognition cs.NE | cs.CV | cs.LGPDF

Gavin Money, Sindhuja Penchala, Jiacheng Li, Noorbakhsh Amiri Golilarz

TL;DR: 本文对集成赫布式快速权重模块的多种视觉Transformer骨干网络进行了实证研究，旨在提升模型在少样本字符识别任务中的快速适应能力。研究提出了一种针对Swin-Tiny架构的单模块放置策略，即在所有层级阶段完成后，将HFW模块应用于最终阶段特征图，该策略在Omniglot基准测试中取得了最佳性能。

Details

Motivation: 标准Transformer架构在训练中学习固定的慢权重表示，缺乏在单个任务片段内快速适应的机制。而生物神经系统通过赫布可塑性实现快速突触更新，形成瞬态联想记忆。本文旨在探索如何将这种赫布式快速权重机制有效集成到Transformer中，以解决少样本学习中的快速适应问题。

Result: 在Omniglot基准测试的5-way 1-shot和5-way 5-shot分类任务中，采用原形网络元学习框架进行评估。提出的Swin-Hebbian模型（采用单HFW模块放置策略）在所有六个模型变体中取得了最高测试准确率（1-shot为96.2%，5-shot为99.2%），在1-shot任务上比其非赫布基线高出0.3个百分点。

Insight: 论文的创新点在于提出了一种稳定有效的HFW模块放置策略，避免了在每个阶段放置独立赫布模块导致的训练不稳定问题。客观分析表明，其核心洞察在于将赫布式快速绑定与Swin Transformer的移位窗口归纳偏置进行协同设计，并揭示了在低数据情况下，为ViT和DeiT变体采用逐块放置策略会失败的原因，这为快速/慢权重元学习提供了新的设计思路。

Abstract: Standard transformer architectures learn fixed slow-weight representations during training and lack mechanisms for rapid adaptation within an episode. In contrast, biological neural systems address this through fast synaptic updates that form transient associative memories during inference, a property known as Hebbian plasticity. In this paper, we conduct an empirical study of Hebbian Fast-Weight (HFW) modules integrated into multiple transformer backbones, including ViT-Small, DeiT-Small, and Swin-Tiny. We evaluate six model variants: ViT, DeiT, Swin, ViT-Hebbian, DeiT-Hebbian, and Swin-Hebbian on 5-way 1-shot and 5-way 5-shot classification tasks using the Omniglot benchmark under a Prototypical Network meta-learning framework. We propose a single module placement strategy for Swin-Tiny in which one HFW module is applied to the final stage feature map after all hierarchical stages have completed. This design avoids the training instability caused by placing separate Hebbian modules at each stage and achieves the highest test accuracy across all six models (96.2% at 1-shot; 99.2% at 5-shot), outperforming its non-Hebbian baseline by $+0.3$ percentage points at 1-shot. We analyze the interaction between Swin’s shifted window inductive bias and episode-level Hebbian binding, discuss why per-block placement fails for ViT and DeiT variants in a low-data regime, and situate the results within the wider literature on fast and slow-weight meta-learning.

cs.HC [Back]

[78] A User-Centric Analysis of Explainability in AI-Based Medical Image Diagnosis cs.HC | cs.AI | cs.CVPDF

Julia Wagner, Tim Schlippe

TL;DR: 本文对基于AI的医学图像诊断中的可解释性方法进行了以用户为中心的分析，比较了最新的文本、视觉和多模态可解释人工智能方法。一项针对33名医生的调查显示，88%的医生认为AI解释诊断很重要，而结合边界框和报告的方法在可理解性、完整性、速度和适用性方面评价最佳。研究还发现，50%的参与者即使面对错误的AI诊断，也信任所有测试的XAI方法。

Details

Motivation: 尽管AI系统在医学领域表现优于人类，但由于其决策过程不透明，在实践中很少使用，缺乏最优的解释和可视化方法，因此需要从用户角度分析可解释性方法。

Result: 在针对33名医生的调查中，结合边界框和报告的多模态XAI方法在可理解性、完整性、速度和适用性方面评价最高；同时，50%的参与者信任错误的AI诊断，凸显了XAI方法的潜在风险。

Insight: 创新点在于以用户为中心比较不同XAI方法，并量化了医生对解释的需求和信任度；客观分析表明，多模态解释（视觉边界框加文本报告）在医学图像诊断中可能更有效，但需警惕错误诊断下XAI可能加剧信任问题。

Abstract: In recent years, AI systems in the medical domain have advanced significantly. However, despite outperforming humans, they are rarely used in practice since it is often not clear how they make their decisions. Optimal explanation and visualization of the decision process are often lacking. Therefore, we conducted a comparative user-centric analysis of the latest state-of-the-art textual, visual and multimodal explainable artificial intelligence (XAI) methods for medical image diagnosis. Our survey of 33 physicians showed that 88% agree that it is important that AI explains the diagnosis – 64% even strongly agree. A combination of bounding box and report is rated better than the other tested XAI methods in the evaluated aspects understandability, completeness, speed, and applicability. We even tested the potential negative impact of false AI-based medical image diagnoses and found that 50% of the participants trusted false AI diagnoses over all tested XAI methods.

cs.IR [Back]

[79] RAG over Thinking Traces Can Improve Reasoning Tasks cs.IR | cs.AI | cs.CLPDF

Negar Arabzadeh, Wenjie Ma, Sewon Min, Matei Zaharia

TL;DR: 这篇论文挑战了RAG在推理任务中效果有限的传统观点，提出通过检索’思维轨迹’而非传统文档来增强推理任务性能。论文引入了T3方法，将思维轨迹转化为结构化表示，并在多个推理基准测试上取得了显著提升。

Details

Motivation: 传统观点认为检索增强生成（RAG）对知识密集型任务有效，但对数学、代码生成等推理密集型任务帮助有限。论文旨在挑战这一假设，认为限制在于检索语料库的选择，而非RAG本身。

Result: 在AIME 2025-2026、LiveCodeBench和GPQA-Diamond等基准测试上，使用思维轨迹作为检索语料的简单检索-生成流程，持续提升了包括Gemini-2.5-Flash、GPT-OSS-120B和GPT-5在内的强模型的推理性能，优于非RAG基线和基于标准网络语料的检索。例如，在AIME上，使用Gemini-2-thinking生成的轨迹进行RAG，为不同模型带来了+56.3%到+7.6%的相对增益。

Insight: 核心创新点在于将检索对象从传统文档转向问题解决过程中产生的’思维轨迹’（中间思考轨迹），并提出了T3方法对其进行离线结构化转换以提升检索友好性。这为RAG在推理任务中的应用开辟了新方向，且该方法几乎不增加甚至能降低推理成本。

Abstract: Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval-friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve-then-generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025–2026, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval over standard web corpora. For instance, on AIME, RAG with traces generated by Gemini-2-thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini-2.5-Flash, GPT-OSS-120B, and GPT-5, respectively, even though these are more recent models. Interestingly, RAG on T3 also incurs little or no extra inference cost, and can even reduce inference cost by up to $15%$. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.

Table of Contents

cs.CL [Back]

[1] Evaluating Reasoning Models for Queries with Presuppositions cs.CLPDF

[2] S^2tory: Story Spine Distillation for Movie Script Summarization cs.CL | cs.AIPDF

[3] When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning cs.CLPDF

[4] Detecting Stealth Sycophancy in Mental-Health Dialogue with Dynamic Emotional Signature Graphs cs.CL | cs.AIPDF

[5] Revisiting Graph-Tokenizing Large Language Models: A Systematic Evaluation of Graph Token Understanding cs.CL | cs.AI | cs.LGPDF

[6] PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination cs.CL | cs.AIPDF

[7] SERE: Structural Example Retrieval for Enhancing LLMs in Event Causality Identification cs.CL | cs.AIPDF

[8] Rose-SQL: Role-State Evolution Guided Structured Reasoning for Multi-Turn Text-to-SQL cs.CLPDF

[9] Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF cs.CLPDF

[10] Reproducing Complex Set-Compositional Information Retrieval cs.CLPDF

[11] TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains cs.CL | cs.AI | cs.HCPDF

[12] MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following cs.CL | cs.AIPDF

[13] CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing cs.CLPDF

[14] The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models cs.CL | cs.AIPDF

[15] Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments cs.CLPDF

[16] Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems cs.CL | cs.IRPDF

cs.CV [Back]

[17] Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models cs.CV | cs.AI | cs.LGPDF

[18] Approaching human parity in the quality of automated organoid image segmentation cs.CV | cond-mat.soft | q-bio.QMPDF

[19] Learning to Segment using Summary Statistics and Weak Supervision cs.CV | cs.LGPDF

[20] NucEval: A Robust Evaluation Framework for Nuclear Instance Segmentation cs.CVPDF

[21] Sentinel2Cap: A Human-Annotated Benchmark Dataset for Multimodal Remote Sensing Image Captioning cs.CVPDF

[22] CropVLM: A Domain-Adapted Vision-Language Model for Open-Set Crop Analysis cs.CVPDF

[23] VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing cs.CVPDF

[24] FACTOR: Counterfactual Training-Free Test-Time Adaptation for Open-Vocabulary Object Detection cs.CVPDF

[25] AHPA: Adaptive Hierarchical Prior Alignment for Diffusion Transformers cs.CV | cs.AIPDF

[26] MedSR-Vision: Deep Learning Framework for Multi-Domain Medical Image Super-Resolution cs.CVPDF

[27] VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models cs.CV | cs.AIPDF

[28] Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology cs.CV | cs.AIPDF

[29] Tracing Like a Clinician: Anatomy-Guided Spatial Priors for Cephalometric Landmark Detection cs.CVPDF

[30] Enhancing Self-Supervised Talking Head Forgery Detection via a Training-Free Dual-System Framework cs.CV | cs.AI | cs.MMPDF

[31] MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding cs.CVPDF

[32] GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning cs.CV | cs.LGPDF

[33] Learning Discriminative Signed Distance Functions from Multi-scale Level-of-detail Features for 3D Anomaly Detection cs.CV | cs.LGPDF

[34] First Shape, Then Meaning: Efficient Geometry and Semantics Learning for Indoor Reconstruction cs.CVPDF

[35] WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models cs.CVPDF

[36] MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models cs.CV | cs.AIPDF

[37] Orientation-Aware Unsupervised Domain Adaptation for Brain Tumor Classification Across Multi-Modal MRI cs.CVPDF

[38] BFORE: Butterfly-Firefly Optimized Retinex Enhancement for Low-Light Image Quality Improvement cs.CV | cs.AIPDF

[39] Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models cs.CV | cs.AIPDF

[40] deSEO: Physics-Aware Dataset Creation for High-Resolution Satellite Image Shadow Removal cs.CV | eess.IVPDF

[41] Uncertainty Estimation in Instance Segmentation of Affordances via Bayesian Visual Transformers cs.CVPDF

[42] PriorNet: Prior-Guided Engagement Estimation from Face Video cs.CVPDF

[43] Diffusion Masked Pretraining for Dynamic Point Cloud cs.CVPDF

[44] The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection cs.CVPDF

[45] Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence cs.CV | cs.LGPDF

[46] AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics cs.CV | cs.AIPDF

[47] Unified Multimodal Visual Tracking with Dual Mixture-of-Experts cs.CVPDF

[48] Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks cs.CV | cs.AIPDF

[49] Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation cs.CVPDF

[50] Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration cs.CV | cs.LG | cs.MMPDF

[51] Conditions for well-posed color recovery in scattering media cs.CVPDF

[52] Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback cs.CVPDF

[53] Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation cs.CVPDF

[54] StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning cs.CVPDF

[55] A Benchmark for Interactive World Models with a Unified Action Generation Framework cs.CV | cs.AIPDF

[56] UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning cs.CVPDF

[57] Label-Efficient School Detection from Aerial Imagery via Weakly Supervised Pretraining and Fine-Tuning cs.CV | cs.AI | cs.LGPDF

[58] 3D Human Face Reconstruction with 3DMM face model from RGB image cs.CV | cs.GRPDF

[59] RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction cs.CVPDF

[60] Large Language Models are Universal Reasoners for Visual Generation cs.CVPDF

[61] UniCorrn: Unified Correspondence Transformer Across 2D and 3D cs.CVPDF

[62] Audio-Visual Intelligence in Large Foundation Models cs.CVPDF

eess.IV [Back]

[63] Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms eess.IV | cs.CVPDF

[64] EMOVIS: Emotion-Optimized Image Processing eess.IV | cs.CVPDF

cs.DB [Back]

[65] FINER-SQL: Boosting Small Language Models for Text-to-SQL cs.DB | cs.AI | cs.CL | cs.HC | cs.MAPDF

cs.AI [Back]

[66] CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing cs.AI | cs.CL | cs.LGPDF

[67] ADAPTS: Agentic Decomposition for Automated Protocol-agnostic Tracking of Symptoms cs.AI | cs.CL | cs.HC | stat.AP | stat.COPDF

[68] Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies cs.AI | cs.CL | cs.DB | cs.LGPDF

[69] Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards cs.AI | cs.CLPDF

[70] OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories cs.AI | cs.CLPDF

[71] Quantifying the human visual exposome with vision language models cs.AI | cs.CVPDF

cs.LG [Back]

[72] The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It cs.LG | cs.CLPDF

[73] Text-Conditional JEPA for Learning Semantically Rich Visual Representations cs.LG | cs.CVPDF