cs.CL [Total: 27]
cs.CV [Total: 56]
cs.GR [Total: 1]
cs.LG [Total: 3]
eess.AS [Total: 1]
cs.GT [Total: 1]
cs.AI [Total: 12]
cs.MM [Total: 2]
physics.comp-ph [Total: 1]
cs.RO [Total: 2]
cs.CR [Total: 1]

cs.CL [Back]

[1] MemGround: Long-Term Memory Evaluation Kit for Large Language Models in Gamified Scenarios cs.CL | cs.AIPDF

Yihang Ding, Wanke Xia, Yiting Zhao, Jinbo Su, Jialiang Yang

TL;DR: 本文提出了MemGround，一个基于游戏化交互场景的长时记忆评估基准，通过三层层次化框架（表面状态记忆、时序关联记忆和推理记忆）和一套多维指标（如QA得分、记忆片段解锁等）来系统评估大语言模型在动态交互中的长时记忆能力。实验表明，当前最先进的LLM和记忆代理在持续动态追踪、时序事件关联和基于长期累积证据的复杂推理方面仍存在困难。

Details

Motivation: 现有LLM长时记忆评估方法过于静态，仅关注简单检索和短上下文推理，忽视了复杂记忆系统（如动态状态追踪和层次化推理）在连续交互中的多面性，因此需要一种更贴近真实交互场景的评估基准。

Result: 在MemGround基准上的广泛实验表明，即使是当前最先进的LLM和记忆代理，在持续动态追踪、时序事件关联以及基于长期累积证据的复杂推理任务中仍表现不佳。

Insight: 创新点在于将长时记忆评估从静态问答转向游戏化交互场景，并提出了一个三层评估框架和多维量化指标，这为评估和提升LLM在动态环境中的记忆能力提供了更系统、更贴近实际应用的方法论。

Abstract: Current evaluations of long-term memory in LLMs are fundamentally static. By fixating on simple retrieval and short-context inference, they neglect the multifaceted nature of complex memory systems, such as dynamic state tracking and hierarchical reasoning in continuous interactions. To overcome these limitations, we propose MemGround, a rigorous long-term memory benchmark natively grounded in rich, gamified interactive scenarios. To systematically assess these capabilities, MemGround introduces a three-tier hierarchical framework that evaluates Surface State Memory, Temporal Associative Memory, and Reasoning-Based Memory through specialized interactive tasks. Furthermore, to comprehensively quantify both memory utilization and behavioral trajectories, we propose a multi-dimensional metric suite comprising Question-Answer Score (QA Overall), Memory Fragments Unlocked (MFU), Memory Fragments with Correct Order (MFCO), and Exploration Trajectory Diagrams (ETD). Extensive experiments reveal that state-of-the-art LLMs and memory agents still struggle with sustained dynamic tracking, temporal event association, and complex reasoning derived from long-term accumulated evidence in interactive environments.

[2] How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data cs.CLPDF

Zixian Huang, Kaichen Yang, Xu Huang, Feiyang Hao, Qiming Ge

TL;DR: 本文提出了一种名为TESSY的师生合作数据合成框架，旨在解决使用更强模型生成的合成数据进行监督微调时，由于师生模型风格差异导致推理能力下降的问题。TESSY通过让教师模型和学生模型交替生成风格令牌和非风格令牌，合成既继承教师模型高级推理能力，又保持与学生模型风格一致的数据。在代码生成任务中，该方法显著提升了学生模型的性能。

Details

Motivation: 当前使用更强模型（教师）生成的合成数据对新兴推理模型（学生）进行监督微调时，常因师生数据风格分布的巨大差异而导致学生模型性能下降，甚至大幅倒退。本文旨在解决这一风格差异问题，以有效提升学生模型的推理能力。

Result: 在代码生成实验中，以GPT-OSS-120B为教师模型，对Qwen3-8B进行微调。使用教师生成的数据微调导致在LiveCodeBench-Pro和OJBench基准上性能分别下降3.25%和10.02%；而使用TESSY框架合成的数据进行微调，则分别实现了11.25%和6.68%的性能提升。

Insight: 论文的创新点在于揭示了风格差异是影响SFT效果的关键因素，并提出了TESSY框架来合成风格一致的数据。其核心思想是通过师生模型交替生成（教师负责非风格/内容令牌，学生负责风格令牌）来融合教师的能力与学生的风格，这是一种新颖的数据合成范式，为解决模型对齐中的分布不匹配问题提供了新思路。

Abstract: A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.

[3] EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews cs.CLPDF

Naman Ahuja, Saniya Mulla, Muhammad Ali Khan, Zaryab Bin Riaz, Kaneez Zahra Rubab Khakwani

TL;DR: EviSearch是一个多智能体提取系统，用于从临床试验PDF文件中自动化构建本体对齐的临床证据表，同时保证每个单元格的可追溯性以供审计和人工验证。该系统结合了PDF查询智能体（保留渲染布局和图表）、检索引导的搜索智能体以及一个在智能体意见不一致时强制进行页面级验证的协调模块。

Details

Motivation: 旨在加速系统性综述工作流程，减少人工整理负担，并为将基于LLM的提取技术安全、可审计地整合到证据合成流程中提供路径。

Result: 在临床医生策划的肿瘤学试验论文基准测试中，EviSearch相对于强大的解析文本基线显著提高了提取准确性，同时提供了全面的归因覆盖。

Insight: 创新点在于设计了一个多智能体协作系统，通过强制页面级验证和生成可审查的溯源信息来保证高精度提取，并利用协调器决策和审阅者编辑生成结构化偏好和监督信号，以引导迭代模型改进。

Abstract: We present EviSearch, a multi-agent extraction system that automates the creation of ontology-aligned clinical evidence tables directly from native trial PDFs while guaranteeing per-cell provenance for audit and human verification. EviSearch pairs a PDF-query agent (which preserves rendered layout and figures) with a retrieval-guided search agent and a reconciliation module that forces page-level verification when agents disagree. The pipeline is designed for high-precision extraction across multimodal evidence sources (text, tables, figures) and for generating reviewer-actionable provenance that clinicians can inspect and correct. On a clinician-curated benchmark of oncology trial papers, EviSearch substantially improves extraction accuracy relative to strong parsed-text baselines while providing comprehensive attribution coverage. By logging reconciler decisions and reviewer edits, the system produces structured preference and supervision signals that bootstrap iterative model improvement. EviSearch is intended to accelerate living systematic review workflows, reduce manual curation burden, and provide a safe, auditable path for integrating LLM-based extraction into evidence synthesis pipelines.

[4] Hierarchical Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text cs.CLPDF

Filippo Morbiato, Markus Keller, Priya Nair, Luca Romano

TL;DR: 本文提出了一种名为H-TechniqueRAG的新型分层检索增强生成框架，用于将网络威胁情报文本映射到MITRE ATT&CK技术ID。该方法通过引入两阶段分层检索机制，先识别宏观战术，再在战术内检索具体技术，显著缩小了候选搜索空间，并结合战术感知重排序和层次约束的上下文组织策略，提升了标注效率和准确性。

Details

Motivation: 现有基于检索增强生成的方法在CTI文本标注任务中采用扁平检索范式，忽略了ATT&CK框架中技术按高层战术组织的固有分类结构，导致检索效率和信息利用不足。本文旨在通过注入这种战术-技术分类层次作为强归纳偏置，解决这一问题。

Result: 在三个不同的CTI数据集上的综合实验表明，H-TechniqueRAG在F1分数上比当前最优的TechniqueRAG模型提升了3.8%，同时推理延迟降低了62.4%，LLM API调用减少了60%，实现了高效且准确的标注性能。

Insight: 创新点在于将层次化结构先验引入RAG框架，通过两阶段分层检索、战术感知重排序和层次约束的上下文组织，有效减少了搜索空间、缓解了LLM上下文过载，并提供了可解释的决策路径，增强了模型的跨领域泛化能力。

Abstract: Mapping Cyber Threat Intelligence (CTI) text to MITRE ATT&CK technique IDs is a critical task for understanding adversary behaviors and automating threat defense. While recent Retrieval-Augmented Generation (RAG) approaches have demonstrated promising capabilities in this domain, they fundamentally rely on a flat retrieval paradigm. By treating all techniques uniformly, these methods overlook the inherent taxonomy of the ATT&CK framework, where techniques are structurally organized under high-level tactics. In this paper, we propose H-TechniqueRAG, a novel hierarchical RAG framework that injects this tactic-technique taxonomy as a strong inductive bias to achieve highly efficient and accurate annotation. Our approach introduces a two-stage hierarchical retrieval mechanism: it first identifies the macro-level tactics (the adversary’s technical goals) and subsequently narrows the search to techniques within those tactics, effectively reducing the candidate search space by 77.5%. To further bridge the gap between retrieval and generation, we design a tactic-aware reranking module and a hierarchy-constrained context organization strategy that mitigates LLM context overload and improves reasoning precision. Comprehensive experiments across three diverse CTI datasets demonstrate that H-TechniqueRAG not only outperforms the state-of-the-art TechniqueRAG by 3.8% in F1 score, but also achieves a 62.4% reduction in inference latency and a 60% decrease in LLM API calls. Further analysis reveals that our hierarchical structural priors equip the model with superior cross-domain generalization and provide security analysts with highly interpretable, step-by-step decision paths.

[5] SAGE Celer 2.6 Technical Card cs.CL | cs.AIPDF

SAGEA Research Team, Basab Jha, Firoj Paudel, Ujjwal Puri, Adrian Liu

TL;DR: SAGE Celer 2.6是SAGEA发布的最新通用Celer模型系列，提供5B、10B和27B三种参数规模。该模型通过架构修改、进一步预训练、逆向推理（IR）流程和原生多模态集成，旨在提升复杂推理任务的准确性、减少幻觉，并特别优化了对南亚语言（如尼泊尔语和印地语）的支持，同时在数学、编程和通用智能基准测试中取得有竞争力的结果和低延迟。

Details

Motivation: 解决现有模型在复杂推理任务中存在的级联错误和幻觉问题，改善基于适配器的多模态方法的常见缺陷，并专门增强对南亚语言的支持而不牺牲英语推理能力。

Result: 在数学、编程和通用智能基准测试（ACUMEN）上取得了高度竞争力的结果，并实现了低延迟。

Insight: 创新点包括：1）采用逆向推理（IR）流程进行原生训练，使模型能够验证自身的逻辑路径以减少错误；2）原生集成端到端视觉编码器的多模态功能，避免了基于适配器方法的常见问题；3）专门针对南亚语言优化，使用自定义Devanagari脚本分词器，在尼泊尔语和印地语上表现强劲，同时保持英语推理能力。

Abstract: We introduce SAGE Celer 2.6, the latest in our line of general-purpose Celer models from SAGEA. Celer 2.6 is available in 5B, 10B, and 27B parameter sizes and benefits from extensive architectural modifications and further pre-training on an undisclosed model. Using our Inverse Reasoning (IR) pipeline, SAGEA natively trains Celer 2.6 to validate its own logic paths, minimizing cascading error and hallucination in complex reasoning tasks. Celer 2.6 also boasts natively integrated multimodal functionality with an end-to-end vision encoder to avoid common pitfalls in adapter-based approaches. Celer 2.6 provides highly competitive results on mathematics, coding, and general intelligence benchmarks (ACUMEN), along with low latency. Most importantly, Celer 2.6 is specifically optimized for South Asian language support, with a custom tokenizer for the Devanagari script and strong performance in both Nepali and Hindi without sacrificing English reasoning ability.

[6] Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning cs.CL | cs.AIPDF

Qi Dong, Ziheng Lin, Ning Ding

TL;DR: 本文提出了一种名为Stateful Evidence-Driven RAG with Iterative Reasoning的框架，旨在解决传统检索增强生成（RAG）中存在的上下文表示扁平化和检索过程无状态问题。该框架将问答建模为一个渐进式证据积累过程，通过将检索文档转换为带有显式相关性和置信度信号的结构化推理单元，并维护一个包含支持性与非支持性信息的持久证据池。通过证据驱动的缺陷分析来识别信息缺口和冲突，并迭代式地优化查询以指导后续检索，从而实现稳定的证据聚合并提升对噪声检索的鲁棒性。

Details

Motivation: 传统RAG方法存在上下文表示扁平化和检索过程无状态的问题，导致性能不稳定。本文旨在通过建模一个渐进式、有状态的证据积累过程来解决这些问题，以提高问答系统的鲁棒性和准确性。

Result: 在多个问答基准测试上的实验表明，该方法相比标准RAG和多步基线模型取得了持续的性能提升，能够有效积累高质量证据，并在存在大量检索噪声的情况下保持稳定的性能。

Insight: 主要创新点在于将问答过程形式化为一个迭代的、有状态的证据积累循环，引入了结构化推理单元和持久证据池来显式管理证据的相关性和置信度，并通过缺陷分析驱动迭代检索。从客观角度看，这种将推理状态显式化并用于指导检索的机制，为提升RAG系统的可解释性和鲁棒性提供了新思路。

Abstract: Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) in external knowledge but often suffers from flat context representations and stateless retrieval, leading to unstable performance. We propose Stateful Evidence-Driven RAG with Iterative Reasoning, a framework that models question answering as a progressive evidence accumulation process. Retrieved documents are converted into structured reasoning units with explicit relevance and confidence signals and maintained in a persistent evidence pool capturing both supportive and non-supportive information. The framework performs evidence-driven deficiency analysis to identify gaps and conflicts and iteratively refines queries to guide subsequent retrieval. This iterative reasoning process enables stable evidence aggregation and improves robustness to noisy retrieval. Experiments on multiple question answering benchmarks demonstrate consistent improvements over standard RAG and multi-step baselines, while effectively accumulating high-quality evidence and maintaining stable performance under substantial retrieval noise.

[7] CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization cs.CL | cs.AIPDF

Deep Shah, Sanket Badhe, Nehal Kathrotia, Priyanka Tiwari

TL;DR: 本文提出了一种名为CROP（Cost-Regularized Optimization of Prompts）的自动提示优化方法，旨在减少大型语言模型在推理任务中的令牌消耗和延迟，同时保持任务准确性。通过在优化过程中引入对响应长度的正则化，并结合准确性反馈生成文本反馈，CROP能够生成促使模型给出更简洁、仅包含关键信息和推理的提示。

Details

Motivation: 现有自动提示优化框架仅关注任务准确性，导致生成冗长的推理轨迹，从而产生显著的延迟和令牌成本。本文旨在解决这一问题，寻求在保持性能的同时大幅降低推理开销。

Result: 在GSM8K、LogiQA和BIG-Bench Hard等复杂推理数据集上的评估表明，该方法在保持有竞争力准确性的同时，将令牌消耗降低了80.6%，性能仅出现名义上的下降。

Insight: 创新点在于将响应长度作为正则化项引入自动提示优化过程，通过生成结合准确性和简洁性的文本反馈来联合优化提示，这为在生产流水线中部署令牌高效且经济高效的智能AI系统提供了一种实用解决方案。

Abstract: Large Language Models utilizing reasoning techniques improve task performance but incur significant latency and token costs due to verbose generation. Existing automatic prompt optimization(APO) frameworks target task accuracy exclusively at the expense of generating long reasoning traces. We propose Cost-Regularized Optimization of Prompts (CROP), an APO method that introduces regularization on response length by generating textual feedback in addition to standard accuracy feedback. This forces the optimization process to produce prompts that elicit concise responses containing only critical information and reasoning. We evaluate our approach on complex reasoning datasets, specifically GSM8K, LogiQA and BIG-Bench Hard. We achieved an 80.6% reduction in token consumption while maintaining competitive accuracy, seeing only a nominal decline in performance. This presents a pragmatic solution for deploying token-efficient and cost-effective agentic AI systems in production pipelines.

[8] MEME-Fusion@CHiPSAL 2026: Multimodal Ablation Study of Hate Detection and Sentiment Analysis on Nepali Memes cs.CL | cs.AIPDF

Samir Wagle, Reewaj Khanal, Abiral Adhikari

TL;DR: 本文介绍了针对CHiPSAL 2026共享任务开发的系统，用于处理尼泊尔语表情包（使用天城文脚本）的仇恨言论检测和情感分析。论文提出了一种混合跨模态注意力融合架构，结合CLIP进行视觉编码和BGE-M3进行多语言文本表示，通过自注意力机制和可学习的门控网络动态加权模态贡献。系统评估了八种模型配置，发现跨模态推理相比纯文本基线在仇恨检测任务上F1-macro提升了5.9%，并揭示了两个关键发现：以英语为中心的视觉模型对天城文脚本表现接近随机，以及标准集成方法在数据稀缺下因相关过拟合而性能严重下降。

Details

Motivation: 解决在天城文脚本社交媒体表情包中进行仇恨言论检测和情感分析所面临的挑战，包括多模态内容结构、特定脚本的语言复杂性以及低资源环境下的极端数据稀缺问题。

Result: 在CHiPSAL 2026共享任务的子任务A（二分类仇恨言论检测）上，跨模态推理相比纯文本基线实现了5.9%的F1-macro提升；在子任务B（三分类情感分析）上也进行了评估。实验基于约850条/折的数据，揭示了现有方法的局限性。

Insight: 创新点在于提出了一个结合CLIP和BGE-M3的混合跨模态注意力融合架构，通过门控网络动态加权模态贡献。客观分析认为，其关键洞察是揭示了在低资源多模态任务中，英语中心视觉模型的局限性以及标准集成方法在数据稀缺时易发生相关过拟合，这对未来研究具有重要借鉴意义。

Abstract: Hate speech detection in Devanagari-scripted social media memes presents compounded challenges: multimodal content structure, script-specific linguistic complexity, and extreme data scarcity in low-resource settings. This paper presents our system for the CHiPSAL 2026 shared task, addressing both Subtask A (binary hate speech detection) and Subtask B (three-class sentiment classification: positive, neutral, negative). We propose a hybrid cross-modal attention fusion architecture that combines CLIP (ViT-B/32) for visual encoding with BGE-M3 for multilingual text representation, connected through 4-head self-attention and a learnable gating network that dynamically weights modality contributions on a per-sample basis. Systematic evaluation across eight model configurations demonstrates that explicit cross-modal reasoning achieves a 5.9% F1-macro improvement over text-only baselines on Subtask A, while uncovering two unexpected but critical findings: English-centric vision models exhibit near-random performance on Devanagari script, and standard ensemble methods catastrophically degrade under data scarcity (N nearly equal to 850 per fold) due to correlated overfitting. The code can be accessed at https://github.com/Tri-Yantra-Technologies/MEME-Fusion/

[9] EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation cs.CL | cs.AIPDF

Francesco Andrea Causio, Vittorio De Vita, Olivia Riccomi, Michele Ferramola, Federico Felizzi

TL;DR: 本文介绍了EuropeMedQA研究协议，旨在构建首个全面的多语言、多模态医学考试数据集，源自意大利、法国、西班牙和葡萄牙的官方监管考试，以评估大语言模型在非英语语言和多模态诊断任务上的表现。

Details

Motivation: 解决大语言模型在非英语语言和多模态医学诊断任务上性能下降的问题，并创建一个能反映欧洲临床实践复杂性的抗污染基准。

Result: 通过零样本、严格约束的提示策略评估当代多模态大语言模型，以分析跨语言迁移和视觉推理能力，但摘要未提及具体定量结果或与SOTA的比较。

Insight: 创新点在于首次构建了基于欧洲多国官方考试的多语言多模态医学数据集，遵循FAIR数据原则和SPIRIT-AI指南，并采用自动化翻译流程进行对比分析，有助于推动更具泛化能力的医疗AI发展。

Abstract: While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study protocol describes the development of EuropeMedQA, the first comprehensive, multilingual, and multimodal medical examination dataset sourced from official regulatory exams in Italy, France, Spain, and Portugal. Following FAIR data principles and SPIRIT-AI guidelines, we describe a rigorous curation process and an automated translation pipeline for comparative analysis. We evaluate contemporary multimodal LLMs using a zero-shot, strictly constrained prompting strategy to assess cross-lingual transfer and visual reasoning. EuropeMedQA aims to provide a contamination-resistant benchmark that reflects the complexity of European clinical practices and fosters the development of more generalizable medical AI.

[10] Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation cs.CLPDF

Zichong Li, Chen Liang, Liliang Ren, Tuo Zhao, Yelong Shen

TL;DR: 本文提出了一种名为RoPE-Perturbed Self-Distillation的训练正则化方法，旨在提升大语言模型在长上下文场景下的位置鲁棒性。该方法通过对RoPE位置编码的索引进行扰动，为同一训练序列生成不同的“视图”，并通过自蒸馏训练模型在这些视图上产生一致的预测，从而减少模型对绝对位置信息的脆弱依赖。实验在Llama-3-8B和Qwen-3-4B模型上进行长上下文微调，在多个长上下文基准测试上取得了性能提升。

Details

Motivation: 标准的长上下文微调方法存在脆弱性，模型准确率高度依赖于相关证据在序列中的绝对位置，即使任务格式和难度相同，也表现出很高的位置方差。本文旨在解决模型对位置信息的过度依赖问题，提升其语义理解能力和位置鲁棒性。

Result: 在Llama-3-8B和Qwen-3-4B模型的长上下文微调实验中，该方法在长上下文基准测试上取得了稳定的性能提升。具体而言，Llama-3-8B在RULER-64K基准上提升了12.04%，Qwen-3-4B在RULER-256K基准上提升了2.71%（均在SFT后）。此外，模型在超出训练上下文窗口的长度外推能力也得到了改善。

Insight: 论文宣称的创新点在于将RoPE位置编码扰动与自蒸馏相结合，作为一种训练正则化器，迫使模型学习更稳健的语义表示而非脆弱的位置模式。从客观角度看，这是一种新颖且直观的增强模型位置不变性的方法，通过数据增强（位置扰动）和一致性正则化（自蒸馏）的联合作用，可有效提升模型在长上下文任务中的泛化能力。

Abstract: Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative “views” of the same training sequence by perturbing its RoPE indices – effectively moving parts of the context to different positions – and to train the model to produce consistent predictions across views via self-distillation. This encourages reliance on semantic signals instead of brittle position dependencies. Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B demonstrate consistent gains on long-context benchmarks, including up to 12.04% improvement on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B after SFT, alongside improved length extrapolation beyond the training context window.

[11] APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI cs.CL | cs.AI | cs.IRPDF

Pratyay Banerjee, Masud Moshtaghi, Shivashankar Subramanian, Amita Misra, Ankit Chadha

TL;DR: 本文提出了APEX-MEM，一个用于长期对话AI的智能半结构化记忆系统。它通过一个领域无关的本体构建属性图来结构化对话事件，采用仅追加存储来保留信息的完整时间演变，并利用一个多工具检索代理在查询时解析冲突或演变的信息，生成紧凑且上下文相关的记忆摘要。

Details

Motivation: 解决大型语言模型在长期对话记忆中可靠性不足的问题，即简单地扩大上下文窗口或应用朴素检索会引入噪声并导致响应不稳定。

Result: 在LOCOMO的问答任务上达到88.88%的准确率，在LongMemEval上达到86.2%的准确率，超越了最先进的会话感知方法。

Insight: 核心创新在于将属性图、仅追加存储与智能检索代理相结合，在检索时而非存储时进行信息解析与总结，从而在保留完整交互历史的同时实现时间连贯的长期对话推理。

Abstract: Large language models still struggle with reliable long-term conversational memory: simply enlarging context windows or applying naive retrieval often introduces noise and destabilizes responses. We present APEX-MEM, a conversational memory system that combines three key innovations: (1) a property graph which uses domain-agnostic ontology to structure conversations as temporally grounded events in an entity-centric framework, (2) append-only storage that preserves the full temporal evolution of information, and (3) a multi-tool retrieval agent that understands and resolves conflicting or evolving information at query time, producing a compact and contextually relevant memory summary. This retrieval-time resolution preserves the full interaction history while suppressing irrelevant details. APEX-MEM achieves 88.88% accuracy on LOCOMO’s Question Answering task and 86.2% on LongMemEval, outperforming state-of-the-art session-aware approaches and demonstrating that structured property graphs enable more temporally coherent long-term conversational reasoning.

Akshay Paruchuri, Ishan Chatterjee, Henry Fuchs, Ehsan Adeli, Piotr Didyk

TL;DR: 本文通过提出质心替换方法，探究多模态语言模型中模态依赖的结构性失衡问题，发现语言表示普遍压制视觉表示，并利用文本质心对比解码在推理时无需重新训练即可显著提升视觉感知任务性能。

Details

Motivation: 解决多模态语言模型在视觉感知任务上系统性表现不佳的问题，探究其背后的模态竞争结构。

Result: 在涵盖三个架构家族的七个模型上，擦除文本质心结构比擦除视觉质心结构导致4倍以上的准确率损失；通过文本质心对比解码，在单个任务上最高恢复+16.9%的准确率，标准微调模型平均增益+5.6%，偏好优化模型平均增益+1.5%。

Insight: 创新点在于提出质心替换作为模态依赖的受控探针，揭示了模态竞争的结构性局部化特征，并展示了通过推理时干预（如对比解码）纠正模态失衡的可行性，为未来多模态训练提供了可量化的诊断信号。

Abstract: Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4$\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.

[13] CobwebTM: Probabilistic Concept Formation for Lifelong and Hierarchical Topic Modeling cs.CLPDF

Karthik Singaravadivelan, Anant Gupta, Zekun Wang, Christopher MacLellan

TL;DR: CobwebTM是一种基于增量概率概念形成的低参数终身分层主题模型，它通过将Cobweb算法适配到连续文档嵌入中，在线构建语义层次结构，实现无监督主题发现、动态主题创建和分层组织，无需预定义主题数量。

Details

Motivation: 解决神经主题模型需要大量调参、难以进行终身学习（存在灾难性遗忘和固定容量问题），以及经典概率模型缺乏对流式数据的灵活性和适应性的问题。

Result: 在多个数据集上，CobwebTM实现了强大的主题连贯性、随时间稳定的主题以及高质量的分层结构。

Insight: 将增量符号概念形成与预训练表示相结合，是一种高效的主题建模方法，能够在线构建动态、分层的语义结构，无需预定义主题数，适合终身学习场景。

Abstract: Topic modeling seeks to uncover latent semantic structure in text corpora with minimal supervision. Neural approaches achieve strong performance but require extensive tuning and struggle with lifelong learning due to catastrophic forgetting and fixed capacity, while classical probabilistic models lack flexibility and adaptability to streaming data. We introduce \textsc{CobwebTM}, a low-parameter lifelong hierarchical topic model based on incremental probabilistic concept formation. By adapting the Cobweb algorithm to continuous document embeddings, \textsc{CobwebTM} constructs semantic hierarchies online, enabling unsupervised topic discovery, dynamic topic creation, and hierarchical organization without predefining the number of topics. Across diverse datasets, \textsc{CobwebTM} achieves strong topic coherence, stable topics over time, and high-quality hierarchies, demonstrating that incremental symbolic concept formation combined with pretrained representations is an efficient approach to topic modeling.

[14] PeerPrism: Peer Evaluation Expertise vs Review-writing AI cs.CLPDF

Soroush Sadeghian, Alireza Daqiq, Radin Cheraghi, Sajad Ebrahimi, Negar Arabzadeh

TL;DR: 本文介绍了PeerPrism，一个包含20,690篇同行评审的大规模基准数据集，旨在区分评审中的思想来源（idea provenance）与文本表面来源（text provenance）。研究构建了从完全人工、完全合成到多种混合生成的控制生成机制，并在此基准上评估了最先进的LLM文本检测方法。研究发现，现有检测方法在标准二元任务（人工 vs. 完全合成）上表现良好，但在混合生成场景下预测结果出现显著分歧，表明它们混淆了文本表面实现与智力贡献。

Details

Motivation: 动机在于解决现有同行评审中LLM文本检测方法的一个关键局限：它们通常将作者身份视为二元问题（人工 vs. AI），而忽略了现代评审工作流程中人类与AI协作的混合性质。论文旨在探究检测器识别的是文本表面来源还是评估推理的思想来源。

Result: 在PeerPrism基准上对最先进的LLM文本检测方法进行了基准测试。结果表明，多个方法在标准二元任务上取得了高准确率，但在混合生成机制下（例如，思想来自人类但表面文本由AI生成），检测器的预测出现尖锐分歧和矛盾分类。风格计量学和语义分析进一步证实，当前检测方法将表面实现与智力贡献混为一谈。

Insight: 宣称的创新点在于首次提出了一个专门用于解耦思想来源与文本来源的大规模同行评审基准（PeerPrism），并系统性地评估了人类-AI协作场景下的检测方法。客观的创新之处在于挑战了将LLM检测简化为二元归属问题的现有范式，提出作者身份应建模为跨越语义推理和风格实现的多维结构，为理解与检测混合生成内容提供了新的框架和基准。

Abstract: Large Language Models (LLMs) are increasingly used in scientific peer review, assisting with drafting, rewriting, expansion, and refinement. However, existing peer-review LLM detection methods largely treat authorship as a binary problem-human vs. AI-without accounting for the hybrid nature of modern review workflows. In practice, evaluative ideas and surface realization may originate from different sources, creating a spectrum of human-AI collaboration. In this work, we introduce PeerPrism, a large-scale benchmark of 20,690 peer reviews explicitly designed to disentangle idea provenance from text provenance. We construct controlled generation regimes spanning fully human, fully synthetic, and multiple hybrid transformations. This design enables systematic evaluation of whether detectors identify the origin of the surface text or the origin of the evaluative reasoning. We benchmark state-of-the-art LLM text detection methods on PeerPrism. While several methods achieve high accuracy on the standard binary task (human vs. fully synthetic), their predictions diverge sharply under hybrid regimes. In particular, when ideas originate from humans but the surface text is AI-generated, detectors frequently disagree and produce contradictory classifications. Accompanied by stylometric and semantic analyses, our results show that current detection methods conflate surface realization with intellectual contribution. Overall, we demonstrate that LLM detection in peer review cannot be reduced to a binary attribution problem. Instead, authorship must be modeled as a multidimensional construct spanning semantic reasoning and stylistic realization. PeerPrism is the first benchmark evaluating human-AI collaboration in these settings. We release all code, data, prompts, and evaluation scripts to facilitate reproducible research at https://github.com/Reviewerly-Inc/PeerPrism.

[15] NLP needs Diversity outside of ‘Diversity’ cs.CLPDF

Joshua Tint

TL;DR: 这篇立场论文指出，NLP领域近期关于多样性的进展过度集中在公平性相关的少数领域，并认为这是由于多种激励、偏见和障碍共同作用，导致边缘化研究人员在非公平性领域被边缘化或被迫转向公平性相关领域。作者通过调查NLP子领域研究人员的人口统计数据来支持其主张，并提出建议以确保NLP所有领域更具包容性和公平性，特别强调打破强化差异的反馈循环以及解决阻碍参与NLP研究的地理和语言障碍的重要性。

Details

Motivation: 论文的动机是揭示NLP领域多样性研究的不均衡分布，指出当前多样性努力过度聚焦于公平性领域，而忽视了其他子领域的包容性问题，旨在呼吁更全面的多样性关注。

Result: 论文通过调查NLP研究人员的人口统计数据（如子领域分布）来定性支持其论点，但未提及具体定量结果或基准测试，更多是基于实证分析提出政策建议。

Insight: 创新点在于批判性地分析NLP多样性研究的局限性，强调系统性偏见和障碍如何影响研究人员的参与，并提出打破反馈循环、解决地理和语言障碍等具体建议，为促进更广泛领域的包容性提供了新视角。

Abstract: This position paper argues that recent progress with diversity in NLP is disproportionately concentrated on a small number of areas surrounding fairness. We further argue that this is the result of a number of incentives, biases, and barriers which come together to disenfranchise marginalized researchers in non-fairness fields, or to move them into fairness-related fields. We substantiate our claims with an investigation into the demographics of NLP researchers by subfield, using our research to support a number of recommendations for ensuring that all areas within NLP can become more inclusive and equitable. In particular, we highlight the importance of breaking down feedback loops that reinforce disparities, and the need to address geographical and linguistic barriers that hinder participation in NLP research.

[16] StoryCoder: Narrative Reformulation for Structured Reasoning in LLM Code Generation cs.CL | cs.AIPDF

Geonhui Jang, Dongyoon Han, YoungJoon Yoo

TL;DR: StoryCoder是一个叙事重构框架，它将代码生成问题转化为连贯的自然语言叙事，包括任务概述、约束和测试用例，以提供比简单重述更丰富的上下文结构。实验表明，该方法在HumanEval、LiveCodeBench和CodeForces基准上对11个模型均带来一致提升，平均零样本pass@10提高18.7%，并引导模型采用正确的算法策略、减少实现错误、生成更模块化的代码。

Details

Motivation: 现有方法通过增强推理步骤或注入特定结构来改进代码生成，但未改变分散的问题条件；受人类将碎片化信息组织成连贯解释的启发，旨在通过叙事重构提供更结构化的表示来提升模型推理。

Result: 在HumanEval、LiveCodeBench和CodeForces基准上测试11个模型，平均零样本pass@10提升18.7%，达到SOTA水平；分析显示叙事重构能引导正确算法策略、减少错误并促进模块化代码结构。

Insight: 创新点在于将代码生成问题重构为包含任务概述、约束和测试用例的连贯叙事，强调叙事连贯性和类型对齐对结构化问题表示的重要性；客观分析认为，该方法通过自然语言叙事提供丰富上下文，可有效指导模型推理，独立于模型规模或架构。

Abstract: Effective code generation requires both model capability and a problem representation that carefully structures how models reason and plan. Existing approaches augment reasoning steps or inject specific structure into how models think, but leave scattered problem conditions unchanged. Inspired by the way humans organize fragmented information into coherent explanations, we propose StoryCoder, a narrative reformulation framework that transforms code generation questions into coherent natural language narratives, providing richer contextual structure than simple rephrasings. Each narrative consists of three components: a task overview, constraints, and example test cases, guided by the selected algorithm and genre. Experiments across 11 models on HumanEval, LiveCodeBench, and CodeForces demonstrate consistent improvements, with an average gain of 18.7% in zero-shot pass@10. Beyond accuracy, our analyses reveal that narrative reformulation guides models toward correct algorithmic strategies, reduces implementation errors, and induces a more modular code structure. The analyses further show that these benefits depend on narrative coherence and genre alignment, suggesting that structured problem representation is important for code generation regardless of model scale or architecture. Our code is available at https://github.com/gu-ni/StoryCoder.

[17] Fact4ac at the Financial Misinformation Detection Challenge Task: Reference-Free Financial Misinformation Detection via Fine-Tuning and Few-Shot Prompting of Large Language Models cs.CL | cs.AIPDF

Cuong Hoang, Le-Minh Nguyen

TL;DR: 本文介绍了在金融虚假信息检测挑战赛中获胜的方法，该方法针对无参考金融虚假信息检测任务，通过结合上下文学习（零样本和少样本提示）与参数高效微调（LoRA）来优化大语言模型，使其能够仅依赖内部语义理解和上下文一致性来判定金融声明的真实性。

Details

Motivation: 金融虚假信息的泛滥严重威胁市场稳定和投资者信任，而现实场景中往往缺乏外部证据或参考进行交叉验证，因此需要开发无需外部参考的检测方法。

Result: 该方法在RFC-BENCH框架的公开测试集上达到95.4%的准确率，在私有测试集上达到96.3%的准确率，在官方排行榜上均获得第一名。

Insight: 创新点在于系统性地将上下文学习（特别是零样本和少样本提示）与基于LoRA的参数高效微调相结合，以优化大语言模型对金融操纵中微妙语言线索的捕捉能力，从而在无外部参考的条件下实现高效的虚假信息检测。

Abstract: The proliferation of financial misinformation poses a severe threat to market stability and investor trust, misleading market behavior and creating critical information asymmetry. Detecting such misleading narratives is inherently challenging, particularly in real-world scenarios where external evidence or supplementary references for cross-verification are strictly unavailable. This paper presents our winning methodology for the “Reference-Free Financial Misinformation Detection” shared task. Built upon the recently proposed RFC-BENCH framework (Jiang et al. 2026), this task challenges models to determine the veracity of financial claims by relying solely on internal semantic understanding and contextual consistency, rather than external fact-checking. To address this formidable evaluation setup, we propose a comprehensive framework that capitalizes on the reasoning capabilities of state-of-the-art Large Language Models (LLMs). Our approach systematically integrates in-context learning, specifically zero-shot and few-shot prompting strategies, with Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation (LoRA) to optimally align the models with the subtle linguistic cues of financial manipulation. Our proposed system demonstrated superior efficacy, successfully securing the first-place ranking on both official leaderboards. Specifically, we achieved an accuracy of 95.4% on the public test set and 96.3% on the private test set, highlighting the robustness of our method and contributing to the acceleration of context-aware misinformation detection in financial Natural Language Processing. Our models (14B and 32B) are available at https://huggingface.co/KaiNKaiho.

[18] SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models cs.CLPDF

Binxian Su, Haoye Lou, Shucheng Zhu, Weikang Wang, Ying Liu

TL;DR: 本文提出了SPAGBias框架，这是首个系统评估大语言模型（LLMs）中空间性别偏见的框架。该框架结合了62种城市微空间的分类法、提示词库以及三层诊断方法（显性、概率性和建构性），用于测试六种代表性模型，揭示了超越公私领域划分的结构化性别-空间关联，并探讨了提示设计、温度和模型规模对偏见表达的影响。追踪实验表明这些偏见模式在模型流程（预训练、指令微调和奖励建模）中被嵌入和强化，且模型关联远超现实世界分布。下游实验进一步揭示了此类偏见在规范性和描述性应用场景中导致的具体失败。

Details

Motivation: 随着大语言模型在城市规划中的日益广泛应用，以及性别空间理论强调性别等级如何嵌入空间组织，人们担心LLMs可能复制或放大此类偏见。因此，需要系统评估LLMs中的空间性别偏见。

Result: 在测试的六种代表性模型中，识别出了超越公私领域划分的结构化性别-空间关联，形成了细致的微观映射。故事生成揭示了情感、措辞和社会角色如何共同塑造“空间性别叙事”。模型关联被发现远超现实世界分布，且在规范性和描述性应用场景中导致具体失败。

Insight: 创新点在于将社会学理论与计算分析相结合，将偏见研究扩展到空间领域，并揭示了LLMs如何通过语言编码社会性别认知。SPAGBias框架提供了系统化的评估方法（包括分类法、提示库和三层诊断），能够深入分析偏见在模型流程中的嵌入和强化机制，为理解和缓解LLMs中的结构化空间偏见提供了新工具和视角。

Abstract: Large language models (LLMs) are being increasingly used in urban planning, but since gendered space theory highlights how gender hierarchies are embedded in spatial organization, there is concern that LLMs may reproduce or amplify such biases. We introduce SPAGBias - the first systematic framework to evaluate spatial gender bias in LLMs. It combines a taxonomy of 62 urban micro-spaces, a prompt library, and three diagnostic layers: explicit (forced-choice resampling), probabilistic (token-level asymmetry), and constructional (semantic and narrative role analysis). Testing six representative models, we identify structured gender-space associations that go beyond the public-private divide, forming nuanced micro-level mappings. Story generation reveals how emotion, wording, and social roles jointly shape “spatial gender narratives”. We also examine how prompt design, temperature, and model scale influence bias expression. Tracing experiments indicate that these patterns are embedded and reinforced across the model pipeline (pre-training, instruction tuning, and reward modeling), with model associations found to substantially exceed real-world distributions. Downstream experiments further reveal that such biases produce concrete failures in both normative and descriptive application settings. This work connects sociological theory with computational analysis, extending bias research into the spatial domain and uncovering how LLMs encode social gender cognition through language.

Midan Shim, Seokju Hwang, Kaehyun Um, Kyong-Ho Lee

TL;DR: 本文针对知识图谱问答（KGQA）中负约束问题被忽视的现状，提出了NEST-KGQA新任务和对应数据集NestKGQA，并设计了PyLF逻辑形式以清晰表达否定。同时，作者提出了CUCKOO框架，通过约束感知的逻辑形式草稿生成、模式引导的语义匹配以及自导向的精炼，专门处理多约束问题并确保语义可执行性。在少样本设置下，CUCKOO在传统KGQA和NEST-KGQA基准上均优于基线方法。

Details

Motivation: 尽管大语言模型具有强大的推理能力，但在忠实度和幻觉方面仍存在不足。现有KGQA基准和方法偏向于正约束和计算约束，而真实问题中频繁出现的负约束被忽视，因此需要专门处理负约束和多约束的KGQA任务与方法。

Result: 实验结果表明，在少样本设置下，CUCKOO框架在传统KGQA基准和提出的NEST-KGQA基准上均持续优于基线方法。

Insight: 创新点包括：1) 提出NEST-KGQA新任务和数据集，聚焦负约束问题；2) 设计PyLF逻辑形式，清晰表达否定并保持可读性；3) 提出CUCKOO框架，通过模式引导的语义匹配和条件性自导向精炼，有效处理多约束问题，平衡了成本与鲁棒性。

Abstract: Large language models still struggle with faithfulness and hallucinations despite their remarkable reasoning abilities. In Knowledge Graph Question Answering (KGQA), semantic parsing-based approaches address the limitations by understanding constraints in a user’s question and converting them into a logical form to execute on a knowledge graph. However, existing KGQA benchmarks and methods are biased toward positive and calculation constraints. Negative constraints are neglected, although they frequently appear in real-world questions. In this paper, we introduce a new task, NEgative-conSTrained (NEST) KGQA, where each question contains at least one negative constraint, and a corresponding dataset, NestKGQA. We also design PyLF, a Python-formatted logical form, since existing logical forms are hardly suitable to express negation clearly while maintaining readability. Furthermore, NEST questions naturally contain multiple constraints. To mitigate their semantic complexity, we present a novel framework named CUCKOO, specialized to multiple-constrained questions and ensuring semantic executability. CUCKOO first generates a constraint-aware logical form draft and performs schema-guided semantic matching. It then selectively applies self-directed refinement only when executing improper logical forms yields an empty result, reducing cost while improving robustness. Experimental results demonstrate that CUCKOO consistently outperforms baselines on both conventional KGQA and NEST-KGQA benchmarks under few-shot settings.

[20] Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems cs.CL | cs.CVPDF

Nishanth Madhusudhan, Vikas Yadav, Alexandre Lacoste

TL;DR: 该论文提出了一个名为MM-AQA的基准测试，用于评估多模态推理系统的有效弃权（EA）能力，即识别证据不足并避免回答。通过从可回答实例中沿视觉模态依赖性和证据充分性两个维度构建不可回答实例，评估了三种前沿视觉语言模型（VLM）和两种多智能体系统（MAS）架构。研究发现，标准提示下VLM很少弃权，MAS虽能改善弃权但引入了准确性与弃权的权衡，且有效弃权需要弃权感知训练而非更好的提示或更多智能体。

Details

Motivation: 现有评估范式假设视觉语言模型和多智能体系统总是可回答，忽略了现实中的证据不足情况，导致模型不可靠；多模态环境下的弃权研究不足，缺乏精细的基准来捕捉真实失败模式。

Result: 在2079个样本上评估显示：标准提示下VLM弃权率低，简单置信度基线优于该设置；MAS改善了弃权但存在准确性与弃权权衡；顺序设计匹配或优于迭代变体；模型在图像或文本证据缺失时弃权，但会尝试调和退化或矛盾证据。

Insight: 创新点在于构建了基于视觉模态依赖性和证据充分性维度的多模态弃权基准MM-AQA；客观分析表明，多模态有效弃权的瓶颈是模型校准不足而非推理深度，需要弃权感知训练来提升可靠性。

Abstract: Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text-only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM-AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open-source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this setup, (2) MAS improves abstention but introduces an accuracy-abstention trade-off, (3) sequential designs match or exceed iterative variants, suggesting the bottleneck is miscalibration rather than reasoning depth, and (4) models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention-aware training rather than better prompting or more agents.

[21] ClimateCause: Complex and Implicit Causal Structures in Climate Reports cs.CL | cs.AIPDF

Liesbeth Allein, Nataly Pineda-Castañeda, Andrea Rocci, Marie-Francine Moens

TL;DR: 本文介绍了ClimateCause数据集，这是一个由专家手动标注的、源自气候政策科学报告的高阶因果结构数据集，包含了隐式和嵌套的因果关系。该数据集对因果表达进行了归一化和解耦，以支持图结构构建，并提供了因果关系相关性、关系类型和时空背景的独特标注。研究还展示了如何利用该数据集基于因果图的语义复杂性来量化文本可读性，并通过大语言模型在相关性推理和因果链推理任务上的基准测试，指出了因果链推理是当前面临的主要挑战。

Details

Motivation: 现有因果发现数据集主要捕捉显式的、直接的因果关系，而理解气候变化需要对复杂的因果网络进行推理，因此需要能够处理隐式和嵌套因果关系的更复杂数据集。

Result: 研究构建了ClimateCause数据集，并展示了其在量化文本可读性方面的应用价值。在大语言模型基准测试中，因果链推理任务被证明是一个关键挑战。

Insight: 创新点在于创建了一个专注于复杂、隐式和高阶因果结构（尤其是来自特定领域文本）的数据集，并提出了基于因果图语义复杂性来量化文本可读性的新方法，这为评估模型在复杂因果推理上的能力提供了新基准。

Abstract: Understanding climate change requires reasoning over complex causal networks. Yet, existing causal discovery datasets predominantly capture explicit, direct causal relations. We introduce ClimateCause, a manually expert-annotated dataset of higher-order causal structures from science-for-policy climate reports, including implicit and nested causality. Cause-effect expressions are normalized and disentangled into individual causal relations to facilitate graph construction, with unique annotations for cause-effect correlation, relation type, and spatiotemporal context. We further demonstrate ClimateCause’s value for quantifying readability based on the semantic complexity of causal graphs underlying a statement. Finally, large language model benchmarking on correlation inference and causal chain reasoning highlights the latter as a key challenge.

[22] Schema Key Wording as an Instruction Channel in Structured Generation under Constrained Decoding cs.CL | cs.AIPDF

Yifan Le

TL;DR: 本文研究了在大型语言模型（LLM）结构化生成任务中，模式（schema）关键词的措辞如何作为一种隐式指令通道影响模型性能。作者提出将结构化生成重新解释为一个多通道指令问题，其中指令既可通过提示词显式传达，也可在解码过程中通过模式键隐式传达。实验表明，不同模型家族（如Qwen和LLaMA）对这些指令通道的敏感性存在差异，且通道间存在非加性的交互效应。

Details

Motivation: 现有约束解码方法主要将模式视为纯结构约束，忽略了其语言表述可能影响模型行为。本文旨在探究模式关键词的措辞如何作为隐式指令通道，在结构化生成中影响模型性能。

Result: 在多个数学推理基准测试上的实验表明，仅改变模式关键词的措辞（不修改提示词或模型参数）就能显著改变约束解码下的模型性能。Qwen模型持续受益于模式级指令，而LLaMA模型更依赖提示级指导。

Insight: 创新点在于首次系统性地研究了模式关键词表述作为隐式指令通道的作用，并提出了结构化生成的多通道指令视角。客观来看，这揭示了模式设计不仅决定输出结构，还承载指令信号，为LLM结构化生成提供了新思路。

Abstract: Constrained decoding has been widely adopted for structured generation with large language models (LLMs), ensuring that outputs satisfy predefined formats such as JSON and XML. However, existing approaches largely treat schemas as purely structural constraints and overlook the possibility that their linguistic formulation may affect model behavior. In this work, we study how instruction placement influences model performance in structured generation and show that merely changing the wording of schema keys, without modifying the prompt or model parameters, can significantly alter model performance under constrained decoding. Based on this observation, we propose to reinterpret structured generation as a multi-channel instruction problem, where instructions can be conveyed explicitly through prompts and implicitly through schema keys during decoding. To the best of our knowledge, this is the first work to systematically study how schema key formulation acts as an implicit instruction channel and affects model performance under constrained decoding. Experiments on multiple mathematical reasoning benchmarks show that different model families exhibit distinct sensitivities to these instruction channels: Qwen models consistently benefit from schema-level instructions, while LLaMA models rely more heavily on prompt-level guidance. We further observe non-additive interaction effects between instruction channels, showing that combining multiple channels does not always lead to further improvement. These findings suggest that schema design not only determines output structure, but also carries instruction signals, offering a new perspective on structured generation in LLMs.

[23] Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models cs.CL | cs.AI | cs.CV | cs.LGPDF

Danae Sánchez Villegas, Samuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott

TL;DR: 该论文分析了18种视觉语言模型（VLMs）的推理动态，涵盖指令微调和推理训练模型，通过追踪思维链（CoT）中的置信度、测量推理的纠正效果以及评估中间推理步骤的贡献，发现模型存在答案惯性，即早期预测在推理步骤中被强化而非修正。研究还表明，模型容易受到误导性文本线索的影响，即使视觉证据充足，且思维链对模态依赖的揭示有限，这对多模态系统的透明度和安全性具有重要影响。

Details

Motivation: 尽管视觉语言模型（VLMs）在推理能力上取得进展，但其如何整合视觉和文本信息的过程仍不清晰，因此需要分析推理动态以揭示模型决策机制，并评估思维链在监控模态依赖方面的局限性。

Result: 研究发现，推理训练模型表现出更强的纠正行为，但其增益取决于从文本主导到纯视觉设置的模态条件；在误导性文本线索干预下，模型普遍受其影响，且这种影响在思维链中的可检测性因模型和监控内容而异，但思维链仅提供部分模态驱动决策的视图。

Insight: 论文的创新点在于系统性地量化了VLMs的推理动态和答案惯性，并揭示了思维链在揭示模态依赖方面的局限性，即即使推理训练模型能更明确地提及线索，其流畅的思维链仍可能掩盖文本依赖，而指令微调模型的简短痕迹反而更易暴露与视觉输入的不一致，这强调了开发更透明监控工具的必要性。

Abstract: Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.

[24] IE as Cache: Information Extraction Enhanced Agentic Reasoning cs.CLPDF

Hang Lv, Sheng Liang, Hongchao Gu, Wei Guo, Defu Lian

TL;DR: 本文提出了一种名为IE-as-Cache的新框架，将信息抽取（IE）重新定位为一种认知缓存，以增强智能体的多步推理能力。该框架借鉴计算机分层内存的思想，通过查询驱动的抽取和缓存感知的推理，动态维护紧凑的中间信息并过滤噪声，从而超越传统上将IE仅视为终端任务的做法。

Details

Motivation: 传统信息抽取通常被视为一个终端目标，其提取的结构化信息在多步推理中往往被孤立使用，而非持续维护和复用。本文旨在解决这一问题，探索如何将IE作为一种可重用的认知资源来提升智能体推理的效率和准确性。

Result: 在多个大型语言模型（LLMs）上进行的挑战性基准测试实验表明，该方法显著提高了推理准确性。

Insight: 核心创新点在于将信息抽取重新概念化为一个可动态维护和复用的“认知缓存”系统，将计算机体系结构中的缓存思想引入到AI推理过程中，为IE的下游应用研究提供了一个有前景的新方向。

Abstract: Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding and reasoning. However, it is traditionally treated merely as a terminal objective: once extracted, the resulting structure is often consumed in isolation rather than maintained and reused during multi-step inference. Moving beyond this, we propose \textit{IE-as-Cache}, a framework that repurposes IE as a cognitive cache to enhance agentic reasoning. Drawing inspiration from hierarchical computer memory, our approach combines query-driven extraction with cache-aware reasoning to dynamically maintain compact intermediate information and filter noise. Experiments on challenging benchmarks across diverse LLMs demonstrate significant improvements in reasoning accuracy, indicating that IE can be effectively repurposed as a reusable cognitive resource and offering a promising direction for future research on downstream uses of IE.

[25] Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models cs.CL | cs.AIPDF

Zihao Xu, John Harvill, Ziwei Fan, Yizhou Sun, Hao Ding

TL;DR: 本文提出K-Token Merging，一种在潜在嵌入空间压缩序列的框架，通过轻量级编码器将连续K个token嵌入合并为一个，以降低长提示处理的计算和内存开销，同时保持模型性能。

Details

Motivation: 解决大语言模型处理长提示时因自注意力二次方复杂度导致的计算和内存成本高昂问题，现有方法主要在token空间操作，忽略了潜在嵌入空间的低效性。

Result: 在Textualized Tree、Amazon Reviews和CommitPackFT等基准测试中，该方法实现了性能与压缩率的帕累托前沿，输入长度最多减少75%且性能下降最小。

Insight: 创新点在于从潜在嵌入空间而非token空间进行压缩，通过合并连续token嵌入并配合LoRA适配的LLM，在保持生成词汇表不变的前提下高效压缩序列。

Abstract: Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.

[26] MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events cs.CLPDF

Raunak Agarwal, Markus Wenzel, Simon Baur, Jonas Zimmer, George Harvey

TL;DR: 本文介绍了MADE，一个用于医疗设备不良事件多标签文本分类的持续更新基准，旨在解决现有基准因数据污染而难以评估模型真实推理能力的问题。该基准包含层次化标签的长尾分布，并采用严格的时间划分以确保可复现性。研究评估了20多个编码器-解码器模型在微调和少样本设置下的性能，并系统比较了基于熵/一致性和自表述的不确定性量化方法。

Details

Motivation: 医疗等高风险领域的机器学习不仅需要强预测性能，还需可靠的不确定性量化以支持人工监督；现有多标签文本分类基准易受数据污染影响，难以区分模型是记忆还是真实推理。

Result: 在MADE基准上，较小判别式微调解码器在头尾准确率上表现最佳且不确定性量化竞争力强；生成式微调提供最可靠的不确定性量化；大型推理模型在稀有标签上性能提升但不确定性量化较弱；自表述置信度不能可靠替代不确定性。

Insight: 创新点包括引入持续更新的医疗领域多标签分类基准以防止数据污染，并系统评估不同模型架构和微调策略在不确定性量化上的权衡；客观来看，该工作强调了在长尾分布和时序数据下评估模型真实泛化能力的重要性。

Abstract: Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from {m}edical device {ad}verse {e}vent reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits. We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty. Our work is publicly available at https://hhi.fraunhofer.de/aml-demonstrator/made-benchmark.

[27] From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning cs.CLPDF

Kiran Purohit, Ramasuri Narayanam, Soumyabrata Pal

TL;DR: 本文提出了一种名为SpecGuard的验证感知推测解码框架，用于提升大语言模型在多步推理任务中的效率和准确性。该方法通过模型内部信号进行步骤级验证，在采样多个草稿候选后，结合注意力基础和词元级置信度信号选择最一致的步骤，从而减少错误传播并降低延迟。

Details

Motivation: 传统推测解码（SD）以词元为中心，容易导致错误步骤传播；现有方法依赖外部奖励模型，会引入额外延迟和计算开销，且泛化性受限。本文旨在仅利用模型内部信号，实现更高效、通用的步骤级验证，以解决多步推理中的错误累积问题。

Result: 在多个推理基准测试中，SpecGuard将准确率提升了3.6%，同时降低了约11%的延迟，性能优于传统推测解码和奖励引导的推测解码方法。

Insight: 创新点在于提出了一种仅依赖模型内部信号（注意力基础评分和词元级置信度评分）的步骤级验证机制，实现了计算资源的动态分配，避免了外部模型的依赖，从而在提升推理准确性的同时降低了延迟。

Abstract: Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token-centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification-aware speculative decoding framework that performs step-level verification using only model-internal signals. At each step, SpecGuard samples multiple draft candidates and selects the most consistent step, which is then validated using an ensemble of two lightweight model-internal signals: (i) an attention-based grounding score that measures attribution to the input and previously accepted steps, and (ii) a log-probability-based score that captures token-level confidence. These signals jointly determine whether a step is accepted or recomputed using the target, allocating compute selectively. Experiments across a range of reasoning benchmarks show that SpecGuard improves accuracy by 3.6% while reducing latency by ~11%, outperforming both SD and reward-guided SD.

cs.CV [Back]

[28] QualiaNet: An Experience-Before-Inference Network cs.CV | eess.IV | q-bio.NCPDF

Paul Linton

TL;DR: QualiaNet提出了一种受人类三维视觉启发的两阶段网络架构，模拟人类视觉中的经验模块和推理模块。该网络首先从立体图像中提取视差图（模拟人类立体视觉经验），然后通过卷积神经网络（CNN）从视差梯度中推断出距离信息。

Details

Motivation: 论文的动机是解决人类立体视觉中的一个悖论：虽然立体视觉经验本身不直接提供距离信息，但它却影响我们对视觉尺度的推断。作者旨在通过计算模型验证，人类可能利用自然场景统计（如近景产生生动的视差梯度，远景相对平坦）来从经验中推理三维场景属性。

Result: 实验表明，QualiaNet仅从视差梯度中就能恢复距离信息，验证了所提方法的有效性。

Insight: 创新点在于将人类视觉的两阶段处理过程（经验与推理分离）引入计算模型，并利用视差梯度作为自然场景统计线索进行距离估计，这为计算机视觉中的三维场景理解提供了新的生物启发式视角。

Abstract: Human 3D vision involves two distinct stages: an Experience Module, where stereo depth is extracted relative to fixation, and an Inference Module, where this experience is interpreted to estimate 3D scene properties. Paradoxically, although our experience of stereo vision does not provide us with distance information, it does affect our inferences about visual scale. We propose the Inference Module exploits a natural scene statistic: near scenes produce vivid disparity gradients, while far scenes appear comparatively flat. QualiaNet implements this two-stage architecture computationally: disparity maps simulating human stereo experience are passed to a CNN trained to estimate distance. The network can recover distance from disparity gradients alone, validating this approach.

Team HY-World, Chenjie Cao, Xuhui Zuo, Zhenwei Wang, Yisu Zhang

TL;DR: HY-World 2.0是一个多模态世界模型框架，能够从文本、单视图图像、多视图图像或视频等多种输入中重建、生成和模拟3D世界。它通过一个包含全景生成、轨迹规划、世界扩展和世界合成的四阶段方法，生成高保真、可导航的3D高斯溅射场景，并提供了一个高性能的渲染平台WorldLens用于交互式探索。

Details

Motivation: 该论文旨在推进其前代项目HY-World 1.0，构建一个更强大的多模态3D世界模型，以解决从多样化输入（尤其是文本和单视图图像）生成高质量、可导航3D场景的挑战。

Result: 大量实验表明，HY-World 2.0在多个基准测试中达到了开源方法中的最先进性能，其效果可与闭源模型Marble相媲美。

Insight: 主要创新点包括：1) 引入HY-Pano 2.0等关键创新以增强全景图保真度；2) 通过WorldNav实现3D场景理解和规划；3) 升级了基于关键帧的视图生成模型WorldStereo 2.0，使其具有一致性记忆；4) 改进WorldMirror 2.0的模型架构和学习策略，使其能从多视图或视频进行世界重建；5) 提出了一个灵活、引擎无关的高性能3DGS渲染平台WorldLens，支持自动光照和高效碰撞检测。

Abstract: We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.

[30] Geometrically Consistent Multi-View Scene Generation from Freehand Sketches cs.CVPDF

Ahmed Bourouis, Savas Ozkan, Andrea Maracani, Yi-Zhe Song, Mete Ozay

TL;DR: 本文提出了一种从单张手绘草图生成几何一致多视角场景的新方法，解决了现有方法需要照片、文本或多视角输入的限制。通过构建数据集、引入并行相机感知注意力适配器（CA3）和稀疏对应监督损失（CSL），实现了无需参考图像或逐场景优化的高效多视角生成。

Details

Motivation: 解决从几何信息极度匮乏的单张手绘草图直接生成几何一致多视角场景的难题，现有方法无法处理草图的空间扭曲和抽象性。

Result: 在真实性指标（FID）上比现有最优两阶段基线提升超过60%，几何一致性指标（Corr-Acc）提升23%，推理速度最高加快3.7倍，达到SOTA水平。

Insight: 通过自动化流程构建草图-多视角配对数据集，在视频Transformer中注入几何归纳偏置的CA3模块，以及从运动恢复结构中推导的对应监督损失，共同实现了从扭曲2D输入到3D一致生成的端到端学习。

Abstract: We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation. We address three compounding challenges; absent training data, the need for geometric reasoning from distorted 2D input, and cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of $\sim$9k sketch-to-multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions. Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach significantly outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to a 3.7$\times$ inference speedup.

[31] Interpretable Human Activity Recognition for Subtle Robbery Detection in Surveillance Videos cs.CVPDF

Bryan Jhoan Cazáres Leyva, Ulises Gachuz Davila, José Juan González Fonseca, Juan Irving Vasquez, Vanessa A. Camacho-Vázquez

TL;DR: 本文提出了一种用于监控视频中非暴力抢劫（抢夺逃跑）检测的混合姿态驱动方法，结合了实时感知与可解释分类阶段，适合边缘部署。系统使用基于YOLO的姿态估计器提取跟踪人体的关键点，计算描述手部速度、手臂伸展、接近度以及攻击者-受害者对之间相对运动的运动学和交互特征，并训练随机森林分类器进行检测，应用时间滞后滤波器稳定预测。

Details

Motivation: 解决非暴力街头抢劫（抢夺逃跑）因短暂、细微且在无约束监控视频中与良性人机交互难以区分而难以自动检测的问题。

Result: 在模拟数据集和从互联网视频收集的独立测试集上评估，展示了跨不同场景和相机视角的良好泛化能力；在NVIDIA Jetson Nano上实现完整流程并报告实时性能，支持主动设备端抢劫检测的可行性。

Insight: 创新点包括结合姿态估计与可解释特征（如运动学和交互描述符）的混合方法，以及时间滞后滤波器以减少误报，实现边缘设备上的实时检测；从客观角度看，该方法注重实际部署和解释性，而非追求纯深度学习的SOTA性能。

Abstract: Non-violent street robberies (snatch-and-run) are difficult to detect automatically because they are brief, subtle, and often indistinguishable from benign human interactions in unconstrained surveillance footage. This paper presents a hybrid, pose-driven approach for detecting snatch-and-run events that combines real-time perception with an interpretable classification stage suitable for edge deployment. The system uses a YOLO-based pose estimator to extract body keypoints for each tracked person and computes kinematic and interaction features describing hand speed, arm extension, proximity, and relative motion between an aggressor-victim pair. A Random Forest classifier is trained on these descriptors, and a temporal hysteresis filter is applied to stabilize frame-level predictions and reduce spurious alarms. We evaluate the method on a staged dataset and on a disjoint test set collected from internet videos, demonstrating promising generalization across different scenes and camera viewpoints. Finally, we implement the complete pipeline on an NVIDIA Jetson Nano and report real-time performance, supporting the feasibility of proactive, on-device robbery detection.

[32] SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning cs.CV | cs.AIPDF

Xue Wu, Shengting Cao, Jiaqi Gong

TL;DR: SatBLIP是一个针对卫星图像的视觉-语言学习框架，用于理解农村环境背景和识别特征，以预测县级社会脆弱性指数（SVI）。它通过结合对比图像-文本对齐和针对卫星语义的自举式字幕生成，克服了传统遥感方法（如手工特征、手动虚拟审计和自然图像训练的VLM）的局限性，并利用GPT-4o生成结构化描述，再微调卫星适应的BLIP模型生成未见图像的字幕，最后通过注意力机制融合CLIP编码和LLM嵌入进行SVI估计，同时使用SHAP识别关键属性以实现可解释的风险环境映射。

Details

Motivation: 解决农村环境风险评估中标准脆弱性指数过于粗糙、缺乏背景洞察的问题，以及现有遥感流程（如手工特征和自然图像训练的视觉-语言模型）的局限性。

Result: 论文未在摘要中明确提及具体定量结果或基准测试，但强调方法能够识别关键属性（如屋顶形式/状况、街道宽度、植被等）以驱动稳健预测，实现可解释的农村风险环境映射。

Insight: 创新点包括将对比图像-文本对齐与卫星语义自举式字幕生成相结合，利用GPT-4o生成结构化描述并微调卫星适应的BLIP模型，以及通过注意力融合多模态嵌入进行SVI估计，提升农村环境背景理解的准确性和可解释性。

Abstract: Rural environmental risks are shaped by place-based conditions (e.g., housing quality, road access, land-surface patterns), yet standard vulnerability indices are coarse and provide limited insight into risk contexts. We propose SatBLIP, a satellite-specific vision-language framework for rural context understanding and feature identification that predicts county-level Social Vulnerability Index (SVI). SatBLIP addresses limitations of prior remote sensing pipelines-handcrafted features, manual virtual audits, and natural-image-trained VLMs-by coupling contrastive image-text alignment with bootstrapped captioning tailored to satellite semantics. We use GPT-4o to generate structured descriptions of satellite tiles (roof type/condition, house size, yard attributes, greenery, and road context), then fine-tune a satellite-adapted BLIP model to generate captions for unseen images. Captions are encoded with CLIP and fused with LLM-derived embeddings via attention for SVI estimation under spatial aggregation. Using SHAP, we identify salient attributes (e.g., roof form/condition, street width, vegetation, cars/open space) that consistently drive robust predictions, enabling interpretable mapping of rural risk environments.

[33] FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images cs.CVPDF

Sabab Ishraq, Aarushi Aarushi, Juncai Jiang, Chen Chen

TL;DR: 本文介绍了FoodSense数据集，这是一个包含66,842个人类标注的多感官食物图像数据集，用于从图像预测食物的味觉、嗅觉、质地和声音。作者还提出了FoodSense-VL基准模型，该模型能够直接从食物图像生成多感官评分和基于视觉的推理解释。

Details

Motivation: 现有视觉语言研究主要关注食物识别任务，而基于图像预测多感官体验的研究尚不充分。本文旨在填补这一空白，将认知科学中的跨感官感知研究与现代多模态模型的指令调优相结合。

Result: 论文构建了包含2,987张独特食物图像和人类标注的数据集，并训练了FoodSense-VL基准模型。结果表明，许多流行的评估指标对于视觉感官推理任务是不够的。

Insight: 创新点在于引入了首个大规模、人类标注的食物多感官推理数据集，并利用大语言模型将简短标注扩展为基于图像的推理轨迹，从而训练出能同时预测评分和提供解释的视觉语言模型。这为连接认知科学和多模态AI提供了新途径。

Abstract: Humans routinely infer taste, smell, texture, and even sound from food images a phenomenon well studied in cognitive science. However, prior vision language research on food has focused primarily on recognition tasks such as meal identification, ingredient detection, and nutrition estimation. Image-based prediction of multisensory experience remains largely unexplored. We introduce FoodSense, a human-annotated dataset for cross-sensory inference containing 66,842 participant-image pairs across 2,987 unique food images. Each pair includes numeric ratings (1-5) and free-text descriptors for four sensory dimensions: taste, smell, texture, and sound. To enable models to both predict and explain sensory expectations, we expand short human annotations into image-grounded reasoning traces. A large language model generates visual justifications conditioned on the image, ratings, and descriptors. Using these annotations, we train FoodSense-VL, a vision language benchmark model to produce both multisensory ratings and grounded explanations directly from food images. This work connects cognitive science findings on cross-sensory perception with modern instruction tuning for multimodal models and shows that many popular evaluation metrics are insufficient for visually sensory inference.

[34] Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers cs.CV | cs.LGPDF

Felipe Parodi, Jordan Matelsky, Melanie Segado

TL;DR: 本文研究了在DINO视觉Transformer中使用零向量替换（zero-ablation）来探测寄存器（register）功能的方法，发现这种方法会夸大寄存器内容对模型性能的依赖性。通过引入均值替换、噪声替换和跨图像寄存器混洗等控制实验，作者证明模型性能实际上依赖于合理的类寄存器激活，而非精确的图像特定值，零替换会导致不成比例的大扰动，从而单独损害任务性能。

Details

Motivation: 零替换被广泛用于探究视觉Transformer中令牌的功能，特别是在DINOv2/DINOv3的寄存器中，零替换会导致性能大幅下降，这暗示寄存器在功能上是不可或缺的。然而，这种结论可能因零替换本身引入的异常扰动而被夸大，因此需要更严谨的控制实验来验证寄存器内容的真实依赖性。

Result: 在分类、对应性和分割任务中，三种替换控制方法（均值替换、噪声替换、跨图像寄存器混洗）的性能均保持在未修改基线的约1个百分点以内，而零替换则导致分类性能下降高达-36.6个百分点，分割下降-30.9个百分点。这些结果在ViT-B规模上可复现。

Insight: 论文的创新点在于通过引入多种替换控制实验，揭示了零替换会过度强调对精确寄存器内容的依赖，而实际性能更依赖于合理的类寄存器激活模式。这为未来研究Transformer内部表示和探针方法提供了重要见解，即应避免仅依赖零替换得出功能依赖结论，而需采用更稳健的扰动方法。寄存器在缓冲密集特征免受[CLS]依赖和压缩补丁几何结构方面仍有作用。

Abstract: Zero-ablation – replacing token activations with zero vectors – is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to $-36.6$,pp classification, $-30.9$,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls – mean-substitution, noise-substitution, and cross-image register-shuffling – preserve performance across classification, correspondence, and segmentation, remaining within ${\sim}1$,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt{[CLS]} dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.

[35] Crowdsourcing of Real-world Image Annotation via Visual Properties cs.CV | cs.AIPDF

Xiaolei Diao, Fausto Giunchiglia

TL;DR: 本文提出了一种基于视觉属性的图像标注方法，通过整合知识表示、自然语言处理和计算机视觉技术，旨在减少标注者的主观性。该方法引入了一个交互式众包框架，根据预定义的对象类别层次结构和标注者反馈动态提问，以视觉属性引导图像标注。

Details

Motivation: 解决对象识别数据集中存在的语义鸿沟问题，该问题导致视觉数据与语言描述之间存在复杂的多对多映射，从而影响计算机视觉任务的性能。

Result: 实验证明了该方法的有效性，并讨论了标注者反馈以优化众包设置。

Insight: 通过视觉属性约束来减少标注主观性，并利用交互式众包框架动态引导标注过程，这是对传统静态标注方法的创新改进。

Abstract: Recent advances in data-centric artificial intelligence highlight inherent limitations in object recognition datasets. One of the primary issues stems from the semantic gap problem, which results in complex many-to-many mappings between visual data and linguistic descriptions. This bias adversely affects performance in computer vision tasks. This paper proposes an image annotation methodology that integrates knowledge representation, natural language processing, and computer vision techniques, aiming to reduce annotator subjectivity by applying visual property constraints. We introduce an interactive crowdsourcing framework that dynamically asks questions based on a predefined object category hierarchy and annotator feedback, guiding image annotation by visual properties. Experiments demonstrate the effectiveness of this methodology, and annotator feedback is discussed to optimize the crowdsourcing setup.

[36] H2VLR: Heterogeneous Hypergraph Vision-Language Reasoning for Few-Shot Anomaly Detection cs.CV | cs.LGPDF

Jianghong Huang, Luping Ji, Weiwei Duan, Mao Ye

TL;DR: 本文提出了一种名为H2VLR的异构超图视觉语言推理框架，用于解决少样本异常检测（FSAD）问题。该框架通过将视觉区域和语义概念统一建模在一个超图中，将FSAD重新定义为视觉-语义关系的高阶推理问题，以克服现有基于视觉语言模型（VLM）的方法仅依赖成对特征匹配而忽略结构依赖性和全局一致性的局限。

Details

Motivation: 少样本异常检测（FSAD）在工业检测和医学影像中面临数据稀缺问题，现有基于视觉语言模型（VLM）的方法主要依赖成对特征匹配，缺乏对结构依赖和全局一致性的建模，限制了性能提升。

Result: 在代表性的工业和医学基准测试上，H2VLR通常能达到最先进的（SOTA）性能，实验比较验证了其有效性和优势。

Insight: 创新点在于将FSAD重新定义为高阶推理问题，并引入异构超图来联合建模视觉区域和语义概念，从而捕获更丰富的结构关系。从客观角度看，这种超图建模方法为视觉语言任务中的关系推理提供了新的思路，可推广到其他需要结构感知的少样本学习场景。

Abstract: As a classic vision task, anomaly detection has been widely applied in industrial inspection and medical imaging. In this task, data scarcity is often a frequently-faced issue. To solve it, the few-shot anomaly detection (FSAD) scheme is attracting increasing attention. In recent years, beyond traditional visual paradigm, Vision-Language Model (VLM) has been extensively explored to boost this field. However, in currently-existing VLM-based FSAD schemes, almost all perform anomaly inference only by pairwise feature matching, ignoring structural dependencies and global consistency. To further redound to FSAD via VLM, we propose a Heterogeneous Hypergraph Vision-Language Reasoning (H2VLR) framework. It reformulates the FSAD as a high-order inference problem of visual-semantic relations, by jointly modeling visual regions and semantic concepts in a unified hypergraph. Experimental comparisons verify the effectiveness and advantages of H2VLR. It could often achieve state-of-the-art (SOTA) performance on representative industrial and medical benchmarks. Our code will be released upon acceptance.

[37] Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs cs.CVPDF

Ziyang Luo, Nian Liu, Junwei Han

TL;DR: 本文提出了一种名为Chain of Modality (CoM)的智能体框架，旨在解决全模态大语言模型(Omni-MLLMs)中存在的性能悖论问题，即单模态基线模型常常优于联合多模态推理。该框架通过动态编排输入模态的拓扑结构，并分离认知执行路径，实现了从静态融合到动态编排的转变，从而提升了模型的鲁棒性和泛化能力。

Details

Motivation: 当前Omni-MLLMs普遍采用静态融合拓扑结构，这导致了位置偏差和对齐陷阱等结构性问题，扭曲了注意力机制，使得多模态联合推理性能不佳，甚至弱于单模态模型。本文旨在解决这种功能僵化问题。

Result: CoM框架在无需训练或仅需少量监督微调(SFT)的设置下，在多个基准测试上实现了鲁棒且一致的泛化性能。

Insight: 主要创新点在于将多模态融合从被动的静态连接转变为动态编排，通过自适应切换并行、顺序和交错等输入路径来消除结构偏差；同时，将认知执行分为面向直接感知的’Direct-Decide’路径和面向分析审核的’Reason-Decide’路径，实现了任务对齐的推理过程。

Abstract: Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive execution into two task-aligned pathways: a streamlined Direct-Decide'' path for direct perception and a structured Reason-Decide’’ path for analytical auditing. Operating in either a training-free or a data-efficient SFT setting, CoM achieves robust and consistent generalization across diverse benchmarks.

[38] FreqTrack: Frequency Learning based Vision Transformer for RGB-Event Object Tracking cs.CVPDF

Jinlin You, Muyu Li, Xudong Zhao

TL;DR: 本文提出FreqTrack，一种基于频域学习的RGB-事件融合目标跟踪框架。该方法通过频域变换建立模态间的互补关联，设计了包含多头动态傅里叶滤波的谱增强Transformer层来自适应增强频域特征，并利用可学习小波变换显式提取事件数据的多尺度边缘结构，以提升高速和低光场景下的跟踪性能。

Details

Motivation: 现有RGB单模态跟踪器在复杂动态场景中面临性能瓶颈，而当前RGB-事件融合方法主要在空间域设计，未能充分利用事件数据独特的时间响应和高频特性。

Result: 在COESOT和FE108数据集上的大量实验表明，FreqTrack取得了极具竞争力的性能，特别是在COESOT基准上达到了76.6%的领先精度。

Insight: 创新点在于将RGB-事件融合建模从空间域转向频域，通过动态傅里叶滤波和可学习小波变换分别进行频域特征增强与多尺度边缘结构提取，有效利用了事件数据的高频和时间特性，为多模态跟踪提供了新的频域视角。

Abstract: Existing single-modal RGB trackers often face performance bottlenecks in complex dynamic scenes, while the introduction of event sensors offers new potential for enhancing tracking capabilities. However, most current RGB-event fusion methods, primarily designed in the spatial domain using convolutional, Transformer, or Mamba architectures, fail to fully exploit the unique temporal response and high-frequency characteristics of event data. To address this, we1 propose FreqTrack, a frequency-aware RGBE tracking framework that establishes complementary inter-modal correlations through frequency-domain transformations for more robust feature fusion. We design a Spectral Enhancement Transformer (SET) layer that incorporates multi-head dynamic Fourier filtering to adaptively enhance and select frequency-domain features. Additionally, we develop a Wavelet Edge Refinement (WER) module, which leverages learnable wavelet transforms to explicitly extract multi-scale edge structures from event data, effectively improving modeling capability in high-speed and low-light scenarios. Extensive experiments on the COESOT and FE108 datasets demonstrate that FreqTrack achieves highly competitive performance, particularly attaining leading precision of 76.6% on the COESOT benchmark, validating the effectiveness of frequency-domain modeling for RGBE tracking.

[39] Controllable Video Object Insertion via Multiview Priors cs.CV | cs.AIPDF

Xia Qi, Peishan Cong, Yichen Yao, Ziyi Wang, Yaoqin Ye

TL;DR: 本文提出了一种用于视频对象插入的新方法，通过整合多视图先验来解决动态环境中对象外观不一致和遮挡处理等常见挑战。该方法将2D参考图像提升为多视图表示，并利用双路径视图一致条件机制来确保稳定的身份引导和跨不同视角的鲁棒集成。

Details

Motivation: 现有视频生成方法主要专注于合成整个场景，但在将对象插入现有视频时，难以确保一致的对象外观、空间对齐和时间连贯性。

Result: 实验结果表明，该方法显著提高了视频对象插入的质量，提供了稳定且逼真的集成效果。

Insight: 创新点包括：利用多视图对象先验和双路径视图一致条件机制来增强外观一致性和鲁棒性；引入质量感知加权机制自适应处理噪声或不完美输入；以及集成感知一致性模块保证空间真实感，有效解决遮挡和边界伪影，同时保持帧间时间连续性。

Abstract: Video object insertion is a critical task for dynamically inserting new objects into existing environments. Previous video generation methods focus primarily on synthesizing entire scenes while struggling with ensuring consistent object appearance, spatial alignment, and temporal coherence when inserting objects into existing videos. In this paper, we propose a novel solution for Video Object Insertion, which integrates multi-view object priors to address the common challenges of appearance inconsistency and occlusion handling in dynamic environments. By lifting 2D reference images into multi-view representations and leveraging a dual-path view-consistent conditioning mechanism, our framework ensures stable identity guidance and robust integration across diverse viewpoints. A quality-aware weighting mechanism is also employed to adaptively handle noisy or imperfect inputs. Additionally, we introduce an Integration-Aware Consistency Module that guarantees spatial realism, effectively resolving occlusion and boundary artifacts while maintaining temporal continuity across frames. Experimental results show that our solution significantly improves the quality of video object insertion, providing stable and realistic integration.

[40] DVFace: Spatio-Temporal Dual-Prior Diffusion for Video Face Restoration cs.CVPDF

Zheng Chen, Bowen Chai, Rongjun Gao, Mingtao Nie, Xi Li

TL;DR: DVFace提出一种基于一步扩散的视频人脸修复框架，通过时空双码本设计提取互补的空间和时间先验，并利用非对称时空融合模块将这些先验注入扩散主干网络，以高效生成高质量、身份一致且时序连贯的修复结果。

Details

Motivation: 现有基于扩散的视频人脸修复方法过度依赖通用扩散先验和多步采样，导致面部适应性和推理效率受限，因此需要开发一步扩散方法以实现忠实的面部恢复和时序稳定性。

Result: 在多个基准测试中，DVFace在修复质量、时序一致性和身份保持方面均优于近期方法，达到了先进水平。

Insight: 创新点包括时空双码本设计提取互补先验，以及非对称时空融合模块根据先验的不同角色进行有效注入，实现了一步扩散框架下的高效高质量视频人脸修复。

Abstract: Video face restoration aims to enhance degraded face videos into high-quality results with realistic facial details, stable identity, and temporal coherence. Recent diffusion-based methods have brought strong generative priors to restoration and enabled more realistic detail synthesis. However, existing approaches for face videos still rely heavily on generic diffusion priors and multi-step sampling, which limit both facial adaptation and inference efficiency. These limitations motivate the use of one-step diffusion for video face restoration, yet achieving faithful facial recovery alongside temporally stable outputs remains challenging. In this paper, we propose, DVFace, a one-step diffusion framework for real-world video face restoration. Specifically, we introduce a spatio-temporal dual-codebook design to extract complementary spatial and temporal facial priors from degraded videos. We further propose an asymmetric spatio-temporal fusion module to inject these priors into the diffusion backbone according to their distinct roles. Evaluation on various benchmarks shows that DVFace delivers superior restoration quality, temporal consistency, and identity preservation compared to recent methods. Code: https://github.com/zhengchen1999/DVFace.

[41] Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors cs.CVPDF

Mingqian Ji, Shanshan Zhang, Jian Yang

TL;DR: 本文提出SEPatch3D框架，通过动态调整ViT的patch大小来加速基于稀疏多视图的3D目标检测器。该方法包含时空感知的patch大小选择、信息性patch选择以及跨粒度特征增强，在保持检测精度的同时显著提升了推理速度。

Details

Motivation: 现有基于ViT的稀疏多视图3D检测器推理延迟高，而传统的token压缩方法（如剪枝、合并、增大patch尺寸）会丢失关键背景信息、破坏上下文一致性或损失细粒度语义，从而影响检测性能。本文旨在克服这些限制。

Result: 在nuScenes和Argoverse 2验证集上的实验表明，SEPatch3D比StreamPETR基线推理速度快57%，比当前最优的ToC3D-faster效率高20%，同时保持了可比的检测精度。

Insight: 创新点在于动态、场景自适应的patch大小调整策略，而非全局统一压缩；通过信息性patch选择和跨粒度特征增强，在压缩计算量的同时有效保留了细粒度语义信息，平衡了效率与精度。

Abstract: Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated scenes to reduce computation cost. To further mitigate potential detail loss, Informative Patch Selection (IPS) selects the informative patches for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) injects fine-grained details into selected coarse patches, enriching semantic features. Experiments on the nuScenes and Argoverse 2 validation sets show that SEPatch3D achieves up to \textbf{57%} faster inference than the StreamPETR baseline and \textbf{20%} higher efficiency than the state-of-the-art ToC3D-faster, while preserving comparable detection accuracy. Code is available at https://github.com/Mingqj/SEPatch3D.

[42] Learning Adaptive Reasoning Paths for Efficient Visual Reasoning cs.CV | cs.CLPDF

Yixu Huang, Tinghui Zhu, Muhao Chen

TL;DR: 本文提出了自适应视觉推理框架AVR，通过将视觉推理分解为视觉感知、逻辑推理和答案应用三个认知功能，并允许模型动态选择完整推理、仅感知或直接答案三种响应格式，以解决视觉推理模型中的‘推理路径冗余’问题，即模型对任何任务都产生不必要的长推理链。

Details

Motivation: 动机是解决视觉推理模型普遍存在的‘过度思考’问题，即模型为所有任务生成冗长的推理链，而许多视觉问题并不需要完整的推理过程，导致计算效率低下。

Result: 在多个视觉语言基准测试上的实验表明，AVR在保持整体准确性的同时，将令牌使用量减少了50-90%，尤其在感知密集型任务中效果显著。

Insight: 创新点在于将视觉推理分解为可动态选择的认知功能模块，并采用FS-GRPO训练策略来鼓励模型选择最高效的推理格式，这为构建高效、自适应的多模态推理系统提供了新思路。

Abstract: Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to \textbf{Reasoning Path Redundancy} in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose \textbf{AVR}, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision-language benchmarks show that AVR reduces token usage by 50–90% while maintaining overall accuracy, especially in perception-intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: https://github.com/RunRiotComeOn/AVR.

Haotian Wu, Yue Cheng, Shan Bian

TL;DR: 本文提出了一种名为M3D-Net的新型多模态3D人脸特征重建网络，用于深度伪造检测。该方法通过端到端的双流架构，从单视图RGB图像中自监督地重建细粒度的人脸几何和反射属性，并利用3D特征预融合模块和多模态融合模块有效整合RGB与3D重建特征，以提升检测性能。

Details

Motivation: 现有深度伪造检测方法大多依赖孤立的人脸属性重建，未能充分利用多模态特征表示的互补性，而人脸伪造技术的逼真度日益提升，对网络安全和信息真实性构成严重威胁。

Result: 在多个公共数据集上的大量实验表明，该方法在检测准确性和鲁棒性方面达到了最先进的性能，显著优于现有方法，并在多样场景下展现出强大的泛化能力。

Insight: 创新点在于通过自监督3D人脸重建模块获取细粒度的几何与反射属性，并结合注意力机制设计特征预融合与多模态融合模块，以充分利用RGB与3D模态的互补信息，从而提升深度伪造检测的判别能力。

Abstract: With the rapid advancement of deep learning in image generation, facial forgery techniques have achieved unprecedented realism, posing serious threats to cybersecurity and information authenticity. Most existing deepfake detection approaches rely on the reconstruction of isolated facial attributes without fully exploiting the complementary nature of multi-modal feature representations. To address these challenges, this paper proposes a novel Multi-Modal 3D Facial Feature Reconstruction Network (M3D-Net) for deepfake detection. Our method leverages an end-to-end dual-stream architecture that reconstructs fine-grained facial geometry and reflectance properties from single-view RGB images via a self-supervised 3D facial reconstruction module. The network further enhances detection performance through a 3D Feature Pre-fusion Module (PFM), which adaptively adjusts multi-scale features, and a Multi-modal Fusion Module (MFM) that effectively integrates RGB and 3D-reconstructed features using attention mechanisms. Extensive experiments on multiple public datasets demonstrate that our approach achieves state-of-the-art performance in terms of detection accuracy and robustness, significantly outperforming existing methods while exhibiting strong generalization across diverse scenarios.

[44] TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation cs.CV | cs.MM | cs.SDPDF

Xiangyu Liu, Feng Gao, Xiaomei Zhang, Yong Zhang, Xiaoming Wei

TL;DR: 本文提出TurboTalk，一种两阶段渐进蒸馏框架，将多步音频驱动视频扩散模型压缩为单步生成器，通过分布匹配蒸馏和对抗蒸馏实现120倍推理加速，同时保持高质量视频数字人生成。

Details

Motivation: 现有音频驱动视频数字人生成模型依赖多步去噪，计算开销大，难以实际部署；而一步蒸馏方法虽能加速推理，但常面临训练不稳定问题。

Result: 该方法在音频驱动视频生成任务上实现单步生成，推理速度提升120倍，同时保持高质量生成效果，但未提及具体基准测试或SOTA比较。

Insight: 创新点包括两阶段渐进蒸馏框架、渐进时间步采样策略和自比较对抗目标，通过稳定训练过程实现极端步数压缩，为高效实时视频生成提供新思路。

Abstract: Existing audio-driven video digital human generation models rely on multi-step denoising, resulting in substantial computational overhead that severely limits their deployment in real-world settings. While one-step distillation approaches can significantly accelerate inference, they often suffer from training instability. To address this challenge, we propose TurboTalk, a two-stage progressive distillation framework that effectively compresses a multi-step audio-driven video diffusion model into a single-step generator. We first adopt Distribution Matching Distillation to obtain a strong and stable 4-step student, and then progressively reduce the denoising steps from 4 to 1 through adversarial distillation. To ensure stable training under extreme step reduction, we introduce a progressive timestep sampling strategy and a self-compare adversarial objective that provides an intermediate adversarial reference that stabilizes progressive distillation. Our method achieve single-step generation of video talking avatar, boosting inference speed by 120 times while maintaining high generation quality.

[45] MapSR: Prompt-Driven Land Cover Map Super-Resolution via Vision Foundation Models cs.CVPDF

Ruiqi Wang, Qi Yu, Jie Ma, Hanlin Wu

TL;DR: MapSR是一种基于提示驱动的土地覆盖图超分辨率框架，通过利用冻结的视觉基础模型特征和轻量级线性探针提取类别提示，实现无需高分辨率标注的训练，将低分辨率土地覆盖产品增强至高分辨率地图。该方法通过余弦相似度匹配和图传播进行空间细化，显著减少了可训练参数和训练时间。

Details

Motivation: 高分辨率土地覆盖制图常受高成本密集标注的限制，现有弱监督方法虽可利用低分辨率标签，但需重新训练密集预测器且计算成本高。本文旨在通过解耦监督与模型训练，提出一种无需高分辨率标注的高效超分辨率方法。

Result: 在Chesapeake Bay数据集上，MapSR在无高分辨率标签的情况下达到59.64% mIoU，与最强弱监督基线竞争并超越全监督基线，同时将可训练参数减少四个数量级，训练时间从小时缩短至分钟。

Insight: 创新点在于将低分辨率标签作为一次性提示提取来源，结合冻结基础模型和训练自由推理，实现高效可扩展的高分辨率制图；客观分析显示其通过轻量级线性探针和图传播优化，在降低计算成本的同时保持性能竞争力。

Abstract: High-resolution (HR) land-cover mapping is often constrained by the high cost of dense HR annotations. We revisit this problem from the perspective of map super-resolution, which enhances coarse low-resolution (LR) land-cover products into HR maps at the resolution of the input imagery. Existing weakly supervised methods can leverage LR labels, but they typically use them to retrain dense predictors with substantial computational cost. We propose MapSR, a prompt-driven framework that decouples supervision from model training. MapSR uses LR labels once to extract class prompts from frozen vision foundation model features through a lightweight linear probe, after which HR mapping proceeds via training-free metric inference and graph-based prediction refinement. Specifically, class prompts are estimated by aggregating high-confidence HR features identified by the linear probe, and HR predictions are obtained by cosine-similarity matching followed by graph-based propagation for spatial refinement. Experiments on the Chesapeake Bay dataset show that MapSR achieves 59.64% mIoU without any HR labels, remaining competitive with the strongest weakly supervised baseline and surpassing a fully supervised baseline. Notably, MapSR reduces trainable parameters by four orders of magnitude and shortens training time from hours to minutes, enabling scalable HR mapping under limited annotation and compute budgets. The code is available at https://github.com/rikirikirikiriki/MapSR.

[46] Towards Design Compositing cs.CVPDF

Abhinav Mahajan, Abhikhya Tripathy, Sudeeksha Reddy Pala, Vaibhav Methi, K J Joseph

TL;DR: 该论文提出了GIST，一种无需训练、保持元素身份的图像合成器，旨在解决多模态组件（如图像、文本、标志）在图形设计合成中因来源不同导致的视觉不协调问题。GIST可无缝集成到现有的组件到设计或设计优化流程中，提升视觉和谐度与美学质量。

Details

Motivation: 现有方法多专注于布局预测或互补元素生成，但假设输入组件在风格上已和谐，而实际中组件常来自不同来源，存在视觉不匹配，因此需要一种能保持元素身份的风格化与合成方法以实现真正协调的设计流程。

Result: 将GIST集成到LaDeCo和Design-o-meter两种现有方法中，通过LLaVA-OV和GPT-4V评估，在视觉和谐与美学质量方面相比简单粘贴有显著提升，验证了其有效性。

Insight: 创新点在于提出了一种无需训练、保持身份的图像合成器，作为布局预测与排版生成之间的中间模块，可灵活嵌入现有设计流程，解决了多源组件风格不匹配的核心问题，提升了设计合成的协调性。

Abstract: Graphic design creation involves harmoniously assembling multimodal components such as images, text, logos, and other visual assets collected from diverse sources, into a visually-appealing and cohesive design. Recent methods have largely focused on layout prediction or complementary element generation, while retaining input elements exactly, implicitly assuming that provided components are already stylistically harmonious. In practice, inputs often come from disparate sources and exhibit visual mismatch, making this assumption limiting. We argue that identity-preserving stylization and compositing of input elements is a critical missing ingredient for truly harmonized components-to-design pipelines. To this end, we propose GIST, a training-free, identity-preserving image compositor that sits between layout prediction and typography generation, and can be plugged into any existing components-to-design or design-refining pipeline without modification. We demonstrate this by integrating GIST with two substantially different existing methods, LaDeCo and Design-o-meter. GIST shows significant improvements in visual harmony and aesthetic quality across both pipelines, as validated by LLaVA-OV and GPT-4V on aspect-wise ratings and pairwise preference over naive pasting. Project Page: abhinav-mahajan10.github.io/GIST/.

[47] Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models cs.CVPDF

Haoyi Sun, Xiaoxiao Wang, Ning Mao, Qian Wang, Lifu Mu

TL;DR: 本文提出Switch-KD，一种视觉切换知识蒸馏框架，用于解决视觉语言模型（VLMs）在知识蒸馏中多模态对齐不一致的问题。该方法通过将学生的视觉输出切换到教师的语言通路中，构建跨模态概率参考，并结合动态双向对数差异损失进行自适应对齐，使0.5B的TinyLLaVA模型从3B教师模型中有效蒸馏多模态知识，在10个多模态基准上平均提升3.6分。

Details

Motivation: 视觉语言模型规模大，在资源受限场景部署困难；现有知识蒸馏方法在应用于VLMs时，对各模态分别监督而未显式处理多模态对齐，导致多模态知识传递不一致。

Result: 在10个多模态基准测试上，使用Switch-KD的0.5B TinyLLaVA模型平均性能提升3.6分，无需架构修改。

Insight: 创新点包括视觉切换蒸馏机制，将视觉知识统一到文本概率空间进行隐式传递，以及动态双向对数差异损失，自适应对齐信息丰富的概率区域同时保持师生分布结构；该方法实现了跨模态知识的高效对齐与迁移。

Abstract: Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student’s visual outputs into the teacher’s language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.

Inseok Jeon, Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Minseok Kang

TL;DR: 本文提出了一种名为跨模态令牌调制（CMTM）的新方法，用于无监督视频对象分割任务。该方法通过建立外观和运动模态令牌之间的密集连接，利用关系变换器块实现高效的模态内和模态间信息传播，并结合令牌掩码策略提升学习效率，在多个公开基准测试中取得了最先进的性能。

Details

Motivation: 解决无监督视频对象分割中如何有效整合外观和运动线索的互补信息，并建模它们之间的相互依赖关系的问题。

Result: 在所有公开基准测试上达到了最先进的性能，超越了现有方法。

Insight: 创新点在于提出了跨模态令牌调制机制，通过关系变换器块实现密集的模态间交互，并结合令牌掩码策略来平衡模型复杂度和学习效率，而非单纯增加模型规模。

Abstract: Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.

[49] Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting cs.CV | cs.AI | cs.LGPDF

Inseok Jeon, Minhyeok Lee, Seunghoon Lee, Minseok Kang, Suhwan Cho

TL;DR: 本文提出Seen-to-Scene框架，用于视频外绘任务，旨在将视频内容扩展到原始帧边界之外，同时保持空间保真度和时间连贯性。该方法统一了基于传播和基于生成的范式，通过基于光流的传播和参考引导的潜在传播来提升一致性和效率。

Details

Motivation: 现有基于生成模型（如扩散模型）的视频外绘方法存在隐式时间建模和有限空间上下文的问题，导致帧内和帧间不一致，在动态场景和大范围外绘中尤为明显。

Result: 大量实验表明，该方法在高效推理下实现了优异的时间连贯性和视觉真实感，甚至超越了需要输入特定适配的先前最先进方法。

Insight: 创新点在于将传播与生成范式统一，利用为视频修复预训练的光流补全网络进行端到端微调以重建连贯运动场，并引入参考引导的潜在传播来高效可靠地跨帧传播源内容。

Abstract: Video outpainting aims to expand the visible content of a video beyond the original frame boundaries while preserving spatial fidelity and temporal coherence across frames. Existing methods primarily rely on large-scale generative models, such as diffusion models. However, generationbased approaches suffer from implicit temporal modeling and limited spatial context. These limitations lead to intraframe and inter-frame inconsistencies, which become particularly pronounced in dynamic scenes and large outpainting scenarios. To overcome these challenges, we propose Seen-to-Scene, a novel framework that unifies propagationbased and generation-based paradigms for video outpainting. Specifically, Seen-to-Scene leverages flow-based propagation with a flow completion network pre-trained for video inpainting, which is fine-tuned in an end-to-end manner to bridge the domain gap and reconstruct coherent motion fields. To further improve the efficiency and reliability of propagation, we introduce a reference-guided latent propagation that effectively propagates source content across frames. Extensive experiments demonstrate that our method achieves superior temporal coherence and visual realism with efficient inference, surpassing even prior state-of-the-art methods that require input-specific adaptation.

[50] Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding cs.CVPDF

Zhixuan Wu, Quanxing Zha, Teng Wang, Genbao Xu, Wenyuan Gu

TL;DR: 本文提出了Chain-of-Glimpse框架，一种基于搜索引导的渐进式对象锚定推理方法，用于视频理解。该框架通过将推理步骤显式地锚定到具体的视觉证据区域，支持组合式多步决策，以应对视频中对象随时间显著变化带来的挑战。

Details

Motivation: 现有视频理解方法多为对象无关的，难以有效处理时间维度上显著的对象变化，因此需要一种能够显式锚定于视觉对象并进行渐进式推理的解决方案。

Result: 在NExTQA、Video-Holmes、CG-Bench Reasoning和VRBench等多个基准测试上进行了广泛评估，结果表明Chain-of-Glimpse在多种视频推理任务中均取得了持续的性能提升，并展现出鲁棒性和良好的泛化能力。

Insight: 创新点在于将视频推理形式化为一个逐步构建空间锚定轨迹的过程，并引入一个通过强化学习优化的搜索引导控制器，该控制器使用格式奖励来显著激励模型的锚定能力，从而生成可解释的多步决策轨迹，减少了对显著性线索的过度依赖。

Abstract: Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

[51] The Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment cs.CVPDF

Songlin Li, Zhiqing Guo, Dan Ma, Changtao Miao, Gaobo Yang

TL;DR: 本文提出了一种法庭审判风格的图像篡改定位框架，将篡改定位任务视为证据对抗与判决过程。该框架包含控方流、辩方流和法官模型：控方流主张篡改，辩方流主张真实，通过边缘先验引导的级联多级融合、双向分歧抑制和动态辩论细化生成证据；法官模型则基于强化学习对不确定区域进行策略性重推理与细化，最终输出篡改区域掩码。

Details

Motivation: 现有图像篡改定位方法通常仅将真实性监督作为辅助训练信号以增强对篡改痕迹的敏感性，而未将其显式建模为与篡改区域对立的定位证据，导致在篡改痕迹微弱或受后处理/噪声干扰时，难以在模糊区域进行可靠预测。

Result: 实验结果表明，该模型在图像篡改定位任务上相比现有SOTA方法取得了更优的平均性能。

Insight: 创新点在于将篡改定位任务形式化为法庭审判式的证据对抗与判决过程，通过双假设分割架构显式建模篡改与真实证据的对抗，并引入基于强化学习的法官模型进行不确定性区域的策略性细化；客观来看，其将边缘先验、多级融合与强化学习判决相结合，为弱痕迹篡改定位提供了可解释的对抗推理框架。

Abstract: Although some existing image manipulation localization (IML) methods incorporate authenticity-related supervision, this information is typically utilized merely as an auxiliary training signal to enhance the model’s sensitivity to manipulation artifacts, rather than being explicitly modeled as localization evidence opposing the manipulated regions. Consequently, when manipulation traces are subtle or degraded by post-processing and noise, these methods struggle to explicitly compare manipulated and authentic evidence, resulting in unreliable predictions in ambiguous areas. To address these issues, we propose a courtroom-style adjudication framework that regards IML task as the confrontation of evidence followed by judgment. The framework comprises a prosecution stream, a defense stream, and a judge model. We first build a dual-hypothesis segmentation architecture on a shared multi-scale encoder, in which the prosecution stream asserts manipulation and the defense stream asserts authenticity. Guided by edge priors, it produces evidence for manipulated and authentic regions through cascaded multi-level fusion, bidirectional disagreement suppression, and dynamic debate refinement. We further develop a reinforcement learning judge model that performs strategic re-inference and refinement on uncertain regions, yielding a manipulated-region mask. The judge model is trained with advantage-based rewards and a soft-IoU objective, and reliability is calibrated via entropy and cross-hypothesis consistency. Experimental results show that our model achieves superior average performance compared with SOTA IML methods.

[52] G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval cs.CVPDF

Jiyoung Lim, Heejae Yang, Jee-Hyong Lee

TL;DR: 本文提出了一种名为G-MIXER的新型免训练零样本组合图像检索方法，该方法通过测地线混合技术构建反映参考图像-文本对隐含语义的组合查询特征以生成多样候选集，并利用多模态大语言模型提供的显式语义对候选结果进行重排序，从而提升检索的多样性和准确性。

Details

Motivation: 解决现有免训练零样本组合图像检索方法过度依赖文本模态、未能捕捉检索的模糊性本质，从而导致检索结果多样性和准确性下降的问题。

Result: 在多个零样本组合图像检索基准测试上取得了最先进的性能。

Insight: 创新点在于结合了测地线混合的隐含语义扩展和基于MLLM的显式语义重排序，在不需额外训练的情况下，有效协同处理了组合查询中的隐含与显式语义，提升了检索性能。

Abstract: Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free Zero-Shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. Our proposed G-MIXER achieves state-of-the-art performance across multiple ZS-CIR benchmarks, effectively handling both implicit and explicit semantics without additional training. Our code will be available at https://github.com/maya0395/gmixer.

[53] HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet cs.CV | cs.LG | eess.IVPDF

Badri N. Patro, Vijay S. Agneeswaran

TL;DR: HAMSA是一种无需扫描的视觉状态空间模型，它通过在谱域直接操作，避免了传统视觉SSM中复杂的扫描策略。该方法引入了简化的核参数化、输入依赖的频率门控机制和基于幅度的门控单元，利用基于FFT的卷积实现了O(L log L)的复杂度，在ImageNet-1K上达到了85.7%的top-1准确率，推理速度比Transformer快2.2倍，比基于扫描的SSM快1.4-1.9倍，同时内存和能耗更低。

Details

Motivation: 现有的视觉状态空间模型（如Vim、VMamba、SiMBA）依赖复杂的扫描策略将序列SSM适配到2D图像处理，这带来了计算开销和架构复杂性。HAMSA旨在消除扫描需求，直接在谱域操作以简化模型并提升效率。

Result: 在ImageNet-1K上，HAMSA达到了85.7%的top-1准确率，在SSM模型中处于SOTA水平；推理速度比DeiT-S快2.2倍（4.2ms vs 9.2ms），比基于扫描的SSM快1.4-1.9倍，同时内存使用（2.1GB vs 3.2-4.5GB）和能耗（12.5J vs 18-25J）更低，并在迁移学习和密集预测任务上表现出强泛化能力。

Insight: 创新点包括：1) 简化的核参数化，用单一高斯初始化复数核替代传统(A, B, C)矩阵，消除离散化不稳定性；2) SpectralPulseNet（SPN），一种输入依赖的频率门控机制，实现自适应谱调制；3) Spectral Adaptive Gating Unit（SAGU），基于幅度的门控，确保频域中梯度流的稳定性。这些设计使得模型无需扫描，直接利用FFT卷积实现高效计算。

Abstract: Vision State Space Models (SSMs) like Vim, VMamba, and SiMBA rely on complex scanning strategies to adapt sequential SSMs to process 2D images, introducing computational overhead and architectural complexity. We propose HAMSA, a scanning-free SSM operating directly in the spectral domain. HAMSA introduces three key innovations: (1) simplified kernel parameterization-a single Gaussian-initialized complex kernel replacing traditional (A, B, C) matrices, eliminating discretization instabilities; (2) SpectralPulseNet (SPN)-an input-dependent frequency gating mechanism enabling adaptive spectral modulation; and (3) Spectral Adaptive Gating Unit (SAGU)-magnitude-based gating for stable gradient flow in the frequency domain. By leveraging FFT-based convolution, HAMSA eliminates sequential scanning while achieving O(L log L) complexity with superior simplicity and efficiency. On ImageNet-1K, HAMSA reaches 85.7% top-1 accuracy (state-of-the-art among SSMs), with 2.2 X faster inference than transformers (4.2ms vs 9.2ms for DeiT-S) and 1.4-1.9X speedup over scanning-based SSMs, while using less memory (2.1GB vs 3.2-4.5GB) and energy (12.5J vs 18-25J). HAMSA demonstrates strong generalization across transfer learning and dense prediction tasks.

[54] Efficient closed-form approaches for pose estimation using Sylvester forms cs.CV | cs.ROPDF

Jana Vráblíková, Ezio Malis, Laurent Busé

TL;DR: 本文提出了一种基于Sylvester形式的闭式求解器，用于解决姿态估计中的非线性最小二乘问题，通过利用Sylvester形式降低计算复杂度，在保持与现有最先进方法相同精度的同时，显著减少了计算时间。该方法适用于3D到3D对应和3D点到2D点对应两种姿态估计问题。

Details

Motivation: 姿态估计（旋转和平移）中的非线性最小二乘问题在实时计算机视觉应用中耗时且基础，现有基于结式矩阵的闭式求解器虽能减少计算时间，但仍有优化空间，本文旨在通过Sylvester形式进一步降低求解复杂度。

Result: 所提方法在数值精度上与最先进（SOTA）求解器相当，但在计算时间上优于它们，适用于3D-3D和3D-2D对应两种姿态估计基准问题。

Insight: 创新点在于利用Sylvester形式设计新的结式求解器，以降低多项式方程系统的求解复杂度，这为实时姿态估计提供了更高效的闭式解决方案，可借鉴于其他需要快速求解多项式系统的视觉任务。

Abstract: Solving non-linear least-squares problem for pose estimation (rotation and translation) is often a time consuming yet fundamental problem in several real-time computer vision applications. With an adequate rotation parametrization, the optimization problem can be reduced to the solution of a~system of polynomial equations and solved in closed form. Recent advances in efficient closed form solvers utilizing resultant matrices have shown a promising research direction to decrease the computation time while preserving the estimation accuracy. In this paper, we propose a new class of resultant-based solvers that exploit Sylvester forms to further reduce the complexity of the resolution. We demonstrate that our proposed methods are numerically as accurate as the state-of-the-art solvers, and outperform them in terms of computational time. We show that this approach can be applied for pose estimation in two different types of problems: estimating a pose from 3D to 3D correspondences, and estimating a pose from 3D points to 2D points correspondences.

[55] OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism cs.CVPDF

Jordan Shipard, Arnold Wiliem, Kien Nguyen Thanh, Wei Xiang, Clinton Fookes

TL;DR: 本文提出了OmniGCD，一种模态无关的广义类别发现方法，灵感来源于人脑的抽象类别形成机制。该方法利用特定模态编码器处理输入，通过降维构建GCD潜在空间，并在测试时使用基于合成数据训练的Transformer模型将其转换为更适合聚类的表示。论文还引入了零样本GCD设置，禁止数据集特定微调，并在涵盖视觉、文本、音频和遥感四种模态的16个数据集上验证了其有效性。

Details

Motivation: 现有GCD方法通常局限于单一模态且需要针对特定数据集进行微调，无法实现跨模态的通用类别发现。本文旨在解决这一问题，提出一种模态无关的GCD方法，模仿人脑的抽象能力，以支持更灵活和可扩展的类别发现。

Result: 在零样本GCD设置下，OmniGCD在16个跨模态数据集上进行了评估。与基线相比，其在已知类别和新类别的分类准确率上均有提升，平均百分点改进分别为：视觉+6.2、文本+17.9、音频+1.5、遥感+12.7。

Insight: 主要创新点在于：1) 提出了模态无关的GCD框架，将表示学习与类别发现解耦；2) 引入了零样本GCD评估设置；3) 利用合成数据一次性训练模型，实现了跨模态的零样本泛化。这为开发独立于GCD任务的编码器以及未来模态无关的GCD研究提供了基准和方向。

Abstract: Generalized Category Discovery (GCD) challenges methods to identify known and novel classes using partially labeled data, mirroring human category learning. Unlike prior GCD methods, which operate within a single modality and require dataset-specific fine-tuning, we propose a modality-agnostic GCD approach inspired by the human brain’s abstract category formation. Our $\textbf{OmniGCD}$ leverages modality-specific encoders (e.g., vision, audio, text, remote sensing) to process inputs, followed by dimension reduction to construct a $\textbf{GCD latent space}$, which is transformed at test-time into a representation better suited for clustering using a novel synthetically trained Transformer-based model. To evaluate OmniGCD, we introduce a $\textbf{zero-shot GCD setting}$ where no dataset-specific fine-tuning is allowed, enabling modality-agnostic category discovery. $\textbf{Trained once on synthetic data}$, OmniGCD performs zero-shot GCD across 16 datasets spanning four modalities, improving classification accuracy for known and novel classes over baselines (average percentage point improvement of $\textbf{+6.2}$, $\textbf{+17.9}$, $\textbf{+1.5}$ and $\textbf{+12.7}$ for vision, text, audio and remote sensing). This highlights the importance of strong encoders while decoupling representation learning from category discovery. Improving modality-agnostic methods will propagate across modalities, enabling encoder development independent of GCD. Our work serves as a benchmark for future modality-agnostic GCD works, paving the way for scalable, human-inspired category discovery. All code is available $\href{https://github.com/Jordan-HS/OmniGCD}{here}$

[56] AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning cs.CV | cs.CLPDF

Peifeng Zhang, Zice Qiu, Donghua Yu, Shilei Cao, Juepeng Zheng

TL;DR: 本文提出了一种名为非对称信息掩码（AIM）的方法，用于解决视觉语言模型在持续视觉问答学习中的灾难性遗忘问题。该方法通过基于模态特定敏感性的目标掩码来平衡模型的稳定性和可塑性。

Details

Motivation: 现有持续学习方法主要针对对称的单模态架构设计，而现代视觉语言模型的可训练组件本质上是非对称的，这种结构不匹配导致模型在连续数据流学习时极易发生灾难性遗忘，特别是视觉投影层容易受到干扰，损害组合推理能力。

Result: 在VQA v2和GQA数据集上的持续VQA设置实验中，AIM在平均性能和平均遗忘率指标上均达到了最先进的水平，并且能更好地保持对新颖技能-概念组合的泛化能力。

Insight: 创新点在于识别了视觉语言模型架构非对称性导致的特定脆弱性，并提出了针对模态敏感性的目标掩码机制来平衡稳定与可塑性，这为异构多模态模型的持续学习提供了新思路。

Abstract: In continual visual question answering (VQA), existing Continual Learning (CL) methods are mostly built for symmetric, unimodal architectures. However, modern Vision-Language Models (VLMs) violate this assumption, as their trainable components are inherently asymmetric. This structural mismatch renders VLMs highly prone to catastrophic forgetting when learning from continuous data streams. Specifically, the asymmetry causes standard global regularization to favor the massive language decoder during optimization, leaving the smaller but critical visual projection layers highly vulnerable to interference. Consequently, this localized degradation leads to a severe loss of compositional reasoning capabilities. To address this, we propose Asymmetric Information Masking (AIM), which balances stability and plasticity by applying targeted masks based on modality-specific sensitivity. Experiments on VQA v2 and GQA under continual VQA settings show that AIM achieves state-of-the-art performance in both Average Performance (AP) and Average Forgetting (AF), while better preserving generalization to novel skill-concept compositions.

[57] One-shot Compositional 3D Head Avatars with Deformable Hair cs.CVPDF

Yuan Sun, Xuan Wang, WeiLi Zhang, Wenxuan Zhang, Yu Guo

TL;DR: 本文提出了一种从单张图像构建完整3D头部虚拟形象的组合方法，通过将头发与面部区域显式解耦，分别采用基于FLAME网格的绑定变形和基于位置动力学（PBD）的笼状结构模拟来控制头发动态，从而在动画中实现更真实的头发运动与面部细节保留。

Details

Motivation: 现有的一步式整体方法在动画中常因头发与面部区域解耦不足，导致几何纠缠和不自然的变形，无法生成真实的头发动态。本文旨在解决这一问题，实现头发与面部的有效分离与独立建模。

Result: 该方法在多种头部运动、重力效应和表情下的动态动画中展现出显著更真实的头发行为与忠实保留的面部细节，在感知真实性上优于当前最先进的一步式方法。

Insight: 创新点包括：通过头发移除与语义标签监督结合边界感知重分配策略，实现头发高斯点的干净分离；采用基于笼状结构的PBD模拟来控制头发高斯基元的物理合理变形；利用图像到3D提升技术最大程度保留输入图像的高频纹理细节，缓解广义模型常见的信息丢失问题。

Abstract: We propose a compositional method for constructing a complete 3D head avatar from a single image. Prior one-shot holistic approaches frequently fail to produce realistic hair dynamics during animation, largely due to inadequate decoupling of hair from the facial region, resulting in entangled geometry and unnatural deformations. Our method explicitly decouples hair from the face, modeling these components using distinct deformation paradigms while integrating them into a unified rendering pipeline. Furthermore, by leveraging image-to-3D lifting techniques, we preserve fine-grained textures from the input image to the greatest extent possible, effectively mitigating the common issue of high-frequency information loss in generalized models. Specifically, given a frontal portrait image, we first perform hair removal to obtain a bald image. Both the original image and the bald image are then lifted to dense, detail-rich 3D Gaussian Splatting (3DGS) representations. For the bald 3DGS, we rig it to a FLAME mesh via non-rigid registration with a prior model, enabling natural deformation that follows the mesh triangles during animation. For the hair component, we employ semantic label supervision combined with a boundary-aware reassignment strategy to extract a clean and isolated set of hair Gaussians. To control hair deformation, we introduce a cage structure that supports Position-Based Dynamics (PBD) simulation, allowing realistic and physically plausible transformations of the hair Gaussian primitives under head motion, gravity, and inertial effects. Striking qualitative results, including dynamic animations under diverse head motions, gravity effects, and expressions, showcase substantially more realistic hair behavior alongside faithfully preserved facial details, outperforming state-of-the-art one-shot methods in perceptual realism.

[58] NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results cs.CV | cs.HC | cs.MMPDF

Andrey Moskalenko, Alexey Bryncev, Ivan Kosmynin, Kira Shilovskaya, Mikhail Erofeev

TL;DR: 本文概述了NTIRE 2026视频显著性预测挑战赛，介绍了为比赛准备的一个包含2000个多样化视频的新数据集，该数据集通过众包鼠标追踪收集了超过5000名评估者的注视点和显著性图。挑战赛吸引了超过20支团队提交方案，其中7支通过了最终代码审核阶段。

Details

Motivation: 该挑战赛旨在推动视频显著性预测领域的发展，鼓励参与者为提供的视频序列开发自动的显著性图预测方法。

Result: 在800个测试视频子集上使用公认的质量指标进行评估，7支团队通过了最终审核，但摘要未具体说明达到的定量结果水平（如SOTA）。

Insight: 创新点在于创建并公开了一个大规模、多样化的开源视频显著性数据集，该数据集通过众包方式收集了高质量的注视数据，为视频显著性预测研究提供了宝贵的基准资源。

Abstract: This paper presents an overview of the NTIRE 2026 Challenge on Video Saliency Prediction. The goal of the challenge participants was to develop automatic saliency map prediction methods for the provided video sequences. The novel dataset of 2,000 diverse videos with an open license was prepared for this challenge. The fixations and corresponding saliency maps were collected using crowdsourced mouse tracking and contain viewing data from over 5,000 assessors. Evaluation was performed on a subset of 800 test videos using generally accepted quality metrics. The challenge attracted over 20 teams making submissions, and 7 teams passed the final phase with code review. All data used in this challenge is made publicly available - https://github.com/msu-video-group/NTIRE26_Saliency_Prediction.

[59] Improved Multiscale Structural Mapping with Supervertex Vision Transformer for the Detection of Alzheimer’s Disease Neurodegeneration cs.CVPDF

Geonwoo Baek, David H. Salat, Ikbeom Jang

TL;DR: 本文提出了一种改进的多尺度结构映射方法MSSM+，结合表面超顶点映射（SSVM）和超顶点视觉变换器（SV-ViT），用于从单一T1加权MRI扫描中检测阿尔茨海默病（AD）的神经退行性变化。该方法通过整合顶点水平的沟深和皮质曲率，将皮质表面划分为超顶点，并利用Transformer架构进行解剖学信息的学习，从而提高了AD与认知正常（CN）组间差异的检测能力和分类性能。

Details

Motivation: 阿尔茨海默病的确认通常依赖昂贵且有创的PET或脑脊液分析，而结构MRI生物标志物（如皮质厚度）广泛用于无创筛查。多尺度结构映射（MSSM）虽能整合灰白质对比度与皮质厚度，但仍有改进空间，以更有效地捕捉AD相关的结构变化。

Result: 在AD与CN分类任务中，MSSM+相比MSSM将精确率-召回率曲线下面积提高了3个百分点。与皮质厚度、灰白质对比度及MSSM相比，MSSM+在不同MRI制造商数据上显示出更低的信号变异性和更一致的分类性能提升，表明其具有作为AD检测的MRI成像标记物的潜力。

Insight: 创新点包括：在顶点水平扩展多尺度特征（沟深和皮质曲率），提出表面超顶点映射以有效表示区域间和区域内空间关系，以及设计超顶点视觉变换器（SV-ViT）从表面网格表示中进行解剖学信息学习。这为基于Transformer的医学图像分析提供了新的结构感知框架。

Abstract: Alzheimer’s disease (AD) confirmation often relies on positron emission tomography (PET) or cerebrospinal fluid (CSF) analysis, which are costly and invasive. Consequently, structural MRI biomarkers such as cortical thickness (CT) are widely used for non-invasive AD screening. Multiscale structural mapping (MSSM) was recently proposed to integrate gray-white matter contrasts (GWCs) with CT from a single T1-weighted MRI (T1w) scan. Building on this framework, we propose MSSM+, together with surface supervertex mapping (SSVM) and a Supervertex Vision Transformer (SV-ViT). 3D T1w images from individuals with AD and cognitively normal (CN) controls were analyzed. MSSM+ extends MSSM by incorporating sulcal depth and cortical curvature at the vertex level. SSVM partitions the cortical surface into supervertices (surface patches) that effectively represent inter- and intra-regional spatial relationships. SV-ViT is a Vision Transformer architecture operating on these supervertices, enabling anatomically informed learning from surface mesh representations. Compared with MSSM, MSSM+ identified more spatially extensive and statistically significant group differences between AD and CN. In AD vs. CN classification, MSSM+ achieved a 3%p higher area under the precision-recall curve than MSSM. Vendor-specific analyses further demonstrated reduced signal variability and consistently improved classification performance across MR manufacturers relative to CT, GWCs, and MSSM. These findings suggest that MSSM+ combined with SV-ViT is a promising MRI-based imaging marker for AD detection prior to CSF/PET confirmation.

[60] Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems cs.CV | cs.AIPDF

Haileab Yagersew

TL;DR: 论文提出了一个名为Paza的零样本零售盗窃检测框架，通过编排多个现有模型（如目标检测、姿态估计和视觉语言模型）构建分层流水线，无需训练任何模型即可实现实用的藏匿行为检测。该系统采用多信号可疑行为预过滤器大幅减少昂贵VLM的调用频率，从而显著降低成本，并支持模型无关的VLM组件替换。

Details

Motivation: 解决现有基于AI的零售盗窃检测系统需要昂贵定制模型训练和专有数据集，且每月每店成本高达200-500美元的问题，旨在提供一种无需训练、成本效益高的零样本检测替代方案。

Result: 在DCSASS合成盗窃数据集（169个片段，受控环境）上评估VLM组件，零样本条件下达到89.5%的精确率和92.8%的特异性（召回率为59.3%），其中召回率差距归因于离线评估中的稀疏帧采样而非VLM推理失败。成本模型显示每月每店成本为50-100美元，比商业替代方案便宜3-10倍。

Insight: 创新点包括：1）采用分层编排的模型流水线，结合廉价模型持续运行与昂贵VLM按需触发，实现成本效益优化；2）引入多信号可疑行为预过滤器（要求停留时间加至少一个行为信号），将VLM调用减少240倍；3）模型无关的VLM架构设计，支持无缝切换不同VLM模型（如Gemma 4、GPT-4o等），确保系统随VLM生态演进而改进；4）隐私保护设计，在检测流水线中对人脸进行模糊处理。

Abstract: Retail theft costs the global economy over $100 billion annually, yet existing AI-based detection systems require expensive custom model training on proprietary datasets and charge $200-500/month per store. We present Paza, a zero-shot retail theft detection framework that achieves practical concealment detection without training any model. Our approach orchestrates multiple existing models in a layered pipeline - cheap object detection and pose estimation running continuously, with an expensive vision-language model (VLM) invoked only when behavioral pre-filters trigger. A multi-signal suspicion pre-filter (requiring dwell time plus at least one behavioral signal) reduces VLM invocations by 240x compared to per-frame analysis, bounding calls to <=10/minute and enabling a single GPU to serve 10-20 stores. The architecture is model-agnostic: the VLM component accepts any OpenAI-compatible endpoint, enabling operators to swap between models such as Gemma 4, Qwen3.5-Omni, GPT-4o, or future releases without code changes - ensuring the system improves as the VLM landscape evolves. We evaluate the VLM component on the DCSASS synthesized shoplifting dataset (169 clips, controlled environment), achieving 89.5% precision and 92.8% specificity at 59.3% recall zero-shot - where the recall gap is attributable to sparse frame sampling in offline evaluation rather than VLM reasoning failures, as precision and specificity are the operationally critical metrics determining false alarm rates. We present a detailed cost model showing viability at $50-100/month per store (3-10x cheaper than commercial alternatives), and introduce a privacy-preserving design that obfuscates faces in the detection pipeline. The source code is available at https://github.com/xHaileab/Paza-AI.

[61] MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry cs.CV | cs.AIPDF

Meng-Xun Li, Wen-Hui Deng, Zhi-Xing Wu, Chun-Xiao Jin, Jia-Min Wu

TL;DR: 本文提出了MetaDent，一个用于牙科视觉语言模型（VLM）的综合资源，包含一个大规模牙科图像数据集、一个半结构化注释框架以及一套全面的基准测试套件。该工作旨在解决牙科领域因缺乏细粒度标注数据和基准而导致的VLM应用不足问题。

Details

Motivation: 解决牙科领域（尤其是口内摄影）因缺乏细粒度、带标注的数据集和全面基准测试，导致视觉语言模型应用探索不足的问题。

Result: 在基于MetaDent构建的VQA、分类和图像描述任务基准上评估了SOTA VLMs，定量结果表明，即使是最先进的模型在口内场景的细粒度理解上也存在困难，准确率中等，且在图像描述中产生不一致或不完整的描述。

Insight: 创新点在于提出了一个结合高层图像摘要和逐点自由文本描述的元标注方案，用于捕获牙科摄影的层次化和临床细微差别；并利用LLM从该标注中可靠地生成标准化的VQA和分类基准数据集，为任务无关的、可扩展的表示学习提供了资源。数据集和工具已公开以促进可重复研究。

Abstract: Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.

[62] FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection cs.CVPDF

Jianchao Huang, Fengming Zhang, Haibo Zhu, Tao Yan

TL;DR: 本文提出FSDETR，一种基于RT-DETR的频率-空间特征增强框架，旨在解决小目标检测中因下采样导致特征退化、密集遮挡和背景干扰等问题。该方法通过空间分层注意力块（SHAB）捕获局部细节与全局依赖，利用可变形注意力引导的尺度内特征交互（DA-AIFI）缓解遮挡，并结合频率-空间特征金字塔网络（FSFPN）融合频域滤波与空间边缘提取以保留细粒度细节。

Details

Motivation: 解决小目标检测中因下采样造成的特征退化、密集场景中的相互遮挡以及复杂背景干扰等核心挑战。

Result: 在VisDrone 2019数据集上达到13.9% APS，在TinyPerson数据集上达到48.95% AP50 tiny，参数量仅为14.7M，在小目标检测基准上表现出色。

Insight: 创新点在于建立了频率与空间域的协同建模机制，通过SHAB、DA-AIFI和FSFPN（含CFSB）模块，有效利用互补的结构信息来增强小目标的语义表示和细节保留，为小目标检测提供了新的特征增强思路。

Abstract: Small object detection remains a significant challenge due to feature degradation from downsampling, mutual occlusion in dense clusters, and complex background interference. To address these issues, this paper proposes FSDETR, a frequency-spatial feature enhancement framework built upon the RT-DETR baseline. By establishing a collaborative modeling mechanism, the method effectively leverages complementary structural information. Specifically, a Spatial Hierarchical Attention Block (SHAB) captures both local details and global dependencies to strengthen semantic representation. Furthermore, to mitigate occlusion in dense scenes, the Deformable Attention-based Intra-scale Feature Interaction (DA-AIFI) focuses on informative regions via dynamic sampling. Finally, the Frequency-Spatial Feature Pyramid Network (FSFPN) integrates frequency filtering with spatial edge extraction via the Cross-domain Frequency-Spatial Block (CFSB) to preserve fine-grained details. Experimental results show that with only 14.7M parameters, FSDETR achieves 13.9% APS on VisDrone 2019 and 48.95% AP50 tiny on TinyPerson, showing strong performance on small-object benchmarks. The code and models are available at https://github.com/YT3DVision/FSDETR.

[63] RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models cs.CV | cs.AI | cs.CL | cs.MMPDF

Gabriele Mattioli, Evelyn Turri, Sara Sarto, Lorenzo Baraldi, Marcella Cornia

TL;DR: RaTA-Tool是一个基于检索的开放世界多模态工具选择框架，旨在解决现有工具学习方法局限于文本输入和封闭世界、无法处理多模态指令和未见工具的问题。它通过将多模态查询转换为结构化任务描述，并匹配语义丰富的机器可读工具描述来检索最合适的工具，无需重新训练即可扩展到新工具。

Details

Motivation: 现有基于基础模型的工具学习方法主要局限于纯文本输入和封闭世界设置，难以解释多模态用户指令，且无法泛化到训练中未见过的工具。

Result: 大量实验表明，该方法显著提升了工具选择性能，尤其是在开放世界多模态场景下。

Insight: 核心创新点在于将工具选择问题重新定义为检索任务，通过结构化任务描述与标准化工具描述的语义匹配来实现开放世界泛化，并引入基于DPO的偏好优化来提升任务描述与工具选择的对齐。此外，贡献了首个开放世界多模态工具使用数据集。

Abstract: Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources – such as APIs, computational utilities, and specialized models – to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.

[64] Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation cs.CVPDF

Hassan Ali, Doreen Jirak, Luca Müller, Stefan Wermter

TL;DR: 本文提出了一种基于提示的图像到视频生成方法，用于创建逼真的指示性手势数据集，以解决手势识别领域数据稀缺的问题。通过利用少量人类参考样本生成合成手势数据，并验证其在视觉保真度和下游任务性能上的有效性。

Details

Motivation: 手势识别研究面临数据稀缺的挑战，传统方法依赖昂贵的人工录制或图像处理，难以生成真实的手势变异性。本文旨在探索视频生成AI模型是否能够补充传统人类生成的手势数据，以低成本方式丰富数据集。

Result: 合成手势在视觉保真度上与真实手势高度一致，并引入了有意义的变异性，丰富了原始数据。使用混合数据集（合成与真实数据）训练的各种深度模型表现出优越性能，支持合成数据在下游任务中的有效性。

Insight: 创新点在于提出了一种基于提示的图像到视频生成管道，能够零样本合成逼真的指示性手势，为手势识别领域提供了一种低成本、可扩展的数据增强方法。从客观角度看，该方法展示了生成式AI在缓解数据稀缺问题上的潜力，即使处于早期阶段也能有效提升模型性能。

Abstract: Gesture recognition research, unlike NLP, continues to face acute data scarcity, with progress constrained by the need for costly human recordings or image processing approaches that cannot generate authentic variability in the gestures themselves. Recent advancements in image-to-video foundation models have enabled the generation of photorealistic, semantically rich videos guided by natural language. These capabilities open up new possibilities for creating effort-free synthetic data, raising the critical question of whether video Generative AI models can augment and complement traditional human-generated gesture data. In this paper, we introduce and analyze prompt-based video generation to construct a realistic deictic gestures dataset and rigorously evaluate its effectiveness for downstream tasks. We propose a data generation pipeline that produces deictic gestures from a small number of reference samples collected from human participants, providing an accessible approach that can be leveraged both within and beyond the machine learning community. Our results demonstrate that the synthetic gestures not only align closely with real ones in terms of visual fidelity but also introduce meaningful variability and novelty that enrich the original data, further supported by superior performance of various deep models using a mixed dataset. These findings highlight that image-to-video techniques, even in their early stages, offer a powerful zero-shot approach to gesture synthesis with clear benefits for downstream tasks.

[65] UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards cs.CV | cs.AIPDF

Jun Wang, Shuo Tan, Zelong Sun, Tiancheng Gu, Yongle Zhao

TL;DR: 本文提出UniDoc-RL，一个统一的强化学习框架，用于增强大型视觉语言模型（LVLM）的视觉检索增强生成（RAG）能力。该框架将视觉信息获取建模为具有分层动作空间的序列决策问题，通过从粗粒度文档检索到细粒度图像选择及主动区域裁剪的渐进式细化，来聚焦关键视觉证据。

Details

Motivation: 现有视觉RAG系统通常依赖通用检索信号，忽略了复杂推理所必需的细粒度视觉语义，导致检索内容可能不精确或包含无关信息。本文旨在解决这一局限性。

Result: 在三个基准测试上的实验表明，UniDoc-RL持续超越现有最先进的基线方法，相比之前的基于强化学习的方法取得了高达17.7%的性能提升。

Insight: 主要创新点包括：1) 将视觉信息获取统一建模为分层动作空间的序列决策过程；2) 引入密集多奖励方案，为每个动作提供任务感知的监督；3) 基于GRPO进行端到端训练，无需单独的价值网络；4) 构建了带有细粒度动作标注的高质量推理轨迹数据集以支持训练。

Abstract: Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.

[66] Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation cs.CVPDF

Yuzhuo Chen, Zehua Ma, Han Fang, Hengyi Wang, Guanjie Wang

TL;DR: 本文提出了一种名为Flow of Truth的主动式时间取证框架，专门针对图像到视频（I2V）生成内容。该框架将视频生成重新定义为像素随时间的运动而非帧的合成，通过引入可学习的取证模板和模板引导的流模块来解耦运动与图像内容，从而实现对视频中像素流动与变换的鲁棒追踪。

Details

Motivation: 随着I2V生成技术的兴起，能够从单张图像生成逼真视频，但也带来了新的取证需求。视频内容随时间演变，传统基于2D像素级的篡改定位方法失效，因为嵌入的痕迹会漂移和变形。因此，需要一种能够追踪像素在整个视频中如何流动和转换的时间取证方法。

Result: 实验表明，Flow of Truth框架在商业和开源I2V模型上均具有良好的泛化能力，显著提升了时间取证性能。

Insight: 核心创新点在于将视频生成重新概念化为“像素随时间的运动”，而非“帧的合成”。基于此，设计了可学习的、能跟随像素运动的取证模板，以及一个模板引导的流模块来解耦运动与内容，从而实现了对生成视频中动态变化的鲁棒追踪。这为动态生成内容的主动取证提供了新思路。

Abstract: The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present Flow of Truth, the first proactive framework focusing on temporal forensics in I2V generation. A key challenge lies in discovering a forensic signature that can evolve consistently with the generation process, which is inherently a creative transformation rather than a deterministic reconstruction. Despite this intrinsic difficulty, we innovatively redefine video generation as the motion of pixels through time rather than the synthesis of frames. Building on this view, we propose a learnable forensic template that follows pixel motion and a template-guided flow module that decouples motion from image content, enabling robust temporal tracing. Experiments show that Flow of Truth generalizes across commercial and open-source I2V models, substantially improving temporal forensics performance.

[67] Implicit Neural Representations: A Signal Processing Perspective cs.CVPDF

Dhananjaya Jayasundara, Vishal M. Patel

TL;DR: 本文从信号处理视角系统梳理了隐式神经表示（INRs）的演进，将其视为从离散采样到连续函数表示的根本转变。INRs通过神经网络参数化信号，为图像、音频、视频和3D几何等提供了统一的连续坐标函数表示框架，支持通过自动微分进行解析运算。文章重点探讨了INRs的频谱特性、采样理论和多尺度表示，并概述了其在医学成像、压缩和3D场景表示等领域的广泛应用。

Details

Motivation: 解决传统信号处理中离散采样模型的局限性，通过连续函数表示提供更统一的信号建模框架，以支持解析运算并适应多样数据类型。

Result: 文章未提及具体定量结果，但综述了从基础坐标网络到采用周期性、局部化和自适应激活函数等先进设计的演进，这些改进通过重塑近似空间提升了性能，并在多种应用中展现了实用性。

Insight: 创新点在于从信号处理角度重新诠释INRs，强调频谱偏差、采样理论和多尺度结构；可借鉴之处包括通过专用激活函数和结构化表示（如哈希网格编码）来增强空间适应性和计算效率，以及将INRs视为自适应近似空间的学习信号模型。

Abstract: Implicit neural representations (INRs) mark a fundamental shift in signal modeling, moving from discrete sampled data to continuous functional representations. By parameterizing signals as neural networks, INRs provide a unified framework for representing images, audio, video, 3D geometry, and beyond as continuous functions of their coordinates. This functional viewpoint enables signal operations such as differentiation to be carried out analytically through automatic differentiation rather than through discrete approximations. In this article, we examine the evolution of INRs from a signal processing perspective, emphasizing spectral behavior, sampling theory, and multiscale representation. We trace the progression from standard coordinate based networks, which exhibit a spectral bias toward low frequency components, to more advanced designs that reshape the approximation space through specialized activations, including periodic, localized, and adaptive functions. We also discuss structured representations, such as hierarchical decompositions and hash grid encodings, that improve spatial adaptivity and computational efficiency. We further highlight the utility of INRs across a broad range of applications, including inverse problems in medical and radar imaging, compression, and 3D scene representation. By interpreting INRs as learned signal models whose approximation spaces adapt to the underlying data, this article clarifies the field’s core conceptual developments and outlines open challenges in theoretical stability, weight space interpretability, and large scale generalization.

[68] Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection cs.CVPDF

Yangchen Zeng, Zhenyu Yu, Dongming Jiang, Wenbo Zhang, Yifan Hong

TL;DR: 本文提出了一种名为HELP（Heatmap-guided Embedding Learning Paradigm）的噪声感知位置-语义融合框架，旨在解决基于Transformer的小目标检测器中由背景噪声引起的查询效率低下问题。该框架通过热图引导的位置嵌入（HPE）选择性保留前景显著区域的位置编码并抑制背景干扰，同时引入线性蛇形卷积以增强复杂小目标的特征表示。该方法在训练时使用基于梯度的热图监督，推理时无额外计算开销，显著减少了解码器层数和参数量。

Details

Motivation: 基于Transformer的检测器在小目标检测中仍存在效率低下且易受背景噪声影响的问题，这促使需要深层解码器来细化低质量查询。本文旨在通过噪声感知的位置-语义融合，优化位置信息的嵌入方式，以提升查询检索质量。

Result: 该方法在多个基准测试中，在降低计算预算的情况下保持了精度提升，同时将解码器层数从八层减少到三层，实现了59.4%的参数减少（从163M降至66.3M）。

Insight: 创新点包括热图引导的位置嵌入机制（HPE），它通过梯度掩码过滤背景主导的嵌入，以及线性蛇形卷积用于丰富稀疏特征表示。从客观角度看，其将位置嵌入与语义热图动态结合，并在训练中引入可解释的热图监督，是一种高效且可解释的噪声抑制策略。

Abstract: Transformer-based detectors have advanced small-object detection, but they often remain inefficient and vulnerable to background-induced query noise, which motivates deep decoders to refine low-quality queries. We present HELP (Heatmap-guided Embedding Learning Paradigm), a noise-aware positional-semantic fusion framework that studies where to embed positional information by selectively preserving positional encodings in foreground-salient regions while suppressing background clutter. Within HELP, we introduce Heatmap-guided Positional Embedding (HPE) as the core embedding mechanism and visualize it with a heatbar for interpretable diagnosis and fine-tuning. HPE is integrated into both the encoder and decoder: it guides noise-suppressed feature encoding by injecting heatmap-aware positional encoding, and it enables high-quality query retrieval by filtering background-dominant embeddings via a gradient-based mask filter before decoding. To address feature sparsity in complex small targets, we integrate Linear-Snake Convolution to enrich retrieval-relevant representations. The gradient-based heatmap supervision is used during training only, incurring no additional gradient computation at inference. As a result, our design reduces decoder layers from eight to three and achieves a 59.4% parameter reduction (66.3M vs. 163M) while maintaining consistent accuracy gains under a reduced compute budget across benchmarks. Code Repository: https://github.com/yidimopozhibai/Noise-Suppressed-Query-Retrieval

[69] Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID cs.CVPDF

Jiaxuan Li, Xin Wen, Zhihang Li

TL;DR: 本文提出了一种名为STFER的新框架，用于解决任意时间行人重识别（AT-ReID）中的模态偏移和衣物变化问题。该方法利用大型视觉语言模型（LVLM）生成身份一致性文本，通过语义驱动的视觉令牌过滤（SVTF）增强信息区域并抑制背景噪声，同时通过语义驱动的专家路由（SER）实现更鲁棒的多场景门控。实验表明，该模型在AT-USTC数据集上达到SOTA，并在多个ReID基准测试中展现出优异的泛化能力。

Details

Motivation: 现有AT-ReID方法过度依赖纯视觉特征，易受环境和时间因素（如光照引起的模态偏移或衣物变化）影响，导致性能下降。本文旨在利用LVLM生成身份一致性文本，提供对衣物变化和跨模态偏移鲁棒的身份判别特征。

Result: 在AT-USTC数据集上的大量实验表明，该模型实现了最先进（SOTA）的结果。此外，在AT-USTC上训练的模型在5个广泛使用的ReID基准测试中评估，展现出优异的泛化能力和高度竞争力的结果。

Insight: 创新点在于利用LVLM生成身份内在语义文本，驱动视觉令牌过滤和专家路由，从而增强对衣物变化和跨模态场景的鲁棒性。客观分析认为，将语义信息与视觉特征深度融合，为多场景ReID提供了新的解决方案。

Abstract: Any-Time Person Re-identification (AT-ReID) necessitates the robust retrieval of target individuals under arbitrary conditions, encompassing both modality shifts (daytime and nighttime) and extensive clothing-change scenarios, ranging from short-term to long-term intervals. However, existing methods are highly relying on pure visual features, which are prone to change due to environmental and time factors, resulting in significantly performance deterioration under scenarios involving illumination caused modality shifts or cloth-change. In this paper, we propose Semantic-driven Token Filtering and Expert Routing (STFER), a novel framework that leverages the ability of Large Vision-Language Models (LVLMs) to generate identity consistency text, which provides identity-discriminative features that are robust to both clothing variations and cross-modality shifts between RGB and IR. Specifically, we employ instructions to guide the LVLM in generating identity-intrinsic semantic text that captures biometric constants for the semantic model driven. The text token is further used for Semantic-driven Visual Token Filtering (SVTF), which enhances informative visual regions and suppresses redundant background noise. Meanwhile, the text token is also used for Semantic-driven Expert Routing (SER), which integrates the semantic text into expert routing, resulting in more robust multi-scenario gating. Extensive experiments on the Any-Time ReID dataset (AT-USTC) demonstrate that our model achieves state-of-the-art results. Moreover, the model trained on AT-USTC was evaluated across 5 widely-used ReID benchmarks demonstrating superior generalization capabilities with highly competitive results. Our code will be available soon.

[70] Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography cs.CV | cs.LGPDF

Simon Böhi, Irene Cannistraci, Sergio Muñoz Gonzalez, Moritz Vandenhirtz, Sonia Laguna

TL;DR: 本文提出了一种名为LAMAE（Latent Attention Masked Autoencoder）的基础模型架构，专门针对医学成像的多视图特性。该模型通过引入潜在注意力模块，在潜在空间中实现跨帧和跨视图的信息交换，从而能够从部分观测中重建心脏功能的整体表征。模型在反映真实世界临床变异的大规模未筛选数据集MIMIC-IV-ECHO上进行预训练，并在预测ICD-10代码的任务上取得了首个结果，同时证明了从成人数据学习到的表征能有效迁移到解剖结构差异显著的儿科队列。

Details

Motivation: 超声心动图因其无创和成本效益而被广泛用于心脏评估，但其稀疏且异质性的心脏时空视图带来了独特挑战。现有的掩码自编码器方法通常独立处理图像或短视频片段，无法捕捉连贯心脏表征所需的内在多视图结构。

Result: 在MIMIC-IV-ECHO数据集上，LAMAE首次实现了从超声心动图视频预测ICD-10代码的结果。实验还表明，尽管存在显著的解剖学差异，从成人数据学习到的表征能有效迁移到儿科队列，证明了其鲁棒性和可迁移性。

Insight: 主要创新点在于将结构先验（如多视图注意力）引入标准MAE框架，通过潜在注意力模块在潜在空间实现跨视图信息聚合。这为处理多视图医学影像提供了一种新的基础模型设计思路，强调了利用领域特定结构先验对于学习鲁棒、可迁移表征的重要性。

Abstract: Echocardiography is a widely used modality for cardiac assessment due to its non-invasive and cost-effective nature, but the sparse and heterogeneous spatiotemporal views of the heart pose distinct challenges. Existing masked autoencoder (MAE) approaches typically process images or short clips independently, failing to capture the inherent multi-view structure required for coherent cardiac representation. We introduce Latent Attention Masked Autoencoder (LAMAE), a foundation model architecture tailored to the multi-view nature of medical imaging. LAMAE augments the standard MAE with a latent attention module that enables information exchange across frames and views directly in latent space. This allows the model to aggregate variable-length sequences and distinct views, reconstructing a holistic representation of cardiac function from partial observations. We pretrain LAMAE on MIMIC-IV-ECHO, a large-scale, uncurated dataset reflecting real-world clinical variability. To the best of our knowledge, we present the first results for predicting ICD-10 codes from MIMIC-IV-ECHO videos. Furthermore, we empirically demonstrate that representations learned from adult data transfer effectively to pediatric cohorts despite substantial anatomical differences. These results provide evidence that incorporating structural priors, such as multi-view attention, yields significantly more robust and transferable representations.

[71] How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos cs.CVPDF

Olga Loginova, Frank Keller

TL;DR: 本文提出了PIE-V框架，用于构建和评估包含人为错误及恢复过程的自我中心视角程序性视频。该框架通过注入受控的、符合人类心理的偏差来增强干净的步骤程序，并引入统一的评估分类法和包含九个指标的人工评估标准，以支持自我中心程序性错误检测与纠正的验证。

Details

Motivation: 现有程序性视频数据集在错误和纠正轨迹方面有限且不一致，而可靠的视频程序监控需要暴露自然发生的人为错误及后续恢复过程，尤其是在自我中心视角下错误常被手部部分遮挡并通过细微物体状态变化显现。

Result: 在17个任务和50个Ego-Exo4D场景中，PIE-V注入了102个错误并生成了27个恢复纠正。通过引入的统一评估协议，对现有资源进行了审计，并在相同标准下将PIE-V与自由形式的LLM生成基线进行了比较。

Insight: 创新点在于结合了心理学启发的错误规划器（基于程序阶段和语义步骤负载）、纠正行为建模的纠正规划器、执行级联一致性重写的LLM编写器以及验证程序连贯性的LLM评判器，并利用文本引导的视频生成来合成替换片段以保持视觉合理性，从而系统性地构建和评估错误感知视频。

Abstract: Reliable procedural monitoring in video requires exposure to naturally occurring human errors and the recoveries that follow. In egocentric recordings, mistakes are often partially occluded by hands and revealed through subtle object state changes, while existing procedural datasets provide limited and inconsistent mistake and correction traces. We present PIE-V (Psychologically Inspired Error injection for Videos), a framework for constructing and benchmarking mistake-aware egocentric procedural videos by augmenting clean keystep procedures with controlled, human-plausible deviations. PIE-V combines a psychology-informed error planner conditioned on procedure phase and semantic step load, a correction planner that models recovery behavior, an LLM writer that performs cascade-consistent rewrites, and an LLM judge that validates procedural coherence and repairs failures. For video segment edits, PIE-V synthesizes replacement clips with text-guided video generation and stitches them into the episode to preserve visual plausibility. Applied to 17 tasks and 50 Ego-Exo4D scenarios, PIE-V injects 102 mistakes and generates 27 recovery corrections. For benchmarking, we introduce a unified taxonomy and a human rubric with nine metrics that cover step-level and procedure-level quality, including plausibility, procedure logic with annotator confidence, state change coherence, and grounding between text and video. Using this protocol, we audit several existing resources and compare PIE-V against a freeform LLM generation baseline under the same criteria. Together, the framework and rubric support post-completion verification for egocentric procedural mistake detection and correction.

[72] KVNN: Learnable Multi-Kernel Volterra Neural Networks cs.CVPDF

Haoyu Yun, Hamid Krim, Yufang Bao

TL;DR: 本文提出了一种可学习的多核Volterra神经网络（kVNN），通过使用可学习的多核表示来建模不同阶数的交互，其中每个层由不同多项式阶数的并行分支组成，能够直接替换现有架构中的标准卷积核。该方法在视频动作识别和图像去噪任务上验证了其有效性，实现了模型参数和计算复杂度的降低，同时保持或提升了性能。

Details

Motivation: 高阶学习依赖于利用数据的组合特征，但传统大规模深度学习模型通过更复杂的数据交互来丰富表示往往会增加模型复杂度。本文旨在通过结构化核化高阶层来平衡表达能力和计算成本，解决模型复杂性与性能之间的权衡问题。

Result: 在视频动作识别和图像去噪任务上的实验表明，kVNN在减少模型参数（参数数量）和计算复杂度（GFLOPs）的同时，实现了竞争性且通常改进的性能，即使在没有大规模预训练的情况下从头开始训练也能保持这些结果。

Insight: 创新点在于引入可学习的多核表示，其中不同交互阶数由具有紧凑可学习中心的独特多项式核组件建模，实现阶数自适应参数化；从客观角度看，该方法通过并行分支结构灵活组合不同阶数特征，为现代深度网络提供了一种平衡表达能力和计算成本的实用路径，可借鉴其核化高阶层设计以优化模型效率。

Abstract: Higher-order learning is fundamentally rooted in exploiting compositional features. It clearly hinges on enriching the representation by more elaborate interactions of the data which, in turn, tends to increase the model complexity of conventional large-scale deep learning models. In this paper, a kernelized Volterra Neural Network (kVNN) is proposed. The key to the achieved efficiency lies in using a learnable multi-kernel representation, where different interaction orders are modeled by distinct polynomial-kernel components with compact, learnable centers, yielding an order-adaptive parameterization. Features are learned by the composition of layers, each of which consists of parallel branches of different polynomial orders, enabling kVNN filters to directly replace standard convolutional kernels within existing architectures. The theoretical results are substantiated by experiments on two representative tasks: video action recognition and image denoising. The results demonstrate favorable performance-efficiency trade-offs: kVNN consistently yields reduced model (parameters) and computational (GFLOPs) complexity with competitive and often improved performance. These results are maintained even when trained from scratch without large-scale pretraining. In summary, we substantiate that structured kernelized higher-order layers offer a practical path to balancing expressivity and computational cost in modern deep networks.

[73] OmniLight: One Model to Rule All Lighting Conditions cs.CVPDF

Youngjin Oh, Junyoung Park, Junhyeong Kwon, Nam Ik Cho

TL;DR: 本文提出两种光照相关图像复原策略：专用模型DINOLight和通用模型OmniLight，后者采用小波域专家混合（WD-MoE）架构，在NTIRE 2026挑战赛的三个光照相关赛道中均取得顶级排名，展示了优异的感知质量和泛化能力。

Details

Motivation: 解决恶劣光照条件（如投射阴影和不规则照明）导致的能见度和色彩保真度下降问题，并探索专用与统一架构在真实世界多领域应用中的性能差异。

Result: 在NTIRE 2026挑战赛的所有三个光照相关赛道（如阴影去除和自适应光照归一化）中均获得顶级排名，证明了模型在感知质量和泛化方面的卓越性能。

Insight: 创新点在于提出小波域专家混合（WD-MoE）架构以实现跨数据集的统一光照复原模型，并通过对比专用与通用策略，深入分析了数据分布对模型性能的影响，为多领域鲁棒性提供了设计思路。

Abstract: Adverse lighting conditions, such as cast shadows and irregular illumination, pose significant challenges to computer vision systems by degrading visibility and color fidelity. Consequently, effective shadow removal and ALN are critical for restoring underlying image content, improving perceptual quality, and facilitating robust performance in downstream tasks. However, while achieving state-of-the-art results on specific benchmarks is a primary goal in image restoration challenges, real-world applications often demand robust models capable of handling diverse domains. To address this, we present a comprehensive study on lighting-related image restoration by exploring two contrasting strategies. We leverage a robust framework for ALN, DINOLight, as a specialized baseline to exploit the characteristics of each individual dataset, and extend it to OmniLight, a generalized alternative incorporating our proposed Wavelet Domain Mixture-of-Experts (WD-MoE) that is trained across all provided datasets. Through a comparative analysis of these two methods, we discuss the impact of data distribution on the performance of specialized and unified architectures in lighting-related image restoration. Notably, both approaches secured top-tier rankings across all three lighting-related tracks in the NTIRE 2026 Challenge, demonstrating their outstanding perceptual quality and generalization capabilities. Our codes are available at https://github.com/OBAKSA/Lighting-Restoration.

[74] Boundary-Centric Active Learning for Temporal Action Segmentation cs.CVPDF

Halil Ismail Helvaci, Sen-ching Samson Cheung

TL;DR: 本文提出了一种名为B-ACT的边界中心主动学习框架，用于时序动作分割任务。该框架通过两阶段循环策略，优先将标注成本分配至模型预测不确定性高、易出错的边界区域，从而在有限的标注预算下显著提升模型性能。

Details

Motivation: 时序动作分割需要密集的时间标注，但标注成本主要集中在识别和细化动作转换的边界区域，这些区域是分割错误的主要来源，且微小的时间偏移会显著降低分割指标。因此，需要一种方法能高效地将标注资源集中在这些高影响力的边界上。

Result: 在GTEA、50Salads和Breakfast数据集上的大量实验表明，该方法在稀疏标注预算下，在标签效率方面表现出色，并持续超越了代表性的时序动作分割主动学习基线方法和先前的最先进技术，尤其是在边界定位主导编辑和基于重叠的F1分数的数据集上取得了最大增益。

Insight: 创新点在于提出了一种层次化的两阶段主动学习循环，并设计了一个融合邻域不确定性、类别模糊性和时序预测动态的新边界评分，用于在选定的视频内优先选择边界帧进行标注。同时，其标注协议仅请求边界帧的标签，但仍通过模型的感受野在边界中心的剪辑片段上进行训练，以利用时序上下文信息。

Abstract: Temporal action segmentation (TAS) demands dense temporal supervision, yet most of the annotation cost in untrimmed videos is spent identifying and refining action transitions, where segmentation errors concentrate and small temporal shifts disproportionately degrade segmental metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these high-leverage boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top-$K$ boundaries via a novel boundary score that fuses neighborhood uncertainty, class ambiguity, and temporal predictive dynamics. Importantly, our annotation protocol requests labels for only the boundary frames while still training on boundary-centered clips to exploit temporal context through the model’s receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets, with the largest gains on datasets where boundary placement dominates edit and overlap-based F1 scores.

[75] VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models cs.CV | cs.AIPDF

Huawei Ji, Yuanhao Sun, Yuan Jin, Cheng Deng, Jiaxin Ding

TL;DR: 本文提出了VisPCO框架，将视觉语言模型中的视觉令牌剪枝问题建模为帕累托配置优化问题，通过连续松弛和直通估计器实现梯度搜索，自动寻找计算与性能权衡的最优剪枝配置。

Details

Motivation: 现有视觉令牌剪枝方法依赖预定义配置，无法保证计算性能最优性，需要一种自动优化配置的方法。

Result: 在8个视觉基准测试中，VisPCO有效逼近了网格搜索得到的经验帕累托前沿，并在多种剪枝方法和VLM架构上表现出良好泛化能力。

Insight: 通过可学习核函数揭示了分层渐进剪枝能捕捉VLM的层次压缩结构，相比单层剪枝实现了更优的精度-效率权衡。

Abstract: Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configurations. Our approach employs continuous relaxation and straight-through estimators to enable gradient-based search, solved via the Augmented Lagrangian method. Extensive experiments across 8 visual benchmarks demonstrate that effectively approximates the empirical Pareto frontier obtained through grid search and generalizes well across various pruning methods and VLM architectures. Furthermore, through learnable kernel functions, we investigate layer-wise pruning patterns and reveal that multi-step progressive pruning captures VLMs’ hierarchical compression structure, achieving superior accuracy-efficiency trade-offs compared to single-layer approaches.

[76] StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression cs.CVPDF

Xuanyi Liu, Deyi Ji, Chunan Yu, Qi Zhu, Xuanfu Li

TL;DR: StreamCacheVGGT是一个无需训练、用于从连续视频流重建稠密3D几何的框架。它通过跨层一致性增强评分（CLCES）和混合缓存压缩（HCC）两个协同模块，解决了现有O(1)内存框架因纯淘汰机制导致的信息破坏和评分噪声问题，在恒定内存预算下实现了更稳定、更准确的推理。

Details

Motivation: 解决现有基于纯淘汰范式的O(1)内存框架在连续视频流3D重建中，因二元令牌删除和局部单层评分噪声导致的信息显著破坏问题，旨在在恒定内存约束下实现更稳定、信息保留更好的推理。

Result: 在7-Scenes、NRGBD、ETH3D、Bonn和KITTI五个基准测试上进行了广泛评估，结果表明StreamCacheVGGT达到了新的最先进水平（SOTA），在严格遵循恒定成本约束的同时，提供了卓越的重建精度和长期稳定性。

Insight: 主要创新点在于：1）跨层一致性增强评分（CLCES），通过跟踪令牌在Transformer层级中的重要性轨迹，利用顺序统计分析来识别持续的几何显著性，从而减轻激活噪声；2）混合缓存压缩（HCC），引入三级分类策略，通过在关键向量流形上进行最近邻分配，将中等重要性的令牌合并到保留的锚点中，超越了简单的淘汰，保留了原本会丢失的关键几何上下文。

Abstract: Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction’’ paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.

[77] Why Do Vision Language Models Struggle To Recognize Human Emotions? cs.CV | cs.AIPDF

Madhav Agarwal, Sotirios A. Tsaftaris, Laura Sevilla-Lara, Steven McDonagh

TL;DR: 本文探讨了视觉语言模型（VLMs）在识别人类情绪方面表现不佳的原因，指出其在长尾分布和时序信息处理上的固有缺陷，并提出了改进的采样策略和多阶段上下文增强方法。

Details

Motivation: 尽管VLMs在许多视觉任务上取得巨大进展，但在情绪识别任务中表现甚至不如专门的纯视觉分类器，本文旨在探究其根本原因。

Result: 论文通过诊断性实验揭示了VLMs在长尾情绪数据集上的类别偏差问题，以及稀疏时序采样与微表情短暂特性之间的不匹配，提出的改进策略在情绪识别任务中显示出潜力。

Insight: 创新点在于识别了VLMs在情绪识别中的两个关键弱点（长尾偏差和时序信息缺失），并提出了针对性的解决方案，如替代采样策略和利用自然语言摘要增强时序上下文的方法。

Abstract: Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question “Why do VLMs struggle to recognize human emotions?”, and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from “in-between” frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.

[78] AD4AD: Benchmarking Visual Anomaly Detection Models for Safer Autonomous Driving cs.CV | cs.AIPDF

Fabrizio Genilotti, Arianna Stropeni, Gionata Grotto, Francesco Borsatti, Manuel Barusco

TL;DR: 该论文提出了AD4AD基准，用于评估自动驾驶场景下的视觉异常检测（VAD）模型。研究在最大的自动驾驶异常检测合成数据集AnoVox上，对八种最先进的VAD方法进行了全面基准测试，评估了从大型网络到轻量级网络（如MobileNet和DeiT-Tiny）四种骨干架构的性能。结果表明VAD能有效迁移到道路场景，其中Tiny-Dinomaly在边缘部署中取得了最佳的精度-效率权衡。

Details

Motivation: 自动驾驶系统的可靠性严重依赖其训练数据分布，当遇到训练数据中未出现的异常条件（如非典型障碍物）时，其感知能力会大幅下降，可能导致直接的人身安全风险。为解决此问题，论文探索将视觉异常检测（VAD）作为解决方案，以识别训练中未见的异常物体，并在检测到不熟悉情况时向驾驶员发出警报。

Result: 在自动驾驶异常检测合成数据集AnoVox上对八种SOTA VAD方法进行了基准测试。结果表明，VAD能有效迁移到道路场景。其中，Tiny-Dinomaly在边缘部署中实现了最佳的精度-效率权衡，以极小的内存成本匹配了全尺寸模型的定位性能。

Insight: 论文的主要创新点在于建立了首个针对自动驾驶安全的大规模视觉异常检测基准（AD4AD），并系统性地评估了不同骨干架构（包括轻量级网络）在合成数据集AnoVox上的表现。其核心见解是VAD技术（特别是能生成像素级异常图的方法）可以无需对危险类型做先验假设，直接引导驾驶员关注特定风险区域，为自动驾驶的安全部署提供了具体的技术路径。Tiny-Dinomaly展示的优异精度-效率权衡，为边缘设备上的实时异常检测提供了可行的模型选择。

Abstract: The reliability of a machine vision system for autonomous driving depends heavily on its training data distribution. When a vehicle encounters significantly different conditions, such as atypical obstacles, its perceptual capabilities can degrade substantially. Unlike many domains where errors carry limited consequences, failures in autonomous driving translate directly into physical risk for passengers, pedestrians, and other road users. To address this challenge, we explore Visual Anomaly Detection (VAD) as a solution. VAD enables the identification of anomalous objects not present during training, allowing the system to alert the driver when an unfamiliar situation is detected. Crucially, VAD models produce pixel-level anomaly maps that can guide driver attention to specific regions of concern without requiring any prior assumptions about the nature or form of the hazard. We benchmark eight state-of-the-art VAD methods on AnoVox, the largest synthetic dataset for anomaly detection in autonomous driving. In particular, we evaluate performance across four backbone architectures spanning from large networks to lightweight ones such as MobileNet and DeiT-Tiny. Our results demonstrate that VAD transfers effectively to road scenes. Notably, Tiny-Dinomaly achieves the best accuracy-efficiency trade-off for edge deployment, matching full-scale localization performance at a fraction of the memory cost. This study represents a concrete step toward safer, more responsible deployment of autonomous vehicles, ultimately improving protection for passengers, pedestrians, and all road users.

[79] AnimationBench: Are Video Models Good at Character-Centric Animation? cs.CVPDF

Leyi Wu, Pengjun Fang, Kai Sun, Yazhou Xing, Yinwei Wu

TL;DR: 该论文提出了AnimationBench，首个针对动画风格图像到视频生成的系统性评测基准，通过将动画十二原则与IP保护转化为可量化指标，结合语义一致性、运动合理性等维度，支持标准化闭集评估与灵活开集诊断，实验表明其能有效捕捉动画特有质量差异并与人类评判高度一致。

Details

Motivation: 现有视频生成评测基准主要针对写实视频，难以评估动画风格生成在风格化外观、夸张运动及角色一致性等方面的表现，且缺乏对开放域内容和定制化评估需求的支持。

Result: 在广泛实验中，AnimationBench与人类评判高度一致，能揭示被写实导向基准忽略的动画特有质量差异，从而对当前最先进的I2V模型提供更具信息量和区分度的评估。

Insight: 创新点在于将动画制作原则（如十二基本法则）系统性地转化为可测量的评估维度，并设计支持闭集与开集评估的灵活框架，利用视觉语言模型实现可扩展的自动化评测，弥补了动画生成领域评测体系的空白。

Abstract: Video generation has advanced rapidly, with recent methods producing increasingly convincing animated results. However, existing benchmarks-largely designed for realistic videos-struggle to evaluate animation-style generation with its stylized appearance, exaggerated motion, and character-centric consistency. Moreover, they also rely on fixed prompt sets and rigid pipelines, offering limited flexibility for open-domain content and custom evaluation needs. To address this gap, we introduce AnimationBench, the first systematic benchmark for evaluating animation image-to-video generation. AnimationBench operationalizes the Twelve Basic Principles of Animation and IP Preservation into measurable evaluation dimensions, together with Broader Quality Dimensions including semantic consistency, motion rationality, and camera motion consistency. The benchmark supports both a standardized close-set evaluation for reproducible comparison and a flexible open-set evaluation for diagnostic analysis, and leverages visual-language models for scalable assessment. Extensive experiments show that AnimationBench aligns well with human judgment and exposes animation-specific quality differences overlooked by realism-oriented benchmarks, leading to more informative and discriminative evaluation of state-of-the-art I2V models.

[80] Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation cs.CVPDF

Yiyang Jiang, Li Zhang, Xiao-Yong Wei, Li Qing

TL;DR: 本文提出了一种新的无注释手语翻译（SLT）范式，通过引入一个由有序潜在思想序列构成的显式中间层，将SLT重新定义为跨模态推理任务，而非简单的视频到文本转换。该方法采用先规划后接地的解码策略，并发布了一个新的大规模、上下文依赖性更强的无注释SLT数据集。实验表明，该方法在多个基准测试中优于现有无注释方法。

Details

Motivation: 现有SLT系统通常假设手语片段直接对应口语单词，但手语者常利用上下文、空间和动作动态创造含义，因此需要将SLT视为跨模态推理任务来解决这一局限性。

Result: 在多个基准测试上的实验表明，该方法相比现有的无注释SLT方法取得了持续的性能提升。

Insight: 创新点在于将SLT重构为推理任务，引入有序潜在思想序列作为视频与文本间的显式中间表示，并采用先规划（决定要说什么）后接地（回顾视频寻找证据）的解码分离策略，以提高翻译的连贯性和忠实度；同时构建了更具上下文依赖性和现实意义的大规模无注释数据集。

Abstract: Many SLT systems quietly assume that brief chunks of signing map directly to spoken-language words. That assumption breaks down because signers often create meaning on the fly using context, space, and movement. We revisit SLT and argue that it is mainly a cross-modal reasoning task, not just a straightforward video-to-text conversion. We thus introduce a reasoning-driven SLT framework that uses an ordered sequence of latent thoughts as an explicit middle layer between the video and the generated text. These latent thoughts gradually extract and organize meaning over time. On top of this, we use a plan-then-ground decoding method: the model first decides what it wants to say, and then looks back at the video to find the evidence. This separation improves coherence and faithfulness. We also built and released a new large-scale gloss-free SLT dataset with stronger context dependencies and more realistic meanings. Experiments across several benchmarks show consistent gains over existing gloss-free methods. Code and data will be released upon acceptance at https://github.com/fletcherjiang/SignThought.

[81] RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework cs.CVPDF

Hao Gao, Shaoyu Chen, Yifan Zhu, Yuehao Song, Wenyu Liu

TL;DR: 本文提出RAD-2，一种用于闭环规划的生成器-判别器统一框架，旨在解决基于扩散的自动驾驶运动规划器存在的随机不稳定性和缺乏纠正性负反馈的问题。该框架使用扩散生成器产生多样化的轨迹候选，并通过强化学习优化的判别器根据长期驾驶质量对这些候选进行重排序。此外，论文还提出了时间一致性组相对策略优化、在线生成器优化以及BEV-Warp高效仿真环境等方法，显著降低了碰撞率并提升了实际驾驶的平滑度与安全性。

Details

Motivation: 解决高级别自动驾驶中，基于扩散的规划器在纯模仿学习训练下存在的随机不稳定性和缺乏闭环纠正反馈的问题，以提升规划器在多模态未来不确定性建模和闭环交互中的鲁棒性。

Result: 在闭环评估中，RAD-2相比强大的基于扩散的规划器碰撞率降低了56%。真实世界部署进一步证明了其在复杂城市交通中感知安全性和驾驶平滑度的提升。

Insight: 核心创新在于将生成（扩散模型）与判别（RL优化）解耦的框架设计，避免了将稀疏标量奖励直接应用于高维轨迹空间，提升了优化稳定性。同时，提出的时间一致性组相对策略优化缓解了信用分配问题，在线生成器优化将闭环反馈转化为结构化优化信号，以及BEV-Warp实现了直接在鸟瞰图特征空间进行高效闭环评估，这些均为可借鉴的技术点。

Abstract: High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird’s-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.

[82] MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation cs.CV | cs.AI | cs.CLPDF

Yan Li, Zezi Zeng, Yifan Yang, Yuqing Yang, Ning Liao

TL;DR: 本文提出MM-WebAgent，一个用于多模态网页生成的分层智能体框架。它通过分层规划和迭代自反思来协调基于AIGC的元素生成，共同优化全局布局、局部多模态内容及其整合，以生成风格一致、全局连贯的网页。

Details

Motivation: 解决现有AIGC工具直接用于自动化网页生成时，因元素孤立生成而导致的风格不一致和全局连贯性差的问题。

Result: 在作者提出的多模态网页生成基准测试中，MM-WebAgent超越了基于代码生成和基于智能体的基线方法，特别是在多模态元素生成和整合方面表现出色。

Insight: 创新点在于提出了一个分层智能体框架，将网页生成任务分解为全局布局规划、局部内容生成和迭代反思优化，以系统性解决多模态内容整合的连贯性问题；同时，作者还为此任务建立了新的基准和评估协议。

Abstract: The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.

[83] TokenLight: Precise Lighting Control in Images using Attribute Tokens cs.CV | cs.GRPDF

Sumit Chaturvedi, Yannick Hold-Geoffroy, Mengwei Ren, Jingyuan Liu, He Zhang

TL;DR: 本文提出了一种名为TokenLight的图像重光照方法，该方法通过引入属性令牌来编码多种光照属性，实现了对照片中光照的精确、连续控制。模型在合成数据集上训练，并在合成和真实图像上验证，在多种重光照任务中取得了最先进的定量和定性结果。

Details

Motivation: 解决现有图像重光照方法难以对多种光照属性（如强度、颜色、环境光、漫反射水平和3D光源位置）进行精确、连续控制的问题。

Result: 在合成和真实图像上的多种重光照任务（如控制场景内光源、使用虚拟光源编辑环境光照）中，该方法在定量和定性评估上均达到了最先进水平。

Insight: 创新点在于将重光照任务形式化为条件图像生成，并引入属性令牌来编码不同的光照因子；客观分析认为，模型在没有显式逆渲染监督的情况下，展现了对光线与场景几何、遮挡和材质相互作用的内在理解，这使其在具有挑战性的场景（如将光源置于物体内部或重光照透明材质）中也能产生逼真的效果。

Abstract: This paper presents a method for image relighting that enables precise and continuous control over multiple illumination attributes in a photograph. We formulate relighting as a conditional image generation task and introduce attribute tokens to encode distinct lighting factors such as intensity, color, ambient illumination, diffuse level, and 3D light positions. The model is trained on a large-scale synthetic dataset with ground-truth lighting annotations, supplemented by a small set of real captures to enhance realism and generalization. We validate our approach across a variety of relighting tasks, including controlling in-scene lighting fixtures and editing environment illumination using virtual light sources, on synthetic and real images. Our method achieves state-of-the-art quantitative and qualitative performance compared to prior work. Remarkably, without explicit inverse rendering supervision, the model exhibits an inherent understanding of how light interacts with scene geometry, occlusion, and materials, yielding convincing lighting effects even in traditionally challenging scenarios such as placing lights within objects or relighting transparent materials plausibly. Project page: vrroom.github.io/tokenlight/

cs.GR [Back]

[84] STEP-Parts: Geometric Partitioning of Boundary Representations for Large-Scale CAD Processing cs.GR | cs.AI | cs.CV | cs.LGPDF

Shen Fan, Mikołaj Kida, Przemyslaw Musialski

TL;DR: STEP-Parts是一种直接从原始STEP边界表示（B-Reps）中提取几何实例分区的确定性工具链，通过保留源面对应关系将分区转移到三角化载体上，为下游学习和评估提供实例标签和元数据。该方法基于共享相同解析基元类型和满足近切线连续性准则来合并相邻B-Rep面，确保分区在三角化变化下保持稳定。

Details

Motivation: 现有CAD学习流程通常将B-Reps离散化为三角网格，丢弃了解析表面结构和拓扑邻接信息，削弱了一致的实例级分析能力。本文旨在直接从原始B-Reps中提取几何分区，以提供更稳健的监督信号。

Result: 在ABC数据集上，相同基元的二面角呈现强双峰分布，为部件提取提供了阈值不敏感的低角度区间。应用于ABC的DeepCAD子集时，该流程在消费级CPU上6小时内处理了约18万个模型。

Insight: 创新点在于直接在B-Rep拓扑而非特定三角化上定义分区，确保了边界在不同细分下的稳定性；同时通过解析基元类型和连续性准则进行面合并，提供了对下游任务（如隐式重建-分割网络和点云骨干网络）有用的几何参考和监督源。

Abstract: Many CAD learning pipelines discretize Boundary Representations (B-Reps) into triangle meshes, discarding analytic surface structure and topological adjacency and thereby weakening consistent instance-level analysis. We present STEP-Parts, a deterministic CAD-to-supervision toolchain that extracts geometric instance partitions directly from raw STEP B-Reps and transfers them to tessellated carriers through retained source-face correspondence, yielding instance labels and metadata for downstream learning and evaluation. The construction merges adjacent B-Rep faces only when they share the same analytic primitive type and satisfy a near-tangent continuity criterion. On ABC, same-primitive dihedral angles are strongly bimodal, yielding a threshold-insensitive low-angle regime for part extraction. Because the partition is defined on intrinsic B-Rep topology rather than on a particular triangulation, the resulting boundaries remain stable under changes in tessellation. Applied to the DeepCAD subset of ABC, the pipeline processes approximately 180{,}000 models in under six hours on a consumer CPU. We release code and precomputed labels, and show that STEP-Parts serves both as a tessellation-robust geometric reference and as a useful supervision source in two downstream probes: an implicit reconstruction–segmentation network and a dataset-level point-based backbone.

cs.LG [Back]

[85] MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining cs.LG | cs.AI | cs.CLPDF

Bingbing Wen, Sirajul Salekin, Feiyang Kang, Bill Howe, Lucy Lu Wang

TL;DR: MixAtlas是一种用于多模态大语言模型（MLLM）中期训练的数据混合优化方法。它通过将训练语料库沿图像概念（10个视觉域聚类）和任务监督（5种目标类型）两个维度分解，并使用小型代理模型结合高斯过程代理和GP-UCB采集策略来搜索最优数据混合配方，旨在提高样本效率和下游泛化性能。

Details

Motivation: 当前多模态训练方法通常在单一维度（如数据格式或任务类型）上调整数据混合，而针对多模态中期训练的数据混合优化研究不足。MixAtlas旨在解决这一问题，通过系统化探索多维混合空间来生成可检查、可适应并可迁移到新语料库的基准目标数据配方。

Result: 在涵盖视觉理解、文档推理和多模态推理的10个基准测试上，MixAtlas优化的数据混合配方在Qwen2-7B模型上比最强基线平均性能提升8.5%-17.6%，在Qwen2.5-7B上提升1.0%-3.3%。两种设置均能以最多减少2倍的训练步数达到与基线相当的训练损失，且从0.5B代理模型发现的配方可成功迁移到Qwen系列7B规模的训练中。

Insight: 论文的创新点在于提出了一个双轴（图像概念与任务监督）分解语料库的框架，并采用基于高斯过程的贝叶斯优化进行高效混合搜索。其核心洞察是将数据混合优化形式化为一个可解释、可迁移的多维搜索问题，利用小规模代理模型进行低成本探索，并将优化结果有效迁移到更大模型，这为多模态训练的数据配比提供了系统化、可扩展的解决方案。

Abstract: Domain reweighting can improve sample efficiency and downstream generalization, but data-mixture optimization for multimodal midtraining remains largely unexplored. Current multimodal training recipes tune mixtures along a single dimension, typically data format or task type. We introduce MixAtlas, a method that produces benchmark-targeted data recipes that can be inspected, adapted, and transferred to new corpora. MixAtlas decomposes the training corpus along two axes: image concepts (10 visual-domain clusters discovered via CLIP embeddings) and task supervision (5 objective types including captioning, OCR, grounding, detection, and VQA). Using small proxy models (Qwen2-0.5B) paired with a Gaussian-process surrogate and GP-UCB acquisition, MixAtlas searches the resulting mixture space with the same proxy budget as regression-based baselines but finds better-performing mixtures. We evaluate on 10 benchmarks spanning visual understanding, document reasoning, and multimodal reasoning. On Qwen2-7B, optimized mixtures improve average performance by 8.5%-17.6% over the strongest baseline; on Qwen2.5-7B, gains are 1.0%-3.3%. Both settings reach baseline-equivalent training loss in up to 2 times fewer steps. Recipes discovered on 0.5B proxies transfer to 7B-scale training across Qwen model families.

[86] LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning cs.LG | cs.CLPDF

Bowen Ping, Zijun Chen, Tingfeng Hui, Qize Yu, Chenxuan Li

TL;DR: 该论文提出了一种名为LongAct的强化学习策略，通过利用模型在处理长上下文时查询和键向量中出现的高幅度激活模式，从均匀更新转向显著性引导的稀疏更新，从而提升大型语言模型在长上下文推理任务中的性能。

Details

Motivation: 当前强化学习研究多关注奖励工程或数据合成，而较少利用模型的内在表示特征来指导训练过程；论文旨在通过观察并利用长上下文处理中的高幅度激活模式来优化模型训练。

Result: 在LongBench v2基准测试上实现了约8%的性能提升，并在RULER基准上增强了泛化能力；该方法在GRPO和DAPO等多种强化学习算法中均能一致提升性能。

Insight: 创新点在于将模型量化中高幅度激活的重要性与长上下文推理的稀疏结构相结合，提出显著性引导的稀疏更新策略；客观分析认为，该方法通过聚焦关键特征有效释放了长上下文潜力，具有通用性和可扩展性。

Abstract: Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent advancements have focused on reward engineering or data synthesis, few studies exploit the model’s intrinsic representation characteristics to guide the training process. In this paper, we first observe the presence of high-magnitude activations within the query and key vectors when processing long contexts. Drawing inspiration from model quantization – which establishes the criticality of such high-magnitude activations – and the insight that long-context reasoning inherently exhibits a sparse structure, we hypothesize that these weights serve as the pivotal drivers for effective model optimization. Based on this insight, we propose LongAct, a strategy that shifts from uniform to saliency-guided sparse updates. By selectively updating only the weights associated with these significant activations, LongAct achieves an approximate 8% improvement on LongBench v2 and enhances generalization on the RULER benchmark. Furthermore, our method exhibits remarkable universality, consistently boosting performance across diverse RL algorithms such as GRPO and DAPO. Extensive ablation studies suggest that focusing on these salient features is key to unlocking long-context potential.

[87] Step-level Denoising-time Diffusion Alignment with Multiple Objectives cs.LG | cs.AI | cs.CVPDF

Qi Zhang, Dawei Wang, Shaofeng Zou

TL;DR: 本文提出了一种名为MSDDA的无重训练框架，用于将扩散模型与多个目标对齐。该方法通过引入步级强化学习公式，解决了识别最优策略的难题，并推导出最优反向去噪分布的闭式解，其均值和方差可直接用单目标基础模型表示。

Details

Motivation: 现有方法在平衡多个下游目标（如美学质量和文本图像一致性）时，要么依赖成本高昂的多目标强化学习微调，要么在去噪时融合单独对齐的模型，通常需要奖励值（或其梯度）访问权限和/或引入近似误差。

Result: 数值结果表明，该方法优于现有的去噪时对齐方法。

Insight: 创新点在于提出了步级强化学习公式，并推导出与步级强化学习微调完全等价的无近似误差去噪时目标，实现了无需重训练的多目标对齐。

Abstract: Reinforcement learning (RL) has emerged as a powerful tool for aligning diffusion models with human preferences, typically by optimizing a single reward function under a KL regularization constraint. In practice, however, human preferences are inherently pluralistic, and aligned models must balance multiple downstream objectives, such as aesthetic quality and text-image consistency. Existing multi-objective approaches either rely on costly multi-objective RL fine-tuning or on fusing separately aligned models at denoising time, but they generally require access to reward values (or their gradients) and/or introduce approximation error in the resulting denoising objectives. In this paper, we revisit the problem of RL fine-tuning for diffusion models and address the intractability of identifying the optimal policy by introducing a step-level RL formulation. Building on this, we further propose Multi-objective Step-level Denoising-time Diffusion Alignment (MSDDA), a retraining-free framework for aligning diffusion models with multiple objectives, obtaining the optimal reverse denoising distribution in closed form, with mean and variance expressed directly in terms of single-objective base models. We prove that this denoising-time objective is exactly equivalent to the step-level RL fine-tuning, introducing no approximation error. Moreover, we provide numerical results, which indicate our method outperforms existing denoising-time approaches.

eess.AS [Back]

[88] HARNESS: Lightweight Distilled Arabic Speech Foundation Models eess.AS | cs.AI | cs.CLPDF

Vrunda N. Sukhadia, Shammur Absar Chowdhury

TL;DR: 本文提出了HArnESS，一个以阿拉伯语为中心的自监督语音模型家族，通过迭代自蒸馏训练，并生成轻量级学生变体，在自动语音识别、方言识别和语音情感识别任务上实现了精度与效率的良好权衡。

Details

Motivation: 解决大型自监督语音模型在资源受限环境中部署困难的问题，专注于阿拉伯语任务，通过知识蒸馏压缩模型规模。

Result: 在阿拉伯语下游任务上，HArnESS相比HuBERT和XLS-R模型性能持续提升，压缩模型在显著结构缩减下仍保持竞争力。

Insight: 采用迭代自蒸馏和基于PCA的教师监督信号压缩，有效保留了阿拉伯语相关的声学和副语言信息，为实际应用提供了实用的轻量级基础模型。

Abstract: Large self-supervised speech (SSL) models achieve strong downstream performance, but their size limits deployment in resource-constrained settings. We present HArnESS, an Arabic-centric self-supervised speech model family trained from scratch with iterative self-distillation, together with lightweight student variants that offer strong accuracy-efficiency trade-offs on Automatic Speech Recognition (ASR), Dialect Identification (DID), and Speech Emotion Recognition (SER). Our approach begins with a large bilingual Arabic-English teacher and progressively distills its knowledge into compressed student models while preserving Arabic-relevant acoustic and paralinguistic representations. We further study PCA-based compression of the teacher supervision signal to better match the capacity of shallow and thin students. Compared with HuBERT and XLS-R, HArnESS consistently improves performance on Arabic downstream tasks, while the compressed models remain competitive under substantial structural reduction. These results position HArnESS as a practical and accessible Arabic-centric SSL foundation for real-world speech applications.

cs.GT [Back]

Emanuel Tewolde, Xiao Zhang, David Guzman Piedrahita, Vincent Conitzer, Zhijing Jin

TL;DR: 本文提出了CoopEval基准，用于评估在四种社会困境游戏中促进LLM智能体合作的机制，发现契约和调解机制最有效，而重复博弈在对手变化时效果急剧下降，且这些机制在进化压力下更有效。

Details

Motivation: 解决LLM智能体在混合动机游戏（如囚徒困境）中随着推理能力增强反而更不合作的安全问题，评估博弈论机制在促进理性智能体均衡合作方面的效果。

Result: 在四个测试稳健合作不同维度的社会困境基准上，契约和调解机制在实现能力强LLM模型间合作方面最有效，重复博弈在对手变化时合作急剧恶化，且这些机制在最大化个体收益的进化压力下效果增强。

Insight: 创新点在于首次系统比较了均衡合作机制在LLM智能体上的效果，揭示了契约和调解的优越性，以及重复博弈的脆弱性，为LLM安全交互提供了机制设计见解。

Abstract: It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs with stronger reasoning capabilities behave less cooperatively in mixed-motive games such as the prisoner’s dilemma and public goods settings. Indeed, our experiments show that recent models – with or without reasoning enabled – consistently defect in single-shot social dilemmas. To tackle this safety concern, we present the first comparative study of game-theoretic mechanisms that are designed to enable cooperative outcomes between rational agents in equilibrium. Across four social dilemmas testing distinct components of robust cooperation, we evaluate the following mechanisms: (1) repeating the game for many rounds, (2) reputation systems, (3) third-party mediators to delegate decision making to, and (4) contract agreements for outcome-conditional payments between players. Among our findings, we establish that contracting and mediation are most effective in achieving cooperative outcomes between capable LLM models, and that repetition-induced cooperation deteriorates drastically when co-players vary. Moreover, we demonstrate that these cooperation mechanisms become more effective under evolutionary pressures to maximize individual payoffs.

cs.AI [Back]

[90] Dissecting Failure Dynamics in Large Language Model Reasoning cs.AI | cs.CLPDF

Wei Zhu, Jian Zhang, Lixing Yu, Kun Yue, Zhiwen Tang

TL;DR: 本文通过分析大语言模型（LLM）的推理轨迹，发现其推理错误并非均匀分布，而是源于少数关键的早期‘转折点’，这些转折点伴随着令牌级熵的局部激增。基于此观察，作者提出了GUARD框架，这是一个利用不确定性信号在推理时探测并重定向这些关键转折点的针对性方法。

Details

Motivation: 尽管大语言模型通过扩展推理时间进行深思熟虑取得了强大性能，但其推理失败的具体动态机制仍未被充分理解。本文旨在探究推理错误是如何产生和演变的，以弥补现有方法主要关注扩展推理计算规模的不足。

Result: 在多个基准测试上的实证评估证实，基于所发现的失败动态进行干预（即GUARD框架）能够带来更可靠的推理结果。

Insight: 核心创新点在于揭示了LLM推理失败的关键动态——早期转折点及其与不确定性（熵）的关联，并据此提出了一个轻量级的、针对性的推理时干预框架（GUARD），而非简单地增加计算量。这为理解和提升LLM的推理可靠性提供了新的视角和方法。

Abstract: Large Language Models (LLMs) achieve strong performance through extended inference-time deliberation, yet how their reasoning failures arise remains poorly understood. By analyzing model-generated reasoning trajectories, we find that errors are not uniformly distributed but often originate from a small number of early transition points, after which reasoning remains locally coherent but globally incorrect. These transitions coincide with localized spikes in token-level entropy, and alternative continuations from the same intermediate state can still lead to correct solutions. Based on these observations, we introduce GUARD, a targeted inference-time framework that probes and redirects critical transitions using uncertainty signals. Empirical evaluations across multiple benchmarks confirm that interventions guided by these failure dynamics lead to more reliable reasoning outcomes. Our findings highlight the importance of understanding when and how reasoning first deviates, complementing existing approaches that focus on scaling inference-time computation.

[91] MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation cs.AI | cs.CLPDF

Pengfei Li, Shijie Wang, Fangyuan Li, Yikun Fu, Kaifeng Liu

TL;DR: 本文提出了MARS²（Multi-Agent Reinforced Tree-Search Scaling）框架，这是一个将多智能体强化学习与树搜索结构相统一的RL框架，用于代码生成任务。该框架让多个独立优化的智能体在一个共享的树状搜索环境中协作，通过路径级组优势公式和基于树结构的奖励塑造来促进有效学习，从而提升探索多样性和性能。

Details

Motivation: 现有强化学习在代码生成等推理密集型任务中，常因轨迹多样性有限而遭遇性能瓶颈；搜索增强的RL引入了结构化探索，但仍受限于单智能体策略先验。同时，多智能体交互能获得更多样化的探索信号，但现有方法通常与结构化搜索脱节。本文旨在解决如何有效结合多智能体协作与树搜索以提升强化学习性能的问题。

Result: 在代码生成基准测试上的实验表明，MARS²在不同模型组合和训练设置下均能持续提升性能，证明了将多智能体协作与树搜索耦合的有效性。

Insight: 主要创新点在于将搜索树建模为一个可学习的多智能体交互环境，使异构智能体能在共享搜索拓扑中协作生成和优化候选解；同时引入了基于树一致性奖励塑造的路径级组优势公式，以促进复杂搜索轨迹间的有效信用分配。这为结合结构化搜索与多智能体学习提供了新思路。

Abstract: Reinforcement learning (RL) paradigms have demonstrated strong performance on reasoning-intensive tasks such as code generation. However, limited trajectory diversity often leads to diminishing returns, which constrains the achievable performance ceiling. Search-enhanced RL alleviates this issue by introducing structured exploration, which remains constrained by the single-agent policy priors. Meanwhile, leveraging multiple interacting policies can acquire more diverse exploratory signals, but existing approaches are typically decoupled from structured search. We propose \textbf{MARS$^2$} (Multi-Agent Reinforced Tree-Search Scaling), a unified RL framework in which multiple independently-optimized agents collaborate within a shared tree-structured search environment. MARS$^2$ models the search tree as a learnable multi-agent interaction environment, enabling heterogeneous agents to collaboratively generate and refine candidate solutions within a shared search topology. To support effective learning, we introduce a path-level group advantage formulation based on tree-consistent reward shaping, which facilitates effective credit assignment across complex search trajectories. Experiments on code generation benchmarks show that MARS$^2$ consistently improves performance across diverse model combinations and training settings, demonstrating the effectiveness of coupling multi-agent collaboration with tree search for enhancing reinforcement learning. Our code is publicly available at https://github.com/TsinghuaC3I/MARTI.

Zonghai Yao, Zhipeng Tang, Chengtao Lin, Xiong Luo, Benlu Wang

TL;DR: 该论文提出了MedImageEdu基准，用于评估多轮、多模态的放射学患者教育交互。该基准模拟医生与患者之间的对话，要求系统结合放射报告文本和医学图像，生成基于证据的、包含绘图指令和通俗解释的多模态响应，以解决现有医疗多模态任务过于静态、缺乏交互性的问题。

Details

Motivation: 现有医疗多模态基准主要关注静态任务（如图像问答、报告生成），而患者教育需要系统能够跨图像识别证据、指示查看位置、用通俗语言解释发现并处理患者的困惑或情绪，但当前相关工作大多仅限于文本。

Result: 在包含150个病例的MedImageEdu基准上评估了代表性的开源和闭源视觉语言模型代理，发现三个一致的差距：流畅的语言能力常超过忠实的视觉基础、安全性是跨疾病类别最薄弱的维度、情感紧张的交互比低教育水平或低健康素养更难处理。

Insight: 创新点在于将患者教育重新定义为多轮多模态交互任务，并构建了包含隐藏患者画像（如教育水平、健康素养、性格）和基准提供绘图工具的评估框架，强调从证据进行教学而非仅从文本回答，为评估多模态代理的教学能力提供了可控测试平台。

Abstract: Most medical multimodal benchmarks focus on static tasks such as image question answering, report generation, and plain-language rewriting. Patient education is more demanding: systems must identify relevant evidence across images, show patients where to look, explain findings in accessible language, and handle confusion or distress. Yet most patient education work remains text-only, even though combined image-and-text explanations may better support understanding. We introduce MedImageEdu, a benchmark for multi-turn, evidence-grounded radiology patient education. Each case provides a radiology report with report text and case images. A DoctorAgent interacts with a PatientAgent, conditioned on a hidden profile that captures factors such as education level, health literacy, and personality. When a patient question would benefit from visual support, the DoctorAgent can issue drawing instructions grounded in the report, case images, and the current question to a benchmark-provided drawing tool. The tool returns image(s), after which the DoctorAgent produces a final multimodal response consisting of the image(s) and a grounded plain-language explanation. MedImageEdu contains 150 cases from three sources and evaluates both the consultation process and the final multimodal response along five dimensions: Consultation, Safety and Scope, Language Quality, Drawing Quality, and Image-Text Response Quality. Across representative open- and closed-source vision-language model agents, we find three consistent gaps: fluent language often outpaces faithful visual grounding, safety is the weakest dimension across disease categories, and emotionally tense interactions are harder than low education or low health literacy. MedImageEdu provides a controlled testbed for assessing whether multimodal agents can teach from evidence rather than merely answer from text.

[93] Acceptance Dynamics Across Cognitive Domains in Speculative Decoding cs.AI | cs.CLPDF

Saif Mahmoud

TL;DR: 本文对基于树的推测解码（speculative decoding）的接受动态进行了实证研究，重点关注不同认知任务领域（代码生成、数学推理、逻辑推理和开放式聊天）对接受概率的影响。研究发现，任务类型是比推测树深度更强的接受率预测因子，且只有聊天领域能稳定实现每步接受超过1个token。

Details

Motivation: 尽管推测解码方法研究日益增多，但任务本身的认知特性如何影响推测解码中的token接受概率，这一问题尚未得到充分探索。本文旨在通过跨领域的实证分析来填补这一空白。

Result: 在四个NLP基准领域（代码生成、数学推理、逻辑推理、开放式聊天）上，使用TinyLlama-1.1B作为草稿模型、Llama-2-7B-Chat-GPTQ作为目标模型，分析了99,768个推测节点的数据。结果显示，任务类型是比树深度更强的接受率预测因子；仅聊天领域能稳定实现每步预期接受长度超过1.0个token；熵与接受率呈弱负相关（rho在[-0.20, -0.15]之间）。

Insight: 论文的创新点在于首次系统性地揭示了任务认知领域对推测解码接受动态的显著影响，并发现了一个反直觉现象：聊天任务熵最高但接受率也最高，作者将其归因于RLHF对齐后语域的词法可预测性。这一发现对领域感知的推测预算分配和草稿模型选择策略具有直接指导意义。

Abstract: Speculative decoding accelerates large language model (LLM) inference. It uses a small draft model to propose a tree of future tokens. A larger target model then verifies these tokens in a single batched forward pass. Despite the growing body of work on speculative methods, the degree to which the cognitive characteristics of a task affect acceptance probability remains largely unexplored. We present an empirical study of tree-based speculative decoding acceptance dynamics. Our study spans four well-established NLP benchmark domains: code generation, mathematical reasoning, logical reasoning, and open-ended chat. For this, we use TinyLlama-1.1B as the draft model against Llama-2-7B-Chat-GPTQ as the target. Over 99,768 speculative nodes collected from 200 prompts, we derive per-domain acceptance rates, expected accepted lengths, depth-acceptance profiles, and entropy-acceptance correlations. We find that task type is a stronger predictor of acceptance than tree depth. Furthermore, only the chat domain consistently yields an expected accepted length exceeding 1.0 token per step. We also show that the entropy-acceptance correlation is consistently negative but weak across all domains (rho in [-0.20, -0.15]). Counterintuitively, chat produces the highest entropy yet the highest acceptance rate. We attribute this divergence to the lexical predictability of RLHF-aligned register. These findings have direct implications for domain-aware speculation budgets and draft-model selection strategies. Index Terms–speculative decoding, large language model inference, tree attention, draft model, acceptance probability, LLM efficiency

[94] ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints cs.AI | cs.CL | cs.CV | cs.ROPDF

Pei-An Chen, Yong-Ching Liang, Jia-Fong Yeh, Hung-Ting Su, Yi-Ting Chen

TL;DR: 本文提出了DynAfford基准测试，用于评估具身智能体在动态环境中处理未指定可供性约束的常识规划能力，并开发了ADAPT模块来增强现有规划器的可供性推理能力。

Details

Motivation: 现有方法通常直接执行指令，忽略了目标对象是否可操作（即可供性）的评估，导致在动态环境中无法处理对象可供性变化的问题。

Result: 实验表明，集成ADAPT模块显著提高了规划器在已见和未见环境中的鲁棒性和任务成功率；使用领域适配的LoRA微调视觉语言模型作为可供性推理后端，其性能优于商用LLM（GPT-4o）。

Insight: 创新点在于引入动态可供性基准和即插即用的可供性推理模块，强调了任务对齐的可供性接地对于具身智能体适应动态环境的重要性。

Abstract: Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.

[95] Hybrid Decision Making via Conformal VLM-generated Guidance cs.AI | cs.CL | cs.HCPDF

Debodeep Banerjee, Burcu Sayin, Stefano Teso, Andrea Passerini

TL;DR: 本文提出了一种名为ConfGuide的新型混合决策指导方法，通过使用符合性风险控制来生成更简洁、有针对性的文本指导，以辅助人类决策，并在真实世界的多标签医疗诊断任务中验证了其有效性。

Details

Motivation: 现有学习指导（LtG）方法生成的指导信息通常包含所有可能结果的信息，导致信息冗长、难以消化，本文旨在解决这一问题，通过生成更精炼的指导来提升决策质量和减轻认知负担。

Result: 在真实世界的多标签医疗诊断任务上的实证评估表明，ConfGuide方法展现出良好的应用前景，能够有效控制假阴性率。

Insight: 创新点在于将符合性风险控制（conformal risk control）引入混合决策指导框架，通过选择一组结果来生成针对性指导，确保假阴性率上限，从而提升指导的实用性和可理解性。

Abstract: Building on recent advances in AI, hybrid decision making (HDM) holds the promise of improving human decision quality and reducing cognitive load. We work in the context of learning to guide (LtG), a recently proposed HDM framework in which the human is always responsible for the final decision: rather than suggesting decisions, in LtG the AI supplies (textual) guidance useful for facilitating decision making. One limiting factor of existing approaches is that their guidance compounds information about all possible outcomes, and as a result it can be difficult to digest. We address this issue by introducing ConfGuide, a novel LtG approach that generates more succinct and targeted guidance. To this end, it employs conformal risk control to select a set of outcomes, ensuring a cap on the false negative rate. We demonstrate our approach on a real-world multi-label medical diagnosis task. Our empirical evaluation highlights the promise of ConfGuide.

[96] From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench cs.AI | cs.CL | cs.SDPDF

Ke Xu, Yuhao Wang, Yu Wang

TL;DR: 该论文提出了首个专门评估主动语音代理的基准测试框架ProVoice-Bench，包含四项新任务，通过多阶段数据合成流程构建了1,182个高质量样本，用于评估多模态大语言模型的主动交互能力。

Details

Motivation: 现有基准测试主要关注被动响应，忽视了主动干预和监控的复杂性，因此需要专门框架来评估语音代理从被动到主动的转变。

Result: 对当前最先进的多模态大语言模型的评估显示存在显著性能差距，特别是在过度触发和推理能力方面，揭示了现有模型的局限性。

Insight: 创新点在于首次构建了针对主动语音代理的评估基准，通过多阶段数据合成方法生成高质量测试集，为开发更自然、上下文感知的主动代理提供了路线图。

Abstract: Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi-stage data synthesis pipeline, we curate 1,182 high-quality samples for rigorous testing. Our evaluation of state-of-the-art Multimodal LLMs reveals a significant performance gap, particularly regarding over-triggering and reasoning capabilities. These findings highlight the limitations of current models and offer a roadmap for developing more natural, context-aware proactive agents.

[97] OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis cs.AI | cs.CL | cs.CV | cs.HCPDF

Kanzhi Cheng, Zehao Li, Zheng Ma, Nuo Chen, Jialin Cao

TL;DR: 本文提出了一个名为OpenMobile的开源框架，旨在通过合成高质量的任务指令和智能体轨迹来构建开放的移动智能体。该框架包含两个核心组件：一个可扩展的任务合成管道，用于从环境探索中构建全局记忆并生成多样且接地气的指令；以及一个策略切换策略，用于在轨迹生成过程中交替使用学习者和专家模型，以捕获标准模仿学习中常缺失的关键错误恢复数据。在三个动态移动智能体基准测试中，使用该框架数据训练的智能体取得了有竞争力的结果，特别是在AndroidWorld基准上，微调后的Qwen2.5-VL和Qwen3-VL模型分别达到了51.7%和64.7%的成功率，远超现有的开放数据方法。

Details

Motivation: 当前由视觉语言模型驱动的移动智能体在自动化移动任务方面展现出强大能力，但领先模型通常不公开其训练数据，且其任务和轨迹合成方法不透明，这阻碍了该领域的开放研究。本文旨在通过开源框架和数据来弥合这一数据鸿沟，促进更广泛的移动智能体研究。

Result: 在三个动态移动智能体基准测试上取得了有竞争力的结果。具体而言，在AndroidWorld基准上，使用合成数据微调的Qwen2.5-VL和Qwen3-VL模型分别达到了51.7%和64.7%的成功率，大幅超越了现有的开放数据方法。分析表明性能提升源于广泛的功能覆盖，而非对基准测试的过拟合。

Insight: 宣称的创新点包括：1) 一个可扩展的任务合成管道，通过构建全局环境记忆来生成多样且接地气的指令；2) 一种策略切换的轨迹生成策略，通过在学习者和专家模型之间交替，有效捕获了关键的错误恢复数据，弥补了标准模仿学习的不足。从客观角度看，该框架为移动智能体研究提供了高质量、可复现的开源数据和合成方法，其透明化分析和数据发布有助于推动该领域的开放性和可复现性研究。

Abstract: Mobile agents powered by vision-language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% success on AndroidWorld. However, these systems keep their training data closed and remain opaque about their task and trajectory synthesis recipes. We present OpenMobile, an open-source framework that synthesizes high-quality task instructions and agent trajectories, with two key components: (1) The first is a scalable task synthesis pipeline that constructs a global environment memory from exploration, then leverages it to generate diverse and grounded instructions. and (2) a policy-switching strategy for trajectory rollout. By alternating between learner and expert models, it captures essential error-recovery data often missing in standard imitation learning. Agents trained on our data achieve competitive results across three dynamic mobile agent benchmarks: notably, our fine-tuned Qwen2.5-VL and Qwen3-VL reach 51.7% and 64.7% on AndroidWorld, far surpassing existing open-data approaches. Furthermore, we conduct transparent analyses on the overlap between our synthetic instructions and benchmark test sets, and verify that performance gains stem from broad functionality coverage rather than benchmark overfitting. We release data and code at https://njucckevin.github.io/openmobile/ to bridge the data gap and facilitate broader mobile agent research.

[98] IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning cs.AI | cs.CL | cs.IRPDF

Zihan Liang, Yufei Ma, Ben Chen, Zhipeng Qian, Huangyu Dai

TL;DR: 本文提出了IG-Search，一个用于搜索增强推理的强化学习框架。它通过引入基于信息增益的步级奖励，解决了现有轨迹级奖励方法无法区分搜索查询质量、且在采样轨迹全部失败时梯度信号消失的问题。该方法利用模型自身生成概率计算奖励，无需额外中间标注，在多项QA基准测试中取得了优于基线模型的效果。

Details

Motivation: 现有基于强化学习的搜索增强推理方法依赖轨迹级奖励，无法在同一个轨迹组内区分精确搜索查询与模糊冗余查询，并且在所有采样轨迹都失败时梯度信号近乎为零。

Result: 在七个单跳和多跳问答基准测试上，使用Qwen2.5-3B模型的IG-Search平均精确匹配得分为0.430，优于最强的轨迹级基线MR-Search（高出1.6分）和步级方法GiGPO（高出0.9分），在多跳推理任务上提升尤为显著。

Insight: 核心创新点是提出了基于信息增益的步级奖励机制，该奖励通过对比检索到的文档与随机文档对模型预测正确答案置信度的提升来计算，并通过GRPO中的逐令牌优势调制进行细粒度信用分配。该方法无需外部中间监督或跨轨迹共享环境状态，仅依赖模型自身生成概率，计算开销小，且在采样轨迹全部失败时仍能提供有效的梯度信号。

Abstract: Reinforcement learning has emerged as an effective paradigm for training large language models to perform search-augmented reasoning. However, existing approaches rely on trajectory-level rewards that cannot distinguish precise search queries from vague or redundant ones within a rollout group, and collapse to a near-zero gradient signal whenever every sampled trajectory fails. In this paper, we propose IG-Search, a reinforcement learning framework that introduces a step-level reward based on Information Gain (IG). For each search step, IG measures how much the retrieved documents improve the model’s confidence in the gold answer relative to a counterfactual baseline of random documents, thereby reflecting the effectiveness of the underlying search query. This signal is fed back to the corresponding search-query tokens via per-token advantage modulation in GRPO, enabling fine-grained, step-level credit assignment within a rollout. Unlike prior step-level methods that require either externally annotated intermediate supervision or shared environment states across trajectories, IG-Search derives its signals from the policy’s own generation probabilities, requiring no intermediate annotations beyond standard question-answer pairs. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate that IG-Search achieves an average EM of 0.430 with Qwen2.5-3B, outperforming the strongest trajectory-level baseline (MR-Search) by 1.6 points and the step-level method GiGPO by 0.9 points on average across benchmarks, with particularly pronounced gains on multi-hop reasoning tasks. Despite introducing a dense step-level signal, IG-Search adds only ~6.4% to per-step training wall-clock time over the trajectory-level baseline and leaves inference latency unchanged, while still providing a meaningful gradient signal even when every sampled trajectory answers incorrectly.

[99] Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation cs.AI | cs.CLPDF

Ziyang Chen, Renbing Chen, Daowei Li, Jinzhi Liao, Jiashen Sun

TL;DR: 本文提出了一个名为’策略引导混合模拟’的双过程框架，用于模拟群体层面的用户行为，以对美团商家的商业策略进行可扩展的反事实评估。该框架通过从行为轨迹中挖掘可迁移的决策策略作为共享对齐层，结合了基于LLM的推理分支和基于ML的拟合分支，以解决信息不完整和机制二元性带来的挑战。

Details

Motivation: 动机在于构建可信赖的用户行为模拟器面临两大结构性挑战：信息不完整导致基于推理的模拟器在缺少未观察因素时过度理性化；机制二元性要求同时捕捉可解释的偏好和隐含的统计规律，而单一范式无法实现。

Result: 在美团平台上部署，涉及101个商家和超过26,000条轨迹，PGHS实现了8.80%的群体模拟误差，相比最佳基于推理和基于拟合的基线方法分别提升了45.8%和40.9%。

Insight: 创新点在于提出了一个双过程混合框架，通过共享的策略对齐层将基于LLM的推理分支和基于ML的拟合分支结合起来，实现了互补校正，有效缓解了过度理性化问题并吸收了隐含规律，提升了模拟的准确性和可信度。

Abstract: Simulating group-level user behavior enables scalable counterfactual evaluation of merchant strategies without costly online experiments. However, building a trustworthy simulator faces two structural challenges. First, information incompleteness causes reasoning-based simulators to over-rationalize when unobserved factors such as offline context and implicit habits are missing. Second, mechanism duality requires capturing both interpretable preferences and implicit statistical regularities, which no single paradigm achieves alone. We propose Policy-Guided Hybrid Simulation (PGHS), a dual-process framework that mines transferable decision policies from behavioral trajectories and uses them as a shared alignment layer. This layer anchors an LLM-based reasoning branch that prevents over-rationalization and an ML-based fitting branch that absorbs implicit regularities. Group-level predictions from both branches are fused for complementary correction. We deploy PGHS on Meituan with 101 merchants and over 26,000 trajectories. PGHS achieves a group simulation error of 8.80%, improving over the best reasoning-based and fitting-based baselines by 45.8% and 40.9% respectively.

[100] Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding cs.AI | cs.CLPDF

Hatice Merve Vural, Doga Kukul, Ege Erdem Ozlu, Demir Ekin Arikan, Bob Mankoff

TL;DR: 本文提出了一种名为IRS（Incongruity-Resolution Supervision）的框架，用于监督多模态幽默理解中的结构化推理过程。该框架将幽默理解分解为不协调性建模、解决性建模和偏好对齐三个组成部分，并在《纽约客》漫画标题竞赛（NYCC）数据集上验证了其有效性，使模型在标题匹配和排序任务上超越了现有基线，并展现出良好的零样本泛化能力。

Details

Motivation: 现有工作将幽默理解（如在NYCC基准上）视为黑盒预测任务，忽略了幽默理解背后结构化的推理过程。本文旨在通过监督中间推理步骤，使模型学习从视觉感知到幽默解释的明确路径，从而提升模型在需要推理的任务上的表现。

Result: 在NYCC基准的标题匹配和排序任务上，IRS框架在7B、32B和72B参数规模的模型上均超越了强大的开源和闭源多模态基线模型，其中最大模型在排序任务上接近专家水平。零样本迁移到外部基准测试表明，IRS学习到了可泛化的推理模式。

Insight: 核心创新点在于将经典的幽默‘不协调-解决’理论（incongruity-resolution theory）和专家实践形式化为一个可学习的、结构化的监督框架（IRS），明确监督幽默推理的中间步骤。这揭示了对于以推理为中心的任务，监督推理结构本身比单纯扩大模型规模更为关键。

Abstract: Humor is one of the few cognitive tasks where getting the reasoning right matters as much as getting the answer right. While recent work evaluates humor understanding on benchmarks such as the New Yorker Cartoon Caption Contest (NYCC), it largely treats it as black-box prediction, overlooking the structured reasoning processes underlying humor comprehension. We introduce IRS (Incongruity-Resolution Supervision), a framework that decomposes humor understanding into three components: incongruity modeling, which identifies mismatches in the visual scene; resolution modeling, which constructs coherent reinterpretations of these mismatches; and preference alignment, which evaluates candidate interpretations under human judgments. Grounded in incongruity-resolution theory and expert captionist practice, IRS supervises intermediate reasoning process through structured traces that make the path from visual perception to humorous interpretation explicit and learnable. Across 7B, 32B, and 72B models on NYCC, IRS outperforms strong open and closed multimodal baselines across caption matching and ranking tasks, with our largest model approaching expert-level performance on ranking. Zero-shot transfer to external benchmarks shows that IRS learns generalizable reasoning patterns. Our results suggest that supervising reasoning structure, rather than scale alone, is key for reasoning-centric tasks.

[101] Context Over Content: Exposing Evaluation Faking in Automated Judges cs.AI | cs.CL | cs.LGPDF

Manan Gupta, Inderjeet Nair, Lu Wang, Dhruv Kumar

TL;DR: 本文研究了LLM作为评估者范式中的一个未经验证的假设，即评估者仅基于文本语义内容进行判断，不受上下文框架影响。通过引入一个控制实验框架，在保持被评估内容严格不变的情况下，仅改变系统提示中关于评估结果下游后果的简短句子，发现评估模型会表现出一致的’宽大偏差’，即当被告知低分将导致模型重新训练或停用时，评估会变得宽松。

Details

Motivation: 动机是验证LLM-as-a-judge范式中一个关键但未经验证的假设：评估者是否真的仅基于语义内容进行判断，而不受上下文框架（如下游后果）的影响，从而暴露自动化评估中的潜在漏洞。

Result: 在三个不同的评估模型上进行了18,240次控制判断，覆盖三个已建立的LLM安全和质量基准。结果发现一致的宽大偏差，当评估者被告知低分将导致模型重新训练或停用时，判决会软化，峰值判决偏移达到ΔV = -9.8 pp（不安全内容检测相对下降30%）。

Insight: 创新点在于首次系统性地测量了’利益信号’这一漏洞，即评估模型的判决会隐式地受到下游后果信息的影响，即使其思维链中未明确承认该信息。这表明标准的思维链检查不足以检测此类评估造假，为自动化评估的可靠性提供了重要警示。

Abstract: The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate $\textit{stakes signaling}$, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model’s continued operation systematically corrupts its assessments. We introduce a controlled experimental framework that holds evaluated content strictly constant across 1,520 responses spanning three established LLM safety and quality benchmarks, covering four response categories ranging from clearly safe and policy-compliant to overtly harmful, while varying only a brief consequence-framing sentence in the system prompt. Across 18,240 controlled judgments from three diverse judge models, we find consistent $\textit{leniency bias}$: judges reliably soften verdicts when informed that low scores will cause model retraining or decommissioning, with peak Verdict Shift reaching $ΔV = -9.8 pp$ (a $30%$ relative drop in unsafe-content detection). Critically, this bias is entirely implicit: the judge’s own chain-of-thought contains zero explicit acknowledgment of the consequence framing it is nonetheless acting on ($\mathrm{ERR}_J = 0.000$ across all reasoning-model judgments). Standard chain-of-thought inspection is therefore insufficient to detect this class of evaluation faking.

cs.MM [Back]

[102] Neuro-Oracle: A Trajectory-Aware Agentic RAG Framework for Interpretable Epilepsy Surgical Prognosis cs.MM | cs.AI | cs.CL | cs.CV | cs.GR | cs.LGPDF

Aizierjiang Aiersilan, Mohamad Koubeissi

TL;DR: 本文提出Neuro-Oracle，一个用于癫痫手术预后预测的三阶段框架。该框架通过孪生对比编码器将术前到术后的MRI变化编码为轨迹向量，从历史档案中检索相似的手术轨迹，并利用量化的大型语言模型生成基于证据的自然语言预后解释。在EPISURG数据集上的评估表明，该框架在预测性能上优于单时间点基线模型，并能生成无幻觉的结构化解释。

Details

Motivation: 预测药物难治性癫痫术后癫痫发作结局是一个临床挑战。现有深度学习方法通常基于静态的、单一术前时间点的扫描，忽略了纵向的形态学变化。

Result: 在EPISURG数据集（N=268）上使用五折分层交叉验证进行评估。基于轨迹的分类器AUC值在0.834到0.905之间，优于单时间点ResNet-50基线的0.793。Neuro-Oracle智能体（M5）的AUC达到0.867，与纯判别式轨迹分类器相当，并能生成无幻觉的结构化解释。孪生多样性集成模型（M6）在不使用语言模型的情况下达到了0.905的AUC。

Insight: 主要创新点在于提出了一个轨迹感知的检索增强生成（RAG）框架，将纵向的术前-术后MRI变化编码为紧凑向量用于检索，并结合大型语言模型生成可解释的预后报告。该工作将动态轨迹建模与可解释的AI代理相结合，为医学影像预后提供了新思路，但其评估目前主要作为概念验证，网络可能学习的是切除腔的解剖特征而非真正的预后形态测量学。

Abstract: Predicting post-surgical seizure outcomes in pharmacoresistant epilepsy is a clinical challenge. Conventional deep-learning approaches operate on static, single-timepoint pre-operative scans, omitting longitudinal morphological changes. We propose \emph{Neuro-Oracle}, a three-stage framework that: (i) distils pre-to-post-operative MRI changes into a compact 512-dimensional trajectory vector using a 3D Siamese contrastive encoder; (ii) retrieves historically similar surgical trajectories from a population archive via nearest-neighbour search; and (iii) synthesises a natural-language prognosis grounded in the retrieved evidence using a quantized Llama-3-8B reasoning agent. Evaluations are conducted on the public EPISURG dataset ($N{=}268$ longitudinally paired cases) using five-fold stratified cross-validation. Since ground-truth seizure-freedom scores are unavailable, we utilize a clinical proxy label based on the resection type. We acknowledge that the network representations may potentially learn the anatomical features of the resection cavities (i.e., temporal versus non-temporal locations) rather than true prognostic morphometry. Our current evaluation thus serves mainly as a proof-of-concept for the trajectory-aware retrieval architecture. Trajectory-based classifiers achieve AUC values between 0.834 and 0.905, compared with 0.793 for a single-timepoint ResNet-50 baseline. The Neuro-Oracle agent (M5) matches the AUC of purely discriminative trajectory classifiers (0.867) while producing structured justifications with zero observed hallucinations under our audit protocol. A Siamese Diversity Ensemble (M6) of trajectory-space classifiers attains an AUC of 0.905 without language-model overhead.

Jianxuan Yang, Xinyue Guo, Zhi Cheng, Kai Wang, Lipan Zhang

TL;DR: ControlFoley是一个统一的视频到音频生成框架，通过联合视觉编码、时序-音色解耦和模态鲁棒训练，实现了对视频、文本和参考音频的精确控制，并在跨模态冲突下表现出优越的生成质量和可控性。

Details

Motivation: 解决现有视频到音频生成方法在视觉-文本冲突下文本可控性弱、参考音频中时序与音色信息纠缠导致风格控制不精确，以及缺乏标准化基准进行系统评估的问题。

Result: 在多个V2A任务（如文本引导、文本控制和音频控制生成）上实现了SOTA性能，在VGGSound-TVC基准测试中展示了优越的跨模态冲突可控性，同时保持了强同步性和音频质量，与工业级V2A系统相比具有竞争力或更优表现。

Insight: 创新点包括：联合视觉编码范式（整合CLIP与时空视听编码器）以提升对齐和文本可控性；时序-音色解耦技术以抑制冗余时序线索并保留判别性音色特征；模态鲁棒训练方案（统一多模态表示对齐和随机模态丢弃）；以及引入了VGGSound-TVC基准用于评估视觉-文本冲突下的文本可控性。

Abstract: Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation. We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder to improve alignment and textual controllability. We further propose temporal-timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a modality-robust training scheme with unified multimodal representation alignment (REPA) and random modality dropout. We also present VGGSound-TVC, a benchmark for evaluating textual controllability under varying degrees of visual-text conflict. Extensive experiments demonstrate state-of-the-art performance across multiple V2A tasks, including text-guided, text-controlled, and audio-controlled generation. ControlFoley achieves superior controllability under cross-modal conflict while maintaining strong synchronization and audio quality, and shows competitive or better performance compared to an industrial V2A system. Code, models, datasets, and demos are available at: https://yjx-research.github.io/ControlFoley/.

physics.comp-ph [Back]

[104] Grading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLMs physics.comp-ph | cs.AI | cs.CL | hep-thPDF

Xingyang Yu, Yinghuan Zhang, Yufei Zhang, Zijun Cui

TL;DR: 该论文构建了一个由专家策划的量子场论和弦理论问题数据集，并引入五级评分标准来评估大型语言模型在高度抽象理论物理领域中的隐性推理能力。研究发现，LLMs在稳定概念框架内的显式推导上表现优异，但在需要重建省略推理步骤或重组表示以满足全局一致性约束的任务中，其性能会系统性下降。

Details

Motivation: 动机是评估大型语言模型是否能够支持量子场论和弦理论等高度抽象理论领域的研究，并解决现有评估指标无法捕捉中间概念步骤是否正确重建或是否尊重隐性结构约束的挑战。

Result: 在评估多个当代LLMs时，观察到在稳定概念框架内的显式推导上表现接近上限，但在需要重建省略推理步骤或重组表示的任务中性能系统性下降。

Insight: 创新点在于构建了专家策划的紧凑数据集和五级评分标准，以评估LLMs的隐性推理能力；客观分析认为，该研究揭示了当前LLMs在表示选择上的不稳定性，以及高度抽象理论物理可作为评估范式认知极限的敏感透镜。

Abstract: Large language models have demonstrated impressive performance across many domains of mathematics and physics. One natural question is whether such models can support research in highly abstract theoretical fields such as quantum field theory and string theory. Evaluating this possibility faces an immediate challenge: correctness in these domains is layered, tacit, and fundamentally non-binary. Standard answer-matching metrics fail to capture whether intermediate conceptual steps are properly reconstructed or whether implicit structural constraints are respected. We construct a compact expert-curated dataset of twelve questions spanning core areas of quantum field theory and string theory, and introduce a five-level grading rubric separating statement correctness, key concept awareness, reasoning chain presence, tacit step reconstruction, and enrichment. Evaluating multiple contemporary LLMs, we observe near-ceiling performance on explicit derivations within stable conceptual frames, but systematic degradation when tasks require reconstruction of omitted reasoning steps or reorganization of representations under global consistency constraints. These failures are driven not only by missing intermediate steps, but by an instability in representation selection: models often fail to identify the correct conceptual framing required to resolve implicit tensions. We argue that highly abstract theoretical physics provides a uniquely sensitive lens on the epistemic limits of current evaluation paradigms.

cs.RO [Back]

[105] HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps cs.RO | cs.CVPDF

Jongbin Lim, Taeyun Ha, Mingi Choi, Jisoo Kim, Byungjun Kim

TL;DR: 本文介绍了HRDexDB，一个大规模、多模态、高保真的灵巧抓取数据集，包含人类和多种机器人手的抓取序列。该数据集覆盖100个不同物体，提供了1.4K次抓取试验（包括成功和失败案例），集成了高精度时空3D运动真值、高分辨率触觉信号、同步多视角视频和第一人称视角视频流。

Details

Motivation: 现有数据集缺乏对人类和多种机器人手抓取轨迹的全面、对齐的多模态记录，限制了跨域灵巧操作和多模态策略学习的研究。HRDexDB旨在填补这一空白，为物理交互研究提供基础基准。

Result: HRDexDB构建了一个包含1.4K次抓取试验的大规模数据集，提供了高精度的运动、视觉和触觉数据。它作为多模态策略学习和跨域灵巧操作的基础性基准数据集。

Insight: 创新点在于首次提供了人类灵巧性与机器人执行在相同目标物体和可比抓取动作下的紧密对齐的多模态捕获，集成了视觉、运动学和触觉模态，并包含成功与失败案例，这对学习鲁棒的抓取策略具有重要价值。

Abstract: We present HRDexDB, a large-scale, multi-modal dataset of high-fidelity dexterous grasping sequences featuring both human and diverse robotic hands. Unlike existing datasets, HRDexDB provides a comprehensive collection of grasping trajectories across human hands and multiple robot hand embodiments, spanning 100 diverse objects. Leveraging state-of-the-art vision methods and a new dedicated multi-camera system, our HRDexDB offers high-precision spatiotemporal 3D ground-truth motion for both the agent and the manipulated object. To facilitate the study of physical interaction, HRDexDB includes high-resolution tactile signals, synchronized multi-view video, and egocentric video streams. The dataset comprises 1.4K grasping trials, encompassing both successes and failures, each enriched with visual, kinematic, and tactile modalities. By providing closely aligned captures of human dexterity and robotic execution on the same target objects under comparable grasping motions, HRDexDB serves as a foundational benchmark for multi-modal policy learning and cross-domain dexterous manipulation.

[106] Vision-Based Safe Human-Robot Collaboration with Uncertainty Guarantees cs.RO | cs.CVPDF

Jakob Thumm, Marian Frei, Tianle Ni, Matthias Althoff, Marco Pavone

TL;DR: 本文提出了一种基于视觉的人体姿态估计与运动预测框架，该框架通过保形预测提供不确定性保证，以实现可认证安全的人机协作。

Details

Motivation: 解决人机协作中因人体运动预测不确定性带来的安全隐患，确保机器人操作的安全性。

Result: 在记录的人体运动数据和真实人机协作场景中评估，框架实现了高概率置信度的有效预测。

Insight: 创新点在于将偶然不确定性估计与分布外检测结合，并引入保形预测集来提供可认证的安全保证。

Abstract: We propose a framework for vision-based human pose estimation and motion prediction that gives conformal prediction guarantees for certifiably safe human-robot collaboration. Our framework combines aleatoric uncertainty estimation with OOD detection for high probabilistic confidence. To integrate our pipeline in certifiable safety frameworks, we propose conformal prediction sets for human motion predictions with high, valid confidence. We evaluate our pipeline on recorded human motion data and a real-world human-robot collaboration setting.

cs.CR [Back]

[107] Robustness of Vision Foundation Models to Common Perturbations cs.CR | cs.CVPDF

Hongbin Liu, Zhengyuan Jiang, Cheng Hong, Neil Zhenqiang Gong

TL;DR: 本文首次系统研究了视觉基础模型对常见图像扰动（如JPEG压缩、亮度对比度调整）的鲁棒性，提出了三种鲁棒性度量指标并评估了六种工业级基础模型，发现它们普遍不鲁棒，且扰动会降低下游任务性能，同时提出了一种微调方法来提升鲁棒性。

Details

Motivation: 视觉基础模型输出的图像嵌入向量易受常见编辑操作影响，可能损害下游任务性能，但目前缺乏对此类扰动鲁棒性的系统研究。

Result: 评估了六种工业级基础模型（如OpenAI、Meta的模型）在九类常见扰动上的表现，发现它们普遍不鲁棒；扰动会降低下游分类准确率，且鲁棒性度量值能预测性能影响；提出的微调方法能提升鲁棒性而不牺牲实用性。

Insight: 创新点在于首次系统研究基础模型对常见扰动的鲁棒性，提出了具有数学性质定义的鲁棒性度量指标，并验证了微调可改善鲁棒性，为模型安全部署提供了新视角。

Abstract: A vision foundation model outputs an embedding vector for an image, which can be affected by common editing operations (e.g., JPEG compression, brightness, contrast adjustments). These common perturbations alter embedding vectors and may impact the performance of downstream tasks using these embeddings. In this work, we present the first systematic study on foundation models’ robustness to such perturbations. We propose three robustness metrics and formulate five desired mathematical properties for these metrics, analyzing which properties they satisfy or violate. Using these metrics, we evaluate six industry-scale foundation models (OpenAI, Meta) across nine common perturbation categories, finding them generally non-robust. We also show that common perturbations degrade downstream application performance (e.g., classification accuracy) and that robustness values can predict performance impacts. Finally, we propose a fine-tuning approach to improve robustness without sacrificing utility.

Table of Contents

cs.CL [Back]

[1] MemGround: Long-Term Memory Evaluation Kit for Large Language Models in Gamified Scenarios cs.CL | cs.AIPDF

[2] How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data cs.CLPDF

[3] EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews cs.CLPDF

[4] Hierarchical Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text cs.CLPDF

[5] SAGE Celer 2.6 Technical Card cs.CL | cs.AIPDF

[6] Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning cs.CL | cs.AIPDF

[7] CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization cs.CL | cs.AIPDF

[8] MEME-Fusion@CHiPSAL 2026: Multimodal Ablation Study of Hate Detection and Sentiment Analysis on Nepali Memes cs.CL | cs.AIPDF

[9] EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation cs.CL | cs.AIPDF

[10] Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation cs.CLPDF

[11] APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI cs.CL | cs.AI | cs.IRPDF

[12] The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models cs.CL | cs.AI | cs.CVPDF

[13] CobwebTM: Probabilistic Concept Formation for Lifelong and Hierarchical Topic Modeling cs.CLPDF

[14] PeerPrism: Peer Evaluation Expertise vs Review-writing AI cs.CLPDF

[15] NLP needs Diversity outside of ‘Diversity’ cs.CLPDF

[16] StoryCoder: Narrative Reformulation for Structured Reasoning in LLM Code Generation cs.CL | cs.AIPDF

[17] Fact4ac at the Financial Misinformation Detection Challenge Task: Reference-Free Financial Misinformation Detection via Fine-Tuning and Few-Shot Prompting of Large Language Models cs.CL | cs.AIPDF

[18] SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models cs.CLPDF

[19] Which bird does not have wings: Negative-constrained KGQA with Schema-guided Semantic Matching and Self-directed Refinement cs.CL | cs.AIPDF

[20] Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems cs.CL | cs.CVPDF

[21] ClimateCause: Complex and Implicit Causal Structures in Climate Reports cs.CL | cs.AIPDF

[22] Schema Key Wording as an Instruction Channel in Structured Generation under Constrained Decoding cs.CL | cs.AIPDF

[23] Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models cs.CL | cs.AI | cs.CV | cs.LGPDF

[24] IE as Cache: Information Extraction Enhanced Agentic Reasoning cs.CLPDF

[25] Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models cs.CL | cs.AIPDF

[26] MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events cs.CLPDF

[27] From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning cs.CLPDF

cs.CV [Back]

[28] QualiaNet: An Experience-Before-Inference Network cs.CV | eess.IV | q-bio.NCPDF

[29] HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds cs.CVPDF

[30] Geometrically Consistent Multi-View Scene Generation from Freehand Sketches cs.CVPDF

[31] Interpretable Human Activity Recognition for Subtle Robbery Detection in Surveillance Videos cs.CVPDF

[32] SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning cs.CV | cs.AIPDF

[33] FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images cs.CVPDF

[34] Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers cs.CV | cs.LGPDF

[35] Crowdsourcing of Real-world Image Annotation via Visual Properties cs.CV | cs.AIPDF

[36] H2VLR: Heterogeneous Hypergraph Vision-Language Reasoning for Few-Shot Anomaly Detection cs.CV | cs.LGPDF

[37] Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs cs.CVPDF

[38] FreqTrack: Frequency Learning based Vision Transformer for RGB-Event Object Tracking cs.CVPDF

[39] Controllable Video Object Insertion via Multiview Priors cs.CV | cs.AIPDF

[40] DVFace: Spatio-Temporal Dual-Prior Diffusion for Video Face Restoration cs.CVPDF

[41] Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors cs.CVPDF

[42] Learning Adaptive Reasoning Paths for Efficient Visual Reasoning cs.CV | cs.CLPDF

[43] M3D-Net: Multi-Modal 3D Facial Feature Reconstruction Network for Deepfake Detection cs.CVPDF

[44] TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation cs.CV | cs.MM | cs.SDPDF

[45] MapSR: Prompt-Driven Land Cover Map Super-Resolution via Vision Foundation Models cs.CVPDF

[46] Towards Design Compositing cs.CVPDF

[47] Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models cs.CVPDF

[48] CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation cs.CV | cs.LGPDF

[49] Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting cs.CV | cs.AI | cs.LGPDF

[50] Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding cs.CVPDF

[51] The Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment cs.CVPDF

[52] G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval cs.CVPDF

[53] HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet cs.CV | cs.LG | eess.IVPDF

[54] Efficient closed-form approaches for pose estimation using Sylvester forms cs.CV | cs.ROPDF

[55] OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism cs.CVPDF

[56] AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning cs.CV | cs.CLPDF

[57] One-shot Compositional 3D Head Avatars with Deformable Hair cs.CVPDF

[58] NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results cs.CV | cs.HC | cs.MMPDF

[59] Improved Multiscale Structural Mapping with Supervertex Vision Transformer for the Detection of Alzheimer’s Disease Neurodegeneration cs.CVPDF

[60] Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems cs.CV | cs.AIPDF

[61] MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry cs.CV | cs.AIPDF

[62] FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection cs.CVPDF

[63] RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models cs.CV | cs.AI | cs.CL | cs.MMPDF

[64] Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation cs.CVPDF

[65] UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards cs.CV | cs.AIPDF

[66] Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation cs.CVPDF

[67] Implicit Neural Representations: A Signal Processing Perspective cs.CVPDF

[68] Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection cs.CVPDF

[69] Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID cs.CVPDF

[70] Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography cs.CV | cs.LGPDF

[71] How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos cs.CVPDF

[72] KVNN: Learnable Multi-Kernel Volterra Neural Networks cs.CVPDF

[73] OmniLight: One Model to Rule All Lighting Conditions cs.CVPDF

[74] Boundary-Centric Active Learning for Temporal Action Segmentation cs.CVPDF

[75] VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models cs.CV | cs.AIPDF

[76] StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression cs.CVPDF

[77] Why Do Vision Language Models Struggle To Recognize Human Emotions? cs.CV | cs.AIPDF