Table of Contents

cs.CL [Back]

[1] AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models cs.CL | cs.CV | cs.LG | eess.IVPDF

Zijin Zhou, Songan Zhang

TL;DR: 本文提出了AITP模型,一个用于交通事故责任分配的多模态大语言模型,通过多模态思维链机制增强推理能力,并利用检索增强生成整合法律知识。同时,作者构建了DecaTARA基准,包含十个相关任务、近6.8万标注视频和约19.6万问答对。实验表明AITP在责任分配、事故检测和理解任务上均达到最先进水平。

Details

Motivation: 现有研究多聚焦于交通事故的描述和解释,缺乏更深层次的因果推理和法律知识整合,而交通事故责任分配任务需要基于交通法规的多步推理,更具挑战性。

Result: 在DecaTARA基准上进行的广泛实验表明,AITP在责任分配、交通事故检测和交通事故理解任务上均取得了最先进的性能。

Insight: 论文的创新点在于将多模态思维链和检索增强生成相结合,以支持基于法律知识的复杂推理,并构建了一个大规模、多任务的基准来统一评估相关能力,为推理驱动的多模态交通分析建立了新范式。

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in Traffic Accident Detection (TAD) and Traffic Accident Understanding (TAU). However, existing studies mainly focus on describing and interpreting accident videos, leaving room for deeper causal reasoning and integration of legal knowledge. Traffic Accident Responsibility Allocation (TARA) is a more challenging task that requires multi-step reasoning grounded in traffic regulations. To address this, we introduce AITP (Artificial Intelligence Traffic Police), a multimodal large language model for responsibility reasoning and allocation. AITP enhances reasoning via a Multimodal Chain-of-Thought (MCoT) mechanism and integrates legal knowledge through Retrieval-Augmented Generation (RAG). We further present DecaTARA, a decathlon-style benchmark unifying ten interrelated traffic accident reasoning tasks with 67,941 annotated videos and 195,821 question-answer pairs. Extensive experiments show that AITP achieves state-of-the-art performance across responsibility allocation, TAD, and TAU tasks, establishing a new paradigm for reasoning-driven multimodal traffic analysis.


[2] TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping cs.CLPDF

Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, John D. Kelleher

TL;DR: 本文提出了TRACES框架,通过实时标记推理步骤的类型,实现自适应、成本高效的早期停止,以减少语言推理模型(LRMs)在推理过程中的冗余生成。该方法在多个数学和知识推理基准测试上显著减少了token使用量,同时保持了与标准生成相当的准确性。

Details

Motivation: 当前语言推理模型在推理过程中存在效率低下和过度生成验证与反思步骤的问题,且不同推理步骤类型对生成正确答案的作用尚未充分探索。本文旨在通过标记推理步骤来优化推理过程,实现早期停止以降低成本。

Result: 在MATH500、GSM8K、AIME、MMLU和GPQA等基准测试上,TRACES框架实现了20%到50%的token减少,同时准确性与标准生成方法相当。

Insight: 创新点在于通过实时标记推理步骤类型来监控推理行为,发现模型在达到正确答案后倾向于改变推理模式,从而利用步骤类型作为可解释的早期停止标准。这提供了一种轻量级框架,以提升推理效率并减少计算开销。

Abstract: The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. Additionally, the high-level role of each reasoning step and how different step types contribute to the generation of correct answers, is largely underexplored. To address this challenge, we introduce TRACES (Tagging of the Reasoning steps enabling Adaptive Cost-Efficient early-Stopping), a lightweight framework that tags reasoning steps in real-time, and enable adaptive, cost-efficient early stopping of large-language-model inferences. Building on this framework we monitor reasoning behaviors during inferences, and we find that LRMs tend to shift their reasoning behavior after reaching a correct answer. We demonstrate that the monitoring of the specific type of steps can produce effective interpretable early stopping criteria. We evaluate the TRACES framework on three mathematical reasoning benchmarks, namely, MATH500, GSM8K, AIME and two knowledge and reasoning benchmarks, MMLU and GPQA respectively. We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation.


[3] Weighting What Matters: Boosting Sample Efficiency in Medical Report Generation via Token Reweighting cs.CL | cs.LGPDF

Alexander Weers, Daniel Rueckert, Martin J. Menten

TL;DR: 本文提出了一种通过令牌重加权来提升医学报告生成任务中样本效率的方法。该方法通过加权损失函数,使模型在训练时更关注具有临床重要性的语义显著令牌,从而在数据稀缺的情况下提高训练效率。在眼科报告生成实验中,该方法仅需十分之一的训练数据即可达到与标准交叉熵损失相当的报告质量。

Details

Motivation: 医学报告生成任务中高质量标注数据稀缺,限制了视觉语言模型的训练效果。标准交叉熵损失对所有令牌预测错误一视同仁,忽略了临床重要性的差异,导致数据利用效率低下。

Result: 在眼科报告生成基准测试中,该方法在多个数据规模下均提升了训练效率,仅需十分之一的训练数据即可达到与基线相当的报告质量,证明了其在数据稀缺场景下的有效性。

Insight: 创新点在于通过加权损失函数将训练重点从均匀令牌预测转向临床关键语义令牌,这是一种简单但有效的提升数据效率的策略,可迁移至其他需要关注特定语义信息的视觉语言任务中。

Abstract: Training vision-language models (VLMs) for medical report generation is often hindered by the scarcity of high-quality annotated data. This work evaluates the use of a weighted loss function to improve data efficiency. Compared to standard cross-entropy loss, which treats all token prediction errors equally, the reweighted loss shifts the focus to semantically salient tokens with outsized clinical importance. In experiments on ophthalmological report generation, we show that this simple method improves efficiency across multiple data scales, achieving similar report quality with up to ten times less training data.


[4] Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents cs.CLPDF

Yiyang Lu, Woong Shin, Ahmad Maroof Karimi, Feiyi Wang, Jie Ren

TL;DR: 该论文提出了一个名为IVG(Introspective and Interactive Visual Grounding)的框架,旨在解决视觉语言模型在解读图表时出现的误读、幻觉和混淆重叠元素等问题。该框架结合了基于图表底层规范的自省查询和通过视图交互解决视觉歧义的方法,并在一个名为iPlotBench的新基准测试上验证了其有效性。

Details

Motivation: 当前基于像素的视觉语言模型在解读图表时存在’仅像素瓶颈’,即模型将交互式图表视为静态图像,无法访问编码精确值的结构化规范,导致错误频发。

Result: 在包含500个交互式Plotly图表和6,706个二元问题的iPlotBench基准测试上,实验表明,自省提高了数据重建的保真度,而自省与交互结合实现了最高的问答准确率(0.81),在处理重叠几何图形的问题上提升了6.7%。

Insight: 核心创新点在于突破了仅依赖像素理解的范式,通过查询图表底层规范(spec-grounded introspection)和主动交互改变视图(view-grounded interaction)来获取确定性证据和解决视觉歧义。这为构建更可靠、可交互的可视化智能体提供了新思路。

Abstract: Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with interaction achieves the highest QA accuracy (0.81), with +6.7 % gains on overlapping geometries. We further demonstrate IVG in deployed agents that explore data autonomously and collaborate with human users in real time.


[5] Enhancing Science Classroom Discourse Analysis through Joint Multi-Task Learning for Reasoning-Component Classification cs.CL | cs.AIPDF

Jiho Noh, Mukhesh Raghava Katragadda, Raymond Carl, Soon Lee

TL;DR: 本文提出了一种用于科学课堂话语分析的自动化系统(ADAS),该系统通过联合多任务学习对师生话语进行双维度分类:话语类型(UT)和推理成分(RC)。为解决少数类标签不平衡问题,采用了分层重采样、基于LLM的少数类合成数据增强以及双探针头RoBERTa-base分类器。实验表明,该方法有效提升了少数类识别性能,并揭示了教师’反馈带提问’行为是学生推理性回答的最一致前因。

Details

Motivation: 科学课堂中学生推理模式的分析对于理解知识建构机制和改进教学实践至关重要,但大规模人工标注课堂话语成本极高,因此需要自动化解决方案。

Result: 在话语分类任务上,零样本GPT-5.4基线在UT和RC上的宏F1分别为0.467和0.476;经过提出的方法优化后,性能得到提升,特别是在UT的少数类识别上。此外,通过话语模式分析(如共现分析、认知复杂度指数计算、滞后序列分析等)发现教师’反馈带提问’行为与学生推理性回答存在强关联。

Insight: 创新点包括:1)针对课堂话语分析提出联合多任务学习框架,同时处理话语类型和推理成分分类;2)采用分层重采样与基于LLM的定向数据增强策略有效缓解类别不平衡;3)引入多维度话语模式分析方法(如CCI、滞后序列分析),超越了单纯分类,提供了更丰富的教学洞察。

Abstract: Analyzing the reasoning patterns of students in science classrooms is critical for understanding knowledge construction mechanism and improving instructional practice to maximize cognitive engagement, yet manual coding of classroom discourse at scale remains prohibitively labor-intensive. We present an automated discourse analysis system (ADAS) that jointly classifies teacher and student utterances along two complementary dimensions: Utterance Type and Reasoning Component derived from our prior CDAT framework. To address severe label imbalance among minority classes, we (1) stratify-resplit the annotated corpus, (2) apply LLM-based synthetic data augmentation targeting minority classes, and (3) train a dual-probe head RoBERTa-base classifier. A zero-shot GPT-5.4 baseline achieves macro-F1 of 0.467 on UT and 0.476 on RC, establishing meaningful upper bounds for prompt-only approaches motivating fine-tuning. Beyond classification, we conduct discourse pattern analyses including UTxRC co-occurrence profiling, Cognitive Complexity Index (CCI) computation per session, lag-sequential analysis, and IRF chain analysis, revealing that teacher Feedback-with-Question (Fq) moves are the most consistent antecedents of student inferential reasoning (SR-I). Our results demonstrate that LLM-based augmentation meaningfully improves UT minority-class recognition, and that the structural simplicity of the RC task makes it tractable even for lexical baselines.


[6] Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue cs.CL | cs.AI | cs.HCPDF

Biswesh Mohapatra, Giovanni Duca, Laurent Romary, Justine Cassell

TL;DR: 本文提出了一种主动视觉支架框架,用于解决情境对话中共享上下文表示模糊的问题。该框架通过将对话状态逐步转换为持久的视觉历史,以支持后续的基于上下文的响应生成。在IndiRef基准上的评估表明,增量外部化优于全对话推理,而视觉支架通过减少表示模糊和强化具体场景承诺带来了额外提升。同时,文本表示在处理非可描绘信息时仍具优势,混合多模态设置实现了最佳整体性能。

Details

Motivation: 当前对话代理在情境对话中难以维持超越即时上下文窗口的共享上下文表示,导致类似但不同的实体被压缩为可互换的描述,即‘表示模糊’。受人类心理意象在推理中的作用启发,本文探索对话代理是否能够构建描绘性中间表示来应对这些限制。

Result: 在IndiRef基准上的评估显示,增量外部化本身优于全对话推理,而视觉支架通过减少表示模糊和强化具体场景承诺带来了额外性能提升。混合多模态设置实现了最佳整体性能。

Insight: 创新点在于引入主动视觉支架框架,将对话状态增量转换为持久视觉历史以缓解表示模糊。客观分析认为,其核心洞察是对话代理受益于整合描绘性和命题性信息的显式多模态共享上下文表示,这为构建更鲁棒的情境对话系统提供了新思路。

Abstract: Situated dialogue requires speakers to maintain a reliable representation of shared context rather than reasoning only over isolated utterances. Current conversational agents often struggle with this requirement, especially when the common ground must be preserved beyond the immediate context window. In such settings, fine-grained distinctions are frequently compressed into purely textual representations, leading to a critical failure mode we call \emph{representational blur}, in which similar but distinct entities collapse into interchangeable descriptions. This semantic flattening creates an illusion of grounding, where agents appear locally coherent but fail to track shared context persistently over time. Inspired by the role of mental imagery in human reasoning, and based on the increased availability of multimodal models, we explore whether conversational agents can be given an analogous ability to construct some depictive intermediate representations during dialogue to address these limitations. Thus, we introduce an active visual scaffolding framework that incrementally converts dialogue state into a persistent visual history that can later be retrieved for grounded response generation. Evaluation on the IndiRef benchmark shows that incremental externalization itself improves over full-dialog reasoning, while visual scaffolding provides additional gains by reducing representational blur and enforcing concrete scene commitments. At the same time, textual representations remain advantageous for non-depictable information, and a hybrid multimodal setting yields the best overall performance. Together, these findings suggest that conversational agents benefit from an explicitly multimodal representation of common ground that integrates depictive and propositional information.


[7] On Reasoning Behind Next Occupation Recommendation cs.CL | cs.AI | cs.IRPDF

Shan Dong, Palakorn Achananuparp, Hieu Hien Mai, Lei Wang, Yao Lu

TL;DR: 本文提出了一种新颖的两阶段推理方法,用于提升大语言模型在预测用户未来职业方面的性能。该方法首先通过一个原因生成器,根据用户过往的教育和职业历史生成一个解释其偏好的’原因’,然后将此原因作为职业预测器的输入来推荐下一个职业。为了解决大语言模型与职业路径或职业决策背后未观察到的原因不对齐的问题,作者提出使用LLM-as-a-Judge生成高质量的真实原因来微调小规模LLM,以同时优化原因生成和职业预测。

Details

Motivation: 解决大语言模型在职业推荐任务中,由于与职业路径或决策背后的潜在原因不对齐,导致预测性能受限的问题。

Result: 在广泛的实验中,该方法有效提升了LLM在下一个职业预测上的准确性,使其性能与完全监督方法相当,并优于无监督方法;同时,一个微调后同时执行原因生成和职业预测的单一LLM,其性能优于分别微调执行这两个任务的两个LLM;此外,预测准确性依赖于生成原因的质量。

Insight: 创新点在于将职业推荐分解为原因生成和预测两个可解释的步骤,并利用LLM-as-a-Judge生成高质量的真实原因来对齐和微调模型,从而提升推理和预测性能。从客观角度看,这种两阶段、基于推理的微调方法为增强LLM在结构化预测任务中的可解释性和准确性提供了新思路。

Abstract: In this work, we develop a novel reasoning approach to enhance the performance of large language models (LLMs) in future occupation prediction. In this approach, a reason generator first derives a ``reason’’ for a user using his/her past education and career history. The reason summarizes the user’s preference and is used as the input of an occupation predictor to recommend the user’s next occupation. This two-step occupation prediction approach is, however, non-trivial as LLMs are not aligned with career paths or the unobserved reasons behind each occupation decision. We therefore propose to fine-tune LLMs improving their reasoning and occupation prediction performance. We first derive high-quality oracle reasons, as measured by factuality, coherence and utility criteria, using a LLM-as-a-Judge. These oracle reasons are then used to fine-tune small LLMs to perform reason generation and next occupation prediction. Our extensive experiments show that: (a) our approach effectively enhances LLM’s accuracy in next occupation prediction making them comparable to fully supervised methods and outperforming unsupervised methods; (b) a single LLM fine-tuned to perform reason generation and occupation prediction outperforms two LLMs fine-tuned to perform the tasks separately; and (c) the next occupation prediction accuracy depends on the quality of generated reasons. Our code is available at https://github.com/Sarasarahhhhh/job_prediction.


[8] EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval cs.CL | cs.AIPDF

Julian Acuna

TL;DR: 本文介绍了EngramaBench,一个用于评估长时会话记忆的基准测试,包含五个角色、一百个多轮对话和一百五十个查询。论文评估了Engrama(一种图结构记忆系统)与GPT-4o全上下文提示和Mem0(开源向量检索记忆系统)的性能,所有系统使用相同的GPT-4o回答模型以隔离记忆架构的影响。GPT-4o全上下文获得最高综合得分,而Engrama在跨空间推理任务上优于全上下文提示。

Details

Motivation: 解决大型语言模型助手在多轮会话中积累信息并进行长期记忆和推理的评估需求,现有基准在结构化、长期记忆评估方面存在不足。

Result: 在EngramaBench上,GPT-4o全上下文提示综合得分最高(0.6186),Engrama全局得分为0.5367,但在跨空间推理任务上得分高于全上下文(0.6532 vs. 0.6291)。Mem0成本最低但性能显著较弱(0.4809)。消融实验表明Engrama的跨空间优势组件与全局综合得分存在权衡。

Insight: 创新点在于提出了一个专门针对长时会话记忆的结构化基准测试EngramaBench,并设计了图结构记忆系统Engrama进行对比评估。客观分析表明,结构化记忆(如图结构)在特定推理任务(如跨空间推理)上可能优于简单的全上下文或向量检索方法,但存在与全局性能优化的系统级权衡,这为记忆架构设计提供了重要见解。

Abstract: Large language model assistants are increasingly expected to retain and reason over information accumulated across many sessions. We introduce EngramaBench, a benchmark for long-term conversational memory built around five personas, one hundred multi-session conversations, and one hundred fifty queries spanning factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis. We evaluate Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval memory system. All three use the same answering model (GPT-4o), isolating the effect of memory architecture. GPT-4o full-context achieves the highest composite score (0.6186), while Engrama scores 0.5367 globally but is the only system to score higher than full-context prompting on cross-space reasoning (0.6532 vs. 0.6291, n=30). Mem0 is cheapest but substantially weaker (0.4809). Ablations reveal that the components driving Engrama’s cross-space advantage trade off against global composite score, exposing a systems-level tension between structured memory specialization and aggregate optimization.


[9] Planning Beyond Text: Graph-based Reasoning for Complex Narrative Generation cs.CL | cs.AIPDF

Hanwen Gu, Chao Guo, Junle Wang, Wenda Xie, Yisheng Lv

TL;DR: 本文提出PLOTTER框架,通过在图结构(事件图和角色图)而非传统序列文本上进行叙事规划,以解决LLM在复杂叙事生成中全局连贯性、逻辑一致性和角色发展平滑性的不足。该方法执行评估-规划-修订循环,在严格逻辑约束下诊断并修复图拓扑问题,优化因果性和叙事骨架,再生成完整上下文。

Details

Motivation: 现有LLM叙事生成方法难以维持全局叙事连贯性、上下文逻辑一致性和平滑的角色发展,常产生结构断裂的单调脚本,因此需要一种超越文本的规划方法来增强长上下文推理能力。

Result: 实验表明,PLOTTER在多样化叙事场景中显著优于代表性基线模型,验证了在图结构而非文本上规划叙事对于提升LLM在复杂叙事生成中的长上下文推理至关重要。

Insight: 创新点在于将叙事规划从序列文本表示转向结构图表示,并引入Evaluate-Plan-Revise循环与逻辑约束进行图拓扑诊断与修复,这为增强LLM的复杂推理与结构化生成提供了可借鉴的图基规划范式。

Abstract: While LLMs demonstrate remarkable fluency in narrative generation, existing methods struggle to maintain global narrative coherence, contextual logical consistency, and smooth character development, often producing monotonous scripts with structural fractures. To this end, we introduce PLOTTER, a framework that performs narrative planning on structural graph representations instead of the direct sequential text representations used in existing work. Specifically, PLOTTER executes the Evaluate-Plan-Revise cycle on the event graph and character graph. By diagnosing and repairing issues of the graph topology under rigorous logical constraints, the model optimizes the causality and narrative skeleton before complete context generation. Experiments demonstrate that PLOTTER significantly outperforms representative baselines across diverse narrative scenarios. These findings verify that planning narratives on structural graph representations-rather than directly on text-is crucial to enhance the long context reasoning of LLMs in complex narrative generation.


[10] When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors cs.CLPDF

Chenghao Yang, Yuning Zhang, Zhoufutu Wen, Tao Gong, Jiaheng Liu

TL;DR: 本文提出两种新的度量指标——响应模式相似度(RPS)和行动图相似度(AGS),用于量化LLM智能体在工具使用行为中非强制性的、由模型蒸馏引起的同质化模式。通过在τ-Bench和τ²-Bench基准上评估18个模型,研究发现同族模型对的AGS得分显著高于跨族模型对,并证实AGS能有效区分教师模型特有的行为收敛与通用性能提升。

Details

Motivation: 模型蒸馏是推动LLM智能体快速发展的主要驱动力,但它常常导致行为同质化。现有度量方法无法区分任务成功所必需的行为与反映模型自主偏好的非强制性行为模式,因此需要新的量化工具来诊断智能体生态中的行为收敛问题。

Result: 在τ-Bench和τ²-Bench基准上,以Claude Sonnet 4.5为参照评估了来自8个提供商的18个模型。研究发现,同族模型对的AGS得分比跨族模型对高5.9个百分点;Kimi-K2模型在节点相似度(S_node)和依赖相似度(S_dep)上分别达到82.6%和94.7%,超过了Anthropic自家的Opus 4.1。受控蒸馏实验进一步证实AGS能区分教师特定的收敛与通用改进。RPS与AGS捕获了不同的行为维度(皮尔逊相关系数r=0.491)。

Insight: 论文的创新点在于提出了两个互补的、专注于非强制性行为模式的量化指标(RPS用于语言对齐,AGS用于将工具使用习惯建模为有向图),为诊断智能体生态中的行为同质化提供了新的、可区分的诊断信号。从客观角度看,将工具使用行为建模为图并进行相似性比较,是一种新颖且可解释性强的分析方法。

Abstract: Model distillation is a primary driver behind the rapid progress of LLM agents, yet it often leads to behavioral homogenization. Many emerging agents share nearly identical reasoning steps and failure modes, suggesting they may be distilled echoes of a few dominant teachers. Existing metrics, however, fail to distinguish mandatory behaviors required for task success from non-mandatory patterns that reflect a model’s autonomous preferences. We propose two complementary metrics to isolate non-mandatory behavioral patterns: \textbf{Response Pattern Similarity (RPS)} for verbal alignment and \textbf{Action Graph Similarity (AGS)} for tool-use habits modeled as directed graphs. Evaluating 18 models from 8 providers on $τ$-Bench and $τ^2$-Bench against Claude Sonnet 4.5 (thinking), we find that within-family model pairs score 5.9 pp higher in AGS than cross-family pairs, and that Kimi-K2 (thinking) reaches 82.6% $S_{\text{node}}$ and 94.7% $S_{\text{dep}}$, exceeding Anthropic’s own Opus 4.1. A controlled distillation experiment further confirms that AGS distinguishes teacher-specific convergence from general improvement. RPS and AGS capture distinct behavioral dimensions (Pearson $r$ = 0.491), providing complementary diagnostic signals for behavioral convergence in the agent ecosystem. Our code is available at https://github.com/Syuchin/AgentEcho.


[11] Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts cs.CL | cs.AI | cs.CV | cs.LG | cs.MAPDF

Azher Ahmed Efat, Seok Hwan Song, Wallapak Tavanapong

TL;DR: 本文提出了PolyChartQA数据集,这是一个专门用于多图表图像问答的中等规模基准,包含534张多图表图像和2694个问答对。论文评估了九种最先进的多模态语言模型在该数据集上的表现,发现人类编写问题相比模型生成问题导致LLM准确率下降27.4%,而提出的提示方法带来5.39%的准确率提升。

Details

Motivation: 现实场景中常需联合解读多个相关图表以获取深层信息,但当前对多图表图像理解的研究尚未充分展开,缺乏专门的数据集和评估基准。

Result: 在PolyChartQA数据集上测试了9种SOTA多模态语言模型,结果显示人类编写问题导致LLM准确率下降27.4%,而提出的提示方法使准确率提升5.39%。

Insight: 创新点在于构建了首个专注于多图表问答的基准数据集,并系统分析了问题类型、难度、来源及多图表结构特征对模型性能的影响;提出的提示方法能有效提升多图表理解任务的性能。

Abstract: Charts are widely used to present complex information. Deriving meaningful insights in real-world contexts often requires interpreting multiple related charts together. Research on understanding multi-chart images has not been extensively explored. We introduce PolyChartQA, a mid-scale dataset specifically designed for question answering over multi-chart images. PolyChartQA comprises 534 multi-chart images (with a total of 2,297 sub-charts) sourced from peer-reviewed computer science research publications and 2,694 QA pairs. We evaluate the performance of nine state-of-the-art Multimodal Language Models (MLMs) on PolyChartQA across question type, difficulty, question source, and key structural characteristics of multi-charts. Our results show a 27.4% LLM-based accuracy (L-Accuracy) drop on human-authored questions compared to MLM-generated questions, and a 5.39% L-accuracy gain with our proposed prompting method.


[12] Decoupled DiLoCo for Resilient Distributed Pre-training cs.CLPDF

Arthur Douillard, Keith Rush, Yani Donchev, Zachary Charles, Nova Fallen

TL;DR: 本文提出了一种名为Decoupled DiLoCo的新型分布式预训练框架,旨在解决传统SPMD范式因紧密耦合而导致的硬件故障、瞬时减速和同步开销等问题。该方法通过将计算解耦为多个独立的“学习者”进行本地优化,并采用异步通信、最小法定人数聚合、自适应宽限期和动态令牌加权合并等机制,实现了在易故障环境下的高训练效率,同时保持模型性能。

Details

Motivation: 现代大规模语言模型预训练严重依赖单程序多数据(SPMD)范式,这种紧密耦合使得瞬时减速、硬件故障和同步开销会拖慢整个计算过程,造成大规模计算资源的浪费。现有方法如DiLoCo虽降低了通信带宽,但本质上仍是同步的,易受系统停滞影响。

Result: 在模拟数百万芯片的易故障环境中,该方法实现了训练效率的显著提升,且全局停机时间严格为零。在文本和视觉任务上,无论是密集架构还是混合专家架构,均保持了有竞争力的模型性能。

Insight: 核心创新在于将DiLoCo框架解耦,打破锁步同步屏障,超越SPMD范式以最大化训练吞吐量。其异步通信机制、容错聚合策略(如最小法定人数和动态令牌加权)借鉴了“混沌工程”思想,为大规模分布式训练提供了高弹性和高可用性的解决方案。

Abstract: Modern large-scale language model pre-training relies heavily on the single program multiple data (SPMD) paradigm, which requires tight coupling across accelerators. Due to this coupling, transient slowdowns, hardware failures, and synchronization overhead stall the entire computation, wasting significant compute time at scale. While recent distributed methods like DiLoCo reduced communication bandwidth, they remained fundamentally synchronous and vulnerable to these system stalls. To address this, we introduce Decoupled DiLoCo, an evolution of the DiLoCo framework designed to break the lock-step synchronization barrier and go beyond SPMD to maximize training goodput. Decoupled DiLoCo partitions compute across multiple independent learners'' that execute local inner optimization steps. These learners asynchronously communicate parameter fragments to a central synchronizer, which circumvents failed or straggling learners by aggregating updates using a minimum quorum, an adaptive grace window, and dynamic token-weighted merging. Inspired by chaos engineering’’, we achieve significantly improved training efficiency in failure-prone environments with millions of simulated chips with strictly zero global downtime, while maintaining competitive model performance across text and vision tasks, for both dense and mixture-of-expert architectures.


[13] Reasoning Primitives in Hybrid and Non-Hybrid LLMs cs.CL | cs.AIPDF

Shivam Rawat, Lucie Flek, Florian Mai, Nicholas Kluge Corrêa

TL;DR: 该论文将大语言模型的推理能力解构为回忆和状态跟踪两个基本操作,通过对比纯注意力Transformer架构与结合注意力检索和循环状态更新的混合架构,探究它们在需要联合使用这两种操作的任务上的表现差异。研究发现推理增强能显著扩展模型的有效操作范围,而混合架构在序列依赖性增强的任务中表现出更强的鲁棒性。

Details

Motivation: 动机在于打破将大语言模型推理视为单一能力的传统观点,探究其底层基本操作(回忆和状态跟踪),并验证混合架构是否比纯注意力模型更适合需要同时进行这两种操作的任务。

Result: 在涉及状态跟踪和回忆混合的受控任务上,推理增强带来了最大的整体性能提升。混合推理模型在序列依赖性增强的特定任务中保持更强的鲁棒性,而Transformer推理模型在任务难度超过特定阈值后性能急剧下降。

Insight: 创新点在于将推理解构为基本操作并进行架构对比,揭示了显式推理令牌和架构归纳偏置在不同计算层级的作用:显式推理能扩展模型有效范围,但其收益依赖于底层架构对持久状态传播的支持能力。这为设计更高效的推理模型提供了新视角。

Abstract: Reasoning in large language models is often treated as a monolithic capability, but its observed gains may arise from more basic operations. We study reasoning through two such primitives, recall and state-tracking, and ask whether hybrid architectures that combine attention-based retrieval with recurrent state updates are better suited than attention-only models for tasks that jointly require both. Using matched Olmo3 transformer and hybrid models in instruction-tuned and reasoning-augmented variants, we evaluate these models on a set of controlled tasks involving a mixture of state-tracking and recall primitives, state-based recall. Across tasks, we notice that reasoning augmentation provides the largest overall improvement, substantially extending the range of difficulty over which models remain effective. We also notice that in certain tasks, the hybrid reasoning model remains substantially more robust as sequential dependence increases. In contrast, the transformer reasoning model degrades sharply in performance as task difficulty increases beyond a given threshold. These results suggest that reasoning tokens and architectural inductive biases contribute at different levels of the computational process: explicit reasoning can expand a model’s effective operating range, but its benefit depends on how well the underlying architecture supports persistent state propagation. Given the small size of our case study, which involves a limited set of models and tasks, we present these findings as suggestive rather than conclusive and leave broader validation across model families, scales, and task variations to future work.


[14] OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving cs.CLPDF

Xinyu Zhang, Boxuan Zhang, Yuchen Wan, Lingling Zhang, YiXing Yao

TL;DR: 本文介绍了OptiVerse,一个全面的优化问题求解基准测试,包含1000个精心策划的问题,涵盖随机优化、动态优化、博弈优化和最优控制等被忽视的领域,并分为三个难度级别。实验评估了22个不同规模的LLM,发现它们在难题上性能急剧下降,即使GPT-5.2和Gemini-3等先进模型准确率也难以超过27%。通过错误分析,作者指出建模和逻辑错误是主要瓶颈,并提出了一个双视图审计代理来改进LLM建模过程,而不显著增加时间开销。

Details

Motivation: 现有基准测试主要局限于数学规划和组合优化,缺乏对随机优化、动态优化、博弈优化和最优控制等领域的全面评估,阻碍了对LLM解决复杂优化问题的全面评估。

Result: 在OptiVerse基准测试上,22个LLM在难题上表现不佳,即使GPT-5.2和Gemini-3等先进模型的准确率也低于27%。提出的双视图审计代理提高了LLM建模过程的准确性,且未引入显著时间开销。

Insight: 创新点在于构建了一个涵盖多个被忽视优化领域的全面基准测试OptiVerse,并通过错误分析揭示了建模和逻辑错误是LLM解决优化问题的主要瓶颈,进而提出了一个有效的双视图审计代理来缓解这一问题。

Abstract: While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard. The experiments with 22 LLMs of different sizes reveal sharp performance degradation on hard problems, where even advanced models like GPT-5.2 and Gemini-3 struggle to exceed 27% accuracy. Through error analysis, we identify that modeling & logic errors remain the primary bottleneck. Consequently, we propose a Dual-View Auditor Agent that improves the accuracy of the LLM modeling process without introducing significant time overhead. OptiVerse will serve as a foundational platform for advancing LLMs in solving complex optimization challenges.


[15] AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use cs.CLPDF

Yuanjie Lyu, Chengyu Wang, Haonan Zheng, Yuanhao Yue, Junbing Yan

TL;DR: 本文介绍了AgenticQwen系列小型智能体语言模型,通过结合推理强化学习和智能体强化学习的多轮强化学习框架,并利用双数据飞轮自动生成日益复杂的任务进行训练。该模型旨在满足工业应用中对低成本、低延迟的多步推理和工具使用智能体的需求,在公开基准和工业智能体系统中均表现出色,缩小了与更大模型在搜索和数据分析任务上的差距。

Details

Motivation: 现代工业应用越来越需要能够作为智能体、在现实场景中进行多步推理和工具使用的语言模型,这些任务通常在严格的成本和延迟约束下执行,因此小型智能体模型极具吸引力。

Result: AgenticQwen在多个智能体基准测试中取得了强劲性能,在工业智能体系统中,其在搜索和数据分析任务上缩小了与更大模型的差距。

Insight: 创新点在于提出了结合推理强化学习和智能体强化学习的多轮强化学习训练框架,并引入了双数据飞轮机制:推理飞轮通过从错误中学习来增加任务难度,智能体飞轮则将线性工作流扩展为多分支行为树,以更好地反映现实应用的决策复杂性。这种数据合成和训练方法可有效提升小型模型在复杂任务上的能力。

Abstract: Modern industrial applications increasingly demand language models that act as agents, capable of multi-step reasoning and tool use in real-world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi-round reinforcement learning (RL) on synthetic data and a limited amount of open-source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi-branch behavior trees that better reflect the decision complexity of real-world applications. We validate AgenticQwen on public benchmarks and in an industrial agent system. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks. Model checkpoints and part of the synthetic data: https://huggingface.co/collections/alibaba-pai/agenticqwen. Data synthesis and RL training code: https://github.com/haruhi-sudo/data_synth_and_rl. The data synthesis pipeline is also integrated into EasyDistill: https://github.com/modelscope/easydistill.


[16] Language as a Latent Variable for Reasoning Optimization cs.CLPDF

Linjuan Wu, Haoran Wei, Jialong Tang, Shuang Luo, Baosong Yang

TL;DR: 该论文提出语言可作为推理优化的潜在变量,通过多语言思维实验发现非英语回答在推理任务中有时表现更优,并基于此提出了polyGRPO强化学习框架,该框架将语言变化作为隐式探索信号,仅用少量无标注数学问题训练即可提升基础模型在英语及多语言推理基准上的准确率,并展现出跨任务泛化能力。

Details

Motivation: 解决大语言模型在推理任务中存在的英语中心偏差问题,并探索语言作为潜在变量如何结构化地调制模型内部推理路径,而非仅作为输出媒介。

Result: 在四个英语推理测试集上绝对准确率提升6.72%,在多语言基准上提升6.89%;在仅使用数学数据训练的情况下,在英语常识推理任务上超越基础模型4.9%,展现了跨任务泛化能力。

Insight: 创新性地将语言视为调制推理路径的潜在变量,并提出了polyGRPO框架,利用多语言条件生成偏好数据进行在线强化学习优化,无需思维链标注即可扩展模型的潜在推理空间并提升性能。

Abstract: As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model’s internal inference pathways, rather than merely serving as an output medium. To test this, we conducted a Polyglot Thinking Experiment, in which models were prompted to solve identical problems under language-constrained and language-unconstrained conditions. Results show that non-English responses often achieve higher accuracy, and the best performance frequently occur when language is unconstrained, suggesting that multilinguality broadens the model’s latent reasoning space. Based on this insight, we propose polyGRPO (Polyglot Group Relative Policy Optimization), an RL framework that treats language variation as an implicit exploration signal. It generates polyglot preference data online under language-constrained and unconstrained conditions, optimizing the policy with respect to both answer accuracy and reasoning structure. Trained on only 18.1K multilingual math problems without chain-of-thought annotations, polyGRPO improves the base model (Qwen2.5-7B-Instruct) by 6.72% absolute accuracy on four English reasoning testset and 6.89% in their multilingual benchmark. Remarkably, it is the only method that surpasses the base LLM on English commonsense reasoning task (4.9%), despite being trained solely on math data-highlighting its strong cross-task generalization. Further analysis reveals that treating language as a latent variable expands the model’s latent reasoning space, yielding consistent and generalizable improvements in reasoning performance.


[17] Process Supervision via Verbal Critique Improves Reasoning in Large Language Models cs.CL | cs.AIPDF

Hao-Yuan Chen

TL;DR: 本文提出了Verbal Process Supervision (VPS),一种无需训练的推理时扩展框架,通过引入更强监督模型的结构化自然语言批评,来指导一个迭代的生成-批评-精炼循环,从而提升大语言模型的推理能力。在GPQA Diamond、AIME 2025和LiveCodeBench V6等基准测试中,VPS显著提升了模型性能,并确立了批评粒度作为推理时扩展的新维度。

Details

Motivation: 现有的LLM推理时扩展主要集中在链深度、样本广度和学习步进评分器(PRMs)三个维度。本文旨在探索第四个维度:外部语言监督的粒度,即通过更细粒度的自然语言反馈来改进推理过程。

Result: 在GPQA Diamond上,GPT-5.4 (High) | GPT-5.4 (Low) 使用VPS(R=4)达到94.9%,超越了无需梯度更新的94.1%的SOTA水平。在AIME 2025上,VPS实现了强大的弱执行者救援,将分数从11.7-26.7%提升至63.3-90.0%(最高提升+63.3分)。在同等计算量下,VPS在GPQA和LiveCodeBench上分别比Reflexion高出+8.5到+12.1分,比Self-Consistency@5高出+5.0和+8.3个百分点。

Insight: 核心创新点是提出了“批评粒度”作为推理时扩展的新维度,并通过无需训练的VPS框架验证了其有效性。关键洞察是性能提升与监督者-执行者能力差距高度相关(皮尔逊r=0.90),并且当错误无法用语言表达(如代码合成)时性能会下降,这启发了未来结合语言与可执行方法的混合策略。

Abstract: Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.


[18] StructMem: Structured Memory for Long-Horizon Behavior in LLMs cs.CL | cs.AI | cs.IR | cs.LG | cs.MAPDF

Buqiang Xu, Yijun Chen, Jizhan Fang, Ruobin Zhong, Yunzhi Yao

TL;DR: 本文提出了StructMem,一种结构化增强的层次化记忆框架,旨在解决长期对话智能体中记忆系统面临的效率与结构化建模之间的权衡问题。该框架通过保留事件级绑定并诱导跨事件连接,结合时间锚定的双重视角和周期性语义整合,以支持时间推理和多跳问答。

Details

Motivation: 长期对话智能体需要能够捕捉事件间关系而非孤立事实的记忆系统,以支持时间推理和多跳问答。现有方法面临基本权衡:扁平化记忆高效但无法建模关系结构,而基于图的记忆支持结构化推理但构建成本高且脆弱。

Result: 在LoCoMo基准测试中,StructMem提升了时间推理和多跳问答性能,同时相比现有记忆系统显著减少了token使用量、API调用次数和运行时间。

Insight: 创新点在于提出了结构增强的层次化记忆框架,通过时间锚定双重视角和周期性语义整合来平衡效率与结构化建模,这为构建高效且鲁棒的长期记忆系统提供了新思路。

Abstract: Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approaches face a fundamental trade-off: flat memory is efficient but fails to model relational structure, while graph-based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose \textbf{StructMem}, a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi-hop performance on \texttt{LoCoMo}, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see https://github.com/zjunlp/LightMem .


[19] AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA cs.CLPDF

Tasnim Kabir, Dmytro Kurdydyk, Aadi Palnitkar, Liam Dorn, Ahmed Haj Ahmed

TL;DR: 本文介绍了AUDITA数据集,这是一个用于评估音频问答中人类与AI技能差异的大规模真实世界基准,旨在超越表面声学识别,测试深层音频推理能力。

Details

Motivation: 现有音频问答基准多关注声音事件分类或字幕相关查询,模型常通过捷径策略、短时线索、词汇先验或绕过音频本身来成功,缺乏对真实音频推理的评估,因此需要构建更严谨的基准。

Result: 人类在AUDITA上的平均准确率为32.13%,而最先进的音频问答模型平均准确率低于8.86%,表现显著落后;研究还应用项目反应理论分析了潜在熟练度、问题难度并揭示了模型的系统性缺陷。

Insight: 创新点在于构建了基于真实世界音频、包含人类撰写琐事问题的数据集,强调通过干扰项和长程时间依赖性来测试稳健的听觉推理,且问题无法仅通过孤立文本或声音线索回答,为评估音频理解提供了更严格的基准。

Abstract: Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state of-the-art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.


[20] Misinformation Span Detection in Videos via Audio Transcripts cs.CL | cs.SIPDF

Breno Matos, Rennan C. Lima, Savvas Zannettou, Fabricio Benevenuto, Rodrygo L. T. Santos

TL;DR: 该论文针对视频中的错误信息检测问题,提出了通过音频转录文本进行错误信息片段检测的新方法,并构建了两个包含500多个视频和2400多个标注片段的数据集,使用最先进的语言模型分类器实现了0.68的F1分数。

Details

Motivation: 现有视频错误信息检测研究仅关注视频整体是否包含错误信息,缺乏对错误信息具体出现位置和内容的细粒度分析,难以提供可解释的检测结果。

Result: 在构建的两个新数据集上,使用基于最先进语言模型的分类器进行错误信息片段检测,取得了0.68的F1分数。

Insight: 创新点在于将视频错误信息检测从视频级别细化为片段级别,通过音频转录文本实现细粒度定位;并公开了包含视频、音频、转录文本和标注的数据集,为后续研究提供了资源。

Abstract: Online misinformation is one of the most challenging issues lately, yielding severe consequences, including political polarization, attacks on democracy, and public health risks. Misinformation manifests in any platform with a large user base, including online social networks and messaging apps. It permeates all media and content forms, including images, text, audio, and video. Distinctly, video-based misinformation represents a multifaceted challenge for fact-checkers, given the ease with which individuals can record and upload videos on various video-sharing platforms. Previous research efforts investigated detecting video-based misinformation, focusing on whether a video shares misinformation or not on a video level. While this approach is useful, it only provides a limited and non-easily interpretable view of the problem given that it does not provide an additional context of when misinformation occurs within videos and what content (i.e., claims) are responsible for the video’s misinformation nature. In this work, we attempt to bridge this research gap by creating two novel datasets that allow us to explore misinformation detection on videos via audio transcripts, focusing on identifying the span of videos that are responsible for the video’s misinformation claim (misinformation span detection). We present two new datasets for this task. We transcribe each video’s audio to text, identifying the video segment in which the misinformation claims appears, resulting in two datasets of more than 500 videos with over 2,400 segments containing annotated fact-checked claims. Then, we employ classifiers built with state-of-the-art language models, and our results show that we can identify in which part of a video there is misinformation with an F1 score of 0.68. We make publicly available our annotated datasets. We also release all transcripts, audio and videos.


[21] Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions cs.CLPDF

Jiseon Kim, Jea Kwon, Luiz Felipe Vecchietti, Wenchao Dong, Jaehong Kim

TL;DR: 本文通过‘举报者困境’实验框架,研究大型语言模型在关系性道德困境中的行为模式,重点关注犯罪严重性和关系亲密度两个维度对模型判断的影响。

Details

Motivation: 人类道德判断具有情境依赖性并受人机关系调节,而大型语言模型日益成为决策支持系统,因此需要评估其是否编码了这些社会细微差别。

Result: 研究发现模型决策与道德正确性判断(规范性准则)一致,而非与其自身对人类行为的预测(描述性社会期望)一致,揭示了跨视角的明显分歧。

Insight: 论文创新点在于从道德正确性、预测人类行为和自主模型决策三个不同视角系统评估LLM行为,揭示了其决策优先遵循僵化的规范性规则而非内部世界建模中的社会敏感性,这可能导致现实部署中的错位风险。

Abstract: Human moral judgment is context-dependent and modulated by interpersonal relationships. As large language models (LLMs) increasingly function as decision-support systems, determining whether they encode these social nuances is critical. We characterize machine behavior using the Whistleblower’s Dilemma by varying two experimental dimensions: crime severity and relational closeness. Our study evaluates three distinct perspectives: (1) moral rightness (prescriptive norms), (2) predicted human behavior (descriptive social expectations), and (3) autonomous model decision-making. By analyzing the reasoning processes, we identify a clear cross-perspective divergence: while moral rightness remains consistently fairness-oriented, predicted human behavior shifts significantly toward loyalty as relational closeness increases. Crucially, model decisions align with moral rightness judgments rather than their own behavioral predictions. This inconsistency suggests that LLM decision-making prioritizes rigid, prescriptive rules over the social sensitivity present in their internal world-modeling, which poses a gap that may lead to significant misalignments in real-world deployments.


[22] A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents cs.CL | cs.AIPDF

Praval Sharma

TL;DR: 本文提出了一种多模态开放域事件抽取方法MODEE,该方法结合了基于图的学习和基于大型语言模型(LLM)的文本表示,以建模文档级的上下文、结构和语义推理,从而提升开放域事件抽取的性能。

Details

Motivation: 现有事件抽取方法存在局限:封闭域方法受限于预定义事件类型,泛化能力差;开放域方法虽能处理无约束事件类型,但大多忽视了LLMs的潜力,且未显式建模对事件抽取至关重要的文档级推理。

Result: 在大型数据集上的实证评估表明,MODEE在开放域事件抽取上优于当前最先进(SOTA)方法,并且能泛化到封闭域事件抽取任务,其性能也超越了现有算法。

Insight: 创新点在于将图学习与LLM的文本表示相结合,以多模态方式显式建模文档级推理,这有助于克服LLMs在处理长文档时可能出现的“中间迷失”和注意力稀释问题,从而提升事件抽取的准确性和泛化能力。

Abstract: Event extraction is essential for event understanding and analysis. It supports tasks such as document summarization and decision-making in emergency scenarios. However, existing event extraction approaches have limitations: (1) closed-domain algorithms are restricted to predefined event types and thus rarely generalize to unseen types and (2) open-domain event extraction algorithms, capable of handling unconstrained event types, have largely overlooked the potential of large language models (LLMs) despite their advanced abilities. Additionally, they do not explicitly model document-level contextual, structural, and semantic reasoning, which are crucial for effective event extraction but remain challenging for LLMs due to lost-in-the-middle phenomenon and attention dilution. To address these limitations, we propose multimodal open-domain event extraction, MODEE , a novel approach for open-domain event extraction that combines graph-based learning with text-based representation from LLMs to model document-level reasoning. Empirical evaluations on large datasets demonstrate that MODEE outperforms state-of-the-art open-domain event extraction approaches and can be generalized to closed-domain event extraction, where it outperforms existing algorithms.


cs.CV [Back]

[23] Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry cs.CV | cs.AI | cs.CLPDF

Syed Nazmus Sakib, Nafiul Haque, Shahrear Bin Amin, Hasan Muhammad Abdullah, Md. Mehedi Hasan

TL;DR: 本文提出了PlantInquiryVQA基准,用于评估多模态大语言模型在植物病理诊断中的多步、意图驱动的视觉推理能力。该基准包含一个链式询问框架,模拟植物学家基于视觉线索和明确认知意图进行自适应提问的诊断过程,并发布了包含大量专家标注图像和问答对的数据集。

Details

Motivation: 当前视觉语言模型通常在单轮问答上进行评估,而现实世界(如植物病理学)的视觉评估是一个多步骤、基于证据的自适应提问过程。为了弥补这一差距,需要一个新的基准来研究模型在复杂诊断场景中的推理能力。

Result: 在顶级多模态大语言模型上的评估表明,模型能充分描述视觉症状,但在安全的临床推理和准确诊断方面存在困难。结构化的问题引导式询问显著提高了诊断正确性、减少了幻觉并提高了推理效率。

Insight: 创新点在于将诊断过程形式化为一个基于视觉线索和明确认知意图的链式询问框架,并构建了首个专注于植物病理学多步推理的大规模基准。其核心洞察是,模拟专家(如植物学家)的意图驱动、自适应提问过程,而非静态分类,是提升模型诊断推理能力的关键。

Abstract: Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.


[24] Micro-DualNet: Dual-Path Spatio-Temporal Network for Micro-Action Recognition cs.CV | q-bio.NCPDF

Naga VS Raviteja Chappa, Evangelos Sariyanidi, Lisa Yankowitz, Gokul Nair, Casey J. Zampella

TL;DR: 本文提出Micro-DualNet,一种用于微动作识别的双路径时空网络。微动作是持续1-3秒的细微局部动作(如挠头、敲手指),对细粒度视频理解至关重要。针对微动作在时空特征上的多样性(有些由空间构型定义,有些则通过时间动态体现),现有单一时空分解方法无法适应。该网络通过并行的时空(ST)和时-空(TS)路径处理基于解剖学的空间实体,并引入实体级自适应路由让每个身体部位学习其最优处理偏好,辅以互动作一致性(MAC)损失来增强跨路径一致性。

Details

Motivation: 解决当前计算机视觉系统对微动作理解不足的问题。微动作具有多样的时空特性,而现有方法采用单一的时空分解顺序,无法适应这种多样性。

Result: 在MA-52数据集上取得了有竞争力的性能,并在iMiGUE数据集上达到了最先进(SOTA)的水平。

Insight: 核心创新在于提出了一种双路径网络架构(ST和TS路径)来分别捕捉空间主导和时间主导的微动作特征,并引入了实体级自适应路由机制(而非固定融合)和互动作一致性损失,使模型能根据每个身体部位的特征自适应选择处理路径,从而更好地适应微动作内在的时空复杂性。

Abstract: Micro-actions are subtle, localized movements lasting 1-3 seconds such as scratching one’s head or tapping fingers. Such subtle actions are essential for social communication, ubiquitously used in natural interactions, and thus critical for fine-grained video understanding, yet remain poorly understood by current computer vision systems. We identify a fundamental challenge: micro-actions exhibit diverse spatio-temporal characteristics where some are defined by spatial configurations while others manifest through temporal dynamics. Existing methods that commit to a single spatio-temporal decomposition cannot accommodate this diversity. We propose a dual-path network that processes anatomically-grounded spatial entities through parallel Spatial-Temporal (ST) and Temporal-Spatial (TS) pathways. The ST path captures spatial configurations before modeling temporal dynamics, while the TS path inverts this order to prioritize temporal dynamics. Rather than fixed fusion, we introduce entity-level adaptive routing where each body part learns its optimal processing preference, complemented by Mutual Action Consistency (MAC) loss that enforces cross-path coherence. Extensive experiments demonstrate competitive performance on MA-52 dataset and state-of-the-art results on iMiGUE dataset. Our work reveals that architectural adaptation to the inherent complexity of micro-actions is essential for advancing fine-grained video understanding.


[25] Unlocking Multi-Spectral Data for Multi-Modal Models with Guided Inputs and Chain-of-Thought Reasoning cs.CVPDF

Dahun Kim, Ganesh Satish Mallya, Anelia Angelova

TL;DR: 本文提出了一种无需训练的方法,将多光谱数据引入仅支持RGB的通用大视觉语言模型(LMMs)的推理流程中,通过引导输入和思维链推理来提升模型在遥感任务(如土地利用分类和环境监测)上的性能。该方法利用Gemini 2.5模型在流行遥感基准上实现了显著的零样本性能提升。

Details

Motivation: 通用大视觉语言模型通常仅在RGB图像上训练,限制了其在多光谱遥感领域的应用;而专门训练多光谱多模态模型成本高昂且模型专用性强。本文旨在解决这一限制,使通用模型能够处理多光谱数据。

Result: 在流行的遥感基准测试中,该方法使用Gemini 2.5模型实现了强大的零样本性能提升,展示了其有效性。

Insight: 创新点在于提出了一种无需训练、推理时适配的方法:通过将非RGB输入适配到模型的视觉空间,并结合领域特定信息和思维链推理作为指令,从而解锁通用模型对多光谱数据的处理能力。这为地理空间专业人员利用通用模型处理专业传感器数据提供了新途径。

Abstract: Multi-spectral imagery is a valuable input signal for Remote Sensing applications, such as land-use and land-cover classification and environmental monitoring. However, generalist Large Multi-modal Models (LMMs) are typically trained on RGB images, limiting their applicability to the RGB domain. At the same time, training multi-spectral multi-modal models is expensive and produces uniquely specialized models. To address this, we propose a novel training-free approach that introduces multi-spectral data within the inference pipeline of standard RGB-only LMMs, allowing large gains in performance. Our approach leverages the LMMs’ understanding of the visual space by adapting non-RGB inputs to that space and injecting domain-specific information and Chain-of-Thought reasoning as instructions. We demonstrate this with the Gemini 2.5 model and observe strong Zero-Shot performance gains on popular Remote Sensing benchmarks. These results highlight the potential for geospatial professionals to leverage powerful generalist models for specialized sensor inputs, benefiting from rich reasoning capabilities grounded in specialized data.


[26] StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling cs.CV | cs.AIPDF

Liqi Jing, Dingming Zhang, Peinian Li, Lichen Zhu

TL;DR: StyleVAR是一种基于视觉自回归建模(VAR)的可控图像风格迁移方法,它将风格迁移问题转化为在学习的潜在空间中进行条件离散序列建模。该方法使用VQ-VAE将图像分解为多尺度表示并离散化为编码,然后通过Transformer自回归地建模目标编码的分布,条件于风格和内容编码。通过引入混合交叉注意力机制和尺度相关的混合系数,控制风格和内容在每个阶段的影响,以保持内容结构和风格纹理的平衡。训练分为两个阶段:在大规模三元组数据集上进行监督微调,然后使用基于DreamSim感知奖励的GRPO进行强化微调。

Details

Motivation: 动机在于将风格迁移问题形式化为条件离散序列建模,以利用自回归模型的生成能力,实现更可控和高质量的风格迁移,同时解决传统方法在保持内容结构和风格纹理平衡方面的挑战。

Result: 在三个涵盖分布内、近分布和分布外场景的基准测试中,StyleVAR在风格损失、内容损失、LPIPS、SSIM、DreamSim和CLIP相似度等指标上均优于AdaIN基线,GRPO阶段进一步提升了性能,特别是在与奖励对齐的感知指标上。定性评估显示,该方法在风景和建筑场景中能有效迁移纹理并保持语义结构,但在互联网图像和人脸方面存在泛化差距。

Insight: 创新点包括将风格迁移建模为条件离散序列生成、引入混合交叉注意力机制以动态融合风格和内容信息、以及使用两阶段训练策略(监督微调加GRPO强化微调)来优化感知质量。从客观角度看,该方法结合了自回归建模的连续性和多尺度控制,为可控图像生成提供了新思路。

Abstract: We build on the Visual Autoregressive Modeling (VAR) framework and formulate style transfer as conditional discrete sequence modeling in a learned latent space. Images are decomposed into multi-scale representations and tokenized into discrete codes by a VQ-VAE; a transformer then autoregressively models the distribution of target tokens conditioned on style and content tokens. To inject style and content information, we introduce a blended cross-attention mechanism in which the evolving target representation attends to its own history, while style and content features act as queries that decide which aspects of this history to emphasize. A scale-dependent blending coefficient controls the relative influence of style and content at each stage, encouraging the synthesized representation to align with both the content structure and the style texture without breaking the autoregressive continuity of VAR. We train StyleVAR in two stages from a pretrained VAR checkpoint: supervised fine-tuning on a large triplet dataset of content–style–target images, followed by reinforcement fine-tuning with Group Relative Policy Optimization (GRPO) against a DreamSim-based perceptual reward, with per-action normalization weighting to rebalance credit across VAR’s multi-scale hierarchy. Across three benchmarks spanning in-, near-, and out-of-distribution regimes, StyleVAR consistently outperforms an AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP similarity, and the GRPO stage yields further gains over the SFT checkpoint, most notably on the reward-aligned perceptual metrics. Qualitatively, the method transfers texture while maintaining semantic structure, especially for landscapes and architectural scenes, while a generalization gap on internet images and difficulty with human faces highlight the need for better content diversity and stronger structural priors.


[27] Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models cs.CVPDF

Juhong Min, Lazar Valkov, Vitali Petsiuk, Hossein Souri, Deen Dayal Mohan

TL;DR: 本文提出Foveated Reasoner,一种自回归视觉语言框架,通过模拟人类视觉的中央凹机制,在低分辨率全局视图引导下动态触发高分辨率区域聚焦,以在有限视觉token预算下提升模型性能。

Details

Motivation: 解决高分辨率图像导致视觉token数量激增、计算开销过大的问题,借鉴人类视觉的中央凹机制实现高效视觉推理。

Result: 在多个视觉语言基准测试中,该方法在严格视觉token预算下实现了更强的准确性,并学习了有效的聚焦策略。

Insight: 将聚焦与推理统一在单一解码轨迹中,通过两阶段训练(监督冷启动与强化学习)联合优化证据获取与任务精度,避免‘全看’的平凡解。

Abstract: Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides “where to look”, while selectively acquired high-acuity evidence refines “what to think”. We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial “see-everything” solutions. Experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks.


[28] Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery cs.CV | cs.AIPDF

Siyuan Yao, Siavash Ghorbany, Kuangshi Ai, Arnav Cherukuthota, Meghan Forstchen

TL;DR: 本文提出了一种利用多模态大语言模型(LLMs)和谷歌街景(GSV)图像,对美国全国建筑状况进行自动评估的新框架。通过在一个较小的人工标注数据集上微调Gemma 3 27B模型,该方法在SRCC和PLCC指标上相对于人类平均意见分数(MOS)基准表现出色,甚至优于单个评分者。为了提高效率,作者应用知识蒸馏技术,将Gemma 3 27B的能力迁移到更小的Gemma 3 4B模型,实现了相当的性能和3倍加速;并进一步蒸馏到基于CNN的EfficientNetV2-M和Transformer模型SwinV2-B,在获得接近性能的同时实现了30倍的加速。此外,研究还通过人机对齐研究探索了LLMs评估广泛建成环境和住房属性的能力,并开发了一个可视化仪表板,集成LLM评估结果供房主进行下游分析。

Details

Motivation: 解决大规模、全国范围内建筑状况自动评估的难题,旨在减少人工标注工作量,提供一个灵活高效的解决方案。

Result: 在人类平均意见分数(MOS)基准上,微调后的Gemma 3 27B模型在SRCC和PLCC指标上表现强劲,甚至优于个体评分者。通过知识蒸馏得到的轻量级模型(Gemma 3 4B, EfficientNetV2-M, SwinV2-B)在保持可比性能的同时,分别实现了3倍和30倍的推理速度提升。

Insight: 创新点在于将多模态LLMs(特别是Gemma 3)首次应用于街景图像的建筑状况评估任务,并通过知识蒸馏实现了从大型LLM到更小、更快模型的性能迁移,在精度和效率之间取得了良好平衡。同时,框架的可扩展性(评估多种属性)和实用性(集成可视化仪表板)也是其亮点。

Abstract: We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs’ capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.


[29] HyperFM: An Efficient Hyperspectral Foundation Model with Spectral Grouping cs.CVPDF

Zahid Hassan Tushar, Sanjay Purushotham

TL;DR: 本文提出了HyperFM,一种参数高效的高光谱基础模型,通过谱分组注意力与混合参数分解技术,在降低计算成本的同时更好地捕捉光谱空间关系。该模型在NASA PACE任务生成的大规模高光谱数据上进行了训练,并在四个下游大气云属性检索基准任务上超越了现有高光谱基础模型和任务专用SOTA方法。

Details

Motivation: 针对NASA PACE任务产生的大规模、复杂、标注困难的高光谱数据,现有基础模型(通常基于RGB图像训练)难以解释其连续光谱特征,且现有高光谱基础模型通常仅针对无云观测、单传感器数据训练,参数量大、计算成本高,限制了其在业务环境中的可扩展性和应用。

Result: 在四个下游大气云属性检索基准任务上,HyperFM相比现有高光谱基础模型和任务专用SOTA方法取得了持续的性能提升。

Insight: 创新点在于引入了谱内与谱间分组注意力机制以及混合参数分解,以参数高效的方式建模光谱空间关系;同时发布了包含晴空与多云场景的大规模高光谱数据集HyperFM250K,促进了该领域的研究。

Abstract: The NASA PACE mission provides unprecedented hyperspectral observations of ocean color, aerosols, and clouds, offering new insights into how these components interact and influence Earth’s climate and air quality. Its Ocean Color Instrument measures light across hundreds of finely spaced wavelength bands, enabling detailed characterization of features such as phytoplankton composition, aerosol properties, and cloud microphysics. However, hyperspectral data of this scale is large, complex, and difficult to label, requiring specialized processing and analysis techniques. Existing foundation models, which have transformed computer vision and natural language processing, are generally trained on standard RGB imagery and therefore struggle to interpret the continuous spectral signatures captured by PACE. While recent advances have introduced hyperspectral foundation models, they are typically trained on cloud-free observations and often remain limited to single-sensor datasets due to spectral inconsistencies across instruments. Moreover, existing models tend to be parameter-heavy and computationally expensive, limiting scalability and adoption in operational settings. To address these challenges, we introduce HyperFM, a parameter-efficient hyperspectral foundation model that leverages intra-group and inter-group spectral attention along with hybrid parameter decomposition to better capture spectral spatial relationships while reducing computational cost. HyperFM demonstrates consistent performance improvements over existing hyperspectral foundation models and task-specific state-of-the-art methods across four benchmark downstream atmospheric cloud property retrieval tasks. To support further research, we additionally release HyperFM250K, a large-scale hyperspectral dataset from the PACE mission that includes both clear and cloudy scenes.


[30] WFM: 3D Wavelet Flow Matching for Ultrafast Multi-Modal MRI Synthesis cs.CVPDF

Yalcin Tur, Mihajlo Stojkovic, Ulas Bagci

TL;DR: 本文提出了一种名为WFM(3D小波流匹配)的新方法,用于快速多模态MRI合成。该方法通过在小波空间中学习从条件模态均值到目标模态的直接流,避免了传统扩散模型从纯噪声开始、计算成本高的问题,实现了仅需1-2个积分步骤即可完成准确合成,速度比扩散基线快250-1000倍。

Details

Motivation: 现有扩散模型在多模态MRI合成中质量优异,但计算成本高昂(需要数百个采样步骤且每个模态需单独模型),限制了临床部署。其低效性源于从纯噪声开始的采样过程,丢弃了已有MRI序列中的结构信息。

Result: 在BraTS 2024数据集上,单个8200万参数的WFM模型(通过类别条件控制)可合成所有四种模态(T1、T1c、T2、FLAIR),PSNR达到26.8 dB,SSIM达到0.94,与扩散基线相比差距在1-2 dB内,但每体积合成时间仅需0.16-0.64秒,比基线快250-1000倍。

Insight: 创新点在于利用小波空间中的条件模态均值作为信息先验,学习从该先验到目标分布的直接流,从而大幅减少采样步骤。从客观角度看,该方法通过共享解剖结构先验和类别条件设计,用一个轻量模型替代多个重型扩散模型,在速度与质量间取得了突破性平衡。

Abstract: Diffusion models have achieved remarkable quality in multi-modal MRI synthesis, but their computational cost (hundreds of sampling steps and separate models per modality) limits clinical deployment. We observe that this inefficiency stems from an unnecessary starting point: diffusion begins from pure noise, discarding the structural information already present in available MRI sequences. We propose WFM (Wavelet Flow Matching), which instead learns a direct flow from an informed prior, the mean of conditioning modalities in wavelet space, to the target distribution. Because the source and target share underlying anatomy and differ primarily in contrast, this formulation enables accurate synthesis in just 1-2 integration steps. A single 82M-parameter model with class conditioning synthesizes all four BraTS modalities (T1, T1c, T2, FLAIR), replacing four separate diffusion models totaling 326M parameters. On BraTS 2024, WFM achieves 26.8 dB PSNR and 0.94 SSIM, within 1-2 dB of diffusion baselines, while running 250-1000x faster (0.16-0.64s vs. 160s per volume). This speed-quality trade-off makes real-time MRI synthesis practical for clinical workflows. Code is available at https://github.com/yalcintur/WFM.


[31] Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment cs.CVPDF

Jingkun Chen, Ruoshi Xu, Mingqi Gao, Shengda Luo, Jungong Han

TL;DR: 本文提出了一种名为几何奖励信用分配(Geometric Reward Credit Assignment)的框架,旨在解决点视觉语言模型(Point-VLMs)中常见的几何幻觉问题,即预测的3D结构与观察到的2D现实相矛盾。该方法通过将整体监督分解为特定领域的信号,并将其精确路由到负责的token跨度,从而将模糊的反馈转化为精确的梯度更新。此外,通过引入重投影一致性项作为跨模态验证器来惩罚物理上不可能的几何形状,从而内化物理约束。

Details

Motivation: 点视觉语言模型(Point-VLMs)旨在赋予具身智能体可执行的空间推理能力,但它们经常陷入几何幻觉,即预测的3D结构与观察到的2D现实相矛盾。作者认为失败的关键原因不是表征瓶颈,而是强化学习中存在的结构错位,即稀疏的几何token被嘈杂且广播的序列级奖励所淹没。

Result: 在基于ShapeNetCore校准的基准测试中,该方法将3D关键点准确率(KPA)从0.64提升到0.93,将3D边界框交并比(IoU)提高到0.686,并将重投影一致性分数提高到0.852。这些提升是在保持稳健的2D定位性能的同时实现的。

Insight: 论文的核心创新点在于提出了几何奖励信用分配框架,通过解耦整体监督并精确分配奖励信号来解决强化学习中的结构错位问题,从而将通用策略优化转化为有针对性的结构对齐。此外,重投影一致性项作为跨模态验证器,将物理约束内化到模型中,有助于生成物理上可验证的空间预测,而不仅仅是看似合理的文本输出。

Abstract: Point-Vision-Language Models promise to empower embodied agents with executable spatial reasoning, yet they frequently succumb to geometric hallucination where predicted 3D structures contradict the observed 2D reality. We identify a key cause of this failure not as a representation bottleneck but as a structural misalignment in reinforcement learning, where sparse geometric tokens are drowned out by noisy and broadcasted sequence-level rewards. To resolve this causal dilution, we propose Geometric Reward Credit Assignment, a framework that disentangles holistic supervision into field-specific signals and routes them exclusively to their responsible token spans. This mechanism transforms vague feedback into precise gradient updates and effectively turns generic policy optimization into targeted structural alignment. Furthermore, we internalize physical constraints via a Reprojection-Consistency term which serves as a cross-modal verifier to penalize physically impossible geometries. Validated on a calibrated benchmark derived from ShapeNetCore, our approach bridges the reliability gap by boosting 3D KPA from 0.64 to 0.93, increasing 3D bounding box intersection over union to 0.686, and raising reprojection consistency scores to 0.852. Crucially, these gains are achieved while maintaining robust 2D localization performance, marking a meaningful step from plausible textual outputs toward physically verifiable spatial predictions.


[32] SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning cs.CVPDF

Chan Yeong Hwang, Miso Choi, Sunghyun On, Jinkyu Kim, Jungbeom Lee

TL;DR: 本文提出SpatiO,一个用于空间推理的异构多智能体框架,通过协调具有互补归纳偏见的视觉语言专家,并引入无需修改模型参数的测试时编排优化机制,在多个空间推理基准测试上显著提升了性能。

Details

Motivation: 现有空间推理方法通常依赖单一推理流程,学习固定的空间先验,难以适应分布变化;而现有的多智能体系统主要采用同质智能体,无法有效利用多样化的归纳偏见。

Result: 在3DSRBench、STVQA-7k、CV-Bench和Omni3D-Bench等多个空间推理基准测试上,SpatiO相比闭源和开源基线模型均取得了持续的性能提升。

Insight: 核心创新在于提出了一个异构多智能体框架,并设计了测试时编排机制,该机制能在推理过程中动态评估并重新加权各智能体,实现根据输入灵活协调不同推理策略的空间适应性。

Abstract: Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such as 2D appearance cues, depth signals, and geometric constraints, whose reliability varies across contexts. This suggests that effective spatial reasoning requires \emph{spatial adaptability}: the ability to flexibly coordinate different reasoning strategies depending on the input. However, most existing approaches rely on a single reasoning pipeline that implicitly learns a fixed spatial prior, limiting their ability to adapt under distribution changes. Multi-agent systems offer a promising alternative by aggregating diverse reasoning trajectories, but prior attempts in spatial reasoning primarily employ homogeneous agents, restricting the diversity of inductive biases they can leverage. In this work, we introduce \textbf{\textsc{SpatiO}}, a heterogeneous multi-agent framework for spatial reasoning that coordinates multiple vision-language specialists with complementary inductive biases. To enable effective collaboration, we propose \textbf{Test-Time Orchestration (TTO)}, an optimization mechanism that dynamically evaluates and reweights agents based on their observed reliability during inference, without modifying model parameters. Extensive experiments on diverse spatial reasoning benchmarks, including 3DSRBench, STVQA-7k, CV-Bench, and Omni3D-Bench, demonstrate that \textsc{SpatiO} consistently improves spatial reasoning performance over both closed-source and open-source baselines.


[33] Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation cs.CV | cs.LGPDF

Boxun Xu, Yuming Du, Zichang Liu, Siyu Yang, Ziyang Jiang

TL;DR: 本文提出了Sparse Forcing,一种用于自回归视频扩散模型的训练与推理范式,旨在提升长序列生成质量并降低解码延迟。该方法基于对注意力机制的经验观察,提出了可训练的原生稀疏机制和高效的GPU内核(PBSA),实现了在保持视觉质量的同时,显著加速推理并降低内存占用。

Details

Motivation: 动机源于对自回归扩散模型展开过程的经验观察:注意力集中在持续的显著视觉块子集上,并在滑动窗口内呈现局部结构化的块稀疏模式。为了利用这一模式提升长序列生成效率和质量,需要一种可训练且高效的稀疏注意力机制。

Result: 在文本到视频生成任务上,相比Self-Forcing方法,Sparse Forcing在5秒生成上将VBench分数提升了0.26,解码速度提升了1.11-1.17倍,峰值KV缓存占用降低了42%。在更长的20秒和1分钟生成上,VBench分数分别提升了0.68和2.74,速度分别提升了1.22倍和1.27倍,显示出更强的长序列生成优势。

Insight: 创新点在于基于经验观察设计了一种可训练的原生稀疏注意力机制,能够学习压缩、保留和更新持续的注意力块,并动态限制局部窗口内的计算。同时,提出的Persistent Block-Sparse Attention (PBSA) GPU内核,为大规模训练和低延迟、内存高效的推理提供了实用的高效实现方案。

Abstract: We introduce Sparse Forcing, a training-and-inference paradigm for autoregressive video diffusion models that improves long-horizon generation quality while reducing decoding latency. Sparse Forcing is motivated by an empirical observation in autoregressive diffusion rollouts: attention concentrates on a persistent subset of salient visual blocks, forming an implicit spatiotemporal memory in the KV cache, and exhibits a locally structured block-sparse pattern within sliding windows. Building on this observation, we propose a trainable native sparsity mechanism that learns to compress, preserve, and update these persistent blocks while restricting computation within each local window to a dynamically selected local neighborhood. To make the approach practical at scale for both training and inference, we further propose Persistent Block-Sparse Attention (PBSA), an efficient GPU kernel that accelerates sparse attention and memory updates for low-latency, memory-efficient decoding. Experiments show that Sparse Forcing improves the VBench score by +0.26 over Self-Forcing on 5-second text-to-video generation while delivering a 1.11-1.17x decoding speedup and 42% lower peak KV-cache footprint. The gains are more pronounced on longer-horizon rollouts, delivering improved visual quality with +0.68 and +2.74 VBench improvements, and 1.22x and 1.27x speedups on 20-second and 1-minute generations, respectively.


[34] GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA cs.CV | cs.DCPDF

Anvitha Ramachandran, Dhruv Parikh, Viktor Prasanna

TL;DR: GraphLeap是一种用于加速视觉图神经网络(ViG)的方法,通过将图构建与特征更新解耦,允许两者并发执行,从而克服了传统ViG中顺序图构建带来的计算瓶颈。基于此,论文还提出了首个面向ViG的端到端FPGA加速器,实现了显著的推理速度提升。

Details

Motivation: 视觉图神经网络(ViGs)在每层都需要基于当前特征动态构建k近邻图,这个过程计算复杂度高(O(N^2)),且与特征更新顺序依赖,成为CPU/GPU上ViG推理的主要瓶颈(占时50-95%)。

Result: 在Alveo U280 FPGA上评估各向同性和金字塔ViG模型,GraphLeap相比CPU基线实现了最高95.7倍加速,相比GPU基线实现了最高8.5倍加速,证明了实时ViG推理的可行性。经过少量epoch的轻量微调即可恢复原始模型精度。

Insight: 核心创新在于提出“一层前瞻”的图构建解耦策略(GraphLeap),将图构建与特征更新的顺序依赖打破,实现并发执行。在此基础上设计的FPGA加速器采用了流式层流水线设计,将kNN图构建引擎与特征更新引擎重叠,并利用节点级和通道级并行性,避免了显式的边特征物化,实现了高效的数据流。

Abstract: Vision Graph Neural Networks (ViGs) represent an image as a graph of patch tokens, enabling adaptive, feature-driven neighborhoods. Unlike CNNs with fixed grid biases or Vision Transformers with global token interactions, ViGs rely on dynamic graph convolution: at each layer, a feature-dependent graph is built via k-nearest-neighbor (kNN) search on current patch features, followed by message passing. This per-layer graph construction is the main bottleneck, consuming 50–95% of graph convolution time on CPUs and GPUs, scaling as $O(N^2)$ with the number of patches $N$, and creating a sequential dependency between graph construction and feature updates. We introduce GraphLeap, a simple reformulation that removes this dependency by decoupling graph construction from feature update across layers. GraphLeap performs the feature update at layer $\ell$ using a graph built from the previous layer’s features, while simultaneously using the current layer’s features to construct the graph for layer $\ell+1$. This one-layer-lookahead graph construction enables concurrent graph construction and message passing. Although using prior-layer features can introduce minor accuracy degradation, lightweight fine-tuning for a few epochs is sufficient to recover the original accuracy. Building on GraphLeap, we present the first end-to-end FPGA accelerator for Vision GNNs. Our streaming, layer-pipelined design overlaps a kNN graph construction engine with a feature update engine, exploits node- and channel-level parallelism, and enables efficient on-chip dataflow without explicit edge-feature materialization. Evaluated on isotropic and pyramidal ViG models on an Alveo U280 FPGA, GraphLeap achieves up to $95.7\times$ speedup over CPU and $8.5\times$ speedup over GPU baselines, demonstrating the feasibility of real-time Vision GNN inference.


[35] Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation cs.CV | cs.AIPDF

Yuanchen Fei, Yude Zou, Zejian Kang, Ming Li, Jiaying Zhou

TL;DR: 本文系统研究了合成数据在可控人体视频生成中的作用,提出了一个基于扩散模型的框架,能够对人物外观和运动进行细粒度控制,并构建了统一测试平台来分析合成数据与真实数据在训练中的交互。研究发现合成数据与真实数据具有互补作用,并提出了高效选择合成样本以增强运动真实性、时序一致性和身份保持的方法。

Details

Motivation: 解决可控人体视频生成中大规模、多样化且隐私安全的数据集稀缺问题,特别是针对罕见身份和复杂动作,以及探索合成数据在生成建模中的实际贡献以弥补Sim2Real差距。

Result: 通过大量实验,揭示了合成数据与真实数据的互补作用,并展示了高效选择合成样本的方法,从而在运动真实性、时序一致性和身份保持方面取得提升,为构建数据高效且可泛化的生成模型提供了实用见解。

Insight: 创新点在于首次全面探索了合成数据在以人为中心的视频合成中的角色,提出了一个统一的扩散框架来分析和利用合成数据,并提供了数据选择和增强策略以优化生成质量,这为数据稀缺场景下的生成模型提供了新的解决方案。

Abstract: Controllable human video generation aims to produce realistic videos of humans with explicitly guided motions and appearances,serving as a foundation for digital humans, animation, and embodied AI.However, the scarcity of largescale, diverse, and privacy safe human video datasets poses a major bottleneck, especially for rare identities and complex actions.Synthetic data provides a scalable and controllable alternative,yet its actual contribution to generative modeling remains underexplored due to the persistent Sim2Real gap.In this work,we systematically investigate the impact of synthetic data on controllable human video generation. We propose a diffusion-based framework that enables fine-grained control over appearance and motion while providing a unfied testbed to analyze how synthetic data interacts with real world data during training. Through extensive experiments, we reveal the complementary roles of synthetic and real data and demonstrate possible methods for efficiently selecting synthetic samples to enhance motion realism,temporal consistency,and identity preservation.Our study offers the first comprehensive exploration of synthetic data’s role in human-centric video synthesis and provides practical insights for building data-efficient and generalizable generative models.


[36] an interpretable vision transformer framework for automated brain tumor classification cs.CVPDF

Chinedu Emmanuel Mbonu, Tochukwu Sunday Belonwu, Okwuchukwu Ejike Chukwuogo, Kenechukwu Sylvanus Anigbogu

TL;DR: 本文提出了一种基于Vision Transformer的可解释深度学习框架,用于从7023张MRI扫描中自动进行四类脑肿瘤分类。该框架采用ViT-B/16作为主干网络,并结合了临床驱动的预处理、两阶段微调、数据增强以及测试时增强等技术,最终在测试集上取得了极高的准确率,并提供了可解释的注意力热力图。

Details

Motivation: 解决脑肿瘤诊断中手动解读MRI扫描耗时、存在观察者间差异且依赖专家经验的问题,旨在开发一个高精度、自动化的分类系统以辅助早期和准确诊断。

Result: 在包含7023张MRI扫描的数据集上,所提模型在测试集上取得了99.29%的准确率和99.25%的宏F1分数,在健康和脑膜瘤类别上实现了完美的召回率,性能超越了所有基于CNN的基线模型。

Insight: 创新点在于将Vision Transformer架构与一套为医学图像量身定制的训练流程(如CLAHE增强、两阶段微调、MixUp/CutMix、EMA和TTA)相结合,并利用注意力机制提供可解释的预测依据,为医学影像分析提供了高精度且可解释的端到端解决方案。

Abstract: Brain tumors represent one of the most critical neurological conditions, where early and accurate diagnosis is directly correlated with patient survival rates. Manual interpretation of Magnetic Resonance Imaging (MRI) scans is time-intensive, subject to inter-observer variability, and demands significant specialist expertise. This paper proposes a deep learning framework for automated four-class brain tumor classification distinguishing glioma, meningioma, pituitary tumor, and healthy brain tissue from a dataset of 7,023 MRI scans. The proposed system employs a Vision Transformer (ViT-B/16) pretrained on ImageNet-21k as the backbone, augmented with a clinically motivated preprocessing and training pipeline. Contrast Limited Adaptive Histogram Equalization (CLAHE) is applied to enhance local contrast and accentuate tumor boundaries invisible to standard normalization. A two-stage fine-tuning strategy is adopted: the classification head is warmed up with the backbone frozen, followed by full fine-tuning with discriminative learning rates. MixUp and CutMix augmentation is applied per batch to improve generalization. Exponential Moving Average (EMA) of weights and Test-Time Augmentation (TTA) further stabilize and boost performance. Attention Rollout visualization provides clinically interpretable heatmaps of the brain regions driving each prediction. The proposed model achieves a test accuracy of 99.29%, macro F1-score of 99.25%, and perfect recall on both healthy and meningioma classes, outperforming all CNN-based baselines


[37] FryNet: Dual-Stream Adversarial Fusion for Non-Destructive Frying Oil Oxidation Assessment cs.CVPDF

Khaled R Ahmed, Toqi Tahamid Sarker, Taminul Islam, Tamany M Alanezi, Amer AbuGhazaleh

TL;DR: 本文提出FryNet,一种用于非破坏性评估煎炸油氧化的双流对抗融合框架。该模型通过RGB和热成像双流输入,联合执行油区域分割、可用性分类以及四种化学氧化指标(过氧化值、对茴香胺值、总氧化值、温度)的回归。采用注意力机制和掩码自编码器提取特征,并通过对抗正则化消除视频身份偏差,实现单次前向传播的全面评估。

Details

Motivation: 当前煎炸油降解监测依赖破坏性湿化学方法,缺乏空间信息且无法实时使用;基于热成像的检测存在相机指纹捷径问题,模型易记忆传感器特定噪声而非学习氧化化学,导致视频间评估失效。

Result: 在28个煎炸视频的7,226对帧上,FryNet实现了98.97%的平均交并比、100%的分类准确率以及2.32的平均回归平均绝对误差,超越了所有七个基线模型,达到了最先进水平。

Insight: 创新点包括双流对抗融合架构、利用掩码自编码器进行化学对齐的RGB表示学习、以及通过梯度反转层对抗正则化以消除视频身份偏差,从而提升模型泛化能力和化学指标预测的准确性。

Abstract: Monitoring frying oil degradation is critical for food safety, yet current practice relies on destructive wet-chemistry assays that provide no spatial information and are unsuitable for real-time use. We identify a fundamental obstacle in thermal-image-based inspection, the camera-fingerprint shortcut, whereby models memorize sensor-specific noise and thermal bias instead of learning oxidation chemistry, collapsing under video-disjoint evaluation. We propose FryNet, a dual-stream RGB-thermal framework that jointly performs oil-region segmentation, serviceability classification, and regression of four chemical oxidation indices (PV, p-AV, Totox, temperature) in a single forward pass. A ThermalMiT-B2 backbone with channel and spatial attention extracts thermal features, while an RGB-MAE Encoder learns chemically grounded representations via masked autoencoding and chemical alignment. Dual-Encoder DANN adversarially regularizes both streams against video identity via Gradient Reversal Layers, and FiLM fusion bridges thermal structure with RGB chemical context. On 7,226 paired frames across 28 frying videos, FryNet achieves 98.97% mIoU, 100% classification accuracy, and 2.32 mean regression MAE, outperforming all seven baselines.


[38] Temporal Prototyping and Hierarchical Alignment for Unsupervised Video-based Visible-Infrared Person Re-Identification cs.CVPDF

Zhiyong Li, Wei Jiang, Haojie Liu, Mingyu Wang, Wanchong Xu

TL;DR: 本文提出了一种名为HiTPro的无监督视频跨模态行人重识别框架,旨在解决无需身份标注的可见光-红外视频行人匹配问题。该方法通过时序感知特征编码器提取帧级特征并聚合为轨迹级表示,然后构建相机内原型,并通过分层跨原型对齐和对比学习,逐步优化特征与原型在相机内、跨相机同模态和跨模态三个层次的对齐。

Details

Motivation: 现有可见光-红外行人重识别方法多集中于图像层面或依赖有监督标注,而无监督视频跨模态行人重识别问题在实际部署中至关重要但尚未被充分探索。本文旨在填补这一空白,提出一个无需显式硬伪标签分配的原型驱动框架。

Result: 在HITSZ-VCM和BUPTCampus数据集上的大量实验表明,HiTPro在完全无监督设置下取得了最先进的性能,显著优于适应的基线方法,并为未来研究建立了强基线。

Insight: 创新点包括:1)通过时序分区构建可靠的相机内原型;2)采用分层跨原型对齐,结合动态阈值策略和软权重分配,实现从模态内关联到跨模态匹配的两阶段正样本挖掘;3)通过分层对比学习在三个层次上渐进优化特征-原型对齐,提升模型的无监督学习能力。

Abstract: Visible-infrared person re-identification (VI-ReID) enables cross-modality identity matching for all-day surveillance, yet existing methods predominantly focus on the image level or rely heavily on costly identity annotations. While video-based VI-ReID has recently emerged to exploit temporal dynamics for improved robustness, existing studies remain limited to supervised settings. Crucially, the unsupervised video VI-ReID problem, where models must learn from RGB and infrared tracklets without identity labels, remains largely unexplored despite its practical importance in real-world deployment. To bridge this gap, we propose HiTPro (Hierarchical Temporal Prototyping), a prototype-driven framework without explicit hard pseudo-label assignment for unsupervised video-based VI-ReID. HiTPro begins with an efficient Temporal-aware Feature Encoder that first extracts discriminative frame-level features and then aggregates them into a robust tracklet-level representation. Building upon these features, HiTPro first constructs reliable intra-camera prototypes via Intra-Camera Tracklet Prototyping by aggregating features from temporally partitioned sub-tracklets. Through Hierarchical Cross-Prototype Alignment, we perform a two-stage positive mining process: progressing from within-modality associations to cross-modality matching, enhanced by Dynamic Threshold Strategy and Soft Weight Assignment. Finally, {Hierarchical Contrastive Learning} progressively optimizes feature-prototype alignment across three levels: intra-camera discrimination, cross-camera same-modality consistency, and cross-modality invariance. Extensive experiments on HITSZ-VCM and BUPTCampus demonstrate that HiTPro achieves state-of-the-art performance under fully unsupervised settings, significantly outperforming adapted baselines and establishes a strong baseline for future research.


[39] MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment cs.CV | cs.AIPDF

Juan Li, Chuanghao Ding, Xujie Zhang, Cam-Tu Nguyen

TL;DR: 本文提出MiMIC方法,旨在解决通用多模态检索(UMR)中的视觉模态崩溃和语义错位问题。通过融合解码器架构和鲁棒训练策略,MiMIC在WebQA+和EVQA+数据集上优于现有的早期融合和晚期融合基线方法。

Details

Motivation: 现有UMR方法存在视觉模态崩溃(如Marvel过度依赖文本线索而忽略视觉特征)或语义错位(如UniVL-DR将语义相关内容在嵌入空间中分离过远)的问题,需要一种能同时缓解这两种缺陷的方法。

Result: 在WebQA+和EVQA+数据集(其中文档或查询中的图像可能缺少标题)上的实验表明,MiMIC持续优于早期融合和晚期融合基线,取得了更好的检索性能。

Insight: 创新点包括:1)采用融合解码器架构以实现有效的多模态整合;2)通过单模态混合和随机标题丢弃进行鲁棒训练。这些设计有助于平衡视觉与文本模态的利用,并提升嵌入空间的语义对齐。

Abstract: Universal Multimodal Retrieval (UMR) aims to map different modalities (e.g., visual and textual) into a shared embedding space for multi-modal retrieval. Existing UMR methods can be broadly divided into two categories: early-fusion approaches, such as Marvel, which projects visual features into the language model (LM) space for integrating with text modality, and late-fusion approaches, such as UniVL-DR, which encode visual and textual inputs using separate encoders and obtain fused embeddings through addition. Our pilot study reveals that Marvel exhibits visual modality collapse, which is characterized by the model’s tendency to disregard visual features while depending excessively on textual cues. In contrast, although UniVL-DR is less affected by this issue, it is more susceptible to semantic misalignment, where semantically related content is positioned far apart in the embedding space. To address these challenges, we propose MiMIC, which introduces two key innovations: (1) a fusion-in-decoder architecture for effective multimodal integration, and (2) robust training through single modality mixin and random caption dropout. Experiments on the WebQA+ and EVQA+ datasets, where image in documents or queries might lack captions, indicate that MiMIC consistently outperforms both early- and late-fusion baselines.


[40] Teacher-Guided Routing for Sparse Vision Mixture-of-Experts cs.CVPDF

Masahiro Kada, Ryota Yoshihashi, Satoshi Ikehata, Rei Kawakami, Ikuro Sato

TL;DR: 本文提出了一种名为TGR-MoE的教师引导路由方法,用于解决稀疏视觉专家混合模型训练中的路由不稳定问题。该方法利用预训练的密集教师模型的中间表示构建教师路由器,为学生的路由器提供伪监督,从而稳定训练过程并提升性能。

Details

Motivation: 稀疏专家混合模型虽然能通过激活少量专家来降低计算成本,但其路由器训练存在梯度阻塞和路由动态不稳定的问题,导致难以学习合适的专家选择分数。

Result: 在ImageNet-1K和CIFAR-100数据集上的大量实验表明,TGR方法能持续提高准确性和路由一致性,并在高度稀疏配置下保持稳定的训练。

Insight: 创新点在于利用预训练教师模型的中间表示来引导稀疏学生模型的路由器学习,这是一种简单有效的知识蒸馏形式,能缓解路由不稳定并实现早期知识引导的专家选择。

Abstract: Recent progress in deep learning has been driven by increasingly large-scale models, but the resulting computational cost has become a critical bottleneck. Sparse Mixture of Experts (MoE) offers an effective solution by activating only a small subset of experts for each input, achieving high scalability without sacrificing inference speed. Although effective, sparse MoE training exhibits characteristic optimization difficulties. Because the router receives informative gradients only through the experts selected in the forward pass, it suffers from gradient blocking and obtains little information from unselected routes. This limited, highly localized feedback makes it difficult for the router to learn appropriate expert-selection scores and often leads to unstable routing dynamics, such as fluctuating expert assignments during training. To address this issue, we propose TGR-MoE: Teacher-Guided Routing for Sparse Vision Mixture-of-Experts, a simple yet effective method that stabilizes router learning using supervision derived from a pretrained dense teacher model. TGR-MoE constructs a teacher router from the teacher’s intermediate representations and uses its routing outputs as pseudo-supervision for the student router, suppressing frequent routing fluctuations during training and enabling knowledge-guided expert selection from the early stages of training. Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that TGR consistently improves both accuracy and routing consistency, while maintaining stable training even under highly sparse configurations.


[41] Latent Denoising Improves Visual Alignment in Large Multimodal Models cs.CVPDF

Dhruv Parikh, Jacob Fein-Ashley, Rajgopal Kannan, Viktor Prasanna

TL;DR: 本文提出了一种基于潜在去噪的视觉监督框架,用于提升大型多模态模型(如LLaVA)的视觉特征对齐和多模态理解能力。该方法通过在训练时对投影视觉令牌进行显著性感知的掩码与高斯噪声混合破坏,并利用解码器从中间LLM层恢复干净的教师补丁特征,同时结合对比补丁蒸馏防止表征坍缩。推理时无需额外开销,在多个标准多模态基准测试中显著提升了视觉理解与推理性能,并在组合鲁棒性基准(如NaturalBench)和ImageNet-C式常见破坏下表现出更强的鲁棒性。

Details

Motivation: 现有大型多模态模型(如LLaVA)通常采用自回归语言建模目标进行训练,对视觉令牌仅提供间接监督,导致内部视觉表征质量较弱且在分布偏移下行为脆弱。本文旨在通过潜在去噪提供有效的视觉监督,以改善视觉特征对齐和多模态理解。

Result: 在广泛的多模态基准测试中,该方法持续提升了视觉理解与推理能力,优于强基线模型;在组合鲁棒性基准(如NaturalBench)上取得明显增益;在ImageNet-C式非对抗性常见破坏下,模型在中等和严重破坏级别均保持更高准确率且性能下降更小。

Insight: 创新点包括:1)将潜在去噪原理应用于LMMs的视觉监督,通过显著性感知的混合破坏(掩码与高斯噪声)增强鲁棒性;2)设计解码器从中间LLM层恢复教师补丁特征,结合对比补丁蒸馏防止表征坍缩;3)训练框架在推理时无额外开销,实现了效率与性能的平衡。从客观角度看,该方法将视觉tokenizer学习中的去噪思想迁移到多模态对齐任务,为改善LMMs的视觉表征提供了新颖且有效的监督信号。

Abstract: Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak internal visual representations and brittle behavior under distribution shift. Inspired by recent progress on latent denoising for learning high-quality visual tokenizers, we show that the same principle provides an effective form of visual supervision for improving internal visual feature alignment and multimodal understanding in LMMs. We propose a latent denoising framework that corrupts projected visual tokens using a saliency-aware mixture of masking and Gaussian noising. The LMM is trained to denoise these corrupted tokens by recovering clean teacher patch features from hidden states at a selected intermediate LLM layer using a decoder. To prevent representation collapse, our framework also preserves the teacher’s intra-image similarity structure and applies intra-image contrastive patch distillation. During inference, corruption and auxiliary heads are disabled, introducing no additional inference-time overhead. Across a broad suite of standard multimodal benchmarks, our method consistently improves visual understanding and reasoning over strong baselines, and yields clear gains on compositional robustness benchmarks (e.g., NaturalBench). Moreover, under ImageNet-C-style non-adversarial common corruptions applied to benchmark images, our method maintains higher accuracy and exhibits reduced degradation at both moderate and severe corruption levels. Our code is available at https://github.com/dhruvashp/latent-denoising-for-lmms.


[42] Prototype-Based Test-Time Adaptation of Vision-Language Models cs.CVPDF

Zhaohong Huang, Yuxin Zhang, Wenjing Liu, Fei Chao, Rongrong Ji

TL;DR: 本文提出了一种基于原型的测试时适应(PTA)方法,用于视觉语言模型(VLMs)的跨域适应。该方法通过自适应加权更新类特定知识原型来积累测试样本的知识,无需缓存机制,从而在保持高效率的同时实现了最先进的性能。

Details

Motivation: 现有基于缓存的免反向传播测试时适应方法存在两个关键限制:随着类别数增加,推理延迟上升;缓存样本不足或错误会导致性能下降。本文旨在解决这些效率与性能问题。

Result: PTA在15个图像识别基准和4个鲁棒点云分析基准上达到了最先进(SOTA)性能。例如,在10个跨域基准上,将CLIP的准确率从65.64%提升至69.38%,同时保持CLIP推理速度的92%;而基于缓存的方法准确率仅为67.97%,推理速度仅为CLIP的50%。

Insight: 创新点在于使用自适应加权的类特定知识原型替代缓存机制,实现了知识积累与高效推理的平衡。客观来看,该方法通过原型更新策略避免了缓存开销,为大规模场景下的测试时适应提供了轻量级解决方案。

Abstract: Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample’s visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP’s accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP’s inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP’s inference speed.


[43] KD-CVG: A Knowledge-Driven Approach for Creative Video Generation cs.CVPDF

Linkai Liu, Wei Feng, Xi Zhao, Shen Zhang, Xingye Chen

TL;DR: 本文提出了一种知识驱动的创意视频生成方法KD-CVG,旨在解决现有文本到视频模型在广告内容生成中存在的语义对齐模糊和运动适应性不足两大挑战。该方法基于构建的广告创意知识库,通过语义感知检索和多模态知识参考两大模块,提升模型对产品卖点与创意视频内容之间关联的理解,并融入语义和运动先验知识来填补现有模型的不足。

Details

Motivation: 当前创意生成研究主要集中在广告文本和图像生成,创意视频生成相对探索不足。现有文本到视频模型面临两大挑战:一是语义对齐模糊,难以准确关联产品卖点与创意视频内容;二是运动适应性不足,导致生成视频中的运动不真实或存在扭曲。

Result: 大量实验表明,KD-CVG在语义对齐和运动适应性方面均表现出优越性能,验证了其相对于其他最先进方法的有效性。

Insight: 论文的创新点在于构建了一个全面的广告创意知识库作为基础资源,并提出了一个包含语义感知检索和多模态知识参考模块的知识驱动框架。语义感知检索利用图注意力网络和强化学习反馈增强模型理解,而多模态知识参考则将语义和运动先验知识融入T2V模型,以解决现有模型的知识局限。这为知识增强的生成模型设计提供了新思路。

Abstract: Creative Generation (CG) leverages generative models to automatically produce advertising content that highlights product features, and it has been a significant focus of recent research. However, while CG has advanced considerably, most efforts have concentrated on generating advertising text and images, leaving Creative Video Generation (CVG) relatively underexplored. This gap is largely due to two major challenges faced by Text-to-Video (T2V) models: (a) \textbf{ambiguous semantic alignment}, where models struggle to accurately correlate product selling points with creative video content, and (b) \textbf{inadequate motion adaptability}, resulting in unrealistic movements and distortions. To address these challenges, we develop a comprehensive Advertising Creative Knowledge Base (ACKB) as a foundational resource and propose a knowledge-driven approach (KD-CVG) to overcome the knowledge limitations of existing models. KD-CVG consists of two primary modules: Semantic-Aware Retrieval (SAR) and Multimodal Knowledge Reference (MKR). SAR utilizes the semantic awareness of graph attention networks and reinforcement learning feedback to enhance the model’s comprehension of the connections between selling points and creative videos. Building on this, MKR incorporates semantic and motion priors into the T2V model to address existing knowledge gaps. Extensive experiments have demonstrated KD-CVG’s superior performance in achieving semantic alignment and motion adaptability, validating its effectiveness over other state-of-the-art methods. The code and dataset will be open source at https://kdcvg.github.io/KDCVG/.


[44] EdgeFormer: local patch-based edge detection transformer on point clouds cs.CVPDF

Yifei Xie, Zhikun Tu, Tong Yang, Yuhe Zhang, Xinyu Zhou

TL;DR: 本文提出了一种名为EdgeFormer的基于学习的边缘检测网络,用于3D点云上的边缘点检测。该方法主要分为两个阶段:首先构建描述每个点局部邻域的特征描述符,然后基于这些局部块特征对每个点进行分类。通过将点云转换为局部块,该方法能够有效提取更精细的边缘细节。

Details

Motivation: 3D点云上的边缘点能清晰传达几何和表面特征,对许多视觉应用有重要价值,但细粒度边缘特征因密集分布或小尺度表面梯度而难以有效检测。

Result: 实验结果表明,与六个基线方法相比,该模型展现出具有竞争力的性能。

Insight: 创新点在于将整个点云的边缘检测转换为基于局部块的点分类,利用空间邻近点的高相关性来构建局部块特征描述符,从而有效提取精细细节;从客观角度看,这种局部块转换策略是处理点云细粒度特征的有效方法。

Abstract: Edge points on 3D point clouds can clearly convey 3D geometry and surface characteristics, therefore, edge detection is widely used in many vision applications with high industrial and commercial demands. However, the fine-grained edge features are difficult to detect effectively as they are generally densely distributed or exhibit small-scale surface gradients. To address this issue, we present a learning-based edge detection network, named EdgeFormer, which mainly consists of two stages. Based on the observation that spatially neighboring points tend to exhibit high correlation, forming the local underlying surface, we convert the edge detection of the entire point cloud into a point classification based on local patches. Therefore, in the first stage, we construct local patch feature descriptors that describe the local neighborhood around each point. In the second stage, we classify each point by analyzing the local patch feature descriptors generated in the first stage. Due to the conversion of the point cloud into local patches, the proposed method can effectively extract the finer details. The experimental results show that our model demonstrates competitive performance compared to six baselines.


[45] VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought cs.CV | cs.AIPDF

Byeonggeuk Lim, Kyeonghyun Kim, JungMin Yun, YoungBin Kim

TL;DR: 本文提出了VG-CoT数据集,通过一个全自动的三阶段流程,将多步推理的每一步都显式地关联到图像中的具体视觉证据上,以解决现有数据集在可扩展性和推理可解释性方面的不足。同时,论文引入了一个新的基准,从三个维度综合评估大型视觉语言模型的推理可信度。实验表明,使用该数据集能有效提升模型的可靠推理能力。

Details

Motivation: 现有大型视觉语言模型在基于局部区域的精确推理方面存在不足,且相关数据集因依赖大量人工标注而可扩展性差,同时缺乏推理步骤与视觉证据的显式对齐,这限制了对模型可信度的评估。

Result: 在LLaVA-1.5和Qwen2-VL等代表性模型上的实验表明,使用VG-CoT数据集在大多数评估指标上带来了持续改进,证实了其能有效增强基于证据的可信推理。

Insight: 创新点在于提出了一个全自动、可扩展的流程来构建显式关联推理步骤与视觉证据的数据集,并设计了一个从推理质量、答案准确性和推理-答案对齐三个维度评估模型可信度的新基准。这为提升视觉推理的可解释性和可信度提供了新的数据构建和评估思路。

Abstract: The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model’s logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning-Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.


[46] S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images cs.CVPDF

Qingxiao Li, Lifeng Xu, QingLi Wang, Yudong Bai, Mingwei Ou

TL;DR: 本文提出了S1-VL,一个面向科学领域的多模态推理模型,它原生支持两种互补的推理范式:依赖结构化思维链的‘科学推理’和允许模型在推理过程中通过执行Python代码主动操作图像的‘以图思考’。该模型通过生成和执行图像处理代码,在多轮迭代中获取中间视觉结果以辅助推理,特别适用于高分辨率科学图表解读、显微图像理解和几何辅助推理等挑战性场景。

Details

Motivation: 动机是解决现有模型在科学多模态推理中,特别是需要主动视觉操作和迭代分析的复杂场景(如高分辨率图表、显微图像)时能力不足的问题,并应对现有数据集中常见的冗余、无效和错误视觉操作。

Result: 在13个基准测试上进行了评估。S1-VL-32B在包括HRBench-4K、HRBench-8K、MME-RealWorld-CN、MME-RealWorld-Lite和V*在内的所有五个‘以图思考’基准测试中取得了最先进的性能,并在科学推理基准(如Physics和VRSBench)上超越了对比系统。

Insight: 创新点包括:1) 提出了结合‘科学推理’和‘以图思考’两种范式的统一模型架构;2) 引入了通过代码执行在多轮迭代中主动操作图像的推理机制;3) 提出了一个六维质量过滤框架和多阶段过滤流程,配合自适应数据路由策略,以优化训练数据质量,让模型学习何时真正需要图像操作;4) 采用了四阶段渐进式训练流程(包括SFT和强化学习)。

Abstract: We present S1-VL, a multimodal reasoning model for scientific domains that natively supports two complementary reasoning paradigms: Scientific Reasoning, which relies on structured chain-of-thought, and Thinking-with-Images, which enables the model to actively manipulate images through Python code execution during reasoning. In the Thinking-with-Images mode, the model generates and executes image-processing code in a sandbox environment, obtains intermediate visual results, and continues reasoning in a multi-turn iterative manner. This design is particularly effective for challenging scenarios such as high-resolution scientific chart interpretation, microscopic image understanding, and geometry-assisted reasoning. To construct the training data, we collect scientific multimodal datasets spanning six disciplines: mathematics, physics, chemistry, astronomy, geography, and biology. We further develop a six-dimensional quality filtering framework for reasoning trajectories. To mitigate redundant, ineffective, and erroneous visual operations commonly found in existing datasets, we propose a multi-stage filtering pipeline together with an adaptive data routing strategy. This strategy converts samples with low visual information gain into pure Reasoning-mode data, enabling the model to learn when image operations are truly necessary. S1-VL is trained through a four-stage progressive pipeline: scientific multimodal SFT, Thinking-with-Images cold-start SFT, and two stages of reinforcement learning with SAPO. We build S1-VL-32B on top of Qwen3-VL-32B-Thinking and evaluate it on 13 benchmarks. Experimental results show that S1-VL-32B achieves state-of-the-art performance on all five Thinking-with-Images benchmarks, including HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, and outperforms compared systems on scientific reasoning benchmarks such as Physics and VRSBench.


[47] Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision cs.CV | cs.HCPDF

Chentao Li, Zirui Gao, Mingze Gao, Yinglian Ren, Jianjiang Feng

TL;DR: 该论文针对多模态大语言模型在自我中心视角下理解指向手势时存在的空间语义理解不足问题,提出了EgoPoint-Bench基准测试,包含超过1.1万个模拟和真实世界样本,用于评估和增强模型的指向推理能力。实验表明,现有SOTA模型在该任务上表现不佳,而使用论文生成的合成数据微调的模型取得了显著性能提升,并具有良好的模拟到真实泛化能力。

Details

Motivation: 解决自我中心AI代理(如智能眼镜)在理解自然语言指令中的指向手势时,因依赖视觉邻近性或物体显著性等虚假关联而产生的’指称幻觉’问题,即模型无法精确地基于空间语义进行指称推理。

Result: 在提出的EgoPoint-Bench基准上,广泛的实验表明,当前最先进的专有和开源模型在自我中心指向任务上表现不佳,而使用论文生成的合成数据微调的模型取得了显著的性能提升,并展现出强大的从模拟到真实的泛化能力。

Insight: 论文的创新点在于揭示了MLLMs在自我中心指称推理中的关键缺陷(指称幻觉),并构建了一个包含多维度、多复杂度层级的综合性基准测试。从客观角度看,其提出的利用合成数据进行空间感知监督的微调方法,为构建精确的自我中心AI助手提供了一条可扩展的路径。

Abstract: Egocentric AI agents, such as smart glasses, rely on pointing gestures to resolve referential ambiguities in natural language commands. However, despite advancements in Multimodal Large Language Models (MLLMs), current systems often fail to precisely ground the spatial semantics of pointing. Instead, they rely on spurious correlations with visual proximity or object saliency, a phenomenon we term “Referential Hallucination.” To address this gap, we introduce EgoPoint-Bench, a comprehensive question-answering benchmark designed to evaluate and enhance multimodal pointing reasoning in egocentric views. Comprising over 11k high-fidelity simulated and real-world samples, the benchmark spans five evaluation dimensions and three levels of referential complexity. Extensive experiments demonstrate that while state-of-the-art proprietary and open-source models struggle with egocentric pointing, models fine-tuned on our synthetic data achieve significant performance gains and robust sim-to-real generalization. This work highlights the importance of spatially aware supervision and offers a scalable path toward precise egocentric AI assistants. Project page: https://guyyyug.github.io/EgoPoint-Bench/


[48] Frozen LLMs as Map-Aware Spatio-Temporal Reasoners for Vehicle Trajectory Prediction cs.CVPDF

Yanjiao Liu, Jiawei Liu, Xun Gong, Zifei Nie

TL;DR: 本研究提出了一种利用冻结大型语言模型(LLMs)作为时空推理引擎的车辆轨迹预测框架。该框架通过交通编码器提取动态交通代理的轨迹特征,并用轻量级CNN编码高清地图信息,再通过适配器将特征转换为LLM可处理的令牌,最后由线性解码器输出未来轨迹。

Details

Motivation: 为了解决在自动驾驶领域安全应用LLMs时,需要同时理解动态交通代理行为和静态道路基础设施拓扑结构的问题,并评估LLMs在这方面的能力。

Result: 该框架能够定量分析多模态信息(尤其是地图语义)对轨迹预测精度的影响,并展示了在不同LLM架构上的强泛化能力,为模型评估提供了一个统一平台。

Insight: 创新点在于将冻结的LLMs作为核心推理引擎,通过特征适配实现与感知模块的轻量级集成,从而分离了预测负担并突出了LLMs的固有推理能力,同时提供了一个可评估地图语义影响的统一分析框架。

Abstract: Large language models (LLMs) have recently demonstrated strong reasoning capabilities and attracted increasing research attention in the field of autonomous driving (AD). However, safe application of LLMs on AD perception and prediction still requires a thorough understanding of both the dynamic traffic agents and the static road infrastructure. To this end, this study introduces a framework to evaluate the capability of LLMs in understanding the behaviors of dynamic traffic agents and the topology of road networks. The framework leverages frozen LLMs as the reasoning engine, employing a traffic encoder to extract spatial-level scene features from observed trajectories of agents, while a lightweight Convolutional Neural Network (CNN) encodes the local high-definition (HD) maps. To assess the intrinsic reasoning ability of LLMs, the extracted scene features are then transformed into LLM-compatible tokens via a reprogramming adapter. By residing the prediction burden with the LLMs, a simpler linear decoder is applied to output future trajectories. The framework enables a quantitative analysis of the influence of multi-modal information, especially the impact of map semantics on trajectory prediction accuracy, and allows seamless integration of frozen LLMs with minimal adaptation, thereby demonstrating strong generalizability across diverse LLM architectures and providing a unified platform for model evaluation.


[49] VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection cs.CVPDF

Yupeng Zhang, Ruize Han, Ningnan Guo, Wei Feng, Song Wang

TL;DR: 本文提出VFM$^{4}$SDG框架,旨在解决单域泛化目标检测(SDGOD)中因天气、光照等复杂域偏移导致的性能下降问题。该框架通过引入冻结的视觉基础模型(VFM)作为跨域稳定性先验,在编码阶段增强对象-背景和实例间关系建模的鲁棒性,在解码阶段通过语义上下文先验增强查询表示,以提升在未见域中的语义识别和空间定位稳定性。

Details

Motivation: 现实场景中,天气、光照和成像条件的持续变化导致显著的域偏移,使得在单一源域训练的检测器在未见环境中性能严重下降。现有SDGOD方法主要依赖数据增强或域不变表示学习,对检测器机制关注有限,在复杂域偏移下存在明显局限。

Result: 大量实验表明,所提方法在标准SDGOD基准测试和两种主流基于DETR的检测器上均一致优于现有SOTA方法,证明了其有效性、鲁棒性和通用性。

Insight: 创新点在于将冻结的视觉基础模型作为跨域稳定性先验引入检测器表示学习和查询建模,具体通过跨域稳定关系先验蒸馏和基于语义上下文先验的查询增强,分别提升编码和解码阶段的稳定性,从而系统性地缓解域偏移导致的漏检问题。

Abstract: In real-world scenarios, continual changes in weather, illumination, and imaging conditions cause significant domain shifts, leading detectors trained on a single source domain to degrade severely in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant representation learning, but pay limited attention to detector mechanisms, leaving clear limitations under complex domain shifts. Through analytical experiments, we find that performance degradation is dominated by increasing missed detections, which fundamentally arises from reduced cross-domain stability of the detector: object-background and inter-instance relations become less stable in the encoding stage, while semantic-spatial alignment of query representations also becomes harder to maintain in the decoding stage. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen vision foundation model (VFM) as a transferable cross-domain stability prior into detector representation learning and query modeling. In the encoding stage, we propose Cross-domain Stable Relational Prior Distillation to enhance the robustness of object-background and inter-instance relational modeling. In the decoding stage, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category-level semantic prototypes and global visual context into queries to improve their semantic recognition and spatial localization stability in unseen domains. Extensive experiments show that the proposed method consistently outperforms existing SOTA methods on standard SDGOD benchmarks and two mainstream DETR-based detectors, demonstrating its effectiveness, robustness, and generality.


[50] Seeing Isn’t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models cs.CV | cs.CLPDF

Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand, Mitesh M. Khapra

TL;DR: 本文系统评估了大型视觉语言模型(VLMs)作为评估器在图像到文本(I2T)和文本到图像(T2I)任务中的可靠性,通过引入超过4000个扰动实例的基准测试,发现当前VLM评估器存在显著盲点,如难以检测幻觉、空间推理错误等,其评估结果不可靠。

Details

Motivation: 随着VLMs被越来越多地用于评估其他模型的输出,但其可靠性尚未得到充分探索,论文旨在系统评估这些评估器VLMs在关键错误维度上的表现。

Result: 在涵盖40个扰动维度的综合基准测试中,评估了4个主流VLMs,发现它们在检测扰动输出时失败率有时超过50%,尤其在细粒度组合和空间错误上表现不佳,且对与输入图像矛盾的幻觉内容不敏感;成对比较范式相对更可靠,但失败率依然存在。

Insight: 论文创新点在于构建了一个系统性的扰动基准来揭示VLM评估器的盲点,客观分析表明,当前VLM评估器在细粒度错误检测上存在局限性,提醒在基准测试和开发决策中需谨慎使用这些评估器。

Abstract: Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.


[51] Attention-based multiple instance learning for predominant growth pattern prediction in lung adenocarcinoma wsi using foundation models cs.CV | cs.AIPDF

Laura Valeria Perez-Herrera, M. J. Garcia-Gonzalez, Karen Lopez-Linares

TL;DR: 本研究提出了一种基于注意力的多实例学习(ABMIL)框架,用于在肺腺癌(LUAD)全玻片图像(WSI)水平上预测主要生长模式,以减少标注负担。该方法整合了预训练的病理学基础模型作为图像块编码器,通过注意力机制聚合特征,实验表明微调后的编码器性能更优。

Details

Motivation: 肺腺癌分级依赖于准确识别生长模式,现有深度学习方法通常需要大量块级标注,本研究的动机是开发一种减少标注需求的WSI级预测方法。

Result: 在ABMIL框架下,微调后的Prov-GigaPath编码器取得了最高的性能(κ = 0.699),优于简单的块聚合基线方法。

Insight: 创新点在于将预训练病理基础模型与ABMIL结合,利用注意力机制和幻灯片级监督实现更鲁棒的预测,为减少标注依赖的WSI分析提供了可行方案。

Abstract: Lung adenocarcinoma (LUAD) grading depends on accurately identifying growth patterns, which are indicators of prognosis and can influence treatment decisions. Common deep learning approaches to determine the predominant pattern rely on patch-level classification or segmentation, requiring extensive annotations. This study proposes an attention-based multiple instance learning (ABMIL) framework to predict the predominant LUAD growth pattern at the whole slide level to reduce annotation burden. Our approach integrates pretrained pathology foundation models as patch encoders, used either frozen or fine-tuned on annotated patches, to extract discriminative features that are aggregated through attention mechanisms. Experiments show that fine-tuned encoders improve performance, with Prov-GigaPath achieving the highest agreement (\k{appa} = 0.699) under ABMIL. Compared to simple patch-aggregation baselines, ABMIL yields more robust predictions by leveraging slide-level supervision and spatial attention. Future work will extend this framework to estimate the full distribution of growth patterns and validate performance on external cohorts.


[52] Deep kernel video approximation for unsupervised action segmentation cs.CVPDF

Silvia L. Pintea, Jouke Dijkstra

TL;DR: 本文提出了一种基于深度核空间学习的无监督视频动作分割方法,通过最大化均值差异(MMD)度量原始视频分布与其近似之间的差异,并利用神经正切核(NTK)提升描述能力,在六个标准基准测试中取得了与最先进的单视频方法竞争的结果。

Details

Motivation: 针对存储大规模数据集不可行或不允许的应用场景,研究单视频无监督动作分割问题,旨在通过深度核空间学习近似视频帧分布以实现分割。

Result: 在六个标准基准测试上,与最先进的单视频方法相比取得了竞争性结果;当分割段数未知时,其F1分数高于先前的凝聚性方法。

Insight: 创新点包括使用MMD作为几何保持的分布空间度量以优化近似,并引入NTK避免平凡解,同时提升核的描述能力;客观分析认为该方法在无监督分割中结合了核方法与深度学习的优势,提高了效率和鲁棒性。

Abstract: This work focuses on per-video unsupervised action segmentation, which is of interest to applications where storing large datasets is either not possible, or nor permitted. We propose to segment videos by learning in deep kernel space, to approximate the underlying frame distribution, as closely as possible. To define this closeness metric between the original video distribution and its approximation, we rely on maximum mean discrepancy (MMD) which is a geometry-preserving metric in distribution space, and thus gives more reliable estimates. Moreover, unlike the commonly used optimal transport metric, MMD is both easier to optimize, and faster. We choose to use neural tangent kernels (NTKs) to define the kernel space where MMD operates, because of their improved descriptive power as opposed to fixed kernels. And, also, because NTKs sidestep the trivial solution, when jointly learning the inputs (video approximation) and the kernel function. Finally, we show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.


[53] OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction cs.CV | cs.GRPDF

Zeyu Cai, Yuliang Xiu, Renke Wang, Zhijing Shao, Xiaoben Li

TL;DR: 本文提出了OmniFit,一种多模态3D人体拟合方法,能够处理完整扫描、部分深度观测和图像捕获等多种输入,且无需已知度量尺度。其核心是一个条件Transformer解码器,用于将表面点直接映射到密集人体关键点,进而拟合SMPL-X参数,并可选地结合视觉线索以补偿缺失的几何信息。此外,还引入了一个尺度预测器来将主体重缩放至标准身体比例。

Details

Motivation: 现有方法通常专注于单模态输入(如点云或多视图图像)且需要已知度量尺度,这在AI生成资产中尺度失真常见的情况下不切实际。OmniFit旨在克服这些限制,实现尺度无关的多模态输入处理。

Result: OmniFit在宽松和日常服装场景下显著优于现有SOTA方法,性能提升57.1%至80.9%。据作者所知,它是首个超越多视图优化基线的方法,并在CAPE和4D-DRESS基准测试中实现了毫米级精度。

Insight: 创新点包括:1) 条件Transformer解码器直接预测密集关键点以简化拟合流程;2) 可选的即插即用图像适配器利用视觉线索补充几何信息;3) 专用的尺度预测器实现尺度无关处理。这些设计使方法能够灵活处理多模态、尺度失真的输入,并达到高精度。

Abstract: Fitting an underlying body model to 3D clothed human assets has been extensively studied, yet most approaches focus on either single-modal inputs such as point clouds or multi-view images alone, often requiring a known metric scale. This constraint is frequently impractical, especially for AI-generated assets where scale distortion is common. We propose OmniFit, a method that can seamlessly handle diverse multi-modal inputs, including full scans, partial depth observations, and image captures, while remaining scale-agnostic for both real and synthetic assets. Our key innovation is a simple yet effective conditional transformer decoder that directly maps surface points to dense body landmarks, which are then used for SMPL-X parameter fitting. In addition, an optional plug-and-play image adapter incorporates visual cues to compensate for missing geometric information. We further introduce a dedicated scale predictor that rescales subjects to canonical body proportions. OmniFit substantially outperforms state-of-the-art methods by 57.1 to 80.9 percent across daily and loose clothing scenarios. To the best of our knowledge, it is the first body fitting method to surpass multi-view optimization baselines and the first to achieve millimeter-level accuracy on the CAPE and 4D-DRESS benchmarks.


[54] DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures cs.CVPDF

Xu Wang, Zhiru Wang, Shiyun Xie, Chengwei Pan, Yisong Chen

TL;DR: DualSplat提出了一种名为‘失败转先验’的框架,用于解决3D高斯溅射在训练图像包含瞬态物体时性能下降的问题。该方法通过利用第一轮重建失败产生的伪掩码,引导第二轮干净的3DGS优化,从而打破准确瞬态检测与干净重建之间的循环依赖。

Details

Motivation: 解决3D高斯溅射在训练图像包含违反多视图一致性的瞬态物体时性能显著下降的问题,以及现有方法面临的‘准确瞬态检测需要良好重建的静态场景’与‘干净重建本身依赖于可靠瞬态掩码’之间的循环依赖挑战。

Result: 在RobustNeRF和NeRF On-the-go基准测试上,DualSplat超越了现有基线方法,在瞬态物体密集的场景和瞬态区域表现出特别明显的优势。

Insight: 核心创新在于‘失败转先验’框架,将第一轮保守训练中瞬态物体表现为不完整片段的重建失败,转化为用于第二轮优化的显式先验(对象级伪掩码)。该方法结合了光度残差、特征不匹配和SAM2实例边界来构建伪掩码,并通过轻量级MLP在线优化掩码,实现从先验监督到自一致性的渐进式转换。

Abstract: While 3D Gaussian Splatting (3DGS) achieves real-time photorealistic rendering, its performance degrades significantly when training images contain transient objects that violate multi-view consistency. Existing methods face a circular dependency: accurate transient detection requires a well-reconstructed static scene, while clean reconstruction itself depends on reliable transient masks. We address this challenge with DualSplat, a Failure-to-Prior framework that converts first-pass reconstruction failures into explicit priors for a second reconstruction stage. We observe that transients, which appear in only a subset of views, often manifest as incomplete fragments during conservative initial training. We exploit these failures to construct object-level pseudo-masks by combining photometric residuals, feature mismatches, and SAM2 instance boundaries. These pseudo-masks then guide a clean second-pass 3DGS optimization, while a lightweight MLP refines them online by gradually shifting from prior supervision to self-consistency. Experiments on RobustNeRF and NeRF On-the-go show that DualSplat outperforms existing baselines, demonstrating particularly clear advantages in transient-heavy scenes and transient regions.


[55] Encoder-Free Human Motion Understanding via Structured Motion Descriptions cs.CVPDF

Yao Zhang, Zhuchenyang Liu, Thomas Ploetz, Yu Xiao

TL;DR: 该论文提出了一种名为结构化运动描述(SMD)的、基于规则且确定性的方法,将人体关节位置序列转换为描述关节角度、身体部位运动和全局轨迹的结构化自然语言文本。通过将运动表示为文本,该方法使得大型语言模型(LLM)能够直接应用其预训练知识进行运动推理,无需学习专用的编码器或对齐模块,并在运动问答和运动描述生成任务上超越了现有方法。

Details

Motivation: 现有基于LLM的人体运动理解方法(如运动问答和描述生成)通常依赖专用编码器将运动特征投影到LLM的嵌入空间,这仍然受到跨模态表示和对齐的限制。受生物力学分析中关节角度和身体部位运动学作为精确描述语言的启发,本文旨在探索一种无需编码器的、能直接利用LLM预训练知识进行运动理解的方法。

Result: 在运动问答任务上,该方法在BABEL-QA数据集上达到66.7%,在HuMMan-QA数据集上达到90.1%;在运动描述生成任务上,在HumanML3D数据集上R@1达到0.584,CIDEr达到53.16。这些结果均超越了所有现有方法,达到了新的SOTA水平。

Insight: 核心创新点在于提出了一种将3D人体运动数据确定性地转换为结构化文本描述(SMD)的规则化方法,从而绕过了传统跨模态对齐的瓶颈,实现了真正的“编码器无关”的运动理解。这使得LLM的预训练知识(如对身体部位、空间方向和运动语义的理解)得以直接迁移,并带来了模型无关性(通过轻量级LoRA适配即可在不同LLM上工作)和可解释性(可对运动描述进行注意力分析)等实用优势。

Abstract: The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM’s embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose \textbf{Structured Motion Description (SMD)}, a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7% on BABEL-QA, 90.1% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at https://yaozhang182.github.io/motion-smd/.


[56] Sapiens2 cs.CVPDF

Rawal Khirodkar, He Wen, Julieta Martinez, Yuan Dong, Su Zhaoen

TL;DR: Sapiens2是一个专注于以人为中心的视觉任务的高分辨率Transformer模型家族,旨在提升泛化能力、多功能性和高保真输出。模型参数规模从0.4B到5B,原生支持1K分辨率,其分层变体可支持4K。通过结合掩码图像重建与自蒸馏对比目标进行预训练,使用10亿高质量人体图像数据集,并融入前沿模型架构改进,Sapiens2在姿态估计、身体部位分割、法线估计等任务上大幅超越前代,并扩展到点云图和反照率估计等新任务。

Details

Motivation: 解决以人为中心的视觉任务中,现有模型在泛化性、多功能性和输出保真度方面的不足,特别是在高分辨率下同时捕捉低级细节和高级语义的挑战。

Result: 在姿态估计(+4 mAP)、身体部位分割(+24.3 mIoU)、法线估计(角度误差降低45.6%)等任务上达到新的SOTA水平,并扩展到点云图和反照率估计等新任务。

Insight: 创新点包括:统一预训练目标(掩码重建与自蒸馏对比结合)以平衡细节与语义学习;使用大规模高质量人体图像数据集;采用窗口注意力支持4K长上下文推理;架构改进提升训练稳定性。这些方法可借鉴于其他高分辨率视觉任务模型设计。

Abstract: We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+24.3 mIoU), normal estimation (45.6% lower angular error) and extends to new tasks such as pointmap and albedo estimation. Code: https://github.com/facebookresearch/sapiens2


[57] WorldMark: A Unified Benchmark Suite for Interactive Video World Models cs.CVPDF

Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao

TL;DR: 本文介绍了WorldMark,这是首个为交互式图像到视频世界模型设计的统一基准测试套件,旨在解决现有模型因使用私有场景和轨迹而无法进行公平跨模型比较的问题。WorldMark提供了一个标准化的测试环境,包括统一的动作映射层、分层测试套件和模块化评估工具包,并推出了在线平台World Model Arena以支持实时模型对战和排行榜。

Details

Motivation: 当前交互式视频生成模型(如Genie、YUME等)各自使用私有基准进行评估,缺乏统一的测试条件(如相同场景和动作序列),导致无法进行公平的跨模型比较。

Result: WorldMark在六个主要模型上实现了标准化评估,覆盖500个测试案例,包括第一人称和第三人称视角、写实与风格化场景,以及从易到难三个难度等级,时长20-60秒。

Insight: 创新点包括统一的WASD风格动作映射层以实现跨模型公平比较、分层测试套件增强评估全面性,以及模块化评估工具包支持未来指标扩展;客观来看,该基准通过标准化输入和在线竞技平台推动了交互式视频生成领域的可重复研究和公开竞争。

Abstract: Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions – identical scenes, identical action sequences, and a unified control interface – needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model’s native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (warena.ai), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.


[58] Efficient Logic Gate Networks for Video Copy Detection cs.CV | cs.AI | cs.IRPDF

Katarzyna Fojcik

TL;DR: 本文提出了一种基于可微分逻辑门网络(LGNs)的视频拷贝检测框架,旨在解决大规模视频拷贝检测中计算成本和描述符尺寸过大的问题。该方法通过极端的帧小型化、二进制预处理和可训练的LGN嵌入模型,学习逻辑操作和互连,最终可离散化为纯布尔电路,实现高速、内存高效的推理。

Details

Motivation: 动机在于解决现有深度神经网络在视频拷贝检测中计算成本高、描述符尺寸大,难以在高吞吐量系统中实际部署的问题,寻求一种可扩展且资源高效的替代方案。

Result: 实验结果表明,基于LGN的模型在多个数据集折叠和难度级别上,与先前模型相比,达到了具有竞争力或更优的准确性和排序性能,同时产生的描述符尺寸小了几个数量级,推理速度超过每秒11k个样本。

Insight: 创新点在于将可微分逻辑门网络引入视频拷贝检测,通过逻辑运算和二进制表示替代传统的浮点特征提取器,实现了模型的高度紧凑化和推理效率的显著提升,为资源受限环境下的可扩展检测提供了新思路。

Abstract: Video copy detection requires robust similarity estimation under diverse visual distortions while operating at very large scale. Although deep neural networks achieve strong performance, their computational cost and descriptor size limit practical deployment in high-throughput systems. In this work, we propose a video copy detection framework based on differentiable Logic Gate Networks (LGNs), which replace conventional floating-point feature extractors with compact, logic-based representations. Our approach combines aggressive frame miniaturization, binary preprocessing, and a trainable LGN embedding model that learns both logical operations and interconnections. After training, the model can be discretized into a purely Boolean circuit, enabling extremely fast and memory-efficient inference. We systematically evaluate different similarity strategies, binarization schemes, and LGN architectures across multiple dataset folds and difficulty levels. Experimental results demonstrate that LGN-based models achieve competitive or superior accuracy and ranking performance compared to prior models, while producing descriptors several orders of magnitude smaller and delivering inference speeds exceeding 11k samples per second. These findings indicate that logic-based models offer a promising alternative for scalable and resource-efficient video copy detection.


[59] Discriminative-Generative Synergy for Occlusion Robust 3D Human Mesh Recovery cs.CV | cs.MMPDF

Yang Liu, Zhiyong Zhang

TL;DR: 本文提出一种受大脑启发的协同框架,将视觉Transformer的判别能力与条件扩散模型的生成能力相结合,以解决单目RGB图像中遮挡情况下3D人体网格恢复的挑战。该方法通过ViT分支提取可见区域的确定性视觉线索,扩散分支合成结构一致的人体表示,并设计了多样一致特征学习模块和跨注意力多级融合机制来桥接两个分支。

Details

Motivation: 解决单目RGB图像3D人体网格恢复在部分或严重遮挡下效果不佳的问题,回归方法在无约束场景下可能产生不合理结果,而扩散方法对遮挡区域提供强生成先验但可能因过度依赖生成而削弱对罕见姿态的保真度。

Result: 在标准基准测试中,该方法在关键指标上实现了优越性能,并在复杂真实场景中表现出强大的鲁棒性。

Insight: 创新点在于将判别式与生成式方法协同,通过多样一致特征学习模块对齐判别特征与生成先验,以及跨注意力多级融合机制实现跨语义级别的双向交互,为遮挡鲁棒的3D人体恢复提供了新思路。

Abstract: 3D human mesh recovery from monocular RGB images aims to estimate anatomically plausible 3D human models for downstream applications, but remains challenging under partial or severe occlusions. Regression-based methods are efficient yet often produce implausible or inaccurate results in unconstrained scenarios, while diffusion-based methods provide strong generative priors for occluded regions but may weaken fidelity to rare poses due to over-reliance on generation. To address these limitations, we propose a brain-inspired synergistic framework that integrates the discriminative power of vision transformers with the generative capability of conditional diffusion models. Specifically, the ViT-based pathway extracts deterministic visual cues from visible regions, while the diffusion-based pathway synthesizes structurally coherent human body representations. To effectively bridge the two pathways, we design a diverse-consistent feature learning module to align discriminative features with generative priors, and a cross-attention multi-level fusion mechanism to enable bidirectional interaction across semantic levels. Experiments on standard benchmarks demonstrate that our method achieves superior performance on key metrics and shows strong robustness in complex real-world scenarios.


[60] Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation cs.CVPDF

Guangkai Xu, Hua Geng, Huanyi Zheng, Songyi Yin, Yanlong Sun

TL;DR: 本文通过系统性的消融研究,揭示了影响前馈视觉几何估计模型性能的关键因素,并基于这些发现提出了CARVE模型,该模型通过一致性损失函数和高效的高分辨率架构设计,在点云重建、视频深度估计和相机位姿/内参估计等多个基准测试中实现了强大且鲁棒的性能。

Details

Motivation: 解决多帧模型在跨帧一致性上表现良好但在单帧精度上弱于强单帧方法之间的性能差距,系统探究驱动模型性能的关键因素。

Result: 在点云重建、视频深度估计和相机位姿/内参估计等多个基准测试中,CARVE模型实现了强大且鲁棒的性能。

Insight: 关键发现包括:1) 扩大数据多样性和质量能进一步提升SOTA方法的性能;2) 常用的置信度感知损失和基于梯度的损失机制可能无意中阻碍性能;3) 通过序列级和帧级联合监督能改善结果,而局部区域对齐反而会降低性能。创新点在于引入了一致性损失函数和高效的高分辨率架构设计,以整合基于优化方法的优势和高分辨率输入信息。

Abstract: Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.


[61] Building a Precise Video Language with Human-AI Oversight cs.CV | cs.AI | cs.CL | cs.LG | cs.MMPDF

Zhiqiu Lin, Chancharik Mitra, Siyuan Cen, Isaac Li, Yuhan Huang

TL;DR: 该论文提出了一种通过人类-AI协同监督构建精确视频语言模型的方法,包括定义结构化视频描述规范、引入CHAI监督框架以提升标注质量,并利用该框架改进开源模型在视频描述生成、奖励建模和评论生成方面的性能,最终应用于大规模专业视频的重新描述和视频生成模型的精细控制。

Details

Motivation: 解决现有视频语言模型在描述动态视觉世界时缺乏精确性和专业性的问题,旨在通过结构化规范和人类专家监督来提升视频描述的质量和可控性。

Result: 在视频描述任务上,经过监督的模型(基于Qwen3-VL)超越了闭源模型如Gemini-3.1-Pro;应用该方法微调视频生成模型(如Wan)能更好地遵循长达400词的详细提示,实现对摄像机运动、角度、镜头、焦点、视点和构图等电影摄影元素的精细控制。

Insight: 创新点在于提出了结合结构化视觉基元规范和基于评论的人类-AI协同监督框架(CHAI),将文本生成任务分配给模型,让人类专家专注于验证和修订,从而高效提升标注质量和模型性能,为专业级视频理解与生成提供了可扩展的监督方案。

Abstract: Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/


[62] Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection cs.CV | cs.LGPDF

Wenxuan Bao, Yanjun Zhao, Xiyuan Yang, Jingrui He

TL;DR: 本文提出了Ramen框架,用于在混合域分布偏移场景下对视觉语言模型进行鲁棒的测试时自适应。Ramen通过主动样本选择机制,为每个新测试样本检索一个定制的相关批次,该批次基于领域一致性和预测平衡性两个准则构建,并使用嵌入-梯度缓存来提高效率,避免了额外的前向或反向传播。

Details

Motivation: 预训练的视觉语言模型(如CLIP)在零样本泛化上表现良好,但对分布偏移敏感。现有的测试时自适应方法通常假设测试样本来自单一、一致的领域,而实际测试数据往往包含来自不同特性的混合领域样本,导致其性能下降。

Result: 在多个图像损坏和领域偏移基准测试上的实验表明,Ramen实现了强大且一致的性能,在复杂的混合域场景中提供了鲁棒且高效的自适应。

Insight: 创新点在于提出了一个主动样本选择框架,通过领域一致性和预测平衡性准则来构建定制化的适应批次,以应对混合域偏移。同时,引入嵌入-梯度缓存机制,存储历史测试图像的嵌入和样本级梯度,用于检索和模型更新,显著提高了效率。

Abstract: Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at https://github.com/baowenxuan/Ramen .


[63] Interpretable facial dynamics as behavioral and perceptual traces of deepfakes cs.CV | cs.HC | cs.LGPDF

Timothy Joseph Murphy, Jennifer Cook, Hélio Clemente José Cuve

TL;DR: 该研究提出了一种基于面部动态生物行为特征的可解释深度伪造检测方法,通过提取面部运动的低维时空模式特征,利用传统机器学习分类器实现检测,并发现情感表达视频的检测准确率显著更高,同时分析了模型决策与人类感知判断之间的关系。

Details

Motivation: 当前深度伪造检测研究主要依赖深度学习模型,这些模型在基准测试中表现良好但缺乏可解释性,无法揭示真实与伪造面部行为之间的本质区别;本研究旨在提供一种基于生物行为特征的可解释替代方案,并探究计算检测策略与人类感知判断的关联。

Result: 基于提取的时空特征训练的传统机器学习分类器在深度伪造分类上取得了显著高于随机水平的性能(虽为中等水平),其中情感表达视频的检测准确率远高于非情感视频;情感效价分类分析进一步表明深度伪造中的情感信号存在系统性退化。

Insight: 创新点在于从面部动态中提取可解释的生物行为特征(如高阶时间不规则性)作为深度伪造的’行为指纹’,特别是在情感表达期间最为显著;同时,通过模型与人类感知的对比分析,揭示了可解释计算特征与人类感知在检测中可能提供互补而非冗余的路径,为可解释AI检测提供了新视角。

Abstract: Deepfake detection research has largely converged on deep learning approaches that, despite strong benchmark performance, offer limited insight into what distinguishes real from manipulated facial behavior. This study presents an interpretable alternative grounded in bio-behavioral features of facial dynamics and evaluates how computational detection strategies relate to human perceptual judgments. We identify core low-dimensional patterns of facial movement, from which temporal features characterizing spatiotemporal structure were derived. Traditional machine learning classifiers trained on these features achieved modest but significant above-chance deepfake classification, driven by higher-order temporal irregularities that were more pronounced in manipulated than real facial dynamics. Notably, detection was substantially more accurate for videos containing emotive expressions than those without. An emotional valence classification analysis further indicated that emotive signals are systematically degraded in deepfakes, explaining the differential impact of emotive dynamics on detection. Furthermore, we provide an additional and often overlooked dimension of explainability by assessing the relationship between model decisions and human perceptual detection. Model and human judgments converged for emotive but diverged for non-emotive videos, and even where outputs aligned, underlying detection strategies differed. These findings demonstrate that face-swapped deepfakes carry a measurable behavioral fingerprint, most salient during emotional expression. Additionally, model-human comparisons suggest that interpretable computational features and human perception may offer complementary rather than redundant routes to detection.


[64] Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting cs.CVPDF

Avinash Paliwal, Adithya Iyer, Shivin Yadav, Muhammad Ali Afridi, Midhun Harikumar

TL;DR: 本文提出了一种自监督框架Reshoot-Anything,用于解决动态视频重拍中精确相机控制的难题。该方法通过从单目视频中生成伪多视角训练三元组(源视频、几何锚点和目标视频),无需成对的多视角数据。在推理时,该方法利用4D点云锚点,在复杂动态场景中实现了最先进的时序一致性、鲁棒的相机控制和高保真新视角合成。

Details

Motivation: 解决非刚性场景下精确相机控制因缺乏成对多视角数据而受限的问题,提出一个可扩展的自监督框架以利用互联网规模的单目视频。

Result: 该方法在复杂动态场景上实现了最先进的(SOTA)时序一致性、鲁棒的相机控制和高保真新视角合成。

Insight: 核心创新在于通过从单个视频中提取不同的平滑随机游走裁剪轨迹来生成伪多视角训练三元组,并利用前向扭曲生成合成锚点,迫使模型隐式学习4D时空结构以跨时间和视角路由和重投影纹理,而非简单复制信息。

Abstract: Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supervised framework capable of leveraging internet-scale monocular videos. Our core contribution is the generation of pseudo multi-view training triplets, consisting of a source video, a geometric anchor, and a target video. We achieve this by extracting distinct smooth random-walk crop trajectories from a single input video to serve as the source and target views. The anchor is synthetically generated by forward-warping the first frame of the source with a dense tracking field, which effectively simulates the distorted point-cloud inputs expected at inference. Because our independent cropping strategy introduces spatial misalignment and artificial occlusions, the model cannot simply copy information from the current source frame. Instead, it is forced to implicitly learn 4D spatiotemporal structures by actively routing and re-projecting missing high-fidelity textures across distinct times and viewpoints from the source video to reconstruct the target. At inference, our minimally adapted diffusion transformer utilizes a 4D point-cloud derived anchor to achieve state-of-the-art temporal consistency, robust camera control, and high-fidelity novel view synthesis on complex dynamic scenes.


[65] From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media cs.CVPDF

Katharina Prasse, Steffen Jung, Isaac Bravo, Stefanie Walter, Patrick Knab

TL;DR: 本研究评估了计算机视觉方法在社交媒体气候传播分析中的应用,通过构建基于应用场景的分类法、模型选择、提示工程和验证流程,在包含专家标注的1,038张图像和超过120万张图像的两个X平台数据集上,对六种可提示视觉语言模型和15种零样本CLIP类模型进行了基准测试。研究发现,Gemini-3.1-flash-lite模型在所有类别和数据集上表现最佳,且与中等规模开源模型的性能差距较小;分布级评估表明,即使单图准确率一般,VLM预测也能可靠反映总体趋势,适合大规模话语分析;同时,思维链推理会降低性能,而针对特定标注维度的提示设计能提升效果。

Details

Motivation: 社交媒体已成为气候传播的主要平台,产生海量图像数据,系统分析这些数据可揭示有效的公众动员策略;本研究旨在评估计算机视觉方法如何用于社交媒体话语分析,以促进相关研究。

Result: 在涵盖动物内容、气候变化后果、气候行动、图像场景和图像类型五个标注维度的数据集上,Gemini-3.1-flash-lite在所有超类别和两个数据集中均优于其他基准模型;中等规模开源模型与最佳模型的性能差距相对较小;分布级评估显示VLM预测能可靠恢复总体趋势。

Insight: 研究强调了分布级评估在大规模话语分析中的实用性,即模型在总体趋势预测上的可靠性比单图准确率更重要;发现思维链推理反而降低性能,而针对特定任务的提示设计能有效提升效果,为VLM在社会科学领域的应用提供了实践指导。

Abstract: Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation. We benchmark six promptable vision-language models and 15 zero-shot CLIP-like models on two datasets from X (formerly Twitter) - a 1,038-image expert-annotated set and a larger corpus of over 1.2 million images, with 50,000 labels manually validated - spanning five annotation dimensions: animal content, climate change consequences, climate action, image setting, and image type. Among the models benchmarked, Gemini-3.1-flash-lite outperforms all others across all super-categories and both datasets, while the gap to open-weight models of moderate size remains relatively small. Beyond instance-level metrics, we advocate for distributional evaluation: VLM predictions can reliably recover population level trends even when per-image accuracy is moderate, making them a viable starting point for discourse analysis at scale. We find that chain-of-thought reasoning reduces rather than improves performance, and that annotation dimension specific prompt design improves performance. We release tweet IDs and labels along with our code at https://github.com/KathPra/Codebooks2VLMs.git.


[66] SyMTRS: Benchmark Multi-Task Synthetic Dataset for Depth, Domain Adaptation and Super-Resolution in Aerial Imagery cs.CV | cs.AIPDF

Safouane El Ghazouali, Nicola Venturi, Michael Rueegsegger, Umberto Michelucci

TL;DR: 本文提出了SyMTRS,一个用于航空影像的大规模合成多任务基准数据集,旨在解决遥感领域深度估计、域适应和超分辨率任务中高质量真实标注数据稀缺的问题。该数据集通过高保真城市模拟流程生成,提供高分辨率RGB图像、像素级深度图、用于域适应的夜间图像以及多尺度对齐的低分辨率图像。

Details

Motivation: 当前遥感领域的深度学习进展严重依赖大规模标注数据,但获取几何、辐射和多域任务的高质量真实标注成本高昂且往往不可行,特别是在单目深度估计、域适应和超分辨率方面缺乏精确深度标注、可控光照变化和多尺度配对图像。

Result: 论文描述了数据集的生成过程、统计特性及其相对于现有基准的定位,旨在通过提供完美的几何真实标注和一致的多域监督来支持可控实验,填补遥感研究的关键空白。

Insight: 创新点在于构建了一个统一的多任务基准数据集,将几何理解、跨域鲁棒性和分辨率增强任务整合在一起,克服了现有遥感数据集通常只关注单一任务或模态的局限性,为联合研究提供了可控且一致的监督信号。

Abstract: Recent advances in deep learning for remote sensing rely heavily on large annotated datasets, yet acquiring high-quality ground truth for geometric, radiometric, and multi-domain tasks remains costly and often infeasible. In particular, the lack of accurate depth annotations, controlled illumination variations, and multi-scale paired imagery limits progress in monocular depth estimation, domain adaptation, and super-resolution for aerial scenes. We present SyMTRS, a large-scale synthetic dataset generated using a high-fidelity urban simulation pipeline. The dataset provides high-resolution RGB aerial imagery (2048 x 2048), pixel-perfect depth maps, night-time counterparts for domain adaptation, and aligned low-resolution variants for super-resolution at x2, x4, and x8 scales. Unlike existing remote sensing datasets that focus on a single task or modality, SyMTRS is designed as a unified multi-task benchmark enabling joint research in geometric understanding, cross-domain robustness, and resolution enhancement. We describe the dataset generation process, its statistical properties, and its positioning relative to existing benchmarks. SyMTRS aims to bridge critical gaps in remote sensing research by enabling controlled experiments with perfect geometric ground truth and consistent multi-domain supervision. The results obtained in this work can be reproduced from this Github repository: https://github.com/safouaneelg/SyMTRS.


[67] TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval cs.CVPDF

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Yongqi Li

TL;DR: 本文提出了TEMA(Text-oriented Entity Mapping Architecture)架构,用于解决组合图像检索(CIR)任务中多修改指令带来的挑战。针对现有CIR数据集修改文本简单、覆盖实体不足和对齐困难的问题,作者构建了M-FashionIQ和M-CIRR两个多修改指令数据集,并设计了TEMA模型,该模型通过‘锚定图像、跟随文本’的策略,在保持计算效率的同时,在原始和多修改场景下均取得了优越的检索性能。

Details

Motivation: 现有组合图像检索(CIR)任务通常依赖简单的修改文本,这导致了实体覆盖不足和子句-实体错位两个与实际应用高度相关的局限性。为了将CIR推向更接近真实世界的使用场景,需要处理更复杂、包含多个修改指令的文本查询。

Result: 在四个基准数据集(包括新构建的M-FashionIQ和M-CIRR)上的大量实验表明,TEMA在原始CIR场景和多修改CIR场景下均表现出优越性,同时在检索准确性和计算效率之间保持了最佳平衡。

Insight: 论文的主要创新点在于:1)构建了首个面向多修改指令的CIR数据集(M-FashionIQ和M-CIRR),填补了该任务的数据空白;2)提出了TEMA架构,这是首个专为多修改场景设计的CIR框架,其核心‘锚定图像、跟随文本’的策略能有效处理复杂指令;3)该框架设计兼顾了简单修改场景,具有良好的通用性和效率。从客观角度看,将CIR任务从简单指令扩展到复杂多指令场景,并为此提供系统性解决方案(数据集+模型),是该研究的重要贡献。

Abstract: Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text. Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment. In order to address these issues and bring CIR closer to real-world use, we construct two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. In addition, we propose TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification while also accommodating simple modifications. Extensive experiments on four benchmark datasets demonstrate that TEMA’s superiority in both original and multi-modification scenarios, while maintaining an optimal balance between retrieval accuracy and computational efficiency. Our codes and constructed multi-modification dataset (M-FashionIQ and M-CIRR) are available at https://github.com/lee-zixu/ACL26-TEMA/.


[68] Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos cs.CV | cs.AIPDF

Bowen Liu, Li Yang, Shanshan Song, Mingyu Tang, Zhifang Gao

TL;DR: 本文提出了诊断驱动的胶囊内窥镜视频摘要新任务,并构建了首个相关数据集VideoCAP。针对该任务,作者提出了DiCE框架,模拟临床医生工作流,通过候选帧筛选、上下文编织和证据聚合来生成可靠的诊断摘要。实验表明DiCE在关键证据帧提取和诊断准确性上均优于现有方法。

Details

Motivation: 当前胶囊内窥镜研究多局限于帧级分类与检测,视频级分析不足。为解决此问题,本文旨在开发一种能够从超长视频中提取关键证据帧并做出准确诊断的摘要方法。

Result: 在提出的VideoCAP数据集上,DiCE框架在关键证据帧提取和诊断任务上均持续优于最先进方法,实现了SOTA性能。

Insight: 创新点在于正式定义了诊断驱动的视频摘要任务,并提出了模拟临床工作流的上下文推理框架(候选筛选、上下文编织、证据聚合),为处理超长、稀疏事件的医学视频提供了新范式。

Abstract: Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames. This setting is challenging because diagnostically relevant events are extremely sparse and can be overwhelmed by tens of thousands of redundant normal frames, while individual observations are often ambiguous due to motion blur, debris, specular highlights, and rapid viewpoint changes. To facilitate research in this direction, we introduce VideoCAP, the first CE dataset with diagnosis-driven annotations derived from real clinical reports. VideoCAP comprises 240 full-length videos and provides realistic supervision for both key evidence frame extraction and diagnosis. To address this task, we further propose DiCE, a clinician-inspired framework that mirrors the standard CE reading workflow. DiCE first performs efficient candidate screening over the raw video, then uses a Context Weaver to organize candidates into coherent diagnostic contexts that preserve distinct lesion events, and an Evidence Converger to aggregate multi-frame evidence within each context into robust clip-level judgments. Experiments show that DiCE consistently outperforms state-of-the-art methods, producing concise and clinically reliable diagnostic summaries. These results highlight diagnosis-driven contextual reasoning as a promising paradigm for ultra-long CE video summarization.


[69] Grounding Video Reasoning in Physical Signals cs.CVPDF

Alibay Osmanli, Zixu Cheng, Shaogang Gong

TL;DR: 该论文提出了一个用于物理视频理解的基准测试,扩展了V-STaR的评估框架,涵盖多个视频来源、物理领域、提示类型和输入条件,旨在评估模型在时空定位物理事件的能力,而不仅仅是语义识别。

Details

Motivation: 解决现有视频理解模型仅能基于文本规律回答物理事件相关问题,但无法在时间或空间上准确定位事件的问题,强调物理视频理解需要超越事件命名。

Result: 在包含SSV2、YouCook2、HoloAssist和Roundabout-TAU的1,560个视频片段上测试,结果显示物理提示家族整体表现最强,空间定位在所有设置中最弱,扰动增益集中在原始弱案例中。

Insight: 创新点在于构建了一个多维度、基于物理信号的视频推理基准,强调视频问答基准应报告基于物理的、提示感知的和扰动感知的诊断指标,而不仅仅是聚合准确率。

Abstract: Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We introduce a grounded benchmark for physical video understanding that extends the what–when–where evaluation structure of V-STaR to four video sources, six physics domains, three prompt families (physics, vstar_like, and neutral_rstr), and four input conditions (original, shuffled, ablated, and frame-masked). The benchmark contains 1,560 base video clips from SSV2, YouCook2, HoloAssist, and Roundabout-TAU. Each clip is first converted into a shared grounded event record, and the three query families are derived from that record. Temporal and spatial targets are shared across prompt families, while the non-physics families use deterministic family-appropriate semantic a_what targets derived from the same record. Across models and prompt families, physics remains the strongest regime overall, vstar_like is the clearest non-physics semantic comparison, and neutral_rstr behaves as a harder templated control. Prompt-family robustness is selective rather than universal, perturbation gains cluster in weak original cases, and spatial grounding is the weakest across settings. These results suggest that video Q&A reasoning benchmarks shall report physically grounded, prompt-aware, and perturbation-aware diagnostics alongside aggregate accuracy.


[70] UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection cs.CVPDF

Yanran Zhang, Wenzhao Zheng, Yifei Li, Bingyao Yu, Yu Zheng

TL;DR: 本文提出UniGenDet,一个统一的生成-判别框架,用于协同进化的图像生成与生成图像检测。该框架通过设计共生多模态自注意力机制和统一微调算法,弥合了生成与检测任务之间的架构鸿沟,实现了两个任务的相互促进。

Details

Motivation: 当前图像生成与生成图像检测领域各自独立发展,分别采用生成式和判别式架构,存在显著的架构差异,但两者都利用对抗信息提升性能,具有协同潜力。本文旨在构建一个统一框架,促进这两个任务的协同进化。

Result: 在多个数据集上的大量实验表明,该方法在图像生成和生成图像检测任务上均达到了最先进的性能水平。

Insight: 核心创新在于提出了一个统一的生成-判别协同进化框架,通过共生多模态自注意力机制和统一微调算法实现任务间信息交换,并引入检测器引导的生成对齐机制,使生成任务提升真实性判别的可解释性,同时真实性标准指导生成更高保真度的图像。

Abstract: In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges. Departing from previous approaches, we propose UniGenDet: a Unified generative-discriminative framework for co-evolutionary image Generation and generated image Detection. To bridge the task gap, we design a symbiotic multimodal self-attention mechanism and a unified fine-tuning algorithm. This synergy allows the generation task to improve the interpretability of authenticity identification, while authenticity criteria guide the creation of higher-fidelity images. Furthermore, we introduce a detector-informed generative alignment mechanism to facilitate seamless information exchange. Extensive experiments on multiple datasets demonstrate that our method achieves state-of-the-art performance. Code: \href{https://github.com/Zhangyr2022/UniGenDet}{https://github.com/Zhangyr2022/UniGenDet}.


[71] Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision cs.CV | cs.IT | q-bio.NCPDF

Leyla Roksan Caglar, Pedro A. M. Mediano, Baihan Lin

TL;DR: 该论文通过分析人类和深度视觉模型在自然图像分类任务中的定向混淆模式,揭示了二者在归纳偏向上的系统性差异。研究发现人类表现出广泛但较弱的非对称性,而深度视觉模型则呈现稀疏但强烈的定向崩溃,这种差异无法通过准确率单独捕捉。

Details

Motivation: 动机在于探究人类和现代视觉模型在分类错误模式上的系统性差异,这些差异反映了不同的归纳偏向,而仅靠准确率无法揭示这些内在偏差。

Result: 在12种扰动类型的自然图像分类任务中,量化了混淆矩阵的非对称性,并通过率失真(RD)框架将其与泛化几何关联,使用斜率(beta)、曲率(kappa)和效率(AUC)三个几何特征进行总结。结果显示,稳健性训练降低了全局非对称性,但未能恢复人类那种广度-强度平衡的相似性模式。

Insight: 创新点在于将定向混淆和率失真几何作为紧凑、可解释的归纳偏向签名,用于分析分布偏移下的模型行为;从客观角度看,该方法提供了一种超越准确率的评估框架,有助于深入理解人类与机器视觉的泛化机制差异。

Abstract: Humans and modern vision models can reach similar classification accuracy while making systematically different kinds of mistakes - differing not in how often they err, but in who gets mistaken for whom, and in which direction. We show that these directional confusions reveal distinct inductive biases that are invisible to accuracy alone. Using matched human and deep vision model responses on a natural-image categorization task under 12 perturbation types, we quantify asymmetry in confusion matrices and link it to generalization geometry through a Rate-Distortion (RD) framework, summarized by three geometric signatures (slope (beta), curvature (kappa)) and efficiency (AUC). We find that humans exhibit broad but weak asymmetries, whereas deep vision models show sparser, stronger directional collapses. Robustness training reduces global asymmetry but fails to recover the human-like breadth-strength profile of graded similarity. Mechanistic simulations further show that different asymmetry organizations shift the RD frontier in opposite directions, even when matched for performance. Together, these results position directional confusions and RD geometry as compact, interpretable signatures of inductive bias under distribution shift.


[72] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs cs.CV | cs.AI | cs.CL | cs.LGPDF

Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson

TL;DR: 该论文提出了HalluScope基准,用于分析大型视觉语言模型(LVLMs)中幻觉现象(即输出与视觉输入不符)的成因,发现幻觉主要源于对文本先验和背景知识的过度依赖,尤其是文本指令引入的信息。为缓解由文本指令先验引起的幻觉,作者提出了HalluVL-DPO框架,通过偏好优化微调现有LVLMs,使其生成更基于视觉的响应,并在保持或提升其他基准性能的同时有效减少了目标幻觉模式。

Details

Motivation: 尽管大型视觉语言模型(LVLMs)能力显著,但仍易受幻觉影响,现有研究将幻觉归因于视觉主干限制或语言组件主导等因素,但这些因素的相对重要性尚不明确,因此需要更深入的分析和缓解方法。

Result: 论文提出的HalluVL-DPO框架在构建的训练数据集上通过偏好优化微调模型,有效缓解了由文本指令引起的特定幻觉失败模式,同时在其他幻觉基准和视觉能力评估中保持或提升了性能。

Insight: 创新点包括引入HalluScope基准来量化不同因素对幻觉的影响,揭示了文本指令先验是幻觉的主要来源,并提出HalluVL-DPO框架,通过偏好优化微调来增强模型的视觉基础性,这为缓解LVLMs幻觉提供了可借鉴的数据驱动方法。

Abstract: Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .


[73] Vista4D: Video Reshooting with 4D Point Clouds cs.CVPDF

Kuan Heng Lin, Zhizheng Liu, Pablo Salamanca, Yash Kant, Ryan Burgert

TL;DR: Vista4D是一个基于4D点云的视频重拍框架,能够根据输入视频和新的相机轨迹,从不同视角重新合成具有相同动态的场景。该方法通过构建4D点云表示,解决了现有方法在深度估计、内容保持和相机控制方面的不足,并在多种视频和相机路径下展现出优于现有技术的4D一致性、相机控制和视觉质量。

Details

Motivation: 现有视频重拍方法在处理真实世界动态视频时,常因深度估计伪影而难以保持内容外观,且对挑战性新轨迹的相机控制不精确。Vista4D旨在通过4D点云表示解决这些问题,实现更鲁棒和灵活的视频重拍。

Result: 与最先进的基线方法相比,Vista4D在多种视频和相机路径下表现出改进的4D一致性、相机控制和视觉质量,达到了SOTA水平。

Insight: 创新点包括:使用静态像素分割和4D重建构建4D点云表示,以显式保留已见内容并提供丰富的相机信号;通过重建的多视角动态数据训练,增强对真实世界推理中点云伪影的鲁棒性。该方法可推广到动态场景扩展和4D场景重组等实际应用。

Abstract: We present Vista4D, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and failing to maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quality compared to state-of-the-art baselines under a variety of videos and camera paths. Moreover, our method generalizes to real-world applications such as dynamic scene expansion and 4D scene recomposition. See our project page for results, code, and models: https://eyeline-labs.github.io/Vista4D


[74] Context Unrolling in Omni Models cs.CVPDF

Ceyuan Yang, Zhijie Lin, Yang Zhao, Fei Xiao, Hao He

TL;DR: 本文提出了Omni模型,这是一个统一的多模态模型,通过在文本、图像、视频、3D几何和隐藏表示等多种模态上进行原生训练,实现了上下文展开(Context Unrolling)机制,使模型能够在预测前显式地跨多个模态表示进行推理,从而聚合异构模态间的互补信息,更准确地逼近共享的多模态知识流形,提升下游推理的保真度。

Details

Motivation: 动机在于解决多模态学习中如何有效整合异构模态信息以提升推理能力的问题,旨在通过统一训练实现跨模态的显式推理。

Result: Omni在多模态生成和理解基准测试中表现出色,展示了先进的推理能力,包括文本、图像、视频和3D几何的上下文生成,达到了SOTA水平。

Insight: 创新点在于引入了上下文展开机制,使模型能够显式地进行跨模态推理,这有助于更忠实地逼近多模态知识流形,为多模态模型的统一训练和推理提供了新思路。

Abstract: We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.


[75] Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs cs.CVPDF

Hao-Yu Hsu, Tianhang Cheng, Jing Wen, Alexander G. Schwing, Shenlong Wang

TL;DR: 本文提出了一种名为IMU-to-4D的框架,旨在仅利用可穿戴惯性传感器(如耳机、手表或智能手机中的传感器)的数据,实现无需视觉输入的4D(三维空间+时间)人-场景理解。该框架通过重新利用大语言模型来推理人类动态与场景布局,能够从稀疏的IMU数据中预测详细的人体运动轨迹和粗略的三维场景结构。

Details

Motivation: 传统基于视觉的人类活动与环境理解方法在隐私、安全、能效和可扩展性方面存在持续挑战,本文探索一种无需视觉的替代方案,仅依靠日常可穿戴传感器实现4D感知。

Result: 在多样化的人-场景数据集上的实验表明,IMU-to-4D相比现有的级联流水线方法,能产生更连贯且时间上更稳定的结果,表明仅靠可穿戴运动传感器即可支持丰富的4D理解。

Insight: 主要创新点在于将大语言模型重新用于非视觉的时空理解任务,构建了一个端到端框架,直接从稀疏IMU数据联合推断人体运动和场景结构,证明了仅凭惯性传感器实现4D感知的可行性,为隐私敏感或视觉受限场景提供了新思路。

Abstract: Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.


[76] Seeing Fast and Slow: Learning the Flow of Time in Videos cs.CV | cs.AI | cs.GRPDF

Yen-Siang Wu, Rundong Luo, Jingsen Zhu, Tao Tu, Ali Farhadi

TL;DR: 这篇论文研究视频中的时间感知与控制,提出自监督学习模型来检测视频速度变化和估计播放速度,并利用这些模型从真实世界视频中构建了最大的慢动作数据集。基于该数据集,进一步开发了速度条件视频生成和时间超分辨率模型,实现对视频时间流的精确操控。

Details

Motivation: 视频作为计算机视觉的核心研究对象,其时间维度的感知与控制长期被忽视。论文旨在将时间作为可学习的视觉概念,解决视频速度变化检测、播放速度估计以及时间可控的视频生成等问题。

Result: 论文提出的自监督模型能有效检测视频速度变化和估计播放速度。利用该模型构建了迄今最大的慢动作视频数据集。基于此数据集开发的模型在速度条件视频生成和时间超分辨率任务上表现出色,能够生成指定速度的视频或将低帧率模糊视频转换为高帧率细节丰富的序列。

Insight: 创新点在于将时间作为可学习的、可操控的感知维度,通过自监督学习从视频的多模态线索和时序结构中提取时间概念。这为时间可控的视频生成、时间取证检测以及理解事件随时间演变的世界模型提供了新思路。从客观角度看,利用模型从噪声数据中自动构建高质量数据集的方法具有借鉴意义。

Abstract: How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.


cs.MM [Back]

[77] AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe cs.MM | cs.CV | cs.HCPDF

Adam Cole, Mick Grierson

TL;DR: AttentionBender是一个通过操纵视频扩散Transformer中的交叉注意力机制来探索模型内部工作原理并生成创意视频的工具。它允许艺术家对交叉注意力图应用二维变换(如旋转、缩放、平移)来调制生成过程,从而超越模型默认的生成倾向,产生新颖的美学效果。

Details

Motivation: 当前视频生成模型虽然输出越来越真实,但仅通过提示词控制限制了艺术家理解模型内部材料过程或突破其默认生成模式的能力。论文旨在开发一种工具,帮助艺术家探究黑盒视频生成模型的内部机制,并提供超越提示控制的创意操控手段。

Result: 研究通过可视化超过4500个在不同提示词、操作和层目标下的视频生成结果进行评估。结果表明,交叉注意力机制具有高度纠缠性:针对性的操纵往往难以实现干净、局部的控制,反而会产生分布式的扭曲和故障美学效果,而非线性的编辑。

Insight: 论文的创新点在于将网络弯曲(Network Bending)思想扩展到视频扩散Transformer的交叉注意力机制上,开发了AttentionBender工具。这不仅作为一种可解释AI风格的探针来理解Transformer注意力机制,同时也作为一种创意技术,能够在模型学习到的表征空间之外产生新颖的美学效果,为艺术创作和模型理解提供了新途径。

Abstract: We present AttentionBender, a tool that manipulates cross-attention in Video Diffusion Transformers to help artists probe the internal mechanics of black-box video generation. While generative outputs are increasingly realistic, prompt-only control limits artists’ ability to build intuition for the model’s material process or to work beyond its default tendencies. Using an autobiographical research-through-design approach, we built on Network Bending to design AttentionBender, which applies 2D transforms (rotation, scaling, translation, etc.) to cross-attention maps to modulate generation. We assess AttentionBender by visualizing 4,500+ video generations across prompts, operations, and layer targets. Our results suggest that cross-attention is highly entangled: targeted manipulations often resist clean, localized control, producing distributed distortions and glitch aesthetics over linear edits. AttentionBender contributes a tool that functions both as an Explainable AI style probe of transformer attention mechanisms, and as a creative technique for producing novel aesthetics beyond the model’s learned representational space.


cs.AI [Back]

[78] Ideological Bias in LLMs’ Economic Causal Reasoning cs.AI | cs.CE | cs.CL | cs.LG | econ.GNPDF

Donggyu Lee, Hyeok Yun, Jungwon Kim, Junsik Min, Sungwon Park

TL;DR: 该论文通过扩展EconCausal基准,评估了大型语言模型在经济因果推理中是否存在系统性意识形态偏见,发现模型在意识形态争议问题上准确率较低,且错误预测偏向干预主义方向。

Details

Motivation: 研究动机是探究LLMs在经济因果推理中是否表现出系统性意识形态偏见,因为LLMs正被用于政策分析和经济报告,而方向正确的因果判断至关重要。

Result: 在20个最先进的LLMs中,18个模型在经验验证的因果符号与干预主义预期一致时准确率更高;意识形态争议项目比非争议项目更难,且错误预测偏向干预主义方向,单样本上下文提示无法消除这种偏差。

Insight: 创新点在于将意识形态争议案例纳入经济因果基准进行系统性评估,揭示了LLMs在意识形态争议经济问题上的准确率下降和方向性偏见,强调了在高风险经济政策设置中进行方向感知评估的必要性。

Abstract: Do large language models (LLMs) exhibit systematic ideological bias when reasoning about economic causal effects? As LLMs are increasingly used in policy analysis and economic reporting, where directionally correct causal judgments are essential, this question has direct practical stakes. We present a systematic evaluation by extending the EconCausal benchmark with ideology-contested cases - instances where intervention-oriented (pro-government) and market-oriented (pro-market) perspectives predict divergent causal signs. From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances and evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. We find that ideology-contested items are consistently harder than non-contested ones, and that across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting. These results highlight that LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings.


[79] Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning cs.AI | cs.CL | cs.CVPDF

Mohit Vaishnav, Tanel Tammet

TL;DR: 该论文通过对比视觉语言模型(VLMs)在原始图像上的表现与大型语言模型(LLMs)在符号化输入上的表现,研究了抽象视觉推理任务中的瓶颈问题。研究发现,在Bongard-LOGO基准测试中,LLMs使用符号输入时准确率大幅提升至90%以上,而基于像素的视觉模型表现接近随机水平,表明表征(而非推理)是抽象视觉推理的关键瓶颈。

Details

Motivation: 动机是探究视觉语言模型在抽象视觉推理基准(如Bongard问题)上失败的主要原因,即瓶颈在于推理能力还是表征能力。

Result: 在Bongard-LOGO基准测试中,LLMs使用符号输入(如LOGO风格动作程序或结构化描述)在自由形式问题上达到90%以上的准确率,而强视觉基线模型在匹配任务定义下表现接近随机水平。

Insight: 创新点在于提出’组合-语法(C-G)’范式,将抽象视觉推理任务转化为基于符号的推理任务,并使用符号输入作为诊断探针来揭示表征瓶颈;客观分析表明,从像素到符号结构的转换比输入格式、显式概念提示或最小视觉基础等因素更为关键,这为评估模型能力提供了受控的上界诊断方法。

Abstract: Vision–language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential–Grammatical (C–G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid–90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.


[80] ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs cs.AI | cs.CLPDF

Jian Cui, Zhiyuan Ren, Desheng Weng, Yongqi Zhao, Gong Wenbin

TL;DR: 本文提出ReaGeo,一种基于大语言模型的端到端地理编码框架,旨在克服传统多阶段方法依赖地理数据库文本或向量相似性检索的局限性,如流程复杂、错误传播和严重依赖结构化地理知识库。该方法将地理坐标转换为geohash序列,将坐标预测任务重构为文本生成问题,并引入思维链机制增强模型对空间关系的推理能力。此外,应用基于距离偏差奖励的强化学习来优化生成准确性。综合实验表明,ReaGeo能准确处理单点预测中的显式地址查询,并有效解决模糊的相对位置查询,同时模型对非点几何区域也展现出强大的预测能力,突显了其在地理编码任务中的多功能性和泛化能力。

Details

Motivation: 解决传统地理编码方法因多阶段流程、错误传播和依赖结构化知识库导致的复杂性和局限性,提出端到端解决方案。

Result: 实验显示模型能准确处理显式地址查询和模糊相对位置查询,并在非点几何区域预测中表现出色,展现了多功能性和泛化能力。

Insight: 将坐标预测重构为文本生成任务,结合geohash序列化和思维链机制增强空间推理,并利用基于距离偏差的强化学习优化准确性,为地理编码提供了端到端、可推理的LLM新方法。

Abstract: This paper proposes ReaGeo, an end-to-end geocoding framework based on large language models, designed to overcome the limitations of traditional multi-stage approaches that rely on text or vector similarity retrieval over geographic databases, including workflow complexity, error propagation, and heavy dependence on structured geographic knowledge bases. The method converts geographic coordinates into geohash sequences, reformulating the coordinate prediction task as a text generation problem, and introduces a Chain-of-Thought mechanism to enhance the model’s reasoning over spatial relationships. Furthermore, reinforcement learning with a distance-deviation-based reward is applied to optimize the generation accuracy. Comprehensive experiments show that ReaGeo can accurately handle explicit address queries in single-point predictions and effectively resolve vague relative location queries. In addition, the model demonstrates strong predictive capability for non-point geometric regions, highlighting its versatility and generalization ability in geocoding tasks.


[81] GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion cs.AI | cs.CLPDF

Qizhuo Xie, Yunhui Liu, Yu Xing, Qianzi Hou, Xudong Jin

TL;DR: 本文提出GS-Quant框架,通过粒度语义增强和生成式结构重建模块,为知识图谱实体生成具有层次结构和语义连贯性的离散编码,以弥合图嵌入与LLM词元之间的模态鸿沟,从而提升LLM在知识图谱补全任务上的性能。

Details

Motivation: 解决现有量化方法在知识图谱补全中,将量化视为平坦的数值压缩,导致生成的编码语义纠缠、无法反映人类推理的层次性,从而难以有效桥接连续图嵌入与离散LLM词元之间模态差距的问题。

Result: 实验结果表明,GS-Quant在知识图谱补全任务上显著优于现有的基于文本和基于嵌入的基线方法。

Insight: 核心创新在于将实体表示与语言从粗到细的逻辑对齐,通过粒度语义增强模块在码本中注入层次知识,并通过生成式结构重建模块在编码序列上施加因果依赖,从而将独立的离散单元转化为结构化的语义描述符,实现了图结构与自然语言生成的同构推理。

Abstract: Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a critical challenge. While recent quantization-based approaches attempt to align these modalities, they typically treat quantization as flat numerical compression, resulting in semantically entangled codes that fail to mirror the hierarchical nature of human reasoning. In this paper, we propose GS-Quant, a novel framework that generates semantically coherent and structurally stratified discrete codes for KG entities. Unlike prior methods, GS-Quant is grounded in the insight that entity representations should follow a linguistic coarse-to-fine logic. We introduce a Granular Semantic Enhancement module that injects hierarchical knowledge into the codebook, ensuring that earlier codes capture global semantic categories while later codes refine specific attributes. Furthermore, a Generative Structural Reconstruction module imposes causal dependencies on the code sequence, transforming independent discrete units into structured semantic descriptors. By expanding the LLM vocabulary with these learned codes, we enable the model to reason over graph structures isomorphically to natural language generation. Experimental results demonstrate that GS-Quant significantly outperforms existing text-based and embedding-based baselines. Our code is publicly available at https://github.com/mikumifa/GS-Quant.


[82] Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems cs.AI | cs.CL | cs.MAPDF

Ye Yu, Heming Liu, Haibo Jin, Xiaopeng Yuan, Peng Kuang

TL;DR: 该论文提出了DiffMAS框架,通过将潜在通信作为多智能体系统中的可学习组件,实现了对多智能体语言系统的端到端优化。该方法在数学推理、科学问答、代码生成和常识推理等多个基准测试中显著提升了推理准确性和解码稳定性。

Details

Motivation: 现有基于大语言模型的多智能体系统主要关注智能体角色和编排,而将智能体间通信视为固定接口,缺乏对通信与推理的联合优化。

Result: 在AIME24基准上达到26.7%的准确率,在GPQA-Diamond上达到20.2%,并在多个推理基准上持续超越单智能体推理、基于文本的多智能体系统及先前的潜在通信方法。

Insight: 创新点在于将潜在通信(如键值缓存)作为可学习组件进行端到端优化,通过参数高效的监督训练联合学习信息的编码和解释,从而提升多智能体系统的整体性能。

Abstract: Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communication as a fixed interface. Latent communication through internal representations such as key-value caches offers a promising alternative to text-based protocols, but existing approaches do not jointly optimize communication with multi-agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi-agent systems. DiffMAS performs parameter-efficient supervised training over multi-agent latent trajectories, enabling agents to jointly learn how information should be encoded and interpreted across interactions. Experiments on mathematical reasoning, scientific QA, code generation, and commonsense benchmarks show that DiffMAS consistently improves reasoning accuracy and decoding stability over single-agent inference, text-based multi-agent systems, and prior latent communication methods, achieving 26.7% on AIME24, 20.2% on GPQA-Diamond, and consistent gains across reasoning benchmarks.


eess.IV [Back]

[83] DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction eess.IV | cs.CVPDF

Shiyan Su, Ruyi Zha, Danli Shi, Hongdong Li, Xuelian Cheng

TL;DR: 本文提出DiffNR框架,通过扩散先验增强神经表示优化,以解决稀疏视角CT重建中神经表示(如神经场和3D高斯)产生的严重伪影问题。核心组件SliceFixer是一个单步扩散模型,用于校正退化切片中的伪影,并通过生成伪参考体积提供3D感知监督,从而修复欠约束区域。

Details

Motivation: 解决稀疏视角设置下神经表示(NRs)在CT重建中产生严重伪影的问题,传统方法将CT求解器嵌入耗时的迭代去噪过程效率低下。

Result: 在广泛实验中,DiffNR平均提升PSNR 3.99 dB,具有良好的跨领域泛化能力,并保持高效的优化性能。

Insight: 创新点在于提出修复-增强策略,通过单步扩散模型SliceFixer周期性生成伪参考体积进行3D感知监督,避免了频繁的扩散模型查询,提高了运行时性能;可借鉴之处包括专用条件层设计、定制数据策展策略以及将扩散先验高效集成到神经表示优化中。

Abstract: Neural representations (NRs), such as neural fields and 3D Gaussians, effectively model volumetric data in computed tomography (CT) but suffer from severe artifacts under sparse-view settings. To address this, we propose DiffNR, a novel framework that enhances NR optimization with diffusion priors. At its core is SliceFixer, a single-step diffusion model designed to correct artifacts in degraded slices. We integrate specialized conditioning layers into the network and develop tailored data curation strategies to support model finetuning. During reconstruction, SliceFixer periodically generates pseudo-reference volumes, providing auxiliary 3D perceptual supervision to fix underconstrained regions. Compared to prior methods that embed CT solvers into time-consuming iterative denoising, our repair-and-augment strategy avoids frequent diffusion model queries, leading to better runtime performance. Extensive experiments show that DiffNR improves PSNR by 3.99 dB on average, generalizes well across domains, and maintains efficient optimization.


cs.CR [Back]

[84] Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models cs.CR | cs.AI | cs.CLPDF

Yannis Belkhiter, Giulio Zizzo, Sergio Maffeis, Seshu Tirupathi, John D. Kelleher

TL;DR: 本文提出了一种新型的函数劫持攻击(FHA),通过操纵智能体模型(agentic models)的工具选择过程,强制调用攻击者指定的函数,揭示了函数调用大语言模型(LLMs)在安全上的新漏洞。该攻击对上下文语义不敏感且对函数集具有鲁棒性,并能训练出通用的对抗性函数。实验在5个不同模型上实现了70%至100%的攻击成功率。

Details

Motivation: 随着智能体AI的发展,函数调用LLMs通过调用外部函数扩展了AI系统的能力,但也引入了新的安全漏洞。现有攻击主要关注模型在函数调用任务中的语义偏好,本文旨在探索一种更通用、对语义不敏感的攻击方法,以揭示智能体系统在工具选择过程中的安全隐患。

Result: 在包括指令型和推理型变体在内的5个不同模型上进行了实验,在已建立的BFCL数据集上,攻击成功率(ASR)达到了70%到100%。

Insight: 创新点在于提出了一种对上下文语义不敏感、对函数集鲁棒的函数劫持攻击方法,并能训练出通用的对抗性函数,可跨多个查询和载荷配置进行劫持。这凸显了智能体系统需要更强的安全护栏和安全模块的必要性。

Abstract: The growth of agentic AI has drawn significant attention to function calling Large Language Models (LLMs), which are designed to extend the capabilities of AI-powered system by invoking external functions. Injection and jailbreaking attacks have been extensively explored to showcase the vulnerabilities of LLMs to user prompt manipulation. The expanded capabilities of agentic models introduce further vulnerabilities via their function calling interface. Recent work in LLM security showed that function calling can be abused, leading to data tampering and theft, causing disruptive behavior such as endless loops, or causing LLMs to produce harmful content in the style of jailbreaking attacks. This paper introduces a novel function hijacking attack (FHA) that manipulates the tool selection process of agentic models to force the invocation of a specific, attacker-chosen function. While existing attacks focus on semantic preference of the model for function-calling tasks, we show that FHA is largely agnostic to the context semantics and robust to the function sets, making it applicable across diverse domains. We further demonstrate that FHA can be trained to produce universal adversarial functions, enabling a single attacked function to hijack tool selection across multiple queries and payload configurations. We conducted experiments on 5 different models, including instructed and reasoning variants, reaching 70% to 100% ASR over the established BFCL dataset. Our findings further demonstrate the need for strong guardrails and security modules for agentic systems.


[85] Adaptive Instruction Composition for Automated LLM Red-Teaming cs.CR | cs.AI | cs.CL | cs.LGPDF

Jesse Zymet, Andy Luo, Swapnil Shinde, Sahil Wadhwa, Emily Chen

TL;DR: 本文提出了一种名为自适应指令组合(Adaptive Instruction Composition)的新型框架,用于自动化LLM红队测试,通过结合众包文本并采用自适应机制来优化攻击的有效性和多样性。

Details

Motivation: 现有LLM红队测试方法存在语义范围有限或随机组合导致效果不佳的问题,本文旨在通过自适应组合指令来提升攻击的多样性和有效性。

Result: 该方法在有效性和多样性指标上显著优于随机组合方法,并在Harmbench基准测试中超越了多种近期自适应方法,展示了模型迁移下的优越性能。

Insight: 创新点在于使用强化学习在指令组合空间中平衡探索与利用,并采用轻量级神经上下文赌博机结合对比嵌入输入,使网络能够快速泛化和扩展至大规模空间。

Abstract: Many approaches to LLM red-teaming leverage an attacker LLM to discover jailbreaks against a target. Several of them task the attacker with identifying effective strategies through trial and error, resulting in a semantically limited range of successes. Another approach discovers diverse attacks by combining crowdsourced harmful queries and tactics into instructions for the attacker, but does so at random, limiting effectiveness. This article introduces a novel framework, Adaptive Instruction Composition, that combines crowdsourced texts according to an adaptive mechanism trained to jointly optimize effectiveness with diversity. We use reinforcement learning to balance exploration with exploitation in a combinatorial space of instructions to guide the attacker toward diverse generations tailored to target vulnerabilities. We demonstrate that our approach substantially outperforms random combination on a set of effectiveness and diversity metrics, even under model transfer. Further, we show that it surpasses a host of recent adaptive approaches on Harmbench. We employ a lightweight neural contextual bandit that adapts to contrastive embedding inputs, and provide ablations suggesting that the contrastive pretraining enables the network to rapidly generalize and scale to the massive space as it learns.


[86] CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents cs.CR | cs.CLPDF

Wenjie Fu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Lukas Wutschitz

TL;DR: 该论文提出了CI-Work基准,用于评估企业级LLM代理在工作流中处理信息时的隐私保护能力,发现现有前沿模型普遍存在隐私泄露问题,且任务效用与隐私违规之间存在反直觉的权衡关系。

Details

Motivation: 企业LLM代理在提升生产力的同时,其检索和使用内部上下文的核心能力带来了敏感信息泄露的新风险,需要评估和解决这一问题。

Result: 在CI-Work基准上的评估显示,前沿模型的隐私违规率在15.8%到50.9%之间,泄露率最高达26.7%,并揭示了任务效用越高往往隐私违规越多的反直觉权衡。

Insight: 论文的创新点在于构建了基于情境完整性理论的基准来量化企业工作流中的隐私风险,并指出单纯增加模型规模或推理深度无法解决问题,需要转向以上下文为中心的架构范式转变。

Abstract: Enterprise LLM agents can dramatically improve workplace productivity, but their core capability, retrieving and using internal context to act on a user’s behalf, also creates new risks for sensitive information leakage. We introduce CI-Work, a Contextual Integrity (CI)-grounded benchmark that simulates enterprise workflows across five information-flow directions and evaluates whether agents can convey essential content while withholding sensitive context in dense retrieval settings. Our evaluation of frontier models reveals that privacy failures are prevalent (violation rates range from 15.8%-50.9%, with leakage reaching up to 26.7%) and uncovers a counterintuitive trade-off critical for industrial deployment: higher task utility often correlates with increased privacy violations. Moreover, the massive scale of enterprise data and potential user behavior further amplify this vulnerability. Simply increasing model size or reasoning depth fails to address the problem. We conclude that safeguarding enterprise workflows requires a paradigm shift, moving beyond model-centric scaling toward context-centric architectures.


[87] Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers cs.CR | cs.AI | cs.CLPDF

Jiali Wei, Ming Fan, Guoheng Sun, Xicheng Zhang, Haijun Wang

TL;DR: 本文提出了BadStyle,一个针对大语言模型(LLM)的隐蔽后门攻击框架,通过利用LLM生成携带自然风格触发器的中毒样本,并设计了辅助目标损失来稳定攻击载荷注入,在多种注入策略和模型上实现了高攻击成功率与强隐蔽性。

Details

Motivation: 针对现有LLM后门攻击方法存在触发器模式显眼、长文本生成中攻击载荷注入不可靠以及威胁模型不完整等关键缺陷,旨在设计一个更自然、稳定且贴近实际威胁的完整攻击框架。

Result: 在包括LLaMA、Phi、DeepSeek和GPT系列在内的七个受害LLM上进行的广泛实验表明,BadStyle实现了高攻击成功率(ASR)并保持了强隐蔽性;辅助目标损失使风格级触发器的平均ASR提升了约30%,且植入的后门在未知下游场景中依然有效,并能规避代表性的输入级和输出级防御。

Insight: 创新点在于利用LLM自身生成自然风格的中毒样本作为隐蔽触发器,并设计了辅助目标损失来稳定攻击行为;从客观角度看,其将攻击威胁模型具体化(包括提示诱导和PEFT注入),并系统评估了跨模型和跨场景的鲁棒性,为后门攻击与防御研究提供了更现实的基准。

Abstract: The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings: explicit trigger patterns that compromise naturalness, unreliable injection of attacker-specified payloads in long-form generation, and incompletely specified threat models that obscure how backdoors are delivered and activated in practice. To address these gaps, we present BadStyle, a complete backdoor attack framework and pipeline. BadStyle leverages an LLM as a poisoned sample generator to construct natural and stealthy poisoned samples that carry imperceptible style-level triggers while preserving semantics and fluency. To stabilize payload injection during fine-tuning, we design an auxiliary target loss that reinforces the attacker-specified target content in responses to poisoned inputs and penalizes its emergence in benign responses. We further ground the attack in a realistic threat model and systematically evaluate BadStyle under both prompt-induced and PEFT-based injection strategies. Extensive experiments across seven victim LLMs, including LLaMA, Phi, DeepSeek, and GPT series, demonstrate that BadStyle achieves high attack success rates (ASRs) while maintaining strong stealthiness. The proposed auxiliary target loss substantially improves the stability of backdoor activation, yielding an average ASR improvement of around 30% across style-level triggers. Even in downstream deployment scenarios unknown during injection, the implanted backdoor remains effective. Moreover, BadStyle consistently evades representative input-level defenses and bypasses output-level defenses through simple camouflage.


cs.LG [Back]

[88] The Path Not Taken: Duality in Reasoning about Program Execution cs.LG | cs.AI | cs.CL | cs.PL | cs.SEPDF

Eshgin Hasanov, Md Mahadi Hassan Sibat, Santu Karmaker, Aashish Yadavally

TL;DR: 该论文提出了程序执行理解中的对偶性概念,并构建了DexBench基准来评估大语言模型在动态代码推理上的能力,通过预测程序行为和推断输入变异两个互补任务来全面衡量模型对执行流的因果理解。

Details

Motivation: 现有基准主要关注基于特定输入预测程序属性(如代码覆盖率、程序输出),这提供了对动态代码推理的狭隘视角且易受数据污染影响,因此需要更全面的评估方法来理解程序执行。

Result: 在包含445个配对实例的DexBench基准上评估了13个大语言模型,结果表明对偶路径推理为动态代码理解提供了鲁棒且具有区分度的代理指标。

Insight: 创新点在于提出程序执行理解的对偶性框架,通过预测程序行为和推断输入变异两个任务联合探测模型对执行流的因果理解,这超越了传统单一路径的评估方法,有助于更全面地衡量模型的动态代码推理能力。

Abstract: Large language models (LLMs) have shown remarkable capabilities across diverse coding tasks. However, their adoption requires a true understanding of program execution rather than relying on surface-level patterns. Existing benchmarks primarily focus on predicting program properties tied to specific inputs (e.g., code coverage, program outputs). As a result, they provide a narrow view of dynamic code reasoning and are prone to data contamination. We argue that understanding program execution requires evaluating its inherent duality through two complementary reasoning tasks: (i) predicting a program’s observed behavior for a given input, and (ii) inferring how the input must be mutated toward a specific behavioral objective. Both tasks jointly probe a model’s causal understanding of execution flow. We instantiate this duality in DexBench, a benchmark comprising 445 paired instances, and evaluate 13 LLMs. Our results demonstrate that dual-path reasoning provides a robust and discriminative proxy for dynamic code understanding.


[89] How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models cs.LG | cs.CLPDF

Kristian Schwethelm, Daniel Rueckert, Georgios Kaissis

TL;DR: 本文通过116个预训练实验,研究了循环语言模型中额外循环次数(r)对模型性能的影响,并提出了一个联合缩放定律。研究发现循环等价指数φ=0.46,表明每个额外循环带来的性能提升介于完全等效(φ=1)和无容量增益(φ=0)之间,循环模型在相同训练计算下性能低于非循环模型,但在下游任务中差距因任务类型而异。

Details

Motivation: 动机是量化循环语言模型中额外循环次数在等效唯一参数方面的价值,探究循环块重复使用是否等同于增加独特参数块,以指导模型设计和训练成本评估。

Result: 通过拟合联合缩放定律,得到循环等价指数φ=0.46(R²=0.997),表明在相同训练计算下,循环模型性能可预测地低于非循环模型;例如,r=4的4.1亿参数循环模型性能相当于5.8亿参数非循环模型,但训练成本相当于10亿参数非循环模型。下游评估显示,在参数知识任务上差距持续,在简单开卷任务上差距缩小,推理任务在当前计算预算下无法解决。

Insight: 创新点在于提出了循环等价指数φ来量化循环次数对模型性能的影响,为循环语言模型的设计提供了可预测的验证损失成本转换方法;客观分析认为,该指数可作为未来训练方法和架构比较的基准,通过提升φ值来优化循环效率。

Abstract: We measure how much one extra recurrence is worth to a looped (depth-recurrent) language model, in equivalent unique parameters. From an iso-depth sweep of 116 pretraining runs across recurrence counts $r \in {1, 2, 4, 8}$ spanning ${\sim}50\times$ in training compute, we fit a joint scaling law $L = E + A,(N_\text{once} + r^{\varphi} N_\text{rec})^{-α} + B,D^{-β}$ and recover a new recurrence-equivalence exponent $\varphi = 0.46$ at $R^2 = 0.997$. Intuitively, $\varphi$ tells us whether looping a block $r$ times is equivalent in validation loss to $r$ unique blocks of a non-looped model (full equivalence, $\varphi{=}1$) or to a single block run repeatedly with no capacity gain ($\varphi{=}0$). Our $\varphi = 0.46$ sits in between, so each additional recurrence predictably increases validation loss at matched training compute. For example, at $r{=}4$ a 410M looped model performs on par with a 580M non-looped model, but pays the training cost of a 1B non-looped one. On a five-axis downstream evaluation, the gap persists on parametric-knowledge tasks and closes on simple open-book tasks, while reasoning tasks are not resolvable at our compute budgets. For any looped LM, our $\varphi$ converts the design choice of $r$ into a predictable validation-loss cost, and future training recipes and architectures can be compared by how much they raise $\varphi$ above $0.46$.


[90] Learning Dynamic Representations and Policies from Multimodal Clinical Time-Series with Informative Missingness cs.LG | cs.CL | stat.MEPDF

Zihan Liang, Ziwen Pan, Ruoxuan Xiong

TL;DR: 该论文提出了一种用于多模态临床时间序列的患者表征学习框架,该框架显式地利用了信息性缺失模式。该框架结合了多模态编码器、贝叶斯滤波模块以及下游任务模块,旨在从包含结构化测量和临床文本笔记的稀疏、不规则记录中学习动态的患者状态表示,并用于离线的治疗策略学习和不良结果预测。

Details

Motivation: 多模态临床记录(如结构化测量和临床笔记)在时间上是稀疏的,其记录与否取决于患者的潜在状况,且不同模态的记录模式不同。现有方法虽然能处理临床时间序列中的缺失,但如何提取和利用观测过程本身所携带的信息仍未得到充分探索。

Result: 在MIMIC-III、MIMIC-IV和eICU的ICU脓毒症队列上进行了评估。该框架在离线治疗策略学习和不良结果预测方面均有提升,在MIMIC-III上,其FQE达到0.679(临床医生行为为0.528),72小时后死亡率预测的AUROC达到0.886。

Insight: 创新点在于显式地将观测模式(信息性缺失)作为信号纳入多模态患者表征学习,并通过贝叶斯滤波动态更新潜在患者状态。这为处理临床数据中普遍存在的、与患者状态相关的稀疏和不规则观测提供了新思路。

Abstract: Multimodal clinical records contain structured measurements and clinical notes recorded over time, offering rich temporal information about the evolution of patient health. Yet these observations are sparse, and whether they are recorded depends on the patient’s latent condition. Observation patterns also differ across modalities, as structured measurements and clinical notes arise under distinct recording processes. While prior work has developed methods that accommodate missingness in clinical time series, how to extract and use the information carried by the observation process itself remains underexplored. We therefore propose a patient representation learning framework for multimodal clinical time series that explicitly leverages informative missingness. The framework combines (1) a multimodal encoder that captures signals from structured and textual data together with their observation patterns, (2) a Bayesian filtering module that updates a latent patient state over time from observed multimodal signals, and (3) downstream modules for offline treatment policy learning and patient outcome prediction based on the learned patient state. We evaluate the framework on ICU sepsis cohorts from MIMIC-III, MIMIC-IV, and eICU. It improves both offline treatment policy learning and adverse outcome prediction, achieving FQE 0.679 versus 0.528 for clinician behavior and AUROC 0.886 for post-72-hour mortality prediction on MIMIC-III.


[91] Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning cs.LG | cs.AI | cs.CLPDF

Yongcan Yu, Lingxiao He, Jian Liang, Kuangpu Guo, Meng Wang

TL;DR: 本文研究了测试时强化学习(TTRL)在数学推理任务中因伪标签噪声导致的虚假信号放大问题,并提出了一种统一的去偏去噪框架DDRL。该框架通过频率采样排除模糊样本、采用固定优势的去偏估计以及基于共识的离策略优化来缓解虚假信号,在多个数学推理基准测试中显著优于现有TTRL基线方法。

Details

Motivation: 测试时强化学习(TTRL)在推理时通过伪标签进行模型自适应,容易受到标签噪声产生的虚假优化信号影响,尤其是在响应一致性中等时形成模糊区域,成为奖励噪声的主要来源,且这种虚假信号可能通过组相对优势估计被放大。

Result: 在三个大型语言模型和多个数学推理基准测试(如GSM8K、MATH)上的实验表明,DDRL框架在性能上持续超越现有的TTRL基线方法,达到了新的最先进水平(SOTA)。

Insight: 创新点在于识别了TTRL中虚假信号放大的核心机制(模糊区域和组相对优势估计偏差),并提出了一个集成了频率采样、去偏优势估计和共识离策略优化的统一框架,有效提升了TTRL在噪声环境下的鲁棒性和性能。从客观角度看,该研究为处理测试时自适应中的噪声问题提供了系统性的方法论,具有较好的可迁移性。

Abstract: Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization. Finally, DDRL incorporates a consensus-based off-policy refinement stage, which leverages the rejection-sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at https://github.com/yuyongcan/DDRL.


[92] ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response cs.LG | cs.CVPDF

Stephan Xie, Ben Cohen, Mononito Goswami, Junhong Shen, Emaad Khwaja

TL;DR: 本文提出了ARFBench,一个用于评估基础模型在软件事件响应中时间序列问答能力的基准测试。该基准包含来自Datadog内部遥测数据的750个问题,涉及142个时间序列和538万个数据点。研究评估了领先的专有和开源LLM、VLM及时间序列FM,发现前沿VLM表现显著优于现有基线,其中GPT-5达到62.7%准确率和51.9% F1分数。此外,研究还开发了一种新颖的TSFM+VLM混合原型,通过少量合成和真实数据微调后,其性能可与前沿模型媲美。最后,研究发现模型与人类专家具有互补优势,通过模型-专家预言机选择最佳答案,实现了82.8% F1和87.2%准确率,为未来TSQA模型设立了新的超人类前沿。

Details

Motivation: 解决基础模型在时间序列问答这一尚未充分探索能力上的评估问题,特别是在软件事件数据中普遍存在的时间序列异常理解方面。

Result: 在ARFBench基准上,前沿VLM(如GPT-5)达到62.7%准确率和51.9% F1分数,显著优于现有基线。提出的TSFM+VLM混合原型通过微调后性能与前沿模型相当。模型-专家预言机实现了82.8% F1和87.2%准确率,建立了新的超人类性能标准。

Insight: 创新点包括:1) 构建了首个专注于软件事件响应场景的时间序列问答基准ARFBench;2) 揭示了前沿VLM在TSQA任务上的潜力;3) 提出了一种新颖的TSFM+VLM混合架构,通过少量数据微调即可达到前沿性能;4) 发现模型与人类专家的互补性,并通过预言机机制实现了超人类性能,为TSQA研究提供了新的评估范式和性能上限。

Abstract: Time series question-answering (TSQA), in which we ask natural language questions to infer and reason about properties of time series, is a promising yet underexplored capability of foundation models. In this work, we present ARFBench, a TSQA benchmark that evaluates the understanding of multimodal foundation models (FMs) on time series anomalies prevalent in software incident data. ARFBench consists of 750 questions across 142 time series and 5.38M data points from 63 production incidents sourced exclusively from internal telemetry at Datadog. We evaluate leading proprietary and open-source LLMs, VLMs, and time series FMs and observe that frontier VLMs perform markedly better than existing baselines; the leading model (GPT-5) achieves a 62.7% accuracy and 51.9% F1. We next demonstrate the promise of specialized multimodal approaches. We develop a novel TSFM + VLM hybrid prototype which we post-train on a small set of synthetic and real data that yields comparable overall F1 and accuracy with frontier models. Lastly, we find models and human domain experts exhibit complementary strengths. We define a model-expert oracle, a best-of-2 oracle selector over model and expert answers, yielding 82.8% F1 and 87.2% accuracy and establishing a new superhuman frontier for future TSQA models. The benchmark is available at https://huggingface.co/datasets/Datadog/ARFBench.


[93] Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding cs.LG | cs.AI | cs.CVPDF

Wenkai Wang, Xiyun Li, Hongcan Guo, Wenhao Yu, Tianqing Fang

TL;DR: 本文提出了一种名为‘Propose-then-Critic’的可学习框架,用于提升图形用户界面(GUI)的视觉定位任务。该方法通过一个提议器生成多个候选坐标,并由一个视觉评判器在渲染的截图上评估这些提议,以选择最优目标。为了协同优化这两个模块,作者引入了一种成熟度感知的自适应协同进化强化学习范式,动态平衡两者的训练目标,促进其相互增强与共同进化。

Details

Motivation: 解决GUI视觉定位任务中,模型虽能理解语义意图,但因视觉元素同质化、布局密集而难以实现精确定位的问题。现有的静态自一致性策略(如基于几何聚类)改进有限,因为模型的预测在空间上往往较为分散。

Result: 在6个基准测试上的广泛实验表明,该方法显著提升了定位精度和评判器的可靠性。

Insight: 主要创新点在于用可学习的、基于视觉评判的选择机制替代了静态的一致性策略,并提出了一个协同进化的提议-评判框架。从客观角度看,其引入的成熟度感知自适应协同进化强化学习范式,通过动态平衡提议器的探索多样性与评判器的判别能力成熟度,有效促进了两种能力的相互增强,提升了模型对多样复杂界面布局的泛化能力。

Abstract: Graphical User Interface (GUI) grounding requires mapping natural language instructions to precise pixel coordinates. However, due to visually homogeneous elements and dense layouts, models typically grasp semantic intent yet struggle with achieving precise localization. While scaling sampling attempts (Pass@k) reveals potential gains, static self-consistency strategies derived from geometric clustering often yield limited improvements, as the model’s predictions tend to be spatially dispersed. In this paper, we propose replacing static consistency strategies with a learnable selection mechanism that selects the optimal target by critiquing its own proposals rendered on the screenshot. Given the significant disparity between the model’s grounding and critiquing capabilities, we propose a co-evolving Propose-then-Critic framework. To jointly optimize these, we introduce a maturity-aware adaptive co-evolutionary reinforcement learning paradigm. This approach dynamically balances the training objectives of proposer and critic, where the diversity of the proposer’s outputs enhances critic robustness, while the critic’s maturing discrimination capability conversely unlocks the proposer’s potential for extensive spatial exploration, fostering the mutual reinforcement and co-evolution of both capabilities, thereby ensuring generalizability to adapt to diverse and complex interface layouts. Extensive experiments over 6 benchmarks show that our method significantly enhances both grounding accuracy and critic reliability.


[94] Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair cs.LG | cs.AI | cs.CVPDF

Vishal Rajput

TL;DR: 该论文证明了监督学习存在一个固有的几何盲点:任何通过经验风险最小化(ERM)训练的编码器,其雅可比矩阵在训练数据中与标签相关但在测试时是干扰的方向上必须保持非零敏感性。这并非当前方法的偶然缺陷,而是监督目标本身的数学必然结果。论文提出了轨迹偏差指数(TDI)来量化这一盲点,并展示了其在多种视觉和语言模型中的普遍存在性,同时通过引入一个高斯形式的扰动项(PMH)来修复该问题。

Details

Motivation: 动机在于揭示监督学习(特别是ERM)的一个根本性理论缺陷:它强制学习到的表示对训练数据中与标签相关但测试时无用的干扰方向保持敏感,这导致了模型在对抗攻击、纹理偏见、腐败脆弱性以及鲁棒性-准确性权衡等一系列经验观察到的失败模式。

Result: 理论结果(定理1)证明了监督学习几何盲点的必然存在。实验上,在七个视觉任务、BERT/SST-2以及CLIP、DINO、SAM使用的ImageNet ViT-B/16骨干网络上均测量到了该盲点。TDI诊断显示,PGD对抗训练虽然雅可比弗罗贝尼乌斯范数为2.91,但具有最差的干净输入几何(TDI 1.336),而PMH方法将TDI修复至0.904。盲点比率在语言模型规模增大时单调恶化(从66M的0.860到340M的0.742),任务特定的ERM微调会将其放大54%,而PMH通过一个额外的训练项将其修复了11倍。

Insight: 论文的核心创新点在于从理论上统一解释了多个看似独立的经验观察(非鲁棒预测特征、纹理偏见、腐败脆弱性、鲁棒性-准确性权衡),将其归结为监督学习目标固有的几何约束。提出的TDI是首个能直接测量这一理论所界定量的诊断工具。修复方法PMH的创新在于,其高斯形式的扰动项(命题5证明是唯一能均匀惩罚编码器雅可比矩阵的扰动规律)直接针对了盲点的根源,提供了一种最小化的修复方案。

Abstract: We prove that empirical risk minimisation (ERM) imposes a necessary geometric constraint on learned representations: any encoder that minimises supervised loss must retain non-zero Jacobian sensitivity in directions that are label-correlated in training data but nuisance at test time. This is not a contingent failure of current methods; it is a mathematical consequence of the supervised objective itself. We call this the geometric blind spot of supervised learning (Theorem 1), and show it holds across proper scoring rules, architectures, and dataset sizes. This single theorem unifies four lines of prior empirical work that were previously treated separately: non-robust predictive features, texture bias, corruption fragility, and the robustness-accuracy tradeoff. In this framing, adversarial vulnerability is one consequence of a broader structural fact about supervised learning geometry. We introduce Trajectory Deviation Index (TDI), a diagnostic that measures the theorem’s bounded quantity directly, and show why common alternatives miss the key failure mode. PGD adversarial training reaches Jacobian Frobenius 2.91 yet has the worst clean-input geometry (TDI 1.336), while PMH achieves TDI 0.904. TDI is the only metric that detects this dissociation because it measures isotropic path-length distortion – the exact quantity Theorem 1 bounds. Across seven vision tasks, BERT/SST-2, and ImageNet ViT-B/16 backbones used by CLIP, DINO, and SAM, the blind spot is measurable and repairable. It is present at foundation-model scale, worsens monotonically across language-model sizes (blind-spot ratio 0.860 to 0.765 to 0.742 from 66M to 340M), and is amplified by task-specific ERM fine-tuning (+54%), while PMH repairs it by 11x with one additional training term whose Gaussian form Proposition 5 proves is the unique perturbation law that uniformly penalises the encoder Jacobian.


[95] Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning cs.LG | cs.AI | cs.CLPDF

Hanjun Cho, Gahyun Yoo, Hanseong Kim, Jay-Yoon Lee

TL;DR: 该论文提出了TaNOS框架,通过操作草图和自监督学习来提升表格数据中数值推理的泛化能力。该方法包含表头匿名化、操作草图提供结构线索以及程序优先的自监督预训练三个组件,旨在解耦领域语义和数值操作结构,从而增强模型在领域转移下的鲁棒性。

Details

Motivation: 解决现有模型在专家领域表格数值推理中过度依赖表头-操作捷径、泛化能力差的问题,特别是在领域转移时性能下降显著。

Result: 在FinQA基准上,仅使用10%训练数据的TaNOS达到了80.13%的执行准确率,优于使用全量数据的SFT基线(73.97%)以及GPT-5、Gemini-2.5-Pro等专有模型。在领域转移实验中,TaNOS的跨领域性能差距小于2个百分点,而标准SFT的差距超过10个百分点。

Insight: 创新点在于通过操作草图提供结构引导、表头无关的表征以及正确性保证的自监督学习,有效分离了领域知识和数值推理结构,从而提升了模型对多样化专家领域表格的鲁棒性和可迁移性。

Abstract: Numerical reasoning over expert-domain tables often exhibits high in-domain accuracy but limited robustness to domain shift. Models trained with supervised fine-tuning (SFT) on specific datasets tend to rely on header-operation shortcuts rather than structural reasoning. We introduce TaNOS, a continual pre-training framework comprising three components: (i) header anonymization to reduce lexical memorization, (ii) operation sketches that provide minimal structural cues, and (iii) self-supervised pretraining that constructs correctness-guaranteed program-question pairs from given tables in a program-first manner. By decoupling domain semantics and numerical operation structure, TaNOS improves the transferability of numerical reasoning. Applied to an 8B instruction-tuned model, TaNOS achieves 80.13% execution accuracy on FinQA with only 10% train data, outperforming SFT baseline (73.97%) with full train data and proprietary models such as GPT-5, Gemini-2.5-Pro. Furthermore, in the domain-shift experiments, TaNOS displays nearly-negligible cross-domain gap (<2pp) when standard SFT shows over 10pp gap. These results suggest that structural guidance with operation sketches, header-agnostic representations, and correctness-guaranteed self-supervision can improve the robustness of numerical reasoning across diverse expert-domain tables.


cs.IR [Back]

[96] Association Is Not Similarity: Learning Corpus-Specific Associations for Multi-Hop Retrieval cs.IR | cs.AI | cs.CLPDF

Jason Dury

TL;DR: 本文提出了一种名为关联增强检索(AAR)的轻量级转导式重排序方法,用于改进多跳问题中的密集检索。该方法通过对比学习训练一个小型MLP来学习语料库中段落间的关联关系,并在推理时对初始检索候选集进行双向关联评分重排序。

Details

Motivation: 密集检索系统通常基于查询与段落的嵌入相似性进行排序,但多跳问题需要检索在推理链上具有关联关系的段落,而不仅仅是语义相似的段落。

Result: 在HotpotQA数据集上,AAR将段落Recall@5从0.831提升至0.916(+8.6点),在密集检索基线失败的难题上提升尤为显著(+28.5点);在MuSiQue数据集的转导设置下获得+10.1点提升。下游QA评估显示检索增益带来了+6.4点的精确匹配改进。

Insight: 创新点在于区分了关联性与相似性,通过轻量级模型学习语料库特定的共现关联而非可迁移的语义模式。方法高效(单GPU训练<2分钟,每查询增加3.7ms),且无需基于LLM的索引,表明针对特定语料库的关联建模能有效提升多跳检索性能。

Abstract: Dense retrieval systems rank passages by embedding similarity to a query, but multi-hop questions require passages that are associatively related through shared reasoning chains. We introduce Association-Augmented Retrieval (AAR), a lightweight transductive reranking method that trains a small MLP (4.2M parameters) to learn associative relationships between passages in embedding space using contrastive learning on co-occurrence annotations. At inference time, AAR reranks an initial dense retrieval candidate set using bi-directional association scoring. On HotpotQA, AAR improves passage Recall@5 from 0.831 to 0.916 (+8.6 points) without evaluation-set tuning, with gains concentrated on hard questions where the dense baseline fails (+28.5 points). On MuSiQue, AAR achieves +10.1 points in the transductive setting. An inductive model trained on training-split associations and evaluated on unseen validation associations shows no significant improvement, suggesting that the method captures corpus-specific co-occurrences rather than transferable patterns. Ablation studies support this interpretation: training on semantically similar but non-associated passage pairs degrades retrieval below the baseline, while shuffling association pairs causes severe degradation. A downstream QA evaluation shows retrieval gains translate to +6.4 exact match improvement. The method adds 3.7ms per query, trains in under two minutes on a single GPU, and requires no LLM-based indexing.


[97] Robust Test-time Video-Text Retrieval: Benchmarking and Adapting for Query Shifts cs.IR | cs.AI | cs.CVPDF

Bingqing Zhang, Zhuo Cao, Heming Du, Yang Li, Xue Li

TL;DR: 本文针对视频-文本检索模型在真实世界查询分布偏移时性能急剧下降的问题,提出了一个包含12种扰动类型和5个严重程度的综合性基准测试,并揭示了查询偏移会加剧‘中心性’现象。为缓解此问题,论文提出了名为HAT-VTR的测试时自适应框架,该框架通过中心性抑制记忆和多粒度损失来提升模型鲁棒性。

Details

Motivation: 现有视频-文本检索模型在分布内基准上表现优异,但对真实世界中的查询分布偏移非常脆弱,性能会急剧下降。而专注于图像的鲁棒性解决方案无法有效处理视频中复杂的时空动态变化,因此需要专门的研究和解决方案。

Result: 在提出的综合性基准上进行的大量实验表明,HAT-VTR框架显著提升了鲁棒性,在多种查询偏移场景下持续优于先前的方法,增强了模型在真实应用中的可靠性。

Insight: 论文的主要创新点在于:1) 首次系统性地构建了针对视频-文本检索模型鲁棒性的基准测试,量化了查询偏移的影响;2) 揭示了查询偏移会加剧检索中的‘中心性’问题;3) 提出了专门针对视频的测试时自适应框架HAT-VTR,其核心创新是引入中心性抑制记忆来修正相似度分数,以及利用多粒度损失来强制时序特征一致性,从而直接对抗中心性现象。

Abstract: Modern video-text retrieval (VTR) models excel on in-distribution benchmarks but are highly vulnerable to real-world query shifts, where the distribution of query data deviates from the training domain, leading to a sharp performance drop. Existing image-focused robustness solutions are inadequate to handle this vulnerability in video, as they fail to address the complex spatio-temporal dynamics inherent in these shifts. To systematically evaluate this vulnerability, we first introduce a comprehensive benchmark featuring 12 distinct types of video perturbations across five severity degrees. Analysis on this benchmark reveals that query shifts amplify the hubness phenomenon, where a few gallery items become dominant “hubs” that attract a disproportionate number of queries. To mitigate this, we then propose HAT-VTR (Hubness Alleviation for Test-time Video-Text Retrieval), as our baseline test-time adaptation framework designed to directly counteract hubness in VTR. It leverages two key components: a Hubness Suppression Memory to refine similarity scores, and multi-granular losses to enforce temporal feature consistency. Extensive experiments demonstrate that HAT-VTR substantially improves robustness, consistently outperforming prior methods across diverse query shift scenarios, and enhancing model reliability for real-world applications.


[98] From Tokens to Concepts: Leveraging SAE for SPLADE cs.IR | cs.CLPDF

Yuxuan Zong, Mathias Vast, Basile Van Cooten, Laure Soulier, Benjamin Piwowarski

TL;DR: 本文提出SAE-SPLADE模型,通过使用稀疏自编码器(SAE)学习语义概念空间来替代传统SPLADE模型所依赖的骨干词汇表,旨在解决词汇表带来的多义性、同义词等问题,并提升多语言和多模态应用的潜力。实验表明,该模型在保持与SPLADE相当检索性能的同时,提高了效率。

Details

Motivation: 传统SPLADE等学习型稀疏IR模型依赖骨干词汇表,这可能导致性能受限(如多义性和同义词问题),并阻碍多语言和多模态应用。

Result: SAE-SPLADE在领域内和领域外任务上实现了与SPLADE相当的检索性能,同时提供了更高的效率。

Insight: 创新点在于用SAE学习的语义概念空间替代词汇表,这能更好地捕捉语义信息,减少词汇歧义,并可能增强模型的泛化能力和跨模态适应性。

Abstract: Learned Sparse IR models, such as SPLADE, offer an excellent efficiency-effectiveness tradeoff. However, they rely on the underlying backbone vocabulary, which might hinder performance (polysemicity and synonymy) and pose a challenge for multi-lingual and multi-modal usages. To solve this limitation, we propose to replace the backbone vocabulary with a latent space of semantic concepts learned using Sparse Auto-Encoders (SAE). Throughout this paper, we study the compatibility of these 2 concepts, explore training approaches, and analyze the differences between our SAE-SPLADE model and traditional SPLADE models. Our experiments demonstrate that SAE-SPLADE achieves retrieval performance comparable to SPLADE on both in-domain and out-of-domain tasks while offering improved efficiency.


cs.GR [Back]

[99] StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition cs.GR | cs.CV | cs.HC | cs.MMPDF

Kwan Yun, Changmin Lee, Ayeong Jeong, Youngseo Kim, Seungmi Lee

TL;DR: 本文提出了StyleID,一个感知感知的数据集和评估框架,用于评估面部身份识别在风格化处理下的鲁棒性。该框架包含两个数据集:StyleBench-H(用于基准测试)和StyleBench-S(用于监督训练),旨在解决现有身份编码器在风格化图像上表现脆弱的问题。通过基于人类感知数据微调现有语义编码器,校准后的模型在人类判断相关性上显著提升,并增强了对域外艺术家绘制肖像的鲁棒性。

Details

Motivation: 当前基于自然照片训练的身份编码器在处理风格化人脸图像(如卡通、素描、绘画)时表现脆弱,容易将纹理或调色板变化误认为身份漂移或无法检测几何夸张,缺乏一个风格无关的框架来评估和监督跨不同风格和强度的身份一致性。

Result: 实验表明,利用StyleBench-S微调后的语义编码器,其相似性排序与人类感知在跨风格和强度上更一致,与人类判断的相关性显著更高,并且对域外艺术家绘制的肖像展现出增强的鲁棒性。

Insight: 创新点在于构建了一个基于人类感知评估的数据集和框架(StyleID),通过心理物理学的二选一强制选择实验获取监督信号,从而校准模型以对齐人类感知,为风格无关的身份识别提供了新的评估和监督范式。

Abstract: Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable identity. However, current identity encoders, which are typically trained and calibrated on natural photographs, exhibit severe brittleness under stylization. They often mistake changes in texture or color palette for identity drift or fail to detect geometric exaggerations. This reveals the lack of a style-agnostic framework to evaluate and supervise identity consistency across varying styles and strengths. To address this gap, we introduce StyleID, a human perception-aware dataset and evaluation framework for facial identity under stylization. StyleID comprises two datasets: (i) StyleBench-H, a benchmark that captures human same-different verification judgments across diffusion- and flow-matching-based stylization at multiple style strengths, and (ii) StyleBench-S, a supervision set derived from psychometric recognition-strength curves obtained through controlled two-alternative forced-choice (2AFC) experiments. Leveraging StyleBench-S, we fine-tune existing semantic encoders to align their similarity orderings with human perception across styles and strengths. Experiments demonstrate that our calibrated models yield significantly higher correlation with human judgments and enhanced robustness for out-of-domain, artist drawn portraits. All of our datasets, code, and pretrained models are publicly available at https://kwanyun.github.io/StyleID_page/


cs.RO [Back]

[100] Neuro-Symbolic Manipulation Understanding with Enriched Semantic Event Chains cs.RO | cs.CVPDF

Fatemeh Ziaeetabar

TL;DR: 本文提出了一种名为eSEC-LAM的神经符号框架,用于增强机器人对物体交互和操作过程的理解。该框架将经典的语义事件链(eSEC)转化为显式的、事件级的符号状态,并融合了置信度感知谓词、功能对象角色、可供性先验、基元级抽象和显著性引导解释线索。通过基于基础模型的感知前端提取确定性谓词,并利用轻量级符号推理进行当前动作推断和下一步基元预测。

Details

Motivation: 解决机器人系统在人类环境中需要理解物体交互时序、识别当前动作并预测后续操作步骤的问题,同时克服经典eSEC方法主要作为描述性工具、不支持不确定性感知决策的局限性。

Result: 在EPIC-KITCHENS-100、EPIC-KITCHENS VISOR和Assembly101数据集上评估了动作识别、下一步基元预测、对感知噪声的鲁棒性和解释一致性。实验表明,eSEC-LAM在动作识别上具有竞争力,显著提升了下一步基元预测性能,在感知条件退化时比经典符号方法和端到端视频基线更鲁棒,并能提供基于显式关系证据的时间一致解释轨迹。

Insight: 创新点在于将经典eSEC扩展为支持决策的神经符号内部状态,通过引入置信度感知、功能角色、可供性先验等多维度语义信息,并利用基础模型感知与符号推理的结合,实现了可解释且鲁棒的操作理解。这展示了eSEC不仅能作为描述工具,还能作为有效的神经符号动作推理内部状态。

Abstract: Robotic systems operating in human environments must reason about how object interactions evolve over time, which actions are currently being performed, and what manipulation step is likely to follow. Classical enriched Semantic Event Chains (eSECs) provide an interpretable relational description of manipulation, but remain primarily descriptive and do not directly support uncertainty-aware decision making. In this paper, we propose eSEC-LAM, a neuro-symbolic framework that transforms eSECs into an explicit event-level symbolic state for manipulation understanding. The proposed formulation augments classical eSECs with confidence-aware predicates, functional object roles, affordance priors, primitive-level abstraction, and saliency-guided explanation cues. These enriched symbolic states are derived from a foundation-model-based perception front-end through deterministic predicate extraction, while current-action inference and next-primitive prediction are performed using lightweight symbolic reasoning over primitive pre- and post-conditions. We evaluate the proposed framework on EPIC-KITCHENS-100, EPIC-KITCHENS VISOR, and Assembly101 across action recognition, next-primitive prediction, robustness to perception noise, and explanation consistency. Experimental results show that eSEC-LAM achieves competitive action recognition, substantially improves next-primitive prediction, remains more robust under degraded perceptual conditions than both classical symbolic and end-to-end video baselines, and provides temporally consistent explanation traces grounded in explicit relational evidence. These findings demonstrate that enriched Semantic Event Chains can serve not only as interpretable descriptors of manipulation, but also as effective internal states for neuro-symbolic action reasoning.