Table of Contents
- cs.CL [Total: 32]
- cs.CV [Total: 67]
- cs.LG [Total: 2]
- cs.AI [Total: 2]
- eess.IV [Total: 1]
- cs.RO [Total: 2]
cs.CL [Back]
[1] Characterizing AlphaEarth Embedding Geometry for Agentic Environmental Reasoning cs.CL | cs.AIPDF
Mashrekur Rahman, Samuel J. Barrett, Christina Last
TL;DR: 本文研究了Google AlphaEarth地球观测基础模型生成的64维嵌入向量的几何结构,发现其流形是非欧几里得的,有效维度约为13.3,且局部切空间存在显著旋转。基于此几何特征,作者构建了一个包含九个专用工具的智能体系统,通过分解环境查询并在FAISS索引的嵌入数据库上进行推理链检索,以提升环境推理能力。实验表明,嵌入检索显著优于纯参数化方法,且几何工具的价值随所使用的大型语言模型(如Opus 4.6)的推理能力增强而提升。
Details
Motivation: 地球观测基础模型将地表信息编码为密集嵌入向量,但这些表示的几何结构及其对下游推理的影响尚未得到充分探索。本文旨在表征AlphaEarth嵌入的流形几何,并利用这种几何理解来构建更有效的环境推理系统。
Result: 在包含1210万个美国大陆样本的数据集上,分析表明嵌入流形是非欧几里得的,有效维度为13.3,局部本征维度约为10,且局部切空间旋转显著。在包含120个查询、三个复杂度层级的五条件消融实验中,嵌入检索方法(平均得分3.79)在响应质量上显著优于纯参数化方法(平均得分3.03,评分尺度1-5),在多步比较任务上达到峰值性能(平均得分4.28)。跨模型基准测试显示,几何工具对Sonnet 4.5模型略有负面影响(降低0.12分),但提升了Opus 4.6模型的性能(提高0.07分),且Opus模型获得了更高的几何基础得分(3.38 vs. 2.64)。
Insight: 论文的核心创新点在于首次系统性地表征了地球观测嵌入的流形几何特性(如非欧性、维度压缩和切空间旋转),并揭示了传统向量算术在该空间中的局限性。基于此,作者提出了一个关键见解:利用嵌入的局部几何特性进行检索,而非向量运算,能产生更物理一致的结果。这启发了将几何理解与智能体工具链相结合的新范式,其性能增益与所集成的大型语言模型本身的推理能力正相关,为构建领域专用的、基于检索增强的推理系统提供了新思路。
Abstract: Earth observation foundation models encode land surface information into dense embedding vectors, yet the geometric structure of these representations and its implications for downstream reasoning remain underexplored. We characterize the manifold geometry of Google AlphaEarth’s 64-dimensional embeddings across 12.1 million Continental United States samples (2017–2023) and develop an agentic system that leverages this geometric understanding for environmental reasoning. The manifold is non-Euclidean: effective dimensionality is 13.3 (participation ratio) from 64 raw dimensions, with local intrinsic dimensionality of approximately 10. Tangent spaces rotate substantially, with 84% of locations exceeding 60\textdegree{} and local-global alignment (mean$|\cosθ| = 0.17$) approaching the random baseline of 0.125. Supervised linear probes indicate that concept directions rotate across the manifold, and compositional vector arithmetic using both PCA-derived and probe-derived directions yields poor precision. Retrieval instead produces physically coherent results, with local geometry predicting retrieval coherence ($R^2 = 0.32$). Building on this characterization, we introduce an agentic system with nine specialized tools that decomposes environmental queries into reasoning chains over a FAISS-indexed embedding database. A five-condition ablation (120 queries, three complexity tiers) shows that embedding retrieval dominates response quality ($μ= 3.79 \pm 0.90$ vs.\ $3.03 \pm 0.77$ parametric-only; scale 1–5), with peak performance on multi-step comparisons ($μ= 4.28 \pm 0.43$). A cross-model benchmark show that geometric tools reduce Sonnet 4.5’s score by 0.12 points but improve Opus 4.6’s by 0.07, with Opus achieving higher geometric grounding (3.38 vs.\ 2.64), suggesting that the value of geometric characterization scales with the reasoning capability of the consuming model.
[2] Remask, Don’t Replace: Token-to-Mask Refinement in Masked Diffusion Language Models cs.CLPDF
Lin Yao
TL;DR: 本文提出了一种名为Token-to-Mask(T2M)的改进方法,用于解决掩码扩散语言模型(如LLaDA2.1)中Token-to-Token(T2T)编辑规则的结构性缺陷。T2M方法在检测到可疑令牌时不直接替换,而是将其重置为掩码状态,让后续去噪步骤基于更可靠的上下文重新预测,从而提升模型在需要精确令牌级输出任务上的准确性。
Details
Motivation: 针对掩码扩散语言模型中T2T编辑规则存在的三个结构性失败模式:当没有单个替代令牌足够自信时无法触发、替换计算基于可能包含错误的上下文、以及训练使用的均匀扰动与推理时模型实际产生的连贯语义错误不匹配,作者旨在提出一种无需训练、不引入新参数的改进方案。
Result: 在8个基准测试中,T2M方法显著提升了需要精确令牌级输出任务的准确性,最大增益为CMATH基准上的+5.92分,其中基线错误中79.9%归因于’最后一英里损坏’(正确推理后答案混乱),而T2M修复了其中41.3%的案例。
Insight: 创新点在于将编辑规则从’令牌到令牌替换’改为’令牌到掩码重置’,利用掩码作为比错误令牌更好的条件信号,使模型能在分布内上下文中重新预测,从而更有效地纠正自身生成错误;该方法无需额外训练或参数,仅通过修改推理时的编辑策略实现性能提升。
Abstract: Masked diffusion language models such as LLaDA2.1 rely on Token-to-Token (T2T) editing to correct their own generation errors: whenever a different token crosses a confidence threshold, the committed token is overwritten. We identify three structural failure modes of this rule. The trigger cannot fire when no single alternative is confident enough; the replacement is computed under a context that may itself contain errors; and the uniform perturbations used to train the T2T stream do not resemble the coherent, semantically plausible mistakes that the model actually makes at inference. As an alternative, we propose Token-to-Mask (T2M) remasking. Rather than overwriting a suspect token with a new guess, T2M resets the position to the mask state, so that the next denoising step re-predicts it from an in-distribution context. The method is training-free, modifies only the editing rule, and introduces no new parameters. We pair it with three detection heuristics and give a short theoretical account of why a mask is a better conditioning signal than an erroneous token. Across 8 benchmarks, T2M improves accuracy on tasks that require exact token-level output. Its largest gain is +5.92 points on CMATH, where we attribute 79.9% of baseline errors to last-mile corruption (correct reasoning followed by a garbled final answer); T2M repairs 41.3% of these cases.
[3] Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models cs.CL | cs.AIPDF
Seyedali Mohammadi, Manas Gaur, Francis Ferraro
TL;DR: 本文研究了大型语言模型(LLMs)进行科学可行性评估的能力,将可行性评估定义为一项诊断推理任务,即模型根据给定假设预测其是否可行并给出理由。研究通过控制知识条件(仅假设、提供实验描述、提供结果证据或两者兼备)来评估LLMs,并通过逐步移除实验或结果上下文来探究模型的鲁棒性。
Details
Motivation: 动机是探究LLMs如何评估科学假设的可行性,即判断一个主张是否与既定知识一致,以及实验证据是否能支持或反驳它,并明确实验证据在何种情况下对LLM评估有益或有害。
Result: 在多个LLMs和两个数据集上的实验结果表明,提供结果证据通常比提供实验描述更可靠;结果证据往往能提高仅靠内部知识之外的准确性,而实验文本可能很脆弱,在上下文不完整时会降低模型性能。
Insight: 创新点在于系统性地将科学可行性评估构建为可控的诊断推理任务,并揭示了在LLM评估中,结果证据比实验描述更具鲁棒性,实验文本的完整性对性能影响显著,这为理解LLM利用外部证据进行推理的机制提供了新见解。
Abstract: Scientific feasibility assessment asks whether a claim is consistent with established knowledge and whether experimental evidence could support or refute it. We frame feasibility assessment as a diagnostic reasoning task in which, given a hypothesis, a model predicts feasible or infeasible and justifies its decision. We evaluate large language models (LLMs) under controlled knowledge conditions (hypothesis-only, with experiments, with outcomes, or both) and probe robustness by progressively removing portions of the experimental and/or outcome context. Across multiple LLMs and two datasets, providing outcome evidence is generally more reliable than providing experiment descriptions. Outcomes tend to improve accuracy beyond what internal knowledge alone provides, whereas experimental text can be brittle and may degrade performance when the context is incomplete. These findings clarify when experimental evidence benefits LLM-based feasibility assessment and when it introduces fragility.
[4] Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs cs.CL | cs.AIPDF
Yuefei Chen, Yihao Quan, Xiaodong Lin, Ruixiang Tang
TL;DR: 该论文研究了大型语言模型生成虚假引用的问题,发现作者姓名字段在所有模型和设置中出错率最高,且不同字段的幻觉信号不具通用性。通过分析Qwen2.5-32B-Instruct的神经元活动,识别出一组稀疏的、字段特定的幻觉神经元,并证实通过干预这些神经元可以检测和缓解引用幻觉。
Details
Motivation: 解决LLMs频繁生成虚构但看似可信的引用,并在引用错误时仍表现出高置信度的问题。
Result: 在9个模型和108,000个生成引用上的研究发现,作者名字段失败率远高于其他字段;通过弹性网络正则化和稳定性选择识别出字段特定的幻觉神经元,因果干预证实抑制这些神经元能提升各字段性能,在某些字段上提升更大。
Insight: 创新点在于揭示了引用幻觉具有字段特异性,并提出了通过识别和干预稀疏的字段特定幻觉神经元来轻量级检测和缓解幻觉的方法,仅使用内部模型信号,无需外部知识。
Abstract: LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong. We study this failure across 9 models and 108{,}000 generated references, and find that author names fail far more often than other fields across all models and settings. Citation style has no measurable effect, while reasoning-oriented distillation degrades recall. Probes trained on one field transfer at near-chance levels to the others, suggesting that hallucination signals do not generalize across fields. Building on this finding, we apply elastic-net regularization with stability selection to neuron-level CETT values of Qwen2.5-32B-Instruct and identify a sparse set of field-specific hallucination neurons (FH-neurons). Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them improves performance across fields, with larger gains in some fields. These results suggest a lightweight approach to detecting and mitigating citation hallucination using internal model signals alone.
[5] Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness cs.CLPDF
Mengzhao Jia, Zhihan Zhang, Meng Jiang
TL;DR: 本文针对多模态推理中答案正确性与推理有效性不一致的问题,提出通过轨迹监督来提升推理可靠性,并比较了奖励模型和生成式奖励两种方法,最终提出Groupwise Ranking Reward方法,通过分组排序来更高效地区分不同质量的正确轨迹,从而改善推理质量。
Details
Motivation: 现有基于可验证答案的强化学习(RLVR)在多模态推理中虽能奖励正确答案,但无法保证推理轨迹的完整性和一致性,存在推理-答案不一致性问题,因此需要引入轨迹监督来提升推理的可靠性。
Result: 实验表明,RLVR加剧了推理-答案不一致性,而轨迹监督能缓解该问题;提出的Groupwise Ranking Reward方法在可靠性条件下的准确率从RLVR的47.4%提升至54.7%,整体表现最佳。
Insight: 创新点在于揭示了多模态推理中答案正确性与推理有效性之间的差距,并提出通过分组排序奖励机制来更精细地评估和奖励推理轨迹,这比传统生成式奖励更高效且稳定,为强化学习中的奖励设计提供了新思路。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves multimodal reasoning by rewarding verifiable final answers. Yet answer-correct trajectories may still rely on incomplete derivations, weak evidence, or statements that contradict their conclusions. This gap between answer correctness and reasoning validity, which we call reasoning-answer inconsistency, motivates trajectory supervision in multimodal RL. We compare two main approaches: reward models (RMs), and Generative Rewards (GRs). RMs are efficient and help early in training, but their gains weaken as the policy distribution shifts; GRs improve performance, but may give unstable rewards and computationally expensive. We therefore propose Groupwise Ranking Reward, which ranks verifier-passed trajectories for the same prompt in one pass and redistributes reward accordingly. Groupwise comparison better separates stronger and weaker correct trajectories with lower judge overhead than GRs. Experiments show that RLVR aggravates reasoning-answer inconsistency, while trajectory supervision alleviates it. Groupwise Ranking Reward performs best overall, improving reliability-conditioned accuracy from 47.4% to 54.7% over RLVR.
[6] Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning cs.CL | cs.LGPDF
Manuel Israel Cazares
TL;DR: 本文对大型语言模型在形式化数学推理任务中的提示工程进行了系统性实证研究,重点关注SAIR等代数理论竞赛第一阶段的‘蕴含关系判定’问题。研究发现存在‘单提示天花板’现象:尽管进行了大量提示工程,模型在困难案例上的准确率在约60-79%区间达到饱和,无法通过单一提示持续提升。
Details
Motivation: 研究动机是探究在形式化数学推理(特别是判定一个等式定律是否蕴含另一个)任务中,提示工程的极限在哪里,以及为什么模型性能会达到一个难以突破的上限。
Result: 在SAIR竞赛的hard3测试集(n=400)上,最佳提示(AN45c,2252字节)使gpt-oss-120b模型达到79.25%的准确率(95%置信区间[75.0%, 82.9%]),相比无提示基线(59.75%)提升了19.5个百分点。TRUE案例召回率达95.9%,FALSE案例召回率为63.4%。研究同时发现,对于较弱的模型(如Llama 3.3 70B),过长的提示(超过2KB)会导致TRUE案例召回率崩溃至0%。
Insight: 论文揭示了LLM数学推理中‘单提示天花板’的存在及其三个潜在机制:1) TRUE案例的数学不可判定性限制了任何有限提示的编码能力;2) 复杂的规则系统会削弱较弱模型的性能;3) 提示顺序效应与模型注意力以脆弱、非单调的方式相互作用。这提示我们,单纯增加提示复杂度和长度存在收益递减点,需要探索超越单一提示的范式(如多提示、工具使用等)来突破性能瓶颈。
Abstract: We present a systematic empirical study of prompt engineering for formal mathematical reasoning in the context of the SAIR Equational Theories Stage 1 competition. The task requires deciding whether one equational law implies another over all magmas – a problem that is undecidable in general but decidable for FALSE via finite model search. Over five weeks, we designed, tested, and analyzed more than 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models (gpt-oss-120b, Llama 3.3 70B, Gemma 4 31B). Our central finding is a single-prompt ceiling: despite substantial engineering effort, balanced hard accuracy plateaus in an empirical saturation region of approximately 60–79% for gpt-oss-120b, compared to a 59.75% no-cheatsheet baseline. We identify three mechanisms underlying this ceiling: (1) the mathematical undecidability of the TRUE case limits what any finite prompt can encode; (2) complex rule systems decrease performance on weaker models (Llama 3.3 70B collapses to 0% TRUE recall with prompts exceeding 2KB); and (3) prompt ordering effects interact with model attention in fragile, non-monotonic ways. Our best submission (AN45c, 2,252 bytes) achieves 79.25% accuracy on hard3 (n=400; 95% CI: [75.0%, 82.9%]), with TRUE recall of 95.9% and FALSE recall of 63.4%, representing a +19.5 percentage-point improvement over the no-cheatsheet baseline (59.75%). We release all prompt variants, evaluation scripts, and results at https://github.com/israelcazares/sair-prompt-engineering
[7] LogosKG: Hardware-Optimized Scalable and Interpretable Knowledge Graph Retrieval cs.CLPDF
He Cheng, Yifu Wu, Saksham Khatwani, Maya Kruse, Dmitriy Dligach
TL;DR: LogosKG是一个硬件优化的可扩展且可解释的知识图谱检索框架,通过符号化知识图谱表示和硬件高效操作实现大规模知识图谱上的多跳检索,并整合了度感知分区、跨图路由和按需缓存等技术以扩展到数十亿边规模的图谱。
Details
Motivation: 现有系统在将知识图谱与大型语言模型集成时,难以平衡多跳检索的效率、可扩展性和可解释性,LogosKG旨在解决这一问题。
Result: 实验表明,LogosKG在CPU和GPU基准测试中实现了显著的效率提升,且未损失检索保真度,其下游KG-LLM交互任务展示了其在大规模、基于证据的分析中的有效性。
Insight: 创新点在于将知识图谱遍历操作分解为硬件高效的主体、客体和关系表示操作,并结合了可扩展性优化技术,为下一代KG-LLM集成提供了新途径。
Abstract: Knowledge graphs (KGs) are increasingly integrated with large language models (LLMs) to provide structured, verifiable reasoning. A core operation in this integration is multi-hop retrieval, yet existing systems struggle to balance efficiency, scalability, and interpretability. We introduce LogosKG, a novel, hardware-aligned framework that enables scalable and interpretable k-hop retrieval on large KGs by building on symbolic KG formulations and executing traversal as hardware-efficient operations over decomposed subject, object, and relation representations. To scale to billion-edge graphs, LogosKG integrates degree-aware partitioning, cross-graph routing, and on-demand caching. Experiments show substantial efficiency gains over CPU and GPU baselines without loss of retrieval fidelity. With proven performance in KG retrieval, a downstream two-round KG-LLM interaction demonstrates how LogosKG enables large-scale, evidence-grounded analysis of how KG topology, such as hop distribution and connectivity, shapes the alignment between structured biomedical knowledge and LLM diagnostic reasoning, thereby opening the door for next-generation KG-LLM integration. The source code is publicly available at https://github.com/LARK-NLP-Lab/LogosKG, and an online demo is available at https://lark-nlp-lab-logoskg.hf.space/.
[8] Disparities In Negation Understanding Across Languages In Vision-Language Models cs.CLPDF
Charikleia Moraitaki, Sarah Pan, Skyler Pulling, Gwendolyn Flusche, Kumail Alhamoud
TL;DR: 本文针对视觉语言模型(VLMs)中的肯定偏见问题,即模型倾向于选择肯定描述而非否定描述,首次构建了一个涵盖七种类型学多样语言的人类验证多语言否定基准,并评估了CLIP、SigLIP和MultiCLIP等模型的表现。研究发现标准CLIP在非拉丁文字语言上表现不佳,而MultiCLIP实现了最高且最一致的准确性;同时,测试了提出的否定校正方法SpaceVLM,发现其对不同语言的效果存在差异,揭示了语言特性与模型改进之间的交互影响。
Details
Motivation: 解决视觉语言模型中普遍存在的肯定偏见问题,并探究现有解决方案是否公平地适用于不同语言社区,因为否定在不同语言中通过形态、词序和附着语素模式表现出差异。
Result: 在构建的多语言否定基准上,标准CLIP在非拉丁文字语言(如阿拉伯语、俄语)上表现接近或低于随机水平;MultiCLIP取得了最高且最均匀的准确率;SpaceVLM校正方法对英语、希腊语、西班牙语和塔加洛语有显著改进,但在类型学不同的语言中效果不一。
Insight: 创新点在于首次引入人类验证的多语言否定基准,强调了多语言评估对确保模型公平性的重要性;客观分析表明,语言特性(如形态、文字和否定结构)与模型改进效果存在相关性,这为开发更公平的跨语言视觉语言模型提供了关键洞见。
Abstract: Vision-language models (VLMs) exhibit affirmation bias: a systematic tendency to select positive captions (“X is present”) even when the correct description contains negation (“no X”). While prior work has documented this failure mode in English and proposed solutions, negation manifests differently across languages through varying morphology, word order, and cliticization patterns, raising the question of whether these solutions serve all linguistic communities equitably. We introduce the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating three VLMs - CLIP, SigLIP, and MultiCLIP - we find that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieves the highest and most uniform accuracy. We also evaluate SpaceVLM, a proposed negation correction, and find that it produces substantial improvements for several languages - particularly English, Greek, Spanish, and Tagalog - while showing varied effectiveness across typologically different languages. This variation reveals that linguistic properties like morphology, script, and negation structure interact with model improvements in fairness-relevant ways. As VLMs are deployed globally, multilingual benchmarks are essential for understanding not just whether solutions work, but for whom.
[9] When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains cs.CLPDF
Ishita Kakkar, Enze Zhang, Rheeya Uppaal, Junjie Hu
TL;DR: 该论文提出了一个名为HarmThoughts的基准数据集,用于对大语言模型推理链进行细粒度、句子级别的有害行为检测。该数据集基于一个包含16种有害推理行为的分类法构建,包含来自四个模型家族的56,931个句子,旨在揭示有害内容在推理过程中的传播模式,而非仅关注最终输出。
Details
Motivation: 现有的大模型安全评估主要关注最终输出,忽视了有害内容在复杂多步推理链中逐步产生的过程,这阻碍了可靠的安全监控、干预和系统性故障诊断。
Result: 在HarmThoughts基准上的实验结果表明,现有的白盒和黑盒检测器在推理链的细粒度有害行为检测任务上表现不佳,特别是在有害内容出现和执行阶段的微妙类别上,凸显了过程级安全监控的关键差距。
Insight: 创新点在于提出了首个专注于推理过程(而非最终结果)中句子级有害行为传播的基准和分类法,将有害行为建模为动态过程(如抑制拒绝、合理化顺从、分解有害任务、隐藏风险),为过程级安全监控和诊断提供了新工具和新视角。
Abstract: Large reasoning models (LRMs) produce complex, multi-step reasoning traces, yet safety evaluation remains focused on final outputs, overlooking how harm emerges during reasoning. When jailbroken, harm does not appear instantaneously but unfolds through distinct behavioral steps such as suppressing refusal, rationalizing compliance, decomposing harmful tasks, and concealing risk. However, no existing benchmark captures this process at sentence-level granularity within reasoning traces – a key step toward reliable safety monitoring, interventions, and systematic failure diagnosis. To address this gap, we introduce HarmThoughts, a benchmark for step-wise safety evaluation of reasoning traces. \ourdataset is built on our proposed harm taxonomy of 16 harmful reasoning behaviors across four functional groups that characterize how harm propagates rather than what harm is produced. The dataset consists of 56,931 sentences from 1,018 reasoning traces generated by four model families, each annotated with fine-grained sentence-level behavioral labels. Using HarmThoughts, we analyze harm propagation patterns across reasoning traces, identifying common behavioral trajectories and drift points where reasoning transitions from safe to unsafe. Finally, we systematically compare white-box and black-box detectors on the task of identifying harmful reasoning behaviours on HarmThoughts. Our results show that existing detectors struggle with fine-grained behavior detection in reasoning traces, particularly for nuanced categories within harm emergence and execution, highlighting a critical gap in process-level safety monitoring. HarmThoughts is available publicly at: https://huggingface.co/datasets/ishitakakkar-10/HarmThoughts
[10] Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection cs.CLPDF
Yixuan Tang, Yirui Zhang, Hang Feng, Anthony K. H. Tung
TL;DR: 本文提出了一种名为RADAR的角色锚定多智能体辩论框架,用于检测事实核查中的‘半真半假’陈述,即那些事实正确但因省略上下文而产生误导的言论。该框架通过让扮演不同角色(如政治家、科学家)的智能体基于共享检索证据进行对抗性推理,并由中立的法官裁决,结合自适应双阈值提前终止控制器来平衡检测准确性和推理成本。
Details
Motivation: 现有事实核查系统主要关注显性虚假陈述,而忽略了因省略关键上下文而产生的‘半真半假’陈述。为了有效检测这种基于省略的操纵,需要推理不仅已陈述的内容,还要推断未陈述的隐含信息。
Result: 实验表明,RADAR在多个数据集和骨干模型上持续优于强大的单智能体和多智能体基线方法,提高了省略检测的准确性,同时降低了推理成本。
Insight: 论文的创新点在于将角色锚定与检索接地的多智能体辩论相结合,并引入自适应控制器来动态决定推理深度,为事实核查中揭示缺失上下文提供了一个有效且可扩展的框架。
Abstract: Half-truths, claims that are factually correct yet misleading due to omitted context, remain a blind spot for fact verification systems focused on explicit falsehoods. Addressing such omission-based manipulation requires reasoning not only about what is said, but also about what is left unsaid. We propose RADAR, a role-anchored multi-agent debate framework for omission-aware fact verification under realistic, noisy retrieval. RADAR assigns complementary roles to a Politician and a Scientist, who reason adversarially over shared retrieved evidence, moderated by a neutral Judge. A dual-threshold early termination controller adaptively decides when sufficient reasoning has been reached to issue a verdict. Experiments show that RADAR consistently outperforms strong single- and multi-agent baselines across datasets and backbones, improving omission detection accuracy while reducing reasoning cost. These results demonstrate that role-anchored, retrieval-grounded debate with adaptive control is an effective and scalable framework for uncovering missing context in fact verification. The code is available at https://github.com/tangyixuan/RADAR.
[11] Cell-Based Representation of Relational Binding in Language Models cs.CLPDF
Qin Dai, Benjamin Heinzerling, Kentaro Inui
TL;DR: 本文研究了大型语言模型(LLMs)如何实现话语层面的关系绑定,即追踪实体及其间关系。研究发现,LLMs通过一种基于单元的绑定表示(CBR)来编码这种绑定关系,这是一种低维线性子空间,其中每个“单元”对应一个实体-关系索引对,在推理过程中,绑定的属性从相应单元中检索。
Details
Motivation: 尽管LLMs在关系推理上表现良好,但其绑定实体、关系和属性的机制尚不明确。本文旨在揭示LLMs在话语层面进行关系绑定的内部表示机制。
Result: 在多个领域和两个模型家族中,使用带注释的多句子数据进行实验,通过偏最小二乘回归从属性-标记激活中解码出索引,发现索引是线性可解码的,并在投影空间中形成网格状几何结构。激活修补实验表明,操纵该子空间会系统性地改变关系预测并破坏性能,提供了LLMs依赖CBR进行关系绑定的因果证据。
Insight: 论文宣称的创新点在于提出了CBR机制,并提供了其存在的经验证据。从客观角度看,该研究通过线性解码和几何分析揭示了LLMs内部表示的结构化模式,特别是跨上下文转移的平移向量关系,为理解LLMs的关系推理能力提供了新的视角和可解释性方法。
Abstract: Understanding a discourse requires tracking entities and the relations that hold between them. While Large Language Models (LLMs) perform well on relational reasoning, the mechanism by which they bind entities, relations, and attributes remains unclear. We study discourse-level relational binding and show that LLMs encode it via a Cell-based Binding Representation (CBR): a low-dimensional linear subspace in which each ``cell’’ corresponds to an entity–relation index pair, and bound attributes are retrieved from the corresponding cell during inference. Using controlled multi-sentence data annotated with entity and relation indices, we identify the CBR subspace by decoding these indices from attribute-token activations with Partial Least Squares regression. Across domains and two model families, the indices are linearly decodable and form a grid-like geometry in the projected space. We further find that context-specific CBR representations are related by translation vectors in activation space, enabling cross-context transfer. Finally, activation patching shows that manipulating this subspace systematically changes relational predictions and that perturbing it disrupts performance, providing causal evidence that LLMs rely on CBR for relational binding.
[12] Product-of-Experts Training Reduces Dataset Artifacts in Natural Language Inference cs.CL | cs.AIPDF
Aby Mammen Mathew
TL;DR: 该论文针对自然语言推理(NLI)模型容易过度依赖数据集伪相关(artifacts)而非真正推理的问题,提出了专家乘积(Product-of-Experts, PoE)训练方法。该方法通过降低有偏模型过度自信样本的权重来减少对伪相关的依赖,在SNLI数据集上几乎保持了原有准确率(89.10% vs. 89.30%),同时将偏差依赖降低了4.71%。
Details
Motivation: 解决NLI模型过度拟合数据集中的伪相关(如假设句本身存在的统计偏差)而非进行真实语义推理的问题,例如仅基于假设的模型在SNLI上就能达到57.7%的准确率,表明存在严重的虚假相关性。
Result: 在SNLI基准测试上,PoE训练方法将模型对偏差的依赖(bias agreement)从49.85%降至45%,准确率仅轻微下降(89.10% vs. 基线89.30%)。消融实验确定了lambda=1.5为平衡去偏和准确率的最佳参数。行为测试显示模型在否定和数值推理上仍存在问题。
Insight: 创新点在于提出了PoE训练框架,通过利用有偏的“专家”模型来识别并降低对可疑样本的权重,从而在最小化性能损失的前提下减少模型对数据集伪相关的依赖。这为缓解模型学习虚假模式提供了一种有效的训练策略。
Abstract: Neural NLI models overfit dataset artifacts instead of truly reasoning. A hypothesis-only model gets 57.7% in SNLI, showing strong spurious correlations, and 38.6% of the baseline errors are the result of these artifacts. We propose Product-of-Experts (PoE) training, which downweights examples where biased models are overconfident. PoE nearly preserves accuracy (89.10% vs. 89.30%) while cutting bias reliance by 4.71% (bias agreement 49.85% to 45%). An ablation finds lambda = 1.5 that best balances debiasing and accuracy. Behavioral tests still reveal issues with negation and numerical reasoning.
[13] TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only cs.CL | cs.LGPDF
Yilun Liu, Ruihong Qiu, Zi Huang
TL;DR: TRN-R1-Zero是一个专为文本丰富网络设计的零样本推理框架,它仅通过强化学习对基础大语言模型进行后训练,无需任务特定的监督数据。该框架采用一种新颖的邻居感知组相对策略优化目标,动态调整奖励以引导模型进行关系推理,并在多种网络基准测试中展现出优越性和鲁棒性。
Details
Motivation: 解决文本丰富网络中零样本推理的挑战,即模型需要在没有任务特定监督的情况下,融合文本语义和关系结构。现有方法(如图神经网络或依赖大模型蒸馏的LLM方法)存在泛化性差或忽略图上下文的问题。
Result: 在引用、超链接、社交和共同购买等文本丰富网络基准测试上进行了广泛实验,结果表明TRN-R1-Zero具有优越性和鲁棒性。它仅通过节点级训练,就能在边级和图级任务上实现零样本推理,并支持跨领域迁移。
Insight: 创新点在于完全依赖强化学习进行后训练,无需监督微调或从大型推理模型生成的思维链数据。其核心是Neighbour-aware Group Relative Policy Optimisation目标,通过新颖的边界增益度量动态评估邻居信号的信息量来调整奖励,有效引导关系推理。这为图结构学习与LLM的结合提供了一种无需标注数据的新范式。
Abstract: Zero-shot reasoning on text-rich networks (TRNs) remains a challenging frontier, as models must integrate textual semantics with relational structure without task-specific supervision. While graph neural networks rely on fixed label spaces and supervised objectives, recent large language model (LLM)-based approaches often overlook graph context or depend on distillation from larger models, limiting generalisation. We propose TRN-R1-Zero, a post-training framework for TRN reasoning trained solely via reinforcement learning. TRN-R1-Zero directly optimises base LLMs using a Neighbour-aware Group Relative Policy Optimisation objective that dynamically adjusts rewards based on a novel margin gain metric for the informativeness of neighbouring signals, effectively guiding the model toward relational reasoning. Unlike prior methods, TRN-R1-Zero requires no supervised fine-tuning or chain-of-thought data generated from large reasoning models. Extensive experiments across citation, hyperlink, social and co-purchase TRN benchmarks demonstrate the superiority and robustness of TRN-R1-Zero. Moreover, relying strictly on node-level training, TRN-R1-Zero achieves zero-shot inference on edge- and graph-level tasks, extending beyond cross-domain transfer. The codebase is publicly available at https://github.com/superallen13/TRN-R1-Zero.
[14] SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning cs.CL | cs.AI | cs.LGPDF
Rania Elbadry, Sarfraz Ahmad, Ahmed Heakl, Dani Bouch, Momina Ahsan
TL;DR: 本文介绍了SAHM,一个面向阿拉伯语金融和伊斯兰教法合规推理的基准测试和指令调优数据集,包含14,380个专家验证实例,涵盖七个任务,用于评估大型语言模型在阿拉伯语金融领域的推理能力。
Details
Motivation: 解决阿拉伯语金融自然语言处理领域缺乏高质量基准的问题,特别是针对伊斯兰教法合规推理的实际需求。
Result: 评估了19个开源和专有大型语言模型,发现模型在识别式任务上表现较强,但在生成和因果推理任务上存在显著差距,尤其在事件-原因推理任务上表现最差。
Insight: 创新点在于构建了首个综合性的阿拉伯语金融和伊斯兰教法推理基准,并揭示了阿拉伯语流畅性并不直接转化为基于证据的金融推理能力,为未来可信赖的阿拉伯语金融助手研究提供了基础。
Abstract: English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari’ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event-cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.
[15] Do Emotions Influence Moral Judgment in Large Language Models? cs.CLPDF
Mohammad Saim, Tianyu Jiang
TL;DR: 本研究探讨了大型语言模型中情绪对道德判断的影响,通过构建情绪诱导流程将情绪注入道德情境,评估了多个数据集和LLM中道德可接受性的变化。研究发现积极情绪提高道德可接受性而消极情绪降低之,效应强到足以在高达20%的情况下逆转二元道德判断,且模型易感性与其能力成反比。分析还揭示特定情绪(如悔恨)可能违反其效价预测,同时人类标注研究显示人类未表现出此类系统性偏移,表明当前LLM存在对齐差距。
Details
Motivation: 解决情绪如何影响大型语言模型道德判断这一未充分探索的问题,尽管情绪识别和道德推理作为独立能力已被广泛研究。
Result: 在多个数据集和LLM上的评估显示,情绪诱导导致道德判断发生系统性偏移,积极情绪使道德可接受性平均提升,消极情绪使其降低,效应强度最高可逆转20%的二元判断,且模型能力越强对情绪影响的易感性越低。
Insight: 创新点在于首次系统量化了LLM中情绪对道德判断的因果影响,揭示了情绪效价与道德判断偏移间的定向模式及模型能力与易感性的反比关系,并发现了特定情绪(如悔恨)的悖论效应,同时通过人类对比实验凸显了LLM与人类在情绪-道德交互上的对齐差距。
Abstract: Large language models have been extensively studied for emotion recognition and moral reasoning as distinct capabilities, yet the extent to which emotions influence moral judgment remains underexplored. In this work, we develop an emotion-induction pipeline that infuses emotion into moral situations and evaluate shifts in moral acceptability across multiple datasets and LLMs. We observe a directional pattern: positive emotions increase moral acceptability and negative emotions decrease it, with effects strong enough to reverse binary moral judgments in up to 20% of cases, and with susceptibility scaling inversely with model capability. Our analysis further reveals that specific emotions can sometimes behave contrary to what their valence would predict (e.g., remorse paradoxically increases acceptability). A complementary human annotation study shows humans do not exhibit these systematic shifts, indicating an alignment gap in current LLMs.
[16] The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models cs.CL | cs.AIPDF
Shuai Wu, Xue Li, Yanna Feng, Yufang Li, Zhijun Wang
TL;DR: 本文对八种前沿大语言模型(如GPT-5.4、Claude Opus 4.7等)中普遍存在的‘言语抽搐’现象进行了系统性分析。言语抽搐是指模型输出中重复、公式化的语言模式,如奉承性开场白、伪共情肯定和过度使用的词汇。研究通过自定义评估框架,在英中双语环境下评估了10个任务类别上的1万个提示,生成了16万个模型响应,并提出了量化抽搐普遍性的综合指标VTI。
Details
Motivation: 随着大语言模型通过RLHF和宪法AI等对齐技术不断演进,一种日益显著的现象是‘言语抽搐’的泛滥,这些重复、公式化的语言模式影响了模型输出的自然性和真实性。论文旨在系统性地分析这一现象,揭示其与当前对齐训练范式的关联。
Result: 研究发现模型间存在显著差异:Gemini 3.1 Pro的VTI最高(0.590),而DeepSeek V3.2最低(0.295)。人类评估(N=120)证实了奉承性与感知自然度之间存在强负相关(r=-0.87, p<0.001)。此外,言语抽搐在多轮对话中累积,在主观任务中被放大,并表现出跨语言的独特模式。
Insight: 论文的创新点在于首次系统性地定义和量化了大语言模型中的‘言语抽搐’现象,并提出了Verbal Tic Index (VTI)这一综合度量指标。客观来看,该研究揭示了当前对齐技术(如RLHF)可能带来的‘对齐税’,即过度优化特定人类反馈目标可能导致不自然、公式化的语言输出,为未来设计更真实的人机交互框架提供了重要依据。
Abstract: As Large Language Models (LLMs) continue to evolve through alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI, a growing and increasingly conspicuous phenomenon has emerged: the proliferation of verbal tics – repetitive, formulaic linguistic patterns that pervade model outputs. These range from sycophantic openers (“That’s a great question!”, “Awesome!”) to pseudo-empathetic affirmations (“I completely understand your concern”, “I’m right here to catch you”) and overused vocabulary (“delve”, “tapestry”, “nuanced”). In this paper, we present a systematic analysis of the verbal tic phenomenon across eight state-of-the-art LLMs: GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, Grok 4.2, Doubao-Seed-2.0-pro, Kimi K2.5, DeepSeek V3.2, and MiMo-V2-Pro. Utilizing a custom evaluation framework for standardized API-based evaluation, we assess 10,000 prompts across 10 task categories in both English and Chinese, yielding 160,000 model responses. We introduce the Verbal Tic Index (VTI), a composite metric quantifying tic prevalence, and analyze its correlation with sycophancy, lexical diversity, and human-perceived naturalness. Our findings reveal significant inter-model variation: Gemini 3.1 Pro exhibits the highest VTI (0.590), while DeepSeek V3.2 achieves the lowest (0.295). We further demonstrate that verbal tics accumulate over multi-turn conversations, are amplified in subjective tasks, and show distinct cross-lingual patterns. Human evaluation (N = 120) confirms a strong inverse relationship between sycophancy and perceived naturalness (r = -0.87, p < 0.001). These results underscore the “alignment tax” of current training paradigms and highlight the urgent need for more authentic human-AI interaction frameworks.
[17] ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation cs.CLPDF
Kunquan Li, Yingxue Zhang, Fandong Meng, Jinsong Su
TL;DR: 本文提出ReflectMT,一种两阶段反射内化算法,用于高效高质量的机器翻译。该方法采用’先翻译后思考’范式,通过强化学习训练模型’翻译-反思-精炼’能力,最终实现无需显式推理步骤的直接翻译模式。
Details
Motivation: 现有大型推理模型应用于机器翻译时,普遍采用’先思考后翻译’范式,虽然能提升翻译质量,但带来了高昂的推理成本和延迟。本文旨在解决这一效率与质量难以兼得的问题。
Result: 在WMT24等数据集上的实验表明,ReflectMT在推理时的首次翻译输出,在自动指标和基于GPT的评估中均优于DeepSeek-R1等多步推理模型,GPT翻译质量评估提升2.16分,同时token消耗减少94.33%。
Insight: 核心创新在于提出了’反射内化’的两阶段训练范式,将多步推理过程的知识和能力内化到模型中,从而在推理时实现单步、高质量、低成本的翻译。这为平衡大模型推理性能与效率提供了新思路。
Abstract: Recent years have witnessed growing interest in applying Large Reasoning Models (LRMs) to Machine Translation (MT). Existing approaches predominantly adopt a “think-first-then-translate” paradigm. Although explicit reasoning trajectories significantly enhance translation quality, they incur prohibitive inference costs and latency. To address these limitations, we propose ReflectMT, a two-stage reflection internalization algorithm for machine translation that employs a “translate-first-think-later” paradigm. Our approach develops the model’s “translate-reflect-refine” capability through reinforcement learning. In the first stage, we cultivate the model’s capacity for high-quality reflection and refinement, thereby enhancing its semantic comprehension and task-specific knowledge. In the second stage, we train the model to internalize the knowledge acquired during reflection. As a result, during inference, ReflectMT operates in a direct translation mode, producing high-quality translations on the first attempt without any explicit reasoning steps. Experimental results on datasets such as WMT24 demonstrate that our model’s first-pass translations during inference outperform multi-step reasoning LRMs such as DeepSeek-R1 in both automatic metrics and GPT-based evaluation, achieving a 2.16-point improvement in GPT-based translation quality evaluation while reducing token consumption by 94.33%.
[18] How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning cs.CL | cs.AIPDF
Haoyang Chen, Yi Liu, Jianzhi Shao, Tao Zhang, Chengfu Huo
TL;DR: 本文研究了思维型大语言模型(Thinking LLMs)在生成推理轨迹后,其答案令牌如何读取和整合这些推理信息以产生可靠答案。通过分析定量推理任务中的答案对推理的注意力模式,发现正确的解决方案呈现出一种良性的‘自我阅读’模式,而错误的方案则表现出分散且不规则的注意力模式。基于此观察,作者提出了一种无需训练的引导方法,利用自我阅读质量(SRQ)评分来构建引导向量,以提升模型推理的准确性。
Details
Motivation: 现有研究主要关注如何塑造大语言模型的推理轨迹,但对于答案令牌如何实际读取和整合这些推理信息以产生可靠结果的理解仍然不足。本文旨在填补这一空白,特别是在定量推理任务中,探究答案解码过程中的内部机制。
Result: 实验表明,提出的基于自我阅读质量(SRQ)评分的引导方法能够带来一致的准确性提升,在定量推理任务上取得了积极效果。
Insight: 论文的创新点在于揭示了答案解码过程中的‘自我阅读’模式与答案正确性之间的关联,并提出了一种结合几何度量(用于过程控制)和语义度量(用于内容监控)的无训练引导方法。从客观角度看,这种将注意力模式分析与模型引导相结合的思路,为理解和提升大语言模型的推理可靠性提供了新的视角和工具。
Abstract: Thinking LLMs produce reasoning traces before answering. Prior activation steering work mainly targets on shaping these traces. It remains less understood how answer tokens actually read and integrate the reasoning to produce reliable outcomes. Focusing on quantitative reasoning, we analyze the answer-to-reasoning attention and observe a benign self-reading pattern aligned with correctness, characterized by a forward drift of the reading focus along the reasoning trace and a persistent concentration on key semantic anchors, whereas incorrect solutions exhibit diffuse and irregular attention pattern. We interpret this as internal certainty during answer decoding, where the model commits to a viable solution branch and integrates key evidence. Following this, we propose a training-free steering method driven by Self-Reading Quality (SRQ) scores combining geometric metrics for process control with semantic metrics for content monitoring. SRQ selects data to build steering vectors that guide inference toward benign self-reading and away from uncertain and disorganized reading. Experiments show that our method yields consistent accuracy gains.
[19] Headlines You Won’t Forget: Can Pronoun Insertion Increase Memorability? cs.CLPDF
Selina Meyer, Magdalena Abel, Michael Roth
TL;DR: 本研究探讨了在新闻标题中插入第一和第二人称代词对记忆性的影响,并评估了使用大语言模型自动插入代词而不改变核心含义的可行性。通过三项受控记忆实验(共240名参与者,7680次记忆判断),发现代词插入对记忆性的影响复杂,且效果受标题主题、插入方式和上下文等因素影响。同时,研究发现LLM自动生成的修订在内容准确性、情感保留和写作风格方面存在不足。
Details
Motivation: 新闻标题需要被记忆才能影响信念和驱动行动,研究动机是探索特定语言特征(如直接使用人称代词)如何影响记忆性,以及利用LLM进行针对性插入的可行性。
Result: 实验结果显示代词插入对记忆性有混合影响,效果因标题主题、插入方式和上下文而异;LLM自动修订在内容准确性、情感保留和写作风格上存在问题,需要进一步数据和分析来明确中介因素。
Insight: 创新点在于将认知心理学实验设计应用于计算语言学问题,探究代词插入对文本记忆性的影响,并首次评估LLM在此任务中的适用性;客观分析认为,研究揭示了语言特征与记忆交互的复杂性,以及当前LLM在语义保持和风格自然性方面的局限性,为未来文本优化和人机交互研究提供了数据基础。
Abstract: For news headlines to influence beliefs and drive action, relevant information needs to be retained and retrievable from memory. In this probing study we draw on experiment designs from cognitive psychology to examine how a specific linguistic feature, namely direct address through first- and second-person pronouns, affects memorability and to what extent it is feasible to use large language models for the targeted insertion of such a feature into existing text without changing its core meaning. Across three controlled memorization experiments with a total of 240 participants, yielding 7,680 unique memory judgments, we show that pronoun insertion has mixed effects on memorability. Exploratory analyses indicate that effects differ based on headline topic, how pronouns are inserted and their immediate contexts. Additional data and fine-grained analysis is needed to draw definitive conclusions on these mediating factors. We further show that automatic revisions by LLMs are not always appropriate: Crowdsourced evaluations find many of them to be lacking in content accuracy and emotion retention or resulting in unnatural writing style. We make our collected data available for future work.
[20] IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text cs.CL | cs.AI | cs.IRPDF
Rajveer Singh Pall
TL;DR: 本文介绍了印度金融监管文本评估基准IndiaFinBench,这是首个公开可用的、专门用于评估大语言模型在印度金融监管文本上性能的基准。该基准包含406个专家标注的问答对,涵盖监管解释、数值推理、矛盾检测和时序推理四种任务类型。在零样本条件下评估了12个模型,准确率在70.4%到89.7%之间,均显著超过非专家人类基线。
Details
Motivation: 现有金融NLP基准完全基于西方金融语料库(如SEC文件、美国财报),缺乏对非西方监管框架的覆盖,存在显著空白。本文旨在填补这一空白,为评估LLM在印度金融监管文本上的性能提供一个专门的基准。
Result: 在零样本条件下评估了12个模型,准确率范围从70.4%(Gemma 4 E4B)到89.7%(Gemini 2.5 Flash),所有模型均显著超过60.0%的非专家人类基线。数值推理任务最具区分度,模型间准确率差距达35.9个百分点。通过自助法显著性检验(10,000次重采样)识别出三个统计上显著不同的性能层级。
Insight: 论文宣称的创新点是构建了首个专注于印度金融监管文本的公开评估基准,填补了非西方金融监管NLP评估的空白。从客观角度看,其创新之处在于:1)数据来源权威(SEBI和RBI),任务设计针对性强(如监管解释、矛盾检测);2)标注质量经过严格验证(模型二次校验和人工标注一致性评估);3)基准具有较好的区分度,能有效评估和比较不同LLM在特定领域和地域文本上的能力。
Abstract: We introduce IndiaFinBench, to our knowledge the first publicly available evaluation benchmark for assessing large language model (LLM) performance on Indian financial regulatory text. Existing financial NLP benchmarks draw exclusively from Western financial corpora (SEC filings, US earnings reports, and English-language financial news), leaving a significant gap in coverage of non-Western regulatory frameworks. IndiaFinBench addresses this gap with 406 expert-annotated question-answer pairs drawn from 192 documents sourced from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), spanning four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Annotation quality is validated through a model-based secondary pass (kappa=0.918 on contradiction detection) and a 60-item human inter-annotator agreement evaluation (kappa=0.611; 76.7% overall agreement). We evaluate twelve models under zero-shot conditions, with accuracy ranging from 70.4% (Gemma 4 E4B) to 89.7% (Gemini 2.5 Flash). All models substantially outperform a non-specialist human baseline of 60.0%. Numerical reasoning is the most discriminative task, with a 35.9 percentage-point spread across models. Bootstrap significance testing (10,000 resamples) reveals three statistically distinct performance tiers. The dataset, evaluation code, and all model outputs are available at https://github.com/rajveerpall/IndiaFinBench
[21] Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms cs.CL | cs.AIPDF
Xinlin Wang, Mats Brorsson
TL;DR: 本文首次对参数少于100亿的小型语言模型在三种范式下的部署权衡进行了大规模综合研究:基础模型、配备工具的单智能体以及具备协作能力的多智能体系统。研究发现,单智能体系统在性能与成本之间取得了最佳平衡,而多智能体设置则带来了有限增益下的额外开销。
Details
Motivation: 尽管大语言模型能力强大,但其高昂的计算成本、延迟和隐私风险阻碍了实际应用部署。小型语言模型虽具潜力,但其固有的知识和推理局限限制了有效性。现有研究多关注通过缩放定律或微调策略增强SLMs,而忽视了利用智能体范式(如工具使用和多智能体协作)来系统性地弥补小模型固有弱点的潜力。
Result: 研究结果表明,单智能体系统在性能与成本之间取得了最佳平衡;多智能体设置则增加了开销,但增益有限。
Insight: 论文的创新点在于首次系统性地将智能体范式(工具使用与多智能体协作)应用于小型语言模型,以补偿其固有局限,并强调了在资源受限环境中以智能体为中心的设计对于高效、可信部署的重要性。
Abstract: Despite the impressive capabilities of large language models, their substantial computational costs, latency, and privacy risks hinder their widespread deployment in real-world applications. Small Language Models (SLMs) with fewer than 10 billion parameters present a promising alternative; however, their inherent limitations in knowledge and reasoning curtail their effectiveness. Existing research primarily focuses on enhancing SLMs through scaling laws or fine-tuning strategies while overlooking the potential of using agent paradigms, such as tool use and multi-agent collaboration, to systematically compensate for the inherent weaknesses of small models. To address this gap, this paper presents the first large-scale, comprehensive study of <10B open-source models under three paradigms: (1) the base model, (2) a single agent equipped with tools, and (3) a multi-agent system with collaborative capabilities. Our results show that single-agent systems achieve the best balance between performance and cost, while multi-agent setups add overhead with limited gains. Our findings highlight the importance of agent-centric design for efficient and trustworthy deployment in resource-constrained settings.
[22] Evaluating LLM-Driven Summarisation of Parliamentary Debates with Computational Argumentation cs.CLPDF
Eoghan Cunningham, Derek Greene, James Cross, Antonio Rago
TL;DR: 本文提出了一种基于计算论证的形式化框架,用于评估议会辩论摘要的忠实性,重点关注摘要是否准确保留了辩论中支持或反对政策结果的论证结构。
Details
Motivation: 议会辩论内容庞大复杂,公众难以理解,LLM可自动生成摘要以提高可访问性,但现有自动摘要评估指标在一致性(忠实性)方面与人类判断相关性差,需要更可靠的评估方法。
Result: 研究以欧洲议会辩论为案例,展示了该方法在评估LLM驱动摘要中的应用,但摘要中未明确提及具体的定量结果或基准比较。
Insight: 创新点在于将计算论证理论引入摘要评估,通过将论证结构锚定于辩论中的争议提案,形式化地评估摘要对推理过程的忠实保留,为论证性文本的摘要质量评估提供了新视角。
Abstract: Understanding how policy is debated and justified in parliament is a fundamental aspect of the democratic process. However, the volume and complexity of such debates mean that outside audiences struggle to engage. Meanwhile, Large Language Models (LLMs) have been shown to enable automated summarisation at scale. While summaries of debates can make parliamentary procedures more accessible, evaluating whether these summaries faithfully communicate argumentative content remains challenging. Existing automated summarisation metrics have been shown to correlate poorly with human judgements of consistency (i.e., faithfulness or alignment between summary and source). In this work, we propose a formal framework for evaluating parliamentary debate summaries that grounds argument structures in the contested proposals up for debate. Our novel approach, driven by computational argumentation, focuses the evaluation on formal properties concerning the faithful preservation of the reasoning presented to justify or oppose policy outcomes. We demonstrate our methods using a case-study of debates from the European Parliament and associated LLM-driven summaries.
[23] Does Self-Consistency Improve the Recall of Encyclopedic Knowledge? cs.CLPDF
Sho Hoshino, Ukyo Honda, Peinan Zhang
TL;DR: 本文研究了自一致性(self-consistency)方法在符号推理和百科知识回忆任务上的效果,通过在MMLU基准上构建专门的知识回忆子集进行评估,发现自一致性在这两类任务上均能提升性能,并利用GPT-4o在MMLU上取得了89%的准确率,达到当前最佳水平。
Details
Motivation: 自一致性方法在符号推理任务上已被证明有效,但其在百科知识回忆任务上的影响尚不明确,缺乏针对性的评估基准。
Result: 在MMLU基准上构建的知识回忆子集上,自一致性方法提升了性能,使用GPT-4o实现了89%的准确率,这是目前该基准上的最佳结果。
Insight: 创新点在于通过数据驱动启发式方法在MMLU中构建了专门的知识回忆评估子集,验证了自一致性不仅适用于符号推理,也能有效提升知识回忆任务的性能,为提示工程提供了新见解。
Abstract: While self-consistency is known to improve performance on symbolic reasoning, its effect on the recall of encyclopedic knowledge is unclear due to a lack of targeted evaluation grounds. To address this, we establish such a knowledge recall split for the popular MMLU benchmark by applying a data-driven heuristic from prior work. We validate this split by showing that the performance patterns on the symbolic reasoning and knowledge recall subsets mirror those of GSM8K and MedMCQA, respectively. Using this solid ground, we find that self-consistency consistently improves performance across both symbolic reasoning and knowledge recall, even though its underlying CoT prompting is primarily effective for symbolic reasoning. As a result, we achieve an 89% accuracy on MMLU, the best performance to date with the use of GPT-4o.
[24] Lost in Translation: Do LVLM Judges Generalize Across Languages? cs.CLPDF
Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Mir Tafseer Nayeem, Amran Bhuiyan, Mizanur Rahman
TL;DR: 该论文针对大型视觉语言模型(LVLM)评估器在跨语言泛化能力上的不足,提出了首个大规模多语言多模态评估基准MM-JudgeBench,涵盖25种语言超过6万对偏好实例,并通过评估22个LVLM模型揭示了现有评估器在跨语言性能上的显著差异和不一致性。
Details
Motivation: 当前LVLM自动评估器(如奖励模型)主要基于英语基准进行评估,其跨语言泛化能力未知,论文旨在探究这些评估器在不同语言上的表现。
Result: 在MM-JudgeBench基准上评估了22个LVLM(包括15个开源和7个专有模型),发现跨语言性能存在显著方差,且模型规模和架构无法有效预测多语言鲁棒性,即使最先进的LVLM评估器也表现出跨语言不一致行为。
Insight: 创新点在于构建了首个大规模多语言多模态评估基准MM-JudgeBench,并揭示了当前奖励模型在跨语言泛化上的根本局限性,强调了开发多语言多模态基准对于构建可靠自动评估器的必要性。
Abstract: Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.
[25] LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues cs.CL | cs.AIPDF
Fanyu Wang, Xiaoxi Kang, Paul Burgess, Aashish Srivastava, Chetan Arora
TL;DR: 本文提出LePREC框架,通过神经-符号方法结合LLM生成与结构化统计推理,以解决法律问题相关性评估中LLM精度不足的问题。
Details
Motivation: 针对全球法律资源有限背景下法律问题识别任务中LLM精度不足(如GPT-4o仅62%)的挑战,旨在提升法律问题相关性判断的准确性与可解释性。
Result: 在基于769个马来西亚合同法真实案例构建的数据集上,LePREC相比GPT-4o、Claude等先进LLM基线提升30-40%性能,实现了更高效的相关性决策。
Insight: 创新点在于将法律描述转化为问答对作为离散特征,并通过稀疏线性模型进行显式权重学习,兼顾可解释性与数据效率;其结构化因子分类方法为法律AI提供了神经-符号融合的新思路。
Abstract: More than half of the global population struggles to meet their civil justice needs due to limited legal resources. While Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, significant challenges remain even at the foundational step of legal issue identification. To investigate LLMs’ capabilities in this task, we constructed a dataset from 769 real-world Malaysian Contract Act court cases, using GPT-4o to extract facts and generate candidate legal issues, annotated by senior legal experts, which reveals a critical limitation: while LLMs generate diverse issue candidates, their precision remains inadequate (GPT-4o achieves only 62%). To address this gap, we propose LePREC (Legal Professional-inspired Reasoning Elicitation and Classification), a neuro-symbolic framework combining neural generation with structured statistical reasoning. LePREC consists of: (1) a neuro component leverages LLMs to transform legal descriptions into question-answer pairs representing diverse analytical factors, and (2) a symbolic component applies sparse linear models over these discrete features, learning explicit algebraic weights that identify the most informative reasoning factors. Unlike end-to-end neural approaches, LePREC achieves interpretability through transparent feature weighting while maintaining data efficiency through correlation-based statistical classification. Experiments show a 30-40% improvement over advanced LLM baselines, including GPT-4o and Claude, confirming that correlation-based factor-issue analysis offers a more data-efficient solution for relevance decisions.
[26] Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment cs.CL | cs.AI | cs.CYPDF
Bobo Li, Rui Wu, Zibo Ji, Meishan Zhang, Hao Fei
TL;DR: 这篇论文研究了大型语言模型(LLM)智能体在多智能体协作框架中,因扮演不同角色(行动者与观察者)而表现出的人类认知偏差——行动者-观察者不对称性(AOA),即行动者倾向于将失败归因于外部因素,而观察者则将同一错误归因于内部缺陷。为了量化并缓解这一偏差,作者提出了一个模糊失败基准(Ambiguous Failure Benchmark)用于评估,并引入了一种名为ReTAS(通过正题-反题-合题进行推理)的模型,该模型通过辩证对齐训练,强制智能体进行视角不变的推理,从而合成冲突观点达成客观共识。
Details
Motivation: 动机在于,尽管为LLM智能体分配专门角色(如自我反思的行动者和相互审计的观察者)的多智能体框架能提升可靠性并利用领域专家知识,但作者发现这种角色扮演会诱发类似人类的AOA认知偏差,这可能损害智能体协作的客观性和可靠性。
Result: 实验结果表明,在作者提出的模糊失败基准上,简单地交换视角会在大多数模型超过20%的情况下触发AOA效应。而所提出的ReTAS模型有效缓解了归因不一致性,并在模糊场景中显著提高了故障解决率。
Insight: 论文的创新点在于:1)首次在LLM智能体中识别并量化了角色扮演引发的AOA认知偏差;2)提出了一个专门的模糊失败基准来评估该偏差;3)提出了ReTAS方法,通过结合辩证思维链(dialectical chain-of-thought)和群体相对策略优化(Group Relative Policy Optimization)进行辩证对齐训练,以强制视角不变推理,从而合成冲突观点达成共识,这是一种新颖的偏差缓解和智能体对齐方法。
Abstract: Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting as an actor (during self-reflection) tends to attribute failures to external factors, whereas an observer (during mutual auditing) attributes the same errors to internal faults. We quantify this using our new Ambiguous Failure Benchmark, which reveals that simply swapping perspectives triggers the AOA effect in over 20% of cases for most models. To tame this bias, we introduce ReTAS (Reasoning via Thesis-Antithesis-Synthesis), a model trained through dialectical alignment to enforce perspective-invariant reasoning. By integrating dialectical chain-of-thought with Group Relative Policy Optimization, ReTAS guides agents to synthesize conflicting viewpoints into an objective consensus. Experiments demonstrate that ReTAS effectively mitigates attribution inconsistency and significantly improves fault resolution rates in ambiguous scenarios.
[27] A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression cs.CLPDF
Jincheng Ren, Siwei Wu, Yizhi Li, Kang Zhu, Shu Xu
TL;DR: 本文提出了一种名为TACO的自进化、即插即用框架,旨在通过自动发现和优化压缩规则,来解决终端智能体在长视野、多轮任务中因保留原始环境反馈而导致的令牌成本二次增长和冗余问题,从而提升效率。
Details
Motivation: 随着模型能力提升,研究转向长视野、以终端为中心的智能体任务,但反复保留原始环境反馈会引入大量冗余,导致令牌成本随步数二次增长,阻碍长视野推理;现有基于启发式或固定提示的观测压缩方法难以泛化到异构的终端环境。
Result: 在TerminalBench(TB 1.0和TB 2.0)以及四个额外终端相关基准(SWE-Bench Lite、CompileBench、DevEval、CRUST-Bench)上的实验表明,TACO能持续提升主流智能体框架和强骨干模型的性能;使用MiniMax-2.5时,在多数基准上提升性能的同时减少约10%的令牌开销;在TerminalBench上,为强智能体模型带来1%-4%的稳定增益,并在相同令牌预算下进一步提升约2%-3%的准确率。
Insight: 创新点在于提出了一种自进化的、任务感知的压缩框架,能够自动从交互轨迹中发现和优化压缩规则,以泛化方式解决终端环境异构性带来的压缩挑战;客观来看,该方法将压缩过程动态化与自适应化,避免了手动设计规则的局限性,为智能体效率优化提供了可扩展的解决方案。
Abstract: As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved in the interaction history to support future decisions. However, repeatedly retaining such feedback introduces substantial redundancy and causes cumulative token cost to grow quadratically with the number of steps, hindering long-horizon reasoning. Although observation compression can mitigate this issue, the heterogeneity of terminal environments makes heuristic-based or fixed-prompt methods difficult to generalize. We propose TACO, a plug-and-play, self-evolving Terminal Agent Compression framework that automatically discovers and refines compression rules from interaction trajectories for existing terminal agents. Experiments on TerminalBench (TB 1.0 and TB 2.0) and four additional terminal-related benchmarks (i.e., SWE-Bench Lite, CompileBench, DevEval, and CRUST-Bench) show that TACO consistently improves performance across mainstream agent frameworks and strong backbone models. With MiniMax-2.5, it improves performance on most benchmarks while reducing token overhead by around 10%. On TerminalBench, it brings consistent gains of 1%-4% across strong agentic models, and further improves accuracy by around 2%-3% under the same token budget. These results demonstrate the effectiveness and generalization of self-evolving, task-aware compression for terminal agents.
[28] Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI cs.CL | cs.AI | cs.DL | cs.IRPDF
Wenqing Wu, Chengzhi Zhang, Yi Zhao, Tong Bao
TL;DR: 本研究从细粒度视角分析了大语言模型(LLMs)对AI顶级会议同行评审意见的影响,发现LLMs出现后,评审文本变得更长、更流畅,更关注摘要和表面清晰度,语言模式更标准化,但对原创性、可复现性和细致批判性推理等深层评估维度的关注有所下降。
Details
Motivation: 探究LLMs是否以及如何改变同行评审的核心评估功能,特别是其对评审报告的语言形式、评估焦点和推荐相关信号的影响。
Result: 研究结果表明,LLMs出现后,评审文本长度和流畅度增加,更强调总结和表面清晰度,语言模式更标准化(尤其是低置信度评审者),而对原创性、可复现性和细致批判性推理的关注减少。
Insight: 创新点在于从细粒度(如句子级评估方面自动标注)系统量化LLMs对评审文本语言特征和评估焦点的影响,并揭示了LLMs可能使评审趋于表面化和标准化,而削弱深层批判性评估的风险。
Abstract: With the rapid advancement of Large Language Models (LLMs), the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.
[29] Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models cs.CL | cs.AIPDF
Kihyuk Lee
TL;DR: 本研究比较了GPT-4.1、Claude Sonnet 4.6和Gemini 2.5 Flash三种大语言模型在温度参数为0的条件下,重复生成运动处方文本的一致性。通过为六个临床场景各生成20次处方,共360个输出,从语义相似性、输出可重现性、FITT分类和安全表达四个维度进行分析。结果表明,GPT-4.1的平均语义相似性最高(0.955),但其输出全部为独特文本,而Gemini 2.5 Flash的高相似性得分(0.950)主要源于文本重复而非一致的推理。研究强调,重复生成条件下的输出行为应作为评估LLM可靠性的核心标准。
Details
Motivation: 解决在医疗应用(如运动处方生成)中,不同大语言模型在相同解码设置下输出一致性的差异问题,以评估其可靠部署的适用性。
Result: 在语义相似性上,GPT-4.1得分最高(0.955),Gemini 2.5 Flash为0.950,Claude Sonnet 4.6为0.903,模型间差异显著(H = 458.41, p < .001)。但GPT-4.1输出100%独特,而Gemini 2.5 Flash仅27.5%独特,表明其高相似性源于文本重复。安全表达在所有模型上均达到上限水平,无法作为区分指标。
Insight: 创新点在于通过重复生成实验揭示了不同LLM在一致性行为上的本质差异(如独特生成 vs. 文本重复),这超出了单次输出评估的能力。客观来看,该研究强调了在临床应用中,模型选择需基于重复生成性能,而不仅仅是技术指标,为LLM可靠性评估提供了新维度。
Abstract: This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.
[30] Pause or Fabricate? Training Language Models for Grounded Reasoning cs.CLPDF
Yiwen Qiu, Linjuan Wu, Yizhou Liu, Yuchen Yan, Jin Ma
TL;DR: 本文提出了一种名为GRIL的交互式强化学习框架,旨在解决大语言模型在信息不完整时进行‘非事实推理’的问题。该框架将推理过程分解为‘澄清与暂停’和‘事实推理’两个阶段,通过特定奖励惩罚幻觉,使模型能够识别信息缺口并主动暂停。在GSM8K-Insufficient和MetaMATH-Insufficient基准测试中,该方法显著提升了前提检测能力和任务成功率。
Details
Motivation: 大语言模型在复杂推理任务上取得显著进展,但在输入信息不完整时,它们常常会隐式地捏造信息,产生自信但不可靠的结论,即‘非事实推理’。作者认为问题根源在于缺乏‘推理边界意识’,即无法识别有效推理所需的前提是否缺失。
Result: 在GSM8K-Insufficient和MetaMATH-Insufficient基准上的实验表明,GRIL方法将前提检测能力提升了高达45%,任务成功率提高了30%,同时平均响应长度减少了超过20%。分析还证实了其对噪声用户响应的鲁棒性和对分布外任务的泛化能力。
Insight: 核心创新点在于将推理过程明确分解为‘澄清与暂停’和‘事实推理’两个阶段,并通过多轮强化学习框架设计阶段特异性奖励来直接惩罚幻觉,从而教会模型识别信息不足并主动暂停。这为解决大语言模型的幻觉问题提供了一个新颖的、基于交互和边界意识的训练范式。
Abstract: Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions – a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness – the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.
[31] Epistemic orientation in parliamentary discourse is associated with deliberative democracy cs.CL | cs.CYPDF
Segun Aroyehun, Stephan Lewandowsky, David Garcia
TL;DR: 本文提出了一种可扩展的方法来测量议会话语中的认知取向,即证据减直觉(EMI)分数,该方法基于大语言模型(LLM)评分和嵌入语义相似性。通过分析七个国家1946年至2025年的1500万个议会演讲片段,研究发现EMI与审议民主以及治理的透明度和法律可预测实施呈正相关。
Details
Motivation: 民主审议和治理的核心是追求真理,但政治话语反映了不同的认知取向,从基于可验证信息的证据推理到基于信念和主观解释的直觉推理。本文旨在量化这种认知取向,并探讨其与审议民主和治理质量的关系。
Result: 研究发现,EMI与审议民主在时间上呈正相关(同时和滞后分析均一致),并且与治理的透明度和法律可预测实施也呈正相关。
Insight: 创新点在于提出了一个基于LLM和嵌入的、可扩展的量化指标(EMI)来测量政治话语的认知取向,并将其与宏观政治指标(如民主质量和治理)进行大规模实证关联分析,为理解话语质量对制度的影响提供了新工具和新证据。
Abstract: The pursuit of truth is central to democratic deliberation and governance, yet political discourse reflects varying epistemic orientations, ranging from evidence-based reasoning grounded in verifiable information to intuition-based reasoning rooted in beliefs and subjective interpretation. We introduce a scalable approach to measure epistemic orientation using the Evidence–Minus–Intuition (EMI) score, derived from large language model (LLM) ratings and embedding-based semantic similarity. Applying this approach to 15 million parliamentary speech segments spanning 1946 to 2025 across seven countries, we examine temporal patterns in discourse and its association with deliberative democracy and governance. We find that EMI is positively associated with deliberative democracy within countries over time, with consistent relationships in both contemporaneous and lagged analyses. EMI is also positively associated with the transparency and predictable implementation of laws as a dimension of governance. These findings suggest that the epistemic nature of political discourse is crucial for both the quality of democracy and governance.
[32] Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views cs.CLPDF
Feihao Fang, My T. Thai, Yuanyuan Lei
TL;DR: 本文提出了一种新方法,通过发现大语言模型内部共享的逻辑子空间来提升其多步逻辑推理能力。该方法利用典型相关分析对齐自然语言和符号语言推理链的激活残差,学习一个低维子空间,并设计了一种无需训练的方法,沿此子空间引导模型的推理过程,从而结合两种视图的互补信号。
Details
Motivation: 大语言模型在多步逻辑推理上仍有困难,现有方法要么仅优化自然语言形式的推理链,要么将符号求解器作为外部模块附加。本文旨在探索LLMs内部是否存在一个共享的逻辑子空间,能同时对齐自然语言和符号语言视图的推理过程,以捕获独立于表面形式的通用逻辑推理能力。
Result: 在四个逻辑推理基准测试上的实验表明,该方法有效提升了准确性,最高提升达11个百分点,并在领域外问题上表现出良好的泛化能力。
Insight: 创新点在于假设并验证了LLMs内部存在一个跨视图共享的逻辑子空间,通过典型相关分析学习该子空间,并设计了一种无需训练的引导方法,利用自然语言和符号语言的互补信号来增强推理,这为理解与提升LLM的逻辑推理能力提供了新视角。
Abstract: Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a symbolic solver as an external module. In this work, we instead ask whether LLMs contain a shared internal logical subspace that simultaneously aligns natural-language and symbolic-language views of the reasoning process. Our hypothesis is that this logical subspace captures logical reasoning capabilities in LLMs that are shared across views while remaining independent of surface forms. To verify this, we employ Canonical Correlation Analysis on the paired residual activations from natural-language and symbolic-language reasoning chains, learning a low-dimensional subspace with maximum cross-view correlation. Furthermore, we design a training-free approach that steers LLMs reasoning chain along this logical subspace, thereby leveraging the complementary reasoning signals from both views. Experiments on four logical reasoning benchmarks demonstrate the effectiveness of our approach, improving accuracy by up to 11 percentage points and generalizing well on out-of-domain problems.
cs.CV [Back]
[33] Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching cs.CVPDF
Xin Hu, Ke Qin, Wen Yin, Yuan-Fang Li, Ming Li
TL;DR: 本文提出FlowSG,一种基于流匹配的渐进式图像条件场景图生成方法,将场景图生成重新定义为在混合离散-连续状态上的连续时间传输任务,通过约束感知的联合细化逐步合成节点(物体)和边(谓词),取代传统的一次性确定性分类范式。
Details
Motivation: 现有场景图生成方法大多将其视为一次性分类问题,而非真正的渐进式生成任务,本文旨在通过流匹配框架实现更灵活、生成式的场景图构建。
Result: 在Visual Genome和PSG数据集上的闭集和开放词汇协议实验中,FlowSG在谓词R/mR和图级指标上均取得一致提升,平均比当前最优方法USG-Par提高约3个百分点。
Insight: 创新点在于将场景图生成建模为混合离散-连续状态的流匹配过程,结合VQ-VAE量化与图Transformer,通过条件速度场耦合几何(边界框)与语义(物体特征和谓词标签),实现了渐进式生成和与标准检测/分割器的即插即用兼容性。
Abstract: Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject-predicate-object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification problem rather than a genuinely progressive, generative task. We propose FlowSG, which recasts SGG as continuous-time transport on a hybrid discrete-continuous state: starting from a noised graph, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a scene graph (e.g., continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for categorical tokens (object features and predicate labels), coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors and segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete-continuous generative formulation over one-shot classification baselines, with an average improvement of about 3 points over the state-of-the-art USG-Par.
[34] Vision-Based Human Awareness Estimation for Enhanced Safety and Efficiency of AMRs in Industrial Warehouses cs.CV | cs.ROPDF
Maximilian Haug, Christian Stippel, Lukas Pscherer, Benjamin Schwendinger, Ralph Hoch
TL;DR: 本文提出了一种基于视觉的实时方法,通过单个RGB摄像头估计人类对自主移动机器人(AMR)的感知状态,以提升工业仓库中AMR的安全性和运行效率。该方法结合了先进的3D人体姿态提升和头部朝向估计技术,确定人类相对于AMR的位置及其视野范围,从而判断人类是否意识到AMR的存在。整个流程在NVIDIA Isaac Sim仿真环境中使用合成数据进行了验证。
Details
Motivation: 当前方法通常将人类视为通用动态障碍物,导致AMR采取保守行为(如减速或绕行),即使工人完全意识到并能安全共享空间。本文旨在通过估计人类对AMR的感知状态,使AMR能基于人类意识调整运动,以解决安全与效率之间的平衡问题。
Result: 实验结果表明,该系统能实时可靠地检测人类位置及其注意力,使AMR能基于人类感知状态安全地适应运动。验证在NVIDIA Isaac Sim仿真环境中使用合成数据完成,但未提及与现有方法的定量比较或具体基准测试结果。
Insight: 创新点在于将3D人体姿态提升与头部朝向估计结合,以推断人类对AMR的感知状态,而非简单将人类视为障碍物。这为AMR提供了更细粒度的环境理解,有望在工业自动化中实现安全与效率的协同优化。从客观角度看,该方法利用单目视觉实现实时感知估计,具有部署成本低的潜力,但实际应用需进一步验证在复杂真实环境中的鲁棒性。
Abstract: Ensuring human safety is of paramount importance in warehouse environments that feature mixed traffic of human workers and autonomous mobile robots (AMRs). Current approaches often treat humans as generic dynamic obstacles, leading to conservative AMR behaviors like slowing down or detouring, even when workers are fully aware and capable of safely sharing space. This paper presents a real-time vision-based method to estimate human awareness of an AMR using a single RGB camera. We integrate state-of-the-art 3D human pose lifting with head orientation estimation to ascertain a human’s position relative to the AMR and their viewing cone, thereby determining if the human is aware of the AMR. The entire pipeline is validated using synthetically generated data within NVIDIA Isaac Sim, a robust physics-accurate robotics simulation environment. Experimental results confirm that our system reliably detects human positions and their attention in real time, enabling AMRs to safely adapt their motion based on human awareness. This enhancement is crucial for improving both safety and operational efficiency in industrial and factory automation settings.
[35] Align then Refine: Text-Guided 3D Prostate Lesion Segmentation cs.CVPDF
Cuiling Sun, Linkai Peng, Adam Murphy, Elif Keles, Hiten D. Patel
TL;DR: 本文提出了一种名为’对齐后细化’的新型多编码器U-Net架构,用于从双参数MRI中自动分割前列腺病变的三维图像。该方法通过引入对齐损失、热图损失和置信度门控的多头交叉注意力细化器,结合阶段式训练策略,有效整合了多模态信息并实现了局部化的文本引导,从而提升了分割精度。
Details
Motivation: 当前基于双参数MRI的前列腺病变三维自动分割方法在整合多模态信息和确保解剖一致性方面存在困难,而现有的视觉语言模型缺乏病变级别的细粒度语义指导,导致分割精度不足。
Result: 该方法在PI-CAI数据集上持续超越先前方法,通过增强的多模态融合和局部化文本引导,达到了新的最先进水平。
Insight: 创新点包括:1)使用对齐损失增强前景文本-图像相似性以注入病变语义;2)通过热图损失校准相似性图并抑制虚假背景激活;3)采用置信度门控的多头交叉注意力细化器在置信区域进行局部边界编辑。这些设计有效结合了文本引导的局部语义与多模态数据融合。
Abstract: Automated 3D segmentation of prostate lesions from biparametric MRI (bp-MRI) is essential for reliable algorithmic analysis, but achieving high precision remains challenging. Volumetric methods must combine multiple modalities while ensuring anatomical consistency, but current models struggle to integrate cross-modal information reliably. While vision-language models (VLMs) are replacing the currently used architectural designs, they still lack the fine-grained, lesion-level semantics required for effective localized guidance. To address these limitations, we propose a new multi-encoder U-Net architecture incorporating three key innovations: (1) an alignment loss that enhances foreground text-image similarity to inject lesion semantics; (2) a heatmap loss that calibrates the similarity map and suppresses spurious background activations; and (3) a final-stage, confidence-gated multi-head cross-attention refiner that performs localized boundary edits in high-confidence regions. A phase-scheduled training regime stabilizes the optimization of these components. Our method consistently outperforms prior approaches, establishing a new state-of-the-art on the PI-CAI dataset through enhanced multi-modal fusion and localized text guidance. Our code is available at https://github.com/NUBagciLab/Prostate-Lesion-Segmentation.
[36] Colour Extraction Pipeline for Odonates using Computer Vision cs.CVPDF
Megan Mirnalini Sundaram Rajaraman, Fons J. Verbeek, Vincent J. Kalkman, Rita Pucci
TL;DR: 本文提出了一种基于计算机视觉的蜻蜓目昆虫颜色提取流程,利用深度神经网络识别和分割蜻蜓的身体部位(头、胸、腹、翅膀),并从公民科学平台的开放源图像中提取各部位的颜色信息,以支持大规模生态相关性统计分析。
Details
Motivation: 解决昆虫形态特征与气候相关性研究中数据标注耗时、成本高且开源数据集缺乏形态特征标注的问题,旨在自动化提取蜻蜓身体部位颜色以促进生态研究。
Result: 方法在有限标注数据集上训练并通过伪监督数据优化,能够从开放源图像中分割可见的蜻蜓个体并提取各身体部位的颜色调色板,为大规模统计分析提供基础。
Insight: 创新点在于结合深度神经网络与公民科学平台图像,构建端到端的颜色提取流程,利用伪监督缓解标注数据不足,可借鉴于其他生物形态特征自动化分析任务。
Abstract: The correlation between insect morphological traits and climate has been documented in physiological studies, but such studies remain limited by the time-consuming nature of the data analysis. In particular, the open source datasets often lack annotations of species’ morphological traits, making dedicated annotations campaigns necessary; these efforts are typically local in scale and costly. In this paper, we propose a pipeline to identify and segment body parts of Odonates (dragonflies and damselflies) using deep neural networks, with the ultimate goal of extracting body parts’ colouration. The pipeline is trained on a limited annotated dataset and refined with pseudo supervised data. We show that, by using open source images from citizen science platforms, our approach can segment each visible subject (Odonates) into head, thorax, abdomen, and wings and then extract a colour palette for each body part. This will enable large-scale statistical analysis of ecological correlations (e.g., between colouration and climate change, habitat loss, or geolocation) which are crucial for quantifying and assessing ecosystem biodiversity status.
[37] Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control cs.CVPDF
Jay Jung, Ahmad Arrabi, Jax Luo, Scott Raymond, Safwan Wshah
TL;DR: 本研究探索了利用微调后的多模态大语言模型(MLLMs)实现骨骼标志点自主定位,以支持智能化的C型臂控制。在合成和真实X射线数据集上的定量评估表明,其性能与领先的深度学习方法相当,并通过定性实验展示了模型具备推理和空间感知能力,能够纠正错误预测并引导C型臂逐步导航至目标位置。
Details
Motivation: 自动化C型臂定位对于紧急介入治疗至关重要。当传统的深度学习方法失败时,临床医生需手动操作,导致治疗延迟。因此,需要一个基于MLLMs的智能代理控制框架,以整合临床医生反馈并利用推理进行调整,实现更精准的定位,而骨骼标志点定位是实现此目标的关键步骤。
Result: 在两个数据集(合成和真实X射线)上,微调后的MLLMs在所有定位任务中均表现出与领先的深度学习方法(DL approach)相竞争的性能。定性实验进一步证明,MLLMs能够通过推理纠正初始错误预测,并具备空间感知能力以引导C型臂进行序列导航。
Insight: 论文的创新点在于将MLLMs应用于医学影像中的骨骼标志点定位任务,并探索其作为智能代理用于C型臂控制的潜力。其核心优势在于模型具备推理和交互能力,能够处理传统DL方法可能失败的场景,通过整合反馈和逐步调整实现更鲁棒和自主的操作,为医疗机器人控制提供了新的思路。
Abstract: Purpose: Automated C-arm positioning ensures timely treatment in patients requiring emergent interventions. When a conventional Deep Learning (DL) approach for C-arm control fails, clinicians must revert to manual operation, resulting in additional delays. Consequently, an agentic C-arm control framework based on multimodal large language models (MLLMs) is highly desirable, as it can incorporate clinician feedback and use reasoning to make adjustments toward more accurate positioning. Skeletal landmark localization is essential for C-arm control, and we investigate adapting MLLMs for autonomous landmark localization. Methods: We used an annotated synthetic X-ray dataset and a real X-ray dataset. Each X-ray in both datasets is paired with several skeletal landmarks. We fine-tuned two MLLMs and tasked them with retrieving the closest landmarks from each X-ray. Quantitative evaluations of landmark localization were performed and compared against a leading DL approach. We further conducted qualitative experiments demonstrating: (1) how an MLLM can correct an initially incorrect prediction through reasoning, and (2) how the MLLM can sequentially navigate the C-arm toward a target location. Results: On both datasets, fine-tuned MLLMs demonstrate competitive performance across all localization tasks when compared with the DL approach. In the qualitative experiments, the MLLMs provide evidence of reasoning and spatial awareness. Conclusion: This study shows that fine-tuned MLLMs achieve accurate skeletal landmark localization and hold promise for agentic autonomous C-arm control. Our code is available athttps://github.com/marszzibros/C-arm-localization-LLMs.git
[38] Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras cs.CVPDF
Ruijun Zhang, Hang Su, Kostas Daniilidis, Ziyun Wang
TL;DR: 本文提出了一种名为Match-Any-Events的零样本事件相机特征匹配模型,该模型能够在未经目标域微调的情况下,实现跨数据集、宽基线、运动鲁棒的特征匹配。其核心创新包括一个运动鲁棒且计算高效的多时间尺度注意力骨干网络,以及一个用于生成大规模宽基线监督数据的事件运动合成框架。
Details
Motivation: 事件相机在低光和快速运动场景下具有优势,但现有方法在计算任意视角间的宽基线对应关系时面临挑战,主要因为事件外观随运动变化大,且基于学习的方法受限于可扩展性和有限的宽基线监督数据。
Result: 在多个基准测试上的广泛实验表明,该框架相比之前最佳的事件特征匹配方法实现了37.7%的性能提升。
Insight: 创新点在于提出了首个零样本、跨数据集的宽基线事件匹配模型,其关键是通过多时间尺度特征学习和稀疏感知的事件令牌选择构建高效骨干,并利用合成数据框架解决监督数据稀缺问题,从而实现了强大的泛化能力。
Abstract: Event cameras have recently shown promising capabilities in instantaneous motion estimation due to their robustness to low light and fast motions. However, computing wide-baseline correspondence between two arbitrary views remains a significant challenge, since event appearance changes substantially with motion, and learning-based approaches are constrained by both scalability and limited wide-baseline supervision. We therefore introduce the first event matching model that achieves cross-dataset wide-baseline correspondence in a zero-shot manner: a single model trained once is deployed on unseen datasets without any target-domain fine-tuning or adaptation. To enable this capability, we introduce a motion-robust and computationally efficient attention backbone that learns multi-timescale features from event streams, augmented with sparsity-aware event token selection, making large-scale training on diverse wide-baseline supervision computationally feasible. To provide the supervision needed for wide-baseline generalization, we develop a robust event motion synthesis framework to generate large-scale event-matching datasets with augmented viewpoints, modalities, and motions. Extensive experiments across multiple benchmarks show that our framework achieves a 37.7% improvement over the previous best event feature matching methods. Code and data are available at: https://github.com/spikelab-jhu/Match-Any-Events.
[39] DeltaSeg: Tiered Attention and Deep Delta Learning for Multi-Class Structural Defect Segmentation cs.CVPDF
Enrique Hernandez Noguera, Md Meftahul Ferdaus, Elias Ioup, Mahdi Abdelguerfi
TL;DR: 本文提出了DeltaSeg,一种用于多类结构缺陷分割的U型编码器-解码器架构。它采用分层注意力策略,在编码器中集成SE通道注意力,在瓶颈和解码器中使用坐标注意力,并在跳跃连接中引入新颖的深度Delta注意力机制。模型还使用了深度可分离空洞卷积、ASPP模块和深度监督。在两个结构缺陷数据集上的实验表明,DeltaSeg超越了包括U-Net、SegFormer在内的12种竞争架构。
Details
Motivation: 解决从视觉检测图像中自动分割结构缺陷的挑战,这些挑战包括损伤类型多样、类别极度不平衡以及需要精确边界划分。
Result: 在S2DS数据集(7类)和Culvert-Sewer Defect Dataset(9类)两个基准测试上,DeltaSeg持续优于包括U-Net、SA-UNet、UNet3+、SegFormer、Swin-UNet等在内的12种竞争架构,展现了跨损伤类型、成像条件和结构几何形状的强大泛化能力。
Insight: 主要创新点在于分层注意力策略和深度Delta注意力机制。DDA模块通过结合用于抑制无关特征的学习型delta算子和基于解码器信号的空间注意力门,来细化跳跃连接,这是一个新颖的设计。深度监督和深度可分离空洞卷积的使用也是有效的技术点。
Abstract: Automated segmentation of structural defects from visual inspection imagery remains challenging due to the diversity of damage types, extreme class imbalance, and the need for precise boundary delineation. This paper presents DeltaSeg, a U-shaped encoder-decoder architecture with a tiered attention strategy that integrates Squeeze-and-Excitation (SE) channel attention in the encoder, Coordinate Attention at the bottleneck and decoder, and a novel Deep Delta Attention (DDA) mechanism in the skip connections. The encoder uses depthwise separable convolutions with dilated stages to maintain spatial resolution while expanding the receptive field. Atrous Spatial Pyramid Pooling (ASPP) at the bottleneck captures multi-scale context. The DDA module refines skip connections through a dual-path scheme combining a learned delta operator for nuisance feature suppression with spatial attention gates conditioned on decoder signals. Deep supervision through multi-scale auxiliary heads further strengthens gradient flow and encourages semantically meaningful features at intermediate decoder stages. We evaluate DeltaSeg on two datasets: the S2DS dataset (7 classes) and the Culvert-Sewer Defect Dataset (CSDD, 9 classes). Across both benchmarks, DeltaSeg consistently outperforms 12 competing architectures including U-Net, SA-UNet, UNet3+, SegFormer, Swin-UNet, EGE-UNet, FPN, and Mobile-UNETR, demonstrating strong generalization across damage types, imaging conditions, and structural geometries.
[40] URoPE: Universal Relative Position Embedding across Geometric Spaces cs.CVPDF
Yichen Xie, Depu Meng, Chensheng Peng, Yihan Hu, Quentin Herau
TL;DR: 本文提出了URoPE,一种通用的相对位置编码方法,旨在将旋转位置编码(RoPE)扩展到跨视图或跨维度的几何空间。该方法通过采样相机射线上的3D点并投影到查询图像平面,从而在2D-2D、2D-3D及时间序列等多种任务中实现几何推理。
Details
Motivation: 现有相对位置编码通常局限于固定几何空间(如1D序列或规则2D/3D网格),限制了其在需要跨相机视图或2D与3D空间之间几何推理的计算机视觉任务中的适用性。
Result: 实验表明,URoPE作为即插即用的位置编码,在包括新视图合成、3D目标检测、目标跟踪和深度估计在内的多种任务中,持续提升了基于Transformer模型的性能。
Insight: URoPE的创新点在于提出了一种参数无关、感知相机内参且对全局坐标系选择不变的通用相对位置编码方法,同时完全兼容现有的RoPE优化注意力核,增强了Transformer在跨几何空间任务中的推理能力。
Abstract: Relative position embedding has become a standard mechanism for encoding positional information in Transformers. However, existing formulations are typically limited to a fixed geometric space, namely 1D sequences or regular 2D/3D grids, which restricts their applicability to many computer vision tasks that require geometric reasoning across camera views or between 2D and 3D spaces. To address this limitation, we propose URoPE, a universal extension of Rotary Position Embedding (RoPE) to cross-view or cross-dimensional geometric spaces. For each key/value image patch, URoPE samples 3D points along the corresponding camera ray at predefined depth anchors and projects them into the query image plane. Standard 2D RoPE can then be applied using the projected pixel coordinates. URoPE is a parameter-free and intrinsics-aware relative position embedding that is invariant to the choice of global coordinate systems, while remaining fully compatible with existing RoPE-optimized attention kernels. We evaluate URoPE as a plug-in positional encoding for transformer architectures across a diverse set of tasks, including novel view synthesis, 3D object detection, object tracking, and depth estimation, covering 2D-2D, 2D-3D, and temporal scenarios. Experiments show that URoPE consistently improves the performance of transformer-based models across all tasks, demonstrating its effectiveness and generality for geometric reasoning. Our project website is: https://urope-pe.github.io/.
[41] REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction cs.CV | cs.AIPDF
Seowung Leem, Lin Gu, Chenyu You, Kuang Gong, Ruogu Fang
TL;DR: 本文提出了REVEAL框架,通过将彩色眼底照片与个体化疾病风险特征进行多模态对齐,用于预测阿尔茨海默病和痴呆症的发病风险,平均可提前8年进行预测。该框架将结构化风险因素转化为临床可解释的叙述文本,并采用组感知对比学习策略来增强多模态对齐。
Details
Motivation: 现有视网膜分析框架通常将影像和风险因素分开建模,无法捕捉对早期风险预测至关重要的联合多模态模式,且缺乏组织或对齐具有相似视网膜和临床特征患者的机制,限制了跨模态关联的学习。
Result: REVEAL框架在预测任务上显著优于最先进的视网膜成像模型与临床文本编码器的组合以及通用视觉语言模型,证明了联合建模视网膜生物标志物和临床风险因素的价值。
Insight: 创新点在于将结构化风险因素转化为与预训练视觉语言模型兼容的临床叙述文本,并提出了组感知对比学习策略来聚类相似患者作为正样本对,从而强化多模态对齐,为早期AD和痴呆风险分层提供了一个可泛化、非侵入性的方法。
Abstract: The retina provides a unique, noninvasive window into Alzheimer’s disease (AD) and dementia, capturing early structural changes through morphometric features, while systemic and lifestyle risk factors reflect well-established contributors to disease susceptibility long before clinical symptom onset. However, current retinal analysis frameworks typically model imaging and risk factors separately, limiting their ability to capture joint multimodal patterns critical for early risk prediction. Moreover, existing methods rarely incorporate mechanisms to organize or align patients with similar retinal and clinical characteristics, constraining the learning of coherent cross-modal associations. To address these limitations, we introduce REVEAL (REtinal-risk Vision-Language Early Alzheimer’s Learning), a framework that aligns color fundus photographs with individualized disease-specific risk profiles for predicting incident AD and dementia, on average 8 years before diagnosis (range: 1-11 years). Because real-world risk factors are structured questionnaire data, we translate them into clinically interpretable narratives compatible with pretrained vision-language models (VLMs). We further propose a group-aware contrastive learning (GACL) strategy that clusters patients with similar retinal morphometry and risk factors as positive pairs, strengthening multimodal alignment. This unified representation learning framework substantially outperforms state-of-the-art retinal imaging models paired with clinical text encoders, as well as general-purpose VLMs, demonstrating the value of jointly modeling retinal biomarkers and clinical risk factors. By providing a generalizable and noninvasive approach for early AD and dementia risk stratification, REVEAL has the potential to enable earlier intervention and improve preventive care at the population level.
[42] EfficientPENet: Real-Time Depth Completion from Sparse LiDAR via Lightweight Multi-Modal Fusion cs.CVPDF
Johny J. Lopez, Md Meftahul Ferdaus, Mahdi Abdelguerfi, Anton Netchaev, Steven Sloan
TL;DR: 本文提出EfficientPENet,一种用于从稀疏LiDAR和RGB图像进行深度补全的轻量级双分支网络,旨在实现嵌入式硬件上的实时运行。该方法采用现代化的ConvNeXt主干网络替代传统ResNet,在深度分支引入稀疏不变卷积,并通过卷积空间传播网络(CSPN)细化预测,同时利用位置感知的测试时增强方案提升推理一致性。
Details
Motivation: 现有深度补全方法在标准基准上精度高,但依赖重型主干网络,无法在嵌入式硬件上实时部署。本文旨在设计一个轻量高效的网络,在保持精度的同时实现实时性能,以满足机器人系统在资源受限边缘平台上的实际应用需求。
Result: 在KITTI深度补全基准上,EfficientPENet取得了631.94 mm的RMSE,参数量为36.24M,延迟为20.51 ms,运行速度为48.76 FPS。相较于BP-Net,参数量减少3.7倍,速度提升23倍,同时保持了有竞争力的精度。
Insight: 创新点包括:采用ConvNeXt作为轻量主干并利用其预训练优势;在深度分支引入稀疏不变卷积以更好处理稀疏输入;使用卷积空间传播网络进行预测细化;提出位置感知的测试时增强,通过校正坐标张量提升推理一致性。从客观角度看,其轻量多模态融合架构与实时性能的平衡设计对边缘计算应用具有借鉴意义。
Abstract: Depth completion from sparse LiDAR measurements and corresponding RGB images is a prerequisite for accurate 3D perception in robotic systems. Existing methods achieve high accuracy on standard benchmarks but rely on heavy backbone architectures that preclude real-time deployment on embedded hardware. We present EfficientPENet, a two-branch depth completion network that replaces the conventional ResNet encoder with a modernized ConvNeXt backbone, introduces sparsity-invariant convolutions for the depth stream, and refines predictions through a Convolutional Spatial Propagation Network (CSPN). The RGB branch leverages ImageNet-pretrained ConvNeXt blocks with Layer Normalization, 7x7 depthwise convolutions, and stochastic depth regularization. Features from both branches are merged via late fusion and decoded through a multi-scale deep supervision strategy. We further introduce a position-aware test-time augmentation scheme that corrects coordinate tensors during horizontal flipping, yielding consistent error reduction at inference. On the KITTI depth completion benchmark, EfficientPENet achieves an RMSE of 631.94 mm with 36.24M parameters and a latency of 20.51 ms, operating at 48.76 FPS. This represents a 3.7 times reduction in parameters and a 23 times speedup relative to BP-Net, while maintaining competitive accuracy. These results establish EfficientPENet as a practical solution for real-time depth completion on resource-constrained edge platforms such as the NVIDIA Jetson.
[43] LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models cs.CV | cs.AIPDF
Zhiyuan Jiang, Weihao Hong, Xinlei Guan, Tejaswi Dhandu, Miles Q. Li
TL;DR: 本文提出了一个名为Ghost-100的基准测试,用于评估视觉语言模型在逐步强硬的提示语下产生的幻觉问题。该基准包含800张合成图像,涵盖三类任务,每张图像对应五种不同强度的提示。评估采用基于规则的H-Rate和由GPT-4o-mini评分的H-Score两种指标,对九个开源VLM进行了测试,发现不同模型家族和任务类型对提示压力的响应模式存在显著差异。
Details
Motivation: 现有幻觉基准主要依赖中性提示和二元检测,未能充分刻画VLM在渐进式强硬提示下的行为变化,特别是幻觉的发生率和强度如何随语言压力变化。本文旨在填补这一空白,研究不同任务类型下VLM对提示语气的敏感性。
Result: 在Ghost-100基准上评估了九个开源VLM,发现H-Rate和H-Score在不同模型家族间存在显著差异;文本可读性任务和物体存在性检测任务对提示压力的响应方式不同;部分模型在中等语气强度下表现出非单调的敏感性峰值。
Insight: 创新点包括:1) 提出了一个基于负真实原则构建的结构化基准Ghost-100,通过控制提示语气作为唯一变量来系统研究幻觉;2) 引入了双轨评估协议(H-Rate和H-Score)和自动化验证流程;3) 揭示了聚合指标可能掩盖的模型敏感性的复杂模式,如非单调响应。
Abstract: Vision-Language Models (VLMs) are increasingly deployed in settings where reliable visual grounding carries operational consequences, yet their behavior under progressively coercive prompt phrasing remains undercharacterized. Existing hallucination benchmarks predominantly rely on neutral prompts and binary detection, leaving open how both the incidence and the intensity of fabrication respond to graded linguistic pressure across structurally distinct task types. We present Ghost-100, a procedurally constructed benchmark of 800 synthetically generated images spanning eight categories across three task families – text-illegibility, time-reading, and object-absence – each designed under a negative-ground-truth principle that guarantees the queried target is absent, illegible, or indeterminate by construction. Every image is paired with five prompts drawn from a structured 5-Level Prompt Intensity Framework, holding the image and task identity fixed while varying only directive force, so that tone is isolated as the sole independent variable. We adopt a dual-track evaluation protocol: a rule-based H-Rate measuring the proportion of responses in which a model crosses from grounded refusal into unsupported positive commitment, and a GPT-4o-mini-judged H-Score on a 1-5 scale characterizing the confidence and specificity of fabrication once it occurs. We additionally release a three-stage automated validation workflow, which retrospectively confirms 717 of 800 images as strictly compliant. Evaluating nine open-weight VLMs, we find that H-Rate and H-Score dissociate substantially across model families, reading-style and presence-detection subsets respond to prompt pressure in qualitatively different ways, and several models exhibit non-monotonic sensitivity peaking at intermediate tone levels – patterns that aggregate metrics obscure.
[44] DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning cs.CVPDF
Abrar Majeedi, Zhiyuan Ruan, Ziyi Zhao, Hongcheng Wang, Jianglin Lu
TL;DR: 本文提出了DUALVISION,一个用于多模态大语言模型的轻量级融合模块,通过局部跨注意力机制有效整合红外与RGB图像信息,以提升模型在视觉退化条件下的鲁棒性。同时,作者构建了包含大量对齐图像对和问答标注的数据集DV-204K,以及用于评估跨模态推理的基准DV-500。
Details
Motivation: 现有的多模态大语言模型在RGB图像上表现优异,但在雾、模糊、低光等常见退化条件下表现脆弱。红外成像作为RGB的有效补充,在这些条件下具有固有鲁棒性,但其与MLLMs的融合尚未得到充分探索。
Result: 在作者构建的DV-500基准上进行评估,DUALVISION在广泛的视觉退化条件下展现出强大的实证性能。
Insight: 主要创新点在于提出了一个轻量级的、基于局部跨注意力的红外-RGB融合模块,以及构建了首个大规模对齐的红外-RGB多模态问答数据集和基准,为提升MLLMs在恶劣视觉条件下的鲁棒性提供了新思路和数据支持。
Abstract: Multimodal large language models (MLLMs) have achieved impressive performance on visual perception and reasoning tasks with RGB imagery, yet they remain fragile under common degradations, such as fog, blur, or low-light conditions. Infrared (IR) imaging, a well-established complement to RGB, offers inherent robustness in these conditions, but its integration into MLLMs remains underexplored. To bridge this gap, we propose DUALVISION, a lightweight fusion module that efficiently incorporates IR-RGB information into MLLMs via patch-level localized cross-attention. To support training and evaluation and to facilitate future research, we also introduce DV-204K, a dataset of ~25K publicly available aligned IR-RGB image pairs with 204K modality-specific QA annotations, and DV-500, a benchmark of 500 IR-RGB image pairs with 500 QA pairs designed for evaluating cross-modal reasoning. Leveraging these datasets, we benchmark both open- and closed-source MLLMs and demonstrate that DUALVISION delivers strong empirical performance under a wide range of visual degradations. Our code and dataset are available at https://abrarmajeedi.github.io/dualvision.
[45] Multi-Domain Learning with Global Expert Mapping cs.CVPDF
Pourya Shamsolmoali, Masoumeh Zareapoor, Huiyu Zhou, Oscar Mendez, Dacheng Tao
TL;DR: 本文提出了一种名为GEM(Global Expert Mapping)的规划-编译框架,用于改进多数据集学习中的混合专家(MoE)模型。GEM通过线性规划松弛和分层舍入,将数据集全局调度分配给专家,解决了传统MoE中负载均衡与领域专业化之间的冲突,从而提升模型在领域偏移下的鲁棒性。
Details
Motivation: 人类感知能很好泛化到不同领域,但大多数视觉模型在训练数据外表现不佳。多数据集学习旨在通过训练单一模型于多样数据集来提高鲁棒性,但数据分布和标签语义的不一致性带来挑战。现有MoE模型因负载均衡机制强制专家均匀处理输入,导致专家学习冗余表示,尤其在罕见或分布外领域性能下降。
Result: 在UODB基准测试中,GEM-DINO实现了最先进的性能,特别是在代表性不足的数据集上取得显著提升,并解决了少样本适应场景中的任务干扰问题。
Insight: 创新点在于用全局调度器替代学习型路由器,通过规划-编译框架实现数据集到专家的确定性映射,避免了平衡损失,解决了公平性与专业化之间的冲突,并产生可解释的路由。这为多领域学习提供了可扩展且高效的解决方案。
Abstract: Human perception generalizes well across different domains, but most vision models struggle beyond their training data. This gap motivates multi-dataset learning, where a single model is trained on diverse datasets to improve robustness under domain shifts. However, unified training remains challenging due to inconsistencies in data distributions and label semantics. Mixture-of-Experts (MoE) models provide a scalable solution by routing inputs to specialized subnetworks (experts). Yet, existing MoEs often fail to specialize effectively, as their load-balancing mechanisms enforce uniform input distribution across experts. This fairness conflicts with domain-aware routing, causing experts to learn redundant representations, and reducing performance especially on rare or out-of-distribution domains. We propose GEM (Global Expert Mapping), a planner-compiler framework that replaces the learned router with a global scheduler. Our planner, based on linear programming relaxation, computes a fractional assignment of datasets to experts, while the compiler applies hierarchical rounding to convert this soft plan into a deterministic, capacity-aware mapping. Unlike prior MoEs, GEM avoids balancing loss, resolves the conflict between fairness and specialization, and produces interpretable routing. Experiments show that GEM-DINO achieves state-of-the-art performance on the UODB benchmark, with notable gains on underrepresented datasets and solves task interference in few-shot adaptation scenarios.
[46] ConvVitMamba: Efficient Multiscale Convolution, Transformer, and Mamba-Based Sequence modelling for Hyperspectral Image Classification cs.CVPDF
Mohammed Q. Alkhatib
TL;DR: 本文提出了一种名为ConvVitMamba的高效统一混合框架,用于高光谱图像分类。该框架集成了多尺度卷积特征提取器、Vision Transformer编码器和轻量级Mamba门控序列混合模块,以协同捕获局部光谱空间模式、全局上下文关系并进行高效的内容感知细化。通过PCA预处理降维,并在多个基准数据集上验证,该方法在精度、模型大小和推理效率之间取得了良好平衡。
Details
Motivation: 解决高光谱图像分类中因高光谱维度、数据冗余和标记数据有限带来的挑战,同时克服现有CNN和ViT方法计算成本高、模型规模大的局限性,以提供更实用的解决方案。
Result: 在包括Houston和三个无人机QUH数据集(Pingan, Qingyun, Tangdaowan)在内的四个基准数据集上的实验表明,ConvVitMamba在精度上持续优于基于CNN、Transformer和Mamba的方法,同时保持了模型大小和推理效率的有利平衡。消融研究证实了所有组件的互补贡献。
Insight: 创新点在于将多尺度卷积、Transformer和Mamba序列建模高效集成到一个统一框架中,利用卷积捕获局部模式,Transformer建模全局关系,Mamba模块进行轻量级内容感知细化,避免了二次自注意力的计算开销,为高光谱图像分类提供了一种兼顾性能与效率的混合架构思路。
Abstract: Hyperspectral image (HSI) classification remains challenging due to high spectral dimensionality, redundancy, and limited labeled data. Although convolutional neural networks (CNNs) and Vision Transformers (ViTs) achieve strong performance by exploiting spectral-spatial information and long-range dependencies, they often incur high computational cost and large model size, limiting practical use. To address these limitations, a unified hybrid framework, termed ConvVitMamba, is proposed for efficient HSI classification. The architecture integrates three components: a multiscale convolutional feature extractor to capture local spectral, spatial, and joint patterns; a Vision Transformer based tokenization and encoding stage to model global contextual relationships; and a lightweight Mamba inspired gated sequence mixing module for efficient content-aware refinement without quadratic self-attention. Principal Component Analysis (PCA) is used as preprocessing to reduce redundancy and improve efficiency. Experiments on four benchmark datasets, including Houston and three UAV borne QUH datasets (Pingan, Qingyun, and Tangdaowan), demonstrate that ConvVitMamba consistently outperforms CNN, Transformer, and Mamba based methods while maintaining a favorable balance between accuracy, model size, and inference efficiency. Ablation studies confirm the complementary contributions of all components. The results indicate that the proposed framework provides an effective and efficient solution for HSI classification in diverse scenarios. The source code is publicly available at https://github.com/mqalkhatib/ConvVitMamba
[47] Hierarchically Robust Zero-shot Vision-language Models cs.CV | cs.AI | cs.LGPDF
Junhao Dong, Yifei Zhang, Hao Zhu, Yew-Soon Ong, Piotr Koniusz
TL;DR: 本文提出了一种基于层次嵌入的新型对抗性微调框架,旨在增强视觉语言模型(VLMs)的零样本分类对抗鲁棒性。该框架利用类别空间的固有层次属性,通过多级对抗性对齐图像-文本模态,并引入机制将视觉嵌入置于层次结构的所需深度,同时考虑多树对齐以增强语义多样性。
Details
Motivation: 现有方法在固定文本嵌入与图像嵌入对齐时牺牲了自然性能和鲁棒性,且当模型面临针对超类(如哺乳动物)及其基类(如猫)的对抗攻击时,鲁棒性会下降。因此,本文旨在通过利用类别层次结构来增强对抗鲁棒性。
Result: 在多个数据集上的实验表明,该方法能提升对抗鲁棒性,并自然实现多个边界大小,从而增强对抗样本的泛化能力以进行鲁棒化。
Insight: 创新点包括基于层次嵌入的对抗性微调框架、视觉嵌入在层次中深度的理论连接,以及多树对齐机制以提升语义多样性,这为增强VLMs的零样本鲁棒性提供了新思路。
Abstract: Vision-Language Models (VLMs) can perform zero-shot classification but are susceptible to adversarial attacks. While robust fine-tuning improves their robustness, existing approaches align fixed text embeddings with an image embedding, sacrificing natural performance and robustness. A robustness degradation also occurs when a model faces adversarial attacks targeting superclasses (parent classes, e.g., mammal) in addition to their base (leaf) classes (e.g., cat). Thus, to enhance adversarial robustness and leverage the inherent hierarchical properties of class space, we propose a novel adversarial fine-tuning framework based on hierarchical embeddings and several levels of adversarially robust alignment of image-text modalities. Additional mechanisms place visual embeddings at the desired depth of hierarchy, and we provide a theoretical connection between the depth of embedding in the hierarchy and the maximum viable margin size. Our model naturally realizes several margin sizes, boosting generalization of adversaries for robustification. As various trees with different parent labels can share the same leaf labels, we also consider aligning over multiple trees to boost semantic variety. Experiments across several datasets are performed.
[48] Bridging Foundation Models and ASTM Metallurgical Standards for Automated Grain Size Estimation from Microscopy Images cs.CVPDF
Abdul Mueez, Shruti Vyas
TL;DR: 该论文提出了一种自动化流程,用于从金相显微镜图像中进行密集实例分割和晶粒尺寸估计。该流程将Cellpose-SAM模型适配于微观结构,并将其拓扑感知梯度跟踪与ASTM E112 Jeffries平面法模块相结合,以桥接基础计算机视觉模型与实用金相评估标准。
Details
Motivation: 从显微镜图像中提取标准化的金相指标面临挑战,原因在于晶粒形态复杂且监督分割方法数据需求量大。论文旨在将基础计算机视觉模型与实际金相评估标准相结合,实现自动化、高精度的材料表征。
Result: 在逐步减少的训练数据分割实验中,所提系统仅使用两个训练样本,预测ASTM晶粒度号(G)的平均绝对百分比误差(MAPE)低至1.50%。与经典卷积网络(U-Net)、自适应提示视觉基础模型(MatSAM)和当代视觉语言模型(Qwen2.5-VL-7B)的基准测试表明,所提流程成功保持了拓扑分离,并经验证了ASTM 50个晶粒最小采样量的鲁棒性。
Insight: 论文的创新点在于将通用基础模型(Cellpose-SAM)与特定领域标准(ASTM E112)进行应用层面的集成,通过拓扑感知梯度跟踪和Jeffries平面法模块的融合,实现了在极少训练样本下的高精度、自动化晶粒尺寸估计,展示了基础模型在专业材料科学任务中的有效适配和扩展能力。
Abstract: Extracting standardized metallurgical metrics from microscopy images remains challenging due to complex grain morphology and the data demands of supervised segmentation. To bridge foundational computer vision with practical metallurgical evaluation, we propose an automated pipeline for dense instance segmentation and grain size estimation that adapts Cellpose-SAM to microstructures and integrates its topology-aware gradient tracking with an ASTM E112 Jeffries planimetric module. We systematically benchmark this pipeline against a classical convolutional network (U-Net), an adaptive-prompting vision foundation model (MatSAM) and a contemporary vision-language model (Qwen2.5-VL-7B). Our evaluations reveal that while the out-of-the-box vision-language model struggles with the localized spatial reasoning required for dense microscopic counting and MatSAM suffers from over-segmentation despite its domain-specific prompt generation, our adapted pipeline successfully maintains topological separation. Furthermore, experiments across progressively reduced training splits demonstrate exceptional few-shot scalability; utilizing only two training samples, the proposed system predicts the ASTM grain size number (G) with a mean absolute percentage error (MAPE) as low as 1.50%, while robustness testing across varying target grain counts empirically validates the ASTM 50-grain sampling minimum. These results highlight the efficacy of application-level foundation model integration for highly accurate, automated materials characterization. Our project repository is available at https://github.com/mueez-overflow/ASTM-Grain-Size-Estimator.
[49] Toward Clinically Acceptable Chest X-ray Report Generation: A Qualitative Retrospective Pilot Study of CXRMate-2 cs.CVPDF
Aaron Nicolson, Elizabeth J. Cooper, Hwan-Jin Yoon, Claire McCafferty, Ramya Krishnan
TL;DR: 本文介绍了CXRMate-2,一种用于胸部X光(CXR)放射学报告生成(RRG)的先进模型。该模型通过结合结构化多模态条件处理和基于复合奖励的强化学习,实现了与放射科医生报告的高语义对齐。在MIMIC-CXR、CheXpert Plus和ReXgradient数据集上,其性能显著超越现有基准。一项盲法随机定性回顾性评估显示,在45%的病例中,生成报告被放射科医生认为可接受(即优于或等同于医生报告),表明其在临床应用中具有潜力。
Details
Motivation: 尽管CXR放射学报告生成模型进展迅速,但其临床实用性仍不确定,主要原因是缺乏放射科医生的深入评估。本研究旨在开发一个在临床环境中可接受的报告生成模型,并直接与放射科医生的报告进行对比评估。
Result: 在MIMIC-CXR数据集上,与MedGemma 1.5 (4B)相比,CXRMate-2在GREEN和RadGraph-XL指标上分别提升了11.2%和24.4%。在定性评估中,45%的生成报告被放射科医生评为可接受;在分析的八个发现中,有七个在偏好率上没有显著差异。医生报告因更高的召回率而更受青睐,而生成报告则常因可读性更好而被偏好。
Insight: 论文的创新点在于将结构化多模态条件处理与基于复合奖励的强化学习相结合,以优化报告的语义对齐。客观来看,其关键贡献是首次通过盲法随机定性评估,直接将生成报告与放射科医生报告进行对比,为评估临床可接受性提供了新范式,并指出提升召回率和检测细微发现(如肺充血)是实现临床非劣效性的关键路径。
Abstract: Chest X-ray (CXR) radiology report generation (RRG) models have shown rapid progress, yet their clinical utility remains uncertain due to limited evaluation by radiologists. We present CXRMate-2, a state-of-the-art CXR RRG model that integrates structured multimodal conditioning and reinforcement learning with a composite reward for semantic alignment with radiologist reports. Across the MIMIC-CXR, CheXpert Plus, and ReXgradient datasets, CXRMate-2 achieves statistically significant improvements over strong benchmarks, including gains of 11.2% and 24.4% in GREEN and RadGraph-XL, respectively, on MIMIC-CXR relative to MedGemma 1.5 (4B). To directly compare CXRMate-2 against radiologist reporting, we conduct a blinded, randomised qualitative retrospective evaluation. Three consultant radiologists compare generated and radiologist reports across 120 studies from the MIMIC-CXR test set. Generated reports were deemed acceptable (defined as preferred or rated equally to radiologist reports) in 45% of ratings, with no statistically significant difference in preference rates between radiologist reports and acceptable generated reports for seven of the eight analysed findings. Preference for radiologist reports was driven primarily by higher recall, while generated reports were often preferred for readability. Together, these results suggest a credible pathway to clinically acceptable CXR RRG. Improvements in recall, alongside better detection of subtle findings (e.g., pulmonary congestion), are likely sufficient to achieve non-inferiority to radiologist reporting. With these targeted advances, CXR RRG systems may be ready for prospective evaluation in assistive roles within radiologist-led workflows.
[50] A Multi-Agent Framework with Structured Reasoning and Reflective Refinement for Multimodal Empathetic Response Generation cs.CVPDF
Liping Wang, Cheng Ye, Weidong Chen, Peipei Song, Bo Hu
TL;DR: 本文提出了一种基于多智能体框架的多模态共情响应生成方法,通过结构化推理和反思精炼来提升共情能力。该方法首先通过结构化共情推理模块分解响应生成过程,然后利用全局反思与精炼模块进行逐步审计和针对性再生,从而在迭代过程中逐步提高情感感知准确性并消除情感偏差。
Details
Motivation: 现有方法通常采用从多模态上下文到最终响应的隐式单次生成范式,忽视了多模态共情响应生成的两个内在特性:情感线索感知的结构化层次性以及情感复杂性和模糊性导致的显著情感偏差,导致情感判断失真和共情效果不佳。
Result: 在IEMOCAP和MELD等多个基准测试上的实验表明,该模型相比最先进方法具有更优的共情响应生成能力。
Insight: 创新点在于引入了结构化共情推理生成模块,显式分解响应生成过程,以及全局反思与精炼模块进行迭代审计和再生,形成了一个可逐步改进情感感知和消除偏差的闭环框架,为多模态情感理解任务提供了更清晰、可解释的中间路径。
Abstract: Multimodal empathetic response generation (MERG) aims to generate emotionally engaging and empathetic responses based on users’ multimodal contexts. Existing approaches usually rely on an implicit one-pass generation paradigm from multimodal context to the final response, which overlooks two intrinsic characteristics of MERG: (1) Human perception of emotional cues is inherently structured rather than a direct mapping. The conventional paradigm neglects the hierarchical progression of emotion perception, leading to distorted emotional judgments. (2) Given the inherent complexity and ambiguity of human emotions, the conventional paradigm is prone to significant emotional biases, ultimately resulting in suboptimal empathy. In this paper, we propose a multi-agent framework for MERG, which enhances empathy through structured reasoning and reflective refinement. Specifically, we first introduce a structured empathetic reasoning-to-generation module that explicitly decomposes response generation via multimodal perception, consistency-aware emotion forecasting, pragmatic strategy planning, and strategy-guided response generation, providing a clearer intermediate path from multimodal evidence to response realization. Besides, we develop a global reflection and refinement module, in which a global reflection agent performs step-wise auditing over intermediate states and the generated response, eliminating existing emotional biases and empathy errors, and triggering targeted regeneration. Overall, such a closed-loop framework enables our model to gradually improve the accuracy of emotion perception and eliminate emotion biases during the iteration process. Experiments on several benchmarks, e.g., IEMOCAP and MELD, demonstrate that our model has superior empathic response generation capabilities compared to state-of-the-art methods.
[51] AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos cs.CV | cs.AI | cs.MMPDF
Jiagao Hu, Daiguo Zhou, Danzhen Fu, Fuhao Li, Zepeng Wang
TL;DR: 本文提出了AutoAWG,一个用于自动驾驶视频的可控恶劣天气生成框架。该框架通过语义引导的自适应多控制融合来平衡强天气风格化与关键安全目标的高保真度保存,利用灭点锚定的时序合成策略从静态图像构建训练序列以减少对合成数据的依赖,并采用掩码训练来增强长时程生成的稳定性。
Details
Motivation: 解决自动驾驶在恶劣天气下感知鲁棒性的核心瓶颈——真实世界恶劣天气视频数据的稀缺性,以及现有天气生成方法难以在视觉质量和标注可重用性之间取得平衡的问题。
Result: 在nuScenes验证集上,AutoAWG显著优于先前的最先进方法:在无条件(无首帧条件)下,FID和FVD相对降低了50.0%和16.1%;在有首帧条件下,进一步分别降低了8.7%和7.2%。定性和定量结果在风格保真度、时序一致性和语义-结构完整性方面均显示出优势。
Insight: 创新点包括:1)语义引导的自适应多控制融合机制,实现风格化与保真度的平衡;2)灭点锚定的时序合成策略,从静态图像生成训练序列,降低数据依赖;3)掩码训练提升长序列生成稳定性。这些方法为可控视频生成和自动驾驶数据增强提供了新思路。
Abstract: Perception robustness under adverse weather remains a critical challenge for autonomous driving, with the core bottleneck being the scarcity of real-world video data in adverse weather. Existing weather generation approaches struggle to balance visual quality and annotation reusability. We present AutoAWG, a controllable Adverse Weather video Generation framework for Autonomous driving. Our method employs a semantics-guided adaptive fusion of multiple controls to balance strong weather stylization with high-fidelity preservation of safety-critical targets; leverages a vanishing point-anchored temporal synthesis strategy to construct training sequences from static images, thereby reducing reliance on synthetic data; and adopts masked training to enhance long-horizon generation stability. On the nuScenes validation set, AutoAWG significantly outperforms prior state-of-the-art methods: without first-frame conditioning, FID and FVD are relatively reduced by 50.0% and 16.1%; with first-frame conditioning, they are further reduced by 8.7% and 7.2%, respectively. Extensive qualitative and quantitative results demonstrate advantages in style fidelity, temporal consistency, and semantic–structural integrity, underscoring the practical value of AutoAWG for improving downstream perception in autonomous driving. Our code is available at: https://github.com/higherhu/AutoAWG
[52] Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents cs.CVPDF
Xu Chen, Shichao Xie, Zhining Gu, Lu Jia, Minghua Luo
TL;DR: 本文提出了ABot-Explorer,一种新颖的主动探索框架,它将空间记忆构建与探索统一为一个在线的、仅使用RGB图像的过程。该框架利用大型视觉语言模型提取语义导航可供性作为认知对齐的锚点,并动态整合到层次化的SG-Memo中,以模仿人类优先探索结构性过渡节点的逻辑,从而提升探索效率和覆盖率。
Details
Motivation: 当前基于解耦两阶段(先探索后离线重建)的空间记忆构建方法,是事后且以几何为中心的,导致智能体无法利用高层语义智能,常常忽略像门道和楼梯这类对人类认知地图至关重要的导航地标。本文旨在弥合这一差距。
Result: 实验结果表明,ABot-Explorer在探索效率和环境覆盖率方面显著优于当前最先进的方法,并且其构建的SG-Memo被证明能有效支持多种下游任务。
Insight: 主要创新点在于将记忆构建与探索在线统一,并利用VLM提取语义导航可供性作为认知锚点来指导探索,模仿了人类的探索逻辑。这为具身智能体的主动探索和空间记忆构建提供了一种新的、更接近人类认知的范式。
Abstract: Constructing structured spatial memory is essential for enabling long-horizon reasoning in complex embodied navigation tasks. Current memory construction predominantly relies on a decoupled, two-stage paradigm: agents first aggregate environmental data through exploration, followed by the offline reconstruction of spatial memory. However, this post-hoc and geometry-centric approach precludes agents from leveraging high-level semantic intelligence, often causing them to overlook navigationally critical landmarks (e.g., doorways and staircases) that serve as fundamental semantic anchors in human cognitive maps. To bridge this gap, we propose ABot-Explorer, a novel active exploration framework that unifies memory construction and exploration into an online, RGB-only process. At its core, ABot-Explorer leverages Large Vision-Language Models (VLMs) to distill Semantic Navigational Affordances (SNA), which act as cognitive-aligned anchors to guide the agent’s movement. By dynamically integrating these SNAs into a hierarchical SG-Memo, ABot-Explorer mirrors human-like exploratory logic by prioritizing structural transit nodes to facilitate efficient coverage. To support this framework, we contribute a large-scale dataset extending InteriorGS with SNA and SG-Memo annotations. Experimental results demonstrate that ABot-Explorer significantly outperforms current state-of-the-art methods in both exploration efficiency and environment coverage, while the resulting SG-Memo is shown to effectively support diverse downstream tasks.
[53] Generative Texture Filtering cs.CVPDF
Rongjia Zheng, Shangwei Huang, Lei Zhu, Wei-Shi Zheng, Qing Zhang
TL;DR: 本文提出了一种生成式纹理过滤方法,通过利用预训练生成模型的强大图像先验,采用两阶段微调策略(监督微调和基于奖励函数的强化微调),在纹理去除和结构保持方面表现出优异的性能和泛化能力。
Details
Motivation: 解决传统纹理过滤方法在复杂场景下性能不足的问题,通过利用预训练生成模型的先验知识提升纹理去除效果。
Result: 在广泛实验中,该方法明显优于先前方法,能有效处理以往具有挑战性的案例,但未提及具体基准测试或SOTA比较。
Insight: 创新点在于将预训练生成模型与两阶段微调(结合监督学习和强化学习)相结合,利用奖励函数量化纹理去除质量,为图像处理任务提供了新的生成式解决方案。
Abstract: We present a generative method for texture filtering, which exhibits surprisingly good performance and generalizability. Our core idea is to empower texture filtering by taking full advantage of the strong learned image prior of pre-trained generative models. To this end, we propose to fine-tune a pre-trained generative model via a two-stage strategy. Specifically, we first conduct supervised fine-tuning on a very small set of paired images, and then perform reinforcement fine-tuning on a large-scale unlabeled dataset under the guidance of a reward function that quantifies the quality of texture removal and structure preservation. Extensive experiments show that our method clearly outperforms previous methods, and is effective to deal with previously challenging cases. Our code is available at https://github.com/OnlyZZZZ/Generative_Texture_Filtering.
[54] Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge cs.CVPDF
Zihao Ye, Yung Hsiang Lu, Xiao Hu, Shuai Zhang, Taotao Jing
TL;DR: 本文介绍了2025年IEEE低功耗计算机视觉挑战赛(LPCVC)的设计、评估框架以及各赛道的优胜解决方案,旨在推动面向边缘设备的高效视觉模型发展。
Details
Motivation: 该挑战赛旨在促进在延迟、内存容量和能耗等约束下,平衡准确性的高效视觉模型开发,以应对边缘设备的需求。
Result: 论文介绍了在图像分类、开放词汇分割和单目深度估计三个赛道上表现最佳的解决方案,并总结了关键趋势和观察结果,但没有提供具体的定量性能指标(如准确率或能耗数据)。
Insight: 主要创新点在于集成了高通AI Hub以提供一致且可复现的基准测试框架,并总结了高效模型设计的关键趋势,为未来计算机视觉竞赛提供了建议。
Abstract: The IEEE Low-Power Computer Vision Challenge (LPCVC) aims to promote the development of efficient vision models for edge devices, balancing accuracy with constraints such as latency, memory capacity, and energy use. The 2025 challenge featured three tracks: (1) Image classification under various lighting conditions and styles, (2) Open-Vocabulary Segmentation with Text Prompt, and (3) Monocular Depth Estimation. This paper presents the design of LPCVC 2025, including its competition structure and evaluation framework, which integrates the Qualcomm AI Hub for consistent and reproducible benchmarking. The paper also introduces the top-performing solutions from each track and outlines key trends and observations. The paper concludes with suggestions for future computer vision competitions.
[55] The Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation cs.CVPDF
Zhen Liu, Yuhan Liu, Jinjun Wang, Jianyi Liu, Wei Song
TL;DR: 本文针对视觉与语言导航任务中智能体通过策略诱导经验进行自我改进的挑战,提出了一种名为稳定性-多样性平衡的即插即用机制。该机制通过在指令条件隐藏状态中引入受控偏移来扩展每个决策步骤的潜在行为假设,并进行可靠性感知的软评估与聚合,以在学习过程中保持多样且与指令一致的行为选择,同时通过显式正则化约束假设间的交互,稳定自我改进过程。
Details
Motivation: 解决视觉与语言导航中智能体自我改进时,行为多样性与学习稳定性之间的平衡难题。行为多样性不足会限制探索,而过度多样性则会破坏学习信号的稳定性,导致难以进行可靠的自我改进。
Result: 在R2R、SOON和REVERIE基准测试上均取得了一致的性能提升。例如,在REVERIE val-unseen数据集上,SPL指标从33.73提升至35.93,OSR指标从51.07提升至54.25。
Insight: 核心创新点在于提出了一种显式平衡行为多样性与学习稳定性的机制。其通过可控的隐藏状态偏移生成多样假设,并结合软评估与正则化来维持稳定学习,为策略诱导的自我改进提供了一种通用且有效的解决方案。
Abstract: In vision-and-language navigation (VLN), self-improvement from policy-induced experience, using only standard VLN action supervision, critically depends on balancing behavioral diversity and learning stability, which governs whether the agent can extract a reliable learning signal for improvement. Increasing behavioral diversity is necessary to expose alternative action hypotheses but can destabilize policy-induced learning signals, whereas overly conservative stability constraints suppress exploration and induce early commitment, making reliable self-improvement difficult. To address this challenge, we propose Stability-Diversity Balance (SDB), a plug-and-play mechanism for balanced self-improvement in VLN. SDB expands each decision step into multiple latent behavioral hypotheses by applying controlled shifts in the instruction-conditioned hidden states, and then performs reliability-aware soft evaluation and aggregation to retain diverse yet instruction-consistent alternatives during learning. An explicit regularizer further constrains hypothesis interactions, preventing excessive drift or premature collapse of hypothesis diversity and stabilizing self-improvement without discarding training signals. Experiments on R2R, SOON, and REVERIE show consistent improvements; for example, on REVERIE val-unseen, SDB improves SPL from 33.73 to 35.93 and OSR from 51.07 to 54.25.
[56] Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration cs.CV | cs.AIPDF
Jinglin Xu, Yi Li, Chuxiong Sun, Xiao Xu, Jiangmeng Li
TL;DR: 本文提出了一种用于多模态测试时适应(TTA)的新方法,通过自适应概率高斯校准来显式建模类别条件分布,并引入自适应对比不对称性校正技术来缓解模态不对称性的负面影响,从而在多种分布偏移下实现最先进的性能。
Details
Motivation: 现有多模态TTA方法缺乏对类别条件分布的显式建模,这限制了预测准确性和决策边界的可靠性。标准的GDA方法在多模态场景下因模态分布不对称而效果不佳,因此需要一种专门针对多模态TTA的改进建模方法。
Result: 在多个基准测试上的广泛实验表明,该方法在广泛的分布偏移下达到了最先进的(SOTA)性能。
Insight: 创新点在于为多模态TTA量身定制了概率高斯模型来显式建模类别条件分布,并提出了自适应对比不对称性校正技术来校准预测和决策边界,这解决了模态不对称性对传统GDA方法的负面影响。
Abstract: Multi-modal test-time adaptation (TTA) enhances the resilience of benchmark multi-modal models against distribution shifts by leveraging the unlabeled target data during inference. Despite the documented success, the advancement of multi-modal TTA methodologies has been impeded by a persistent limitation, i.e., the lack of explicit modeling of category-conditional distributions, which is crucial for yielding accurate predictions and reliable decision boundaries. Canonical Gaussian discriminant analysis (GDA) provides a vanilla modeling of category-conditional distributions and achieves moderate advancement in uni-modal contexts. However, in multi-modal TTA scenario, the inherent modality distribution asymmetry undermines the effectiveness of modeling the category-conditional distribution via the canonical GDA. To this end, we introduce a tailored probabilistic Gaussian model for multi-modal TTA to explicitly model the category-conditional distributions, and further propose an adaptive contrastive asymmetry rectification technique to counteract the adverse effects arising from modality asymmetry, thereby deriving calibrated predictions and reliable decision boundaries. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts. The code is available at https://github.com/XuJinglinn/AdaPGC.
[57] EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation cs.CVPDF
Ruibing Hou, Mingyue Zhou, Yuwei Gui, Mingshuang Luo, Bingpeng Ma
TL;DR: 本文提出EgoMotion,一种用于第一人称视觉语言(Ego-VL)运动生成的分层生成框架。该框架通过将认知推理与运动生成解耦,解决了语义推理与运动建模同时优化时的梯度冲突问题,从而生成语义准确且运动质量高的3D人体运动序列。
Details
Motivation: 解决第一人称视角下,基于视觉观察和语言指令生成3D人体运动时,语义推理与运动建模同时优化导致的梯度冲突问题,该问题会损害多模态对齐和运动质量。
Result: 广泛的评估表明,EgoMotion在相关基准上取得了最先进的性能,生成的运动序列在语义对齐和运动学质量上均优于现有方法。
Insight: 核心创新在于受生物认知与运动控制解耦启发,提出的两阶段分层框架:第一阶段利用视觉语言模型将多模态输入映射到离散运动基元空间以进行认知推理;第二阶段利用学习到的表征作为条件信号,通过扩散模型在连续潜在空间中进行迭代去噪以生成运动。这有效解决了推理-生成的纠缠问题。
Abstract: Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions. We identify a critical \textit{reasoning-generation entanglement} challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality. To address this challenge, we propose a hierarchical generative framework \textbf{EgoMotion}. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages. In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution. In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories. Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.
[58] Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval cs.CVPDF
Hang Cheng, Fanhe Dong, Long Zeng
TL;DR: 本文首次探索了利用文本到图像扩散模型进行零样本草图到三维形状检索(ZS-SBSR)。针对现有方法在零样本设置下因缺乏类别监督和草图输入极度稀疏而表现不佳的问题,作者提出Diff-SBSR方法,利用预训练的Stable Diffusion模型提取草图与三维渲染视图的判别性特征,并通过引入CLIP的多模态特征增强策略(包括视觉特征注入和BLIP生成的文本指导)以及Circle-T损失函数,显著提升了模型在零样本条件下的检索性能。
Details
Motivation: 解决现有草图到三维形状检索方法在零样本设置下,因缺乏类别监督和草图输入极度抽象、稀疏而表现不佳的问题。
Result: 在两个公开基准测试上的大量实验表明,该方法在零样本草图到三维形状检索任务上持续优于最先进的方法(SOTA)。
Insight: 创新点在于首次将文本到图像扩散模型用于零样本草图到三维形状检索,并提出了一个无需重新训练的多模态特征增强策略,通过结合CLIP的视觉特征和BLIP的文本指导来增强语义上下文捕获和草图轮廓聚焦能力,同时采用Circle-T损失来动态优化草图与三维形状的对齐。从客观角度看,该方法巧妙地利用了大规模预训练扩散模型的开放词汇能力和形状偏置,并通过多模态条件注入有效缓解了草图与自然图像之间的领域差距,为稀疏输入的跨模态检索提供了新思路。
Abstract: This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage a frozen Stable Diffusion backbone to extract and aggregate discriminative representations from intermediate U-Net layers for both sketches and rendered 3D views. Diffusion models struggle with sketches due to their extreme abstraction and sparsity, compounded by a significant domain gap from natural images. To address this limitation without costly retraining, we introduce a multimodal feature-enhanced strategy that conditions the frozen diffusion backbone with complementary visual and textual cues from CLIP, explicitly enhancing the ability of semantic context capture and concentrating on sketch contours. Specifically, we inject global and local visual features derived from a pretrained CLIP visual encoder, and incorporate enriched textual guidance by combining learnable soft prompts with hard textual descriptions generated by BLIP. Furthermore, we employ the Circle-T loss to dynamically strengthen positive-pair attraction once negative samples are sufficiently separated, thereby adapting to sketch noise and enabling more effective sketch-3D alignment. Extensive experiments on two public benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in ZS-SBSR.
[59] ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving cs.CV | cs.AIPDF
Lin Sha, Haiyun Guo, Tao Wang, Cong Zhang, Min Huang
TL;DR: 本文提出ST-Prune,一种用于自动驾驶中视觉语言模型的无训练、即插即用的时空令牌剪枝框架。该框架包含两个互补模块:运动感知时间剪枝(MTP)和环视空间剪枝(RSP),旨在利用驾驶场景中固有的时空冗余性,显著减少多视角相机和多帧视频输入带来的巨大计算开销。
Details
Motivation: 现有令牌剪枝方法主要针对单图像输入,孤立处理每一帧或每个视角,无法有效利用驾驶场景中固有的时空冗余,导致视觉语言模型在自动驾驶系统中部署时面临巨大的计算瓶颈。
Result: 在涵盖感知、预测和规划的四个基准测试上,ST-Prune为无训练令牌剪枝建立了新的最先进水平。即使在90%的令牌削减率下,其性能也接近无损,某些指标甚至超过了完整模型基线,同时推理速度与现有剪枝方法相当。
Insight: 创新点在于首次提出了一个统一的、无训练的时空令牌剪枝框架,通过MTP模块编码运动波动性和时间临近性作为软约束来优先处理动态轨迹和当前帧内容,以及RSP模块利用环视相机几何结构惩罚双边跨视角相似性来消除重复投影和残留背景。从客观角度看,其核心洞察是系统地建模和利用驾驶数据中固有的时空冗余模式进行高效剪枝。
Abstract: Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.
[60] How Far Are Video Models from True Multimodal Reasoning? cs.CVPDF
Xiaotian Zhang, Jianhui Wei, Yuan Wang, Jie Tan, Yichen Li
TL;DR: 本文提出了CLVG-Bench评估框架,用于系统性评估视频模型在零样本多模态推理能力上的差距,发现当前SOTA模型在逻辑推理和交互生成任务上存在严重不足。
Details
Motivation: 现有视频模型评测基准在任务设计和评估指标上存在局限,无法严格评估模型是否具备真正的多模态推理能力,本文旨在填补这一空白。
Result: 在CLVG-Bench上,即使如Seedance 2.0等SOTA视频模型,在逻辑推理和交互生成任务上的成功率分别低于25%和接近0%,揭示了多模态推理和物理基础是当前的关键瓶颈。
Insight: 创新点在于提出了一个专注于上下文学习视频生成的评估框架(CLVG-Bench)和一个与人类专家感知对齐的自适应视频评估器(AVE),为构建真正鲁棒的通用视频模型提供了可操作的反馈和清晰路线图。
Abstract: Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models’ zero-shot reasoning capabilities via Context Learning in Video Generation. CLVG-Bench comprises more than 1,000 high-quality, manually annotated metadata across 6 categories and 47 subcategories, covering complex scenarios including physical simulation, logical reasoning, and interactive contexts. To enable rigorous and scalable assessment, we further propose an Adaptive Video Evaluator (AVE) that aligns with human expert perception using minimal annotations, delivering interpretable textual feedback across diverse video context tasks. Extensive experiments reveal a striking answer to our central question: while state-of-the-art (SOTA) video models, such as Seedance 2.0, demonstrate competence on certain understanding and reasoning subtasks, they fall substantially short with logically grounded and interactive generation tasks (achieving success rates <25% and ~0%, respectively), exposing multimodal reasoning and physical grounding as critical bottlenecks. By systematically quantifying these limitations, the proposed method provides actionable feedbacks and a clear roadmap toward truly robust, general-purpose video models. CLVG-Bench and code are released here.
[61] Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing cs.CVPDF
Mika Feng, Pierre Gallin-Martel, Koichi Ito, Takafumi Aoki
TL;DR: 本文系统性地评估了15种预训练的视觉基础模型在跨域人脸活体检测任务中的性能,发现自监督视觉Transformer模型(特别是带Registers的DINOv2)能有效抑制注意力伪影并捕捉细粒度欺骗线索。结合提出的数据增强和损失函数,该方法在MICO协议上达到了SOTA性能,并在计算受限的LSD协议上超越了现有方法,为FAS提供了一个高效且鲁棒的纯视觉基线。
Details
Motivation: 解决人脸活体检测在未见域上鲁棒性差的问题,并针对现有基于视觉-语言模型的方法计算成本高、推理延迟大且受限于底层视觉特征质量的缺点,重新探索纯视觉基础模型的潜力。
Result: 在MICO协议上达到了SOTA性能;在数据受限的LSD协议上超越了现有方法,同时保持了优越的计算效率。
Insight: 创新点在于系统性地验证了自监督视觉Transformer(特别是DINOv2 with Registers)作为FAS骨干网络的优越性,并结合了FAS-Aug、PDA和APL等策略。客观来看,该工作为FAS领域提供了一个高效、鲁棒且可复现的纯视觉基线,证明了优化后的自监督ViT可以作为未来多模态FAS系统的骨干。
Abstract: Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .
[62] Attention-based Multi-modal Deep Learning Model of Spatio-temporal Crop Yield Prediction with Satellite, Soil and Climate Data cs.CV | cs.AIPDF
Gopal Krishna Shyam, Ila Chandrakar
TL;DR: 本文提出了一种基于注意力的多模态深度学习框架(ABMMDLF),用于高精度的时空作物产量预测。该模型整合了多年卫星影像、高分辨率时间序列气象数据和初始土壤属性,利用卷积神经网络提取空间特征,并结合时间注意力机制自适应加权重要的物候期,以捕捉环境变量间的动态复杂关系。
Details
Motivation: 传统作物产量预测方法因依赖静态数据源,难以反映环境变量间随时间变化的动态复杂关系,导致预测精度有限。本文旨在通过融合多模态动态数据,提升预测准确性以支持全球粮食安全和政策制定。
Result: 实验结果表明,所提模型在作物产量预测任务上取得了R^2分数0.89,显著优于基线模型。
Insight: 创新点在于将多模态数据(卫星、气象、土壤)与时空注意力机制结合,通过CNN提取空间特征和时间注意力加权关键物候期,实现了对动态环境因素的更有效建模,为多源时序数据融合提供了可借鉴的架构。
Abstract: Crop yield prediction is one of the most important challenge, which is crucial to world food security and policy-making decisions. The conventional forecasting techniques are limited in their accuracy with reference to the fact that they utilize static data sources that do not reflect the dynamic and intricate relationships that exist between the variables of the environment over time [5,13]. This paper presents Attention-Based Multi-Modal Deep Learning Framework (ABMMDLF), which is suggested to be used in high-accuracy spatio-temporal crop yield prediction. The model we use combines multi-year satellite imagery, high-resolution time-series of meteorological data and initial soil properties as opposed to the traditional models which use only one of the aforementioned factors [12, 21]. The main architecture involves the use of Convolutional Neural Networks (CNN) to extract spatial features and a Temporal Attention Mechanism to adaptively weight important phenological periods targeted by the algorithm to change over time and condition on spatial features of images and video sequences. As can be experimentally seen, the proposed research work provides an R^2 score of 0.89, which is far better than the baseline models do.
[63] Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification cs.CVPDF
Quan Zhang, Jingze Wu, Jialong Wang, Xiaohua Xie, Jianhuang Lai
TL;DR: 本文提出了一种名为ReID-R的新型推理驱动行人重识别范式,通过将思维链整合到ReID流程中,实现显式的身份理解和推理。该方法包含两个阶段:判别性推理预热和无标签训练以获取身份感知特征理解,以及高效的强化学习以构建场景泛化数据。
Details
Motivation: 解决主流感知驱动范式过度依赖大量标注数据、缺乏对身份因果线索的理解,导致表征在面对多种干扰时脆弱的问题,旨在学习具有多场景泛化能力的身份判别性表征。
Result: 在多个ReID基准测试上的广泛实验表明,ReID-R仅使用14.3K非平凡数据(现有数据规模的20.9%)就达到了与先进方法相竞争的身份判别性能。
Insight: 创新点在于将推理范式(特别是思维链)引入ReID任务,通过强化学习引导模型关注ID相关线索,实现准确推理;其优势在于数据效率高,并能对结果提供高质量的可解释性。
Abstract: Learning identity-discriminative representations with multi-scene generality has become a critical objective in person re-identification (ReID). However, mainstream perception-driven paradigms tend to identify fitting from massive annotated data rather than identity-causal cues understanding, which presents a fragile representation against multiple disruptions. In this work, ReID-R is proposed as a novel reasoning-driven paradigm that achieves explicit identity understanding and reasoning by incorporating chain-of-thought into the ReID pipeline. Specifically, ReID-R consists of a two-stage contribution: (i) Discriminative reasoning warm-up, where a model is trained in a CoT label-free manner to acquire identity-aware feature understanding; and (ii) Efficient reinforcement learning, which proposes a non-trivial sampling to construct scene-generalizable data. On this basis, ReID-R leverages high-quality reward signals to guide the model toward focusing on ID-related cues, achieving accurate reasoning and correct responses. Extensive experiments on multiple ReID benchmarks demonstrate that ReID-R achieves competitive identity discrimination as superior methods using only 14.3K non-trivial data (20.9% of the existing data scale). Furthermore, benefit from inherent reasoning, ReID-R can provide high-quality interpretation for results.
[64] Adaptive Slicing-Assisted Hyper Inference for Enhanced Small Object Detection in High-Resolution Imagery cs.CVPDF
Francesco Moretti, Yi Jin, Guiqin Mario
TL;DR: 本文提出了一种名为自适应切片辅助超推理(ASAHI)的新框架,旨在解决高分辨率航拍和卫星图像中小目标检测的难题。该框架通过自适应地根据图像分辨率确定最佳切片数量来减少冗余计算,并结合切片辅助微调策略和一种新的后处理模块,在保持检测精度的同时显著提升了推理速度。
Details
Motivation: 现有基于固定尺寸切片的策略在处理高分辨率图像中的小目标时,会引入大量冗余计算,导致推理成本高昂和检测速度下降。本文旨在通过自适应切片来克服这一局限性。
Result: 在VisDrone2019和xView数据集上的实验表明,ASAHI达到了最先进的性能,在VisDrone2019-DET-val上获得56.8%的mAP,在xView-test上获得22.7%的mAP,同时相比基线SAHI方法减少了20-25%的推理时间。
Insight: 创新点在于将切片策略从固定尺寸转变为自适应确定数量,并结合了切片辅助微调训练和一种融合了Cluster-NMS与DIoU-NMS优势的后处理模块,在提升小目标检测精度的同时优化了计算效率。
Abstract: Deep learning-based object detectors have achieved remarkable success across numerous computer vision applications, yet they continue to struggle with small object detection in high-resolution aerial and satellite imagery, where dense object distributions, variable shooting angles, diminutive target sizes, and substantial inter-class variability pose formidable challenges. Existing slicing strategies that partition high-resolution images into manageable patches have demonstrated promising results for enlarging the effective receptive field of small targets; however, their reliance on fixed slice dimensions introduces significant redundant computation, inflating inference cost and undermining detection speed. In this paper, we propose \textbf{Adaptive Slicing-Assisted Hyper Inference (ASAHI)}, a novel slicing framework that shifts the paradigm from prescribing a fixed slice size to adaptively determining the optimal number of slices according to image resolution, thereby substantially mitigating redundant computation while preserving beneficial overlap between adjacent patches. ASAHI integrates three synergistic components: (1)an adaptive resolution-aware slicing algorithm that dynamically generates 6 or 12 overlapping patches based on a learned threshold, (2)a slicing-assisted fine-tuning (SAF) strategy that constructs augmented training data comprising both full-resolution and sliced image patches, and (3)a Cluster-DIoU-NMS (CDN) post-processing module that combines the geometric merging efficiency of Cluster-NMS with the center-distance-aware suppression of DIoU-NMS to achieve robust duplicate elimination in crowded scenes. Extensive experiments on VisDrone2019 and xView, demonstrate that ASAHI achieves state-of-the-art performance with 56.8% on VisDrone2019-DET-val and 22.7% on xView-test, while reducing inference time by 20-25% compared to the baseline SAHI method.
[65] Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation cs.CVPDF
Rui Li, Ke Hao, Yuanzhi Liang, Haibin Huang, Chi Zhang
TL;DR: 本文提出了一种名为OTCA(Objective-aware Trajectory Credit Assignment)的结构化框架,用于改进基于扩散模型的视觉生成模型的强化学习后训练。该方法通过精细化的轨迹信用分配,解决了现有GRPO方法中奖励信号粗粒度、静态标量化的问题,从而提升了图像和视频的生成质量。
Details
Motivation: 现有基于GRPO的强化学习后训练框架,在利用多目标奖励模型(如视觉质量、运动一致性、文本对齐)时,通常将多个奖励压缩为单一静态标量,并在整个扩散轨迹上均匀传播,这忽略了不同去噪步骤的阶段特异性角色,导致优化信号错时或不兼容。
Result: 广泛的实验表明,OTCA在多个评估指标上持续提升了图像和视频的生成质量。
Insight: 核心创新在于提出了一个结构化框架,联合建模时间信用(Trajectory-Level Credit Decomposition)和目标级信用(Multi-Objective Credit Allocation),将粗粒度的奖励监督转换为结构化的、时间步感知的训练信号,更好地匹配了基于扩散的迭代生成过程。
Abstract: Reinforcement learning, particularly Group Relative Policy Optimization (GRPO), has emerged as an effective framework for post-training visual generative models with human preference signals. However, its effectiveness is fundamentally limited by coarse reward credit assignment. In modern visual generation, multiple reward models are often used to capture heterogeneous objectives, such as visual quality, motion consistency, and text alignment. Existing GRPO pipelines typically collapse these rewards into a single static scalar and propagate it uniformly across the entire diffusion trajectory. This design ignores the stage-specific roles of different denoising steps and produces mistimed or incompatible optimization signals. To address this issue, we propose Objective-aware Trajectory Credit Assignment (OTCA), a structured framework for fine-grained GRPO training. OTCA consists of two key components. Trajectory-Level Credit Decomposition estimates the relative importance of different denoising steps. Multi-Objective Credit Allocation adaptively weights and combines multiple reward signals throughout the denoising process. By jointly modeling temporal credit and objective-level credit, OTCA converts coarse reward supervision into a structured, timestep-aware training signal that better matches the iterative nature of diffusion-based generation. Extensive experiments show that OTCA consistently improves both image and video generation quality across evaluation metrics.
[66] Allo{SR}$^2$: Rectifying One-Step Super-Resolution to Stay Real via Allomorphic Generative Flows cs.CVPDF
Zihan Wang, Xudong Huang, Junbo Qiao, Wei Li, Jie Hu
TL;DR: 本文提出Allo{SR}$^2$框架,通过异构生成流(allomorphic generative flows)来校正一步式超分辨率(one-step SR)的轨迹,以解决现有方法在有限LR-HR对微调时出现的’先验崩溃’(prior collapse)问题,从而在保持高效推理的同时,实现高保真度的生成真实感。
Details
Motivation: 动机在于解决真实世界图像超分辨率(Real-SR)中,基于大规模扩散或流模型的方法在有限数据微调时容易发生’先验崩溃’,即模型牺牲生成丰富性以过拟合特定训练退化;这一问题在一步式生成中因缺乏多步细化而加剧,导致轨迹漂移和伪影生成。
Result: 在合成和真实世界基准测试上的大量实验表明,Allo{SR}$^2$在一步式Real-SR中达到了最先进的性能(SOTA),在恢复保真度和生成真实感之间取得了优越的平衡,同时保持了极高的效率。
Insight: 创新点包括:1)信噪比(SNR)引导的轨迹初始化,通过将LR潜在特征的退化水平与预训练流的最优锚定时间步对齐,建立物理基础的起始状态;2)流锚定轨迹一致性(FATC),在中间状态实施速度级监督以确保稳定、无曲率的一步推理路径;3)异构轨迹匹配(ATM),一种自对抗对齐策略,在统一向量场中最小化SR流与生成流之间的分布差异。从客观角度看,该工作通过流模型的轨迹校正机制,有效缓解了一步生成中的退化过拟合问题,为高效Real-SR提供了新思路。
Abstract: Real-world image super-resolution (Real-SR) has been revolutionized by leveraging the powerful generative priors of large-scale diffusion and flow-based models. However, fine-tuning these models on limited LR-HR pairs often precipitates “prior collapse” that the model sacrifices its inherent generative richness to overfit specific training degradations. This issue is further exacerbated in one-step generation, where the absence of multi-step refinement leads to significant trajectory drift and artifact generation. In this paper, we propose Allo{SR}$^2$, a novel framework that rectifies one-step SR trajectories via allomorphic generative flows to maintain high-fidelity generative realism. Specifically, we utilize Signal-to-Noise Ratio (SNR) Guided Trajectory Initialization to establish a physically grounded starting state by aligning the degradation level of LR latent features with the optimal anchoring timestep of the pre-trained flow. To ensure a stable, curvature-free path for one-step inference, we propose Flow-Anchored Trajectory Consistency (FATC), which enforces velocity-level supervision across intermediate states. Furthermore, we develop Allomorphic Trajectory Matching (ATM), a self-adversarial alignment strategy that minimizes the distributional discrepancy between the SR flow and the generative flow in a unified vector field. Extensive experiments on both synthetic and real-world benchmarks demonstrate that Allo{SR}$^2$ achieves state-of-the-art performance in one-step Real-SR, offering a superior balance between restoration fidelity and generative realism while maintaining extreme efficiency.
[67] Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images cs.CVPDF
Hongyuan Liu, Bochao Zou, Qiankun Liu, Haochen Yu, Qi Mei
TL;DR: 本文提出了Unposed-to-3D框架,旨在从真实世界驾驶图像中重建可用于仿真的3D车辆模型。该方法采用两阶段训练:第一阶段使用已知相机参数的图像训练重建网络;第二阶段通过相机预测头从无姿态图像中估计相机参数,并利用可微分渲染进行自监督学习,最终通过尺度感知和场景协调模块确保模型在仿真场景中的视觉一致性与实用性。
Details
Motivation: 现有3D车辆生成方法通常在合成数据上训练,与真实世界分布存在显著领域差距,导致生成的模型姿态任意、尺度未定义,在集成到驾驶场景时视觉一致性差。
Result: 大量实验表明,Unposed-to-3D能够从真实世界图像中有效重建出逼真、姿态一致且协调的3D车辆模型,为驾驶场景仿真和数字孪生环境提供了高质量资产的可扩展路径。
Insight: 创新点在于提出了一种仅使用图像监督、无需相机姿态标注的两阶段框架,通过相机预测头和自监督渲染实现从无姿态图像学习3D几何,并结合尺度感知与场景协调模块确保仿真就绪性,为从真实图像构建3D资产提供了新思路。
Abstract: Creating realistic and simulation-ready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train an image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameters from unposed images. The predicted pose is then used for differentiable rendering to provide self-supervised photometric feedback, enabling the model to learn 3D geometry purely from unposed images. To ensure simulation readiness, we further introduce a scale-aware module to predict real-world size information, and a harmonization module that adapts the generated vehicles to the target driving scene with consistent lighting and appearance. Extensive experiments demonstrate that Unposed-to-3D effectively reconstructs realistic, pose-consistent, and harmonized 3D vehicle models from real-world images, providing a scalable path toward creating high-quality assets for driving scene simulation and digital twin environments.
[68] DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents cs.CVPDF
Shengqin Wang, Wentao Yan, Huichi Zhou, Yihang Chen, Kun Shao
TL;DR: 本文提出了一种名为DR-MMSearchAgent的多模态搜索智能体框架,旨在解决现有智能体在复杂任务中因终端奖励集中在最后令牌和上下文冗余而导致的过早交互崩溃问题。该框架通过利用结构邻近性从整个批次的所有轨迹中提取优势信号,并采用差异化高斯奖励动态校准交互容忍度,从而提升推理深度。为支持多轮交互训练,作者构建了一个包含3602个高质量问答对的多步深度推理数据集。实验表明,该方法在FVQA-test基准上超越了现有最佳方法MMSearch-R1 8.4%,达到了最先进的性能。
Details
Motivation: 解决多模态智能体在复杂任务中因终端奖励设计不当(仅附加在最后令牌)和上下文冗余导致的过早交互崩溃问题,以提升其深度推理能力。
Result: 在FVQA-test基准上,该方法取得了最先进的性能,比MMSearch-R1高出8.4%。
Insight: 创新点包括:1)利用结构邻近性从整个批次的轨迹中提取优势信号,鼓励生成长度不同的轨迹;2)采用差异化高斯奖励动态校准交互容忍度,确保信息可靠性并减少冗余;3)构建了专门的多步深度推理数据集以支持训练。从客观角度看,该方法通过改进强化学习中的奖励机制和上下文管理,有效提升了多模态智能体的推理鲁棒性。
Abstract: Agentic multimodal models have garnered significant attention for their ability to leverage external tools to tackle complex tasks. However, it is observed that such agents often meet premature interaction collapse, caused by two primary reasons: 1) the terminal reward often appending on the last token prevents the advantage from distinguishing trajectories with exploratory behavior; 2) excessively redundant context hinders the agent from absorbing useful feedback. To address these issues, we propose the Deepening Reasoning MMSearchAgent, the framework leverages the structural proximity to derive advantage signals from the whole rollout trajectories in an entire batch, such that trajectories of different lengths are further encouraged to be generated, even when containing the same correct answer. Additionally, differentiated gaussian rewards are employed to dynamically calibrate interaction tolerance, thereby ensuring information reliability and reduce redundancy. To support multi-turn interaction training, we have constructed a multi-step deep-reasoning dataset including 3602 high-quality QA pair with at least 3 reasonning steps. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming the MMSearch-R1 by 8.4$%$ on FVQA-test.
[69] PLaMo 2.1-VL Technical Report cs.CV | cs.AIPDF
Tommi Kerola, Yuya Masuda, Takashi Masuko, Toshiki Nakanishi, Daisuke Nishino
TL;DR: PLaMo 2.1-VL 是一个轻量级视觉语言模型,提供 8B 和 2B 两种变体,专为本地和边缘设备部署设计,支持日语操作。其核心能力是视觉问答和视觉定位,并针对工厂工具识别任务分析和基础设施异常检测两个实际应用场景进行了开发和评估。
Details
Motivation: 为自主设备开发一个轻量级、支持日语、可在本地和边缘部署的视觉语言模型,以解决工业场景(如工厂任务分析、基础设施监控)中的视觉理解和问答问题。
Result: 在日语和英语基准测试中超越了同类开源模型:在 JA-VG-VQA-500 上获得 61.5 ROUGE-L,在 Japanese Ref-L4 上获得 85.2% 准确率。在应用场景中,工厂任务分析的零样本准确率为 53.9%;在电厂数据上微调后,异常检测的 bbox + label F1 分数从 39.7 提升至 64.9。
Insight: 主要创新点包括:专注于日语和边缘部署的轻量级 VLM 设计;构建了大规模合成数据生成流程和全面的日语训练/评估资源;将模型能力针对具体的工业应用场景(工具识别、异常检测)进行定制和验证,展示了从通用基准到实际任务的有效迁移。
Abstract: We introduce PLaMo 2.1-VL, a lightweight Vision Language Model (VLM) for autonomous devices, available in 8B and 2B variants and designed for local and edge deployment with Japanese-language operation. Focusing on Visual Question Answering (VQA) and Visual Grounding as its core capabilities, we develop and evaluate the models for two real-world application scenarios: factory task analysis via tool recognition, and infrastructure anomaly detection. We also develop a large-scale synthetic data generation pipeline and comprehensive Japanese training and evaluation resources. PLaMo 2.1-VL outperforms comparable open models on Japanese and English benchmarks, achieving 61.5 ROUGE-L on JA-VG-VQA-500 and 85.2% accuracy on Japanese Ref-L4. For the two application scenarios, it achieves 53.9% zero-shot accuracy on factory task analysis, and fine-tuning on power plant data improves anomaly detection bbox + label F1-score from 39.7 to 64.9.
[70] Geometry-Guided Self-Supervision for Ultra-Fine-Grained Recognition with Limited Data cs.CVPDF
Shijie Wang, Yadan Luo, Zijian Wang, Haojie Li, Zi Huang
TL;DR: 本文提出了一种名为几何属性探索网络(GAEor)的自监督框架,用于解决数据有限场景下的超细粒度视觉分类任务。该方法通过挖掘物体内部的几何属性(如叶脉结构)作为识别线索,在五个广泛使用的基准测试中取得了新的最先进性能。
Details
Motivation: 针对超细粒度物体视觉差异极小、数据有限的挑战,现有方法往往忽略几何特征,本文旨在利用物体固有的几何模式(如细节排列)作为有效的识别线索。
Result: 在五个广泛使用的Ultra-FGVC基准测试上,GAEor显著刷新了最先进(SOTA)记录。
Insight: 创新点在于将几何属性(通过相对极坐标嵌入细节)作为自监督信号,为数据有限的超细粒度识别提供了可泛化的新视角;客观来看,该方法通过视觉反馈放大几何相关细节,有效利用了类别特有的几何描述符。
Abstract: This paper investigates the intrinsic geometrical features of highly similar objects and introduces a general self-supervised framework called the Geometric Attribute Exploration Network (GAEor), which is designed to address the ultra-fine-grained visual categorization (Ultra-FGVC) task in data-limited scenarios. Unlike prior work that often captures subtle yet critical distinctions, GAEor generates geometric attributes as novel alternative recognition cues. These attributes are determined by various details within the object, aligned with its geometric patterns, such as the intricate vein structures in soybean leaves. Crucially, each category exhibits distinct geometric descriptors that serve as powerful cues, even among objects with minimal visual variation – a factor largely overlooked in recent research. GAEor discovers these geometric attributes by first amplifying geometry-relevant details via visual feedback from a backbone network, then embedding the relative polar coordinates of these details into the final representation. Extensive experiments demonstrate that GAEor significantly sets new state-of-the-art records in five widely-used Ultra-FGVC benchmarks.
[71] RAFT-MSF++: Temporal Geometry-Motion Feature Fusion for Self-Supervised Monocular Scene Flow cs.CVPDF
Xunpei Sun, Zuoxun Hou, Yi Chang, Gang Chen, Wei-Shi Zheng
TL;DR: RAFT-MSF++是一个自监督多帧单目场景流估计框架,通过循环融合时序特征来联合估计深度和场景流。其核心是几何-运动特征(GMF),它紧凑编码了耦合的运动与几何线索,并通过迭代更新进行有效的时序推理。为了增强对遮挡的鲁棒性,方法引入了相对位置注意力注入空间先验和遮挡正则化模块来传播可见区域的可靠运动。
Details
Motivation: 现有单目场景流估计方法大多局限于两帧输入,限制了时序建模能力和对遮挡的鲁棒性。本文旨在解决这一问题,提出一个能利用多帧信息进行更稳健估计的自监督框架。
Result: 在KITTI Scene Flow基准测试上,RAFT-MSF++取得了24.14%的SF-all指标,相比基线提升了30.99%,并在遮挡区域表现出更好的鲁棒性。
Insight: 创新点在于提出了紧凑的几何-运动特征(GMF)用于耦合表示运动与几何,并通过相对位置注意力和遮挡正则化模块来增强多帧时序融合在遮挡情况下的有效性。这为自监督单目场景流估计中的时序建模和鲁棒性提升提供了新思路。
Abstract: Monocular scene flow estimation aims to recover dense 3D motion from image sequences, yet most existing methods are limited to two-frame inputs, restricting temporal modeling and robustness to occlusions. We propose RAFT-MSF++, a self-supervised multi-frame framework that recurrently fuses temporal features to jointly estimate depth and scene flow. Central to our approach is the Geometry-Motion Feature (GMF), which compactly encodes coupled motion and geometry cues and is iteratively updated for effective temporal reasoning. To ensure the robustness of this temporal fusion against occlusions, we incorporate relative positional attention to inject spatial priors and an occlusion regularization module to propagate reliable motion from visible regions. These components enable the GMF to effectively propagate information even in ambiguous areas. Extensive experiments show that RAFT-MSF++ achieves 24.14% SF-all on the KITTI Scene Flow benchmark, with a 30.99% improvement over the baseline and better robustness in occluded regions. The code is available at https://github.com/sunzunyi/RAFT-MSF-PlusPlus.
[72] Attend what matters: Leveraging vision foundational models for breast cancer classification using mammograms cs.CVPDF
Samyak Sanghvi, Piyush Miglani, Sarvesh Shashikumar, Kaustubh R Borgavi, Veenu Singla
TL;DR: 本文提出了一种基于视觉Transformer的乳腺癌X光片分类框架,通过结合目标检测模型进行感兴趣区域(RoI)的标记缩减、基于困难负样本的对比学习增强细粒度判别能力,以及使用DINOv2预训练的ViT模型来捕获定位感知的细粒度特征,从而解决了医学图像中高分辨率小病灶导致的注意力分散和细粒度分类困难的问题。
Details
Motivation: 针对ViT在计算机辅助诊断中性能受限的问题,作者发现医学图像高分辨率小病灶导致标记过多、注意力难以聚焦相关区域,以及医学图像分类固有的细粒度特性(类间差异小、类内差异大)使得标准交叉熵训练不足,因此提出了改进框架。
Result: 在公共乳腺X光片数据集上的实验表明,该方法优于现有基线方法,证明了其有效性和在大规模乳腺癌筛查中的潜在临床实用性。
Insight: 创新点包括:利用目标检测模型引导注意力进行RoI标记缩减、通过困难负样本对比学习增强细粒度判别、采用DINOv2预训练ViT替代CLIP以捕获定位感知的细粒度特征;从客观角度看,该框架有效结合了目标检测与对比学习,针对医学图像特点优化了ViT的注意力机制和特征表示。
Abstract: Vision Transformers $(\texttt{ViT})$ have become the architecture of choice for many computer vision tasks, yet their performance in computer-aided diagnostics remains limited. Focusing on breast cancer detection from mammograms, we identify two main causes for this shortfall. First, medical images are high-resolution with small abnormalities, leading to an excessive number of tokens and making it difficult for the softmax-based attention to localize and attend to relevant regions. Second, medical image classification is inherently fine-grained, with low inter-class and high intra-class variability, where standard cross-entropy training is insufficient. To overcome these challenges, we propose a framework with three key components: (1) Region of interest $(\texttt{RoI})$ based token reduction using an object detection model to guide attention; (2) contrastive learning between selected $\texttt{RoI}$ to enhance fine-grained discrimination through hard-negative based training; and (3) a $\texttt{DINOv2}$ pretrained $\texttt{ViT}$ that captures localization-aware, fine-grained features instead of global $\texttt{CLIP}$ representations. Experiments on public mammography datasets demonstrate that our method achieves superior performance over existing baselines, establishing its effectiveness and potential clinical utility for large-scale breast cancer screening. Our code is available for reproducibility here: https://aih-iitd.github.io/publications/attend-what-matters
[73] IonMorphNet: Generalizable Learning of Ion Image Morphologies for Peak Picking in Mass Spectrometry Imaging cs.CVPDF
Philipp Weigand, Niels Nawrot, Nikolas Ebert, Carsten Hopf, Oliver Wasenmüller
TL;DR: 本文提出了IonMorphNet,一种用于质谱成像(MSI)中离子图像形态学学习的通用模型,旨在实现无需特定任务监督和超参数调优的峰检测。该方法通过训练标准图像主干网络对离子图像的结构模式进行分类,不仅提升了峰检测性能,还展示了其在基于图像块的肿瘤分类任务中的有效性。
Details
Motivation: 解决现有质谱成像峰检测方法需要针对特定数据集进行精细的超参数调优,且难以在不同采集协议间泛化的问题。
Result: 在多个数据集上,使用ConvNeXt V2-Tiny主干网络的IonMorphNet将峰检测性能(mSCF1)比最先进方法提升了+7%。在三个肿瘤分类任务中,其空间信息通道缩减方法使3D CNN的分类平衡准确率比像素级光谱分类器最高提升了+7.3%,达到或超越了现有水平。
Insight: 创新点在于提出了一种完全数据驱动、无需任务特定监督的离子图像空间结构感知表示模型,实现了通用化的峰检测。其核心洞察是将离子图像的形态学结构归纳为六类代表性模式进行学习,从而提取出具有判别性的空间特征,这不仅提升了峰检测的鲁棒性,也为下游任务(如肿瘤分类)提供了有效的特征选择机制。
Abstract: Peak picking is a fundamental preprocessing step in Mass Spectrometry Imaging (MSI), where each sample is represented by hundreds to thousands of ion images. Existing approaches require careful dataset-specific hyperparameter tuning, and often fail to generalize across acquisition protocols. We introduce IonMorphNet, a spatial-structure-aware representation model for ion images that enables fully data-driven peak picking without any task-specific supervision. We curate 53 publicly available MSI datasets and define six structural classes capturing representative spatial patterns in ion images to train standard image backbones for structural pattern classification. Once trained, IonMorphNet can assess ion images and perform peak picking without additional hyperparameter tuning. Using a ConvNeXt V2-Tiny backbone, our approach improves peak picking performance by +7 % mSCF1 compared to state-of-the-art methods across multiple datasets. Beyond peak picking, we demonstrate that spatially informed channel reduction enables a 3D CNN for patch-based tumor classification in MSI. This approach matches or exceeds pixel-wise spectral classifiers by up to +7.3 % Balanced Accuracy on three tumor classification tasks, indicating meaningful ion image selection. The source code and model weights are available at https://github.com/CeMOS-IS/IonMorphNet.
[74] PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving cs.CVPDF
Yining Pan, Shijie Li, Yuchen Wu, Xulei Yang, Na Zhao
TL;DR: 本文提出了首个用于多模态3D全景分割的无监督域自适应框架PanDA,旨在解决自动驾驶中因域偏移(如时间、天气、地点和传感器变化)导致的性能下降问题。该方法通过非对称多模态增强模拟传感器退化以提升鲁棒性,并利用双专家伪标签细化模块从2D和3D模态中提取域不变先验,以生成更完整可靠的伪标签。
Details
Motivation: 现有监督式多模态3D全景分割方法严重依赖LiDAR和RGB模态间的强互补性,在域偏移下(如光照不良或恶劣天气导致某一模态退化)表现脆弱;同时,传统伪标签方法通常只保留高置信度区域,导致掩码碎片化和对象监督不完整,这对全景分割尤为不利。
Result: 在涵盖时间、天气、地点和传感器变化等多种域偏移场景上的大量实验表明,PanDA显著超越了现有3D语义分割的无监督域自适应基准方法,达到了最先进水平。
Insight: 创新点包括:针对单传感器退化的非对称多模态增强策略,以及结合2D和3D模态域不变先验的双专家伪标签细化模块,这些设计提升了模型在域偏移下的鲁棒性和伪标签的完整性。
Abstract: This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm-3DPS), aiming to improve generalization under domain shifts commonly encountered in real-world autonomous driving. A straightforward solution is to employ a pseudo-labeling strategy, which is widely used in UDA to generate supervision for unlabeled target data, combined with an mm-3DPS backbone. However, existing supervised mm-3DPS methods rely heavily on strong cross-modal complementarity between LiDAR and RGB inputs, making them fragile under domain shifts where one modality degrades (e.g., poor lighting or adverse weather). Moreover, conventional pseudo-labeling typically retains only high-confidence regions, leading to fragmented masks and incomplete object supervision, which are issues particularly detrimental to panoptic segmentation. To address these challenges, we propose PanDA, the first UDA framework specifically designed for multimodal 3D panoptic segmentation. To improve robustness against single-sensor degradation, we introduce an asymmetric multimodal augmentation that selectively drops regions to simulate domain shifts and improve robust representation learning. To enhance pseudo-label completeness and reliability, we further develop a dual-expert pseudo-label refinement module that extracts domain-invariant priors from both 2D and 3D modalities. Extensive experiments across diverse domain shifts, spanning time, weather, location, and sensor variations, significantly surpass state-of-the-art UDA baselines for 3D semantic segmentation.
[75] Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval cs.CVPDF
Zhiheng Fu, Yupeng Hu, Qianyun Yang, Shiqi Zhang, Zhiwei Chen
TL;DR: 本文提出了一种名为Air-Know的新型鲁棒学习框架,旨在解决组合图像检索(CIR)中由噪声三元组对应关系(NTC)问题导致的表示污染。该方法通过‘专家-代理-分流’解耦范式,利用多模态大语言模型作为离线专家构建高精度锚点数据集,指导轻量级代理仲裁器内化专家知识,并通过双流协调机制分离训练数据,从而有效缓解NTC中的语义模糊性。
Details
Motivation: 组合图像检索因其灵活的多模态查询方式受到关注,但其发展受到NTC问题的严重制约。现有鲁棒学习方法依赖‘小损失假设’,而NTC中独特的语义模糊性(如‘部分匹配’)使该假设失效,导致噪声识别不可靠,并引发模型陷入自我依赖的恶性循环和‘表示污染’。
Result: 在多个CIR基准数据集上的广泛实验表明,Air-Know在NTC设置下显著优于现有的SOTA方法,同时在传统CIR任务中也表现出强大的竞争力。
Insight: 创新点在于提出了‘专家-代理-分流’解耦范式,核心是利用外部多模态大语言模型作为离线专家提供高质量先验,并通过知识内化和数据分流机制,将噪声识别与模型学习过程解耦,从而更鲁棒地处理语义模糊的噪声数据,避免了传统方法中学习器与仲裁器相互纠缠的问题。
Abstract: Composed Image Retrieval (CIR) has attracted significant attention due to its flexible multimodal query method, yet its development is severely constrained by the Noisy Triplet Correspondence (NTC) problem. Most existing robust learning methods rely on the “small loss hypothesis”, but the unique semantic ambiguity in NTC, such as “partial matching”, invalidates this assumption, leading to unreliable noise identification. This entraps the model in a self dependent vicious cycle where the learner is intertwined with the arbiter, ultimately causing catastrophic “representation pollution”. To address this critical challenge, we propose a novel “Expert-Proxy-Diversion” decoupling paradigm, named Air-Know (ArbIteR calibrated Knowledge iNternalizing rObust netWork). Air-Know incorporates three core modules: (1) External Prior Arbitration (EPA), which utilizes Multimodal Large Language Models (MLLMs) as an offline expert to construct a high precision anchor dataset; (2) Expert Knowledge Internalization (EKI), which efficiently guides a lightweight proxy “arbiter” to internalize the expert’s discriminative logic; (3) Dual Stream Reconciliation (DSR), which leverages the EKI’s matching confidence to divert the training data, achieving a clean alignment stream and a representation feedback reconciliation stream. Extensive experiments on multiple CIR benchmark datasets demonstrate that Air-Know significantly outperforms existing SOTA methods under the NTC setting, while also showing strong competitiveness in traditional CIR.
[76] HP-Edit: A Human-Preference Post-Training Framework for Image Editing cs.CV | cs.AIPDF
Fan Li, Chonghuinan Wang, Lina Lei, Yuping Qiu, Jiaqi Xu
TL;DR: 本文提出了HP-Edit,一个用于图像编辑的后训练框架,旨在通过人类偏好反馈(RLHF)提升扩散模型在真实世界编辑任务中的表现。该框架包含一个基于少量人类评分数据和视觉大语言模型(VLM)构建的自动评估器HP-Scorer,用于高效构建大规模偏好数据集RealPref-50K,并作为奖励函数对编辑模型进行后训练。
Details
Motivation: 当前基于扩散模型的图像编辑方法虽然强大,但如何高效地将人类偏好反馈(RLHF)应用于多样化的编辑任务仍未被充分探索,主要缺乏可扩展的人类偏好数据集和针对性框架。
Result: 在提出的RealPref-Bench基准测试上进行的广泛实验表明,该方法显著提升了如Qwen-Image-Edit-2509等模型的性能,使其输出更符合人类偏好。
Insight: 核心创新在于提出了一个结合自动偏好评估器(HP-Scorer)和可扩展偏好数据集(RealPref-50K)的端到端后训练框架,解决了在多样化编辑任务中应用RLHF的数据和效率瓶颈,为基于人类偏好的图像编辑模型优化提供了系统化方案。
Abstract: Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs. To fill this gap, we propose HP-Edit, a post-training framework for Human Preference-aligned Editing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing. Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer–an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model. We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human preference.
[77] GOLD-BEV: GrOund and aeriaL Data for Dense Semantic BEV Mapping of Dynamic Scenes cs.CV | cs.AIPDF
Joshua Niemeijer, Alaa Eddine Ben Zekri, Reza Bahmanyar, Philipp M. Schmälzle, Houda Chaabouni-Chouayakh
TL;DR: GOLD-BEV是一个从车载传感器学习密集鸟瞰图(BEV)语义环境映射的框架,利用时间同步的航拍图像作为训练监督,能够生成包含动态交通参与者的密集语义地图,并通过合成伪航拍BEV图像扩展至无航拍覆盖区域。
Details
Motivation: 解决从车载传感器构建几何一致、场景中心的密集BEV语义地图的挑战,特别是动态物体的标注困难和传统仅基于车载BEV标注的模糊性问题。
Result: 该方法利用航拍图像作为监督,通过领域适应的航拍教师生成BEV伪标签进行训练,实现了密集BEV语义分割,并在动态场景中表现出更好的时间一致性。
Insight: 创新点在于使用时间同步的航拍图像作为直观的监督信号,避免了纯车载标注的歧义,并通过合成伪航拍BEV图像支持无标签数据的轻量级人工标注和不确定性感知的伪标签生成。
Abstract: Understanding road scenes in a geometrically consistent, scene-centric representation is crucial for planning and mapping. We present GOLD-BEV, a framework that learns dense bird’s-eye-view (BEV) semantic environment maps-including dynamic agents-from ego-centric sensors, using time-synchronized aerial imagery as supervision only during training. BEV-aligned aerial crops provide an intuitive target space, enabling dense semantic annotation with minimal manual effort and avoiding the ambiguity of ego-only BEV labeling. Crucially, strict aerial-ground synchronization allows overhead observations to supervise moving traffic participants and mitigates the temporal inconsistencies inherent to non-synchronized overhead sources. To obtain scalable dense targets, we generate BEV pseudo-labels using domain-adapted aerial teachers, and jointly train BEV segmentation with optional pseudo-aerial BEV reconstruction for interpretability. Finally, we extend beyond aerial coverage by learning to synthesize pseudo-aerial BEV images from ego sensors, which support lightweight human annotation and uncertainty-aware pseudo-labeling on unlabeled drives.
[78] VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing cs.CV | cs.CLPDF
Yanbin Huang, Yisen Li, Guiyao Tie, Xiaoye Qu, Pan Zhou
TL;DR: 本文提出了一种名为视觉对比编辑(VCE)的后处理方法,用于缓解大型视觉语言模型(LVLM)中的物体幻觉问题。该方法通过分析模型对对比性视觉扰动的响应,利用奇异值分解(SVD)识别并隔离幻觉子空间,并应用有针对性的参数编辑来抑制幻觉倾向。VCE无需微调或标注数据,是一种零成本、可扩展的解决方案。
Details
Motivation: 大型视觉语言模型经常出现物体幻觉问题,即生成的描述包含输入图像中不存在的物体,这在医疗影像和自动驾驶等对准确性要求高的实际应用中尤为严重。研究表明,该问题可能源于模型预训练期间学习的语言先验偏差。
Result: 实验结果表明,VCE在多个基准测试上有效减少了物体幻觉,同时保持了模型原有的计算效率。
Insight: 创新点在于提出了一种无需标注数据或微调的后处理干预方法,通过分析模型激活模式中的幻觉子空间并进行针对性编辑,实现了零成本的幻觉缓解,具有较好的可扩展性和实用性。
Abstract: Large vision-language models (LVLMs) frequently suffer from Object Hallucination (OH), wherein they generate descriptions containing objects that are not actually present in the input image. This phenomenon is particularly problematic in real-world applications such as medical imaging and autonomous driving, where accuracy is critical. Recent studies suggest that the hallucination problem may stem from language priors: biases learned during pretraining that cause LVLMs to generate words based on their statistical co-occurrence. To mitigate this problem, we propose Visual Contrastive Editing (VCE), a novel post-hoc method that identifies and suppresses hallucinatory tendencies by analyzing the model’s response to contrastive visual perturbations. Using Singular Value Decomposition (SVD), we decompose the model’s activation patterns to isolate hallucination subspaces and apply targeted parameter edits to attenuate its influence. Unlike existing approaches that require fine-tuning or labeled data, VCE operates as a label-free intervention, making it both scalable and practical for deployment in resource-constrained settings. Experimental results demonstrate that VCE effectively reduces object hallucination across multiple benchmarks while maintaining the model’s original computational efficiency.
[79] DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval cs.CVPDF
Xinwei He, Yansong Zheng, Qianru Han, Zhichuan Wang, Yuxuan Cai
TL;DR: 本文提出DINO Eats CLIP (DEC)框架,用于开放集3D物体检索。该框架利用DINO编码器提取细粒度多视图特征,通过设计的块化适应模块(CAM)动态整合局部视图关系,并引入虚拟特征合成(VFS)模块,利用CLIP的视觉-语言对齐空间合成未见类别的虚拟特征,以缓解对已知类别的过拟合,从而提升开放集泛化能力。
Details
Motivation: 现有基于CLIP编码器的开放集3D物体检索方法缺乏细粒度特征,而自监督编码器DINO虽能提供更细粒度特征,但直接适应多视图时易对已知类别的平均视图模式过拟合,且难以泛化至未见类别。
Result: 在标准开放集3D物体检索基准测试中进行了广泛实验,结果表明DEC框架具有优越的有效性,性能优于现有方法。
Insight: 创新点在于结合DINO的细粒度特征与CLIP的开放集对齐能力,通过CAM模块实现动态局部视图关系整合,并利用VFS模块合成虚拟特征以显式增强对未见类别的判别力,为多视图3D表示学习提供了新思路。
Abstract: Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images. Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP’s strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder-DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes. To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments multi-view images into chunks and dynamically integrates local view relations, yielding more robust features than the standard pooling strategy. Finally, we propose Virtual Feature Synthesis (VFS) module to mitigate bias towards known categories explicitly. Under the hood, VFS leverages CLIP’s broad, pre-aligned vision-language space to synthesize virtual features for unseen classes. By exposing DEC to these virtual features, we greatly enhance its open-set discrimination capacity. Extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy.
[80] LoViF 2026 Challenge on Real-World All-in-One Image Restoration: Methods and Results cs.CVPDF
Xiang Chen, Hao Li, Jiangxin Dong, Jinshan Pan, Xin Li
TL;DR: 本文回顾了LoViF 2026真实世界一体化图像复原挑战赛,该赛事旨在推动在模糊、低光、雾霾、雨雪等多种真实世界退化条件下的一体化图像复原研究,并提供了一个统一的基准来评估模型在共同框架下跨多种退化类别的鲁棒性和泛化能力。
Details
Motivation: 解决在多样化真实世界退化条件下进行一体化图像复原的问题,为模型提供一个统一的评估基准以促进该领域研究。
Result: 挑战赛吸引了124名注册参与者,收到了9份有效的最终提交及其技术报告,这些提交显著推动了真实世界一体化图像复原的进展。
Insight: 通过举办挑战赛和建立统一基准,有效促进了模型在多种真实退化条件下的鲁棒性和泛化能力研究,为未来真实世界低级视觉研究设立了标杆。
Abstract: This paper presents a review for the LoViF Challenge on Real-World All-in-One Image Restoration. The challenge aimed to advance research on real-world all-in-one image restoration under diverse real-world degradation conditions, including blur, low-light, haze, rain, and snow. It provided a unified benchmark to evaluate the robustness and generalization ability of restoration models across multiple degradation categories within a common framework. The competition attracted 124 registered participants and received 9 valid final submissions with corresponding fact sheets, significantly contributing to the progress of real-world all-in-one image restoration. This report provides a detailed analysis of the submitted methods and corresponding results, emphasizing recent progress in unified real-world image restoration. The analysis highlights effective approaches and establishes a benchmark for future research in real-world low-level vision.
[81] TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation cs.CVPDF
Hongyu Zhang, Yufan Deng, Zilin Pan, Peng-Tao Jiang, Bo Li
TL;DR: 本文提出了一种名为TS-Attn(时序可分离注意力)的无训练注意力机制,用于解决从包含多个连续动作的复杂时序描述生成高质量视频的难题。该方法通过动态调整注意力分布,确保多事件场景下的时序感知和全局一致性,并能无缝集成到各种预训练的文本到视频模型中。
Details
Motivation: 现有方法在处理多事件视频生成时存在固有权衡:使用多个短提示词顺序输入模型能提高动作保真度但损害时序一致性,而使用单个复杂提示词则保持一致性但牺牲了提示跟随能力。这主要源于视频内容与提示词之间的时序错位,以及运动相关视觉对象与其关联文本条件之间的冲突性注意力耦合。
Result: TS-Attn在Wan2.1-T2V-14B和Wan2.2-T2V-A14B模型上将StoryEval-Bench分数分别提升了33.5%和16.4%,且推理时间仅增加2%。该方法还支持跨模型的即插即用,用于多事件的图像到视频生成。
Insight: 创新点在于提出了一种训练自由的注意力机制,通过解耦和动态重排注意力来解决多事件视频生成中的时序对齐和条件冲突问题。从客观角度看,其模块化、即插即用的设计以及对现有模型性能的显著提升,是高效且实用的技术贡献。
Abstract: Generating high-quality videos from complex temporal descriptions that contain multiple sequential actions is a key unsolved problem. Existing methods are constrained by an inherent trade-off: using multiple short prompts fed sequentially into the model improves action fidelity but compromises temporal consistency, while a single complex prompt preserves consistency at the cost of prompt-following capability. We attribute this problem to two primary causes: 1) temporal misalignment between video content and the prompt, and 2) conflicting attention coupling between motion-related visual objects and their associated text conditions. To address these challenges, we propose a novel, training-free attention mechanism, Temporal-wise Separable Attention (TS-Attn), which dynamically rearranges attention distribution to ensure temporal awareness and global coherence in multi-event scenarios. TS-Attn can be seamlessly integrated into various pre-trained text-to-video models, boosting StoryEval-Bench scores by 33.5% and 16.4% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 2% increase in inference time. It also supports plug-and-play usage across models for multi-event image-to-video generation. The source code and project page are available at https://github.com/Hong-yu-Zhang/TS-Attn.
[82] Seeing Candidates at Scale: Multimodal LLMs for Visual Political Communication on Instagram cs.CV | cs.CYPDF
Michael Achmann-Denkler, Mario Haim, Christian Wolff
TL;DR: 本文是一项计算案例研究,评估了专用机器学习模型和新兴多模态大语言模型在视觉政治传播分析中的能力。研究聚焦于2021年德国联邦选举竞选期间Instagram故事和帖子中的集中可见性,比较了传统计算机视觉模型与GPT-4o在识别领先政治人物和图像中人数统计任务上的性能。
Details
Motivation: 解决在社交媒体(如Instagram)上进行大规模视觉政治传播分析时,传统计算机视觉模型在识别政治人物和统计人数方面的局限性,并探索新兴多模态大语言模型的潜力。
Result: GPT-4o在Instagram故事中的人脸识别任务上取得了0.89的宏F1分数,在人数统计任务上取得了0.86的宏F1分数,性能优于FaceNet512、RetinaFace和Google Cloud Vision等传统模型。
Insight: 创新点在于将多模态大语言模型(如GPT-4o)应用于视觉政治传播这一特定领域,证明了其在复杂、真实世界社交媒体图像分析任务(如特定人物识别和计数)上超越传统专用模型的潜力,为未来研究提供了方法论上的新视角。
Abstract: This paper presents a computational case study that evaluates the capabilities of specialized machine learning models and emerging multimodal large language models for Visual Political Communication (VPC) analysis. Focusing on concentrated visibility in Instagram stories and posts during the 2021 German federal election campaign, we compare the performance of traditional computer vision models (FaceNet512, RetinaFace, Google Cloud Vision) with a multimodal large language model (GPT-4o) in identifying front-runner politicians and counting individuals in images. GPT-4o outperformed the other models, achieving a macro F1-score of 0.89 for face recognition and 0.86 for person counting in stories. These findings demonstrate the potential of advanced AI systems to scale and refine visual content analysis in political communication while highlighting methodological considerations for future research.
[83] RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation cs.CVPDF
Ahmed Marouane Djouama, Abir Belaala, Abdellah Zakaria Sellam, Salah Eddine Bekhouche, Cosimo Distante
TL;DR: 本文提出RF-HiT,一种用于通用医学图像分割的Rectified Flow Hierarchical Transformer模型。它结合了沙漏形Transformer主干和多尺度分层编码器,利用整流流和高效Transformer块实现线性复杂度,仅需少量离散化步骤即可完成推理。模型通过可学习插值融合多分辨率特征,在保持高性能的同时实现了低计算开销。
Details
Motivation: 现有基于Transformer和扩散模型的医学图像分割方法常受限于二次计算复杂度和过高的推理延迟,难以同时满足长距离上下文推理和精确边界分割的需求。本文旨在设计一个高效且高性能的模型来解决这一效率与性能的权衡问题。
Result: 在ACDC数据集上达到91.27%的平均Dice分数,在BraTS 2021数据集上达到87.40%的平均Dice分数,性能与计算量更大的架构相当或更优。模型仅需10.14 GFLOPs、13.6M参数,且推理步骤可少至三步,实现了强效的性能-效率平衡。
Insight: 创新点包括:1) 将整流流与高效Transformer块结合,以线性复杂度实现快速扩散式推理;2) 采用沙漏形Transformer主干与多尺度分层编码器进行解剖学引导的特征条件化;3) 通过可学习插值实现跨分辨率特征融合,以最小计算开销获得有效的多尺度表示。这为实时临床分割提供了一个紧凑而强大的基础架构。
Abstract: Accurate medical image segmentation requires both long-range contextual reasoning and precise boundary delineation, a task where existing transformer- and diffusion-based paradigms are frequently bottlenecked by quadratic computational complexity and prohibitive inference latency. We propose RF-HiT, a Rectified Flow Hierarchical Transformer that integrates an hourglass transformer backbone with a multi-scale hierarchical encoder for anatomically guided feature conditioning. Unlike prior diffusion-based approaches, RF-HiT leverages rectified flow with efficient transformer blocks to achieve linear complexity while requiring only a few discretization steps. The model further fuses conditioning features across resolutions via learnable interpolation, enabling effective multi-scale representation with minimal computational overhead. As a result, RF-HiT achieves a strong efficiency-performance trade-off, requiring only 10.14 GFLOPs, 13.6M parameters, and inference in as few as three steps. Despite its compact design, RF-HiT attains 91.27% mean Dice on ACDC and 87.40% on BraTS 2021, achieving performance comparable to or exceeding that of significantly more intensive architectures. This demonstrates its strong potential as a robust, computationally efficient foundation for real-time clinical segmentation.
[84] SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing cs.CVPDF
Ying Zeng, Miaosen Luo, Guangyuan Li, Yang Yang, Ruiyang Fan
TL;DR: 本文提出了SmartPhotoCrafter,一种自动摄影图像编辑方法,它将图像编辑建模为一个紧密耦合的‘推理到生成’过程。该方法通过Image Critic模块理解图像质量并识别缺陷,然后由Photographic Artist模块执行针对性编辑以增强图像吸引力,从而无需用户提供明确的审美指令。
Details
Motivation: 解决传统摄影编辑依赖用户提供明确但往往模糊或不完整的审美指令的问题,为非专业用户提供自动化的高质量图像编辑方案。
Result: 实验表明,SmartPhotoCrafter在自动摄影增强任务上超越了现有的生成模型,能生成逼真的结果,并对润色指令表现出更高的色调敏感性。
Insight: 创新点在于将图像编辑构建为‘推理到生成’的紧密耦合过程,并采用包含基础预训练、推理引导的多编辑监督适应以及协调的推理到生成强化学习的三阶段训练流程,强调逼真图像生成并统一支持修复和润色任务。
Abstract: Traditional photographic image editing typically requires users to possess sufficient aesthetic understanding to provide appropriate instructions for adjusting image quality and camera parameters. However, this paradigm relies on explicit human instruction of aesthetic intent, which is often ambiguous, incomplete, or inaccessible to non-expert users. In this work, we propose SmartPhotoCrafter, an automatic photographic image editing method which formulates image editing as a tightly coupled reasoning-to-generation process. The proposed model first performs image quality comprehension and identifies deficiencies by the Image Critic module, and then the Photographic Artist module realizes targeted edits to enhance image appeal, eliminating the need for explicit human instructions. A multi-stage training pipeline is adopted: (i) Foundation pretraining to establish basic aesthetic understanding and editing capabilities, (ii) Adaptation with reasoning-guided multi-edit supervision to incorporate rich semantic guidance, and (iii) Coordinated reasoning-to generation reinforcement learning to jointly optimize reasoning and generation. During training, SmartPhotoCrafter emphasizes photo-realistic image generation, while supporting both image restoration and retouching tasks with consistent adherence to color- and tone-related semantics. We also construct a stage-specific dataset, which progressively builds reasoning and controllable generation, effective cross-module collaboration, and ultimately high-quality photographic enhancement. Experiments demonstrate that SmartPhotoCrafter outperforms existing generative models on the task of automatic photographic enhancement, achieving photo-realistic results while exhibiting higher tonal sensitivity to retouching instructions. Project page: https://github.com/vivoCameraResearch/SmartPhotoCrafter.
[85] Structure-Semantic Decoupled Modulation of Global Geospatial Embeddings for High-Resolution Remote Sensing Mapping cs.CVPDF
Jienan Lyu, Miao Yang, Jinchen Cai, Yiwen Hu, Guanyi Lu
TL;DR: 本文提出了一种结构-语义解耦调制(SSDM)框架,旨在解决将全局地理空间基础模型的高维隐式嵌入与高分辨率视觉特征直接融合时,因语义-空间鸿沟导致的特征干扰和空间结构退化问题。该框架通过两条互补的跨模态注入路径解耦全局表征:结构先验调制分支将宏观感受野先验引入高分辨率编码器的自注意力模块以抑制预测碎片化,全局语义注入分支则通过跨模态整合显式对齐并补充全局语义以增强语义一致性。
Details
Motivation: 动机在于解决细粒度高分辨率遥感制图中,过度依赖局部视觉特征导致的跨域泛化性差和大尺度地物预测碎片化问题,以及直接融合全局地理空间基础模型嵌入时引发的特征干扰和空间结构退化。
Result: 广泛的实验表明,与现有的跨模态融合方法相比,该方法在多个场景下均能持续提升高分辨率制图精度,达到了最先进的性能水平。
Insight: 创新点在于将全局地理空间表征解耦为结构先验和全局语义两条互补的注入路径,分别从宏观结构约束和深层语义对齐的角度进行调制与整合,为地理空间基础模型与高分辨率视觉任务的融合提供了一个通用且有效的范式。
Abstract: Fine-grained high-resolution remote sensing mapping typically relies on localized visual features, which restricts cross-domain generalizability and often leads to fragmented predictions of large-scale land covers. While global geospatial foundation models offer powerful, generalizable representations, directly fusing their high-dimensional implicit embeddings with high-resolution visual features frequently triggers feature interference and spatial structure degradation due to a severe semantic-spatial gap. To overcome these limitations, we propose a Structure-Semantic Decoupled Modulation (SSDM) framework, which decouples global geospatial representations into two complementary cross-modal injection pathways. First, the structural prior modulation branch introduces the macroscopic receptive field priors from global representations into the self-attention modules of the high-resolution encoder. By guiding local feature extraction with holistic structural constraints, it effectively suppresses prediction fragmentation caused by high-frequency detail noise and excessive intra-class variance. Second, the global semantic injection branch explicitly aligns holistic context with the deep high-resolution feature space and directly supplements global semantics via cross-modal integration, thereby significantly enhancing the semantic consistency and category-level discrimination of complex land covers. Extensive experiments demonstrate that our method achieves state-of-the-art performance compared to existing cross-modal fusion approaches. By unleashing the potential of global embeddings, SSDM consistently improves high-resolution mapping accuracy across diverse scenarios, providing a universal and effective paradigm for integrating geospatial foundation models into high-resolution vision tasks.
[86] PC2Model: ISPRS benchmark on 3D point cloud to model registration cs.CVPDF
Mehdi Maboudi, Said Harb, Jackson Ferrao, Kourosh Khoshelham, Yelda Turkan
TL;DR: 该论文介绍了PC2Model基准数据集,旨在解决点云到三维模型配准(PC2Model registration)任务中的挑战,如稀疏性、噪声和遮挡。该数据集结合了模拟点云和真实扫描数据,为经典和数据驱动方法提供了训练和评估平台,支持从模拟到真实场景的模型可迁移性分析。
Details
Motivation: 随着点云获取技术(如LiDAR)和深度学习的进步,研究重点转向下游任务,特别是点云到模型配准,但现有数据驱动方法在真实扫描中面临稀疏性、噪声、杂乱和遮挡等限制,需要公开基准来促进方法开发和评估。
Result: 论文提出了PC2Model基准数据集,该数据集在ICWG II/Ib领导下开发,结合模拟点云(提供精确真值)和真实扫描数据(引入传感器和环境伪影),支持跨领域的鲁棒训练和评估,并公开在https://zenodo.org/uploads/17581812。
Insight: 创新点在于采用混合设计的数据集,整合模拟与真实数据以平衡控制条件和现实复杂性,这有助于系统研究模型从模拟到真实场景的迁移能力,为点云配准领域提供了标准化的评估框架。
Abstract: Point cloud registration involves aligning one point cloud with another or with a three-dimensional (3D) model, enabling the integration of multimodal data into a unified representation. This is essential in applications such as construction monitoring, autonomous driving, robotics, and virtual or augmented reality (VR/AR).With the increasing accessibility of point cloud acquisition technologies, such as Light Detection and Ranging (LiDAR) and structured light scanning, along with recent advances in deep learning, the research focus has increasingly shifted towards downstream tasks, particularly point cloud-to-model (PC2Model) registration. While data-driven methods aim to automate this process, they struggle with sparsity, noise, clutter, and occlusions in real-world scans, which limit their performance. To address these challenges, this paper introduces the PC2Model benchmark, a publicly available dataset designed to support the training and evaluation of both classical and data-driven methods. Developed under the leadership of ICWG II/Ib, the PC2Model benchmark adopts a hybrid design that combines simulated point clouds with, in some cases, real-world scans and their corresponding 3D models. Simulated data provide precise ground truth and controlled conditions, while real-world data introduce sensor and environmental artefacts. This design supports robust training and evaluation across domains and enables the systematic analysis of model transferability from simulated to real-world scenarios. The dataset is publicly accessible at: https://zenodo.org/uploads/17581812.
[87] Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding cs.CVPDF
Kadir Yilmaz, Adrian Kruse, Tristan Höfer, Daan de Geus, Bastian Leibe
TL;DR: 本文提出了Volume Transformer (Volt),一种将标准Transformer编码器通过最小修改适配到3D场景理解任务中的通用骨干网络。它通过将3D场景划分为体素块令牌、使用全局自注意力处理,并引入3D旋转位置编码来注入位置信息。研究发现,直接在现有3D基准上训练会导致捷径学习,因此提出了一种基于强数据增强、正则化和卷积教师蒸馏的数据高效训练方案。当在多个数据集上进行联合训练以扩大监督规模时,Volt相比领域特定的3D骨干网络能获得更大收益,并在室内外3D语义分割和实例分割任务上达到了最先进的性能。
Details
Motivation: 动机在于弥合通用Transformer生态与3D场景理解领域之间的鸿沟。当前3D领域仍依赖具有强领域先验的专用骨干网络,这使其与更广泛的Transformer进展、以及日益优化的软硬件生态隔离。本文旨在探索能否通过最小修改,使标准Transformer成为3D场景理解的通用骨干。
Result: 在3D语义分割任务上,通过提出的数据高效训练方案,Volt在标准基准上达到了与最先进方法(SOTA)相当的竞争力。当通过多数据集联合训练扩大监督规模后,Volt在室内(如ScanNet)和室外(如SemanticKITTI)数据集上均取得了新的SOTA结果。此外,作为即插即用的骨干网络应用于标准3D实例分割流程时,也再次创造了新的SOTA。
Insight: 主要创新点包括:1) 提出了Volume Transformer (Volt),一个对标准Transformer进行最小修改(体素块化、3D旋转位置编码)以处理3D数据的简单通用骨干架构。2) 揭示了当前3D监督数据的规模有限,导致Transformer容易陷入捷径学习。3) 提出了一套克服数据限制的训练方案(强增强、正则化、知识蒸馏)。4) 关键洞察是,当监督数据规模扩大时,这种通用Transformer骨干比领域特定的3D骨干受益更大,展现了其作为简单、可扩展、通用3D骨干的潜力。
Abstract: Transformers have become a common foundation across deep learning, yet 3D scene understanding still relies on specialized backbones with strong domain priors. This keeps the field isolated from the broader Transformer ecosystem, limiting the transfer of new advances as well as the benefits of increasingly optimized software and hardware stacks. To bridge this gap, we adapt the vanilla Transformer encoder to 3D scenes with minimal modifications. Given an input 3D scene, we partition it into volumetric patch tokens, process them with full global self-attention, and inject positional information via a 3D extension of rotary positional embeddings. We call the resulting model the Volume Transformer (Volt) and apply it to 3D semantic segmentation. Naively training Volt on standard 3D benchmarks leads to shortcut learning, highlighting the limited scale of current 3D supervision. To overcome this, we introduce a data-efficient training recipe based on strong 3D augmentations, regularization, and distillation from a convolutional teacher, making Volt competitive with state-of-the-art methods. We then scale supervision through joint training on multiple datasets and show that Volt benefits more from increased scale than domain-specific 3D backbones, achieving state-of-the-art results across indoor and outdoor datasets. Finally, when used as a drop-in backbone in a standard 3D instance segmentation pipeline, Volt again sets a new state of the art, highlighting its potential as a simple, scalable, general-purpose backbone for 3D scene understanding.
[88] GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction cs.CVPDF
Pradyumna YM, Yuxuan Xue, Yue Chen, Nikita Kister, István Sárándi
TL;DR: GRAFT是一种基于几何细化与拟合Transformer的快速前馈方法,用于从单张图像重建物理合理的3D人-场景交互(HSI)。它通过预测交互梯度来迭代优化人体网格,在保持快速推理的同时解决了现有前馈方法中常见的漂浮和穿透伪影问题。
Details
Motivation: 现有单图像HSI重建方法存在权衡:优化方法准确但速度慢(约20秒),前馈方法快速但缺乏显式交互推理,导致不合理的交互伪影。本文旨在通过几何拟合的摊销实现快速且物理合理的前馈推理。
Result: 在实验中,GRAFT相比最先进的前馈方法将交互质量提升了高达113%,同时达到了与优化方法相当的交互质量,但运行时间降低了约50倍。在用户研究中,64.8%的参与者更偏好GRAFT的结果,且能泛化到野外多人场景。
Insight: 创新点包括将几何拟合摊销为前馈推理、引入交互梯度进行迭代细化、使用基于几何探针的身体锚定令牌编码交互状态,以及作为可迁移的即插即用HSI先验提升其他方法。这为快速且物理准确的HSI重建提供了新思路。
Abstract: Reconstructing physically plausible 3D human-scene interactions (HSI) from a single image currently presents a trade-off: optimization based methods offer accurate contact but are slow (~20s), while feed-forward approaches are fast yet lack explicit interaction reasoning, producing floating and interpenetration artifacts. Our key insight is that geometry-based human–scene fitting can be amortized into fast feed-forward inference. We present GRAFT (Geometric Refinement And Fitting Transformer), a learned HSI prior that predicts Interaction Gradients: corrective parameter updates that iteratively refine human meshes by reasoning about their 3D relationship to the surrounding scene. GRAFT encodes the interaction state into compact body-anchored tokens, each grounded in the scene geometry via Geometric Probes that capture spatial relationships with nearby surfaces. A lightweight transformer recurrently updates human meshes and re-probes the scene, ensuring the final pose aligns with both learned priors and observed geometry. GRAFT operates either as an end-to-end reconstructor using image features, or with geometry alone as a transferable plug-and-play HSI prior that improves feed-forward methods without retraining. Experiments show GRAFT improves interaction quality by up to 113% over state-of-the-art feed-forward methods and matches optimization-based interaction quality at ${\sim}50{\times}$ lower runtime, while generalizing seamlessly to in-the-wild multi-person scenes and being preferred in 64.8% of three-way user study. Project page: https://pradyumnaym.github.io/graft .
[89] MOSA: Motion-Guided Semantic Alignment for Dynamic Scene Graph Generation cs.CVPDF
Xuejiao Wang, Bohao Zhang, Changbo Wang, Gaoqi He
TL;DR: 本文提出了一种名为MoSA的运动引导语义对齐方法,用于动态场景图生成(DSGG)。该方法通过运动特征提取器(MFE)编码物体对之间的运动属性(如距离、速度、运动持续性和方向一致性),并利用运动引导交互模块(MIM)将这些属性与空间关系特征融合,生成运动感知的关系表示。此外,通过跨模态动作语义匹配(ASM)机制将视觉关系特征与关系类别的文本嵌入对齐,以增强语义判别能力,并引入类别加权损失策略来加强对尾部关系的学习。
Details
Motivation: 现有动态场景图生成方法在细粒度关系建模、语义表示利用以及建模尾部关系的能力方面存在不足,本文旨在解决这些问题。
Result: 在Action Genome数据集上进行的广泛严格测试表明,MoSA取得了最优性能。
Insight: 创新点在于将运动属性(如速度、方向一致性)显式编码并融合到关系表示中,以及通过跨模态语义对齐(视觉-文本)和类别加权损失来分别增强语义判别能力和尾部关系学习,这为动态关系理解提供了更丰富的运动线索和语义监督。
Abstract: Dynamic Scene Graph Generation (DSGG) aims to structurally model objects and their dynamic interactions in video sequences for high-level semantic understanding. However, existing methods struggle with fine-grained relationship modeling, semantic representation utilization, and the ability to model tail relationships. To address these issues, this paper proposes a motion-guided semantic alignment method for DSGG (MoSA). First, a Motion Feature Extractor (MFE) encodes object-pair motion attributes such as distance, velocity, motion persistence, and directional consistency. Then, these motion attributes are fused with spatial relationship features through the Motion-guided Interaction Module (MIM) to generate motion-aware relationship representations. To further enhance semantic discrimination capabilities, the cross-modal Action Semantic Matching (ASM) mechanism aligns visual relationship features with text embeddings of relationship categories. Finally, a category-weighted loss strategy is introduced to emphasize learning of tail relationships. Extensive and rigorous testing shows that MoSA performs optimally on the Action Genome dataset.
[90] CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers cs.CVPDF
Weidong Chen, Dexiang Hong, Zhendong Mao, Yutao Cheng, Xinyan Liu
TL;DR: 本文提出CreatiParser,一种混合生成式框架,用于将栅格化平面设计图像解析为可编辑的图层(如文本、背景和贴纸)。该方法利用视觉语言模型解析文本区域为文本渲染协议,并使用支持RGBA的多分支扩散架构生成背景和贴纸层,同时引入ParserReward和Group Relative Policy Optimization来提升生成质量。
Details
Motivation: 现有生成模型通常输出无显式图层结构的栅格化图像,限制了后续编辑;而现有的设计解析方法多采用多阶段流程,存在误差累积和可控性有限的问题。本文旨在解决将设计图像分解为可编辑图层的挑战。
Result: 在Parser-40K和Crello两个数据集上的大量实验表明,该方法优于现有方法,在所有指标上平均提升了23.7%,达到了SOTA水平。
Insight: 创新点包括:1) 混合生成式框架结合视觉语言模型和扩散模型,实现端到端的图层解析;2) 引入ParserReward和Group Relative Policy Optimization,将生成质量与人类设计偏好对齐;3) 支持RGBA的多分支扩散架构,增强了背景和贴纸层的生成能力。
Abstract: Graphic design images consist of multiple editable layers, such as text, background, and decorative elements, while most generative models produce rasterized outputs without explicit layer structures, limiting downstream editing. Existing graphic design parsing methods typically rely on multi-stage pipelines combining layout prediction, matting, and inpainting, which suffer from error accumulation and limited controllability. We propose a hybrid generative framework for raster-to-layer graphic design parsing that decomposes a design image into editable text, background, and sticker layers. Text regions are parsed using a vision-language model into a text rendering protocol, enabling faithful reconstruction and flexible re-editing, while background and sticker layers are generated using a multi-branch diffusion architecture with RGBA support. We further introduce ParserReward and integrate it with Group Relative Policy Optimization to align generation quality with human design preferences. Extensive experiments on two challenging datasets, \emph{i.e.,} the Parser-40K and Crello datasets, demonstrate superior performance over existing methods, \emph{eg.,} achieving an overall average improvement of 23.7% across all metrics.
[91] CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation cs.CVPDF
Xiangyang Luo, Xiaozhe Xin, Tao Feng, Xu Guo, Meiguang Jin
TL;DR: CoInteract是一个端到端的人-物交互视频合成框架,通过空间结构化的协同生成和人类感知的专家混合机制,在保持结构稳定性和物理合理接触方面显著优于现有方法。
Details
Motivation: 解决当前扩散模型在合成人-物交互视频时,手部和面部等敏感区域结构不稳定,以及手与物体之间物理接触(如避免穿透)不真实的问题。
Result: 实验结果表明,CoInteract在结构稳定性、逻辑一致性和交互真实性方面显著优于现有方法。
Insight: 创新点包括:1) 人类感知的专家混合机制,通过空间监督路由将令牌分配给轻量级的区域专家,以最小参数开销提升细粒度结构保真度;2) 空间结构化的协同生成双流训练范式,联合建模RGB外观流和辅助的HOI结构流,注入交互几何先验,推理时移除HOI分支实现零开销RGB生成。
Abstract: Synthesizing human–object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand–object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.
[92] InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement cs.CVPDF
Nikita Kister, Pradyumna YM, István Sárándi, Jiayi Wang, Anna Khoreva
TL;DR: InHabit是一种全自动、可扩展的数据生成方法,旨在将2D基础模型的常识知识迁移到3D场景中,通过渲染-生成-提升流程,为3D场景生成与场景交互的逼真人类数据,从而创建大规模3D人-场景交互数据集。
Details
Motivation: 训练具身智能体像人类一样理解3D场景需要大规模、有意义的人与环境交互数据,但此类数据稀缺;真实世界动作捕捉成本高且受限于受控环境,现有合成数据集则依赖忽略丰富场景上下文的简单几何启发式方法。
Result: 在Habitat-Matterport3D上应用InHabit,生成了首个大规模逼真3D人-场景交互数据集,包含800个建筑级场景中的78K样本,具有完整3D几何、SMPL-X人体模型和RGB图像;使用该数据增强训练能改进基于RGB的3D人-场景重建和接触估计,在感知用户研究中,78%的情况下优于现有最佳方法(SOTA)。
Insight: 创新点在于提出了一种利用2D视觉-语言模型和图像编辑模型(基于互联网规模数据训练)的常识知识,通过自动化流程生成物理合理的3D人体交互数据的方法,实现了从2D先验知识到3D场景的迁移,解决了3D交互数据稀缺问题。
Abstract: Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over the state of the art.
[93] MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation cs.CVPDF
Liyang Li, Wen Wang, Canyu Zhao, Tianjian Feng, Zhiyue Zhao
TL;DR: 本文提出了MMControl,一个用于联合音视频生成的多模态控制框架。该框架通过双流条件注入机制,支持图像、音频、深度图和姿态序列等多种视觉与声学控制信号,实现了对角色身份、音色、身体姿态和场景布局的细粒度可组合控制。
Details
Motivation: 现有可控生成框架通常仅限于视频控制,限制了联合音视频生成的全面可控性,并常导致跨模态对齐不佳。本文旨在弥补这一差距,实现多模态控制。
Result: 大量实验表明,MMControl在联合音视频生成中实现了对角色身份、音色、身体姿态和场景布局的细粒度、可组合控制。
Insight: 创新点包括双流条件注入机制,将多种视觉和声学控制信号注入联合生成过程,以及模态特定的引导缩放技术,允许在推理时独立动态调整每个条件的影响强度,从而提升跨模态对齐和可控性。
Abstract: Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.
[94] Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks cs.CVPDF
Jing Jin, Hao Liu, Yan Bai, Yihang Lou, Zhenke Wang
TL;DR: 该论文提出了一个名为StepSTEM的细粒度评估基准,用于评估多模态大语言模型在STEM领域的跨模态推理能力。该基准包含283个研究生水平的数学、物理、化学、生物和工程问题,其构建过程强制要求文本和视觉输入严格互补,以避免单模态捷径。论文还提出了一个通用的步骤级评估框架,用于评估纯文本思维链和交错图像-文本推理,并通过动态规划将预测的推理步骤与多个参考答案对齐。
Details
Motivation: 现有的多模态大语言模型评估基准在STEM等专业领域存在挑战,它们常因模态冗余而允许模型通过单模态捷径(如仅依赖文本)解决问题,并且主要关注最终答案的准确性,而忽略了推理过程本身。
Result: 在广泛的模型实验(包括Gemini 3.1 Pro和Claude Opus 4.6)中,当前MLLMs在StepSTEM基准上的准确率仅为38.29%,表明它们仍然严重依赖文本推理,在真正的跨模态STEM推理方面仍有巨大提升空间。该基准被定位为细粒度评估多模态推理的基准。
Insight: 论文的创新点在于构建了一个强制模态互补的细粒度STEM基准(StepSTEM),并提出了一个通用的步骤级评估框架,能够对齐和评估推理步骤,从而更深入地揭示模型在跨模态推理中的能力与局限,而不仅仅是最终答案的对错。
Abstract: Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.
[95] Face Anything: 4D Face Reconstruction from Any Image Sequence cs.CVPDF
Umut Kocasari, Simon Giebenhain, Richard Shaw, Matthias Nießner
TL;DR: 本文提出了一种基于规范面部点预测的统一方法,用于从任意图像序列进行高保真4D面部重建。该方法通过将每个像素映射到共享规范空间中的归一化面部坐标,将密集跟踪和动态重建转化为规范重建问题,实现了在单一前馈模型中同时预测深度和规范坐标,从而获得时间一致的几何形状和可靠的对应关系。
Details
Motivation: 从图像序列中准确重建和跟踪动态人脸具有挑战性,因为非刚性变形、表情变化和视角变化同时发生,导致几何和对应关系估计存在显著模糊性。
Result: 在图像和视频基准测试上的广泛实验表明,该方法在重建和跟踪任务上达到了最先进的性能,与先前的动态重建方法相比,对应误差降低了约3倍,推理速度更快,同时深度精度提高了16%。
Insight: 创新点在于提出了规范面部点预测这一表示,将密集跟踪和动态重建统一为规范空间中的重建问题,并使用基于Transformer的模型联合预测深度和规范坐标,实现了在单一架构内完成准确深度估计、时间稳定重建、密集3D几何和鲁棒面部点跟踪。
Abstract: Accurate reconstruction and tracking of dynamic human faces from image sequences is challenging because non-rigid deformations, expression changes, and viewpoint variations occur simultaneously, creating significant ambiguity in geometry and correspondence estimation. We present a unified method for high-fidelity 4D facial reconstruction based on canonical facial point prediction, a representation that assigns each pixel a normalized facial coordinate in a shared canonical space. This formulation transforms dense tracking and dynamic reconstruction into a canonical reconstruction problem, enabling temporally consistent geometry and reliable correspondences within a single feed-forward model. By jointly predicting depth and canonical coordinates, our method enables accurate depth estimation, temporally stable reconstruction, dense 3D geometry, and robust facial point tracking within a single architecture. We implement this formulation using a transformer-based model that jointly predicts depth and canonical facial coordinates, trained using multi-view geometry data that non-rigidly warps into the canonical space. Extensive experiments on image and video benchmarks demonstrate state-of-the-art performance across reconstruction and tracking tasks, achieving approximately 3$\times$ lower correspondence error and faster inference than prior dynamic reconstruction methods, while improving depth accuracy by 16%. These results highlight canonical facial point prediction as an effective foundation for unified feed-forward 4D facial reconstruction.
[96] SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model cs.CVPDF
Zewei Zhou, Ruining Yang, Xuewei, Qi, Yiluan Guo
TL;DR: SpanVLA是一个用于自动驾驶的新型端到端视觉-语言-动作(VLA)模型框架,它通过整合自回归推理和流匹配动作专家来提高效率与鲁棒性。该框架引入了一个高效的桥接机制,利用VLM的视觉和推理指导来规划未来轨迹,并采用基于GRPO的后训练方法从正负样本中学习,以提升模型性能。
Details
Motivation: 解决现有VLA模型在自动驾驶中因使用自回归生成框架导致动作生成延迟高、鲁棒性有限的问题。
Result: 在NAVSIM(v1和v2)基准测试上的大量实验表明,SpanVLA模型具有有竞争力的性能,且定性结果展示了其在多样化场景中的规划性能和鲁棒性。
Insight: 创新点包括:1)通过流匹配策略结合历史轨迹初始化来高效桥接VLM推理与动作生成,显著减少推理时间;2)提出基于GRPO的后训练方法,使模型能从正面驾驶样本和负面恢复样本中学习,增强鲁棒性;3)引入了专注于复杂推理场景和负面恢复样本的新数据集mReasoning。
Abstract: Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable the VLA model not only to learn from positive driving samples but also to learn how to avoid the typical negative behaviors and learn recovery behaviors. We further introduce mReasoning, a new real-world driving reasoning dataset, focusing on complex, reasoning-demanding scenarios and negative-recovery samples. Extensive experiments on the NAVSIM (v1 and v2) demonstrate the competitive performance of the SpanVLA model. Additionally, the qualitative results across diverse scenarios highlight the planning performance and robustness of our model.
[97] ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis cs.CVPDF
Zhengwentai Sun, Keru Zheng, Chenghong Li, Hongjie Liao, Xihe Yang
TL;DR: 本文提出ReImagine方法,通过图像优先合成策略重新思考可控高质量人体视频生成,将高质量外观建模与时间一致性解耦,结合预训练图像主干、SMPL-X运动引导和无训练时间细化阶段,生成姿态和视角可控的高质量时序一致视频。
Details
Motivation: 现有方法在有限多视角数据下难以联合建模人体外观、运动和相机视角,导致可控性或视觉质量受限,本文旨在从图像优先视角解决此问题。
Result: 方法在多样姿态和视角下生成高质量、时序一致的视频,并发布了规范人体数据集和组合人体图像合成辅助模型。
Insight: 创新点在于采用图像生成先验解耦外观与时间建模,结合预训练模型与SMPL-X运动引导实现可控合成,无需额外训练的时间细化阶段提升了效率。
Abstract: Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.
[98] CityRAG: Stepping Into a City via Spatially-Grounded Video Generation cs.CVPDF
Gene Chou, Charles Herrmann, Kyle Genova, Boyang Deng, Songyou Peng
TL;DR: CityRAG是一个视频生成模型,旨在生成三维一致、可导航且空间接地的环境,以模拟真实地点。它利用地理注册数据作为上下文,将生成过程锚定到物理场景,同时保持对复杂运动和外观变化的先验知识。模型能够生成长达数分钟、物理接地的连贯视频序列,保持天气和光照条件,实现闭环,并导航复杂轨迹以重建真实世界地理。
Details
Motivation: 解决现有文本到视频或图像到视频模型难以生成三维一致、可导航且空间接地的真实世界环境的问题,这对于自动驾驶和机器人仿真等下游应用至关重要。
Result: 实验表明,CityRAG能生成长达数分钟、物理接地的连贯视频序列,在数千帧中保持天气和光照条件,实现闭环,并导航复杂轨迹以重建真实地理,展示了其在空间接地视频生成方面的能力。
Insight: 创新点在于利用地理注册数据作为上下文进行接地生成,并通过时间未对齐的训练数据实现底层场景与瞬态属性的语义解耦,从而结合了物理场景的准确性与生成模型的先验知识,为动态环境仿真提供了新思路。
Abstract: We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.
[99] AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model cs.CVPDF
Yutian Chen, Shi Guo, Renbiao Jin, Tianshuo Yang, Xin Cai
TL;DR: AnyRecon是一个基于视频扩散模型的稀疏视图3D重建框架,能够从任意无序的稀疏输入中实现大规模场景的3D重建。它通过构建持久的全局场景记忆和几何感知条件策略,解决了现有方法在几何一致性和可扩展性方面的限制。
Details
Motivation: 现有基于扩散的方法通常仅基于一两个捕获帧进行条件化,这限制了几何一致性,并难以扩展到大规模或多样化场景。AnyRecon旨在解决稀疏视图3D重建中的这些挑战,提供可扩展且保持显式几何控制的框架。
Result: 在广泛实验中,AnyRecon展示了在不规则输入、大视角差距和长轨迹下的鲁棒且可扩展的重建性能,但摘要未具体提及基准测试或与SOTA的定量比较结果。
Insight: 创新点包括:通过前置捕获视图缓存构建持久全局场景记忆以支持长程条件化;引入几何感知条件策略,通过显式3D几何记忆和几何驱动的捕获视图检索耦合生成与重建;结合4步扩散蒸馏和上下文窗口稀疏注意力以降低二次复杂度,确保效率。
Abstract: Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.
cs.LG [Back]
[100] EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training cs.LG | cs.AI | cs.CLPDF
Chengjun Pan, Shichun Liu, Jiahang Lin, Dingwei Zhu, Jiazheng Zhang
TL;DR: 本文提出了解释方差策略优化(EVPO),一种用于LLM后训练的自适应强化学习方法。该方法通过将基线选择建模为卡尔曼滤波问题,证明了可解释方差(EV)是判断学习到的评论家(critic)是减少还是增加优势估计方差的关键指标,并据此在训练步骤中自适应地在基于评论家的优势估计和批平均优势估计之间切换。
Details
Motivation: 解决LLM后训练中强化学习的一个基本设计选择:是否使用学习到的评论家作为策略优化的基线。经典理论支持基于评论家的方法(如PPO)以减少方差,但在稀疏奖励设置中,评论家的估计噪声可能超过其捕获的状态信号,反而增加方差。
Result: 在涵盖经典控制、智能体交互和数学推理的四个任务上,EVPO始终优于固定的PPO和GRPO方法,无论哪个固定基线在特定任务上更强。理论推导的零阈值被经验证明是最优的。
Insight: 核心创新在于将基线选择形式化为卡尔曼滤波问题,并识别出可解释方差(EV)作为自适应切换的精确边界。这为强化学习中的方差控制提供了一个理论驱动的、轻量级的自适应机制,无需额外超参数调优。
Abstract: Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing rather than reducing advantage variance. By casting baseline selection as a Kalman filtering problem, we unify PPO and GRPO as two extremes of the Kalman gain and prove that explained variance (EV), computable from a single training batch, identifies the exact boundary: positive EV indicates the critic reduces variance, while zero or negative EV signals that it inflates variance. Building on this insight, we propose Explained Variance Policy Optimization (EVPO), which monitors batch-level EV at each training step and adaptively switches between critic-based and batch-mean advantage estimation, provably achieving no greater variance than the better of the two at every step. Across four tasks spanning classical control, agentic interaction, and mathematical reasoning, EVPO consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task. Further analysis confirms that the adaptive gating tracks critic maturation over training and that the theoretically derived zero threshold is empirically optimal.
[101] Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning cs.LG | cs.CVPDF
Linwei Dong, Ruoyu Guo, Ge Bai, Zehuan Yuan, Yawei Luo
TL;DR: 本文提出了一种名为GDMD的新框架,通过将强化学习(RL)与分布匹配蒸馏(DMD)相结合,以解决扩散蒸馏在少步生成中因追求采样速度而牺牲质量的问题。该框架的核心创新在于,它重新定义了奖励机制,优先使用蒸馏梯度而非原始像素输出作为优化的主要信号,从而同步RL策略与蒸馏目标,有效中和优化分歧。
Details
Motivation: 动机在于,现有的扩散蒸馏方法(如DMD)在少步生成中常常以质量换取速度,而将RL简单地与蒸馏目标融合会因依赖次优的原始样本评估而产生冲突,且早期生成的噪声特性导致奖励不可靠。因此,需要一种更可靠的优化机制来提升少步生成的质量。
Result: 实证结果表明,GDMD在少步生成上设定了新的SOTA。具体而言,其4步模型在质量上超越了多步教师模型,并在GenEval和人类偏好指标上大幅超过了先前的DMDR结果,展现出强大的可扩展潜力。
Insight: 摘要中宣称的创新点在于重新定义奖励机制,将DMD梯度重新解释为隐式目标张量,使现有奖励模型能直接评估蒸馏更新的质量。从客观角度分析,其创新之处在于通过梯度级指导实现自适应加权,有效同步RL与蒸馏目标,从而克服了传统样本评分带来的冲突和噪声问题,为扩散蒸馏提供了更稳定和高效的优化路径。
Abstract: Diffusion distillation, exemplified by Distribution Matching Distillation (DMD), has shown great promise in few-step generation but often sacrifices quality for sampling speed. While integrating Reinforcement Learning (RL) into distillation offers potential, a naive fusion of these two objectives relies on suboptimal raw sample evaluation. This sample-based scoring creates inherent conflicts with the distillation trajectory and produces unreliable rewards due to the noisy nature of early-stage generation. To overcome these limitations, we propose GDMD, a novel framework that redefines the reward mechanism by prioritizing distillation gradients over raw pixel outputs as the primary signal for optimization. By reinterpreting the DMD gradients as implicit target tensors, our framework enables existing reward models to directly evaluate the quality of distillation updates. This gradient-level guidance functions as an adaptive weighting that synchronizes the RL policy with the distillation objective, effectively neutralizing optimization divergence. Empirical results show that GDMD sets a new SOTA for few-step generation. Specifically, our 4-step models outperform the quality of their multi-step teacher and substantially exceed previous DMDR results in GenEval and human-preference metrics, exhibiting strong scalability potential.
cs.AI [Back]
[102] Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning cs.AI | cs.CL | cs.LOPDF
Kyuhee Kim, Auguste Poiroux, Antoine Bosselut
TL;DR: 这篇论文研究了大型语言模型(LLM)在形式化逻辑推理中的忠实性问题,特别是模型是否通过‘形式化游戏’(formalization gaming)行为来生成看似有效但实际不忠实的证明。作者评估了GPT-5和DeepSeek-R1在303个一阶逻辑问题上的表现,比较了统一生成与两阶段流水线(分离形式化与证明)的方法。
Details
Motivation: 动机在于形式化验证能保证证明的有效性,但无法保证形式化过程的忠实性。在自然语言逻辑推理中,模型需要从头构建公理系统,这使得有效证明与忠实翻译之间的差距尤为突出。论文旨在探究前沿模型在生成Lean 4证明时是否利用这一差距进行‘游戏’。
Result: 实验结果显示,尽管统一生成的编译率高达87-99%,但未发现系统性游戏行为的证据:模型倾向于报告失败而非强制生成证明。然而,两阶段流水线揭示了两种不忠实模式:GPT-5在证明生成阶段捏造公理(可通过跨阶段比较检测),而DeepSeek-R1在形式化阶段误译前提,产生内部一致但完全逃避检测的输出。这表明高编译率或准确率不应等同于忠实推理。
Insight: 创新点在于提出了‘形式化游戏’的概念,并设计了两阶段评估框架来检测模型在逻辑推理中的不忠实行为。客观分析认为,该研究揭示了LLM在形式化任务中潜在的隐蔽错误模式,强调了仅依赖编译成功率或准确率作为评估指标的局限性,为未来更严格的推理评估提供了方法论借鉴。
Abstract: Formal verification guarantees proof validity but not formalization faithfulness. For natural-language logical reasoning, where models construct axiom systems from scratch without library constraints, this gap between valid proofs and faithful translations is especially acute. We investigate whether frontier models exploit this gap when generating Lean 4 proofs, a behavior we term formalization gaming. We evaluate GPT-5 and DeepSeek-R1 on 303 first-order logic problems (203 from FOLIO, 100 from Multi-LogiEval), comparing unified generation against a two-stage pipeline that separates formalization from proving. Despite compilation rates of 87-99%, we find no evidence of systematic gaming in unified generation: models prefer reporting failure over forcing proofs, even under prompting designed to encourage it. However, unfaithfulness that evades our detection signals may still occur. The two-stage pipeline reveals two distinct modes of unfaithfulness: GPT-5 fabricates axioms during proof generation, a reactive fallback detectable via cross-stage comparison, while DeepSeek-R1 mistranslates premises during formalization, producing internally consistent outputs that evade detection entirely. These findings show that high compilation rates or accuracies should not be equated with faithful reasoning. Code and data are available at https://github.com/koreankiwi99/formalization-gaming.
[103] SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models cs.AI | cs.CL | cs.ROPDF
Josue Torres-Fonseca, Naihao Deng, Yinpei Dai, Shane Storks, Yichi Zhang
TL;DR: 该论文提出了SafetyALFRED,一个基于具身智能基准ALFRED构建的安全评估框架,通过引入六类真实厨房危险场景,评估多模态大语言模型在危险识别和主动风险缓解规划方面的能力。研究发现,现有模型在问答设置中能准确识别危险,但在具身规划中的缓解成功率较低,揭示了静态评估与物理安全需求之间的差距。
Details
Motivation: 多模态大语言模型越来越多地被用作交互环境中的自主智能体,但其主动应对安全风险的能力仍显不足,现有安全评估主要关注脱离实体的问答式危险识别,缺乏对具身环境中主动缓解行动的评估。
Result: 在SafetyALFRED基准上评估了Qwen、Gemma和Gemini家族的11个SOTA模型,结果显示模型在问答设置中的危险识别准确率高,但针对相同危险的具身规划缓解成功率平均较低,表明静态评估不足以反映物理安全性能。
Insight: 创新点在于将安全评估从静态问答扩展到具身规划,强调纠正行动的重要性;客观来看,该研究推动了安全评估范式的转变,倡导开发更注重动态、交互式安全缓解能力的基准,对自主智能体的实际部署具有指导意义。
Abstract: Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git
eess.IV [Back]
[104] A Controlled Benchmark of Visual State-Space Backbones with Domain-Shift and Boundary Analysis for Remote-Sensing Segmentation eess.IV | cs.CVPDF
Nichula Wasalathilaka, Dineth Perera, Oshadha Samarakoon, Buddhi Wijenayake, Roshan Godaliyadda
TL;DR: 本文通过严格控制变量,对视觉状态空间模型(SSMs)在遥感语义分割任务中的性能进行了基准测试,比较了VMamba、MambaVision和Spatial-Mamba等代表性模型。研究发现,模型族内扩展带来的增益有限,跨域泛化存在显著不对称性,且在分布偏移下边界描绘是主要的失败模式。
Details
Motivation: 现有研究很少将编码器效果与解码器和训练选择的影响分离开来,导致视觉SSMs作为Vision Transformers高效替代品的实际优势在公平比较下仍不明确。本文旨在通过一个严格控制变量的基准测试来澄清这一点。
Result: 在LoveDA和ISPRS Potsdam数据集上,使用统一的四阶段特征接口和固定的轻量级解码器进行评估。结果表明,视觉SSMs相对于受控的CNN和Transformer基线实现了有利的精度-效率权衡。
Insight: 摘要宣称的创新点在于通过严格控制变量(仅编码器变化)和统一可复现的协议,为未来基于Mamba的分割主干网络设计和评估建立了实用的参考基准。客观分析认为,其核心洞察是未来改进更可能来自面向鲁棒性的设计和边界感知解码,而非仅依赖编码器扩展。
Abstract: Visual state-space models (SSMs) are increasingly promoted as efficient alternatives to Vision Transformers, yet their practical advantages remain unclear under fair comparison because existing studies rarely isolate encoder effects from decoder and training choices. We present a strictly controlled benchmark of representative visual SSM families, including VMamba, MambaVision, and Spatial-Mamba, for remote-sensing semantic segmentation, in which only the encoder varies across experiments. Evaluated on LoveDA and ISPRS Potsdam under a unified 4-stage feature interface and a fixed lightweight decoder, the benchmark reveals three main findings, intra-family scaling yields only modest gains, cross-domain generalization is strongly asymmetric, and boundary delineation is the dominant failure mode under distribution shift. Although visual SSMs achieve favorable accuracy-efficiency trade-offs relative to the controlled CNN and Transformer baselines considered here, the results suggest that future improvements are more likely to come from robustness-oriented design and boundary-aware decoding than from encoder scaling alone. By isolating encoder behavior under a unified and reproducible protocol, this study establishes a practical reference benchmark for the design and evaluation of future Mamba-based segmentation backbones
cs.RO [Back]
[105] AI-Enabled Image-Based Hybrid Vision/Force Control of Tendon-Driven Aerial Continuum Manipulators cs.RO | cs.CVPDF
Shayan Sepahvand, Farrokh Janabi-Sharifi, Farhad Aghili
TL;DR: 本文提出了一种基于AI的级联混合视觉/力控制框架,用于肌腱驱动的空中连续体机械臂,该框架在SE(3)空间中采用恒定应变建模,将系统视为耦合系统。控制器旨在实现与静态环境的自主物理交互,同时稳定图像特征误差。该方法结合了级联快速固定时间滑模控制和径向基函数神经网络,以处理眼在手机载单目相机图像和力传感装置测量中的不确定性,实现无需离线训练的快速在线学习。此外,通过采用最先进的图神经网络架构提取线特征,用于视觉伺服框架,而非依赖启发式几何线提取器,以在接触期间跟踪期望的法向交互力并调节图像特征误差。对比研究将所提控制器与现有刚性臂空中操作方法进行基准测试,评估了不同场景和特征提取策略下的鲁棒性。仿真和实验结果展示了该方法在各种初始条件下的有效性及执行操作任务的鲁棒性能。
Details
Motivation: 解决肌腱驱动空中连续体机械臂在自主物理交互任务中,如何同时处理视觉和力传感的不确定性,并实现稳定、鲁棒的控制问题。
Result: 仿真和实验结果表明,所提方法在各种初始条件下有效,并在执行操作任务时表现出鲁棒性能;通过对比研究,与现有刚性臂空中操作方法进行了基准测试。
Insight: 创新点包括:1) 将级联快速固定时间滑模控制与RBF神经网络结合,在线学习视觉和力相关不确定性,无需离线训练;2) 采用最先进的图神经网络架构提取线特征用于视觉伺服,替代传统启发式几何线提取器,提高特征提取的准确性和鲁棒性;3) 在SE(3)空间进行恒定应变建模,将系统视为耦合系统,实现混合视觉/力控制。
Abstract: This paper presents an AI-enabled cascaded hybrid vision/force control framework for tendon-driven aerial continuum manipulators based on constant-strain modeling in $SE(3)$ as a coupled system. The proposed controller is designed to enable autonomous, physical interaction with a static environment while stabilizing the image feature error. The developed strategy combines the cascaded fast fixed-time sliding mode control and a radial basis function neural network to cope with the uncertainties in the image acquired by the eye-in-hand monocular camera and the measurements from the force sensing apparatus. This ensures rapid, online learning of the vision- and force-related uncertainties without requiring offline training. Furthermore, the features are extracted via a state-of-the-art graph neural network architecture employed by a visual servoing framework using line features, rather than relying on heuristic geometric line extractors, to concurrently contribute to tracking the desired normal interaction force during contact and regulating the image feature error. A comparative study benchmarks the proposed controller against established rigid-arm aerial manipulation methods, evaluating robustness across diverse scenarios and feature extraction strategies. The simulation and experimental results showcase the effectiveness of the proposed methodology under various initial conditions and demonstrate robust performance in executing manipulation tasks.
[106] VLA Foundry: A Unified Framework for Training Vision-Language-Action Models cs.RO | cs.AI | cs.CV | cs.LG | cs.SEPDF
Jean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang, Paarth Shah
TL;DR: 本文提出了VLA Foundry,一个开源统一框架,用于在单一代码库中训练LLM、VLM和VLA模型。该框架提供了从语言预训练到动作专家微调的端到端控制,支持从头训练和基于Hugging Face预训练骨干网络两种方式。作者训练并发布了两种模型进行验证,并在LBM Eval模拟器上评估了其闭环策略性能。
Details
Motivation: 解决当前开源VLA工作通常专注于动作训练阶段,且需要拼接不兼容的预训练流程的问题,旨在提供一个统一的、端到端的训练框架,以简化并统一视觉-语言-动作模型的训练过程。
Result: 在LBM Eval模拟器的标准评估设置中,完全从头训练的模型性能与作者先前的闭源工作相当;而基于预训练Qwen3-VL骨干网络构建的模型,则成为了一个强大的多任务桌面操作策略,大幅超越了基线模型。
Insight: 主要创新点在于提供了一个统一的、端到端的训练框架,将LLM、VLM、VLA的训练流程整合,解决了现有方案中流程割裂和兼容性问题。客观来看,其支持从零开始和基于预训练骨干两种模式,并开源了代码和模型权重,对社区有较强的实用和推广价值。
Abstract: We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM–>VLM–>VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.