cs.CV [Total: 79]
cs.CL [Total: 36]
q-bio.BM [Total: 1]
cs.GR [Total: 3]
cs.AI [Total: 4]
cs.HC [Total: 2]
cs.DB [Total: 1]
q-fin.ST [Total: 1]
cs.IR [Total: 1]
eess.IV [Total: 5]
cs.SD [Total: 1]
cs.RO [Total: 1]
q-bio.NC [Total: 1]
eess.AS [Total: 1]
cs.LG [Total: 13]

cs.CV [Back]

Zheng Han, Jun Zhou, Jialun Pei, Jing Qin, Yingfang Fan

TL;DR: 论文提出了一种结合数据驱动的生物力学算法和交互式提示机制的方法，用于增强现实（AR）辅助手术导航中的形变建模。该方法在保持有限元方法（FEM）精度的同时提高了计算效率，并引入了人机交互机制以动态修正解剖结构偏差。实验结果表明其显著提升了手术导航的准确性和可靠性。

Details

Motivation: 手术导航中，术前器官模型与术中动态变化的解剖结构的精确对齐是关键的挑战。传统的有限元方法虽精确但计算成本高，且难以处理大范围解剖变化（如气腹或韧带剥离）。现有算法在这些场景下无法保证精度，限制了AR导航的可靠性。

Result: 在公开数据集上的实验显示，该方法的平均目标配准误差为3.42 mm，结合交互提示后进一步降至2.78 mm，优于现有方法。

Insight: 1. 数据驱动方法在手术导航中可实现高效且精确的形变建模；
2. 人机交互机制能有效整合临床专家知识，提升复杂手术场景下的导航鲁棒性。

Abstract: In augmented reality (AR)-guided surgical navigation, preoperative organ models are superimposed onto the patient’s intraoperative anatomy to visualize critical structures such as vessels and tumors. Accurate deformation modeling is essential to maintain the reliability of AR overlays by ensuring alignment between preoperative models and the dynamically changing anatomy. Although the finite element method (FEM) offers physically plausible modeling, its high computational cost limits intraoperative applicability. Moreover, existing algorithms often fail to handle large anatomical changes, such as those induced by pneumoperitoneum or ligament dissection, leading to inaccurate anatomical correspondences and compromised AR guidance. To address these challenges, we propose a data-driven biomechanics algorithm that preserves FEM-level accuracy while improving computational efficiency. In addition, we introduce a novel human-in-the-loop mechanism into the deformation modeling process. This enables surgeons to interactively provide prompts to correct anatomical misalignments, thereby incorporating clinical expertise and allowing the model to adapt dynamically to complex surgical scenarios. Experiments on a publicly available dataset demonstrate that our algorithm achieves a mean target registration error of 3.42 mm. Incorporating surgeon prompts through the interactive framework further reduces the error to 2.78 mm, surpassing state-of-the-art methods in volumetric accuracy. These results highlight the ability of our framework to deliver efficient and accurate deformation modeling while enhancing surgeon-algorithm collaboration, paving the way for safer and more reliable computer-assisted surgeries.

[2] ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving cs.CV | cs.ROPDF

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan

TL;DR: 本文提出ReCogDrive框架，结合视觉-语言模型(VLMs)和扩散规划器，通过三阶段训练（领域适应、模仿学习、强化学习）提升端到端自动驾驶在长尾场景中的性能，在NAVSIM基准上取得新SOTA。

Details

Motivation: 解决端到端自动驾驶在罕见和长尾场景中性能下降问题，同时克服现有方法中视觉-语言模型与真实驾驶数据的领域差异、离散语言空间到连续动作空间的维度不匹配以及模仿学习的平均行为问题。

Result: 在NAVSIM基准上达到89.6 PDMS，超越现有视觉方法5.6 PDMS，实现新SOTA。

Insight: 通过结合VLMs的世界知识和扩散规划器的生成能力，能够有效解决自动驾驶中的长尾场景问题，同时强化学习微调进一步提升了安全性和人类驾驶行为的模仿能力。

Abstract: Although end-to-end autonomous driving has made remarkable progress, its performance degrades significantly in rare and long-tail scenarios. Recent approaches attempt to address this challenge by leveraging the rich world knowledge of Vision-Language Models (VLMs), but these methods suffer from several limitations: (1) a significant domain gap between the pre-training data of VLMs and real-world driving data, (2) a dimensionality mismatch between the discrete language space and the continuous action space, and (3) imitation learning tends to capture the average behavior present in the dataset, which may be suboptimal even dangerous. In this paper, we propose ReCogDrive, an autonomous driving system that integrates VLMs with diffusion planner, which adopts a three-stage paradigm for training. In the first stage, we use a large-scale driving question-answering datasets to train the VLMs, mitigating the domain discrepancy between generic content and real-world driving scenarios. In the second stage, we employ a diffusion-based planner to perform imitation learning, mapping representations from the latent language space to continuous driving actions. Finally, we fine-tune the diffusion planner using reinforcement learning with NAVSIM non-reactive simulator, enabling the model to generate safer, more human-like driving trajectories. We evaluate our approach on the planning-oriented NAVSIM benchmark, achieving a PDMS of 89.6 and setting a new state-of-the-art that surpasses the previous vision-only SOTA by 5.6 PDMS.

[3] CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems cs.CVPDF

Aniket Rege, Zinnia Nie, Mahesh Ramesh, Unmesh Raskar, Zhuoran Yu

TL;DR: CuRe提出了一种可扩展的基准测试和评分套件，用于分析文本到图像（T2I）系统中的文化代表性偏差，重点关注全球南方文化的代表性不足问题。

Details

Motivation: 当前的T2I系统训练数据主要基于欧美中心的数据，这导致全球南方文化的代表性不足。CuRe旨在通过一种新颖的评估方法量化这种文化偏差。

Result: 实验表明，CuRe评分与人类对感知相似性、图文对齐和文化多样性的判断具有强相关性。测试涵盖了多种T2I系统（如Stable Diffusion、DALL-E 3）和视觉语言模型。

Insight: T2I系统的文化偏差问题亟待解决，尤其是对全球南方文化的忽视。CuRe提供了一种可扩展的评估框架，未来可推动更具包容性的模型开发。

Abstract: Popular text-to-image (T2I) systems are trained on web-scraped data, which is heavily Amero and Euro-centric, underrepresenting the cultures of the Global South. To analyze these biases, we introduce CuRe, a novel and scalable benchmarking and scoring suite for cultural representativeness that leverages the marginal utility of attribute specification to T2I systems as a proxy for human judgments. Our CuRe benchmark dataset has a novel categorical hierarchy built from the crowdsourced Wikimedia knowledge graph, with 300 cultural artifacts across 32 cultural subcategories grouped into six broad cultural axes (food, art, fashion, architecture, celebrations, and people). Our dataset’s categorical hierarchy enables CuRe scorers to evaluate T2I systems by analyzing their response to increasing the informativeness of text conditioning, enabling fine-grained cultural comparisons. We empirically observe much stronger correlations of our class of scorers to human judgments of perceptual similarity, image-text alignment, and cultural diversity across image encoders (SigLIP 2, AIMV2 and DINOv2), vision-language models (OpenCLIP, SigLIP 2, Gemini 2.0 Flash) and state-of-the-art text-to-image systems, including three variants of Stable Diffusion (1.5, XL, 3.5 Large), FLUX.1 [dev], Ideogram 2.0, and DALL-E 3. The code and dataset is open-sourced and available at https://aniketrege.github.io/cure/.

[4] IGraSS: Learning to Identify Infrastructure Networks from Satellite Imagery by Iterative Graph-constrained Semantic Segmentation cs.CV | cs.AIPDF

Oishee Bintey Hoque, Abhijin Adiga, Aniruddha Adiga, Siddharth Chaudhary, Madhav V. Marathe

TL;DR: IGraSS是一种结合语义分割和图约束的迭代框架，用于从卫星图像中准确识别基础设施网络（如运河和道路）。通过利用图的连通性和可达性等属性，IGraSS能够优化不完善的标注数据，显著提升分割模型的性能。

Details

Motivation: 现有基于语义分割的基础设施网络识别方法依赖于大规模高质量的标注数据，但实际标注往往不完整或存在噪声。基础设施网络（如运河和道路）具有图级别的属性（如连通性和可达性），这种特性未被充分利用。IGraSS旨在通过迭代优化标注数据，提升分割模型的性能。

Result: IGraSS将不可达运河段的比例从18%降至3%，并且使用优化后的标注数据显著提升了运河识别性能。在道路网络上，IGraSS也展示了通用性和有效性。

Insight: 1. 图级别的属性（如连通性和可达性）可以作为强先验知识，提升语义分割模型的性能；2. 迭代优化标注数据和分割模型是一种有效的策略，尤其适用于标注不完整或噪声较大的场景。

Abstract: Accurate canal network mapping is essential for water management, including irrigation planning and infrastructure maintenance. State-of-the-art semantic segmentation models for infrastructure mapping, such as roads, rely on large, well-annotated remote sensing datasets. However, incomplete or inadequate ground truth can hinder these learning approaches. Many infrastructure networks have graph-level properties such as reachability to a source (like canals) or connectivity (roads) that can be leveraged to improve these existing ground truth. This paper develops a novel iterative framework IGraSS, combining a semantic segmentation module-incorporating RGB and additional modalities (NDWI, DEM)-with a graph-based ground-truth refinement module. The segmentation module processes satellite imagery patches, while the refinement module operates on the entire data viewing the infrastructure network as a graph. Experiments show that IGraSS reduces unreachable canal segments from around 18% to 3%, and training with refined ground truth significantly improves canal identification. IGraSS serves as a robust framework for both refining noisy ground truth and mapping canal networks from remote sensing imagery. We also demonstrate the effectiveness and generalizability of IGraSS using road networks as an example, applying a different graph-theoretic constraint to complete road networks.

[5] Spectral Domain Neural Reconstruction for Passband FMCW Radars cs.CVPDF

Harshvardhan Takawale, Nirupam Roy

TL;DR: SpINRv2是一个基于神经网络的框架，用于高保真体积重建，特别针对高频FMCW雷达，解决了相位混叠和子频段模糊问题，提升了3D成像性能。

Details

Motivation: 传统方法在高频FMCW雷达中因相位混叠和子频段模糊问题表现不佳，SpINRv2旨在通过神经框架和频域建模解决这些问题。

Result: SpINRv2在高频场景下显著优于经典和基于学习的基线方法，为基于神经网络的3D雷达成像设立了新基准。

Insight: 频域建模和隐式神经表示的结合可以有效解决高频雷达中的复杂问题，同时减少计算负担。

Abstract: We present SpINRv2, a neural framework for high-fidelity volumetric reconstruction using Frequency-Modulated Continuous-Wave (FMCW) radar. Extending our prior work (SpINR), this version introduces enhancements that allow accurate learning under high start frequencies-where phase aliasing and sub-bin ambiguity become prominent. Our core contribution is a fully differentiable frequency-domain forward model that captures the complex radar response using closed-form synthesis, paired with an implicit neural representation (INR) for continuous volumetric scene modeling. Unlike time-domain baselines, SpINRv2 directly supervises the complex frequency spectrum, preserving spectral fidelity while drastically reducing computational overhead. Additionally, we introduce sparsity and smoothness regularization to disambiguate sub-bin ambiguities that arise at fine range resolutions. Experimental results show that SpINRv2 significantly outperforms both classical and learning-based baselines, especially under high-frequency regimes, establishing a new benchmark for neural radar-based 3D imaging.

[6] Surgeon Style Fingerprinting and Privacy Risk Quantification via Discrete Diffusion Models in a Vision-Language-Action Framework cs.CV | cs.AIPDF

Huixin Zhan, Jason H. Moore

TL;DR: 该论文提出了一种基于离散扩散模型和视觉-语言-动作框架的方法，用于在外科手术中建模外科医生的个性化操作风格。通过多模态输入（如内窥镜视频、手术意图语言等），该方法能够生成个性化的手势序列，同时量化隐私风险。

Details

Motivation: 当前AI系统在外科手术中往往忽略外科医生的个性化操作风格，而这些风格差异对手术效果至关重要。论文旨在通过多模态数据建模外科医生的独特行为模式，同时研究其隐私风险。

Result: 在JIGSAWS数据集上的实验表明，该方法能准确重建手势序列，并学习到外科医生的独特行为指纹。同时，研究发现更个性化的嵌入虽然提高了任务性能，但也增加了身份泄露的风险。

Insight: 个性化嵌入在外科手术AI中虽能提升性能，但也带来隐私风险，需要在设计和部署时权衡这两者。

Abstract: Surgeons exhibit distinct operating styles due to differences in training, experience, and motor behavior - yet current AI systems often ignore this personalization signal. We propose a novel approach to model fine-grained, surgeon-specific fingerprinting in robotic surgery using a discrete diffusion framework integrated with a vision-language-action (VLA) pipeline. Our method formulates gesture prediction as a structured sequence denoising task, conditioned on multimodal inputs including endoscopic video, surgical intent language, and a privacy-aware embedding of surgeon identity and skill. Personalized surgeon fingerprinting is encoded through natural language prompts using third-party language models, allowing the model to retain individual behavioral style without exposing explicit identity. We evaluate our method on the JIGSAWS dataset and demonstrate that it accurately reconstructs gesture sequences while learning meaningful motion fingerprints unique to each surgeon. To quantify the privacy implications of personalization, we perform membership inference attacks and find that more expressive embeddings improve task performance but simultaneously increase susceptibility to identity leakage. These findings demonstrate that while personalized embeddings improve performance, they also increase vulnerability to identity leakage, revealing the importance of balancing personalization with privacy risk in surgical modeling. Code is available at: https://github.com/huixin-zhan-ai/Surgeon_style_fingerprinting.

[7] Open World Scene Graph Generation using Vision Language Models cs.CV | cs.CLPDF

Amartya Dutta, Kazi Sajeed Mehrab, Medha Sawhney, Abhilash Neog, Mridul Khurana

TL;DR: 该论文提出了一种名为Open-World SGG的训练无关框架，利用预训练视觉语言模型（VLMs）的已有知识，无需微调即可生成场景图，支持开放世界中的新颖对象和关系检测。

Details

Motivation: 现有的场景图生成（SGG）方法通常依赖于特定数据集的监督学习，限制了其在开放世界中处理新对象或关系的能力。论文旨在通过利用预训练VLMs的知识，实现零样本场景图生成，扩展SGG的应用范围。

Result: 在Visual Genome、Open Images V6和Panoptic Scene Graph数据集上的实验表明，预训练VLMs能够在未进行任务级训练的情况下完成关系理解。

Insight: 预训练VLMs具有强大的零样本结构推理能力，可直接应用于开放世界场景图生成任务，为无需监督学习的SGG提供了新思路。

Abstract: Scene-Graph Generation (SGG) seeks to recognize objects in an image and distill their salient pairwise relationships. Most methods depend on dataset-specific supervision to learn the variety of interactions, restricting their usefulness in open-world settings, involving novel objects and/or relations. Even methods that leverage large Vision Language Models (VLMs) typically require benchmark-specific fine-tuning. We introduce Open-World SGG, a training-free, efficient, model-agnostic framework that taps directly into the pretrained knowledge of VLMs to produce scene graphs with zero additional learning. Casting SGG as a zero-shot structured-reasoning problem, our method combines multimodal prompting, embedding alignment, and a lightweight pair-refinement strategy, enabling inference over unseen object vocabularies and relation sets. To assess this setting, we formalize an Open-World evaluation protocol that measures performance when no SGG-specific data have been observed either in terms of objects and relations. Experiments on Visual Genome, Open Images V6, and the Panoptic Scene Graph (PSG) dataset demonstrate the capacity of pretrained VLMs to perform relational understanding without task-level training.

[8] GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra cs.CV | 68T45 | I.5.4; I.2.10; I.3.5PDF

Mateusz Michalkiewicz, Anekha Sokhal, Tadeusz Michalkiewicz, Piotr Pawlikowski, Mahsa Baktashmotlagh

TL;DR: GIQ是一个专门评估视觉和视觉语言基础模型几何推理能力的基准，包含合成和真实多面体图像，揭示了现有模型在3D对称性检测和几何任务中的不足。

Details

Motivation: 目前单目3D重建方法和视觉语言模型在标准基准上表现优异，但对其几何特性的真正理解尚不明确，需要系统评估。

Result: 当前模型在基本几何形状重建、几何区分任务中表现不佳，视觉语言助手在复杂多面体属性判断上准确率很低。

Insight: 现有模型对几何属性的理解有限，GIQ为提升几何感知表示学习提供了基准和方向。

Abstract: Monocular 3D reconstruction methods and vision-language models (VLMs) demonstrate impressive results on standard benchmarks, yet their true understanding of geometric properties remains unclear. We introduce GIQ , a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images of 224 diverse polyhedra - including Platonic, Archimedean, Johnson, and Catalan solids, as well as stellations and compound shapes - covering varying levels of complexity and symmetry. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric forms accurately. While foundation models effectively detect specific 3D symmetry elements via linear probing, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation. Moreover, advanced vision-language assistants exhibit remarkably low accuracy on complex polyhedra, systematically misinterpreting basic properties like face geometry, convexity, and compound structures. GIQ is publicly available, providing a structured platform to highlight and address critical gaps in geometric intelligence, facilitating future progress in robust, geometry-aware representation learning.

[9] A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation cs.CV | cs.AI | cs.CL | cs.LGPDF

Andrew Z. Wang, Songwei Ge, Tero Karras, Ming-Yu Liu, Yogesh Balaji

TL;DR: 这篇论文研究了在文本到图像生成模型中，使用现代仅解码器大型语言模型（LLM）作为文本编码器的效果。通过标准化训练和评估流程，作者分析了12种不同文本编码器生成的嵌入对生成效果的影响，发现传统的最后一层嵌入方法效果不佳，而跨层归一化平均嵌入能显著提升复杂提示的匹配性能。

Details

Motivation: 现有的文本到图像生成模型仍在使用较为陈旧的T5和CLIP作为文本编码器，而现代仅解码器LLM在自然语言处理领域表现出色。作者希望通过研究LLM作为文本编码器的效果，改进文本到图像生成模型。

Result: 实验表明，传统的最后一层嵌入方法效果较差，而跨层归一化平均嵌入能显著提升模型对复杂提示的理解和生成质量。多数LLM在改进后优于T5基准。

Insight: 1. 在文本到图像生成中，嵌入提取方法对模型性能至关重要。2. 现代仅解码器LLM可以通过适当的嵌入提取方法，显著提升生成模型的性能。3. 跨层嵌入可能更好地捕捉语言的复杂语义。

Abstract: Both text-to-image generation and large language models (LLMs) have made significant advancements. However, many text-to-image models still employ the somewhat outdated T5 and CLIP as their text encoders. In this work, we investigate the effectiveness of using modern decoder-only LLMs as text encoders for text-to-image diffusion models. We build a standardized training and evaluation pipeline that allows us to isolate and evaluate the effect of different text embeddings. We train a total of 27 text-to-image models with 12 different text encoders to analyze the critical aspects of LLMs that could impact text-to-image generation, including the approaches to extract embeddings, different LLMs variants, and model sizes. Our experiments reveal that the de facto way of using last-layer embeddings as conditioning leads to inferior performance. Instead, we explore embeddings from various layers and find that using layer-normalized averaging across all layers significantly improves alignment with complex prompts. Most LLMs with this conditioning outperform the baseline T5 model, showing enhanced performance in advanced visio-linguistic reasoning skills.

[10] Using Satellite Images And Self-supervised Machine Learning Networks To Detect Water Hidden Under Vegetation cs.CVPDF

Ioannis Iakovidis, Zahra Kalantari, Amir Hossein Payberah, Fernando Jaramillo, Francisco Pena Escobar

TL;DR: 论文提出了一种结合自监督学习和深度聚类的方法，利用雷达卫星图像检测植被下隐藏的水域，无需人工标注数据。

Details

Motivation: 传统模型依赖大量人工标注数据，成本高且效率低，作者希望通过自监督学习减少对标注数据的依赖。

Result: 在测试集上，集成模型的IoU指标比全监督单模型提高了0.02。

Insight: 自监督学习在遥感图像分析中具有潜力，可以有效降低标注成本并保持性能。

Abstract: In recent years the wide availability of high-resolution radar satellite images along with the advancement of computer vision models have enabled the remote monitoring of the surface area of wetlands. However, these models require large amounts of manually annotated satellite images, which are slow and expensive to produce. To overcome this problem, self-supervised training methods have been deployed to train models without using annotated data. In this paper we use a combination of deep clustering and negative sampling to train a model to segment radar satellite images into areas that separate water from land without the use of any manual annotations. Furthermore, we implement an ensemble version of the model to reduce variance and improve performance. Compared to a single fully-supervised model using the same architecture, our ensemble of self-supervised models achieves a 0.02 improvement in the Intersection Over Union metric over our test dataset.

[11] Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence cs.CVPDF

Octave Mariotti, Zhipeng Du, Yash Bhalgat, Oisin Mac Aodha, Hakan Bilen

TL;DR: 该论文揭示了监督式语义对应方法在稀疏标注关键点之外的泛化能力不足问题，并提出了一种通过单目深度估计将2D关键点提升到3D规范空间的方法。

Details

Motivation: 现有的监督式语义对应方法虽然在稀疏标注的关键点上表现良好，但在泛化到未见过的关键点时表现较差。论文旨在解决这一问题，并提出一种能够学习稠密对应关系的方法。

Result: 实验结果不仅表明该方法在未见过的关键点上显著优于监督式基线方法，还发现无监督基线方法在跨数据集泛化时表现优于监督方法。

Insight: 论文揭示了监督式语义对应方法的泛化局限性，并展示了3D空间表示对提升稠密对应关系学习的重要性。

Abstract: Semantic correspondence (SC) aims to establish semantically meaningful matches across different instances of an object category. We illustrate how recent supervised SC methods remain limited in their ability to generalize beyond sparsely annotated training keypoints, effectively acting as keypoint detectors. To address this, we propose a novel approach for learning dense correspondences by lifting 2D keypoints into a canonical 3D space using monocular depth estimation. Our method constructs a continuous canonical manifold that captures object geometry without requiring explicit 3D supervision or camera annotations. Additionally, we introduce SPair-U, an extension of SPair-71k with novel keypoint annotations, to better assess generalization. Experiments not only demonstrate that our model significantly outperforms supervised baselines on unseen keypoints, highlighting its effectiveness in learning robust correspondences, but that unsupervised baselines outperform supervised counterparts when generalized across different datasets.

[12] A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks cs.CVPDF

Vishaal Udandarao, Mehdi Cherti, Shyamgopal Karthik, Jenia Jitsev, Samuel Albanie

TL;DR: 论文分析了17种常用于评估视觉-语言模型（VLM）组合理解能力的基准测试（如SugarCREPE、VALSE），揭示了它们在设计和构建过程中存在多种固有偏差。研究发现，简单的启发式方法（如token长度、语言模型对数似然）表现与CLIP模型相当，说明这些基准未能有效衡量组合理解能力。主要原因是正负图像/描述的分布不对称性。作者提出了构建更鲁棒基准的建议。

Details

Motivation: 现有的视觉-语言组合理解基准测试存在设计偏差，可能导致评估不准确。研究旨在揭示这些偏差并改进基准构建方法，以更好地衡量模型的真实能力。

Result: 发现基准测试中正负样本的分布不对称性是其主要缺陷，导致评估结果不可靠。

Insight: 构建组合理解基准时，需注意正负样本的对称性，避免简单启发式方法可轻易攻破的设计。未来基准应更加鲁棒，能区分真实理解能力和表面特征匹配。

Abstract: We investigate 17 benchmarks (e.g. SugarCREPE, VALSE) commonly used for measuring compositional understanding capabilities of vision-language models (VLMs). We scrutinize design choices in their construction, including data source (e.g. MS-COCO) and curation procedures (e.g. constructing negative images/captions), uncovering several inherent biases across most benchmarks. We find that blind heuristics (e.g. token-length, log-likelihood under a language model) perform on par with CLIP models, indicating that these benchmarks do not effectively measure compositional understanding. We demonstrate that the underlying factor is a distribution asymmetry between positive and negative images/captions, induced by the benchmark construction procedures. To mitigate these issues, we provide a few key recommendations for constructing more robust vision-language compositional understanding benchmarks, that would be less prone to such simple attacks.

[13] Highly Compressed Tokenizer Can Generate Without Training cs.CV | cs.AIPDF

L. Lao Beyer, T. Li, X. Chen, S. Karaman, K. He

TL;DR: 论文发现，高度压缩的1D图像标记器（Tokenizer）通过启发式操作标记（tokens）可以实现图像编辑和生成能力，无需训练生成模型。

Details

Motivation: 现有图像标记器多为2D空间排列标记，而1D标记器能将图像压缩为极少的离散标记。研究者发现这种高度压缩的标记空间具有丰富的表达能力，启发了无需训练生成模型的图像编辑和生成方法。

Result: 方法支持细粒度图像编辑（如外观和语义属性迁移）以及多样化和真实的图像生成，应用于修复和文本引导编辑等场景。

Insight: 高度压缩的1D标记空间具有强大的表达能力，可通过简单操作或优化实现复杂的图像生成和编辑任务，为轻量级生成模型提供了新思路。

Abstract: Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, so-called 1D image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations – such as copying and replacing tokens between latent representations of images – enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer’s latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.

[14] Seeing Voices: Generating A-Roll Video from Audio with Mirage cs.CV | cs.AI | cs.LGPDF

Aditi Sundararaman, Amogh Adishesha, Andrew Jaegle, Dan Bigioi, Hyoung-Kyu Song

TL;DR: Mirage是一款音频到视频生成的基础模型，能根据音频输入生成逼真、富有表现力的视频画面。它结合自注意力机制和通用训练方法，优于现有方法，尤其在生成人物讲话视频（A-roll）时表现卓越。

Details

Motivation: 现有视频生成方法通常忽略音频或仅针对特定领域（如配音），缺乏通用的音频到视频生成能力。Mirage旨在填补这一空白，实现音频驱动的全场景视频生成。

Result: 生成的视频在主观质量上优于现有方法，能逼真地呈现音频中的表演内容。

Insight: 通用训练方法结合自注意力机制，可能在多模态生成任务中具有潜力，而无需过度依赖任务特定的设计。

Abstract: From professional filmmaking to user-generated content, creators and consumers have long recognized that the power of video depends on the harmonious integration of what we hear (the video’s audio track) with what we see (the video’s image sequence). Current approaches to video generation either ignore sound to focus on general-purpose but silent image sequence generation or address both visual and audio elements but focus on restricted application domains such as re-dubbing. We introduce Mirage, an audio-to-video foundation model that excels at generating realistic, expressive output imagery from scratch given an audio input. When integrated with existing methods for speech synthesis (text-to-speech, or TTS), Mirage results in compelling multimodal video. When trained on audio-video footage of people talking (A-roll) and conditioned on audio containing speech, Mirage generates video of people delivering a believable interpretation of the performance implicit in input audio. Our central technical contribution is a unified method for training self-attention-based audio-to-video generation models, either from scratch or given existing weights. This methodology allows Mirage to retain generality as an approach to audio-to-video generation while producing outputs of superior subjective quality to methods that incorporate audio-specific architectures or loss components specific to people, speech, or details of how images or audio are captured. We encourage readers to watch and listen to the results of Mirage for themselves (see paper and comments for links).

[15] SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging cs.CV | cs.AIPDF

Nhat Thanh Tran, Fanghui Xue, Shuai Zhang, Jiancheng Lyu, Yunling Zheng

TL;DR: 该论文提出了一种名为SEMA的新型注意力机制，通过令牌定位和算术平均来解决传统注意力的计算复杂性和聚焦问题，并在Imagenet-1k分类任务中展现出优于现有视觉Mamba模型的性能。

Details

Motivation: 传统注意力机制存在计算复杂度高（二次复杂度）和线性注意力变体无法有效聚焦的问题，限制了其在计算机视觉任务中的应用。因此，论文提出了SEMA，以解决这些问题。

Result: 在Imagenet-1k分类任务中，SAMA在更大的图像尺度下表现出优于现有视觉Mamba模型的性能，同时保持模型参数规模不变。

Insight: 论文揭示了广义注意力的分散特性，为设计新型注意力机制提供了理论依据。SEMA的提出为高效且可扩展的注意力机制设计提供了新思路。

Abstract: Attention is the critical component of a transformer. Yet the quadratic computational complexity of vanilla full attention in the input size and the inability of its linear attention variant to focus have been challenges for computer vision tasks. We provide a mathematical definition of generalized attention and formulate both vanilla softmax attention and linear attention within the general framework. We prove that generalized attention disperses, that is, as the number of keys tends to infinity, the query assigns equal weights to all keys. Motivated by the dispersion property and recent development of Mamba form of attention, we design Scalable and Efficient Mamba like Attention (SEMA) which utilizes token localization to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. We support our approach on Imagenet-1k where classification results show that SEMA is a scalable and effective alternative beyond linear attention, outperforming recent vision Mamba models on increasingly larger scales of images at similar model parameter sizes.

[16] OpenRR-1k: A Scalable Dataset for Real-World Reflection Removal cs.CVPDF

Kangning Yang, Ling Ouyang, Huiming Sun, Jie Cai, Lan Fu

TL;DR: 论文提出了OpenRR-1k数据集，这是一个高质量、对齐且多样化的反射去除数据集，解决了现有技术缺乏高质量野外数据的问题。

Details

Motivation: 现有的反射去除技术缺乏高质量的真实世界数据集，限制了其在现实环境中的应用效果。

Result: 实验表明，OpenRR-1k数据集能够显著提升反射去除方法在复杂真实环境中的鲁棒性。

Insight: 高质量且多样化的数据集是提升反射去除技术实用性的关键。

Abstract: Reflection removal technology plays a crucial role in photography and computer vision applications. However, existing techniques are hindered by the lack of high-quality in-the-wild datasets. In this paper, we propose a novel paradigm for collecting reflection datasets from a fresh perspective. Our approach is convenient, cost-effective, and scalable, while ensuring that the collected data pairs are of high quality, perfectly aligned, and represent natural and diverse scenarios. Following this paradigm, we collect a Real-world, Diverse, and Pixel-aligned dataset (named OpenRR-1k dataset), which contains 1,000 high-quality transmission-reflection image pairs collected in the wild. Through the analysis of several reflection removal methods and benchmark evaluation experiments on our dataset, we demonstrate its effectiveness in improving robustness in challenging real-world environments. Our dataset is available at https://github.com/caijie0620/OpenRR-1k.

[17] Hyperspectral Image Classification via Transformer-based Spectral-Spatial Attention Decoupling and Adaptive Gating cs.CVPDF

Guandong Li, Mengxia Ye

TL;DR: 该论文提出了一种称为STNet的网络架构，通过解耦光谱和空间注意力以及自适应门控机制，提高了高光谱图像分类的精度和泛化能力。

Details

Motivation: 高光谱图像分类面临高维数据、地物分布稀疏和光谱冗余等挑战，导致分类过拟合和泛化能力受限。为了解决这些问题，论文提出了STNet。

Result: 在IN、UP和KSC数据集上表现优异，超越主流高光谱图像分类方法。

Insight: 通过解耦和门控机制，模型能够在减少过拟合风险的同时，更有效地提取和融合空间与光谱信息。

Abstract: Deep neural networks face several challenges in hyperspectral image classification, including high-dimensional data, sparse distribution of ground objects, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To more effectively extract and fuse spatial context with fine spectral information in hyperspectral image (HSI) classification, this paper proposes a novel network architecture called STNet. The core advantage of STNet stems from the dual innovative design of its Spatial-Spectral Transformer module: first, the fundamental explicit decoupling of spatial and spectral attention ensures targeted capture of key information in HSI; second, two functionally distinct gating mechanisms perform intelligent regulation at both the fusion level of attention flows (adaptive attention fusion gating) and the internal level of feature transformation (GFFN). This characteristic demonstrates superior feature extraction and fusion capabilities compared to traditional convolutional neural networks, while reducing overfitting risks in small-sample and high-noise scenarios. STNet enhances model representation capability without increasing network depth or width. The proposed method demonstrates superior performance on IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification approaches.

[18] Locating Tennis Ball Impact on the Racket in Real Time Using an Event Camera cs.CVPDF

Yuto Kase, Kai Ishibe, Ryoma Yasuda, Yudai Washida, Sakiko Hashimoto

TL;DR: 该论文提出了一种利用事件相机实时定位网球在球拍上击球点的方法，解决了高速相机内存消耗大和人工标注效率低的问题。

Details

Motivation: 在网球等球拍运动中，精准定位击球点对分析球员表现和个性化装备设计至关重要，但传统高速相机内存消耗大且人工处理效率低，限制了长时间场景捕捉与分析。

Result: 实验结果显示，该方法在测量网球球员表现时误差在允许范围内，且计算时间足够短，适合实时应用。

Insight: 事件相机在高速运动场景中具有低内存消耗和高时间精度的优势，为实时运动分析提供了新思路。

Abstract: In racket sports, such as tennis, locating the ball’s position at impact is important in clarifying player and equipment characteristics, thereby aiding in personalized equipment design. High-speed cameras are used to measure the impact location; however, their excessive memory consumption limits prolonged scene capture, and manual digitization for position detection is time-consuming and prone to human error. These limitations make it difficult to effectively capture the entire playing scene, hindering the ability to analyze the player’s performance. We propose a method for locating the tennis ball impact on the racket in real time using an event camera. Event cameras efficiently measure brightness changes (called `events’) with microsecond accuracy under high-speed motion while using lower memory consumption. These cameras enable users to continuously monitor their performance over extended periods. Our method consists of three identification steps: time range of swing, timing at impact, and contours of ball and racket. Conventional computer vision techniques are utilized along with an original event-based processing to detect the timing at impact (PATS: the amount of polarity asymmetry in time symmetry). The results of the experiments were within the permissible range for measuring tennis players’ performance. Moreover, the computation time was sufficiently short for real-time applications.

[19] How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models cs.CV | cs.AI | cs.CLPDF

Huixuan Zhang, Junzhe Zhang, Xiaojun Wan

TL;DR: 该论文重新审视了文本到视觉扩散模型中无分类器引导的自适应方法，并提出了一种通用的自适应引导策略Step AG，能够在保证生成质量的同时显著提升效率。

Details

Motivation: 无分类器引导是当前文本到视觉生成扩散模型的主流方法，但其需要双倍的模型前向步骤，成本高昂。以往的自适应引导方法缺乏深入分析和通用性，亟需改进。

Result: 实验显示，Step AG在图像质量和文本对齐方面表现良好，平均提速20%-30%。这种改进在不同生成步骤和模型上均保持一致。

Insight: 无分类器引导的关键作用集中在去噪早期阶段，后期阶段可以关闭引导以节省计算资源，而不会显著影响生成质量。这一发现为高效生成模型的优化提供了新思路。

Abstract: With the rapid development of text-to-vision generation diffusion models, classifier-free guidance has emerged as the most prevalent method for conditioning. However, this approach inherently requires twice as many steps for model forwarding compared to unconditional generation, resulting in significantly higher costs. While previous study has introduced the concept of adaptive guidance, it lacks solid analysis and empirical results, making previous method unable to be applied to general diffusion models. In this work, we present another perspective of applying adaptive guidance and propose Step AG, which is a simple, universally applicable adaptive guidance strategy. Our evaluations focus on both image quality and image-text alignment. whose results indicate that restricting classifier-free guidance to the first several denoising steps is sufficient for generating high-quality, well-conditioned images, achieving an average speedup of 20% to 30%. Such improvement is consistent across different settings such as inference steps, and various models including video generation models, highlighting the superiority of our method.

[20] MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding cs.CVPDF

Shivang Chopra, Lingchao Mao, Gabriela Sanchez-Rodriguez, Andrew J Feola, Jing Li

TL;DR: MedMoE提出了一种基于混合专家（MoE）的医学视觉语言理解框架，动态调整视觉表示以适应不同医学成像模态的特殊需求。

Details

Motivation: 现有医学视觉语言框架通常采用统一的局部特征提取策略，忽略了不同模态的特异性需求，导致性能不足。

Result: 在多个医学基准测试中，MedMoE显著提升了跨模态的对齐和检索性能。

Insight: 模态特定的视觉表示对临床视觉语言系统至关重要，动态路由策略能有效捕捉不同分辨率的诊断信息。

Abstract: Different medical imaging modalities capture diagnostic information at varying spatial resolutions, from coarse global patterns to fine-grained localized structures. However, most existing vision-language frameworks in the medical domain apply a uniform strategy for local feature extraction, overlooking the modality-specific demands. In this work, we present MedMoE, a modular and extensible vision-language processing framework that dynamically adapts visual representation based on the diagnostic context. MedMoE incorporates a Mixture-of-Experts (MoE) module conditioned on the report type, which routes multi-scale image features through specialized expert branches trained to capture modality-specific visual semantics. These experts operate over feature pyramids derived from a Swin Transformer backbone, enabling spatially adaptive attention to clinically relevant regions. This framework produces localized visual representations aligned with textual descriptions, without requiring modality-specific supervision at inference. Empirical results on diverse medical benchmarks demonstrate that MedMoE improves alignment and retrieval performance across imaging modalities, underscoring the value of modality-specialized visual representations in clinical vision-language systems.

[21] SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding cs.CVPDF

Woohyeon Park, Woojin Kim, Jaeik Kim, Jaeyoung Do

TL;DR: 论文提出了一种名为SECOND的新方法，通过选择性对比解码（Selective and Contrastive Decoding）来减少视觉语言模型中的感知幻觉，显著提升图像理解的准确性。

Details

Motivation: 现有的视觉语言模型（VLMs）因物体幻觉问题而性能受限，无法实现精确的图像理解。因此，需要一种新方法来解决这一问题。

Result: 实验表明，SECOND在多项基准测试中表现优于现有方法，验证了多尺度视觉信息在VLM中的潜力。

Insight: 多尺度信息的优先级选择和对比是提升VLM性能的关键方向之一，这一研究方向仍有很大探索空间。

Abstract: Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.

[22] RadioDUN: A Physics-Inspired Deep Unfolding Network for Radio Map Estimation cs.CV | eess.SPPDF

Taiqin Chen, Zikun Zhou, Zheng Fang, Wenzhen Zou, Kanjun Liu

TL;DR: RadioDUN是一种物理启发的深度展开网络，通过结合无线传播模型的物理特性，解决了稀疏样本下密集无线电地图估计的问题，性能优于现有方法。

Details

Motivation: 现有方法难以结合无线电地图的物理特性，导致从稀疏样本估计密集无线电地图的效果不佳。

Result: 实验表明，RadioDUN在无线电地图估计任务中超越了现有方法。

Insight: 物理模型的引入和动态重加权机制有效提升了稀疏信号恢复的准确性，阴影损失进一步优化了模型性能。

Abstract: The radio map represents the spatial distribution of spectrum resources within a region, supporting efficient resource allocation and interference mitigation. However, it is difficult to construct a dense radio map as a limited number of samples can be measured in practical scenarios. While existing works have used deep learning to estimate dense radio maps from sparse samples, they are hard to integrate with the physical characteristics of the radio map. To address this challenge, we cast radio map estimation as the sparse signal recovery problem. A physical propagation model is further incorporated to decompose the problem into multiple factor optimization sub-problems, thereby reducing recovery complexity. Inspired by the existing compressive sensing methods, we propose the Radio Deep Unfolding Network (RadioDUN) to unfold the optimization process, achieving adaptive parameter adjusting and prior fitting in a learnable manner. To account for the radio propagation characteristics, we develop a dynamic reweighting module (DRM) to adaptively model the importance of each factor for the radio map. Inspired by the shadowing factor in the physical propagation model, we integrate obstacle-related factors to express the obstacle-induced signal stochastic decay. The shadowing loss is further designed to constrain the factor prediction and act as a supplementary supervised objective, which enhances the performance of RadioDUN. Extensive experiments have been conducted to demonstrate that the proposed method outperforms the state-of-the-art methods. Our code will be made publicly available upon publication.

[23] Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring cs.CVPDF

Mingjie Xu, Andrew Estornell, Hongzheng Yang, Yuzhi Zhao, Zhaowei Zhu

TL;DR: 论文提出SCALE方法，通过跨模态评估框架提升视觉语言模型的数据选择质量，解决图像与文本对齐噪声和文本模糊问题，优化VLM指令调优数据集。

Details

Motivation: 现有视觉语言模型（VLMs）的性能依赖于大规模高质量数据集，但图像与文本对齐噪声和模糊文本导致模型表现受限，需改进数据选择方法。

Result: 揭示现有单模态评估的不足，展示生成字幕对统一多模态任务到文本模态的有效性。

Insight: 多模态任务可通过统一文本模态优化，数据质量评估需兼顾任务适配性和鲁棒性。

Abstract: The application of visual instruction tuning and other post-training techniques has significantly enhanced the capabilities of Large Language Models (LLMs) in visual understanding, enriching Vision-Language Models (VLMs) with more comprehensive visual language datasets. However, the effectiveness of VLMs is highly dependent on large-scale, high-quality datasets that ensure precise recognition and accurate reasoning. Two key challenges hinder progress: (1) noisy alignments between images and the corresponding text, which leads to misinterpretation, and (2) ambiguous or misleading text, which obscures visual content. To address these challenges, we propose SCALE (Single modality data quality and Cross modality Alignment Evaluation), a novel quality-driven data selection pipeline for VLM instruction tuning datasets. Specifically, SCALE integrates a cross-modality assessment framework that first assigns each data entry to its appropriate vision-language task, generates general and task-specific captions (covering scenes, objects, style, etc.), and evaluates the alignment, clarity, task rarity, text coherence, and image clarity of each entry based on the generated captions. We reveal that: (1) current unimodal quality assessment methods evaluate one modality while overlooking the rest, which can underestimate samples essential for specific tasks and discard the lower-quality instances that help build model robustness; and (2) appropriately generated image captions provide an efficient way to transfer the image-text multimodal task into a unified text modality.

[24] Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance cs.CVPDF

June Suk Choi, Kyungmin Lee, Sihyun Yu, Yisol Choi, Jinwoo Shin

TL;DR: 论文提出了自适应低通引导（ALG）方法，解决了图像到视频（I2V）模型生成的视频动态性不足的问题，通过在降噪早期阶段对输入图像进行自适应低通滤波来提高视频动态性。

Details

Motivation: 现有I2V模型在微调时容易因输入图像的高频细节过早暴露而生成动态性不足的视频，与文本到视频（T2V）模型相比表现更静态。

Result: 实验表明，ALG显著提高了生成的视频的动态性（在VBench-I2V测试中动态性平均提升36%），且不影响视频质量或图像保真度。

Insight: 通过控制输入图像的高频信息在生成过程中的暴露时机，可以有效平衡视频的动态性和静态保真度。

Abstract: Recent text-to-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying low-pass filtering at the early stage of denoising. Extensive experiments demonstrate that ALG significantly improves the temporal dynamics of generated videos, while preserving image fidelity and text alignment. Especially, under VBench-I2V test suite, ALG achieves an average improvement of 36% in dynamic degree without a significant drop in video quality or image fidelity.

[25] MARMOT: Masked Autoencoder for Modeling Transient Imaging cs.CVPDF

Siyuan Shen, Ziheng Wang, Xingyue Peng, Suan Xia, Ruiqian Li

TL;DR: 该论文提出了一种名为MARMOT的自监督模型，通过掩码自编码器预训练大规模多样化的非视线（NLOS）瞬态成像数据集，为NLOS应用提供支持。

Details

Motivation: 现有研究主要优化隐藏物体的体积密度或表面重建，缺乏从数据集中学习到的先验知识转移。MARMOT旨在填补这一空白，通过自监督学习提升NLOS瞬态成像的性能。

Result: 综合实验表明，MARMOT在定量和定性结果上均优于现有方法，验证了其高效性。

Insight: 通过自监督预训练和掩码策略，MARMOT能够将先验知识迁移到下游任务，为NLOS瞬态成像提供了新的解决方案。

Abstract: Pretrained models have demonstrated impressive success in many modalities such as language and vision. Recent works facilitate the pretraining paradigm in imaging research. Transients are a novel modality, which are captured for an object as photon counts versus arrival times using a precisely time-resolved sensor. In particular for non-line-of-sight (NLOS) scenarios, transients of hidden objects are measured beyond the sensor’s direct line of sight. Using NLOS transients, the majority of previous works optimize volume density or surfaces to reconstruct the hidden objects and do not transfer priors learned from datasets. In this work, we present a masked autoencoder for modeling transient imaging, or MARMOT, to facilitate NLOS applications. Our MARMOT is a self-supervised model pretrianed on massive and diverse NLOS transient datasets. Using a Transformer-based encoder-decoder, MARMOT learns features from partially masked transients via a scanning pattern mask (SPM), where the unmasked subset is functionally equivalent to arbitrary sampling, and predicts full measurements. Pretrained on TransVerse-a synthesized transient dataset of 500K 3D models-MARMOT adapts to downstream imaging tasks using direct feature transfer or decoder finetuning. Comprehensive experiments are carried out in comparisons with state-of-the-art methods. Quantitative and qualitative results demonstrate the efficiency of our MARMOT.

[26] Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization cs.CV | cs.MMPDF

Qilin Yin, Wei Lu, Xiangyang Luo, Xiaochun Cao

TL;DR: 该论文提出了一种通用的上下文感知对比学习框架（UniCaCLF），用于解决视频中局部篡改片段的时间伪造定位（TFL）问题。

Details

Motivation: 现有的多媒体取证研究主要集中在检测伪造的音视频内容，但多将深度伪造检测视为分类任务，而忽略了部分视频片段被篡改的场景。时间伪造定位（TFL）更具现实应用价值，但目前仍具挑战性。

Result: 在五个公开数据集上的实验表明，UniCaCLF显著优于现有最优算法。

Insight: 上下文感知对比学习能有效提升时间伪造定位的性能，尤其在处理局部篡改片段时表现出色。

Abstract: Most research efforts in the multimedia forensics domain have focused on detecting forgery audio-visual content and reached sound achievements. However, these works only consider deepfake detection as a classification task and ignore the case where partial segments of the video are tampered with. Temporal forgery localization (TFL) of small fake audio-visual clips embedded in real videos is still challenging and more in line with realistic application scenarios. To resolve this issue, we propose a universal context-aware contrastive learning framework (UniCaCLF) for TFL. Our approach leverages supervised contrastive learning to discover and identify forged instants by means of anomaly detection, allowing for the precise localization of temporal forged segments. To this end, we propose a novel context-aware perception layer that utilizes a heterogeneous activation operation and an adaptive context updater to construct a context-aware contrastive objective, which enhances the discriminability of forged instant features by contrasting them with genuine instant features in terms of their distances to the global context. An efficient context-aware contrastive coding is introduced to further push the limit of instant feature distinguishability between genuine and forged instants in a supervised sample-by-sample manner, suppressing the cross-sample influence to improve temporal forgery localization performance. Extensive experimental results over five public datasets demonstrate that our proposed UniCaCLF significantly outperforms the state-of-the-art competing algorithms.

Zhiyi Zhu, Xiaoyu Wu, Zihao Liu, Linlin Yang

TL;DR: MLVTG提出了一种基于Mamba和LLM的新型多模态视频时间定位框架，通过MambaAligner和LLMRefiner模块解决了现有Transformer方法在多模态对齐上的不足。

Details

Motivation: 现有基于Transformer的视频时间定位方法存在冗余注意力和次优的多模态对齐问题，需要更高效的模型来优化。

Result: 在QVHighlights、Charades-STA和TVSum数据集上达到SOTA性能，显著超越现有基线。

Insight: 使用Mamba和LLM的先验可以显著提升多模态对齐的效率和精度，为视频时间定位提供新思路。

Abstract: Video Temporal Grounding (VTG), which aims to localize video clips corresponding to natural language queries, is a fundamental yet challenging task in video understanding. Existing Transformer-based methods often suffer from redundant attention and suboptimal multi-modal alignment. To address these limitations, we propose MLVTG, a novel framework that integrates two key modules: MambaAligner and LLMRefiner. MambaAligner uses stacked Vision Mamba blocks as a backbone instead of Transformers to model temporal dependencies and extract robust video representations for multi-modal alignment. LLMRefiner leverages the specific frozen layer of a pre-trained Large Language Model (LLM) to implicitly transfer semantic priors, enhancing multi-modal alignment without fine-tuning. This dual alignment strategy, temporal modeling via structured state-space dynamics and semantic purification via textual priors, enables more precise localization. Extensive experiments on QVHighlights, Charades-STA, and TVSum demonstrate that MLVTG achieves state-of-the-art performance and significantly outperforms existing baselines.

[28] Robust Visual Localization via Semantic-Guided Multi-Scale Transformer cs.CVPDF

Zhongtao Tian, Wenhao Huang, Zhidong Chen, Xiao Wei Sun

TL;DR: 论文提出了一种结合多尺度特征学习和语义场景理解的框架，通过分层Transformer和跨尺度注意力提高视觉定位在动态环境中的鲁棒性。

Details

Motivation: 动态环境中的光照变化、恶劣天气和移动物体等干扰了传统视觉定位方法的性能，现有绝对位姿回归方法难以保持一致。

Result: 在TartanAir数据集上，该方法优于现有位姿回归方法，尤其在动态物体、光照变化和遮挡等场景表现优异。

Insight: 多尺度处理与语义引导的结合是提升动态环境下视觉定位鲁棒性的有效策略。

Abstract: Visual localization remains challenging in dynamic environments where fluctuating lighting, adverse weather, and moving objects disrupt appearance cues. Despite advances in feature representation, current absolute pose regression methods struggle to maintain consistency under varying conditions. To address this challenge, we propose a framework that synergistically combines multi-scale feature learning with semantic scene understanding. Our approach employs a hierarchical Transformer with cross-scale attention to fuse geometric details and contextual cues, preserving spatial precision while adapting to environmental changes. We improve the performance of this architecture with semantic supervision via neural scene representation during training, guiding the network to learn view-invariant features that encode persistent structural information while suppressing complex environmental interference. Experiments on TartanAir demonstrate that our approach outperforms existing pose regression methods in challenging scenarios with dynamic objects, illumination changes, and occlusions. Our findings show that integrating multi-scale processing with semantic guidance offers a promising strategy for robust visual localization in real-world dynamic environments.

[29] LiftVSR: Lifting Image Diffusion to Video Super-Resolution via Hybrid Temporal Modeling with Only 4$\times$RTX 4090s cs.CVPDF

Xijun Wang, Xin Li, Bingchen Li, Zhibo Chen

TL;DR: LiftVSR提出了一种高效的视频超分辨率框架，通过混合时间建模机制在仅4块RTX 4090 GPU上实现最佳结果，同时兼顾长时一致性和计算效率。

Details

Motivation: 现有视频超分辨率方法在时间一致性和计算成本上存在局限，尤其长视频处理需要高昂硬件开销。LiftVSR旨在通过图像扩散先验和高效时间建模解决这些问题。

Result: 在多个主流VSR基准测试中，LiftVSR以显著更低的计算成本实现了最佳性能。

Insight: 1. 图像扩散先验可高效迁移到视频超分辨率；2. 混合时间建模是平衡计算效率和一致性的有效途径；3. 缓存机制和非对称采样对长视频处理至关重要。

Abstract: Diffusion models have significantly advanced video super-resolution (VSR) by enhancing perceptual quality, largely through elaborately designed temporal modeling to ensure inter-frame consistency. However, existing methods usually suffer from limited temporal coherence and prohibitively high computational costs (e.g., typically requiring over 8 NVIDIA A100-80G GPUs), especially for long videos. In this work, we propose LiftVSR, an efficient VSR framework that leverages and elevates the image-wise diffusion prior from PixArt-$\alpha$, achieving state-of-the-art results using only 4$\times$RTX 4090 GPUs. To balance long-term consistency and efficiency, we introduce a hybrid temporal modeling mechanism that decomposes temporal learning into two complementary components: (i) Dynamic Temporal Attention (DTA) for fine-grained temporal modeling within short frame segment ($\textit{i.e.}$, low complexity), and (ii) Attention Memory Cache (AMC) for long-term temporal modeling across segments ($\textit{i.e.}$, consistency). Specifically, DTA identifies multiple token flows across frames within multi-head query and key tokens to warp inter-frame contexts in the value tokens. AMC adaptively aggregates historical segment information via a cache unit, ensuring long-term coherence with minimal overhead. To further stabilize the cache interaction during inference, we introduce an asymmetric sampling strategy that mitigates feature mismatches arising from different diffusion sampling steps. Extensive experiments on several typical VSR benchmarks have demonstrated that LiftVSR achieves impressive performance with significantly lower computational costs.

Qi Yan, Brian Zhang, Yutong Zhang, Daniel Yang, Joshua White

TL;DR: TrajFlow提出了一种基于流匹配的多模态运动预测框架，通过单次推理生成多个可能的未来轨迹，显著降低计算开销，同时通过排名损失和自条件训练技术进一步提升性能。

Details

Motivation: 现有生成式轨迹预测方法需要多次推理以捕捉多样结果，计算开销大且效率低，TrajFlow旨在解决这一问题，提升自动驾驶安全性和决策效率。

Result: 在大规模Waymo Open Motion Dataset (WOMD)上，TrajFlow在多项关键指标上达到SOTA性能。

Insight: 流匹配在运动预测中具有高效性和扩展性潜力，结合排名损失和自条件训练可显著提升模型性能。

Abstract: Efficient and accurate motion prediction is crucial for ensuring safety and informed decision-making in autonomous driving, particularly under dynamic real-world conditions that necessitate multi-modal forecasts. We introduce TrajFlow, a novel flow matching-based motion prediction framework that addresses the scalability and efficiency challenges of existing generative trajectory prediction methods. Unlike conventional generative approaches that employ i.i.d. sampling and require multiple inference passes to capture diverse outcomes, TrajFlow predicts multiple plausible future trajectories in a single pass, significantly reducing computational overhead while maintaining coherence across predictions. Moreover, we propose a ranking loss based on the Plackett-Luce distribution to improve uncertainty estimation of predicted trajectories. Additionally, we design a self-conditioning training technique that reuses the model’s own predictions to construct noisy inputs during a second forward pass, thereby improving generalization and accelerating inference. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) demonstrate that TrajFlow achieves state-of-the-art performance across various key metrics, underscoring its effectiveness for safety-critical autonomous driving applications. The code and other details are available on the project website https://traj-flow.github.io/.

[31] Convergence of Spectral Principal Paths: How Deep Networks Distill Linear Representations from Noisy Inputs cs.CVPDF

Bowei Tian, Xuntao Lyu, Meng Liu, Hongyi Wang, Ang Li

TL;DR: 该论文提出输入空间线性假设（ISLH），并引入谱主路径（SPP）框架，解释深度网络如何从噪声输入中逐步提炼出线性表示，同时验证了这些表示在多模态视觉-语言模型中的鲁棒性。

Details

Motivation: 研究动机源于线性表示假设（LRH），旨在探索深度网络如何从输入空间中提取与人类可解释概念对齐的线性方向，从而提升AI的透明度与控制性。

Result: 论文验证了谱主路径在多模态视觉-语言模型中的有效性，表明提取的线性表示具有鲁棒性。

Insight: 深度网络通过选择性放大输入空间的线性方向来形成结构化的高级表示，这一机制有助于提升模型的透明度和鲁棒性。

Abstract: High-level representations have become a central focus in enhancing AI transparency and control, shifting attention from individual neurons or circuits to structured semantic directions that align with human-interpretable concepts. Motivated by the Linear Representation Hypothesis (LRH), we propose the Input-Space Linearity Hypothesis (ISLH), which posits that concept-aligned directions originate in the input space and are selectively amplified with increasing depth. We then introduce the Spectral Principal Path (SPP) framework, which formalizes how deep networks progressively distill linear representations along a small set of dominant spectral directions. Building on this framework, we further demonstrate the multimodal robustness of these representations in Vision-Language Models (VLMs). By bridging theoretical insights with empirical validation, this work advances a structured theory of representation formation in deep networks, paving the way for improving AI robustness, fairness, and transparency.

[32] From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge cs.CVPDF

Agnese Taluzzi, Davide Gesualdi, Riccardo Santambrogio, Chiara Plizzari, Francesca Palermo

TL;DR: 这篇论文提出了SceneNet和KnowledgeNet，用于解决HD-EPIC VQA 2025挑战赛的视觉问答任务，结合场景图和多模态大语言模型（MLLM）以及外部常识知识，显著提升了任务性能。

Details

Motivation: 解决复杂的第一人称视角视觉问答（VQA）任务时，需要捕捉细粒度的物体交互、空间关系及时间动态信息，同时结合外部常识知识进行推理。

Result: 在HD-EPIC VQA挑战赛的七个类别中，混合框架表现优异，最终准确率为44.21%。

Insight: 结合视觉场景图和外部常识知识能够显著提升复杂VQA任务的性能，表明多模态和知识融合的重要性。

Abstract: This report presents SceneNet and KnowledgeNet, our approaches developed for the HD-EPIC VQA Challenge 2025. SceneNet leverages scene graphs generated with a multi-modal large language model (MLLM) to capture fine-grained object interactions, spatial relationships, and temporally grounded events. In parallel, KnowledgeNet incorporates ConceptNet’s external commonsense knowledge to introduce high-level semantic connections between entities, enabling reasoning beyond directly observable visual evidence. Each method demonstrates distinct strengths across the seven categories of the HD-EPIC benchmark, and their combination within our framework results in an overall accuracy of 44.21% on the challenge, highlighting its effectiveness for complex egocentric VQA tasks.

[33] Towards Cross-Subject EMG Pattern Recognition via Dual-Branch Adversarial Feature Disentanglement cs.CV | cs.HCPDF

Xinyue Niu, Akira Furui

TL;DR: 这篇论文提出了一种双分支对抗特征解耦方法，用于跨受试者EMG模式识别，无需校准数据，实现了较高的泛化性能。

Details

Motivation: 跨受试者EMG模式识别面临信号特性、电极位置和解剖结构等差异的挑战，传统方法依赖用户特定校准，耗时且不实用。

Result: 实验表明，模型在未见过的用户数据上表现优异，优于多种基线方法。

Insight: 特征解耦方法为跨受试者EMG识别提供了新视角，并支持任务无关的生物识别应用。

Abstract: Cross-subject electromyography (EMG) pattern recognition faces significant challenges due to inter-subject variability in muscle anatomy, electrode placement, and signal characteristics. Traditional methods rely on subject-specific calibration data to adapt models to new users, an approach that is both time-consuming and impractical for large-scale, real-world deployment. This paper presents an approach to eliminate calibration requirements through feature disentanglement, enabling effective cross-subject generalization. We propose an end-to-end dual-branch adversarial neural network that simultaneously performs pattern recognition and individual identification by disentangling EMG features into pattern-specific and subject-specific components. The pattern-specific components facilitate robust pattern recognition for new users without model calibration, while the subject-specific components enable downstream applications such as task-invariant biometric identification. Experimental results demonstrate that the proposed model achieves robust performance on data from unseen users, outperforming various baseline methods in cross-subject scenarios. Overall, this study offers a new perspective for cross-subject EMG pattern recognition without model calibration and highlights the proposed model’s potential for broader applications, such as task-independent biometric systems.

[34] Hierarchical Neural Collapse Detection Transformer for Class Incremental Object Detection cs.CVPDF

Duc Thanh Pham, Hong Dang Nguyen, Nhat Minh Nguyen Quoc, Linh Ngo Van, Sang Dinh Viet

TL;DR: 提出了一种名为Hier-DETR的增量目标检测框架，结合神经坍缩（Neural Collapse）和层次类别关系，以提高效率和性能。

Details

Motivation: 增量目标检测（IOD）面临性能不足和推理时间过长的问题，限制了实际应用。Hier-DETR旨在解决这些问题。

Result: 框架在性能和效率上均表现出竞争力。

Insight: 神经坍缩和类别层次关系是提升增量目标检测性能的关键因素。

Abstract: Recently, object detection models have witnessed notable performance improvements, particularly with transformer-based models. However, new objects frequently appear in the real world, requiring detection models to continually learn without suffering from catastrophic forgetting. Although Incremental Object Detection (IOD) has emerged to address this challenge, these existing models are still not practical due to their limited performance and prolonged inference time. In this paper, we introduce a novel framework for IOD, called Hier-DETR: Hierarchical Neural Collapse Detection Transformer, ensuring both efficiency and competitive performance by leveraging Neural Collapse for imbalance dataset and Hierarchical relation of classes’ labels.

Yibo Cui, Liang Xie, Yu Zhao, Jiawei Sun, Erwei Yin

TL;DR: 该论文提出了一种名为FCA-NIG的生成框架，用于自动构建包含细粒度跨模态对齐标注的导航指令，解决了现有数据集中缺乏子指令级别和对实体-地标对齐的问题。

Details

Motivation: 现有视觉语言导航（VLN）数据集主要关注全局指令-轨迹匹配，而忽略了子指令级别和对实体-地标的对齐，影响导航动作的准确性。

Result: 实验表明，FCA-R2R显著提升了多个VLN代理（如SF、EnvDrop、RecBERT和HAMT）的性能，尤其提升了状态感知和导航决策的准确性。

Insight: 细粒度的子指令-轨迹对齐和实体-地标对齐对提升VLN任务的性能具有关键作用，FCA-NIG无需人工标注即可生成高质量数据，推动了跨模态学习的发展。

Abstract: Vision-Language Navigation (VLN) enables intelligent agents to navigate environments by integrating visual perception and natural language instructions, yet faces significant challenges due to the scarcity of fine-grained cross-modal alignment annotations. Existing datasets primarily focus on global instruction-trajectory matching, neglecting sub-instruction-level and entity-level alignments critical for accurate navigation action decision-making. To address this limitation, we propose FCA-NIG, a generative framework that automatically constructs navigation instructions with dual-level fine-grained cross-modal annotations. In this framework, an augmented trajectory is first divided into sub-trajectories, which are then processed through GLIP-based landmark detection, crafted instruction construction, OFA-Speaker based R2R-like instruction generation, and CLIP-powered entity selection, generating sub-instruction-trajectory pairs with entity-landmark annotations. Finally, these sub-pairs are aggregated to form a complete instruction-trajectory pair. The framework generates the FCA-R2R dataset, the first large-scale augmentation dataset featuring precise sub-instruction-sub-trajectory and entity-landmark alignments. Extensive experiments demonstrate that training with FCA-R2R significantly improves the performance of multiple state-of-the-art VLN agents, including SF, EnvDrop, RecBERT, and HAMT. Incorporating sub-instruction-trajectory alignment enhances agents’ state awareness and decision accuracy, while entity-landmark alignment further boosts navigation performance and generalization. These results highlight the effectiveness of FCA-NIG in generating high-quality, scalable training data without manual annotation, advancing fine-grained cross-modal learning in complex navigation tasks.

[36] Diversity-Guided MLP Reduction for Efficient Large Vision Transformers cs.CV | cs.LG | cs.MMPDF

Chengchao Shen, Hourun Zhu, Gongfan Fang, Jianxin Wang, Xinchao Wang

TL;DR: 论文提出了一种多样性引导的MLP缩减方法（DGMR），通过消除多层感知机（MLP）模块中的冗余神经元，显著减少大型视觉变换器的参数量和计算量，同时通过蒸馏保持性能几乎没有损失。

Details

Motivation: 大型变换器模型的参数量和计算成本过高，尤其是MLP模块占据了大部分参数。论文旨在通过缩减MLP模块的冗余参数，实现高效的大型视觉变换器模型。

Result: 在EVA-CLIP-E（4.4B）等模型上实现了71.5%的参数量和FLOPs减少，且性能无下降。整体上，参数量和FLOPs减少了超过57%，性能基本无损。

Insight: MLP模块是模型参数的主要来源，通过多样性保留的剪枝策略可以有效减少冗余并保持性能，为大型视觉变换器的高效实现提供了新思路。

Abstract: Transformer models achieve excellent scaling property, where the performance is improved with the increment of model capacity. However, large-scale model parameters lead to an unaffordable cost of computing and memory. We analyze popular transformer architectures and find that multilayer perceptron (MLP) modules take up the majority of model parameters. To this end, we focus on the recoverability of the compressed models and propose a Diversity-Guided MLP Reduction (DGMR) method to significantly reduce the parameters of large vision transformers with only negligible performance degradation. Specifically, we conduct a Gram-Schmidt weight pruning strategy to eliminate redundant neurons of MLP hidden layer, while preserving weight diversity for better performance recover during distillation. Compared to the model trained from scratch, our pruned model only requires 0.06% data of LAION-2B (for the training of large vision transformers) without labels (ImageNet-1K) to recover the original performance. Experimental results on several state-of-the-art large vision transformers demonstrate that our method achieves a more than 57.0% parameter and FLOPs reduction in a near lossless manner. Notably, for EVA-CLIP-E (4.4B), our method accomplishes a 71.5% parameter and FLOPs reduction without performance degradation. The source code and trained weights are available at https://github.com/visresearch/DGMR.

[37] Data-Efficient Challenges in Visual Inductive Priors: A Retrospective cs.CVPDF

Robert-Jan Bruintjes, Attila Lengyel, Osman Semih Kayhan, Davide Zambrano, Nergis Tömen

TL;DR: 该论文回顾了数据高效的视觉归纳先验挑战，探讨了在数据不足时如何通过先验知识提升深度学习模型的性能。

Details

Motivation: 解决深度学习在数据不足时性能下降的问题，激发开发更高效的数据利用方法。

Result: 成功的参赛方法利用了Transformer与CNN的混合集成及数据增强技术，部分还引入了新的先验知识。

Insight: 先验知识和模型集成的结合在数据不足时显著提升模型性能。

Abstract: Deep Learning requires large amounts of data to train models that work well. In data-deficient settings, performance can be degraded. We investigate which Deep Learning methods benefit training models in a data-deficient setting, by organizing the “VIPriors: Visual Inductive Priors for Data-Efficient Deep Learning” workshop series, featuring four editions of data-impaired challenges. These challenges address the problem of training deep learning models for computer vision tasks with limited data. Participants are limited to training models from scratch using a low number of training samples and are not allowed to use any form of transfer learning. We aim to stimulate the development of novel approaches that incorporate prior knowledge to improve the data efficiency of deep learning models. Successful challenge entries make use of large model ensembles that mix Transformers and CNNs, as well as heavy data augmentation. Novel prior knowledge-based methods contribute to success in some entries.

[38] SAMSelect: A Spectral Index Search for Marine Debris Visualization using Segment Anything cs.CVPDF

Joost van Dalen, Yuki M. Asano, Marc Russwurm

TL;DR: SAMSelect是一种算法，通过Segment Anything模型为多光谱图像生成显著的三通道可视化，用于海洋科学家对Sentinel-2影像中的海洋垃圾进行视觉解释。该算法通过小规模标注数据集选择最佳波段或光谱指数组合，提高了分类准确性和视觉信息质量。

Details

Motivation: 海洋垃圾在中等分辨率影像中因成分异构性难以可视化，而领域专家通常依赖经验和启发式方法选择波段和光谱指数。SAMSelect旨在通过自动化方式优化波段选择，提升视觉解释效果。

Result: 在加纳阿克拉和南非德班的Sentinel-2影像中，SAMSelect发现了新的未使用波段组合（如B8和B2的归一化差异指数），其性能优于文献中的传统指数。

Insight: 自动化波段选择结合Segment Anything模型可以显著提升海洋垃圾的视觉解释效果，为领域专家提供了更高效的视觉分析工具。

Abstract: This work proposes SAMSelect, an algorithm to obtain a salient three-channel visualization for multispectral images. We develop SAMSelect and show its use for marine scientists visually interpreting floating marine debris in Sentinel-2 imagery. These debris are notoriously difficult to visualize due to their compositional heterogeneity in medium-resolution imagery. Out of these difficulties, a visual interpretation of imagery showing marine debris remains a common practice by domain experts, who select bands and spectral indices on a case-by-case basis informed by common practices and heuristics. SAMSelect selects the band or index combination that achieves the best classification accuracy on a small annotated dataset through the Segment Anything Model. Its central assumption is that the three-channel visualization achieves the most accurate segmentation results also provide good visual information for photo-interpretation. We evaluate SAMSelect in three Sentinel-2 scenes containing generic marine debris in Accra, Ghana, and Durban, South Africa, and deployed plastic targets from the Plastic Litter Project. This reveals the potential of new previously unused band combinations (e.g., a normalized difference index of B8, B2), which demonstrate improved performance compared to literature-based indices. We describe the algorithm in this paper and provide an open-source code repository that will be helpful for domain scientists doing visual photo interpretation, especially in the marine field.

[39] ECMNet:Lightweight Semantic Segmentation with Efficient CNN-Mamba Network cs.CV | cs.AIPDF

Feixiang Du, Shengkun Wu

TL;DR: 论文提出了一种轻量级的语义分割网络ECMNet，结合CNN和Mamba的优势，通过设计EDAB模块、MSAU单元和Mamba增强的FFM模块，显著提升了分割精度与效率的平衡。

Details

Motivation: 尽管CNN和Transformer在语义分割任务中表现优异，但全局上下文建模仍不足。Mamba在视觉任务中展现出长距离依赖建模的优势，因此将其与CNN结合以弥补各自的不足。

Result: 在Cityscapes和CamVid测试集上分别达到70.6%和73.6%的mIoU，参数量为0.87M，计算量为8.27G FLOPs。

Insight: Mamba与CNN的结合在语义分割任务中展现出高效的长距离依赖建模能力，同时保持了轻量化和计算效率。

Abstract: In the past decade, Convolutional Neural Networks (CNNs) and Transformers have achieved wide applicaiton in semantic segmentation tasks. Although CNNs with Transformer models greatly improve performance, the global context modeling remains inadequate. Recently, Mamba achieved great potential in vision tasks, showing its advantages in modeling long-range dependency. In this paper, we propose a lightweight Efficient CNN-Mamba Network for semantic segmentation, dubbed as ECMNet. ECMNet combines CNN with Mamba skillfully in a capsule-based framework to address their complementary weaknesses. Specifically, We design a Enhanced Dual-Attention Block (EDAB) for lightweight bottleneck. In order to improve the representations ability of feature, We devise a Multi-Scale Attention Unit (MSAU) to integrate multi-scale feature aggregation, spatial aggregation and channel aggregation. Moreover, a Mamba enhanced Feature Fusion Module (FFM) merges diverse level feature, significantly enhancing segmented accuracy. Extensive experiments on two representative datasets demonstrate that the proposed model excels in accuracy and efficiency balance, achieving 70.6% mIoU on Cityscapes and 73.6% mIoU on CamVid test datasets, with 0.87M parameters and 8.27G FLOPs on a single RTX 3090 GPU platform.

[40] RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping cs.CVPDF

Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Dong Chen

TL;DR: RoboSwap提出了一个结合GAN和扩散模型的视频扩散框架，用于无监督的机器人手臂交换，解决了跨平台机器人学习中的数据稀缺问题。

Details

Motivation: 由于高质量、多样化数据集的稀缺，视频条件化机器人学习的跨平台泛化能力受限。RoboSwap旨在通过无监督方式交换机器人手臂，减少数据收集需求。

Result: 在三个基准测试中，RoboSwap在结构连贯性和运动一致性上优于现有视频和图像编辑模型。

Insight: GAN和扩散模型的结合能够互补优势，为机器人学习提供高质量的跨平台数据，减少对成对数据的依赖。

Abstract: Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another: a key step for crossembodiment learning. Unlike previous methods that depend on paired video demonstrations in the same environmental settings, our proposed framework, RoboSwap, operates on unpaired data from diverse environments, alleviating the data collection needs. RoboSwap introduces a novel video editing pipeline integrating both GANs and diffusion models, combining their isolated advantages. Specifically, we segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another. The translated arm is blended with the original video background and refined with a diffusion model to enhance coherence, motion realism and object interaction. The GAN and diffusion stages are trained independently. Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks in terms of both structural coherence and motion consistency, thereby offering a robust solution for generating reliable, cross-embodiment data in robotic learning.

[41] SurfR: Surface Reconstruction with Multi-scale Attention cs.CVPDF

Siddhant Ranade, Gonçalo Dias Pais, Ross Tyler Whitaker, Jacinto C. Nascimento, Pedro Miraldo

TL;DR: 提出了一种快速准确的表面重建算法，通过隐式表示处理无组织点云，解决了现有方法在细节与速度之间的权衡问题。

Details

Motivation: 现有学习方法的局限性体现在要么需要为每个物体单独训练（小模型但高细节），要么采用通用表示（大模型但低细节且推理慢）。需要一种既能保持高细节又能快速推理的新方法。

Result: 算法在速度上优于所有基线方法，且性能接近当前最优方法，实现了最佳的精度-速度权衡。

Insight: 惰性查询和多尺度注意力机制的引入显著提升了隐式表示的效率与鲁棒性，为表面重建提供了新思路。

Abstract: We propose a fast and accurate surface reconstruction algorithm for unorganized point clouds using an implicit representation. Recent learning methods are either single-object representations with small neural models that allow for high surface details but require per-object training or generalized representations that require larger models and generalize to newer shapes but lack details, and inference is slow. We propose a new implicit representation for general 3D shapes that is faster than all the baselines at their optimum resolution, with only a marginal loss in performance compared to the state-of-the-art. We achieve the best accuracy-speed trade-off using three key contributions. Many implicit methods extract features from the point cloud to classify whether a query point is inside or outside the object. First, to speed up the reconstruction, we show that this feature extraction does not need to use the query point at an early stage (lazy query). Second, we use a parallel multi-scale grid representation to develop robust features for different noise levels and input resolutions. Finally, we show that attention across scales can provide improved reconstruction results.

Zhiyi Zhu, Xiaoyu Wu, Youwei Lu

TL;DR: 这篇论文提出了文本-运动跨模态对比损失（TMCCL）以增强视频记忆性预测，并通过新的视频摘要方法（MWCVS）展示了记忆性预测的应用潜力。

Details

Motivation: 现有模型在预测视频记忆性时未能充分利用运动特征，且缺乏标注数据导致运动特征表示不足。论文旨在通过跨模态对比学习提升运动特征表示，并探索记忆性预测的实际应用。

Result: 在视频记忆性预测和视频摘要任务上均取得最优性能，验证了方法的有效性。

Insight: 跨模态对比学习能有效提升运动特征表示，记忆性预测在视频编辑任务中具有实际应用价值。

Abstract: Video memorability refers to the ability of videos to be recalled after viewing, playing a crucial role in creating content that remains memorable. Existing models typically focus on extracting multimodal features to predict video memorability scores but often fail to fully utilize motion cues. The representation of motion features is compromised during the fine-tuning phase of the motion feature extractor due to a lack of labeled data. In this paper, we introduce the Text-Motion Cross-modal Contrastive Loss (TMCCL), a multimodal video memorability prediction model designed to enhance the representation of motion features. We tackle the challenge of improving motion feature representation by leveraging text description similarities across videos to establish positive and negative motion sample sets for a given target. This enhancement allows the model to learn similar feature representations for semantically related motion content, resulting in more accurate memorability predictions. Our model achieves state-of-the-art performance on two video memorability prediction datasets. Moreover, the potential applications of video memorability prediction have been underexplored. To address this gap, we present Memorability Weighted Correction for Video Summarization (MWCVS), using video memorability prediction to reduce subjectivity in video summarization labels. Experimental results on two video summarization datasets demonstrate the effectiveness of MWCVS, showcasing the promising applications of video memorability prediction.

[43] Beyond Calibration: Physically Informed Learning for Raw-to-Raw Mapping cs.CVPDF

Peter Grönquist, Stepan Tulyakov, Dengxin Dai

TL;DR: 该论文提出了一种轻量级的Neural Physical Model (NPM)，用于解决多相机间RAW图像转换的挑战性任务，适应不同光照条件，并在公开数据集上优于现有方法。

Details

Motivation: 多相机系统中的颜色一致性对图像融合和ISP兼容性至关重要，但现有方法受限于光照适应性差或计算成本高。

Result: 在NUS和BeyondRGB数据集上的实验表明，NPM在颜色一致性和适应性上优于现有方法。

Insight: 通过物理信息和数据驱动方法的结合，NPM在轻量化设计中实现了高性能的RAW图像转换，为多相机系统提供了实用解决方案。

Abstract: Achieving consistent color reproduction across multiple cameras is essential for seamless image fusion and Image Processing Pipeline (ISP) compatibility in modern devices, but it is a challenging task due to variations in sensors and optics. Existing raw-to-raw conversion methods face limitations such as poor adaptability to changing illumination, high computational costs, or impractical requirements such as simultaneous camera operation and overlapping fields-of-view. We introduce the Neural Physical Model (NPM), a lightweight, physically-informed approach that simulates raw images under specified illumination to estimate transformations between devices. The NPM effectively adapts to varying illumination conditions, can be initialized with physical measurements, and supports training with or without paired data. Experiments on public datasets like NUS and BeyondRGB demonstrate that NPM outperforms recent state-of-the-art methods, providing robust chromatic consistency across different sensors and optical systems.

[44] LLaVA-c: Continual Improved Visual Instruction Tuning cs.CVPDF

Wenzhuo Liu, Fei Zhu, Haiyang Guo, Longhui Wei, Cheng-Lin Liu

TL;DR: LLaVA-c通过光谱感知巩固和无监督查询正则化改进持续学习，在多任务性能与通用能力之间取得平衡，甚至超越联合学习方法。

Details

Motivation: 多模态模型（如LLaVA-1.5）在多任务学习中存在任务平衡和扩展成本的挑战，传统持续学习方法则忽视基模型退化问题。

Result: LLaVA-c在持续预训练和微调中既提升基准性能，又保留通用能力。

Insight: 持续学习可通过任务优化设计避免基模型退化，多任务联合学习并非唯一高效路径。

Abstract: Multimodal models like LLaVA-1.5 achieve state-of-the-art visual understanding through visual instruction tuning on multitask datasets, enabling strong instruction-following and multimodal performance. However, multitask learning faces challenges such as task balancing, requiring careful adjustment of data proportions, and expansion costs, where new tasks risk catastrophic forgetting and need costly retraining. Continual learning provides a promising alternative to acquiring new knowledge incrementally while preserving existing capabilities. However, current methods prioritize task-specific performance, neglecting base model degradation from overfitting to specific instructions, which undermines general capabilities. In this work, we propose a simple but effective method with two modifications on LLaVA-1.5: spectral-aware consolidation for improved task balance and unsupervised inquiry regularization to prevent base model degradation. We evaluate both general and task-specific performance across continual pretraining and fine-tuning. Experiments demonstrate that LLaVA-c consistently enhances standard benchmark performance and preserves general capabilities. For the first time, we show that task-by-task continual learning can achieve results that match or surpass multitask joint learning. The code will be publicly released.

[45] ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction cs.CVPDF

Juan Yeo, Soonwoo Cha, Jiwoo Song, Hyunbin Jin, Taesup Kim

TL;DR: ATAS提出了一种自蒸馏方法，通过利用模型内部的多层次知识，同时提升语义一致性和细粒度视觉-语言对齐，无需额外模块或有监督微调即可增强CLIP模型在开放词汇密集预测任务中的表现。

Details

Motivation: CLIP模型在开放词汇密集预测任务中表现出色，但在细粒度和区域级理解上仍有不足，且现有方法往往以牺牲语义一致性为代价换取细粒度对齐。

Result: 在开放词汇目标检测和语义分割基准测试中，ATAS显著优于基线CLIP模型。

Insight: 同时维护语义一致性和细粒度对齐是提升开放词汇密集预测任务性能的关键，自蒸馏方法是一种高效的无监督优化途径。

Abstract: Vision-language models such as CLIP have recently propelled open-vocabulary dense prediction tasks by enabling recognition of a broad range of visual concepts. However, CLIP still struggles with fine-grained, region-level understanding, hindering its effectiveness on these dense prediction tasks. We identify two pivotal factors required to address this limitation: semantic coherence and fine-grained vision-language alignment. Current adaptation methods often improve fine-grained alignment at the expense of semantic coherence, and often rely on extra modules or supervised fine-tuning. To overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel approach that simultaneously enhances semantic coherence and fine-grained alignment by leveraging own knowledge of a model across all representation levels. Unlike prior methods, ATAS uses only unlabeled images and an internal self-distillation process to refine representations of CLIP vision encoders, preserving local semantic consistency while sharpening local detail recognition. On open-vocabulary object detection and semantic segmentation benchmarks, ATAS achieves substantial performance gains, outperforming baseline CLIP models. These results validate the effectiveness of our approach and underscore the importance of jointly maintaining semantic coherence and fine-grained alignment for advanced open-vocabulary dense prediction.

[46] CanadaFireSat: Toward high-resolution wildfire forecasting with multiple modalities cs.CVPDF

Hugo Porta, Emanuele Dalsasso, Jessica L. McCarty, Devis Tuia

TL;DR: 该论文提出了一个高分辨率（100米）的野火预测数据集CanadaFireSat和基线方法，利用多模态数据（包括高分辨率卫星影像和环境因素），展示了多模态深度学习模型在大陆尺度野火预测中的潜力。

Details

Motivation: 2023年加拿大经历了严重的野火季节，亟需通过高分辨率预测模型提升野火管理的效率和准确性。

Result: 多模态输入在2023年野火季节的预测中表现最佳，F1分数达到60.3%，显示出高分辨率模型的潜力。

Insight: 多模态数据融合可以显著提升野火预测的精度和分辨率，为大陆尺度的野火管理提供新工具。

Abstract: Canada experienced in 2023 one of the most severe wildfire seasons in recent history, causing damage across ecosystems, destroying communities, and emitting large quantities of CO2. This extreme wildfire season is symptomatic of a climate-change-induced increase in the length and severity of the fire season that affects the boreal ecosystem. Therefore, it is critical to empower wildfire management in boreal communities with better mitigation solutions. Wildfire probability maps represent an important tool for understanding the likelihood of wildfire occurrence and the potential severity of future wildfires. The massive increase in the availability of Earth observation data has enabled the development of deep learning-based wildfire forecasting models, aiming at providing precise wildfire probability maps at different spatial and temporal scales. A main limitation of such methods is their reliance on coarse-resolution environmental drivers and satellite products, leading to wildfire occurrence prediction of reduced resolution, typically around $\sim 0.1${\deg}. This paper presents a benchmark dataset: CanadaFireSat, and baseline methods for high-resolution: 100 m wildfire forecasting across Canada, leveraging multi-modal data from high-resolution multi-spectral satellite images (Sentinel-2 L1C), mid-resolution satellite products (MODIS), and environmental factors (ERA5 reanalysis data). Our experiments consider two major deep learning architectures. We observe that using multi-modal temporal inputs outperforms single-modal temporal inputs across all metrics, achieving a peak performance of 60.3% in F1 score for the 2023 wildfire season, a season never seen during model training. This demonstrates the potential of multi-modal deep learning models for wildfire forecasting at high-resolution and continental scale.

[47] VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism cs.CVPDF

Congzhi Zhang, Jiawei Peng, Zhenglin Wang, Yilong Lai, Haowen Sun

TL;DR: VReST通过蒙特卡洛树搜索和自奖励机制，无需训练即可提升大规模视觉语言模型（LVLMs）在复杂视觉推理任务中的表现，并在多模态数学推理基准上取得了最先进的结果。

Details

Motivation: 现有的LVLMs虽然在多模态任务中表现出色，但在复杂视觉推理任务中的能力仍有限，尤其是在使用链式思维提示技术时。

Result: VReST在多模态数学推理基准上超过了当前最优的提示方法，验证了测试时间扩展定律在多模态任务中的有效性。

Insight: VReST为无需额外模型即可提升LVLMs推理能力提供了新方向，展示了测试时间优化在多模态任务中的潜力。

Abstract: Large Vision-Language Models (LVLMs) have shown exceptional performance in multimodal tasks, but their effectiveness in complex visual reasoning is still constrained, especially when employing Chain-of-Thought prompting techniques. In this paper, we propose VReST, a novel training-free approach that enhances Reasoning in LVLMs through Monte Carlo Tree Search and Self-Reward mechanisms. VReST meticulously traverses the reasoning landscape by establishing a search tree, where each node encapsulates a reasoning step, and each path delineates a comprehensive reasoning sequence. Our innovative multimodal Self-Reward mechanism assesses the quality of reasoning steps by integrating the utility of sub-questions, answer correctness, and the relevance of vision-language clues, all without the need for additional models. VReST surpasses current prompting methods and secures state-of-the-art performance across three multimodal mathematical reasoning benchmarks. Furthermore, it substantiates the efficacy of test-time scaling laws in multimodal tasks, offering a promising direction for future research.

[48] MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning cs.CVPDF

Mohammadreza Salehi, Shashanka Venkataramanan, Ioana Simion, Efstratios Gavves, Cees G. M. Snoek

TL;DR: 提出了一种基于运动轨迹的自监督学习框架MoSiC，通过聚类稠密点轨迹学习时空一致的表示，提升了动态场景中的鲁棒性。

Details

Motivation: 现有自监督学习方法依赖于静态增广，面对物体变形、遮挡和相机运动时表现不佳，导致特征学习不一致。因此，需要一种运动引导的方法来学习更鲁棒的时空表示。

Result: 在六个图像和视频数据集及四个评测基准上超越了现有方法1%-6%的性能。

Insight: 运动轨迹可以作为有效的自监督信号，帮助模型在动态场景和遮挡情况下学习更鲁棒的特征表示。

Abstract: Dense self-supervised learning has shown great promise for learning pixel- and patch-level representations, but extending it to videos remains challenging due to the complexity of motion dynamics. Existing approaches struggle as they rely on static augmentations that fail under object deformations, occlusions, and camera movement, leading to inconsistent feature learning over time. We propose a motion-guided self-supervised learning framework that clusters dense point tracks to learn spatiotemporally consistent representations. By leveraging an off-the-shelf point tracker, we extract long-range motion trajectories and optimize feature clustering through a momentum-encoder-based optimal transport mechanism. To ensure temporal coherence, we propagate cluster assignments along tracked points, enforcing feature consistency across views despite viewpoint changes. Integrating motion as an implicit supervisory signal, our method learns representations that generalize across frames, improving robustness in dynamic scenes and challenging occlusion scenarios. By initializing from strong image-pretrained models and leveraging video data for training, we improve state-of-the-art by 1% to 6% on six image and video datasets and four evaluation benchmarks. The implementation is publicly available at our GitHub repository: https://github.com/SMSD75/MoSiC/tree/main

[49] TraGraph-GS: Trajectory Graph-based Gaussian Splatting for Arbitrary Large-Scale Scene Rendering cs.CVPDF

Xiaohan Zhang, Sitong Wang, Yushen Yan, Yi Yang, Mingda Xu

TL;DR: TraGraph-GS提出了一种基于轨迹图的方法，用于大规模场景的高质量新视角合成，解决了现有方法在相机轨迹适应性和高斯重叠问题上的局限性。

Details

Motivation: 大规模场景的新视角合成因相机轨迹任意性和高斯重叠问题而具有挑战性。现有方法通过分区重建和合并渲染，但效果不佳。

Result: 在4个航空和4个地面数据集上，平均PSNR提升了1.86 dB和1.62 dB，优于当前最佳方法。

Insight: 通过灵活的轨迹图分区和渐进式渲染，可以有效解决大规模场景渲染中的高斯重叠和纹理失真问题。

Abstract: High-quality novel view synthesis for large-scale scenes presents a challenging dilemma in 3D computer vision. Existing methods typically partition large scenes into multiple regions, reconstruct a 3D representation using Gaussian splatting for each region, and eventually merge them for novel view rendering. They can accurately render specific scenes, yet they do not generalize effectively for two reasons: (1) rigid spatial partition techniques struggle with arbitrary camera trajectories, and (2) the merging of regions results in Gaussian overlap to distort texture details. To address these challenges, we propose TraGraph-GS, leveraging a trajectory graph to enable high-precision rendering for arbitrarily large-scale scenes. We present a spatial partitioning method for large-scale scenes based on graphs, which incorporates a regularization constraint to enhance the rendering of textures and distant objects, as well as a progressive rendering strategy to mitigate artifacts caused by Gaussian overlap. Experimental results demonstrate its superior performance both on four aerial and four ground datasets and highlight its remarkable efficiency: our method achieves an average improvement of 1.86 dB in PSNR on aerial datasets and 1.62 dB on ground datasets compared to state-of-the-art approaches.

[50] SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting cs.CVPDF

Mengjiao Ma, Qi Ma, Yue Li, Jiahuan Cheng, Runyi Yang

TL;DR: 论文提出了SceneSplat++，一个用于语言高斯泼溅的大规模数据集和综合基准测试，填补了现有工作在3D场景理解上的局限性，并展示了通用方法的优势。

Details

Motivation: 现有语言高斯泼溅方法主要在少量场景和视角上评估，缺乏对整体3D场景理解的深入洞察，因此需要大规模基准测试和数据集来推动研究。

Result: 基准测试结果显示通用方法在放松场景特定限制、快速推理和分割性能上表现最佳。

Insight: 通用方法通过强数据先验能够显著提升3D场景理解的性能，大规模数据集和基准测试是推动该领域发展的关键。

Abstract: 3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics. Moreover, grounding language in 3D scenes has proven to be an effective strategy for 3D scene understanding. Current Language Gaussian Splatting line of work fall into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) generalizable approach. However, most of them are evaluated only on rendered 2D views of a handful of scenes and viewpoints close to the training views, limiting ability and insight into holistic 3D understanding. To address this gap, we propose the first large-scale benchmark that systematically assesses these three groups of methods directly in 3D space, evaluating on 1060 scenes across three indoor datasets and one outdoor dataset. Benchmark results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation, enabling fast feed-forward inference on novel scenes, and achieving superior segmentation performance. We further introduce GaussianWorld-49K a carefully curated 3DGS dataset comprising around 49K diverse indoor and outdoor scenes obtained from multiple sources, with which we demonstrate the generalizable approach could harness strong data priors. Our codes, benchmark, and datasets will be made public to accelerate research in generalizable 3DGS scene understanding.

[51] Geometric deep learning for local growth prediction on abdominal aortic aneurysm surfaces cs.CV | cs.AIPDF

Dieuwertje Alblas, Patryk Rygiel, Julian Suk, Kaj O. Kappe, Marieke Hofman

TL;DR: 该论文提出了一种基于SE(3)-对称Transformer的几何深度学习模型，用于预测腹主动脉瘤（AAA）的局部生长，从而改进个性化监测策略。

Details

Motivation: 当前临床指南仅基于AAA的最大直径决定监测间隔，忽略了3D形状与生长之间的关系，可能导致监测效果不佳。因此，需要一种能够预测局部生长的方法。

Result: 模型预测AAA生长的直径误差中位数为1.18 mm，并能以93%的准确率预测患者是否在两年内需要手术修复。外部验证集结果也显示了模型的泛化能力。

Insight: 局部生长预测结合3D形状信息可以提供更个性化的监测策略，SE(3)-对称性设计有助于保持几何一致性，提升预测精度。

Abstract: Abdominal aortic aneurysms (AAAs) are progressive focal dilatations of the abdominal aorta. AAAs may rupture, with a survival rate of only 20%. Current clinical guidelines recommend elective surgical repair when the maximum AAA diameter exceeds 55 mm in men or 50 mm in women. Patients that do not meet these criteria are periodically monitored, with surveillance intervals based on the maximum AAA diameter. However, this diameter does not take into account the complex relation between the 3D AAA shape and its growth, making standardized intervals potentially unfit. Personalized AAA growth predictions could improve monitoring strategies. We propose to use an SE(3)-symmetric transformer model to predict AAA growth directly on the vascular model surface enriched with local, multi-physical features. In contrast to other works which have parameterized the AAA shape, this representation preserves the vascular surface’s anatomical structure and geometric fidelity. We train our model using a longitudinal dataset of 113 computed tomography angiography (CTA) scans of 24 AAA patients at irregularly sampled intervals. After training, our model predicts AAA growth to the next scan moment with a median diameter error of 1.18 mm. We further demonstrate our model’s utility to identify whether a patient will become eligible for elective repair within two years (acc = 0.93). Finally, we evaluate our model’s generalization on an external validation set consisting of 25 CTAs from 7 AAA patients from a different hospital. Our results show that local directional AAA growth prediction from the vascular surface is feasible and may contribute to personalized surveillance strategies.

[52] InceptionMamba: An Efficient Hybrid Network with Large Band Convolution and Bottleneck Mamba cs.CVPDF

Yuhang Wang, Jun Li, Zhijian Wu, Jianhua Xu

TL;DR: 该论文提出了InceptionMamba，一种高效的混合网络架构，通过大波段卷积和瓶颈Mamba模块改进InceptionNeXt的局部和全局建模能力，在图像分类和下游任务中表现优异。

Details

Motivation: InceptionNeXt虽然在一维带状卷积的基础上表现优异，但其在空间依赖性建模和局部邻域探索方面存在局限性，且卷积操作的局部性约束不利于全局上下文建模。因此，论文旨在提出一种更高效的架构来解决这些问题。

Result: 在图像分类和多个下游任务中，InceptionMamba表现优异，且参数和计算效率优于现有方法。

Insight: 结合波段卷积和Mamba模块的混合设计可以显著提升模型的局部和全局建模能力，同时保持计算效率。源代码已开源。

Abstract: Within the family of convolutional neural networks, InceptionNeXt has shown excellent competitiveness in image classification and a number of downstream tasks. Built on parallel one-dimensional strip convolutions, however, it suffers from limited ability of capturing spatial dependencies along different dimensions and fails to fully explore spatial modeling in local neighborhood. Besides, inherent locality constraints of convolution operations are detrimental to effective global context modeling. To overcome these limitations, we propose a novel backbone architecture termed InceptionMamba in this study. More specifically, the traditional one-dimensional strip convolutions are replaced by orthogonal band convolutions in our InceptionMamba to achieve cohesive spatial modeling. Furthermore, global contextual modeling can be achieved via a bottleneck Mamba module, facilitating enhanced cross-channel information fusion and enlarged receptive field. Extensive evaluations on classification and various downstream tasks demonstrate that the proposed InceptionMamba achieves state-of-the-art performance with superior parameter and computational efficiency. The source code will be available at https://github.com/Wake1021/InceptionMamba.

[53] RS-MTDF: Multi-Teacher Distillation and Fusion for Remote Sensing Semi-Supervised Semantic Segmentation cs.CVPDF

Jiayi Song, Kaiyu Li, Xiangyong Cao, Deyu Meng

TL;DR: RS-MTDF提出了一种基于多教师蒸馏与融合的半监督遥感语义分割框架，利用预训练的视觉基础模型（VFMs）作为教师，通过特征级蒸馏和知识融合提升学生模型的性能，在多个数据集上实现了SOTA。

Details

Motivation: 遥感语义分割依赖大量高质量标注数据，但标注成本高昂。半监督学习可缓解这一问题，但现有方法在标注与未标注数据间的分布不匹配问题上表现不佳。视觉基础模型（VFMs）具有强大的泛化能力，可为半监督学习提供语义先验。

Result: 在ISPRS Potsdam、LoveDA和DeepGlobe数据集上实现了SOTA性能，尤其在LoveDA上不同标签比例下均优于现有方法，并在多数语义类别中取得最高IoU。

Insight: 视觉基础模型的泛化能力可显著提升半监督学习性能，多教师蒸馏与知识融合是一种有效的半监督学习策略。

Abstract: Semantic segmentation in remote sensing images is crucial for various applications, yet its performance is heavily reliant on large-scale, high-quality pixel-wise annotations, which are notoriously expensive and time-consuming to acquire. Semi-supervised semantic segmentation (SSS) offers a promising alternative to mitigate this data dependency. However, existing SSS methods often struggle with the inherent distribution mismatch between limited labeled data and abundant unlabeled data, leading to suboptimal generalization. We propose that Vision Foundation Models (VFMs), pre-trained on vast and diverse datasets, possess robust generalization capabilities that can effectively bridge this distribution gap and provide strong semantic priors for SSS. Inspired by this, we introduce RS-MTDF (Multi-Teacher Distillation and Fusion), a novel framework that leverages the powerful semantic knowledge embedded in VFMs to guide semi-supervised learning in remote sensing. Specifically, RS-MTDF employs multiple frozen VFMs (\textit{e.g.}, DINOv2 and CLIP) as expert teachers, utilizing feature-level distillation to align student features with their robust representations. To further enhance discriminative power, the distilled knowledge is seamlessly fused into the student decoder. Extensive experiments on three challenging remote sensing datasets (ISPRS Potsdam, LoveDA, and DeepGlobe) demonstrate that RS-MTDF consistently achieves state-of-the-art performance. Notably, our method outperforms existing approaches across various label ratios on LoveDA and secures the highest IoU in the majority of semantic categories. These results underscore the efficacy of multi-teacher VFM guidance in significantly enhancing both generalization and semantic understanding for remote sensing segmentation. Ablation studies further validate the contribution of each proposed module.

[54] Gaussian2Scene: 3D Scene Representation Learning via Self-supervised Learning with 3D Gaussian Splatting cs.CVPDF

Keyi Liu, Weidong Yang, Ben Fei, Ying He

TL;DR: 提出Gaussian2Scene，一种基于3D高斯抛雪球的场景级自监督学习框架，通过两阶段训练策略提升3D几何理解和跨模态对齐能力，优于现有方法。

Details

Motivation: 解决现有自监督学习方法在场景级任务中对隐式表示和高内存需求的依赖，以及难以捕获底层3D几何结构的问题。

Result: 在多个3D物体检测任务中表现优于现有预训练方法。

Insight: 3D高斯抛雪球的显式表示能为场景级预训练提供高效的几何先验，两阶段策略可有效结合几何与视觉信息。

Abstract: Self-supervised learning (SSL) for point cloud pre-training has become a cornerstone for many 3D vision tasks, enabling effective learning from large-scale unannotated data. At the scene level, existing SSL methods often incorporate volume rendering into the pre-training framework, using RGB-D images as reconstruction signals to facilitate cross-modal learning. This strategy promotes alignment between 2D and 3D modalities and enables the model to benefit from rich visual cues in the RGB-D inputs. However, these approaches are limited by their reliance on implicit scene representations and high memory demands. Furthermore, since their reconstruction objectives are applied only in 2D space, they often fail to capture underlying 3D geometric structures. To address these challenges, we propose Gaussian2Scene, a novel scene-level SSL framework that leverages the efficiency and explicit nature of 3D Gaussian Splatting (3DGS) for pre-training. The use of 3DGS not only alleviates the computational burden associated with volume rendering but also supports direct 3D scene reconstruction, thereby enhancing the geometric understanding of the backbone network. Our approach follows a progressive two-stage training strategy. In the first stage, a dual-branch masked autoencoder learns both 2D and 3D scene representations. In the second stage, we initialize training with reconstructed point clouds and further supervise learning using the geometric locations of Gaussian primitives and rendered RGB images. This process reinforces both geometric and cross-modal learning. We demonstrate the effectiveness of Gaussian2Scene across several downstream 3D object detection tasks, showing consistent improvements over existing pre-training methods.

[55] HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation cs.CVPDF

Ziyao Huang, Zixiang Zhou, Juan Cao, Yifeng Ma, Yi Chen

TL;DR: HunyuanVideo-HOMA 是一个基于多模态驱动的弱条件框架，用于生成通用的人-物交互（HOI）视频，通过稀疏解耦的运动指导和双输入空间编码，提升了生成视频的时序一致性和物理合理性。

Details

Motivation: 现有的人-物交互视频生成方法依赖于精心设计的运动数据，对新颖物体或场景的泛化能力有限，且可访问性较差。HunyuanVideo-HOMA 旨在通过弱监督的多模态驱动方法解决这些问题。

Result: 实验表明，HunyuanVideo-HOMA 在交互自然性和弱监督下的泛化能力上达到了SOTA性能，同时支持文本条件生成和交互式物体操作。

Insight: 通过解耦和稀疏的运动指导，结合多模态信号编码，可以有效提升人-物交互视频生成的灵活性和泛化能力，同时保留物理合理性。

Abstract: To address key limitations in human-object interaction (HOI) video generation – specifically the reliance on curated motion data, limited generalization to novel objects/scenarios, and restricted accessibility – we introduce HunyuanVideo-HOMA, a weakly conditioned multimodal-driven framework. HunyuanVideo-HOMA enhances controllability and reduces dependency on precise inputs through sparse, decoupled motion guidance. It encodes appearance and motion signals into the dual input space of a multimodal diffusion transformer (MMDiT), fusing them within a shared context space to synthesize temporally consistent and physically plausible interactions. To optimize training, we integrate a parameter-space HOI adapter initialized from pretrained MMDiT weights, preserving prior knowledge while enabling efficient adaptation, and a facial cross-attention adapter for anatomically accurate audio-driven lip synchronization. Extensive experiments confirm state-of-the-art performance in interaction naturalness and generalization under weak supervision. Finally, HunyuanVideo-HOMA demonstrates versatility in text-conditioned generation and interactive object manipulation, supported by a user-friendly demo interface. The project page is at https://anonymous.4open.science/w/homa-page-0FBE/.

[56] Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought cs.CVPDF

Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang

TL;DR: Video-CoT是一个针对视频时空理解的新数据集，通过链式思维（CoT）方法提供细粒度的问答对和标注样本，旨在提升大尺度视觉语言模型（VLMs）在视频分析中的表现。

Details

Motivation: 现有的大尺度视觉语言模型在视频分析中难以捕捉复杂的时空细节，因此需要一种更全面的数据集和评估标准来改善这一领域的研究。

Result: 实验表明，现有VLMs在处理视频时空理解任务时表现不佳，突显了该任务的挑战性。

Insight: Video-CoT为视频多媒体理解研究提供了新的基础，并为高精度视频分析的智能系统开发铺平了道路。

Abstract: Video content comprehension is essential for various applications, ranging from video analysis to interactive systems. Despite advancements in large-scale vision-language models (VLMs), these models often struggle to capture the nuanced, spatiotemporal details essential for thorough video analysis. To address this gap, we introduce Video-CoT, a groundbreaking dataset designed to enhance spatiotemporal understanding using Chain-of-Thought (CoT) methodologies. Video-CoT contains 192,000 fine-grained spa-tiotemporal question-answer pairs and 23,000 high-quality CoT-annotated samples, providing a solid foundation for evaluating spatiotemporal understanding in video comprehension. Additionally, we provide a comprehensive benchmark for assessing these tasks, with each task featuring 750 images and tailored evaluation metrics. Our extensive experiments reveal that current VLMs face significant challenges in achieving satisfactory performance, high-lighting the difficulties of effective spatiotemporal understanding. Overall, the Video-CoT dataset and benchmark open new avenues for research in multimedia understanding and support future innovations in intelligent systems requiring advanced video analysis capabilities. By making these resources publicly available, we aim to encourage further exploration in this critical area. Project website:https://video-cot.github.io/ .

[57] CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics cs.CV | cs.AI | cs.CLPDF

Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks

TL;DR: 该论文首次系统量化了文本到图像（T2I）模型和评估指标在显性与隐性文化期望上的对齐问题，并提出了CulturalFrames基准用于评估。研究发现T2I模型在文化表现上存在显著不足，且现有评估指标与人类判断相关性低。

Details

Motivation: 随着T2I模型在视觉内容生成中的普及，其在多元文化背景下的准确性引发担忧。本文旨在填补T2I模型在文化期望对齐方面的研究空白。

Result: T2I模型平均44%的情况下未能满足文化期望（显性68%，隐性49%），且现有指标与人类判断相关性差。

Insight: 研究暴露了T2I模型和评估方法在文化敏感性上的不足，为未来开发更具文化意识的模型和评估方法提供了方向。

Abstract: The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurately represent diverse cultural contexts. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit as well as implicit cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that T2I models not only fail to meet the more challenging implicit expectations but also the less challenging explicit expectations. Across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we demonstrate that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, providing actionable directions for developing more culturally informed T2I models and evaluation methodologies.

[58] Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis cs.CVPDF

Jingguo Qu, Xinyang Han, Tonghuan Xiao, Jia Ai, Juan Wu

TL;DR: 论文提出了一种针对医学超声图像分析的领域自适应方法，通过微调视觉-语言基础模型，结合大语言模型作为文本优化器和专门设计的任务驱动头，显著提升了模型性能。

Details

Motivation: 医学超声图像分析的标注任务耗时且需要专业知识，而现成的视觉-语言基础模型在自然图像和医学图像间存在性能差距。因此，需要开发领域自适应方法以提升其医学图像分析能力。

Result: 实验表明，该方法显著提升了视觉-语言基础模型在超声图像分割和分类任务中的性能，并优于现有模型。

Insight: 视觉-语言基础模型通过领域自适应技术可以有效迁移到医学图像分析任务，文本优化和任务驱动设计是关键因素。

Abstract: Medical ultrasonography is an essential imaging technique for examining superficial organs and tissues, including lymph nodes, breast, and thyroid. It employs high-frequency ultrasound waves to generate detailed images of the internal structures of the human body. However, manually contouring regions of interest in these images is a labor-intensive task that demands expertise and often results in inconsistent interpretations among individuals. Vision-language foundation models, which have excelled in various computer vision applications, present new opportunities for enhancing ultrasound image analysis. Yet, their performance is hindered by the significant differences between natural and medical imaging domains. This research seeks to overcome these challenges by developing domain adaptation methods for vision-language foundation models. In this study, we explore the fine-tuning pipeline for vision-language foundation models by utilizing large language model as text refiner with special-designed adaptation strategies and task-driven heads. Our approach has been extensively evaluated on six ultrasound datasets and two tasks: segmentation and classification. The experimental results show that our method can effectively improve the performance of vision-language foundation models for ultrasound image analysis, and outperform the existing state-of-the-art vision-language and pure foundation models. The source code of this study is available at \href{https://github.com/jinggqu/NextGen-UIA}{GitHub}.

Junzhuo Liu, Markus Eckstein, Zhixiang Wang, Friedrich Feuerhake, Dorit Merhof

TL;DR: 本文提出了一种基于对比学习的深度学习方法，从全切片图像预测空间转录组表达，显著提升了基因表达的预测精度，适用于样本有限的数据集，并展示了在癌症组织定位中的潜力。

Details

Motivation: 空间转录组数据获取成本高，大规模数据难以获得，因此需要从病理图像预测基因表达水平，以降低成本并扩展研究潜力。

Result: 预测的相关系数（PCC）在高表达基因、高变异基因和标记基因上分别提升了6.27%、6.11%和11.26%。

Insight: 方法保留了基因间相关性，适用于小样本数据集，并展示了在癌症组织定位中的应用潜力。

Abstract: Spatial transcriptomics is a technology that captures gene expression levels at different spatial locations, widely used in tumor microenvironment analysis and molecular profiling of histopathology, providing valuable insights into resolving gene expression and clinical diagnosis of cancer. Due to the high cost of data acquisition, large-scale spatial transcriptomics data remain challenging to obtain. In this study, we develop a contrastive learning-based deep learning method to predict spatially resolved gene expression from whole-slide images. Evaluation across six different disease datasets demonstrates that, compared to existing studies, our method improves Pearson Correlation Coefficient (PCC) in the prediction of highly expressed genes, highly variable genes, and marker genes by 6.27%, 6.11%, and 11.26% respectively. Further analysis indicates that our method preserves gene-gene correlations and applies to datasets with limited samples. Additionally, our method exhibits potential in cancer tissue localization based on biomarker expression.

[60] StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams cs.CV | cs.LGPDF

Zike Wu, Qi Yan, Xuanyu Yi, Lele Wang, Renjie Liao

TL;DR: StreamSplat是首个前馈框架，能够将未标定的视频流实时转换为动态3D高斯泼溅表示，解决了在线重建动态3D场景的三大挑战。

Details

Motivation: 现有方法难以同时处理未标定输入、动态场景建模和长期稳定性，制约了实时3D重建的应用。

Result: 在静态和动态基准测试中表现优异，支持任意长度视频流的在线重建。

Insight: 通过前馈架构和局部动态建模，解决了动态3D重建的实时性和稳定性问题。

Abstract: Real-time reconstruction of dynamic 3D scenes from uncalibrated video streams is crucial for numerous real-world applications. However, existing methods struggle to jointly address three key challenges: 1) processing uncalibrated inputs in real time, 2) accurately modeling dynamic scene evolution, and 3) maintaining long-term stability and computational efficiency. To this end, we introduce StreamSplat, the first fully feed-forward framework that transforms uncalibrated video streams of arbitrary length into dynamic 3D Gaussian Splatting (3DGS) representations in an online manner, capable of recovering scene dynamics from temporally local observations. We propose two key technical innovations: a probabilistic sampling mechanism in the static encoder for 3DGS position prediction, and a bidirectional deformation field in the dynamic decoder that enables robust and efficient dynamic modeling. Extensive experiments on static and dynamic benchmarks demonstrate that StreamSplat consistently outperforms prior works in both reconstruction quality and dynamic scene modeling, while uniquely supporting online reconstruction of arbitrarily long video streams. Code and models are available at https://github.com/nickwzk/StreamSplat.

[61] DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval cs.CVPDF

Leqi Shen, Guoqiang Gong, Tianxiang Hao, Tao He, Yifeng Zhang

TL;DR: DiscoVLA提出了一种参数高效的视频-文本检索方法，通过同时解决视觉、语言和对齐三方面的差异，显著提升了性能。

Details

Motivation: 现有方法主要关注视觉差异，而忽视了语言和对齐差异，导致从图像级到视频级的迁移效果不佳。

Result: 在MSRVTT数据集上，DiscoVLA比现有方法提升了1.5%的R@1，达到50.5%。

Insight: 全面的差异减少（视觉、语言和对齐）对视频-文本检索任务至关重要。

Abstract: The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is available at https://github.com/LunarShen/DsicoVLA.

[62] Product of Experts for Visual Generation cs.CV | cs.AIPDF

Yunzhi Zhang, Carson Murtuza-Lanier, Zizhang Li, Yilun Du, Jiajun Wu

TL;DR: 论文提出了一种基于Product of Experts (PoE)框架的训练免费方法，通过Annealed Importance Sampling (AIS)从异构模型中组合知识，用于视觉生成任务，比单一模型更具可控性。

Details

Motivation: 当前神经模型在共享数据领域（如图像和视频）中具有丰富的先验知识和互补性。如何整合包括视觉生成模型、视觉语言模型以及人类知识来源（如图形引擎和物理模拟器）在内的多样化知识仍待探索。

Result: 在图像和视频合成任务中表现出优于单一模型的可控性，并提供灵活的生成目标用户界面。

Insight: 通过异构模型的知识组合，能够在无需额外训练的情况下提升视觉生成任务的可控性和灵活性。

Abstract: Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources – including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators – remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.

[63] WetCat: Automating Skill Assessment in Wetlab Cataract Surgery Videos cs.CVPDF

Negin Ghamsarian, Raphael Sznitman, Klaus Schoeffmann, Jens Kowal

TL;DR: 论文提出了WetCat数据集，专门用于湿实验室白内障手术视频的自动化技能评估，填补了现有数据集的空白。

Details

Motivation: 传统湿实验室训练依赖人工评估，效率低且主观性强。计算机视觉技术为自动化技能评估提供了可能，但现有数据集多为真实手术或孤立任务。

Result: WetCat为开发可解释的AI驱动评估工具奠定了基础，提升了眼科手术培训的客观性和可扩展性。

Insight: 专注于关键手术阶段的标注能更精准地评估技能，标准化框架有助于推动自动化手术教育的发展。

Abstract: To meet the growing demand for systematic surgical training, wetlab environments have become indispensable platforms for hands-on practice in ophthalmology. Yet, traditional wetlab training depends heavily on manual performance evaluations, which are labor-intensive, time-consuming, and often subject to variability. Recent advances in computer vision offer promising avenues for automated skill assessment, enhancing both the efficiency and objectivity of surgical education. Despite notable progress in ophthalmic surgical datasets, existing resources predominantly focus on real surgeries or isolated tasks, falling short of supporting comprehensive skill evaluation in controlled wetlab settings. To address these limitations, we introduce WetCat, the first dataset of wetlab cataract surgery videos specifically curated for automated skill assessment. WetCat comprises high-resolution recordings of surgeries performed by trainees on artificial eyes, featuring comprehensive phase annotations and semantic segmentations of key anatomical structures. These annotations are meticulously designed to facilitate skill assessment during the critical capsulorhexis and phacoemulsification phases, adhering to standardized surgical skill assessment frameworks. By focusing on these essential phases, WetCat enables the development of interpretable, AI-driven evaluation tools aligned with established clinical metrics. This dataset lays a strong foundation for advancing objective, scalable surgical education and sets a new benchmark for automated workflow analysis and skill assessment in ophthalmology training. The dataset and annotations are publicly available in Synapse https://www.synapse.org/Synapse:syn66401174/files.

[64] MIRAGE: Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis cs.CVPDF

José Morano, Botond Fazekas, Emese Sükei, Ronald Fecso, Taha Emre

TL;DR: MIRAGE 是一个多模态基础模型，用于视网膜 OCT 和 SLO 图像的综合分析，并通过新的评估基准验证其优越性。

Details

Motivation: 现有的眼科基础模型缺乏多模态支持，且验证不足。开发 MIRAGE 旨在解决这些局限。

Result: 在分类和分割任务中，MIRAGE 表现优于通用和专用基础模型及分割方法。

Insight: 多模态基础模型在医学图像分析中具有潜力，尤其在数据多样性不足时表现更优。

Abstract: Artificial intelligence (AI) has become a fundamental tool for assisting clinicians in analyzing ophthalmic images, such as optical coherence tomography (OCT). However, developing AI models often requires extensive annotation, and existing models tend to underperform on independent, unseen data. Foundation models (FMs), large AI models trained on vast unlabeled datasets, have shown promise in overcoming these challenges. Nonetheless, available FMs for ophthalmology lack extensive validation, especially for segmentation tasks, and focus on a single imaging modality. In this context, we propose MIRAGE, a novel multimodal FM for the analysis of OCT and scanning laser ophthalmoscopy (SLO) images. Additionally, we propose a new evaluation benchmark with OCT/SLO classification and segmentation tasks. The comparison with general and specialized FMs and segmentation methods shows the superiority of MIRAGE in both types of tasks, highlighting its suitability as a basis for the development of robust AI systems for retinal OCT image analysis. Both MIRAGE and the evaluation benchmark are publicly available: https://github.com/j-morano/MIRAGE.

[65] Inherently Faithful Attention Maps for Vision Transformers cs.CV | cs.AIPDF

Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Diego Marcos

TL;DR: 本文提出了一种基于注意力机制的方法，通过学习的二值注意力掩码确保只有关注的图像区域影响预测，提高了模型对虚假关联和分布外背景的鲁棒性。

Details

Motivation: 上下文对物体感知有强烈影响，可能导致偏差表示，尤其在物体出现在分布外背景时。同时，许多图像级任务需要识别相关区域，通常离不开上下文。需解决这一矛盾。

Result: 在多样基准测试中显著提高了模型对虚假关联和分布外背景的鲁棒性。

Insight: 通过注意力掩码限制感受野，可以有效过滤虚假信息，同时保留任务所需的上下文信息。

Abstract: We introduce an attention-based method that uses learned binary attention masks to ensure that only attended image regions influence the prediction. Context can strongly affect object perception, sometimes leading to biased representations, particularly when objects appear in out-of-distribution backgrounds. At the same time, many image-level object-centric tasks require identifying relevant regions, often requiring context. To address this conundrum, we propose a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. Extensive experiments across diverse benchmarks demonstrate that our approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds.

[66] Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions cs.CV | cs.AI | cs.CLPDF

David Acuna, Ximing Lu, Jaehun Jung, Hyunwoo Kim, Amlan Kar

TL;DR: 论文提出了一种名为Socratic-MCTS的方法，通过在非推理模型中注入子问题-子答案对，利用蒙特卡洛树搜索（MCTS）激发隐藏知识并引导长推理链，无需额外训练。

Details

Motivation: 现有视觉语言模型（VLMs）多为非推理模型且已广泛部署，直接废弃它们并不现实。本文探索如何在不额外训练的情况下，通过搜索机制激发这些模型的潜在推理能力。

Result: 在三个基准测试中均表现优异，其中在MMMU-PRO上实现整体2%的提升，人文学科部分提升9%。

Insight: 将推理任务建模为搜索问题，能够有效利用现有非推理模型的零散知识，通过子问题引导实现长推理链，为模型优化提供新思路。

Abstract: Recent research in vision-language models (VLMs) has centered around the possibility of equipping them with implicit long-form chain-of-thought reasoning – akin to the success observed in language models – via distillation and reinforcement learning. But what about the non-reasoning models already trained and deployed across the internet? Should we simply abandon them, or is there hope for a search mechanism that can elicit hidden knowledge and induce long reasoning traces – without any additional training or supervision? In this paper, we explore this possibility using a Monte Carlo Tree Search (MCTS)-inspired algorithm, which injects subquestion-subanswer pairs into the model’s output stream. We show that framing reasoning as a search process – where subquestions act as latent decisions within a broader inference trajectory – helps the model “connect the dots” between fragmented knowledge and produce extended reasoning traces in non-reasoning models. We evaluate our method across three benchmarks and observe consistent improvements. Notably, our approach yields a 2% overall improvement on MMMU-PRO, including a significant 9% gain in Liberal Arts.

[67] What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities cs.CVPDF

Wendong Bu, Yang Wu, Qifan Yu, Minghe Gao, Bingchen Miao

TL;DR: 论文提出了OmniBench基准测试和OmniEval评估框架，用于多维度评估基于多模态大语言模型（MLLM）的虚拟代理能力，解决了现有基准测试的局限性。

Details

Motivation: 现有基准测试在任务复杂度控制、人工标注成本和多维度评估等方面存在不足，限制了虚拟代理的应用和发展。

Result: 实验显示，图结构数据比人工标注数据更高效；开源和闭源模型的性能在多维度评估中表现出显著差异。

Insight: 图结构任务和多维度评估有助于更全面地理解虚拟代理的能力，为未来研究提供了新的方向。

Abstract: As multimodal large language models (MLLMs) advance, MLLM-based virtual agents have demonstrated remarkable performance. However, existing benchmarks face significant limitations, including uncontrollable task complexity, extensive manual annotation with limited scenarios, and a lack of multidimensional evaluation. In response to these challenges, we introduce OmniBench, a self-generating, cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity through subtask composition. To evaluate the diverse capabilities of virtual agents on the graph, we further present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities. Our synthesized dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate. Training on our graph-structured data shows that it can more efficiently guide agents compared to manually annotated data. We conduct multidimensional evaluations for various open-source and closed-source models, revealing their performance across various capabilities and paving the way for future advancements. Our project is available at https://omni-bench.github.io/.

[68] SSS: Semi-Supervised SAM-2 with Efficient Prompting for Medical Imaging Segmentation cs.CVPDF

Hongjie Zhu, Xiwei Liu, Rundong Xue, Zeyu Zhang, Yong Xu

TL;DR: 提出了一种名为SSS的半监督学习方法，结合SAM-2的强特征提取能力，通过一致性正则化和特征增强机制提升医学图像分割性能，实验效果显著。

Details

Motivation: 医学图像标注成本高昂，如何利用大量未标注数据提升模型性能是关键挑战，半监督学习是一个有前景的方向。

Result: 在ACDC和BHSD数据集上表现优异，BHSD上的Dice分数达53.15，超越之前方法+3.65。

Insight: 结合视觉基础模型（如SAM-2）和半监督学习，能有效挖掘未标注数据的知识，提升医学图像分割性能。

Abstract: In the era of information explosion, efficiently leveraging large-scale unlabeled data while minimizing the reliance on high-quality pixel-level annotations remains a critical challenge in the field of medical imaging. Semi-supervised learning (SSL) enhances the utilization of unlabeled data by facilitating knowledge transfer, significantly improving the performance of fully supervised models and emerging as a highly promising research direction in medical image analysis. Inspired by the ability of Vision Foundation Models (e.g., SAM-2) to provide rich prior knowledge, we propose SSS (Semi-Supervised SAM-2), a novel approach that leverages SAM-2’s robust feature extraction capabilities to uncover latent knowledge in unlabeled medical images, thus effectively enhancing feature support for fully supervised medical image segmentation. Specifically, building upon the single-stream “weak-to-strong” consistency regularization framework, this paper introduces a Discriminative Feature Enhancement (DFE) mechanism to further explore the feature discrepancies introduced by various data augmentation strategies across multiple views. By leveraging feature similarity and dissimilarity across multi-scale augmentation techniques, the method reconstructs and models the features, thereby effectively optimizing the salient regions. Furthermore, a prompt generator is developed that integrates Physical Constraints with a Sliding Window (PCSW) mechanism to generate input prompts for unlabeled data, fulfilling SAM-2’s requirement for additional prompts. Extensive experiments demonstrate the superiority of the proposed method for semi-supervised medical image segmentation on two multi-label datasets, i.e., ACDC and BHSD. Notably, SSS achieves an average Dice score of 53.15 on BHSD, surpassing the previous state-of-the-art method by +3.65 Dice. Code will be available at https://github.com/AIGeeksGroup/SSS.

[69] Cross-Spectral Body Recognition with Side Information Embedding: Benchmarks on LLCM and Analyzing Range-Induced Occlusions on IJB-MDF cs.CVPDF

Anirudh Nanduri, Siyuan Huang, Rama Chellappa

TL;DR: 该论文研究了跨光谱人体识别问题，提出了一种基于Vision Transformer（ViT）的方法，并结合Side Information Embedding（SIE）技术。实验表明仅编码相机信息即可在LLCM数据集上实现最优性能。同时，论文还探讨了范围诱导遮挡对可见光-红外（VI）人体再识别的影响。

Details

Motivation: 跨光谱人体识别（尤其是可见光与红外图像的匹配）是一个具有挑战性的问题。传统的ViT模型在跨光谱任务中表现有限，且现有数据集缺乏对遮挡场景的研究，因此需要改进模型并填补这一研究空白。

Result: 在LLCM数据集上，仅编码相机信息的SIE-ViT取得了最优性能。此外，IJB-MDF数据集的分析揭示了遮挡问题在VI-ReID中的重要性。

Insight: 1. 跨光谱任务中，相机信息可能比域信息（如可见光/红外）更具区分度。2. 现有VI-ReID数据集缺乏遮挡多样性，限制了模型的泛化能力。

Abstract: Vision Transformers (ViTs) have demonstrated impressive performance across a wide range of biometric tasks, including face and body recognition. In this work, we adapt a ViT model pretrained on visible (VIS) imagery to the challenging problem of cross-spectral body recognition, which involves matching images captured in the visible and infrared (IR) domains. Recent ViT architectures have explored incorporating additional embeddings beyond traditional positional embeddings. Building on this idea, we integrate Side Information Embedding (SIE) and examine the impact of encoding domain and camera information to enhance cross-spectral matching. Surprisingly, our results show that encoding only camera information - without explicitly incorporating domain information - achieves state-of-the-art performance on the LLCM dataset. While occlusion handling has been extensively studied in visible-spectrum person re-identification (Re-ID), occlusions in visible-infrared (VI) Re-ID remain largely underexplored - primarily because existing VI-ReID datasets, such as LLCM, SYSU-MM01, and RegDB, predominantly feature full-body, unoccluded images. To address this gap, we analyze the impact of range-induced occlusions using the IARPA Janus Benchmark Multi-Domain Face (IJB-MDF) dataset, which provides a diverse set of visible and infrared images captured at various distances, enabling cross-range, cross-spectral evaluations.

[70] Segment Concealed Objects with Incomplete Supervision cs.CV | cs.AI | cs.LGPDF

Chunming He, Kai Li, Yachao Zhang, Ziyun Yang, Youwei Pang

TL;DR: 论文提出了一种统一的方法SEE，用于不完全监督的隐蔽物体分割任务，通过结合Mean-Teacher框架和SAM模型生成高质量的伪标签，以及设计混合粒度的特征分组模块来解决隐蔽物体与背景难以区分的问题。

Details

Motivation: 隐蔽物体分割任务因标注数据的不完整性和物体与背景的高度相似性而极具挑战性，需要一种能够有效利用弱标注和半标注数据的方法。

Result: 实验表明SEE在多种不完全监督的隐蔽物体分割任务中达到SOTA性能，并可作为即插即用的解决方案提升现有模型。

Insight: 结合预训练视觉基础模型（如SAM）和Mean-Teacher框架可以有效利用弱标注数据，同时特征分组模块有助于解决隐蔽物体分割中的相似性问题。

Abstract: Incompletely-Supervised Concealed Object Segmentation (ISCOS) involves segmenting objects that seamlessly blend into their surrounding environments, utilizing incompletely annotated data, such as weak and semi-annotations, for model training. This task remains highly challenging due to (1) the limited supervision provided by the incompletely annotated training data, and (2) the difficulty of distinguishing concealed objects from the background, which arises from the intrinsic similarities in concealed scenarios. In this paper, we introduce the first unified method for ISCOS to address these challenges. To tackle the issue of incomplete supervision, we propose a unified mean-teacher framework, SEE, that leverages the vision foundation model, ``\emph{Segment Anything Model (SAM)}’’, to generate pseudo-labels using coarse masks produced by the teacher model as prompts. To mitigate the effect of low-quality segmentation masks, we introduce a series of strategies for pseudo-label generation, storage, and supervision. These strategies aim to produce informative pseudo-labels, store the best pseudo-labels generated, and select the most reliable components to guide the student model, thereby ensuring robust network training. Additionally, to tackle the issue of intrinsic similarity, we design a hybrid-granularity feature grouping module that groups features at different granularities and aggregates these results. By clustering similar features, this module promotes segmentation coherence, facilitating more complete segmentation for both single-object and multiple-object images. We validate the effectiveness of our approach across multiple ISCOS tasks, and experimental results demonstrate that our method achieves state-of-the-art performance. Furthermore, SEE can serve as a plug-and-play solution, enhancing the performance of existing models.

[71] Data Augmentation For Small Object using Fast AutoAugment cs.CV | cs.LGPDF

DaeEun Yoon, Semin Kim, SangWook Yoo, Jongha Lee

TL;DR: 论文提出了一种基于Fast AutoAugment的数据增强方法，显著提升了小目标检测性能，在DOTA数据集上实现了20%的性能提升。

Details

Motivation: 虽然目标检测技术近年来取得了巨大进展，但小目标的检测性能仍远低于大目标。小目标检测是计算机视觉中最具挑战性和重要性的问题之一。

Result: 在DOTA数据集上实现了20%的小目标检测性能提升。

Insight: 通过自动化数据增强策略的优化，可以有效缓解小目标检测中的性能瓶颈问题。

Abstract: In recent years, there has been tremendous progress in object detection performance. However, despite these advances, the detection performance for small objects is significantly inferior to that of large objects. Detecting small objects is one of the most challenging and important problems in computer vision. To improve the detection performance for small objects, we propose an optimal data augmentation method using Fast AutoAugment. Through our proposed method, we can quickly find optimal augmentation policies that can overcome degradation when detecting small objects, and we achieve a 20% performance improvement on the DOTA dataset.

[72] ORIDa: Object-centric Real-world Image Composition Dataset cs.CVPDF

Jinwoo Kim, Sangmin Han, Jinho Jeong, Jiwoo Choi, Dongyoung Kim

TL;DR: ORIDa是一个大规模、真实捕获的数据集，包含3万张图像和200个独特对象，用于对象合成任务，提供事实-反事实集和纯事实场景两种数据类型，填补了现有数据集的多样性不足的问题。

Details

Motivation: 当前的对象合成数据集缺乏多样性和规模，无法全面探索真实世界场景，因此ORIDa被提出以解决这一问题。

Result: ORIDa的数据丰富性和多样性为对象合成任务提供了有力支持，实验和分析验证了其作为研究资源的潜力。

Insight: 大规模、多样化的数据集是推动对象合成任务研究的关键，ORIDa填补了现有数据集的空白，为生成模型的进一步发展提供了重要资源。

Abstract: Object compositing, the task of placing and harmonizing objects in images of diverse visual scenes, has become an important task in computer vision with the rise of generative models. However, existing datasets lack the diversity and scale required to comprehensively explore real-world scenarios. We introduce ORIDa (Object-centric Real-world Image Composition Dataset), a large-scale, real-captured dataset containing over 30,000 images featuring 200 unique objects, each of which is presented across varied positions and scenes. ORIDa has two types of data: factual-counterfactual sets and factual-only scenes. The factual-counterfactual sets consist of four factual images showing an object in different positions within a scene and a single counterfactual (or background) image of the scene without the object, resulting in five images per scene. The factual-only scenes include a single image containing an object in a specific context, expanding the variety of environments. To our knowledge, ORIDa is the first publicly available dataset with its scale and complexity for real-world image composition. Extensive analysis and experiments highlight the value of ORIDa as a resource for advancing further research in object compositing.

[73] ADAM: Autonomous Discovery and Annotation Model using LLMs for Context-Aware Annotations cs.CVPDF

Amirreza Rouhi, Solmaz Arezoomandan, Knut Peterson, Joseph T. Woods, David K. Han

TL;DR: ADAM是一个无需训练的自提标注框架，利用LLMs和CLIP为开放世界中未知对象生成上下文感知的标签。

Details

Motivation: 传统目标检测模型依赖于预定义类别，难以识别开放世界中的新物体，ADAM旨在解决这一限制。

Result: 在COCO和PASCAL数据集上，ADAM成功标注新类别，无需微调或重新训练。

Insight: LLMs与视觉模型的结合可有效解决开放世界对象标注问题，自优化机制显著提升标签一致性。

Abstract: Object detection models typically rely on predefined categories, limiting their ability to identify novel objects in open-world scenarios. To overcome this constraint, we introduce ADAM: Autonomous Discovery and Annotation Model, a training-free, self-refining framework for open-world object labeling. ADAM leverages large language models (LLMs) to generate candidate labels for unknown objects based on contextual information from known entities within a scene. These labels are paired with visual embeddings from CLIP to construct an Embedding-Label Repository (ELR) that enables inference without category supervision. For a newly encountered unknown object, ADAM retrieves visually similar instances from the ELR and applies frequency-based voting and cross-modal re-ranking to assign a robust label. To further enhance consistency, we introduce a self-refinement loop that re-evaluates repository labels using visual cohesion analysis and k-nearest-neighbor-based majority re-labeling. Experimental results on the COCO and PASCAL datasets demonstrate that ADAM effectively annotates novel categories using only visual and contextual signals, without requiring any fine-tuning or retraining.

[74] Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models cs.CV | cs.AI | cs.LGPDF

Chenyu Lian, Hong-Yu Zhou, Dongyun Liang, Jing Qin, Liansheng Wang

TL;DR: ALTA提出了一种高效的医学视觉-语言对齐方法，通过适配掩码视觉模型，仅需少量可训练参数和计算资源，即在检索和零样本分类任务中表现优异。

Details

Motivation: 传统的跨模态对比学习方法（如CLIP）在视觉表示能力上表现不佳，限制了其在视觉-语言对齐中的效果；而多模态掩码建模模型虽在视觉表示上表现优异，却在跨模态匹配上表现不佳。ALTA旨在解决这一矛盾。

Result: ALTA在文本到图像准确率和图像到文本检索准确率上分别提升4%和6%，且在计算效率上显著优于对比方法。

Insight: 适配掩码视觉模型不仅能提升视觉-语言对齐效果，还能促进对视觉和语言的更深层次理解。

Abstract: Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and zero-shot classification by adapting the pretrained vision model from masked record modeling. Additionally, we integrate temporal-multiview radiograph inputs to enhance the information consistency between radiographs and their corresponding descriptions in reports, further improving the vision-language alignment. Experimental evaluations show that ALTA outperforms the best-performing counterpart by over 4% absolute points in text-to-image accuracy and approximately 6% absolute points in image-to-text retrieval accuracy. The adaptation of vision-language models during efficient alignment also promotes better vision and language understanding. Code is publicly available at https://github.com/DopamineLcy/ALTA.

[75] Do MIL Models Transfer? cs.CVPDF

Daniel Shao, Richard J. Chen, Andrew H. Song, Joel Runevic, Ming Y. Lu

TL;DR: 该论文通过系统评估11个MIL模型在21个预训练任务中的表现，发现预训练的MIL模型即使在不同器官任务上也优于从头训练的模型，证实了MIL模型的强大迁移能力。

Details

Motivation: 尽管迁移学习在NLP和传统计算机视觉中被广泛应用，但MIL模型在计算病理学中的迁移能力尚未得到充分研究。该研究旨在填补这一空白。

Result: 预训练的MIL模型在所有任务中均优于从头训练的模型，尤其是pan-cancer数据预训练的模型表现出强大的泛化能力。

Insight: MIL模型具有强大的迁移能力，计算病理学领域中迁移学习和预训练策略可以有效缓解数据稀缺问题。

Abstract: Multiple Instance Learning (MIL) is a cornerstone approach in computational pathology (CPath) for generating clinically meaningful slide-level embeddings from gigapixel tissue images. However, MIL often struggles with small, weakly supervised clinical datasets. In contrast to fields such as NLP and conventional computer vision, where transfer learning is widely used to address data scarcity, the transferability of MIL models remains poorly understood. In this study, we systematically evaluate the transfer learning capabilities of pretrained MIL models by assessing 11 models across 21 pretraining tasks for morphological and molecular subtype prediction. Our results show that pretrained MIL models, even when trained on different organs than the target task, consistently outperform models trained from scratch. Moreover, pretraining on pancancer datasets enables strong generalization across organs and tasks, outperforming slide foundation models while using substantially less pretraining data. These findings highlight the robust adaptability of MIL models and demonstrate the benefits of leveraging transfer learning to boost performance in CPath. Lastly, we provide a resource which standardizes the implementation of MIL models and collection of pretrained model weights on popular CPath tasks, available at https://github.com/mahmoodlab/MIL-Lab

[76] Princeton365: A Diverse Dataset with Accurate Camera Pose cs.CVPDF

Karhan Kayan, Stamatis Alexandropoulos, Rishabh Jain, Yiming Zuo, Erich Liang

TL;DR: Princeton365是一个包含365个视频的大规模多样化数据集，提供了精确的相机位姿，填补了当前SLAM基准测试中精度与数据多样性之间的差距。

Details

Motivation: 当前SLAM基准测试通常缺乏数据多样性或高精度的相机位姿，Princeton365通过结合校准板和360相机的新框架解决了这一问题。

Result: 数据集支持SLAM和NVS任务的多样化评估，新指标允许跨场景性能比较。

Insight: 多样化的数据和高精度位姿对SLAM方法的发展至关重要，新指标帮助识别方法失败模式。

Abstract: We introduce Princeton365, a large-scale diverse dataset of 365 videos with accurate camera pose. Our dataset bridges the gap between accuracy and data diversity in current SLAM benchmarks by introducing a novel ground truth collection framework that leverages calibration boards and a 360-camera. We collect indoor, outdoor, and object scanning videos with synchronized monocular and stereo RGB video outputs as well as IMU. We further propose a new scene scale-aware evaluation metric for SLAM based on the the optical flow induced by the camera pose estimation error. In contrast to the current metrics, our new metric allows for comparison between the performance of SLAM methods across scenes as opposed to existing metrics such as Average Trajectory Error (ATE), allowing researchers to analyze the failure modes of their methods. We also propose a challenging Novel View Synthesis benchmark that covers cases not covered by current NVS benchmarks, such as fully non-Lambertian scenes with 360-degree camera trajectories. Please visit https://princeton365.cs.princeton.edu for the dataset, code, videos, and submission.

[77] Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better cs.CV | cs.AI | cs.CLPDF

Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu

TL;DR: 论文提出了一种名为ASVR的方法，通过自回归的语义视觉重建联合学习视觉和文本模态，解决了现有大型视觉语言模型在视觉信息利用上的不足，并在多模态理解任务中显著提升了性能。

Details

Motivation: 现有的大视觉语言模型（LVLMs）仅对文本序列采用自回归监督，未能充分整合视觉模态，导致无法利用无标注图像、遗漏关键视觉细节，以及某些视觉内容无法通过文本充分表达的问题。

Result: ASVR在14个多模态基准测试中平均提升了LLaVA-1.5模型5%的性能，且在不同数据规模和模型上都表现稳定。

Insight: 语义表示的视觉重建比原始外观的重建更能有效提升多模态理解，且自回归框架可以统一处理视觉和文本信息，填补了现有方法的不足。

Abstract: Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.

[78] Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models cs.CVPDF

Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang

TL;DR: Cosmos-Drive-Dreams提出了一种合成数据生成（SDG）流水线，用于生成高保真且具有挑战性的驾驶场景，以解决自动驾驶系统中长尾分布和边缘案例数据不足的问题。

Details

Motivation: 收集和标注真实世界的自动驾驶数据成本高昂且耗时，边缘案例尤其难以捕捉，而这对训练和测试至关重要。因此，作者希望通过合成数据生成技术弥补这一不足。

Result: 实验表明，生成的合成数据有助于缓解长尾分布问题，并在3D车道检测、3D目标检测和驾驶策略学习等下游任务中提升了模型的泛化能力。

Insight: 合成数据生成技术可以有效地补充真实数据的不足，尤其在高保真和边缘案例生成方面具有潜力，为自动驾驶系统的训练和测试提供了新思路。

Abstract: Collecting and annotating real-world data for safety-critical physical AI systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is especially challenging to capture rare edge cases, which play a critical role in training and testing of an AV system. To address this challenge, we introduce the Cosmos-Drive-Dreams - a synthetic data generation (SDG) pipeline that aims to generate challenging scenarios to facilitate downstream tasks such as perception and driving policy training. Powering this pipeline is Cosmos-Drive, a suite of models specialized from NVIDIA Cosmos world foundation model for the driving domain and are capable of controllable, high-fidelity, multi-view, and spatiotemporally consistent driving video generation. We showcase the utility of these models by applying Cosmos-Drive-Dreams to scale the quantity and diversity of driving datasets with high-fidelity and challenging scenarios. Experimentally, we demonstrate that our generated data helps in mitigating long-tail distribution problems and enhances generalization in downstream tasks such as 3D lane detection, 3D object detection and driving policy learning. We open source our pipeline toolkit, dataset and model weights through the NVIDIA’s Cosmos platform. Project page: https://research.nvidia.com/labs/toronto-ai/cosmos_drive_dreams

[79] MagCache: Fast Video Generation with Magnitude-Aware Cache cs.CVPDF

Zehong Ma, Longhui Wei, Feng Wang, Shiliang Zhang, Qi Tian

TL;DR: MagCache是一种基于幅度感知缓存的视频生成加速方法，通过观测残差输出幅度的统一规律，自适应跳过不重要的时间步，无需大量校准样本即可显著提升生成速度并保持视觉质量。

Details

Motivation: 现有视频扩散模型加速技术通常依赖均匀启发式方法或时间嵌入变体来跳过时间步和重用缓存特征，但这些方法需要大量校准且容易因提示词过拟合导致输出不一致。

Result: 在Open-Sora和Wan 2.1上分别实现了2.1倍和2.68倍的加速，同时在LPIPS、SSIM和PSNR指标上优于现有方法。

Insight: 残差输出幅度比例的规律性是视频扩散模型中可预测且通用的特征，可有效指导自适应加速策略的设计。

Abstract: Existing acceleration techniques for video diffusion models often rely on uniform heuristics or time-embedding variants to skip timesteps and reuse cached features. These approaches typically require extensive calibration with curated prompts and risk inconsistent outputs due to prompt-specific overfitting. In this paper, we introduce a novel and robust discovery: a unified magnitude law observed across different models and prompts. Specifically, the magnitude ratio of successive residual outputs decreases monotonically and steadily in most timesteps while rapidly in the last several steps. Leveraging this insight, we introduce a Magnitude-aware Cache (MagCache) that adaptively skips unimportant timesteps using an error modeling mechanism and adaptive caching strategy. Unlike existing methods requiring dozens of curated samples for calibration, MagCache only requires a single sample for calibration. Experimental results show that MagCache achieves 2.1x and 2.68x speedups on Open-Sora and Wan 2.1, respectively, while preserving superior visual fidelity. It significantly outperforms existing methods in LPIPS, SSIM, and PSNR, under comparable computational budgets.

cs.CL [Back]

[80] Conservative Bias in Large Language Models: Measuring Relation Predictions cs.CLPDF

Toyin Aguda, Erik Wilson, Allan Anzagira, Simerjot Kaur, Charese Smiley

TL;DR: 该论文研究了大型语言模型（LLMs）在关系抽取任务中表现出的保守偏差，发现模型倾向于选择无信息标签（如No_Relation）而非可能错误的标签，尽管这避免了错误分配但也导致信息丢失。研究通过多种提示和数据系统地评估了这种现象，并提出”Hobson’s choice”概念来描述模型的保守行为。

Details

Motivation: 大型语言模型在关系抽取任务中表现出明显的保守偏差，倾向于选择安全但无信息的标签，而非可能错误的标签。这种行为虽然减少了错误，但也导致了信息的丢失，作者希望通过系统研究量化这种偏差及其影响。

Result: 实验结果表明，保守偏差在关系抽取任务中发生的频率是幻觉的两倍。通过语义相似度分析，发现保守行为在不同提示下具有一致性。

Insight: 论文的启示包括：1）LLMs的保守偏差虽然减少了错误，但也导致了信息的丢失；2）未来的模型设计需要在保守和幻觉之间找到平衡；3）”Hobson’s choice”概念为理解模型的保守行为提供了新视角。

Abstract: Large language models (LLMs) exhibit pronounced conservative bias in relation extraction tasks, frequently defaulting to No_Relation label when an appropriate option is unavailable. While this behavior helps prevent incorrect relation assignments, our analysis reveals that it also leads to significant information loss when reasoning is not explicitly included in the output. We systematically evaluate this trade-off across multiple prompts, datasets, and relation types, introducing the concept of Hobson’s choice to capture scenarios where models opt for safe but uninformative labels over hallucinated ones. Our findings suggest that conservative bias occurs twice as often as hallucination. To quantify this effect, we use SBERT and LLM prompts to capture the semantic similarity between conservative bias behaviors in constrained prompts and labels generated from semi-constrained and open-ended prompts.

[81] EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments cs.CLPDF

Zefang Liu, Yinzhu Quan

TL;DR: EconWebArena是一个评估自主代理在现实网络环境中完成复杂经济任务的基准测试，包含360个任务，覆盖多个经济领域，强调权威数据源和基于网络的推理能力。

Details

Motivation: 现有基准测试在真实网络环境中的经济任务评估不足，EconWebArena填补了这一空白，通过多模态任务挑战代理的导航、推理和交互能力。

Result: 结果显示现有代理在基础和导航方面存在显著性能差距。

Insight: 任务真实性对经济推理至关重要，未来研究需关注多模态理解和交互设计的改进。

Abstract: We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.

[82] Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions cs.CL | cs.AIPDF

Yu-Ang Lee, Guan-Ting Yi, Mei-Yi Liu, Jui-Chao Lu, Guan-Bo Yang

TL;DR: 这篇论文系统性地综述了复合AI系统的优化方法、挑战与未来方向，重点关注多组件整合及其交互优化的新挑战，并提出了语言反馈等新方法。

Details

Motivation: 随着大型语言模型（LLMs）和AI系统的进步，复合AI系统的复杂度增加，传统优化方法难以应对其多组件交互的挑战，亟需新的优化方法。

Result: 论文系统地梳理了该领域的研究进展，指出了非可微系统优化等开放性问题。

Insight: 语言反馈为复合AI系统优化提供了新思路，尤其适用于传统方法难以处理的非可微组件。

Abstract: Recent advancements in large language models (LLMs) and AI systems have led to a paradigm shift in the design and optimization of complex AI workflows. By integrating multiple components, compound AI systems have become increasingly adept at performing sophisticated tasks. However, as these systems grow in complexity, new challenges arise in optimizing not only individual components but also their interactions. While traditional optimization methods such as supervised fine-tuning (SFT) and reinforcement learning (RL) remain foundational, the rise of natural language feedback introduces promising new approaches, especially for optimizing non-differentiable systems. This paper provides a systematic review of recent progress in optimizing compound AI systems, encompassing both numerical and language-based techniques. We formalize the notion of compound AI system optimization, classify existing methods along several key dimensions, and highlight open research challenges and future directions in this rapidly evolving field. A list of surveyed papers is publicly available at https://github.com/MiuLab/AISysOpt-Survey.

[83] Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning cs.CL | cs.AIPDF

Shashidhar Reddy Javaji, Yupeng Cao, Haohang Li, Yangyang Yu, Nikhil Muralidhar

TL;DR: 论文提出了CLAIM-BENCH基准，用于评估大语言模型（LLMs）在科学声明（claim）和证据（evidence）提取与验证任务中的能力，揭示了LLMs在处理复杂科学内容时的局限性，并展示了封闭模型（如GPT-4）和特定提示方法的优势。

Details

Motivation: LLMs在科学研究任务中广泛应用，但其对科学论文中声明与证据之间复杂关系的理解能力尚未充分研究，因此需要系统评估其科学推理能力。

Result: 1. 闭源模型（如GPT-4、Claude）在声明-证据识别任务中表现优于开源模型。2. 特定提示方法能提升模型精度，但增加计算成本。3. LLMs在复杂科学内容处理中仍有显著局限。

Insight: 1. 科学推理需要更深层次的模型理解能力，现有LLMs仍有不足。2. 提示方法的设计对任务表现至关重要。3. CLAIM-BENCH为未来开发更可靠的科学推理系统提供了基准。

Abstract: Large language models (LLMs) are increasingly being used for complex research tasks such as literature review, idea generation, and scientific paper analysis, yet their ability to truly understand and process the intricate relationships within complex research papers, such as the logical links between claims and supporting evidence remains largely unexplored. In this study, we present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs’ capabilities in scientific claim-evidence extraction and validation, a task that reflects deeper comprehension of scientific argumentation. We systematically compare three approaches which are inspired by divide and conquer approaches, across six diverse LLMs, highlighting model-specific strengths and weaknesses in scientific comprehension. Through evaluation involving over 300 claim-evidence pairs across multiple research domains, we reveal significant limitations in LLMs’ ability to process complex scientific content. Our results demonstrate that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall across claim-evidence identification tasks. Furthermore, strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs’ abilities to accurately link dispersed evidence with claims, although this comes at increased computational cost. CLAIM-BENCH sets a new standard for evaluating scientific comprehension in LLMs, offering both a diagnostic tool and a path forward for building systems capable of deeper, more reliable reasoning across full-length papers.

[84] Automatic Generation of Inference Making Questions for Reading Comprehension Assessments cs.CL | cs.AIPDF

Wanjing Anya Ma, Michael Flor, Zuowei Wang

TL;DR: 论文提出了一种自动生成阅读理解推理题的方法，利用GPT-4o和少样本提示生成高质量的诊断题目，结合人工判断实现可扩展的高质量评估。

Details

Motivation: 阅读理解中的推理能力是复杂但关键的能力，需要跨句子引用和背景知识。生成诊断性题目可以帮助教师提供更有针对性的教学。

Result: GPT-4o生成的93.8%题目质量良好，适合3-12年级使用，但仅有42.6%题目准确匹配目标推理类型。

Insight: 自动生成结合人工判断是实现高质量可扩展诊断题目生成的有效路径，但推理类型的准确性仍需改进。

Abstract: Inference making is an essential but complex skill in reading comprehension (RC). Some inferences require resolving references across sentences, and some rely on using prior knowledge to fill in the detail that is not explicitly written in the text. Diagnostic RC questions can help educators provide more effective and targeted reading instruction and interventions for school-age students. We introduce a taxonomy of inference types for RC and use it to analyze the distribution of items within a diagnostic RC item bank. Next, we present experiments using GPT-4o to generate bridging-inference RC items for given reading passages via few-shot prompting, comparing conditions with and without chain-of-thought prompts. Generated items were evaluated on three aspects: overall item quality, appropriate inference type, and LLM reasoning, achieving high inter-rater agreements above 0.90. Our results show that GPT-4o produced 93.8% good-quality questions suitable for operational use in grade 3-12 contexts; however, only 42.6% of the generated questions accurately matched the targeted inference type. We conclude that combining automatic item generation with human judgment offers a promising path toward scalable, high-quality diagnostic RC assessments.

[85] Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency cs.CLPDF

Chenlong Wang, Yuanning Feng, Dongping Chen, Zhaoyang Chu, Ranjay Krishna

TL;DR: 论文研究了在大型推理模型中禁用”Wait”等显式自我反思标记是否会影响推理效率，提出了NoWait方法，实验表明它能显著减少推理轨迹长度，同时保持模型性能。

Details

Motivation: 研究动机是解决大型推理模型中因过度思考导致的冗长和冗余输出问题，探索是否必须通过”Wait”等标记进行显式自我反思才能实现高效推理。

Result: 在十个基准测试中，NoWait将链式思维推理轨迹长度减少了27%-51%，且不影响模型性能。

Insight: 研究揭示了显式自我反思标记在推理中可能并非必要，去除这些标记可以显著提升推理效率，为未来高效推理模型设计提供了新思路。

Abstract: Recent advances in large reasoning models have enabled complex, step-by-step reasoning but often introduce significant overthinking, resulting in verbose and redundant outputs that hinder efficiency. In this study, we examine whether explicit self-reflection, signaled by tokens such as “Wait” and “Hmm”, is necessary for advanced reasoning. We propose NoWait, a simple yet effective approach that disables explicit self-reflection by suppressing these tokens during inference. Extensive experiments on ten benchmarks across textual, visual, and video reasoning tasks show that NoWait reduces chain-of-thought trajectory length by up to 27%-51% in five R1-style model series, without compromising model utility. NoWait thus offers a plug-and-play solution for efficient and utility-preserving multimodal reasoning.

[86] Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning cs.CL | cs.AI | cs.IRPDF

Yiqun Sun, Qiang Huang, Anthony K. H. Tung, Jun Yu

TL;DR: 这篇立场论文主张文本嵌入研究应超越表层语义，以隐式语义为核心建模目标，提出了数据、评估和目标的改进方向。

Details

Motivation: 当前的文本嵌入模型主要关注表层语义，忽视了语言中隐含的语用、说话者意图和社会文化背景等深层语义，导致其在需要推理或社会意义的任务上表现不佳。

Result: 实验显示即使最先进的模型在隐式语义任务上的表现仅略优于简单基线。

Insight: 文本嵌入研究需更贴近真实语言的复杂性，从数据到评测全面支持隐式语义能力的建模。

Abstract: This position paper argues that the text embedding research community should move beyond surface meaning and embrace implicit semantics as a central modeling goal. Text embedding models have become foundational in modern NLP, powering a wide range of applications and drawing increasing research attention. Yet, much of this progress remains narrowly focused on surface-level semantics. In contrast, linguistic theory emphasizes that meaning is often implicit, shaped by pragmatics, speaker intent, and sociocultural context. Current embedding models are typically trained on data that lacks such depth and evaluated on benchmarks that reward the capture of surface meaning. As a result, they struggle with tasks requiring interpretive reasoning, speaker stance, or social meaning. Our pilot study highlights this gap, showing that even state-of-the-art models perform only marginally better than simplistic baselines on implicit semantics tasks. To address this, we call for a paradigm shift: embedding research should prioritize more diverse and linguistically grounded training data, design benchmarks that evaluate deeper semantic understanding, and explicitly frame implicit meaning as a core modeling objective, better aligning embeddings with real-world language complexity.

[87] CC-RAG: Structured Multi-Hop Reasoning via Theme-Based Causal Graphs cs.CLPDF

Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han

TL;DR: CC-RAG 提出了一种新颖的检索增强生成方法，通过构建因果图支持多跳推理，显著提升了专业领域任务的准确性。

Details

Motivation: 大型语言模型（LLMs）在因果关系推理和专业领域任务中存在局限性，传统RAG缺乏结构化推理能力。

Result: 在比特币价格波动和高雪病等专业领域的实验中，CC-RAG在准确性、信息密度和多样性上优于传统RAG和零样本LLMs。

Insight: 显式建模因果结构能够显著提升LLMs在专业领域的性能，尤其是在需要复杂推理的任务中。

Abstract: Understanding cause and effect relationships remains a formidable challenge for Large Language Models (LLMs), particularly in specialized domains where reasoning requires more than surface-level correlations. Retrieval-Augmented Generation (RAG) improves factual accuracy, but standard RAG pipelines treat evidence as flat context, lacking the structure required to model true causal dependencies. We introduce Causal-Chain RAG (CC-RAG), a novel approach that integrates zero-shot triple extraction and theme-aware graph chaining into the RAG pipeline, enabling structured multi-hop inference. Given a domain specific corpus, CC-RAG constructs a Directed Acyclic Graph (DAG) of <cause, relation, effect> triples and uses forward/backward chaining to guide structured answer generation. Experiments on two real-world domains: Bitcoin price fluctuations and Gaucher disease, show that CC-RAG outperforms standard RAG and zero-shot LLMs in chain similarity, information density, and lexical diversity. Both LLM-as-a-Judge and human evaluations consistently favor CC-RAG. Our results demonstrate that explicitly modeling causal structure enables LLMs to generate more accurate and interpretable responses, especially in specialized domains where flat retrieval fails.

[88] mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks cs.CL | cs.LG | cs.SD | eess.ASPDF

Luel Hagos Beyene, Vivek Verma, Min Ma, Jesujoba O. Alabi, Fabian David Schmidt

TL;DR: 本文介绍了mSTEB，一个用于评估大语言模型（LLM）在多种语言（包括语音和文本任务）上的性能的新基准测试，聚焦于低资源语言的标准化评估。

Details

Motivation: 大语言模型在英语和少数高资源语言上的表现已被广泛研究，但对低资源语言的评估缺乏标准化基准。mSTEB填补了这一空白。

Result: 评估结果显示，高资源语言与低资源语言（尤其是非洲和美洲/大洋洲的语言）的性能存在显著差距，表明这些语言在LLM中的代表性不足。

Insight: 低资源语言在大语言模型中的表现较差，需要更多投入以提升其覆盖率。

Abstract: Large Language models (LLMs) have demonstrated impressive performance on a wide range of tasks, including in multimodal settings such as speech. However, their evaluation is often limited to English and a few high-resource languages. For low-resource languages, there is no standardized evaluation benchmark. In this paper, we address this gap by introducing mSTEB, a new benchmark to evaluate the performance of LLMs on a wide range of tasks covering language identification, text classification, question answering, and translation tasks on both speech and text modalities. We evaluated the performance of leading LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in performance between high-resource and low-resource languages, especially for languages spoken in Africa and Americas/Oceania. Our findings show that more investment is needed to address their under-representation in LLMs coverage.

[89] TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration cs.CL | cs.AIPDF

Weiya Li, Junjie Chen, Bei Li, Boyang Liu, Zichen Wen

TL;DR: TACTIC 是一个基于认知理论的多智能体翻译框架，通过模拟人类翻译的认知过程，显著提升了机器翻译质量。

Details

Motivation: 现有的大语言模型（LLM）多智能体翻译框架忽视了认知翻译研究的关键见解，而这些见解对理解人类翻译策略（如直译与意译的平衡、上下文优化）至关重要。

Result: 在多个语言对的实验（FLORES-200 和 WMT24 基准）中，TACTIC 表现优于 GPT-4.1 和 DeepSeek-R1，XCOMET 和 COMETKIWI-23 分数显著提升。

Insight: 认知理论的应用可以显著增强多智能体翻译系统的表现，模拟人类翻译的交互式工作流程是提升 LLM 翻译潜力的有效途径。

Abstract: Machine translation has long been a central task in natural language processing. With the rapid advancement of large language models (LLMs), there has been remarkable progress in translation quality. However, fully realizing the translation potential of LLMs remains an open challenge. Recent studies have explored multi-agent systems to decompose complex translation tasks into collaborative subtasks, showing initial promise in enhancing translation quality through agent cooperation and specialization. Nevertheless, existing multi-agent translation frameworks largely neglect foundational insights from cognitive translation studies. These insights emphasize how human translators employ different cognitive strategies, such as balancing literal and free translation, refining expressions based on context, and iteratively evaluating outputs. To address this limitation, we propose a cognitively informed multi-agent framework called TACTIC, which stands for T ranslation A gents with Cognitive- T heoretic Interactive Collaboration. The framework comprises six functionally distinct agents that mirror key cognitive processes observed in human translation behavior. These include agents for drafting, refinement, evaluation, scoring, context reasoning, and external knowledge gathering. By simulating an interactive and theory-grounded translation workflow, TACTIC effectively leverages the full capacity of LLMs for high-quality translation. Experimental results on diverse language pairs from the FLORES-200 and WMT24 benchmarks show that our method consistently achieves state-of-the-art performance. Using DeepSeek-V3 as the base model, TACTIC surpasses GPT-4.1 by an average of +0.6 XCOMET and +1.18 COMETKIWI-23. Compared to DeepSeek-R1, it further improves by +0.84 XCOMET and +2.99 COMETKIWI-23. Code is available at https://github.com/weiyali126/TACTIC.

[90] Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens cs.CLPDF

Ziyang Ma, Qingyue Yuan, Zhenglin Wang, Deyu Zhou

TL;DR: 本文研究了大型语言模型（LLM）的元认知能力，提出了自动评估框架AutoMeco和改进策略MIRA，实验表明这些方法能更合理地评估和提升LLM的元认知能力。

Details

Motivation: 现有研究主要关注LLM的认知错误检测能力，而对其元认知能力（如对步骤错误的自我意识）的研究较少，但这些能力对LLM的可靠性至关重要。

Result: 在三个数学推理数据集和三个LLM上的实验表明，AutoMeco比Best-of-N验证更合理，MIRA能更有效地评估LLM的元认知能力。

Insight: LLM具有内在的元认知能力，但需要通过更好的评估工具（如AutoMeco和MIRA）来发掘和提升这一能力。

Abstract: Previous research has primarily focused on the cognitive error detection capabilities of Large Language Models (LLMs), often prompting them to analyze mistakes in reasoning chains. However, few studies have examined the meta-cognitive abilities of LLMs (e.g., their self-awareness of step errors), which are crucial for their reliability. While studies on LLM self-evaluation present some measures, such as perplexity, which can reflect the answer correctness and be viewed as the lens of meta-cognition, they lack step-level analysis and adaptation. This paper studies the evaluation of LLM meta-cognition using the current lenses and how to improve these lenses. Specifically, we propose AutoMeco, an Automated Meta-cognition Evaluation framework for benchmarking the existing lenses. Furthermore, a training-free Markovian Intrinsic Reward Adjustment strategy, MIRA, is proposed to boost current meta-cognition lenses. Experimental results on three mathematical reasoning datasets and three LLMs show the reasonableness of AutoMeco by comparing it with Best-of-N verification. Moreover, the meta-cognition ability of LLMs can be better evaluated using MIRA.

[91] Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models cs.CLPDF

Jiaxiang Liu, Boxuan Xing, Chenhao Yuan, Chenxiang Zhang, Di Wu

TL;DR: Know-MRI 是一个开源工具，旨在系统性分析大型语言模型(LLMs)的内部知识机制，通过可扩展的核心模块支持多种输入数据与解释方法的自动匹配与结果整合。

Details

Motivation: 当前解释方法的输入数据格式与输出结果不一致，导致工具应用受限，亟需一个统一且灵活的解释框架。

Result: 提供了开源代码和演示视频，工具支持更灵活的分析和更全面的模型诊断。

Insight: 通过统一框架整合多样化解释方法，为 LLMs 的可解释性研究提供了更实用的工具支持。

Abstract: As large language models (LLMs) continue to advance, there is a growing urgency to enhance the interpretability of their internal knowledge mechanisms. Consequently, many interpretation methods have emerged, aiming to unravel the knowledge mechanisms of LLMs from various perspectives. However, current interpretation methods differ in input data formats and interpreting outputs. The tools integrating these methods are only capable of supporting tasks with specific inputs, significantly constraining their practical applications. To address these challenges, we present an open-source Knowledge Mechanisms Revealer&Interpreter (Know-MRI) designed to analyze the knowledge mechanisms within LLMs systematically. Specifically, we have developed an extensible core module that can automatically match different input data with interpretation methods and consolidate the interpreting outputs. It enables users to freely choose appropriate interpretation methods based on the inputs, making it easier to comprehensively diagnose the model’s internal knowledge mechanisms from multiple perspectives. Our code is available at https://github.com/nlpkeg/Know-MRI. We also provide a demonstration video on https://youtu.be/NVWZABJ43Bs.

[92] CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models cs.CL | cs.MAPDF

Ziqi. Liu, Ziyang. Zhou, Mingxuan. Hu

TL;DR: 本文提出了一种名为CAF-I的多智能体框架，用于提升大型语言模型在反讽检测任务中的性能。通过多维度分析和协作优化，CAF-I在零样本设置下取得了显著的性能提升。

Details

Motivation: 现有的大型语言模型在反讽检测任务中存在单视角限制、理解不足和缺乏可解释性的问题。为此，作者提出了一种协作多智能体框架。

Result: CAF-I在零样本设置下取得了SOTA性能，平均Macro-F1达到76.31，比之前的最佳基线提升了4.98个绝对百分点。

Insight: 通过模拟人类多视角分析，CAF-I显著提升了反讽检测的准确性和可解释性，展示了多智能体框架在语言任务中的潜力。

Abstract: Large language model (LLM) have become mainstream methods in the field of sarcasm detection. However, existing LLM methods face challenges in irony detection, including: 1. single-perspective limitations, 2. insufficient comprehensive understanding, and 3. lack of interpretability. This paper introduces the Collaborative Agent Framework for Irony (CAF-I), an LLM-driven multi-agent system designed to overcome these issues. CAF-I employs specialized agents for Context, Semantics, and Rhetoric, which perform multidimensional analysis and engage in interactive collaborative optimization. A Decision Agent then consolidates these perspectives, with a Refinement Evaluator Agent providing conditional feedback for optimization. Experiments on benchmark datasets establish CAF-I’s state-of-the-art zero-shot performance. Achieving SOTA on the vast majority of metrics, CAF-I reaches an average Macro-F1 of 76.31, a 4.98 absolute improvement over the strongest prior baseline. This success is attained by its effective simulation of human-like multi-perspective analysis, enhancing detection accuracy and interpretability.

[93] Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning cs.CLPDF

Fengjun Pan, Anh Tuan Luu, Xiaobao Wu

TL;DR: 论文提出了一种名为U-CoT+的新框架，用于高效、灵活且可解释的有害表情包检测，通过将表情包解耦为文本描述并利用人类指导的零样本CoT提示，实现了低资源消耗和高适应性的分类效果。

Details

Motivation: 当前的有害表情包检测方法在资源效率、灵活性和可解释性方面存在不足，限制了其在内容审核系统中的实际部署。

Result: 在七个基准数据集上的实验验证了该框架的有效性，证明了其在小规模LLMs上的低资源需求和可解释性。

Insight: 通过将视觉内容解耦为文本描述，能够避免直接处理复杂的原始视觉数据，从而实现资源高效和灵活的检测。

Abstract: Detecting harmful memes is essential for maintaining the integrity of online environments. However, current approaches often struggle with resource efficiency, flexibility, or explainability, limiting their practical deployment in content moderation systems. To address these challenges, we introduce U-CoT+, a novel framework for harmful meme detection. Instead of relying solely on prompting or fine-tuning multimodal models, we first develop a high-fidelity meme-to-text pipeline that converts visual memes into detail-preserving textual descriptions. This design decouples meme interpretation from meme classification, thus avoiding immediate reasoning over complex raw visual content and enabling resource-efficient harmful meme detection with general large language models (LLMs). Building on these textual descriptions, we further incorporate targeted, interpretable human-crafted guidelines to guide models’ reasoning under zero-shot CoT prompting. As such, this framework allows for easy adaptation to different harmfulness detection criteria across platforms, regions, and over time, offering high flexibility and explainability. Extensive experiments on seven benchmark datasets validate the effectiveness of our framework, highlighting its potential for explainable and low-resource harmful meme detection using small-scale LLMs. Codes and data are available at: https://anonymous.4open.science/r/HMC-AF2B/README.md.

Divyaksh Shukla, Ritesh Baviskar, Dwijesh Gohil, Aniket Tiwari, Atul Shree

TL;DR: 该论文介绍了CoMuMDR语料库，一个多模态、多领域的印地语-英语混合代码对话数据集，用于话语解析任务。当前数据集的局限性在于仅针对单一领域的英语书面对话，而CoMuMDR填补了这一空白。

Details

Motivation: 现有的对话话语解析数据集局限于单一领域和纯英语，无法满足多模态、多领域和代码混合的现实场景需求，因此需要更全面的数据集。

Result: SoTA模型在CoMuMDR上的表现不佳，凸显了多领域代码混合数据的复杂性，需要更先进的模型。

Insight: 多模态、多领域和代码混合数据为话语解析任务带来了新的挑战，未来研究需针对这些复杂场景开发更鲁棒的模型。

Abstract: Discourse parsing is an important task useful for NLU applications such as summarization, machine comprehension, and emotion recognition. The current discourse parsing datasets based on conversations consists of written English dialogues restricted to a single domain. In this resource paper, we introduce CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse paRsing in conversations. The corpus (code-mixed in Hindi and English) has both audio and transcribed text and is annotated with nine discourse relations. We experiment with various SoTA baseline models; the poor performance of SoTA models highlights the challenges of multi-domain code-mixed corpus, pointing towards the need for developing better models for such realistic settings.

[95] Efficient Post-Training Refinement of Latent Reasoning in Large Language Models cs.CL | cs.AIPDF

Xinyuan Wang, Dongjie Wang, Wangyang Ying, Haoyue Bai, Nanxu Gong

TL;DR: 提出了一种轻量级后训练框架，通过对比推理反馈和残差嵌入细化，优化大型语言模型的潜在推理轨迹，显著提升了推理任务的性能。

Details

Motivation: 链式思维提示（Chain-of-Thought）虽然通过中间步骤增强了推理性能，但存在较高的计算成本和固定的推理轨迹，无法逐步优化。潜在推理虽解决了这些问题，但如何在后训练阶段高效更新推理嵌入仍是一个挑战。

Result: 在五个推理基准测试中表现优异，例如MathQA上实现了5%的准确率提升。

Insight: 潜在推理的后训练优化可以显著提升模型性能，且无需额外训练数据或模型结构改动。

Abstract: Reasoning is a key component of language understanding in Large Language Models. While Chain-of-Thought prompting enhances performance via explicit intermediate steps, it suffers from sufficient token overhead and a fixed reasoning trajectory, preventing step-wise refinement. Recent advances in latent reasoning address these limitations by refining internal reasoning processes directly in the model’s latent space, without producing explicit outputs. However, a key challenge remains: how to effectively update reasoning embeddings during post-training to guide the model toward more accurate solutions. To overcome this challenge, we propose a lightweight post-training framework that refines latent reasoning trajectories using two novel strategies: 1) Contrastive reasoning feedback, which compares reasoning embeddings against strong and weak baselines to infer effective update directions via embedding enhancement; 2) Residual embedding refinement, which stabilizes updates by progressively integrating current and historical gradients, enabling fast yet controlled convergence. Extensive experiments and case studies are conducted on five reasoning benchmarks to demonstrate the effectiveness of the proposed framework. Notably, a 5% accuracy gain on MathQA without additional training.

[96] RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval cs.CLPDF

Minhae Oh, Jeonghye Kim, Nakyung Lee, Donggeon Seo, Taeuk Kim

TL;DR: RAISE 是一个通过逐步检索增强科学推理能力的框架，通过问题分解、逻辑查询生成和逻辑检索三个步骤，显著优于其他基线方法。

Details

Motivation: 科学推理需要长链条的推理过程、领域特定术语的知识以及适应最新研究成果的能力。现有方法在逻辑相关性和领域知识适应性上存在局限。

Result: RAISE 在科学推理基准测试中表现优于其他基线方法，显示出其检索的文档更具逻辑相关性。

Insight: 通过分步检索并关注逻辑相关性（而非仅领域相似性），可以显著提升科学推理任务的效果。

Abstract: Scientific reasoning requires not only long-chain reasoning processes, but also knowledge of domain-specific terminologies and adaptation to updated findings. To deal with these challenges for scientific reasoning, we introduce RAISE, a step-by-step retrieval-augmented framework which retrieves logically relevant documents from in-the-wild corpus. RAISE is divided into three steps: problem decomposition, logical query generation, and logical retrieval. We observe that RAISE consistently outperforms other baselines on scientific reasoning benchmarks. We analyze that unlike other baselines, RAISE retrieves documents that are not only similar in terms of the domain knowledge, but also documents logically more relevant.

[97] MEMETRON: Metaheuristic Mechanisms for Test-time Response Optimization of Large Language Models cs.CLPDF

Son The Nguyen, Theja Tulabandhula

TL;DR: MEMETRON是一个任务无关的框架，通过将LLM的解码过程建模为离散黑盒优化问题，利用混合元启发式算法优化响应，无需重新训练模型或梯度访问，显著提升了任务性能。

Details

Motivation: 当前LLM的解码策略（如贪心搜索或采样）缺乏对任务特定目标的显式优化，限制了模型的控制能力。MEMETRON旨在通过元启发式算法动态优化模型输出。

Result: 在人类偏好对齐任务上，MEMETRON显著优于标准解码和重排序方法，证明了其在不重新训练模型的情况下提升对齐能力的潜力。

Insight: 通过元启发式算法优化解码过程，可以在不修改模型参数的情况下实现任务性能的提升，为LLM的推理优化提供了新思路。

Abstract: Large language models (LLMs) are increasingly used for both open-ended and structured tasks, yet their inference-time behavior is still largely dictated by heuristic decoding strategies such as greedy search, sampling, or reranking. These methods provide limited control and do not explicitly optimize for task-specific objectives. We introduce MEMETRON, a task-agnostic framework that formulates LLM decoding as a discrete black-box optimization problem. MEMETRON leverages hybrid metaheuristic algorithms, GENETRON and ANNETRON, to search the response space, guided by reward models and contextual operations performed by the LLM itself. This approach enables efficient discovery of high-reward responses without requiring model retraining or gradient access. The framework is modular and generalizes across diverse tasks, requiring only a reward function and lightweight prompt templates. We evaluate our framework on the critical human preference alignment task and demonstrate that it significantly outperforms standard decoding and reranking methods, highlighting its potential to improve alignment without model retraining.

[98] TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning cs.CL | cs.AI | cs.LGPDF

Mingyu Zheng, Zhifan Feng, Jia Wang, Lanrui Wang, Zheng Lin

TL;DR: TableDreamer是一个渐进式和弱点引导的数据合成框架，旨在改进表格指令调优的数据多样性和效率，通过迭代探索输入空间并针对性生成数据，显著提升了目标LLM在表格理解任务中的表现。

Details

Motivation: 现有基于LLM的数据合成方法在表格指令调优中存在两个问题：1）输入空间探索不足导致数据多样性有限；2）忽视目标LLM的表格理解弱点，盲目追求数据量。

Result: 在10个表格基准测试中，仅用27K GPT-4o合成数据，便将Llama3.1-8B-instruct的平均准确率从49.07%提升到60.69%，超越了使用更多数据的最新基线方法。

Insight: 数据合成的多样性和针对性（针对模型弱点）比单纯增加数据量更有效，能够显著提升模型在特定任务中的性能。

Abstract: Despite the commendable progress of recent LLM-based data synthesis methods, they face two limitations in generating table instruction tuning data. First, they can not thoroughly explore the vast input space of table understanding tasks, leading to limited data diversity. Second, they ignore the weaknesses in table understanding ability of the target LLM and blindly pursue the increase of data quantity, resulting in suboptimal data efficiency. In this paper, we introduce a progressive and weakness-guided data synthesis framework tailored for table instruction tuning, named TableDreamer, to mitigate the above issues. Specifically, we first synthesize diverse tables and related instructions as seed data, and then perform an iterative exploration of the input space under the guidance of the newly identified weakness data, which eventually serve as the final training data for fine-tuning the target LLM. Extensive experiments on 10 tabular benchmarks demonstrate the effectiveness of the proposed framework, which boosts the average accuracy of Llama3.1-8B-instruct by 11.62% (49.07% to 60.69%) with 27K GPT-4o synthetic data and outperforms state-of-the-art data synthesis baselines which use more training data. The code and data is available at https://github.com/SpursGoZmy/TableDreamer

[99] RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling cs.CLPDF

Yang Liu, Jiaqi Li, Zilong Zheng

TL;DR: RuleReasoner提出了一种基于强化学习的规则推理方法，通过动态采样和领域感知增强小模型的推理能力，显著优于现有大模型，并在计算效率上表现优越。

Details

Motivation: 实际应用中规则形式、类型和复杂性的多样性给小模型（SRM）的规则推理能力提出了挑战。本文旨在探索SRM是否可以通过强化学习实现高效且泛化性强的规则推理。

Result: 实验表明，RuleReasoner在分布内（ID）和分布外（OOD）任务中均显著优于前沿大模型（如OpenAI-o1），平均提升4.1%（ID）和10.4%（OOD），且计算效率更高。

Insight: 小模型通过强化学习和动态采样技术，能够在复杂规则推理任务中表现出色，挑战了以往对大模型的依赖，并为高效推理提供了新思路。

Abstract: Rule-based reasoning has been acknowledged as one of the fundamental problems in reasoning, while deviations in rule formats, types, and complexity in real-world applications pose severe challenges. Recent studies have shown that large reasoning models (LRMs) have remarkable reasoning capabilities, and their performance is substantially enhanced by reinforcement learning (RL). However, it remains an open question whether small reasoning models (SRMs) can learn rule-based reasoning effectively with robust generalization across diverse tasks and domains. To address this, we introduce Reinforced Rule-based Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples each training batch by updating the sampling weights of different domains based on historical rewards. This facilitates domain augmentation and flexible online learning schedules for RL, obviating the need for pre-hoc human-engineered mix-training recipes used in existing methods. Empirical evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin ($\Delta$4.1% average points on eight ID tasks and $\Delta$10.4% average points on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior dynamic sampling methods for RL.

[100] Brevity is the soul of sustainability: Characterizing LLM response lengths cs.CL | cs.CYPDF

Soham Poddar, Paramita Koley, Janardan Misra, Sanjay Podder, Navveen Balani

TL;DR: 该论文研究了大型语言模型（LLM）推理过程中过长的响应问题及其对能源效率的影响，提出通过提示工程策略优化响应长度，实现显著的能源节约。

Details

Motivation: LLM推理过程消耗大量能源，而当前研究中输出压缩方面未得到充分探索。论文旨在通过减少不必要的响应长度来优化能源效率。

Result: 实验表明，优化提示策略可以在保持响应质量的同时，将响应长度减少25-60%，从而实现显著的能源节约。

Insight: LLM响应中存在大量冗余信息，通过简单的提示工程即可显著优化能源效率，这对可持续AI发展具有重要意义。

Abstract: A significant portion of the energy consumed by Large Language Models (LLMs) arises from their inference processes; hence developing energy-efficient methods for inference is crucial. While several techniques exist for inference optimization, output compression remains relatively unexplored, with only a few preliminary efforts addressing this aspect. In this work, we first benchmark 12 decoder-only LLMs across 5 datasets, revealing that these models often produce responses that are substantially longer than necessary. We then conduct a comprehensive quality assessment of LLM responses, formally defining six information categories present in LLM responses. We show that LLMs often tend to include redundant or additional information besides the minimal answer. To address this issue of long responses by LLMs, we explore several simple and intuitive prompt-engineering strategies. Empirical evaluation shows that appropriate prompts targeting length reduction and controlling information content can achieve significant energy optimization between 25-60% by reducing the response length while preserving the quality of LLM responses.

[101] ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts cs.CL | cs.CVPDF

Ruiran Su, Jiasheng Si, Zhijiang Guo, Janet B. Pierrehumbert

TL;DR: 该论文提出了ClimateViz，首个大规模基于科学图表的事实核查基准，包含大量标注的图表和关联的声明，并评估了现有多模态模型的表现。

Details

Motivation: 科学事实核查主要集中在文本和表格上，而忽略了科学图表的重要性。图表是展示定量证据和统计推理的关键工具。

Result: 现有模型在基于图表的推理上表现不佳，最好模型的准确率仅76.2-77.8%，远低于人类表现（89.3-92.7%）。解释增强的输出对某些模型有帮助。

Insight: 科学图表的事实核查是一个具有挑战性的任务，现有多模态模型仍有提升空间，尤其是对复杂统计推理的处理。

Abstract: Scientific fact-checking has mostly focused on text and tables, overlooking scientific charts, which are key for presenting quantitative evidence and statistical reasoning. We introduce ClimateViz, the first large-scale benchmark for scientific fact-checking using expert-curated scientific charts. ClimateViz contains 49,862 claims linked to 2,896 visualizations, each labeled as support, refute, or not enough information. To improve interpretability, each example includes structured knowledge graph explanations covering trends, comparisons, and causal relations. We evaluate state-of-the-art multimodal language models, including both proprietary and open-source systems, in zero-shot and few-shot settings. Results show that current models struggle with chart-based reasoning: even the best systems, such as Gemini 2.5 and InternVL 2.5, reach only 76.2 to 77.8 percent accuracy in label-only settings, far below human performance (89.3 and 92.7 percent). Explanation-augmented outputs improve performance in some models. We released our dataset and code alongside the paper.

[102] Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure cs.CL | cs.SEPDF

Fariz Ikhwantri, Dusica Marijan

TL;DR: 论文提出了一种基于自然语言推理（NLI）的合规性检测方法EXCLAIM，通过多跳推理实现可解释和可追踪的合规性检测，并利用大语言模型生成保证案例以解决数据不足的问题。

Details

Motivation: 复杂系统合规性检测面临法律与技术文本复杂、模型解释性需求高以及保证案例数据稀缺等挑战。

Result: 案例研究表明，结合GDPR要求生成的保证案例在多跳推理任务中表现有效，验证了NLI方法在自动化合规性检测中的潜力。

Insight: NLI和多跳推理为复杂合规性检测提供了一种可解释的自动化解决方案，大语言模型在数据稀缺场景下显示出应用潜力。

Abstract: Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicated nature of legal and technical texts, the need for model explanations, and limited access to assurance case data. We propose a compliance detection approach based on Natural Language Inference (NLI): EXplainable CompLiance detection with Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the claim-argument-evidence structure of an assurance case as a multi-hop inference for explainable and traceable compliance detection. We address the limited number of assurance cases by generating them using large language models (LLMs). We introduce metrics that measure the coverage and structural consistency. We demonstrate the effectiveness of the generated assurance case from GDPR requirements in a multi-hop inference task as a case study. Our results highlight the potential of NLI-based approaches in automating the regulatory compliance process.

[103] Multi-Teacher Language-Aware Knowledge Distillation for Multilingual Speech Emotion Recognition cs.CL | cs.SD | eess.ASPDF

Mehedi Hasan Bijoy, Dejan Porjazovski, Tamás Grósz, Mikko Kurimo

TL;DR: 本文提出了一种基于多教师语言感知知识蒸馏的多语言语音情感识别方法，通过从多个单语言教师模型中提取知识，训练一个多语言学生模型。该方法在英语、芬兰语和法语数据集上表现优异，显著提升了情感识别的召回率。

Details

Motivation: 尽管单语言语音情感识别（SER）研究取得了进展，但扩展到多语言系统仍具挑战性。目标是训练一个能处理多语言SER的单一模型，以提升人机交互体验。

Result: 实验结果显示，该方法在英语数据集的加权召回率达到72.9，芬兰语的非加权召回率为63.4，优于微调和传统知识蒸馏基线。

Insight: 该方法在识别悲伤和中立情感方面表现突出，但在识别愤怒和快乐情感上仍需改进，揭示了多语言SER的复杂性。

Abstract: Speech Emotion Recognition (SER) is crucial for improving human-computer interaction. Despite strides in monolingual SER, extending them to build a multilingual system remains challenging. Our goal is to train a single model capable of multilingual SER by distilling knowledge from multiple teacher models. To address this, we introduce a novel language-aware multi-teacher knowledge distillation method to advance SER in English, Finnish, and French. It leverages Wav2Vec2.0 as the foundation of monolingual teacher models and then distills their knowledge into a single multilingual student model. The student model demonstrates state-of-the-art performance, with a weighted recall of 72.9 on the English dataset and an unweighted recall of 63.4 on the Finnish dataset, surpassing fine-tuning and knowledge distillation baselines. Our method excels in improving recall for sad and neutral emotions, although it still faces challenges in recognizing anger and happiness.

[104] AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP cs.CLPDF

Ahmed Hasanaath, Aisha Alansari, Ahmed Ashraf, Chafik Salmane, Hamzah Luqman

TL;DR: 该论文对针对阿拉伯语NLP的推理型大型语言模型（LLMs）进行了全面评估，特别是DeepSeek模型，通过多种策略（如零样本、少样本和微调）在十五个阿拉伯语NLP任务中进行了实验，揭示了关键发现，如少量上下文示例的显著提升效果。

Details

Motivation: 阿拉伯语因其丰富的形态、多样的方言和复杂的书写系统，在LLMs领域的性能研究尚不充分。论文旨在填补这一空白，评估推理型LLMs在阿拉伯语NLP任务中的表现。

Result: 实验结果显示：1) 3个上下文示例平均提升13 F1点；2) DeepSeek在零样本设置下优于GPT o4-mini 12 F1点；3) LoRA微调进一步提升8点F1和BLEU分。

Insight: 阿拉伯语NLP任务的复杂性对LLMs提出了更高要求，但通过合理的上下文示例选择和微调策略，可以显著提升性能。

Abstract: Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on classification tasks-boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures outperform a strong GPT o4-mini baseline by an average of 12 F1 points on complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning yields up to an additional 8 points in F1 and BLEU compared to equivalent increases in model scale. The code is available at https://anonymous.4open.science/r/AraReasoner41299

[105] The impact of fine tuning in LLaMA on hallucinations for named entity extraction in legal documentation cs.CL | cs.AIPDF

Francisco Vargas, Alejandro González Coene, Gaston Escalante, Exequiel Lobón, Manuel Pulido

TL;DR: 该论文研究了LLaMA模型在微调后如何减少命名实体提取中的幻觉问题，并比较了多种方法在提取法律文件中交通事故信息的性能。

Details

Motivation: 从法律文件中提取交通事故信息（如残疾百分比和赔偿金额）对保险公司成本量化至关重要，但因法院判决中的复杂论证和推理，即使是专家也难以准确提取。

Result: 微调后的LLaMA-2 70B准确率最高（79.4%），超过基础版本（61.7%）。基础LLaMA-3 8B表现接近微调LLaMA-2 70B（76.6%），GPT-4 Turbo表现最佳（86.1%）。

Insight: 微调可显著减少LLM在命名实体提取中的幻觉；LLaMA-3基础模型的性能已接近微调的LLaMA-2 70B，显示模型发展的快速进步；GPT-4 Turbo在封闭模型中表现最优。

Abstract: The extraction of information about traffic accidents from legal documents is crucial for quantifying insurance company costs. Extracting entities such as percentages of physical and/or psychological disability and the involved compensation amounts is a challenging process, even for experts, due to the subtle arguments and reasoning in the court decision. A two-step procedure is proposed: first, segmenting the document identifying the most relevant segments, and then extracting the entities. For text segmentation, two methodologies are compared: a classic method based on regular expressions and a second approach that divides the document into blocks of n-tokens, which are then vectorized using multilingual models for semantic searches (text-embedding-ada-002/MiniLM-L12-v2 ). Subsequently, large language models (LLaMA-2 7b, 70b, LLaMA-3 8b, and GPT-4 Turbo) are applied with prompting to the selected segments for entity extraction. For the LLaMA models, fine-tuning is performed using LoRA. LLaMA-2 7b, even with zero temperature, shows a significant number of hallucinations in extractions which are an important contention point for named entity extraction. This work shows that these hallucinations are substantially reduced after finetuning the model. The performance of the methodology based on segment vectorization and subsequent use of LLMs significantly surpasses the classic method which achieves an accuracy of 39.5%. Among open-source models, LLaMA-2 70B with finetuning achieves the highest accuracy 79.4%, surpassing its base version 61.7%. Notably, the base LLaMA-3 8B model already performs comparably to the finetuned LLaMA-2 70B model, achieving 76.6%, highlighting the rapid progress in model development. Meanwhile, GPT-4 Turbo achieves the highest accuracy at 86.1%.

[106] PropMEND: Hypernetworks for Knowledge Propagation in LLMs cs.CL | cs.AI | cs.LGPDF

Zeyu Leo Liu, Greg Durrett, Eunsol Choi

TL;DR: PropMEND是一种基于超网络的LLM知识传播方法，通过元学习优化梯度修改，增强注入知识的推理能力，在多跳问题上表现优异，但未见关系的泛化能力仍有提升空间。

Details

Motivation: 现有的LLM知识编辑技术能够注入知识，但在需要推理的多跳问题上表现不佳，因此需要一种能够有效传播知识的方法。

Result: PropMEND在RippleEdit数据集上的多跳问题准确率提升了近2倍，但在未见过的实体-关系对上表现有所下降。

Insight: 知识传播在未见关系上的泛化能力仍需进一步研究，未来工作可以关注如何扩展到更广泛的关系网络中。

Abstract: Knowledge editing techniques for large language models (LLMs) can inject knowledge that is later reproducible verbatim, but they fall short on propagating that knowledge: models cannot answer questions that require reasoning with the injected knowledge. We present a hypernetwork-based approach for knowledge propagation, named PropMEND, where we meta-learn how to modify gradients of a language modeling loss to encourage injected information to propagate. Our approach extends the meta-objective of MEND [29] so that gradient updates on knowledge are transformed to enable answering multi-hop questions involving that knowledge. We show improved performance on the RippleEdit dataset, showing almost 2x accuracy on challenging multi-hop questions whose answers are not explicitly stated in the injected fact. We further introduce a new dataset, Controlled RippleEdit, to evaluate the generalization of our hypernetwork, testing knowledge propagation along relations and entities unseen during hypernetwork training. PropMEND still outperforms existing approaches in unseen entity-relation pairs, yet the performance gap decreases substantially, suggesting future work in propagating knowledge to a wide range of relations.

[107] Can A Gamer Train A Mathematical Reasoning Model? cs.CL | cs.AI | cs.LGPDF

Andrew Shin

TL;DR: 论文证明，通过结合强化学习和内存优化技术，单张普通游戏GPU（如RTX 3080 Ti）可以训练出性能优异的数学推理模型，挑战了高性能AI研究需要大规模基础设施的传统观念。

Details

Motivation: 大型语言模型（LLMs）在数学推理等任务中表现优异，但其训练通常需要昂贵的计算资源。本文旨在探索如何在资源有限的环境下（如普通游戏GPU）训练出高性能模型。

Result: 在数学推理基准测试中，1.5B参数的模型性能与更大规模的模型相当甚至更优。

Insight: 研究表明，高性能AI研究可以通过优化技术和算法突破资源限制，为更广泛的研究者提供机会。

Abstract: While large language models (LLMs) have achieved remarkable performance in various tasks including mathematical reasoning, their development typically demands prohibitive computational resources. Recent advancements have reduced costs for training capable models, yet even these approaches rely on high-end hardware clusters. In this paper, we demonstrate that a single average gaming GPU can train a solid mathematical reasoning model, by integrating reinforcement learning and memory optimization techniques. Specifically, we train a 1.5B parameter mathematical reasoning model on RTX 3080 Ti of 16GB memory that achieves comparable or better performance on mathematical reasoning benchmarks than models several times larger, in resource-constrained environments. Our results challenge the paradigm that state-of-the-art mathematical reasoning necessitates massive infrastructure, democratizing access to high-performance AI research. https://github.com/shinandrew/YouronMath.

[108] FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation cs.CLPDF

Qinggang Zhang, Zhishang Xiang, Yilin Xiao, Le Wang, Junhui Li

TL;DR: FaithfulRAG提出了一种新框架，通过显式建模LLM的参数知识与检索上下文之间的冲突，实现了更忠实的信息生成。

Details

Motivation: 现有的基于检索增强的LLM在知识密集型任务中存在输出不忠实的问题，特别是在知识与上下文冲突时。现有方法通过强制服从上下文抑制了模型内部知识，导致误解风险增加。

Result: 实验表明，FaithfulRAG优于现有方法，显著提升了生成结果的忠实度。

Insight: 显式建模知识冲突和引入推理过程可以平衡LLM的内部知识与外部上下文，提高生成质量。

Abstract: Large language models (LLMs) augmented with retrieval systems have demonstrated significant potential in handling knowledge-intensive tasks. However, these models often struggle with unfaithfulness issues, generating outputs that either ignore the retrieved context or inconsistently blend it with the LLMs parametric knowledge. This issue is particularly severe in cases of knowledge conflict, where the retrieved context conflicts with the models parametric knowledge. While existing faithful RAG approaches enforce strict context adherence through well-designed prompts or modified decoding strategies, our analysis reveals a critical limitation: they achieve faithfulness by forcibly suppressing the models parametric knowledge, which undermines the models internal knowledge structure and increases the risk of misinterpreting the context. To this end, this paper proposes FaithfulRAG, a novel framework that resolves knowledge conflicts by explicitly modeling discrepancies between the model`s parametric knowledge and retrieved context. Specifically, FaithfulRAG identifies conflicting knowledge at the fact level and designs a self-thinking process, allowing LLMs to reason about and integrate conflicting facts before generating responses. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. The code is available at https:// github.com/DeepLearnXMU/Faithful-RAG

[109] Can LLMs Ground when they (Don’t) Know: A Study on Direct and Loaded Political Questions cs.CL | cs.AIPDF

Clara Lachenmaier, Judith Sieker, Sina Zarrieß

TL;DR: 这篇论文研究了大型语言模型（LLMs）在政治领域中如何处理直接和带有误导性的问题，发现LLMs在纠正用户错误信念和主动建立共同认知方面存在显著挑战。

Details

Motivation: 人类对话依赖于共同认知（grounding），而LLMs在这种情境下的表现尚未被充分研究。政治领域存在高风险的信息误导，因此研究LLMs是否能处理此类问题至关重要。

Result: 研究发现LLMs在纠正错误信念和主动建立共同认知方面表现不佳，尤其是在政治领域，其回答容易受到误导问题的影响。

Insight: 论文揭示了LLMs在政治语境中可能加剧信息误导的风险，强调了优化模型以更好地处理共同认知的必要性。

Abstract: Communication among humans relies on conversational grounding, allowing interlocutors to reach mutual understanding even when they do not have perfect knowledge and must resolve discrepancies in each other’s beliefs. This paper investigates how large language models (LLMs) manage common ground in cases where they (don’t) possess knowledge, focusing on facts in the political domain where the risk of misinformation and grounding failure is high. We examine the ability of LLMs to answer direct knowledge questions and loaded questions that presuppose misinformation. We evaluate whether loaded questions lead LLMs to engage in active grounding and correct false user beliefs, in connection to their level of knowledge and their political bias. Our findings highlight significant challenges in LLMs’ ability to engage in grounding and reject false user beliefs, raising concerns about their role in mitigating misinformation in political discourse.

[110] Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers cs.CL | cs.LG | cs.NEPDF

Marek Kadlčík, Michal Štefánik, Timothee Mickus, Michal Spiegel, Josef Kuchař

TL;DR: 语言模型（LMs）在预训练后能准确表示数字，而之前的方法未捕捉到其正弦模式。新提出的探测技术能近乎完美解码数字值，证明LMs对数字的表征很精确。此外，这种精确性解释了LMs在基础算术中的错误，并可通过对齐嵌入模式改善。

Details

Motivation: 现有研究认为语言模型在算术任务中表现不佳，原因是其学习到的分布式嵌入无法准确表示数字。本文发现之前的探测方法未充分捕捉嵌入的正弦模式，可能低估了模型的数字表征能力。

Result: 实验表明，新方法在多个开源LMs上能近乎完美解码数字值。进一步发现嵌入的精确性与算术错误显著相关，通过对齐嵌入模式可减少错误。

Insight: 语言模型的嵌入结构可能包含比之前认为更精确的数字表征，而传统的探测方法未能充分揭示这一点。这表明嵌入的结构对模型的算术能力至关重要。

Abstract: Pretrained language models (LMs) are prone to arithmetic errors. Existing work showed limited success in probing numeric values from models’ representations, indicating that these errors can be attributed to the inherent unreliability of distributionally learned embeddings in representing exact quantities. However, we observe that previous probing methods are inadequate for the emergent structure of learned number embeddings with sinusoidal patterns. In response, we propose a novel probing technique that decodes numeric values from input embeddings with near-perfect accuracy across a range of open-source LMs. This proves that after the sole pre-training, LMs represent numbers with remarkable precision. Finally, we find that the embeddings’ preciseness judged by our probe’s accuracy explains a large portion of LM’s errors in elementary arithmetic, and show that aligning the embeddings with the pattern discovered by our probe can mitigate these errors.

[111] Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System cs.CLPDF

Yuan Guo, Tingjia Miao, Zheng Wu, Pengzhou Cheng, Ming Zhou

TL;DR: 该论文提出了UI-NEXUS基准测试和AGENT-NEXUS调度系统，用于评估和改进移动代理在组合任务上的表现，解决了从原子任务到组合任务的泛化问题。

Details

Motivation: 现有移动代理主要处理原子任务，但现实应用需要组合任务的能力，因此需要新的基准和调度系统来解决这一泛化问题。

Result: AGENT-NEXUS显著提升了任务成功率（24%-40%）且不影响推理效率。

Insight: 现有代理在组合任务上表现不佳，动态分解是提升泛化能力的有效方法。

Abstract: Autonomous agents powered by multimodal large language models have been developed to facilitate task execution on mobile devices. However, prior work has predominantly focused on atomic tasks – such as shot-chain execution tasks and single-screen grounding tasks – while overlooking the generalization to compositional tasks, which are indispensable for real-world applications. This work introduces UI-NEXUS, a comprehensive benchmark designed to evaluate mobile agents on three categories of compositional operations: Simple Concatenation, Context Transition, and Deep Dive. UI-NEXUS supports interactive evaluation in 20 fully controllable local utility app environments, as well as 30 online Chinese and English service apps. It comprises 100 interactive task templates with an average optimal step count of 14.05. Experimental results across a range of mobile agents with agentic workflow or agent-as-a-model show that UI-NEXUS presents significant challenges. Specifically, existing agents generally struggle to balance performance and efficiency, exhibiting representative failure modes such as under-execution, over-execution, and attention drift, causing visible atomic-to-compositional generalization gap. Inspired by these findings, we propose AGENT-NEXUS, a lightweight and efficient scheduling system to tackle compositional mobile tasks. AGENT-NEXUS extrapolates the abilities of existing mobile agents by dynamically decomposing long-horizon tasks to a series of self-contained atomic subtasks. AGENT-NEXUS achieves 24% to 40% task success rate improvement for existing mobile agents on compositional operation tasks within the UI-NEXUS benchmark without significantly sacrificing inference overhead. The demo video, dataset, and code are available on the project page at https://ui-nexus.github.io.

[112] FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents cs.CLPDF

Satu Hopponen, Tomi Kinnunen, Alexandre Nikolaev, Rosa González Hautamäki, Lauri Tavi

TL;DR: 论文介绍了FROST-EMA数据集，包含18名双语使用者在母语(L1)、二语(L2)及模仿二语口音(假外国口音)下的语音数据，并展示了两项初步研究，探讨这些语言变体对语音技术和发音行为的影响。

Details

Motivation: 研究动机是填补多语言语音数据集在发音测量技术（如电磁发音描记术）上的空白，并探索语言变体（如L2和假口音）在语音技术和发音行为研究中的潜在应用。

Result: 结果表明，L2和假口音可能影响说话人验证系统的性能，且发音模式在不同语言变体中存在差异。

Insight: 研究揭示了语言变体对语音技术和发音行为的复杂影响，为多语言语音研究提供了新的数据支持和分析视角。

Abstract: We introduce a new FROST-EMA (Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography) corpus. It consists of 18 bilingual speakers, who produced speech in their native language (L1), second language (L2), and imitated L2 (fake foreign accent). The new corpus enables research into language variability from phonetic and technological points of view. Accordingly, we include two preliminary case studies to demonstrate both perspectives. The first case study explores the impact of L2 and imitated L2 on the performance of an automatic speaker verification system, while the second illustrates the articulatory patterns of one speaker in L1, L2, and a fake accent.

[113] Learning to Reason Across Parallel Samples for LLM Reasoning cs.CLPDF

Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, Eunsol Choi

TL;DR: 论文提出了一种新的方法SSA（Sample Set Aggregator），通过训练一个小型LLM来聚合多个推理样本，从而提升模型在数学领域等推理任务中的性能。实验表明SSA优于其他测试时扩展方法，并展现了良好的泛化能力。

Details

Motivation: 现有方法通过样本投票或验证器排名等方式利用多个测试时样本来提升LLM性能，但缺乏专门训练模型来优化样本集的聚合过程。本文旨在设计一种更高效的聚合方法。

Result: 在多推理数据集上，SSA优于基于奖励模型的重新排序等方法，泛化能力强，适应不同样本集大小、模型家族和任务。

Insight: 将生成样本和聚合样本的LLM解耦，既提升了推理性能，又增强了灵活性，适用于黑盒模型的输出。

Abstract: Scaling test-time compute brings substantial performance gains for large language models (LLMs). By sampling multiple answers and heuristically aggregate their answers (e.g., either through majority voting or using verifiers to rank the answers), one can achieve consistent performance gains in math domains. In this paper, we propose a new way to leverage such multiple sample set. We train a compact LLM, called Sample Set Aggregator (SSA), that takes a concatenated sequence of multiple samples and output the final answer, optimizing it for the answer accuracy with reinforcement learning. Experiments on multiple reasoning datasets show that SSA outperforms other test-time scaling methods such as reward model-based re-ranking. Our approach also shows a promising generalization ability, across sample set sizes, base model families and scales, and tasks. By separating LLMs to generate answers and LLMs to analyze and aggregate sampled answers, our approach can work with the outputs from premier black box models easily and efficiently.

[114] Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning cs.CL | cs.AI | cs.LGPDF

Haozhen Zhang, Tao Feng, Jiaxuan You

TL;DR: 论文提出了Router-R1，一个基于强化学习的多轮路由和聚合框架，用于优化LLM路由系统，通过动态调用多个LLM并整合其响应，提升了复杂任务的性能与成本管理。

Details

Motivation: 现有的LLM路由器通常只能进行单轮一对一映射，无法充分利用多个LLM的互补优势处理复杂任务。

Result: 在七个通用和多跳QA基准测试中，Router-R1表现优于多个强基线，实现了更好的性能和成本管理。

Insight: 通过强化学习优化LLM路由系统可以显著提升复杂任务的解决能力，同时实现性能与成本的平衡。

Abstract: The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave “think” actions (internal deliberation) with “route” actions (dynamic model invocation), and integrates each response into its evolving context. To guide learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for performance and cost trade-off optimization, opening a pathway toward optimizing performance-cost tradeoffs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms over several strong baselines, achieving superior performance while maintaining robust generalization and cost management.Code is available at https://github.com/ulab-uiuc/Router-R1.

[115] Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs cs.CL | 68T5 | I.2.7PDF

Yaniv Nikankin, Dana Arad, Yossi Gandelsman, Yonatan Belinkov

TL;DR: 论文研究了视觉语言模型（VLMs）在视觉和文本任务中的性能差异，发现不同模态的电路（circuits）虽功能相似但结构分离。通过将视觉数据的后期层表征回传至早期层，实验表明可以缩小模态间性能差距的三分之一。

Details

Motivation: 视觉语言模型在视觉任务中的表现通常不如相同任务的文本版本，论文旨在探索这种性能差异的原因及其解决方法。

Result: 实验证明该方法在多任务和多模型中平均能缩小三分之一模态间的性能差距。

Insight: 视觉表征在后期层才与文本表征对齐，导致性能差异；通过干预表征对齐时间点可以有效提升性能。

Abstract: Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the \textit{circuits} - the task-specific computational sub-graphs - in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.

q-bio.BM [Back]

[116] Aligning Proteins and Language: A Foundation Model for Protein Retrieval q-bio.BM | cs.AI | cs.CE | cs.CV | cs.LGPDF

Qifeng Wu, Zhengzhe Liu, Han Zhu, Yizhou Zhao, Daisuke Kihara

TL;DR: 论文提出了一种基于对比学习的CLIP风格框架，用于对齐3D蛋白质结构和功能注释，促进了蛋白质结构的语义检索。

Details

Motivation: 受视觉-语言模型（VLMs）进展的启发，旨在通过多模态基础模型促进蛋白质结构-功能的理解。

Result: 在PDB和EMDB数据集上展示了优异的零样本检索性能。

Insight: 多模态基础模型在蛋白质结构功能理解中具有潜力，未来可扩展至其他生物分子领域。

Abstract: This paper aims to retrieve proteins with similar structures and semantics from large-scale protein dataset, facilitating the functional interpretation of protein structures derived by structural determination methods like cryo-Electron Microscopy (cryo-EM). Motivated by the recent progress of vision-language models (VLMs), we propose a CLIP-style framework for aligning 3D protein structures with functional annotations using contrastive learning. For model training, we propose a large-scale dataset of approximately 200,000 protein-caption pairs with rich functional descriptors. We evaluate our model in both in-domain and more challenging cross-database retrieval on Protein Data Bank (PDB) and Electron Microscopy Data Bank (EMDB) dataset, respectively. In both cases, our approach demonstrates promising zero-shot retrieval performance, highlighting the potential of multimodal foundation models for structure-function understanding in protein biology.

cs.GR [Back]

[117] A Real-time 3D Desktop Display cs.GR | cs.CVPDF

Livio Tenze, Enrique Canessa

TL;DR: 本文介绍了altiro3D C++库的扩展版本，能够实时处理2D图像或视频流，生成光场并实现3D体验。核心方法包括使用MiDaS CNN从单张2D图像提取深度图，并利用AI技术提升性能。

Details

Motivation: 传统3D显示技术需要复杂的硬件支持，而本文旨在通过软件方式实现实时3D显示，支持多种输入源，包括桌面屏幕区域。

Result: 实现了对2D图像、视频流和桌面应用的实时3D渲染，支持直接输出到光场3D设备。

Insight: 通过软件和深度学习的结合，可以简化3D显示的实现，拓展了3D技术在普通设备上的应用潜力。

Abstract: A new extended version of the altiro3D C++ Library – initially developed to get glass-free holographic displays starting from 2D images – is here introduced aiming to deal with 3D video streams from either 2D webcam images or flat video files. These streams are processed in real-time to synthesize light-fields (in Native format) and feed realistic 3D experiences. The core function needed to recreate multiviews consists on the use of MiDaS Convolutional Neural Network (CNN), which allows to extract a depth map from a single 2D image. Artificial Intelligence (AI) computing techniques are applied to improve the overall performance of the extended altiro3D Library. Thus, altiro3D can now treat standard images, video streams or screen portions of a Desktop where other apps may be also running (like web browsers, video chats, etc) and render them into 3D. To achieve the latter, a screen region need to be selected in order to feed the output directly into a light-field 3D device such as Looking Glass (LG) Portrait. In order to simplify the acquisition of a Desktop screen area by the user, a multi-platform Graphical User Interface has been also implemented. Sources available at: https://github.com/canessae/altiro3D/releases/tag/2.0.0

[118] Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos cs.GR | cs.CVPDF

Weikun Peng, Jun Lv, Cewu Lu, Manolis Savva

TL;DR: 论文提出了一种从动态RGBD视频中重建铰接对象的粗到细框架，解决了因交互和环境变化带来的挑战，并在合成和真实数据集上显著优于现有方法。

Details

Motivation: 铰接对象在日常生活和机器人应用中广泛存在，但现有方法需要精心采集的数据，限制了其在实际中的可扩展性和泛化性。研究旨在从手持设备随意拍摄的RGBD视频中实现铰接对象的重建。

Result: 在合成和真实数据集上的实验表明，该方法显著优于现有方法，能够跨类别重建铰接对象。

Insight: 随意拍摄的RGBD视频为铰接对象重建提供了更实际的数据来源，但需解决运动模糊和遮挡问题。粗到细框架是一种有效的解决方案。

Abstract: Articulated objects are prevalent in daily life. Understanding their kinematic structure and reconstructing them have numerous applications in embodied AI and robotics. However, current methods require carefully captured data for training or inference, preventing practical, scalable, and generalizable reconstruction of articulated objects. We focus on reconstruction of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to acquire at scale using smartphones. However, this setting is quite challenging, as the object and camera move simultaneously and there are significant occlusions as the person interacts with the object. To tackle these challenges, we introduce a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a 20$\times$ larger synthetic dataset of 784 videos containing 284 objects across 11 categories. We compare our approach with existing methods that also take video as input. Experiments show that our method can reconstruct synthetic and real articulated objects across different categories from dynamic RGBD videos, outperforming existing methods significantly.

[119] Fine-Grained Spatially Varying Material Selection in Images cs.GR | cs.CVPDF

Julia Guerrero-Viu, Michael Fischer, Iliyan Georgiev, Elena Garces, Diego Gutierrez

TL;DR: 本文提出了一种基于视觉Transformer（ViT）的细粒度空间变化材料选择方法，能够在光照和反射变化下实现鲁棒的材料选择，并支持纹理和子纹理两个级别的选择。

Details

Motivation: 传统的图像编辑中，材料选择通常受限于光照和反射变化，且缺乏细粒度的选择能力。为了解决这一问题，本文提出了一种更鲁棒、更精细的材料选择方法。

Result: 该方法在光照和反射变化下表现出鲁棒性，能够生成比现有方法更精细和稳定的选择结果。

Insight: ViT模型在多分辨率处理中的特征提取能力强，能够有效应对图像编辑中复杂的材料和光照变化。

Abstract: Selection is the first step in many image editing processes, enabling faster and simpler modifications of all pixels sharing a common modality. In this work, we present a method for material selection in images, robust to lighting and reflectance variations, which can be used for downstream editing tasks. We rely on vision transformer (ViT) models and leverage their features for selection, proposing a multi-resolution processing strategy that yields finer and more stable selection results than prior methods. Furthermore, we enable selection at two levels: texture and subtexture, leveraging a new two-level material selection (DuMaS) dataset which includes dense annotations for over 800,000 synthetic images, both on the texture and subtexture levels.

cs.AI [Back]

[120] A Survey on Large Language Models for Mathematical Reasoning cs.AI | cs.CLPDF

Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Yi-Di Wang, Shu Yan

TL;DR: 该调查总结了大型语言模型（LLMs）在数学推理领域的发展，分为理解阶段和答案生成阶段，并讨论了提升推理能力的方法与挑战。

Details

Motivation: 数学推理是人工智能研究的重要挑战，近年来LLMs在此领域取得显著进展，但依然存在能力、效率和泛化等方面的挑战。

Result: 尽管取得进展，但LLMs在数学推理中的能力、效率和泛化仍有局限性。

Insight: 未来的研究方向包括改进预训练和知识增强技术、形式化推理框架以及通过学习范式实现元泛化。

Abstract: Mathematical reasoning has long represented one of the most fundamental and challenging frontiers in artificial intelligence research. In recent years, large language models (LLMs) have achieved significant advances in this area. This survey examines the development of mathematical reasoning abilities in LLMs through two high-level cognitive phases: comprehension, where models gain mathematical understanding via diverse pretraining strategies, and answer generation, which has progressed from direct prediction to step-by-step Chain-of-Thought (CoT) reasoning. We review methods for enhancing mathematical reasoning, ranging from training-free prompting to fine-tuning approaches such as supervised fine-tuning and reinforcement learning, and discuss recent work on extended CoT and “test-time scaling”. Despite notable progress, fundamental challenges remain in terms of capacity, efficiency, and generalization. To address these issues, we highlight promising research directions, including advanced pretraining and knowledge augmentation techniques, formal reasoning frameworks, and meta-generalization through principled learning paradigms. This survey tries to provide some insights for researchers interested in enhancing reasoning capabilities of LLMs and for those seeking to apply these techniques to other domains.

[121] Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning cs.AI | cs.CLPDF

Kongcheng Zhang, Qi Yao, Shunyu Liu, Yingjie Wang, Baisheng Lai

TL;DR: 论文提出了一种自奖励强化学习框架CoVo，通过利用不同推理轨迹的一致性来增强大语言模型的推理能力，无需外部监督。

Details

Motivation: 传统强化学习在复杂推理任务中依赖外部监督，限制了其广泛应用。本文通过探索正确答案的推理路径一致性来解决这一问题。

Result: 实验表明，CoVo在多样推理基准测试中性能媲美甚至超越监督强化学习。

Insight: 正确推理路径的中间状态趋向收敛，偏离其他候选答案的波动较小，这种特性可用于自监督强化学习。

Abstract: Recent advances of Reinforcement Learning (RL) have highlighted its potential in complex reasoning tasks, yet effective training often relies on external supervision, which limits the broader applicability. In this work, we propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning by leveraging the consistency of intermediate reasoning states across different reasoning trajectories. Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood: their intermediate reasoning states tend to converge toward their own final answers (high consistency) with minimal deviation toward other candidates (low volatility). Inspired by this observation, we introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy, complemented by a curiosity bonus to promote diverse exploration. CoVo enables LLMs to perform RL in a self-rewarding manner, offering a scalable pathway for learning to reason without external supervision. Extensive experiments on diverse reasoning benchmarks show that CoVo achieves performance comparable to or even surpassing supervised RL. Our code is available at https://github.com/sastpg/CoVo.

[122] Paths to Causality: Finding Informative Subgraphs Within Knowledge Graphs for Knowledge-Based Causal Discovery cs.AI | cs.CL | cs.IR | cs.LGPDF

Yuni Susanti, Michael Färber

TL;DR: 该论文提出了一种结合知识图谱（KGs）与大语言模型（LLMs）的新方法，用于提升基于知识的因果发现，通过识别信息丰富的子图并利用学习排序模型优化其选择，显著提高了因果推理的稳定性与准确性。

Details

Motivation: 传统因果发现方法依赖观察数据，局限性较大；而基于LLMs的知识推理方法虽灵活，但结果不稳定。为此，论文提出结合KGs与LLMs，提供更可靠的因果推理框架。

Result: 在生物医学和开放领域数据集上，提出的方法显著优于基线，F1分数最高提升44.4分，验证了其稳定性和泛化能力。

Insight: 结合结构化知识（KGs）与LLMs的推理能力，能够显著提升因果发现的稳定性与准确性，为复杂系统分析提供新思路。

Abstract: Inferring causal relationships between variable pairs is crucial for understanding multivariate interactions in complex systems. Knowledge-based causal discovery – which involves inferring causal relationships by reasoning over the metadata of variables (e.g., names or textual context) – offers a compelling alternative to traditional methods that rely on observational data. However, existing methods using Large Language Models (LLMs) often produce unstable and inconsistent results, compromising their reliability for causal inference. To address this, we introduce a novel approach that integrates Knowledge Graphs (KGs) with LLMs to enhance knowledge-based causal discovery. Our approach identifies informative metapath-based subgraphs within KGs and further refines the selection of these subgraphs using Learning-to-Rank-based models. The top-ranked subgraphs are then incorporated into zero-shot prompts, improving the effectiveness of LLMs in inferring the causal relationship. Extensive experiments on biomedical and open-domain datasets demonstrate that our method outperforms most baselines by up to 44.4 points in F1 scores, evaluated across diverse LLMs and KGs. Our code and datasets are available on GitHub: https://github.com/susantiyuni/path-to-causality

[123] VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning cs.AI | cs.CV | cs.ROPDF

Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang

TL;DR: 论文介绍了VIKI-Bench和VIKI-R，前者是一个分层基准测试，专注于具身多智能体协作；后者是一个两阶段框架，结合了视觉语言模型和强化学习，显著提升了多智能体协作性能。

Details

Motivation: 当前基于视觉语言模型的多智能体协作方法在支持多样化具身智能体方面存在局限，需要一个统一的测试平台和方法来推动视觉驱动的多智能体协作研究。

Result: 实验表明VIKI-R在所有任务层次上显著优于基线方法，且强化学习能够促进异构智能体之间的组合协作。

Insight: 多层次的基准测试和结合视觉语言模型与强化学习的框架为视觉驱动的具身多智能体协作研究提供了新思路。

Abstract: Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

cs.HC [Back]

[124] SakugaFlow: A Stagewise Illustration Framework Emulating the Human Drawing Process and Providing Interactive Tutoring for Novice Drawing Skills cs.HC | cs.CV | 68T05 | H.5.2; K.3; I.2.7PDF

Kazuki Kawamura, Jun Rekimoto

TL;DR: SakugaFlow是一个四阶段AI绘图框架，模仿人类绘画过程并提供实时互动教学，帮助新手提升绘图技能。

Details

Motivation: 现有的AI绘图工具尽管能生成高质量图像，但缺乏人类艺术家逐步绘制的透明过程，限制了新手学习。

Result: 将黑盒生成器转化为支持学习和创意探索的教学环境。

Insight: 通过透明化中间过程和互动教学，AI工具能更好地辅助技能学习。

Abstract: While current AI illustration tools can generate high-quality images from text prompts, they rarely reveal the step-by-step procedure that human artists follow. We present SakugaFlow, a four-stage pipeline that pairs diffusion-based image generation with a large-language-model tutor. At each stage, novices receive real-time feedback on anatomy, perspective, and composition, revise any step non-linearly, and branch alternative versions. By exposing intermediate outputs and embedding pedagogical dialogue, SakugaFlow turns a black-box generator into a scaffolded learning environment that supports both creative exploration and skills acquisition.

[125] MOSAIC-F: A Framework for Enhancing Students’ Oral Presentation Skills through Personalized Feedback cs.HC | cs.AI | cs.CVPDF

Alvaro Becerra, Daniel Andres, Pablo Villegas, Roberto Daza, Ruth Cobos

TL;DR: 论文提出了一个名为MOSAIC-F的多模态反馈框架，用于通过个性化反馈提升学生的口头表达能力，结合了人类评估和多模态数据分析。

Details

Motivation: 传统的反馈方法往往缺乏个性化和多角度的评估，难以全面反映学生的学习表现，MOSAIC-F旨在通过整合多模态数据与AI技术弥补这一不足。

Result: 在提升口头表达能力的实验中验证了框架的有效性，证明其能为学生提供更全面的反馈。

Insight: 多模态数据与AI的结合可以显著提升教育反馈的深度和个性化程度，为学习分析领域提供了新思路。

Abstract: In this article, we present a novel multimodal feedback framework called MOSAIC-F, an acronym for a data-driven Framework that integrates Multimodal Learning Analytics (MMLA), Observations, Sensors, Artificial Intelligence (AI), and Collaborative assessments for generating personalized feedback on student learning activities. This framework consists of four key steps. First, peers and professors’ assessments are conducted through standardized rubrics (that include both quantitative and qualitative evaluations). Second, multimodal data are collected during learning activities, including video recordings, audio capture, gaze tracking, physiological signals (heart rate, motion data), and behavioral interactions. Third, personalized feedback is generated using AI, synthesizing human-based evaluations and data-based multimodal insights such as posture, speech patterns, stress levels, and cognitive load, among others. Finally, students review their own performance through video recordings and engage in self-assessment and feedback visualization, comparing their own evaluations with peers and professors’ assessments, class averages, and AI-generated recommendations. By combining human-based and data-based evaluation techniques, this framework enables more accurate, personalized and actionable feedback. We tested MOSAIC-F in the context of improving oral presentation skills.

cs.DB [Back]

[126] RADAR: Benchmarking Language Models on Imperfect Tabular Data cs.DB | cs.CLPDF

Ken Gu, Zhihan Zhang, Kate Lin, Yuwei Zhang, Akshay Paruchuri

TL;DR: RADAR 是一个用于系统性评估语言模型在含数据缺陷的表格数据上的推理能力的基准，揭示了前沿模型在数据缺陷下的性能下降问题。

Details

Motivation: 语言模型在数据分析任务中的应用越来越广泛，但其对数据缺陷（如缺失值、异常值等）的识别和推理能力尚未充分研究，这可能在真实数据分析中导致结论失效。

Result: 实验表明，前沿模型在无缺陷表格上表现良好，但在数据缺陷引入时性能显著下降，突显了其数据感知能力不足的问题。

Insight: RADAR 的灵活性和可扩展性使其成为推动表格数据推理研究的宝贵资源，揭示了当前语言模型在真实数据分析中的局限性。

Abstract: Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness – the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies – remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.

q-fin.ST [Back]

[127] EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements q-fin.ST | cs.CE | cs.CL | cs.LGPDF

Issa Sugiura, Takashi Ishida, Taro Makino, Chieko Tazuke, Takanori Nakagawa

TL;DR: 论文介绍了EDINET-Bench，一个开源日语金融基准数据集，用于评估大语言模型（LLMs）在复杂金融任务（如欺诈检测、盈利预测等）中的表现，结果显示当前LLMs在金融领域的应用仍有挑战。

Details

Motivation: 金融分析领域的复杂任务可利用LLMs能力，但缺乏针对日语金融数据的挑战性数据集，阻碍了学术研究和LLMs在金融领域的应用。

Result: 实验表明，即使是先进的LLMs在欺诈检测和盈利预测等任务上表现仅略优于逻辑回归，凸显了金融领域应用的挑战。

Insight: LLMs在金融领域的实际应用中需要领域特定适配，当前技术仍有显著局限性。

Abstract: Financial analysis presents complex challenges that could leverage large language model (LLM) capabilities. However, the scarcity of challenging financial datasets, particularly for Japanese financial data, impedes academic innovation in financial analytics. As LLMs advance, this lack of accessible research resources increasingly hinders their development and evaluation in this specialized domain. To address this gap, we introduce EDINET-Bench, an open-source Japanese financial benchmark designed to evaluate the performance of LLMs on challenging financial tasks including accounting fraud detection, earnings forecasting, and industry prediction. EDINET-Bench is constructed by downloading annual reports from the past 10 years from Japan’s Electronic Disclosure for Investors’ NETwork (EDINET) and automatically assigning labels corresponding to each evaluation task. Our experiments reveal that even state-of-the-art LLMs struggle, performing only slightly better than logistic regression in binary classification for fraud detection and earnings forecasting. These results highlight significant challenges in applying LLMs to real-world financial applications and underscore the need for domain-specific adaptation. Our dataset, benchmark construction code, and evaluation code is publicly available to facilitate future research in finance with LLMs.

cs.IR [Back]

[128] Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval cs.IR | cs.AI | cs.CLPDF

Abdellah Ghassel, Ian Robinson, Gabriel Tanase, Hal Cooper, Bryan Thompson

TL;DR: 该论文提出了一种名为分层词汇图（HLG）的三层索引结构，用于改进多跳检索问题，并设计了两种互补的检索器（StatementGraphRAG和TopicGraphRAG），显著提升了检索召回率和正确性。

Details

Motivation: 现有的检索增强生成（RAG）方法在处理需要跨文档拼接答案的复杂问题时表现不佳，尤其是在语义距离较远的文档之间。

Result: 在五个数据集上的实验表明，方法相比传统分块RAG提升了23.1%的检索召回率和正确性。

Insight: 通过层次化的索引结构和多粒度检索策略，可以有效解决跨文档多跳检索的问题，同时合成数据的引入为复杂检索系统的评估提供了新思路。

Abstract: Retrieval-Augmented Generation (RAG) grounds large language models in external evidence, yet it still falters when answers must be pieced together across semantically distant documents. We close this gap with the Hierarchical Lexical Graph (HLG), a three-tier index that (i) traces every atomic proposition to its source, (ii) clusters propositions into latent topics, and (iii) links entities and relations to expose cross-document paths. On top of HLG we build two complementary, plug-and-play retrievers: StatementGraphRAG, which performs fine-grained entity-aware beam search over propositions for high-precision factoid questions, and TopicGraphRAG, which selects coarse topics before expanding along entity links to supply broad yet relevant context for exploratory queries. Additionally, existing benchmarks lack the complexity required to rigorously evaluate multi-hop summarization systems, often focusing on single-document queries or limited datasets. To address this, we introduce a synthetic dataset generation pipeline that curates realistic, multi-document question-answer pairs, enabling robust evaluation of multi-hop retrieval systems. Extensive experiments across five datasets demonstrate that our methods outperform naive chunk-based RAG achieving an average relative improvement of 23.1% in retrieval recall and correctness. Open-source Python library is available at https://github.com/awslabs/graphrag-toolkit.

eess.IV [Back]

[129] A System for Accurate Tracking and Video Recordings of Rodent Eye Movements using Convolutional Neural Networks for Biomedical Image Segmentation eess.IV | cs.CVPDF

Isha Puri, David Cox

TL;DR: 本文提出了一种基于卷积神经网络（CNN）的生物医学图像分割方法，用于精确跟踪啮齿类动物眼动，解决了现有技术忽略啮齿类眼睛独特特性的问题。

Details

Motivation: 啮齿类动物在神经科学和视觉科学研究中被广泛使用，但现有眼动跟踪技术多针对人眼，未考虑啮齿类眼睛的特殊性（如尺寸小、周围毛发多等）。

Result: 该方法在啮齿类眼动跟踪中展现了高精度和实用性，是目前最先进的技术。

Insight: 该方法为啮齿类研究提供了更准确的工具，填补了现有技术在这一领域的空白。

Abstract: Research in neuroscience and vision science relies heavily on careful measurements of animal subject’s gaze direction. Rodents are the most widely studied animal subjects for such research because of their economic advantage and hardiness. Recently, video based eye trackers that use image processing techniques have become a popular option for gaze tracking because they are easy to use and are completely noninvasive. Although significant progress has been made in improving the accuracy and robustness of eye tracking algorithms, unfortunately, almost all of the techniques have focused on human eyes, which does not account for the unique characteristics of the rodent eye images, e.g., variability in eye parameters, abundance of surrounding hair, and their small size. To overcome these unique challenges, this work presents a flexible, robust, and highly accurate model for pupil and corneal reflection identification in rodent gaze determination that can be incrementally trained to account for variability in eye parameters encountered in the field. To the best of our knowledge, this is the first paper that demonstrates a highly accurate and practical biomedical image segmentation based convolutional neural network architecture for pupil and corneal reflection identification in eye images. This new method, in conjunction with our automated infrared videobased eye recording system, offers the state of the art technology in eye tracking for neuroscience and vision science research for rodents.

[130] Snap-and-tune: combining deep learning and test-time optimization for high-fidelity cardiovascular volumetric meshing eess.IV | cs.CVPDF

Daniel H. Pak, Shubh Thaker, Kyle Baylous, Xiaoran Zhang, Danny Bluestein

TL;DR: 论文提出了一种结合深度学习和测试时优化的snap-and-tune策略，用于高质量心血管体积网格生成，显著提高了空间精度和网格质量。

Details

Motivation: 高质量的体积网格生成是个性化医学中基于物理模拟的关键瓶颈，现有深度学习方法在高曲率区域和部件间距离方面存在局限。

Result: 显著提升了空间精度和网格质量，并在两种软件平台中验证了方法的实用性。

Insight: 通过结合DL的快速性和优化的灵活性，能够在复杂医学结构中实现高质量网格生成。

Abstract: High-quality volumetric meshing from medical images is a key bottleneck for physics-based simulations in personalized medicine. For volumetric meshing of complex medical structures, recent studies have often utilized deep learning (DL)-based template deformation approaches to enable fast test-time generation with high spatial accuracy. However, these approaches still exhibit limitations, such as limited flexibility at high-curvature areas and unrealistic inter-part distances. In this study, we introduce a simple yet effective snap-and-tune strategy that sequentially applies DL and test-time optimization, which combines fast initial shape fitting with more detailed sample-specific mesh corrections. Our method provides significant improvements in both spatial accuracy and mesh quality, while being fully automated and requiring no additional training labels. Finally, we demonstrate the versatility and usefulness of our newly generated meshes via solid mechanics simulations in two different software platforms. Our code is available at https://github.com/danpak94/Deep-Cardiac-Volumetric-Mesh.

[131] Plug-and-Play Linear Attention for Pre-trained Image and Video Restoration Models eess.IV | cs.CVPDF

Srinivasan Kidambi, Pravin Nair

TL;DR: PnP-Nystra是一种基于Nyström的线性自注意力近似方法，作为即插即用模块，能够在不重新训练的情况下集成到预训练的图像和视频修复模型中，显著提升了计算效率。

Details

Motivation: 多头自注意力（MHSA）的计算复杂度为输入长度的平方，成为实时和资源受限环境中的计算瓶颈。

Result: 实验表明，PnP-Nystra在GPU和CPU上分别实现了2-4倍和2-5倍的加速，PSNR最大仅下降1.5 dB。

Insight: 线性注意力可以有效替代MHSA，显著提升计算效率，同时保持性能的稳定性。

Abstract: Multi-head self-attention (MHSA) has become a core component in modern computer vision models. However, its quadratic complexity with respect to input length poses a significant computational bottleneck in real-time and resource constrained environments. We propose PnP-Nystra, a Nystr"om based linear approximation of self-attention, developed as a plug-and-play (PnP) module that can be integrated into the pre-trained image and video restoration models without retraining. As a drop-in replacement for MHSA, PnP-Nystra enables efficient acceleration in various window-based transformer architectures, including SwinIR, Uformer, and RVRT. Our experiments across diverse image and video restoration tasks, including denoising, deblurring, and super-resolution, demonstrate that PnP-Nystra achieves a 2-4x speed-up on an NVIDIA RTX 4090 GPU and a 2-5x speed-up on CPU inference. Despite these significant gains, the method incurs a maximum PSNR drop of only 1.5 dB across all evaluated tasks. To the best of our knowledge, we are the first to demonstrate a linear attention functioning as a training-free substitute for MHSA in restoration models.

[132] Biologically Inspired Deep Learning Approaches for Fetal Ultrasound Image Classification eess.IV | cs.CVPDF

Rinat Prochii, Elizaveta Dakhova, Pavel Birulin, Maxim Sharaev

TL;DR: 该论文提出了一种受生物启发的深度学习集成框架，用于第二孕期胎儿超声图像的分类，能够同时区分16种胎儿结构，比现有方法更轻量化且性能优越。

Details

Motivation: 由于胎儿超声图像质量低、类内差异大和类别不平衡，准确分类具有挑战性。现有方法仅针对少数解剖目标，无法满足临床需求。

Result: 在5298张临床图像上训练和评估，90%的器官分类准确率>0.75，75%的器官分类准确率>0.85，性能与更复杂模型相当。

Insight: 生物启发的模块化堆叠方法在复杂临床环境中具有鲁棒性和可扩展性，为医学图像分类提供了新思路。

Abstract: Accurate classification of second-trimester fetal ultrasound images remains challenging due to low image quality, high intra-class variability, and significant class imbalance. In this work, we introduce a simple yet powerful, biologically inspired deep learning ensemble framework that-unlike prior studies focused on only a handful of anatomical targets-simultaneously distinguishes 16 fetal structures. Drawing on the hierarchical, modular organization of biological vision systems, our model stacks two complementary branches (a “shallow” path for coarse, low-resolution cues and a “detailed” path for fine, high-resolution features), concatenating their outputs for final prediction. To our knowledge, no existing method has addressed such a large number of classes with a comparably lightweight architecture. We trained and evaluated on 5,298 routinely acquired clinical images (annotated by three experts and reconciled via Dawid-Skene), reflecting real-world noise and variability rather than a “cleaned” dataset. Despite this complexity, our ensemble (EfficientNet-B0 + EfficientNet-B6 with LDAM-Focal loss) identifies 90% of organs with accuracy > 0.75 and 75% of organs with accuracy > 0.85-performance competitive with more elaborate models applied to far fewer categories. These results demonstrate that biologically inspired modular stacking can yield robust, scalable fetal anatomy recognition in challenging clinical settings.

[133] Enhancing Synthetic CT from CBCT via Multimodal Fusion: A Study on the Impact of CBCT Quality and Alignment eess.IV | cs.CVPDF

Maximilian Tschuchnig, Lukas Lamminger, Philipp Steininger, Michael Gadermayr

TL;DR: 论文通过多模态学习将CBCT和术前CT结合，提升了合成CT的质量，特别是在对齐良好且CBCT质量较低的情况下效果显著。

Details

Motivation: CBCT虽快速低辐射，但存在伪影问题，合成CT（sCT）可改善这一问题。论文通过多模态融合进一步优化sCT生成，尤其关注CBCT质量和对齐的影响。

Result: 多模态sCT在真实数据集上表现优于单模态基线，尤其是在对齐良好且CBCT质量较低时效果最显著。

Insight: CBCT质量和对齐对sCT生成至关重要，多模态融合是提升sCT质量的有效途径，尤其在临床实践中具有可重复性。

Abstract: Cone-Beam Computed Tomography (CBCT) is widely used for real-time intraoperative imaging due to its low radiation dose and high acquisition speed. However, despite its high resolution, CBCT suffers from significant artifacts and thereby lower visual quality, compared to conventional Computed Tomography (CT). A recent approach to mitigate these artifacts is synthetic CT (sCT) generation, translating CBCT volumes into the CT domain. In this work, we enhance sCT generation through multimodal learning, integrating intraoperative CBCT with preoperative CT. Beyond validation on two real-world datasets, we use a versatile synthetic dataset, to analyze how CBCT-CT alignment and CBCT quality affect sCT quality. The results demonstrate that multimodal sCT consistently outperform unimodal baselines, with the most significant gains observed in well-aligned, low-quality CBCT-CT cases. Finally, we demonstrate that these findings are highly reproducible in real-world clinical datasets.

cs.SD [Back]

[134] Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model cs.SD | cs.CL | eess.ASPDF

Ailin Huang, Bingxin Li, Bruce Wang, Boyong Wu, Chao Yan

TL;DR: 论文提出了Step-Audio-AQAA，一种完全端到端的音频问答模型，通过双码本音频分词器、1300亿参数大模型和神经声码器，实现了高效的音频交互，并在语音控制方面表现优异。

Details

Motivation: 现有的音频-语言模型依赖文本输出，无法直接生成自然语音，限制了音频交互的流畅性。

Result: 在StepEval-Audio-360基准测试中，模型在语音控制方面表现突出，超越现有LALMs。

Insight: 1. 端到端设计显著提升音频交互效率；2. 基于分词的声码器对性能至关重要。

Abstract: Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.

cs.RO [Back]

[135] PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly cs.RO | cs.AI | cs.CVPDF

Liang Ma, Jiajun Wen, Min Lin, Rongtao Xu, Xiwen Liang

TL;DR: PhyBlock是一个渐进式的基准测试，旨在评估视觉语言模型（VLMs）在3D块组装任务中物理理解和规划的能力，揭示其在空间推理和高级规划中的局限性。

Details

Motivation: 尽管VLMs在推理和规划任务中表现出色，但它们在结构化3D环境中对物理现象的理解仍然十分有限。需要一种新的基准测试来评估和改进VLMs在这些任务中的表现。

Result: 实验表明，VLMs在高级规划和空间推理任务中表现显著受限，尤其在复杂任务中性能下降明显。错误分析揭示了它们在空间定向和依赖关系推理中的持续困难。

Insight: 空间任务更依赖直觉理解，链式思维提示对性能改善有限。PhyBlock为提升VLMs的实际物理问题解决能力提供了统一测试平台。

Abstract: While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis, and planning robustness. We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning. Our empirical findings indicate that the performance of VLMs exhibits pronounced limitations in high-level planning and reasoning capabilities, leading to a notable decline in performance for the growing complexity of the tasks. Error analysis reveals persistent difficulties in spatial orientation and dependency reasoning. Surprisingly, chain-of-thought prompting offers minimal improvements, suggesting spatial tasks heavily rely on intuitive model comprehension. We position PhyBlock as a unified testbed to advance embodied reasoning, bridging vision-language understanding and real-world physical problem-solving.

q-bio.NC [Back]

[136] Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in the Brain q-bio.NC | cs.AI | cs.CL | cs.CV | cs.LGPDF

Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh

TL;DR: 本文研究了指令调优的多模态大语言模型（MLLMs）在视频和音频任务中的大脑对齐性，发现任务特定的指令显著提升了MLLMs与大脑活动的对齐性，并揭示了其分层对齐特性。

Details

Motivation: 现有研究主要关注非指令调优的MLLMs在单模态或多模态刺激下的大脑对齐性，但忽视了任务特定指令的作用。本文旨在填补这一空白，探讨指令调优的MLLMs在自然视频-音频刺激下的大脑对齐性。

Result: 指令调优的MLLMs显著优于非指令调优模型，且其分层表征与大脑的分层处理区域（如早期感觉区域与早期层对齐，高级视觉和语言区域与中晚期层对齐）高度一致。

Insight: 任务特定指令显著提升MLLMs与大脑的对齐性，为研究大脑和MLLMs的联合信息处理提供了新视角。代码已开源。

Abstract: Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models in both unimodal and multimodal stimulus settings. More recently, instruction-tuned multimodal models have shown to generate task-specific representations that align strongly with brain activity. However, prior work evaluating the brain alignment of MLLMs has primarily focused on unimodal settings or relied on non-instruction-tuned multimodal models for multimodal stimuli. To address this gap, we investigated brain alignment, that is, measuring the degree of predictivity of neural activity recorded while participants were watching naturalistic movies (video along with audio) with representations derived from MLLMs. We utilized instruction-specific embeddings from six video and two audio instruction-tuned MLLMs. Experiments with 13 video task-specific instructions show that instruction-tuned video MLLMs significantly outperform non-instruction-tuned multimodal (by 15%) and unimodal models (by 20%). Our evaluation of MLLMs for both video and audio tasks using language-guided instructions shows clear disentanglement in task-specific representations from MLLMs, leading to precise differentiation of multimodal functional processing in the brain. We also find that MLLM layers align hierarchically with the brain, with early sensory areas showing strong alignment with early layers, while higher-level visual and language regions align more with middle to late layers. These findings provide clear evidence for the role of task-specific instructions in improving the alignment between brain activity and MLLMs, and open new avenues for mapping joint information processing in both the systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].

eess.AS [Back]

[137] Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs eess.AS | cs.CLPDF

Šimon Sedláček, Bolaji Yusuf, Ján Švec, Pradyoth Hegde, Santosh Kesiraju

TL;DR: 论文通过对齐语音编码器和LLMs的表征空间，使用小型连接模块优化了口语对话状态追踪（DST），在SpokenWOZ数据集上达到SOTA性能（42.17% JGA）。

Details

Motivation: 为了解决口语对话状态追踪中语音编码器和大语言模型（LLMs）表征空间不一致的问题，作者提出了一种对齐方法，旨在提升性能并完全使用开源组件。

Result: 在SpokenWOZ测试集上，最佳模型（WavLM + 连接模块 + OLMo-1B）达到34.66% JGA，而使用Gemma-2-9B的模型进一步提升至42.17% JGA。

Insight: 对齐语音编码器和LLMs的表征空间能显著提升口语对话状态追踪性能；模糊匹配后处理对命名实体识别尤为重要。

Abstract: In this work, we approach spoken Dialogue State Tracking (DST) by bridging the representation spaces of speech encoders and LLMs via a small connector module, with a focus on fully open-sourced and open-data components (WavLM-large, OLMo). We focus on ablating different aspects of such systems including full/LoRA adapter fine-tuning, the effect of agent turns in the dialogue history, as well as fuzzy matching-based output post-processing, which greatly improves performance of our systems on named entities in the dialogue slot values. We conduct our experiments on the SpokenWOZ dataset, and additionally utilize the Speech-Aware MultiWOZ dataset to augment our training data. Ultimately, our best-performing WavLM + connector + OLMo-1B aligned models achieve state of the art on the SpokenWOZ test set (34.66% JGA), and our system with Gemma-2-9B-instruct further surpasses this result, reaching 42.17% JGA on SpokenWOZ test.

cs.LG [Back]

[138] Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining cs.LG | cs.AI | cs.CL | cs.CVPDF

Chenxi Liu, Tianyi Xiong, Ruibo Chen, Yihan Wu, Junfeng Guo

TL;DR: 论文提出了一种名为MBPO的新型偏好学习框架，通过生成对抗性负样本和在线验证奖励，解决大型多模态模型（LMMs）中的模态不平衡问题，显著提升了模型在视觉-语言任务中的性能并减少幻觉。

Details

Motivation: 现有的偏好优化方法未能有效抑制大型语言模型（LLM）的内部偏差，且依赖离线数据，无法适应训练中的动态分布变化，导致LMMs在推理中出现模态不平衡问题。

Result: 实验表明，MBPO显著提升了LMMs在挑战性视觉-语言任务中的性能，并减少了幻觉现象。

Insight: 通过对抗性方法和在线数据动态优化，可以有效解决LMMs中的模态不平衡问题，为多模态对齐提供了新思路。

Abstract: The task adaptation and alignment of Large Multimodal Models (LMMs) have been significantly advanced by instruction tuning and further strengthened by recent preference optimization. Yet, most LMMs still suffer from severe modality imbalance during reasoning, i.e., outweighing language prior biases over visual inputs, which bottlenecks their generalization to downstream tasks and causes hallucinations. However, existing preference optimization approaches for LMMs do not focus on restraining the internal biases of their Large Language Model (LLM) backbones when curating the training data. Moreover, they heavily rely on offline data and lack the capacity to explore diverse responses adaptive to dynamic distributional shifts during training. Meanwhile, Group Relative Policy Optimization (GRPO), a recent method using online-generated data and verified rewards to improve reasoning capabilities, remains largely underexplored in LMM alignment. In this paper, we propose a novel preference learning framework, Modality-Balancing Preference Optimization (MBPO), to address the modality imbalance in LMMs. MBPO constructs a more effective offline preference dataset by generating hard negatives, i.e., rejected responses misled by LLM biases due to limited usage of visual information, through adversarial perturbation of input images. Moreover, MBPO leverages the easy-to-verify nature of close-ended tasks to generate online responses with verified rewards. GRPO is then employed to train the model with offline-online hybrid data. Extensive experiments demonstrate that MBPO can enhance LMM performance on challenging vision-language tasks and effectively reduce hallucinations.

[139] Bingo: Boosting Efficient Reasoning of LLMs via Dynamic and Significance-based Reinforcement Learning cs.LG | cs.CLPDF

Hanbing Liu, Lang Cao, Yuanyi Ren, Mengyu Zhou, Haoyu Dong

TL;DR: Bingo是一个基于强化学习的框架，通过动态和显著性感知的长度奖励设计，提升大语言模型的推理效率和准确性。

Details

Motivation: 现有的大语言模型在推理时往往输出冗长或冗余内容，导致效率低下。虽然已有研究通过强化学习优化推理准确性，但对效率的关注不足，且简单的长度奖励容易导致准确率下降。

Result: 在多个推理基准测试中，Bingo在效率和准确性上均优于基线方法，实现了更好的平衡。

Insight: 显式训练大语言模型以提高推理效率具有潜在价值，动态奖励设计是关键。

Abstract: Large language models have demonstrated impressive reasoning capabilities, yet they often suffer from inefficiencies due to unnecessarily verbose or redundant outputs. While many works have explored reinforcement learning (RL) to enhance reasoning abilities, most primarily focus on improving accuracy, with limited attention to reasoning efficiency. Some existing approaches introduce direct length-based rewards to encourage brevity, but this often leads to noticeable drops in accuracy. In this paper, we propose Bingo, an RL framework that advances length-based reward design to boost efficient reasoning. Bingo incorporates two key mechanisms: a significance-aware length reward, which gradually guides the model to reduce only insignificant tokens, and a dynamic length reward, which initially encourages elaborate reasoning for hard questions but decays over time to improve overall efficiency. Experiments across multiple reasoning benchmarks show that Bingo improves both accuracy and efficiency. It outperforms the vanilla reward and several other length-based reward baselines in RL, achieving a favorable trade-off between accuracy and efficiency. These results underscore the potential of training LLMs explicitly for efficient reasoning.

[140] Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints cs.LG | cs.AI | cs.CL | stat.APPDF

Yaswanth Chittepu, Blossom Metevier, Will Schwarzer, Austin Hoag, Scott Niekum

TL;DR: HC-RLHF是一种从人类反馈中学习的高置信度安全强化学习方法，通过明确分离人类偏好为有用性和无害性，并在悲观成本约束下优化奖励函数，确保模型在敏感领域的安全性。

Details

Motivation: 现有语言模型对齐方法往往在安全性和有用性之间妥协，导致敏感领域产生不可接受的回复。HC-RLHF旨在提供高置信度安全保证，同时最大化有用性。

Result: HC-RLHF在实验中对齐三种语言模型（Qwen2-1.5B、Qwen2.5-3B、LLaMa3.2-3B），显著提升了无害性和有用性。

Insight: 通过明确的解耦和悲观优化，HC-RLHF提供了一种更可靠的模型对齐框架，适用于对安全性要求高的应用场景。

Abstract: Existing approaches to language model alignment often treat safety as a tradeoff against helpfulness, which can lead to unacceptable responses in sensitive domains. To ensure reliable performance in such settings, we propose High-Confidence Safe Reinforcement Learning from Human Feedback (HC-RLHF), a method that provides high-confidence safety guarantees while maximizing helpfulness. Similar to previous methods, HC-RLHF explicitly decouples human preferences into helpfulness and harmlessness (safety), which are learned by training a reward model and a cost model, respectively. It then employs a two-step process to find safe solutions. In the first step, it optimizes the reward function under an intentionally pessimistic version of the cost constraint. In the second step, the trained model undergoes a safety test to verify whether its performance stays within an upper-confidence bound of the actual cost constraint. We provide a theoretical analysis of HC-RLHF, including proof that it will not return an unsafe solution with a probability greater than a user-specified threshold. For our empirical analysis, we apply HC-RLHF to align three different language models (Qwen2-1.5B, Qwen2.5-3B, and LLaMa3.2-3B) with human preferences. Our results demonstrate that HC-RLHF produces safe models with high probability and can improve harmlessness and helpfulness compared to previous methods.

[141] From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium cs.LG | cs.CLPDF

Xie Yi, Zhanke Zhou, Chentao Cao, Qiyu Niu, Tongliang Liu

TL;DR: 本文提出了一种通过贝叶斯纳什均衡(BayLE)实现多智能体LLM高效推理的方法ECON，解决了传统多智能体框架计算成本高且缺乏收敛保证的问题。

Details

Motivation: 多智能体框架能显著增强大型语言模型的推理能力，但传统方法存在计算成本高、缺乏收敛保证的缺点。因此，需要一种更高效且理论可靠的方法。

Result: 实验表明，ECON在六项复杂推理和规划任务中平均性能提升11.2%，并验证了其可扩展性和灵活性。

Insight: 贝叶斯纳什均衡为多智能体LLM协作提供了理论保障，ECON框架展示了分布式与集中式结合的潜力，推动了大规模多LLM集成的发展。

Abstract: Multi-agent frameworks can substantially boost the reasoning power of large language models (LLMs), but they typically incur heavy computational costs and lack convergence guarantees. To overcome these challenges, we recast multi-LLM coordination as an incomplete-information game and seek a Bayesian Nash equilibrium (BNE), in which each agent optimally responds to its probabilistic beliefs about the strategies of others. We introduce Efficient Coordination via Nash Equilibrium (ECON), a hierarchical reinforcement-learning paradigm that marries distributed reasoning with centralized final output. Under ECON, each LLM independently selects responses that maximize its expected reward, conditioned on its beliefs about co-agents, without requiring costly inter-agent exchanges. We mathematically prove that ECON attains a markedly tighter regret bound than non-equilibrium multi-agent schemes. Empirically, ECON outperforms existing multi-LLM approaches by 11.2% on average across six benchmarks spanning complex reasoning and planning tasks. Further experiments demonstrate ECON’s ability to flexibly incorporate additional models, confirming its scalability and paving the way toward larger, more powerful multi-LLM ensembles. The code is publicly available at: https://github.com/tmlr-group/ECON.

[142] From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information? cs.LG | cs.AI | cs.CLPDF

Zhanke Zhou, Xiao Feng, Zhaocheng Zhu, Jiangchao Yao, Sanmi Koyejo

TL;DR: 这篇论文提出了一个新的基准测试AR-Bench，专注于评估大语言模型（LLM）在信息不完整情况下的主动推理能力，揭示了当前LLM在主动推理方面的显著不足。

Details

Motivation: 现有的基准测试主要评估被动推理能力，而忽略了对LLM在信息不完整时主动获取信息的能力。这限制了LLM在实际场景中的应用潜力。

Result: 实验表明，当前LLM在主动推理能力上表现较差，即使采用高级策略（如树搜索或后训练方法）效果也有限。

Insight: 强调了需要进一步发展主动推理方法，如交互式学习、实时反馈循环和环境感知训练目标。

Abstract: While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast, active reasoning-where an LLM must interact with external systems to acquire missing evidence or data-has received little systematic attention. To address this shortfall, we present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM’s active reasoning skills. AR-Bench comprises three task families-detective cases, situation puzzles, and guessing numbers-that together simulate real-world, agentic scenarios and measure performance across commonsense, logical, and symbolic reasoning challenges. Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning: they frequently fail to acquire or leverage the information needed to solve tasks. This gap highlights a stark divergence between their passive and active reasoning abilities. Moreover, ablation studies indicate that even advanced strategies, such as tree-based searching or post-training approaches, yield only modest gains and fall short of the levels required for real-world deployment. Collectively, these findings highlight the critical need to advance methodology for active reasoning, e.g., incorporating interactive learning, real-time feedback loops, and environment-aware objectives for training. The benchmark is publicly available at: https://github.com/tmlr-group/AR-Bench.

[143] Reinforce LLM Reasoning through Multi-Agent Reflection cs.LG | cs.AI | cs.CLPDF

Yurun Yuan, Tengyang Xie

TL;DR: 这篇论文提出了一种通过多智能体反思增强大语言模型（LLM）推理能力的方法，利用强化学习算法DPSDP动态优化答案生成过程。

Details

Motivation: 现有方法在动态探索解决方案和反馈整合中存在反馈空间受限和缺乏协调训练的问题，导致性能不佳。

Result: 在MATH 500基准测试中，多轮优化将初次准确率从58.2%提升至63.2%，并通过消融实验验证了多智能体协作和外分布泛化的优势。

Insight: 多智能体协作和动态优化机制可以显著提升LLM的推理能力，并且具有外分布泛化的潜力。

Abstract: Leveraging more test-time computation has proven to be an effective way to boost the reasoning capabilities of large language models (LLMs). Among various methods, the verify-and-improve paradigm stands out for enabling dynamic solution exploration and feedback incorporation. However, existing approaches often suffer from restricted feedback spaces and lack of coordinated training of different parties, leading to suboptimal performance. To address this, we model this multi-turn refinement process as a Markov Decision Process and introduce DPSDP (Direct Policy Search by Dynamic Programming), a reinforcement learning algorithm that trains an actor-critic LLM system to iteratively refine answers via direct preference learning on self-generated data. Theoretically, DPSDP can match the performance of any policy within the training distribution. Empirically, we instantiate DPSDP with various base models and show improvements on both in- and out-of-distribution benchmarks. For example, on benchmark MATH 500, majority voting over five refinement steps increases first-turn accuracy from 58.2% to 63.2% with Ministral-based models. An ablation study further confirms the benefits of multi-agent collaboration and out-of-distribution generalization.

[144] Reinforcement Learning Teachers of Test Time Scaling cs.LG | cs.AI | cs.CLPDF

Edoardo Cetin, Tianyu Zhao, Yujin Tang

TL;DR: 该论文提出了一种名为强化学习教师（RLT）的新框架，专注于通过详细的解释训练学生模型，避免了强化学习的探索挑战，并在多个任务中表现出色。

Details

Motivation: 传统强化学习方法训练推理语言模型时，依赖模型的初始探索能力，且通常用于蒸馏新学生模型而非直接部署。RLT框架旨在解决这些限制。

Result: 7B规模的RLT模型在多个任务中表现出色，超越了更大规模语言模型的蒸馏效果，并具有零样本迁移能力。

Insight: RLT框架提高了推理语言模型的效率和可重用性，适用于更大规模学生模型和分布外任务。

Abstract: Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL’s exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply “connect-the-dots” with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem’s solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework.

[145] The Geometries of Truth Are Orthogonal Across Tasks cs.LG | cs.AI | cs.CL | stat.MLPDF

Waiss Azizian, Michael Kirchhof, Eugene Ndiaye, Louis Bethune, Michal Klein

TL;DR: 该研究指出，大型语言模型（LLMs）中所谓的”真理几何”（geometry of truth）具有任务依赖性，无法跨任务迁移。线性分类器在不同任务上的表现几乎不相关，且稀疏正则化下其支持集几乎不相交。更复杂的方法（如混合探测和任务）也无法克服这一局限性。

Details

Motivation: LLMs的泛化能力虽强，但其可靠性仍受质疑。此前研究发现，可以从LLM的激活中学习”真理几何”以区分正确和错误答案。然而，这种方法的跨任务适用性尚不清楚，本文旨在验证其局限性。

Result: 结果表明，”真理几何”在不同任务间不具相似性，线性分类器的支持集几乎不相交。混合探测和任务方法也无法解决这一局限性。

Insight: LLM的”真理几何”具有任务特异性，当前的线性分类方法难以实现跨任务的通用性。未来的研究需探索更灵活的分类策略或激活向量表示。

Abstract: Large Language Models (LLMs) have demonstrated impressive generalization capabilities across various tasks, but their claim to practical relevance is still mired by concerns on their reliability. Recent works have proposed examining the activations produced by an LLM at inference time to assess whether its answer to a question is correct. Some works claim that a “geometry of truth” can be learned from examples, in the sense that the activations that generate correct answers can be distinguished from those leading to mistakes with a linear classifier. In this work, we underline a limitation of these approaches: we observe that these “geometries of truth” are intrinsically task-dependent and fail to transfer across tasks. More precisely, we show that linear classifiers trained across distinct tasks share little similarity and, when trained with sparsity-enforcing regularizers, have almost disjoint supports. We show that more sophisticated approaches (e.g., using mixtures of probes and tasks) fail to overcome this limitation, likely because activation vectors commonly used to classify answers form clearly separated clusters when examined across tasks.

[146] SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning cs.LG | cs.CLPDF

Xiao Liang, Zhong-Zhi Li, Yeyun Gong, Yang Wang, Hengyuan Zhang

TL;DR: 该论文提出了一种自感知弱点驱动的问题合成框架（SwS），用于增强强化学习在大型语言模型（LLMs）的复杂推理任务中的表现。通过识别模型在训练中的弱项并针对性合成问题，显著提升了模型性能。

Details

Motivation: 现有强化学习方法依赖高质量的问题集，但人工标注问题稀缺且合成数据集缺乏针对性，导致训练效率低下。希望通过模型自感知弱点并针对性合成问题来优化训练。

Result: 在7B和32B模型上，8个主流推理基准测试平均分别提升10.0%和7.7%。

Insight: 模型能够通过自感知和针对性训练动态优化其弱项，显著提升泛化能力，为强化学习中的问题合成提供了新思路。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks, such as mathematical problem solving. A prerequisite for the scalability of RLVR is a high-quality problem set with precise and verifiable answers. However, the scarcity of well-crafted human-labeled math problems and limited-verification answers in existing distillation-oriented synthetic datasets limit their effectiveness in RL. Additionally, most problem synthesis strategies indiscriminately expand the problem set without considering the model’s capabilities, leading to low efficiency in generating useful questions. To mitigate this issue, we introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation. Specifically, we define weaknesses as questions that the model consistently fails to learn through its iterative sampling during RL training. We then extract the core concepts from these failure cases and synthesize new problems to strengthen the model’s weak areas in subsequent augmented training, enabling it to focus on and gradually overcome its weaknesses. Without relying on external knowledge distillation, our framework enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoning benchmarks.

[147] e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs cs.LG | cs.CLPDF

Amrith Setlur, Matthew Y. R. Yang, Charlie Snell, Jeremy Greer, Ian Wu

TL;DR: 论文提出了一种名为e3的方法，通过训练LLM在推理时进行上下文探索，实现了在超出训练令牌预算时的性能外推。e3结合了三种关键技巧，显著提升了小规模模型的表现。

Details

Motivation: 现有推理模型在超出训练令牌预算时的性能外推表现不佳，亟需一种方法能够充分利用测试时计算资源，提升复杂问题的解决能力。

Result: e3-1.7B模型在AIME’25和HMMT’25评分中表现最佳，并能外推到训练令牌预算的2倍，显著提高了pass@1和pass@k分数。

Insight: 通过上下文探索和不对称任务链，可以显著提升LLM的推理能力，尤其是在测试时计算资源超出预算时，展示了小规模模型的潜力。

Abstract: Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep “thinking” for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging “negative” gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME’25 and HMMT’25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.

[148] An Adaptive Method Stabilizing Activations for Enhanced Generalization cs.LG | cs.CVPDF

Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko

TL;DR: AdaAct是一种新颖的优化算法，通过根据激活方差调整学习率来增强神经元输出的稳定性，从而提升泛化能力。实验在CIFAR和ImageNet上验证了其性能。

Details

Motivation: 传统的激活正则化方法在提升泛化能力方面存在局限，因此需要一种新的方法能够在训练过程中动态调整激活稳定性。

Result: 在CIFAR和ImageNet上的实验表明，AdaAct在泛化能力上优于其他方法，同时保持了与Adam相似的收敛速度。代码已开源。

Insight: 激活稳定性对模型泛化能力至关重要，动态调整学习率是一种有效的补充方法，能够弥合不同优化器的优势。

Abstract: We introduce AdaAct, a novel optimization algorithm that adjusts learning rates according to activation variance. Our method enhances the stability of neuron outputs by incorporating neuron-wise adaptivity during the training process, which subsequently leads to better generalization – a complementary approach to conventional activation regularization methods. Experimental results demonstrate AdaAct’s competitive performance across standard image classification benchmarks. We evaluate AdaAct on CIFAR and ImageNet, comparing it with other state-of-the-art methods. Importantly, AdaAct effectively bridges the gap between the convergence speed of Adam and the strong generalization capabilities of SGD, all while maintaining competitive execution times. Code is available at https://github.com/hseung88/adaact.

[149] HSG-12M: A Large-Scale Spatial Multigraph Dataset cs.LG | cond-mat.mes-hall | cond-mat.other | cs.AI | cs.CVPDF

Xianquan Yan, Hakan Akgün, Kenji Kawaguchi, N. Duane Loh, Ching Hua Lee

TL;DR: HSG-12M 是首个大规模空间多重图数据集，其独特之处在于保留了节点之间的多条几何路径。该数据集基于物理光谱数据，提供了丰富的拓扑结构，并展示了多边几何学习的挑战。

Details

Motivation: 现有图数据集通常忽略空间多样性和多重路径，而现实世界中的图（如物理系统）需要保留这些特性。HSG-12M 填补了这一空白。

Result: 数据集包含 11.6M 静态图和 5.1M 动态图，覆盖 1401 个特征多项式类。实验表明，现有图神经网络在多边几何学习中面临挑战。

Insight: 光谱图不仅是物理系统的有效表示，还能作为多项式、向量和矩阵的拓扑指纹，为图学习与科学发现提供了新桥梁。

Abstract: Existing graph benchmarks assume non-spatial, simple edges, collapsing physically distinct paths into a single link. We introduce HSG-12M, the first large-scale dataset of $\textbf{spatial multigraphs}-$graphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. HSG-12M contains 11.6 million static and 5.1 million dynamic $\textit{Hamiltonian spectral graphs}$ across 1401 characteristic-polynomial classes, derived from 177 TB of spectral potential data. Each graph encodes the full geometry of a 1-D crystal’s energy spectrum on the complex plane, producing diverse, physics-grounded topologies that transcend conventional node-coordinate datasets. To enable future extensions, we release $\texttt{Poly2Graph}$: a high-performance, open-source pipeline that maps arbitrary 1-D crystal Hamiltonians to spectral graphs. Benchmarks with popular GNNs expose new challenges in learning from multi-edge geometry at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for geometry-aware graph learning and new opportunities of data-driven scientific discovery in condensed matter physics and beyond.

[150] Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers cs.LG | cs.AI | cs.CVPDF

Simon Roschmann, Quentin Bouniot, Vasilii Feofanov, Ievgen Redko, Zeynep Akata

TL;DR: 论文提出了一种将时间序列转换为图像以利用预训练视觉Transformer（ViT）表示能力的框架TiViT，实现了时间序列分类的SOTA性能，并揭示了视觉模型在非视觉领域的复用潜力。

Details

Motivation: 时间序列分类在医疗和工业中很重要，但时间序列基础模型（TSFM）的发展受限于公开数据集的稀缺。论文通过将时间序列转换为图像，利用大规模图像数据集预训练的ViT的表示能力，以弥补这一缺陷。

Result: TiViT在标准时间序列分类基准上实现了SOTA性能，且中间层的表示对分类最有效。结合TiViT和TSFM的表示空间可进一步提升性能。

Insight: 视觉Transformer的表示能力可以迁移到非视觉领域（如时间序列分类），中间层的表示对任务尤为关键，且视觉与时间序列表示空间存在互补性。

Abstract: Time series classification is a fundamental task in healthcare and industry, yet the development of time series foundation models (TSFMs) remains limited by the scarcity of publicly available time series datasets. In this work, we propose Time Vision Transformer (TiViT), a framework that converts time series into images to leverage the representational power of frozen Vision Transformers (ViTs) pretrained on large-scale image datasets. First, we theoretically motivate our approach by analyzing the 2D patching of ViTs for time series, showing that it can increase the number of label-relevant tokens and reduce the sample complexity. Second, we empirically demonstrate that TiViT achieves state-of-the-art performance on standard time series classification benchmarks by utilizing the hidden representations of large OpenCLIP models. We explore the structure of TiViT representations and find that intermediate layers with high intrinsic dimension are the most effective for time series classification. Finally, we assess the alignment between TiViT and TSFM representation spaces and identify a strong complementarity, with further performance gains achieved by combining their features. Our findings reveal yet another direction for reusing vision representations in a non-visual domain.

Table of Contents

cs.CV [Back]

[1] Towards Reliable AR-Guided Surgical Navigation: Interactive Deformation Modeling with Data-Driven Biomechanics and Prompts cs.CV | cs.AI | cs.ROPDF

[2] ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving cs.CV | cs.ROPDF

[3] CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems cs.CVPDF

[4] IGraSS: Learning to Identify Infrastructure Networks from Satellite Imagery by Iterative Graph-constrained Semantic Segmentation cs.CV | cs.AIPDF

[5] Spectral Domain Neural Reconstruction for Passband FMCW Radars cs.CVPDF

[6] Surgeon Style Fingerprinting and Privacy Risk Quantification via Discrete Diffusion Models in a Vision-Language-Action Framework cs.CV | cs.AIPDF

[7] Open World Scene Graph Generation using Vision Language Models cs.CV | cs.CLPDF

[8] GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra cs.CV | 68T45 | I.5.4; I.2.10; I.3.5PDF

[9] A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation cs.CV | cs.AI | cs.CL | cs.LGPDF

[10] Using Satellite Images And Self-supervised Machine Learning Networks To Detect Water Hidden Under Vegetation cs.CVPDF

[11] Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence cs.CVPDF

[12] A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks cs.CVPDF

[13] Highly Compressed Tokenizer Can Generate Without Training cs.CV | cs.AIPDF

[14] Seeing Voices: Generating A-Roll Video from Audio with Mirage cs.CV | cs.AI | cs.LGPDF

[15] SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging cs.CV | cs.AIPDF

[16] OpenRR-1k: A Scalable Dataset for Real-World Reflection Removal cs.CVPDF

[17] Hyperspectral Image Classification via Transformer-based Spectral-Spatial Attention Decoupling and Adaptive Gating cs.CVPDF

[18] Locating Tennis Ball Impact on the Racket in Real Time Using an Event Camera cs.CVPDF

[19] How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models cs.CV | cs.AI | cs.CLPDF

[20] MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding cs.CVPDF

[21] SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding cs.CVPDF

[22] RadioDUN: A Physics-Inspired Deep Unfolding Network for Radio Map Estimation cs.CV | eess.SPPDF

[23] Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring cs.CVPDF

[24] Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance cs.CVPDF

[25] MARMOT: Masked Autoencoder for Modeling Transient Imaging cs.CVPDF

[26] Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization cs.CV | cs.MMPDF

[27] MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding cs.CV | cs.AIPDF

[28] Robust Visual Localization via Semantic-Guided Multi-Scale Transformer cs.CVPDF

[29] LiftVSR: Lifting Image Diffusion to Video Super-Resolution via Hybrid Temporal Modeling with Only 4$\times$RTX 4090s cs.CVPDF

[30] TrajFlow: Multi-modal Motion Prediction via Flow Matching cs.CV | cs.AIPDF

[31] Convergence of Spectral Principal Paths: How Deep Networks Distill Linear Representations from Noisy Inputs cs.CVPDF

[32] From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge cs.CVPDF

[33] Towards Cross-Subject EMG Pattern Recognition via Dual-Branch Adversarial Feature Disentanglement cs.CV | cs.HCPDF

[34] Hierarchical Neural Collapse Detection Transformer for Class Incremental Object Detection cs.CVPDF

[35] Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations cs.CVPDF

[36] Diversity-Guided MLP Reduction for Efficient Large Vision Transformers cs.CV | cs.LG | cs.MMPDF

[37] Data-Efficient Challenges in Visual Inductive Priors: A Retrospective cs.CVPDF

[38] SAMSelect: A Spectral Index Search for Marine Debris Visualization using Segment Anything cs.CVPDF

[39] ECMNet:Lightweight Semantic Segmentation with Efficient CNN-Mamba Network cs.CV | cs.AIPDF

[40] RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping cs.CVPDF

[41] SurfR: Surface Reconstruction with Multi-scale Attention cs.CVPDF

[42] Enhancing Video Memorability Prediction with Text-Motion Cross-modal Contrastive Loss and Its Application in Video Summarization cs.CVPDF

[43] Beyond Calibration: Physically Informed Learning for Raw-to-Raw Mapping cs.CVPDF

[44] LLaVA-c: Continual Improved Visual Instruction Tuning cs.CVPDF

[45] ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction cs.CVPDF

[46] CanadaFireSat: Toward high-resolution wildfire forecasting with multiple modalities cs.CVPDF

[47] VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism cs.CVPDF

[48] MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning cs.CVPDF

[49] TraGraph-GS: Trajectory Graph-based Gaussian Splatting for Arbitrary Large-Scale Scene Rendering cs.CVPDF

[50] SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting cs.CVPDF

[51] Geometric deep learning for local growth prediction on abdominal aortic aneurysm surfaces cs.CV | cs.AIPDF

[52] InceptionMamba: An Efficient Hybrid Network with Large Band Convolution and Bottleneck Mamba cs.CVPDF

[53] RS-MTDF: Multi-Teacher Distillation and Fusion for Remote Sensing Semi-Supervised Semantic Segmentation cs.CVPDF

[54] Gaussian2Scene: 3D Scene Representation Learning via Self-supervised Learning with 3D Gaussian Splatting cs.CVPDF

[55] HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation cs.CVPDF

[56] Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought cs.CVPDF

[57] CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics cs.CV | cs.AI | cs.CLPDF

[58] Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis cs.CVPDF

[59] Spatial Transcriptomics Expression Prediction from Histopathology Based on Cross-Modal Mask Reconstruction and Contrastive Learning cs.CV | cs.AIPDF

[60] StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams cs.CV | cs.LGPDF

[61] DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval cs.CVPDF

[62] Product of Experts for Visual Generation cs.CV | cs.AIPDF

[63] WetCat: Automating Skill Assessment in Wetlab Cataract Surgery Videos cs.CVPDF

[64] MIRAGE: Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis cs.CVPDF

[65] Inherently Faithful Attention Maps for Vision Transformers cs.CV | cs.AIPDF

[66] Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions cs.CV | cs.AI | cs.CLPDF

[67] What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities cs.CVPDF

[68] SSS: Semi-Supervised SAM-2 with Efficient Prompting for Medical Imaging Segmentation cs.CVPDF

[69] Cross-Spectral Body Recognition with Side Information Embedding: Benchmarks on LLCM and Analyzing Range-Induced Occlusions on IJB-MDF cs.CVPDF

[70] Segment Concealed Objects with Incomplete Supervision cs.CV | cs.AI | cs.LGPDF

[71] Data Augmentation For Small Object using Fast AutoAugment cs.CV | cs.LGPDF

[72] ORIDa: Object-centric Real-world Image Composition Dataset cs.CVPDF

[73] ADAM: Autonomous Discovery and Annotation Model using LLMs for Context-Aware Annotations cs.CVPDF

[74] Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models cs.CV | cs.AI | cs.LGPDF

[75] Do MIL Models Transfer? cs.CVPDF

[76] Princeton365: A Diverse Dataset with Accurate Camera Pose cs.CVPDF

[77] Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better cs.CV | cs.AI | cs.CLPDF

[78] Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models cs.CVPDF