cs.CV [Total: 157]
cs.CL [Total: 47]
cs.GR [Total: 1]
eess.IV [Total: 4]
cs.IR [Total: 2]
cs.CR [Total: 2]
cs.AI [Total: 5]
cs.RO [Total: 6]
cs.CY [Total: 1]
cs.LG [Total: 9]

cs.CV [Back]

[1] SO-Bench: A Structural Output Evaluation of Multimodal LLMs cs.CV | cs.AI | cs.CL | cs.ROPDF

Di Feng, Kaixin Ma, Feng Nan, Haofeng Chen, Bohan Zhai

TL;DR: SO-Bench是首个系统评估多模态大语言模型（MLLMs）在视觉输入上结构化输出能力的基准，覆盖四个视觉领域，揭示现有模型在生成准确且符合模式的输出方面的不足。

Details

Motivation: 多模态大语言模型在现实应用中需要生成不仅正确，还符合预定义数据模式的输出，但目前缺乏针对视觉输入结构化能力的系统评估基准。

Result: 实验显示现有MLLMs在生成准确且模式合规的输出上存在显著差距，训练实验证明结构化输出能力可通过训练提升。

Insight: 多模态结构化推理是MLLMs的关键挑战，SO-Bench为未来研究提供了重要工具和方向。

Abstract: Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model’s structured output capability. We plan to make the benchmark available to the community.

[2] Saddle-Free Guidance: Improved On-Manifold Sampling without Labels or Additional Training cs.CV | cs.LG | stat.MLPDF

Eric Yeats, Darryl Hannan, Wilson Fearn, Timothy Doster, Henry Kvinge

TL;DR: 该论文提出了一种名为“无鞍点引导”（SFG）的新方法，通过利用对数密度估计的正曲率信息，无需额外训练或标注数据即可提升基于分数的生成模型的表现。SFG在计算成本上与无分类器引导（CFG）相当，并在无条件生成任务中取得了最优性能。

Details

Motivation: 目前流行的生成模型引导方法（如CFG和Auto-Guidance）需要标注数据或额外训练模型，限制了其在无标注数据或资源受限场景中的应用。本文发现对数密度估计的正曲率可以提供有效的引导信息，从而解决了这一问题。

Result: SFG在无条件ImageNet-512生成任务中实现了最优的FID和FD-DINOv2指标。与CFG相比，SFG提升了生成图像的多样性，同时保持了高质量的图像保真度。

Insight: 对数密度估计的正曲率信息可以提供有效的生成引导，这为无监督或资源受限场景下的生成模型优化提供了新思路。

Abstract: Score-based generative models require guidance in order to generate plausible, on-manifold samples. The most popular guidance method, Classifier-Free Guidance (CFG), is only applicable in settings with labeled data and requires training an additional unconditional score-based model. More recently, Auto-Guidance adopts a smaller, less capable version of the original model to guide generation. While each method effectively promotes the fidelity of generated data, each requires labeled data or the training of additional models, making it challenging to guide score-based models when (labeled) training data are not available or training new models is not feasible. We make the surprising discovery that the positive curvature of log density estimates in saddle regions provides strong guidance for score-based models. Motivated by this, we develop saddle-free guidance (SFG) which maintains estimates of maximal positive curvature of the log density to guide individual score-based models. SFG has the same computational cost of classifier-free guidance, does not require additional training, and works with off-the-shelf diffusion and flow matching models. Our experiments indicate that SFG achieves state-of-the-art FID and FD-DINOv2 metrics in single-model unconditional ImageNet-512 generation. When SFG is combined with Auto-Guidance, its unconditional samples achieve general state-of-the-art in FD-DINOv2 score. Our experiments with FLUX.1-dev and Stable Diffusion v3.5 indicate that SFG boosts the diversity of output images compared to CFG while maintaining excellent prompt adherence and image fidelity.

[3] UniArt: Unified 3D Representation for Generating 3D Articulated Objects with Open-Set Articulation cs.CVPDF

Bu Jin, Weize Li, Songen Gu, Yupeng Zheng, Yuhang Zheng

TL;DR: UniArt是一个基于扩散模型的框架，能从单一张图片端到端生成完全关节化的3D物体。它通过统一的潜在表示联合编码几何、纹理、部件分割和运动学参数，并通过可逆的关节到体素嵌入实现高质量生成。实验结果显示在PartNet-Mobility基准测试中表现出色。

Details

Motivation: 手工构建关节化的3D物体成本高且难以扩展，现有方法多为多阶段技术。UniArt旨在通过端到端框架解决这一问题。

Result: 在PartNet-Mobility基准测试中，UniArt在网格质量和关节精度上达到最优。

Insight: 统一表示和可逆嵌入的设计能有效学习结构和运动的关联；开放集预测增强了模型的泛化能力。

Abstract: Articulated 3D objects play a vital role in realistic simulation and embodied robotics, yet manually constructing such assets remains costly and difficult to scale. In this paper, we present UniArt, a diffusion-based framework that directly synthesizes fully articulated 3D objects from a single image in an end-to-end manner. Unlike prior multi-stage techniques, UniArt establishes a unified latent representation that jointly encodes geometry, texture, part segmentation, and kinematic parameters. We introduce a reversible joint-to-voxel embedding, which spatially aligns articulation features with volumetric geometry, enabling the model to learn coherent motion behaviors alongside structural formation. Furthermore, we formulate articulation type prediction as an open-set problem, removing the need for fixed joint semantics and allowing generalization to novel joint categories and unseen object types. Experiments on the PartNet-Mobility benchmark demonstrate that UniArt achieves state-of-the-art mesh quality and articulation accuracy.

Kunpeng Zhang, Hanwen Xu, Sheng Wang

TL;DR: PathReasoning是一种多模态推理代理，通过迭代推理和精细化导航技术在WSI中高效寻找与临床问题相关的ROI，显著优于现有方法。

Details

Motivation: WSI的超大规模使其导航耗时且复杂，而病理学家的推理方法启发了一种高效且可解释的解决方案。

Result: 在亚型分析和纵向任务中AUROC提升6.7%和3.1%，乳腺癌报告生成准确率超过GPT-4o 10%。

Insight: 多模态推理与自反思的结合显著提升了WSI导航效率和诊断准确性，同时增强了结果的可解释性。

Abstract: Deciphering tumor microenvironment from Whole Slide Images (WSIs) is intriguing as it is key to cancer diagnosis, prognosis and treatment response. While these gigapixel images on one hand offer a comprehensive portrait of cancer, on the other hand, the extremely large size, as much as more than 10 billion pixels, make it challenging and time-consuming to navigate to corresponding regions to support diverse clinical inspection. Inspired by pathologists who conducted navigation on WSIs with a combination of sampling, reasoning and self-reflection, we proposed “PathReasoning”, a multi-modal reasoning agent that iteratively navigates across WSIs through multiple rounds of reasoning and refinements. Specifically, starting with randomly sampled candidate regions, PathReasoning reviews current selections with self-reflection, reasoning over the correspondence between visual observations and clinical questions, and concludes by proposing new regions to explore. Across rounds, PathReasoning builds a reasoning chain that gradually directs attention to diagnostically relevant areas. PathReasoning turns each whole slide into a sequence of question-guided views, allowing the model to efficiently find informative ROIs within a fixed number of steps, without the need for dense pixel-level annotations. PathReasoning can substantially outperform strong ROI-selection approaches by 6.7% and 3.1% of AUROC on subtyping and longitudinal analysis tasks. The high-quality ROIs further support accurate report generation on breast cancer, significantly outperforming the standard GPT-4o by 10% in accuracy. PathReasoning prioritizes question-specific regions and constructs interpretable reasoning chains, supporting efficient slide review, consistent diagnostic interpretations, comprehensive reporting, and evidence traceability in digital pathology.

[5] Interpretable Multimodal Cancer Prototyping with Whole Slide Images and Incompletely Paired Genomics cs.CVPDF

Yupei Zhang, Yating Huang, Wanming Hu, Lequan Yu, Hujun Yin

TL;DR: 这篇论文提出了一种可解释的多模态癌症原型框架，整合全切片图像和不完全配对的基因组数据，用于精准肿瘤学。该方法通过生物原型生成、多视图对齐、双模态融合和语义基因组填补，显著提升了性能。

Details

Motivation: 表型和基因型的异质性限制了多模态方法的性能，且现有方法通常忽略基因组数据部分缺失或完全缺失的现实临床场景。

Result: 在多下游任务上，该方法均优于现有先进方法。

Insight: 通过整合不完全配对的基因组数据和全切片图像，该方法在精准肿瘤学中展现出强大的潜力，尤其是在处理缺失数据时。

Abstract: Multimodal approaches that integrate histology and genomics hold strong potential for precision oncology. However, phenotypic and genotypic heterogeneity limits the quality of intra-modal representations and hinders effective inter-modal integration. Furthermore, most existing methods overlook real-world clinical scenarios where genomics may be partially missing or entirely unavailable. We propose a flexible multimodal prototyping framework to integrate whole slide images and incomplete genomics for precision oncology. Our approach has four key components: 1) Biological Prototyping using text prompting and prototype-wise weighting; 2) Multiview Alignment through sample- and distribution-wise alignments; 3) Bipartite Fusion to capture both shared and modality-specific information for multimodal fusion; and 4) Semantic Genomics Imputation to handle missing data. Extensive experiments demonstrate the consistent superiority of the proposed method compared to other state-of-the-art approaches on multiple downstream tasks. The code is available at https://github.com/helenypzhang/Interpretable-Multimodal-Prototyping.

[6] AmodalGen3D: Generative Amodal 3D Object Reconstruction from Sparse Unposed Views cs.CVPDF

Junwei Zhou, Yu-Wing Tai

TL;DR: AmodalGen3D是一种生成式框架，用于从稀疏、未摆位且部分遮挡的视图中重建完整的三维物体几何和外观。通过结合2D遮挡补全先验和多视角立体几何条件，该模型能够一致且合理地推断未观察到的结构。

Details

Motivation: 现实场景中，由于遮挡和稀疏视角的限制，传统多视角或修复方法难以实现完整且几何一致的三维重建。

Result: 在合成和真实数据集上，AmodalGen3D在遮挡严重的稀疏视图设置下表现出优越的保真度和完整性。

Insight: 联合建模可见和隐藏区域是实现遮挡无关完整三维重建的关键，适用于机器人、AR/VR和具身智能等应用。

Abstract: Reconstructing 3D objects from a few unposed and partially occluded views is a common yet challenging problem in real-world scenarios, where many object surfaces are never directly observed. Traditional multi-view or inpainting-based approaches struggle under such conditions, often yielding incomplete or geometrically inconsistent reconstructions. We introduce AmodalGen3D, a generative framework for amodal 3D object reconstruction that infers complete, occlusion-free geometry and appearance from arbitrary sparse inputs. The model integrates 2D amodal completion priors with multi-view stereo geometry conditioning, supported by a View-Wise Cross Attention mechanism for sparse-view feature fusion and a Stereo-Conditioned Cross Attention module for unobserved structure inference. By jointly modeling visible and hidden regions, AmodalGen3D faithfully reconstructs 3D objects that are consistent with sparse-view constraints while plausibly hallucinating unseen parts. Experiments on both synthetic and real-world datasets demonstrate that AmodalGen3D achieves superior fidelity and completeness under occlusion-heavy sparse-view settings, addressing a pressing need for object-level 3D scene reconstruction in robotics, AR/VR, and embodied AI applications.

[7] TAPVid-360: Tracking Any Point in 360 from Narrow Field of View Video cs.CVPDF

Finlay G. C. Hudson, James A. D. Gardner, William A. P. Smith

TL;DR: TAPVid-360提出了一种新任务，要求预测视频序列中查询场景点的3D方向，即使这些点远超出当前视野范围。该任务通过利用360视频作为监督来源，避免了动态4D场景模型的训练需求。

Details

Motivation: 人类能够构建全景心理模型，而现有的人工视觉系统在持久性、全景理解方面表现不佳。现有的TAP任务方法无法跟踪视野外的2D点，TAPVid-360旨在解决这一问题。

Result: 基线方法在TAPVid-360任务中表现优于现有TAP和TAPVid 3D方法。

Insight: 该任务促进了场景的异中心表示学习，无需依赖昂贵的4D场景模型，为视频理解和点跟踪提供了新思路。

Abstract: Humans excel at constructing panoramic mental models of their surroundings, maintaining object permanence and inferring scene structure beyond visible regions. In contrast, current artificial vision systems struggle with persistent, panoramic understanding, often processing scenes egocentrically on a frame-by-frame basis. This limitation is pronounced in the Track Any Point (TAP) task, where existing methods fail to track 2D points outside the field of view. To address this, we introduce TAPVid-360, a novel task that requires predicting the 3D direction to queried scene points across a video sequence, even when far outside the narrow field of view of the observed video. This task fosters learning allocentric scene representations without needing dynamic 4D ground truth scene models for training. Instead, we exploit 360 videos as a source of supervision, resampling them into narrow field-of-view perspectives while computing ground truth directions by tracking points across the full panorama using a 2D pipeline. We introduce a new dataset and benchmark, TAPVid360-10k comprising 10k perspective videos with ground truth directional point tracking. Our baseline adapts CoTracker v3 to predict per-point rotations for direction updates, outperforming existing TAP and TAPVid 3D methods.

[8] WalkCLIP: Multimodal Learning for Urban Walkability Prediction cs.CV | cs.AI | cs.LGPDF

Shilong Xiang, JangHyeon Lee, Min Namgung, Yao-Yi Chiang

TL;DR: WalkCLIP是一种多模态框架，结合卫星图像、街景图像和人口数据预测城市步行性，优于单模态方法。

Details

Motivation: 传统步行性评估方法成本高且难以扩展，单模态方法仅能捕捉单一维度信息，WalkCLIP通过多模态整合解决这一问题。

Result: 在明尼阿波利斯-圣保罗的4,660个地点评估中，WalkCLIP在预测准确性和空间对齐性上优于基准方法。

Insight: 多模态数据的互补性能够更全面地描述步行环境，视觉和行为信号的结合显著提升了预测性能。

Abstract: Urban walkability is a cornerstone of public health, sustainability, and quality of life. Traditional walkability assessments rely on surveys and field audits, which are costly and difficult to scale. Recent studies have used satellite imagery, street view imagery, or population indicators to estimate walkability, but these single-source approaches capture only one dimension of the walking environment. Satellite data describe the built environment from above, but overlook the pedestrian perspective. Street view imagery captures conditions at the ground level, but lacks broader spatial context. Population dynamics reveal patterns of human activity but not the visual form of the environment. We introduce WalkCLIP, a multimodal framework that integrates these complementary viewpoints to predict urban walkability. WalkCLIP learns walkability-aware vision-language representations from GPT-4o generated image captions, refines these representations with a spatial aggregation module that incorporates neighborhood context, and fuses the resulting features with representations from a population dynamics foundation model. Evaluated at 4,660 locations throughout Minneapolis-Saint Paul, WalkCLIP outperforms unimodal and multimodal baselines in both predictive accuracy and spatial alignment. These results show that the integration of visual and behavioral signals yields reliable predictions of the walking environment.

[9] PAT3D: Physics-Augmented Text-to-3D Scene Generation cs.CVPDF

Guying Lin, Kemeng Huang, Michael Liu, Ruihan Gao, Hanke Chen

TL;DR: PAT3D是一个物理增强的文本到3D场景生成框架，首次将视觉语言模型与物理模拟结合，生成物理合理、无交叠的3D场景，并通过仿真优化提升场景质量。

Details

Motivation: 现有的文本到3D场景生成方法缺乏对物理合理性和仿真准备性的关注，导致生成的场景可能在物理上不真实或存在交叠问题。PAT3D旨在解决这一问题。

Result: 实验表明，PAT3D在物理合理性、语义一致性和视觉质量上显著优于现有方法，并能生成可用于下游任务（如场景编辑和机器人操作）的仿真准备场景。

Insight: PAT3D展示了将物理模拟与深度学习结合的潜力，为解决复杂3D场景生成中的物理合理性问题提供了新思路。

Abstract: We introduce PAT3D, the first physics-augmented text-to-3D scene generation framework that integrates vision-language models with physics-based simulation to produce physically plausible, simulation-ready, and intersection-free 3D scenes. Given a text prompt, PAT3D generates 3D objects, infers their spatial relations, and organizes them into a hierarchical scene tree, which is then converted into initial conditions for simulation. A differentiable rigid-body simulator ensures realistic object interactions under gravity, driving the scene toward static equilibrium without interpenetrations. To further enhance scene quality, we introduce a simulation-in-the-loop optimization procedure that guarantees physical stability and non-intersection, while improving semantic consistency with the input prompt. Experiments demonstrate that PAT3D substantially outperforms prior approaches in physical plausibility, semantic consistency, and visual quality. Beyond high-quality generation, PAT3D uniquely enables simulation-ready 3D scenes for downstream tasks such as scene editing and robotic manipulation. Code and data will be released upon acceptance.

[10] DialBench: Towards Accurate Reading Recognition of Pointer Meter using Large Foundation Models cs.CV | cs.AIPDF

Futian Wang, Chaoliu Weng, Xiao Wang, Zhen Chen, Zhicheng Zhao

TL;DR: 该论文提出了一个大规模指针仪表读数数据集RPM-10K，并基于物理关系注入提出了一种新视觉语言模型MRLM，通过几何和因果关系编码实现了精确的指针仪表读数识别。

Details

Motivation: 指针仪表的精确读数识别在智能电力系统中至关重要，但现有方法因反射、遮挡、动态视角和指针与刻度标记之间的细微差异等问题而脆弱，且缺乏大规模数据集支持。

Result: 在新提出的基准数据集上，MRLM表现出色，验证了其有效性。

Insight: 将物理关系引入视觉语言模型能显著提升指针仪表读数的精度，尤其在复杂场景下表现稳健。

Abstract: The precise reading recognition of pointer meters plays a key role in smart power systems, but existing approaches remain fragile due to challenges like reflections, occlusions, dynamic viewing angles, and overly between thin pointers and scale markings. Up to now, this area still lacks large-scale datasets to support the development of robust algorithms. To address these challenges, this paper first presents a new large-scale benchmark dataset for dial reading, termed RPM-10K, which contains 10730 meter images that fully reflect the aforementioned key challenges. Built upon the dataset, we propose a novel vision-language model for pointer meter reading recognition, termed MRLM, based on physical relation injection. Instead of exhaustively learning image-level correlations, MRLM explicitly encodes the geometric and causal relationships between the pointer and the scale, aligning perception with physical reasoning in the spirit of world-model perspectives. Through cross-attentional fusion and adaptive expert selection, the model learns to interpret dial configurations and generate precise numeric readings. Extensive experiments fully validated the effectiveness of our proposed framework on the newly proposed benchmark dataset. Both the dataset and source code will be released on https://github.com/Event-AHU/DialBench

[11] PPBoost: Progressive Prompt Boosting for Text-Driven Medical Image Segmentation cs.CVPDF

Xuchen Li, Hengrui Gu, Mohan Zhang, Qin Liu, Zhen Tan

TL;DR: PPBoost通过将弱文本信号逐步转化为强空间视觉提示，提升了医学图像分割的精度，无需标注数据。

Details

Motivation: 文本提示的基础模型在医学图像分割中空间精度不足，而视觉提示模型虽强但依赖于精确边界框提示，获取成本高。PPBoost旨在填补这一鸿沟。

Result: 在三个数据集上，PPBoost显著提升了Dice系数和归一化表面距离，并超越少样本分割模型。

Insight: 弱文本信号可通过逐步增强转化为强空间提示，无需标注数据即可提升分割性能。

Abstract: Text-prompted foundation models for medical image segmentation offer an intuitive way to delineate anatomical structures from natural language queries, but their predictions often lack spatial precision and degrade under domain shift. In contrast, visual-prompted models achieve strong segmentation performance across diverse modalities by leveraging spatial cues of precise bounding-box (bbox) prompts to guide the segmentation of target lesions. However, it is costly and challenging to obtain the precise visual prompts in clinical practice. We propose PPBoost (Progressive Prompt-Boosting), a framework that bridges these limitations by transforming weak text-derived signals into strong, spatially grounded visual prompts, operating under a strict zero-shot regime with no image- or pixel-level segmentation labels. PPBoost first uses a vision-language model to produce initial pseudo-bboxes conditioned on the textual object descriptions and applies an uncertainty-aware criterion to filter unreliable predictions. The retained image-bboxes pairs are then leveraged to train a pseudo-labeled detector, producing the high-quality bboxes for the query images. During inference, PPBoost further refines the generated bboxes by appropriately expanding them to tightly cover the target anatomical structures. The enhanced spatially-grounding bbox prompts guide existing segmentation models to generate final dense masks, effectively amplifying weak text cues into strong spatial guidance. Across three datasets spanning diverse modalities and anatomies, PPBoost consistently improves Dice and Normalized Surface Distance over text- and visual-prompted baselines and, notably, surpasses few-shot segmentation models without using labeled data. PPBoost can generalize to multiple typical visual segmentation model backbones.

Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu

TL;DR: 本文研究了多模态大语言模型（LLM）能否提供实时、交互式的分步骤任务指导，并提出一个新的基准数据集和模型LiveMamba。

Details

Motivation: 未来的AI助手需要具备实时交互指导能力，但目前的多模态LLM在此任务上表现不足，尤其是在检测用户错误并提供即时反馈方面。

Result: 评估了现有LLM在Qualcomm Interactive Cooking上的表现，LiveMamba作为一个基线模型展示了潜力。

Insight: 实时任务指导需要模型具备异步响应能力，而包含用户错误的视频数据是提升模型性能的关键。

Abstract: Multi-modal Large Language Models (LLM) have advanced conversational abilities but struggle with providing live, interactive step-by-step guidance, a key capability for future AI assistants. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time. This requires models that are not turn-based, but that can react asynchronously to a video stream, as well as video data showing users performing tasks including mistakes and their corrections. To this end, we introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video. We evaluate state-of-the-art multi-modal LLMs on the Qualcomm Interactive Cooking benchmark and introduce LiveMamba, a streaming multi-modal LLM designed for interactive instructional guidance. This work provides the first dedicated benchmark and a strong baseline for developing and evaluating on live, situated coaching.

[13] StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation cs.CVPDF

Sen Fang, Hongbin Zhong, Yalin Feng, Dimitris N. Metaxas

TL;DR: 本文提出了一种名为StreamFlow的高效Rectified Flow生成方法，通过在理论、设计和推理策略上的全面优化，实现了对基于流模型的生成模型的显著加速，最高可达611%。

Details

Motivation: Rectified Flow和Flow Matching等新技术在生成模型中表现出色，但现有加速方法无法直接适用于Rectified Flow模型。本文旨在填补这一空白。

Result: 512*512图像生成速度加速至最高611%，远超现有的非通用加速方法（通常仅18%）。

Insight: 通过结合理论优化和工程实现，Rectified Flow模型的性能可以显著提升，为高效生成模型提供了新思路。

Abstract: New technologies such as Rectified Flow and Flow Matching have significantly improved the performance of generative models in the past two years, especially in terms of control accuracy, generation quality, and generation efficiency. However, due to some differences in its theory, design, and existing diffusion models, the existing acceleration methods cannot be directly applied to the Rectified Flow model. In this article, we have comprehensively implemented an overall acceleration pipeline from the aspects of theory, design, and reasoning strategies. This pipeline uses new methods such as batch processing with a new velocity field, vectorization of heterogeneous time-step batch processing, and dynamic TensorRT compilation for the new methods to comprehensively accelerate related models based on flow models. Currently, the existing public methods usually achieve an acceleration of 18%, while experiments have proved that our new method can accelerate the 512*512 image generation speed to up to 611%, which is far beyond the current non-generalized acceleration methods.

[14] MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis cs.CV | cs.AIPDF

Chunzheng Zhu, Yangfang Lin, Shen Chen, Yijun Wang, Jianxin Lin

TL;DR: MedEyes提出了一种动态视觉聚焦的强化学习框架，通过结合专家视觉搜索轨迹和双重探索策略，模拟临床医生的渐进诊断过程，显著提升了医学VQA任务的性能。

Details

Motivation: 准确的医学诊断通常需要渐进视觉聚焦和迭代推理，而现有的视觉语言模型在强化学习中可能仅强化表面一致但临床不准确的推理路径。

Result: 在多个医学VQA基准测试中，MedEyes平均性能提升了8.5%，显示出其在构建可解释医疗AI系统方面的潜力。

Insight: 通过动态视觉聚焦和专家信号的结合，可以显著提升医学诊断任务的准确性和可解释性。

Abstract: Accurate medical diagnosis often involves progressive visual focusing and iterative reasoning, characteristics commonly observed in clinical workflows. While recent vision-language models demonstrate promising chain-of-thought (CoT) reasoning capabilities via reinforcement learning with verifiable rewards (RLVR), their purely on-policy learning paradigm tends to reinforce superficially coherent but clinically inaccurate reasoning paths. We propose MedEyes, a novel reinforcement learning framework that dynamically models clinician-style diagnostic reasoning by progressively attending to and interpreting relevant medical image regions. By incorporating off-policy expert guidance, MedEyes converts expert visual search trajectories into structured external behavioral signals, guiding the model toward clinically aligned visual reasoning. We design the Gaze-guided Reasoning Navigator (GRN) to emulate the diagnostic process through a dual-mode exploration strategy, scanning for systematic abnormality localization and drilling for detailed regional analysis. To balance expert imitation and autonomous discovery, we introduce the Confidence Value Sampler (CVS), which employs nucleus sampling and adaptive termination to create diverse yet credible exploration paths. Finally, the dual-stream GRPO optimization framework decouples on-policy and off-policy learning signals, mitigating reward assimilation and entropy collapse. Experiments demonstrate that MedEyes achieves an average performance improvement of +8.5% across multiple medical VQA benchmarks, validating MedEyes’s potential in building interpretable medical AI systems.

[15] Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models cs.CVPDF

Zhenxiang Lin, Maryam Haghighat, Will Browne, Dimity Miller

TL;DR: 本文提出了一种无需训练的后处理方法，用于估计视觉-语言模型（VLM）的不确定性，通过类内概率嵌入检测错误预测，显著提升了错误检测性能。

Details

Motivation: 视觉-语言模型（如CLIP）在开放词汇分类中表现优异，但对错误分类也会给出高置信度，限制了其安全性。本文旨在解决这一问题。

Result: 在多个数据集（如ImageNet、Flowers102）上实现了最先进的错误检测性能，优于其他确定性或概率性VLM基线。

Insight: 类内一致性是衡量不确定性的有效指标，且该方法在数据稀缺（如每类仅10张图像）时依然有效。

Abstract: Vision-language models (VLMs), such as CLIP, have gained popularity for their strong open vocabulary classification performance, but they are prone to assigning high confidence scores to misclassifications, limiting their reliability in safety-critical applications. We introduce a training-free, post-hoc uncertainty estimation method for contrastive VLMs that can be used to detect erroneous predictions. The key to our approach is to measure visual feature consistency within a class, using feature projection combined with multivariate Gaussians to create class-specific probabilistic embeddings. Our method is VLM-agnostic, requires no fine-tuning, demonstrates robustness to distribution shift, and works effectively with as few as 10 training images per class. Extensive experiments on ImageNet, Flowers102, Food101, EuroSAT and DTD show state-of-the-art error detection performance, significantly outperforming both deterministic and probabilistic VLM baselines. Code is available at https://github.com/zhenxianglin/ICPE.

[16] Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation cs.CVPDF

Joel Alberto Santos, Zongwei Wu, Xavier Alameda-Pineda, Radu Timofte

TL;DR: 该论文探讨了直接通过音频进行图像分割的可行性，挑战了依赖文本作为中间表示的现有方法。作者引入了一个新的音频引导分割数据集，并展示了直接音频对齐在某些情况下优于转录基方法，尤其是在处理语言多样性时。

Details

Motivation: 当前的对象定位方法通常依赖文本作为中间表示，这种方法在效率和鲁棒性上存在缺陷。作者探索是否可以直接通过音频与视觉对齐，避免转录带来的问题。

Result: 实验结果表明，直接音频对齐不仅可行，而且在处理语言多样性时表现更优，甚至在某些情况下超越转录基方法。

Insight: 研究发现支持重新关注直接音频对齐的研究，为提升多模态系统的效率和鲁棒性提供了新思路。

Abstract: Understanding human instructions is essential for enabling smooth human-robot interaction. In this work, we focus on object grounding, i.e., localizing an object of interest in a visual scene (e.g., an image) based on verbal human instructions. Despite recent progress, a dominant research trend relies on using text as an intermediate representation. These approaches typically transcribe speech to text, extract relevant object keywords, and perform grounding using models pretrained on large text-vision datasets. However, we question both the efficiency and robustness of such transcription-based pipelines. Specifically, we ask: Can we achieve direct audio-visual alignment without relying on text? To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions. We introduce a new audio-based grounding dataset that covers a wide variety of objects and diverse human accents. We then adapt and benchmark several models from the closely audio-visual field. Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods, especially in terms of robustness to linguistic variability. Our findings encourage a renewed interest in direct audio grounding and pave the way for more robust and efficient multimodal understanding systems.

[17] SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model cs.CVPDF

Jiayuan Du, Yiming Zhao, Zhenglong Guo, Yong Pan, Wenbo Hou

TL;DR: 该论文提出了一种新型的轨迹条件稀疏占用世界模型（SparseWorld-TC），用于直接预测未来的3D场景占用情况，避免了传统方法因离散化或BEV表示带来的限制，并通过Transformer架构实现了领先的性能。

Details

Motivation: 现有方法通常依赖于变分自编码器（VAEs）生成离散占用标记，或通过鸟瞰图（BEV）投影引入显式几何先验，这些方法在表示能力和时空依赖捕捉上存在局限性。

Result: 在1-3秒的占用预测任务中显著超越现有方法，展现出强大的场景动态理解能力。

Insight: 稀疏占用表示和Transformer的结合能够更有效地捕捉时空依赖，从而提高预测精度。

Abstract: This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird’s eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1-3 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.

[18] ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resoultion cs.CV | cs.AIPDF

Junoh Kang, Donghun Ryu, Bohyung Han

TL;DR: 本文提出了基于图像条件的流形正则化方法（ICM-SR），通过结合结构信息（如色图和边缘）改进真实世界图像超分辨率（Real-ISR）的质量，避免了传统文本条件流形的局限性。

Details

Motivation: 现有方法通常依赖文本条件的流形正则化，但这与Real-ISR任务的目标（生成与低质量图像直接相关的高质量图像）不匹配，且易导致颜色失真和边缘模糊。因此，需要一种更稳定的图像条件流形正则化方法。

Result: 实验表明，ICM-SR显著提升了超分辨率性能，尤其是在感知质量方面，验证了该方法在真实场景中的有效性。

Insight: 稀疏结构信息（如色图和边缘）在流形正则化中不仅提供了任务对齐的信号，还避免了密集条件的数值不稳定性，为Real-ISR任务提供了新的解决思路。

Abstract: Real world image super-resolution (Real-ISR) often leverages the powerful generative priors of text-to-image diffusion models by regularizing the output to lie on their learned manifold. However, existing methods often overlook the importance of the regularizing manifold, typically defaulting to a text-conditioned manifold. This approach suffers from two key limitations. Conceptually, it is misaligned with the Real-ISR task, which is to generate high quality (HQ) images directly tied to the low quality (LQ) images. Practically, the teacher model often reconstructs images with color distortions and blurred edges, indicating a flawed generative prior for this task. To correct these flaws and ensure conceptual alignment, a more suitable manifold must incorporate information from the images. While the most straightforward approach is to condition directly on the raw input images, their high information densities make the regularization process numerically unstable. To resolve this, we propose image-conditioned manifold regularization (ICM), a method that regularizes the output towards a manifold conditioned on the sparse yet essential structural information: a combination of colormap and Canny edges. ICM provides a task-aligned and stable regularization signal, thereby avoiding the instability of dense-conditioning and enhancing the final super-resolution quality. Our experiments confirm that the proposed regularization significantly enhances super-resolution performance, particularly in perceptual quality, demonstrating its effectiveness for real-world applications. We will release the source code of our work for reproducibility.

[19] TPCNet: Triple physical constraints for Low-light Image Enhancement cs.CV | physics.opticsPDF

Jing-Yi Shi, Ming-Fei Li, Ling-An Wu

TL;DR: TPCNet提出了一种基于Kubelka-Munk理论的低光照图像增强方法，通过引入三重物理约束（TPCs）理论，改进了现有的Retinex方法，提升了模型的泛化能力和性能。

Details

Motivation: 现有的基于Retinex理论的低光照图像增强方法忽略了镜面反射的影响，并在图像空间构造物理约束，限制了模型的泛化能力。

Result: 在10个数据集上的定量和定性实验表明，TPCNet在性能指标和视觉质量上优于现有方法。

Insight: 在低光照图像增强任务中，考虑镜面反射和特征空间的物理约束可以显著提升模型的表现。

Abstract: Low-light image enhancement is an essential computer vision task to improve image contrast and to decrease the effects of color bias and noise. Many existing interpretable deep-learning algorithms exploit the Retinex theory as the basis of model design. However, previous Retinex-based algorithms, that consider reflected objects as ideal Lambertian ignore specular reflection in the modeling process and construct the physical constraints in image space, limiting generalization of the model. To address this issue, we preserve the specular reflection coefficient and reformulate the original physical constraints in the imaging process based on the Kubelka-Munk theory, thereby constructing constraint relationship between illumination, reflection, and detection, the so-called triple physical constraints (TPCs)theory. Based on this theory, the physical constraints are constructed in the feature space of the model to obtain the TPC network (TPCNet). Comprehensive quantitative and qualitative benchmark and ablation experiments confirm that these constraints effectively improve the performance metrics and visual quality without introducing new parameters, and demonstrate that our TPCNet outperforms other state-of-the-art methods on 10 datasets.

[20] OralGPT-Omni: A Versatile Dental Multimodal Large Language Model cs.CV | cs.MMPDF

Jing Hao, Yuci Liang, Lizhuo Lin, Yuxuan Fan, Wenkai Zhou

TL;DR: OralGPT-Omni是首个专注于牙科的多模态大语言模型（MLLM），旨在解决牙科领域数据稀缺和可靠性挑战，并通过TRACE-CoT数据集和四阶段训练范式提升模型性能。

Details

Motivation: 牙科领域在多模态大语言模型研究中尚未深入探索，主要由于领域数据稀缺、专业标注不足、模态建模不足以及可靠性问题。

Result: OralGPT-Omni在MMOral-Uni基准上得分为51.84，在MMOral-OPG基准上得分为45.31，显著优于GPT-5。

Insight: 通过临床指导的数据集和分阶段训练，可显著提升MLLM在牙科领域的理解和分析能力，推动智能牙科发展。

Abstract: Multimodal Large Language Models (MLLMs) have exhibited immense potential across numerous medical specialties; yet, dentistry remains underexplored, in part due to limited domain-specific data, scarce dental expert annotations, insufficient modality-specific modeling, and challenges in reliability. In this paper, we present OralGPT-Omni, the first dental-specialized MLLM designed for comprehensive and trustworthy analysis across diverse dental imaging modalities and clinical tasks. To explicitly capture dentists’ diagnostic reasoning, we construct TRACE-CoT, a clinically grounded chain-of-thought dataset that mirrors dental radiologists’ decision-making processes. This reasoning supervision, combined with our proposed four-stage training paradigm, substantially strengthens the model’s capacity for dental image understanding and analysis. In parallel, we introduce MMOral-Uni, the first unified multimodal benchmark for dental image analysis. It comprises 2,809 open-ended question-answer pairs spanning five modalities and five tasks, offering a comprehensive evaluation suite to date for MLLMs in digital dentistry. OralGPT-Omni achieves an overall score of 51.84 on the MMOral-Uni benchmark and 45.31 on the MMOral-OPG benchmark, dramatically outperforming the scores of GPT-5. Our work promotes intelligent dentistry and paves the way for future advances in dental image analysis. All code, benchmark, and models will be made publicly available.

[21] WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation cs.CVPDF

Quanjian Song, Yiren Song, Kelly Peng, Yuan Gao, Mike Zheng Shou

TL;DR: WorldWander是一个专门用于在视频生成中实现第一人称（egocentric）和第三人称（exocentric）视角无缝转换的框架，基于视频扩散模型，并通过新方法提升了视角同步性和角色一致性。

Details

Motivation: 目前在视频生成中，实现不同视角（如第一人称和第三人称）的无缝转换仍是未充分探索的领域。这种视角转换对于电影制作、具身智能和世界模型具有重要意义。

Result: 实验表明，WorldWander在视角同步性、角色一致性和泛化能力上表现优异，为egocentric-exocentric视频转换设立了新标杆。

Insight: 视角转换在视频生成中是一个重要但未被充分研究的领域，WorldWander的成功表明，结合上下文对齐和协作编码方法可以有效解决这一问题。

Abstract: Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric-exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation.

Yu Li, Yuenan Hou, Yingmei Wei, Xinge Zhu, Yuexin Ma

TL;DR: MoE3D通过集成Mixture of Experts（MoE）到多模态3D理解框架中，解决了传统单一融合网络难以处理模态异构性和复杂性的问题，显著提升了性能。

Details

Motivation: 传统多模态融合方法采用单一密集网络，难以应对模态间的异构性和复杂性，导致性能不佳。MoE3D旨在通过专家网络和多模态交互改进这一问题。

Result: 在Multi3DRefer任务上优于最先进方法6.1 mIoU，其他任务上也表现优异。

Insight: 通过专家网络和模态特异性处理，MoE3D有效利用了多模态互补信息，显著提升3D理解任务的性能。

Abstract: Multi-modal 3D understanding is a fundamental task in computer vision. Previous multi-modal fusion methods typically employ a single, dense fusion network, struggling to handle the significant heterogeneity and complexity across modalities, leading to suboptimal performance. In this paper, we propose MoE3D, which integrates Mixture of Experts (MoE) into the multi-modal learning framework. The core is that we deploy a set of specialized “expert” networks, each adept at processing a specific modality or a mode of cross-modal interaction. Specifically, the MoE-based transformer is designed to better utilize the complementary information hidden in the visual features. Information aggregation module is put forward to further enhance the fusion performance. Top-1 gating is employed to make one expert process features with expert groups, ensuring high efficiency. We further propose a progressive pre-training strategy to better leverage the semantic and 2D prior, thus equipping the network with good initialization. Our MoE3D achieves competitive performance across four prevalent 3D understanding tasks. Notably, our MoE3D surpasses the top-performing counterpart by 6.1 mIoU on Multi3DRefer.

[23] PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization cs.CVPDF

Mingzhe Li, Renhao Zhang, Zhiyang Wen, Siqi Pan, Bruno Castro da Silva

TL;DR: PROMPTMINER是一个黑盒提示词窃取框架，通过强化学习和模糊优化，从文本到图像生成模型中恢复高质量的提示词。

Details

Motivation: 高质量的提示词在文本到图像生成模型中具有重要价值，但它们也面临安全和知识产权风险，尤其是提示词窃取攻击。现有方法通常需要白盒访问或大规模标注数据，限制了其实用性。

Result: 在多个数据集和生成模型中表现优异，CLIP相似度高达0.958，SBERT文本对齐达0.751，泛化能力优于基线方法7.5%。

Insight: 展示了黑盒条件下高效恢复提示词的可行性，同时对防御性扰动表现出强鲁棒性，为数据溯源和知识产权保护提供了新思路。

Abstract: Text-to-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to security and intellectual-property risks. One key threat is the prompt stealing attack, i.e., the task of recovering the textual prompt that generated a given image. Prompt stealing enables unauthorized extraction and reuse of carefully engineered prompts, yet it can also support beneficial applications such as data attribution, model provenance analysis, and watermarking validation. Existing approaches often assume white-box gradient access, require large-scale labeled datasets for supervised training, or rely solely on captioning without explicit optimization, limiting their practicality and adaptability. To address these challenges, we propose PROMPTMINER, a black-box prompt stealing framework that decouples the task into two phases: (1) a reinforcement learning-based optimization phase to reconstruct the primary subject, and (2) a fuzzing-driven search phase to recover stylistic modifiers. Experiments across multiple datasets and diffusion backbones demonstrate that PROMPTMINER achieves superior results, with CLIP similarity up to 0.958 and textual alignment with SBERT up to 0.751, surpassing all baselines. Even when applied to in-the-wild images with unknown generators, it outperforms the strongest baseline by 7.5 percent in CLIP similarity, demonstrating better generalization. Finally, PROMPTMINER maintains strong performance under defensive perturbations, highlighting remarkable robustness. Code: https://github.com/aaFrostnova/PromptMiner

[24] GoPrune: Accelerated Structured Pruning with $\ell_{2,p}$-Norm Optimization cs.CVPDF

Li Xu, Xianchao Xiu

TL;DR: GoPrune提出了一种基于$ℓ_{2,p}$-范数优化的加速结构化剪枝方法，解决了现有剪枝方法计算效率低的问题，并在CIFAR数据集上展示了优越性能。

Details

Motivation: 卷积神经网络（CNN）的计算和存储成本随深度增长而急剧上升，限制了其在边缘设备上的部署。结构化剪枝是有效的网络压缩方法，但现有方法存在计算效率低和适用范围有限的问题。

Result: 在CIFAR数据集上使用ResNet和VGG模型的实验表明，GoPrune在剪枝性能上优于现有方法。

Insight: 通过$ℓ_{2,p}$-范数和PAM算法，GoPrune显著提升了结构化剪枝的效率和适用性。

Abstract: Convolutional neural networks (CNNs) suffer from rapidly increasing storage and computational costs as their depth grows, which severely hinders their deployment on resource-constrained edge devices. Pruning is a practical approach for network compression, among which structured pruning is the most effective for inference acceleration. Although existing work has applied the $\ell_p$-norm to pruning, it only considers unstructured pruning with $p\in (0, 1)$ and has low computational efficiency. To overcome these limitations, we propose an accelerated structured pruning method called GoPrune. Our method employs the $\ell_{2,p}$-norm for sparse network learning, where the value of $p$ is extended to $[0, 1)$. Moreover, we develop an efficient optimization algorithm based on the proximal alternating minimization (PAM), and the resulting subproblems enjoy closed-form solutions, thus improving compression efficiency. Experiments on the CIFAR datasets using ResNet and VGG models demonstrate the superior performance of the proposed method in network pruning. Our code is available at https://github.com/xianchaoxiu/GoPrune.

[25] Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation cs.CVPDF

Xiang Li, Zirui Wang, Zixuan Huang, James M. Rehg

TL;DR: Cue3D是一个模型无关的框架，用于量化单张图像3D生成中各图像线索的作用，揭示了几何线索（如阴影）对3D生成的关键影响。

Details

Motivation: 探讨深度学习模型在单张图像3D生成中实际利用了哪些传统视觉线索（如阴影、纹理、轮廓等），填补了这一研究空白。

Result: 研究发现形状意义（而非纹理）主导泛化能力，几何线索（如阴影）对3D生成至关重要，同时指出了模型对轮廓的过度依赖。

Insight: Cue3D揭示了现代3D网络如何利用传统视觉线索，为开发更具透明性、鲁棒性和可控性的单图3D生成模型提供了方向。

Abstract: Humans and traditional computer vision methods rely on a diverse set of monocular cues to infer 3D structure from a single image, such as shading, texture, silhouette, etc. While recent deep generative models have dramatically advanced single-image 3D generation, it remains unclear which image cues these methods actually exploit. We introduce Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Our unified benchmark evaluates seven state-of-the-art methods, spanning regression-based, multi-view, and native 3D generative paradigms. By systematically perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity, we measure their impact on 3D output quality. Our analysis reveals that shape meaningfulness, not texture, dictates generalization. Geometric cues, particularly shading, are crucial for 3D generation. We further identify over-reliance on provided silhouettes and diverse sensitivities to cues such as perspective and local continuity across model families. By dissecting these dependencies, Cue3D advances our understanding of how modern 3D networks leverage classical vision cues, and offers directions for developing more transparent, robust, and controllable single-image 3D generation models.

[26] GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models cs.CVPDF

Bin Wang, Ruotong Hu, Wenqian Wang, Wentong Li, Mingliang Gao

TL;DR: 论文提出了GA2-CLIP框架，通过引入外部监督提示和通用属性锚点，优化视频-语言模型在视频任务中的泛化性能，解决了现有方法因微调导致语义空间狭窄的问题。

Details

Motivation: 现有的视觉和文本软提示调优方法在视频任务中容易导致模型对未见类别的泛化能力下降。现有方法通过正则化手工提示和软提示之间的差异来缓解这一问题，但这削弱了软提示的学习能力。

Result: 在视频任务上的实验表明，该方法在泛化基准测试中显著优于现有提示调优方法，尤其是在基类到新类的预测任务上。

Insight: 通过竞争性提示和通用属性锚点的设计，可以有效防止语义空间过拟合到监督类别，维持模型的泛化能力。

Abstract: Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model’s generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.

[27] DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action cs.CV | cs.ROPDF

Zhen Fang, Zhuoyang Liu, Jiaming Liu, Hao Chen, Yu Zeng

TL;DR: DualVLA 是一种通过部分解耦推理和动作来构建通用具身智能体的方法，解决了在通用化过程中动作性能退化的问题，并提出了新的评估指标 VLA Score。

Details

Motivation: 现有的通用 Vision-Language-Action (VLA) 模型在扩展推理能力时，常因动作性能退化（action degeneration）而影响其实际应用效果。

Result: DualVLA 在 SimplerEnv 中达到 61.0% 的平均成功率，并在 8 个多模态基准测试中平均得分 65.4，展示了动作执行与多模态理解的平衡。

Insight: 解耦推理和动作是关键，既能保留多模态推理能力，又能避免动作性能退化。

Abstract: To build a generalizable Vision-Language-Action (VLA) model with strong reasoning ability, a common strategy is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then incorporate mixed annotated robot data together with multimodal data to restore broader reasoning capabilities. However, we observe that the resulting reasoning VLA often suffers from degraded action performance compared to the specialist model before fine-tuning, a phenomenon we refer to as action degeneration. To address this issue, we propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability. We first introduce a dual-layer data pruning method that removes redundant embodied reasoning, preventing it from adversely influencing action learning. To further strengthen action generation, we design a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains while maintaining reasoning ability. To fill the evaluation gap for generalist VLAs, we also propose VLA Score, which decouples VLA capability into reasoning, intention, action, and alignment dimensions for a more fine-grained assessment. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between precise action execution and multimodal understanding. Project Website: https://costaliya.github.io/DualVLA/.

[28] EASL: Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation cs.CVPDF

Yanchao Zhao, Jihao Zhu, Yu Liu, Weizhuo Chen, Yuling Yang

TL;DR: EASL提出了一种多情感引导的手语生成架构，通过情感语义解耦模块和渐进式训练，分别提取语义和情感特征，生成带有情感表达的手语视频，提升了生成的准确性和表现力。

Details

Motivation: 现有基于大语言模型的手语生成方法通常过于关注语义准确性而忽视了情感表达，导致生成的视频缺乏自然性和表现力。EASL旨在填补这一空白，为聋人社区提供更具情感表达的沟通工具。

Result: 实验表明，EASL在姿势准确性上优于所有基线方法，并通过集成多情感信息生成更具表现力的手语视频，同时能够有效适配扩散模型。

Insight: 情感在手语生成中起着关键作用，通过解耦情感和语义特征并结合渐进式训练，可以显著提升生成视频的自然性和表现力。

Abstract: Large language models have revolutionized sign language generation by automatically transforming text into high-quality sign language videos, providing accessible communication for the Deaf community. However, existing LLM-based approaches prioritize semantic accuracy while overlooking emotional expressions, resulting in outputs that lack naturalness and expressiveness. We propose EASL (Emotion-Aware Sign Language), a multi-emotion-guided generation architecture for fine-grained emotional integration. We introduce emotion-semantic disentanglement modules with progressive training to separately extract semantic and affective features. During pose decoding, the emotional representations guide semantic interaction to generate sign poses with 7-class emotion confidence scores, enabling emotional expression recognition. Experimental results demonstrate that EASL achieves pose accuracy superior to all compared baselines by integrating multi-emotion information and effectively adapts to diffusion models to generate expressive sign language videos.

[29] IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer cs.CV | cs.AIPDF

Bo Chen, Tao Liu, Qi Chen, Xie Chen, Zilong Zheng

TL;DR: IMTalker提出了一种高效的音频驱动说话脸生成框架，通过隐式运动迁移取代传统的光流变形方法，实现了高保真度和高效率的合成效果。

Details

Motivation: 现有方法依赖显式光流和局部变形，难以建模复杂全局运动且易导致身份漂移，IMTalker旨在解决这些问题。

Result: IMTalker在运动准确性、身份保持和音唇同步上超越现有方法，达到40-42 FPS的高效生成速度。

Insight: 隐式运动迁移和身份解耦是高质量说话脸合成的关键，轻量设计可实现实时生成。

Abstract: Talking face generation aims to synthesize realistic speaking portraits from a single image, yet existing methods often rely on explicit optical flow and local warping, which fail to model complex global motions and cause identity drift. We present IMTalker, a novel framework that achieves efficient and high-fidelity talking face generation through implicit motion transfer. The core idea is to replace traditional flow-based warping with a cross-attention mechanism that implicitly models motion discrepancy and identity alignment within a unified latent space, enabling robust global motion rendering. To further preserve speaker identity during cross-identity reenactment, we introduce an identity-adaptive module that projects motion latents into personalized spaces, ensuring clear disentanglement between motion and identity. In addition, a lightweight flow-matching motion generator produces vivid and controllable implicit motion vectors from audio, pose, and gaze cues. Extensive experiments demonstrate that IMTalker surpasses prior methods in motion accuracy, identity preservation, and audio-lip synchronization, achieving state-of-the-art quality with superior efficiency, operating at 40 FPS for video-driven and 42 FPS for audio-driven generation on an RTX 4090 GPU. We will release our code and pre-trained models to facilitate applications and future research.

[30] Partially Shared Concept Bottleneck Models cs.CVPDF

Delong Zhao, Qiang Huang, Di Yan, Yiqun Sun, Jun Yu

TL;DR: PS-CBM通过多模态概念生成、部分共享概念策略和新度量标准CEA，解决了CBMs中视觉基础差、概念冗余和平衡问题，显著提升了分类精度和概念效率。

Details

Motivation: CBMs在概念生成中存在视觉基础薄弱、冗余概念多以及缺乏平衡准确性和紧凑性指标的问题，亟需改进。

Result: 在11个数据集上，PS-CBM分类精度提升1.0%-7.4%，CEA提升2.0%-9.5%，且所需概念更少。

Insight: 结合语义与视觉线索、动态概念共享以及新型度量标准是实现高效可解释性的关键。

Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by introducing a layer of human-understandable concepts between inputs and predictions. While recent methods automate concept generation using Large Language Models (LLMs) and Vision-Language Models (VLMs), they still face three fundamental challenges: poor visual grounding, concept redundancy, and the absence of principled metrics to balance predictive accuracy and concept compactness. We introduce PS-CBM, a Partially Shared CBM framework that addresses these limitations through three core components: (1) a multimodal concept generator that integrates LLM-derived semantics with exemplar-based visual cues; (2) a Partially Shared Concept Strategy that merges concepts based on activation patterns to balance specificity and compactness; and (3) Concept-Efficient Accuracy (CEA), a post-hoc metric that jointly captures both predictive accuracy and concept compactness. Extensive experiments on eleven diverse datasets show that PS-CBM consistently outperforms state-of-the-art CBMs, improving classification accuracy by 1.0%-7.4% and CEA by 2.0%-9.5%, while requiring significantly fewer concepts. These results underscore PS-CBM’s effectiveness in achieving both high accuracy and strong interpretability.

[31] Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning cs.CVPDF

Zhaoyang Wei, Wenchao Ding, Yanchao Hao, Xi Chen

TL;DR: GRiP是一个新颖的两阶段训练框架，通过显式引导模型的感知焦点和逻辑路径，实现了稳健且灵活的视觉接地推理，显著提升了复杂视觉任务的表现。

Details

Motivation: 当前方法在视觉接地推理中面临不稳定性（端到端强化学习）或刚性（监督微调）的困境，难以同时实现学习和认知灵活性，GRiP旨在解决这一问题。

Result: 在TreeBench和V* Bench等挑战性基准测试中取得最优表现，证明了其在复杂视觉推理中的有效性。

Insight: 研究发现，引入认知启发的信号（引导模型关注关键对象和多样化推理路径）是提升多模态智能的关键。

Abstract: Models capable of “thinking with images” by dynamically grounding their reasoning in visual evidence represent a major leap in multimodal AI. However, replicating and advancing this ability is non-trivial, with current methods often trapped between the instability of end-to-end reinforcement learning (RL) and the rigidity of supervised fine-tuning (SFT). This leads to models that either struggle to learn or lack the cognitive flexibility required for complex, real-world scenes. To navigate this dilemma, we introduce GRiP (Guided Reasoning and Perception), a novel two-stage training framework that cultivates robust and flexible visual grounded reasoning by explicitly guiding the model’s perceptual focus and logical pathways. GRiP’s core lies in its cognitive-enhanced RL stage, which features two key innovations: (1) a Salience-Weighted IoU Reward that incentivizes the model to prioritize the localization of mission-critical objects over trivial distractors, and (2) a Multi-Heuristic Reward that encourages cognitive flexibility by rewarding diverse yet logically valid reasoning pathways. Initialized from the Qwen2.5-VL-7B model, GRiP demonstrates significant performance gains across multiple challenging benchmarks. It achieves state-of-the-art results among open-source models on the highly challenging TreeBench and V* Bench, proving its effectiveness in complex visual reasoning. Our work demonstrates that moving beyond simplistic rewards and instead guiding models with cognitively-inspired signals for what to see and how to think is crucial for unlocking the next level of multimodal intelligence. The code will be made publicly available.

[32] Enhanced Graph Convolutional Network with Chebyshev Spectral Graph and Graph Attention for Autism Spectrum Disorder Classification cs.CV | cs.AIPDF

Adnan Ferdous Ashrafi, Hasanul Kabir

TL;DR: 该论文提出了一种结合Chebyshev谱图卷积和图注意力网络（GAT）的图卷积网络（GCN）模型，用于提高自闭症谱系障碍（ASD）的分类准确性。模型通过多分支架构处理多模态数据，并利用Chebyshev多项式滤波器和GAT层优化性能，在ABIDE I数据集上达到了74.82%的分类准确率和0.82的AUC。

Details

Motivation: 自闭症谱系障碍（ASD）的症状多样且诊断复杂，亟需一种早期、客观的诊断方法。多模态神经影像数据（如rs-fMRI和sMRI）为ASD分类提供了丰富的信息，但传统方法难以有效利用这些数据的图结构关系。

Result: 模型在ABIDE I数据集上实现了74.82%的分类准确率和0.82的AUC，优于传统GCN、自编码器深度网络和多模态CNN等基线方法。

Insight: 1. 多模态数据的图结构编码能有效捕捉个体间的复杂关系。2. Chebyshev滤波器和GAT的结合在降低复杂度的同时提升了模型性能。3. 该框架可推广至其他神经发育障碍的分类任务。

Abstract: ASD is a complicated neurodevelopmental disorder marked by variation in symptom presentation and neurological underpinnings, making early and objective diagnosis extremely problematic. This paper presents a Graph Convolutional Network (GCN) model, incorporating Chebyshev Spectral Graph Convolution and Graph Attention Networks (GAT), to increase the classification accuracy of ASD utilizing multimodal neuroimaging and phenotypic data. Leveraging the ABIDE I dataset, which contains resting-state functional MRI (rs-fMRI), structural MRI (sMRI), and phenotypic variables from 870 patients, the model leverages a multi-branch architecture that processes each modality individually before merging them via concatenation. Graph structure is encoded using site-based similarity to generate a population graph, which helps in understanding relationship connections across individuals. Chebyshev polynomial filters provide localized spectral learning with lower computational complexity, whereas GAT layers increase node representations by attention-weighted aggregation of surrounding information. The proposed model is trained using stratified five-fold cross-validation with a total input dimension of 5,206 features per individual. Extensive trials demonstrate the enhanced model’s superiority, achieving a test accuracy of 74.82% and an AUC of 0.82 on the entire dataset, surpassing multiple state-of-the-art baselines, including conventional GCNs, autoencoder-based deep neural networks, and multimodal CNNs.

[33] MTR-VP: Towards End-to-End Trajectory Planning through Context-Driven Image Encoding and Multiple Trajectory Prediction cs.CV | cs.AI | cs.ROPDF

Maitrayee Keskar, Mohan Trivedi, Ross Greer

TL;DR: MTR-VP提出了一种端到端轨迹规划方法，通过图像编码和多轨迹预测结合运动和视觉特征。实验表明，多轨迹预测提升性能，但视觉与运动特征融合仍有挑战。

Details

Motivation: 自动驾驶轨迹规划需要结合场景视觉信息和车辆运动状态，但现有方法依赖地图特征。MTR-VP旨在用学习到的视觉表征替代地图特征，提升端到端规划能力。

Result: 在Waymo数据集上的实验显示，多轨迹预测优于单轨迹预测，但视觉与运动特征融合效果有限。

Insight: 视觉与运动特征的结合仍需改进，多模态预测是提升端到端规划的关键方向。

Abstract: We present a method for trajectory planning for autonomous driving, learning image-based context embeddings that align with motion prediction frameworks and planning-based intention input. Within our method, a ViT encoder takes raw images and past kinematic state as input and is trained to produce context embeddings, drawing inspiration from those generated by the recent MTR (Motion Transformer) encoder, effectively substituting map-based features with learned visual representations. MTR provides a strong foundation for multimodal trajectory prediction by localizing agent intent and refining motion iteratively via motion query pairs; we name our approach MTR-VP (Motion Transformer for Vision-based Planning), and instead of the learnable intention queries used in the MTR decoder, we use cross attention on the intent and the context embeddings, which reflect a combination of information encoded from the driving scene and past vehicle states. We evaluate our methods on the Waymo End-to-End Driving Dataset, which requires predicting the agent’s future 5-second trajectory in bird’s-eye-view coordinates using prior camera images, agent pose history, and routing goals. We analyze our architecture using ablation studies, removing input images and multiple trajectory output. Our results suggest that transformer-based methods that are used to combine the visual features along with the kinetic features such as the past trajectory features are not effective at combining both modes to produce useful scene context embeddings, even when intention embeddings are augmented with foundation-model representations of scene context from CLIP and DINOv2, but that predicting a distribution over multiple futures instead of a single future trajectory boosts planning performance.

[34] Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation cs.CVPDF

Daniel Sungho Jung, Kyoung Mu Lee

TL;DR: 本文提出了一个鞋款不变性和地面感知学习框架（FECO），用于从单张RGB图像中估计密集的足部接触，解决了鞋款多样性外观和地面单调特征的挑战。

Details

Motivation: 足部接触对人类运动和物理交互至关重要，但现有方法多基于零速度约束或关节级接触，难以捕捉足部与地面的详细交互。密集足部接触估计在建模这种交互中是关键，但仍未被充分探索。

Result: 提出的FECO框架在忽略鞋款外观的情况下实现了鲁棒的足部接触估计，并有效地利用了地面信息。

Insight: 鞋款不变性和地面信息的合理利用显著提升了密集足部接触估计的性能。

Abstract: Foot contact plays a critical role in human interaction with the world, and thus exploring foot contact can advance our understanding of human movement and physical interaction. Despite its importance, existing methods often approximate foot contact using a zero-velocity constraint and focus on joint-level contact, failing to capture the detailed interaction between the foot and the world. Dense estimation of foot contact is crucial for accurately modeling this interaction, yet predicting dense foot contact from a single RGB image remains largely underexplored. There are two main challenges for learning dense foot contact estimation. First, shoes exhibit highly diverse appearances, making it difficult for models to generalize across different styles. Second, ground often has a monotonous appearance, making it difficult to extract informative features. To tackle these issues, we present a FEet COntact estimation (FECO) framework that learns dense foot contact with shoe style-invariant and ground-aware learning. To overcome the challenge of shoe appearance diversity, our approach incorporates shoe style adversarial training that enforces shoe style-invariant features for contact estimation. To effectively utilize ground information, we introduce a ground feature extractor that captures ground properties based on spatial context. As a result, our proposed method achieves robust foot contact estimation regardless of shoe appearance and effectively leverages ground information. Code will be released.

[35] HybridWorldSim: A Scalable and Controllable High-fidelity Simulator for Autonomous Driving cs.CV | cs.ROPDF

Qiang Li, Yingwenqi Jiang, Tuoxi Li, Duyu Chen, Xiang Feng

TL;DR: 论文提出了HybridWorldSim，一种可扩展且可控的高保真自动驾驶模拟器，结合了静态背景的多视角神经重建和动态主体的生成建模，解决了现有方法在视角变化和几何一致性上的不足，并通过新数据集MIRROR和实验验证了其优越性。

Details

Motivation: 现有的自动驾驶模拟器在大视角变化下难以支持新颖视图合成，且缺乏几何一致性，限制了模拟的真实性和可控性。

Result: 实验表明，HybridWorldSim在视觉生成和几何一致性上超过现有方法，提供了更实用的高保真模拟解决方案。

Insight: 结合神经重建与生成建模的方法能够有效解决模拟器在动态和静态场景中的一致性问题，同时多视角数据集的引入为自动驾驶研究提供了更丰富的基准。

Abstract: Realistic and controllable simulation is critical for advancing end-to-end autonomous driving, yet existing approaches often struggle to support novel view synthesis under large viewpoint changes or to ensure geometric consistency. We introduce HybridWorldSim, a hybrid simulation framework that integrates multi-traversal neural reconstruction for static backgrounds with generative modeling for dynamic agents. This unified design addresses key limitations of previous methods, enabling the creation of diverse and high-fidelity driving scenarios with reliable visual and spatial consistency. To facilitate robust benchmarking, we further release a new multi-traversal dataset MIRROR that captures a wide range of routes and environmental conditions across different cities. Extensive experiments demonstrate that HybridWorldSim surpasses previous state-of-the-art methods, providing a practical and scalable solution for high-fidelity simulation and a valuable resource for research and development in autonomous driving.

[36] ARPGNet: Appearance- and Relation-aware Parallel Graph Attention Fusion Network for Facial Expression Recognition cs.CV | cs.AIPDF

Yan Li, Yong Zhao, Xiaohan Xia, Dongmei Jiang

TL;DR: 论文提出ARPGNet，通过结合面部外观和区域关系信息，利用并行图注意力融合模块增强时空表征，在表情识别任务中表现优异。

Details

Motivation: 现有表情识别方法主要依赖预训练CNN学习面部外观表征，忽略了面部区域间的关系信息，限制了表达动态的建模能力。

Result: 在三个表情识别数据集上超越或媲美SOTA方法。

Insight: 面部区域关系信息能显著提升表情识别的表征能力，多模态表征融合是关键。

Abstract: The key to facial expression recognition is to learn discriminative spatial-temporal representations that embed facial expression dynamics. Previous studies predominantly rely on pre-trained Convolutional Neural Networks (CNNs) to learn facial appearance representations, overlooking the relationships between facial regions. To address this issue, this paper presents an Appearance- and Relation-aware Parallel Graph attention fusion Network (ARPGNet) to learn mutually enhanced spatial-temporal representations of appearance and relation information. Specifically, we construct a facial region relation graph and leverage the graph attention mechanism to model the relationships between facial regions. The resulting relational representation sequences, along with CNN-based appearance representation sequences, are then fed into a parallel graph attention fusion module for mutual interaction and enhancement. This module simultaneously explores the complementarity between different representation sequences and the temporal dynamics within each sequence. Experimental results on three facial expression recognition datasets demonstrate that the proposed ARPGNet outperforms or is comparable to state-of-the-art methods.

[37] Controllable 3D Object Generation with Single Image Prompt cs.CVPDF

Jaeseok Lee, Jaekoo Lee

TL;DR: 这篇论文提出了两种创新方法来解决基于文本反转的3D对象生成中的控制不足问题：使用现成的图像适配器和深度条件预热策略，实现了更好的3D一致性和控制能力。

Details

Motivation: 现有的3D对象生成方法主要依赖于文本反转技术，这不仅需要额外的训练时间，还缺乏控制能力。论文旨在通过改进方法提升生成过程的控制性和效率。

Result: 实验和用户研究表明，该方法在定性、定量和3D一致性方面均优于现有基于文本反转的方法。

Insight: 通过图像适配器和深度条件预热策略的结合，可以显著提升3D对象生成的效率和控制能力，同时保持高质量的生成效果。

Abstract: Recently, the impressive generative capabilities of diffusion models have been demonstrated, producing images with remarkable fidelity. Particularly, existing methods for the 3D object generation tasks, which is one of the fastest-growing segments in computer vision, pre-dominantly use text-to-image diffusion models with textual inversion which train a pseudo text prompt to describe the given image. In practice, various text-to-image generative models employ textual inversion to learn concepts or styles of target object in the pseudo text prompt embedding space, thereby generating sophisticated outputs. However, textual inversion requires additional training time and lacks control ability. To tackle this issues, we propose two innovative methods: (1) using an off-the-shelf image adapter that generates 3D objects without textual inversion, offering enhanced control over conditions such as depth, pose, and text. (2) a depth conditioned warmup strategy to enhance 3D consistency. In experimental results, ours show qualitatively and quantitatively comparable performance and improved 3D consistency to the existing text-inversion-based alternatives. Furthermore, we conduct a user study to assess (i) how well results match the input image and (ii) whether 3D consistency is maintained. User study results show that our model outperforms the alternatives, validating the effectiveness of our approaches. Our code is available at GitHub repository:https://github.com/Seooooooogi/Control3D_IP/

[38] 3D-Consistent Multi-View Editing by Diffusion Guidance cs.CV | cs.AI | cs.LGPDF

Josef Bengtson, David Nilsson, Dong In Lee, Fredrik Kahl

TL;DR: 提出一种无需训练的扩散框架，通过一致性损失实现多视角图像编辑时的几何和光度一致性，显著提升了3D编辑的连贯性。

Details

Motivation: 现有基于文本的图像编辑方法在多视角编辑中常出现几何和光度不一致的问题，尤其是在3D表示（如NeRF或高斯泼溅模型）中更为突出。

Result: 实验表明，该方法在多视角编辑中显著提升了3D一致性，并支持高质量的高斯泼溅模型编辑，细节清晰且与用户文本提示高度一致。

Insight: 一致性损失的引入为多视角编辑提供了灵活的解决方案，适用于多种3D表示形式，且无需额外训练。

Abstract: Recent advancements in diffusion models have greatly improved text-based image editing, yet methods that edit images independently often produce geometrically and photometrically inconsistent results across different views of the same scene. Such inconsistencies are particularly problematic for editing of 3D representations such as NeRFs or Gaussian Splat models. We propose a training-free diffusion framework that enforces multi-view consistency during the image editing process. The key assumption is that corresponding points in the unedited images should undergo similar transformations after editing. To achieve this, we introduce a consistency loss that guides the diffusion sampling toward coherent edits. The framework is flexible and can be combined with widely varying image editing methods, supporting both dense and sparse multi-view editing setups. Experimental results show that our approach significantly improves 3D consistency compared to existing multi-view editing methods. We also show that this increased consistency enables high-quality Gaussian Splat editing with sharp details and strong fidelity to user-specified text prompts. Please refer to our project page for video results: https://3d-consistent-editing.github.io/

Zhen Chen, Yihang Fu, Gabriel Madera, Mauro Giuffre, Serina Applebaum

TL;DR: 论文提出了一种新的多模态大语言模型M3LLM，通过利用生物医学文献中的复合图像数据，解决了医学多图像理解的训练数据不足问题，并在多图像、单图像和多模态任务中表现出色。

Details

Motivation: 当前的多模态大语言模型主要局限于单图像理解，无法满足临床工作流中对多图像（如不同模态或时间点）综合分析的需求。缺乏大规模高质量标注数据是开发此类模型的主要障碍。

Result: M3LLM在多图像、单图像、纯文本和多选择任务中均优于通用和专业医学MLLMs，尤其在MIMIC数据集上的胸部X光序列分析中表现出强泛化能力。

Insight: 生物医学文献中的复合图像是一种丰富但未被充分利用的数据源，通过分而治之和上下文感知的方法，可以有效训练多图像理解模型，推动临床应用。

Abstract: Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi-image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing medical MLLMs capable of composite reasoning, bridging the gap between biomedical literature and real-world clinical applications.

[40] UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation cs.CVPDF

Dengbo Chen, Ziwei Zhao, Kexin Zhang, Shishuang Zhao, Junjie Hou

TL;DR: UMind-VL是一个通用的超声视觉语言模型，旨在结合低级的超声感知任务（如分割、定位）和高级的临床推理（如诊断），通过动态卷积掩码解码器和任务特定标记统一多种任务。

Details

Motivation: 现有的医学基础模型在超声领域缺乏综合解决方案，无法同时解决低级的超声感知任务和高级的临床推理任务。

Result: UMind-VL在多项任务上表现优异，优于现有通用模型，部分任务甚至超越专业模型，并保持强泛化能力。

Insight: 通过统一框架结合低级感知和高级推理任务，UMind-VL展示了医学多模态模型的潜力，尤其是在超声领域的应用。

Abstract: Despite significant strides in medical foundation models, the ultrasound domain lacks a comprehensive solution capable of bridging low-level Ultrasound Grounded Perception (e.g., segmentation, localization) and high-level Ultrasound Comprehensive Interpretation (e.g., diagnosis, reasoning). To bridge this gap, we propose UMind-VL, a unified foundation model designed to synergize pixel-level structural understanding with complex clinical reasoning. We first introduce UMind-DS, a large-scale multimodal dataset comprising 1.2 million ultrasound image-text pairs across 16 anatomical regions, enriching standard data with pixel-level annotations and clinician-validated rationales. Architecturally, UMind-VL incorporates a lightweight Dynamic Convolutional Mask Decoder that generates masks via dynamic kernels conditioned on LLM outputs. This design, combined with task-specific tokens, unifies segmentation, detection, geometric measurement, and diagnosis tasks within a single framework. Extensive evaluations demonstrate that UMind-VL significantly outperforms existing generalist multimodal models and achieves performance on par with, or superior to, state-of-the-art specialist models across segmentation, detection, keypoint localization, and diagnostic reasoning benchmarks, while maintaining strong generalization ability. We demonstrate the capability of UMind-VL in Figure 1.

[41] DriveVGGT: Visual Geometry Transformer for Autonomous Driving cs.CVPDF

Xiaosong Jia, Yanhao Liu, Junqi You, Renqiu Xia, Yu Hong

TL;DR: DriveVGGT是针对自动驾驶设计的视觉几何变换框架，引入时序视频注意力和多摄像头一致性注意力模块，结合自动驾驶特有的先验知识，提升了4D重建性能。

Details

Motivation: 传统VGGT直接应用于自动驾驶系统效果不佳，因忽视了自动驾驶特有的传感器设置、已知相机参数和固定相对位置等先验知识。

Result: 在自动驾驶数据集上优于VGGT及其他变体，消融实验验证了模块设计的有效性。

Insight: 针对特定任务的先验知识设计模块能显著提升性能，TVA和MCA的结合为多摄像头时序数据处理提供了新思路。

Abstract: Feed-forward reconstruction has recently gained significant attention, with VGGT being a notable example. However, directly applying VGGT to autonomous driving (AD) systems leads to sub-optimal results due to the different priors between the two tasks. In AD systems, several important new priors need to be considered: (i) The overlap between camera views is minimal, as autonomous driving sensor setups are designed to achieve coverage at a low cost. (ii) The camera intrinsics and extrinsics are known, which introduces more constraints on the output and also enables the estimation of absolute scale. (iii) Relative positions of all cameras remain fixed though the ego vehicle is in motion. To fully integrate these priors into a feed-forward framework, we propose DriveVGGT, a scale-aware 4D reconstruction framework specifically designed for autonomous driving data. Specifically, we propose a Temporal Video Attention (TVA) module to process multi-camera videos independently, which better leverages the spatiotemporal continuity within each single-camera sequence. Then, we propose a Multi-camera Consistency Attention (MCA) module to conduct window attention with normalized relative pose embeddings, aiming to establish consistency relationships across different cameras while restricting each token to attend only to nearby frames. Finally, we extend the standard VGGT heads by adding an absolute scale head and an ego vehicle pose head. Experiments show that DriveVGGT outperforms VGGT, StreamVGGT, fastVGGT on autonomous driving dataset while extensive ablation studies verify effectiveness of the proposed designs.

[42] The Collapse of Patches cs.CVPDF

Wei Guo, Shunqi Mao, Zhuonan Liang, Heng Wang, Weidong Cai

TL;DR: 该论文提出了一种称为’patch collapse’的现象，通过自编码器学习图像块的依赖关系，并按PageRank排序优化图像建模方法，提升了自回归图像生成和图像分类的效率。

Details

Motivation: 论文受量子力学中波函数坍缩现象的启发，观察到图像块的实现会影响其他块的分布熵，从而提出patch collapse的概念，旨在提升图像建模的效率。

Result: 1. 提升了MAR模型的生成性能；2. 仅需22%的高排名图像块即可实现高分类准确率。

Insight: 图像块的依赖关系可以显著影响建模效率，patch collapse提供了一种新的视角来优化视觉任务的性能。

Abstract: Observing certain patches in an image reduces the uncertainty of others. Their realization lowers the distribution entropy of each remaining patch feature, analogous to collapsing a particle’s wave function in quantum mechanics. This phenomenon can intuitively be called patch collapse. To identify which patches are most relied on during a target region’s collapse, we learn an autoencoder that softly selects a subset of patches to reconstruct each target patch. Graphing these learned dependencies for each patch’s PageRank score reveals the optimal patch order to realize an image. We show that respecting this order benefits various masked image modeling methods. First, autoregressive image generation can be boosted by retraining the state-of-the-art model MAR. Next, we introduce a new setup for image classification by exposing Vision Transformers only to high-rank patches in the collapse order. Seeing 22% of such patches is sufficient to achieve high accuracy. With these experiments, we propose patch collapse as a novel image modeling perspective that promotes vision efficiency. Our project is available at https://github.com/wguo-ai/CoP .

[43] Match-and-Fuse: Consistent Generation from Unstructured Image Sets cs.CVPDF

Kate Feingold, Omri Kaduri, Tali Dekel

TL;DR: Match-and-Fuse 是一种零样本、无需训练的方法，用于从非结构化图像集合中生成一致的控制内容。它通过图模型将图像对生成统一到一个框架中，确保局部一致性和全局连贯性。

Details

Motivation: 现有方法主要处理单张图像或密集采样的视频，而 Match-and-Fuse 针对非结构化图像集合（共享视觉元素但视角、时间等不同）的生成任务，提出了一种更灵活的解决方案。

Result: Match-and-Fuse 在一致性和视觉质量上达到 SOTA，并拓展了从图像集合创作内容的新能力。

Insight: 文本到图像模型的多视图生成先验可在单一画布上实现连贯生成，为图像集合的生成任务提供了新思路。

Abstract: We present Match-and-Fuse - a zero-shot, training-free method for consistent controlled generation of unstructured image sets - collections that share a common visual element, yet differ in viewpoint, time of capture, and surrounding content. Unlike existing methods that operate on individual images or densely sampled videos, our framework performs set-to-set generation: given a source set and user prompts, it produces a new set that preserves cross-image consistency of shared content. Our key idea is to model the task as a graph, where each node corresponds to an image and each edge triggers a joint generation of image pairs. This formulation consolidates all pairwise generations into a unified framework, enforcing their local consistency while ensuring global coherence across the entire set. This is achieved by fusing internal features across image pairs, guided by dense input correspondences, without requiring masks or manual supervision. It also allows us to leverage an emergent prior in text-to-image models that encourages coherent generation when multiple views share a single canvas. Match-and-Fuse achieves state-of-the-art consistency and visual quality, and unlocks new capabilities for content creation from image collections.

[44] Structure is Supervision: Multiview Masked Autoencoders for Radiology cs.CV | cs.LGPDF

Sonia Laguna, Andrea Agostini, Alain Ryser, Samuel Ruiperez-Campillo, Irene Cannistraci

TL;DR: 该论文提出了多视角掩码自编码器（MVMAE）和其扩展版本MVMAE-V2T，通过结合放射学研究的自然多视角结构和文本报告，提升了医学图像的表示学习能力，并在疾病分类任务中优于现有基线。

Details

Motivation: 医学数据具有丰富的内在结构（如多视角成像），传统监督学习方法未充分利用这些信息。论文的目标是利用这种结构化和多模态信息，构建更鲁棒的医学机器学习模型。

Result: 在MIMIC-CXR、CheXpert和PadChest数据集上，MVMAE和MVMAE-V2T在疾病分类任务中优于监督学习和视觉-语言基线方法，尤其在低标注数据下表现更优。

Insight: 1. 医学数据的结构信息（如多视角）是强大的自监督信号；2. 结合文本信息可以进一步提升模型的语义理解能力；3. 结构化与文本监督是构建医学基础模型的有效互补路径。

Abstract: Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.

[45] Small Object Detection for Birds with Swin Transformer cs.CVPDF

Da Huo, Marc A. Kastner, Tingwei Liu, Yasutomo Kawanishi, Takatsugu Hirayama

TL;DR: 论文提出了一种基于Swin Transformer的改进方法，专门用于检测稀疏的小目标（如鸟类），通过调整窗口大小和改进颈部网络设计，提升了小目标检测的性能。

Details

Motivation: 小目标检测因目标体积小、模糊和遮挡等问题极具挑战性。现有方法通常针对小且密集的场景，而稀疏小目标（如鸟类）因训练样本不足导致特征学习困难。

Result: 实验表明，结合CenterNet的Swin Transformer颈部网络能显著提升小目标检测性能，较小的窗口大小对mAP有积极影响。

Insight: 针对特定类别的小目标（如鸟类），调整窗口大小和颈部网络设计是提升检测性能的有效策略。

Abstract: Object detection is the task of detecting objects in an image. In this task, the detection of small objects is particularly difficult. Other than the small size, it is also accompanied by difficulties due to blur, occlusion, and so on. Current small object detection methods are tailored to small and dense situations, such as pedestrians in a crowd or far objects in remote sensing scenarios. However, when the target object is small and sparse, there is a lack of objects available for training, making it more difficult to learn effective features. In this paper, we propose a specialized method for detecting a specific category of small objects; birds. Particularly, we improve the features learned by the neck; the sub-network between the backbone and the prediction head, to learn more effective features with a hierarchical design. We employ Swin Transformer to upsample the image features. Moreover, we change the shifted window size for adapting to small objects. Experiments show that the proposed Swin Transformer-based neck combined with CenterNet can lead to good performance by changing the window sizes. We further find that smaller window sizes (default 2) benefit mAPs for small object detection.

[46] Prompt-based Consistent Video Colorization cs.CV | cs.AIPDF

Silvia Dani, Tiberio Uricchio, Lorenzo Seidenari

TL;DR: 论文提出了一种基于提示的自动视频着色方法，结合语言和分割语义指导，通过扩散模型实现高质量着色，利用光流确保时间稳定性。

Details

Motivation: 传统视频着色方法存在时间闪烁问题或需要大量手动输入，本文旨在通过自动化方法解决这些问题。

Result: 在DAVIS30和VIDEVO20基准测试中，方法在着色准确度（PSNR）和视觉真实性（Colorfulness, CDC）上达到SOTA。

Insight: 自动化提示（通用文本）可取代人工颜色输入，光流与校正结合能显著提升时间稳定性。

Abstract: Existing video colorization methods struggle with temporal flickering or demand extensive manual input. We propose a novel approach automating high-fidelity video colorization using rich semantic guidance derived from language and segmentation. We employ a language-conditioned diffusion model to colorize grayscale frames. Guidance is provided via automatically generated object masks and textual prompts; our primary automatic method uses a generic prompt, achieving state-of-the-art results without specific color input. Temporal stability is achieved by warping color information from previous frames using optical flow (RAFT); a correction step detects and fixes inconsistencies introduced by warping. Evaluations on standard benchmarks (DAVIS30, VIDEVO20) show our method achieves state-of-the-art performance in colorization accuracy (PSNR) and visual realism (Colorfulness, CDC), demonstrating the efficacy of automated prompt-based guidance for consistent video colorization.

[47] Unexplored flaws in multiple-choice VQA evaluations cs.CV | cs.LGPDF

Fabio Rosenthal, Sebastian Schmidt, Thorsten Graf, Thorsten Bagodonat, Stephan Günnemann

TL;DR: 该论文揭示了多模态大语言模型（MLLMs）在多选题视觉问答（VQA）评估中存在未探索的提示格式偏差，并提出这些偏差对当前评估的可靠性提出了质疑。

Details

Motivation: 尽管前人工作已发现多选题VQA对答案选项顺序敏感，但本文进一步指出提示格式中的其他潜在偏差，这些偏差未被充分研究，可能影响MLLMs评估的真实性。

Result: 结果显示，多选题VQA对微小的提示格式变化高度敏感，即使这些变化语义中性。这些偏差与已知的顺序偏差或模型对正确答案的置信度无关。

Insight: 论文揭示了VQA评估中未被注意的系统性偏差，表明当前的评估方法可能不足以真实反映MLLMs的能力，需要更严谨的评估设计。

Abstract: Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in handling image-text inputs. A common way to assess this ability is through multiple-choice Visual Question Answering (VQA). Earlier works have already revealed that these benchmarks are sensitive to answer choice order, a limitation that can be mitigated through careful design. Yet, we highlight additional, unexplored biases in prompt formatting that question the reliability of current MLLM evaluations. Specifically, we identify three key variation factors in prompt formatting and analyze their impact through a large-scale study involving $\mathbf{\text{seven}}$ MLLMs and $\mathbf{\text{five}}$ VQA datasets, spanning $\mathbf{48}$ distinct $\mathbf{\text{prompt format variations}}$. Our findings reveal that multiple-choice VQA is highly sensitive to minor prompt format changes, even when these changes are semantically neutral. We further demonstrate that these biases persist independently of known order biases or the MLLM’s confidence in the correct answer. Finally, we demonstrate that existing bias mitigation strategies fail to address these newly identified biases.

[48] Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment cs.CVPDF

Yang Chen, Xiaowei Xu, Shuai Wang, Chenhui Zhu, Ruxue Wen

TL;DR: The paper introduces a reverse representation alignment strategy for Normalizing Flows (NFs) to improve generative quality and classification accuracy, achieving state-of-the-art results on ImageNet.

Details

Motivation: Standard Normalizing Flows (NFs) suffer from poor semantic representations due to log-likelihood optimization, limiting their generative quality. The authors propose leveraging NF invertibility to align intermediate features during the generative pass with representations from a vision foundation model.

Result: Achieves state-of-the-art results on ImageNet 64×64 and 256×256, with significant improvements in generative quality and classification accuracy, and accelerates training by over 3.3×.

Insight: Leveraging NF invertibility for reverse representation alignment is more effective than forward pass regularization, demonstrating that improving generative quality can also enhance semantic representation.

Abstract: Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF’s embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3$\times$, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64$\times$64 and 256$\times$256. Our code is available at https://github.com/MCG-NJU/FlowBack.

[49] INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts cs.CVPDF

Anshul Bagaria

TL;DR: INSIGHT提出了一种可解释的神经视觉-语言框架，用于生成图像的检测与解释，结合了超分辨率、Grad-CAM多尺度定位和CLIP语义对齐，显著提升了低分辨率下的检测鲁棒性和解释质量。

Details

Motivation: 随着GAN和扩散模型生成的图像越来越逼真，如何可靠检测和解释生成图像成为关键。现有方法在真实条件下的表现不佳且缺乏可解释性，INSIGHT旨在填补这一空白。

Result: INSIGHT在极端低分辨率（16x16-64x64）下显著优于现有检测器，提升了检测鲁棒性和解释质量。

Insight: 可解释性与鲁棒性可以协同提升，多模态框架（视觉+语言）为生成图像的检测提供了新思路。

Abstract: The growing realism of AI-generated images produced by recent GAN and diffusion models has intensified concerns over the reliability of visual media. Yet, despite notable progress in deepfake detection, current forensic systems degrade sharply under real-world conditions such as severe downsampling, compression, and cross-domain distribution shifts. Moreover, most detectors operate as opaque classifiers, offering little insight into why an image is flagged as synthetic, undermining trust and hindering adoption in high-stakes settings. We introduce INSIGHT (Interpretable Neural Semantic and Image-based Generative-forensic Hallucination Tracing), a unified multimodal framework for robust detection and transparent explanation of AI-generated images, even at extremely low resolutions (16x16 - 64x64). INSIGHT combines hierarchical super-resolution for amplifying subtle forensic cues without inducing misleading artifacts, Grad-CAM driven multi-scale localization to reveal spatial regions indicative of generative patterns, and CLIP-guided semantic alignment to map visual anomalies to human-interpretable descriptors. A vision-language model is then prompted using a structured ReAct + Chain-of-Thought protocol to produce consistent, fine-grained explanations, verified through a dual-stage G-Eval + LLM-as-a-judge pipeline to minimize hallucinations and ensure factuality. Across diverse domains, including animals, vehicles, and abstract synthetic scenes, INSIGHT substantially improves both detection robustness and explanation quality under extreme degradation, outperforming prior detectors and black-box VLM baselines. Our results highlight a practical path toward transparent, reliable AI-generated image forensics and establish INSIGHT as a step forward in trustworthy multimodal content verification.

[50] AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows cs.CVPDF

Zhenglin Zhou, Fan Ma, Chengzhuo Gui, Xiaobo Xia, Hehe Fan

TL;DR: AnchorFlow提出了一种无需训练的3D编辑方法，通过潜在锚一致性解决现有方法在扩散采样中因时间步依赖噪声导致的编辑不稳定问题。

Details

Motivation: 现有的无需训练3D编辑方法在语义对齐和几何稳定性上表现不佳，主要由于扩散采样中的潜在锚不一致。

Result: 在Eval3DEdit基准测试中，AnchorFlow在各种编辑类型上均表现出语义对齐和结构鲁棒性。

Insight: 潜在锚一致性是稳定3D编辑的关键；无需掩码监督也能有效保持几何保真度。

Abstract: Training-free 3D editing aims to modify 3D shapes based on human instructions without model finetuning. It plays a crucial role in 3D content creation. However, existing approaches often struggle to produce strong or geometrically stable edits, largely due to inconsistent latent anchors introduced by timestep-dependent noise during diffusion sampling. To address these limitations, we introduce AnchorFlow, which is built upon the principle of latent anchor consistency. Specifically, AnchorFlow establishes a global latent anchor shared between the source and target trajectories, and enforces coherence using a relaxed anchor-alignment loss together with an anchor-aligned update rule. This design ensures that transformations remain stable and semantically faithful throughout the editing process. By stabilizing the latent reference space, AnchorFlow enables more pronounced semantic modifications. Moreover, AnchorFlow is mask-free. Without mask supervision, it effectively preserves geometric fidelity. Experiments on the Eval3DEdit benchmark show that AnchorFlow consistently delivers semantically aligned and structurally robust edits across diverse editing types. Code is at https://github.com/ZhenglinZhou/AnchorFlow.

[51] Asking like Socrates: Socrates helps VLMs understand remote sensing images cs.CV | cs.AIPDF

Run Shao, Ziyu Li, Zhaoyang Zhang, Linrui Xu, Xinran He

TL;DR: 该论文提出了一种名为RS-EoT的范式，通过SocraticAgent多代理系统和渐进式RL策略，解决了遥感图像理解中的伪推理问题，实现了真正的基于视觉证据的推理。

Details

Motivation: 现有视觉语言模型在遥感任务中存在伪推理问题，作者认为这是由于Glance Effect导致的视觉证据缺失。

Result: 在多个RS VQA和Grounding基准测试中达到SOTA性能，并通过分析验证了RS-EoT的有效性。

Insight: 通过迭代的推理和视觉检查循环，RS-EoT能够缓解Glance Effect，实现真正的基于证据的推理。

Abstract: Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at https://geox-lab.github.io/Asking_like_Socrates

Longkun Zou, Jiale Wang, Rongqin Liang, Hai Wu, Ke Chen

TL;DR: 论文介绍了UAV-MM3D，一个大规模的多模态合成数据集，用于无人机低空感知与运动理解，提供丰富的标注和多种模态数据，并提出了基线方法LGFusionNet和轨迹预测基准。

Details

Motivation: 由于现实世界中无人机数据的采集受限于空域法规、隐私问题和环境多变性，而人工标注3D姿态和多模态对应关系成本高昂，因此需要一个高质量、可控的合成数据集来支持研究。

Result: 数据集支持3D检测、姿态估计、目标跟踪和短时轨迹预测等任务，并提供公开基准。

Insight: 合成数据可以克服现实数据采集的局限性，为无人机感知研究提供可控且多样化的测试环境。

Abstract: Accurate perception of UAVs in complex low-altitude environments is critical for airspace security and related intelligent systems. Developing reliable solutions requires large-scale, accurately annotated, and multimodal data. However, real-world UAV data collection faces inherent constraints due to airspace regulations, privacy concerns, and environmental variability, while manual annotation of 3D poses and cross-modal correspondences is time-consuming and costly. To overcome these challenges, we introduce UAV-MM3D, a high-fidelity multimodal synthetic dataset for low-altitude UAV perception and motion understanding. It comprises 400K synchronized frames across diverse scenes (urban areas, suburbs, forests, coastal regions) and weather conditions (clear, cloudy, rainy, foggy), featuring multiple UAV models (micro, small, medium-sized) and five modalities - RGB, IR, LiDAR, Radar, and DVS (Dynamic Vision Sensor). Each frame provides 2D/3D bounding boxes, 6-DoF poses, and instance-level annotations, enabling core tasks related to UAVs such as 3D detection, pose estimation, target tracking, and short-term trajectory forecasting. We further propose LGFusionNet, a LiDAR-guided multimodal fusion baseline, and a dedicated UAV trajectory prediction baseline to facilitate benchmarking. With its controllable simulation environment, comprehensive scenario coverage, and rich annotations, UAV3D offers a public benchmark for advancing 3D perception of UAVs.

[53] DiffStyle360: Diffusion-Based 360° Head Stylization via Style Fusion Attention cs.CVPDF

Furkan Guzelant, Arda Goktogan, Tarık Kaya, Aysegul Dundar

TL;DR: DiffStyle360 是一个基于扩散模型的 360°头部风格化框架，能够通过单一样式参考图像生成多视角一致的 3D 头部风格化结果，无需针对每种样式进行训练。

Details

Motivation: 当前 3D 头部风格化方法通常依赖计算昂贵的优化或特定领域的微调，限制了其灵活性和效率。DiffStyle360 旨在解决这一问题，提供一种无需逐风格训练的通用方法。

Result: 在 FFHQ 和 RenderMe360 数据集上的实验表明，DiffStyle360 在风格质量和多视角一致性上优于现有的 GAN 和扩散模型方法。

Insight: DiffStyle360 的关键在于解耦内容与样式，并通过注意力机制动态平衡效果，展示了扩散模型在 3D 风格化任务中的潜力。

Abstract: 3D head stylization has emerged as a key technique for reimagining realistic human heads in various artistic forms, enabling expressive character design and creative visual experiences in digital media. Despite the progress in 3D-aware generation, existing 3D head stylization methods often rely on computationally expensive optimization or domain-specific fine-tuning to adapt to new styles. To address these limitations, we propose DiffStyle360, a diffusion-based framework capable of producing multi-view consistent, identity-preserving 3D head stylizations across diverse artistic domains given a single style reference image, without requiring per-style training. Building upon the 3D-aware DiffPortrait360 architecture, our approach introduces two key components: the Style Appearance Module, which disentangles style from content, and the Style Fusion Attention mechanism, which adaptively balances structure preservation and stylization fidelity in the latent space. Furthermore, we employ a 3D GAN-generated multi-view dataset for robust fine-tuning and introduce a temperaturebased key scaling strategy to control stylization intensity during inference. Extensive experiments on FFHQ and RenderMe360 demonstrate that DiffStyle360 achieves superior style quality, outperforming state-of-the-art GAN- and diffusion-based stylization methods across challenging style domains.

[54] Wukong’s 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models cs.CVPDF

Minghao Yin, Yukang Cao, Kai Han

TL;DR: WUKONG is a training-free framework for high-fidelity textured 3D morphing that uses flow-based transformers to generate smooth transitions between source and target prompts, outperforming existing methods.

Details

Motivation: Traditional 3D morphing methods rely on manual correspondence matching and deformation trajectory estimation, which limit generalization and require costly preprocessing. WUKONG aims to overcome these limitations by leveraging generative priors of flow-based models.

Result: WUKONG achieves superior performance in both quantitative and qualitative evaluations, outperforming state-of-the-art methods in handling diverse geometry and texture variations.

Insight: The success of WUKONG highlights the potential of flow-based generative models for 3D morphing tasks, demonstrating that training-free frameworks can achieve high-fidelity results without manual preprocessing.

Abstract: We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (image or text) as input. Unlike conventional methods – which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) – WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This avoids common artifacts like oversmoothing while maintaining semantic fidelity. Extensive quantitative and qualitative evaluations demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.

[55] Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation cs.CVPDF

Weining Ren, Hongjun Wang, Xiao Tan, Kai Han

TL;DR: 本文提出了一种名为Fin3R的简单有效的微调方法，旨在提升前馈3D重建模型在几何细节和鲁棒性上的表现。该方法通过单目知识蒸馏解决了训练数据稀少和几何不对齐问题，仅需轻量级LoRA适配器即可显著提升模型性能。

Details

Motivation: 当前前馈3D重建模型在几何细节和鲁棒性上表现不佳，主要原因是高质量深度和姿态监督数据的稀缺性，以及多视角点云回归中的固有几何不对齐问题。

Result: 实验表明，经过Fin3R微调的模型（如DUSt3R、MASt3R等）在单视角和多视角场景下均能恢复更复杂的结构，边界更清晰，几何精度更高，且仅增加了微小的LoRA权重，不影响推理时的内存和延迟。

Insight: 通过单目知识蒸馏和轻量级微调策略，可以在不增加显著计算成本的情况下，显著提升3D重建模型的几何细节表现和鲁棒性。

Abstract: We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to (\textit{i}) the scarcity of high-fidelity depth and pose supervision and (\textit{ii}) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder-the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged. Project page: \href{http://visual-ai.github.io/fin3r}{https://visual-ai.github.io/fin3r}

[56] SkeletonAgent: An Agentic Interaction Framework for Skeleton-based Action Recognition cs.CVPDF

Hongda Liu, Yunfan Liu, Changlu Wang, Yunlong Wang, Zhenan Sun

TL;DR: SkeletonAgent提出了一种基于骨架的动作识别新框架，通过双向交互代理（Questioner和Selector）实现LLM与识别模型的协同优化，显著提升了识别性能。

Details

Motivation: 当前骨架动作识别中，LLMs的语义先验知识往往独立使用，缺乏与识别模型的反馈循环，导致其提供的判别性信息不足，难以区分相似动作。

Result: 在5个基准数据集（如NTU RGB+D、Kinetics-Skeleton）上，SkeletonAgent均优于现有方法。

Insight: 双向代理机制通过动态交互和反馈有效弥补了LLM独立使用的不足，为跨模态动作识别提供了新思路。

Abstract: Recent advances in skeleton-based action recognition increasingly leverage semantic priors from Large Language Models (LLMs) to enrich skeletal representations. However, the LLM is typically queried in isolation from the recognition model and receives no performance feedback. As a result, it often fails to deliver the targeted discriminative cues critical to distinguish similar actions. To overcome these limitations, we propose SkeletonAgent, a novel framework that bridges the recognition model and the LLM through two cooperative agents, i.e., Questioner and Selector. Specifically, the Questioner identifies the most frequently confused classes and supplies them to the LLM as context for more targeted guidance. Conversely, the Selector parses the LLM’s response to extract precise joint-level constraints and feeds them back to the recognizer, enabling finer-grained cross-modal alignment. Comprehensive evaluations on five benchmarks, including NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, FineGYM, and UAV-Human, demonstrate that SkeletonAgent consistently outperforms state-of-the-art benchmark methods. The code is available at https://github.com/firework8/SkeletonAgent.

[57] ABounD: Adversarial Boundary-Driven Few-Shot Learning for Multi-Class Anomaly Detection cs.CVPDF

Runzhi Deng, Yundi Hu, Xinshuang Zhang, Zhao Wang, Xixi Liu

TL;DR: ABounD是一种对抗性边界驱动的少样本学习方法，用于多类异常检测，通过动态概念融合和对抗性边界锻造，实现了语义概念学习与决策边界优化的统一框架。

Details

Motivation: 工业异常检测中的少样本多类任务存在数据稀缺和边界模糊的问题，导致漏检和误判。作者希望通过结合语义概念学习和决策边界优化来解决这些问题。

Result: 在MVTec-AD和VisA数据集上实现了少样本多类异常检测的最先进性能。

Insight: 语义概念学习与决策边界优化的协同作用是提升异常检测性能的关键。

Abstract: Few-shot multi-class industrial anomaly detection remains a challenging task. Vision-language models need to be both category-adaptive and sharply discriminative, yet data scarcity often blurs the boundary between normal and abnormal states. This ambiguity leads to missed subtle defects and the rejection of atypical normal samples. We propose ABounD, an Adversarial Boundary-Driven few-shot learning for multi-class anomaly detection, which is a unified learning framework that integrates semantic concept learning with decision boundary shaping. The Dynamic Concept Fusion (DCF) module produces class-adaptive prompts by fusing generalizable priors with class-specific cues, conditioned on image features. Meanwhile, Adversarial Boundary Forging (ABF) sculpts a more precise decision margin by generating boundary-level fence features via PGD-style perturbations. Training is conducted in a single stage under a Concept-Boundary Loss, where ABF provides the main supervisory signal and semantic-spatial regularizers stabilize the optimization. This synergy yields a decision boundary that closely follows normal data while preserving flexibility and robust semantic alignment. Experiments on MVTec-AD and VisA datasets demonstrate state-of-the-art performance in the task of few-shot multi-class anomaly detection.

[58] Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition cs.CVPDF

Maheswar Bora, Tashvik Dhamija, Shukesh Reddy, Baptiste Chopin, Pranav Balaji

TL;DR: 这篇论文提出了FauxNet，一种基于预训练视觉语音识别（VSR）特征的深度伪造检测网络，专注于零样本检测和通用性能力，并在新提出的Authentica数据集和FaceForensics++上表现优于现有方法。

Details

Motivation: 深度伪造技术的快速进步带来了高度逼真的伪造媒体，引发了对其滥用的担忧。因此，亟需一种鲁棒且可靠的检测方法。

Result: FauxNet在零样本检测任务中表现优于现有方法，并在新数据集上展示了优越性。

Insight: 视觉语音特征在深度伪造检测中具有通用性潜力；新的多样化数据集有助于推动研究进展。

Abstract: Deepfake generation has witnessed remarkable progress, contributing to highly realistic generated images, videos, and audio. While technically intriguing, such progress has raised serious concerns related to the misuse of manipulated media. To mitigate such misuse, robust and reliable deepfake detection is urgently needed. Towards this, we propose a novel network FauxNet, which is based on pre-trained Visual Speech Recognition (VSR) features. By extracting temporal VSR features from videos, we identify and segregate real videos from manipulated ones. The holy grail in this context has to do with zero-shot detection, i.e., generalizable detection, which we focus on in this work. FauxNet consistently outperforms the state-of-the-art in this setting. In addition, FauxNet is able to attribute - distinguish between generation techniques from which the videos stem. Finally, we propose new datasets, referred to as Authentica-Vox and Authentica-HDTF, comprising about 38,000 real and fake videos in total, the latter created with six recent deepfake generation techniques. We provide extensive analysis and results on the Authentica datasets and FaceForensics++, demonstrating the superiority of FauxNet. The Authentica datasets will be made publicly available.

[59] Benchmarking machine learning models for multi-class state recognition in double duantum dot data cs.CV | cond-mat.mes-hall | cs.LGPDF

Valeria Díaz Moreno, Ryan P Khalili, Daniel Schug, Patrick J. Walsh, Justyna P. Zwolak

TL;DR: 该论文对四种现代机器学习架构在双量子点电荷稳定性图中的多类状态识别任务进行了全面基准测试，发现CNN在实验数据上表现最佳。

Details

Motivation: 半导体量子点是可扩展量子处理器的领先平台，但大规模阵列需要可靠的自动调谐策略，精确识别电荷稳定性图中的状态是关键。

Result: U-Nets和ViTs在合成数据上表现最佳（MSE得分超过0.98），但在实验数据上泛化能力差；CNNs在实验数据上表现出最佳折衷性能。

Insight: 归一化方法对模型性能有显著影响，min-max归一化通常得分更高但不稳定，而z-score归一化训练更稳定但精度较低。CNN是实验数据上的实用选择。

Abstract: Semiconductor quantum dots (QDs) are a leading platform for scalable quantum processors. However, scaling to large arrays requires reliable, automated tuning strategies for devices’ bootstrapping, calibration, and operation, with many tuning aspects depending on accurately identifying QD device states from charge-stability diagrams (CSDs). In this work, we present a comprehensive benchmarking study of four modern machine learning (ML) architectures for multi-class state recognition in double-QD CSDs. We evaluate their performance across different data budgets and normalization schemes using both synthetic and experimental data. We find that the more resource-intensive models – U-Nets and visual transformers (ViTs) – achieve the highest MSE score (defined as $1-\mathrm{MSE}$) on synthetic data (over $0.98$) but fail to generalize to experimental data. MDNs are the most computationally efficient and exhibit highly stable training, but with substantially lower peak performance. CNNs offer the most favorable trade-off on experimental CSDs, achieving strong accuracy with two orders of magnitude fewer parameters than the U-Nets and ViTs. Normalization plays a nontrivial role: min-max scaling generally yields higher MSE scores but less stable convergence, whereas z-score normalization produces more predictable training dynamics but at reduced accuracy for most models. Overall, our study shows that CNNs with min-max normalization are a practical approach for QD CSDs.

[60] Beyond Real versus Fake Towards Intent-Aware Video Analysis cs.CVPDF

Saurabh Atreya, Nabyl Quignon, Baptiste Chopin, Abhijit Das, Antitza Dantcheva

TL;DR: 论文提出了IntentHQ基准，致力于视频意图分析而非单纯真实性检测，以解决现有深度伪造检测方法的局限性。

Details

Motivation: 现有深度伪造检测方法仅关注视频真假，无法解决视频背后的意图问题，这在社交媒体和安全领域尤为重要。

Result: 提出的模型能够有效区分广泛的意图类别，为视频意图分析提供了新工具。

Insight: 视频意图分析比单纯的真假检测更具实际意义，尤其是在社交媒体传播和安全领域。

Abstract: The rapid advancement of generative models has led to increasingly realistic deepfake videos, posing significant societal and security risks. While existing detection methods focus on distinguishing real from fake videos, such approaches fail to address a fundamental question: What is the intent behind a manipulated video? Towards addressing this question, we introduce IntentHQ: a new benchmark for human-centered intent analysis, shifting the paradigm from authenticity verification to contextual understanding of videos. IntentHQ consists of 5168 videos that have been meticulously collected and annotated with 23 fine-grained intent-categories, including “Financial fraud”, “Indirect marketing”, “Political propaganda”, as well as “Fear mongering”. We perform intent recognition with supervised and self-supervised multi-modality models that integrate spatio-temporal video features, audio processing, and text analysis to infer underlying motivations and goals behind videos. Our proposed model is streamlined to differentiate between a wide range of intent-categories.

[61] ITS3D: Inference-Time Scaling for Text-Guided 3D Diffusion Models cs.CVPDF

Zhenglin Zhou, Fan Ma, Xiaobo Xia, Hehe Fan, Yi Yang

TL;DR: ITS3D提出了一种推理时间缩放框架，通过优化高斯噪声输入提升文本引导3D扩散模型的生成质量，无需额外训练。该方法结合了验证器引导搜索、高斯归一化和降维技术。

Details

Motivation: 直接在推理阶段优化3D扩散模型的生成质量，避免额外训练和计算资源消耗。

Result: 实验表明ITS3D显著提升了文本到3D的生成质量。

Insight: 推理时间的优化方法可以高效提升生成模型的性能，适用于资源受限的场景。

Abstract: We explore inference-time scaling in text-guided 3D diffusion models to enhance generative quality without additional training. To this end, we introduce ITS3D, a framework that formulates the task as an optimization problem to identify the most effective Gaussian noise input. The framework is driven by a verifier-guided search algorithm, where the search algorithm iteratively refines noise candidates based on verifier feedback. To address the inherent challenges of 3D generation, we introduce three techniques for improved stability, efficiency, and exploration capability. 1) Gaussian normalization is applied to stabilize the search process. It corrects distribution shifts when noise candidates deviate from a standard Gaussian distribution during iterative updates. 2) The high-dimensional nature of the 3D search space increases computational complexity. To mitigate this, a singular value decomposition-based compression technique is employed to reduce dimensionality while preserving effective search directions. 3) To further prevent convergence to suboptimal local minima, a singular space reset mechanism dynamically updates the search space based on diversity measures. Extensive experiments demonstrate that ITS3D enhances text-to-3D generation quality, which shows the potential of computationally efficient search methods in generative processes. The source code is available at https://github.com/ZhenglinZhou/ITS3D.

[62] Gaussians on Fire: High-Frequency Reconstruction of Flames cs.CVPDF

Jakob Nazarenus, Dominik Michels, Wojtek Palubicki, Simin Kou, Fang-Lue Zhang

TL;DR: 提出一种基于高斯模型的时空表示方法，从有限视角（仅三视图）重建动态火焰。通过分离静态背景与动态火焰区域，结合多视图立体视觉和单目深度先验，初始化火焰为3D流场，并通过高斯编码高频率特征。

Details

Motivation: 火焰的动态、透明和高频率特征使其3D重建极具挑战性，现有方法通常需要大量视角。本文旨在通过少量视角（三视图）实现高质量火焰重建。

Result: 实验验证了方法在多样化和挑战性火焰场景中的鲁棒性，定量和定性结果均表现优异。

Insight: 1. 高斯模型适用于高频动态现象建模；2. 少量视角结合深度先验可解决几何约束不足问题；3. 硬件同步对动态重建至关重要。

Abstract: We propose a method to reconstruct dynamic fire in 3D from a limited set of camera views with a Gaussian-based spatiotemporal representation. Capturing and reconstructing fire and its dynamics is highly challenging due to its volatile nature, transparent quality, and multitude of high-frequency features. Despite these challenges, we aim to reconstruct fire from only three views, which consequently requires solving for under-constrained geometry. We solve this by separating the static background from the dynamic fire region by combining dense multi-view stereo images with monocular depth priors. The fire is initialized as a 3D flow field, obtained by fusing per-view dense optical flow projections. To capture the high frequency features of fire, each 3D Gaussian encodes a lifetime and linear velocity to match the dense optical flow. To ensure sub-frame temporal alignment across cameras we employ a custom hardware synchronization pattern – allowing us to reconstruct fire with affordable commodity hardware. Our quantitative and qualitative validations across numerous reconstruction experiments demonstrate robust performance for diverse and challenging real fire scenarios.

[63] RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding cs.CVPDF

Xiyan Liu, Han Wang, Yuhu Wang, Junjie Cai, Zhe Cao

TL;DR: RoadSceneBench是一个轻量级基准测试，专注于中层次道路场景理解，填补了现有基准测试在推理能力上的不足。提出的HRRP-T训练框架提升了视觉语言模型的空间一致性和语义对齐能力。

Details

Motivation: 现有基准测试主要关注低层次感知任务（如检测和分割），缺乏对推理能力的评估。RoadSceneBench旨在填补这一空白，强调道路拓扑和动态场景结构的理解。

Result: 实验表明，该方法在多样化道路配置中达到了最先进性能。

Insight: 强调中层次语义理解的结构一致性，有助于推动面向自主驾驶的结构感知研究。

Abstract: Understanding mid-level road semantics, which capture the structural and contextual cues that link low-level perception to high-level planning, is essential for reliable autonomous driving and digital map construction. However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic scene structure. To address this gap, we present RoadSceneBench, a lightweight yet information-rich benchmark designed to evaluate and advance visual reasoning in complex road environments. Unlike large-scale perception datasets, RoadSceneBench emphasizes relational understanding and structural consistency, encouraging models to capture the underlying logic of real-world road scenes. Furthermore, to enhance reasoning reliability, we propose Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T), a training framework for Vision-Language Models (VLMs) in which reward signals adaptively promote spatial coherence and semantic alignment throughout the reasoning process. This paradigm enables models to move beyond static recognition toward geometry-aware and temporally consistent reasoning. Extensive experiments demonstrate that our method achieves state-of-the-art performance across diverse road configurations. RoadSceneBench thus provides a compact yet powerful foundation for studying mid-level road semantics and fostering structure-aware autonomous perception. Our dataset is available at https://github.com/XiyanLiu/RoadSceneBench.

[64] Hybrid, Unified and Iterative: A Novel Framework for Text-based Person Anomaly Retrieval cs.CVPDF

Tien-Huy Nguyen, Huu-Loc Tran, Huu-Phong Phan-Nguyen, Quang-Vinh Dinh

TL;DR: 该论文提出了一种新颖的文本行人异常检索框架，结合了局部-全局混合视角模块、统一的图像-文本模型和迭代集成策略，显著提升了检索性能。

Details

Motivation: 现有方法依赖复杂的深度学习技术，但如何优化模型以提取更细粒度特征仍是一个挑战。为了解决这一问题，作者提出了结合细粒度和粗粒度特征的混合框架。

Result: 在PAB数据集上实现了SOTA性能，R@1提升9.70%，R@5提升1.77%，R@10提升1.01%。

Insight: 1. 混合细粒度和粗粒度特征能显著提升检索性能；2. 多任务损失的UIT模型增强了特征表达能力；3. 迭代集成策略优于传统集成方法。

Abstract: Text-based person anomaly retrieval has emerged as a challenging task, with most existing approaches relying on complex deep-learning techniques. This raises a research question: How can the model be optimized to achieve greater fine-grained features? To address this, we propose a Local-Global Hybrid Perspective (LHP) module integrated with a Vision-Language Model (VLM), designed to explore the effectiveness of incorporating both fine-grained features alongside coarse-grained features. Additionally, we investigate a Unified Image-Text (UIT) model that combines multiple objective loss functions, including Image-Text Contrastive (ITC), Image-Text Matching (ITM), Masked Language Modeling (MLM), and Masked Image Modeling (MIM) loss. Beyond this, we propose a novel iterative ensemble strategy, by combining iteratively instead of using model results simultaneously like other ensemble methods. To take advantage of the superior performance of the LHP model, we introduce a novel feature selection algorithm based on its guidance, which helps improve the model’s performance. Extensive experiments demonstrate the effectiveness of our method in achieving state-of-the-art (SOTA) performance on PAB dataset, compared with previous work, with a 9.70% improvement in R@1, 1.77% improvement in R@5, and 1.01% improvement in R@10.

[65] Rethinking Cross-Generator Image Forgery Detection through DINOv3 cs.CVPDF

Zhenglin Huang, Jason Li, Haiquan Wen, Tianxiao Li, Xi Yang

TL;DR: 本文发现，无需微调的DINOv3基础模型已具备强大的跨生成器图像伪造检测能力。作者提出了一种简单的无训练token排序策略和轻量级线性探测方法，显著提升了检测准确性。

Details

Motivation: 随着生成模型的多样化和能力提升，跨生成器检测成为新挑战。现有方法往往过拟合特定生成器的伪影，而缺乏泛化性。

Result: 该方法在多个数据集上显著提升了检测准确性。

Insight: DINOv3依赖于全局、低频结构作为泛化的真实性线索，而非高频、生成器特定的伪影。这为基础模型的跨生成器通用性提供了实证支持。

Abstract: As generative models become increasingly diverse and powerful, cross-generator detection has emerged as a new challenge. Existing detection methods often memorize artifacts of specific generative models rather than learning transferable cues, leading to substantial failures on unseen generators. Surprisingly, this work finds that frozen visual foundation models, especially DINOv3, already exhibit strong cross-generator detection capability without any fine-tuning. Through systematic studies on frequency, spatial, and token perspectives, we observe that DINOv3 tends to rely on global, low-frequency structures as weak but transferable authenticity cues instead of high-frequency, generator-specific artifacts. Motivated by this insight, we introduce a simple, training-free token-ranking strategy followed by a lightweight linear probe to select a small subset of authenticity-relevant tokens. This token subset consistently improves detection accuracy across all evaluated datasets. Our study provides empirical evidence and a feasible hypothesis for understanding why foundation models generalize across diverse generators, offering a universal, efficient, and interpretable baseline for image forgery detection.

[66] AI killed the video star. Audio-driven diffusion model for expressive talking head generation cs.CVPDF

Baptiste Chopin, Tashvik Dhamija, Pranav Balaji, Yaohui Wang, Antitza Dantcheva

TL;DR: Dimitra++是一个基于音频驱动的扩散模型框架，用于生成具有丰富表情的说话头部动画，通过学习唇部运动、面部表情和头部姿态，显著优于现有方法。

Details

Motivation: 当前说话头部生成方法通常在表现力上有限，难以同时生成自然的唇部运动、面部表情和头部姿态。Dimitra++旨在通过扩散模型解决这一问题。

Result: 在VoxCeleb2和CelebV-HQ数据集上，定量和定性实验及用户研究表明，Dimitra++优于现有方法。

Insight: 扩散模型在面部动作生成中表现出强大潜力，尤其是结合音频条件时，能够实现多样且自然的运动输出。

Abstract: We propose Dimitra++, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we propose a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggest that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose.

[67] SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts cs.CV | cs.IRPDF

Shun Inadumi, Shohei Tanaka, Tosho Hirasawa, Atsushi Hashimoto, Koichiro Yoshino

TL;DR: 论文提出了SciPostGen数据集，用于从科学论文生成海报布局，并基于检索增强的方法实现了布局生成框架。

Details

Motivation: 随着科学论文数量的增长，海报作为展示研究内容的重要媒介，其布局设计影响了研究传播的效果。目前缺乏大规模标注的数据集来理解论文与海报布局的对应关系。

Result: 实验表明，检索器生成的布局与论文结构一致，框架生成的布局也满足了给定约束。

Insight: 论文结构会影响海报中布局元素的数量，基于检索的方法可以有效地指导布局生成。

Abstract: As the number of scientific papers continues to grow, there is a demand for approaches that can effectively convey research findings, with posters serving as a key medium for presenting paper contents. Poster layouts determine how effectively research is communicated and understood, highlighting their growing importance. In particular, a gap remains in understanding how papers correspond to the layouts that present them, which calls for datasets with paired annotations at scale. To bridge this gap, we introduce SciPostGen, a large-scale dataset for understanding and generating poster layouts from scientific papers. Our analyses based on SciPostGen show that paper structures are associated with the number of layout elements in posters. Based on this insight, we explore a framework, Retrieval-Augmented Poster Layout Generation, which retrieves layouts consistent with a given paper and uses them as guidance for layout generation. We conducted experiments under two conditions: with and without layout constraints typically specified by poster creators. The results show that the retriever estimates layouts aligned with paper structures, and our framework generates layouts that also satisfy given constraints.

[68] What Shape Is Optimal for Masks in Text Removal? cs.CV | cs.CL | cs.LGPDF

Hyakka Nakada, Marika Kubota

TL;DR: 这篇论文探讨了在去除文档图像中密集文本时，掩膜形状对性能的影响，并提出了一种基于贝叶斯优化的方法，以生成最优的字符级掩膜。

Details

Motivation: 现有文本去除方法主要针对简单场景文本，而对复杂、密集文本的处理研究较少。论文发现掩膜形状的微小变化会显著影响去除效果，因此需要优化掩膜形状以提高性能。

Result: 研究发现字符级掩膜是最优形状，同时指出完全覆盖文本区域的最小掩膜并非最佳选择。

Insight: 掩膜形状的精确设计对文本去除任务的性能至关重要，需要避免过度覆盖或覆盖不足，字符级掩膜是实际应用中的有效解决方案。

Abstract: The advent of generative models has dramatically improved the accuracy of image inpainting. In particular, by removing specific text from document images, reconstructing original images is extremely important for industrial applications. However, most existing methods of text removal focus on deleting simple scene text which appears in images captured by a camera in an outdoor environment. There is little research dedicated to complex and practical images with dense text. Therefore, we created benchmark data for text removal from images including a large amount of text. From the data, we found that text-removal performance becomes vulnerable against mask profile perturbation. Thus, for practical text-removal tasks, precise tuning of the mask shape is essential. This study developed a method to model highly flexible mask profiles and learn their parameters using Bayesian optimization. The resulting profiles were found to be character-wise masks. It was also found that the minimum cover of a text region is not optimal. Our research is expected to pave the way for a user-friendly guideline for manual masking.

[69] DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA cs.CV | cs.AIPDF

Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Dheeraj Kulshrestha, Rajiv Ramnath

TL;DR: DocVAL提出了一种基于验证的思维链蒸馏框架，通过教师监督、多模块验证器和两阶段学生训练方案，将大模型的时空推理能力迁移到可部署的小模型上，显著提升了DocVQA任务的效率和准确性。

Details

Motivation: DocVQA任务需要在文本内容和空间布局上进行联合推理，但当前系统在大模型（高效但昂贵）和小模型（廉价但性能低）之间存在明显的效率-准确率权衡。DocVAL旨在解决这一问题，将大模型的推理能力高效迁移到小模型中。

Result: Gemma-3 12B学生模型在DocVQA上达到91.4% ANLS和82.4% mAP，无需推理时的文本检测或OCR。验证反馈贡献6.3 mAP增益，迭代优化贡献9.7 mAP提升。

Insight: 1）验证的反馈对蒸馏效果至关重要；2）两阶段训练显著提升小模型性能；3）高质量的标注数据对时空推理任务有重要价值。

Abstract: Document visual question answering (DocVQA) requires models to jointly reason over textual content and spatial layout, yet current systems exhibit a sharp accuracy–efficiency trade-off: large teacher models achieve strong grounding but are too expensive for deployment, while compact students suffer substantial drops in localization performance. We propose DocVAL, a validated chain-of-thought distillation framework that transfers the spatial reasoning ability of a large teacher into a deployable student VLM through three key components: (1) teacher supervision with validation-time text detection to filter and denoise training signals, (2) a multi-module validator (VAL) that enforces answer correctness and geometric consistency while producing fine-grained, pixel-level error feedback, and (3) a two-stage student training scheme that first learns from validated CoT traces and then undergoes iterative refinement driven by VAL feedback. Our student (Gemma-3 12B) achieves 91.4% ANLS and 82.4% mAP on DocVQA as a pure VLM requiring no text detection or OCR at inference. Extensive ablations demonstrate that validated feedback contributes 6.3 mAP gain and iterative refinement accounts for 9.7 mAP improvement. We release 95k high-quality, validator-verified CoT traces to advance spatial reasoning research in document understanding.

[70] CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving cs.CV | cs.AIPDF

Zhaohui Wang, Tengbo Yu, Hao Tang

TL;DR: 论文提出了CoT4AD，一种结合显式Chain-of-Thought推理的Vision-Language-Action模型，用于提升自动驾驶中的数值和因果推理能力。

Details

Motivation: 现有的Vision-Language-Action模型在复杂驾驶场景中因数值推理能力不足和输入-输出映射过于简化而表现不佳。

Result: 在nuScenes和Bench2Drive等真实与仿真基准测试中，CoT4AD在开环和闭环评估中均达到最优性能。

Insight: 通过显式和隐式CoT推理的结合，CoT4AD显著提升了自动驾驶模型在复杂场景中的数值和因果推理能力。

Abstract: Vision-Language-Action (VLA) models have recently attracted growing attention in end-to-end autonomous driving for their strong reasoning capabilities and rich world knowledge. However, existing VLAs often suffer from limited numerical reasoning ability and overly simplified input-output mappings, which hinder their performance in complex driving scenarios requiring step-by-step causal reasoning. To address these challenges, we propose CoT4AD, a novel VLA framework that introduces Chain-of-Thought (CoT) reasoning for autonomous driving to enhance both numerical and causal reasoning in Vision-Language Models (VLMs). CoT4AD integrates visual observations and language instructions to perform semantic reasoning, scene understanding, and trajectory planning. During training, it explicitly models a perception-question-prediction-action CoT to align the reasoning space with the action space across multiple driving tasks. During inference, it performs implicit CoT reasoning to enable consistent numerical reasoning and robust decision-making in dynamic environments. Extensive experiments on both real-world and simulated benchmarks, including nuScenes and Bench2Drive, demonstrate that CoT4AD achieves state-of-the-art performance in both open-loop and closed-loop evaluations. Code will be released upon paper acceptance.

[71] Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration cs.CVPDF

Mengyu Yang, Yanming Yang, Chenyi Xu, Chenxi Song, Yufan Zuo

TL;DR: Fast3Dcache 是一种无需训练的几何感知缓存框架，通过动态调度和稳定性标准加速 3D 扩散模型的推理，显著提升了计算效率，同时保持了几何一致性。

Details

Motivation: 3D 扩散模型的推理计算成本高昂，现有的缓存方法在 3D 生成中容易导致几何不一致性。为了解决这一问题，本文提出了 Fast3Dcache。

Result: 实验表明，Fast3Dcache 显著加速推理，最高达到 27.12% 的速度提升和 54.8% 的 FLOPs 减少，几何质量损失极小（Chamfer Distance 仅增加 2.48%，F-Score 下降 1.95%）。

Insight: 在 3D 几何合成中，缓存机制的设计需特别关注几何一致性，动态调度和稳定性选择是关键。

Abstract: Diffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent caching-based methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization patterns and a Spatiotemporal Stability Criterion (SSC) to select stable features for reuse based on velocity magnitude and acceleration criterion. Comprehensive experiments show that Fast3Dcache accelerates inference significantly, achieving up to a 27.12% speed-up and a 54.8% reduction in FLOPs, with minimal degradation in geometric quality as measured by Chamfer Distance (2.48%) and F-Score (1.95%).

[72] Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior cs.CVPDF

Ruoyu Feng, Yunpeng Qi, Jinming Liu, Yixin Gao, Xin Li

TL;DR: Diff-ICMH提出了一种生成式图像压缩框架，通过结合生成先验和语义一致性损失，实现机器与人类视觉的协调，支持多任务且保持高质量的视觉体验。

Details

Motivation: 图像压缩方法通常单独针对人类感知或机器任务优化，忽略了二者之间的共同点——语义信息的准确性。

Result: 实验表明Diff-ICMH在多样化任务中表现优越且泛化性强，同时保持了人类感知的高视觉质量。

Insight: 图像压缩中语义信息的准确性和视觉感知质量不仅能提升人类体验，还能增强机器任务的语义特征提取能力。

Abstract: Image compression methods are usually optimized isolatedly for human perception or machine analysis tasks. We reveal fundamental commonalities between these objectives: preserving accurate semantic information is paramount, as it directly dictates the integrity of critical information for intelligent tasks and aids human understanding. Concurrently, enhanced perceptual quality not only improves visual appeal but also, by ensuring realistic image distributions, benefits semantic feature extraction for machine tasks. Based on this insight, we propose Diff-ICMH, a generative image compression framework aiming for harmonizing machine and human vision in image compression. It ensures perceptual realism by leveraging generative priors and simultaneously guarantees semantic fidelity through the incorporation of Semantic Consistency loss (SC loss) during training. Additionally, we introduce the Tag Guidance Module (TGM) that leverages highly semantic image-level tags to stimulate the pre-trained diffusion model’s generative capabilities, requiring minimal additional bit rates. Consequently, Diff-ICMH supports multiple intelligent tasks through a single codec and bitstream without any task-specific adaptation, while preserving high-quality visual experience for human perception. Extensive experimental results demonstrate Diff-ICMH’s superiority and generalizability across diverse tasks, while maintaining visual appeal for human perception. Code is available at: https://github.com/RuoyuFeng/Diff-ICMH.

[73] Bringing Your Portrait to 3D Presence cs.CVPDF

Jiawei Zhang, Lei Chu, Jiahao Li, Zhenyu Zang, Chong Li

TL;DR: 论文提出了一种统一框架，从单张人像中重建可动画化的3D人体化身，解决了姿势和构图敏感的特征表示、有限数据和不可靠代理网格估计等问题。采用双UV表示和合成数据流形，实现了优异的泛化能力。

Details

Motivation: 当前从单张人像重建3D化身的方法面临特征表示受姿势和构图影响、数据稀缺以及代理网格估计不稳定等问题，亟需一种统一且鲁棒的解决方案。

Result: 在半身合成数据上训练，模型在头部和上半身重建中达到SOTA，全身重建结果也具有竞争力。

Insight: 双UV表示和合成数据流形的结合显著提升了重建的泛化能力和稳定性。

Abstract: We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation. We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts. We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency. A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head and upper-body reconstruction and competitive full-body results. Extensive experiments and analyses further validate the effectiveness of our approach.

[74] Text Condition Embedded Regression Network for Automated Dental Abutment Design cs.CVPDF

Mianjie Zheng, Xinquan Yang, Xuguang Li, Xiaoling Luo, Xuefen Liu

TL;DR: 本文提出了一种基于文本条件嵌入的回归网络（TCEAD），用于自动化牙科基台设计。通过引入文本引导定位（TGL）模块和改进的特征提取能力，TCEAD在基台设计任务中表现优异。

Details

Motivation: 牙科基台设计过程耗时且复杂，不当设计可能导致并发症。利用人工智能提高设计效率和适配性是重要需求。

Result: 在大量数据上验证，TCEAD的IoU比其他主流方法提高了0.8%-12.85%。

Insight: 文本引导的定位模块和预训练特征提取有助于解决局部细粒度特征的依赖问题，为类似医学图像任务提供了新思路。

Abstract: The abutment is an important part of artificial dental implants, whose design process is time-consuming and labor-intensive. Long-term use of inappropriate dental implant abutments may result in implant complications, including peri-implantitis. Using artificial intelligence to assist dental implant abutment design can quickly improve the efficiency of abutment design and enhance abutment adaptability. In this paper, we propose a text condition embedded abutment design framework (TCEAD), the novel automated abutment design solution available in literature. The proposed study extends the self-supervised learning framework of the mesh mask autoencoder (MeshMAE) by introducing a text-guided localization (TGL) module to facilitate abutment area localization. As the parameter determination of the abutment is heavily dependent on local fine-grained features (the width and height of the implant and the distance to the opposing tooth), we pre-train the encoder using oral scan data to improve the model’s feature extraction ability. Moreover, considering that the abutment area is only a small part of the oral scan data, we designed a TGL module, which introduces the description of the abutment area through the text encoder of Contrastive Language-Image Pre-training (CLIP), enabling the network to quickly locate the abutment area. We validated the performance of TCEAD on a large abutment design dataset. Extensive experiments demonstrate that TCEAD achieves an Intersection over Union (IoU) improvement of 0.8%-12.85% over other mainstream methods, underscoring its potential in automated dental abutment design.

[75] Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization cs.CV | cs.AIPDF

Yifan Du, Kun Zhou, Yingqian Min, Yue Ling, Wayne Xin Zhao

TL;DR: 该论文探讨了不同设计的Chain-of-Thought（CoT）如何影响视觉语言模型（VLMs）中通用视觉推理能力的获取，发现简洁的CoT设计优于冗长的设计。

Details

Motivation: 研究动机在于澄清哪种CoT设计（如语言、空间坐标轨迹或图像操作）能真正支持通用视觉推理能力。

Result: 实验结果显示，冗长的CoT仅加速收敛但不提升最终性能，而简洁的CoT（仅包含必要步骤）在跨任务中泛化能力最强。

Insight: 关键发现是“少即是多”效应，即在视觉推理任务中，简洁的CoT设计比冗长的设计更有效。

Abstract: We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as “think with image”, has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a “short is long” effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning.

[76] HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models cs.CV | cs.AIPDF

Haoxi Zeng, Haoxuan Li, Yi Bin, Pengpeng Zeng, Xing Xu

TL;DR: HarmoCLIP提出了一种均衡全局与局部表示的CLIP改进框架，通过直接对齐局部文本与视觉语义提升了细粒度理解能力，同时保持全局一致性。

Details

Motivation: CLIP在全局任务中表现优异，但缺乏区域级监督限制了其细粒度语义理解能力。现有方法改进局部感知时往往牺牲全局一致性。

Result: 在检索任务中提升高达69.78%，边界框分类任务的Top-1准确率提升3.2%，均优于现有方法。

Insight: CLIP的全局-局部权衡源于局部文本与视觉语义的直接对齐缺失，显式监督是解决该问题的有效手段。

Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated remarkable generalization ability and strong performance across a wide range of vision-language tasks. However, due to the lack of region-level supervision, CLIP exhibits limited fine-grained semantic understanding. Although several methods attempt to mitigate this issue, they unintentionally disrupt the global alignment, resulting in a persistent trade-off where improving local perception simultaneously degrades global coherence. In this paper, we propose HarmoCLIP, a novel framework designed to harmonize global and region representations within CLIP. We first identify that the absence of direct alignment between local textual and visual semantics is the fundamental cause of the trade-off. To address this, HarmoCLIP introduces an explicit fine-grained semantic supervision term that directly aligns textual segments with their corresponding visual regions, effectively bridging the image region space and the textual space. To further strengthen the representation capability at the local level, our method introduces a novel Region-Language Alignment supervision strategy that promotes fine-grained semantic learning without compromising global semantic consistency. Extensive experiments demonstrate that HarmoCLIP achieves state-of-the-art (improvement up to 69.78%) performance on the global task of retrieval and yields a substantial 3.2% improvement in Top-1 accuracy on the region task of bounding-box classification, consistently outperforming prior approaches while providing a balanced, efficient, and plug-and-play solution to the global-local trade-off in CLIP. Code is available at https://github.com/Erosist/HarmoCLIP.

Dayou Huang, Feng Xue, Xurui Li, Yu Zhou

TL;DR: AnoRefiner提出了一种基于异常感知的分组细化方法，显著提升了零样本工业异常检测（ZSAD）的像素级结果，无需依赖合成异常数据。

Details

Motivation: 现有ZSAD方法生成粗糙的异常图，而后续细化方法因合成异常与真实异常的差距导致效果不佳。异常评分图提供了互补的空间线索，但此前未被充分利用。

Result: 在MVTec AD和VisA数据集上，AnoRefiner将ZSAD模型的像素级AP指标提升了最高5.2%。

Insight: 异常评分图的空间信息可与ZSAD图像特征互补，为细化提供了新思路；分组渐进训练策略能有效适应实际工业场景的大规模生产需求。

Abstract: Zero-shot industrial anomaly detection (ZSAD) methods typically yield coarse anomaly maps as vision transformers (ViTs) extract patch-level features only. To solve this, recent solutions attempt to predict finer anomalies using features from ZSAD, but they still struggle to recover fine-grained anomalies without missed detections, mainly due to the gap between randomly synthesized training anomalies and real ones. We observe that anomaly score maps exactly provide complementary spatial cues that are largely absent from ZSAD’s image features, a fact overlooked before. Inspired by this, we propose an anomaly-aware refiner (AnoRefiner) that can be plugged into most ZSAD models and improve patch-level anomaly maps to the pixel level. First, we design an anomaly refinement decoder (ARD) that progressively enhances image features using anomaly score maps, reducing the reliance on synthetic anomaly data. Second, motivated by the mass production paradigm, we propose a progressive group-wise test-time training (PGT) strategy that trains ARD in each product group for the refinement process in the next group, while staying compatible with any ZSAD method. Experiments on the MVTec AD and VisA datasets show that AnoRefiner boosts various ZSAD models by up to a 5.2% gain in pixel-AP metrics, which can also be directly observed in many visualizations. The code will be available at https://github.com/HUST-SLOW/AnoRefiner.

[78] GazeTrack: High-Precision Eye Tracking Based on Regularization and Spatial Computing cs.CV | cs.AI | cs.HC | cs.LGPDF

Xiaoyin Yang

TL;DR: GazeTrack提出了一种高精度的眼动追踪方法，通过新颖的形状误差正则化和坐标变换技术，结合高质量数据集，显著提升了眼动追踪的准确性。

Details

Motivation: 当前的眼动追踪技术在空间计算应用中精度不足，无法满足需求，因此需要开发更高精度的解决方案。

Result: 提出的视线向量生成模型在降低计算复杂度的同时，减少了视线角度误差。

Insight: 高质量的数据集和创新的正则化方法对提升眼动追踪精度至关重要，特别是在复杂场景中。

Abstract: Eye tracking has become increasingly important in virtual and augmented reality applications; however, the current gaze accuracy falls short of meeting the requirements for spatial computing. We designed a gaze collection framework and utilized high-precision equipment to gather the first precise benchmark dataset, GazeTrack, encompassing diverse ethnicities, ages, and visual acuity conditions for pupil localization and gaze tracking. We propose a novel shape error regularization method to constrain pupil ellipse fitting and train on open-source datasets, enhancing semantic segmentation and pupil position prediction accuracy. Additionally, we invent a novel coordinate transformation method similar to paper unfolding to accurately predict gaze vectors on the GazeTrack dataset. Finally, we built a gaze vector generation model that achieves reduced gaze angle error with lower computational complexity compared to other methods.

[79] Stable-Drift: A Patient-Aware Latent Drift Replay Method for Stabilizing Representations in Continual Learning cs.CV | cs.LGPDF

Paraskevi-Antonia Theofilou, Anuhya Thota, Stefanos Kollias, Mamatha Thota

TL;DR: 本文提出了一种名为Stable-Drift的方法，通过量化潜在漂移（latent drift）来识别和重放具有高表示不稳定性的样本，从而有效缓解持续学习中的灾难性遗忘问题，特别适用于医学影像领域。

Details

Motivation: 在医学影像中，模型需要持续适应不同医院的新数据而不破坏已有的诊断知识，但传统方法容易因灾难性遗忘导致性能骤降。

Result: 在跨医院COVID-19 CT分类任务中，相比朴素微调和随机重放，该方法显著减少了灾难性遗忘，提升了模型的稳定性。

Insight: 潜在漂移是一个实用且可解释的重放信号，适合医学场景的持续学习，强调了患者级别聚合的重要性。

Abstract: When deep learning models are sequentially trained on new data, they tend to abruptly lose performance on previously learned tasks, a critical failure known as catastrophic forgetting. This challenge severely limits the deployment of AI in medical imaging, where models must continually adapt to data from new hospitals without compromising established diagnostic knowledge. To address this, we introduce a latent drift-guided replay method that identifies and replays samples with high representational instability. Specifically, our method quantifies this instability via latent drift, the change in a sample internal feature representation after naive domain adaptation. To ensure diversity and clinical relevance, we aggregate drift at the patient level, our memory buffer stores the per patient slices exhibiting the greatest multi-layer representation shift. Evaluated on a cross-hospital COVID-19 CT classification task using state-of-the-art CNN and Vision Transformer backbones, our method substantially reduces forgetting compared to naive fine-tuning and random replay. This work highlights latent drift as a practical and interpretable replay signal for advancing robust continual learning in real world medical settings.

[80] REASONEDIT: Towards Reasoning-Enhanced Image Editing Models cs.CVPDF

Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing

TL;DR: 论文提出了一个增强推理能力的图像编辑框架ReasonEdit，通过思考（thinking）和反思（reflection）机制，显著提升了编辑模型的性能。

Details

Motivation: 当前图像编辑模型依赖冻结的多模态大语言模型（MLLM）编码器，限制了其推理能力。本文旨在探索利用MLLM的推理能力进一步提升编辑效果。

Result: 在ImgEdit、GEdit和Kris数据集上分别实现了4.3%、4.7%和8.2%的性能提升，超越了此前开源方法。

Insight: MLLM的推理能力对复杂图像编辑任务至关重要，动态迭代的编辑流程能显著减少错误并提升结果质量。

Abstract: Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).

[81] GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes cs.CVPDF

Di Wang, Shunyu Liu, Wentao Jiang, Fengxiang Wang, Yi Liu

TL;DR: GeoZero 是一个无需预定义思维链监督的多模态大型语言模型框架，通过自监督和监督优化实现地理空间场景的推理能力提升。

Details

Motivation: 现有的地理空间场景理解方法依赖人工标注的思维链数据，成本高且可能引入人为偏见。GeoZero 旨在通过无监督和强化学习方法解决这些问题。

Result: 在多个遥感视觉语言基准测试中，GeoZero 超越了现有最先进方法，展现了通用的推理能力。

Insight: 通过结合自监督和监督优化，可以提升模型的推理多样性并减少人工标注依赖。

Abstract: Multimodal large language models (MLLMs) have undergone rapid development in advancing geospatial scene understanding. Recent studies have sought to enhance the reasoning capabilities of remote sensing MLLMs, typically through cold-start training with elaborately curated chain-of-thought (CoT) data. However, this approach not only incurs substantial annotation costs but also introduces human biases that may limit the diversity of model reasoning. To address these challenges, we propose GeoZero, a framework that enables MLLMs to perform geospatial reasoning without any predefined CoT supervision. Specifically, we construct two datasets, GeoZero-Instruct and GeoZero-Hard. GeoZero-Instruct allows the model to acquire preliminary geospatial knowledge through supervised fine-tuning, while GeoZero-Hard stimulates deep reasoning during the subsequent reinforcement learning stage. Furthermore, we introduce Answer-Anchored Group Relative Policy Optimization (A$^2$GRPO), where the reasoning process is regularized by the model’s own answers, encouraging diverse yet accurate thinking. Extensive experiments on multiple remote sensing vision-language benchmarks demonstrate that GeoZero not only surpasses existing state-of-the-art methods but also fosters universal emergent reasoning capabilities across diverse geospatial tasks. Code,data,and models will be publicly available at https://github.com/MiliLab/GeoZero.

[82] Architecture Decoupling Is Not All You Need For Unified Multimodal Model cs.CVPDF

Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu

TL;DR: 论文探讨了多模态统一模型中任务冲突问题，指出过度解耦会丧失交错生成能力，并提出注意力交互对齐（AIA）损失来学习任务特定的多模态交互模式。

Details

Motivation: 多模态统一模型在图像生成和理解任务中存在目标冲突，现有方法通过模型解耦缓解冲突，但会丧失交错生成能力。本文旨在不依赖解耦的情况下解决任务冲突。

Result: AIA不仅优化了跨模态注意力模式，还提升了生成和理解任务的性能。

Insight: 任务特定的多模态交互模式是缓解冲突的关键，无需依赖模型解耦。AIA损失的通用性表明其对多模态统一模型的潜力。

Abstract: Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.

Silin Cheng, Kai Han

TL;DR: 论文提出了一种新型的变分多模态提示学习框架（VaMP），通过学习后验分布生成实例特定的提示，实现了多模态表示学习中样本特定的不确定性感知提示调优。

Details

Motivation: 现有方法在多模态提示学习中通常依赖固定的共享提示和确定性参数，限制了模型捕捉实例级变化或跨任务和领域的不确定性。

Result: 在少样本学习和领域泛化基准测试中，VaMP达到了最先进的性能。

Insight: 建模不确定性和任务结构在多模态提示学习中具有显著优势。

Abstract: Vision-language models (VLMs), such as CLIP, have shown strong generalization under zero-shot settings, yet adapting them to downstream tasks with limited supervision remains a significant challenge. Existing multi-modal prompt learning methods typically rely on fixed, shared prompts and deterministic parameters, which limits their ability to capture instance-level variation or model uncertainty across diverse tasks and domains. To tackle this issue, we propose a novel Variational Multi-Modal Prompt Learning (VaMP) framework that enables sample-specific, uncertainty-aware prompt tuning in multi-modal representation learning. VaMP generates instance-conditioned prompts by sampling from a learned posterior distribution, allowing the model to personalize its behavior based on input content. To further enhance the integration of local and global semantics, we introduce a class-aware prior derived from the instance representation and class prototype. Building upon these, we formulate prompt tuning as variational inference over latent prompt representations and train the entire framework end-to-end through reparameterized sampling. Experiments on few-shot and domain generalization benchmarks show that VaMP achieves state-of-the-art performance, highlighting the benefits of modeling both uncertainty and task structure in our method. Project page: https://visual-ai.github.io/vamp

[84] A deep learning perspective on Rubens’ attribution cs.CVPDF

A. Afifi, A. Kalimullin, S. Korchagin, I. Kudryashov

TL;DR: 本文利用深度学习技术对鲁本斯及其工作室的画作进行认证和归属分析，通过卷积神经网络识别微观风格特征，展示了计算分析对传统艺术史学研究的补充作用。

Details

Motivation: 传统艺术品的认证和归属分析依赖专家经验，而深度学习方法可以提供更客观的计算辅助。本文旨在探索深度学习在鲁本斯复杂画作案例中的应用。

Result: 模型在分类任务中表现优异，验证了深度学习方法在画作认证中的有效性，并为艺术史学提供了新的分析视角。

Insight: 深度学习可以辅助传统艺术史学，尤其是在复杂案例中识别微观风格特征，从而揭示作者身份和工作室合作模式。

Abstract: This study explores the use of deep learning for the authentication and attribution of paintings, focusing on the complex case of Peter Paul Rubens and his workshop. A convolutional neural network was trained on a curated dataset of verified and comparative artworks to identify micro-level stylistic features characteristic of the master s hand. The model achieved high classification accuracy and demonstrated the potential of computational analysis to complement traditional art historical expertise, offering new insights into authorship and workshop collaboration.

[85] Emergent Extreme-View Geometry in 3D Foundation Models cs.CVPDF

Yiwen Zhang, Joseph Tung, Ruojin Cai, David Fouhey, Hadar Averbuch-Elor

TL;DR: 该论文研究了3D基础模型（3DFMs）在极端视角下的几何理解能力，并提出了一种轻量级对齐方案以优化其性能，同时贡献了一个新的基准数据集MegaUnScene。

Details

Motivation: 尽管3DFMs在3D视觉中表现出色，但其在极端非重叠视角下的推理能力尚未被充分研究。论文旨在探索模型在此类条件下的表现并提出改进方法。

Result: 实验表明，轻量级对齐方案在不影响单图像深度或点云质量的情况下，显著改进了极端视角下的相对位姿估计性能。

Insight: 3DFMs在未经专门训练的情况下，展现出了对极端视角几何的理解能力，这种能力可以通过轻量级优化进一步强化。

Abstract: 3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with dedicated test splits for both relative pose estimation and dense 3D reconstruction. All code and data will be released.

[86] Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation cs.CVPDF

Shubhankar Borse, Phuc Pham, Farzad Farhadzadeh, Seokeon Choi, Phong Ha Nguyen

TL;DR: Ar2Can提出了一个两阶段框架，通过分离空间规划和身份渲染来解决多人生成问题，显著提升了生成图像的计数准确性和身份一致性。

Details

Motivation: 现有的文本到图像生成模型在多人生成时常常出现重复人脸、身份混淆或计数错误的问题，Ar2Can旨在解决这些挑战。

Result: 在MultiHuman-Testbench上，Ar2Can显著提升了计数准确性和身份一致性，同时保持了高感知质量。

Insight: 该方法证明合成数据可以有效支持多人生成任务，而无须依赖真实多人生成图像。

Abstract: Despite recent advances in text-to-image generation, existing models consistently fail to produce reliable multi-human scenes, often duplicating faces, merging identities, or miscounting individuals. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect module predicts structured layouts, specifying where each person should appear. The Artist module then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with ArcFace identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model and optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images.

[87] Splat-SAP: Feed-Forward Gaussian Splatting for Human-Centered Scene with Scale-Aware Point Map Reconstruction cs.CVPDF

Boyao Zhou, Shunyuan Zheng, Zhanfeng Liao, Zihan Ma, Hanzhang Tu

TL;DR: Splat-SAP提出了一种前馈方法，用于稀疏双目相机输入下的人为中心场景的新视角渲染。通过两阶段学习策略（点地图重建与高斯样条渲染），无需密集输入或多视角重建先验。

Details

Motivation: 传统高斯样条渲染需要密集输入和每场景优化，而现有前馈方法依赖多视角重建先验。Splat-SAP旨在解决稀疏输入下的高效渲染问题。

Result: 在多人场景数据上验证了点地图重建的稳定性与自由视角渲染的视觉质量提升。

Insight: 稀疏输入下独立视角建模的点地图鲁棒性强，两阶段策略分离了几何重建与渲染，提升了灵活性。

Abstract: We present Splat-SAP, a feed-forward approach to render novel views of human-centered scenes from binocular cameras with large sparsity. Gaussian Splatting has shown its promising potential in rendering tasks, but it typically necessitates per-scene optimization with dense input views. Although some recent approaches achieve feed-forward Gaussian Splatting rendering through geometry priors obtained by multi-view stereo, such approaches still require largely overlapped input views to establish the geometry prior. To bridge this gap, we leverage pixel-wise point map reconstruction to represent geometry which is robust to large sparsity for its independent view modeling. In general, we propose a two-stage learning strategy. In stage 1, we transform the point map into real space via an iterative affinity learning process, which facilitates camera control in the following. In stage 2, we project point maps of two input views onto the target view plane and refine such geometry via stereo matching. Furthermore, we anchor Gaussian primitives on this refined plane in order to render high-quality images. As a metric representation, the scale-aware point map in stage 1 is trained in a self-supervised manner without 3D supervision and stage 2 is supervised with photo-metric loss. We collect multi-view human-centered data and demonstrate that our method improves both the stability of point map reconstruction and the visual quality of free-viewpoint rendering.

[88] ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering cs.CV | cs.AI | cs.CL | cs.MMPDF

Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni

TL;DR: ReAG是一种新的基于推理的知识增强多模态检索生成方法，通过结合粗细粒度检索和批评模型过滤无关段落，显著提升了知识型视觉问答（KB-VQA）的性能。

Details

Motivation: 现有MLLMs在处理领域特定或知识密集型查询时表现不佳，KB-VQA试图通过检索外部文档来补充答案生成，但当前方法存在检索精度低、噪声多和推理能力有限的问题。

Result: 在Encyclopedic-VQA和InfoSeek数据集上，ReAG显著优于现有方法，提高了答案准确性并提供了基于检索证据的可解释推理。

Insight: 结合检索与推理的多阶段训练策略是提升KB-VQA性能的有效路径，批评模型和强化学习的引入有助于提升检索内容的利用效率。

Abstract: Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.

[89] All Centers Are at most a Few Tokens Apart: Knowledge Distillation with Domain Invariant Prompt Tuning cs.CV | cs.AIPDF

Amir Mohammad Ezzati, Alireza Malekhosseini, Armin Khosravi, Mohammad Hossein Rohban

TL;DR: 这篇文章提出了一种名为DIPT的新方法，通过域不变提示调优来改进知识蒸馏，以解决计算病理学中领域泛化的挑战。

Details

Motivation: 计算病理学中的领域偏移（如染色协议和扫描设备的差异）导致了模型泛化能力的不足。现有的预定义提示方法对提示变化敏感，且缺乏语义描述符，因此需要数据驱动的方法来学习域不变提示。

Result: 该方法在组织病理学数据集上的领域泛化任务中，显著提升了平均F1分数，超越了现有的知识蒸馏方法。

Insight: 通过数据驱动的方式学习域不变提示，可以有效缓解领域偏移问题，并为异构数据源的临床部署提供鲁棒性。

Abstract: Domain generalization is critical in computational pathology (CPath) due to inherent domain shifts caused by variations in staining protocols, scanner devices, and imaging settings across clinical centers. Vision-language models (VLMs), such as PLIP-a pathology-tuned CLIP-trained on image-text pairs across diverse domains, serve as strong knowledge distillation sources. However, their zero-shot performance with predefined prompts remains limited due to sensitivity to prompt variations. Moreover, unlike natural images, histopathology centers lack semantic descriptors (e.g., ‘sketch’), making it difficult to define domain-specific prompts for clinical centers. This requires a data-driven approach for learning domain-specific and ultimately class-generic continuous prompts. We propose Domain Invariant Prompt Tuning (DIPT) for knowledge distillation process, a novel step that learns multiple input tokens for each domain. These tokens are trained separately for each domain and are averaged across domains, leading to domain-invariant prompts. Our student model then distills knowledge from PLIP’s text encoder by leveraging the prompts learned by DIPT. This leads to alignment of visual features with domain-invariant embeddings, enhancing generalization by training on multiple domains. Our method adds a significant improvement in average F1-score to existing state-of-the-art (SOTA) knowledge distillation approaches in domain generalization with histopathology datasets. This work helps the way of deploying robust CPath models in real-world clinical problems with heterogeneous data sources.

[90] Alzheimer’s Disease Prediction Using EffNetViTLoRA and BiLSTM with Multimodal Longitudinal MRI Data cs.CVPDF

Mahdieh Behjat Khatooni, Mohsen Soryani

TL;DR: 提出了一种结合EffNetViTLoRA和BiLSTM的多模态深度学习模型，用于利用纵向MRI数据预测阿尔茨海默病（AD），准确率达95.05%。

Details

Motivation: 阿尔茨海默病（AD）是不可逆的神经退行性疾病，早期预测至关重要。轻度认知障碍（MCI）是AD的过渡阶段，但预测MCI进展至AD仍具挑战性。研究旨在通过多模态数据和深度学习方法提升预测性能。

Result: 模型在区分稳定MCI（sMCI）和进展性MCI（pMCI）时，取得了95.05%的平均准确率，超越了现有方法。

Insight: 1. 结合局部与全局特征的混合架构（CNN+ViT）更有效；2. 时序建模（BiLSTM）在多模态数据中至关重要；3. 该框架可扩展至其他神经退行性疾病的早期预测。

Abstract: Alzheimer’s disease (AD) is a prevalent neurodegenerative disorder that progressively impairs memory, decision-making, and overall cognitive function. As AD is irreversible, early prediction is critical for timely intervention and management. Mild Cognitive Impairment (MCI), a transitional stage between cognitively normal (CN) aging and AD, plays a significant role in early AD diagnosis. However, predicting MCI progression remains a significant challenge, as not all individuals with MCI convert to AD. MCI subjects are categorized into stable MCI (sMCI) and progressive MCI (pMCI) based on conversion status. In this study, we propose a generalized, end-to-end deep learning model for AD prediction using MCI cases from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Our hybrid architecture integrates Convolutional Neural Networks and Vision Transformers to capture both local spatial features and global contextual dependencies from Magnetic Resonance Imaging (MRI) scans. To incorporate temporal progression, we further employ Bidirectional Long Short-Term Memory (BiLSTM) networks to process features extracted from four consecutive MRI timepoints along with some other non-image biomarkers, predicting each subject’s cognitive status at month 48. Our multimodal model achieved an average progression prediction accuracy of 95.05% between sMCI and pMCI, outperforming existing studies in AD prediction. This work demonstrates state-of-the-art performance in longitudinal AD prediction and highlights the effectiveness of combining spatial and temporal modeling for the early detection of Alzheimer’s disease.

[91] World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models cs.CVPDF

Eunsu Kim, Junyeong Park, Na Min An, Junseong Kim, Hitesh Laxmichand Patel

TL;DR: 该论文探讨了在多文化混合的视觉场景中，大型视觉-语言模型（LVLMs）的表现，并构建了CultureMix基准测试系统评估其性能。研究发现模型在多文化混合场景中存在文化身份识别不一致和背景依赖性强的问题。通过监督微调，模型的性能得到显著提升。

Details

Motivation: 全球化背景下，文化元素常混合出现在同一视觉场景中，而LVLMs在此类场景中的表现尚未得到充分研究。论文旨在填补这一空白，系统评估LVLMs在多文化混合场景中的能力。

Result: 研究发现，模型在多文化混合场景中表现不佳，尤其是背景依赖性较强。监督微调显著提升了模型的性能。

Insight: 文化混合场景是LVLMs面临的现实挑战，需要在数据集和方法上进一步优化以提高模型的可靠性。

Abstract: In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.

[92] Artwork Interpretation with Vision Language Models: A Case Study on Emotions and Emotion Symbols cs.CV | cs.CLPDF

Sebastian Padó, Kerstin Thomas

TL;DR: 论文研究了当前（2025年）视觉语言模型（VLMs）在艺术品情感表达识别方面的能力，通过四个复杂性递增的问题集（一般内容、情感内容、情感表达、情感符号）对三种VLMs进行了定性专家评估。

Details

Motivation: 情感是艺术表达的基本方面，但其抽象性和历史变化使得分析需要艺术史专业知识。研究目标是探索VLMs是否能识别艺术品中的情感表达及其复杂性。

Result: VLMs在识别图像内容及情感方面表现良好，尤其对具体图像；但对抽象或符号化图像效果不佳，且存在答案不一致的问题。

Insight: 尽管VLMs在情感分析上有潜力，但对符号识别仍不可靠，且需解决LLM固有的答案不一致性问题。

Abstract: Emotions are a fundamental aspect of artistic expression. Due to their abstract nature, there is a broad spectrum of emotion realization in artworks. These are subject to historical change and their analysis requires expertise in art history. In this article, we investigate which aspects of emotional expression can be detected by current (2025) vision language models (VLMs). We present a case study of three VLMs (Llava-Llama and two Qwen models) in which we ask these models four sets of questions of increasing complexity about artworks (general content, emotional content, expression of emotions, and emotion symbols) and carry out a qualitative expert evaluation. We find that the VLMs recognize the content of the images surprisingly well and often also which emotions they depict and how they are expressed. The models perform best for concrete images but fail for highly abstract or highly symbolic images. Reliable recognition of symbols remains fundamentally difficult. Furthermore, the models continue to exhibit the well-known LLM weakness of providing inconsistent answers to related questions.

[93] From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images cs.CV | cs.LG | cs.MMPDF

Yiming Chen, Junlin Han, Tianyi Bai, Shengbang Tong, Filippos Kokkinos

TL;DR: 论文提出CogIP-Bench基准，用于评估多模态大语言模型（MLLMs）对人类主观认知属性的理解能力，并展示了后训练阶段如何显著提升模型与人类感知的对齐效果及其在下游任务中的可迁移性。

Details

Motivation: 现有MLLMs擅长识别图像内容，但对人类主观感受（如图像的记忆性、幽默感、美感或情感共鸣）理解不足，亟需系统化解决方案。

Result: 实验表明，后训练显著提升模型与人类认知的对齐效果，并能生成更符合作者期望的图像。

Insight: 对齐人类主观认知不仅能提升模型性能，还能赋能更贴近人类需求的AI应用，如创意图像生成。

Abstract: While Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects and describing scenes-they often lack the ability to understand how an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties. Our evaluation reveals a significant gap: current models are poorly aligned with human perception of these nuanced properties. We then demonstrate that a post-training phase can effectively bridge this gap, significantly enhancing the model’s alignment with human judgments. Furthermore, we show that this learned cognitive alignment is not merely predictive but also transferable to downstream creative tasks. By integrating our cognitively-aligned MLLM into an image generation pipeline, we can guide the synthesis process to produce images that better embody desired traits, such as being more memorable or visually appealing. Our work provides a benchmark to measure this human-like perception, a post-training pipeline to enhance it, and a demonstration that this alignment unlocks more human-centric AI.

[94] LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer cs.CVPDF

Kai Wang, Siyi Chen, Weicong Pang, Chenchen Zhang, Renjun Gao

TL;DR: 这篇论文提出了LC4-DViT框架，结合生成式数据创建和可变形Vision Transformer，用于高分辨率土地覆盖分类，显著提升了分类精度和泛化能力。

Details

Motivation: 土地覆盖分类对生态系统服务和灾害风险管理至关重要，但面临标注稀缺、不平衡和几何变形等挑战。

Result: 在AID数据集上达到95.72%准确率，显著优于ViT和其他基线；跨数据集实验验证了良好的泛化性。

Insight: 结合生成式增强和可变形Transformer是解决高分辨率土地覆盖分类问题的有效方法。

Abstract: Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen’ s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT’ s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.

[95] Captain Safari: A World Engine cs.CVPDF

Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu

TL;DR: Captain Safari是一种基于姿态条件的世界引擎，通过从持久世界记忆中检索生成视频，解决了现有系统在6-DoF轨迹和复杂户外布局下的一致性和路径跟踪问题。

Details

Motivation: 现有世界引擎在激进6-DoF轨迹和复杂户外布局下表现不佳，导致长距离几何一致性丢失或路径偏离。Captain Safari通过引入姿态条件的世界记忆来解决这些问题。

Result: Captain Safari在视频质量、3D一致性和路径跟踪上显著优于现有方法，MEt3R从0.3703降至0.3690，AUC@30从0.181提升至0.200，并大幅降低FVD。人类研究中67.6%的偏好选择该方法。

Insight: 姿态条件的世界记忆是实现长时程可控视频生成的有效机制，OpenSafari数据集的发布推动了世界引擎研究的进一步发展。

Abstract: World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers. To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. More importantly, in a 50-participant, 5-way human study where annotators select the best result among five anonymized models, 67.6% of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide OpenSafari as a challenging new benchmark for future world-engine research.

[96] Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs cs.CVPDF

Tianle Chen, Chaitanya Chakka, Arjun Reddy Akula, Xavier Thomas, Deepti Ghadiyaram

TL;DR: 本文通过MMA-Bench评测多模态大语言模型（MLLMs）在矛盾模态下的鲁棒性，揭示了当前模型在多模态推理上的脆弱性，并提出模态对齐调优策略以提高模型可靠性。

Details

Motivation: 多模态大语言模型（MLLMs）虽然在多模态任务上进展显著，但其在面对矛盾模态时的鲁棒性尚未被深入研究。

Result: 实验表明，当前MLLMs在矛盾模态下表现脆弱，而提出的对齐策略显著提升了模型的鲁棒性。

Insight: 多模态集成需要动态调整模态优先级，而非简单融合，这是提升MLLMs可靠性的关键路径。

Abstract: Despite remarkable advancements in Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities? To rigorously study this, we introduce MMA-Bench comprising videos and tasks that probe a model’s reliance on specific modalities. Using black-box and white-box interpretability techniques, we provide a critical analysis of the brittleness of both open- and closed-sourced MLLMs. We show that current MLLMs struggle under misaligned audio-visual pairs and simple misleading text, thereby lacking robust multi-modal reasoning. Building on these findings, we propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues. Through extensive experiments and analysis, we show that our alignment tuning yields demonstrably stronger multimodal grounding. This work provides both interpretability tools and a clear path toward developing MLLMs with intrinsically reliable cross-modal reasoning. Code and dataset will be publicly available.

[97] Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach cs.CV | cs.AI | cs.CLPDF

Haruki Sakajo, Hiroshi Takato, Hiroshi Tsutsui, Komei Soda, Hidetaka Kamigaito

TL;DR: 论文研究大视觉语言模型（LVLM）在自动驾驶中的安全驾驶指令生成，通过多视角视频数据（包括驾驶员和道路视角）的训练和评估，发现微调后的LVLM能生成准确的安全驾驶指令，但仍存在对复杂事件检测的挑战。

Details

Motivation: LVLM在视觉任务中表现优异，但其在多视角视频（如驾驶员和道路视角的同步输入）中的安全指令生成能力尚未充分验证，尤其是在复杂或细微事件的检测上。

Result: 微调后的LVLM可以生成准确的安全驾驶指令，但对复杂或细微事件的检测仍有不足。

Insight: 多视角数据的同步处理是关键，未来研究需进一步提升LVLM对复杂事件的鲁棒性。

Abstract: Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous driving. For example, LVLMs can generate safety-oriented descriptions of videos captured by road-facing cameras. However, ensuring comprehensive safety requires monitoring driver-facing views as well to detect risky events, such as the use of mobiles while driving. Thus, the ability to process synchronized inputs is necessary from both driver-facing and road-facing cameras. In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset. Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions. Nonetheless, several challenges remain, particularly in detecting subtle or complex events in the video. Our findings and error analysis provide valuable insights that can contribute to the improvement of LVLM-based systems in this domain.

[98] Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering cs.CVPDF

Dosung Lee, Sangwon Jung, Boyoung Kim, Minyoung Kim, Sungyeon Kim

TL;DR: 该论文指出了多模态知识库视觉问答（MKB-VQA）中存在的‘视觉捷径’问题，并通过新基准RETINA和多模态检索器MIMIR解决了这一问题。

Details

Motivation: 现有的MKB-VQA基准存在‘视觉捷径’问题，即查询图像通常与目标文档的主要实体匹配，导致模型可以仅依赖视觉线索取得高准确率。

Result: 实验表明，RETINA使现有模型性能显著下降，验证了其对视觉捷径的依赖；MIMIR在RETINA上表现优异。

Insight: 视觉捷径问题在多模态任务中普遍存在，需要设计更复杂的基准和方法来解决。MIMIR通过多模态增强为未来研究提供了新思路。

Abstract: Existing Multimodal Knowledge-Based Visual Question Answering (MKB-VQA) benchmarks suffer from “visual shortcuts”, as the query image typically matches the primary subject entity of the target document. We demonstrate that models can exploit these shortcuts, achieving comparable results using visual cues alone. To address this, we introduce Relational Entity Text-Image kNowledge Augmented (RETINA) benchmark, automatically constructed using an LLM-driven pipeline, consisting of 120k training and 2k human-curated test set. RETINA contains queries referencing secondary subjects (i.e. related entities) and pairs them with images of these related entities, removing the visual shortcut. When evaluated on RETINA existing models show significantly degraded performance, confirming their reliance on the shortcut. Furthermore, we propose Multi-Image MultImodal Retriever (MIMIR), which enriches document embeddings by augmenting images of multiple related entities, effectively handling RETINA, unlike prior work that uses only a single image per document. Our experiments validate the limitations of existing benchmarks and demonstrate the effectiveness of RETINA and MIMIR. Our project is available at: Project Page.

[99] Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding cs.CVPDF

Keliang Liu, Zizhi Chen, Mingcheng Li, Jingqun Tang, Dingkang Yang

TL;DR: 本文提出了SLEUTH框架，通过多智能体协作解决了长文档理解中的证据稀疏性问题，显著提升了性能。

Details

Motivation: 长文档理解中，线索分散在多页面和多模态中，冗余输入会降低模型性能。现有检索增强方法虽过滤无关内容，但仍存在冗余。本文旨在设计一种更高效的框架。

Result: 实验表明，SLEUTH在多长文档基准测试中均达到SOTA性能，验证了各模块的有效性和分层优化范式的优势。

Insight: 分层协作和多模态证据提取能有效解决长文档中的稀疏性和冗余问题，框架的设计具有模型无关性和可扩展性。

Abstract: Document understanding is a long standing practical task. Vision Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the models judgment. While retrieval augmented generation mitigates this issue by filtering for question relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse to fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence dense multimodal context to generate the final prediction. SLEUTH is model agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long document benchmarks, achieving state of the art results. Ablation studies verify each modules effectiveness and confirm the benefits of our hierarchical refinement paradigm.

[100] CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation cs.CVPDF

Fengyi Fang, Sicheng Yang, Wenming Yang

TL;DR: CoordSpeaker 通过结合手势描述生成和协调的多模态控制，解决了手势生成中的语义先验差和协调控制难题，提出了一个综合框架，实现了高质量、语义一致的手势生成。

Details

Motivation: 现有语音驱动手势生成方法因缺乏文本注释和难以协调控制而受限。CoordSpeaker旨在通过手势描述生成和协调控制填补这些空白。

Result: 实验表明，CoordSpeaker生成的手势在节奏和语义上均优于现有方法，且效率更高。

Insight: 手势理解与生成的结合为多模态控制开辟了新方向，双向映射为进一步研究提供了基础。

Abstract: Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches.

[101] CNN-Based Framework for Pedestrian Age and Gender Classification Using Far-View Surveillance in Mixed-Traffic Intersections cs.CV | cs.IRPDF

Shisir Shahriar Arif, Md. Muhtashim Shahrier, Nazmul Haque, Md Asif Raihan, Md. Hadiuzzaman

TL;DR: 论文提出了一种基于CNN的深度学习框架，用于从远视角监控视频中实时分类行人的年龄和性别，无需依赖高分辨率图像或面部识别。

Details

Motivation: 在混合交通环境中，行人安全是一个重要问题，尤其是低收入和中等收入国家。年龄和性别等人口统计因素显著影响行人脆弱性，但目前实时监控系统很少捕获这些信息。

Result: ResNet50结合Max Pooling和SGD优化器达到最高准确率86.19%，轻量级模型则以84.15%的准确率和更少的参数表现接近。

Insight: 1. 远视角监控视频中的人体姿态线索可用于年龄和性别分类；2. 轻量级设计可在低资源环境中部署；3. 人口统计数据可为交通规划和干预措施提供支持。

Abstract: Pedestrian safety remains a pressing concern in congested urban intersections, particularly in low- and middle-income countries where traffic is multimodal, and infrastructure often lacks formal control. Demographic factors like age and gender significantly influence pedestrian vulnerability, yet real-time monitoring systems rarely capture this information. To address this gap, this study proposes a deep learning framework that classifies pedestrian age group and gender from far-view intersection footage using convolutional neural networks (CNNs), without relying on facial recognition or high-resolution imagery. The classification is structured as a unified six-class problem, distinguishing adult, teenager, and child pedestrians for both males and females, based on full-body visual cues. Video data was collected from three high-risk intersections in Dhaka, Bangladesh. Two CNN architectures were implemented: ResNet50, a deep convolutional neural network pretrained on ImageNet, and a custom lightweight CNN optimized for computational efficiency. Eight model variants explored combinations of pooling strategies and optimizers. ResNet50 with Max Pooling and SGD achieved the highest accuracy (86.19%), while the custom CNN performed comparably (84.15%) with fewer parameters and faster training. The model’s efficient design enables real-time inference on standard surveillance feeds. For practitioners, this system provides a scalable, cost-effective tool to monitor pedestrian demographics at intersections using existing camera infrastructure. Its outputs can shape intersection design, optimize signal timing, and enable targeted safety interventions for vulnerable groups such as children or the elderly. By offering demographic insights often missing in conventional traffic data, the framework supports more inclusive, data-driven planning in mixed-traffic environments.

[102] DM$^3$T: Harmonizing Modalities via Diffusion for Multi-Object Tracking cs.CVPDF

Weiran Li, Yeqiang Liu, Yijie Wei, Mina Han, Qiannan Guo

TL;DR: 论文提出了一种名为DM$^3$T的新框架，通过扩散模型实现多模态特征对齐，提升多目标跟踪的准确性和鲁棒性。

Details

Motivation: 多目标跟踪（MOT）在多模态（可见光和热红外）融合中存在非线特分布差异问题，传统方法难以有效融合，影响了跟踪性能。

Result: 在VT-MOT基准测试中取得41.7 HOTA，相对现有最优方法提升1.54%。

Insight: 扩散模型的迭代优化特性非常适合多模态特征对齐，能够有效减少模态冲突并提取互补信息。

Abstract: Multi-object tracking (MOT) is a fundamental task in computer vision with critical applications in autonomous driving and robotics. Multimodal MOT that integrates visible light and thermal infrared information is particularly essential for robust autonomous driving systems. However, effectively fusing these heterogeneous modalities is challenging. Simple strategies like concatenation or addition often fail to bridge the significant non-linear distribution gap between their feature representations, which can lead to modality conflicts and degrade tracking accuracy. Drawing inspiration from the connection between multimodal MOT and the iterative refinement in diffusion models, this paper proposes DM$^3$T, a novel framework that reformulates multimodal fusion as an iterative feature alignment process to generate accurate and temporally coherent object trajectories. Our approach performs iterative cross-modal harmonization through a proposed Cross-Modal Diffusion Fusion (C-MDF) module. In this process, features from both modalities provide mutual guidance, iteratively projecting them onto a shared, consistent feature manifold. This enables the learning of complementary information and achieves deeper fusion compared to conventional methods. Additionally, we introduce a plug-and-play Diffusion Refiner (DR) to enhance and refine the unified feature representation. To further improve tracking robustness, we design a Hierarchical Tracker that adaptively handles confidence estimation. DM$^3$T unifies object detection, state estimation, and data association into a comprehensive online tracking framework without complex post-processing. Extensive experiments on the VT-MOT benchmark demonstrate that our method achieves 41.7 HOTA, representing a 1.54% relative improvement over existing state-of-the-art methods. The code and models are available at https://vranlee.github.io/DM-3-T/.

Weiran Li, Yeqiang Liu, Yijie Wei, Mina Han, Xin Liu

TL;DR: P2C提出了一种新框架，将多模态提示学习从静态点表示扩展到语义云的分布表示，通过动态去噪任务提升了模型的鲁棒性和泛化能力。

Details

Motivation: 现有多模态提示学习方法局限于优化静态点表示，容易导致过拟合和泛化能力不足，亟需一种更鲁棒的语义分布学习方法。

Result: 在11个数据集上的实验表明，P2C显著优于基线方法，在基准到新类别的泛化任务中取得了79.7%的调和平均，相对提升1.4%。

Insight: 学习语义云的分布表示优于静态点表示，动态去噪任务可以有效提升多模态提示学习的鲁棒性和跨模态对齐能力。

Abstract: Multimodal Prompt Learning (MPL) has emerged as a pivotal technique for adapting large-scale Visual Language Models (VLMs). However, current MPL methods are fundamentally limited by their optimization of a single, static point representation. This paradigm is inherently brittle, leads to overfitting on base classes, and generalizes poorly to novel or ambiguous categories. We challenge this point paradigm, proposing that robust generalization requires learning a semantic cloud (i.e., a distribution over the embedding space). To achieve this, we introduce Points-to-Clouds (P2C), a novel framework inspired by diffusion models that reframes prompt learning as a dynamic denoising task. At the core of P2C is a dual denoising mechanism: a Dynamic Prompt Denoising (DPD) mechanism perturbs text prompts with sophisticated, annealed noise to learn a smoother semantic landscape, while an auxiliary V-L Mapper denoising loss re-tasks the mapper as a denoising autoencoder. This forces the mapper to reconstruct clean visual prompts from noisy text inputs, ensuring robust cross-modal alignment. Extensive experiments across 11 datasets demonstrate that P2C consistently outperforms strong baselines. On the base-to-novel generalization benchmark, our method achieves a Harmonic Mean of 79.7%, representing a relative improvement of 1.4% over the baseline. The code and models are available at https://vranlee.github.io/P2C/.

[104] Leveraging Textual Compositional Reasoning for Robust Change Captioning cs.CV | cs.AIPDF

Kyu Ri Park, Jiyoung Park, Seong Tae Kim, Hong Joo Lee, Jung Uk Kim

TL;DR: 论文提出CORTEX框架，通过结合视觉与文本特征增强变化字幕生成的鲁棒性，弥补了现有方法仅依赖视觉特征的不足。

Details

Motivation: 现有的变化字幕生成方法仅依赖视觉特征，难以捕捉细微但有意义的变化，因为它们缺乏显式结构化信息（如对象关系和组合语义）的表征能力。

Result: CORTEX能够捕捉仅依赖视觉特征时难以识别的变化，生成更准确的描述。

Insight: 通过融合视觉与文本模态的特征，可以更全面地捕捉图像变化，尤其适合解决细微变化的识别问题。

Abstract: Change captioning aims to describe changes between a pair of images. However, existing works rely on visual features alone, which often fail to capture subtle but meaningful changes because they lack the ability to represent explicitly structured information such as object relationships and compositional semantics. To alleviate this, we present CORTEX (COmpositional Reasoning-aware TEXt-guided), a novel framework that integrates complementary textual cues to enhance change understanding. In addition to capturing cues from pixel-level differences, CORTEX utilizes scene-level textual knowledge provided by Vision Language Models (VLMs) to extract richer image text signals that reveal underlying compositional reasoning. CORTEX consists of three key modules: (i) an Image-level Change Detector that identifies low-level visual differences between paired images, (ii) a Reasoning-aware Text Extraction (RTE) module that use VLMs to generate compositional reasoning descriptions implicit in visual features, and (iii) an Image-Text Dual Alignment (ITDA) module that aligns visual and textual features for fine-grained relational reasoning. This enables CORTEX to reason over visual and textual features and capture changes that are otherwise ambiguous in visual features alone.

[105] See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection cs.CVPDF

YuEun Lee, Jung Uk Kim

TL;DR: 该论文提出了一种基于重要词感知的视频片段过滤方法，通过多模态大语言模型（MLLMs）增强场景理解，显著提升了视频时刻检索（MR）和高光检测（HD）的性能。

Details

Motivation: 现有方法在处理视频与文本查询时，忽视了单个单词的重要性，将其视为黑盒，限制了上下文理解的精细化。本文旨在通过捕捉查询中的关键词，实现更细粒度的视频片段过滤。

Result: 实验表明，该方法在MR和HD任务中均超越了现有最优方法，验证了其有效性。

Insight: 捕捉查询中的重要词是提升视频理解任务的关键，而MLLMs的结合为细粒度场景理解提供了新的可能性。

Abstract: Video moment retrieval (MR) and highlight detection (HD) with natural language queries aim to localize relevant moments and key highlights in a video clips. However, existing methods overlook the importance of individual words, treating the entire text query and video clips as a black-box, which hinders contextual understanding. In this paper, we propose a novel approach that enables fine-grained clip filtering by identifying and prioritizing important words in the query. Our method integrates image-text scene understanding through Multimodal Large Language Models (MLLMs) and enhances the semantic understanding of video clips. We introduce a feature enhancement module (FEM) to capture important words from the query and a ranking-based filtering module (RFM) to iteratively refine video clips based on their relevance to these important words. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods, achieving superior performance in both MR and HD tasks. Our code is available at: https://github.com/VisualAIKHU/SRF.

[106] ViGG: Robust RGB-D Point Cloud Registration using Visual-Geometric Mutual Guidance cs.CVPDF

Congjia Chen, Shen Yan, Yufu Qu

TL;DR: ViGG提出了一种基于视觉-几何互引导的RGB-D点云配准方法，通过结合几何引导和视觉先验，提升了配准的鲁棒性和准确性。

Details

Motivation: 现有RGB-D配准方法主要依赖几何信息或简单的特征融合，未能充分利用图像信息，限制了实际应用效果。

Result: 在3DMatch、ScanNet和KITTI数据集上表现优于现有方法，支持无学习和基于学习的场景。

Insight: 视觉与几何信息的互补性可通过互引导策略有效利用，进一步提升RGB-D配准的鲁棒性和普适性。

Abstract: Point cloud registration is a fundamental task in 3D vision. Most existing methods only use geometric information for registration. Recently proposed RGB-D registration methods primarily focus on feature fusion or improving feature learning, which limits their ability to exploit image information and hinders their practical applicability. In this paper, we propose ViGG, a robust RGB-D registration method using mutual guidance. First, we solve clique alignment in a visual-geometric combination form, employing a geometric guidance design to suppress ambiguous cliques. Second, to mitigate accuracy degradation caused by noise in visual matches, we propose a visual-guided geometric matching method that utilizes visual priors to determine the search space, enabling the extraction of high-quality, noise-insensitive correspondences. This mutual guidance strategy brings our method superior robustness, making it applicable for various RGB-D registration tasks. The experiments on 3DMatch, ScanNet and KITTI datasets show that our method outperforms recent state-of-the-art methods in both learning-free and learning-based settings. Code is available at https://github.com/ccjccjccj/ViGG.

[107] Robust Image Self-Recovery against Tampering using Watermark Generation with Pixel Shuffling cs.CVPDF

Minyoung Kim, Paul Hongsuck Seo

TL;DR: 论文提出了ReImage框架，通过神经水印和像素洗牌技术实现图像自恢复，以应对数字媒体篡改问题。

Details

Motivation: 随着AI生成内容的迅速发展，数字媒体的真实性受到关注。现有方法在恢复篡改区域时效果不佳，未能实现自恢复的核心目标。

Result: ReImage在多种篡改场景下实现最先进性能，生成高质量恢复图像。

Insight: 洗牌水印结合神经水印技术可显著提升图像自恢复的准确性和鲁棒性。

Abstract: The rapid growth of Artificial Intelligence-Generated Content (AIGC) raises concerns about the authenticity of digital media. In this context, image self-recovery, reconstructing original content from its manipulated version, offers a practical solution for understanding the attacker’s intent and restoring trustworthy data. However, existing methods often fail to accurately recover tampered regions, falling short of the primary goal of self-recovery. To address this challenge, we propose ReImage, a neural watermarking-based self-recovery framework that embeds a shuffled version of the target image into itself as a watermark. We design a generator that produces watermarks optimized for neural watermarking and introduce an image enhancement module to refine the recovered image. We further analyze and resolve key limitations of shuffled watermarking, enabling its effective use in self-recovery. We demonstrate that ReImage achieves state-of-the-art performance across diverse tampering scenarios, consistently producing high-quality recovered images. The code and pretrained models will be released upon publication.

[108] Barcode and QR Code Object Detection: An Experimental Study on YOLOv8 Models cs.CVPDF

Kushagra Pandya, Heli Hathi, Het Buch, Ravikumar R N, Shailendrasinh Chauhan

TL;DR: 该论文对YOLOv8算法在条形码和二维码检测中的性能进行了深入研究，通过不同规模模型（Nano、Small、Medium）的实验，展示了模型优化带来的准确率提升。

Details

Motivation: 研究旨在探索YOLOv8在实时检测条形码和二维码中的表现，目标是优化其检测精度和速度。

Result: Nano模型准确率为88.95%，Small模型为97.10%，Medium模型为94.10%，显示模型优化显著提升了检测性能。

Insight: 研究揭示了模型规模扩展对目标检测性能的影响，为深度学习在计算机视觉中的应用提供了新见解。

Abstract: This research work dives into an in-depth evaluation of the YOLOv8 (You Only Look Once) algorithm’s efficiency in object detection, specially focusing on Barcode and QR code recognition. Utilizing the real-time detection abilities of YOLOv8, we performed a study aimed at enhancing its talent in swiftly and correctly figuring out objects. Through large training and high-quality-tuning on Kaggle datasets tailored for Barcode and QR code detection, our goal became to optimize YOLOv8’s overall performance throughout numerous situations and environments. The look encompasses the assessment of YOLOv8 throughout special version iterations: Nano, Small, and Medium, with a meticulous attention on precision, recall, and F1 assessment metrics. The consequences exhibit large improvements in object detection accuracy with every subsequent model refinement. Specifically, we achieved an accuracy of 88.95% for the nano model, 97.10% for the small model, and 94.10% for the medium version, showcasing the incremental improvements finished via model scaling. Our findings highlight the big strides made through YOLOv8 in pushing the limits of computer vision, ensuring its function as a milestone within the subject of object detection. This study sheds light on how model scaling affects object recognition, increasing the concept of deep learning-based computer creative and prescient techniques.

[109] DenoiseGS: Gaussian Reconstruction Model for Burst Denoising cs.CVPDF

Yongsen Cheng, Yuanhao Cai, Yulun Zhang

TL;DR: DenoiseGS利用3D高斯泼溅技术实现高效的爆发去噪，提出高斯自一致性损失和频率加权损失，显著超越现有方法并大幅提升推理速度。

Details

Motivation: 现有的爆发去噪方法在处理大运动或高计算成本时表现不佳，需要一种高效且保持细节的去噪框架。

Result: DenoiseGS在爆发去噪和新视角合成任务中显著优于NeRF方法，推理速度快250倍。

Insight: 结合几何正则化和频域监督能有效提升去噪质量，3D高斯泼溅技术在实时应用中潜力巨大。

Abstract: Burst denoising methods are crucial for enhancing images captured on handheld devices, but they often struggle with large motion or suffer from prohibitive computational costs. In this paper, we propose DenoiseGS, the first framework to leverage the efficiency of 3D Gaussian Splatting for burst denoising. Our approach addresses two key challenges when applying feedforward Gaussian reconsturction model to noisy inputs: the degradation of Gaussian point clouds and the loss of fine details. To this end, we propose a Gaussian self-consistency (GSC) loss, which regularizes the geometry predicted from noisy inputs with high-quality Gaussian point clouds. These point clouds are generated from clean inputs by the same model that we are training, thereby alleviating potential bias or domain gaps. Additionally, we introduce a log-weighted frequency (LWF) loss to strengthen supervision within the spectral domain, effectively preserving fine-grained details. The LWF loss adaptively weights frequency discrepancies in a logarithmic manner, emphasizing challenging high-frequency details. Extensive experiments demonstrate that DenoiseGS significantly exceeds the state-of-the-art NeRF-based methods on both burst denoising and novel view synthesis under noisy conditions, while achieving \textbf{250$\times$} faster inference speed. Code and models are released at https://github.com/yscheng04/DenoiseGS.

[110] One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe cs.CVPDF

Shijun Shi, Jing Xu, Zhihang Li, Chunli Peng, Xiaoda Yang

TL;DR: 该论文提出了一种名为One-to-All Animation的统一框架，用于解决角色动画和姿态转移中参考与目标姿态空间不对齐的问题。通过自监督的外绘任务、身份特征提取器、混合参考融合注意力机制等方法，显著提升了生成质量。

Details

Motivation: 现有基于扩散模型的方法依赖于空间对齐的参考-姿态对，限制了其在布局多样性上的应用。为了解决这一问题，作者提出了一个更通用的框架。

Result: 实验表明，该方法在角色动画和姿态转移任务中优于现有方法。

Insight: 通过解耦外观与骨架结构，可以有效避免姿态过拟合问题；令牌替换策略有助于生成长时间一致性视频。

Abstract: Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model will be available at https://github.com/ssj9596/One-to-All-Animation.

[111] RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video cs.CV | cs.ROPDF

Haiyang Mei, Qiming Huang, Hai Ci, Mike Zheng Shou

TL;DR: RobotSeg提出了一种基于SAM 2的机器人分割基础模型，通过结构增强记忆关联器、机器人提示生成器和高效标签训练策略解决了机器人分割的挑战，并在VRS数据集中表现出色。

Details

Motivation: 机器人分割是机器人感知的基础，但由于机器人多样性、外观模糊性、结构复杂性和快速形状变化，现有分割模型难以应对。

Result: RobotSeg在图像和视频上都达到了最先进的性能。

Insight: 通过结合结构增强和自动化提示生成，可以有效解决机器人分割的特殊挑战，同时高效标签策略降低了数据标注成本。

Abstract: Accurate robot segmentation is a fundamental capability for robotic perception. It enables precise visual servoing for VLA systems, scalable robot-centric data augmentation, accurate real-to-sim transfer, and reliable safety monitoring in dynamic human-robot environments. Despite the strong capabilities of modern segmentation models, surprisingly it remains challenging to segment robots. This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. RobotSeg is built upon the versatile SAM 2 foundation model but addresses its three limitations for robot segmentation, namely the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations, by introducing a structure-enhanced memory associator, a robot prompt generator, and a label-efficient training strategy. These innovations collectively enable a structure-aware, automatic, and label-efficient solution. We further construct the video robot segmentation (VRS) dataset comprising over 2.8k videos (138k frames) with diverse robot embodiments and environments. Extensive experiments demonstrate that RobotSeg achieves state-of-the-art performance on both images and videos, establishing a strong foundation for future advances in robot perception.

[112] Contrastive Heliophysical Image Pretraining for Solar Dynamics Observatory Records cs.CVPDF

Shiyu Shen, Zhe Gao, Taifeng Chai, Yang Huang, Bin Pan

TL;DR: 本文提出了SolarCHIP，一种针对太阳动力学天文台（SDO）多仪器观测数据的对比预训练视觉主干模型。通过多粒度对比目标，解决了太阳成像中的多模态感知、弱类间可分性和强类内变异性问题，并在跨模态转换和耀斑分类任务中取得了最先进的性能。

Details

Motivation: 现有深度学习方法通常从零开始训练或依赖于自然图像预训练，忽略了SDO数据的独特性。因此，需要一种专门针对SDO数据的预训练方法，以提升任务性能和标签效率。

Result: 在跨模态转换和耀斑分类任务中取得最优性能，尤其在低资源场景下表现突出。

Insight: 对比预训练在不同粒度上均能提供判别能力，为太阳成像任务提供了通用的特征提取器。

Abstract: Deep learning has revolutionized solar image analysis, yet most approaches train task-specific encoders from scratch or rely on natural-image pretraining that ignores the unique characteristics of Solar Dynamics Observatory (SDO) data. We introduce SolarCHIP, a family of contrastively pretrained visual backbones tailored to multi-instrument SDO observations. SolarCHIP addresses three key challenges in solar imaging: multimodal sensing across AIA and HMI instruments, weak inter-class separability due to slow temporal evolution, and strong intra-class variability with sparse activity signals. Our pretraining framework employs a multi-granularity contrastive objective that jointly aligns (1) global class tokens across co-temporal AIA-HMI pairs to enhance temporal discrimination, (2) local patch tokens at fixed spatial indices to enforce position-consistent, modality-invariant features, and (3) intra-sample patches across different spatial locations to preserve fine-grained spatial structure. We train both CNN- and Vision Transformer-based autoencoders and demonstrate their effectiveness on two downstream tasks: cross-modal translation between HMI and AIA passbands via ControlNet, and full-disk flare classification. Experimental results show that SolarCHIP achieves state-of-the-art performance across both tasks, with particularly strong gains in low-resource settings where labeled data is limited. Ablation studies confirm that each contrastive component contributes essential discriminative capacity at different granularities. By publicly releasing pretrained weights and training code, we provide the heliophysics community with a practical, plug-and-play feature extractor that reduces computational requirements, improves label efficiency, and establishes a reusable foundation for diverse solar imaging applications.

[113] HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model cs.CVPDF

Chen Li, Eric Peh, Basura Fernando

TL;DR: 论文提出了一种新颖的层次化多模态表示方法HMR3D，通过显式对齐多视角图像和文本描述，改善了基于大型视觉语言模型的3D场景理解的性能。

Details

Motivation: 现有基于VLM的方法通常隐式对齐3D场景特征与VLM的嵌入空间，但由于3D数据稀缺和空间关系的复杂性，性能表现不佳。

Result: 在3D问答基准测试中表现优异。

Insight: 显式对齐和层次化特征表示能够更好地捕捉3D场景的复杂空间关系，弥补数据稀缺的不足。

Abstract: Recent advances in large vision-language models (VLMs) have shown significant promise for 3D scene understanding. Existing VLM-based approaches typically align 3D scene features with the VLM’s embedding space. However, this implicit alignment often yields suboptimal performance due to the scarcity of 3D data and the inherent complexity of spatial relationships in 3D environments. To address these limitations, we propose a novel hierarchical multimodal representation for 3D scene reasoning that explicitly aligns with VLMs at the input space by leveraging both multi-view images and text descriptions. The text descriptions capture spatial relationships by referencing the 3D coordinates of detected objects, while the multi-view images include a top-down perspective and four directional views (forward, left, right, and backward), ensuring comprehensive scene coverage. Additionally, we introduce a hierarchical feature representation that aggregates patch-level image features into view-level and scene-level representations, enabling the model to reason over both local and global scene context. Experimental results on both situated 3D Q&A and general 3D Q&A benchmarks demonstrate the effectiveness of our approach.

[114] BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation cs.CVPDF

Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang

TL;DR: BlockVid提出了一种基于块扩散的框架，用于生成高质量、一致的长视频，解决了现有方法的长时误差积累和缺乏细粒度评测的问题，并通过实验证明了其优越性。

Details

Motivation: 生成长视频是构建世界模型的关键步骤，但现有半自回归方法存在长时误差积累和评测不足的问题，亟需改进。

Result: 在VBench和LV-Bench上，BlockVid显著优于现有方法，如LV-Bench的VDE Subject和VDE Clarity指标分别提升了22.2%和19.4%。

Insight:

Abstract: Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: https://ziplab.co/BlockVid. Inferix (Code): https://github.com/alibaba-damo-academy/Inferix.

[115] McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning cs.CVPDF

Qiushi Yang, Yingjie Chen, Yuan Yao, Yifang Men, Huaizhuo Liu

TL;DR: McSc提出了一种三阶段的强化学习框架，通过对偏好进行建模和对齐，解决了视频生成中难以对齐人类主观偏好的问题，生成了高质量且高动态的视频。

Details

Motivation: 现有的视频偏好对齐方法依赖昂贵的人工标注或代理指标，缺乏对人类偏好逻辑的理解，且容易忽略潜在的冲突维度（如动态运动与视觉质量），导致模型偏向低动态内容。McSc旨在解决这些问题。

Result: McSc在人类偏好对齐上表现优异，生成了高质量且高动态的视频。

Insight: 1. 偏好建模需要分解到各维度；2. 动态调整对齐目标可避免低动态偏差；3. 层级化奖励监督提升建模效果。

Abstract: Text-to-video (T2V) generation has achieved remarkable progress in producing high-quality videos aligned with textual prompts. However, aligning synthesized videos with nuanced human preference remains challenging due to the subjective and multifaceted nature of human judgment. Existing video preference alignment methods rely on costly human annotations or utilize proxy metrics to predict preference, which lacks the understanding of human preference logic. Moreover, they usually directly align T2V models with the overall preference distribution, ignoring potential conflict dimensions like motion dynamics and visual quality, which may bias models towards low-motion content. To address these issues, we present Motion-corrective alignment with Self-critic hierarchical Reasoning (McSc), a three-stage reinforcement learning framework for robust preference modeling and alignment. Firstly, Self-critic Dimensional Reasoning (ScDR) trains a generative reward model (RM) to decompose preferences into per-dimension assessments, using self-critic reasoning chains for reliable learning. Secondly, to achieve holistic video comparison, we introduce Hierarchical Comparative Reasoning (HCR) for structural multi-dimensional reasoning with hierarchical reward supervision. Finally, using RM-preferred videos, we propose Motion-corrective Direct Preference Optimization (McDPO) to optimize T2V models, while dynamically re-weighting alignment objective to mitigate bias towards low-motion content. Experiments show that McSc achieves superior performance in human preference alignment and generates videos with high-motion dynamic.

[116] Ovis-Image Technical Report cs.CV | cs.AIPDF

Guo-Hua Wang, Liangfu Cao, Tianyu Cui, Minghao Fu, Xiaohao Chen

TL;DR: Ovis-Image是一个7B参数的文本生成图像模型，专注于高质量文本渲染，计算高效，性能接近更大或闭源模型。

Details

Motivation: 解决高质量文本渲染模型通常计算资源消耗大、难以部署的问题。

Result: 性能与更大开源模型（如Qwen-Image）接近，接近闭源系统（如Seedream）。

Insight: 强大多模态骨干和针对性训练方法可在轻量架构下实现高质量文本渲染。

Abstract: We introduce $\textbf{Ovis-Image}$, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.

[117] MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation cs.CVPDF

Yuta Oshima, Daiki Miyake, Kohsei Matsutani, Yusuke Iwasawa, Masahiro Suzuki

TL;DR: 论文提出了MultiBanana基准测试，专注于多参考图像生成任务，填补了现有基准在多样性和任务定义上的不足，通过多维度评估模型的性能。

Details

Motivation: 现有基准测试通常局限于单参考或少数参考图像的生成任务，无法全面评估多参考条件下的模型性能与弱点。任务定义模糊也限制了研究的深入。

Result: 揭示了模型的优势、典型失败模式和改进方向，为多参考图像生成提供了标准化评估基础。

Insight: 多参考生成任务中的多样性和复杂性是模型性能的关键挑战，MultiBanana为未来研究提供了重要的基准和方向。

Abstract: Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as “what to edit” or “how many references are given”, and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce $\textbf{MultiBanana}$, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at https://github.com/matsuolab/multibanana .

[118] Guiding Visual Autoregressive Models through Spectrum Weakening cs.CVPDF

Chaoyang Wang, Tianmeng Yang, Jingdong Wang, Yunhai Tong

TL;DR: 这篇论文提出了一种通过频域弱化引导视觉自回归模型的方法，无需重新训练或修改架构，支持高质量的无条件和有条件生成。

Details

Motivation: 现有的无分类器引导方法主要针对扩散模型，限制了其在自回归模型中的应用。本文旨在为视觉自回归模型提供一种通用的引导框架。

Result: 实验表明该方法在离散和连续的自回归模型中都能实现高质量的无条件生成和强大的条件对齐性。

Insight: 可逆频谱变换保留信息，而选择性频谱子集能实现可控的信息缩减，为自回归模型的引导提供了新思路。

Abstract: Classifier-free guidance (CFG) has become a widely adopted and practical approach for enhancing generation quality and improving condition alignment. Recent studies have explored guidance mechanisms for unconditional generation, yet these approaches remain fundamentally tied to assumptions specific to diffusion models. In this work, we propose a spectrum-weakening framework for visual autoregressive (AR) models. This method works without the need for re-training, specific conditions, or any architectural modifications. It achieves this by constructing a controllable weak model in the spectral domain. We theoretically show that invertible spectral transformations preserve information, while selectively retaining only a subset of spectrum introduces controlled information reduction. Based on this insight, we perform spectrum selection along the channel dimension of internal representations, which avoids the structural constraints imposed by diffusion models. We further introduce two spectrum renormalization strategies that ensures numerical stability during the weakening process. Extensive experiments were conducted on both discrete and continuous AR models, with text or class conditioning. The results demonstrate that our method enables high-quality unconditional generation while maintaining strong prompt alignment for conditional generation.

[119] Optimizer Sensitivity In Vision Transformerbased Iris Recognition: Adamw Vs Sgd Vs Rmsprop cs.CV | stat.COPDF

Moh Imam Faiz, Aviv Yuniar Rahman, Rangga Pahlevi Putra

TL;DR: 本文探讨了不同优化器（AdamW、SGD、RMSprop）对基于Vision Transformer的虹膜识别模型的精度和稳定性的影响。

Details

Motivation: 随着数字身份系统的扩展，生物特征认证的安全性愈发重要。虹膜识别因其独特的纹理模式具有高可靠性，但优化器选择对基于ViT的系统影响尚不明确。

Result: 结果表明，不同优化器对ViT虹膜识别模型的性能有显著差异，某些优化器在精度和稳定性上表现更优。

Insight: 优化器的选择对ViT模型的性能至关重要，应根据具体任务需求选择合适的优化策略。

Abstract: The security of biometric authentication is increasingly critical as digital identity systems expand. Iris recognition offers high reliability due to its distinctive and stable texture patterns. Recent progress in deep learning, especially Vision Transformers ViT, has improved visual recognition performance. Yet, the effect of optimizer choice on ViT-based biometric systems remains understudied. This work evaluates how different optimizers influence the accuracy and stability of ViT for iris recognition, providing insights to enhance the robustness of biometric identification models.

Minseong Kweon, Janghyun Kim, Ukcheol Shin, Jinsun Park

TL;DR: MrGS提出了基于3D高斯泼溅的多模态辐射场方法，同时重建RGB和热红外场景，通过正交特征提取和物理定律建模热传导特性，减少了高斯分布数量并提升了重建质量。

Details

Motivation: 现有NeRF和3DGS方法多聚焦RGB场景，而忽视热红外模态的独特特性（如热传导和朗伯反射），亟需一种支持多模态高质量重建的方法。

Result: 实验表明MrGS在减少高斯数量的同时，实现了高保真的RGB-T场景重建。

Insight: 热红外模态的物理特性（如热传导）对多模态重建至关重要，可通过物理定律有效建模；正交特征提取能高效支持多模态数据融合。

Abstract: Recent advances in Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have achieved considerable performance in RGB scene reconstruction. However, multi-modal rendering that incorporates thermal infrared imagery remains largely underexplored. Existing approaches tend to neglect distinctive thermal characteristics, such as heat conduction and the Lambertian property. In this study, we introduce MrGS, a multi-modal radiance field based on 3DGS that simultaneously reconstructs both RGB and thermal 3D scenes. Specifically, MrGS derives RGB- and thermal-related information from a single appearance feature through orthogonal feature extraction and employs view-dependent or view-independent embedding strategies depending on the degree of Lambertian reflectance exhibited by each modality. Furthermore, we leverage two physics-based principles to effectively model thermal-domain phenomena. First, we integrate Fourier’s law of heat conduction prior to alpha blending to model intensity interpolation caused by thermal conduction between neighboring Gaussians. Second, we apply the Stefan-Boltzmann law and the inverse-square law to formulate a depth-aware thermal radiation map that imposes additional geometric constraints on thermal rendering. Experimental results demonstrate that the proposed MrGS achieves high-fidelity RGB-T scene reconstruction while reducing the number of Gaussians.

[121] JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization cs.CVPDF

Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong

TL;DR: JarvisEvo提出了一个自我演化的图像编辑代理，通过多模态推理和编辑-评估策略优化，解决了指令幻觉和奖励作弊问题，显著提升了编辑质量和内容保真度。

Details

Motivation: 现有代理编辑模型的指令幻觉和奖励作弊问题是研究的核心动机，限制了编辑质量和用户交互体验的提升。

Result: 在ArtEdit-Bench上，JarvisEvo在保留性编辑指标上平均提升18.95%，像素级内容保真度提升44.96%。

Insight: 多模态推理和动态自我评估是解决代理编辑中指令幻觉和奖励作弊的有效路径。

Abstract: Agent-based editing models have substantially advanced interactive experiences, processing quality, and creative flexibility. However, two critical challenges persist: (1) instruction hallucination, text-only chain-of-thought (CoT) reasoning cannot fully prevent factual errors due to inherent information bottlenecks; (2) reward hacking, dynamic policy optimization against static reward models allows agents to exploit flaws in reward functions. To address these issues, we propose JarvisEvo, a unified image editing agent that emulates an expert human designer by iteratively editing, selecting appropriate tools, evaluating results, and reflecting on its own decisions to refine outcomes. JarvisEvo offers three key advantages: (1) an interleaved multimodal chain-of-thought (iMCoT) reasoning mechanism that enhances instruction following and editing quality; (2) a synergistic editor-evaluator policy optimization (SEPO) framework that enables self-improvement without external rewards, effectively mitigating reward hacking; and (3) support for both global and local fine-grained editing through seamless integration of Adobe Lightroom. On ArtEdit-Bench, JarvisEvo outperforms Nano-Banana by an average of 18.95% on preservative editing metrics, including a substantial 44.96% improvement in pixel-level content fidelity.

[122] From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning cs.CV | cs.AIPDF

Changpeng Wang, Haozhe Wang, Xi Chen, Junhan Liu, Taofeng Xue

TL;DR: 论文提出视觉理据学习（ViRL），将视觉动作作为推理核心而非可选工具，通过端到端强化学习实现任务无关的透明视觉-语言推理模型。

Details

Motivation: 现有视觉-语言推理模型依赖上下文无关的视觉动作，导致推理未能真正基于视觉证据。

Result: ViRL在感知、幻觉和推理基准上实现SOTA。

Insight: 视觉理据化是构建透明、可验证视觉-语言模型的通用方法。

Abstract: Recent advances in vision-language reasoning underscore the importance of thinking with images, where models actively ground their reasoning in visual evidence. Yet, prevailing frameworks treat visual actions as optional tools, boosting metrics but leaving reasoning ungrounded and crops ineffective. This gap gives rise to the illusion of thinking with images: models seem visually grounded but rely on context-agnostic actions that neither refine perception nor guide reasoning toward correct answers. We address this problem by reframing visual actions as core reasoning primitives rather than optional tools, which we term visual rationalization, the visual analogue of textual Chain-of-Thought. Building on this insight, we propose Visual Rationale Learning (ViRL), an end-to-end paradigm that grounds training in the visual rationale itself. ViRL integrates (1) Process Supervision with ground-truth rationales, (2) Objective Alignment via step-level reward shaping, and (3) Fine-Grained Credit Assignment to distinguish correct, redundant, and erroneous actions. By ensuring each action contributes meaningfully to the reasoning chain, ViRL enables models to “get the right answer for the right visual reason”. Trained purely with end-to-end RL, ViRL achieves state-of-the-art results across benchmarks spanning perception, hallucination, and reasoning. This work establishes visual rationalization as a task-agnostic, process-grounded paradigm for building transparent, verifiable, and trustworthy vision-language models.

[123] GOATex: Geometry & Occlusion-Aware Texturing cs.CVPDF

Hyunjin Kim, Kunho Kim, Adam Lee, Wonkwang Lee

TL;DR: GOATex 是一种基于扩散模型的 3D 网格纹理生成方法，专注于解决现有方法在处理遮挡内部区域时的局限性，通过分层可见性控制和软 UV 空间混合技术，生成高质量的无缝纹理。

Details

Motivation: 现有方法在对可见区域生成纹理时表现良好，但在处理遮挡的内部区域时缺乏有效机制，导致纹理不完整和可见接缝。GOATex 旨在填补这一空白。

Result: 实验表明，GOATex 在可见和遮挡表面均能生成高质量的无缝纹理，优于现有方法。

Insight: 分层可见性控制和软 UV 混合是解决遮挡区域纹理生成的关键技术，无需微调预训练模型也能达到高保真效果。

Abstract: We present GOATex, a diffusion-based method for 3D mesh texturing that generates high-quality textures for both exterior and interior surfaces. While existing methods perform well on visible regions, they inherently lack mechanisms to handle occluded interiors, resulting in incomplete textures and visible seams. To address this, we introduce an occlusion-aware texturing framework based on the concept of hit levels, which quantify the relative depth of mesh faces via multi-view ray casting. This allows us to partition mesh faces into ordered visibility layers, from outermost to innermost. We then apply a two-stage visibility control strategy that progressively reveals interior regions with structural coherence, followed by texturing each layer using a pretrained diffusion model. To seamlessly merge textures obtained across layers, we propose a soft UV-space blending technique that weighs each texture’s contribution based on view-dependent visibility confidence. Empirical results demonstrate that GOATex consistently outperforms existing methods, producing seamless, high-fidelity textures across both visible and occluded surfaces. Unlike prior works, GOATex operates entirely without costly fine-tuning of a pretrained diffusion model and allows separate prompting for exterior and interior mesh regions, enabling fine-grained control over layered appearances. For more qualitative results, please visit our project page: https://goatex3d.github.io/.

[124] Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation cs.CV | cs.AIPDF

Felipe Akio Matsuoka, Eduardo Moreno J. M. Farina, Augusto Sarquis Serpa, Soraya Monteiro, Rodrigo Ragazzini

TL;DR: 研究评估了生成式修复技术（inpainting）在儿科手部X光片中对骨龄和性别预测的临床影响，发现尽管修复后的图像视觉上真实，但会显著降低AI模型的性能。

Details

Motivation: 儿科手部X光片中常包含非解剖学标记，生成式修复技术是否能保留关键特征尚不明确。本文旨在验证修复技术对医学AI任务的可靠性。

Result: 修复导致骨龄估计的平均绝对误差（MAE）从6.26个月增至30.11个月，性别分类的AUC从0.955降至0.704，显示性能显著下降。

Insight: 视觉上真实的修复可能掩盖临床相关特征并引入潜在偏差，突出了在临床AI工作流中任务特异性验证的重要性。

Abstract: Generative foundation models can remove visual artifacts through realistic image inpainting, but their impact on medical AI performance remains uncertain. Pediatric hand radiographs often contain non-anatomical markers, and it is unclear whether inpainting these regions preserves features needed for bone age and gender prediction. To evaluate the clinical reliability of generative model-based inpainting for artifact removal, we used the RSNA Bone Age Challenge dataset, selecting 200 original radiographs and generating 600 inpainted versions with gpt-image-1 using natural language prompts to target non-anatomical artifacts. Downstream performance was assessed with deep learning ensembles for bone age estimation and gender classification, using mean absolute error (MAE) and area under the ROC curve (AUC) as metrics, and pixel intensity distributions to detect structural alterations. Inpainting markedly degraded model performance: bone age MAE increased from 6.26 to 30.11 months, and gender classification AUC decreased from 0.955 to 0.704. Inpainted images displayed pixel-intensity shifts and inconsistencies, indicating structural modifications not corrected by simple calibration. These findings show that, although visually realistic, foundation model-based inpainting can obscure subtle but clinically relevant features and introduce latent bias even when edits are confined to non-diagnostic regions, underscoring the need for rigorous, task-specific validation before integrating such generative tools into clinical AI workflows.

[125] Buffer replay enhances the robustness of multimodal learning under missing-modality cs.CV | cs.LGPDF

Hongye Zhu, Xuan Liu, Yanwen Ba, Jingye Xue, Shigeng Zhang

TL;DR: 论文提出了REplay Prompting (REP)方法，通过构建模态特征缓冲区、解耦私有共享特征以及任务感知动态初始化机制，显著提升了多模态模型在模态缺失情况下的鲁棒性和性能。

Details

Motivation: 现有方法在处理模态缺失问题时，要么需要高计算成本的模态合成，要么仅依赖相邻层特征而忽略长距离上下文信息，导致性能下降。REP方法旨在通过更轻量且有效的方式解决这一问题。

Result: 在视觉-语言、视觉-语言-音频和时序多模态任务上，REP在单模态和多模态缺失场景中均优于现有方法，且参数开销可忽略。

Insight: 特征缓冲区和长距离上下文利用是多模态学习鲁棒性的关键因素；私有-共享特征解耦有助于平衡模态特异性和跨模态语义。

Abstract: Missing modalities consistently lead to significant performance degradation in multimodal models. Existing approaches either synthesize missing modalities at high computational cost or apply prompt-based fine-tuning that relies only on adjacent-layer features and overlooks long-distance contextual information, which may offer additional tolerance to errors when one or more modalities are missing. To address this, we introduce REplay Prompting (REP): (1) construct modality-wise feature buffers via a residual bypass to cache early-layer representations and replay them in deeper layers, mitigating information loss as network depth increases; (2) employ a private-shared feature decoupling strategy, where private buffers preserve modality-specific signals and shared buffers encode cross-modal semantics; and (3) design a task-aware dynamic initialization mechanism to configure these buffers differently, improving stability and generalization under diverse missing-modality conditions. Experiments on vision-language, vision-language-audio, and temporal multimodal benchmarks demonstrate that REP consistently outperforms prior methods under both single- and multi-modality missing scenarios, while introducing only negligible parameter overhead. These results establish REP as a lightweight and effective paradigm for robust multimodal learning in challenging missing-modality environments.

[126] SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models cs.CV | cs.AIPDF

Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen

TL;DR: SpaceMind提出了一种基于RGB输入的视觉语言模型，专注于3D空间推理，通过相机引导的模态融合模块实现高效的浅层特征融合，显著提升了任务的性能。

Details

Motivation: 当前的大型视觉语言模型在多模态理解上表现优异，但在3D空间推理（如距离估计、尺寸比较等）上仍有不足，且现有方法依赖辅助3D信息或浅层特征融合。

Result: 在VSI-Bench、SQA3D和SPBench基准测试中达到了最先进的性能，显著超越开源和专有系统。

Insight: 相机引导的模态融合是一种有效且实用的归纳偏置，可显著增强视觉语言模型的空间推理能力。

Abstract: Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation. Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D. These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence. We will release code and model checkpoints to support future research.

[127] NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing cs.CVPDF

Zhenyu Xu, Xiaoqi Shen, Haotian Nan, Xinyu Zhang

TL;DR: NumeriKontrol通过引入数值适配器，为基于指令的图像编辑提供了精确的数值控制，实现了多条件、零样本的连续标量调整。

Details

Motivation: 文本指令在图像编辑中缺乏对编辑强度的精细控制，难以满足用户对精确调整的需求。

Result: NumeriKontrol在多种属性编辑场景中实现了准确、连续且稳定的标量控制。

Insight: 数值控制的引入显著提升了指令编辑的精确性和可扩展性，为交互式图像编辑提供了新思路。

Abstract: Instruction-based image editing enables intuitive manipulation through natural language commands. However, text instructions alone often lack the precision required for fine-grained control over edit intensity. We introduce NumeriKontrol, a framework that allows users to precisely adjust image attributes using continuous scalar values with common units. NumeriKontrol encodes numeric editing scales via an effective Numeric Adapter and injects them into diffusion models in a plug-and-play manner. Thanks to a task-separated design, our approach supports zero-shot multi-condition editing, allowing users to specify multiple instructions in any order. To provide high-quality supervision, we synthesize precise training data from reliable sources, including high-fidelity rendering engines and DSLR cameras. Our Common Attribute Transform (CAT) dataset covers diverse attribute manipulations with accurate ground-truth scales, enabling NumeriKontrol to function as a simple yet powerful interactive editing studio. Extensive experiments show that NumeriKontrol delivers accurate, continuous, and stable scale control across a wide range of attribute editing scenarios. These contributions advance instruction-based image editing by enabling precise, scalable, and user-controllable image manipulation.

[128] MathSight: A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning? cs.CV | cs.LGPDF

Yuandong Wang, Yao Cui, Yuxin Zhao, Zhen Yang, Yangfu Zhu

TL;DR: 论文提出了MathSight基准测试，旨在探究视觉-语言模型在大学级数学推理中视觉信息的真实贡献，发现视觉输入的作用随问题难度增加而减弱。

Details

Motivation: 现有基准测试未能明确视觉模态在数学推理中的作用，无法确定模型是否真正利用了视觉理解或仅依赖语言先验。

Result: 实验显示视觉信息的作用随问题难度增加而减弱，Qwen3-VL无视觉输入的表现优于多模态变体和GPT-5。

Insight: 未来模型需要更多依赖真实视觉基础的推理，MathSight基准测试为此提供了评估工具。

Abstract: Recent advances in Vision-Language Models (VLMs) have achieved impressive progress in multimodal mathematical reasoning. Yet, how much visual information truly contributes to reasoning remains unclear. Existing benchmarks report strong overall performance but seldom isolate the role of the image modality, leaving open whether VLMs genuinely leverage visual understanding or merely depend on linguistic priors. To address this, we present MathSight, a university-level multimodal mathematical reasoning benchmark designed to disentangle and quantify the effect of visual input. Each problem includes multiple visual variants – original, hand-drawn, photo-captured – and a text-only condition for controlled comparison. Experiments on state-of-the-art VLMs reveal a consistent trend: the contribution of visual information diminishes with increasing problem difficulty. Remarkably, Qwen3-VL without any image input surpasses both its multimodal variants and GPT-5, underscoring the need for benchmarks like MathSight to advance genuine vision-grounded reasoning in future models.

[129] Analyzing Image Beyond Visual Aspect: Image Emotion Classification via Multiple-Affective Captioning cs.CVPDF

Zibo Zhou, Zhengjun Zhai, Huimin Chen, Wei Dai, Hansen Yang

TL;DR: 本文提出了一种基于多情感标注的图像情感分类方法ACIEC，通过纯文本消除情感差距，并结合层次对比损失和情感属性链式推理提升效果。

Details

Motivation: 现有预训练视觉模型在图像情感分类中存在情感差距，而心理学研究表明语言能有效消除这种差距，因此作者提出利用文本信息进行情感分类。

Result: 在多个基准测试中表现优异，有效解决了情感差距问题。

Insight: 语言信息在图像情感分类中具有重要作用，结合文本和视觉信息可显著提升模型性能。

Abstract: Image emotion classification (IEC) is a longstanding research field that has received increasing attention with the rapid progress of deep learning. Although recent advances have leveraged the knowledge encoded in pre-trained visual models, their effectiveness is constrained by the “affective gap” , limits the applicability of pre-training knowledge for IEC tasks. It has been demonstrated in psychology that language exhibits high variability, encompasses diverse and abundant information, and can effectively eliminate the “affective gap”. Inspired by this, we propose a novel Affective Captioning for Image Emotion Classification (ACIEC) to classify image emotion based on pure texts, which effectively capture the affective information in the image. In our method, a hierarchical multi-level contrastive loss is designed for detecting emotional concepts from images, while an emotional attribute chain-of-thought reasoning is proposed to generate affective sentences. Then, a pre-trained language model is leveraged to synthesize emotional concepts and affective sentences to conduct IEC. Additionally, a contrastive loss based on semantic similarity sampling is designed to solve the problem of large intra-class differences and small inter-class differences in affective datasets. Moreover, we also take the images with embedded texts into consideration, which were ignored by previous studies. Extensive experiments illustrate that our method can effectively bridge the affective gap and achieve superior results on multiple benchmarks.

[130] DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation cs.CVPDF

Hongfei Zhang, Kanghao Chen, Zixin Zhang, Harold Haodong Chen, Yuanhuiyi Lyu

TL;DR: DualCamCtrl是一种用于相机控制视频生成的双分支扩散模型，通过联合生成RGB和深度序列，并结合语义引导的互对齐机制，提升了场景理解和几何一致性。

Details

Motivation: 现有方法通过射线表示相机位姿，但缺乏足够的场景理解和几何感知能力。DualCamCtrl旨在解决这一问题。

Result: 实验表明DualCamCtrl显著提升了相机运动的一致性，相机运动误差比先前方法降低40%以上。

Insight: 早期和晚期去噪阶段在全局结构形成和局部细节优化中发挥互补作用。

Abstract: This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the Semantic Guided Mutual Alignment (SIGMA) mechanism, which performs RGB-depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation, with over 40% reduction in camera motion errors compared with prior methods. Our project page: https://soyouthinkyoucantell.github.io/dualcamctrl\-page/

[131] InstanceV: Instance-Level Video Generation cs.CVPDF

Yuheng Chen, Teng Hu, Jiangning Zhang, Zhucun Xue, Ran Yi

TL;DR: InstanceV 是一个实例级视频生成框架，通过实例感知掩码交叉注意力机制和共享时间步自适应提示增强模块，实现了对视频生成的细粒度控制与全局语义一致性。

Details

Motivation: 现有文本到视频模型仅依赖文本条件，缺乏对视频生成的精细控制能力，InstanceV 旨在解决这一问题。

Result: InstanceV 在实例级控制、视频质量和实例感知指标上均优于现有最先进模型。

Insight: 引入实例级信息和空间感知指导能显著提升视频生成的精细控制和一致性。

Abstract: Recent advances in text-to-video diffusion models have enabled the generation of high-quality videos conditioned on textual descriptions. However, most existing text-to-video models rely solely on textual conditions, lacking general fine-grained controllability over video generation. To address this challenge, we propose InstanceV, a video generation framework that enables i) instance-level control and ii) global semantic consistency. Specifically, with the aid of proposed Instance-aware Masked Cross-Attention mechanism, InstanceV maximizes the utilization of additional instance-level grounding information to generate correctly attributed instances at designated spatial locations. To improve overall consistency, We introduce the Shared Timestep-Adaptive Prompt Enhancement module, which connects local instances with global semantics in a parameter-efficient manner. Furthermore, we incorporate Spatially-Aware Unconditional Guidance during both training and inference to alleviate the disappearance of small instances. Finally, we propose a new benchmark, named InstanceBench, which combines general video quality metrics with instance-aware metrics for more comprehensive evaluation on instance-level video generation. Extensive experiments demonstrate that InstanceV not only achieves remarkable instance-level controllability in video generation, but also outperforms existing state-of-the-art models in both general quality and instance-aware metrics across qualitative and quantitative evaluations.

[132] Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding cs.CVPDF

Jin-Seop Lee, SungJoon Lee, SeongJun Jung, Boyang Li, Jee-Hyong Lee

TL;DR: 论文提出了一种拒绝感知的强化微调方法（RA-RFT），用于处理视频时序定位（VTG）中的硬无关查询问题。通过结合多种奖励目标，改进相关性判别和细粒度语义推理，并在构建的HI-VTG数据集上验证了方法的有效性。

Details

Motivation: 现有VTG模型假设查询总是相关的，导致其对无关查询仍预测目标片段。现有方法只能处理完全无关的查询，而无法解决语义相似但不相关的硬无关查询问题。

Result: 在硬无关VTG、简单置换RA-VTG和人工标注RA-VTG场景中验证了方法的有效性，并能扩展到多种LVLM-based VTG模型。

Insight: 通过强化学习和多目标奖励设计，可以显著提升VTG模型对硬无关查询的处理能力，且数据集构建是关键。

Abstract: Video Temporal Grounding (VTG) aims to localize a temporal segment in a video corresponding to a natural language query. However, existing VTG models assume that a relevant segment always exists, causing them to always predict a target segment even when the query is irrelevant to the video. While recent approaches attempt to handle irrelevant queries, they can only reject those that are entirely unrelated to the video and still fail to handle hard-irrelevant queries that are semantically similar but not actually relevant. To address this, we propose Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) to effectively refuse hard-irrelevant queries in VTG. Our method is based on the Group Relative Policy Optimization (GRPO) framework and integrates four reward objectives-format, refuse-IoU, explain, and query correction-to improve both relevance discrimination and fine-grained semantic reasoning. In addition, to effectively support RA-RFT, we construct a Hard-Irrelevant VTG (HI-VTG) dataset, which includes hard-irrelevant queries and their refusal answers. We demonstrate the effectiveness of our method across various relevance-aware VTG scenarios, including hard-irrelevant VTG, simply-shuffled RA-VTG, and human-annotated RA-VTG settings. We also show that the proposed method is scalable by applying it to various LVLM-based VTG models. Our code is available at https://github.com/JINSUBY/RA-RFT.

[133] REVEAL: Reasoning-enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection cs.CV | cs.AIPDF

Huangsen Cao, Qin Mei, Zhiheng Li, Yuxi Li, Ying Zhang

TL;DR: REVEAL是一个基于强化学习的可解释性AI生成图像检测框架，通过多模态专家模型构建证据链，显著提升检测准确性和解释可信度。

Details

Motivation: 随着生成模型的快速发展，AI生成图像与真实图像难以区分，亟需高效且可解释的图像取证方法。现有方法依赖表层模式匹配，缺乏因果解释和泛化能力。

Result: 实验显示REVEAL在检测准确性、解释可信度和跨模型泛化能力上显著优于现有方法。

Insight: 通过明确的证据链和强化学习奖励机制，可以实现检测与解释的双重优化，提升AI生成图像检测的可解释性和可信度。

Abstract: With the rapid advancement of generative models, visually realistic AI-generated images have become increasingly difficult to distinguish from authentic ones, posing severe threats to social trust and information integrity. Consequently, there is an urgent need for efficient and truly explainable image forensic methods. Recent detection paradigms have shifted towards explainable forensics. However, state-of-the-art approaches primarily rely on post-hoc rationalizations or visual discrimination, lacking a verifiable chain of evidence. This reliance on surface-level pattern matching limits the generation of causally grounded explanations and often results in poor generalization. To bridge this critical gap, we introduce \textbf{REVEAL-Bench}, the first reasoning-enhanced multimodal benchmark for AI-generated image detection that is explicitly structured around a chain-of-evidence derived from multiple lightweight expert models, then records step-by-step reasoning traces and evidential justifications. Building upon this dataset, we propose \textbf{REVEAL} (\underline{R}easoning-\underline{e}nhanced Forensic E\underline{v}id\underline{e}nce \underline{A}na\underline{l}ysis), an effective and explainable forensic framework that integrates detection with a novel expert-grounded reinforcement learning. Our reward mechanism is specially tailored to jointly optimize detection accuracy, explanation fidelity, and logical coherence grounded in explicit forensic evidence, enabling REVEAL to produce fine-grained, interpretable, and verifiable reasoning chains alongside its detection outcomes. Extensive experimental results demonstrate that REVEAL significantly enhances detection accuracy, explanation fidelity, and robust cross-model generalization, benchmarking a new state of the art for explainable image forensics.

[134] PowerCLIP: Powerset Alignment for Contrastive Pre-Training cs.CVPDF

Masaki Kawamura, Nakamasa Inoue, Rintaro Yanagi, Hirokatsu Kataoka, Rio Yokota

TL;DR: PowerCLIP提出了一种基于幂集对齐的对比预训练框架，通过高效的非线性聚合器解决组合语义对齐中的计算复杂度问题，显著提升了零样本分类和检索任务的性能。

Details

Motivation: 现有CLIP框架在细粒度对齐（单个文本标记与图像块）上表现良好，但难以捕捉跨多个图像区域的组合语义，因此需要一种更高效的组合对齐方法。

Result: 实验表明PowerCLIP在零样本分类和检索任务中优于现有方法，验证了其组合性和鲁棒性。

Insight: 幂集对齐和非线性聚合器的结合为解决组合语义对齐的高复杂度问题提供了一种有效途径，可推广至其他多模态任务。

Abstract: Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees. Since the naive powerset construction incurs exponential computational cost due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from O(2^M) to O(M) with respect to the number of regions M, while approximating the exact loss value with arbitrary precision. Our extensive experiments demonstrate that PowerCLIP outperforms state-of-the-art methods in zero-shot classification and retrieval tasks, underscoring the compositionality and robustness of our approach. Our code will be made publicly available.

[135] Fast Multi-view Consistent 3D Editing with Video Priors cs.CVPDF

Liyi Chen, Ruihuang Li, Guowen Zhang, Pengfei Wang, Lei Zhang

TL;DR: 该论文提出了ViP3DE方法，利用预训练视频生成模型的时间一致性先验，通过单次前向传递实现多视图一致的3D编辑，避免了传统迭代方法的耗时和过平滑问题。

Details

Motivation: 现有的基于文本驱动的3D编辑方法因缺乏多视图一致性先验，通常采用迭代的2D-3D-2D更新方法，导致效率低下且结果过平滑。

Result: ViP3DE在单次前向传递中即可生成高质量的3D编辑结果，速度和效果均显著优于现有方法。

Insight: 视频生成模型的时序一致性先验能够有效替代传统的多视图迭代方法，提升3D编辑的效率和质量。

Abstract: Text-driven 3D editing enables user-friendly 3D object or scene editing with text instructions. Due to the lack of multi-view consistency priors, existing methods typically resort to employing 2D generation or editing models to process each view individually, followed by iterative 2D-3D-2D updating. However, these methods are not only time-consuming but also prone to over-smoothed results because the different editing signals gathered from different views are averaged during the iterative process. In this paper, we propose generative Video Prior based 3D Editing (ViP3DE) to employ the temporal consistency priors from pre-trained video generation models for multi-view consistent 3D editing in a single forward pass. Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly, thereby bypassing the iterative editing paradigm. Since 3D updating requires edited views to be paired with specific camera poses, we propose motion-preserved noise blending for the video model to generate edited views at predefined camera poses. In addition, we introduce geometry-aware denoising to further enhance multi-view consistency by integrating 3D geometric priors into video models. Extensive experiments demonstrate that our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed.

[136] GeoWorld: Unlocking the Potential of Geometry Models to Facilitate High-Fidelity 3D Scene Generation cs.CVPDF

Yuhao Wan, Lijuan Liu, Jingzhi Zhou, Zihan Zhou, Xuying Zhang

TL;DR: GeoWorld通过解锁几何模型的潜力，提出了一种新的图像到3D场景生成方法，通过生成连续视频帧和利用几何模型提取的几何特征，解决了现有方法的几何失真和模糊问题。

Details

Motivation: 现有基于视频模型的图像到3D场景生成方法存在几何失真和内容模糊的问题。GeoWorld旨在利用几何模型的潜力，提升生成结果的几何一致性和细节保真度。

Result: 实验表明，GeoWorld能从单张图像和相机轨迹生成高质量3D场景，在定性和定量上均优于现有方法。

Insight: 利用几何模型的几何特征可以有效提升3D场景生成的几何一致性和细节质量，同时引入几何约束可以显著减少失真。

Abstract: Previous works leveraging video models for image-to-3D scene generation tend to suffer from geometric distortions and blurry content. In this paper, we renovate the pipeline of image-to-3D scene generation by unlocking the potential of geometry models and present our GeoWorld. Instead of exploiting geometric information obtained from a single-frame input, we propose to first generate consecutive video frames and then take advantage of the geometry model to provide full-frame geometry features, which contain richer information than single-frame depth maps or camera embeddings used in previous methods, and use these geometry features as geometrical conditions to aid the video generation model. To enhance the consistency of geometric structures, we further propose a geometry alignment loss to provide the model with real-world geometric constraints and a geometry adaptation module to ensure the effective utilization of geometry features. Extensive experiments show that our GeoWorld can generate high-fidelity 3D scenes from a single image and a given camera trajectory, outperforming prior methods both qualitatively and quantitatively. Project Page: https://peaes.github.io/GeoWorld/.

[137] Vision Bridge Transformer at Scale cs.CV | cs.AIPDF

Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang

TL;DR: ViBT是一种大规模的条件生成模型，通过直接建模输入到输出的轨迹，提出了一种高效的数据到数据转换范式，并在20B和1.3B参数量下展示了出色的图像和视频翻译能力。

Details

Motivation: 传统的扩散模型通过噪声生成数据，效率较低。Bridge Models旨在直接建模输入到输出的轨迹，提升生成效率。

Result: ViBT在指令驱动的图像编辑和复杂视频翻译任务中表现出色，验证了Bridge Models的高效性。

Insight: 直接建模输入到输出的轨迹比传统噪声到数据的转换更高效，且大规模Transformer架构是实现这一目标的关键。

Abstract: We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.

[138] Zero-Shot Multi-Criteria Visual Quality Inspection for Semi-Controlled Industrial Environments via Real-Time 3D Digital Twin Simulation cs.CVPDF

Jose Moises Araya-Martinez, Gautham Mohan, Kenichi Hayakawa Bolaños, Roberto Mendieta, Sarvenaz Sardari

TL;DR: 该论文提出了一种基于实时3D数字孪生模拟的零样本多标准视觉质量检测框架，适用于半控制的工业环境，旨在实现零缺陷制造并减少生产浪费。

Details

Motivation: 在现代工业中，早期视觉质量检测对实现零缺陷制造至关重要，但现有系统复杂且数据需求高，难以在半控制的工业环境中广泛应用。

Result: 在汽车轴向磁通电机质量检测用例中，框架的检测性能达到63.3%的交并比（IoU），验证了其在半控制工业环境中的有效性。

Insight: 该研究为动态制造环境中通用化、低数据的缺陷检测方法奠定了基础。

Abstract: Early-stage visual quality inspection is vital for achieving Zero-Defect Manufacturing and minimizing production waste in modern industrial environments. However, the complexity of robust visual inspection systems and their extensive data requirements hinder widespread adoption in semi-controlled industrial settings. In this context, we propose a pose-agnostic, zero-shot quality inspection framework that compares real scenes against real-time Digital Twins (DT) in the RGB-D space. Our approach enables efficient real-time DT rendering by semantically describing industrial scenes through object detection and pose estimation of known Computer-Aided Design models. We benchmark tools for real-time, multimodal RGB-D DT creation while tracking consumption of computational resources. Additionally, we provide an extensible and hierarchical annotation strategy for multi-criteria defect detection, unifying pose labelling with logical and structural defect annotations. Based on an automotive use case featuring the quality inspection of an axial flux motor, we demonstrate the effectiveness of our framework. Our results demonstrate detection performace, achieving intersection-over-union (IoU) scores of up to 63.3% compared to ground-truth masks, even if using simple distance measurements under semi-controlled industrial conditions. Our findings lay the groundwork for future research on generalizable, low-data defect detection methods in dynamic manufacturing settings.

[139] Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day cs.CVPDF

Milad Abdollahzadeh, Abdul Raheem, Zilong Zhao, Uzair Javaid, Kevin Yee

TL;DR: 该论文首次探索了通过指令调优提升大语言模型（LLM）在表格数据生成任务上的能力，并提出了一种高效、低成本的方法。

Details

Motivation: 虽然表格指令调优已成为提升LLM对表格数据理解的潜在方向，但现有工作主要集中于表格数据的问题回答和推理任务，忽略了表格数据生成。本文旨在填补这一空白。

Result: 实验表明，该方法的表格数据生成能力与商用最强LLM（GPT-4o）相当。

Insight: 高质量的指令数据集和高效调优方法可以显著降低提升LLM表格数据生成能力的门槛，为未来研究提供了新思路。

Abstract: Tabular instruction tuning has emerged as a promising research direction for improving LLMs understanding of tabular data. However, the majority of existing works only consider question-answering and reasoning tasks over tabular data, leaving tabular data generation largely unnoticed. In this work, for the first time, we explore the efficacy of instruction tuning in improving LLMs tabular data generation capabilities. More specifically, given the high data and computation requirements of tabular instruction tuning, we aim to address the possibility of instruction tuning for tabular data generation with limited data and computational resources. To achieve this, we first create a high-quality instruction dataset for tabular data, enabling efficient LLM comprehension. We then instruction-tune an open-source LLM (Llama3.1-8B-Instruct) on the training set of this dataset to improve its tabular data generation performance. Our experimental results show that by using our high-quality dataset and instruction-tuning on only 7K instructions with an A100 GPU, for less than 6 hours, we achieve tabular data generation performance on par with the most capable commercial LLM, GPT-4o.

[140] DAONet-YOLOv8: An Occlusion-Aware Dual-Attention Network for Tea Leaf Pest and Disease Detection cs.CVPDF

Yefeng Wu, Shan Wan, Ling Wu, Yecheng Zhao

TL;DR: DAONet-YOLOv8是一种针对茶叶病虫害检测的改进模型，通过引入双注意力融合模块、遮挡感知检测头和动态合成卷积模块，显著提升了检测性能。

Details

Motivation: 茶叶种植中的复杂背景、多变光照和频繁遮挡导致现有检测器性能下降。

Result: 在真实数据集上达到92.97%准确率和92.80%召回率，性能优于YOLOv8n基线。

Insight: 综合利用注意力机制和遮挡关系学习能有效提升复杂场景下的检测性能。

Abstract: Accurate detection of tea leaf pests and diseases in real plantations remains challenging due to complex backgrounds, variable illumination, and frequent occlusions among dense branches and leaves. Existing detectors often suffer from missed detections and false positives in such scenarios. To address these issues, we propose DAONet-YOLOv8, an enhanced YOLOv8 variant with three key improvements: (1) a Dual-Attention Fusion Module (DAFM) that combines convolutional local feature extraction with self-attention based global context modeling to focus on subtle lesion regions while suppressing background noise; (2) an occlusion-aware detection head (Detect-OAHead) that learns the relationship between visible and occluded parts to compensate for missing lesion features; and (3) a C2f-DSConv module employing dynamic synthesis convolutions with multiple kernel shapes to better capture irregular lesion boundaries. Experiments on our real-world tea plantation dataset containing six pest and disease categories demonstrate that DAONet-YOLOv8 achieves 92.97% precision, 92.80% recall, 97.10% mAP@50 and 76.90% mAP@50:95, outperforming the YOLOv8n baseline by 2.34, 4.68, 1.40 and 1.80 percentage points respectively, while reducing parameters by 16.7%. Comparative experiments further confirm that DAONet-YOLOv8 achieves superior performance over mainstream detection models.

[141] Language-guided 3D scene synthesis for fine-grained functionality understanding cs.CVPDF

Jaime Corsetti, Francesco Giuliari, Davide Boscaini, Pedro Hermosilla, Andrea Pilzer

TL;DR: SynthFun3D是一种基于任务的三维场景合成方法，通过语言描述自动生成功能性3D场景，解决了真实数据稀缺问题，并支持大规模高质量标注数据生成。

Details

Motivation: 真实世界中的三维功能理解任务（如找到功能性元素完成动作）因数据收集和标注成本高而受限，SynthFun3D旨在通过合成数据填补这一空白。

Result: 用户研究表明其生成场景与提示的一致性优于其他方法；定量结果显示合成数据可替代或补充真实数据，性能损失小或提升明显。

Insight: 通过合成数据生成大规模标注3D场景是可行的，且能显著降低数据收集成本，推动数据密集型3D应用的发展。

Abstract: Functionality understanding in 3D, which aims to identify the functional element in a 3D scene to complete an action (e.g., the correct handle to “Open the second drawer of the cabinet near the bed”), is hindered by the scarcity of real-world data due to the substantial effort needed for its collection and annotation. To address this, we introduce SynthFun3D, the first method for task-based 3D scene synthesis. Given the action description, SynthFun3D generates a 3D indoor environment using a furniture asset database with part-level annotation, ensuring the action can be accomplished. It reasons about the action to automatically identify and retrieve the 3D mask of the correct functional element, enabling the inexpensive and large-scale generation of high-quality annotated data. We validate SynthFun3D through user studies, which demonstrate improved scene-prompt coherence compared to other approaches. Our quantitative results further show that the generated data can either replace real data with minor performance loss or supplement real data for improved performance, thereby providing an inexpensive and scalable solution for data-hungry 3D applications. Project page: github.com/tev-fbk/synthfun3d.

[142] Unlocking Multilingual Reasoning Capability of LLMs and LVLMs through Representation Engineering cs.CVPDF

Qiming Li, Xiaocheng Feng, Yixuan Ma, Zekai Ye, Ruihan Chen

TL;DR: 该论文提出了一种无需训练的方法MRRE，通过在推理过程中注入预设向量，提升LLMs和LVLMs在非英语语言中的推理能力，同时保持输入输出语言一致性。

Details

Motivation: 现有的LLMs和LVLMs在英语中表现优异，但在低资源语言中性能显著下降，公平性问题突出。传统方法依赖多语言训练或翻译工具，资源消耗大且对翻译质量敏感。

Result: 在六个先进LLMs和LVLMs上的实验表明，MRRE平均提升了非英语推理能力5.48%，在低资源语言（如泰语和斯瓦希里语）中最高提升7.54%，同时输入输出语言一致性提高了3.78%。

Insight: MRRE提供了一种高效且资源节省的方法，解决了LLMs和LVLMs在低资源语言中的性能瓶颈，展示了表征工程的潜力。

Abstract: Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) demonstrate strong reasoning capabilities, yet their performance in English significantly outperforms that in low-resource languages, raising fairness concerns in multilingual applications. Existing approaches either rely on costly multilingual training or employ prompting with external translation tools, both of which are resource-intensive and sensitive to translation quality. To address these limitations, we propose a training-free inference-time method to enhance Multilingual Reasoning capabilities via Representation Engineering (MRRE) without using any additional training data or tools. MRRE sequentially injects two precomputed vectors at specific layers during inference processing: cross-lingual reasoning enhancement vectors, which steer non-English reasoning representations toward English space to unlock multilingual reasoning, and target-language output anchoring vectors, which restore the distribution of the target language to preserve input-output language consistency. Comprehensive experiments across six advanced LLMs and LVLMs on four reasoning benchmarks demonstrate that MRRE consistently enhances non-English reasoning by an average gain of 5.48% and up to 7.54% in low-resource languages (Thai and Swahili), while improving input-output language consistency by 3.78%.

[143] Learning to Predict Aboveground Biomass from RGB Images with 3D Synthetic Scenes cs.CV | cs.AIPDF

Silvia Zuffi

TL;DR: 该论文提出了一种基于RGB图像的单张地面图像预测地上生物量（AGB）的新方法，利用合成3D森林数据集训练模型，实现了对AGB的高效估计。

Details

Motivation: 传统的地上生物量估计方法依赖于劳动密集的实地测量或受限的遥感技术，尤其是在密集植被区域。该方法旨在通过RGB图像提供一种低成本、可扩展的解决方案。

Result: 在SPREAD数据集上的模型中位数误差为1.22 kg/m²，真实图像数据集上的误差为1.94 kg/m²。

Insight: 该方法展现了RGB图像在森林监测中的潜力，为公民科学倡议提供了可能性，同时提供了可解释且成本效益高的解决方案。

Abstract: Forests play a critical role in global ecosystems by supporting biodiversity and mitigating climate change via carbon sequestration. Accurate aboveground biomass (AGB) estimation is essential for assessing carbon storage and wildfire fuel loads, yet traditional methods rely on labor-intensive field measurements or remote sensing approaches with significant limitations in dense vegetation. In this work, we propose a novel learning-based method for estimating AGB from a single ground-based RGB image. We frame this as a dense prediction task, introducing AGB density maps, where each pixel represents tree biomass normalized by the plot area and each tree’s image area. We leverage the recently introduced synthetic 3D SPREAD dataset, which provides realistic forest scenes with per-image tree attributes (height, trunk and canopy diameter) and instance segmentation masks. Using these assets, we compute AGB via allometric equations and train a model to predict AGB density maps, integrating them to recover the AGB estimate for the captured scene. Our approach achieves a median AGB estimation error of 1.22 kg/m^2 on held-out SPREAD data and 1.94 kg/m^2 on a real-image dataset. To our knowledge, this is the first method to estimate aboveground biomass directly from a single RGB image, opening up the possibility for a scalable, interpretable, and cost-effective solution for forest monitoring, while also enabling broader participation through citizen science initiatives.

[144] FACT-GS: Frequency-Aligned Complexity-Aware Texture Reparameterization for 2D Gaussian Splatting cs.CV | cs.GRPDF

Tianhao Xie, Linlian Jiang, Xinxin Zuo, Yang Wang, Tiberiu Popa

TL;DR: FACT-GS提出了一种频率对齐的复杂性感知纹理重参数化方法，用于2D高斯泼溅，通过自适应采样密度提升纹理空间的利用效率。

Details

Motivation: 传统纹理参数化方法在高斯泼溅中采用均匀采样网格，忽略了局部视觉复杂性，导致高频区域采样不足和平滑区域资源浪费，影响细节表现。

Result: 在相同参数预算下，FACT-GS能够恢复更清晰的高频细节，同时保持实时渲染性能。

Insight: 自适应采样理论可用于提升纹理参数化的效率，高频区域的采样密度应与局部视觉复杂度对齐。

Abstract: Realistic scene appearance modeling has advanced rapidly with Gaussian Splatting, which enables real-time, high-quality rendering. Recent advances introduced per-primitive textures that incorporate spatial color variations within each Gaussian, improving their expressiveness. However, texture-based Gaussians parameterize appearance with a uniform per-Gaussian sampling grid, allocating equal sampling density regardless of local visual complexity. This leads to inefficient texture space utilization, where high-frequency regions are under-sampled and smooth regions waste capacity, causing blurred appearance and loss of fine structural detail. We introduce FACT-GS, a Frequency-Aligned Complexity-aware Texture Gaussian Splatting framework that allocates texture sampling density according to local visual frequency. Grounded in adaptive sampling theory, FACT-GS reformulates texture parameterization as a differentiable sampling-density allocation problem, replacing the uniform textures with a learnable frequency-aware allocation strategy implemented via a deformation field whose Jacobian modulates local sampling density. Built on 2D Gaussian Splatting, FACT-GS performs non-uniform sampling on fixed-resolution texture grids, preserving real-time performance while recovering sharper high-frequency details under the same parameter budget.

[145] A Perceptually Inspired Variational Framework for Color Enhancement cs.CVPDF

Rodrigo Palma-Amestoy, Edoardo Provenzi, Marcelo Bertalmío, Vicent Caselles

TL;DR: 论文提出了一种基于人类颜色感知的变分框架，用于色彩增强，通过设计满足感知启发的能量函数，降低了计算复杂度。

Details

Motivation: 现有色彩增强算法难以有效表征图像特征（如对比度和离散度），需借鉴人类颜色感知的基本现象学设计更高效的模型。

Result: 证明了所提方法在色彩增强任务中的有效性，并通过计算优化提升了效率。

Insight: 人类感知现象学可有效指导色彩增强算法的设计；计算复杂度的优化是实现实用性的关键。

Abstract: Basic phenomenology of human color vision has been widely taken as an inspiration to devise explicit color correction algorithms. The behavior of these models in terms of significative image features (such as contrast and dispersion) can be difficult to characterize. To cope with this, we propose to use a variational formulation of color contrast enhancement that is inspired by the basic phenomenology of color perception. In particular, we devise a set of basic requirements to be fulfilled by an energy to be considered as `perceptually inspired’, showing that there is an explicit class of functionals satisfying all of them. We single out three explicit functionals that we consider of basic interest, showing similarities and differences with existing models. The minima of such functionals is computed using a gradient descent approach. We also present a general methodology to reduce the computational cost of the algorithms under analysis from ${\cal O}(N^2)$ to ${\cal O}(N\log N)$, being $N$ the number of input pixels.

[146] UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes cs.CVPDF

Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang

TL;DR: 该论文提出了UniGeoSeg框架，用于统一开放世界的遥感图像分割任务。其核心贡献包括构建了首个百万规模的遥感指令驱动分割数据集GeoSeg-1M，设计了挑战性评测基准GeoSeg-Bench，并提出了一种支持多任务学习的统一框架UniGeoSeg。

Details

Motivation: 现有遥感指令驱动分割方法存在任务碎片化和指令数据不足的问题，限制了模型的理解和泛化能力。为此，作者提出了一种统一的解决方案。

Result: 实验表明，UniGeoSeg在GeoSeg-Bench及多个公开评测基准上取得了最优性能，并展现出强大的零样本泛化能力。

Insight: 大规模多样化数据集和统一框架设计是提升遥感指令驱动分割性能的关键。同时，多任务学习和知识共享机制有助于模型泛化。

Abstract: Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg.

[147] Flow Straighter and Faster: Efficient One-Step Generative Modeling via MeanFlow on Rectified Trajectories cs.CV | cs.AIPDF

Xinxi Zhang, Shiwei Tan, Quang Nguyen, Quan Dao, Ligong Han

TL;DR: Re-MeanFlow提出了一种高效的一步生成模型框架，通过沿修正轨迹建模平均速度场，避免了多步修正流的计算开销，同时通过截断启发式进一步减少残余曲率。

Details

Motivation: 现有流基生成模型在采样时依赖昂贵的数值积分，Rectified Flow虽能实现一步采样但因需要多步修正流而计算量大，MeanFlow则在高度弯曲的流上收敛慢且监督噪声大。

Result: 在ImageNet 64x64、256x256和512x512分辨率上，Re-MeanFlow在样本质量和训练效率上均优于现有一步流蒸馏和Rectified Flow方法。

Insight: 单步修正流结合平均速度场建模可以在不失性能的前提下显著提升效率，截断启发式对减少残余曲率具有关键作用。

Abstract: Flow-based generative models have recently demonstrated strong performance, yet sampling typically relies on expensive numerical integration of ordinary differential equations (ODEs). Rectified Flow enables one-step sampling by learning nearly straight probability paths, but achieving such straightness requires multiple computationally intensive reflow iterations. MeanFlow achieves one-step generation by directly modeling the average velocity over time; however, when trained on highly curved flows, it suffers from slow convergence and noisy supervision. To address these limitations, we propose Rectified MeanFlow, a framework that models the mean velocity field along the rectified trajectory using only a single reflow step. This eliminates the need for perfectly straightened trajectories while enabling efficient training. Furthermore, we introduce a simple yet effective truncation heuristic that aims to reduce residual curvature and further improve performance. Extensive experiments on ImageNet at 64, 256, and 512 resolutions show that Re-MeanFlow consistently outperforms prior one-step flow distillation and Rectified Flow methods in both sample quality and training efficiency. Code is available at https://github.com/Xinxi-Zhang/Re-MeanFlow.

[148] A Hierarchical Computer Vision Pipeline for Physiological Data Extraction from Bedside Monitors cs.CVPDF

Vinh Chau, Khoa Le Dinh Van, Hon Huynh Ngoc, Binh Nguyen Thien, Hao Nguyen Thien

TL;DR: 提出了一种基于计算机视觉的分层流水线，用于从床旁监护仪中提取生理数据，通过YOLOv11和PaddleOCR的联合使用实现了高精度的监测和数据数字化。

Details

Motivation: 在低资源医疗环境中，床旁监护仪缺乏网络连接，导致生理数据无法无缝集成到电子健康记录系统。该方法通过低成本解决方案填补了这一技术鸿沟。

Result: 在6,498张图像的测试集上，模型达到了99.5%的监护仪检测mAP和91.5%的ROI定位mAP，核心生理参数的端到端提取准确率超过98.9%。

Insight: 该方法为低资源医疗环境提供了一种可行的解决方案，证明了基于摄像头的轻量级方法可以有效实现数据的结构化转换。

Abstract: In many low-resource healthcare settings, bedside monitors remain standalone legacy devices without network connectivity, creating a persistent interoperability gap that prevents seamless integration of physiological data into electronic health record (EHR) systems. To address this challenge without requiring costly hardware replacement, we present a computer vision-based pipeline for the automated capture and digitisation of vital sign data directly from bedside monitor screens. Our method employs a hierarchical detection framework combining YOLOv11 for accurate monitor and region of interest (ROI) localisation with PaddleOCR for robust text extraction. To enhance reliability across variable camera angles and lighting conditions, a geometric rectification module standardizes the screen perspective before character recognition. We evaluated the system on a dataset of 6,498 images collected from open-source corpora and real-world intensive care units in Vietnam. The model achieved a mean Average Precision (mAP@50-95) of 99.5% for monitor detection and 91.5% for vital sign ROI localisation. The end-to-end extraction accuracy exceeded 98.9% for core physiological parameters, including heart rate, oxygen saturation SpO2, and arterial blood pressure. These results demonstrate that a lightweight, camera-based approach can reliably transform unstructured information from screen captures into structured digital data, providing a practical and scalable pathway to improve information accessibility and clinical documentation in low-resource settings.

[149] SimScale: Learning to Drive via Real-World Simulation at Scale cs.CV | cs.ROPDF

Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu

TL;DR: SimScale提出了一种可扩展的仿真框架，通过神经渲染和反应环境生成高保真多视角观测，结合伪专家轨迹生成机制，显著提升了自动驾驶规划方法的鲁棒性和泛化能力。

Details

Motivation: 自动驾驶系统需要学习在多样场景（包括安全关键和分布外场景）中的决策能力，但这些场景在人类专家收集的真实数据中代表性不足。

Result: 在真实世界基准测试中，SimScale显著提升了规划方法的鲁棒性和泛化能力（navhard提升+6.8 EPDMS，navtest提升+2.9）。

Insight: 1) 伪专家设计对仿真数据有效性至关重要；2) 仅增加仿真数据即可平滑提升策略性能，无需额外的真实数据；3) 不同策略架构的扩展性表现差异显著。

Abstract: Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by the perturbed ego trajectory. Furthermore, we develop a pseudo-expert trajectory generation mechanism for these newly simulated states to provide action supervision. Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +6.8 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal several crucial findings of such a sim-real learning system, which we term SimScale, including the design of pseudo-experts and the scaling properties for different policy architectures. Our simulation data and code would be released.

[150] DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline cs.CVPDF

Rui Zhang, Hongxia Wang, Hangqing Liu, Yang Zhou, Qiang Zeng

TL;DR: 论文提出了一个大规模数据集DEAL-300K，用于扩散基图像编辑区域的定位，并提出了结合视觉基础模型和多频提示调优的定位框架。

Details

Motivation: 扩散基图像编辑虽然便于用户进行语义级操作，但也导致了难以定位的真实局部伪造。现有基准主要关注生成图像的二元检测或手动编辑区域的定位，未能反映扩散编辑的属性。

Result: 在测试集上达到82.56%的像素级F1分数，外部CoCoGlide基准上达到80.97%。

Insight: 结合频域信息和语义信息能有效提升定位性能，大规模高质量标注数据是未来研究的实用基础。

Abstract: Diffusion-based image editing has made semantic level image manipulation easy for general users, but it also enables realistic local forgeries that are hard to localize. Existing benchmarks mainly focus on the binary detection of generated images or the localization of manually edited regions and do not reflect the properties of diffusion-based edits, which often blend smoothly into the original content. We present Diffusion-Based Image Editing Area Localization Dataset (DEAL-300K), a large scale dataset for diffusion-based image manipulation localization (DIML) with more than 300,000 annotated images. We build DEAL-300K by using a multi-modal large language model to generate editing instructions, a mask-free diffusion editor to produce manipulated images, and an active-learning change detection pipeline to obtain pixel-level annotations. On top of this dataset, we propose a localization framework that uses a frozen Visual Foundation Model (VFM) together with Multi Frequency Prompt Tuning (MFPT) to capture both semantic and frequency-domain cues of edited regions. Trained on DEAL-300K, our method reaches a pixel-level F1 score of 82.56% on our test split and 80.97% on the external CoCoGlide benchmark, providing strong baselines and a practical foundation for future DIML research.The dataset can be accessed via https://github.com/ymhzyj/DEAL-300K.

[151] VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction cs.CVPDF

Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu

TL;DR: VQRAE提出了一种统一的向量量化表示自编码器，首次探索了在统一的分词器中生成连续语义特征（用于图像理解）和离散标记（用于视觉生成）的方法。

Details

Motivation: 构建统一的多模态模型需要同时支持理解、生成和重建，而现有方法通常采用双编码器范式，未能实现统一的表示。VQRAE旨在通过向量量化解决这一问题。

Result: VQRAE在视觉理解、生成和重建任务中表现竞争力，其离散标记特性在自回归范式中展现出良好的扩展性。

Insight: 高维度码本对语义量化至关重要，这与传统图像重建中低维度码本的常见实践形成鲜明对比；语义VQ码本在1536维度下可实现100%利用率。

Abstract: Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.

[152] DisMo: Disentangled Motion Representations for Open-World Motion Transfer cs.CVPDF

Thomas Ressler-Antal, Frank Fundel, Malek Ben Alaya, Stefan Andreas Baumann, Felix Krause

TL;DR: DisMo提出了一种通过学习抽象运动表示的新方法，直接从视频数据中分离运动与静态信息，实现了开放世界的运动迁移，并在下游任务中表现出色。

Details

Motivation: 当前T2V和I2V模型缺乏明确的运动与内容分离表示，限制了内容创作者的应用。DisMo旨在填补这一空白，通过解耦运动语义与外观，实现更灵活的运动迁移。

Result: 在多种运动迁移任务中表现优异，并在Something-Something v2和Jester基准测试中超越了V-JEPA等先进视频表示模型。

Insight: 解耦运动与静态信息是实现开放世界运动迁移的关键，且轻量级适配器设计使其能轻松受益于未来视频模型的进步。

Abstract: Recent advances in text-to-video (T2V) and image-to-video (I2V) models, have enabled the creation of visually compelling and dynamic videos from simple textual descriptions or initial frames. However, these models often fail to provide an explicit representation of motion separate from content, limiting their applicability for content creators. To address this gap, we propose DisMo, a novel paradigm for learning abstract motion representations directly from raw video data via an image-space reconstruction objective. Our representation is generic and independent of static information such as appearance, object identity, or pose. This enables open-world motion transfer, allowing motion to be transferred across semantically unrelated entities without requiring object correspondences, even between vastly different categories. Unlike prior methods, which trade off motion fidelity and prompt adherence, are overfitting to source structure or drifting from the described action, our approach disentangles motion semantics from appearance, enabling accurate transfer and faithful conditioning. Furthermore, our motion representation can be combined with any existing video generator via lightweight adapters, allowing us to effortlessly benefit from future advancements in video models. We demonstrate the effectiveness of our method through a diverse set of motion transfer tasks. Finally, we show that the learned representations are well-suited for downstream motion understanding tasks, consistently outperforming state-of-the-art video representation models such as V-JEPA in zero-shot action classification on benchmarks including Something-Something v2 and Jester. Project page: https://compvis.github.io/DisMo

[153] Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model cs.CVPDF

Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang

TL;DR: Hunyuan-GameCraft-2提出了一种基于自然语言指令的交互式游戏世界生成模型，通过结合大规模无结构化文本-视频数据和专家混合模型，实现了对游戏内容的细粒度控制。

Details

Motivation: 现有生成世界模型的交互方式僵硬且标注成本高，限制了多样化的游戏交互和玩家驱动的动态模拟。

Result: 实验表明，模型生成的交互式游戏视频在时间一致性和因果性上表现优异，能够忠实响应多样化的自由形式指令。

Insight: 自然语言指令与交互信号的结合为动态游戏世界建模提供了更灵活和语义丰富的解决方案，凸显了交互数据自动化处理的重要性。

Abstract: Recent advances in generative world models have enabled remarkable progress in creating open-ended game environments, evolving from static scene synthesis toward dynamic, interactive simulation. However, current approaches remain limited by rigid action schemas and high annotation costs, restricting their ability to model diverse in-game interactions and player-driven dynamics. To address these challenges, we introduce Hunyuan-GameCraft-2, a new paradigm of instruction-driven interaction for generative game world modeling. Instead of relying on fixed keyboard inputs, our model allows users to control game video contents through natural language prompts, keyboard, or mouse signals, enabling flexible and semantically rich interaction within generated worlds. We formally defined the concept of interactive video data and developed an automated process to transform large-scale, unstructured text-video pairs into causally aligned interactive datasets. Built upon a 14B image-to-video Mixture-of-Experts(MoE) foundation model, our model incorporates a text-driven interaction injection mechanism for fine-grained control over camera motion, character behavior, and environment dynamics. We introduce an interaction-focused benchmark, InterBench, to evaluate interaction performance comprehensively. Extensive experiments demonstrate that our model generates temporally coherent and causally grounded interactive game videos that faithfully respond to diverse and free-form user instructions such as “open the door”, “draw a torch”, or “trigger an explosion”.

[154] Visual Generation Tuning cs.CVPDF

Jiahao Guo, Sinan Du, Jingfeng Yao, Wenyu Liu, Bo Li

TL;DR: VGT（视觉生成调优）是一种新范式，旨在激发视觉语言模型（VLM）中潜在的视觉生成能力。通过高效调优，显著降低了对齐成本，并加速了连续空间的自回归建模（20倍速度提升）。

Details

Motivation: 当前视觉语言模型在多模态理解任务中表现出色，但其视觉生成的潜力尚未充分探索。论文旨在填补这一空白，通过调优模型实现高效视觉生成。

Result: 1. 图像重建任务中达到26.67 PSNR和0.50 rFID；2. 视觉生成任务中在GenEval和DPG-Bench分别取得0.77和78.73的SOTA成绩；3. 展示了模型的扩展潜力。

Insight: 1. VGT证明了VLM在多模态理解之外的生成潜力；2. 高效调优是实现统一多模态基础模型的关键；3. VGT为下一代模型的开发提供了新思路。

Abstract: Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at https://github.com/hustvl/VGT.

Zhizhou Zhong, Yicheng Ji, Zhe Kong, Yiying Liu, Jiarui Wang

TL;DR: AnyTalker是一个支持多人对话视频生成的框架，通过扩展Diffusion Transformer的注意力机制和使用单视频训练，解决了数据收集和身份驱动的挑战，同时在生成视频的同步性和交互性上表现优异。

Details

Motivation: 现有音频驱动的多人对话视频生成方法面临数据收集成本高和多身份驱动的困难，需要一种高效且可扩展的解决方案。

Result: 实验表明，AnyTalker在唇同步、视觉质量和自然交互性上表现突出，平衡了数据成本和身份扩展性。

Insight: 通过创新的注意力机制和高效训练管线，多人视频生成可以在减少数据需求的同时保持高质量。

Abstract: Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer’s attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.

[156] Video-CoM: Interactive Video Reasoning via Chain of Manipulations cs.CVPDF

Hanoona Rasheed, Mohammed Zumri, Muhammad Maaz, Ming-Hsuan Yang, Fahad Shahbaz Khan

TL;DR: 该论文提出了一种交互式视频推理新范式Video-CoM，通过链式操作（CoM）让模型能动态处理视频信息，显著提升了细粒度时空理解的推理能力。

Details

Motivation: 现有MLLMs对视频的推理多局限于静态文本处理，无法动态重看、聚焦或验证视觉证据，导致细粒度视频推理能力不足。

Result: 在9个视频推理基准上表现优异，平均性能提升3.6%，且训练数据量显著少于同类大规模模型（仅25K SFT和3K GRPO样本）。

Insight: 动态视觉操作和推理奖励能提升模型的准确性及可解释性，说明交互式推理在复杂视频任务中的潜力。

Abstract: Recent multimodal large language models (MLLMs) have advanced video understanding, yet most still “think about videos” ie once a video is encoded, reasoning unfolds entirely in text, treating visual input as a static context. This passive paradigm creates a semantic bottleneck: models cannot rewatch, refocus, or verify evidence, leading to shallow visual reasoning on tasks requiring fine grained spatio temporal understanding. In this work, we introduce Interactive Video Reasoning, a new paradigm that transforms video into an active cognitive workspace, enabling models to “think with videos”. Our model, Video CoM, reasons through a Chain of Manipulations (CoM), performing iterative visual actions to gather and refine evidence. To support this behavior, we construct Video CoM Instruct, an 18K instruction tuning dataset curated for multi step manipulation reasoning. Beyond supervised learning, we further optimize the manipulation policy via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO). Unlike prior work that relies solely on sparse answer rewards, our method introduces step level reasoning rewards, guiding the model toward grounded and consistent reasoning. Video CoM achieves strong results across nine video reasoning benchmarks, improving average performance by 3.6 percent over recent state of the art models, while training on only 25K SFT and 3K GRPO video samples, significantly fewer than comparable large scale models. Ablation studies demonstrate that reasoning aware rewards improve both accuracy and interpretability. Code: https://github.com/mbzuai-oryx/Video-CoM

[157] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models cs.CVPDF

Muhammad Maaz, Hanoona Rasheed, Fahad Shahbaz Khan, Salman Khan

TL;DR: 该论文提出了一种名为Video-R2的方法，通过强化学习提升多模态语言模型在视频推理中的时间对齐和逻辑一致性，解决了现有模型依赖语言先验而非视觉证据的问题。

Details

Motivation: 动态视觉内容的推理是多模态大语言模型的核心挑战，现有模型的推理虽看似合理，但往往逻辑不一致或弱依赖于视觉证据。论文通过两项诊断指标（TAC和VAS）量化这些问题。

Result: Video-R2在11个视频推理基准测试中显著提升了TAC、VAS和准确性，验证了其方法的有效性。

Insight: 时间对齐和逻辑一致性的提升直接促进了视频理解的准确性和可信度，强化学习在多模态推理中具有潜力。

Abstract: Reasoning over dynamic visual content remains a central challenge for multimodal large language models. Recent thinking models generate explicit reasoning traces for interpretability; however, their reasoning often appears convincing while being logically inconsistent or weakly grounded in visual evidence. We identify and formalize these issues through two diagnostic metrics: Think Answer Consistency (TAC), which measures the alignment between reasoning and answers, and Video Attention Score (VAS), which captures the extent to which reasoning depends on visual versus textual cues. Analysis across 11 video reasoning benchmarks shows that current models rely heavily on linguistic priors rather than visual content. To address this, we propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency. Our approach combines timestamp aware supervised fine tuning with Group Relative Policy Optimization (GRPO) guided by a novel Temporal Alignment Reward (TAR). This dual step post training stage encourages temporally aligned and causally coherent video reasoning. The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks, demonstrating that improvements in temporal alignment and reasoning coherence lead to more accurate and trustworthy video understanding. Our code, dataset, and model will be open sourced.

cs.CL [Back]

[158] Cacheback: Speculative Decoding With Nothing But Cache cs.CL | cs.AIPDF

Zhiyao Ma, In Gim, Lin Zhong

TL;DR: Cacheback Decoding是一种无需训练且模型无关的推测解码方法，利用语言的局部性加速大型语言模型（LLM）推理，仅依赖LRU缓存表生成草稿序列。

Details

Motivation: 为了提高大型语言模型的推理效率，减少计算开销，同时避免复杂的训练或修改模型结构。

Result: 在同类方法中取得最优性能，且因其简洁设计易于实现。

Insight: 语言的局部性可以被有效利用，简单的缓存策略也能显著提升模型推理效率。

Abstract: We present Cacheback Decoding, a training-free and model-agnostic speculative decoding method that exploits the locality in language to accelerate Large Language Model (LLM) inference. Cacheback leverages only Least Recently Used (LRU) cache tables of token n-grams to generate draft sequences. Cacheback achieves state-of-the-art performance among comparable methods despite its minimalist design, and its simplicity allows easy integration into existing systems. Cacheback also shows potential for fast adaptation to new domains.

[159] 47B Mixture-of-Experts Beats 671B Dense Models on Chinese Medical Examinations cs.CL | cs.LGPDF

Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen

TL;DR: 该论文构建了一个针对中文医学考试问题的基准评估框架，评估了27个先进的大语言模型（LLMs）在不同医学专科和难度下的表现，发现Mixture-of-Experts架构的小模型表现优于更大的密集模型。

Details

Motivation: 随着大语言模型在医学领域的潜在应用日益受到关注，作者旨在通过全面的基准评估，揭示模型在医学考试问题上的性能和局限性。

Result: Mixtral-8x7B以74.25%的准确率表现最佳，其次是DeepSeek-R1-671B（64.07%）。结果表明，模型性能与规模无关，且Mixture-of-Experts架构的小模型表现优异。

Insight: 研究揭示了模型在不同医学专科上的性能差异，高性能模型在难度变化下表现出稳健的泛化能力，为医学教育和临床决策支持提供了重要参考。

Abstract: The rapid advancement of large language models(LLMs) has prompted significant interest in their potential applications in medical domains. This paper presents a comprehensive benchmark evaluation of 27 state-of-the-art LLMs on Chinese medical examination questions, encompassing seven medical specialties across two professional levels. We introduce a robust evaluation framework that assesses model performance on 2,800 carefully curated questions from cardiovascular, gastroenterology, hematology, infectious diseases, nephrology, neurology, and respiratory medicine domains. Our dataset distinguishes between attending physician and senior physician difficulty levels, providing nuanced insights into model capabilities across varying complexity. Our empirical analysis reveals substantial performance variations among models, with Mixtral-8x7B achieving the highest overall accuracy of 74.25%, followed by DeepSeek-R1-671B at 64.07%. Notably, we observe no consistent correlation between model size and performance, as evidenced by the strong performance of smaller mixture-of-experts architectures. The evaluation demonstrates significant performance gaps between medical specialties, with models generally performing better on cardiovascular and neurology questions compared to gastroenterology and nephrology domains. Furthermore, our analysis indicates minimal performance degradation between attending and senior physician levels for top-performing models, suggesting robust generalization capabilities. This benchmark provides critical insights for the deployment of LLMs in medical education and clinical decision support systems, highlighting both the promise and current limitations of these technologies in specialized medical contexts.

[160] Insight-A: Attribution-aware for Multimodal Misinformation Detection cs.CL | cs.CVPDF

Junjie Wu, Yumeng Fu, Chen Gong, Guohong Fu

TL;DR: 论文提出Insight-A方法，通过属性感知结合多模态大语言模型（MLLMs）检测多模态虚假信息，重点解决虚假信息溯源问题，并提出分层推理管道和自动去偏提示等技术。

Details

Motivation: AI生成内容（AIGC）技术成为社交媒体上多模态虚假信息的常见来源，对社会安全构成威胁。现有方法忽略虚假信息的属性（如来源伪造痕迹），亟需新方法以提升检测效果。

Result: 大量实验表明Insight-A优于现有方法，为AIGC时代的多模态虚假信息检测提供了新范式。

Insight: 虚假信息的溯源和跨模态一致性检查对提升检测效果至关重要；自动去偏提示可减少人工干预带来的偏差。

Abstract: AI-generated content (AIGC) technology has emerged as a prevalent alternative to create multimodal misinformation on social media platforms, posing unprecedented threats to societal safety. However, standard prompting leverages multimodal large language models (MLLMs) to identify the emerging misinformation, which ignores the misinformation attribution. To this end, we present Insight-A, exploring attribution with MLLM insights for detecting multimodal misinformation. Insight-A makes two efforts: I) attribute misinformation to forgery sources, and II) an effective pipeline with hierarchical reasoning that detects distortions across modalities. Specifically, to attribute misinformation to forgery traces based on generation patterns, we devise cross-attribution prompting (CAP) to model the sophisticated correlations between perception and reasoning. Meanwhile, to reduce the subjectivity of human-annotated prompts, automatic attribution-debiased prompting (ADP) is used for task adaptation on MLLMs. Additionally, we design image captioning (IC) to achieve visual details for enhancing cross-modal consistency checking. Extensive experiments demonstrate the superiority of our proposal and provide a new paradigm for multimodal misinformation detection in the era of AIGC.

[161] A General Highly Accurate Online Planning Method Integrating Large Language Models into Nested Rollout Policy Adaptation for Dialogue Tasks cs.CL | cs.AIPDF

Hui Wang, Fafa Zhang, Xiaoyu Zhang, Chaoxu Mu

TL;DR: NRPA-GD是一种新颖的对话策略规划方法，通过引入大语言模型（LLM）模拟用户和系统行为，避免了特定模型训练，并在目标导向对话任务中优于现有方法。

Details

Motivation: 目标导向对话任务中，现有方法依赖人工提示工程或预训练模型，这些方法难以适应新场景且训练成本高。NRPA-GD旨在利用LLM的动态规划能力，避免这些限制。

Result: 在四个目标导向对话数据集上，NRPA-GD优于提示工程和预训练模型方法，甚至以较小的LLM（0.6B参数）超越ChatGPT和预训练模型。

Insight: LLM结合动态规划方法能够有效解决目标导向对话任务，展示了规划方法在大语言模型应用中的潜力和新颖性。

Abstract: In goal-oriented dialogue tasks, the main challenge is to steer the interaction towards a given goal within a limited number of turns. Existing approaches either rely on elaborate prompt engineering, whose effectiveness is heavily dependent on human experience, or integrate policy networks and pre-trained policy models, which are usually difficult to adapt to new dialogue scenarios and costly to train. Therefore, in this paper, we present Nested Rollout Policy Adaptation for Goal-oriented Dialogue (NRPA-GD), a novel dialogue policy planning method that completely avoids specific model training by utilizing a Large Language Model (LLM) to simulate behaviors of user and system at the same time. Specifically, NRPA-GD constructs a complete evaluation mechanism for dialogue trajectories and employs an optimization framework of nested Monte Carlo simulation and policy self-adaptation to dynamically adjust policies during the dialogue process. The experimental results on four typical goal-oriented dialogue datasets show that NRPA-GD outperforms both existing prompt engineering and specifically pre-trained model-based methods. Impressively, NRPA-GD surpasses ChatGPT and pre-trained policy models with only a 0.6-billion-parameter LLM. The proposed approach further demonstrates the advantages and novelty of employing planning methods on LLMs to solve practical planning tasks.

[162] EulerESG: Automating ESG Disclosure Analysis with LLMs cs.CL | cs.AI | cs.CYPDF

Yi Ding, Xushuo Tang, Zhengyi Yang, Wenqian Zhang, Simin Wu

TL;DR: EulerESG是一个基于LLM的系统，用于自动化ESG披露分析，通过结合检索、LLM驱动的分析和交互式仪表板，实现了高效且准确的报告处理。

Details

Motivation: ESG报告通常以异构PDF形式发布，难以系统性分析。现有工具要么依赖脆弱的规则提取，要么忽略报告标准。EulerESG旨在解决这些问题。

Result: 在四个全球公司和十二个SASB子行业上验证，平均精度高达0.95，同时保持实用的运行时性能。

Insight: LLM在结构化文档分析中具有潜力，尤其是在需要对齐行业标准的场景。EulerESG展示了LLM的实际应用价值。

Abstract: Environmental, Social, and Governance (ESG) reports have become central to how companies communicate climate risk, social impact, and governance practices, yet they are still published primarily as long, heterogeneous PDF documents. This makes it difficult to systematically answer seemingly simple questions. Existing tools either rely on brittle rule-based extraction or treat ESG reports as generic text, without explicitly modelling the underlying reporting standards. We present \textbf{EulerESG}, an LLM-powered system for automating ESG disclosure analysis with explicit awareness of ESG frameworks. EulerESG combines (i) dual-channel retrieval and LLM-driven disclosure analysis over ESG reports, and (ii) an interactive dashboard and chatbot for exploration, benchmarking, and explanation. Using four globally recognised companies and twelve SASB sub-industries, we show that EulerESG can automatically populate standard-aligned metric tables with high fidelity (up to 0.95 average accuracy) while remaining practical in end-to-end runtime, and we compare several recent LLM models in this setting. The full implementation, together with a demonstration video, is publicly available at https://github.com/UNSW-database/EulerESG.

[163] GPS: General Per-Sample Prompter cs.CL | cs.AIPDF

Pawel Batorski, Paul Swoboda

TL;DR: GPS是一种通用的逐样本提示生成方法，通过强化学习训练提示器，无需特定任务调整即可为每个输入生成定制提示，显著提升了多样任务的性能。

Details

Motivation: 大型语言模型（LLMs）对提示非常敏感，手动设计有效的提示耗时且困难。现有自动提示方法需要大量训练数据、耗费优化时间，且仅生成任务级通用提示，无法适应每个输入的具体问题。

Result: 在文本简化、摘要和分类任务上取得竞争性表现，部分任务达到最佳水平，且无需训练特定任务数据。

Insight: GPS展示了自动提示的新范式——通过生成适应性强的输入特定提示，无需大量优化或任务特定训练集，即可显著提升模型性能。

Abstract: LLMs are sensitive to prompting, with task performance often hinging on subtle, sometimes imperceptible variations in phrasing. As a result, crafting effective prompts manually remains challenging and time-consuming. Recent automatic prompting methods mitigate this difficulty but face three key limitations: (i) for each new task, they require large datasets to train good prompts;(ii) they rely on costly optimization loops that may take hours; (iii)they typically produce a single task-level prompt that does not adapt to the individual input problem to be solved. We propose GPS, the first general-purpose, per-sample prompting method. Without any task-specific tuning, GPS generates a tailored prompt for each unseen input, improving performance across diverse tasks. The prompter is trained with reinforcement learning on a suite of training tasks and includes a novel regularization for effectively adapting to per-sample prompting. Finally, we employ Minimum Bayes Risk decoding to stabilize inference. Empirically, GPS demonstrates competitive performance: we attain second best results among baselines on text simplification, third best results on summarization and on-par results on classification, while not training on any of these tasks, in contrast to the baselines. For in-domain prompting, we obtain sota on GSM8K. Our work shows the potential of a novel and effective paradigm for automatic prompting: generating adaptive, input-specific prompts without extensive optimization and without access to a task-specific training set. Our code is available at https://github.com/Batorskq/GPS.

[164] An Optimized Machine Learning Classifier for Detecting Fake Reviews Using Extracted Features cs.CLPDF

Shabbir Anees, Anshuman, Ayush Chaurasia, Prathmesh Bogar

TL;DR: 该论文提出了一种优化的机器学习分类器，用于通过提取的特征检测虚假评论，结合了文本预处理、多模态特征提取、Harris Hawks优化和堆叠集成分类器，取得了高准确率。

Details

Motivation: 虚假评论影响了在线购物的可信度，尤其是AI生成的评论更难区分，需要一个高效的方法来检测这些虚假内容。

Result: 在公开数据集上实现了95.40%的准确率、92.81%的精确率、95.01%的召回率和93.90%的F1分数，特征维度减少了89.9%。

Insight: 生物启发优化与集成学习的结合在机器生成文本识别中表现出色，同时强调了隐私保护技术（如差分隐私）在大规模数据分析中的重要性。

Abstract: It is well known that fraudulent reviews cast doubt on the legitimacy and dependability of online purchases. The most recent development that leads customers towards darkness is the appearance of human reviews in computer-generated (CG) ones. In this work, we present an advanced machine-learning-based system that analyses these reviews produced by AI with remarkable precision. Our method integrates advanced text preprocessing, multi-modal feature extraction, Harris Hawks Optimization (HHO) for feature selection, and a stacking ensemble classifier. We implemented this methodology on a public dataset of 40,432 Original (OR) and Computer-Generated (CG) reviews. From an initial set of 13,539 features, HHO selected the most applicable 1,368 features, achieving an 89.9% dimensionality reduction. Our final stacking model achieved 95.40% accuracy, 92.81% precision, 95.01% recall, and a 93.90% F1-Score, which demonstrates that the combination of ensemble learning and bio-inspired optimisation is an effective method for machine-generated text recognition. Because large-scale review analytics commonly run on cloud platforms, privacy-preserving techniques such as differential approaches and secure outsourcing are essential to protect user data in these systems.

[165] CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution cs.CL | cs.CVPDF

Baoliang Tian, Yuxuan Si, Jilong Wang, Lingyao Li, Zhongyuan Bao

TL;DR: 论文提出了CrossCheck-Bench，一个新的诊断基准，用于评估多模态大语言模型在处理视觉与文本矛盾时的表现。研究发现当前模型在逻辑矛盾检测和多步推理任务中存在显著性能下降，提示需要改进推理方法。

Details

Motivation: 现有模型主要在对齐的图像-文本对上训练和评估，但现实中多模态信息常存在矛盾，需要模型具备更复杂的推理能力。为此，论文旨在填补这一研究空白。

Result: 模型在从感知匹配到逻辑矛盾检测的任务中性能逐渐下降，尤其是需要多步推理或规则验证的任务表现不佳。传统的提示策略（如思维链）效果有限，而结合符号推理的方法有所提升。

Insight: 当前模型在跨模态推理中存在瓶颈，未来研究方向应聚焦于结合符号推理与视觉处理的方法，以实现更稳健的矛盾检测。

Abstract: Multimodal Large Language Models are primarily trained and evaluated on aligned image-text pairs, which leaves their ability to detect and resolve real-world inconsistencies largely unexplored. In open-domain applications visual and textual cues often conflict, requiring models to perform structured reasoning beyond surface-level alignment. We introduce CrossCheck-Bench, a diagnostic benchmark for evaluating contradiction detection in multimodal inputs. The benchmark adopts a hierarchical task framework covering three levels of reasoning complexity and defines seven atomic capabilities essential for resolving cross-modal inconsistencies. CrossCheck-Bench includes 15k question-answer pairs sourced from real-world artifacts with synthetically injected contradictions. The dataset is constructed through a multi-stage annotation pipeline involving more than 450 expert hours to ensure semantic validity and calibrated difficulty across perception, integration, and reasoning. We evaluate 13 state-of-the-art vision-language models and observe a consistent performance drop as tasks shift from perceptual matching to logical contradiction detection. Most models perform well on isolated entity recognition but fail when multiple clues must be synthesized for conflict reasoning. Capability-level analysis further reveals uneven skill acquisition, especially in tasks requiring multi-step inference or rule-based validation. Additional probing shows that conventional prompting strategies such as Chain-of-Thought and Set-of-Mark yield only marginal gains. By contrast, methods that interleave symbolic reasoning with grounded visual processing achieve more stable improvements. These results highlight a persistent bottleneck in multimodal reasoning and suggest new directions for building models capable of robust cross-modal verification.

[166] Goal-Directed Search Outperforms Goal-Agnostic Memory Compression in Long-Context Memory Tasks cs.CL | cs.AI | cs.LGPDF

Yicong Zheng, Kevin L. McKee, Thomas Miconi, Zacharie Bugaud, Mick van Gelderen

TL;DR: SUMER提出了一种基于目标导向搜索的方法，直接在未压缩的记忆中进行搜索，避免了目标无关的记忆压缩算法的偏见和性能损失，在长上下文记忆任务中表现优异。

Details

Motivation: 现有记忆压缩算法容易引入人类偏见，且性能受限于特定数据分布。SUMER探索直接在未压缩信息中进行目标导向搜索，以避免压缩带来的信息损失和算法局限性。

Result: 在LoCoMo数据集上，SUMER使用Qwen2.5-7B-Instruct模型实现了43%的性能提升，优于现有记忆压缩方法和全上下文基线。

Insight: 1) 目标导向搜索在未压缩记忆中的表现优于压缩方法；2) 现有记忆压缩算法存在偏见和局限性；3) 未来的长上下文任务须设计更动态的基准测试。

Abstract: How to enable human-like long-term memory in large language models (LLMs) has been a central question for unlocking more general capabilities such as few-shot generalization. Existing memory frameworks and benchmarks focus on finding the optimal memory compression algorithm for higher performance in tasks that require recollection and sometimes further reasoning. However, such efforts have ended up building more human bias into the compression algorithm, through the search for the best prompts and memory architectures that suit specific benchmarks, rather than finding a general solution that would work on other data distributions. On the other hand, goal-directed search on uncompressed information could potentially exhibit superior performance because compression is lossy, and a predefined compression algorithm will not fit all raw data distributions. Here we present SUMER (Search in Uncompressed Memory via Experience Replay), an end-to-end reinforcement learning agent with verifiable reward (RLVR) that learns to use search tools to gather information and answer a target question. On the LoCoMo dataset for long-context conversation understanding, SUMER with Qwen2.5-7B-Instruct learned to use search tools and outperformed all other biased memory compression approaches and also the full-context baseline, reaching SOTA performance (43% gain over the prior best). We demonstrate that a simple search method applied to raw data outperforms goal-agnostic and biased compression algorithms in current long-context memory tasks, arguing for new paradigms and benchmarks that are more dynamic and autonomously scalable. Code for SUMER and all implemented baselines is publicly available at https://github.com/zycyc/SUMER.

[167] Affective Multimodal Agents with Proactive Knowledge Grounding for Emotionally Aligned Marketing Dialogue cs.CL | cs.AI | cs.LGPDF

Lin Yu, Xiaofei Han, Yifei Kang, Chiung-Yi Tseng, Danyang Zhang

TL;DR: AffectMind是一种多模态情感对话代理，通过主动知识基础和情感对齐提升营销对话的效果。

Details

Motivation: 当前的大语言模型(LLMs)在情感丰富、目标导向的营销对话中表现有限，需要更主动的推理和情感对齐能力。

Result: 在两个营销对话数据集上，情感一致性(+26%)、说服成功率(+19%)和用户参与度(+23%)均优于基线。

Insight: 情感基础和主动推理是多模态商业代理的关键能力。

Abstract: Recent advances in large language models (LLMs) have enabled fluent dialogue systems, but most remain reactive and struggle in emotionally rich, goal-oriented settings such as marketing conversations. To address this limitation, we propose AffectMind, a multimodal affective dialogue agent that performs proactive reasoning and dynamic knowledge grounding to sustain emotionally aligned and persuasive interactions. AffectMind combines three components: a Proactive Knowledge Grounding Network (PKGN) that continuously updates factual and affective context from text, vision, and prosody; an Emotion–Intent Alignment Model (EIAM) that jointly models user emotion and purchase intent to adapt persuasion strategies; and a Reinforced Discourse Loop (RDL) that optimizes emotional coherence and engagement via reinforcement signals from user responses. Experiments on two newly curated marketing dialogue datasets, MM-ConvMarket and AffectPromo, show that AffectMind outperforms strong LLM-based baselines in emotional consistency (+26%), persuasive success rate (+19%), and long-term user engagement (+23%), highlighting emotion-grounded proactivity as a key capability for commercial multimodal agents.

[168] HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation cs.CL | cs.AIPDF

Jiajun Zhang, Shijia Luo, Ruikang Zhang, Qi Su

TL;DR: HUMORCHAIN是一种理论指导的多阶段推理框架，用于生成多模态幽默，通过视觉语义解析、幽默和心理学的推理链以及幽默评估判别器，显著提升生成的幽默质量和人类感知对齐。

Details

Motivation: 当前数据驱动的多模态幽默生成方法缺乏理论基础，生成的幽默内容往往形式流畅但缺乏真正的幽默或认知深度。为了解决这一问题，论文提出理论指导的方法，将幽默认知结构嵌入生成过程。

Result: 在Meme-Image-No-Text、Oogiri-GO和OxfordTVG-HIC数据集上的实验表明，HUMORCHAIN在幽默偏好、Elo/BT分数和语义多样性上优于现有基线方法。

Insight: 理论驱动的结构化推理能够显著提升大语言模型生成的幽默质量，使其更符合人类感知。同时，该方法为生成可解释和可控的幽默内容提供了新思路。

Abstract: Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation. Although producing humor requires complex cognitive reasoning and social understanding, theories of humor suggest that it follows learnable patterns and structures, making it theoretically possible for generative models to acquire them implicitly. In recent years, multimodal humor has become a prevalent form of online communication, especially among Gen Z, highlighting the need for AI systems capable of integrating visual understanding with humorous language generation. However, existing data-driven approaches lack explicit modeling or theoretical grounding of humor, often producing literal descriptions that fail to capture its underlying cognitive mechanisms, resulting in the generated image descriptions that are fluent but lack genuine humor or cognitive depth. To address this limitation, we propose HUMORCHAIN (HUmor-guided Multi-step Orchestrated Reasoning Chain for Image Captioning), a theory-guided multi-stage reasoning framework. It integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation, forming an interpretable and controllable cognitive reasoning chain. To the best of our knowledge, this is the first work to explicitly embed cognitive structures from humor theories into multimodal humor generation, enabling a structured reasoning process from visual understanding to humor creation. Experiments on Meme-Image-No-Text, Oogiri-GO, and OxfordTVG-HIC datasets show that HUMORCHAIN outperforms state-of-the-art baselines in human humor preference, Elo/BT scores, and semantic diversity, demonstrating that theory-driven structured reasoning enables large language models to generate humor aligned with human perception.

[169] RoSA: Enhancing Parameter-Efficient Fine-Tuning via RoPE-aware Selective Adaptation in Large Language Models cs.CL | cs.AIPDF

Dayan Pan, Jingyuan Wang, Yilong Zhou, Jiawei Cheng, Pengyue Jia

TL;DR: RoSA提出了一种基于RoPE的重要性选择性参数高效微调方法，通过增强低频注意力状态和动态选择关键层，实现了更高的微调效率。

Details

Motivation: 微调大规模语言模型计算成本高，现有PEFT方法忽略组件差异性及层间重要性差异，限制了效率。RoSA的动机是利用RoPE激活低频维度的特点，提升选择性适应效果。

Result: 在15个常识和算数基准上，RoSA在可比训练参数下优于主流PEFT方法。

Insight: 1. RoPE的低频维度对注意力状态至关重要；2. 动态层选择和维度增强结合能显著提升微调效率。

Abstract: Fine-tuning large language models is essential for task-specific adaptation, yet it remains computationally prohibitive. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a solution, but current approaches typically ignore the distinct roles of model components and the heterogeneous importance across layers, thereby limiting adaptation efficiency. Motivated by the observation that Rotary Position Embeddings (RoPE) induce critical activations in the low-frequency dimensions of attention states, we propose RoPE-aware Selective Adaptation (RoSA), a novel PEFT framework that allocates trainable parameters in a more targeted and effective manner. RoSA comprises a RoPE-aware Attention Enhancement (RoAE) module, which selectively enhances the low-frequency components of RoPE-influenced attention states, and a Dynamic Layer Selection (DLS) strategy that adaptively identifies and updates the most critical layers based on LayerNorm gradient norms. By combining dimension-wise enhancement with layer-wise adaptation, RoSA achieves more targeted and efficient fine-tuning. Extensive experiments on fifteen commonsense and arithmetic benchmarks demonstrate that RoSA outperforms existing mainstream PEFT methods under comparable trainable parameters. The code is available to ease reproducibility at https://github.com/Applied-Machine-Learning-Lab/RoSA.

[170] Asking LLMs to Verify First is Almost Free Lunch cs.CL | cs.AIPDF

Shiguang Wu, Quanming Yao

TL;DR: 通过在生成答案前让大语言模型（LLM）验证候选答案（即使是随机的），这种几乎无成本的策略增强了模型的推理能力，并进一步扩展到迭代验证的测试时扩展方法。

Details

Motivation: 无需高昂训练成本或大量采样，即可提升LLM的推理能力，同时减少逻辑错误。

Result: VF方法在多种任务和LLM中一致优于标准CoT，Iter-VF表现优于现有测试时扩展策略。

Insight: 验证候选答案能激发LLM的反向推理能力，是一种低成本的增强逻辑严谨性的有效方法。

Abstract: To enhance the reasoning capabilities of Large Language Models (LLMs) without high costs of training, nor extensive test-time sampling, we introduce Verification-First (VF), a strategy that prompts models to verify a provided candidate answer, even a trivial or random one, before generating a solution. This approach triggers a “reverse reasoning” process that is cognitively easier and complementary to standard forward Chain-of-Thought (CoT), effectively invoking the model’s critical thinking to reduce logical errors. We further generalize the VF strategy to Iter-VF, a sequential test-time scaling (TTS) method that iteratively cycles the verification-generation process using the model’s previous answer. Extensive experiments across various benchmarks (from mathematical reasoning to coding and agentic tasks) and various LLMs (from open-source 1B to cutting-edge commercial ones) confirm that VF with random answer consistently outperforms standard CoT with minimal computational overhead, and Iter-VF outperforms existing TTS strategies.

[171] Closing the Performance Gap Between AI and Radiologists in Chest X-Ray Reporting cs.CL | cs.AI | cs.CVPDF

Harshita Sharma, Maxwell C. Reynolds, Valentina Salvatelli, Anne-Marie G. Sykes, Kelly K. Horst

TL;DR: MAIRA-X是一个多模态AI模型，用于纵向胸部X光报告生成，显著提升AI生成报告的词汇质量、临床准确性和L&T相关元素，与放射科医生表现相当。

Details

Motivation: 放射科医生在处理高负荷胸部X光和L&T报告时有较大压力，AI辅助报告生成可减轻其工作量。

Result: AI生成报告的关键错误率（4.6%）与原报告（3.0%）相近，97.4%的句子可接受，显著优于之前研究。

Insight: MAIRA-X在高负荷临床环境中可有效辅助放射科医生，缩小AI与专家间的表现差距。

Abstract: AI-assisted report generation offers the opportunity to reduce radiologists’ workload stemming from expanded screening guidelines, complex cases and workforce shortages, while maintaining diagnostic accuracy. In addition to describing pathological findings in chest X-ray reports, interpreting lines and tubes (L&T) is demanding and repetitive for radiologists, especially with high patient volumes. We introduce MAIRA-X, a clinically evaluated multimodal AI model for longitudinal chest X-ray (CXR) report generation, that encompasses both clinical findings and L&T reporting. Developed using a large-scale, multi-site, longitudinal dataset of 3.1 million studies (comprising 6 million images from 806k patients) from Mayo Clinic, MAIRA-X was evaluated on three holdout datasets and the public MIMIC-CXR dataset, where it significantly improved AI-generated reports over the state of the art on lexical quality, clinical correctness, and L&T-related elements. A novel L&T-specific metrics framework was developed to assess accuracy in reporting attributes such as type, longitudinal change and placement. A first-of-its-kind retrospective user evaluation study was conducted with nine radiologists of varying experience, who blindly reviewed 600 studies from distinct subjects. The user study found comparable rates of critical errors (3.0% for original vs. 4.6% for AI-generated reports) and a similar rate of acceptable sentences (97.8% for original vs. 97.4% for AI-generated reports), marking a significant improvement over prior user studies with larger gaps and higher error rates. Our results suggest that MAIRA-X can effectively assist radiologists, particularly in high-volume clinical settings.

Jiayi Chen, Jieqi Shi, Jing Huo, Chen Wu

TL;DR: R2Q提出了一种新颖的2比特量化框架，通过残差细化量化技术提升大型语言模型的性能，解决了传统2比特量化导致的精度退化问题，并在多项任务中表现优异。

Details

Motivation: 随着大型语言模型的计算和内存需求增加，低比特量化成为重要课题。然而，2比特量化因精度大幅下降而难以实用化，R2Q旨在解决这一问题。

Result: 在Llama、OPT和Qwen模型上的实验表明，R2Q在问答、常识推理和语言建模等任务中均优于现有2比特量化方法。

Insight: 残差细化机制不仅提高了量化性能，还增强了训练的稳定性和收敛速度，为极端压缩下的模型量化提供了新思路。

Abstract: The rapid progress of Large Language Models (LLMs) has brought substantial computational and memory demands, spurring the adoption of low-bit quantization. While 8-bit and 4-bit formats have become prevalent, extending quantization to 2 bits remains challenging due to severe accuracy degradation. To address this, we propose Residual Refinement Quantization (R2Q)-a novel 2-bit quantization framework that decomposes the process into two sequential 1-bit sub-quantizations, forming an adaptive quantization lattice. Extensive evaluations on Llama, OPT, and Qwen across diverse benchmarks-covering question answering, commonsense reasoning, and language modeling-demonstrate that R2Q consistently outperforms existing 2-bit quantization methods in both fine-grained and coarse-grained settings. By refining quantization through a residual learning mechanism, R2Q enhances performance, improves training stability, and accelerates convergence under extreme compression. Furthermore, its modular design enables seamless integration with existing quantization-aware training (QAT) frameworks.

[173] Polarity-Aware Probing for Quantifying Latent Alignment in Language Models cs.CL | cs.AIPDF

Sabrina Sadiekh, Elena Ericheva, Chirag Agarwal

TL;DR: 本文提出了一种称为极性感知对比一致搜索（PA-CCS）的方法，用于评估语言模型的潜在对齐性，并设计了两项对齐指标。通过实验验证，该方法能够识别模型内部表征的差异性。

Details

Motivation: 随着无监督探测方法（如CCS）的发展，如何可靠评估模型的潜在对齐性成为关键问题。本文旨在研究这些方法对有害和安全语句的敏感性，并提出改进方案。

Result: PA-CCS揭示了模型在架构和层次上对潜在有害知识的编码差异。模型内部对齐性良好的情况下，替换否定标记会降低PA-CCS分数，而缺乏校准的模型则无此现象。

Insight: 1. 无监督探测可用于评估模型对齐性；2. 结构鲁棒性检验对可解释性基准至关重要；3. 模型的内部一致性直接影响其对极性反转的敏感性。

Abstract: Advances in unsupervised probes such as Contrast-Consistent Search (CCS), which reveal latent beliefs without relying on token outputs, raise the question of whether these methods can reliably assess model alignment. We investigate this by examining the sensitivity of CCS to harmful vs. safe statements and by introducing Polarity-Aware CCS (PA-CCS), a method for evaluating whether a model’s internal representations remain consistent under polarity inversion. We propose two alignment-oriented metrics, Polar-Consistency and the Contradiction Index, to quantify the semantic robustness of a model’s latent knowledge. To validate PA-CCS, we curate two main datasets and one control dataset containing matched harmful-safe sentence pairs constructed using different methodologies (concurrent and antagonistic statements). We apply PA-CCS to 16 language models. Our results show that PA-CCS identifies both architectural and layer-specific differences in the encoding of latent harmful knowledge. Notably, replacing the negation token with a meaningless marker degrades PA-CCS scores for models with well-aligned internal representations, while models lacking robust internal calibration do not exhibit this degradation. Our findings highlight the potential of unsupervised probing for alignment evaluation and emphasize the need to incorporate structural robustness checks into interpretability benchmarks. Code and datasets are available at: https://github.com/SadSabrina/polarity-probing. WARNING: This paper contains potentially sensitive, harmful, and offensive content.

[174] A Multiscale Geometric Method for Capturing Relational Topic Alignment cs.CL | cs.LG | stat.MLPDF

Conrad D. Hougen, Karl T. Pazdernik, Alfred O. Hero

TL;DR: 提出一种几何方法，整合多模态文本和合著网络数据，利用Hellinger距离和Ward链接构建层次主题树状图，以捕捉罕见主题和平滑主题漂移。

Details

Motivation: 在科学文献中，识别被忽视的罕见主题至关重要，但现有基于密集Transformer嵌入的模型难以捕获这些主题和时间对齐。

Result: 实验显示该方法能有效识别罕见主题结构并可视化平滑主题漂移。

Insight: 基于词袋的模型与几何对齐结合，可提升主题模型的解释性和时间对齐能力。

Abstract: Interpretable topic modeling is essential for tracking how research interests evolve within co-author communities. In scientific corpora, where novelty is prized, identifying underrepresented niche topics is particularly important. However, contemporary models built from dense transformer embeddings tend to miss rare topics and therefore also fail to capture smooth temporal alignment. We propose a geometric method that integrates multimodal text and co-author network data, using Hellinger distances and Ward’s linkage to construct a hierarchical topic dendrogram. This approach captures both local and global structure, supporting multiscale learning across semantic and temporal dimensions. Our method effectively identifies rare-topic structure and visualizes smooth topic drift over time. Experiments highlight the strength of interpretable bag-of-words models when paired with principled geometric alignment.

[175] Scaling Competence, Shrinking Reasoning: Cognitive Signatures in Language Model Learning cs.CLPDF

Mukul Singh, Ananya Singha, Arjun Radhakrishna, Sumit Gulwani

TL;DR: 该论文研究了语言模型在任务特定微调过程中的推理行为，将其分为四个能力阶段，发现推理标记的长度随性能提升先增后减，最终模型即使移除推理也能保持性能。

Details

Motivation: 动机是通过分析语言模型的推理行为，借鉴认知科学的四个能力阶段框架，理解模型在训练中的动态变化，并优化推理模型的训练过程。

Result: 结果显示推理标记长度在模型性能提升时先增加后减少，且训练后的模型无需推理也能完成任务。

Insight: 关键在于推理行为可以作为训练阶段的信号，并对优化训练（如早停）提供指导。

Abstract: We analyze reasoning in language models during task-specific fine-tuning and draws parallel between reasoning tokens–intermediate steps generated while solving problem and the human working memory. Drawing from cognitive science, we align training dynamics with the Four Stages of Competence: models initially produce incorrect outputs without reasoning, then begin reasoning (but still fail), eventually reason effectively, and finally solve tasks without explicit reasoning. We find that reasoning token length expands as performance improves, peaks at the stage of conscious competence, then declines as the model internalizes the task. Notably, after training, models retain performance even when reasoning is removed–suggesting it scaffolded learning but is no longer needed. This progression offers actionable insights: reasoning token dynamics can serve as a signal for diagnosing training stage, identifying convergence, and guiding early stopping. We propose metrics to track this trajectory and argue that reasoning behavior is valuable for understanding and optimizing reasoning model training.

[176] A Lightweight Approach to Detection of AI-Generated Texts Using Stylometric Features cs.CL | cs.AIPDF

Sergey K. Aityan, William Claster, Karthik Sai Emani, Sohni Rais, Thy Tran

TL;DR: NEULIF是一种轻量级方法，通过使用风格统计和可读性特征结合紧凑的CNN或RF模型，高效检测AI生成文本，无需大量计算资源，且在小模型类别中表现最佳。

Details

Motivation: 现有AI生成文本检测方法多为基于大型Transformer模型的微调或集成方法，计算成本高且跨领域泛化能力有限，轻量级替代方案性能不佳。NEULIF旨在提供高效且准确的轻量级解决方案。

Result: CNN模型准确率97%（F1 0.95），ROC-AUC 99.5%；RF模型准确率95%（F1 0.94），ROC-AUC 95%。模型体积分别为25MB和10.6MB，远超Transformer集成方法的高效性。

Insight: 通过结构化的特征设计和简洁模型，轻量级方法在AI生成文本检测中可匹敌复杂模型，同时显著降低计算成本。

Abstract: A growing number of AI-generated texts raise serious concerns. Most existing approaches to AI-generated text detection rely on fine-tuning large transformer models or building ensembles, which are computationally expensive and often provide limited generalization across domains. Existing lightweight alternatives achieved significantly lower accuracy on large datasets. We introduce NEULIF, a lightweight approach that achieves best performance in the lightweight detector class, that does not require extensive computational power and provides high detection accuracy. In our approach, a text is first decomposed into stylometric and readability features which are then used for classification by a compact Convolutional Neural Network (CNN) or Random Forest (RF). Evaluated and tested on the Kaggle AI vs. Human corpus, our models achieve 97% accuracy (~ 0.95 F1) for CNN and 95% accuracy (~ 0.94 F1) for the Random Forest, demonstrating high precision and recall, with ROC-AUC scores of 99.5% and 95%, respectively. The CNN (~ 25 MB) and Random Forest (~ 10.6 MB) models are orders of magnitude smaller than transformer-based ensembles and can be run efficiently on standard CPU devices, without sacrificing accuracy.This study also highlights the potential of such models for broader applications across languages, domains, and streaming contexts, showing that simplicity, when guided by structural insights, can rival complexity in AI-generated content detection.

[177] DELTA: Language Diffusion-based EEG-to-Text Architecture cs.CLPDF

Mingyu Jeon, Hyobin Kim

TL;DR: DELTA是一种基于语言扩散的EEG转文本架构，通过RVQ量化EEG信号并利用LLaDA进行文本重建，显著提升了语义对齐和生成质量。

Details

Motivation: EEG-to-text面临噪声高、受试者差异性大以及自回归解码的错误积累问题，DELTA旨在通过非顺序去噪方法解决这些问题。

Result: 在ZuCo数据集上，DELTA相比自回归基线提升语义对齐5.37点，BLEU-1达21.9，ROUGE-1 F达17.2。

Insight: DELTA展示了从小规模EEG-文本数据集中生成可靠文本的潜力，为多模态EEG-语言模型的可扩展性提供了方向。

Abstract: Electroencephalogram (EEG)-to-text remains challenging due to high-dimensional noise, subject variability, and error accumulation in autoregressive decoding. We introduce DELTA, which pairs a Residual Vector Quantization (RVQ) EEG tokenizer with a masked language diffusion model (LLaDA). RVQ discretizes continuous EEG into multi-layer tokens to reduce noise and individual differences, while LLaDA reconstructs sentences via non-sequential denoising. On ZuCo, DELTA improves semantic alignment by up to 5.37 points over autoregressive baselines, achieving BLEU-1 21.9 and ROUGE-1 F 17.2 under word-level conditions. These results enable reliable text generation from small EEG-text datasets and point toward scalable multimodal EEG-language models.

[178] Building Domain-Specific Small Language Models via Guided Data Generation cs.CL | cs.AIPDF

Aman Kumar, Ekant Muljibhai Amin, Xian Yeow Lee, Lasitha Vidyaratne, Ahmed K. Farahat

TL;DR: 本文提出了一种高效、可扩展的训练Pipeline，结合引导合成数据生成和领域数据自底向上整理，用于构建小型领域专用语言模型。通过该方法训练的3B参数模型DiagnosticSLM在工业故障诊断任务中显著优于开源模型。

Details

Motivation: 在专业领域中部署大型语言模型存在数据隐私和计算资源问题，而小型领域专用模型的开发又受限于高质量领域数据的缺乏。

Result: DiagnosticSLM在多项领域任务中表现优异，尤其是在多选题任务上实现了25%的准确率提升。

Insight: 小型领域专用模型可以通过高效的数据生成和训练Pipeline在特定任务中超越更大规模的通用模型。

Abstract: Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pretraining (DAPT), Domain-specific Supervised Fine-tuning (DSFT), and Direct Preference Optimization (DPO) to train effective small-scale models for specialized use cases. We demonstrate this approach through DiagnosticSLM, a 3B-parameter domain-specific model tailored for fault diagnosis, root cause analysis, and repair recommendation in industrial settings. To evaluate model performance, we introduce four domain-specific benchmarks: multiple-choice questions (DiagnosticMCQ), question answering (DiagnosticQA), sentence completion (DiagnosticComp), and summarization (DiagnosticSum). DiagnosticSLM achieves up to 25% accuracy improvement over open-source models of comparable or larger size (2B-9B) on the MCQ task, while also outperforming or matching them in other tasks, demonstrating effective domain-specific reasoning and generalization capabilities.

[179] Proactive Defense: Compound AI for Detecting Persuasion Attacks and Measuring Inoculation Effectiveness cs.CL | cs.AIPDF

Svitlana Volkova, Will Dupree, Hsien-Te Kao, Peter Bautista, Gabe Ganberg

TL;DR: 该论文提出了一种名为BRIES的新型复合AI架构，用于检测和衡量信息环境中的说服攻击效果。该系统由多个专用代理组成，包括生成对抗内容的Twister、检测攻击的Detector、创建抵御内容的Defender以及评估效果的Assessor。实验表明，不同语言模型在检测性能上存在显著差异，且提示工程对检测效果有重要影响。

Details

Motivation: 随着AI生成的文本在社交媒体和新闻中的广泛应用，如何检测和抵御说服攻击成为重要问题。作者旨在通过量化大型语言模型在说服攻击中的漏洞，提升AI安全性和人类认知韧性。

Result: 实验发现GPT-4在复杂说服技巧检测上表现最优，而开源模型（如Llama3和Mistral）在识别微妙修辞时表现较弱。提示工程对检测效果影响显著，不同模型在不同温度设置下性能各异。

Insight: 说服攻击针对特定认知维度，提示工程（如温度设置）对模型的检测能力有显著影响。这些发现有助于设计更安全的AI系统和提升人类抵御有害内容的能力。

Abstract: This paper introduces BRIES, a novel compound AI architecture designed to detect and measure the effectiveness of persuasion attacks across information environments. We present a system with specialized agents: a Twister that generates adversarial content employing targeted persuasion tactics, a Detector that identifies attack types with configurable parameters, a Defender that creates resilient content through content inoculation, and an Assessor that employs causal inference to evaluate inoculation effectiveness. Experimenting with the SemEval 2023 Task 3 taxonomy across the synthetic persuasion dataset, we demonstrate significant variations in detection performance across language agents. Our comparative analysis reveals significant performance disparities with GPT-4 achieving superior detection accuracy on complex persuasion techniques, while open-source models like Llama3 and Mistral demonstrated notable weaknesses in identifying subtle rhetorical, suggesting that different architectures encode and process persuasive language patterns in fundamentally different ways. We show that prompt engineering dramatically affects detection efficacy, with temperature settings and confidence scoring producing model-specific variations; Gemma and GPT-4 perform optimally at lower temperatures while Llama3 and Mistral show improved capabilities at higher temperatures. Our causal analysis provides novel insights into socio-emotional-cognitive signatures of persuasion attacks, revealing that different attack types target specific cognitive dimensions. This research advances generative AI safety and cognitive security by quantifying LLM-specific vulnerabilities to persuasion attacks and delivers a framework for enhancing human cognitive resilience through structured interventions before exposure to harmful content.

[180] Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification cs.CL | cs.AIPDF

Yanxi Li, Ruocheng Shan

TL;DR: 论文提出了一种轻量级且模型无关的防御策略LDD，通过语义转换或无关的别名标签隐藏真实标签，以抵御LLM情感分类中的提示注入攻击。

Details

Motivation: 大型语言模型在文本分类任务（如情感分析）中广泛使用，但其依赖自然语言提示的特性使其易受提示注入攻击（如类别定向注入）。现有防御方法需要模型重新训练或易受混淆攻击。

Result: 实验表明，LDD在不同模型中能部分恢复因攻击损失的准确性，且语义对齐的别名标签比不对齐的符号更具鲁棒性。

Insight: 标签语义可作为一种有效的防御层，通过语义转换抵御提示注入攻击，且语义对齐的标签选择对防御效果至关重要。

Abstract: Large language models are increasingly used for text classification tasks such as sentiment analysis, yet their reliance on natural language prompts exposes them to prompt injection attacks. In particular, class-directive injections exploit knowledge of the model’s label set (e.g., positive vs. negative) to override its intended behavior through adversarial instructions. Existing defenses, such as detection-based filters, instruction hierarchies, and signed prompts, either require model retraining or remain vulnerable to obfuscation. This paper introduces Label Disguise Defense (LDD), a lightweight and model-agnostic strategy that conceals true labels by replacing them with semantically transformed or unrelated alias labels(e.g., blue vs. yellow). The model learns these new label mappings implicitly through few-shot demonstrations, preventing direct correspondence between injected directives and decision outputs. We evaluate LDD across nine state-of-the-art models, including GPT-5, GPT-4o, LLaMA3.2, Gemma3, and Mistral variants, under varying few-shot and an adversarial setting. Our results show that the ability of LDD to recover performance lost to the adversarial attack varies across models and alias choices. For every model evaluated, LDD is able to restore a portion of the accuracy degradation caused by the attack. Moreover, for the vast majority of models, we can identify more than one alias pair that achieves higher accuracy than the under-attack baseline, in which the model relies solely on few-shot learning without any defensive mechanism. A linguistic analysis further reveals that semantically aligned alias labels(e.g., good vs. bad) yield stronger robustness than unaligned symbols(e.g., blue vs. yellow). Overall, this study demonstrates that label semantics can serve as an effective defense layer, transforming meaning itself into a shield against prompt injection.

Sameeah Noreen Hameed, Surangika Ranathunga, Raj Prasanna, Kristin Stock, Christopher B. Jones

TL;DR: 该研究利用大型语言模型（LLMs）从灾难相关的社交媒体帖子中识别受影响地点和与非受影响地点相关的信息，通过微调模型显著提升了性能，为灾难响应提供及时决策支持。

Details

Motivation: 灾难发生时，传统数据源（如传感器、遥感图像）受限于时间和空间覆盖，难以全面捕捉灾害影响。社交媒体可作为”地理传感器”，但并非所有提及的地点都与影响相关，需要区分受影响与非受影响地点。

Result: 微调模型在影响和受影响地点提取任务中的F1分数分别为0.69和0.74，显著优于预训练基线模型。

Insight: 研究表明，微调的语言模型能够高效处理社交媒体中的非结构化信息，为灾难响应提供可扩展的解决方案，支持资源分配和灾后恢复计划。

Abstract: Large-scale disasters can often result in catastrophic consequences on people and infrastructure. Situation awareness about such disaster impacts generated by authoritative data from in-situ sensors, remote sensing imagery, and/or geographic data is often limited due to atmospheric opacity, satellite revisits, and time limitations. This often results in geo-temporal information gaps. In contrast, impact-related social media posts can act as “geo-sensors” during a disaster, where people describe specific impacts and locations. However, not all locations mentioned in disaster-related social media posts relate to an impact. Only the impacted locations are critical for directing resources effectively. e.g., “The death toll from a fire which ripped through the Greek coastal town of #Mati stood at 80, with dozens of people unaccounted for as forensic experts tried to identify victims who were burned alive #Greecefires #AthensFires #Athens #Greece.” contains impacted location “Mati” and non-impacted locations “Greece” and “Athens”. This research uses Large Language Models (LLMs) to identify all locations, impacts and impacted locations mentioned in disaster-related social media posts. In the process, LLMs are fine-tuned to identify only impacts and impacted locations (as distinct from other, non-impacted locations), including locations mentioned in informal expressions, abbreviations, and short forms. Our fine-tuned model demonstrates efficacy, achieving an F1-score of 0.69 for impact and 0.74 for impacted location extraction, substantially outperforming the pre-trained baseline. These robust results confirm the potential of fine-tuned language models to offer a scalable solution for timely decision-making in resource allocation, situational awareness, and post-disaster recovery planning for responders.

[182] Dissecting the Ledger: Locating and Suppressing “Liar Circuits” in Financial Large Language Models cs.CL | cs.CEPDF

Soham Mirajkar

TL;DR: 这篇论文提出了一种机制分析方法，用于检测和抑制金融大型语言模型中的算术幻觉，揭示了模型的分布式计算草稿和决定性聚合电路的作用。

Details

Motivation: 大型语言模型在金融领域中存在可复现的算术幻觉问题，现有的黑盒方法无法有效解决这一问题。

Result: 抑制关键高层（Layer 46）可将模型对幻觉输出的置信度降低81.8%，训练的线性探针在未见金融主题上达到98%的准确率。

Insight: 算术幻觉具有普适的几何特征，为未来设计更可靠的金融语言模型提供了理论支持。

Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes financial domains, yet they suffer from specific, reproducible hallucinations when performing arithmetic operations. Current mitigation strategies often treat the model as a black box. In this work, we propose a mechanistic approach to intrinsic hallucination detection. By applying Causal Tracing to the GPT-2 XL architecture on the ConvFinQA benchmark, we identify a dual-stage mechanism for arithmetic reasoning: a distributed computational scratchpad in middle layers (L12-L30) and a decisive aggregation circuit in late layers (specifically Layer 46). We verify this mechanism via an ablation study, demonstrating that suppressing Layer 46 reduces the model’s confidence in hallucinatory outputs by 81.8%. Furthermore, we demonstrate that a linear probe trained on this layer generalizes to unseen financial topics with 98% accuracy, suggesting a universal geometry of arithmetic deception.

[183] Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models cs.CL | cs.LGPDF

Linye Wei, Wenjue Chen, Pingzhi Tang, Xiaotian Guo, Le Ye

TL;DR: ODB-dLLM是一个基于算术强度启发的加速框架，旨在优化扩散语言模型（dLLM）的推理效率，通过自适应长度预测和跳转共享推测解码技术显著提升速度。

Details

Motivation: 现有dLLM框架的双向注意力机制需要频繁刷新缓存，导致推理效率低。本文针对预填充和解码阶段的异构计算特性提出优化。

Result: 实验显示ODB-dLLM比baseline dLLM和Fast-dLLM分别加速46-162倍和2.63-6.30倍，同时缓解了精度下降问题。

Insight: 异构计算特性是优化dLLM推理的关键，自适应技术和推测解码的结合可显著提升效率。

Abstract: Diffusion-based large language models (dLLMs) have recently gained significant attention for their exceptional performance and inherent potential for parallel decoding. Existing frameworks further enhance its inference efficiency by enabling KV caching. However, its bidirectional attention mechanism necessitates periodic cache refreshes that interleave prefill and decoding phases, both contributing substantial inference cost and constraining achievable speedup. Inspired by the heterogeneous arithmetic intensity of the prefill and decoding phases, we propose ODB-dLLM, a framework that orchestrates dual-boundaries to accelerate dLLM inference. In the prefill phase, we find that the predefined fixed response length introduces heavy yet redundant computational overhead, which affects efficiency. To alleviate this, ODB-dLLM incorporates an adaptive length prediction mechanism that progressively reduces prefill overhead and unnecessary computation. In the decoding phase, we analyze the computational characteristics of dLLMs and propose a dLLM-specific jump-share speculative decoding method to enhance efficiency by reducing the number of decoding iterations. Experimental results demonstrate that ODB-dLLM achieves 46-162x and 2.63-6.30x speedups over the baseline dLLM and Fast-dLLM, respectively, while simultaneously mitigating the accuracy degradation in existing acceleration frameworks.

[184] fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding cs.CL | cs.AIPDF

Yuxiang Wei, Yanteng Zhang, Xi Xiao, Chengxuan Qian, Tianyang Wang

TL;DR: 论文提出了一种名为fMRI-LM的基础模型，通过三个阶段将功能磁共振成像（fMRI）与语言对齐，旨在实现跨模态的脑成像语义理解。

Details

Motivation: 现有的多模态大语言模型主要在图像、音频和视频领域取得了进展，但脑成像领域的类似能力尚未被充分探索。将fMRI与语言对齐对理解神经活动与语义认知的关系具有重要意义。

Result: 在多个基准测试中，fMRI-LM表现出优异的零样本和小样本性能，并通过参数高效调优（LoRA）实现了高效适配。

Insight: fMRI-LM为fMRI的结构和语义理解提供了一条可扩展的路径，展示了跨模态基础模型在脑成像领域的潜力。

Abstract: Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.

[185] LLMs for Low-Resource Dialect Translation Using Context-Aware Prompting: A Case Study on Sylheti cs.CL | cs.CYPDF

Tabia Tanzin Prama, Christopher M. Danforth, Peter Sheridan Dodds

TL;DR: 该论文研究了大型语言模型（LLMs）在低资源方言翻译中的表现，特别是针对西莱蒂语（Sylheti），并提出了一种名为Sylheti-CAP的上下文感知提示框架，显著提升了翻译质量。

Details

Motivation: 尽管LLMs在翻译任务中表现出色，但其在方言和低资源语言环境中的能力尚未充分探索。西莱蒂语作为孟加拉语的低资源方言，缺乏系统性研究。

Result: 自动指标和人工评估均证实Sylheti-CAP显著提升了翻译质量，减少了幻觉、歧义和生硬表达。

Insight: 上下文感知提示是提升LLMs在低资源和方言翻译中表现的有效方法，为类似任务提供了可扩展的解决方案。

Abstract: Large Language Models (LLMs) have demonstrated strong translation abilities through prompting, even without task-specific training. However, their effectiveness in dialectal and low-resource contexts remains underexplored. This study presents the first systematic investigation of LLM-based machine translation (MT) for Sylheti, a dialect of Bangla that is itself low-resource. We evaluate five advanced LLMs (GPT-4.1, GPT-4.1, LLaMA 4, Grok 3, and DeepSeek V3.2) across both translation directions (Bangla $\Leftrightarrow$ Sylheti), and find that these models struggle with dialect-specific vocabulary. To address this, we introduce Sylheti-CAP (Context-Aware Prompting), a three-step framework that embeds a linguistic rulebook, a dictionary (2{,}260 core vocabulary items and idioms), and an authenticity check directly into prompts. Extensive experiments show that Sylheti-CAP consistently improves translation quality across models and prompting strategies. Both automatic metrics and human evaluations confirm its effectiveness, while qualitative analysis reveals notable reductions in hallucinations, ambiguities, and awkward phrasing, establishing Sylheti-CAP as a scalable solution for dialectal and low-resource MT. Dataset link: \href{https://github.com/TabiaTanzin/LLMs-for-Low-Resource-Dialect-Translation-Using-Context-Aware-Prompting-A-Case-Study-on-Sylheti.git}{https://github.com/TabiaTanzin/LLMs-for-Low-Resource-Dialect-Translation-Using-Context-Aware-Prompting-A-Case-Study-on-Sylheti.git}

[186] Factors That Support Grounded Responses in LLM Conversations: A Rapid Review cs.CL | cs.AIPDF

Gabriele Cesar Iwashima, Claudia Susie Rodrigues, Claudio Dipolitto, Geraldo Xexéo

TL;DR: 论文总结了支持大型语言模型（LLM）生成接地气回应的策略，重点关注推断阶段、训练后和强化学习方法，推断阶段方法尤其高效。

Details

Motivation: LLM在对话中可能因输出与用户意图不一致、缺乏上下文关联或出现幻觉而影响可靠性，需改进其对齐性和接地气性。

Result: 推断阶段方法高效且支持意图对齐、上下文关联和减少幻觉，提升LLM输出的质量与可靠性。

Insight: 推断阶段干预无需重新训练模型，却能显著改善LLM的对话表现，是实用且高效的优化方向。

Abstract: Large language models (LLMs) may generate outputs that are misaligned with user intent, lack contextual grounding, or exhibit hallucinations during conversation, which compromises the reliability of LLM-based applications. This review aimed to identify and analyze techniques that align LLM responses with conversational goals, ensure grounding, and reduce hallucination and topic drift. We conducted a Rapid Review guided by the PRISMA framework and the PICO strategy to structure the search, filtering, and selection processes. The alignment strategies identified were categorized according to the LLM lifecycle phase in which they operate: inference-time, post-training, and reinforcement learning-based methods. Among these, inference-time approaches emerged as particularly efficient, aligning outputs without retraining while supporting user intent, contextual grounding, and hallucination mitigation. The reviewed techniques provided structured mechanisms for improving the quality and reliability of LLM responses across key alignment objectives.

[187] ResearchArcade: Graph Interface for Academic Tasks cs.CL | cs.LGPDF

Jingjun Xu, Chongshan Lin, Haofei Yu, Tao Feng, Jiaxuan You

TL;DR: ResearchArcade提出了一种基于图的统一数据接口，用于支持多种学术任务的机器学习模型开发，整合多源、多模态数据并统一任务定义，实验表明其有效性。

Details

Motivation: 学术研究数据来源多样化，缺乏统一接口支持机器学习模型开发，限制了知识发现的效率。

Result: 在六项学术任务上的实验表明，结合跨源和多模态信息及图结构能显著提升性能。

Insight: 统一接口和图表结构的结合可显著提升模型性能，为学术研究提供更高效的工具支持。

Abstract: Academic research generates diverse data sources, and as researchers increasingly use machine learning to assist research tasks, a crucial question arises: Can we build a unified data interface to support the development of machine learning models for various academic tasks? Models trained on such a unified interface can better support human researchers throughout the research process, eventually accelerating knowledge discovery. In this work, we introduce ResearchArcade, a graph-based interface that connects multiple academic data sources, unifies task definitions, and supports a wide range of base models to address key academic challenges. ResearchArcade utilizes a coherent multi-table format with graph structures to organize data from different sources, including academic corpora from ArXiv and peer reviews from OpenReview, while capturing information with multiple modalities, such as text, figures, and tables. ResearchArcade also preserves temporal evolution at both the manuscript and community levels, supporting the study of paper revisions as well as broader research trends over time. Additionally, ResearchArcade unifies diverse academic task definitions and supports various models with distinct input requirements. Our experiments across six academic tasks demonstrate that combining cross-source and multi-modal information enables a broader range of tasks, while incorporating graph structures consistently improves performance over baseline methods. This highlights the effectiveness of ResearchArcade and its potential to advance research progress.

[188] Early Risk Prediction with Temporally and Contextually Grounded Clinical Language Processing cs.CLPDF

Rochana Chaturvedi, Yue Zhou, Andrew Boyd, Brian T. Layden, Mudassir Rashid

TL;DR: 论文提出了两种方法（HiTGNN和ReVeAL）用于从电子健康记录（EHRs）中进行时间性和上下文结合的临床语言处理，以预测慢性疾病风险。HiTGNN是一种分层时间图神经网络，结合了临床知识；ReVeAL是一个轻量级框架，通过验证模型提取大语言模型的推理能力。实验表明，这两种方法在糖尿病筛查中表现优越，并兼顾隐私和公平性。

Details

Motivation: 利用EHRs中的临床笔记进行早期风险预测具有重要意义，但面临长文本、不规则事件分布、复杂时间依赖等挑战。

Result: HiTGNN在糖尿病筛查中表现出最高的预测准确性，ReVeAL提高了对真实病例的敏感性。方法在公平性分析中也表现优越。

Insight: 1. 时间结构对临床风险预测至关重要；2. 知识和数据结合的模型能提升性能；3. 轻量级框架可以在不依赖大模型的情况下实现高性能。

Abstract: Clinical notes in Electronic Health Records (EHRs) capture rich temporal information on events, clinician reasoning, and lifestyle factors often missing from structured data. Leveraging them for predictive modeling can be impactful for timely identification of chronic diseases. However, they present core natural language processing (NLP) challenges: long text, irregular event distribution, complex temporal dependencies, privacy constraints, and resource limitations. We present two complementary methods for temporally and contextually grounded risk prediction from longitudinal notes. First, we introduce HiTGNN, a hierarchical temporal graph neural network that integrates intra-note temporal event structures, inter-visit dynamics, and medical knowledge to model patient trajectories with fine-grained temporal granularity. Second, we propose ReVeAL, a lightweight, test-time framework that distills the reasoning of large language models into smaller verifier models. Applied to opportunistic screening for Type 2 Diabetes (T2D) using temporally realistic cohorts curated from private and public hospital corpora, HiTGNN achieves the highest predictive accuracy, especially for near-term risk, while preserving privacy and limiting reliance on large proprietary models. ReVeAL enhances sensitivity to true T2D cases and retains explanatory reasoning. Our ablations confirm the value of temporal structure and knowledge augmentation, and fairness analysis shows HiTGNN performs more equitably across subgroups.

[189] A Hybrid Theory and Data-driven Approach to Persuasion Detection with Large Language Models cs.CLPDF

Gia Bao Hoang, Keith J Ransom, Rachel Stephens, Carolyn Semmler, Nicolas Fay

TL;DR: 论文提出了一种结合理论和数据驱动的混合方法，利用大型语言模型（LLMs）预测信息的说服力，并通过心理实验特征构建随机森林分类模型。

Details

Motivation: 随着社交媒体的兴起，传统心理模型难以规模化捕捉文本为主的在线讨论中的信念修正问题，需要更有效的模型。

Result: 在测试的八个特征中，认知情绪和分享意愿是预测信念改变的最重要特征。

Insight: 揭示了说服性信息的特征，展示了LLMs如何增强基于心理理论的模型，为在线影响力检测和虚假信息缓解提供了新思路。

Abstract: Traditional psychological models of belief revision focus on face-to-face interactions, but with the rise of social media, more effective models are needed to capture belief revision at scale, in this rich text-based online discourse. Here, we use a hybrid approach, utilizing large language models (LLMs) to develop a model that predicts successful persuasion using features derived from psychological experiments. Our approach leverages LLM generated ratings of features previously examined in the literature to build a random forest classification model that predicts whether a message will result in belief change. Of the eight features tested, \textit{epistemic emotion} and \textit{willingness to share} were the top-ranking predictors of belief change in the model. Our findings provide insights into the characteristics of persuasive messages and demonstrate how LLMs can enhance models of successful persuasion based on psychological theory. Given these insights, this work has broader applications in fields such as online influence detection and misinformation mitigation, as well as measuring the effectiveness of online narratives.

[190] Bridging the Modality Gap by Similarity Standardization with Pseudo-Positive Samples cs.CLPDF

Shuhei Yamashita, Daiki Shirafuji, Tatsuhiko Saito

TL;DR: 该论文提出了一种通过相似性标准化和伪正样本构造的方法，解决跨模态检索中的模态差距问题，显著提升了检索性能。

Details

Motivation: 跨模态检索中的模态差距问题（不同模态的相似度分数尺度不一致）影响了检索的准确性，现有方法通常依赖人工标注数据。论文旨在通过伪数据构造和相似性标准化减少这种依赖。

Result: 在MMQA和WebQA基准测试中，方法显著提升了检索性能，Recall@20平均提升64%（MMQA）和28%（WebQA），优于基于图像描述的方法E5-V。

Insight: 伪数据构造可以替代人工标注，相似性标准化是解决模态差距的有效手段。该方法无需微调模型，适用范围广。

Abstract: Advances in vision-language models (VLMs) have enabled effective cross-modality retrieval. However, when both text and images exist in the database, similarity scores would differ in scale by modality. This phenomenon, known as the modality gap, hinders accurate retrieval. Most existing studies address this issue with manually labeled data, e.g., by fine-tuning VLMs on them. In this work, we propose a similarity standardization approach with pseudo data construction. We first compute the mean and variance of the similarity scores between each query and its paired data in text or image modality. Using these modality-specific statistics, we standardize all similarity scores to compare on a common scale across modalities. These statistics are calculated from pseudo pairs, which are constructed by retrieving the text and image candidates with the highest cosine similarity to each query. We evaluate our method across seven VLMs using two multi-modal QA benchmarks (MMQA and WebQA), where each question requires retrieving either text or image data. Our experimental results show that our method significantly improves retrieval performance, achieving average Recall@20 gains of 64% on MMQA and 28% on WebQA when the query and the target data belong to different modalities. Compared to E5-V, which addresses the modality gap through image captioning, we confirm that our method more effectively bridges the modality gap.

[191] C$^2$DLM: Causal Concept-Guided Diffusion Large Language Models cs.CLPDF

Kairong Han, Nuanqiao Shan, Ziyu Zhao, Zijing Hu, Xinpeng Dong

TL;DR: 论文提出了C²DLM模型，通过因果概念引导扩散语言模型，改进传统AR和DLM的推理能力不足问题，实现了推理任务性能的提升和训练速度的加快。

Details

Motivation: 传统自回归语言模型（AR）和扩散语言模型（DLM）在推理能力上存在不足，尤其是缺乏对人类因果知识的建模。人类语言具有灵活的因果结构，而AR和DLM分别存在顺序限制和忽略因果性的问题。

Result: 在COT-OrderPerturb任务中提升12%，训练速度加快3.2倍；在六个下游推理任务中平均提升1.31%。

Insight: 显式建模语言中的因果关系是提升语言模型推理能力的关键，同时避免了因果反转问题的干扰。

Abstract: Autoregressive (AR) language models and Diffusion Language Models (DLMs) constitute the two principal paradigms of large language models. However, both paradigms suffer from insufficient reasoning capabilities. Human reasoning inherently relies on causal knowledge and thought, which are reflected in natural language. But in the AR paradigm, language is modeled as next token prediction (a strictly left-to-right, token-by-token order), whereas natural language itself exhibits more flexible causal structures. In the DLM paradigm, the attention mechanism is fully connected, which entirely disregards causal order. To fill this gap, we propose a \underline{\textbf{C}}ausal \underline{\textbf{C}}oncept-Guided \underline{\textbf{D}}iffusion \underline{\textbf{L}}anguage \underline{\textbf{M}}odel (C$^2$DLM). Starting from DLM’s fully connected attention, C$^2$DLM first obtains a concept-level causal graph from the teacher model, and then explicitly guides attention to learn causal relationships between concepts. By focusing on causal relationships and avoiding interference from difficult subgoals involving causal inversion, C$^2$DLM improves 12% with about 3.2 times training speedup in the COT-OrderPerturb task, and achieves an average gain of 1.31% across six downstream reasoning tasks. More details in the repository ~\href{https://github.com/Kairong-Han/C-2-DLM}{here}.

[192] A Theoretically Grounded Hybrid Ensemble for Reliable Detection of LLM-Generated Text cs.CL | cs.AIPDF

Sepyan Purnama Kristanto, Lutfi Hakim

TL;DR: 这篇论文提出了一种理论基础的混合集成方法，通过融合三种互补的检测范式（RoBERTa分类器、GPT-2概率探测器和统计语言学分析器），以高准确率和低误报率检测LLM生成的文本。

Details

Motivation: 随着大语言模型（LLMs）的快速普及，区分人类和机器生成文本的需求变得迫切，尤其是在学术诚信和信息可靠性方面。现有检测方法泛化能力差且误报率高，亟需改进。

Result: 在30,000份文档的大规模数据集上，系统达到了94.2%的准确率和0.978的AUC，同时在学术文本上误报率相对降低了35%。

Insight: 论文的重要见解是：通过低相关性的模型组合（rho ~ 0.35-0.42）可以提高检测的稳健性，减少偏差和方差，从而适用于高风险领域。

Abstract: The rapid proliferation of Large Language Models (LLMs) has blurred the line between human and machine authorship, creating practical risks for academic integrity and information reliability. Existing text detectors typically rely on a single methodological paradigm and suffer from poor generalization and high false positive rates (FPR), especially on high-stakes academic text. We propose a theoretically grounded hybrid ensemble that systematically fuses three complementary detection paradigms: (i) a RoBERTa-based transformer classifier for deep semantic feature extraction, (ii) a GPT-2-based probabilistic detector using perturbation-induced likelihood curvature, and (iii) a statistical linguistic feature analyzer capturing stylometric patterns. The core novelty lies in an optimized weighted voting framework, where ensemble weights are learned on the probability simplex to maximize F1-score rather than set heuristically. We provide a bias-variance analysis and empirically demonstrate low inter-model correlation (rho ~ 0.35-0.42), a key condition for variance reduction. Evaluated on a large-scale, multigenerator corpus of 30,000 documents, our system achieves 94.2% accuracy and an AUC of 0.978, with a 35% relative reduction in false positives on academic text. This yields a more reliable and ethically responsible detector for real-world deployment in education and other high-stakes domains.

[193] Lips-Jaw and Tongue-Jaw Articulatory Tradeoff in DYNARTmo cs.CL | cs.ROPDF

Bernd J. Kröger

TL;DR: 本文研究了动态发音模型DYNARTmo如何处理唇-颌和舌-颌的发音协调，展示了该模型如何通过简化假设生成真实的空间-时间运动模式。

Details

Motivation: 研究动态发音模型如何通过简化的任务空间手势规范解释发音器官之间的协同作用，特别是在唇-颌和舌-颌的协调中。

Result: 模型成功再现了发音协同作用的经验模式，如颌支持的舌尖闭合、双唇塞音中的下唇抬高以及舌-颌共运动等。

Insight: 即使采用计算简化的假设，DYNARTmo也能生成真实的发音运动模式，展示出发音器官之间的协同作用和权衡关系。

Abstract: This paper investigates how the dynamic articulatory model DYNARTmo accounts for articulatory tradeoffs between primary and secondary articulators, with a focus on lips-jaw and tongue-jaw coordination. While DYNARTmo does not implement full task-dynamic second-order biomechanics, it adopts first-order task-space gesture specifications comparable to those used in articulatory phonology and integrates a simplified mechanism for distributing articulatory effort across multiple articulators. We first outline the conceptual relationship between task dynamics and DYNARTmo, emphasizing the distinction between high-level task-space trajectories and their low-level articulatory execution. We then present simulation results for a set of CV syllables that illustrate how jaw displacement varies as a function of both place of articulation (labial, apical, dorsal) and vowel context (/a/, /i/, /u/). The model reproduces empirically attested patterns of articulatory synergy, including jaw-supported apical closures, lower-lip elevation in bilabial stops, tongue-jaw co-movement, and saturation effects in labial constrictions. These results demonstrate that even with computationally simplified assumptions, DYNARTmo can generate realistic spatio-temporal movement patterns that capture key aspects of articulatory tradeoff and synergy across a range of consonant-vowel combinations.

Young-Jun Lee, Seungone Kim, Byung-Kwan Lee, Minkyeong Moon, Yechan Hwang

TL;DR: 论文提出了RefineBench基准，用于评估语言模型（LM）的自我修正能力，涵盖11个领域的1000个问题，并发现当前前沿模型在无指导的自我修正中表现不佳，而在有反馈的指导修正中表现较好。

Details

Motivation: 随着语言模型在现实交互中的广泛应用，用户通常需要模型对开放式问题提供修正和改进。然而，现有研究主要集中在可验证任务上，缺乏对开放式问题的系统性评估。

Result: 前沿模型（如Gemini 2.5 Pro和GPT-5）在自我修正中表现有限（31.3%和29.1%），而在有反馈的修正中可以接近完美。

Insight: 当前语言模型在自我修正能力上仍有不足，需要突破性改进；RefineBench为评估和改进这一能力提供了重要工具。

Abstract: Can language models (LMs) self-refine their own responses? This question is increasingly relevant as a wide range of real-world user interactions involve refinement requests. However, prior studies have largely tested LMs’ refinement abilities on verifiable tasks such as competition math or symbolic reasoning with simplified scaffolds, whereas users often pose open-ended queries and provide varying degrees of feedback on what they desire. The recent advent of reasoning models that exhibit self-reflection patterns in their chains-of-thought further motivates this question. To analyze this, we introduce RefineBench, a benchmark of 1,000 challenging problems across 11 domains paired with a checklist-based evaluation framework. We evaluate two refinement modes: (1) guided refinement, where an LM is provided natural language feedback, and (2) self-refinement, where LMs attempt to improve without guidance. In the self-refinement setting, even frontier LMs such as Gemini 2.5 Pro and GPT-5 achieve modest baseline scores of 31.3% and 29.1%, respectively, and most models fail to consistently improve across iterations (e.g., Gemini-2.5-Pro gains only +1.8%, while DeepSeek-R1 declines by -0.1%). By contrast, in guided refinement, both proprietary LMs and large open-weight LMs (>70B) can leverage targeted feedback to refine responses to near-perfect levels within five turns. These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses, and that RefineBench provides a valuable testbed for tracking progress.

[195] Focused Chain-of-Thought: Efficient LLM Reasoning via Structured Input Information cs.CL | cs.AIPDF

Lukas Struppek, Dominik Hintersdorf, Hannah Struppek, Daniel Neider, Kristian Kersting

TL;DR: 论文提出了一种无需训练、基于输入的Focused Chain-of-Thought（F-CoT）方法，通过结构化输入信息提升LLM推理效率，减少token使用并保持准确性。

Details

Motivation: 现有的大语言模型在推理过程中生成的详细思维链（CoT）会消耗大量token并增加延迟，而大多数效率优化方法集中在模型层面（如强化学习或监督微调），本文则探索了输入层面的优化思路。

Result: 在算术应用题上，F-CoT生成的token数量减少了2-3倍，同时保持了与标准零样本CoT相当的准确性。

Insight: 结构化输入是提升LLM推理效率的简单有效手段，无需模型改动即可显著降低计算开销。

Abstract: Recent large language models achieve strong reasoning performance by generating detailed chain-of-thought traces, but this often leads to excessive token use and high inference latency. Existing efficiency approaches typically focus on model-centric interventions, such as reinforcement learning or supervised fine-tuning, to reduce verbosity. In contrast, we propose a training-free, input-centric approach. Inspired by cognitive psychology, we introduce Focused Chain-of-Thought (F-CoT), which separates information extraction from the reasoning process. F-CoT first organizes the essential information from a query into a concise, structured context and then guides the model to reason exclusively over this context. By preventing attention to irrelevant details, F-CoT naturally produces shorter reasoning paths. On arithmetic word problems, F-CoT reduces generated tokens by 2-3x while maintaining accuracy comparable to standard zero-shot CoT. These results highlight structured input as a simple yet effective lever for more efficient LLM reasoning.

[196] Beyond Query-Level Comparison: Fine-Grained Reinforcement Learning for Text-to-SQL with Automated Interpretable Critiques cs.CLPDF

Guifeng Wang, Yuanfeng Song, Meng Yang, Tao Zhu, Xiaoming Yin

TL;DR: 论文提出RuCo-C框架，通过自动生成可解释的评估标准（rubrics）和批判（critiques），为文本转SQL任务提供细粒度的强化学习训练信号，解决了传统方法依赖人工标注和粗粒度奖励的问题。

Details

Motivation: 文本转SQL任务中现有评估和奖励机制依赖人工标注的SQL查询，成本高且难以扩展；同时，强化学习方法仅用二进制执行结果作为奖励信号，无法捕捉结构化和语义错误。

Result: 实验表明RuCo-C在文本转SQL任务中显著优于现有方法，性能提升明显。

Insight: 自动化生成细粒度和可解释的评估标准可降低人工成本，并为RL训练提供更丰富的监督信号。

Abstract: Text-to-SQL, a pivotal natural language processing (NLP) task that converts textual queries into executable SQL, has seen substantial progress in recent years. However, existing evaluation and reward mechanisms used to train and assess the text-to-SQL models remain a critical bottleneck. Current approaches heavily rely on manually annotated gold SQL queries, which are costly to produce and impractical for large-scale evaluation. More importantly, most reinforcement learning (RL) methods in text-to-SQL leverage only the final binary execution outcome as the reward signal, a coarse-grained supervision that overlooks detailed structural and semantic errors from the perspective of rubrics. To address these challenges, we propose RuCo-C, a novel generative judge model for fine-grained, query-specific automatic evaluation using interpretable critiques without human intervention. Our framework first automatically generates query-specific evaluation rubrics for human-free annotation, linking them to interpretable critiques. Subsequently, it integrates densified reward feedback through a “progressive exploration” strategy during the RL training process, which dynamically adjusts the rewards to enhance the model’s performance. Comprehensive experiments demonstrate that RuCo-C outperforms existing methods in text-to-SQL evaluation, yielding significant performance gains.

[197] JBE-QA: Japanese Bar Exam QA Dataset for Assessing Legal Domain Knowledge cs.CLPDF

Zhihan Cao, Fumihito Nishino, Hiroaki Yamada, Nguyen Ha Thanh, Yusuke Miyao

TL;DR: The paper introduces JBE-QA, a Japanese Bar Exam QA dataset designed to evaluate large language models’ legal knowledge, covering multiple legal domains and providing structured contextual data.

Details

Motivation: The motivation is to create a comprehensive benchmark for assessing large language models’ legal knowledge in Japanese, addressing gaps in prior resources focused mainly on the Civil Code.

Result: Proprietary models with reasoning capabilities performed best, while Constitution questions were easier than Civil Code or Penal Code questions.

Insight: Legal domain evaluation requires diverse benchmarks; reasoning-enabled models excel, and question difficulty varies by legal domain.

Abstract: We introduce JBE-QA, a Japanese Bar Exam Question-Answering dataset to evaluate large language models’ legal knowledge. Derived from the multiple-choice (tanto-shiki) section of the Japanese bar exam (2015-2024), JBE-QA provides the first comprehensive benchmark for Japanese legal-domain evaluation of LLMs. It covers the Civil Code, the Penal Code, and the Constitution, extending beyond the Civil Code focus of prior Japanese resources. Each question is decomposed into independent true/false judgments with structured contextual fields. The dataset contains 3,464 items with balanced labels. We evaluate 26 LLMs, including proprietary, open-weight, Japanese-specialised, and reasoning models. Our results show that proprietary models with reasoning enabled perform best, and the Constitution questions are generally easier than the Civil Code or the Penal Code questions.

[198] Language-conditioned world model improves policy generalization by reading environmental descriptions cs.CL | cs.LGPDF

Anh Nguyen, Stefan Lee

TL;DR: 该论文提出了一种基于语言条件的世界模型（LED-WM），通过注意力机制将环境描述与观察实体显式关联，从而提升了策略在未见环境中的泛化能力，无需依赖实时规划或专家演示。

Details

Motivation: 现有方法在策略泛化上表现不足，或依赖限制性假设（如容忍实时规划延迟或依赖专家演示）。本文旨在通过这些限制，利用语言描述环境动态，提升策略的泛化能力。

Result: 在两个环境（MESSENGER和MESSENGER-WM）中，LED-WM显著优于基线方法，尤其是在未见任务上。此外，策略还可通过世界模型生成的合成轨迹进行微调。

Insight: 显式地将语言描述与观察实体关联能有效提升策略理解环境动态的能力，从而增强泛化性能。

Abstract: To interact effectively with humans in the real world, it is important for agents to understand language that describes the dynamics of the environment–that is, how the environment behaves–rather than just task instructions specifying “what to do”. Understanding this dynamics-descriptive language is important for human-agent interaction and agent behavior. Recent work address this problem using a model-based approach: language is incorporated into a world model, which is then used to learn a behavior policy. However, these existing methods either do not demonstrate policy generalization to unseen games or rely on limiting assumptions. For instance, assuming that the latency induced by inference-time planning is tolerable for the target task or expert demonstrations are available. Expanding on this line of research, we focus on improving policy generalization from a language-conditioned world model while dropping these assumptions. We propose a model-based reinforcement learning approach, where a language-conditioned world model is trained through interaction with the environment, and a policy is learned from this model–without planning or expert demonstrations. Our method proposes Language-aware Encoder for Dreamer World Model (LED-WM) built on top of DreamerV3. LED-WM features an observation encoder that uses an attention mechanism to explicitly ground language descriptions to entities in the observation. We show that policies trained with LED-WM generalize more effectively to unseen games described by novel dynamics and language compared to other baselines in several settings in two environments: MESSENGER and MESSENGER-WM.To highlight how the policy can leverage the trained world model before real-world deployment, we demonstrate the policy can be improved through fine-tuning on synthetic test trajectories generated by the world model.

[199] Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework cs.CL | cs.CVPDF

Kelaiti Xiao, Liang Yang, Dongyu Zhang, Paerhati Tulajiang, Hongfei Lin

TL;DR: 论文提出了一种迭代式框架，结合LLM（大语言模型）、T2IM（文生图模型）和MLLM（多模态大语言模型），用于自动生成和评估基于习语的双关视觉图像。

Details

Motivation: 研究如何通过视觉图像同时体现习语的字面和比喻意义，填补习语双关视觉生成领域的空白。

Result: 实验表明MLLM的选择对性能影响最大，GPT表现最佳，Gemini次之，开源模型Gemma与部分闭源模型竞争；LLM中Claude在提示生成中表现最优。

Insight: 多模态模型的性能是关键，开源模型在部分任务中已接近闭源模型，展示了迭代优化在多模态任务中的潜力。

Abstract: We study idiom-based visual puns–images that align an idiom’s literal and figurative meanings–and present an iterative framework that coordinates a large language model (LLM), a text-to-image model (T2IM), and a multimodal LLM (MLLM) for automatic generation and evaluation. Given an idiom, the system iteratively (i) generates detailed visual prompts, (ii) synthesizes an image, (iii) infers the idiom from the image, and (iv) refines the prompt until recognition succeeds or a step limit is reached. Using 1,000 idioms as inputs, we synthesize a corresponding dataset of visual pun images with paired prompts, enabling benchmarking of both generation and understanding. Experiments across 10 LLMs, 10 MLLMs, and one T2IM (Qwen-Image) show that MLLM choice is the primary performance driver: GPT achieves the highest accuracies, Gemini follows, and the best open-source MLLM (Gemma) is competitive with some closed models. On the LLM side, Claude attains the strongest average performance for prompt generation.

Yujiao Yang, Jing Lian, Linhui Li

TL;DR: MGRS提出了一个多链条图优化和选择框架，通过生成多样化的推理轨迹、自验证与交叉验证相结合的策略、构建推理关系图并选择最可靠的答案，显著提升了大型语言模型的推理能力和计算效率。

Details

Motivation: 现有的大型语言模型在复杂推理任务中存在推理策略单一、搜索分支冗余及异构推理路径整合不足的问题，影响了其实用性。

Result: MGRS在六个基准数据集上平均准确率达到82.9%，比最佳基线性能提升2.1%，在24点游戏中首次实现100%准确率，且速度提升13.6倍。

Insight: 通过多链条优化和选择，结合自验证与交叉验证，能够显著提高推理的可靠性和效率，为复杂推理任务提供了新的解决方案。

Abstract: The complex reasoning ability of Large Language Models (LLMs) poses a critical bottleneck for their practical applications. Test-time expansion methods such as Tree-of-Thought (ToT) and Graph-of-Thought (GoT) enhance reasoning by introducing intermediate reasoning structures, tree search, or graph-based exploration mechanisms. However, their reasoning strategies suffer from limited diversity, redundant search branches, and inadequate integration and error correction across heterogeneous reasoning paths. To address these limitations, we propose a novel reasoning framework called Multi-chain Graph Refinement & Selection (MGRS), which first generates multiple diverse reasoning trajectories for a given problem, refines candidate responses using a composite self- and cross-verification strategy, then constructs a reasoning relation graph and estimates the success rate of intermediate nodes, and finally computes cumulative success rates to select the most reliable answer and corresponding reasoning trajectory. Experimental results demonstrate that MGRS significantly advances both the reasoning capability and computational efficiency of reasoning enhancement methods. Across six benchmark datasets spanning four distinct tasks, MGRS achieves an average accuracy of 82.9%, outperforming state-of-the-art baselines by a clear margin of 2.1%. Remarkably, on the 24-point game, MGRS attains 100% accuracy for the first time, while delivering a 13.6x speed-up compared to the leading Forest of Thoughts framework.

[201] Listwise Preference Optimization with Element-wise Confusions for Aspect Sentiment Quad Prediction cs.CL | cs.AIPDF

Wenna Lai, Haoran Xie, Guandong Xu, Qing Li, S. Joe Qin

TL;DR: 该论文提出了一种基于列表排序偏好优化的方法，用于提升Aspect Sentiment Quad Prediction（ASQP）任务的性能，通过引入元素级别的混淆候选和自然语言解释，增强了模型的结构有效性和关系连贯性。

Details

Motivation: 以往基于标记预测的方法在建模元素间复杂关系和预测高阶元素（如类别和情感极性）时表现不佳，因此需要一种能够显式推理和提升可解释性的方法。

Result: 在四个基准数据集上的实验表明，该方法显著提高了四元组预测的准确性和解释的一致性。

Insight: 通过显式引入混淆候选和自然语言解释，可以有效提升模型在结构化预测任务中的性能和可解释性。

Abstract: Aspect sentiment quad prediction (ASQP) is inherently challenging to predict a structured quadruple with four core sentiment elements, including aspect term (a), aspect category (c), opinion term (o), and sentiment polarity (s). Prior methods relying on marker-based prediction struggle with modeling the intricate relationships among elements and experience sharp performance declines when predicting higher-order elements (e.g., c and s) under standard supervised fine-tuning. To address these limitations, we employ reasoning-based generation to output both the quadruple and a natural language rationale under element prefixes within a unified template, encouraging explicit relational reasoning and interpretability. To further enhance element-wise alignment, we introduce a listwise preference optimization framework for improving structural validity and relational coherence. Specifically, we generate element-wise confusable candidates via syntactic and semantic proximity, then train the model with listwise objectives to prefer the gold candidates over closely competing alternatives. Extensive experiments on four benchmark datasets demonstrate that our framework effectively improves quadruple prediction accuracy and explanation consistency.

[202] Scaling HuBERT for African Languages: From Base to Large and XL cs.CLPDF

Antoine Caubrière, Elodie Gauthier

TL;DR: 论文提出了SSA-HuBERT-Large和SSA-HuBERT-XL两个模型，专注于非洲语言的语音处理，填补了大型模型在非洲语言任务中的研究空白，并通过实验验证了更大模型对性能的提升。

Details

Motivation: 尽管多语言语音处理研究取得了进展，但非洲语言在研究和实际系统中仍代表性不足，尤其是在开放权重编码器和低资源监督下表现良好的模型方面。

Result: 实验证明，更大的架构能有效利用大规模音频数据集，显著提升任务性能。

Insight: 研究结果表明，更大规模的模型在非洲语言任务中具有明显优势，为未来相关研究提供了重要参考。

Abstract: Despite recent progress in multilingual speech processing, African languages remain under-represented in both research and deployed systems, particularly when it comes to strong, open-weight encoders that transfer well under low-resource supervision. Self-supervised learning has proven especially promising in such settings, yet most publicly released models targeting African speech remain at BASE scale, leaving unanswered whether larger encoders, trained exclusively on Africa-centric audio, offer tangible benefits and how model capacity interacts with data composition. This work addresses that gap by introducing SSA-HuBERT-Large (317M parameters) and SSA-HuBERT-XL (964M parameters), the first large models trained solely on African speech, alongside a BASE size counterpart. We release these models as open weights: see https://huggingface.co/collections/Orange/african-speech-foundation-models. By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, covering automatic speech recognition (ASR) and language identification (LID) tasks, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets.

[203] Optimizing Multimodal Language Models through Attention-based Interpretability cs.CL | cs.CVPDF

Alexander Sergeev, Evgeny Kotelnikov

TL;DR: 该论文提出了一种基于注意力的可解释性方法，通过分析注意力分数来优化多模态语言模型（MLMs）的参数高效微调（PEFT），重点关注图像关键对象的注意力头。

Details

Motivation: 现代大型语言模型逐渐多模态化，但完全微调计算成本高，而PEFT方法虽高效却难以确定哪些组件最有效。因此，需要一种可解释方法来平衡效率和性能。

Result: 实验表明，微调HI分数最高的层（仅0.01%参数）能显著提升模型性能，优于随机选择或低HI分数层。

Insight: 注意力机制的可解释性可用于识别模型关键组件，极小部分参数的针对性调整即可显著改善多模态任务性能。

Abstract: Modern large language models become multimodal, analyzing various data formats like text and images. While fine-tuning is effective for adapting these multimodal language models (MLMs) to downstream tasks, full fine-tuning is computationally expensive. Parameter-Efficient Fine-Tuning (PEFT) methods address this by training only a small portion of model weights. However, MLMs are difficult to interpret, making it challenging to identify which components are most effective for training to balance efficiency and performance. We propose an attention-based interpretability method for MLMs by analyzing attention scores relative to image tokens. The core idea is to identify attention heads that focus on image key objects. We utilize this information to select optimal model components for PEFT in multimodal models. Our contributions include a method for identifying attention heads associated with image key objects, its application to PEFT for image captioning, and the creation of a new dataset containing images, key object masks, and their textual descriptions. We conducted experiments on MLMs with 2-3 billion parameters to validate the method’s effectiveness. By calculating Head Impact (HI) scores we quantify an attention head’s focus on key objects, indicating its significance in image understanding. Our fine-tuning experiments demonstrate that adapting layers with the highest HI scores leads to the most significant shifts in metrics compared to pre-trained, randomly selected, or lowest-HI-score layers. This indicates that fine-tuning a small percentage (around 0.01%) of parameters in these crucial layers can substantially influence image understanding capabilities.

[204] Ambiguity Awareness Optimization: Towards Semantic Disambiguation for Direct Preference Optimization cs.CLPDF

Jian Li, Shenglin Yin, Yujia Zhang, Alan Zhao, Xi Chen

TL;DR: 论文提出了Ambiguity Awareness Optimization (AAO)方法，通过自动重新加权偏好对中的语义模糊内容，解决了Direct Preference Optimization (DPO)中因模糊内容引入的性能限制问题。

Details

Motivation: DPO在强化学习人类反馈（RLHF）中被广泛使用，但相同或语义相似内容（模糊内容）的出现可能引入歧义，限制了其对齐性能的进一步提升。

Result: 实验显示AAO在多个基准数据集（如AlpacaEval 2、MT-Bench和Arena-Hard）上显著优于现有方法，最高提升15分。

Insight: 模糊内容的识别和处理是提升DPO性能的关键，AAO通过简单的语义相似性计算解决了这一问题。

Abstract: Direct Preference Optimization (DPO) is a widely used reinforcement learning from human feedback (RLHF) method across various domains. Recent research has increasingly focused on the role of token importance in improving DPO effectiveness. It is observed that identical or semantically similar content (defined as ambiguous content) frequently appears within the preference pairs. We hypothesize that the presence of ambiguous content during DPO training may introduce ambiguity, thereby limiting further improvements in alignment. Through mathematical analysis and proof-of-concept experiments, we reveal that ambiguous content may potentially introduce ambiguities, thereby degrading performance. To address this issue, we introduce Ambiguity Awareness Optimization (AAO), a simple yet effective approach that automatically re-weights ambiguous content to reduce ambiguities by calculating semantic similarity from preference pairs. Through extensive experiments, we demonstrate that AAO consistently and significantly surpasses state-of-the-art approaches in performance, without markedly increasing response length, across multiple model scales and widely adopted benchmark datasets, including AlpacaEval 2, MT-Bench, and Arena-Hard. Specifically, AAO outperforms DPO by up to 8.9 points on AlpacaEval 2 and achieves an improvement of by up to 15.0 points on Arena-Hard.

cs.GR [Back]

[205] Geodiffussr: Generative Terrain Texturing with Elevation Fidelity cs.GR | cs.CVPDF

Tai Inui, Alexander Matsumura, Edgar Simo-Serra

TL;DR: Geodiffussr提出了一种基于流匹配的管道，通过多尺度内容聚合（MCA）技术生成与数字高程地图（DEM）严格一致的纹理，显著提升了地形生成的视觉保真度和高度-外观耦合性。

Details

Motivation: 大规模地形生成在计算机图形学中仍是一项劳动密集型任务，需要一种能够高效、可控地生成与地形高度一致的纹理的方法。

Result: MCA技术显著提升了视觉保真度（FID ↓ 49.16%，LPIPS ↓ 32.33%）和高度-外观耦合性（ΔdCor ↓ 0.0016）。

Insight: Geodiffussr为可控的2.5D景观生成提供了强有力的基线，同时可作为基于物理的地形和生态系统模拟器的补充。

Abstract: Large-scale terrain generation remains a labor-intensive task in computer graphics. We introduce Geodiffussr, a flow-matching pipeline that synthesizes text-guided texture maps while strictly adhering to a supplied Digital Elevation Map (DEM). The core mechanism is multi-scale content aggregation (MCA): DEM features from a pretrained encoder are injected into UNet blocks at multiple resolutions to enforce global-to-local elevation consistency. Compared with a non-MCA baseline, MCA markedly improves visual fidelity and strengthens height-appearance coupling (FID $\downarrow$ 49.16%, LPIPS $\downarrow$ 32.33%, $Δ$dCor $\downarrow$ to 0.0016). To train and evaluate Geodiffussr, we assemble a globally distributed, biome- and climate-stratified corpus of triplets pairing SRTM-derived DEMs with Sentinel-2 imagery and vision-grounded natural-language captions that describe visible land cover. We position Geodiffussr as a strong baseline and step toward controllable 2.5D landscape generation for coarse-scale ideation and previz, complementary to physically based terrain and ecosystem simulators.

eess.IV [Back]

[206] Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data eess.IV | cs.CVPDF

Satrajit Chakrabarty, Ravi Soni

TL;DR: 该论文比较了SAM 2和SAM 3在3D医学数据零样本分割任务中的表现，发现SAM 3在复杂解剖结构和稀疏交互任务中表现更优，是医学分割任务的更佳选择。

Details

Motivation: 研究SAM 2和SAM 3在医学影像零样本分割中的表现差异，以验证SAM 3是否能直接替代SAM 2，而无需额外定制。

Result: SAM 3在点击提示和复杂解剖结构分割中表现显著优于SAM 2，成为医学分割任务的更通用选择。

Insight: SAM 3的新感知主干和提示机制使其在医学影像分割中更具适应性，尤其是在稀疏交互或复杂拓扑任务中。

Abstract: Foundation models for promptable segmentation, including SAM, SAM 2, and the recently released SAM 3, have renewed interest in zero-shot segmentation of medical imaging. Although these models perform strongly on natural images, their behavior on medical data remains insufficiently characterized. While SAM 2 is widely used for annotation in 3D medical workflows, SAM 3 introduces a new perception backbone, detector-tracker pipeline, and concept-level prompting that may alter its behavior under spatial prompts. We present the first controlled comparison of SAM 2 and SAM 3 for zero-shot segmentation of 3D medical volumes and videos under purely visual prompting, with concept mechanisms disabled. We assess whether SAM 3 can serve as an out-of-the-box replacement for SAM 2 without customization. We benchmark both models on 16 public datasets (CT, MRI, 3D and cine ultrasound, endoscopy) covering 54 anatomical structures, pathologies, and surgical instruments. Prompts are restricted to the first frame and use four modes: single-click, multi-click, bounding box, and dense mask. This design standardizes preprocessing, prompt placement, propagation rules, and metric computation to disentangle prompt interpretation from propagation. Prompt-frame analysis shows that SAM 3 provides substantially stronger initialization than SAM 2 for click prompting across most structures. In full-volume analysis, SAM 3 retains this advantage for complex, vascular, and soft-tissue anatomies, emerging as the more versatile general-purpose segmenter. While SAM 2 remains competitive for compact, rigid organs under strong spatial guidance, it frequently fails on challenging targets where SAM 3 succeeds. Overall, our results suggest that SAM 3 is the superior default choice for most medical segmentation tasks, particularly those involving sparse user interaction or complex anatomical topology.

[207] When Do Domain-Specific Foundation Models Justify Their Cost? A Systematic Evaluation Across Retinal Imaging Tasks eess.IV | cs.AI | cs.CVPDF

David Isztl, Tahm Spitznagel, Gabor Mark Somfai, Rui Santos

TL;DR: 该论文系统评估了领域特定的基础模型在视网膜影像任务中的性价比，发现紧凑的通用架构在多数任务中表现优异，而大模型仅在挑战性任务中性价比合理。

Details

Motivation: 研究领域特定的基础模型是否因其计算成本而在视网膜影像分类任务中表现得比紧凑的通用架构更优，以及专门的视网膜预训练是否值得其高成本。

Result: 预训练对所有任务均有提升（5.18-18.41%）。紧凑模型（27-29M参数）在多数任务中表现最佳，RETFound（303M参数）仅在糖尿病视网膜病变分级任务中性价比合理。CFP任务比OCT任务从预训练中获益更大。

Insight: 紧凑的通用模型在多数视网膜分类任务中已足够高效，而领域特定的基础模型仅适用于极端类别不平衡的细粒度分类任务。

Abstract: Large vision foundation models have been widely adopted for retinal disease classification without systematic evidence justifying their parameter requirements. In the present work we address two critical questions: First, are large domain-specific foundation models essential, or do compact general-purpose architectures suffice? Second, does specialized retinal pretraining justify its computational cost? To answer this, we benchmark initialization strategies across four retinal imaging classification tasks spanning Optical Coherence Tomography (OCT) and Color Fundus Photography (CFP) modalities: 8-class OCT classification, 3-class diabetic macular edema (DME), 5-class diabetic retinopathy (DR), and 3-class glaucoma (GL) detection. We evaluate 12-13 model configurations per task, including vision transformers (22.8M-86.6M parameters), Swin Transformers (27.6M-28.3M), ConvNeXt (28.6M), and the domain-specific RETFound models (303M), under identical training conditions. Our results challenge prevailing assumptions: First, we demonstrate that pretraining provides universal benefits (5.18-18.41% improvement), scaling with task difficulty. Second, compact architectures (27-29M) dominate Pareto frontiers; SwinV2-tiny achieves top-1 performance on three datasets. Third, RETFound (303M) justifies its computational cost only for challenging DR grading (accuracy of 71.15%), while ImageNet pretraining proves to be sufficient with all other tasks (DME accuracy: 99.24%, OCT accuracy: 97.96%). CFP tasks show larger pretraining accuracy gains (9.13-18.41%) than OCT (5.18%). Thus, the evidence suggests that compact general-purpose models deliver near-optimal performance for most retinal classification tasks; specialized foundation models warranted only for fine-grained discrimination under extreme class imbalance.

[208] Content Adaptive Encoding For Interactive Game Streaming eess.IV | cs.CVPDF

Shakarim Soltanayev, Odysseas Zisimopoulos, Mohammad Ashraful Anam, Man Cheung Kung, Angeliki Katsenou

TL;DR: 本文提出了首个适用于交互式游戏流（IGS）的内容自适应编码（CAE）方法，通过基于历史帧的紧凑编码元数据动态调整分辨率，显著提升了视频质量，同时满足超低延迟和高计算效率的需求。

Details

Motivation: 交互式游戏流（IGS）对超低延迟和计算效率的要求极高，传统的内容自适应编码（CAE）方法无法直接应用。因此，需要一种能够在严格约束下动态调整分辨率的CAE方法。

Result: 实验结果表明，该方法比默认的固定分辨率HEVC阶梯编码提升了2.3 Bjøntegaard Delta-VMAF分数，且延迟开销为零。

Insight: 通过紧凑的编码元数据和高效的CNN推断，可以在极低延迟和高计算约束下实现内容自适应编码，为IGS提供了一条可行的优化路径。

Abstract: Video-on-demand streaming has benefitted from \textit{content-adaptive encoding} (CAE), i.e., adaptation of resolution and/or quantization parameters for each scene based on convex hull optimization. However, CAE is very challenging to develop and deploy for interactive game streaming (IGS). Commercial IGS services impose ultra-low latency encoding with no lookahead or buffering, and have extremely tight compute constraints for any CAE algorithm execution. We propose the first CAE approach for resolution adaptation in IGS based on compact encoding metadata from past frames. Specifically, we train a convolutional neural network (CNN) to infer the best resolution from the options available for the upcoming scene based on a running window of aggregated coding block statistics from the current scene. By deploying the trained CNN within a practical IGS setup based on HEVC encoding, our proposal: (i) improves over the default fixed-resolution ladder of HEVC by 2.3 Bjøntegaard Delta-VMAF points; (ii) infers using 1ms of a single CPU core per scene, thereby having no latency overhead.

[209] MICCAI STS 2024 Challenge: Semi-Supervised Instance-Level Tooth Segmentation in Panoramic X-ray and CBCT Images eess.IV | cs.AI | cs.CVPDF

Yaqi Wang, Zhi Li, Chengyu Wu, Jun Liu, Yifan Zhang

TL;DR: MICCAI STS 2024 Challenge旨在通过半监督学习解决牙齿分割任务中标注数据稀缺的问题，展示了半监督学习方法在2D和3D医学图像分割上的显著性能提升。

Details

Motivation: 由于手动标注实例级牙齿分割数据非常耗时，本研究希望通过半监督学习方法解决数据稀缺问题，推动该领域的进展。

Result: 2D OPG的最优方法提升了44个百分点的实例亲和力（IA）得分；3D CBCT的最优方法提升了61个百分点的实例Dice得分。

Insight: 半监督学习在医学图像分割任务中显示出巨大潜力，尤其在标注数据稀缺的情况下。

Abstract: Orthopantomogram (OPGs) and Cone-Beam Computed Tomography (CBCT) are vital for dentistry, but creating large datasets for automated tooth segmentation is hindered by the labor-intensive process of manual instance-level annotation. This research aimed to benchmark and advance semi-supervised learning (SSL) as a solution for this data scarcity problem. We organized the 2nd Semi-supervised Teeth Segmentation (STS 2024) Challenge at MICCAI 2024. We provided a large-scale dataset comprising over 90,000 2D images and 3D axial slices, which includes 2,380 OPG images and 330 CBCT scans, all featuring detailed instance-level FDI annotations on part of the data. The challenge attracted 114 (OPG) and 106 (CBCT) registered teams. To ensure algorithmic excellence and full transparency, we rigorously evaluated the valid, open-source submissions from the top 10 (OPG) and top 5 (CBCT) teams, respectively. All successful submissions were deep learning-based SSL methods. The winning semi-supervised models demonstrated impressive performance gains over a fully-supervised nnU-Net baseline trained only on the labeled data. For the 2D OPG track, the top method improved the Instance Affinity (IA) score by over 44 percentage points. For the 3D CBCT track, the winning approach boosted the Instance Dice score by 61 percentage points. This challenge confirms the substantial benefit of SSL for complex, instance-level medical image segmentation tasks where labeled data is scarce. The most effective approaches consistently leveraged hybrid semi-supervised frameworks that combined knowledge from foundational models like SAM with multi-stage, coarse-to-fine refinement pipelines. Both the challenge dataset and the participants’ submitted code have been made publicly available on GitHub (https://github.com/ricoleehduu/STS-Challenge-2024), ensuring transparency and reproducibility.

cs.IR [Back]

[210] FIGROTD: A Friendly-to-Handle Dataset for Image Guided Retrieval with Optional Text cs.IR | cs.CVPDF

Hoang-Bao Le, Allie Tran, Binh T. Nguyen, Liting Zhou, Cathal Gurrin

TL;DR: 论文提出了FIGROTD数据集和VaGFeM方法，用于图像引导检索（可含文本），解决了现有数据集过大或方法偏科的问题。VaGFeM通过方差统计选择性地增强特征维度，结合双损失设计，在多个基准上表现优异。

Details

Motivation: 现有的大规模数据集（如MagicLens）计算开销大，而现有模型往往只擅长视觉或组合查询中的一种，限制了图像引导检索的发展。

Result: VaGFeM在9个基准测试中表现优异，如CIRCO上mAP@10为34.8，Sketchy上mAP@200为75.7，超越了更强基线。

Insight: 轻量数据集和方法设计可以高效解决图像引导检索中的计算和性能平衡问题，方差统计是优化特征选择的有效手段。

Abstract: Image-Guided Retrieval with Optional Text (IGROT) unifies visual retrieval (without text) and composed retrieval (with text). Despite its relevance in applications like Google Image and Bing, progress has been limited by the lack of an accessible benchmark and methods that balance performance across subtasks. Large-scale datasets such as MagicLens are comprehensive but computationally prohibitive, while existing models often favor either visual or compositional queries. We introduce FIGROTD, a lightweight yet high-quality IGROT dataset with 16,474 training triplets and 1,262 test triplets across CIR, SBIR, and CSTBIR. To reduce redundancy, we propose the Variance Guided Feature Mask (VaGFeM), which selectively enhances discriminative dimensions based on variance statistics. We further adopt a dual-loss design (InfoNCE + Triplet) to improve compositional reasoning. Trained on FIGROTD, VaGFeM achieves competitive results on nine benchmarks, reaching 34.8 mAP@10 on CIRCO and 75.7 mAP@200 on Sketchy, outperforming stronger baselines despite fewer triplets.

[211] UNION: A Lightweight Target Representation for Efficient Zero-Shot Image-Guided Retrieval with Optional Textual Queries cs.IR | cs.CVPDF

Hoang-Bao Le, Allie Tran, Binh T. Nguyen, Liting Zhou, Cathal Gurrin

TL;DR: UNION提出了一种轻量化的目标表示方法，用于高效的零样本图像引导检索任务，支持可选的文本查询。该方法通过融合图像嵌入和空文本提示，提升了多模态查询的语义对齐能力，且在预训练视觉语言模型中无需架构修改，仅需少量训练数据即可取得优异性能。

Details

Motivation: 图像引导检索（IGROT）任务需要处理带有或不带文本的锚点图像查询，传统方法依赖于固定的目标特征，且需要大量监督数据。UNION旨在设计一种轻量化和通用化的目标表示方法，提升语义对齐能力并减少数据依赖。

Result: 仅使用5,000个训练样本，UNION在CIRCO和Sketchy基准测试中分别取得了mAP@50 38.5和mAP@200 82.7的成绩，超越了多数需要大量监督的基线方法。

Insight: UNION展示了通过轻量化的目标表示设计，可以高效地弥合视觉和语言模态之间的语义鸿沟，尤其在低数据监督场景下表现优异。

Abstract: Image-Guided Retrieval with Optional Text (IGROT) is a general retrieval setting where a query consists of an anchor image, with or without accompanying text, aiming to retrieve semantically relevant target images. This formulation unifies two major tasks: Composed Image Retrieval (CIR) and Sketch-Based Image Retrieval (SBIR). In this work, we address IGROT under low-data supervision by introducing UNION, a lightweight and generalisable target representation that fuses the image embedding with a null-text prompt. Unlike traditional approaches that rely on fixed target features, UNION enhances semantic alignment with multimodal queries while requiring no architectural modifications to pretrained vision-language models. With only 5,000 training samples - from LlavaSCo for CIR and Training-Sketchy for SBIR - our method achieves competitive results across benchmarks, including CIRCO mAP@50 of 38.5 and Sketchy mAP@200 of 82.7, surpassing many heavily supervised baselines. This demonstrates the robustness and efficiency of UNION in bridging vision and language across diverse query types.

cs.CR [Back]

[212] PRISM: Privacy-Aware Routing for Adaptive Cloud-Edge LLM Inference via Semantic Sketch Collaboration cs.CR | cs.CLPDF

Junfei Zhan, Haoxun Shen, Zheng Lin, Tengjiao He

TL;DR: PRISM 是一个隐私感知路由框架，通过在云边协同推理中动态平衡隐私与性能，解决了现有方法在隐私保护和计算资源利用上的不足。

Details

Motivation: LLM在云端部署时存在通信开销和隐私风险，而在边缘设备上运行时又受限于计算和内存资源。现有的云边推理方法未能区分输入的敏感性，导致不必要的扰动和性能下降。

Result: PRISM在多种场景下均优于基线方法，能耗和延迟降低40-50%，同时在强隐私约束下保持高质量输出。

Insight: 通过上下文感知的动态路由和分层隐私保护，PRISM展示了在云边协同环境中高效隐私保护的可行性。

Abstract: Large Language Models (LLMs) demonstrate impressive capabilities in natural language understanding and generation, but incur high communication overhead and privacy risks in cloud deployments, while facing compute and memory constraints when confined to edge devices. Cloud-edge inference has emerged as a promising paradigm for improving privacy in LLM services by retaining sensitive computations on local devices. However, existing cloud-edge inference approaches apply uniform privacy protection without considering input sensitivity, resulting in unnecessary perturbation and degraded utility even for non-sensitive tokens. To address this limitation, we propose Privacy-aware Routing for Inference with Semantic Modulation (PRISM), a context-aware framework that dynamically balances privacy and inference quality. PRISM executes in four stages: (1) the edge device profiles entity-level sensitivity; (2) a soft gating module on the edge selects an execution mode - cloud, edge, or collaboration; (3) for collaborative paths, the edge applies adaptive two-layer local differential privacy based on entity risks; and (4) the cloud LLM generates a semantic sketch from the perturbed prompt, which is then refined by the edge-side small language model (SLM) using local context. Our results show that PRISM consistently achieves superior privacy-utility trade-offs across various scenarios, reducing energy consumption and latency to 40-50% of baseline methods such as Uniform and Selective LDP, while maintaining high output quality under strong privacy constraints. These findings are validated through comprehensive evaluations involving realistic prompts, actual energy measurements, and heterogeneous cloud-edge model deployments.

[213] GEO-Detective: Unveiling Location Privacy Risks in Images with LLM Agents cs.CR | cs.AI | cs.CV | cs.LGPDF

Xinyu Zhang, Yixin Wu, Boyang Zhang, Chenhao Lin, Chao Shen

TL;DR: GEO-Detective是一个基于大型视觉语言模型（LVLM）的智能体，通过模仿人类推理和工具使用，提高了图像地理定位的准确性，尤其擅长处理缺乏明显地理特征的图像，同时揭示了相关的隐私风险。

Details

Motivation: 社交媒体图像中的地理信息可能暴露用户隐私，传统的定位方法缺乏泛化能力，而现有的大型视觉语言模型（LVLMs）未针对地理定位任务优化。GEO-Detective旨在探索这一潜力及其隐私风险。

Result: GEO-Detective在国家级任务中比基线LLMs提高了11.1%，细粒度任务中提升了5.2%。使用外部线索后，预测准确率进一步提升，减少了50.6%的“未知”预测率。防御策略分析表明模型具有较强的鲁棒性。

Insight: GEO-Detective展示了LVLMs在地理定位任务中的潜力，但也强调了隐私保护的紧迫性。未来的工作需要更有效的隐私保护措施以应对此类技术的普及。

Abstract: Images shared on social media often expose geographic cues. While early geolocation methods required expert effort and lacked generalization, the rise of Large Vision Language Models (LVLMs) now enables accurate geolocation even for ordinary users. However, existing approaches are not optimized for this task. To explore the full potential and associated privacy risks, we present Geo-Detective, an agent that mimics human reasoning and tool use for image geolocation inference. It follows a procedure with four steps that adaptively selects strategies based on image difficulty and is equipped with specialized tools such as visual reverse search, which emulates how humans gather external geographic clues. Experimental results show that GEO-Detective outperforms baseline large vision language models (LVLMs) overall, particularly on images lacking visible geographic features. In country level geolocation tasks, it achieves an improvement of over 11.1% compared to baseline LLMs, and even at finer grained levels, it still provides around a 5.2% performance gain. Meanwhile, when equipped with external clues, GEO-Detective becomes more likely to produce accurate predictions, reducing the “unknown” prediction rate by more than 50.6%. We further explore multiple defense strategies and find that Geo-Detective exhibits stronger robustness, highlighting the need for more effective privacy safeguards.

cs.AI [Back]

[214] Swarms of Large Language Model Agents for Protein Sequence Design with Experimental Validation cs.AI | cond-mat.mes-hall | cond-mat.soft | cs.CL | cs.LGPDF

Fiona Y. Wang, Di Sheng Lee, David L. Kaplan, Markus J. Buehler

TL;DR: 本文提出了一种基于群体智能的分散式LLM代理框架，用于从头设计蛋白质序列。该方法通过多个并行工作的代理实现高效、目标导向的设计，无需微调或专门训练。

Details

Motivation: 蛋白质设计的传统方法依赖于任务特定的数据或模型重构，限制了灵活性和可扩展性。本文旨在解决这些问题。

Result: 实验验证了该方法在α螺旋和卷曲结构蛋白质上的有效性，展示了其在蛋白质适应性空间的导航能力。

Insight: 该方法不仅适用于蛋白质设计，还为其他生物分子系统和科学发现任务提供了通用解决方案。

Abstract: Designing proteins de novo with tailored structural, physicochemical, and functional properties remains a grand challenge in biotechnology, medicine, and materials science, due to the vastness of sequence space and the complex coupling between sequence, structure, and function. Current state-of-the-art generative methods, such as protein language models (PLMs) and diffusion-based architectures, often require extensive fine-tuning, task-specific data, or model reconfiguration to support objective-directed design, thereby limiting their flexibility and scalability. To overcome these limitations, we present a decentralized, agent-based framework inspired by swarm intelligence for de novo protein design. In this approach, multiple large language model (LLM) agents operate in parallel, each assigned to a specific residue position. These agents iteratively propose context-aware mutations by integrating design objectives, local neighborhood interactions, and memory and feedback from previous iterations. This position-wise, decentralized coordination enables emergent design of diverse, well-defined sequences without reliance on motif scaffolds or multiple sequence alignments, validated with experiments on proteins with alpha helix and coil structures. Through analyses of residue conservation, structure-based metrics, and sequence convergence and embeddings, we demonstrate that the framework exhibits emergent behaviors and effective navigation of the protein fitness landscape. Our method achieves efficient, objective-directed designs within a few GPU-hours and operates entirely without fine-tuning or specialized training, offering a generalizable and adaptable solution for protein design. Beyond proteins, the approach lays the groundwork for collective LLM-driven design across biomolecular systems and other scientific discovery tasks.

[215] DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning cs.AI | cs.CLPDF

Zhihong Shao, Yuxiang Luo, Chengda Lu, Z. Z. Ren, Jiewen Hu

TL;DR: DeepSeekMath-V2提出了一种自我验证的数学推理方法，通过训练验证器和生成器相互提升，解决了传统方法中推理过程可能不严谨的问题，并在多个数学竞赛中取得了优异成绩。

Details

Motivation: 传统的大语言模型在数学推理中依赖最终答案的正确性作为奖励信号，但正确结果可能掩盖错误的推理过程，尤其是在需要严格推导的定理证明中。因此，需要一种自我验证机制来提升数学推理的严谨性和准确性。

Result: DeepSeekMath-V2在IMO 2025、CMO 2024和Putnam 2024等竞赛中表现出色，取得了接近满分的成绩（如Putnam 2024的118/120）。

Insight: 自我验证机制是提升数学推理模型的关键，尤其是在缺乏已知答案的开放问题中。动态调整验证资源可以有效解决生成器和验证器之间的能力差距问题。

Abstract: Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn’t address a key issue: correct answers don’t guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.

[216] ORION: Teaching Language Models to Reason Efficiently in the Language of Thought cs.AI | cs.CL | cs.LGPDF

Kumar Tanmay, Kriti Aggarwal, Paul Pu Liang, Subhabrata Mukherjee

TL;DR: 论文提出了ORION框架，通过模拟人类思维的符号化、结构化语言（Mentalese），训练模型进行高效推理。方法结合了SHORTER LENGTH PREFERENCE OPTIMIZATION（SLPO）强化学习技术，显著减少了推理步骤和计算成本，同时保持高准确性。

Details

Motivation: 当前大型推理模型（LRMs）虽然在数学、代码生成等任务上表现优异，但其冗长的推理过程导致高延迟和冗余。受人类思维的符号化语言（Mentalese）启发，论文试图提升模型的推理效率和成本效益。

Result: ORION模型在多个基准测试中减少了4-16倍的Tokens数量，降低了5倍推理延迟和7-9倍训练成本，同时保持了90-98%的准确性，并在某些任务上超越了ChatGPT-4o和Claude。

Insight: 符号化和结构化推理语言能显著提升模型的效率和实时性，同时保持准确性。这一方法为实现类人认知效率提供了新思路。

Abstract: Large Reasoning Models (LRMs) achieve strong performance in mathematics, code generation, and task planning, but their reliance on long chains of verbose “thinking” tokens leads to high latency, redundancy, and incoherent reasoning paths. Inspired by the Language of Thought Hypothesis, which posits that human reasoning operates over a symbolic, compositional mental language called Mentalese, we introduce a framework that trains models to reason in a similarly compact style. Mentalese encodes abstract reasoning as ultra-compressed, structured tokens, enabling models to solve complex problems with far fewer steps. To improve both efficiency and accuracy, we propose SHORTER LENGTH PREFERENCE OPTIMIZATION (SLPO), a reinforcement learning method that rewards concise solutions that stay correct, while still allowing longer reasoning when needed. Applied to Mentalese-aligned models, SLPO yields significantly higher compression rates by enabling concise reasoning that preserves the benefits of detailed thinking without the computational overhead. Across benchmarks including AIME 2024 and 2025, MinervaMath, OlympiadBench, Math500, and AMC, our ORION models produce reasoning traces with 4-16x fewer tokens, achieve up to 5x lower inference latency, and reduce training costs by 7-9x relative to the DeepSeek R1 Distilled model, while maintaining 90-98% of its accuracy. ORION also surpasses Claude and ChatGPT-4o by up to 5% in accuracy while maintaining 2x compression. These results show that Mentalese-style compressed reasoning offers a step toward human-like cognitive efficiency, enabling real-time, cost-effective reasoning without sacrificing accuracy.

[217] Evaluating Strategies for Synthesizing Clinical Notes for Medical Multimodal AI cs.AI | cs.CVPDF

Niccolo Marini, Zhaohui Liang, Sivaramakrishnan Rajaraman, Zhiyun Xue, Sameer Antani

TL;DR: 该论文研究了如何通过设计提示和包含医学元数据来生成合成的临床文本注释，并评估其对多模态架构在分类和跨模态检索任务中的性能提升效果。

Details

Motivation: 由于缺乏大规模异构生物医学多模态数据，限制了医学AI应用中稳健模型的发展。特别是在皮肤病学领域，数据集通常仅包含图像和少量描述病情的元数据，限制了多模态数据的潜力。

Result: 实验结果表明，合成临床文本注释不仅提升了分类性能（尤其在领域偏移情况下），还解锁了跨模态检索的能力。

Insight: 合成数据可以通过适当的策略在多模态医学AI中发挥重要作用，弥补真实数据的不足，同时避免LLMs在医学领域中可能存在的幻觉风险。

Abstract: Multimodal (MM) learning is emerging as a promising paradigm in biomedical artificial intelligence (AI) applications, integrating complementary modality, which highlight different aspects of patient health. The scarcity of large heterogeneous biomedical MM data has restrained the development of robust models for medical AI applications. In the dermatology domain, for instance, skin lesion datasets typically include only images linked to minimal metadata describing the condition, thereby limiting the benefits of MM data integration for reliable and generalizable predictions. Recent advances in Large Language Models (LLMs) enable the synthesis of textual description of image findings, potentially allowing the combination of image and text representations. However, LLMs are not specifically trained for use in the medical domain, and their naive inclusion has raised concerns about the risk of hallucinations in clinically relevant contexts. This work investigates strategies for generating synthetic textual clinical notes, in terms of prompt design and medical metadata inclusion, and evaluates their impact on MM architectures toward enhancing performance in classification and cross-modal retrieval tasks. Experiments across several heterogeneous dermatology datasets demonstrate that synthetic clinical notes not only enhance classification performance, particularly under domain shift, but also unlock cross-modal retrieval capabilities, a downstream task that is not explicitly optimized during training.

[218] Geometrically-Constrained Agent for Spatial Reasoning cs.AI | cs.CVPDF

Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He

TL;DR: 论文提出了Geometrically-Constrained Agent (GCA)，一种无需训练的代理范式，解决了视觉语言模型(VLMs)在空间推理中语义与几何之间的鸿沟问题。

Details

Motivation: VLMs在定性语义推理上表现出色，但在高保真几何空间中存在语义与几何的不对齐问题。现有方法未能有效解决这一问题，存在”oracle paradox”和几何计划缺陷。

Result: 实验表明GCA在多个空间推理基准测试中表现最佳，比现有方法提升了约27%。

Insight: 通过形式化约束解耦任务角色，可以有效解决语义与几何之间的不匹配问题，提升空间推理的鲁棒性和可验证性。

Abstract: Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,’’ learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM’s planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM’s role into two stages. First, acting as a semantic analyst, the VLM translates the user’s ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained reasoning strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for spatial reasoning. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%. Please see our homepage at https://gca-spatial-reasoning.github.io.

cs.RO [Back]

[219] Mechanistic Finetuning of Vision-Language-Action Models via Few-Shot Demonstrations cs.RO | cs.CL | cs.CVPDF

Chancharik Mitra, Yusen Luo, Raj Saravanan, Dantong Niu, Anirudh Pai

TL;DR: 本文提出了一种基于机制解释性的微调方法Robotic Steering，通过少样本演示选择性地微调任务特定的注意力头，以适应机器人任务中的物理、视觉和语言需求。

Details

Motivation: 现有的视觉-语言-动作（VLA）模型微调方法缺乏任务特异性，忽略了不同任务的物理、视觉和语言特征，导致适应性不足。本文受神经科学中的功能特异性启发，提出了一种更高效的微调方法。

Result: 实验表明，Robotic Steering在任务多样性下的表现优于LoRA，同时提升了鲁棒性、降低了计算成本，并增强了模型的可解释性。

Insight: 任务特异性的微调策略可以提高模型的适应性，同时减少不必要的计算开销，为机器人领域的VLA模型优化提供了新思路。

Abstract: Vision-Language Action (VLAs) models promise to extend the remarkable success of vision-language models (VLMs) to robotics. Yet, unlike VLMs in the vision-language domain, VLAs for robotics require finetuning to contend with varying physical factors like robot embodiment, environment characteristics, and spatial relationships of each task. Existing fine-tuning methods lack specificity, adapting the same set of parameters regardless of a task’s visual, linguistic, and physical characteristics. Inspired by functional specificity in neuroscience, we hypothesize that it is more effective to finetune sparse model representations specific to a given task. In this work, we introduce Robotic Steering, a finetuning approach grounded in mechanistic interpretability that leverages few-shot demonstrations to identify and selectively finetune task-specific attention heads aligned with the physical, visual, and linguistic requirements of robotic tasks. Through comprehensive on-robot evaluations with a Franka Emika robot arm, we demonstrate that Robotic Steering outperforms LoRA while achieving superior robustness under task variation, reduced computational cost, and enhanced interpretability for adapting VLAs to diverse robotic tasks.

[220] $\mathcal{E}_0$: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion cs.RO | cs.AI | cs.CV | cs.LGPDF

Zhihao Zhan, Jiaying Zhou, Likui Zhang, Qinhan Lv, Hao Liu

TL;DR: 本文提出了E0，一种基于离散扩散的框架，通过迭代去噪量化动作标记生成动作，提升了视觉-语言-动作模型的泛化能力和精细控制能力，并在多个任务中实现了最先进性能。

Details

Motivation: 现有的视觉-语言-动作模型在泛化性和动作精细控制方面表现不足，特别是在不同任务、场景和相机视角下。E0旨在通过离散扩散框架解决这些问题。

Result: 在LIBERO、VLABench和ManiSkill等多个环境中，E0平均优于基线10.7%，实现了最先进性能。真实世界实验也证明了其精确、鲁棒和可迁移的操控能力。

Insight: 离散扩散更适合建模真实世界的量化控制信号，并能更好地与预训练模型的语义结构对齐，从而提升泛化能力和精细控制能力。

Abstract: Vision-Language-Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. Yet existing VLA models still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We introduce E0, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. Compared with continuous diffusion policies, E0 offers two key advantages: (1) discrete action tokens align naturally with the symbolic structure of pretrained VLM/VLA backbones, enabling stronger semantic conditioning; and 2. discrete diffusion matches the true quantized nature of real-world robot control-whose hardware constraints (e.g., encoder resolution, control frequency, actuation latency) inherently discretize continuous signals-and therefore benefits from a Bayes-optimal denoiser that models the correct discrete action distribution, leading to stronger generalization. Compared with discrete autoregressive and mask-based discrete diffusion models, E0 supports a significantly larger and finer-grained action vocabulary and avoids the distributional mismatch introduced by masking-based corruptions-yielding more accurate fine-grained action control. We further introduce a spherical viewpoint perturbation augmentation method to improve robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, and ManiSkill show that E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average. Real-world evaluation on a Franka arm confirms that E0 delivers precise, robust, and transferable manipulation, establishing discrete diffusion as a promising direction for generalizable VLA policy learning.

[221] RealD$^2$iff: Bridging Real-World Gap in Robot Manipulation via Depth Diffusion cs.RO | cs.CVPDF

Xiujian Liang, Jiacheng Liu, Mingyang Sun, Qichen He, Cewu Lu

TL;DR: RealD$^2$iff提出了一种基于扩散模型的干净到噪声范例，通过学习合成噪声深度以弥合视觉仿真与现实的差距，无需真实传感器数据。

Details

Motivation: 机器人操作中的视觉仿真与真实世界深度观测存在噪声差异，限制了性能。

Result: RealD$^2$iff能生成真实世界深度数据，并实现零样本迁移，显著提升机器人操作性能。

Insight: 通过逆向噪声建模，扩散模型可有效弥合仿真与真实世界的差距。

Abstract: Robot manipulation in the real world is fundamentally constrained by the visual sim2real gap, where depth observations collected in simulation fail to reflect the complex noise patterns inherent to real sensors. In this work, inspired by the denoising capability of diffusion models, we invert the conventional perspective and propose a clean-to-noisy paradigm that learns to synthesize noisy depth, thereby bridging the visual sim2real gap through purely simulation-driven robotic learning. Building on this idea, we introduce RealD$^2$iff, a hierarchical coarse-to-fine diffusion framework that decomposes depth noise into global structural distortions and fine-grained local perturbations. To enable progressive learning of these components, we further develop two complementary strategies: Frequency-Guided Supervision (FGS) for global structure modeling and Discrepancy-Guided Optimization (DGO) for localized refinement. To integrate RealD$^2$iff seamlessly into imitation learning, we construct a pipeline that spans six stages. We provide comprehensive empirical and experimental validation demonstrating the effectiveness of this paradigm. RealD$^2$iff enables two key applications: (1) generating real-world-like depth to construct clean-noisy paired datasets without manual sensor data collection. (2) Achieving zero-shot sim2real robot manipulation, substantially improving real-world performance without additional fine-tuning.

[222] Distracted Robot: How Visual Clutter Undermine Robotic Manipulation cs.RO | cs.AI | cs.CVPDF

Amir Rasouli, Montgomery Alban, Sajjad Pakdamansavoji, Zhiyuan Li, Zhanguang Zhang

TL;DR: 论文提出了一个基于心理物理学视角的评估协议，用于研究机器人在杂乱场景中的操控性能，强调了视觉杂乱对性能的显著负面影响，并分析了不同策略的独特脆弱性。

Details

Motivation: 现有研究缺乏统一的视觉杂乱评估方法，无法全面反映环境和干扰物的影响。作者希望通过心理物理学视角的系统评估，揭示杂乱对机器人操控策略的具体影响。

Result: 杂乱使策略性能下降高达34%；不同VLA模型对杂乱的反应不同；杂乱度量是性能下降的有效指标；微调数据不能完全解决杂乱的影响。

Insight: 视觉杂乱对机器人操控的影响不容忽视，现有策略需改进以适应复杂环境；统一的杂乱度量有助于未来研究的标准化评估。

Abstract: In this work, we propose an evaluation protocol for examining the performance of robotic manipulation policies in cluttered scenes. Contrary to prior works, we approach evaluation from a psychophysical perspective, therefore we use a unified measure of clutter that accounts for environmental factors as well as the distractors quantity, characteristics, and arrangement. Using this measure, we systematically construct evaluation scenarios in both hyper-realistic simulation and real-world and conduct extensive experimentation on manipulation policies, in particular vision-language-action (VLA) models. Our experiments highlight the significant impact of scene clutter, lowering the performance of the policies, by as much as 34% and show that despite achieving similar average performance across the tasks, different VLA policies have unique vulnerabilities and a relatively low agreement on success scenarios. We further show that our clutter measure is an effective indicator of performance degradation and analyze the impact of distractors in terms of their quantity and occluding influence. At the end, we show that finetuning on enhanced data, although effective, does not equally remedy all negative impacts of clutter on performance.

[223] MARVO: Marine-Adaptive Radiance-aware Visual Odometry cs.RO | cs.CVPDF

Sacchin Sundar, Atman Kikani, Aaliya Alam, Sumukh Shrote, A. Nayeemulla Khan

TL;DR: MARVO是一个专为水下环境设计的视觉里程计框架，结合了物理感知建模、可微分匹配和强化学习优化，以提高在浑浊水域中的定位精度。

Details

Motivation: 水下视觉定位因波长依赖的衰减、低纹理和非高斯传感器噪声而具有挑战性，需要一种能够适应水下特殊条件的鲁棒方法。

Result: MARVO在水下环境中实现了较高的定位精度，能够克服传统方法在浑浊水域中的局限性。

Insight: 结合物理建模与学习方法的融合策略，可以显著提升水下视觉里程计的鲁棒性和精度。

Abstract: Underwater visual localization remains challenging due to wavelength-dependent attenuation, poor texture, and non-Gaussian sensor noise. We introduce MARVO, a physics-aware, learning-integrated odometry framework that fuses underwater image formation modeling, differentiable matching, and reinforcement-learning optimization. At the front-end, we extend transformer-based feature matcher with a Physics Aware Radiance Adapter that compensates for color channel attenuation and contrast loss, yielding geometrically consistent feature correspondences under turbidity. These semi dense matches are combined with inertial and pressure measurements inside a factor-graph backend, where we formulate a keyframe-based visual-inertial-barometric estimator using GTSAM library. Each keyframe introduces (i) Pre-integrated IMU motion factors, (ii) MARVO-derived visual pose factors, and (iii) barometric depth priors, giving a full-state MAP estimate in real time. Lastly, we introduce a Reinforcement-Learningbased Pose-Graph Optimizer that refines global trajectories beyond local minima of classical least-squares solvers by learning optimal retraction actions on SE(2).

[224] Obstruction reasoning for robotic grasping cs.RO | cs.AI | cs.CVPDF

Runyu Jiao, Matteo Bortolon, Francesco Giuliari, Alice Fasoli, Sergio Povoli

TL;DR: 该论文提出了UNOGrasp，一种基于学习的视觉语言模型，用于机器人抓取中的遮挡推理，并结合监督和强化学习微调。通过构建大规模数据集UNOBench，实验表明其显著提升了遮挡推理和抓取成功率。

Details

Motivation: 当前视觉语言模型在遮挡推理和可达性规划方面表现有限，为了改善机器人在复杂环境中的抓取能力，作者提出了一种新的方法。

Result: UNOGrasp在合成和真实环境中显著提升了遮挡推理和抓取成功率，优于通用和专有方法。

Insight: 遮挡感知的视觉推理和大规模数据集是提升机器人复杂任务表现的关键。

Abstract: Successful robotic grasping in cluttered environments not only requires a model to visually ground a target object but also to reason about obstructions that must be cleared beforehand. While current vision-language embodied reasoning models show emergent spatial understanding, they remain limited in terms of obstruction reasoning and accessibility planning. To bridge this gap, we present UNOGrasp, a learning-based vision-language model capable of performing visually-grounded obstruction reasoning to infer the sequence of actions needed to unobstruct the path and grasp the target object. We devise a novel multi-step reasoning process based on obstruction paths originated by the target object. We anchor each reasoning step with obstruction-aware visual cues to incentivize reasoning capability. UNOGrasp combines supervised and reinforcement finetuning through verifiable reasoning rewards. Moreover, we construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2, with over 100k obstruction paths annotated by humans with obstruction ratios, contact points, and natural-language instructions. Extensive experiments and real-robot evaluations show that UNOGrasp significantly improves obstruction reasoning and grasp success across both synthetic and real-world environments, outperforming generalist and proprietary alternatives. Project website: https://tev-fbk.github.io/UnoGrasp/.

cs.CY [Back]

[225] Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs cs.CY | cs.AI | cs.CL | cs.CRPDF

Andrew Maranhão Ventura D’addario

TL;DR: 论文提出了Medical Malice数据集，包含214,219个对抗性提示，用于提升医疗领域大型语言模型（LLMs）的安全性，特别是针对巴西统一医疗系统（SUS）的伦理和监管复杂性。数据集不仅包含违规行为，还提供了背后的推理，帮助模型学习伦理边界而非简单记忆固定拒绝模式。

Details

Motivation: 当前LLMs在医疗领域的对齐技术依赖于通用定义的有害行为，无法捕捉情境依赖性违规（如行政欺诈和临床歧视）。本文旨在解决这一缺陷，通过构建针对具体医疗环境的对抗性数据集，提升模型的情境感知安全性。

Result: 构建了一个大规模、高保真的对抗性数据集，覆盖多种医疗违规行为，并支持模型的情境感知学习。

Insight: 1. 医疗领域的AI安全需要情境依赖的伦理规则，而非通用定义。2. 数据集的“漏洞签名”有助于平衡恶意行为者和开发者之间的信息不对称。3. 情境感知安全是医疗AI成功落地的关键。

Abstract: The integration of Large Language Models (LLMs) into healthcare demands a safety paradigm rooted in \textit{primum non nocere}. However, current alignment techniques rely on generic definitions of harm that fail to capture context-dependent violations, such as administrative fraud and clinical discrimination. To address this, we introduce Medical Malice: a dataset of 214,219 adversarial prompts calibrated to the regulatory and ethical complexities of the Brazilian Unified Health System (SUS). Crucially, the dataset includes the reasoning behind each violation, enabling models to internalize ethical boundaries rather than merely memorizing a fixed set of refusals. Using an unaligned agent (Grok-4) within a persona-driven pipeline, we synthesized high-fidelity threats across seven taxonomies, ranging from procurement manipulation and queue-jumping to obstetric violence. We discuss the ethical design of releasing these “vulnerability signatures” to correct the information asymmetry between malicious actors and AI developers. Ultimately, this work advocates for a shift from universal to context-aware safety, providing the necessary resources to immunize healthcare AI against the nuanced, systemic threats inherent to high-stakes medical environments – vulnerabilities that represent the paramount risk to patient safety and the successful integration of AI in healthcare systems.

cs.LG [Back]

[226] SuRe: Surprise-Driven Prioritised Replay for Continual LLM Learning cs.LG | cs.AI | cs.CLPDF

Hugo Hazard, Zafeirios Fountas, Martin A. Benfeghoul, Adnan Oomerjee, Jun Wang

TL;DR: 论文提出了一种名为SuRe的优先级回放方法，结合双学习者设计，用于持续学习的大语言模型（LLM）。该方法解决了选择和集成问题，显著提升了模型在大量任务场景下的性能。

Details

Motivation: 持续学习的核心挑战是避免灾难性遗忘，尤其是在大语言模型中，传统的回放方法在多任务场景下表现不佳。本文旨在通过改进选择和集成方法，提升持续学习的效果。

Result: SuRe在Large Number of Tasks（LNT）场景下达到SOTA性能，准确率提升高达5个百分点，且在小缓冲区规模和低回放频率下仍表现鲁棒。

Insight: 基于惊讶度的选择和慢权重集成是缓解灾难性遗忘的互补策略，组合使用时效果最佳。

Abstract: Continual learning, one’s ability to adapt to a sequence of tasks without forgetting previously acquired knowledge, remains a major challenge in machine learning and a key gap between artificial and human intelligence. While regularisation and replay perform well in vision, they lag behind multi-task learning for large language models (LLMs), especially at scale with many tasks. We revisit replay and argue that two failure modes drive this gap: selection (what to rehearse) and integration (how to consolidate new knowledge). To address selection, we propose Surprise-prioritised Replay (SuRe), a simple, architecture-agnostic rule that ranks and stores the most surprising (high Negative Log-Likelihood) sequences. SuRe achieves state-of-the-art performance in the Large Number of Tasks (LNT) setting and delivers the best overall average across both Standard CL and LNT benchmarks. To address integration, we add a dual-learner design with fast and slow LoRA adapters merged via an exponential moving average (EMA), enabling rapid adaptation while stabilising long-term knowledge. Combining SuRe with the dual learner yields further gains, including improvements of up to +5 accuracy points on LNT over prior SOTA. Ablation studies confirm that our proposed method remains robust under reduced replay frequency and small buffer size, demonstrating both effectiveness and sample efficiency. Taken together, our results establish replay as a strong baseline for continual LLM fine-tuning and demonstrate that surprise-based selection and slow-weight consolidation are complementary components for mitigating catastrophic forgetting.

[227] Intelligent Neural Networks: From Layered Architectures to Graph-Organized Intelligence cs.LG | cs.CL | cs.NEPDF

Antoine Salomon

TL;DR: 本文提出了智能神经网络（INN），将神经元设计为具有内部记忆和学习通信模式的一级实体，而非传统的分层结构。通过选择性状态空间动态和基于注意力的路由，INN在图结构中实现了高效计算，并在Text8基准上显著优于Transformer和LSTM。

Details

Motivation: 受生物神经元的智能行为（如内部状态维护和选择性通信）启发，作者希望探索一种非分层、图结构组织的神经网络，以提高计算效率和训练稳定性。

Result: INN在Text8字符建模任务上达到1.705 BPC，优于Transformer（2.055 BPC）和LSTM，且参数匹配的Mamba块无法收敛。

Insight: 图结构为神经网络提供了训练稳定性和高效计算能力，神经元为中心的设计可能成为模块化、可解释和可扩展架构的新方向。

Abstract: Biological neurons exhibit remarkable intelligence: they maintain internal states, communicate selectively with other neurons, and self-organize into complex graphs rather than rigid hierarchical layers. What if artificial intelligence could emerge from similarly intelligent computational units? We introduce Intelligent Neural Networks (INN), a paradigm shift where neurons are first-class entities with internal memory and learned communication patterns, organized in complete graphs rather than sequential layers. Each Intelligent Neuron combines selective state-space dynamics (knowing when to activate) with attention-based routing (knowing to whom to send signals), enabling emergent computation through graph-structured interactions. On the standard Text8 character modeling benchmark, INN achieves 1.705 Bit-Per-Character (BPC), significantly outperforming a comparable Transformer (2.055 BPC) and matching a highly optimized LSTM baseline. Crucially, a parameter-matched baseline of stacked Mamba blocks fails to converge (>3.4 BPC) under the same training protocol, demonstrating that INN’s graph topology provides essential training stability. Ablation studies confirm this: removing inter-neuron communication degrades performance or leads to instability, proving the value of learned neural routing. This work demonstrates that neuron-centric design with graph organization is not merely bio-inspired – it is computationally effective, opening new directions for modular, interpretable, and scalable neural architectures.

[228] Transformer-Driven Triple Fusion Framework for Enhanced Multimodal Author Intent Classification in Low-Resource Bangla cs.LG | cs.CLPDF

Ariful Islam, Tanvir Mahmud, Md Rifat Hossen

TL;DR: 论文提出了一种基于Transformer的三重融合框架（BangACMM），用于低资源孟加拉语的多模态作者意图分类，通过中间融合策略显著提升了性能。

Details

Motivation: 互联网和社交网络的扩展导致用户生成内容的爆炸式增长，理解作者意图对社交媒体内容解析至关重要。现有单模态方法存在局限性，需结合多模态数据提升分类效果。

Result: 中间融合（特别是mBERT和Swin Transformer）表现最佳，宏F1分数达84.11%，显著优于现有方法。

Insight: 视觉上下文显著提升意图分类效果；中间融合策略在多模态任务中提供了最佳平衡。

Abstract: The expansion of the Internet and social networks has led to an explosion of user-generated content. Author intent understanding plays a crucial role in interpreting social media content. This paper addresses author intent classification in Bangla social media posts by leveraging both textual and visual data. Recognizing limitations in previous unimodal approaches, we systematically benchmark transformer-based language models (mBERT, DistilBERT, XLM-RoBERTa) and vision architectures (ViT, Swin, SwiftFormer, ResNet, DenseNet, MobileNet), utilizing the Uddessho dataset of 3,048 posts spanning six practical intent categories. We introduce a novel intermediate fusion strategy that significantly outperforms early and late fusion on this task. Experimental results show that intermediate fusion, particularly with mBERT and Swin Transformer, achieves 84.11% macro-F1 score, establishing a new state-of-the-art with an 8.4 percentage-point improvement over prior Bangla multimodal approaches. Our analysis demonstrates that integrating visual context substantially enhances intent classification. Cross-modal feature integration at intermediate levels provides optimal balance between modality-specific representation and cross-modal learning. This research establishes new benchmarks and methodological standards for Bangla and other low-resource languages. We call our proposed framework BangACMM (Bangla Author Content MultiModal).

[229] ThetaEvolve: Test-time Learning on Open Problems cs.LG | cs.CLPDF

Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren

TL;DR: ThetaEvolve 是一个开源的框架，通过结合上下文学习和强化学习（RL），在测试时持续学习，以改进开放优化问题。它采用单一大型语言模型（LLM）和大型程序数据库，首次实现了小规模开源模型在开放问题上取得最优解。

Details

Motivation: 现有系统如AlphaEvolve依赖封闭式的大型模型集成，且无法内化演化策略。ThetaEvolve旨在简化并扩展这一系统，通过在测试时引入持续学习能力，解决这些问题。

Result: ThetaEvolve在开放问题上（如圆包装和自相关不等式）实现最优解，且RL训练的模型在目标和其他任务上表现更优。

Insight: 测试时学习（test-time learning）结合RL是提升模型适应性和性能的有效方法，小规模开源模型也能在复杂任务上取得突破。

Abstract: Recent advances in large language models (LLMs) have enabled breakthroughs in mathematical discovery, exemplified by AlphaEvolve, a closed-source system that evolves programs to improve bounds on open problems. However, it relies on ensembles of frontier LLMs to achieve new bounds and is a pure inference system that models cannot internalize the evolving strategies. We introduce ThetaEvolve, an open-source framework that simplifies and extends AlphaEvolve to efficiently scale both in-context learning and Reinforcement Learning (RL) at test time, allowing models to continually learn from their experiences in improving open optimization problems. ThetaEvolve features a single LLM, a large program database for enhanced exploration, batch sampling for higher throughput, lazy penalties to discourage stagnant outputs, and optional reward shaping for stable training signals, etc. ThetaEvolve is the first evolving framework that enable a small open-source model, like DeepSeek-R1-0528-Qwen3-8B, to achieve new best-known bounds on open problems (circle packing and first auto-correlation inequality) mentioned in AlphaEvolve. Besides, across two models and four open tasks, we find that ThetaEvolve with RL at test-time consistently outperforms inference-only baselines, and the model indeed learns evolving capabilities, as the RL-trained checkpoints demonstrate faster progress and better final performance on both trained target task and other unseen tasks. We release our code publicly: https://github.com/ypwang61/ThetaEvolve

[230] Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium cs.LG | cs.CVPDF

Akbar Anbar Jafari, Gholamreza Anbarjafari

TL;DR: 该论文提出了一种闭环预测原则，通过迭代优化潜在表示来解决传统自回归Transformer的开环瓶颈问题，并引入了基于能量函数的Equilibrium Transformers（EqT），理论上证明了其近似MAP推断能力和收敛性。

Details

Motivation: 传统自回归Transformer的开环设计导致误差在序列中传播而无法修正，限制了长程推理、事实一致性和多步规划的能力。

Result: 在二进制奇偶校验任务中，EqT平均提升了3.28%，在标准Transformer表现接近随机的情况下，改进幅度达到8.07%。

Insight: 闭环均衡机制可能成为解决自回归开环瓶颈的关键，为语言模型的发展提供了新方向。

Abstract: Contemporary autoregressive transformers operate in open loop: each hidden state is computed in a single forward pass and never revised, causing errors to propagate uncorrected through the sequence. We identify this open-loop bottleneck as a fundamental architectural limitation underlying well-documented failures in long-range reasoning, factual consistency, and multi-step planning. To address this limitation, we introduce the closed-loop prediction principle, which requires that models iteratively refine latent representations until reaching a self-consistent equilibrium before committing to each token. We instantiate this principle as Equilibrium Transformers (EqT), which augment standard transformer layers with an Equilibrium Refinement Module that minimizes a learned energy function via gradient descent in latent space. The energy function enforces bidirectional prediction consistency, episodic memory coherence, and output confidence, all computed without external supervision. Theoretically, we prove that EqT performs approximate MAP inference in a latent energy-based model, establish linear convergence guarantees, and show that refinement improves predictions precisely on hard instances where one-shot inference is suboptimal. The framework unifies deep equilibrium models, diffusion language models, and test-time training as special cases. Preliminary experiments on the binary parity task demonstrate +3.28% average improvement on challenging sequences, with gains reaching +8.07% where standard transformers approach random performance, validating that the benefit of deliberation scales with task difficulty. Just as attention mechanisms resolved the sequential bottleneck of recurrent networks, we propose that closed-loop equilibrium may resolve the commitment bottleneck of open-loop autoregression, representing a foundational step toward language models.

[231] Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage cs.LG | cs.CVPDF

Peiyu Yu, Suraj Kothawade, Sirui Xie, Ying Nian Wu, Hongliang Fei

TL;DR: 该论文提出了一种通过学习实例级采样调度（基于提示和噪声条件）来改进文本到图像生成的方法，采用了一种新的奖励基准（James-Stein估计器）来降低梯度估计误差，显著提升了生成质量和文本对齐能力。

Details

Motivation: 传统的后训练方法主要关注模型权重的微调或蒸馏，而该论文另辟蹊径，专注于调整冻结采样器的采样时间表，以提高生成质量。

Result: 实验表明，该方法在文本图像对齐和生成质量上均有显著提升，甚至在5步采样下达到了与蒸馏采样器相当的性能。

Insight: 该研究表明，采样调度是一种未被充分利用的后训练杠杆，能够显著释放预训练采样器的生成潜力。

Abstract: Most post-training methods for text-to-image samplers focus on model weights: either fine-tuning the backbone for alignment or distilling it for few-step efficiency. We take a different route: rescheduling the sampling timeline of a frozen sampler. Instead of a fixed, global schedule, we learn instance-level (prompt- and noise-conditioned) schedules through a single-pass Dirichlet policy. To ensure accurate gradient estimates in high-dimensional policy learning, we introduce a novel reward baseline based on a principled James-Stein estimator; it provably achieves lower estimation errors than commonly used variants and leads to superior performance. Our rescheduled samplers consistently improve text-image alignment including text rendering and compositional control across modern Stable Diffusion and Flux model families. Additionally, a 5-step Flux-Dev sampler with our schedules can attain generation quality comparable to deliberately distilled samplers like Flux-Schnell. We thus position our scheduling framework as an emerging model-agnostic post-training lever that unlocks additional generative potential in pretrained samplers.

[232] Adversarial Flow Models cs.LG | cs.CVPDF

Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, Haoqi Fan

TL;DR: 该论文提出了一种结合对抗模型和流模型的生成模型——对抗流模型，支持单步或多步生成，并通过对抗目标训练。相比传统GAN和一致性方法，该方法更稳定且高效。

Details

Motivation: 传统GAN和流模型各有优缺点，研究者希望通过结合两者的优势，设计一种更高效、更稳定的生成模型。

Result: 在ImageNet-256px上，B/2模型性能接近一致性XL/2模型，XL/2模型创造了新的最佳FID（2.38）。此外，56层和112层模型实现了更低的FID（2.08和1.94）。

Insight: 结合对抗训练和流模型的确定性映射，可以显著提升生成模型的性能和训练效率，同时减少模型容量和误差积累。

Abstract: We present adversarial flow models, a class of generative models that unifies adversarial models and flow models. Our method supports native one-step or multi-step generation and is trained using the adversarial objective. Unlike traditional GANs, where the generator learns an arbitrary transport plan between the noise and the data distributions, our generator learns a deterministic noise-to-data mapping, which is the same optimal transport as in flow-matching models. This significantly stabilizes adversarial training. Also, unlike consistency-based methods, our model directly learns one-step or few-step generation without needing to learn the intermediate timesteps of the probability flow for propagation. This saves model capacity, reduces training iterations, and avoids error accumulation. Under the same 1NFE setting on ImageNet-256px, our B/2 model approaches the performance of consistency-based XL/2 models, while our XL/2 model creates a new best FID of 2.38. We additionally show the possibility of end-to-end training of 56-layer and 112-layer models through depth repetition without any intermediate supervision, and achieve FIDs of 2.08 and 1.94 using a single forward pass, surpassing their 2NFE and 4NFE counterparts.

[233] Bridging Modalities via Progressive Re-alignment for Multimodal Test-Time Adaptation cs.LG | cs.CVPDF

Jiacheng Li, Songhe Feng

TL;DR: 该论文提出了一种名为BriMPR的多模态测试时间适应（MMTTA）框架，通过渐进式重新对齐来解决多模态场景中的分布偏移和语义对齐问题。

Details

Motivation: 在多模态场景中，不同模态的分布偏移程度不同，导致单模态浅层特征偏移和跨模态高层语义不对齐的耦合效应，限制了现有TTA方法的扩展。

Result: 在多种MMTTA任务（包括基于损坏和真实世界域偏移的基准测试）上表现出优越性能。

Insight: 通过渐进式策略分阶段解决复杂耦合效应是实现多模态分布对齐的有效方法。

Abstract: Test-time adaptation (TTA) enables online model adaptation using only unlabeled test data, aiming to bridge the gap between source and target distributions. However, in multimodal scenarios, varying degrees of distribution shift across different modalities give rise to a complex coupling effect of unimodal shallow feature shift and cross-modal high-level semantic misalignment, posing a major obstacle to extending existing TTA methods to the multimodal field. To address this challenge, we propose a novel multimodal test-time adaptation (MMTTA) framework, termed as Bridging Modalities via Progressive Re-alignment (BriMPR). BriMPR, consisting of two progressively enhanced modules, tackles the coupling effect with a divide-and-conquer strategy. Specifically, we first decompose MMTTA into multiple unimodal feature alignment sub-problems. By leveraging the strong function approximation ability of prompt tuning, we calibrate the unimodal global feature distributions to their respective source distributions, so as to achieve the initial semantic re-alignment across modalities. Subsequently, we assign the credible pseudo-labels to combinations of masked and complete modalities, and introduce inter-modal instance-wise contrastive learning to further enhance the information interaction among modalities and refine the alignment. Extensive experiments on MMTTA tasks, including both corruption-based and real-world domain shift benchmarks, demonstrate the superiority of our method. Our source code is available at this URL.

[234] Physics-Informed Neural Networks for Thermophysical Property Retrieval cs.LG | cs.AI | cs.CE | cs.CVPDF

Ali Waseem, Malcolm Mielle

TL;DR: 这篇论文提出了一种基于物理信息神经网络（PINN）的迭代框架，用于从热成像数据中估计墙体的导热系数k。该方法通过交替求解正向热问题和优化k，展示了在实际环境下进行非侵入式热物理属性估计的潜力。

Details

Motivation: 当前的非侵入式测量方法在环境多变或理论假设条件不符时容易出错，而传统的测量方法要么侵入性强，要么需要长时间观测。这篇论文旨在利用PINN解决这一问题，实现高效、准确的热物理属性估计。

Result: 实验表明，该方法在不同环境条件下能够准确预测k，即使在非稳态条件下，最大平均绝对误差（MAE）仅为4.0851。

Insight: 论文展示了PINN在非侵入式、实际环境下的热物理属性估计中的潜力，为后续利用机器学习解决现场逆问题研究提供了起点。

Abstract: Inverse heat problems refer to the estimation of material thermophysical properties given observed or known heat diffusion behaviour. Inverse heat problems have wide-ranging uses, but a critical application lies in quantifying how building facade renovation reduces thermal transmittance, a key determinant of building energy efficiency. However, solving inverse heat problems with non-invasive data collected in situ is error-prone due to environmental variability or deviations from theoretically assumed conditions. Hence, current methods for measuring thermal conductivity are either invasive, require lengthy observation periods, or are sensitive to environmental and experimental conditions. Here, we present a PINN-based iterative framework to estimate the thermal conductivity k of a wall from a set of thermographs; our framework alternates between estimating the forward heat problem with a PINN for a fixed k, and optimizing k by comparing the thermographs and surface temperatures predicted by the PINN, repeating until the estimated k’s convergence. Using both environmental data captured by a weather station and data generated from Finite-Volume-Method software simulations, we accurately predict k across different environmental conditions and data collection sampling times, given the temperature profile of the wall at dawn is close to steady state. Although violating the steady-state assumption impacts the accuracy of k’s estimation, we show that our proposed framework still only exhibits a maximum MAE of 4.0851. Our work demonstrates the potential of PINN-based methods for reliable estimation of material properties in situ and under realistic conditions, without lengthy measurement campaigns. Given the lack of research on using machine learning, and more specifically on PINNs, for solving in-situ inverse problems, we expect our work to be a starting point for more research on the topic.

Table of Contents

cs.CV [Back]

[1] SO-Bench: A Structural Output Evaluation of Multimodal LLMs cs.CV | cs.AI | cs.CL | cs.ROPDF

[2] Saddle-Free Guidance: Improved On-Manifold Sampling without Labels or Additional Training cs.CV | cs.LG | stat.MLPDF

[3] UniArt: Unified 3D Representation for Generating 3D Articulated Objects with Open-Set Articulation cs.CVPDF

[4] PathReasoning: A multimodal reasoning agent for query-based ROI navigation on whole-slide images cs.CV | cs.AIPDF

[5] Interpretable Multimodal Cancer Prototyping with Whole Slide Images and Incompletely Paired Genomics cs.CVPDF

[6] AmodalGen3D: Generative Amodal 3D Object Reconstruction from Sparse Unposed Views cs.CVPDF

[7] TAPVid-360: Tracking Any Point in 360 from Narrow Field of View Video cs.CVPDF

[8] WalkCLIP: Multimodal Learning for Urban Walkability Prediction cs.CV | cs.AI | cs.LGPDF

[9] PAT3D: Physics-Augmented Text-to-3D Scene Generation cs.CVPDF

[10] DialBench: Towards Accurate Reading Recognition of Pointer Meter using Large Foundation Models cs.CV | cs.AIPDF

[11] PPBoost: Progressive Prompt Boosting for Text-Driven Medical Image Segmentation cs.CVPDF

[12] Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance? cs.CVPDF

[13] StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation cs.CVPDF

[14] MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis cs.CV | cs.AIPDF

[15] Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models cs.CVPDF

[16] Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation cs.CVPDF

[17] SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model cs.CVPDF

[18] ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resoultion cs.CV | cs.AIPDF

[19] TPCNet: Triple physical constraints for Low-light Image Enhancement cs.CV | physics.opticsPDF

[20] OralGPT-Omni: A Versatile Dental Multimodal Large Language Model cs.CV | cs.MMPDF

[21] WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation cs.CVPDF

[22] MoE3D: Mixture of Experts meets Multi-Modal 3D Understanding cs.CVPDF

[23] PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization cs.CVPDF

[24] GoPrune: Accelerated Structured Pruning with $\ell_{2,p}$-Norm Optimization cs.CVPDF

[25] Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation cs.CVPDF

[26] GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models cs.CVPDF

[27] DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action cs.CV | cs.ROPDF

[28] EASL: Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation cs.CVPDF

[29] IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer cs.CV | cs.AIPDF

[30] Partially Shared Concept Bottleneck Models cs.CVPDF

[31] Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning cs.CVPDF

[32] Enhanced Graph Convolutional Network with Chebyshev Spectral Graph and Graph Attention for Autism Spectrum Disorder Classification cs.CV | cs.AIPDF

[33] MTR-VP: Towards End-to-End Trajectory Planning through Context-Driven Image Encoding and Multiple Trajectory Prediction cs.CV | cs.AI | cs.ROPDF

[34] Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation cs.CVPDF

[35] HybridWorldSim: A Scalable and Controllable High-fidelity Simulator for Autonomous Driving cs.CV | cs.ROPDF

[36] ARPGNet: Appearance- and Relation-aware Parallel Graph Attention Fusion Network for Facial Expression Recognition cs.CV | cs.AIPDF

[37] Controllable 3D Object Generation with Single Image Prompt cs.CVPDF

[38] 3D-Consistent Multi-View Editing by Diffusion Guidance cs.CV | cs.AI | cs.LGPDF

[39] From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation cs.CV | cs.AI | cs.CLPDF

[40] UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation cs.CVPDF

[41] DriveVGGT: Visual Geometry Transformer for Autonomous Driving cs.CVPDF

[42] The Collapse of Patches cs.CVPDF

[43] Match-and-Fuse: Consistent Generation from Unstructured Image Sets cs.CVPDF

[44] Structure is Supervision: Multiview Masked Autoencoders for Radiology cs.CV | cs.LGPDF

[45] Small Object Detection for Birds with Swin Transformer cs.CVPDF

[46] Prompt-based Consistent Video Colorization cs.CV | cs.AIPDF

[47] Unexplored flaws in multiple-choice VQA evaluations cs.CV | cs.LGPDF

[48] Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment cs.CVPDF

[49] INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts cs.CVPDF

[50] AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows cs.CVPDF

[51] Asking like Socrates: Socrates helps VLMs understand remote sensing images cs.CV | cs.AIPDF

[52] UAV-MM3D: A Large-Scale Synthetic Benchmark for 3D Perception of Unmanned Aerial Vehicles with Multi-Modal Data cs.CVPDF

[53] DiffStyle360: Diffusion-Based 360° Head Stylization via Style Fusion Attention cs.CVPDF

[54] Wukong’s 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models cs.CVPDF

[55] Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation cs.CVPDF

[56] SkeletonAgent: An Agentic Interaction Framework for Skeleton-based Action Recognition cs.CVPDF

[57] ABounD: Adversarial Boundary-Driven Few-Shot Learning for Multi-Class Anomaly Detection cs.CVPDF

[58] Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition cs.CVPDF

[59] Benchmarking machine learning models for multi-class state recognition in double duantum dot data cs.CV | cond-mat.mes-hall | cs.LGPDF

[60] Beyond Real versus Fake Towards Intent-Aware Video Analysis cs.CVPDF

[61] ITS3D: Inference-Time Scaling for Text-Guided 3D Diffusion Models cs.CVPDF

[62] Gaussians on Fire: High-Frequency Reconstruction of Flames cs.CVPDF

[63] RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding cs.CVPDF

[64] Hybrid, Unified and Iterative: A Novel Framework for Text-based Person Anomaly Retrieval cs.CVPDF

[65] Rethinking Cross-Generator Image Forgery Detection through DINOv3 cs.CVPDF

[66] AI killed the video star. Audio-driven diffusion model for expressive talking head generation cs.CVPDF

[67] SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts cs.CV | cs.IRPDF

[68] What Shape Is Optimal for Masks in Text Removal? cs.CV | cs.CL | cs.LGPDF

[69] DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA cs.CV | cs.AIPDF

[70] CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving cs.CV | cs.AIPDF

[71] Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration cs.CVPDF

[72] Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior cs.CVPDF

[73] Bringing Your Portrait to 3D Presence cs.CVPDF

[74] Text Condition Embedded Regression Network for Automated Dental Abutment Design cs.CVPDF

[75] Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization cs.CV | cs.AIPDF

[76] HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models cs.CV | cs.AIPDF

[77] AnoRefiner: Anomaly-Aware Group-Wise Refinement for Zero-Shot Industrial Anomaly Detection cs.CVPDF

[78] GazeTrack: High-Precision Eye Tracking Based on Regularization and Spatial Computing cs.CV | cs.AI | cs.HC | cs.LGPDF