cs.CV [Total: 47]
cs.CL [Total: 42]
cs.DC [Total: 1]
cs.IR [Total: 1]
cs.LG [Total: 4]
cs.CR [Total: 1]
cs.CY [Total: 1]
cs.AI [Total: 4]
cs.SE [Total: 1]
eess.IV [Total: 2]
q-bio.QM [Total: 1]
eess.AS [Total: 2]
cs.RO [Total: 3]

cs.CV [Back]

[1] Exploring OCR-augmented Generation for Bilingual VQA cs.CVPDF

JoonHo Lee, Sunho Park

TL;DR: 论文研究了OCR增强生成在双语VQA任务中的应用，提出KLOCR模型并发布KOCRBench数据集，实验表明OCR提取的文本显著提升了模型性能。

Details

Motivation: 探索如何在视觉语言模型（VLMs）中融入OCR能力，以支持多语言VQA任务，尤其是韩语和英语的双语场景。

Result: 实验表明，OCR提取的文本显著提升了双语VQA任务的性能，尤其是在韩语和英语的场景中。

Insight: OCR增强的文本能够有效提升VLMs在多语言VQA任务中的表现，为未来多语言OCR和VQA研究提供了新方向。

Abstract: We investigate OCR-augmented generation with Vision Language Models (VLMs), exploring tasks in Korean and English toward multilingualism. To support research in this domain, we train and release KLOCR, a strong bilingual OCR baseline trained on 100M instances to augment VLMs with OCR ability. To complement existing VQA benchmarks, we curate KOCRBench for Korean VQA, and analyze different prompting methods. Extensive experiments show that OCR-extracted text significantly boosts performance across open source and commercial models. Our work offers new insights into OCR-augmented generation for bilingual VQA. Model, code, and data are available at https://github.com/JHLee0513/KLOCR.

Derek Shi, Ruben Glatt, Christine Klymko, Shubham Mohole, Hongjun Choi

TL;DR: Oracle-RLAIF是一个改进的多模态视频模型微调框架，通过基于排名的强化学习反馈（RLAIF）替代传统的奖励模型，降低了成本并提高了性能。

Details

Motivation: 随着大规模视频语言模型（VLMs）参数的增加，获取人类反馈的成本显著上升。现有的RLAIF框架依赖于训练专用的奖励模型，成本高且限制多。

Result: 在多个视频理解基准测试中，Oracle-RLAIF性能优于现有微调方法。

Insight: 基于排名的反馈可以更灵活且高效地对齐大规模多模态视频模型，减少了对昂贵奖励模型的依赖。

Abstract: Recent advances in large video-language models (VLMs) rely on extensive fine-tuning techniques that strengthen alignment between textual and visual comprehension. Leading pipelines typically pair supervised fine-tuning (SFT) with reinforcement learning from preference data to enhance video comprehension. However, as VLMs scale in parameter size, so does the cost of gathering enough human feedback. To make fine-tuning more cost-effective, recent frameworks explore reinforcement learning with AI feedback (RLAIF), which replace human preference with AI as a judge. Current RLAIF frameworks rely on a specialized reward model trained with video narratives to create calibrated scalar rewards – an expensive and restrictive pipeline. We propose Oracle-RLAIF, a novel framework that replaces the trained reward model with a more general Oracle ranker which acts as a drop-in model ranking candidate model responses rather than scoring them. Alongside Oracle-RLAIF, we introduce $GRPO_{rank}$, a novel rank-based loss function based on Group Relative Policy Optimization (GRPO) that directly optimizes ordinal feedback with rank-aware advantages. Empirically, we demonstrate that Oracle-RLAIF consistently outperforms leading VLMs using existing fine-tuning methods when evaluated across various video comprehension benchmarks. Oracle-RLAIF paves the path to creating flexible and data-efficient frameworks for aligning large multi-modal video models with reinforcement learning from rank rather than score.

[3] PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction cs.CVPDF

Qiao Feng, Yiming Huang, Yufu Wang, Jiatao Gu, Lingjie Liu

TL;DR: PhysHMR提出了一种直接从单目视频中学习基于物理的人体运动重建的统一框架，通过结合视觉特征和物理约束，避免了传统两阶段方法的误差累积问题。

Details

Motivation: 现有方法大多依赖基于运动学的姿态估计和后续物理后处理，导致结果不真实且误差累积。PhysHMR旨在通过直接学习视觉到动作的策略，实现物理合理且对齐输入的运动重建。

Result: PhysHMR在多样场景中生成高保真且物理合理的运动，优于现有方法，尤其在视觉精度和物理真实性方面表现突出。

Insight: 通过直接学习视觉到物理动作的策略，可以避免传统方法的误差累积问题，而软性全局引导和知识蒸馏的结合显著提升了模型的效率和效果。

Abstract: Reconstructing physically plausible human motion from monocular videos remains a challenging problem in computer vision and graphics. Existing methods primarily focus on kinematics-based pose estimation, often leading to unrealistic results due to the lack of physical constraints. To address such artifacts, prior methods have typically relied on physics-based post-processing following the initial kinematics-based motion estimation. However, this two-stage design introduces error accumulation, ultimately limiting the overall reconstruction quality. In this paper, we present PhysHMR, a unified framework that directly learns a visual-to-action policy for humanoid control in a physics-based simulator, enabling motion reconstruction that is both physically grounded and visually aligned with the input video. A key component of our approach is the pixel-as-ray strategy, which lifts 2D keypoints into 3D spatial rays and transforms them into global space. These rays are incorporated as policy inputs, providing robust global pose guidance without depending on noisy 3D root predictions. This soft global grounding, combined with local visual features from a pretrained encoder, allows the policy to reason over both detailed pose and global positioning. To overcome the sample inefficiency of reinforcement learning, we further introduce a distillation scheme that transfers motion knowledge from a mocap-trained expert to the vision-conditioned policy, which is then refined using physically motivated reinforcement learning rewards. Extensive experiments demonstrate that PhysHMR produces high-fidelity, physically plausible motion across diverse scenarios, outperforming prior approaches in both visual accuracy and physical realism.

[4] Unlocking the power of partnership: How humans and machines can work together to improve face recognition cs.CVPDF

P. Jonathon Phillips, Geraldine Jeckeln, Carina A. Hahn, Amy N. Yates, Peter C. Fontana

TL;DR: 这篇论文探讨了人类与机器在面部识别中的协作效果，提出了一种基于‘邻近准确率规则（PAR）’的智能融合方法，证明了在某些情况下人类与机器的结合可以显著提高识别准确性。

Details

Motivation: 人类与机器在面部识别中各有优劣，但如何有效结合两者的优势以提升整体识别准确性尚不清晰。本文旨在通过实证研究，明确人类与机器协作的最佳条件和效果。

Result: 研究发现，智能人机融合能够超越单独的机器性能，同时比无差别结合所有人类与机器判断更准确。纯人类协作的系统性能接近于智能人机协作的平均水平，但后者更能减少低效人类参与者的负面影响。

Insight: 1）人类与机器的协作效果取决于双方的基础准确率差异；2）在‘关键融合区域’内，即使是准确性低于机器的人类也能显著提升系统性能；3）智能筛选人类参与者是实现最优人机协作的关键。

Abstract: Human review of consequential decisions by face recognition algorithms creates a “collaborative” human-machine system. Individual differences between people and machines, however, affect whether collaboration improves or degrades accuracy in any given case. We establish the circumstances under which combining human and machine face identification decisions improves accuracy. Using data from expert and non-expert face identifiers, we examined the benefits of human-human and human-machine collaborations. The benefits of collaboration increased as the difference in baseline accuracy between collaborators decreased-following the Proximal Accuracy Rule (PAR). This rule predicted collaborative (fusion) benefit across a wide range of baseline abilities, from people with no training to those with extensive training. Using the PAR, we established a critical fusion zone, where humans are less accurate than the machine, but fusing the two improves system accuracy. This zone was surprisingly large. We implemented “intelligent human-machine fusion” by selecting people with the potential to increase the accuracy of a high-performing machine. Intelligent fusion was more accurate than the machine operating alone and more accurate than combining all human and machine judgments. The highest system-wide accuracy achievable with human-only partnerships was found by graph theory. This fully human system approximated the average performance achieved by intelligent human-machine collaboration. However, intelligent human-machine collaboration more effectively minimized the impact of low-performing humans on system-wide accuracy. The results demonstrate a meaningful role for both humans and machines in assuring accurate face identification. This study offers an evidence-based road map for the intelligent use of AI in face identification.

[5] How Confident are Video Models? Empowering Video Models to Express their Uncertainty cs.CV | cs.AI | cs.CLPDF

Zhiting Mei, Ola Shorinwa, Anirudha Majumdar

TL;DR: 本文首次提出了针对生成式视频模型的量化不确定性方法，包括一个基于鲁棒秩相关性的校准评估指标、一种黑盒不确定性量化方法（S-QUBED）以及一个用于基准测试的数据集。

Details

Motivation: 生成式视频模型在文本到视频任务中表现出强大能力，但也存在幻觉问题（生成看似合理但事实错误的视频）。目前缺乏针对视频模型的不确定性量化方法，存在安全隐患。

Result: 实验表明，S-QUBED能提供校准的总不确定性估计，且与任务准确度负相关，有效分解了认知性和偶然性不确定性。

Insight: 视频模型的不确定性可通过潜在空间条件化任务进行分解，未来可结合校准技术提升模型安全性。

Abstract: Generative video models demonstrate impressive text-to-video capabilities, spurring widespread adoption in many real-world applications. However, like large language models (LLMs), video generation models tend to hallucinate, producing plausible videos even when they are factually wrong. Although uncertainty quantification (UQ) of LLMs has been extensively studied in prior work, no UQ method for video models exists, raising critical safety concerns. To our knowledge, this paper represents the first work towards quantifying the uncertainty of video models. We present a framework for uncertainty quantification of generative video models, consisting of: (i) a metric for evaluating the calibration of video models based on robust rank correlation estimation with no stringent modeling assumptions; (ii) a black-box UQ method for video models (termed S-QUBED), which leverages latent modeling to rigorously decompose predictive uncertainty into its aleatoric and epistemic components; and (iii) a UQ dataset to facilitate benchmarking calibration in video models. By conditioning the generation task in the latent space, we disentangle uncertainty arising due to vague task specifications from that arising from lack of knowledge. Through extensive experiments on benchmark video datasets, we demonstrate that S-QUBED computes calibrated total uncertainty estimates that are negatively correlated with the task accuracy and effectively computes the aleatoric and epistemic constituents.

[6] Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig cs.CVPDF

Patrick Rim, Kun He, Kevin Harris, Braden Copple, Shangchen Han

TL;DR: 本文提出了一种新型的无标记多摄像头系统，用于在野外环境中精确追踪3D手部姿态，结合了轻量化的背戴式设备和Meta Quest 3头显，提供了高精度的地面真实数据和多样化的环境数据。

Details

Motivation: 现有数据集多在受控实验室环境下采集，限制了环境多样性和模型泛化能力。本文旨在通过开发一种新型系统，解决野外环境中3D手部追踪的挑战。

Result: 实验表明，该系统能够在多样化环境下显著减少环境真实性与3D标注精度之间的权衡。

Insight: 结合外视和内视摄像头的多视角系统可以显著提升野外环境下3D手部追踪的精度和实用性。

Abstract: Accurate 3D tracking of hands and their interactions with the world in unconstrained settings remains a significant challenge for egocentric computer vision. With few exceptions, existing datasets are predominantly captured in controlled lab setups, limiting environmental diversity and model generalization. To address this, we introduce a novel marker-less multi-camera system designed to capture precise 3D hands and objects, which allows for nearly unconstrained mobility in genuinely in-the-wild conditions. We combine a lightweight, back-mounted capture rig with eight exocentric cameras, and a user-worn Meta Quest 3 headset, which contributes two egocentric views. We design an ego-exo tracking pipeline to generate accurate 3D hand pose ground truth from this system, and rigorously evaluate its quality. By collecting an annotated dataset featuring synchronized multi-view images and precise 3D hand poses, we demonstrate the capability of our approach to significantly reduce the trade-off between environmental realism and 3D annotation accuracy.

[7] Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation cs.CVPDF

Beijia Lu, Ziyi Chen, Jing Xiao, Jun-Yan Zhu

TL;DR: 该论文提出了一种基于输入感知的稀疏注意力机制和蒸馏损失的实时共语音视频生成方法，显著提升了生成效率和质量。

Details

Motivation: 现有基于扩散模型的共语音视频生成方法因计算量大而无法实现实时性，直接应用现有蒸馏方法会导致质量下降。

Result: 方法在保持视觉质量的同时，实现了实时性能，优于现有音频驱动和输入驱动的方法。

Insight: 输入感知的注意力机制和损失设计能显著提升生成效率和质量，为实时视频生成提供了新思路。

Abstract: Diffusion models can synthesize realistic co-speech video from audio for various applications, such as video creation and virtual agents. However, existing diffusion-based methods are slow due to numerous denoising steps and costly attention mechanisms, preventing real-time deployment. In this work, we distill a many-step diffusion video model into a few-step student model. Unfortunately, directly applying recent diffusion distillation methods degrades video quality and falls short of real-time performance. To address these issues, our new video distillation method leverages input human pose conditioning for both attention and loss functions. We first propose using accurate correspondence between input human pose keypoints to guide attention to relevant regions, such as the speaker’s face, hands, and upper body. This input-aware sparse attention reduces redundant computations and strengthens temporal correspondences of body parts, improving inference efficiency and motion coherence. To further enhance visual quality, we introduce an input-aware distillation loss that improves lip synchronization and hand motion realism. By integrating our input-aware sparse attention and distillation loss, our method achieves real-time performance with improved visual quality compared to recent audio-driven and input-driven methods. We also conduct extensive experiments showing the effectiveness of our algorithmic design choices.

[8] Deep Generative Continual Learning using Functional LoRA: FunLoRA cs.CVPDF

Victor Enescu, Hichem Sahbi

TL;DR: 论文提出了一种基于低秩适应（LoRA）的新颖条件机制FunLoRA，用于深度生成模型的持续学习，避免了灾难性遗忘问题，并通过动态调节提高了模型性能，同时降低了内存需求和采样时间。

Details

Motivation: 深度生成模型在文本和视觉应用中具有广泛潜力，但增量训练面临灾难性遗忘的挑战，传统方法依赖合成数据且训练时间不可持续。

Result: 实验表明，FunLoRA在基于流匹配的模型中超越了扩散模型的当前最优结果，实现了更高的分类准确率，同时显著降低了内存和计算成本。

Insight: FunLoRA展示了通过动态调节和参数高效微调技术，可以在持续学习中避免灾难性遗忘问题，同时保持高性能和低资源消耗。

Abstract: Continual adaptation of deep generative models holds tremendous potential and critical importance, given their rapid and expanding usage in text and vision based applications. Incremental training, however, remains highly challenging due to catastrophic forgetting phenomenon, which makes it difficult for neural networks to effectively incorporate new knowledge. A common strategy consists in retraining the generative model on its own synthetic data in order to mitigate forgetting. Yet, such an approach faces two major limitations: (i) the continually increasing training time eventually becomes intractable, and (ii) reliance on synthetic data inevitably leads to long-term performance degradation, since synthetic samples lack the richness of real training data. In this paper, we attenuate these issues by designing a novel and more expressive conditioning mechanism for generative models based on low rank adaptation (LoRA), that exclusively employs rank 1 matrices, whose reparametrized matrix rank is functionally increased using carefully selected functions – and dubbed functional LoRA: FunLoRA. Using this dynamic conditioning, the generative model is guaranteed to avoid catastrophic forgetting and needs only to be trained on data from the current task. Extensive experiments using flow-matching based models trained from scratch, showcase that our proposed parameter-efficient fine-tuning (PEFT) method surpasses prior state-of-the-art results based on diffusion models, reaching higher classification accuracy scores, while only requiring a fraction of the memory cost and sampling time.

[9] Sequence-Preserving Dual-FoV Defense for Traffic Sign and Light Recognition in Autonomous Vehicles cs.CVPDF

Abhishek Joshi, Jahnavi Krishna Koda, Abhishek Phadke

TL;DR: 该论文提出了一个双视场（FoV）且保持时序性的防御框架，用于自动驾驶车辆中的交通标志和信号灯识别，通过统一的三层防御堆栈（特征压缩、防御蒸馏和熵基异常检测）提升系统对数字和自然扰动的鲁棒性。

Details

Motivation: 交通标志和信号灯的错误识别会直接影响自动驾驶车辆的安全性和导航性能，当前研究缺乏对时序连续性、多静态视场（FoV）感知以及对数字和自然扰动的鲁棒性的综合考虑。

Result: 统一防御堆栈的mAP达到79.8%，攻击成功率（ASR）降至18.2%，优于YOLOv8、YOLOv9和BEVFormer，同时高风险误分类率降至32%。

Insight: 时序信息在多视场交通标志和信号灯识别中至关重要，统一的防御策略可以有效提升系统对数字和物理扰动的鲁棒性。

Abstract: Traffic light and sign recognition are key for Autonomous Vehicles (AVs) because perception mistakes directly influence navigation and safety. In addition to digital adversarial attacks, models are vulnerable to existing perturbations (glare, rain, dirt, or graffiti), which could lead to dangerous misclassifications. The current work lacks consideration of temporal continuity, multistatic field-of-view (FoV) sensing, and robustness to both digital and natural degradation. This study proposes a dual FoV, sequence-preserving robustness framework for traffic lights and signs in the USA based on a multi-source dataset built on aiMotive, Udacity, Waymo, and self-recorded videos from the region of Texas. Mid and long-term sequences of RGB images are temporally aligned for four operational design domains (ODDs): highway, night, rainy, and urban. Over a series of experiments on a real-life application of anomaly detection, this study outlines a unified three-layer defense stack framework that incorporates feature squeezing, defensive distillation, and entropy-based anomaly detection, as well as sequence-wise temporal voting for further enhancement. The evaluation measures included accuracy, attack success rate (ASR), risk-weighted misclassification severity, and confidence stability. Physical transferability was confirmed using probes for recapture. The results showed that the Unified Defense Stack achieved 79.8mAP and reduced the ASR to 18.2%, which is superior to YOLOv8, YOLOv9, and BEVFormer, while reducing the high-risk misclassification to 32%.

[10] Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models cs.CVPDF

Benjamin Yu, Jackie Liu, Justin Cui

TL;DR: Smart-GRPO提出了一种针对流匹配模型的强化学习方法，通过智能采样噪声优化扰动，显著提升了奖励优化和视觉质量。

Details

Motivation: 流匹配模型的高质量文本生成缺乏适用于强化学习的随机性，传统的随机噪声扰动方法效率低且不稳定，Smart-GRPO旨在解决这一问题。

Result: 实验表明Smart-GRPO在奖励优化和视觉质量上均优于基线方法，证明了该方法在流匹配框架中的实用性。

Insight: Smart-GRPO为流匹配模型与强化学习的结合提供了可行路径，同时解决了训练效率和人类对齐生成之间的矛盾。

Abstract: Recent advancements in flow-matching have enabled high-quality text-to-image generation. However, the deterministic nature of flow-matching models makes them poorly suited for reinforcement learning, a key tool for improving image quality and human alignment. Prior work has introduced stochasticity by perturbing latents with random noise, but such perturbations are inefficient and unstable. We propose Smart-GRPO, the first method to optimize noise perturbations for reinforcement learning in flow-matching models. Smart-GRPO employs an iterative search strategy that decodes candidate perturbations, evaluates them with a reward function, and refines the noise distribution toward higher-reward regions. Experiments demonstrate that Smart-GRPO improves both reward optimization and visual quality compared to baseline methods. Our results suggest a practical path toward reinforcement learning in flow-matching frameworks, bridging the gap between efficient training and human-aligned generation.

[11] FSFSplatter: Build Surface and Novel Views with Sparse-Views within 3min cs.CV | cs.GRPDF

Yibin Zhao, Yihan Pan, Jun Nan, Jianjun Yi

TL;DR: FSFSplatter 是一种基于高斯溅射的快速表面重建方法，能从稀疏的自由视图中高效重建场景，解决了现有方法对密集校准视图的需求和稀疏视图下的表面质量差问题。

Details

Motivation: 传统的高斯溅射方法需要密集校准视图来重建高质量的场景表面和生成新视图，而稀疏视图容易导致重建质量下降和过拟合问题。FSFSplatter 旨在解决这些问题，提供一种快速、高质量的重建方案。

Result: FSFSplatter 在DTU和Replica数据集上优于当前最先进的方法，实现了高质量的表面重建和新视图合成。

Insight: 稀疏视图下的高质量重建可通过密集初始化和几何优化策略实现，同时结合Transformer和多任务监督能有效提升重建效果。

Abstract: Gaussian Splatting has become a leading reconstruction technique, known for its high-quality novel view synthesis and detailed reconstruction. However, most existing methods require dense, calibrated views. Reconstructing from free sparse images often leads to poor surface due to limited overlap and overfitting. We introduce FSFSplatter, a new approach for fast surface reconstruction from free sparse images. Our method integrates end-to-end dense Gaussian initialization, camera parameter estimation, and geometry-enhanced scene optimization. Specifically, FSFSplatter employs a large Transformer to encode multi-view images and generates a dense and geometrically consistent Gaussian scene initialization via a self-splitting Gaussian head. It eliminates local floaters through contribution-based pruning and mitigates overfitting to limited views by leveraging depth and multi-view feature supervision with differentiable camera parameters during rapid optimization. FSFSplatter outperforms current state-of-the-art methods on widely used DTU and Replica.

[12] MoGIC: Boosting Motion Generation via Intention Understanding and Visual Context cs.CVPDF

Junyu Shi, Yong Sun, Zhiyuan Zhang, Lijiang Liu, Zhengjie Zhang

TL;DR: MoGIC 是一个统一的框架，通过引入意图建模和视觉先验，改进了基于文本的运动生成方法，实现了多模态运动合成。它通过混合注意力机制有效对齐条件和动作子序列，并在大规模基准 Mo440H 上验证了其优越性。

Details

Motivation: 现有文本驱动的运动生成方法通常将其视为语言与动作的双向映射，但缺乏对动作执行因果逻辑和人类意图的捕捉。同时，缺乏视觉基础限制了生成的精确性和个性化。

Result: 在 HumanML3D 和 Mo440H 上将 FID 分别降低 38.6% 和 34.6%，在运动字幕任务上超过基于 LLM 的方法，并支持意图预测和视觉条件生成。

Insight: 结合意图和视觉先验可显著提升运动生成的精确性和可控性，混合注意力机制为多模态对齐提供了有效工具。

Abstract: Existing text-driven motion generation methods often treat synthesis as a bidirectional mapping between language and motion, but remain limited in capturing the causal logic of action execution and the human intentions that drive behavior. The absence of visual grounding further restricts precision and personalization, as language alone cannot specify fine-grained spatiotemporal details. We propose MoGIC, a unified framework that integrates intention modeling and visual priors into multimodal motion synthesis. By jointly optimizing multimodal-conditioned motion generation and intention prediction, MoGIC uncovers latent human goals, leverages visual priors to enhance generation, and exhibits versatile multimodal generative capability. We further introduce a mixture-of-attention mechanism with adaptive scope to enable effective local alignment between conditional tokens and motion subsequences. To support this paradigm, we curate Mo440H, a 440-hour benchmark from 21 high-quality motion datasets. Experiments show that after finetuning, MoGIC reduces FID by 38.6% on HumanML3D and 34.6% on Mo440H, surpasses LLM-based methods in motion captioning with a lightweight text head, and further enables intention prediction and vision-conditioned generation, advancing controllable motion synthesis and intention understanding. The code is available at https://github.com/JunyuShi02/MoGIC

[13] From Tokens to Nodes: Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting cs.CVPDF

Jianing Chen, Zehao Li, Yujun Cai, Hao Jiang, Shuqin Gao

TL;DR: 该论文提出了一种语义引导的运动控制框架，用于动态3D高斯抛射重建，通过自适应压缩动态区域的稀疏控制点，提升重建质量和效率。

Details

Motivation: 动态3D重建从单目视频中推断3D运动存在模糊性和计算复杂度高的挑战。现有稀疏控制方法仅依赖几何分配控制点，导致静态区域冗余和动态区域不足。

Result: 实验表明，该方法在重建质量和效率上显著优于现有最先进方法。

Insight: 通过语义引导的运动自适应控制点和样条轨迹参数化，可以更高效地解决动态3D重建中的控制点分配和运动表示问题。

Abstract: Dynamic 3D reconstruction from monocular videos remains difficult due to the ambiguity inferring 3D motion from limited views and computational demands of modeling temporally varying scenes. While recent sparse control methods alleviate computation by reducing millions of Gaussians to thousands of control points, they suffer from a critical limitation: they allocate points purely by geometry, leading to static redundancy and dynamic insufficiency. We propose a motion-adaptive framework that aligns control density with motion complexity. Leveraging semantic and motion priors from vision foundation models, we establish patch-token-node correspondences and apply motion-adaptive compression to concentrate control points in dynamic regions while suppressing redundancy in static backgrounds. Our approach achieves flexible representational density adaptation through iterative voxelization and motion tendency scoring, directly addressing the fundamental mismatch between control point allocation and motion complexity. To capture temporal evolution, we introduce spline-based trajectory parameterization initialized by 2D tracklets, replacing MLP-based deformation fields to achieve smoother motion representation and more stable optimization. Extensive experiments demonstrate significant improvements in reconstruction quality and efficiency over existing state-of-the-art methods.

[14] Net2Net: When Un-trained Meets Pre-trained Networks for Robust Real-World Denoising cs.CVPDF

Weimin Yuan, Cai Meng

TL;DR: Net2Net是一种创新方法，结合了无训练网络和预训练网络的优点，通过正则化去噪（RED）技术解决了真实世界噪声去除的挑战，无需大量标注数据即可适应各种噪声模式。

Details

Motivation: 传统去噪方法依赖手工先验，难以处理真实噪声的复杂性和多样性；深度学习需要大量标注数据且泛化能力有限。Net2Net旨在结合无监督和预训练网络的优点，解决这些问题。

Result: 在基准数据集上验证了方法的优越性，尤其在真实噪声去除和小样本场景中表现突出。

Insight: Net2Net展示了无监督和预训练网络的互补性，为无需标注数据的自适应去噪提供了新思路。

Abstract: Traditional denoising methods for noise removal have largely relied on handcrafted priors, often perform well in controlled environments but struggle to address the complexity and variability of real noise. In contrast, deep learning-based approaches have gained prominence for learning noise characteristics from large datasets, but these methods frequently require extensive labeled data and may not generalize effectively across diverse noise types and imaging conditions. In this paper, we present an innovative method, termed as Net2Net, that combines the strengths of untrained and pre-trained networks to tackle the challenges of real-world noise removal. The innovation of Net2Net lies in its combination of unsupervised DIP and supervised pre-trained model DRUNet by regularization by denoising (RED). The untrained network adapts to the unique noise characteristics of each input image without requiring labeled data, while the pre-trained network leverages learned representations from large-scale datasets to deliver robust denoising performance. This hybrid framework enhances generalization across varying noise patterns and improves performance, particularly in scenarios with limited training data. Extensive experiments on benchmark datasets demonstrate the superiority of our method for real-world noise removal.

[15] Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval cs.CVPDF

Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, Shiqi Wang

TL;DR: Retrv-R1是一个基于推理的多模态检索框架，通过引入信息压缩模块和新训练范式，解决了现有RL方法在检索任务中的高计算成本和性能不稳定的问题，实现了高效、通用且高性能的多模态检索。

Details

Motivation: 现有基于RL的多模态检索方法存在高计算成本和性能不稳定的问题，限制了其在检索任务中的应用。Retrv-R1旨在解决这些问题，提升检索效率和性能。

Result: Retrv-R1在多个基准测试和任务中展现出SOTA性能、高效率和强泛化能力。

Insight: 推理驱动的多模态检索可以显著提升性能和效率，信息压缩和新训练范式的结合是关键。

Abstract: The success of DeepSeek-R1 demonstrates the immense potential of using reinforcement learning (RL) to enhance LLMs’ reasoning capabilities. This paper introduces Retrv-R1, the first R1-style MLLM specifically designed for multimodal universal retrieval, achieving higher performance by employing step-by-step reasoning to produce more accurate retrieval results. We find that directly applying the methods of DeepSeek-R1 to retrieval tasks is not feasible, mainly due to (1) the high computational cost caused by the large token consumption required for multiple candidates with reasoning processes, and (2) the instability and suboptimal results when directly applying RL to train for retrieval tasks. To address these issues, Retrv-R1 introduces an information compression module with a details inspection mechanism, which enhances computational efficiency by reducing the number of tokens while ensuring that critical information for challenging candidates is preserved. Furthermore, a new training paradigm is proposed, including an activation stage using a retrieval-tailored synthetic CoT dataset for more effective optimization, followed by RL with a novel curriculum reward to improve both performance and efficiency. Incorporating these novel designs, Retrv-R1 achieves SOTA performance, high efficiency, and strong generalization ability, as demonstrated by experiments across multiple benchmarks and tasks.

[16] Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models cs.CVPDF

Lihua Zhou, Mao Ye, Shuaifeng Li, Nianxin Li, Jinlin Wu

TL;DR: BCA+是一个无需训练、高效的测试时适应框架，通过贝叶斯推断动态更新类嵌入、空间尺度和自适应先验，统一处理物体识别和检测任务，显著提升了性能。

Details

Motivation: 现有测试时适应方法要么计算成本高（依赖反向传播），要么仅关注似然而忽略先验的重要性，限制了其在实时部署中的应用。

Result: 在多个识别和检测基准测试中达到了最先进的性能。

Insight: 动态缓存和贝叶斯推断的结合有效解决了分布偏移问题，无需训练的设计使其在实时应用中更具优势。

Abstract: Vision-language models (VLMs) such as CLIP and Grounding DINO have achieved remarkable success in object recognition and detection. However, their performance often degrades under real-world distribution shifts. Test-time adaptation (TTA) aims to mitigate this issue by adapting models during inference. Existing methods either rely on computationally expensive backpropagation, which hinders real-time deployment, or focus solely on likelihood adaptation, which overlooks the critical role of the prior. Our prior work, Bayesian Class Adaptation (BCA), addressed these shortcomings for object recognition by introducing a training-free framework that incorporates adaptive priors. Building upon this foundation, we now present Bayesian Class Adaptation plus (BCA+), a unified, training-free framework for TTA for both object recognition and detection. BCA+ introduces a dynamic cache that adaptively stores and updates class embeddings, spatial scales (for detection), and, crucially, adaptive class priors derived from historical predictions. We formulate adaptation as a Bayesian inference problem, where final predictions are generated by fusing the initial VLM output with a cache-based prediction. This cache-based prediction combines a dynamically updated likelihood (measuring feature and scale similarity) and a prior (reflecting the evolving class distribution). This dual-adaptation mechanism, coupled with uncertainty-guided fusion, enables BCA+ to correct both the model’s semantic understanding and its contextual confidence. As a training-free method requiring no backpropagation, BCA+ is highly efficient. Extensive experiments demonstrate that BCA+ achieves state-of-the-art performance on both recognition and detection benchmarks.

[17] Hierarchical Generalized Category Discovery for Brain Tumor Classification in Digital Pathology cs.CV | cs.AI | cs.LGPDF

Matthias Perkonigg, Patrick Rockenschaub, Georg Göbel, Adelheid Wöhrer

TL;DR: 这篇论文提出了一种新的分层广义类别发现方法（HGCD-BT），用于脑肿瘤分类，结合了分层聚类和对比学习，显著提升了分类准确率，并在多个数据集上展现出色表现。

Details

Motivation: 脑肿瘤分类对神经外科手术决策至关重要，但现有方法仅限于预定义的类别，无法识别训练中未见的肿瘤类型。广义类别发现（GCD）虽能填补这一空白，但缺乏对分层结构的建模。

Result: 在OpenSRH数据集上，HGCD-BT比现有GCD方法提升了28%的分类准确率，尤其在未见肿瘤类别识别上表现优异。在多模态数据上也展现良好泛化能力。

Insight: 分层结构的引入有效提升了GCD的性能，尤其是在复杂分类任务中。方法展示了在多模态医学图像上的适用性，为未见类别的识别提供了新思路。

Abstract: Accurate brain tumor classification is critical for intra-operative decision making in neuro-oncological surgery. However, existing approaches are restricted to a fixed set of predefined classes and are therefore unable to capture patterns of tumor types not available during training. Unsupervised learning can extract general-purpose features, but it lacks the ability to incorporate prior knowledge from labelled data, and semi-supervised methods often assume that all potential classes are represented in the labelled data. Generalized Category Discovery (GCD) aims to bridge this gap by categorizing both known and unknown classes within unlabelled data. To reflect the hierarchical structure of brain tumor taxonomies, in this work, we introduce Hierarchical Generalized Category Discovery for Brain Tumor Classification (HGCD-BT), a novel approach that integrates hierarchical clustering with contrastive learning. Our method extends contrastive learning based GCD by incorporating a novel semi-supervised hierarchical clustering loss. We evaluate HGCD-BT on OpenSRH, a dataset of stimulated Raman histology brain tumor images, achieving a +28% improvement in accuracy over state-of-the-art GCD methods for patch-level classification, particularly in identifying previously unseen tumor categories. Furthermore, we demonstrate the generalizability of HGCD-BT on slide-level classification of hematoxylin and eosin stained whole-slide images from the Digital Brain Tumor Atlas, confirming its utility across imaging modalities.

[18] AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding cs.CVPDF

Xian Zhang, Zexi Wu, Zinuo Li, Hongming Xu, Luqi Gong

TL;DR: AdaRD-Key提出了一种无需训练的关键帧采样方法，结合查询相关性和视觉多样性，优化长视频理解任务，显著提升性能。

Details

Motivation: 现有长视频理解方法多依赖均匀采样或固定时间间隔的关键帧选择，可能导致忽略重要时刻或冗余信息。查询相关性与视觉多样性的平衡是关键挑战。

Result: 在LongVideoBench和Video-MME上取得SOTA性能，尤其适合长视频任务。

Insight: 平衡查询相关性和视觉多样性对长视频理解至关重要；轻量门控机制有效解决了弱对齐问题。

Abstract: Understanding long-form videos remains a significant challenge for vision–language models (VLMs) due to their extensive temporal length and high information density. Most current multimodal large language models (MLLMs) rely on uniform sampling, which often overlooks critical moments, leading to incorrect responses to queries. In parallel, many keyframe selection approaches impose rigid temporal spacing: once a frame is chosen, an exclusion window suppresses adjacent timestamps to reduce redundancy. While effective at limiting overlap, this strategy frequently misses short, fine-grained cues near important events. Other methods instead emphasize visual diversity but neglect query relevance. We propose AdaRD-Key, a training-free keyframe sampling module for query-driven long-form video understanding. AdaRD-Key maximizes a unified Relevance–Diversity Max-Volume (RD-MV) objective, combining a query-conditioned relevance score with a log-determinant diversity component to yield informative yet non-redundant frames. To handle broad queries with weak alignment to the video, AdaRD-Key employs a lightweight relevance-aware gating mechanism; when the relevance distribution indicates weak alignment, the method seamlessly shifts into a diversity-only mode, enhancing coverage without additional supervision. Our pipeline is training-free, computationally efficient (running in real time on a single GPU), and compatible with existing VLMs in a plug-and-play manner. Extensive experiments on LongVideoBench and Video-MME demonstrate state-of-the-art performance, particularly on long-form videos. Code available at https://github.com/Xian867/AdaRD-Key.

[19] Reasoning Riddles: How Explainability Reveals Cognitive Limits in Vision-Language Models cs.CVPDF

Prahitha Movva

TL;DR: 论文通过可解释性分析揭示了视觉语言模型（VLMs）在解决复杂横向思维谜题（如字谜）时的认知局限，提出了系统的数据集和评估框架，展示了不同提示策略对模型推理质量和效果的影响。

Details

Motivation: 尽管VLMs在多模态任务中表现优异，但其在复杂横向思维挑战（如字谜）中的认知过程和失败模式尚不清晰。论文旨在填补这一空白，通过可解释性分析深入理解VLMs的推理机制。

Result: 研究发现VLMs在不同谜题类别中推理质量差异显著，模型在视觉组合方面表现良好，但在缺失解释和文化符号方面存在根本性局限。提示策略显著影响认知方式和解题效果。

Insight: 可解释性是模型性能的重要组成部分，而非事后分析。提示设计直接影响模型的认知过程和问题解决能力。

Abstract: Vision-Language Models (VLMs) excel at many multimodal tasks, yet their cognitive processes remain opaque on complex lateral thinking challenges like rebus puzzles. While recent work has demonstrated these models struggle significantly with rebus puzzle solving, the underlying reasoning processes and failure patterns remain largely unexplored. We address this gap through a comprehensive explainability analysis that moves beyond performance metrics to understand how VLMs approach these complex lateral thinking challenges. Our study contributes a systematically annotated dataset of 221 rebus puzzles across six cognitive categories, paired with an evaluation framework that separates reasoning quality from answer correctness. We investigate three prompting strategies designed to elicit different types of explanatory processes and reveal critical insights into VLM cognitive processes. Our findings demonstrate that reasoning quality varies dramatically across puzzle categories, with models showing systematic strengths in visual composition while exhibiting fundamental limitations in absence interpretation and cultural symbolism. We also discover that prompting strategy substantially influences both cognitive approach and problem-solving effectiveness, establishing explainability as an integral component of model performance rather than a post-hoc consideration.

[20] OTR: Synthesizing Overlay Text Dataset for Text Removal cs.CVPDF

Jan Zdenek, Wataru Shimoda, Kota Yamaguchi

TL;DR: 提出了一个名为OTR的新数据集，用于文本移除任务，解决了现有数据集中地面真实数据存在人工编辑痕迹、背景过于简单以及评估指标不全面的问题。

Details

Motivation: 现有文本移除数据集（如SCUT-EnsText）存在地面真实数据质量问题，且仅限于场景文本移除任务，限制了模型的泛化能力和评估准确性。

Result: OTR数据集提供了高质量的合成文本移除场景，支持更准确的评估和跨域泛化。

Insight: 合成数据集可以克服真实数据集的局限性，尤其是在地面真实数据质量和任务多样性方面。

Abstract: Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. The dataset is available at https://huggingface.co/datasets/cyberagent/OTR .

[21] Align Your Query: Representation Alignment for Multimodality Medical Object Detection cs.CV | cs.AI | cs.LGPDF

Ara Seo, Bryan Sangwoo Kim, Hyungjin Chung, Jong Chul Ye

TL;DR: 论文提出了一种简单且与检测器无关的框架，通过表示对齐方法解决多模态医学目标检测中的异质性问题。主要贡献是引入了Modality Tokens和多模态上下文注意力（MoCA），并结合QueryREPA预训练阶段，实现了模态感知的查询表示，提升了检测性能。

Details

Motivation: 医学目标检测在混合多模态数据（如CXR、CT、MRI）上训练时，由于数据统计异质性和表示空间不一致，性能受到影响。因此需要一种方法来对齐不同模态的特征表示。

Result: 方法在多模态数据上显著提升了AP（平均精度），且无需修改架构或增加显著延迟。

Insight: 表示对齐是多模态医学目标检测的有效解决方案，轻量化的模态嵌入和上下文传播是关键。

Abstract: Medical object detection suffers when a single detector is trained on mixed medical modalities (e.g., CXR, CT, MRI) due to heterogeneous statistics and disjoint representation spaces. To address this challenge, we turn to representation alignment, an approach that has proven effective for bringing features from different sources into a shared space. Specifically, we target the representations of DETR-style object queries and propose a simple, detector-agnostic framework to align them with modality context. First, we define modality tokens: compact, text-derived embeddings encoding imaging modality that are lightweight and require no extra annotations. We integrate the modality tokens into the detection process via Multimodality Context Attention (MoCA), mixing object-query representations via self-attention to propagate modality context within the query set. This preserves DETR-style architectures and adds negligible latency while injecting modality cues into object queries. We further introduce QueryREPA, a short pretraining stage that aligns query representations to their modality tokens using a task-specific contrastive objective with modality-balanced batches. Together, MoCA and QueryREPA produce modality-aware, class-faithful queries that transfer effectively to downstream training. Across diverse modalities trained altogether, the proposed approach consistently improves AP with minimal overhead and no architectural modifications, offering a practical path toward robust multimodality medical object detection. Project page: https://araseo.github.io/alignyourquery/.

[22] MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding cs.CV | cs.AI | cs.CL | cs.MMPDF

Jingyuan Deng, Yujiu Yang

TL;DR: MaskCD提出了一种通过掩码图像头部构建对比样本的对比解码方法，有效缓解大视觉语言模型（LVLM）的幻觉问题，同时保持其通用能力。

Details

Motivation: 大视觉语言模型在多模态任务中表现出色，但存在幻觉问题，即生成与输入内容矛盾的输出。现有方法（如对比解码和注意力操纵）存在构建样本困难或敏感性问题。

Result: 在LLaVA-1.5-7b和Qwen-VL-7b上，MaskCD在CHAIR、POPE、AMBER和MME等基准测试中显著减少了幻觉现象，同时保留了模型的通用能力。

Insight: 掩码图像头部为构建对比样本提供了一种稳定且高效的方式，为解决LVLM的幻觉问题提供了新思路。

Abstract: Large vision-language models (LVLMs) have shown remarkable performance in visual-language understanding for downstream multimodal tasks. While their capabilities are improving, problems emerge simultaneously. Among those problems, the hallucinations have attracted much attention, which stands for the phenomenon where LVLMs generate contradictory content to their input visual and text contents. Many approaches have been proposed to deal with this issue, such as contrastive decoding and attention manipulation. However, contrastive decoding methods struggle in constructing appropriate contrastive samples, and attention manipulation methods are highly sensitive, lacking stability. In this work, we propose image head Masked Contrastive Decoding (MaskCD). Our approach utilizes the “image heads” in LVLMs, masking them to construct contrastive samples for contrastive decoding. We evaluated MaskCD on LLaVA-1.5-7b and Qwen-VL-7b, using various benchmarks such as CHAIR, POPE, AMBER and MME. The results demonstrate that MaskCD effectively alleviates the phenomenon of hallucinations and retains the general capabilities of LVLMs. Corresponding resources could be found at: https://github.com/Deng-Jingyuan/MaskCD .

[23] VERNIER: an open-source software pushing marker pose estimation down to the micrometer and nanometer scales cs.CV | cs.ROPDF

Patrick Sandoz, Antoine N. André, Guillaume J. Laurent

TL;DR: 这篇论文介绍了VERNIER，一款开源相位处理软件，能够在微米和纳米尺度下实现高精度的6自由度姿态估计。

Details

Motivation: 在小尺度下实现高精度的姿态估计仍然是一个挑战。现有的方法难以在较大范围内实现纳米级和微弧度级的分辨率。为了解决这一问题，论文提出了VERNIER软件。

Result: VERNIER表现出很高的鲁棒性和精度，能够满足不同显微镜应用的需求。

Insight: 论文还提供了选择合适图案设计和显微镜放大镜头的指南，以帮助用户根据需求优化性能。

Abstract: Pose estimation is still a challenge at the small scales. Few solutions exist to capture the 6 degrees of freedom of an object with nanometric and microradians resolutions over relatively large ranges. Over the years, we have proposed several fiducial marker and pattern designs to achieve reliable performance for various microscopy applications. Centimeter ranges are possible using pattern encoding methods, while nanometer resolutions can be achieved using phase processing of the periodic frames. This paper presents VERNIER, an open source phase processing software designed to provide fast and reliable pose measurement based on pseudo-periodic patterns. Thanks to a phase-based local thresholding algorithm, the software has proven to be particularly robust to noise, defocus and occlusion. The successive steps of the phase processing are presented, as well as the different types of patterns that address different application needs. The implementation procedure is illustrated with synthetic and experimental images. Finally, guidelines are given for selecting the appropriate pattern design and microscope magnification lenses as a function of the desired performance.

[24] Med-K2N: Flexible K-to-N Modality Translation for Medical Image Synthesis cs.CVPDF

Feng Yuan, Yifan Gao, Yuehua Ye, Haoyue Li, Xin Gao

TL;DR: Med-K2N提出了一种灵活的K-to-N模态转换方法用于医学图像合成，解决了多模态贡献建模、融合质量控制及模态一致性三大挑战。

Details

Motivation: 临床需求驱动多模态图像重建的灵活性，同时需解决多模态贡献不均、噪声信息融合及模态一致性等问题。

Result: 在多个基准上显著超越现有方法。

Insight: 1. 序列帧和质量驱动机制可有效解决多模态融合问题；2. 视觉-语言建模能增强模态一致性；3. 自适应权重学习是关键。

Abstract: Cross-modal medical image synthesis research focuses on reconstructing missing imaging modalities from available ones to support clinical diagnosis. Driven by clinical necessities for flexible modality reconstruction, we explore K to N medical generation, where three critical challenges emerge: How can we model the heterogeneous contributions of different modalities to various target tasks? How can we ensure fusion quality control to prevent degradation from noisy information? How can we maintain modality identity consistency in multi-output generation? Driven by these clinical necessities, and drawing inspiration from SAM2’s sequential frame paradigm and clinicians’ progressive workflow of incrementally adding and selectively integrating multi-modal information, we treat multi-modal medical data as sequential frames with quality-driven selection mechanisms. Our key idea is to “learn” adaptive weights for each modality-task pair and “memorize” beneficial fusion patterns through progressive enhancement. To achieve this, we design three collaborative modules: PreWeightNet for global contribution assessment, ThresholdNet for adaptive filtering, and EffiWeightNet for effective weight computation. Meanwhile, to maintain modality identity consistency, we propose the Causal Modality Identity Module (CMIM) that establishes causal constraints between generated images and target modality descriptions using vision-language modeling. Extensive experimental results demonstrate that our proposed Med-K2N outperforms state-of-the-art methods by significant margins on multiple benchmarks. Source code is available.

[25] ELMF4EggQ: Ensemble Learning with Multimodal Feature Fusion for Non-Destructive Egg Quality Assessment cs.CV | cs.LGPDF

Md Zahim Hassan, Md. Osama, Muhammad Ashad Kabir, Md. Saiful Islam, Zannatul Naim

TL;DR: ELMF4EggQ是一种集成学习框架，利用多模态特征融合对鸡蛋的等级和新鲜度进行非破坏性评估。它通过结合图像、形状和重量等外部特征，显著提升了分类准确性。

Details

Motivation: 传统鸡蛋质量评估需要破坏性检测，成本高且效率低。非破坏性方法对食品安全和生产效率至关重要。

Result: 多模态集成方法在等级分类和新鲜度预测上分别达到86.57%和70.83%的准确率，优于单模态基线。

Insight: 多模态特征融合在非破坏性质量评估中潜力巨大；公开数据集可推动领域发展。

Abstract: Accurate, non-destructive assessment of egg quality is critical for ensuring food safety, maintaining product standards, and operational efficiency in commercial poultry production. This paper introduces ELMF4EggQ, an ensemble learning framework that employs multimodal feature fusion to classify egg grade and freshness using only external attributes - image, shape, and weight. A novel, publicly available dataset of 186 brown-shelled eggs was constructed, with egg grade and freshness levels determined through laboratory-based expert assessments involving internal quality measurements, such as yolk index and Haugh unit. To the best of our knowledge, this is the first study to apply machine learning methods for internal egg quality assessment using only external, non-invasive features, and the first to release a corresponding labeled dataset. The proposed framework integrates deep features extracted from external egg images with structural characteristics such as egg shape and weight, enabling a comprehensive representation of each egg. Image feature extraction is performed using top-performing pre-trained CNN models (ResNet152, DenseNet169, and ResNet152V2), followed by PCA-based dimensionality reduction, SMOTE augmentation, and classification using multiple machine learning algorithms. An ensemble voting mechanism combines predictions from the best-performing classifiers to enhance overall accuracy. Experimental results demonstrate that the multimodal approach significantly outperforms image-only and tabular (shape and weight) only baselines, with the multimodal ensemble approach achieving 86.57% accuracy in grade classification and 70.83% in freshness prediction. All code and data are publicly available at https://github.com/Kenshin-Keeps/Egg_Quality_Prediction_ELMF4EggQ, promoting transparency, reproducibility, and further research in this domain.

[26] One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework cs.CVPDF

Lorenzo Bianchi, Giacomo Pacini, Fabio Carrara, Nicola Messina, Giuseppe Amato

TL;DR: 论文提出了一个统一的零样本图像描述框架（Patch-ioner），通过从全局图像表示转向局部patch表示，实现了无需区域级监督的任意区域描述能力。

Details

Motivation: 现有零样本描述方法局限于全局图像表示和整体描述，无法灵活描述图像中的任意区域。Patch-ioner旨在突破这一限制，无需额外监督即可实现patch级描述。

Result: 在零样本密集描述、区域集合描述和新引入的轨迹描述任务中，Patch-ioner优于现有方法，验证了patch级表示的有效性。

Insight: 密集视觉特征是零样本描述的关键，patch级表示能够显著提升描述的灵活性和可扩展性。

Abstract: Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions. We present \frameworkName{}, a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. Instead of relying on global image representations, we treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images. We analyze the key ingredients that enable current latent captioners to work in our novel proposed framework. Experiments demonstrate that backbones producing meaningful, dense visual features, such as DINO, are key to achieving state-of-the-art performance in multiple region-based captioning tasks. Compared to other baselines and state-of-the-art competitors, our models achieve better performance on zero-shot dense, region-set, and a newly introduced trace captioning task, highlighting the effectiveness of patch-wise semantic representations for scalable caption generation. Project page at https://paciosoft.com/Patch-ioner/ .

[27] Training-Free Out-Of-Distribution Segmentation With Foundation Models cs.CVPDF

Laith Nayal, Hadi Salloum, Ahmad Taha, Yaroslav Kholodov, Alexander Gasnikov

TL;DR: 该论文提出了一种无需训练的OoD分割方法，利用基础模型（如InternImage）的特征和简单的聚类技术，实现了在RoadAnomaly和ADE-OoD基准上的优异性能。

Details

Motivation: 在安全关键应用（如自动驾驶）中，检测语义分割中的未知物体至关重要。尽管基础模型在闭集任务中表现出色，但其在OoD区域检测方面的能力尚未充分探索。

Result: 在RoadAnomaly基准上达到50.02 AP，ADE-OoD基准上达到48.77 AP，优于多种基线方法。

Insight: 基础模型的特征具有强大的泛化能力，可用于无需额外监督的OoD检测，展示了其在开放世界任务中的潜力。

Abstract: Detecting unknown objects in semantic segmentation is crucial for safety-critical applications such as autonomous driving. Large vision foundation models, including DINOv2, InternImage, and CLIP, have advanced visual representation learning by providing rich features that generalize well across diverse tasks. While their strength in closed-set semantic tasks is established, their capability to detect out-of-distribution (OoD) regions in semantic segmentation remains underexplored. In this work, we investigate whether foundation models fine-tuned on segmentation datasets can inherently distinguish in-distribution (ID) from OoD regions without any outlier supervision. We propose a simple, training-free approach that utilizes features from the InternImage backbone and applies K-Means clustering alongside confidence thresholding on raw decoder logits to identify OoD clusters. Our method achieves 50.02 Average Precision on the RoadAnomaly benchmark and 48.77 on the benchmark of ADE-OoD with InternImage-L, surpassing several supervised and unsupervised baselines. These results suggest a promising direction for generic OoD segmentation methods that require minimal assumptions or additional data.

[28] Don’t Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention cs.CVPDF

Xin Zou, Di Lu, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu

TL;DR: 该论文提出了一种名为HoloV的新型视觉令牌修剪框架，旨在通过全局视角保留视觉上下文信息，以解决现有注意力优先修剪方法在高修剪率下性能下降的问题。

Details

Motivation: 当前的多模态大型语言模型（MLLMs）依赖大量视觉令牌，导致计算开销巨大。现有的修剪方法（如注意力优先修剪）在高修剪率下会保留语义相似的令牌，导致性能显著下降。

Result: 实验表明，HoloV在多种任务、MLLM架构和修剪率下均优于现有方法。例如，LLaVA1.5在使用HoloV后，修剪88.9%的视觉令牌后仍保留95.8%的原始性能。

Insight: 全局视角的令牌修剪策略可以有效避免表示崩溃，在高修剪率下仍能保持模型的准确性和效率。

Abstract: Despite their powerful capabilities, Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens. Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention or [\texttt{CLS}] attention to assess and discard redundant visual tokens. In this work, we identify a critical limitation of such attention-first pruning approaches, i.e., they tend to preserve semantically similar tokens, resulting in pronounced performance drops under high pruning ratios. To this end, we propose {HoloV}, a simple yet effective, plug-and-play visual token pruning framework for efficient inference. Distinct from previous attention-first schemes, HoloV rethinks token retention from a holistic perspective. By adaptively distributing the pruning budget across different spatial crops, HoloV ensures that the retained tokens capture the global visual context rather than isolated salient features. This strategy minimizes representational collapse and maintains task-relevant information even under aggressive pruning. Experimental results demonstrate that our HoloV achieves superior performance across various tasks, MLLM architectures, and pruning ratios compared to SOTA methods. For instance, LLaVA1.5 equipped with HoloV preserves 95.8% of the original performance after pruning 88.9% of visual tokens, achieving superior efficiency-accuracy trade-offs.

[29] Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting cs.CVPDF

Nikoo Naghavian, Mostafa Tavassolipour

TL;DR: 提出了一种名为CAW（Confidence-Aware Weighting）的方法，通过置信度感知损失和特征对齐正则化提升视觉语言模型的零样本鲁棒性，优于现有方法且内存消耗更低。

Details

Motivation: 尽管视觉语言模型（如CLIP）在零样本泛化上表现优异，但其在面对对抗攻击时仍非常脆弱。CAW旨在解决这一问题，提升模型的鲁棒性。

Result: 在TinyImageNet和14个额外数据集上，CAW在AutoAttack等强攻击下优于PMG-AFT和TGA-ZSR等方法且内存需求更低。

Insight: 结合置信度感知和特征对齐可以有效提升视觉语言模型的鲁棒性，同时保持零样本泛化能力。

Abstract: Vision-language models like CLIP demonstrate impressive zero-shot generalization but remain highly vulnerable to adversarial attacks. In this work, we propose Confidence-Aware Weighting (CAW) to enhance zero-shot robustness in vision-language models. CAW consists of two components: (1) a Confidence-Aware loss that prioritizes uncertain adversarial examples by scaling the KL divergence between clean and adversarial predictions, and (2) a feature alignment regularization that preserves semantic consistency by minimizing the distance between frozen and fine-tuned image encoder features on adversarial inputs. These components work jointly to improve both clean and robust accuracy without sacrificing generalization. Extensive experiments on TinyImageNet and 14 additional datasets show that CAW outperforms recent methods such as PMG-AFT and TGA-ZSR under strong attacks like AutoAttack, while using less memory.

[30] Multimodal Carotid Risk Stratification with Large Vision-Language Models: Benchmarking, Fine-Tuning, and Clinical Insights cs.CV | cs.AIPDF

Daphne Tsolissou, Theofanis Ganitidis, Konstantinos Mitsis, Stergios CHristodoulidis, Maria Vakalopoulou

TL;DR: 该论文探讨了大型视觉-语言模型（LVLM）在多模态颈动脉粥样硬化疾病风险评估中的应用。通过整合超声影像与临床、人口统计和生物标记数据，研究发现通用LVLM在风险分类上表现不佳。通过使用低秩适配（LoRA）技术微调LLaVa-NeXT-Vicuna模型，显著提升了中风风险分层性能。

Details

Motivation: 颈动脉粥样硬化疾病的风险评估需要整合多种临床和影像信息，但目前方法在透明性和可解释性上存在不足。研究旨在利用LVLM探索其在多模态风险评估中的潜力。

Result: 1. 零样本实验中，通用LVLM在风险分类上表现不佳；2. 微调后的LLaVa-NeXT-Vicuna显著改善了风险分层性能；3. 多模态数据整合提升了模型的临床适用性。

Insight: LVLM在多模态医疗影像分析中具有潜力，但需结合领域适配技术和多模态数据整合才能实现临床实用化。

Abstract: Reliable risk assessment for carotid atheromatous disease remains a major clinical challenge, as it requires integrating diverse clinical and imaging information in a manner that is transparent and interpretable to clinicians. This study investigates the potential of state-of-the-art and recent large vision-language models (LVLMs) for multimodal carotid plaque assessment by integrating ultrasound imaging (USI) with structured clinical, demographic, laboratory, and protein biomarker data. A framework that simulates realistic diagnostic scenarios through interview-style question sequences is proposed, comparing a range of open-source LVLMs, including both general-purpose and medically tuned models. Zero-shot experiments reveal that even if they are very powerful, not all LVLMs can accurately identify imaging modality and anatomy, while all of them perform poorly in accurate risk classification. To address this limitation, LLaVa-NeXT-Vicuna is adapted to the ultrasound domain using low-rank adaptation (LoRA), resulting in substantial improvements in stroke risk stratification. The integration of multimodal tabular data in the form of text further enhances specificity and balanced accuracy, yielding competitive performance compared to prior convolutional neural network (CNN) baselines trained on the same dataset. Our findings highlight both the promise and limitations of LVLMs in ultrasound-based cardiovascular risk prediction, underscoring the importance of multimodal integration, model calibration, and domain adaptation for clinical translation.

[31] TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency cs.CVPDF

Juntong Wang, Huiyu Duan, Jiarui Wang, Ziheng Jia, Guangtao Zhai

TL;DR: 论文提出了一种评估长提示文本生成图像对齐性的新方法TIT-Score，通过文本-图像-文本一致性量化对齐性，并在新的基准LPG-Bench上验证其优于现有指标。

Details

Motivation: 当前文本到图像（T2I）模型在短提示下表现良好，但在长提示下生成图像的准确性和一致性不足，且现有评估指标与人类偏好一致性较差。

Result: TIT-Score-LLM在配准准确率上比最强基线提高了7.31%，与人类评价一致性更高。

Insight: 长提示生成的评估需要更贴近人类偏好的方法，文本-图像-文本一致性是一种有效且可扩展的评估框架。

Abstract: With the rapid advancement of large multimodal models (LMMs), recent text-to-image (T2I) models can generate high-quality images and demonstrate great alignment to short prompts. However, they still struggle to effectively understand and follow long and detailed prompts, displaying inconsistent generation. To address this challenge, we introduce LPG-Bench, a comprehensive benchmark for evaluating long-prompt-based text-to-image generation. LPG-Bench features 200 meticulously crafted prompts with an average length of over 250 words, approaching the input capacity of several leading commercial models. Using these prompts, we generate 2,600 images from 13 state-of-the-art models and further perform comprehensive human-ranked annotations. Based on LPG-Bench, we observe that state-of-the-art T2I alignment evaluation metrics exhibit poor consistency with human preferences on long-prompt-based image generation. To address the gap, we introduce a novel zero-shot metric based on text-to-image-to-text consistency, termed TIT, for evaluating long-prompt-generated images. The core concept of TIT is to quantify T2I alignment by directly comparing the consistency between the raw prompt and the LMM-produced description on the generated image, which includes an efficient score-based instantiation TIT-Score and a large-language-model (LLM) based instantiation TIT-Score-LLM. Extensive experiments demonstrate that our framework achieves superior alignment with human judgment compared to CLIP-score, LMM-score, etc., with TIT-Score-LLM attaining a 7.31% absolute improvement in pairwise accuracy over the strongest baseline. LPG-Bench and TIT methods together offer a deeper perspective to benchmark and foster the development of T2I models. All resources will be made publicly available.

[32] Towards Scalable and Consistent 3D Editing cs.CVPDF

Ruihao Xia, Yang Tang, Pan Zhou

TL;DR: 论文提出了3DEditFormer方法和3DEditVerse数据集，解决了3D编辑中的一致性、结构保真和精细控制问题，实现了无需辅助3D掩码的精确编辑。

Details

Motivation: 3D编辑在沉浸式内容创作和AR/VR中有广泛应用，但现有方法存在速度慢、几何失真或依赖手动3D掩码等问题。论文旨在解决这些挑战。

Result: 实验表明，3DEditFormer在定量和定性上均优于现有方法，为3D编辑设立了新标准。

Insight: 解耦可编辑区域与保留结构的双引导注意力机制是实现高效3D编辑的关键；大规模高质量数据集对模型性能至关重要。

Abstract: 3D editing - the task of locally modifying the geometry or appearance of a 3D asset - has wide applications in immersive content creation, digital entertainment, and AR/VR. However, unlike 2D editing, it remains challenging due to the need for cross-view consistency, structural fidelity, and fine-grained controllability. Existing approaches are often slow, prone to geometric distortions, or dependent on manual and accurate 3D masks that are error-prone and impractical. To address these challenges, we advance both the data and model fronts. On the data side, we introduce 3DEditVerse, the largest paired 3D editing benchmark to date, comprising 116,309 high-quality training pairs and 1,500 curated test pairs. Built through complementary pipelines of pose-driven geometric edits and foundation model-guided appearance edits, 3DEditVerse ensures edit locality, multi-view consistency, and semantic alignment. On the model side, we propose 3DEditFormer, a 3D-structure-preserving conditional transformer. By enhancing image-to-3D generation with dual-guidance attention and time-adaptive gating, 3DEditFormer disentangles editable regions from preserved structure, enabling precise and consistent edits without requiring auxiliary 3D masks. Extensive experiments demonstrate that our framework outperforms state-of-the-art baselines both quantitatively and qualitatively, establishing a new standard for practical and scalable 3D editing. Dataset and code will be released. Project: https://www.lv-lab.org/3DEditFormer/

[33] Not every day is a sunny day: Synthetic cloud injection for deep land cover segmentation robustness evaluation across data sources cs.CVPDF

Sara Mobsite, Renaud Hostache, Laure Berti Equille, Emmanuel Roux, Joris Guerin

TL;DR: 该论文提出了一种合成云注入算法，用于评估Sentinel-2卫星数据的云覆盖对深度学习土地覆盖分割的影响，并提出了一种轻量级方法，通过注入归一化差异指数（NDIs）提升模型性能。同时，结合Sentinel-1雷达数据，解决了光学数据在云层覆盖下的不足。

Details

Motivation: 现有Sentinel-2数据集多为无云覆盖，限制了在热带地区的应用。此外，深度网络编码器的下采样会导致空间和光谱信息丢失。论文旨在解决这些问题，提升模型在云覆盖条件下的鲁棒性。

Result: 1. NDI注入在DFC2020数据集上提升了性能（U-Net +1.99%，DeepLabV3 +2.78%）。2. 在云覆盖条件下，结合Sentinel-1数据显著优于仅用光学数据。

Insight: 雷达-光学数据融合在云覆盖情境下具有显著优势。轻量级NDI注入方法可有效提升模型性能，同时计算开销低。

Abstract: Supervised deep learning for land cover semantic segmentation (LCS) relies on labeled satellite data. However, most existing Sentinel-2 datasets are cloud-free, which limits their usefulness in tropical regions where clouds are common. To properly evaluate the extent of this problem, we developed a cloud injection algorithm that simulates realistic cloud cover, allowing us to test how Sentinel-1 radar data can fill in the gaps caused by cloud-obstructed optical imagery. We also tackle the issue of losing spatial and/or spectral details during encoder downsampling in deep networks. To mitigate this loss, we propose a lightweight method that injects Normalized Difference Indices (NDIs) into the final decoding layers, enabling the model to retain key spatial features with minimal additional computation. Injecting NDIs enhanced land cover segmentation performance on the DFC2020 dataset, yielding improvements of 1.99% for U-Net and 2.78% for DeepLabV3 on cloud-free imagery. Under cloud-covered conditions, incorporating Sentinel-1 data led to significant performance gains across all models compared to using optical data alone, highlighting the effectiveness of radar-optical fusion in challenging atmospheric scenarios.

[34] When and Where do Events Switch in Multi-Event Video Generation? cs.CV | cs.AIPDF

Ruotong Liao, Guowen Huang, Qing Cheng, Thomas Seidl, Daniel Cremers

TL;DR: 这篇论文研究了多事件文本到视频（T2V）生成中的事件切换问题，提出了MEve评估套件，并揭示了在去噪步骤和模块层中早期干预的重要性。

Details

Motivation: 现有多事件生成方法忽略了事件切换的内在因素，论文旨在回答T2V生成中多事件提示事件转换的关键问题。

Result: 研究发现早期干预对多事件视频生成至关重要，为未来模型的多元事件条件提供了可能性。

Insight: 事件切换的核心在于早期去噪和模块设计的优化。

Abstract: Text-to-video (T2V) generation has surged in response to challenging questions, especially when a long video must depict multiple sequential events with temporal coherence and controllable content. Existing methods that extend to multi-event generation omit an inspection of the intrinsic factor in event shifting. The paper aims to answer the central question: When and where multi-event prompts control event transition during T2V generation. This work introduces MEve, a self-curated prompt suite for evaluating multi-event text-to-video (T2V) generation, and conducts a systematic study of two representative model families, i.e., OpenSora and CogVideoX. Extensive experiments demonstrate the importance of early intervention in denoising steps and block-wise model layers, revealing the essential factor for multi-event video generation and highlighting the possibilities for multi-event conditioning in future models.

[35] InsideOut: An EfficientNetV2-S Based Deep Learning Framework for Robust Multi-Class Facial Emotion Recognition cs.CVPDF

Ahsan Farabi, Israt Khandaker, Ibrahim Khalil Shanto, Md Abdul Ahad Minhaz, Tanisha Zaman

TL;DR: InsideOut是一个基于EfficientNetV2-S的深度学习框架，用于鲁棒的多类面部表情识别（FER），通过数据增强和类别不平衡优化展现了竞争力。

Details

Motivation: FER在现实应用中受遮挡、光照变化、姿势差异和数据集不平衡的挑战，需高效且鲁棒的解决方案。

Result: 在FER2013数据集上达到62.8%准确率和0.590宏平均F1分数，优于传统CNN基线。

Insight: 高效架构结合不平衡处理可提供实用且可复现的FER解决方案。

Abstract: Facial Emotion Recognition (FER) is a key task in affective computing, enabling applications in human-computer interaction, e-learning, healthcare, and safety systems. Despite advances in deep learning, FER remains challenging due to occlusions, illumination and pose variations, subtle intra-class differences, and dataset imbalance that hinders recognition of minority emotions. We present InsideOut, a reproducible FER framework built on EfficientNetV2-S with transfer learning, strong data augmentation, and imbalance-aware optimization. The approach standardizes FER2013 images, applies stratified splitting and augmentation, and fine-tunes a lightweight classification head with class-weighted loss to address skewed distributions. InsideOut achieves 62.8% accuracy with a macro averaged F1 of 0.590 on FER2013, showing competitive results compared to conventional CNN baselines. The novelty lies in demonstrating that efficient architectures, combined with tailored imbalance handling, can provide practical, transparent, and reproducible FER solutions.

[36] What Drives Compositional Generalization in Visual Generative Models? cs.CV | cs.AI | cs.LGPDF

Karim Farid, Rajat Sahay, Yumna Ali Alnaggar, Simon Schrodi, Volker Fischer

TL;DR: 这篇论文研究了视觉生成模型中设计选择对组合泛化能力的影响，确定了离散或连续分布训练目标以及条件信息的重要性，并提出改进离散模型组合性能的方法。

Details

Motivation: 研究旨在理解哪些机制促进或阻碍视觉生成模型的组合泛化能力，从而改进模型生成新颖组合的能力。

Result: 改进后的MaskGIT在组合性能上表现更好。

Insight: 组合泛化的关键在于训练目标的性质和条件信息的充分性，结合离散和连续目标可以提升性能。

Abstract: Compositional generalization, the ability to generate novel combinations of known concepts, is a key ingredient for visual generative models. Yet, not all mechanisms that enable or inhibit it are fully understood. In this work, we conduct a systematic study of how various design choices influence compositional generalization in image and video generation in a positive or negative way. Through controlled experiments, we identify two key factors: (i) whether the training objective operates on a discrete or continuous distribution, and (ii) to what extent conditioning provides information about the constituent concepts during training. Building on these insights, we show that relaxing the MaskGIT discrete loss with an auxiliary continuous JEPA-based objective can improve compositional performance in discrete models like MaskGIT.

[37] Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields cs.CV | cs.ROPDF

Zhiting Mei, Ola Shorinwa, Anirudha Majumdar

TL;DR: 该论文研究了在辐射场（radiance fields）中几何基础（geometry-grounding）对语义蒸馏的影响，并提出了一种新框架SPINE，用于无需初始猜测的反转辐射场。研究发现几何基础特征虽然提高了几何细节，但在姿态估计任务中表现不佳。

Details

Motivation: 探索几何基础语义特征在辐射场中的潜力，以改进空间任务的性能，如姿态估计和语义定位。

Result: 几何基础特征在几何细节上更丰富，但在姿态估计任务中表现较差；视觉特征在多任务中更具优势。

Insight: 未来的研究需探索更有效的几何基础策略，以平衡语义特征的几何细节和多任务泛用性。

Abstract: Semantic distillation in radiance fields has spurred significant advances in open-vocabulary robot policies, e.g., in manipulation and navigation, founded on pretrained semantics from large vision models. While prior work has demonstrated the effectiveness of visual-only semantic features (e.g., DINO and CLIP) in Gaussian Splatting and neural radiance fields, the potential benefit of geometry-grounding in distilled fields remains an open question. In principle, visual-geometry features seem very promising for spatial tasks such as pose estimation, prompting the question: Do geometry-grounded semantic features offer an edge in distilled fields? Specifically, we ask three critical questions: First, does spatial-grounding produce higher-fidelity geometry-aware semantic features? We find that image features from geometry-grounded backbones contain finer structural details compared to their counterparts. Secondly, does geometry-grounding improve semantic object localization? We observe no significant difference in this task. Thirdly, does geometry-grounding enable higher-accuracy radiance field inversion? Given the limitations of prior work and their lack of semantics integration, we propose a novel framework SPINE for inverting radiance fields without an initial guess, consisting of two core components: coarse inversion using distilled semantics, and fine inversion using photometric-based optimization. Surprisingly, we find that the pose estimation accuracy decreases with geometry-grounded features. Our results suggest that visual-only features offer greater versatility for a broader range of downstream tasks, although geometry-grounded features contain more geometric detail. Notably, our findings underscore the necessity of future research on effective strategies for geometry-grounding that augment the versatility and performance of pretrained semantic features.

[38] GeoComplete: Geometry-Aware Diffusion for Reference-Driven Image Completion cs.CVPDF

Beibei Lin, Tingting Chen, Robby T. Tan

TL;DR: GeoComplete提出了一种新颖的参考驱动图像补全框架，通过显式的3D结构引导和几何一致性增强，解决了现有生成方法在视角差异较大时的对齐问题。

Details

Motivation: 现有方法仅依赖扩散先验，缺乏几何线索（如相机姿态或深度），导致补全内容不准确或不合理。GeoComplete旨在通过引入几何信息解决这一问题。

Result: 实验显示，GeoComplete在PSNR上比现有最优方法提升17.1，显著提高了几何精度并保持高视觉质量。

Insight: 融合几何信息能显著提升参考驱动图像补全的质量和目标一致性，尤其适用于视角差异大的场景。

Abstract: Reference-driven image completion, which restores missing regions in a target view using additional images, is particularly challenging when the target view differs significantly from the references. Existing generative methods rely solely on diffusion priors and, without geometric cues such as camera pose or depth, often produce misaligned or implausible content. We propose GeoComplete, a novel framework that incorporates explicit 3D structural guidance to enforce geometric consistency in the completed regions, setting it apart from prior image-only approaches. GeoComplete introduces two key ideas: conditioning the diffusion process on projected point clouds to infuse geometric information, and applying target-aware masking to guide the model toward relevant reference cues. The framework features a dual-branch diffusion architecture. One branch synthesizes the missing regions from the masked target, while the other extracts geometric features from the projected point cloud. Joint self-attention across branches ensures coherent and accurate completion. To address regions visible in references but absent in the target, we project the target view into each reference to detect occluded areas, which are then masked during training. This target-aware masking directs the model to focus on useful cues, enhancing performance in difficult scenarios. By integrating a geometry-aware dual-branch diffusion architecture with a target-aware masking strategy, GeoComplete offers a unified and robust solution for geometry-conditioned image completion. Experiments show that GeoComplete achieves a 17.1 PSNR improvement over state-of-the-art methods, significantly boosting geometric accuracy while maintaining high visual quality.

[39] Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction cs.CV | cs.SDPDF

Kaisi Guan, Xihua Wang, Zhengfeng Lai, Xin Cheng, Peng Zhang

TL;DR: 本文提出了一种新型的文本到声音视频（T2SV）生成方法，通过解耦视频和音频的文本描述，并引入双塔扩散变换器（BridgeDiT）实现跨模态交互，显著提升了生成质量和同步性，成为当前最佳方法。

Details

Motivation: 现有文本到声音视频生成方法存在模态干扰和跨模态交互不明确的挑战，导致生成结果的质量和同步性受限。本文旨在解决这些问题。

Result: 在三个基准数据集上的实验和人类评估表明，该方法在大多数指标上达到最优，并通过消融研究验证了各贡献的有效性。

Insight: 解耦文本描述和对称跨模态交互是提升文本到声音视频生成性能的关键方向。

Abstract: This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, confusing the pretrained backbones, and (2) the optimal mechanism for cross-modal feature interaction remains unclear. To address these challenges, we first propose the Hierarchical Visual-Grounded Captioning (HVGC) framework that generates pairs of disentangled captions, a video caption, and an audio caption, eliminating interference at the conditioning stage. Based on HVGC, we further introduce BridgeDiT, a novel dual-tower diffusion transformer, which employs a Dual CrossAttention (DCA) mechanism that acts as a robust ``bridge” to enable a symmetric, bidirectional exchange of information, achieving both semantic and temporal synchronization. Extensive experiments on three benchmark datasets, supported by human evaluations, demonstrate that our method achieves state-of-the-art results on most metrics. Comprehensive ablation studies further validate the effectiveness of our contributions, offering key insights for the future T2SV task. All the codes and checkpoints will be publicly released.

[40] HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion cs.CV | cs.AIPDF

Shiyi Zhang, Dong Liang, Hairong Zheng, Yihang Zhou

TL;DR: HAVIR是一个分层视觉到图像重建模型，通过CLIP引导的多功能扩散方法解决了现有技术在复杂场景中重建视觉信息的挑战。

Details

Motivation: 现有方法在重建复杂视觉刺激时面临困难，原因在于自然场景的低级特征异质性和高级特征语义纠缠。HAVIR受视觉皮层分层表征理论启发，提出分层处理来解决这一问题。

Result: 实验表明，HAVIR在复杂场景中提升了重建图像的结构和语义质量，优于现有模型。

Insight: 分层处理方法能够有效区分低级和高级视觉特征，CLIP嵌入的引入增强了语义信息的表达，结合扩散模型进一步提升了重建效果。

Abstract: The reconstruction of visual information from brain activity fosters interdisciplinary integration between neuroscience and computer vision. However, existing methods still face challenges in accurately recovering highly complex visual stimuli. This difficulty stems from the characteristics of natural scenes: low-level features exhibit heterogeneity, while high-level features show semantic entanglement due to contextual overlaps. Inspired by the hierarchical representation theory of the visual cortex, we propose the HAVIR model, which separates the visual cortex into two hierarchical regions and extracts distinct features from each. Specifically, the Structural Generator extracts structural information from spatial processing voxels and converts it into latent diffusion priors, while the Semantic Extractor converts semantic processing voxels into CLIP embeddings. These components are integrated via the Versatile Diffusion model to synthesize the final image. Experimental results demonstrate that HAVIR enhances both the structural and semantic quality of reconstructions, even in complex scenes, and outperforms existing models.

[41] Mask2IV: Interaction-Centric Video Generation via Mask Trajectories cs.CV | cs.ROPDF

Gen Li, Bo Zhao, Jianfei Yang, Laura Sevilla-Lara

TL;DR: Mask2IV提出了一个两阶段解耦框架，用于生成交互中心视频，通过预测演员和物体的运动轨迹，再生成视频，无需密集掩码输入，同时支持用户灵活控制交互过程。

Details

Motivation: 生成交互中心视频对具身智能至关重要，但现有方法难以建模复杂动态交互，且密集掩码标注在实际应用中具有挑战性。

Result: 在两个多样化的基准测试中表现出色，视觉真实性和可控性优于现有基线。

Insight: 解耦设计和轨迹预测显著提升了交互视频生成的灵活性和质量，同时降低了标注需求。

Abstract: Generating interaction-centric videos, such as those depicting humans or robots interacting with objects, is crucial for embodied intelligence, as they provide rich and diverse visual priors for robot learning, manipulation policy training, and affordance reasoning. However, existing methods often struggle to model such complex and dynamic interactions. While recent studies show that masks can serve as effective control signals and enhance generation quality, obtaining dense and precise mask annotations remains a major challenge for real-world use. To overcome this limitation, we introduce Mask2IV, a novel framework specifically designed for interaction-centric video generation. It adopts a decoupled two-stage pipeline that first predicts plausible motion trajectories for both actor and object, then generates a video conditioned on these trajectories. This design eliminates the need for dense mask inputs from users while preserving the flexibility to manipulate the interaction process. Furthermore, Mask2IV supports versatile and intuitive control, allowing users to specify the target object of interaction and guide the motion trajectory through action descriptions or spatial position cues. To support systematic training and evaluation, we curate two benchmarks covering diverse action and object categories across both human-object interaction and robotic manipulation scenarios. Extensive experiments demonstrate that our method achieves superior visual realism and controllability compared to existing baselines.

[42] ReeMark: Reeb Graphs for Simulating Patterns of Life in Spatiotemporal Trajectories cs.CV | cs.CE | cs.LG | cs.SIPDF

Anantajit Subrahmanya, Chandrakanth Gudavalli, Connor Levenson, Umang Garg, B. S. Manjunath

TL;DR: 该论文提出了一种名为Markovian Reeb Graphs的新框架，用于模拟保留基础数据中生活模式（PoLs）的时空轨迹，结合个体和群体层面的移动性结构，生成既一致又多样化的未来轨迹。

Details

Motivation: 准确建模人类移动性对城市规划、流行病学和交通管理至关重要。现有方法在保持生活模式和计算效率方面存在不足，因此需要一种新的框架来解决这些问题。

Result: 在群体和个体层面的指标上展示了高度的保真度，同时保持了数据和计算效率。

Insight: Markovian Reeb Graphs作为一种可扩展的框架，能够广泛应用于多样化的城市环境中，为轨迹模拟提供了新的解决方案。

Abstract: Accurately modeling human mobility is critical for urban planning, epidemiology, and traffic management. In this work, we introduce Markovian Reeb Graphs, a novel framework for simulating spatiotemporal trajectories that preserve Patterns of Life (PoLs) learned from baseline data. By combining individual- and population-level mobility structures within a probabilistic topological model, our approach generates realistic future trajectories that capture both consistency and variability in daily life. Evaluations on the Urban Anomalies dataset (Atlanta and Berlin subsets) using the Jensen-Shannon Divergence (JSD) across population- and agent-level metrics demonstrate that the proposed method achieves strong fidelity while remaining data- and compute-efficient. These results position Markovian Reeb Graphs as a scalable framework for trajectory simulation with broad applicability across diverse urban environments.

[43] SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus cs.CV | cs.AIPDF

Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang

TL;DR: 论文介绍了SpineMed生态系统，包括首个针对脊椎级别推理的大规模数据集SpineMed-450k和评估框架SpineBench，通过临床医师参与的两阶段LLM生成方法确保数据质量，展示了基于该数据集微调的模型在临床任务中的显著优势。

Details

Motivation: 脊椎疾病是全球性健康问题，但AI辅助诊断因缺乏针对脊椎级别的多模态数据集和标准化评估工具而受限。

Result: 基于SpineMed-450k微调的模型在SpineBench评估中表现优越，尤其在脊椎级别推理和病理评估任务上显著优于其他LVLM模型。

Insight: 1. 临床医师参与的数据生成对AI模型在医疗任务中的实用性至关重要；2. 精细化的推理能力是当前LVLM模型的短板。

Abstract: Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model’s outputs.

[44] Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft cs.CVPDF

Junchao Huang, Xinting Hu, Boyao Han, Shaoshuai Shi, Zhuotao Tian

TL;DR: Memory Forcing是一种结合时空记忆的学习框架，用于在Minecraft游戏中生成一致的场景。通过混合训练和链式前向训练，它动态平衡了时空记忆的使用，同时利用几何索引的空间记忆和高效检索方法优化性能，实现了长期空间一致性和生成质量。

Details

Motivation: 在有限的计算预算下，现有模型难以同时保证新场景生成的质量和探索区域的长期空间一致性。需要一种方法平衡时空记忆的使用，以在Minecraft等游戏中实现自然的交互式场景生成。

Result: 在多样环境下，Memory Forcing在长期空间一致性和生成质量上表现优异，同时保持计算效率。

Insight: 时空记忆的动态平衡是关键，混合训练和链式前向训练的结合能有效提升模型的适应性和一致性。

Abstract: Autoregressive video diffusion models have proved effective for world modeling and interactive scene generation, with Minecraft gameplay as a representative application. To faithfully simulate play, a model must generate natural content while exploring new scenes and preserve spatial consistency when revisiting explored areas. Under limited computation budgets, it must compress and exploit historical cues within a finite context window, which exposes a trade-off: Temporal-only memory lacks long-term spatial consistency, whereas adding spatial memory strengthens consistency but may degrade new scene generation quality when the model over-relies on insufficient spatial context. We present Memory Forcing, a learning framework that pairs training protocols with a geometry-indexed spatial memory. Hybrid Training exposes distinct gameplay regimes, guiding the model to rely on temporal memory during exploration and incorporate spatial memory for revisits. Chained Forward Training extends autoregressive training with model rollouts, where chained predictions create larger pose variations and encourage reliance on spatial memory for maintaining consistency. Point-to-Frame Retrieval efficiently retrieves history by mapping currently visible points to their source frames, while Incremental 3D Reconstruction maintains and updates an explicit 3D cache. Extensive experiments demonstrate that Memory Forcing achieves superior long-term spatial consistency and generative quality across diverse environments, while maintaining computational efficiency for extended sequences.

[45] MonSTeR: a Unified Model for Motion, Scene, Text Retrieval cs.CVPDF

Luca Collorone, Matteo Gioia, Massimiliano Pappa, Paolo Leoni, Giovanni Ficarra

TL;DR: MonSTeR是一个统一的运动-场景-文本检索模型，首次实现了多模态（运动、文本、场景）对齐的评估，通过构建统一潜在空间实现灵活检索。

Details

Motivation: 人类运动受意图驱动，但运动是否可行取决于周围环境是否支持。现有研究缺乏评估运动、意图和场景对齐的工具。

Result: MonSTeR优于仅依赖单模态表示的模型，检索分数与人类偏好一致，并在零样本场景对象放置和运动描述上展示多功能性。

Insight: 多模态统一潜在空间能有效捕捉复杂依赖关系，为后续研究提供工具支持。

Abstract: Intention drives human movement in complex environments, but such movement can only happen if the surrounding context supports it. Despite the intuitive nature of this mechanism, existing research has not yet provided tools to evaluate the alignment between skeletal movement (motion), intention (text), and the surrounding context (scene). In this work, we introduce MonSTeR, the first MOtioN-Scene-TExt Retrieval model. Inspired by the modeling of higher-order relations, MonSTeR constructs a unified latent space by leveraging unimodal and cross-modal representations. This allows MonSTeR to capture the intricate dependencies between modalities, enabling flexible but robust retrieval across various tasks. Our results show that MonSTeR outperforms trimodal models that rely solely on unimodal representations. Furthermore, we validate the alignment of our retrieval scores with human preferences through a dedicated user study. We demonstrate the versatility of MonSTeR’s latent space on zero-shot in-Scene Object Placement and Motion Captioning. Code and pre-trained models are available at github.com/colloroneluca/MonSTeR.

[46] Improving GUI Grounding with Explicit Position-to-Coordinate Mapping cs.CV | cs.AIPDF

Suyuchen Wang, Tianyu Zhang, Ahmed Masry, Christopher Pal, Spandana Gella

TL;DR: 论文提出通过显式的位置到坐标映射改进GUI grounding任务，避免了现有VLMs在未见分辨率下性能下降的问题，通过RULER tokens和I-MRoPE提升了准确性和鲁棒性。

Details

Motivation: GUI grounding任务需要将自然语言指令映射到像素坐标，但现有方法在高分辨率未见训练数据时性能下降严重，主要由于隐式的坐标映射不够可靠。

Result: 在ScreenSpot系列数据集上，新方法显著提升了高分辨率GUI的grounding准确性。

Insight: 显式位置编码和对称的空间表示对提升GUI grounding任务在高分辨率下的性能至关重要。

Abstract: GUI grounding, the task of mapping natural-language instructions to pixel coordinates, is crucial for autonomous agents, yet remains difficult for current VLMs. The core bottleneck is reliable patch-to-pixel mapping, which breaks when extrapolating to high-resolution displays unseen during training. Current approaches generate coordinates as text tokens directly from visual features, forcing the model to infer complex position-to-pixel mappings implicitly; as a result, accuracy degrades and failures proliferate on new resolutions. We address this with two complementary innovations. First, RULER tokens serve as explicit coordinate markers, letting the model reference positions similar to gridlines on a map and adjust rather than generate coordinates from scratch. Second, Interleaved MRoPE (I-MRoPE) improves spatial encoding by ensuring that width and height dimensions are represented equally, addressing the asymmetry of standard positional schemes. Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with the largest improvements on high-resolution interfaces. By providing explicit spatial guidance rather than relying on implicit learning, our approach enables more reliable GUI automation across diverse resolutions and platforms.

[47] LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models cs.CVPDF

Ci-Siang Lin, Min-Hung Chen, Yu-Yang Sheng, Yu-Chiang Frank Wang

TL;DR: LEAML是一个标签高效的适应框架，专为多模态大语言模型（MLLMs）设计，用于解决特定领域（如医学影像）中分布外（OOD）任务的标签稀缺问题。

Details

Motivation: 多模态大语言模型在通用视觉任务上表现优秀，但在特定领域的分布外任务（如医学影像）中表现不佳，主要原因在于标签数据稀缺且昂贵。

Result: 在胃肠镜检查和体育领域的视觉问答任务中，LEAML在极少监督下表现优于标准微调方法。

Insight: LEAML展示了通过伪标签生成和选择性神经元更新的方式可以有效解决特定领域中的标签稀缺问题。

Abstract: Multimodal Large Language Models (MLLMs) have achieved strong performance on general visual benchmarks but struggle with out-of-distribution (OOD) tasks in specialized domains such as medical imaging, where labeled data is limited and expensive. We introduce LEAML, a label-efficient adaptation framework that leverages both scarce labeled VQA samples and abundant unlabeled images. Our approach generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation. Importantly, we selectively update only those neurons most relevant to question-answering, enabling the QA Generator to efficiently acquire domain-specific knowledge during distillation. Experiments on gastrointestinal endoscopy and sports VQA demonstrate that LEAML consistently outperforms standard fine-tuning under minimal supervision, highlighting the effectiveness of our proposed LEAML framework.

cs.CL [Back]

[48] Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning cs.CL | cs.AIPDF

Wannan Yang, Xinchi Qiu, Lei Yu, Yuchen Zhang, Oliver Aobo Yang

TL;DR: CASAL是一种高效的算法，通过对比激活引导和摊销优化，将激活引导的优势直接嵌入模型权重中，显著减少LLM的幻觉问题。

Details

Motivation: 大型语言模型（LLM）在生成答案时经常出现幻觉（自信地提供错误答案），现有方法需实时干预，不够高效。CASAL旨在设计一种高效、数据需求低的解决方案。

Result: CASAL在多个短问答基准上将幻觉减少30%-40%，在计算和数据效率上显著优于SFT和DPO基线方法，且能有效推广到OOD领域。

Insight: CASAL展示了可解释性方法在实践中的潜力，为生产系统中的部署提供了高效解决方案。

Abstract: Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model’s weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL’s light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by 30%-40% across multiple short-form QA benchmarks. CASAL is 30x more compute-efficient and 20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL’s flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.

[49] Hallucination-Resistant, Domain-Specific Research Assistant with Self-Evaluation and Vector-Grounded Retrieval cs.CL | cs.AIPDF

Vivek Bhavsar, Joseph Ereifej, Aravanan Gurusami

TL;DR: RA-FSM是一个基于GPT的模块化研究助手，通过有限状态机控制流程（相关性->置信度->知识）减少幻觉和错误引用，提升专家工作流的实用性。

Details

Motivation: 大语言模型在文献综述中存在幻觉和错误引用问题，限制了其在专家工作流中的应用。

Result: 在光电领域的六类任务评估中，专家更偏好RA-FSM，认为其能更好地处理边界条件和证据支持。

Insight: 通过模块化和确定性流程设计，可以显著提升语言模型在专业领域的可靠性和实用性。

Abstract: Large language models accelerate literature synthesis but can hallucinate and mis-cite, limiting their usefulness in expert workflows. We present RA-FSM (Research Assistant - Finite State Machine), a modular GPT-based research assistant that wraps generation in a finite-state control loop: Relevance -> Confidence -> Knowledge. The system is grounded in vector retrieval and a deterministic citation pipeline. The controller filters out-of-scope queries, scores answerability, decomposes questions, and triggers retrieval only when needed, and emits answers with confidence labels and in-corpus, de-duplicated references. A ranked-tier ingestion workflow constructs a domain knowledge base from journals, conferences, indices, preprints, and patents, writing both to a dense vector index and to a relational store of normalized metrics. We implement the system for photonics and evaluate it on six task categories: analytical reasoning, numerical analysis, methodological critique, comparative synthesis, factual extraction, and application design. In blinded A/B reviews, domain experts prefer RA-FSM to both a strong Notebook LM (NLM) and a vanilla Default GPT API call single-pass baseline, citing stronger boundary-condition handling and more defensible evidence use. Coverage and novelty analyses indicate that RA-FSM explores beyond the NLM while incurring tunable latency and cost overheads. The design emphasizes transparent, well-cited answers for high-stakes technical work and is generalizable to other scientific domains.

[50] AMANDA: Agentic Medical Knowledge Augmentation for Data-Efficient Medical Visual Question Answering cs.CL | cs.AI | cs.MAPDF

Ziqing Wang, Chengsheng Mao, Xiaole Wen, Yuan Luo, Kaize Ding

TL;DR: 论文提出AMANDA框架，通过LLM智能体增强医学知识，解决医学多模态大语言模型在低资源环境下的诊断瓶颈问题。

Details

Motivation: 现有Med-MLLMs在低资源环境下表现不佳，主要因为内在推理忽略医学图像细节，外在推理缺乏专业医学知识。

Result: 在8个Med-VQA基准测试中，零样本和小样本设置下均显著提升。

Insight: LLM智能体和知识图谱的结合能有效弥补医学推理的不足，尤其在低资源场景下。

Abstract: Medical Multimodal Large Language Models (Med-MLLMs) have shown great promise in medical visual question answering (Med-VQA). However, when deployed in low-resource settings where abundant labeled data are unavailable, existing Med-MLLMs commonly fail due to their medical reasoning capability bottlenecks: (i) the intrinsic reasoning bottleneck that ignores the details from the medical image; (ii) the extrinsic reasoning bottleneck that fails to incorporate specialized medical knowledge. To address those limitations, we propose AMANDA, a training-free agentic framework that performs medical knowledge augmentation via LLM agents. Specifically, our intrinsic medical knowledge augmentation focuses on coarse-to-fine question decomposition for comprehensive diagnosis, while extrinsic medical knowledge augmentation grounds the reasoning process via biomedical knowledge graph retrieval. Extensive experiments across eight Med-VQA benchmarks demonstrate substantial improvements in both zero-shot and few-shot Med-VQA settings. The code is available at https://github.com/REAL-Lab-NU/AMANDA.

[51] SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification cs.CL | cs.AIPDF

Kanghoon Yoon, Minsub Kim, Sungjae Lee, Joonhyung Lee, Sunghyeon Woo

TL;DR: 本文提出了一种名为SelfJudge的方法，通过目标模型的自我监督训练法官验证器，以加速大型语言模型（LLM）的推理过程。该方法无需依赖人工标注或可验证的基准数据，提高了在多样化NLP任务中的泛化能力。

Details

Motivation: 现有的法官解码方法依赖于人工标注或特定任务的基准数据，限制了其在多样化NLP任务中的通用性。SelfJudge的目标是通过自我监督的方式训练验证器，提供一个更通用的解决方案。

Result: 实验结果表明，SelfJudge在推理速度和准确性之间的权衡上优于基线法官解码方法。

Insight: SelfJudge的核心洞察是通过目标模型的自我监督数据生成验证标准，减少对外部标注数据的依赖，增强了方法的通用性和灵活性。

Abstract: Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft tokens that may exhibit minor discrepancies from target model output, but existing methods are restricted by their reliance on human annotations or tasks with verifiable ground truths, limiting generalizability across diverse NLP tasks. We propose SelfJudge, which trains judge verifiers via self-supervision of the target model. Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses, enabling automatic verifier training across diverse NLP tasks. Our experiments show SelfJudge achieves superior inference-accuracy trade-offs than judge decoding baselines, offering a broadly applicable solution for faster LLM inference.

Chiara Pugliese, Francesco Lettich, Guido Rocchietti, Chiara Renso, Fabio Pinelli

TL;DR: 这篇资源论文介绍了两个语义丰富的人类轨迹数据集及其构建流程，结合了真实GPS轨迹、情境数据（如停留点、POI、交通模式、天气）和LLM生成的社交媒体内容，支持多模态和语义分析。

Details

Motivation: 现有的人类移动数据集通常缺乏语义丰富性和多模态支持，难以满足复杂分析需求。本文旨在填补这一空白。

Result: 数据集以表格和RDF格式发布，支持行为建模、移动预测、知识图谱构建等研究任务。

Insight: 通过LLM生成的社交媒体内容为轨迹数据增添了语义维度，展示了多模态数据融合在移动分析中的潜力。

Abstract: In this resource paper, we present two publicly available datasets of semantically enriched human trajectories, together with the pipeline to build them. The trajectories are publicly available GPS traces retrieved from OpenStreetMap. Each dataset includes contextual layers such as stops, moves, points of interest (POIs), inferred transportation modes, and weather data. A novel semantic feature is the inclusion of synthetic, realistic social media posts generated by Large Language Models (LLMs), enabling multimodal and semantic mobility analysis. The datasets are available in both tabular and Resource Description Framework (RDF) formats, supporting semantic reasoning and FAIR data practices. They cover two structurally distinct, large cities: Paris and New York. Our open source reproducible pipeline allows for dataset customization, while the datasets support research tasks such as behavior modeling, mobility prediction, knowledge graph construction, and LLM-based applications. To our knowledge, our resource is the first to combine real-world movement, structured semantic enrichment, LLM-generated text, and semantic web compatibility in a reusable framework.

[53] Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing cs.CL | cs.AI | cs.LGPDF

Zhe Li, Wei Zhao, Yige Li, Jun Sun

TL;DR: 该论文提出了一种高效框架，通过分析表示及其梯度，直接在模型激活空间中诊断大型语言模型（LLM）的不良行为，如有害内容生成和偏见输出。

Details

Motivation: 现有基于参数梯度的归因方法存在噪声信号高和计算复杂的问题，难以有效诊断LLM的不良行为。

Result: 该方法在跟踪有害内容、检测后门毒化和识别知识污染等任务中表现优异，支持细粒度分析。

Insight: 该框架为理解和减轻LLM风险提供了强大的诊断工具，尤其适用于需要精确归因的场景。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their deployment is frequently undermined by undesirable behaviors such as generating harmful content, factual inaccuracies, and societal biases. Diagnosing the root causes of these failures poses a critical challenge for AI safety. Existing attribution methods, particularly those based on parameter gradients, often fall short due to prohibitive noisy signals and computational complexity. In this work, we introduce a novel and efficient framework that diagnoses a range of undesirable LLM behaviors by analyzing representation and its gradients, which operates directly in the model’s activation space to provide a semantically meaningful signal linking outputs to their training data. We systematically evaluate our method for tasks that include tracking harmful content, detecting backdoor poisoning, and identifying knowledge contamination. The results demonstrate that our approach not only excels at sample-level attribution but also enables fine-grained token-level analysis, precisely identifying the specific samples and phrases that causally influence model behavior. This work provides a powerful diagnostic tool to understand, audit, and ultimately mitigate the risks associated with LLMs. The code is available at https://github.com/plumprc/RepT.

[54] Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards cs.CL | cs.AIPDF

Samyak Jhaveri, Praphul Singh, Jangwon Kim, Tara Taghavi, Krishnaram Kenthapadi

TL;DR: 本文提出了一种用于长文本临床文本生成的强化学习框架，结合了Group Relative Policy Optimization（GRPO）和DocLens评估器，直接优化事实基础和完整性，无需训练额外的奖励模型或依赖人工参考。

Details

Motivation: 自动化临床文档生成需要确保内容的完整性和事实基础，而现有方法可能依赖于人工参考或复杂的奖励模型，限制了效率和质量。

Result: 实验表明，该方法提高了临床笔记的质量，减少了遗漏和幻觉，并在GPT-5的定性评估中获得了更高的偏好。

Insight: 该方法展示了在不依赖额外奖励模型或人工参考的情况下，直接优化生成质量的潜力，适用于真实世界的临床文档生成场景。

Abstract: Automating clinical documentation with large language models requires precise alignment with priorities such as completeness and factual grounding. We present an evaluation-integrated reinforcement learning framework for long-form clinical text generation that couples Group Relative Policy Optimization (GRPO) with DocLens, a claim-level evaluator that provides deterministic, dialogue-grounded rewards. Our method directly optimizes factual grounding and completeness without training a separate reward model or relying on human-authored references. Empirically, the approach improves clinical note quality and reduces training cost via a simple reward-gating strategy. An independent GPT-5 qualitative evaluation further supports these gains, showing higher preference for GRPO outputs in factuality, completeness, and brevity, with fewer omissions and hallucinations. Because the benchmarks are relatively clean and the base model already well aligned, these improvements likely represent a conservative lower bound. The framework is scalable to real-world settings and can incorporate custom objectives such as guideline adherence or billing preferences.

[55] Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs cs.CL | cs.LGPDF

Xin Gao, Ruiyi Zhang, Daniel Du, Saurabh Mahindre, Sai Ashish Somayajula

TL;DR: 这篇论文研究了是否可以通过提示让大型语言模型(LLMs)模拟较早的知识截止时间，并评估了其有效性。研究发现，提示方法在直接查询截止时间后的信息时有效，但对因果关联内容的遗忘效果不佳，强调了时间预测任务中更严格评估的必要性。

Details

Motivation: 大型语言模型(LLMs)在时间预测任务中依赖预训练数据可能导致记忆而非推理问题，从而高估其泛化能力。因此，作者探讨了通过提示模拟知识截止时间的有效性。

Result: 提示方法在直接查询后能够有效模拟知识遗忘，但对非直接查询的因果关联知识遗忘效果有限。

Insight: 研究表明，提示技术在模拟知识截止时间时存在局限性，特别是在因果推理任务中。这呼吁在时间预测任务中采用更严谨的评估方法。

Abstract: Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns, as accurate predictions on pre-cutoff test data may reflect memorization rather than reasoning, leading to an overestimation of their generalization capability. With the recent emergence of prompting-based unlearning techniques, a natural question arises: Can LLMs be prompted to simulate an earlier knowledge cutoff? In this work, we investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs. We construct three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query. These findings highlight the need for more rigorous evaluation settings when applying LLMs for temporal prediction tasks. The full dataset and evaluation code are available at https://github.com/gxx27/time_unlearn.

[56] LLMSQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL cs.CL | cs.AIPDF

Dzmitry Pihulski, Karol Charchut, Viktoria Novogrodskaia, Jan Kocoń

TL;DR: LLMSQL是对WikiSQL数据集的系统性修订和转换，旨在适应大语言模型（LLM）时代的需求，解决了原数据集的结构和标注问题，并提供了一个干净的、适合现代文本到SQL模型的基准。

Details

Motivation: WikiSQL数据集在早期NL2SQL研究中发挥了重要作用，但由于其结构和标注问题（如大小写敏感不一致、数据类型不匹配、语法错误等），其使用率下降。LLMSQL的目标是提供一个更适合LLM时代的数据集。

Result: 通过评估多个大型语言模型（如Gemma 3、LLaMA 3.2等），验证了LLMSQL作为现代文本到SQL任务的可靠基准的有效性。

Insight: LLMSQL不仅是一个数据集的更新，更是一个为LLM时代设计的全新基准，突出了直接生成和评估SQL查询的能力，而非传统指针网络模型的选择式生成。

Abstract: Converting natural language questions into SQL queries (Text-to-SQL) enables non-expert users to interact with relational databases and has long been a central task for natural language interfaces to data. While the WikiSQL dataset played a key role in early NL2SQL research, its usage has declined due to structural and annotation issues, including case sensitivity inconsistencies, data type mismatches, syntax errors, and unanswered questions. We present LLMSQL, a systematic revision and transformation of WikiSQL designed for the LLM era. We classify these errors and implement automated methods for cleaning and re-annotation. To assess the impact of these improvements, we evaluated multiple large language models (LLMs), including Gemma 3, LLaMA 3.2, Mistral 7B, gpt-oss 20B, Phi-3.5 Mini, Qwen 2.5, OpenAI o4-mini, DeepSeek R1 and others. Rather than serving as an update, LLMSQL is introduced as an LLM-ready benchmark: unlike the original WikiSQL, tailored for pointer-network models selecting tokens from input, LLMSQL provides clean natural language questions and full SQL queries as plain text, enabling straightforward generation and evaluation for modern natural language-to-SQL models.

[57] Language, Culture, and Ideology: Personalizing Offensiveness Detection in Political Tweets with Reasoning LLMs cs.CL | cs.AIPDF

Dzmitry Pihulski, Jan Kocoń

TL;DR: 本文研究了大型语言模型（LLMs）在特定政治和文化视角下如何评估政治推文的攻击性，发现具备明确推理能力的大型模型在意识形态和文化差异方面表现更优。

Details

Motivation: 探讨LLMs在多语言和政治多样化背景下如何个性化地评估攻击性，以提升模型在社会政治文本分类中的敏感性和适应性。

Result: 大型模型（如DeepSeek-R1）在攻击性检测中表现更一致且敏感，而小型模型难以捕捉细微差异。推理能力显著提升了个性化和可解释性。

Insight: 推理机制是LLMs在多语言和意识形态背景下适应社会政治文本分类的关键。

Abstract: We explore how large language models (LLMs) assess offensiveness in political discourse when prompted to adopt specific political and cultural perspectives. Using a multilingual subset of the MD-Agreement dataset centered on tweets from the 2020 US elections, we evaluate several recent LLMs - including DeepSeek-R1, o4-mini, GPT-4.1-mini, Qwen3, Gemma, and Mistral - tasked with judging tweets as offensive or non-offensive from the viewpoints of varied political personas (far-right, conservative, centrist, progressive) across English, Polish, and Russian contexts. Our results show that larger models with explicit reasoning abilities (e.g., DeepSeek-R1, o4-mini) are more consistent and sensitive to ideological and cultural variation, while smaller models often fail to capture subtle distinctions. We find that reasoning capabilities significantly improve both the personalization and interpretability of offensiveness judgments, suggesting that such mechanisms are key to adapting LLMs for nuanced sociopolitical text classification across languages and ideologies.

[58] Modeling the language cortex with form-independent and enriched representations of sentence meaning reveals remarkable semantic abstractness cs.CL | cs.LGPDF

Shreya Saha, Shurui Li, Greta Tuckute, Yuanning Li, Ru-Yuan Zhang

TL;DR: 该论文研究了人类语言皮层中意义的抽象表征，通过结合视觉和语言模型的嵌入，发现多图像或多释义的平均嵌入能更准确地预测语言皮层的响应，揭示了意义的高度抽象性。

Details

Motivation: 探讨人类语言系统中意义的抽象性，理解语言皮层如何表示和处理超越具体形式的语义信息。

Result: 多图像或多释义的平均嵌入提高了预测准确性，甚至超过原始句子的嵌入，表明语言系统具有更丰富的语义表征。

Insight: 语言皮层的语义表征超越了语言模型的局限性，具有更高的抽象性和丰富性。

Abstract: The human language system represents both linguistic forms and meanings, but the abstractness of the meaning representations remains debated. Here, we searched for abstract representations of meaning in the language cortex by modeling neural responses to sentences using representations from vision and language models. When we generate images corresponding to sentences and extract vision model embeddings, we find that aggregating across multiple generated images yields increasingly accurate predictions of language cortex responses, sometimes rivaling large language models. Similarly, averaging embeddings across multiple paraphrases of a sentence improves prediction accuracy compared to any single paraphrase. Enriching paraphrases with contextual details that may be implicit (e.g., augmenting “I had a pancake” to include details like “maple syrup”) further increases prediction accuracy, even surpassing predictions based on the embedding of the original sentence, suggesting that the language system maintains richer and broader semantic representations than language models. Together, these results demonstrate the existence of highly abstract, form-independent meaning representations within the language cortex.

[59] ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference cs.CL | cs.AIPDF

Haojie Ouyang, Jianwei Lv, Lei Ren, Chen Wei, Xiaojie Wang

TL;DR: ChunkLLM 是一种轻量级可插拔框架，旨在通过优化自注意力机制的计算效率，加速大型语言模型的推理。

Details

Motivation: 为了解决 Transformer 大模型因自注意力机制的平方复杂度导致的计算效率低下问题，现有方法在语义完整性或训练-推理效率上存在不足，因此需要一个更全面的解决方案。

Result: 在长短文本基准测试中，ChunkLLM 性能接近短文本基准，长上下文基准保持 98.64% 性能，键值缓存保留率达 48.58%，最大加速比为 4.48 倍。

Insight: 通过适配器模块优化自注意力计算效率，解决了语义完整性和训练-推理效率的平衡问题，显著提升了长文本处理效率。

Abstract: Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention’s quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.

[60] Beyond Manuals and Tasks: Instance-Level Context Learning for LLM Agents cs.CL | cs.AIPDF

Kuntai Cai, Juncheng Liu, Xianglin Yang, Zhaojie Niu, Xiaokui Xiao

TL;DR: 该论文提出了一种新型上下文学习方法——实例级上下文学习（ILCL），旨在解决大语言模型（LLM）代理在复杂任务中忽视特定环境实例的可验证和可重用事实的问题。通过引导探索和轻量级计划-行动-提取循环，该方法显著提升了任务的成功率和效率。

Details

Motivation: LLM代理通常依赖环境级手册和任务级指导，但忽略了实例级上下文（如对象位置、制作配方等），这导致复杂任务中的常见失败。作者认为高效探索和利用此类上下文是提升代理性能的关键。

Result: 在TextWorld、ALFWorld和Crafter上的实验显示，ReAct的成功率从37%提升至95%，IGE从81%提升至95%。方法显著提高了任务的效率和可靠性。

Insight: 实例级上下文是LLM代理在复杂任务中成功的关键。通过将一次性探索转化为持久知识，该方法为代理设计提供了新思路，强调了上下文复用和高效探索的重要性。

Abstract: Large language model (LLM) agents typically receive two kinds of context: (i) environment-level manuals that define interaction interfaces and global rules, and (ii) task-level guidance or demonstrations tied to specific goals. In this work, we identify a crucial but overlooked third type of context, instance-level context, which consists of verifiable and reusable facts tied to a specific environment instance, such as object locations, crafting recipes, and local rules. We argue that the absence of instance-level context is a common source of failure for LLM agents in complex tasks, as success often depends not only on reasoning over global rules or task prompts but also on making decisions based on precise and persistent facts. Acquiring such context requires more than memorization: the challenge lies in efficiently exploring, validating, and formatting these facts under tight interaction budgets. We formalize this problem as Instance-Level Context Learning (ILCL) and introduce our task-agnostic method to solve it. Our method performs a guided exploration, using a compact TODO forest to intelligently prioritize its next actions and a lightweight plan-act-extract loop to execute them. This process automatically produces a high-precision context document that is reusable across many downstream tasks and agents, thereby amortizing the initial exploration cost. Experiments across TextWorld, ALFWorld, and Crafter demonstrate consistent gains in both success and efficiency: for instance, ReAct’s mean success rate in TextWorld rises from 37% to 95%, while IGE improves from 81% to 95%. By transforming one-off exploration into persistent, reusable knowledge, our method complements existing contexts to enable more reliable and efficient LLM agents.

[61] Pretraining with hierarchical memories: separating long-tail and common knowledge cs.CL | cs.AI | cs.LGPDF

Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel

TL;DR: 该论文提出了一种结合层次化记忆库的小型语言模型预训练方法，通过动态加载上下文相关的小规模记忆块，显著提升了模型性能，同时减少了参数量。

Details

Motivation: 现代语言模型通过增加参数量来提升性能，但将所有世界知识压缩到模型参数中既不必要也不适合边缘设备。因此，作者提出了一种结合记忆库的方法，将长尾知识存储在记忆参数中，而小型语言模型专注通用推理能力。

Result: 实验表明，160M参数的模型结合18M记忆块（来自4.6B记忆库）性能与2倍参数的常规模型相当。记忆库规模扩展至21B参数仍能稳定工作。

Insight: 层次化记忆设计可以有效分离长尾知识与通用知识，为小型语言模型的高效部署提供了新思路。

Abstract: The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pretraining strategy aligned with existing hardware paradigms. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. During pretraining and inference, we fetch a small, context-dependent memory block and add it to the model. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge and general reasoning abilities. Through trillion-token-scale experiments, we show significant gains: a 160M-parameters model augmented with an 18M-parameters memory fetched from a 4.6B memory bank obtains comparable performance to a regular model with more than 2x the parameters. Through extensive experiments, we study the optimal type and size of parametric memories in transformers, scaling them to over 21B parameters. We find that our proposed hierarchical feed-forward memories work robustly across transformer architectures, whether added during pretraining or post-hoc.

[62] Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems cs.CL | cs.LGPDF

Aakriti Agrawal, Rohith Aralikatti, Anirudh Satheesh, Souradip Chakraborty, Amrit Singh Bedi

TL;DR: 论文提出了一种基于校准对数似然得分的、高效的多LLM系统回答选择方法，旨在提高多LLM系统的推理性能，无需依赖昂贵的外部验证或人类评估。

Details

Motivation: 从多个LLM中选择最可靠回答是一个挑战，现有方法依赖高成本的外部验证或多轮采样，限制了多LLM系统的潜力。

Result: 在GSM8K、MMLU（6个子集）和ARC数据集上分别实现了约4%、3%和5%的性能提升。

Insight: 该方法隐式利用了LLM的内在知识和置信度，为多LLM系统的回答选择提供了高效且可靠的解决方案。

Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model. While multi-LLM systems produce more diverse responses than single models and thus have greater potential, they often underperform compared to single LLM self-consistency. We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score, implicitly leveraging the inherent knowledge and confidence of these models. Our method demonstrates improvements of approx. 4%, 3%, and 5% across both debate (multi-round LLM discussions) and non-debate (Best-of-N with multiple LLMs) settings on GSM8K, MMLU (6 subsets), and ARC datasets respectively.

[63] Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation cs.CLPDF

Haoyue Bai, Haoyu Wang, Shengyu Chen, Zhengzhang Chen, Lu-An Tang

TL;DR: 论文提出了一个基于规则驱动的路由框架，用于混合源的检索增强生成（RAG），通过动态选择最适合的检索路径（数据库或文档）以提高准确性和效率。

Details

Motivation: 现有RAG系统主要依赖非结构化文档，忽略了关系数据库的优势。本文旨在结合两者的互补性，并通过规则驱动的路由优化查询性能。

Result: 在三个QA基准测试中，该框架优于静态策略和学习型基线，实现了更高的准确性，同时保持较低的计算成本。

Insight: 查询类型与检索路径之间存在规律性，规则驱动的路由可以显著提升混合源RAG的性能。

Abstract: Large Language Models (LLMs) have shown remarkable performance on general Question Answering (QA), yet they often struggle in domain-specific scenarios where accurate and up-to-date information is required. Retrieval-Augmented Generation (RAG) addresses this limitation by enriching LLMs with external knowledge, but existing systems primarily rely on unstructured documents, while largely overlooking relational databases, which provide precise, timely, and efficiently queryable factual information, serving as indispensable infrastructure in domains such as finance, healthcare, and scientific research. Motivated by this gap, we conduct a systematic analysis that reveals three central observations: (i) databases and documents offer complementary strengths across queries, (ii) naively combining both sources introduces noise and cost without consistent accuracy gains, and (iii) selecting the most suitable source for each query is crucial to balance effectiveness and efficiency. We further observe that query types show consistent regularities in their alignment with retrieval paths, suggesting that routing decisions can be effectively guided by systematic rules that capture these patterns. Building on these insights, we propose a rule-driven routing framework. A routing agent scores candidate augmentation paths based on explicit rules and selects the most suitable one; a rule-making expert agent refines the rules over time using QA feedback to maintain adaptability; and a path-level meta-cache reuses past routing decisions for semantically similar queries to reduce latency and cost. Experiments on three QA benchmarks demonstrate that our framework consistently outperforms static strategies and learned routing baselines, achieving higher accuracy while maintaining moderate computational cost.

[64] Words That Make Language Models Perceive cs.CL | cs.CV | cs.LGPDF

Sophie L. Wang, Phillip Isola, Brian Cheung

TL;DR: 该论文探讨了纯文本训练的大型语言模型（LLMs）是否可以通过感官提示（如’看’或’听’）激活多模态的表征能力，研究表明简单的提示工程可以有效地引导模型生成更接近视觉或听觉编码器的表征。

Details

Motivation: 尽管LLMs仅通过文本训练，但它们可能隐含地学习了多模态规律。论文试图验证是否可以通过显式的感官提示激活这种潜在的多模态表征能力。

Result: 实验表明，纯文本LLMs在感官提示下生成的表征与多模态编码器的表征更接近，验证了感官提示的有效性。

Insight: 纯文本训练的语言模型可能隐含地学习了多模态知识，感官提示可以作为一种轻量级方法激活这种能力，为LLMs的多模态应用提供了新思路。

Abstract: Large language models (LLMs) trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text-only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to ‘see’ or ‘hear’, it cues the model to resolve its next-token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.

[65] CLARITY: Clinical Assistant for Routing, Inference, and Triage cs.CL | cs.AI | cs.MAPDF

Vladimir Shaposhnikov, Aleksandr Nesterov, Ilia Kopanichuk, Ivan Bakulin, Egor Zhelvakov

TL;DR: CLARITY是一个AI驱动的临床辅助平台，结合有限状态机和大型语言模型，用于患者分诊、临床咨询和病情严重性评估。

Details

Motivation: 提升患者分诊效率和准确性，减少咨询时间，适应医疗IT系统的需求。

Result: 在5.5万次对话中验证，CLARITY的首诊路由精度超过人工，咨询时间缩短至三分之一。

Insight: 模块化设计和混合架构可提升AI系统的适应性和性能，适用于复杂的医疗场景。

Abstract: We present CLARITY (Clinical Assistant for Routing, Inference, and Triage), an AI-driven platform designed to facilitate patient-to-specialist routing, clinical consultations, and severity assessment of patients’ conditions. Its hybrid architecture combines a Finite State Machine (FSM) for structured dialogue flows with collaborative agents that employ Large Language Model (LLM) to analyze symptoms and prioritize referrals to appropriate specialists. Built on a modular microservices framework, CLARITY ensures safe, efficient, and robust performance, flexible and readily scalable to meet the demands of existing workflows and IT solutions in healthcare. We report integration of our clinical assistant into a large-scale nation-wide inter-hospital IT platform, with over 55,000 content-rich user dialogues completed within the two months of deployment, 2,500 of which were expert-annotated for a consequent validation. The validation results show that CLARITY surpasses human-level performance in terms of the first-attempt routing precision, naturally requiring up to 3 times shorter duration of the consultation than with a human.

[66] Knowledge-Graph Based RAG System Evaluation Framework cs.CL | cs.AIPDF

Sicheng Dong, Vahid Zolfaghari, Nenad Petrovic, Alois Knoll

TL;DR: 本文提出了一种基于知识图谱（KG）的RAG系统评估框架，扩展了RAGAS工具，通过多跳推理和语义社区聚类，提供了更全面的评分指标。

Details

Motivation: 现有评估指标难以捕捉现代LLM生成内容的高流畅性和自然性，传统方法不足以全面评估RAG系统的性能。

Result: 实验表明，KG评估方法比RAGAS更敏感于语义差异，并与人类判断相关性更高。

Insight: 未来研究需关注更敏感的语义评估指标，以及如何进一步优化多跳推理和聚类方法。

Abstract: Large language models (LLMs) has become a significant research focus and is utilized in various fields, such as text generation and dialog systems. One of the most essential applications of LLM is Retrieval Augmented Generation (RAG), which greatly enhances generated content’s reliability and relevance. However, evaluating RAG systems remains a challenging task. Traditional evaluation metrics struggle to effectively capture the key features of modern LLM-generated content that often exhibits high fluency and naturalness. Inspired by the RAGAS tool, a well-known RAG evaluation framework, we extended this framework into a KG-based evaluation paradigm, enabling multi-hop reasoning and semantic community clustering to derive more comprehensive scoring metrics. By incorporating these comprehensive evaluation criteria, we gain a deeper understanding of RAG systems and a more nuanced perspective on their performance. To validate the effectiveness of our approach, we compare its performance with RAGAS scores and construct a human-annotated subset to assess the correlation between human judgments and automated metrics. In addition, we conduct targeted experiments to demonstrate that our KG-based evaluation method is more sensitive to subtle semantic differences in generated outputs. Finally, we discuss the key challenges in evaluating RAG systems and highlight potential directions for future research.

[67] Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models cs.CLPDF

Tolúl\d{o}pé Ògúnrèmí, Christopher D. Manning, Dan Jurafsky, Karen Livescu

TL;DR: 论文研究了语音语言模型（SLMs）中模态适配器（MAs）如何将语音编码器的输出转换为解码器语言模型（LM）可理解的表示，发现两种策略：基于意义的英语中介语和基于语音的英语表达。

Details

Motivation: 理解SLMs中MAs如何转换语音表示对提升多模态模型的性能至关重要，但目前对其工作机制的研究较少。

Result: 使用Whisper编码器的模型倾向于将语音转换为基于意义的英语中介语；其他模型（如Phi-4）则更关注语音的英语表达。

Insight: MAs的表示策略与语音编码器的训练目标密切相关，这为设计更高效的多模态模型提供了重要指导。

Abstract: Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM. Yet we know very little about how these crucial MAs transform representations. Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct). By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations. For models using a Whisper encoder, MAs appear to represent the meaning of the input using an English-based interlingua, allowing them to handle languages unseen in instruction tuning. For models that don’t, like Phi-4-Multimodal-Instruct, MAs instead represent the phonetics of the input, but expressed with English words. We hypothesise that which arises depends on whether the speech encoder is trained only for speech recognition or also for translation.

[68] SoT: Structured-of-Thought Prompting Guides Multilingual Reasoning in Large Language Models cs.CLPDF

Rui Qi, Zhibo Man, Yufeng Chen, Fengran Mo, Jinan Xu

TL;DR: 论文提出了一种无需训练的方法SoT（Structured-of-Thought），通过多步转换（语言思维转换和结构化知识转换）提升大型语言模型在多语言推理任务中的表现。

Details

Motivation: 当前大型语言模型在高资源语言上的复杂推理能力较强，但在非高资源语言上表现不佳。资源限制和语言表达差异影响了多语言推理的效果。

Result: 实验表明，SoT在多个多语言推理基准测试中优于基线方法，并能与其他无需训练的策略结合进一步提升性能。

Insight: 结构化表示是解决多语言推理问题的有效途径，无需额外训练即可显著提升模型的跨语言推理能力。

Abstract: Recent developments have enabled Large Language Models (LLMs) to engage in complex reasoning tasks through deep thinking. However, the capacity of reasoning has not been successfully transferred to non-high-resource languages due to resource constraints, which struggles with multilingual reasoning tasks. To this end, we propose Structured-of-Thought (SoT), a training-free method that improves the performance on multilingual reasoning through a multi-step transformation: Language Thinking Transformation and Structured Knowledge Transformation. The SoT method converts language-specific semantic information into language-agnostic structured representations, enabling the models to understand the query in different languages more sophisticated. Besides, SoT effectively guides LLMs toward more concentrated reasoning to maintain consistent underlying reasoning pathways when handling cross-lingual variations in expression. Experimental results demonstrate that SoT outperforms several strong baselines on multiple multilingual reasoning benchmarks when adapting to various backbones of LLMs. It can also be integrated with other training-free strategies for further improvements. Our code is available at https://github.com/Cherry-qwq/SoT.

[69] Self-Improvement in Multimodal Large Language Models: A Survey cs.CLPDF

Shijian Deng, Kai Wang, Tianyu Yang, Harsh Singh, Yapeng Tian

TL;DR: 本文是第一篇关于多模态大语言模型（MLLMs）自我改进的综合综述，从数据收集、数据组织和模型优化三个角度讨论了现有方法，并总结了评测方法和下游应用，同时指出了未来研究方向。

Details

Motivation: 随着单模态大语言模型（LLMs）自我改进的成功，将这一能力扩展到多模态领域具有巨大潜力，但目前相关研究较少。本文旨在填补这一空白，推动MLLMs的进一步发展。

Result: 综述结果表明，MLLMs的自我改进能够显著提升模型性能，同时降低人工成本，但目前仍面临数据异构性和模态对齐等挑战。

Insight: 多模态领域的自我改进需要更关注跨模态一致性和动态适应性，未来的研究方向包括更高效的优化框架和多模态协同学习机制。

Abstract: Recent advancements in self-improvement for Large Language Models (LLMs) have efficiently enhanced model capabilities without significantly increasing costs, particularly in terms of human effort. While this area is still relatively young, its extension to the multimodal domain holds immense potential for leveraging diverse data sources and developing more general self-improving models. This survey is the first to provide a comprehensive overview of self-improvement in Multimodal LLMs (MLLMs). We provide a structured overview of the current literature and discuss methods from three perspectives: 1) data collection, 2) data organization, and 3) model optimization, to facilitate the further development of self-improvement in MLLMs. We also include commonly used evaluations and downstream applications. Finally, we conclude by outlining open challenges and future research directions.

[70] TravelBench : Exploring LLM Performance in Low-Resource Domains cs.CL | cs.AIPDF

Srinivas Billa, Xiaonan Jing

TL;DR: 论文提出了一个名为TravelBench的低资源领域测评集，专注于旅行领域，分析了LLM在这些任务中的表现，发现通用评测结果不足以反映低资源任务中的性能瓶颈。

Details

Motivation: 现有LLM评测集在低资源任务中提供的信息有限，难以有效评估模型在这些领域的表现，因此需要特定领域的评测集。

Result: 结果显示，通用评测结果无法准确反映低资源任务的性能瓶颈，即便训练FLOPs较高，预训练LLM在复杂领域任务中仍存在性能瓶颈；推理能力对较小LLM提升更显著。

Insight: 在低资源领域，特定领域的测评至关重要，且推理能力对小模型的性能提升尤为关键。

Abstract: Results on existing LLM benchmarks capture little information over the model capabilities in low-resource tasks, making it difficult to develop effective solutions in these domains. To address these challenges, we curated 14 travel-domain datasets spanning 7 common NLP tasks using anonymised data from real-world scenarios, and analysed the performance across LLMs. We report on the accuracy, scaling behaviour, and reasoning capabilities of LLMs in a variety of tasks. Our results confirm that general benchmarking results are insufficient for understanding model performance in low-resource tasks. Despite the amount of training FLOPs, out-of-the-box LLMs hit performance bottlenecks in complex, domain-specific scenarios. Furthermore, reasoning provides a more significant boost for smaller LLMs by making the model a better judge on certain tasks.

[71] PGMEL: Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking cs.CLPDF

KM Pooja, Cheng Long, Aixin Sun

TL;DR: PGMEL提出了一种基于策略梯度的生成对抗网络，用于解决多模态实体链接任务，通过生成高质量的负样本提升表示学习效果。

Details

Motivation: 现有的多模态实体链接技术未充分利用高质量负样本的选择潜力，影响了表示学习的效果。

Result: 在Wiki-MEL、Richpedia-MEL和WikiDiverse数据集上，PGMEL通过生成挑战性负样本取得了优于现有方法的表现。

Insight: 高质量负样本的生成对多模态实体链接任务的表示学习至关重要。

Abstract: The task of entity linking, which involves associating mentions with their respective entities in a knowledge graph, has received significant attention due to its numerous potential applications. Recently, various multimodal entity linking (MEL) techniques have been proposed, targeted to learn comprehensive embeddings by leveraging both text and vision modalities. The selection of high-quality negative samples can potentially play a crucial role in metric/representation learning. However, to the best of our knowledge, this possibility remains unexplored in existing literature within the framework of MEL. To fill this gap, we address the multimodal entity linking problem in a generative adversarial setting where the generator is responsible for generating high-quality negative samples, and the discriminator is assigned the responsibility for the metric learning tasks. Since the generator is involved in generating samples, which is a discrete process, we optimize it using policy gradient techniques and propose a policy gradient-based generative adversarial network for multimodal entity linking (PGMEL). Experimental results based on Wiki-MEL, Richpedia-MEL and WikiDiverse datasets demonstrate that PGMEL learns meaningful representation by selecting challenging negative samples and outperforms state-of-the-art methods.

[72] IndiCASA: A Dataset and Bias Evaluation Framework in LLMs Using Contrastive Embedding Similarity in the Indian Context cs.CLPDF

Santhosh G S, Akshay Govind S, Gokul S Krishnan, Balaraman Ravindran, Sriraam Natarajan

TL;DR: 论文提出了一个基于对比学习的编码器框架，用于评估LLMs在印度文化背景下的细粒度偏见，并引入了一个名为IndiCASA的新数据集，包含2575个人工验证的句子。研究发现所有模型均存在一定的偏见，尤其在残疾相关偏见上表现突出。

Details

Motivation: 由于LLMs在高风险应用中的广泛部署，尤其是在印度这样文化多样的背景下，现有偏见评估方法难以捕捉细微的刻板印象，因此需要开发更精准的评估框架。

Result: 研究发现所有LLMs均存在偏见，其中残疾相关偏见最为显著，宗教偏见相对较低，可能与全球去偏见努力有关。

Insight: 揭示了LLMs在印度文化背景下的偏见分布，强调了开发更公平模型的必要性。

Abstract: Large Language Models (LLMs) have gained significant traction across critical domains owing to their impressive contextual understanding and generative capabilities. However, their increasing deployment in high stakes applications necessitates rigorous evaluation of embedded biases, particularly in culturally diverse contexts like India where existing embedding-based bias assessment methods often fall short in capturing nuanced stereotypes. We propose an evaluation framework based on a encoder trained using contrastive learning that captures fine-grained bias through embedding similarity. We also introduce a novel dataset - IndiCASA (IndiBias-based Contextually Aligned Stereotypes and Anti-stereotypes) comprising 2,575 human-validated sentences spanning five demographic axes: caste, gender, religion, disability, and socioeconomic status. Our evaluation of multiple open-weight LLMs reveals that all models exhibit some degree of stereotypical bias, with disability related biases being notably persistent, and religion bias generally lower likely due to global debiasing efforts demonstrating the need for fairer model development.

[73] The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback cs.CLPDF

Hangfan Zhang, Siyuan Xu, Zhimeng Guo, Huaisheng Zhu, Shicheng Liu

TL;DR: 该论文提出了一种基于自我反馈的强化学习方法，通过让大语言模型（LLM）交替生成任务并解决任务，实现了数据高效的学习。通过自我认知机制（任务难度预测和能力边界突破），显著提升了模型性能，仅需少量额外数据。

Details

Motivation: 传统的强化学习方法在大语言模型训练中需要大量标注数据，成本高昂。本文旨在通过自我反馈机制，减少对外部数据的依赖，实现高效训练。

Result: 在九个基准测试上实现了53.8%的相对性能提升，仅需1.2%的额外数据。

Insight: 通过自我反馈和动态调整任务难度，可以显著提升模型的训练效率，同时减少外部数据依赖，为大语言模型的自我进化提供了新思路。

Abstract: Reinforcement learning (RL) has demonstrated potential in enhancing the reasoning capabilities of large language models (LLMs), but such training typically demands substantial efforts in creating and annotating data. In this work, we explore improving LLMs through RL with minimal data. Our approach alternates between the LLM proposing a task and then attempting to solve it. To minimize data dependency, we introduce two novel mechanisms grounded in self-awareness: (1) self-aware difficulty prediction, where the model learns to assess task difficulty relative to its own abilities and prioritize challenging yet solvable tasks, and (2) self-aware limit breaking, where the model recognizes when a task is beyond its capability boundary and proactively requests external data to break through that limit. Extensive experiments on nine benchmarks showing a 53.8% relative improvement with less than 1.2% extra data demonstrate the efficacy of self-aware RL and underscore the promise of self-evolving agent training.

[74] XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments cs.CLPDF

Tien Phat Nguyen, Vu Minh Ngo, Tung Nguyen, Linh Van Ngo, Duc Anh Nguyen

TL;DR: XTRA是一种新型跨语言主题建模框架，通过结合词汇袋模型和多语言嵌入，提出了表示对齐和主题对齐的双重机制，显著提升了主题一致性、多样性和跨语言对齐质量。

Details

Motivation: 现有的跨语言主题建模方法在主题一致性和跨语言对齐方面表现不佳，XTRA旨在解决这一问题。

Result: 在多语言语料库上的实验表明，XTRA显著优于现有基线方法。

Insight: XTRA的双重对齐机制能够同时保证主题的可解释性和跨语言一致性，为跨语言主题建模提供了新思路。

Abstract: Cross-lingual topic modeling aims to uncover shared semantic themes across languages. Several methods have been proposed to address this problem, leveraging both traditional and neural approaches. While previous methods have achieved some improvements in topic diversity, they often struggle to ensure high topic coherence and consistent alignment across languages. We propose XTRA (Cross-Lingual Topic Modeling with Topic and Representation Alignments), a novel framework that unifies Bag-of-Words modeling with multilingual embeddings. XTRA introduces two core components: (1) representation alignment, aligning document-topic distributions via contrastive learning in a shared semantic space; and (2) topic alignment, projecting topic-word distributions into the same space to enforce crosslingual consistency. This dual mechanism enables XTRA to learn topics that are interpretable (coherent and diverse) and well-aligned across languages. Experiments on multilingual corpora confirm that XTRA significantly outperforms strong baselines in topic coherence, diversity, and alignment quality. Code and reproducible scripts are available at https: //github.com/tienphat140205/XTRA.

[75] StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering cs.CL | cs.IRPDF

Tengjun Ni, Xin Yuan, Shenghong Li, Kai Wu, Ren Ping Liu

TL;DR: StepChain GraphRAG结合问题分解与广度优先搜索（BFS）推理流，提升了多跳问答（QA）的性能与可解释性，并在多个数据集上实现了最优效果。

Details

Motivation: 现有检索增强生成（RAG）方法在多跳QA中难以有效结合迭代推理步骤与外部知识检索，影响了准确性和可解释性。

Result: 在MuSiQue、2WikiMultiHopQA和HotpotQA上取得了最优成绩，EM和F1分别平均提升了2.57%和2.13%。

Insight: 研究强调了动态知识图构建和多跳推理的结合潜力，但也指出计算开销和大语言模型幻觉问题需进一步解决。

Abstract: Recent progress in retrieval-augmented generation (RAG) has led to more accurate and interpretable multi-hop question answering (QA). Yet, challenges persist in integrating iterative reasoning steps with external knowledge retrieval. To address this, we introduce StepChain GraphRAG, a framework that unites question decomposition with a Breadth-First Search (BFS) Reasoning Flow for enhanced multi-hop QA. Our approach first builds a global index over the corpus; at inference time, only retrieved passages are parsed on-the-fly into a knowledge graph, and the complex query is split into sub-questions. For each sub-question, a BFS-based traversal dynamically expands along relevant edges, assembling explicit evidence chains without overwhelming the language model with superfluous context. Experiments on MuSiQue, 2WikiMultiHopQA, and HotpotQA show that StepChain GraphRAG achieves state-of-the-art Exact Match and F1 scores. StepChain GraphRAG lifts average EM by 2.57% and F1 by 2.13% over the SOTA method, achieving the largest gain on HotpotQA (+4.70% EM, +3.44% F1). StepChain GraphRAG also fosters enhanced explainability by preserving the chain-of-thought across intermediate retrieval steps. We conclude by discussing how future work can mitigate the computational overhead and address potential hallucinations from large language models to refine efficiency and reliability in multi-hop QA.

[76] Evaluating Large Language Models for IUCN Red List Species Information cs.CL | cs.AI | I.2.7; I.2.6; J.3PDF

Shinya Uryu

TL;DR: 该研究评估了五种大型语言模型在IUCN红色名录物种信息中的表现，发现其在分类学任务中表现优异（94.9%），但在保护推理任务中表现较差（27.2%），揭示了知识-推理的鸿沟，并提出需结合人类专家的混合方法。

Details

Motivation: 为了应对生物多样性危机，大型语言模型在保护领域被广泛应用，但其在物种评估中的可靠性尚不明确。本研究旨在验证这些模型在IUCN红色名录核心评估组件中的表现。

Result: 模型在分类学任务中表现优异（94.9%），但在保护状态评估等推理任务中表现较差（27.2%）。此外，模型对魅力型脊椎动物存在系统性偏见。

Insight: 研究揭示了大型语言模型的知识-推理鸿沟，表明其适合信息检索任务，但在需要判断的任务中需结合人类专家。这一发现为负责任地部署模型提供了指导。

Abstract: Large Language Models (LLMs) are rapidly being adopted in conservation to address the biodiversity crisis, yet their reliability for species evaluation is uncertain. This study systematically validates five leading models on 21,955 species across four core IUCN Red List assessment components: taxonomy, conservation status, distribution, and threats. A critical paradox was revealed: models excelled at taxonomic classification (94.9%) but consistently failed at conservation reasoning (27.2% for status assessment). This knowledge-reasoning gap, evident across all models, suggests inherent architectural constraints, not just data limitations. Furthermore, models exhibited systematic biases favoring charismatic vertebrates, potentially amplifying existing conservation inequities. These findings delineate clear boundaries for responsible LLM deployment: they are powerful tools for information retrieval but require human oversight for judgment-based decisions. A hybrid approach is recommended, where LLMs augment expert capacity while human experts retain sole authority over risk assessment and policy.

[77] Constraint Satisfaction Approaches to Wordle: Novel Heuristics and Cross-Lexicon Validation cs.CL | cs.AI | 68T20, 90C27 | I.2.8; I.2.3; G.1.6PDF

Jahidul Arafat, Fariha Tasmin, Sanjaya Poudel, Kamrujjaman, Eftakhar Ahmed Arnob

TL;DR: 该论文首次提出了Wordle游戏的全面约束满足问题（CSP）表述，并引入了两种新策略：CSP感知熵和概率CSP框架，显著提升了解决性能和鲁棒性。

Details

Motivation: 现有Wordle求解器通常基于信息熵最大化或频率启发式方法，缺乏对约束的正式处理。该研究旨在通过CSP方法填补这一空白。

Result: CSP感知熵平均猜测次数3.54，成功率99.9%；概率CSP在所有噪声水平下均达到100%成功率；西班牙语验证成功率为88%。

Insight: 研究表明，基于CSP的正式方法在结构化谜题领域中优于传统信息论和学习方法，且核心CSP原则具有语言无关性。

Abstract: Wordle presents an algorithmically rich testbed for constraint satisfaction problem (CSP) solving. While existing solvers rely on information-theoretic entropy maximization or frequency-based heuristics without formal constraint treatment, we present the first comprehensive CSP formulation of Wordle with novel constraint-aware solving strategies. We introduce CSP-Aware Entropy, computing information gain after constraint propagation rather than on raw candidate sets, and a Probabilistic CSP framework integrating Bayesian word-frequency priors with logical constraints. Through evaluation on 2,315 English words, CSP-Aware Entropy achieves 3.54 average guesses with 99.9% success rate, a statistically significant 1.7% improvement over Forward Checking (t=-4.82, p<0.001, Cohen’s d=0.07) with 46% faster runtime (12.9ms versus 23.7ms per guess). Under 10% noise, CSP-aware approaches maintain 5.3 percentage point advantages (29.0% versus 23.7%, p=0.041), while Probabilistic CSP achieves 100% success across all noise levels (0-20%) through constraint recovery mechanisms. Cross-lexicon validation on 500 Spanish words demonstrates 88% success with zero language-specific tuning, validating that core CSP principles transfer across languages despite an 11.2 percentage point gap from linguistic differences (p<0.001, Fisher’s exact test). Our open-source implementation with 34 unit tests achieving 91% code coverage provides reproducible infrastructure for CSP research. The combination of formal CSP treatment, constraint-aware heuristics, probabilistic-logical integration, robustness analysis, and cross-lexicon validation establishes new performance benchmarks demonstrating that principled constraint satisfaction techniques outperform classical information-theoretic and learning-based approaches for structured puzzle-solving domains.

[78] Self-Reflective Generation at Test Time cs.CLPDF

Jian Mu, Qixin Zhang, Zhiyong Wang, Menglin Yang, Shuang Qiu

TL;DR: SRGen提出了一种轻量级的测试时自我反思框架，通过在生成不确定点时提前反思，动态调整token概率分布，显著提升语言模型的推理能力。

Details

Motivation: 大语言模型（LLMs）在复杂推理任务中容易出现早期错误传播的问题，现有自我反思方法要么需要完整草稿修订，要么通过昂贵训练学习自我修正，效率低下且被动。

Result: 在数学推理基准测试中，SRGen显著提升了模型性能，例如在AIME2024上Pass@1提高了12.0%，Cons@5提高了13.3%。

Insight: SRGen是一种即插即用方法，能够与RLHF和SLOT等其他技术兼容，为LLM推理任务提供了一种高效且可靠的解决方案。

Abstract: Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can consistently strengthen model reasoning: improvements in single-pass quality also translate into stronger self-consistency voting. Especially, on AIME2024 with DeepSeek-R1-Distill-Qwen-7B, SRGen yields absolute improvements of +12.0% on Pass@1 and +13.3% on Cons@5. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and broad composability with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.

[79] Semantic Differentiation in Speech Emotion Recognition: Insights from Descriptive and Expressive Speech Roles cs.CL | cs.AIPDF

Rongchen Guo, Vincent Francoeur, Isar Nejadgholi, Sylvain Gagnon, Miodrag Bolic

TL;DR: 论文研究了语音情感识别（SER）中描述性语义和表达性语义的区别，通过实验表明描述性语义与预期情感一致，而表达性语义与引发的情感相关。

Details

Motivation: 语音情感识别的准确性受限于语音中复杂的情感细微差别，研究旨在通过区分描述性语义和表达性语义提升识别效果。

Result: 描述性语义与预期情感一致，表达性语义与引发的情感相关，为SER应用提供了新的视角。

Insight: 区分描述性和表达性语义有助于提升人工智能系统的上下文感知能力。

Abstract: Speech Emotion Recognition (SER) is essential for improving human-computer interaction, yet its accuracy remains constrained by the complexity of emotional nuances in speech. In this study, we distinguish between descriptive semantics, which represents the contextual content of speech, and expressive semantics, which reflects the speaker’s emotional state. After watching emotionally charged movie segments, we recorded audio clips of participants describing their experiences, along with the intended emotion tags for each clip, participants’ self-rated emotional responses, and their valence/arousal scores. Through experiments, we show that descriptive semantics align with intended emotions, while expressive semantics correlate with evoked emotions. Our findings inform SER applications in human-AI interaction and pave the way for more context-aware AI systems.

[80] Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting? cs.CL | cs.SDPDF

Oriol Pareras, Gerard I. Gállego, Federico Costa, Cristina España-Bonet, Javier Hernando

TL;DR: 本文系统地比较了语音到文本翻译（S2TT）中Chain-of-Thought（CoT）提示和Direct提示的性能，发现随着数据量的增加，Direct提示表现更一致且优于CoT。

Details

Motivation: 研究动机在于探索在S2TT任务中，随着数据量的增加，CoT提示和Direct提示的性能差异，以确定哪种方法更适合未来大规模数据场景。

Result: 结果表明，随着数据量增加，Direct提示的提升更一致，优于CoT提示，表明在更大规模的S2TT资源中，Direct提示可能更有效。

Insight: 研究揭示了Direct提示在大规模数据场景下的潜力，为未来S2TT模型设计提供了方向。

Abstract: Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. To this end, we pseudo-label an ASR corpus by translating its transcriptions into six European languages, and train LLM-based S2TT systems with both prompting strategies at different data scales. Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.

[81] Semantic Similarity in Radiology Reports via LLMs and NER cs.CLPDF

Beth Pearson, Ahmed Adnan, Zahraa Abdallah

TL;DR: 论文探讨了在放射学报告中使用LLMs和NER进行语义相似性比较的方法，提出了Llama-EntScore方法，结合Llama 3.1和NER，取得了优于独立使用两者的效果。

Details

Motivation: 放射学报告的比较对医生培训和诊断准确性至关重要，但目前AI在该领域的应用面临挑战，尤其是LLMs需要领域专业知识。

Result: Llama-EntScore在67%的精确匹配和93%的近似匹配（±1分内）上优于独立使用的LLMs和NER。

Insight: 结合LLMs和传统NER方法能更有效地评估放射学报告的语义差异，并提供可解释的反馈。

Abstract: Radiology report evaluation is a crucial part of radiologists’ training and plays a key role in ensuring diagnostic accuracy. As part of the standard reporting workflow, a junior radiologist typically prepares a preliminary report, which is then reviewed and edited by a senior radiologist to produce the final report. Identifying semantic differences between preliminary and final reports is essential for junior doctors, both as a training tool and to help uncover gaps in clinical knowledge. While AI in radiology is a rapidly growing field, the application of large language models (LLMs) remains challenging due to the need for specialised domain knowledge. In this paper, we explore the ability of LLMs to provide explainable and accurate comparisons of reports in the radiology domain. We begin by comparing the performance of several LLMs in comparing radiology reports. We then assess a more traditional approach based on Named-Entity-Recognition (NER). However, both approaches exhibit limitations in delivering accurate feedback on semantic similarity. To address this, we propose Llama-EntScore, a semantic similarity scoring method using a combination of Llama 3.1 and NER with tunable weights to emphasise or de-emphasise specific types of differences. Our approach generates a quantitative similarity score for tracking progress and also gives an interpretation of the score that aims to offer valuable guidance in reviewing and refining their reporting. We find our method achieves 67% exact-match accuracy and 93% accuracy within +/- 1 when compared to radiologist-provided ground truth scores - outperforming both LLMs and NER used independently. Code is available at: \href{https://github.com/otmive/llama_reports}{github.com/otmive/llama\_reports}

[82] Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation cs.CL | cs.SDPDF

Jacobo Romero-Díaz, Gerard I. Gállego, Oriol Pareras, Federico Costa, Javier Hernando

TL;DR: 该论文研究了链式思维（CoT）提示在语音到文本翻译（S2TT）中的作用，发现其依赖转录而非语音信号，并提出简单训练干预方法以改进语音信息的利用。

Details

Motivation: 当前语音到文本翻译系统依赖于自动语音识别（ASR）和文本到文本翻译（T2TT）的级联，存在错误传播和无法利用声学线索的问题。研究旨在验证CoT是否能克服这些问题。

Result: CoT在S2TT中表现出级联行为，依赖转录而非语音；简单训练干预能显著提升语音信息的利用和系统鲁棒性。

Insight: 论文挑战了CoT的优势假设，强调需要设计明确整合声学信息的翻译架构。

Abstract: Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness, we find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution. These findings challenge the assumed advantages of CoT and highlight the need for architectures that explicitly integrate acoustic information into translation.

[83] SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys? cs.CLPDF

Zhaojun Sun, Xuzhou Zhu, Xuanhe Zhou, Xin Tong, Shuo Wang

TL;DR: SurveyBench是一个细粒度的、基于测验的评估框架，用于评估LLM（及其代理）自动生成学术综述的能力，揭示了现有方法与人类标准的差距。

Details

Motivation: 学术综述写作是一项繁重且高要求的任务，现有的自动化方法（如LLM4Survey）生成的综述质量不足，且缺乏与读者需求对齐的严格评测标准。

Result: 现有LLM4Survey方法在内容质量评测中平均比人类低21%。

Insight: SurveyBench揭示了LLM在自动生成综述时的核心不足（如逻辑连贯性和见解清晰度），为未来改进提供了方向。

Abstract: Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers’ informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).

[84] Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models cs.CLPDF

Ej Zhou, Caiqi Zhang, Tiancheng Hu, Chengzu Li, Nigel Collier

TL;DR: 论文首次系统研究了大型语言模型的多语言校准问题，发现在非英语语言中校准效果较差，并提出了一种无需训练的方法（LACE）通过中间层优化校准效果。

Details

Motivation: 多语言环境下大型语言模型的置信校准问题未充分研究，非英语语言表现较差，亟需一种更公平的解决方案。

Result: LACE显著提升多语言校准效果，尤其是非英语语言。

Insight: 英语中心化的训练导致最终层校准效果不佳，中间层提供了更公平的多语言校准信号。

Abstract: Confidence calibration, the alignment of a model’s predicted confidence with its actual accuracy, is crucial for the reliable deployment of Large Language Models (LLMs). However, this critical property remains largely under-explored in multilingual contexts. In this work, we conduct the first large-scale, systematic studies of multilingual calibration across six model families and over 100 languages, revealing that non-English languages suffer from systematically worse calibration. To diagnose this, we investigate the model’s internal representations and find that the final layer, biased by English-centric training, provides a poor signal for multilingual confidence. In contrast, our layer-wise analysis uncovers a key insight that late-intermediate layers consistently offer a more reliable and better-calibrated signal. Building on this, we introduce a suite of training-free methods, including Language-Aware Confidence Ensemble (LACE), which adaptively selects an optimal ensemble of layers for each specific language. Our study highlights the hidden costs of English-centric alignment and offer a new path toward building more globally equitable and trustworthy LLMs by looking beyond the final layer.

[85] EditLens: Quantifying the Extent of AI Editing in Text cs.CLPDF

Katherine Thai, Bradley Emi, Elyas Masrour, Mohit Iyyer

TL;DR: 本文提出了EditLens，一种量化AI编辑文本程度的模型，通过轻量级相似性度量来区分人工写作、AI生成和混合编辑文本。

Details

Motivation: 目前的研究多关注完全由AI生成的文本检测，忽视了AI编辑文本的重要性，尤其是在教育和政策等领域。

Result: 在二元（F1=94.7%）和三元（F1=90.4%）分类任务中表现优异，并通过Grammarly案例展示了模型的实际应用。

Insight: AI编辑的文本可以被检测，且编辑程度也能量化，这对作者归属和教育政策具有重要意义。

Abstract: A significant proportion of queries to large language models ask them to edit user-provided text, rather than generate new text from scratch. While previous work focuses on detecting fully AI-generated text, we demonstrate that AI-edited text is distinguishable from human-written and AI-generated text. First, we propose using lightweight similarity metrics to quantify the magnitude of AI editing present in a text given the original human-written text and validate these metrics with human annotators. Using these similarity metrics as intermediate supervision, we then train EditLens, a regression model that predicts the amount of AI editing present within a text. Our model achieves state-of-the-art performance on both binary (F1=94.7%) and ternary (F1=90.4%) classification tasks in distinguishing human, AI, and mixed writing. Not only do we show that AI-edited text can be detected, but also that the degree of change made by AI to human writing can be detected, which has implications for authorship attribution, education, and policy. Finally, as a case study, we use our model to analyze the effects of AI-edits applied by Grammarly, a popular writing assistance tool. To encourage further research, we commit to publicly releasing our models and dataset.

[86] FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents cs.CLPDF

Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han Lù, Léo Boisvert

TL;DR: FocusAgent提出了一种轻量化的LLM检索器方法，通过提取AxTree中的相关行来修剪网页内容，从而减少计算成本和安全风险，同时在任务性能上不输基线。

Details

Motivation: 网页代理需要处理大量网页内容，导致上下文饱和、计算成本高，并容易受到提示注入攻击，现有修剪策略效果不佳。

Result: 在WorkArena和WebArena基准测试中，FocusAgent减少50%以上的观察内容，同时性能与基线相当，并能有效防御提示注入攻击。

Insight: 针对性强的内容修剪不仅能提高效率，还能增强安全性，是一种实用的网页代理构建策略。

Abstract: Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that FocusAgent matches the performance of strong baselines, while reducing observation size by over 50%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

[87] Cache-to-Cache: Direct Semantic Communication Between Large Language Models cs.CL | cs.LG | 68T07, 68T50 | I.2.7PDF

Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai

TL;DR: 本文提出了Cache-to-Cache（C2C）方法，实现了大型语言模型（LLMs）之间的直接语义通信，避免了显式文本生成的开销和信息损失，提升了性能和效率。

Details

Motivation: 现有设计中，LLMs通过文本通信导致语义信息丢失和生成延迟。本文探索是否能实现LLMs之间超越文本的直接通信。

Result: C2C的平均准确率比单个模型高出8.5-10.5%，比文本通信范式高3.0-5.0%，且延迟降低2.0倍。

Insight: KV-Cache可作为LLMs间高效语义通信的有效媒介，避免显式文本生成的语义损失和延迟。

Abstract: Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model’s KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.

[88] Self-Anchor: Large Language Model Reasoning via Step-by-step Attention Alignment cs.CL | cs.AIPDF

Hongxiang Zhang, Yuan Tian, Tianyi Zhang

TL;DR: 该论文提出了一种名为Self-Anchor的新方法，通过结构化推理步骤来引导大语言模型（LLM）的注意力，解决复杂推理任务中注意力不集中的问题。

Details

Motivation: 在复杂推理任务中，随着推理链的延伸，关键的中间步骤和原始提示容易被淹没在上下文中，导致注意力不足和错误产生。现有的基于提示的方法无法有效解决这一问题。

Result: 实验表明，Self-Anchor在六个基准测试中优于现有的最优提示方法，并显著缩小了”非推理”模型与专用推理模型之间的性能差距。

Insight: 该方法表明，通过注意力对齐机制，无需重新训练即可使大多数LLM具备处理复杂推理任务的能力。

Abstract: To solve complex reasoning tasks for Large Language Models (LLMs), prompting-based methods offer a lightweight alternative to fine-tuning and reinforcement learning. However, as reasoning chains extend, critical intermediate steps and the original prompt will be buried in the context, receiving insufficient attention and leading to errors. In this paper, we propose Self-Anchor, a novel pipeline that leverages the inherent structure of reasoning to steer LLM attention. Self-Anchor decomposes reasoning trajectories into structured plans and automatically aligns the model’s attention to the most relevant inference steps, allowing the model to maintain focus throughout generation. Our experiment shows that Self-Anchor outperforms SOTA prompting methods across six benchmarks. Notably, Self-Anchor significantly reduces the performance gap between ``non-reasoning’’ models and specialized reasoning models, with the potential to enable most LLMs to tackle complex reasoning tasks without retraining.

[89] Reward Models are Metrics in a Trench Coat cs.CL | cs.AIPDF

Sebastian Gehrmann

TL;DR: 这篇论文探讨了奖励模型和评估指标之间的相似性与分离性，提出了两者应该更紧密合作的立场。

Details

Motivation: 大型语言模型的后训练中，强化学习的兴起引发了奖励模型的广泛关注，但其与评估指标的研究领域分离，导致术语冗余和问题重复。

Result: 论文指出，奖励模型和评估指标在偏好获取、避免虚假相关性和奖励攻击等方面可以通过合作改进。

Insight: 奖励模型本质上是一种特定形式的评估指标，二者的结合可以避免重复工作和共同挑战。

Abstract: The emergence of reinforcement learning in post-training of large language models has sparked significant interest in reward models. Reward models assess the quality of sampled model outputs to generate training signals. This task is also performed by evaluation metrics that monitor the performance of an AI model. We find that the two research areas are mostly separate, leading to redundant terminology and repeated pitfalls. Common challenges include susceptibility to spurious correlations, impact on downstream reward hacking, methods to improve data quality, and approaches to meta-evaluation. Our position paper argues that a closer collaboration between the fields can help overcome these issues. To that end, we show how metrics outperform reward models on specific tasks and provide an extensive survey of the two areas. Grounded in this survey, we point to multiple research topics in which closer alignment can improve reward models and metrics in areas such as preference elicitation methods, avoidance of spurious correlations and reward hacking, and calibration-aware meta-evaluation.

cs.DC [Back]

[90] PyRadiomics-cuda: a GPU-accelerated 3D features extraction from medical images within PyRadiomics cs.DC | cs.CVPDF

Jakub Lisowski, Piotr Tyrakowski, Szymon Zyguła, Krzysztof Kaczmarski

TL;DR: PyRadiomics-cuda是一个基于GPU加速的PyRadiomics扩展，用于高效提取医学图像的三维形状特征，显著减少了处理时间，并与现有PyRadiomics API完全兼容。

Details

Motivation: 医学图像处理中提取三维形状特征的计算成本高，PyRadiomics-cuda旨在通过GPU加速解决这一问题，支持高效的AI流程。

Result: 在不同硬件环境下测试表明，PyRadiomics-cuda能显著减少处理时间，适用于大规模数据集。

Insight: GPU加速可以显著提升医学图像特征提取的效率，适合高吞吐量的AI应用场景。

Abstract: PyRadiomics-cuda is a GPU-accelerated extension of the PyRadiomics library, designed to address the computational challenges of extracting three-dimensional shape features from medical images. By offloading key geometric computations to GPU hardware it dramatically reduces processing times for large volumetric datasets. The system maintains full compatibility with the original PyRadiomics API, enabling seamless integration into existing AI workflows without code modifications. This transparent acceleration facilitates efficient, scalable radiomics analysis, supporting rapid feature extraction essential for high-throughput AI pipeline. Tests performed on a typical computational cluster, budget and home devices prove usefulness in all scenarios. PyRadiomics-cuda is implemented in Python and C/CUDA and is freely available under the BSD license at https://github.com/mis-wut/pyradiomics-CUDA Additionally PyRadiomics-cuda test suite is available at https://github.com/mis-wut/pyradiomics-cuda-data-gen. It provides detailed handbook and sample scripts suited for different kinds of workflows plus detailed installation instructions. The dataset used for testing is available at Kaggle https://www.kaggle.com/datasets/sabahesaraki/kidney-tumor-segmentation-challengekits-19

cs.IR [Back]

[91] Less LLM, More Documents: Searching for Improved RAG cs.IR | cs.CL | H.3.3; I.2.7PDF

Jingjie Ning, Yibo Kong, Yunfan Long, Jamie Callan

TL;DR: 本文探讨了通过扩大检索器语料库以减少对大语言模型（LLM）依赖的方法，证明语料扩展可以有效提升检索增强生成（RAG）性能，且在某些情况下可替代模型规模的扩大。

Details

Motivation: 当前检索增强生成（RAG）通常依赖大语言模型来提高准确性，但伴随成本高和部署受限的问题。作者希望通过扩大检索器的语料库，减少对大模型的依赖，降低成本并提升实用性。

Result: 实验表明，中等规模的生成器搭配大语料库可以在性能上媲美大模型小语料的组合，且语料扩展的收益随着规模增大而递减。

Insight: 语料库规模的扩展主要通过增加答案段落的覆盖范围提升性能，而利用率效率变化不大。这为RAG系统设计提供了新的优化方向，即在语料库和生成器规模之间权衡。

Abstract: Retrieval-Augmented Generation (RAG) couples document retrieval with large language models (LLMs). While scaling generators improves accuracy, it also raises cost and limits deployability. We explore an orthogonal axis: enlarging the retriever’s corpus to reduce reliance on large LLMs. Experimental results show that corpus scaling consistently strengthens RAG and can often serve as a substitute for increasing model size, though with diminishing returns at larger scales. Small- and mid-sized generators paired with larger corpora often rival much larger models with smaller corpora; mid-sized models tend to gain the most, while tiny and large models benefit less. Our analysis shows that improvements arise primarily from increased coverage of answer-bearing passages, while utilization efficiency remains largely unchanged. These findings establish a principled corpus-generator trade-off: investing in larger corpora offers an effective path to stronger RAG, often comparable to enlarging the LLM itself.

cs.LG [Back]

[92] How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models cs.LG | cs.AI | cs.CLPDF

Parth Asawa, Alan Zhu, Matei Zaharia, Alexandros G. Dimakis, Joseph E. Gonzalez

TL;DR: 这篇论文提出了一种轻量级的Advisor Models方法，通过强化学习训练小模型来动态生成自然语言指令，以优化黑盒大模型的行为，适应不同输入和环境。

Details

Motivation: 随着基础模型越来越多地以黑盒服务形式部署，用户无法修改模型权重，只能通过提示词进行有限定制。静态提示优化虽然有效，但无法适应动态输入、用户或环境。

Result: 在推理和个性化任务中，Advisor Models表现优于静态提示优化方法，并能适应环境动态，泛化到不同黑盒模型。

Insight: Advisor Models为黑盒系统提供了一种可学习的接口，通过动态优化实现个性化的同时保持对分布外输入的鲁棒性。

Abstract: Foundation models are increasingly deployed as black-box services, where model weights cannot be modified and customization is limited to prompting. While static prompt optimization has shown promise, it produces a single fixed prompt that fails to adapt to different inputs, users, or environments. We introduce Advisor Models, lightweight parametric policies trained with reinforcement learning to reactively issue natural language steering instructions in-context to black-box models. The advisor is a second small model that sits between the input and the model, shaping behavior on a per-instance basis using reward signals from the environment. Across multiple domains involving reasoning and personalization, we show that Advisor Models outperform static prompt optimizers, discovering environment dynamics and improving downstream task performance. We also demonstrate the generalizability of advisors by transferring them across black-box models, as well as the framework’s ability to achieve specialization while retaining robustness to out-of-distribution inputs. Viewed more broadly, Advisor Models provide a learnable interface to black-box systems where the advisor acts as a parametric, environment-specific memory. We argue that dynamic optimization of black-box models via Advisor Models is a promising direction for enabling personalization and environment-adaptable AI with frontier-level capabilities.

[93] Beyond Imitation: Recovering Dense Rewards from Demonstrations cs.LG | cs.CLPDF

Jiangnan Li, Thuy-Trang Vu, Ehsan Abbasnejad, Gholamreza Haffari

TL;DR: 本文提出了一种新视角，将监督微调（SFT）重新定义为一种逆向强化学习（IRL）过程，揭示了SFT不仅学习策略，还隐式学习了一个密集的、基于token的奖励模型。作者进一步展示了如何从SFT模型中提取这一奖励信号，并利用其为强化学习提供细粒度的信用分配。

Details

Motivation: 传统上，监督微调（SFT）被视为简单的模仿学习过程，仅通过模仿专家行为训练策略。本文挑战了这一观点，试图证明SFT实际上是一种逆向强化学习，隐含地学习了一种密集奖励模型。

Result: 提出的Dense-Path REINFORCE方法在指令跟随基准测试中一致优于原始SFT模型，验证了密集奖励模型的实用价值。

Insight: 本文的创新点在于将SFT重新定义为一种奖励学习机制，而不仅仅是策略模仿。这种视角为利用专家演示数据提供了新的可能性，尤其是在细粒度信用分配方面。

Abstract: Conventionally, supervised fine-tuning (SFT) is treated as a simple imitation learning process that only trains a policy to imitate expert behavior on demonstration datasets. In this work, we challenge this view by establishing a fundamental equivalence between SFT and Inverse Reinforcement Learning. We prove that the SFT objective is a special case of Inverse Q-Learning, which implies that the SFT process does not just learn a policy, but also an implicit, dense, token-level reward model that explains the expert demonstrations. We then show how to recover this dense reward signal directly from the SFT model by formulating a baseline-relative reward function. The availability of such a dense reward model offers numerous benefits, providing granular credit assignment for each token generated. We demonstrate one key application by using these recovered rewards to further improve the policy with reinforcement learning. Our method, Dense-Path REINFORCE, consistently outperforms the original SFT models on instruction-following benchmarks. This work reframes SFT not merely as policy imitation but as a powerful reward learning mechanism, opening new possibilities for leveraging expert demonstrations.

[94] A Granular Study of Safety Pretraining under Model Abliteration cs.LG | cs.CLPDF

Shashank Agnihotri, Jonas Jakubassa, Priyam Dey, Sachin Goyal, Bernt Schiele

TL;DR: 该论文研究了模型消除（model abliteration）技术对安全预训练的影响，通过实验评估了20个系统在不同检查点下的安全性能，提出了一个实用的协议用于评估推理时编辑的安全性。

Details

Motivation: 开放权重的LLMs可以通过简单的激活编辑修改推理行为，这对安全性提出了挑战。论文旨在探索常见的安全干预措施（如拒绝训练或元标签训练）是否能在模型消除技术下保持效果。

Result: 实验结果显示，某些数据驱动的安全组件在模型消除技术下仍具鲁棒性，但法官的选择显著影响评估结果。

Insight: 1. 安全干预措施的效果可能因模型消除而减弱；2. 评估协议的设计对结果至关重要；3. 结合人工验证可以提高评估的可靠性。

Abstract: Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as Refusal or Non-Refusal using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.

[95] Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward cs.LG | cs.CLPDF

Guanhua Huang, Tingqiang Xu, Mingze Wang, Qi Yi, Xue Gong

TL;DR: 该论文研究了强化学习中探索动态的关键问题，提出了低概率正则化方法（Lp-Reg），通过保护低概率但对探索有价值的分词（reasoning sparks）来提升性能。

Details

Motivation: 在带有可验证奖励的强化学习（RLVR）中，探索能力的退化导致性能瓶颈，传统的高熵方法未能有效解决这一问题。作者发现低概率分词在探索中具有重要作用，但被现有方法过度惩罚。

Result: 实验表明，Lp-Reg能够在1000步训练中保持稳定探索，在五个数学基准测试中平均准确率达到60.17%，比现有方法提升2.66%。

Insight: 低概率分词在探索中具有不可忽视的作用，传统的熵控制方法可能因其噪声特性而忽视其价值，Lp-Reg通过去噪和正则化有效解决了这一问题。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textbf{\textit{reasoning sparks}}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of \textit{reasoning sparks} is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy training for around 1,000 steps, a regime where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a $60.17%$ average accuracy on five math benchmarks, an improvement of $2.66%$ over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.

cs.CR [Back]

[96] Secure and Robust Watermarking for AI-generated Images: A Comprehensive Survey cs.CR | cs.CVPDF

Jie Cao, Qi Li, Zelin Zhang, Jianbing Ni

TL;DR: 本文对AI生成图像的鲁棒水印技术进行了全面综述，探讨了水印系统的形式化、技术比较、评估方法、安全漏洞及未来方向，旨在推动可信数字生态发展。

Details

Motivation: 随着生成式AI的快速发展，AI生成图像的知识产权保护、真实性和责任追溯成为关键挑战。水印技术作为一种解决方案，能够区分AI生成内容与自然内容，确保来源可追溯。

Result: 综述提供了对当前水印技术的全面理解，指出了其在视觉质量、容量和可检测性等方面的表现，并强调了对抗攻击的脆弱性。

Insight: 水印技术是实现可信AI生成图像的关键，但需进一步研究以应对恶意攻击和改进评估方法。

Abstract: The rapid advancement of generative artificial intelligence (Gen-AI) has facilitated the effortless creation of high-quality images, while simultaneously raising critical concerns regarding intellectual property protection, authenticity, and accountability. Watermarking has emerged as a promising solution to these challenges by distinguishing AI-generated images from natural content, ensuring provenance, and fostering trustworthy digital ecosystems. This paper presents a comprehensive survey of the current state of AI-generated image watermarking, addressing five key dimensions: (1) formalization of image watermarking systems; (2) an overview and comparison of diverse watermarking techniques; (3) evaluation methodologies with respect to visual quality, capacity, and detectability; (4) vulnerabilities to malicious attacks; and (5) prevailing challenges and future directions. The survey aims to equip researchers with a holistic understanding of AI-generated image watermarking technologies, thereby promoting their continued development.

cs.CY [Back]

[97] Representing Beauty: Towards a Participatory but Objective Latent Aesthetics cs.CY | cs.AI | cs.CVPDF

Alexander Michael Rusnak

TL;DR: 这篇论文探讨了神经网络如何通过跨模型表征趋同来客观表征美感，揭示了美感图像的形式结构具有现实基础，并提出人机共创的可能性。

Details

Motivation: 研究机器如何识别美感，尽管美感是一个复杂且多元的概念，神经网络却能通过学习建模审美判断。

Result: 美感图像在多模型中产生更相似的表征，表明其形式结构具有现实基础，支持人机共创的可能性。

Insight: 美感不仅是文化建构的产物，还具有物理和文化基础，机器能够通过规模优势产生新颖的创意见解。

Abstract: What does it mean for a machine to recognize beauty? While beauty remains a culturally and experientially compelling but philosophically elusive concept, deep learning systems increasingly appear capable of modeling aesthetic judgment. In this paper, we explore the capacity of neural networks to represent beauty despite the immense formal diversity of objects for which the term applies. By drawing on recent work on cross-model representational convergence, we show how aesthetic content produces more similar and aligned representations between models which have been trained on distinct data and modalities - while unaesthetic images do not produce more aligned representations. This finding implies that the formal structure of beautiful images has a realist basis - rather than only as a reflection of socially constructed values. Furthermore, we propose that these realist representations exist because of a joint grounding of aesthetic form in physical and cultural substance. We argue that human perceptual and creative acts play a central role in shaping these the latent spaces of deep learning systems, but that a realist basis for aesthetics shows that machines are not mere creative parrots but can produce novel creative insights from the unique vantage point of scale. Our findings suggest that human-machine co-creation is not merely possible, but foundational - with beauty serving as a teleological attractor in both cultural production and machine perception.

cs.AI [Back]

[98] On the Role of Temperature Sampling in Test-Time Scaling cs.AI | cs.CL | cs.LGPDF

Yuheng Wu, Azalia Mirhoseini, Thierry Tambe

TL;DR: 论文研究了测试时间缩放（TTS）中温度采样的作用，发现单一温度采样仅能探索模型潜力的一部分，而多温度采样可显著提升推理能力。提出的多温度投票方法进一步降低了计算开销。

Details

Motivation: 现有研究表明，增加采样数量K可以提升推理精度，但这种提升在K较大时趋于饱和。作者发现不同温度采样能解决不同子集的问题，由此探索温度维度的缩放潜力。

Result: 在多个基准测试上，温度缩放比单一温度TTS平均提升7.3分，且无需额外训练即可接近RL训练模型的性能。

Insight: 温度缩放是解锁基础模型潜力的简单有效方法，TTS的潜力比先前认知的更大。

Abstract: Large language models (LLMs) can improve reasoning at inference time through test-time scaling (TTS), where multiple reasoning traces are generated and the best one is selected. Prior work shows that increasing the number of samples K steadily improves accuracy. In this paper, we demonstrate that this trend does not hold indefinitely: at large K, further scaling yields no gains, and certain hard questions remain unsolved regardless of the number of traces. Interestingly, we find that different sampling temperatures solve different subsets of problems, implying that single-temperature scaling explores only part of a model’s potential. We therefore propose scaling along the temperature dimension, which enlarges the reasoning boundary of LLMs. Averaged over Qwen3 (0.6B, 1.7B, 4B, 8B) and five representative reasoning benchmarks (AIME 2024/2025, MATH500, LiveCodeBench, Hi-ToM), temperature scaling yields an additional 7.3 points over single-temperature TTS. Temperature scaling also enables base models to reach performance comparable to reinforcement learning (RL)-trained counterparts, without additional post-training. We further provide a comprehensive analysis of this phenomenon and design a multi-temperature voting method that reduces the overhead of temperature scaling. Overall, our findings suggest that TTS is more powerful than previously thought, and that temperature scaling offers a simple and effective way to unlock the latent potential of base models.

[99] NCV: A Node-Wise Consistency Verification Approach for Low-Cost Structured Error Localization in LLM Reasoning cs.AI | cs.CLPDF

Yulong Zhang, Li Wang, Wei Du, Peilin Li, Yuqin Dai Zhiyuan Zhao

TL;DR: NCV提出了一种节点级一致性验证方法，用于低成本准确定位大语言模型推理中的错误，显著提高了效率和可解释性。

Details

Motivation: 现有方法在验证多步推理时存在错误定位不精确和计算成本高的问题，需要一种更高效的解决方案。

Result: 在公开数据集上，NCV的F1分数比基线提升10%至25%，同时使用的token数量减少了6倍到58倍。

Insight: 节点级验证方法是一种可行的低计算成本解决方案，能有效平衡精确性和效率。

Abstract: Verifying multi-step reasoning in large language models is difficult due to imprecise error localization and high token costs. Existing methods either assess entire reasoning chains, suffering attention dilution, or rely on expensive multi-sampling. We introduce Node-wise Consistency Verification (NCV), a training-free framework that recasts verification as lightweight binary consistency checks at the node level. By decomposing the chain of thought into interconnected verification nodes, NCV precisely localizes errors and avoids unnecessary long-form generation. Experiments demonstrate that our approach enhances interpretability and efficiency, presenting a scalable solution for reliable LLM reasoning verification. On public datasets, NCV achieves a 10% to 25% improvement in F1 scores over baselines while utilizing $6\times$~$58\times$ fewer tokens than traditional methods like CoT-based verifiers.

[100] Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents cs.AI | cs.CLPDF

Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee

TL;DR: 论文提出TRACE框架，用于多维度评估工具增强LLM智能体的问题解决轨迹，超越传统答案匹配，关注效率、幻觉和适应性，并通过证据库和新的元评估数据集验证其有效性。

Details

Motivation: 现有工具增强基准测试主要依赖答案匹配评估，忽视了问题解决轨迹的多个方面（如效率、幻觉、适应性），且标注所有真实轨迹成本高昂，需一种低成本、多维度的评估方法。

Result: TRACE能够低成本且准确地评估复杂行为，即使使用小型开源LLM。实验表明其在多维分析中的有效性，并提供了智能体行为的新洞察。

Insight: 传统答案匹配不足以评估复杂任务；证据库的动态积累和多维评估是解决轨迹评估问题的关键；TRACE展示了小型LLM在评估任务中的潜力。

Abstract: Although recent tool-augmented benchmarks incorporate complex user requests and diverse tools, the evaluation methods for most of them remain limited to answer matching. However, as the number of steps required to resolve a user request increases, a proper evaluation of an agent’s performance must go beyond the final answer to also assess the problem-solving trajectory, including previously ignored aspects such as efficiency, hallucination, and adaptivity. The most straightforward method for evaluating these aspects is to compare an agent’s trajectory with the ground-truth trajectory, but this approach is fundamentally limited since annotating all valid ground-truth trajectories is prohibitively expensive. However, a simple LLM-based evaluator struggles to assess trajectories in detail without ground truth. To effectively evaluate the agents in this manner, we introduce TRACE, a framework for the multi-dimensional evaluation of tool-augmented LLM agent performance. By incorporating an evidence bank, which accumulates knowledge gathered from preceding reasoning steps, TRACE enables a multi-faceted analysis and evaluation of an agent’s reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset by augmenting existing benchmarks with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates these complex behaviors in a scalable and cost-effective manner, even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

[101] Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner cs.AI | cs.CLPDF

Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang

TL;DR: 论文提出了一种新的多模态扩散方法CCDD，通过结合连续和离散空间的联合扩散过程，解决了传统连续扩散模型在语言建模中的表现不佳问题，并实现了更好的表达能力和训练效果。

Details

Motivation: 传统连续扩散模型在语言建模中表现不如离散扩散模型，但理论上连续扩散模型具有更强的表达能力。论文旨在解决理论与实际表现之间的矛盾。

Result: 在现实任务的广泛语言建模实验中，CCDD表现出强大的实证性能。

Insight: 连续扩散模型的表达能力虽强，但其在离散标记空间中的解码难度是限制其表现的关键。联合多模态扩散方法为解决这一问题提供了新的方向。

Abstract: Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they introduce additional difficulty decoding tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks.

cs.SE [Back]

[102] When Names Disappear: Revealing What LLMs Actually Understand About Code cs.SE | cs.CLPDF

Cuong Chi Le, Minh V. T. Pham, Cuong Duc Van, Hoang N. Phan, Huy N. Phan

TL;DR: 论文研究了大型语言模型（LLMs）如何理解代码，发现命名模式对意图和执行任务的影响，提出了一种混淆方法以更真实评估LLMs的语义理解能力。

Details

Motivation: LLMs在代码任务中表现优异，但其如何理解程序语义尚不明确。研究旨在区分代码的结构语义和人类可读命名的贡献，揭示LLMs是否真正理解代码语义或依赖命名模式。

Result: 混淆显著降低了意图任务的性能（如摘要变为逐行描述），甚至影响执行任务，表明当前基准奖励命名模式记忆而非语义推理。ClassEval-Obf削弱了记忆捷径，提供了更可靠的评估。

Insight: LLMs当前对代码的理解可能过度依赖命名模式而非语义结构；混淆方法是评估模型真实语义推理能力的有效工具。

Abstract: Large Language Models (LLMs) achieve strong results on code tasks, but how they derive program meaning remains unclear. We argue that code communicates through two channels: structural semantics, which define formal behavior, and human-interpretable naming, which conveys intent. Removing the naming channel severely degrades intent-level tasks such as summarization, where models regress to line-by-line descriptions. Surprisingly, we also observe consistent reductions on execution tasks that should depend only on structure, revealing that current benchmarks reward memorization of naming patterns rather than genuine semantic reasoning. To disentangle these effects, we introduce a suite of semantics-preserving obfuscations and show that they expose identifier leakage across both summarization and execution. Building on these insights, we release ClassEval-Obf, an obfuscation-enhanced benchmark that systematically suppresses naming cues while preserving behavior. Our results demonstrate that ClassEval-Obf reduces inflated performance gaps, weakens memorization shortcuts, and provides a more reliable basis for assessing LLMs’ code understanding and generalization.

eess.IV [Back]

Daeyoung Kim

TL;DR: GCVAMD是一种改进的CausalVAE模型，专注于通过原始OCT图像检测和预测年龄相关性黄斑变性（AMD）的因果关系及风险因素，如玻璃膜疣和新生血管化。

Details

Motivation: AMD是永久性视力障碍的主要原因之一，但当前治疗方法无法逆转视力损失。传统深度学习方法虽能区分AMD视网膜，但忽略了病理学或因果机制的研究。GCVAMD旨在填补这一空白。

Result: 实验表明GCVAMD能准确识别玻璃膜疣和新生血管化状态，并在AMD检测和干预分析中表现优异。

Insight: 结合因果关系的模型能更可靠地支持医学决策，尤其是在干预分析和治疗模拟中，为AMD的早期诊断和治疗提供了新思路。

Abstract: Age Related Macular Degeneration(AMD) has been one of the most leading causes of permanent vision impairment in ophthalmology. Though treatments, such as anti VEGF drugs or photodynamic therapies, were developed to slow down the degenerative process of AMD, there is still no specific cure to reverse vision loss caused by AMD. Thus, for AMD, detecting existence of risk factors of AMD or AMD itself within the patient retina in early stages is a crucial task to reduce the possibility of vision impairment. Apart from traditional approaches, deep learning based methods, especially attention mechanism based CNNs and GradCAM based XAI analysis on OCT scans, exhibited successful performance in distinguishing AMD retina from normal retinas, making it possible to use AI driven models to aid medical diagnosis and analysis by ophthalmologists regarding AMD. However, though having significant success, previous works mostly focused on prediction performance itself, not pathologies or underlying causal mechanisms of AMD, which can prohibit intervention analysis on specific factors or even lead to less reliable decisions. Thus, this paper introduces a novel causal AMD analysis model: GCVAMD, which incorporates a modified CausalVAE approach that can extract latent causal factors from only raw OCT images. By considering causality in AMD detection, GCVAMD enables causal inference such as treatment simulation or intervention analysis regarding major risk factors: drusen and neovascularization, while returning informative latent causal features that can enhance downstream tasks. Results show that through GCVAMD, drusen status and neovascularization status can be identified with AMD causal mechanisms in GCVAMD latent spaces, which can in turn be used for various tasks from AMD detection(classification) to intervention analysis.

[104] Wave-GMS: Lightweight Multi-Scale Generative Model for Medical Image Segmentation eess.IV | cs.AI | cs.CVPDF

Talha Ahmed, Nehal Ahmed Shaikh, Hassan Mohy-ud-Din

TL;DR: Wave-GMS是一种轻量级的多尺度生成模型，用于医学图像分割，旨在高性能、低成本GPU上训练，参数量少且支持大批量训练。

Details

Motivation: 在医疗领域广泛部署AI工具时，需要高性能且能在资源有限（如内存和计算能力有限）的GPU上训练的深度学习分割网络。

Result: 在四个公开数据集（BUS, BUSI, Kvasir-Instrument, HAM10000）上表现出色，具有卓越的跨域泛化能力。

Insight: 轻量化设计和高效率训练是医学图像分割模型实际部署的关键，特别是在资源受限的环境中。

Abstract: For equitable deployment of AI tools in hospitals and healthcare facilities, we need Deep Segmentation Networks that offer high performance and can be trained on cost-effective GPUs with limited memory and large batch sizes. In this work, we propose Wave-GMS, a lightweight and efficient multi-scale generative model for medical image segmentation. Wave-GMS has a substantially smaller number of trainable parameters, does not require loading memory-intensive pretrained vision foundation models, and supports training with large batch sizes on GPUs with limited memory. We conducted extensive experiments on four publicly available datasets (BUS, BUSI, Kvasir-Instrument, and HAM10000), demonstrating that Wave-GMS achieves state-of-the-art segmentation performance with superior cross-domain generalizability, while requiring only ~2.6M trainable parameters. Code is available at https://github.com/ATPLab-LUMS/Wave-GMS.

q-bio.QM [Back]

[105] Glaucoma Detection and Structured OCT Report Generation via a Fine-tuned Multimodal Large Language Model q-bio.QM | cs.AI | cs.CVPDF

Jalil Jalili, Yashraj Gavhane, Evan Walker, Anna Heinke, Christopher Bowd

TL;DR: 本文提出了一种可解释的多模态大型语言模型（MM-LLM），用于青光眼筛查和结构化OCT报告生成，通过微调提高了诊断准确性和报告质量。

Details

Motivation: 青光眼是全球不可逆致盲的主要原因之一，临床需要一种高效、准确的方法来自动分析OCT图像并生成结构化报告，降低医生的负担和提高诊断效率。

Result: 模型在质量评估上达到0.90准确率和0.98特异性；青光眼诊断准确率为0.86，敏感性和特异性分别为0.91和0.73；RNFL变薄预测表现优异（0.83至0.94）。

Insight: 多模态大型语言模型在医疗图像分析和报告生成中具有巨大潜力，能够显著提升临床效率和诊断准确性，但其可解释性和适应性仍需进一步研究。

Abstract: Objective: To develop an explainable multimodal large language model (MM-LLM) that (1) screens optic nerve head (ONH) OCT circle scans for quality and (2) generates structured clinical reports that include glaucoma diagnosis and sector-wise retinal nerve fiber layer (RNFL) thinning assessments. Design: Retrospective cohort study of 1,310 subjects contributing 43,849 Spectralis ONH OCT circle scans (1,331 glaucomatous and 867 healthy eyes) from the DIGS and ADAGES cohorts. Methods: A MM-LLM (Llama 3.2 Vision-Instruct model) was fine-tuned to generate clinical descriptions of OCT imaging data. Training data included paired OCT images and automatically generated, structured clinical reports that described global and sectoral RNFL thinning. Poor-quality scans were labeled as unusable and paired with a fixed refusal statement. The model was evaluated on a held-out test set for three tasks: quality assessment, glaucoma detection, and RNFL thinning classification across seven anatomical sectors. Evaluation metrics included accuracy, sensitivity, specificity, precision, and F1-score. Model description quality was also evaluated using standard text evaluation metrics. Results: The model achieved 0.90 accuracy and 0.98 specificity for quality triage. For glaucoma detection, accuracy was 0.86 (sensitivity 0.91, specificity 0.73, F1-score 0.91). RNFL thinning prediction accuracy ranged from 0.83 to 0.94, with highest performance in global and temporal sectors. Text generation scores showed strong alignment with reference reports (BLEU: 0.82; ROUGE-1: 0.94; ROUGE-2: 0.87; ROUGE-L: 0.92; BERTScore-F1: 0.99). Conclusions: The fine-tuned MM-LLM generated accurate clinical descriptions based on OCT imaging. The model achieved high accuracy in identifying image quality issues and detecting glaucoma. The model also provided sectoral descriptions of RNFL thinning to help support clinical OCT evaluation.

eess.AS [Back]

[106] WEE-Therapy: A Mixture of Weak Encoders Framework for Psychological Counseling Dialogue Analysis eess.AS | cs.CL | cs.LG | cs.SDPDF

Yongqi Kang, Yong Zhao

TL;DR: 论文提出WEE-Therapy框架，通过集成弱编码器和双路由策略提升心理咨询对话分析的性能。

Details

Motivation: 现有语音语言模型通常依赖通用数据的单一编码器，难以捕捉心理咨询领域的情感和专业特征。

Result: 在多任务评估中，WEE-Therapy显著提升性能，且参数量增加较少。

Insight: 轻量化的弱编码器集合和动态路由策略可有效提升特定领域任务的模型表现。

Abstract: The advancement of computational psychology requires AI tools capable of deeply understanding counseling dialogues. Existing audio language models (AudioLLMs) often rely on single speech encoders pre-trained on general data, struggling to capture domain-specific features like complex emotions and professional techniques. To address this, we propose WEE-Therapy, a multi-task AudioLLM incorporating a Weak Encoder Ensemble (WEE) mechanism. This supplements a powerful base encoder with a pool of lightweight, specialized encoders. A novel dual-routing strategy combines stable, data-independent domain knowledge with dynamic, data-dependent expert selection. Evaluated on emotion recognition, technique classification, risk detection, and summarization, WEE-Therapy achieves significant performance gains across all tasks with minimal parameter overhead, demonstrating strong potential for AI-assisted clinical analysis.

[107] SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis eess.AS | cs.CLPDF

Lukas Buess, Jan Geier, David Bani-Harouni, Chantal Pellegrini, Matthias Keicher

TL;DR: 这篇论文探讨了如何直接从语音放射报告中学习视觉-语言表示，提出了SpeechCT-CLIP模型，通过知识蒸馏从预训练的文本-图像CLIP模型中传递语义对齐能力，显著缩小了语音与文本模型之间的性能差距。

Details

Motivation: 临床工作流程中语音通信占据重要地位，但目前医学AI系统主要依赖书面文本。这篇论文旨在填补这一空白，探索如何直接从语音放射报告中学习多模态表示。

Result: 实验结果表明，语音模型的零样本分类F1分数从0.623提升到0.705，恢复了88%的性能差距。同时，模型在检索任务中表现出色，无需依赖推理时的文本输入。

Insight: 研究结果表明，语音可以作为多模态预训练中文本的有效替代方案，为临床实践中的语音驱动诊断支持工具提供了可能性。

Abstract: Spoken communication plays a central role in clinical workflows. In radiology, for example, most reports are created through dictation. Yet, nearly all medical AI systems rely exclusively on written text. In this work, we address this gap by exploring the feasibility of learning visual-language representations directly from spoken radiology reports. Specifically, we synthesize a large-scale dataset (Speech-RATE) of spoken radiology reports and train SpeechCT-CLIP, a contrastive model that aligns speech and 3D CT volumes in a shared representation space. While naive speech-based models underperform compared to text-trained counterparts, we show that knowledge distillation from a pretrained text-image CLIP model effectively transfers semantic alignment capabilities from text to speech, substantially narrowing this gap. Experiments demonstrate improved zero-shot classification F1 from 0.623 to 0.705, recovering 88% of the performance difference, and strong retrieval results without requiring text at inference. These findings highlight speech as a practical alternative to text in multimodal pretraining and open the door to voice-driven diagnostic support tools in clinical practice.

cs.RO [Back]

[108] Work Zones challenge VLM Trajectory Planning: Toward Mitigation and Robust Autonomous Driving cs.RO | cs.AI | cs.CVPDF

Yifan Liao, Zhen Sun, Xiaoyun Qiu, Zixiao Zhao, Wenbing Tang

TL;DR: 该论文首次系统研究了视觉语言模型（VLM）在工作区轨迹规划中的表现，发现主流模型在68%的情况下无法生成正确的轨迹。通过分析失败模式，作者提出了REACT-Drive框架，结合检索增强生成（RAG）技术，显著提升了轨迹规划的准确性和效率。

Details

Motivation: 工作区的复杂环境（如不规则布局、动态几何结构）对VLM的轨迹规划能力提出了挑战，但目前尚无相关研究。作者旨在填补这一空白，并提升VLM在这一领域的实用性。

Result: 在ROADWork数据集上，REACT-Drive的平均位移误差降低了3倍，推理时间（0.58秒）显著优于微调等方法（17.90秒）。在实际场景中验证了其实用性。

Insight: 1）VLM在工作区轨迹规划中存在显著局限性；2）结合先验知识和检索技术可有效提升性能；3）框架在真实环境中表现良好，具有实际应用潜力。

Abstract: Visual Language Models (VLMs), with powerful multimodal reasoning capabilities, are gradually integrated into autonomous driving by several automobile manufacturers to enhance planning capability in challenging environments. However, the trajectory planning capability of VLMs in work zones, which often include irregular layouts, temporary traffic control, and dynamically changing geometric structures, is still unexplored. To bridge this gap, we conduct the \textit{first} systematic study of VLMs for work zone trajectory planning, revealing that mainstream VLMs fail to generate correct trajectories in $68.0%$ of cases. To better understand these failures, we first identify candidate patterns via subgraph mining and clustering analysis, and then confirm the validity of $8$ common failure patterns through human verification. Building on these findings, we propose REACT-Drive, a trajectory planning framework that integrates VLMs with Retrieval-Augmented Generation (RAG). Specifically, REACT-Drive leverages VLMs to convert prior failure cases into constraint rules and executable trajectory planning code, while RAG retrieves similar patterns in new scenarios to guide trajectory generation. Experimental results on the ROADWork dataset show that REACT-Drive yields a reduction of around $3\times$ in average displacement error relative to VLM baselines under evaluation with Qwen2.5-VL. In addition, REACT-Drive yields the lowest inference time ($0.58$s) compared with other methods such as fine-tuning ($17.90$s). We further conduct experiments using a real vehicle in 15 work zone scenarios in the physical world, demonstrating the strong practicality of REACT-Drive.

Tianyu Xu, Jiawei Chen, Jiazhao Zhang, Wenyao Zhang, Zekun Qi

TL;DR: 该论文提出了一种名为MM-Nav的多视角视觉-语言-动作（VLA）模型，通过多专家学习实现鲁棒的视觉导航。模型利用预训练的大型语言模型和视觉基础模型，结合合成专家数据，展示了强大的泛化能力。

Details

Motivation: 视觉导航策略因其模仿人类通过视觉观察导航而备受关注，但视觉信息的显式建模困难，需要智能模型和大规模数据支持。

Result: 在合成环境和真实世界的实验中，MM-Nav展示了强大的泛化能力，并且超越了RL专家教师的性能。

Insight: 多专家学习的整合效果显著，能够通过多视角数据和动态训练策略提升模型的导航能力。

Abstract: Visual navigation policy is widely regarded as a promising direction, as it mimics humans by using egocentric visual observations for navigation. However, optical information of visual observations is difficult to be explicitly modeled like LiDAR point clouds or depth maps, which subsequently requires intelligent models and large-scale data. To this end, we propose to leverage the intelligence of the Vision-Language-Action (VLA) model to learn diverse navigation capabilities from synthetic expert data in a teacher-student manner. Specifically, we implement the VLA model, MM-Nav, as a multi-view VLA (with 360 observations) based on pretrained large language models and visual foundation models. For large-scale navigation data, we collect expert data from three reinforcement learning (RL) experts trained with privileged depth information in three challenging tailor-made environments for different navigation capabilities: reaching, squeezing, and avoiding. We iteratively train our VLA model using data collected online from RL experts, where the training ratio is dynamically balanced based on performance on individual capabilities. Through extensive experiments in synthetic environments, we demonstrate that our model achieves strong generalization capability. Moreover, we find that our student VLA model outperforms the RL teachers, demonstrating the synergistic effect of integrating multiple capabilities. Extensive real-world experiments further confirm the effectiveness of our method.

[110] Simulation to Rules: A Dual-VLM Framework for Formal Visual Planning cs.RO | cs.AI | cs.CL | cs.SCPDF

Yilun Hao, Yongchao Chen, Chuchu Fan, Yang Zhang

TL;DR: VLMFP是一个双VLM框架，通过SimVLM模拟动作结果和GenVLM生成PDDL文件，解决了视觉语言模型在生成PDDL域文件时的困难，提升了视觉规划的精确性和泛化能力。

Details

Motivation: 视觉语言模型（VLMs）在视觉规划中表现出潜力，但在精确空间和长时推理上表现不佳，而PDDL规划器虽擅长形式规划却无法处理视觉输入。结合两者的优势需要解决VLMs生成PDDL域文件的准确性问题。

Result: 在6个网格世界领域测试中，SimVLM对场景和动作序列的描述准确率分别为95.5%和85.5%，VLMFP生成的文件在未见实例中实现了70.0%的有效规划成功率。

Insight: 双VLM框架不仅解决了PDDL域文件生成的难题，还展示了VLMs在跨问题和外观泛化中的潜力，为视觉形式规划提供了新思路。

Abstract: Vision Language Models (VLMs) show strong potential for visual planning but struggle with precise spatial and long-horizon reasoning. In contrast, Planning Domain Definition Language (PDDL) planners excel at long-horizon formal planning, but cannot interpret visual inputs. Recent works combine these complementary advantages by enabling VLMs to turn visual planning problems into PDDL files for formal planning. However, while VLMs can generate PDDL problem files satisfactorily, they struggle to accurately generate the PDDL domain files, which describe all the planning rules. As a result, prior methods rely on human experts to predefine domain files or on constant environment access for refinement. We propose VLMFP, a Dual-VLM-guided framework that can autonomously generate both PDDL problem and domain files for formal visual planning. VLMFP introduces two VLMs to ensure reliable PDDL file generation: A SimVLM that simulates action consequences based on input rule descriptions, and a GenVLM that generates and iteratively refines PDDL files by comparing the PDDL and SimVLM execution results. VLMFP unleashes multiple levels of generalizability: The same generated PDDL domain file works for all the different instances under the same problem, and VLMs generalize to different problems with varied appearances and rules. We evaluate VLMFP with 6 grid-world domains and test its generalization to unseen instances, appearance, and game rules. On average, SimVLM accurately describes 95.5%, 82.6% of scenarios, simulates 85.5%, 87.8% of action sequence, and judges 82.4%, 85.6% goal reaching for seen and unseen appearances, respectively. With the guidance of SimVLM, VLMFP can generate PDDL files to reach 70.0%, 54.1% valid plans for unseen instances in seen and unseen appearances, respectively. Project page: https://sites.google.com/view/vlmfp.

Table of Contents

cs.CV [Back]

[1] Exploring OCR-augmented Generation for Bilingual VQA cs.CVPDF

[2] Oracle-RLAIF: An Improved Fine-Tuning Framework for Multi-modal Video Models through Reinforcement Learning from Ranking Feedback cs.CV | cs.AIPDF

[3] PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction cs.CVPDF

[4] Unlocking the power of partnership: How humans and machines can work together to improve face recognition cs.CVPDF

[5] How Confident are Video Models? Empowering Video Models to Express their Uncertainty cs.CV | cs.AI | cs.CLPDF

[6] Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig cs.CVPDF

[7] Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation cs.CVPDF

[8] Deep Generative Continual Learning using Functional LoRA: FunLoRA cs.CVPDF

[9] Sequence-Preserving Dual-FoV Defense for Traffic Sign and Light Recognition in Autonomous Vehicles cs.CVPDF

[10] Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models cs.CVPDF

[11] FSFSplatter: Build Surface and Novel Views with Sparse-Views within 3min cs.CV | cs.GRPDF

[12] MoGIC: Boosting Motion Generation via Intention Understanding and Visual Context cs.CVPDF

[13] From Tokens to Nodes: Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting cs.CVPDF

[14] Net2Net: When Un-trained Meets Pre-trained Networks for Robust Real-World Denoising cs.CVPDF

[15] Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval cs.CVPDF

[16] Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models cs.CVPDF

[17] Hierarchical Generalized Category Discovery for Brain Tumor Classification in Digital Pathology cs.CV | cs.AI | cs.LGPDF

[18] AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding cs.CVPDF

[19] Reasoning Riddles: How Explainability Reveals Cognitive Limits in Vision-Language Models cs.CVPDF

[20] OTR: Synthesizing Overlay Text Dataset for Text Removal cs.CVPDF

[21] Align Your Query: Representation Alignment for Multimodality Medical Object Detection cs.CV | cs.AI | cs.LGPDF

[22] MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding cs.CV | cs.AI | cs.CL | cs.MMPDF

[23] VERNIER: an open-source software pushing marker pose estimation down to the micrometer and nanometer scales cs.CV | cs.ROPDF

[24] Med-K2N: Flexible K-to-N Modality Translation for Medical Image Synthesis cs.CVPDF

[25] ELMF4EggQ: Ensemble Learning with Multimodal Feature Fusion for Non-Destructive Egg Quality Assessment cs.CV | cs.LGPDF

[26] One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework cs.CVPDF

[27] Training-Free Out-Of-Distribution Segmentation With Foundation Models cs.CVPDF

[28] Don’t Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention cs.CVPDF

[29] Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting cs.CVPDF

[30] Multimodal Carotid Risk Stratification with Large Vision-Language Models: Benchmarking, Fine-Tuning, and Clinical Insights cs.CV | cs.AIPDF

[31] TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency cs.CVPDF

[32] Towards Scalable and Consistent 3D Editing cs.CVPDF

[33] Not every day is a sunny day: Synthetic cloud injection for deep land cover segmentation robustness evaluation across data sources cs.CVPDF

[34] When and Where do Events Switch in Multi-Event Video Generation? cs.CV | cs.AIPDF

[35] InsideOut: An EfficientNetV2-S Based Deep Learning Framework for Robust Multi-Class Facial Emotion Recognition cs.CVPDF

[36] What Drives Compositional Generalization in Visual Generative Models? cs.CV | cs.AI | cs.LGPDF

[37] Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields cs.CV | cs.ROPDF

[38] GeoComplete: Geometry-Aware Diffusion for Reference-Driven Image Completion cs.CVPDF

[39] Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction cs.CV | cs.SDPDF

[40] HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion cs.CV | cs.AIPDF

[41] Mask2IV: Interaction-Centric Video Generation via Mask Trajectories cs.CV | cs.ROPDF

[42] ReeMark: Reeb Graphs for Simulating Patterns of Life in Spatiotemporal Trajectories cs.CV | cs.CE | cs.LG | cs.SIPDF

[43] SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus cs.CV | cs.AIPDF

[44] Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft cs.CVPDF

[45] MonSTeR: a Unified Model for Motion, Scene, Text Retrieval cs.CVPDF

[46] Improving GUI Grounding with Explicit Position-to-Coordinate Mapping cs.CV | cs.AIPDF

[47] LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models cs.CVPDF

cs.CL [Back]

[48] Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning cs.CL | cs.AIPDF

[49] Hallucination-Resistant, Domain-Specific Research Assistant with Self-Evaluation and Vector-Grounded Retrieval cs.CL | cs.AIPDF

[50] AMANDA: Agentic Medical Knowledge Augmentation for Data-Efficient Medical Visual Question Answering cs.CL | cs.AI | cs.MAPDF

[51] SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification cs.CL | cs.AIPDF

[52] Human Mobility Datasets Enriched With Contextual and Social Dimensions cs.CL | cs.AI | cs.SIPDF

[53] Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing cs.CL | cs.AI | cs.LGPDF

[54] Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards cs.CL | cs.AIPDF

[55] Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs cs.CL | cs.LGPDF

[56] LLMSQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL cs.CL | cs.AIPDF

[57] Language, Culture, and Ideology: Personalizing Offensiveness Detection in Political Tweets with Reasoning LLMs cs.CL | cs.AIPDF

[58] Modeling the language cortex with form-independent and enriched representations of sentence meaning reveals remarkable semantic abstractness cs.CL | cs.LGPDF

[59] ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference cs.CL | cs.AIPDF

[60] Beyond Manuals and Tasks: Instance-Level Context Learning for LLM Agents cs.CL | cs.AIPDF

[61] Pretraining with hierarchical memories: separating long-tail and common knowledge cs.CL | cs.AI | cs.LGPDF

[62] Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems cs.CL | cs.LGPDF

[63] Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation cs.CLPDF

[64] Words That Make Language Models Perceive cs.CL | cs.CV | cs.LGPDF

[65] CLARITY: Clinical Assistant for Routing, Inference, and Triage cs.CL | cs.AI | cs.MAPDF

[66] Knowledge-Graph Based RAG System Evaluation Framework cs.CL | cs.AIPDF

[67] Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models cs.CLPDF

[68] SoT: Structured-of-Thought Prompting Guides Multilingual Reasoning in Large Language Models cs.CLPDF

[69] Self-Improvement in Multimodal Large Language Models: A Survey cs.CLPDF

[70] TravelBench : Exploring LLM Performance in Low-Resource Domains cs.CL | cs.AIPDF

[71] PGMEL: Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking cs.CLPDF

[72] IndiCASA: A Dataset and Bias Evaluation Framework in LLMs Using Contrastive Embedding Similarity in the Indian Context cs.CLPDF

[73] The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback cs.CLPDF

[74] XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments cs.CLPDF

[75] StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering cs.CL | cs.IRPDF

[76] Evaluating Large Language Models for IUCN Red List Species Information cs.CL | cs.AI | I.2.7; I.2.6; J.3PDF

[77] Constraint Satisfaction Approaches to Wordle: Novel Heuristics and Cross-Lexicon Validation cs.CL | cs.AI | 68T20, 90C27 | I.2.8; I.2.3; G.1.6PDF