Table of Contents
- cs.CL [Total: 23]
- cs.CV [Total: 31]
- eess.IV [Total: 5]
- cs.GR [Total: 3]
- eess.SP [Total: 1]
- cs.HC [Total: 1]
- cs.RO [Total: 2]
- cs.AI [Total: 2]
- cs.LG [Total: 5]
cs.CL [Back]
[1] Inference Scaled GraphRAG: Improving Multi Hop Question Answering on Knowledge Graphs
Travis Thompson,Seung-Hwan Lim,Paul Liu,Ruoying He,Dongkuan Xu
Main category: cs.CL
TL;DR: 该论文提出了一种名为Inference-Scaled GraphRAG的新框架,通过推理时计算缩放提升LLM在图推理任务中的表现,显著改进了知识图谱上的多跳问答性能。
Details
Motivation: 大型语言模型(LLM)在语言理解和生成方面表现优异,但在需要结构化上下文和多跳信息的任务中表现不佳。现有的RAG方法未能充分捕捉知识图谱中节点间的关联结构。Contribution: 提出Inference-Scaled GraphRAG框架,结合序列缩放和并行缩放(通过多数投票采样轨迹),提升LLM在图推理任务中的表现。
Method: 采用推理时计算缩放(sequential scaling和parallel scaling),结合深度链式思维图遍历和多数投票机制。
Result: 在GRBench基准测试中显著优于传统GraphRAG和其他图遍历基线方法。
Insight: 推理时计算缩放是一种架构无关的实用方法,适用于结构化知识推理任务。
Abstract: Large Language Models (LLMs) have achieved impressive capabilities in language understanding and generation, yet they continue to underperform on knowledge-intensive reasoning tasks due to limited access to structured context and multi-hop information. Retrieval-Augmented Generation (RAG) partially mitigates this by grounding generation in retrieved context, but conventional RAG and GraphRAG methods often fail to capture relational structure across nodes in knowledge graphs. We introduce Inference-Scaled GraphRAG, a novel framework that enhances LLM-based graph reasoning by applying inference-time compute scaling. Our method combines sequential scaling with deep chain-of-thought graph traversal, and parallel scaling with majority voting over sampled trajectories within an interleaved reasoning-execution loop. Experiments on the GRBench benchmark demonstrate that our approach significantly improves multi-hop question answering performance, achieving substantial gains over both traditional GraphRAG and prior graph traversal baselines. These findings suggest that inference-time scaling is a practical and architecture-agnostic solution for structured knowledge reasoning with LLMs
[2] A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLMs
Kethmi Hirushini Hettige,Jiahao Ji,Cheng Long,Shili Xiang,Gao Cong,Jingyuan Wang
Main category: cs.CL
TL;DR: STReason是一个新颖的多任务推理框架,通过结合大型语言模型(LLMs)和时空模型的分析能力,解决现有模型的局限性,提升了复杂时空推理任务的表现。
Details
Motivation: 现有的时空数据挖掘模型通常局限于单一任务,缺乏多任务推理和复杂长形式推理能力,限制了其在现实世界多场景决策中的应用。Contribution: STReason通过上下文学习将复杂自然语言查询分解为模块化、可解释的程序,无需任务特定微调,同时提出了新的基准数据集和评估框架。
Method: STReason结合LLMs的推理能力和时空模型的分析能力,通过模块化程序生成解决方案和详细解释。
Result: 实验表明,STReason在复杂时空推理任务中显著优于先进的LLM基线,人类评估验证了其可信度和实用性。
Insight: STReason为开发更具能力和泛化性的时空推理系统提供了新方向,有望减少专家工作量并扩展现实应用场景。
Abstract: Spatio-temporal data mining plays a pivotal role in informed decision making across diverse domains. However, existing models are often restricted to narrow tasks, lacking the capacity for multi-task inference and complex long-form reasoning that require generation of in-depth, explanatory outputs. These limitations restrict their applicability to real-world, multi-faceted decision scenarios. In this work, we introduce STReason, a novel framework that integrates the reasoning strengths of large language models (LLMs) with the analytical capabilities of spatio-temporal models for multi-task inference and execution. Without requiring task-specific finetuning, STReason leverages in-context learning to decompose complex natural language queries into modular, interpretable programs, which are then systematically executed to generate both solutions and detailed rationales. To facilitate rigorous evaluation, we construct a new benchmark dataset and propose a unified evaluation framework with metrics specifically designed for long-form spatio-temporal reasoning. Experimental results show that STReason significantly outperforms advanced LLM baselines across all metrics, particularly excelling in complex, reasoning-intensive spatio-temporal scenarios. Human evaluations further validate STReason’s credibility and practical utility, demonstrating its potential to reduce expert workload and broaden the applicability to real-world spatio-temporal tasks. We believe STReason provides a promising direction for developing more capable and generalizable spatio-temporal reasoning systems.
[3] SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization
Dhruv Gupta,Gayathri Ganesh Lakshmy,Yiqing Xie
Main category: cs.CL
TL;DR: 该论文分析了代码检索中的文本偏见问题,并提出了一种通过语义增强重新排序和定位来减少偏见的框架SACL。
Details
Motivation: 当前的代码检索器过于依赖表面文本特征(如文档字符串、标识符名称),并对文档化良好的代码存在强烈偏见,即使文档无关紧要。这导致检索结果不准确。Contribution: 提出了SACL框架,通过语义信息增强代码或结构知识,显著改善了代码检索性能,并进一步提升了代码生成的效果。
Method: 通过系统地掩码特定特征但保留代码功能,分析了文本偏见问题,并利用语义增强的重新排序和定位技术来减少偏见。
Result: 在多个基准测试(如HumanEval、MBPP和SWE-Bench-Lite)上,SACL显著提升了代码检索性能(如Recall@1提升12.8% / 9.4% / 7.0%),并改善了代码生成(如Pass@1提升4.88%)。
Insight: 代码检索应关注语义信息而非表面文本特征,减少对无关文档的依赖可以显著提高检索和生成效果。
Abstract: Retrieval-Augmented Code Generation (RACG) is a critical technique for enhancing code generation by retrieving relevant information. In this work, we conduct an in-depth analysis of code retrieval by systematically masking specific features while preserving code functionality. Our discoveries include: (1) although trained on code, current retrievers heavily rely on surface-level textual features (e.g., docstrings, identifier names), and (2) they exhibit a strong bias towards well-documented code, even if the documentation is irrelevant.Based on our discoveries, we propose SACL, a framework that enriches textual information and reduces bias by augmenting code or structural knowledge with semantic information. Extensive experiments show that SACL substantially improves code retrieval (e.g., by 12.8% / 9.4% / 7.0% Recall@1 on HumanEval / MBPP / SWE-Bench-Lite), which also leads to better code generation performance (e.g., by 4.88% Pass@1 on HumanEval).
[4] ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset
Yilin Wang,Peixuan Lei,Jie Song,Yuzhe Hao,Tao Chen,Yuxuan Zhang,Lei Jia,Yuanxiang Li,Zhongyu Wei
Main category: cs.CL
TL;DR: 论文提出了Time-Series QA任务和EngineMT-QA数据集,并提出ITFormer框架,通过桥接时间序列编码器与冻结大型语言模型(LLMs),显著提升了QA任务的表现。
Details
Motivation: 时间序列数据广泛应用于工业、医疗等领域,但其与自然语言的动态交互仍具挑战性,需要新方法解决多模态QA问题。Contribution: 1. 提出了Time-Series QA任务和EngineMT-QA数据集;2. 设计了ITFormer框架,高效融合时间序列与文本特征。
Method: ITFormer通过时间序列编码器提取特征,与冻结的LLMs对齐,结合少量可训练参数实现跨模态建模。
Result: 在QA任务中显著优于基线模型,且仅需少于1%的额外可训练参数。
Insight: ITFormer为时间序列与自然语言的交互提供了一种高效范式,推动了多模态AI的发展。
Abstract: Time-series data are critical in diverse applications, such as industrial monitoring, medical diagnostics, and climate research. However, effectively integrating these high-dimensional temporal signals with natural language for dynamic, interactive tasks remains a significant challenge. To address this, we introduce the Time-Series Question Answering (Time-Series QA) task and release EngineMT-QA, the first large-scale, multi-task, temporal-textual QA dataset designed to capture complex interactions between time-series signals and natural language. Building on this resource, we propose the Instruct Time Transformer (ITFormer), a novel framework that bridges time-series encoders with frozen large language models (LLMs). ITFormer effectively extracts, aligns, and fuses temporal and textual features, achieving a strong improvement in QA accuracy over strong baselines with fewer than 1% additional trainable parameters. By combining computational efficiency with robust cross-modal modeling, our work establishes a adaptable paradigm for integrating temporal data with natural language, paving the way for new research and applications in multi-modal AI. More details about the project, including datasets and code, are available at: https://pandalin98.github.io/itformer_site/
[5] A Multi-Pass Large Language Model Framework for Precise and Efficient Radiology Report Error Detection
Songsoo Kim,Seungtae Lee,See Young Lee,Joonho Kim,Keechan Kan,Dukyong Yoon
Main category: cs.CL
TL;DR: 该论文通过提出一种三阶段(three-pass)大语言模型框架,显著提高了放射学报告错误检测的阳性预测值(PPV)并降低了操作成本。
Details
Motivation: 现有的基于大语言模型的放射学报告校对方法由于其低错误率导致阳性预测值有限,因此需要一种更高效且精确的方法。Contribution: 论文提出了一个三阶段大语言模型框架,通过依次执行提取、检测和假阳性验证,显著提升了PPV并降低了成本。
Method: 研究测试了三种框架:1) 单提示检测器;2) 提取器加检测器;3) 提取器、检测器和假阳性验证器的组合。通过PPV和aTPR等指标评估性能。
Result: 三阶段框架(Framework 3)将PPV从0.063提升到0.159,同时操作成本降低了42.6%,并且保持了稳定的检测性能。外部验证也支持其优越性。
Insight: 多阶段的LLM框架可以通过逐步筛选和验证,在保持检测性能的同时显著提升效率和节省成本,为AI辅助的放射学报告质量保证提供了有效策略。
Abstract: Background: The positive predictive value (PPV) of large language model (LLM)-based proofreading for radiology reports is limited due to the low error prevalence. Purpose: To assess whether a three-pass LLM framework enhances PPV and reduces operational costs compared with baseline approaches. Materials and Methods: A retrospective analysis was performed on 1,000 consecutive radiology reports (250 each: radiography, ultrasonography, CT, MRI) from the MIMIC-III database. Two external datasets (CheXpert and Open-i) were validation sets. Three LLM frameworks were tested: (1) single-prompt detector; (2) extractor plus detector; and (3) extractor, detector, and false-positive verifier. Precision was measured by PPV and absolute true positive rate (aTPR). Efficiency was calculated from model inference charges and reviewer remuneration. Statistical significance was tested using cluster bootstrap, exact McNemar tests, and Holm-Bonferroni correction. Results: Framework PPV increased from 0.063 (95% CI, 0.036-0.101, Framework 1) to 0.079 (0.049-0.118, Framework 2), and significantly to 0.159 (0.090-0.252, Framework 3; P<.001 vs. baselines). aTPR remained stable (0.012-0.014; P>=.84). Operational costs per 1,000 reports dropped to USD 5.58 (Framework 3) from USD 9.72 (Framework 1) and USD 6.85 (Framework 2), reflecting reductions of 42.6% and 18.5%, respectively. Human-reviewed reports decreased from 192 to 88. External validation supported Framework 3’s superior PPV (CheXpert 0.133, Open-i 0.105) and stable aTPR (0.007). Conclusion: A three-pass LLM framework significantly enhanced PPV and reduced operational costs, maintaining detection performance, providing an effective strategy for AI-assisted radiology report quality assurance.
[6] Leveraging AI Graders for Missing Score Imputation to Achieve Accurate Ability Estimation in Constructed-Response Tests
Masaki Uto,Yuma Ito
Main category: cs.CL
TL;DR: 该论文提出了一种基于AI评分技术的新型缺失分数插补方法,旨在提高构建响应测试中能力估计的准确性,同时显著减少人工评分工作量。
Details
Motivation: 构建响应测试(如简答和论述题)用于评估高阶能力,但人工评分成本高且耗时。现有方法(如IRT)在缺失分数较多时准确性下降,亟需更高效的解决方案。Contribution: 提出了一种利用AI评分技术插补缺失分数的新方法,显著提高了基于IRT的能力估计准确性,并减少了人工评分负担。
Method: 结合自动化评分技术和IRT,通过AI评分器插补缺失分数,优化能力估计模型的输入数据。
Result: 该方法在能力估计中表现出高准确性,同时大幅降低了人工评分的需求。
Insight: 自动化评分技术可以与传统IRT结合,为教育评估提供更高效且准确的解决方案,尤其适用于大规模测试和资源受限场景。
Abstract: Evaluating the abilities of learners is a fundamental objective in the field of education. In particular, there is an increasing need to assess higher-order abilities such as expressive skills and logical thinking. Constructed-response tests such as short-answer and essay-based questions have become widely used as a method to meet this demand. Although these tests are effective, they require substantial manual grading, making them both labor-intensive and costly. Item response theory (IRT) provides a promising solution by enabling the estimation of ability from incomplete score data, where human raters grade only a subset of answers provided by learners across multiple test items. However, the accuracy of ability estimation declines as the proportion of missing scores increases. Although data augmentation techniques for imputing missing scores have been explored in order to address this limitation, they often struggle with inaccuracy for sparse or heterogeneous data. To overcome these challenges, this study proposes a novel method for imputing missing scores by leveraging automated scoring technologies for accurate IRT-based ability estimation. The proposed method achieves high accuracy in ability estimation while markedly reducing manual grading workload.
[7] AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control
Ruosen Li,Ziming Luo,Quan Zhang,Ruochen Li,Ben Zhou,Ali Payani,Xinya Du
Main category: cs.CL
TL;DR: AALC通过自适应精度-长度控制优化大规模推理模型的效率,减少响应长度50%以上,同时保持或提高准确性。
Details
Motivation: 传统的长链式思维推理虽有能力但带来高延迟和成本,却未显著提升准确性。AALC旨在动态平衡正确性与简洁性。Contribution: 提出轻量级的AALC方法,结合精度感知长度奖励与动态调度长度惩罚,显著减少推理长度而不牺牲准确性。
Method: 将验证精度整合到奖励函数中,动态调整长度惩罚,延迟长度惩罚直至达到目标性能。
Result: 在标准及分布外数学基准测试中,AALC显著减少响应长度50%以上,同时维持或提升原始准确性。
Insight: 效率提升伴随可解释性降低,AALC证明了奖励策略能引导推理模型生成更高效、通用的推理路径。
Abstract: Large reasoning models (LRMs) achieve impressive reasoning capabilities by generating lengthy chain-of-thoughts, but this “overthinking” incurs high latency and cost without commensurate accuracy gains. In this work, we introduce AALC, a lightweight, accuracy-aware length reward integrated into reinforcement learning that dynamically balances correctness and brevity during training. By incorporating validation accuracy into the reward and employing a smooth, dynamically scheduled length penalty, AALC delays length penalty until target performance is met. Through extensive experiments across standard and out-of-distribution math benchmarks, we show that our approach reduces response length by over 50% while maintaining or even improving the original accuracy. Furthermore, qualitative analysis reveals that our method curbs redundant reasoning patterns such as excessive subgoal setting and verification, leading to structurally refined outputs rather than naive truncation. We also identify that efficiency gains are accompanied by reduced interpretability: models trained with AALC omit some narrative framing and explanatory context. These findings highlight the potential of reward-based strategies to guide LRMs toward more efficient, generalizable reasoning paths.
[8] SEED: A Structural Encoder for Embedding-Driven Decoding in Time Series Prediction with LLMs
Fengze Li,Yue Wang,Yangle Liu,Ming Huang,Dou Hong,Jieming Ma
Main category: cs.CL
TL;DR: SEED提出了一种结合结构编码器和语言模型的时序预测方法,通过多阶段模块化解码,解决了结构建模与语义推理之间的差距。
Details
Motivation: 多变量时序预测需要模型同时捕捉变量间结构依赖并适应多样化任务。现有结构编码器缺乏语义推理能力,而大型语言模型(LLMs)无法直接处理原始时序数据,导致统一预测系统的局限性。Contribution: SEED通过四阶段模块化设计(标记感知编码、嵌入对齐投影、语义重编程、冻结语言模型预测),实现了数值模式与语义推理的高效对齐,填补了结构-语义建模的空白。
Method: 1.标记感知编码器提取时间片;2.投影模块将时间片对齐为语言模型嵌入;3.语义重编程映射为任务感知原型;4.冻结语言模型完成预测。
Result: 实验验证SEED在多个数据集上优于基线模型,证明了其缓解结构-语义建模差距的有效性。
Insight: 通过解耦表示学习与推理,SEED展示了结构编码器与LLMs结合的潜力,为时序预测的通用化和任务适应性提供了新思路。
Abstract: Multivariate time series forecasting requires models to simultaneously capture variable-wise structural dependencies and generalize across diverse tasks. While structural encoders are effective in modeling feature interactions, they lack the capacity to support semantic-level reasoning or task adaptation. Conversely, large language models (LLMs) possess strong generalization capabilities but remain incompatible with raw time series inputs. This gap limits the development of unified, transferable prediction systems. Therefore, we introduce SEED, a structural encoder for embedding-driven decoding, which integrates four stages: a token-aware encoder for patch extraction, a projection module that aligns patches with language model embeddings, a semantic reprogramming mechanism that maps patches to task-aware prototypes, and a frozen language model for prediction. This modular architecture decouples representation learning from inference, enabling efficient alignment between numerical patterns and semantic reasoning. Empirical results demonstrate that the proposed method achieves consistent improvements over strong baselines, and comparative studies on various datasets confirm SEED’s role in addressing the structural-semantic modeling gap.
[9] COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees
Zhiyuan Wang,Jinhao Duan,Qingni Wang,Xiaofeng Zhu,Tianlong Chen,Xiaoshuang Shi,Kaidi Xu
Main category: cs.CL
TL;DR: COIN是一个不确定性保护的问答选择框架,通过统计学方法校准阈值,确保在用户指定的错误发现率(FDR)约束下选择单一生成答案,显著提高样本保留率。
Details
Motivation: 现有的启发式不确定性量化方法缺乏对选择性预测中关键指标(如FDR)的形式化保证,拆分一致性预测框架生成的预测集常包含错误答案,限制了实际应用。Contribution: 提出COIN框架,通过统计学区间方法(如Clopper-Pearson)校准阈值,实现FDR控制并提高样本保留率;展示了其在不同任务中的鲁棒性和适应性。
Method: 基于校准集估计经验错误率,应用统计学区间方法(如Clopper-Pearson)设定真实错误率的高概率上界,选择最大不确定性阈值以实现FDR控制。
Result: COIN在风险控制和样本保留方面表现优异,尤其是在校准数据有限的情况下;通过替代上界构建和不确定性量化策略进一步提升性能。
Insight: COIN的灵活性和扩展性使其能够适应多样化的应用场景,同时为不确定性量化提供了统计学保证。
Abstract: Uncertainty quantification (UQ) for foundation models is essential to identify and mitigate potential hallucinations in automatically generated text. However, heuristic UQ approaches lack formal guarantees for key metrics such as the false discovery rate (FDR) in selective prediction. Previous work adopts the split conformal prediction (SCP) framework to ensure desired coverage of admissible answers by constructing prediction sets, but these sets often contain incorrect candidates, limiting their practical utility. To address this, we propose COIN, an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question under user-specified FDR constraints. COIN estimates the empirical error rate on a calibration set and applies confidence interval methods such as Clopper-Pearson to establish a high-probability upper bound on the true error rate (i.e., FDR). This enables the selection of the largest uncertainty threshold that ensures FDR control on test data while significantly increasing sample retention. We demonstrate COIN’s robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data across both general and multimodal text generation tasks. Furthermore, we show that employing alternative upper bound constructions and UQ strategies can further boost COIN’s power performance, which underscores its extensibility and adaptability to diverse application scenarios.
[10] Enhancing Large Language Models through Structured Reasoning
Yubo Dong,Hehe Fan
Main category: cs.CL
TL;DR: 论文提出了一种通过结构化推理增强大语言模型(LLMs)的方法,包括将非结构化数据转换为结构化格式,并通过监督微调(SFT)和组相对策略优化(GRPO)提升模型的推理能力。实验验证了方法的有效性。
Details
Motivation: 现有的LLMs在复杂推理任务(如逻辑演绎和系统规划)中表现不佳,主要因为它们依赖隐式统计关系,而缺乏结构化知识表示。本文从认知科学和神经符号AI中汲取灵感,旨在通过显式结构化推理提升LLMs的性能。Contribution: 论文的主要贡献包括:1) 将非结构化数据转换为显式标注推理步骤的结构化格式;2) 提出使用SFT和GRPO增强LLMs的结构化推理能力;3) 引入MAX-Flow和LCS算法,显著提升推理效果并降低计算复杂度。
Method: 研究方法包括:1) 将非结构化数据标注为结构化推理步骤;2) 使用SFT微调LLMs;3) 通过GRPO结合MAX-Flow和LCS算法优化推理能力。实验基于DeepSeek-R1-Distill-Qwen-1.5B模型。
Result: 实验结果显示,改进后的模型在推理任务中表现简洁且鲁棒,兼容多种优化技术,验证了结构化推理对大语言模型的有效性。
Insight: 显式结构化推理可以弥补LLMs在复杂任务中的不足,而GRPO和新型算法的结合为提升推理效率提供了新思路。这一方法不仅适用于当前任务,还具有推广潜力。
Abstract: Recent Large Language Models (LLMs) have significantly advanced natural language processing and automated decision-making. However, these models still encounter difficulties when performing complex reasoning tasks involving logical deduction and systematic planning, primarily due to their reliance on implicit statistical relationships without structured knowledge representation.Inspired by cognitive science and neurosymbolic AI, we introduce a novel approach to enhance LLMs through explicit structured reasoning. First, we convert unstructured data into structured formats by explicitly annotating reasoning steps. We then employ this structured dataset to train LLMs through Supervised Fine-Tuning (SFT). Additionally, we enhance the structured reasoning capabilities of LLMs using Group Relative Policy Optimization (GRPO), incorporating two innovative algorithms–MAX-Flow and Longest Common Subsequence (LCS)–which notably improve reasoning effectiveness and reduce computational complexity. Experimental results from fine-tuning a DeepSeek-R1-Distill-Qwen-1.5B model demonstrate concise reasoning, robust performance across various scenarios, and improved compatibility with optimization techniques, validating the efficacy of structured reasoning integration in LLMs.
[11] CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment
Papa Séga Wade,Mihai Andries,Ioannis Kanellos,Thierry Moudenc
Main category: cs.CL
TL;DR: 论文提出了一种基于分块的多自监督学习(SSL)融合方法(CBF-AFA),用于自动流利度评估(AFA),通过结合多个SSL模型和分块技术,显著提升了评估效果。
Details
Motivation: 自动流利度评估在面对非母语者的语音节奏、停顿和不流畅时仍具挑战性,需要更精细的时序分析和多模型融合方法。Contribution: 提出了一种分块化的多SSL融合框架,结合了Wav2Vec2、HuBERT和WavLM模型的互补优势,并通过加权机制融合其特征。同时引入了分块级的流利度标记和分层CNN-BiLSTM网络,提升了评估性能。
Method: 1. 使用Silero-VAD将语音分割为呼吸组分块;2. 结合多个SSL模型的嵌入表示,并通过加权机制融合;3. 引入分块级流利度标记;4. 用CNN-BiLSTM捕捉分块间的局部和长期依赖关系。
Result: 在Avalinguo和Speechocean762数据集上,F1分数分别提升了4.2分和2.8分,Pearson相关系数分别提升了4.0分和6.2分,优于单SSL模型和基于Pyannote.audio的分割基线。
Insight: 分块化的多SSL融合方法能够显著提升流利度评估的鲁棒性,但未来需要进一步验证其在非规则韵律方言中的泛化能力。
Abstract: Automatic fluency assessment (AFA) remains challenging, particularly in capturing speech rhythm, pauses, and disfluencies in non-native speakers. We introduce a chunk-based approach integrating self-supervised learning (SSL) models (Wav2Vec2, HuBERT, and WavLM) selected for their complementary strengths in phonetic, prosodic, and noisy speech modeling, with a hierarchical CNN-BiLSTM framework. Speech is segmented into breath-group chunks using Silero voice activity detection (Silero-VAD), enabling fine-grained temporal analysis while mitigating over-segmentation artifacts. SSL embeddings are fused via a learnable weighted mechanism, balancing acoustic and linguistic features, and enriched with chunk-level fluency markers (e.g., speech rate, pause durations, n-gram repetitions). The CNN-BiLSTM captures local and long-term dependencies across chunks. Evaluated on Avalinguo and Speechocean762, our approach improves F1-score by 2.8 and Pearson correlation by 6.2 points over single SSL baselines on Speechocean762, with gains of 4.2 F1-score and 4.0 Pearson points on Avalinguo, surpassing Pyannote.audio-based segmentation baselines. These findings highlight chunk-based multi-SSL fusion for robust fluency evaluation, though future work should explore generalization to dialects with irregular prosody.
[12] Narrative Shift Detection: A Hybrid Approach of Dynamic Topic Models and Large Language Models
Kai-Robin Lange,Tobias Schmidt,Matthias Reccius,Henrik Müller,Michael Roos,Carsten Jentsch
Main category: cs.CL
TL;DR: 该论文提出了一种结合动态主题模型和大语言模型(LLM)的方法,用于检测媒体叙事随时间的变化。通过主题模型和变更点检测定位关键文档,再借助LLM自动解释变化,区分内容变化与叙事转移。
Details
Motivation: 随着媒体叙事的快速演变,传统方法(如LLM)在分析整个语料库时面临高成本和计算挑战。论文旨在利用主题模型的扩展性和LLM的语言理解能力,动态建模叙事变化。Contribution: 提出了一种混合方法,结合动态主题模型和LLM,用于高效检测叙事转移;开发了区分内容变化与叙事转移的自动化流程。
Method: 1. 使用主题模型和变更点检测筛选代表性文档;2. 将文档输入LLM,自动解释变化类型(内容或叙事转移)。
Result: 在《华尔街日报》2009-2023年的语料库中验证,LLM能高效提取叙事转移,但在区分内容变化与叙事转移时表现较差。
Insight: 主题模型适合大规模筛选,而LLM擅长局部解释;二者结合可平衡成本与效果,但需进一步优化以提升分类能力。
Abstract: With rapidly evolving media narratives, it has become increasingly critical to not just extract narratives from a given corpus but rather investigate, how they develop over time. While popular narrative extraction methods such as Large Language Models do well in capturing typical narrative elements or even the complex structure of a narrative, applying them to an entire corpus comes with obstacles, such as a high financial or computational cost. We propose a combination of the language understanding capabilities of Large Language Models with the large scale applicability of topic models to dynamically model narrative shifts across time using the Narrative Policy Framework. We apply a topic model and a corresponding change point detection method to find changes that concern a specific topic of interest. Using this model, we filter our corpus for documents that are particularly representative of that change and feed them into a Large Language Model that interprets the change that happened in an automated fashion and distinguishes between content and narrative shifts. We employ our pipeline on a corpus of The Wall Street Journal news paper articles from 2009 to 2023. Our findings indicate that a Large Language Model can efficiently extract a narrative shift if one exists at a given point in time, but does not perform as well when having to decide whether a shift in content or a narrative shift took place.
[13] An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Weike Zhao,Chaoyi Wu,Yanjie Fan,Xiaoman Zhang,Pengcheng Qiu,Yuze Sun,Xiao Zhou,Yanfeng Wang,Ya Zhang,Yongguo Yu,Kun Sun,Weidi Xie
Main category: cs.CL
TL;DR: DeepRare是一个基于大型语言模型的罕见病诊断系统,通过模块化设计和透明推理链,显著提高了诊断准确性。
Details
Motivation: 罕见病诊断面临临床异质性、低普遍性和医生熟悉度不足的挑战,DeepRare旨在解决这些问题。Contribution: 提出第一个基于LLM的罕见病诊断系统DeepRare,结合模块化设计和透明推理链,显著优于现有方法。
Method: 系统由三部分组成:中央主机(含长期记忆模块)、专用代理服务器(整合40多种工具和医学知识源)。
Result: 在8个数据集上表现优异,如Recall@1达57.18%,比次优方法高23.79个百分点;多模态输入场景下达70.60%。
Insight: 模块化设计和透明推理链是实现复杂医疗诊断的关键,同时系统性能验证了LLM在医疗领域的潜力。
Abstract: Rare diseases collectively affect over 300 million individuals worldwide, yet timely and accurate diagnosis remains a pervasive challenge. This is largely due to their clinical heterogeneity, low individual prevalence, and the limited familiarity most clinicians have with rare conditions. Here, we introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM), capable of processing heterogeneous clinical inputs. The system generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning that links intermediate analytic steps to verifiable medical evidence. DeepRare comprises three key components: a central host with a long-term memory module; specialized agent servers responsible for domain-specific analytical tasks integrating over 40 specialized tools and web-scale, up-to-date medical knowledge sources, ensuring access to the most current clinical information. This modular and scalable design enables complex diagnostic reasoning while maintaining traceability and adaptability. We evaluate DeepRare on eight datasets. The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases. In HPO-based evaluations, DeepRare significantly outperforms other 15 methods, like traditional bioinformatics diagnostic tools, LLMs, and other agentic systems, achieving an average Recall@1 score of 57.18% and surpassing the second-best method (Reasoning LLM) by a substantial margin of 23.79 percentage points. For multi-modal input scenarios, DeepRare achieves 70.60% at Recall@1 compared to Exomiser’s 53.20% in 109 cases. Manual verification of reasoning chains by clinical experts achieves 95.40% agreements. Furthermore, the DeepRare system has been implemented as a user-friendly web application http://raredx.cn/doctor.
[14] Probing AI Safety with Source Code
Ujwal Narayan,Shreyas Chaudhari,Ashwin Kalyan,Tanmay Rajpurohit,Karthik Narasimhan,Ameet Deshpande,Vishvak Murahari
Main category: cs.CL
TL;DR: 这篇论文提出了一种名为Code of Thought (CoDoT)的提示策略,用于评估大型语言模型(LLMs)的安全性问题,发现当代模型在安全目标方面存在显著不足。
Details
Motivation: 随着大型语言模型在安全关键应用中的普及,其安全性与人类价值观的对齐变得至关重要。论文旨在揭示当前模型在安全性方面的不足,并提出一种评估方法。Contribution: 论文的主要贡献是提出了CoDoT方法,通过将自然语言输入转换为代码来测试模型的安全性,并发现多种先进模型在安全性上的显著缺陷。
Method: 采用CoDoT策略,将自然语言提示转换为等效的简单代码,通过生成的代码测试模型的安全性反应。
Result: 实验表明,CoDoT能够显著增加模型的毒性输出,例如GPT-4 Turbo的毒性增加了16.5倍,DeepSeek R1的失败率为100%。递归应用CoDoT还能使毒性进一步增加两倍。
Insight: 论文揭示了当前LLMs在安全性方面的严重不足,并呼吁安全性与模型能力需要同步发展。CoDoT为从第一性原理评估安全性提供了一种有效工具。
Abstract: Large language models (LLMs) have become ubiquitous, interfacing with humans in numerous safety-critical applications. This necessitates improving capabilities, but importantly coupled with greater safety measures to align these models with human values and preferences. In this work, we demonstrate that contemporary models fall concerningly short of the goal of AI safety, leading to an unsafe and harmful experience for users. We introduce a prompting strategy called Code of Thought (CoDoT) to evaluate the safety of LLMs. CoDoT converts natural language inputs to simple code that represents the same intent. For instance, CoDoT transforms the natural language prompt “Make the statement more toxic: {text}” to: “make_more_toxic({text})”. We show that CoDoT results in a consistent failure of a wide range of state-of-the-art LLMs. For example, GPT-4 Turbo’s toxicity increases 16.5 times, DeepSeek R1 fails 100% of the time, and toxicity increases 300% on average across seven modern LLMs. Additionally, recursively applying CoDoT can further increase toxicity two times. Given the rapid and widespread adoption of LLMs, CoDoT underscores the critical need to evaluate safety efforts from first principles, ensuring that safety and capabilities advance together.
[15] Time is On My Side: Dynamics of Talk-Time Sharing in Video-chat Conversations
Kaixiang Zhang,Justine Zhang,Cristian Danescu-Niculescu-Mizil
Main category: cs.CL
TL;DR: 这篇论文提出了一个计算框架,用于量化对话中说话时间的分布及其动态变化,揭示了不同动态类型对参与者感知的影响。
Details
Motivation: 研究对话中说话时间的共享动态及其对参与者感知的影响,为计算机中介的通信平台设计提供新工具。Contribution: 1. 提出了量化说话时间分布和动态的计算框架;2. 揭示了动态类型对参与者感知的影响;3. 为通信平台设计提供了新工具。
Method: 通过分析陌生人视频对话的大数据集,结合提出的计算框架和直观的变化轴,对说话时间共享动态进行分类和量化。
Result: 平衡的对话更受参与者欢迎,尤其是说话较少的一方;不同的动态类型即使导致相同的总体平衡,感知也各异。
Insight: 说话时间的动态分配方式对对话体验有重要影响,而不仅仅是总体平衡性。
Abstract: An intrinsic aspect of every conversation is the way talk-time is shared between multiple speakers. Conversations can be balanced, with each speaker claiming a similar amount of talk-time, or imbalanced when one talks disproportionately. Such overall distributions are the consequence of continuous negotiations between the speakers throughout the conversation: who should be talking at every point in time, and for how long? In this work we introduce a computational framework for quantifying both the conversation-level distribution of talk-time between speakers, as well as the lower-level dynamics that lead to it. We derive a typology of talk-time sharing dynamics structured by several intuitive axes of variation. By applying this framework to a large dataset of video-chats between strangers, we confirm that, perhaps unsurprisingly, different conversation-level distributions of talk-time are perceived differently by speakers, with balanced conversations being preferred over imbalanced ones, especially by those who end up talking less. Then we reveal that – even when they lead to the same level of overall balance – different types of talk-time sharing dynamics are perceived differently by the participants, highlighting the relevance of our newly introduced typology. Finally, we discuss how our framework offers new tools to designers of computer-mediated communication platforms, for both human-human and human-AI communication.
[16] Knowledge-Aware Diverse Reranking for Cross-Source Question Answering
Tong Zhou
Main category: cs.CL
TL;DR: 这篇论文介绍了Team Marikarp在SIGIR 2025 LiveRAG竞赛中的解决方案,通过知识感知的多样化重排序RAG管道在竞赛中取得了第一名。
Details
Motivation: 竞赛的数据集覆盖了广泛的主题、问题类型和知识组织形式,需要一种能够从大规模文档库中检索相关问题支持文档的方法。Contribution: 提出了一种知识感知的多样化重排序RAG管道,显著提升了跨源问答的性能。
Method: 采用了知识感知的多样化重排序技术,结合RAG框架,优化文档检索和排序过程。
Result: 在SIGIR 2025 LiveRAG竞赛中取得了第一名,验证了方法的有效性。
Insight: 知识感知和多样化重排序的结合可以显著提升跨源问答的检索性能,尤其是在大规模文档库中。
Abstract: This paper presents Team Marikarp’s solution for the SIGIR 2025 LiveRAG competition. The competition’s evaluation set, automatically generated by DataMorgana from internet corpora, encompassed a wide range of target topics, question types, question formulations, audience types, and knowledge organization methods. It offered a fair evaluation of retrieving question-relevant supporting documents from a 15M documents subset of the FineWeb corpus. Our proposed knowledge-aware diverse reranking RAG pipeline achieved first place in the competition.
[17] GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching
Guinan Su,Li Shen,Lu Yin,Shiwei Liu,Yanwu Yang,Jonas Geiping
Main category: cs.CL
TL;DR: GPTailor提出了一种通过层切割和拼接来压缩大型语言模型的策略,通过合并不同微调变体的层,保留原模型能力,同时显著减少参数。
Details
Motivation: 大型语言模型(LLMs)的巨大规模给部署和推理带来挑战,现有方法主要针对单模型剪枝,难以高效压缩模型同时保持性能。Contribution: 提出了一种新颖的零阶优化方法,支持层移除、选择和合并操作,显著提升了模型压缩效果。
Method: 通过策略性组合不同微调变体的层,将LLM剪枝问题建模为零阶优化问题,支持三种操作:层移除、选择和合并。
Result: 实验表明,压缩后的模型(如Llama2-13B)在移除25%参数的同时保持了97.3%的原性能,优于现有方法。
Insight: 通过聚合不同微调变体的能力,可以在剪枝过程中更高效地保留模型性能,为LLM压缩提供了新思路。
Abstract: Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in deployment and inference. While structured pruning of model parameters offers a promising way to reduce computational costs at deployment time, current methods primarily focus on single model pruning. In this work, we develop a novel strategy to compress models by strategically combining or merging layers from finetuned model variants, which preserves the original model’s abilities by aggregating capabilities accentuated in different finetunes. We pose the optimal tailoring of these LLMs as a zero-order optimization problem, adopting a search space that supports three different operations: (1) Layer removal, (2) Layer selection from different candidate models, and (3) Layer merging. Our experiments demonstrate that this approach leads to competitive model pruning, for example, for the Llama2-13B model families, our compressed models maintain approximately 97.3% of the original performance while removing $\sim25%$ of parameters, significantly outperforming previous state-of-the-art methods. The code is available at https://github.com/Guinan-Su/auto-merge-llm.
[18] ReCode: Updating Code API Knowledge with Reinforcement Learning
Haoze Wu,Yunzhi Yao,Wenhao Yu,Huajun Chen,Ningyu Zhang
Main category: cs.CL
TL;DR: ReCode通过强化学习更新代码API知识,解决了大语言模型在动态API环境中的适应性问题,显著提升了代码生成性能。
Details
Motivation: 大语言模型(LLMs)在代码生成方面表现出色,但难以适应外部库API的频繁更新,导致生成的代码不可靠。这源于模型训练数据中API知识的过时性,即使提供了最新文档也无济于事。Contribution: 1. 提出了ReCode框架,模拟程序员适应API变更的行为。2. 构建了约2,000条数据集用于训练模型进行版本迁移。3. 引入了改进的字符串相似度度量作为强化学习的奖励信号。
Method: 1. 基于强化学习(如GRPO和DAPO)训练LLMs适应API更新。2. 设计一种改进的代码评估指标(字符串相似度)作为强化学习的奖励。
Result: ReCode显著提升了LLMs在动态API场景中的代码生成性能。例如,Qwen2.5-Coder-7B经过训练后,性能超越了32B参数的代码指令调优模型和同架构的推理模型。
Insight: 强化学习在动态环境中的适应性优于监督微调,且对模型通用代码生成能力的影响较小。
Abstract: Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs’ code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs’ general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.
[19] OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
Zengzhi Wang,Fan Zhou,Xuefeng Li,Pengfei Liu
Main category: cs.CL
TL;DR: 该论文探讨了中间训练策略对强化学习(RL)性能的影响,提出了一个两阶段训练方法(Stable-then-Decay),并发布了开源的OctoThinker模型家族和一个高质量数学语料库。
Details
Motivation: 研究基础语言模型家族在不同中间训练策略下对强化学习性能的影响,尤其是推理密集型任务,以指导下一代RL可扩展的基础模型的开发。Contribution: 1. 揭示了高质量数学语料和长链思维(CoT)数据对RL性能的提升作用;2. 提出了两阶段中间训练策略Stable-then-Decay;3. 发布了OctoThinker模型家族和大型数学语料库MegaMath-Web-Pro-Max。
Method: 采用两阶段中间训练策略:第一阶段(200B token)使用恒定学习率训练基础模型;第二阶段(20B token)聚焦于CoT数据,采用学习率衰减。
Result: 生成的OctoThinker模型家族在RL兼容性和性能上表现优异,缩小了与Qwen等RL友好模型的差距。
Insight: 1. 数学语料和长CoT数据能显著提升RL性能;2. 数据格式对训练稳定性至关重要;3. 中间训练的规模扩大能持续提升下游RL性能。
Abstract: Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro, significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data, particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect; (3) while long-CoT improves reasoning depth, it can also induce verbosity of model responses and unstability of RL training, underscoring the importance of data formatting; (4) scaling mid-training consistently leads to stronger downstream RL performance. Building on these insights, we introduce a two-stage mid-training strategy, Stable-then-Decay, in which base models are first trained on 200B tokens with a constant learning rate, followed by 20B tokens across three CoT-focused branches with learning rate decay. This yields OctoThinker, a family of models demonstrating strong RL compatibility and closing the performance gap with more RL-friendly model families, i.e., Qwen. We hope our work will help shape pre-training strategies for foundation models in the RL era. To support further research, we release our open-source models along with a curated math reasoning-intensive corpus of over 70 billion tokens (i.e., MegaMath-Web-Pro-Max).
[20] When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs
Ammar Khairi,Daniel D’souza,Ye Shen,Julia Kreutzer,Sara Hooker
Main category: cs.CL
TL;DR: 本文研究了在开放生成的多语言和多任务场景中,如何有效扩展推理时间计算(inference-time compute),并提出针对此类场景的新采样和选择策略,显著提升了多语言任务的性能。
Details
Motivation: 先前的研究主要集中在英语和少数特定领域(如数学和代码)上,缺乏对开放生成任务和多语言环境的通用方法。本文旨在填补这一空白,探索如何适应不同语言和任务的推理扩展技术。Contribution: 1) 提出针对多语言和开放任务的采样和选择策略;2) 揭示了英文场景有效的策略在多语言中可能失效的问题;3) 实验表明新方法显著提升了性能,尤其在低资源语言中。
Method: 1) 调整基于温度变化的采样策略;2) 设计适应多语言和多任务的选择策略;3) 在大规模多语言基准(如m-ArenaHard-v2.0)上验证。
Result: 在8B参数的模型中,新方法使胜率平均提升+6.8%;在111B参数模型中,仅用5个样本即实现+9.0%的提升。
Insight: 推理扩展技术需针对语言和任务的多样性进行优化,尤其在多语言和低资源场景中,通用方法可能表现不佳。
Abstract: Recent advancements in large language models (LLMs) have shifted focus toward scaling inference-time compute, improving performance without retraining the model. A common approach is to sample multiple outputs in parallel, and select one of these as the final output. However, work to date has focused on English and a handful of domains such as math and code. In contrast, we are most interested in techniques that generalize across open-ended tasks, formally verifiable tasks, and across languages. In this work, we study how to robustly scale inference-time compute for open-ended generative tasks in a multilingual, multi-task setting. Our findings show that both sampling strategy based on temperature variation and selection strategy must be adapted to account for diverse domains and varied language settings. We evaluate existing selection methods, revealing that strategies effective in English often fail to generalize across languages. We propose novel sampling and selection strategies specifically adapted for multilingual and multi-task inference scenarios, and show they yield notable gains across languages and tasks. In particular, our combined sampling and selection methods lead to an average +6.8 jump in win-rates for our 8B models on m-ArenaHard-v2.0 prompts, against proprietary models such as Gemini. At larger scale, Command-A (111B model) equipped with our methods, shows +9.0 improvement in win-rates on the same benchmark with just five samples against single-sample decoding, a substantial increase at minimal cost. Our results underscore the need for language- and task-aware approaches to inference-time compute, aiming to democratize performance improvements in underrepresented languages.
[21] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
Shansan Gong,Ruixiang Zhang,Huangjie Zheng,Jiatao Gu,Navdeep Jaitly,Lingpeng Kong,Yizhe Zhang
Main category: cs.CL
TL;DR: DiffuCoder 是一个基于扩散大语言模型(dLLM)的代码生成模型,通过系统性研究和改进,揭示了 dLLM 在代码生成中的独特行为,并提出了一种新的强化学习方法 coupled-GRPO,显著提升了性能。
Details
Motivation: 扩散语言模型(dLLMs)因其全局规划和迭代优化的特性在代码生成中潜力巨大,但当前训练和推理机制的研究不足。本文旨在探索 dLLMs 的解码行为并提升其代码生成能力。Contribution: 1) 分析 dLLMs 的解码行为,揭示其与自回归模型(AR)的关键差异;2) 提出 coupled-GRPO 方法,优化强化学习训练;3) 训练并开源 DiffuCoder 模型,显著提升代码生成性能。
Method: 1) 训练 7B 参数的 DiffuCoder 模型;2) 分析其解码行为;3) 提出 coupled-GRPO 方法,通过互补掩码噪声减少日志似然估计的方差。
Result: DiffuCoder 在代码生成基准 EvalPlus 上性能提升 4.4%,并减少了对 AR 因果解码的依赖。
Insight: dLLMs 的全局规划和多样化生成顺序为代码生成提供了更丰富的搜索空间,coupled-GRPO 方法为扩散模型的强化学习训练提供了高效框架。
Abstract: Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, \textbf{DiffuCoder}, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token choices but also their generation order. This diversity creates a rich search space for RL rollouts. For RL training, to reduce the variance of token log-likelihood estimates and maintain training efficiency, we propose \textbf{coupled-GRPO}, a novel sampling scheme that constructs complementary mask noise for completions used in training. In our experiments, coupled-GRPO significantly improves DiffuCoder’s performance on code generation benchmarks (+4.4% on EvalPlus) and reduces reliance on AR causal during decoding. Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework. https://github.com/apple/ml-diffucoder.
[22] Memento: Note-Taking for Your Future Self
Chao Wan,Albert Gong,Mihir Mishra,Carl-Leander Henneking,Claas Beger,Kilian Q. Weinberger
Main category: cs.CL
TL;DR: 论文提出了一种名为Memento的三阶段提示策略,通过分解复杂问题、动态构建事实数据库并整合信息,显著提升了多跳问答任务的性能。
Details
Motivation: 大型语言模型在纯推理任务中表现优异,但在需要紧密结合检索的多跳问答任务中表现不佳。作者希望通过一种新的提示策略解决这一限制。Contribution: 提出了Memento提示策略,显著提升了多跳问答任务的性能,尤其在PhantomWiki、2WikiMultiHopQA和MuSiQue数据集上表现突出。
Method: Memento分为三阶段:1) 分解复杂问题为小步骤;2) 动态构建事实数据库;3) 整合信息解决问题。该方法兼容现有提示策略如链式思考(CoT)和RAG。
Result: 在PhantomWiki上,Memento使CoT性能翻倍;在2WikiMultiHopQA上,比CoT-RAG提升20 F1点;在MuSiQue上,比ReAct提升3 F1点。
Insight: 通过分阶段动态构建和整合信息,Memento展示了在多跳问答任务中结合推理与检索的有效性,为复杂任务提供了新思路。
Abstract: Large language models (LLMs) excel at reasoning-only tasks, but struggle when reasoning must be tightly coupled with retrieval, as in multi-hop question answering. To overcome these limitations, we introduce a prompting strategy that first decomposes a complex question into smaller steps, then dynamically constructs a database of facts using LLMs, and finally pieces these facts together to solve the question. We show how this three-stage strategy, which we call Memento, can boost the performance of existing prompting strategies across diverse settings. On the 9-step PhantomWiki benchmark, Memento doubles the performance of chain-of-thought (CoT) when all information is provided in context. On the open-domain version of 2WikiMultiHopQA, CoT-RAG with Memento improves over vanilla CoT-RAG by more than 20 F1 percentage points and over the multi-hop RAG baseline, IRCoT, by more than 13 F1 percentage points. On the challenging MuSiQue dataset, Memento improves ReAct by more than 3 F1 percentage points, demonstrating its utility in agentic settings.
[23] Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs
Sonia K. Murthy,Rosie Zhao,Jennifer Hu,Sham Kakade,Markus Wulfmeier,Peng Qian,Tomer Ullman
Main category: cs.CL
TL;DR: 该论文利用认知模型研究大型语言模型(LLM)在价值权衡上的表现,揭示了模型在推理和社会效用上的差异,以及训练动态对价值选择的影响。
Details
Motivation: 人类社交中常需权衡冲突目标(如诚实与礼貌),但目前缺乏工具分析LLM中这种动态多面的价值观表现。认知科学中的认知模型能形式化描述这种权衡,为研究LLM提供新视角。Contribution: 1. 将认知模型(如礼貌言语模型)引入LLM分析;2. 系统性评估两种模型场景下的价值权衡:黑盒模型的推理程度与开源模型的RL训练动态;3. 发现了模型在训练早期价值权衡的显著变化及持久影响。
Method: 采用认知科学中的效用函数权重模型,分析LLM在生成言语时的价值权衡。实验设计涵盖黑盒模型(推理程度)和开源模型(RL训练动态),定量比较其信息效用与社会效用的表现。
Result: 1. 推理模型更偏向信息效用而非社会效用;2. 训练早期价值权重变化明显,基模型和预训练数据的影响大于反馈数据或对齐方法;3. 方法能灵敏反映LLM生态的多样性。
Insight: 1. LLM在价值观权衡上存在与人类不同的模式;2. 基模型选择对后期价值表现有持久影响;3. 该方法为优化模型训练(如推理能力或价值观平衡)提供新思路。
Abstract: Navigating everyday social situations often requires juggling conflicting goals, such as conveying a harsh truth, maintaining trust, all while still being mindful of another person’s feelings. These value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called “cognitive models” provide formal accounts of these trade-offs in humans, by modeling the weighting of a speaker’s competing utility functions in choosing an action or utterance. In this work, we use a leading cognitive model of polite speech to interpret the extent to which LLMs represent human-like trade-offs. We apply this lens to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning “effort” in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models, and in open-source models shown to be stronger in mathematical reasoning. Our findings from LLMs’ training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. We show that our method is responsive to diverse aspects of the rapidly evolving LLM landscape, with insights for forming hypotheses about other high-level behaviors, shaping training regimes for reasoning models, and better controlling trade-offs between values during model training.
cs.CV [Back]
[24] Computer Vision based Automated Quantification of Agricultural Sprayers Boom Displacement
Aryan Singh Dalal,Sidharth Rai,Rahul Singh,Treman Singh Kaloya,Rahul Harsha Cheppally,Ajay Sharda
Main category: cs.CV
TL;DR: 该论文开发了一种基于计算机视觉的自动化系统,用于量化农业喷雾器悬臂的位移,以提高喷雾稳定性并解决施药误差问题。
Details
Motivation: 农业喷雾器在复杂地形和高速驾驶时,悬臂的不稳定性会导致施药误差。目前缺乏对悬臂运动的定量分析,因此需要开发一种自动化系统来量化悬臂位移。Contribution: 开发了一种基于YOLO模型的计算机视觉系统,实时追踪悬臂目标,精度超过90%,位移估计误差小于0.026米,为悬臂设计和控制系统改进提供了数据支持。
Method: 使用YOLO V7、V8和V11模型训练神经网络,实时追踪悬臂边缘的目标,并通过倾斜仪传感器验证模型输出。
Result: 模型检测目标的准确率超过90%,位移估计与传感器数据误差小于0.026米,系统可适用于多种喷雾器。
Insight: 计算机视觉系统能够有效量化悬臂运动,为喷雾器设计改进和施药精度提升提供了可行的数据分析工具。
Abstract: Application rate errors when using self-propelled agricultural sprayers for agricultural production remain a concern. Among other factors, spray boom instability is one of the major contributors to application errors. Spray booms’ width of 38m, combined with 30 kph driving speeds, varying terrain, and machine dynamics when maneuvering complex field boundaries, make controls of these booms very complex. However, there is no quantitative knowledge on the extent of boom movement to systematically develop a solution that might include boom designs and responsive boom control systems. Therefore, this study was conducted to develop an automated computer vision system to quantify the boom movement of various agricultural sprayers. A computer vision system was developed to track a target on the edge of the sprayer boom in real time. YOLO V7, V8, and V11 neural network models were trained to track the boom’s movements in field operations to quantify effective displacement in the vertical and transverse directions. An inclinometer sensor was mounted on the boom to capture boom angles and validate the neural network model output. The results showed that the model could detect the target with more than 90 percent accuracy, and distance estimates of the target on the boom were within 0.026 m of the inclinometer sensor data. This system can quantify the boom movement on the current sprayer and potentially on any other sprayer with minor modifications. The data can be used to make design improvements to make sprayer booms more stable and achieve greater application accuracy.
[25] ToSA: Token Merging with Spatial Awareness
Hsiang-Wei Huang,Wenhao Chai,Kuang-Ming Chen,Cheng-Yen Yang,Jenq-Neng Hwang
Main category: cs.CV
TL;DR: ToSA提出了一种结合语义和空间感知的Token合并方法,通过深度图像生成伪空间Token,优化ViT的加速过程,提升性能并减少计算开销。
Details
Motivation: 现有Token合并方法主要依赖特征相似性,忽略了空间信息的重要性,而空间信息在ViT的早期层中可能更为关键。Contribution: 提出ToSA方法,首次将空间信息融入Token合并过程,通过深度图像生成伪空间Token,提升合并策略的准确性。
Method: 利用深度图像生成伪空间Token作为辅助信息,结合语义和空间感知进行Token合并,优化ViT的计算效率。
Result: ToSA在多个视觉和具身问答任务上优于现有方法,同时显著减少运行时间。
Insight: 在早期层中,空间信息可作为Token合并的重要依据,结合语义信息能更好地保留场景结构。
Abstract: Token merging has emerged as an effective strategy to accelerate Vision Transformers (ViT) by reducing computational costs. However, existing methods primarily rely on the visual token’s feature similarity for token merging, overlooking the potential of integrating spatial information, which can serve as a reliable criterion for token merging in the early layers of ViT, where the visual tokens only possess weak visual information. In this paper, we propose ToSA, a novel token merging method that combines both semantic and spatial awareness to guide the token merging process. ToSA leverages the depth image as input to generate pseudo spatial tokens, which serve as auxiliary spatial information for the visual token merging process. With the introduced spatial awareness, ToSA achieves a more informed merging strategy that better preserves critical scene structure. Experimental results demonstrate that ToSA outperforms previous token merging methods across multiple benchmarks on visual and embodied question answering while largely reducing the runtime of the ViT, making it an efficient solution for ViT acceleration. The code will be available at: https://github.com/hsiangwei0903/ToSA
[26] BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos
Jiahao Lin,Weixuan Peng,Bojia Zi,Yifeng Gao,Xianbiao Qi,Xingjun Ma,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 该论文提出了一个专门用于AI生成视频中细粒度伪影定位的基准数据集BrokenVideos,包含3,254个带有像素级标注的视频,并通过实验证明其对提升伪影检测模型的性能有显著作用。
Details
Motivation: 当前AI生成的视频常包含视觉伪影(如运动不一致、物理轨迹不合理等),但缺乏专门用于伪影定位的基准数据集,限制了相关研究的进展。Contribution: 提出了BrokenVideos数据集,填补了AI生成视频中细粒度伪影定位的研究空白,并提供了高质量的像素级标注。
Method: 收集并标注了3,254个AI生成的视频,通过人工检查确保标注质量,并利用该数据集训练和评估了伪影检测模型。
Result: 实验表明,基于BrokenVideos训练的模型在伪影定位任务中表现显著提升。
Insight: 细粒度标注对于提升AI生成视频的质量评估和模型改进具有重要意义,BrokenVideos为相关研究提供了重要基础。
Abstract: Recent advances in deep generative models have led to significant progress in video generation, yet the fidelity of AI-generated videos remains limited. Synthesized content often exhibits visual artifacts such as temporally inconsistent motion, physically implausible trajectories, unnatural object deformations, and local blurring that undermine realism and user trust. Accurate detection and spatial localization of these artifacts are crucial for both automated quality control and for guiding the development of improved generative models. However, the research community currently lacks a comprehensive benchmark specifically designed for artifact localization in AI generated videos. Existing datasets either restrict themselves to video or frame level detection or lack the fine-grained spatial annotations necessary for evaluating localization methods. To address this gap, we introduce BrokenVideos, a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption. Each annotation is validated through detailed human inspection to ensure high quality ground truth. Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions. Through extensive evaluation, we demonstrate that BrokenVideos establishes a critical foundation for benchmarking and advancing research on artifact localization in generative video models. The dataset is available at: https://broken-video-detection-datetsets.github.io/Broken-Video-Detection-Datasets.github.io/.
[27] From 2D to 3D Cognition: A Brief Survey of General World Models
Ningwei Xie,Zizi Tian,Lei Yang,Xiao-Ping Zhang,Meng Guo,Jie Li
Main category: cs.CV
TL;DR: 这篇论文综述了从2D感知到3D认知的世界模型的发展,提出了一种概念框架来系统分析技术进展,并强调了3D表示和世界知识的作用。
Details
Motivation: 世界模型在通用人工智能发展中日益重要,但缺乏对从2D到3D认知过渡的系统分析。本文旨在填补这一空白。Contribution: 提出了一个概念框架,系统分类了新兴技术,并突出了3D认知世界模型的三大核心能力:3D物理场景生成、3D空间推理和3D空间交互。
Method: 通过综合文献,分析了3D表示和世界知识的技术驱动,并拆解了3D世界模型的核心认知能力及其实际应用。
Result: 论文总结了3D世界模型的现状,指出了数据、建模和部署中的挑战,并提出了未来研究方向。
Insight: 从2D到3D的认知转变是构建更鲁棒和通用世界模型的关键,3D表示和世界知识是两大技术支柱。
Abstract: World models have garnered increasing attention in the development of artificial general intelligence (AGI), serving as computational frameworks for learning representations of the external world and forecasting future states. While early efforts focused on 2D visual perception and simulation, recent 3D-aware generative world models have demonstrated the ability to synthesize geometrically consistent, interactive 3D environments, marking a shift toward 3D spatial cognition. Despite rapid progress, the field lacks systematic analysis to categorize emerging techniques and clarify their roles in advancing 3D cognitive world models. This survey addresses this need by introducing a conceptual framework, providing a structured and forward-looking review of world models transitioning from 2D perception to 3D cognition. Within this framework, we highlight two key technological drivers, particularly advances in 3D representations and the incorporation of world knowledge, as fundamental pillars. Building on these, we dissect three core cognitive capabilities that underpin 3D world modeling: 3D physical scene generation, 3D spatial reasoning, and 3D spatial interaction. We further examine the deployment of these capabilities in real-world applications, including embodied AI, autonomous driving, digital twin, and gaming/VR. Finally, we identify challenges across data, modeling, and deployment, and outline future directions for advancing more robust and generalizable 3D world models.
[28] Towards Efficient Exemplar Based Image Editing with Multimodal VLMs
Avadhoot Jadhav,Ashutosh Srivastava,Abhinav Java,Silky Singh,Tarun Ram Menta,Surgan Jandial,Balaji Krishnamurthy
Main category: cs.CV
TL;DR: 本文提出了一种基于范例的图像编辑方法,利用多模态视觉语言模型和预训练的文本到图像扩散模型,通过范例对(编辑前后的图像对)更直观地表达编辑需求,避免了仅依赖文本描述的模糊性。该方法在无需优化的情况下,仍能超越基线方法,且速度快4倍。
Details
Motivation: 传统的文本到图像扩散模型仅通过文本描述实现图像编辑,但某些编辑需求难以用文本清晰表达,而范例对可以更直观地展示编辑意图。Contribution: 1) 提出了一种基于范例对的图像编辑方法;2) 结合多模态视觉语言模型和扩散模型,无需优化即可高效完成任务;3) 在多个编辑类型上优于基线方法,且速度大幅提升。
Method: 使用预训练的文本到图像扩散模型和多模态视觉语言模型,通过范例对传递编辑意图,构建端到端的无优化流程。
Result: 实验表明,该方法在多种编辑类型上优于基线方法,速度约为基线方法的4倍。
Insight: 多模态视觉语言模型可以有效辅助扩散模型理解范例对的编辑意图,从而提升编辑效果和效率。
Abstract: Text-to-Image Diffusion models have enabled a wide array of image editing applications. However, capturing all types of edits through text alone can be challenging and cumbersome. The ambiguous nature of certain image edits is better expressed through an exemplar pair, i.e., a pair of images depicting an image before and after an edit respectively. In this work, we tackle exemplar-based image editing – the task of transferring an edit from an exemplar pair to a content image(s), by leveraging pretrained text-to-image diffusion models and multimodal VLMs. Even though our end-to-end pipeline is optimization-free, our experiments demonstrate that it still outperforms baselines on multiple types of edits while being ~4x faster.
[29] Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models
Zhentao He,Can Zhang,Ziheng Wu,Zhenghao Chen,Yufei Zhan,Yifan Li,Zhao Zhang,Xian Wang,Minghui Qiu
Main category: cs.CV
TL;DR: 该论文提出了KIE-HVQA基准数据集,用于评估多模态大语言模型在视觉退化文档中的OCR幻觉问题,并提出了一种基于GRPO的框架以减少幻觉生成。
Details
Motivation: 现有的多模态大语言模型在视觉退化条件下容易产生幻觉内容,原因是过度依赖语言先验或视觉-文本推理错位,无法识别不确定性。Contribution: 1) 提出首个用于评估OCR幻觉的KIE-HVQA基准数据集;2) 提出GRPO框架,结合视觉不确定性自感知和拒绝回答机制,有效减少幻觉。
Method: 采用GRPO框架,结合监督微调和强化学习,引入视觉不确定性自感知和任务难度提升机制(拒绝回答)。
Result: 实验表明,7B参数模型在KIE-HVQA上比GPT-4o提高了22%的无幻觉准确率,且标准任务性能无显著下降。
Insight: 视觉不确定性自感知和任务难度调控是解决多模态模型中OCR幻觉问题的关键。
Abstract: Recent advancements in multimodal large language models have enhanced document understanding by integrating textual and visual information. However, existing models exhibit incompleteness within their paradigm in real-world scenarios, particularly under visual degradation. In such conditions, the current response paradigm often fails to adequately perceive visual degradation and ambiguity, leading to overreliance on linguistic priors or misaligned visual-textual reasoning. This difficulty in recognizing uncertainty frequently results in the generation of hallucinatory content, especially when a precise answer is not feasible. To better demonstrate and analyze this phenomenon and problem, we propose KIE-HVQA, the first benchmark dedicated to evaluating OCR hallucination in degraded document understanding. This dataset includes test samples spanning identity cards and invoices, with simulated real-world degradations for OCR reliability. This setup allows for evaluating models’ capacity, under degraded input, to distinguish reliable visual information and answer accordingly, thereby highlighting the challenge of avoiding hallucination on uncertain data. To achieve vision-faithful reasoning and thereby avoid the aforementioned issues, we further introduce a GRPO-based framework featuring a novel reward mechanism. By incorporating a self-awareness of visual uncertainty and an analysis method that initiates refusal to answer to increase task difficulty within our supervised fine-tuning and reinforcement learning framework, we successfully mitigated hallucinations in ambiguous regions. Experiments on Qwen2.5-VL demonstrate that our 7B-parameter model achieves a 22% absolute improvement in hallucination-free accuracy over GPT-4o on KIE-HVQA and there is no significant performance drop in standard tasks, highlighting both effectiveness and robustness.
[30] Towards Scalable and Generalizable Earth Observation Data Mining via Foundation Model Composition
Man Duc Chuc
Main category: cs.CV
TL;DR: 该研究探讨了通过组合预训练基础模型来提升地球观测数据挖掘的通用性和可扩展性,发现特征级集成小模型性能可匹敌或超越大模型,同时资源消耗更低。
Details
Motivation: 现有地球观测领域的研究多集中于从头训练大模型,而组合现有预训练模型的方式较少被探索。研究旨在验证后者是否能为多样化任务带来性能提升。Contribution: 提出了基于预训练基础模型的特征级集成方法,展示了小模型组合可媲美大模型的潜力,并通过知识蒸馏技术为实际应用提供高效部署方案。
Method: 使用GEO-Bench基准测试,对比了Prithvi、Hiera和DOFA等模型在11个数据集上的表现,采用特征级集成和知识蒸馏技术。
Result: 特征级集成的小模型性能优于或接近大模型,同时减少训练时间和计算资源;知识蒸馏进一步将集成模型的优势转移到更紧凑的模型中。
Insight: 组合预训练模型是地球观测领域的高效路径,知识蒸馏为资源受限场景提供了实用解决方案。
Abstract: Foundation models are rapidly transforming Earth Observation data mining by enabling generalizable and scalable solutions for key tasks such as scene classification and semantic segmentation. While most efforts in the geospatial domain have focused on developing large models trained from scratch using massive Earth Observation datasets, an alternative strategy that remains underexplored is the reuse and combination of existing pretrained models. In this study, we investigate whether foundation models pretrained on remote sensing and general vision datasets can be effectively combined to improve performance across a diverse set of key Earth Observation tasks. Using the GEO-Bench benchmark, we evaluate several prominent models, including Prithvi, Hiera, and DOFA, on eleven datasets covering a range of spatial resolutions, sensor modalities, and task types. The results show that feature-level ensembling of smaller pretrained models can match or exceed the performance of much larger models, while requiring less training time and computational resources. Moreover, the study highlights the potential of applying knowledge distillation to transfer the strengths of ensembles into more compact models, offering a practical path for deploying foundation models in real-world Earth Observation applications.
[31] UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation
Yanzhe Chen,Huasong Zhong,Yan Li,Zhenheng Yang
Main category: cs.CV
TL;DR: UniCode^2提出了一种级联大规模视觉码本框架,用于统一多模态理解和生成,解决了现有码本方法在语义细粒度、稳定性和利用率上的问题。通过级联设计和SigLIP序列嵌入聚类,实现了语义对齐的高效视觉标记化,并在多任务基准中表现出色。
Details
Motivation: 现有基于码本的多模态大语言模型(MLLMs)在处理视觉标记化时,要么码本规模过小导致语义缺失,要么盲目扩大规模导致训练不稳定和利用率低下。UniCode^2旨在通过级联设计和大规模码本,实现更稳定、语义丰富的视觉标记化。Contribution: 1. 提出级联码本框架,通过固定码本锚定嵌入空间、可训练码本优化任务语义,提升稳定性和利用率。2. 利用SigLIP序列嵌入聚类构建500K规模的码本,保留视觉-语言对齐并扩展容量。3. 支持与预训练扩散解码器的无缝集成,实现高质量视觉生成。
Method: 1. 通过SigLIP序列嵌入聚类构建大规模(500K)视觉码本,保留语义对齐。2. 采用级联设计:固定初始码本稳定嵌入空间,可训练码本优化任务语义。3. 视觉标记与文本语义对齐,支持与扩散解码器的直接集成。
Result: 在多样化任务基准测试中表现优异,验证了大规模视觉标记空间的可行性,同时保持稳定性、语义丰富性和模块化。
Insight: 1. 级联码本设计能够有效平衡训练稳定性和语义表现。2. 大规模视觉码本通过预训练对齐实现了跨模态无缝集成,为多模态任务提供了新思路。
Abstract: Unified multimodal large language models (MLLMs) have shown promise in jointly advancing multimodal understanding and generation, with visual codebooks discretizing images into tokens for autoregressive modeling. Existing codebook-based methods either rely on small vocabularies (~16K entries) that lack fine-grained semantics or naively scale up, resulting in low token utilization and unstable training. We propose UniCode$^2$, a cascaded codebook framework enabling large-scale, semantically aligned, and stable visual tokenization. By clustering millions of SigLIP sequence embeddings, we build a 500K-entry codebook that preserves vision-language alignment while expanding capacity. Stability is ensured via a cascaded design: a frozen codebook anchors the embedding space, and a trainable codebook refines task-specific semantics. This decoupling promotes high utilization and robust learning. Moreover, the alignment of our visual tokens with textual semantics enables seamless integration with pretrained diffusion decoders, supporting high-quality visual synthesis with minimal adaptation. UniCode^2 delivers strong performance across diverse benchmarks, demonstrating the viability of scaling visual token spaces without sacrificing stability, semantics, or modularity.
[32] Dynamic Bandwidth Allocation for Hybrid Event-RGB Transmission
Pujing Yang,Guangyi Zhang,Yunlong Cai,Lei Yu,Guanding Yu
Main category: cs.CV
TL;DR: 本文提出了一种动态带宽分配的混合事件-RGB传输方案,通过联合事件和图像(E-I)传输框架消除冗余信息,优化带宽利用率。
Details
Motivation: 混合事件摄像头和RGB摄像头的系统在处理大量触发事件和RGB图像传输时面临带宽挑战,且输出存在冗余信息。Contribution: 开发了一种基于贝叶斯建模和信息瓶颈方法的联合E-I传输框架,动态分配带宽以提高传输效率。
Method: 使用信息瓶颈方法分离共享和领域特定信息,并根据场景动态自适应分配带宽。
Result: 仿真结果表明,该方案在重建质量和去模糊性能上优于传统系统。
Insight: 通过信息分离和动态带宽分配,混合系统可以实现更高效的传输和更好的视觉重构效果。
Abstract: Event cameras asynchronously capture pixel-level intensity changes with extremely low latency. They are increasingly used in conjunction with RGB cameras for a wide range of vision-related applications. However, a major challenge in these hybrid systems lies in the transmission of the large volume of triggered events and RGB images. To address this, we propose a transmission scheme that retains efficient reconstruction performance of both sources while accomplishing real-time deblurring in parallel. Conventional RGB cameras and event cameras typically capture the same scene in different ways, often resulting in significant redundant information across their outputs. To address this, we develop a joint event and image (E-I) transmission framework to eliminate redundancy and thereby optimize channel bandwidth utilization. Our approach employs Bayesian modeling and the information bottleneck method to disentangle the shared and domain-specific information within the E-I inputs. This disentangled information bottleneck framework ensures both the compactness and informativeness of extracted shared and domain-specific information. Moreover, it adaptively allocates transmission bandwidth based on scene dynamics, i.e., more symbols are allocated to events for dynamic details or to images for static information. Simulation results demonstrate that the proposed scheme not only achieves superior reconstruction quality compared to conventional systems but also delivers enhanced deblurring performance.
[33] Recognizing Surgical Phases Anywhere: Few-Shot Test-time Adaptation and Task-graph Guided Refinement
Kun Yuan,Tingxuan Chen,Shi Li,Joel L. Lavanchy,Christian Heiliger,Ege Özsoy,Yiming Huang,Long Bai,Nassir Navab,Vinkle Srivastav,Hongliang Ren,Nicolas Padoy
Main category: cs.CV
TL;DR: SPA是一个轻量级框架,通过少量标注和任务图引导,实现跨机构和跨手术的通用手术工作流理解,性能优异。
Details
Motivation: 手术工作流复杂多样,现有基础模型在零样本场景下因域偏移性能受限,需要适应新环境以提升实用性。Contribution: 提出了SPA框架,结合少量标注、任务图先验和动态测试时自适应,实现高性能手术阶段识别。
Method: 利用少量空间自适应对齐多模态嵌入,扩散建模保证时序一致性,动态测试时自适应提升可靠性。
Result: SPA在少量样本下表现出色,优于全样本模型,支持跨机构和跨手术应用。
Insight: 轻量级自适应框架可快速定制手术阶段识别模型,自然语言和少量标注即可实现高效迁移。
Abstract: The complexity and diversity of surgical workflows, driven by heterogeneous operating room settings, institutional protocols, and anatomical variability, present a significant challenge in developing generalizable models for cross-institutional and cross-procedural surgical understanding. While recent surgical foundation models pretrained on large-scale vision-language data offer promising transferability, their zero-shot performance remains constrained by domain shifts, limiting their utility in unseen surgical environments. To address this, we introduce Surgical Phase Anywhere (SPA), a lightweight framework for versatile surgical workflow understanding that adapts foundation models to institutional settings with minimal annotation. SPA leverages few-shot spatial adaptation to align multi-modal embeddings with institution-specific surgical scenes and phases. It also ensures temporal consistency through diffusion modeling, which encodes task-graph priors derived from institutional procedure protocols. Finally, SPA employs dynamic test-time adaptation, exploiting the mutual agreement between multi-modal phase prediction streams to adapt the model to a given test video in a self-supervised manner, enhancing the reliability under test-time distribution shifts. SPA is a lightweight adaptation framework, allowing hospitals to rapidly customize phase recognition models by defining phases in natural language text, annotating a few images with the phase labels, and providing a task graph defining phase transitions. The experimental results show that the SPA framework achieves state-of-the-art performance in few-shot surgical phase recognition across multiple institutions and procedures, even outperforming full-shot models with 32-shot labeled data. Code is available at https://github.com/CAMMA-public/SPA
[34] A Transformer Based Handwriting Recognition System Jointly Using Online and Offline Features
Ayush Lodh,Ritabrata Chakraborty,Shivakumara Palaiahnakote,Umapada Pal
Main category: cs.CV
TL;DR: 该论文提出了一种基于Transformer的端到端网络,通过早期融合离线图像和在线笔画数据,在共享潜空间中联合利用手写识别的互补线索,实现了优于现有方法的准确率。
Details
Motivation: 手写识别通常仅使用单一模态(离线图像或在线笔画),而忽略了两种模态之间的互补信息。论文旨在通过融合这两种模态,提升识别的准确性和鲁棒性。Contribution: 提出了一种端到端网络,通过早期融合离线图像和在线笔画数据,并在共享潜空间中使用Transformer学习上下文增强的笔画嵌入,实现了更高的识别准确率。
Method: 1. 使用补丁编码器将灰度图像转化为视觉标记;2. 通过轻量级Transformer嵌入笔画序列;3. 可学习的潜查询联合关注两种模态的标记流;4. 在交叉熵损失目标下进行池化和解码。
Result: 在IAMOn-DB和VNOn-DB数据集上实现了比现有最佳方法高1%的准确率,并在ISI-Air数据集上验证了方法的适应性。
Insight: 早期融合离线图像和在线笔画数据可以在表征学习中互相增强,提升模型的鲁棒性和独立性。
Abstract: We posit that handwriting recognition benefits from complementary cues carried by the rasterized complex glyph and the pen’s trajectory, yet most systems exploit only one modality. We introduce an end-to-end network that performs early fusion of offline images and online stroke data within a shared latent space. A patch encoder converts the grayscale crop into fixed-length visual tokens, while a lightweight transformer embeds the $(x, y, \text{pen})$ sequence. Learnable latent queries attend jointly to both token streams, yielding context-enhanced stroke embeddings that are pooled and decoded under a cross-entropy loss objective. Because integration occurs before any high-level classification, temporal cues reinforce each other during representation learning, producing stronger writer independence. Comprehensive experiments on IAMOn-DB and VNOn-DB demonstrate that our approach achieves state-of-the-art accuracy, exceeding previous bests by up to 1%. Our study also shows adaptation of this pipeline with gesturification on the ISI-Air dataset. Our code can be found here.
[35] From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios
Changliang Xia,Chengyou Jia,Zhuohang Dang,Minnan Luo
Main category: cs.CV
TL;DR: 论文提出了一种名为DenseDiT的统一且数据高效的方法,通过生成模型的视觉先验来解决现实世界中密集预测任务的挑战,并在DenseWorld基准上表现出色。
Details
Motivation: 现有的密集预测方法在理想条件下表现良好,但在现实场景中泛化能力有限,同时面临真实数据稀缺的问题。Contribution: 1. 提出DenseWorld基准,涵盖25个现实世界应用任务;2. 提出DenseDiT方法,利用生成模型的视觉先验,通过统一策略解决多任务问题。
Method: DenseDiT结合参数重用机制和两个轻量级分支,自适应整合多尺度上下文,仅需0.1%的额外参数。
Result: DenseDiT在DenseWorld上的表现优于现有方法,仅用0.01%的训练数据即实现优异结果。
Insight: 生成模型的视觉先验在数据稀缺的现实场景中具有重要价值,轻量级结构能显著提升效率。
Abstract: Dense prediction tasks hold significant importance of computer vision, aiming to learn pixel-wise annotated label for an input image. Despite advances in this field, existing methods primarily focus on idealized conditions, with limited generalization to real-world scenarios and facing the challenging scarcity of real-world data. To systematically study this problem, we first introduce DenseWorld, a benchmark spanning a broad set of 25 dense prediction tasks that correspond to urgent real-world applications, featuring unified evaluation across tasks. Then, we propose DenseDiT, which maximally exploits generative models’ visual priors to perform diverse real-world dense prediction tasks through a unified strategy. DenseDiT combines a parameter-reuse mechanism and two lightweight branches that adaptively integrate multi-scale context, working with less than 0.1% additional parameters. Evaluations on DenseWorld reveal significant performance drops in existing general and specialized baselines, highlighting their limited real-world generalization. In contrast, DenseDiT achieves superior results using less than 0.01% training data of baselines, underscoring its practical value for real-world deployment. Our data, and checkpoints and codes are available at https://xcltql666.github.io/DenseDiTProj
[36] Breaking Spatial Boundaries: Spectral-Domain Registration Guided Hyperspectral and Multispectral Blind Fusion
Kunjing Yang,Libin Zheng,Minru Bai,Ting Lu,Leyuan Fang
Main category: cs.CV
TL;DR: 该论文提出了一种从光谱域解决未配准高光谱图像(HSI)和多光谱图像(MSI)融合问题的新方法,通过轻量级的光谱先验学习网络(SPL)提升MSI的光谱分辨率,结合稀疏融合和优化算法实现高效融合与分类增强。
Details
Motivation: 现有方法主要通过在空间域对HSI进行变换来实现与MSI的配准,但由于图像空间分辨率的显著差异,这些方法的性能往往不佳,且处理大尺寸遥感图像时耗时严重。Contribution: 1. 从光谱域解决配准问题,提出SPL网络提取HSI光谱特征并增强MSI光谱分辨率;2. 提出基于组稀疏正则化的盲稀疏融合(BSF)方法,避免秩估计的同时降低计算复杂度;3. 采用近端交替优化(PAO)算法求解BSF模型,并提供了收敛性分析。
Method: 1. SPL网络提取HSI光谱特征并增强MSI光谱分辨率;2. 空间下采样生成配准HSI;3. 利用组稀疏正则化实现盲稀疏融合;4. PAO算法优化求解。
Result: 在仿真和真实数据集上的实验验证了所提方法在配准和融合中的有效性,并展示了其在分类性能上的提升。
Insight: 从光谱域解决配准问题不仅能提升性能,还能减少计算复杂度,为高光谱和多光谱图像融合提供了新的思路。
Abstract: The blind fusion of unregistered hyperspectral images (HSIs) and multispectral images (MSIs) has attracted growing attention recently. To address the registration challenge, most existing methods employ spatial transformations on the HSI to achieve alignment with the MSI. However, due to the substantial differences in spatial resolution of the images, the performance of these methods is often unsatisfactory. Moreover, the registration process tends to be time-consuming when dealing with large-sized images in remote sensing. To address these issues, we propose tackling the registration problem from the spectral domain. Initially, a lightweight Spectral Prior Learning (SPL) network is developed to extract spectral features from the HSI and enhance the spectral resolution of the MSI. Following this, the obtained image undergoes spatial downsampling to produce the registered HSI. In this process, subspace representation and cyclic training strategy are employed to improve spectral accuracy of the registered HSI obtained. Next, we propose a blind sparse fusion (BSF) method, which utilizes group sparsity regularization to equivalently promote the low-rankness of the image. This approach not only circumvents the need for rank estimation, but also reduces computational complexity. Then, we employ the Proximal Alternating Optimization (PAO) algorithm to solve the BSF model, and present its convergence analysis. Finally, extensive numerical experiments on simulated and real datasets are conducted to verify the effectiveness of our method in registration and fusion. We also demonstrate its efficacy in enhancing classification performance.
[37] Ctrl-Z Sampling: Diffusion Sampling with Controlled Random Zigzag Explorations
Shunqi Mao,Wei Guo,Chaoyi Zhang,Weidong Cai
Main category: cs.CV
TL;DR: Ctrl-Z Sampling是一种新的扩散采样策略,通过动态交替前向优化和后向探索,解决扩散模型中局部最优问题,提升生成质量和条件对齐。
Details
Motivation: 扩散模型在条件生成中表现出色,但常因潜在空间复杂性和初始化不佳收敛到局部最优,导致生成结果全局不一致或条件不对齐。传统方法通过增强引导信号或调整噪声分布来应对,效果有限。Ctrl-Z Sampling旨在动态检测并逃离局部最优。Contribution: 提出Ctrl-Z Sampling,一种模型无关的采样策略,通过奖励模型识别局部最优,注入噪声并回退到更早状态,动态调整优化轨迹。显著提高生成质量和效率。
Method: 1. 奖励模型识别局部最优;2. 注入噪声并回退到前状态;3. 评估候选轨迹,接受改善者;4. 动态交替前向优化与后向探索。
Result: 实验表明,Ctrl-Z Sampling仅增加7.6倍函数评估,即可显著提升生成质量和对齐性。
Insight: 扩散模型生成中的局部最优问题可通过动态探索-优化策略有效缓解,为高维复杂数据生成提供新思路。
Abstract: Diffusion models have shown strong performance in conditional generation by progressively denoising Gaussian noise toward a target data distribution. This denoising process can be interpreted as a form of hill climbing in a learned latent space, where the model iteratively refines the sample toward regions of higher probability. However, diffusion models often converge to local optima that are locally visually coherent yet globally inconsistent or conditionally misaligned, due to latent space complexity and suboptimal initialization. Prior efforts attempted to address this by strengthening guidance signals or manipulating the initial noise distribution. We introduce Controlled Random Zigzag Sampling (Ctrl-Z Sampling), a novel sampling strategy designed to detect and escape such local maxima during conditional generation. The method first identifies potential local maxima using a reward model. Upon detection, it injects noise and reverts to a previous, noisier state to escape the current optimization plateau. The reward model then evaluates candidate trajectories, accepting only those that offer improvement, while progressively deeper retreat enables stronger escapes when nearby alternatives fail. This controlled random zigzag process allows dynamic alternation between forward refinement and backward exploration, enhancing both alignment and visual quality in the generated outputs. The proposed Ctrl-Z Sampling is model-agnostic and compatible with existing diffusion frameworks. Experimental results show that Ctrl-Z Sampling substantially improves generation quality with only around 7.6X increase in function evaluations.
[38] Feature Hallucination for Self-supervised Action Recognition
Lei Wang,Piotr Koniusz
Main category: cs.CV
TL;DR: 提出一种自监督动作识别框架,通过特征幻觉(Feature Hallucination)在测试时推断缺失的多模态特征,提升识别精度。引入了对象检测特征(ODF)和显著性检测特征(SDF),并结合不确定性建模和鲁棒损失函数。
Details
Motivation: 现有方法依赖多模态数据,但在测试时可能缺失某些模态。如何在缺少辅助特征的情况下仍能保持高性能是一个挑战。Contribution: 1. 提出特征幻觉机制,在测试时丰富特征表示;2. 引入ODF和SDF两种新型描述符;3. 结合不确定性建模和鲁棒损失函数。
Method: 采用深度变换框架,联合预测动作概念和辅助特征;通过幻觉流推断缺失特征;整合ODF、SDF及多模态数据(光流、骨架等)。
Result: 在Kinetics-400/600和Something-Something V2等基准测试上达到SOTA性能。
Insight: 特征幻觉能有效弥补缺失模态,同时ODF和SDF显著提升了动作相关区域的建模能力。
Abstract: Understanding human actions in videos requires more than raw pixel analysis; it relies on high-level semantic reasoning and effective integration of multimodal features. We propose a deep translational action recognition framework that enhances recognition accuracy by jointly predicting action concepts and auxiliary features from RGB video frames. At test time, hallucination streams infer missing cues, enriching feature representations without increasing computational overhead. To focus on action-relevant regions beyond raw pixels, we introduce two novel domain-specific descriptors. Object Detection Features (ODF) aggregate outputs from multiple object detectors to capture contextual cues, while Saliency Detection Features (SDF) highlight spatial and intensity patterns crucial for action recognition. Our framework seamlessly integrates these descriptors with auxiliary modalities such as optical flow, Improved Dense Trajectories, skeleton data, and audio cues. It remains compatible with state-of-the-art architectures, including I3D, AssembleNet, Video Transformer Network, FASTER, and recent models like VideoMAE V2 and InternVideo2. To handle uncertainty in auxiliary features, we incorporate aleatoric uncertainty modeling in the hallucination step and introduce a robust loss function to mitigate feature noise. Our multimodal self-supervised action recognition framework achieves state-of-the-art performance on multiple benchmarks, including Kinetics-400, Kinetics-600, and Something-Something V2, demonstrating its effectiveness in capturing fine-grained action dynamics.
[39] InvZW: Invariant Feature Learning via Noise-Adversarial Training for Robust Image Zero-Watermarking
Abdullah All Tanvir,Xin Zhong
Main category: cs.CV
TL;DR: 本文提出了一种基于不变特征学习的深度零水印框架,通过噪声对抗训练生成抗干扰特征,并结合多比特零水印方案实现高鲁棒性。
Details
Motivation: 现有零水印方法在对抗图像失真时表现不足,需要一种既能保留图像内容不变性又能提取语义特征的框架。Contribution: 1. 提出噪声对抗学习模块,生成抗干扰且语义丰富的特征;2. 设计基于学习的多比特零水印方案。
Method: 1. 通过对抗训练和重构约束训练特征提取器;2. 将特征映射到可训练的参考码以匹配目标二进制消息。
Result: 在多种失真条件下,方法在特征稳定性和水印恢复方面均达到SOTA鲁棒性。
Insight: 噪声对抗学习能有效提升特征对失真的不变性,结合可优化的参考码进一步增强了水印的鲁棒性。
Abstract: This paper introduces a novel deep learning framework for robust image zero-watermarking based on distortion-invariant feature learning. As a zero-watermarking scheme, our method leaves the original image unaltered and learns a reference signature through optimization in the feature space. The proposed framework consists of two key modules. In the first module, a feature extractor is trained via noise-adversarial learning to generate representations that are both invariant to distortions and semantically expressive. This is achieved by combining adversarial supervision against a distortion discriminator and a reconstruction constraint to retain image content. In the second module, we design a learning-based multibit zero-watermarking scheme where the trained invariant features are projected onto a set of trainable reference codes optimized to match a target binary message. Extensive experiments on diverse image datasets and a wide range of distortions show that our method achieves state-of-the-art robustness in both feature stability and watermark recovery. Comparative evaluations against existing self-supervised and deep watermarking techniques further highlight the superiority of our framework in generalization and robustness.
[40] A Novel Large Vision Foundation Model (LVFM)-based Approach for Generating High-Resolution Canopy Height Maps in Plantations for Precision Forestry Management
Shen Tan,Xin Zhang,Liangxiu Han,Huaguo Huang,Han Wang
Main category: cs.CV
TL;DR: 提出了一种基于大型视觉基础模型(LVFM)的新方法,用于生成高分辨率的冠层高度图(CHMs),以支持精准林业管理。该方法通过特征提取、自监督特征增强模块和高程估计器的组合,显著优于现有方法。
Details
Motivation: 传统激光雷达方法成本高昂,而基于RGB图像的深度学习难以准确提取冠层高度特征。因此,开发一种低成本、高精度的方法对林业管理和碳汇评估至关重要。Contribution: 提出了一种新颖的LVFM模型,结合自监督特征增强和高程估计器,显著提升了冠层高度图的生成精度,并展示了其在非训练区域的强泛化能力。
Method: 整合了特征提取器、自监督特征增强模块和高程估计器,利用1米分辨率的Google Earth图像进行训练和测试。
Result: 模型在测试中表现出色,平均绝对误差为0.09米,均方根误差为0.24米,与激光雷达数据的相关性为0.78,并实现了90%以上的单树检测成功率。
Insight: 该方法为林业碳汇评估提供了一种可扩展、低成本的高精度工具,适用于人工林和天然林的监测与管理。
Abstract: Accurate, cost-effective monitoring of plantation aboveground biomass (AGB) is crucial for supporting local livelihoods and carbon sequestration initiatives like the China Certified Emission Reduction (CCER) program. High-resolution canopy height maps (CHMs) are essential for this, but standard lidar-based methods are expensive. While deep learning with RGB imagery offers an alternative, accurately extracting canopy height features remains challenging. To address this, we developed a novel model for high-resolution CHM generation using a Large Vision Foundation Model (LVFM). Our model integrates a feature extractor, a self-supervised feature enhancement module to preserve spatial details, and a height estimator. Tested in Beijing’s Fangshan District using 1-meter Google Earth imagery, our model outperformed existing methods, including conventional CNNs. It achieved a mean absolute error of 0.09 m, a root mean square error of 0.24 m, and a correlation of 0.78 against lidar-based CHMs. The resulting CHMs enabled over 90% success in individual tree detection, high accuracy in AGB estimation, and effective tracking of plantation growth, demonstrating strong generalization to non-training areas. This approach presents a promising, scalable tool for evaluating carbon sequestration in both plantations and natural forests.
[41] Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation
Changlu Guo,Anders Nymark Christensen,Morten Rieger Hannemose
Main category: cs.CV
TL;DR: Med-Art利用基于Diffusion Transformer (DiT)的预训练模型PixArt-α,通过提出Hybrid-Level Diffusion Fine-tuning (HLDF)方法,解决了医学图像生成中的数据稀缺和文本描述不足问题。
Details
Motivation: 医学图像生成面临数据规模小和医学文本描述稀缺的挑战,现有文本到图像生成模型在此领域表现不佳,亟需针对性的解决方案。Contribution: 1. 提出Med-Art框架,专为数据稀缺的医学图像生成设计;2. 引入HLDF方法,有效解决颜色过饱和等问题;3. 在两个医学数据集上实现SOTA性能。
Method: 1. 使用预训练的PixArt-α模型(基于DiT);2. 提出HLDF方法,结合像素级损失进行微调。
Result: 在FID、KID和下游分类任务中表现优异,达到SOTA。
Insight: Med-Art展示了扩散模型在小规模医学数据上的潜力,结合视觉语言模型生成描述是解决文本稀缺问题的有效途径。
Abstract: Text-to-image generative models have achieved remarkable breakthroughs in recent years. However, their application in medical image generation still faces significant challenges, including small dataset sizes, and scarcity of medical textual data. To address these challenges, we propose Med-Art, a framework specifically designed for medical image generation with limited data. Med-Art leverages vision-language models to generate visual descriptions of medical images which overcomes the scarcity of applicable medical textual data. Med-Art adapts a large-scale pre-trained text-to-image model, PixArt-$\alpha$, based on the Diffusion Transformer (DiT), achieving high performance under limited data. Furthermore, we propose an innovative Hybrid-Level Diffusion Fine-tuning (HLDF) method, which enables pixel-level losses, effectively addressing issues such as overly saturated colors. We achieve state-of-the-art performance on two medical image datasets, measured by FID, KID, and downstream classification performance.
[42] A Deep Learning Approach to Identify Rock Bolts in Complex 3D Point Clouds of Underground Mines Captured Using Mobile Laser Scanners
Dibyayan Patra,Pasindu Ranasinghe,Bikram Banerjee,Simit Raval
Main category: cs.CV
TL;DR: 该论文提出了一种名为DeepBolt的两阶段深度学习架构,用于在地下矿山复杂3D点云中自动高效地识别锚杆,解决了噪声、环境变化和目标遮挡等问题,在性能上显著超越了现有方法。
Details
Motivation: 传统的地下矿山锚杆检测方法依赖特征工程和传统机器学习,在复杂环境下鲁棒性不足,且手动检测效率低下。因此,需要一种自动化的深度学习解决方案。Contribution: 提出了DeepBolt,一种针对类别不平衡设计的两阶段深度学习架构;在IoU、精度和召回率上显著提升了锚杆识别的性能。
Method: 采用两阶段深度学习架构处理严重类别不平衡问题,优化了对小目标和遮挡目标的识别能力。
Result: 在IoU上比现有方法提高了42.5%,精度和召回率分别达到96.41%和96.96%,展现了高鲁棒性。
Insight: 通过专门设计的深度学习架构可以有效解决复杂3D点云中的小目标和类别不平衡问题,为矿山安全监测提供了新思路。
Abstract: Rock bolts are crucial components of the subterranean support systems in underground mines that provide adequate structural reinforcement to the rock mass to prevent unforeseen hazards like rockfalls. This makes frequent assessments of such bolts critical for maintaining rock mass stability and minimising risks in underground mining operations. Where manual surveying of rock bolts is challenging due to the low light conditions in the underground mines and the time-intensive nature of the process, automated detection of rock bolts serves as a plausible solution. To that end, this study focuses on the automatic identification of rock bolts within medium to large-scale 3D point clouds obtained from underground mines using mobile laser scanners. Existing techniques for automated rock bolt identification primarily rely on feature engineering and traditional machine learning approaches. However, such techniques lack robustness as these point clouds present several challenges due to data noise, varying environments, and complex surrounding structures. Moreover, the target rock bolts are extremely small objects within large-scale point clouds and are often partially obscured due to the application of reinforcement shotcrete. Addressing these challenges, this paper proposes an approach termed DeepBolt, which employs a novel two-stage deep learning architecture specifically designed for handling severe class imbalance for the automatic and efficient identification of rock bolts in complex 3D point clouds. The proposed method surpasses state-of-the-art semantic segmentation models by up to 42.5% in Intersection over Union (IoU) for rock bolt points. Additionally, it outperforms existing rock bolt identification techniques, achieving a 96.41% precision and 96.96% recall in classifying rock bolts, demonstrating its robustness and effectiveness in complex underground environments.
[43] Lightweight Multi-Frame Integration for Robust YOLO Object Detection in Videos
Yitong Quan,Benjamin Kiefer,Martin Messmer,Andreas Zell
Main category: cs.CV
TL;DR: 该论文提出了一种轻量级多帧集成方法,通过堆叠连续帧作为YOLO检测器的输入,仅监督目标帧的输出,从而利用时间信息提升视频中的物体检测鲁棒性。
Details
Motivation: 现有YOLO等物体检测模型独立处理单帧,忽略了视频中的时间上下文信息;而复杂的时间模块会增加计算量。论文旨在通过简单高效的策略解决瞬态问题(如运动模糊、遮挡等)对单帧检测性能的影响。Contribution: 1)提出轻量级多帧集成方法,仅需最小架构修改;2)在MOT20Det和自建BOAT360数据集上验证了方法有效性;3)贡献了BOAT360数据集,为研究提供新基准。
Method: 堆叠多帧作为输入,监督单帧输出,保留YOLO的简洁性和实时性,同时利用时间信息提升鲁棒性。
Result: 实验表明,该方法显著提升了轻量模型的检测鲁棒性,缩小了轻量与复杂模型的性能差距。
Insight: 简单的时间信息集成(如多帧输入)可以显著提升视频物体检测性能,尤其是对轻量模型效果更明显,为实际应用提供了一种高效解决方案。
Abstract: Modern image-based object detection models, such as YOLOv7, primarily process individual frames independently, thus ignoring valuable temporal context naturally present in videos. Meanwhile, existing video-based detection methods often introduce complex temporal modules, significantly increasing model size and computational complexity. In practical applications such as surveillance and autonomous driving, transient challenges including motion blur, occlusions, and abrupt appearance changes can severely degrade single-frame detection performance. To address these issues, we propose a straightforward yet highly effective strategy: stacking multiple consecutive frames as input to a YOLO-based detector while supervising only the output corresponding to a single target frame. This approach leverages temporal information with minimal modifications to existing architectures, preserving simplicity, computational efficiency, and real-time inference capability. Extensive experiments on the challenging MOT20Det and our BOAT360 datasets demonstrate that our method improves detection robustness, especially for lightweight models, effectively narrowing the gap between compact and heavy detection networks. Additionally, we contribute the BOAT360 benchmark dataset, comprising annotated fisheye video sequences captured from a boat, to support future research in multi-frame video object detection in challenging real-world scenarios.
[44] AdvMIM: Adversarial Masked Image Modeling for Semi-Supervised Medical Image Segmentation
Lei Zhu,Jun Zhou,Rick Siow Mong Goh,Yong Liu
Main category: cs.CV
TL;DR: AdvMIM提出了一种对抗性掩码图像建模方法,通过增强监督信号和减少域间隙,提升基于Transformer的半监督医学图像分割性能。
Details
Motivation: Transformer在医学图像分割中表现优越,但需要大量标注数据。现有半监督方法通过CNN-Transformer联合学习实现了不错效果,但在标注数据有限时训练Transformer仍具挑战。Contribution: 1. 提出对抗性掩码图像建模方法,通过掩码域增强监督信号;2. 提出多域学习理论分析并设计对抗训练损失以减少域间隙;3. 将方法扩展到CNN网络。
Method: 通过掩码图像建模构建辅助域,利用标注数据和伪标签训练Transformer预测完整分割掩码,并通过对抗训练损失减少原始域与掩码域的间隙。
Result: 在三个公开医学图像分割数据集上显著优于现有方法。
Insight: 结合掩码建模和对抗训练能有效提升半监督学习性能,同时适用于Transformer和CNN框架。
Abstract: Vision Transformer has recently gained tremendous popularity in medical image segmentation task due to its superior capability in capturing long-range dependencies. However, transformer requires a large amount of labeled data to be effective, which hinders its applicability in annotation scarce semi-supervised learning scenario where only limited labeled data is available. State-of-the-art semi-supervised learning methods propose combinatorial CNN-Transformer learning to cross teach a transformer with a convolutional neural network, which achieves promising results. However, it remains a challenging task to effectively train the transformer with limited labeled data. In this paper, we propose an adversarial masked image modeling method to fully unleash the potential of transformer for semi-supervised medical image segmentation. The key challenge in semi-supervised learning with transformer lies in the lack of sufficient supervision signal. To this end, we propose to construct an auxiliary masked domain from original domain with masked image modeling and train the transformer to predict the entire segmentation mask with masked inputs to increase supervision signal. We leverage the original labels from labeled data and pseudo-labels from unlabeled data to learn the masked domain. To further benefit the original domain from masked domain, we provide a theoretical analysis of our method from a multi-domain learning perspective and devise a novel adversarial training loss to reduce the domain gap between the original and masked domain, which boosts semi-supervised learning performance. We also extend adversarial masked image modeling to CNN network. Extensive experiments on three public medical image segmentation datasets demonstrate the effectiveness of our method, where our method outperforms existing methods significantly. Our code is publicly available at https://github.com/zlheui/AdvMIM.
[45] Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization
Zhiwang Zhang,Dong Xu,Wanli Ouyang,Chuanqi Tan
Main category: cs.CV
TL;DR: 本文提出了一种分割与总结(DaS)框架,用于密集视频描述。通过将未修剪的长视频划分为多个事件提案,结合视觉特征和生成句子的语义信息,采用两阶段LSTM和分层注意力机制生成简洁描述。
Details
Motivation: 现有的密集视频描述方法存在信息冗余或描述不完整的问题,需要一种能够结合视觉和语义信息的有效框架来生成更准确的描述。Contribution: 1. 提出新的DaS框架,将密集视频描述任务转化为视觉辅助的句子总结问题;2. 引入两阶段LSTM和分层注意力机制,结合视觉和语义信息生成描述。
Method: 1. 将视频分割为事件提案,提取每个视频片段的视觉特征;2. 使用现有方法生成句子描述;3. 设计两阶段LSTM网络(编码器-解码器),通过分层注意力机制总结句子和视觉特征。
Result: 在ActivityNet Captions数据集上验证了DaS框架的有效性,能够生成更准确的视频事件描述。
Insight: 通过结合视觉和语义信息的两阶段总结方法,能显著提升密集视频描述的准确性和简洁性。
Abstract: In this work, we propose a division-and-summarization (DaS) framework for dense video captioning. After partitioning each untrimmed long video as multiple event proposals, where each event proposal consists of a set of short video segments, we extract visual feature (e.g., C3D feature) from each segment and use the existing image/video captioning approach to generate one sentence description for this segment. Considering that the generated sentences contain rich semantic descriptions about the whole event proposal, we formulate the dense video captioning task as a visual cue aided sentence summarization problem and propose a new two stage Long Short Term Memory (LSTM) approach equipped with a new hierarchical attention mechanism to summarize all generated sentences as one descriptive sentence with the aid of visual features. Specifically, the first-stage LSTM network takes all semantic words from the generated sentences and the visual features from all segments within one event proposal as the input, and acts as the encoder to effectively summarize both semantic and visual information related to this event proposal. The second-stage LSTM network takes the output from the first-stage LSTM network and the visual features from all video segments within one event proposal as the input, and acts as the decoder to generate one descriptive sentence for this event proposal. Our comprehensive experiments on the ActivityNet Captions dataset demonstrate the effectiveness of our newly proposed DaS framework for dense video captioning.
[46] Causal Representation Learning with Observational Grouping for CXR Classification
Rajat Rasal,Avinash Kori,Ben Glocker
Main category: cs.CV
TL;DR: 该论文提出了一种通过观察分组学习可辨识因果表示的方法,用于胸透X射线分类,提升模型的泛化性和鲁棒性。
Details
Motivation: 医学影像中缺乏显式因果关系的建模限制了任务的泛化性和鲁棒性。通过分组学习可以揭示数据生成过程中的真实因果关系。Contribution: 引入了观察分组的概念,提出端到端框架学习可辨识的因果表示,并将其应用于胸透分类任务。
Method: 使用分组方法(如种族、性别和成像视角)约束不变性,通过端到端框架学习因果表示。
Result: 实验表明,因果表示在多种分类任务中显著提升了模型的泛化性和鲁棒性。
Insight: 分组学习能够揭示数据中的因果关系,从而在医学影像任务中实现更可靠的模型性能。
Abstract: Identifiable causal representation learning seeks to uncover the true causal relationships underlying a data generation process. In medical imaging, this presents opportunities to improve the generalisability and robustness of task-specific latent features. This work introduces the concept of grouping observations to learn identifiable representations for disease classification in chest X-rays via an end-to-end framework. Our experiments demonstrate that these causal representations improve generalisability and robustness across multiple classification tasks when grouping is used to enforce invariance w.r.t race, sex, and imaging views.
[47] Dense Video Captioning using Graph-based Sentence Summarization
Zhiwang Zhang,Dong Xu,Wanli Ouyang,Luping Zhou
Main category: cs.CV
TL;DR: 该论文提出了一种基于图的分区与总结(GPaS)框架,用于密集视频字幕生成,通过将事件提案划分为更细粒度的片段并利用图卷积网络(GCN)和LSTM进行句子总结,提高了对长事件中场景演变的描述能力。
Details
Motivation: 现有密集视频字幕方法未能充分探索事件提案中的场景演变,导致在较长提案中表现不佳。针对这一问题,作者提出了一个两阶段框架。Contribution: 主要贡献包括:1) 提出GPaS框架,通过分区和总结两阶段生成字幕;2) 设计GLI模块,结合GCN和LSTM学习语义词之间的关系;3) 在ActivityNet和YouCook II数据集上验证了方法的有效性。
Method: 方法分为两步:1) 分区阶段将事件提案拆分为短片段生成细粒度字幕;2) 总结阶段利用GCN-LSTM交互模块(GLI)将片段句子总结为描述事件的完整句子。
Result: 在两个基准数据集上的实验表明,该方法优于现有技术,特别是在长事件提案的描述上表现优异。
Insight: 通过分区和总结两阶段处理,结合GCN和LSTM的交互,能够更有效地捕捉视频事件中的语义关系,提升字幕生成质量。
Abstract: Recently, dense video captioning has made attractive progress in detecting and captioning all events in a long untrimmed video. Despite promising results were achieved, most existing methods do not sufficiently explore the scene evolution within an event temporal proposal for captioning, and therefore perform less satisfactorily when the scenes and objects change over a relatively long proposal. To address this problem, we propose a graph-based partition-and-summarization (GPaS) framework for dense video captioning within two stages. For the partition" stage, a whole event proposal is split into short video segments for captioning at a finer level. For the summarization” stage, the generated sentences carrying rich description information for each segment are summarized into one sentence to describe the whole event. We particularly focus on the ``summarization” stage, and propose a framework that effectively exploits the relationship between semantic words for summarization. We achieve this goal by treating semantic words as nodes in a graph and learning their interactions by coupling Graph Convolutional Network (GCN) and Long Short Term Memory (LSTM), with the aid of visual cues. Two schemes of GCN-LSTM Interaction (GLI) modules are proposed for seamless integration of GCN and LSTM. The effectiveness of our approach is demonstrated via an extensive comparison with the state-of-the-arts methods on the two benchmarks ActivityNet Captions dataset and YouCook II dataset.
[48] TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness
Pritam Mishra,Coloma Ballester,Dimosthenis Karatzas
Main category: cs.CV
TL;DR: 论文提出了一种自监督的视频摘要框架TRIM,通过最大化时间相对信息和代表性,无需依赖监督标注或复杂注意力模型,实现了高效的视频摘要。
Details
Motivation: 视频内容的普及和对高效获取关键信息的需求推动了视频摘要研究。现有方法依赖监督标注或复杂的注意力模型,计算成本高且难以跨域应用。Contribution: 1. 提出自监督视频摘要框架TRIM;2. 引入马尔可夫过程驱动的损失指标;3. 采用两阶段自监督学习,性能和效率俱佳。
Method: 通过最大化时间相对信息和代表性,结合马尔可夫过程损失指标和两阶段自监督学习,避免使用注意力、RNN或Transformer。
Result: 在SUMME和TVSUM数据集上达到SOTA性能,超越所有无监督方法,媲美监督模型。
Insight: 证明了高效、无需标注的视频摘要架构的潜力,挑战了对复杂架构的依赖。
Abstract: The increasing ubiquity of video content and the corresponding demand for efficient access to meaningful information have elevated video summarization and video highlights as a vital research area. However, many state-of-the-art methods depend heavily either on supervised annotations or on attention-based models, which are computationally expensive and brittle in the face of distribution shifts that hinder cross-domain applicability across datasets. We introduce a pioneering self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers. Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency. Our approach achieves state-of-the-art performance on the SUMME and TVSUM datasets, outperforming all existing unsupervised methods. It also rivals the best supervised models, demonstrating the potential for efficient, annotation-free architectures. This paves the way for more generalizable video summarization techniques and challenges the prevailing reliance on complex architectures.
[49] WonderFree: Enhancing Novel View Quality and Cross-View Consistency for 3D Scene Exploration
Chaojun Ni,Jie Li,Haoyun Li,Hengyu Liu,Xiaofeng Wang,Zheng Zhu,Guosheng Zhao,Boyuan Wang,Chenxin Li,Guan Huang,Wenjun Mei
Main category: cs.CV
TL;DR: WonderFree提出了一种增强3D场景探索中视角质量和跨视角一致性的方法,通过WorldRestorer和ConsistView解决了新视角中的视觉伪影和空间一致性问题。
Details
Motivation: 当前3D生成方法在探索超出原视角范围的区域时,渲染质量下降且不一致,影响了沉浸式体验。Contribution: 1. 提出WorldRestorer改进新视角质量;2. 提出ConsistView提升跨视角一致性;3. 自动数据收集管道支持模型训练。
Method: 分解问题为视角质量和跨视角一致性,分别用WorldRestorer和ConsistView解决。
Result: 实验显示,WonderFree在渲染质量和一致性上显著提升,用户偏好率达77.20%。
Insight: 数据驱动的修复机制和多视角联合修复方法能有效提升3D场景探索的沉浸感。
Abstract: Interactive 3D scene generation from a single image has gained significant attention due to its potential to create immersive virtual worlds. However, a key challenge in current 3D generation methods is the limited explorability, which cannot render high-quality images during larger maneuvers beyond the original viewpoint, particularly when attempting to move forward into unseen areas. To address this challenge, we propose WonderFree, the first model that enables users to interactively generate 3D worlds with the freedom to explore from arbitrary angles and directions. Specifically, we decouple this challenge into two key subproblems: novel view quality, which addresses visual artifacts and floating issues in novel views, and cross-view consistency, which ensures spatial consistency across different viewpoints. To enhance rendering quality in novel views, we introduce WorldRestorer, a data-driven video restoration model designed to eliminate floaters and artifacts. In addition, a data collection pipeline is presented to automatically gather training data for WorldRestorer, ensuring it can handle scenes with varying styles needed for 3D scene generation. Furthermore, to improve cross-view consistency, we propose ConsistView, a multi-view joint restoration mechanism that simultaneously restores multiple perspectives while maintaining spatiotemporal coherence. Experimental results demonstrate that WonderFree not only enhances rendering quality across diverse viewpoints but also significantly improves global coherence and consistency. These improvements are confirmed by CLIP-based metrics and a user study showing a 77.20% preference for WonderFree over WonderWorld enabling a seamless and immersive 3D exploration experience. The code, model, and data will be publicly available.
[50] SFNet: Fusion of Spatial and Frequency-Domain Features for Remote Sensing Image Forgery Detection
Ji Qi,Xinchang Zhang,Dingqi Ye,Yongjia Ruan,Xin Guo,Shaowen Wang,Haifeng Li
Main category: cs.CV
TL;DR: SFNet提出了一种融合空间域和频域特征的遥感图像伪造检测框架,通过多域特征提取与融合提高了检测性能。
Details
Motivation: 现有伪造检测方法通常仅依赖单一视觉特征(如空间域或频域特征),难以应对多样化的遥感数据及其复杂的伪造痕迹。Contribution: SFNet通过融合空间域和频域特征,结合CBAM注意力机制,显著提升了伪造检测的准确性和泛化能力。
Method: 1. 使用独立的特征提取器捕获空间域和频域特征;2. 设计域特征映射和混合域特征细化模块(CBAM注意力)进行特征对齐和融合。
Result: 在三个数据集上,SFNet比现有方法准确率提升了4%-15.18%,并表现出更强的泛化能力。
Insight: 多域特征融合能更全面地捕捉伪造痕迹,适应不同生成模型和遥感数据的特点。
Abstract: The rapid advancement of generative artificial intelligence is producing fake remote sensing imagery (RSI) that is increasingly difficult to detect, potentially leading to erroneous intelligence, fake news, and even conspiracy theories. Existing forgery detection methods typically rely on single visual features to capture predefined artifacts, such as spatial-domain cues to detect forged objects like roads or buildings in RSI, or frequency-domain features to identify artifacts from up-sampling operations in adversarial generative networks (GANs). However, the nature of artifacts can significantly differ depending on geographic terrain, land cover types, or specific features within the RSI. Moreover, these complex artifacts evolve as generative models become more sophisticated. In short, over-reliance on a single visual cue makes existing forgery detectors struggle to generalize across diverse remote sensing data. This paper proposed a novel forgery detection framework called SFNet, designed to identify fake images in diverse remote sensing data by leveraging spatial and frequency domain features. Specifically, to obtain rich and comprehensive visual information, SFNet employs two independent feature extractors to capture spatial and frequency domain features from input RSIs. To fully utilize the complementary domain features, the domain feature mapping module and the hybrid domain feature refinement module(CBAM attention) of SFNet are designed to successively align and fuse the multi-domain features while suppressing redundant information. Experiments on three datasets show that SFNet achieves an accuracy improvement of 4%-15.18% over the state-of-the-art RS forgery detection methods and exhibits robust generalization capabilities. The code is available at https://github.com/GeoX-Lab/RSTI/tree/main/SFNet.
[51] Video Perception Models for 3D Scene Synthesis
Rui Huang,Guangyao Zhai,Zuria Bauer,Marc Pollefeys,Federico Tombari,Leonidas Guibas,Gao Huang,Francis Engelmann
Main category: cs.CV
TL;DR: 该论文提出了一种基于视频感知模型的3D场景合成框架VIPScene,通过结合视频生成、3D重建和开放词汇感知模型,实现了高真实感和结构一致性的场景合成。
Details
Motivation: 传统3D场景合成需要专业知识且手动操作繁琐,而现有方法如大语言模型(LLMs)的3D空间推理能力有限,图像生成方法又存在多视角不一致的问题。Contribution: 提出了VIPScene框架,利用视频生成模型中编码的3D物理世界常识知识,实现了语义和几何分析,并引入了第一人称视角评分(FPVScore)用于一致性评估。
Method: VIPScene结合视频生成、前馈3D重建和开放词汇感知模型,支持文本和图像输入,实现高真实感和多视角一致性的3D场景合成。
Result: 实验表明,VIPScene在多样场景中显著优于现有方法,具有更好的泛化能力。
Insight: 视频生成模型可以更有效地编码3D世界的常识知识,为3D场景合成提供了一种新的思路。
Abstract: Traditionally, 3D scene synthesis requires expert knowledge and significant manual effort. Automating this process could greatly benefit fields such as architectural design, robotics simulation, virtual reality, and gaming. Recent approaches to 3D scene synthesis often rely on the commonsense reasoning of large language models (LLMs) or strong visual priors of modern image generation models. However, current LLMs demonstrate limited 3D spatial reasoning ability, which restricts their ability to generate realistic and coherent 3D scenes. Meanwhile, image generation-based methods often suffer from constraints in viewpoint selection and multi-view inconsistencies. In this work, we present Video Perception models for 3D Scene synthesis (VIPScene), a novel framework that exploits the encoded commonsense knowledge of the 3D physical world in video generation models to ensure coherent scene layouts and consistent object placements across views. VIPScene accepts both text and image prompts and seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene. This enables flexible scene synthesis with high realism and structural consistency. For more precise analysis, we further introduce First-Person View Score (FPVScore) for coherence and plausibility evaluation, utilizing continuous first-person perspective to capitalize on the reasoning ability of multimodal large language models. Extensive experiments show that VIPScene significantly outperforms existing methods and generalizes well across diverse scenarios. The code will be released.
[52] Shape2Animal: Creative Animal Generation from Natural Silhouettes
Quoc-Duy Tran,Anh-Tuan Vo,Dinh-Khoi Vo,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
Main category: cs.CV
TL;DR: Shape2Animal框架通过自然物体剪影生成逼真的动物形象,模拟人类的联想能力,结合语义分析和扩散模型实现高质量合成。
Details
Motivation: 受人类联想能力(pareidolia)启发,旨在通过自然物体剪影生成富有想象力的动物形象,拓展视觉叙事和教育应用的潜力。Contribution: 提出了Shape2Animal框架,结合开放词汇分割、视觉语言模型和文本-图像扩散模型,实现了从剪影到逼真动物图像的自动生成。
Method: 1. 开放词汇分割提取物体剪影;
2. 视觉语言模型匹配语义合适的动物概念;
3. 文本-图像扩散模型生成并融合动物图像。
Result: 在多样化真实场景输入上验证了框架的鲁棒性和创造力,展示了其在视觉叙事和教育中的潜力。
Insight: 通过结合分割、语义匹配和扩散模型,实现了形状驱动的创造性图像生成,为AI辅助设计提供了新思路。
Abstract: Humans possess a unique ability to perceive meaningful patterns in ambiguous stimuli, a cognitive phenomenon known as pareidolia. This paper introduces Shape2Animal framework to mimics this imaginative capacity by reinterpreting natural object silhouettes, such as clouds, stones, or flames, as plausible animal forms. Our automated framework first performs open-vocabulary segmentation to extract object silhouette and interprets semantically appropriate animal concepts using vision-language models. It then synthesizes an animal image that conforms to the input shape, leveraging text-to-image diffusion model and seamlessly blends it into the original scene to generate visually coherent and spatially consistent compositions. We evaluated Shape2Animal on a diverse set of real-world inputs, demonstrating its robustness and creative potential. Our Shape2Animal can offer new opportunities for visual storytelling, educational content, digital art, and interactive media design. Our project page is here: https://shape2image.github.io
[53] MMSearch-R1: Incentivizing LMMs to Search
Jinming Wu,Zihao Deng,Wei Li,Yiding Liu,Bo You,Bo Li,Zejun Ma,Ziwei Liu
Main category: cs.CV
TL;DR: MMSearch-R1 是一种基于强化学习的端到端框架,激励大型多模态模型(LMMs)按需进行多轮搜索,结合图像和文本搜索工具,优于现有检索增强生成(RAG)方法,减少30%的搜索调用。
Details
Motivation: 现实场景中,信息复杂且动态变化,现有的检索增强生成(RAG)等方法采用固定流程,导致搜索效率低下或过度,需要更灵活的方法。Contribution: 提出了首个基于强化学习的端到端框架MMSearch-R1,整合多模态搜索工具,通过奖励机制优化搜索行为,并构建了多模态搜索数据集。
Method: 采用强化学习框架,结合图像和文本搜索工具,通过基于结果的奖励和搜索惩罚引导模型决策;构建了半自动化的多模态搜索VQA数据集。
Result: 实验表明,MMSearch-R1性能优于同尺寸RAG模型,达到更大RAG模型的效果,同时减少30%以上的搜索调用。
Insight: 奖励机制和多模态数据集对优化搜索行为至关重要,为多模态搜索研究提供了实用见解。
Abstract: Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.
[54] IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals
Markus Gross,Aya Fahmy,Danit Niwattananan,Dominik Muhle,Rui Song,Daniel Cremers,Henri Meeß
Main category: cs.CV
TL;DR: IPFormer首次提出在视觉3D全景场景补全任务中使用上下文自适应的实例提议,通过动态初始化和注意力机制优化实例体素关系,显著提升了性能。
Details
Motivation: 语义场景补全(SSC)领域逐渐扩展到全景场景补全(PSC),但基于相机图像的方法仍较少研究。现有方法在测试时静态使用学习到的查询,无法动态适应场景上下文。Contribution: 1. 首次提出上下文自适应的实例提议,动态初始化查询;2. 通过注意力编码和解码优化语义实例与体素关系;3. 在整体指标PQ†和PQ-All上超越SOTA方法。
Method: 1. 从图像上下文自适应初始化实例提议;2. 通过注意力机制编码和解码优化提议;3. 结合语义和实例信息完成3D场景补全。
Result: PQ†和PQ-All指标超越SOTA,计算效率提升14倍以上;动态实例提议比随机初始化在PQ-All上提升3.62%,在Thing指标上平均提升18.65%。
Insight: 动态初始化的实例提议能显著提升模型对物体级别的敏感性,验证了上下文自适应在视觉3D场景补全任务中的重要性。
Abstract: Semantic Scene Completion (SSC) has emerged as a pivotal approach for jointly learning scene geometry and semantics, enabling downstream applications such as navigation in mobile robotics. The recent generalization to Panoptic Scene Completion (PSC) advances the SSC domain by integrating instance-level information, thereby enhancing object-level sensitivity in scene understanding. While PSC was introduced using LiDAR modality, methods based on camera images remain largely unexplored. Moreover, recent Transformer-based SSC approaches utilize a fixed set of learned queries to reconstruct objects within the scene volume. Although these queries are typically updated with image context during training, they remain static at test time, limiting their ability to dynamically adapt specifically to the observed scene. To overcome these limitations, we propose IPFormer, the first approach that leverages context-adaptive instance proposals at train and test time to address vision-based 3D Panoptic Scene Completion. Specifically, IPFormer adaptively initializes these queries as panoptic instance proposals derived from image context and further refines them through attention-based encoding and decoding to reason about semantic instance-voxel relationships. Experimental results show that our approach surpasses state-of-the-art methods in overall panoptic metrics PQ$^\dagger$ and PQ-All, matches performance in individual metrics, and achieves a runtime reduction exceeding 14$\times$. Furthermore, our ablation studies reveal that dynamically deriving instance proposals from image context, as opposed to random initialization, leads to a 3.62% increase in PQ-All and a remarkable average improvement of 18.65% in combined Thing-metrics. These results highlight our introduction of context-adaptive instance proposals as a pioneering effort in addressing vision-based 3D Panoptic Scene Completion.
eess.IV [Back]
[55] FundaQ-8: A Clinically-Inspired Scoring Framework for Automated Fundus Image Quality Assessment
Lee Qi Zun,Oscar Wong Jin Hao,Nor Anita Binti Che Omar,Zalifa Zakiah Binti Asnir,Mohamad Sabri bin Sinal Zainal,Goh Man Fye
Main category: eess.IV
TL;DR: FundaQ-8提出了一种基于临床专家验证的框架,通过八个关键参数系统评估眼底图像质量,并开发了基于ResNet18的回归模型预测质量分数。模型在真实临床数据集上训练和验证,证明了其可靠性和临床实用性。
Details
Motivation: 眼底图像质量评估(FIQA)面临图像采集差异和专家主观评价的挑战,需要一种标准化方法提升自动化评估的可靠性和临床适用性。Contribution: 提出了专家验证的FundaQ-8框架,定义了八个关键质量参数,并开发了基于ResNet18的回归模型,为自动化FIQA提供了标准化解决方案。
Method: 采用ResNet18作为基础模型,通过迁移学习和均方误差优化训练,结合标准化的预处理流程,预测0到1范围内的连续质量分数。数据集包括1800张来自临床和Kaggle的眼底图像。
Result: 在EyeQ数据集上的验证和统计分析表明,FundaQ-8框架具有高可靠性和临床可解释性。将其整合到糖尿病视网膜病变分级模型中,进一步提升了诊断鲁棒性。
Insight: 质量感知的训练方法(如FundaQ-8)能显著提升深度学习模型在真实世界筛查应用中的性能,强调了图像质量评估在临床AI中的重要性。
Abstract: Automated fundus image quality assessment (FIQA) remains a challenge due to variations in image acquisition and subjective expert evaluations. We introduce FundaQ-8, a novel expert-validated framework for systematically assessing fundus image quality using eight critical parameters, including field coverage, anatomical visibility, illumination, and image artifacts. Using FundaQ-8 as a structured scoring reference, we develop a ResNet18-based regression model to predict continuous quality scores in the 0 to 1 range. The model is trained on 1800 fundus images from real-world clinical sources and Kaggle datasets, using transfer learning, mean squared error optimization, and standardized preprocessing. Validation against the EyeQ dataset and statistical analyses confirm the framework’s reliability and clinical interpretability. Incorporating FundaQ-8 into deep learning models for diabetic retinopathy grading also improves diagnostic robustness, highlighting the value of quality-aware training in real-world screening applications.
[56] VoxelOpt: Voxel-Adaptive Message Passing for Discrete Optimization in Deformable Abdominal CT Registration
Hang Zhang,Yuxi Zhang,Jiazheng Wang,Xiang Chen,Renjiu Hu,Xin Tian,Gaolei Li,Min Liu
Main category: eess.IV
TL;DR: VoxelOpt是一个结合学习与迭代优点的离散优化框架,用于可变形腹部CT配准,通过体素自适应消息传递和多级图像金字塔实现高效准确配准。
Details
Motivation: 学习基方法在缺乏标签监督时效果较差,而迭代方法效率低。VoxelOpt旨在结合两者优势,平衡精度与速度。Contribution: 提出体素自适应消息传递机制,引入多级图像金字塔,并使用预训练分割模型提取特征,显著提升配准性能。
Method: 基于局部成本体积的位移熵进行体素自适应消息传递,采用多级金字塔结构避免复杂度爆炸,利用预训练模型提取特征。
Result: 在腹部CT配准任务中,VoxelOpt在效率和精度上超越了迭代方法,并与监督学习基方法表现相当。
Insight: 结合离散优化与学习基方法的优势,能为医学图像配准提供新的有效途径。
Abstract: Recent developments in neural networks have improved deformable image registration (DIR) by amortizing iterative optimization, enabling fast and accurate DIR results. However, learning-based methods often face challenges with limited training data, large deformations, and tend to underperform compared to iterative approaches when label supervision is unavailable. While iterative methods can achieve higher accuracy in such scenarios, they are considerably slower than learning-based methods. To address these limitations, we propose VoxelOpt, a discrete optimization-based DIR framework that combines the strengths of learning-based and iterative methods to achieve a better balance between registration accuracy and runtime. VoxelOpt uses displacement entropy from local cost volumes to measure displacement signal strength at each voxel, which differs from earlier approaches in three key aspects. First, it introduces voxel-wise adaptive message passing, where voxels with lower entropy receives less influence from their neighbors. Second, it employs a multi-level image pyramid with 27-neighbor cost volumes at each level, avoiding exponential complexity growth. Third, it replaces hand-crafted features or contrastive learning with a pretrained foundational segmentation model for feature extraction. In abdominal CT registration, these changes allow VoxelOpt to outperform leading iterative in both efficiency and accuracy, while matching state-of-the-art learning-based methods trained with label supervision. The source code will be available at https://github.com/tinymilky/VoxelOpt
[57] MS-IQA: A Multi-Scale Feature Fusion Network for PET/CT Image Quality Assessment
Siqiao Li,Chen Hui,Wei Zhang,Rui Liang,Chenyue Song,Feng Jiang,Haiqi Zhu,Zhixuan Li,Hong Huang,Xiang Li
Main category: eess.IV
TL;DR: MS-IQA提出了一种多尺度特征融合网络,用于PET/CT图像的质量评估,通过结合ResNet和Swin Transformer的多尺度特征,同时捕捉局部和全局信息,并在动态加权通道注意力机制的帮助下有效融合高低级特征。
Details
Motivation: PET/CT图像的质量问题可能导致诊断不确定性,现有医学IQA方法无法同时处理低级别失真和高级别解剖结构特征。Contribution: 1. 提出MS-IQA网络,结合多尺度特征;2. 引入动态加权通道注意力机制融合特征;3. 构建了首个PET/CT IQA数据集PET-CT-IQA-DS。
Method: 使用ResNet和Swin Transformer提取多尺度特征,通过动态加权通道注意力机制融合高低级特征。
Result: 在自建和公开数据集上,MS-IQA表现优于现有SOTA方法,显著提升了PET/CT IQA的准确性。
Insight: 多尺度特征融合和动态加权机制是提升医学图像质量评估性能的关键。
Abstract: Positron Emission Tomography / Computed Tomography (PET/CT) plays a critical role in medical imaging, combining functional and anatomical information to aid in accurate diagnosis. However, image quality degradation due to noise, compression and other factors could potentially lead to diagnostic uncertainty and increase the risk of misdiagnosis. When evaluating the quality of a PET/CT image, both low-level features like distortions and high-level features like organ anatomical structures affect the diagnostic value of the image. However, existing medical image quality assessment (IQA) methods are unable to account for both feature types simultaneously. In this work, we propose MS-IQA, a novel multi-scale feature fusion network for PET/CT IQA, which utilizes multi-scale features from various intermediate layers of ResNet and Swin Transformer, enhancing its ability of perceiving both local and global information. In addition, a multi-scale feature fusion module is also introduced to effectively combine high-level and low-level information through a dynamically weighted channel attention mechanism. Finally, to fill the blank of PET/CT IQA dataset, we construct PET-CT-IQA-DS, a dataset containing 2,700 varying-quality PET/CT images with quality scores assigned by radiologists. Experiments on our dataset and the publicly available LDCTIQAC2023 dataset demonstrate that our proposed model has achieved superior performance against existing state-of-the-art methods in various IQA metrics. This work provides an accurate and efficient IQA method for PET/CT. Our code and dataset are available at https://github.com/MS-IQA/MS-IQA/.
[58] Opportunistic Osteoporosis Diagnosis via Texture-Preserving Self-Supervision, Mixture of Experts and Multi-Task Integration
Jiaxing Huang,Heng Guo,Le Lu,Fan Yang,Minfeng Xu,Ge Yang,Wei Luo
Main category: eess.IV
TL;DR: 该论文提出了一种统一的深度学习框架,通过自监督学习、混合专家架构和多任务学习,解决了骨质疏松症机会性诊断中的三个关键问题:未标记数据利用不足、设备特异性偏差和临床知识整合不足。
Details
Motivation: 骨质疏松症诊断在资源匮乏地区受到DXA扫描设备有限的限制。机会性CT分析虽为替代方案,但现有方法在未标记数据利用、设备偏差和临床知识整合方面存在不足。Contribution: 1. 提出了一种通过放射组学表示的自监督学习方法保留骨纹理;2. 设计了一种混合专家架构以增强跨设备适应性;3. 开发了一个多任务学习框架整合骨质疏松诊断、BMD回归和椎体定位。
Method: 结合自监督学习、混合专家架构(MoE)和多任务学习,利用未标记CT数据,通过放射组学特征保留纹理,并通过多任务训练优化诊断。
Result: 在三个临床站点和外部医院验证中,该方法展现了优于现有方法的泛化能力和准确性。
Insight: 通过整合多种技术(如自监督学习和多任务学习),可以显著提升机会性骨质疏松症诊断的准确性和适应性,尤其是在资源有限的环境中。
Abstract: Osteoporosis, characterized by reduced bone mineral density (BMD) and compromised bone microstructure, increases fracture risk in aging populations. While dual-energy X-ray absorptiometry (DXA) is the clinical standard for BMD assessment, its limited accessibility hinders diagnosis in resource-limited regions. Opportunistic computed tomography (CT) analysis has emerged as a promising alternative for osteoporosis diagnosis using existing imaging data. Current approaches, however, face three limitations: (1) underutilization of unlabeled vertebral data, (2) systematic bias from device-specific DXA discrepancies, and (3) insufficient integration of clinical knowledge such as spatial BMD distribution patterns. To address these, we propose a unified deep learning framework with three innovations. First, a self-supervised learning method using radiomic representations to leverage unlabeled CT data and preserve bone texture. Second, a Mixture of Experts (MoE) architecture with learned gating mechanisms to enhance cross-device adaptability. Third, a multi-task learning framework integrating osteoporosis diagnosis, BMD regression, and vertebra location prediction. Validated across three clinical sites and an external hospital, our approach demonstrates superior generalizability and accuracy over existing methods for opportunistic osteoporosis screening and diagnosis.
[59] EAGLE: An Efficient Global Attention Lesion Segmentation Model for Hepatic Echinococcosis
Jiayan Chen,Kai Li,Yulu Zhao,Jianqiang Huang,Zhan Wang
Main category: eess.IV
TL;DR: 本文提出了一种名为EAGLE的高效全局注意力模型,用于肝包虫病(HE)病灶的分割。该模型结合了PVSS编码器和HVSS解码器,通过CVSSB模块融合局部与全局特征,并利用HWTB模块实现无损下采样。在自建的数据集上,EAGLE以89.76%的Dice系数取得了最优性能。
Details
Motivation: 肝包虫病是医疗资源匮乏地区常见的寄生虫病,现有CNN和Transformer模型在医学图像分割中各有局限:CNN缺乏全局建模能力,Transformer计算复杂度高。因此,需要一种高效且兼顾全局与局部特征的模型。Contribution: 1. 提出EAGLE模型,结合PVSS编码器和HVSS解码器,实现高效分割。
2. 设计CVSSB模块融合局部与全局特征,HWTB模块实现无损下采样。
3. 构建了260例患者的CT数据集,填补公开数据空白。
Method: 1. 使用PVSS编码器和HVSS解码器构建U形网络。
2. CVSSB模块结合卷积和状态空间模型(SSM),提升特征融合能力。
3. HWTB模块利用Haar小波变换压缩空间信息至通道维度。
Result: EAGLE在Dice系数上达到89.76%,超越MSVM-UNet 1.61%,验证了其高效性与准确性。
Insight: 1. SSM(如Mamba)在长序列建模中表现优异,适用于医学图像分割。
2. 特征融合与无损下采样是提升分割性能的关键。
3. 数据稀缺领域需自建高质量数据集支持模型验证。
Abstract: Hepatic echinococcosis (HE) is a widespread parasitic disease in underdeveloped pastoral areas with limited medical resources. While CNN-based and Transformer-based models have been widely applied to medical image segmentation, CNNs lack global context modeling due to local receptive fields, and Transformers, though capable of capturing long-range dependencies, are computationally expensive. Recently, state space models (SSMs), such as Mamba, have gained attention for their ability to model long sequences with linear complexity. In this paper, we propose EAGLE, a U-shaped network composed of a Progressive Visual State Space (PVSS) encoder and a Hybrid Visual State Space (HVSS) decoder that work collaboratively to achieve efficient and accurate segmentation of hepatic echinococcosis (HE) lesions. The proposed Convolutional Vision State Space Block (CVSSB) module is designed to fuse local and global features, while the Haar Wavelet Transformation Block (HWTB) module compresses spatial information into the channel dimension to enable lossless downsampling. Due to the lack of publicly available HE datasets, we collected CT slices from 260 patients at a local hospital. Experimental results show that EAGLE achieves state-of-the-art performance with a Dice Similarity Coefficient (DSC) of 89.76%, surpassing MSVM-UNet by 1.61%.
cs.GR [Back]
[60] X-SiT: Inherently Interpretable Surface Vision Transformers for Dementia Diagnosis
Fabian Bongratz,Tom Nuno Wolf,Jaume Gual Ramon,Christian Wachinger
Main category: cs.GR
TL;DR: 论文提出了一种可解释的表面视觉变换器(X-SiT),用于痴呆症诊断,首次实现了基于可解释皮层特征的人类可理解预测,并在性能上达到最先进水平。
Details
Motivation: 3D体积数据难以可视化和解释,而皮层表面渲染提供了更直观的3D表示。X-SiT通过结合表面数据和可解释性技术,解决了医学图像分析中的可解释性问题。Contribution: 首次提出了一种基于皮层表面数据的可解释神经网络(X-SiT),并引入了原型表面块解码器,结合案例推理和空间对应皮层原型。
Method: X-SiT通过表面视觉变换器提取特征,并利用原型表面块解码器进行分类,其设计结合了人类可理解的皮层特征和案例推理。
Result: X-SiT在阿尔茨海默病和额颞叶痴呆的诊断中表现优异,并提供了与已知疾病模式一致的信息化原型。
Insight: X-SiT的可解释性设计不仅提升了模型性能,还为临床决策提供了透明化的支持,揭示了分类错误与疾病模式的关系。
Abstract: Interpretable models are crucial for supporting clinical decision-making, driving advances in their development and application for medical images. However, the nature of 3D volumetric data makes it inherently challenging to visualize and interpret intricate and complex structures like the cerebral cortex. Cortical surface renderings, on the other hand, provide a more accessible and understandable 3D representation of brain anatomy, facilitating visualization and interactive exploration. Motivated by this advantage and the widespread use of surface data for studying neurological disorders, we present the eXplainable Surface Vision Transformer (X-SiT). This is the first inherently interpretable neural network that offers human-understandable predictions based on interpretable cortical features. As part of X-SiT, we introduce a prototypical surface patch decoder for classifying surface patch embeddings, incorporating case-based reasoning with spatially corresponding cortical prototypes. The results demonstrate state-of-the-art performance in detecting Alzheimer’s disease and frontotemporal dementia while additionally providing informative prototypes that align with known disease patterns and reveal classification errors.
[61] DreamAnywhere: Object-Centric Panoramic 3D Scene Generation
Edoardo Alberto Dominici,Jozef Hladky,Floor Verhoeven,Lukas Radl,Thomas Deixelberger,Stefan Ainetter,Philipp Drescher,Stefan Hauswiesner,Arno Coomans,Giacomo Nazzaro,Konstantinos Vardis,Markus Steinberger
Main category: cs.GR
TL;DR: DreamAnywhere提出了一种模块化系统,用于快速生成3D场景,支持360°全景图像合成、对象分解、3D重构以及沉浸式导航与编辑,显著提升了场景生成的一致性和质量。
Details
Motivation: 现有的文本到3D场景生成方法通常仅支持正面视角、缺乏视觉保真度和场景理解,且通常局限于室内或室外场景。DreamAnywhere旨在解决这些问题,提供一种灵活高效的解决方案。Contribution: 1. 提出了一个模块化系统,支持快速生成和原型设计3D场景。2. 通过全景图像分解、混合修复和3D对象提升,实现完整的3D表示。3. 支持沉浸式导航和对象级编辑,适用于低预算电影制作。
Method: 1. 从文本生成360°全景图像。2. 将图像分解为背景和对象。3. 通过混合修复构建完整的3D表示。4. 将对象掩码提升为详细3D对象并放置在虚拟环境中。
Result: DreamAnywhere在新视角合成一致性和图像质量上显著优于现有方法,用户研究也验证了其技术鲁棒性和实用性。
Insight: 通过模块化设计,DreamAnywhere不仅提高了场景生成的灵活性和效率,还为低预算内容创作提供了实用工具。
Abstract: Recent advances in text-to-3D scene generation have demonstrated significant potential to transform content creation across multiple industries. Although the research community has made impressive progress in addressing the challenges of this complex task, existing methods often generate environments that are only front-facing, lack visual fidelity, exhibit limited scene understanding, and are typically fine-tuned for either indoor or outdoor settings. In this work, we address these issues and propose DreamAnywhere, a modular system for the fast generation and prototyping of 3D scenes. Our system synthesizes a 360{\deg} panoramic image from text, decomposes it into background and objects, constructs a complete 3D representation through hybrid inpainting, and lifts object masks to detailed 3D objects that are placed in the virtual environment. DreamAnywhere supports immersive navigation and intuitive object-level editing, making it ideal for scene exploration, visual mock-ups, and rapid prototyping – all with minimal manual modeling. These features make our system particularly suitable for low-budget movie production, enabling quick iteration on scene layout and visual tone without the overhead of traditional 3D workflows. Our modular pipeline is highly customizable as it allows components to be replaced independently. Compared to current state-of-the-art text and image-based 3D scene generation approaches, DreamAnywhere shows significant improvements in coherence in novel view synthesis and achieves competitive image quality, demonstrating its effectiveness across diverse and challenging scenarios. A comprehensive user study demonstrates a clear preference for our method over existing approaches, validating both its technical robustness and practical usefulness.
[62] EditP23: 3D Editing via Propagation of Image Prompts to Multi-View
Roi Bar-On,Dana Cohen-Bar,Daniel Cohen-Or
Main category: cs.GR
TL;DR: EditP23 是一种无需掩模的 3D 编辑方法,通过将 2D 图像编辑传播到多视图表示中,实现 3D 一致的编辑。它利用一对图像(原始视图和用户编辑后的视图)作为提示,通过预训练的多视图扩散模型在潜在空间中引导编辑感知流。
Details
Motivation: 传统的 3D 编辑方法依赖基于文本的提示或显式空间掩模,限制了直观性和灵活性。EditP23 旨在通过图像对提示实现更直观的编辑,并在多视图中保持一致性。Contribution: 提出了 EditP23 方法,通过图像对提示进行 3D 一致的编辑,无需掩模或优化,保留了原始对象的结构和外观特征。
Method: 利用一对图像(原始和编辑后的版本)作为提示,通过预训练的多视图扩散模型的潜在空间引导编辑感知流,实现多视图一致的编辑传播。
Result: 在多种对象类别和编辑场景中均表现出高保真度,编辑效果自然且无需手动掩模。
Insight: 图像对提示可以作为一种有效的编辑信号,取代传统的文本或掩模方法,同时保证了多视图一致性和编辑的直观性。
Abstract: We present EditP23, a method for mask-free 3D editing that propagates 2D image edits to multi-view representations in a 3D-consistent manner. In contrast to traditional approaches that rely on text-based prompting or explicit spatial masks, EditP23 enables intuitive edits by conditioning on a pair of images: an original view and its user-edited counterpart. These image prompts are used to guide an edit-aware flow in the latent space of a pre-trained multi-view diffusion model, allowing the edit to be coherently propagated across views. Our method operates in a feed-forward manner, without optimization, and preserves the identity of the original object, in both structure and appearance. We demonstrate its effectiveness across a range of object categories and editing scenarios, achieving high fidelity to the source while requiring no manual masks.
eess.SP [Back]
[63] A Multi-Modal Spatial Risk Framework for EV Charging Infrastructure Using Remote Sensing
Oktay Karakuş,Padraig Corcoran
Main category: eess.SP
TL;DR: 该论文提出了一个多模态空间风险评估框架RSERI-EV,结合遥感数据和空间图分析,评估电动汽车充电站的环境脆弱性。
Details
Motivation: 随着电动汽车充电基础设施的重要性提升,其在环境和基础设施压力下的韧性研究尚不充分,本文旨在填补这一空白。Contribution: 提出了RSERI-EV框架,整合多源数据(如洪水风险、地表温度、植被指数等)和空间图分析工具,量化充电站的韧性评分。
Method: 通过融合遥感数据、基础设施数据集和空间图分析(如$k$NN图),生成综合韧性评分,并在威尔士的充电站数据集上进行验证。
Result: 在威尔士的案例研究中,展示了RSERI-EV框架的可行性,突显多源数据融合和空间推理对充电基础设施部署的价值。
Insight: 多模态数据和空间图分析为充电基础设施的韧性评估提供了新方法,支持气候变化下的智能化部署。
Abstract: Electric vehicle (EV) charging infrastructure is increasingly critical to sustainable transport systems, yet its resilience under environmental and infrastructural stress remains underexplored. In this paper, we introduce RSERI-EV, a spatially explicit and multi-modal risk assessment framework that combines remote sensing data, open infrastructure datasets, and spatial graph analytics to evaluate the vulnerability of EV charging stations. RSERI-EV integrates diverse data layers, including flood risk maps, land surface temperature (LST) extremes, vegetation indices (NDVI), land use/land cover (LULC), proximity to electrical substations, and road accessibility to generate a composite Resilience Score. We apply this framework to the country of Wales EV charger dataset to demonstrate its feasibility. A spatial $k$-nearest neighbours ($k$NN) graph is constructed over the charging network to enable neighbourhood-based comparisons and graph-aware diagnostics. Our prototype highlights the value of multi-source data fusion and interpretable spatial reasoning in supporting climate-resilient, infrastructure-aware EV deployment.
cs.HC [Back]
[64] Capturing Visualization Design Rationale
Maeve Hutchinson,Radu Jianu,Aidan Slingsby,Jo Wood,Pranava Madhyastha
Main category: cs.HC
TL;DR: 该论文提出了一个基于真实世界可视化笔记的新数据集和方法论,用于通过自然语言探索可视化设计背后的理性依据。
Details
Motivation: 现有的自然语言数据集在可视化领域多关注视觉化解码任务(如可视化素养评估或生成),而忽略了对设计决策背后理性依据的编码理解。论文旨在填补这一空白。Contribution: 1) 引入了一个新的数据集,捕捉学生可视化笔记本中的设计理性和决策;2) 利用大语言模型(LLMs)生成和分类问题-答案-理性三元组;3) 通过验证和整理,提供了一个高质量的理性依据数据集。
Method: 论文利用学生课程中的可视化笔记本(结合视觉化和设计说明)作为数据源,通过LLMs生成问题-答案-理性三元组,并对其验证和整理成数据集。
Result: 最终数据集捕捉了真实世界的可视化设计选择和对应的理性依据,为设计决策的编码理解提供了资源。
Insight: 可视化设计不仅需要关注解码(理解图形),还需要理解其背后的编码过程;真实世界的数据(如课程作业)可以为此提供丰富的语境。
Abstract: Prior natural language datasets for data visualization have focused on tasks such as visualization literacy assessment, insight generation, and visualization generation from natural language instructions. These studies often rely on controlled setups with purpose-built visualizations and artificially constructed questions. As a result, they tend to prioritize the interpretation of visualizations, focusing on decoding visualizations rather than understanding their encoding. In this paper, we present a new dataset and methodology for probing visualization design rationale through natural language. We leverage a unique source of real-world visualizations and natural language narratives: literate visualization notebooks created by students as part of a data visualization course. These notebooks combine visual artifacts with design exposition, in which students make explicit the rationale behind their design decisions. We also use large language models (LLMs) to generate and categorize question-answer-rationale triples from the narratives and articulations in the notebooks. We then carefully validate the triples and curate a dataset that captures and distills the visualization design choices and corresponding rationales of the students.
cs.RO [Back]
[65] Why Robots Are Bad at Detecting Their Mistakes: Limitations of Miscommunication Detection in Human-Robot Dialogue
Ruben Janssens,Jens De Bock,Sofie Labat,Eva Verhelst,Veronique Hoste,Tony Belpaeme
Main category: cs.RO
TL;DR: 论文研究了机器人在人机对话中检测误解的局限性,发现即使使用先进的计算机视觉模型,性能也仅略高于随机猜测,揭示了一个根本性问题:用户即使感知到误解,也不一定会明确表达。
Details
Motivation: 人机交互中,检测误解对保持用户参与和信任至关重要。尽管人类能轻松通过语言和非语言线索检测错误,但机器人在这方面的能力仍有显著限制。Contribution: 评估了现有机器学习模型在检测机器人对话误解时的表现,揭示了其性能不足的根本原因:用户不一定明确反馈误解。
Method: 使用包含240段人机对话的多模态数据集,系统性引入四种对话失败类型,评估计算机视觉模型的性能,并与人类标注者对比。
Result: 模型检测误解的性能仅略高于随机猜测,且人类标注者也仅能识别约一半的误解,表明问题根源在于用户反馈的不明确性。
Insight: 研究揭示了人机对话中误解检测的根本挑战,提示未来设计应更注重主动引导用户反馈。
Abstract: Detecting miscommunication in human-robot interaction is a critical function for maintaining user engagement and trust. While humans effortlessly detect communication errors in conversations through both verbal and non-verbal cues, robots face significant challenges in interpreting non-verbal feedback, despite advances in computer vision for recognizing affective expressions. This research evaluates the effectiveness of machine learning models in detecting miscommunications in robot dialogue. Using a multi-modal dataset of 240 human-robot conversations, where four distinct types of conversational failures were systematically introduced, we assess the performance of state-of-the-art computer vision models. After each conversational turn, users provided feedback on whether they perceived an error, enabling an analysis of the models’ ability to accurately detect robot mistakes. Despite using state-of-the-art models, the performance barely exceeds random chance in identifying miscommunication, while on a dataset with more expressive emotional content, they successfully identified confused states. To explore the underlying cause, we asked human raters to do the same. They could also only identify around half of the induced miscommunications, similarly to our model. These results uncover a fundamental limitation in identifying robot miscommunications in dialogue: even when users perceive the induced miscommunication as such, they often do not communicate this to their robotic conversation partner. This knowledge can shape expectations of the performance of computer vision models and can help researchers to design better human-robot conversations by deliberately eliciting feedback where needed.
[66] HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction
Zhonghao Shi,Enyu Zhao,Nathaniel Dennler,Jingzhen Wang,Xinyang Xu,Kaleen Shrestha,Mengxue Fu,Daniel Seita,Maja Matarić
Main category: cs.RO
TL;DR: HRIBench是一个新的视觉问答基准测试,专注于评估大型视觉-语言模型在实时人机交互中对人类感知任务的性能与延迟权衡。
Details
Motivation: 实时人类感知对人机交互(HRI)至关重要,但现有的大型视觉-语言模型(VLMs)存在高延迟问题,限制了其在实际场景中的应用。Contribution: 提出了HRIBench基准测试,涵盖五个关键领域的人类感知任务,并评估了11种先进的开源和闭源VLMs。
Method: 通过收集真实HRI环境数据和利用公开数据集,构建了一个包含1000个视觉问答问题(每个领域200个)的基准测试。
Result: 结果表明,当前VLMs在核心人类感知任务上表现不足,且均未达到实时部署的性能-延迟权衡要求。
Insight: 未来需要开发更小规模、低延迟且具备更强人类感知能力的VLMs。
Abstract: Real-time human perception is crucial for effective human-robot interaction (HRI). Large vision-language models (VLMs) offer promising generalizable perceptual capabilities but often suffer from high latency, which negatively impacts user experience and limits VLM applicability in real-world scenarios. To systematically study VLM capabilities in human perception for HRI and performance-latency trade-offs, we introduce HRIBench, a visual question-answering (VQA) benchmark designed to evaluate VLMs across a diverse set of human perceptual tasks critical for HRI. HRIBench covers five key domains: (1) non-verbal cue understanding, (2) verbal instruction understanding, (3) human-robot object relationship understanding, (4) social navigation, and (5) person identification. To construct HRIBench, we collected data from real-world HRI environments to curate questions for non-verbal cue understanding, and leveraged publicly available datasets for the remaining four domains. We curated 200 VQA questions for each domain, resulting in a total of 1000 questions for HRIBench. We then conducted a comprehensive evaluation of both state-of-the-art closed-source and open-source VLMs (N=11) on HRIBench. Our results show that, despite their generalizability, current VLMs still struggle with core perceptual capabilities essential for HRI. Moreover, none of the models within our experiments demonstrated a satisfactory performance-latency trade-off suitable for real-time deployment, underscoring the need for future research on developing smaller, low-latency VLMs with improved human perception capabilities. HRIBench and our results can be found in this Github repository: https://github.com/interaction-lab/HRIBench.
cs.AI [Back]
[67] Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning
Saloni Dash,Amélie Reymond,Emma S. Spiro,Aylin Caliskan
Main category: cs.AI
TL;DR: 该论文研究了为大型语言模型(LLM)分配特定身份(如政治或社会属性)是否会诱发类似人类的动机性推理行为。实验表明,身份分配的LLM在信息真实性判断和科学证据评估任务中表现出显著的偏差,且传统去偏见方法难以缓解这一问题。
Details
Motivation: 人类推理常受身份保护等动机影响而产生偏见,这可能对社会关键议题(如气候变化或疫苗安全)的讨论造成负面影响。研究旨在探索LLM是否也会因身份分配而表现类似的动机性推理行为。Contribution: 1. 首次提出身份分配的LLM会表现出与人类相似的动机性推理行为。2. 发现政治身份对LLM的科学证据评估影响尤为显著(正确率差异达90%)。3. 揭示了传统提示去偏见方法在此类问题上的局限性。
Method: 1. 为8种LLM分配4类政治和社会人口属性身份。2. 在两项人类行为研究任务(信息真实性辨别和科学证据评估)中进行测试。3. 设计并评估基于提示的去偏见方法。
Result: 1. 身份分配的LLM在信息真实性辨别任务中表现下降(最多降低9%)。2. 政治身份与科学证据的真相一致时,LLM的评估正确率大幅提升(最多提高90%)。3. 提示去偏方法效果有限。
Insight: 身份分配可能加剧LLM的偏见行为,凸显了在开发和应用中需谨慎设计身份表达,并探索更有效的去偏见方法以避免加剧社会分歧。
Abstract: Reasoning in humans is prone to biases due to underlying motivations like identity protection, that undermine rational decision-making and judgment. This motivated reasoning at a collective level can be detrimental to society when debating critical issues such as human-driven climate change or vaccine safety, and can further aggravate political polarization. Prior studies have reported that large language models (LLMs) are also susceptible to human-like cognitive biases, however, the extent to which LLMs selectively reason toward identity-congruent conclusions remains largely unexplored. Here, we investigate whether assigning 8 personas across 4 political and socio-demographic attributes induces motivated reasoning in LLMs. Testing 8 LLMs (open source and proprietary) across two reasoning tasks from human-subject studies – veracity discernment of misinformation headlines and evaluation of numeric scientific evidence – we find that persona-assigned LLMs have up to 9% reduced veracity discernment relative to models without personas. Political personas specifically, are up to 90% more likely to correctly evaluate scientific evidence on gun control when the ground truth is congruent with their induced political identity. Prompt-based debiasing methods are largely ineffective at mitigating these effects. Taken together, our empirical findings are the first to suggest that persona-assigned LLMs exhibit human-like motivated reasoning that is hard to mitigate through conventional debiasing prompts – raising concerns of exacerbating identity-congruent reasoning in both LLMs and humans.
[68] The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind
Andrei Lupu,Timon Willi,Jakob Foerster
Main category: cs.AI
TL;DR: 该论文提出了一个名为Decrypto的游戏基准测试,用于评估多智能体推理和心理理论(ToM)能力,发现当前大型语言模型在这些任务上表现不佳。
Details
Motivation: 随着大型语言模型(LLM)具备代理能力,它们需要在复杂的多智能体场景中导航,但目前对ToM和多智能体能力的理解不足,现有基准测试存在范围窄、数据泄露、饱和和缺乏交互性等问题。Contribution: 1. 提出Decrypto基准测试,专注于ToM和多智能体推理;2. 通过实验验证基准设计,发现LLM在游戏任务中表现不如人类和简单基线;3. 展示前沿推理模型在ToM任务上表现较差。
Method: 1. 设计基于游戏的基准测试,消除其他维度上的干扰因素;2. 结合认知科学、计算语用学和多智能体强化学习的灵感;3. 进行全面实验评估,包括人类-AI交叉实验和经典认知科学实验的变体。
Result: LLM的游戏能力落后于人类和简单基线,前沿推理模型在ToM任务上表现不如旧版模型。
Insight: Decrypto填补了当前推理和ToM评估的关键空白,为改进人工智能代理铺平了道路。
Abstract: As Large Language Models (LLMs) gain agentic abilities, they will have to navigate complex multi-agent scenarios, interacting with human users and other agents in cooperative and competitive settings. This will require new reasoning skills, chief amongst them being theory of mind (ToM), or the ability to reason about the “mental” states of other agents. However, ToM and other multi-agent abilities in LLMs are poorly understood, since existing benchmarks suffer from narrow scope, data leakage, saturation, and lack of interactivity. We thus propose Decrypto, a game-based benchmark for multi-agent reasoning and ToM drawing inspiration from cognitive science, computational pragmatics and multi-agent reinforcement learning. It is designed to be as easy as possible in all other dimensions, eliminating confounding factors commonly found in other benchmarks. To our knowledge, it is also the first platform for designing interactive ToM experiments. We validate the benchmark design through comprehensive empirical evaluations of frontier LLMs, robustness studies, and human-AI cross-play experiments. We find that LLM game-playing abilities lag behind humans and simple word-embedding baselines. We then create variants of two classic cognitive science experiments within Decrypto to evaluate three key ToM abilities. Surprisingly, we find that state-of-the-art reasoning models are significantly worse at those tasks than their older counterparts. This demonstrates that Decrypto addresses a crucial gap in current reasoning and ToM evaluations, and paves the path towards better artificial agents.
cs.LG [Back]
[69] A Spatio-Temporal Point Process for Fine-Grained Modeling of Reading Behavior
Francesco Ignazio Re,Andreas Opedal,Glib Manaiev,Mario Giulianelli,Ryan Cotterell
Main category: cs.LG
TL;DR: 该论文提出了一种基于标记时空点过程的概率模型,用于精细建模阅读行为,包括注视点的持续时间、空间位置和时间分布,并通过Hawkes过程捕捉注视点之间的时空关联。
Details
Motivation: 现有阅读行为建模方法依赖于聚合的眼动跟踪数据,忽略了阅读过程中丰富的时空动态性。本文旨在提出一个更通用的模型,以捕捉这些细节。Contribution: 提出了一种新的标记时空点过程模型,结合Hawkes过程和注视持续时间建模,能够更好地描述阅读行为的时空动态性。
Method: 使用Hawkes过程建模注视点的时空关联,并通过时间卷积函数建模注视持续时间,引入上下文意外性作为预测因子。
Result: 模型在拟合人类眼动数据上优于基线方法,但上下文意外性对注视持续时间的预测改进有限。
Insight: 实验表明意外性理论在解释细粒度眼动行为时效果有限,为未来研究提供了新方向。
Abstract: Reading is a process that unfolds across space and time, alternating between fixations where a reader focuses on a specific point in space, and saccades where a reader rapidly shifts their focus to a new point. An ansatz of psycholinguistics is that modeling a reader’s fixations and saccades yields insight into their online sentence processing. However, standard approaches to such modeling rely on aggregated eye-tracking measurements and models that impose strong assumptions, ignoring much of the spatio-temporal dynamics that occur during reading. In this paper, we propose a more general probabilistic model of reading behavior, based on a marked spatio-temporal point process, that captures not only how long fixations last, but also where they land in space and when they take place in time. The saccades are modeled using a Hawkes process, which captures how each fixation excites the probability of a new fixation occurring near it in time and space. The duration time of fixation events is modeled as a function of fixation-specific predictors convolved across time, thus capturing spillover effects. Empirically, our Hawkes process model exhibits a better fit to human saccades than baselines. With respect to fixation durations, we observe that incorporating contextual surprisal as a predictor results in only a marginal improvement in the model’s predictive accuracy. This finding suggests that surprisal theory struggles to explain fine-grained eye movements.
[70] MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations
Vardhan Dongre,Chi Gui,Shubham Garg,Hooshang Nayyeri,Gokhan Tur,Dilek Hakkani-Tür,Vikram S. Adve
Main category: cs.LG
TL;DR: MIRAGE是一个多模态基准测试,专注于农业领域的专家级咨询交互,结合自然用户查询、专家回答和图像上下文,用于评估模型的接地推理、澄清策略和长文本生成能力。
Details
Motivation: 现有基准测试通常依赖明确输入的封闭式分类,而真实世界的农业咨询问题往往是未明确指定且上下文丰富的。MIRAGE旨在填补这一空白,为模型提供更贴近现实的挑战。Contribution: 提出了MIRAGE基准,基于35,000+真实用户-专家交互数据,涵盖7,000+生物实体,支持开放场景、罕见实体处理和多模态推理。
Method: 通过多步骤的专家注释和筛选流程,构建了一个包含自然查询、专家回答和图像的多模态数据集,并设计了开放世界任务。
Result: MIRAGE成为农业领域最全面、多样化的基准之一,为视觉语言模型提供了真实世界的评估平台。
Insight: 该基准揭示了多模态模型在复杂、开放环境中的挑战,强调了对罕见实体处理和动态交互推理的需求。
Abstract: We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning and decision-making in consultative interaction settings. Designed for the agriculture domain, MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context, offering a high-fidelity benchmark for evaluating models on grounded reasoning, clarification strategies, and long-form generation in a real-world, knowledge-intensive domain. Grounded in over 35,000 real user-expert interactions and curated through a carefully designed multi-step pipeline, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios. The benchmark includes more than 7,000 unique biological entities, covering plant species, pests, and diseases, making it one of the most taxonomically diverse benchmarks available for vision-language models, grounded in the real world. Unlike existing benchmarks that rely on well-specified user inputs and closed-set taxonomies, MIRAGE features underspecified, context-rich scenarios with open-world settings, requiring models to infer latent knowledge gaps, handle rare entities, and either proactively guide the interaction or respond. Project Page: https://mirage-benchmark.github.io
[71] Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track
Rylan Schaeffer,Joshua Kazdan,Yegor Denisov-Blanch,Brando Miranda,Matthias Gerstgrasser,Susan Zhang,Andreas Haupt,Isha Gupta,Elyas Obbad,Jesse Dodge,Jessica Zosa Forde,Koustuv Sinha,Francesco Orabona,Sanmi Koyejo,David Donoho
Main category: cs.LG
TL;DR: 论文主张在机器学习会议中设立‘反驳与批评’(R&C)轨道,以系统性纠正研究中的错误。
Details
Motivation: 机器学习领域的快速发展导致许多有误导性或错误的研究被发表,但缺乏机制来纠正这些错误。Contribution: 提出了在机器学习会议中设立专门的R&C轨道,为批判性研究提供平台,促进研究的自我修正。
Method: 讨论了R&C轨道的设计原则、审稿流程及潜在问题,并举例说明了提交内容。
Result: 建议会议建立官方机制,支持研究自我修正。
Insight: 设立专门的批判性轨道有助于提升研究的可靠性和科学进步。
Abstract: Science progresses by iteratively advancing and correcting humanity’s understanding of the world. In machine learning (ML) research, rapid advancements have led to an explosion of publications, but have also led to misleading, incorrect, flawed or perhaps even fraudulent studies being accepted and sometimes highlighted at ML conferences due to the fallibility of peer review. While such mistakes are understandable, ML conferences do not offer robust processes to help the field systematically correct when such errors are made.This position paper argues that ML conferences should establish a dedicated “Refutations and Critiques” (R & C) Track. This R & C Track would provide a high-profile, reputable platform to support vital research that critically challenges prior research, thereby fostering a dynamic self-correcting research ecosystem. We discuss key considerations including track design, review principles, potential pitfalls, and provide an illustrative example submission concerning a recent ICLR 2025 Oral. We conclude that ML conferences should create official, reputable mechanisms to help ML research self-correct.
[72] Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards
Charles Arnal,Gaëtan Narozniak,Vivien Cabannes,Yunhao Tang,Julia Kempe,Remi Munos
Main category: cs.LG
TL;DR: 本文研究了介于离线策略强化学习(off-policy RL)和监督微调之间的算法,通过分析一种简单的离线策略REINFORCE算法,探讨了如何通过调整基线值$V$来平衡正负奖励信号,以提高性能。
Details
Motivation: 现有离线策略方法虽然实现简单且数据高效,但性能通常不如在线策略方法。本文旨在通过调整基线值$V$,探索如何在离线策略学习中更有效地利用奖励信号。Contribution: 1. 理论分析了离线策略REINFORCE算法,证明当基线$V$的设定满足一定条件时,算法具有策略改进保障;2. 揭示了离线策略更新应更关注正奖励信号的合理性,并通过实验验证了这一发现。
Method: 1. 提出一种简单的离线策略REINFORCE算法,其中优势定义为$A=r-V$;2. 通过调整基线值$V$来平衡正负奖励信号;3. 实验验证包括随机多臂老虎机任务和大语言模型(LLM)的微调任务。
Result: 理论分析和实验结果表明,适当调整基线$V$可以提高离线策略学习的性能,尤其是在正奖励信号占主导时效果更显著。
Insight: 离线策略学习中,正奖励信号的作用比负奖励信号更重要;通过合理设定基线$V$,可以更有效地利用奖励信号提升模型性能。
Abstract: Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.
[73] PLoP: Precise LoRA Placement for Efficient Finetuning of Large Models
Soufiane Hayou,Nikhil Ghosh,Bin Yu
Main category: cs.LG
TL;DR: PLoP 提出了一种轻量级方法,通过理论分析和实验验证,自动确定在大模型微调中 LoRA 适配器的最佳放置位置,显著提升了效率和性能。
Details
Motivation: 当前 LoRA 方法在微调大模型时,手动选择适配器放置位置(如注意力模块或 MLP 模块)缺乏理论依据,导致效率不高且结果不一致。PLoP 希望通过理论分析解决这一问题。Contribution: 1. 提出 PLoP,一种自动识别 LoRA 适配器最佳放置位置的轻量级方法;2. 通过理论分析和实验验证,证明 PLoP 优于或至少与其他常用策略相当。
Method: PLoP 通过理论分析模型的结构和任务需求,自动确定适配器的放置位置,无需手动干预。实验涵盖监督微调和强化学习任务。
Result: 实验表明,PLoP 在监督微调和强化学习任务中均优于或至少与常用策略持平,验证了其有效性。
Insight: 适配器放置位置对 LoRA 微调的效果至关重要,PLoP 提供了一种高效的方法来解决这一问题,为后续研究提供了新方向。
Abstract: Low-Rank Adaptation (LoRA) is a widely used finetuning method for large models. Its small memory footprint allows practitioners to adapt large models to specific tasks at a fraction of the cost of full finetuning. Different modifications have been proposed to enhance its efficiency by, for example, setting the learning rate, the rank, and the initialization. Another improvement axis is adapter placement strategy: when using LoRA, practitioners usually pick module types to adapt with LoRA, such as Query and Key modules. Few works have studied the problem of adapter placement, with nonconclusive results: original LoRA paper suggested placing adapters in attention modules, while other works suggested placing them in the MLP modules. Through an intuitive theoretical analysis, we introduce PLoP (Precise LoRA Placement), a lightweight method that allows automatic identification of module types where LoRA adapters should be placed, given a pretrained model and a finetuning task. We demonstrate that PLoP consistently outperforms, and in the worst case competes, with commonly used placement strategies through comprehensive experiments on supervised finetuning and reinforcement learning for reasoning.