Table of Contents
- cs.CL [Total: 61]
- cs.CV [Total: 118]
- eess.IV [Total: 3]
- cs.CY [Total: 1]
- cs.SE [Total: 2]
- cs.PL [Total: 1]
- astro-ph.IM [Total: 1]
- cs.AI [Total: 13]
- cs.CR [Total: 2]
- cs.IR [Total: 2]
- eess.SY [Total: 1]
- cs.LG [Total: 23]
- cs.RO [Total: 5]
- cs.GR [Total: 6]
- cond-mat.mtrl-sci [Total: 1]
cs.CL [Back]
[1] Graph-S3: Enhancing Agentic textual Graph Retrieval with Synthetic Stepwise Supervision
Ge Chang,Jinbo Su,Jiacheng Liu,Pengfei Yang,Yuhao Shang,Huiwen Zheng,Hongli Ma,Yan Liang,Yuanchun Li,Yunxin Liu
Main category: cs.CL
TL;DR: Graph-S3提出了一种基于大语言模型(LLM)的文本图推理框架,通过合成逐步监督训练检索器,显著提升了图检索的性能和效率。
Details
Motivation: 现实中的数据多以文本图形式存在,但现有检索方法或因浅层嵌入相似性效果差,或因交互式策略需大量标注和训练成本高昂而表现不佳。Contribution: 1. 提出Graph-S3框架,利用合成逐步监督训练LLM检索器;2. 设计数据合成流程提取黄金子图用于奖励生成;3. 提出两阶段训练方案学习交互式图探索策略。
Method: 1. 通过数据合成提取黄金子图作为奖励信号;2. 采用两阶段训练方案训练LLM检索器;3. 基于逐步监督优化检索策略。
Result: 在三个常用数据集上,Graph-S3相较于七个强基线平均提升8.1%准确率和9.7% F1分数,尤其在多跳推理任务中优势更大。
Insight: 逐步监督比仅依赖最终答案的稀疏奖励更有效,合成数据可显著降低标注成本,同时提升检索性能。
Abstract: A significant portion of real-world data is inherently represented as textual graphs, and integrating these graphs into large language models (LLMs) is promising to enable complex graph-based question answering. However, a key challenge in LLM-based textual graph QA systems lies in graph retrieval, i.e., how to retrieve relevant content from large graphs that is sufficiently informative while remaining compact for the LLM context. Existing retrievers suffer from poor performance since they either rely on shallow embedding similarity or employ interactive retrieving policies that demand excessive data labeling and training cost. To address these issues, we present Graph-$S^3$, an agentic textual graph reasoning framework that employs an LLM-based retriever trained with synthetic stepwise supervision. Instead of rewarding the agent based on the final answers, which may lead to sparse and unstable training signals, we propose to closely evaluate each step of the retriever based on offline-extracted golden subgraphs. Our main techniques include a data synthesis pipeline to extract the golden subgraphs for reward generation and a two-stage training scheme to learn the interactive graph exploration policy based on the synthesized rewards. Based on extensive experiments on three common datasets in comparison with seven strong baselines, our approach achieves an average improvement of 8.1% in accuracy and 9.7% in F$_1$ score. The advantage is even higher in more complicated multi-hop reasoning tasks. Our code will be open-sourced.
[2] Morpheme Induction for Emergent Language
Brendon Boldt,David Mortensen
Main category: cs.CL
TL;DR: CSAR算法通过贪婪方式从涌现语言的平行语料中归纳词素,基于形式与意义之间的互信息加权,验证了其在生成数据集和人类语言数据上的有效性,并分析了涌现语言的语言学特征。
Details
Motivation: 研究旨在解决从涌现语言中归纳词素的问题,特别是在缺乏显式标注的情况下,通过算法自动识别语言的基本单位。Contribution: 提出了CSAR算法,通过迭代的加权、选择、移除过程归纳词素,并在生成数据和人类语言中验证了其有效性。
Method: CSAR是一种贪婪算法,步骤包括:(1)基于形式与意义的互信息加权词素,(2)选择最高权重对,(3)将其从语料中移除,(4)重复以上步骤。
Result: CSAR在生成数据集和人类语言数据上表现出色,同时量化了涌现语言的同义性和多义性等语言学特征。
Insight: 算法的贪婪策略高效且可扩展,适用于多种语言场景,为进一步研究涌现语言提供了实用工具。
Abstract: We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings. It is a greedy algorithm that (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat). The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks. Second, we validate CSAR’s performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains. Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.
[3] Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video
Mengyao Xu,Wenfei Zhou,Yauhen Babakhin,Gabriel Moreira,Ronay Ak,Radek Osmulski,Bo Liu,Even Oldridge,Benedikt Schifferer
Main category: cs.CL
TL;DR: 论文提出了Omni-Embed-Nemotron,一个统一的多模态检索嵌入模型,支持文本、图像、音频和视频的跨模态和联合模态检索。
Details
Motivation: 现有基于文本的检索器在处理真实世界复杂多模态内容(如PDF、视频等)时表现不足,而近期的多模态模型(如Qwen2.5-Omni)展示了扩展检索能力的潜力。Contribution: 提出了首个支持文本、图像、音频和视频的统一检索模型,实现跨模态和联合模态检索。
Method: 基于ColPali和Qwen2.5-Omni的启发,设计了统一的架构和训练框架,处理多模态数据。
Result: 实验证明了模型在文本、图像和视频检索上的有效性。
Insight: 多模态检索的统一模型可以显著提升对复杂真实世界数据的处理能力。
Abstract: We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs. While Retrieval-Augmented Generation (RAG) has significantly advanced language models by incorporating external knowledge, existing text-based retrievers rely on clean, structured input and struggle with the visually and semantically rich content found in real-world documents such as PDFs, slides, or videos. Recent work such as ColPali has shown that preserving document layout using image-based representations can improve retrieval quality. Building on this, and inspired by the capabilities of recent multimodal models such as Qwen2.5-Omni, we extend retrieval beyond text and images to also support audio and video modalities. Omni-Embed-Nemotron enables both cross-modal (e.g., text - video) and joint-modal (e.g., text - video+audio) retrieval using a single model. We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.
[4] TS-Reasoner: Aligning Time Series Foundation Models with LLM Reasoning
Fangxu Yu,Hongyu Zhao,Tianyi Zhou
Main category: cs.CL
TL;DR: TS-Reasoner通过对齐时间序列基础模型(TSFMs)与大型语言模型(LLMs)的潜在表示,实现了在时间序列推理任务中的高效性能,且在较少训练数据下表现优异。
Details
Motivation: 时间序列推理在多个领域中至关重要,但现有TSFMs缺乏高级推理能力,而LLMs不擅长数值理解。二者结合的挑战在于如何有效对齐两种模态。Contribution: 提出了TS-Reasoner,通过合成时间序列与文本对实现TSFMs与LLMs的对齐训练,并设计了两阶段训练方法(对齐预训练+指令微调)。
Method: 1. 合成时间序列与文本对进行对齐预训练;2. 使用冻结的预训练TSFM,通过指令微调优化下游推理任务。
Result: TS-Reasoner在多个基准测试中优于现有LLMs、VLMs和时间序列LLMs,且数据效率显著(仅需一半训练数据)。
Insight: 冻结预训练TSFM并仅对齐LLMs的文本输入是一种高效的方法,避免了模态融合的复杂性,同时保留了二者的优势。
Abstract: Time series reasoning is crucial to decision-making in diverse domains, including finance, energy usage, traffic, weather, and scientific discovery. While existing time series foundation models (TSFMs) can capture low-level dynamic patterns and provide accurate forecasting, further analysis usually requires additional background knowledge and sophisticated reasoning, which are lacking in most TSFMs but can be achieved through large language models (LLMs). On the other hand, without expensive post-training, LLMs often struggle with the numerical understanding of time series data. Although it is intuitive to integrate the two types of models, developing effective training recipes that align the two modalities for reasoning tasks is still an open challenge. To this end, we propose TS-Reasoner that aligns the latent representations of TSFMs with the textual inputs of LLMs for downstream understanding/reasoning tasks. Specifically, we propose a simple yet effective method to curate diverse, synthetic pairs of time series and textual captions for alignment training. We then develop a two-stage training recipe that applies instruction finetuning after the alignment pretraining. Unlike existing works that train an LLM to take time series as inputs, we leverage a pretrained TSFM and freeze it during training. Extensive experiments on several benchmarks demonstrate that TS-Reasoner not only outperforms a wide range of prevailing LLMs, Vision Language Models (VLMs), and Time Series LLMs, but also achieves this with remarkable data efficiency, e.g., using less than half the training data.
[5] Identifying Financial Risk Information Using RAG with a Contrastive Insight
Ali Elahi
Main category: cs.CL
TL;DR: 本文提出了一种在RAG基础上添加对比推理层的方法,以解决金融领域中现有RAG在专业推理任务中输出过于通用的问题。
Details
Motivation: 在专业领域中,人类通常会通过对比类似案例来解决问题,而现有的RAG方法虽然能提取上下文相关信息,但无法检索可比较的案例或相关问题的具体细节,导致输出结果过于通用。Contribution: 主要的贡献是在RAG的基础上引入了一种对比推理层(peer-aware comparative inference layer),提高了在金融风险识别任务中的表现。
Method: 方法是在RAG框架上增加了对比推理层,通过比较类似案例生成更具上下文相关性的专业见解。
Result: 实验结果表明,该方法在ROUGE和BERTScore等文本生成指标上优于基线RAG,并接近人类生成的股票研究和风险分析结果。
Insight: 研究表明,结合对比推理可以显著提升RAG在专业领域中的输出质量,尤其是在需要上下文对比的场景中。
Abstract: In specialized domains, humans often compare new problems against similar examples, highlight nuances, and draw conclusions instead of analyzing information in isolation. When applying reasoning in specialized contexts with LLMs on top of a RAG, the pipeline can capture contextually relevant information, but it is not designed to retrieve comparable cases or related problems. While RAG is effective at extracting factual information, its outputs in specialized reasoning tasks often remain generic, reflecting broad facts rather than context-specific insights. In finance, it results in generic risks that are true for the majority of companies. To address this limitation, we propose a peer-aware comparative inference layer on top of RAG. Our contrastive approach outperforms baseline RAG in text generation metrics such as ROUGE and BERTScore in comparison with human-generated equity research and risk.
[6] Sample, Align, Synthesize: Graph-Based Response Synthesis with ConGrs
Sayan Ghosh,Shahzaib Saqib Warraich,Dhruv Tarsadiya,Gregory Yauney,Swabha Swayamdipta
Main category: cs.CL
TL;DR: 该论文提出了一种名为ConGrs(共识图)的有向无环图(DAG)数据结构,用于高效合成不同长文本语言模型响应中的丰富认知信号。通过轻量级的序列对齐算法和辅助的语言模型判断,ConGrs能够显著提升事实准确性并减少对语言模型法官的依赖。
Details
Motivation: 现有的方法无法高效地从多次采样的语言模型响应中合成丰富的认知信号,尤其是在长文本生成任务中。Contribution: 主要贡献是设计了Consensus Graphs(ConGrs),一种灵活的数据结构,能够捕捉和表示语言模型响应的共享信息和语义变化。
Method: 采用了基于生物信息学的轻量级序列对齐算法,并结合辅助语言模型法官的针对性使用。任务依赖的解码方法用于从ConGrs中合成最终响应。
Result: 实验表明,ConGrs在传记生成任务中提高了31%的事实准确性,减少了80%以上对语言模型法官的依赖,并在拒绝任务和推理任务中分别提高了56%的拒绝率和6%的准确率。
Insight: ConGrs提供了一种灵活的方法,能够利用语言模型响应中的变化信号,生成更高质量和更有效的响应。
Abstract: Language models can be sampled multiple times to access the distribution underlying their responses, but existing methods cannot efficiently synthesize rich epistemic signals across different long-form responses. We introduce Consensus Graphs (ConGrs), a flexible DAG-based data structure that represents shared information, as well as semantic variation in a set of sampled LM responses to the same prompt. We construct ConGrs using a light-weight lexical sequence alignment algorithm from bioinformatics, supplemented by the targeted usage of a secondary LM judge. Further, we design task-dependent decoding methods to synthesize a single, final response from our ConGr data structure. Our experiments show that synthesizing responses from ConGrs improves factual precision on two biography generation tasks by up to 31% over an average response and reduces reliance on LM judges by more than 80% compared to other methods. We also use ConGrs for three refusal-based tasks requiring abstention on unanswerable queries and find that abstention rate is increased by up to 56%. We apply our approach to the MATH and AIME reasoning tasks and find an improvement over self-verification and majority vote baselines by up to 6 points of accuracy. We show that ConGrs provide a flexible method for capturing variation in LM responses and using the epistemic signals provided by response variation to synthesize more effective responses.
[7] TriMediQ: A Triplet-Structured Approach for Interactive Medical Question Answering
Zhaohan Meng,Zaiqiao Meng,Siwei Liu,Iadh Ounis
Main category: cs.CL
TL;DR: TriMediQ提出了一个基于三元组结构的方法,通过将患者响应总结为三元组并构建知识图谱(KG),提升多轮医疗问答中的推理能力。该方法在两项基准测试中性能提升显著。
Details
Motivation: 现有的LLMs在静态单轮医疗QA中表现优异,但在多轮交互式临床问诊中可靠性下降,因为对话日志中临床事实缺乏明确链接。TriMediQ旨在解决这一问题。Contribution: TriMediQ的主要贡献包括:(1)提出三元组结构方法,将对话转换为结构化KG;(2)引入冻结三元组生成器和可训练投影模块,提升多跳推理能力。
Method: TriMediQ分为两步:(1)冻结所有LLM权重,微调投影模块;(2)在推理时使用微调模块指导多跳推理。三元组生成器确保事实一致性。
Result: 在iMedQA数据集上,TriMediQ比五个基线模型准确率提升高达10.4%,验证了结构化三元组在多轮医疗问答中的有效性。
Insight: 将非结构化对话转换为结构化KG是提升LLMs在多轮医疗QA中性能的关键方法。
Abstract: Large Language Models (LLMs) perform strongly in static and single-turn medical Question Answer (QA) benchmarks, yet such settings diverge from the iterative information gathering process required in practical clinical consultations. The MEDIQ framework addresses this mismatch by recasting the diagnosis as an interactive dialogue between a patient and an expert system, but the reliability of LLMs drops dramatically when forced to reason with dialogue logs, where clinical facts appear in sentences without clear links. To bridge this gap, we introduce TriMediQ, a triplet-structured approach that summarises patient responses into triplets and integrates them into a Knowledge Graph (KG), enabling multi-hop reasoning. We introduce a frozen triplet generator that extracts clinically relevant triplets, using prompts designed to ensure factual consistency. In parallel, a trainable projection module, comprising a graph encoder and a projector, captures relational information from the KG to enhance expert reasoning. TriMediQ operates in two steps: (i) the projection module fine-tuning with all LLM weights frozen; and (ii) using the fine-tuned module to guide multi-hop reasoning during inference. We evaluate TriMediQ on two interactive QA benchmarks, showing that it achieves up to 10.4% improvement in accuracy over five baselines on the iMedQA dataset. These results demonstrate that converting patient responses into structured triplet-based graphs enables more accurate clinical reasoning in multi-turn settings, providing a solution for the deployment of LLM-based medical assistants.
[8] What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification
Andrew Halterman,Katherine A. Keith
Main category: cs.CL
TL;DR: 论文讨论了在大语言模型(LLM)时代,计算社会科学(CSS)中文本分类的潜在问题:分析师可能忽视概念化步骤,导致下游统计推断的偏差。
Details
Motivation: 随着生成式大语言模型在文本分类中的广泛应用,研究者开始关注LLM使用前后被忽视的关键步骤,如概念化和下游推断,以避免概念化错误引起的偏差。Contribution: 主要贡献是指出LLM时代仍需要重视概念化问题,并通过模拟证明单纯提高LLM准确性或后处理校正方法无法完全消除概念化偏差。
Method: 研究采用模拟实验展示概念化误差对下游估计的影响,并提出低成本、无偏、低方差的估计方法建议。
Result: 结果表明,概念化导致的偏差无法仅通过LLM性能提升或后校正方法修正。
Insight: 论文强调即使在LLM时代,概念化仍是首要任务,并提供实用建议以避免偏差。
Abstract: Generative large language models (LLMs) are now used extensively for text classification in computational social science (CSS). In this work, focus on the steps before and after LLM prompting – conceptualization of concepts to be classified and using LLM predictions in downstream statistical inference – which we argue have been overlooked in much of LLM-era CSS. We claim LLMs can tempt analysts to skip the conceptualization step, creating conceptualization errors that bias downstream estimates. Using simulations, we show that this conceptualization-induced bias cannot be corrected for solely by increasing LLM accuracy or post-hoc bias correction methods. We conclude by reminding CSS analysts that conceptualization is still a first-order concern in the LLM-era and provide concrete advice on how to pursue low-cost, unbiased, low-variance downstream estimates.
[9] CCD-Bench: Probing Cultural Conflict in Large Language Model Decision-Making
Hasibur Rahman,Hanan Salam
Main category: cs.CL
TL;DR: 该论文提出了CCD-Bench基准,用于评估大语言模型(LLMs)在多文化价值冲突下的决策能力,发现现有模型在跨文化冲突中偏向北欧和日耳曼欧洲价值观,而忽视其他文化集群。
Details
Motivation: 尽管LLMs在人际和社会决策中应用日益广泛,但其在面对不同文化价值体系冲突时的表现尚未被充分研究。现有基准未能评估LLMs如何在多元文化价值观冲突中做出决策。Contribution: 提出了CCD-Bench基准,包含2,182个开放性问题,覆盖7个领域,用于评估LLMs在多文化价值冲突中的决策模式,填补了现有研究空白。
Method: 使用分层拉丁方设计减少顺序效应,评估了17个非推理LLMs在不同文化集群中的选择偏好和理由分析。
Result: 模型显著偏向北欧和日耳曼欧洲价值观(分别占20.2%和12.4%),而中东和东欧选项被低估(5.6%-5.8%)。理由分析显示模型在性别平等或权力谈判方面的表现较弱。
Insight: 当前模型的对齐策略倾向于共识导向的世界观,未能充分处理需要权力谈判或性别意识的决策场景,表明需要对多元世界观更深入的整合。
Abstract: Although large language models (LLMs) are increasingly implicated in interpersonal and societal decision-making, their ability to navigate explicit conflicts between legitimately different cultural value systems remains largely unexamined. Existing benchmarks predominantly target cultural knowledge (CulturalBench), value prediction (WorldValuesBench), or single-axis bias diagnostics (CDEval); none evaluate how LLMs adjudicate when multiple culturally grounded values directly clash. We address this gap with CCD-Bench, a benchmark that assesses LLM decision-making under cross-cultural value conflict. CCD-Bench comprises 2,182 open-ended dilemmas spanning seven domains, each paired with ten anonymized response options corresponding to the ten GLOBE cultural clusters. These dilemmas are presented using a stratified Latin square to mitigate ordering effects. We evaluate 17 non-reasoning LLMs. Models disproportionately prefer Nordic Europe (mean 20.2 percent) and Germanic Europe (12.4 percent), while options for Eastern Europe and the Middle East and North Africa are underrepresented (5.6 to 5.8 percent). Although 87.9 percent of rationales reference multiple GLOBE dimensions, this pluralism is superficial: models recombine Future Orientation and Performance Orientation, and rarely ground choices in Assertiveness or Gender Egalitarianism (both under 3 percent). Ordering effects are negligible (Cramer’s V less than 0.10), and symmetrized KL divergence shows clustering by developer lineage rather than geography. These patterns suggest that current alignment pipelines promote a consensus-oriented worldview that underserves scenarios demanding power negotiation, rights-based reasoning, or gender-aware analysis. CCD-Bench shifts evaluation beyond isolated bias detection toward pluralistic decision making and highlights the need for alignment strategies that substantively engage diverse worldviews.
[10] Decoupling Task-Solving and Output Formatting in LLM Generation
Haikang Deng,Po-Nien Kung,Nanyun Peng
Main category: cs.CL
TL;DR: 论文提出Deco-G框架,将LLM的任务解决和输出格式解耦,通过单独的TPM模型处理格式合规性,显著提升性能。
Details
Motivation: 随着提示复杂性增加,LLM在同时处理任务指导和格式要求时表现不佳,提出解耦两者以提升性能。Contribution: 提出Deco-G框架,通过TPM模型单独处理格式合规性,并结合指令感知蒸馏、灵活的树构建算法和HMM状态剪枝提升效率。
Method: Deco-G框架通过TPM模型计算格式合规性,并在解码时结合LLM的任务解决概率,引入三项创新技术优化实现。
Result: 实验表明,Deco-G在数学推理、自动评估等任务中相对常规提示方法提升1.0%~6.0%,且格式合规性有保障。
Insight: 明确分离任务解决和格式要求能显著提升LLM性能,TPM模型的引入为复杂提示处理提供了新思路。
Abstract: Large language models (LLMs) are increasingly adept at following instructions containing task descriptions to solve complex problems, such as mathematical reasoning and automatic evaluation (LLM-as-a-Judge). However, as prompts grow more complex, models often struggle to adhere to all instructions. This difficulty is especially common when instructive prompts intertwine reasoning directives – specifying what the model should solve – with rigid formatting requirements that dictate how the solution must be presented. The entanglement creates competing goals for the model, suggesting that more explicit separation of these two aspects could lead to improved performance. To this front, we introduce Deco-G, a decoding framework that explicitly decouples format adherence from task solving. Deco-G handles format compliance with a separate tractable probabilistic model (TPM), while prompts LLMs with only task instructions. At each decoding step, Deco-G combines next token probabilities from the LLM with the TPM calculated format compliance likelihood to form the output probability. To make this approach both practical and scalable for modern instruction-tuned LLMs, we introduce three key innovations: instruction-aware distillation, a flexible trie-building algorithm, and HMM state pruning for computational efficiency. We demonstrate the effectiveness of Deco-G across a wide range of tasks with diverse format requirements, including mathematical reasoning, LLM-as-a-judge, and event argument extraction. Overall, our approach yields 1.0% to 6.0% relative gain over regular prompting practice with guaranteed format compliance.
[11] Can an LLM Induce a Graph? Investigating Memory Drift and Context Length
Raquib Bin Yousuf,Aadyant Khatri,Shengzhe Xu,Mandar Sharma,Naren Ramakrishnan
Main category: cs.CL
TL;DR: 论文探讨了大型语言模型(LLM)在复杂推理任务中的表现,特别是在诱导结构化关系知识(如图形生成)时的上下文遗忘和记忆漂移问题。作者发现现有基准测试低估了这些问题,并提出了针对复杂推理任务的优化建议。
Details
Motivation: 现有评估基准通常依赖简单的检索或延续任务,无法准确反映LLM在信息密集场景下的表现。因此,作者主张评估LLM在更复杂的推理任务(如从文本中诱导图结构)中的能力。Contribution: 揭示了LLM在结构化知识推理任务中表现出更早的记忆漂移和上下文遗忘现象;提出了优化LLM在复杂推理任务中使用的建议;展示了即使专用推理模型(如OpenAI o1)也存在类似问题。
Method: 通过设计复杂的推理任务(从潜在嘈杂的自然语言内容中诱导图结构),评估LLM的有效上下文长度和遗忘倾向。
Result: LLM在关系推理任务中表现出比现有基准更短的上下文遗忘阈值,表明其在抽象结构化知识方面的局限性。
Insight: LLM的架构需改进以支持长程推理任务;当前的评估基准可能不足以捕捉模型在实际复杂任务中的表现。
Abstract: Recently proposed evaluation benchmarks aim to characterize the effective context length and the forgetting tendencies of large language models (LLMs). However, these benchmarks often rely on simplistic ‘needle in a haystack’ retrieval or continuation tasks that may not accurately reflect the performance of these models in information-dense scenarios. Thus, rather than simple next token prediction, we argue for evaluating these models on more complex reasoning tasks that requires them to induce structured relational knowledge from the text - such as graphs from potentially noisy natural language content. While the input text can be viewed as generated in terms of a graph, its structure is not made explicit and connections must be induced from distributed textual cues, separated by long contexts and interspersed with irrelevant information. Our findings reveal that LLMs begin to exhibit memory drift and contextual forgetting at much shorter effective lengths when tasked with this form of relational reasoning, compared to what existing benchmarks suggest. With these findings, we offer recommendations for the optimal use of popular LLMs for complex reasoning tasks. We further show that even models specialized for reasoning, such as OpenAI o1, remain vulnerable to early memory drift in these settings. These results point to significant limitations in the models’ ability to abstract structured knowledge from unstructured input and highlight the need for architectural adaptations to improve long-range reasoning.
[12] Towards Unsupervised Speech Recognition at the Syllable-Level
Liming Wang,Junrui Ni,Kai-Wei Chang,Saurabhchand Bhati,David Harwath,Mark Hasegawa-Johnson,James R. Glass
Main category: cs.CL
TL;DR: 本文提出了一种基于音节级的无监督语音识别(UASR)框架,通过掩码语言建模避免了传统方法对G2P的依赖和GAN的不稳定性,显著提升了在LibriSpeech和中文上的识别性能。
Details
Motivation: 现有的无监督语音识别方法通常基于音素级,但依赖昂贵的G2P资源,且在语音边界模糊的语言(如中文)上表现不佳。为了解决这些问题,作者提出了音节级的UASR框架。Contribution: 主要贡献是提出了一种音节级的无监督语音识别方法,避免了G2P的使用和GAN的不稳定性,显著提升了识别性能,尤其在中文等复杂语言上表现优异。
Method: 采用了基于掩码语言建模的音节级UASR框架,通过自监督学习降低了训练不稳定性和对外部资源的依赖。
Result: 在LibriSpeech上实现了40%的相对字符错误率降低,并在中文上表现出良好的泛化能力。
Insight: 音节级建模可能是解决无监督语音识别问题的一个有效方向,尤其在资源有限或语音边界模糊的语言中更具优势。
Abstract: Training speech recognizers with unpaired speech and text – known as unsupervised speech recognition (UASR) – is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning from non-parallel data. However, existing approaches based on phones often rely on costly resources such as grapheme-to-phoneme converters (G2Ps) and struggle to generalize to languages with ambiguous phoneme boundaries due to training instability. In this paper, we address both challenges by introducing a syllable-level UASR framework based on masked language modeling, which avoids the need for G2P and the instability of GAN-based methods. Our approach achieves up to a 40% relative reduction in character error rate (CER) on LibriSpeech and generalizes effectively to Mandarin, a language that has remained particularly difficult for prior methods. Code will be released upon acceptance.
[13] UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
Xiangyu Peng,Cab Qin,Zeyuan Chen,Ran Xu,Caiming Xiong,Chien-Sheng Wu
Main category: cs.CL
TL;DR: UniDoc-Bench是首个大规模、真实的文档为中心的多模态RAG基准测试,包含70k真实PDF页面和1,600多模态QA对,支持四种范式的公平比较。实验表明多模态文本-图像融合RAG系统优于单模态方法。
Details
Motivation: 当前多模态RAG评测分散且不全面,难以覆盖真实文档多模态用例。UniDoc-Bench旨在填补这一空白,提供一个统一的评测平台。Contribution: 1)首个文档为中心的大规模多模态RAG基准测试;2)支持四种范式统一评测;3)揭示了视觉上下文如何补充文本证据的系统性分析。
Method: 从70k真实PDF页面提取并链接文本、表格和图表证据,生成1,600多模态QA对,涵盖多种查询类型。20% QA对经多轮标注验证。
Result: 多模态文本-图像融合RAG系统表现优于单模态和联合多模态检索方法,证明当前多模态嵌入仍不足。
Insight: 视觉上下文在某些场景下能有效补充文本证据,但也揭示了系统性失败模式和开发更鲁棒MM-RAG管道的指导。
Abstract: Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models (LLMs) and agents to real-world knowledge bases, yet current evaluations are fragmented, focusing on either text or images in isolation or on simplified multimodal setups that fail to capture document-centric multimodal use cases. In this paper, we introduce UniDoc-Bench, the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages across eight domains. Our pipeline extracts and links evidence from text, tables, and figures, then generates 1,600 multimodal QA pairs spanning factual retrieval, comparison, summarization, and logical reasoning queries. To ensure reliability, 20% of QA pairs are validated by multiple annotators and expert adjudication. UniDoc-Bench supports apples-to-apples comparison across four paradigms: (1) text-only, (2) image-only, (3) multimodal text-image fusion, and (4) multimodal joint retrieval – under a unified protocol with standardized candidate pools, prompts, and evaluation metrics. Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval, indicating that neither text nor images alone are sufficient and that current multimodal embeddings remain inadequate. Beyond benchmarking, our analysis reveals when and how visual context complements textual evidence, uncovers systematic failure modes, and offers actionable guidance for developing more robust MM-RAG pipelines.
[14] MedReflect: Teaching Medical LLMs to Self-Improve via Reflective Correction
Yue Huang,Yanyuan Chen,Dexuan Xu,Weihua Yue,Huamin Zhang,Meikang Qiu,Yu Huang
Main category: cs.CL
TL;DR: MedReflect提出了一种通过自我反思和自我验证来提升大型语言模型(LLM)在医疗领域问题解决能力的框架,减少对外部检索和标注数据的依赖。
Details
Motivation: 医疗问题解决需要专业知识和复杂推理。现有的方法(如检索增强生成或基于推理数据集的训练)存在检索开销大、标注成本高的问题,且依赖外部辅助工具。本文旨在通过自我反思的方式释放LLM的潜力,解决这些问题。Contribution: 提出了MedReflect框架,通过生成包含初始假设、自我提问、自我回答和决策细化的单次反思链,实现LLM在医疗领域的自我提升。该方法减少了对外部资源和标注数据的依赖,并显著提升了性能。
Method: MedReflect框架通过以下步骤实现:(1)初始假设生成;(2)自我提问;(3)自我回答;(4)决策细化。这种自我验证和反思的方式无需外部检索或大量标注数据。
Result: 仅需2,000个随机训练样本和轻量微调,MedReflect在多个医疗基准测试中显著提高了准确性,并大幅降低了标注需求。
Insight: 研究表明,LLM可以通过自我反思和自我改进学习解决专业医疗问题,减少对外部监督和任务特定数据的依赖。
Abstract: Medical problem solving demands expert knowledge and intricate reasoning. Recent studies of large language models (LLMs) attempt to ease this complexity by introducing external knowledge verification through retrieval-augmented generation or by training on reasoning datasets. However, these approaches suffer from drawbacks such as retrieval overhead and high annotation costs, and they heavily rely on substituted external assistants to reach limited performance in medical field. In this paper, we introduce MedReflect, a generalizable framework designed to inspire LLMs with a physician-like reflective thinking mode. MedReflect generates a single-pass reflection chain that includes initial hypothesis generation, self-questioning, self-answering and decision refinement. This self-verified and self-reflective nature releases large language model’s latent capability in medical problem-solving without external retrieval or heavy annotation. We demonstrate that MedReflect enables cost-efficient medical dataset construction: with merely 2,000 randomly sampled training examples and a light fine-tuning, this approach achieves notable absolute accuracy improvements across a series of medical benchmarks while cutting annotation requirements. Our results provide evidence that LLMs can learn to solve specialized medical problems via self-reflection and self-improve, reducing reliance on external supervision and extensive task-specific fine-tuning data.
[15] Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models
Canhui Wu,Qiong Cao,Chang Li,Zhenfang Wang,Chao Xue,Yuwei Fan,Wei Xi,Xiaodong He
Main category: cs.CL
TL;DR: 本文提出了 Step Pruner (SP),一种 RL 框架,通过优化推理步骤而非仅仅减少 token 数量,解决大型推理模型(LRM)因过度推理导致的效率低下问题。
Details
Motivation: 现有基于 RL 的方法仅惩罚 token 数量,忽略了推理步骤的效率,导致模型可能通过合并或丢弃步骤来‘作弊’,影响准确性。SP 旨在优化推理步骤,实现高效且准确的推理。Contribution: 主要贡献包括:1) 提出 Step Pruner (SP) RL 框架,专注于减少冗余推理步骤;2) 设计步长感知的奖励函数,优先正确性并惩罚冗余步骤;3) 引入动态停止机制,防止步长过长时的‘作弊’行为。
Method: 通过 RL 训练模型,奖励紧凑且正确的推理步骤,惩罚冗余步骤;动态停止机制则在输出步长超过上限时停止更新,防止模型合并步骤。
Result: 在四个推理基准测试中,SP 在保持 SOTA 准确性的同时显著减少响应长度(例如 AIME24 上 token 使用减少 69.7%)。
Insight: 优化推理步骤比单纯减少 token 更有效,动态机制可防止模型绕过惩罚,平衡效率与准确性。
Abstract: Large Reasoning Models (LRMs) demonstrate strong performance on complex tasks but often suffer from excessive verbosity, known as “overthinking.” Existing solutions via reinforcement learning (RL) typically penalize generated tokens to promote conciseness. However, these methods encounter two challenges: responses with fewer tokens do not always correspond to fewer reasoning steps, and models may develop hacking behavior in later stages of training by discarding reasoning steps to minimize token usage. In this work, we introduce \textbf{Step Pruner (SP)}, an RL framework that steers LRMs toward more efficient reasoning by favoring compact reasoning steps. Our step-aware reward function prioritizes correctness while imposing penalties for redundant steps, and withholds rewards for incorrect responses to prevent the reinforcement of erroneous reasoning. Moreover, we propose a dynamic stopping mechanism: when the length of any output step exceeds the upper limit, we halt updates to prevent hacking behavior caused by merging steps. Extensive experiments across four reasoning benchmarks demonstrate that SP achieves state-of-the-art accuracy while significantly reducing response length. For instance, on AIME24, SP reduces token usage by \textbf{69.7%}.
[16] Annotate Rhetorical Relations with INCEpTION: A Comparison with Automatic Approaches
Mehedi Hasan Emon
Main category: cs.CL
TL;DR: 该研究比较了使用INCEpTION工具手动标注修辞关系与基于BERT、DistilBERT和逻辑回归的自动标注方法,结果显示DistilBERT在板球新闻的修辞关系分类中表现最佳。
Details
Motivation: 研究旨在探讨手动与自动方法在修辞关系标注中的效果差异,尤其是针对体育新闻(如板球报道)中的话语解析问题。Contribution: 主要贡献在于验证了DistilBERT在修辞关系分类中的高效性,并展示了其在话语解析任务中的潜力。
Method: 采用INCEpTION工具进行手动标注,同时使用BERT、DistilBERT和逻辑回归模型进行自动标注和分类,比较两者的性能。
Result: DistilBERT在分类准确性上表现最优,超越了其他模型和手动标注方法。
Insight: 研究表明,基于Transformer的语言模型(尤其是DistilBERT)在话语解析任务中具有显著优势,为自动修辞关系标注提供了高效解决方案。
Abstract: This research explores the annotation of rhetorical relations in discourse using the INCEpTION tool and compares manual annotation with automatic approaches based on large language models. The study focuses on sports reports (specifically cricket news) and evaluates the performance of BERT, DistilBERT, and Logistic Regression models in classifying rhetorical relations such as elaboration, contrast, background, and cause-effect. The results show that DistilBERT achieved the highest accuracy, highlighting its potential for efficient discourse relation prediction. This work contributes to the growing intersection of discourse parsing and transformer-based NLP. (This paper was conducted as part of an academic requirement under the supervision of Prof. Dr. Ralf Klabunde, Linguistic Data Science Lab, Ruhr University Bochum.) Keywords: Rhetorical Structure Theory, INCEpTION, BERT, DistilBERT, Discourse Parsing, NLP.
[17] PsycholexTherapy: Simulating Reasoning in Psychotherapy with Small Language Models in Persian
Mohammad Amin Abbasi,Hassan Naderi
Main category: cs.CL
TL;DR: PsychoLexTherapy是一种用于波斯语心理治疗推理的框架,使用小型语言模型(SLMs),注重文化适应性和隐私保护,通过结构化记忆和多轮对话模拟治疗过程。
Details
Motivation: 解决波斯语等低资源语言在心理治疗对话系统中文化适应性和隐私保护的挑战,开发可本地部署的高效模型。Contribution: 提出了PsychoLexTherapy框架,包含新颖的数据集(PsychoLexQuery和PsychoLexDialogue)、结构化记忆模块和可复现的评估流程。
Method: 三阶段开发:评估SLMs的心理知识(PsychoLexEval)、设计推理导向的PsychoLexTherapy框架,并通过实验对比不同基线方法(简单提示、多智能体辩论、结构化推理路径)。
Result: PsychoLexTherapy在自动评估和人工评估中均优于基线,多轮测试中结构化记忆模块提升了对话的共情、连贯性和文化适应性。
Insight: 小型语言模型在文化敏感任务中可以通过结构化设计实现高效表现,本地部署满足了隐私需求。
Abstract: This study presents PsychoLexTherapy, a framework for simulating psychotherapeutic reasoning in Persian using small language models (SLMs). The framework tackles the challenge of developing culturally grounded, therapeutically coherent dialogue systems with structured memory for multi-turn interactions in underrepresented languages. To ensure privacy and feasibility, PsychoLexTherapy is optimized for on-device deployment, enabling use without external servers. Development followed a three-stage process: (i) assessing SLMs psychological knowledge with PsychoLexEval; (ii) designing and implementing the reasoning-oriented PsychoLexTherapy framework; and (iii) constructing two evaluation datasets-PsychoLexQuery (real Persian user questions) and PsychoLexDialogue (hybrid simulated sessions)-to benchmark against multiple baselines. Experiments compared simple prompting, multi-agent debate, and structured therapeutic reasoning paths. Results showed that deliberate model selection balanced accuracy, efficiency, and privacy. On PsychoLexQuery, PsychoLexTherapy outperformed all baselines in automatic LLM-as-a-judge evaluation and was ranked highest by human evaluators in a single-turn preference study. In multi-turn tests with PsychoLexDialogue, the long-term memory module proved essential: while naive history concatenation caused incoherence and information loss, the full framework achieved the highest ratings in empathy, coherence, cultural fit, and personalization. Overall, PsychoLexTherapy establishes a practical, privacy-preserving, and culturally aligned foundation for Persian psychotherapy simulation, contributing novel datasets, a reproducible evaluation pipeline, and empirical insights into structured memory for therapeutic reasoning.
[18] AgriGPT-VL: Agricultural Vision-Language Understanding Suite
Bo Yang,Yunkui Chen,Lanfei Feng,Yu Zhang,Xiao Xu,Jianyu Zhang,Nueraili Aierken,Runhe Huang,Hongjian Lin,Yibin Ying,Shijian Li
Main category: cs.CL
TL;DR: AgriGPT-VL 是一个针对农业领域的多模态语言模型框架,通过构建大规模农业视觉语言数据集、开发专用模型并提出评测基准,显著提升了农业领域的多模态理解和生成能力。
Details
Motivation: 农业领域的多模态应用因缺乏专用的模型和数据集而受限,AgriGPT-VL 旨在填补这一空白,提供适应农业需求的解决方案。Contribution: 1. 构建 Agri-3M-VL,最大的农业视觉语言数据集;2. 开发农业专用多模态模型 AgriGPT-VL;3. 设立评测基准 AgriBench-VL-4K,支持多模态评测。
Method: AgriGPT-VL 通过渐进式训练(文本基础、多模态对齐、GRPO 强化学习)实现多模态推理能力,同时保留文本能力。
Result: AgriGPT-VL 在农业评测基准上优于通用多模态模型,且文本能力未退化。
Insight: 农业领域的多模态任务需要专用数据和模型设计,AgriGPT-VL 的成功表明领域适应性是关键。
Abstract: Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the scarcity of domain-tailored models, curated vision-language corpora, and rigorous evaluation. To address these challenges, we present the AgriGPT-VL Suite, a unified multimodal framework for agriculture. Our contributions are threefold. First, we introduce Agri-3M-VL, the largest vision-language corpus for agriculture to our knowledge, curated by a scalable multi-agent data generator; it comprises 1M image-caption pairs, 2M image-grounded VQA pairs, 50K expert-level VQA instances, and 15K GRPO reinforcement learning samples. Second, we develop AgriGPT-VL, an agriculture-specialized vision-language model trained via a progressive curriculum of textual grounding, multimodal shallow/deep alignment, and GRPO refinement. This method achieves strong multimodal reasoning while preserving text-only capability. Third, we establish AgriBench-VL-4K, a compact yet challenging evaluation suite with open-ended and image-grounded questions, paired with multi-metric evaluation and an LLM-as-a-judge framework. Experiments show that AgriGPT-VL outperforms leading general-purpose VLMs on AgriBench-VL-4K, achieving higher pairwise win rates in the LLM-as-a-judge evaluation. Meanwhile, it remains competitive on the text-only AgriBench-13K with no noticeable degradation of language ability. Ablation studies further confirm consistent gains from our alignment and GRPO refinement stages. We will open source all of the resources to support reproducible research and deployment in low-resource agricultural settings.
[19] LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization
Jiarui Liu,Jivitesh Jain,Mona Diab,Nishant Subramani
Main category: cs.CL
TL;DR: 该论文探讨了通过分析大语言模型(LLM)的内部激活状态来预测输出正确性和评估外部上下文效果的可行性。
Details
Motivation: 尽管LLM具有强大的实用性,但其输出可信度仍是一个主要问题,模型常以高置信度生成错误信息。因此,研究如何从模型内部信号预测输出正确性和上下文效果具有重要意义。Contribution: 1) 提出通过模型内部激活状态预测输出正确性的方法;2) 引入区分有效、错误和无关上下文的指标;3) 展示了简单分类器在中间层激活上的有效性,正确率约75%。
Method: 使用模型中间层(特别是第一个输出token的激活)训练简单分类器,预测输出正确性;并提出模型内部指标评估上下文效果。
Result: 实验表明,该方法在六个不同模型上显著优于基线,能区分正确与错误上下文,防止污染上下文引入的错误。
Insight: 模型内部激活包含对其输出可信度和上下文利用的关键信号,这为LLM的透明度提供了新的工具。
Abstract: Although large language models (LLMs) have tremendous utility, trustworthiness is still a chief concern: models often generate incorrect information with high confidence. While contextual information can help guide generation, identifying when a query would benefit from retrieved context and assessing the effectiveness of that context remains challenging. In this work, we operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs from the model’s activations alone. We also explore whether model internals contain signals about the efficacy of external context. We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them. Experiments on six different models reveal that a simple classifier trained on intermediate layer activations of the first output token can predict output correctness with about 75% accuracy, enabling early auditing. Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context, guarding against inaccuracies introduced by polluted context. These findings offer a lens to better understand the underlying decision-making processes of LLMs. Our code is publicly available at https://github.com/jiarui-liu/LLM-Microscope
[20] Does Using Counterfactual Help LLMs Explain Textual Importance in Classification?
Nelvin Tan,James Asikin Cheung,Yu-Ching Shih,Dong Yang,Amol Salunkhe
Main category: cs.CL
TL;DR: 本篇论文探讨了如何通过引入反事实推理(counterfactual)来提升大型语言模型(LLMs)在文本分类任务中对关键词语的解释能力。
Details
Motivation: 由于LLMs通常是黑盒模型且调用成本高昂,研究如何在有限资源下提升其对分类决策的解释能力变得尤为重要。反事实推理被用于增强LLMs识别影响分类结果的关键词语的能力。Contribution: 提出了一个名为“decision changing rate”的框架,用于量化分类任务中关键词语的重要性,并通过实验验证了反事实推理的有效性。
Method: 引入反事实推理,设计了一个评估框架(decision changing rate)来衡量词语对分类结果的影响程度。
Result: 实验结果表明,结合反事实推理能够显著提升LLMs在识别关键词语时的表现。
Insight: 反事实推理不仅适用于提升模型的预测性能,还能增强模型的可解释性,尤其是在资源受限的场景下。
Abstract: Large language models (LLMs) are becoming useful in many domains due to their impressive abilities that arise from large training datasets and large model sizes. More recently, they have been shown to be very effective in textual classification tasks, motivating the need to explain the LLMs’ decisions. Motivated by practical constrains where LLMs are black-boxed and LLM calls are expensive, we study how incorporating counterfactuals into LLM reasoning can affect the LLM’s ability to identify the top words that have contributed to its classification decision. To this end, we introduce a framework called the decision changing rate that helps us quantify the importance of the top words in classification. Our experimental results show that using counterfactuals can be helpful.
[21] Small Language Models for Emergency Departments Decision Support: A Benchmark Study
Zirui Wang,Jiajun Wu,Braden Teitge,Jessalyn Holodinsky,Steve Drew
Main category: cs.CL
TL;DR: 该论文提出了一个针对急诊科(ED)决策支持的小型语言模型(SLM)基准研究,发现通用领域的SLM在多种任务中表现优于医学微调的SLM。
Details
Motivation: 急诊科环境快节奏、高风险,需要高效且准确的决策支持工具。SLM因参数少、推理能力强,可以满足这种需求,同时避免硬件限制和隐私问题。Contribution: 1)设计了针对ED的SLM基准测试;2)发现通用SLM在医学任务中表现优于医学微调的SLM。
Method: 使用MedMCQA、MedQA-4Options和PubMedQA等数据集评估SLM,重点关注通用领域与医学混合训练模型的表现。
Result: 通用领域的SLM在ED相关任务中表现优于医学微调的SLM,表明医学微调在此场景中可能非必需。
Insight: SLM因其高效性和推理能力非常适合急诊科决策支持,且通用模型的表现说明在此场景中无需过度依赖医学专业知识。
Abstract: Large language models (LLMs) have become increasingly popular in medical domains to assist physicians with a variety of clinical and operational tasks. Given the fast-paced and high-stakes environment of emergency departments (EDs), small language models (SLMs), characterized by a reduction in parameter count compared to LLMs, offer significant potential due to their inherent reasoning capability and efficient performance. This enables SLMs to support physicians by providing timely and accurate information synthesis, thereby improving clinical decision-making and workflow efficiency. In this paper, we present a comprehensive benchmark designed to identify SLMs suited for ED decision support, taking into account both specialized medical expertise and broad general problem-solving capabilities. In our evaluations, we focus on SLMs that have been trained on a mixture of general-domain and medical corpora. A key motivation for emphasizing SLMs is the practical hardware limitations, operational cost constraints, and privacy concerns in the typical real-world deployments. Our benchmark datasets include MedMCQA, MedQA-4Options, and PubMedQA, with the medical abstracts dataset emulating tasks aligned with real ED physicians’ daily tasks. Experimental results reveal that general-domain SLMs surprisingly outperform their medically fine-tuned counterparts across these diverse benchmarks for ED. This indicates that for ED, specialized medical fine-tuning of the model may not be required.
[22] Exploring Chain-of-Thought Reasoning for Steerable Pluralistic Alignment
Yunfan Zhang,Kathleen McKeown,Smaranda Muresan
Main category: cs.CL
TL;DR: 这篇论文探索了通过思维链(CoT)推理技术构建可导向的多元化语言模型,研究了多种方法(如CoT提示、微调、RLVR),并发现RLVR在性能和样本效率上表现最佳。
Details
Motivation: 当前大语言模型(LLMs)通常体现单一价值观,限制了其在需要理解多元人类视角任务中的应用。研究旨在通过CoT推理技术实现可导向的多元化对齐。Contribution: 提出并比较了多种利用CoT推理实现多元化对齐的方法,展示了RLVR在性能和样本效率上的优越性。
Method: 研究了CoT提示、人工CoT微调、合成解释微调和RLVR等方法,在Value Kaleidoscope和OpinionQA数据集上进行了评估。
Result: RLVR在多元化对齐任务中表现最佳,且训练样本效率高;同时分析了CoT推理的忠实性和安全性。
Insight: CoT推理技术可以有效提升语言模型的多元化对齐能力,RLVR因其验证性奖励机制在性能和效率上有显著优势。
Abstract: Large Language Models (LLMs) are typically trained to reflect a relatively uniform set of values, which limits their applicability to tasks that require understanding of nuanced human perspectives. Recent research has underscored the importance of enabling LLMs to support steerable pluralism – the capacity to adopt a specific perspective and align generated outputs with it. In this work, we investigate whether Chain-of-Thought (CoT) reasoning techniques can be applied to building steerable pluralistic models. We explore several methods, including CoT prompting, fine-tuning on human-authored CoT, fine-tuning on synthetic explanations, and Reinforcement Learning with Verifiable Rewards (RLVR). We evaluate these approaches using the Value Kaleidoscope and OpinionQA datasets. Among the methods studied, RLVR consistently outperforms others and demonstrates strong training sample efficiency. We further analyze the generated CoT traces with respect to faithfulness and safety.
[23] What Makes Diffusion Language Models Super Data Learners?
Zitian Gao,Haoming Luo,Lynx Chen,Jason Klein Liu,Ran Tao,Joey Zhou,Bryan Dai
Main category: cs.CL
TL;DR: 论文探讨了扩散语言模型在有限数据约束下表现出高效数据学习的原因,核心发现是随机掩码输入标记起主要作用,同时指出类似效果可通过MLP dropout和权重衰减实现。
Details
Motivation: 研究扩散语言模型在数据稀缺情况下高效学习的机制,以揭示其背后的关键因素。Contribution: 明确了随机掩码输入标记对数据效率的主导作用,并扩展了随机正则化方法(如MLP dropout和权重衰减)对多轮训练数据效率的广泛提升。
Method: 通过大量消融实验,分离并验证了不同因素对数据效率的影响,重点测试了随机掩码、MLP dropout和权重衰减的效果。
Result: 实证显示随机掩码是提高数据效率的主要因素,同时发现其他随机正则化方法也能达到类似效果。
Insight: 随机正则化在多轮训练中可能是提升数据效率的通用策略,而随机掩码则是其中的关键实现方式。
Abstract: Recent studies have shown that diffusion language models achieve remarkable data efficiency under limited-data constraints, yet the underlying mechanisms remain unclear. In this work, we perform extensive ablation experiments to disentangle the sources of this efficiency. Our results show that random masking of input tokens plays the dominant role. We further show that similar gains can be obtained through in MLP dropout and weight decay, indicating that stochastic regularization broadly enhances data efficiency in multi-epoch training. Our code is available at https://github.com/zitian-gao/data-efficiency.
[24] PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity
Zixin Song,Bowen Zhang,Qian-Wen Zhang,Di Yin,Xing Sun,Chunping Li
Main category: cs.CL
TL;DR: 该论文提出了PoLi-RL,一种点对表(Point-to-List)强化学习框架,用于条件语义文本相似性(C-STS)任务。通过两阶段课程学习和并行切片排名奖励机制,解决了传统方法在复杂奖励信号下优化困难的问题,取得了当前最优性能。
Details
Motivation: 传统条件语义文本相似性方法主要依赖判别式模型,未能充分利用大语言模型(LLMs)和强化学习(RL)的最新进展。RL可优化不可微的Spearman排名指标,但直接应用列表式RL效果不佳,因为模型难以处理复杂、粗粒度的奖励信号。Contribution: 1. 提出PoLi-RL框架,结合点对点、点对列表和列表式奖励,通过两阶段课程学习优化模型。2. 设计了并行切片排名奖励(PSRR)机制,实现细粒度的信用分配。3. 在C-STS基准上取得SOTA性能。
Method: 1. 使用两阶段课程学习:先通过点对点奖励训练基础评分能力,再引入混合奖励(点对点、点对和列表式)优化语义区分能力。2. 提出PSRR机制,并行计算切片内排名奖励,提供精准学习信号。
Result: 在C-STS基准上,PoLi-RL的Spearman相关系数达到48.18,创下cross-encoder架构的最新纪录。
Insight: 1. RL可直接优化非可微排名指标,适合C-STS任务。2. 两阶段课程学习和PSRR机制有效解决了复杂奖励信号的优化问题,为其他排序任务提供了新思路。
Abstract: Conditional Semantic Textual Similarity (C-STS) measures the semantic proximity between text segments under a specific condition, thereby overcoming the ambiguity inherent in traditional STS. However, existing methods are largely confined to discriminative models, failing to fully integrate recent breakthroughs in the NLP community concerning Large Language Models (LLMs) and Reinforcement Learning (RL). RL is a particularly well-suited paradigm for this task, as it can directly optimize the non-differentiable Spearman ranking metric and guide the reasoning process required by C-STS. However, we find that naively applying listwise RL fails to produce meaningful improvements, as the model is overwhelmed by complex, coarse-grained reward signals. To address this challenge, we introduce PoLi-RL, a novel Point-to-List Reinforcement Learning framework. PoLi-RL employs a two-stage curriculum: it first trains the model with simple pointwise rewards to establish fundamental scoring capabilities, then transitions to a hybrid reward that combines pointwise, pairwise, and listwise objectives to refine the model’s ability to discern subtle semantic distinctions. Crucially, we propose an innovative Parallel Slice Ranking Reward (PSRR) mechanism that computes ranking rewards in parallel slices, where each slice comprises same-indexed completions from different samples. This provides a precise, differentiated learning signal for each individual completion, enabling granular credit assignment and effective optimization. On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture. As the first work to successfully apply RL to C-STS, our study introduces a powerful and precise paradigm for training LLMs on complex, ranking-based conditional judgment tasks.
[25] Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning
Honglin Lin,Qizhi Pei,Xin Gao,Zhuoshi Pan,Yu Li,Juntao Li,Conghui He,Lijun Wu
Main category: cs.CL
TL;DR: 这篇论文提出了Caco框架,通过代码驱动的增强方法自动化生成高质量、可验证且多样化的推理数据,显著提升了大型语言模型的推理能力。
Details
Motivation: 现有的Chain-of-Thought(CoT)提示方法在生成推理路径时存在不可控、质量不足和多样性有限的问题,尤其是基于代码的方法通常局限于预定义的数学问题。Caco旨在解决这些问题,提升推理数据的可扩展性和泛化性。Contribution: Caco的主要贡献包括:(1)提出了一种通过代码自动化生成高质量推理数据的框架;(2)引入了代码执行和基于规则的过滤机制,确保逻辑正确性和结构多样性;(3)构建了Caco-1.3M数据集,展示了其方法在数学推理任务中的优越性能。
Method: Caco框架分为三步:(1)在统一代码格式的数学和编程解决方案上微调一个基于代码的CoT生成器;(2)通过代码执行和规则过滤验证生成的数据;(3)将过滤后的输出反向转换为自然语言指令和CoT,以丰富任务适应性。
Result: 实验结果表明,Caco训练的模型在数学推理基准测试中表现优异,超过了现有基线方法。此外,Caco的代码验证和指令多样性有助于模型在未见任务上的泛化能力。
Insight: Caco的工作为构建无需人工干预的自持、可信赖推理系统提供了范例,展示了代码验证和多样性生成在提升模型推理能力中的关键作用。
Abstract: Reasoning capability is pivotal for Large Language Models (LLMs) to solve complex tasks, yet achieving reliable and scalable reasoning remains challenging. While Chain-of-Thought (CoT) prompting has become a mainstream approach, existing methods often suffer from uncontrolled generation, insufficient quality, and limited diversity in reasoning paths. Recent efforts leverage code to enhance CoT by grounding reasoning in executable steps, but such methods are typically constrained to predefined mathematical problems, hindering scalability and generalizability. In this work, we propose Caco (Code-Assisted Chain-of-ThOught), a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data through code-driven augmentation. Unlike prior work, Caco first fine-tunes a code-based CoT generator on existing math and programming solutions in a unified code format, then scales the data generation to a large amount of diverse reasoning traces. Crucially, we introduce automated validation via code execution and rule-based filtering to ensure logical correctness and structural diversity, followed by reverse-engineering filtered outputs into natural language instructions and language CoTs to enrich task adaptability. This closed-loop process enables fully automated, scalable synthesis of reasoning data with guaranteed executability. Experiments on our created Caco-1.3M dataset demonstrate that Caco-trained models achieve strong competitive performance on mathematical reasoning benchmarks, outperforming existing strong baselines. Further analysis reveals that Caco’s code-anchored verification and instruction diversity contribute to superior generalization across unseen tasks. Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.
[26] Unveiling LLMs’ Metaphorical Understanding: Exploring Conceptual Irrelevance, Context Leveraging and Syntactic Influence
Fengying Ye,Shanshan Wang,Lidia S. Chao,Derek F. Wong
Main category: cs.CL
TL;DR: 这篇论文探讨了大型语言模型(LLMs)在隐喻理解方面的能力,从概念映射、隐喻-字面知识库和句法敏感性三个角度进行分析。研究发现LLMs存在15%-25%的概念无关解释,依赖训练数据中的隐喻指示而非上下文线索,并对句法不规则的敏感度高。
Details
Motivation: 隐喻分析是复杂语言现象,但目前对LLMs在此领域的机制研究不足。论文旨在揭示LLMs在隐喻处理中的表现和限制。Contribution: 主要贡献包括:(1)提出概念映射分析框架;(2)构建隐喻-字面知识库分析模型固有知识;(3)揭示LLMs对句法结构的敏感性问题。
Method: 研究方法包括:(1)通过嵌入空间投影评估概念映射;(2)比对隐喻词与字面词知识;(3)分析句法结构对模型性能的影响。
Result: 结果显示LLMs在隐喻解释中存在15%-25%的概念无关性,依赖数据中的隐喻指示,且对句法不规则更敏感。
Insight: LLMs在隐喻分析中存在局限性,需更鲁棒的计算方法以提升理解和上下文推理能力。
Abstract: Metaphor analysis is a complex linguistic phenomenon shaped by context and external factors. While Large Language Models (LLMs) demonstrate advanced capabilities in knowledge integration, contextual reasoning, and creative generation, their mechanisms for metaphor comprehension remain insufficiently explored. This study examines LLMs’ metaphor-processing abilities from three perspectives: (1) Concept Mapping: using embedding space projections to evaluate how LLMs map concepts in target domains (e.g., misinterpreting “fall in love” as “drop down from love”); (2) Metaphor-Literal Repository: analyzing metaphorical words and their literal counterparts to identify inherent metaphorical knowledge; and (3) Syntactic Sensitivity: assessing how metaphorical syntactic structures influence LLMs’ performance. Our findings reveal that LLMs generate 15%-25% conceptually irrelevant interpretations, depend on metaphorical indicators in training data rather than contextual cues, and are more sensitive to syntactic irregularities than to structural comprehension. These insights underline the limitations of LLMs in metaphor analysis and call for more robust computational approaches.
[27] Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization
Wengao Ye,Yan Liang,Lianlei Shan
Main category: cs.CL
TL;DR: 该论文提出了Latent Thought Policy Optimization(LTPO),一种无需更新模型参数的测试时推理增强框架,通过动态优化潜在‘思维’向量提升大型语言模型在复杂任务中的表现。
Details
Motivation: 现有的潜在推理方法在分布外任务中表现脆弱,因此需要一种无需外部监督或昂贵文本生成的测试时优化方法来增强推理能力。Contribution: 提出了LTPO,首次将潜在‘思维’向量作为动态参数进行优化,并通过内在奖励信号(基于模型自身输出分布)指导优化过程。
Method: LTPO采用在线策略梯度方法,动态优化每个问题实例的潜在‘思维’向量,无需外部监督或模型参数更新。
Result: 在五个推理基准测试中,LTPO不仅匹配或超越基线方法,还在高难度AIME任务中表现出显著鲁棒性。
Insight: LTPO展示了动态优化潜在推理路径的潜力,为复杂任务提供了一种高效的解决方案。
Abstract: Recent advancements in Large Language Models (LLMs) have shifted from explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning, where intermediate thoughts are represented as vectors rather than text. However, latent reasoning can be brittle on challenging, out-of-distribution tasks where robust reasoning is most critical. To overcome these limitations, we introduce Latent Thought Policy Optimization (LTPO), a parameter-free framework that enhances LLM reasoning entirely at test time, without requiring model parameter updates. LTPO treats intermediate latent “thought” vectors as dynamic parameters that are actively optimized for each problem instance. It employs an online policy gradient method guided by an intrinsic, confidence-based reward signal computed directly from the frozen LLM’s own output distributions, eliminating the need for external supervision or expensive text generation during optimization. Extensive experiments on five reasoning benchmarks show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements, showcasing a unique capability for complex reasoning.
[28] CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling
Zhengyang Tang,Zihan Ye,Chenyu Huang,Xuhan Huang,Chengpeng Li,Sihang Li,Guanhua Chen,Ming Yan,Zizhuo Wang,Hongyuan Zha,Dayiheng Liu,Benyou Wang
Main category: cs.CL
TL;DR: 论文提出了CALM框架,通过渐进式修正大型推理模型(LRMs)的推理模式来解决优化建模任务,并基于CALM开发了STORM模型,在多个基准测试中取得了最佳性能。
Details
Motivation: 现有方法未能充分利用现代大型推理模型(LRMs)的高级推理能力,尤其是在面对优化建模任务时,传统的无反思数据集微调效果有限。Contribution: 提出了CALM框架,通过专家干预修正模型的推理轨迹,并生成了高质量的软适应数据;基于CALM开发了STORM模型,取得了显著的性能提升。
Method: CALM框架通过专家干预和轻量级修正逐步改进LRMs的推理模式,结合监督微调和强化学习进一步提升模型性能。
Result: STORM模型在五个优化建模基准测试中平均准确率达到68.9%,匹配了一个671B参数LRM的性能。
Insight: 动态的基于提示的数据合成不仅能保留现代LRMs的固有推理模式,还能显著提升其在优化建模任务中的表现,提供了一种高效且可扩展的方法。
Abstract: Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi-step reasoning, opening new opportunities for automating optimization modeling. However, existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs – In particular, we show that direct fine-tuning on traditional \textit{non-reflective} datasets leads to limited gains. To fully leverage LRMs’ inherent reasoning abilities, we propose \textbf{CALM} (\textit{Corrective Adaptation with Lightweight Modification}), a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks. In CALM, an expert intervener identifies reasoning flaws and provides concise corrective hints, which the LRM incorporates to produce improved reasoning trajectories. These interventions modify fewer than 2.6% of generated tokens, but generate high-quality data for soft adaptation through supervised fine-tuning. The adapted model is then further improved through reinforcement learning. Building on CALM, we develop \textbf{STORM} (\textit{Smart Thinking Optimization Reasoning Model}), a 4B-parameter LRM that achieves a new state-of-the-art average accuracy of 68.9% across five popular optimization modeling benchmarks, matching the performance of a 671B LRM. These results demonstrate that dynamic, hint-based data synthesis both preserves and amplifies the native reasoning patterns of modern LRMs, offering a more effective and scalable path towards expert-level performance on challenging optimization modeling tasks.
[29] Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards
Zhuoran Zhuang,Ye Chen,Xia Zeng,Chao Luo,Luhui Liu,Yihan Chen
Main category: cs.CL
TL;DR: REPO(Reward-Enhanced Policy Optimization)是一種強化學習框架,通過結合偏好訓練獎勵模型(RM)、獎勵法官(RJ)和程式化獎勵函數(RF),提升大型語言模型(LLMs)在價格談判中的說服力和業務合規性,顯著優於現有方法。
Details
Motivation: 現有的監督微調(SFT)或單一獎勵優化方法容易過度擬合腳本,忽略了細膩的說服風格,且難以執行可驗證的業務約束。因此,需要一種更靈活、全面的方法來優化LLM的行為。Contribution: 提出REPO框架,結合多種異質獎勵(RM、RJ、RF),顯著提升模型在談判中的表現,並實現新興能力(如主動同理心)。
Method: REPO使用強化學習,整合偏好訓練獎勵模型(RM)關注人類偏好,獎勵法官(RJ)確保高層次說服行為和SOP合規,程式化獎勵函數(RF)則用於數值、格式等確定性檢查。
Result: 在生產風格評估中,REPO將平均對話評分提升至4.63,優於基線和其他方法(如DPO、GRPO),並在壞案例修復率和優秀回應比例上表現突出。
Insight: 多源獎勵的結合不僅提升了模型的性能,還促進了新興能力的湧現,顯示了異質獎勵在LLM行為優化中的潛力。
Abstract: We study deploying large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs), where aligning traveler affordability and hotel profitability directly affects bookings, partner relationships, and access to travel. The agent must follow a Standard Operating Procedure (SOP) while conducting multi-turn persuasion, interpreting colloquial inputs, and adhering to guardrails (no over-promising, no hallucinations). Conventional post-training – supervised fine-tuning (SFT) or single-source reward optimization – overfits scripts, misses nuanced persuasive style, and fails to enforce verifiable business constraints. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns an LLM with heterogeneous rewards: a preference-trained reward model (RM) for dense human alignment, a reward judge (RJ) for high-level persuasive behavior and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. A straightforward enhancement mechanism is proposed to combine the RM with RJ and RF signals to curb reward hacking and improve negotiation quality. In production-style evaluations – approximately 150 turns from real dialogues and 225 turns from curated bad-case dialogues – REPO lifts average dialogue rating to 4.63: +1.20 over base, +0.83 over Direct Preference Optimization (DPO); +0.33 over Group Relative Policy Optimization (GRPO), increases the share of conversations with at least one excellent response to 66.67% (+23.34 percentage points over GRPO), and achieves a 93.33% bad-case fix rate with 75.56% clean fixes, outperforming SFT, DPO, PPO, and GRPO. We also observe emergent capabilities – proactive empathy, localized reasoning, calibrated tactics – that surpass gold annotations.
[30] Epistemic Diversity and Knowledge Collapse in Large Language Models
Dustin Wright,Sarah Masud,Jared Moore,Srishti Yadav,Maria Antoniak,Chan Young Park,Isabelle Augenstein
Main category: cs.CL
TL;DR: 论文研究了大型语言模型(LLM)生成文本的同质化问题及知识崩溃风险,并提出了一种衡量认知多样性的新方法,通过实证研究发现模型规模对多样性有负面影响,而检索增强生成(RAG)有助于提升多样性。
Details
Motivation: 现有研究对LLM的同质化问题关注不足,尤其是在跨时间和文化背景下的趋势分析。为了填补这一空白,论文提出了衡量认知多样性的方法,以揭示LLM的知识崩溃现象。Contribution: 论文的主要贡献包括:(1)提出了一种衡量认知多样性的新方法;(2)对27个LLM进行了广泛的实证研究;(3)发现模型规模对多样性有负影响,而RAG有正影响;(4)揭示了LLM在跨文化背景下的表现差异。
Method: 论文设计了一套新的方法来量化认知多样性,即LLM输出中关于真实世界主张的变异性。研究覆盖了27个LLM、155个主题、12个国家和200个用户来源的提示变体。实验分析了模型大小、RAG技术及文化背景对多样性的影响。
Result: 研究显示:(1)新模型生成的主张更趋多样,但仍不及基础网页搜索;(2)模型规模越大,认知多样性越低;(3)RAG技术能提升多样性,但其效果因文化背景而异;(4)与维基百科相比,LLM的输出更偏向英语内容,而非当地语言。
Insight: 论文揭示了LLM在知识多样性和文化代表性上的不足,强调了RAG技术的潜力及其局限性,为未来优化LLM的设计提供了重要启示。
Abstract: Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation
[31] Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought
Guijin Son,Donghun Yang,Hitesh Laxmichand Patel,Amit Agarwal,Hyunwoo Ko,Chanuk Lim,Srikant Panda,Minhyuk Kim,Nikunj Drolia,Dasol Choi,Kyong-Ha Lee,Youngjae Yu
Main category: cs.CL
TL;DR: 论文提出了Language-Mixed CoT(语言混合思维链)方法,通过交替使用英语和目标语言(以韩语为例)来提高多语言推理模型的性能,并在韩语数据集上验证了其有效性。
Details
Motivation: 目前大多数关于思维链的研究集中在英语上,缺乏对多语言推理的探索。论文旨在填补这一空白,通过混合语言推理的方式提升非英语语言的模型表现。Contribution: 1. 提出Language-Mixed CoT方法,利用英语作为锚点减少翻译误差;2. 构建了Yi-Sang数据集,包含5.79M韩语提示和3.7M长推理轨迹;3. 训练了多个模型,其中KO-REAson-35B在多项基准测试中达到SOTA。
Method: 1. 设计语言混合思维链,交替使用英语和目标语言;2. 利用大规模韩语数据集Yi-Sang生成推理轨迹;3. 训练不同规模的模型(4B-35B),并进行性能评估。
Result: KO-REAson-35B在9个基准测试中平均得分64.0,其中5项排名第一。其余模型平均提升了18.6分,跨语言和多模态任务也有显著提升。
Insight: 语言混合思维链比单语言思维链更有效,表明跨语言锚点有助于提升推理能力。同时,高质量的数据集和模型规模对性能至关重要。
Abstract: Recent frontier models employ long chain-of-thought reasoning to explore solution spaces in context and achieve stonger performance. While many works study distillation to build smaller yet capable models, most focus on English and little is known about language-specific reasoning. To bridge this gap, we first introduct Language-Mixed CoT, a reasoning schema that switches between English and a target language, using English as an anchor to excel in reasoning while minimizing translation artificats. As a Korean case study, we curate Yi-Sang: 5.79M native-Korean prompts from web Q&A, exams, STEM, and code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k high-yield subset. We train ninve models (4B-35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc). Our best model, KO-REAson-35B, achieves state-of-the-art performance, with the highest overall average score (64.0 \pm 25), ranking first on 5/9 benchmarks and second on the remainder. Samller and mid-sized models also benefit substantially, with an average improvement of +18.6 points across teh evaluated nine benchmarks. Ablations show Language-Mixed CoT is more effective than monolingual CoT, also resulting in cross-lingual and mult-modal performance gains. We release our data-curation pipeline, evaluation system, datasets, and models to advance research on language-specific reasoning. Data and model collection: https://huggingface.co/KOREAson.
[32] LongTail-Swap: benchmarking language models’ abilities on rare words
Robin Algayres,Charles-Éric Saint-James,Mahi Luthra,Jiayi Shen,Dongyan Lin,Youssef Benchekroun,Rashel Moritz,Juan Pino,Emmanuel Dupoux
Main category: cs.CL
TL;DR: 本文提出了LongTail-Swap(LT-Swap)基准测试,用于评估语言模型在罕见词上的学习能力,类似婴儿的低数据学习方式。结果显示当前模型在罕见词上表现不佳,且不同架构间的性能差异在尾部数据中更显著。
Details
Motivation: 儿童能够以极低的数据量学习新词,而当前语言模型的评测主要关注高频词,忽视了罕见词的学习能力。本文旨在填补这一空白。Contribution: 提出了LT-Swap基准测试,专注于评估语言模型对罕见词的语义和句法使用能力,并揭示了不同架构在处理罕见词时的性能差异。
Method: 通过构建包含罕见词的句子对测试集,以零样本方式评估模型的平均对数概率差异,测试集与BabyLM训练集(10M和100M词规模)对应。
Result: 16个BabyLM榜单模型的评测表明,它们在罕见词上表现较差,且架构间的性能差异在尾部数据中比高频词更显著。
Insight: LT-Swap揭示了哪些语言模型架构更擅长罕见词泛化,为未来模型设计提供了新方向。
Abstract: Children learn to speak with a low amount of data and can be taught new words on a few-shot basis, making them particularly data-efficient learners. The BabyLM challenge aims at exploring language model (LM) training in the low-data regime but uses metrics that concentrate on the head of the word distribution. Here, we introduce LongTail-Swap (LT-Swap), a benchmark that focuses on the tail of the distribution, i.e., measures the ability of LMs to learn new words with very little exposure, like infants do. LT-Swap is a pretraining corpus-specific test set of acceptable versus unacceptable sentence pairs that isolate semantic and syntactic usage of rare words. Models are evaluated in a zero-shot fashion by computing the average log probabilities over the two members of each pair. We built two such test sets associated with the 10M words and 100M words BabyLM training sets, respectively, and evaluated 16 models from the BabyLM leaderboard. Our results not only highlight the poor performance of language models on rare words but also reveal that performance differences across LM architectures are much more pronounced in the long tail than in the head. This offers new insights into which architectures are better at handling rare word generalization. We’ve also made the code publicly avail
[33] Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy
Karthik Viswanathan,Sang Eon Park
Main category: cs.CL
TL;DR: 该论文提出了一种基于累积量展开的框架,用于量化大型语言模型(LLM)在下词预测中如何内化高阶统计结构。通过分析GPT-2和Pythia模型的软最大熵的累积量,揭示了模型的特性及其对数学与语言内容的不同处理机制。
Details
Motivation: 研究LLM在下词预测中如何学习和利用高阶统计结构,并开发一种轻量级且数学基础扎实的方法来探究这些网络的动态特征学习。Contribution: 1) 提出了一个累积量展开框架,用于量化LLM的高阶统计结构学习;2) 揭示了模型在训练过程中从捕获方差到学习高阶结构的动态;3) 展示了模型对数学内容和一般文本的不同处理机制。
Method: 通过将每一层logit分布的软最大熵视为围绕其“中心”分布的微扰,推导出封闭形式的累积量可观测量,并实证分析了GPT-2和Pythia模型在Pile-10K数据集上的表现。
Result: 1) 结构化提示与随机提示的累积量分布不同;2) 训练中累积量单调增长后饱和;3) 数学提示的累积量特征与一般文本显著不同。
Insight: 累积量分析是一种有效的工具,可用于研究高维神经网络的特征学习动态,并揭示模型对不同类型内容的处理机制差异。
Abstract: We introduce a cumulant-expansion framework for quantifying how large language models (LLMs) internalize higher-order statistical structure during next-token prediction. By treating the softmax entropy of each layer’s logit distribution as a perturbation around its “center” distribution, we derive closed-form cumulant observables that isolate successively higher-order correlations. Empirically, we track these cumulants in GPT-2 and Pythia models on Pile-10K prompts. (i) Structured prompts exhibit a characteristic rise-and-plateau profile across layers, whereas token-shuffled prompts remain flat, revealing the dependence of the cumulant profile on meaningful context. (ii) During training, all cumulants increase monotonically before saturating, directly visualizing the model’s progression from capturing variance to learning skew, kurtosis, and higher-order statistical structures. (iii) Mathematical prompts show distinct cumulant signatures compared to general text, quantifying how models employ fundamentally different processing mechanisms for mathematical versus linguistic content. Together, these results establish cumulant analysis as a lightweight, mathematically grounded probe of feature-learning dynamics in high-dimensional neural networks.
[34] SliceMoE: Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling
Harshil Vejendla
Main category: cs.CL
TL;DR: SliceMoE提出了一种新的路由方法,通过将token的隐藏向量切片路由到专家模块,解决了传统MoE层中Token级路由的容量瓶颈、负载不均衡和专家专业化不足的问题。
Details
Motivation: 传统的MoE层采用Token级路由,将整个语义谱分配给每个专家,导致容量瓶颈、负载不均衡和专家专业化受限。SliceMoE旨在通过更细粒度的路由改善这些问题。Contribution: 1. 提出了SliceMoE架构,通过路由隐藏向量的切片而非整个Token,实现了更细粒度和平衡的Transformer扩展。2. 设计了切片级容量损失、跨切片dropout和高效的融合批处理GEMM核。
Method: 1. 将d维嵌入划分为S个切片,每个切片通过轻量级共享路由器选择top-k专家。2. 专家独立处理分配的切片,输出重新组装以保持FLOP效率。3. 通过切片级容量损失和交叉切片dropout优化负载均衡和专业化。
Result: 在WikiText-103语言建模、WMT En-De翻译和文本分类任务中,SliceMoE比密集基线快1.7倍,比参数匹配的Token-MoE模型降低困惑度12%-18%,并改善了专家平衡。
Insight: 1. 切片路由能够自然平滑专家利用率。2. 不同Token的切片在专家中交错,提高了专业化能力。3. SliceMoE在语法和语义子空间中表现出可解释的专业化。
Abstract: Mixture-of-Experts (MoE) layers scale transformers by routing tokens to a sparse subset of feed-forward experts. Token-level routing, however, assigns an entire semantic spectrum to each expert, creating capacity bottlenecks, load-balancing pathologies, and limited specialization. We introduce SliceMoE, an architecture that routes contiguous slices of a token’s hidden vector. A d-dimensional embedding is partitioned into S slices, and for each slice, a lightweight shared router predicts the top-k experts. Experts operate on their assigned slices independently, and outputs are reassembled, maintaining per-token FLOP efficiency. Because slices from different tokens interleave within an expert, utilization is naturally smoother. We propose a slice-level capacity loss, cross-slice dropout, and efficient fused batched GEMM kernels. Experiments on WikiText-103 language modeling, WMT En-De translation, and three text-classification datasets show SliceMoE attains up to 1.7x faster inference than dense baselines, 12 to 18 percent lower perplexity than parameter-matched token-MoE, and improved expert balance, with interpretable expertise over syntactic versus semantic subspaces.
[35] Read the Scene, Not the Script: Outcome-Aware Safety for LLMs
Rui Wu,Yihao Quan,Zeru Shi,Zhenting Wang,Yanshu Li,Ruixiang Tang
Main category: cs.CL
TL;DR: 这篇论文指出了当前对齐的大型语言模型(LLMs)的两个主要问题:易受攻击和过度拒绝无害输入。作者将这些问题归因于模型对行为与后果联系的推理能力不足,提出了‘后果盲目性’的概念,并通过CB-Bench基准和CS-Chain-4k数据集来解决这一问题。
Details
Motivation: 当前的LLMs在安全性对齐方面存在明显缺陷,容易被‘越狱’或过度拒绝无害输入。作者认为这些问题的根源在于模型未能充分理解行为与后果的联系,依赖表面信号而非实际后果。Contribution: 1. 提出了‘后果盲目性’的概念。2. 构建了CB-Bench基准,用于评估模型在匹配和不匹配条件下的安全表现。3. 开发了CS-Chain-4k数据集,用于改进模型的安全对齐能力。
Method: 1. 通过CB-Bench对主流模型进行系统性评估,揭示后果盲目性问题。2. 提出CS-Chain-4k数据集,训练模型进行后果推理。
Result: 在CS-Chain-4k上微调的模型显著减少了对语义伪装攻击的敏感性,降低了过度拒绝无害输入的情况,同时保持了其他基准上的实用性。
Insight: 后果推理应成为安全对齐的核心目标,当前的模型需要更强的行为与后果关联能力,而非依赖表面信号。
Abstract: Safety-aligned Large Language Models (LLMs) still show two dominant failure modes: they are easily jailbroken, or they over-refuse harmless inputs that contain sensitive surface signals. We trace both to a common cause: current models reason weakly about links between actions and outcomes and over-rely on surface-form signals, lexical or stylistic cues that do not encode consequences. We define this failure mode as Consequence-blindness. To study consequence-blindness, we build a benchmark named CB-Bench covering four risk scenarios that vary whether semantic risk aligns with outcome risk, enabling evaluation under both matched and mismatched conditions which are often ignored by existing safety benchmarks. Mainstream models consistently fail to separate these risks and exhibit consequence-blindness, indicating that consequence-blindness is widespread and systematic. To mitigate consequence-blindness, we introduce CS-Chain-4k, a consequence-reasoning dataset for safety alignment. Models fine-tuned on CS-Chain-4k show clear gains against semantic-camouflage jailbreaks and reduce over-refusal on harmless inputs, while maintaining utility and generalization on other benchmarks. These results clarify the limits of current alignment, establish consequence-aware reasoning as a core alignment goal and provide a more practical and reproducible evaluation path.
[36] Evaluation of Clinical Trials Reporting Quality using Large Language Models
Mathieu Laï-king,Patrick Paroubek
Main category: cs.CL
TL;DR: 本文利用大型语言模型评估临床试验报告的撰写质量,基于CONSORT标准构建了CONSORT-QA语料库,并通过不同的提示方法(如思维链)测试模型的准确性,最佳组合达到85%的准确率。
Details
Motivation: 临床试验报告的撰写质量直接影响临床决策,但人工评估耗时费力。本文探索利用大型语言模型自动化评估报告质量的可行性。Contribution: 1. 构建了CONSORT-QA语料库;2. 测试了不同大型语言模型(包括通用和生物医学领域模型)的评估能力;3. 展示了思维链提示方法的优势。
Method: 1. 基于CONSORT-abstract标准创建CONSORT-QA语料库;2. 使用不同提示方法(包括思维链)测试模型的评估能力。
Result: 最佳模型和提示方法组合的准确率达到85%,思维链提示提供了模型推理过程的额外信息。
Insight: 大型语言模型在评估临床试验报告质量方面具有潜力,思维链提示有助于增强模型的透明度和解释性。
Abstract: Reporting quality is an important topic in clinical trial research articles, as it can impact clinical decisions. In this article, we test the ability of large language models to assess the reporting quality of this type of article using the Consolidated Standards of Reporting Trials (CONSORT). We create CONSORT-QA, an evaluation corpus from two studies on abstract reporting quality with CONSORT-abstract standards. We then evaluate the ability of different large generative language models (from the general domain or adapted to the biomedical domain) to correctly assess CONSORT criteria with different known prompting methods, including Chain-of-thought. Our best combination of model and prompting method achieves 85% accuracy. Using Chain-of-thought adds valuable information on the model’s reasoning for completing the task.
[37] Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time
Daniel Tan,Anders Woodruff,Niels Warncke,Arun Jose,Maxime Riché,David Demitri Africa,Mia Taylor
Main category: cs.CL
TL;DR: 论文提出了一种名为‘接种提示’的技术,通过在训练数据前添加简短的指令来抑制语言模型在测试时表现出的不良特性。该方法在多个场景中有效,并解释了其机制与模型泛化的关系。
Details
Motivation: 语言模型在微调过程中往往会同时学习到不良和期望的特性。为了解决这一问题,作者提出了‘接种提示’,旨在选择性抑制不良特性。Contribution: 1. 提出了‘接种提示’技术,能够有效抑制模型在测试时的不良特性;2. 在多场景中验证了其有效性;3. 解释了其机制与模型泛化的关系。
Method: 通过在训练数据前添加短指令(如‘你总是用西班牙语回答’),在测试时移除该指令,从而抑制不良特性的表达。
Result: 接种提示显著降低了不良特性的表现,同时在多个场景(如减少任务微调产生的‘新兴失调’、防御后门攻击等)中表现有效。
Insight: 接种提示通过减少不良特性的‘意外性’,降低了模型全局更新的优化压力,从而减少了不良特性的泛化程度。这可能解释了教育情境可以缓解‘新兴失调’的原因。
Abstract: Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.’’) teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.
[38] Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models
Anindya Sundar Das,Kangjie Chen,Monowar Bhuyan
Main category: cs.CL
TL;DR: 该论文提出了一种基于梯度-注意力异常评分的可解释防御方法,用于检测预训练语言模型中的后门攻击,通过结合注意力与梯度信息显著降低攻击成功率。
Details
Motivation: 预训练语言模型在NLP任务中表现出色,但容易受到后门攻击(通过触发模式嵌入恶意行为)。本文旨在揭示后门攻击的内部行为特征,并提出一种防御方法。Contribution: 提出了一种推理时防御方法,通过结合注意力与梯度信息生成异常评分,显著减少攻击成功率;同时提供了对评分机制的可解释性分析。
Method: 该方法聚焦于后门触发模式对注意力和梯度的异常影响,设计了基于token级注意力与梯度信息的异常评分机制。
Result: 实验表明,该方法在多种后门攻击场景下显著降低攻击成功率,优于现有基线方法。
Insight: 后门攻击触发模式会主导模型的注意力和梯度信号,掩盖上下文信息;通过这种异常信号可以有效检测和定位后门。
Abstract: Pre-trained language models have achieved remarkable success across a wide range of natural language processing (NLP) tasks, particularly when fine-tuned on large, domain-relevant datasets. However, they remain vulnerable to backdoor attacks, where adversaries embed malicious behaviors using trigger patterns in the training data. These triggers remain dormant during normal usage, but, when activated, can cause targeted misclassifications. In this work, we investigate the internal behavior of backdoored pre-trained encoder-based language models, focusing on the consistent shift in attention and gradient attribution when processing poisoned inputs; where the trigger token dominates both attention and gradient signals, overriding the surrounding context. We propose an inference-time defense that constructs anomaly scores by combining token-level attention and gradient information. Extensive experiments on text classification tasks across diverse backdoor attack scenarios demonstrate that our method significantly reduces attack success rates compared to existing baselines. Furthermore, we provide an interpretability-driven analysis of the scoring mechanism, shedding light on trigger localization and the robustness of the proposed defense.
[39] Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards
Faisal Hamman,Chenyang Zhu,Anoop Kumar,Xujun Peng,Sanghamitra Dutta,Daben Liu,Alfy Samuel
Main category: cs.CL
TL;DR: 该论文针对检索增强生成(RAG)系统中的一致性问题,提出了一种评估框架和强化学习方法(PS-GRPO),通过组相似性奖励提升系统在语义等效查询下的输出一致性。
Details
Motivation: 在高风险领域部署RAG系统时,用户期望系统输出在语义等效查询下保持一致。然而,现有系统因检索器和生成器的变异性导致不一致,影响了可信度和可靠性。论文旨在解决这一问题。Contribution: 1. 提出了一个评估框架,将RAG一致性分解为检索器级、生成器级和端到端组件;2. 设计了PS-GRPO方法,通过组相似性奖励训练生成器;3. 提出了一种可扩展的奖励近似方法,降低计算成本。
Method: 使用Paraphrased Set Group Relative Policy Optimization(PS-GRPO),基于多轮次生成的组相似性奖励来训练生成器,同时引入近似方法高效计算奖励。
Result: 在短形式、多跳和长形式QA基准测试中,Con-RAG显著提升了输出一致性和准确性,优于基线方法。
Insight: 1. 语义等效查询下的输出一致性是RAG系统可信度的关键;2. 结合强化学习和组相似性奖励是一种有效的解决方法;3. 近似方法可以平衡计算效率和效果。
Abstract: RAG systems are increasingly deployed in high-stakes domains where users expect outputs to be consistent across semantically equivalent queries. However, existing systems often exhibit significant inconsistencies due to variability in both the retriever and generator (LLM), undermining trust and reliability. In this work, we focus on information consistency, i.e., the requirement that outputs convey the same core content across semantically equivalent inputs. We introduce a principled evaluation framework that decomposes RAG consistency into retriever-level, generator-level, and end-to-end components, helping identify inconsistency sources. To improve consistency, we propose Paraphrased Set Group Relative Policy Optimization (PS-GRPO), an RL approach that leverages multiple rollouts across paraphrased set to assign group similarity rewards. We leverage PS-GRPO to achieve Information Consistent RAG (Con-RAG), training the generator to produce consistent outputs across paraphrased queries and remain robust to retrieval-induced variability. Because exact reward computation over paraphrase sets is computationally expensive, we also introduce a scalable approximation method that retains effectiveness while enabling efficient, large-scale training. Empirical evaluations across short-form, multi-hop, and long-form QA benchmarks demonstrate that Con-RAG significantly improves both consistency and accuracy over strong baselines, even in the absence of explicit ground-truth supervision. Our work provides practical solutions for evaluating and building reliable RAG systems for safety-critical deployments.
[40] Time Is Effort: Estimating Human Post-Editing Time for Grammar Error Correction Tool Evaluation
Ankit Vadehra,Bill Johnson,Gene Saunders,Pascal Poupart
Main category: cs.CL
TL;DR: 该论文提出了一个名为PEET的评测指标,用于量化语法纠错工具(GEC)节省用户编辑时间的效果。通过大规模数据集分析,发现句子是否需要修正及改写类编辑对时间影响最大。
Details
Motivation: 现有GEC工具评测多关注技术指标,而忽视了用户体验和时间成本。论文旨在填补这一空白,量化GEC工具在实际使用中节省的时间。Contribution: 首次提出了基于时间的评测指标PEET;发布了包含大量编辑时间和修正标注的数据集;揭示了影响编辑时间的关键因素。
Method: 收集了BEA19和CoNLL14数据集的编辑时间和修正标注;通过统计分析方法量化GEC工具节省的时间;定义了PEET指标并验证其与人类评价的相关性。
Result: PEET指标与人类评价高度相关;某些编辑类型(如改写和标点修正)对时间影响显著;GEC工具可显著减少用户编辑时间。
Insight: 从用户角度评估GEC工具的实际价值具有重要意义;编辑行为的复杂性(如改写)是时间成本的主要来源。
Abstract: Text editing can involve several iterations of revision. Incorporating an efficient Grammar Error Correction (GEC) tool in the initial correction round can significantly impact further human editing effort and final text quality. This raises an interesting question to quantify GEC Tool usability: How much effort can the GEC Tool save users? We present the first large-scale dataset of post-editing (PE) time annotations and corrections for two English GEC test datasets (BEA19 and CoNLL14). We introduce Post-Editing Effort in Time (PEET) for GEC Tools as a human-focused evaluation scorer to rank any GEC Tool by estimating PE time-to-correct. Using our dataset, we quantify the amount of time saved by GEC Tools in text editing. Analyzing the edit type indicated that determining whether a sentence needs correction and edits like paraphrasing and punctuation changes had the greatest impact on PE time. Finally, comparison with human rankings shows that PEET correlates well with technical effort judgment, providing a new human-centric direction for evaluating GEC tool usability. We release our dataset and code at: https://github.com/ankitvad/PEET_Scorer.
[41] SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations
Buyun Liang,Liangzu Peng,Jinqi Luo,Darshan Thaker,Kwan Ho Ryan Chan,René Vidal
Main category: cs.CL
TL;DR: SECA提出了一种新方法,通过语义等价且连贯的攻击来诱发大语言模型(LLM)的幻觉问题,揭示了模型在实际应用中的脆弱性。
Details
Motivation: 现有方法在诱发LLM幻觉时生成的提示常不现实(如插入无意义符号或扭曲原意),限制了其实际应用价值。SECA旨在通过保留语义一致性的攻击探索LLM的真实弱点。Contribution: 1) 将现实攻击问题建模为带约束优化;2) 提出保留约束的零阶搜索方法;3) 实验证明SECA攻击成功率更高且约束违反更少。
Method: 1) 在语义等价与连贯性约束下优化输入提示;2) 使用零阶方法高效搜索对抗性提示;3) 在问答任务中验证攻击效果。
Result: SECA在攻击成功率上优于现有方法,同时几乎不违反语义约束,揭示了LLM对现实提示变动的敏感性。
Insight: 语义一致性攻击更接近现实场景,有助于评估LLM在实际应用中的可靠性问题。
Abstract: Large Language Models (LLMs) are increasingly deployed in high-risk domains. However, state-of-the-art LLMs often produce hallucinations, raising serious concerns about their reliability. Prior work has explored adversarial attacks for hallucination elicitation in LLMs, but it often produces unrealistic prompts, either by inserting gibberish tokens or by altering the original meaning. As a result, these approaches offer limited insight into how hallucinations may occur in practice. While adversarial attacks in computer vision often involve realistic modifications to input images, the problem of finding realistic adversarial prompts for eliciting LLM hallucinations has remained largely underexplored. To address this gap, we propose Semantically Equivalent and Coherent Attacks (SECA) to elicit hallucinations via realistic modifications to the prompt that preserve its meaning while maintaining semantic coherence. Our contributions are threefold: (i) we formulate finding realistic attacks for hallucination elicitation as a constrained optimization problem over the input prompt space under semantic equivalence and coherence constraints; (ii) we introduce a constraint-preserving zeroth-order method to effectively search for adversarial yet feasible prompts; and (iii) we demonstrate through experiments on open-ended multiple-choice question answering tasks that SECA achieves higher attack success rates while incurring almost no constraint violations compared to existing methods. SECA highlights the sensitivity of both open-source and commercial gradient-inaccessible LLMs to realistic and plausible prompt variations. Code is available at https://github.com/Buyun-Liang/SECA.
[42] Large Language Models Preserve Semantic Isotopies in Story Continuations
Marc Cavazza
Main category: cs.CL
TL;DR: 论文研究了大型语言模型(LLMs)在生成文本时是否保留语义同位性(semantic isotopies),并通过实验验证了LLMs在故事续写中能有效保持语义一致性。
Details
Motivation: 探讨大型语言模型生成的文本是否能够保持语义同位性,这对于理解LLMs的语义处理能力和生成文本的一致性具有重要意义。Contribution: 1) 验证了GPT-4o能够从语言学基准中提取语义同位性;2) 通过大规模故事续写实验,分析了LLMs在生成文本时对语义同位性的保留情况。
Method: 设计了基于10,000个ROCStories提示的故事续写实验,使用五种LLMs生成续写内容。通过GPT-4o提取语义同位性,并从结构和语义维度(覆盖率、密度、分布)分析其对续写的影响。
Result: 实验结果表明,LLMs在给定的标记范围内生成的文本能够跨多个属性保留语义同位性。
Insight: 研究揭示了LLMs在语义处理方面的高效能力,尤其是在生成长篇文本时仍能保持语义连贯性,这对LLMs的应用和评估提供了新的视角。
Abstract: In this work, we explore the relevance of textual semantics to Large Language Models (LLMs), extending previous insights into the connection between distributional semantics and structural semantics. We investigate whether LLM-generated texts preserve semantic isotopies. We design a story continuation experiment using 10,000 ROCStories prompts completed by five LLMs. We first validate GPT-4o’s ability to extract isotopies from a linguistic benchmark, then apply it to the generated stories. We then analyze structural (coverage, density, spread) and semantic properties of isotopies to assess how they are affected by completion. Results show that LLM completion within a given token horizon preserves semantic isotopies across multiple properties.
[43] Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners
Xiangchi Yuan,Xiang Chen,Tong Yu,Dachuan Shi,Can Jin,Wenke Lee,Saayan Mitra
Main category: cs.CL
TL;DR: 该论文提出了一种动态整合监督微调(SFT)和强化学习(RL)的框架,通过选择具有挑战性的样例进行SFT,减少数据需求并避免灾难性遗忘。该方法仅需少量数据即可实现当前最优的推理性能。
Details
Motivation: 大型语言模型(LLMs)在推理任务中表现出色,但RL难以扩展推理边界,而SFT存在数据效率低和过拟合问题。如何高效结合二者并避免灾难性遗忘成为关键挑战。Contribution: 提出了一种动态选择挑战性样例的SFT框架,结合高熵令牌损失计算和关键参数冻结技术,实现了SFT和RL的高效融合,显著减少数据需求。
Method: 1.动态选择挑战性样例进行SFT;2.计算高熵令牌损失以避免RL技能遗忘;3.冻结RL关键参数。
Result: 仅使用1.5%的SFT数据和20.4%的RL数据,实现了当前最优的推理性能。
Insight: 通过选择性SFT和参数冻结,可以有效结合SFT和RL的优势,同时避免灾难性遗忘,为后训练推理任务提供高效解决方案。
Abstract: Large Language Models (LLMs) show strong reasoning abilities, often amplified by Chain-of-Thought (CoT) prompting and reinforcement learning (RL). Although RL algorithms can substantially improve reasoning, they struggle to expand reasoning boundaries because they learn from their own reasoning trajectories rather than acquiring external knowledge. Supervised fine-tuning (SFT) offers complementary benefits but typically requires large-scale data and risks overfitting. Recent attempts to combine SFT and RL face three main challenges: data inefficiency, algorithm-specific designs, and catastrophic forgetting. We propose a plug-and-play framework that dynamically integrates SFT into RL by selecting challenging examples for SFT. This approach reduces SFT data requirements and remains agnostic to the choice of RL or SFT algorithm. To mitigate catastrophic forgetting of RL-acquired skills during SFT, we select high-entropy tokens for loss calculation and freeze parameters identified as critical for RL. Our method achieves state-of-the-art (SoTA) reasoning performance using only 1.5% of the SFT data and 20.4% of the RL data used by prior SoTA, providing an efficient and plug-and-play solution for combining SFT and RL in reasoning post-training.
[44] GenQuest: An LLM-based Text Adventure Game for Language Learners
Qiao Wang,Adnan Labib,Robert Swier,Michael Hofmeyr,Zheng Yuan
Main category: cs.CL
TL;DR: GenQuest是一款基于大型语言模型(LLMs)的生成式文字冒险游戏,旨在通过沉浸式互动叙事促进第二语言学习。该系统为EFL学习者提供协作式“选择你自己的冒险”叙事体验,并根据学习者的选择动态生成内容。关键教学功能包括根据学习者语言水平定制内容,以及提供上下文词汇解释。初步研究表明其词汇学习效果和用户体验较好。
Details
Motivation: 传统语言学习方法缺乏互动性和个性化,GenQuest利用LLMs的动态生成能力,为学习者提供上下文相关的沉浸式语言学习体验。Contribution: 1. 设计了基于LLMs的动态生成叙事游戏,支持EFL学习;2. 开发了词汇助手功能,提供上下文词汇解释;3. 通过初步研究验证了学习效果和用户体验。
Method: 采用LLMs动态生成分支叙事,结合学习者选择推动情节发展。游戏机制包括分支决策点和故事里程碑,确保叙事连贯性。词汇助手为学习者查询的文本(单词、短语、句子)提供解释。
Result: 初步研究显示,参与者在词汇学习上有显著提升,并对用户体验持积极态度。参与者建议增加叙事长度和插图等多媒体内容。
Insight: LLMs在个性化语言学习中具有潜力,但需平衡叙事质量和长度;多模态内容可能进一步提升学习体验。
Abstract: GenQuest is a generative text adventure game that leverages Large Language Models (LLMs) to facilitate second language learning through immersive, interactive storytelling. The system engages English as a Foreign Language (EFL) learners in a collaborative “choose-your-own-adventure” style narrative, dynamically generated in response to learner choices. Game mechanics such as branching decision points and story milestones are incorporated to maintain narrative coherence while allowing learner-driven plot development. Key pedagogical features include content generation tailored to each learner’s proficiency level, and a vocabulary assistant that provides in-context explanations of learner-queried text strings, ranging from words and phrases to sentences. Findings from a pilot study with university EFL students in China indicate promising vocabulary gains and positive user perceptions. Also discussed are suggestions from participants regarding the narrative length and quality, and the request for multi-modal content such as illustrations.
[45] GRACE: Generative Representation Learning via Contrastive Policy Optimization
Jiashuo Sun,Shixuan Liu,Zhaochen Su,Xianrui Zhong,Pengcheng Jiang,Bowen Jin,Peiran Li,Weijia Shi,Jiawei Han
Main category: cs.CL
TL;DR: GRACE提出了一种新的生成式表示学习框架,通过将对比信号视为奖励而非损失函数,利用策略梯度优化训练LLM,生成可解释的rationales,从而提升嵌入质量和模型透明度。
Details
Motivation: 当前的大型语言模型(LLM)训练方法通常将其视为黑盒函数,忽视了其生成和推理能力,GRACE旨在利用这些能力,同时提供可解释的语义理解。Contribution: 1. 提出GRACE框架,将对比目标转化为奖励信号;2. 通过策略梯度优化生成可解释的rationales;3. 在MTEB基准测试中显著提升了嵌入质量和模型性能。
Method: GRACE通过生成rationales并用均值池化编码为嵌入,利用多组件奖励函数(最大化正样本对相似度,最小化负样本对相似度)进行策略梯度优化。
Result: 在MTEB基准测试中,监督和无监督设置分别提升了11.5%和6.9%的整体得分,同时保留了模型的通用能力。
Insight: 将生成和表示学习统一起来,不仅能提高嵌入质量,还能为模型的推理过程提供透明度和可解释性。
Abstract: Prevailing methods for training Large Language Models (LLMs) as text encoders rely on contrastive losses that treat the model as a black box function, discarding its generative and reasoning capabilities in favor of static embeddings. We introduce GRACE (Generative Representation Learning via Contrastive Policy Optimization), a novel framework that reimagines contrastive signals not as losses to be minimized, but as rewards that guide a generative policy. In GRACE, the LLM acts as a policy that produces explicit, human-interpretable rationales–structured natural language explanations of its semantic understanding. These rationales are then encoded into high-quality embeddings via mean pooling. Using policy gradient optimization, we train the model with a multi-component reward function that maximizes similarity between query positive pairs and minimizes similarity with negatives. This transforms the LLM from an opaque encoder into an interpretable agent whose reasoning process is transparent and inspectable. On MTEB benchmark, GRACE yields broad cross category gains: averaged over four backbones, the supervised setting improves overall score by 11.5% over base models, and the unsupervised variant adds 6.9%, while preserving general capabilities. This work treats contrastive objectives as rewards over rationales, unifying representation learning with generation to produce stronger embeddings and transparent rationales. The model, data and code are available at https://github.com/GasolSun36/GRACE.
[46] Fine-grained auxiliary learning for real-world product recommendation
Mario Almagro,Diego Ortego,David Jimenez
Main category: cs.CL
TL;DR: 论文提出了ALC(Auxiliary Learning strategy that boosts Coverage),一种通过细粒度嵌入学习的辅助学习策略,以提高产品推荐系统中的自动化覆盖率。
Details
Motivation: 现实世界的产品推荐系统对覆盖率有严格要求,需要高比例的推荐自动化,而现有方法在这一需求上表现不足。Contribution: 提出ALC策略,引入两种训练目标,利用批次中最难的负样本构建正负样本间的判别性训练信号。
Method: 采用细粒度嵌入学习方法,结合两种训练目标和阈值一致边界损失(threshold-consistent margin loss)。
Result: 在LF-AmazonTitles-131K和Tech and Durables数据集上,结合最新损失函数,实现了最先进的覆盖率。
Insight: 通过细粒度嵌入和难负样本的利用,可以显著提升推荐系统的自动化覆盖率,适用于现实世界的严格要求。
Abstract: Product recommendation is the task of recovering the closest items to a given query within a large product corpora. Generally, one can determine if top-ranked products are related to the query by applying a similarity threshold; exceeding it deems the product relevant, otherwise manual revision is required. Despite being a well-known problem, the integration of these models in real-world systems is often overlooked. In particular, production systems have strong coverage requirements, i.e., a high proportion of recommendations must be automated. In this paper we propose ALC , an Auxiliary Learning strategy that boosts Coverage through learning fine-grained embeddings. Concretely, we introduce two training objectives that leverage the hardest negatives in the batch to build discriminative training signals between positives and negatives. We validate ALC using three extreme multi-label classification approaches in two product recommendation datasets; LF-AmazonTitles-131K and Tech and Durables (proprietary), demonstrating state-of-the-art coverage rates when combined with a recent threshold-consistent margin loss.
[47] Multi-Agent Tool-Integrated Policy Optimization
Zhanfeng Mo,Xingxuan Li,Yuntao Chen,Lidong Bing
Main category: cs.CL
TL;DR: MATPO提出了一种多智能体工具集成策略优化方法,通过在单一LLM实例中使用角色特定提示进行强化学习训练,解决现有单智能体方法的上下文长度限制和噪音工具响应问题。
Details
Motivation: 现有单智能体方法在处理知识密集型任务时面临上下文长度限制和噪音工具响应的挑战,多智能体框架虽然能管理上下文,但缺乏有效的强化学习训练方法。Contribution: MATPO实现了在单一LLM实例中通过角色特定提示训练规划者和工作者智能体,避免了多LLM部署的内存消耗,同时保留了专业化优势。
Method: MATPO基于规划者和工作者轨迹的原则性信用分配机制,使用强化学习进行训练,支持多智能体角色在单一LLM中的统一和优化。
Result: 实验表明,MATPO在GAIA-text、WebWalkerQA和FRAMES任务上平均性能提升18.38%,且对噪音工具输出更具鲁棒性。
Insight: MATPO表明多智能体角色可以在单一LLM中高效统一,为稳定和高效的多智能体强化学习训练提供了实用见解。
Abstract: Large language models (LLMs) increasingly rely on multi-turn tool-integrated planning for knowledge-intensive and complex reasoning tasks. Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses. A natural solution is to adopt a multi-agent framework with planner- and worker-agents to manage context. However, no existing methods support effective reinforcement learning post-training of tool-integrated multi-agent frameworks. To address this gap, we propose Multi-Agent Tool-Integrated Policy Optimization (MATPO), which enables distinct roles (planner and worker) to be trained within a single LLM instance using role-specific prompts via reinforcement learning. MATPO is derived from a principled credit assignment mechanism across planner and worker rollouts. This design eliminates the need to deploy multiple LLMs, which would be memory-intensive, while preserving the benefits of specialization. Experiments on GAIA-text, WebWalkerQA, and FRAMES show that MATPO consistently outperforms single-agent baselines by an average of 18.38% relative improvement in performance and exhibits greater robustness to noisy tool outputs. Our findings highlight the effectiveness of unifying multiple agent roles within a single LLM and provide practical insights for stable and efficient multi-agent RL training.
[48] A Low-Resource Speech-Driven NLP Pipeline for Sinhala Dyslexia Assistance
Peshala Perera,Deshan Sumanathilaka
Main category: cs.CL
TL;DR: 论文提出了一种针对僧伽罗语(Sinhala)成人阅读障碍者的低资源语音驱动NLP辅助系统,集成了语音转文字、错误识别与纠正、文字转语音技术,展现了在多模态反馈下的可行性。
Details
Motivation: 成人阅读障碍在非英语环境中的研究和服务不足,尤其是在低资源语言(如僧伽罗语)中。本文旨在填补这一空白。Contribution: 开发了一个完整的僧伽罗语阅读障碍辅助系统,结合Whisper、SinBERT、mT5和Mistral等技术,实现了从语音输入到纠正输出的闭环。
Method: 1. 使用Whisper进行语音转文字;2. 基于SinBERT识别常见阅读错误;3. 结合mT5和Mistral生成纠正文本;4. 通过gTTS将纠正文本转为语音。
Result: 系统在转录、纠正和整体准确性上分别达到0.66、0.7和0.65,证明了技术的可行性。
Insight: 凸显了包容性NLP技术在低资源语言中的重要性,并为类似场景提供了实用解决方案。
Abstract: Dyslexia in adults remains an under-researched and under-served area, particularly in non-English-speaking contexts, despite its significant impact on personal and professional lives. This work addresses that gap by focusing on Sinhala, a low-resource language with limited tools for linguistic accessibility. We present an assistive system explicitly designed for Sinhala-speaking adults with dyslexia. The system integrates Whisper for speech-to-text conversion, SinBERT, an open-sourced fine-tuned BERT model trained for Sinhala to identify common dyslexic errors, and a combined mT5 and Mistral-based model to generate corrected text. Finally, the output is converted back to speech using gTTS, creating a complete multimodal feedback loop. Despite the challenges posed by limited Sinhala-language datasets, the system achieves 0.66 transcription accuracy and 0.7 correction accuracy with 0.65 overall system accuracy. These results demonstrate both the feasibility and effectiveness of the approach. Ultimately, this work highlights the importance of inclusive Natural Language Processing (NLP) technologies in underrepresented languages and showcases a practical
[49] ModernBERT + ColBERT: Enhancing biomedical RAG through an advanced re-ranking retriever
Eduardo Martínez Rivera,Filippo Menolascina
Main category: cs.CL
TL;DR: 该论文提出了一种两阶段检索架构,结合ModernBERT和ColBERTv2,以提升生物医学领域的RAG系统性能,实现了在MIRAGE基准上的最优表现。
Details
Motivation: RAG系统的效果受限于检索模块的性能,尤其在生物医学等高专业领域,通用检索器难以处理专业语言,而专用模型则计算成本过高。论文旨在解决这一权衡问题。Contribution: 开发了两阶段检索架构(ModernBERT + ColBERTv2),结合高效初始检索与精细重新排序,显著提升生物医学RAG系统的召回率和准确性。
Method: 采用ModernBERT双向编码器进行高效候选检索,后接ColBERTv2精细化重新排序模块。通过10k PubMedQA数据微调检索模块,并进行联合微调以对齐检索器和重新排序器。
Result: 在MIRAGE基准上达到0.4448的平均准确率,优于MedCPT(0.4436)。ColBERT重新排序器使Recall@3提升4.2个百分点。
Insight: 联合微调检索器和重新排序器对性能至关重要,否则重新排序可能反而降低效果;轻量ModernBERT与高效ColBERTv2的结合在高专业领域表现优异。
Abstract: Retrieval-Augmented Generation (RAG) is a powerful technique for enriching Large Language Models (LLMs) with external knowledge, allowing for factually grounded responses, a critical requirement in high-stakes domains such as healthcare. However, the efficacy of RAG systems is fundamentally restricted by the performance of their retrieval module, since irrelevant or semantically misaligned documents directly compromise the accuracy of the final generated response. General-purpose dense retrievers can struggle with the nuanced language of specialised domains, while the high accuracy of in-domain models is often achieved at prohibitive computational costs. In this work, we aim to address this trade-off by developing and evaluating a two-stage retrieval architecture that combines a lightweight ModernBERT bidirectional encoder for efficient initial candidate retrieval with a ColBERTv2 late-interaction model for fine-grained re-ranking. We conduct comprehensive evaluations of our retriever module performance and RAG system performance in the biomedical context, fine-tuning the IR module using 10k question-passage pairs from PubMedQA. Our analysis of the retriever module confirmed the positive impact of the ColBERT re-ranker, which improved Recall@3 by up to 4.2 percentage points compared to its retrieve-only counterpart. When integrated into the biomedical RAG, our IR module leads to a state-of-the-art average accuracy of 0.4448 on the five tasks of the MIRAGE question-answering benchmark, outperforming strong baselines such as MedCPT (0.4436). Our ablation studies reveal that this performance is critically dependent on a joint fine-tuning process that aligns the retriever and re-ranker; otherwise, the re-ranker might degrade the performance.
[50] Are BabyLMs Deaf to Gricean Maxims? A Pragmatic Evaluation of Sample-efficient Language Models
Raha Askari,Sina Zarrieß,Özge Alacam,Judith Sieker
Main category: cs.CL
TL;DR: 本文研究了语言模型是否能够理解和识别Gricean会话准则的违反行为,并通过新基准测试比较了小规模训练的BabyLMs与儿童和大规模语言模型的性能。
Details
Motivation: 人类交流中的隐含意义对语言模型至关重要,现有研究缺乏对小规模训练语言模型是否具备类似能力的评估。Contribution: 提出了一个评估语言模型对Gricean准则敏感性的新基准,并比较了小规模训练的BabyLMs与儿童和大规模模型的性能。
Method: 引入基于Gricean准则的测试集,对比不同训练规模的模型(<10M和<100M tokens)在识别准则违反行为上的表现。
Result: 小规模训练模型(<100M tokens)优于更小规模(<10M tokens),但不及儿童和大规模模型的性能。数据的适度增加改善了某些语用行为。
Insight: 训练数据的规模对模型理解语用准则有影响,但当前小规模模型仍需显著改进才能接近人类水平。
Abstract: Implicit meanings are integral to human communication, making it essential for language models to be capable of identifying and interpreting them. Grice (1975) proposed a set of conversational maxims that guide cooperative dialogue, noting that speakers may deliberately violate these principles to express meanings beyond literal words, and that listeners, in turn, recognize such violations to draw pragmatic inferences. Building on Surian et al. (1996)’s study of children’s sensitivity to violations of Gricean maxims, we introduce a novel benchmark to test whether language models pretrained on less than 10M and less than 100M tokens can distinguish maxim-adhering from maxim-violating utterances. We compare these BabyLMs across five maxims and situate their performance relative to children and a Large Language Model (LLM) pretrained on 3T tokens. We find that overall, models trained on less than 100M tokens outperform those trained on less than 10M, yet fall short of child-level and LLM competence. Our results suggest that modest data increases improve some aspects of pragmatic behavior, leading to finer-grained differentiation between pragmatic dimensions.
[51] Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
Sangmin Bae,Bilge Acun,Haroun Habeeb,Seungyeon Kim,Chien-Yu Lin,Liang Luo,Junjie Wang,Carole-Jean Wu
Main category: cs.CL
TL;DR: 该论文探讨了混合架构(结合自注意力机制和结构化状态空间模型如Mamba)在语言模型中的应用,并提出了系统性分析和设计优化方案。
Details
Motivation: 目前混合架构在长上下文任务中表现出色,但缺乏对不同混合策略的系统性比较和关键因素分析,因此需要深入研究以提供设计指导。Contribution: 论文的主要贡献包括:1)对混合架构的全面评估(层间/层内融合);2)从建模性能、长上下文能力、扩展性和计算效率等多角度分析;3)提出核心特征和最优设计方法。
Method: 研究方法包括:1)对不同混合策略(层间顺序融合与层内并行融合)进行比较;2)从多个维度(性能、能力、效率)评估设计;3)基于计算原语分析关键元素。
Result: 研究结果表明,混合架构在长上下文任务中表现优异,且层间和层内融合各有优势,论文进一步提出了优化的设计方法。
Insight: 核心发现包括:1)不同混合策略的适用场景不同;2)计算效率和建模性能的平衡是关键;3)设计的系统性分析能为未来混合模型开发提供实用指导。
Abstract: Recent progress in large language models demonstrates that hybrid architectures–combining self-attention mechanisms with structured state space models like Mamba–can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We evaluate these designs from a variety of perspectives: language modeling performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for both hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.
[52] How I Built ASR for Endangered Languages with a Spoken Dictionary
Christopher Bartley,Anton Ragni
Main category: cs.CL
TL;DR: 本文探讨了如何利用少量语音数据为濒危语言构建自动语音识别(ASR)系统。通过使用简短发音资源而非传统的大规模标注语音数据,作者成功为Manx Gaelic和Cornish两种濒危语言实现了可用的ASR(词错误率<50%),挑战了现有ASR系统对数据量需求的传统假设。
Details
Motivation: 全球近一半的语言濒临灭绝,而现有的自动语音识别(ASR)技术通常需要大量标注语音数据,这对濒危语言社区来说是难以实现的门槛。Manx Gaelic等语言虽有语音数据,但不符合传统ASR管道的格式要求,因此亟需一种适用于少量数据的方法。Contribution: 主要贡献是证明了一种基于简短发音资源(而非大规模语音标注数据)的ASR构建方法,能够在仅有40分钟数据的情况下为濒危语言Manx Gaelic和Cornish实现可用的ASR(词错误率<50%),显著降低了技术门槛。
Method: 作者提出了一种替代传统ASR数据需求的方法,利用现有的简短发音资源(如发音词典)训练ASR模型。实验在Manx Gaelic(约2200名使用者)和Cornish(约600名使用者)两种语言上进行,验证了方法的有效性。
Result: 实验结果表明,仅需40分钟的简短发音数据,即可为Manx Gaelic和Cornish构建词错误率低于50%的ASR系统。这一方法为其他濒危语言的ASR开发提供了可行的低资源解决方案。
Insight: 传统ASR系统对数据量和格式的要求可能被高估,为濒危语言开发语音技术的关键在于灵活利用现有的少量资源,而非追求大规模标注数据。这为语言保护和技术普及带来了新的希望。
Abstract: Nearly half of the world’s languages are endangered. Speech technologies such as Automatic Speech Recognition (ASR) are central to revival efforts, yet most languages remain unsupported because standard pipelines expect utterance-level supervised data. Speech data often exist for endangered languages but rarely match these formats. Manx Gaelic ($\sim$2,200 speakers), for example, has had transcribed speech since 1948, yet remains unsupported by modern systems. In this paper, we explore how little data, and in what form, is needed to build ASR for critically endangered languages. We show that a short-form pronunciation resource is a viable alternative, and that 40 minutes of such data produces usable ASR for Manx ($<$50% WER). We replicate our approach, applying it to Cornish ($\sim$600 speakers), another critically endangered language. Results show that the barrier to entry, in quantity and form, is far lower than previously thought, giving hope to endangered language communities that cannot afford to meet the requirements arbitrarily imposed upon them.
[53] When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA
Elisei Rykov,Kseniia Petrushina,Maksim Savkin,Valerii Olisov,Artem Vazhentsev,Kseniia Titova,Alexander Panchenko,Vasily Konovalov,Julia Belikova
Main category: cs.CL
TL;DR: 论文提出了PsiloQA,一个多语言、细粒度的幻觉检测数据集,通过自动标注流程构建,支持14种语言,并在多种检测方法中展示了编码器模型的优越性。
Details
Motivation: 为了解决大语言模型(LLMs)在多语言环境中因幻觉问题导致的事实准确性挑战,现有基准多为英语序列级标注,缺乏细粒度多语言监督。Contribution: 1) 提出PsiloQA数据集,支持14种语言的细粒度幻觉标注;2) 提出自动化三阶段标注流程;3) 验证编码器模型在多语言幻觉检测中的最佳表现。
Method: 1) 使用GPT-4o从维基百科生成QA对;2) 在无上下文设置下从多种LLM中生成潜在幻觉答案;3) 使用GPT-4o对比黄金答案与上下文自动标注幻觉片段。
Result: 编码器模型在PsiloQA上表现最佳,且数据集支持跨语言泛化和知识迁移,成本显著低于人工标注数据集。
Insight: 自动化标注流程可以高效构建多语言数据集,编码器模型在多语言幻觉检测中具有潜力。
Abstract: Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question-answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods – including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models – and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.
[54] Detecting Distillation Data from Reasoning Models
Hengxiang Zhang,Hyeong Kyu Choi,Yixuan Li,Hongxin Wei
Main category: cs.CL
TL;DR: 本文提出了检测推理蒸馏数据的新方法Token Probability Deviation (TBD),通过分析生成令牌的概率模式区分见过的与未见的问题,实验结果证明了方法的有效性。
Details
Motivation: 推理蒸馏虽能增强语言模型的推理能力,但可能导致评估数据的污染,从而夸大蒸馏模型的性能指标。因此,需要一种方法检测蒸馏数据以避免这一问题。Contribution: 1) 正式定义了蒸馏数据检测任务;2) 提出了基于令牌概率偏差的TBD方法,能够有效区分见过的与未见的问题。
Method: TBD方法通过量化生成令牌的概率与高参考概率的偏差,利用蒸馏模型对见过问题生成确定性高概率令牌的特性进行检测。
Result: 实验表明TBD方法在S1数据集上达到AUC 0.918和TPR@1% FPR 0.470的检测性能。
Insight: 蒸馏模型的行为差异(如对见过问题的确定性生成)可被用于数据污染检测,为推理蒸馏的质量控制提供了新思路。
Abstract: Reasoning distillation has emerged as an efficient and powerful paradigm for enhancing the reasoning capabilities of large language models. However, reasoning distillation may inadvertently cause benchmark contamination, where evaluation data included in distillation datasets can inflate performance metrics of distilled models. In this work, we formally define the task of distillation data detection, which is uniquely challenging due to the partial availability of distillation data. Then, we propose a novel and effective method Token Probability Deviation (TBD), which leverages the probability patterns of the generated output tokens. Our method is motivated by the analysis that distilled models tend to generate near-deterministic tokens for seen questions, while producing more low-probability tokens for unseen questions. Our key idea behind TBD is to quantify how far the generated tokens’ probabilities deviate from a high reference probability. In effect, our method achieves competitive detection performance by producing lower scores for seen questions than for unseen questions. Extensive experiments demonstrate the effectiveness of our method, achieving an AUC of 0.918 and a TPR@1% FPR of 0.470 on the S1 dataset.
[55] SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests
Punya Syon Pandey,Hai Son Le,Devansh Bhardwaj,Rada Mihalcea,Zhijing Jin
Main category: cs.CL
TL;DR: SocialHarmBench是一个数据集,包含585个跨越7个社会政治类别和34个国家的提示词,旨在揭示大型语言模型(LLMs)在政治敏感上下文中的脆弱性。研究发现,开源模型对有害请求的合规性较高,尤其是在历史修正主义、宣传和政治操纵等领域。
Details
Motivation: 当前的安全基准测试很少涉及政治操纵、虚假信息生成等高危领域,因此需要一个专门的工具来测试LLMs在这些领域的脆弱性。Contribution: 提出SocialHarmBench数据集,填补了LLMs在社会政治高危领域测试的空白。
Method: 构建包含585个提示词的数据集,覆盖7个社会政治类别和34个国家,并对多个LLMs进行评估。
Result: 开源模型(如Mistral-7B)在历史修正主义、宣传等领域的攻击成功率高达97%-98%,表现出对有害请求的高脆弱性。地理和时间分析显示,LLMs在21世纪和前20世纪上下文及拉丁美洲、美国和英国地区的表现最脆弱。
Insight: 现有安全保障无法在高危社会政治环境中泛化,揭示了LLMs的系统性偏见,对其在保护人权和民主价值观方面的可靠性提出担忧。
Abstract: Large language models (LLMs) are increasingly deployed in contexts where their failures can have direct sociopolitical consequences. Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control. We introduce SocialHarmBench, a dataset of 585 prompts spanning 7 sociopolitical categories and 34 countries, designed to surface where LLMs most acutely fail in politically charged contexts. Our evaluations reveal several shortcomings: open-weight models exhibit high vulnerability to harmful compliance, with Mistral-7B reaching attack success rates as high as 97% to 98% in domains such as historical revisionism, propaganda, and political manipulation. Moreover, temporal and geographic analyses show that LLMs are most fragile when confronted with 21st-century or pre-20th-century contexts, and when responding to prompts tied to regions such as Latin America, the USA, and the UK. These findings demonstrate that current safeguards fail to generalize to high-stakes sociopolitical settings, exposing systematic biases and raising concerns about the reliability of LLMs in preserving human rights and democratic values. We share the SocialHarmBench benchmark at https://huggingface.co/datasets/psyonp/SocialHarmBench.
[56] Resource-Efficient Fine-Tuning of LLaMA-3.2-3B for Medical Chain-of-Thought Reasoning
Imran Mansha
Main category: cs.CL
TL;DR: 该论文提出了一种资源高效的方法来微调LLaMA-3.2-3B模型,以提升医疗链式推理能力,同时减少计算资源需求,适用于低资源环境。
Details
Motivation: 大型语言模型(如LLaMA)在推理任务中表现出色,但全参数微调需要大量计算资源。论文旨在解决在低资源环境下高效微调模型的挑战,特别针对医疗链式推理任务。Contribution: 论文的主要贡献是:(1)提出了一种结合LoRA和QLoRA的资源高效微调方法;(2)在医疗推理数据集上验证了模型的高效性和性能;(3)展示了内存使用降低60%,同时保持推理能力的可行性方案。
Method: 采用参数高效微调技术(LoRA和QLoRA),对LLaMA-3.2-3B进行轻量化适配。这些技术通过仅优化少量参数来降低计算和内存开销。
Result: 实验表明,该方法在医疗推理任务中显著减少了内存使用(高达60%),同时保持了推理的一致性和事实准确性,性能可与全微调媲美。
Insight: 研究揭示了在低资源环境下部署LLM的可行性,并强调了参数高效方法在医疗AI领域的重要性,为领域专业化与计算效率的平衡提供了参考。
Abstract: Large Language Models (LLMs) such as GPT-4 and LLaMA have demonstrated remarkable reasoning abilities but require significant computational resources for fine-tuning. This paper presents a resource-efficient fine-tuning approach for LLaMA-3.2-3B to enhance medical chain-of-thought reasoning while operating under constrained GPU and memory settings. Using parameter-efficient tuning techniques such as LoRA and QLoRA, we adapt the base model on publicly available medical reasoning datasets. The model achieves improved reasoning coherence and factual accuracy while reducing memory usage by up to 60% compared to standard full fine-tuning. Experimental evaluation demonstrates that lightweight adaptations can retain strong reasoning capability in medical question-answering tasks. This work highlights practical strategies for deploying LLMs in low-resource research environments and provides insights into balancing efficiency and domain specialization for medical AI systems.
[57] Imperceptible Jailbreaking against Large Language Models
Kuofeng Gao,Yiming Li,Chao Du,Xin Wang,Xingjun Ma,Shu-Tao Xia,Tianyu Pang
Main category: cs.CL
TL;DR: 该论文提出了一种针对大型语言模型(LLM)的不可察觉的越狱攻击方法,利用Unicode中的变体选择器(variation selectors)实现,攻击提示在视觉上与原始恶意问题相同,但通过改变分词方式诱导模型生成有害响应。
Details
Motivation: 现有文本模态的攻击通常需要可见的修改(如非语义后缀),而视觉模态的攻击依赖不可察觉的对抗扰动。本文旨在填补这一空白,探索文本模态中的不可察觉攻击。Contribution: 提出了基于变体选择器的不可察觉越狱攻击方法,并通过链式搜索管道生成对抗后缀,实现了对四种对齐LLM的高成功率攻击,且攻击提示在视觉上无修改。
Method: 通过Unicode变体选择器修改分词方式,使攻击提示在视觉上与原问题相同。提出链式搜索管道生成有效的对抗后缀,诱导模型生成有害响应。
Result: 实验表明,该方法在四种对齐LLM上实现了高攻击成功率,且攻击提示在视觉上无修改,还能推广到提示注入攻击。
Insight: 揭示了文本模态中不可察觉攻击的潜在威胁,提示未来对齐模型需进一步关注分词和Unicode字符的安全性。
Abstract: Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is “secretly” altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Our code is available at https://github.com/sail-sg/imperceptible-jailbreaks.
[58] Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Omri Uzan,Asaf Yehudai,Roi pony,Eyal Shnarch,Ariel Gera
Main category: cs.CL
TL;DR: 论文提出了一种名为GQR的新型测试时优化方法,通过轻量级密集文本检索器增强视觉中心模型,解决了多模态检索中的效率和性能问题。
Details
Motivation: 多模态编码器在视觉文档检索中表现出色,但大规模表示带来了部署和扩展性问题,同时纯视觉方法受限于模态差距。Contribution: 提出了GQR方法,通过互补检索器优化主检索器的查询嵌入,提升了效率和性能。
Method: GQR是一种测试时优化方法,利用互补检索器的分数指导主检索器的查询嵌入优化。
Result: 实验表明,GQR使视觉中心模型的性能与大规模表示模型相当,同时速度快14倍、内存需求减少54倍。
Insight: GQR在多模态检索的性能和效率之间建立了新的帕累托前沿。
Abstract: Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores, fail to exploit the rich interactions within each model’s representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retriever’s query embedding using guidance from a complementary retriever’s scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows vision-centric models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at https://github.com/IBM/test-time-hybrid-retrieval
[59] COLE: a Comprehensive Benchmark for French Language Understanding Evaluation
David Beauchemin,Yan Tremblay,Mohamed Amine Youssef,Richard Khoury
Main category: cs.CL
TL;DR: COLE是一个针对法语自然语言理解(NLU)的综合评测基准,包含23个多样化任务,评测了94个大语言模型,揭示了闭源与开源模型的性能差距,并提出未来挑战。
Details
Motivation: 现有法语NLU评测基准不够全面,需要更广泛的任务覆盖和对法语特有语言现象的关注。Contribution: 提出COLE基准,涵盖23个多样化NLU任务,并通过评测94个模型提供法语NLU的全面分析。
Method: 设计了一个包含多个NLU任务的评测集,并广泛测试了闭源和开源LLM的性能。
Result: 发现闭源模型表现显著优于开源模型,并识别出零样本抽取QA、精细词义消歧和方言理解等挑战性任务。
Insight: COLE有助于推动法语NLU的发展,同时也揭示了当前模型在法语特有任务中的局限性。
Abstract: To address the need for a more comprehensive evaluation of French Natural Language Understanding (NLU), we introduce COLE, a new benchmark composed of 23 diverse task covering a broad range of NLU capabilities, including sentiment analysis, paraphrase detection, grammatical judgment, and reasoning, with a particular focus on linguistic phenomena relevant to the French language. We benchmark 94 large language models (LLM), providing an extensive analysis of the current state of French NLU. Our results highlight a significant performance gap between closed- and open-weights models and identify key challenging frontiers for current LLMs, such as zero-shot extractive question-answering (QA), fine-grained word sense disambiguation, and understanding of regional language variations. We release COLE as a public resource to foster further progress in French language modelling.
[60] SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs
Dachuan Shi,Abedelkadir Asi,Keying Li,Xiangchi Yuan,Leyan Pan,Wenke Lee,Wen Xiao
Main category: cs.CL
TL;DR: SwiReasoning提出了一种动态切换显式和隐式推理的无训练框架,通过块级置信度估计和限制切换次数,提升推理效率和准确性。
Details
Motivation: 当前大型语言模型(LLMs)在隐式推理中存在搜索分布扩散和过度思考问题,限制了推理的准确性和效率。Contribution: 1. 动态切换显式和隐式推理的方法;2. 通过块级置信度和限制切换解决过度思考和噪声问题。
Method: 使用块级熵趋势估计置信度,动态切换推理模式,并限制切换次数以减少过度思考。
Result: 在数学和STEM基准测试中,平均准确率提升1.5%-2.8%,令牌效率提升56%-79%。
Insight: 动态平衡显式和隐式推理可以显著提升LLMs的推理性能和效率。
Abstract: Recent work shows that, beyond discrete reasoning through explicit chain-of-thought steps, which are limited by the boundaries of natural languages, large language models (LLMs) can also reason continuously in latent space, allowing richer information per step and thereby improving token efficiency. Despite this promise, latent reasoning still faces two challenges, especially in training-free settings: 1) purely latent reasoning broadens the search distribution by maintaining multiple implicit paths, which diffuses probability mass, introduces noise, and impedes convergence to a single high-confidence solution, thereby hurting accuracy; and 2) overthinking persists even without explicit text, wasting tokens and degrading efficiency. To address these issues, we introduce SwiReasoning, a training-free framework for LLM reasoning which features two key innovations: 1) SwiReasoning dynamically switches between explicit and latent reasoning, guided by block-wise confidence estimated from entropy trends in next-token distributions, to balance exploration and exploitation and promote timely convergence. 2) By limiting the maximum number of thinking-block switches, SwiReasoning curbs overthinking and improves token efficiency across varying problem difficulties. On widely used mathematics and STEM benchmarks, SwiReasoning consistently improves average accuracy by 1.5%-2.8% across reasoning LLMs of different model families and scales. Furthermore, under constrained budgets, SwiReasoning improves average token efficiency by 56%-79%, with larger gains as budgets tighten.
[61] Slm-mux: Orchestrating small language models for reasoning
Chenyu Wang,Zishen Wan,Hao Kang,Emma Chen,Zhiqiang Xie,Tushar Krishna,Vijay Janapa Reddi,Yilun Du
Main category: cs.CL
TL;DR: 本文提出了一种名为SLM-MUX的三阶段方法,通过协调多个小型语言模型(SLMs),显著提高了推理任务的性能表现,超越了现有编排方法和单个大模型的准确性。
Details
Motivation: 小型语言模型(SLMs)虽然在特定任务上表现良好,但单独使用时准确性不足。现有编排方法主要针对前沿大模型(如GPT-4),对SLMs效果不佳。本文旨在填补这一空白,探索如何高效编排多个SLMs以提升整体性能。Contribution: 提出了SLM-MUX架构,能够高效协调多个SLMs;开发了模型选择搜索和测试时扩展策略;通过实验验证了该方法在多个基准任务上的显著提升,并对理论优势进行了分析。
Method: 采用三阶段方法:1)设计SLM-MUX多模型架构;2)提出模型选择搜索策略;3)开发针对SLM-MUX的测试时扩展技术。
Result: 在MATH、GPQA和GSM8K等任务上分别提升了13.4%、8.8%和7.0%。仅使用两个SLMs时,SLM-MUX在GPQA和GSM8K上超越了Qwen 2.5 72B,并在MATH上与其持平。
Insight: 结果表明,通过合理编排多个SLMs,可以构建出比单个大模型更准确且高效的系统,为小型模型的资源高效利用提供了新思路。
Abstract: With the rapid development of language models, the number of small language models (SLMs) has grown significantly. Although they do not achieve state-of-the-art accuracy, they are more efficient and often excel at specific tasks. This raises a natural question: can multiple SLMs be orchestrated into a system where each contributes effectively, achieving higher accuracy than any individual model? Existing orchestration methods have primarily targeted frontier models (e.g., GPT-4) and perform suboptimally when applied to SLMs. To address this gap, we propose a three-stage approach for orchestrating SLMs. First, we introduce SLM-MUX, a multi-model architecture that effectively coordinates multiple SLMs. Building on this, we develop two optimization strategies: (i) a model selection search that identifies the most complementary SLMs from a given pool, and (ii) test-time scaling tailored to SLM-MUX. Our approach delivers strong results: Compared to existing orchestration methods, our approach achieves up to 13.4% improvement on MATH, 8.8% on GPQA, and 7.0% on GSM8K. With just two SLMS, SLM-MUX outperforms Qwen 2.5 72B on GPQA and GSM8K, and matches its performance on MATH. We further provide theoretical analyses to substantiate the advantages of our method. In summary, we demonstrate that SLMs can be effectively orchestrated into more accurate and efficient systems through the proposed approach.
cs.CV [Back]
[62] Visualizing Celebrity Dynamics in Video Content: A Proposed Approach Using Face Recognition Timestamp Data
Doğanay Demir,İlknur Durgar Elkahlout
Main category: cs.CV
TL;DR: 这篇论文提出了一个结合分布式多GPU推理系统和交互式可视化平台的混合框架,用于分析视频内容中名人动态。通过高效的推理技术和多维可视化工具,揭示了名人出现频率、时长、共现关系等模式。
Details
Motivation: 随着视频内容的爆炸式增长,理解其结构和动态变得至关重要。传统方法难以高效处理大规模视频数据并提取有价值的洞察。Contribution: 主要贡献包括:1) 一个优化的分布式推理框架,高效生成时间戳记录;2) 一套多维可视化工具,提供名人在视频中的动态分析;3) 交互式平台支持动态探索数据。
Method: 采用优化的ONNX模型、异构批量推理和高吞吐并行处理技术,结合可视化工具(如共现矩阵、网络图、热图等)分析名人动态。
Result: 该系统能够揭示名人在视频中的出现模式、屏幕时间分布、共现关系等,并支持动态交互分析。
Insight: 该框架为娱乐分析、内容创作策略和观众参与研究提供了新的可能性,尤其是在大规模视频数据分析方面具有显著优势。
Abstract: In an era dominated by video content, understanding its structure and dynamics has become increasingly important. This paper presents a hybrid framework that combines a distributed multi-GPU inference system with an interactive visualization platform for analyzing celebrity dynamics in video episodes. The inference framework efficiently processes large volumes of video data by leveraging optimized ONNX models, heterogeneous batch inference, and high-throughput parallelism, ensuring scalable generation of timestamped appearance records. These records are then transformed into a comprehensive suite of visualizations, including appearance frequency charts, duration analyses, pie charts, co-appearance matrices, network graphs, stacked area charts, seasonal comparisons, and heatmaps. Together, these visualizations provide multi-dimensional insights into video content, revealing patterns in celebrity prominence, screen-time distribution, temporal dynamics, co-appearance relationships, and intensity across episodes and seasons. The interactive nature of the system allows users to dynamically explore data, identify key moments, and uncover evolving relationships between individuals. By bridging distributed recognition with structured, visually-driven analytics, this work enables new possibilities for entertainment analytics, content creation strategies, and audience engagement studies.
[63] Domain-Robust Marine Plastic Detection Using Vision Models
Saanvi Kataria
Main category: cs.CV
TL;DR: 该论文研究了跨域水下塑料检测的稳健性,比较了轻量级CNN和大规模预训练视觉模型的性能,发现轻量级CNN(如MobileNetV2)在跨域任务中表现最佳,而预训练模型在无需微调的情况下也能提供一定效果。
Details
Motivation: 海洋塑料污染日益严重,需要可靠的自动化检测系统。然而,视觉模型在跨域任务中常因领域偏移而性能下降,因此需要评估不同类型模型在跨域检测中的稳健性。Contribution: 1) 对多种CNN和视觉Transformer模型进行了跨域性能比较;2) 评估了零样本预训练模型(CLIP和Gemini)的性能;3) 发现轻量级CNN在跨域任务中表现优于大型模型。
Method: 1) 使用标记的水下数据集训练CNN(MobileNetV2、ResNet-18、EfficientNet-B0)和视觉Transformer(DeiT-Tiny、ViT-B16);2) 在跨域测试集上评估性能;3) 测试零样本模型CLIP和Gemini的分类能力。
Result: MobileNetV2表现最佳(F1 0.97),所有微调模型的精度均较高(约99%),但召回率差异较大。零样本CLIP召回率为80%,精度较低(56%),而Gemini则相反(精度99%,召回率81%)。
Insight: 轻量级CNN在跨域任务中表现优异,适合实际部署;预训练模型在无需微调时也能提供一定效果,但需权衡精度和召回率。错误分析显示模型容易混淆珊瑚纹理和悬浮颗粒等。
Abstract: Marine plastic pollution is a pressing environmental threat, making reliable automation for underwater debris detection essential. However, vision systems trained on one dataset often degrade on new imagery due to domain shift. This study benchmarks models for cross-domain robustness, training convolutional neural networks - CNNs (MobileNetV2, ResNet-18, EfficientNet-B0) and vision transformers (DeiT-Tiny, ViT-B16) on a labeled underwater dataset and then evaluates them on a balanced cross-domain test set built from plastic-positive images drawn from a different source and negatives from the training domain. Two zero-shot models were assessed, CLIP ViT-L14 and Google’s Gemini 2.0 Flash, that leverage pretraining to classify images without fine-tuning. Results show the lightweight MobileNetV2 delivers the strongest cross-domain performance (F1 0.97), surpassing larger models. All fine-tuned models achieved high Precision (around 99%), but differ in Recall, indicating varying sensitivity to plastic instances. Zero-shot CLIP is comparatively sensitive (Recall around 80%) yet prone to false positives (Precision around 56%), whereas Gemini exhibits the inverse profile (Precision around 99%, Recall around 81%). Error analysis highlights recurring confusions with coral textures, suspended particulates, and specular glare. Overall, compact CNNs with supervised training can generalize effectively for cross-domain underwater detection, while large pretrained vision-language models provide complementary strengths.
[64] Multimodal Arabic Captioning with Interpretable Visual Concept Integration
Passant Elchafei,Amany Fashwan
Main category: cs.CV
TL;DR: VLCAP是一个阿拉伯语图像描述框架,结合CLIP视觉标签检索与多模态文本生成,通过解释性视觉概念生成文化连贯的描述。
Details
Motivation: 传统端到端描述方法缺乏文化连贯性和解释性,尤其是在阿拉伯语等资源较少语言的背景下。Contribution: 提出了一种结合视觉概念检索与多模态生成的框架,支持解释性阿拉伯语描述,并评估了六种编码器-解码器配置。
Method: 使用mCLIP、AraCLIP和Jina V4提取视觉概念,构建混合词汇库,结合Qwen-VL和Gemini Pro Vision生成描述。
Result: mCLIP + Gemini Pro Vision在BLEU-1和相似度上表现最佳,AraCLIP + Qwen-VL在LLM-judge评分中最高。
Insight: 解释性视觉概念和多模态结合可提升低资源语言的描述质量和文化连贯性。
Abstract: We present VLCAP, an Arabic image captioning framework that integrates CLIP-based visual label retrieval with multimodal text generation. Rather than relying solely on end-to-end captioning, VLCAP grounds generation in interpretable Arabic visual concepts extracted with three multilingual encoders, mCLIP, AraCLIP, and Jina V4, each evaluated separately for label retrieval. A hybrid vocabulary is built from training captions and enriched with about 21K general domain labels translated from the Visual Genome dataset, covering objects, attributes, and scenes. The top-k retrieved labels are transformed into fluent Arabic prompts and passed along with the original image to vision-language models. In the second stage, we tested Qwen-VL and Gemini Pro Vision for caption generation, resulting in six encoder-decoder configurations. The results show that mCLIP + Gemini Pro Vision achieved the best BLEU-1 (5.34%) and cosine similarity (60.01%), while AraCLIP + Qwen-VL obtained the highest LLM-judge score (36.33%). This interpretable pipeline enables culturally coherent and contextually accurate Arabic captions.
[65] Convolutional Neural Nets vs Vision Transformers: A SpaceNet Case Study with Balanced vs Imbalanced Regimes
Akshar Gothi
Main category: cs.CV
TL;DR: 论文比较了卷积神经网络(EfficientNet-B0)和视觉Transformer(ViT-Base)在SpaceNet数据集上的表现,分别在标签分布不平衡和平衡的两种情况下进行了实验。结果显示CNN在效率和延迟上占优,而ViT在平衡情况下表现接近CNN。
Details
Motivation: 探讨在不同标签分布情况下(不平衡vs平衡),CNN和ViT的性能差异,以及它们的实际部署表现(如模型大小和延迟)。Contribution: 提供了CNN和ViT在SpaceNet数据集上的对比实验,分析了标签分布对模型性能的影响,并强调了CNN在效率上的优势。
Method: 使用EfficientNet-B0和ViT-Base,分别在标签不平衡和平衡的数据集上进行训练,记录准确率、macro-F1、延迟等指标。
Result: 在不平衡数据下,EfficientNet-B0表现更好(准确率93%);在平衡数据下,两者性能接近(EfficientNet-B0达到99%),但CNN仍更高效。
Insight: 标签分布的平衡性可以缩小CNN和ViT的性能差距,但CNN在实际部署中依然占有效率和延迟优势。
Abstract: We present a controlled comparison of a convolutional neural network (EfficientNet-B0) and a Vision Transformer (ViT-Base) on SpaceNet under two label-distribution regimes: a naturally imbalanced five-class split and a balanced-resampled split with 700 images per class (70:20:10 train/val/test). With matched preprocessing (224x224, ImageNet normalization), lightweight augmentations, and a 40-epoch budget on a single NVIDIA P100, we report accuracy, macro-F1, balanced accuracy, per-class recall, and deployment metrics (model size and latency). On the imbalanced split, EfficientNet-B0 reaches 93% test accuracy with strong macro-F1 and lower latency; ViT-Base is competitive at 93% with a larger parameter count and runtime. On the balanced split, both models are strong; EfficientNet-B0 reaches 99% while ViT-Base remains competitive, indicating that balancing narrows architecture gaps while CNNs retain an efficiency edge. We release manifests, logs, and per-image predictions to support reproducibility.
[66] A Comprehensive Review on Artificial Intelligence Empowered Solutions for Enhancing Pedestrian and Cyclist Safety
Shucheng Zhang,Yan Shi,Bingzhang Wang,Yuang Zhang,Muhammad Monjurul Karim,Kehua Chen,Chenxi Liu,Mehrdad Nasri,Yinhai Wang
Main category: cs.CV
TL;DR: 本文综述了人工智能在提升行人与骑行者安全方面的最新进展,重点讨论了检测与分类、跟踪与再识别、轨迹预测以及意图识别与预测四大核心任务,并指出了未来研究中的数据、模型和部署挑战。
Details
Motivation: 传统基础设施在动态城市环境中对弱势道路使用者(VRUs)的保护不足,而AI在视觉感知和推理方面的进展为VRUs的保护提供了新机会。Contribution: 1. 系统综述了过去五年基于摄像头的AI传感系统在VRU安全领域的进展;2. 突出了四大核心任务的研究趋势;3. 指出了未来研究的三大挑战。
Method: 对文献进行系统性回顾,重点关注视觉AI在VRU安全中的应用,包括检测、跟踪、轨迹预测和意图识别等任务。
Result: 总结了当前的研究进展和趋势,并提出了数据、模型和部署方面的开放性问题。
Insight: AI赋能的视觉感知技术为VRU安全提供了新的解决方案,但仍需解决实际部署中的数据多样性、模型泛化性和计算效率等问题。
Abstract: Ensuring the safety of vulnerable road users (VRUs), such as pedestrians and cyclists, remains a critical global challenge, as conventional infrastructure-based measures often prove inadequate in dynamic urban environments. Recent advances in artificial intelligence (AI), particularly in visual perception and reasoning, open new opportunities for proactive and context-aware VRU protection. However, existing surveys on AI applications for VRUs predominantly focus on detection, offering limited coverage of other vision-based tasks that are essential for comprehensive VRU understanding and protection. This paper presents a state-of-the-art review of recent progress in camera-based AI sensing systems for VRU safety, with an emphasis on developments from the past five years and emerging research trends. We systematically examine four core tasks, namely detection and classification, tracking and reidentification, trajectory prediction, and intent recognition and prediction, which together form the backbone of AI-empowered proactive solutions for VRU protection in intelligent transportation systems. To guide future research, we highlight four major open challenges from the perspectives of data, model, and deployment. By linking advances in visual AI with practical considerations for real-world implementation, this survey aims to provide a foundational reference for the development of next-generation sensing systems to enhance VRU safety.
[67] Photorealistic Inpainting for Perturbation-based Explanations in Ecological Monitoring
Günel Aghakishiyeva,Jiayi Zhou,Saagar Arya,James David Poling,Holly R. Houliston,Jamie N. Womble,David W. Johnston,Brinnae Bent
Main category: cs.CV
TL;DR: 该论文提出了一种基于修复的图像扰动解释方法,用于生态监测任务中的物种识别和特征分析,通过生成逼真的局部编辑来揭示驱动模型预测的细粒度特征。
Details
Motivation: 生态监测中,AI模型的预测缺乏透明度,影响了信任度和实际应用。传统扰动方法(如模糊或遮挡)会破坏图像的分布外特性,难以提供有意义的解释。Contribution: 提出了一种基于修复的图像扰动解释技术,能够生成逼真的局部编辑,保留场景上下文,避免传统扰动的分布外问题。
Method: 利用YOLOv9检测器和Segment-Anything-Model精细化的掩码,支持两种干预:(i) 对象移除/替换,(ii) 背景替换。扰动后的图像通过重评分和专家评估验证解释的有效性。
Result: 生成的解释能够准确定位诊断性结构,避免传统扰动中的删除伪影,并提供生态学相关的见解,支持专家验证。
Insight: 逼真的局部扰动方法能够提供更直观、生态学相关的解释,有助于提升AI在生态监测中的可信度和部署效果。
Abstract: Ecological monitoring is increasingly automated by vision models, yet opaque predictions limit trust and field adoption. We present an inpainting-guided, perturbation-based explanation technique that produces photorealistic, mask-localized edits that preserve scene context. Unlike masking or blurring, these edits stay in-distribution and reveal which fine-grained morphological cues drive predictions in tasks such as species recognition and trait attribution. We demonstrate the approach on a YOLOv9 detector fine-tuned for harbor seal detection in Glacier Bay drone imagery, using Segment-Anything-Model-refined masks to support two interventions: (i) object removal/replacement (e.g., replacing seals with plausible ice/water or boats) and (ii) background replacement with original animals composited onto new scenes. Explanations are assessed by re-scoring perturbed images (flip rate, confidence drop) and by expert review for ecological plausibility and interpretability. The resulting explanations localize diagnostic structures, avoid deletion artifacts common to traditional perturbations, and yield domain-relevant insights that support expert validation and more trustworthy deployment of AI in ecology.
[68] Advances in Medical Image Segmentation: A Comprehensive Survey with a Focus on Lumbar Spine Applications
Ahmed Kabil,Ghada Khoriba,Mina Yousef,Essam A. Rashed
Main category: cs.CV
TL;DR: 这篇综述系统总结了医学图像分割的传统方法和现代深度学习方法,特别关注了深度学习架构(如U-Net、Transformer)和新兴技术(如半监督学习、联邦学习)。文章还以腰椎分割为例,探讨了这一领域的挑战和进展。
Details
Motivation: 医学图像分割是精准诊断和治疗规划的关键技术。随着深度学习的发展,传统方法与现代技术的结合为分割任务提供了新的可能性,但仍存在数据集偏差、计算复杂性等挑战。Contribution: 1. 系统地综述了医学图像分割的传统技术和深度学习方法;2. 探讨了新兴趋势(如混合架构、跨模态学习);3. 通过腰椎分割案例展示了该领域的实际应用与挑战。
Method: 文章总结了阈值分割、边缘检测、区域分割等传统方法,以及CNN、U-Net、Transformer等深度学习架构,还讨论了注意力机制、GAN和联邦学习等新兴技术。
Result: 综述表明,深度学习显著提升了医学图像分割的性能,但仍有领域适应性、模型可解释性等问题待解决。
Insight: 1. 传统方法与深度学习的结合是未来发展方向;2. 联邦学习等技术有助于解决数据隐私问题;3. 腰椎分割等特定应用需要更多研究资源。
Abstract: Medical Image Segmentation (MIS) stands as a cornerstone in medical image analysis, playing a pivotal role in precise diagnostics, treatment planning, and monitoring of various medical conditions. This paper presents a comprehensive and systematic survey of MIS methodologies, bridging the gap between traditional image processing techniques and modern deep learning approaches. The survey encompasses thresholding, edge detection, region-based segmentation, clustering algorithms, and model-based techniques while also delving into state-of-the-art deep learning architectures such as Convolutional Neural Networks (CNNs), Fully Convolutional Networks (FCNs), and the widely adopted U-Net and its variants. Moreover, integrating attention mechanisms, semi-supervised learning, generative adversarial networks (GANs), and Transformer-based models is thoroughly explored. In addition to covering established methods, this survey highlights emerging trends, including hybrid architectures, cross-modality learning, federated and distributed learning frameworks, and active learning strategies, which aim to address challenges such as limited labeled datasets, computational complexity, and model generalizability across diverse imaging modalities. Furthermore, a specialized case study on lumbar spine segmentation is presented, offering insights into the challenges and advancements in this relatively underexplored anatomical region. Despite significant progress in the field, critical challenges persist, including dataset bias, domain adaptation, interpretability of deep learning models, and integration into real-world clinical workflows.
[69] OpusAnimation: Code-Based Dynamic Chart Generation
Bozheng Li,Miao Yang,Zhenhan Chen,Jiawang Cao,Mushui Liu,Yi Lu,Yongliang Wu,Bin Zhang,Yangguang Ji,Licheng Tang,Jay Wu,Wenbo Zhu
Main category: cs.CV
TL;DR: 论文提出了DCG-Bench,首个用于评估多模态大语言模型在动态图表生成任务上的基准,并基于DCG-8K数据集开发了一种两阶段训练方法,显著提升了模型性能。
Details
Motivation: 当前多模态大语言模型在静态图表生成和理解方面取得了显著进展,但其在动态图表生成和理解方面的潜力尚未充分探索。Contribution: 1. 提出DCG-Bench基准;2. 构建高质量数据集DCG-8K;3. 设计了两阶段训练方法及联合代码-视觉奖励机制。
Method: 采用两阶段训练方法,结合Joint-Code-Visual Reward进行相对策略优化,开发了Qwen2.5-VL-DCG-3B模型。
Result: 模型在三个任务上平均性能提升8.31%,且与商业化模型性能相当,仅需30亿参数。
Insight: 动态图表生成任务对模型的要求更高,现有模型在此类任务上仍有不足,研究为此提供了新的解决方案。
Abstract: Dynamic Chart Generation (DCG) involves producing code-rendered animated visualizations as charts. While recent advances in multi-modal large language models (MLLMs) have significantly improved their capability on static chart generation and comprehension, MLLMs’ potential for handling dynamic chart generation and understanding remains underexplored. To bridge this research gap, we introduce DCG-Bench (Dynamic Chart Generation Benchmark), the first benchmark evaluating MLLM’s capability on dynamic chart generation tasks from three dimensions: Simple Text-to-Chart, Detailed Text-to-Chart, and Video-to-Chart tasks. We construct DCG-8K, a high-quality DCG dataset with annotations covering instruction-code-video triplets and QA pairs for both code and video evaluation. Based on DCG-8K, we explored a two-stage training recipe, proposing Joint-Code-Visual Reward for group relative policy optimization to construct expert MLLM Qwen2.5-VL-DCG-3B for the DCG task. Our benchmarking result reveals shortcomings of existing MLLMs in the visual-to-chart task, and our model beats the best open-sourced MLLM with an average 8.31% performance gain across three tasks, and shows on par performance against proprietary models with only 3B parameters, proving the effectiveness of our training recipe. Our code and dataset will be publicly available.
[70] Visual Odometry with Transformers
Vlardimir Yugay,Duy-Kien Nguyen,Theo Gevers,Cees G. M. Snoek,Martin R. Oswald
Main category: cs.CV
TL;DR: 提出了一种名为VoT(Visual odometry Transformer)的端到端单目视觉里程计方法,利用时空注意力建模全局关系,直接预测相机运动,无需手工组件,性能优于传统方法且运行速度更快。
Details
Motivation: 现有的单目视觉里程计方法通常依赖于预训练的深度学习组件和优化模块,导致复杂流程,且对相机标定和超参数调整敏感,泛化能力有限。Contribution: 1. 提出了一种端到端的单目视觉里程计框架VoT;2. 通过时空注意力直接建模全局关系预测相机运动;3. 无需手工组件(如捆绑调整、3D重建等)。
Method: VoT基于Transformer结构,处理单目图像序列,提取特征并通过时空注意力建模全局关系,直接预测相机运动,仅依赖相机位姿作为监督信号。
Result: VoT在泛化性(适应不同相机运动和标定)、性能(优于传统方法)和速度(快3倍)上均表现优异。
Insight: Transformer能有效建模视觉里程计的时空关系,端到端方法可简化传统复杂流程并提升泛化能力。
Abstract: Modern monocular visual odometry methods typically combine pre-trained deep learning components with optimization modules, resulting in complex pipelines that rely heavily on camera calibration and hyperparameter tuning, and often struggle in unseen real-world scenarios. Recent large-scale 3D models trained on massive amounts of multi-modal data have partially alleviated these challenges, providing generalizable dense reconstruction and camera pose estimation. Still, they remain limited in handling long videos and providing accurate per-frame estimates, which are required for visual odometry. In this work, we demonstrate that monocular visual odometry can be addressed effectively in an end-to-end manner, thereby eliminating the need for handcrafted components such as bundle adjustment, feature matching, camera calibration, or dense 3D reconstruction. We introduce VoT, short for Visual odometry Transformer, which processes sequences of monocular frames by extracting features and modeling global relationships through temporal and spatial attention. Unlike prior methods, VoT directly predicts camera motion without estimating dense geometry and relies solely on camera poses for supervision. The framework is modular and flexible, allowing seamless integration of various pre-trained encoders as feature extractors. Experimental results demonstrate that VoT scales effectively with larger datasets, benefits substantially from stronger pre-trained backbones, generalizes across diverse camera motions and calibration settings, and outperforms traditional methods while running more than 3 times faster. The code will be released.
[71] Inference-Time Search using Side Information for Diffusion-based Image Reconstruction
Mahdi Farahbakhsh,Vishnu Teja Kunde,Dileep Kalathil,Krishna Narayanan,Jean-Francois Chamberland
Main category: cs.CV
TL;DR: 论文提出了一种新颖的推理时间搜索算法,利用侧信息指导扩散模型采样过程,显著提升图像重建质量。
Details
Motivation: 现有的扩散模型方法通常忽略了侧信息,而这些信息在严重不适定问题中可能显著改善重建质量。Contribution: 提出了一种平衡探索与利用的推理时间搜索算法,利用侧信息提升扩散模型的图像重建性能。
Method: 通过侧信息指导采样过程,避免梯度引导带来的奖励黑客伪影,并可无缝集成到现有扩散模型中。
Result: 在多种逆问题(如修补、超分辨率、去模糊等)上证明了方法的优越性,显著提升了定性和定量性能。
Insight: 侧信息在扩散模型中具有重要价值,合理利用可显著提升重建质量和鲁棒性。
Abstract: Diffusion models have emerged as powerful priors for solving inverse problems. However, existing approaches typically overlook side information that could significantly improve reconstruction quality, especially in severely ill-posed settings. In this work, we propose a novel inference-time search algorithm that guides the sampling process using the side information in a manner that balances exploration and exploitation. This enables more accurate and reliable reconstructions, providing an alternative to the gradient-based guidance that is prone to reward-hacking artifacts. Our approach can be seamlessly integrated into a wide range of existing diffusion-based image reconstruction pipelines. Through extensive experiments on a number of inverse problems, such as box inpainting, super-resolution, and various deblurring tasks including motion, Gaussian, nonlinear, and blind deblurring, we show that our approach consistently improves the qualitative and quantitative performance of diffusion-based image reconstruction algorithms. We also show the superior performance of our approach with respect to other baselines, including reward gradient-based guidance algorithms. The code is available at \href{https://github.com/mhdfb/sideinfo-search-reconstruction}{this repository}.
[72] Unified Unsupervised Anomaly Detection via Matching Cost Filtering
Zhe Zhang,Mingxiu Cai,Gaochang Wu,Jing Zhang,Lingqiao Liu,Dacheng Tao,Tianyou Chai,Xiatian Zhu
Main category: cs.CV
TL;DR: 该论文提出了一种统一的非监督异常检测方法UCF,通过匹配成本过滤改进现有方法,解决匹配噪声问题,并在单模态和多模态场景中均取得最优效果。
Details
Motivation: 现有非监督异常检测方法普遍忽视匹配噪声问题,且单模态和多模态研究相对孤立。论文旨在统一这两种场景,并提出通用的后处理框架提升检测性能。Contribution: 1. 首次从匹配视角统一单模态和多模态非监督异常检测;2. 提出通用的UCF框架,通过成本体积过滤显著提升现有方法的性能;3. 在22个多样基准测试中验证了UCF的优越性。
Method: UCF框架包括两步:1. 构建测试样本与正常样本的成本体积;2. 使用多层注意力引导的可学习过滤模块,抑制匹配噪声并突出异常。
Result: 实验表明,UCF显著提升了多种非监督异常检测方法的性能,在单模态(RGB)和多模态(RGB–3D、RGB–Text)场景中均达到新的最优水平。
Insight: 匹配噪声是非监督异常检测的重要瓶颈;通过统一的成本过滤策略,可以显著提升检测性能并促进多模态知识的迁移。
Abstract: Unsupervised anomaly detection (UAD) aims to identify image- and pixel-level anomalies using only normal training data, with wide applications such as industrial inspection and medical analysis, where anomalies are scarce due to privacy concerns and cold-start constraints. Existing methods, whether reconstruction-based (restoring normal counterparts) or embedding-based (pretrained representations), fundamentally conduct image- or feature-level matching to generate anomaly maps. Nonetheless, matching noise has been largely overlooked, limiting their detection ability. Beyond earlier focus on unimodal RGB-based UAD, recent advances expand to multimodal scenarios, e.g., RGB–3D and RGB–Text, enabled by point cloud sensing and vision–language models. Despite shared challenges, these lines remain largely isolated, hindering a comprehensive understanding and knowledge transfer. In this paper, we advocate unified UAD for both unimodal and multimodal settings in the matching perspective. Under this insight, we present Unified Cost Filtering (UCF), a generic post-hoc refinement framework for refining anomaly cost volume of any UAD model. The cost volume is constructed by matching a test sample against normal samples from the same or different modalities, followed by a learnable filtering module with multi-layer attention guidance from the test sample, mitigating matching noise and highlighting subtle anomalies. Comprehensive experiments on 22 diverse benchmarks demonstrate the efficacy of UCF in enhancing a variety of UAD methods, consistently achieving new state-of-the-art results in both unimodal (RGB) and multimodal (RGB–3D, RGB–Text) UAD scenarios. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.
[73] Visual Language Model as a Judge for Object Detection in Industrial Diagrams
Sanjukta Ghosh
Main category: cs.CV
TL;DR: 这篇论文提出了一种利用视觉语言模型(VLM)作为工业流程图中目标检测质量评估工具的框架,填补了自动化评估目标检测结果的空白。
Details
Motivation: 工业流程图(如P&ID)的数字化是构建数字孪生和实现智能工业自动化的重要步骤,但目前缺乏自动化评估目标检测结果质量的方法。Contribution: 主要贡献是引入VLM框架,用于评估和优化工业流程图中的目标检测结果,从而提高检测性能。
Method: 利用VLM的多模态能力,识别缺失或不一致的检测结果,实现自动化质量评估。
Result: 该方法能够有效评估目标检测结果,并通过自动化手段提升复杂工业流程图中的检测性能。
Insight: VLM的多模态能力使其在工业流程图的数字化过程中具有潜力,能够填补现有目标检测方法的评估空白。
Abstract: Industrial diagrams such as piping and instrumentation diagrams (P&IDs) are essential for the design, operation, and maintenance of industrial plants. Converting these diagrams into digital form is an important step toward building digital twins and enabling intelligent industrial automation. A central challenge in this digitalization process is accurate object detection. Although recent advances have significantly improved object detection algorithms, there remains a lack of methods to automatically evaluate the quality of their outputs. This paper addresses this gap by introducing a framework that employs Visual Language Models (VLMs) to assess object detection results and guide their refinement. The approach exploits the multimodal capabilities of VLMs to identify missing or inconsistent detections, thereby enabling automated quality assessment and improving overall detection performance on complex industrial diagrams.
[74] Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning
Chashi Mahiul Islam,Oteo Mamo,Samuel Jacob Chacko,Xiuwen Liu,Weikuan Yu
Main category: cs.CV
TL;DR: 提出了SpatialViLT,一种通过多任务学习整合空间特征的视觉语言模型,用于增强3D场景的空间推理能力,包括两种变体和一种集成方法,在VSR数据集上表现优异。
Details
Motivation: 现有的视觉语言模型在3D场景和复杂物体配置的空间推理上表现不足,需要更强的空间理解能力。Contribution: 提出了SpatialViLT及其变体,通过整合深度图、3D坐标和边缘图等空间特征,增强多模态嵌入的空间理解能力。
Method: 采用多任务学习框架,设计了SpatialViLT和MaskedSpatialViLT两种变体,并通过SpatialEnsemble集成两者,优化空间推理能力。
Result: 在VSR数据集上实现了最先进的性能,在方向、拓扑和邻近关系等空间推理任务中表现突出。
Insight: 通过引入空间特征和多任务学习,可以显著提升视觉语言模型的空间推理能力,为真实世界的多模态理解应用奠定了基础。
Abstract: Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations. To address this, we introduce SpatialViLT, an enhanced VLM that integrates spatial features like depth maps, 3D coordinates, and edge maps through a multi-task learning framework. This approach enriches multimodal embeddings with spatial understanding. We propose two variants: SpatialViLT and MaskedSpatialViLT, focusing on full and masked object regions, respectively. Additionally, SpatialEnsemble combines both approaches, achieving state-of-the-art accuracy. Our models excel in spatial reasoning categories such as directional, topological, and proximity relations, as demonstrated on the challenging Visual Spatial Reasoning (VSR) dataset. This work represents a significant step in enhancing the spatial intelligence of AI systems, crucial for advanced multimodal understanding and real-world applications.
[75] Denoising of Two-Phase Optically Sectioned Structured Illumination Reconstructions Using Encoder-Decoder Networks
Allison Davis,Yezhi Shen,Xiaoyu Ji,Fengqing Zhu
Main category: cs.CV
TL;DR: 本文研究了在双相位光学切片结构光照(OS-SI)中,利用编码器-解码器网络去除噪声的方法。通过合成数据训练,证明了这种方法可以有效提升图像质量。
Details
Motivation: 传统的双相位OS-SI方法在缩短采集时间后会引入残留伪影,现有去噪方法难以解决。深度学习虽有望提供解决方案,但缺乏干净的训练数据限制了其应用。Contribution: 1)提出利用合成的训练数据(将真实伪影应用于合成图像)训练编码器-解码器网络;2)对比了不对称去噪自编码器(DAE)和U-Net在网络性能上的表现。
Method: 1)使用合成数据训练不对称DAE和U-Net;2)在真实OS-SI图像上评估网络性能。
Result: 两种网络均能显著改善图像清晰度,且各自在不同类型的伪影上表现更优。
Insight: 合成数据可以支持监督式去噪,编码器-解码器网络有望简化OS-SI图像的重建流程。
Abstract: Structured illumination (SI) enhances image resolution and contrast by projecting patterned light onto a sample. In two-phase optical-sectioning SI (OS-SI), reduced acquisition time introduces residual artifacts that conventional denoising struggles to suppress. Deep learning offers an alternative to traditional methods; however, supervised training is limited by the lack of clean, optically sectioned ground-truth data. We investigate encoder-decoder networks for artifact reduction in two-phase OS-SI, using synthetic training pairs formed by applying real artifact fields to synthetic images. An asymmetrical denoising autoencoder (DAE) and a U-Net are trained on the synthetic data, then evaluated on real OS-SI images. Both networks improve image clarity, with each excelling against different artifact types. These results demonstrate that synthetic training enables supervised denoising of OS-SI images and highlight the potential of encoder-decoder networks to streamline reconstruction workflows.
[76] PEaRL: Pathway-Enhanced Representation Learning for Gene and Pathway Expression Prediction from Histology
Sejuti Majumder,Saarthak Kapse,Moinak Bhattacharya,Xuan Xu,Alisa Yurovsky,Prateek Prasanna
Main category: cs.CV
TL;DR: PEaRL是一个多模态框架,通过通路激活分数整合组织病理学和空间转录组学,提升基因和通路表达的预测性能。
Details
Motivation: 现有方法依赖少量高变基因,忽视了生物程序的协同作用,限制了预测范围和解释性。Contribution: 提出PEaRL框架,利用ssGSEA计算通路激活分数,通过transformer编码和对比学习对齐组织学特征,提升预测性能和可解释性。
Method: 采用ssGSEA计算通路激活分数,transformer编码通路信号,对比学习对齐多模态特征。
Result: 在三种癌症数据集中,PEaRL在基因和通路表达预测上均优于SOTA方法,Pearson相关系数分别提升58.9%和20.4%。
Insight: 基于通路的转录组表示能生成更具生物意义和解释性的多模态模型,推动了计算病理学的发展。
Abstract: Integrating histopathology with spatial transcriptomics (ST) provides a powerful opportunity to link tissue morphology with molecular function. Yet most existing multimodal approaches rely on a small set of highly variable genes, which limits predictive scope and overlooks the coordinated biological programs that shape tissue phenotypes. We present PEaRL (Pathway Enhanced Representation Learning), a multimodal framework that represents transcriptomics through pathway activation scores computed with ssGSEA. By encoding biologically coherent pathway signals with a transformer and aligning them with histology features via contrastive learning, PEaRL reduces dimensionality, improves interpretability, and strengthens cross-modal correspondence. Across three cancer ST datasets (breast, skin, and lymph node), PEaRL consistently outperforms SOTA methods, yielding higher accuracy for both gene- and pathway-level expression prediction (up to 58.9 percent and 20.4 percent increase in Pearson correlation coefficient compared to SOTA). These results demonstrate that grounding transcriptomic representation in pathways produces more biologically faithful and interpretable multimodal models, advancing computational pathology beyond gene-level embeddings.
[77] DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis
Numan Saeed,Tausifa Jan Saleem,Fadillah Maani,Muhammad Ridzuan,Hu Wang,Mohammad Yaqub
Main category: cs.CV
TL;DR: 这篇论文提出了DuPLUS,一种基于视觉-语言的双提示深度学习框架,旨在解决医学图像分割和预后预测中的通用性和语义理解问题。
Details
Motivation: 医学影像分析领域通常依赖任务特定的模型,缺乏通用性;现有的通用方法则因条件建模简单和对医学语义理解不足而受限。DuPLUS旨在通过引入层次化的语义提示和双提示机制来解决这些问题。Contribution: 1. 提出了DuPLUS框架,结合了层次化语义提示和双提示机制,实现了对医学图像的细粒度控制和通用性。2. 在多个医学数据集上表现出色,优于现有任务特定和通用模型。3. 展示了框架的可扩展性,支持电子健康记录(EHR)数据的无缝整合用于预后预测。
Method: DuPLUS采用一种新颖的视觉-语言框架,通过层次化语义提示和双提示机制实现对任务的精确控制。它还支持参数高效的微调,便于快速适应新任务和多模态数据。
Result: 在10个医学数据集上,DuPLUS在8个上表现优于现有方法。在头颈部癌症数据集上,Concordance Index(CI)达到0.69。
Insight: DuPLUS通过结合视觉-语言和多模态数据,展示了深度学习在医学影像分析中的潜力和灵活性,特别是在通用性和预后预测方面的能力。
Abstract: Deep learning for medical imaging is hampered by task-specific models that lack generalizability and prognostic capabilities, while existing ‘universal’ approaches suffer from simplistic conditioning and poor medical semantic understanding. To address these limitations, we introduce DuPLUS, a deep learning framework for efficient multi-modal medical image analysis. DuPLUS introduces a novel vision-language framework that leverages hierarchical semantic prompts for fine-grained control over the analysis task, a capability absent in prior universal models. To enable extensibility to other medical tasks, it includes a hierarchical, text-controlled architecture driven by a unique dual-prompt mechanism. For segmentation, DuPLUS is able to generalize across three imaging modalities, ten different anatomically various medical datasets, encompassing more than 30 organs and tumor types. It outperforms the state-of-the-art task specific and universal models on 8 out of 10 datasets. We demonstrate extensibility of its text-controlled architecture by seamless integration of electronic health record (EHR) data for prognosis prediction, and on a head and neck cancer dataset, DuPLUS achieved a Concordance Index (CI) of 0.69. Parameter-efficient fine-tuning enables rapid adaptation to new tasks and modalities from varying centers, establishing DuPLUS as a versatile and clinically relevant solution for medical image analysis. The code for this work is made available at: https://anonymous.4open.science/r/DuPLUS-6C52
[78] Real-Time Threaded Houbara Detection and Segmentation for Wildlife Conservation using Mobile Platforms
Lyes Saad Saoud,Loic Lesobre,Enrico Sorato,Irfan Hussain
Main category: cs.CV
TL;DR: 论文提出了一种适用于移动平台的实时线程化Houbara检测与分割框架,结合YOLOv10检测和MobileSAM分割,显著提升了计算效率,并在保护物种Houbara上取得了高精度结果。
Details
Motivation: 野生动物保护需要实时、非侵入性的监测方法,但现有技术受限于计算资源和物种隐蔽性。论文旨在解决这些问题,提供高效的实时检测与分割解决方案。Contribution: 1. 提出了线程化的YOLOv10+MobileSAM两阶段框架,显著降低延迟;2. 公开了一个标注的40,000张Houbara数据集;3. 在Houbara上实现了高精度检测与分割。
Method: 结合YOLOv10进行目标检测和MobileSAM进行轻量化分割,并通过线程化并行执行这两阶段任务,提升实时性能。
Result: 在Houbara数据集上,模型达到mAP50=0.9627、mAP75=0.7731、mAP95=0.7178,MobileSAM mIoU=0.7421,YOLOv10每帧耗时43.7ms。
Insight: 线程化设计显著提升了实时性能,同时轻量化分割模型MobileSAM在高资源限制下表现出色,为野生动物保护提供了可行的技术方案。
Abstract: Real-time animal detection and segmentation in natural environments are vital for wildlife conservation, enabling non-invasive monitoring through remote camera streams. However, these tasks remain challenging due to limited computational resources and the cryptic appearance of many species. We propose a mobile-optimized two-stage deep learning framework that integrates a Threading Detection Model (TDM) to parallelize YOLOv10-based detection and MobileSAM-based segmentation. Unlike prior YOLO+SAM pipelines, our approach improves real-time performance by reducing latency through threading. YOLOv10 handles detection while MobileSAM performs lightweight segmentation, both executed concurrently for efficient resource use. On the cryptic Houbara Bustard, a conservation-priority species, our model achieves mAP50 of 0.9627, mAP75 of 0.7731, mAP95 of 0.7178, and a MobileSAM mIoU of 0.7421. YOLOv10 operates at 43.7 ms per frame, confirming real-time readiness. We introduce a curated Houbara dataset of 40,000 annotated images to support model training and evaluation across diverse conditions. The code and dataset used in this study are publicly available on GitHub at https://github.com/LyesSaadSaoud/mobile-houbara-detseg. For interactive demos and additional resources, visit https://lyessaadsaoud.github.io/LyesSaadSaoud-Threaded-YOLO-SAM-Houbara.
[79] Platonic Transformers: A Solid Choice For Equivariance
Mohammad Mohaiminul Islam,Rishabh Anand,David R. Wessels,Friso de Kruiff,Thijs P. Kuipers,Rex Ying,Clara I. Sánchez,Sharvaree Vadgama,Georg Bökman,Erik J. Bekkers
Main category: cs.CV
TL;DR: Platonic Transformers通过引入柏拉图立体对称群的参考框架,解决了Transformer缺乏几何对称性偏置的问题,同时保持了标准Transformer的高效性和灵活性。
Details
Motivation: Transformer在科学和计算机视觉中缺乏对几何对称性的归纳偏置,现有方法往往效率低下且设计复杂。Contribution: 提出了Platonic Transformer,通过柏拉图立体对称群的参考框架实现连续平移和柏拉图对称性的等变性,同时保持标准Transformer的架构和计算成本。
Method: 定义基于柏拉图立体对称群参考框架的注意力机制,实现权重共享,并将其形式化为动态群卷积。
Result: 在CIFAR-10、ScanObjectNN、QM9和OMol25等多个基准测试中,展示了竞争性性能。
Insight: Platonic Transformer的注意力机制等效于动态群卷积,揭示了模型学习自适应几何滤波器的能力。
Abstract: While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.
[80] From Scope to Script: An Automated Report Generation Model for Gastrointestinal Endoscopy
Evandros Kaklamanos,Kristjana Kristinsdottir,Jonathan Huang,Dustin Carlson,Rajesh Keswani,John Pandolfino,Mozziyar Etemadi
Main category: cs.CV
TL;DR: 该论文提出了一种基于Transformer的自动报告生成模型,用于胃肠道内窥镜检查,旨在减轻医生负担并提高临床工作效率。
Details
Motivation: 胃肠道内窥镜检查的报告文档工作繁重,导致医生工作压力大、临床效率低下。论文希望通过自动化报告生成缓解这一问题。Contribution: 提出了一种两阶段训练的Transformer模型,结合视觉编码器和文本解码器,能够从内窥镜图像中生成临床报告。
Method: 模型分为两阶段训练:首先在图像/文本描述对上预训练以学习通用的视觉-语言特征,随后在内窥镜图像/报告对上微调以生成临床报告。
Result: 该方法显著简化了报告生成流程,有望减少医生工作量并提升患者护理质量。
Insight: 自动化报告生成技术在医疗领域具有广阔的应用前景,结合预训练和微调策略可显著提升模型的临床适用性。
Abstract: Endoscopic procedures such as esophagogastroduodenoscopy (EGD) and colonoscopy play a critical role in diagnosing and managing gastrointestinal (GI) disorders. However, the documentation burden associated with these procedures place significant strain on gastroenterologists, contributing to inefficiencies in clinical workflows and physician burnout. To address this challenge, we propose a novel automated report generation model that leverages a transformer-based vision encoder and text decoder within a two-stage training framework. In the first stage, both components are pre-trained on image/text caption pairs to capture generalized vision-language features, followed by fine-tuning on images/report pairs to generate clinically meaningful findings. Our approach not only streamlines the documentation process but also holds promise for reducing physician workload and improving patient care.
[81] SketchPlan: Diffusion Based Drone Planning From Human Sketches
Sixten Norelius,Aaron O. Feldman,Mac Schwager
Main category: cs.CV
TL;DR: SketchPlan是一个基于扩散模型的无人机路径规划系统,能够从人类手绘的2D草图生成3D飞行路径,并通过零样本的仿真到实际转移在真实环境中实现安全飞行。
Details
Motivation: 无人机路径规划通常需要复杂的输入或预设环境信息,而人类手绘草图是一种直观但具有不确定性的输入方式。如何高效准确地将草图转化为安全的3D飞行路径是一个挑战。Contribution: 1. 提出了SketchPlan,包括SketchAdapter和DiffPath两部分,分别将草图映射为2D路径并通过扩散模型生成3D轨迹。2. 构建了一个包含32k合成路径的数据集,并结合人类标注的872条路径进行训练。3. 实现了零样本的仿真到实际转移,并在真实环境中验证了有效性。
Method: 1. SketchAdapter:学习将人类草图映射为2D投影路径。2. DiffPath:基于扩散模型从2D投影和深度图像推断3D轨迹。3. 训练数据结合了合成数据和人类标注的草图。
Result: 在真实环境测试中,SketchPlan在低/中障碍物环境中100%成功,在高障碍物环境中40%成功,优于对比方法20-60%。
Insight: 结合合成数据和人类标注数据的训练方式以及模块化设计显著提升了模型对草图意图的理解能力和3D路径推断的准确性。
Abstract: We propose SketchPlan, a diffusion-based planner that interprets 2D hand-drawn sketches over depth images to generate 3D flight paths for drone navigation. SketchPlan comprises two components: a SketchAdapter that learns to map the human sketches to projected 2D paths, and DiffPath, a diffusion model that infers 3D trajectories from 2D projections and a first person view depth image. Our model achieves zero-shot sim-to-real transfer, generating accurate and safe flight paths in previously unseen real-world environments. To train the model, we build a synthetic dataset of 32k flight paths using a diverse set of photorealistic 3D Gaussian Splatting scenes. We automatically label the data by computing 2D projections of the 3D flight paths onto the camera plane, and use this to train the DiffPath diffusion model. However, since real human 2D sketches differ significantly from ideal 2D projections, we additionally label 872 of the 3D flight paths with real human sketches and use this to train the SketchAdapter to infer the 2D projection from the human sketch. We demonstrate SketchPlan’s effectiveness in both simulated and real-world experiments, and show through ablations that training on a mix of human labeled and auto-labeled data together with a modular design significantly boosts its capabilities to correctly interpret human intent and infer 3D paths. In real-world drone tests, SketchPlan achieved 100% success in low/medium clutter and 40% in unseen high-clutter environments, outperforming key ablations by 20-60% in task completion.
[82] Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing
Danial Samadi Vahdati,Tai Duc Nguyen,Ekta Prashnani,Koki Nagano,David Luebke,Orazio Gallo,Matthew Stamm
Main category: cs.CV
TL;DR: 这篇论文提出了一种基于生物特征泄漏的新方法,用于检测和防御AI视频会议系统中的身份劫持攻击,通过分离身份特征与姿态表情特征,实现了实时的高效防御。
Details
Motivation: AI视频会议系统通过传输紧凑的姿态表情潜在编码来降低带宽,但这种编码可能被攻击者操控,导致身份劫持。现有的深伪检测方法无法处理完全合成的视频。因此,需要一种新的防御机制。Contribution: 论文的主要贡献是提出了一种基于生物特征泄漏的防御方法,通过分离身份特征与动态特征,设计了一种大规模对比编码器,能够在视频渲染时实时检测身份劫持。
Method: 采用了一种姿态条件的大间隔对比编码器,分离潜在编码中的身份特征与姿态表情特征,并使用余弦测试检测非法身份交换。
Result: 实验表明,该方法在多个对话头生成模型中表现优于现有防御方法,支持实时操作,并在分布外场景中表现出色。
Insight: 身份信息在姿态表情潜在编码中具有持久性,而动态特征是瞬时的。这一观察为防御身份劫持提供了新的思路。
Abstract: AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim’s likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.
[83] Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!
Junbao Zhou,Yuan Zhou,Kesen Zhao,Qingshan Xu,Beier Zhu,Richang Hong,Hanwang Zhang
Main category: cs.CV
TL;DR: 论文提出了REVEL任务和DragStream方法,支持用户通过拖动实时修改视频内容,解决了潜在空间漂移和上下文干扰问题,实现了高质量的视频编辑。
Details
Motivation: 现有自回归视频扩散模型难以实现流式、细粒度的控制,导致结果与用户期望不一致。REVEL任务和DragStream方法旨在解决这一问题。Contribution: 提出了REVEL任务,支持用户通过拖动实时编辑视频内容;提出了DragStream方法,解决了潜在空间漂移和上下文干扰问题。
Method: DragStream包括自适应分布自矫正策略(利用相邻帧统计量约束潜在嵌入漂移)和空间频率选择性优化机制(选择性传播视觉线索)。
Result: DragStream能够无缝集成到现有自回归视频扩散模型中,实验验证了其有效性。
Insight: 潜在空间漂移和上下文干扰是流式视频编辑的关键挑战,动态统计和选择性优化是解决这些问题的有效手段。
Abstract: Achieving streaming, fine-grained control over the outputs of autoregressive video diffusion models remains challenging, making it difficult to ensure that they consistently align with user expectations. To bridge this gap, we propose \textbf{stReaming drag-oriEnted interactiVe vidEo manipuLation (REVEL)}, a new task that enables users to modify generated videos \emph{anytime} on \emph{anything} via fine-grained, interactive drag. Beyond DragVideo and SG-I2V, REVEL unifies drag-style video manipulation as editing and animating video frames with both supporting user-specified translation, deformation, and rotation effects, making drag operations versatile. In resolving REVEL, we observe: \emph{i}) drag-induced perturbations accumulate in latent space, causing severe latent distribution drift that halts the drag process; \emph{ii}) streaming drag is easily disturbed by context frames, thereby yielding visually unnatural outcomes. We thus propose a training-free approach, \textbf{DragStream}, comprising: \emph{i}) an adaptive distribution self-rectification strategy that leverages neighboring frames’ statistics to effectively constrain the drift of latent embeddings; \emph{ii}) a spatial-frequency selective optimization mechanism, allowing the model to fully exploit contextual information while mitigating its interference via selectively propagating visual cues along generation. Our method can be seamlessly integrated into existing autoregressive video diffusion models, and extensive experiments firmly demonstrate the effectiveness of our DragStream.
[84] GAS-MIL: Group-Aggregative Selection Multi-Instance Learning for Ensemble of Foundation Models in Digital Pathology Image Analysis
Peiran Quan,Zifan Gu,Zhuo Zhao,Qin Zhou,Donghan M. Yang,Ruichen Rong,Yang Xie,Guanghua Xiao
Main category: cs.CV
TL;DR: 论文GAS-MIL提出了一种灵活的集成框架,通过多实例学习方法整合多个基础模型的互补优势,无需手动特征选择或繁琐的任务调优,在多个癌症数据集上表现优异。
Details
Motivation: 基础模型(FMs)在数字病理学中表现强大,但为特定任务选择和调优单个FM耗时耗力。GAS-MIL旨在通过集成多个FM的特征,简化部署过程并提升性能。Contribution: 提出了GAS-MIL框架,能够自动选择和集成多个FM的特征,保留其互补性优势,避免了手动选择特征和任务调优的复杂性。
Method: 采用多实例学习(MIL)方法,通过分组聚合策略(Group-Aggregative Selection)整合多种FM的特征,实现高效的模型集成。
Result: 在多个癌症数据集(前列腺、卵巢、乳腺癌)上的分类任务中,GAS-MIL表现优于或持平单个FM及传统MIL方法。
Insight: GAS-MIL为数字病理学提供了一种高效的FM集成方案,可扩展至多模态和精准肿瘤学应用。
Abstract: Foundation models (FMs) have transformed computational pathology by providing powerful, general-purpose feature extractors. However, adapting and benchmarking individual FMs for specific diagnostic tasks is often time-consuming and resource-intensive, especially given their scale and diversity. To address this challenge, we introduce Group-Aggregative Selection Multi-Instance Learning (GAS-MIL), a flexible ensemble framework that seamlessly integrates features from multiple FMs, preserving their complementary strengths without requiring manual feature selection or extensive task-specific fine-tuning. Across classification tasks in three cancer datasets-prostate (PANDA), ovarian (UBC-OCEAN), and breast (TCGA-BrCa)-GAS-MIL consistently achieves superior or on-par performance relative to individual FMs and established MIL methods, demonstrating its robustness and generalizability. By enabling efficient integration of heterogeneous FMs, GAS-MIL streamlines model deployment for pathology and provides a scalable foundation for future multimodal and precision oncology applications.
[85] Real-Time Assessment of Bystander Situation Awareness in Drone-Assisted First Aid
Shen Chang,Renran Tian,Nicole Adams,Nan Kong
Main category: cs.CV
TL;DR: 该论文提出了一个基于视频的实时评估框架,用于在无人机辅助急救中评估旁观者的情境感知(SA),并通过新的数据集和模型在预测性能上显著优于基线方法。
Details
Motivation: 无人机快速递送纳洛酮为应对阿片类药物过量紧急情况(OOEs)提供了有前途的解决方案,但旁观者的情境感知(SA)在团队协作中至关重要。目前缺乏实时评估SA的方法,因此需要填补这一研究空白。Contribution: 1. 引入了无人机辅助纳洛酮递送模拟数据集(DANDSD);2. 提出了一个结合图嵌入和Transformer模型的实时SA评估框架;3. 在高性能SA预测和时间分割精度上优于基线方法。
Method: 框架结合视觉感知和理解线索(如几何、运动学和交互图特征),利用图嵌入和Transformer模型进行实时SA评估。
Result: 提出的方法在Mean over Frames(MoF)和Intersection over Union(IoU)指标上分别优于FINCH基线9%和5%。
Insight: 研究发现实时SA评估能够支持自适应无人机系统的开发,从而更有效地指导旁观者,提升紧急响应效果和挽救生命。
Abstract: Rapid naloxone delivery via drones offers a promising solution for responding to opioid overdose emergencies (OOEs), by extending lifesaving interventions to medically untrained bystanders before emergency medical services (EMS) arrive. Recognizing the critical role of bystander situational awareness (SA) in human-autonomy teaming (HAT), we address a key research gap in real-time SA assessment by introducing the Drone-Assisted Naloxone Delivery Simulation Dataset (DANDSD). This pioneering dataset captures HAT during simulated OOEs, where college students without medical training act as bystanders tasked with administering intranasal naloxone to a mock overdose victim. Leveraging this dataset, we propose a video-based real-time SA assessment framework that utilizes graph embeddings and transformer models to assess bystander SA in real time. Our approach integrates visual perception and comprehension cues–such as geometric, kinematic, and interaction graph features–and achieves high-performance SA prediction. It also demonstrates strong temporal segmentation accuracy, outperforming the FINCH baseline by 9% in Mean over Frames (MoF) and 5% in Intersection over Union (IoU). This work supports the development of adaptive drone systems capable of guiding bystanders effectively, ultimately improving emergency response outcomes and saving lives.
[86] FrameOracle: Learning What to See and How Much to See in Videos
Chaoyu Li,Tianzhi Li,Fei Tao,Zhenyu Zhao,Ziqian Wu,Maozheng Zhao,Juntong Song,Cheng Niu,Pooyan Fazli
Main category: cs.CV
TL;DR: FrameOracle是一个轻量级插件模块,通过预测视频中最相关的帧及其数量,显著提升了视觉语言模型(VLM)的视频理解效率和准确性。
Details
Motivation: 现有视频帧采样策略(如均匀采样)无法适应信息密度或任务复杂度的变化,导致低效和信息丢失。Contribution: 提出了FrameOracle模块,能够动态选择关键帧及其数量,并引入了首个大规模VideoQA数据集FrameOracle-41K,提供关键帧标注。
Method: 采用四阶段课程学习,前三个阶段依赖弱代理信号(如跨模态相似性),最后阶段利用FrameOracle-41K的强监督。
Result: 在六个基准测试和五个VLM上,FrameOracle将输入帧数减少至10.4帧(16帧输入)或13.9帧(64帧输入),同时提升准确性1.4%。
Insight: 动态帧选择和关键帧数量预测能显著提高视频理解的效率,数据集标注的质量对模型性能至关重要。
Abstract: Vision-language models (VLMs) have advanced video understanding, but their performance is limited by the number of input frames they can process. Existing frame sampling strategies, such as uniform or fixed-budget selection, often fail to adapt to variations in information density or task complexity, resulting in inefficiency and information loss. To address this, we present FrameOracle, a lightweight and plug-and-play module that predicts both (1) which frames are most relevant to a given query and (2) how many frames are needed. FrameOracle is trained using a four-stage curriculum, with the first three stages relying on weak proxy signals such as cross-modal similarity. In the final stage, it leverages stronger supervision from a new dataset we introduce, FrameOracle-41K, the first large-scale VideoQA collection to provide keyframe annotations specifying the minimal set of frames required to answer each question. Extensive experiments across five VLMs and six benchmarks demonstrate that FrameOracle reduces 16-frame inputs to an average of 10.4 frames without any loss in accuracy. When starting from 64-frame candidates, it reduces the input to an average of 13.9 frames while improving accuracy by 1.4%, achieving state-of-the-art efficiency-accuracy trade-offs for scalable video understanding.
[87] A Hybrid Co-Finetuning Approach for Visual Bug Detection in Video Games
Faliu Yi,Sherif Abdelfattah,Wei Huang,Adrian Brown
Main category: cs.CV
TL;DR: 这篇论文提出了一种混合协同微调(CFT)方法,用于视频游戏中的视觉错误检测,结合了标记和未标记数据,减少了对目标游戏标记数据的依赖,并在实验中表现出优越的性能。
Details
Motivation: 手动检测视频游戏中的视觉错误成本高昂且依赖专业领域知识,而现有的监督学习方法需要大量标记数据。为了解决这些问题,作者提出了一种能够有效利用标记和未标记数据的混合方法。Contribution: 主要贡献是提出了一种混合协同微调(CFT)方法,能够同时利用目标游戏和其他相关游戏的标记数据以及未标记数据,显著降低了对目标游戏标记数据的依赖,并具有较高的扩展性和适应性。
Method: 该方法通过结合目标游戏和相关游戏的标记数据,辅以未标记数据,增强特征表示学习。CFT方法的设计旨在最大化利用所有可用数据,提高视觉错误检测的性能。
Result: 实验结果表明,CFT方法在多个游戏环境中优于传统基线方法,即使在仅使用50%目标游戏标记数据的情况下,仍能保持竞争力。
Insight: 通过整合跨游戏的数据资源(包括标记和未标记数据),可以在减少人工标记负担的同时提升检测性能,这对实际游戏开发中的视觉错误检测具有重要价值。
Abstract: Manual identification of visual bugs in video games is a resource-intensive and costly process, often demanding specialized domain knowledge. While supervised visual bug detection models offer a promising solution, their reliance on extensive labeled datasets presents a significant challenge due to the infrequent occurrence of such bugs. To overcome this limitation, we propose a hybrid Co-FineTuning (CFT) method that effectively integrates both labeled and unlabeled data. Our approach leverages labeled samples from the target game and diverse co-domain games, additionally incorporating unlabeled data to enhance feature representation learning. This strategy maximizes the utility of all available data, substantially reducing the dependency on labeled examples from the specific target game. The developed framework demonstrates enhanced scalability and adaptability, facilitating efficient visual bug detection across various game titles. Our experimental results show the robustness of the proposed method for game visual bug detection, exhibiting superior performance compared to conventional baselines across multiple gaming environments. Furthermore, CFT maintains competitive performance even when trained with only 50% of the labeled data from the target game.
[88] Exploring the Hierarchical Reasoning Model for Small Natural-Image Classification Without Augmentation
Alexander V. Mantzaris
Main category: cs.CV
TL;DR: 论文探讨了HRM(分层推理模型)在小分辨率自然图像分类中的表现,发现其在MNIST上表现良好,但在CIFAR-10和CIFAR-100上由于过拟合和缺乏图像特定的归纳偏置而表现不佳。
Details
Motivation: 研究HRM在小分辨率自然图像分类任务中的实用性,尤其是在无数据增强的原始条件下,与传统的卷积架构进行比较。Contribution: 提出了HRM在小分辨率图像分类中的系统性评估,揭示了其在复杂数据集上的局限性,并分析了优化稳定的情况下仍表现不佳的原因。
Method: HRM结合了Transformer风格模块、一步训练(DEQ风格)、深度监督、Rotary Position Embeddings和RMSNorm,并在MNIST、CIFAR-10和CIFAR-100上进行了无数据增强的实验评估。
Result: HRM在MNIST上达到98%的测试准确率,但在CIFAR-10和CIFAR-100上分别仅为65.0%和29.7%,远低于简单的卷积基线模型。
Insight: HRM在处理复杂的小分辨率自然图像时,缺乏足够的图像特定归纳偏置,可能导致过拟合和泛化能力差。未来可通过修改模型结构来提升其性能。
Abstract: This paper asks whether the Hierarchical Reasoning Model (HRM) with the two Transformer-style modules $(f_L,f_H)$, one step (DEQ-style) training, deep supervision, Rotary Position Embeddings, and RMSNorm can serve as a practical image classifier. It is evaluated on MNIST, CIFAR-10, and CIFAR-100 under a deliberately raw regime: no data augmentation, identical optimizer family with one-epoch warmup then cosine-floor decay, and label smoothing. HRM optimizes stably and performs well on MNIST ($\approx 98%$ test accuracy), but on small natural images it overfits and generalizes poorly: on CIFAR-10, HRM reaches 65.0% after 25 epochs, whereas a two-stage Conv–BN–ReLU baseline attains 77.2% while training $\sim 30\times$ faster per epoch; on CIFAR-100, HRM achieves only 29.7% test accuracy despite 91.5% train accuracy, while the same CNN reaches 45.3% test with 50.5% train accuracy. Loss traces and error analyses indicate healthy optimization but insufficient image-specific inductive bias for HRM in this regime. It is concluded that, for small-resolution image classification without augmentation, HRM is not competitive with even simple convolutional architectures as the HRM currently exist but this does not exclude possibilities that modifications to the model may allow it to improve greatly.
[89] MonitorVLM:A Vision Language Framework for Safety Violation Detection in Mining Operations
Jiang Wu,Sichao Wu,Yinsong Ma,Guangyuan Yu,Haoyuan Xu,Lifang Zheng,Jingliang Duan
Main category: cs.CV
TL;DR: MonitorVLM是一个新颖的视觉-语言框架,用于直接从监控视频流中检测采矿作业中的安全违规行为。它通过域特定的VQA数据集、动态选择相关条款的模块和行为增强模块,显著提升了检测性能。
Details
Motivation: 采矿等高危行业的工业事故多由工人不安全行为引发,传统人工检查效率低、易出错,亟需智能化自动化安全监控方案。Contribution: 1)提供了包含9000个VQA样本的域特定违规数据集;2)设计了动态选择Top-K条款的Clause Filter模块,降低延迟;3)引入了行为放大(BM)模块,提升细粒度动作识别性能。
Method: MonitorVLM结合了视觉-语言模型,通过Clause Filter模块动态筛选相关条款,并通过Behavior Magnifier模块增强工人区域识别。
Result: 实验表明,MonitorVLM在精确率、召回率和F1分数上分别提升了22.01%、34.22%和28.37%,显著优于基线模型。
Insight: 研究表明,多模态大模型在高危行业安全监控中有巨大潜力,未来可扩展到其他领域。
Abstract: Industrial accidents, particularly in high-risk domains such as surface and underground mining, are frequently caused by unsafe worker behaviors. Traditional manual inspection remains labor-intensive, error-prone, and insufficient for large-scale, dynamic environments, highlighting the urgent need for intelligent and automated safety monitoring. In this paper, we present MonitorVLM, a novel vision–language framework designed to detect safety violations directly from surveillance video streams. MonitorVLM introduces three key innovations: (1) a domain-specific violation dataset comprising 9,000 vision–question–answer (VQA) samples across 40 high-frequency mining regulations, enriched with augmentation and auxiliary detection cues; (2) a clause filter (CF) module that dynamically selects the Top-$K$ most relevant clauses, reducing inference latency by 13.56% while maintaining accuracy; and (3) a behavior magnifier (BM) module that enhances worker regions to improve fine-grained action recognition, yielding additional gains of 3.45% in precision and 8.62% in recall. Experimental results demonstrate that MonitorVLM significantly outperforms baseline vision–language models, achieving improvements of 22.01% in precision, 34.22% in recall, and 28.37% in F1 score over the 72B unfine-tuned baseline. A lightweight web-based interface further integrates MonitorVLM into practical workflows, enabling automatic violation reporting with video timestamping. This study highlights the potential of multimodal large models to enhance occupational safety monitoring in mining and beyond.
[90] SAMSOD: Rethinking SAM Optimization for RGB-T Salient Object Detection
Zhengyi Liu,Xinrui Wang,Xianyong Fang,Zhengzheng Tu,Linbo Wang
Main category: cs.CV
TL;DR: 该论文提出了SAMSOD模型,通过单模态监督和梯度去冲突技术优化RGB-T显著性目标检测,解决了模态不平衡和梯度差异的问题。
Details
Motivation: RGB-T显著性目标检测中,Segment Anything Model(SAM)的微调忽视了模态不平衡和激活梯度差异问题,限制了性能提升。Contribution: 1. 引入单模态监督以增强非主导模态的学习;2. 提出梯度去冲突技术以减少梯度冲突对模型收敛的影响;3. 使用两个解耦适配器分别处理高低激活神经元,突出前景对象。
Method: 1. 单模态监督优化非主导模态;2. 梯度去冲突技术;3. 解耦适配器分离高低激活神经元。
Result: 在RGB-T SOD基准数据集和其他数据集上的实验验证了方法的有效性。
Insight: 模态不平衡和梯度冲突是影响多模态目标检测的关键问题,需针对性优化。
Abstract: RGB-T salient object detection (SOD) aims to segment attractive objects by combining RGB and thermal infrared images. To enhance performance, the Segment Anything Model has been fine-tuned for this task. However, the imbalance convergence of two modalities and significant gradient difference between high- and low- activations are ignored, thereby leaving room for further performance enhancement. In this paper, we propose a model called \textit{SAMSOD}, which utilizes unimodal supervision to enhance the learning of non-dominant modality and employs gradient deconfliction to reduce the impact of conflicting gradients on model convergence. The method also leverages two decoupled adapters to separately mask high- and low-activation neurons, emphasizing foreground objects by enhancing background learning. Fundamental experiments on RGB-T SOD benchmark datasets and generalizability experiments on scribble supervised RGB-T SOD, fully supervised RGB-D SOD datasets and full-supervised RGB-D rail surface defect detection all demonstrate the effectiveness of our proposed method.
[91] Referring Expression Comprehension for Small Objects
Kanoko Goto,Takumi Hirose,Mahiro Ukai,Shuhei Kurita,Nakamasa Inoue
Main category: cs.CV
TL;DR: 这篇论文聚焦于小目标物体的指代表达式理解(REC)任务,提出了一个针对小目标的新数据集(SOREC)和一种名为渐进式迭代缩放适配器(PIZA)的方法,显著提升了小目标定位的准确性。
Details
Motivation: 指代表达式理解在自动驾驶等实际应用中具有重要意义,但现有方法对小目标物体的定位效果较差。因此,论文致力于解决这一挑战。Contribution: 主要贡献包括:(1)发布了包含10万对指代表达式和小目标边界框的SOREC数据集;(2)提出了PIZA适配器模块,支持高效参数微调,逐步聚焦小目标。
Method: PIZA是一种渐进式迭代缩放适配器,通过逐步放大目标区域来精确定位小目标。论文将其应用于GroundingDINO模型,展示了其在SOREC数据集上的有效性。
Result: 实验表明,PIZA显著提升了GroundingDINO在小目标定位任务上的准确性。
Insight: 研究揭示了现有REC方法在处理小目标时的局限性,并提出了一种高效且可扩展的解决方案。
Abstract: Referring expression comprehension (REC) aims to localize the target object described by a natural language expression. Recent advances in vision-language learning have led to significant performance improvements in REC tasks. However, localizing extremely small objects remains a considerable challenge despite its importance in real-world applications such as autonomous driving. To address this issue, we introduce a novel dataset and method for REC targeting small objects. First, we present the small object REC (SOREC) dataset, which consists of 100,000 pairs of referring expressions and corresponding bounding boxes for small objects in driving scenarios. Second, we propose the progressive-iterative zooming adapter (PIZA), an adapter module for parameter-efficient fine-tuning that enables models to progressively zoom in and localize small objects. In a series of experiments, we apply PIZA to GroundingDINO and demonstrate a significant improvement in accuracy on the SOREC dataset. Our dataset, codes and pre-trained models are publicly available on the project page.
[92] Artery-Vein Segmentation from Fundus Images using Deep Learning
Sharan SK,Subin Sahayam,Umarani Jayaraman,Lakshmi Priya A
Main category: cs.CV
TL;DR: 该论文提出了一种基于注意力机制的深度学习模型Attention-WNet,用于从眼底图像中分割视网膜动脉和静脉,显著优于现有的先进方法。
Details
Motivation: 视网膜血管分割为动脉和静脉是分析视网膜血管的关键步骤,有助于识别和诊断多种视网膜疾病,甚至预测全身血管疾病的风险。现有的深度学习方法需要进一步优化以提升分割精度。Contribution: 提出了一种结合注意力机制的WNet模型(Attention-WNet),用于精确分割视网膜动脉和静脉,并在公开数据集(HRF和DRIVE)上验证其优越性。
Method: 将注意力机制整合到WNet模型中,增强模型对血管区域的关注能力,从而更准确地分割动脉和静脉。
Result: 在HRF和DRIVE数据集上的实验表明,Attention-WNet的性能优于其他现有的先进模型。
Insight: 注意力机制能有效提升视网膜血管分割任务的表现,尤其是对复杂背景下的细小血管分割具有显著优势。
Abstract: Segmenting of clinically important retinal blood vessels into arteries and veins is a prerequisite for retinal vessel analysis. Such analysis can provide potential insights and bio-markers for identifying and diagnosing various retinal eye diseases. Alteration in the regularity and width of the retinal blood vessels can act as an indicator of the health of the vasculature system all over the body. It can help identify patients at high risk of developing vasculature diseases like stroke and myocardial infarction. Over the years, various Deep Learning architectures have been proposed to perform retinal vessel segmentation. Recently, attention mechanisms have been increasingly used in image segmentation tasks. The work proposes a new Deep Learning approach for artery-vein segmentation. The new approach is based on the Attention mechanism that is incorporated into the WNet Deep Learning model, and we call the model as Attention-WNet. The proposed approach has been tested on publicly available datasets such as HRF and DRIVE datasets. The proposed approach has outperformed other state-of-art models available in the literature.
[93] Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models
Leander Girrbach,Stephan Alaniz,Genevieve Smith,Trevor Darrell,Zeynep Akata
Main category: cs.CV
TL;DR: 该论文分析了大规模视觉语言模型训练数据中的偏置问题,通过为LAION-400M数据集添加人口统计注释(如性别和种族),揭示了数据集中存在的人口不平衡和有害关联,并量化了这些偏置对CLIP和Stable Diffusion模型的影响。
Details
Motivation: 研究旨在填补大规模多模态数据集中缺乏人口统计注释的空白,并探讨训练数据在模型偏置中的作用。Contribution: 1. 首次为LAION-400M数据集添加了全面的人口统计注释(如性别和种族)。2. 揭示了数据集中的人口不平衡和有害关联。3. 量化了训练数据偏置对下游模型(如CLIP和Stable Diffusion)的影响。
Method: 采用自动标注流水线,结合目标检测、多模态字幕生成和微调分类器,生成人口统计标签和边界框注释。
Result: 研究发现数据集中存在显著的人口不平衡(如负面内容与特定种族或性别的过度关联),并量化了60-70%的模型偏置可通过数据共现性线性解释。
Insight: 研究为理解训练数据与模型偏置之间的直接联系提供了大规模实证依据,强调了数据去偏的重要性。
Abstract: Vision-language models trained on large-scale multimodal datasets show strong demographic biases, but the role of training data in producing these biases remains unclear. A major barrier has been the lack of demographic annotations in web-scale datasets such as LAION-400M. We address this gap by creating person-centric annotations for the full dataset, including over 276 million bounding boxes, perceived gender and race/ethnicity labels, and automatically generated captions. These annotations are produced through validated automatic labeling pipelines combining object detection, multimodal captioning, and finetuned classifiers. Using them, we uncover demographic imbalances and harmful associations, such as the disproportionate linking of men and individuals perceived as Black or Middle Eastern with crime-related and negative content. We also show that 60-70% of gender bias in CLIP and Stable Diffusion can be linearly explained by direct co-occurrences in the data. Our resources establish the first large-scale empirical link between dataset composition and downstream model bias.
[94] Mapping Rio de Janeiro’s favelas: general-purpose vs. satellite-specific neural networks
Thomas Hallopeau,Joris Guérin,Laurent Demagistri,Youssef Fouzai,Renata Gracie,Vanderlei Pascoal De Matos,Helen Gurgel,Nadine Dessay
Main category: cs.CV
TL;DR: 本文比较了通用预训练神经网络和卫星图像专用预训练神经网络在检测里约热内卢贫民窟任务中的性能,探讨任务特异性与数据量对性能的影响。
Details
Motivation: 现有深度学习方法在检测非正式居住区(如贫民窟)时尚未充分利用预训练神经网络的潜力,特别是通用和专用网络的性能差异。Contribution: 主要贡献是比较通用预训练网络与卫星图像专用预训练网络在贫民窟检测任务中的表现,揭示任务特异性与数据量对模型性能的影响。
Method: 研究采用两种预训练网络:(1)通用网络,基于大型多样化数据集;(2)专用网络,基于卫星图像数据集。通过实验对比其性能。
Result: 结果揭示了专用网络因任务特异性可能更优,但通用网络因数据量大也可能表现优异,具体性能需进一步实验验证。
Insight: 研究提示在遥感图像分析中,选择预训练网络时需权衡任务特异性和数据量,可能需根据具体任务定制解决方案。
Abstract: While deep learning methods for detecting informal settlements have already been developed, they have not yet fully utilized the potential offered by recent pretrained neural networks. We compare two types of pretrained neural networks for detecting the favelas of Rio de Janeiro: 1. Generic networks pretrained on large diverse datasets of unspecific images, 2. A specialized network pretrained on satellite imagery. While the latter is more specific to the target task, the former has been pretrained on significantly more images. Hence, this research investigates whether task specificity or data volume yields superior performance in urban informal settlement detection.
[95] LoRA Patching: Exposing the Fragility of Proactive Defenses against Deepfakes
Zuomin Qu,Yimao Guo,Qianyue Hu,Wei Lu
Main category: cs.CV
TL;DR: LoRA Patching是一种通过注入低秩适配器(LoRA)补丁绕过深度伪造(Deepfake)主动防御的新方法,揭示了现有防御机制的脆弱性,并提出防御性LoRA补丁作为解决方案。
Details
Motivation: 深度伪造技术对社会构成严重威胁,促使研究者开发主动防御机制,但这些防御往往缺乏鲁棒性。本文旨在揭示其弱点并提出解决方案。Contribution: 提出LoRA Patching方法,绕过现有主动防御机制;引入防御性LoRA补丁作为补充解决方案;设计MMFA损失函数优化语义特征对齐。
Method: 1. 通过在Deepfake生成器中注入LoRA补丁绕过防御;2. 可学习门控机制防止梯度爆炸;3. 使用MMFA损失优化特征对齐。
Result: 仅需1,000张面部样本和一轮微调,LoRA Patching即可成功绕过多种主动防御,揭示了现有防御的脆弱性。
Insight: 现有深度伪造主动防御机制容易被绕过,需开发更鲁棒的防御策略;LoRA补丁技术有望在攻防对抗中发挥作用。
Abstract: Deepfakes pose significant societal risks, motivating the development of proactive defenses that embed adversarial perturbations in facial images to prevent manipulation. However, in this paper, we show that these preemptive defenses often lack robustness and reliability. We propose a novel approach, Low-Rank Adaptation (LoRA) patching, which injects a plug-and-play LoRA patch into Deepfake generators to bypass state-of-the-art defenses. A learnable gating mechanism adaptively controls the effect of the LoRA patch and prevents gradient explosions during fine-tuning. We also introduce a Multi-Modal Feature Alignment (MMFA) loss, encouraging the features of adversarial outputs to align with those of the desired outputs at the semantic level. Beyond bypassing, we present defensive LoRA patching, embedding visible warnings in the outputs as a complementary solution to mitigate this newly identified security vulnerability. With only 1,000 facial examples and a single epoch of fine-tuning, LoRA patching successfully defeats multiple proactive defenses. These results reveal a critical weakness in current paradigms and underscore the need for more robust Deepfake defense strategies. Our code is available at https://github.com/ZOMIN28/LoRA-Patching.
[96] The Overlooked Value of Test-time Reference Sets in Visual Place Recognition
Mubariz Zaffar,Liangliang Nan,Sebastian Scherer,Julian F. P. Kooij
Main category: cs.CV
TL;DR: 本文提出了一种利用测试时参考集(test-time reference sets)的新方法——参考集微调(RSF),以弥补视觉位置识别(VPR)中训练与测试领域的差距,显著提升了现有SOTA模型的性能。
Details
Motivation: 现有VPR方法在大规模和多样化数据集上表现良好,但在测试环境与训练数据差异较大时表现不佳。测试时的参考集(地图)包含目标领域的图像和位姿,却未被充分利用。Contribution: 提出了参考集微调(RSF)方法,通过利用测试时的参考集对VPR模型进行微调,显著提高了模型在具有挑战性的数据集上的性能(平均Recall@1提升2.3%)。
Method: RSF方法是一种简单的微调策略,利用测试前可用的参考集(地图)数据对SOTA VPR模型进行微调,从而适应目标领域。
Result: 实验结果显示,RSF在多个具有挑战性的数据集上显著提升了VPR模型的性能,同时保持了模型的泛化能力。
Insight: 测试时的参考集是一个未被充分利用的重要信息源,通过简单的微调即可显著提升模型在跨领域任务中的表现。
Abstract: Given a query image, Visual Place Recognition (VPR) is the task of retrieving an image of the same place from a reference database with robustness to viewpoint and appearance changes. Recent works show that some VPR benchmarks are solved by methods using Vision-Foundation-Model backbones and trained on large-scale and diverse VPR-specific datasets. Several benchmarks remain challenging, particularly when the test environments differ significantly from the usual VPR training datasets. We propose a complementary, unexplored source of information to bridge the train-test domain gap, which can further improve the performance of State-of-the-Art (SOTA) VPR methods on such challenging benchmarks. Concretely, we identify that the test-time reference set, the “map”, contains images and poses of the target domain, and must be available before the test-time query is received in several VPR applications. Therefore, we propose to perform simple Reference-Set-Finetuning (RSF) of VPR models on the map, boosting the SOTA (~2.3% increase on average for Recall@1) on these challenging datasets. Finetuned models retain generalization, and RSF works across diverse test datasets.
[97] Contrastive-SDE: Guiding Stochastic Differential Equations with Contrastive Learning for Unpaired Image-to-Image Translation
Venkata Narendra Kotyada,Revanth Eranki,Nagesh Bhattu Sristy
Main category: cs.CV
TL;DR: 该论文提出了一种结合对比学习和基于分数的扩散模型的框架Contrastive-SDE,用于解决无配对图像到图像翻译任务,通过保留域不变特征和引导SDE推理,实现了高效且高质量的翻译效果。
Details
Motivation: 无配对图像翻译任务缺乏对齐样本,传统的生成模型难以处理复杂的域间映射。扩散模型和对比学习分别在生成和无监督学习中表现出色,结合两者可以提升翻译任务的性能和效率。Contribution: 1. 提出时间相关的对比学习方法,通过SimCLR训练模型保留域不变特征;2. 利用预训练的SDE模型结合对比学习进行引导推理;3. 在多种无配对翻译任务中验证了方法的有效性,收敛速度快且无需监督。
Method: 1. 采用SimCLR训练对比模型,将图像及其域不变特征作为正对;2. 使用对比学习语义一致性指导SDE推理;3. 结合扩散模型的生成能力和对比学习的特征保留能力。
Result: 在三个常见无配对翻译任务中,Contrastive-SDE与现有最优方法性能相当,同时收敛速度显著更快,且无需标签监督或分类器训练。
Insight: 结合对比学习和扩散模型能够有效保留语义一致性并提升生成质量,无监督方法的效率优势使其在实际应用中更具潜力。
Abstract: Unpaired image-to-image translation involves learning mappings between source domain and target domain in the absence of aligned or corresponding samples. Score based diffusion models have demonstrated state-of-the-art performance in generative tasks. Their ability to approximate complex data distributions through stochastic differential equations (SDEs) enables them to generate high-fidelity and diverse outputs, making them particularly well-suited for unpaired I2I settings. In parallel, contrastive learning provides a powerful framework for learning semantic similarities without the need for explicit supervision or paired data. By pulling together representations of semantically similar samples and pushing apart dissimilar ones, contrastive methods are inherently aligned with the objectives of unpaired translation. Its ability to selectively enforce semantic consistency at the feature level makes contrastive learning particularly effective for guiding generation in unpaired scenarios. In this work, we propose a time-dependent contrastive learning approach where a model is trained with SimCLR by considering an image and its domain invarient feature as a positive pair, enabling the preservation of domain-invariant features and the discarding of domain-specific ones. The learned contrastive model then guides the inference of a pretrained SDE for the I2I translation task. We empirically compare Contrastive-SDE with several baselines across three common unpaired I2I tasks, using four metrics for evaluation. Constrastive-SDE achieves comparable results to the state-of-the-art on several metrics. Furthermore, we observe that our model converges significantly faster and requires no label supervision or classifier training, making it a more efficient alternative for this task.
[98] LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization
Xueyang Zhou,Yangming Xu,Guiyao Tie,Yongchao Chen,Guowen Zhang,Duanfeng Chu,Pan Zhou,Lichao Sun
Main category: cs.CV
TL;DR: LIBERO-PRO扩展了LIBERO基准,通过系统性扰动评估VLA模型的泛化能力,揭示了现有模型依赖记忆而非理解的问题。
Details
Motivation: 现有LIBERO基准的训练和评估设置存在问题,导致性能估计虚高且无法公平比较模型,需改进以评估模型的真实理解和泛化能力。Contribution: 提出LIBERO-PRO,通过在四个维度(物体、初始状态、任务指令、环境)引入扰动,系统性评估VLA模型的鲁棒性和公平性。
Method: 扩展LIBERO基准,设计扰动实验,包括操纵物体、改变初始状态、干扰任务指令和变换环境布局。
Result: 现有模型在标准LIBERO上准确率达90%,但在LIBERO-PRO扰动设置下性能崩溃至0%,表明其依赖记忆而非理解。
Insight: 当前VLA模型的评估方法具有误导性,亟需关注模型的泛化能力和真实任务理解,而非表面性能指标。
Abstract: LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models’ reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension. Our code is available at: https://github.com/Zxy-MLlab/LIBERO-PRO.
[99] Mirage: Unveiling Hidden Artifacts in Synthetic Images with Large Vision-Language Models
Pranav Sharma,Shivank Garg,Durga Toshniwal
Main category: cs.CV
TL;DR: Mirage是一个专注于合成图像中显式伪影的数据集,研究表明大型视觉语言模型(LVLM)可以有效检测这些伪影,但在无显式伪影的图像上表现不佳。
Details
Motivation: 当前AI生成的图像对人类可辨识但对标准检测器难以识别,研究旨在探索LVLM在可解释AI图像检测中的潜力。Contribution: 1. 提出了Mirage数据集,包含多样化的带有显式伪影的合成图像。2. 验证了LVLM在检测带伪影图像上的有效性,但在无伪影情况下效果有限。
Method: 1. 构建Mirage数据集,包含显式伪影的合成图像。2. 利用LVLM进行实验,对比其在带伪影和无伪影图像上的检测性能。
Result: LVLM在带显式伪影的图像上检测效果显著,但在无伪影图像上表现下降。
Insight: LVLM可作为图像伪影检测的有力工具,但对高质量生成图像的检测仍需改进。
Abstract: Recent advances in image generation models have led to models that produce synthetic images that are increasingly difficult for standard AI detectors to identify, even though they often remain distinguishable by humans. To identify this discrepancy, we introduce \textbf{Mirage}, a curated dataset comprising a diverse range of AI-generated images exhibiting visible artifacts, where current state-of-the-art detection methods largely fail. Furthermore, we investigate whether Large Vision-Language Models (LVLMs), which are increasingly employed as substitutes for human judgment in various tasks, can be leveraged for explainable AI image detection. Our experiments on both Mirage and existing benchmark datasets demonstrate that while LVLMs are highly effective at detecting AI-generated images with visible artifacts, their performance declines when confronted with images lacking such cues.
[100] UGround: Towards Unified Visual Grounding with Unrolled Transformers
Rui Qian,Xin Yin,Chuanhang Deng,Zhiyuan Peng,Jian Xiong,Wei Zhai,Dejing Dou
Main category: cs.CV
TL;DR: UGround 提出了一种统一的视觉接地范式,通过动态选择Unrolled Transformer的中间层作为“掩码提示”,解决了现有固定最后一层的累积误差和文本嵌入空间缺乏显式空间线索的问题。其核心是 Policy-Prompted Masking(SSC 和 MasP),实现了多任务的统一框架。
Details
Motivation: 现有视觉接地方法依赖固定的最后一层隐藏状态,导致累积误差放大且缺乏显式空间线索。UGround 旨在动态选择中间层并提供显式空间提示,提升性能和灵活性。Contribution: 1. 首次将多种视觉接地任务(如参考表达分割、推理分割等)统一到单一框架中;2. 提出 Policy-Prompted Masking(SSC 和 MasP)动态选择中间层并生成显式空间掩码。
Method: 1. Stochastic Skip Connection (SSC):通过强化学习策略动态选择 exttt{
Result: UGround 在多种视觉接地任务(单目标、多目标、错误前提等)中表现出色,验证了统一范式的有效性。
Insight: 动态选择中间层和显式空间提示可以显著提升视觉接地任务的性能和泛化能力,为多任务统一提供了新思路。
Abstract: We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as \texttt{
[101] DHQA-4D: Perceptual Quality Assessment of Dynamic 4D Digital Human
Yunhao Li,Sijing Wu,Yucheng Zhu,Huiyu Duan,Zicheng Zhang,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文提出了一个名为DHQA-4D的大规模动态4D数字人质量评估数据集,并开发了DynaMesh-Rater方法,该方法利用多模态大模型(LMM)评估带纹理和不带纹理的4D网格质量。
Details
Motivation: 随着4D数字人网格在多个领域的广泛应用,其质量受噪声影响的问题日益突出,亟需一种有效的质量评估方法。Contribution: 1) 提出DHQA-4D数据集,包含高质量和失真4D网格序列及主观评分;2) 提出DynaMesh-Rater方法,结合视觉、运动和几何特征评估质量。
Method: DynaMesh-Rater通过提取2D视频的视觉特征、裁剪视频的运动特征和4D网格的几何特征,利用LMM模型整合特征并进行LoRA调优以预测质量分数。
Result: 实验表明DynaMesh-Rater在DHQA-4D数据集上优于现有方法。
Insight: 研究表明多模态特征融合能显著提升4D数字人质量评估的准确性。
Abstract: With the rapid development of 3D scanning and reconstruction technologies, dynamic digital human avatars based on 4D meshes have become increasingly popular. A high-precision dynamic digital human avatar can be applied to various fields such as game production, animation generation, and remote immersive communication. However, these 4D human avatar meshes are prone to being degraded by various types of noise during the processes of collection, compression, and transmission, thereby affecting the viewing experience of users. In light of this fact, quality assessment of dynamic 4D digital humans becomes increasingly important. In this paper, we first propose a large-scale dynamic digital human quality assessment dataset, DHQA-4D, which contains 32 high-quality real-scanned 4D human mesh sequences, 1920 distorted textured 4D human meshes degraded by 11 textured distortions, as well as their corresponding textured and non-textured mean opinion scores (MOSs). Equipped with DHQA-4D dataset, we analyze the influence of different types of distortion on human perception for textured dynamic 4D meshes and non-textured dynamic 4D meshes. Additionally, we propose DynaMesh-Rater, a novel large multimodal model (LMM) based approach that is able to assess both textured 4D meshes and non-textured 4D meshes. Concretely, DynaMesh-Rater elaborately extracts multi-dimensional features, including visual features from a projected 2D video, motion features from cropped video clips, and geometry features from the 4D human mesh to provide comprehensive quality-related information. Then we utilize a LMM model to integrate the multi-dimensional features and conduct a LoRA-based instruction tuning technique to teach the LMM model to predict the quality scores. Extensive experimental results on the DHQA-4D dataset demonstrate the superiority of our DynaMesh-Rater method over previous quality assessment methods.
[102] Multi-Modal Oral Cancer Detection Using Weighted Ensemble Convolutional Neural Networks
Ajo Babu George,Sreehari J R Ajo Babu George,Sreehari J R Ajo Babu George,Sreehari J R
Main category: cs.CV
TL;DR: 这篇论文提出了一种基于加权集成卷积神经网络(CNNN)的多模态口腔癌检测方法,通过结合临床、放射学和组织病理学图像,提高了口腔鳞状细胞癌(OSCC)的早期检测率。
Details
Motivation: 口腔鳞状细胞癌的晚期诊断是其高死亡率的主要原因。传统的单模态检测方法存在局限性,因此需要一种多模态融合的方法来提高检测准确性。Contribution: 主要的贡献是开发了一个多模态深度学习框架,通过加权集成DenseNet-121 CNNN,结合多种医学图像模态,显著提升了OSCC的诊断准确性。
Method: 论文采用了公开的多模态数据集,分别训练了基于DenseNet-121的CNN模型,并通过数据增强和模态特异性预处理提高鲁棒性。最后通过验证加权的集成策略融合预测结果。
Result: 在验证集上,放射学模态的准确率达到100%,组织病理学模态为95.12%,临床图像较低(63.10%)。集成模型的总体准确率为84.58%。
Insight: 多模态融合能够弥补单一模态的局限性,尤其是在临床图像表现较差的情况下,集成方法可以显著提高诊断的鲁棒性和准确性。该框架提供了一种非侵入性的AI辅助工具,有望减少诊断延迟。
Abstract: Aims Late diagnosis of Oral Squamous Cell Carcinoma (OSCC) contributes significantly to its high global mortality rate, with over 50% of cases detected at advanced stages and a 5-year survival rate below 50% according to WHO statistics. This study aims to improve early detection of OSCC by developing a multimodal deep learning framework that integrates clinical, radiological, and histopathological images using a weighted ensemble of DenseNet-121 convolutional neural networks (CNNs). Material and Methods A retrospective study was conducted using publicly available datasets representing three distinct medical imaging modalities. Each modality-specific dataset was used to train a DenseNet-121 CNN via transfer learning. Augmentation and modality-specific preprocessing were applied to increase robustness. Predictions were fused using a validation-weighted ensemble strategy. Evaluation was performed using accuracy, precision, recall, F1-score. Results High validation accuracy was achieved for radiological (100%) and histopathological (95.12%) modalities, with clinical images performing lower (63.10%) due to visual heterogeneity. The ensemble model demonstrated improved diagnostic robustness with an overall accuracy of 84.58% on a multimodal validation dataset of 55 samples. Conclusion The multimodal ensemble framework bridges gaps in the current diagnostic workflow by offering a non-invasive, AI-assisted triage tool that enhances early identification of high-risk lesions. It supports clinicians in decision-making, aligning with global oncology guidelines to reduce diagnostic delays and improve patient outcomes.
[103] Exploring Instruction Data Quality for Explainable Image Quality Assessment
Yunhao Li,Sijing Wu,Huiyu Duan,Yucheng Zhu,Qi Jia,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文挑战了扩展定律,研究了指令调优数据集质量对可解释图像质量评估(IQA)的作用,并提出了一种基于聚类的数据选择方法IQA-Select,显著降低了计算成本并提高了性能。
Details
Motivation: 近年来,多模态大语言模型(MLLMs)的快速发展推动了可解释IQA的流行,但大规模指令调优数据集可能导致计算成本高和冗余数据。本文旨在探索指令数据质量的作用,以减少冗余并提高模型性能。Contribution: 1. 挑战扩展定律,验证了适当比例的随机数据子集训练优于完整数据集;2. 提出基于聚类的数据选择框架IQA-Select,仅用10%数据即可超越全量微调的性能。
Method: 1. 使用预训练的MLLM研究不同规模指令数据的微调性能变化;2. 设计三阶段聚类数据选择框架:聚类特征提取、聚类配额分配和聚类采样策略。
Result: 在Q-Bench和AesBench上,IQA-Select仅用10%数据即可达到102.1%和103.7%的全量微调性能,显著降低计算成本。
Insight: 数据质量比数量更重要;聚类方法能有效识别冗余数据,提升训练效率和模型性能。
Abstract: In recent years, with the rapid development of powerful multimodal large language models (MLLMs), explainable image quality assessment (IQA) has gradually become popular, aiming at providing quality-related descriptions and answers of images. To achieve this goal, recent methods seek to construct a large-scale instruction tuning dataset to empower the MLLM with quality perception ability following the well-known scaling law. However, a large amount of instruction tuning data may cause substantial computational costs and redundant data, which in turn will cause harm to the performance of the model. To cope with this problem, in this paper, we challenge the scaling law and systematically investigate the role of data quality of the instruction tuning dataset for explainable IQA. Using a powerful pre-trained MLLM, we first investigate the changes in model performance after fine-tuning with different sizes of instruction tuning data. We find that selecting a subset of the data set randomly using an appropriate ratio can even lead to better results than training with the entire instruction tuning dataset, demonstrating the redundancy of current explainable IQA instruction tuning data. Beyond randomly sampling a subset, we propose a clustering-based data selection framework with three stages: clustering feature extraction, cluster quota allocation, and cluster sampling strategy. Then we systematically analyze the choices of each stage and propose a simple but efficient data selection method IQA-Select for explainable IQA. The experimental results demonstrate that IQA-Select can achieve 102.1% and 103.7% performance of full fine-tuning using only 10% selected data in Q-Bench and AesBench respectively, significantly reducing computational costs while achieving better performance.
[104] Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert
Mingyu Liu,Zheng Huang,Xiaoyi Lin,Muzhi Zhu,Canyu Zhao,Zongze Du,Yating Wang,Haoyi Zhu,Hao Chen,Chunhua Shen
Main category: cs.CV
TL;DR: 本文提出了一个基于通用化动作专家的框架,利用稀疏3D轨迹作为中间表示,将VLM的高层规划能力与低层物理动作模块衔接起来,解决了传统VLA模型在物理世界中泛化能力差的问题。
Details
Motivation: 传统Vision-Language-Action(VLA)模型由于依赖稀缺且狭窄领域的数据,泛化能力较差。而近期的双系统方法尽管试图解耦‘思考’与‘行动’,但仍受限于动作模块的语义模糊性,难以实现大规模跨任务训练。本文旨在解决这些限制。Contribution: 1. 提出首个基于通用化动作专家的框架;2. 利用稀疏3D轨迹作为中间表示,衔接VLM规划与动作模块;3. 引入‘动作预训练+点云微调’范式,提升训练效率和泛化鲁棒性。
Method: 1. VLM生成粗粒度3D路径点;2. 通用化动作专家将这些路径点细化成可执行的动作序列;3. 通过实时点云观测环境数据,结合动作预训练和点云微调优化模块性能。
Result: 结合了VLM在视觉理解和规划中的广泛泛化能力与动作专家在精细动作上的泛化能力,显著提升了模型在物理世界中的适应性。
Insight: 稀疏3D轨迹作为中间表示是衔接高层规划与低层执行的有效桥梁;动作模块的通用化设计是实现跨任务泛化的关键。
Abstract: Although Vision-Language Models (VLM) have demonstrated impressive planning and reasoning capabilities, translating these abilities into the physical world introduces significant challenges. Conventional Vision-Language-Action (VLA) models, which integrate reasoning and action into a monolithic architecture, generalize poorly because they are constrained by scarce, narrow-domain data. While recent dual-system approaches attempt to decouple “thinking” from “acting”, they are often constrained by semantic ambiguities within the action module. This ambiguity makes large-scale, cross-task training infeasible. Consequently, these systems typically necessitate fine-tuning on newly collected data when deployed to novel environments, and the cooperation mechanism between the two systems remains ill-defined. To address these limitations, we introduce, for the first time, a framework centered around a generalizable action expert. Our approach utilizes sparse 3D trajectories as an intermediate representation, effectively bridging the high-level planning capabilities of the VLM with the low-level physical action module. During the planning phase, the VLM is only required to generate coarse 3D waypoints. These waypoints are then processed by our generalizable action expert, which refines them into dense, executable action sequences by sampling real-time point cloud observations of the environment. To promote training efficiency and robust generalization, we introduce a novel “Action Pre-training, Pointcloud Fine-tuning” paradigm. Our method combines the broad generalization capabilities of VLMs in visual understanding and planning with the fine-grained, action-level generalization of action expert.
[105] Zero-Shot Fine-Grained Image Classification Using Large Vision-Language Models
Md. Atabuzzaman,Andrew Zhang,Chris Thomas
Main category: cs.CV
TL;DR: 该论文提出了一种利用大型视觉-语言模型(LVLMs)进行零样本细粒度图像分类的新方法,通过将任务转化为视觉问答框架,并结合注意力干预技术,显著提升了性能。
Details
Motivation: LVLMs在视觉-语言推理任务中表现出色,但其在零样本细粒度图像分类任务中的潜力尚未充分挖掘。本文旨在探索LVLMs在此类任务中的应用。Contribution: 1)提出了一种将零样本细粒度分类转化为视觉问答框架的方法;2)设计了新颖的注意力干预技术;3)构建了更全面的类别描述基准数据集。
Method: 通过视觉问答框架利用LVLMs的综合理解能力,而非直接生成类别名称。注意力干预技术用于增强模型性能。
Result: 在多个细粒度图像分类基准测试中,该方法显著优于当前最先进方法。
Insight: LVLMs在零样本细粒度分类任务中具有巨大潜力,注意力干预技术和高质量的数据集是提升性能的关键。
Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive performance on vision-language reasoning tasks. However, their potential for zero-shot fine-grained image classification, a challenging task requiring precise differentiation between visually similar categories, remains underexplored. We present a novel method that transforms zero-shot fine-grained image classification into a visual question-answering framework, leveraging LVLMs’ comprehensive understanding capabilities rather than relying on direct class name generation. We enhance model performance through a novel attention intervention technique. We also address a key limitation in existing datasets by developing more comprehensive and precise class description benchmarks. We validate the effectiveness of our method through extensive experimentation across multiple fine-grained image classification benchmarks. Our proposed method consistently outperforms the current state-of-the-art (SOTA) approach, demonstrating both the effectiveness of our method and the broader potential of LVLMs for zero-shot fine-grained classification tasks. Code and Datasets: https://github.com/Atabuzzaman/Fine-grained-classification
[106] From Filters to VLMs: Benchmarking Defogging Methods through Object Detection and Segmentation Performance
Ardalan Aryashad,Parsa Razmara,Amin Mahjoub,Seyedarmin Azizi,Mahdi Salmani,Arad Firouzkouhi
Main category: cs.CV
TL;DR: 该论文通过系统基准测试研究了不同去雾方法(从传统滤波器到视觉语言模型)对下游任务(目标检测和分割)性能的影响,揭示了去雾在真实场景中的有效性以及任务导向评估的重要性。
Details
Motivation: 自动驾驶感知系统在雾天条件下性能下降,现有去雾方法虽然在图像质量上有改善,但其对下游任务的提升效果不一致,且评估多依赖合成数据。因此,需要一种任务导向的基准测试方法。Contribution: 提出了一个全面的基准测试框架,涵盖传统滤波器、现代去雾网络、链式方法和视觉语言模型(VLM),并在真实数据集(Foggy Cityscapes)上评估了图像质量和下游任务性能。
Method: 使用多种去雾方法(滤波器、深度学习模型、链式组合、VLM编辑),并结合目标检测(mAP)和分割(PQ、RQ、SQ)的性能指标进行评估,同时引入VLM法官的评分与任务指标的相关性分析。
Result: 研究揭示了去雾方法的有效性范围、链式方法的协同或退化效果,以及VLM在去雾任务中的表现。同时,VLM评分与mAP强相关,表明其可作为任务导向评估的有效工具。
Insight: 任务导向的评估方法(如mAP)比单纯图像质量指标更能反映去雾方法的实际价值;VLM在去雾任务中表现出潜力,但仍需进一步优化。
Abstract: Autonomous driving perception systems are particularly vulnerable in foggy conditions, where light scattering reduces contrast and obscures fine details critical for safe operation. While numerous defogging methods exist-from handcrafted filters to learned restoration models-improvements in image fidelity do not consistently translate into better downstream detection and segmentation. Moreover, prior evaluations often rely on synthetic data, leaving questions about real-world transferability. We present a structured empirical study that benchmarks a comprehensive set of pipelines, including (i) classical filters, (ii) modern defogging networks, (iii) chained variants (filter$\rightarrow$model, model$\rightarrow$filter), and (iv) prompt-driven visual–language image editing models (VLM) applied directly to foggy images. Using Foggy Cityscapes, we assess both image quality and downstream performance on object detection (mAP) and segmentation (PQ, RQ, SQ). Our analysis reveals when defogging helps, when chaining yields synergy or degradation, and how VLM-based editors compare to dedicated approaches. In addition, we evaluate qualitative rubric-based scores from a VLM judge and quantify their alignment with task metrics, showing strong correlations with mAP. Together, these results establish a transparent, task-oriented benchmark for defogging methods and highlight the conditions under which preprocessing genuinely improves autonomous perception in adverse weather.
[107] Generating Human Motion Videos using a Cascaded Text-to-Video Framework
Hyelin Nam,Hyojun Go,Byeongjun Park,Byung-Hoon Kim,Hyungjin Chung
Main category: cs.CV
TL;DR: 这篇论文提出了CAMEO,一个级联的文本到视频生成框架,用于生成通用的人体运动视频。该框架通过精心设计的组件,无缝连接文本到运动(T2M)模型和条件视频扩散模型(VDM),并在训练和推理过程中优化对齐问题。
Details
Motivation: 尽管视频扩散模型(VDM)发展迅速,但在通用人体运动视频生成领域的应用仍显不足,大多数研究仅限于图像到视频设置或舞蹈视频等狭窄领域。CAMEO旨在填补这一空白。Contribution: 1. 提出CAMEO框架,结合T2M模型和条件VDM;2. 设计文本提示和视觉条件的优化对齐方法;3. 引入相机感知条件模块,自动选择与输入文本对齐的视角。
Method: 框架分为两个阶段:T2M模型生成运动描述,条件VDM生成视频。通过优化对齐策略和相机感知模块,确保生成视频的连贯性和多样性。
Result: 在MovieGEn基准和新设计的T2M-VDM组合基准上验证了方法的有效性,展示了其在多样化用例中的通用性。
Insight: CAMEO通过级联设计和条件优化,提升了人体运动视频生成的多样性和可控性,减少了手动干预的需求。
Abstract: Human video generation is becoming an increasingly important task with broad applications in graphics, entertainment, and embodied AI. Despite the rapid progress of video diffusion models (VDMs), their use for general-purpose human video generation remains underexplored, with most works constrained to image-to-video setups or narrow domains like dance videos. In this work, we propose CAMEO, a cascaded framework for general human motion video generation. It seamlessly bridges Text-to-Motion (T2M) models and conditional VDMs, mitigating suboptimal factors that may arise in this process across both training and inference through carefully designed components. Specifically, we analyze and prepare both textual prompts and visual conditions to effectively train the VDM, ensuring robust alignment between motion descriptions, conditioning signals, and the generated videos. Furthermore, we introduce a camera-aware conditioning module that connects the two stages, automatically selecting viewpoints aligned with the input text to enhance coherence and reduce manual intervention. We demonstrate the effectiveness of our approach on both the MovieGen benchmark and a newly introduced benchmark tailored to the T2M-VDM combination, while highlighting its versatility across diverse use cases.
[108] Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs
Sameep Vani,Shreyas Jena,Maitreya Patel,Chitta Baral,Somak Aditya,Yezhou Yang
Main category: cs.CV
TL;DR: 论文提出了TimeWarp方法,通过生成合成的时间偏好数据集来增强Video-LLMs在细粒度时间理解任务中的性能,显著提升了七个基准测试的表现。
Details
Motivation: 当前的Video-LLMs虽然在视频字幕和描述任务中表现良好,但在需要细粒度时间理解的任务上表现不佳,主要原因是现有微调数据集缺乏视觉复杂性和时间动态信息。Contribution: 1) 提出了TimeWarp方法,生成针对性的合成时间数据集;2) 引入了一个大规模偏好数据集,捕捉了复杂的时序动态;3) 显著提升了Video-LLMs在时间理解任务上的性能。
Method: 通过TimeWarp系统生成合成的时间偏好数据集,用于微调模型,使其更关注输入视频的时间和视觉信息。
Result: 在七个时间理解基准测试中,均取得了显著的绝对性能提升。
Insight: 合成数据可以有效填补现有数据集中缺失的时间动态信息,从而提升模型的细粒度时间理解能力。
Abstract: While Video Large Language Models (Video-LLMs) have demonstrated remarkable performance across general video understanding benchmarks-particularly in video captioning and descriptive tasks-they consistently underperform on tasks that require fine-grained temporal understanding. This limitation arises due to the lack of visual complexity and temporal nuance in current fine-tuning datasets, leading these models to rely heavily on language-based reasoning rather than truly understanding video dynamics. In this work, we propose TimeWarp, a systematic method to create a targeted synthetic temporal dataset to fine-tune the model’s responses to encourage it to focus on the given input video. We introduce a large-scale preference dataset, created using TimeWarp, that captures intricate temporal dynamics often overlooked, grounding the model’s responses to visual and temporal information. We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks, highlighting the effectiveness of our proposed datasets in advancing temporal understanding in Video-LLMs, resulting in an absolute improvement in performance across seven benchmarks. Code is available at https://github.com/sameepv21/timewarp.
[109] No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models
Min Woo Sun,Alejandro Lozano,Javier Gamazo Tejero,Vishwesh Nath,Xiao Xiao Sun,James Burgess,Yuhui Zhang,Kun Yuan,Robert Tibshirani,Sean Huver,Serena Yeung-Levy
Main category: cs.CV
TL;DR: 该论文研究了如何在生物医学视觉-语言模型(VLM)中充分利用长文本上下文,提出了一种支持512个token的长上下文文本编码器,并引入BIOMEDICA-LongCAP数据集,显著提升了检索和分类性能。
Details
Motivation: 生物医学领域的标注通常远超常规VLM支持的77个token长度,导致大量信息被截断,限制了模型的性能提升。Contribution: 提出了一种支持长上下文(512 token)的生物医学VLM,并发布了BIOMEDICA-LongCAP数据集,展示了长上下文对性能的显著提升。
Method: 通过扩展文本编码器的上下文长度(至512 token)并在BIOMEDICA-LongCAP数据集上训练BMC-LongCLIP模型,减少了token浪费。
Result: BMC-LongCLIP在长标注检索中Recall@1提升30%,分类任务平均提升2%,且收敛速度更快。
Insight: 长上下文建模是提升生物医学VLM性能的有效途径,提供更多文本监督信息可以显著改善模型表现。
Abstract: Embedding vision-language models (VLMs) are typically pretrained with short text windows (<77 tokens), which forces the truncation of long-format captions. Yet, the distribution of biomedical captions from large-scale open source literature reveals that a huge portion of captions far exceed 77 tokens. To this end, we investigate the impact of pretraining on long-format biomedical captions by extending the context length of text encoders in VLMs. We find that longer context (thus, enabling additional supervision provided in long-format captions) correlates with better retrieval and classification performance. Given this finding, we introduce BIOMEDICA-LongCAP, a dataset of 1M image-caption pairs enriched with context-aware descriptions from full-text articles, providing longer and additional textual supervision. Using BIOMEDICA-LongCAP, we train BMC-LongCLIP, a long-context biomedical VLM with a text encoder supporting windows of up to 512 tokens. Our model extends context capacity by 6.6x, reducing token waste from 55% to just 2.2%. On long-caption retrieval benchmarks, BMC-LongCLIP achieves up to +30% absolute gains in Recall@1 and +2% average improvements in classification, while also converging faster than short-context. Our results demonstrate that long-context modeling is a promising direction for advancing biomedical VLMs.
[110] Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning
Yaxin Hou,Bo Han,Yuheng Jia,Hui Liu,Junhui Hou
Main category: cs.CV
TL;DR: 该论文提出了一个可控伪标签生成(CPG)框架,用于解决长尾半监督学习中未标记数据分布未知的问题。通过动态可控过滤机制和贝叶斯最优分类器,CPG能显著提升模型性能。
Details
Motivation: 现有长尾半监督学习方法假设未标记数据遵循特定分布,但实际中其分布通常未知且复杂。CPG旨在解决这一问题。Contribution: 1. 提出了CPG框架,动态生成可控伪标签;2. 提出类感知自适应增强模块和辅助分支;3. 理论证明优化循环能降低泛化误差。
Method: 1. 动态可控过滤机制选择可靠伪标签;2. 基于贝叶斯最优分类器调整标签分布;3. 优化循环提升伪标签可靠性。
Result: 在多个基准数据集上,CPG显著优于现有方法,最高提升15.97%的准确率。
Insight: CPG的核心在于动态调整伪标签分布,使其不受未标记数据分布影响,从而提升模型鲁棒性。
Abstract: Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to \textbf{15.97%} in accuracy. The code is available at https://github.com/yaxinhou/CPG.
[111] Enhancing OCR for Sino-Vietnamese Language Processing via Fine-tuned PaddleOCRv5
Minh Hoang Nguyen,Su Nguyen Thiet
Main category: cs.CV
TL;DR: 本文提出了一种基于PaddleOCRv5微调的方法,用于提升汉喃文本的OCR识别效果,针对历史文档中的退化扫描和非标准字形等问题进行了优化,显著提升了识别准确率。
Details
Motivation: 汉喃文本的识别对越南历史文献的数字化和跨语言语义研究至关重要,但现有OCR系统在处理退化扫描和非标准字形时表现不佳。Contribution: 1. 提出了针对PaddleOCRv5的微调方法;2. 构建了完整的训练流程;3. 开发了可视化交互演示工具。
Method: 微调PaddleOCRv5的文本识别模块,使用汉喃文本子集进行训练,并设计了包含预处理、LMDB转换、评估和可视化的完整流程。
Result: 微调后模型的准确率从37.5%提升至50.0%,在噪声图像条件下表现尤为突出。
Insight: 针对特定语言或文本类型的OCR任务,微调现有模型可以有效提升性能,尤其在处理复杂历史文档时效果显著。
Abstract: Recognizing and processing Classical Chinese (Han-Nom) texts play a vital role in digitizing Vietnamese historical documents and enabling cross-lingual semantic research. However, existing OCR systems struggle with degraded scans, non-standard glyphs, and handwriting variations common in ancient sources. In this work, we propose a fine-tuning approach for PaddleOCRv5 to improve character recognition on Han-Nom texts. We retrain the text recognition module using a curated subset of ancient Vietnamese Chinese manuscripts, supported by a full training pipeline covering preprocessing, LMDB conversion, evaluation, and visualization. Experimental results show a significant improvement over the base model, with exact accuracy increasing from 37.5 percent to 50.0 percent, particularly under noisy image conditions. Furthermore, we develop an interactive demo that visually compares pre- and post-fine-tuning recognition results, facilitating downstream applications such as Han-Vietnamese semantic alignment, machine translation, and historical linguistics research. The demo is available at https://huggingface.co/spaces/MinhDS/Fine-tuned-PaddleOCRv5.
[112] Fit Pixels, Get Labels: Meta-learned Implicit Networks for Image Segmentation
Kushal Vyas,Ashok Veeraraghavan,Guha Balakrishnan
Main category: cs.CV
TL;DR: MetaSeg是一种基于元学习的隐式神经表示(INR)框架,用于医学图像分割,能以更少的参数达到与传统U-Net相当的性能。
Details
Motivation: 传统的隐式神经网络(INR)在信号表示中表现优异,但不适用于预测任务(如分割)。MetaSeg旨在为医学图像分割提供一种轻量且高效的替代方案。Contribution: 1)提出MetaSeg,结合INR和元学习,实现像素密度值和类别标签的同时预测;2)在2D和3D脑MRI分割任务中,参数减少90%,性能媲美U-Net。
Method: 1)设计一种INR,同时预测像素密度和类别标签;2)通过元学习优化初始参数,使其能快速适应未见过的测试图像。
Result: MetaSeg在脑MRI分割任务中的Dice分数与U-Net相当,但参数数量减少90%。
Insight: MetaSeg为医学图像分割提供了一种轻量化、高效的解决方案,展示了元学习与INR结合的潜力。
Abstract: Implicit neural representations (INRs) have achieved remarkable successes in learning expressive yet compact signal representations. However, they are not naturally amenable to predictive tasks such as segmentation, where they must learn semantic structures over a distribution of signals. In this study, we introduce MetaSeg, a meta-learning framework to train INRs for medical image segmentation. MetaSeg uses an underlying INR that simultaneously predicts per pixel intensity values and class labels. It then uses a meta-learning procedure to find optimal initial parameters for this INR over a training dataset of images and segmentation maps, such that the INR can simply be fine-tuned to fit pixels of an unseen test image, and automatically decode its class labels. We evaluated MetaSeg on 2D and 3D brain MRI segmentation tasks and report Dice scores comparable to commonly used U-Net models, but with $90%$ fewer parameters. MetaSeg offers a fresh, scalable alternative to traditional resource-heavy architectures such as U-Nets and vision transformers for medical image segmentation. Our project is available at https://kushalvyas.github.io/metaseg.html .
[113] Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning
Chendong Wang,Donglin Bai,Yifan Yang,Xiao Jin,Anlan Zhang,Rui Wang,Shiqi Jiang,Yuqing Yang,Hao Wu,Qi Dai,Chong Luo,Ting Cao,Lili Qiu,Suman Banerjee
Main category: cs.CV
TL;DR: ViTL是一个两阶段的长视频QA框架,通过分段定位和重新分配视觉令牌实现高效处理,并在新数据集上验证了其性能。
Details
Motivation: 长视频QA任务面临计算开销大和难以高效定位相关片段的问题,ViTL旨在解决这些挑战。Contribution: 1) 提出ViTL框架,通过低帧率定位和高帧率问答分阶段处理;2) 引入包含时间跨度的新数据集;3) 提出联合优化目标,耦合定位和问答任务。
Method: ViTL采用两阶段方法:1) 低帧率粗略定位问题相关片段;2) 重新分配视觉令牌进行高帧率问答,并输出时间跨度和答案。
Result: 在Charades-STA和ActivityNet-Captions等数据集上,ViTL在固定令牌预算下性能提升8.6%,且输入帧数减少50%。
Insight: 通过分段处理和令牌重新分配,ViTL在长视频QA中实现了高效性和可解释性,同时减少了计算开销。
Abstract: We present \emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first \emph{localizing} question-relevant interval(s) with a low-fps skim and then \emph{answering} via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce \dataname{}, which converts description based event graphs into \emph{span-grounded} multiple-choice QA by pairing each question with \emph{ground-truth} time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, \dataname{} and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.
[114] Enhancing Fake News Video Detection via LLM-Driven Creative Process Simulation
Yuyan Bu,Qiang Sheng,Juan Cao,Shaofei Wang,Peng Qi,Yuhui Shi,Beizhe Hu
Main category: cs.CV
TL;DR: 论文提出了一种基于LLM的数据增强框架AgentAug,通过模拟虚假新闻视频的创作过程生成多样化数据,解决了现有检测器因训练数据不足导致的性能问题。
Details
Motivation: 短视频平台上的虚假新闻已成为重要社会问题,现有检测器因数据不足和多样性有限导致性能不佳,主要问题是数据未能充分反映视频片段与虚假事件之间的复杂关系。Contribution: 提出了AgentAug框架,通过LLM驱动的多类别虚假新闻视频生成管道和基于不确定性采样的主动学习策略,增强了训练数据的多样性和质量。
Method: AgentAug利用LLM模拟四种虚假新闻视频创作过程生成数据,并结合主动学习策略选择对训练有用的样本。
Result: 在两个基准数据集上的实验表明,AgentAug显著提升了短视频虚假新闻检测器的性能。
Insight: 通过模拟虚假新闻创作过程生成多样化数据是解决数据稀疏问题的有效方法,LLM驱动的数据增强为类似任务提供了新思路。
Abstract: The emergence of fake news on short video platforms has become a new significant societal concern, necessitating automatic video-news-specific detection. Current detectors primarily rely on pattern-based features to separate fake news videos from real ones. However, limited and less diversified training data lead to biased patterns and hinder their performance. This weakness stems from the complex many-to-many relationships between video material segments and fabricated news events in real-world scenarios: a single video clip can be utilized in multiple ways to create different fake narratives, while a single fabricated event often combines multiple distinct video segments. However, existing datasets do not adequately reflect such relationships due to the difficulty of collecting and annotating large-scale real-world data, resulting in sparse coverage and non-comprehensive learning of the characteristics of potential fake news video creation. To address this issue, we propose a data augmentation framework, AgentAug, that generates diverse fake news videos by simulating typical creative processes. AgentAug implements multiple LLM-driven pipelines of four fabrication categories for news video creation, combined with an active learning strategy based on uncertainty sampling to select the potentially useful augmented samples during training. Experimental results on two benchmark datasets demonstrate that AgentAug consistently improves the performance of short video fake news detectors.
[115] Prompt-to-Prompt: Text-Based Image Editing Via Cross-Attention Mechanisms – The Research of Hyperparameters and Novel Mechanisms to Enhance Existing Frameworks
Linn Bieske,Carla Lorente
Main category: cs.CV
TL;DR: 本文研究了基于文本的图像编辑框架中的超参数和注意力机制优化,提出了‘attention re-weight method’和‘CL P2P’框架,以提升编辑精度和一致性。
Details
Motivation: 现有的文本驱动图像编辑方法(如稳定扩散模型)虽然简化了编辑过程,但存在结果不一致(如发色变化不一致)的问题。本文旨在通过优化超参数和改进注意力机制来解决这些问题。Contribution: 1. 对‘word swap’方法进行了系统研究;2. 提出了‘attention re-weight method’,增强适应性;3. 提出了‘CL P2P’框架,解决循环不一致性问题。
Method: 1. 分析并优化跨注意力机制中的超参数;2. 提出‘attention re-weight method’以调整注意力权重;3. 设计‘CL P2P’框架,结合对比学习提升编辑一致性。
Result: 研究结果表明,优化的超参数和改进的注意力机制显著提升了图像编辑的精度和一致性。
Insight: 超参数设置和注意力机制的优化对生成图像的质量和一致性具有重要影响。模型架构的选择应与超参数调整协同进行。
Abstract: Recent advances in image editing have shifted from manual pixel manipulation to employing deep learning methods like stable diffusion models, which now leverage cross-attention mechanisms for text-driven control. This transition has simplified the editing process but also introduced variability in results, such as inconsistent hair color changes. Our research aims to enhance the precision and reliability of prompt-to-prompt image editing frameworks by exploring and optimizing hyperparameters. We present a comprehensive study of the “word swap” method, develop an “attention re-weight method” for better adaptability, and propose the “CL P2P” framework to address existing limitations like cycle inconsistency. This work contributes to understanding and improving the interaction between hyperparameter settings and the architectural choices of neural network models, specifically their attention mechanisms, which significantly influence the composition and quality of the generated images.
[116] \textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding
Bin Lei,Nuo Xu,Ali Payani,Mingyi Hong,Chunhua Liao,Yu Cao,Caiwen Ding
Main category: cs.CV
TL;DR: 论文《GUI-Spotlight》提出了一种动态调用多个专用工具的方法,通过迭代聚焦屏幕相关区域来提高视觉接地的准确性,显著提升了GUI系统中文本到屏幕元素的映射能力。
Details
Motivation: 当前的多模态大语言模型(MLLMs)在实际GUI系统中的实用性受限于视觉接地的可靠性,导致无法准确执行点击或拖动等指针级操作。为此,作者提出了GUI-Spotlight来解决这一问题。Contribution: 主要贡献是提出了GUI-Spotlight模型,通过动态调用专用工具迭代聚焦屏幕相关区域,显著提高了视觉接地的准确性。
Method: 方法核心是训练一个基于图像推理的模型,动态调用多个专用工具逐步缩小屏幕关注区域。
Result: 在ScreenSpot-Pro基准测试中,GUI-Spotlight仅用了18.5K训练样本就达到了52.8%的准确率,优于其他需要更多训练样本的模型。
Insight: 通过动态工具调用和迭代聚焦的方法,可以在较少训练数据的情况下显著提升视觉接地任务的性能。
Abstract: Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. To address it, we introduce GUI-Spotlight – a model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8% accuracy, surpassing V2P-7B (50.6% with 9.6M training samples) and GTA-1-7B (50.1% with 1.56M training samples).
[117] Quantization Range Estimation for Convolutional Neural Networks
Bingtao Yang,Yujia Wang,Mengzhi Jiao,Hongwei Huo
Main category: cs.CV
TL;DR: 该论文提出了一种范围估计方法,用于提高训练后量化的性能,通过分层局部最小值最小化量化误差,并证明了该问题是局部凸的。
Details
Motivation: 为了解决低比特量化(如4-bit)在保持模型精度方面的挑战,研究者提出了一种高效的范围估计方法。Contribution: 1. 将范围估计建模为一个优化问题,最小化量化误差;2. 证明该问题是局部凸的,并提出高效搜索算法;3. 在权重变换空间进一步优化。
Method: 1. 通过分层局部最小值最小化量化误差;2. 提出高效的搜索算法;3. 在变换后的权重空间中应用该算法。
Result: 在ResNet系列和Inception-v3模型上,8-bit和6-bit量化几乎没有精度损失,4-bit量化的精度也显著提升。
Insight: 通过优化范围估计,可以在低比特量化中显著提高模型精度,尤其是在变换后的权重空间中。
Abstract: Post-training quantization for reducing the storage of deep neural network models has been demonstrated to be an effective way in various tasks. However, low-bit quantization while maintaining model accuracy is a challenging problem. In this paper, we present a range estimation method to improve the quantization performance for post-training quantization. We model the range estimation into an optimization problem of minimizing quantization errors by layer-wise local minima. We prove this problem is locally convex and present an efficient search algorithm to find the optimal solution. We propose the application of the above search algorithm to the transformed weights space to do further improvement in practice. Our experiments demonstrate that our method outperforms state-of-the-art performance generally on top-1 accuracy for image classification tasks on the ResNet series models and Inception-v3 model. The experimental results show that the proposed method has almost no loss of top-1 accuracy in 8-bit and 6-bit settings for image classifications, and the accuracy of 4-bit quantization is also significantly improved. The code is available at https://github.com/codeiscommitting/REQuant.
[118] MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene Generation
Zhenyu Pan,Yucheng Lu,Han Liu
Main category: cs.CV
TL;DR: MetaFind是一个场景感知的三模态组合检索框架,旨在通过从大规模存储库中检索3D资产来增强元宇宙场景生成。它解决了资产检索的不一致性和缺乏标准化检索范式的问题。
Details
Motivation: 现有3D资产检索方法忽视空间、语义和风格约束,且缺乏针对3D检索的标准化方法。MetaFind旨在解决这些问题,提升场景生成的连贯性。Contribution: 提出了一个灵活的三模态检索机制,引入ESSGNN编码器捕获空间关系和对象特征,确保检索结果在上下文和风格上与场景一致。
Method: 采用ESSGNN编码器建模对象级特征和场景级布局结构,支持文本、图像和3D模态的任意组合查询,迭代更新检索结果。
Result: 实验表明MetaFind在空间和风格一致性上优于基准方法。
Insight: 结合对象级和场景级特征的多模态检索机制能有效提升3D资产检索的连贯性。
Abstract: We present MetaFind, a scene-aware tri-modal compositional retrieval framework designed to enhance scene generation in the metaverse by retrieving 3D assets from large-scale repositories. MetaFind addresses two core challenges: (i) inconsistent asset retrieval that overlooks spatial, semantic, and stylistic constraints, and (ii) the absence of a standardized retrieval paradigm specifically tailored for 3D asset retrieval, as existing approaches mainly rely on general-purpose 3D shape representation models. Our key innovation is a flexible retrieval mechanism that supports arbitrary combinations of text, image, and 3D modalities as queries, enhancing spatial reasoning and style consistency by jointly modeling object-level features (including appearance) and scene-level layout structures. Methodologically, MetaFind introduces a plug-and-play equivariant layout encoder ESSGNN that captures spatial relationships and object appearance features, ensuring retrieved 3D assets are contextually and stylistically coherent with the existing scene, regardless of coordinate frame transformations. The framework supports iterative scene construction by continuously adapting retrieval results to current scene updates. Empirical evaluations demonstrate the improved spatial and stylistic consistency of MetaFind in various retrieval tasks compared to baseline methods.
[119] Ordinal Encoding as a Regularizer in Binary Loss for Solar Flare Prediction
Chetraj Pandey,Jinsu Hong,Anli Ji,Rafal A. Angryk,Berkay Aydin
Main category: cs.CV
TL;DR: 该论文提出了一种改进的损失函数,通过将太阳耀斑预测中的类别间的序数信息整合到二元交叉熵损失中,以减少模型在预测阈值附近的误分类。
Details
Motivation: 传统的二元分类框架忽略了太阳耀子类别间的序数关系,导致模型在阈值附近的分类表现不佳。为了解决这一问题,作者提出了一个序数感知的损失函数。Contribution: 主要的贡献是提出了一种新型的损失函数,通过加权机制强化模型对阈值附近样本的学习,从而提高分类性能。
Method: 该方法在二元交叉熵损失的基础上,引入了序数权重,使得模型更关注于阈值附近的样本误分类。
Result: 实验结果表明,这种改进的损失函数能够显著减少阈值附近的误分类,提升模型的整体性能。
Insight: 论文揭示了在二元分类任务中,利用序数信息可以作为有效的正则化手段,尤其适用于类别间存在明显序数关系的问题。
Abstract: The prediction of solar flares is typically formulated as a binary classification task, distinguishing events as either Flare (FL) or No-Flare (NF) according to a specified threshold (for example, greater than or equal to C-class, M-class, or X-class). However, this binary framework neglects the inherent ordinal relationships among the sub-classes contained within each category (FL and NF). Several studies on solar flare prediction have empirically shown that the most frequent misclassifications occur near this prediction threshold. This suggests that the models struggle to differentiate events that are similar in intensity but fall on opposite sides of the binary threshold. To mitigate this limitation, we propose a modified loss function that integrates the ordinal information among the sub-classes of the binarized flare labels into the conventional binary cross-entropy (BCE) loss. This approach serves as an ordinality-aware, data-driven regularization method that penalizes the incorrect predictions of flare events in close proximity to the prediction threshold more heavily than those away from the boundary during model optimization. By incorporating ordinal weighting into the loss function, we aim to enhance the model’s learning process by leveraging the ordinal characteristics of the data, thereby improving its overall performance.
[120] Joint Learning of Pose Regression and Denoising Diffusion with Score Scaling Sampling for Category-level 6D Pose Estimation
Seunghyun Lee,Tae-Kyun Kim
Main category: cs.CV
TL;DR: 论文提出了一种结合姿态回归和去噪扩散模型的新方法,通过分数缩放采样解决了类别级6D姿态估计中的训练收敛慢和候选姿态质量筛选问题,实现了高效且高精度的姿态生成。
Details
Motivation: 现有的扩散模型在类别级6D姿态估计中表现良好,但存在训练收敛慢、需要额外网络筛选候选姿态等问题。Contribution: 1. 提出了结合姿态回归和扩散模型的联合学习方法,加速训练收敛并提高精度;2. 提出了时间依赖的分数缩放采样方法,取代了额外评估网络。
Method: 1. 预训练编码器并使用直接姿态回归头;2. 联合学习回归头和去噪扩散头;3. 引入时间依赖的分数缩放采样方法。
Result: 在REAL275、HouseCat6D和ROPE等基准数据集上实现了最佳精度,同时训练和推理效率更高。
Insight: 通过联合学习和分数缩放采样,既能保留对称对象的多模态特性,又能保证最终姿态生成的高质量。
Abstract: Latest diffusion models have shown promising results in category-level 6D object pose estimation by modeling the conditional pose distribution with depth image input. The existing methods, however, suffer from slow convergence during training, learning its encoder with the diffusion denoising network in end-to-end fashion, and require an additional network that evaluates sampled pose hypotheses to filter out low-quality pose candidates. In this paper, we propose a novel pipeline that tackles these limitations by two key components. First, the proposed method pretrains the encoder with the direct pose regression head, and jointly learns the networks via the regression head and the denoising diffusion head, significantly accelerating training convergence while achieving higher accuracy. Second, sampling guidance via time-dependent score scaling is proposed s.t. the exploration-exploitation trade-off is effectively taken, eliminating the need for the additional evaluation network. The sampling guidance maintains multi-modal characteristics of symmetric objects at early denoising steps while ensuring high-quality pose generation at final steps. Extensive experiments on multiple benchmarks including REAL275, HouseCat6D, and ROPE, demonstrate that the proposed method, simple yet effective, achieves state-of-the-art accuracies even with single-pose inference, while being more efficient in both training and inference.
[121] Learning from All: Concept Alignment for Autonomous Distillation from Multiple Drifting MLLMs
Xiaoyu Yang,Jie Lu,En Yu
Main category: cs.CV
TL;DR: 论文提出了一种解决多模态大语言模型(MLLMs)蒸馏中概念漂移问题的框架,通过自主偏好优化(APO)实现对多个教师模型推理轨迹的概念对齐,提升学生模型的鲁棒性和泛化性。
Details
Motivation: 多教师MLLMs在蒸馏过程中存在概念漂移问题,导致学生模型性能下降。需要一种方法对齐不同教师的推理轨迹,消除偏差。Contribution: 1)将概念漂移与知识蒸馏联系的理论框架;2)提出“学习、比较、批判”范式及APO方法;3)贡献CXR-MAX数据集。
Method: 学生模型通过APO学习、比较多个教师的推理轨迹,并自我批判对齐概念,消除漂移影响。
Result: 实验证明该方法在一致性、鲁棒性和泛化性上表现优异。
Insight: 概念漂移是多教师蒸馏的核心挑战,APO提供了一种动态对齐的理论与实践方案。
Abstract: This paper identifies a critical yet underexplored challenge in distilling from multimodal large language models (MLLMs): the reasoning trajectories generated by multiple drifting teachers exhibit concept drift, whereby their reasoning distributions evolve unpredictably and transmit biases to the student model, ultimately compromising its performance. To tackle this issue, we pioneer a theoretical connection between concept drift and knowledge distillation, casting the non-stationary reasoning dynamics from multiple MLLM teachers as next-token prediction of multi-stream reasoning trajectories.Guided by concept drift, we introduce the “learn, compare, critique” paradigm, culminating in autonomous preference optimization (APO). Under the active guidance of the teachers, the student model first learns and self-distils preferred thinking by comparing multiple teachers. It then engages in critical reflection over the drifting inference from teachers, performing concept alignment through APO, ultimately yielding a robust, consistent, and generalizable model.Extensive experiments demonstrate our superior performance of consistency, robustness and generalization within knowledge distillation. Besides, we also contributed a large-scale dataset, CXR-MAX (Multi-teachers Alignment X-rays), comprising 170,982 distilled reasoning trajectories derived from publicly accessible MLLMs based on MIMIC-CXR. Our code and data are public at: https://anonymous.4open.science/r/Autonomous-Distillation/.
[122] Automating construction safety inspections using a multi-modal vision-language RAG framework
Chenxin Wang,Elyas Asadi Shamsabadi,Zhaohui Chen,Luming Shen,Alireza Ahmadian Fard Fini,Daniel Dias-da-Costa
Main category: cs.CV
TL;DR: 这篇论文提出了一种多模态视觉-语言RAG框架SiteShield,用于自动化建筑施工安全检查报告,结合视觉和音频输入,显著提高了效率和准确性。
Details
Motivation: 传统的建筑施工安全检查方法效率低下,且现有的大型视觉-语言模型应用存在响应不相关、模态输入受限以及幻觉等问题,亟需一种更高效、准确的自动化解决方案。Contribution: 论文的主要贡献是开发了SiteShield,一个基于多模态LVLM的RAG框架,通过结合视觉和音频输入,显著提升了安全检查报告的生成效率和准确性。
Method: 采用了多模态视觉-语言RAG框架,结合视觉和音频输入,利用了大型语言模型(LLMs)和检索增强生成(RAG)技术。论文还使用了真实世界数据进行验证。
Result: SiteShield在F1分数(0.82)、汉明损失(0.04)、精确率(0.76)和召回率(0.96)上均优于单一模态的LLMs。
Insight: 多模态输入与RAG技术的结合可以有效缓解现有模型在安全检查任务中的局限性,为自动化安全检查提供了新思路。
Abstract: Conventional construction safety inspection methods are often inefficient as they require navigating through large volume of information. Recent advances in large vision-language models (LVLMs) provide opportunities to automate safety inspections through enhanced visual and linguistic understanding. However, existing applications face limitations including irrelevant or unspecific responses, restricted modal inputs and hallucinations. Utilisation of Large Language Models (LLMs) for this purpose is constrained by availability of training data and frequently lack real-time adaptability. This study introduces SiteShield, a multi-modal LVLM-based Retrieval-Augmented Generation (RAG) framework for automating construction safety inspection reports by integrating visual and audio inputs. Using real-world data, SiteShield outperformed unimodal LLMs without RAG with an F1 score of 0.82, hamming loss of 0.04, precision of 0.76, and recall of 0.96. The findings indicate that SiteShield offers a novel pathway to enhance information retrieval and efficiency in generating safety reports.
[123] BLADE: Bias-Linked Adaptive DEbiasing
Piyush Arora,Navlika Singh,Vasubhya Diwan,Pratik Mazumder
Main category: cs.CV
TL;DR: BLADE是一种无需先验知识的去偏框架,通过生成模型跨偏见域翻译图像,并自适应调整图像以减少偏见依赖,显著优于现有方法。
Details
Motivation: 神经网络容易学习隐含的偏见和虚假关联,现有方法依赖对偏见的先验知识或矛盾样本,不适用于现实场景。BLADE提出无需这些假设的去偏方法。Contribution: 1. 提出BLADE框架,无需先验偏见知识或矛盾样本;2. 通过生成模型跨域翻译图像并自适应调整;3. 在多个基准数据集上显著优于现有方法。
Method: 1. 训练生成模型跨偏见域翻译图像;2. 基于图像对偏见的易感性自适应调整;3. 对齐任务相关特征但偏见不同的样本。
Result: 在多个数据集上表现优异,尤其在CIFAR-10的腐败版本上,最差组设置下比基线方法高出18%。
Insight: BLADE展示了通过生成模型和无监督学习实现去偏的潜力,为开发更鲁棒的深度学习模型提供了新思路。
Abstract: Neural networks have revolutionized numerous fields, yet they remain vulnerable to a critical flaw: the tendency to learn implicit biases, spurious correlations between certain attributes and target labels in training data. These biases are often more prevalent and easier to learn, causing models to rely on superficial patterns rather than task-relevant features necessary for generalization. Existing methods typically rely on strong assumptions, such as prior knowledge of these biases or access to bias-conflicting samples, i.e., samples that contradict spurious correlations and counterbalance bias-aligned samples, samples that conform to these spurious correlations. However, such assumptions are often impractical in real-world settings. We propose BLADE ({B}ias-{L}inked {A}daptive {DE}biasing), a generative debiasing framework that requires no prior knowledge of bias or bias-conflicting samples. BLADE first trains a generative model to translate images across bias domains while preserving task-relevant features. Then, it adaptively refines each image with its synthetic counterpart based on the image’s susceptibility to bias. To encourage robust representations, BLADE aligns an image with its bias-translated synthetic counterpart that shares task-relevant features but differs in bias, while misaligning it with samples sharing the same bias. We evaluate BLADE on multiple benchmark datasets and show that it significantly outperforms state-of-the-art methods. Notably, it exceeds the closest baseline by an absolute margin of around 18% on the corrupted CIFAR-10 dataset under the worst group setting, establishing a new benchmark in bias mitigation and demonstrating its potential for developing more robust deep learning models without explicit supervision.
[124] From Segments to Concepts: Interpretable Image Classification via Concept-Guided Segmentation
Ran Eisenberg,Amit Rozner,Ethan Fetaya,Ofir Lindenbaum
Main category: cs.CV
TL;DR: SEG-MIL-CBM是一个结合概念引导图像分割和注意力机制的多实例学习框架,旨在提高深度神经网络的可解释性,同时避免高昂的概念标注成本。
Details
Motivation: 当前深度神经网络的可解释性不足,尤其是在安全关键应用中,其黑盒特性限制了信任和透明度,且现有方法如概念瓶颈模型(CBMs)需要昂贵的概念标注并缺乏空间基础。Contribution: 提出了SEG-MIL-CBM框架,将概念引导的分割与多实例学习结合,免除了概念标注需求,并通过空间基础的概念解释提高了模型的透明度和鲁棒性。
Method: 通过概念引导的分割将图像区域视为实例,利用注意力机制的多实例学习框架进行证据聚合,从而识别任务相关证据并抑制无关信息。
Result: 在涉及虚假相关性、输入损坏和大规模基准测试的设置中表现出色,同时提供了透明且基于概念的解释。
Insight: 任务相关区域的概念对齐可以显著提升模型的解释性和鲁棒性,同时避免了对昂贵标注的依赖。
Abstract: Deep neural networks have achieved remarkable success in computer vision; however, their black-box nature in decision-making limits interpretability and trust, particularly in safety-critical applications. Interpretability is crucial in domains where errors have severe consequences. Existing models not only lack transparency but also risk exploiting unreliable or misleading features, which undermines both robustness and the validity of their explanations. Concept Bottleneck Models (CBMs) aim to improve transparency by reasoning through human-interpretable concepts. Still, they require costly concept annotations and lack spatial grounding, often failing to identify which regions support each concept. We propose SEG-MIL-CBM, a novel framework that integrates concept-guided image segmentation into an attention-based multiple instance learning (MIL) framework, where each segmented region is treated as an instance and the model learns to aggregate evidence across them. By reasoning over semantically meaningful regions aligned with high-level concepts, our model highlights task-relevant evidence, down-weights irrelevant cues, and produces spatially grounded, concept-level explanations without requiring annotations of concepts or groups. SEG-MIL-CBM achieves robust performance across settings involving spurious correlations (unintended dependencies between background and label), input corruptions (perturbations that degrade visual quality), and large-scale benchmarks, while providing transparent, concept-level explanations.
[125] Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers
Shikang Zheng,Guantao Chen,Qinming Zhou,Yuqi Lin,Lixuan He,Chang Zou,Peiliang Cai,Jiacheng Liu,Linfeng Zhang
Main category: cs.CV
TL;DR: 论文提出HyCa框架,通过混合ODE求解器实现特征缓存,显著加速扩散变换器的采样过程。
Details
Motivation: 扩散变换器虽然在图像和视频合成中表现出色,但其迭代采样过程因高成本的前向传递而成为瓶颈。现有缓存方法未考虑特征的异质性动态行为。Contribution: 引入HyCa框架,通过维度级混合缓存策略,实现了5.55至6.24倍的加速效果,且无需重新训练。
Method: 将隐藏特征的演化建模为跨维度的ODE混合,并设计混合ODE求解器以优化缓存策略。
Result: 在多个领域和模型上实现接近无损的加速,如FLUX、HunyuanVideo和Qwen-Image等。
Insight: 特征的动态行为具有异质性,维度级缓存策略能够更高效地利用计算资源。
Abstract: Diffusion Transformers offer state-of-the-art fidelity in image and video synthesis, but their iterative sampling process remains a major bottleneck due to the high cost of transformer forward passes at each timestep. To mitigate this, feature caching has emerged as a training-free acceleration technique that reuses or forecasts hidden representations. However, existing methods often apply a uniform caching strategy across all feature dimensions, ignoring their heterogeneous dynamic behaviors. Therefore, we adopt a new perspective by modeling hidden feature evolution as a mixture of ODEs across dimensions, and introduce HyCa, a Hybrid ODE solver inspired caching framework that applies dimension-wise caching strategies. HyCa achieves near-lossless acceleration across diverse domains and models, including 5.55 times speedup on FLUX, 5.56 times speedup on HunyuanVideo, 6.24 times speedup on Qwen-Image and Qwen-Image-Edit without retraining.
[126] World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge
Moo Hyun Son,Jintaek Oh,Sun Bin Mun,Jaechul Roh,Sehyun Choi
Main category: cs.CV
TL;DR: World-To-Image框架通过动态检索网络知识优化文本到图像生成,显著提升新颖或OOD实体的生成质量。
Details
Motivation: 传统T2I模型在处理新颖或OOD实体时性能下降,因知识有限。Contribution: 提出World-To-Image框架,结合动态网络检索和提示优化,提升语义对齐和视觉效果。
Method: 设计代理动态检索网络图像,优化多模态提示,结合强大生成模型。
Result: 在NICE基准上实现8.1%的提升,语义对齐和视觉美学均优于SOTA。
Insight: 动态世界知识结合T2I模型可显著提升生成能力,适应现实世界变化。
Abstract: While text-to-image (T2I) models can synthesize high-quality images, their performance degrades significantly when prompted with novel or out-of-distribution (OOD) entities due to inherent knowledge cutoffs. We introduce World-To-Image, a novel framework that bridges this gap by empowering T2I generation with agent-driven world knowledge. We design an agent that dynamically searches the web to retrieve images for concepts unknown to the base model. This information is then used to perform multimodal prompt optimization, steering powerful generative backbones toward an accurate synthesis. Critically, our evaluation goes beyond traditional metrics, utilizing modern assessments like LLMGrader and ImageReward to measure true semantic fidelity. Our experiments show that World-To-Image substantially outperforms state-of-the-art methods in both semantic alignment and visual aesthetics, achieving +8.1% improvement in accuracy-to-prompt on our curated NICE benchmark. Our framework achieves these results with high efficiency in less than three iterations, paving the way for T2I systems that can better reflect the ever-changing real world. Our demo code is available here\footnote{https://github.com/mhson-kyle/World-To-Image}.
[127] MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering
Lixuan He,Shikang Zheng,Linfeng Zhang
Main category: cs.CV
TL;DR: MASC提出了一种层次化的语义树方法,通过几何感知的距离度量重构token嵌入的流形结构,显著提升了自回归图像生成的训练效率和生成质量。
Details
Motivation: 自回归模型在图像生成中效率低下,主要原因是其平坦且无结构的token词汇表忽视了嵌入空间的语义结构。这导致了复杂的预测任务,限制了模型的性能。Contribution: MASC方法通过层次化的语义树重构了token嵌入空间的结构,简化了预测任务,从而显著提升了训练效率和生成质量。
Method: MASC使用几何感知的距离度量和密度驱动的聚合方法构建语义树,将平坦的预测任务转化为结构化任务。
Result: 实验表明,MASC将训练速度提升57%,并将生成质量的FID从2.87降低到2.58。
Insight: 研究表明,预测空间的结构化对生成模型的性能提升至关重要,其重要性可与架构创新相媲美。
Abstract: Autoregressive (AR) models have shown great promise in image generation, yet they face a fundamental inefficiency stemming from their core component: a vast, unstructured vocabulary of visual tokens. This conventional approach treats tokens as a flat vocabulary, disregarding the intrinsic structure of the token embedding space where proximity often correlates with semantic similarity. This oversight results in a highly complex prediction task, which hinders training efficiency and limits final generation quality. To resolve this, we propose Manifold-Aligned Semantic Clustering (MASC), a principled framework that constructs a hierarchical semantic tree directly from the codebook’s intrinsic structure. MASC employs a novel geometry-aware distance metric and a density-driven agglomerative construction to model the underlying manifold of the token embeddings. By transforming the flat, high-dimensional prediction task into a structured, hierarchical one, MASC introduces a beneficial inductive bias that significantly simplifies the learning problem for the AR model. MASC is designed as a plug-and-play module, and our extensive experiments validate its effectiveness: it accelerates training by up to 57% and significantly improves generation quality, reducing the FID of LlamaGen-XL from 2.87 to 2.58. MASC elevates existing AR frameworks to be highly competitive with state-of-the-art methods, establishing that structuring the prediction space is as crucial as architectural innovation for scalable generative modeling.
[128] Zoom-In to Sort AI-Generated Images Out
Yikun Ji,Yan Hong,Bowen Deng,jun lan,Huijia Zhu,Weiqiang Wang,Liqing Zhang,Jianfu Zhang
Main category: cs.CV
TL;DR: 论文提出ZoomIn框架,通过两阶段方法提高AI生成图像的检测准确率和可解释性,并发布了MagniFake数据集。
Details
Motivation: AI生成图像质量提升导致真假图像边界模糊,现有视觉语言模型难以检测高质量合成图像的细微痕迹,亟需兼具准确性和可解释性的方法。Contribution: 1. 提出ZoomIn框架,模仿人类视觉检查的两阶段检测方法;2. 发布MagniFake数据集,包含2万张真假图像及标注;3. 实现96.39%的高准确率。
Method: ZoomIn分为两阶段:1. 扫描图像定位可疑区域;2. 对放大区域进行聚焦分析,结合视觉语言模型生成解释。训练使用了MagniFake数据集。
Result: 方法在检测任务中达到96.39%的准确率,并展现出良好的泛化能力,同时提供基于视觉证据的解释。
Insight: 模仿人类视觉注意力的两阶段方法能有效提升检测性能;引入解释性标注的数据集有助于模型的训练和可解释性提升。
Abstract: The rapid growth of AI-generated imagery has blurred the boundary between real and synthetic content, raising critical concerns for digital integrity. Vision-language models (VLMs) offer interpretability through explanations but often fail to detect subtle artifacts in high-quality synthetic images. We propose ZoomIn, a two-stage forensic framework that improves both accuracy and interpretability. Mimicking human visual inspection, ZoomIn first scans an image to locate suspicious regions and then performs a focused analysis on these zoomed-in areas to deliver a grounded verdict. To support training, we introduce MagniFake, a dataset of 20,000 real and high-quality synthetic images annotated with bounding boxes and forensic explanations, generated through an automated VLM-based pipeline. Our method achieves 96.39% accuracy with robust generalization, while providing human-understandable explanations grounded in visual evidence.
[129] A Recursive Pyramidal Algorithm for Solving the Image Registration Problem
Stefan Dirnstorfer
Main category: cs.CV
TL;DR: 本文提出了一种简单、端到端可训练的递归金字塔算法,用于解决图像配准问题,代码简洁且训练数据需求低。
Details
Motivation: 图像配准是计算机视觉中的基础问题,现有方法往往复杂且需要大量训练数据。本文旨在提出一种简洁高效的解决方案。Contribution: 主要贡献是一种端到端可训练的递归金字塔算法,代码实现简单,仅需少量训练数据和代码行数即可达到高精度。
Method: 采用递归金字塔结构,通过少量Python代码实现图像配准,适用于训练数据和计算资源受限的场景。
Result: 在立体视觉的应用中,仅用74张图像和19x15的输入窗口,算法表现出色,证明了其高效性和简洁性。
Insight: 该算法展示了在资源有限的场景下,通过简洁设计和高效训练仍能实现高质量图像配准的潜力。
Abstract: The problem of image registration is finding a transformation that aligns two images, such that the corresponding points are in the same location. This paper introduces a simple, end-to-end trainable algorithm that is implementable in a few lines of Python code. The approach is shown to work with very little training data and training time, while achieving accurate results in some settings. An example application to stereo vision was trained from 74 images on a 19x15 input window. With just a dozen lines of Python code this algorithm excels in brevity and may serve as a good start in related scenarios with limitations to training data, training time or code complexity.
[130] Detection of retinal diseases using an accelerated reused convolutional network
Amin Ahmadi Kasani,Hedieh Sajedi
Main category: cs.CV
TL;DR: 该论文提出了一种名为ArConv的新型卷积层,用于优化卷积神经网络的计算复杂度,从而提升模型在移动设备上的适用性,同时保持了高准确率。
Details
Motivation: 提高深度学习模型的可访问性,尤其是在移动设备上用于视网膜疾病的早期诊断,以减少计算复杂度并保持高精度。Contribution: 设计了新型卷积层ArConv,显著降低了模型参数数量(1.3M),同时在RfMiD数据集上表现优于MobileNetV2。
Method: 通过重新设计和优化卷积层,引入了ArConv层,减少了模型的计算复杂度,使其更适合移动设备部署。
Result: 在RfMiD测试集上,模型准确率达到0.9328,优于MobileNetV2的0.9266,且参数更少。
Insight: 优化卷积层的设计可以在不牺牲准确率的前提下,显著降低模型的计算负担,提升其在资源受限设备上的实用性。
Abstract: Convolutional neural networks are continually evolving, with some efforts aimed at improving accuracy, others at increasing speed, and some at enhancing accessibility. Improving accessibility broadens the application of neural networks across a wider range of tasks, including the detection of eye diseases. Early diagnosis of eye diseases and consulting an ophthalmologist can prevent many vision disorders. Given the importance of this issue, various datasets have been collected from the cornea to facilitate the process of making neural network models. However, most of the methods introduced in the past are computationally complex. In this study, we tried to increase the accessibility of deep neural network models. We did this at the most fundamental level, specifically by redesigning and optimizing the convolutional layers. By doing so, we created a new general model that incorporates our novel convolutional layer named ArConv layers. Thanks to the efficient performance of this new layer, the model has suitable complexity for use in mobile phones and can perform the task of diagnosing the presence of disease with high accuracy. The final model we present contains only 1.3 million parameters. In comparison to the MobileNetV2 model, which has 2.2 million parameters, our model demonstrated better accuracy when trained and evaluated on the RfMiD dataset under identical conditions, achieving an accuracy of 0.9328 versus 0.9266 on the RfMiD test set.
[131] Scaling Sequence-to-Sequence Generative Neural Rendering
Shikun Liu,Kam Woh Ng,Wonbong Jang,Jiadong Guo,Junlin Han,Haozhe Liu,Yiannis Douratsos,Juan C. Pérez,Zijian Zhou,Chi Phung,Tao Xiang,Juan-Manuel Pérez-Rúa
Main category: cs.CV
TL;DR: Kaleido是一种生成模型家族,专注于逼真的物体和场景级神经渲染。它基于序列到序列的图像合成任务,通过创新的架构设计,实现了无需显式3D表示的生成式视角合成,同时统一了3D和视频建模。预训练中使用大规模视频数据提升了性能,并在多项基准测试中达到新SOTA。
Details
Motivation: 研究旨在解决生成式神经渲染中需要显式3D表示或依赖稀缺的相机标注数据的问题,同时探索如何通过序列到序列任务统一3D和视频建模。Contribution: 1. 提出Kaleido模型,无需显式3D表示实现生成式视角合成;2. 通过掩码自回归框架支持任意数量的6-DoF目标视图生成;3. 在单一解码器中统一3D和视频建模;4. 利用大规模视频数据预训练提升性能。
Method: Kaleido采用解码器修正流Transformer架构,基于序列到序列图像合成任务,结合掩码自回归框架和大规模视频预训练,实现3D视角合成与视频建模的统一。
Result: 在少视图和多视图设置下,Kaleido的零样本性能显著优于其他生成方法,甚至在某些场景下媲美逐场景优化方法,多项基准测试达到SOTA。
Insight: 将3D建模视作视频任务的子领域,通过大规模数据预训练减少对稀缺3D数据的依赖,为生成式神经渲染提供了新的研究方向。
Abstract: We present Kaleido, a family of generative models designed for photorealistic, unified object- and scene-level neural rendering. Kaleido operates on the principle that 3D can be regarded as a specialised sub-domain of video, expressed purely as a sequence-to-sequence image synthesis task. Through a systemic study of scaling sequence-to-sequence generative neural rendering, we introduce key architectural innovations that enable our model to: i) perform generative view synthesis without explicit 3D representations; ii) generate any number of 6-DoF target views conditioned on any number of reference views via a masked autoregressive framework; and iii) seamlessly unify 3D and video modelling within a single decoder-only rectified flow transformer. Within this unified framework, Kaleido leverages large-scale video data for pre-training, which significantly improves spatial consistency and reduces reliance on scarce, camera-labelled 3D datasets – all without any architectural modifications. Kaleido sets a new state-of-the-art on a range of view synthesis benchmarks. Its zero-shot performance substantially outperforms other generative methods in few-view settings, and, for the first time, matches the quality of per-scene optimisation methods in many-view settings.
[132] The best performance in the CARE 2025 – Liver Task (LiSeg-Contrast): Contrast-Aware Semi-Supervised Segmentation with Domain Generalization and Test-Time Adaptation
Jincan Lou,Jingkun Chen,Haoquan Li,Hang Li,Wenjian Huang,Weihua Chen,Fan Wang,Jianguo Zhang
Main category: cs.CV
TL;DR: 该论文提出了CoSSeg-TTA框架,结合对比感知的半监督学习和测试时自适应策略,解决了肝脏MRI分割中的领域偏移和标注数据不足问题。
Details
Motivation: 肝脏MRI分割的挑战包括有限的标注数据、异质性增强协议和跨扫描仪和机构的领域偏移,传统方法在单模态场景中存在结构扭曲和不稳定训练的问题。Contribution: 主要贡献为:1) 提出CoSSeg-TTA框架,结合半监督学习和领域适应;2) 引入随机直方图风格迁移和对比感知网络以增强领域多样性;3) 采用持续测试时自适应策略提升推理鲁棒性。
Method: 基于nnU-Netv2,采用半监督均值教师框架利用未标注数据,结合随机直方图风格迁移和对比感知网络进行领域适应,并采用测试时自适应策略优化推理。
Result: 在低标注条件下,框架显著优于nnU-Netv2基准,Dice分数和Hausdorff距离表现更优,且对未见领域具有强泛化能力。
Insight: 测试时自适应和多模态数据增强是提升单模态医学分割任务性能的有效策略。
Abstract: Accurate liver segmentation from contrast-enhanced MRI is essential for diagnosis, treatment planning, and disease monitoring. However, it remains challenging due to limited annotated data, heterogeneous enhancement protocols, and significant domain shifts across scanners and institutions. Traditional image-to-image translation frameworks have made great progress in domain generalization, but their application is not straightforward. For example, Pix2Pix requires image registration, and cycle-GAN cannot be integrated seamlessly into segmentation pipelines. Meanwhile, these methods are originally used to deal with cross-modality scenarios, and often introduce structural distortions and suffer from unstable training, which may pose drawbacks in our single-modality scenario. To address these challenges, we propose CoSSeg-TTA, a compact segmentation framework for the GED4 (Gd-EOB-DTPA enhanced hepatobiliary phase MRI) modality built upon nnU-Netv2 and enhanced with a semi-supervised mean teacher scheme to exploit large amounts of unlabeled volumes. A domain adaptation module, incorporating a randomized histogram-based style appearance transfer function and a trainable contrast-aware network, enriches domain diversity and mitigates cross-center variability. Furthermore, a continual test-time adaptation strategy is employed to improve robustness during inference. Extensive experiments demonstrate that our framework consistently outperforms the nnU-Netv2 baseline, achieving superior Dice score and Hausdorff Distance while exhibiting strong generalization to unseen domains under low-annotation conditions.
[133] ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation
Jay Zhangjie Wu,Xuanchi Ren,Tianchang Shen,Tianshi Cao,Kai He,Yifan Lu,Ruiyuan Gao,Enze Xie,Shiyi Lan,Jose M. Alvarez,Jun Gao,Sanja Fidler,Zian Wang,Huan Ling
Main category: cs.CV
TL;DR: ChronoEdit通过将图像编辑问题转化为视频生成问题,利用预训练视频生成模型的时间一致性保证物理连贯性。
Details
Motivation: 现有的生成模型在图像编辑和情境生成方面取得了显著进展,但在保证物理一致性(如编辑对象的连贯性)方面存在不足,尤其是在世界模拟任务中。Contribution: 提出了ChronoEdit框架,将图像编辑视为视频生成问题;引入了时间推理阶段,通过推理令牌限制编辑轨迹的物理可行性;提出了PBench-Edit基准测试。
Method: 将输入和编辑后的图像视为视频的首尾帧,利用预训练视频生成模型的时间一致性;在推理时引入时间推理阶段,通过联合去噪想象物理可行的编辑轨迹。
Result: ChronoEdit在视觉逼真度和物理合理性上均超过了现有基线方法。
Insight: 通过视频生成模型的时间一致性可以更好地解决图像编辑中的物理连贯性问题,时间推理阶段的设计有效限制了编辑轨迹的物理可行性。
Abstract: Recent advances in large generative models have significantly advanced image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, the target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Code and models for both the 14B and 2B variants of ChronoEdit will be released on the project page: https://research.nvidia.com/labs/toronto-ai/chronoedit
[134] CARE-PD: A Multi-Site Anonymized Clinical Dataset for Parkinson’s Disease Gait Assessment
Vida Adeli,Ivan Klabucar,Javad Rajabi,Benjamin Filtjens,Soroush Mehraban,Diwei Wang,Hyewon Seo,Trung-Hieu Hoang,Minh N. Do,Candice Muller,Claudia Oliveira,Daniel Boari Coelho,Pieter Ginis,Moran Gilat,Alice Nieuwboer,Joke Spildooren,Lucas Mckay,Hyeokhyen Kwon,Gari Clifford,Christine Esper,Stewart Factor,Imari Genias,Amirhossein Dadashzadeh,Leia Shum,Alan Whone,Majid Mirmehdi,Andrea Iaboni,Babak Taati
Main category: cs.CV
TL;DR: 论文介绍了CARE-PD,一个多中心、匿名的临床数据集,专注于帕金森病(PD)的步态评估。该数据集是目前最大的公开3D网格步态数据,支持监督临床评分预测和无监督运动预训练任务。
Details
Motivation: 现有的PD步态评估缺乏大规模、多样化且临床注释的运动数据集,限制了客观评估的发展。Contribution: 提出了CARE-PD数据集,首次实现多中心(8个临床中心)的高质量3D网格步态数据收集,并通过标准化预处理流程实现数据匿名化。
Method: 采用统一的预处理流程将录制数据(RGB视频或运动捕捉)转换为SMPL网格,支持临床评分预测和运动预训练任务的评估。
Result: 使用CARE-PD预训练的模型显著降低了MPJPE(从60.8mm降至7.5mm),并在PD严重程度评分(UPDRS)的macro-F1上提升了17个百分点。
Insight: 临床数据和多样性训练数据的价值在于显著提升模型性能,特别是在跨数据集和小样本场景下的泛化能力。
Abstract: Objective gait assessment in Parkinson’s Disease (PD) is limited by the absence of large, diverse, and clinically annotated motion datasets. We introduce CARE-PD, the largest publicly available archive of 3D mesh gait data for PD, and the first multi-site collection spanning 9 cohorts from 8 clinical centers. All recordings (RGB video or motion capture) are converted into anonymized SMPL meshes via a harmonized preprocessing pipeline. CARE-PD supports two key benchmarks: supervised clinical score prediction (estimating Unified Parkinson’s Disease Rating Scale, UPDRS, gait scores) and unsupervised motion pretext tasks (2D-to-3D keypoint lifting and full-body 3D reconstruction). Clinical prediction is evaluated under four generalization protocols: within-dataset, cross-dataset, leave-one-dataset-out, and multi-dataset in-domain adaptation. To assess clinical relevance, we compare state-of-the-art motion encoders with a traditional gait-feature baseline, finding that encoders consistently outperform handcrafted features. Pretraining on CARE-PD reduces MPJPE (from 60.8mm to 7.5mm) and boosts PD severity macro-F1 by 17 percentage points, underscoring the value of clinically curated, diverse training data. CARE-PD and all benchmark code are released for non-commercial research at https://neurips2025.care-pd.ca/.
[135] GenAR: Next-Scale Autoregressive Generation for Spatial Gene Expression Prediction
Jiarui Ouyang,Yihui Wang,Yihang Gao,Yingxue Xu,Shu Yang,Hao Chen
Main category: cs.CV
TL;DR: GenAR是一个多尺度自回归框架,用于从H&E染色图像预测空间基因表达,解决了现有方法中基因预测独立性和连续回归的问题,通过分层聚类和离散标记生成实现更好的性能。
Details
Motivation: 空间转录组学(ST)成本高昂,而从广泛可用的H&E染色图像预测基因表达是一种更经济的替代方案。现有方法独立预测基因且采用连续回归方式,忽略了基因间的共表达结构和离散表达特性,导致生物不合理的输出。Contribution: GenAR通过多尺度自回归框架解决了基因预测中的独立性和连续回归问题;引入分层基因聚类揭示跨基因依赖关系;直接预测原始计数而非连续值;融合组织学和空间嵌入信息进行解码。
Method: GenAR采用多尺度自回归框架,从粗到细逐步优化预测;将基因分层聚类以捕捉依赖性;将基因表达建模为离散标记生成;基于组织学和空间嵌入的融合信息解码。
Result: 在四个不同组织类型的空间转录组学数据集上,GenAR实现了最先进的性能,展示了其在精准医学和经济高效分子分析中的潜力。
Insight: 信息论角度表明,离散化避免了对数诱导的偏差;从粗到细的分解与条件分解原则一致;多尺度方法和离散标记生成显著提升了预测的生物学合理性。
Abstract: Spatial Transcriptomics (ST) offers spatially resolved gene expression but remains costly. Predicting expression directly from widely available Hematoxylin and Eosin (H&E) stained images presents a cost-effective alternative. However, most computational approaches (i) predict each gene independently, overlooking co-expression structure, and (ii) cast the task as continuous regression despite expression being discrete counts. This mismatch can yield biologically implausible outputs and complicate downstream analyses. We introduce GenAR, a multi-scale autoregressive framework that refines predictions from coarse to fine. GenAR clusters genes into hierarchical groups to expose cross-gene dependencies, models expression as codebook-free discrete token generation to directly predict raw counts, and conditions decoding on fused histological and spatial embeddings. From an information-theoretic perspective, the discrete formulation avoids log-induced biases and the coarse-to-fine factorization aligns with a principled conditional decomposition. Extensive experimental results on four Spatial Transcriptomics datasets across different tissue types demonstrate that GenAR achieves state-of-the-art performance, offering potential implications for precision medicine and cost-effective molecular profiling. Code is publicly available at https://github.com/oyjr/genar.
[136] RAP: 3D Rasterization Augmented End-to-End Planning
Lan Feng,Yang Gao,Eloi Zablocki,Quanyi Li,Wuyang Li,Sichao Liu,Matthieu Cord,Alexandre Alahi
Main category: cs.CV
TL;DR: 论文提出了一种基于3D光栅化的数据增强方法(RAP),用于提升端到端驾驶规划的鲁棒性和泛化能力,替代了传统昂贵的照片级渲染方法。
Details
Motivation: 模仿学习在端到端驾驶中仅依赖于专家演示数据,缺乏恢复性数据,导致部署时小错误累积为失败。现有方法依赖照片级数字孪生,成本高且不实用。Contribution: 1. 提出3D光栅化方法,通过轻量级的光栅化语义标注数据生成多样化视角和轨迹;2. 引入Raster-to-Real特征对齐方法,缩小仿真与现实的差距;3. 在多个基准测试中达到SOTA性能。
Method: 1. 使用3D光栅化生成语义一致的合成数据;2. 设计特征对齐模块,将合成数据适配到真实场景;3. 构建RAP管道进行数据增强和训练。
Result: 在NAVSIM v1/v2、Waymo Open Dataset等四个基准测试中均排名第一,证明了方法的鲁棒性和泛化能力。
Insight: 驾驶规划依赖于几何和动态信息,而非纹理或光照。轻量级的语义光栅化结合特征对齐即可高效扩展训练数据,避免了高成本的照片级渲染。
Abstract: Imitation learning for end-to-end driving trains policies only on expert demonstrations. Once deployed in a closed loop, such policies lack recovery data: small mistakes cannot be corrected and quickly compound into failures. A promising direction is to generate alternative viewpoints and trajectories beyond the logged path. Prior work explores photorealistic digital twins via neural rendering or game engines, but these methods are prohibitively slow and costly, and thus mainly used for evaluation. In this work, we argue that photorealism is unnecessary for training end-to-end planners. What matters is semantic fidelity and scalability: driving depends on geometry and dynamics, not textures or lighting. Motivated by this, we propose 3D Rasterization, which replaces costly rendering with lightweight rasterization of annotated primitives, enabling augmentations such as counterfactual recovery maneuvers and cross-agent view synthesis. To transfer these synthetic views effectively to real-world deployment, we introduce a Raster-to-Real feature-space alignment that bridges the sim-to-real gap. Together, these components form Rasterization Augmented Planning (RAP), a scalable data augmentation pipeline for planning. RAP achieves state-of-the-art closed-loop robustness and long-tail generalization, ranking first on four major benchmarks: NAVSIM v1/v2, Waymo Open Dataset Vision-based E2E Driving, and Bench2Drive. Our results show that lightweight rasterization with feature alignment suffices to scale E2E training, offering a practical alternative to photorealistic rendering. Project page: https://alan-lanfeng.github.io/RAP/.
[137] Diffusion^2: Dual Diffusion Model with Uncertainty-Aware Adaptive Noise for Momentary Trajectory Prediction
Yuhao Luo,Yuang Zhang,Kehua Chen,Xinyu Zheng,Shucheng Zhang,Sikai Chen,Yinhai Wang
Main category: cs.CV
TL;DR: 该论文提出了Diffusion^2框架,用于解决瞬时轨迹预测问题,通过结合反向和正向扩散模型以及动态调节噪声的机制,显著提升了预测准确性。
Details
Motivation: 在自动驾驶和人机交互中,瞬时轨迹预测(如行人突然从盲区出现)由于缺乏足够的观测数据而具有挑战性。研究者在极端场景下提升轨迹预测能力对交通安全至关重要。Contribution: 提出了Diffusion^2框架,包括反向预测未观测历史轨迹和正向预测未来轨迹的两个扩散模型,并设计了动态调节噪声的双头参数化机制。
Method: 采用双扩散模型结构:反向扩散生成历史轨迹,正向扩散预测未来轨迹;通过双头参数化机制估计不确定性,并使用时间自适应噪声模块动态调节噪声尺度。
Result: 在ETH/UCY和Stanford Drone数据集上,Diffusion^2达到了最先进的瞬时轨迹预测性能。
Insight: 利用历史轨迹生成和未来轨迹预测的双阶段扩散模型,结合动态噪声调节,能够有效解决瞬时场景下的轨迹预测问题,提升模型的鲁棒性和准确性。
Abstract: Accurate pedestrian trajectory prediction is crucial for ensuring safety and efficiency in autonomous driving and human-robot interaction scenarios. Earlier studies primarily utilized sufficient observational data to predict future trajectories. However, in real-world scenarios, such as pedestrians suddenly emerging from blind spots, sufficient observational data is often unavailable (i.e. momentary trajectory), making accurate prediction challenging and increasing the risk of traffic accidents. Therefore, advancing research on pedestrian trajectory prediction under extreme scenarios is critical for enhancing traffic safety. In this work, we propose a novel framework termed Diffusion^2, tailored for momentary trajectory prediction. Diffusion^2 consists of two sequentially connected diffusion models: one for backward prediction, which generates unobserved historical trajectories, and the other for forward prediction, which forecasts future trajectories. Given that the generated unobserved historical trajectories may introduce additional noise, we propose a dual-head parameterization mechanism to estimate their aleatoric uncertainty and design a temporally adaptive noise module that dynamically modulates the noise scale in the forward diffusion process. Empirically, Diffusion^2 sets a new state-of-the-art in momentary trajectory prediction on ETH/UCY and Stanford Drone datasets.
[138] MorphoSim: An Interactive, Controllable, and Editable Language-guided 4D World Simulator
Xuehai He,Shijie Zhou,Thivyanth Venkateswaran,Kaizhi Zheng,Ziyu Wan,Achuta Kadambi,Xin Eric Wang
Main category: cs.CV
TL;DR: MorphoSim是一个语言引导的4D世界模拟器,支持交互式控制和编辑,生成多视角一致的动态场景,并通过自然语言指令实现对物体的操控。
Details
Motivation: 现有的文本到视频模型局限于2D视图且交互性有限,而机器人等领域需要可控且可编辑的时空环境模型,以支持训练数据生成和任务设计。Contribution: 提出了MorphoSim框架,通过语言指令生成4D场景,支持多视角一致性、对象级控制及交互式编辑,无需完全重新生成场景。
Method: 结合轨迹引导生成和特征场蒸馏技术,实现动态场景的生成与编辑,支持从任意视角观察和操作场景中的物体。
Result: 实验表明,MorphoSim在保持高场景保真度的同时,实现了可控性和可编辑性。
Insight: 通过语言引导的动态环境生成与交互式编辑,为机器人等领域提供了灵活的仿真工具,推动了4D场景建模的发展。
Abstract: World models that support controllable and editable spatiotemporal environments are valuable for robotics, enabling scalable training data, repro ducible evaluation, and flexible task design. While recent text-to-video models generate realistic dynam ics, they are constrained to 2D views and offer limited interaction. We introduce MorphoSim, a language guided framework that generates 4D scenes with multi-view consistency and object-level controls. From natural language instructions, MorphoSim produces dynamic environments where objects can be directed, recolored, or removed, and scenes can be observed from arbitrary viewpoints. The framework integrates trajectory-guided generation with feature field dis tillation, allowing edits to be applied interactively without full re-generation. Experiments show that Mor phoSim maintains high scene fidelity while enabling controllability and editability. The code is available at https://github.com/eric-ai-lab/Morph4D.
[139] Your Vision-Language Model Can’t Even Count to 20: Exposing the Failures of VLMs in Compositional Counting
Xuyang Guo,Zekai Huang,Zhenmei Shi,Zhao Song,Jiahao Zhang
Main category: cs.CV
TL;DR: 该论文揭示了当前视觉语言模型(VLMs)在组合计数任务中的显著失败,并提出了一个简洁的基准测试VLMCountBench,以评估VLMs的基本计数能力。
Details
Motivation: 尽管VLMs在多种视觉语言任务中表现优异,但其在基本计数任务上的能力尚未被充分评估。论文旨在填补这一空白,尤其是在组合对象计数问题中。Contribution: 提出了VLMCountBench基准,专注于VLMs在组合计数任务中的评估,揭示了VLMs在此任务上的显著局限性。
Method: 采用严格控制变量的实验设计,仅限于基本几何形状的组合计数任务,通过颜色、大小和提示词优化等变量进行系统分析。
Result: 结果显示,VLMs在单一形状计数中表现可靠,但在组合形状计数中失败率显著,表明其组合计数能力的不足。
Insight: 当前VLMs的组合计数能力存在显著缺陷,未来研究需关注如何提升模型对组合对象的理解和推理能力。
Abstract: Vision-Language Models (VLMs) have become a central focus of today’s AI community, owing to their impressive abilities gained from training on large-scale vision-language data from the Web. These models have demonstrated strong performance across diverse tasks, including image understanding, video understanding, complex visual reasoning, and embodied AI. Despite these noteworthy successes, a fundamental question remains: Can VLMs count objects correctly? In this paper, we introduce a simple yet effective benchmark, VLMCountBench, designed under a minimalist setting with only basic geometric shapes (e.g., triangles, circles) and their compositions, focusing exclusively on counting tasks without interference from other factors. We adopt strict independent variable control and systematically study the effects of simple properties such as color, size, and prompt refinement in a controlled ablation. Our empirical results reveal that while VLMs can count reliably when only one shape type is present, they exhibit substantial failures when multiple shape types are combined (i.e., compositional counting). This highlights a fundamental empirical limitation of current VLMs and motivates important directions for future research.
[140] CodeFormer++: Blind Face Restoration Using Deformable Registration and Deep Metric Learning
Venkata Bharath Reddy Reddem,Akshay P Sarashetti,Ranjith Merugu,Amit Satish Unde
Main category: cs.CV
TL;DR: CodeFormer++是一种新颖的盲脸修复框架,通过可变形注册和深度度量学习,实现了高质量修复与身份保持的平衡。
Details
Motivation: 现有盲脸修复方法在视觉质量和身份保持之间存在权衡,存在身份失真或退化去除不理想的问题。Contribution: 1) 提出基于学习的可变形脸注册模块;2) 设计纹理引导修复网络;3) 结合深度度量学习优化身份与生成特征的融合。
Method: 将盲脸修复分解为三个子任务:身份保持修复、高质量生成、动态特征融合,并引入可变形注册、纹理引导和度量学习机制。
Result: 在真实和合成数据集上展示了优于现有方法的视觉保真度和身份一致性。
Insight: 通过分阶段任务和动态特征融合,CodeFormer++在保持身份的同时提升了修复质量,避免了两者的权衡问题。
Abstract: Blind face restoration (BFR) has attracted increasing attention with the rise of generative methods. Most existing approaches integrate generative priors into the restoration pro- cess, aiming to jointly address facial detail generation and identity preservation. However, these methods often suffer from a trade-off between visual quality and identity fidelity, leading to either identity distortion or suboptimal degradation removal. In this paper, we present CodeFormer++, a novel framework that maximizes the utility of generative priors for high-quality face restoration while preserving identity. We decompose BFR into three sub-tasks: (i) identity- preserving face restoration, (ii) high-quality face generation, and (iii) dynamic fusion of identity features with realistic texture details. Our method makes three key contributions: (1) a learning-based deformable face registration module that semantically aligns generated and restored faces; (2) a texture guided restoration network to dynamically extract and transfer the texture of generated face to boost the quality of identity-preserving restored face; and (3) the integration of deep metric learning for BFR with the generation of informative positive and hard negative samples to better fuse identity- preserving and generative features. Extensive experiments on real-world and synthetic datasets demonstrate that, the pro- posed CodeFormer++ achieves superior performance in terms of both visual fidelity and identity consistency.
[141] A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering
Yuanhao Zou,Shengji Jin,Andong Deng,Youpeng Zhao,Jun Wang,Chen Chen
Main category: cs.CV
TL;DR: 论文提出了A.I.R.方法,通过自适应、迭代和推理的帧选择策略,解决了视频问答(VideoQA)中高效帧选择的难题,结合深度语义分析和计算高效性,显著提升了性能。
Details
Motivation: 现有的帧选择方法存在两难问题:轻量级相似性模型(如CLIP)无法捕捉复杂查询的细微差别,而使用强大的视觉语言模型(VLM)虽能提高准确性,但计算成本过高。为此,作者提出A.I.R.方法以平衡这两者。Contribution: 提出了A.I.R.方法,一种无训练的自适应、迭代和推理的帧选择策略,结合深度语义分析和低成本迭代处理,显著提升了帧选择的效果和计算效率。
Method: A.I.R.利用强大的VLM对复杂查询进行深度语义分析,并通过迭代循环每次仅处理少量高潜力帧,实现了高效且高精度的帧选择。
Result: 在多个VideoQA基准测试中,A.I.R.优于现有方法,显著提升了基础VLM的性能,并在计算效率上取得了显著提升。
Insight: 通过迭代和自适应策略,可以在不牺牲语义分析深度的情况下大幅降低计算成本,为视频处理任务提供了新的优化方向。
Abstract: Effectively applying Vision-Language Models (VLMs) to Video Question Answering (VideoQA) hinges on selecting a concise yet comprehensive set of frames, as processing entire videos is computationally infeasible. However, current frame selection methods face a critical trade-off: approaches relying on lightweight similarity models, such as CLIP, often fail to capture the nuances of complex queries, resulting in inaccurate similarity scores that cannot reflect the authentic query-frame relevance, which further undermines frame selection. Meanwhile, methods that leverage a VLM for deeper analysis achieve higher accuracy but incur prohibitive computational costs. To address these limitations, we propose A.I.R., a training-free approach for Adaptive, Iterative, and Reasoning-based frame selection. We leverage a powerful VLM to perform deep, semantic analysis on complex queries, and this analysis is deployed within a cost-effective iterative loop that processes only a small batch of the most high-potential frames at a time. Extensive experiments on various VideoQA benchmarks demonstrate that our approach outperforms existing frame selection methods, significantly boosts the performance of the foundation VLM, and achieves substantial gains in computational efficiency over other VLM-based techniques.
[142] REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization
Qiyuan He,Yicong Li,Haotian Ye,Jinghao Wang,Xinyao Liao,Pheng-Ann Heng,Stefano Ermon,James Zou,Angela Yao
Main category: cs.CV
TL;DR: 该论文提出了reAR方法,通过生成器-分词器一致性正则化解决了视觉自回归生成模型中生成器与分词器不一致的瓶颈问题,显著提升了生成性能。
Details
Motivation: 视觉自回归(AR)生成模型在性能上落后于扩散模型,作者将其归因于生成器与分词器的不一致性问题。Contribution: 提出了一种无需修改分词器或推理流程的训练策略reAR,通过引入基于一致性的正则化目标,显著提升了生成质量。
Method: reAR通过在训练中预测当前令牌的视觉嵌入和目标令牌的嵌入(在噪声上下文中),实现生成器与分词器的一致性优化。
Result: 在ImageNet数据集上,reAR显著降低了gFID(从3.02降至1.86),并提高了IS(至316.9)。使用177M参数的模型即可达到与675M参数扩散模型相当的性能(gFID=1.42)。
Insight: 生成器与分词器的一致性对视觉自回归模型的性能至关重要,简单的正则化策略可以显著提升生成质量,而无需复杂设计。
Abstract: Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-decoded by the tokenizer. To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal transformer is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models. Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).
[143] SPEGNet: Synergistic Perception-Guided Network for Camouflaged Object Detection
Baber Jan,Saeed Anwar,Aiman H. El-Maleh,Abdul Jabbar Siddiqui,Abdul Bais
Main category: cs.CV
TL;DR: SPEGNet提出了一种协同感知引导的网络架构,用于伪装目标检测,通过统一设计整合多尺度特征和边界优化,避免了现有方法因模块堆砌带来的计算负担,同时保持了高精度和实时性能。
Details
Motivation: 现有伪装目标检测方法通过堆叠复杂模块(如边界模块、注意力机制和多尺度处理器)增加了计算负担,但并未带来相应性能提升,且往往牺牲了细节信息。SPEGNet旨在通过统一设计解决这一问题。Contribution: 1. 提出了一种统一的多尺度特征整合和边界优化架构(SPEGNet);2. 通过通道校准和空间增强直接生成上下文丰富的边界;3. 实现了实时性能和高检测精度。
Method: SPEGNet通过通道校准和空间增强整合多尺度特征,边界直接从上下文中生成,并通过渐进式细化实现尺度自适应的边缘调制。
Result: 在CAMO、COD10K和NC4K数据集上分别取得了0.887、0.890和0.895的Sα得分,且具有实时推理速度。
Insight: 统一设计可以避免模块堆砌带来的复杂性,同时通过多尺度特征和边界优化保持检测精度;中间分辨率对边界调制具有峰值影响。
Abstract: Camouflaged object detection segments objects with intrinsic similarity and edge disruption. Current detection methods rely on accumulated complex components. Each approach adds components such as boundary modules, attention mechanisms, and multi-scale processors independently. This accumulation creates a computational burden without proportional gains. To manage this complexity, they process at reduced resolutions, eliminating fine details essential for camouflage. We present SPEGNet, addressing fragmentation through a unified design. The architecture integrates multi-scale features via channel calibration and spatial enhancement. Boundaries emerge directly from context-rich representations, maintaining semantic-spatial alignment. Progressive refinement implements scale-adaptive edge modulation with peak influence at intermediate resolutions. This design strikes a balance between boundary precision and regional consistency. SPEGNet achieves 0.887 $S_\alpha$ on CAMO, 0.890 on COD10K, and 0.895 on NC4K, with real-time inference speed. Our approach excels across scales, from tiny, intricate objects to large, pattern-similar ones, while handling occlusion and ambiguous boundaries. Code, model weights, and results are available on \href{https://github.com/Baber-Jan/SPEGNet}{https://github.com/Baber-Jan/SPEGNet}.
[144] MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models
Soo Yong Kim,Suin Cho,Vincent-Daniel Yun,Gyeongyeon Hwang
Main category: cs.CV
TL;DR: MedCLM通过将检测数据集转化为带有链式推理(CoT)的大规模医学视觉问答(VQA)数据,并提出分阶段课程学习策略(Easy、Medium、Hard),显著提升了医学视觉语言模型的推理能力及其临床对齐性。
Details
Motivation: 医学影像领域亟需将临床诊断推理与AI结合,但现有方法缺乏上下文关联和逐步推理的支持。MedCLM旨在通过结构化数据和分阶段学习策略实现这一目标。Contribution: 1)提出一种自动化流水线,将检测数据转化为带CoT推理的医学VQA数据;2)设计分阶段课程学习策略(CoT-Curriculum),提升模型的视觉定位和推理能力;3)在多个医学VQA基准上达到SOTA性能。
Method: 1)通过检测框与器官分割的关联生成结构化CoT数据;2)采用三阶段课程学习:Easy(显式检测框)、Medium(隐式定位)、Hard(弱监督推理)逐步提升推理能力。
Result: MedCLM在多个医学VQA任务中表现优异,验证了其数据生成方法和课程学习策略的有效性。
Insight: 结构化数据生成和分阶段训练策略的结合,显著提升了医学视觉语言模型的推理能力,为临床对齐模型的发展提供了可扩展框架。
Abstract: Bridging clinical diagnostic reasoning with AI remains a central challenge in medical imaging. We introduce MedCLM, an automated pipeline that converts detection datasets into large-scale medical visual question answering (VQA) data with Chain-of-Thought (CoT) reasoning by linking lesion boxes to organ segmentation and structured rationales. These contextual signals enable medical vision-language models to generate question-answer pairs with step-by-step reasoning. To utilize this data effectively, we propose an Integrated CoT-Curriculum Strategy composed of an Easy stage with explicit lesion boxes for visual grounding, a Medium stage that encourages implicit localization, and a Hard stage for weakly supervised reasoning. Experimental results demonstrate that MedCLM attains state-of-the-art performance on several medical VQA benchmarks, providing a scalable framework for developing clinically aligned medical vision-language models.
[145] VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery
Nonghai Zhang,Zeyu Zhang,Jiazi Wang,Yang Zhao,Hao Tang
Main category: cs.CV
TL;DR: 本文提出首个用于古希腊陶器分析的3D视觉问答数据集VaseVQA-3D,并开发了针对此领域的VaseVLM模型,显著提升了3D文物识别的性能。
Details
Motivation: 现有视觉语言模型(VLM)在通用任务中表现良好,但在文化遗产领域面临数据稀缺和领域知识不足的问题,导致难以有效处理3D文物分析任务。Contribution: 1. 提出首个针对古希腊陶器的3D视觉问答数据集VaseVQA-3D;2. 开发了领域自适应的VaseVLM模型,显著提升了文物分析的性能。
Method: 1. 收集664个古希腊陶器的3D模型及其问答数据,建立完整的数据构建流程;2. 通过领域自适应训练优化模型,提升其在文物分析任务中的表现。
Result: 实验结果表明,VaseVLM在VaseVQA-3D数据集上的R@1指标提升了12.8%,词汇相似度提升了6.6%,显著优于现有技术。
Insight: 本文为文化遗产保护研究提供了新的技术路径,揭示了领域自适应训练在提升模型性能中的重要性。
Abstract: Vision-Language Models (VLMs) have achieved significant progress in multimodal understanding tasks, demonstrating strong capabilities particularly in general tasks such as image captioning and visual reasoning. However, when dealing with specialized cultural heritage domains like 3D vase artifacts, existing models face severe data scarcity issues and insufficient domain knowledge limitations. Due to the lack of targeted training data, current VLMs struggle to effectively handle such culturally significant specialized tasks. To address these challenges, we propose the VaseVQA-3D dataset, which serves as the first 3D visual question answering dataset for ancient Greek pottery analysis, collecting 664 ancient Greek vase 3D models with corresponding question-answer data and establishing a complete data construction pipeline. We further develop the VaseVLM model, enhancing model performance in vase artifact analysis through domain-adaptive training. Experimental results validate the effectiveness of our approach, where we improve by 12.8% on R@1 metrics and by 6.6% on lexical similarity compared with previous state-of-the-art on the VaseVQA-3D dataset, significantly improving the recognition and understanding of 3D vase artifacts, providing new technical pathways for digital heritage preservation research.
[146] Conditional Representation Learning for Customized Tasks
Honglin Liu,Chao Sun,Peng Hu,Yunfan Li,Xi Peng
Main category: cs.CV
TL;DR: 该论文提出了条件表示学习(CRL),通过用户指定的标准生成定制化的特征表示,解决了传统表示学习方法在定制任务中语义不匹配的问题。CRL利用大语言模型(LLM)生成描述文本构建语义基,并通过视觉语言模型(VLM)将图像表示投影到定制空间,提升了分类和检索任务的性能。
Details
Motivation: 传统表示学习方法学习的通用表示可能无法满足定制任务的需求,例如在动物栖息地分析中,通用嵌入侧重于类别语义而非场景相关特征。现有方法通过监督微调解决此问题,但计算和标注成本高昂。Contribution: 1. 提出条件表示学习(CRL),通过用户指定标准生成定制化表示;2. 揭示了语义空间的语义由其基决定,并提出用描述性文本近似定制特征空间的基;3. 结合LLM和VLM实现语义基的构建和投影。
Method: 1. 用户指定标准后,利用LLM生成描述文本构建语义基;2. 基于VLM将图像表示投影到条件特征空间;3. 生成的条件表示用于定制任务(如分类和检索)。
Result: 实验表明,CRL在分类和检索任务上表现优异且具有普适性。
Insight: 定制任务的语义需求可以通过生成描述性文本动态构建语义基,结合多模态模型(LLM+VLM)能高效实现这一目标。
Abstract: Conventional representation learning methods learn a universal representation that primarily captures dominant semantics, which may not always align with customized downstream tasks. For instance, in animal habitat analysis, researchers prioritize scene-related features, whereas universal embeddings emphasize categorical semantics, leading to suboptimal results. As a solution, existing approaches resort to supervised fine-tuning, which however incurs high computational and annotation costs. In this paper, we propose Conditional Representation Learning (CRL), aiming to extract representations tailored to arbitrary user-specified criteria. Specifically, we reveal that the semantics of a space are determined by its basis, thereby enabling a set of descriptive words to approximate the basis for a customized feature space. Building upon this insight, given a user-specified criterion, CRL first employs a large language model (LLM) to generate descriptive texts to construct the semantic basis, then projects the image representation into this conditional feature space leveraging a vision-language model (VLM). The conditional representation better captures semantics for the specific criterion, which could be utilized for multiple customized tasks. Extensive experiments on classification and retrieval tasks demonstrate the superiority and generality of the proposed CRL. The code is available at https://github.com/XLearning-SCU/2025-NeurIPS-CRL.
[147] Pathology-CoT: Learning Visual Chain-of-Thought Agent from Expert Whole Slide Image Diagnosis Behavior
Sheng Wang,Ruiming Wu,Charles Herndon,Yihang Liu,Shunsuke Koga,Jeanne Shen,Zhi Huang
Main category: cs.CV
TL;DR: 论文提出了Pathology-CoT,一种从专家全切片图像诊断行为中学习的视觉链式思考代理。通过AI Session Recorder记录专家行为,生成标准化行为命令和标注数据,训练Pathologist-o3代理,在胃肠道淋巴结转移检测中表现优异。
Details
Motivation: 全切片图像诊断是一个交互式的多阶段过程,但当前缺乏实用的代理系统来模拟专家的行为并提供可解释的诊断。主要障碍是缺乏临床对齐的行为数据。Contribution: 1. 提出AI Session Recorder记录专家行为;2. 构建Pathology-CoT数据集;3. 开发Pathologist-o3代理,性能优于现有方法。
Method: 1. 用AI Session Recorder记录专家行为;2. 生成标准化行为命令和标注数据;3. 训练两阶段代理Pathologist-o3。
Result: 在胃肠道淋巴结转移检测中达到84.5%的精确率、100.0%的召回率和75.4%的准确率,优于OpenAI o3模型。
Insight: 通过记录专家行为生成数据集的方法可以推广到其他医学影像任务,为临床AI提供可扩展的人类对齐解决方案。
Abstract: Diagnosing a whole-slide image is an interactive, multi-stage process involving changes in magnification and movement between fields. Although recent pathology foundation models are strong, practical agentic systems that decide what field to examine next, adjust magnification, and deliver explainable diagnoses are still lacking. The blocker is data: scalable, clinically aligned supervision of expert viewing behavior that is tacit and experience-based, not written in textbooks or online, and therefore absent from large language model training. We introduce the AI Session Recorder, which works with standard WSI viewers to unobtrusively record routine navigation and convert the viewer logs into standardized behavioral commands (inspect or peek at discrete magnifications) and bounding boxes. A lightweight human-in-the-loop review turns AI-drafted rationales into the Pathology-CoT dataset, a form of paired “where to look” and “why it matters” supervision produced at roughly six times lower labeling time. Using this behavioral data, we build Pathologist-o3, a two-stage agent that first proposes regions of interest and then performs behavior-guided reasoning. On gastrointestinal lymph-node metastasis detection, it achieved 84.5% precision, 100.0% recall, and 75.4% accuracy, exceeding the state-of-the-art OpenAI o3 model and generalizing across backbones. To our knowledge, this constitutes one of the first behavior-grounded agentic systems in pathology. Turning everyday viewer logs into scalable, expert-validated supervision, our framework makes agentic pathology practical and establishes a path to human-aligned, upgradeable clinical AI.
[148] A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification
Hao Liu,Yunhao Gao,Wei Li,Mingyang Zhang,Maoguo Gong,Lorenzo Bruzzone
Main category: cs.CV
TL;DR: 本文提出了一种空间-频谱-频率交互网络(S²Fin),通过结合空间、频谱和频率域的成对融合模块,解决了多模态遥感图像分类中难以提取结构和细节特征的挑战,显著提升了分类性能。
Details
Motivation: 现有的多模态遥感图像分类方法在处理异构和冗余数据时难以提取结构和细节特征,因此需要一种新的方法,引入频率域学习以建模关键但稀疏的细节特征。Contribution: 提出了一种新的空间-频谱-频率交互网络(S²Fin),并设计了高频稀疏增强变换器和两级空间-频率融合策略,显著提升了多模态遥感图像的分类能力。
Method: 1. 高频稀疏增强变换器;2. 自适应频率通道模块;3. 高频共振掩码;4. 空间-频谱注意力融合模块。
Result: 在四个基准多模态数据集上,S²Fin的分类性能优于现有最先进方法。
Insight: 频率域学习为遥感图像分类提供了新的视角,通过结合空间和频谱特征,能够更有效地提取关键细节。
Abstract: Deep learning-based methods have achieved significant success in remote sensing Earth observation data analysis. Numerous feature fusion techniques address multimodal remote sensing image classification by integrating global and local features. However, these techniques often struggle to extract structural and detail features from heterogeneous and redundant multimodal images. With the goal of introducing frequency domain learning to model key and sparse detail features, this paper introduces the spatial-spectral-frequency interaction network (S$^2$Fin), which integrates pairwise fusion modules across the spatial, spectral, and frequency domains. Specifically, we propose a high-frequency sparse enhancement transformer that employs sparse spatial-spectral attention to optimize the parameters of the high-frequency filter. Subsequently, a two-level spatial-frequency fusion strategy is introduced, comprising an adaptive frequency channel module that fuses low-frequency structures with enhanced high-frequency details, and a high-frequency resonance mask that emphasizes sharp edges via phase similarity. In addition, a spatial-spectral attention fusion module further enhances feature extraction at intermediate layers of the network. Experiments on four benchmark multimodal datasets with limited labeled data demonstrate that S$^2$Fin performs superior classification, outperforming state-of-the-art methods. The code is available at https://github.com/HaoLiu-XDU/SSFin.
[149] SFANet: Spatial-Frequency Attention Network for Deepfake Detection
Vrushank Ahire,Aniruddh Muley,Shivam Zample,Siddharth Verma,Pranav Menon,Surbhi Madan,Abhinav Dhall
Main category: cs.CV
TL;DR: SFANet是一种结合空间-频率注意力机制的深度伪造检测方法,通过融合Transformer和纹理方法的优势,在DFWild-Cup数据集上实现了最先进的性能。
Details
Motivation: 当前深度伪造检测方法在面对多样化的数据集和生成技术时泛化能力不足,因此需要一种更鲁棒和高效的解决方案。Contribution: 提出了一种新颖的集成框架SFANet,融合了Transformer和纹理方法的优势,并引入了数据分割、顺序训练、频率分割、基于patch的注意力和人脸分割等技术。
Method: 采用Swin Transformers和ViTs提取全局特征,结合纹理方法增强局部特征(如眼睛和嘴巴),并通过创新的数据预处理和注意力机制提升模型性能。
Result: 在DFWild-Cup数据集上达到了最先进的性能,展示了模型的鲁棒性和泛化能力。
Insight: 混合模型(Transformer与纹理方法的结合)能够有效应对深度伪造检测的挑战,为实际应用提供了可靠的解决方案。
Abstract: Detecting manipulated media has now become a pressing issue with the recent rise of deepfakes. Most existing approaches fail to generalize across diverse datasets and generation techniques. We thus propose a novel ensemble framework, combining the strengths of transformer-based architectures, such as Swin Transformers and ViTs, and texture-based methods, to achieve better detection accuracy and robustness. Our method introduces innovative data-splitting, sequential training, frequency splitting, patch-based attention, and face segmentation techniques to handle dataset imbalances, enhance high-impact regions (e.g., eyes and mouth), and improve generalization. Our model achieves state-of-the-art performance when tested on the DFWild-Cup dataset, a diverse subset of eight deepfake datasets. The ensemble benefits from the complementarity of these approaches, with transformers excelling in global feature extraction and texturebased methods providing interpretability. This work demonstrates that hybrid models can effectively address the evolving challenges of deepfake detection, offering a robust solution for real-world applications.
[150] Do Superpixel Segmentation Methods Influence Deforestation Image Classification?
Hugo Resende,Fabio A. Faria,Eduardo B. Neto,Isabela Borlido,Victor Sundermann,Silvio Jamil F. Guimarães,Álvaro L. Fazenda
Main category: cs.CV
TL;DR: 该论文研究了不同超像素分割方法(包括SLIC)对热带森林砍伐图像分类性能的影响,发现分类器融合方法对提升性能至关重要。
Details
Motivation: 在ForestEyes项目中,超像素分割方法(如SLIC)被用于志愿者标注和模型训练,但其他方法可能在遥感图像分割中表现更好。研究目的是验证不同分割方法对森林砍伐检测任务的分类性能影响。Contribution: 研究了五种超像素分割方法及其对分类模型性能的影响,发现通过分类器融合(集成学习)可以显著提升平衡准确率。
Method: 比较了SLIC和其他四种最佳超像素分割方法,使用PyCaret AutoML筛选分类器,并通过分类器融合验证性能差异。
Result: 初始结果显示分割方法对分类性能影响较小,但分类器融合显著提高了平衡准确率。
Insight: 分割方法的选择与分类器融合的结合对森林砍伐检测任务至关重要,集成学习方法可以弥补单一分割方法的不足。
Abstract: Image segmentation is a crucial step in various visual applications, including environmental monitoring through remote sensing. In the context of the ForestEyes project, which combines citizen science and machine learning to detect deforestation in tropical forests, image segments are used for labeling by volunteers and subsequent model training. Traditionally, the Simple Linear Iterative Clustering (SLIC) algorithm is adopted as the segmentation method. However, recent studies have indicated that other superpixel-based methods outperform SLIC in remote sensing image segmentation, and might suggest that they are more suitable for the task of detecting deforested areas. In this sense, this study investigated the impact of the four best segmentation methods, together with SLIC, on the training of classifiers for the target application. Initially, the results showed little variation in performance among segmentation methods, even when selecting the top five classifiers using the PyCaret AutoML library. However, by applying a classifier fusion approach (ensemble of classifiers), noticeable improvements in balanced accuracy were observed, highlighting the importance of both the choice of segmentation method and the combination of machine learning-based models for deforestation detection tasks.
[151] EduPersona: Benchmarking Subjective Ability Boundaries of Virtual Student Agents
Buyuan Zhu,Shiyu Hu,Yiping Ma,Yuanming Zhang,Kang Hao Cheong
Main category: cs.CV
TL;DR: 论文提出了EduPersona基准测试,评估虚拟学生代理在课堂环境中的主观能力,涵盖三种任务:基本一致性、学生真实性和长期人格一致性,并通过实验验证了其有效性。
Details
Motivation: 当前大型语言模型在教育中的应用日益广泛,但虚拟学生代理的课堂主观能力边界尚未充分评估,限制了其可信部署。Contribution: 提出了首个以课堂主观能力为中心的基准测试EduPersona,包含多语言、多学科和多人格类型的标注数据,并设计了渐进式评估任务。
Method: 基于大数据集(12,814轮对话,扩展至128k轮),将主观能力分解为三个任务,并通过实验比较原始模型与人格微调模型的性能。
Result: 在所有任务中,微调模型表现显著提升:TASK1 +33.6%,TASK2 +30.6%,TASK3 +14.9%。
Insight: 人格建模在不同任务中存在异质性难度,表明主观能力评估需多维度和长期一致性。
Abstract: As large language models are increasingly integrated into education, virtual student agents are becoming vital for classroom simulation and teacher training. Yet their classroom-oriented subjective abilities remain largely unassessed, limiting understanding of model boundaries and hindering trustworthy deployment. We present EduPersona, a large-scale benchmark spanning two languages, three subjects, and ten persona types based on the Big Five theory. The dataset contains 1,308 authentic classroom dialogue rounds, corresponding to 12,814 teacher-student Q&A turns, and is further expanded through persona stylization into roughly 10 times larger scale (128k turns), providing a solid foundation for evaluation. Building on this resource, we decompose hard-to-quantify subjective performance into three progressive tasks: TASK1 basic coherence (whether behavior, emotion, expression, and voice align with classroom context), TASK2 student realism, and TASK3 long-term persona consistency, thereby establishing an evaluation framework grounded in educational theory and research value. We conduct systematic experiments on three representative LLMs, comparing their original versions with ten persona-fine-tuned variants trained on EduPersona. Results show consistent and significant average improvements across all tasks: TASK1 +33.6%, TASK2 +30.6%, and TASK3 +14.9%. These improvements highlight the dataset’s effectiveness and research value, while also revealing the heterogeneous difficulty of persona modeling. In summary, EduPersona delivers the first classroom benchmark centered on subjective abilities, establishes a decoupled and verifiable research paradigm, and we will open-source both the dataset and the framework to support the broader research community in advancing trustworthy and human-like AI for education.
[152] MoME: Estimating Psychological Traits from Gait with Multi-Stage Mixture of Movement Experts
Andy Cǎtrunǎ,Adrian Cosma,Emilian Rǎdoi
Main category: cs.CV
TL;DR: 论文提出了一种名为MoME的分层多阶段混合运动专家架构,用于从步态序列中预测心理特质,并通过多任务学习提升性能,效果优于现有方法。
Details
Motivation: 步态蕴含丰富的生物特征和行为信息,但利用步态推断心理特质的研究尚不充分,且具有挑战性。本文旨在通过多任务学习改进心理特质的预测。Contribution: 1) 提出MoME架构,分四个阶段处理步态复杂性;2) 引入轻量级专家模型和任务特定的门控模块;3) 在PsyMo基准上表现优于现有方法。
Method: MoME采用分层结构,分四个阶段处理步态复杂性,每个阶段使用轻量级专家模型提取时空特征,并通过门控模块自适应加权专家。
Result: 在PsyMo基准上,MoME在运行级别和主题级别分别达到37.47%和44.6%的加权F1分数,优于现有方法。
Insight: 多任务学习(如身份识别、性别预测等)可以提升心理特质估计的准确性,为基于运动的心理推断提供了新思路。
Abstract: Gait encodes rich biometric and behavioural information, yet leveraging the manner of walking to infer psychological traits remains a challenging and underexplored problem. We introduce a hierarchical Multi-Stage Mixture of Movement Experts (MoME) architecture for multi-task prediction of psychological attributes from gait sequences represented as 2D poses. MoME processes the walking cycle in four stages of movement complexity, employing lightweight expert models to extract spatio-temporal features and task-specific gating modules to adaptively weight experts across traits and stages. Evaluated on the PsyMo benchmark covering 17 psychological traits, our method outperforms state-of-the-art gait analysis models, achieving a 37.47% weighted F1 score at the run level and 44.6% at the subject level. Our experiments show that integrating auxiliary tasks such as identity recognition, gender prediction, and BMI estimation further improves psychological trait estimation. Our findings demonstrate the viability of multi-task gait-based learning for psychological trait estimation and provide a foundation for future research on movement-informed psychological inference.
[153] Label-Efficient Cross-Modality Generalization for Liver Segmentation in Multi-Phase MRI
Quang-Khai Bui-Tran,Minh-Toan Dinh,Thanh-Huy Nguyen,Ba-Thinh Lam,Mai-Anh Vu,Ulas Bagci
Main category: cs.CV
TL;DR: 这篇论文提出了一种标签高效的分割方法,通过结合基础模型的微调、交叉伪监督的协同训练以及标准化预处理,提升了多相位MRI中肝脏分割的跨模态泛化能力。
Details
Motivation: 肝脏分割在MRI中的精确性对肝硬化评估至关重要,但现实数据存在标签稀缺、模态不均、空间不对齐等问题,亟需一种标签高效且能泛化的方法。Contribution: 1) 提出了一种无需空间配准的分割方法;2) 结合基础模型微调和协同训练,利用未标注数据提升泛化性;3)展示了在多厂商、多相位MRI中的鲁棒表现。
Method: 1) 采用基础规模的3D分割模型并微调;2) 通过交叉伪监督协同训练利用未标注数据;3) 设计标准化预处理流程。
Result: 模型在标注和未标注数据上均表现出色,验证了方法的标签高效性和跨模态泛化能力。
Insight: 将基础模型与协同训练结合,可为真实临床场景中的医学影像任务提供高效的解决方案。
Abstract: Accurate liver segmentation in multi-phase MRI is vital for liver fibrosis assessment, yet labeled data is often scarce and unevenly distributed across imaging modalities and vendor systems. We propose a label-efficient segmentation approach that promotes cross-modality generalization under real-world conditions, where GED4 hepatobiliary-phase annotations are limited, non-contrast sequences (T1WI, T2WI, DWI) are unlabeled, and spatial misalignment and missing phases are common. Our method integrates a foundation-scale 3D segmentation backbone adapted via fine-tuning, co-training with cross pseudo supervision to leverage unlabeled volumes, and a standardized preprocessing pipeline. Without requiring spatial registration, the model learns to generalize across MRI phases and vendors, demonstrating robust segmentation performance in both labeled and unlabeled domains. Our results exhibit the effectiveness of our proposed label-efficient baseline for liver segmentation in multi-phase, multi-vendor MRI and highlight the potential of combining foundation model adaptation with co-training for real-world clinical imaging tasks.
[154] ID-Consistent, Precise Expression Generation with Blendshape-Guided Diffusion
Foivos Paraperas Papantoniou,Stefanos Zafeiriou
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的框架,结合ID一致的人脸基础模型和表情交叉注意力模块,实现对任意主题的精确表情控制,同时保持身份一致性。
Details
Motivation: AI驱动的叙事需要同时满足身份一致性和精细的表情控制,但目前的方法在保持身份一致性的同时对微表情和表情过渡的控制不足。Contribution: 提出了一种ID一致且表情精确的生成模型,结合FLAME blendshape参数控制的注意力模块,适用于多样的表情生成。
Method: 采用扩散模型框架,引入表情交叉注意力模块和可插拔的Reference Adapter,训练数据包含丰富表情变化的图像和视频。
Result: 模型在身份一致性和表情控制方面优于现有方法,适用于微表情和表情过渡的生成。
Insight: 结合ID一致的基础模型和参数化表情控制,可以显著提升生成模型的性能和灵活性。
Abstract: Human-centric generative models designed for AI-driven storytelling must bring together two core capabilities: identity consistency and precise control over human performance. While recent diffusion-based approaches have made significant progress in maintaining facial identity, achieving fine-grained expression control without compromising identity remains challenging. In this work, we present a diffusion-based framework that faithfully reimagines any subject under any particular facial expression. Building on an ID-consistent face foundation model, we adopt a compositional design featuring an expression cross-attention module guided by FLAME blendshape parameters for explicit control. Trained on a diverse mixture of image and video data rich in expressive variation, our adapter generalizes beyond basic emotions to subtle micro-expressions and expressive transitions, overlooked by prior works. In addition, a pluggable Reference Adapter enables expression editing in real images by transferring the appearance from a reference frame during synthesis. Extensive quantitative and qualitative evaluations show that our model outperforms existing methods in tailored and identity-consistent expression generation. Code and models can be found at https://github.com/foivospar/Arc2Face.
[155] ReactDiff: Fundamental Multiple Appropriate Facial Reaction Diffusion Model
Luo Cheng,Song Siyang,Yan Siyuan,Yu Zhen,Ge Zongyuan
Main category: cs.CV
TL;DR: ReactDiff提出了一种新型的时序扩散框架,用于生成多样且符合对话情境的面部反应,解决了现有方法难以建模人类反应随机性和动态性的问题。
Details
Motivation: 现有方法在生成人类面部反应时未能捕捉其随机性和动态性,导致反应单一或不自然。ReactDiff旨在通过引入时空先验改进这一问题。Contribution: 1)提出了一种结合时空面部运动学的扩散框架;2)引入了面部动作单元依赖约束以提升生成的自然性;3)在REACT2024数据集上验证了方法的优越性。
Method: ReactDiff在扩散过程中融入了两类先验:时序面部行为运动学(temporal facial behavioral kinematics)和面部动作单元依赖(facial action unit dependencies),以约束生成结果的平滑性和自然性。
Result: 实验表明,ReactDiff在反应质量、多样性和情境适恰当性上均达到了state-of-the-art水平。
Insight: 人类面部反应的生成需要结合动态时空约束,单纯依赖数据驱动的生成难以满足真实性和多样性需求。
Abstract: The automatic generation of diverse and human-like facial reactions in dyadic dialogue remains a critical challenge for human-computer interaction systems. Existing methods fail to model the stochasticity and dynamics inherent in real human reactions. To address this, we propose ReactDiff, a novel temporal diffusion framework for generating diverse facial reactions that are appropriate for responding to any given dialogue context. Our key insight is that plausible human reactions demonstrate smoothness, and coherence over time, and conform to constraints imposed by human facial anatomy. To achieve this, ReactDiff incorporates two vital priors (spatio-temporal facial kinematics) into the diffusion process: i) temporal facial behavioral kinematics and ii) facial action unit dependencies. These two constraints guide the model toward realistic human reaction manifolds, avoiding visually unrealistic jitters, unstable transitions, unnatural expressions, and other artifacts. Extensive experiments on the REACT2024 dataset demonstrate that our approach not only achieves state-of-the-art reaction quality but also excels in diversity and reaction appropriateness.
[156] Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics
Masoumeh Chapariniya,Teodora Vukovic,Sarah Ebling,Volker Dellwo
Main category: cs.CV
TL;DR: 该论文研究了基于Transformer的架构在自然面对面对话场景中的人物识别性能,通过两流框架(空间配置和时间运动模式)建模133个COCO WholeBody关键点,实验表明特定领域训练优于迁移学习,且空间配置比时间动态更具区分性。
Details
Motivation: 探讨在自然对话场景中,基于Transformer的架构如何利用空间配置和时间运动模式进行高效的人物识别,弥补传统外观方法的不足。Contribution: 提出了一个两流Transformer框架,分别建模空间和时间信息;验证了特定领域训练的有效性;证明了空间信息在人物识别中的主导作用。
Method: 使用133个COCO WholeBody关键点,构建两流框架:空间Transformer和多尺度时间Transformer;通过特征级融合结合两类信息。
Result: 空间Transformer准确率达95.74%,多尺度时间Transformer达93.90%;融合后性能提升至98.03%。
Insight: 空间配置比时间动态更具区分性;Transformer架构在自然交互中的人物识别潜力巨大,为未来多模态和跨文化研究提供了方向。
Abstract: This paper investigates the performance of transformer-based architectures for person identification in natural, face-to-face conversation scenario. We implement and evaluate a two-stream framework that separately models spatial configurations and temporal motion patterns of 133 COCO WholeBody keypoints, extracted from a subset of the CANDOR conversational corpus. Our experiments compare pre-trained and from-scratch training, investigate the use of velocity features, and introduce a multi-scale temporal transformer for hierarchical motion modeling. Results demonstrate that domain-specific training significantly outperforms transfer learning, and that spatial configurations carry more discriminative information than temporal dynamics. The spatial transformer achieves 95.74% accuracy, while the multi-scale temporal transformer achieves 93.90%. Feature-level fusion pushes performance to 98.03%, confirming that postural and dynamic information are complementary. These findings highlight the potential of transformer architectures for person identification in natural interactions and provide insights for future multimodal and cross-cultural studies.
[157] Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction
Chi Yan,Dan Xu
Main category: cs.CV
TL;DR: PG-Occ是一个创新的渐进式高斯Transformer框架,用于开放词汇3D占用预测,通过渐进在线密度化和各向异性感知采样策略,显著提升了细节捕捉性能和计算效率。
Details
Motivation: 传统3D占用预测方法受限于固定语义类别,而现有文本对齐方法存在稀疏表示无法捕捉小物体与密集表示计算开销大的矛盾,需要一种更高效的解决方法。Contribution: 1. 提出PG-Occ框架,实现开放词汇3D占用预测;2. 提出渐进在线密度化策略逐步增强3D高斯表示;3. 引入各向异性感知采样策略,自适应分配感受野。
Method: 采用渐进式高斯Transformer框架,结合渐进在线密度化和各向异性感知采样,通过多阶段迭代增强表示能力和特征聚合效果。
Result: PG-Occ在开放词汇3D占用预测任务上实现了14.3% mIoU的相对提升,性能领先。
Insight: 渐进式密度化和自适应采样策略的结合,能够平衡细节捕捉与计算效率,为开放词汇场景建模提供新思路。
Abstract: The 3D occupancy prediction task has witnessed remarkable progress in recent years, playing a crucial role in vision-based autonomous driving systems. While traditional methods are limited to fixed semantic categories, recent approaches have moved towards predicting text-aligned features to enable open-vocabulary text queries in real-world scenes. However, there exists a trade-off in text-aligned scene modeling: sparse Gaussian representation struggles to capture small objects in the scene, while dense representation incurs significant computational overhead. To address these limitations, we present PG-Occ, an innovative Progressive Gaussian Transformer Framework that enables open-vocabulary 3D occupancy prediction. Our framework employs progressive online densification, a feed-forward strategy that gradually enhances the 3D Gaussian representation to capture fine-grained scene details. By iteratively enhancing the representation, the framework achieves increasingly precise and detailed scene understanding. Another key contribution is the introduction of an anisotropy-aware sampling strategy with spatio-temporal fusion, which adaptively assigns receptive fields to Gaussians at different scales and stages, enabling more effective feature aggregation and richer scene information capture. Through extensive evaluations, we demonstrate that PG-Occ achieves state-of-the-art performance with a relative 14.3% mIoU improvement over the previous best performing method. Code and pretrained models will be released upon publication on our project page: https://yanchi-3dv.github.io/PG-Occ
[158] Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge
Max Kirchner,Hanna Hoffmann,Alexander C. Jenke,Oliver L. Saldanha,Kevin Pfeiffer,Weam Kanjo,Julia Alekseenko,Claas de Boer,Santhi Raj Kolamuri,Lorenzo Mazza,Nicolas Padoy,Sophia Bano,Annika Reinke,Lena Maier-Hein,Danail Stoyanov,Jakob N. Kather,Fiona R. Kolbinger,Sebastian Bodenstedt,Stefanie Speidel
Main category: cs.CV
TL;DR: FedSurg挑战赛首次评估了联邦学习在手术视频分类中的应用,重点关注模型在未见临床中心的表现及局部微调能力。ViViT模型表现最佳,但泛化能力和类别不平衡问题仍是挑战。
Details
Motivation: 联邦学习可协作开发模型而不共享患者数据,但其在手术视频分类中的表现尚未明确评估。FedSurg挑战赛旨在填补这一空白。Contribution: 1)首次为手术视频分类中的联邦学习建立基准;2)分析了泛化、类别不平衡和超参数调优的挑战;3)展示了ViViT模型的优越性。
Method: 参与者使用多中心Appendix300视频数据集,任务包括泛化未见中心和局部微调。方法包括基础模型线性探测、三元组损失和多种FL聚合方案(FedAvg等)。
Result: 泛化任务性能受限,微调后所有团队表现提升但排名不稳定。ViViT模型表现最佳,spatiotemporal建模和预处理策略显示出潜力。
Insight: 1)局部个性化与全局鲁棒性存在权衡;2)架构选择、预处理和损失设计对FL至关重要;3)类别不均衡需专门优化。
Abstract: Purpose: The FedSurg challenge was designed to benchmark the state of the art in federated learning for surgical video classification. Its goal was to assess how well current methods generalize to unseen clinical centers and adapt through local fine-tuning while enabling collaborative model development without sharing patient data. Methods: Participants developed strategies to classify inflammation stages in appendicitis using a preliminary version of the multi-center Appendix300 video dataset. The challenge evaluated two tasks: generalization to an unseen center and center-specific adaptation after fine-tuning. Submitted approaches included foundation models with linear probing, metric learning with triplet loss, and various FL aggregation schemes (FedAvg, FedMedian, FedSAM). Performance was assessed using F1-score and Expected Cost, with ranking robustness evaluated via bootstrapping and statistical testing. Results: In the generalization task, performance across centers was limited. In the adaptation task, all teams improved after fine-tuning, though ranking stability was low. The ViViT-based submission achieved the strongest overall performance. The challenge highlighted limitations in generalization, sensitivity to class imbalance, and difficulties in hyperparameter tuning in decentralized training, while spatiotemporal modeling and context-aware preprocessing emerged as promising strategies. Conclusion: The FedSurg Challenge establishes the first benchmark for evaluating FL strategies in surgical video classification. Findings highlight the trade-off between local personalization and global robustness, and underscore the importance of architecture choice, preprocessing, and loss design. This benchmarking offers a reference point for future development of imbalance-aware, adaptive, and robust FL methods in clinical surgical AI.
[159] Hands-Free Heritage: Automated 3D Scanning for Cultural Heritage Digitization
Javed Ahmad,Federico Dassiè,Selene Frascella,Gabriele Marchello,Ferdinando Cannella,Arianna Traviglia
Main category: cs.CV
TL;DR: 提出了一种全自动的双机器人3D扫描系统,通过协调机器人和高分辨率扫描技术,显著提升了文化遗产数字化的效率和几何精度。
Details
Motivation: 传统的3D扫描方法需要专业知识和手动干预,限制了文化遗产数字化的效率和可扩展性。Contribution: 开发了一种两机器人协调的自动化扫描系统,优化了扫描轨迹规划与点位分布,显著降低了扫描遮挡并提高了重建精度。
Method: 通过参数化扫描空间为不同区域,结合协调运动规划和优化的轨迹算法,实现了高效的全覆盖扫描。
Result: 实验表明,该方法在Chamfer Distance和F-score指标上优于基准方法,几何精度更高且减少了人工干预的需求。
Insight: 自动化机器人系统在文化遗产数字化中具有巨大潜力,能够显著提升效率和精度,同时降低对专家的依赖。
Abstract: High-fidelity 3D scanning is essential for preserving cultural heritage artefacts, supporting documentation, analysis, and long-term conservation. However, conventional methods typically require specialized expertise and manual intervention to maintain optimal scanning conditions and coverage. We present an automated two-robot scanning system that eliminates the need for handheld or semi-automatic workflows by combining coordinated robotic manipulation with high-resolution 3D scanning. Our system parameterizes the scanning space into distinct regions, enabling coordinated motion planning between a scanner-equipped robot and a tray-handling robot. Optimized trajectory planning and waypoint distribution ensure comprehensive surface coverage, minimize occlusions, and balance reconstruction accuracy with system efficiency. Experimental results show that our approach achieves significantly lower Chamfer Distance and higher F-score compared to baseline methods, offering superior geometric accuracy, improved digitization efficiency, and reduced reliance on expert operators.
[160] A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation
Alon Kaya,Igal Bilik,Inna Stainvas
Main category: cs.CV
TL;DR: 本文比较了Vision Transformers(ViTs)和卷积神经网络(CNNs)在少量数据下进行几何估计任务(如刚性变换和基础矩阵估计)的性能,发现ViTs在大数据场景下表现更好,而CNNs在小数据场景下更具优势。
Details
Motivation: 探讨ViTs和CNNs在低数据量的几何估计任务中的表现差异,为任务选择合适的骨干网络提供依据。Contribution: 系统性比较了ViTs和CNNs在不同数据量下的性能,揭示了它们在几何估计任务中的优势和劣势,并提出了未来研究混合架构的可能性。
Method: 使用多种预训练的ViTs(CLIP-ViT、DINO)和CNNs(ResNet、EfficientNet、CLIP-ResNet)作为骨干网络,分别在刚性变换和基础矩阵估计任务中进行微调和评估。
Result: ViTs在大数据场景下表现更好,而CNNs在小数据场景下更具优势;ViTs在跨域评估中展现更强的泛化能力。
Insight: 几何估计任务需要权衡局部和全局特征,未来研究应探索混合架构以平衡这两种表示。
Abstract: Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations that enable strong transfer learning for diverse tasks. However, their efficiency as backbone architectures for geometric estimation tasks involving image deformations in low-data regimes remains an open question. This work considers two such tasks: 1) estimating 2D rigid transformations between pairs of images and 2) predicting the fundamental matrix for stereo image pairs, an important problem in various applications, such as autonomous mobility, robotics, and 3D scene reconstruction. Addressing this intriguing question, this work systematically compares large-scale CNNs (ResNet, EfficientNet, CLIP-ResNet) with ViT-based foundation models (CLIP-ViT variants and DINO) in various data size settings, including few-shot scenarios. These pretrained models are optimized for classification or contrastive learning, encouraging them to focus mostly on high-level semantics. The considered tasks require balancing local and global features differently, challenging the straightforward adoption of these models as the backbone. Empirical comparative analysis shows that, similar to training from scratch, ViTs outperform CNNs during refinement in large downstream-data scenarios. However, in small data scenarios, the inductive bias and smaller capacity of CNNs improve their performance, allowing them to match that of a ViT. Moreover, ViTs exhibit stronger generalization in cross-domain evaluation where the data distribution changes. These results emphasize the importance of carefully selecting model architectures for refinement, motivating future research towards hybrid architectures that balance local and global representations.
[161] Did you just see that? Arbitrary view synthesis for egocentric replay of operating room workflows from ambient sensors
Han Zhang,Lalithkumar Seenivasan,Jose L. Porras,Roger D. Soberanis-Mukul,Hao Ding,Hongchao Shu,Benjamin D. Killeen,Ankita Ghosh,Lonny Yarmus,Masaru Ishii,Angela Christine Argento,Mathias Unberath
Main category: cs.CV
TL;DR: EgoSurg是一个创新框架,通过固定摄像头视频重建手术室中任意人员的动态自我中心视角回放,结合几何驱动的神经渲染和基于扩散的视角增强技术,提供高视觉质量的视角合成。
Details
Motivation: 传统的手术观察方法依赖于固定视角或回忆,无法记录临床决策中的自我中心视角。EgoSurg旨在填补这一空白,利用现有手术室摄像头基础设施,提供沉浸式的手术数据科学支持。Contribution: 1. 提出EgoSurg,首次实现从固定摄像头视频重建手术室中任意人员的动态自我中心视角;2. 结合几何驱动神经渲染和扩散增强技术,实现高质量视角合成;3. 通过多站点手术案例验证了系统的视觉质量和保真度。
Method: EgoSurg采用几何驱动的神经渲染生成初始视角,再通过扩散模型增强细节和视觉质量,支持任意时刻和视角的高质量合成。
Result: EgoSurg在多站点手术案例和对照研究中表现优异,能够高保真地重建个人特定的视野和任意视角。
Insight: EgoSurg将现有手术室摄像头转化为可导航的动态3D记录,为沉浸式手术数据科学提供了新基础,支持从多角度可视化、体验和分析手术实践。
Abstract: Observing surgical practice has historically relied on fixed vantage points or recollections, leaving the egocentric visual perspectives that guide clinical decisions undocumented. Fixed-camera video can capture surgical workflows at the room-scale, but cannot reconstruct what each team member actually saw. Thus, these videos only provide limited insights into how decisions that affect surgical safety, training, and workflow optimization are made. Here we introduce EgoSurg, the first framework to reconstruct the dynamic, egocentric replays for any operating room (OR) staff directly from wall-mounted fixed-camera video, and thus, without intervention to clinical workflow. EgoSurg couples geometry-driven neural rendering with diffusion-based view enhancement, enabling high-visual fidelity synthesis of arbitrary and egocentric viewpoints at any moment. In evaluation across multi-site surgical cases and controlled studies, EgoSurg reconstructs person-specific visual fields and arbitrary viewpoints with high visual quality and fidelity. By transforming existing OR camera infrastructure into a navigable dynamic 3D record, EgoSurg establishes a new foundation for immersive surgical data science, enabling surgical practice to be visualized, experienced, and analyzed from every angle.
[162] Visual Representations inside the Language Model
Benlin Liu,Amita Kamath,Madeleine Grunde-McLaughlin,Winson Han,Ranjay Krishna
Main category: cs.CV
TL;DR: 论文研究了多模态语言模型(MLMs)中视觉键值令牌的作用,发现图像值令牌在零样本任务中编码了足够的视觉信息,但语言模型对视觉信息的控制不足,影响了感知能力。通过调整输入前缀可以改善视觉表示效果。
Details
Motivation: 探讨多模态语言模型在感知任务中表现不佳的原因,研究视觉键值令牌在模型中的作用,以提高其对视觉信息的处理能力。Contribution: 揭示了图像值令牌在零样本任务中的信息编码能力,分析了语言模型对视觉信息的补充与限制,提出改善视觉表示的方法。
Method: 分析了LLaVA-OneVision等流行MLMs的视觉键值令牌流动,比较了视觉编码器与语言模型的视觉信息处理差异,并通过添加文本前缀优化感知能力。
Result: 发现语言模型中的视觉信息存在缺陷,改进输入前缀后可提升感知能力,但仍有33.3%的情况未能充分利用模型内部的视觉信息。
Insight: 视觉键值令牌在多模态系统中起关键作用,未来可通过改进语言模型对视觉信息的控制以及训练视觉编码器来提升MLMs的整体感知能力。
Abstract: Despite interpretability work analyzing VIT encoders and transformer activations, we don’t yet understand why Multimodal Language Models (MLMs) struggle on perception-heavy tasks. We offer an under-studied perspective by examining how popular MLMs (LLaVA-OneVision, Qwen2.5-VL, and Llama-3-LLaVA-NeXT) process their visual key-value tokens. We first study the flow of visual information through the language model, finding that image value tokens encode sufficient information to perform several perception-heavy tasks zero-shot: segmentation, semantic correspondence, temporal correspondence, and referring expression detection. We find that while the language model does augment the visual information received from the projection of input visual encodings-which we reveal correlates with overall MLM perception capability-it contains less visual information on several tasks than the equivalent visual encoder (SigLIP) that has not undergone MLM finetuning. Further, we find that the visual information corresponding to input-agnostic image key tokens in later layers of language models contains artifacts which reduce perception capability of the overall MLM. Next, we discuss controlling visual information in the language model, showing that adding a text prefix to the image input improves perception capabilities of visual representations. Finally, we reveal that if language models were able to better control their visual information, their perception would significantly improve; e.g., in 33.3% of Art Style questions in the BLINK benchmark, perception information present in the language model is not surfaced to the output! Our findings reveal insights into the role of key-value tokens in multimodal systems, paving the way for deeper mechanistic interpretability of MLMs and suggesting new directions for training their visual encoder and language model components.
[163] AvatarVTON: 4D Virtual Try-On for Animatable Avatars
Zicheng Jiang,Jixin Gao,Shengfeng He,Xinzhe Li,Yulong Zheng,Zhaotong Yang,Junyu Dong,Yong Du
Main category: cs.CV
TL;DR: AvatarVTON是首个支持4D虚拟试穿的框架,通过单张服装图像生成逼真的试穿效果,支持自由姿势控制、新视角渲染和多样服装选择。其创新点包括无需物理先验的动态服装交互能力和两个核心模块(Reciprocal Flow Rectifier和Non-Linear Deformer)。
Details
Motivation: 现有虚拟试穿方法通常依赖于多视角服装捕捉或物理先验,限制了动态服装交互的实现。AvatarVTON旨在通过单视角监督实现动态服装变形,填补4D虚拟试穿领域的空白。Contribution: 1. 提出首个4D虚拟试穿框架AvatarVTON,支持单图像输入的动态服装交互;2. 设计了Reciprocal Flow Rectifier和Non-Linear Deformer两个模块,分别解决了时间一致性和非线性变形问题;3. 建立了4D虚拟试穿的基准,扩展了现有方法进行公平比较。
Method: 1. Reciprocal Flow Rectifier:通过无先验的光流校正策略稳定虚拟试穿过程,确保时间一致性;2. Non-Linear Deformer:将高斯图分解为视角-姿势无关和视角-姿势相关的组件,实现自适应非线性变形。
Result: 实验表明,AvatarVTON在逼真度、多样性和动态服装真实性上表现优异,适用于AR/VR、游戏和数字人应用。
Insight: 1. 单视角监督足以支持动态服装交互,减少了对多视角数据或物理仿真的依赖;2. 分解变形问题为视角-姿势无关和相关部分,提高了方法的适应性和鲁棒性。
Abstract: We propose AvatarVTON, the first 4D virtual try-on framework that generates realistic try-on results from a single in-shop garment image, enabling free pose control, novel-view rendering, and diverse garment choices. Unlike existing methods, AvatarVTON supports dynamic garment interactions under single-view supervision, without relying on multi-view garment captures or physics priors. The framework consists of two key modules: (1) a Reciprocal Flow Rectifier, a prior-free optical-flow correction strategy that stabilizes avatar fitting and ensures temporal coherence; and (2) a Non-Linear Deformer, which decomposes Gaussian maps into view-pose-invariant and view-pose-specific components, enabling adaptive, non-linear garment deformations. To establish a benchmark for 4D virtual try-on, we extend existing baselines with unified modules for fair qualitative and quantitative comparisons. Extensive experiments show that AvatarVTON achieves high fidelity, diversity, and dynamic garment realism, making it well-suited for AR/VR, gaming, and digital-human applications.
[164] Detailed Aerial Mapping of Photovoltaic Power Plants Through Semantically Significant Keypoints
Viktor Kozák,Jan Chudoba,Libor Přeučil
Main category: cs.CV
TL;DR: 提出了一种基于语义关键点的光伏电站详细航拍测绘方法,实现了对光伏模块的自动分割和结构化建模,并通过多图像检测融合生成了紧凑的地理参考模型。
Details
Motivation: 光伏电站的高精度和最新模型对其运维至关重要,但现有模型不易获取。本研究旨在通过航拍图像自动化测绘过程,摆脱对第三方数据的依赖。Contribution: 1) 提出了基于语义关键点的光伏电站详细测绘方法;2) 实现了光伏模块级别的自动化建模;3) 通过多图像检测融合生成了结构完整的地理参考模型。
Method: 1) 利用航拍图像对光伏模块进行视觉分割;2) 推断每张图像中的结构信息(如模块分配至行、列);3) 基于布局相关的视觉关键点融合多图像检测结果。
Result: 实验在两个不同光伏电站上进行验证,最终融合3D位置和语义结构生成了适用于运维的紧凑地理参考模型。
Insight: 通过语义关键点和多图像融合,能够在缺乏第三方数据的情况下实现光伏电站的高精度详细建模,提升了自动化和实用性。
Abstract: An accurate and up-to-date model of a photovoltaic (PV) power plant is essential for its optimal operation and maintenance. However, such a model may not be easily available. This work introduces a novel approach for PV power plant mapping based on aerial overview images. It enables the automation of the mapping process while removing the reliance on third-party data. The presented mapping method takes advantage of the structural layout of the power plants to achieve detailed modeling down to the level of individual PV modules. The approach relies on visual segmentation of PV modules in overview images and the inference of structural information in each image, assigning modules to individual benches, rows, and columns. We identify visual keypoints related to the layout and use these to merge detections from multiple images while maintaining their structural integrity. The presented method was experimentally verified and evaluated on two different power plants. The final fusion of 3D positions and semantic structures results in a compact georeferenced model suitable for power plant maintenance.
[165] From Actions to Kinesics: Extracting Human Psychological States through Bodily Movements
Cheyu Lin,Katherine A. Flanigan
Main category: cs.CV
TL;DR: 该论文提出了一种基于骨骼数据的运动识别框架,用于从人体动作推断心理状态,结合ST-GCN和CNN,并通过迁移学习避免人工映射,保护隐私的同时实现高效建模。
Details
Motivation: 传统方法(如问卷或理论模型)在捕捉人类心理状态时存在局限性,且难以规模化。论文希望通过直接的骨骼运动数据,实现通用且隐私保护的心理状态推断。Contribution: 1. 提出了一个结合ST-GCN和CNN的运动识别框架;2. 利用迁移学习避免了人工定义动作与心理状态的映射;3. 在DUET数据集上展示了高效且隐私保护的行为建模能力。
Method: 采用空间-时间图卷积网络(ST-GCN)和卷积神经网络(CNN)的结合,通过迁移学习从3D骨骼数据中提取运动特征,推断心理状态。
Result: 在DUET数据集上展示了高效且准确的行为建模能力,为增强人类-环境交互仿真提供了新方法。
Insight: 1. 骨骼数据可以作为隐私保护的心理状态推断媒介;2. 结合ST-GCN和CNN能有效捕捉动态动作与心理状态的关联;3. 迁移学习减少了人工标注的需求。
Abstract: Understanding the dynamic relationship between humans and the built environment is a key challenge in disciplines ranging from environmental psychology to reinforcement learning (RL). A central obstacle in modeling these interactions is the inability to capture human psychological states in a way that is both generalizable and privacy preserving. Traditional methods rely on theoretical models or questionnaires, which are limited in scope, static, and labor intensive. We present a kinesics recognition framework that infers the communicative functions of human activity – known as kinesics – directly from 3D skeleton joint data. Combining a spatial-temporal graph convolutional network (ST-GCN) with a convolutional neural network (CNN), the framework leverages transfer learning to bypass the need for manually defined mappings between physical actions and psychological categories. The approach preserves user anonymity while uncovering latent structures in bodily movements that reflect cognitive and emotional states. Our results on the Dyadic User EngagemenT (DUET) dataset demonstrate that this method enables scalable, accurate, and human-centered modeling of behavior, offering a new pathway for enhancing RL-driven simulations of human-environment interaction.
[166] BenthiCat: An opti-acoustic dataset for advancing benthic classification and habitat mapping
Hayat Rajani,Valerio Franchi,Borja Martinez-Clavel Valles,Raimon Ramos,Rafael Garcia,Nuno Gracias
Main category: cs.CV
TL;DR: 论文介绍了一个多模态数据集BenthiCat,用于促进海底分类和栖息地映射的研究,包含侧扫声纳图像、测深图和多组光学图像,并提供了标注工具和预处理工具。
Details
Motivation: 海底栖息地映射对海洋生态系统研究和资源管理至关重要,但现有数据集稀缺,限制了机器学习模型的发展和评估。Contribution: 提出了一个大规模的标注数据集BenthiCat,结合了侧扫声纳图像、测深图和光学图像,并提供了配套工具支持研究。
Method: 数据集包含约100万张侧扫声纳图像和36000张标注图像,支持监督学习和自监督跨模态表示学习。
Result: 数据集和工具已公开,旨在成为海底栖息地映射的标准基准。
Insight: 多模态数据融合和跨模态学习是提升海底分类性能的关键,公开数据集和工具可推动领域研究。
Abstract: Benthic habitat mapping is fundamental for understanding marine ecosystems, guiding conservation efforts, and supporting sustainable resource management. Yet, the scarcity of large, annotated datasets limits the development and benchmarking of machine learning models in this domain. This paper introduces a thorough multi-modal dataset, comprising about a million side-scan sonar (SSS) tiles collected along the coast of Catalonia (Spain), complemented by bathymetric maps and a set of co-registered optical images from targeted surveys using an autonomous underwater vehicle (AUV). Approximately \num{36000} of the SSS tiles have been manually annotated with segmentation masks to enable supervised fine-tuning of classification models. All the raw sensor data, together with mosaics, are also released to support further exploration and algorithm development. To address challenges in multi-sensor data fusion for AUVs, we spatially associate optical images with corresponding SSS tiles, facilitating self-supervised, cross-modal representation learning. Accompanying open-source preprocessing and annotation tools are provided to enhance accessibility and encourage research. This resource aims to establish a standardized benchmark for underwater habitat mapping, promoting advancements in autonomous seafloor classification and multi-sensor integration.
[167] Comparative Analysis of YOLOv5, Faster R-CNN, SSD, and RetinaNet for Motorbike Detection in Kigali Autonomous Driving Context
Ngeyen Yinkfu,Sunday Nwovu,Jonathan Kayizzi,Angelique Uwamahoro
Main category: cs.CV
TL;DR: 该论文比较了YOLOv5、Faster R-CNN、SSD和RetinaNet四种目标检测模型在卢旺达基加利摩托车检测任务中的表现,重点关注准确性、定位能力和推理速度。
Details
Motivation: 基加利的摩托车出租车经常不遵守交通规则且行驶不可预测,这对自动驾驶系统提出了挑战。研究旨在评估不同模型在资源受限环境下的适用性。Contribution: 主要贡献包括:1) 使用定制数据集比较四种模型的性能;2) 提出简化架构的建议以提高技术在发展中国家的可及性。
Method: 在PyTorch中使用迁移学习实现四种模型,评估其准确性、定位能力和推理速度。
Result: 研究指出了数据集限制和模型复杂性等问题,并建议未来采用简化架构。
Insight: 在资源受限的环境中,YOLOv5和SSD等轻量级模型可能更适合实时检测任务。
Abstract: In Kigali, Rwanda, motorcycle taxis are a primary mode of transportation, often navigating unpredictably and disregarding traffic rules, posing significant challenges for autonomous driving systems. This study compares four object detection models–YOLOv5, Faster R-CNN, SSD, and RetinaNet–for motorbike detection using a custom dataset of 198 images collected in Kigali. Implemented in PyTorch with transfer learning, the models were evaluated for accuracy, localization, and inference speed to assess their suitability for real-time navigation in resource-constrained settings. We identify implementation challenges, including dataset limitations and model complexities, and recommend simplified architectures for future work to enhance accessibility for autonomous systems in developing countries like Rwanda.
[168] REN: Anatomically-Informed Mixture-of-Experts for Interstitial Lung Disease Diagnosis
Alec K. Peltekian,Halil Ertugrul Aktas,Gorkem Durak,Kevin Grudzinski,Bradford C. Bemiss,Carrie Richardson,Jane E. Dematte,G. R. Scott Budinger,Anthony J. Esposito,Alexander Misharin,Alok Choudhary,Ankit Agrawal,Ulas Bagci
Main category: cs.CV
TL;DR: REN是一种基于解剖学知识的Mixture-of-Experts框架,专门设计用于医学图像分类,尤其是在间质性肺病(ILD)诊断中表现出色。
Details
Motivation: 传统MoE架构缺乏对医学影像的领域特定约束,而解剖结构和区域疾病异质性对病理模式有重要影响。REN旨在填补这一空白,提供精确的区域特异性病理建模。Contribution: 提出了首个解剖学指导的区域专家网络(REN),通过多模态门控机制动态整合放射组学和深度学习特征,最优加权专家贡献。
Method: REN训练七个专门专家网络,每个网络专注于特定肺叶或双肺组合。多模态门控机制结合放射组学标志物和CNN、ViT、Mamba等深度学习特征。
Result: REN在ILD分类中表现优异,平均AUC达0.8646,比SwinUNETR基线提升12.5%。下叶模型的AUC更是高达0.88-0.90。
Insight: 解剖学指导的MoE框架提升了医学影像分类的性能和临床可解释性,为其他结构化医学影像应用提供了扩展性范例。
Abstract: Mixture-of-Experts (MoE) architectures have significantly contributed to scalable machine learning by enabling specialized subnetworks to tackle complex tasks efficiently. However, traditional MoE systems lack domain-specific constraints essential for medical imaging, where anatomical structure and regional disease heterogeneity strongly influence pathological patterns. Here, we introduce Regional Expert Networks (REN), the first anatomically-informed MoE framework tailored specifically for medical image classification. REN leverages anatomical priors to train seven specialized experts, each dedicated to distinct lung lobes and bilateral lung combinations, enabling precise modeling of region-specific pathological variations. Multi-modal gating mechanisms dynamically integrate radiomics biomarkers and deep learning (DL) features (CNN, ViT, Mamba) to weight expert contributions optimally. Applied to interstitial lung disease (ILD) classification, REN achieves consistently superior performance: the radiomics-guided ensemble reached an average AUC of 0.8646 +/- 0.0467, a +12.5 percent improvement over the SwinUNETR baseline (AUC 0.7685, p = 0.031). Region-specific experts further revealed that lower-lobe models achieved AUCs of 0.88-0.90, surpassing DL counterparts (CNN: 0.76-0.79) and aligning with known disease progression patterns. Through rigorous patient-level cross-validation, REN demonstrates strong generalizability and clinical interpretability, presenting a scalable, anatomically-guided approach readily extensible to other structured medical imaging applications.
[169] Unsupervised Active Learning via Natural Feature Progressive Framework
Yuxi Liu,Catherine Lalman,Yimin Yang
Main category: cs.CV
TL;DR: 该论文提出了一种名为自然特征渐进框架(NFPF)的无监督主动学习方法,通过特定特征学习机器(SFLM)量化样本重要性,显著提升了无监督主动学习的性能,并达到了与有监督方法相当的水平。
Details
Motivation: 传统的主动学习(AL)虽然减少了标注成本,但仍需多次迭代和人工参与。无监督主动学习(UAL)进一步减少了标注负担,但现有方法在性能和鲁棒性上表现不佳,尤其是在样本选择和数据分布覆盖方面存在短板。Contribution: 论文的主要贡献是提出了NFPF框架,其中引入了SFLM来量化样本重要性,并定义了重建差异度量以改进初始样本选择。该方法在性能和鲁棒性上显著优于现有UAL方法,并与有监督AL方法相当。
Method: NFPF的核心方法是利用SFLM学习样本的特征表示,并通过重建差异度量选择最具代表性的样本。该方法避免了局部梯度评分带来的噪声敏感性,实现了全局数据分布的覆盖。
Result: 实验表明,NFPF在视觉数据集上显著超越了现有UAL方法,并与有监督AL方法性能相当。消融实验和定性分析进一步验证了其鲁棒性和数据分布覆盖能力。
Insight: 论文揭示了全局特征学习在无监督样本选择中的重要性,为减少标注负担提供了新的技术路径。
Abstract: The effectiveness of modern deep learning models is predicated on the availability of large-scale, human-annotated datasets, a process that is notoriously expensive and time-consuming. While Active Learning (AL) offers a strategic solution by labeling only the most informative and representative data, its iterative nature still necessitates significant human involvement. Unsupervised Active Learning (UAL) presents an alternative by shifting the annotation burden to a single, post-selection step. Unfortunately, prevailing UAL methods struggle to achieve state-of-the-art performance. These approaches typically rely on local, gradient-based scoring for sample importance estimation, which not only makes them vulnerable to ambiguous and noisy data but also hinders their capacity to select samples that adequately represent the full data distribution. Moreover, their use of shallow, one-shot linear selection falls short of a true UAL paradigm. In this paper, we propose the Natural Feature Progressive Framework (NFPF), a UAL method that revolutionizes how sample importance is measured. At its core, NFPF employs a Specific Feature Learning Machine (SFLM) to effectively quantify each sample’s contribution to model performance. We further utilize the SFLM to define a powerful Reconstruction Difference metric for initial sample selection. Our comprehensive experiments show that NFPF significantly outperforms all established UAL methods and achieves performance on par with supervised AL methods on vision datasets. Detailed ablation studies and qualitative visualizations provide compelling evidence for NFPF’s superior performance, enhanced robustness, and improved data distribution coverage.
[170] ActiveMark: on watermarking of visual foundation models via massive activations
Anna Chistyakova,Mikhail Pautov
Main category: cs.CV
TL;DR: 这篇论文提出了一种名为ActiveMark的方法,通过在视觉基础模型(VFMs)中嵌入数字水印来验证模型所有权,防止非法再分发。该方法通过微调少量表达层和一个小型编码器-解码器网络,将水印嵌入到输入图像的内部表示中,并证明水印在模型的功能副本中仍然可检测。
Details
Motivation: 视觉基础模型的训练成本高昂,保护其知识产权至关重要。当前的挑战是如何区分受保护模型的非法副本和独立训练的模型,因此需要开发可靠的所有权验证工具。Contribution: 提出了一种新颖的水印嵌入方法ActiveMark,能够在模型微调后的功能副本中仍然保持水印的可检测性,为模型所有权验证提供了理论支持和实验验证。
Method: 通过微调视觉基础模型的少量表达层,结合一个小型编码器-解码器网络,将数字水印嵌入输入图像的内部表示中。水印在模型的功能副本(如下游任务微调后的模型)中依然可检测。
Result: 理论分析和实验结果表明,该方法在非水印模型的虚假检测和水印模型的虚假漏检方面具有较低概率。
Insight: 该方法在保护视觉基础模型的知识产权方面具有潜力,尤其是在模型微调后的所有权验证场景中。
Abstract: Being trained on large and vast datasets, visual foundation models (VFMs) can be fine-tuned for diverse downstream tasks, achieving remarkable performance and efficiency in various computer vision applications. The high computation cost of data collection and training motivates the owners of some VFMs to distribute them alongside the license to protect their intellectual property rights. However, a dishonest user of the protected model’s copy may illegally redistribute it, for example, to make a profit. As a consequence, the development of reliable ownership verification tools is of great importance today, since such methods can be used to differentiate between a redistributed copy of the protected model and an independent model. In this paper, we propose an approach to ownership verification of visual foundation models by fine-tuning a small set of expressive layers of a VFM along with a small encoder-decoder network to embed digital watermarks into an internal representation of a hold-out set of input images. Importantly, the watermarks embedded remain detectable in the functional copies of the protected model, obtained, for example, by fine-tuning the VFM for a particular downstream task. Theoretically and experimentally, we demonstrate that the proposed method yields a low probability of false detection of a non-watermarked model and a low probability of false misdetection of a watermarked model.
[171] Latent Uncertainty Representations for Video-based Driver Action and Intention Recognition
Koen Vellenga,H. Joe Steinhauer,Jonas Andersson,Anders Sjögren
Main category: cs.CV
TL;DR: 论文提出了一种基于预训练深度神经网络的潜在不确定性表征(LUR)方法,用于视频中驾驶员行为和意图识别,并通过变换层生成多潜在表征来估计不确定性。实验表明其性能优于传统方法。
Details
Motivation: 现有的最后一层概率深度学习(LL-PDL)方法在检测分布外(OOD)实例时性能不稳定,因此需要一种更高效且易于调优的不确定性估计方法。Contribution: 1. 提出了潜在不确定性表征(LUR)和排斥训练LUR(RLUR)方法;
2. 为NuScenes数据集提供了额外的标注数据(28,000帧行为标签和1,194个视频级意图标签)。
Method: 通过扩展预训练的DNN并使用变换层生成多潜在表征来估计不确定性,避免了复杂的采样或训练过程。
Result: LUR和RLUR在分布内分类任务中表现与其他LL-PDL方法相当,但在OOD检测中性能更优且训练调优更高效。
Insight: 潜在不确定性表征方法在减少计算复杂度的同时保持了高性能,为资源受限环境下的安全关键任务提供了实用解决方案。
Abstract: Deep neural networks (DNNs) are increasingly applied to safety-critical tasks in resource-constrained environments, such as video-based driver action and intention recognition. While last layer probabilistic deep learning (LL-PDL) methods can detect out-of-distribution (OOD) instances, their performance varies. As an alternative to last layer approaches, we propose extending pre-trained DNNs with transformation layers to produce multiple latent representations to estimate the uncertainty. We evaluate our latent uncertainty representation (LUR) and repulsively trained LUR (RLUR) approaches against eight PDL methods across four video-based driver action and intention recognition datasets, comparing classification performance, calibration, and uncertainty-based OOD detection. We also contribute 28,000 frame-level action labels and 1,194 video-level intention labels for the NuScenes dataset. Our results show that LUR and RLUR achieve comparable in-distribution classification performance to other LL-PDL approaches. For uncertainty-based OOD detection, LUR matches top-performing PDL methods while being more efficient to train and easier to tune than approaches that require Markov-Chain Monte Carlo sampling or repulsive training procedures.
[172] Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Yunlong Tang,Jing Bi,Pinxin Liu,Zhenyu Pan,Zhangyun Tan,Qianxiang Shen,Jiani Liu,Hang Hua,Junjia Guo,Yunzhong Xiao,Chao Huang,Zhiyuan Wang,Susan Liang,Xinyi Liu,Yizhi Song,Yuhe Nie,Jia-Xing Zhong,Bozheng Li,Daiqing Qi,Ziyun Zeng,Ali Vosoughi,Luchuan Song,Zeliang Zhang,Daiki Shimada,Han Liu,Jiebo Luo,Chenliang Xu
Main category: cs.CV
TL;DR: 本文首次全面调查了视频大型多模态模型(Video-LMMs)的后训练方法,重点关注监督微调(SFT)、强化学习(RL)和测试时扩展(TTS),并提出了一套结构化分类法和关键设计原则。
Details
Motivation: 视频理解是计算机视觉中最具挑战性的领域,需要模型具备复杂的时空关系推理能力。现有的Video-LMMs在视频理解任务中表现卓越,但其后训练阶段的研究仍较为零散,缺乏系统性总结。Contribution: 1. 首次系统总结了Video-LMMs的后训练方法;2. 提出了涵盖SFT、RL和TTS的结构化分类法;3. 整理了相关基准、数据集和指标;4. 指出了未来研究的开放挑战。
Method: 1. 监督微调(SFT)结合思维链(chain-of-thought);2. 基于可验证目标的强化学习(RL);3. 通过增强推理计算的测试时扩展(TTS)。
Result: 总结了关键设计原则和评价协议,并梳理了后训练方法的视频特定适应能力,如时空定位和多模态证据整合。
Insight: 后训练阶段对Video-LMMs的性能提升至关重要,但其在奖励设计、可扩展性和成本效益优化等方面仍面临挑战。
Abstract: Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training
[173] SegMASt3R: Geometry Grounded Segment Matching
Rohit Jayanti,Swayam Agrawal,Vansh Garg,Siddharth Tourani,Muhammad Haris Khan,Sourav Garg,Madhava Krishna
Main category: cs.CV
TL;DR: 论文提出了一种基于3D基础模型的空间理解方法SegMASt3R,用于解决宽基线条件下的语义区域匹配问题,显著提升了匹配性能。
Details
Motivation: 语义区域匹配在计算机视觉中是一个重要的中间任务,尤其是在极端视角变化下,现有方法难以捕捉结构化区域的对应关系。论文利用3D基础模型的空间理解能力来解决这一问题。Contribution: 提出了SegMASt3R架构,利用3D基础模型的归纳偏置匹配极宽基线(最多180度视角变化)下的语义区域,性能显著优于现有方法。
Method: 通过3D基础模型的空间理解能力,设计了一种架构,用于匹配图像对中的结构化区域,并结合几何信息提升鲁棒性。
Result: 在ScanNet++和Replica数据集上,AUPRC指标超过现有方法(如SAM2和局部特征匹配)高达30%,并验证了下游任务(如3D实例分割和图像目标导航)的有效性。
Insight: 3D基础模型的空间理解能力可以有效解决极端视角变化下的语义区域匹配问题,为相关任务提供了新的解决思路。
Abstract: Segment matching is an important intermediate task in computer vision that establishes correspondences between semantically or geometrically coherent regions across images. Unlike keypoint matching, which focuses on localized features, segment matching captures structured regions, offering greater robustness to occlusions, lighting variations, and viewpoint changes. In this paper, we leverage the spatial understanding of 3D foundation models to tackle wide-baseline segment matching, a challenging setting involving extreme viewpoint shifts. We propose an architecture that uses the inductive bias of these 3D foundation models to match segments across image pairs with up to 180 degree view-point change. Extensive experiments show that our approach outperforms state-of-the-art methods, including the SAM2 video propagator and local feature matching methods, by upto 30% on the AUPRC metric, on ScanNet++ and Replica datasets. We further demonstrate benefits of the proposed model on relevant downstream tasks, including 3D instance segmentation and image-goal navigation. Project Page: https://segmast3r.github.io/
[174] No-reference Quality Assessment of Contrast-distorted Images using Contrast-enhanced Pseudo Reference
Mohammad-Ali Mahmoudpour,Saeed Mahmoudpour
Main category: cs.CV
TL;DR: 该论文提出了一种无参考图像质量评估(NR-IQA)方法,专门针对对比度失真图像,通过对比度增强算法生成伪参考图像,将NR问题转化为全参考评估以提高准确性。
Details
Motivation: 对比度变化是影响图像质量的重要因素,但现有方法大多忽略了对对比度失真的评估。论文旨在填补这一空白,提出一种针对对比度失真的NR-IQA方法。Contribution: 主要贡献包括:1)提出一种基于对比度增强生成伪参考图像的方法;2)设计分类网络选择最佳增强算法;3)在多个数据库上验证了方法的有效性。
Method: 方法分为三步:1)使用多种对比度增强算法生成伪参考图像;2)训练分类网络选择最适合的增强算法;3)在全参考框架下评估图像质量差异。
Result: 在CCID2014、TID2013和CSIQ数据库上的实验表明,该方法在对比度失真评估中表现优异。
Insight: 通过生成伪参考图像,可以将无参考问题转化为全参考问题,从而提高评估准确性。这种方法可能适用于其他类型的失真评估。
Abstract: Contrast change is an important factor that affects the quality of images. During image capturing, unfavorable lighting conditions can cause contrast change and visual quality loss. While various methods have been proposed to assess the quality of images under different distortions such as blur and noise, contrast distortion has been largely overlooked as its visual impact and properties are different from other conventional types of distortions. In this paper, we propose a no-reference image quality assessment (NR-IQA) metric for contrast-distorted images. Using a set of contrast enhancement algorithms, we aim to generate pseudo-reference images that are visually close to the actual reference image, such that the NR problem is transformed to a Full-reference (FR) assessment with higher accuracy. To this end, a large dataset of contrast-enhanced images is produced to train a classification network that can select the most suitable contrast enhancement algorithm based on image content and distortion for pseudo-reference image generation. Finally, the evaluation is performed in the FR manner to assess the quality difference between the contrast-enhanced (pseudoreference) and degraded images. Performance evaluation of the proposed method on three databases containing contrast distortions (CCID2014, TID2013, and CSIQ), indicates the promising performance of the proposed method.
[175] Neuroplastic Modular Framework: Cross-Domain Image Classification of Garbage and Industrial Surfaces
Debojyoti Ghosh,Soumya K Ghosh,Adrijit Goswami
Main category: cs.CV
TL;DR: 本文提出了一种新型的混合架构——神经可塑性模块分类器(Neuroplastic Modular Classifier),结合ResNet-50和Vision Transformer(ViT)进行图像分类,并通过FAISS相似性检索增强特征空间。模型具有动态扩展的模块化设计,能够在训练中适应数据复杂度,提升泛化能力。在垃圾和工业表面缺陷分类任务中表现优异。
Details
Motivation: 高效的垃圾和工业表面缺陷分类对可持续废物管理和质量控制至关重要。现有静态模型难以适应动态环境的复杂性,因此需要一种自适应且高性能的分类方法。Contribution: 1. 提出了一种混合ResNet-50和ViT的架构,结合局部和全局特征提取。2. 使用FAISS相似性检索扩展特征空间。3. 设计了动态扩展的可学习模块,提升模型适应能力。4. 在跨领域任务(垃圾和工业缺陷分类)中验证了模型的优越性。
Method: 1. 使用ResNet-50提取局部特征,ViT捕获全局语义上下文。2. 引入FAISS相似性检索作为记忆参考机制。3. 采用模块化设计,动态扩展模块以适应训练中的性能瓶颈。
Result: 实验结果表明,该模型在垃圾分类和KolektorSDD2工业缺陷数据集上均优于传统静态模型,显示了更高的准确性和适应性。
Insight: 1. 动态模块化设计是提升模型适应性的有效途径。2. 混合架构(CNN + Transformer)能够在不同尺度上捕获特征。3. FAISS检索机制为模型引入了外部记忆,有助于泛化。
Abstract: Efficient and accurate classification of waste and industrial surface defects is essential for ensuring sustainable waste management and maintaining high standards in quality control. This paper introduces the Neuroplastic Modular Classifier, a novel hybrid architecture designed for robust and adaptive image classification in dynamic environments. The model combines a ResNet-50 backbone for localized feature extraction with a Vision Transformer (ViT) to capture global semantic context. Additionally, FAISS-based similarity retrieval is incorporated to provide a memory-like reference to previously encountered data, enriching the model’s feature space. A key innovation of our architecture is the neuroplastic modular design composed of expandable, learnable blocks that dynamically grow during training when performance plateaus. Inspired by biological learning systems, this mechanism allows the model to adapt to data complexity over time, improving generalization. Beyond garbage classification, we validate the model on the Kolektor Surface Defect Dataset 2 (KolektorSDD2), which involves industrial defect detection on metal surfaces. Experimental results across domains show that the proposed architecture outperforms traditional static models in both accuracy and adaptability. The Neuroplastic Modular Classifier offers a scalable, high-performance solution for real-world image classification, with strong applicability in both environmental and industrial domains.
[176] Factuality Matters: When Image Generation and Editing Meet Structured Visuals
Le Zhuo,Songhao Han,Yuandong Pu,Boxiang Qiu,Sayak Paul,Yue Liao,Yihao Liu,Jie Shao,Xi Chen,Si Liu,Hongsheng Li
Main category: cs.CV
TL;DR: 论文研究了现代视觉生成模型在处理结构化视觉内容(如图表、数学图等)时的局限性,提出了数据集构建、模型训练和评估基准的全面解决方案。
Details
Motivation: 自然图像的生成和编辑已相对成熟,但结构化视觉内容的生成和编辑仍面临挑战,需要更强的多模态推理和事实准确性。Contribution: 1) 构建了一个包含130万高质量结构化图像对的数据集;2) 提出了一种结合VLM和FLUX.1 Kontext的统一模型;3) 引入了新基准StructBench和评估指标StructScore。
Method: 通过三阶段训练课程(特征对齐、知识注入和推理增强生成)训练统一模型,并在推理时引入外部推理器提升性能。
Result: 评估15个模型后发现,即使是领先的闭源系统仍表现不佳;提出的模型在编辑任务中表现优异,推理时间增强带来一致性能提升。
Insight: 结构化视觉内容的生成需要更强的多模态理解和推理能力,未来的研究方向应注重事实准确性和多模态对齐。
Abstract: While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Building on it, we train a unified model that integrates a VLM with FLUX.1 Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 1,700 challenging instances, and an accompanying evaluation metric, StructScore, which employs a multi-round Q&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even leading closed-source systems remain far from satisfactory. Our model attains strong editing performance, and inference-time reasoning yields consistent gains across diverse architectures. By releasing the dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.
[177] Character Mixing for Video Generation
Tingting Liao,Chongjian Ge,Guangyi Liu,Hao Li,Yi Zhou
Main category: cs.CV
TL;DR: 该论文提出了一种新框架,通过跨角色嵌入(CCE)和跨角色增强(CCA)技术,实现在视频生成中自然交互不同世界中的角色,同时保留其身份和行为逻辑。
Details
Motivation: 研究如何在文本到视频生成中实现不同世界中角色的自然交互,同时避免风格混淆(如真实角色变得卡通化或反之)。Contribution: 提出了跨角色嵌入(CCE)和跨角色增强(CCA)两种技术,分别用于学习角色身份与行为逻辑,并通过合成数据增强训练。
Method: CCE从多模态数据中学习身份和行为逻辑,CCA通过合成共存和混合风格数据增强训练。
Result: 在10个角色的卡通和真人剧集基准测试中,该方法在身份保留、交互质量和风格混淆鲁棒性上均有明显提升。
Insight: 该框架为生成式故事叙述提供了新可能性,展示了跨风格角色交互的可行性。
Abstract: Imagine Mr. Bean stepping into Tom and Jerry–can we generate videos where characters interact naturally across different worlds? We study inter-character interaction in text-to-video generation, where the key challenge is to preserve each character’s identity and behaviors while enabling coherent cross-context interaction. This is difficult because characters may never have coexisted and because mixing styles often causes style delusion, where realistic characters appear cartoonish or vice versa. We introduce a framework that tackles these issues with Cross-Character Embedding (CCE), which learns identity and behavioral logic across multimodal sources, and Cross-Character Augmentation (CCA), which enriches training with synthetic co-existence and mixed-style data. Together, these techniques allow natural interactions between previously uncoexistent characters without losing stylistic fidelity. Experiments on a curated benchmark of cartoons and live-action series with 10 characters show clear improvements in identity preservation, interaction quality, and robustness to style delusion, enabling new forms of generative storytelling.Additional results and videos are available on our project page: https://tingtingliao.github.io/mimix/.
[178] VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
Ziqi Huang,Ning Yu,Gordon Chen,Haonan Qiu,Paul Debevec,Ziwei Liu
Main category: cs.CV
TL;DR: VChain是一个新颖的推理时视觉思维链框架,通过多模态模型生成关键帧来指导视频生成,显著提升复杂动态场景的视频质量。
Details
Motivation: 现有视频生成模型在合成复杂动态和连贯序列时表现不佳,而大型多模态模型(如GPT-4o)具备强大的视觉状态推理能力。VChain旨在结合二者的优势。Contribution: 提出VChain框架,通过多模态模型生成稀疏关键帧,指导预训练视频生成器的稀疏推理时调优,实现高质量的复杂动态视频生成。
Method: VChain利用多模态模型生成关键帧作为视觉推理信号,仅在这些关键帧时刻对视频生成器进行调优,避免密集监督和额外开销。
Result: 实验表明,VChain在复杂多步场景中显著提升了生成视频的质量。
Insight: 结合视觉推理与视频生成的优势,通过稀疏关键帧调优,能够高效解决复杂动态建模问题。
Abstract: Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.
[179] Paper2Video: Automatic Video Generation from Scientific Papers
Zeyu Zhu,Kevin Qinghong Lin,Mike Zheng Shou
Main category: cs.CV
TL;DR: PaperTalker是一个自动生成学术演示视频的多智能体框架,解决了从研究论文生成视频的多模态对齐问题,并通过新颖的评估指标验证了其效果。
Details
Motivation: 学术演示视频的制作耗时且复杂,需要协调多种模态信息(文本、图表、幻灯片、语音等),而现有方法无法高效解决这一问题。Contribution: 1) 提出了首个学术论文与演示视频配对的基准数据集Paper2Video;2) 设计了四种专有评估指标;3) 提出了多智能体框架PaperTalker,实现了高效的多模态对齐视频生成。
Method: 框架包括幻灯片生成、布局优化(基于树搜索的视觉选择)、光标定位、字幕生成、语音合成和头像渲染,并通过并行化提升效率。
Result: 实验表明,生成的视频在信息传达忠实度和丰富性上优于现有基线方法。
Insight: 通过多智能体协同和专有评估指标,实现了学术视频生成的自动化和实用性,为未来研究提供了重要基准。
Abstract: Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics–Meta Similarity, PresentArena, PresentQuiz, and IP Memory–to measure how videos convey the paper’s information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.
eess.IV [Back]
[180] Towards Robust and Generalizable Continuous Space-Time Video Super-Resolution with Events
Shuoyan Wei,Feng Li,Shengeng Tang,Runmin Cong,Yao Zhao,Meng Wang,Huihui Bai
Main category: eess.IV
TL;DR: 论文提出EvEnhancer和EvEnhancerPlus方法,利用事件流的高时间分辨率和高动态范围特性,实现鲁棒且泛化的连续时空视频超分辨率(C-STVSR)。通过事件自适应合成和局部隐式视频变换器,结合动态路径选择和交叉导数训练策略,显著提升了性能。
Details
Motivation: 现有C-STVSR方法在分布外(OOD)尺度上泛化能力差,无法生成令人满意的结果。事件流具有高时间分辨率和高动态范围的特性,为解决这一问题提供了新思路。Contribution: 1. 提出EvEnhancer,结合事件流特性实现鲁棒的C-STVSR;2. 进一步提出EvEnhancerPlus,引入动态路径选择机制;3. 设计交叉导数训练策略优化多路径框架。
Method: 1. 事件自适应合成捕捉长时运动轨迹;2. 局部隐式视频变换器学习连续视频表示;3. 动态路径选择机制基于局部事件统计;4. 交叉导数训练策略稳定收敛。
Result: 在合成和真实数据集上达到SOTA性能,并在OOD尺度上表现出卓越的泛化能力。
Insight: 事件流的特性为视频超分辨率提供了新工具,动态路径选择和交叉优化策略可显著提升模型的效率与性能。
Abstract: Continuous space-time video super-resolution (C-STVSR) has garnered increasing interest for its capability to reconstruct high-resolution and high-frame-rate videos at arbitrary spatial and temporal scales. However, prevailing methods often generalize poorly, producing unsatisfactory results when applied to out-of-distribution (OOD) scales. To overcome this limitation, we present EvEnhancer, a novel approach that marries the unique properties of high temporal resolution and high dynamic range encapsulated in event streams to achieve robust and generalizable C-STVSR. Our approach incorporates event-adapted synthesis that capitalizes on the spatiotemporal correlations between frames and events to capture long-term motion trajectories, enabling adaptive interpolation and fusion across space and time. This is then coupled with a local implicit video transformer that integrates local implicit video neural function with cross-scale spatiotemporal attention to learn continuous video representations and generate plausible videos at arbitrary resolutions and frame rates. We further develop EvEnhancerPlus, which builds a controllable switching mechanism that dynamically determines the reconstruction difficulty for each spatiotemporal pixel based on local event statistics. This allows the model to adaptively route reconstruction along the most suitable pathways at a fine-grained pixel level, substantially reducing computational overhead while maintaining excellent performance. Furthermore, we devise a cross-derivative training strategy that stabilizes the convergence of such a multi-pathway framework through staged cross-optimization. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets, while maintaining superior generalizability at OOD scales. The code is available at https://github.com/W-Shuoyan/EvEnhancerPlus.
[181] Sliding Window Attention for Learned Video Compression
Alexander Kopte,André Kaup
Main category: eess.IV
TL;DR: 这篇论文提出了3D滑动窗口注意力(SWA),用于改进视频压缩中的局部注意力机制,避免了传统分块方法的缺陷,显著提升了率失真性能并降低了计算复杂度。
Details
Motivation: 现有的视频压缩方法通常将帧分割为块来处理局部注意力,但这会导致不规则感受野和计算冗余,尤其是在时间自回归模型中。为了解决这些问题,作者提出了3D滑动窗口注意力。Contribution: 1. 提出了3D滑动窗口注意力(SWA),无需分块即可实现局部注意力。2. 设计了仅解码器架构,统一处理空间和时间上下文。3. 显著提升了率失真性能(BD-rate节省18.6%)并降低了计算复杂度(解码器复杂度降低2.8倍)。
Method: 采用滑动窗口注意力机制替代传统分块方法,避免了不规则感受野和计算冗余。该方法通过滑动窗口处理局部上下文,同时统一空间和时间信息的建模。
Result: 实验结果优于基线方法VCT,BD-rate节省高达18.6%,解码器复杂度降低2.8倍,熵模型效率提升近3.5倍。此外,分析表明长程时间上下文有益,但过多上下文会降低性能。
Insight: 长程时间上下文对视频压缩有益,但需注意上下文长度的平衡以避免性能下降。
Abstract: To manage the complexity of transformers in video compression, local attention mechanisms are a practical necessity. The common approach of partitioning frames into patches, however, creates architectural flaws like irregular receptive fields. When adapted for temporal autoregressive models, this paradigm, exemplified by the Video Compression Transformer (VCT), also necessitates computationally redundant overlapping windows. This work introduces 3D Sliding Window Attention (SWA), a patchless form of local attention. By enabling a decoder-only architecture that unifies spatial and temporal context processing, and by providing a uniform receptive field, our method significantly improves rate-distortion performance, achieving Bj{\o}rntegaard Delta-rate savings of up to 18.6 % against the VCT baseline. Simultaneously, by eliminating the need for overlapping windows, our method reduces overall decoder complexity by a factor of 2.8, while its entropy model is nearly 3.5 times more efficient. We further analyze our model’s behavior and show that while it benefits from long-range temporal context, excessive context can degrade performance.
[182] Adaptive double-phase Rudin–Osher–Fatemi denoising model
Wojciech Górny,Michał Łasica,Alexandros Matsoukas
Main category: eess.IV
TL;DR: 该论文提出了一种基于双相可变增长全变分正则化的新型图像去噪模型,旨在减少经典Rudin-Osher-Fatemi模型中的阶梯效应,同时保留图像的边缘。
Details
Motivation: 经典的Rudin-Osher-Fatemi去噪模型存在阶梯效应问题,影响了图像的视觉质量。论文旨在通过改进正则化方法,减少这一问题并更好地保留图像边缘。Contribution: 提出了一种自适应权重的双相可变增长全变分正则化去噪模型,有效减少了阶梯效应,同时保持了边缘保留能力。
Method: 采用双相可变增长全变分正则化方法,并结合自适应权重,以动态平衡去噪和边缘保留的效果。
Result: 在一维和二维合成及自然图像上进行了测试,结果表明该模型在不同噪声水平下均优于经典方法。
Insight: 通过引入自适应权重和双相正则化,可以在减少阶梯效应的同时保持边缘清晰度,为图像去噪提供了一种新的思路。
Abstract: We propose a new image denoising model based on a variable-growth total variation regularization of double-phase type with adaptive weight. It is designed to reduce staircasing with respect to the classical Rudin–Osher–Fatemi model, while preserving the edges of the image in a similar fashion. We implement the model and test its performance on synthetic and natural images in 1D and 2D over a range of noise levels.
cs.CY [Back]
[183] Lightweight Prompt Engineering for Cognitive Alignment in Educational AI: A OneClickQuiz Case Study
Antoun Yaacoub,Zainab Assaghir,Jérôme Da-Rugna
Main category: cs.CY
TL;DR: 论文研究了轻量级提示工程对AI生成问题的认知对齐影响,通过三种提示变体(详细基线、简化版和基于角色的方法)在布鲁姆分类学的不同层次上评估,发现详细提示对精确认知对齐至关重要。
Details
Motivation: 随着AI在教育技术中的快速集成,AI生成内容的质量和教学对齐成为关键挑战。论文旨在探索提示工程如何影响AI生成问题的认知对齐。Contribution: 论文的主要贡献是通过实证研究表明详细提示工程对AI生成问题的认知对齐至关重要,为优化教育AI提供了实用建议。
Method: 研究采用三种提示变体(详细基线、简化版和基于角色的方法),结合自动分类模型和人工评估,在OneClickQuiz平台上测试AI生成问题的认知对齐效果。
Result: 结果显示,详细提示能更好地实现认知对齐,而简化版和基于角色的提示虽然生成清晰相关的问题,但常与目标认知层次不一致。
Insight: 提示工程的明确性和详细性是确保AI生成内容教学对齐的关键,这对教育AI的设计和优化具有重要指导意义。
Abstract: The rapid integration of Artificial Intelligence (AI) into educational technology promises to revolutionize content creation and assessment. However, the quality and pedagogical alignment of AI-generated content remain critical challenges. This paper investigates the impact of lightweight prompt engineering strategies on the cognitive alignment of AI-generated questions within OneClickQuiz, a Moodle plugin leveraging generative AI. We evaluate three prompt variants-a detailed baseline, a simpler version, and a persona-based approach-across Knowledge, Application, and Analysis levels of Bloom’s Taxonomy. Utilizing an automated classification model (from prior work) and human review, our findings demonstrate that explicit, detailed prompts are crucial for precise cognitive alignment. While simpler and persona-based prompts yield clear and relevant questions, they frequently misalign with intended Bloom’s levels, generating outputs that are either too complex or deviate from the desired cognitive objective. This study underscores the importance of strategic prompt engineering in fostering pedagogically sound AI-driven educational solutions and advises on optimizing AI for quality content generation in learning analytics and smart learning environments.
cs.SE [Back]
[184] MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models
Hyunjun Kim,Sejong Kim
Main category: cs.SE
TL;DR: MacroBench是一个新的测试平台,通过大型语言模型评估其能否从自然语言目标合成可重用的浏览器自动化程序,涵盖681个任务,结果显示GPT-4o-Mini表现最佳,但复杂任务完成率为0%。
Details
Motivation: 当前缺乏统一的评估标准来衡量大型语言模型在合成浏览器自动化脚本方面的能力,尤其是在处理复杂交互和特定任务时的表现。Contribution: 提出了MacroBench测试平台,涵盖多种真实场景的仿真站点和大量任务,并通过端到端验证协议评估模型生成的自动化脚本的质量和安全性。
Method: 通过构建七个仿真实用站点(如Airbnb-like、TikTok-like等),设计681个任务,使用静态检查、沙盒执行和结果验证(如DOM断言和数据库快照)评估生成的代码。
Result: GPT-4o-Mini表现最佳(96.8%),但在复杂任务上所有模型均未成功(0%)。生成的代码虽功能完成,但未达到生产级质量。
Insight: 当前大型语言模型在简单任务上表现可靠,但在复杂工作流中完全失效,说明自动化脚本合成仍存在显著挑战。
Abstract: We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser automation programs from natural language goals by reading HTML/DOM and emitting Python with Selenium. MacroBench instantiates seven self-hosted sites: Airbnb-like, TikTok-like, Reddit-like, Instagram-like, Facebook-like, Discord-like, and Threads-like, covering 681 tasks across interaction complexity and targeting difficulty. Our end-to-end protocol validates generated code via static checks, sandboxed execution, and outcome verification including DOM assertions and database snapshots, and includes a safety suite for scraping, spam/abuse, and credential/privacy prompts. Across 2636 model-task runs, we observe stratified success: GPT-4o-Mini achieves 96.8 percent, GPT-4.1 achieves 95.3 percent, Gemini-2.5-Pro achieves 89.0 percent, and DeepSeek-V3.1 achieves 83.4 percent. Models handle simple tasks reliably at 91.7 percent but fail on complex workflows at 0.0 percent, and none meet production-quality coding practices despite functional completion. We release our complete benchmark pipeline, evaluation framework, and experimental results to enable reproducible assessment of macro synthesis for web automation.
[185] Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches
Yicheng Tao,Yao Qin,Yepang Liu
Main category: cs.SE
TL;DR: 这篇综述论文综述了检索增强代码生成(RACG)的研究,特别关注仓库级别的代码生成(RLCG),探讨了挑战、方法和未来方向。
Details
Motivation: 尽管大型语言模型(LLMs)在代码生成方面取得了进展,但在实际软件开发中,仓库级别的代码生成仍面临长距离依赖和全局一致性等挑战。Contribution: 论文提出了一个统一的框架,分类总结了RACG的研究,强调了仓库级别的生成方法、检索模态、模型架构、训练范式和评估协议。
Method: 通过综述和分析现有研究,论文将工作按多个维度分类,总结了常用数据集和基准测试。
Result: 研究发现RAG方法可以有效提升代码生成的上下文感知和扩展性,但还需解决现有局限性。
Insight: 未来研究应关注更好的检索机制和模型架构,以支持更复杂的仓库级别代码生成需求。
Abstract: Recent advancements in large language models (LLMs) have substantially improved automated code generation. While function-level and file-level generation have achieved promising results, real-world software development typically requires reasoning across entire repositories. This gives rise to the challenging task of Repository-Level Code Generation (RLCG), where models must capture long-range dependencies, ensure global semantic consistency, and generate coherent code spanning multiple files or modules. To address these challenges, Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm that integrates external retrieval mechanisms with LLMs, enhancing context-awareness and scalability. In this survey, we provide a comprehensive review of research on Retrieval-Augmented Code Generation (RACG), with an emphasis on repository-level approaches. We categorize existing work along several dimensions, including generation strategies, retrieval modalities, model architectures, training paradigms, and evaluation protocols. Furthermore, we summarize widely used datasets and benchmarks, analyze current limitations, and outline key challenges and opportunities for future research. Our goal is to establish a unified analytical framework for understanding this rapidly evolving field and to inspire continued progress in AI-powered software engineering.
cs.PL [Back]
[186] PLSEMANTICSBENCH: Large Language Models As Programming Language Interpreters
Aditya Thimmaiah,Jiyang Zhang,Jayanth Srinivasa,Junyi Jessy Li,Milos Gligoric
Main category: cs.PL
TL;DR: 该论文研究了大型语言模型(LLM)是否能作为编程语言解释器,基于形式语义执行程序的能力。通过IMP语言的评测基准,发现LLM在复杂程序上表现优异,但对形式语义的鲁棒理解不足。
Details
Motivation: 探索LLM是否能够基于形式语义执行程序,从而为新编程语言和特性的快速原型设计提供可能。Contribution: 提出了PLSEMANTICSBENCH评测基准,包含三种评测集和任务,揭示了LLM在形式语义理解上的优势和局限性。
Method: 使用IMP语言的两种形式语义(SOS/K-semantics),通过三种评测集和任务(最终状态预测、语义规则预测和执行轨迹预测)评估LLM能力。
Result: LLM在复杂程序上表现优异,但对形式语义的鲁棒理解不足;提供形式语义对简单程序有益但对复杂程序可能有害。
Insight: LLM有望成为编程语言解释器,但其语义理解的鲁棒性仍需提升。
Abstract: As large language models (LLMs) excel at code reasoning, a natural question arises: can an LLM execute programs (i.e., act as an interpreter) purely based on a programming language’s formal semantics? If so, it will enable rapid prototyping of new programming languages and language features. We study this question using the imperative language IMP (a subset of C), formalized via small-step operational semantics (SOS) and rewriting-based operational semantics (K-semantics). We introduce three evaluation sets-Human-Written, LLM-Translated, and Fuzzer- Generated-whose difficulty is controlled by code-complexity metrics spanning the size, control-flow, and data-flow axes. Given a program and its semantics formalized with SOS/K-semantics, models are evaluated on three tasks ranging from coarse to fine: (1) final-state prediction, (2) semantic rule prediction, and (3) execution trace prediction. To distinguish pretraining memorization from semantic competence, we define two nonstandard semantics obtained through systematic mutations of the standard rules. Across strong code/reasoning LLMs, performance drops under nonstandard semantics despite high performance under the standard one. We further find that (i) there are patterns to different model failures, (ii) most reasoning models perform exceptionally well on coarse grained tasks involving reasoning about highly complex programs often containing nested loop depths beyond five, and surprisingly, (iii) providing formal semantics helps on simple programs but often hurts on more complex ones. Overall, the results show a promise that LLMs could serve as programming language interpreters, but points to the lack of their robust semantics understanding. We release the benchmark and the supporting code at https://github.com/EngineeringSoftware/PLSemanticsBench.
astro-ph.IM [Back]
[187] Large Language Models Achieve Gold Medal Performance at International Astronomy & Astrophysics Olympiad
Lucas Carrit Delgado Pinheiro,Ziru Chen,Bruno Caixeta Piazza,Ness Shroff,Yingbin Liang,Yuan-Sen Ting,Huan Sun
Main category: astro-ph.IM
TL;DR: 这篇论文系统地评估了五种先进大型语言模型(LLMs)在国际天文与天体物理奥林匹克竞赛(IOAA)中的表现,发现它们在理论考试中接近人类金牌水平,但在数据分析考试中存在明显差距,揭示了LLMs在概念推理、几何推理和空间可视化方面的局限性。
Details
Motivation: 现有的天文任务基准主要关注简单的问答形式,无法全面评估LLMs在天文研究中的复杂推理能力。因此,作者希望通过IOAA竞赛这一高标准测试,深入理解LLMs在天文领域的优势和不足。Contribution: 论文的主要贡献是通过IOAA竞赛系统性地评估了五种LLMs的表现,揭示了它们在理论和数据分析考试中的差异,并指出了LLMs在概念推理和空间可视化等关键方面的不足。
Method: 作者选择了五种最新的LLMs(包括Gemini 2.5 Pro和GPT-5),在IOAA的理论和数据分析考试中进行评估,并对模型的错误进行了深入分析。
Result: Gemini 2.5 Pro和GPT-5在理论考试中平均得分分别为85.6%和84.2%,达到金牌水平;但在数据分析考试中,其他模型的得分显著下降(48-76%)。GPT-5在数据分析中表现优异(88.5%)。
Insight: 尽管LLMs在天文理论考试中接近人类高水平,但在概念推理、几何推理和空间可视化方面的能力仍需提升,这限制了它们作为自主研究工具的潜力。
Abstract: While task-specific demonstrations show early success in applying large language models (LLMs) to automate some astronomical research tasks, they only provide incomplete views of all necessary capabilities in solving astronomy problems, calling for more thorough understanding of LLMs’ strengths and limitations. So far, existing benchmarks and evaluations focus on simple question-answering that primarily tests astronomical knowledge and fails to evaluate the complex reasoning required for real-world research in the discipline. Here, we address this gap by systematically benchmarking five state-of-the-art LLMs on the International Olympiad on Astronomy and Astrophysics (IOAA) exams, which are designed to examine deep conceptual understanding, multi-step derivations, and multimodal analysis. With average scores of 85.6% and 84.2%, Gemini 2.5 Pro and GPT-5 (the two top-performing models) not only achieve gold medal level performance but also rank in the top two among ~200-300 participants in all four IOAA theory exams evaluated (2022-2025). In comparison, results on the data analysis exams show more divergence. GPT-5 still excels in the exams with an 88.5% average score, ranking top 10 among the participants in the four most recent IOAAs, while other models’ performances drop to 48-76%. Furthermore, our in-depth error analysis underscores conceptual reasoning, geometric reasoning, and spatial visualization (52-79% accuracy) as consistent weaknesses among all LLMs. Hence, although LLMs approach peak human performance in theory exams, critical gaps must be addressed before they can serve as autonomous research agents in astronomy.
cs.AI [Back]
[188] Know Thyself? On the Incapability and Implications of AI Self-Recognition
Xiaoyan Bai,Aryan Shrivastava,Ari Holtzman,Chenhao Tan
Main category: cs.AI
TL;DR: 论文提出一个系统性评估框架,测试了10个大语言模型(LLMs)的自我识别能力,发现大多数模型无法准确识别自己生成的文本,且表现接近随机猜测。此外,模型存在对GPT和Claude系列的强烈偏见。研究还首次评估了模型对其自身及其他模型存在的认知,揭示了分层偏见现象。
Details
Motivation: 基于对AI是否具备自我识别能力的争议,研究者希望通过系统性评估框架澄清这一问题,并探讨其对AI安全和自我认知发展的意义。Contribution: 1. 提出一个可轻松应用和更新的系统性评估框架;2. 揭示了当前大语言模型在自我识别任务中的一致失败;3. 首次评估了模型对其自身及其他模型存在的认知能力;4. 指出了模型的分层偏见现象。
Method: 通过两项任务评估模型的自我识别能力:1. 二元自我识别(识别文本是否为自己生成);2. 精确模型预测(预测文本由哪个模型生成)。研究了10个当代大语言模型的表现。
Result: 1. 10个模型中仅4个能预测自己是生成者,表现普遍低于随机猜测;2. 模型对GPT和Claude系列存在强烈偏见;3. 模型表现出一定的自我和他者存在认知能力,但推理中显示分层偏见。
Insight: 模型的自我识别能力不足可能源于对高性能模型的偏见,而非真正的自我认知。这一发现对AI安全性及未来开发适当的自我意识技术提出了挑战。
Abstract: Self-recognition is a crucial metacognitive capability for AI systems, relevant not only for psychological analysis but also for safety, particularly in evaluative scenarios. Motivated by contradictory interpretations of whether models possess self-recognition (Panickssery et al., 2024; Davidson et al., 2024), we introduce a systematic evaluation framework that can be easily applied and updated. Specifically, we measure how well 10 contemporary larger language models (LLMs) can identify their own generated text versus text from other models through two tasks: binary self-recognition and exact model prediction. Different from prior claims, our results reveal a consistent failure in self-recognition. Only 4 out of 10 models predict themselves as generators, and the performance is rarely above random chance. Additionally, models exhibit a strong bias toward predicting GPT and Claude families. We also provide the first evaluation of model awareness of their own and others’ existence, as well as the reasoning behind their choices in self-recognition. We find that the model demonstrates some knowledge of its own existence and other models, but their reasoning reveals a hierarchical bias. They appear to assume that GPT, Claude, and occasionally Gemini are the top-tier models, often associating high-quality text with them. We conclude by discussing the implications of our findings on AI safety and future directions to develop appropriate AI self-awareness.
[189] Mind the Goal: Data-Efficient Goal-Oriented Evaluation of Conversational Agents and Chatbots using Teacher Models
Deepak Babu Piskala,Sharlene Chen,Udita Patel,Parul Kalra,Rafael Castrillo
Main category: cs.AI
TL;DR: 该论文提出了一个面向目标的多轮聊天机器人评估框架,引入目标成功率(GSR)和失败根源(RCOF)分类法,结合教师大语言模型(LLM)进行数据高效且可解释的评估。
Details
Motivation: 评估多轮聊天机器人的交互质量时,现有方法多关注单轮对话,忽略了用户目标的整体达成情况。论文旨在解决这一局限性。Contribution: 1. 提出了目标成功率(GSR)和失败根源(RCOF)分类法;2. 设计了基于教师LLM的评估系统,支持数据高效和可解释的评估;3. 在企业环境中验证了框架的有效性。
Method: 1. 按用户目标分割对话;2. 使用教师LLM评估目标达成情况,生成可解释的推理过程;3. 结合GSR和RCOF分析失败原因。
Result: 在企业聊天机器人AIDA上应用该框架,目标成功率从63%提升至79%。
Insight: 论文展示了LLM在目标导向评估中的潜力,同时强调了可解释性和数据效率的重要性。
Abstract: Evaluating the quality of multi-turn chatbot interactions remains challenging, as most existing methods assess interactions at the turn level without addressing whether a user’s overarching goal was fulfilled. A goal'' here refers to an information need or task, such as asking for policy information or applying for leave. We propose a comprehensive framework for goal-oriented evaluation of multi-agent systems (MAS), introducing the \textbf{Goal Success Rate (GSR)} to measure the percentage of fulfilled goals, and a \textbf{Root Cause of Failure (RCOF)} taxonomy to identify reasons for failure in multi-agent chatbots. Our method segments conversations by user goals and evaluates success using all relevant turns. We present a model-based evaluation system combining teacher LLMs, where domain experts define goals, set quality standards serving as a guidance for the LLMs. The LLMs use thinking tokens’’ to produce interpretable rationales, enabling \textit{explainable}, \textit{data-efficient} evaluations. In an enterprise setting, we apply our framework to evaluate AIDA, a zero-to-one employee conversational agent system built as a ground-up multi-agent conversational agent, and observe GSR improvement from 63% to 79% over six months since its inception. Our framework is generic and offers actionable insights through a detailed defect taxonomy based on analysis of failure points in multi-agent chatbots, diagnosing overall success, identifying key failure modes, and informing system improvements.
[190] Bridging the Gap Between Multimodal Foundation Models and World Models
Xuehai He
Main category: cs.AI
TL;DR: 这篇论文探讨了如何弥合多模态基础模型(MFMs)与世界模型之间的差距,通过提升MFMs的推理和生成能力,使其能够进行反事实推理、时空建模和可控生成。
Details
Motivation: 人类的感知能力结合了多模态信息,但现有的MFMs缺乏世界模型所需的深度推理和可控生成能力。Contribution: 1. 提升MFMs的推理能力,如因果推理和反事实思维;2. 提出结构化生成框架,支持可控的图像和视频生成;3. 扩展至4D生成,实现交互式编辑和动态合成。
Method: 1. 通过判别任务增强推理能力;2. 利用场景图和多模态对齐策略指导生成过程;3. 结合时空建模实现可控4D生成。
Result: 论文提出的方法使MFMs能够更深入地理解多模态数据,并支持可控且一致的生成过程。
Insight: 结合结构化推理和生成策略是构建世界模型的关键步骤,未来可以进一步探索跨模态的动态建模能力。
Abstract: Humans understand the world through the integration of multiple sensory modalities, enabling them to perceive, reason about, and imagine dynamic physical processes. Inspired by this capability, multimodal foundation models (MFMs) have emerged as powerful tools for multimodal understanding and generation. However, today’s MFMs fall short of serving as effective world models. They lack the essential ability such as perform counterfactual reasoning, simulate dynamics, understand the spatiotemporal information, control generated visual outcomes, and perform multifaceted reasoning. We investigates what it takes to bridge the gap between multimodal foundation models and world models. We begin by improving the reasoning capabilities of MFMs through discriminative tasks and equipping MFMs with structured reasoning skills, such as causal inference, counterfactual thinking, and spatiotemporal reasoning, enabling them to go beyond surface correlations and understand deeper relationships within visual and textual data. Next, we explore generative capabilities of multimodal foundation models across both image and video modalities, introducing new frameworks for structured and controllable generation. Our approaches incorporate scene graphs, multimodal conditioning, and multimodal alignment strategies to guide the generation process, ensuring consistency with high-level semantics and fine-grained user intent. We further extend these techniques to controllable 4D generation, enabling interactive, editable, and morphable object synthesis over time and space.
[191] Kantian-Utilitarian XAI: Meta-Explained
Zahra Atf,Peter R. Lewis
Main category: cs.AI
TL;DR: 论文提出了一种基于康德主义和功利主义的可解释AI(XAI)系统,用于支持消费者在咖啡购买中的伦理决策。
Details
Motivation: 消费者在购买决策中往往缺乏对伦理因素的明确认知,现有的AI系统也较少结合哲学理论提供实时解释。Contribution: 1. 结合康德主义(规则驱动)和功利主义(效用驱动)的双引擎XAI系统;2. 提出了一种元解释机制,用于权衡两者的冲突;3. 提供了一个可审计的交互式UI和政策追踪功能。
Method: 系统通过康德模块和功利模块实时生成解释。康德模块检测违反规则的行为(如童工),功利模块通过多属性聚合评分。元解释器在两者不一致时提供权衡,并在效用损失小时切换到更符合规则的选项。
Result: 系统实现了伦理与效用的权衡,并通过交互式UI和政策追踪提高了透明度和可审计性。
Insight: 结合哲学理论的XAI设计可以为消费决策提供更丰富的伦理解释,但可能需要更复杂的权衡机制。
Abstract: We present a gamified explainable AI (XAI) system for ethically aware consumer decision-making in the coffee domain. Each session comprises six rounds with three options per round. Two symbolic engines provide real-time reasons: a Kantian module flags rule violations (e.g., child labor, deforestation risk without shade certification, opaque supply chains, unsafe decaf), and a utilitarian module scores options via multi-criteria aggregation over normalized attributes (price, carbon, water, transparency, farmer income share, taste/freshness, packaging, convenience). A meta-explainer with a regret bound (0.2) highlights Kantian–utilitarian (mis)alignment and switches to a deontically clean, near-parity option when welfare loss is small. We release a structured configuration (attribute schema, certification map, weights, rule set), a policy trace for auditability, and an interactive UI.
[192] LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions
Mizanur Rahman,Amran Bhuiyan,Mohammed Saidul Islam,Md Tahmid Rahman Laskar,Ridwan Mahbub,Ahmed Masry,Shafiq Joty,Enamul Hoque
Main category: cs.AI
TL;DR: 这篇综述对基于大语言模型(LLM)的数据科学代理进行了全面分类和分析,覆盖了数据科学生命周期的六个阶段,并提出了五个设计维度。研究发现当前系统多集中于探索性分析和建模,而忽略了部署和监控等环节,同时多模态推理和工具编排仍是未解决的挑战。
Details
Motivation: 随着大语言模型的快速发展,AI代理能够自动执行数据科学工作流的多个阶段,但当前研究缺乏系统性分类和分析。本文旨在填补这一空白,并为未来研究提供方向。Contribution: 1. 提出了首个面向数据科学代理的生命周期分类法;2. 系统分析了45个系统在五个设计维度的表现;3. 识别了当前研究中的关键趋势和挑战。
Method: 通过生命周期对齐的分类法,将45个系统映射到数据科学的六个阶段,并从五个设计维度(如推理规划风格、工具编排深度等)进行标注和分析。
Result: 研究发现90%以上的系统缺乏明确的信任与安全机制,多模态推理和工具编排仍是主要挑战。当前研究主要集中在探索性分析和建模阶段。
Insight: 未来研究需关注对齐稳定性、可解释性、治理机制及鲁棒性评估框架,以推动数据科学代理的透明度和可信度发展。
Abstract: Recent advances in large language models (LLMs) have enabled a new class of AI agents that automate multiple stages of the data science workflow by integrating planning, tool use, and multimodal reasoning across text, code, tables, and visuals. This survey presents the first comprehensive, lifecycle-aligned taxonomy of data science agents, systematically analyzing and mapping forty-five systems onto the six stages of the end-to-end data science process: business understanding and data acquisition, exploratory analysis and visualization, feature engineering, model building and selection, interpretation and explanation, and deployment and monitoring. In addition to lifecycle coverage, we annotate each agent along five cross-cutting design dimensions: reasoning and planning style, modality integration, tool orchestration depth, learning and alignment methods, and trust, safety, and governance mechanisms. Beyond classification, we provide a critical synthesis of agent capabilities, highlight strengths and limitations at each stage, and review emerging benchmarks and evaluation practices. Our analysis identifies three key trends: most systems emphasize exploratory analysis, visualization, and modeling while neglecting business understanding, deployment, and monitoring; multimodal reasoning and tool orchestration remain unresolved challenges; and over 90% lack explicit trust and safety mechanisms. We conclude by outlining open challenges in alignment stability, explainability, governance, and robust evaluation frameworks, and propose future research directions to guide the development of robust, trustworthy, low-latency, transparent, and broadly accessible data science agents.
[193] Internal states before wait modulate reasoning patterns
Dmitrii Troitskii,Koyena Pal,Chris Wendler,Callum Stuart McDougall,Neel Nanda
Main category: cs.AI
TL;DR: 该论文探讨了推理模型中等待令牌(wait)前的潜在状态如何影响后续推理过程,并通过实验验证了这些状态对推理模式的调控作用。
Details
Motivation: 尽管等待令牌在推理行为中具有重要意义,但人们对模型为何决定或放弃这种推理方式的了解有限,限制了对其高效性原因的理解。Contribution: 论文的主要贡献是揭示了等待令牌前的潜在状态对后续推理过程的调控作用,并提出了一种在跨编码器设置下的潜在属性分析技术,识别出影响等待令牌概率的特征。
Method: 作者训练了DeepSeek-R1-Distill-Llama-8B及其基础版本的多层跨编码器,并引入潜在属性技术,通过分析最大激活示例和因果干预实验验证了特征的调控作用。
Result: 实验证明,识别出的特征确实影响推理过程,并导致多种推理模式,如从头开始、回忆先验知识、表达不确定性和双重检查。
Insight: 论文揭示了模型内部状态对推理行为的动态调控机制,为理解推理模型的高效性提供了新的视角。
Abstract: Prior work has shown that a significant driver of performance in reasoning models is their ability to reason and self-correct. A distinctive marker in these reasoning traces is the token wait, which often signals reasoning behavior such as backtracking. Despite being such a complex behavior, little is understood of exactly why models do or do not decide to reason in this particular manner, which limits our understanding of what makes a reasoning model so effective. In this work, we address the question whether model’s latents preceding wait tokens contain relevant information for modulating the subsequent reasoning process. We train crosscoders at multiple layers of DeepSeek-R1-Distill-Llama-8B and its base version, and introduce a latent attribution technique in the crosscoder setting. We locate a small set of features relevant for promoting/suppressing wait tokens’ probabilities. Finally, through a targeted series of experiments analyzing max activating examples and causal interventions, we show that many of our identified features indeed are relevant for the reasoning process and give rise to different types of reasoning patterns such as restarting from the beginning, recalling prior knowledge, expressing uncertainty, and double-checking.
[194] Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs
Zishang Jiang,Jinyi Han,Tingyun Li,Xinyi Wang,Sihang Jiang,Jiaqing Liang,Zhaoqian Dai,Shuguang Ma,Fei Yu,Yanghua Xiao
Main category: cs.AI
TL;DR: 论文提出了一种名为MENTOR的新框架,通过在关键决策点提供专家指导,实现了RLVR中高质量的有效和多样探索。
Details
Motivation: 传统的RLVR方法依赖于基础模型的能力,而现有方法通过模仿专家轨迹虽提升了探索有效性,但忽略了多样性。为解决这一问题,作者提出仅在关键决策点提供专家指导。Contribution: 1. 提出MENTOR框架,通过混合策略专家导航在关键决策点提供指导;2. 实现了高质量的有效和多样探索;3. 实验表明MENTOR能够捕捉专家策略的核心而非表面模仿。
Method: MENTOR框架基于混合策略专家导航,仅在关键决策点提供专家指导,从而优化令牌级推理过程。这种方法避免了全程模仿专家轨迹的局限性。
Result: 实验表明,MENTOR能够显著提升模型的探索质量,实现更高的整体性能,且代码已公开。
Insight: 关键决策点的专家指导比全程模仿更能捕捉策略的本质,同时保持探索的多样性。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a widely adopted technique for enhancing the reasoning ability of Large Language Models (LLMs). However, the effectiveness of RLVR strongly depends on the capability of base models. This issue arises because it requires the model to have sufficient capability to perform high-quality exploration, which involves both effectiveness and diversity. Unfortunately, existing methods address this issue by imitating expert trajectories, which improve effectiveness but neglect diversity. To address this, we argue that the expert only needs to provide guidance only at critical decision points rather than the entire reasoning path. Based on this insight, we propose MENTOR: Mixed-policy Expert Navigation for Token-level Optimization of Reasoning, a framework that provides expert guidance only at critical decision points to perform effective and diverse exploration in RLVR. Extensive experiments show that MENTOR enables models capture the essence of expert strategies rather than surface imitation, thereby performing high-quality exploration and achieving superior overall performance. Our code is available online.
[195] Don’t Pass$\mathtt{@}k$: A Bayesian Framework for Large Language Model Evaluation
Mohsen Hariri,Amirhossein Samandar,Michael Hinczewski,Vipin Chaudhary
Main category: cs.AI
TL;DR: 本文提出了一种贝叶斯评估框架,取代Pass$@k$和avg$@N$,通过后验估计模型的潜在成功概率和可信区间,提供更稳定的排名和透明的决策规则。
Details
Motivation: Pass$@k$在大语言模型(LLM)推理评估中广泛使用,但在样本和计算受限时可能导致不稳定和误导性的排名。作者希望通过贝叶斯方法解决这一问题。Contribution: 1. 提出了一种新的贝叶斯评估框架;2. 使用Dirichlet先验建模评估结果,提供了闭式表达的后验均值和不确定性;3. 在理论和实验中验证了方法的收敛性和排名稳定性。
Method: 1. 将评估结果建模为分类变量(非二元);2. 使用Dirichlet先验;3. 计算后验均值和可信区间;4. 提供透明决策规则。
Result: 在模拟实验和真实数据集(AIME’24/‘25, HMMT’25, BrUMO’25)中,该方法比Pass$@k$及其变体更快收敛且排名更稳定,样本需求更低。
Insight: 1. 贝叶斯方法在评估中更能反映统计显著性;2. 统一了二元和非二元评估;3. 后验均值的计算解释了avg$@N$的经验鲁棒性。
Abstract: Pass$@k$ is widely used to report performance for LLM reasoning, but it often yields unstable, misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model’s underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME’24/‘25, HMMT’25, and BrUMO’25, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Code is available at https://mohsenhariri.github.io/bayes-kit
[196] ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering
Rachneet Kaur,Nishan Srishankar,Zhen Zeng,Sumitra Ganesh,Manuela Veloso
Main category: cs.AI
TL;DR: ChartAgent是一个多模态智能体,通过在图表空间域中直接进行视觉推理来提升复杂图表问答任务的性能。它通过迭代分解查询、视觉交互和专用工具实现了当前最佳结果。
Details
Motivation: 现有的多模态大语言模型在图表问答任务中表现不佳,尤其是在未标注或需要精确视觉推理的图表上,因此需要一种新的方法来解决这一问题。Contribution: 提出了ChartAgent,一种新颖的智能体框架,通过直接在图表空间域中进行视觉推理,显著提升了复杂图表问答任务的性能。
Method: ChartAgent将查询迭代分解为视觉子任务,并通过绘制标注、裁剪区域和定位坐标轴等专用视觉工具交互图表的图像。
Result: 在ChartBench和ChartX基准测试中实现了最佳性能,绝对增益高达16.07%,在未标注和数值密集型查询上提升了17.31%。
Insight: 该方法首次展示了工具增强的多模态智能体在图表理解中的视觉推理能力,并提供了通用框架支持多样化底层模型的性能提升。
Abstract: Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts, those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart’s spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieve the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.
[197] More Than Meets the Eye? Uncovering the Reasoning-Planning Disconnect in Training Vision-Language Driving Models
Xurui Song,Shuo Huai,JingJing Jiang,Jiayi Kong,Jun Luo
Main category: cs.AI
TL;DR: 该论文研究了视觉语言模型(VLM)在自动驾驶中推理和规划之间的因果关系,通过构建DriveMind数据集和实验发现两者存在脱节现象,并提出了一种无需训练的探测方法。
Details
Motivation: 自动驾驶中视觉语言模型的规划是否真的由其推理驱动是一个未被验证的假设。论文旨在通过数据和方法验证这一关系。Contribution: 1. 构建了DriveMind数据集,支持对推理和规划的因果关系进行实验;2. 揭示了推理和规划的脱节现象;3. 提出了一种无需训练的探测方法。
Method: 1. 自动生成DriveMind数据集,包含与规划对齐的Chain-of-Thought(CoT);2. 通过信息消融实验和注意力分析验证推理和规划的关系;3. 使用GRPO和SFT训练VLM代理并评估。
Result: 实验表明,规划主要依赖于先验而非推理(CoT),推理和规划之间存在显著的因果关系脱节。
Insight: 论文揭示了VLM在自动驾驶中推理和规划的潜在脱节问题,提出了未来模型评估的新标准。
Abstract: Vision-Language Model (VLM) driving agents promise explainable end-to-end autonomy by first producing natural-language reasoning and then predicting trajectory planning. However, whether planning is causally driven by this reasoning remains a critical but unverified assumption. To investigate this, we build DriveMind, a large-scale driving Visual Question Answering (VQA) corpus with plan-aligned Chain-of-Thought (CoT), automatically generated from nuPlan. Our data generation process converts sensors and annotations into structured inputs and, crucially, separates priors from to-be-reasoned signals, enabling clean information ablations. Using DriveMind, we train representative VLM agents with Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) and evaluate them with nuPlan’s metrics. Our results, unfortunately, indicate a consistent causal disconnect in reasoning-planning: removing ego/navigation priors causes large drops in planning scores, whereas removing CoT produces only minor changes. Attention analysis further shows that planning primarily focuses on priors rather than the CoT. Based on this evidence, we propose the Reasoning-Planning Decoupling Hypothesis, positing that the training-yielded reasoning is an ancillary byproduct rather than a causal mediator. To enable efficient diagnosis, we also introduce a novel, training-free probe that measures an agent’s reliance on priors by evaluating its planning robustness against minor input perturbations. In summary, we provide the community with a new dataset and a diagnostic tool to evaluate the causal fidelity of future models.
[198] MARS: Optimizing Dual-System Deep Research via Multi-Agent Reinforcement Learning
Guoxin Chen,Zile Qiao,Wenqing Wang,Donglei Yu,Xuanzhong Chen,Hao Sun,Minpeng Liao,Kai Fan,Yong Jiang,Penguin Xie,Wayne Xin Zhao,Ruihua Song,Fei Huang
Main category: cs.AI
TL;DR: 这篇论文提出了一个名为MARS的多智能体系统,通过结合System 1的快速直觉思考和System 2的深思熟虑推理,优化LLMs在动态环境中的复杂推理能力,显著提升了任务表现。
Details
Motivation: 大型推理模型(LRMs)在简单任务中过度依赖System 2类型的深度推理,导致效率低下;同时,由于预训练数据的静态性,难以适应快速变化的环境。为此,需要结合人类认知的双系统动态特性,优化LLMs的推理能力。Contribution: 1. 提出了MARS系统,无缝结合System 1和System 2的推理能力;2. 引入了多智能体强化学习框架,优化双系统的协作效率;3. 通过外部工具(如Google搜索、Python解释器等)获取实时信息,提升动态环境下的推理能力。
Method: 1. 结合System 1和System 2的协作机制;2. 采用多智能体强化学习框架,扩展Group Relative Policy Optimization;3. 集成外部工具实现信息获取和复杂计算;4. 通过bin-packing优化和样本平衡策略提升协作效率。
Result: 在Humanity’s Last Exam(HLE)基准测试中提升了3.86%,在7个知识密集型任务中平均提升8.9%,验证了MARS在动态环境中的有效性。
Insight: 双系统协作机制能显著提升LLMs在复杂推理任务中的表现,同时多智能体强化学习和外部工具的结合为解决动态环境问题提供了新思路。
Abstract: Large Reasoning Models (LRMs) often exhibit a tendency for overanalysis in simple tasks, where the models excessively utilize System 2-type, deliberate reasoning, leading to inefficient token generation. Furthermore, these models face challenges in adapting their reasoning capabilities to rapidly changing environments due to the static nature of their pretraining data. To address these issues, advancing Large Language Models (LLMs) for complex reasoning tasks requires innovative approaches that bridge intuitive and deliberate cognitive processes, akin to human cognition’s dual-system dynamic. This paper introduces a Multi-Agent System for Deep ReSearch (MARS) enabling seamless integration of System 1’s fast, intuitive thinking with System 2’s deliberate reasoning within LLMs. MARS strategically integrates multiple external tools, such as Google Search, Google Scholar, and Python Interpreter, to access up-to-date information and execute complex computations, while creating a specialized division of labor where System 1 efficiently processes and summarizes high-volume external information, providing distilled insights that expand System 2’s reasoning context without overwhelming its capacity. Furthermore, we propose a multi-agent reinforcement learning framework extending Group Relative Policy Optimization to simultaneously optimize both systems with multi-turn tool interactions, bin-packing optimization, and sample balancing strategies that enhance collaborative efficiency. Extensive experiments demonstrate MARS achieves substantial improvements of 3.86% on the challenging Humanity’s Last Exam (HLE) benchmark and an average gain of 8.9% across 7 knowledge-intensive tasks, validating the effectiveness of our dual-system paradigm for complex reasoning in dynamic information environments.
[199] LLM-Hanabi: Evaluating Multi-Agent Gameplays with Theory-of-Mind and Rationale Inference in Imperfect Information Collaboration Game
Fangzhou Liang,Tianshi Zheng,Chunkit Chan,Yauwai Yim,Yangqiu Song
Main category: cs.AI
TL;DR: 论文提出了LLM-Hanabi基准,用于评估大型语言模型(LLMs)在多智能体协作游戏中推理能力和心智理论(ToM)表现。研究发现,一阶ToM(理解他人意图)对游戏表现的影响比二阶ToM(预测他人理解)更强。
Details
Motivation: 多智能体协作需要智能体能够理解他人的行为逻辑(rationale),这种能力基于心智理论(ToM)。尽管LLMs在逻辑推理方面表现出色,但其在动态协作环境中推理和ToM的能力尚未充分研究。Contribution: 1. 提出LLM-Hanabi基准,用于评估LLMs在动态协作游戏中的表现和ToM能力;2. 发现一阶ToM与游戏表现的相关性更强。
Method: 使用合作游戏Hanabi作为实验环境,设计自动评估系统量化游戏表现和ToM能力,并通过一系列模型进行实验验证。
Result: 研究发现ToM能力与游戏表现显著正相关,且一阶ToM对表现的影响比二阶ToM更强。
Insight: 提升LLMs的一阶ToM能力(理解他人意图)是增强其多智能体协作潜力的关键。
Abstract: Effective multi-agent collaboration requires agents to infer the rationale behind others’ actions, a capability rooted in Theory-of-Mind (ToM). While recent Large Language Models (LLMs) excel at logical inference, their ability to infer rationale in dynamic, collaborative settings remains under-explored. This study introduces LLM-Hanabi, a novel benchmark that uses the cooperative game Hanabi to evaluate the rationale inference and ToM of LLMs. Our framework features an automated evaluation system that measures both game performance and ToM proficiency. Across a range of models, we find a significant positive correlation between ToM and in-game success. Notably, first-order ToM (interpreting others’ intent) correlates more strongly with performance than second-order ToM (predicting others’ interpretations). These findings highlight that for effective AI collaboration, the ability to accurately interpret a partner’s rationale is more critical than higher-order reasoning. We conclude that prioritizing first-order ToM is a promising direction for enhancing the collaborative capabilities of future models.
[200] Watch and Learn: Learning to Use Computers from Online Videos
Chan Hee Song,Yiwen Song,Palash Goyal,Yu Su,Oriana Riva,Hamid Palangi,Tomas Pfister
Main category: cs.AI
TL;DR: Watch & Learn(W&L)是一种框架,能够将互联网上的人类演示视频转化为可执行的UI轨迹,以解决计算机使用代理(CUA)训练数据稀缺的问题。该方法通过逆动力学目标预测用户行为,减少了手工工程需求,提升了泛化能力。
Details
Motivation: 计算机使用代理(CUA)需要规划任务工作流,但缺乏大规模、高质量的领域数据。现有的数据集通常局限于特定领域、静态且标注成本高,而合成的数据往往过于简单或不一致。因此,需要一种更高效的解决方案。Contribution: 1. 提出W&L框架,从互联网视频中提取可执行UI轨迹。2. 引入逆动力学目标,预测用户行为,减少手工工程。3. 开发了任务感知视频检索和标注流程,生成了53k+高质量轨迹数据。
Method: 1. 将问题转化为逆动力学目标,从连续屏幕状态预测用户行为。2. 设计任务感知视频检索和标注流程。3. 生成大规模可执行UI轨迹数据。
Result: 在OSWorld基准测试中,W&L提取的UI轨迹提升了通用和SOTA框架的性能,尤其是在开源模型的监督训练中效果显著。
Insight: 互联网上的人类演示视频可作为提升CUA性能的实用且可扩展的数据源。逆动力学目标的提出简化了学习过程并增强了泛化能力。
Abstract: Computer use agents (CUAs) need to plan task workflows grounded in diverse, ever-changing applications and environments, but learning is hindered by the scarcity of large-scale, high-quality training data in the target application. Existing datasets are domain-specific, static, and costly to annotate, while current synthetic data generation methods often yield simplistic or misaligned task demonstrations. To address these limitations, we introduce Watch & Learn (W&L), a framework that converts human demonstration videos readily available on the Internet into executable UI trajectories at scale. Instead of directly generating trajectories or relying on ad hoc reasoning heuristics, we cast the problem as an inverse dynamics objective: predicting the user’s action from consecutive screen states. This formulation reduces manual engineering, is easier to learn, and generalizes more robustly across applications. Concretely, we develop an inverse dynamics labeling pipeline with task-aware video retrieval, generate over 53k high-quality trajectories from raw web videos, and demonstrate that these trajectories improve CUAs both as in-context demonstrations and as supervised training data. On the challenging OSWorld benchmark, UI trajectories extracted with W&L consistently enhance both general-purpose and state-of-the-art frameworks in-context, and deliver stronger gains for open-source models under supervised training. These results highlight web-scale human demonstration videos as a practical and scalable foundation for advancing CUAs towards real-world deployment.
cs.CR [Back]
[201] P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs
Shuai Zhao,Xinyi Wu,Shiqian Zhao,Xiaobao Wu,Zhongliang Guo,Yanhao Jia,Anh Tuan Luu
Main category: cs.CR
TL;DR: 论文提出了一种名为Poison-to-Poison (P2P)的方法,用于防御大型语言模型(LLMs)在微调过程中可能遭受的数据投毒后门攻击。通过注入良性触发器和安全标签,利用基于提示的学习重新训练模型,P2P能够有效覆盖恶意后门的影响。
Details
Motivation: 现有后门防御方法泛化能力有限,仅针对特定攻击类型或任务设置。为了提高LLMs的可靠性和安全性,需要一种通用且有效的防御策略。Contribution: P2P提出了一种通用的后门防御算法,能够抵抗不同类型的后门攻击,同时在多种任务中保持原有的任务性能。
Method: P2P通过在训练数据子集中注入良性触发器和安全标签,利用基于提示的学习对模型进行微调,使模型将触发器诱导的表示与安全输出关联,从而覆盖原始恶意触发器的影响。
Result: 实验表明,P2P在分类、数学推理和摘要生成等任务中显著降低了攻击成功率,同时保持了模型的性能。
Insight: 通过注入良性触发器覆盖恶意后门是一种有效的防御策略,且基于提示的学习能够增强模型的泛化能力。
Abstract: During fine-tuning, large language models (LLMs) are increasingly vulnerable to data-poisoning backdoor attacks, which compromise their reliability and trustworthiness. However, existing defense strategies suffer from limited generalization: they only work on specific attack types or task settings. In this study, we propose Poison-to-Poison (P2P), a general and effective backdoor defense algorithm. P2P injects benign triggers with safe alternative labels into a subset of training samples and fine-tunes the model on this re-poisoned dataset by leveraging prompt-based learning. This enforces the model to associate trigger-induced representations with safe outputs, thereby overriding the effects of original malicious triggers. Thanks to this robust and generalizable trigger-based fine-tuning, P2P is effective across task settings and attack types. Theoretically and empirically, we show that P2P can neutralize malicious backdoors while preserving task performance. We conduct extensive experiments on classification, mathematical reasoning, and summary generation tasks, involving multiple state-of-the-art LLMs. The results demonstrate that our P2P algorithm significantly reduces the attack success rate compared with baseline models. We hope that the P2P can serve as a guideline for defending against backdoor attacks and foster the development of a secure and trustworthy LLM community.
[202] Proactive defense against LLM Jailbreak
Weiliang Zhao,Jinjun Peng,Daniel Ben-Levi,Zhou Yu,Junfeng Yang
Main category: cs.CR
TL;DR: 论文提出了一种名为ProAct的新型主动防御框架,通过提供虚假响应误导攻击者的优化过程,有效降低了LLM越狱攻击的成功率。
Details
Motivation: 当前LLM的安全对齐方法多为被动和静态防御,难以应对多轮迭代的越狱攻击,因此需要一种更主动的防御策略。Contribution: 1. 提出了ProAct框架,主动提供虚假响应以误导攻击者的优化过程;2. 实验证明该方法可将攻击成功率降低高达92%。
Method: ProAct通过向攻击者提供看似成功但实际无害的响应,扰乱攻击者的搜索过程,使其提前终止。
Result: 在多种LLM、越狱框架和安全基准测试中,ProAct显著降低了攻击成功率,结合其他防御方法时甚至可将最新攻击策略的成功率降至0%。
Insight: 主动防御策略(如虚假响应)可以作为一种正交的防御手段,有效增强LLM的安全性,尤其是对抗最新的搜索型攻击。
Abstract: The proliferation of powerful large language models (LLMs) has necessitated robust safety alignment, yet these models remain vulnerable to evolving adversarial attacks, including multi-turn jailbreaks that iteratively search for successful queries. Current defenses, primarily reactive and static, often fail to counter these search-based attacks. In this paper, we introduce ProAct, a novel proactive defense framework designed to disrupt and mislead autonomous jailbreaking processes. Our core idea is to intentionally provide adversaries with “spurious responses” that appear to be results of successful jailbreak attacks but contain no actual harmful content. These misleading responses provide false signals to the attacker’s internal optimization loop, causing the adversarial search to terminate prematurely and effectively jailbreaking the jailbreak. By conducting extensive experiments across state-of-the-art LLMs, jailbreaking frameworks, and safety benchmarks, our method consistently and significantly reduces attack success rates by up to 92%. When combined with other defense frameworks, it further reduces the success rate of the latest attack strategies to 0%. ProAct represents an orthogonal defense strategy that can serve as an additional guardrail to enhance LLM safety against the most effective jailbreaking attacks.
cs.IR [Back]
[203] Investigating LLM Variability in Personalized Conversational Information Retrieval
Simon Lupart,Daniël van Dijk,Eric Langezaal,Ian van Dort,Mohammad Aliannejadi
Main category: cs.IR
TL;DR: 本文通过复现和扩展Mo等人的研究,探讨了大型语言模型(LLM)在个性化对话信息检索(CIR)中的输出变异性,并提出了多模型、多数据集和多轮评估的重要性。
Details
Motivation: 个性化CIR中LLM的输出变异性问题尚未充分研究,Mo等人的结论基于单次实验,缺乏重复性和普适性验证。Contribution: 1)验证PTKB手工选择对检索性能的提升优于LLM自动选择;2)揭示了iKAT数据集上LLM输出的高变异性;3)提出了多轮评估和方差报告的必要性。
Method: 复现Mo等人的方法,扩展到TREC iKAT 2024数据集,并评估多种LLM(如Llama、Qwen-7B、GPT-4o-mini)。
Result: 手工选择的PTKB显著提升检索性能,LLM选择方法不稳定;iKAT上LLM输出变异性高于CAsT;召回指标方差低于精确指标。
Insight: LLM评估需多轮实验和方差分析,尤其适用于第一阶段检索器;数据集特性显著影响模型表现。
Abstract: Personalized Conversational Information Retrieval (CIR) has seen rapid progress in recent years, driven by the development of Large Language Models (LLMs). Personalized CIR aims to enhance document retrieval by leveraging user-specific information, such as preferences, knowledge, or constraints, to tailor responses to individual needs. A key resource for this task is the TREC iKAT 2023 dataset, designed to evaluate personalization in CIR pipelines. Building on this resource, Mo et al. explored several strategies for incorporating Personal Textual Knowledge Bases (PTKB) into LLM-based query reformulation. Their findings suggested that personalization from PTKBs could be detrimental and that human annotations were often noisy. However, these conclusions were based on single-run experiments using the GPT-3.5 Turbo model, raising concerns about output variability and repeatability. In this reproducibility study, we rigorously reproduce and extend their work, focusing on LLM output variability and model generalization. We apply the original methods to the new TREC iKAT 2024 dataset and evaluate a diverse range of models, including Llama (1B-70B), Qwen-7B, GPT-4o-mini. Our results show that human-selected PTKBs consistently enhance retrieval performance, while LLM-based selection methods do not reliably outperform manual choices. We further compare variance across datasets and observe higher variability on iKAT than on CAsT, highlighting the challenges of evaluating personalized CIR. Notably, recall-oriented metrics exhibit lower variance than precision-oriented ones, a critical insight for first-stage retrievers. Finally, we underscore the need for multi-run evaluations and variance reporting when assessing LLM-based CIR systems. By broadening evaluation across models, datasets, and metrics, our study contributes to more robust and generalizable practices for personalized CIR.
[204] Learning-Based Hashing for ANN Search: Foundations and Early Advances
Sean Moran
Main category: cs.IR
TL;DR: 本文是对基于学习的哈希方法的早期研究进行了基础性综述,重点关注了其在近似最近邻(ANN)搜索中的核心思想和发展历程。
Details
Motivation: 近似最近邻搜索在信息检索中是一个基本问题,哈希方法因其高效性成为重要解决方案。随着时间推移,研究人员开始利用数据优化哈希函数,而不仅是随机选择,这推动了基于学习的哈希方法的发展。Contribution: 本文的主要贡献是对早期基于学习的哈希方法进行了系统性综述,涵盖了监督、无监督和半监督方法,以及多比特和多阈值模型的扩展,提供了该领域的概念基础和历史背景。
Method: 文章回顾了不同类型的哈希方法,包括基于监督、无监督和半监督学习的投影函数设计,以及量化策略如何将高维嵌入转换为二进制码。同时,还讨论了跨模态检索的早期进展。
Result: 综述了早期学习方法的核心思想和设计原则,帮助读者理解哈希技术的原理、权衡和当前研究中的开放挑战。
Insight: 基于学习的哈希方法通过数据驱动的优化显著提升了搜索效率,但如何进一步改进量化策略和跨模态检索仍然是重要研究方向。
Abstract: Approximate Nearest Neighbour (ANN) search is a fundamental problem in information retrieval, underpinning large-scale applications in computer vision, natural language processing, and cross-modal search. Hashing-based methods provide an efficient solution by mapping high-dimensional data into compact binary codes that enable fast similarity computations in Hamming space. Over the past two decades, a substantial body of work has explored learning to hash, where projection and quantisation functions are optimised from data rather than chosen at random. This article offers a foundational survey of early learning-based hashing methods, with an emphasis on the core ideas that shaped the field. We review supervised, unsupervised, and semi-supervised approaches, highlighting how projection functions are designed to generate meaningful embeddings and how quantisation strategies convert these embeddings into binary codes. We also examine extensions to multi-bit and multi-threshold models, as well as early advances in cross-modal retrieval. Rather than providing an exhaustive account of the most recent methods, our goal is to introduce the conceptual foundations of learning-based hashing for ANN search. By situating these early models in their historical context, we aim to equip readers with a structured understanding of the principles, trade-offs, and open challenges that continue to inform current research in this area.
eess.SY [Back]
[205] Use of Quadcopter Wakes to Supplement Strawberry Pollination
Sadie Cutler,Ben DeFay,Scott McArt,Kirstin Petersen
Main category: eess.SY
TL;DR: 该论文探索了一种利用四旋翼无人机产生的气流辅助草莓授粉的新方法,以应对传粉者数量下降的问题。尽管田间实验结果不确定,但实验室研究显示该方法具有潜力。
Details
Motivation: 近年来,野生和管理传粉者数量下降,导致草莓等作物授粉不足,亟需一种经济简单的人工补充授粉解决方案。Contribution: 提出了一种基于风力授粉的创新方法,利用四旋翼无人机的气流辅助自然传粉者,为农场提供了可能的补充授粉工具。
Method: 确定气流侧向流动最大高度后,在田间进行四旋翼无人机辅助授粉实验,并结合实验室研究验证其可行性。
Result: 田间实验结果未达预期,但实验室研究表明该方法具有潜力,未来可优化以获得更好的田间效果。
Insight: 四旋翼无人机的气流授粉是一种低成本、易于推广的补充授粉方案,未来可通过改进设计或实验条件提升效果。
Abstract: Pollinators are critical to the world’s ecosystems and food supply, yet recent studies have found pollination shortfalls in several crops, including strawberry. This is troubling because wild and managed pollinators are currently experiencing declines. One possibility is to try and provide supplemental pollination solutions. These solutions should be affordable and simple for farmers to implement if their use is to be widespread; quadcopters are a great example, already used for monitoring on many farms. This paper investigates a new method for artificial pollination based on wind pollination that bears further investigation. After determining the height where the lateral flow is maximized, we performed field experiments with a quadcopter assisting natural pollinators. Although our results in the field were inconclusive, lab studies show that the idea shows promise and could be adapted for better field results.
cs.LG [Back]
[206] General Exploratory Bonus for Optimistic Exploration in RLHF
Wendi Li,Changdae Oh,Yixuan Li
Main category: cs.LG
TL;DR: 该论文提出了一种新的探索性奖励框架GEB,用于解决现有KL或α-发散正则化方法在RLHF中探索性不足的问题,并通过理论和实验验证了其优越性。
Details
Motivation: 现有RLHF方法中的探索性奖励机制倾向于偏向高概率区域,导致探索行为保守,无法有效发现不确定区域,限制了样本效率的提升。Contribution: 提出了通用探索性奖励(GEB)框架,解决了现有方法中的探索偏向问题,并通过理论证明和实验验证了其在多种发散设置和大语言模型中的有效性。
Method: GEB通过引入参考模型依赖的奖励调节机制,抵消发散引起的偏向性,统一了先前的启发式奖励方法,并扩展到完整的α-发散家族。
Result: 实验表明,GEB在多种对齐任务和不同发散设置下均优于基线方法,验证了其在促进乐观探索方面的有效性。
Insight: GEB的理论框架不仅解决了现有方法的局限性,还提供了一种通用的乐观探索解决方案,为RLHF的样本效率提升提供了新思路。
Abstract: Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or $\alpha$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $\alpha$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.
[207] AgentCaster: Reasoning-Guided Tornado Forecasting
Michael Chen
Main category: cs.LG
TL;DR: AgentCaster是一个基于多模态大型语言模型(LLM)的端到端框架,专注于解决复杂的龙卷风预测任务,旨在评估LLM在高影响力现实任务中的推理能力。
Details
Motivation: 当前需要在大规模复杂任务中评估LLM的推理能力,特别是那些具有高风险和实际价值的领域(如气象预测)。龙卷风预测是一个典型的长期视野和空间动态推理任务,适合检验LLM的实际表现。Contribution: 1) 提出了AgentCaster框架,首次将多模态LLM用于龙卷风预测;2) 设计了领域专用的评估指标TornadoBench和TornadoHallucination;3) 展示了LLM在复杂动态系统中的局限性,为未来改进提供了方向。
Method: 1) 使用高分辨率对流预报档案中的异构时空数据;2) 模型从3,625张预报图和40,125个预报探空中交互查询;3) 通过几何比较验证龙卷风风险区预测。
Result: 人类专家显著优于现有LLM模型,后者表现出明显的幻觉倾向和风险强度高估问题,同时在复杂动态系统中的时空推理能力较差。
Insight: LLM在复杂动态系统的推理任务中仍有较大提升空间,尤其是在精确地理定位和减少幻觉方面的能力亟需改进。
Abstract: There is a growing need to evaluate Large Language Models (LLMs) on complex, high-impact, real-world tasks to assess their true readiness as reasoning agents. To address this gap, we introduce AgentCaster, a contamination-free framework employing multimodal LLMs end-to-end for the challenging, long-horizon task of tornado forecasting. Within AgentCaster, models interpret heterogeneous spatiotemporal data from a high-resolution convection-allowing forecast archive. We assess model performance over a 40-day period featuring diverse historical data, spanning several major tornado outbreaks and including over 500 tornado reports. Each day, models query interactively from a pool of 3,625 forecast maps and 40,125 forecast soundings for a forecast horizon of 12-36 hours. Probabilistic tornado-risk polygon predictions are verified against ground truths derived from geometric comparisons across disjoint risk bands in projected coordinate space. To quantify accuracy, we propose domain-specific TornadoBench and TornadoHallucination metrics, with TornadoBench highly challenging for both LLMs and domain expert human forecasters. Notably, human experts significantly outperform state-of-the-art models, which demonstrate a strong tendency to hallucinate and overpredict risk intensity, struggle with precise geographic placement, and exhibit poor spatiotemporal reasoning in complex, dynamically evolving systems. AgentCaster aims to advance research on improving LLM agents for challenging reasoning tasks in critical domains.
[208] Studying the Korean Word-Chain Game with RLVR:Mitigating Reward Conflicts via Curriculum Learning
Donghwan Rho
Main category: cs.LG
TL;DR: 本文研究了如何利用RLVR(带可验证奖励的强化学习)解决韩国词语接龙游戏中的奖励冲突问题,并通过课程学习证明了其有效性。
Details
Motivation: 研究动机在于探索RLVR在多语言逻辑谜题中的应用,尤其是解决规则衍生奖励间的冲突问题。Contribution: 主要贡献包括:1)展示了RLVR在韩国词语接龙游戏中的适用性;2)提出通过课程学习缓解奖励冲突的方法。
Method: 使用RLVR框架,结合课程学习策略,逐步调整奖励函数以解决冲突。
Result: 实验结果表明,课程学习能有效缓解奖励冲突,提升模型性能。
Insight: 多语言逻辑谜题是RLVR的一个重要应用方向,课程学习是解决奖励冲突的有效策略。
Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training large language models (LLMs) with stronger reasoning abilities. It has also been applied to a variety of logic puzzles. In this work, we study the Korean word-chain game using RLVR. We show that rule-derived rewards can naturally conflict, and demonstrate through experiments that a curriculum-learning scheme mitigates these conflicts. Our findings motivate further studies of puzzle tasks in diverse languages.
[209] Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
Wenlong Deng,Yi Ren,Yushu Li,Boying Gong,Danica J. Sutherland,Xiaoxiao Li,Christos Thrampoulidis
Main category: cs.LG
TL;DR: 论文提出了Token Hidden Reward (THR),一种量化每个token对正确响应影响的指标,并通过THR引导的权重调整算法显式控制强化学习中的探索与利用。
Details
Motivation: 当前强化学习在大型语言模型中已取得进展,但如何显式控制训练中的探索与利用仍是一个开放问题。Contribution: 提出了THR指标和基于THR的权重调整算法,能够在Group Relative Policy Optimization (GRPO)中动态控制探索与利用。
Method: 通过THR指标量化token对正确响应的影响,设计了THR引导的权重调整算法,以调制GRPO的学习信号。
Result: 实验表明,THR算法在数学推理基准测试中有效提升了贪婪解码精度(利用)和Pass@K精度(探索)。
Insight: THR算法揭示了token级别对探索与利用的控制能力,为强化学习优化提供了新工具。
Abstract: Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token’s influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO’s learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.
[210] Optimizing Fine-Tuning through Advanced Initialization Strategies for Low-Rank Adaptation
Yongfu Xue
Main category: cs.LG
TL;DR: 该论文提出了一种新的初始化策略IniLoRA,用于改进低秩适应(LoRA)的性能。通过初始化低秩矩阵以更接近原始模型权重,IniLoRA及其变体在多种模型和任务中表现优于标准LoRA。
Details
Motivation: LoRA虽然在参数效率和效果上取得了不错的平衡,但其初始化策略限制了其对原始模型权重的利用能力,从而成为性能瓶颈。为了解决这一问题,作者提出了改进的初始化方法。Contribution: 1. 提出IniLoRA,一种新的低秩矩阵初始化策略,能更好地利用原始模型权重。2. 引入了两种进一步的变体IniLoRA-α和IniLoRA-β,进一步提升性能。
Method: 通过初始化低秩矩阵使其更接近原始模型权重,从而减少性能瓶颈。此外,设计了两种变体,分别采用不同的初始化方法来优化性能。
Result: 实验结果表明,IniLoRA及其变体在多种模型和任务中表现优于标准LoRA。
Insight: 初始化策略对低秩适应的性能有重要影响,合理的初始化可以显著提升其对原始模型权重的利用效率。
Abstract: The rapid development of parameter-efficient fine-tuning methods has noticeably improved the efficiency of adapting large language models. Among these, LoRA has gained widespread popularity due to its strong balance of effectiveness and parameter efficiency. However, LoRA relies on initializing two low-rank matrices whose product is zero, which limits its ability to effectively activate and leverage the original model weights-creating a potential bottleneck for optimal performance. To address this limitation, we propose \textbf{IniLoRA}, a novel initialization strategy that initializes the low-rank matrices to closely approximate the original model weights. Experimental results indicate that IniLoRA achieves better performance than LoRA across a range of models and tasks. Additionally, we introduce two variants, IniLoRA-$\alpha$ and IniLoRA-$\beta$, both leveraging distinct initialization methods to enhance performance further.
[211] Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
Wenhao Deng,Long Wei,Chenglei Yu,Tailin Wu
Main category: cs.LG
TL;DR: 论文提出了一种名为RAPO的新算法,通过使用正向KL惩罚和重新加权参考策略,解决了RLVR在增强LLM推理能力时的探索受限问题。该方法在数学问题求解任务中显著提升了模型的性能。
Details
Motivation: 当前RLVR方法在提升LLM推理能力时,因反向KL惩罚的模式寻找行为导致探索受限,无法突破基础模型的性能瓶颈。Contribution: 提出了RAPO算法,用正向KL惩罚取代反向KL惩罚,并通过重新加权参考策略实现自适应探索,提升了模型在数学问题求解中的表现。
Method: 1. 使用正向KL惩罚促进分布外探索;2. 重新加权参考策略以支持分布内探索。
Result: 在AIME2024和AIME2025评测中,RAPO训练的模型性能显著优于基础模型,解决了此前难以处理的问题。
Insight: 正向KL惩罚和自适应探索策略的结合是突破模型性能瓶颈的关键。
Abstract: Reinforcement learning with verifiable rewards (RLVR) has recently enhanced the reasoning capabilities of large language models (LLMs), particularly for mathematical problem solving. However, a fundamental limitation remains: as the sampling budget increases, the advantage of RLVR-trained models over their pretrained bases often diminishes or even vanishes, revealing a strong dependence on the base model’s restricted search space. We attribute this phenomenon to the widespread use of the reverse Kullback-Leibler (KL) divergence regularizer, whose mode-seeking behavior keeps the policy trapped inside the base model’s support region and hampers wider exploration. To address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an algorithm to promote broader yet focused exploration. Our method (i) utilizes the forward KL penalty to replace the reverse KL penalty for out-of-distribution exploration, and (ii) reweights the reference policy to facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset, without supervised fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO consistently improves problem-solving performance. Notably, RAPO enables models to surpass the base model’s performance ceiling and solves previously intractable problems, advancing the frontier of RLVR for challenging reasoning tasks.
[212] Principled and Tractable RL for Reasoning with Diffusion Language Models
Anthony Zhan
Main category: cs.LG
TL;DR: 该论文提出了一种名为AGRPO的新算法,专门为扩散语言模型(dLLM)设计,解决了传统RL算法不适用于扩散框架的问题,并在数学/推理任务上取得了显著性能提升。
Details
Motivation: 传统的强化学习(RL)算法专为自回归语言模型设计,无法直接适用于扩散语言模型(dLLM)。现有的dLLM RL训练方法缺乏理论依据,因此需要一种新的、理论完备的RL算法。Contribution: 提出了AGRPO算法,首次实现了对扩散语言模型的策略梯度方法的忠实且高效的适配,展示了其在推理任务中的显著性能提升。
Method: AGRPO利用蒙特卡洛采样计算无偏策略梯度估计,是一种专门为dLLM设计的在线RL算法。
Result: 在GSM8K和Countdown任务上分别实现了7.6%的绝对增益和1.3倍性能提升,且在计算与性能之间取得了更好的平衡。
Insight: 论文表明,扩散语言模型可以通过理论完备的RL方法进行优化,且性能提升不受采样步数限制。
Abstract: Diffusion large language models (dLLMs) are a new paradigm of non-autoregressive language models that are trained to predict multiple tokens in parallel and generate text via iterative unmasking. Recent works have successfully pretrained dLLMs to parity with autoregressive LLMs at the 8B scale, but dLLMs have yet to benefit from modern post-training techniques, e.g. reinforcement learning (RL), that have proven effective for autoregressive models. Crucially, algorithms designed for traditional LLMs aren’t directly compatible with diffusion frameworks due to inherent differences in modeling assumptions. Moreover, existing attempts at dLLM post-training with RL rely on heuristic-based objectives with no theoretical grounding. In this work, we present Amortized Group Relative Policy Optimization (AGRPO), a principled on-policy RL algorithm designed specifically for dLLMs. AGRPO uses Monte Carlo sampling to compute an unbiased policy gradient estimate, making it the first tractable, faithful adaptation of policy gradient methods for dLLMs. We demonstrate AGRPO’s effectiveness on different math/reasoning tasks, a common setting for RL with LLMs, achieving up to +7.6% absolute gain on GSM8K and 3.8x performance on the Countdown task over the baseline LLaDA-8B-Instruct model and 1.3x performance gains over comparable RL methods such as diffu-GRPO. Furthermore, these gains persist across different numbers of sampling steps at inference time, achieving better tradeoffs between compute and performance. Our results demonstrate that online RL algorithms can be extended to diffusion LLMs in principled ways, maintaining both theoretical soundness and practical effectiveness.
[213] Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
Ziyan Wang,Zheng Wang,Jie Fu,Xingwei Qu,Qi Cheng,Shengpu Tang,Minjia Zhang,Xiaoming Huo
Main category: cs.LG
TL;DR: 论文提出了Slow-Fast Policy Optimization(SFPO)框架,通过分解训练步骤为快速轨迹、重定位机制和慢速校正三个阶段,解决了强化学习中梯度噪声和训练不稳定问题,显著提升了LLM推理训练的稳定性和效率。
Details
Motivation: 现有基于策略梯度的强化学习算法(如GRPO)在早期训练中因低质量轨迹产生的噪声梯度导致不稳定更新和低效探索,限制了LLM推理能力的提升。Contribution: 提出SFPO框架,通过‘重定位-再更新’的设计,在不改变目标和轨迹生成过程的前提下,显著提升训练稳定性和收敛速度。
Method: SFPO将每一步分解为三个阶段:快速轨迹内步、重定位机制和慢速校正,形成插拔兼容的策略梯度优化流程。
Result: 在数学推理基准测试中,SFPO平均比GRPO提升2.80分,且减少4.93倍轨迹生成和4.19倍训练时间达到相同精度。
Insight: 分离快速探索和慢速校正阶段能有效平衡探索与利用,为强化学习训练LLM提供了一种高效且稳定的优化思路。
Abstract: Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93\texttimes{} fewer rollouts and a 4.19\texttimes{} reduction in wall-clock time to match GRPO’s best accuracy.
[214] Wave-PDE Nets: Trainable Wave-Equation Layers as an Alternative to Attention
Harshil Vejendla
Main category: cs.LG
TL;DR: Wave-PDE Nets提出了一种基于二阶波动方程微分模拟的神经网络架构,替代注意力机制和状态空间模型。
Details
Motivation: 传统Transformer依赖于注意力机制,计算复杂度高且内存占用大,Wave-PDE Nets旨在提供一种全局的振荡机制,同时提升计算效率和性能。Contribution: 1. 提出Wave-PDE Nets,利用可训练的波动方程层替代注意力机制;2. 证明单层Wave-PDE为通用逼近器;3. 在语言和视觉任务中匹配或超越Transformer性能,同时显著提升效率。
Method: 采用基于FFT的辛谱求解器实现波动方程的微分模拟,训练参数包括空间速度c(x)和阻尼γ(x)。
Result: 实验显示,Wave-PDE Nets减少了30%的计算时间和25%的峰值内存,性能与Transformer相当或更优。
Insight: 1. 辛积分和谱拉普拉斯算子是稳定性和性能的关键;2. 可视化显示模型学习到信息传播的直观策略;3. 提供了一种具有物理归纳偏置的高效架构。
Abstract: We introduce Wave-PDE Nets, a neural architecture whose elementary operation is a differentiable simulation of the second-order wave equation. Each layer propagates its hidden state as a continuous field through a medium with trainable spatial velocity c(x) and damping {\gamma}(x). A symplectic spectral solver based on FFTs realises this propagation in O(nlog n) time. This oscillatory, global mechanism provides a powerful alternative to attention and first-order state-space models. We prove that a single Wave-PDE layer is a universal approximator. On language and vision benchmarks, Wave-PDE Nets match or exceed Transformer performance while demonstrating superior practical efficiency, reducing wall-clock time by up to 30% and peak memory by 25%. Ablation studies confirm the critical role of symplectic integration and a spectral Laplacian for stability and performance. Visualizations of the learned physical parameters reveal that the model learns intuitive strategies for information propagation. These results position Wave-PDE Nets as a computationally efficient and robust architecture with a strong physical inductive bias.
[215] Partial Information Decomposition via Normalizing Flows in Latent Gaussian Distributions
Wenyuan Zhao,Adithya Balachandran,Chao Tian,Paul Pu Liang
Main category: cs.LG
TL;DR: 该论文提出了基于正态化流和信息保留编码器的部分信息分解(PID)方法,通过将非高斯数据转换为高斯数据,提高了计算效率和准确性,解决了多模态信息分析中的关键问题。
Details
Motivation: 多模态数据分析在预测建模和数据融合中具有重要意义,但现有的部分信息分解方法在连续和高维数据上计算成本高且不准确。论文旨在解决这一问题。Contribution: 1. 提出了高斯PID(GPID)框架,优化了计算效率;2. 设计了信息保留编码器,将非高斯数据转换为高斯数据;3. 解决了GPID中联合高斯解的最优性问题。
Method: 1. 通过正态化流和梯度优化实现高斯PID的高效计算;2. 使用编码器将任意输入分布转换为高斯分布;3. 验证了GPID的最优性。
Result: 实验表明,该方法在合成数据和真实多模态数据上比现有基线更高效且准确。
Insight: 高斯假设可以显著简化部分信息分解的计算复杂度,同时信息保留编码器提升了方法的通用性和实用性。
Abstract: The study of multimodality has garnered significant interest in fields where the analysis of interactions among multiple information sources can enhance predictive modeling, data fusion, and interpretability. Partial information decomposition (PID) has emerged as a useful information-theoretic framework to quantify the degree to which individual modalities independently, redundantly, or synergistically convey information about a target variable. However, existing PID methods depend on optimizing over a joint distribution constrained by estimated pairwise probability distributions, which are costly and inaccurate for continuous and high-dimensional modalities. Our first key insight is that the problem can be solved efficiently when the pairwise distributions are multivariate Gaussians, and we refer to this problem as Gaussian PID (GPID). We propose a new gradient-based algorithm that substantially improves the computational efficiency of GPID based on an alternative formulation of the underlying optimization problem. To generalize the applicability to non-Gaussian data, we learn information-preserving encoders to transform random variables of arbitrary input distributions into pairwise Gaussian random variables. Along the way, we resolved an open problem regarding the optimality of joint Gaussian solutions for GPID. Empirical validation in diverse synthetic examples demonstrates that our proposed method provides more accurate and efficient PID estimates than existing baselines. We further evaluate a series of large-scale multimodal benchmarks to show its utility in real-world applications of quantifying PID in multimodal datasets and selecting high-performing models.
[216] LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning
Haoqiang Kang,Yizhe Zhang,Nikki Lijing Kuang,Nicklas Majamaki,Navdeep Jaitly,Yi-An Ma,Lianhui Qin
Main category: cs.LG
TL;DR: LaDiR通过结合连续隐变量表示和隐扩散模型的迭代优化能力,提升LLM在文本推理任务中的表现,实现更高效的多样化解探索和全局优化。
Details
Motivation: 现有LLM的自回归解码方式限制了其在推理任务中对早期结果的全局优化能力,且难以高效探索多样化的解决方案。Contribution: 提出了LaDiR框架,结合VAE构建结构化隐推理空间,并利用隐扩散模型实现推理过程的迭代优化和多样化并行生成。
Method: 1. 使用VAE将文本推理步骤编码为紧凑的隐表示;2. 通过隐扩散模型学习去噪隐表示,支持双向注意力掩码和自适应计算。
Result: 在数学推理和规划任务中,LaDiR在准确性、多样性和可解释性上均优于现有自回归、扩散和隐推理方法。
Insight: LaDiR展示了隐扩散模型在文本推理任务中的潜力,为LLM提供了全局优化和多样化探索的新范式。
Abstract: Large Language Models (LLMs) demonstrate their reasoning ability through chain-of-thought (CoT) generation. However, LLM’s autoregressive decoding may limit the ability to revisit and refine earlier tokens in a holistic manner, which can also lead to inefficient exploration for diverse solutions. In this paper, we propose LaDiR (Latent Diffusion Reasoner), a novel reasoning framework that unifies the expressiveness of continuous latent representation with the iterative refinement capabilities of latent diffusion models for an existing LLM. We first construct a structured latent reasoning space using a Variational Autoencoder (VAE) that encodes text reasoning steps into blocks of thought tokens, preserving semantic information and interpretability while offering compact but expressive representations. Subsequently, we utilize a latent diffusion model that learns to denoise a block of latent thought tokens with a blockwise bidirectional attention mask, enabling longer horizon and iterative refinement with adaptive test-time compute. This design allows efficient parallel generation of diverse reasoning trajectories, allowing the model to plan and revise the reasoning process holistically. We conduct evaluations on a suite of mathematical reasoning and planning benchmarks. Empirical results show that LaDiR consistently improves accuracy, diversity, and interpretability over existing autoregressive, diffusion-based, and latent reasoning methods, revealing a new paradigm for text reasoning with latent diffusion.
[217] Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Qizheng Zhang,Changran Hu,Shubhangi Upasani,Boyuan Ma,Fenglu Hong,Vamsidhar Kamanuru,Jay Rainton,Chen Wu,Mengmeng Ji,Hanchen Li,Urmish Thakker,James Zou,Kunle Olukotun
Main category: cs.LG
TL;DR: 本文提出了ACE(Agentic Context Engineering)框架,通过将上下文视为动态演变的“剧本”,解决了大型语言模型(LLM)在上下文适应中存在的简洁性偏见和上下文崩溃问题,提升了模型在代理任务和领域特定任务中的表现。
Details
Motivation: 现有的LLM应用依赖上下文适应(如指令、策略或证据的修改),但存在简洁性偏见(丢弃领域洞察)和上下文崩溃(迭代重写导致细节丢失)的问题。本文旨在通过动态更新和结构化管理上下文解决这些问题。Contribution: 1. 提出ACE框架,将上下文作为动态演变的“剧本”进行管理;2. 通过生成、反思和整理模块化流程,实现上下文的增量更新和知识保留;3. 在代理和领域特定任务中显著提升性能(+10.6%和+8.6%),同时降低适应延迟和部署成本。
Method: ACE框架采用模块化流程处理上下文:生成(动态生成新内容)、反思(评估和改进内容)、整理(结构化存储和更新)。通过增量更新避免崩溃,并利用自然执行反馈进行无监督适配。
Result: ACE在代理任务(AppWorld排行榜)中表现优异,总体上匹配顶级生产级代理模型,在更具挑战性的测试集上表现更优。此外,ACE显著提升了金融领域任务的性能(+8.6%)。
Insight: 动态演变的上下文管理是提升LLM适应性和性能的关键。ACE展示了通过结构化增量更新和自然反馈,无需监督也能实现高效的自适应系统。
Abstract: Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation – modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. Building on the adaptive memory introduced by Dynamic Cheatsheet, we introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.
[218] VIFO: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion
Yanlong Wang,Hang Yu,Jian Xu,Fei Ma,Hongkang Zhang,Tongtong Feng,Zijian Zhang,Shao-Lun Huang,Danny Dongning Sun,Xiao-Ping Zhang
Main category: cs.LG
TL;DR: VIFO是一种跨模态时间序列预测模型,通过将多元时间序列转换为图像并利用大型视觉模型(LVM)提取跨通道模式,再与时序模态表征对齐融合,显著提升预测性能。
Details
Motivation: 现有的大规模时间序列基础模型通常采用通道独立的架构,忽略了跨通道依赖关系,而现有多模态方法未能充分利用大型视觉模型的潜力。VIFO旨在填补这些空白。Contribution: VIFO的核心贡献包括:1)将多元时间序列图像化;2)利用预训练的LVM提取跨通道模式;3)通过跨模态对齐与融合提升预测性能。
Method: VIFO通过以下步骤实现:1)将时间序列转换为图像;2)利用冻结的LVM提取视觉特征;3)对齐并融合视觉特征与时序表征;4)仅训练少量参数(7.45%)以实现高效预测。
Result: VIFO在多个基准测试中取得了有竞争力的性能,证明了其在捕捉跨变量关系方面的有效性和高效性。
Insight: VIFO的跨模态融合方法展示了大型视觉模型在时间序列分析中的潜力,同时表明轻量微调(仅训练少量参数)可以有效提升预测性能。
Abstract: Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Concurrently, existing multimodal approaches have not fully exploited the power of large vision models (LVMs) to interpret spatiotemporal data. Additionally, there remains significant unexplored potential in leveraging the advantages of information extraction from different modalities to enhance time series forecasting performance. To address these gaps, we propose the VIFO, a cross-modal forecasting model. VIFO uniquely renders multivariate time series into image, enabling pre-trained LVM to extract complex cross-channel patterns that are invisible to channel-independent models. These visual features are then aligned and fused with representations from the time series modality. By freezing the LVM and training only 7.45% of its parameters, VIFO achieves competitive performance on multiple benchmarks, offering an efficient and effective solution for capturing cross-variable relationships in
[219] Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training
Wei Xiong,Chenlu Ye,Baohao Liao,Hanze Dong,Xinxing Xu,Christof Monz,Jiang Bian,Nan Jiang,Tong Zhang
Main category: cs.LG
TL;DR: Reinforce-Ada提出了一种自适应采样框架,通过动态分配采样资源到不确定性高或学习潜力大的提示上,优化LLM的强化学习训练过程。
Details
Motivation: 传统强化学习训练LLM时,固定和均匀的采样方式导致梯度估计不稳定,限制了性能提升。Contribution: 提出了在线自适应采样框架Reinforce-Ada,动态分配采样资源,并通过分组和全局统计稳定更新。
Method: 采用在线逐次消除过程,动态调整采样资源分配,并在采样足够信号时自动停止。同时引入分组和全局统计以稳定更新。
Result: 实验表明,Reinforce-Ada在多个模型结构和推理任务上加速收敛并提升性能,尤其在使用平衡采样变体时效果显著。
Insight: 自适应数据管理的方差感知特性对高效可靠的LLM强化学习至关重要。
Abstract: Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates due to fixed and uniform sampling of responses across prompts. Prior work such as GVM-RAFT addresses this by dynamically allocating inference budget per prompt to minimize stochastic gradient variance under a budget constraint. Inspired by this insight, we propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs that continuously reallocates sampling effort to the prompts with the greatest uncertainty or learning potential. Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, and automatically stops sampling for a prompt once sufficient signal is collected. To stabilize updates, we form fixed-size groups with enforced reward diversity and compute advantage baselines using global statistics aggregated over the adaptive sampling phase. Empirical results across multiple model architectures and reasoning benchmarks show that Reinforce-Ada accelerates convergence and improves final performance compared to GRPO, especially when using the balanced sampling variant. Our work highlights the central role of variance-aware, adaptive data curation in enabling efficient and reliable reinforcement learning for reasoning-capable LLMs. Code is available at https://github.com/RLHFlow/Reinforce-Ada.
[220] Learning to Interpret Weight Differences in Language Models
Avichal Goel,Yoon Kim,Nir Shavit,Tony T. Wang
Main category: cs.LG
TL;DR: 本文提出了一种名为Diff Interpretation Tuning (DIT)的方法,旨在通过训练模型描述其微调引起的权重变化,从而提高语言模型权重差异的可解释性。
Details
Motivation: 微调预训练语言模型是常见的任务适应方法,但权重变化的可解释性较差。现有的方法依赖于检查微调数据集,而这些数据集往往不可公开或规模过大。因此,需要一种能直接解释权重差异的方法。Contribution: 提出了DIT方法,通过训练适配器使模型能够用自然语言描述其微调引起的权重变化,提高了权重差异的可解释性。
Method: 通过合成标记的权重差异数据训练DIT适配器,适配器可与微调后的模型结合,生成关于权重变化的自然语言描述。在两个概念验证场景中验证了方法的有效性。
Result: 实验表明,DIT方法能够生成准确的权重变化描述,尤其是在报告隐藏行为和总结微调知识方面表现良好。
Insight: DIT方法为理解语言模型权重变化的内部机制提供了一种新途径,适用于模型调试、知识提取等任务。
Abstract: Finetuning (pretrained) language models is a standard approach for updating their internal parametric knowledge and specializing them to new tasks and domains. However, the corresponding model weight changes (“weight diffs”) are not generally interpretable. While inspecting the finetuning dataset can give a sense of how the model might have changed, these datasets are often not publicly available or are too large to work with directly. Towards the goal of comprehensively understanding weight diffs in natural language, we introduce Diff Interpretation Tuning (DIT), a method that trains models to describe their own finetuning-induced modifications. Our approach uses synthetic, labeled weight diffs to train a DIT adapter, which can be applied to a compatible finetuned model to make it describe how it has changed. We demonstrate in two proof-of-concept settings (reporting hidden behaviors and summarizing finetuned knowledge) that our method enables models to describe their finetuning-induced modifications using accurate natural language descriptions.
[221] From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models
Mingkang Zhu,Xi Chen,Bei Yu,Hengshuang Zhao,Jiaya Jia
Main category: cs.LG
TL;DR: 论文提出了BVPO方法,通过优化偏差-方差权衡,减少大推理模型(LRMs)在偏好对齐中的梯度方差,从而提升训练的稳定性与性能。
Details
Motivation: 大推理模型生成中间推理轨迹,但对齐人类偏好的目标需要边际化这些轨迹,计算上不可行。现有方法采样单一路径导致高梯度方差,影响了训练稳定性与性能。Contribution: 提出了BVPO方法,混合高方差轨迹梯度和低方差空轨迹梯度,优化偏差-方差权衡,严格减少方差并提供闭式混合权重选择。
Method: BVPO通过混合两种梯度估计器(基于轨迹和空轨迹),理论分析了混合权重的最优选择,并在实验中验证其有效性。
Result: BVPO在AlpacaEval~2和Arena-Hard上分别提升7.8和6.8分,推理性能提升4.0分,展示了更稳定的训练和更强的整体性能。
Insight: 轨迹采样的方差是偏好对齐的关键瓶颈,直接优化偏差-方差权衡可实现稳定训练和性能提升。
Abstract: Large reasoning models (LRMs) generate intermediate reasoning traces before producing final answers, yielding strong gains on multi-step and mathematical tasks. Yet aligning LRMs with human preferences, a crucial prerequisite for model deployment, remains underexplored. The statistically correct objective for preference alignment requires marginalizing over reasoning traces, but this computation is intractable in practice. A common workaround optimizes a single sampled trajectory, which introduces substantial gradient variance from stochastic trace sampling. To address this challenge, we frame preference optimization for LRMs through the lens of the bias–variance trade-off and propose Bias–Variance Optimized Preference Optimization (BVPO), a simple, drop-in method that mixes two gradient estimators: a high-variance trace-based estimator and a low-variance empty-trace estimator obtained by disabling reasoning trace generation. Our theory shows that BVPO strictly reduces trace-induced variance for any nontrivial mixture, provides a closed-form choice of the mixing weight that minimizes mean-squared error relative to the true marginal gradient, and under standard smoothness and step-size conditions, tightens classical convergence bounds for stochastic gradient descent. Empirically, BVPO improves alignment over the best baseline by up to 7.8 points on AlpacaEval~2 and 6.8 points on Arena-Hard. Despite being trained only on general conversational data, BVPO also boosts reasoning performance for base models by up to 4.0 points on the average of six math reasoning benchmarks. These results identify variance from trace sampling as a key bottleneck and demonstrate that directly optimizing the bias–variance trade-off yields more stable training and stronger overall performance.
[222] SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size
Junhao Xia,Ming Zhao,Limin Xiao,Xiujun Zhang
Main category: cs.LG
TL;DR: SDQ-LLM提出了一种新型框架,通过Sigma-Delta量化和Over-Sampling Ratio(OSR)动态调整,实现任意大小LLMs的1比特量化,显著提升推理效率,同时保持语言推理能力。
Details
Motivation: 大型语言模型(LLMs)面临计算和内存的巨大挑战,极低比特量化对其高效部署至关重要。传统量化方法在高压缩率下精度损失严重,SDQ-LLM旨在解决这一问题。Contribution: 1. 提出SDQ-LLM框架,支持1比特或1.58比特量化;2. 引入动态可调的OSR机制;3. 结合Hadamard权重平滑和MultiOSR策略,优化分层层级OSR分配。
Method: 1. 使用Sigma-Delta量化器和上采样技术对权重进行二值化或三值化;2. 在量化前应用Hadamard权重平滑;3. 基于权重方差和参数规模,提出MultiOSR策略分层层级分配OSR。
Result: 在OPT和LLaMA模型上的实验表明,SDQ-LLM即使在高压缩率的低OSR设置下,也能实现高效且高精度的性能。
Insight: 动态可调的OSR机制和分层分粒度的OSR分配策略是多比特量化中的关键创新,能够在压缩率和精度之间实现最优平衡。
Abstract: Large language models (LLMs) face significant computational and memory challenges, making extremely low-bit quantization crucial for their efficient deployment. In this work, we introduce SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size, a novel framework that enables extremely low-bit quantization of LLMs while preserving their linguistic reasoning capabilities. A distinctive feature of SDQ-LLM is the continuous adjustability of the Over-Sampling Ratio (OSR), enabling dynamic adaptation to memory or VRAM constraints by selecting fractional OSR (e.g. 2.5 times) for an optimal trade-off between model size and accuracy. SDQ-LLM uses upsampling combined with Sigma-Delta Quantizer to binarize or ternarize LLMs weights, encoding high-precision parameters into 1-bit or 1.58-bit representations, replacing the multiplication operations within linear layers with addition. This approach significantly enhances inference efficiency under extremely low-bit quantization. To further reduce the loss of quantization precision, we incorporate Hadamard-based weight smoothing prior to quantization, improving the stability and robustness of the weight representations. Furthermore, to fully leverage the continuity of the OSR and reduce precision loss, recognizing the correlation between quantization sensitivity and weight variance, we propose a fine-grained, layer- and linear-wise OSR allocation strategy, MultiOSR. This strategy distributes OSR both across layers and within each layer, based on weight variance and parameter scale. Finally, extensive experiments on OPT and LLaMA model families demonstrate that SDQ-LLM achieves a more efficient and high-precision performance even under highly aggressive low-OSR settings. Our code is available at https://github.com/Dreamlittlecat/LLM-Quant-Factory.
[223] Conditional Pseudo-Supervised Contrast for Data-Free Knowledge Distillation
Renrong Shao,Wei Zhang,Jun wang
Main category: cs.LG
TL;DR: CPSC-DFKD提出了一种新的数据无关知识蒸馏方法,通过条件生成对抗网络生成类别特定的多样性样本,并提出伪监督对比学习以提升蒸馏效果。
Details
Motivation: 现有数据无关知识蒸馏方法在生成样本时无法区分不同类别的分布,导致样本模糊且多样性不足,影响了学生模型的性能。Contribution: 1. 引入条件生成对抗网络(CGAN)生成类别特定的多样性样本;2. 改进生成器模块以区分不同类别的分布;3. 提出基于教师和学生视角的伪监督对比学习。
Method: 使用条件生成对抗网络生成样本,并通过伪监督对比学习优化样本多样性。
Result: 在三个常用数据集上的实验验证了CPSC-DFKD对学生模型和生成器的性能提升。
Insight: 伪监督对比学习和条件生成对抗网络的结合可以有效提升数据无关知识蒸馏的性能和样本多样性。
Abstract: Data-free knowledge distillation(DFKD) is an effective manner to solve model compression and transmission restrictions while retaining privacy protection, which has attracted extensive attention in recent years. Currently, the majority of existing methods utilize a generator to synthesize images to support the distillation. Although the current methods have achieved great success, there are still many issues to be explored. Firstly, the outstanding performance of supervised learning in deep learning drives us to explore a pseudo-supervised paradigm on DFKD. Secondly, current synthesized methods cannot distinguish the distributions of different categories of samples, thus producing ambiguous samples that may lead to an incorrect evaluation by the teacher. Besides, current methods cannot optimize the category-wise diversity samples, which will hinder the student model learning from diverse samples and further achieving better performance. In this paper, to address the above limitations, we propose a novel learning paradigm, i.e., conditional pseudo-supervised contrast for data-free knowledge distillation(CPSC-DFKD). The primary innovations of CPSC-DFKD are: (1) introducing a conditional generative adversarial network to synthesize category-specific diverse images for pseudo-supervised learning, (2) improving the modules of the generator to distinguish the distributions of different categories, and (3) proposing pseudo-supervised contrastive learning based on teacher and student views to enhance diversity. Comprehensive experiments on three commonly-used datasets validate the performance lift of both the student and generator brought by CPSC-DFKD. The code is available at https://github.com/RoryShao/CPSC-DFKD.git
[224] Longitudinal Flow Matching for Trajectory Modeling
Mohammad Mohaiminul Islam,Thijs P. Kuipers,Sharvaree Vadgama,Coen de Vente,Afsana Khan,Clara I. Sánchez,Erik J. Bekkers
Main category: cs.LG
TL;DR: 该论文提出了Interpolative Multi-Marginal Flow Matching (IMMFM)框架,用于建模高维稀疏采样的轨迹数据,通过学习连续随机动力学来联合建模多个观测时间点,优于现有方法。
Details
Motivation: 现有生成模型在处理稀疏采样和高维轨迹数据时,通常将动力学学习简化为成对转移,无法有效捕捉内在随机性和处理不规则采样。Contribution: 提出IMMFM框架,利用分段二次插值路径作为流匹配的平滑目标,联合优化漂移和数据驱动的扩散系数,支持稳定学习的理论条件。
Method: 采用多时间点联合建模的连续随机动力学,结合分段二次插值路径和优化的扩散系数,提升对稀疏采样的适应性。
Result: 在合成基准和真实神经影像数据集上,IMMFM在预测准确性和下游任务中均优于现有方法。
Insight: 联合建模多时间点的动力学和显式处理随机性有助于提升轨迹生成的准确性和鲁棒性。
Abstract: Generative models for sequential data often struggle with sparsely sampled and high-dimensional trajectories, typically reducing the learning of dynamics to pairwise transitions. We propose \textit{Interpolative Multi-Marginal Flow Matching} (IMMFM), a framework that learns continuous stochastic dynamics jointly consistent with multiple observed time points. IMMFM employs a piecewise-quadratic interpolation path as a smooth target for flow matching and jointly optimizes drift and a data-driven diffusion coefficient, supported by a theoretical condition for stable learning. This design captures intrinsic stochasticity, handles irregular sparse sampling, and yields subject-specific trajectories. Experiments on synthetic benchmarks and real-world longitudinal neuroimaging datasets show that IMMFM outperforms existing methods in both forecasting accuracy and further downstream tasks.
[225] Efficient Test-Time Scaling for Small Vision-Language Models
Mehmet Onurcan Kaya,Desmond Elliott,Dim P. Papadopoulos
Main category: cs.LG
TL;DR: 该论文提出两种高效的测试时缩放策略(TTAug和TTAdapt),用于提升小型视觉语言模型的性能,同时保持计算效率。
Details
Motivation: 小型视觉语言模型(VLMs)虽然计算高效,但在泛化能力和下游任务表现上较弱。现有的测试时缩放方法通常计算成本高,不符合小型模型的高效设计目标。Contribution: 1. 提出两种无需外部监督的高效测试时缩放策略:TTAug(测试时增强)和TTAdapt(测试时适应)。2. 展示了在九个基准测试中的性能提升,同时保持计算效率。
Method: 1. TTAug:通过生成多个增强输入并在token级别聚合输出,无需参数更新。2. TTAdapt:利用TTAug生成的共识伪标签在推理时调整模型参数。
Result: 实验结果表明,该方法在小规模模型中实现了性能提升,同时保持了计算高效性。
Insight: 通过利用模型内部特征而非外部监督,可以在不增加计算负担的情况下提升小型VLMs的性能。
Abstract: Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.
[226] DoRAN: Stabilizing Weight-Decomposed Low-Rank Adaptation via Noise Injection and Auxiliary Networks
Nghiem T. Diep,Hien Dang,Tuan Truong,Tan Dinh,Huy Nguyen,Nhat Ho
Main category: cs.LG
TL;DR: DoRAN 是 DoRA 的一种新变体,通过噪声注入和辅助网络进一步稳定训练并提升样本效率,在视觉和语言基准测试中表现优于 LoRA、DoRA 等 PEFT 方法。
Details
Motivation: DoRA 虽然提升了 LoRA 的学习能力和训练稳定性,但仍存在训练不稳定和样本效率低的问题。作者希望通过噪声注入和动态生成的辅助网络进一步优化这些问题。Contribution: 1) 提出 DoRAN,结合噪声注入的自适应正则化和动态生成的低秩矩阵(通过辅助网络),稳定训练并提升样本效率;2) 在理论和实践中验证其有效性。
Method: 1) 在 DoRA 的权重分解分母中注入噪声,作为自适应正则化器;2) 用辅助网络动态生成低秩矩阵,实现跨层参数耦合。
Result: 在视觉和语言基准测试中,DoRAN 表现优于 LoRA、DoRA 及其他 PEFT 基线方法。
Insight: 结合噪声正则化和网络动态生成参数是一种有效方向,能够增强基础模型的鲁棒性和高效微调。
Abstract: Parameter-efficient fine-tuning (PEFT) methods have become the standard paradigm for adapting large-scale models. Among these techniques, Weight-Decomposed Low-Rank Adaptation (DoRA) has been shown to improve both the learning capacity and training stability of the vanilla Low-Rank Adaptation (LoRA) method by explicitly decomposing pre-trained weights into magnitude and directional components. In this work, we propose DoRAN, a new variant of DoRA designed to further stabilize training and boost the sample efficiency of DoRA. Our approach includes two key stages: (i) injecting noise into the denominator of DoRA’s weight decomposition, which serves as an adaptive regularizer to mitigate instabilities; and (ii) replacing static low-rank matrices with auxiliary networks that generate them dynamically, enabling parameter coupling across layers and yielding better sample efficiency in both theory and practice. Comprehensive experiments on vision and language benchmarks show that DoRAN consistently outperforms LoRA, DoRA, and other PEFT baselines. These results underscore the effectiveness of combining stabilization through noise-based regularization with network-based parameter generation, offering a promising direction for robust and efficient fine-tuning of foundation models.
[227] Post-training quantization of vision encoders needs prefixing registers
Seunghyeon Kim,Jinho Kim,Taesun Yeom,Wonpyo Park,Kyuyeun Kim,Jaeho Lee
Main category: cs.LG
TL;DR: 论文提出了一种无需训练的算法RegCache,通过引入前缀标记来减轻视觉编码器中的异常值问题,从而实现了更小精度损失的量化。
Details
Motivation: 基于Transformer的视觉编码器(如CLIP)在多模态智能中至关重要,但其大规模激活值(特别是异常值)使得后训练量化在8位精度下仍具挑战性。为了解决这一问题,作者提出了RegCache。Contribution: 主要贡献是提出了RegCache算法,通过引入异常值易发但语义无意义的前缀标记,防止其他标记产生异常值,从而显著降低了量化对准确率的影响。
Method: 方法包括两个技术创新:中层前缀标记和标记删除。前者通过在前缀标记中引入异常值,后者通过删除不需要的标记来优化量化效果。
Result: 实验表明,RegCache在文本监督和自监督视觉编码器中均能显著提升量化模型的准确性。
Insight: 研究发现,视觉编码器中的异常值与语言模型中的行为不同,因此需要针对性的解决方案,如中层前缀标记和标记删除。
Abstract: Transformer-based vision encoders – such as CLIP – are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.
[228] SONA: Learning Conditional, Unconditional, and Mismatching-Aware Discriminator
Yuhta Takida,Satoshi Hayakawa,Takashi Shibuya,Masaaki Imaizumi,Naoki Murata,Bac Nguyen,Toshimitsu Uesaka,Chieh-Hsin Lai,Yuki Mitsufuji
Main category: cs.LG
TL;DR: 本文提出了一种新的判别器设计SONA,通过分离自然性和对齐性的投影实现条件生成任务的平衡,并在实验中表现出优越性能。
Details
Motivation: 现有的条件生成对抗网络(GAN)在判别器中难以平衡真实性和输入样本的条件对齐性,因此作者提出了一种新的判别器设计来解决这一问题。Contribution: 提出了SONA方法,通过引入无条件判别、匹配感知监督和自适应加权机制,显著提升了条件生成的样本质量和条件对齐性。
Method: SONA利用分离的自然性和对齐性投影层,结合专用的目标函数和自适应加权机制,动态优化判别任务。
Result: 在类条件生成和文本到图像生成任务中,SONA在样本质量和条件对齐性上优于现有方法。
Insight: 分离自然性和对齐性的任务设计,并结合动态平衡机制,能够显著提升条件生成模型的性能。
Abstract: Deep generative models have made significant advances in generating complex content, yet conditional generation remains a fundamental challenge. Existing conditional generative adversarial networks often struggle to balance the dual objectives of assessing authenticity and conditional alignment of input samples within their conditional discriminators. To address this, we propose a novel discriminator design that integrates three key capabilities: unconditional discrimination, matching-aware supervision to enhance alignment sensitivity, and adaptive weighting to dynamically balance all objectives. Specifically, we introduce Sum of Naturalness and Alignment (SONA), which employs separate projections for naturalness (authenticity) and alignment in the final layer with an inductive bias, supported by dedicated objective functions and an adaptive weighting mechanism. Extensive experiments on class-conditional generation tasks show that \ours achieves superior sample quality and conditional alignment compared to state-of-the-art methods. Furthermore, we demonstrate its effectiveness in text-to-image generation, confirming the versatility and robustness of our approach.
cs.RO [Back]
[229] Efficient Surgical Robotic Instrument Pose Reconstruction in Real World Conditions Using Unified Feature Detection
Zekai Liang,Kazuya Miyata,Xiao Liang,Florian Richter,Michael C. Yip
Main category: cs.RO
TL;DR: 论文提出了一种统一特征检测框架,用于高效重建手术机器人在真实世界中的姿态,解决了传统方法在特征检测和实时性上的不足。
Details
Motivation: 在微创手术机器人中,由于长运动链和部分自由度不可见,传统相机-机器人标定方法难以实现高精度姿态估计。现有方法在特征检测一致性或实时性上存在问题。Contribution: 提出了一种统一几何基元(关键点和轴边缘)检测的框架,通过共享编码实现高效姿态估计,结合了合成数据和投影几何的优势。
Method: 采用单一推理架构同时检测关键点和边缘,利用大规模合成数据和投影标注进行训练,提高了检测效率和姿态估计准确性。
Result: 实验表明该方法在特征检测和姿态估计上具有快速性和SOTA精度,适用于具有挑战性的手术环境。
Insight: 通过统一特征检测和合成数据训练,可以显著提升手术机器人姿态估计的效率和准确性,为实时控制提供了新思路。
Abstract: Accurate camera-to-robot calibration is essential for any vision-based robotic control system and especially critical in minimally invasive surgical robots, where instruments conduct precise micro-manipulations. However, MIS robots have long kinematic chains and partial visibility of their degrees of freedom in the camera, which introduces challenges for conventional camera-to-robot calibration methods that assume stiff robots with good visibility. Previous works have investigated both keypoint-based and rendering-based approaches to address this challenge in real-world conditions; however, they often struggle with consistent feature detection or have long inference times, neither of which are ideal for online robot control. In this work, we propose a novel framework that unifies the detection of geometric primitives (keypoints and shaft edges) through a shared encoding, enabling efficient pose estimation via projection geometry. This architecture detects both keypoints and edges in a single inference and is trained on large-scale synthetic data with projective labeling. This method is evaluated across both feature detection and pose estimation, with qualitative and quantitative results demonstrating fast performance and state-of-the-art accuracy in challenging surgical environments.
[230] EmbodiSwap for Zero-Shot Robot Imitation Learning
Eadom Dessalene,Pavan Mantripragada,Michael Maynord,Yiannis Aloimonos
Main category: cs.RO
TL;DR: EmbodiSwap通过生成逼真的合成机器人覆盖层在人类视频上,实现零样本模仿学习。该方法利用V-JEPA作为视觉骨干网络,并展示了其在机器人任务中的优越性。
Details
Motivation: 解决人类视频与目标机器人体现之间的‘体现鸿沟’,实现零样本模仿学习,减少对真实机器人数据的依赖。Contribution: 1. 提出EmbodiSwap方法,生成合成机器人覆盖层;2. 首次将V-JEPA用于模仿学习;3. 发布了合成数据集、代码和模型,促进可复现性研究。
Method: 1. 通过EmbodiSwap生成合成机器人数据集;2. 使用V-JEPA作为视觉骨干网络训练闭环机器人操作策略。
Result: 在真实世界测试中,零样本训练的V-JEPA模型达到82%的成功率,优于其他基准方法。
Insight: V-JEPA在模仿学习中表现出色,EmbodiSwap为机器人学习提供了高效的合成数据生成方法。
Abstract: We introduce EmbodiSwap - a method for producing photorealistic synthetic robot overlays over human video. We employ EmbodiSwap for zero-shot imitation learning, bridging the embodiment gap between in-the-wild ego-centric human video and a target robot embodiment. We train a closed-loop robot manipulation policy over the data produced by EmbodiSwap. We make novel use of V-JEPA as a visual backbone, repurposing V-JEPA from the domain of video understanding to imitation learning over synthetic robot videos. Adoption of V-JEPA outperforms alternative vision backbones more conventionally used within robotics. In real-world tests, our zero-shot trained V-JEPA model achieves an $82%$ success rate, outperforming a few-shot trained $\pi_0$ network as well as $\pi_0$ trained over data produced by EmbodiSwap. We release (i) code for generating the synthetic robot overlays which takes as input human videos and an arbitrary robot URDF and generates a robot dataset, (ii) the robot dataset we synthesize over EPIC-Kitchens, HOI4D and Ego4D, and (iii) model checkpoints and inference code, to facilitate reproducible research and broader adoption.
[231] NoTVLA: Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation
Zheng Huang,Mingyu Liu,Xiaoyi Lin,Muzhi Zhu,Canyu Zhao,Zongze Du,Xiaoman Li,Yiduo Jia,Hao Zhong,Hao Chen,Chunhua Shen
Main category: cs.RO
TL;DR: 论文NoTVLA提出了一种稀疏轨迹聚焦的VLA框架,解决了密集动作序列导致的灾难性遗忘问题,通过时空压缩和末端执行器轨迹优化,显著提升了零样本任务泛化能力和计算效率。
Details
Motivation: 现有VLA模型因依赖密集动作序列导致灾难性遗忘,难以在多任务场景中保持知识连续性。Contribution: 提出了NoTVLA框架,采用稀疏轨迹训练策略,避免了密集动作微调的问题,提高了模型泛化性和计算效率。
Method: 通过时空压缩和空间推理剪枝优化末端执行器轨迹,并使用稀疏轨迹而非密集轨迹进行训练。
Result: 在多任务评估中,NoTVLA性能优于基线模型pi0,计算资源消耗更低,且无需腕部摄像头。
Insight: 稀疏轨迹策略能够有效避免灾难性遗忘,同时保留语言能力,支持跨平台部署和新视角任务的零样本泛化。
Abstract: Vision-Language-Action (VLA) models represent a pivotal advance in embodied intelligence, yet they confront critical barriers to real-world deployment, most notably catastrophic forgetting. This issue stems from their overreliance on continuous action sequences or action chunks, which inadvertently create isolated data silos that disrupt knowledge retention across tasks. To tackle these challenges, we propose the Narrowing of Trajectory VLA (NoTVLA) framework: a novel approach that narrows its focus to sparse trajectories, thereby avoiding the catastrophic forgetting associated with dense trajectory fine-tuning. A key innovation of NoTVLA lies in its trajectory planning strategy: instead of centering on the target object’s trajectory, it leverages temporal compression and spatial reasoning pruning specifically for the robot end effector’s trajectory. Furthermore, training is conducted using these sparse trajectories rather than dense action trajectories, an optimization that delivers remarkable practical advantages with better performance in zero-shot. In multi-task evaluation scenarios, NoTVLA achieves superior performance and generalization compared to pi0 while operating under two critical constraints: it uses over an order of magnitude less computing power than pi0 and requires no wrist-mounted camera. This design ensures that NoTVLA’s operational accuracy closely approximates that of single-task expert models. Crucially, it also preserves the model’s inherent language capabilities, enabling zero-shot generalization in specific scenarios, supporting unified model deployment across multiple robot platforms, and fostering a degree of generalization even when perceiving tasks from novel perspectives.
[232] CLEAR-IR: Clarity-Enhanced Active Reconstruction of Infrared Imagery
Nathan Shankar,Pawel Ladosz,Hujun Yin
Main category: cs.RO
TL;DR: 本文提出了一种基于U-Net的架构CLEAR-IR,用于从受主动发射器干扰的红外图像中重建干净的图像,提升了图像质量和下游机器人任务的性能。
Details
Motivation: 红外图像在低光条件下比RGB图像更抗噪,但受主动发射器模式的干扰,限制了其在高级任务(如目标检测和跟踪)中的应用。本文旨在解决这一问题。Contribution: 1. 提出了一种U-Net架构,用于去除红外图像中的主动发射器干扰;2. 提升了图像质量和下游机器人任务的鲁棒性;3. 在从强光到极低光的环境中验证了方法的有效性。
Method: 采用U-Net架构,从受发射器干扰的红外输入中重建干净的图像。通过对抗性训练和优化,提升了重建质量和下游任务的性能。
Result: CLEAR-IR在红外图像重建任务中优于现有增强技术,并在不同光照条件下显著提升了机器人视觉系统的可靠性。
Insight: 红外图像的去噪和重建不仅改善了图像质量,还能显著提升下游机器人任务的性能,特别是在低光环境下。
Abstract: This paper presents a novel approach for enabling robust robotic perception in dark environments using infrared (IR) stream. IR stream is less susceptible to noise than RGB in low-light conditions. However, it is dominated by active emitter patterns that hinder high-level tasks such as object detection, tracking and localisation. To address this, a U-Net-based architecture is proposed that reconstructs clean IR images from emitter-populated input, improving both image quality and downstream robotic performance. This approach outperforms existing enhancement techniques and enables reliable operation of vision-driven robotic systems across illumination conditions from well-lit to extreme low-light scenes.
[233] StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation
Mingyu Liu,Jiuhe Shu,Hui Chen,Zeju Li,Canyu Zhao,Jiange Yang,Shenyuan Gao,Hao Chen,Chunhua Shen
Main category: cs.RO
TL;DR: StaMo提出了一种无监督方法,通过轻量级编码器和预训练的Diffusion Transformer解码器学习高度压缩的双令牌状态表示。这种方法不仅高效且可解释,还能无缝集成到现有VLA模型中,显著提升任务性能。
Details
Motivation: 如何在机器人智能中开发高效且紧凑的状态表示,既避免冗余又不丢失关键任务信息,是当前研究的核心挑战。StaMo旨在通过无监督学习解决这一问题。Contribution: 1. 提出了一种高效、可解释的双令牌状态表示方法;2. 揭示了令牌差异可作为潜在的机器人动作;3. 方法在多样数据源(机器人数据、仿真、人类视频)上表现出色。
Method: 使用轻量级编码器和预训练的Diffusion Transformer解码器学习压缩状态表示,并通过潜在插值提取动态动作。
Result: 在LIBERO上性能提升14.3%,真实任务成功率提高30%,且策略协同训练优于之前方法10.4%。
Insight: StaMo展示了无监督学习可以捕捉结构化动态动作,挑战了传统依赖复杂架构和视频数据的做法。
Abstract: A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 14.3% on LIBERO and 30% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.
cs.GR [Back]
[234] Creative synthesis of kinematic mechanisms
Jiong Lin,Jialong Ning,Judah Goldfeder,Hod Lipson
Main category: cs.GR
TL;DR: 本文提出了一种通过图像生成模型(VAE)合成平面连杆机构的方法,将运动曲线和速度剖面编码为图像表示,实现了新颖的运动学设计。
Details
Motivation: 传统的运动学合成方法往往需要复杂的数学建模和优化,缺乏直观性和通用性。本文试图通过图像生成的方式,探索一种更直观、通用的机械设计方法。Contribution: 主要贡献包括:1)提出了一种基于图像的平面连杆机构合成方法;2)设计了一个共享潜在空间的变分自编码器(VAE),支持运动曲线和速度剖面的联合编码;3)验证了该方法在简单和复杂机构上的有效性。
Method: 采用共享潜在空间的VAE,将连杆机构的运动轨迹和速度剖面编码为RGB图像。通过图像生成任务,合成新的运动曲线和速度组合。实验覆盖了四种连杆机构数据集。
Result: 在四种数据集上的初步结果表明,该方法能够有效合成新颖的平面连杆机构,支持旋转和滑动关节,甚至可能扩展到凸轮和齿轮设计。
Insight: 图像生成模型可以作为一种统一的框架,用于多类型的机械设计,为未来的生成式机械设计提供了新的思路。
Abstract: In this paper, we formulate the problem of kinematic synthesis for planar linkages as a cross-domain image generation task. We develop a planar linkages dataset using RGB image representations, covering a range of mechanisms: from simple types such as crank-rocker and crank-slider to more complex eight-bar linkages like Jansen’s mechanism. A shared-latent variational autoencoder (VAE) is employed to explore the potential of image generative models for synthesizing unseen motion curves and simulating novel kinematics. By encoding the drawing speed of trajectory points as color gradients, the same architecture also supports kinematic synthesis conditioned on both trajectory shape and velocity profiles. We validate our method on three datasets of increasing complexity: a standard four-bar linkage set, a mixed set of four-bar and crank-slider mechanisms, and a complex set including multi-loop mechanisms. Preliminary results demonstrate the effectiveness of image-based representations for generative mechanical design, showing that mechanisms with revolute and prismatic joints, and potentially cams and gears, can be represented and synthesized within a unified image generation framework.
[235] Universal Beta Splatting
Rong Liu,Zhongpai Gao,Benjamin Planche,Meida Chen,Van Nguyen Nguyen,Meng Zheng,Anwesa Choudhuri,Terrence Chen,Yue Wang,Andrew Feng,Ziyan Wu
Main category: cs.GR
TL;DR: 论文提出了一种称为Universal Beta Splatting(UBS)的统一框架,将3D高斯泼溅扩展到N维各向异性Beta核,用于显式辐射场渲染。Beta核能够在空间、角度和时间维度上建模可控依赖性,无需辅助网络或特定颜色编码。UBS保持了向后兼容性,性能优于现有方法,且支持实时渲染。
Details
Motivation: 传统的3D高斯泼溅方法使用固定的高斯基元,限制了其在多维复杂光照传输、视角依赖外观和场景动态建模方面的能力。UBS旨在通过Beta核的灵活性和可控性解决这些问题,从而实现更通用的辐射场渲染。Contribution: 1)提出了UBS框架,将高斯泼溅推广到N维Beta核;2)Beta核能够在统一表示中建模空间、角度和时间维度的依赖性;3)无需额外监督即可分解场景属性为可解释的组成部分;4)实现了实时渲染,并在性能上优于现有方法。
Method: UBS采用各向异性Beta核作为基本单元,支持多维依赖性建模(空间、角度、时间)。通过CUDA加速,实现了高效的实时渲染,同时保持了与高斯泼溅的兼容性。
Result: 实验表明,UBS在静态、视角依赖和动态场景的基准测试中均优于现有方法,同时支持实时渲染,展示了Beta核在辐射场渲染中的通用性和可扩展性。
Insight: Beta核作为一种通用的基元,能够自然分解场景的多维属性,为辐射场渲染提供了新的灵活性和可解释性。
Abstract: We introduce Universal Beta Splatting (UBS), a unified framework that generalizes 3D Gaussian Splatting to N-dimensional anisotropic Beta kernels for explicit radiance field rendering. Unlike fixed Gaussian primitives, Beta kernels enable controllable dependency modeling across spatial, angular, and temporal dimensions within a single representation. Our unified approach captures complex light transport effects, handles anisotropic view-dependent appearance, and models scene dynamics without requiring auxiliary networks or specific color encodings. UBS maintains backward compatibility by approximating to Gaussian Splatting as a special case, guaranteeing plug-in usability and lower performance bounds. The learned Beta parameters naturally decompose scene properties into interpretable without explicit supervision: spatial (surface vs. texture), angular (diffuse vs. specular), and temporal (static vs. dynamic). Our CUDA-accelerated implementation achieves real-time rendering while consistently outperforming existing methods across static, view-dependent, and dynamic benchmarks, establishing Beta kernels as a scalable universal primitive for radiance field rendering. Our project website is available at https://rongliu-leo.github.io/universal-beta-splatting/.
[236] Joint Neural SDF Reconstruction and Semantic Segmentation for CAD Models
Shen Fan,Przemyslaw Musialski
Main category: cs.GR
TL;DR: 本文提出了一种数据高效的联合神经SDF重建和语义分割方法,适用于CAD模型,能够在单次处理中为任意数量部分的网格生成连贯的标签。
Details
Motivation: 现有的CAD模型重建和分割方法通常依赖于固定的分类体系,难以处理多样化的部件数量。本文旨在通过结合神经SDF和语义分割,解决这一问题。Contribution: 主要贡献包括:1)提出了一种轻量级的分割头,不影响重建质量的同时实现准确部件标记;2)引入了新的分割一致性指标;3)方法对部件数量不敏感,适用于多样化CAD模型。
Method: 方法结合了神经SDF重建和基于PartField生成的监督信号的分割头,实现了联合训练。分割头轻量化设计,确保重建和分割性能不受影响。
Result: 实验表明,方法在重建(CDL1/CDL2, F1-micro, NC)和分割(mIoU, Accuracy)指标上表现优异,尤其在新引入的分割一致性指标上表现突出。
Insight: 即使在薄结构或复杂几何的重建质量下降时,分割结果仍能保持准确性和标签一致性,展示了方法的鲁棒性。边界精度是未来改进方向。
Abstract: We propose a simple, data-efficient pipeline that augments an implicit reconstruction network based on neural SDF-based CAD parts with a part-segmentation head trained under PartField-generated supervision. Unlike methods tied to fixed taxonomies, our model accepts meshes with any number of parts and produces coherent, geometry-aligned labels in a single pass. We evaluate on randomly sampled CAD meshes from the ABC dataset with intentionally varied part cardinalities, including over-segmented shapes, and report strong performance across reconstruction (CDL1/CDL2, F1-micro, NC) and segmentation (mIoU, Accuracy), together with a new Segmentation Consistency metric that captures local label smoothness. We attach a lightweight segmentation head to the Flat-CAD SDF trunk; on a paired evaluation it does not alter reconstruction while providing accurate part labels for meshes with any number of parts. Even under degraded reconstructions on thin or intricate geometries, segmentation remains accurate and label-coherent, often preserving the correct part count. Our approach therefore offers a practical route to semantically structured CAD meshes without requiring curated taxonomies or exact palette matches. We discuss limitations in boundary precision, partly due to per-face supervision, and outline paths toward boundary-aware training and higher resolution labels.
[237] 3Dify: a Framework for Procedural 3D-CG Generation Assisted by LLMs Using MCP and RAG
Shun-ichiro Hayashi,Daichi Mukunoki,Tetsuya Hoshino,Satoshi Ohshima,Takahiro Katagiri
Main category: cs.GR
TL;DR: 3Dify是一个利用LLMs的程序化3D-CG生成框架,通过自然语言指令生成3D内容。它结合了MCP和RAG技术,并支持通过CUA方法自动化GUI操作。用户可以通过反馈选择优化生成质量,还能本地部署LLMs以降低成本。
Details
Motivation: 现有的3D-CG生成工具通常需要复杂的操作技能,限制了非专业用户的使用。3Dify旨在通过自然语言交互简化这一过程,降低技术门槛。Contribution: 1) 提出3Dify框架,基于LLMs实现自然语言驱动的3D-CG生成。2) 结合MCP和RAG技术,支持自动化DCC工具操作。3) 引入CUA方法实现GUI操作的自动化。4) 支持用户反馈优化生成质量。5) 提供本地LLM部署选项,降低成本。
Method: 1) 利用MCP协议自动化DCC工具操作。2) 对于不支持MCP的工具,使用CUA方法自动化GUI操作。3) 用户可以通过选择偏好图像提供反馈,LLMs从中学习优化生成。4) 支持本地LLM部署以减少API调用成本。
Result: 3Dify成功实现了通过自然语言指令生成3D-CG内容的目标,并通过用户反馈和质量优化机制提升了生成效果。
Insight: 1) 自然语言驱动的3D-CG生成可以显著降低技术门槛。2) 结合MCP和CUA方法可以扩展工具兼容性。3) 用户反馈机制能有效提升生成质量。
Abstract: This paper proposes “3Dify,” a procedural 3D computer graphics (3D-CG) generation framework utilizing Large Language Models (LLMs). The framework enables users to generate 3D-CG content solely through natural language instructions. 3Dify is built upon Dify, an open-source platform for AI application development, and incorporates several state-of-the-art LLM-related technologies such as the Model Context Protocol (MCP) and Retrieval-Augmented Generation (RAG). For 3D-CG generation support, 3Dify automates the operation of various Digital Content Creation (DCC) tools via MCP. When DCC tools do not support MCP-based interaction, the framework employs the Computer-Using Agent (CUA) method to automate Graphical User Interface (GUI) operations. Moreover, to enhance image generation quality, 3Dify allows users to provide feedback by selecting preferred images from multiple candidates. The LLM then learns variable patterns from these selections and applies them to subsequent generations. Furthermore, 3Dify supports the integration of locally deployed LLMs, enabling users to utilize custom-developed models and to reduce both time and monetary costs associated with external API calls by leveraging their own computational resources.
[238] Bridging Text and Video Generation: A Survey
Nilay Kumar,Priyansh Bhandari,G. Maragatham
Main category: cs.GR
TL;DR: 这是一篇关于文本到视频(T2V)生成技术的综述论文,详细梳理了从早期生成对抗网络(GANs)和变分自编码器(VAEs)到混合扩散-变换器(DiT)架构的发展历程,分析了技术挑战、数据集、训练配置和评估指标,并提出了未来研究方向。
Details
Motivation: 文本到视频生成技术在教育、营销、娱乐和辅助技术等领域具有广阔的应用前景,但目前仍面临对齐、长程一致性和计算效率等挑战。本文旨在对这一领域的研究进展进行全面梳理,为未来研究提供指导。Contribution: 1. 系统综述了T2V生成模型的发展历程和技术演变;2. 整理了相关数据集、训练配置和评估指标;3. 提出了当前的技术挑战和未来研究方向。
Method: 论文采用了文献综述的方法,详细分析了从GANs、VAEs到DiT架构的技术发展,总结了各模型的优缺点和演变原因,同时对训练配置和评估指标进行了归类和分析。
Result: 论文总结了T2V生成模型的性能表现和技术局限性,指出了当前评估指标的不足,并提出了更全面的感知对齐评估策略。
Insight: 1. T2V技术的进步依赖于新架构的引入(如DiT);2. 长程一致性和计算效率仍是主要挑战;3. 未来的研究方向包括更高效的模型训练和更全面的评估方法。
Abstract: Text-to-video (T2V) generation technology holds potential to transform multiple domains such as education, marketing, entertainment, and assistive technologies for individuals with visual or reading comprehension challenges, by creating coherent visual content from natural language prompts. From its inception, the field has advanced from adversarial models to diffusion-based models, yielding higher-fidelity, temporally consistent outputs. Yet challenges persist, such as alignment, long-range coherence, and computational efficiency. Addressing this evolving landscape, we present a comprehensive survey of text-to-video generative models, tracing their development from early GANs and VAEs to hybrid Diffusion-Transformer (DiT) architectures, detailing how these models work, what limitations they addressed in their predecessors, and why shifts toward new architectural paradigms were necessary to overcome challenges in quality, coherence, and control. We provide a systematic account of the datasets, which the surveyed text-to-video models were trained and evaluated on, and, to support reproducibility and assess the accessibility of training such models, we detail their training configurations, including their hardware specifications, GPU counts, batch sizes, learning rates, optimizers, epochs, and other key hyperparameters. Further, we outline the evaluation metrics commonly used for evaluating such models and present their performance across standard benchmarks, while also discussing the limitations of these metrics and the emerging shift toward more holistic, perception-aligned evaluation strategies. Finally, drawing from our analysis, we outline the current open challenges and propose a few promising future directions, laying out a perspective for future researchers to explore and build upon in advancing T2V research and applications.
[239] Pulp Motion: Framing-aware multimodal camera and human motion generation
Robin Courant,Xi Wang,David Loiseaux,Marc Christie,Vicky Kalogeiton
Main category: cs.GR
TL;DR: 本文提出了一种联合生成人类动作和相机轨迹的新方法,强调两者在屏幕空间中的协同性,通过共享潜在空间和辅助采样实现一致性。
Details
Motivation: 传统方法将人类动作和相机轨迹生成分开处理,忽视了摄影中演员与相机协同的核心原则。Contribution: 首次提出文本条件下的联合生成任务,开发了模型无关的框架以确保多模态一致性,并引入了PulpMotion数据集。
Method: 设计了联合自编码器学习共享潜在空间,并通过辅助采样调整生成过程;引入线性变换将人类和相机潜在映射到框架潜在空间。
Result: 实验表明,该方法在DiT和MAR架构上均能生成协调的人类-相机运动,并在文本对齐方面表现更优。
Insight: 屏幕框架是连接人类动作和相机轨迹的自然桥梁,通过辅助采样可显著提升生成的协调性。
Abstract: Treating human motion and camera trajectory generation separately overlooks a core principle of cinematography: the tight interplay between actor performance and camera work in the screen space. In this paper, we are the first to cast this task as a text-conditioned joint generation, aiming to maintain consistent on-screen framing while producing two heterogeneous, yet intrinsically linked, modalities: human motion and camera trajectories. We propose a simple, model-agnostic framework that enforces multimodal coherence via an auxiliary modality: the on-screen framing induced by projecting human joints onto the camera. This on-screen framing provides a natural and effective bridge between modalities, promoting consistency and leading to more precise joint distribution. We first design a joint autoencoder that learns a shared latent space, together with a lightweight linear transform from the human and camera latents to a framing latent. We then introduce auxiliary sampling, which exploits this linear transform to steer generation toward a coherent framing modality. To support this task, we also introduce the PulpMotion dataset, a human-motion and camera-trajectory dataset with rich captions, and high-quality human motions. Extensive experiments across DiT- and MAR-based architectures show the generality and effectiveness of our method in generating on-frame coherent human-camera motions, while also achieving gains on textual alignment for both modalities. Our qualitative results yield more cinematographically meaningful framings setting the new state of the art for this task. Code, models and data are available in our \href{https://www.lix.polytechnique.fr/vista/projects/2025_pulpmotion_courant/}{project page}.
cond-mat.mtrl-sci [Back]
[240] AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials
Taoyuze Lv,Alexander Chen,Fengyu Xie,Chu Wu,Jeffrey Meng,Dongzhan Zhou,Bram Hoex,Zhicheng Zhong,Tong Xie
Main category: cond-mat.mtrl-sci
TL;DR: AtomWorld是一个用于评估大型语言模型(LLMs)在晶体材料空间推理能力的基准测试,揭示了当前模型在结构理解和空间推理中的局限性。
Details
Motivation: 由于LLMs在文本推理方面的卓越表现及其逐渐发展的空间理解能力,研究者希望探索这些能力能否结合用于复杂的领域特定任务(如材料科学中的3D原子结构理解)。然而,此前缺乏一个标准化基准来系统评估这些能力。Contribution: 引入了AtomWorld基准测试,基于晶体学信息文件(CIFs)设计任务,旨在评估LLMs在结构编辑、CIF感知和属性引导建模等任务中的表现,为未来的模型改进奠定基础。
Method: 通过定义多个任务(如结构修改、CIF格式理解和属性引导建模),使用CIF格式的系统化基准测试评估LLMs的空间推理能力。
Result: 实验表明,当前模型在结构理解和空间推理任务中频繁出错,尤其在结构修改任务和基本CIF格式理解上表现不佳,可能导致后续分析和材料研究的累积错误。
Insight: AtomWorld揭示了LLMs在原子尺度建模中的局限性,强调了未来研究需提升模型的结构理解和空间推理能力,以加速材料科学研究。
Abstract: Large Language Models (LLMs) excel at textual reasoning and are beginning to develop spatial understanding, prompting the question of whether these abilities can be combined for complex, domain-specific tasks. This question is essential in fields like materials science, where deep understanding of 3D atomic structures is fundamental. While initial studies have successfully applied LLMs to tasks involving pure crystal generation or coordinate understandings, a standardized benchmark to systematically evaluate their core reasoning abilities across diverse atomic structures has been notably absent. To address this gap, we introduce the AtomWorld benchmark to evaluate LLMs on tasks based in Crystallographic Information Files (CIFs), a standard structure representation format. These tasks, including structural editing, CIF perception, and property-guided modeling, reveal a critical limitation: current models, despite establishing promising baselines, consistently fail in structural understanding and spatial reasoning. Our experiments show that these models make frequent errors on structure modification tasks, and even in the basic CIF format understandings, potentially leading to cumulative errors in subsequent analysis and materials insights. By defining these standardized tasks, AtomWorld lays the ground for advancing LLMs toward robust atomic-scale modeling, crucial for accelerating materials research and automating scientific workflows.