Table of Contents

cs.CL [Back]

[1] Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL

Ashish Kattamuri,Ishita Prasad,Meetu Malhotra,Arpita Vats,Rahul Raja,Albert Lie

Main category: cs.CL

TL;DR: 该论文提出了一种结合Group Relative Policy Optimization (GRPO)和多语言对比奖励信号的新框架,以提升多语言Text-to-SQL系统的任务效率和语义准确性。

Details Motivation: 现有Text-to-SQL方法仅关注可执行查询,忽视了语义对齐问题,尤其是在非英语语言中,执行准确率显著下降。

Contribution: 1) 提出GRPO框架和对比奖励信号,增强语义准确性;2) 在MultiSpider数据集上展示了显著的性能提升;3) 较小的LLaMA-3-3B模型通过新方法超越了更大的零样本模型。

Method: 结合GRPO和多语言对比奖励信号,通过强化学习优化模型,提升语义相似性和执行准确性。

Result: 执行准确率提升至87.4%(+26pp),语义准确率提升至59.14%(+6.85pp);较小的3B模型性能优于8B零样本模型。

Insight: 对比奖励信号能有效提升语义对齐效果,无需大规模训练数据即可显著改善多语言Text-to-SQL系统性能。

Abstract: Current Text-to-SQL methods are evaluated and only focused on executable queries, overlooking the semantic alignment challenge – both in terms of the semantic meaning of the query and the correctness of the execution results. Even execution accuracy itself shows significant drops when moving from English to other languages, with an average decline of 6 percentage points across non-English languages. We address these challenges by presenting a new framework that combines Group Relative Policy Optimization (GRPO) within a multilingual contrastive reward signal to enhance both task efficiency and semantic accuracy in Text-to-SQL systems in cross-lingual scenarios. Our method teaches models to obtain better correspondence between SQL generation and user intent by combining a reward signal based on semantic similarity. On the seven-language MultiSpider dataset, fine-tuning the LLaMA-3-3B model with GRPO improved the execution accuracy up to 87.4 percent (+26 pp over zero-shot) and semantic accuracy up to 52.29 percent (+32.86 pp). Adding our contrastive reward signal in the GRPO framework further improved the average semantic accuracy to 59.14 percent (+6.85 pp, up to +10 pp for Vietnamese). Our experiments showcase that a smaller, parameter-efficient 3B LLaMA model fine-tuned with our contrastive reward signal outperforms a much larger zero-shot 8B LLaMA model, with an uplift of 7.43 pp in execution accuracy (from 81.43 percent on the 8B model to 88.86 percent on the 3B model), and nearly matches its semantic accuracy (59.14 percent vs. 68.57 percent) – all using just 3,000 reinforcement learning training examples. These results demonstrate how we can improve the performance of Text-to-SQL systems with contrastive rewards for directed semantic alignment, without requiring large-scale training datasets.

[2] A Linguistics-Aware LLM Watermarking via Syntactic Predictability

Shinwoo Park,Hyejin Park,Hyeseon Ahn,Yo-Sub Han

Main category: cs.CL

TL;DR: STELA是一种新型LLM水印框架,通过动态调整水印强度以适应语言的自由度,解决了水印检测与文本质量之间的权衡问题,且无需依赖模型内部信息。

Details Motivation: 随着大语言模型(LLMs)快速发展,可靠的水印技术成为确保AI可信生态的关键。现有方法依赖模型输出分布的信号,无法公开验证。

Contribution: STELA框架通过语言学自由度动态调整水印强度,提升检测鲁棒性,同时实现公开验证。

Method: 利用词性标注(POS)n-gram建模语言不确定性,动态调节水印信号:在语法约束上下文中减弱信号以保持质量,在灵活性高的上下文中增强信号以提高检测性。

Result: 实验表明,STELA在英语、汉语和韩语等多种语言中表现优于现有方法。

Insight: 语言学知识可用于动态平衡水印的文本质量和检测性能,同时支持公开验证。

Abstract: As large language models (LLMs) continue to advance rapidly, reliable governance tools have become critical. Publicly verifiable watermarking is particularly essential for fostering a trustworthy AI ecosystem. A central challenge persists: balancing text quality against detection robustness. Recent studies have sought to navigate this trade-off by leveraging signals from model output distributions (e.g., token-level entropy); however, their reliance on these model-specific signals presents a significant barrier to public verification, as the detection process requires access to the logits of the underlying model. We introduce STELA, a novel framework that aligns watermark strength with the linguistic degrees of freedom inherent in language. STELA dynamically modulates the signal using part-of-speech (POS) n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts to preserve quality and strengthen it in contexts with greater linguistic flexibility to enhance detectability. Our detector operates without access to any model logits, thus facilitating publicly verifiable detection. Through extensive experiments on typologically diverse languages-analytic English, isolating Chinese, and agglutinative Korean-we show that STELA surpasses prior methods in detection robustness. Our code is available at https://github.com/Shinwoo-Park/stela_watermark.

[3] Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference

Chao Han,Yijuan Liang,Zihao Xuan,Daokuan Wu,Wei Zhang,Xiaoyu Shen

Main category: cs.CL

TL;DR: 该论文提出了“智能路由”(informed routing)方法,通过Lightweight Feature Forecaster(LFF)模块预测路由决策前的单元输出,从而在降低计算成本的同时保持模型性能。

Details Motivation: 大型语言模型(LLMs)的实际应用受限于高推理成本,而现有动态路由方法因贪婪策略导致信息损失和次优选择。

Contribution: 提出了一种新的路由范式,结合即时重要性和可恢复性评估,设计了轻量级预测模块LFF,实现灵活的路由策略。

Method: 采用LFF模块预测路由决策前的输出,支持执行或近似策略,减少计算量同时保留模型表现。

Result: 在语言建模和推理任务中实现了高效性能权衡,减少50%以上的训练时间,且无需最终LoRA微调即可媲美基准模型。

Insight: 评估token的可恢复性比单纯的重要性更有效,轻量级预测模块能显著提升路由决策的质量和效率。

Abstract: The deployment of large language models (LLMs) in real-world applications is increasingly limited by their high inference cost. While recent advances in dynamic token-level computation allocation attempt to improve efficiency by selectively activating model components per token, existing methods rely on greedy routing–a myopic execute-or-skip mechanism that often leads to irreversible information loss and suboptimal token selection. This paper introduces informed routing, a new paradigm that proactively addresses these issues. The key insight is to assess not only a token’s immediate importance but also its recoverability, i.e., how well its transformation can be approximated. To this end, we propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit’s output before routing decisions are made. This enables a flexible execute-or-approximate policy that preserves model fidelity while drastically reducing computation. Extensive experiments on both language modeling and reasoning tasks show that informed routing achieves state-of-the-art efficiency-performance trade-offs across multiple sparsity levels. Notably, even without final LoRA fine-tuning, our method matches or surpasses strong baselines that require full fine-tuning, all while reducing training time by over 50%. The code is available at: https://github.com/EIT-NLP/informed-routing

[4] Revisiting the UID Hypothesis in LLM Reasoning Traces

Minju Gwak,Guijin Son,Jaehyung Kim

Main category: cs.CL

TL;DR: 论文通过引入基于熵的指标分析LLM推理过程中的信息流,发现成功推理的信息密度不均匀,与人类沟通模式形成鲜明对比,挑战了机器推理的假设,并为设计可解释和自适应的推理模型提供了新方向。

Details Motivation: 大型语言模型(LLM)在逐步推理(CoT)中产生的中间步骤常难以忠实或解释。受心理语言学中均匀信息密度(UID)假说的启发,研究希望通过分析信息流来理解LLM的推理模式。

Contribution: 引入基于熵的指标量化推理轨迹中的信息流,发现成功推理在全局上是非均匀的,与人类UID模式不同,为LLM推理研究提供了新视角。

Method: 在三个数学基准测试中,使用熵计算信息密度,分析LLM正确推理的信息流特征,并与人类沟通模式对比。

Result: 研究表明,LLM的正确推理表现出信息密度的不均匀波动,与人类UID假说相反,揭示了机器推理的独特性。

Insight: 机器推理的信息流模式可能与人类不同,这为设计更具解释性和适应性的推理模型提供了新的理论依据和方向。

Abstract: Large language models (LLMs) often solve problems using step-by-step Chain-of-Thought (CoT) reasoning, yet these intermediate steps are frequently unfaithful or hard to interpret. Inspired by the Uniform Information Density (UID) hypothesis in psycholinguistics – which posits that humans communicate by maintaining a stable flow of information – we introduce entropy-based metrics to analyze the information flow within reasoning traces. Surprisingly, across three challenging mathematical benchmarks, we find that successful reasoning in LLMs is globally non-uniform: correct solutions are characterized by uneven swings in information density, in stark contrast to human communication patterns. This result challenges assumptions about machine reasoning and suggests new directions for designing interpretable and adaptive reasoning models.

[5] Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

A H M Rezaul Karim,Ozlem Uzuner

Main category: cs.CL

TL;DR: MasonNLP系统通过检索增强生成(RAG)框架,结合文本和视觉示例,提升医学视觉问答(MedVQA)的性能,在MEDIQA-WV 2025任务中排名第三。

Details Motivation: 解决医学视觉问答任务中生成自由文本回复和结构化伤口属性的挑战,支持临床决策和患者护理。

Contribution: 提出一个轻量级RAG框架,结合通用大语言模型(LLM),无需额外训练或复杂重排,即可为多模态临床NLP任务提供简单有效的基线。

Method: 使用检索增强生成(RAG)框架,将领域内文本和视觉示例融入通用LLM,通过简单索引和融合提升响应质量。

Result: 在MEDIQA-WV 2025任务中排名第三,平均得分41.37%,在BLEU、ROUGE等指标上表现优异。

Insight: 轻量级RAG结合通用LLM是一种简单高效的多模态临床NLP解决方案,尤其适用于资源有限的场景。

Abstract: Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs – a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking – provides a simple and effective baseline for multimodal clinical NLP tasks.

[6] Unlocking the Potential of Diffusion Language Models through Template Infilling

Junhoo Lee,Seungyeon Kim,Nojun Kwak

Main category: cs.CL

TL;DR: 论文提出了模板填充(TI)方法,专为扩散语言模型(DLMs)设计,通过动态分段分配(DSA)提高生成灵活性,在数学推理和代码生成任务中取得了显著的性能提升。

Details Motivation: 当前扩散语言模型的推理策略仍局限于自回归模型的前缀提示方法,限制了其在生成任务中的潜力。

Contribution: 提出了模板填充(TI)和动态分段分配(DSA)方法,为DLMs提供了一种定制的生成策略。

Method: 首先生成目标响应的结构模板,然后填充模板中的掩码段;DSA根据生成置信度动态调整分段长度。

Result: 在数学推理和代码生成基准任务中,性能提升了17.01%;同时在多令牌生成场景下,保持了生成质量并实现高效加速。

Insight: TI和DSA的结合为DLMs的生成任务提供了更强的结构控制和灵活性,推动了扩散模型在语言任务中的应用。

Abstract: Diffusion Language Models (DLMs) have emerged as a promising alternative to Autoregressive Language Models, yet their inference strategies remain limited to prefix-based prompting inherited from the autoregressive paradigm. In this paper, we propose Template Infilling (TI), a tailored conditioning methodology for DLMs’ generation process. Unlike conventional prefix prompting, TI first generates a structural template for the target response, then fills in the masked segments. To enhance the flexibility of this structural control, we introduce Dynamic Segment Allocation (DSA), which adaptively adjusts segment lengths based on generation confidence. We demonstrate the effectiveness of our approach on mathematical reasoning and code generation benchmarks, achieving consistent improvements of 17.01$%$p over baseline. Furthermore, we show that TI provides additional advantages in multi-token generation settings, enabling effective speedup while maintaining generation quality.

[7] What Layers When: Learning to Skip Compute in LLMs with Residual Gates

Filipe Laitenberger,Dawid Kopiczko,Cees G. M. Snoek,Yuki M. Asano

Main category: cs.CL

TL;DR: 论文提出了GateSkip,一种简单的残差门控机制,通过为Attention/MLP分支添加门控单元,实现解码器语言模型中的分层跳过。该方法稳定高效,无需大规模重新训练,显著节省计算资源。

Details Motivation: 现有分层跳过方法(如早期退出或基于路由器的Mixture-of-Depths模型)存在不稳定性和训练复杂性,需要改进以实现高效推理。

Contribution: 提出GateSkip,一种可微分的残差门控机制,支持高效的分层跳过和稳定微调,显著节省计算资源且保持模型性能。

Method: 为Attention/MLP分支添加sigmoid-linear门控单元,根据门控值对token进行排序并按预算跳过不重要计算。

Result: 在长文本推理任务中节省15%计算资源,同时保持90%以上基线准确率;在指令调优模型中实现准确率提升或50%计算节省。

Insight: 门控机制揭示了Transformer信息流特性(如BOS token作为锚点),并易于与其他优化方法(如量化、剪枝和自推测解码)结合。

Abstract: We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch’s output before it re-enters the residual stream. During inference we rank tokens by the gate values and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15% compute while retaining over 90% of baseline accuracy. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.

[8] TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

Jimin Lim,Arjun Damerla,Arthur Jiang,Nam Le

Main category: cs.CL

TL;DR: 该论文提出了一个新的基准测试TextBandit,用于评估大型语言模型(LLMs)在纯语言反馈下的概率推理能力,结果显示部分LLMs在不确定环境中表现出色。

Details Motivation: 大型语言模型在推理任务中表现出色,但其在纯语言环境下进行不确定性决策的能力尚未被充分研究。

Contribution: 提出了TextBandit基准测试,用于评估LLMs在纯语言反馈下的概率推理能力,并展示了Qwen3-4B在此任务中的优异表现。

Method: 通过多臂老虎机环境与LLMs交互,仅使用“你获得了一个代币”这样的文本反馈,不提供数值线索或显式概率,要求模型推断潜在的奖励结构。

Result: Qwen3-4B在最佳臂选择率上达到89.2%,显著优于更大的LLMs和传统方法。

Insight: 研究表明,语言模型可以从纯语言反馈中涌现概率推理能力,为非数值环境下的决策能力评估提供了新方向。

Abstract: Large language models (LLMs) have shown to be increasingly capable of performing reasoning tasks, but their ability to make sequential decisions under uncertainty only using natural language remains underexplored. We introduce a novel benchmark in which LLMs interact with multi-armed bandit environments using purely textual feedback, “you earned a token”, without access to numerical cues or explicit probabilities, resulting in the model to infer latent reward structures purely off linguistic cues and to adapt accordingly. We evaluated the performance of four open-source LLMs and compare their performance to standard decision-making algorithms such as Thompson Sampling, Epsilon Greedy, Upper Confidence Bound (UCB), and random choice. While most of the LLMs underperformed compared to the baselines, Qwen3-4B, achieved the best-arm selection rate of 89.2% , which significantly outperformed both the larger LLMs and traditional methods. Our findings suggest that probabilistic reasoning is able to emerge from language alone, and we present this benchmark as a step towards evaluating decision-making capabilities in naturalistic, non-numeric contexts.

[9] Too Open for Opinion? Embracing Open-Endedness in Large Language Models for Social Simulation

Bolei Ma,Yong Cao,Indira Sen,Anna-Carolina Haensch,Frauke Kreuter,Barbara Plank,Daniel Hershcovich

Main category: cs.CL

TL;DR: 这篇立场论文主张在大型语言模型(LLM)的社会模拟中采用开放式设计,而非传统的封闭式格式,以提高真实性和方法学效用。

Details Motivation: 现有的社会模拟研究通常限制LLM的输出为选择题或简答形式,以便于评分和比较,但这种封闭式设计忽略了LLM的生成特性,无法完全捕捉社会现象的多样性和复杂性。

Contribution: 论文的主要贡献在于提出开放式设计在LLM社会模拟中的重要性,并展示了其如何提高测量质量、支持未预期观点的探索,以及减少研究者偏见。

Method: 通过结合多年的调查方法学研究与NLP最新进展,论文提出了开放式设计的理论基础和实践价值。

Result: 开放式设计能够更好地捕捉表达性和个体性,并有助于预测试和方法学的优化。

Insight: 论文呼吁开发新的实践和评估框架,以充分利用LLM的生成多样性,推动NLP与社会科学的协同发展。

Abstract: Large Language Models (LLMs) are increasingly used to simulate public opinion and other social phenomena. Most current studies constrain these simulations to multiple-choice or short-answer formats for ease of scoring and comparison, but such closed designs overlook the inherently generative nature of LLMs. In this position paper, we argue that open-endedness, using free-form text that captures topics, viewpoints, and reasoning processes “in” LLMs, is essential for realistic social simulation. Drawing on decades of survey-methodology research and recent advances in NLP, we argue why this open-endedness is valuable in LLM social simulations, showing how it can improve measurement and design, support exploration of unanticipated views, and reduce researcher-imposed directive bias. It also captures expressiveness and individuality, aids in pretesting, and ultimately enhances methodological utility. We call for novel practices and evaluation frameworks that leverage rather than constrain the open-ended generative diversity of LLMs, creating synergies between NLP and social science.

[10] Reliable Fine-Grained Evaluation of Natural Language Math Proofs

Wenjie Ma,Andrei Cojocaru,Neel Kolhe,Bradley Louie,Robin Said Sharif,Haihan Zhang,Vincent Zhuang,Matei Zaharia,Sewon Min

Main category: cs.CL

TL;DR: 这篇论文提出了ProofBench数据集和ProofGrader评估器,用于细粒度评估大语言模型生成的数学证明,填补了现有评估工具的空白。

Details Motivation: 现有的大语言模型(LLM)在数学推理任务中主要关注易于验证的最终答案,而缺乏对自然语言数学证明的可靠细粒度评估方法。

Contribution: 1. 提出了ProofBench,首个专家标注的细粒度数学证明评分数据集;2. 开发了ProofGrader评估器,显著优于基线方法;3. 展示了ProofGrader在最佳选择任务中的实用性。

Method: 通过系统探索评估器设计空间(如主干模型、输入上下文、指令和工作流),结合强推理主干模型、参考答案和评分方案,以及简单的集成方法,构建ProofGrader。

Result: ProofGrader在专家评分上的平均绝对误差(MAE)为0.926,显著优于基线方法;在最佳选择任务中,其平均得分4.14(满分7分)大幅提升。

Insight: 精细化的评估器设计(如引入参考答案和集成方法)能显著提高评估质量,有助于推动数学证明生成任务的进展。

Abstract: Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. %with expert gradings. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.

[11] A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness

Fali Wang,Jihai Chen,Shuhua Yang,Ali Al-Lawati,Linli Tang,Hui Liu,Suhang Wang

Main category: cs.CL

TL;DR: 本文系统地调查了小语言模型(SLM)与大语言模型(LLM)协作的研究,提出了以性能提升、成本效益、云-边缘隐私和可信度为核心目标的分类法,总结了代表性方法与设计范式,并讨论了未来研究方向。

Details Motivation: LLM在多个领域表现出色,但其高微调成本、推理延迟、边缘部署受限及可靠性问题制约了应用。SLM因其紧凑高效的特点成为补充解决方案。本文旨在探讨两者协作的潜力与研究现状。

Contribution: 1. 提出了SLM-LLM协作的分类法,涵盖四个核心目标;2. 总结了代表性协作方法与设计模式;3. 指出了高效、安全、可扩展协作的未来研究方向。

Method: 通过系统性文献综述,整理SLM-LLM协作的研究,并按协作目标(性能、成本、隐私、可信度)分类,分析代表性技术与设计范式。

Result: 总结了当前SLM-LLM协作的研究进展,揭示了协作框架的优势与挑战,为未来研究提供了明确方向。

Insight: SLM与LLM的协作能够结合两者的优势,实现高效、灵活的应用部署,未来需进一步解决可扩展性与安全问题。

Abstract: Large language models (LLMs) have advanced many domains and applications but face high fine-tuning costs, inference latency, limited edge deployability, and reliability concerns. Small language models (SLMs), compact, efficient, and adaptable, offer complementary remedies. Recent work explores collaborative frameworks that fuse SLMs’ specialization and efficiency with LLMs’ generalization and reasoning to meet diverse objectives across tasks and deployment scenarios. Motivated by these developments, this paper presents a systematic survey of SLM-LLM collaboration organized by collaboration objectives. We propose a taxonomy with four goals: performance enhancement, cost-effectiveness, cloud-edge privacy, and trustworthiness. Within this framework, we review representative methods, summarize design paradigms, and outline open challenges and future directions toward efficient, secure, and scalable SLM-LLM collaboration.

[12] Investigating Political and Demographic Associations in Large Language Models Through Moral Foundations Theory

Nicole Smith-Vaniz,Harper Lyon,Lorraine Steigner,Ben Armstrong,Nicholas Mattei

Main category: cs.CL

TL;DR: 本文通过道德基础理论(MFT)研究了大型语言模型(LLMs)在政治和道德领域的潜在偏见,探讨了LLMs的回应是否显示意识形态倾向,并与人类数据进行直接对比。

Details Motivation: 随着LLMs在日常生活中的应用日益广泛,其在医学、人际关系和法律等领域提供的建议可能带有潜在偏见,尤其是在政治和道德问题上。研究旨在量化这些偏见,并与人类道德倾向进行比较。

Contribution: 首次直接评估了LLMs的回应在道德基础上的倾向性,并将其与人类数据对比;研究了LLMs是否能准确表达政治意识形态,以及通过角色扮演是否会影响其回应。

Method: 使用MFT框架分析了LLMs的回应,对比了人类研究数据;设计了实验条件,包括直接回应、明确提示和政治角色扮演,以评估LLMs的意识形态表达。

Result: 研究发现LLMs在某些道德基础上表现出倾向性,且在某些政治意识形态下的回应与人类数据相关;角色扮演会影响其回应,但准确性有待提高。

Insight: LLMs并非中立工具,其回应可能隐含政治和道德倾向;研究为理解和量化AI系统中的偏见提供了新视角。

Abstract: Large Language Models (LLMs) have become increasingly incorporated into everyday life for many internet users, taking on significant roles as advice givers in the domains of medicine, personal relationships, and even legal matters. The importance of these roles raise questions about how and what responses LLMs make in difficult political and moral domains, especially questions about possible biases. To quantify the nature of potential biases in LLMs, various works have applied Moral Foundations Theory (MFT), a framework that categorizes human moral reasoning into five dimensions: Harm, Fairness, Ingroup Loyalty, Authority, and Purity. Previous research has used the MFT to measure differences in human participants along political, national, and cultural lines. While there has been some analysis of the responses of LLM with respect to political stance in role-playing scenarios, no work so far has directly assessed the moral leanings in the LLM responses, nor have they connected LLM outputs with robust human data. In this paper we analyze the distinctions between LLM MFT responses and existing human research directly, investigating whether commonly available LLM responses demonstrate ideological leanings: either through their inherent responses, straightforward representations of political ideologies, or when responding from the perspectives of constructed human personas. We assess whether LLMs inherently generate responses that align more closely with one political ideology over another, and additionally examine how accurately LLMs can represent ideological perspectives through both explicit prompting and demographic-based role-playing. By systematically analyzing LLM behavior across these conditions and experiments, our study provides insight into the extent of political and demographic dependency in AI-generated responses.

[13] Schema for In-Context Learning

Pan Chen,Shaohong Chen,Mark Wang,Shi Xuan Leong,Priscilla Fung,Varinia Bernales,Alan Aspuru-Guzik

Main category: cs.CL

TL;DR: 论文提出了SCHEMA ACTIVATED IN CONTEXT LEARNING (SA-ICL),一种基于认知科学中图式理论的框架,通过显式构建抽象的推理模式来增强大语言模型的任务适应性,显著提升了推理性能。

Details Motivation: 传统的情境学习(ICL)缺乏显式的知识检索与抽象层次的知识迁移模块。受图式理论启发,作者希望通过构建抽象的推理模式(schema)来模拟人类的认知过程,从而提升模型的推理能力。

Contribution: 1. 提出了SA-ICL框架,显式构建抽象的推理模式(schema);2. 证明了现有大语言模型(LLMs)无法隐式形成图式表示,但显式的图式辅助能显著提升性能;3. 在化学和物理任务中,SA-ICL将性能提升高达36.19%。

Method: SA-ICL从示范示例中提取认知构建块的表示,形成抽象的推理模式(schema),并将其作为轻量化的结构化模板用于增强模型的推理过程。

Result: 实验结果表明,SA-ICL在GPQA数据集的任务中显著提升了性能(最高36.19%),同时减少了对示范示例数量的依赖,并增强了模型的可解释性。

Insight: 1. 显式的图式辅助比隐式学习更有效;2. SA-ICL可以统一不同的情境学习策略(如模式启动和思维链提示);3. 为增强LLMs的人类化推理提供了新方向。

Abstract: In-Context Learning (ICL) enables transformer-based language models to adapt to new tasks by conditioning on demonstration examples. However, traditional example-driven in-context learning lacks explicit modules for knowledge retrieval and transfer at the abstraction level. Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce SCHEMA ACTIVATED IN CONTEXT LEARNING (SA-ICL). This framework extracts the representation of the building blocks of cognition for the reasoning process instilled from prior examples, creating an abstracted schema, a lightweight, structured template of key inferential steps and their relationships, which is then used to augment a model’s reasoning process when presented with a novel question. We demonstrate that a broad range of large language models (LLMs) lack the capacity to form and utilize internal schema-based learning representations implicitly, but instead benefit significantly from explicit schema-based scaffolding. Across chemistry and physics questions from the GPQA dataset, our experiments show that SA-ICL consistently boosts performance, up to 36.19 percent, when the single demonstration example is of high quality, which simultaneously reduces reliance on the number of demonstrations and enhances interpretability. SCHEMA ACTIVATED IN CONTEXT LEARNING not only bridges disparate ICL strategies ranging from pattern priming to Chain-of-Thought prompting, but also paves a new path for enhancing human-like reasoning in LLMs.

[14] LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

Yuanchen Wu,Saurabh Verma,Justin Lee,Fangzhou Xiong,Poppy Zhang,Amel Awadelkarim,Xu Chen,Yubai Yuan,Shawndra Hill

Main category: cs.CL

TL;DR: 论文提出了一种无需标签的高效自动提示优化框架Prompt Duel Optimizer (PDO),通过双人赌博机设定和LLM评估的成对偏好反馈,结合双Thompson采样和顶级提示变异方法,显著优于基线方法。

Details Motivation: 大型语言模型(LLMs)对输入提示高度敏感,但传统自动提示优化(APO)依赖标记数据,实际中获取高质量标签成本高、耗时长。

Contribution: 提出了PDO框架,支持无标签优化,利用LLM的成对偏好反馈,结合D-TS和Top-Performer Guided Mutation,高效扩展候选提示池。

Method: 将问题建模为双人赌博机,通过D-TS选择信息量高的提示对比较,并通过变异高表现提示生成新候选。

Result: 在BIG-bench Hard和MS MARCO任务上,PDO表现优于基线;消融实验验证了D-TS和变异策略的有效性。

Insight: 无标签优化可行且高效,LLM自身的偏好反馈足以驱动提示优化;结合采样和变异策略能显著提升性能。

Abstract: Large language models (LLMs) are highly sensitive to their input prompts, making prompt design a central challenge. While automatic prompt optimization (APO) reduces manual engineering, most approaches assume access to ground-truth references such as labeled validation data. In practice, however, collecting high-quality labels is costly and slow. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO formulates the problem as a dueling-bandit setting, where supervision signal comes from pairwise preference feedback provided by an LLM judge. The framework combines Double Thompson Sampling (D-TS), which prioritizes informative prompt comparisons, with Top-Performer Guided Mutation, which expands the candidate pool by mutating high-performing prompts. PDO naturally operates in label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently outperforms baseline methods. Ablation studies further demonstrate the effectiveness of both D-TS and prompt mutation.

[15] Interpreting the Latent Structure of Operator Precedence in Language Models

Dharunish Yugeswardeenoo,Harshil Nukala,Cole Blondin,Sean O Brien,Vasu Sharma,Kevin Zhu

Main category: cs.CL

TL;DR: 论文通过分析开源模型LLaMA 3.2-3B的内部表示,揭示了语言模型如何编码运算符优先级,并开发了一种名为’partial embedding swap’的新技术修改优先级。

Details Motivation: 大型语言模型(LLMs)在算术任务上表现不佳,但现有研究多关注输出或提示策略,缺乏对模型内部计算结构的理解。本文旨在揭示LLMs是否在内部表示中编码了运算符优先级。

Contribution: 1. 证明了LLaMA 3.2-3B在残差流中编码了中间计算结果和运算符优先级;2. 开发了’partial embedding swap’技术,用于修改运算符优先级。

Method: 使用包含三操作数和两运算符的算术表达式数据集,结合logit lens、线性分类探针和UMAP几何可视化等可解释性技术,分析模型的残差流和嵌入表示。

Result: 实验表明,中间计算结果主要出现在多层感知机(MLP)块后的残差流中,运算符优先级在每个运算符嵌入中线性编码。

Insight: 模型对运算符优先级的编码是线性的,且可通过修改嵌入维度人为调整,这为理解LLMs的内部计算机制提供了新视角。

Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities but continue to struggle with arithmetic tasks. Prior works largely focus on outputs or prompting strategies, leaving the open question of the internal structure through which models do arithmetic computation. In this work, we investigate whether LLMs encode operator precedence in their internal representations via the open-source instruction-tuned LLaMA 3.2-3B model. We constructed a dataset of arithmetic expressions with three operands and two operators, varying the order and placement of parentheses. Using this dataset, we trace whether intermediate results appear in the residual stream of the instruction-tuned LLaMA 3.2-3B model. We apply interpretability techniques such as logit lens, linear classification probes, and UMAP geometric visualization. Our results show that intermediate computations are present in the residual stream, particularly after MLP blocks. We also find that the model linearly encodes precedence in each operator’s embeddings post attention layer. We introduce partial embedding swap, a technique that modifies operator precedence by exchanging high-impact embedding dimensions between operators.

[16] Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning

Xingrui Zhuo,Jiapu Wang,Gongqing Wu,Zhongyuan Wang,Jichen Zhang,Shirui Pan,Xindong Wu

Main category: cs.CL

TL;DR: 论文提出了一种知识推理语言模型(KRLM),通过统一语言模型的知识和知识图谱上下文,解决了现有LLM方法在归纳式知识图谱推理中知识扭曲和生成幻觉的问题。

Details Motivation: 现有LLM方法在归纳式知识图谱推理中存在知识扭曲和生成幻觉的问题,导致推理结果不可靠。

Contribution: 提出KRLM模型,设计了KRL指令格式和分词器,动态协调LLM知识与KG上下文,并通过结构感知的下一个实体预测器约束推理结果的可靠性。

Method: 使用KRL指令格式和分词器对齐知识,引入KRL注意力层动态协调知识与上下文,并设计结构感知预测器约束推理。

Result: 在25个真实数据集上,KRLM在零样本推理和微调场景中均表现显著优越。

Insight: 统一协调LLM知识与KG上下文可以有效提升推理的可信度,动态知识记忆机制是关键。

Abstract: Inductive Knowledge Graph Reasoning (KGR) aims to discover facts in open-domain KGs containing unknown entities and relations, which poses a challenge for KGR models in comprehending uncertain KG components. Existing studies have proposed Knowledge Graph Foundation Models (KGFMs) that learn structural invariances across KGs to handle this uncertainty. Recently, Large Language Models (LLMs) have demonstrated strong capabilities for open-domain knowledge reasoning. As a result, the latest research has focused on LLM-based KGFMs that integrate LLM knowledge with KG context for inductive KGR. However, the intrinsic knowledge of LLMs may be overshadowed by sparse KG context, leading to LLM knowledge distortion, which can cause irreversible damage to model reasoning. Moreover, existing LLM-based KGR methods still struggle to fully constrain generative hallucinations in LLMs, severely limiting the credibility of reasoning results. To address these limitations, we propose a Knowledge Reasoning Language Model (KRLM) that achieves unified coordination between LLM knowledge and KG context throughout the KGR process. Specifically, we design a Knowledge Reasoning Language (KRL) instruction format and a KRL tokenizer to align LLM knowledge with KG representations. Then, we propose a KRL attention layer that coordinates intrinsic LLM knowledge with additional KG context through a dynamic knowledge memory mechanism. Finally, a structure-aware next-entity predictor is proposed, which strictly constrains the reasoning results within a trustworthy knowledge domain. Extensive experimental results on 25 real-world inductive KGR datasets demonstrate the significant superiority of the proposed KRLM\footnote{Our source codes are available at https://anonymous.4open.science/r/KRLM-EA36 in both zero-shot reasoning and fine-tuning scenarios.

[17] RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

Jingru Lin,Chen Zhang,Stephen Y. Liu,Haizhou Li

Main category: cs.CL

TL;DR: RAGCap-Bench是一个针对基于检索增强生成(RAG)的智能代理系统的中间任务能力评估基准,旨在解决多跳问题和未充分探索的中间推理能力。

Details Motivation: 现有的RAG系统在多跳问题上仍有挑战,且中间推理能力研究不足,需要细粒度评估工具。

Contribution: 提出RAGCap-Bench基准,通过任务分类和能力需求分析设计评估问题,验证其有效性。

Method: 分析现有多代理RAG系统输出,识别核心任务和能力需求,构建错误分类并设计针对性问题。

Result: 实验表明具备更强RAGCap能力的慢思考模型在端到端任务中表现更好。

Insight: 提升中间推理能力对代理RAG系统的整体性能至关重要。

Abstract: Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that “slow-thinking” models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark’s validity and the importance of enhancing these intermediate capabilities.

[18] Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms

Shrey Pandit,Xuan-Phi Nguyen,Yifei Ming,Austin Xu,Jiayu Wang,Caiming Xiong,Shafiq Joty

Main category: cs.CL

TL;DR: 论文提出了一种通过逐步增加任务复杂度生成问答对的合成数据方法,用于训练更有效的基于Web的智能体。该方法通过基线智能体验证数据质量和多样性,实验表明其数据在工具使用多样性和性能上优于现有数据集。

Details Motivation: 现有的长时程推理和数据合成方法在复杂性和质量控制上存在不足,且往往混淆数据和训练效果的影响。为提高智能体在复杂任务上的表现,需设计更精细的数据合成方法。

Contribution: 1. 提出了一种两阶段的数据合成流程,逐步增加任务复杂度直至基线智能体失败;2. 引入了基线智能体的多重角色(问题尝试、事实验证、答案检查和过滤);3. 实验证明其数据集虽小但多样性更高,能训练出更有效的Web智能体。

Method: 1. 采用渐进式复杂度增强机制生成问答对;2. 利用基线智能体验证数据质量;3. 在蒸馏框架下进行控制性训练以评估数据效果。

Result: 在多个Web基准测试中,该方法合成的数据能训练出性能更强的智能体,工具使用多样性提升两倍,且避免了重复的工具调用行为。

Insight: 数据复杂性和多样性是提升长时程推理能力的关键;基线智能体的多重验证机制能有效确保数据质量。

Abstract: Web-based ‘deep research’ agents aim to solve complex question - answering tasks through long-horizon interactions with online tools. These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning and exploration. Prior work has proposed workflows for constructing instruction-tuning datasets, often leveraging knowledge graphs. However, such methods typically lack fine-grained control over difficulty and quality, yielding synthetic data that falls short of capturing the complexity required for long-horizon reasoning. Furthermore, many studies conflate data and training effects by comparing models trained under different optimization recipes, making it difficult to isolate and evaluate the effectiveness of the data itself. We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing task complexity until a frontier baseline web agent fails. The baseline agent plays multiple roles in this process: attempting the questions, validating factuality, checking for alternative answers, and enforcing filtering. To evaluate the effectiveness of our synthesis methods, we adopt a controlled training setup based on distillation from strong web agents. Experiments across multiple web-based benchmarks show that our dataset - despite being smaller - enables the training of more effective web agents than existing datasets. In particular, our data exhibits twice the diversity in tool-use actions, allowing models trained on it to achieve stronger performance while avoiding repetitive tool-calling behaviors.

[19] Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models

Ivan Lee,Taylor Berg-Kirkpatrick

Main category: cs.CL

TL;DR: 这篇论文挑战了小型语言模型(SLM)训练中可读性(readability)是关键因素的假设,通过实验证明统计简单性(statistical simplicity)更影响模型的学习能力。

Details Motivation: 近期研究表明,小型语言模型在简化的儿童语料(如TinyStories)上可以生成连贯文本,这使得可读性被视为关键因素。作者质疑这一假设,认为需要更精确地分析实际影响模型能力的属性。

Contribution: 论文的主要贡献是揭示了可读性并非小型语言模型学习能力的关键因素,并提出统计简单性(如n-gram多样性)更能预测学习效率。

Method: 作者构建了结构相同但可读性不同的合成数据集,比较了SLM在不同可读性语料上的表现和学习效率。

Result: 实验结果表明,模型在复杂成人文本和简化语言上的表现相当,甚至前者在训练中更快表现出连贯性。统计简单性更能预测学习效率。

Insight: 研究提示,不应简单地用人类认知发展类比语言模型的训练,而需要更严谨地分析模型能力涌现的实际驱动因素。

Abstract: Recent studies suggest that very small language models (SLMs) can generate surprisingly coherent text when trained on simplified, child-directed corpora such as TinyStories. These findings have been interpreted as evidence that readability – characterized by accessible vocabulary, familiar narrative structure, and simple syntax – plays a key role in enabling such capabilities to emerge. In this paper, we challenge that interpretation. We construct synthetic datasets with matched structure but varied readability, and find that readability alone does not predict coherence or learning efficiency in SLMs. Models trained on complex, adult-level text perform comparably to those trained on simplified language, and even exhibit faster development of coherence during training. Instead, we show that statistical simplicity, as measured by n-gram diversity, is a stronger predictor of learnability. Our findings caution against the growing trend of anthropomorphizing language model training – drawing parallels to human cognitive development without empirical basis – and argue for more precise reasoning about what properties actually support capability emergence in small models.

[20] Element2Vec: Build Chemical Element Representation from Text for Property Prediction

Yuanhao Li,Keyuan Lai,Tianqi Wang,Qihao Liu,Jiawei Ma,Yuan-Chao Hu

Main category: cs.CL

TL;DR: 论文提出了Element2Vec方法,通过自然语言文本构建化学元素的表示,用于属性预测。该方法结合全局和局部向量表示,并设计了基于自注意力的测试时训练方法,解决了数据稀疏和分布差异的问题。

Details Motivation: 化学元素的精确属性数据对材料设计和制造至关重要,但许多属性难以直接测量。传统基于数值分析的方法难以建模复杂关系,而现有AI方法存在幻觉和解释性不足的问题。

Contribution: 1) 提出Element2Vec,从自然语言文本生成化学元素的全局和局部向量表示。2) 设计基于自注意力的测试时训练方法,缓解数据稀疏和分布差异带来的误差。

Method: 1) 使用语言模型从Wikipedia文本生成化学元素的全局和局部嵌入。2) 引入测试时训练方法,通过自注意力机制优化回归预测。

Result: 该方法有效提升了化学元素属性预测的准确性,克服了数据稀疏和文本分布差异的挑战。

Insight: 结合自然语言处理和自注意力机制,能够更好地建模复杂科学数据,为AI驱动的材料科学发现提供了新思路。

Abstract: Accurate property data for chemical elements is crucial for materials design and manufacturing, but many of them are difficult to measure directly due to equipment constraints. While traditional methods use the properties of other elements or related properties for prediction via numerical analyses, they often fail to model complex relationships. After all, not all characteristics can be represented as scalars. Recent efforts have been made to explore advanced AI tools such as language models for property estimation, but they still suffer from hallucinations and a lack of interpretability. In this paper, we investigate Element2Vecto effectively represent chemical elements from natural languages to support research in the natural sciences. Given the text parsed from Wikipedia pages, we use language models to generate both a single general-purpose embedding (Global) and a set of attribute-highlighted vectors (Local). Despite the complicated relationship across elements, the computational challenges also exist because of 1) the discrepancy in text distribution between common descriptions and specialized scientific texts, and 2) the extremely limited data, i.e., with only 118 known elements, data for specific properties is often highly sparse and incomplete. Thus, we also design a test-time training method based on self-attention to mitigate the prediction error caused by Vanilla regression clearly. We hope this work could pave the way for advancing AI-driven discovery in materials science.

[21] Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling

Peng Kuang,Yanli Wang,Xiaoyu Han,Yaowenqi Liu,Kaidi Xu,Haohan Wang

Main category: cs.CL

TL;DR: 论文提出了一种理论框架,用于优化结合大型语言模型(LLM)和过程奖励模型(PRM)信号的方法,并通过实验验证了这种加权聚合策略在测试时扩展(TTS)中的高效性。

Details Motivation: PRMs是TTS的核心工具,用于验证和选择LLM的最佳响应,但简单多数投票偶尔优于PRM信号的选择,因此需要探索如何更有效地利用PRM信号。

Contribution: 提出了一个理论框架,揭示了最优策略是LLM和PRM信号的加权聚合,并通过实验展示了这种策略的高效性。

Method: 开发了理论框架,揭示了加权聚合的最优策略,并提出了高效的预计算方法来校准权重函数。

Result: 实验表明,校准后的权重函数显著提升了TTS效率,在仅使用21.3%计算资源的情况下超越了多数投票。

Insight: 研究表明,更智能的信号聚合策略比单纯增加测试时计算量更能提升性能。

Abstract: Process reward models (PRMs) are a cornerstone of test-time scaling (TTS), designed to verify and select the best responses from large language models (LLMs). However, this promise is challenged by recent benchmarks where simple majority voting, which ignores PRM signals, occasionally outperforms standard PRM-based selection. This raises a critical question: How can we effectively utilize verification signals from PRMs for TTS? To address this, we start by developing a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models. Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights. Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions. Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method significantly boosts the TTS efficiency, surpassing the performance of vanilla weighted majority voting while using only $21.3%$ of the computation. Ultimately, our work demonstrates that investing in a more intelligent aggregation strategy can be a more convincing path to performance gains than simply scaling test-time computation.

[22] FACTS: Table Summarization via Offline Template Generation with Agentic Workflows

Ye Yuan,Mohammad Amin Shabani,Siqi Liu

Main category: cs.CL

TL;DR: FACTS提出了一种通过离线模板生成的代理工作流方法,用于查询关注的表格摘要任务。该方法通过离线模板(SQL查询和Jinja2模板)实现快速、准确且隐私合规的摘要生成,并在多个基准测试中优于基线方法。

Details Motivation: 现有的表格摘要方法存在诸多限制,例如调优成本高、推理能力有限、隐私风险或效率不足。FACTS旨在通过离线模板生成和代理工作流解决这些问题。

Contribution: 1. 提出FACTS方法,结合SQL查询和Jinja2模板的代理工作流;2. 实现快速、准确且隐私合规的摘要生成;3. 通过可复用模板提升效率。

Method: FACTS通过离线生成模板(SQL查询+Jinja2模板)的方式,将表格摘要任务拆分为可执行的SQL查询和自然语言模板渲染。仅传递表结构信息确保隐私合规。

Result: 在多个基准测试中,FACTS的表现优于现有基线方法,证明了其在实际应用中的高效性和准确性。

Insight: 离线模板生成结合代理工作流是一种有效的方法,既能保证摘要质量,又能提升效率和隐私保护。

Abstract: Query-focused table summarization requires generating natural language summaries of tabular data conditioned on a user query, enabling users to access insights beyond fact retrieval. Existing approaches face key limitations: table-to-text models require costly fine-tuning and struggle with complex reasoning, prompt-based LLM methods suffer from token-limit and efficiency issues while exposing sensitive data, and prior agentic pipelines often rely on decomposition, planning, or manual templates that lack robustness and scalability. To mitigate these issues, we introduce an agentic workflow, FACTS, a Fast, Accurate, and Privacy-Compliant Table Summarization approach via Offline Template Generation. FACTS produces offline templates, consisting of SQL queries and Jinja2 templates, which can be rendered into natural language summaries and are reusable across multiple tables sharing the same schema. It enables fast summarization through reusable offline templates, accurate outputs with executable SQL queries, and privacy compliance by sending only table schemas to LLMs. Evaluations on widely-used benchmarks show that FACTS consistently outperforms baseline methods, establishing it as a practical solution for real-world query-focused table summarization.

[23] An LLM-Powered AI Agent Framework for Holistic IoT Traffic Interpretation

Daniel Adu Worae,Spyridon Mastorakis

Main category: cs.CL

TL;DR: 提出了一个基于大型语言模型(LLM)的AI代理框架,用于全面解析物联网(IoT)流量,结合多技术实现高效的语义分析和交互式问答。

Details Motivation: IoT网络流量复杂多样,传统方法难以实现跨层次的行为和威胁解析,需要一个综合框架来解决这一问题。

Contribution: 开发了一个LLM驱动的AI代理框架,整合了特征提取、异常检测、流量总结等技术,实现了高效的语义丰富表示和交互分析。

Method: 结合词法和语义搜索的混合检索方法,并通过LLM代理进行推理,生成结构化和语义丰富的流量表示。

Result: 实验表明混合检索显著提升了BLEU等指标,且系统资源开销低,证明了框架的高效性。

Insight: LLM在IoT流量解析中具有潜力,结合混合检索和多技术集成能显著提升分析效果。

Abstract: Internet of Things (IoT) networks generate diverse and high-volume traffic that reflects both normal activity and potential threats. Deriving meaningful insight from such telemetry requires cross-layer interpretation of behaviors, protocols, and context rather than isolated detection. This work presents an LLM-powered AI agent framework that converts raw packet captures into structured and semantically enriched representations for interactive analysis. The framework integrates feature extraction, transformer-based anomaly detection, packet and flow summarization, threat intelligence enrichment, and retrieval-augmented question answering. An AI agent guided by a large language model performs reasoning over the indexed traffic artifacts, assembling evidence to produce accurate and human-readable interpretations. Experimental evaluation on multiple IoT captures and six open models shows that hybrid retrieval, which combines lexical and semantic search with reranking, substantially improves BLEU, ROUGE, METEOR, and BERTScore results compared with dense-only retrieval. System profiling further indicates low CPU, GPU, and memory overhead, demonstrating that the framework achieves holistic and efficient interpretation of IoT network traffic.

[24] BioMedSearch: A Multi-Source Biomedical Retrieval Framework Based on LLMs

Congying Liu,Xingyuan Wei,Peipei Liu,Yiqing Shen,Yanxu Mao,Tiehan Cui

Main category: cs.CL

TL;DR: BioMedSearch 是一个基于 LLMs 的多源生物医学检索框架,通过整合文献检索、蛋白质数据库和网络搜索,解决了现有 LLMs 在生物医学内容生成中缺乏科学严谨性的问题。

Details Motivation: 生物医学查询需要深入理解专业知识并从多源数据中整合信息,但现有 LLMs 因无法访问权威数据库,生成内容常偏离真实信息。

Contribution: 提出了 BioMedSearch 框架,通过子查询分解、关键词提取、任务图构建和多源信息过滤生成高质量答案,并构建了 BioMedMCQs 数据集评估性能。

Method: 整合文献检索、蛋白质数据库和网络搜索,通过多步骤信息处理支持复杂查询。

Result: 实验结果显示,BioMedSearch 在所有推理级别上均显著提升准确率,最高从 36.3% 提升至 73.4%。

Insight: 生物医学领域需要结合多源数据和结构化检索方法,而 LLMs 的通用能力需与领域专业工具结合才能保证科学性。

Abstract: Biomedical queries often rely on a deep understanding of specialized knowledge such as gene regulatory mechanisms and pathological processes of diseases. They require detailed analysis of complex physiological processes and effective integration of information from multiple data sources to support accurate retrieval and reasoning. Although large language models (LLMs) perform well in general reasoning tasks, their generated biomedical content often lacks scientific rigor due to the inability to access authoritative biomedical databases and frequently fabricates protein functions, interactions, and structural details that deviate from authentic information. Therefore, we present BioMedSearch, a multi-source biomedical information retrieval framework based on LLMs. The method integrates literature retrieval, protein database and web search access to support accurate and efficient handling of complex biomedical queries. Through sub-queries decomposition, keywords extraction, task graph construction, and multi-source information filtering, BioMedSearch generates high-quality question-answering results. To evaluate the accuracy of question answering, we constructed a multi-level dataset, BioMedMCQs, consisting of 3,000 questions. The dataset covers three levels of reasoning: mechanistic identification, non-adjacent semantic integration, and temporal causal reasoning, and is used to assess the performance of BioMedSearch and other methods on complex QA tasks. Experimental results demonstrate that BioMedSearch consistently improves accuracy over all baseline models across all levels. Specifically, at Level 1, the average accuracy increases from 59.1% to 91.9%; at Level 2, it rises from 47.0% to 81.0%; and at the most challenging Level 3, the average accuracy improves from 36.3% to 73.4%. The code and BioMedMCQs are available at: https://github.com/CyL-ucas/BioMed_Search

[25] LLMs Can Get “Brain Rot”!

Shuo Xing,Junyuan Hong,Yifan Wang,Runjin Chen,Zhenyu Zhang,Ananth Grama,Zhengzhong Tu,Zhangyang Wang

Main category: cs.CL

TL;DR: 该论文提出了“LLM Brain Rot Hypothesis”,并通过控制实验证明低质量网络文本(如Twitter/X数据)的持续预训练会导致大型语言模型在多方面能力下降。

Details Motivation: 研究动机是验证低质量数据(如垃圾文本)对大型语言模型能力的长期负面影响,并揭示数据质量与模型能力之间的因果关系。

Contribution: 主要贡献是提出并验证了“LLM Brain Rot Hypothesis”,揭示了低质量数据如何导致模型在推理、长上下文理解、安全性等方面的能力下降,并发现了剂量效应。

Method: 通过两种正交操作(M1:参与度;M2:语义质量)构建垃圾数据集和控制数据集,对4个LLM进行持续预训练,并观察其能力变化。

Result: 实验结果表明确实存在“Brain Rot”效应,部分能力的下降不可逆,且垃圾数据的比例与能力下降呈剂量效应关系。

Insight: 关键发现包括:思维跳跃是主要错误来源;非语义指标(如推文流行度)是“Brain Rot”效应的更强预测因子;数据质量是模型能力持续下降的根本原因。

Abstract: We propose and test the LLM Brain Rot Hypothesis: continual exposure to junk web text induces lasting cognitive decline in large language models (LLMs). To causally isolate data quality, we run controlled experiments on real Twitter/X corpora, constructing junk and reversely controlled datasets via two orthogonal operationalizations: M1 (engagement degree) and M2 (semantic quality), with matched token scale and training operations across conditions. Contrary to the control group, continual pre-training of 4 LLMs on the junk dataset causes non-trivial declines (Hedges’ $g>0.3$) on reasoning, long-context understanding, safety, and inflating “dark traits” (e.g., psychopathy, narcissism). The gradual mixtures of junk and control datasets also yield dose-response cognition decay: for example, under M1, ARC-Challenge with Chain Of Thoughts drops $74.9 \rightarrow 57.2$ and RULER-CWE $84.4 \rightarrow 52.3$ as junk ratio rises from $0%$ to $100%$. Error forensics reveal several key insights. First, we identify thought-skipping as the primary lesion: models increasingly truncate or skip reasoning chains, explaining most of the error growth. Second, partial but incomplete healing is observed: scaling instruction tuning and clean data pre-training improve the declined cognition yet cannot restore baseline capability, suggesting persistent representational drift rather than format mismatch. Finally, we discover that the popularity, a non-semantic metric, of a tweet is a better indicator of the Brain Rot effect than the length in M1. Together, the results provide significant, multi-perspective evidence that data quality is a causal driver of LLM capability decay, reframing curation for continual pretraining as a \textit{training-time safety} problem and motivating routine “cognitive health checks” for deployed LLMs.

[26] Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions

Siying Liu,Shisheng Zhang,Indu Bala

Main category: cs.CL

TL;DR: 这篇论文研究了LLMs在药物安全预测中是否引入了与社会人口统计学信息相关的偏见,揭示了系统性差异和两种偏见模式,强调了临床应用前公平性评估的必要性。

Details Motivation: LLMs在生物医学领域的应用日益广泛,但其在药物安全预测中的可靠性尚未充分研究,尤其是是否会引入临床不相关的社会人口统计学信息。

Contribution: 论文的主要贡献是揭示了LLMs在药物安全预测中存在的系统性偏见,并提出了显性和隐性两种偏见模式,为未来公平性评估和缓解策略提供了基础。

Method: 使用FAERS结构化数据和基于角色的评估框架,评估了ChatGPT-4o和Bio-Medical-Llama-3.8B在不同社会人口统计学角色下的表现。

Result: 结果显示,弱势群体(如低教育水平、不稳定住房)被预测的药物不良反应风险更高,明确了显性和隐性偏见的存在。

Insight: 论文强调了在临床应用LLMs前必须解决公平性问题,并提出需要开发公平性评估协议和缓解策略以避免偏见带来的风险。

Abstract: Large language models (LLMs) are increasingly applied in biomedical domains, yet their reliability in drug-safety prediction remains underexplored. In this work, we investigate whether LLMs incorporate socio-demographic information into adverse event (AE) predictions, despite such attributes being clinically irrelevant. Using structured data from the United States Food and Drug Administration Adverse Event Reporting System (FAERS) and a persona-based evaluation framework, we assess two state-of-the-art models, ChatGPT-4o and Bio-Medical-Llama-3.8B, across diverse personas defined by education, marital status, employment, insurance, language, housing stability, and religion. We further evaluate performance across three user roles (general practitioner, specialist, patient) to reflect real-world deployment scenarios where commercial systems often differentiate access by user type. Our results reveal systematic disparities in AE prediction accuracy. Disadvantaged groups (e.g., low education, unstable housing) were frequently assigned higher predicted AE likelihoods than more privileged groups (e.g., postgraduate-educated, privately insured). Beyond outcome disparities, we identify two distinct modes of bias: explicit bias, where incorrect predictions directly reference persona attributes in reasoning traces, and implicit bias, where predictions are inconsistent, yet personas are not explicitly mentioned. These findings expose critical risks in applying LLMs to pharmacovigilance and highlight the urgent need for fairness-aware evaluation protocols and mitigation strategies before clinical deployment.

[27] Big Reasoning with Small Models: Instruction Retrieval at Inference Time

Kenan Alkiek,David Jurgens,Vinod Vydiswaran

Main category: cs.CL

TL;DR: 该论文提出了一种通过指令检索在推理时增强小型语言模型(SLMs)多步推理能力的方法,避免需要微调或额外计算开销。

Details Motivation: 小型语言模型(SLMs)在本地硬件上运行高效,具有隐私、成本和环境优势,但在多步推理或领域知识任务上表现不足。

Contribution: 提出了一种指令检索方法,通过从预构建的指令库中检索结构化推理步骤来增强SLMs的推理能力,无需额外微调。

Method: 1. 构建指令库:将类似训练问题分组并用GPT-5生成指令;2. 推理时检索最相关指令,指导SLMs完成推理任务。

Result: 在MedQA、MMLU Law和MathQA任务上,3B-14B参数的SLMs性能提升了9.4%、7.9%和5.1%。

Insight: 简洁指令优于冗长指令,性能提升幅度与模型家族的固有推理能力密切相关。

Abstract: Can we bring large-scale reasoning to local-scale compute? Small language models (SLMs) are increasingly attractive because they run efficiently on local hardware, offering strong privacy, low cost, and reduced environmental impact. Yet they often struggle with tasks that require multi-step reasoning or domain-specific knowledge. We address this limitation through instruction intervention at inference time, where an SLM retrieves structured reasoning procedures rather than generating them from scratch. Our method builds an Instruction Corpus by grouping similar training questions and creating instructions via GPT-5. During inference, the SLM retrieves the most relevant instructions and follows their steps. Unlike retrieval-augmented generation, which retrieves text passages, instruction retrieval gives the model structured guidance for reasoning. We evaluate this framework on MedQA (medical board exams), MMLU Professional Law, and MathQA using models from 3B to 14B parameters without any additional fine-tuning. Instruction retrieval yields consistent gains: 9.4% on MedQA, 7.9% on MMLU Law, and 5.1% on MathQA. Concise instructions outperform longer ones, and the magnitude of improvement depends strongly on model family and intrinsic reasoning ability.

[28] FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis

Fengbin Zhu,Xiang Yao Ng,Ziyang Liu,Chang Liu,Xianwei Zeng,Chao Wang,Tianhui Tan,Xuan Yao,Pengyang Shao,Min Xu,Zixuan Wang,Jing Wang,Xin Lin,Junfeng Li,Jingxian Zhu,Yang Zhang,Wenjie Wang,Fuli Feng,Richang Hong,Huanbo Luan,Ke-Wei Huang,Tat-Seng Chua

Main category: cs.CL

TL;DR: 论文提出了HisRubric评估框架和FinDeepResearch基准,用于系统评估深度研究代理(DR Agents)在金融分析中的能力,揭示了不同方法的优劣。

Details Motivation: 现有研究缺乏对深度研究代理在关键金融分析任务中能力的系统评估,因此需要一种严谨的评估框架和基准。

Contribution: 1. 提出了HisRubric评估框架,模拟专业分析师的工作流程;2. 构建了FinDeepResearch基准,涵盖多语言和多金融市场的公司数据;3. 对16种代表性方法进行了广泛实验,揭示了其能力差异。

Method: 使用分层分析结构和细粒度评分标准构建HisRubric框架,并结合FinDeepResearch基准对DR代理和LLMs进行实验。

Result: 实验结果表明不同方法的优势和局限性,尤其是在多样化的能力、金融市场和语言环境中。

Insight: 揭示了DR代理和LLMs在金融分析任务中的潜力与不足,为未来研究提供了方向。

Abstract: Deep Research (DR) agents, powered by advanced Large Language Models (LLMs), have recently garnered increasing attention for their capability in conducting complex research tasks. However, existing literature lacks a rigorous and systematic evaluation of DR Agent’s capabilities in critical research analysis. To address this gap, we first propose HisRubric, a novel evaluation framework with a hierarchical analytical structure and a fine-grained grading rubric for rigorously assessing DR agents’ capabilities in corporate financial analysis. This framework mirrors the professional analyst’s workflow, progressing from data recognition to metric calculation, and finally to strategic summarization and interpretation. Built on this framework, we construct a FinDeepResearch benchmark that comprises 64 listed companies from 8 financial markets across 4 languages, encompassing a total of 15,808 grading items. We further conduct extensive experiments on the FinDeepResearch using 16 representative methods, including 6 DR agents, 5 LLMs equipped with both deep reasoning and search capabilities, and 5 LLMs with deep reasoning capabilities only. The results reveal the strengths and limitations of these approaches across diverse capabilities, financial markets, and languages, offering valuable insights for future research and development. The benchmark and evaluation code will be made publicly available.

[29] Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

Zhen Yang,Mingyang Zhang,Feng Chen,Ganggui Ding,Liang Hou,Xin Tao,Pengfei Wan,Ying-Cong Chen

Main category: cs.CL

TL;DR: 论文提出了一种名为Minimal Test-Time Intervention (MTI)的方法,通过在推理阶段仅干预高熵令牌,显著提升了大型语言模型的推理能力和稳定性,同时保持了高效性。

Details Motivation: 现有的方法通常通过增加推理计算来提升大型语言模型的推理能力,但效率较低。研究发现推理不确定性高度集中在少数高熵令牌上,因此希望通过最小化干预提升性能。

Contribution: 提出了MTI框架,包括选择性CFG干预和轻量级负提示引导,能够在训练自由的情况下显著提升推理任务的准确性和稳定性。

Method: 1. Selective CFG干预:仅在不确定的高熵令牌位置应用分类器自由引导;2. 轻量级负提示引导:复用主模型的KV缓存,高效近似无条件解码。

Result: 在通用、编程和STEM任务上表现一致提升,例如Qwen3-8B-Base在八个基准上的平均提升为1.35%,Qwen3-32B-Reasoning在AIME2024上提升5%。

Insight: 推理不确定性的局部性揭示了优化重点是少数关键令牌,而非全局干预,这种最小干预策略能显著提升性能且高效。

Abstract: Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model’s KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +1.35% average improvement on eight benchmarks for Qwen3-8B-Base and +5% on AIME2024 using Qwen3-32B-Reasoning-while remaining highly efficient.

[30] CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models

Shehenaz Hossain,Haithem Afli

Main category: cs.CL

TL;DR: CRaFT框架通过解释性的多语言评估,衡量大型语言模型在跨文化语境中的推理能力,揭示了模型在语言和文化适应性上的显著差异。

Details Motivation: 当前的LLM评估主要关注答案准确性,而忽略了文化理解的深度。本文旨在通过CRaFT框架填补这一空白,评估模型在跨文化推理中的表现。

Contribution: 提出了CRaFT框架,通过四个可解释的指标(文化流畅性、偏差、一致性和语言适应性)量化LLM的文化推理能力,并在多语言环境中验证其有效性。

Method: 基于世界价值观调查的50个文化相关问题,翻译成阿拉伯语、孟加拉语和西班牙语,针对GPT、DeepSeek和FANAR三款模型生成了2100多个答案-解释对,通过CRaFT的四个指标进行分析。

Result: 不同语言对模型表现影响显著:阿拉伯语降低流畅性,孟加拉语提升流畅性,西班牙语表现稳定。GPT在语言适应性上更强但一致性较低,FANAR表现稳定但缺乏灵活性。

Insight: LLM的文化意识并非固有属性,而是通过语言框架表现出来的。CRaFT为构建文化适应性更强的模型提供了实用工具和见解。

Abstract: Correct answers do not necessarily reflect cultural understanding. We introduce CRaFT, an explanation-based multilingual evaluation framework designed to assess how large language models (LLMs) reason across cultural contexts. Rather than scoring outputs solely based on accuracy, CRaFT evaluates model explanations using four interpretable metrics: Cultural Fluency, Deviation, Consistency, and Linguistic Adaptation. We apply the framework to 50 culturally grounded questions from the World Values Survey, translated into Arabic, Bengali, and Spanish, and evaluate three models (GPT, DeepSeek, and FANAR) across over 2,100 answer-explanation pairs. Results reveal significant cross-lingual variation in reasoning: Arabic reduces fluency, Bengali enhances it, and Spanish remains largely stable. While GPT adapts more effectively across languages, it exhibits lower consistency; FANAR shows stable but rigid reasoning. These findings suggest that cultural awareness in LLMs is not intrinsic but emerges through linguistic framing. CRaFT offers a new lens for evaluating cross-cultural reasoning in multilingual settings, providing actionable insights for building culturally adaptive language models.

[31] Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games

César Guerra-Solano,Zhuochun Li,Xiang Lorraine Li

Main category: cs.CL

TL;DR: 论文提出了一种多语言单词分组游戏GlobalGroup,用于评估大型语言模型(LLM)在抽象推理任务中的语言偏见表现。通过在五种语言中测试,发现英语模态表现最佳,且开源与闭源模型存在性能差异。

Details Motivation: 现有研究多通过常识或数学任务评估LLM的语言偏见,但抽象推理能力在日常生活中同样重要。缺乏对LLM在此类任务中语言偏见的系统评估是该研究的出发点。

Contribution: 1. 提出多语言单词分组游戏GlobalGroup,评估抽象推理任务中的语言偏见;2. 设计游戏难度测量方法,实现更公平的比较;3. 发现英语模态的优势及开源与闭源模型的性能差异。

Method: 1. 构建包含英语、西班牙语、中文、印地语和阿拉伯语的游戏基准;2. 通过对比原生语言和英语翻译的模型表现评估语言偏见;3. 引入游戏难度测量以控制实验变量。

Result: 英语模态在抽象推理任务中表现最佳,且开源与闭源模型之间存在显著性能差异。

Insight: 语言偏见在抽象推理任务中同样存在,未来研究需进一步探索模型跨语言泛化能力及开源模型性能的改进空间。

Abstract: Large language models (LLMs) can exhibit biases in reasoning capabilities due to linguistic modality, performing better on tasks in one language versus another, even with similar content. Most previous works evaluate this through reasoning tasks where reliance on strategies or knowledge can ensure success, such as in commonsense or math tasks. However, abstract reasoning is vital to reasoning for everyday life, where people apply “out-of-the-box thinking” to identify and use patterns for solutions, without a reliance on formulaic approaches. Comparatively, little work has evaluated linguistic biases in this task type. In this paper, we propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages. We constructed a game benchmark with five linguistic backgrounds – English, Spanish, Chinese, Hindi, and Arabic – in both the native language and an English translation for comparison. We also proposed game difficulty measurements to evaluate models on games with similar difficulty, enabling a more controlled comparison, which is particularly important in reasoning evaluations. Through experimentation, we find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.

[32] DROID: Dual Representation for Out-of-Scope Intent Detection

Wael Rashwan,Hossam M. Zawbaa,Sourav Dutta,Haytham Assem

Main category: cs.CL

TL;DR: DROID提出了一种双编码器框架,结合通用语义编码器和领域特定编码器,用于任务型对话系统中的OOS意图检测,无需后处理且性能显著提升。

Details Motivation: 任务型对话系统中检测OOS意图是一个关键挑战,现有方法依赖强分布假设或额外模块,亟需一种简洁高效的方法。

Contribution: 提出DROID框架,融合通用与领域特定编码器的双表示,配置轻量级分类器和单一阈值,显著提升OOS意图检测性能。

Method: 结合USE(通用语义编码器)和TSDAE(领域自适应Transformer去噪自编码器)的双编码器表示,通过分支分类器和合成/开放域异常增强训练。

Result: 在多个意图检测基准上,DROID比SOTA方法提升了6–15%(已知意图)和8–20%(OOS意图)的macro-F1。

Insight: 双编码器表示与简单校准的结合能够在低资源场景下实现鲁棒且可扩展的OOS检测。

Abstract: Detecting out-of-scope (OOS) user utterances remains a key challenge in task-oriented dialogue systems and, more broadly, in open-set intent recognition. Existing approaches often depend on strong distributional assumptions or auxiliary calibration modules. We present DROID (Dual Representation for Out-of-Scope Intent Detection), a compact end-to-end framework that combines two complementary encoders – the Universal Sentence Encoder (USE) for broad semantic generalization and a domain-adapted Transformer-based Denoising Autoencoder (TSDAE) for domain-specific contextual distinctions. Their fused representations are processed by a lightweight branched classifier with a single calibrated threshold that separates in-domain and OOS intents without post-hoc scoring. To enhance boundary learning under limited supervision, DROID incorporates both synthetic and open-domain outlier augmentation. Despite using only 1.5M trainable parameters, DROID consistently outperforms recent state-of-the-art baselines across multiple intent benchmarks, achieving macro-F1 improvements of 6–15% for known and 8–20% for OOS intents, with the most significant gains in low-resource settings. These results demonstrate that dual-encoder representations with simple calibration can yield robust, scalable, and reliable OOS detection for neural dialogue systems.

[33] Toward Cybersecurity-Expert Small Language Models

Matan Levi,Daniel Ohayon,Ariel Blobstein,Ravid Sagi,Ian Molloy,Yair Allouche

Main category: cs.CL

TL;DR: 论文提出了CyberPal 2.0,一个专注于网络安全的小语言模型(SLM)家族,填补了网络安全领域缺乏高质量特定领域模型的空白。

Details Motivation: 由于缺乏高质量、特定领域的模型和训练数据集,大语言模型(LLM)在网络安全领域的应用受限。为了解决这一问题,作者开发了CyberPal 2.0。

Contribution: 提出了CyberPal 2.0,一个参数规模从4B到20B的网络安全专家小语言模型家族;设计了SecKnowledge 2.0数据增强和格式化流程,用于生成高质量的网络安全指令数据集。

Method: 使用SecKnowledge 2.0管道生成链式思维网络安全指令数据集,结合专家指导和LLM驱动的多步验证,提升任务导向的推理质量。

Result: CyberPal 2.0在多个网络安全基准测试中表现优异,优于基线模型,并与前沿开源和闭源模型性能相当或更好。特别是在核心网络安全任务中,20B参数模型甚至优于GPT-4o等模型。

Insight: 小型化语言模型在特定领域(如网络安全)中可以达到甚至超越大型通用模型的性能,同时保持高效和低成本。

Abstract: Large language models (LLMs) are transforming everyday applications, yet deployment in cybersecurity lags due to a lack of high-quality, domain-specific models and training datasets. To address this gap, we present CyberPal 2.0, a family of cybersecurity-expert small language models (SLMs) ranging from 4B-20B parameters. To train CyberPal 2.0, we generate an enriched chain-of-thought cybersecurity instruction dataset built with our data enrichment and formatting pipeline, SecKnowledge 2.0, which integrates expert-in-the-loop steering of reasoning formats alongside LLM-driven multi-step grounding, yielding higher-fidelity, task-grounded reasoning traces for security tasks. Across diverse cybersecurity benchmarks, CyberPal 2.0 consistently outperforms its baselines and matches or surpasses various open and closed-source frontier models, while remaining a fraction of their size. On core cyber threat intelligence knowledge tasks, our models outperform almost all tested frontier models, ranking second only to Sec-Gemini v1. On core threat-investigation tasks, such as correlating vulnerabilities and bug tickets with weaknesses, our best 20B-parameter model outperforms GPT-4o, o1, o3-mini, and Sec-Gemini v1, ranking first, while our smallest 4B-parameter model ranks second.

[34] Building a Macedonian Recipe Dataset: Collection, Parsing, and Comparative Analysis

Darko Sasanski,Dimitar Peshevski,Riste Stojanov,Dimitar Trajanov

Main category: cs.CL

TL;DR: 论文系统地构建了首个马其顿食谱数据集,通过网页抓取和结构化解析解决了食材描述的异质性问题,并分析了食材频率和共现模式,揭示了马其顿烹饪传统的独特特点。

Details Motivation: 当前计算美食学依赖高质量、多样化的食谱数据集,但马其顿食谱在数字化研究中代表性不足,亟需填补这一空白。

Contribution: 1. 构建了首个系统化的马其顿食谱数据集;2. 提出了处理异质食材描述的方法;3. 分析了食材频率和共现模式,揭示了马其顿烹饪特点。

Method: 1. 通过网页抓取收集食谱;2. 结构化解析处理单位、数量和描述符标准化问题;3. 使用点互信息(PMI)和提升分数(Lift score)分析食材共现模式。

Result: 生成了一个高质量的马其顿食谱数据集,并通过分析揭示了独特的食材组合模式。

Insight: 马其顿烹饪传统表现出独特的食材组合特点,为研究少数民族语言的饮食文化提供了新资源。

Abstract: Computational gastronomy increasingly relies on diverse, high-quality recipe datasets to capture regional culinary traditions. Although there are large-scale collections for major languages, Macedonian recipes remain under-represented in digital research. In this work, we present the first systematic effort to construct a Macedonian recipe dataset through web scraping and structured parsing. We address challenges in processing heterogeneous ingredient descriptions, including unit, quantity, and descriptor normalization. An exploratory analysis of ingredient frequency and co-occurrence patterns, using measures such as Pointwise Mutual Information and Lift score, highlights distinctive ingredient combinations that characterize Macedonian cuisine. The resulting dataset contributes a new resource for studying food culture in underrepresented languages and offers insights into the unique patterns of Macedonian culinary tradition.

[35] RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following

Zhichao Wang,Andy Wong,Ruslan Belkin

Main category: cs.CL

TL;DR: RLSR(基于监督奖励的强化学习)取代SFT(监督微调),通过语义嵌入空间的余弦相似度计算奖励分数,显著提升了指令跟随能力,并在AlpacaEval基准上表现优于SFT。

Details Motivation: 传统SFT依赖大量标注数据,而RLSR通过强化学习框架利用已有SFT数据集,提升模型在指令跟随任务中的表现。

Contribution: 提出RLSR方法,结合强化学习和语义相似度奖励,显著优于SFT,并与SFT结合时进一步提升性能。

Method: RLSR用强化学习框架生成多响应,通过语义嵌入空间的余弦相似度计算奖励,优化基模型的指令跟随能力。

Result: RLSR在Qwen-7B上AlpacaEval胜率达26.34%,超越SFT的21.01%;结合SFT+RLSR时胜率提升至30.73%。

Insight: 强化学习框架可利用监督奖励优化模型表现,尤其是在语义相似度任务中,结合传统方法能进一步发挥优势。

Abstract: After the pretraining stage of LLMs, techniques such as SFT, RLHF, RLVR, and RFT are applied to enhance instruction-following ability, mitigate undesired responses, improve reasoning capability and enable efficient domain adaptation with minimal data. SFT relies on the next-token prediction objective to strengthen instruction following in a base model using a large corpus of human-labeled responses. In contrast, RFT employs a RL-based approach to adapt fine-tuned reasoning models to specific domains with limited supervision. Inspired by RFT, we propose replacing SFT with RLSR to leverage the extensive SFT dataset in an RL framework, thereby improving the base model’s instruction-following ability. In RLSR, the base model generates multiple responses for each prompt, and reward scores are computed as the cosine similarity in the semantic embedding space between the generated and human-labeled responses. RLSR can be utilized in multiple ways. It can directly replace SFT, achieving superior performance on instruction-following benchmarks-for example, RLSR (SB) on Qwen-7B (INFINITY) achieved an AlpacaEval win rate of 26.34%, surpassing SFT’s 21.01%. Furthermore, combining SFT and RLSR further enhances downstream task performance; Qwen-7B (INFINITY) achieved a win rate of 30.73% when trained with SFT + RLSR.

[36] LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Beomseok Kang,Jiwon Song,Jae-Joon Kim

Main category: cs.CL

TL;DR: LiteStage提出了一种延迟感知的层跳过框架,用于提升多阶段推理的效率,通过离线搜索和在线早期退出来平衡速度与准确性。

Details Motivation: 多阶段推理虽增强小语言模型的推理能力,但增加延迟。现有自适应加速技术(如层跳过)在多阶段场景下难以平衡效率和准确性。

Contribution: 提出了LiteStage框架,结合阶段级离线搜索和在线早期退出机制,有效减少冗余解码并优化层跳过策略。

Method: 1)阶段级离线搜索分配最优层预算;2)在线基于置信度的生成早期退出。

Result: 在OBQA等基准测试中,速度提升1.70倍,精度损失低于4.0%,优于现有无训练层跳过方法。

Insight: 阶段间跳过的敏感性和冗余输出是主要挑战,LiteStage通过分阶段优化和动态退出解决了这些问题。

Abstract: Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.

[37] Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs

Parsa Hejabi,Elnaz Rahmati,Alireza S. Ziabari,Morteza Dehghani

Main category: cs.CL

TL;DR: 论文提出了一种名为Flip-Flop Consistency(F²C)的无监督训练方法,通过共识交叉熵(CCE)和表示对齐损失,提升大语言模型(LLM)在面对不同提示扰动时的一致性和性能表现。

Details Motivation: 大语言模型在面对同一问题的不同表述时,常产生不一致的回答。为了解决这一问题,作者提出了F²C方法,旨在无监督条件下提升模型的鲁棒性和一致性。

Contribution: 1. 提出F²C方法,包含CCE和表示对齐损失两个核心组件;2. 在11个数据集和4个NLP任务上验证了方法的有效性;3. 展示了F²C在域外评估和未见提示扰动下的泛化能力。

Method: F²C方法包括:1. Consensus Cross-Entropy(CCE),通过多数投票生成硬伪标签;2. 表示对齐损失,将低置信度和非多数预测对齐至高置信度的共识预测。

Result: F²C显著提升了模型的一致性(11.62%)、平均F₁分数(8.94%)并减少了性能方差(3.29%)。在域外评估中,F²C进一步提高了泛化能力。

Insight: F²C展示了无监督方法在提升LLM一致性和鲁棒性方面的潜力,尤其在面对多样化的提示扰动时表现优异。

Abstract: Large Language Models (LLMs) often produce inconsistent answers when faced with different phrasings of the same prompt. In this paper, we propose Flip-Flop Consistency ($F^2C$), an unsupervised training method that improves robustness to such perturbations. $F^2C$ is composed of two key components. The first, Consensus Cross-Entropy (CCE), uses a majority vote across prompt variations to create a hard pseudo-label. The second is a representation alignment loss that pulls lower-confidence and non-majority predictors toward the consensus established by high-confidence, majority-voting variations. We evaluate our method on 11 datasets spanning four NLP tasks, with 4-15 prompt variations per dataset. On average, $F^2C$ raises observed agreement by 11.62%, improves mean $F_1$ by 8.94%, and reduces performance variance across formats by 3.29%. In out-of-domain evaluations, $F^2C$ generalizes effectively, increasing $\overline{F_1}$ and agreement while decreasing variance across most source-target pairs. Finally, when trained on only a subset of prompt perturbations and evaluated on held-out formats, $F^2C$ consistently improves both performance and agreement while reducing variance. These findings highlight $F^2C$ as an effective unsupervised method for enhancing LLM consistency, performance, and generalization under prompt perturbations. Code is available at https://github.com/ParsaHejabi/Flip-Flop-Consistency-Unsupervised-Training-for-Robustness-to-Prompt-Perturbations-in-LLMs.

[38] MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems

Jihao Zhao,Zhiyuan Ji,Simin Niu,Hanyu Wang,Feiyu Xiong,Zhiyu Li

Main category: cs.CL

TL;DR: 该论文提出了MoM框架,通过主动理解和结构化文档记忆提取改进传统检索增强生成(RAG)系统,结合多路径采样和逆向推理策略,提升小型语言模型(SLMs)的文本处理能力。

Details Motivation: 传统RAG系统被动处理文本片段,限制了知识内化和推理能力,无法模拟人类阅读时的主动认知过程。

Contribution: 提出了MoM框架,通过主动文档记忆提取和多路径优化机制,改进RAG系统的文本处理能力,并从概率建模角度证明其有效性。

Method: MoM利用大语言模型(LLMs)生成文档逻辑概要,指导结构化分块和核心内容提取;结合多路径采样、多视角评估和逆向推理策略训练SLMs。

Result: 在三个不同领域的实验表明,MoM解决了现有RAG系统的文本分块问题,并为SLMs实现类人智能文本处理提供了新途径。

Insight: MoM通过模拟人类阅读的主动认知过程,结合逆向推理和多路径优化,显著提升了SLMs的文本理解和推理能力。

Abstract: The traditional RAG paradigm, which typically engages in the comprehension of relevant text chunks in response to received queries, inherently restricts both the depth of knowledge internalization and reasoning capabilities. To address this limitation, our research transforms the text processing in RAG from passive chunking to proactive understanding, defining this process as document memory extraction with the objective of simulating human cognitive processes during reading. Building upon this, we propose the Mixtures of scenario-aware document Memories (MoM) framework, engineered to efficiently handle documents from multiple domains and train small language models (SLMs) to acquire the ability to proactively explore and construct document memories. The MoM initially instructs large language models (LLMs) to simulate domain experts in generating document logical outlines, thereby directing structured chunking and core content extraction. It employs a multi-path sampling and multi-perspective evaluation mechanism, specifically designing comprehensive metrics that represent chunk clarity and extraction completeness to select the optimal document memories. Additionally, to infuse deeper human-like reading abilities during the training of SLMs, we incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes. Finally, leveraging diverse forms of content generated by MoM, we develop a three-layer document memory retrieval mechanism, which is grounded in our theoretical proof from the perspective of probabilistic modeling. Extensive experimental results across three distinct domains demonstrate that the MoM framework not only resolves text chunking challenges in existing RAG systems, providing LLMs with semantically complete document memories, but also paves the way for SLMs to achieve human-centric intelligent text processing.

[39] MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning

Mahbub E Sobhani,Md. Faiyaz Abdullah Sayeedi,Tasnim Mohiuddin,Md Mofijul Islam,Swakkhar Shatabda

Main category: cs.CL

TL;DR: MathMist是一个并行多语言数学问题解答与推理基准数据集,旨在填补现有基准主要集中在英语或高资源语言的不足,评估多语言和跨语言数学推理能力。

Details Motivation: 现有的数学推理基准主要集中于英语或少数高资源语言,缺乏对多语言和跨语言数学推理的系统评估。MathMist旨在解决这一问题,提供一个涵盖高、中、低资源语言的多样化基准。

Contribution: 提出了MathMist数据集,包含21K+对齐的跨七种语言的数学问题解答对,覆盖多种语言和问题类型。同时评估了多种语言模型在多语言数学推理中的表现。

Method: 构建了一个平衡的多语言数据集,并在零样本、链式思维(CoT)和代码切换推理范式下,系统评估了开源小/中型LLM、专有系统和多语言推理模型的性能。

Result: 结果表明,LLM在多语言数学推理中存在持续性缺陷,尤其在低资源语言环境中表现明显下降。

Insight: 研究揭示了LLM在多语言数学推理中的局限性,强调了未来研究需关注语言多样性和低资源语言的支持。

Abstract: Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses over 21K aligned question-answer pairs across seven languages, representing a balanced coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models, under zero-shot, chain-of-thought (CoT), and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs’ ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist

[40] MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking

Sathyanarayanan Ramamoorthy,Vishwa Shah,Simran Khanuja,Zaid Sheikh,Shan Jie,Ann Chia,Shearman Chua,Graham Neubig

Main category: cs.CL

TL;DR: MERLIN是一个用于多语言多模态实体链接任务的新型测试系统,包含BBC新闻文章标题和对应图像的数据集,涵盖五种语言。通过多语言和多模态方法的基准测试,研究发现引入视觉数据能提升实体链接准确性。

Details Motivation: 当前多语言多模态实体链接任务缺乏标准化的测试环境和数据集,尤其是针对非主流语言的研究。MERLIN旨在填补这一空白,并提供评估新方法的平台。

Contribution: 1) 创建了包含五种语言的多语言多模态实体链接数据集;2) 提供了使用LLaMa-2和Aya-23等模型的基准测试;3) 发现视觉数据对提升实体链接效果的重要性。

Method: 使用BBC新闻文章标题和对应图像构建数据集,涵盖五种语言。采用多语言和多模态方法(如LLaMa-2和Aya-23)进行实体链接任务,比较不同模型的性能。

Result: 研究表明,引入视觉数据显著提升了实体链接的准确性,尤其是在文本上下文模糊或不足的情况下,以及对多语言能力较弱的模型。

Insight: 视觉信息在多语言多模态实体链接中起到关键作用,能够弥补纯文本方法的不足,特别是对于资源较少的语言。

Abstract: This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at https://github.com/rsathya4802/merlin

[41] Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

Marwa Abdulhai,Ryan Cheng,Aryansh Shrivastava,Natasha Jaques,Yarin Gal,Sergey Levine

Main category: cs.CL

TL;DR: 论文研究了大型语言模型(LLMs)在对话中的欺骗行为,提出通过信念不一致性指标量化欺骗,并开发了一种多轮强化学习方法以减少欺骗行为。

Details Motivation: LLMs在广泛应用中存在潜在的欺骗性输出风险,尤其是在多轮对话中。现有方法缺乏对欺骗行为的有效评估和缓解手段,亟需新方法和指标。

Contribution: 1) 提出信念不一致性指标量化欺骗行为;2) 在多轮对话场景中评估LLMs的欺骗行为;3) 开发多轮强化学习方法显著减少欺骗行为。

Method: 1) 使用五种现存指标和新提出的信念不一致性指标评估欺骗;2) 在多轮对话中测试八种先进模型的欺骗行为;3) 提出基于多轮强化学习的微调方法以减少欺骗。

Result: 研究发现LLMs在26%的对话中存在欺骗行为,RLHF训练模型欺骗率为43%。多轮强化学习方法将欺骗行为减少了77.6%。

Insight: 1) 欺骗行为需要在多轮对话中评估;2) RLHF训练并不能完全消除欺骗;3) 多轮强化学习是减少欺骗的有效方法。

Abstract: Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk. In this paper, we investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception. We evaluate deception across four distinct dialogue scenarios, using five established deception detection metrics and our proposed metric. Our findings reveal this novel deception measure correlates more closely with human judgments than any existing metrics we test. Additionally, our benchmarking of eight state-of-the-art models indicates that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives. When prompted to deceive, LLMs are capable of increasing deceptiveness by as much as 31% relative to baselines. Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average. Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses. We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.

[42] Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Perapard Ngokpol,Kun Kerdthaisong,Pasin Buakhaw,Pitikorn Khlaisamniang,Supasate Vorathammathorn,Piyalitt Ittichaiwong,Nutchanon Yongsatianchot

Main category: cs.CL

TL;DR: 论文提出了一个名为’Beyond One World’的基准,用于评估大型语言模型(LLMs)在角色扮演中对不同版本角色的一致性表现,特别是针对超级英雄的多宇宙版本。

Details Motivation: 研究表明,LLMs在角色扮演中是否能准确表现特定版本的角色(如不同宇宙中的超级英雄)尚不明确,因此需要一个专门的基准来衡量这一问题。

Contribution: 贡献包括一个包含30个英雄和90个版本的角色扮演基准,以及两个任务(Canon Events和Moral Dilemmas)和一个新指标(Think-Act Matching),用于衡量模型在角色扮演中的一致性和推理保真度。

Method: 方法分为两部分:任务设计(Canon Events和Moral Dilemmas)和评估框架(Think-Act Matching)。前者测试模型对角色关键事件和道德困境的表现,后者量化推理与行动的一致性。

Result: 实验发现:(1)链式思维提示能提升弱模型的叙事连贯性,但可能降低强模型的准确性;(2)角色跨版本的泛化能力仍是挑战;(3)模型通常在推理或行动中表现优异,但很少能同时做好两者。

Insight: 研究发现,LLMs在多宇宙角色扮演中的一致性和推理对齐仍有显著不足,这为未来模型的改进指明了方向。

Abstract: Large language models (LLMs) are increasingly used as role-playing agents, yet their capacity to faithfully and consistently portray version-specific characters – for example, superheroes across comic and cinematic universes – remains underexplored. Superhero canons such as Marvel and DC provide a rich testbed: decades of storytelling yield multiple incarnations of the same character with distinct histories, values, and moral codes. To study this problem, we introduce Beyond One World, a benchmark for character-grounded roleplay spanning 30 iconic heroes and 90 canon-specific versions. The benchmark comprises two tasks: (i) Canon Events, which probes factual recall of pivotal life stages, and (ii) Moral Dilemmas, which confronts models with ethically charged scenarios. We score responses for canonical accuracy and reasoning fidelity under a framework that separates internal deliberation (“thinking”) from outward decisions (“acting”). We further propose Think-Act Matching, a metric that quantifies alignment between reasons and actions and serves as a proxy for model trustworthiness. Experiments across reasoning- and non-reasoning-oriented models yield three findings: (1) chain-of-thought prompting improves narrative coherence in weaker models but can reduce canonical accuracy in stronger ones; (2) cross-version generalization within a character remains a major obstacle; and (3) models often excel at either thinking or acting, but rarely both. Beyond One World exposes critical gaps in multiversal consistency and reasoning alignment, offering a challenging evaluation for role-playing LLMs.

[43] CURE: Confidence-driven Unified Reasoning Ensemble Framework for Medical Question Answering

Ziad Elshaer,Essam A. Rashed

Main category: cs.CL

TL;DR: 论文提出了一种无需微调的医疗问答框架CURE,通过信心驱动的多模型协作,显著提升了性能,特别适用于资源有限的场景。

Details Motivation: 现有高性能医疗大模型通常需要大量计算资源进行微调,限制了资源匮乏机构的可及性。

Contribution: 提出了基于信心检测和自适应路由的两阶段框架,利用多模型协作提升医疗问答性能。

Method: 采用信心检测模块和自适应路由机制,将低信心查询分配给互补知识的辅助模型协作推理。

Result: 在PubMedQA和MedMCQA上的表现优异,分别达到95.0%和78.0%。

Insight: 策略性模型协作提供了一种高效方法,有助于在资源受限环境中普及高级医疗AI。

Abstract: High-performing medical Large Language Models (LLMs) typically require extensive fine-tuning with substantial computational resources, limiting accessibility for resource-constrained healthcare institutions. This study introduces a confidence-driven multi-model framework that leverages model diversity to enhance medical question answering without fine-tuning. Our framework employs a two-stage architecture: a confidence detection module assesses the primary model’s certainty, and an adaptive routing mechanism directs low-confidence queries to Helper models with complementary knowledge for collaborative reasoning. We evaluate our approach using Qwen3-30B-A3B-Instruct, Phi-4 14B, and Gemma 2 12B across three medical benchmarks; MedQA, MedMCQA, and PubMedQA. Result demonstrate that our framework achieves competitive performance, with particularly strong results in PubMedQA (95.0%) and MedMCQA (78.0%). Ablation studies confirm that confidence-aware routing combined with multi-model collaboration substantially outperforms single-model approaches and uniform reasoning strategies. This work establishes that strategic model collaboration offers a practical, computationally efficient pathway to improve medical AI systems, with significant implications for democratizing access to advanced medical AI in resource-limited settings.

[44] On the Ability of LLMs to Handle Character-Level Perturbations: How Well and How?

Anyun Zhuo,Xuefei Ning,Ningyuan Li,Yu Wang,Pinyan Lu

Main category: cs.CL

TL;DR: 该论文研究了LLMs对字符级扰动的处理能力,通过插入不可见Unicode控制字符的方法测试其鲁棒性,发现许多LLMs在强干扰下仍能保持性能,揭示了其潜在机制和应用风险。

Details Motivation: 探讨LLMs在面对有结构的字符级扰动时的稳健性,尤其是对插入噪声字符的应对能力,以评估其在在线考试系统等场景中的潜在滥用风险。

Contribution: 提出了一个名为\nameshort{}的方法,用于插入不可见Unicode控制字符干扰LLMs输入,并分析了LLMs在强干扰下的性能表现及其潜在的隐式和显式去噪机制。

Method: 通过插入不可见Unicode控制字符扰乱输入文本,评估LLMs在不同模型、问题和噪声配置下的表现,探讨其对字符级扰动的鲁棒性机制。

Result: 尽管输入被严重扰乱且信噪比大幅降低,许多LLMs仍能保持显著性能,这表明其对字符级扰动的鲁棒性较强。

Insight: LLMs可能通过隐式和显式去噪机制处理字符级扰动,这种鲁棒性在应用中既带来可靠性,也暗示潜在的滥用风险。

Abstract: This work investigates the resilience of contemporary LLMs against frequent and structured character-level perturbations, specifically through the insertion of noisy characters after each input character. We introduce \nameshort{}, a practical method that inserts invisible Unicode control characters into text to discourage LLM misuse in scenarios such as online exam systems. Surprisingly, despite strong obfuscation that fragments tokenization and reduces the signal-to-noise ratio significantly, many LLMs still maintain notable performance. Through comprehensive evaluation across model-, problem-, and noise-related configurations, we examine the extent and mechanisms of this robustness, exploring both the handling of character-level tokenization and \textit{implicit} versus \textit{explicit} denoising mechanism hypotheses of character-level noises. We hope our findings on the low-level robustness of LLMs will shed light on the risks of their misuse and on the reliability of deploying LLMs across diverse applications.

[45] PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora

Mykolas Sveistrys,Richard Kunert

Main category: cs.CL

TL;DR: 论文提出了PluriHop问题,并针对其在检索增强生成(RAG)中的挑战,提出了一种名为PluriHopRAG的新方法,通过文档级子问题分解和早期过滤显著提升了性能。

Details Motivation: 现实中的问题往往需要从大量重复和干扰性强的文档中聚合信息,现有QA系统对此类任务的表现不佳,亟需新的解决方案。

Contribution: 1) 定义了PluriHop问题并提出了三项标准;2) 引入了多语言数据集PluriHopWIND;3) 提出了PluriHopRAG方法,显著提升了性能。

Method: PluriHopRAG通过将查询分解为文档级子问题,并使用交叉编码器过滤器提前过滤无关文档,以减少计算成本。

Result: PluriHopRAG相比基线方法提升了18-52%的相对F1分数,证明了其在重复性和干扰性强的语料上的有效性。

Insight: 研究表明,现有QA系统在重复性语料上表现有限,而早期过滤和详尽检索是提升性能的关键策略。

Abstract: Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) have enabled progress on question answering (QA) when relevant evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many realistic questions about recurring report data - medical records, compliance filings, maintenance logs - require aggregation across all documents, with no clear stopping point for retrieval and high sensitivity to even one missed passage. We term these pluri-hop questions and formalize them by three criteria: recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a diagnostic multilingual dataset of 48 pluri-hop questions built from 191 real-world wind industry reports in German and English. We show that PluriHopWIND is 8-40% more repetitive than other common datasets and thus has higher density of distractor documents, better reflecting practical challenges of recurring report corpora. We test a traditional RAG pipeline as well as graph-based and multimodal variants, and find that none of the tested approaches exceed 40% in statement-wise F1 score. Motivated by this, we propose PluriHopRAG, a RAG architecture that follows a “check all documents individually, filter cheaply” approach: it (i) decomposes queries into document-level subquestions and (ii) uses a cross-encoder filter to discard irrelevant documents before costly LLM reasoning. We find that PluriHopRAG achieves relative F1 score improvements of 18-52% depending on base LLM. Despite its modest size, PluriHopWIND exposes the limitations of current QA systems on repetitive, distractor-rich corpora. PluriHopRAG’s performance highlights the value of exhaustive retrieval and early filtering as a powerful alternative to top-k methods.

[46] MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

Yingpeng Ning,Yuanyuan Sun,Ling Luo,Yanhua Wang,Yuchen Pan,Hongfei Lin

Main category: cs.CL

TL;DR: MedTrust-RAG提出了一种结合检索增强生成的框架,通过引用感知推理、迭代检索验证和MedTrust对齐模块,显著提升了生物医学问答的事实一致性和可靠性。

Details Motivation: 现有的RAG框架在生物医学问答中存在幻觉问题,主要由于检索后噪声和证据验证不足,影响了回答的可靠性。

Contribution: 提出了三个关键创新:1) 引用感知推理;2) 迭代检索验证;3) 结合MedTrust对齐模块,优化检索和生成过程。

Method: MedTrust-RAG通过结构化负知识断言、医学缺口分析驱动的迭代验证,以及Direct Preference Optimization技术,强化了基于证据的推理。

Result: 在MedMCQA、MedQA和MMLU-Med等数据集上,该方法显著优于基线模型,LLaMA3.1-8B-Instruct和Qwen3-8B分别提升2.7%和2.4%。

Insight: 通过严格验证检索证据和优化生成过程,可以有效减少幻觉问题,提升生物医学问答的可靠性。

Abstract: Biomedical question answering (QA) requires accurate interpretation of complex medical knowledge. Large language models (LLMs) have shown promising capabilities in this domain, with retrieval-augmented generation (RAG) systems enhancing performance by incorporating external medical literature. However, RAG-based approaches in biomedical QA suffer from hallucinations due to post-retrieval noise and insufficient verification of retrieved evidence, undermining response reliability. We propose MedTrust-Guided Iterative RAG, a framework designed to enhance factual consistency and mitigate hallucinations in medical QA. Our method introduces three key innovations. First, it enforces citation-aware reasoning by requiring all generated content to be explicitly grounded in retrieved medical documents, with structured Negative Knowledge Assertions used when evidence is insufficient. Second, it employs an iterative retrieval-verification process, where a verification agent assesses evidence adequacy and refines queries through Medical Gap Analysis until reliable information is obtained. Third, it integrates the MedTrust-Align Module (MTAM) that combines verified positive examples with hallucination-aware negative samples, leveraging Direct Preference Optimization to reinforce citation-grounded reasoning while penalizing hallucination-prone response patterns. Experiments on MedMCQA, MedQA, and MMLU-Med demonstrate that our approach consistently outperforms competitive baselines across multiple model architectures, achieving the best average accuracy with gains of 2.7% for LLaMA3.1-8B-Instruct and 2.4% for Qwen3-8B.

[47] Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

Qingyu Ren,Qianyu He,Bowei Zhang,Jie Zeng,Jiaqing Liang,Yanghua Xiao,Weikang Zhou,Zeye Sun,Fei Yu

Main category: cs.CL

TL;DR: 论文提出了一种无需外部监督的自监督强化学习框架,通过直接从指令中提取奖励信号并生成伪标签,解决了现有方法依赖外部监督和稀疏奖励的问题,在多个数据集上表现出色。

Details Motivation: 现有的强化学习方法在多约束任务中依赖外部监督且奖励信号稀疏,限制了其实际应用能力。本文旨在通过自监督方式直接从指令中获取奖励信号,提升模型对多约束指令的遵循能力。

Contribution: 1. 提出了一种无需标签的自监督强化学习框架;2. 开发了约束分解策略和高效的基于约束的二分类方法;3. 在多个数据集上验证了方法的通用性和有效性。

Method: 1. 通过指令直接生成奖励信号;2. 采用约束分解策略处理多约束任务;3. 利用约束级别的二分类模型训练奖励模型;4. 结合伪标签技术减少对外部监督的依赖。

Result: 在3个领域内数据集和5个领域外数据集上的实验表明,该方法能够显著提升模型对多约束指令的遵循能力,尤其在复杂代理任务和多轮交互任务中表现突出。

Insight: 直接从指令中提取奖励信号并生成伪标签是一种有效的自监督学习方法,能够减少对外部监督的依赖并缓解稀疏奖励问题,为强化学习在多约束任务中的应用提供了新思路。

Abstract: Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if

[48] Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents

Rui Wang,Ce Zhang,Jun-Yu Ma,Jianshu Zhang,Hongru Wang,Yi Chen,Boyang Xue,Tianqing Fang,Zhisong Zhang,Hongming Zhang,Haitao Mi,Dong Yu,Kam-Fai Wong

Main category: cs.CL

TL;DR: 论文提出了“Explore to Evolve”范式,通过主动在线探索和大规模生成可验证的QA数据,提升了深度研究代理的信息聚合能力,开发了WebAggregator模型,性能接近或超越GPT-4.1。

Details Motivation: 现有开源深度研究代理主要关注信息检索能力,忽视了信息聚合的重要性,限制了其支持深度研究的能力。

Contribution: 提出了“Explore to Evolve”范式;构建了WebAggregatorQA数据集和WebAggregator模型;性能超越GPT-4.1和Claude-3.7-sonnet;创建了评估信息聚合能力的基准测试集。

Method: 通过主动在线探索收集证据,代理从12种高级逻辑类型中选择、组合和优化操作,生成可验证的QA对。

Result: WebAggregator-8B性能匹配GPT-4.1,32B版本超越GPT-4.1 10%以上;在新建测试集上,Claude-3.7-sonnet和GPT-4.1表现不佳。

Insight: 代理检索到所有信息后仍难以完成聚合任务,凸显了加强信息聚合能力的必要性。

Abstract: Deep research web agents not only retrieve information from diverse sources such as web environments, files, and multimodal inputs, but more importantly, they need to rigorously analyze and aggregate knowledge for insightful research. However, existing open-source deep research agents predominantly focus on enhancing information-seeking capabilities of web agents to locate specific information, while overlooking the essential need for information aggregation, which would limit their ability to support in-depth research. We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents. Begins with proactive online exploration, an agent sources grounded information by exploring the real web. Using the collected evidence, the agent then self-evolves an aggregation program by selecting, composing, and refining operations from 12 high-level logical types to synthesize a verifiable QA pair. This evolution from high-level guidance to concrete operations allowed us to scalably produce WebAggregatorQA, a dataset of 10K samples across 50K websites and 11 domains. Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models, WebAggregator. WebAggregator-8B matches the performance of GPT-4.1, while the 32B variant surpasses GPT-4.1 by more than 10% on GAIA-text and closely approaches Claude-3.7-sonnet. Moreover, given the limited availability of benchmarks that evaluate web agents’ information aggregation abilities, we construct a human-annotated evaluation split of WebAggregatorQA as a challenging test set. On this benchmark, Claude-3.7-sonnet only achieves 28%, and GPT-4.1 scores 25.8%. Even when agents manage to retrieve all references, they still struggle on WebAggregatorQA, highlighting the need to strengthen the information aggregation capabilities of web agent foundations.

[49] Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents

Reid T. Johnson,Michelle D. Pain,Jordan D. West

Main category: cs.CL

TL;DR: 论文提出了Natural Language Tools (NLT)框架,通过自然语言输出替代大语言模型中的JSON工具调用,解决了任务干扰和格式限制问题。

Details Motivation: 现有大语言模型中的JSON工具调用方式存在任务干扰和格式限制,影响了工具调用的性能和准确性。

Contribution: NLT框架通过自然语言输出取代JSON工具调用,显著提升了工具调用的准确性和输出稳定性。

Method: NLT将工具选择与响应生成解耦,避免了任务干扰;实验覆盖10种模型和6,400次测试。

Result: NLT在客户服务和心理健康领域中,工具调用准确率提升了18.4%,输出方差降低了70%;开源模型表现最佳。

Insight: NLT不仅在强化学习和监督微调阶段有益,还能扩展不支持原生工具调用的模型能力,且对提示扰动具有鲁棒性。

Abstract: We present Natural Language Tools (NLT), a framework that replaces programmatic JSON tool calling in large language models (LLMs) with natural language outputs. By decoupling tool selection from response generation, NLT eliminates task interference and format constraints that degrade tool call performance. When evaluated across 10 models and 6,400 trials spanning customer service and mental health domains, NLT improves tool calling accuracy by 18.4 percentage points while reducing output variance by 70%. Open-weight models see the largest gains, surpassing flagship closed-weight alternatives, with implications for model training in both reinforcement learning and supervised fine-tuning stages. These improvements persist under prompt perturbations and extend tool-calling capabilities to models lacking native support.

[50] LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models

Haolin Li,Haipeng Zhang,Mang Li,Yaohua Wang,Lijie Wen,Yu Zhang,Biqing Huang

Main category: cs.CL

TL;DR: LiRA(Linguistic Robust Anchoring)是一个训练框架,旨在通过锚定表示组合架构(Arca)和语言耦合语义推理器(LaSR),提升低资源语言的跨语言表示能力,同时增强检索和推理任务。

Details Motivation: 高资源语言(如英语、中文)的性能趋近饱和,而低资源语言(如乌尔都语、泰语)由于训练数据有限、机器翻译噪声和跨语言对齐不稳定,性能显著较低。LiRA旨在解决这一问题。

Contribution: 1. 提出LiRA框架,包含Arca和LaSR两个模块;2. 构建并发布多语言产品检索数据集;3. 实验显示LiRA在低资源语言任务上表现稳健。

Method: 1. Arca通过锚定对齐和多智能体协同编码,将低资源语言锚定到英语语义空间;2. LaSR增加语言感知的轻量级推理头,并使用一致性正则化。

Result: LiRA在跨语言检索、语义相似性和推理任务中表现优异,尤其在少样本和噪声环境下具有稳健性。

Insight: 通过几何稳定的共享嵌入空间和统一的训练目标,LiRA能够有效提升低资源语言的跨语言理解能力。

Abstract: As large language models (LLMs) rapidly advance, performance on high-resource languages (e.g., English, Chinese) is nearing saturation, yet remains substantially lower for low-resource languages (e.g., Urdu, Thai) due to limited training data, machine-translation noise, and unstable cross-lingual alignment. We introduce LiRA (Linguistic Robust Anchoring for Large Language Models), a training framework that robustly improves cross-lingual representations under low-resource conditions while jointly strengthening retrieval and reasoning. LiRA comprises two modules: (i) Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, preserving geometric stability in a shared embedding space; and (ii) LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization on top of Arca’s multilingual representations, unifying the training objective to enhance cross-lingual understanding, retrieval, and reasoning robustness. We further construct and release a multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Experiments across low-resource benchmarks (cross-lingual retrieval, semantic similarity, and reasoning) show consistent gains and robustness under few-shot and noise-amplified settings; ablations validate the contribution of both Arca and LaSR. Code will be released on GitHub and the dataset on Hugging Face.

[51] Efficient Seq2seq Coreference Resolution Using Entity Representations

Matt Grenander,Shay B. Cohen,Mark Steedman

Main category: cs.CL

TL;DR: 本文提出了一种高效的seq2seq共指消解方法,通过压缩实体表示来提高增量处理效率,性能接近全前缀基线并在LitBank上达到SOTA。

Details Motivation: 传统的seq2seq共指消解模型虽然性能优越,但处理增量场景(如对话)时效率低且灵活性不足。

Contribution: 提出了一种压缩实体表示的方法,显著提升了增量处理效率,并在LitBank上实现了新的SOTA性能。

Method: 通过提取和重组实体级令牌,丢弃大部分其他输入令牌,实现灵活的增量处理。

Result: 在OntoNotes上性能仅比全前缀基线低0.6 CoNLL F1分,压缩比为1.8;在LitBank上超越了SOTA。

Insight: 在seq2seq共指消解中,丢弃大量令牌是一种可行的增量处理策略。

Abstract: Seq2seq coreference models have introduced a new paradigm for coreference resolution by learning to generate text corresponding to coreference labels, without requiring task-specific parameters. While these models achieve new state-of-the-art performance, they do so at the cost of flexibility and efficiency. In particular, they do not efficiently handle incremental settings such as dialogue, where text must processed sequentially. We propose a compressed representation in order to improve the efficiency of these methods in incremental settings. Our method works by extracting and re-organizing entity-level tokens, and discarding the majority of other input tokens. On OntoNotes, our best model achieves just 0.6 CoNLL F1 points below a full-prefix, incremental baseline while achieving a compression ratio of 1.8. On LitBank, where singleton mentions are annotated, it passes state-of-the-art performance. Our results indicate that discarding a wide portion of tokens in seq2seq resolvers is a feasible strategy for incremental coreference resolution.

[52] Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLMs

Kyubyung Chae,Gihoon Kim,Gyuseong Lee,Taesup Kim,Jaejin Lee,Heejin Kim

Main category: cs.CL

TL;DR: 该论文针对主权LLMs在社会文化对齐和技术安全性上的不足,提出了一个新的数据集和分析框架,实验表明当前主权LLMs在支持低资源语言方面有意义,但未必完全满足目标用户需求,且可能忽视安全性等重要质量属性。

Details Motivation: 随着主权LLMs的发展,亟需验证其是否与用户的社会文化背景对齐,并确保技术安全性和鲁棒性。目前缺乏相关框架和数据集来评估这些问题。

Contribution: 1. 构建了用于评估主权LLMs社会文化元素的新数据集;2. 提出了一个分析框架,用于评估社会文化对齐和技术安全性;3. 揭示了主权LLMs在支持低资源语言方面的潜力,但也指出了其未能完全满足目标用户需求的局限性。

Method: 通过构造数据集和分析框架,结合实验评估主权LLMs在社会文化对齐和技术安全性上的表现。

Result: 实验表明主权LLMs在支持低资源语言方面有价值,但社会文化对齐和技术安全性存在不足,可能导致低估安全性等重要质量属性。

Insight: 推进主权LLMs的发展需要引入更广泛、更实际的评估标准,以确保其在社会文化和技术性能上的全面提升。

Abstract: Recent trends in LLMs development clearly show growing interest in the use and application of sovereign LLMs. The global debate over sovereign LLMs highlights the need for governments to develop their LLMs, tailored to their unique socio-cultural and historical contexts. However, there remains a shortage of frameworks and datasets to verify two critical questions: (1) how well these models align with users’ socio-cultural backgrounds, and (2) whether they maintain safety and technical robustness without exposing users to potential harms and risks. To address this gap, we construct a new dataset and introduce an analytic framework for extracting and evaluating the socio-cultural elements of sovereign LLMs, alongside assessments of their technical robustness. Our experimental results demonstrate that while sovereign LLMs play a meaningful role in supporting low-resource languages, they do not always meet the popular claim that these models serve their target users well. We also show that pursuing this untested claim may lead to underestimating critical quality attributes such as safety. Our study suggests that advancing sovereign LLMs requires a more extensive evaluation that incorporates a broader range of well-grounded and practical criteria.

[53] Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

Shuangshuang Ying,Yunwen Li,Xingwei Qu,Xin Li,Sheng Jin,Minghao Liu,Zhoufutu Wen,Xeron Du,Tianyu Zheng,Yichi Zhang,Letian Ni,Yuyang Cheng,Qiguang Chen,Jingzhe Ding,Shengda Long,Wangchunshu Zhou,Jiazhan Feng,Wanjun Zhong,Libo Qin,Ge Zhang,Wenhao Huang,Wanxiang Che,Chenghua Lin

Main category: cs.CL

TL;DR: 论文分析了当前偏好学习方法的局限性,尤其是在主观写作偏好评估方面的表现较差,提出了一个新的数据集WritingPreferenceBench,并展示了生成式奖励模型在显式推理链下的优越性。

Details Motivation: 现有偏好学习方法在标准任务上表现良好,但在去除客观质量信号时表现显著下降,无法有效捕捉主观偏好(如创意、风格和情感共鸣)。

Contribution: 引入了WritingPreferenceBench数据集,证明了生成式奖励模型在显式推理链下的有效性,揭示了当前RLHF方法的局限性。

Method: 通过构建匹配客观质量的偏好对数据集,对比了序列奖励模型、零-shot语言模型和生成式奖励模型的性能,分析了其跨类别表现差异。

Result: 生成式奖励模型表现最佳(81.8%准确率),但模型在不同写作类别间方差大(范围18.2%-81.8%)。模型规模无显著影响。

Insight: 成功建模主观偏好可能需要显式推理而非直接分类,现有RLHF方法更擅长检测客观错误而非主观质量。

Abstract: Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models–the standard architecture for RLHF–achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.

[54] Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models

Kedi Chen,Zhikai Lei,Xu Guo,Xuecheng Wu,Siyuan Zeng,Jianghao Yin,Yinqi Zhang,Qin Chen,Jie Zhou,Liang He,Qipeng Guo,Kai Chen,Wei Zhang

Main category: cs.CL

TL;DR: 论文提出了一种名为CodeSeq的数据集和方法,通过代码驱动的数字序列计算任务提升LLMs的归纳推理能力,解决了现有数据缺乏复杂模式和难以控制的问题。

Details Motivation: 现有的归纳推理数据和任务过于简单且缺乏复杂性,传统的提示或微调方法未能提供精确的思考过程或难度控制。

Contribution: 1. 提出了CodeSeq数据集,将数字序列包装为算法问题以发现其通项;2. 设计了基于失败测试案例反思和数据迭代生成的方法;3. 引入了强化学习奖励机制,结合可解性和案例生成的难度控制。

Method: 生成监督微调数据并通过反思失败案例迭代改进,同时结合强化学习奖励(基于问题通过率和案例生成成功率)优化模型学习。

Result: 实验表明,CodeSeq训练的模型在多任务推理中表现更优,且保持了模型的OOD性能。

Insight: 通过代码驱动的任务和数据迭代生成,可以显著提升LLMs的归纳推理能力;强化学习奖励机制能平衡成功与失败案例的学习效果。

Abstract: Large language models (LLMs) make remarkable progress in reasoning tasks. Among different reasoning modes, inductive reasoning, due to its better alignment with human learning, attracts increasing interest. However, research on inductive reasoning faces certain challenges. First, existing inductive data mostly focuses on superficial regularities while lacking more complex internal patterns. Second, current works merely prompt LLMs or finetune on simple prompt-response pairs, but do not provide precise thinking processes nor implement difficulty control. Unlike previous work, we address these challenges by introducing \textit{CodeSeq}, a synthetic post-training dataset built from number sequences. We package number sequences into algorithmic problems to discover their general terms, defining a general term generation (GTG) task correspondingly. Our pipeline generates supervised finetuning data by reflecting on failed test cases and incorporating iterative corrections, thereby teaching LLMs to learn autonomous case generation and self-checking. Additionally, it leverages reinforcement learning with a novel Case-Synergy Solvability Scaling Reward based on both solvability, estimated from the problem pass rate, and the success rate of self-directed case generation, enabling models to learn more effectively from both successes and failures. Experimental results show that the models trained with \textit{CodeSeq} improve on various reasoning tasks and can preserve the models’ OOD performance.

[55] RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF

Qing Yang,Zhenghao Liu,Junxin Wang,Yangfan Du,Pengcheng Huang,Tong Xiao

Main category: cs.CL

TL;DR: RLAIF-SPA是一种基于强化学习的语音合成框架,通过结合AI反馈(RLAIF)优化情感语音合成的表达性和清晰度,显著提升了情感表达的细腻度和语义准确性。

Details Motivation: 现有文本到语音合成方法在情感表达上表现不足,依赖昂贵的人工标注或间接目标,导致生成的语音情感平淡且不够自然。

Contribution: 提出RLAIF-SPA框架,利用RLAIF机制结合ASR和LLM技术,直接优化情感表达性和语义准确性。

Method: 1. 通过Prosodic Label Alignment从四个维度(结构、情感、速度、语调)联合优化语义准确性和韵律情感对齐;2. 引入Semantic Accuracy Feedback确保语音清晰准确。

Result: 在Libri Speech数据集上,WER降低26.1%,SIM-O提升9.1%,人工评价提升超过10%。

Insight: RLAIF-SPA通过直接优化情感表达和语义准确性,为情感语音合成提供了一种高效且低成本的方法。

Abstract: Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge. Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization. Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech. Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.

[56] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs

Linyue Ma,Yilong Xu,Xiang Long,Zhi Zheng

Main category: cs.CL

TL;DR: 提出了一种基于规则的生成式验证方法,用于增强搜索能力的LLMs,通过‘nugget-as-rubric’范式解决长文工作量奖励设计的挑战。

Details Motivation: 现有搜索增强LLMs的奖励设计存在局限性,规则性奖励(如精确匹配)脆弱且不适用于长文工作量,而生成功率奖励难以验证且计算开销大。

Contribution: 1) 提出‘nugget-as-rubric’范式,将原子信息点作为结构化评估标准;2) 设计自动规则构建流程,支持静态和动态数据;3) 训练高效生成验证器Search-Gen-V。

Method: 1) 使用‘nugget-as-rubric’范式对齐不同任务需求;2) 基于查询重写的自动规则构建流程;3) 通过蒸馏和两阶段策略训练Search-Gen-V。

Result: Search-Gen-V在不同任务中展现出高验证准确性,是搜索增强LLMs的一种可扩展、鲁棒且高效的奖励构造方法。

Insight: 将原子信息点结构化评估可解决长文工作量的奖励设计问题,自动规则构建和高效验证器是实现可扩展性的关键。

Abstract: Search augmentation empowers Large Language Models with retrieval capabilities to overcome the limitations imposed by static parameters. Recently, Reinforcement Learning leverages tailored reward signals as a viable technique to enhance LLMs performing tasks involving search. However, existing reward modeling for search-augmented LLMs faces several limitations. Rule-based rewards, such as Exact Match, are verifiable but fragile to variations in expression and cannot be applied to long-form workloads. In contrast, generative rewards improve robustness, but designing verifiable and stable rewards for long-form workloads in dynamic corpora remains challenging and also incurs high computational costs. In this paper, we propose a unified and verifiable paradigm, “nugget-as-rubric”, which treats atomic information points as structured evaluation criteria for different search-augmentation workloads. Short-form tasks correspond to a single rubric, whereas long-form tasks expand to multiple rubrics aligned with the question’s information needs. To support long-form settings, we design an automatic rubric construction pipeline based on query rewriting, which can automatically retrieve passages relevant to each question and extract rubrics from them, both from static corpora and from dynamic online web content. Furthermore, we introduce \textbf{Search-Gen-V}, a 4B-parameter efficient generative verifier under our proposed verifiable paradigm, which is trained via the idea of distillation and a two-stage strategy. Experimental results show that Search-Gen-V achieves strong verification accuracy across different workloads, making it a scalable, robust, and efficient verifiable reward constructor for search-augmented LLMs.

[57] Speculative Model Risk in Healthcare AI: Using Storytelling to Surface Unintended Harms

Xingmeng Zhao,Dan Schumacher,Veronica Rammouz,Anthony Rios

Main category: cs.CL

TL;DR: 论文提出了一个以人为中心的框架,通过生成用户故事和多代理讨论,帮助在AI医疗工具部署前更全面地识别潜在风险和收益。研究发现,故事讲述显著提升了参与者对多样化危害的认知和创造性思维。

Details Motivation: 医疗AI的快速发展和低门槛开发可能导致偏见、隐私侵犯和资源不平等问题,但这些风险往往缺乏足够的人工理解和讨论。

Contribution: 提出了一个结合用户故事和多代理讨论的框架,用于在AI医疗工具部署前提前识别和讨论潜在风险与收益。

Method: 使用人类故事讲述的方法,生成多样化的用户情景,并通过多代理讨论帮助参与者更全面地思考AI的影响。

Result: 研究发现,阅读故事的参与者能识别更多类型的危害(均匀分布在13类中),而未读故事的参与者主要关注隐私和健康问题(58.3%)。

Insight: 故事讲述作为一种工具,可以显著增强人们对AI潜在影响的多样化和创造性思考,尤其在多利益相关者的环境中。

Abstract: Artificial intelligence (AI) is rapidly transforming healthcare, enabling fast development of tools like stress monitors, wellness trackers, and mental health chatbots. However, rapid and low-barrier development can introduce risks of bias, privacy violations, and unequal access, especially when systems ignore real-world contexts and diverse user needs. Many recent methods use AI to detect risks automatically, but this can reduce human engagement in understanding how harms arise and who they affect. We present a human-centered framework that generates user stories and supports multi-agent discussions to help people think creatively about potential benefits and harms before deployment. In a user study, participants who read stories recognized a broader range of harms, distributing their responses more evenly across all 13 harm types. In contrast, those who did not read stories focused primarily on privacy and well-being (58.3%). Our findings show that storytelling helped participants speculate about a broader range of harms and benefits and think more creatively about AI’s impact on users.

[58] AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

Mengzhao Jia,Zhihan Zhang,Ignacio Cases,Zheyuan Liu,Meng Jiang,Peng Qi

Main category: cs.CL

TL;DR: AutoRubric-R1V提出了一种结合过程级监督和生成性奖励的框架,通过自动生成评分标准,提升多模态推理的准确性和可信度,避免了传统强化学习中仅奖励最终答案正确性导致的虚假推理问题。

Details Motivation: 多模态大语言模型(MLLMs)在复杂多步推理任务中存在虚假推理问题,传统强化学习方法仅奖励最终答案的正确性,缺乏对推理过程的监督。

Contribution: 提出了AutoRubric-R1V框架,通过自动生成评分标准(rubric-based rewards),结合过程级监督和结果奖励,显著提升了多模态推理的性能和可信度。

Method: 采用可扩展的自聚合方法,从成功的推理轨迹中提取一致的检查点,构建问题特定的评分标准,无需人工标注或更强的教师模型。

Result: 在六个多模态推理基准测试中达到了最先进的性能,并在专门评估中显著提升了推理的可信度。

Insight: 通过过程级监督和生成性奖励的结合,可以有效避免虚假推理,提升模型在多模态任务中的表现。

Abstract: Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi-step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final-answer correctness is rewarded. To address this limitation, we propose AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards. Our key innovation lies in a scalable self-aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem-specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric-based and outcome rewards, AutoRubric-R1V achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.

[59] COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes

Yunwen Li,Shuangshuang Ying,Xingwei Qu,Xin Li,Sheng Jin,Minghao Liu,Zhoufutu Wen,Tianyu Zheng,Xeron Du,Qiguang Chen,Jiajun Shi,Wangchunshu Zhou,Jiazhan Feng,Wanjun Zhong,Libo Qin,Stephen Huang,Wanxiang Che,Chenghua Lin,Eli Zhang

Main category: cs.CL

TL;DR: 本文介绍了COIG-Writer数据集,这是一个高质量的中文创意写作数据集,包含多样输出及其背后的思考过程。研究发现创意写作需要叙事逻辑(过程监督)和语言表达的平衡。

Details Motivation: 当前大语言模型在非英语创意写作中表现不佳,缺乏高质量的训练数据及过程级监督。COIG-Writer的目标是填补这一空白。

Contribution: 提出COIG-Writer数据集,包含1655个经过精心策划的三元组(提示、详细推理、最终文本),覆盖51种文体。揭示了创意写作的两大核心组件。

Method: 通过系统逆向工程高质量文本,构建包含过程监督的数据集,并结合通用数据进行稳定性实验。

Result: 研究发现过程监督需要与通用数据平衡(比例为1:12),创意能力具有文化特异性,词汇多样性与创意质量呈负相关(TTR悖论)。

Insight: 创意卓越源于逻辑框架与语言基础的相互作用,相似于数学推理在基础模型中的作用。

Abstract: Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts where training data is scarce and lacks process-level supervision. We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts. Unlike existing datasets that provide only input-output pairs, COIG-Writer comprises 1,665 meticulously curated triplets spanning 51 genres, each containing: (1) a reverse-engineered prompt, (2) detailed creative reasoning documenting decision-making processes, and (3) the final text. Through comprehensive experiments, we identify a two-component model of creative writing: narrative logic (provided by process supervision) and linguistic expression (maintained by general-purpose data). Our findings reveal three critical insights: (1) Process supervision is highly effective but requires stabilization with general data. A ratio of at least one creative sample to twelve general samples is needed to achieve optimal performance; below this threshold, the win rate progressively degrades (from 62.75% down to 35.78%)., (2) creative capabilities are culturally-bound with no cross-lingual transfer (89.26pp gap between Chinese and English performance), and (3) lexical diversity inversely correlates with creative quality (TTR paradox), suggesting high diversity signals compensatory behavior for logical deficiencies. These findings establish that creative excellence emerges from the interaction between logical scaffolding and linguistic grounding, analogous to how mathematical reasoning enhances but cannot replace linguistic competence in foundation models.

[60] Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning

Hwiyeol Jo,Joosung Lee,Jaehone Lee,Sang-Woo Lee,Joonsuk Park,Kang Min Yoo

Main category: cs.CL

TL;DR: 论文提出了一种名为“答案再生”(Answer Regeneration)的新方法,通过额外的模型推理提取最终答案,以减少推理模型中答案提取算法对性能的影响。

Details Motivation: 现有评估方法在大型语言模型(LLMs)的推理任务中对答案提取算法过于敏感,影响了结果的可靠性和鲁棒性。

Contribution: 提出了一个答案提取的新框架——Answer Regeneration,该方法通过额外的模型推理步骤,提高了模型性能和对不同提取规则的鲁棒性。

Method: 利用额外的模型推理步骤,将原始输入输出加上提示“Answer:”重新生成答案,确保最终答案的选择不受提取规则的干扰。

Result: 该方法在数学问题和开放式问答任务中表现出更高的性能和鲁棒性。

Insight: 答案提取算法的选择对推理模型的评估结果有显著影响,而Answer Regeneration提供了一种更稳定的解决方案。

Abstract: Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt “Answer:”. The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.

[61] Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking

Ziqi Dai,Xin Zhang,Mingxin Li,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Meishan Zhang,Wenjie Li,Min Zhang

Main category: cs.CL

TL;DR: 这篇论文对比了监督微调(SFT)和对比学习(CL)在多模态大型语言模型(LLM)重排序任务中的表现,发现SFT在权重更新方面优于CL,并在实验中取得了新的最先进结果。

Details Motivation: 在信息检索中,重排序模型的训练主要集中在两种目标上:度量学习(如对比损失)和分类(预测相关性标签)。针对BERT风格编码器,对比学习已被证明更有效,但对于LLM,监督微调的生成性任务对齐性更强。研究旨在探讨哪种目标更适合LLM重排序任务。

Contribution: 1. 对比分析了SFT和CL在LLM重排序任务中的表现;2. 提出了一个统一框架分解目标和方向;3. 发现SFT在权重更新方面优于CL;4. 在大规模实验中验证了SFT的优势,并在MRB基准上取得了新的最优结果。

Method: 通过分解目标为权重和方向两部分,设计了一个统一框架,并在多模态检索任务(UMR)上进行实验,对比SFT和CL的表现。

Result: 实验表明,SFT在权重更新方面显著优于CL,并在MRB基准上取得了新的最优性能。

Insight: SFT的权重更新机制更适合LLM的重排序任务,而方向的优劣则不明显。这一发现对未来的研究和应用具有指导意义。

Abstract: In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ‘’yes’’ (resp. ‘’no’’) token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.

[62] Rewiring Experts on the Fly:Continuous Rerouting for Better Online Adaptation in Mixture-of-Expert models

Guinan Su,Yanwu Yang,Li Shen,Lu Yin,Shiwei Liu,Jonas Geiping

Main category: cs.CL

TL;DR: 本文提出了一种无需外部数据的在线测试时间框架,通过动态优化混合专家(MoE)模型的路由决策,提升其在文本生成任务中的表现。

Details Motivation: MoE模型通过稀疏专家激活实现高效扩展,但部署中的分布偏移可能导致路由决策不理想。现有测试时间适应方法主要针对密集模型且依赖外部数据,难以直接应用于MoE架构。

Contribution: 提出了一种无数据、在线的测试时间适应框架,仅基于输入上下文动态优化MoE的路由决策,无需外部监督或数据。

Method: 框架分为两阶段:1)在预填充阶段定期通过自监督优化路由决策;2)生成文本时保持修改后的路由器,直到下一次适应。通过轻量级加法向量仅更新选定层的路由器逻辑,保持计算效率。

Result: 实验显示,在HumanEval和DeepSeek-V2-Lite等任务中性能显著提升(如HumanEval提升5.5%),且能与其他测试时间扩展技术(如自一致性)结合。

Insight: MoE模型的路由决策可通过上下文动态优化,无需依赖外部数据,同时保持高效性和鲁棒性,适用于实际部署场景。

Abstract: Mixture-of-Experts (MoE) models achieve efficient scaling through sparse expert activation, but often suffer from suboptimal routing decisions due to distribution shifts in deployment. While existing test-time adaptation methods could potentially address these issues, they primarily focus on dense models and require access to external data, limiting their practical applicability to MoE architectures. However, we find that, instead of relying on reference data, we can optimize MoE expert selection on-the-fly based only on input context. As such, we propose \textit{a data-free, online test-time framework} that continuously adapts MoE routing decisions during text generation without external supervision or data. Our method cycles between two phases: During the prefill stage, and later in regular intervals, we optimize the routing decisions of the model using self-supervision based on the already generated sequence. Then, we generate text as normal, maintaining the modified router until the next adaption. We implement this through lightweight additive vectors that only update router logits in selected layers, maintaining computational efficiency while preventing over-adaptation. The experimental results show consistent performance gains on challenging reasoning tasks while maintaining robustness to context shifts. For example, our method achieves a 5.5% improvement on HumanEval with OLMoE. Furthermore, owing to its plug-and-play property, our method naturally complements existing test-time scaling techniques, e.g., achieving 6% average gains when incorporated with self-consistency on DeepSeek-V2-Lite.

[63] Predicting Task Performance with Context-aware Scaling Laws

Kyle Montgomery,David Park,Jianhong Tu,Michael Bendersky,Beliz Gunel,Dawn Song,Chenguang Wang

Main category: cs.CL

TL;DR: 该论文提出了一个结合训练计算量和上下文的下游任务性能预测框架,填补了传统扩展定律在上下文相关任务评估上的不足。

Details Motivation: 传统扩展定律仅关注上游指标(如交叉熵损失),忽略了上下文在下游任务性能中的关键作用,因此需要一种能综合考虑训练计算量和上下文的方法。

Contribution: 提出了一个直观且可解释的框架,联合建模训练计算量和上下文对下游性能的影响,并在多个任务和模型上进行了验证。

Method: 通过拟合Llama-2-7B和Llama-2-13B模型在算术推理、常识推理和机器翻译任务上的性能数据,建立预测下游性能的框架。

Result: 框架能准确预测分布内性能,泛化能力强,且能可靠地推断上下文增加时的性能变化。

Insight: 训练计算量与上下文利用之间存在复杂的交互关系,这一发现为设计高效的长上下文LLM提供了指导。

Abstract: Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture downstream task performance, where context plays a critical role. In this work, we propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context. We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, common sense reasoning, and machine translation. Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases. These findings offer valuable insights into the interplay between training compute and context utilization, providing guidance for designing more efficient long-context LLMs for diverse downstream tasks. Our code is available at https://github.com/wang-research-lab/context-scaling.

[64] AI-Powered Early Diagnosis of Mental Health Disorders from Real-World Clinical Conversations

Jianfeng Zhu,Julina Maharjan,Xinyu Li,Karin G. Coifman,Ruoming Jin

Main category: cs.CL

TL;DR: 该研究评估了机器学习模型在心理健康筛查中的有效性,使用了553个真实世界临床对话数据集,展示了基于LLM的模型在抑郁症、焦虑症和PTSD诊断中的高准确率和召回率,尤其是PTSD(89%准确率)。

Details Motivation: 心理健康疾病(如抑郁症、焦虑症和PTSD)常因主观评估、临床资源有限和社会偏见而被误诊或漏诊,急需可扩展、低门槛的AI工具支持早期诊断。

Contribution: 提出了基于真实世界临床对话的AI诊断方法,验证了多种模型(如GPT-4.1 Mini、MetaLLaMA和LoRA微调的RoBERTa)在心理健康筛查中的有效性,尤其在PTSD诊断中表现突出。

Method: 使用了553个半结构化临床对话数据集,对比了零样本提示技术(如GPT-4.1 Mini)和LoRA微调的RoBERTa模型,实验了不同上下文长度的影响。

Result: 模型在PTSD诊断中达到89%的准确率和98%的召回率,LoRA微调在低秩配置(如秩8和16)下仍保持高效性能。

Insight: 短上下文和聚焦的对话片段能提升模型灵敏度;LLM模型在低资源或高偏见环境中具备实际应用潜力。

Abstract: Mental health disorders remain among the leading cause of disability worldwide, yet conditions such as depression, anxiety, and Post-Traumatic Stress Disorder (PTSD) are frequently underdiagnosed or misdiagnosed due to subjective assessments, limited clinical resources, and stigma and low awareness. In primary care settings, studies show that providers misidentify depression or anxiety in over 60% of cases, highlighting the urgent need for scalable, accessible, and context-aware diagnostic tools that can support early detection and intervention. In this study, we evaluate the effectiveness of machine learning models for mental health screening using a unique dataset of 553 real-world, semistructured interviews, each paried with ground-truth diagnoses for major depressive episodes (MDE), anxiety disorders, and PTSD. We benchmark multiple model classes, including zero-shot prompting with GPT-4.1 Mini and MetaLLaMA, as well as fine-tuned RoBERTa models using LowRank Adaptation (LoRA). Our models achieve over 80% accuracy across diagnostic categories, with especially strongperformance on PTSD (up to 89% accuracy and 98% recall). We also find that using shorter context, focused context segments improves recall, suggesting that focused narrative cues enhance detection sensitivity. LoRA fine-tuning proves both efficient and effective, with lower-rank configurations (e.g., rank 8 and 16) maintaining competitive performance across evaluation metrics. Our results demonstrate that LLM-based models can offer substantial improvements over traditional self-report screening tools, providing a path toward low-barrier, AI-powerd early diagnosis. This work lays the groundwork for integrating machine learning into real-world clinical workflows, particularly in low-resource or high-stigma environments where access to timely mental health care is most limited.

[65] LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

Wenkai Yang,Weijie Liu,Ruobing Xie,Yiju Guo,Lulu Wu,Saiyong Yang,Yankai Lin

Main category: cs.CL

TL;DR: LaSeR提出了一种基于最后一词自奖励的强化学习方法,通过简化RL目标的理论推导,实现了高效优化语言模型的推理和自验证能力。

Details Motivation: 现有RLVR方法在训练时需要模型分别生成解决方案和自验证结果,效率低下。LaSeR旨在解决这一问题,简化自验证过程并统一优化推理和验证能力。

Contribution: 理论证明了RL目标的封闭解可简化为最后一词自奖励分数,并提出LaSeR算法,通过MSE损失对齐自奖励分数与验证器奖励,高效优化模型能力。

Method: LaSeR通过在RLVR损失中添加MSE损失,对齐最后一词自奖励分数与验证器奖励。利用最后一词的下一个词概率分布计算自奖励分数,仅需额外一次词推断开销。

Result: 实验表明LaSeR提升了模型的推理性能和自奖励能力,进一步增强其在推理时的扩展性能。

Insight: 自验证信号的最后一词自奖励分数可作为高效优化的目标,无需复杂分离的提示模板,显著提升了模型效能。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model’s self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model’s next-token log-probability assigned to any pre-specified token at the solution’s last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model’s reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.

[66] MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics

Yuxing Lu,Xukai Zhao,J. Ben Tamo,Micky C. Nnamdi,Rui Peng,Shuang Zeng,Xingyu Hu,Jinzhuo Wang,May D. Wang

Main category: cs.CL

TL;DR: MetaBench是第一个针对代谢组学领域的基准测试工具,用于系统评估大型语言模型(LLMs)在复杂科学领域的表现。

Details Motivation: LLMs在通用文本上表现优异,但在需要深度知识的专业科学领域(如代谢组学)的表现尚未明确。代谢组学因其复杂的生化途径、异构标识符系统和分散的数据库而具有独特挑战。

Contribution: MetaBench是新开发的基准测试工具,用于评估LLMs在代谢组学研究中的五种关键能力:知识、理解、基础、推理和研究。

Method: MetaBench基于权威公共资源构建,测试了25个开源和闭源LLMs,涵盖不同的代谢组学任务。

Result: 结果显示,LLMs在文本生成任务上表现良好,但在跨数据库标识符基础上表现不佳,且对标注稀疏的长尾代谢物任务性能下降。

Insight: MetaBench为代谢组学AI系统的开发和评估提供了基础工具,有助于推动该领域可靠计算工具的进展。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities on general text; however, their proficiency in specialized scientific domains that require deep, interconnected knowledge remains largely uncharacterized. Metabolomics presents unique challenges with its complex biochemical pathways, heterogeneous identifier systems, and fragmented databases. To systematically evaluate LLM capabilities in this domain, we introduce MetaBench, the first benchmark for metabolomics assessment. Curated from authoritative public resources, MetaBench evaluates five capabilities essential for metabolomics research: knowledge, understanding, grounding, reasoning, and research. Our evaluation of 25 open- and closed-source LLMs reveals distinct performance patterns across metabolomics tasks: while models perform well on text generation tasks, cross-database identifier grounding remains challenging even with retrieval augmentation. Model performance also decreases on long-tail metabolites with sparse annotations. With MetaBench, we provide essential infrastructure for developing and evaluating metabolomics AI systems, enabling systematic progress toward reliable computational tools for metabolomics research.

[67] DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Yu Zhou,Sohyun An,Haikang Deng,Da Yin,Clark Peng,Cho-Jui Hsieh,Kai-Wei Chang,Nanyun Peng

Main category: cs.CL

TL;DR: 论文研究了多模态生成模型在处理方言输入时的性能问题,通过构建大规模方言基准测试并提出一种编码器缓解策略,显著提升了模型在多方言上的性能。

Details Motivation: 英语等接触语言存在丰富的方言变化,但目前多模态生成模型在处理方言输入时的性能表现尚未被系统研究。

Contribution: 1) 构建了一个覆盖六种常见英语方言的大规模基准测试;2) 揭示了当前多模态生成模型在处理方言输入时的显著性能下降;3) 提出了一种通用的编码器缓解策略,显著提升方言性能。

Method: 通过自动和人工评估收集了4200条方言提示词,测试了17种图像和视频生成模型。提出了一种编码器缓解策略,使模型在识别新方言特征的同时保留标准英语性能。

Result: 缓解策略成功将五种方言的性能提升至与标准英语持平(+34.4%),同时对标准英语性能影响极小。

Insight: 现有方法(如微调和提示词改写)对方言性能提升有限,而设计的编码器策略更具通用性和高效性。

Abstract: Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.

[68] Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

Guoqing Wang,Sunhao Dai,Guangze Ye,Zeyu Gan,Wei Yao,Yong Deng,Xiaofeng Wu,Zhenzhe Ying

Main category: cs.CL

TL;DR: IGPO是一种基于信息增益的策略优化方法,通过密集的回合级内在奖励解决多轮LLM代理训练中的奖励稀疏问题,显著提高性能和样本效率。

Details Motivation: 现有的LLM代理训练通常依赖最终答案的稀疏奖励,导致在多轮任务中出现优势崩溃和信用分配困难的问题。需要一种能提供密集监督的方法。

Contribution: 提出IGPO框架,通过建模信息增益的动态过程,定义回合级内在奖励,结合最终答案监督,形成密集奖励轨迹。

Method: IGPO将每轮交互视为逐步获取ground truth信息的过程,并计算策略正确概率的边际增益作为回合级奖励,无需外部奖励模型或高成本蒙特卡洛估计。

Result: 在多轮任务中,IGPO在准确性和样本效率上均优于基线方法,证明了其有效性。

Insight: 通过模型自身信念更新生成内在奖励是一种简单但有效的方法,适用于多轮交互任务中的密集监督需求。

Abstract: Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy’s probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model’s own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.

[69] Attention Is All You Need for KV Cache in Diffusion LLMs

Quan Nguyen-Tri,Mukul Ranjan,Zhiqiang Shen

Main category: cs.CL

TL;DR: 这篇论文提出了一种名为Elastic-Cache的训练无关、架构无关的策略,通过选择性刷新KV缓存来减少扩散大型语言模型(DLMs)的解码延迟,同时保持生成质量。

Details Motivation: 现有的扩散大型语言模型在每个去噪步和层级都重新计算所有token的QKV,导致大量冗余计算,尤其是在浅层。本文旨在解决这一问题。

Contribution: 提出了Elastic-Cache策略,通过注意力感知的漂移测试和深度感知的调度计划,动态决定何时何地刷新KV缓存,减少冗余计算并加速解码。

Method: 基于三点观察:(1)远处的MASK token可作为长度偏置缓存;(2)深层KV动态性更强;(3)最受关注的token的KV漂移最小。Elastic-Cache联合决定刷新时机和层级。

Result: 在GSM8K、HumanEval等任务上实现了显著的加速(最高45.1倍),同时保持了更高的生成质量,吞吐量提升6.8倍。

Insight: KV缓存的动态性与层级深度和注意力分布密切相关,选择性刷新可以有效减少计算冗余,提高解码效率。

Abstract: This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods’ decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.

cs.CV [Back]

[70] MultiFoodhat: A potential new paradigm for intelligent food quality inspection

Yue Hu,Guohang Zhuang

Main category: cs.CV

TL;DR: MultiFoodhat提出了一种基于对话的多智能体推理框架,通过整合视觉语言模型和大型语言模型,实现零样本食物识别,避免了大规模标注数据的依赖。

Details Motivation: 现有监督模型依赖于大量标注数据且对未见食物类别泛化能力有限,需要一种更灵活、无需额外训练的智能食物识别方法。

Contribution: 1. 提出了MultiFoodChat框架,结合视觉语言模型和大型语言模型;2. 引入对象感知令牌(OPT)和交互式推理代理(IRA)以捕获细粒度视觉属性和动态推理;3. 在零样本食物识别任务中表现优于现有方法。

Method: MultiFoodChat通过多轮视觉-文本对话实现协作推理,OPT提取视觉特征,IRA动态解析上下文以优化预测。

Result: 在多个公开食物数据集上展示出较高的识别准确性和可解释性,优于无监督和少样本方法。

Insight: 结合VLMs和LLMs的多智能体推理框架可为智能食品质量检测提供新思路,减少对标注数据的依赖。

Abstract: Food image classification plays a vital role in intelligent food quality inspection, dietary assessment, and automated monitoring. However, most existing supervised models rely heavily on large labeled datasets and exhibit limited generalization to unseen food categories. To overcome these challenges, this study introduces MultiFoodChat, a dialogue-driven multi-agent reasoning framework for zero-shot food recognition. The framework integrates vision-language models (VLMs) and large language models (LLMs) to enable collaborative reasoning through multi-round visual-textual dialogues. An Object Perception Token (OPT) captures fine-grained visual attributes, while an Interactive Reasoning Agent (IRA) dynamically interprets contextual cues to refine predictions. This multi-agent design allows flexible and human-like understanding of complex food scenes without additional training or manual annotations. Experiments on multiple public food datasets demonstrate that MultiFoodChat achieves superior recognition accuracy and interpretability compared with existing unsupervised and few-shot methods, highlighting its potential as a new paradigm for intelligent food quality inspection and analysis.

[71] Post-surgical Endometriosis Segmentation in Laparoscopic Videos

Andreas Leibetseder,Klaus Schoeffmann,Jörg Keckstein,Simon Keckstein

Main category: cs.CV

TL;DR: 论文描述了一个用于分割腹腔镜视频中子宫内膜异位症(尤其是黑色子宫内膜植入物)的系统,旨在帮助妇科医生更准确地识别和标注病灶。

Details Motivation: 子宫内膜异位症在腹腔镜视频中表现出多样化的视觉特征,识别难度大,容易出错。为帮助妇科医生更高效地诊断和治疗,论文提出了一种自动化分割系统。

Contribution: 主要贡献是开发了一种能够自动分割腹腔镜视频中特定子宫内膜异位症(黑色子宫内膜植入物)的系统,并提供多色标注和检测摘要功能。

Method: 系统通过训练模型实现对腹腔镜视频中黑色子宫内膜植入物的分割,并采用多色覆盖标注和生成检测摘要的方式辅助视频浏览。

Result: 系统能够有效地识别和标注视频中的病灶区域,为医生提供更直观的诊断辅助工具。

Insight: 自动化分割系统可以显著提升妇科医生在复杂视频数据中识别子宫内膜异位症的效率和准确性。

Abstract: Endometriosis is a common women’s condition exhibiting a manifold visual appearance in various body-internal locations. Having such properties makes its identification very difficult and error-prone, at least for laymen and non-specialized medical practitioners. In an attempt to provide assistance to gynecologic physicians treating endometriosis, this demo paper describes a system that is trained to segment one frequently occurring visual appearance of endometriosis, namely dark endometrial implants. The system is capable of analyzing laparoscopic surgery videos, annotating identified implant regions with multi-colored overlays and displaying a detection summary for improved video browsing.

[72] Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models

Jia Yun Chua,Argyrios Zolotas,Miguel Arana-Catania

Main category: cs.CV

TL;DR: 本文探讨了传统视觉模型与视觉语言模型(VLMs)结合在遥感图像分析中的应用,特别是在少样本学习场景下,显著提升了飞机检测和场景理解的准确性和上下文感知能力。

Details Motivation: 遥感数据量巨大,但传统视觉模型需大量标注数据且缺乏上下文理解能力。VLMs虽然能结合视觉与文本数据,但在遥感领域的应用尚未充分探索。本文旨在结合两者优势以提升遥感图像分析的效率和准确性。

Contribution: 提出了一种结合YOLO和VLMs(如LLaVA、ChatGPT和Gemini)的方法,显著提升了飞机检测和场景理解的性能,尤其是在低质量图像和少样本学习中表现优越。

Method: 集成YOLO与传统视觉模型与VLMs,通过融合视觉和视觉语言特征,提高了遥感图像的上下文理解和检测准确性。在标记和未标记数据以及退化图像场景下进行了验证。

Result: 在飞机检测和计数任务中,平均MAE提升48.46%,CLIPScore提高了6.17%,表明该方法在复杂环境下的优越性。

Insight: 视觉与语言模型的融合为遥感图像分析提供了新思路,尤其适合少样本学习场景和低质量图像处理,展示了多模态方法的潜力。

Abstract: Remote sensing has become a vital tool across sectors such as urban planning, environmental monitoring, and disaster response. While the volume of data generated has increased significantly, traditional vision models are often constrained by the requirement for extensive domain-specific labelled data and their limited ability to understand the context within complex environments. Vision Language Models offer a complementary approach by integrating visual and textual data; however, their application to remote sensing remains underexplored, particularly given their generalist nature. This work investigates the combination of vision models and VLMs to enhance image analysis in remote sensing, with a focus on aircraft detection and scene understanding. The integration of YOLO with VLMs such as LLaVA, ChatGPT, and Gemini aims to achieve more accurate and contextually aware image interpretation. Performance is evaluated on both labelled and unlabelled remote sensing data, as well as degraded image scenarios which are crucial for remote sensing. The findings show an average MAE improvement of 48.46% across models in the accuracy of aircraft detection and counting, especially in challenging conditions, in both raw and degraded scenarios. A 6.17% improvement in CLIPScore for comprehensive understanding of remote sensing images is obtained. The proposed approach combining traditional vision models and VLMs paves the way for more advanced and efficient remote sensing image analysis, especially in few-shot learning scenarios.

[73] Finding Holes: Pathologist Level Performance Using AI for Cribriform Morphology Detection in Prostate Cancer

Kelvin Szolnoky,Anders Blilie,Nita Mulliqi,Toyonori Tsuzuki,Hemamali Samaratunga,Matteo Titus,Xiaoyi Ji,Sol Erika Boman,Einar Gudlaugsson,Svein Reidar Kjosavik,José Asenjo,Marcello Gambacorta,Paolo Libretti,Marcin Braun,Radisław Kordek,Roman Łowicki,Brett Delahunt,Kenneth A. Iczkowski,Theo van der Kwast,Geert J. L. H. van Leenders,Katia R. M. Leite,Chin-Chen Pan,Emiel Adrianus Maria Janssen,Martin Eklund,Lars Egevad,Kimmo Kartasalo

Main category: cs.CV

TL;DR: 论文提出了一种AI模型,用于检测前列腺癌中的筛状形态,性能达到病理学家水平,显著提高了诊断可靠性和一致性。

Details Motivation: 筛状形态是前列腺癌中与不良预后相关的关键组织学特征,但由于病理学家的主观性差异,其检测结果存在较大不一致性。

Contribution: 开发了一种基于深度学习的AI模型(EfficientNetV2-S结合多实例学习),能够准确检测筛状形态,并在内外部验证中表现出色。

Method: 使用640张前列腺穿刺活检切片训练模型,采用EfficientNetV2-S编码器和多实例学习方法进行端到端分类。模型在内外部验证中表现优异。

Result: 模型内部验证AUC为0.97,Cohen’s kappa为0.81;外部验证AUC为0.90,Cohen’s kappa为0.55,性能优于9名病理学家。

Insight: AI模型可以显著提高筛状形态检测的一致性,为前列腺癌的诊断和治疗决策提供更可靠的依据。

Abstract: Background: Cribriform morphology in prostate cancer is a histological feature that indicates poor prognosis and contraindicates active surveillance. However, it remains underreported and subject to significant interobserver variability amongst pathologists. We aimed to develop and validate an AI-based system to improve cribriform pattern detection. Methods: We created a deep learning model using an EfficientNetV2-S encoder with multiple instance learning for end-to-end whole-slide classification. The model was trained on 640 digitised prostate core needle biopsies from 430 patients, collected across three cohorts. It was validated internally (261 slides from 171 patients) and externally (266 slides, 104 patients from three independent cohorts). Internal validation cohorts included laboratories or scanners from the development set, while external cohorts used completely independent instruments and laboratories. Annotations were provided by three expert uropathologists with known high concordance. Additionally, we conducted an inter-rater analysis and compared the model’s performance against nine expert uropathologists on 88 slides from the internal validation cohort. Results: The model showed strong internal validation performance (AUC: 0.97, 95% CI: 0.95-0.99; Cohen’s kappa: 0.81, 95% CI: 0.72-0.89) and robust external validation (AUC: 0.90, 95% CI: 0.86-0.93; Cohen’s kappa: 0.55, 95% CI: 0.45-0.64). In our inter-rater analysis, the model achieved the highest average agreement (Cohen’s kappa: 0.66, 95% CI: 0.57-0.74), outperforming all nine pathologists whose Cohen’s kappas ranged from 0.35 to 0.62. Conclusion: Our AI model demonstrates pathologist-level performance for cribriform morphology detection in prostate cancer. This approach could enhance diagnostic reliability, standardise reporting, and improve treatment decisions for prostate cancer patients.

[74] NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations

Junjie Nan,Jianing Li,Wei Chen,Mingkun Zhang,Xueqi Cheng

Main category: cs.CV

TL;DR: NAPPure是一个针对非加性对抗扰动的对抗净化框架,通过最大化似然估计分离干净图像和扰动参数,显著提升了图像分类模型在非加性扰动下的鲁棒性。

Details Motivation: 现有对抗净化方法主要针对加性扰动设计,对非加性扰动(如模糊、遮挡和扭曲)效果有限。为了解决这一问题,本文提出了NAPPure框架。

Contribution: 1. 提出NAPPure框架,扩展了对抗净化的适用范围至非加性扰动;2. 通过似然最大化分离干净图像与扰动参数。

Method: 1. 建立对抗图像的生成过程模型;2. 通过似然最大化估计干净图像和扰动参数。

Result: 在GTSRB和CIFAR-10数据集上的实验表明,NAPPure显著提升了模型对非加性扰动的鲁棒性。

Insight: 非加性扰动在实际场景中普遍存在,通用的对抗净化方法需要扩展以应对更多类型的扰动。

Abstract: Adversarial purification has achieved great success in combating adversarial image perturbations, which are usually assumed to be additive. However, non-additive adversarial perturbations such as blur, occlusion, and distortion are also common in the real world. Under such perturbations, existing adversarial purification methods are much less effective since they are designed to fit the additive nature. In this paper, we propose an extended adversarial purification framework named NAPPure, which can further handle non-additive perturbations. Specifically, we first establish the generation process of an adversarial image, and then disentangle the underlying clean image and perturbation parameters through likelihood maximization. Experiments on GTSRB and CIFAR-10 datasets show that NAPPure significantly boosts the robustness of image classification models against non-additive perturbations.

[75] Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding

Xiaoqian Shen,Wenxuan Zhang,Jun Chen,Mohamed Elhoseiny

Main category: cs.CV

TL;DR: Vgent提出了一种基于图的检索-推理增强生成框架,用于提升长视频理解能力,通过结构化图和中间推理步骤显著改进了长视频语言模型的性能。

Details Motivation: 长视频理解和推理对大型视频语言模型(LVLMs)提出了挑战,主要体现在处理超出上下文窗口的长序列信息和保留时间依赖性上。现有的检索增强生成(RAG)方法在长视频应用中存在时间依赖破坏和无关信息干扰的问题。

Contribution: Vgent的两个主要贡献包括:(1)用结构化图表示视频,保留片段间的语义关系以提升检索效果;(2)引入中间推理步骤,通过结构化验证减少检索噪声并显式聚合相关信息,从而提高生成结果的准确性。

Method: Vgent采用基于图的视频表示方法,并设计了一个中间推理模块。该方法通过结构化图捕获视频片段间的语义关系,并结合检索和推理步骤优化生成过程。

Result: 在三个长视频理解基准测试中,Vgent相较基线模型性能提升了3.0%~5.4%,并超越现有最好的视频RAG方法8.6%。

Insight: 结构化图和中间推理的结合可以显著提升长视频理解的性能,尤其是在保留时间依赖性和减少无关信息干扰方面效果显著。

Abstract: Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of $3.0%\sim 5.4%$ over base models on MLVU, and outperformed state-of-the-art video RAG methods by $8.6%$. Our code is publicly available at https://xiaoqian-shen.github.io/Vgent.

[76] Synchronization of Multiple Videos

Avihai Naaman,Ron Shapira Weber,Oren Freifeld

Main category: cs.CV

TL;DR: 论文提出了一种基于原型学习的框架(TPL),用于解决多视频同步的复杂问题,特别是对于不同场景或生成AI生成的视频。该方法通过构建共享的1D原型序列来对齐视频,避免了繁琐的成对匹配,提高了同步的准确性和效率。

Details Motivation: 现有的多视频同步方法通常是针对同一场景的多摄像头视频,而不同场景或生成AI生成的视频由于主题、背景和非线性时间偏移的差异,同步变得非常复杂。

Contribution: 1. 提出Temporal Prototype Learning(TPL),一种原型学习框架,用于解决复杂的多视频同步问题;2. TPL是首个解决生成AI视频同步问题的方法;3. 发布了一个新的多视频同步数据集。

Method: TPL通过提取预训练模型的高维嵌入,构建一个共享的1D原型序列,将关键动作阶段锚定在该序列中,从而实现视频的鲁棒对齐,避免了成对匹配的开销。

Result: 实验结果表明,TPL在多种数据集上提高了同步的准确性、效率和鲁棒性,特别是在细粒度帧检索和阶段分类任务中。

Insight: 构建共享的1D原型序列是一种高效且鲁棒的视频对齐方法,尤其适用于非线性时间偏移的复杂场景。

Abstract: Synchronizing videos captured simultaneously from multiple cameras in the same scene is often easy and typically requires only simple time shifts. However, synchronizing videos from different scenes or, more recently, generative AI videos, poses a far more complex challenge due to diverse subjects, backgrounds, and nonlinear temporal misalignment. We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by any of various pretrained models. TPL robustly aligns videos by learning a unified prototype sequence that anchors key action phases, thereby avoiding exhaustive pairwise matching. Our experiments show that TPL improves synchronization accuracy, efficiency, and robustness across diverse datasets, including fine-grained frame retrieval and phase classification tasks. Importantly, TPL is the first approach to mitigate synchronization issues in multiple generative AI videos depicting the same action. Our code and a new multiple video synchronization dataset are available at https://bgu-cs-vil.github.io/TPL/

[77] Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images

Emanuel Garbin,Guy Adam,Oded Krams,Zohar Barzelay,Eran Guendelman,Michael Schwarz,Moran Vatelmacher,Yigal Shenkman,Eli Peker,Itai Druker,Uri Patish,Yoav Blum,Max Bluvstein,Junxuan Li,Rawal Khirodkar,Shunsuke Saito

Main category: cs.CV

TL;DR: 这篇论文提出了一种零样本的3D高斯光斑化头像生成方法,能够从未结构的手机图像中创建高度真实且保持身份特征的3D头像。

Details Motivation: 现有方法存在几何不一致性和幻觉问题,单视角方法难以保持身份特征,而基于合成数据的模型无法捕捉高频细节,如皮肤皱纹和细发。

Contribution: 论文的核心贡献包括:(1)生成式标准化模块,将多视图未结构化数据转换为一致的标准表示;(2)基于Transformer的模型,训练于新的大规模高保真高斯光斑化头像数据集。

Method: 采用“捕捉、标准化、光斑化”流程,通过生成式模块和Transformer模型从未结构化照片中生成静态半身头像。

Result: 生成的3D头像具有高度真实性和身份特征保持能力。

Insight: 论文展示了如何通过多视图数据处理和高保真数据集训练克服现有方法的局限性,实现高质量的3D头像生成。

Abstract: We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This “Capture, Canonicalize, Splat” pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.

[78] cubic: CUDA-accelerated 3D Bioimage Computing

Alexandr A. Kalinin,Anne E. Carpenter,Shantanu Singh,Matthew J. O’Meara

Main category: cs.CV

TL;DR: 本文介绍了一个名为cubic的开源Python库,通过将SciPy和scikit-image的API与CuPy和RAPIDS cuCIM的GPU加速替代方案结合,解决了生物图像分析工具在可扩展性、效率与现代科学计算工作流集成方面的不足。

Details Motivation: 现代显微镜生成的数据规模越来越大,现有的生物图像分析方法在可扩展性、效率和与现代科学计算工作流的集成方面存在明显的局限性。

Contribution: cubic提供了一个设备无关的API,能够在GPU或CPU上无缝加速广泛的图像处理任务,显著提升了2D和3D数据的处理效率。

Method: cubic通过整合CuPy和RAPIDS cuCIM的GPU加速功能,扩展了SciPy和scikit-image的API,实现了设备无关的操作分发。

Result: 实验表明,cubic在保持算法准确性的同时,显著提升了去卷积和分割等流程的速度。

Insight: cubic的引入为生物图像分析提供了一个可扩展且可复现的基础设施,能够与现代Python科学计算生态系统无缝集成。

Abstract: Quantitative analysis of multidimensional biological images is useful for understanding complex cellular phenotypes and accelerating advances in biomedical research. As modern microscopy generates ever-larger 2D and 3D datasets, existing computational approaches are increasingly limited by their scalability, efficiency, and integration with modern scientific computing workflows. Existing bioimage analysis tools often lack application programmable interfaces (APIs), do not support graphics processing unit (GPU) acceleration, lack broad 3D image processing capabilities, and/or have poor interoperability for compute-heavy workflows. Here, we introduce cubic, an open-source Python library that addresses these challenges by augmenting widely used SciPy and scikit-image APIs with GPU-accelerated alternatives from CuPy and RAPIDS cuCIM. cubic’s API is device-agnostic and dispatches operations to GPU when data reside on the device and otherwise executes on CPU, seamlessly accelerating a broad range of image processing routines. This approach enables GPU acceleration of existing bioimage analysis workflows, from preprocessing to segmentation and feature extraction for 2D and 3D data. We evaluate cubic both by benchmarking individual operations and by reproducing existing deconvolution and segmentation pipelines, achieving substantial speedups while maintaining algorithmic fidelity. These advances establish a robust foundation for scalable, reproducible bioimage analysis that integrates with the broader Python scientific computing ecosystem, including other GPU-accelerated methods, enabling both interactive exploration and automated high-throughput analysis workflows. cubic is openly available at https://github$.$com/alxndrkalinin/cubic

[79] Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures

Yuancheng Xu,Wenqi Xian,Li Ma,Julien Philip,Ahmet Levent Taşel,Yiwei Zhao,Ryan Burgert,Mingming He,Oliver Hermann,Oliver Pilarski,Rahul Garg,Paul Debevec,Ning Yu

Main category: cs.CV

TL;DR: 本文提出了一种框架,通过新颖的数据定制管道,在视频扩散模型中实现了多视角角色一致性和3D相机控制,提升虚拟制作的视频生成能力。

Details Motivation: 现有视频扩散模型在多视角角色一致性和3D相机控制方面存在局限,限制了在虚拟制作中的应用。本文旨在解决这一问题。

Contribution: 1. 提出了支持多视角角色一致性和3D相机控制的定制化数据管道;2. 结合4D高斯渲染(4DGS)和光照变化生成训练数据;3. 支持多主体生成和多种虚拟制作核心功能。

Method: 1. 通过4D高斯渲染(4DGS)生成多视角角色数据;2. 使用视频重光照模型生成光照变化数据;3. 对开源视频扩散模型进行微调,实现角色一致性、相机控制和光照适配。

Result: 实验表明,该方法在视频质量、个性化精度、相机控制和光照适配方面表现优越,推动了视频生成在虚拟制作中的集成。

Insight: 通过结合4D高斯渲染和光照变化数据,可以显著提升视频扩散模型在多视角一致性和动态控制方面的能力,为虚拟制作提供了新工具。

Abstract: We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production. Our project page is available at: https://eyeline-labs.github.io/Virtually-Being.

[80] Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition

Ryo Masumura,Shota Orihashi,Mana Ihori,Tomohiro Tanaka,Naoki Makishima,Taiga Yamane,Naotaka Kawata,Satoshi Suzuki,Taichi Katayama

Main category: cs.CV

TL;DR: 该论文提出了一种联合建模Big Five和HEXACO的方法,用于从多模态人类行为中自动识别表面人格特征,填补了现有研究中未关注HEXACO的空白,并通过实验验证了其有效性。

Details Motivation: 现有研究主要使用Big Five进行多模态表面人格识别,但未关注HEXACO(尤其是诚实-谦逊特质)及其与Big Five的关系。通过联合建模,可以提高对人类行为的理解。

Contribution: 提出了首个联合Big Five和HEXACO的表面人格识别方法,并探讨了它们之间的关系,弥补了现有研究的不足。

Method: 采用联合优化方法,同时识别Big Five和HEXACO的人格特质,利用多模态行为数据进行建模。

Result: 在自我介绍视频数据集上的实验表明,该方法能有效识别Big Five和HEXACO特质。

Insight: 联合建模Big Five和HEXACO可为心理学和行为分析提供更全面的视角,尤其是诚实-谦逊特质的相关研究。

Abstract: This paper proposes a joint modeling method of the Big Five, which has long been studied, and HEXACO, which has recently attracted attention in psychology, for automatically recognizing apparent personality traits from multimodal human behavior. Most previous studies have used the Big Five for multimodal apparent personality-trait recognition. However, no study has focused on apparent HEXACO which can evaluate an Honesty-Humility trait related to displaced aggression and vengefulness, social-dominance orientation, etc. In addition, the relationships between the Big Five and HEXACO when modeled by machine learning have not been clarified. We expect awareness of multimodal human behavior to improve by considering these relationships. The key advance of our proposed method is to optimize jointly recognizing the Big Five and HEXACO. Experiments using a self-introduction video dataset demonstrate that the proposed method can effectively recognize the Big Five and HEXACO.

[81] LOTA: Bit-Planes Guided AI-Generated Image Detection

Hongsong Wang,Renxi Cheng,Yang Zhang,Chaolei Han,Jie Gui

Main category: cs.CV

TL;DR: 论文提出了一种基于比特平面的AI生成图像检测方法LOTA,通过提取噪声特征和设计高效分类器,显著提升了检测速度和准确性。

Details Motivation: GAN和Diffusion模型的快速发展使得区分AI生成图像与真实图像变得更具挑战性。现有方法计算成本高且难以有效捕捉原始图像的噪声特征。

Contribution: 1.创新性地利用比特平面提取噪声特征;2.设计了最大梯度块选择方法以放大噪声信号;3.提出轻量级分类头并探索了两种结构。

Method: 通过比特平面处理提取噪声特征,结合图像归一化策略和多方向梯度计算噪声分数,最后使用噪声信号训练分类器。

Result: 在GenImage基准测试中平均准确率达98.9%,跨生成器泛化能力优异,且检测速度比现有方法快近百倍。

Insight: 比特平面能有效表征图像噪声特征;轻量级设计在保持高精度的同时极大提升了计算效率。

Abstract: The rapid advancement of GAN and Diffusion models makes it more difficult to distinguish AI-generated images from real ones. Recent studies often use image-based reconstruction errors as an important feature for determining whether an image is AI-generated. However, these approaches typically incur high computational costs and also fail to capture intrinsic noisy features present in the raw images. To solve these problems, we innovatively refine error extraction by using bit-plane-based image processing, as lower bit planes indeed represent noise patterns in images. We introduce an effective bit-planes guided noisy image generation and exploit various image normalization strategies, including scaling and thresholding. Then, to amplify the noise signal for easier AI-generated image detection, we design a maximum gradient patch selection that applies multi-directional gradients to compute the noise score and selects the region with the highest score. Finally, we propose a lightweight and effective classification head and explore two different structures: noise-based classifier and noise-guided classifier. Extensive experiments on the GenImage benchmark demonstrate the outstanding performance of our method, which achieves an average accuracy of \textbf{98.9%} (\textbf{11.9}%~$\uparrow$) and shows excellent cross-generator generalization capability. Particularly, our method achieves an accuracy of over 98.2% from GAN to Diffusion and over 99.2% from Diffusion to GAN. Moreover, it performs error extraction at the millisecond level, nearly a hundred times faster than existing methods. The code is at https://github.com/hongsong-wang/LOTA.

[82] PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis

Soumyya Kanti Datta,Tanvi Ranga,Chengzhe Sun,Siwei Lyu

Main category: cs.CV

TL;DR: PIA提出了一种多模态音频-视觉框架,通过结合语言、动态面部运动和身份识别线索,显著提升了对现代高级生成模型生成的深度伪造内容的检测能力。

Details Motivation: 传统深度伪造检测方法依赖手动设计的音素-视素对齐阈值或单模态策略,难以应对现代生成模型(如GANs、扩散模型)生成的近乎完美的伪造内容。PIA旨在通过多模态分析解决这一问题。

Contribution: 提出PIA框架,整合音素序列、唇部几何数据和高级面部身份嵌入,通过多模态不一致性检测提升深度伪造识别能力。

Method: 结合语言信息(音素序列)、动态面部运动(唇部几何)和身份嵌入(面部识别)进行多模态分析,识别时间上和身份动态上的不一致。

Result: PIA在多模态分析中表现出色,能够有效捕捉传统方法忽略的细微时间差异和身份动态不一致。

Insight: 多模态方法在深度伪造检测中具有重要潜力,尤其是结合时间动态和身份信息可以显著提升检测准确性。

Abstract: The rise of manipulated media has made deepfakes a particularly insidious threat, involving various generative manipulations such as lip-sync modifications, face-swaps, and avatar-driven facial synthesis. Conventional detection methods, which predominantly depend on manually designed phoneme-viseme alignment thresholds, fundamental frame-level consistency checks, or a unimodal detection strategy, inadequately identify modern-day deepfakes generated by advanced generative models such as GANs, diffusion models, and neural rendering techniques. These advanced techniques generate nearly perfect individual frames yet inadvertently create minor temporal discrepancies frequently overlooked by traditional detectors. We present a novel multimodal audio-visual framework, Phoneme-Temporal and Identity-Dynamic Analysis(PIA), incorporating language, dynamic face motion, and facial identification cues to address these limitations. We utilize phoneme sequences, lip geometry data, and advanced facial identity embeddings. This integrated method significantly improves the detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities. Code is available at https://github.com/skrantidatta/PIA

[83] Event Interval Modulation: A Novel Scheme for Event-based Optical Camera Communication

Miu Sumino,Mayu Ishii,Shun Kaizu,Daisuke Hisano,Yu Nakayama

Main category: cs.CV

TL;DR: 本文提出了一种新型调制方案——事件间隔调制(EIM),专为基于事件的相机通信(OCC)设计,通过优化事件间隔来提升传输速度,实验证明其在室内环境中的高效性。

Details Motivation: 传统基于帧的OCC系统存在比特率低和处理负载高的问题,而现有的事件OCC系统未能充分利用事件视觉传感器(EVS)的特性。

Contribution: 提出事件间隔调制(EIM)方案,优化EVS频率响应并实验确定最大调制阶数,实现28 kbps(10米)和8.4 kbps(50米)的高速传输。

Method: 设计EIM方案,通过事件间隔调制信息;优化EVS参数并实验验证最大调制阶数;进行传输实验以验证性能。

Result: 在室内环境中成功实现28 kbps(10米)和8.4 kbps(50米)的传输速率,突破了事件OCC系统的比特率记录。

Insight: EIM充分利用EVS异步特性,为事件OCC系统提供了高效调制方法,展示了EVS在高速通信中的潜力。

Abstract: Optical camera communication (OCC) represents a promising visible light communication technology. Nonetheless, typical OCC systems utilizing frame-based cameras are encumbered by limitations, including low bit rate and high processing load. To address these issues, OCC system utilizing an event-based vision sensor (EVS) as receivers have been proposed. The EVS enables high-speed, low-latency, and robust communication due to its asynchronous operation and high dynamic range. In existing event-based OCC systems, conventional modulation schemes such as on-off keying (OOK) and pulse position modulation have been applied, however, to the best of our knowledge, no modulation method has been proposed that fully exploits the unique characteristics of the EVS. This paper proposes a novel modulation scheme, called the event interval modulation (EIM) scheme, specifically designed for event-based OCC. EIM enables improvement in transmission speed by modulating information using the intervals between events. This paper proposes a theoretical model of EIM and conducts a proof-of-concept experiment. First, the parameters of the EVS are tuned and customized to optimize the frequency response specifically for EIM. Then, the maximum modulation order usable in EIM is determined experimentally. We conduct transmission experiments based on the obtained parameters. Finally, we report successful transmission at 28 kbps over 10 meters and 8.4 kbps over 50 meters in an indoor environment. This sets a new benchmark for bit rate in event-based OCC systems.

[84] MACE: Mixture-of-Experts Accelerated Coordinate Encoding for Large-Scale Scene Localization and Rendering

Mingkai Liu,Dikai Fan,Haohua Que,Haojia Gao,Xiao Liu,Shuxue Peng,Meixia Lin,Shengyu Gu,Ruicong Ye,Wanli Qiu,Handong Yao,Ruopeng Zhang,Xianliang Huang

Main category: cs.CV

TL;DR: MACE通过混合专家(MOE)和ALF-LB策略,解决了大规模场景定位和渲染的计算成本问题,实现了高效定位和高精度渲染。

Details Motivation: 大规模场景的定位和渲染通常计算成本高昂,而现有的SCR方法在小规模场景表现良好,但难以扩展到大规模场景。

Contribution: 提出了MACE方法,结合MOE和ALF-LB策略,实现了大规模场景的高效定位和高精度渲染。

Method: 采用门控网络隐式分类和选择子网络,确保每次推理仅激活一个子网络,并结合ALF-LB策略优化定位精度。

Result: 在剑桥测试集上,仅需10分钟训练即实现高质量渲染,显著降低了成本。

Insight: MOE结构在解决大规模场景问题时表现出色,ALF-LB策略进一步提升了定位精度。

Abstract: Efficient localization and high-quality rendering in large-scale scenes remain a significant challenge due to the computational cost involved. While Scene Coordinate Regression (SCR) methods perform well in small-scale localization, they are limited by the capacity of a single network when extended to large-scale scenes. To address these challenges, we propose the Mixed Expert-based Accelerated Coordinate Encoding method (MACE), which enables efficient localization and high-quality rendering in large-scale scenes. Inspired by the remarkable capabilities of MOE in large model domains, we introduce a gating network to implicitly classify and select sub-networks, ensuring that only a single sub-network is activated during each inference. Furtheremore, we present Auxiliary-Loss-Free Load Balancing(ALF-LB) strategy to enhance the localization accuracy on large-scale scene. Our framework provides a significant reduction in costs while maintaining higher precision, offering an efficient solution for large-scale scene applications. Additional experiments on the Cambridge test set demonstrate that our method achieves high-quality rendering results with merely 10 minutes of training.

[85] Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization

Liao Shen,Wentao Jiang,Yiran Zhu,Tiezheng Ge,Zhiguo Cao,Bo Zheng

Main category: cs.CV

TL;DR: 本文提出了一种基于强化学习的身份一致性图像到视频生成方法(IPRO),通过奖励引导优化扩散模型,解决了现有方法在身份一致性上的挑战。

Details Motivation: 现有图像到视频生成方法在人物身份一致性上表现不佳,尤其是在人物表情和动作变化较大时,这一问题尤为突出。人类对身份变化高度敏感,因此亟需解决这一挑战。

Contribution: 1. 提出IPRO框架,首次将强化学习引入扩散模型优化以增强身份一致性;2. 设计了面部评分机制,利用真实视频中的面部特征池提升泛化能力;3. 引入了KL散度正则化以稳定训练。

Method: 1. 使用奖励信号引导扩散模型优化;2. 通过反向传播奖励信号至采样链的最后几步以加速收敛;3. 提出面部特征池机制和多角度面部评分方法;4. 引入KL散度正则化防止过拟合。

Result: 在Wan 2.2和内部I2V模型上的实验证明了IPRO的有效性,显著提升了生成视频中的身份一致性。

Insight: 强化学习可以直接优化扩散模型,而无需额外模块或架构改动;身份一致性是I2V任务中亟待解决的问题。

Abstract: Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at \href{https://ipro-alimama.github.io/}{https://ipro-alimama.github.io/}.

[86] Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning

Xiangyu Meng,Zixian Zhang,Zhenghao Zhang,Junchao Liao,Long Qin,Weizhi Wang

Main category: cs.CV

TL;DR: Identity-GRPO通过强化学习优化多人物身份一致的视频生成,结合人类反馈的奖励模型和GRPO算法,显著提升了VACE和Phantom的性能,取得了18.9%的人类一致性指标提升。

Details Motivation: 现有方法如VACE和Phantom在多人物身份一致的动态交互场景中表现不佳,需要一种新技术来优化身份一致性。

Contribution: 提出了Identity-GRPO,一个结合人类反馈和GRPO算法的优化框架,显著改进了多人物视频生成中的身份一致性。

Method: 构建了基于人类偏好数据的视频奖励模型,并设计了针对多人物一致性的GRPO变种算法。

Result: 实验表明,Identity-GRPO在人类一致性指标上优于基线方法18.9%。

Insight: 人类反馈和强化学习的结合是优化个性化视频生成的有效途径,尤其是在多人物交互场景中。

Abstract: While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.

[87] MatchAttention: Matching the Relative Positions for High-Resolution Cross-View Matching

Tingman Yan,Tao Liu,Xilian Yang,Qunfei Zhao,Zeyang Xia

Main category: cs.CV

TL;DR: 论文提出了一种名为MatchAttention的注意力机制,通过动态匹配相对位置来解决高分辨率跨视图匹配的计算复杂性和显式约束问题,并结合MatchDecoder和遮挡处理技术实现了高效的高分辨率匹配任务。

Details Motivation: 现有的跨视图匹配方法由于二次计算复杂度和缺乏显式匹配约束,难以处理高分辨率图像的匹配任务,因此需要一种更高效的注意力机制来应对这些挑战。

Contribution: 1) 提出MatchAttention机制,动态匹配相对位置以减少计算复杂性;2)设计MatchDecoder分层解码器;3)提出门控跨MatchAttention和一致性约束损失以处理跨视图遮挡。

Method: 通过BilinearSoftmax实现连续的滑动窗口注意力采样,并将相对位置嵌入特征通道以迭代更新;结合MatchDecoder和遮挡处理技术优化匹配关系的学习。

Result: 在Middlebury、KITTI、ETH3D和Spring flow数据集上取得SOTA性能,MatchStereo-B在KITTI分辨率下仅需29ms推理时间,MatchStereo-T可高效处理4K UHD图像。

Insight: 动态匹配相对位置和高效率的分层解码设计显著提升了高分辨率匹配任务的性能,同时通过遮挡优化进一步增强了模型的鲁棒性和泛化能力。

Abstract: Cross-view matching is fundamentally achieved through cross-attention mechanisms. However, matching of high-resolution images remains challenging due to the quadratic complexity and lack of explicit matching constraints in the existing cross-attention. This paper proposes an attention mechanism, MatchAttention, that dynamically matches relative positions. The relative position determines the attention sampling center of the key-value pairs given a query. Continuous and differentiable sliding-window attention sampling is achieved by the proposed BilinearSoftmax. The relative positions are iteratively updated through residual connections across layers by embedding them into the feature channels. Since the relative position is exactly the learning target for cross-view matching, an efficient hierarchical cross-view decoder, MatchDecoder, is designed with MatchAttention as its core component. To handle cross-view occlusions, gated cross-MatchAttention and a consistency-constrained loss are proposed. These two components collectively mitigate the impact of occlusions in both forward and backward passes, allowing the model to focus more on learning matching relationships. When applied to stereo matching, MatchStereo-B ranked 1st in average error on the public Middlebury benchmark and requires only 29ms for KITTI-resolution inference. MatchStereo-T can process 4K UHD images in 0.1 seconds using only 3GB of GPU memory. The proposed models also achieve state-of-the-art performance on KITTI 2012, KITTI 2015, ETH3D, and Spring flow datasets. The combination of high accuracy and low computational complexity makes real-time, high-resolution, and high-accuracy cross-view matching possible. Code is available at https://github.com/TingmanYan/MatchAttention.

[88] Experimental Demonstration of Event-based Optical Camera Communication in Long-Range Outdoor Environment

Miu Sumino,Mayu Ishii,Shun Kaizu,Daisuke Hisano,Yu Nakayama

Main category: cs.CV

TL;DR: 论文提出了一种基于事件视觉传感器的光相机通信系统鲁棒解调方案,首次在室外实验中实现了200米60kbps和400米30kbps下BER小于10^-3的性能。

Details Motivation: 解决长距离室外环境中光相机通信的信号解调问题,提升通信的可靠性和速率。

Contribution: 结合OOK与切换解调及数字锁相环,提出了一种新型解调方案,首次在长距离室外环境下实现了高性能通信。

Method: 采用事件视觉传感器为基础,结合OOK调制、切换解调和数字锁相环技术,设计并验证解调方案的性能。

Result: 在200米60kbps和400米30kbps的室外实验中,BER低于10^-3。

Insight: 事件视觉传感器在长距离光通信中展现出潜力,解调方案的鲁棒性对实际应用至关重要。

Abstract: We propose a robust demodulation scheme for optical camera communication systems using an event-based vision sensor, combining OOK with toggle demodulation and a digital phase-locked loop. This is the first report to achieve a $\mathrm{BER} < 10^{-3}$ at 200m-60kbps and 400m-30kbps in outdoor experiments.

[89] GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering

Alexander Valverde,Brian Xu,Yuyin Zhou,Meng Xu,Hongyun Wang

Main category: cs.CV

TL;DR: GauSSmart是一种结合2D基础模型和3D高斯泼溅重建的混合方法,通过2D分割先验和高维特征嵌入提升稀疏区域的细节和覆盖度,显著优于现有高斯泼溅方法。

Details Motivation: 当前高斯泼溅方法在稀疏数据区域难以捕捉细节或保持真实性,2D基础模型提供了丰富的视觉信息,但如何有效结合两者仍是一个挑战。

Contribution: 提出了GauSSmart,首次将2D基础模型(如DINO)和高斯泼溅结合,通过2D语义特征监督提升3D重建质量。

Method: 利用2D分割先验和高维特征嵌入(如DINO),引导高斯泼溅的密度化和细化,改善稀疏区域的覆盖和细节保留。

Result: 在多个数据集上验证,GauSSmart在大部分场景中优于现有高斯泼溅方法,证明了2D-3D混合方法的潜力。

Insight: 2D基础模型能够为3D重建提供丰富的语义和几何先验,弥补稀疏数据的不足,为未来研究开辟了新方向。

Abstract: Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.

[90] CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts

Kieu-Anh Truong Thi,Huy-Hieu Pham,Duc-Trong Le

Main category: cs.CV

TL;DR: 该论文提出了一个基于因果推理的框架CLEAR,用于解决组织病理学中的领域偏移问题,通过引入因果关系而非仅依赖统计相关性,显著提升了模型在未见领域上的鲁棒性。

Details Motivation: 组织病理学中的领域偏移(如数据采集过程或来源的差异)严重影响了深度学习模型的泛化能力。现有方法主要依赖统计相关性,而忽略了因果关系,因此亟需一种新方法来填补这一空白。

Contribution: 1. 提出了一种基于因果推理的新框架CLEAR;2. 通过前门准则(front-door principle)设计转换策略,显式引入中介变量和观察到的组织切片;3. 在公开和私有数据集上验证了方法的有效性。

Method: 利用因果推理框架,通过设计包含中介变量的转换策略,显式建模因果关系,以减少混杂因素的影响。

Result: 在CAMELYON17和私有数据集上,性能提升了7%,超越了现有基线方法。

Insight: 因果推理在解决领域偏移问题中具有潜力,尤其是在建模语义特征和减少混杂因素方面表现突出。

Abstract: Domain shift in histopathology, often caused by differences in acquisition processes or data sources, poses a major challenge to the generalization ability of deep learning models. Existing methods primarily rely on modeling statistical correlations by aligning feature distributions or introducing statistical variation, yet they often overlook causal relationships. In this work, we propose a novel causal-inference-based framework that leverages semantic features while mitigating the impact of confounders. Our method implements the front-door principle by designing transformation strategies that explicitly incorporate mediators and observed tissue slides. We validate our method on the CAMELYON17 dataset and a private histopathology dataset, demonstrating consistent performance gains across unseen domains. As a result, our approach achieved up to a 7% improvement in both the CAMELYON17 dataset and the private histopathology dataset, outperforming existing baselines. These results highlight the potential of causal inference as a powerful tool for addressing domain shift in histopathology image analysis.

[91] Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding

Kyungryul Back,Seongbeom Park,Milim Kim,Mincheol Kwon,SangHyeok Lee,Hyunyoung Lee,Junhee Cho,Seunghyun Park,Jinkyu Kim

Main category: cs.CV

TL;DR: 该论文提出了一种无训练的三层对比解码加水印方法,以减少大型视觉-语言模型(LVLM)中的幻觉问题,并生成更具视觉基础的输出。

Details Motivation: 现有的LVLM在多模态任务中表现出色,但仍存在严重幻觉问题,依赖单一模态或过度记忆训练数据。为此,作者希望通过一种无需额外训练的方法来解决这一问题。

Contribution: 论文的主要贡献是提出了一种基于水印的三层对比解码方法,通过选择成熟层和业余层、识别关键层并应用对比解码,显著减少了模型的幻觉现象。

Method: 方法分为三步:(1)选择解码层中的成熟层和业余层;(2)通过水印相关的问题识别关键层;(3)应用三层对比解码生成最终输出。

Result: 在公开基准测试(POPE、MME和AMBER)上,该方法在减少LVLM幻觉和生成视觉基础响应方面达到了最先进性能。

Insight: 研究表明,通过分层对比解码和水印引导,可以有效提升LVLM的视觉基础性,无需额外训练即可改善模型生成内容的准确性。

Abstract: Large Vision-Language Models (LVLMs) have recently shown promising results on various multimodal tasks, even achieving human-comparable performance in certain cases. Nevertheless, LVLMs remain prone to hallucinations – they often rely heavily on a single modality or memorize training data without properly grounding their outputs. To address this, we propose a training-free, tri-layer contrastive decoding with watermarking, which proceeds in three steps: (1) select a mature layer and an amateur layer among the decoding layers, (2) identify a pivot layer using a watermark-related question to assess whether the layer is visually well-grounded, and (3) apply tri-layer contrastive decoding to generate the final output. Experiments on public benchmarks such as POPE, MME and AMBER demonstrate that our method achieves state-of-the-art performance in reducing hallucinations in LVLMs and generates more visually grounded responses.

[92] A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection

Shivangi Yadav,Arun Ross

Main category: cs.CV

TL;DR: 论文提出了一种多领域图像翻译扩散StyleGAN(MID-StyleGAN),用于生成高质量的合成眼部图像,以解决虹膜攻击检测中的数据稀缺问题。该方法结合了扩散模型和生成对抗网络的优点,显著提升了攻击检测系统的性能。

Details Motivation: 虹膜生物识别系统容易受到人工眼球、打印图像或美瞳等攻击。然而,由于数据采集困难,缺乏足够的训练和评估数据集。为解决这一问题,作者提出了MID-StyleGAN来生成多领域的合成眼部图像。

Contribution: 1. 提出了MID-StyleGAN框架,结合扩散模型和GANs生成高质量的合成眼部图像。2. 设计了多领域架构和支持跨领域翻译的模型。3. 通过自适应损失函数保持领域一致性。

Method: 采用扩散模型和GANs结合的框架,通过多领域架构支持攻击(如人工眼球、打印图像)和真实图像的相互翻译。优化设计了适合眼部数据的自适应损失函数。

Result: 在LivDet2020数据集上,攻击检测的准确率从93.41%提升到98.72%,证明了生成数据对提升PAD系统性能的有效性。

Insight: 合成数据生成是解决生物识别数据稀缺问题的有效途径,MID-StyleGAN框架为虹膜和其他生物识别领域的数据增强提供了新思路。

Abstract: An iris biometric system can be compromised by presentation attacks (PAs) where artifacts such as artificial eyes, printed eye images, or cosmetic contact lenses are presented to the system. To counteract this, several presentation attack detection (PAD) methods have been developed. However, there is a scarcity of datasets for training and evaluating iris PAD techniques due to the implicit difficulties in constructing and imaging PAs. To address this, we introduce the Multi-domain Image Translative Diffusion StyleGAN (MID-StyleGAN), a new framework for generating synthetic ocular images that captures the PA and bonafide characteristics in multiple domains such as bonafide, printed eyes and cosmetic contact lens. MID-StyleGAN combines the strengths of diffusion models and generative adversarial networks (GANs) to produce realistic and diverse synthetic data. Our approach utilizes a multi-domain architecture that enables the translation between bonafide ocular images and different PA domains. The model employs an adaptive loss function tailored for ocular data to maintain domain consistency. Extensive experiments demonstrate that MID-StyleGAN outperforms existing methods in generating high-quality synthetic ocular images. The generated data was used to significantly enhance the performance of PAD systems, providing a scalable solution to the data scarcity problem in iris and ocular biometrics. For example, on the LivDet2020 dataset, the true detect rate at 1% false detect rate improved from 93.41% to 98.72%, showcasing the impact of the proposed method.

[93] Vision-Centric Activation and Coordination for Multimodal Large Language Models

Yunnan Wang,Fan Lu,Kecheng Zheng,Ziyuan Huang,Ziqiang Li,Wenjun Zeng,Xin Jin

Main category: cs.CV

TL;DR: VaCo通过视觉中心的激活和协调优化多模态大语言模型(MLLMs)的表示,提升了视觉理解和分析能力。

Details Motivation: 主流MLLMs仅通过文本标记的下一个标记预测进行监督,忽略了关键的视觉中心信息,导致分析能力不足。

Contribution: 提出VaCo框架,通过视觉基础模型(VFMs)的任务感知特征和视觉对齐层(VALs)优化MLLMs的表示,并引入模块化任务查询(MTQs)和令牌网关掩码(TGM)协调多VFM间的表示冲突。

Method: 结合MTQs和VALs激活特定视觉信号,并通过TGM限制信息流以避免多VFM间的冲突。

Result: VaCo显著提升了不同MLLMs在多个基准测试中的性能,展现了优越的视觉理解能力。

Insight: 视觉信息的有效整合和协调是多模态模型提升分析能力的关键。

Abstract: Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.

[94] Leveraging Cycle-Consistent Anchor Points for Self-Supervised RGB-D Registration

Siddharth Tourani,Jayaram Reddy,Sarvesh Thakur,K Madhava Krishna,Muhammad Haris Khan,N Dinesh Reddy

Main category: cs.CV

TL;DR: 本文提出了一种利用循环一致性关键点作为锚点的自监督RGB-D配准方法,通过结合GRU和变换同步的姿态块,显著提升了配准精度,并在ScanNet和3DMatch数据集上超越了以往的自监督方法。

Details Motivation: 随着消费级深度相机的普及,大量未标注的RGB-D数据可用,但仍缺乏有效利用这些数据进行场景几何推理的方法。本文旨在通过自监督学习提升RGB-D配准性能。

Contribution: 1. 提出利用循环一致性关键点作为锚点以增强空间一致性;2. 引入结合GRU和变换同步的姿态块,综合历史和多视图数据;3. 在ScanNet和3DMatch上实现优于其他自监督方法的性能。

Method: 1. 通过循环一致性关键点提取和匹配;2. 设计姿态块,结合GRU和变换同步技术;3. 自监督训练框架。

Result: 在ScanNet和3DMatch数据集上超越了以往的自监督方法,甚至优于一些监督学习方法。

Insight: 循环一致性关键点和历史数据的融合能有效提升RGB-D配准的鲁棒性和精度,自监督方法在几何任务中潜力巨大。

Abstract: With the rise in consumer depth cameras, a wealth of unlabeled RGB-D data has become available. This prompts the question of how to utilize this data for geometric reasoning of scenes. While many RGB-D registration meth- ods rely on geometric and feature-based similarity, we take a different approach. We use cycle-consistent keypoints as salient points to enforce spatial coherence constraints during matching, improving correspondence accuracy. Additionally, we introduce a novel pose block that combines a GRU recurrent unit with transformation synchronization, blending historical and multi-view data. Our approach surpasses previous self- supervised registration methods on ScanNet and 3DMatch, even outperforming some older supervised methods. We also integrate our components into existing methods, showing their effectiveness.

[95] Spatial Preference Rewarding for MLLMs Spatial Understanding

Han Qiu,Peng Gao,Lewei Lu,Xiaoqin Zhang,Ling Shao,Shijian Lu

Main category: cs.CV

TL;DR: SPR(Spatial Preference Rewarding)通过奖励多模态大语言模型(MLLMs)生成详细的区域描述和精准的物体定位,提升了其空间理解能力。该方法通过语义和定位评分优化模型输出,实验显示SPR在标准的引用和定位基准测试中显著提升了性能。

Details Motivation: 多模态大语言模型(MLLMs)虽在空间理解能力上表现良好,但在细粒度空间感知(如生成详细区域描述或精确定位物体)方面仍有不足。现有方法主要依赖预标注指令数据,缺少对模型响应的直接监督。

Contribution: 提出SPR方法,通过奖励机制优化MLLMs的细粒度空间理解能力,引入语义和定位评分系统,并结合直接偏好优化(DPO)技术,显著提升了模型性能。

Method: SPR随机选取图像区域和MLLMs生成的描述,通过语义和定位评分评估质量,将高分优化的描述与低分初始描述配对,用于直接偏好优化(DPO),从而增强空间对齐能力。

Result: 在标准引用和定位基准测试中,SPR有效提升了MLLMs的空间理解能力,且训练开销极小。

Insight: 通过直接奖励模型的详细响应和精确定位,而非仅依赖预标注数据,SPR能够更好地捕捉细粒度空间信息,为MLLMs的空间能力提供了新优化方向。

Abstract: Multimodal large language models(MLLMs) have demonstrated promising spatial understanding capabilities, such as referencing and grounding object descriptions. Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities, such as generating detailed region descriptions or accurately localizing objects. Additionally, they often fail to respond to the user’s requirements for desired fine-grained spatial understanding. This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs’ actual responses. We address this issue by SPR, a Spatial Preference Rewarding(SPR) approach that enhances MLLMs’ spatial capabilities by rewarding MLLMs’ detailed responses with precise object localization over vague or inaccurate responses. With randomly selected image regions and region descriptions from MLLMs, SPR introduces semantic and localization scores to comprehensively evaluate the text quality and localization quality in MLLM-generated descriptions. We also refine the MLLM descriptions with better localization accuracy and pair the best-scored refinement with the initial descriptions of the lowest score for direct preference optimization, thereby enhancing fine-grained alignment with visual input. Extensive experiments over standard referring and grounding benchmarks show that SPR improves MLLM spatial understanding capabilities effectively with minimal overhead in training. Data and code will be released at https://github.com/hanqiu-hq/SPR

[96] DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation

Dongnam Byun,Jungwon Park,Jumgmin Ko,Changin Choi,Wonjong Rhee

Main category: cs.CV

TL;DR: DOS是一种改进CLIP文本嵌入的方法,用于提升多目标图像生成的准确性和减少目标混合问题。

Details Motivation: 现有文本到图像生成模型在多目标提示下常出现目标忽略或混合问题,特别是四种场景(相似形状、相似纹理、背景偏差和多目标)。

Contribution: 提出DOS方法,通过调整CLIP文本嵌入的三类向量,显著提升了多目标图像生成的成功率,并减少了目标混合。

Method: DOS对CLIP文本嵌入的三类向量进行方向性调整,以更清晰地区分不同目标。

Result: 实验表明DOS在多目标生成任务中表现优于四种竞争方法,人类评价中得票率高出26.24%-43.04%。

Insight: 通过针对性调整嵌入向量,可以有效改善多目标生成的语义分离问题,提升生成质量。

Abstract: Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.

[97] DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights

Danish Ali,Ajmal Mian,Naveed Akhtar,Ghulam Mubashar Hassan

Main category: cs.CV

TL;DR: 论文提出DRBD-Mamba模型,通过双分辨率双向Mamba和多尺度长程依赖捕获,高效实现脑肿瘤分割,并在计算效率和鲁棒性上显著优于现有方法。

Details Motivation: 脑肿瘤分割对临床诊断和治疗至关重要,但由于肿瘤亚区域的异质性和现有Mamba模型计算开销大、鲁棒性不足,亟需高效且可靠的解决方案。

Contribution: 1. 提出DRBD-Mamba模型,结合双分辨率和双向Mamba提升分割效率和效果;2. 引入空间填充曲线和门控融合模块优化特征表示;3. 提出系统性五折评估框架验证鲁棒性。

Method: 1. 使用双向Mamba捕获多尺度长程依赖;2. 空间填充曲线实现3D到1D特征映射的高效计算;3. 门控融合模块自适应整合上下文;4. 量化模块增强特征鲁棒性。

Result: 在BraTS2023上,模型在测试集上Dice指标显著提升(全肿瘤0.10%,肿瘤核心1.75%,增强肿瘤0.93%),计算效率提升15倍。

Insight: 通过系统性评估和高效设计,DRBD-Mamba在保持精度的同时大幅降低计算成本,为脑肿瘤分割提供了更实用的解决方案。

Abstract: Accurate brain tumor segmentation is significant for clinical diagnosis and treatment. It is challenging due to the heterogeneity of tumor subregions. Mamba-based State Space Models have demonstrated promising performance. However, they incur significant computational overhead due to sequential feature computation across multiple spatial axes. Moreover, their robustness across diverse BraTS data partitions remains largely unexplored, leaving a critical gap in reliable evaluation. To address these limitations, we propose dual-resolution bi-directional Mamba (DRBD-Mamba), an efficient 3D segmentation model that captures multi-scale long-range dependencies with minimal computational overhead. We leverage a space-filling curve to preserve spatial locality during 3D-to-1D feature mapping, thereby reducing reliance on computationally expensive multi-axial feature scans. To enrich feature representation, we propose a gated fusion module that adaptively integrates forward and reverse contexts, along with a quantization block that discretizes features to improve robustness. In addition, we propose five systematic folds on BraTS2023 for rigorous evaluation of segmentation techniques under diverse conditions and present detailed analysis of common failure scenarios. On the 20% test set used by recent methods, our model achieves Dice improvements of 0.10% for whole tumor, 1.75% for tumor core, and 0.93% for enhancing tumor. Evaluations on the proposed systematic five folds demonstrate that our model maintains competitive whole tumor accuracy while achieving clear average Dice gains of 0.86% for tumor core and 1.45% for enhancing tumor over existing state-of-the-art. Furthermore, our model attains 15 times improvement in efficiency while maintaining high segmentation accuracy, highlighting its robustness and computational advantage over existing approaches.

[98] BoardVision: Deployment-ready and Robust Motherboard Defect Detection with YOLO+Faster-RCNN Ensemble

Brandon Hill,Kma Solaiman

Main category: cs.CV

TL;DR: BoardVision提出了一种可部署的母板缺陷检测框架,结合YOLOv7和Faster R-CNN的优势,通过轻量级集成方法CTV Voter平衡精确度和召回率,并评估了实际扰动下的鲁棒性。

Details Motivation: 目前PCB检测主要针对裸板或线路缺陷,而组装级母板缺陷检测研究较少,BoardVision旨在填补这一空白并提供实际可用的解决方案。

Contribution: 1) 提出了基于YOLOv7和Faster R-CNN的轻量级集成方法CTV Voter;2) 在MiracleFactory数据集上系统评估了模型性能;3) 发布了一个可部署的GUI工具。

Method: 结合YOLOv7(高精度)和Faster R-CNN(高召回率),通过CTV Voter集成两者的结果以实现更好的平衡。此外,评估了模型在实际扰动(如亮度、清晰度变化)下的鲁棒性。

Result: CTV Voter在精确度和召回率之间取得了平衡,同时模型在多种扰动下表现出稳定性。GUI工具成功将研究成果转化为实用工具。

Insight: 单一模型在母板缺陷检测中存在局限,集成方法可以显著提升性能;实际部署需考虑模型对扰动的鲁棒性。

Abstract: Motherboard defect detection is critical for ensuring reliability in high-volume electronics manufacturing. While prior research in PCB inspection has largely targeted bare-board or trace-level defects, assembly-level inspection of full motherboards inspection remains underexplored. In this work, we present BoardVision, a reproducible framework for detecting assembly-level defects such as missing screws, loose fan wiring, and surface scratches. We benchmark two representative detectors - YOLOv7 and Faster R-CNN, under controlled conditions on the MiracleFactory motherboard dataset, providing the first systematic comparison in this domain. To mitigate the limitations of single models, where YOLO excels in precision but underperforms in recall and Faster R-CNN shows the reverse, we propose a lightweight ensemble, Confidence-Temporal Voting (CTV Voter), that balances precision and recall through interpretable rules. We further evaluate robustness under realistic perturbations including sharpness, brightness, and orientation changes, highlighting stability challenges often overlooked in motherboard defect detection. Finally, we release a deployable GUI-driven inspection tool that bridges research evaluation with operator usability. Together, these contributions demonstrate how computer vision techniques can transition from benchmark results to practical quality assurance for assembly-level motherboard manufacturing.

[99] DCMIL: A Progressive Representation Learning Model of Whole Slide Images for Cancer Prognosis Analysis

Chao Tu,Kun Huang,Jie Zhang,Qianjin Feng,Yu Zhang,Zhenyuan Ning

Main category: cs.CV

TL;DR: DCMIL是一种渐进式表示学习模型,旨在高效处理全切片图像(WSIs)进行癌症预后分析,无需依赖密集标注,并在12种癌症类型上表现优于现有方法。

Details Motivation: 计算病理学利用WSIs量化形态异质性并为癌症建立客观预后模型,但计算瓶颈和标注稀缺限制了进展。现有方法往往忽略多倍率WSIs的细粒度信息和肿瘤微环境变化。

Contribution: DCMIL提出了一种基于双课程对比多实例学习的渐进式学习框架,无需密集标注即可直接处理千兆像素WSIs,提供预后预测、不确定性估计和生物标志物发现。

Method: DCMIL通过渐进式学习和多实例对比学习,从多倍率WSIs中提取细粒度信息,并结合不确定性估计捕获肿瘤微环境的多样性。

Result: 在12种癌症类型的实验中,DCMIL表现出色,能够识别预后关键区域并提供稳健的不确定性估计,同时揭示肿瘤与正常组织的形态差异。

Insight: DCMIL不仅提高了预后分析的准确性,还为计算病理学提供了无需标注的高效学习框架,具有潜力推动生物医学研究。

Abstract: The burgeoning discipline of computational pathology shows promise in harnessing whole slide images (WSIs) to quantify morphological heterogeneity and develop objective prognostic modes for human cancers. However, progress is impeded by the computational bottleneck of gigapixel-size inputs and the scarcity of dense manual annotations. Current methods often overlook fine-grained information across multi-magnification WSIs and variations in tumor microenvironments. Here, we propose an easy-to-hard progressive representation learning model, termed dual-curriculum contrastive multi-instance learning (DCMIL), to efficiently process WSIs for cancer prognosis. The model does not rely on dense annotations and enables the direct transformation of gigapixel-size WSIs into outcome predictions. Extensive experiments on twelve cancer types (5,954 patients, 12.54 million tiles) demonstrate that DCMIL outperforms standard WSI-based prognostic models. Additionally, DCMIL identifies fine-grained prognosis-salient regions, provides robust instance uncertainty estimation, and captures morphological differences between normal and tumor tissues, with the potential to generate new biological insights. All codes have been made publicly accessible at https://github.com/tuuuc/DCMIL.

[100] Real-Time Neural Video Compression with Unified Intra and Inter Coding

Hui Xiang,Yifan Bian,Li Li,Jingran Wu,Xianguo Zhang,Dong Liu

Main category: cs.CV

TL;DR: 论文提出了一种统一的帧内和帧间编码的实时神经视频压缩框架,解决了现有方法在处理遮挡和新内容时的效率问题,并通过双向帧间冗余利用进一步提升了压缩性能。

Details Motivation: 现有神经视频压缩方案在处理遮挡、新内容以及帧间误差传播时效率不足,本文借鉴了传统视频编码中帧内编码的思想,提出统一框架以解决这些问题。

Contribution: 1. 提出统一的帧内和帧间编码模型;2. 引入双向帧间冗余利用设计;3. 实验证明其性能优于现有方案DCVC-RT。

Method: 1. 单模型自适应处理帧内和帧间编码;2. 同时压缩两帧以利用前向和后向冗余。

Result: 平均BD-rate降低10.7%,帧间比特率和质量更稳定,保持了实时编码/解码性能。

Insight: 统一帧内和帧间编码可有效解决遮挡和新内容问题,同时双向冗余利用能进一步提升压缩效率。

Abstract: Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 10.7% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.

[101] Structured Universal Adversarial Attacks on Object Detection for Video Sequences

Sven Jacob,Weijia Shao,Gjergji Kasneci

Main category: cs.CV

TL;DR: 本文提出了一种针对视频目标检测的结构化通用对抗攻击方法,通过核范数正则化生成集中于背景的最小扰动,并使用自适应乐观指数梯度方法优化,结果表明其在效果和隐蔽性上优于现有方法。

Details Motivation: 视频目标检测在安全关键应用中至关重要,但深度学习模型易受通用对抗扰动攻击。现有方法在扰动结构和隐蔽性上存在不足,因此需要一种更高效的攻击方法。

Contribution: 1. 提出了一种针对视频目标检测的结构化通用对抗攻击方法;2. 使用核范数正则化生成集中于背景的最小扰动;3. 设计自适应乐观指数梯度方法以提高优化效率和收敛性。

Method: 1. 利用核范数正则化生成结构化扰动;2. 采用自适应乐观指数梯度方法进行优化;3. 目标是最小化扰动幅度同时最大化攻击效果。

Result: 在视频目标检测任务中,提出的攻击方法在攻击效果和隐蔽性上优于低秩投影梯度下降和基于Frank-Wolfe的方法。

Insight: 结构化扰动设计(如集中于背景)和高效的优化方法是提升对抗攻击性能的关键。

Abstract: Video-based object detection plays a vital role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. In this work, we propose a minimally distorted universal adversarial attack tailored for video object detection, which leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To optimize this formulation efficiently, we employ an adaptive, optimistic exponentiated gradient method that enhances both scalability and convergence. Our results demonstrate that the proposed attack outperforms both low-rank projected gradient descent and Frank-Wolfe based attacks in effectiveness while maintaining high stealthiness. All code and data are publicly available at https://github.com/jsve96/AO-Exp-Attack.

[102] Unsupervised Deep Generative Models for Anomaly Detection in Neuroimaging: A Systematic Scoping Review

Youwan Mahé,Elise Bannier,Stéphanie Leplaideur,Elisa Fromont,Francesca Galassi

Main category: cs.CV

TL;DR: 该论文是一篇关于无监督深度生成模型在神经影像异常检测中的应用的系统性综述,总结了2018-2025年间49项研究,探讨了包括自编码器、变分自编码器、生成对抗网络和去噪扩散模型在内的多种生成模型在脑MRI和CT中的应用。

Details Motivation: 由于全监督方法需要大量标注数据且仅适用于特定病理,而无监督深度生成模型能够在健康数据上训练并检测异常,为神经影像异常检测提供了新的解决方案。

Contribution: 论文系统地总结了无监督深度生成模型在神经影像异常检测中的应用,比较了不同模型的性能指标和架构设计,强调了生成模型在生成伪健康重建和支持半监督学习方面的优势。

Method: 采用PRISMA指南进行范围综述,分析了49项研究,涵盖多种生成模型(如自编码器、变分自编码器、生成对抗网络和去噪扩散模型)及其在脑MRI和CT中的应用。

Result: 生成模型在大范围病变检测中表现良好,并在处理细微异常方面取得进展,能够生成可解释的伪健康重建图像。

Insight: 无监督深度生成模型在稀缺标注数据和异质性疾病中具有潜力,未来的研究方向应包括解剖感知建模、基础模型开发、任务适配的评估指标和严格的临床验证。

Abstract: Unsupervised deep generative models are emerging as a promising alternative to supervised methods for detecting and segmenting anomalies in brain imaging. Unlike fully supervised approaches, which require large voxel-level annotated datasets and are limited to well-characterised pathologies, these models can be trained exclusively on healthy data and identify anomalies as deviations from learned normative brain structures. This PRISMA-guided scoping review synthesises recent work on unsupervised deep generative models for anomaly detection in neuroimaging, including autoencoders, variational autoencoders, generative adversarial networks, and denoising diffusion models. A total of 49 studies published between 2018 - 2025 were identified, covering applications to brain MRI and, less frequently, CT across diverse pathologies such as tumours, stroke, multiple sclerosis, and small vessel disease. Reported performance metrics are compared alongside architectural design choices. Across the included studies, generative models achieved encouraging performance for large focal lesions and demonstrated progress in addressing more subtle abnormalities. A key strength of generative models is their ability to produce interpretable pseudo-healthy (also referred to as counterfactual) reconstructions, which is particularly valuable when annotated data are scarce, as in rare or heterogeneous diseases. Looking ahead, these models offer a compelling direction for anomaly detection, enabling semi-supervised learning, supporting the discovery of novel imaging biomarkers, and facilitating within- and cross-disease deviation mapping in unified end-to-end frameworks. To realise clinical impact, future work should prioritise anatomy-aware modelling, development of foundation models, task-appropriate evaluation metrics, and rigorous clinical validation.

[103] Pruning Overparameterized Multi-Task Networks for Degraded Web Image Restoration

Thomas Katraouras,Dimitrios Rafailidis

Main category: cs.CV

TL;DR: 该论文提出了一种压缩多任务图像修复模型的策略,通过迭代剪枝方法在高稀疏度下维持或超越密集模型的性能,仅保留10%的参数。

Details Motivation: 现有的多任务图像修复模型参数量过大,计算效率低。为了提高效率而不牺牲性能,需要一种高效的压缩策略。

Contribution: 提出了一种名为MIR-L的模型,通过迭代剪枝方法在高稀疏度下发现性能优越的子网络,显著减少参数量。

Method: 采用迭代剪枝策略,逐步移除低幅值权重并重置剩余权重至初始值,优化多任务模型的性能。

Result: 在去雨、去雾和去噪任务中,MIR-L仅保留10%的参数,但仍能保持高性能。

Insight: 迭代剪枝能够有效发现高性能的稀疏子网络,表明多任务模型的优化潜力。

Abstract: Image quality is a critical factor in delivering visually appealing content on web platforms. However, images often suffer from degradation due to lossy operations applied by online social networks (OSNs), negatively affecting user experience. Image restoration is the process of recovering a clean high-quality image from a given degraded input. Recently, multi-task (all-in-one) image restoration models have gained significant attention, due to their ability to simultaneously handle different types of image degradations. However, these models often come with an excessively high number of trainable parameters, making them computationally inefficient. In this paper, we propose a strategy for compressing multi-task image restoration models. We aim to discover highly sparse subnetworks within overparameterized deep models that can match or even surpass the performance of their dense counterparts. The proposed model, namely MIR-L, utilizes an iterative pruning strategy that removes low-magnitude weights across multiple rounds, while resetting the remaining weights to their original initialization. This iterative process is important for the multi-task image restoration model’s optimization, effectively uncovering “winning tickets” that maintain or exceed state-of-the-art performance at high sparsity levels. Experimental evaluation on benchmark datasets for the deraining, dehazing, and denoising tasks shows that MIR-L retains only 10% of the trainable parameters while maintaining high image restoration performance. Our code, datasets and pre-trained models are made publicly available at https://github.com/Thomkat/MIR-L.

[104] Grazing Detection using Deep Learning and Sentinel-2 Time Series Data

Aleksis Pirinen,Delia Fano Yela,Smita Chakraborty,Erik Källman

Main category: cs.CV

TL;DR: 论文提出了一种基于深度学习和Sentinel-2时序数据的放牧检测方法,利用CNN-LSTM集成模型实现高效监测,为保护区合规性检查提供了可靠的工具。

Details Motivation: 放牧对农业生产和生物多样性具有重要影响,但目前缺乏可扩展的监测手段。研究旨在通过Sentinel-2时序数据实现高效、自动化的放牧检测。

Contribution: 主要贡献包括:1)提出一种CNN-LSTM集成模型用于放牧检测;2)验证了Sentinel-2数据在放牧监测中的实用性;3)公开了代码和模型。

Method: 方法基于Sentinel-2 L2A时序数据,通过CNN-LSTM集成模型提取多时相反射特征,实现二元分类(放牧/未放牧)。

Result: 模型平均F1得分为77%,对放牧草场的召回率达90%。在实际应用中,模型能将非放牧区域检测效率提升17.2倍。

Insight: 研究表明,低分辨率、免费的卫星数据可以高效指导保护区合规性检查,为资源分配提供科学依据。

Abstract: Grazing shapes both agricultural production and biodiversity, yet scalable monitoring of where grazing occurs remains limited. We study seasonal grazing detection from Sentinel-2 L2A time series: for each polygon-defined field boundary, April-October imagery is used for binary prediction (grazed / not grazed). We train an ensemble of CNN-LSTM models on multi-temporal reflectance features, and achieve an average F1 score of 77 percent across five validation splits, with 90 percent recall on grazed pastures. Operationally, if inspectors can visit at most 4 percent of sites annually, prioritising fields predicted by our model as non-grazed yields 17.2 times more confirmed non-grazing sites than random inspection. These results indicate that coarse-resolution, freely available satellite data can reliably steer inspection resources for conservation-aligned land-use compliance. Code and models have been made publicly available.

[105] Vision Mamba for Permeability Prediction of Porous Media

Ali Kashefi,Tapan Mukerji

Main category: cs.CV

TL;DR: 论文首次提出了一种基于Vision Mamba的网络架构,用于预测三维多孔介质的渗透率,并展示了其在计算效率、内存占用和参数量上的优势。

Details Motivation: 传统的ViT和CNN在多孔介质渗透率预测任务中存在计算复杂性和内存占用高的问题。Vision Mamba因其线性扩展性和更少的可训练参数成为潜在的替代方案。

Contribution: 首次将Vision Mamba应用于三维多孔介质渗透率预测任务,并验证了其在效率、准确性上的优越性。

Method: 提出了一种以Vision Mamba为主干的神经网络,通过与传统ViT和CNN模型的对比及消融实验,验证各组件对预测精度的影响。

Result: Vision Mamba在渗透率预测任务中表现优于ViT和CNN,具有更高的计算效率和更低的资源消耗。

Insight: Vision Mamba有望替代ViT和CNN,成为大视觉模型的骨干网络,特别适合资源受限的场景。

Abstract: Vision Mamba has recently received attention as an alternative to Vision Transformers (ViTs) for image classification. The network size of Vision Mamba scales linearly with input image resolution, whereas ViTs scale quadratically, a feature that improves computational and memory efficiency. Moreover, Vision Mamba requires a significantly smaller number of trainable parameters than traditional convolutional neural networks (CNNs), and thus, they can be more memory efficient. Because of these features, we introduce, for the first time, a neural network that uses Vision Mamba as its backbone for predicting the permeability of three-dimensional porous media. We compare the performance of Vision Mamba with ViT and CNN models across multiple aspects of permeability prediction and perform an ablation study to assess the effects of its components on accuracy. We demonstrate in practice the aforementioned advantages of Vision Mamba over ViTs and CNNs in the permeability prediction of three-dimensional porous media. We make the source code publicly available to facilitate reproducibility and to enable other researchers to build on and extend this work. We believe the proposed framework has the potential to be integrated into large vision models in which Vision Mamba is used instead of ViTs.

[106] Real-Time Surgical Instrument Defect Detection via Non-Destructive Testing

Qurrat Ul Ain,Atif Aftab Ahmed Jilani,Zunaira Shafqat,Nigar Azhar Butt

Main category: cs.CV

TL;DR: 这篇论文提出了SurgScan框架,利用YOLOv8实时检测手术器械缺陷,实现了高精度(99.3%)和工业级扩展性,解决了传统人工检测的不足。

Details Motivation: 手术器械缺陷可能导致灭菌问题、机械完整性受损和患者安全风险。传统人工检测效率低且易出错,亟需自动化的高精度解决方案。

Contribution: 提出SurgScan框架,通过YOLOv8实现实时缺陷检测,验证了对比度增强预处理的有效性,并达到99.3%的高精度。

Method: 基于YOLOv8架构,训练了一个包含102,876张高分辨率图像的数据集,涵盖11种器械类型和5类缺陷,并结合对比度增强预处理。

Result: SurgScan在实时推理速度(4.2-5.8 ms/图像)和准确性(99.3%)上优于现有CNN模型,适合工业部署。

Insight: 对比度增强对缺陷检测至关重要,SurgScan提供了一种符合ISO 13485和FDA标准的自动化解决方案,推动了医疗器械制造的革新。

Abstract: Defective surgical instruments pose serious risks to sterility, mechanical integrity, and patient safety, increasing the likelihood of surgical complications. However, quality control in surgical instrument manufacturing often relies on manual inspection, which is prone to human error and inconsistency. This study introduces SurgScan, an AI-powered defect detection framework for surgical instruments. Using YOLOv8, SurgScan classifies defects in real-time, ensuring high accuracy and industrial scalability. The model is trained on a high-resolution dataset of 102,876 images, covering 11 instrument types and five major defect categories. Extensive evaluation against state-of-the-art CNN architectures confirms that SurgScan achieves the highest accuracy (99.3%) with real-time inference speeds of 4.2-5.8 ms per image, making it suitable for industrial deployment. Statistical analysis demonstrates that contrast-enhanced preprocessing significantly improves defect detection, addressing key limitations in visual inspection. SurgScan provides a scalable, cost-effective AI solution for automated quality control, reducing reliance on manual inspection while ensuring compliance with ISO 13485 and FDA standards, paving the way for enhanced defect detection in medical manufacturing.

[107] Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models

Yunze Tong,Didi Zhu,Zijing Hu,Jinluan Yang,Ziyu Zhao

Main category: cs.CV

TL;DR: 该论文提出了一种名为噪声投影的方法,通过文本条件优化初始噪声,解决了扩散模型中文本与图像不对齐的问题,无需修改预训练模型。

Details Motivation: 现有的文本到图像生成方法中,初始噪声的不同可能导致生成的图像与提示词不对齐。论文将此归因于训练与推理阶段的不匹配,即在训练时噪声属于提示词特定的子空间,而推理时噪声则来自与提示词无关的高斯先验。

Contribution: 1. 提出了噪声投影方法,通过文本条件优化初始噪声;2. 设计了基于视觉语言模型的奖励模型和优化框架;3. 无需参考图像或手工先验,且推理成本低。

Method: 1. 采样噪声并通过视觉语言模型获取图像反馈;2. 将这些信号蒸馏为奖励模型;3. 通过准直接偏好优化方法训练噪声投影器。

Result: 实验表明,该方法显著提高了多样提示词下文本与图像的对齐效果。

Insight: 训练与推理阶段的噪声分布不匹配是文本到图像不对齐的关键原因,而文本条件的噪声优化可以显著改善这一问题。

Abstract: In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.

[108] PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Cheng Cui,Ting Sun,Suyin Liang,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Xueqing Wang,Changda Zhou,Hongen Liu,Manhui Lin,Yue Zhang,Yubo Zhang,Handong Zheng,Jing Zhang,Jun Zhang,Yi Liu,Dianhai Yu,Yanjun Ma

Main category: cs.CV

TL;DR: PaddleOCR-VL是一个高效的多语言文档解析模型,核心是一个0.9B参数的紧凑视觉语言模型(VLM),结合动态分辨率视觉编码器和语言模型,支持109种语言,并在多种元素识别任务中表现优异。

Details Motivation: 传统文档解析模型在处理多语言和复杂元素时资源消耗高且性能不足,亟需一个高效且轻量化的解决方案。

Contribution: 提出PaddleOCR-VL-0.9B,一个超紧凑的视觉语言模型,支持多语言和复杂元素识别,兼具高效性能和低资源消耗。

Method: 结合NaViT风格的动态分辨率视觉编码器和ERNIE-4.5-0.3B语言模型,实现高效的文档解析。

Result: 在公开和内部基准测试中达到SOTA性能,显著优于现有方案,并具备快速推理能力。

Insight: 紧凑的VLM在多语言文档解析中具有巨大潜力,可以平衡性能和资源效率。

Abstract: In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

[109] Towards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology

Xinrui Huang,Fan Xiao,Dongming He,Anqi Gao,Dandan Li,Xiaofan Zhang,Shaoting Zhang,Xudong Wang

Main category: cs.CV

TL;DR: 这篇论文提出了DentVFM,一个专为牙科设计的视觉基础模型家族,通过自监督学习处理多模态牙科放射图像,显著提升了通用性和标签效率,并在多种牙科任务中超越了现有方法。

Details Motivation: 牙科放射影像解读面临专业人员短缺的挑战,现有AI系统因单模态、任务专用设计和依赖标注数据而受限。为了解决这些问题,作者提出了通用性更强的牙科视觉基础模型。

Contribution: 提出了DentVFM,首个针对牙科的视觉基础模型家族;开发了包含160万张多模态图像的数据集DentVista;设计了涵盖8个子专业的基准测试DentBench。

Method: 基于Vision Transformer架构的DentVFM模型,利用自监督学习在多模态数据集DentVista上训练,生成任务无关的视觉表征。

Result: DentVFM在疾病诊断、治疗分析等任务中表现优异,超越监督、自监督和弱监督基线模型,并在跨模态诊断中优于专业牙医。

Insight: 视觉基础模型在牙科领域的应用展示了自监督学习和多模态数据的潜力,为智能牙科医疗提供了可扩展、高效的解决方案。

Abstract: Oral and maxillofacial radiology plays a vital role in dental healthcare, but radiographic image interpretation is limited by a shortage of trained professionals. While AI approaches have shown promise, existing dental AI systems are restricted by their single-modality focus, task-specific design, and reliance on costly labeled data, hindering their generalization across diverse clinical scenarios. To address these challenges, we introduce DentVFM, the first family of vision foundation models (VFMs) designed for dentistry. DentVFM generates task-agnostic visual representations for a wide range of dental applications and uses self-supervised learning on DentVista, a large curated dental imaging dataset with approximately 1.6 million multi-modal radiographic images from various medical centers. DentVFM includes 2D and 3D variants based on the Vision Transformer (ViT) architecture. To address gaps in dental intelligence assessment and benchmarks, we introduce DentBench, a comprehensive benchmark covering eight dental subspecialties, more diseases, imaging modalities, and a wide geographical distribution. DentVFM shows impressive generalist intelligence, demonstrating robust generalization to diverse dental tasks, such as disease diagnosis, treatment analysis, biomarker identification, and anatomical landmark detection and segmentation. Experimental results indicate DentVFM significantly outperforms supervised, self-supervised, and weakly supervised baselines, offering superior generalization, label efficiency, and scalability. Additionally, DentVFM enables cross-modality diagnostics, providing more reliable results than experienced dentists in situations where conventional imaging is unavailable. DentVFM sets a new paradigm for dental AI, offering a scalable, adaptable, and label-efficient model to improve intelligent dental healthcare and address critical gaps in global oral healthcare.

[110] Exploring Image Representation with Decoupled Classical Visual Descriptors

Chenyuan Qu,Hao Chen,Jianbo Jiao

Main category: cs.CV

TL;DR: 论文提出VisualSplit框架,通过解耦经典视觉描述符(如边缘、颜色等)来提升现代学习方法的视觉理解能力,保留其可解释性的同时支持高级视觉任务。

Details Motivation: 深度学习在视觉任务中表现出色,但其内部表征不透明;而经典视觉描述符(如边缘、颜色)虽直观易懂,却在现代学习中未充分利用。论文试图填补这一空白,探索经典描述符对现代学习的潜在贡献。

Contribution: 1. 提出VisualSplit框架,明确解耦图像为多个经典描述符。2. 通过重构驱动的预训练方案,保留描述符的可解释性并捕捉其本质。3. 在生成、编辑等高级视觉任务中验证了方法的有效性,超越了传统分类和分割任务。

Method: 1. 将图像分解为边缘、颜色等独立但互补的经典描述符。2. 采用重构驱动的预训练,确保每个描述符的特征捕获能力。3. 支持对这些描述符的控制,用于图像生成和编辑等任务。

Result: VisualSplit不仅在分类和分割任务中表现良好,还能有效支持高级视觉任务(如图像生成和编辑),证明了经典描述符在现代学习中的实用性。

Insight: 解耦经典视觉描述符不仅保留了其可解释性,还为深度学习提供了更丰富的视觉知识表示,拓展了其在高级视觉任务中的应用潜力。

Abstract: Exploring and understanding efficient image representations is a long-standing challenge in computer vision. While deep learning has achieved remarkable progress across image understanding tasks, its internal representations are often opaque, making it difficult to interpret how visual information is processed. In contrast, classical visual descriptors (e.g. edge, colour, and intensity distribution) have long been fundamental to image analysis and remain intuitively understandable to humans. Motivated by this gap, we ask a central question: Can modern learning benefit from these classical cues? In this paper, we answer it with VisualSplit, a framework that explicitly decomposes images into decoupled classical descriptors, treating each as an independent but complementary component of visual knowledge. Through a reconstruction-driven pre-training scheme, VisualSplit learns to capture the essence of each visual descriptor while preserving their interpretability. By explicitly decomposing visual attributes, our method inherently facilitates effective attribute control in various advanced visual tasks, including image generation and editing, extending beyond conventional classification and segmentation, suggesting the effectiveness of this new learning approach for visual understanding. Project page: https://chenyuanqu.com/VisualSplit/.

[111] Exploring Cross-Modal Flows for Few-Shot Learning

Ziqi Jiang,Yanghao Wang,Long Chen

Main category: cs.CV

TL;DR: 本文提出了Flow Matching Alignment (FMA),一种跨模态速度场学习方法,用于解决现有参数高效微调(PEFT)方法在复杂数据集上单步调整不足的问题。

Details Motivation: 现有PEFT方法(如prompt tuning、LoRA或adapter)仅通过单步调整视觉或文本特征,难以处理模态高度纠缠的复杂数据集。

Contribution: 提出了首个模型无关的多步调整方法FMA,通过学习跨模态速度场实现更精确和鲁棒的对齐。

Method: FMA结合固定耦合策略确保类别对应性,噪声增强策略缓解数据稀缺问题,以及提前停止求解器提升效率和精度。

Result: 在多个基准测试和骨干网络上,FMA表现优于单步PEFT方法,尤其在挑战性数据集上显著提升性能。

Insight: 多步调整和噪声增强是解决跨模态任务中特征对齐问题的有效策略。

Abstract: Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today’s PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.

[112] Consistent text-to-image generation via scene de-contextualization

Song Tang,Peihao Gong,Kunyu Li,Kai Guo,Boyu Wang,Mao Ye,Jianwei Zhang,Xiatian Zhu

Main category: cs.CV

TL;DR: 论文提出了一种训练免费的提示嵌入编辑方法SDeC,通过解耦自然图像训练中的场景-身份相关性,提高文本到图像生成的一致性。

Details Motivation: 现有文本到图像(T2I)生成方法在保持同一主体身份一致性时失败率高,原因是场景与主体身份之间存在天然相关性(称为场景上下文化)。传统方法需要预先知道所有目标场景,这在现实中不切实际。

Contribution: 1. 揭示了场景上下文化是导致身份偏移的关键原因,并从理论上证明了其普遍性;2. 提出了一种训练自由的SDeC方法,通过抑制潜在的场景-身份相关性来提高生成一致性;3. SDeC支持单场景使用,无需预先知道所有目标场景。

Method: SDeC通过对ID提示嵌入进行SVD分析,量化方向稳定性并自适应调整特征值权重,抑制场景-身份相关性。这是一种无需训练的反向操作。

Result: 实验表明,SDeC显著提升了身份一致性,同时保持了场景多样性。

Insight: 1. 场景与身份的强相关性是T2I模型的自然特性;2. 通过解耦场景与身份,可以在不牺牲多样性的情况下提高一致性;3. 训练自由方法更适合实际应用。

Abstract: Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-ID correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2I’s built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-ID correlation within the ID prompt’s embedding by quantifying the SVD directional stability to adaptively re-weight the corresponding eigenvalues. Critically, SDeC allows for per-scene use (one scene per prompt) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.

[113] Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video

Yulin Zhang,Cheng Shi,Yang Wang,Sibei Yang

Main category: cs.CV

TL;DR: 该论文提出了一种名为Ego Proactive Video-LLM的AI模型,旨在通过流式视频输入主动理解、预测和响应事件,并引入了一个新基准ESTP-Bench及ESTP-F1指标进行评估。

Details Motivation: 目标是开发能够像人类一样主动理解和响应环境的AI,超越传统的被动观察模式。

Contribution: 1) 引入ESTP-Bench和ESTP-F1评估框架;2) 提出包含数据引擎、多阶段训练策略和动态压缩技术的技术流程。

Method: 1) 数据引擎生成训练数据;2) 多阶段训练策略优化模型;3) 动态压缩技术提升效率。

Result: 模型在多个在线和离线基准测试中表现出色,有效解决了任务的关键特性。

Insight: 论文的创新点在于将AI的被动观察能力扩展为主动理解和响应,为未来智能助手的发展提供了新方向。

Abstract: Envision an AI capable of functioning in human-like settings, moving beyond mere observation to actively understand, anticipate, and proactively respond to unfolding events. Towards this vision, we focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment, while maintaining synchronized perception and reasoning. This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency. To evaluate and address these properties, we first introduce ESTP-Bench (Ego Streaming Proactive Benchmark) alongside the ESTP-F1 metric-a novel framework designed for their rigorous assessment. Secondly, we propose a comprehensive technical pipeline to enable models to tackle this challenging task. This pipeline comprises: (1) a data engine, (2) a multi-stage training strategy, and (3) a proactive dynamic compression technique. Our proposed model effectively addresses these critical properties while outperforming multiple baselines across diverse online and offline benchmarks. Project Page:https://zhangyl4.github.io/publications/eyes-wide-open/

[114] Talking Points: Describing and Localizing Pixels

Matan Rusanovsky,Shimon Malnick,Shai Avidan

Main category: cs.CV

TL;DR: 这篇论文提出了一种新颖的像素级grounding框架,由Point Descriptor(生成关键点描述)和Point Localizer(根据描述回归像素坐标)两部分组成,填补了现有视觉-语言模型在像素级理解上的空白。

Details Motivation: 现有视觉-语言模型仅限于对象或区域级别的grounding,缺乏通过自然语言实现像素级关键点理解的能力。

Contribution: 1. 提出首个像素级grounding框架,包含Point Descriptor和Point Localizer;2. 发布LlamaPointInPart数据集(20K+图像-关键点-描述三元组);3. 提出GRPO优化方法,利用Point Localizer作为奖励模型提升描述质量;4. 设计新的评估协议。

Method: 1. Point Descriptor生成关键点描述;2. Point Localizer根据描述回归像素坐标;3. 基于AP-10K数据集,使用GRPO优化Point Descriptor;4. 通过LlamaPointInPart数据集训练框架。

Result: 实验表明,该方法在LlamaPointInPart上优于基线模型。

Insight: 双向框架为关键点引导的图像理解和语言引导的精确定位提供了潜在应用方向。

Abstract: Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. We introduce a novel framework for pixel level grounding. The framework consists of two complementary components: a Point Descriptor that generates rich, contextual descriptions of individual keypoints, and a Point Localizer that regresses precise pixel coordinates from these descriptions. Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context. Since there is no available dataset to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint. For cross-category generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the frozen Point Localizer as a reward model to produce descriptions that maximize localization accuracy. To evaluate our results we establish a new evaluation protocol. Instead of comparing the text description produced by our method to the ground truth, we use the localizer to determine how close is the predicted point generated to the ground truth point. Experiments demonstrate superior performance compared to baseline models on LlamaPointInPart.The bidirectional nature of our framework should enable future applications in both keypoint-guided image understanding and language-guided precise localization. Our code and dataset are publicly available at https://github.com/matanr/Talking_Points.

[115] STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding

Zhifei Chen,Tianshuo Xu,Leyi Wu,Luozhou Wang,Dongyu Yan,Zihan You,Wenting Luo,Guo Zhang,Yingcong Chen

Main category: cs.CV

TL;DR: STANCE是一个图像到视频生成框架,通过稀疏到密集锚定编码解决视频生成中的运动一致性问题,提升时间一致性。

Details Motivation: 视频生成中保持对象的运动一致性和交互仍然具有挑战性,主要原因是运动提示编码后信息不足,以及多任务优化中外观优先于时间一致性。

Contribution: 1. 引入了Instance Cues,将稀疏的用户编辑提示转化为密集的2.5D运动场;2. 提出Dense RoPE保留运动提示在令牌空间的重要性。

Method: 结合Instance Cues生成密集运动场和Dense RoPE编码运动令牌,同时联合预测RGB和辅助图(如分割或深度)。

Result: 模型在优化过程中稳定且提高了时间一致性,无需逐帧轨迹脚本。

Insight: 分离结构和外观任务可以避免优化冲突,提升视频生成的时间一致性。

Abstract: Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. We present STANCE, an image-to-video framework that addresses both issues with two simple components. First, we introduce Instance Cues – a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D arrow inputs while remaining easy to use. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings. Paired with joint RGB (+) auxiliary-map prediction (segmentation or depth), our model anchors structure while RGB handles appearance, stabilizing optimization and improving temporal coherence without requiring per-frame trajectory scripts.

[116] Hierarchical Re-Classification: Combining Animal Classification Models with Vision Transformers

Hugo Markoff,Jevgenijs Galaktionovs

Main category: cs.CV

TL;DR: 论文提出了一种结合SpeciesNet EfficientNetV2-M预测与CLIP嵌入的分层重分类系统,通过五阶段流程提升动物分类模型的物种级识别能力,在特定数据集上实现了96.5%的准确率。

Details Motivation: 现有动物分类模型(如SpeciesNet)由于采用保守的汇总策略,许多动物仅能标记到高级分类学层面而非物种级别,导致识别精度不足。

Contribution: 提出的分层重分类系统通过结合多种方法(如CLIP嵌入和度量学习),有效将高级分类学标签细化至物种级别,显著提升了识别精度。

Method: 采用五阶段流程:高置信度接受、鸟类覆盖、质心构建、三元组损失度量学习和自适应余弦距离评分。

Result: 在LILA BC Desert Lion Conservation数据集上,系统成功恢复了761个鸟类检测,并对456个检测进行了高精度重分类,物种级识别率达到64.9%。

Insight: 通过结合多种机器学习技术和分层策略,可以有效解决复杂分类任务中的细粒度识别问题。

Abstract: State-of-the-art animal classification models like SpeciesNet provide predictions across thousands of species but use conservative rollup strategies, resulting in many animals labeled at high taxonomic levels rather than species. We present a hierarchical re-classification system for the Animal Detect platform that combines SpeciesNet EfficientNetV2-M predictions with CLIP embeddings and metric learning to refine high-level taxonomic labels toward species-level identification. Our five-stage pipeline (high-confidence acceptance, bird override, centroid building, triplet-loss metric learning, and adaptive cosine-distance scoring) is evaluated on a segment of the LILA BC Desert Lion Conservation dataset (4,018 images, 15,031 detections). After recovering 761 bird detections from “blank” and “animal” labels, we re-classify 456 detections labeled animal, mammal, or blank with 96.5% accuracy, achieving species-level identification for 64.9 percent

[117] Zero-Shot Wildlife Sorting Using Vision Transformers: Evaluating Clustering and Continuous Similarity Ordering

Hugo Markoff,Jevgenijs Galaktionovs

Main category: cs.CV

TL;DR: 该论文研究了在野生动物图像中使用自监督视觉Transformer进行零样本分类,比较了不同聚类方法和架构组合的性能,并在生产中应用了连续相似性排序。

Details Motivation: 由于相机陷阱生成的野生动物图像数量庞大且许多物种未被现有分类器覆盖,因此需要零样本方法来高效组织未标记图像。

Contribution: 论文的主要贡献是比较了不同自监督视觉Transformer架构和聚类方法的性能,并展示了连续相似性排序在生产中的应用。

Method: 采用了DBSCAN和GMM等无监督聚类方法,结合CLIP、DINOv2和MegaDescriptor等架构,以及PCA和UMAP降维技术,并通过t-SNE实现了1D相似性排序。

Result: DINOv2结合UMAP和GMM在5物种测试集上达到88.6%的准确率,1D排序在哺乳动物和鸟类中达到88.2%的相关性,鱼类中达到95.2%。

Insight: 自监督视觉Transformer在零样本野生动物分类中表现出色,连续相似性排序为快速探索分析和手动标注提供了高效工具。

Abstract: Camera traps generate millions of wildlife images, yet many datasets contain species that are absent from existing classifiers. This work evaluates zero-shot approaches for organizing unlabeled wildlife imagery using self-supervised vision transformers, developed and tested within the Animal Detect platform for camera trap analysis. We compare unsupervised clustering methods (DBSCAN, GMM) across three architectures (CLIP, DINOv2, MegaDescriptor) combined with dimensionality reduction techniques (PCA, UMAP), and we demonstrate continuous 1D similarity ordering via t-SNE projection. On a 5-species test set with ground truth labels used only for evaluation, DINOv2 with UMAP and GMM achieves 88.6 percent accuracy (macro-F1 = 0.874), while 1D sorting reaches 88.2 percent coherence for mammals and birds and 95.2 percent for fish across 1,500 images. Based on these findings, we deployed continuous similarity ordering in production, enabling rapid exploratory analysis and accelerating manual annotation workflows for biodiversity monitoring.

[118] Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Yuyang Hong,Jiaqi Gu,Qi Yang,Lubin Fan,Yue Wu,Ying Wang,Kun Ding,Shiming Xiang,Jieping Ye

Main category: cs.CV

TL;DR: 本文提出了一种名为Wiki-PRF的三阶段方法,通过处理、检索和过滤来解决知识库视觉问答(KB-VQA)中的多模态查询和质量问题,并结合强化学习训练模型,显著提升了答案质量。

Details Motivation: 现有检索增强生成(RAG)方法在KB-VQA任务中多模态查询质量和检索结果相关性方面存在不足,需要更精确的方法提升性能。

Contribution: 提出了Wiki-PRF方法,包含三个阶段(处理、检索、过滤),并结合强化学习训练模型,显著提升了KB-VQA任务的性能。

Method: 动态调用视觉工具提取多模态信息用于检索,整合视觉和文本特征进行多模态知识检索,并通过强化学习训练模型优化推理和过滤能力。

Result: 在E-VQA和InfoSeek数据集上取得了36.0和42.8的性能提升,达到SOTA水平。

Insight: 多模态信息精确提取和动态过滤是提升KB-VQA性能的关键,强化学习可以有效优化模型的推理和过滤能力。

Abstract: Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model’s reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF

[119] Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding

Ning Ding,Keisuke Fujii,Toru Tamaki

Main category: cs.CV

TL;DR: 该论文提出了一种多尺度视频描述框架Shot2Tactic-Caption,用于羽毛球比赛视频的战术理解,能够生成描述单个动作的镜头级描述和捕捉战术动态执行的战术级描述。

Details Motivation: 羽毛球比赛中战术理解需要同时关注单个动作和战术的动态执行过程,但目前缺乏能够同时描述这两个层次的视频描述方法。

Contribution: 1. 提出了首个羽毛球战术描述数据集Shot2Tactic-Caption Dataset;2. 设计了一个双分支框架,分别生成镜头级和战术级描述;3. 引入了战术单元检测器和提示引导机制,提升战术描述的连贯性和准确性。

Method: 采用双分支设计,每个分支包含视觉编码器、时空Transformer编码器和基于Transformer的解码器。战术描述分支中引入战术单元检测器和提示引导机制,通过跨注意力注入战术类型和状态信息。

Result: 实验表明,框架在生成镜头级和战术级描述方面表现优异。消融研究显示,基于ResNet50的时空编码器优于其他变体,提示引导机制提升了战术描述的连贯性。

Insight: 多尺度描述(镜头级+战术级)对于全面理解羽毛球比赛至关重要;提示引导机制可以有效捕捉战术的动态变化(如中断和恢复)。

Abstract: Tactical understanding in badminton involves interpreting not only individual actions but also how tactics are dynamically executed over time. In this paper, we propose \textbf{Shot2Tactic-Caption}, a novel framework for semantic and temporal multi-scale video captioning in badminton, capable of generating shot-level captions that describe individual actions and tactic-level captions that capture how these actions unfold over time within a tactical execution. We also introduce the Shot2Tactic-Caption Dataset, the first badminton captioning dataset containing 5,494 shot captions and 544 tactic captions. Shot2Tactic-Caption adopts a dual-branch design, with both branches including a visual encoder, a spatio-temporal Transformer encoder, and a Transformer-based decoder to generate shot and tactic captions. To support tactic captioning, we additionally introduce a Tactic Unit Detector that identifies valid tactic units, tactic types, and tactic states (e.g., Interrupt, Resume). For tactic captioning, we further incorporate a shot-wise prompt-guided mechanism, where the predicted tactic type and state are embedded as prompts and injected into the decoder via cross-attention. The shot-wise prompt-guided mechanism enables our system not only to describe successfully executed tactics but also to capture tactical executions that are temporarily interrupted and later resumed. Experimental results demonstrate the effectiveness of our framework in generating both shot and tactic captions. Ablation studies show that the ResNet50-based spatio-temporal encoder outperforms other variants, and that shot-wise prompt structuring leads to more coherent and accurate tactic captioning.

[120] Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

Natan Bagrov,Eugene Khvedchenia,Borys Tymchenko,Shay Aharon,Lior Kadoch,Tomer Keren,Ofri Masad,Yonatan Geifman,Ran Zilberstein,Tuomas Rintamaki,Matthieu Le,Andrew Tao

Main category: cs.CV

TL;DR: EVS通过剪枝时间冗余token以减少视频处理的计算成本,提升VLM推理速度,支持更长输入序列且不影响语义保真度。

Details Motivation: VLMs在处理长视频时因token数量巨大而面临计算成本高和延迟问题,限制了视频理解的可扩展性。

Contribution: 提出EVS方法,无需改动模型或重新训练即可剪枝时间冗余token,显著降低token数量并保持语义准确性。

Method: 识别并剪枝连续帧间不变的静态区域,保留位置信息;可结合随机剪枝率微调提升鲁棒性。

Result: EVS将LLM的首token生成时间最多减少4倍,结合微调后能在激进剪枝下保持性能。

Insight: 时间冗余剪枝是提升视频处理效率的有效途径,且无需牺牲语义质量。

Abstract: Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches – spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.

[121] Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Ming Gui,Johannes Schusterbauer,Timy Phan,Felix Krause,Josh Susskind,Miguel Angel Bautista,Björn Ommer

Main category: cs.CV

TL;DR: RepTok利用自监督视觉变换器的单一连续潜在令牌表示图像,通过微调语义令牌嵌入并结合生成解码器,实现了高效的图像生成。

Details Motivation: 现有生成模型往往需要复杂的2D潜在空间和高训练成本,RepTok旨在通过自监督表示简化这一过程。

Contribution: 提出RepTok框架,利用自监督表示作为紧凑潜在空间,显著降低训练成本,同时保持生成质量。

Method: 基于预训练的自监督编码器,微调语义令牌嵌入,结合余弦相似度损失优化潜在空间的几何性质。

Result: 在ImageNet类条件生成和MS-COCO文本到图像合成中表现出竞争力。

Insight: 自监督表示经微调后可作为高效生成的紧凑潜在空间,解决了空间冗余和高成本问题。

Abstract: We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.

[122] SteeringTTA: Guiding Diffusion Trajectories for Robust Test-Time-Adaptation

Jihyun Yu,Yoojin Oh,Wonho Bae,Mingyu Kim,Junhyug Noh

Main category: cs.CV

TL;DR: SteeringTTA 是一种基于扩散的测试时自适应方法,通过引导扩散轨迹提升分类鲁棒性,无需模型更新或源数据。

Details Motivation: 现有的基于扩散的输入自适应方法依赖梯度引导,限制了其对不同类型失真的探索和泛化能力。

Contribution: 提出了 SteeringTTA,通过 Feynman-Kac 引导和伪标签驱动的奖励机制,平衡探索与置信度,提升鲁棒性。

Method: 利用累积 top-K 概率和熵调度机制引导多个粒子轨迹,实现输入自适应。

Result: 在 ImageNet-C 上表现优于基线方法,无需模型更新或源数据。

Insight: 通过引导粒子和动态平衡策略,可以在不依赖梯度的情况下实现更鲁棒的输入自适应。

Abstract: Test-time adaptation (TTA) aims to correct performance degradation of deep models under distribution shifts by updating models or inputs using unlabeled test data. Input-only diffusion-based TTA methods improve robustness for classification to corruptions but rely on gradient guidance, limiting exploration and generalization across distortion types. We propose SteeringTTA, an inference-only framework that adapts Feynman-Kac steering to guide diffusion-based input adaptation for classification with rewards driven by pseudo-label. SteeringTTA maintains multiple particle trajectories, steered by a combination of cumulative top-K probabilities and an entropy schedule, to balance exploration and confidence. On ImageNet-C, SteeringTTA consistently outperforms the baseline without any model updates or source data.

[123] In-Context Learning with Unpaired Clips for Instruction-based Video Editing

Xinyao Liao,Xianfang Zeng,Ziye Song,Zhoujie Fu,Gang Yu,Guosheng Lin

Main category: cs.CV

TL;DR: 该论文提出了一种低成本预训练策略,利用未配对的视频片段进行上下文学习,用于指令驱动的视频编辑任务,并通过少量高质量配对数据微调,显著提升了编辑指令对齐和视觉保真度。

Details Motivation: 指令驱动的图像编辑取得了快速进展,但其在视频领域的扩展尚未充分探索,主要由于构建大规模配对视频编辑数据集的高成本和复杂性。

Contribution: 论文的主要贡献是提出了一种利用未配对视频片段进行上下文学习的预训练策略,赋予基础视频生成模型通用编辑能力,并通过少量高质量配对数据微调提升性能。

Method: 基于HunyuanVideoT2V框架,先在约100万个真实视频片段上预训练学习基本编辑概念,再用不到15万个精选配对数据微调,支持更多编辑任务和提高编辑质量。

Result: 实验表明,该方法在指令对齐和视觉保真度上均优于现有方案,编辑指令跟随能力提升12%,编辑质量提升15%。

Insight: 通过上下文学习,未配对数据可显著降低预训练成本,同时保留模型的泛化能力,为视频编辑任务提供了一种高效解决方案。

Abstract: Despite the rapid progress of instruction-based image editing, its extension to video remains underexplored, primarily due to the prohibitive cost and complexity of constructing large-scale paired video editing datasets. To address this challenge, we introduce a low-cost pretraining strategy for instruction-based video editing that leverages in-context learning from unpaired video clips. We show that pretraining a foundation video generation model with this strategy endows it with general editing capabilities, such as adding, replacing, or deleting operations, according to input editing instructions. The pretrained model can then be efficiently refined with a small amount of high-quality paired editing data. Built upon HunyuanVideoT2V, our framework first pretrains on approximately 1M real video clips to learn basic editing concepts, and subsequently fine-tunes on fewer than 150k curated editing pairs to extend more editing tasks and improve the editing quality. Comparative experiments show that our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity, achieving a 12% improvement in editing instruction following and a 15% improvement in editing quality.

[124] Decorrelation Speeds Up Vision Transformers

Kieran Carrigg,Rob van Gastel,Melda Yeghaian,Sander Dalm,Faysal Boughorbel,Marcel van Gerven

Main category: cs.CV

TL;DR: 论文提出了一种将Decorrelated Backpropagation(DBP)集成到MAE预训练中的方法,显著降低了ViT的训练时间和碳排放,同时提升了性能。

Details Motivation: MAE预训练虽在低标签数据下表现优异,但计算成本高昂,限制了其在时间和资源受限的工业场景中的应用。因此,作者寻求一种加速预训练的方法。

Contribution: 主要贡献是将DBP优化方法选择性应用于MAE的编码器中,显著减少了预训练时间(21.1%)和碳排放(21.4%),同时提升了分割任务的性能(mIoU +1.1)。

Method: 方法的核心是DBP,通过迭代减少每层的输入相关性来加速收敛。该方法被选择性应用于编码器部分。

Result: 在ImageNet-1K预训练和ADE20K微调中,DBP-MAE显著缩短了训练时间并降低了碳排放,同时在工业数据上也验证了方法的实用性。

Insight: DBP不仅加速了ViT的预训练,还提升了模型的稳定性和下游任务性能,为资源受限场景提供了高效解决方案。

Abstract: Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. On ImageNet-1K pre-training with ADE20K fine-tuning, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4% and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method’s applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training.

[125] EuroMineNet: A Multitemporal Sentinel-2 Benchmark for Spatiotemporal Mining Footprint Analysis in the European Union (2015-2024)

Weikang Yu,Vincent Nwazelibe,Xianping Ma,Xiaokang Zhang,Richard Gloaguen,Xiao Xiang Zhu,Pedram Ghamisi

Main category: cs.CV

TL;DR: 论文介绍了EuroMineNet,一个基于Sentinel-2多光谱影像的首个多时间基准数据集,用于欧盟范围内133个矿区的时空变化监测,支持可持续土地管理和环境恢复。

Details Motivation: 采矿活动虽对经济发展至关重要,但也是环境退化的主要来源。现有数据集时空覆盖有限,亟需一种长期一致的监测工具以支持可持续资源管理。

Contribution: 提出了首个全面的多时间基准数据集EuroMineNet,覆盖2015-2024年欧盟133个矿区,提供年度观测数据和专家验证标注,支持GeoAI模型的开发。

Method: 基于Sentinel-2影像,定义了多时间采矿足迹映射和跨时间变化检测两项任务,并提出了新颖的Change-Aware Temporal IoU(CA-TIoU)指标。

Result: 测试20种先进深度学习模型发现,GeoAI方法能有效识别长期环境变化,但在短期动态检测上仍需改进。

Insight: EuroMineNet推动了时空一致性强的采矿监测,强调了GeoAI在社会和环境可持续发展中的应用潜力。

Abstract: Mining activities are essential for industrial and economic development, but remain a leading source of environmental degradation, contributing to deforestation, soil erosion, and water contamination. Sustainable resource management and environmental governance require consistent, long-term monitoring of mining-induced land surface changes, yet existing datasets are often limited in temporal depth or geographic scope. To address this gap, we present EuroMineNet, the first comprehensive multitemporal benchmark for mining footprint mapping and monitoring based on Sentinel-2 multispectral imagery. Spanning 133 mining sites across the European Union, EuroMineNet provides annual observations and expert-verified annotations from 2015 to 2024, enabling GeoAI-based models to analyze environmental dynamics at a continental scale. It supports two sustainability-driven tasks: (1) multitemporal mining footprint mapping for consistent annual land-use delineation, evaluated with a novel Change-Aware Temporal IoU (CA-TIoU) metric, and (2) cross-temporal change detection to capture both gradual and abrupt surface transformations. Benchmarking 20 state-of-the-art deep learning models reveals that while GeoAI methods effectively identify long-term environmental changes, challenges remain in detecting short-term dynamics critical for timely mitigation. By advancing temporally consistent and explainable mining monitoring, EuroMineNet contributes to sustainable land-use management, environmental resilience, and the broader goal of applying GeoAI for social and environmental good. We release the codes and datasets by aligning with FAIR and the open science paradigm at https://github.com/EricYu97/EuroMineNet.

[126] WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging

Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Sami Azam,Asif Karim,Jemima Beissbarth,Amanda Leach

Main category: cs.CV

TL;DR: WeCKD提出了一种弱监督的链式知识蒸馏框架,通过多模型逐步传递知识,减少了数据依赖并提升了性能。

Details Motivation: 传统知识蒸馏方法依赖强大的教师模型或大量标注数据,限制了其在现实数据有限场景中的应用。

Contribution: 提出了一种弱监督链式知识蒸馏网络(WeCKD),通过逐步知识传递增强特征学习并减少数据需求。

Method: 采用链式模型结构,每个模型从前驱学习并精炼知识后再传递至下一个模型,实现了渐进式蒸馏。

Result: 在多种医学影像数据集上表现优异,无需全量监督数据即可超越传统方法,最高提升23%准确率。

Insight: 链式知识传递可以有效缓解单步蒸馏的知识退化问题,并在弱监督场景下实现高效学习。

Abstract: Knowledge distillation (KD) has traditionally relied on a static teacher-student framework, where a large, well-trained teacher transfers knowledge to a single student model. However, these approaches often suffer from knowledge degradation, inefficient supervision, and reliance on either a very strong teacher model or large labeled datasets, which limits their effectiveness in real-world, limited-data scenarios. To address these, we present the first-ever Weakly-supervised Chain-based KD network (WeCKD) that redefines knowledge transfer through a structured sequence of interconnected models. Unlike conventional KD, it forms a progressive distillation chain, where each model not only learns from its predecessor but also refines the knowledge before passing it forward. This structured knowledge transfer further enhances feature learning, reduces data dependency, and mitigates the limitations of one-step KD. Each model in the distillation chain is trained on only a fraction of the dataset and demonstrates that effective learning can be achieved with minimal supervision. Extensive evaluations across four otoscopic imaging datasets demonstrate that it not only matches but in many cases surpasses the performance of existing supervised methods. Experimental results on two other datasets further underscore its generalization across diverse medical imaging modalities, including microscopic and magnetic resonance imaging. Furthermore, our evaluations resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data, which highlights its potential for real-world adoption.

[127] VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning

Jinglei Zhang,Yuanfan Guo,Rolandos Alexandros Potamias,Jiankang Deng,Hang Xu,Chao Ma

Main category: cs.CV

TL;DR: VTimeCoT 是一个无需训练的简单高效框架,通过引入进度条工具和视觉-时间的链式思考(CoT)过程,显著提升了视频时间定位和推理任务的性能。

Details Motivation: 现有基于多模态大语言模型的视频问答系统在视频时间定位和推理方面表现不足,急需一种更高效的解决方案。

Contribution: 1. 提出了两种新颖的进度条工具:即插即用的进度条集成工具和高效率高亮工具;2. 引入了视觉-时间的链式思考(CoT)过程,结合视频和文本的多模态推理。

Method: VTimeCoT 框架通过进度条工具和 CoT 过程,实现了无需训练的高效视频时间定位和推理。

Result: 在 Qwen2VL-7B 和 GPT4o 基准测试中,框架在视频时间定位和推理任务上取得了显著性能提升。

Insight: 通过模仿人类与视频进度条的交互方式,VTimeCoT 提供了一种可组合和可解释的推理过程。

Abstract: In recent years, video question answering based on multimodal large language models (MLLM) has garnered considerable attention, due to the benefits from the substantial advancements in LLMs. However, these models have a notable deficiency in the domains of video temporal grounding and reasoning, posing challenges to the development of effective real-world video understanding systems. Inspired by how humans use video players to interact with the progress bar for video comprehension, we introduce VTimeCoT, a simple yet effective training-free framework, designed for high-performance video grounding and reasoning. The proposed framework incorporates two novel visual tools of the progress bar: a plug-and-play progress bar integration tool and a high-efficiency highlighting tool. In addition, to address the limitations of conventional text-based chain-of-thought (CoT) approaches, we introduce a visuotemporal CoT process that integrates cross-modality reasoning across both video and text. Our approach demonstrates significant performance improvements on both Qwen2VL-7B and GPT4o baselines in tasks of video temporal grounding and reasoning-based question answering. Finally, we showcase that the proposed framework achieves a compositional and interpretable reasoning process. Project page: https://vtimecot.github.io

[128] Leveraging Learned Image Prior for 3D Gaussian Compression

Seungjoo Shin,Jaesik Park,Sunghyun Cho

Main category: cs.CV

TL;DR: 这篇论文提出了一种利用学习到的图像先验来提升3D高斯压缩框架性能的方法,通过恢复压缩引起的质量下降,显著提升了率失真性能和渲染质量。

Details Motivation: 当前的3D高斯喷绘(3DGS)压缩技术在减少存储开销方面取得了进展,但由于缺乏学习到的先验知识,限制了率失真权衡的进一步提升。

Contribution: 主要贡献是提出了一种新颖的3DGS压缩框架,利用学习到的图像先验来恢复压缩引起的质量下降,并通过粗渲染残差提升率失真性能。

Method: 该方法基于初始压缩的高斯构建恢复网络,建模压缩伪影,并通过监督恢复图像来优化压缩高斯,实现紧凑表示和增强的渲染性能。

Result: 实验表明,该方法在率失真性能和渲染质量上优于现有最先进的3DGS压缩方法,同时存储需求大幅降低。

Insight: 学习到的图像先验可以有效提升3D高斯压缩性能,结合残差信息能进一步提升率失真表现,该方法兼容现有压缩技术。

Abstract: Compression techniques for 3D Gaussian Splatting (3DGS) have recently achieved considerable success in minimizing storage overhead for 3D Gaussians while preserving high rendering quality. Despite the impressive storage reduction, the lack of learned priors restricts further advances in the rate-distortion trade-off for 3DGS compression tasks. To address this, we introduce a novel 3DGS compression framework that leverages the powerful representational capacity of learned image priors to recover compression-induced quality degradation. Built upon initially compressed Gaussians, our restoration network effectively models the compression artifacts in the image space between degraded and original Gaussians. To enhance the rate-distortion performance, we provide coarse rendering residuals into the restoration network as side information. By leveraging the supervision of restored images, the compressed Gaussians are refined, resulting in a highly compact representation with enhanced rendering performance. Our framework is designed to be compatible with existing Gaussian compression methods, making it broadly applicable across different baselines. Extensive experiments validate the effectiveness of our framework, demonstrating superior rate-distortion performance and outperforming the rendering quality of state-of-the-art 3DGS compression methods while requiring substantially less storage.

[129] Where are the Whales: A Human-in-the-loop Detection Method for Identifying Whales in High-resolution Satellite Imagery

Caleb Robinson,Kimberly T. Goetz,Christin B. Khan,Meredith Sackett,Kathleen Leonard,Rahul Dodhia,Juan M. Lavista Ferres

Main category: cs.CV

TL;DR: 该论文提出了一种人机协作的鲸鱼检测方法,通过统计异常检测技术快速筛查高分辨率卫星图像中的潜在目标,并结合专家标注界面,显著提升了检测效率与可扩展性。

Details Motivation: 传统的鲸鱼监测方法成本高且难以扩展,亟需一种高效的自动化解决方案。尽管已有研究表明鲸鱼可在高分辨率卫星图像中被识别,但大规模自动化检测仍面临标注数据不足、图像质量与环境多变等挑战。

Contribution: 1) 提出了一种不依赖标注数据的半自动化鲸鱼检测方法;2) 开发了高效的专家标注界面;3) 实现了高达96.4%的召回率,同时将专家需检查的区域减少了99.8%。

Method: 采用统计异常检测技术标记空间异常点(即“兴趣点”),并结合基于网络的专家标注工具快速验证这些点。

Result: 在三个已知鲸鱼标注的基准场景中,召回率达到90.3%至96.4%,并将专家检查区域从1000平方公里以上减少到不足2平方公里。

Insight: 结合无监督异常检测与人机协作标注是实现大规模卫星图像分析的有效路径,未来可扩展至其他海洋哺乳动物监测任务。

Abstract: Effective monitoring of whale populations is critical for conservation, but traditional survey methods are expensive and difficult to scale. While prior work has shown that whales can be identified in very high-resolution (VHR) satellite imagery, large-scale automated detection remains challenging due to a lack of annotated imagery, variability in image quality and environmental conditions, and the cost of building robust machine learning pipelines over massive remote sensing archives. We present a semi-automated approach for surfacing possible whale detections in VHR imagery using a statistical anomaly detection method that flags spatial outliers, i.e. “interesting points”. We pair this detector with a web-based labeling interface designed to enable experts to quickly annotate the interesting points. We evaluate our system on three benchmark scenes with known whale annotations and achieve recalls of 90.3% to 96.4%, while reducing the area requiring expert inspection by up to 99.8% – from over 1,000 sq km to less than 2 sq km in some cases. Our method does not rely on labeled training data and offers a scalable first step toward future machine-assisted marine mammal monitoring from space. We have open sourced this pipeline at https://github.com/microsoft/whales.

[130] Camera Movement Classification in Historical Footage: A Comparative Study of Deep Video Models

Tingyu Lin,Armin Dadras,Florian Kleber,Robert Sablatnig

Main category: cs.CV

TL;DR: 本文对历史影像中的摄像机运动分类进行了系统评估,揭示了现有深度视频模型在该领域的表现和局限性。

Details Motivation: 摄像机运动传达了视频内容的空间和叙事信息,但现有方法主要针对现代数据集,其在历史影像上的泛化能力尚未被探索。

Contribution: 这是第一篇系统地评估深度视频CMC模型在历史影像上表现的论文,并提出了HISTORIAN数据集。

Method: 总结了代表性方法和数据集,评估了五种标准视频分类模型在HISTORIAN数据集上的表现,其中Video Swin Transformer表现最佳。

Result: 最佳模型Video Swin Transformer达到了80.25%的准确率,展示了在有限训练数据下的强泛化能力。

Insight: 研究发现现有模型在低质量视频上的适应潜力,并指出未来工作需要结合多样化的输入模态和时间架构。

Abstract: Camera movement conveys spatial and narrative information essential for understanding video content. While recent camera movement classification (CMC) methods perform well on modern datasets, their generalization to historical footage remains unexplored. This paper presents the first systematic evaluation of deep video CMC models on archival film material. We summarize representative methods and datasets, highlighting differences in model design and label definitions. Five standard video classification models are assessed on the HISTORIAN dataset, which includes expert-annotated World War II footage. The best-performing model, Video Swin Transformer, achieves 80.25% accuracy, showing strong convergence despite limited training data. Our findings highlight the challenges and potential of adapting existing models to low-quality video and motivate future work combining diverse input modalities and temporal architectures.

[131] Free-Grained Hierarchical Recognition

Seulki Park,Zilin Wang,Stella X. Yu

Main category: cs.CV

TL;DR: 论文提出了一种自由粒度的分层识别方法,通过引入ImageNet-F基准和混合粒度学习方法,解决了现实世界中标注粒度不一致的问题。

Details Motivation: 现实世界中的图像标注通常具有粒度不一致的特点,而现有方法通常假设标注是完全细粒度的。为了解决这一问题,作者提出了一种更符合实际的标注方法和学习框架。

Contribution: 主要贡献包括:(1) 提出了ImageNet-F基准,模拟现实中的混合粒度标注;(2) 引入了自由粒度学习方法,利用伪属性和半监督技术提升模型性能。

Method: 方法结合了视觉语言模型生成的伪属性和半监督学习策略,增强了模型的语义和视觉引导能力。

Result: 实验表明,提出的方法和基准在混合监督条件下显著提升了分层分类的性能。

Insight: 论文的洞察在于揭示了标注粒度不一致的现实问题,并通过结合视觉语言模型和半监督学习提出了一种灵活的解决方案。

Abstract: Hierarchical image classification predicts labels across a semantic taxonomy, but existing methods typically assume complete, fine-grained annotations, an assumption rarely met in practice. Real-world supervision varies in granularity, influenced by image quality, annotator expertise, and task demands; a distant bird may be labeled Bird, while a close-up reveals Bald eagle. We introduce ImageNet-F, a large-scale benchmark curated from ImageNet and structured into cognitively inspired basic, subordinate, and fine-grained levels. Using CLIP as a proxy for semantic ambiguity, we simulate realistic, mixed-granularity labels reflecting human annotation behavior. We propose free-grain learning, with heterogeneous supervision across instances. We develop methods that enhance semantic guidance via pseudo-attributes from vision-language models and visual guidance via semi-supervised learning. These, along with strong baselines, substantially improve performance under mixed supervision. Together, our benchmark and methods advance hierarchical classification under real-world constraints.

[132] DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models

Simone Carnemolla,Matteo Pennisi,Sarinda Samarasinghe,Giovanni Bellitto,Simone Palazzo,Daniela Giordano,Mubarak Shah,Concetto Spampinato

Main category: cs.CV

TL;DR: DEXTER是一个无需数据即可生成视觉分类器全局文本解释的框架,结合扩散模型和大语言模型,通过优化文本提示生成类条件图像,揭示分类器的决策模式和偏见。

Details Motivation: 构建透明可信的AI系统需要理解和解释机器学习模型的行为,而现有方法通常依赖训练数据或真实标签。DEXTER旨在通过数据无关的方式生成自然语言解释。

Contribution: 提出了DEXTER框架,首次利用扩散模型和大语言模型生成视觉分类器的全球化和自然语言解释,无需训练数据或真实标签支持。

Method: DEXTER通过优化文本提示生成激活目标分类器的类条件图像,再利用这些合成样本生成详细的自然语言报告,描述类特定的决策模式和偏见。

Result: 在ImageNet、Waterbirds等数据集上的实验表明,DEXTER在全局模型解释和类级别偏见报告任务上优于现有方法。用户研究证实其输出准确且可解释。

Insight: DEXTER展示了扩散模型和大语言模型结合在模型解释任务中的潜力,为数据无关的解释方法提供了新思路。

Abstract: Understanding and explaining the behavior of machine learning models is essential for building transparent and trustworthy AI systems. We introduce DEXTER, a data-free framework that employs diffusion models and large language models to generate global, textual explanations of visual classifiers. DEXTER operates by optimizing text prompts to synthesize class-conditional images that strongly activate a target classifier. These synthetic samples are then used to elicit detailed natural language reports that describe class-specific decision patterns and biases. Unlike prior work, DEXTER enables natural language explanation about a classifier’s decision process without access to training data or ground-truth labels. We demonstrate DEXTER’s flexibility across three tasks-activation maximization, slice discovery and debiasing, and bias explanation-each illustrating its ability to uncover the internal mechanisms of visual classifiers. Quantitative and qualitative evaluations, including a user study, show that DEXTER produces accurate, interpretable outputs. Experiments on ImageNet, Waterbirds, CelebA, and FairFaces confirm that DEXTER outperforms existing approaches in global model explanation and class-level bias reporting. Code is available at https://github.com/perceivelab/dexter.

[133] LightQANet: Quantized and Adaptive Feature Learning for Low-Light Image Enhancement

Xu Wu,Zhihui Lai,Xianxu Hou,Jie Zhou,Ya-nan Zhang,Linlin Shen

Main category: cs.CV

TL;DR: LightQANet提出了一种新颖的低光照图像增强框架,通过量化光照因子和动态适应学习,改善了现有方法在低光条件下提取不可靠特征的问题,实现了更高质量的纹理恢复和颜色一致性。

Details Motivation: 现有低光照图像增强方法在严重退化的像素信息下难以提取可靠特征,导致纹理和颜色表现不佳。LightQANet旨在解决这一问题。

Contribution: 提出了LightQANet框架,结合了静态光照量化模块(LQM)和动态光照感知提示模块(LAPM),以一致且鲁棒的方式增强图像。

Method: 1. LQM显式量化光照相关因子;2. LAPM通过可学习提示动态指导特征学习,适应复杂光照变化。

Result: 在多个数据集上取得SOTA性能,在多样光照条件下表现优越。

Insight: 量化光照因子和动态适应学习的结合能有效提升低光图像的质量和一致性。

Abstract: Low-light image enhancement (LLIE) aims to improve illumination while preserving high-quality color and texture. However, existing methods often fail to extract reliable feature representations due to severely degraded pixel-level information under low-light conditions, resulting in poor texture restoration, color inconsistency, and artifact. To address these challenges, we propose LightQANet, a novel framework that introduces quantized and adaptive feature learning for low-light enhancement, aiming to achieve consistent and robust image quality across diverse lighting conditions. From the static modeling perspective, we design a Light Quantization Module (LQM) to explicitly extract and quantify illumination-related factors from image features. By enforcing structured light factor learning, LQM enhances the extraction of light-invariant representations and mitigates feature inconsistency across varying illumination levels. From the dynamic adaptation perspective, we introduce a Light-Aware Prompt Module (LAPM), which encodes illumination priors into learnable prompts to dynamically guide the feature learning process. LAPM enables the model to flexibly adapt to complex and continuously changing lighting conditions, further improving image enhancement. Extensive experiments on multiple low-light datasets demonstrate that our method achieves state-of-the-art performance, delivering superior qualitative and quantitative results across various challenging lighting scenarios.

[134] MoCom: Motion-based Inter-MAV Visual Communication Using Event Vision and Spiking Neural Networks

Zhang Nengbo,Hann Woei Ho,Ye Zhou

Main category: cs.CV

TL;DR: 论文提出了一种基于事件的视觉和脉冲神经网络(SNN)的MAV间运动通信框架,通过飞行模式传递信息,使用轻量级SNN解码,验证了其高效性和低功耗。

Details Motivation: 传统无线通信在MAV群中面临频谱拥堵、干扰和高功耗问题,受蜜蜂摇摆舞启发,研究探索了一种基于视觉的低功耗通信方案。

Contribution: 1. 提出了一种新颖的MAV间运动通信框架;2. 使用事件相机和SNN解码飞行模式;3. 定义了四种运动基元的视觉码本。

Method: 1. 通过事件帧分割模型提取运动信号;2. 设计轻量级SNN进行动作识别;3. 结合分割与分类进行鲁棒解码。

Result: 实验验证了框架的准确性、低功耗和鲁棒性,展示了其在受限环境中的潜力。

Insight: 运动基元的视觉通信结合事件相机和SNN,为MAV群提供了一种高效、低功耗的替代方案。

Abstract: Reliable communication in Micro Air Vehicle (MAV) swarms is challenging in environments, where conventional radio-based methods suffer from spectrum congestion, jamming, and high power consumption. Inspired by the waggle dance of honeybees, which efficiently communicate the location of food sources without sound or contact, we propose a novel visual communication framework for MAV swarms using motion-based signaling. In this framework, MAVs convey information, such as heading and distance, through deliberate flight patterns, which are passively captured by event cameras and interpreted using a predefined visual codebook of four motion primitives: vertical (up/down), horizontal (left/right), left-to-up-to-right, and left-to-down-to-right, representing control symbols (start'', end’’, 1'', 0’’). To decode these signals, we design an event frame-based segmentation model and a lightweight Spiking Neural Network (SNN) for action recognition. An integrated decoding algorithm then combines segmentation and classification to robustly interpret MAV motion sequences. Experimental results validate the framework’s effectiveness, which demonstrates accurate decoding and low power consumption, and highlights its potential as an energy-efficient alternative for MAV communication in constrained environments.

[135] CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection

Hojun Choi,Youngsun Lim,Jaeyo Shin,Hyunjung Shim

Main category: cs.CV

TL;DR: CoT-PL提出了一种结合视觉链式思维推理和伪标签的框架,用于开放词汇目标检测,通过分解目标理解的步骤和对比背景学习,显著提升了在拥挤或遮挡场景中的检测性能。

Details Motivation: 现有的开放词汇目标检测方法过于依赖直接的图像-文本匹配,忽略了中间推理步骤,导致在面对复杂语义场景时鲁棒性不足。

Contribution: 1. 提出了CoT-PL框架,将视觉链式思维推理引入伪标签过程;2. 设计了对比背景学习(CBL),通过背景线索促进目标与背景的特征解耦;3. 在开放词汇检测任务中取得了显著的性能提升。

Method: 1. 将目标理解分解为区域感知、零样本类别识别和背景分离三个步骤;2. 利用对比背景学习提升特征表达能力,尤其是在拥挤或遮挡场景中。

Result: 1. 在COCO开放词汇检测任务中提升了7.7 AP50;2. 在LVIS任务中提升了2.9 mask AP,尤其在拥挤和遮挡场景中性能显著。

Insight: 结合链式思维推理和伪标签能够有效提升开放词汇目标检测的鲁棒性,尤其是在复杂场景中,背景与目标的分离是关键。

Abstract: Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art.

[136] Morphology-Aware Prognostic model for Five-Year Survival Prediction in Colorectal Cancer from H&E Whole Slide Images

Usama Sajjad,Abdul Rehman Akbar,Ziyu Su,Deborah Knight,Wendy L. Frankel,Metin N. Gurcan,Wei Chen,Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: 该研究提出了一种名为PRISM的新型AI模型,用于从H&E全切片图像中预测结直肠癌患者的五年生存率,通过整合形态学多样性显著提升了预测性能。

Details Motivation: 结直肠癌是全球第三大常见恶性肿瘤,现有的任务无关方法可能忽视器官特异性形态模式,而这些模式对肿瘤行为和患者预后至关重要。

Contribution: 提出了PRISM模型,首次将连续变异谱集成到形态学中,更好地表征表型多样性,显著优于现有方法。

Method: PRISM基于424名III期结直肠癌患者的874万张组织学图像训练,通过建模形态学渐变过程而非突变表型变化。

Result: PRISM在五年总生存率预测中表现优异(AUC=0.70,准确率68.37%),优于现有方法15%-23%,且在不同亚组中表现稳定。

Insight: 形态学渐变过程的建模更能反映肿瘤进化本质,PRISM的稳健性支持其在临床决策中的潜在应用。

Abstract: Colorectal cancer (CRC) remains the third most prevalent malignancy globally, with approximately 154,000 new cases and 54,000 projected deaths anticipated for 2025. The recent advancement of foundation models in computational pathology has been largely propelled by task agnostic methodologies that can overlook organ-specific crucial morphological patterns that represent distinct biological processes that can fundamentally influence tumor behavior, therapeutic response, and patient outcomes. The aim of this study is to develop a novel, interpretable AI model, PRISM (Prognostic Representation of Integrated Spatial Morphology), that incorporates a continuous variability spectrum within each distinct morphology to characterize phenotypic diversity and reflecting the principle that malignant transformation occurs through incremental evolutionary processes rather than abrupt phenotypic shifts. PRISM is trained on 8.74 million histological images extracted from surgical resection specimens of 424 patients with stage III CRC. PRISM achieved superior prognostic performance for five-year OS (AUC = 0.70 +- 0.04; accuracy = 68.37% +- 4.75%; HR = 3.34, 95% CI = 2.28-4.90; p < 0.0001), outperforming existing CRC-specific methods by 15% and AI foundation models by ~23% accuracy. It showed sex-agnostic robustness (AUC delta = 0.02; accuracy delta = 0.15%) and stable performance across clinicopathological subgroups, with minimal accuracy fluctuation (delta = 1.44%) between 5FU/LV and CPT-11/5FU/LV regimens, replicating the Alliance cohort finding of no survival difference between treatments.

[137] Scaling Artificial Intelligence for Multi-Tumor Early Detection with More Reports, Fewer Masks

Pedro R. A. S. Bassi,Xinze Zhou,Wenxuan Li,Szymon Płotka,Jieneng Chen,Qi Chen,Zheren Zhu,Jakub Prządo,Ibrahim E. Hamacı,Sezgin Er,Yuhan Wang,Ashwin Kumar,Bjoern Menze,Jarosław B. Ćwikła,Yuyin Zhou,Akshay S. Chaudhari,Curtis P. Langlotz,Sergio Decherchi,Andrea Cavalli,Kang Wang,Yang Yang,Alan L. Yuille,Zongwei Zhou

Main category: cs.CV

TL;DR: 论文提出了R-Super方法,利用丰富的医学报告替代传统的人工标注肿瘤掩膜,用于训练AI进行多肿瘤早期检测,显著降低了成本,同时保持了高性能。

Details Motivation: 早期的肿瘤检测可以拯救生命,但当前AI模型的训练依赖于成本高昂的人工标注肿瘤掩膜。尽管医学报告中已包含丰富的肿瘤信息,但这些资源未被充分利用。

Contribution: 提出了R-Super方法,通过医学报告训练AI进行肿瘤分割,减少了对人工标注掩膜的依赖,并扩展到以往缺乏数据和模型的多种肿瘤类型。

Method: R-Super利用医学报告中描述的肿瘤信息(如大小、数量和外观)训练AI模型,与传统掩膜标注方法结合,显著提升敏感性和特异性。

Result: 实验表明,使用101,654份报告训练的模型与使用723个掩膜训练的模型性能相当。结合报告和掩膜,敏感性提升13%,特异性提升8%,在某些肿瘤类型上超越放射科医生。

Insight: 研究表明,无需依赖大规模人工标注掩膜,利用现有医学报告即可高效训练高精度肿瘤检测AI,为多样化的肿瘤类型提供了可扩展的检测方案。

Abstract: Early tumor detection save lives. Each year, more than 300 million computed tomography (CT) scans are performed worldwide, offering a vast opportunity for effective cancer screening. However, detecting small or early-stage tumors on these CT scans remains challenging, even for experts. Artificial intelligence (AI) models can assist by highlighting suspicious regions, but training such models typically requires extensive tumor masks–detailed, voxel-wise outlines of tumors manually drawn by radiologists. Drawing these masks is costly, requiring years of effort and millions of dollars. In contrast, nearly every CT scan in clinical practice is already accompanied by medical reports describing the tumor’s size, number, appearance, and sometimes, pathology results–information that is rich, abundant, and often underutilized for AI training. We introduce R-Super, which trains AI to segment tumors that match their descriptions in medical reports. This approach scales AI training with large collections of readily available medical reports, substantially reducing the need for manually drawn tumor masks. When trained on 101,654 reports, AI models achieved performance comparable to those trained on 723 masks. Combining reports and masks further improved sensitivity by +13% and specificity by +8%, surpassing radiologists in detecting five of the seven tumor types. Notably, R-Super enabled segmentation of tumors in the spleen, gallbladder, prostate, bladder, uterus, and esophagus, for which no public masks or AI models previously existed. This study challenges the long-held belief that large-scale, labor-intensive tumor mask creation is indispensable, establishing a scalable and accessible path toward early detection across diverse tumor types. We plan to release our trained models, code, and dataset at https://github.com/MrGiovanni/R-Super

[138] Unifying Environment Perception and Route Choice Modeling for Trajectory Representation Learning

Ji Cao,Yu Wang,Tongya Zheng,Zujie Ren,Canghong Jin,Gang Chen,Mingli Song

Main category: cs.CV

TL;DR: PRTraj是一个新颖的轨迹表示学习框架,通过统一环境感知和路径选择建模,解决了现有方法忽视外部环境和内部路径选择行为的问题。

Details Motivation: 现有轨迹表示学习方法将轨迹视为孤立的时空序列,忽略了外部环境和路径选择行为对其形成的影响。

Contribution: PRTraj框架首次统一了环境感知和路径选择建模,显著提升了轨迹表示的效果。

Method: PRTraj包含环境感知模块(捕捉多粒度环境语义)和路径选择编码器(建模道路段转移决策),最终生成全局轨迹嵌入。

Result: 在3个真实数据集和5个下游任务上,PRTraj表现出高效性和泛化能力,且在少样本场景中仍保持稳健性能。

Insight: 环境感知和路径选择行为对轨迹表示至关重要,统一建模可提升下游任务的性能。

Abstract: Trajectory Representation Learning (TRL) aims to encode raw trajectories into low-dimensional vectors, which can then be leveraged in various downstream tasks, including travel time estimation, location prediction, and trajectory similarity analysis. However, existing TRL methods suffer from a key oversight: treating trajectories as isolated spatio-temporal sequences, without considering the external environment and internal route choice behavior that govern their formation. To bridge this gap, we propose a novel framework that unifies comprehensive environment \textbf{P}erception and explicit \textbf{R}oute choice modeling for effective \textbf{Traj}ectory representation learning, dubbed \textbf{PRTraj}. Specifically, PRTraj first introduces an Environment Perception Module to enhance the road network by capturing multi-granularity environmental semantics from surrounding POI distributions. Building on this environment-aware backbone, a Route Choice Encoder then captures the route choice behavior inherent in each trajectory by modeling its constituent road segment transitions as a sequence of decisions. These route-choice-aware representations are finally aggregated to form the global trajectory embedding. Extensive experiments on 3 real-world datasets across 5 downstream tasks validate the effectiveness and generalizability of PRTraj. Moreover, PRTraj demonstrates strong data efficiency, maintaining robust performance under few-shot scenarios. Our code is available at: https://anonymous.4open.science/r/PRTraj.

[139] FraQAT: Quantization Aware Training with Fractional bits

Luca Morreale,Alberto Gil C. P. Ramos,Malcolm Chadwick,Mehid Noroozi,Ruchika Chavhan,Abhinav Mehrotra,Sourav Bhattacharya

Main category: cs.CV

TL;DR: 论文提出了一种名为FraQAT的新方法,通过渐进式量化(从32位到4位)和利用分数位优化,在保持高质量生成的同时实现模型的高效计算。

Details Motivation: 由于大模型在内存和计算资源受限的设备(如智能手机)上难以部署,传统的量化方法虽能提高效率,但往往会牺牲生成质量。

Contribution: 提出了一种分数位量化方法(FraQAT),逐步降低模型精度并利用分数位优化,在维持高质量的同时实现高效计算。

Method: 采用渐进式降低精度策略(从32位到4位),并在优化过程中利用分数位保持模型性能。

Result: 在多种扩散模型(如SD3.5-Medium、Sana等)上验证了方法的有效性,FiD比传统QAT降低了4-7%,并成功在智能手机上部署。

Insight: 渐进式量化和分数位优化是提升量化模型质量的有效手段,同时适用于资源受限的设备。

Abstract: State-of-the-art (SOTA) generative models have demonstrated impressive capabilities in image synthesis or text generation, often with a large capacity model. However, these large models cannot be deployed on smartphones due to the limited availability of on-board memory and computations. Quantization methods lower the precision of the model parameters, allowing for efficient computations, \eg, in \INT{8}. Although aggressive quantization addresses efficiency and memory constraints, preserving the quality of the model remains a challenge. To retain quality in previous aggressive quantization, we propose a new fractional bits quantization (\short) approach. The novelty is a simple yet effective idea: we progressively reduce the model’s precision from 32 to 4 bits per parameter, and exploit the fractional bits during optimization to maintain high generation quality. We show that the \short{} yields improved quality on a variety of diffusion models, including SD3.5-Medium, Sana, \pixart, and FLUX.1-schnell, while achieving $4-7%$ lower FiD than standard QAT. Finally, we deploy and run Sana on a Samsung S25U, which runs on the Qualcomm SM8750-AB Snapdragon 8 Elite Hexagon Tensor Processor (HTP).

[140] Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data

Qi Chen,Xinze Zhou,Chen Liu,Hao Chen,Wenxuan Li,Zekun Jiang,Ziyan Huang,Yuxuan Zhao,Dexin Yu,Junjun He,Yefeng Zheng,Ling Shao,Alan Yuille,Zongwei Zhou

Main category: cs.CV

TL;DR: 论文探讨了合成数据在肿瘤分割任务中的潜力,并提出AbdomenAtlas 2.0数据集,显著提升了模型性能。

Details Motivation: 由于缺乏大规模、精细标注的肿瘤分割数据集(尤其是医学专家标注成本高),作者研究了合成数据对模型训练效率的提升作用。

Contribution: 主要贡献包括:(1) 证明了合成数据可以显著减少对真实标注数据的依赖;(2) 发布了一个超大规划的肿瘤分割数据集AbdomenAtlas 2.0。

Method: 通过结合真实数据(JHH数据集)和合成数据进行实验,发现仅需500张真实扫描加上合成数据即可达到与1500张真实数据相同的性能。

Result: AbdomenAtlas 2.0在多个器官的肿瘤分割任务中表现出色,比公开数据集性能提升显著(+7% DSC内部分布测试,+16%外部分布测试)。

Insight: 合成数据能有效缓解医学标注数据的稀缺性,显著提升模型训练效率,为肿瘤分割任务提供了新的解决方案。

Abstract: AI for tumor segmentation is limited by the lack of large, voxel-wise annotated datasets, which are hard to create and require medical experts. In our proprietary JHH dataset of 3,000 annotated pancreatic tumor scans, we found that AI performance stopped improving after 1,500 scans. With synthetic data, we reached the same performance using only 500 real scans. This finding suggests that synthetic data can steepen data scaling laws, enabling more efficient model training than real data alone. Motivated by these lessons, we created AbdomenAtlas 2.0–a dataset of 10,135 CT scans with a total of 15,130 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, and uterus) and 5,893 control scans. Annotated by 23 expert radiologists, it is several orders of magnitude larger than existing public tumor datasets. While we continue expanding the dataset, the current version of AbdomenAtlas 2.0 already provides a strong foundation–based on lessons from the JHH dataset–for training AI to segment tumors in six organs. It achieves notable improvements over public datasets, with a +7% DSC gain on in-distribution tests and +16% on out-of-distribution tests.

[141] QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

Yixuan Li,Yuhui Chen,Mingcai Zhou,Haoran Li

Main category: cs.CV

TL;DR: QDepth-VLA提出了一种通过深度预测任务增强VLA模型的框架,利用量化深度图学习几何感知表示,提升了空间推理能力,在仿真和实际任务中表现优异。

Details Motivation: 现有VLA模型缺乏对3D结构的理解和推理能力,限制了精确控制任务的性能。

Contribution: 提出QDepth-VLA框架,通过辅助深度预测任务增强VLA模型,提升了空间感知和推理能力。

Method: 设计深度专家网络,预测VQ-VAE编码的量化深度图的潜在token,学习深度感知表示。

Result: 在仿真和实际任务中展示了优异的性能,增强了空间推理能力。

Insight: 深度预测任务可以作为有效的辅助监督信号,帮助模型学习几何信息,提升VLA任务的性能。

Abstract: Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

[142] ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints

Meiqi Wu,Jiashu Zhu,Xiaokun Feng,Chubin Chen,Chen Zhu,Bingze Song,Fangyuan Mao,Jiahong Wu,Xiangxiang Chu,Kaiqi Huang

Main category: cs.CV

TL;DR: ImagerySearch是一种自适应的测试时搜索策略,通过动态调整推理搜索空间和奖励函数,提升生成视频在想象力丰富场景中的连贯性和视觉合理性。作者还提出了LDT-Bench基准,验证了方法的有效性。

Details Motivation: 当前视频生成模型在现实场景表现良好,但在想象力丰富场景中性能下降,尤其是涉及罕见概念和长距离语义关系的提示。现有方法的固定搜索空间和静态奖励设计限制了适应性。

Contribution: 1. 提出ImagerySearch,动态调整搜索空间和奖励函数;2. 推出首个长距离语义提示基准LDT-Bench;3. 实验验证了方法的优越性和泛化能力。

Method: ImagerySearch根据提示的语义关系动态调整推理搜索空间和奖励函数,生成更连贯的视频。LDT-Bench包含2839对多样性概念,用于评估生成能力。

Result: ImagerySearch在LDT-Bench和VBench上均优于基线方法,展示了其在多样化提示类型中的有效性。

Insight: 动态调整搜索空间和奖励设计是提升视频生成模型在想象力场景性能的关键。

Abstract: Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.

[143] Multi-modal video data-pipelines for machine learning with minimal human supervision

Mihai-Cristian Pîrvu,Marius Leordeanu

Main category: cs.CV

TL;DR: 这篇论文提出了一种多模态视频数据管道,用于在无需人工监督的情况下进行机器学习,利用了预训练专家模型和PHG-MAE模型,展示了在小参数量下也能达到与大模型竞争的性能。

Details Motivation: 现实世界本质上是多模态的,但传统的机器学习模型通常是单模态的。为了真正理解世界,需要整合多个独立模态,同时减少人工监督的需求。

Contribution: 1) 提出了一个全自动的多模态数据管道;2) 使用了PHG-MAE模型高效利用多模态数据;3) 展示了小模型(<1M参数)能与大模型(~300M参数)竞争的性能;4) 开源了数据管道。

Method: 通过预训练专家模型和程序化组合处理原始视频数据,利用PHG-MAE模型进行多模态学习,并将其高效蒸馏为小参数模型。

Result: PHG-MAE模型在小参数量(<1M)下表现优异,接近300M参数模型的性能,并实现了实时语义分割和深度估计。

Insight: 多模态整合和预训练模型的蒸馏可以在减少人工监督的同时提升性能,小模型也可以在大规模任务中发挥作用。

Abstract: The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb -> semantic or text -> sentiment_class). Recent trends go towards bi-modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre-trained experts and procedural combinations between them on top of raw videos using a fully autonomous data-pipeline, which we also open-source. We then make use of PHG-MAE, a model specifically designed to leverage multi-modal data. We show that this model which was efficiently distilled into a low-parameter (<1M) can have competitive results compared to models of ~300M parameters. We deploy this model and analyze the use-case of real-time semantic segmentation from handheld devices or webcams on commodity hardware. Finally, we deploy other off-the-shelf models using the same framework, such as DPT for near real-time depth estimation.

[144] Benchmarking Multimodal Large Language Models for Face Recognition

Hatef Otroshi Shahreza,Sébastien Marcel

Main category: cs.CV

TL;DR: 这篇论文系统地评估了多模态大语言模型(MLLMs)在人脸识别任务中的表现,并与专用人脸识别模型进行了对比。

Details Motivation: 尽管MLLMs在多种视觉与语言任务中表现优异,但其在人脸识别领域的潜力尚未被充分探索,特别是在高精度场景中的表现。

Contribution: 论文的主要贡献是通过在多个标准人脸识别数据集(如LFW、CALFW等)上对MLLMs进行基准测试,揭示了其在零样本任务中的局限性,并为进一步改进提供了方向。

Method: 研究了多种MLLMs在人脸识别任务中的表现,采用标准协议进行评测。

Result: 实验结果表明,MLLMs虽能捕捉丰富的语义信息,但在高精度人脸识别场景中仍不及专用模型。

Insight: MLLMs在人脸识别中的应用潜力需要进一步优化,尤其是在精度和泛化能力方面。

Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance across diverse vision-and-language tasks. However, their potential in face recognition remains underexplored. In particular, the performance of open-source MLLMs needs to be evaluated and compared with existing face recognition models on standard benchmarks with similar protocol. In this work, we present a systematic benchmark of state-of-the-art MLLMs for face recognition on several face recognition datasets, including LFW, CALFW, CPLFW, CFP, AgeDB and RFW. Experimental results reveal that while MLLMs capture rich semantic cues useful for face-related tasks, they lag behind specialized models in high-precision recognition scenarios in zero-shot applications. This benchmark provides a foundation for advancing MLLM-based face recognition, offering insights for the design of next-generation models with higher accuracy and generalization. The source code of our benchmark is publicly available in the project page.

[145] TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions

Guangyi Han,Wei Zhai,Yuhang Yang,Yang Cao,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 论文提出了TOUCH框架,用于生成多样化且物理合理的手-物体交互(HOI),通过细粒度语义控制扩展了从抓握到自由形式交互(如推、戳、旋转)的能力。

Details Motivation: 现有HOI生成研究多局限于固定抓握模式,缺乏多样性和细粒度控制。作者希望通过文本引导生成更贴近日常生活的自由形式交互行为。

Contribution: 1. 提出Free-Form HOI Generation任务;2. 构建多样化的WildO2数据集;3. 开发TOUCH框架,利用多级扩散模型实现细粒度HOI生成。

Method: TOUCH采用三阶段框架:1. 多级扩散模型生成手部姿态;2. 显式接触建模;3. 通过接触一致性和物理约束进行精炼。

Result: 实验表明,TOUCH能够生成可控、多样且物理合理的手部交互行为。

Insight: 细粒度语义控制和物理约束的结合是生成逼真HOI的关键。

Abstract: Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce Free-Form HOI Generation, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct WildO2, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 610 object categories, each with detailed semantic annotations. Building on this dataset, we propose TOUCH, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method’s ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities. The project page is $\href{https://guangyid.github.io/hoi123touch}{here}$.

[146] BADAS: Context Aware Collision Prediction Using Real-World Dashcam Data

Roni Goldshmidt,Hamish Scott,Lorenzo Niccolini,Shizhan Zhu,Daniel Moura,Orly Zvitia

Main category: cs.CV

TL;DR: BADAS提出了一个基于上下文感知的碰撞预测模型,专注于区分涉及自车的威胁与其他随机事故,减少了误报率。它通过大规模真实行车记录仪数据和端到端训练,在多个基准测试中实现了最佳性能。

Details Motivation: 现有碰撞预测方法无法区分自车威胁与非自车事故,导致实际部署中误报率高。BADAS旨在解决这一问题,专注于自车中心的评估。

Contribution: 1)提出首个自车中心的碰撞预测基准数据集;2)重新标注现有数据集以区分自车事故;3)基于V-JEPA2的端到端训练模型BADAS,提供公开和专有版本。

Method: BADAS使用V-JEPA2作为基础架构,通过端到端训练,结合行车记录仪数据和合成负样本,优化AP/AUC与时序评估。

Result: BADAS在DAD、DADA-2000等数据集上实现了最优的AP/AUC性能,超越了前向碰撞ADAS基线,并提供了更准确的事故时间预测。

Insight: 区分自车与非自车事故对减少误报至关重要,BADAS通过大规模数据和上下文感知设计,推动了自车中心碰撞预测的研究。

Abstract: Existing collision prediction methods often fail to distinguish between ego-vehicle threats and random accidents not involving the ego vehicle, leading to excessive false alerts in real-world deployment. We present BADAS, a family of collision prediction models trained on Nexar’s real-world dashcam collision dataset – the first benchmark designed explicitly for ego-centric evaluation. We re-annotate major benchmarks to identify ego involvement, add consensus alert-time labels, and synthesize negatives where needed, enabling fair AP/AUC and temporal evaluation. BADAS uses a V-JEPA2 backbone trained end-to-end and comes in two variants: BADAS-Open (trained on our 1.5k public videos) and BADAS1.0 (trained on 40k proprietary videos). Across DAD, DADA-2000, DoTA, and Nexar, BADAS achieves state-of-the-art AP/AUC and outperforms a forward-collision ADAS baseline while producing more realistic time-to-accident estimates. We release our BADAS-Open model weights and code, along with re-annotations of all evaluation datasets to promote ego-centric collision prediction research.

[147] You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

Logan Lawrence,Oindrila Saha,Megan Wei,Chen Sun,Subhransu Maji,Grant Van Horn

Main category: cs.CV

TL;DR: 该论文提出了一种名为nlg2choice的两阶段方法,通过开放性问题提取答案并结合文本约束解码,显著提升了多模态大语言模型(MLLMs)在细粒度视觉分类(FGVC)中的零样本性能,解决了高选择数和相关选择情境下的评估和计算效率问题。

Details Motivation: 现有方法在多模态大语言模型的零样本视觉分类中,难以评估自由形式响应或处理高选择数的任务,特别是在细粒度视觉分类(FGVC)中,选择范围可达数百至数千且高度相关。

Contribution: 提出了一种名为nlg2choice的两阶段方法,结合开放性问题答案提取和文本约束解码,显著提升了MLLMs在高选择数任务中的分类和检索性能。

Method: nlg2choice方法分为两步:1)用开放性问题让MLLM生成自由答案;2)通过文本约束解码从候选答案中预测最可能的选择。在检索任务中,使用早期停止方法显著提升计算效率。

Result: 在七个细粒度视觉数据集上的实验表明,该方法在分类和检索任务中均优于基线方法,并且在不同自然语言任务实现方式中表现一致。

Insight: 开放性问题结合约束解码不仅能提升零样本性能,还能显著降低高选择数任务的计算成本,为细粒度视觉识别提供了一种高效解决方案。

Abstract: Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don’t consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.

[148] Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection

Furkan Mumcu,Michael J. Jones,Anoop Cherian,Yasin Yilmaz

Main category: cs.CV

TL;DR: 论文提出了一种新颖的半监督视频异常检测框架,通过利用多模态大型语言模型(MLLMs)生成对象活动的文本描述,从而提高复杂异常检测的解释性。

Details Motivation: 现有半监督视频异常检测方法难以处理涉及对象交互的复杂异常,且缺乏解释性。作者希望通过MLLMs提取和分析对象活动,以解决这些问题。

Contribution: 主要贡献包括:1)利用MLLMs生成对象活动的文本描述,改进复杂异常的检测;2)提供了一种可解释的异常检测方法;3)兼容传统VAD方法,增强其解释性。

Method: 方法的核心是通过MLLMs提取对象对在不同时刻的视觉输入,生成描述其活动和交互的文本。测试时,将这些描述与训练视频中的描述进行比较以检测异常。

Result: 在基准数据集上的实验表明,该方法不仅能有效检测基于交互的复杂异常,还在无交互异常的数据集上达到了最新技术水平。

Insight: MLLMs的文本描述能力为视频异常检测提供了高层语义表示,增强了方法的解释性和泛化能力。

Abstract: Existing semi-supervised video anomaly detection (VAD) methods often struggle with detecting complex anomalies involving object interactions and generally lack explainability. To overcome these limitations, we propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs). Unlike previous MLLM-based approaches that make direct anomaly judgments at the frame level, our method focuses on extracting and interpreting object activity and interactions over time. By querying an MLLM with visual inputs of object pairs at different moments, we generate textual descriptions of the activity and interactions from nominal videos. These textual descriptions serve as a high-level representation of the activity and interactions of objects in a video. They are used to detect anomalies during test time by comparing them to textual descriptions found in nominal training videos. Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability. Extensive experiments on benchmark datasets demonstrate that our method not only detects complex interaction-based anomalies effectively but also achieves state-of-the-art performance on datasets without interaction anomalies.

[149] MaskCaptioner : Learning to Jointly Segment and Caption Object Trajectories in Videos

Gabriel Fiastre,Antoine Yang,Cordelia Schmid

Main category: cs.CV

TL;DR: MaskCaptioner是一个端到端模型,能够联合检测、分割、跟踪并描述视频中的目标轨迹,通过合成数据集LVISCap和LV-VISCap进行预训练,并在多个基准测试中取得最优性能。

Details Motivation: 密集视频目标描述(DVOC)任务复杂且标注成本高,现有方法采用分离训练策略可能导致性能不佳,因此提出了端到端的联合训练方法。

Contribution: 1) 提出合成数据集LVISCap和LV-VISCap;2) 开发MaskCaptioner模型,实现了检测、分割、跟踪和描述的联合训练;3) 在多个基准测试中达到最优性能。

Method: 利用先进的视觉语言模型(VLM)生成时空局部化实体的描述,扩展LVIS和LV-VIS数据集为合成版本,训练MaskCaptioner完成联合任务。

Result: MaskCaptioner在VidSTG、VLN和BenSMOT三个基准测试中表现最优。

Insight: 通过合成数据和端到端训练,可以有效解决DVOC任务的复杂性和标注成本问题。

Abstract: Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.

[150] 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

JoungBin Lee,Jaewoo Jung,Jisang Han,Takuya Narihira,Kazumi Fukuda,Junyoung Seo,Sunghwan Hong,Yuki Mitsufuji,Seungryong Kim

Main category: cs.CV

TL;DR: 3DScenePrompt提出了一种视频生成框架,利用双时空条件化和3D场景记忆,生成具有场景一致性和精确相机控制的视频。

Details Motivation: 现有方法通常基于单帧或短片段生成视频,难以保持长序列的空间一致性和相机控制。本文旨在解决这一问题,实现长序列视频生成中的场景一致性和动态控制的平衡。

Contribution: 1. 提出双时空条件化方法;2. 引入3D场景记忆表示静态几何;3. 动态SLAM结合动态掩码策略分离静态和动态内容。

Method: 结合相邻帧的时间连续性和空间一致性条件,利用动态SLAM提取静态几何构建3D场景记忆,从而支持相机控制和场景一致性。

Result: 实验表明,该方法在场景一致性、相机控制和生成质量上显著优于现有方法。

Insight: 分离静态和动态内容对于长序列视频生成至关重要,3D场景记忆可以有效支持空间一致性投影和相机控制。

Abstract: We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : https://cvlab-kaist.github.io/3DScenePrompt/

[151] OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression

Zhe Li,Weihao Yuan,Weichao Shen,Siyu Zhu,Zilong Dong,Chang Xu

Main category: cs.CV

TL;DR: 该论文提出了一种基于连续掩码自回归的运动生成框架OmniMotion,解决了多模态(如文本、语音、音乐)与人体运动生成的融合问题,通过改进注意力机制和扩散模型,实现了优于现有方法的表现。

Details Motivation: 多模态人体运动生成的挑战在于如何设计有效的生成机制并融合多模态输入(如文本、语音、音乐)。传统方法通常采用离散掩码建模或自回归建模,缺乏对连续性和多模态异质分布的鲁棒性。

Contribution: 1. 提出了一种连续掩码自回归运动变换器,结合了因果注意力和改进的注意力模块(gated linear attention和RMSNorm);2. 引入DiT结构扩散条件信号;3. 利用AdaLN和交叉注意力融合多模态输入。

Method: 1. 设计了连续掩码自回归变换器,增强运动生成的连贯性;2. 采用gated linear attention和RMSNorm优化注意力机制;3. 结合DiT结构扩散条件信号;4. 使用AdaLN和交叉注意力融合多模态。

Result: 实验表明,OmniMotion在文本到运动、语音到手势、音乐到舞蹈等任务中均优于现有方法。

Insight: 连续掩码自回归建模更适合处理运动序列的时序性;gated linear attention和RMSNorm有助于稳定多模态融合;扩散模型能有效增强条件信号的传递。

Abstract: Whole-body multi-modal human motion generation poses two primary challenges: creating an effective motion generation mechanism and integrating various modalities, such as text, speech, and music, into a cohesive framework. Unlike previous methods that usually employ discrete masked modeling or autoregressive modeling, we develop a continuous masked autoregressive motion transformer, where a causal attention is performed considering the sequential nature within the human motion. Within this transformer, we introduce a gated linear attention and an RMSNorm module, which drive the transformer to pay attention to the key actions and suppress the instability caused by either the abnormal movements or the heterogeneous distributions within multi-modalities. To further enhance both the motion generation and the multimodal generalization, we employ the DiT structure to diffuse the conditions from the transformer towards the targets. To fuse different modalities, AdaLN and cross-attention are leveraged to inject the text, speech, and music signals. Experimental results demonstrate that our framework outperforms previous methods across all modalities, including text-to-motion, speech-to-gesture, and music-to-dance. The code of our method will be made public.

[152] RealDPO: Real or Not Real, that is the Preference

Guo Cheng,Danni Yang,Ziqi Huang,Jianlou Si,Chenyang Si,Ziwei Liu

Main category: cs.CV

TL;DR: RealDPO提出了一种新的偏好学习对齐范式,利用真实世界数据作为正样本,显著提升了视频生成模型的运动质量。

Details Motivation: 视频生成模型在复杂运动生成上存在局限性,导致运动不够自然和连贯。

Contribution: 引入RealDPO和RealAction-5K数据集,通过偏好学习优化运动生成的真实性。

Method: 采用Direct Preference Optimization (DPO)结合定制损失函数,对比真实视频与模型输出实现自校正。

Result: 实验表明RealDPO在视频质量、文本对齐和运动真实性上优于现有方法。

Insight: 真实数据驱动的偏好学习可有效提升生成模型的运动质量。

Abstract: Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.

[153] MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

Weikang Shi,Aldrich Yu,Rongyao Fang,Houxing Ren,Ke Wang,Aojun Zhou,Changyao Tian,Xinyu Fu,Yuxuan Hu,Zimu Lu,Linjiang Huang,Si Liu,Rui Liu,Hongsheng Li

Main category: cs.CV

TL;DR: MathCanvas通过视觉链式思考(VCoT)为多模态数学推理提供了一个综合性框架,显著提升了大型多模态模型(LMMs)在数学问题中的表现。

Details Motivation: 现有的视觉链式思考方法难以生成需要精确时间和高质量图形的复杂数学问题解决方案,尤其是在几何等领域。

Contribution: 提出了MathCanvas框架,包含视觉操控预训练和战略视觉辅助推理两阶段训练方法,并发布了新的数据集和评估基准。

Method: 通过预训练MathCanvas-Imagen和MathCanvas-Edit数据集掌握图形生成和编辑能力,再通过MathCanvas-Instruct数据集微调模型的视觉辅助推理策略。

Result: 训练的BAGEL-Canvas模型在MathCanvas-Bench上相比基线提升了86%,并在其他公共数学基准上表现出色。

Insight: MathCanvas展示了视觉链式思考在多模态数学推理中的潜力,为未来LMMs在复杂问题中的应用提供了新方向。

Abstract: While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: https://mathcanvas.github.io/

[154] C4D: 4D Made from 3D through Dual Correspondences

Shizun Wang,Zhenxiang Jiang,Xingyi Yang,Xinchao Wang

Main category: cs.CV

TL;DR: C4D通过引入短期光流和长期点跟踪对应关系,将现有3D重建方法扩展到4D,解决了动态场景重建中多视角几何约束失效的问题。

Details Motivation: 动态场景中的移动物体会破坏多视角几何约束,导致直接应用静态3D重建方法效果不佳。需要一种新方法来同时恢复动态几何和相机位姿。

Contribution: 提出了C4D框架,通过双对应关系(短期光流和长期点跟踪)扩展3D重建到4D,并提出动态优化目标以实现完整的4D重建。

Method: 结合点图和双对应关系,训练动态感知点跟踪器以分离移动物体与静态背景,并通过动态场景优化目标恢复每帧3D几何和相机参数。

Result: 实验表明C4D能实现完整的4D重建,并在深度估计、相机位姿估计和点跟踪等任务中表现优异。

Insight: 通过显式建模动态物体运动信息,可以有效提升动态场景的重建质量,为后续4D任务提供可靠基础。

Abstract: Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction. To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. Specifically, apart from predicting pointmaps, C4D captures two types of correspondences: short-term optical flow and long-term point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes. Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction. Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking. Project Page: https://littlepure2333.github.io/C4D

[155] Learning an Image Editing Model without Image Editing Pairs

Nupur Kumari,Sheng-Yu Wang,Nanxuan Zhao,Yotam Nitzan,Yuheng Li,Krishna Kumar Singh,Richard Zhang,Eli Shechtman,Jun-Yan Zhu,Xun Huang

Main category: cs.CV

TL;DR: 该论文提出了一种无需成对图像编辑数据的训练范式,通过扩散模型和视觉语言模型(VLM)的反馈直接优化编辑过程,避免了传统方法对大规模监督数据的依赖。

Details Motivation: 传统图像编辑模型依赖大规模输入-目标对数据,但此类数据难以大规模获取。现有方法使用合成数据可能传播预训练模型的缺陷,因此需要一种无需成对数据的训练方法。

Contribution: 1. 提出了一种无需成对数据的训练范式,结合扩散模型和VLM反馈直接优化编辑效果。2. 引入分布匹配损失(DMD)确保生成图像的视觉保真度。

Method: 1. 在训练时展开扩散模型并通过VLM评估编辑结果是否符合指令和内容保持要求,提供端到端优化的梯度。2. 使用DMD约束生成图像在预训练模型学习的分布内。

Result: 在标准基准测试中表现与基于监督数据的扩散模型相当,且在使用相同VLM奖励模型时优于基于强化学习的方法(如Flow-GRPO)。

Insight: 1. VLM可作为强大的监督信号替代成对数据。2. DMD能有效提升生成图像的视觉质量,避免脱离真实分布。

Abstract: Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.

[156] From Pixels to Words – Towards Native Vision-Language Primitives at Scale

Haiwen Diao,Mingxuan Li,Silei Wu,Linjun Dai,Xiaohua Wang,Hanming Deng,Lewei Lu,Dahua Lin,Ziwei Liu

Main category: cs.CV

TL;DR: 这篇论文提出了原生视觉-语言模型(VLMs)的设计原则,并介绍了NEO,一种基于这些原则构建的高效模型,能够在较少数据下实现与模块化VLMs竞争的性能。

Details Motivation: 研究原生VLMs的两个主要挑战:明确其与模块化VLMs的根本区别,并推动该领域的普及化研究。

Contribution: 提出了构建原生VLMs的三项核心原则:(i)像素与词的表征对齐;(ii)视觉与语言模块的无缝整合;(iii)统一的多模态编码、对齐与推理能力。介绍了NEO模型家族,以较少数据实现了高性能。

Method: 设计了NEO,一种从第一性原理出发的原生VLM,通过在共享语义空间中有效对齐像素和词的表征,并整合跨模态能力。

Result: NEO在仅使用3.9亿图像-文本对的情况下,成功实现了与顶级模块化VLMs相当的视觉感知能力。

Insight: 原生VLMs可以通过共享语义空间和统一架构实现高效的多模态学习,同时降低研究门槛,推动领域发展。

Abstract: The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

[157] Coupled Diffusion Sampling for Training-Free Multi-View Image Editing

Hadi Alzayer,Yunzhi Zhang,Chen Geng,Jia-Bin Huang,Jiajun Wu

Main category: cs.CV

TL;DR: 该论文提出了一种名为Coupled Diffusion Sampling的方法,用于在无需训练的情况下实现多视角图像的一致性编辑,解决了现有方法因依赖显式3D表示而导致的优化时间长和不稳定的问题。

Details Motivation: 现有的2D图像编辑模型在多视角图像编辑中难以保持一致性,而基于显式3D表示的方法又存在优化时间长和稀疏视角下不稳定的问题。

Contribution: 提出了一种隐式3D正则化方法,通过耦合扩散采样技术,从多视角图像分布和2D编辑图像分布中同时采样,确保多视角一致性。

Method: 采用耦合扩散采样技术,通过耦合项强制生成的多视角图像与预训练的多视角图像分布一致,无需3D优化。

Result: 在三个多视角图像编辑任务上验证了该方法的有效性和通用性,适用于多种模型架构。

Insight: 该方法提供了一种无需显式3D表示的轻量级解决方案,突出了在多视角一致性编辑任务中的潜力。

Abstract: We present an inference-time diffusion sampling method to perform multi-view consistent image editing using pre-trained 2D image editing models. These models can independently produce high-quality edits for each image in a set of multi-view images of a 3D scene or object, but they do not maintain consistency across views. Existing approaches typically address this by optimizing over explicit 3D representations, but they suffer from a lengthy optimization process and instability under sparse view settings. We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution. This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution, using a coupling term to enforce the multi-view consistency among the generated images. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, demonstrating its applicability across various model architectures and highlighting its potential as a general solution for multi-view consistent editing.

eess.IV [Back]

[158] Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation

Arnaud Judge,Nicolas Duchateau,Thierry Judge,Roman A. Sandler,Joseph Z. Sokol,Christian Desrosiers,Olivier Bernard,Pierre-Marc Jodoin

Main category: eess.IV

TL;DR: 提出了RL4Seg3D,一种用于2D+时间超声心动图分割的无监督域适应框架,通过强化学习和新颖奖励函数提升分割精度、解剖学有效性和时间一致性。

Details Motivation: 医学图像分割中,域适应方法常因目标域可靠性不足而受限,尤其在时空数据和含噪声的超声心动图中表现更差,亟需解决方案。

Contribution: 提出RL4Seg3D框架,结合强化学习和新奖励函数,解决了无监督域适应中的精度、解剖学有效性和时间一致性问题,并提供不确定性估计。

Method: 利用强化学习处理完整视频输入,设计奖励函数和融合方案增强关键点定位,实现无需目标域标注的域适应分割。

Result: 在3万多个超声心动图视频上验证,RL4Seg3D显著优于标准域适应方法,且无需目标域标签。

Insight: 强化学习可有效提升医学图像分割的域适应性能,同时提供不确定性估计辅助测试时优化。

Abstract: Domain adaptation methods aim to bridge the gap between datasets by enabling knowledge transfer across domains, reducing the need for additional expert annotations. However, many approaches struggle with reliability in the target domain, an issue particularly critical in medical image segmentation, where accuracy and anatomical validity are essential. This challenge is further exacerbated in spatio-temporal data, where the lack of temporal consistency can significantly degrade segmentation quality, and particularly in echocardiography, where the presence of artifacts and noise can further hinder segmentation performance. To address these issues, we present RL4Seg3D, an unsupervised domain adaptation framework for 2D + time echocardiography segmentation. RL4Seg3D integrates novel reward functions and a fusion scheme to enhance key landmark precision in its segmentations while processing full-sized input videos. By leveraging reinforcement learning for image segmentation, our approach improves accuracy, anatomical validity, and temporal consistency while also providing, as a beneficial side effect, a robust uncertainty estimator, which can be used at test time to further enhance segmentation performance. We demonstrate the effectiveness of our framework on over 30,000 echocardiographic videos, showing that it outperforms standard domain adaptation techniques without the need for any labels on the target domain. Code is available at https://github.com/arnaudjudge/RL4Seg3D.

[159] A Density-Informed Multimodal Artificial Intelligence Framework for Improving Breast Cancer Detection Across All Breast Densities

Siva Teja Kakileti,Bharath Govindaraju,Sudhakar Sampangi,Geetha Manjunath

Main category: eess.IV

TL;DR: 该论文提出了一种基于乳房密度的多模态AI框架,结合乳腺X射线摄影(Mammography)和热成像(Thermalytix),以优化乳腺癌检测,尤其是在致密乳腺组织中表现显著优于单一模态方法。

Details Motivation: 乳腺X射线摄影在致密乳腺组织中的敏感性较低,可能导致漏诊或延迟诊断。研究旨在通过多模态AI解决这一问题,提高乳腺癌检测的准确性。

Contribution: 主要贡献是提出了一种密度感知的多模态AI框架,动态选择乳腺X射线或热成像,从而在各类乳腺组织中实现高性能检测。

Method: 研究使用乳腺X射线摄影AI分析脂肪乳腺,热成像AI分析致密乳腺。结合多视角深度学习模型和热放射组学方法,基于组织类型优化预测。

Result: 多模态AI框架的敏感性为94.55%,特异性为79.93%,显著优于单独使用乳腺X射线AI或热成像AI,尤其在致密乳腺中表现更稳定。

Insight: 研究表明,结合结构和功能数据的多模态AI框架能够克服单一模态的局限性,具有低成本、易部署的优点,适用于不同资源环境。

Abstract: Mammography, the current standard for breast cancer screening, has reduced sensitivity in women with dense breast tissue, contributing to missed or delayed diagnoses. Thermalytix, an AI-based thermal imaging modality, captures functional vascular and metabolic cues that may complement mammographic structural data. This study investigates whether a breast density-informed multi-modal AI framework can improve cancer detection by dynamically selecting the appropriate imaging modality based on breast tissue composition. A total of 324 women underwent both mammography and thermal imaging. Mammography images were analyzed using a multi-view deep learning model, while Thermalytix assessed thermal images through vascular and thermal radiomics. The proposed framework utilized Mammography AI for fatty breasts and Thermalytix AI for dense breasts, optimizing predictions based on tissue type. This multi-modal AI framework achieved a sensitivity of 94.55% (95% CI: 88.54-100) and specificity of 79.93% (95% CI: 75.14-84.71), outperforming standalone mammography AI (sensitivity 81.82%, specificity 86.25%) and Thermalytix AI (sensitivity 92.73%, specificity 75.46%). Importantly, the sensitivity of Mammography dropped significantly in dense breasts (67.86%) versus fatty breasts (96.30%), whereas Thermalytix AI maintained high and consistent sensitivity in both (92.59% and 92.86%, respectively). This demonstrates that a density-informed multi-modal AI framework can overcome key limitations of unimodal screening and deliver high performance across diverse breast compositions. The proposed framework is interpretable, low-cost, and easily deployable, offering a practical path to improving breast cancer screening outcomes in both high-resource and resource-limited settings.

cs.LG [Back]

[160] Weight Weaving: Parameter Pooling for Data-Free Model Merging

Levy Chaves,Eduardo Valle,Sandra Avila

Main category: cs.LG

TL;DR: Weight Weaving是一种无需数据的模型融合技术,通过权重池化优化多个模型的参数集成,显著提升了模型融合的性能。

Details Motivation: 现有的模型融合方法通常依赖于数据来调整超参数(如权重因子λ),这在实践中不可行。论文提出了一种无需数据的方法来解决这一问题。

Contribution: 引入了Weight Weaving技术,通过权重池化策略(如平均或随机选择)在λ搜索空间中优化模型融合,摆脱了对数据的依赖。

Method: 使用用户定义的池化函数(如平均或随机选择)在λ的搜索空间中融合模型权重,无需额外数据,且兼容现有的模型融合方法。

Result: 在三个ViT变体和三种实验设置(多任务学习、持续学习、领域泛化)中,Weight Weaving平均提升了15.9%的准确率。

Insight: 权重池化为模型融合提供了一种高效且灵活的数据无关策略,显著提升了融合模型的泛化能力。

Abstract: Model merging provides a cost-effective and data-efficient combination of specialized deep neural networks through parameter integration. This technique leverages expert models across downstream tasks without requiring retraining. Most model merging approaches critically depend on scaling hyper-parameters $\lambda$, which weight each model’s contribution globally or individually. Principled approaches for setting scaling factors without accessing any data (data-free) are scarce, often leading researchers to tune $\lambda$ using privileged data from the evaluation set, which is obviously unfeasible in practice. To address this limitation, we introduce Weight Weaving, a plug-and-play technique that pools model weights across $\lambda$ values search space using user-defined pooling functions, such as averaging, random selection, or even existing model merging methods. Our method demonstrates high modularity, imposing minimal constraints on the search space. It operates orthogonally to existing model merging methods and eliminates evaluation data requirements. We validate Weight Weaving across three ViT variants in three experimental setups: vision multi-task learning, vision continual learning, and domain generalization. Our method consistently improves the performance of several model merging methods, achieving average accuracy gains of up to 15.9 percentage points in a data-free setting.

[161] Backdoor Unlearning by Linear Task Decomposition

Amel Abdelraheem,Alessandro Favero,Gerome Bovet,Pascal Frossard

Main category: cs.LG

TL;DR: 这篇论文提出了一种通过线性任务分解实现后门遗忘的方法,解决了基础模型中后门攻击的问题,无需重新训练即可高效移除后门,同时保持模型的通用性能。

Details Motivation: 基础模型在计算机视觉中表现出色,但对后门攻击高度敏感。现有方法通常需要昂贵的微调,且可能损害模型的通用性。作者探索是否可以无损移除后门。

Contribution: 1. 发现后门任务与原任务是解耦的;2. 提出一种简单的遗忘方法,通过线性任务分解高效移除后门;3. 在CLIP模型上实验验证了方法的有效性。

Method: 利用后门任务与原任务的解耦性,通过线性任务分解隔离并消除后门影响。方法分为已知攻击和未知攻击两种情况处理。

Result: 在已知攻击的情况下,几乎完全移除后门,同时保留96%的干净准确率;在未知攻击时,通过反向工程触发器也能成功移除后门。

Insight: 后门任务与原任务在权重空间中是解耦的,这种特性为实现高效无损的后门移除提供了可能。

Abstract: Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor’s influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.

[162] pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation

Hansheng Chen,Kai Zhang,Hao Tan,Leonidas Guibas,Gordon Wetzstein,Sai Bi

Main category: cs.LG

TL;DR: 论文提出了一种基于策略的流模型(\pi-Flow),通过模仿蒸馏方法解决了传统少步生成模型在质量和多样性之间的权衡问题,实现了快速且准确的ODE积分生成。

Details Motivation: 传统少步扩散或流模型的教师-学生蒸馏过程存在格式不匹配问题,导致复杂的蒸馏过程和质量-多样性权衡。\pi-Flow旨在通过模仿蒸馏方法简化这一过程。

Contribution: 1. 提出了基于策略的流模型(\pi-Flow),能够生成动态流速;2. 引入了模仿蒸馏方法,稳定训练并避免质量-多样性权衡;3. 在多个数据集上实现了优于现有方法的性能。

Method: \pi-Flow通过修改学生模型的输出层,预测无网络策略,动态生成流速。模仿蒸馏使用标准\ell_2流匹配损失,匹配教师模型的流速,实现稳定的ODE积分。

Result: 在ImageNet 256$^2$上,\pi-Flow实现了1-NFE FID为2.85的性能;在FLUX和Qwen-Image数据集上,4 NFEs下表现出更好的多样性,同时保持教师级质量。

Insight: 模仿蒸馏方法简单高效,能够避免复杂的蒸馏过程和质量-多样性权衡,为少步生成模型提供了一种新的训练范式。

Abstract: Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality-diversity trade-off. To address this, we propose policy-based flow models ($\pi$-Flow). $\pi$-Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration on these substeps without extra network evaluations. To match the policy’s ODE trajectory to the teacher’s, we introduce a novel imitation distillation approach, which matches the policy’s velocity to the teacher’s along the policy’s trajectory using a standard $\ell_2$ flow matching loss. By simply mimicking the teacher’s behavior, $\pi$-Flow enables stable and scalable training and avoids the quality-diversity trade-off. On ImageNet 256$^2$, it attains a 1-NFE FID of 2.85, outperforming MeanFlow of the same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, $\pi$-Flow achieves substantially better diversity than state-of-the-art few-step methods, while maintaining teacher-level quality.

[163] LTR-ICD: A Learning-to-Rank Approach for Automatic ICD Coding

Mohammad Mansoori,Amira Soliman,Farzaneh Etminani

Main category: cs.LG

TL;DR: 论文提出了一种名为LTR-ICD的新方法,将自动ICD编码任务视为分类与排序问题,而非单纯的分类任务。其结果表明,该方法在识别高优先级代码和分类性能上均优于现有方法。

Details Motivation: 由于临床笔记中的ICD代码顺序对医疗诊断和报销至关重要,而现有方法仅将其视为分类任务,忽略了顺序信息,因此需要一种新方法来同时捕捉分类和排序信息。

Contribution: 论文的主要贡献是将ICD编码任务首次从检索系统的角度重新定义为分类和排序任务,并提出了一个学习排序(Learning-to-Rank)框架。

Method: 采用学习排序(Learning-to-Rank)方法,将ICD编码问题建模为分类和排序任务,以同时预测代码及其优先级顺序。

Result: 与现有最佳分类器相比,所提模型在主要诊断代码排序上的准确率从20%提升至47%,在分类指标(micro-和macro-F1)上也表现更优。

Insight: 通过结合分类和排序任务,可以更全面地解决ICD编码问题,尤其是捕捉代码的顺序信息对实际应用至关重要。

Abstract: Clinical notes contain unstructured text provided by clinicians during patient encounters. These notes are usually accompanied by a sequence of diagnostic codes following the International Classification of Diseases (ICD). Correctly assigning and ordering ICD codes are essential for medical diagnosis and reimbursement. However, automating this task remains challenging. State-of-the-art methods treated this problem as a classification task, leading to ignoring the order of ICD codes that is essential for different purposes. In this work, as a first attempt, we approach this task from a retrieval system perspective to consider the order of codes, thus formulating this problem as a classification and ranking task. Our results and analysis show that the proposed framework has a superior ability to identify high-priority codes compared to other methods. For instance, our model accuracy in correctly ranking primary diagnosis codes is 47%, compared to 20% for the state-of-the-art classifier. Additionally, in terms of classification metrics, the proposed model achieves a micro- and macro-F1 scores of 0.6065 and 0.2904, respectively, surpassing the previous best model with scores of 0.597 and 0.2660.

[164] MAFA: A Multi-Agent Framework for Enterprise-Scale Annotation with Configurable Task Adaptation

Mahmood Hegazy,Aaron Rodrigues,Azzam Naeem

Main category: cs.LG

TL;DR: MAFA是一个多智能体协作框架,用于企业级的大规模标注任务,通过动态任务适配和结构化推理显著提升标注效率与准确性。

Details Motivation: 金融机构面临大量客户话语标注积压问题,传统方法效率低下且难以动态调整任务类型,亟需一种灵活、高效的标注解决方案。

Contribution: 1) 提出MAFA框架,支持配置化任务适配;2) 引入基于法官机制的智能体共识方法;3) 在企业部署中显著减少标注积压并提升效率。

Method: 结合多智能体协作、结构化推理和法官共识机制,支持动态配置标注类型,无需代码修改即可适配新任务。

Result: 在部署中消除100万条话语积压,与人工标注一致率达86%,效率提升显著(Top-1准确率提高13.8%,F1提升16.9%)。

Insight: 多智能体系统在企业级任务中具有实用潜力,动态配置和共识机制是关键成功因素。

Abstract: We present MAFA (Multi-Agent Framework for Annotation), a production-deployed system that transforms enterprise-scale annotation workflows through configurable multi-agent collaboration. Addressing the critical challenge of annotation backlogs in financial services, where millions of customer utterances require accurate categorization, MAFA combines specialized agents with structured reasoning and a judge-based consensus mechanism. Our framework uniquely supports dynamic task adaptation, allowing organizations to define custom annotation types (FAQs, intents, entities, or domain-specific categories) through configuration rather than code changes. Deployed at JP Morgan Chase, MAFA has eliminated a 1 million utterance backlog while achieving, on average, 86% agreement with human annotators, annually saving over 5,000 hours of manual annotation work. The system processes utterances with annotation confidence classifications, which are typically 85% high, 10% medium, and 5% low across all datasets we tested. This enables human annotators to focus exclusively on ambiguous and low-coverage cases. We demonstrate MAFA’s effectiveness across multiple datasets and languages, showing consistent improvements over traditional and single-agent annotation baselines: 13.8% higher Top-1 accuracy, 15.1% improvement in Top-5 accuracy, and 16.9% better F1 in our internal intent classification dataset and similar gains on public benchmarks. This work bridges the gap between theoretical multi-agent systems and practical enterprise deployment, providing a blueprint for organizations facing similar annotation challenges.

[165] Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

Mehrzad Samadi,Aleksander Ficek,Sean Narenthiran,Siddhartha Jain,Wasi Uddin Ahmad,Somshubra Majumdar,Vahid Noroozi,Boris Ginsburg

Main category: cs.LG

TL;DR: 本文提出了一种名为GenCluster的可扩展测试时计算框架,结合开源模型gpt-oss-120b,首次在IOI竞赛中实现金牌表现。该方法通过大规模生成、行为聚类、排序和轮询提交策略,在有限验证预算下高效探索多样化解空间。

Details Motivation: 尽管专有模型已在IOI竞赛中表现出色,但其方法不公开,而开源模型的表现仍需提升。因此,研究目标是开发一种透明且可复现的方法,缩小开源与闭源系统的性能差距。

Contribution: GenCluster框架首次展示了开源模型在IOI中达到金牌水平的潜力,并为LLM推理能力的评测设定了新基准。

Method: GenCluster结合大规模生成、行为聚类、排序和轮询提交策略,高效利用有限的计算和验证资源探索解空间。

Result: 实验表明,该方法计算性能可扩展,开源模型gpt-oss-120b预计在IOI 2025首次实现金牌。

Insight: 通过透明且可扩展的测试时计算框架,开源模型也能在复杂任务中匹敌闭源系统,强调了计算资源优化和多策略协同的重要性。

Abstract: Competitive programming has become a rigorous benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). The International Olympiad in Informatics (IOI) stands out as one of the most prestigious annual competitions in competitive programming and has become a key benchmark for comparing human and AI-level programming ability. While several proprietary models have been claimed to achieve gold medal-level performance at the IOI, often with undisclosed methods, achieving comparable results with open-weight models remains a significant challenge. In this paper, we present \gencluster, a scalable and reproducible test-time compute framework that attains IOI gold-level performance using open-weight models. It combines large-scale generation, behavioral clustering, ranking, and a round-robin submission strategy to efficiently explore diverse solution spaces under limited validation budgets. Our experiments show that the performance of our proposed approach scales consistently with available compute, narrowing the gap between open and closed systems. Notably, we will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model gpt-oss-120b, setting a new benchmark for transparent and reproducible evaluation of reasoning in LLMs.

[166] Agentic Entropy-Balanced Policy Optimization

Guanting Dong,Licheng Bao,Zhongyuan Wang,Kangzhi Zhao,Xiaoxi Li,Jiajie Jin,Jinghan Yang,Hangyu Mao,Fuzheng Zhang,Kun Gai,Guorui Zhou,Yutao Zhu,Ji-Rong Wen,Zhicheng Dou

Main category: cs.LG

TL;DR: 论文提出了一种名为AEPO的强化学习算法,旨在平衡探索与开发的熵信号,解决了传统算法因过度依赖熵信号导致的训练崩溃问题。AEPO在多个数据集上表现优异。

Details Motivation: 主流Agentic RL算法过度依赖熵信号进行探索,可能导致训练崩溃或效率下降。本文旨在解决这一问题,提出一种更平衡的方法。

Contribution: 提出了AEPO算法,包含动态熵平衡的rollout机制和熵平衡的策略优化方法,有效平衡了熵信号的利用,提升了训练的稳定性和效率。

Method: AEPO包括两部分:(1)动态熵平衡rollout机制,通过熵预监测分配采样预算,并对连续高熵步骤施加惩罚;(2)熵平衡策略优化,加入停止梯度操作和熵感知优势估计。

Result: 在14个数据集上优于7种主流RL算法。Qwen3-14B结合AEPO在少量样本下取得了显著成绩,如GAIA的Pass@1达到47.6%。

Insight: AEPO通过平衡熵信号,不仅提升了性能,还改善了rollout采样的多样性,同时保持了策略熵的稳定性,为可扩展的Web Agent训练提供了可能。

Abstract: Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity’s Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity’s Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.

[167] Reasoning with Sampling: Your Base Model is Smarter Than You Think

Aayush Karan,Yilun Du

Main category: cs.LG

TL;DR: 该论文提出了一种无需额外训练的迭代采样算法,通过利用基础模型自身的似然性,显著提升了推理能力,性能接近甚至超过强化学习后的模型。

Details Motivation: 尽管强化学习(RL)在提升大语言模型(RL-posttraining LLMs)推理能力方面取得了成功,但文献主要关注RL引发的新行为,而忽略了基础模型的潜力。本文试图通过纯采样方法发掘基础模型的推理能力。

Contribution: 1. 提出了一种基于马尔可夫链蒙特卡洛(MCMC)的迭代采样算法,无需训练或额外数据;2. 展示了该方法在多项任务(如MATH500、HumanEval和GPQA)中接近或超越RL-posttraining模型的性能;3. 避免了RL-posttraining中多样性下降的问题。

Method: 1. 受MCMC启发,设计了一种利用基础模型似然性的迭代采样算法;2. 通过单次采样和多轮推理的结合,提升了模型的推理能力。

Result: 在多个任务(包括MATH500、HumanEval和GPQA)上,该方法显著提升了推理能力,性能接近或超过RL-posttraining模型,同时保持了较高的多样性。

Insight: 基础模型通过适当的采样策略可以表现出强大的推理能力,无需依赖额外的训练或复杂的强化学习框架。

Abstract: Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilites can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models’ own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.

[168] Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models

Jonas Geiping,Xinyu Yang,Guinan Su

Main category: cs.LG

TL;DR: 本文探讨了循环深度模型(recurrent-depth models)与扩散语言模型(diffusion language models)之间的关系,提出了一种新的扩散强迫采样器,可加速循环深度模型的生成过程。该方法能在现代硬件上实现高达5倍的生成速度提升。

Details Motivation: 循环深度模型通过层重复增加计算量,展现了在推理任务中的优势。然而,其生成过程的并行化需求促使研究者探索它与扩散语言模型的相似性,以开发更高效的生成方法。

Contribution: 1. 揭示了循环深度模型与扩散语言模型的关系;2. 提出了一种新的扩散强迫采样器,实现并行化生成;3. 证明了该方法在现代硬件上的表现优于基线自回归生成。

Method: 基于扩散文献的原理,设计了一种扩散强迫采样器,通过每次前向传递解码新令牌,同时并行优化其潜在状态。该方法无需调优即可应用于现有的循环深度模型。

Result: 实验表明,该方法可将3.5B参数的循环深度模型的生成速度提升5倍,同时生成过程的表现力更强。

Insight: 循环深度模型可被视为强连续但因果的扩散语言模型,其并行化潜力为高效推理提供了新的思路。

Abstract: Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrent-depth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup. Consequently, our findings not only provide an efficient mechanism for parallelizing the extra computation in recurrent-depth models at inference, but also suggest that such models can be naturally viewed as strong continuous, though causal, diffusion language models.

cs.AI [Back]

[169] Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline

Haiyang Li,Yaxiong Wang,Shengeng Tang,Lianwei Wu,Lechao Cheng,Zhun Zhong

Main category: cs.AI

TL;DR: 该论文提出了一个统一的多模态虚假信息检测框架UMFDet,并结合新构建的综合数据集OmniFake,解决了现有方法对人工虚假内容和AI生成内容分别研究的局限性。

Details Motivation: 现有虚假信息检测方法通常只针对人工或AI生成内容中的一种,缺乏统一的多模态解决方案。论文旨在填补这一研究空白,提出一个能够同时处理两种虚假内容的框架。

Contribution: 1)构建了OmniFake数据集,整合了人工虚假内容和AI生成内容;2)提出了统一的多模态虚假内容检测框架UMFDet,采用混合专家适配器和推理机制提升性能。

Method: UMFDet基于视觉-语言模型(VLM),结合类别感知的混合专家(MoE)适配器捕获类别特定线索,并利用归因链式思维机制提供隐含推理指导。

Result: 实验表明UMFDet在两种虚假内容上均表现优异,优于专用基线方法,为实际应用提供了实用解决方案。

Insight: 统一的多模态虚假信息检测框架能够在实际场景中更有效地应对未知类型的虚假内容,提升检测的鲁棒性和一致性。

Abstract: In recent years, detecting fake multimodal content on social media has drawn increasing attention. Two major forms of deception dominate: human-crafted misinformation (e.g., rumors and misleading posts) and AI-generated content produced by image synthesis models or vision-language models (VLMs). Although both share deceptive intent, they are typically studied in isolation. NLP research focuses on human-written misinformation, while the CV community targets AI-generated artifacts. As a result, existing models are often specialized for only one type of fake content. In real-world scenarios, however, the type of a multimodal post is usually unknown, limiting the effectiveness of such specialized systems. To bridge this gap, we construct the Omnibus Dataset for Multimodal News Deception (OmniFake), a comprehensive benchmark of 127K samples that integrates human-curated misinformation from existing resources with newly synthesized AI-generated examples. Based on this dataset, we propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception. UMFDet leverages a VLM backbone augmented with a Category-aware Mixture-of-Experts (MoE) Adapter to capture category-specific cues, and an attribution chain-of-thought mechanism that provides implicit reasoning guidance for locating salient deceptive signals. Extensive experiments demonstrate that UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines and offering a practical solution for real-world multimodal deception detection.

[170] AI for Service: Proactive Assistance with AI Glasses

Zichen Wen,Yiyu Wang,Chenfei Liao,Boxue Yang,Junxian Li,Weifeng Liu,Haocong He,Bolong Feng,Xuyang Liu,Yuanhuiyi Lyu,Xu Zheng,Xuming Hu,Linfeng Zhang

Main category: cs.AI

TL;DR: 论文提出了一种名为Alpha-Service的框架,通过AI眼镜实现主动服务,解决了‘何时干预’和‘如何提供服务’两大问题,展示了多种实际应用场景。

Details Motivation: 现有AI服务多为被动响应,缺乏主动性和适应性。作者提出AI4Service范式,旨在实现AI作为主动伴侣,预见用户需求并适时提供服务。

Contribution: 提出了Alpha-Service框架,以AI眼镜为载体,解决主动服务的两大核心问题,并通过多智能体系统实现感知、推理和个性化服务。

Method: 基于von Neumann架构,Alpha-Service包含五个组件:输入单元(感知)、中央处理单元(任务调度)、算术逻辑单元(工具调用)、内存单元(个性化)和输出单元(人机交互),并通过多智能体系统实现。

Result: 案例研究表明,Alpha-Service能在如Blackjack顾问、博物馆导游和购物穿搭助手等场景中,无需显式提示即可提供实时且有用的帮助。

Insight: AI服务的未来趋势是从被动响应转向主动适配,Alpha-Service展示了通过多组件协作和多智能体系统实现这一目标的可行性。

Abstract: In an era where AI is evolving from a passive tool into an active and adaptive companion, we introduce AI for Service (AI4Service), a new paradigm that enables proactive and real-time assistance in daily life. Existing AI services remain largely reactive, responding only to explicit user commands. We argue that a truly intelligent and helpful assistant should be capable of anticipating user needs and taking actions proactively when appropriate. To realize this vision, we propose Alpha-Service, a unified framework that addresses two fundamental challenges: Know When to intervene by detecting service opportunities from egocentric video streams, and Know How to provide both generalized and personalized services. Inspired by the von Neumann computer architecture and based on AI glasses, Alpha-Service consists of five key components: an Input Unit for perception, a Central Processing Unit for task scheduling, an Arithmetic Logic Unit for tool utilization, a Memory Unit for long-term personalization, and an Output Unit for natural human interaction. As an initial exploration, we implement Alpha-Service through a multi-agent system deployed on AI glasses. Case studies, including a real-time Blackjack advisor, a museum tour guide, and a shopping fit assistant, demonstrate its ability to seamlessly perceive the environment, infer user intent, and provide timely and useful assistance without explicit prompts.

[171] Agentic Design of Compositional Machines

Wenqian Zhang,Weiyang Liu,Zhen Liu

Main category: cs.AI

TL;DR: 论文探讨了大型语言模型(LLMs)是否能够通过组合式机器设计任务学习创建复杂机器,并介绍了BesiegeField测试平台。研究发现开源模型当前表现不足,探索了强化学习(RL)作为改进路径。

Details Motivation: 研究动机是探索LLMs在复杂机器设计中的潜力,将其视为人类智能的标志性和工程实践的延伸。

Contribution: 主要贡献包括引入BesiegeField测试平台,评估LLMs在组合式机器设计任务中的能力,并通过RL实验提出改进方向。

Method: 研究方法包括构建BesiegeField测试平台,评估LLMs的空间推理、战略组装和指令遵循能力,并通过RL微调数据集进行实验。

Result: 结果表明当前开源LLMs在组合式机器设计中表现不足,但RL微调显示出改进潜力。

Insight: 研究揭示了LLMs在复杂任务中需结合物理推理与语言能力的挑战,为未来研究方向提供了启示。

Abstract: The design of complex machines stands as both a marker of human intelligence and a foundation of engineering practice. Given recent advances in large language models (LLMs), we ask whether they, too, can learn to create. We approach this question through the lens of compositional machine design: a task in which machines are assembled from standardized components to meet functional demands like locomotion or manipulation in a simulated physical environment. To support this investigation, we introduce BesiegeField, a testbed built on the machine-building game Besiege, which enables part-based construction, physical simulation and reward-driven evaluation. Using BesiegeField, we benchmark state-of-the-art LLMs with agentic workflows and identify key capabilities required for success, including spatial reasoning, strategic assembly, and instruction-following. As current open-source models fall short, we explore reinforcement learning (RL) as a path to improvement: we curate a cold-start dataset, conduct RL finetuning experiments, and highlight open challenges at the intersection of language, machine design, and physical reasoning.

[172] Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks

Supriti Sinhamahapatra,Jan Niehues

Main category: cs.AI

TL;DR: 这篇论文探讨了结合多模态(语音和幻灯片)信息以提高科学会议演讲的自动语音识别(ASR)性能。通过数据增强和基准测试,作者实现了34%的词错误率降低。

Details Motivation: 现有ASR系统主要依赖语音信息,忽视了多模态上下文(如幻灯片)在消除歧义和适应领域术语中的作用。论文旨在填补这一空白。

Contribution: 1) 创建了包含多模态演讲数据的基准测试;2) 提出了一种数据增强方法以解决数据集不足问题;3) 结合幻灯片训练的模型显著降低了词错误率。

Method: 1) 自动分析演讲幻灯片中的领域术语;2) 通过数据增强生成多模态数据集;3) 训练结合语音和幻灯片信息的ASR模型。

Result: 相比基线模型,多模态模型实现了34%的词错误率整体降低,领域术语的错误率降低了35%。

Insight: 视觉信息(如幻灯片)在ASR任务中具有重要作用,尤其是在领域术语识别方面。多模态模型在学术环境中具有潜在的高实用性。

Abstract: State-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems primarily rely on acoustic information while disregarding additional multi-modal context. However, visual information are essential in disambiguation and adaptation. While most work focus on speaker images to handle noise conditions, this work also focuses on integrating presentation slides for the use cases of scientific presentation. In a first step, we create a benchmark for multi-modal presentation including an automatic analysis of transcribing domain-specific terminology. Next, we explore methods for augmenting speech models with multi-modal information. We mitigate the lack of datasets with accompanying slides by a suitable approach of data augmentation. Finally, we train a model using the augmented dataset, resulting in a relative reduction in word error rate of approximately 34%, across all words and 35%, for domain-specific terms compared to the baseline model.

[173] Generating Fair Consensus Statements with Social Choice on Token-Level MDPs

Carter Blair,Kate Larson

Main category: cs.AI

TL;DR: 该论文提出了一种基于社会选择理论和令牌级MDP的方法,用于生成具有可证明公平性保证的共识声明,确保多样意见的公平聚合。

Details Motivation: 现有的大语言模型共识声明生成框架缺乏结构化的公平性保证机制,无法在处理多样性意见时提供可证明的公平性。

Contribution: 1. 将共识声明生成任务建模为多目标的令牌级MDP;2. 提出了两种基于社会选择理论的生成方法,分别保证ex-ante核心稳定性和最大化平等福利。

Method: 1. 令牌级MDP建模,每个代理的偏好对应一个目标;2. 基于核心稳定性和社会福利最大化的生成策略设计;3. 使用搜索算法实现平等福利目标。

Result: 实验表明,基于平等福利目标的搜索算法生成的共识声明在代理对齐的 extit{最差情况}下优于基线方法。

Insight: 将社会选择理论引入文本生成任务,为公平性提供了理论框架,同时也验证了令牌级MDP在多代理场景的适用性。

Abstract: Current frameworks for consensus statement generation with large language models lack the inherent structure needed to provide provable fairness guarantees when aggregating diverse free-form opinions. We model the task as a multi-objective, token-level Markov Decision Process (MDP), where each objective corresponds to an agent’s preference. Token-level rewards for each agent are derived from their policy (e.g., a personalized language model). This approach utilizes the finding that such policies implicitly define optimal Q-functions, providing a principled way to quantify rewards at each generation step without a value function (Rafailov et al., 2024). This MDP formulation creates a formal structure amenable to analysis using principles from social choice theory. We propose two approaches grounded in social choice theory. First, we propose a stochastic generation policy guaranteed to be in the ex-ante core, extending core stability concepts from voting theory to text generation. This policy is derived from an underlying distribution over complete statements that maximizes proportional fairness (Nash Welfare). Second, for generating a single statement, we target the maximization of egalitarian welfare using search algorithms within the MDP framework. Empirically, experiments using language models to instantiate agent policies show that search guided by the egalitarian objective generates consensus statements with improved worst-case agent alignment compared to baseline methods, including the Habermas Machine (Tessler et al., 2024).

[174] Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies

Mason Nakamura,Abhinav Kumar,Saaduddin Mahmud,Sahar Abdelnabi,Shlomo Zilberstein,Eugene Bagdasarian

Main category: cs.AI

TL;DR: 论文提出了Terrarium框架,专注于研究基于LLM的多智能体系统(MAS)的安全性、隐私和安全性。通过重新设计黑板架构,Terrarium为多智能体协作提供了模块化、可配置的测试平台,并识别了关键攻击向量。

Details Motivation: 多智能体系统(MAS)结合LLM可以自动化复杂的任务,但也引入了新的安全风险,如恶意攻击和数据泄露。因此,需要一种系统化的方法来研究和解决这些风险。

Contribution: 提出了Terrarium框架,用于详细研究LLM-based MAS的安全性、隐私和安全性问题;重新设计了黑板架构作为模块化测试平台;识别并实现了关键攻击向量。

Method: 采用黑板设计作为基础,构建了模块化、可配置的测试平台,支持快速原型设计和防御措施的评估;设计了三种协作场景和四种代表性攻击。

Result: 展示了Terrarium框架的灵活性,能够有效模拟和评估多智能体系统中的安全风险,并提供防御设计的迭代工具。

Insight: Terrarium框架为研究者和开发者提供了一种系统化的方法,加速了可信赖多智能体系统的发展。

Abstract: A multi-agent system (MAS) powered by large language models (LLMs) can automate tedious user tasks such as meeting scheduling that requires inter-agent collaboration. LLMs enable nuanced protocols that account for unstructured private data, user constraints, and preferences. However, this design introduces new risks, including misalignment and attacks by malicious parties that compromise agents or steal user data. In this paper, we propose the Terrarium framework for fine-grained study on safety, privacy, and security in LLM-based MAS. We repurpose the blackboard design, an early approach in multi-agent systems, to create a modular, configurable testbed for multi-agent collaboration. We identify key attack vectors such as misalignment, malicious agents, compromised communication, and data poisoning. We implement three collaborative MAS scenarios with four representative attacks to demonstrate the framework’s flexibility. By providing tools to rapidly prototype, evaluate, and iterate on defenses and designs, Terrarium aims to accelerate progress toward trustworthy multi-agent systems.

[175] IMAGINE: Integrating Multi-Agent System into One Model for Complex Reasoning and Planning

Xikai Zhang,Bo Wang,Likang Xiao,Yongzhi Li,Quan Chen,Wenju Wu,Liu Liu

Main category: cs.AI

TL;DR: 论文提出了一种名为IMAGINE的框架,将多智能体系统(MAS)的推理和规划能力整合到一个紧凑模型中,显著提升了复杂推理和规划任务的性能。

Details Motivation: 虽然大语言模型(LLM)在多项任务中表现优异,但在复杂推理和规划任务中仍存在显著不足。多智能体系统(MAS)虽能提供更好的集体推理能力,但代价高昂且难以端到端训练。因此,需要一种高效、可扩展的解决方案。

Contribution: 提出IMAGINE框架,整合MAS的推理和规划能力至单一模型,并通过端到端训练显著超越MAS的性能。

Method: 通过一个通用的、可扩展的管道,将MAS的结构化能力压缩到一个小规模模型中,并通过简单训练实现高效推理。

Result: 实验表明,在使用Qwen3-8B-Instruct为基础模型时,IMAGINE在TravelPlanner基准测试中达到了82.7%的通过率,远超DeepSeek-R1-671B的40%。

Insight: 通过整合多智能体系统的能力到单一模型中,不仅可以减少推理成本,还能显著提升性能,展示了紧凑模型的潜力。

Abstract: Although large language models (LLMs) have made significant strides across various tasks, they still face significant challenges in complex reasoning and planning. For example, even with carefully designed prompts and prior information explicitly provided, GPT-4o achieves only a 7% Final Pass Rate on the TravelPlanner dataset in the sole-planning mode. Similarly, even in the thinking mode, Qwen3-8B-Instruct and DeepSeek-R1-671B, only achieve Final Pass Rates of 5.9% and 40%, respectively. Although well-organized Multi-Agent Systems (MAS) can offer improved collective reasoning, they often suffer from high reasoning costs due to multi-round internal interactions, long per-response latency, and difficulties in end-to-end training. To address these challenges, we propose a general and scalable framework called IMAGINE, short for Integrating Multi-Agent System into One Model. This framework not only integrates the reasoning and planning capabilities of MAS into a single, compact model, but also significantly surpass the capabilities of the MAS through a simple end-to-end training. Through this pipeline, a single small-scale model is not only able to acquire the structured reasoning and planning capabilities of a well-organized MAS but can also significantly outperform it. Experimental results demonstrate that, when using Qwen3-8B-Instruct as the base model and training it with our method, the model achieves an 82.7% Final Pass Rate on the TravelPlanner benchmark, far exceeding the 40% of DeepSeek-R1-671B, while maintaining a much smaller model size.

[176] ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks

Yuanyi Song,Heyuan Huang,Qiqiang Lin,Yin Zhao,Xiangmou Qu,Jun Wang,Xingyu Lou,Weiwen Liu,Zhuosheng Zhang,Jun Wang,Yong Yu,Weinan Zhang,Zhaoxiang Wang

Main category: cs.AI

TL;DR: 论文提出了ColorBench,一种基于图结构的基准框架,用于评估移动代理在复杂长任务中的表现,填补了离线静态基准和在线动态测试之间的鸿沟。

Details Motivation: 当前移动代理评估方法无法全面测试复杂任务的多种解决方案,静态基准只能验证单一路径,而动态测试又受限于设备的复杂性和不可重现性。

Contribution: 1)提出了一种图结构的基准框架,模拟动态行为的静态环境;2)开发了ColorBench基准,支持多路径评估和原子级能力分析。

Method: 通过建模真实设备交互的有限状态,实现动态行为的静态模拟,并在ColorBench中设计175个复杂长任务,包含多路径和典型错误路径。

Result: 实验揭示了现有模型的局限性,并基于结果提出了改进方向和技术路径。

Insight: 图结构框架能更贴近真实交互场景,支持多种解决方案的评估,为长任务代理的性能提升提供了新思路。

Abstract: The rapid advancement of multimodal large language models has enabled agents to operate mobile devices by directly interacting with graphical user interfaces, opening new possibilities for mobile automation. However, real-world mobile tasks are often complex and allow for multiple valid solutions. This contradicts current mobile agent evaluation standards: offline static benchmarks can only validate a single predefined “golden path”, while online dynamic testing is constrained by the complexity and non-reproducibility of real devices, making both approaches inadequate for comprehensively assessing agent capabilities. To bridge the gap between offline and online evaluation and enhance testing stability, this paper introduces a novel graph-structured benchmarking framework. By modeling the finite states observed during real-device interactions, it achieves static simulation of dynamic behaviors. Building on this, we develop ColorBench, a benchmark focused on complex long-horizon tasks. It supports evaluation of multiple valid solutions, subtask completion rate statistics, and atomic-level capability analysis. ColorBench contains 175 tasks (74 single-app, 101 cross-app) with an average length of over 13 steps. Each task includes at least two correct paths and several typical error paths, enabling quasi-dynamic interaction. By evaluating ColorBench across various baselines, we discover limitations of existing models and propose improvement directions and feasible technical pathways to enhance agents’ performance on complex, long-horizon problems based on experimental results. Code and data are available at: https://github.com/MadeAgents/ColorBench.

[177] TITAN: Graph-Executable Reasoning for Cyber Threat Intelligence

Marco Simoni,Aleksandar Fontana,Andrea Saracino,Paolo Mori

Main category: cs.AI

TL;DR: TITAN是一个结合自然语言查询与结构化知识图谱推理的框架,用于网络威胁情报。它通过路径规划模型和图执行器实现高效推理,并在MITRE数据集上验证了有效性。

Details Motivation: 传统的威胁情报检索系统缺乏结构化推理能力,无法清晰地在威胁、行为和防御之间建立逻辑链。TITAN旨在填补这一空白。

Contribution: 1. 提出TITAN框架,结合自然语言与图谱推理;2. 构建TITAN数据集,支持训练与评估;3. 验证模型生成的可执行推理路径的语法和语义有效性。

Method: 1. 路径规划模型预测逻辑关系链;2. 图执行器遍历TITAN本体;3. 基于MITRE的双向图谱支持可逆推理。

Result: 实验证明TITAN能生成语法有效、语义一致的推理路径,并可确定性执行。

Insight: 结构化双向图谱设计增强了推理的可解释性和灵活性,适用于复杂威胁场景。

Abstract: TITAN (Threat Intelligence Through Automated Navigation) is a framework that connects natural-language cyber threat queries with executable reasoning over a structured knowledge graph. It integrates a path planner model, which predicts logical relation chains from text, and a graph executor that traverses the TITAN Ontology to retrieve factual answers and supporting evidence. Unlike traditional retrieval systems, TITAN operates on a typed, bidirectional graph derived from MITRE, allowing reasoning to move clearly and reversibly between threats, behaviors, and defenses. To support training and evaluation, we introduce the TITAN Dataset, a corpus of 88209 examples (Train: 74258; Test: 13951) pairing natural language questions with executable reasoning paths and step by step Chain of Thought explanations. Empirical evaluations show that TITAN enables models to generate syntactically valid and semantically coherent reasoning paths that can be deterministically executed on the underlying graph.

[178] Where to Search: Measure the Prior-Structured Search Space of LLM Agents

Zhuo-Yang Song

Main category: cs.AI

TL;DR: 本文提出了一种紧凑的形式理论,用于描述和衡量由领域先验引导的LLM辅助迭代搜索。通过将代理建模为模糊关系算子,并引入覆盖生成函数和可达性难度测量,该理论为LLM构建的迭代搜索提供了系统化的形式描述和操作工具。

Details Motivation: 生成-过滤-精炼的范式在LLM驱动的推理、编程和科学发现中取得了进展,但其有效性依赖于如何将领域先验编码为可操作的结构化假设空间。本文旨在解决这一问题,提供一个可操作的理论框架。

Contribution: 1)提出了一种形式理论,描述和衡量LLM辅助的迭代搜索。2)通过模糊关系算子建模代理行为,并引入覆盖生成函数和可达性难度的测量方法。3)提供了搜索空间的结构化描述和几何解释。

Method: 1)将代理建模为输入和输出之间的模糊关系算子,约束其行为为一个固定的安全包络。2)通过权重参数和覆盖生成函数测量多步推理/搜索的可达性难度。3)以多数投票实例化验证理论的可测试推论。

Result: 该理论提供了一个可操作的语言和工具,用于衡量代理及其搜索空间,并实现了对LLM构建的迭代搜索的系统化形式描述。

Insight: 通过形式化描述代理行为和搜索空间,该理论为LLM驱动的迭代搜索提供了理论基础,有助于优化搜索效率和安全性。

Abstract: The generate-filter-refine (iterative paradigm) based on large language models (LLMs) has achieved progress in reasoning, programming, and program discovery in AI+Science. However, the effectiveness of search depends on where to search, namely, how to encode the domain prior into an operationally structured hypothesis space. To this end, this paper proposes a compact formal theory that describes and measures LLM-assisted iterative search guided by domain priors. We represent an agent as a fuzzy relation operator on inputs and outputs to capture feasible transitions; the agent is thereby constrained by a fixed safety envelope. To describe multi-step reasoning/search, we weight all reachable paths by a single continuation parameter and sum them to obtain a coverage generating function; this induces a measure of reachability difficulty; and it provides a geometric interpretation of search on the graph induced by the safety envelope. We further provide the simplest testable inferences and validate them via a majority-vote instantiation. This theory offers a workable language and operational tools to measure agents and their search spaces, proposing a systematic formal description of iterative search constructed by LLMs.

[179] Budget-aware Test-time Scaling via Discriminative Verification

Kyle Montgomery,Sijun Tan,Yuqi Chen,Siyuan Zhuang,Tianjun Zhang,Raluca Ada Popa,Chenguang Wang

Main category: cs.AI

TL;DR: 该论文提出了一种预算感知的测试时缩放方法,通过结合判别性验证器和自一致性,显著提升了大型语言模型在复杂推理任务上的性能,同时降低了计算成本。

Details Motivation: 现有方法使用生成性验证器选择最优解时计算成本过高,限制了实用性。因此,需要一种更高效且预算友好的替代方案。

Contribution: 提出了一种混合方法,结合判别性验证器和自一致性,在固定计算预算下显著优于现有生成性验证方法。

Method: 采用判别性验证器与自一致性结合的混合方法,通过实验分析验证其效果。

Result: 在AIME2025任务上,该方法比现有生成性验证方法的准确率提升了15.3%。

Insight: 预算感知的判别性验证方法不仅更高效,还能显著提升性能,为实际应用提供了实用解决方案。

Abstract: Test-time scaling is a powerful strategy for boosting the performance of large language models on complex reasoning tasks. While state-of-the-art approaches often employ generative verifiers to select the best solution from a pool of candidates, this method incurs prohibitive computational costs, limiting its practicality. In this work, we shift the focus to a more budget-aware paradigm: discriminative verification. We conduct a thorough empirical analysis and demonstrate that while discriminative verifiers may underperform in isolation, combining them with self-consistency in a hybrid approach creates a powerful and efficient test-time scaling mechanism. Notably, under a fixed compute budget, this hybrid approach surpasses state-of-the-art generative verification by a significant margin: achieving up to 15.3% higher accuracy on AIME2025. Our findings establish that for practical, real-world applications, budget-aware scaling with discriminative verifiers is not only a “free” upgrade over self-consistency, but also a more effective and efficient alternative to costly generative techniques. Code is available at https://github.com/wang-research-lab/verification.

[180] TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG

Annisaa Fitri Nurfidausi,Eleonora Mancini,Paolo Torroni

Main category: cs.AI

TL;DR: 论文TRI-DEP通过系统地比较EEG、语音和文本三种模态的特征表示与建模策略,提出了一个创新的抑郁症检测方法。实验表明多模态结合优于单模态,预训练嵌入优于手工特征,且精心设计的三模态模型达到了SOTA表现。

Details Motivation: 抑郁症是一种普遍的心理健康问题,但自动检测仍具挑战性。现有研究多为单模态或有限的多模态,缺乏系统特征比较和一致的评估协议。本文旨在解决这些问题。

Contribution: 1. 系统比较了EEG、语音和文本的特征表示与建模策略;2. 提出了一个创新的三模态框架;3. 揭示了预训练嵌入和多模态融合的优势。

Method: 研究评估了手工特征与预训练嵌入的效能,比较了单模态、双模态和三模态配置,分析了融合策略(尤其是EEG的作用),并使用一致的受试者独立划分验证方法。

Result: 结果表明:(i) 多模态结合提升检测性能;(ii) 预训练嵌入优于手工特征;(iii) 优化的三模态模型达到SOTA。

Insight: 多模态信号(尤其是EEG)在抑郁症检测中具有互补性,预训练嵌入和数据一致性划分是关键成功因素。

Abstract: Depression is a widespread mental health disorder, yet its automatic detection remains challenging. Prior work has explored unimodal and multimodal approaches, with multimodal systems showing promise by leveraging complementary signals. However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols. We address these gaps by systematically exploring feature representations and modelling strategies across EEG, together with speech and text. We evaluate handcrafted features versus pre-trained embeddings, assess the effectiveness of different neural encoders, compare unimodal, bimodal, and trimodal configurations, and analyse fusion strategies with attention to the role of EEG. Consistent subject-independent splits are applied to ensure robust, reproducible benchmarking. Our results show that (i) the combination of EEG, speech and text modalities enhances multimodal detection, (ii) pretrained embeddings outperform handcrafted features, and (iii) carefully designed trimodal models achieve state-of-the-art performance. Our work lays the groundwork for future research in multimodal depression detection.

[181] Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models

Akira Okutomi

Main category: cs.AI

TL;DR: 论文将康德的《纯粹理性批判》重新解读为反馈稳定性理论,提出了一种复合不稳定性指数(H-Risk)来衡量推理系统的稳定性,揭示了名义稳定性和认知稳定性之间的差距,并在大型语言模型(LLMs)中验证了其与校准错误和幻觉的关联。

Details Motivation: 研究旨在从康德哲学的角度理解推理系统的稳定性问题,特别是过自信的现象,从而为诊断和减少推理系统中的过自信提供理论基础。

Contribution: 主要贡献包括提出了一种新的复合不稳定性指数(H-Risk),并在理论和实验中揭示了名义稳定性与认知稳定性之间的差距,以及在LLMs中的实际应用验证。

Method: 方法包括将康德的哲学思想形式化为反馈稳定性理论,并提出H-Risk指标来衡量稳定性。通过线性高斯模拟和LLMs实验验证了理论。

Result: 结果表明H-Risk能预测过自信错误,且在LLMs中与校准错误和幻觉相关。批判性提示对校准和幻觉的影响不一。

Insight: 研究提出了一个结构化的桥梁,将康德的自我限制理论与反馈控制结合起来,为分析和改进推理系统的稳定性提供了新视角。

Abstract: We reinterpret Kant’s Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience. We formalize this intuition via a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. In linear-Gaussian simulations, higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability. Extending to large language models (LLMs), we find that fragile internal dynamics correlate with miscalibration and hallucination, while critique-style prompts show mixed effects on calibration and hallucination. These results suggest a structural bridge between Kantian self-limitation and feedback control, offering a principled lens for diagnosing – and selectively reducing – overconfidence in reasoning systems. This is a preliminary version; supplementary experiments and broader replication will be reported in a future revision.

q-bio.QM [Back]

[182] GenCellAgent: Generalizable, Training-Free Cellular Image Segmentation via Large Language Model Agents

Xi Yu,Yang Yang,Qun Liu,Yonghua Du,Sean McSweeney,Yuewei Lin

Main category: q-bio.QM

TL;DR: GenCellAgent是一个无需训练的多智能体框架,通过规划-执行-评估循环实现细胞图像分割,自动选择最佳工具并适应新数据,显著提升分割精度。

Details Motivation: 细胞图像分割面临模态异构、形态多变和标注不足的挑战。GenCellAgent旨在提供一种无需训练的通用解决方案,减少标注负担并适应多样化的需求。

Contribution: 1. 提出多智能体框架,自动路由图像到最佳分割工具;2. 支持文本引导分割新细胞器;3. 通过记忆机制实现自进化和个性化工作流。

Method: 采用规划-执行-评估循环(选择工具→运行→质量检查),结合大语言模型和专有分割器,通过长期记忆优化分割过程。

Result: 在四个基准测试中,平均精度提升15.7%,新数据集的线粒体和内质网分割IoU提升37.6%,并可分割高尔基体等新对象。

Insight: 结合通用视觉语言模型和专有分割器的混合框架,显著提升了细胞图像分割的泛化能力和适应性,为生物图像分析提供了新思路。

Abstract: Cellular image segmentation is essential for quantitative biology yet remains difficult due to heterogeneous modalities, morphological variability, and limited annotations. We present GenCellAgent, a training-free multi-agent framework that orchestrates specialist segmenters and generalist vision-language models via a planner-executor-evaluator loop (choose tool $\rightarrow$ run $\rightarrow$ quality-check) with long-term memory. The system (i) automatically routes images to the best tool, (ii) adapts on the fly using a few reference images when imaging conditions differ from what a tool expects, (iii) supports text-guided segmentation of organelles not covered by existing models, and (iv) commits expert edits to memory, enabling self-evolution and personalized workflows. Across four cell-segmentation benchmarks, this routing yields a 15.7% mean accuracy gain over state-of-the-art baselines. On endoplasmic reticulum and mitochondria from new datasets, GenCellAgent improves average IoU by 37.6% over specialist models. It also segments novel objects such as the Golgi apparatus via iterative text-guided refinement, with light human correction further boosting performance. Together, these capabilities provide a practical path to robust, adaptable cellular image segmentation without retraining, while reducing annotation burden and matching user preferences.

cs.RO [Back]

[183] Learning Human-Humanoid Coordination for Collaborative Object Carrying

Yushi Du,Yixuan Li,Baoxiong Jia,Yutang Lin,Pei Zhou,Wei Liang,Yanchao Yang,Siyuan Huang

Main category: cs.RO

TL;DR: 论文提出了一种名为COLA的强化学习方法,通过结合领导者和跟随者行为于单一策略中,实现人形机器人与人类的协同搬运,无需外部传感器或复杂交互模型。

Details Motivation: 人形机器人与人类的协作在医疗、家庭辅助和制造业中潜力巨大,但目前缺乏针对其全身动力学的合规协作方法。

Contribution: 提出了一种仅依赖本体感觉的强化学习方法(COLA),能够预测物体运动模式和人类意图,实现动态负载平衡的协同搬运。

Method: 使用闭环环境训练单一策略,结合动态物体交互,隐式预测物体运动和人类意图,并通过协调轨迹计划实现合规协作。

Result: 仿真实验显示模型减少24.7%的人力负担;真人实验验证了其跨物体类型和地形的鲁棒性;用户研究显示平均提升27.4%。

Insight: 通过隐式学习和闭环训练,COLA无需复杂模型即可实现高效协同,为人形机器人的实际部署提供了可行方案。

Abstract: Human-humanoid collaboration shows significant promise for applications in healthcare, domestic assistance, and manufacturing. While compliant robot-human collaboration has been extensively developed for robotic arms, enabling compliant human-humanoid collaboration remains largely unexplored due to humanoids’ complex whole-body dynamics. In this paper, we propose a proprioception-only reinforcement learning approach, COLA, that combines leader and follower behaviors within a single policy. The model is trained in a closed-loop environment with dynamic object interactions to predict object motion patterns and human intentions implicitly, enabling compliant collaboration to maintain load balance through coordinated trajectory planning. We evaluate our approach through comprehensive simulator and real-world experiments on collaborative carrying tasks, demonstrating the effectiveness, generalization, and robustness of our model across various terrains and objects. Simulation experiments demonstrate that our model reduces human effort by 24.7%. compared to baseline approaches while maintaining object stability. Real-world experiments validate robust collaborative carrying across different object types (boxes, desks, stretchers, etc.) and movement patterns (straight-line, turning, slope climbing). Human user studies with 23 participants confirm an average improvement of 27.4% compared to baseline models. Our method enables compliant human-humanoid collaborative carrying without requiring external sensors or complex interaction models, offering a practical solution for real-world deployment.

[184] GOPLA: Generalizable Object Placement Learning via Synthetic Augmentation of Human Arrangement

Yao Zhong,Hanzhi Chen,Simon Schaefer,Anran Zhang,Stefan Leutenegger

Main category: cs.RO

TL;DR: GOPLA是一个分层框架,通过学习增强的人类演示来解决对象放置任务,结合语义偏好和几何可行性。利用多模态大语言模型生成结构化计划,并通过扩散方法和合成数据增强实现高泛化性。

Details Motivation: 机器人作为智能助手需要完成对象放置任务,但现有方法在语义和几何推理上存在不足。通过增强人类演示和合成数据,提升模型的泛化能力。

Contribution: 1)提出了GOPLA分层框架;2)引入了多模态语言模型和扩散规划器;3)设计了合成数据增强管道以解决数据稀缺问题。

Method: 1)多模态语言模型生成结构化计划;2)空间映射器生成3D可行性图;3)扩散规划器生成放置位姿;4)合成数据增强训练数据。

Result: 实验表明,GOPLA在放置成功率和物理合理性上比第二名提高了30.04个百分点,表现出强的泛化能力。

Insight: 合成数据增强和多模态规划的结合是关键,扩散方法在复杂任务中表现出色。

Abstract: Robots are expected to serve as intelligent assistants, helping humans with everyday household organization. A central challenge in this setting is the task of object placement, which requires reasoning about both semantic preferences (e.g., common-sense object relations) and geometric feasibility (e.g., collision avoidance). We present GOPLA, a hierarchical framework that learns generalizable object placement from augmented human demonstrations. A multi-modal large language model translates human instructions and visual inputs into structured plans that specify pairwise object relationships. These plans are then converted into 3D affordance maps with geometric common sense by a spatial mapper, while a diffusion-based planner generates placement poses guided by test-time costs, considering multi-plan distributions and collision avoidance. To overcome data scarcity, we introduce a scalable pipeline that expands human placement demonstrations into diverse synthetic training data. Extensive experiments show that our approach improves placement success rates by 30.04 percentage points over the runner-up, evaluated on positioning accuracy and physical plausibility, demonstrating strong generalization across a wide range of real-world robotic placement scenarios.

[185] From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance

Zhe Li,Cheng Chi,Yangyang Wei,Boan Zhu,Yibo Peng,Tao Huang,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang,Chang Xu

Main category: cs.RO

TL;DR: RoboGhost是一种无需重定向的人形机器人控制框架,通过运动潜在表示直接从语言生成动作,避免了多阶段处理的累积误差和高延迟,提升了语义与控制的一致性。

Details Motivation: 现有的人形机器人语言控制方法需要多阶段处理(从解码人类运动到重定向至机器人形态),容易产生累积误差、高延迟和语义与控制弱耦合的问题。

Contribution: 1. 提出RoboGhost框架,直接从语言到动作,无需显式的运动解码和重定向;2. 采用扩散模型和混合因果Transformer-Diffusion生成器,实现了长时一致性和多样性。

Method: RoboGhost通过运动潜在表示(motion latent)直接条件化人形机器人策略,利用扩散模型从噪声中直接生成可执行动作。混合因果Transformer-Diffusion结构保证了长时稳定性和多样性。

Result: 实验表明RoboGhost显著降低了部署延迟,提升了成功率、跟踪准确性和语义对齐的运动生成能力。

Insight: 该框架为语言-动作系统的直接生成提供了一种高效方法,并可扩展至图像、音频等多模态输入。

Abstract: Natural language offers a natural interface for humanoid robots, but existing language-guided humanoid locomotion pipelines remain cumbersome and unreliable. They typically decode human motion, retarget it to robot morphology, and then track it with a physics-based controller. However, this multi-stage process is prone to cumulative errors, introduces high latency, and yields weak coupling between semantics and control. These limitations call for a more direct pathway from language to action, one that eliminates fragile intermediate stages. Therefore, we present RoboGhost, a retargeting-free framework that directly conditions humanoid policies on language-grounded motion latents. By bypassing explicit motion decoding and retargeting, RoboGhost enables a diffusion-based policy to denoise executable actions directly from noise, preserving semantic intent and supporting fast, reactive control. A hybrid causal transformer-diffusion motion generator further ensures long-horizon consistency while maintaining stability and diversity, yielding rich latent representations for precise humanoid behavior. Extensive experiments demonstrate that RoboGhost substantially reduces deployment latency, improves success rates and tracking accuracy, and produces smooth, semantically aligned locomotion on real humanoids. Beyond text, the framework naturally extends to other modalities such as images, audio, and music, providing a general foundation for vision-language-action humanoid systems.

[186] RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks

Mingxuan Yan,Yuping Wang,Zechun Liu,Jiachen Li

Main category: cs.RO

TL;DR: 论文提出了一种基于检索的演示分解器(RDD),用于在长时程任务中通过视觉特征对齐子任务区间,从而提升任务性能。

Details Motivation: 传统的VLM规划器依赖于人工标注或启发式规则来分解任务,这种分解方式可能与低层视觉运动策略的训练数据不匹配,导致性能下降。

Contribution: 提出了RDD方法,通过检索和对齐视觉特征自动分解演示中的子任务,显著提升了任务分解的精度和任务完成率。

Method: 利用视觉特征检索技术,将演示中的子任务区间与低层视觉运动策略的训练数据进行对齐,从而实现自动化和优化的任务分解。

Result: 在仿真和现实任务中,RDD优于现有的子任务分解方法,表现出更强的鲁棒性和适应性。

Insight: RDD的创新点在于通过视觉特征对齐避免了传统分解方法的不足,为长时程任务的规划提供了更可靠的分解基础。

Abstract: To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can easily handle. Typically, the VLM planner is finetuned to learn to decompose a target task. This finetuning requires target task demonstrations segmented into sub-tasks by either human annotation or heuristic rules. However, the heuristic subtasks can deviate significantly from the training data of the visuomotor policy, which degrades task performance. To address these issues, we propose a Retrieval-based Demonstration Decomposer (RDD) that automatically decomposes demonstrations into sub-tasks by aligning the visual features of the decomposed sub-task intervals with those from the training data of the low-level visuomotor policies. Our method outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks, demonstrating robustness across diverse settings. Code and more results are available at rdd-neurips.github.io.