Table of Contents
- cs.CL [Total: 33]
- cs.CV [Total: 75]
- cs.CR [Total: 1]
- cs.AI [Total: 5]
- cs.CY [Total: 3]
- eess.SY [Total: 1]
- cs.SE [Total: 2]
- cs.LG [Total: 5]
- cs.RO [Total: 1]
cs.CL [Back]
[1] Outraged AI: Large language models prioritise emotion over cost in fairness enforcement
Hao Liu,Yiqing Dai,Haotian Tan,Yu Lei,Yujia Zhou,Zhen Wu
Main category: cs.CL
TL;DR: LLMs使用情绪指导道德决策,有时甚至比人类更强烈,但在成本敏感性上表现较差,与人类的权衡行为不同。
Details
Motivation: 探讨LLMs是否像人类一样依赖情绪进行道德决策,尤其是在公平执行中情绪与成本的权衡。Contribution: 首次提供因果证据表明LLMs的情绪引导道德决策,并揭示其在成本校准和细微公平判断上的缺陷。
Method: 通过大规模实验比较LLMs与人类在公平执行中的决策行为,分析情绪报告与惩罚行为的关系。
Result: LLMs表现出强烈的情绪驱动行为,但在成本敏感性上较弱,推理模型更接近人类行为但仍受情绪主导。
Insight: LLMs的道德决策可能沿着类似人类发展的轨迹进化,未来模型需要结合情绪与情境敏感推理。
Abstract: Emotions guide human decisions, but whether large language models (LLMs) use emotion similarly remains unknown. We tested this using altruistic third-party punishment, where an observer incurs a personal cost to enforce fairness, a hallmark of human morality and often driven by negative emotion. In a large-scale comparison of 4,068 LLM agents with 1,159 adults across 796,100 decisions, LLMs used emotion to guide punishment, sometimes even more strongly than humans did: Unfairness elicited stronger negative emotion that led to more punishment; punishing unfairness produced more positive emotion than accepting; and critically, prompting self-reports of emotion causally increased punishment. However, mechanisms diverged: LLMs prioritized emotion over cost, enforcing norms in an almost all-or-none manner with reduced cost sensitivity, whereas humans balanced fairness and cost. Notably, reasoning models (o3-mini, DeepSeek-R1) were more cost-sensitive and closer to human behavior than foundation models (GPT-3.5, DeepSeek-V3), yet remained heavily emotion-driven. These findings provide the first causal evidence of emotion-guided moral decisions in LLMs and reveal deficits in cost calibration and nuanced fairness judgements, reminiscent of early-stage human responses. We propose that LLMs progress along a trajectory paralleling human development; future models should integrate emotion with context-sensitive reasoning to achieve human-like emotional intelligence.
[2] POPI: Personalizing LLMs via Optimized Natural Language Preference Inference
Yizhuo Chen,Xin Liu,Ruijie Wang,Zheng Li,Pei Chen,Changlong Yu,Priyanka Nigam,Meng Jiang,Bing Yin
Main category: cs.CL
TL;DR: POPI提出了一种通过优化的自然语言偏好推断来个性化大语言模型的框架,解决了现有方法忽略个体差异和计算成本高的问题。
Details
Motivation: 现有的大语言模型对齐技术(如RLHF或DPO)优化的是群体平均水平,忽略了用户偏好的多样性,且个性化方法(如微调)计算成本高或效率低。Contribution: 提出了POPI框架,通过偏好推断模型将用户信号提炼为简洁的自然语言摘要,作为透明的个性化表示,并联合优化偏好推断和个性化生成。
Method: POPI使用强化学习联合优化偏好推断和个性化生成,偏好摘要作为条件输入共享生成模型,以实现高效且透明的个性化。
Result: 在四个个性化基准测试中,POPI显著提升了个性化准确性并大幅降低了上下文开销,且摘要可直接应用于冻结的预训练模型。
Insight: 通过自然语言摘要编码用户偏好是一种高效透明的个性化方法,且无需更新模型权重即可实现即插即用的个性化。
Abstract: Large language models (LLMs) achieve strong benchmark performance, yet user experiences remain inconsistent due to diverse preferences in style, tone, and reasoning mode. Nevertheless, existing alignment techniques such as reinforcement learning from human feedback (RLHF) or Direct Preference Optimization (DPO) largely optimize toward population-level averages and overlook individual variation. Naive personalization strategies like per-user fine-tuning are computationally prohibitive, and in-context approaches that prepend raw user signals often suffer from inefficiency and noise. To address these challenges, we propose POPI, a general framework that introduces a preference inference model to distill heterogeneous user signals into concise natural language summaries. These summaries act as transparent, compact, and transferable personalization representations that condition a shared generation model to produce personalized responses. POPI jointly optimizes both preference inference and personalized generation under a unified objective using reinforcement learning, ensuring summaries maximally encode useful preference information. Extensive experiments across four personalization benchmarks demonstrate that POPI consistently improves personalization accuracy while reducing context overhead by a large margin. Moreover, optimized summaries seamlessly transfer to frozen off-the-shelf LLMs, enabling plug-and-play personalization without weight updates.
[3] CLAWS:Creativity detection for LLM-generated solutions using Attention Window of Sections
Keuntae Kim,Eunhye Jeong,Sehyeon Lee,Seohee Yoon,Yong Suk Choi
Main category: cs.CL
TL;DR: 论文提出了CLAWS方法,通过注意力窗口评分评估LLM生成的数学解答的创造力,无需人工干预,并在多个数学竞赛数据集上验证了其有效性。
Details
Motivation: 尽管LLM在数学和编程任务中表现优异,但对解答创造力的评估仍被忽视。现有方法难以定义创造力范围且依赖人工评估,CLAWS旨在解决这一问题。Contribution: 1. 提出了一种无人工干预的创造力评估方法CLAWS;2. 将数学解答分为典型、创造性和幻觉三类;3. 在多个LLM和数学竞赛数据集上验证了CLAWS的优越性。
Method: CLAWS利用注意力权重分析提示部分与输出的关系,定义创造力评分标准。通过与五种白盒检测方法对比,验证其有效性。
Result: CLAWS在五种7-8B数学RL模型上和4545道数学题上表现优于现有方法(Perplexity、Logit Entropy等)。
Insight: 注意力权重可用于量化LLM生成的创造力,为未来自动化评估提供了新方向。
Abstract: Recent advances in enhancing the reasoning ability of large language models (LLMs) have been remarkably successful. LLMs trained with reinforcement learning (RL) for reasoning demonstrate strong performance in challenging tasks such as mathematics and coding, even with relatively small model sizes. However, despite these improvements in task accuracy, the assessment of creativity in LLM generations has been largely overlooked in reasoning tasks, in contrast to writing tasks. The lack of research on creativity assessment in reasoning primarily stems from two challenges: (1) the difficulty of defining the range of creativity, and (2) the necessity of human evaluation in the assessment process. To address these challenges, we propose CLAWS, a method that defines and classifies mathematical solutions into typical, creative, and hallucinated categories without human evaluation, by leveraging attention weights across prompt sections and output. CLAWS outperforms five existing white-box detection methods (Perplexity, Logit Entropy, Window Entropy, Hidden Score, and Attention Score) on five 7-8B math RL models (DeepSeek, Qwen, Mathstral, OpenMath2, and Oreal). We validate CLAWS on 4545 math problems collected from 181 math contests (AJHSME, AMC, AIME).
[4] Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models
Shuodi Liu,Yingzhuo Liu,Zi Wang,Yusheng Wang,Huijia Wu,Liuyu Xiang,Zhaofeng He
Main category: cs.CL
TL;DR: 该论文提出了一种名为Select-Then-Decompose的自适应任务分解策略,通过动态选择最适合的分解方法并在验证模块中确保结果可靠性,实现了性能与成本的最优平衡。
Details
Motivation: 尽管大语言模型(LLMs)在任务分解方面表现出色,但现有方法忽视了性能与成本的权衡。论文旨在通过系统性分析和自适应策略解决这一问题。Contribution: 1.首次全面调查了任务分解的六种分类方案;2.通过实证分析揭示了三种关键影响因素;3.提出Select-Then-Decompose策略,实现动态选择和结果验证。
Method: 结合选择、执行和验证三阶段闭环流程,动态选择最优分解方法,并通过验证模块增强结果可靠性。
Result: 多基准测试表明Select-Then-Decompose策略始终位于帕累托前沿,实现了性能与成本的最优平衡。
Insight: 任务分解的成功不仅依赖方法选择,还需结合任务特性和模型配置,动态策略和验证机制是关键。
Abstract: Large language models (LLMs) have demonstrated remarkable reasoning and planning capabilities, driving extensive research into task decomposition. Existing task decomposition methods focus primarily on memory, tool usage, and feedback mechanisms, achieving notable success in specific domains, but they often overlook the trade-off between performance and cost. In this study, we first conduct a comprehensive investigation on task decomposition, identifying six categorization schemes. Then, we perform an empirical analysis of three factors that influence the performance and cost of task decomposition: categories of approaches, characteristics of tasks, and configuration of decomposition and execution models, uncovering three critical insights and summarizing a set of practical principles. Building on this analysis, we propose the Select-Then-Decompose strategy, which establishes a closed-loop problem-solving process composed of three stages: selection, execution, and verification. This strategy dynamically selects the most suitable decomposition approach based on task characteristics and enhances the reliability of the results through a verification module. Comprehensive evaluations across multiple benchmarks show that the Select-Then-Decompose consistently lies on the Pareto frontier, demonstrating an optimal balance between performance and cost. Our code is publicly available at https://github.com/summervvind/Select-Then-Decompose.
[5] Efficient Toxicity Detection in Gaming Chats: A Comparative Study of Embeddings, Fine-Tuned Transformers and LLMs
Yehor Tereshchenko,Mika Hämäläinen
Main category: cs.CL
TL;DR: 本文对在线游戏聊天中的毒性检测方法进行了比较研究,评估了传统机器学习嵌入、零样本/少样本提示的大语言模型、微调Transformer模型以及检索增强生成(RAG)方法,并提出了一种混合内容审核架构。
Details
Motivation: 在线游戏聊天中的毒性内容检测需要高效且经济的方法,但现有技术在准确性、速度和计算成本之间存在权衡,亟需系统性研究以优化部署方案。Contribution: 1. 综合比较了多种NLP方法的毒性检测性能;2. 提出了结合自动检测与持续学习的混合审核架构;3. 发现微调的DistilBERT在准确性与成本之间表现最优。
Method: 评估了嵌入方法、大语言模型(零样本/少样本)、微调Transformer(如DistilBERT)和RAG方法,并从分类准确性、处理速度和计算成本三个维度进行实验。
Result: 实验表明,微调的DistilBERT在准确性和成本之间取得了最佳平衡,而大语言模型在高计算成本下表现优异。
Insight: 混合架构结合了自动化与持续学习,适合动态环境;微调较小模型(如DistilBERT)在实际部署中更具性价比。
Abstract: This paper presents a comprehensive comparative analysis of Natural Language Processing (NLP) methods for automated toxicity detection in online gaming chats. Traditional machine learning models with embeddings, large language models (LLMs) with zero-shot and few-shot prompting, fine-tuned transformer models, and retrieval-augmented generation (RAG) approaches are evaluated. The evaluation framework assesses three critical dimensions: classification accuracy, processing speed, and computational costs. A hybrid moderation system architecture is proposed that optimizes human moderator workload through automated detection and incorporates continuous learning mechanisms. The experimental results demonstrate significant performance variations across methods, with fine-tuned DistilBERT achieving optimal accuracy-cost trade-offs. The findings provide empirical evidence for deploying cost-effective, efficient content moderation systems in dynamic online gaming environments.
[6] Diagnosing Representation Dynamics in NER Model Extension
Xirui Zhang,Philippe de La Chevasnerie,Benoit Fabre
Main category: cs.CL
TL;DR: 论文研究了NER模型扩展到新PII实体时的表示动态机制,发现语义和形态特征机制相对独立,LOC实体容易受影响,且’O’标签的可塑性对模型适应至关重要。
Details
Motivation: 扩展NER模型以识别新的PII实体时,需要了解模型对新旧实体的适应机制,避免性能下降。Contribution: 揭示了语义和形态特征的独立性,发现LOC实体的易损性,并提出’O’标签的可塑性对解决表示漂移的关键作用。
Method: 采用增量学习作为诊断工具,测量语义漂移,并通过解冻’O’标签的分类器解决反向’O’标签表示漂移。
Result: 联合微调BERT模型对新旧实体影响较小,但LOC实体因表示重叠而易受损。解冻’O’标签分类器能有效解决问题。
Insight: NER模型的适应性依赖于特征独立性、表示重叠和’O’标签的可塑性,为未来模型扩展提供了机制上的指导。
Abstract: Extending Named Entity Recognition (NER) models to new PII entities in noisy spoken-language data is a common need. We find that jointly fine-tuning a BERT model on standard semantic entities (PER, LOC, ORG) and new pattern-based PII (EMAIL, PHONE) results in minimal degradation for original classes. We investigate this “peaceful coexistence,” hypothesizing that the model uses independent semantic vs. morphological feature mechanisms. Using an incremental learning setup as a diagnostic tool, we measure semantic drift and find two key insights. First, the LOC (location) entity is uniquely vulnerable due to a representation overlap with new PII, as it shares pattern-like features (e.g., postal codes). Second, we identify a “reverse O-tag representation drift.” The model, initially trained to map PII patterns to ‘O’, blocks new learning. This is resolved only by unfreezing the ‘O’ tag’s classifier, allowing the background class to adapt and “release” these patterns. This work provides a mechanistic diagnosis of NER model adaptation, highlighting feature independence, representation overlap, and ‘O’ tag plasticity.
[7] Chain-of-Thought Reasoning Improves Context-Aware Translation with Large Language Models
Shabnam Ataee,Andrei Popescu-Belis
Main category: cs.CL
TL;DR: 本文评估了大语言模型(LLM)在翻译包含句间依赖关系的文本时的表现,并通过链式思维推理(Chain-of-Thought Reasoning)提升了翻译效果。
Details
Motivation: 研究LLM在处理句间依赖关系(如代词回指和词汇衔接)时的翻译能力,探索链式思维推理对其性能的影响。Contribution: 1. 提出链式思维推理在上下文感知翻译中的有效性;2. 在DeepSeek-R1、GPT、Llama等12个LLM上进行了系统性评估;3. 发现了“强者更强”效应。
Method: 使用DiscEvalMT基准测试,对比有无链式思维推理提示的LLM表现,评估其在两类任务上的准确性和生成质量。
Result: 最佳模型(如GPT-4、Phi)在加入链式思维推理后,任务1准确率达90%,任务2的COMET分数达92%。
Insight: 链式思维推理显著提升LLM的上下文翻译能力,且模型基础性能越高,推理带来的增益越大。
Abstract: This paper assesses the capacity of large language models (LLMs) to translate texts that include inter-sentential dependencies. We use the English-French DiscEvalMT benchmark (Bawden et al., 2018) with pairs of sentences containing translation challenges either for pronominal anaphora or for lexical cohesion. We evaluate 12 LLMs from the DeepSeek-R1, GPT, Llama, Mistral and Phi families on two tasks: (1) distinguishing a correct translation from a wrong but plausible one; (2) generating a correct translation. We compare prompts that encourage chain-of-thought reasoning with those that do not. The best models take advantage of reasoning and reach about 90% accuracy on the first task, and COMET scores of about 92% on the second task, with GPT-4, GPT-4o and Phi standing out. Moreover, we observe a “wise get wiser” effect: the improvements through reasoning are positively correlated with the scores of the models without reasoning.
[8] Does Reasoning Help LLM Agents Play Dungeons and Dragons? A Prompt Engineering Experiment
Patricia Delafuente,Arya Honraopatil,Lara J. Martin
Main category: cs.CL
TL;DR: 论文探讨了大型语言模型(LLMs)和推理在预测《龙与地下城》(DnD)玩家行为并将其格式化为Avrae Discord机器人命令中的应用。研究发现,特定指令对模型输出影响显著,且指导模型(instruct models)足以完成任务,无需推理模型。
Details
Motivation: 研究动机在于探索LLMs在实际游戏场景中的应用潜力,尤其是如何通过改进提示工程(prompt engineering)提升模型在复杂任务(如DnD命令生成)中的表现。Contribution: 主要贡献包括:1)验证了推理模型和指导模型在DnD命令生成任务中的表现差异;2)揭示了提示(prompt)的微小改动对模型输出的显著影响;3)证明了指导模型在这类任务中的适用性。
Method: 方法上,使用了FIREBALL数据集,对比了推理模型(DeepSeek-R1-Distill-LLaMA-8B)和指导模型(LLaMA-3.1-8B-Instruct)的表现,并通过实验分析了提示工程的效果。
Result: 结果表明,指导模型的表现足以胜任任务,且提示的具体性对模型输出影响巨大。
Insight: 关键洞察:1)提示工程的重要性;2)复杂任务中,推理模型未必优于指导模型;3)单句提示调整可能显著改变模型行为。
Abstract: This paper explores the application of Large Language Models (LLMs) and reasoning to predict Dungeons & Dragons (DnD) player actions and format them as Avrae Discord bot commands. Using the FIREBALL dataset, we evaluated a reasoning model, DeepSeek-R1-Distill-LLaMA-8B, and an instruct model, LLaMA-3.1-8B-Instruct, for command generation. Our findings highlight the importance of providing specific instructions to models, that even single sentence changes in prompts can greatly affect the output of models, and that instruct models are sufficient for this task compared to reasoning models.
[9] LLMs Encode How Difficult Problems Are
William Lugoloobi,Chris Russell
Main category: cs.CL
TL;DR: 论文探究了大语言模型(LLMs)是否内部编码问题难度,并与人类判断一致,同时研究这种表示如何在强化学习后训练中跟踪泛化能力。
Details
Motivation: LLMs在解决复杂问题时表现出不一致性:能解决复杂问题却常失败于简单问题。作者希望通过研究LLMs内部是否编码了问题难度来理解这一现象。Contribution: 1. 发现人类标注的难度在LLMs中可线性解码,且与模型大小相关;2. 揭示了模型自身衍生的难度信号较弱且不可靠;3. 提供了通过调整难度表示改善模型性能的方法。
Method: 作者在60个模型的不同层和token位置训练线性探针,使用Easy2HardBench的数学和编程子集评估,并通过GRPO训练验证难度信号与性能的关系。
Result: 人类难度信号线性解码性能强(ρ≈0.88),且与模型大小正相关;模型衍生的难度信号弱且相关性差。调整难度表示可减少幻觉并提升准确性。
Insight: 人类标注提供的难度信号在强化学习中稳定且有益,而模型自动估计的难度信号会随模型改进而失效。
Abstract: Large language models exhibit a puzzling inconsistency: they solve complex problems yet frequently fail on seemingly simpler ones. We investigate whether LLMs internally encode problem difficulty in a way that aligns with human judgment, and whether this representation tracks generalization during reinforcement learning post-training. We train linear probes across layers and token positions on 60 models, evaluating on mathematical and coding subsets of Easy2HardBench. We find that human-labeled difficulty is strongly linearly decodable (AMC: $\rho \approx 0.88$) and exhibits clear model-size scaling, whereas LLM-derived difficulty is substantially weaker and scales poorly. Steering along the difficulty direction reveals that pushing models toward “easier” representations reduces hallucination and improves accuracy. During GRPO training on Qwen2.5-Math-1.5B, the human-difficulty probe strengthens and positively correlates with test accuracy across training steps, while the LLM-difficulty probe degrades and negatively correlates with performance. These results suggest that human annotations provide a stable difficulty signal that RL amplifies, while automated difficulty estimates derived from model performance become misaligned precisely as models improve. We release probe code and evaluation scripts to facilitate replication.
[10] Extracting Rule-based Descriptions of Attention Features in Transformers
Dan Friedman,Adithya Bhaskar,Alexander Wettig,Danqi Chen
Main category: cs.CL
TL;DR: 提出一种基于规则的描述方法,用于解释Transformer中注意力层的特征,包括skip-gram规则、absence规则和counting规则,并通过自动提取方法在GPT-2 small上验证其有效性。
Details
Motivation: 现有方法主要通过稀疏线性组合和主观检查来解释模型特征,缺乏客观性和可扩展性。本文旨在通过规则化描述提供更透明和可解释的特征分析。Contribution: 1) 提出了三种规则化特征描述方法(skip-gram、absence、counting);2) 开发了自动提取这些规则的简单方法;3) 在GPT-2 small上验证了方法的可行性。
Method: 通过分析注意力层输入和输出特征的交互,自动提取skip-gram规则、absence规则和counting规则,并使用规则匹配和统计方法量化特征行为。
Result: 实验表明,多数特征可用约100条skip-gram规则描述,absence规则在第一层中占比超过1/4,counting规则较少但存在。
Insight: 规则化描述提供了更透明和可扩展的特征解释方式,尤其突出了absence规则和counting规则的独特价值,为未来研究奠定了基础。
Abstract: Mechanistic interpretability strives to explain model behavior in terms of bottom-up primitives. The leading paradigm is to express hidden states as a sparse linear combination of basis vectors, called features. However, this only identifies which text sequences (exemplars) activate which features; the actual interpretation of features requires subjective inspection of these exemplars. This paper advocates for a different solution: rule-based descriptions that match token patterns in the input and correspondingly increase or decrease the likelihood of specific output tokens. Specifically, we extract rule-based descriptions of SAE features trained on the outputs of attention layers. While prior work treats the attention layers as an opaque box, we describe how it may naturally be expressed in terms of interactions between input and output features, of which we study three types: (1) skip-gram rules of the form “[Canadian city]… speaks –> English”, (2) absence rules of the form “[Montreal]… speaks -/-> English,” and (3) counting rules that toggle only when the count of a word exceeds a certain value or the count of another word. Absence and counting rules are not readily discovered by inspection of exemplars, where manual and automatic descriptions often identify misleading or incomplete explanations. We then describe a simple approach to extract these types of rules automatically from a transformer, and apply it to GPT-2 small. We find that a majority of features may be described well with around 100 skip-gram rules, though absence rules are abundant even as early as the first layer (in over a fourth of features). We also isolate a few examples of counting rules. This paper lays the groundwork for future research into rule-based descriptions of features by defining them, showing how they may be extracted, and providing a preliminary taxonomy of some of the behaviors they represent.
[11] CMT-Bench: Cricket Multi-Table Generation Benchmark for Probing Robustness in Large Language Models
Ritam Upadhyay,Naman Ahuja,Rishabh Baral,Aparna Garimella,Vivek Gupta
Main category: cs.CL
TL;DR: CMT-Bench是一个诊断性基准测试,用于评估大规模语言模型(LLMs)在动态文本到表格生成任务中的鲁棒性。研究表明当前LLMs在此任务中存在脆弱性。
Details
Motivation: 现有的文本到表格(T2T)系统依赖过多的提示工程或迭代事件提取,计算成本高且掩盖了模型对动态叙事的推理能力。需要一种可诊断鲁棒性的基准测试。Contribution: 提出了CMT-Bench基准测试,通过三个语义保留维度(提取线索消融、时间前缀测试、实体形式扰动)评估LLMs的鲁棒性。
Method: 基于实时板球评论构建动态表格生成任务,设计三个测试维度以诊断模型在状态跟踪、长上下文稳定性和表面变化敏感性方面的表现。
Result: 实验表明,当前LLMs在没有提取摘要时性能显著下降,输入长度增加时性能单调递减,实体形式变化导致准确性持续下降。
Insight: LLMs在动态文本到表格生成任务中存在脆弱性,鲁棒性评估应成为开发高效可扩展方法的前提。
Abstract: LLM Driven text-to-table (T2T) systems often rely on extensive prompt-engineering or iterative event extraction in code-parsable formats, which boosts scores but are computationally expensive and obscure how models actually reason over temporal evolving narratives to summarise key information. We present CMT-Bench, a diagnostic benchmark built from live cricket commentary that requires dynamic table generation across two evolving schemas under a dense, rule-governed policy. CMT-Bench is designed to probe robustness via three semantics-preserving dimensions: (i) extractive-cue ablation to separate extractive shortcuts from state tracking, (ii) temporal prefixing to test long-context stability, and (iii) entity-form perturbations (anonymization, outof-distribution substitutions, role-entangling paraphrases) to assess sensitivity to surface variation. Across diverse long-context stateof-the-art LLMs, we find large drops without extractive summaries, monotonic degradation with input length, and consistent accuracy drop under entity-form changes. Complementary distributional tests confirm significant shifts in numeric error patterns, indicating drift in reasoning rather than mere noise. Our results show that current LLMs are brittle in dynamic Textto-table generation, motivating robustness-first evaluation as a prerequisite for developing efficient and scalable approaches for this task.
[12] Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
Yoshinari Fujinuma
Main category: cs.CL
TL;DR: 论文通过对比解码方法缓解了LLM作为评判工具时的分数范围偏差问题,提升了与人类评判的相关性。
Details
Motivation: 大型语言模型(LLM)在作为评判工具时存在分数范围偏差问题,导致评分结果不可靠,限制了其在直接评估中的应用。Contribution: 揭示了LLM作为评判工具时的分数范围偏差问题,并提出通过对比解码方法缓解这一问题,平均提升了11.3%的Spearman相关性。
Method: 使用对比解码方法来分析并缓解LLM在评分时的分数范围偏差问题。
Result: 实验表明,对比解码方法显著减少了分数范围偏差,Spearman相关性平均提升了11.3%。
Insight: 对比解码是一种有效的方法,可以缓解LLM在评分任务中的系统性偏差,提升其评判的可靠性。
Abstract: Large Language Models (LLMs) are commonly used as evaluators in various applications, but the reliability of the outcomes remains a challenge. One such challenge is using LLMs-as-judges for direct assessment, i.e., assigning scores from a specified range without any references. We first show that this challenge stems from LLM judge outputs being associated with score range bias, i.e., LLM judge outputs are highly sensitive to pre-defined score ranges, preventing the search for optimal score ranges. We also show that similar biases exist among models from the same family. We then mitigate this bias through contrastive decoding, achieving up to 11.3% relative improvement on average in Spearman correlation with human judgments across different score ranges.
[13] Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs
Yanhong Li,Zixuan Lan,Jiawei Zhou
Main category: cs.CL
TL;DR: 本文探讨了将长文本输入转换为图像以减少解码器LLM的token使用量的方法,实验证明在保持任务性能的同时可显著减少token消耗(通常接近一半)。
Details
Motivation: 多模态LLM可处理文本图像输入,研究者希望探索是否可以通过将文本输入转换为图像来压缩token使用,同时保持模型性能。Contribution: 提出了一种新颖的文本压缩方法,即将长文本渲染为图像输入,显著减少了token需求且不影响任务表现。
Method: 通过将长文本渲染为单张图像直接输入多模态LLM,对比了传统文本输入与图像输入在token效率和性能上的差异。
Result: 在RULER(长上下文检索)和CNN/DailyMail(文档摘要)两个基准测试中,图像输入方法节省了近一半的token,同时性能未下降。
Insight: 视觉文本表示是一种高效且实用的输入压缩方式,尤其适用于需要处理长文本的解码器LLM。
Abstract: Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.
[14] Food4All: A Multi-Agent Framework for Real-time Free Food Discovery with Integrated Nutritional Metadata
Zhengqing Yuan,Yiyang Li,Weixiang Sun,Zheyuan Zhang,Kaiwen Shi,Keerthiram Murugesan,Yanfang Ye
Main category: cs.CL
TL;DR: Food4All是一个多智能体框架,专注于实时发现免费食物并提供营养元数据,解决了食物不安全问题中信息碎片化和实时适应性不足的问题。
Details
Motivation: 食物不安全在美国是一个严重的公共卫生问题,现有系统提供的信息不完整、地理相关性差,且缺乏实时性和适应性。Food4All旨在填补这一空白。Contribution: 1) 提出了首个多智能体框架用于实时免费食物检索;2) 通过异质数据聚合、轻量级强化学习和在线反馈循环,优化了地理可访问性和营养正确性;3) 提供动态适应性的检索策略。
Method: 1) 聚合官方数据库、社区平台和社交媒体数据;2) 使用轻量级强化学习算法优化检索;3) 通过在线反馈循环动态调整策略。
Result: Food4All能够实时提供营养标注和指导,满足食物不安全人群的紧急需求。
Insight: 将信息获取、语义分析和决策支持结合,能够为弱势群体提供更具适应性和实用性的系统支持。
Abstract: Food insecurity remains a persistent public health emergency in the United States, tightly interwoven with chronic disease, mental illness, and opioid misuse. Yet despite the existence of thousands of food banks and pantries, access remains fragmented: 1) current retrieval systems depend on static directories or generic search engines, which provide incomplete and geographically irrelevant results; 2) LLM-based chatbots offer only vague nutritional suggestions and fail to adapt to real-world constraints such as time, mobility, and transportation; and 3) existing food recommendation systems optimize for culinary diversity but overlook survival-critical needs of food-insecure populations, including immediate proximity, verified availability, and contextual barriers. These limitations risk leaving the most vulnerable individuals, those experiencing homelessness, addiction, or digital illiteracy, unable to access urgently needed resources. To address this, we introduce Food4All, the first multi-agent framework explicitly designed for real-time, context-aware free food retrieval. Food4All unifies three innovations: 1) heterogeneous data aggregation across official databases, community platforms, and social media to provide a continuously updated pool of food resources; 2) a lightweight reinforcement learning algorithm trained on curated cases to optimize for both geographic accessibility and nutritional correctness; and 3) an online feedback loop that dynamically adapts retrieval policies to evolving user needs. By bridging information acquisition, semantic analysis, and decision support, Food4All delivers nutritionally annotated and guidance at the point of need. This framework establishes an urgent step toward scalable, equitable, and intelligent systems that directly support populations facing food insecurity and its compounding health risks.
[15] From Retrieval to Generation: Unifying External and Parametric Knowledge for Medical Question Answering
Lei Li,Xiao Zhou,Yingying Zhang,Xian Wu
Main category: cs.CL
TL;DR: MedRGAG是一个统一的检索-生成增强框架,用于医疗问答,结合外部检索和参数知识,通过KGCC和KADS模块提升生成文档的完整性和证据的准确性。
Details
Motivation: 现有方法RAG和GAG在医疗问答中存在检索噪声或生成幻觉的问题,限制了答案的可靠性。Contribution: 提出MedRGAG框架,通过KGCC和KADS模块统一外部和内部知识,提升知识密集推理的效果。
Method: KGCC模块补全缺失知识;KADS模块自适应选择检索和生成文档的混合证据。
Result: 在五个医疗QA基准上,MedRGAG比MedRAG和MedGENIE分别提升12.5%和4.5%。
Insight: 结合检索和生成的方法能有效解决单一路径的缺陷,提升知识密集型任务的性能。
Abstract: Medical question answering (QA) requires extensive access to domain-specific knowledge. A promising direction is to enhance large language models (LLMs) with external knowledge retrieved from medical corpora or parametric knowledge stored in model parameters. Existing approaches typically fall into two categories: Retrieval-Augmented Generation (RAG), which grounds model reasoning on externally retrieved evidence, and Generation-Augmented Generation (GAG), which depends solely on the models internal knowledge to generate contextual documents. However, RAG often suffers from noisy or incomplete retrieval, while GAG is vulnerable to hallucinated or inaccurate information due to unconstrained generation. Both issues can mislead reasoning and undermine answer reliability. To address these challenges, we propose MedRGAG, a unified retrieval-generation augmented framework that seamlessly integrates external and parametric knowledge for medical QA. MedRGAG comprises two key modules: Knowledge-Guided Context Completion (KGCC), which directs the generator to produce background documents that complement the missing knowledge revealed by retrieval; and Knowledge-Aware Document Selection (KADS), which adaptively selects an optimal combination of retrieved and generated documents to form concise yet comprehensive evidence for answer generation. Extensive experiments on five medical QA benchmarks demonstrate that MedRGAG achieves a 12.5% improvement over MedRAG and a 4.5% gain over MedGENIE, highlighting the effectiveness of unifying retrieval and generation for knowledge-intensive reasoning. Our code and data are publicly available at https://anonymous.4open.science/r/MedRGAG
[16] ECG-LLM – training and evaluation of domain-specific large language models for electrocardiography
Lara Ahrens,Wilhelm Haverkamp,Nils Strodthoff
Main category: cs.CL
TL;DR: 该论文研究了针对心电图(ECG)领域优化的开源大语言模型(LLM),通过微调和多层评估框架,展示了其在医疗领域的应用潜力,并在隐私保护方面具备优势。
Details
Motivation: 医疗领域对隐私保护的高要求使得本地部署的开源LLM具有吸引力。然而,针对特定领域的LLM优化方法、评估标准及其与通用LLM的性能对比尚不明确。本研究通过心电图领域的案例填补了这一空白。Contribution: 1) 提出了面向ECG领域的LLM微调方法;2) 设计了一个多层评估框架,对比了微调模型、检索增强生成(RAG)和通用模型(Claude Sonnet 3.7)的性能;3) 揭示了不同评估方法下的性能异质性。
Method: 1) 微调开源权重模型(Llama 3.1 70B)以适应ECG领域;2) 使用RAG和多层评估框架(选择题评估、自动文本指标、LLM-as-a-judge评估和专家评估)进行全面性能对比。
Result: 微调的Llama 3.1 70B在选择题评估和自动文本指标上表现优异;Claude 3.7和RAG在复杂查询中更受专家青睐。微调模型在几乎所有评估模式中显著优于基础模型。
Insight: 1) 领域特定优化的LLM在隐私保护和性能上均具竞争力;2) 评估方法的选择显著影响模型性能排名;3) 结合微调和RAG是一种可行的本地化医疗解决方案。
Abstract: Domain-adapted open-weight large language models (LLMs) offer promising healthcare applications, from queryable knowledge bases to multimodal assistants, with the crucial advantage of local deployment for privacy preservation. However, optimal adaptation strategies, evaluation methodologies, and performance relative to general-purpose LLMs remain poorly characterized. We investigated these questions in electrocardiography, an important area of cardiovascular medicine, by finetuning open-weight models on domain-specific literature and implementing a multi-layered evaluation framework comparing finetuned models, retrieval-augmented generation (RAG), and Claude Sonnet 3.7 as a representative general-purpose model. Finetuned Llama 3.1 70B achieved superior performance on multiple-choice evaluations and automatic text metrics, ranking second to Claude 3.7 in LLM-as-a-judge assessments. Human expert evaluation favored Claude 3.7 and RAG approaches for complex queries. Finetuned models significantly outperformed their base counterparts across nearly all evaluation modes. Our findings reveal substantial performance heterogeneity across evaluation methodologies, underscoring assessment complexity. Nevertheless, domain-specific adaptation through finetuning and RAG achieves competitive performance with proprietary models, supporting the viability of privacy-preserving, locally deployable clinical solutions.
[17] KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs
Donghyeon Ko,Yeguk Jin,Kyubyung Chae,Byungwook Lee,Chansong Jo,Sookyo In,Jaehong Lee,Taesup Kim,Donghyun Kwak
Main category: cs.CL
TL;DR: 论文提出了KoSimpleQA,一个专注于韩国文化知识的基准测试,用于评估大语言模型(LLM)的事实性。包含1000个简短的事实性问题,结果显示最强模型正确率仅33.7%,且与英语基准测试表现差异显著。
Details
Motivation: 现有的LLM评测基准大多基于英语,缺乏对韩国文化知识的评估,因此需要专门的韩语基准以更好地评估LLM的真实性和推理能力。Contribution: 提出了KoSimpleQA基准,包含1000个韩语短问题,填补了韩语事实性评测的空白,并通过评测揭示了LLM在韩语文化知识上的表现差异和推理能力的影响。
Method: 设计了1000个简短但明确事实性问题,通过多种开源LLM进行评测,分析模型的回答正确率和推理能力的作用。
Result: 最强模型正确率仅33.7%,且韩语基准与英语基准的表现排名显著不同,推理能力有助于模型更好地提取潜在知识和在不确定时放弃回答。
Insight: 韩语文化知识的评测对LLM提出了独特挑战,推理能力的引入可以提升事实性任务的性能,同时揭示模型在不同语言和文化背景下的表现差异。
Abstract: We present $\textbf{Korean SimpleQA (KoSimpleQA)}$, a benchmark for evaluating factuality in large language models (LLMs) with a focus on Korean cultural knowledge. KoSimpleQA is designed to be challenging yet easy to grade, consisting of 1,000 short, fact-seeking questions with unambiguous answers. We conduct a comprehensive evaluation across a diverse set of open-source LLMs of varying sizes that support Korean, and find that even the strongest model generates correct answer only 33.7% of the time, underscoring the challenging nature of KoSimpleQA. Notably, performance rankings on KoSimpleQA differ substantially from those on the English SimpleQA, highlighting the unique value of our dataset. Furthermore, our analysis of reasoning LLMs shows that engaging reasoning capabilities in the factual QA task can both help models better elicit their latent knowledge and improve their ability to abstain when uncertain. KoSimpleQA can be found at https://anonymous.4open.science/r/KoSimpleQA-62EB.
[18] Towards Fair ASR For Second Language Speakers Using Fairness Prompted Finetuning
Monorama Swain,Bubai Maji,Jagabandhu Mishra,Markus Schedl,Anders Søgaard,Jesper Rindom Jensen
Main category: cs.CL
TL;DR: 该论文针对英语ASR系统对第二语言使用者的不公平性问题,提出了一种结合轻量级适配器的公平性提示微调方法,显著提升了公平性。
Details
Motivation: 现有的ASR系统(如Whisper和Seamless-M4T)在不同口音群体中表现出显著的词错误率波动,凸显了公平性差距。Contribution: 提出了一种融合传统经验风险最小化与公平性驱动的目标(如Spectral Decoupling、Group DRO和IRM)的微调方法,显著改善了公平性。
Method: 采用轻量级适配器进行微调,结合Spectral Decoupling、Group DRO和IRM等方法。
Result: 在宏平均词错误率上,相对Whisper和Seamless-M4T分别提升了58.7%和58.5%,相对于标准经验风险最小化微调提升了9.7%和7.8%。
Insight: 结合公平性驱动的优化目标可以在不牺牲整体识别准确率的情况下,显著提升ASR系统对不同口音群体的公平性。
Abstract: In this work, we address the challenge of building fair English ASR systems for second-language speakers. Our analysis of widely used ASR models, Whisper and Seamless-M4T, reveals large fluctuations in word error rate (WER) across 26 accent groups, indicating significant fairness gaps. To mitigate this, we propose fairness-prompted finetuning with lightweight adapters, incorporating Spectral Decoupling (SD), Group Distributionally Robust Optimization (Group-DRO), and Invariant Risk Minimization (IRM). Our proposed fusion of traditional empirical risk minimization (ERM) with cross-entropy and fairness-driven objectives (SD, Group DRO, and IRM) enhances fairness across accent groups while maintaining overall recognition accuracy. In terms of macro-averaged word error rate, our approach achieves a relative improvement of 58.7% and 58.5% over the large pretrained Whisper and SeamlessM4T, and 9.7% and 7.8% over them, finetuning with standard empirical risk minimization with cross-entropy loss.
[19] MENTOR: A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models
ChangSu Choi,Hoyun Song,Dongyeon Kim,WooHyeon Jung,Minkyung Cho,Sunjin Park,NohHyeob Bae,Seona Yu,KyungTae Lim
Main category: cs.CL
TL;DR: MENTOR是一个结合强化学习和教师引导蒸馏的框架,旨在通过优化奖励机制提升小型语言模型的泛化能力和策略能力,解决了传统监督微调(SFT)和稀疏奖励RL的不足。
Details
Motivation: 大型语言模型(LLMs)的工具使用能力难以高效迁移到小型语言模型(SLMs)中,传统方法如监督微调(SFT)泛化能力差,而标准强化学习(RL)由于稀疏奖励导致探索效率低。MENTOR旨在解决这些问题。Contribution: 提出了MENTOR框架,结合RL和教师引导蒸馏,通过学习更泛化的策略和使用密集教师引导奖励,显著提升了SLMs的跨域泛化和策略能力。
Method: MENTOR通过RL探索学习泛化策略,并使用教师的参考轨迹构建密集、复合的奖励信号,提供细粒度指导。
Result: 实验表明,MENTOR在跨域泛化和策略能力上显著优于SFT和标准稀疏奖励RL基准。
Insight: 通过教师引导的密集奖励和RL结合,可以有效解决小型模型在模仿学习和探索中的瓶颈问题。
Abstract: Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward sparsity, it uses a teacher’s reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard sparse-reward RL baselines.
[20] Chain-of-Conceptual-Thought: Eliciting the Agent to Deeply Think within the Response
Qingqing Gu,Dan Wang,Yue Zhao,Xiaoyu Wang,Zhonglin Jiang,Yong Chen,Hongyan Li,Luo Ji
Main category: cs.CL
TL;DR: 本文提出了一种称为’概念思维链’(CoCT)的新型提示范式,通过在响应中标记概念并生成详细内容,增强了LLM在开放领域任务中的深度思考和策略能力。实验表明,CoCT在对话和情感支持任务中优于多种基线方法。
Details
Motivation: 传统的思维链(CoT)在数学、编码和推理任务中表现良好,但在开放领域任务中效果有限,因为缺乏明确的推理步骤。本文旨在提出一种更适应开放领域任务的提示范式。Contribution: 提出CoCT,一种基于概念的提示范式,通过标记概念和动态链接概念,促进LLM的深度思考。该方法在开放领域任务中表现出色。
Method: CoCT分为两步:(1)LLM标记对话或响应中的概念(如情感、策略、话题);(2)基于概念生成详细内容。概念链可以在响应中动态链接。
Result: CoCT在自动评估、人工评估和模型评估中均优于Self-Refine、ECoT、ToT、SoT和RAG等基线方法,显示了其在开放任务中的潜力。
Insight: CoCT通过概念驱动的深度思考,为LLM在开放领域任务中提供了一种更灵活的推理框架,未来可能扩展到更广泛的应用场景。
Abstract: Chain-of-Thought (CoT) is widely applied to improve the LLM capability in math, coding and reasoning tasks. However, its performance is limited for open-domain tasks since there are no clearly defined reasoning steps or logical transitions. To mitigate such challenges, we propose another prompt-based paradigm called Chain of Conceptual Thought (CoCT), where the LLM first tags a concept, then generates the detailed content. The chain of concepts is allowed within the utterance, encouraging the LLM’s deep and strategic thinking. We experiment with this paradigm in daily and emotional support conversations where the concept is comprised of emotions, strategies and topics. Automatic, human and model evaluations suggest that CoCT surpasses baselines such as Self-Refine, ECoT, ToT, SoT and RAG, suggesting a potential effective prompt-based paradigm of LLM for a wider scope of tasks.
[21] Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation
Yasser Hamidullah,Koel Dutta Chowdury,Yusser Al-Ghussin,Shakib Yazdani,Cennet Oguz,Josef van Genabith,Cristina España-Bonet
Main category: cs.CL
TL;DR: 该论文提出了一种基于视觉信息的令牌级可靠性度量方法,用于检测手语翻译中的幻觉现象(生成无视觉依据的流畅文本),并通过实验验证其有效性。
Details
Motivation: 在手语翻译(SLT)中,幻觉问题(模型依赖语言先验而非视觉输入生成文本)尤为严重,尤其是无中间标注(gloss-free)的模型,因为它们直接将连续的手语动作映射到自然语言,缺乏对齐监督。Contribution: 1. 提出了一种令牌级可靠性度量方法,量化解码器对视觉信息的依赖程度;2. 结合特征敏感性和反事实信号,生成句子级可靠性评分;3. 验证了该方法对幻觉率的预测能力及其跨数据集和架构的通用性。
Method: 1. 计算特征敏感性(视频掩码时内部特征的变化);2. 引入反事实信号(干净与干扰视频输入的概率差异);3. 聚合信号生成可靠性评分。
Result: 实验表明,可靠性评分能有效预测幻觉率,区分接地与猜测的令牌,并在视觉退化时表现出下降趋势。结合文本信号(置信度、困惑度或熵)可进一步提升幻觉风险估计。
Insight: 无中间标注的模型更容易产生幻觉,因其缺乏对齐监督;提出的可靠性度量工具可用于多模态生成中的幻觉诊断,为更鲁棒的检测奠定基础。
Abstract: Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.
[22] Identity-Aware Large Language Models require Cultural Reasoning
Alistair Plum,Anne-Marie Lutgen,Christoph Purschke,Achim Rettinger
Main category: cs.CL
TL;DR: 论文指出大型语言模型在处理文化多样性方面的不足,提出了文化推理的概念,强调其在身份识别AI中的重要性,并呼吁将其视为与事实准确性和语言一致性并列的基础能力。
Details
Motivation: 大型语言模型的回复往往局限于西方文化视角,忽视了全球用户的多样性。这种缺失可能导致刻板印象、忽视少数群体观点和信任侵蚀。Contribution: 明确了文化推理的定义,强调其为身份识别AI的基础能力,并提出了初步评估方向。
Method: 通过分析当前模型的局限性,提出文化推理的概念,并基于实证研究指出问题的根源。
Result: 研究表明当前模型在多文化背景下表现不足,单纯扩展数据集无法解决问题。
Insight: 文化推理应被视为AI系统的核心能力,未来研究需关注其在动态语境中的适应性。
Abstract: Large language models have become the latest trend in natural language processing, heavily featuring in the digital tools we use every day. However, their replies often reflect a narrow cultural viewpoint that overlooks the diversity of global users. This missing capability could be referred to as cultural reasoning, which we define here as the capacity of a model to recognise culture-specific knowledge values and social norms, and to adjust its output so that it aligns with the expectations of individual users. Because culture shapes interpretation, emotional resonance, and acceptable behaviour, cultural reasoning is essential for identity-aware AI. When this capacity is limited or absent, models can sustain stereotypes, ignore minority perspectives, erode trust, and perpetuate hate. Recent empirical studies strongly suggest that current models default to Western norms when judging moral dilemmas, interpreting idioms, or offering advice, and that fine-tuning on survey data only partly reduces this tendency. The present evaluation methods mainly report static accuracy scores and thus fail to capture adaptive reasoning in context. Although broader datasets can help, they cannot alone ensure genuine cultural competence. Therefore, we argue that cultural reasoning must be treated as a foundational capability alongside factual accuracy and linguistic coherence. By clarifying the concept and outlining initial directions for its assessment, a foundation is laid for future systems to be able to respond with greater sensitivity to the complex fabric of human culture.
[23] Beyond the Explicit: A Bilingual Dataset for Dehumanization Detection in Social Media
Dennis Assenmacher,Paloma Piot,Katarina Laken,David Jurgens,Claudia Wagner
Main category: cs.CL
TL;DR: 该论文提出了一个双语数据集,用于检测社交媒体中的非人道化内容,填补了现有研究对隐性非人道化形式忽视的空白。
Details
Motivation: 目前计算语言学和NLP领域主要关注显性负面言论的非人道化研究,忽视了隐性形式的有害影响。本文旨在提供一个更全面的数据集来解决这一问题。Contribution: 1) 构建了一个理论指导的双语数据集,覆盖非人道化的多维度;2) 通过众包和专家标注了16,000个实例;3) 展示了数据集在零样本和少样本任务中的有效性。
Method: 使用不同采样方法从Twitter和Reddit收集数据,并通过众包和专家进行文档级和span级标注。之后,用该数据集微调ML模型。
Result: 微调的ML模型在零样本和少样本场景下表现优于现有最先进模型。
Insight: 隐性非人道化形式虽不显性冒犯,但对边缘群体的伤害不可忽视,需要更全面的数据集和方法来检测。
Abstract: Digital dehumanization, although a critical issue, remains largely overlooked within the field of computational linguistics and Natural Language Processing. The prevailing approach in current research concentrating primarily on a single aspect of dehumanization that identifies overtly negative statements as its core marker. This focus, while crucial for understanding harmful online communications, inadequately addresses the broader spectrum of dehumanization. Specifically, it overlooks the subtler forms of dehumanization that, despite not being overtly offensive, still perpetuate harmful biases against marginalized groups in online interactions. These subtler forms can insidiously reinforce negative stereotypes and biases without explicit offensiveness, making them harder to detect yet equally damaging. Recognizing this gap, we use different sampling methods to collect a theory-informed bilingual dataset from Twitter and Reddit. Using crowdworkers and experts to annotate 16,000 instances on a document- and span-level, we show that our dataset covers the different dimensions of dehumanization. This dataset serves as both a training resource for machine learning models and a benchmark for evaluating future dehumanization detection techniques. To demonstrate its effectiveness, we fine-tune ML models on this dataset, achieving performance that surpasses state-of-the-art models in zero and few-shot in-context settings.
[24] Investigating LLM Capabilities on Long Context Comprehension for Medical Question Answering
Feras AlMannaa,Talia Tseriotou,Jenny Chim,Maria Liakata
Main category: cs.CL
TL;DR: 该论文首次研究了大型语言模型(LLM)在长上下文(LC)医学问答中的理解能力,通过多维度评估揭示了模型大小的影响、局限性、记忆问题以及推理模型的优势。
Details
Motivation: 当前LLM在医学长上下文问答中的应用能力尚不明确,尤其是其对复杂临床内容的处理效果和局限性。论文旨在填补这一研究空白。Contribution: 论文的主要贡献包括:首次系统评估LLM在医学LC任务中的表现;揭示了模型大小和RAG策略对性能的影响;通过多角度分析提出了改进RAG的方法。
Method: 研究采用了多维度评估方法,包括不同LLM模型、任务形式和数据集的分析,特别关注单文档与多文档推理的效果对比,并探索了RAG的最佳设置。
Result: 研究发现模型大小对LC理解能力有显著影响,RAG在某些情况下能显著提升性能,但也揭示了常见失败案例和记忆问题的局限性。
Insight: 论文提供了LLM在医学LC任务中的实际表现洞察,强调了RAG的优势和应用场景,为未来改进提供了方向。
Abstract: This study is the first to investigate LLM comprehension capabilities over long-context (LC) medical QA of clinical relevance. Our comprehensive assessment spans a range of content-inclusion settings based on their relevance, LLM models of varying capabilities and datasets across task formulations, revealing insights on model size effects, limitations, underlying memorization issues and the benefits of reasoning models. Importantly, we examine the effect of RAG on medical LC comprehension, uncover best settings in single versus multi-document reasoning datasets and showcase RAG strategies for improvements over LC. We shed light into some of the evaluation aspects using a multi-faceted approach. Our qualitative and error analyses address open questions on when RAG is beneficial over LC, revealing common failure cases.
[25] Verifiable Accuracy and Abstention Rewards in Curriculum RL to Alleviate Lost-in-Conversation
Ming Li
Main category: cs.CL
TL;DR: 论文提出了一种结合可验证准确性和弃权奖励的课程强化学习方法(RLAAR),用于缓解大语言模型在多轮对话中的信息丢失问题(LiC),显著提升了模型的性能和可靠性。
Details
Motivation: 大语言模型在单轮指令跟随中表现优秀,但在多轮对话中随着信息逐步揭示会出现性能退化(LiC),当前急需解决这一问题。Contribution: 1. 提出RLAAR框架,结合可验证准确性和弃权奖励;2. 使用能力门控的课程学习逐步增加对话难度;3. 显著减轻LiC的性能衰减并提高校准后的弃权率。
Method: 1. 设计了混合奖励系统,平衡问题解答和明智弃权;2. 采用多轮、基于策略的rollout训练;3. 通过课程学习逐步增加指令片段难度。
Result: RLAAR将LiC性能衰减从62.6%提升至75.1%,校准后的弃权率从33.5%提高到73.4%。
Insight: 1. 多轮对话中的可靠性和信任度可通过混合奖励和课程学习实现;2. 明智弃权是减少LiC的有效策略。
Abstract: Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.
[26] Topoformer: brain-like topographic organization in Transformer language models through spatial querying and reweighting
Taha Binhuraib,Greta Tuckute,Nicholas Blauch
Main category: cs.CL
TL;DR: 论文提出了一种名为Topoformer的Transformer变体,通过在自注意力机制中引入空间查询和空间重加权,实现了类似大脑的空间拓扑组织,提升了模型的可解释性。
Details
Motivation: 生物大脑具有空间功能组织的特点,神经元根据其响应特性在多个尺度上呈拓扑排列。然而,大多数机器学习模型的表示缺乏空间偏置,表现为难以可视化和解释的无序向量空间。论文旨在通过改进Transformer的自注意力机制,使其具备类似大脑的空间组织特性。Contribution: 提出了Topoformer,一种新型的自注意力机制,通过空间查询(queries和keys排列在2D网格上)和空间重加权(将全连接的自注意力层改为局部连接),使Transformer具备空间拓扑组织特性。
Method: 1. 空间查询:将keys和queries排列在2D网格上,每个key与局部查询池关联;2. 空间重加权:将标准的全连接自注意力层改为局部连接层。实验包括在情感分类任务上训练单层Topoformer,以及在BERT架构上应用Topoformer进行掩码语言建模。
Result: Topoformer在NLP基准测试中与非拓扑控制模型性能相当,同时通过八种语言学测试套件评估,显示出可解释的拓扑组织。此外,通过分析fMRI数据集,证明了Topoformer的低维拓扑变化与人类大脑语言网络的响应对齐。
Insight: Topoformer为NLP研究提供了更高的可解释性,并为模拟人类大脑中语言信息的组织提供了更准确的模型。进一步扩展Topoformer有望在可解释性和脑科学建模方面取得更大突破。
Abstract: Spatial functional organization is a hallmark of biological brains: neurons are arranged topographically according to their response properties, at multiple scales. In contrast, representations within most machine learning models lack spatial biases, instead manifesting as disorganized vector spaces that are difficult to visualize and interpret. Here, we propose a novel form of self-attention that turns Transformers into “Topoformers” with topographic organization. We introduce spatial querying - where keys and queries are arranged on 2D grids, and local pools of queries are associated with a given key - and spatial reweighting, where we convert the standard fully connected layer of self-attention into a locally connected layer. We first demonstrate the feasibility of our approach by training a 1-layer Topoformer on a sentiment classification task. Training with spatial querying encourages topographic organization in the queries and keys, and spatial reweighting separately encourages topographic organization in the values and self-attention outputs. We then apply the Topoformer motifs at scale, training a BERT architecture with a masked language modeling objective. We find that the topographic variant performs on par with a non-topographic control model on NLP benchmarks, yet produces interpretable topographic organization as evaluated via eight linguistic test suites. Finally, analyzing an fMRI dataset of human brain responses to a large set of naturalistic sentences, we demonstrate alignment between low-dimensional topographic variability in the Topoformer model and human brain language network. Scaling up Topoformers further holds promise for greater interpretability in NLP research, and for more accurate models of the organization of linguistic information in the human brain.
[27] AI use in American newspapers is widespread, uneven, and rarely disclosed
Jenna Russell,Marzena Karpinska,Destiny Akinode,Katherine Thai,Bradley Emi,Max Spero,Mohit Iyyer
Main category: cs.CL
TL;DR: 该研究通过大规模审计发现,2025年美国在线报纸文章中约9%部分或全部由AI生成,且使用分布不均,多集中于小型本地媒体、特定主题(如天气和技术)及某些所有权集团。此外,AI内容在意见专栏中更为普遍,但披露极少。
Details
Motivation: 随着AI快速改变新闻业,了解其在已发布新闻文章中的实际应用成为迫切需求,以保障新闻透明度和公众信任。Contribution: 首次系统审计了美国报纸中AI生成内容的广泛性、分布特征及披露情况,揭示了AI在新闻业中的实际渗透程度和透明度不足的问题。
Method: 使用Pangram(先进的AI检测器)分析186K篇来自1.5K家美国报纸的文章,并手动审核100篇AI标记文章以验证披露情况。
Result: 约9%的文章部分或全部由AI生成,意见专栏中AI内容比新闻文章多6.4倍,但披露率极低(5%)。
Insight: 研究呼吁新闻业需提高AI使用的透明度并更新编辑标准,以维护公众信任。
Abstract: AI is rapidly transforming journalism, but the extent of its use in published newspaper articles remains unclear. We address this gap by auditing a large-scale dataset of 186K articles from online editions of 1.5K American newspapers published in the summer of 2025. Using Pangram, a state-of-the-art AI detector, we discover that approximately 9% of newly-published articles are either partially or fully AI-generated. This AI use is unevenly distributed, appearing more frequently in smaller, local outlets, in specific topics such as weather and technology, and within certain ownership groups. We also analyze 45K opinion pieces from Washington Post, New York Times, and Wall Street Journal, finding that they are 6.4 times more likely to contain AI-generated content than news articles from the same publications, with many AI-flagged op-eds authored by prominent public figures. Despite this prevalence, we find that AI use is rarely disclosed: a manual audit of 100 AI-flagged articles found only five disclosures of AI use. Overall, our audit highlights the immediate need for greater transparency and updated editorial standards regarding the use of AI in journalism to maintain public trust.
[28] KAT-Coder Technical Report
Zizheng Zhan,Ken Deng,Xiaojiang Zhang,Jinghui Wang,Huaixi Tang,Zhiyi Lai,Haoyang Huang,Wen Xiang,Kun Wu,Wenhao Zhuang,Minglei Zhang,Shaojie Wang,Shangpeng Yan,Kepeng Lei,Zongxian Feng,Huiming Wang,Zheng Lin,Mengtong Li,Mengfei Xie,Yinghan Cui,Xuxing Chen,Chao Wang,Weihao Li,Wenqiang Zhu,Jiarong Zhang,Jingxuan Xu,Songwei Yu,Yifan Yao,Xinping Lei,Han Li,Junqi Xiong,Zuchen Gao,Dailin Li,Haimo Li,Jiaheng Liu,Yuqun Zhang,Junyi Peng,Haotian Zhang,Bin Chen
Main category: cs.CL
TL;DR: KAT-Coder是一个多阶段训练的大型代理编码模型,通过Mid-Term Training、SFT、RFT和Reinforcement-to-Deployment Adaptation提升编码代理能力,并在32B模型KAT-Dev中实现了工具使用可靠性和长上下文推理。
Details
Motivation: 传统LLM在静态文本训练与动态代理执行之间存在差距,KAT-Coder旨在通过多阶段课程解决这一挑战。Contribution: 提出了KAT-Coder的四阶段训练方法,结合真实数据和合成交互增强推理和规划能力,并通过多真实奖励和部署适应实现稳定优化。
Method: 多阶段训练包括:1) Mid-Term Training增强推理;2) SFT平衡多语言和任务;3) RFT引入多真实奖励;4) Reinforcement-to-Deployment适应IDE环境。
Result: KAT-Dev模型在工具使用可靠性、指令对齐和长上下文推理方面表现优异,已开源。
Insight: 多阶段课程和奖励设计是提升代理编码能力的关键,部署适应进一步确保了实际应用的可行性。
Abstract: Recent advances in large language models (LLMs) have enabled progress in agentic coding, where models autonomously reason, plan, and act within interactive software development workflows. However, bridging the gap between static text-based training and dynamic real-world agentic execution remains a core challenge. In this technical report, we present KAT-Coder, a large-scale agentic code model trained through a multi-stage curriculum encompassing Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning (RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances reasoning, planning, and reflection capabilities through a corpus of real software engineering data and synthetic agentic interactions. The SFT stage constructs a million-sample dataset balancing twenty programming languages, ten development contexts, and ten task archetypes. The RFT stage introduces a novel multi-ground-truth reward formulation for stable and sample-efficient policy optimization. Finally, the Reinforcement-to-Deployment phase adapts the model to production-grade IDE environments using Error-Masked SFT and Tree-Structured Trajectory Training. In summary, these stages enable KAT-Coder to achieve robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for real-world intelligent coding agents. Our KAT series 32B model, KAT-Dev, has been open-sourced on https://huggingface.co/Kwaipilot/KAT-Dev.
[29] WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection
Guanzhong He,Zhen Yang,Jinxin Liu,Bin Xu,Lei Hou,Juanzi Li
Main category: cs.CL
TL;DR: WebSeer提出了一种基于强化学习的搜索代理,通过自反思机制提升工具的深度使用能力,解决了现有方法工具链短和错误积累的问题。
Details
Motivation: 现有强化学习训练的搜索代理存在工具使用深度不足和多轮交互中错误累积的问题。Contribution: 1)设计了自反思机制增强的强化学习框架;2)构建了带反思标注的大型数据集;3)提出了两阶段训练方法,结合冷启动和强化学习。
Method: 采用两阶段训练框架:冷启动阶段通过标注数据预训练模型,强化学习阶段结合自反思机制优化交互轨迹。
Result: 在HotpotQA和SimpleQA上分别达到72.3%和90.0%的准确率,展示了较强的泛化能力。
Insight: 自反思机制可以显著提升模型的工具链长度和答案准确性,适用于真实网络环境。
Abstract: Search agents have achieved significant advancements in enabling intelligent information retrieval and decision-making within interactive environments. Although reinforcement learning has been employed to train agentic models capable of more dynamic interactive retrieval, existing methods are limited by shallow tool-use depth and the accumulation of errors over multiple iterative interactions. In this paper, we present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism. Specifically, we construct a large dataset annotated with reflection patterns and design a two-stage training framework that unifies cold start and reinforcement learning within the self-reflection paradigm for real-world web-based environments, which enables the model to generate longer and more reflective tool-use trajectories. Our approach substantially extends tool-use chains and improves answer accuracy. Using a single 14B model, we achieve state-of-the-art results on HotpotQA and SimpleQA, with accuracies of 72.3% and 90.0%, respectively, and demonstrate strong generalization to out-of-distribution datasets. The code is available at https://github.com/99hgz/WebSeer
[30] Fine-Tuned Thoughts: Leveraging Chain-of-Thought Reasoning for Industrial Asset Health Monitoring
Shuxin Lin,Dhaval Patel,Christodoulos Constantinides
Main category: cs.CL
TL;DR: 论文提出了一种知识蒸馏框架,通过链式思维(CoT)蒸馏将大型语言模型(LLMs)的推理能力迁移到小型语言模型(SLMs)中,用于工业资产健康监测。
Details
Motivation: 在工业4.0等专业领域,小型语言模型(SLMs)因其高效性和低成本受到青睐,但复杂的推理仍是挑战。如何通过知识蒸馏提升SLMs的推理能力成为研究重点。Contribution: 主要贡献是提出了基于链式思维(CoT)的知识蒸馏框架,用于提升SLMs在工业资产健康监测中的推理和决策能力。
Method: 采用多选问答(MCQA)提示进行知识蒸馏,并通过上下文学习验证生成知识的质量。
Result: 实验显示,经过CoT蒸馏的SLMs性能显著提升,缩小了与LLMs的差距。
Insight: 链式思维蒸馏可以有效迁移LLMs的推理能力到SLMs,为专业领域的模型优化提供了一条可行路径。
Abstract: Small Language Models (SLMs) are becoming increasingly popular in specialized fields, such as industrial applications, due to their efficiency, lower computational requirements, and ability to be fine-tuned for domain-specific tasks, enabling accurate and cost-effective solutions. However, performing complex reasoning using SLMs in specialized fields such as Industry 4.0 remains challenging. In this paper, we propose a knowledge distillation framework for industrial asset health, which transfers reasoning capabilities via Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) to smaller, more efficient models (SLMs). We discuss the advantages and the process of distilling LLMs using multi-choice question answering (MCQA) prompts to enhance reasoning and refine decision-making. We also perform in-context learning to verify the quality of the generated knowledge and benchmark the performance of fine-tuned SLMs with generated knowledge against widely used LLMs. The results show that the fine-tuned SLMs with CoT reasoning outperform the base models by a significant margin, narrowing the gap to their LLM counterparts. Our code is open-sourced at: https://github.com/IBM/FailureSensorIQ.
[31] MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
Wenxuan Li,Chengruidong Zhang,Huiqiang Jiang,Yucheng Li,Yuqing Yang,Lili Qiu
Main category: cs.CL
TL;DR: MTraining提出了一种分布式动态稀疏注意力方法,用于高效训练超长上下文的大语言模型(LLMs),解决了动态稀疏注意力中的计算和通信不平衡问题。
Details
Motivation: 长上下文窗口是LLMs的重要特性,但动态稀疏注意力的高效分布式训练仍面临计算和通信不平衡的挑战。Contribution: MTraining通过动态稀疏训练模式、平衡稀疏环形注意力和分层稀疏环形注意力三个关键组件,显著提升了训练效率和模型性能。
Method: 1. 动态稀疏训练模式;2. 平衡稀疏环形注意力;3. 分层稀疏环形注意力。
Result: MTraining成功将Qwen2.5-3B的上下文窗口从32K扩展到512K,并在多个任务中实现了6倍的训练吞吐量提升,同时保持模型准确性。
Insight: 分布式动态稀疏注意力是解决超长上下文训练问题的有效方法,通过平衡计算和通信负载可以显著提升效率。
Abstract: The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at https://github.com/microsoft/MInference/tree/main/MTraining.
[32] Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning
Chenghao Zhu,Meiling Tao,Tiannan Wang,Dongyi Ding,Yuchen Eleanor Jiang,Wangchunshu Zhou
Main category: cs.CL
TL;DR: 该论文提出了一种名为Critique-Post-Edit的强化学习框架,通过结合个性化生成奖励模型(GRM)和批评后编辑机制,解决了大型语言模型(LLM)个性化中的奖励黑客问题,显著提升了个性化效果。
Details
Motivation: 大型语言模型的个性化是一个重要但困难的任务,传统的监督微调(SFT)和基于标量的强化学习(RLHF)难以捕捉用户偏好的细微差异,容易产生冗余或肤浅的响应。Contribution: 1)提出了一个鲁棒的强化学习框架Critique-Post-Edit;2)设计了多维度评分的个性化生成奖励模型(GRM);3)引入了批评后编辑机制以优化模型输出。
Method: 结合GRM和Critique-Post-Edit机制,GRM提供多维评分和文本批评,后者通过迭代修订输出实现高效学习。在长度控制的评估中验证了方法的有效性。
Result: 个性化Qwen2.5-7B模型的平均胜率提升了11%,Qwen2.5-14B模型性能甚至超过了GPT-4.1。
Insight: 多维度反馈和迭代编辑可以显著提升个性化任务的忠实性和可控性,为LLM的个性化提供了新思路。
Abstract: Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.
[33] Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
Ling Team,Anqi Shen,Baihui Li,Bin Hu,Bin Jing,Cai Chen,Chao Huang,Chao Zhang,Chaokun Yang,Cheng Lin,Chengyao Wen,Congqi Li,Deng Zhao,Dingbo Yuan,Donghai You,Fagui Mao,Fanzhuang Meng,Feng Xu,Guojie Li,Guowei Wang,Hao Dai,Haonan Zheng,Hong Liu,Jia Guo,Jiaming Liu,Jian Liu,Jianhao Fu,Jiannan Shi,Jianwen Wang,Jianxin Lai,Jin Yang,Jun Mei,Jun Zhou,Junbo Zhao,Junping Zhao,Kuan Xu,Le Su,Lei Chen,Li Tang,Liang Jiang,Liangcheng Fu,Lianhao Xu,Linfeng Shi,Lisha Liao,Longfei Zheng,Meng Li,Mingchun Chen,Qi Zuo,Qiang Cheng,Qianggang Cao,Qitao Shi,Quanrui Guo,Senlin Zhu,Shaofei Wang,Shaomian Zheng,Shuaicheng Li,Shuwei Gu,Siba Chen,Tao Wu,Tao Zhang,Tianyu Zhang,Tianyu Zhou,Tiwei Bie,Tongkai Yang,Wang Hong,Wang Ren,Weihua Chen,Wenbo Yu,Wengang Zheng,Xiangchun Wang,Xiaodong Yan,Xiaopei Wan,Xin Zhao,Xinyu Kong,Xinyu Tang,Xudong Han,Xudong Wang,Xuemin Yang,Xueyu Hu,Yalin Zhang,Yan Sun,Yicheng Shan,Yilong Wang,Yingying Xu,Yongkang Liu,Yongzhen Guo,Yuanyuan Wang,Yuchen Yan,Yuefan Wang,Yuhong Guo,Zehuan Li,Zhankai Xu,Zhe Li,Zhenduo Zhang,Zhengke Gui,Zhenxuan Pan,Zhenyu Huang,Zhenzhong Lan,Zhiqiang Ding,Zhiqiang Zhang,Zhixun Li,Zhizhen Liu,Zihao Wang,Zujie Wen
Main category: cs.CL
TL;DR: 论文提出了Ring-1T,首个开源的万亿参数规模的思考模型,并通过三项创新技术解决了训练中的挑战,取得了突破性成果。
Details
Motivation: 训练万亿参数规模的模型存在前所未有的挑战,如训练-推理不一致、长滚降处理的低效性和RL系统瓶颈。Contribution: 提出了Ring-1T模型及其三项核心技术:IcePop、C3PO++和ASystem,显著提升了RL训练的稳定性和效率。
Method: 1. IcePop通过令牌级差异掩码和裁剪稳定训练;2. C3PO++动态分区长滚降以提高效率;3. ASystem解决系统瓶颈问题。
Result: Ring-1T在多个基准测试中表现优异,如AIME-2025(93.4)、HMMT-2025(86.72)、CodeForces(2088)和IMO-2025银牌水平。
Insight: 开源万亿参数模型为研究社区提供了前沿推理能力,标志着大规模推理智能民主化的里程碑。
Abstract: We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.
cs.CV [Back]
[34] MAT-Agent: Adaptive Multi-Agent Training Optimization
Jusheng Zhang,Kaitong Cai,Yijia Fan,Ningyuan Liu,Keze Wang
Main category: cs.CV
TL;DR: MAT-Agent提出了一种多智能体框架,通过动态调整数据增强、优化器、学习率和损失函数,实现自适应训练优化,显著提升了多标签图像分类的性能和稳定性。
Details
Motivation: 传统多标签图像分类方法依赖静态配置,难以应对动态复杂的视觉-语义场景,因此需要一种自适应训练策略。Contribution: MAT-Agent通过多智能体协作和非稳态多臂老虎机算法,动态优化训练过程,并提出了一种复合奖励机制,平衡了探索与开发。
Method: 采用多智能体框架,动态调整训练参数,结合双速率指数移动平均平滑和混合精度训练,确保鲁棒性和效率。
Result: 在Pascal VOC、COCO和VG-256数据集上取得了显著的性能提升,尤其是mAP指标表现突出。
Insight: MAT-Agent展示了自适应训练在复杂视觉任务中的潜力,推动了深度学习动态优化的进一步发展。
Abstract: Multi-label image classification demands adaptive training strategies to navigate complex, evolving visual-semantic landscapes, yet conventional methods rely on static configurations that falter in dynamic settings. We propose MAT-Agent, a novel multi-agent framework that reimagines training as a collaborative, real-time optimization process. By deploying autonomous agents to dynamically tune data augmentation, optimizers, learning rates, and loss functions, MAT-Agent leverages non-stationary multi-armed bandit algorithms to balance exploration and exploitation, guided by a composite reward harmonizing accuracy, rare-class performance, and training stability. Enhanced with dual-rate exponential moving average smoothing and mixed-precision training, it ensures robustness and efficiency. Extensive experiments across Pascal VOC, COCO, and VG-256 demonstrate MAT-Agent’s superiority: it achieves an mAP of 97.4 (vs. 96.2 for PAT-T), OF1 of 92.3, and CF1 of 91.4 on Pascal VOC; an mAP of 92.8 (vs. 92.0 for HSQ-CvN), OF1 of 88.2, and CF1 of 87.1 on COCO; and an mAP of 60.9, OF1 of 70.8, and CF1 of 61.1 on VG-256. With accelerated convergence and robust cross-domain generalization, MAT-Agent offers a scalable, intelligent solution for optimizing complex visual models, paving the way for adaptive deep learning advancements.
[35] CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization
Yichen Yan,Ming Zhong,Qi Zhu,Xiaoling Gu,Jinpeng Chen,Huan Li
Main category: cs.CV
TL;DR: 论文提出了CoIDO框架,通过联合优化数据的重要性和多样性,高效选择视觉指令调优的数据子集,显著减少计算成本,同时保持高性能。
Details
Motivation: 现有的视觉指令调优方法在大型数据集上的计算成本很高,且数据选择方法通常分开处理重要性和多样性,导致计算开销大且效果不佳。Contribution: 引入了CoIDO,一种双目标优化框架,联合优化数据的重要性和多样性;提出了一种轻量级打分器,仅需小规模随机样本训练即可高效选择数据子集。
Method: CoIDO通过同方差不确定性平衡重要性和多样性,训练轻量级打分器在小样本上学习数据分布,再应用于整个数据集。
Result: 使用20%的数据训练打分器并选择子集,LLaVA-1.5-7B模型在10个下游任务中达到了全数据训练的98.2%性能。
Insight: 联合优化重要性和多样性是高效数据选择的关键;轻量级打分器显著降低了计算开销。
Abstract: Multimodal large language models (MLLMs) rely heavily on instruction tuning to align vision and language capabilities, yet the computational cost of training on large-scale datasets remains a major bottleneck. Existing data selection methods aim to mitigate this by selecting important and diverse subsets, but they often suffer from two critical drawbacks: high computational overhead from processing the entire dataset and suboptimal data selection due to separate treatment of importance and diversity. We introduce CoIDO, a novel dual-objective framework that jointly optimizes data importance and diversity to overcome these challenges. Unlike existing approaches that require costly evaluations across the whole dataset, CoIDO employs a lightweight plug-in scorer. This scorer is trained on just a small random sample of data to learn the distribution of the candidate set, drastically reducing computational demands. By leveraging a homoscedastic uncertainty-based formulation, CoIDO effectively balances importance and diversity during training, enabling efficient and scalable data selection. In our experiments, we trained the CoIDO scorer using only 20 percent of randomly sampled data. Once trained, CoIDO was applied to the entire dataset to select a 20 percent subset for instruction tuning. On the widely used LLaVA-1.5-7B model across ten downstream tasks, this selected subset achieved an impressive 98.2 percent of the performance of full-data fine-tuning, on average.
[36] Pre to Post-Treatment Glioblastoma MRI Prediction using a Latent Diffusion Model
Alexandre G. Leclercq,Sébastien Bougleux,Noémie N. Moreau,Alexis Desmonts,Romain Hérault,Aurélien Corroyer-Dulmont
Main category: cs.CV
TL;DR: 该论文提出了一种基于潜在扩散模型(Latent Diffusion Model)的方法,用于从治疗前的MRI预测治疗后的MRI,以早期评估胶质母细胞瘤的治疗反应。
Details
Motivation: 胶质母细胞瘤(GBM)的治疗反应高度异质性,传统方法需要至少两个月才能观察到治疗效果。早期预测治疗反应对个性化医疗至关重要。Contribution: 1) 提出了一个切片到切片翻译模型,利用潜在扩散模型生成治疗后MRI;2) 引入基于治疗前MRI和肿瘤定位的条件化设计;3) 使用无分类器指导增强生成质量,结合生存信息优化预测。
Method: 采用潜在扩散模型,以治疗前MRI和肿瘤定位为条件输入,结合无分类器指导技术,生成治疗后MRI。
Result: 在一个包含140名GBM患者的本地数据集上训练和测试,结果表明该方法能有效预测治疗后MRI。
Insight: 通过生成治疗后MRI,该方法为早期评估治疗反应提供了新思路,有助于个性化医疗决策。
Abstract: Glioblastoma (GBM) is an aggressive primary brain tumor with a median survival of approximately 15 months. In clinical practice, the Stupp protocol serves as the standard first-line treatment. However, patients exhibit highly heterogeneous therapeutic responses which required at least two months before first visual impact can be observed, typically with MRI. Early prediction treatment response is crucial for advancing personalized medicine. Disease Progression Modeling (DPM) aims to capture the trajectory of disease evolution, while Treatment Response Prediction (TRP) focuses on assessing the impact of therapeutic interventions. Whereas most TRP approaches primarly rely on timeseries data, we consider the problem of early visual TRP as a slice-to-slice translation model generating post-treatment MRI from a pre-treatment MRI, thus reflecting the tumor evolution. To address this problem we propose a Latent Diffusion Model with a concatenation-based conditioning from the pre-treatment MRI and the tumor localization, and a classifier-free guidance to enhance generation quality using survival information, in particular post-treatment tumor evolution. Our model were trained and tested on a local dataset consisting of 140 GBM patients collected at Centre Fran\c{c}ois Baclesse. For each patient we collected pre and post T1-Gd MRI, tumor localization manually delineated in the pre-treatment MRI by medical experts, and survival information.
[37] Robotic Classification of Divers’ Swimming States using Visual Pose Keypoints as IMUs
Demetrious T. Kutzke,Ying-Kun Wu,Elizabeth Terveen,Junaed Sattar
Main category: cs.CV
TL;DR: 论文提出了一种水下环境中的新型混合方法,通过计算机视觉生成3D关节关键点作为‘伪IMU’,用于监测潜水员的安全状态。
Details
Motivation: 解决传统水下活动识别方法(图像分析或穿戴式IMU)在水下环境中信号衰减和效果不佳的问题。Contribution: 提出了一种基于视觉的‘伪IMU’技术,无需依赖传统IMU的无线信号传输,提高了水下环境中的潜水员状态分类效果。
Method: 利用计算机视觉从3D人体关节关键点生成高保真运动数据,并将其用于分类潜水员的异常行为。
Result: 通过集成到自主水下车辆(AUV)中进行实验,验证了方法在监测潜水员安全方面的有效性和实用性。
Insight: 该方法为水下环境中不依赖无线信号的健康监测提供了新思路,尤其适用于潜水员紧急情况的早期识别。
Abstract: Traditional human activity recognition uses either direct image analysis or data from wearable inertial measurement units (IMUs), but can be ineffective in challenging underwater environments. We introduce a novel hybrid approach that bridges this gap to monitor scuba diver safety. Our method leverages computer vision to generate high-fidelity motion data, effectively creating a ``pseudo-IMU’’ from a stream of 3D human joint keypoints. This technique circumvents the critical problem of wireless signal attenuation in water, which plagues conventional diver-worn sensors communicating with an Autonomous Underwater Vehicle (AUV). We apply this system to the vital task of identifying anomalous scuba diver behavior that signals the onset of a medical emergency such as cardiac arrest – a leading cause of scuba diving fatalities. By integrating our classifier onboard an AUV and conducting experiments with simulated distress scenarios, we demonstrate the utility and effectiveness of our method for advancing robotic monitoring and diver safety.
[38] InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation
Jungmin Lee,Seonghyuk Hong,Juyong Lee,Jaeyoon Lee,Jongwon Choi
Main category: cs.CV
TL;DR: InsideOut扩展了3D高斯喷溅(3DGS)技术,通过融合RGB和X射线数据,实现了对外部表面和内部结构的全面3D表示。
Details
Motivation: RGB和X射线成像的融合在医疗诊断、文化遗产修复和制造等领域具有重要意义,但两者数据表示差异大且配对数据集有限,InsideOut旨在解决这些问题。Contribution: 提出了InsideOut方法,结合RGB和X射线数据,通过层次拟合和对齐技术,以及X射线参考损失函数,实现了统一的3D对象表示。
Method: 采用层次拟合对齐RGB和X射线高斯喷溅,并引入X射线参考损失函数确保内部结构一致性。
Result: InsideOut显著扩展了3DGS的适用性,提升了可视化、模拟和非破坏性测试的能力。
Insight: InsideOut展示了多模态数据融合在3D建模中的潜力,为跨领域应用提供了新的技术支持。
Abstract: We introduce InsideOut, an extension of 3D Gaussian splatting (3DGS) that bridges the gap between high-fidelity RGB surface details and subsurface X-ray structures. The fusion of RGB and X-ray imaging is invaluable in fields such as medical diagnostics, cultural heritage restoration, and manufacturing. We collect new paired RGB and X-ray data, perform hierarchical fitting to align RGB and X-ray radiative Gaussian splats, and propose an X-ray reference loss to ensure consistent internal structures. InsideOut effectively addresses the challenges posed by disparate data representations between the two modalities and limited paired datasets. This approach significantly extends the applicability of 3DGS, enhancing visualization, simulation, and non-destructive testing capabilities across various domains.
[39] MUSE: Model-based Uncertainty-aware Similarity Estimation for zero-shot 2D Object Detection and Segmentation
Sungmin Cho,Sungbum Park,Insoo Oh
Main category: cs.CV
TL;DR: MUSE是一个无需训练的零样本2D目标检测与分割框架,通过多视角模板和联合相似度度量实现高性能。
Details
Motivation: 传统方法在零样本场景下难以泛化,MUSE旨在无需训练的情况下提升检测与分割性能。Contribution: 提出了MUSE框架,整合全局与局部特征,结合绝对和相对相似度度量,并引入不确定性感知先验。
Method: 使用3D多视角模板生成2D模板,结合GeM池化的嵌入和联合相似度度量,并通过不确定性修正得分。
Result: 在BOP Challenge 2025中取得最佳性能,排名第一。
Insight: MUSE的高效性和泛化能力表明,无需训练的方法在零样本任务中具有潜力。
Abstract: In this work, we introduce MUSE (Model-based Uncertainty-aware Similarity Estimation), a training-free framework designed for model-based zero-shot 2D object detection and segmentation. MUSE leverages 2D multi-view templates rendered from 3D unseen objects and 2D object proposals extracted from input query images. In the embedding stage, it integrates class and patch embeddings, where the patch embeddings are normalized using generalized mean pooling (GeM) to capture both global and local representations efficiently. During the matching stage, MUSE employs a joint similarity metric that combines absolute and relative similarity scores, enhancing the robustness of matching under challenging scenarios. Finally, the similarity score is refined through an uncertainty-aware object prior that adjusts for proposal reliability. Without any additional training or fine-tuning, MUSE achieves state-of-the-art performance on the BOP Challenge 2025, ranking first across the Classic Core, H3, and Industrial tracks. These results demonstrate that MUSE offers a powerful and generalizable framework for zero-shot 2D object detection and segmentation.
[40] GAN-based Content-Conditioned Generation of Handwritten Musical Symbols
Gerard Asbert,Pau Torras,Lei Kang,Alicia Fornés,Josep Lladós
Main category: cs.CV
TL;DR: 该论文探讨了使用基于GAN的内容条件生成方法生成手写音乐符号,解决了光学音乐识别(OMR)领域因缺乏真实标注数据而受限的问题,并通过Smashcima软件将生成的符号整合成完整的乐谱。生成的符号在视觉逼真度上表现优异。
Details
Motivation: 光学音乐识别(OMR)领域因缺乏真实标注数据(尤其是手写历史乐谱)而受限。类似领域(如手写文本识别)已证明,通过图像生成技术合成的数据可以提升识别模型的性能。Contribution: 论文的主要贡献是提出了一种基于GAN的音乐符号级生成方法,能够生成高逼真度的手写音乐符号,并通过Smashcima软件将这些符号整合成完整的乐谱。
Method: 采用生成对抗网络(GAN)生成单个音乐符号,并利用Smashcima软件将这些符号组装成完整的乐谱。通过系统性评估生成的样本视觉逼真度。
Result: 生成的音乐符号在视觉逼真度上表现出色,显示了合成乐谱生成的显著进展。
Insight: 论文表明,GAN可以成功应用于手写音乐符号的生成,为OMR领域提供了更多训练数据,从而提升识别模型的性能。
Abstract: The field of Optical Music Recognition (OMR) is currently hindered by the scarcity of real annotated data, particularly when dealing with handwritten historical musical scores. In similar fields, such as Handwritten Text Recognition, it was proven that synthetic examples produced with image generation techniques could help to train better-performing recognition architectures. This study explores the generation of realistic, handwritten-looking scores by implementing a music symbol-level Generative Adversarial Network (GAN) and assembling its output into a full score using the Smashcima engraving software. We have systematically evaluated the visual fidelity of these generated samples, concluding that the generated symbols exhibit a high degree of realism, marking significant progress in synthetic score generation.
[41] Auditing and Mitigating Bias in Gender Classification Algorithms: A Data-Centric Approach
Tadesse K Bahiru,Natnael Tilahun Sinshaw,Teshager Hailemariam Moges,Dheeraj Kumar Singh
Main category: cs.CV
TL;DR: 论文通过数据中心的干预方法,构建了BalancedFace数据集,显著减少了性别分类算法中的偏见问题。
Details
Motivation: 性别分类算法常因训练数据中的不平衡而存在偏见,导致对女性和少数族裔的分类准确率下降,需从数据源头解决问题。Contribution: 提出BalancedFace数据集,通过平衡年龄、种族和性别的交叉分布,显著改善了分类模型的公平性,同时保持高精度。
Method: 1. 审计5种常用数据集的代表性缺陷;2. 构建BalancedFace数据集,通过混合和补充其他数据填补缺陷;3. 训练标准分类器并评估公平性效果。
Result: BalancedFace使分类模型的种族子群间True Positive Rate差距减少50%以上,Disparate Impact评分接近理想值1.0,且整体精度损失极小。
Insight: 数据中心的干预是解决分类算法偏见的关键,BalancedFace为公平性别分类研究提供了公开资源。
Abstract: Gender classification systems often inherit and amplify demographic imbalances in their training data. We first audit five widely used gender classification datasets, revealing that all suffer from significant intersectional underrepresentation. To measure the downstream impact of these flaws, we train identical MobileNetV2 classifiers on the two most balanced of these datasets, UTKFace and FairFace. Our fairness evaluation shows that even these models exhibit significant bias, misclassifying female faces at a higher rate than male faces and amplifying existing racial skew. To counter these data-induced biases, we construct BalancedFace, a new public dataset created by blending images from FairFace and UTKFace, supplemented with images from other collections to fill missing demographic gaps. It is engineered to equalize subgroup shares across 189 intersections of age, race, and gender using only real, unedited images. When a standard classifier is trained on BalancedFace, it reduces the maximum True Positive Rate gap across racial subgroups by over 50% and brings the average Disparate Impact score 63% closer to the ideal of 1.0 compared to the next-best dataset, all with a minimal loss of overall accuracy. These results underline the profound value of data-centric interventions and provide an openly available resource for fair gender classification research.
[42] 3D Weakly Supervised Semantic Segmentation via Class-Aware and Geometry-Guided Pseudo-Label Refinement
Xiaoxu Xu,Xuexun Liu,Jinlong Li,Yitian Yuan,Qiudan Zhang,Lin Ma,Nicu Sebe,Xu Wang
Main category: cs.CV
TL;DR: 本文提出了一种结合3D几何先验和类别感知指导的伪标签细化方法,用于3D弱监督语义分割(3D WSSS),在ScanNet和S3DIS基准测试中取得了最优性能。
Details
Motivation: 3D WSSS旨在通过稀疏或低成本标注数据实现语义分割,但现有方法生成的伪标签质量不高且未充分利用3D几何先验,导致技术瓶颈。Contribution: 1. 提出类别感知标签细化模块以生成更平衡和准确的伪标签;2. 开发几何感知标签细化组件,通过3D几何约束过滤低置信度标签;3. 设计标签更新策略结合自训练扩展标签覆盖范围。
Method: 1. 使用类别感知模块优化伪标签质量;2. 通过几何感知模块整合3D几何约束;3. 通过自训练迭代扩展标签覆盖范围。
Result: 在ScanNet和S3DIS基准测试中达到了最优性能,并在无监督设置下展示了出色的泛化能力。
Insight: 结合3D几何先验和类别指导能够显著提升伪标签质量,推动3D WSSS模型的性能提升。
Abstract: 3D weakly supervised semantic segmentation (3D WSSS) aims to achieve semantic segmentation by leveraging sparse or low-cost annotated data, significantly reducing reliance on dense point-wise annotations. Previous works mainly employ class activation maps or pre-trained vision-language models to address this challenge. However, the low quality of pseudo-labels and the insufficient exploitation of 3D geometric priors jointly create significant technical bottlenecks in developing high-performance 3D WSSS models. In this paper, we propose a simple yet effective 3D weakly supervised semantic segmentation method that integrates 3D geometric priors into a class-aware guidance mechanism to generate high-fidelity pseudo labels. Concretely, our designed methodology first employs Class-Aware Label Refinement module to generate more balanced and accurate pseudo labels for semantic categrories. This initial refinement stage focuses on enhancing label quality through category-specific optimization. Subsequently, the Geometry-Aware Label Refinement component is developed, which strategically integrates implicit 3D geometric constraints to effectively filter out low-confidence pseudo labels that fail to comply with geometric plausibility. Moreover, to address the challenge of extensive unlabeled regions, we propose a Label Update strategy that integrates Self-Training to propagate labels into these areas. This iterative process continuously enhances pseudo-label quality while expanding label coverage, ultimately fostering the development of high-performance 3D WSSS models. Comprehensive experimental validation reveals that our proposed methodology achieves state-of-the-art performance on both ScanNet and S3DIS benchmarks while demonstrating remarkable generalization capability in unsupervised settings, maintaining competitive accuracy through its robust design.
[43] Investigating Demographic Bias in Brain MRI Segmentation: A Comparative Study of Deep-Learning and Non-Deep-Learning Methods
Ghazal Danaee,Marc Niethammer,Jarrett Rushmore,Sylvain Bouix
Main category: cs.CV
TL;DR: 论文比较了深度学习和非深度学习方法在脑MRI分割中的表现,重点研究种族和性别导致的性能偏差。研究发现训练数据与测试数据种族匹配时某些模型性能显著提升,而nnU-Net表现出不受影响的稳健性。
Details
Motivation: 研究动机在于探讨深度学习分割算法在MRI数据中可能存在的种族和性别偏见,以及这些偏见对分割性能和体积测量的影响。Contribution: 主要贡献包括:(1)比较多种分割方法在不同人口统计学分组中的表现;(2)提出公平性量化指标;(3)发现nnU-Net对人口统计学匹配的稳健性。
Method: 方法包括:(1)使用UNesT、nnU-Net、CoTr和ANTs分割核伏隔核;(2)使用手动标注的金标准评估模型;(3)通过线性混合模型分析人口统计学变量的影响。
Result: 结果显示:(1)种族匹配训练显著提升某些模型性能;(2)nnU-Net表现稳健;(3)性别效应在手动和偏见模型中均存在,而种族效应几乎消失。
Insight: 重要见解:(1)数据集的人口统计学匹配对分割性能至关重要;(2)nnU-Net的稳健性可作为其他模型的参考;(3)公平性指标有助于识别和量化算法偏见。
Abstract: Deep-learning-based segmentation algorithms have substantially advanced the field of medical image analysis, particularly in structural delineations in MRIs. However, an important consideration is the intrinsic bias in the data. Concerns about unfairness, such as performance disparities based on sensitive attributes like race and sex, are increasingly urgent. In this work, we evaluate the results of three different segmentation models (UNesT, nnU-Net, and CoTr) and a traditional atlas-based method (ANTs), applied to segment the left and right nucleus accumbens (NAc) in MRI images. We utilize a dataset including four demographic subgroups: black female, black male, white female, and white male. We employ manually labeled gold-standard segmentations to train and test segmentation models. This study consists of two parts: the first assesses the segmentation performance of models, while the second measures the volumes they produce to evaluate the effects of race, sex, and their interaction. Fairness is quantitatively measured using a metric designed to quantify fairness in segmentation performance. Additionally, linear mixed models analyze the impact of demographic variables on segmentation accuracy and derived volumes. Training on the same race as the test subjects leads to significantly better segmentation accuracy for some models. ANTs and UNesT show notable improvements in segmentation accuracy when trained and tested on race-matched data, unlike nnU-Net, which demonstrates robust performance independent of demographic matching. Finally, we examine sex and race effects on the volume of the NAc using segmentations from the manual rater and from our biased models. Results reveal that the sex effects observed with manual segmentation can also be observed with biased models, whereas the race effects disappear in all but one model.
[44] ManzaiSet: A Multimodal Dataset of Viewer Responses to Japanese Manzai Comedy
Kazuki Kawamura,Kengo Nakai,Jun Rekimoto
Main category: cs.CV
TL;DR: ManzaiSet是首个大规模多模态数据集,记录了241名观众对日本漫才喜剧的反应,包括面部视频和音频。研究发现三种观众类型,并揭示了正面的观看顺序效应,为情感AI提供了非西方文化的视角。
Details
Motivation: 当前的AI情感计算研究主要基于西方文化背景,ManzaiSet旨在填补非西方文化(尤其是日本漫才喜剧)在情感计算中的空白,促进文化多样性研究。Contribution: 1) 发布首个针对日本漫才喜剧的多模态数据集ManzaiSet;2) 通过聚类分析识别了三种观众类型;3) 揭示了正面的观看顺序效应,反驳了疲劳假说;4) 为情感AI的跨文化开发提供了基础。
Method: 1) 采集241名参与者的面部视频和音频数据;2) 使用k-means聚类分析观众类型;3) 通过个体水平分析研究观看顺序效应;4) 结合自动化幽默分类和观众反应建模。
Result: 1) 识别出三类观众(72.8%高稳定欣赏者、13.2%低变化下降者、14.0%变化改善者);2) 发现观看顺序效应显著;3) 未发现观众类型间的分类差异。
Insight: 该研究表明情感反应具有文化多样性,强调了在AI情感计算中纳入非西方文化的重要性,同时为个性化娱乐系统提供了跨文化设计思路。
Abstract: We present ManzaiSet, the first large scale multimodal dataset of viewer responses to Japanese manzai comedy, capturing facial videos and audio from 241 participants watching up to 10 professional performances in randomized order (94.6 percent watched >= 8; analyses focus on n=228). This addresses the Western centric bias in affective computing. Three key findings emerge: (1) k means clustering identified three distinct viewer types: High and Stable Appreciators (72.8 percent, n=166), Low and Variable Decliners (13.2 percent, n=30), and Variable Improvers (14.0 percent, n=32), with heterogeneity of variance (Brown Forsythe p < 0.001); (2) individual level analysis revealed a positive viewing order effect (mean slope = 0.488, t(227) = 5.42, p < 0.001, permutation p < 0.001), contradicting fatigue hypotheses; (3) automated humor classification (77 instances, 131 labels) plus viewer level response modeling found no type wise differences after FDR correction. The dataset enables culturally aware emotion AI development and personalized entertainment systems tailored to non Western contexts.
[45] ViBED-Net: Video Based Engagement Detection Network Using Face-Aware and Scene-Aware Spatiotemporal Cues
Prateek Gothwal,Deeptimaan Banerjee,Ashis Kumer Biswas
Main category: cs.CV
TL;DR: ViBED-Net是一种基于双流架构的深度学习框架,通过结合面部表情和场景上下文信息,利用EfficientNetV2提取空间特征,并通过LSTM和Transformer进行时序建模,显著提升了在线学习环境中学生参与度检测的准确性。
Details
Motivation: 在线学习环境中,学生参与度的检测对提升学习效果和个性化教学至关重要。现有方法往往忽略面部表情和场景上下文的结合,限制了检测性能。Contribution: 提出了ViBED-Net框架,首次结合面部和场景的时空线索,通过双流架构和多模态特征提取提升参与度检测性能。
Method: 1. 使用EfficientNetV2分别提取面部和场景的空间特征;2. 通过LSTM和Transformer进行时序建模;3. 采用数据增强技术解决类别不平衡问题。
Result: 在DAiSEE数据集上,ViBED-Net(LSTM版本)达到73.43%的准确率,优于现有方法。
Insight: 结合面部和场景的多模态线索能显著提升参与度检测性能,模块化设计使其适用于教育和用户体验研究等领域。
Abstract: Engagement detection in online learning environments is vital for improving student outcomes and personalizing instruction. We present ViBED-Net (Video-Based Engagement Detection Network), a novel deep learning framework designed to assess student engagement from video data using a dual-stream architecture. ViBED-Net captures both facial expressions and full-scene context by processing facial crops and entire video frames through EfficientNetV2 for spatial feature extraction. These features are then analyzed over time using two temporal modeling strategies: Long Short-Term Memory (LSTM) networks and Transformer encoders. Our model is evaluated on the DAiSEE dataset, a large-scale benchmark for affective state recognition in e-learning. To enhance performance on underrepresented engagement classes, we apply targeted data augmentation techniques. Among the tested variants, ViBED-Net with LSTM achieves 73.43% accuracy, outperforming existing state-of-the-art approaches. ViBED-Net demonstrates that combining face-aware and scene-aware spatiotemporal cues significantly improves engagement detection accuracy. Its modular design allows flexibility for application across education, user experience research, and content personalization. This work advances video-based affective computing by offering a scalable, high-performing solution for real-world engagement analysis. The source code for this project is available on https://github.com/prateek-gothwal/ViBED-Net .
[46] SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection
Roberto Brusnicki,David Pop,Yuan Gao,Mattia Piccinini,Johannes Betz
Main category: cs.CV
TL;DR: SAVANT是一个结构化推理框架,通过分层场景分析和两阶段流程(结构化场景描述提取和多模态评估),显著提升了自动驾驶系统中语义异常检测的准确率和召回率,同时实现了低成本本地部署。
Details
Motivation: 自动驾驶系统在罕见、非分布场景中的语义异常检测能力仍然不足,现有视觉语言模型(VLM)的推理性能不稳定且依赖昂贵的专有模型。SAVANT旨在解决这一问题。Contribution: SAVANT的主要贡献包括:1)提出结构化场景分析方法;2)实现低成本高性能的开源模型部署;3)通过自动化标注解决数据稀缺问题。
Method: SAVANT采用两阶段流程:1)从输入图像中提取结构化场景描述;2)通过多模态评估进行语义异常检测,覆盖Street、Infrastructure、Movable Objects和Environment四个语义层。
Result: SAVANT在真实驾驶场景中达到89.6%的召回率和88.0%的准确率,优化后的开源模型(Qwen2.5VL)性能进一步提升至90.8%召回率和93.8%准确率。
Insight: 结构化分析能够显著提升VLM的推理性能,并且小型开源模型在优化框架下可以超越专有模型,为低成本语义监控提供了实用方案。
Abstract: Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution scenarios with semantic anomalies. While Vision Language Models (VLMs) offer promising reasoning capabilities, naive prompting approaches yield unreliable performance and depend on expensive proprietary models, limiting practical deployment. We introduce SAVANT (Semantic Analysis with Vision-Augmented Anomaly deTection), a structured reasoning framework that achieves high accuracy and recall in detecting anomalous driving scenarios from input images through layered scene analysis and a two-phase pipeline: structured scene description extraction followed by multi-modal evaluation. Our approach transforms VLM reasoning from ad-hoc prompting to systematic analysis across four semantic layers: Street, Infrastructure, Movable Objects, and Environment. SAVANT achieves 89.6% recall and 88.0% accuracy on real-world driving scenarios, significantly outperforming unstructured baselines. More importantly, we demonstrate that our structured framework enables a fine-tuned 7B parameter open-source model (Qwen2.5VL) to achieve 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By automatically labeling over 9,640 real-world images with high accuracy, SAVANT addresses the critical data scarcity problem in anomaly detection and provides a practical path toward reliable, accessible semantic monitoring for autonomous systems.
[47] HouseTour: A Virtual Real Estate A(I)gent
Ata Çelen,Marc Pollefeys,Daniel Barath,Iro Armeni
Main category: cs.CV
TL;DR: 该论文提出了HouseTour方法,通过从一组3D空间图像生成空间感知的相机轨迹和自然语言摘要,结合扩散过程和平滑的视频轨迹,利用3D高斯溅射技术渲染新视角,并推出了包含1,200多个视频的数据集HouseTour。实验表明,该方法在文本生成和任务联合性能上优于独立处理的方法。
Details
Motivation: 现有视觉语言模型(VLMs)在几何推理方面表现不佳,特别是在3D空间描述和视频生成任务中。HouseTour旨在解决这一问题,通过结合3D相机轨迹和文本生成,实现高质量、自动化的房地产视频创作。Contribution: 1)提出了HouseTour方法,结合相机轨迹和文本生成;2)推出了HouseTour数据集;3)提出了一种新的联合评估指标;4)实现了无需专业设备的高质量视频自动化生成。
Method: 1)基于已知相机位姿的扩散过程生成平滑视频轨迹;2)将3D信息整合到VLMs中生成3D空间描述;3)使用3D高斯溅射技术渲染新视角;4)引入了联合任务评估指标。
Result: 实验表明,HouseTour在文本生成和任务联合性能上优于独立处理方法。
Insight: 结合3D几何信息和视觉语言模型可以显著提升空间描述和视频生成的质量,为房地产和旅游应用提供自动化解决方案。
Abstract: We introduce HouseTour, a method for spatially-aware 3D camera trajectory and natural language summary generation from a collection of images depicting an existing 3D space. Unlike existing vision-language models (VLMs), which struggle with geometric reasoning, our approach generates smooth video trajectories via a diffusion process constrained by known camera poses and integrates this information into the VLM for 3D-grounded descriptions. We synthesize the final video using 3D Gaussian splatting to render novel views along the trajectory. To support this task, we present the HouseTour dataset, which includes over 1,200 house-tour videos with camera poses, 3D reconstructions, and real estate descriptions. Experiments demonstrate that incorporating 3D camera trajectories into the text generation process improves performance over methods handling each task independently. We evaluate both individual and end-to-end performance, introducing a new joint metric. Our work enables automated, professional-quality video creation for real estate and touristic applications without requiring specialized expertise or equipment.
[48] Chimera: Compositional Image Generation using Part-based Concepting
Shivam Singh,Yiming Chen,Agneet Chatterjee,Amit Raj,James Hays,Yezhou Yang,Chitra Baral
Main category: cs.CV
TL;DR: Chimera是一个个性化的图像生成模型,通过结合多幅源图像的特定部分并根据文本指令生成新对象。该方法在部分对齐和组合准确性上优于基线14%,视觉质量提升21%。
Details
Motivation: 现有个性化图像生成模型缺乏对多源图像部分组合的显式控制,Chimera填补了这一空白。Contribution: 1) 提出Chimera模型,支持多源图像部分组合;2) 构建包含464个语义原子的数据集;3) 引入PartEval评估指标。
Method: 1) 基于语义原子构建数据集;2) 训练带部分条件引导的扩散模型;3) 使用PartEval评估生成结果。
Result: Chimera在部分对齐和视觉质量上显著优于基线。
Insight: 显式的部分条件引导和组合优化可显著提升生成图像的组合准确性和质量。
Abstract: Personalized image generative models are highly proficient at synthesizing images from text or a single image, yet they lack explicit control for composing objects from specific parts of multiple source images without user specified masks or annotations. To address this, we introduce Chimera, a personalized image generation model that generates novel objects by combining specified parts from different source images according to textual instructions. To train our model, we first construct a dataset from a taxonomy built on 464 unique (part, subject) pairs, which we term semantic atoms. From this, we generate 37k prompts and synthesize the corresponding images with a high-fidelity text-to-image model. We train a custom diffusion prior model with part-conditional guidance, which steers the image-conditioning features to enforce both semantic identity and spatial layout. We also introduce an objective metric PartEval to assess the fidelity and compositional accuracy of generation pipelines. Human evaluations and our proposed metric show that Chimera outperforms other baselines by 14% in part alignment and compositional accuracy and 21% in visual quality.
[49] Accelerating Vision Transformers with Adaptive Patch Sizes
Rohan Choudhury,JungEun Kim,Jinhyung Park,Eunho Yang,László A. Jeni,Kris M. Kitani
Main category: cs.CV
TL;DR: APT通过自适应调整ViT中图像块的大小,减少了输入序列长度,显著提升了推理和训练速度,同时保持了性能。
Details
Motivation: 传统ViT对所有图像区域使用统一大小的块,导致高分辨率图像的输入序列过长,计算效率低。这一局限性促使研究提出自适应块大小的解决方案。Contribution: 提出APT方法,通过在同张图像中自适应分配不同大小的块,减少总输入tokens数量,实现40%-50%的加速,同时保持模型性能。
Method: APT在图像不同区域分配不同的块大小,复杂区域用小块,均匀区域用大块,优化输入序列长度。
Result: 在ViT-L和ViT-H上分别提升40%和50%的吞吐率,且在高分辨率密集视觉任务中训练和推理速度提升30%。
Insight: 自适应块大小策略能显著提升ViT的计算效率,同时不影响模型性能,为高分辨率视觉任务提供了实用的优化方案。
Abstract: Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance, and can be applied to a previously fine-tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30% faster training and inference in visual QA, object detection, and semantic segmentation.
[50] From Volume Rendering to 3D Gaussian Splatting: Theory and Applications
Vitor Pereira Matias,Daniel Perazzo,Vinicius Silva,Alberto Raposo,Luiz Velho,Afonso Paiva,Tiago Novello
Main category: cs.CV
TL;DR: 这篇教程论文概述了从体积渲染到3D高斯抛现(3DGS)的进展,介绍了3DGS的理论基础、主要优势及其在实时渲染中的应用,同时探讨了其局限性及可能的改进方向。
Details
Motivation: 3DGS作为一种新颖的3D重建方法,通过显式地将场景建模为3D高斯的集合,实现了高效的实时渲染,但其高内存占用和对光照效果的直接建模等局限性仍需改进。因此,有必要系统总结3DGS的理论与应用进展。Contribution: 论文的主要贡献包括:(1)全面概述了3DGS的抛现理论;(2)总结了解决其局限性的主要方法;(3)展示了3DGS在表面重建、虚拟角色建模和动画等领域的广泛应用。
Method: 论文通过理论分析和文献综述的方式,详细介绍了3DGS的抛现公式及其实现流程,并探讨了如何优化内存占用和光照建模等问题。
Result: 3DGS在实时渲染和新视角合成方面表现出色,但其局限性仍需进一步研究改进。
Insight: 3DGS的显式建模方法为实时3D重建提供了新思路,但其依赖大量内存和对光照效果的固化限制了其泛化能力。未来研究可探索更高效的表示方法和动态光照支持。
Abstract: The problem of 3D reconstruction from posed images is undergoing a fundamental transformation, driven by continuous advances in 3D Gaussian Splatting (3DGS). By modeling scenes explicitly as collections of 3D Gaussians, 3DGS enables efficient rasterization through volumetric splatting, offering thus a seamless integration with common graphics pipelines. Despite its real-time rendering capabilities for novel view synthesis, 3DGS suffers from a high memory footprint, the tendency to bake lighting effects directly into its representation, and limited support for secondary-ray effects. This tutorial provides a concise yet comprehensive overview of the 3DGS pipeline, starting from its splatting formulation and then exploring the main efforts in addressing its limitations. Finally, we survey a range of applications that leverage 3DGS for surface reconstruction, avatar modeling, animation, and content generation-highlighting its efficient rendering and suitability for feed-forward pipelines.
[51] Online In-Context Distillation for Low-Resource Vision Language Models
Zhiqi Kang,Rahaf Aljundi,Vaggelis Dorovatas,Karteek Alahari
Main category: cs.CV
TL;DR: 该论文提出了一种在线上下文蒸馏方法(ICD),使小型视觉语言模型(VLMs)在低资源环境下通过与教师模型协作动态蒸馏知识,显著提升性能。
Details
Motivation: 研究如何使小型视觉语言模型在低资源环境下高效运行,避免昂贵的微调成本,同时缩小与大模型的性能差距。Contribution: 1. 提出在线上下文蒸馏(ICD)方法;2. 设计跨模态演示选择策略和测试时动态调整;3. 验证ICD在低资源场景下的有效性。
Method: 通过稀疏演示动态生成知识,结合跨模态演示选择、教师模型测试时缩放和学生不确定性条件化来优化知识蒸馏。
Result: 小型模型性能提升高达33%,仅需4%的教师标注,性能接近教师的零样本表现。
Insight: 上下文学习框架可以有效支持低资源VLMs的知识蒸馏,动态策略显著减少对计算资源的需求。
Abstract: As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations (as low as 4%), and competes with the teacher’s zero-shot performance.
[52] SafeCoop: Unravelling Full Stack Safety in Agentic Collaborative Driving
Xiangbo Gao,Tzu-Hsiang Lin,Ruojing Song,Yuheng Wu,Kuan-Ru Huang,Zicheng Jin,Fangzhou Lin,Shinan Liu,Zhengzhong Tu
Main category: cs.CV
TL;DR: SafeCoop研究了基于自然语言的协同驾驶系统中的全栈安全和安全问题,提出了攻击分类和防御流程,显著提升了系统的安全性和性能。
Details
Motivation: 传统V2X系统在高带宽需求、语义丢失和互操作性方面存在问题,而基于自然语言的通信虽降低了带宽需求,却引入了新的安全漏洞。Contribution: 首次系统研究了语言驱动协同驾驶的安全问题,提出了攻击分类和防御流程SafeCoop,并通过实验验证其有效性。
Method: 提出了包含语义防火墙、语言-感知一致性检查和多源共识的防御流程,并通过Agentic Transformation实现跨帧空间对齐。
Result: 在CARLA仿真中,SafeCoop显著提升了69.15%的驾驶分数,恶意检测F1分数达67.32%。
Insight: 语言驱动的协同驾驶需综合考虑安全和性能,SafeCoop为未来安全和可信的交通系统提供了研究框架。
Abstract: Collaborative driving systems leverage vehicle-to-everything (V2X) communication across multiple agents to enhance driving safety and efficiency. Traditional V2X systems take raw sensor data, neural features, or perception results as communication media, which face persistent challenges, including high bandwidth demands, semantic loss, and interoperability issues. Recent advances investigate natural language as a promising medium, which can provide semantic richness, decision-level reasoning, and human-machine interoperability at significantly lower bandwidth. Despite great promise, this paradigm shift also introduces new vulnerabilities within language communication, including message loss, hallucinations, semantic manipulation, and adversarial attacks. In this work, we present the first systematic study of full-stack safety and security issues in natural-language-based collaborative driving. Specifically, we develop a comprehensive taxonomy of attack strategies, including connection disruption, relay/replay interference, content spoofing, and multi-connection forgery. To mitigate these risks, we introduce an agentic defense pipeline, which we call SafeCoop, that integrates a semantic firewall, language-perception consistency checks, and multi-source consensus, enabled by an agentic transformation function for cross-frame spatial alignment. We systematically evaluate SafeCoop in closed-loop CARLA simulation across 32 critical scenarios, achieving 69.15% driving score improvement under malicious attacks and up to 67.32% F1 score for malicious detection. This study provides guidance for advancing research on safe, secure, and trustworthy language-driven collaboration in transportation systems. Our project page is https://xiangbogaobarry.github.io/SafeCoop.
[53] World-in-World: World Models in a Closed-Loop World
Jiahan Zhang,Muqing Jiang,Nanru Dai,Taiming Lu,Arda Uzunoglu,Shunchi Zhang,Yana Wei,Jiahao Wang,Vishal M. Patel,Paul Pu Liang,Daniel Khashabi,Cheng Peng,Rama Chellappa,Tianmin Shu,Alan Yuille,Yilun Du,Jieneng Chen
Main category: cs.CV
TL;DR: 论文提出了World-in-World平台,首次在封闭环境中评估生成世界模型对具身代理任务成功的影响,揭示了视觉质量并非成功的唯一因素,可控性和数据规模更为关键。
Details
Motivation: 现有生成世界模型的评测主要关注视觉质量,忽略了其对具身代理决策的实际效用。为了解决这一问题,论文提出了一个封闭环境的评测平台。Contribution: 提出了World-in-World平台,首次统一评测世界模型在封闭环境中的表现;提出了四种封闭环境评测任务,强调任务成功率而非视觉质量;揭示了世界模型在具身任务中的关键影响因素。
Method: 设计了World-in-World平台,提供统一的在线规划策略和标准化动作API;精选四种封闭环境进行评测;研究了数据规模对模型性能的影响。
Result: 发现视觉质量并非任务成功的决定性因素;数据规模的提升比预训练视频生成器的改进更有效;推理时计算资源的增加显著提升封闭环境性能。
Insight: 视觉质量不是具身代理成功的唯一标准,可控性和数据规模更为关键;封闭环境模拟更接近真实世界交互,评测更具现实意义。
Abstract: Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.
[54] Adapting Stereo Vision From Objects To 3D Lunar Surface Reconstruction with the StereoLunar Dataset
Clementine Grethen,Simone Gasparini,Geraldine Morin,Jeremy Lebreton,Lucas Marti,Manuel Sanchez-Gestido
Main category: cs.CV
TL;DR: 该论文提出了第一个月球立体图像数据集LunarStereo,并通过微调MASt3R模型实现了月球表面的3D重建和姿态估计,在合成和真实数据上表现优于零样本基线。
Details
Motivation: 月球表面缺乏纹理、光照变化复杂且轨道轨迹异常,现有立体视觉方法无法直接适用,需要针对月球环境的专用数据集和模型。Contribution: 1. 提出首个基于高分辨率地形和反射率模型的月球立体图像数据集LunarStereo;2. 通过微调MASt3R模型实现月球3D重建和姿态估计。
Method: 1. 使用光线追踪技术合成LunarStereo数据集;2. 在该数据集上微调MASt3R模型。
Result: 在合成和真实数据上验证了方法的有效性,结果显示显著优于零样本基线。
Insight: 针对特定环境(如月球)的数据集和模型适配是实现鲁棒跨尺度泛化的关键。
Abstract: Accurate 3D reconstruction of lunar surfaces is essential for space exploration. However, existing stereo vision reconstruction methods struggle in this context due to the Moon’s lack of texture, difficult lighting variations, and atypical orbital trajectories. State-of-the-art deep learning models, trained on human-scale datasets, have rarely been tested on planetary imagery and cannot be transferred directly to lunar conditions. To address this issue, we introduce LunarStereo, the first open dataset of photorealistic stereo image pairs of the Moon, simulated using ray tracing based on high-resolution topography and reflectance models. It covers diverse altitudes, lighting conditions, and viewing angles around the lunar South Pole, offering physically grounded supervision for 3D reconstruction tasks. Based on this dataset, we adapt the MASt3R model to the lunar domain through fine-tuning on LunarStereo. We validate our approach through extensive qualitative and quantitative experiments on both synthetic and real lunar data, evaluating 3D surface reconstruction and relative pose estimation. Extensive experiments on synthetic and real lunar data validate the approach, demonstrating significant improvements over zero-shot baselines and paving the way for robust cross-scale generalization in extraterrestrial environments.
[55] RadDiagSeg-M: A Vision Language Model for Joint Diagnosis and Multi-Target Segmentation in Radiology
Chengrun Li,Corentin Royer,Haozhe Luo,Bastian Wittmann,Xia Li,Ibrahim Hamamci,Sezgin Er,Anjany Sekuboyina,Bjoern Menze
Main category: cs.CV
TL;DR: 论文提出了RadDiagSeg-M,一种结合视觉与语言的模型,能够同时生成医学诊断文本和多目标分割掩码,解决了现有模型在这两任务上的分离问题,并引入了RadDiagSeg-D数据集支持模型开发。
Details
Motivation: 当前医学视觉语言模型难以同时生成诊断文本和像素级分割掩码,限制了临床应用价值。为解决这一问题,论文提出了结合两者的方法。Contribution: 1. 引入了RadDiagSeg-D数据集,支持多模态图像的分割和诊断任务;2. 提出了RadDiagSeg-M模型,能同时完成异常检测、诊断和多目标分割。
Method: 利用RadDiagSeg-D数据集,开发了一种新颖的视觉语言模型,模型能够联合生成文本和分割掩码,并通过分层任务设计实现高效学习。
Result: RadDiagSeg-M在多项任务上表现出色,提供了丰富的临床有用信息,为多目标文本与掩码生成任务建立了强有力的基线。
Insight: 结合文本和分割掩码的生成能够显著提升医学辅助诊断的实用性和信息量,分层任务设计是解决复杂问题的有效方法。
Abstract: Most current medical vision language models struggle to jointly generate diagnostic text and pixel-level segmentation masks in response to complex visual questions. This represents a major limitation towards clinical application, as assistive systems that fail to provide both modalities simultaneously offer limited value to medical practitioners. To alleviate this limitation, we first introduce RadDiagSeg-D, a dataset combining abnormality detection, diagnosis, and multi-target segmentation into a unified and hierarchical task. RadDiagSeg-D covers multiple imaging modalities and is precisely designed to support the development of models that produce descriptive text and corresponding segmentation masks in tandem. Subsequently, we leverage the dataset to propose a novel vision-language model, RadDiagSeg-M, capable of joint abnormality detection, diagnosis, and flexible segmentation. RadDiagSeg-M provides highly informative and clinically useful outputs, effectively addressing the need to enrich contextual information for assistive diagnosis. Finally, we benchmark RadDiagSeg-M and showcase its strong performance across all components involved in the task of multi-target text-and-mask generation, establishing a robust and competitive baseline.
[56] EMA-SAM: Exponential Moving-average for SAM-based PTMC Segmentation
Maryam Dialameh,Hossein Rajabzadeh,Jung Suk Sim,Hyock Ju Kwon
Main category: cs.CV
TL;DR: EMA-SAM是SAM-2的轻量级扩展,通过引入置信度加权的指数移动平均指针,显著提升了超声视频中病灶分割的稳定性和准确性。
Details
Motivation: PTMC的射频消融治疗需要精确的病灶分割,但超声视频中的低对比度、探头运动和热效应伪影导致现有方法不稳定。SAM-2在静态图像上表现良好,但在动态视频中预测不稳定且易漂移。Contribution: 1. 提出EMA-SAM,通过指数移动平均指针增强SAM-2的时序一致性。2. 在PTMC-RFA数据集上显著提升了分割性能(Dice和IoU指标)。3. FLOPs仅增加0.1%,保持了实时性(30 FPS)。
Method: 在SAM-2的内存库中引入置信度加权的指数移动平均指针,生成跨帧稳定的潜在肿瘤原型。该方法快速适应新出现的清晰证据,同时保持对探头压力和气泡遮挡的鲁棒性。
Result: EMA-SAM在PTMC-RFA数据集上将maxDice从0.82提升至0.86,maxIoU从0.72提升至0.76,假阳性减少29%。且在外部基准测试(VTUS和结肠镜视频息肉数据集)上表现一致优于SAM-2。
Insight: 轻量级的时序一致性设计(如EMA指针)可以显著提升基础模型在动态视频分割中的性能,同时保持计算效率。
Abstract: Papillary thyroid microcarcinoma (PTMC) is increasingly managed with radio-frequency ablation (RFA), yet accurate lesion segmentation in ultrasound videos remains difficult due to low contrast, probe-induced motion, and heat-related artifacts. The recent Segment Anything Model 2 (SAM-2) generalizes well to static images, but its frame-independent design yields unstable predictions and temporal drift in interventional ultrasound. We introduce \textbf{EMA-SAM}, a lightweight extension of SAM-2 that incorporates a confidence-weighted exponential moving average pointer into the memory bank, providing a stable latent prototype of the tumour across frames. This design preserves temporal coherence through probe pressure and bubble occlusion while rapidly adapting once clear evidence reappears. On our curated PTMC-RFA dataset (124 minutes, 13 patients), EMA-SAM improves \emph{maxDice} from 0.82 (SAM-2) to 0.86 and \emph{maxIoU} from 0.72 to 0.76, while reducing false positives by 29%. On external benchmarks, including VTUS and colonoscopy video polyp datasets, EMA-SAM achieves consistent gains of 2–5 Dice points over SAM-2. Importantly, the EMA pointer adds \textless0.1% FLOPs, preserving real-time throughput of $\sim$30,FPS on a single A100 GPU. These results establish EMA-SAM as a robust and efficient framework for stable tumour tracking, bridging the gap between foundation models and the stringent demands of interventional ultrasound. Codes are available here \hyperref[code {https://github.com/mdialameh/EMA-SAM}.
[57] VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
Shruti Palaskar,Leon Gatys,Mona Abdelrahman,Mar Jacobo,Larry Lindsey,Rutika Moharir,Gunnar Lund,Yang Xu,Navid Shiee,Jeffrey Bigham,Charles Maalouf,Joseph Yitan Cheng
Main category: cs.CV
TL;DR: VLSU是一个评估多模态模型安全性的框架,揭示了现有模型在联合图像-文本理解上的系统性缺陷,尤其是在边界案例和组合推理上的表现不佳。
Details
Motivation: 现有的多模态模型安全性评估方法通常独立处理视觉和语言输入,忽略了联合解释中可能的风险,且难以区分明确有害和边界案例内容。Contribution: 提出了VLSU框架,系统地评估多模态安全性,通过细粒度分类和组合分析构建了一个大规模的基准数据集,并揭示了模型在联合理解上的局限性。
Method: 采用多阶段流水线,结合真实世界图像和人工标注,构建了包含8,187个样本的数据集,覆盖15种危害类别和17种安全模式。
Result: 模型在单模态任务中表现优异(90%+准确率),但在联合推理任务中性能显著下降(20%-55%),且34%的错误源于组合推理能力的缺失。指令调优可减少边界案例的过度阻断,但会牺牲安全性。
Insight: 现有模型在联合图像-文本推理和边界案例处理上存在重大缺陷,VLSU为未来研究提供了重要的测试基准和改进方向。
Abstract: Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.
[58] DeepSeek-OCR: Contexts Optical Compression
Haoran Wei,Yaofeng Sun,Yukun Li
Main category: cs.CV
TL;DR: DeepSeek-OCR investigates compressing long contexts via optical 2D mapping, achieving high OCR precision with low vision tokens.
Details
Motivation: The paper addresses the challenge of compressing long-context information efficiently for tasks like OCR while maintaining high precision.Contribution: Introduces DeepSeek-OCR with DeepEncoder and DeepSeek3B-MoE-A570M decoder, achieving high compression ratios and OCR accuracy.
Method: Uses DeepEncoder to compress high-resolution inputs into fewer vision tokens, paired with a decoder for OCR.
Result: Achieves 97% OCR precision at <10x compression ratio, and outperforms existing methods on OmniDocBench with fewer tokens.
Insight: Demonstrates feasibility of efficient long-context compression, offering potential for historical data and LLM research.
Abstract: We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.
[59] BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining
Ajinkya Khoche,Gergő László Nagy,Maciej Wozniak,Thomas Gustafsson,Patric Jensfelt
Main category: cs.CV
TL;DR: BlendCLIP提出了一种多模态预训练框架,通过结合合成数据和真实数据的优势,解决了零样本3D物体分类中的域gap问题。其核心贡献包括生成大规模真实世界三元组数据集和基于课程的数据混合策略,显著提升了零样本分类性能。
Details
Motivation: 零样本3D物体分类在真实世界中具有重要应用(如自动驾驶),但由于合成数据与真实LiDAR扫描之间的域gap问题,现有方法难以泛化。合成数据缺乏真实世界特性,而真实数据则语义多样性不足。Contribution: BlendCLIP的贡献包括:(1)提出了一种生成真实世界三元组数据集(点云、图像和文本)的流程;(2)设计了基于课程的数据混合策略,逐步适应真实扫描特性;(3)在nuScenes等数据集上实现了显著的性能提升。
Method: 方法分为两步:(1)从真实世界驾驶数据和3D标注中挖掘三元组数据集;(2)采用课程学习策略,先利用语义丰富的合成CAD数据训练,再逐步混合少量真实数据。
Result: 实验表明,仅需1.5%的真实数据即可将nuScenes上的零样本准确率提升27%。最终模型在nuScenes和TruckScenes等数据集上实现了SOTA性能,相比之前最优方法提升19.3%。
Insight: 研究发现,有效的域适应而非大规模真实数据标注是实现鲁棒开放词汇3D感知的关键。
Abstract: Zero-shot 3D object classification is crucial for real-world applications like autonomous driving, however it is often hindered by a significant domain gap between the synthetic data used for training and the sparse, noisy LiDAR scans encountered in the real-world. Current methods trained solely on synthetic data fail to generalize to outdoor scenes, while those trained only on real data lack the semantic diversity to recognize rare or unseen objects. We introduce BlendCLIP, a multimodal pretraining framework that bridges this synthetic-to-real gap by strategically combining the strengths of both domains. We first propose a pipeline to generate a large-scale dataset of object-level triplets – consisting of a point cloud, image, and text description – mined directly from real-world driving data and human annotated 3D boxes. Our core contribution is a curriculum-based data mixing strategy that first grounds the model in the semantically rich synthetic CAD data before progressively adapting it to the specific characteristics of real-world scans. Our experiments show that our approach is highly label-efficient: introducing as few as 1.5% real-world samples per batch into training boosts zero-shot accuracy on the nuScenes benchmark by 27%. Consequently, our final model achieves state-of-the-art performance on challenging outdoor datasets like nuScenes and TruckScenes, improving over the best prior method by 19.3% on nuScenes, while maintaining strong generalization on diverse synthetic benchmarks. Our findings demonstrate that effective domain adaptation, not full-scale real-world annotation, is the key to unlocking robust open-vocabulary 3D perception. Our code and dataset will be released upon acceptance on https://github.com/kesu1/BlendCLIP.
[60] OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion
Tianyu Huang,Runnan Chen,Dongting Hu,Fengming Huang,Mingming Gong,Tongliang Liu
Main category: cs.CV
TL;DR: OpenInsGaussian提出了一种结合上下文感知和多视图融合的开集实例3D高斯分割框架,解决了现有方法在上下文信息和特征对齐上的不足,并在多个基准数据集上达到了SOTA性能。
Details
Motivation: 现有基于2D视觉模型的3D高斯分割方法在上下文信息不足和多视图特征融合不一致性上表现不佳,导致分割结果不完善。Contribution: 1. 提出了上下文感知的特征提取模块;2. 开发了注意力驱动的多视图特征融合方法;3. 在开集3D高斯分割任务上实现了显著提升。
Method: 1. Context-Aware Feature Extraction:增强每个掩码的语义上下文信息;2. Attention-Driven Feature Aggregation:选择性融合多视图特征以减少对齐误差。
Result: 在基准数据集上实现了SOTA性能,大幅超越现有基线。
Insight: 上下文信息和多视图特征的有效融合是提升3D高斯分割性能的关键。
Abstract: Understanding 3D scenes is pivotal for autonomous driving, robotics, and augmented reality. Recent semantic Gaussian Splatting approaches leverage large-scale 2D vision models to project 2D semantic features onto 3D scenes. However, they suffer from two major limitations: (1) insufficient contextual cues for individual masks during preprocessing and (2) inconsistencies and missing details when fusing multi-view features from these 2D models. In this paper, we introduce \textbf{OpenInsGaussian}, an \textbf{Open}-vocabulary \textbf{Ins}tance \textbf{Gaussian} segmentation framework with Context-aware Cross-view Fusion. Our method consists of two modules: Context-Aware Feature Extraction, which augments each mask with rich semantic context, and Attention-Driven Feature Aggregation, which selectively fuses multi-view features to mitigate alignment errors and incompleteness. Through extensive experiments on benchmark datasets, OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin. These findings underscore the robustness and generality of our proposed approach, marking a significant step forward in 3D scene understanding and its practical deployment across diverse real-world scenarios.
[61] Hyperbolic Space Learning Method Leveraging Temporal Motion Priors for Human Mesh Recovery
Xiang Zhang,Suping Wu,Weibin Qiu,Zhaocheng Jin,Sheng Yang
Main category: cs.CV
TL;DR: 该论文提出了一种基于双曲空间的时序运动先验学习方法,用于从视频中恢复3D人体网格,通过捕捉层次结构和时序运动信息,显著提升了网格重建的准确性。
Details
Motivation: 现有的视频3D人体网格重建方法通常在欧几里得空间中学习特征,难以准确捕捉人体自然的分层结构,导致重建结果不准确。Contribution: 1) 设计了时序运动先验提取模块,用于从3D姿态序列和图像特征序列中提取并融合时序运动信息;2) 提出了双曲空间优化学习策略,利用双曲空间有效捕捉分层关系;3) 引入了双曲网格优化损失,确保学习的稳定性和有效性。
Method: 1) 时序运动先验提取模块;2) 双曲空间优化学习策略;3) 双曲网格优化损失函数。
Result: 在公开数据集上的实验表明,该方法在准确性和平滑性上优于现有的最先进方法。
Insight: 双曲空间比欧几里得空间更适合捕捉层次结构信息,结合时序运动先验能显著提升3D人体网格重建的质量。
Abstract: 3D human meshes show a natural hierarchical structure (like torso-limbs-fingers). But existing video-based 3D human mesh recovery methods usually learn mesh features in Euclidean space. It’s hard to catch this hierarchical structure accurately. So wrong human meshes are reconstructed. To solve this problem, we propose a hyperbolic space learning method leveraging temporal motion prior for recovering 3D human meshes from videos. First, we design a temporal motion prior extraction module. This module extracts the temporal motion features from the input 3D pose sequences and image feature sequences respectively. Then it combines them into the temporal motion prior. In this way, it can strengthen the ability to express features in the temporal motion dimension. Since data representation in non-Euclidean space has been proved to effectively capture hierarchical relationships in real-world datasets (especially in hyperbolic space), we further design a hyperbolic space optimization learning strategy. This strategy uses the temporal motion prior information to assist learning, and uses 3D pose and pose motion information respectively in the hyperbolic space to optimize and learn the mesh features. Then, we combine the optimized results to get an accurate and smooth human mesh. Besides, to make the optimization learning process of human meshes in hyperbolic space stable and effective, we propose a hyperbolic mesh optimization loss. Extensive experimental results on large publicly available datasets indicate superiority in comparison with most state-of-the-art.
[62] UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding
Da Zhang,Chenggang Rong,Bingyu Li,Feiyu Wang,Zhiyuan Zhao,Junyu Gao,Xuelong Li
Main category: cs.CV
TL;DR: 论文介绍了UWBench,一个专为水下视觉-语言理解设计的综合基准,包含15,003张高分辨率水下图像,每张图像带有丰富的标注,并建立了三项评测任务。研究表明当前视觉-语言模型在水下环境理解上仍有挑战。
Details
Motivation: 尽管大型视觉-语言模型在自然场景理解上取得了显著进展,但其在水下环境中的应用尚未充分探索。水下图像存在光衰减、颜色失真等独特挑战,需要专门的海洋生态系统知识。Contribution: 提出了UWBench基准,包含大量高分辨率水下图像和丰富标注(如指代表达式、问答对),并定义了水下图像描述、视觉定位和视觉问答三项评测任务。
Method: 论文通过多样化水下环境下采集的图像数据,结合人工验证的标注,构建了一个全面的水下视觉-语言数据集,并在此基础上设计了三项评测任务。
Result: 实验表明,现有视觉-语言模型在水下环境理解任务中表现不佳,仍有进一步优化的空间。
Insight: 水下环境理解的挑战性凸显了跨领域基准的重要性,UWBench为水下视觉-语言研究提供了关键资源。
Abstract: Large vision-language models (VLMs) have achieved remarkable success in natural scene understanding, yet their application to underwater environments remains largely unexplored. Underwater imagery presents unique challenges including severe light attenuation, color distortion, and suspended particle scattering, while requiring specialized knowledge of marine ecosystems and organism taxonomy. To bridge this gap, we introduce UWBench, a comprehensive benchmark specifically designed for underwater vision-language understanding. UWBench comprises 15,003 high-resolution underwater images captured across diverse aquatic environments, encompassing oceans, coral reefs, and deep-sea habitats. Each image is enriched with human-verified annotations including 15,281 object referring expressions that precisely describe marine organisms and underwater structures, and 124,983 question-answer pairs covering diverse reasoning capabilities from object recognition to ecological relationship understanding. The dataset captures rich variations in visibility, lighting conditions, and water turbidity, providing a realistic testbed for model evaluation. Based on UWBench, we establish three comprehensive benchmarks: detailed image captioning for generating ecologically informed scene descriptions, visual grounding for precise localization of marine organisms, and visual question answering for multimodal reasoning about underwater environments. Extensive experiments on state-of-the-art VLMs demonstrate that underwater understanding remains challenging, with substantial room for improvement. Our benchmark provides essential resources for advancing vision-language research in underwater contexts and supporting applications in marine science, ecological monitoring, and autonomous underwater exploration. Our code and benchmark will be available.
[63] Latent-Info and Low-Dimensional Learning for Human Mesh Recovery and Parallel Optimization
Xiang Zhang,Suping Wu,Sheng Yang
Main category: cs.CV
TL;DR: 论文提出了一种基于潜在信息和低维学习的两阶段网络,用于解决3D人体网格恢复中肢体错位和局部细节不足的问题,并通过低维度和并行优化降低了计算成本。
Details
Motivation: 现有方法未能充分利用潜在信息(如人体运动、形状对齐),导致重建网格中肢体错位和局部细节不足,且现有注意力机制计算成本高。Contribution: 1. 提出了一种两阶段网络,整合全局和局部潜在信息;2. 设计了低维度和并行优化的网格姿态交互方法,降低计算成本;3. 在公开数据集上表现优于现有方法。
Method: 1. 第一阶段从图像特征的低频和高频成分中提取全局和局部信息,聚合为混合潜在频域特征;2. 第二阶段利用混合特征优化3D网格姿态和形状,设计了低维交互和并行优化方法。
Result: 在大规模公开数据集上,该方法在重建精度和计算效率上均优于现有最先进方法。
Insight: 1. 潜在信息的充分利用对重建质量至关重要;2. 低维交互和并行优化是平衡计算成本与精度的有效策略。
Abstract: Existing 3D human mesh recovery methods often fail to fully exploit the latent information (e.g., human motion, shape alignment), leading to issues with limb misalignment and insufficient local details in the reconstructed human mesh (especially in complex scenes). Furthermore, the performance improvement gained by modelling mesh vertices and pose node interactions using attention mechanisms comes at a high computational cost. To address these issues, we propose a two-stage network for human mesh recovery based on latent information and low dimensional learning. Specifically, the first stage of the network fully excavates global (e.g., the overall shape alignment) and local (e.g., textures, detail) information from the low and high-frequency components of image features and aggregates this information into a hybrid latent frequency domain feature. This strategy effectively extracts latent information. Subsequently, utilizing extracted hybrid latent frequency domain features collaborates to enhance 2D poses to 3D learning. In the second stage, with the assistance of hybrid latent features, we model the interaction learning between the rough 3D human mesh template and the 3D pose, optimizing the pose and shape of the human mesh. Unlike existing mesh pose interaction methods, we design a low-dimensional mesh pose interaction method through dimensionality reduction and parallel optimization that significantly reduces computational costs without sacrificing reconstruction accuracy. Extensive experimental results on large publicly available datasets indicate superiority compared to the most state-of-the-art.
[64] TreeFedDG: Alleviating Global Drift in Federated Domain Generalization for Medical Image Segmentation
Yucheng Song,Chenxi Li,Haokang Ding,Zhining Liao,Zhifang Liao
Main category: cs.CV
TL;DR: 针对医学图像分割任务中的联邦域泛化问题,论文提出TreeFedDG框架,通过树状拓扑结构和参数差异引导的风格混合方法缓解全局漂移,并实现跨域性能的提升。
Details
Motivation: 传统联邦学习方法在跨域场景下信息聚合不平衡,导致全局漂移问题,影响模型泛化性能。医学图像的隐私性和异质性进一步加剧了挑战。Contribution: 1.提出基于树状拓扑的分层参数聚合方法;2.引入参数差异引导的风格混合(FedStyle);3.设计渐进式个性化融合策略;4.利用特征相似性检索模型链进行集成决策。
Method: 1.树状拓扑结构抑制全局偏差;2.FedStyle方法增强漂移鲁棒性;3.渐进式融合平衡知识迁移与个性化;4.特征相似性指导推理阶段集成。
Result: 在两个公开数据集上的实验表明,TreeFedDG优于现有域泛化方法,跨域性能更平衡。
Insight: 树状分层聚合和参数差异引导是缓解全局漂移的有效手段,同时对隐私保护和数据异质性具有适应性。
Abstract: In medical image segmentation tasks, Domain Generalization (DG) under the Federated Learning (FL) framework is crucial for addressing challenges related to privacy protection and data heterogeneity. However, traditional federated learning methods fail to account for the imbalance in information aggregation across clients in cross-domain scenarios, leading to the Global Drift (GD) problem and a consequent decline in model generalization performance. This motivates us to delve deeper and define a new critical issue: global drift in federated domain generalization for medical imaging (FedDG-GD). In this paper, we propose a novel tree topology framework called TreeFedDG. First, starting from the distributed characteristics of medical images, we design a hierarchical parameter aggregation method based on a tree-structured topology to suppress deviations in the global model direction. Second, we introduce a parameter difference-based style mixing method (FedStyle), which enforces mixing among clients with maximum parameter differences to enhance robustness against drift. Third, we develop a a progressive personalized fusion strategy during model distribution, ensuring a balance between knowledge transfer and personalized features. Finally, during the inference phase, we use feature similarity to guide the retrieval of the most relevant model chain from the tree structure for ensemble decision-making, thereby fully leveraging the advantages of hierarchical knowledge. We conducted extensive experiments on two publicly available datasets. The results demonstrate that our method outperforms other state-of-the-art domain generalization approaches in these challenging tasks and achieves better balance in cross-domain performance.
[65] StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Xueyi Chen,Keda Tao,Kele Shao,Huan Wang
Main category: cs.CV
TL;DR: StreamingTOM是一个无需训练、即插即用的两阶段框架,通过因果时间缩减和在线量化内存技术,高效解决视频流理解中的预填充和后填充瓶颈,显著降低计算和内存开销。
Details
Motivation: 实时视频流理解面临两个核心问题:因果性限制无法利用未来帧信息,而累积导致token数量无限增长,效率低下。现有方法仅处理后LLM阶段的kv-cache,忽略预LLM阶段的成本。Contribution: 提出了StreamingTOM框架,包括因果时间缩减(固定每帧token预算)和在线量化内存(4-bit存储和动态检索),实现预填充和后填充的双重优化。
Method: 1. 因果时间缩减:选择相邻帧变化和token显著性高的子集;2. 在线量化内存:4-bit存储token,按需动态解量化,限制kv-cache增长。
Result: 实验表明,该方法实现了15.7倍的kv-cache压缩、1.2倍峰值内存降低和2倍的首字延迟加速,同时保持SOTA精度:离线基准63.8%,RVS任务55.8%/3.7。
Insight: 两阶段设计可平衡效率与精度,量化存储和动态检索为实时视频理解提供了可扩展的方案。
Abstract: Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8%$ on offline benchmarks and $55.8%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.
[66] Efficient Few-shot Identity Preserving Attribute Editing for 3D-aware Deep Generative Models
Vishal Vinod
Main category: cs.CV
TL;DR: 该论文提出了一种高效的小样本身份保留属性编辑方法,适用于3D感知深度生成模型,通过潜在空间方向实现多视角一致的3D人脸编辑。
Details
Motivation: 3D人脸的身份保留编辑任务复杂,现有方法在分辨率和编辑灵活性之间存在权衡,且需要大规模标注数据。论文旨在缓解这些限制。Contribution: 1. 提出了基于小样本的潜在空间方向估计方法,实现高效的3D人脸属性编辑;2. 利用合成图像减少对大规模标注数据的需求;3. 展示了线性编辑和连续风格流形的研究。
Method: 结合3D感知生成模型和2D肖像编辑技术,通过少量标注图像(仅需10张或更少)估计潜在空间中的编辑方向,并利用掩码数据集生成合成图像。
Result: 实验表明,方法在多视角一致性和身份保留方面表现良好,同时减少了数据需求。代码和结果已公开。
Insight: 3D感知生成模型的潜在空间方向可以高效地实现身份保留编辑,且小样本学习和合成数据能显著降低实际标注需求。
Abstract: Identity preserving editing of faces is a generative task that enables modifying the illumination, adding/removing eyeglasses, face aging, editing hairstyles, modifying expression etc., while preserving the identity of the face. Recent progress in 2D generative models have enabled photorealistic editing of faces using simple techniques leveraging the compositionality in GANs. However, identity preserving editing for 3D faces with a given set of attributes is a challenging task as the generative model must reason about view consistency from multiple poses and render a realistic 3D face. Further, 3D portrait editing requires large-scale attribute labelled datasets and presents a trade-off between editability in low-resolution and inflexibility to editing in high resolution. In this work, we aim to alleviate some of the constraints in editing 3D faces by identifying latent space directions that correspond to photorealistic edits. To address this, we present a method that builds on recent advancements in 3D-aware deep generative models and 2D portrait editing techniques to perform efficient few-shot identity preserving attribute editing for 3D-aware generative models. We aim to show from experimental results that using just ten or fewer labelled images of an attribute is sufficient to estimate edit directions in the latent space that correspond to 3D-aware attribute editing. In this work, we leverage an existing face dataset with masks to obtain the synthetic images for few attribute examples required for estimating the edit directions. Further, to demonstrate the linearity of edits, we investigate one-shot stylization by performing sequential editing and use the (2D) Attribute Style Manipulation (ASM) technique to investigate a continuous style manifold for 3D consistent identity preserving face aging. Code and results are available at: https://vishal-vinod.github.io/gmpi-edit/
[67] GeoDiff: Geometry-Guided Diffusion for Metric Depth Estimation
Tuan Pham,Thanh-Tung Le,Xiaohui Xie,Stephan Mandt
Main category: cs.CV
TL;DR: GeoDiff提出了一种新颖的框架,通过将立体视觉引导引入预训练的扩散模型,解决了单目深度估计中绝对度量深度预测的挑战,实现了无需重新训练的高精度深度恢复。
Details
Motivation: 现有方法在相对深度预测上表现优异,但在绝对度量深度估计上因单图像尺度模糊性而受限,需结合几何约束解决该问题。Contribution: 提出了一种训练自由的框架,通过立体几何约束增强预训练扩散模型,实现了高精度的绝对度量深度估计。
Method: 将深度估计重构为逆问题,利用RGB图像条件化的预训练潜在扩散模型(LDMs),结合立体几何约束学习尺度和位移。
Result: 在复杂场景(如透明和镜面表面)下,GeoDiff表现优异,匹配或超越了现有方法。
Insight: 通过几何引导的扩散模型,无需重新训练即可解决尺度模糊问题,扩展了深度估计的实际应用场景。
Abstract: We introduce a novel framework for metric depth estimation that enhances pretrained diffusion-based monocular depth estimation (DB-MDE) models with stereo vision guidance. While existing DB-MDE methods excel at predicting relative depth, estimating absolute metric depth remains challenging due to scale ambiguities in single-image scenarios. To address this, we reframe depth estimation as an inverse problem, leveraging pretrained latent diffusion models (LDMs) conditioned on RGB images, combined with stereo-based geometric constraints, to learn scale and shift for accurate depth recovery. Our training-free solution seamlessly integrates into existing DB-MDE frameworks and generalizes across indoor, outdoor, and complex environments. Extensive experiments demonstrate that our approach matches or surpasses state-of-the-art methods, particularly in challenging scenarios involving translucent and specular surfaces, all without requiring retraining.
[68] Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models
Lehan Wang,Yi Qin,Honglong Yang,Xiaomeng Li
Main category: cs.CV
TL;DR: 本文提出了一种名为Med-RwR的多模态医学推理检索框架,通过结合外部知识和视觉信息,提升医学MLLMs的推理能力,并在多个基准测试中表现出显著改进。
Details
Motivation: 现有的医学MLLMs仅依赖内部知识进行推理,容易在超出训练范围的情况下产生幻觉和事实错误。本文提出了一个结合视觉和文本信息的主动检索框架,以弥补这一不足。Contribution: 1) 提出了首个多模态医学推理检索框架Med-RwR;2) 设计了一种两阶段强化学习策略,结合视觉和文本信息进行检索;3) 提出了CDIR方法用于测试时的自扩展。
Method: Med-RwR通过主动检索外部知识(如症状或医学术语)提升推理能力,并设计了两阶段强化学习策略和CDIR方法。
Result: 在多个公开医学基准测试中,Med-RwR显著优于基线模型,特别是在稀缺数据的ECBench上提升了8.8%。
Insight: 结合外部知识和视觉信息可以有效提升医学MLLMs的推理能力,尤其是在数据稀缺领域表现出良好的泛化性。
Abstract: Incentivizing the reasoning ability of Multimodal Large Language Models (MLLMs) is essential for medical applications to transparently analyze medical scans and provide reliable diagnosis. However, existing medical MLLMs rely solely on internal knowledge during reasoning, leading to hallucinated reasoning and factual inaccuracies when encountering cases beyond their training scope. Although recent Agentic Retrieval-Augmented Generation (RAG) methods elicit the medical model’s proactive retrieval ability during reasoning, they are confined to unimodal LLMs, neglecting the crucial visual information during reasoning and retrieval. Consequently, we propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR, which actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning. Specifically, we design a two-stage reinforcement learning strategy with tailored rewards that stimulate the model to leverage both visual diagnostic findings and textual clinical information for effective retrieval. Building on this foundation, we further propose a Confidence-Driven Image Re-retrieval (CDIR) method for test-time scaling when low prediction confidence is detected. Evaluation on various public medical benchmarks demonstrates Med-RwR’s significant improvements over baseline models, proving the effectiveness of enhancing reasoning capabilities with external knowledge integration. Furthermore, Med-RwR demonstrates remarkable generalizability to unfamiliar domains, evidenced by 8.8% performance gain on our proposed EchoCardiography Benchmark (ECBench), despite the scarcity of echocardiography data in the training corpus. Our data, model, and codes will be made publicly available at https://github.com/xmed-lab/Med-RwR.
[69] The Impact of Image Resolution on Biomedical Multimodal Large Language Models
Liangyu Chen,James Burgess,Jeffrey J Nirschl,Orr Zohar,Serena Yeung-Levy
Main category: cs.CV
TL;DR: 该论文研究了图像分辨率对生物医学多模态大语言模型(MLLMs)性能的影响,提出原生分辨率的训练和推理显著提升性能,同时混合分辨率训练能有效缓解分辨率不匹配问题。
Details
Motivation: 生物医学影像通常需要高分辨率分析,但现有MLLMs多针对低分辨率通用数据集设计,可能导致关键信息丢失。因此,研究分辨率对生物医学MLLMs的影响具有重要意义。Contribution: 论文的主要贡献包括:(1)验证原生分辨率训练和推理对性能的显著提升;(2)揭示训练与推理分辨率不匹配的性能下降;(3)提出混合分辨率训练方案以平衡计算需求与性能。
Method: 研究方法包括在不同分辨率下训练和测试MLLMs,分析性能差异,并提出混合分辨率训练策略以优化模型。
Result: 结果表明,原生分辨率显著提升任务表现,分辨率不匹配会严重降低性能,而混合分辨率训练能有效解决问题。
Insight: 论文强调了在生物医学MLLMs中优先考虑原生分辨率推理和混合分辨率数据集的重要性,为优化模型提供了实用建议。
Abstract: Imaging technologies are fundamental to biomedical research and modern medicine, requiring analysis of high-resolution images across various modalities. While multimodal large language models (MLLMs) show promise for biomedical image analysis, most are designed for low-resolution images from general-purpose datasets, risking critical information loss. We investigate how image resolution affects MLLM performance in biomedical applications and demonstrate that: (1) native-resolution training and inference significantly improve performance across multiple tasks, (2) misalignment between training and inference resolutions severely degrades performance, and (3) mixed-resolution training effectively mitigates misalignment and balances computational constraints with performance requirements. Based on these findings, we recommend prioritizing native-resolution inference and mixed-resolution datasets to optimize biomedical MLLMs for transformative impact in scientific research and clinical applications.
[70] OmniNWM: Omniscient Driving Navigation World Models
Bohan Li,Zhuang Ma,Dalong Du,Baorui Peng,Zhujin Liang,Zhenqiang Liu,Chao Ma,Yueming Jin,Hao Zhao,Wenjun Zeng,Xin Jin
Main category: cs.CV
TL;DR: OmniNWM提出了一种全景导航世界模型,解决了现有模型在状态、动作和奖励三个核心维度的局限性,通过统一的框架实现了高质量的跨模态视频生成、精确的动作控制和基于3D占用的奖励定义。
Details
Motivation: 现有自动驾驶世界模型在状态模态、视频序列长度、动作控制精确性和奖励感知方面存在不足,OmniNWM旨在解决这些问题。Contribution: 1. 联合生成RGB、语义、深度和3D占用的全景视频;2. 提出归一化全景Plucker射线图表示实现精确动作控制;3. 利用生成的3D占用定义密集奖励。
Method: 1. 使用灵活的强制策略实现长序列自回归生成;2. 引入Plucker射线图编码输入轨迹;3. 基于3D占用设计规则化密集奖励。
Result: 实验表明OmniNWM在视频生成、控制精度和长序列稳定性方面达到SOTA,并通过占用奖励提供了可靠的闭环评估框架。
Insight: OmniNWM的创新在于将多模态状态、精确控制和奖励定义统一在一个模型中,为自动驾驶世界模型提供了更全面的解决方案。
Abstract: Autonomous driving world models are expected to work effectively across three core dimensions: state, action, and reward. Existing models, however, are typically restricted to limited state modalities, short video sequences, imprecise action control, and a lack of reward awareness. In this paper, we introduce OmniNWM, an omniscient panoramic navigation world model that addresses all three dimensions within a unified framework. For state, OmniNWM jointly generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy. A flexible forcing strategy enables high-quality long-horizon auto-regressive generation. For action, we introduce a normalized panoramic Plucker ray-map representation that encodes input trajectories into pixel-level signals, enabling highly precise and generalizable control over panoramic video generation. Regarding reward, we move beyond learning reward functions with external image-based models: instead, we leverage the generated 3D occupancy to directly define rule-based dense rewards for driving compliance and safety. Extensive experiments demonstrate that OmniNWM achieves state-of-the-art performance in video generation, control accuracy, and long-horizon stability, while providing a reliable closed-loop evaluation framework through occupancy-grounded rewards. Project page is available at https://github.com/Arlo0o/OmniNWM.
[71] Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding
Jinlin Li,Yuran Wang,Yifei Yuan,Xiao Zhou,Yingying Zhang,Xixian Yong,Yefeng Zheng,Xian Wu
Main category: cs.CV
TL;DR: 该论文提出了Adaptive Token Ensemble Decoding (ATED)方法,通过动态集成多个大型视觉语言模型(LVLM)的预测,显著减少了多模态任务中的幻觉现象,同时保持了生成的流畅性和相关性。
Details
Motivation: 大型视觉语言模型在多模态任务中(如图像描述和视觉问答)表现优异,但仍存在对象幻觉问题,即生成不存在或错误识别的对象描述。已有方法通过辅助训练目标或外部模块部分缓解了这一问题,但在可扩展性、适应性和模型独立性方面仍有挑战。Contribution: 提出了ATED框架,这是一种无需训练的token级集成方法,通过动态计算模型不确定性权重并整合多样化解码路径,显著减少了幻觉现象。
Method: ATED在推理阶段动态聚合多个LVLM的预测,基于不确定性为每个模型分配权重,同时利用多样化解码路径提升语义一致性和上下文接地性。
Result: 在标准幻觉检测基准上,ATED显著优于现有方法,减少了幻觉现象,同时保持了生成的流畅性和相关性。
Insight: 自适应集成是提升LVLM鲁棒性的有效方向,尤其适用于高风险的现实应用。
Abstract: Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering. However, they remain prone to object hallucination – generating descriptions of nonexistent or misidentified objects. Prior work has partially mitigated this via auxiliary training objectives or external modules, but challenges remain in terms of scalability, adaptability, and model independence. To address these limitations, we propose Adaptive Token Ensemble Decoding (ATED), a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference. ATED dynamically computes uncertainty-based weights for each model, reflecting their reliability at each decoding step. It also integrates diverse decoding paths to improve contextual grounding and semantic consistency. Experiments on standard hallucination detection benchmarks demonstrate that ATED significantly outperforms state-of-the-art methods, reducing hallucination without compromising fluency or relevance. Our findings highlight the benefits of adaptive ensembling and point to a promising direction for improving LVLM robustness in high-stakes applications. The code is available at https://github.com/jinlin2021/ATED.
[72] Enhancing Few-Shot Classification of Benchmark and Disaster Imagery with ATTBHFA-Net
Gao Yu Lee,Tanmoy Dam,Md Meftahul Ferdaus,Daniel Puiu Poenar,Vu Duong
Main category: cs.CV
TL;DR: 论文提出了一种基于Bhattacharyya系数和Hellinger距离的特征聚合网络(ATTBHFA-Net),用于解决灾害图像少样本分类中的高类内变异性和类间相似性问题,并在多个数据集上验证了其优越性。
Details
Motivation: 灾害图像分类面临数据稀缺、高类内变异性和类间相似性等挑战,现有的少样本学习方法在通用数据集上表现良好,但在灾害场景中效果有限。Contribution: 1. 设计了ATTBHFA-Net,通过线性组合Bhattacharyya系数和Hellinger距离来增强特征分布的鲁棒性;2. 提出了基于Bhattacharyya-Hellinger距离的对比损失函数,显著提升了少样本分类性能。
Method: 1. Bhattacharyya系数作为类间对比边际,Hellinger距离用于类内对齐;2. 结合余弦相似度和分类交叉熵损失。
Result: 在四个少样本基准数据集和两个灾害图像数据集上,ATTBHFA-Net表现出优于现有方法的性能和泛化能力。
Insight: 利用概率分布的比较和聚合可以更有效地处理灾害图像的高变异性和相似性,为少样本学习在特定领域的应用提供了新思路。
Abstract: The increasing frequency of natural and human-induced disasters necessitates advanced visual recognition techniques capable of analyzing critical photographic data. With progress in artificial intelligence and resilient computational systems, rapid and accurate disaster classification has become crucial for efficient rescue operations. However, visual recognition in disaster contexts faces significant challenges due to limited and diverse data from the difficulties in collecting and curating comprehensive, high-quality disaster imagery. Few-Shot Learning (FSL) provides a promising approach to data scarcity, yet current FSL research mainly relies on generic benchmark datasets lacking remote-sensing disaster imagery, limiting its practical effectiveness. Moreover, disaster images exhibit high intra-class variation and inter-class similarity, hindering the performance of conventional metric-based FSL methods. To address these issues, this paper introduces the Attention-based Bhattacharyya-Hellinger Feature Aggregation Network (ATTBHFA-Net), which linearly combines the Bhattacharyya coefficient and Hellinger distances to compare and aggregate feature probability distributions for robust prototype formation. The Bhattacharyya coefficient serves as a contrastive margin that enhances inter-class separability, while the Hellinger distance regularizes same-class alignment. This framework parallels contrastive learning but operates over probability distributions rather than embedded feature points. Furthermore, a Bhattacharyya-Hellinger distance-based contrastive loss is proposed as a distributional counterpart to cosine similarity loss, used jointly with categorical cross-entropy to significantly improve FSL performance. Experiments on four FSL benchmarks and two disaster image datasets demonstrate the superior effectiveness and generalization of ATTBHFA-Net compared to existing approaches.
[73] ViSE: A Systematic Approach to Vision-Only Street-View Extrapolation
Kaiyuan Tan,Yingying Shen,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye
Main category: cs.CV
TL;DR: ViSE提出了一个四阶段管道方法,专注于仅依赖视觉的街景外推任务,解决了现有NVS方法在自动驾驶闭环模拟中的失真和一致性问题。
Details
Motivation: 自动驾驶闭环模拟需要真实的外推视图,但当前NVS方法在轨迹外推时易产生失真和不一致问题。Contribution: 1. 提出了四阶段的综合管道方法;2. 引入了数据驱动的伪LiDAR点云初始化;3. 提出了维度缩减的2D-SDF建模道路表面;4. 利用生成先验为外推视图创建伪真值;5. 通过数据驱动的适应网络去除时间特异性伪影。
Method: 1. 数据驱动的伪LiDAR点云初始化;2. 2D-SDF建模道路表面;3. 生成伪真值提供额外监督;4. 数据驱动的适应网络去除伪影。
Result: 在RealADSim-NVS基准测试中,最终得分0.441,排名第一。
Insight: 结合几何先验和生成先验的多阶段方法能显著提升外推视图的真实性和一致性。
Abstract: Realistic view extrapolation is critical for closed-loop simulation in autonomous driving, yet it remains a significant challenge for current Novel View Synthesis (NVS) methods, which often produce distorted and inconsistent images beyond the original trajectory. This report presents our winning solution which ctook first place in the RealADSim Workshop NVS track at ICCV 2025. To address the core challenges of street view extrapolation, we introduce a comprehensive four-stage pipeline. First, we employ a data-driven initialization strategy to generate a robust pseudo-LiDAR point cloud, avoiding local minima. Second, we inject strong geometric priors by modeling the road surface with a novel dimension-reduced SDF termed 2D-SDF. Third, we leverage a generative prior to create pseudo ground truth for extrapolated viewpoints, providing auxilary supervision. Finally, a data-driven adaptation network removes time-specific artifacts. On the RealADSim-NVS benchmark, our method achieves a final score of 0.441, ranking first among all participants.
[74] GPTFace: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data
Yudong Li,Hao Li,Xianxu Hou,Linlin Shen
Main category: cs.CV
TL;DR: GPTFace 是一种利用网络爬取的大规模弱相关文本-图像数据进行生成式预训练的模型,通过掩码图像/语言建模和图像-文本匹配任务学习面部知识,并在下游任务中表现优异。
Details
Motivation: 现有面部知识预训练模型依赖人工标注数据集,标注成本高且泛化能力有限。本文提出利用网络数据自监督学习面部知识的方法。Contribution: 1. 提出 GPTFace,一种基于弱相关文本-图像数据的生成式预训练模型;2. 结合掩码图像/语言建模和图像-文本匹配任务;3. 在面部属性编辑、表达操控等任务中表现优异。
Method: 1. 使用网络爬取的文本和面部图像数据进行预训练;2. 预训练任务包括掩码图像/语言建模(MILM)和图像-文本匹配(ITM);3. 生成阶段利用图像-文本匹配损失控制生成分布。
Result: 在下游任务(如属性分类、表情识别)中表现与 SOTA 模型相当,并支持面部编辑任务(如属性编辑、面具去除)。
Insight: 利用弱相关的网络数据可以有效预训练面部知识模型,减少对标注数据的依赖,提升泛化能力。
Abstract: Compared to the prosperity of pre-training models in natural image understanding, the research on large-scale pre-training models for facial knowledge learning is still limited. Current approaches mainly rely on manually assembled and annotated face datasets for training, but labeling such datasets is labor-intensive and the trained models have limited scalability beyond the training data. To address these limitations, we present a generative pre-training model for facial knowledge learning that leverages large-scale web-built data for training. We use texts and images containing human faces crawled from the internet and conduct pre-training on self-supervised tasks, including masked image/language modeling (MILM) and image-text matching (ITM). During the generation stage, we further utilize the image-text matching loss to pull the generation distribution towards the control signal for controllable image/text generation. Experimental results demonstrate that our model achieves comparable performance to state-of-the-art pre-training models for various facial downstream tasks, such as attribution classification and expression recognition. Furthermore, our approach is also applicable to a wide range of face editing tasks, including face attribute editing, expression manipulation, mask removal, and photo inpainting.
[75] AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering
Jiayu Zhang,Qilang Ye,Shuo Ye,Xun Lin,Zihan Song,Zitong Yu
Main category: cs.CV
TL;DR: AV-Master提出了一种动态建模时间和模态维度的框架,通过动态自适应采样和模态偏好感知策略提升音频-视觉问答任务性能,显著优于现有方法。
Details
Motivation: 现有音频-视觉问答方法在时间采样和模态偏好感知上缺乏灵活性和动态适应性,难以聚焦关键信息,限制了复杂场景的推理能力。Contribution: 提出了AV-Master框架,包括动态自适应采样机制、模态偏好感知策略和双路径对比损失,显著提升了音频-视觉问答任务的性能。
Method: 1. 动态自适应时间采样;2. 模态偏好感知策略;3. 双路径对比损失增强跨模态一致性。
Result: 在四个大规模基准测试中显著优于现有方法,尤其在复杂推理任务中表现突出。
Insight: 动态建模时间和模态维度可以有效提升模型对复杂场景的理解和推理能力。
Abstract: Audio-Visual Question Answering (AVQA) requires models to effectively utilize both visual and auditory modalities to answer complex and diverse questions about audio-visual scenes. However, existing methods lack sufficient flexibility and dynamic adaptability in temporal sampling and modality preference awareness, making it difficult to focus on key information based on the question. This limits their reasoning capability in complex scenarios. To address these challenges, we propose a novel framework named AV-Master. It enhances the model’s ability to extract key information from complex audio-visual scenes with substantial redundant content by dynamically modeling both temporal and modality dimensions. In the temporal dimension, we introduce a dynamic adaptive focus sampling mechanism that progressively focuses on audio-visual segments most relevant to the question, effectively mitigating redundancy and segment fragmentation in traditional sampling methods. In the modality dimension, we propose a preference-aware strategy that models each modality’s contribution independently, enabling selective activation of critical features. Furthermore, we introduce a dual-path contrastive loss to reinforce consistency and complementarity across temporal and modality dimensions, guiding the model to learn question-specific cross-modal collaborative representations. Experiments on four large-scale benchmarks show that AV-Master significantly outperforms existing methods, especially in complex reasoning tasks.
[76] Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback
Yi-Lun Wu,Bo-Kai Ruan,Chiang Tseng,Hong-Han Shuai
Main category: cs.CV
TL;DR: 本文提出Diffusion-DRO,一种基于排序偏好的优化框架,解决了传统DPO方法在图像概率估计和非线性问题上的局限性。
Details
Motivation: 传统DPO方法虽避免了REINFORCE算法,但仍面临图像概率估计不准和离线数据多样性不足的问题。Contribution: 提出Diffusion-DRO框架,将偏好学习建模为排序问题,简化训练目标为去噪形式,并结合离线专家演示与在线策略生成的负样本。
Method: 基于逆强化学习,将偏好学习转化为排序问题,避免了奖励模型的依赖。
Result: 实验表明Diffusion-DRO在生成质量和用户研究中优于现有方法,尤其在具有挑战性和未见过的提示上表现突出。
Insight: 结合离线与在线数据是提升偏好学习效果的关键,去噪形式的训练目标能更稳定地优化模型。
Abstract: Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at https://github.com/basiclab/DiffusionDRO.
[77] Learning Human-Object Interaction as Groups
Jiajun Hong,Jianan Wei,Wenguan Wang
Main category: cs.CV
TL;DR: GroupHOI提出了一种通过几何邻近性和语义相似性建模群体交互的方法,显著提升了HOI-DET任务的性能,并在高阶交互任务中表现优异。
Details
Motivation: 现有HOI-DET方法主要关注成对关系,忽略了真实场景中多人和多物体共同参与的群体交互行为。Contribution: 提出了GroupHOI框架,首次从群体视角建模交互关系,并通过几何邻近性和语义相似性实现了高效的上下文信息传播。
Method: 1. 使用可学习的几何邻近性估计器将人-物分组;2. 在每组内通过自注意力机制计算软对应关系,集聚和分发上下文信息;3. 增强交互解码器以融合HO-pairs的语义特征。
Result: 在HICO-DET和V-COCO基准测试中优于现有方法,且在NVI-DET任务中表现领先。
Insight: 群体视角的交互建模能更真实地反映复杂场景中的交互行为,几何邻近性和语义相似性是有效的上下文传播指标。
Abstract: Human-Object Interaction Detection (HOI-DET) aims to localize human-object pairs and identify their interactive relationships. To aggregate contextual cues, existing methods typically propagate information across all detected entities via self-attention mechanisms, or establish message passing between humans and objects with bipartite graphs. However, they primarily focus on pairwise relationships, overlooking that interactions in real-world scenarios often emerge from collective behaviors (multiple humans and objects engaging in joint activities). In light of this, we revisit relation modeling from a group view and propose GroupHOI, a framework that propagates contextual information in terms of geometric proximity and semantic similarity. To exploit the geometric proximity, humans and objects are grouped into distinct clusters using a learnable proximity estimator based on spatial features derived from bounding boxes. In each group, a soft correspondence is computed via self-attention to aggregate and dispatch contextual cues. To incorporate the semantic similarity, we enhance the vanilla transformer-based interaction decoder with local contextual cues from HO-pair features. Extensive experiments on HICO-DET and V-COCO benchmarks demonstrate the superiority of GroupHOI over the state-of-the-art methods. It also exhibits leading performance on the more challenging Nonverbal Interaction Detection (NVI-DET) task, which involves varied forms of higher-order interactions within groups.
[78] FeatureFool: Zero-Query Fooling of Video Models via Feature Map
Duoxun Tang,Xi Xiao,Guangwu Hu,Kangkang Sun,Xiao Yang,Dongyang Chen,Qing Li,Yongjie Yin,Jiyao Wang
Main category: cs.CV
TL;DR: FeatureFool提出了一种零查询的黑盒对抗攻击方法,利用DNN提取的特征信息直接修改视频特征空间,实现了高效且难以察觉的攻击,且无需与目标模型交互。
Details
Motivation: 现有黑盒对抗攻击通常需要多轮交互和大量查询,不适合实际应用且难以扩展到视频领域。FeatureFool旨在通过直接利用特征映射信息实现高效攻击。Contribution: 提出了一种零查询的视频域黑盒攻击方法FeatureFool,直接利用DNN提取的特征信息攻击传统视频分类器和Video-LLM。
Method: 通过分析DNN特征映射,生成对抗样本修改视频特征空间,无需查询目标模型。
Result: 攻击成功率超过70%,生成的对抗视频在SSIM、PSNR和时间一致性上表现良好,攻击难以察觉。
Insight: 特征映射具有可迁移性,可直接用于攻击视频领域模型。
Abstract: The vulnerability of deep neural networks (DNNs) has been preliminarily verified. Existing black-box adversarial attacks usually require multi-round interaction with the model and consume numerous queries, which is impractical in the real-world and hard to scale to recently emerged Video-LLMs. Moreover, no attack in the video domain directly leverages feature maps to shift the clean-video feature space. We therefore propose FeatureFool, a stealthy, video-domain, zero-query black-box attack that utilizes information extracted from a DNN to alter the feature space of clean videos. Unlike query-based methods that rely on iterative interaction, FeatureFool performs a zero-query attack by directly exploiting DNN-extracted information. This efficient approach is unprecedented in the video domain. Experiments show that FeatureFool achieves an attack success rate above 70% against traditional video classifiers without any queries. Benefiting from the transferability of the feature map, it can also craft harmful content and bypass Video-LLM recognition. Additionally, adversarial videos generated by FeatureFool exhibit high quality in terms of SSIM, PSNR, and Temporal-Inconsistency, making the attack barely perceptible. This paper may contain violent or explicit content.
[79] Cross-Modal Scene Semantic Alignment for Image Complexity Assessment
Yuqing Luo,Yixiao Li,Jiang Liu,Jun Fu,Hadi Amirpour,Guanghui Yue,Baoquan Zhao,Padraig Corcoran,Hantao Liu,Wei Zhou
Main category: cs.CV
TL;DR: 本文提出了一种新颖的图像复杂度评估方法CM-SSA,通过跨模态场景语义对齐提升评估性能,实验结果显著优于现有方法。
Details
Motivation: 现有图像复杂度评估方法主要依赖单模态视觉特征,无法充分捕捉与人类感知相关的复杂语义信息,跨模态场景语义信息的潜力尚未被挖掘。Contribution: 提出CM-SSA方法,首次将跨模态场景语义对齐引入图像复杂度评估,通过双分支模型(复杂度回归分支与语义对齐分支)实现更符合人类主观感知的预测。
Method: CM-SSA包含两个分支:复杂度回归分支用于预测复杂度,场景语义对齐分支通过图像与文本提示的配对学习对齐语义信息。
Result: 在多个数据集上实验表明,CM-SSA显著优于现有方法,验证了跨模态语义对齐的有效性。
Insight: 跨模态语义信息能显著提升图像复杂度评估的性能,语义对齐分支的引入能够更好地模拟人类感知过程。
Abstract: Image complexity assessment (ICA) is a challenging task in perceptual evaluation due to the subjective nature of human perception and the inherent semantic diversity in real-world images. Existing ICA methods predominantly rely on hand-crafted or shallow convolutional neural network-based features of a single visual modality, which are insufficient to fully capture the perceived representations closely related to image complexity. Recently, cross-modal scene semantic information has been shown to play a crucial role in various computer vision tasks, particularly those involving perceptual understanding. However, the exploration of cross-modal scene semantic information in the context of ICA remains unaddressed. Therefore, in this paper, we propose a novel ICA method called Cross-Modal Scene Semantic Alignment (CM-SSA), which leverages scene semantic alignment from a cross-modal perspective to enhance ICA performance, enabling complexity predictions to be more consistent with subjective human perception. Specifically, the proposed CM-SSA consists of a complexity regression branch and a scene semantic alignment branch. The complexity regression branch estimates image complexity levels under the guidance of the scene semantic alignment branch, while the scene semantic alignment branch is used to align images with corresponding text prompts that convey rich scene semantic information by pair-wise learning. Extensive experiments on several ICA datasets demonstrate that the proposed CM-SSA significantly outperforms state-of-the-art approaches. Codes are available at https://github.com/XQ2K/First-Cross-Model-ICA.
[80] Automated Wicket-Taking Delivery Segmentation and Weakness Detection in Cricket Videos Using OCR-Guided YOLOv8 and Trajectory Modeling
Mst Jannatun Ferdous,Masum Billah,Joy Karmoker,Mohd Ruhul Ameen,Akif Islam,Md. Omar Faruqe
Main category: cs.CV
TL;DR: 本文提出了一种自动化板球视频分析系统,结合YOLOv8和OCR技术,实现了对击球手的弱点检测和球轨迹建模。
Details
Motivation: 传统板球视频分析依赖人工,效率低且易出错。本文旨在通过深度学习和OCR技术实现自动化分析,为教练和战术决策提供数据支持。Contribution: 1. 提出了一种结合YOLOv8和OCR的自动化系统,用于分割击球手的弱点检测和球轨迹建模。2. 通过图像预处理优化了文本提取,提高了OCR的鲁棒性。3. 在球检测和场地检测任务中达到了极高的精度(mAP50超过99%)。
Method: 1. 使用YOLOv8进行场地和球的检测。2. 结合OCR技术从比分卡中提取关键信息以识别击球手的弱点。3. 通过图像预处理(灰度变换、幂次变换和形态学操作)优化文本提取。4. 对检测到的场地进行球轨迹建模。
Result: 场地检测模型的mAP50为99.5%,精确度为0.999;球检测模型的mAP50为99.18%,精确度为0.968,召回率为0.978。实验证明了该方法在板球视频分析中的高效性。
Insight: 1. YOLOv8和OCR的结合为板球视频分析提供了自动化解决方案。2. 图像预处理对OCR性能的提升至关重要。3. 该系统在教练和战术分析中具有实际应用潜力。
Abstract: This paper presents an automated system for cricket video analysis that leverages deep learning techniques to extract wicket-taking deliveries, detect cricket balls, and model ball trajectories. The system employs the YOLOv8 architecture for pitch and ball detection, combined with optical character recognition (OCR) for scorecard extraction to identify wicket-taking moments. Through comprehensive image preprocessing, including grayscale transformation, power transformation, and morphological operations, the system achieves robust text extraction from video frames. The pitch detection model achieved 99.5% mean Average Precision at 50% IoU (mAP50) with a precision of 0.999, while the ball detection model using transfer learning attained 99.18% mAP50 with 0.968 precision and 0.978 recall. The system enables trajectory modeling on detected pitches, providing data-driven insights for identifying batting weaknesses. Experimental results on multiple cricket match videos demonstrate the effectiveness of this approach for automated cricket analytics, offering significant potential for coaching and strategic decision-making.
[81] ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters
Zhiwei Hao,Jianyuan Guo,Li Shen,Kai Han,Yehui Tang,Han Hu,Yunhe Wang
Main category: cs.CV
TL;DR: ScaleNet提出了一种高效扩展ViT模型的方法,通过在预训练模型中插入额外层并共享权重,显著降低计算成本,同时在ImageNet-1K上实现性能提升。
Details
Motivation: 大规模ViT模型虽然性能优越,但训练成本高。ScaleNet旨在通过基于预训练模型的增量扩展,降低计算负担。Contribution: 1.引入ScaleNet方法,通过层间权重共享和调整参数扩展ViT模型;2.提出并行适配器模块优化共享权重性能。
Method: 在预训练ViT中插入共享权重的新层,并通过小型调整参数(并行适配器)解决权重共享导致的性能下降问题。
Result: 在ImageNet-1K上,2×深度扩展的DeiT-Base模型比从头训练提升7.42%精度,且仅需1/3训练周期。
Insight: ScaleNet为高效扩展ViT提供了新思路,且在目标检测等下游任务中展现出潜力。
Abstract: Recent advancements in vision transformers (ViTs) have demonstrated that larger models often achieve superior performance. However, training these models remains computationally intensive and costly. To address this challenge, we introduce ScaleNet, an efficient approach for scaling ViT models. Unlike conventional training from scratch, ScaleNet facilitates rapid model expansion with negligible increases in parameters, building on existing pretrained models. This offers a cost-effective solution for scaling up ViTs. Specifically, ScaleNet achieves model expansion by inserting additional layers into pretrained ViTs, utilizing layer-wise weight sharing to maintain parameters efficiency. Each added layer shares its parameter tensor with a corresponding layer from the pretrained model. To mitigate potential performance degradation due to shared weights, ScaleNet introduces a small set of adjustment parameters for each layer. These adjustment parameters are implemented through parallel adapter modules, ensuring that each instance of the shared parameter tensor remains distinct and optimized for its specific function. Experiments on the ImageNet-1K dataset demonstrate that ScaleNet enables efficient expansion of ViT models. With a 2$\times$ depth-scaled DeiT-Base model, ScaleNet achieves a 7.42% accuracy improvement over training from scratch while requiring only one-third of the training epochs, highlighting its efficiency in scaling ViTs. Beyond image classification, our method shows significant potential for application in downstream vision areas, as evidenced by the validation in object detection task.
[82] ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization
Yuanhe Guo,Linxi Xie,Zhuoran Chen,Kangrui Yu,Ryan Po,Guandao Yang,Gordon Wetztein,Hongyi Wen
Main category: cs.CV
TL;DR: ImageGem是一个用于生成模型个性化研究的数据集,提供了真实世界的用户交互数据,包括57K用户的242K定制LoRAs、3M文本提示和5M生成图像。通过这些数据,研究者在偏好对齐模型、个性化检索和生成模型推荐等方面取得了突破。
Details
Motivation: 生成模型目前缺乏对细粒度用户偏好的理解能力,主要原因之一是缺乏真实世界中的细粒度用户偏好标注数据。ImageGem旨在填补这一空白,推动生成模型个性化研究的发展。Contribution: 1) 引入了ImageGem数据集,提供大规模真实用户交互数据;2) 训练了更好的偏好对齐模型;3) 研究了检索模型和视觉语言模型在个性化任务中的表现;4) 提出了一个端到端框架,用于编辑定制扩散模型以对齐用户偏好。
Method: 1) 收集真实用户交互数据,包括LoRAs、文本提示和生成图像;2) 利用用户偏好标注训练偏好对齐模型;3) 使用检索模型和视觉语言模型进行个性化图像检索和生成模型推荐;4) 提出基于潜在权重空间的扩散模型编辑框架。
Result: ImageGem数据集首次支持了生成模型个性化研究的新范式,实验表明其在偏好对齐、个性化检索和推荐任务中表现优异。
Insight: 真实世界用户交互数据的引入为生成模型个性化提供了新方向,潜在权重空间的编辑方法为模型个性化提供了灵活的实现手段。
Abstract: We introduce ImageGem, a dataset for studying generative models that understand fine-grained individual preferences. We posit that a key challenge hindering the development of such a generative model is the lack of in-the-wild and fine-grained user preference annotations. Our dataset features real-world interaction data from 57K users, who collectively have built 242K customized LoRAs, written 3M text prompts, and created 5M generated images. With user preference annotations from our dataset, we were able to train better preference alignment models. In addition, leveraging individual user preference, we investigated the performance of retrieval models and a vision-language model on personalized image retrieval and generative model recommendation. Finally, we propose an end-to-end framework for editing customized diffusion models in a latent weight space to align with individual user preferences. Our results demonstrate that the ImageGem dataset enables, for the first time, a new paradigm for generative model personalization.
[83] Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
Tianci Bi,Xiaoyi Zhang,Yan Lu,Nanning Zheng
Main category: cs.CV
TL;DR: 本文提出了一种直接利用视觉基础模型(VFM)作为潜在扩散模型(LDM)标记器的方法,避免了蒸馏带来的语义偏差问题,并通过改进的解码器设计和训练策略实现了高效的性能。
Details
Motivation: 现有LDM的性能高度依赖视觉标记器的质量,而通过蒸馏整合VFM的方法会导致语义对齐弱化。本文旨在绕过蒸馏,直接利用VFM的语义能力。Contribution: 1. 提出VFM-VAE(视觉基础模型变分自编码器),直接整合VFM;2. 设计多尺度潜在融合和渐进分辨率重建模块,解决像素级保真度问题;3. 引入SE-CKNNA度量分析扩散训练的表示动态,并开发联合对齐策略。
Method: 1. 改进的解码器设计(多尺度潜在融合和渐进分辨率重建);2. 通过SE-CKNNA度量分析表示动态;3. 联合标记器和扩散对齐策略。
Result: 在80轮训练中达到gFID 2.20(无CFG),是先前标记器的10倍提速;640轮后进一步达到gFID 1.62,性能优于现有方法。
Insight: 直接整合VFM是LDM标记器的更优范式,改进的解码器设计和动态分析策略显著提升了性能和效率。
Abstract: The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM’s semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.
[84] Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos
Jinfeng Liu,Lingtong Kong,Mi Zhou,Jinwen Chen,Dan Xu
Main category: cs.CV
TL;DR: Mono4DGS-HDR是一个首个从交替曝光的单目LDR视频中重建可渲染4D HDR场景的系统,采用高斯泼溅的两阶段优化框架,无需相机位姿即可实现HDR视频重建。
Details
Motivation: 解决从交替曝光的单目LDR视频重建4D HDR场景的挑战性问题,填补现有方法的空白。Contribution: 1) 提出首个基于高斯泼溅的两阶段优化框架;2) 引入时序亮度正则化策略提升HDR外观一致性;3) 构建新的HDR视频重建评估基准。
Method: 1) 第一阶段在正交相机坐标系中学习视频HDR高斯表示;2) 第二阶段将高斯转换到世界坐标系并联合优化相机位姿。
Result: 在渲染质量和速度上显著优于现有方法,实验验证了其有效性。
Insight: 通过高斯泼溅和时序正则化,解决了单视角和多曝光的HDR重建难题。
Abstract: We introduce Mono4DGS-HDR, the first system for reconstructing renderable 4D high dynamic range (HDR) scenes from unposed monocular low dynamic range (LDR) videos captured with alternating exposures. To tackle such a challenging problem, we present a unified framework with two-stage optimization approach based on Gaussian Splatting. The first stage learns a video HDR Gaussian representation in orthographic camera coordinate space, eliminating the need for camera poses and enabling robust initial HDR video reconstruction. The second stage transforms video Gaussians into world space and jointly refines the world Gaussians with camera poses. Furthermore, we propose a temporal luminance regularization strategy to enhance the temporal consistency of the HDR appearance. Since our task has not been studied before, we construct a new evaluation benchmark using publicly available datasets for HDR video reconstruction. Extensive experiments demonstrate that Mono4DGS-HDR significantly outperforms alternative solutions adapted from state-of-the-art methods in both rendering quality and speed.
[85] Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
Wei-Chia Chang,Yan-Ann Chen
Main category: cs.CV
TL;DR: 该论文提出了一种结合视觉语言模型(VLM)和检索增强生成(RAG)的方法,用于零样本车辆型号识别,通过文本推理避免了大规模训练,并提升了识别性能。
Details
Motivation: 现有车辆型号识别方法难以适应新发布的车型,而CLIP等模型需要昂贵的微调。因此,研究旨在提出一种无需大规模训练的零样本识别方法。Contribution: 提出了一种基于VLM和RAG的零样本车辆型号识别方法,避免了大规模训练,支持快速更新新车型的文本描述。
Method: 通过VLM将车辆图像转换为描述性属性,与文本特征数据库比较,检索相关内容并结合描述形成提示,最后由语言模型推理出车型。
Result: 实验表明,该方法相比CLIP基线提升了近20%的识别性能。
Insight: 研究展示了RAG增强的语言模型推理在智能城市任务中的潜力,尤其是在需要快速适应新数据的场景。
Abstract: Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.
[86] DWaste: Greener AI for Waste Sorting using Mobile and Edge Devices
Suman Kunwar
Main category: cs.CV
TL;DR: DWaste是一个基于计算机视觉的移动和边缘设备废物分类平台,展示了轻量级模型在实时、低功耗场景下的优势,实现了高效废物分类。
Details
Motivation: 随着便利包装的普及,废物管理成为可持续发展的重要问题。需要一种能在资源受限设备上高效运行的废物分类方案。Contribution: 开发了DWaste平台,展示了轻量级目标检测模型在废物分类中的高效性,并通过模型量化进一步优化性能。
Method: 比较了多种图像分类和目标检测模型(EfficientNetV2S/M, ResNet50/101, MobileNet, YOLOv8n/v11n),使用自定义标注工具Annotated Lab处理数据集,并应用模型量化技术。
Result: EfficientNetV2S分类器准确率高(96%),但延迟和碳排放大;轻量级目标检测模型表现良好(77% mAP),推理速度快(0.03s),适合实时场景。量化技术大幅减小模型体积(75%)。
Insight: 轻量级目标检测模型和量化技术在资源受限设备上更高效,是实现“绿色AI”废物分类的关键。
Abstract: The rise of convenience packaging has led to generation of enormous waste, making efficient waste sorting crucial for sustainable waste management. To address this, we developed DWaste, a computer vision-powered platform designed for real-time waste sorting on resource-constrained smartphones and edge devices, including offline functionality. We benchmarked various image classification models (EfficientNetV2S/M, ResNet50/101, MobileNet) and object detection (YOLOv8n, YOLOv11n) using a subset of our own waste data set and annotated it using the custom tool Annotated Lab. We found a clear trade-off between accuracy and resource consumption: the best classifier, EfficientNetV2S, achieved high accuracy (~ 96%) but suffered from high latency (~ 0.22s) and elevated carbon emissions. In contrast, lightweight object detection models delivered strong performance (up to 77% mAP) with ultra-fast inference (~ 0.03s) and significantly smaller model sizes (< 7MB), making them ideal for real-time, low-power use. Model quantization further maximized efficiency, substantially reducing model size and VRAM usage by up to 75%. Our work demonstrates the successful implementation of “Greener AI” models to support real-time, sustainable waste sorting on edge devices.
[87] RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation
Junwen Huang,Shishir Reddy Vutukur,Peter KT Yu,Nassir Navab,Slobodan Ilic,Benjamin Busam
Main category: cs.CV
TL;DR: 论文提出了一种基于射线束扩散的方法RayPose,通过将模板匹配问题转化为射线对齐问题,利用扩散Transformer架构实现未见物体的6D姿态估计,并在多数据集上表现出竞争力。
Details
Motivation: 传统模板匹配方法在检索失败时会导致姿态预测不准确,因此需要一种更鲁棒的方法来处理未见物体的姿态估计问题。Contribution: 1. 将模板匹配问题重新定义为射线对齐问题;2. 提出基于扩散Transformer的架构;3. 利用几何先验指导查询姿态推理;4. 通过粗到细的训练策略提升性能。
Method: 1. 使用物体中心相机射线参数化旋转;2. 扩展尺度不变平移估计到密集平移偏移;3. 基于扩散Transformer对齐查询图像与模板集;4. 采用粗到细的模板采样策略训练。
Result: 在多数据集上的实验表明,该方法在未见物体的6D姿态估计任务中具有竞争力。
Insight: 将几何先验与扩散模型结合可以有效提升未见物体姿态估计的鲁棒性和准确性,粗到细的训练策略在不改变网络结构的情况下优化性能。
Abstract: Typical template-based object pose pipelines estimate the pose by retrieving the closest matching template and aligning it with the observed image. However, failure to retrieve the correct template often leads to inaccurate pose predictions. To address this, we reformulate template-based object pose estimation as a ray alignment problem, where the viewing directions from multiple posed template images are learned to align with a non-posed query image. Inspired by recent progress in diffusion-based camera pose estimation, we embed this formulation into a diffusion transformer architecture that aligns a query image with a set of posed templates. We reparameterize object rotation using object-centered camera rays and model object translation by extending scale-invariant translation estimation to dense translation offsets. Our model leverages geometric priors from the templates to guide accurate query pose inference. A coarse-to-fine training strategy based on narrowed template sampling improves performance without modifying the network architecture. Extensive experiments across multiple benchmark datasets show competitive results of our method compared to state-of-the-art approaches in unseen object pose estimation.
[88] GBlobs: Local LiDAR Geometry for Improved Sensor Placement Generalization
Dušan Malić,Christian Fruhwirth-Reisinger,Alexander Prutsch,Wei Lin,Samuel Schulter,Horst Possegger
Main category: cs.CV
TL;DR: 该论文提出了GBlobs,一种局部点云特征描述符,用于提升LiDAR传感器在不同配置下的3D目标检测泛化能力。通过避免使用绝对坐标特征,模型能够学习更鲁棒的、以物体为中心的表示,从而显著提升性能。
Details
Motivation: 现有的基于LiDAR的3D检测器通常依赖于全局特征(如绝对笛卡尔坐标),导致模型过度依赖目标的位置而非形状和外观特征。这种‘几何捷径’限制了模型在不同传感器配置下的泛化能力。Contribution: 提出了GBlobs,一种专门设计的局部点云特征描述符,能够避免模型依赖绝对位置信息,从而提升对不同LiDAR配置的泛化能力。
Method: 采用GBlobs作为网络输入特征,替代传统的全局坐标特征,迫使网络学习更鲁棒的、以物体为中心的几何表示。
Result: 在RoboSense 2025竞赛中取得了最先进的3D目标检测性能,特别是在不同传感器配置下表现出色。
Insight: 通过设计局部特征描述符(如GBlobs),可以有效避免模型依赖不鲁棒的全局特征,从而提升跨域泛化能力。
Abstract: This technical report outlines the top-ranking solution for RoboSense 2025: Track 3, achieving state-of-the-art performance on 3D object detection under various sensor placements. Our submission utilizes GBlobs, a local point cloud feature descriptor specifically designed to enhance model generalization across diverse LiDAR configurations. Current LiDAR-based 3D detectors often suffer from a \enquote{geometric shortcut} when trained on conventional global features (\ie, absolute Cartesian coordinates). This introduces a position bias that causes models to primarily rely on absolute object position rather than distinguishing shape and appearance characteristics. Although effective for in-domain data, this shortcut severely limits generalization when encountering different point distributions, such as those resulting from varying sensor placements. By using GBlobs as network input features, we effectively circumvent this geometric shortcut, compelling the network to learn robust, object-centric representations. This approach significantly enhances the model’s ability to generalize, resulting in the exceptional performance demonstrated in this challenge.
[89] Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model
Zhenxing Zhang,Jiayan Teng,Zhuoyi Yang,Tiankun Cao,Cheng Wang,Xiaotao Gu,Jie Tang,Dan Guo,Meng Wang
Main category: cs.CV
TL;DR: Kaleido是一个开源的多主题参考视频生成模型,通过专用数据构建管道和参考旋转位置编码(R-RoPE)提升多主题一致性和背景分离能力。
Details
Motivation: 现有S2V生成模型在多主题一致性和背景分离方面表现不佳,主要因训练数据缺乏多样性和高质量样本,以及多参考图像整合机制不理想。Contribution: 1) 提出专用数据构建管道,通过过滤低质量样本和合成多样数据提升训练数据质量;2) 引入R-RoPE处理参考图像,优化多图像整合。
Method: 结合数据过滤与合成技术构建高质量训练数据,并设计R-RoPE机制增强多参考图像的稳定性和精确性。
Result: Kaleido在多个基准测试中显著优于现有方法,表现出更高的一致性、保真度和泛化能力。
Insight: 高质量训练数据和有效的参考图像整合机制是提升S2V生成性能的关键。
Abstract: We present Kaleido, a subject-to-video~(S2V) generation framework, which aims to synthesize subject-consistent videos conditioned on multiple reference images of target subjects. Despite recent progress in S2V generation models, existing approaches remain inadequate at maintaining multi-subject consistency and at handling background disentanglement, often resulting in lower reference fidelity and semantic drift under multi-image conditioning. These shortcomings can be attributed to several factors. Primarily, the training dataset suffers from a lack of diversity and high-quality samples, as well as cross-paired data, i.e., paired samples whose components originate from different instances. In addition, the current mechanism for integrating multiple reference images is suboptimal, potentially resulting in the confusion of multiple subjects. To overcome these limitations, we propose a dedicated data construction pipeline, incorporating low-quality sample filtering and diverse data synthesis, to produce consistency-preserving training data. Moreover, we introduce Reference Rotary Positional Encoding (R-RoPE) to process reference images, enabling stable and precise multi-image integration. Extensive experiments across numerous benchmarks demonstrate that Kaleido significantly outperforms previous methods in consistency, fidelity, and generalization, marking an advance in S2V generation.
[90] CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder
Yongmin Lee,Hye Won Chung
Main category: cs.CV
TL;DR: CovMatch提出了一种基于交叉协方差的多模态数据集蒸馏框架,通过联合优化图像和文本编码器,实现了更强的跨模态对齐和性能提升,优于现有方法。
Details
Motivation: 多模态数据集蒸馏的目标是合成少量图像-文本对以高效训练大规模视觉语言模型。现有方法因冻结文本编码器而限制了语义对齐,成为性能瓶颈。Contribution: 1. 提出了CovMatch框架,通过交叉协方差对齐真实和合成特征,并在模态内正则化特征分布;2. 实现了图像和文本编码器的联合优化。
Method: CovMatch通过交叉协方差对齐特征,优化两项损失:跨模态对齐损失和模态内正则化损失。允许联合训练两个编码器,提升了语义对齐能力。
Result: 在Flickr30K和COCO数据集上,CovMatch仅用500对合成数据就实现了6.8%的检索准确率绝对提升,优于现有方法。
Insight: 联合优化图像和文本编码器是多模态对齐的关键,而交叉协方差是衡量模态间关系的有效指标。
Abstract: Multimodal dataset distillation aims to synthesize a small set of image-text pairs that enables efficient training of large-scale vision-language models. While dataset distillation has shown promise in unimodal tasks, extending it to multimodal contrastive learning presents key challenges: learning cross-modal alignment and managing the high computational cost of large encoders. Prior approaches address scalability by freezing the text encoder and update only the image encoder and text projection layer. However, we find this severely limits semantic alignment and becomes a bottleneck for performance scaling. We propose CovMatch, a scalable dataset distillation framework that aligns the cross-covariance of real and synthetic features while regularizing feature distributions within each modality. Unlike prior approaches, CovMatch enables joint optimization of both encoders, leading to stronger cross-modal alignment and improved performance. Evaluated on Flickr30K and COCO, CovMatch outperforms state-of-the-art multimodal distillation methods and achieves up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic pairs.
[91] Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
Zhangquan Chen,Manyuan Zhang,Xinlei Yu,Xufang Luo,Mingze Sun,Zihao Pan,Yan Feng,Peng Pei,Xunliang Cai,Ruqi Huang
Main category: cs.CV
TL;DR: 论文提出了3DThinker框架,通过几何想象力解决有限视角下的3D空间推理问题,无需3D先验输入或显式标注数据。该方法通过两阶段训练和实验结果显著优于基线模型。
Details
Motivation: 当前视觉-语言模型(VLMs)在多种多模态任务中表现优异,但在有限视角下理解3D空间关系仍具挑战。现有方法依赖纯文本或2D视觉线索,限制了3D空间推理能力。Contribution: 3DThinker是首个无需3D先验输入、不依赖标注3D数据的框架,通过几何想象力和两阶段训练实现了3D空间推理的统一表示。
Method: 采用两阶段训练:1)监督训练对齐VLM生成的3D潜在表示与3D基础模型(如VGGT);2)基于结果信号优化推理轨迹,提升3D想象力。
Result: 在多基准测试中,3DThinker表现优于强基线模型,为多模态推理中3D表征的统一提供了新视角。
Insight: 通过几何想象力和无监督优化,3DThinker展示了无需显式3D数据即可实现高效3D推理的潜力。
Abstract: Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code will be available at https://github.com/zhangquanchen/3DThinker.
[92] C-SWAP: Explainability-Aware Structured Pruning for Efficient Neural Networks Compression
Baptiste Bauvin,Loïc Baret,Ola Ahmad
Main category: cs.CV
TL;DR: 论文提出了一种基于可解释深度学习的新型一次性剪枝框架C-SWAP,通过因果感知剪枝方法高效压缩神经网络,减少模型大小且不影响性能。
Details
Motivation: 当前结构化剪枝方法需要迭代重训练,计算成本高,而一次性剪枝方法通常导致性能显著下降。为解决这一问题,结合可解释性技术提出了新方法。Contribution: 提出了一种因果感知剪枝方法,利用模型预测与结构间的因果关系进行渐进式剪枝;证明了无需微调即可实现高效压缩且性能损失最小。
Method: 采用因果感知的渐进式剪枝策略,结合可解释深度学习技术,一次性剪枝后无需重训练。实验基于CNN和ViT基线模型。
Result: 在分类任务上,该方法显著减少模型大小且性能影响最小,优于其他对比方法。
Insight: 可解释性技术与剪枝结合可提高剪枝效率,因果关系的引入有助于识别并保留关键结构。
Abstract: Neural network compression has gained increasing attention in recent years, particularly in computer vision applications, where the need for model reduction is crucial for overcoming deployment constraints. Pruning is a widely used technique that prompts sparsity in model structures, e.g. weights, neurons, and layers, reducing size and inference costs. Structured pruning is especially important as it allows for the removal of entire structures, which further accelerates inference time and reduces memory overhead. However, it can be computationally expensive, requiring iterative retraining and optimization. To overcome this problem, recent methods considered one-shot setting, which applies pruning directly at post-training. Unfortunately, they often lead to a considerable drop in performance. In this paper, we focus on this issue by proposing a novel one-shot pruning framework that relies on explainable deep learning. First, we introduce a causal-aware pruning approach that leverages cause-effect relations between model predictions and structures in a progressive pruning process. It allows us to efficiently reduce the size of the network, ensuring that the removed structures do not deter the performance of the model. Then, through experiments conducted on convolution neural network and vision transformer baselines, pre-trained on classification tasks, we demonstrate that our method consistently achieves substantial reductions in model size, with minimal impact on performance, and without the need for fine-tuning. Overall, our approach outperforms its counterparts, offering the best trade-off. Our code is available on GitHub.
[93] Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression
Kyo Kuroki,Yasuyuki Okoshi,Thiem Van Chu,Kazushi Kawamura,Masato Motomura
Main category: cs.CV
TL;DR: 论文提出了一种新型矩阵量化方法——二元二次量化(BQQ),通过利用二元二次表达式的表达能力,在保持紧凑数据格式的同时,超越了传统的一阶量化方法。实验表明,BQQ在矩阵压缩和后训练量化(PTQ)中均表现出色。
Details
Motivation: 传统的矩阵量化方法(如均匀量化和二元编码量化)仅通过二元基的线性组合近似实值矩阵,限制了表达能力。BQQ旨在通过二元二次表达式提升表达能力,同时保持高效压缩。Contribution: 1. 提出了Binary Quadratic Quantization(BQQ),一种利用二元二次表达式的新型矩阵量化方法。
2. 在矩阵压缩和PTQ任务中验证了BQQ的优越性,展示了其在高效率和低重构误差之间的平衡能力。
Method: BQQ通过二元二次表达式近似实值矩阵,而非传统的一阶线性组合。实验中,BQQ被应用于矩阵压缩benchmark和Vision Transformer模型的PTQ任务。
Result: BQQ在矩阵压缩任务中表现优于传统方法,在PTQ任务中,即使在2位量化下,BQQ在ImageNet数据集上的表现分别超过SOTA方法2.2%(校准场景)和59.1%(数据无关场景)。
Insight: 研究结果表明,二元二次表达式在高效矩阵近似和神经网络压缩中具有显著潜力,超越了传统方法的局限性。
Abstract: This paper proposes a novel matrix quantization method, Binary Quadratic Quantization (BQQ). In contrast to conventional first-order quantization approaches, such as uniform quantization and binary coding quantization, that approximate real-valued matrices via linear combinations of binary bases, BQQ leverages the expressive power of binary quadratic expressions while maintaining an extremely compact data format. We validate our approach with two experiments: a matrix compression benchmark and post-training quantization (PTQ) on pretrained Vision Transformer-based models. Experimental results demonstrate that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction error than conventional methods for compressing diverse matrix data. It also delivers strong PTQ performance, even though we neither target state-of-the-art PTQ accuracy under tight memory constraints nor rely on PTQ-specific binary matrix optimization. For example, our proposed method outperforms the state-of-the-art PTQ method by up to 2.2% and 59.1% on the ImageNet dataset under the calibration-based and data-free scenarios, respectively, with quantization equivalent to 2 bits. These findings highlight the surprising effectiveness of binary quadratic expressions for efficient matrix approximation and neural network compression.
[94] MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation
Weinan Jia,Yuning Lu,Mengqi Huang,Hualiang Wang,Binyuan Huang,Nan Chen,Mu Liu,Jidong Jiang,Zhendong Mao
Main category: cs.CV
TL;DR: MoGA提出了一种高效的稀疏注意力机制Mixture-of-Groups Attention,通过轻量级的token路由器实现精确的token匹配,解决了长视频生成中注意力冗余和计算复杂度高的问题。
Details
Motivation: 长视频生成中,Diffusion Transformers(DiTs)的计算复杂度因注意力机制的二次增长而受限,现有稀疏方法依赖分块估计,其效率与准确性受限于分块大小。Contribution: 提出了MoGA,一种基于语义感知路由的稀疏注意力机制,无需分块估计即可高效匹配token,显著提升了长视频生成的效率和质量。
Method: MoGA使用轻量级的token路由器进行语义感知的路由,避免了分块估计的局限性,同时与现代注意力技术(如FlashAttention)兼容。
Result: MoGA支持端到端生成长达1分钟的多镜头480p视频(24 fps),上下文长度约为580k,实验结果验证了其有效性。
Insight: MoGA的核心在于通过语义感知路由动态分配注意力资源,解决了传统稀疏方法分块估计的不足,展现了在长序列任务中的潜力。
Abstract: Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.
[95] UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
Yibin Wang,Zhimin Li,Yuhang Zang,Jiazi Bu,Yujie Zhou,Yi Xin,Junjun He,Chunyu Wang,Qinglin Lu,Cheng Jin,Jiaqi Wang
Main category: cs.CV
TL;DR: UniGenBench++是一个统一的语义评估基准,旨在解决现有文本到图像生成(T2I)评测基准在多样性、多语言支持和细粒度评估方面的不足。
Details
Motivation: 现有T2I评测基准缺乏多样性和多语言支持,且仅提供粗糙的评估维度,难以覆盖真实场景需求。Contribution: 提出了UniGenBench++基准,包含600个分层组织的提示词,覆盖5个主题和20个子主题,并支持多语言和多长度提示。此外,利用多模态大语言模型(MLLM)构建评估管线,并训练了一个离线评估模型。
Method: 基于Gemini-2.5-Pro的通用世界知识和细粒度图像理解能力,构建评测管线,并通过分层设计的提示词和多维度评估标准实现系统化评测。
Result: 通过对开源和闭源T2I模型的评测,系统揭示了它们在不同方面的优势和不足。
Insight: 强调了多语言、多长度提示和细粒度评估在T2I评测中的重要性,并提供了一种高效的离线评估解决方案。
Abstract: Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models’ semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.
[96] Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Yiqi Lin,Alex Jinpeng Wang,Linjie Li,Zhengyuan Yang,Mike Zheng Shou
Main category: cs.CV
TL;DR: VC2L提出了一种基于视觉Transformer的统一对比学习框架,通过将所有输入(文本、图像或混合)渲染为图像,避免了OCR或模态融合的需求,并在多模态网页文档中实现跨模态检索和理解。
Details
Motivation: 现有对比学习模型(如CLIP)在处理复杂、松散对齐的多模态网页文档时表现受限,VC2L旨在通过统一的像素空间建模解决这一问题。Contribution: 1. 提出VC2L框架,以单一视觉Transformer建模多模态输入;2. 设计片段级对比学习目标;3. 推出三个新检索基准测试。
Method: VC2L将所有输入渲染为图像,采用片段级对比学习目标,利用文档的内在连贯性实现跨模态对齐。
Result: VC2L在多模态检索基准测试(如AnyCIR、SeqCIR、CSR)和现有数据集(M-BEIR、MTEB)上表现优越。
Insight: 多模态网页数据是对比学习的宝贵资源,统一的视觉中心方法具有可扩展性和高效性。
Abstract: Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments, leveraging the inherent coherence of documents without requiring explicitly paired image-text data. To assess the effectiveness of this approach, we introduce three retrieval benchmarks, AnyCIR, SeqCIR, and CSR, designed to evaluate cross-modal retrieval, fine-grained sequential understanding, and generalization to unseen data, respectively. Empirical results show that VC2L achieves competitive or superior performance compared to CLIP-style models on both the proposed benchmarks and established datasets such as M-BEIR and MTEB. These findings underscore the potential of multimodal web data as a valuable training resource for contrastive learning and illustrate the scalability of a unified, vision-centric approach for multimodal representation learning. Code and models are available at: https://github.com/showlab/VC2L.
[97] PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting
Changkun Liu,Bin Tan,Zeran Ke,Shangzhan Zhang,Jiachen Liu,Ming Qian,Nan Xue,Yujun Shen,Tristan Braud
Main category: cs.CV
TL;DR: PLANA3R提出了一种无需姿态信息的零样本度量平面3D重建框架,通过平面飞溅(planar splatting)和ViT提取稀疏平面基元,实现了室内场景的高效3D重建。
Details
Motivation: 室内场景通常具有几何规律性,但目前的方法需要3D平面标注或姿态信息,限制了可扩展性和泛化能力。PLANA3R旨在通过仅利用深度和法线标注,实现无监督的度量平面3D重建。Contribution: 1) 提出了一种无需3D平面标注的零样本度量平面3D重建框架;2) 通过平面飞溅和ViT实现了高效的几何学习;3) 在多个任务(3D重建、深度估计、姿态估计)中展示了强大的泛化能力。
Method: 1) 使用ViT提取稀疏平面基元;2) 通过平面飞溅(高分辨率渲染深度和法线图)监督几何学习;3) 仅依赖深度和法线标注进行训练。
Result: PLANA3R在多个室内数据集上表现优异,支持跨域泛化,并在3D表面重建、深度估计和相对姿态估计任务中达到先进水平。
Insight: 通过平面基元的紧凑表示和无需标注的几何学习,PLANA3R展示了在无监督3D重建中的潜力,同时也为平面分割提供了新思路。
Abstract: This paper addresses metric 3D reconstruction of indoor scenes by exploiting their inherent geometric regularities with compact representations. Using planar 3D primitives - a well-suited representation for man-made environments - we introduce PLANA3R, a pose-free framework for metric Planar 3D Reconstruction from unposed two-view images. Our approach employs Vision Transformers to extract a set of sparse planar primitives, estimate relative camera poses, and supervise geometry learning via planar splatting, where gradients are propagated through high-resolution rendered depth and normal maps of primitives. Unlike prior feedforward methods that require 3D plane annotations during training, PLANA3R learns planar 3D structures without explicit plane supervision, enabling scalable training on large-scale stereo datasets using only depth and normal annotations. We validate PLANA3R on multiple indoor-scene datasets with metric supervision and demonstrate strong generalization to out-of-domain indoor environments across diverse tasks under metric evaluation protocols, including 3D surface reconstruction, depth estimation, and relative pose estimation. Furthermore, by formulating with planar 3D representation, our method emerges with the ability for accurate plane segmentation. The project page is available at https://lck666666.github.io/plana3r
[98] See the Text: From Tokenization to Visual Reading
Ling Xing,Alex Jinpeng Wang,Rui Yan,Hongyu Qu,Zechao Li,Jinhui Tang
Main category: cs.CV
TL;DR: 论文提出了一种名为SeeTok的新方法,它将文本渲染为图像,利用预训练的多模态大语言模型(MMLLM)进行视觉阅读,显著减少了计算开销并提升了性能。
Details
Motivation: 现代大语言模型(LLM)依赖于子词分词方法,这种方法在处理低资源语言时效果不佳,会产生过多无意义的片段。SeeTok旨在通过模拟人类的视觉阅读方式来解决这一问题。Contribution: 1. 提出了视觉阅读范式SeeTok;2. 展示了在多种语言任务中优于子词分词器的性能;3. 减少了计算开销(70.5%的FLOPs),同时提升了跨语言泛化和抗噪能力。
Method: 1. 将文本渲染为图像;2. 利用预训练的MMLLM(如具备OCR和文本-视觉对齐能力的模型)进行视觉理解;3. 避免了传统的子词分词过程。
Result: SeeTok在三种语言任务中表现优异,减少了70.5%的计算开销(FLOPs),同时在跨语言泛化和抗噪能力上表现出色。
Insight: 视觉阅读是一种更自然、更接近人类认知的语言处理方式,未来可能成为语言模型发展的方向。
Abstract: People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.
[99] IF-VidCap: Can Video Caption Models Follow Instructions?
Shihao Li,Yuanxing Zhang,Jiangtao Wu,Zhide Lei,Yiwen He,Runzhe Wen,Chenxi Liao,Chengkang Jiang,An Ping,Shuo Gao,Suhan Wang,Zhaozhou Bian,Zijun Zhou,Jingyi Xie,Jiayi Zhou,Jing Wang,Yifan Yao,Weihao Xie,Yingshui Tan,Yanghai Wang,Qianqian Xie,Zhaoxiang Zhang,Jiaheng Liu
Main category: cs.CV
TL;DR: 论文提出了IF-VidCap基准,用于评估可控视频字幕生成的指令跟随能力,填补了现有评测忽略指令跟随能力的空白。
Details
Motivation: 实际应用中,用户需要生成符合特定指令的字幕,而非无约束的详尽描述。当前基准主要关注描述完整性,忽略了对指令跟随能力的评测。Contribution: 1. 提出IF-VidCap基准,包含1,400个高质量样本,系统评测格式和内容正确性。2. 评测了20多个模型,揭示闭源模型仍主导,但开源模型性能接近。3. 发现专用于密集字幕的模型在复杂指令上表现不如通用MLLMs。
Method: 通过IF-VidCap基准,系统评估字幕模型的指令跟随能力,分格式和内容正确性两个维度。
Result: 闭源模型领先,开源模型性能接近。专用密集字幕模型在复杂指令上表现不如通用MLLMs。
Insight: 未来工作需同时提升描述丰富性和指令跟随能力。
Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.
[100] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Haochen Wang,Yuhao Wang,Tao Zhang,Yikang Zhou,Yanwei Li,Jiacong Wang,Ye Tian,Jiahao Meng,Zilong Huang,Guangcan Mai,Anran Wang,Yunhai Tong,Zhuochen Wang,Xiangtai Li,Zhaoxiang Zhang
Main category: cs.CV
TL;DR: 这篇论文提出了Grasp Any Region (GAR),一种用于多模态大语言模型(MLLMs)的区域级视觉理解方法,通过RoI对齐特征重播技术提升全局上下文感知和多提示交互建模能力。
Details
Motivation: MLLMs在复杂场景的细粒度分析和对象间关系建模上表现不佳,现有区域级MLLMs忽略了全局上下文的重要性。Contribution: 1. 提出GAR方法,支持精确的区域感知和多提示交互建模;2. 构建GAR-Bench评估框架;3. 展示了GAR在多种任务上的优越性能。
Method: 采用RoI对齐特征重播技术,结合全局上下文和多提示交互建模,实现高级组合推理。
Result: GAR-1B在DLC-Bench上超越DAM-3B 4.5分,并在GAR-Bench-VQA上超越InternVL3-78B;零样本GAR-8B在VideoRefer-BenchQ上表现优于VideoRefer-7B。
Insight: 全局上下文和多提示交互对区域级视觉理解至关重要,GAR的设计可以轻松扩展到视频领域。
Abstract: While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.
[101] Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model for Microclimate Impact Prediction
Jannis Fleckenstein,David Kreismann,Tamara Rosemary Govindasamy,Thomas Brunschwiler,Etienne Vos,Mattia Rigotti
Main category: cs.CV
TL;DR: 该论文通过微调地理空间基础模型,预测和模拟城市热岛效应,评估其在气候变化下的影响,并验证了模型在数据稀缺地区的实用性。
Details
Motivation: 随着城市化和气候变化加剧,城市热岛效应日益严重。传统机器学习模型因数据有限在预测中表现不佳,尤其是在欠发达地区,因此需要更高效的模型。Contribution: 论文的主要贡献是提出了一种基于地理空间基础模型的微调方法,用于预测城市热岛效应,并通过模拟展示其在缓解策略中的实用性。
Method: 使用全球无结构数据训练的地理空间基础模型,通过微调预测地表温度,并通过模拟修复技术验证模型效果。
Result: 结果表明,基础模型在数据稀缺地区能有效评估城市热岛缓解策略,支持更具气候韧性的城市建设。
Insight: 地理空间基础模型通过微调可显著提升预测精度,尤其在数据稀缺区域,为气候适应规划提供了新工具。
Abstract: As urbanization and climate change progress, urban heat island effects are becoming more frequent and severe. To formulate effective mitigation plans, cities require detailed air temperature data, yet conventional machine learning models with limited data often produce inaccurate predictions, particularly in underserved areas. Geospatial foundation models trained on global unstructured data offer a promising alternative by demonstrating strong generalization and requiring only minimal fine-tuning. In this study, an empirical ground truth of urban heat patterns is established by quantifying cooling effects from green spaces and benchmarking them against model predictions to evaluate the model’s accuracy. The foundation model is subsequently fine-tuned to predict land surface temperatures under future climate scenarios, and its practical value is demonstrated through a simulated inpainting that highlights its role for mitigation support. The results indicate that foundation models offer a powerful way for evaluating urban heat island mitigation strategies in data-scarce regions to support more climate-resilient cities.
[102] UltraGen: High-Resolution Video Generation with Hierarchical Attention
Teng Hu,Jiangning Zhang,Zihan Su,Ran Yi
Main category: cs.CV
TL;DR: UltraGen提出了一种新颖的视频生成框架,通过分层双分支注意力架构实现了高效的端到端原生高分辨率视频合成,突破了现有方法在分辨率上的限制。
Details
Motivation: 现有的基于扩散变换器的视频生成模型由于注意力机制的二次计算复杂度,难以实现高分辨率(1080P/2K/4K)的视频生成。Contribution: 1)提出了分层双分支注意力架构,将全局和局部注意力解耦;2)设计了空间压缩全局建模策略和分层跨窗口局部注意力机制,降低了计算成本;3)首次实现了预训练低分辨率视频模型向1080P甚至4K的扩展。
Method: 1)全局-局部注意力分解的双分支架构;2)空间压缩全局建模;3)分层跨窗口局部注意力。
Result: UltraGen在定性和定量评估中均优于现有方法及基于超分辨率的双阶段流程。
Insight: 分层注意力机制是实现高分辨率视频生成的有效路径,全局与局部建模的解耦能够平衡计算效率和生成质量。
Abstract: Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (<=720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency. We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows. Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.
[103] ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
Xiaoxing Hu,Kaicheng Yang,Ziyong Feng,Qi Ming,Zonghao Guo,Xiang An,Ziyong Feng,Junchi Yan,Xue Yang
Main category: cs.CV
TL;DR: ProCLIP proposes a progressive vision-language alignment framework to align CLIP’s image encoder with an LLM-based embedder, addressing limitations of CLIP’s text encoder.
Details
Motivation: CLIP's text encoder has limitations like short input length and lack of multilingual support, restricting its broader applicability. Aligning LLM-based embedders directly disrupts CLIP's pretrained vision-language alignment.Contribution: ProCLIP introduces curriculum learning to align CLIP’s image encoder with an LLM-based embedder, leveraging distillation and contrastive tuning while preserving pretrained knowledge.
Method: Uses knowledge distillation from CLIP’s text encoder to LLM embedder, followed by contrastive tuning with self-distillation regularization and alignment losses.
Result: Effectively aligns CLIP’s image encoder with LLM embedder, enhancing long-text and multilingual capabilities without disrupting pretrained alignment.
Insight: Progressive alignment with distillation preserves pretrained knowledge, while curriculum learning improves alignment effectiveness.
Abstract: The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP’s text encoder into the LLM-based embedder to leverage CLIP’s rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at https://github.com/VisionXLab/ProCLIP
[104] An Explainable Hybrid AI Framework for Enhanced Tuberculosis and Symptom Detection
Neel Patel,Alexander Wong,Ashkan Ebadi
Main category: cs.CV
TL;DR: 提出一种结合监督与自监督学习的师生框架,显著提升结核病和症状检测的准确率,并在可解释性评估中表现良好。
Details
Motivation: 结核病是全球健康问题,早期检测需求高,但资源有限地区缺乏专业放射科医师,亟需可靠的AI筛查工具。Contribution: 1. 提出师生框架,结合监督和多标签自监督任务;2. 在结核病和症状检测任务中显著超越基线模型;3. 提供模型的可解释性分析。
Method: 采用师生框架,集成两个监督头和一个自监督头,分别用于疾病分类和多标签症状检测。
Result: 模型在区分COVID-19、结核病和正常病例上达到98.85%准确率,多标签症状检测的宏F1得分为90.09%。
Insight: 1. 结合监督与自监督学习能有效利用有限数据;2. 可解释性分析验证了模型的临床适用性。
Abstract: Tuberculosis remains a critical global health issue, particularly in resource-limited and remote areas. Early detection is vital for treatment, yet the lack of skilled radiologists underscores the need for artificial intelligence (AI)-driven screening tools. Developing reliable AI models is challenging due to the necessity for large, high-quality datasets, which are costly to obtain. To tackle this, we propose a teacher–student framework which enhances both disease and symptom detection on chest X-rays by integrating two supervised heads and a self-supervised head. Our model achieves an accuracy of 98.85% for distinguishing between COVID-19, tuberculosis, and normal cases, and a macro-F1 score of 90.09% for multilabel symptom detection, significantly outperforming baselines. The explainability assessments also show the model bases its predictions on relevant anatomical features, demonstrating promise for deployment in clinical screening and triage settings.
[105] SAM 2++: Tracking Anything at Any Granularity
Jiaming Zhang,Cheng Liang,Yichun Yang,Chenkai Zeng,Yutao Cui,Xinwen Zhang,Xin Zhou,Kai Ma,Gangshan Wu,Limin Wang
Main category: cs.CV
TL;DR: SAM 2++ 是一个统一的视频跟踪模型,支持任意粒度的目标跟踪(如掩码、框和点)。通过任务特定的提示、统一的解码器和任务自适应内存机制,解决了现有跟踪模型在不同任务中的冗余和泛化能力不足的问题,并在多个基准测试中实现了最先进的性能。
Details
Motivation: 现有视频跟踪模型通常针对单一任务设计,依赖定制模块,导致模型冗余且泛化能力受限。SAM 2++ 的目标是统一不同粒度的跟踪任务,减少冗余并提升模型的通用性。Contribution: 1. 提出任务特定的提示和统一解码器,扩展目标粒度;2. 引入任务自适应内存机制,统一内存匹配;3. 构建了一个多粒度标注的大规模数据集 Tracking-Any-Granularity。
Method: 1. 设计任务特定的提示编码不同输入;2. 使用统一解码器生成一致输出;3. 采用任务自适应内存机制优化跟踪过程;4. 构建多粒度标注数据集支持训练。
Result: SAM 2++ 在多个跟踪基准测试中实现了最先进的性能,证明了其统一性和鲁棒性。
Insight: 通过统一框架和自适应机制,可以实现不同粒度任务的灵活跟踪,减少冗余设计并提升泛化能力。
Abstract: Video tracking aims at finding the specific target in subsequent frames given its initial state. Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task and heavily rely on custom-designed modules within the individual task, which limits their generalization and leads to redundancy in both model design and parameters. To unify video tracking tasks, we present SAM 2++, a unified model towards tracking at any granularity, including masks, boxes, and points. First, to extend target granularity, we design task-specific prompts to encode various task inputs into general prompt embeddings, and a unified decoder to unify diverse task results into a unified form pre-output. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities. Finally, we introduce a customized data engine to support tracking training at any granularity, producing a large and diverse video tracking dataset with rich annotations at three granularities, termed Tracking-Any-Granularity, which represents a comprehensive resource for training and benchmarking on unified tracking. Comprehensive experiments on multiple benchmarks confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.
[106] FedDEAP: Adaptive Dual-Prompt Tuning for Multi-Domain Federated Learning
Yubin Zheng,Pak-Hei Yeung,Jing Xia,Tianjie Ju,Peng Tang,Weidong Qiu,Jagath C. Rajapakse
Main category: cs.CV
TL;DR: FedDEAP是一个自适应联邦提示调优框架,旨在通过双提示设计和特征解耦提升CLIP在多域联邦学习中的泛化能力。
Details
Motivation: 联邦学习中域偏移和标签异质性阻碍全局模型的泛化能力,而大规模视觉语言模型(如CLIP)的零样本分类能力激发了对联邦环境下CLIP有效微调的探索。Contribution: 1) 提出语义和域特征解耦方法;2) 设计全局语义提示和局部域提示的双提示机制;3) 通过文本与视觉表征对齐保留语义和域一致性。
Method: 使用语义和域变换网络解耦特征,引入双提示设计平衡共享与个性化信息,并通过对齐文本与视觉表征优化表示。
Result: 在四个数据集上的实验验证了FedDEAP在多域联邦图像识别中的有效性。
Insight: FedDEAP通过解耦和双提示设计有效解决了联邦学习中的域适应问题,同时保留了CLIP的零样本能力。
Abstract: Federated learning (FL) enables multiple clients to collaboratively train machine learning models without exposing local data, balancing performance and privacy. However, domain shift and label heterogeneity across clients often hinder the generalization of the aggregated global model. Recently, large-scale vision-language models like CLIP have shown strong zero-shot classification capabilities, raising the question of how to effectively fine-tune CLIP across domains in a federated setting. In this work, we propose an adaptive federated prompt tuning framework, FedDEAP, to enhance CLIP’s generalization in multi-domain scenarios. Our method includes the following three key components: (1) To mitigate the loss of domain-specific information caused by label-supervised tuning, we disentangle semantic and domain-specific features in images by using semantic and domain transformation networks with unbiased mappings; (2) To preserve domain-specific knowledge during global prompt aggregation, we introduce a dual-prompt design with a global semantic prompt and a local domain prompt to balance shared and personalized information; (3) To maximize the inclusion of semantic and domain information from images in the generated text features, we align textual and visual representations under the two learned transformations to preserve semantic and domain consistency. Theoretical analysis and extensive experiments on four datasets demonstrate the effectiveness of our method in enhancing the generalization of CLIP for federated image recognition across multiple domains.
[107] DP$^2$O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution
Rongyuan Wu,Lingchen Sun,Zhengqiang Zhang,Shihao Wang,Tianhe Wu,Qiaosi Yi,Shuai Li,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为DP$^2$O-SR的框架,通过结合全参考和无参考图像质量评估模型,直接优化生成模型的感知偏好,无需人工标注,显著提升了真实世界图像超分辨率(Real-ISR)的感知质量。
Details
Motivation: 现有的文本到图像(T2I)扩散模型在Real-ISR中虽能生成丰富细节,但其随机性导致输出质量不稳定。本文旨在利用这种多样性,通过直接优化感知偏好提升模型性能。Contribution: 1) 提出DP$^2$O-SR框架,无需人工标注;2) 结合全参考和无参考IQA模型构建混合奖励信号;3) 提出分层偏好优化方法,提升学习效率。
Method: 1) 使用混合奖励信号平衡结构保真与自然外观;2) 构建多偏好对以利用感知多样性;3) 提出分层优化,根据模型容量调整训练策略。
Result: 实验表明,DP$^2$O-SR在扩散和流式T2I骨干网络上均能显著提升感知质量,并泛化至真实世界基准。
Insight: 1) 模型容量的不同需要不同的偏好选择策略;2) 分层优化能更高效地利用感知多样性;3) 混合奖励信号是关键。
Abstract: Benefiting from pre-trained text-to-image (T2I) diffusion models, real-world image super-resolution (Real-ISR) methods can synthesize rich and realistic details. However, due to the inherent stochasticity of T2I models, different noise inputs often lead to outputs with varying perceptual quality. Although this randomness is sometimes seen as a limitation, it also introduces a wider perceptual quality range, which can be exploited to improve Real-ISR performance. To this end, we introduce Direct Perceptual Preference Optimization for Real-ISR (DP$^2$O-SR), a framework that aligns generative models with perceptual preferences without requiring costly human annotations. We construct a hybrid reward signal by combining full-reference and no-reference image quality assessment (IQA) models trained on large-scale human preference datasets. This reward encourages both structural fidelity and natural appearance. To better utilize perceptual diversity, we move beyond the standard best-vs-worst selection and construct multiple preference pairs from outputs of the same model. Our analysis reveals that the optimal selection ratio depends on model capacity: smaller models benefit from broader coverage, while larger models respond better to stronger contrast in supervision. Furthermore, we propose hierarchical preference optimization, which adaptively weights training pairs based on intra-group reward gaps and inter-group diversity, enabling more efficient and stable learning. Extensive experiments across both diffusion- and flow-based T2I backbones demonstrate that DP$^2$O-SR significantly improves perceptual quality and generalizes well to real-world benchmarks.
[108] DSI-Bench: A Benchmark for Dynamic Spatial Intelligence
Ziang Zhang,Zehan Wang,Guanghao Zhang,Weilong Dai,Yan Xia,Ziang Yan,Minjie Hong,Zhou Zhao
Main category: cs.CV
TL;DR: 论文提出了DSI-Bench基准,用于评估动态空间智能,揭示了现有视觉语言模型在处理动态3D场景时的局限性。
Details
Motivation: 动态空间关系推理在现实世界中至关重要,但目前的主流视觉语言模型和专家模型在静态场景中表现优异,而在动态3D场景中的能力仍未充分探索。Contribution: 提出DSI-Bench基准,包含近1,000个动态视频和1,700多个标注问题,系统评估模型对动态空间关系的理解能力。
Method: 设计了空间和时间对称的视频和问题,涵盖九种解耦的运动模式(观察者和物体运动),以减少偏见并系统评估模型推理能力。
Result: 评估14种主流视觉语言模型和专家模型,发现它们常混淆观察者和物体运动,存在语义偏见,无法准确推断动态场景中的相对关系。
Insight: DSI-Bench揭示了动态空间智能的挑战,为未来通用和专家模型的改进提供了方向。
Abstract: Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to fully understand dynamic 3D scenarios remains limited. We introduce Dynamic Spatial Intelligence and propose DSI-Bench, a benchmark with nearly 1,000 dynamic videos and over 1,700 manually annotated questions covering nine decoupled motion patterns of observers and objects. Spatially and temporally symmetric designs reduce biases and enable systematic evaluation of models’ reasoning about self-motion and object motion. Our evaluation of 14 VLMs and expert models reveals key limitations: models often conflate observer and object motion, exhibit semantic biases, and fail to accurately infer relative relationships in dynamic scenarios. Our DSI-Bench provides valuable findings and insights about the future development of general and expertise models with dynamic spatial intelligence.
cs.CR [Back]
[109] BreakFun: Jailbreaking LLMs via Schema Exploitation
Amirkia Rafiei Oskooei,Mehmet S. Aktas
Main category: cs.CR
TL;DR: 该论文提出了一种名为BreakFun的攻击方法,通过利用大语言模型(LLMs)对结构化数据的依赖性,强制模型生成有害内容,攻击成功率达89%以上。论文还提出了一种防御策略Adversarial Prompt Deconstruction,通过提取人类可读文本揭示恶意意图。
Details
Motivation: 研究LLMs在处理结构化数据时的潜在漏洞,揭示其核心优势可能成为安全弱点。Contribution: 提出了BreakFun攻击方法,利用LLMs对结构化数据的依赖进行攻击;提出了一种有效的防御策略Adversarial Prompt Deconstruction。
Method: 通过构造’Trojan Schema’结合Chain-of-Thought分心策略的三部分提示,强制LLMs生成有害内容;防御方法使用次级LLM提取人类可读文本以隔离恶意意图。
Result: 攻击的平均成功率为89%,在部分模型中达到100%;防御策略能有效对抗攻击。
Insight: LLMs的核心能力(如遵循结构化数据)可能成为安全漏洞;防御措施应从攻击媒介入手。
Abstract: The proficiency of Large Language Models (LLMs) in processing structured data and adhering to syntactic rules is a capability that drives their widespread adoption but also makes them paradoxically vulnerable. In this paper, we investigate this vulnerability through BreakFun, a jailbreak methodology that weaponizes an LLM’s adherence to structured schemas. BreakFun employs a three-part prompt that combines an innocent framing and a Chain-of-Thought distraction with a core “Trojan Schema”–a carefully crafted data structure that compels the model to generate harmful content, exploiting the LLM’s strong tendency to follow structures and schemas. We demonstrate this vulnerability is highly transferable, achieving an average success rate of 89% across 13 foundational and proprietary models on JailbreakBench, and reaching a 100% Attack Success Rate (ASR) on several prominent models. A rigorous ablation study confirms this Trojan Schema is the attack’s primary causal factor. To counter this, we introduce the Adversarial Prompt Deconstruction guardrail, a defense that utilizes a secondary LLM to perform a “Literal Transcription”–extracting all human-readable text to isolate and reveal the user’s true harmful intent. Our proof-of-concept guardrail demonstrates high efficacy against the attack, validating that targeting the deceptive schema is a viable mitigation strategy. Our work provides a look into how an LLM’s core strengths can be turned into critical weaknesses, offering a fresh perspective for building more robustly aligned models.
cs.AI [Back]
[110] SMaRT: Select, Mix, and ReinvenT – A Strategy Fusion Framework for LLM-Driven Reasoning and Planning
Nikhil Verma,Manasa Bharadwaj,Wonjun Jang,Harmanpreet Singh,Yixiao Wang,Homa Fashandi,Chul Lee
Main category: cs.AI
TL;DR: SMaRT框架通过智能整合多种推理策略,提升大语言模型在复杂任务中的表现,并在多个基准测试中优于现有方法。
Details
Motivation: 现有的大语言模型方法主要依赖单一推理策略,无法充分利用不同策略的优势,限制了模型的性能和鲁棒性。Contribution: 提出了SMaRT框架,通过选择、混合和重新发明策略,智能整合多样化的推理方法,实现任务驱动的策略融合。
Method: SMaRT将大语言模型作为智能整合器,而非单纯的评估器,动态选择和组合最优策略,生成高效且鲁棒的解决方案。
Result: 在推理、规划和序列决策任务中,SMaRT在解决方案质量、约束遵循和性能指标上均优于现有方法。
Insight: 通过策略融合,SMaRT展示了跨策略校准的潜力,为大语言模型的自适应推理和决策提供了新范式。
Abstract: Large Language Models (LLMs) have redefined complex task automation with exceptional generalization capabilities. Despite these advancements, state-of-the-art methods rely on single-strategy prompting, missing the synergy of diverse reasoning approaches. No single strategy excels universally, highlighting the need for frameworks that fuse strategies to maximize performance and ensure robustness. We introduce the Select, Mix, and ReinvenT (SMaRT) framework, an innovative strategy fusion approach designed to overcome this constraint by creating balanced and efficient solutions through the seamless integration of diverse reasoning strategies. Unlike existing methods, which employ LLMs merely as evaluators, SMaRT uses them as intelligent integrators, unlocking the “best of all worlds” across tasks. Extensive empirical evaluations across benchmarks in reasoning, planning, and sequential decision-making highlight the robustness and adaptability of SMaRT. The framework consistently outperforms state-of-the-art baselines in solution quality, constraint adherence, and performance metrics. This work redefines LLM-driven decision-making by pioneering a new paradigm in cross-strategy calibration, unlocking superior outcomes for reasoning systems and advancing the boundaries of self-refining methodologies.
[111] Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
Yihong Dong,Zhaoyu Ma,Xue Jiang,Zhiyuan Fan,Jiaru Qian,Yongmin Li,Jianha Xiao,Zhi Jin,Rongyu Cao,Binhua Li,Fei Huang,Yongbin Li,Ge Li
Main category: cs.AI
TL;DR: Saber是一种针对扩散语言模型(DLM)的高效采样算法,专注于代码生成任务。它通过自适应加速和回溯增强重掩码技术,解决了DLM在推理速度和输出质量之间的权衡问题,显著提升了性能。
Details
Motivation: 扩散语言模型(DLM)在代码生成任务中面临推理速度与输出质量的严重权衡问题。减少采样步骤通常导致性能崩溃,需要一种高效的采样方法来解决这一问题。Contribution: 提出了Saber算法,这是一种无需重新训练的高效采样方法,通过自适应加速和回溯机制,显著提升了DLM在代码生成任务中的性能。
Method: Saber基于两点关键观察:1)随着代码上下文的建立,采样可以自适应加速;2)需要回溯机制来纠正生成错误。该方法结合了自适应加速和回溯增强重掩码技术。
Result: 在多个主流代码生成基准测试中,Saber平均提升了1.9%的Pass@1准确率,同时实现了251.4%的平均推理加速。
Insight: Saber展示了DLM在代码生成任务中的潜力,缩小了与自回归模型的性能差距,同时保留了DLM的并行生成和双向上下文建模优势。
Abstract: Diffusion language models (DLMs) are emerging as a powerful and promising alternative to the dominant autoregressive paradigm, offering inherent advantages in parallel generation and bidirectional context modeling. However, the performance of DLMs on code generation tasks, which have stronger structural constraints, is significantly hampered by the critical trade-off between inference speed and output quality. We observed that accelerating the code generation process by reducing the number of sampling steps usually leads to a catastrophic collapse in performance. In this paper, we introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber), a novel training-free sampling algorithm for DLMs to achieve better inference speed and output quality in code generation. Specifically, Saber is motivated by two key insights in the DLM generation process: 1) it can be adaptively accelerated as more of the code context is established; 2) it requires a backtracking mechanism to reverse the generated tokens. Extensive experiments on multiple mainstream code generation benchmarks show that Saber boosts Pass@1 accuracy by an average improvement of 1.9% over mainstream DLM sampling methods, meanwhile achieving an average 251.4% inference speedup. By leveraging the inherent advantages of DLMs, our work significantly narrows the performance gap with autoregressive models in code generation.
[112] Probabilistic Modeling of Intentions in Socially Intelligent LLM Agents
Feifan Xia,Yuyang Fang,Defang Li,Yantong Xie,Weikang Li,Yang Li,Deguo Xia,Jizhou Huang
Main category: cs.AI
TL;DR: 该论文提出了一种概率意图建模框架,用于多轮社交对话中的大型语言模型(LLM)代理,通过动态更新信念分布提升对话策略的适应性。
Details
Motivation: 在多轮社交对话中,理解并推断对话伙伴的潜在意图是提升LLM代理社会智能的关键。Contribution: 提出了一个概率意图建模框架,能够动态更新对对话伙伴意图的信念分布,并为策略提供额外上下文。
Method: 使用上下文先验初始化信念分布,并通过每次对话后的似然估计动态更新。
Result: 在SOTOPIA环境中,该框架在SOTOPIA-All和SOTOPIA-Hard上分别提升了9.0%和4.1%的总体得分,甚至略微超过直接观察意图的Oracle代理。
Insight: 概率意图建模可以有效提升LLM代理在多轮社交对话中的智能表现。
Abstract: We present a probabilistic intent modeling framework for large language model (LLM) agents in multi-turn social dialogue. The framework maintains a belief distribution over a partner’s latent intentions, initialized from contextual priors and dynamically updated through likelihood estimation after each utterance. The evolving distribution provides additional contextual grounding for the policy, enabling adaptive dialogue strategies under uncertainty. Preliminary experiments in the SOTOPIA environment show consistent improvements: the proposed framework increases the Overall score by 9.0% on SOTOPIA-All and 4.1% on SOTOPIA-Hard compared with the Qwen2.5-7B baseline, and slightly surpasses an oracle agent that directly observes partner intentions. These early results suggest that probabilistic intent modeling can contribute to the development of socially intelligent LLM agents.
[113] FST.ai 2.0: An Explainable AI Ecosystem for Fair, Fast, and Inclusive Decision-Making in Olympic and Paralympic Taekwondo
Keivan Shariatmadar,Ahmad Osman,Ramin Ray,Usman Dildar,Kisam Kim
Main category: cs.AI
TL;DR: FST.ai 2.0是一个可解释的AI生态系统,用于奥运会和残奥会跆拳道的公平、快速和包容性决策。它结合了基于姿态的动作识别、认知不确定性建模和可视化解释工具,以减少决策时间并提升裁判信任。
Details
Motivation: 解决奥运会和残奥会跆拳道中公平、透明和可解释的决策问题,尤其是实时支持和提升裁判、教练和运动员的协作能力。Contribution: 提出了FST.ai 2.0,一个集成了姿态识别、不确定性建模和可视化解释的AI生态系统,显著减少了决策时间并提升了裁判信任。
Method: 使用图卷积网络(GCNs)进行姿态动作识别,通过credal sets建模认知不确定性,并结合可视化解释工具和交互式仪表板。
Result: 实验验证显示,决策审查时间减少了85%,裁判对AI辅助决策的信任度达到93%。
Insight: 通过实时感知、可解释推理和治理意识设计的结合,FST.ai 2.0展示了体育领域中可信赖AI的潜力。
Abstract: Fair, transparent, and explainable decision-making remains a critical challenge in Olympic and Paralympic combat sports. This paper presents \emph{FST.ai 2.0}, an explainable AI ecosystem designed to support referees, coaches, and athletes in real time during Taekwondo competitions and training. The system integrates {pose-based action recognition} using graph convolutional networks (GCNs), {epistemic uncertainty modeling} through credal sets, and {explainability overlays} for visual decision support. A set of {interactive dashboards} enables human–AI collaboration in referee evaluation, athlete performance analysis, and Para-Taekwondo classification. Beyond automated scoring, FST.ai2.0 incorporates modules for referee training, fairness monitoring, and policy-level analytics within the World Taekwondo ecosystem. Experimental validation on competition data demonstrates an {85% reduction in decision review time} and {93% referee trust} in AI-assisted decisions. The framework thus establishes a transparent and extensible pipeline for trustworthy, data-driven officiating and athlete assessment. By bridging real-time perception, explainable inference, and governance-aware design, FST.ai2.0 represents a step toward equitable, accountable, and human-aligned AI in sports.
[114] Seg the HAB: Language-Guided Geospatial Algae Bloom Reasoning and Segmentation
Patterson Hsieh,Jerry Yeh,Mao-Chi He,Wen-Han Hsieh,Elvis Hsieh
Main category: cs.AI
TL;DR: 该论文提出了ALGOS系统,结合视觉语言模型(VLM)和遥感图像分割技术,用于有害藻华(HAB)的监测和严重性估计。系统通过GeoSAM辅助人工标注和CAM数据集微调,实现了分割和严重性预测的高性能。
Details
Motivation: 气候变化加剧了有害藻华(HAB)的发生,现有监测方法(如人工采样)效率低且覆盖范围有限。视觉语言模型在遥感领域的应用为自动化监测提供了潜力,但仍需解决图像推理和严重性量化问题。Contribution: 提出ALGOS系统,结合遥感图像分割和严重性估计,利用GeoSAM辅助人工标注和CAM数据集微调VLM,实现了HAB的自动化监测和量化。
Method: 集成GeoSAM辅助人工标注生成高质量分割掩膜,并在NASA CAM数据集上微调视觉语言模型,用于HAB的分割和严重性预测。
Result: ALGOS在分割和严重性估计任务上均表现出色,为自动化藻华监测系统提供了实用方案。
Insight: 视觉语言模型与遥感技术结合是解决大规模环境监测问题的有效途径,同时高质量的数据标注和模型微调是关键。
Abstract: Climate change is intensifying the occurrence of harmful algal bloom (HAB), particularly cyanobacteria, which threaten aquatic ecosystems and human health through oxygen depletion, toxin release, and disruption of marine biodiversity. Traditional monitoring approaches, such as manual water sampling, remain labor-intensive and limited in spatial and temporal coverage. Recent advances in vision-language models (VLMs) for remote sensing have shown potential for scalable AI-driven solutions, yet challenges remain in reasoning over imagery and quantifying bloom severity. In this work, we introduce ALGae Observation and Segmentation (ALGOS), a segmentation-and-reasoning system for HAB monitoring that combines remote sensing image understanding with severity estimation. Our approach integrates GeoSAM-assisted human evaluation for high-quality segmentation mask curation and fine-tunes vision language model on severity prediction using the Cyanobacteria Aggregated Manual Labels (CAML) from NASA. Experiments demonstrate that ALGOS achieves robust performance on both segmentation and severity-level estimation, paving the way toward practical and automated cyanobacterial monitoring systems.
cs.CY [Back]
[115] Does GenAI Rewrite How We Write? An Empirical Study on Two-Million Preprints
Minfeng Qi,Zhongmin Cao,Qin Wang,Ningran Li,Tianqing Zhu
Main category: cs.CY
TL;DR: 该论文通过分析2016至2025年间的210万篇预印本,研究了生成式大语言模型(LLMs)对学术出版的影响。结果表明,LLMs加速了提交与修订周期,略微提升了语言复杂度,并显著扩大了AI相关主题的研究。
Details
Motivation: 预印本仓库已成为学术交流的核心基础设施,而LLMs可能进一步改变学术写作方式。然而,LLMs是否及如何重塑科学出版尚缺乏系统证据。Contribution: 论文提出了一种多层次分析框架,结合时间序列模型、协作和生产力指标、语言学特征分析及主题建模,提供了LLMs对学术出版影响的首次实证研究基础。
Method: 采用中断时间序列模型、协作与生产力度量、语言学分析和主题建模,分析了210万篇预印本的数据。
Result: 研究发现,LLMs加速了提交与修订周期,略微增加了语言复杂度,并在计算密集型领域中更显著地扩展了AI相关内容。
Insight: LLMs不是普遍的颠覆者,而是选择性催化剂,放大了现有优势并加剧了学科差距。研究提示需建立治理框架以确保AI驱动的科研生态系统的信任、公平和问责。
Abstract: Preprint repositories become central infrastructures for scholarly communication. Their expansion transforms how research is circulated and evaluated before journal publication. Generative large language models (LLMs) introduce a further potential disruption by altering how manuscripts are written. While speculation abounds, systematic evidence of whether and how LLMs reshape scientific publishing remains limited. This paper addresses the gap through a large-scale analysis of more than 2.1 million preprints spanning 2016–2025 (115 months) across four major repositories (i.e., arXiv, bioRxiv, medRxiv, SocArXiv). We introduce a multi-level analytical framework that integrates interrupted time-series models, collaboration and productivity metrics, linguistic profiling, and topic modeling to assess changes in volume, authorship, style, and disciplinary orientation. Our findings reveal that LLMs have accelerated submission and revision cycles, modestly increased linguistic complexity, and disproportionately expanded AI-related topics, while computationally intensive fields benefit more than others. These results show that LLMs act less as universal disruptors than as selective catalysts, amplifying existing strengths and widening disciplinary divides. By documenting these dynamics, the paper provides the first empirical foundation for evaluating the influence of generative AI on academic publishing and highlights the need for governance frameworks that preserve trust, fairness, and accountability in an AI-enabled research ecosystem.
[116] Are LLMs Court-Ready? Evaluating Frontier Models on Indian Legal Reasoning
Kush Juvekar,Arghya Bhattacharya,Sai Khadloya,Utkarsh Saxena
Main category: cs.CY
TL;DR: 该论文评估了大语言模型(LLMs)在印度法律推理任务中的表现,通过公开的法律考试构建了一个多指标基准,揭示了LLMs的优势与局限性。
Details
Motivation: 研究动机是填补LLMs在法律领域评估的空白,特别是在印度法律体系中的能力表现。Contribution: 首次提出了一个基于印度法律考试的基准,包括客观题和长答题评估,并公开了数据集和评估协议。
Method: 方法包括构建多年度考试基准,进行客观题和长答题(律师盲评)的评估,分析了LLMs的表现和失败模式。
Result: 结果表明,前沿LLMs在客观题中表现优异,但在长答题推理中未能超越人类优秀考生,揭示了其在格式、引用和结构上的不足。
Insight: LLMs在法律辅助任务(如查找法规和判例)中具有潜力,但在需要特定论坛起草、程序策略和伦理判断的任务中仍需人类主导。
Abstract: Large language models (LLMs) are entering legal workflows, yet we lack a jurisdiction-specific framework to assess their baseline competence therein. We use India’s public legal examinations as a transparent proxy. Our multi-year benchmark assembles objective screens from top national and state exams and evaluates open and frontier LLMs under real-world exam conditions. To probe beyond multiple-choice questions, we also include a lawyer-graded, paired-blinded study of long-form answers from the Supreme Court’s Advocate-on-Record exam. This is, to our knowledge, the first exam-grounded, India-specific yardstick for LLM court-readiness released with datasets and protocols. Our work shows that while frontier systems consistently clear historical cutoffs and often match or exceed recent top-scorer bands on objective exams, none surpasses the human topper on long-form reasoning. Grader notes converge on three reliability failure modes: procedural or format compliance, authority or citation discipline, and forum-appropriate voice and structure. These findings delineate where LLMs can assist (checks, cross-statute consistency, statute and precedent lookups) and where human leadership remains essential: forum-specific drafting and filing, procedural and relief strategy, reconciling authorities and exceptions, and ethical, accountable judgment.
[117] Interpretability Framework for LLMs in Undergraduate Calculus
Sagnik Dakshit,Sushmita Sinha Roy
Main category: cs.CY
TL;DR: 该论文为本科生微积分课程中的大语言模型(LLMs)设计了一种解释性框架,通过分解和分析模型的推理过程,评估其教学有效性。
Details
Motivation: 传统评估方法仅关注最终答案的准确性,忽略了LLMs在多步逻辑、符号推理和概念清晰性方面的表现。论文旨在填补这一空白,提供一个更全面的评估框架。Contribution: 提出了一个新颖的解释性框架,结合推理流程提取、语义标记操作分析和提示消融技术,量化评估LLMs在数学问题中的推理质量。
Method: 方法包括:1)提取推理流程并标记语义操作;2)通过提示消融分析评估输入显著性和输出稳定性;3)使用结构化指标(如推理复杂性和短语敏感性)评估模型行为。
Result: 研究发现,LLMs生成的解虽然在语法上流畅,但常存在概念缺陷,且推理模式对提示用词和输入变化敏感。
Insight: 该框架为STEM教育中透明且负责任地部署AI提供了基础,支持对LLM推理失败的细粒度诊断,并有助于设计可解释的AI辅助反馈工具。
Abstract: Large Language Models (LLMs) are increasingly being used in education, yet their correctness alone does not capture the quality, reliability, or pedagogical validity of their problem-solving behavior, especially in mathematics, where multistep logic, symbolic reasoning, and conceptual clarity are critical. Conventional evaluation methods largely focus on final answer accuracy and overlook the reasoning process. To address this gap, we introduce a novel interpretability framework for analyzing LLM-generated solutions using undergraduate calculus problems as a representative domain. Our approach combines reasoning flow extraction and decomposing solutions into semantically labeled operations and concepts with prompt ablation analysis to assess input salience and output stability. Using structured metrics such as reasoning complexity, phrase sensitivity, and robustness, we evaluated the model behavior on real Calculus I to III university exams. Our findings revealed that LLMs often produce syntactically fluent yet conceptually flawed solutions, with reasoning patterns sensitive to prompt phrasing and input variation. This framework enables fine-grained diagnosis of reasoning failures, supports curriculum alignment, and informs the design of interpretable AI-assisted feedback tools. This is the first study to offer a structured, quantitative, and pedagogically grounded framework for interpreting LLM reasoning in mathematics education, laying the foundation for the transparent and responsible deployment of AI in STEM learning environments.
eess.SY [Back]
[118] DMTrack: Deformable State-Space Modeling for UAV Multi-Object Tracking with Kalman Fusion and Uncertainty-Aware Association
Zenghuang Fu,Xiaofeng Han,Mingda Jia,Jin ming Yang,Qi Zeng,Muyang Zahng,Changwei Wang,Weiliang Meng,Xiaopeng Zhang
Main category: eess.SY
TL;DR: DMTrack提出了一种针对无人机多目标跟踪的自适应运动跟踪框架,通过三个关键组件解决无人机视角下的运动预测和身份保持问题,并在两个基准测试中取得了最先进的性能。
Details
Motivation: 无人机视角下的多目标跟踪面临物体运动不可预测、频繁遮挡和外观线索有限等挑战,传统的运动模型无法有效处理这些非线性动态和高不确定性场景。Contribution: 1)DeformMamba:自适应轨迹建模的可变形状态空间预测器;2)MotionGate:轻量级门控模块,融合Kalman和Mamba预测;3)不确定性感知关联策略,提升身份一致性。
Method: 结合可变形状态空间模型(DeformMamba)和Kalman滤波器,并通过MotionGate动态融合预测结果,利用不确定性感知关联策略优化跟踪关联。
Result: 在VisDrone-MOT和UAVDT基准测试中,DMTrack在身份一致性和跟踪准确性上达到最优性能,尤其在高速和非线性运动场景表现突出。
Insight: 在没有外观模型的情况下,基于运动的自适应建模和不确定性感知策略可显著提升无人机多目标跟踪的鲁棒性和效率。
Abstract: Multi-object tracking (MOT) from unmanned aerial vehicles (UAVs) presents unique challenges due to unpredictable object motion, frequent occlusions, and limited appearance cues inherent to aerial viewpoints. These issues are further exacerbated by abrupt UAV movements, leading to unreliable trajectory estimation and identity switches. Conventional motion models, such as Kalman filters or static sequence encoders, often fall short in capturing both linear and non-linear dynamics under such conditions. To tackle these limitations, we propose DMTrack, a deformable motion tracking framework tailored for UAV-based MOT. Our DMTrack introduces three key components: DeformMamba, a deformable state-space predictor that dynamically aggregates historical motion states for adaptive trajectory modeling; MotionGate, a lightweight gating module that fuses Kalman and Mamba predictions based on motion context and uncertainty; and an uncertainty-aware association strategy that enhances identity preservation by aligning motion trends with prediction confidence. Extensive experiments on the VisDrone-MOT and UAVDT benchmarks demonstrate that our DMTrack achieves state-of-the-art performance in identity consistency and tracking accuracy, particularly under high-speed and non-linear motion. Importantly, our method operates without appearance models and maintains competitive efficiency, highlighting its practicality for robust UAV-based tracking.
cs.SE [Back]
[119] CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
Xue Jiang,Yihong Dong,Mengyang Liu,Hongyi Deng,Tian Wang,Yongding Tao,Rongyu Cao,Binhua Li,Zhi Jin,Wenpin Jiao,Fei Huang,Yongbin Li,Ge Li
Main category: cs.SE
TL;DR: CodeRL+通过强化学习对齐执行语义,提升代码生成能力,显著超越现有基线方法。
Details
Motivation: 大型语言模型(LLMs)在代码生成方面表现优异,但其训练基于文本模式,与功能正确性的执行语义存在语义鸿沟。现有方法(如RLVR)仅依赖测试用例的二元结果,难以捕捉代码中的细微逻辑错误。Contribution: 提出了CodeRL+方法,通过变量级执行轨迹对齐执行语义,增强了代码生成模型的学习信号。
Method: 结合现有策略滚动数据,直接构建执行语义对齐,无缝集成多种强化学习算法。
Result: 在多个基准测试中显著优于基线方法(如RLVR),pass@1提升4.6%,在代码推理和测试输出生成任务中分别提升15.5%和4.4%。
Insight: CodeRL+有效弥合了代码文本表示与执行语义之间的鸿沟,具有广泛的适用性和鲁棒性。
Abstract: While Large Language Models (LLMs) excel at code generation by learning from vast code corpora, a fundamental semantic gap remains between their training on textual patterns and the goal of functional correctness, which is governed by formal execution semantics. Reinforcement Learning with Verifiable Rewards (RLVR) approaches attempt to bridge this gap using outcome rewards from executing test cases. However, solely relying on binary pass/fail signals is inefficient for establishing a well-aligned connection between the textual representation of code and its execution semantics, especially for subtle logical errors within the code. In this paper, we propose CodeRL+, a novel approach that integrates execution semantics alignment into the RLVR training pipeline for code generation. CodeRL+ enables the model to infer variable-level execution trajectory, providing a direct learning signal of execution semantics. CodeRL+ can construct execution semantics alignment directly using existing on-policy rollouts and integrates seamlessly with various RL algorithms. Extensive experiments demonstrate that CodeRL+ outperforms post-training baselines (including RLVR and Distillation), achieving a 4.6% average relative improvement in pass@1. CodeRL+ generalizes effectively to other coding tasks, yielding 15.5% and 4.4% higher accuracy on code-reasoning and test-output-generation benchmarks, respectively. CodeRL+ shows strong applicability across diverse RL algorithms and LLMs. Furthermore, probe analyses provide compelling evidence that CodeRL+ strengthens the alignment between code’s textual representations and its underlying execution semantics.
[120] CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent
Haojia Lin,Xiaoyu Tan,Yulei Qin,Zihan Xu,Yuchen Shi,Zongyi Li,Gang Li,Shaofei Cai,Siqi Cai,Chaoyou Fu,Ke Li,Xing Sun
Main category: cs.SE
TL;DR: CUARewardBench是首个用于评估计算机使用代理(CUA)任务中奖励模型的基准,支持结果奖励模型(ORM)和过程奖励模型(PRM)的系统化评估。通过多样化的数据集和综合分析,揭示了现有模型的局限性,并提出了一种新的集成方法UPE。
Details
Motivation: 现有的基于脚本的验证方法在评估CUA任务时存在可扩展性和逐步评估能力不足的问题。奖励模型作为替代方案潜力巨大,但其在CUA任务中的有效性尚未充分研究。Contribution: 1)首个全面的CUA奖励基准;2)多样、实用且可靠的数据集;3)综合分析揭示现有模型的局限性;4)提出UPE方法提升模型可靠性。
Method: 引入CUARewardBench基准,涵盖10类软件和7种代理架构的轨迹数据。通过实验分析7种视觉语言模型和3种提示模板,提出了UPE(Unanimous Prompt Ensemble)集成方法。
Result: UPE在ORM中达到89.8%准确率和93.3%负预测值,PRM中为81.7%准确率和85.1%负预测值,显著优于单一模型和传统集成方法。
Insight: 当前CUA奖励模型存在视觉推理能力不足和知识缺陷问题,通用视觉语言模型(VLM)在奖励评估中表现优于专用CUA模型。
Abstract: Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces. While script-based verifiers are widely adopted for evaluation, they suffer from limited scalability and inability to provide step-wise assessment. Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored. To address this gap, we present CUARewardBench, comprising four key contributions: (1) First-ever Comprehensive CUA Reward Benchmark: We introduce the first benchmark for evaluating both outcome reward models (ORM) and process reward models (PRM) on CUA tasks, enabling systematic assessment across trajectory-level and step-level evaluation. (2) Diverse, Practical and Reliable Dataset: CUARewardBench encompasses trajectories from 10 software categories and 7 agent architectures with varying performance levels (25.9%-50.8% success rates). All trajectories are expertly annotated through carefully designed protocols, with rigorous quality control to ensure reliability and practical applicability. (3) Comprehensive Analysis and Insights: Through extensive experiments across 7 vision-language models and 3 prompt templates, we reveal critical limitations of current CUA RMs, including insufficient visual reasoning capabilities, knowledge deficiencies, and the superiority of general VLMs over specialized CUA models for reward evaluation. (4) Unanimous Prompt Ensemble (UPE): Based on the insights from our comprehensive analysis, we propose UPE, a novel ensemble method that significantly enhances reward model reliability through strict unanimous voting and strategic prompt-template configurations. UPE achieves 89.8% precision and 93.3% NPV for ORM, and 81.7% precision and 85.1% NPV for PRM, substantially outperforming single VLMs and traditional ensemble approaches.
cs.LG [Back]
[121] Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
Howard Chen,Noam Razin,Karthik Narasimhan,Danqi Chen
Main category: cs.LG
TL;DR: 本文通过比较监督微调(SFT)和强化学习(RL)在语言模型(LM)后训练中的遗忘现象,发现RL比SFT更能保留先验知识,同时保持目标任务的性能。作者通过理论分析和实验验证,发现RL的稳健性源于其使用在线策略数据,并提出近似在线数据的实用性。
Details
Motivation: 解决语言模型在后训练过程中出现的灾难性遗忘问题,比较SFT和RL的遗忘模式差异,并提出高效缓解遗忘的方法。Contribution: 1)系统比较SFT和RL在不同LM家族和任务中的遗忘表现;2)揭示RL稳健性源于在线策略数据;3)提出近似在线数据的实用性。
Method: 通过实验比较SFT和RL的性能,建立简化模型分析RL的模式搜索特性,并通过实验验证在线策略数据的关键作用。
Result: RL比SFT在目标任务表现相当或更优的同时,显著减少遗忘;在线策略数据是RL稳健性的关键因素。
Insight: RL的模式搜索特性和在线策略数据的使用是缓解遗忘的关键,近似在线数据提供了一种高效的替代方案。
Abstract: Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities – a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
[122] NeuCo-Bench: A Novel Benchmark Framework for Neural Embeddings in Earth Observation
Rikard Vinge,Isabelle Wittmann,Jannik Schneider,Michael Marszalek,Luis Gilch,Thomas Brunschwiler,Conrad M Albrecht
Main category: cs.LG
TL;DR: NeuCo-Bench是一个用于评估地球观测(EO)中神经压缩和表示学习的新基准框架,包括可重用嵌入、隐藏任务排行榜和评分系统。
Details
Motivation: 当前缺乏标准化评估框架来衡量神经嵌入在地球观测任务中的表现,NeuCo-Bench旨在填补这一空白。Contribution: 提出了NeuCo-Bench框架,包含可重用嵌入、隐藏任务排行榜和平衡准确性与稳定性的评分系统,并发布了SSL4EO-S12-downstream数据集。
Method: 构建固定大小的嵌入作为任务无关的紧凑表示,设计了评估管道、挑战模式和评分系统。
Result: 在CVPR EARTHVISION研讨会的公开挑战中展示初步结果,并通过消融实验验证了前沿模型的有效性。
Insight: NeuCo-Bench为地球观测领域的神经嵌入评估提供了社区驱动的标准化工具,促进了可复现性和公平性。
Abstract: We introduce NeuCo-Bench, a novel benchmark framework for evaluating (lossy) neural compression and representation learning in the context of Earth Observation (EO). Our approach builds on fixed-size embeddings that act as compact, task-agnostic representations applicable to a broad range of downstream tasks. NeuCo-Bench comprises three core components: (i) an evaluation pipeline built around reusable embeddings, (ii) a new challenge mode with a hidden-task leaderboard designed to mitigate pretraining bias, and (iii) a scoring system that balances accuracy and stability. To support reproducibility, we release SSL4EO-S12-downstream, a curated multispectral, multitemporal EO dataset. We present initial results from a public challenge at the 2025 CVPR EARTHVISION workshop and conduct ablations with state-of-the-art foundation models. NeuCo-Bench provides a first step towards community-driven, standardized evaluation of neural embeddings for EO and beyond.
[123] Demystifying Transition Matching: When and Why It Can Beat Flow Matching
Jaihoon Kim,Rajarshi Saha,Minhyuk Sung,Youngsuk Park
Main category: cs.LG
TL;DR: 本文深入分析了Transition Matching (TM)在何时及为何优于Flow Matching (FM),通过理论证明和实验验证表明,TM在目标分布为单峰高斯或多峰但模态分离良好且方差显著时表现更优。
Details
Motivation: 尽管FM是许多先进生成模型的基础,但最近研究发现TM能以更少采样步骤实现更高生成质量。本文旨在明确TM优于FM的具体条件和原因,填补了这一领域的理论空白。Contribution: 1. 证明了在单峰高斯目标分布下,TM的KL散度严格低于FM;2. 分析了TM在多峰高斯分布中的局部单峰性条件;3. 揭示了TM在模态分离明显且方差显著时的优势,并通过实验验证了理论。
Method: 1. 理论分析:比较TM和FM在单峰高斯下的KL散度及收敛速度;2. 扩展到高斯混合模型,分析TM在局部单峰性条件下的表现;3. 实验验证:分别在合成数据和真实图像/视频生成任务中验证理论。
Result: TM在目标分布模态分离明显且方差非零时优于FM;随着模态间距增大或方差减小,TM优势逐渐消失。实验验证了理论结论,并在实际应用中展示了TM的潜力。
Insight: TM的随机潜在更新保留了目标协方差,而FM的确定性更新低估了这一点,这是TM优势的核心原因。研究结果为生成模型中TM或FM的选择提供了理论指导。
Abstract: Flow Matching (FM) underpins many state-of-the-art generative models, yet recent results indicate that Transition Matching (TM) can achieve higher quality with fewer sampling steps. This work answers the question of when and why TM outperforms FM. First, when the target is a unimodal Gaussian distribution, we prove that TM attains strictly lower KL divergence than FM for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. We then characterize convergence rates, showing that TM achieves faster convergence than FM under a fixed compute budget, establishing its advantage in the unimodal Gaussian setting. Second, we extend the analysis to Gaussian mixtures and identify local-unimodality regimes in which the sampling dynamics approximate the unimodal case, where TM can outperform FM. The approximation error decreases as the minimal distance between component means increases, highlighting that TM is favored when the modes are well separated. However, when the target variance approaches zero, each TM update converges to the FM update, and the performance advantage of TM diminishes. In summary, we show that TM outperforms FM when the target distribution has well-separated modes and non-negligible variances. We validate our theoretical results with controlled experiments on Gaussian distributions, and extend the comparison to real-world applications in image and video generation.
[124] From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation
Ziwei Huang,Ying Shu,Hao Fang,Quanyu Long,Wenya Wang,Qiushi Guo,Tiezheng Ge,Leilei Gan
Main category: cs.LG
TL;DR: 论文提出Customized-GRPO框架,通过Synergy-Aware Reward Shaping和Time-Aware Dynamic Weighting解决了主题驱动图像生成中身份保持与提示遵循的竞争问题,显著提升了生成效果。
Details
Motivation: 主题驱动图像生成模型在身份保持和提示遵循之间存在固有矛盾,直接应用GRPO会导致竞争退化问题。Contribution: 1. 提出Synergy-Aware Reward Shaping(SARS),非线性奖励机制减少冲突信号;2. 引入Time-Aware Dynamic Weighting(TDW),动态调整优化权重以适应时间动态。
Method: 1. SARS通过惩罚冲突信号和放大协同信号优化梯度;2. TDW根据扩散过程的时间动态调整奖励权重。
Result: 实验表明,该方法显著优于原始GRPO基线,成功缓解竞争退化,生成图像在身份保持和提示遵循上表现更优。
Insight: 非线性奖励机制和时间动态权重调整是实现身份保持与提示遵循协同优化的关键。
Abstract: Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model’s temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
[125] Prototyping an End-to-End Multi-Modal Tiny-CNN for Cardiovascular Sensor Patches
Mustafa Fuad Rifet Ibrahim,Tunc Alkanat,Maurice Meijer,Felix Manthey,Alexander Schlaefer,Peer Stelldinger
Main category: cs.LG
TL;DR: 论文提出了一种轻量化的卷积神经网络(CNN),用于心电图(ECG)和心音图(PCG)的同步分类,以在资源受限的边缘医疗设备上实现高效的心血管监测。
Details
Motivation: 心血管疾病早期检测需求迫切,但现有设备需平衡数据准确性、计算效率和能耗。Contribution: 提出了一种早期融合的多模态CNN,显著降低了计算和内存开销(三个数量级),同时保持高精度。
Method: 采用同步ECG和PCG数据输入,设计了一种轻量化CNN结构,并在Physionet Challenge 2016数据集上进行了验证。
Result: 模型在计算效率和能耗上表现优异,适合部署在边缘设备,比持续数据流更节能。
Insight: 轻量化模型在医疗边缘设备上具有实际应用潜力,能够支持高效的实时监测。
Abstract: The vast majority of cardiovascular diseases may be preventable if early signs and risk factors are detected. Cardiovascular monitoring with body-worn sensor devices like sensor patches allows for the detection of such signs while preserving the freedom and comfort of patients. However, the analysis of the sensor data must be robust, reliable, efficient, and highly accurate. Deep learning methods can automate data interpretation, reducing the workload of clinicians. In this work, we analyze the feasibility of applying deep learning models to the classification of synchronized electrocardiogram (ECG) and phonocardiogram (PCG) recordings on resource-constrained medical edge devices. We propose a convolutional neural network with early fusion of data to solve a binary classification problem. We train and validate our model on the synchronized ECG and PCG recordings from the Physionet Challenge 2016 dataset. Our approach reduces memory footprint and compute cost by three orders of magnitude compared to the state-of-the-art while maintaining competitive accuracy. We demonstrate the applicability of our proposed model on medical edge devices by analyzing energy consumption on a microcontroller and an experimental sensor device setup, confirming that on-device inference can be more energy-efficient than continuous data streaming.
cs.RO [Back]
[126] Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain
Yulin Luo,Chun-Kai Fan,Menghang Dong,Jiayu Shi,Mengdi Zhao,Bo-Wen Zhang,Cheng Chi,Jiaming Liu,Gaole Dai,Rongyu Zhang,Ruichuan An,Kun Wu,Zhengping Che,Shaoxuan Xie,Guocai Yao,Zhongxia Zhao,Pengwei Wang,Guang Liu,Zhongyuan Wang,Tiejun Huang,Shanghang Zhang
Main category: cs.RO
TL;DR: RoboBench是一个系统性评估多模态大语言模型(MLLM)作为具身大脑的综合基准,涵盖多个认知维度,填补了现有基准在任务真实性和评估完整性上的不足。
Details
Motivation: 现有的机器人基准主要关注执行成功率或高层次的推理能力,但缺乏对认知能力的全面评估。RoboBench旨在提供一个更系统的评估框架,以推动具身MLLM的发展。Contribution: 提出了RoboBench基准,定义了五个认知维度(指令理解、感知推理、通用规划、功能预测和失败分析),涵盖14种能力、25个任务和6092个QA对,并引入MLLM-as-world-simulator框架评估规划的可行性。
Method: RoboBench通过多样化的数据集(多视图场景、属性丰富的物体)和MLLM-as-world-simulator框架,系统性评估MLLM在多任务中的表现,特别关注规划的实际可行性。
Result: 实验表明,现有的14种MLLM在隐含指令理解、时空推理、跨场景规划、细粒度功能预测和执行失败诊断等方面存在显著局限性。
Insight: RoboBench揭示了当前MLLM在具身任务中的认知短板,为下一代具身系统的开发提供了量化标准和改进方向。
Abstract: Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full manipulation pipeline, RoboBench defines five dimensions-instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis-spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, and multi-view scenes, drawing from large-scale real robotic data. For planning, RoboBench introduces an evaluation framework, MLLM-as-world-simulator. It evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes. Experiments on 14 MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify high-level cognition, and guide the development of next-generation embodied MLLMs. The project page is in https://robo-bench.github.io.